Wednesday, January 18, 2012

When can we expect the last damn microarray paper?

With bonus R code

It came as a shock to learn from PubMed that almost 900 papers were published with the word "microarray" in their titles last year alone, just 12 shy of the 2010 count. More alarming, many of these papers were not of the innocuous "Microarray study of gene expression in dog scrotal tissue" variety, but dry rehashings along the lines of "Statistical approaches to normalizing microarrays to the reference brightness of Ursa Minor".

It's an ugly truth we must face: people aren't just using microarrays, they're still writing about them.

See for yourself:

getCount<-function(term){function(year){
  nihUrl<-concat("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=",term,"+",year,"[pdat]")
  #cleanurl<-gsub('\\]','%5D',gsub('\\[','%5B',x=url))
  #http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=microarray%5btitle%5d+2003%5bpdat%5d
  xml<-xmlTreeParse(URLencode(nihUrl),isURL=TRUE)
  #Data Mashups in R, pg17
  as.numeric(xmlValue(xml$doc$children$eSearchResult$children$Count$children$text))
}}

years<-1995:2011
df<-data.frame(type="obs",year=years,
    mic=sapply(years,function(x){do.call(getCount('microarray[title]'),list(x))}),
    ngs=sapply(years,function(x){do.call(getCount('"next generation sequencing"[title] OR "high-throughput sequencing"[title]'),list(x))})
)
#papers with "microarray" in title
> df[,c("year","mic")]
   year  mic
1  1995    2
2  1996    4
3  1997    0
4  1998    7
5  1999   28
6  2000  108
7  2001  273
8  2002  553
9  2003  770
10 2004 1032
11 2005 1135
12 2006 1216
13 2007 1107
14 2008 1055
15 2009  981
16 2010  909
17 2011  897
Reading another treatise on microarray normalization in 2012 would be just tragic. Who still reads these? Who still writes these papers? Can we stop them? If not, when can we expect NGS to wipe them off the map?
#97 is a fair start
df<-subset(df,year>=1997)
mdf<-melt(df,id.vars=c("type","year"),variable_name="citation")

c<-ggplot(mdf,aes(x=year))
p<-c+geom_point(aes(y=value,color=citation)) +
  ylab("papers") +
  stat_smooth(aes(y=value,color=citation),data=subset(mdf,citation=="mic"),method="loess") +
  scale_x_continuous(breaks=seq(from=1997,to=2011,by=2))
print(p)
Here I plot both microarray and next-generation sequencing papers (in title). We see kurtosis is working in our favor, and LOESS seems to agree!
But when will the pain end? Let us extrapolate, wildly.
#Return 0 for negative elements
# noNeg(c(3,2,1,0,-1,-2,2))
# [1] 3 2 1 0 0 0 2
noNeg<-function(v){sapply(v,function(x){max(x,0)})}

#Return up to the first negative/zero element inclusive
# toZeroNoNeg(c(3,2,1,0,-1,-2,2))
# [1] 3 2 1 0
toZeroNoNeg<-function(v){noNeg(v)[1:firstZero(noNeg(v))]}

#return index of first zero
firstZero<-function(v){which(noNeg(v)==0)[1]}

#let's peer into the future
df.lo.mic<-loess(mic ~ year,df,control=loess.control(surface="direct"))

#when will it stop?
mic_predict<-as.integer(predict(df.lo.mic,data.frame(year=2012:2020),se=FALSE))
zero_year<-2011+firstZero(mic_predict)
cat(concat("LOESS projects ",sum(toZeroNoNeg(mic_predict))," more microarray papers."))
cat(concat("The last damn microarray paper is projected to be in ",(zero_year-1),"."))

#predict ngs growth
df.lo.ngs<-loess(ngs ~ year,df,control=loess.control(surface="direct"))
ngs_predict<-as.integer(predict(df.lo.ngs,data.frame(year=2012:zero_year),se=FALSE))

pred_df<-data.frame(type="pred",year=c(2012:zero_year),mic=toZeroNoNeg(mic_predict),ngs=ngs_predict)
df2<-rbind(df,pred_df)

mdf2<-melt(df2,id.vars=c("type","year"),variable_name="citation")

c2<-ggplot(mdf2,aes(x=year))
p2<-c2+geom_point(aes(y=value,color=citation,shape=type),size=3) +
    ylab("papers") +
    scale_y_continuous(breaks=seq(from=0,to=1600,by=200))+
    scale_x_continuous(breaks=seq(from=1997,to=zero_year,by=2))
print(p2)

LOESS projects 2038 more microarray papers.
The last damn microarray paper is projected to be published in 2016.

Yeah, right...

Full R code here: https://gist.github.com/1637248

13 comments:

  1. Great post and code, very useful - thank you!

    ReplyDelete
  2. Ha! We do!

    Yes, we reallly still write about microarrays now and then. And I think we have good reasons to. Technology is just a means to an end, right? People already did lots of microarray studies and for most of these the raw data are stored in online repositories. With increased standardisation in study descriptions (e.g. The isa-tab initiative), the possibilities to use these old data in comparative studies increases. In that case you want to be sure these old data are good so you need the right normaiisation and especially quality control approaches.

    Also some of microarray technologies like ChIP-on-chip and DNA mathylation arrays are still developing and approaches for these are sometimes different.

    And finally... For some things like getting the right SNP calls microaarays just perform very well.

    So 2016? No I don't think so, the last microarray paper will probably be a lot later.

    ReplyDelete
  3. Great post! Thanks for sharing your thoughts and your code.

    I agree with Chris, indeed the content accumulated on microarray, especially gene expression is huge and probably holds answers to open problems in biomedicine. One of the most exciting applications is drug repositioning.

    Since you can identify subjects based on their NGS data, it is not clear whether there will soon, if ever, be such a large corpus of genome-wide NGS information form such varied sources, due to privacy issues.

    The isa-tab initiative is great as it will ultimately make it easy to use this vast resource. In the meantime we are developing InSilico DB, following a pragmatic approach where users can edit the meta-data online (>118,000 profiles re-annotated to date). The numerical data is re-normalized and ready to be analyzed in open-source GUI (GenePattern) and command-line tools (R/Bioconductor). We also include RNA-Seq expression estimates from SRA, but I expect the microarray data to stay relevant for a while. If you want to check it out: http://insilico.ulb.ac.be.

    ReplyDelete
  4. In science just like in tecnology innovation happens at a fast pace but all new technologies then have a long tail before they eventually die out.

    The same is true with microarray technology. I would not be suprised if there is a resurgence in microarray studies that microarrays are used as a complement of NGS analysis.

    I believe there is still innovation in this area and 2017 is way too optimistic (or pessimistic) view... I bet there will still be coming out with papers featuring microarrays in 15 years time.

    ReplyDelete
  5. I think the difference is that these are only papers with 'microarray' in the titles, and are thus probably have some flavor of statistical methodology or normalization/preprocessing. I would guess that most biology papers that use microarrays as a tool do not put the word microarray in the title, and since microarrays are a cheap, high-throughput tool with an established analysis pipeline, they, as a tool, are going to be around for a long time.

    ReplyDelete
  6. The masters programme i am doing at one of the Universities in Sweden is almost entirely based on microarrays. Everything we have learnt in courses such as programming in R, Genomics, Bioinformatics etc is full of microarray data analysis and NGS is totally lacking! Why haven't universities like this one revised their curricula to integrate NGS given its increasing importance in identification of disease-causal variants and clinical applications (personalised medicine)?

    I am going to use NGS data in my thesis but it will be a personal effort since even the would be supervisors have no expertise in analysing NGS data! Well, i have written this in the course evaluations and i hope they will catch up in the near future.

    ReplyDelete
  7. Thanks! Very useful code. Although... Im an ecologist and never worked with microarrays/dont know much about them. Why the beef with them vs NGS?

    ReplyDelete
    Replies
    1. Oh this was intended to be a tongue-in-cheek post. However, I think a statistician who writes paper in 2012 about microarray normalization is grave digging. I have no problem with people using microarrays in GWAS or clinical apps, but putting "microarray" in the title is as superfluous as "pipette" or "Perl".

      Delete
    2. Hey, that was just about my next paper: "MicroPiPerl: A new Perl script to assist in pipette calibration for microarray experiments"... :( Thanks, Jeremy, now my subject is burnt.

      Delete
    3. Eric you might want to move on to "A high-throughput sequencing study of differential expression in dog scrotal tissue"

      Delete
    4. Sounds interesting, but needs a more punchy title. How about "Lessons form the Dog Dick Transcriptome"?

      Delete
  8. Thanks Jeremy, very interesting! We have recently released InSilico DB (http://insilico.ulb.ac.be) to manage and store genomics data, There are 100,000s of public microarray samples pre-installed available for download or export in a ready-to-use format (R and GenePattern).

    ReplyDelete