tag:blogger.com,1999:blog-85320659607564825902024-03-20T04:24:02.887-04:00JermdemoMostly bioinformatics, NGS, and cat litter box reviewsJermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.comBlogger41125tag:blogger.com,1999:blog-8532065960756482590.post-50180565558101534772012-12-27T16:08:00.000-05:002014-06-05T14:26:00.008-04:00The Reproducible Research Guilt Trip May Finally Be Paying Off<h4>
We might be closer to killing off the "Just take my word for it - I'm pretty sure I did this right" methods section</h4>
<div>
<br /></div>
There is no shortage of well-reasoned articles filled persuasive arguments about the need for higher reproducible research standards in the scientific literature. With so many good posts about the virtues of reproducible research, they all boil down to one overarching concept:<br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQv_qhALQlWpW11aPwjAgsJ07B5tc-cvb8aaezv_HZrakJk3IG2PchViuPDbjpfMpH51XFlxWlaTbvpG7DhoDOkeROapfr83hMakW_hl9erlkZvdiALhsDIv0r0hTGxBSIdWzf7hhmq7k/s1600/writeshitdown.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="write shit down" border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQv_qhALQlWpW11aPwjAgsJ07B5tc-cvb8aaezv_HZrakJk3IG2PchViuPDbjpfMpH51XFlxWlaTbvpG7DhoDOkeROapfr83hMakW_hl9erlkZvdiALhsDIv0r0hTGxBSIdWzf7hhmq7k/s320/writeshitdown.jpg" title="" width="242" /></a></div>
<br />
<br />
Why is this even an issue? Biologists in particular seem to be collectively and subconsciously reacting to those awful General Chemistry labs where they had you copy down pages of instructions verbatim into your lab notebook. It should come as no surprise that bioinformatics is ground zero for reproducibility activism.<br />
<br />
It is unfortunate reproducible research is tied up with all sorts of other holier-than-thou practices: open access, open source, open data, literate programming, blogging, functional programming. This all-encompassing evangelism tends to polarize people. While wonky über-programmers like C. Titus Brown lay out <a href="http://ivory.idyll.org/blog/replication-i.html">fundamental practices for reproducibility</a>, most PIs have been publicly giving lip service to the idea of reproducible research, belying a "I don't wanna eat my vegetables"-type disdain. There are now "<a href="http://biotest.cgrb.oregonstate.edu/">corsortia</a>" and an "<a href="https://www.scienceexchange.com/reproducibility">initiative</a>" to compel scientists to actually write their shit down, preferably with door prizes. If you think this has a "<a href="http://www.youtube.com/watch?v=dDjBsOkidek">posture pals</a>" (video) feel to it, you're not alone. As the number of pro-RR articles has steadily increased, few take these to heart.<br />
<br />
This head against wall bashing has been the pattern for many years - better tools are now available (RStudio, knitr, Galaxy, cloud computing, figshare, github, bitbucket) and more rah-rah from the blogosphere - but little enforcement from major journals. But now a recent development has raised my hopes, because it indicates editors have been tightening the screws enough to cause discomfort:<br />
<br />
<i>People have actually started to argue against reproducible research!</i><br />
<i><br /></i>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNUKgQ6_XiJM0aJFrg71wOCqQJFFjgi_4TuZvwh_Nk5DhI3_imwbwQvLDS2S6X-pRgRlGGCDJAyDny3kkBx5Y9c5kceVRpMFkIT_7fJC_yokJmzqet7R7y_mn8a_loZJhRp1V3ZTbrkcs/s1600/thinker2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="255" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNUKgQ6_XiJM0aJFrg71wOCqQJFFjgi_4TuZvwh_Nk5DhI3_imwbwQvLDS2S6X-pRgRlGGCDJAyDny3kkBx5Y9c5kceVRpMFkIT_7fJC_yokJmzqet7R7y_mn8a_loZJhRp1V3ZTbrkcs/s400/thinker2.jpg" width="400" /></a></div>
<h4>
Hearts and Minds</h4>
<br />
The founder of the irreplicability movement is Christopher Drummond, author of “<a href="http://cogprints.org/8675/">Reproducible Research: a Dissenting Opinion</a>”. I will attempt to paraphrase his arguments here:<br />
<ol>
<li>Richard Feynman never had a Github account.</li>
<li>No one is really going to read your damn code anyway.</li>
<li>Writing shit down == A big drag, man.</li>
<li>The Anil Potti incident proves liars always lie about their Rhodes Scholarships first. We should crack down on curricula vitae, not veritas curat.</li>
</ol>
Drummond's points are challenged <a href="http://simplystatistics.org/2012/11/15/reproducible-research-with-us-or-against-us-3/">here</a> by statistician and <a href="https://www.coursera.org/course/compdata">Coursera</a> favorite Roger Peng.<br />
<br />
A precursor to the dissenting opinion article is Drummond's "<a href="http://cogprints.org/7691/7/icmlws09.pdf">Replicability is not Reproducibility: </a><a href="http://cogprints.org/7691/7/icmlws09.pdf">Nor is it Good Science</a>". A distinction is drawn between reproducibility and replicability, the former being what is advocated and the latter being more generalizable or scientifically provable. The idea we require researchers to submit their data and code, replicable research, is a narrow concept really only useful for ferreting out scientific misconduct.<br />
<br />
<div style="text-align: left;">
<a href="http://www.flickr.com/photos/usfwsmtnprairie/8282763616/" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;" title="Black-footed Ferret by USFWS Mountain Prairie, on Flickr"><img alt="Black-footed Ferret" height="132" src="http://farm9.staticflickr.com/8063/8282763616_3e4f7cb057.jpg" title="credit: Kimberly Tamkun / USFWS" width="200" /></a></div>
I would argue that ignorance of biological sequence analysis, and even moreso statistics, is a bigger threat than the outright fraud seen in the Duke case. Most bioinformatics manuscripts feature analysis which is not replicable, which is frightening to consider when GWAS and exome NGS variant papers implicate so many genes in disease, many of them residing along a razor thin p-value threshold tweaked by several incomprehensible cherry picked program parameters.<br />
<br />
It is not clear science can efficiently <a href="http://blogs.nature.com/news/2012/12/is-the-scientific-literature-self-correcting.html">self-correct</a>. So while replicability is not reproducibility, reproducibility is too slow to substitute for replicability. A manuscript that describes real reproducible biological phenomena is essentially conjecture until it can be repeated. The greatest <a href="http://en.wikipedia.org/wiki/Ferret_legging">ferret-legger</a> the world has ever known will live in obscurity until they buy a ferret. We have a culture of scientists who refuse to buy a ferret.<br />
<br />
<h4>
Accounting for Tastes</h4>
<div>
<br /></div>
The other dissenting opinion (<a href="http://gasstationwithoutpumps.wordpress.com/2012/08/27/accountable-research-software/">here</a>) is from UCSC's Kevin Karplus, who replies to Iddo Friedberg's <a href="http://bytesizebio.net/index.php/2012/08/24/can-we-make-research-software-accountable/">post</a> recommending a panel of white coat mechanics to help biologists get their code ready for publication. Karplus raises two points:<br />
<ul>
<li>It is difficult to make polished software for others to use and that is not the point of research.</li>
<li>Replicability is not reproducibility.</li>
</ul>
Regardless of Friedberg's proposal, railing against "polished software" is simply a straw man argument. Reproducible research in <strike>2012 </strike>2013 does not mean robust, extensible, or even well documented code. Most sequence analysis papers feature very little compiled code, but rely on using a series of executable programs glued together using scripting languages, producing intermediate data then digested into a report, often written in R.<br />
<br />
Getting these sequence analysis workflows to be reproducible will not require a highly skilled platoon of developers. Any willing researcher can submit a shell script or a build script of commands provided they avoid these common pitfalls:<br />
<ul>
<li dir="ltr">Using bioinformatics web applications with no web service capability</li>
<li dir="ltr">Using desktop bioinformatics software with no logging capability</li>
<li dir="ltr">Relying on proprietary institutional databases, perhaps with stored procedures that prove too unwiedly to dump</li>
<li dir="ltr">Using command line programs without a <a href="https://gist.github.com/1651133">directory-based bash history</a></li>
<li dir="ltr">Using Excel to manually manipulate data</li>
</ul>
As our toolset and research community matures, these <strike>excuses</strike> obstacles will eventually disappear. But there is one scenario which will always be true in some of the more competitive arenas of bioinformatics programming (e.g. structure prediction, de novo assembly):<br />
<ul>
<li>The researcher was perfectly capable of submitting code but decided to retain a competitive advantage.</li>
</ul>
"Over-CASPed" researchers who are unwilling to divulge their secret sauce should be relegated to appropriate sandboxes.<br />
<br />
Replication does not prove a biological truth but we often don't even have the fleeting proof that a scientist did what they said they did.<br />
<br />
Which brings us back to those damn chemistry labs. While many public access talk shows find <a href="http://www.philly.com/philly/blogs/evolution/Why-are-So-Many-Chemists-Creationists-.html">chemists willing argue against evolution</a>, you would be hard pressed to find a one who would argue against writing shit down.<br />
<br />
In other words: <i>Not writing shit down is an even worse idea than creationism.</i><br />
<br />
--<br />
<br />
<i>There, I blogged in 2012.</i>Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com6tag:blogger.com,1999:blog-8532065960756482590.post-72067078856638169842012-02-20T21:44:00.000-05:002017-05-22T12:13:39.050-04:00AGBT: digesting diposable MinIONs in diasporaDespite my current <a href="http://www.biostars.org/user/list/">ranking of 15th in Biostar</a>, myriad page views of my <a href="http://jermdemo.blogspot.com/2011/06/big-ass-servers-and-myths-of-clusters.html">BAS</a>™ post (albeit mostly misdirected perverts), and positive response for my celebrated campaign against more microarray papers, for some reason I was not "comped" an all-expenses paid trip as honorary blog journalist to this year's <a href="http://agbt.org/">Advances in Biology and Genome Technology</a>, which is kind of like CES for sequencing people, except AGBT is still worth attending. Normally the oversight would not bother me, as bioinformatics itself is not the focus of this meeting, but the flood of <a href="https://twitter.com/#%21/search/realtime/%23AGBT">#AGBT</a> tweets would not let me forget this fact and I was forced to stew and blog in envy.
<br />
<br />
<h4>
The first game changing disruptive revolutionary thing from England since 1964</h4>
Even from my distant perch it was obvious all the scientific presentations at AGBT were overshadowed by a 17-minute showstopping demo from Clive Brown of Oxford Nanopore, a company that by all appearances would either die, focus on some minor stuff, or bring it. They chose the third option, and in so doing boosted the "<a href="http://www.google.com/trends/?q=clive">Clive index</a>" to unprecedented levels. OxN's recent decision to enlist famed geneticist and serial startup advisor George Church struck me as a huge gamble, as the string of Route 128 flameouts touting his name lead me to assume long ago that Church had stowed away some cursed Tiki idol in his luggage like Bobby in that episode of the Brady Bunch. However, after reading up on OxN, I had to admit I was just bitter about Dr. Church's refusal to invest in my chain of <a href="http://www.polonator.org/">Polonator</a>-based paternity testing clinics, Yo Po'lonatizz!™<br />
<br />
Two new sequencer platforms were announced:<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOwwzg6YpVyRUH4OEGQO-XN2QkVmxPsvuf3TiFF0Zjvr82PaBswkNGvopdxpiWpXEdKICbafrQ6n09k1C84zJ1d2YnKu_nMx5E83dI_F0p_JYGllIRTmnDumLjb4GcD2vIilijl4XP4QE/s1600/MinION_in_laptop.jpg" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" height="213" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOwwzg6YpVyRUH4OEGQO-XN2QkVmxPsvuf3TiFF0Zjvr82PaBswkNGvopdxpiWpXEdKICbafrQ6n09k1C84zJ1d2YnKu_nMx5E83dI_F0p_JYGllIRTmnDumLjb4GcD2vIilijl4XP4QE/s320/MinION_in_laptop.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A MinION. Forget to hit eject before removing this<br />
and you will instantly lose $900.</td></tr>
</tbody></table>
<ul>
<li>The MinION, a $900 "disposable" USB drive which detects minute changes in voltage incurred by the passage of DNA through a
robust and delicious lipid bilayer. Finally a device capable of sequencing filthy rabbit blood right on the spot!</li>
<li>The GridION system, a scalable rackmounted sequencer, which despite some lack of pricing clarity, should produce an actual $1000 15-minute human genome by 2013.</li>
</ul>
These exotic machines must be truly game-changing because they made properly expanding Albert Vilella's <a href="https://docs.google.com/spreadsheet/ccc?key=0AvaxS3m5rl-9dHdtUGRtaGlsZWNFNWJleDRXaUhQTHc">NGS sequencer spreadsheet</a> quite difficult. The MinION, in particular, could be viewed as a free device with $900 of consumables. This effectively lowers the bar to getting high-throughput sequence in the doctor's office to a 100% unamortized billable transaction. These things also claim <i>fucking unlimited read lengths.</i><br />
<br />
Expression microarrays, SAGE, 454, ABI SOLiD, and now Pacific Biosciences have all left bad tastes of uncertainty and dissatisfaction in the mouths of scientists. It is easy to disappoint people on a grand scale with a $700,000 machine, but $900 worth of chemicals in a USB drive is a different animal, and it seems likely this invention will find a following if it even delivers on a fraction of what it promises.<br />
<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi33MMhm5V8Yh1u26z04rIZPlzOfF7Nwdyy6KcR_yqc09e0Zl78ahGn4LwupbYxKHNbp10Lkgi4Nj2voVwwTAmWmMFlZrAlLyTl3MXLy7l0NUK8uYF9oZSa5PG_FIDlHJHo37V_VD0C2vY/s1600/GridION_scaling.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="170" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi33MMhm5V8Yh1u26z04rIZPlzOfF7Nwdyy6KcR_yqc09e0Zl78ahGn4LwupbYxKHNbp10Lkgi4Nj2voVwwTAmWmMFlZrAlLyTl3MXLy7l0NUK8uYF9oZSa5PG_FIDlHJHo37V_VD0C2vY/s320/GridION_scaling.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The GridION - put it in a rack or right on the floor.</td></tr>
</tbody></table>
Good information on this sequencer-on-a-stick is to be found at <a href="http://pathogenomics.bham.ac.uk/blog/2012/02/oxford-nanopore-megaton-announcement-why-do-you-need-a-machine-exclusive-interview-for-this-blog/">Nick Loman's blog</a>, <a href="http://www.genomesunzipped.org/2012/02/making-sequencing-simpler-with-nanopores.php">Genomes Unzipped</a>, <a href="http://www.nanoporetech.com/news/press-releases/view/39">and official press releases</a>. An excellent discussion of the nanopores themselves can be found at <a href="http://www.omespeak.com/blog/?p=507">Omically Speaking</a>.
<br />
<br />
<h4>
More cringeworthy marketing from the West coast </h4>
The Oxford Nanopore machines are so jaw-dropping, in fact, that <a href="http://www.forbes.com/sites/matthewherper/2012/02/18/who-doubts-the-usb-thumb-drive-sequencer-a-rival/">Jonathan Rothberg is already crying vaporware</a>. His complaints do seem warranted, given disappointments from past year's announcements and the lack of publicly available sequence from these devices.
<br />
<br />
Unfortunately Ion Torrent has spent all of its goodwill on an inane and hamfisted advertising war against Illumina's MiSeq, an intentionally crippled opponent. Seemingly orchestrated by castoffs from the Celebrity Apprentice, this assault began with <a href="http://www.youtube.com/watch?feature=player_embedded&v=GUr17pHezUo">cringe-inducing derivations</a> of Apple commercials, and has expanded to include a sort of "feature combover." Through some<a href="http://www.youtube.com/watch?v=jv-JHafk4UA&feature=youtu.be"> convoluted logic</a> involving consensus, a professional whiteboard artist attempts to convince the public how the homopolymer error rate is actually lower using Ion Torrent PGM than MiSeq. This is the sequencing equivalent of having your mom try to convince you two apples is better than one <a href="http://en.wikipedia.org/wiki/Drake%27s_Devil_Dogs">devil dog</a>, or some such utter nonsense.
<br />
<br />
My response was predictably measured and cerebral.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
This is not the <a href="https://picasaweb.google.com/lh/photo/oyux1nepVh9nxAqgbMnA1Px8lHi6ePCYZLKzcyR0WMQ?feat=directlink">first time</a> I have tweet-confronted Ion Torrent over its odious approach. All this is rather unnecessary because overall, and despite the homopolymer issues, the utility of the PGM has been more or less within expectations. The MiSeq is also exactly within expectations, since it is basically a transparent, measly 1/50th slice of a HiSeq. The same cannot really be said for the RS, whose error rate is clearly far above what was expected at the outset. So if anyone requires an aggressive smokescreen-type marketing campaign (or a new machine) it is Pacific Biosciences.Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com3tag:blogger.com,1999:blog-8532065960756482590.post-81850761916201006242012-01-18T21:15:00.000-05:002013-09-27T09:44:26.912-04:00When can we expect the last damn microarray paper?<h3>With bonus R code</h3>
It came as a shock to learn from PubMed that almost <a href="http://www.ncbi.nlm.nih.gov/pubmed?term=microarray%5Btitle%5D%20and%202011%5Bpdat%5D">900 papers</a> were published with the word "microarray" in their titles last year alone, just 12 shy of the 2010 count. More alarming, many of these papers were not of the innocuous "Microarray study of gene expression in dog scrotal tissue" variety, but dry rehashings along the lines of "Statistical approaches to normalizing microarrays to the reference brightness of Ursa Minor".
<p>
It's an ugly truth we must face: people aren't just using microarrays, <i>they're still writing about them</i>.
</p>
<p>
See for yourself:
</p>
<pre class="brush: shell">getCount<-function(term){function(year){
nihUrl<-concat("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=",term,"+",year,"[pdat]")
#cleanurl<-gsub('\\]','%5D',gsub('\\[','%5B',x=url))
#http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=microarray%5btitle%5d+2003%5bpdat%5d
xml<-xmlTreeParse(URLencode(nihUrl),isURL=TRUE)
#Data Mashups in R, pg17
as.numeric(xmlValue(xml$doc$children$eSearchResult$children$Count$children$text))
}}
years<-1995:2011
df<-data.frame(type="obs",year=years,
mic=sapply(years,function(x){do.call(getCount('microarray[title]'),list(x))}),
ngs=sapply(years,function(x){do.call(getCount('"next generation sequencing"[title] OR "high-throughput sequencing"[title]'),list(x))})
)
#papers with "microarray" in title
> df[,c("year","mic")]
year mic
1 1995 2
2 1996 4
3 1997 0
4 1998 7
5 1999 28
6 2000 108
7 2001 273
8 2002 553
9 2003 770
10 2004 1032
11 2005 1135
12 2006 1216
13 2007 1107
14 2008 1055
15 2009 981
16 2010 909
17 2011 897
</pre>
Reading another treatise on microarray normalization in 2012 would be just tragic. Who still reads these? Who still writes these papers? Can we stop them? If not, when can we expect NGS to wipe them off the map?
<br />
<pre class="brush: shell">
#97 is a fair start
df<-subset(df,year>=1997)
mdf<-melt(df,id.vars=c("type","year"),variable_name="citation")
c<-ggplot(mdf,aes(x=year))
p<-c+geom_point(aes(y=value,color=citation)) +
ylab("papers") +
stat_smooth(aes(y=value,color=citation),data=subset(mdf,citation=="mic"),method="loess") +
scale_x_continuous(breaks=seq(from=1997,to=2011,by=2))
print(p)
</pre>
Here I plot both microarray and next-generation sequencing papers (in title). We see kurtosis is working in our favor, and LOESS seems to agree!
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg92YKe9G8CaS6-w7yX0s_kQMkqxGJPXMFfRC9F_7yNtJ-1u-GYfWljbvhM4sWYyaHkk9fiFprGgQP8xJ_0UZRp3Zfnim4CcFhOgMD8ORcynBszIJOdaTIgU_zsnoyVTOqe-p2DtsCZlIc/s1600/obs.png" imageanchor="1" style=""><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg92YKe9G8CaS6-w7yX0s_kQMkqxGJPXMFfRC9F_7yNtJ-1u-GYfWljbvhM4sWYyaHkk9fiFprGgQP8xJ_0UZRp3Zfnim4CcFhOgMD8ORcynBszIJOdaTIgU_zsnoyVTOqe-p2DtsCZlIc/s800/obs.png" /></a></div>
But when will the pain end? Let us extrapolate, wildly.
<pre class="brush: shell">
#Return 0 for negative elements
# noNeg(c(3,2,1,0,-1,-2,2))
# [1] 3 2 1 0 0 0 2
noNeg<-function(v){sapply(v,function(x){max(x,0)})}
#Return up to the first negative/zero element inclusive
# toZeroNoNeg(c(3,2,1,0,-1,-2,2))
# [1] 3 2 1 0
toZeroNoNeg<-function(v){noNeg(v)[1:firstZero(noNeg(v))]}
#return index of first zero
firstZero<-function(v){which(noNeg(v)==0)[1]}
#let's peer into the future
df.lo.mic<-loess(mic ~ year,df,control=loess.control(surface="direct"))
#when will it stop?
mic_predict<-as.integer(predict(df.lo.mic,data.frame(year=2012:2020),se=FALSE))
zero_year<-2011+firstZero(mic_predict)
cat(concat("LOESS projects ",sum(toZeroNoNeg(mic_predict))," more microarray papers."))
cat(concat("The last damn microarray paper is projected to be in ",(zero_year-1),"."))
#predict ngs growth
df.lo.ngs<-loess(ngs ~ year,df,control=loess.control(surface="direct"))
ngs_predict<-as.integer(predict(df.lo.ngs,data.frame(year=2012:zero_year),se=FALSE))
pred_df<-data.frame(type="pred",year=c(2012:zero_year),mic=toZeroNoNeg(mic_predict),ngs=ngs_predict)
df2<-rbind(df,pred_df)
mdf2<-melt(df2,id.vars=c("type","year"),variable_name="citation")
c2<-ggplot(mdf2,aes(x=year))
p2<-c2+geom_point(aes(y=value,color=citation,shape=type),size=3) +
ylab("papers") +
scale_y_continuous(breaks=seq(from=0,to=1600,by=200))+
scale_x_continuous(breaks=seq(from=1997,to=zero_year,by=2))
print(p2)
</pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7-7kD6X4ZSpc0QDS2HR50W9iRh1Ey_WIThuLA-9Rk5RBM9AXHnpMsuequdzbSU7iY3eKkUBXkAOX0Br7b3_gZy8gko1Fpj484oUjp1tR1VZhEnDVue7TK5LBfxSpur2nMGsMUnyUv2cA/s1600/pred.png" imageanchor="1" style=""><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7-7kD6X4ZSpc0QDS2HR50W9iRh1Ey_WIThuLA-9Rk5RBM9AXHnpMsuequdzbSU7iY3eKkUBXkAOX0Br7b3_gZy8gko1Fpj484oUjp1tR1VZhEnDVue7TK5LBfxSpur2nMGsMUnyUv2cA/s800/pred.png" /></a></div>
<p>LOESS projects 2038 more microarray papers.<br/>
The last damn microarray paper is projected to be published in 2016.
</p>
<p>
Yeah, right...
</p>
Full R code here: <a href="https://gist.github.com/1637248">https://gist.github.com/1637248</a>
<script src="https://gist.github.com/leipzig/1637248.js"></script>Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com13tag:blogger.com,1999:blog-8532065960756482590.post-16824112917942503292011-10-31T13:17:00.003-04:002011-10-31T16:39:07.175-04:00Making R's paste act more like CONCAT<p>While vector-friendly, R's <a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/paste.html">paste</a> function has a few behaviors I don't particularly like.</p>
<p>One is using a space as the default separator:<br/>
<pre class="brush: shell">
> adjectives<-c("lean","fast","strong")
> paste(adjectives,"er")
> paste(adjectives,"er")
[1] "lean er" "fast er" "strong er" #d'oh
> paste(adjectives,"er",sep="")
[1] "leaner" "faster" "stronger"
</pre></p>
<p>Empty vectors get an undeserved first class treatment:
<pre class="brush: shell">
> paste(indelPositions,"i",sep="")
[1] "i"
> indelPositions<-c(5,6,7)
> paste(indelPositions,"i",sep="")
[1] "5i" "6i" "7i" #good
> indelPositions<-c()
> paste(indelPositions,"i",sep="")
[1] "i" #not so good
</pre></p>
<p>And perhaps worst of all, NA values get replaced with a string called "NA":
<pre class="brush: shell">
> placing<-"1"
> paste(placing,"st",sep="")
[1] "1st" #awesome
> placing<-NA_integer_
> paste(placing,"st",sep="")
[1] "NAst" #ugh
</pre></p>
<p>This is inconvenient in situations where I don't know a priori if I will get a value, a vector of length 0, or an NA.
</p>
<p>Working from Hadley Wickham's str_c function in the <a href="https://github.com/hadley/stringr">stringr</a> package, I decided to write a paste function that behaves more like CONCAT in SQL:
<br/>
<pre class="brush: shell">
library(stringr)
concat<-CONCAT<-function(...,sep="",collapse=NULL){
strings<-list(...)
#catch NULLs, NAs
if(
all(unlist(llply(strings,length))>0)
&&
all(!is.na(unlist(strings)))
){
do.call("paste", c(strings, list(sep = sep, collapse = collapse)))
}else{
NULL
}
}
</pre></p>
<p>This function has the behaviors I expect:
<pre class="brush: shell">
> concat(adjectives,"er")
[1] "leaner" "faster" "stronger"
> concat(indelPositions,"i")
NULL
> concat(placing,"st")
NULL
</pre>
</p>
<p>
That's more like it!
</p>Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com1tag:blogger.com,1999:blog-8532065960756482590.post-68599660952449284442011-10-06T13:10:00.000-04:002011-10-06T13:32:41.024-04:00SELinux for enhanced headaches<br />
<a href="http://en.wikipedia.org/wiki/Security-Enhanced_Linux">Security Enhanced Linux (SELinux)</a> is a new extra hidden layer of permissions that makes configuring things more difficult, without ever identifying itself as the culprit - kind of like <a href="http://en.wikipedia.org/wiki/Access_control_list">ACLs</a> but more cryptic. Though it may be more secure, it is not an enhancing experience to deal with, and probably not worth it for the average user.<br />
<br />
For example to have Apache serve personal websites (i.e. http://server/~leipzig) it is no longer enough to alter httpd.conf, because you will get mysterious 403 errors until you do this (as <a href="http://serverfault.com/questions/307167/apache-403-forbidden/307209#307209">others</a> have experienced):
<br />
<pre>chcon -R -t httpd_sys_content_t /home/leipzig</pre>
<br />
You forget about this change until xauth starts complaining about stuff for no apparent reason:
<br />
<pre>/usr/bin/xauth: timeout in locking authority file /home/leipzig/.Xauthority</pre>
<br />
so of course you need to do this (thanks <a href="http://www.linkedin.com/in/madhavdiwan">Madhav Diwan</a> for <a href="http://forums.fedoraforum.org/showpost.php?p=1336336&postcount=9">this</a> post):<br />
<pre>chcon unconfined_u:object_r:user_home_dir_t:s0 /home/leipzig</pre>
<br />
I have no idea what these things actually mean, nor any real interest in learning. I'm sure this stuff is great for sysadmin cocktail chat but at least for private servers it is just another the brake on the wheel of getting things done. For the time being I have set the level to "permissive", which means it displays warnings but does not interfere, but am leaning toward "disabled" or maybe something else:<br />
<br />
<pre>
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
SELINUX=excoriated
# SELINUXTYPE= can take one of these two values:
# targeted - Targeted processes are protected,
# mls - Multi Level Security protection.
SELINUXTYPE=targeted
</pre>
<br />
More on the pros and cons:<br />
<a href="http://unix.stackexchange.com/questions/9163/does-selinux-provide-enough-extra-security-to-be-worth-the-hassle-of-learning-set">http://unix.stackexchange.com/questions/9163/does-selinux-provide-enough-extra-security-to-be-worth-the-hassle-of-learning-set</a>
Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com0tag:blogger.com,1999:blog-8532065960756482590.post-25107451519900763482011-08-18T10:03:00.008-04:002011-08-18T11:16:37.925-04:00Installing RStudio Server on Scientific Linux 6: My bash notebook<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicgtePN-Zo0_rB8lwKIywTAhl17ilz_JNT4d9K1AM5sfdBpFBWskDlk92qZHeLLy-V_tuLR1lv46MRYzUCG_rtpW-XNqx35qfBGYJDysOP-RyZAyINpIy4q9kLnZU4WLYHnVGlWaTBw70/s1600/Screen+shot+2011-08-18+at+9.58.10+AM.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="220" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicgtePN-Zo0_rB8lwKIywTAhl17ilz_JNT4d9K1AM5sfdBpFBWskDlk92qZHeLLy-V_tuLR1lv46MRYzUCG_rtpW-XNqx35qfBGYJDysOP-RyZAyINpIy4q9kLnZU4WLYHnVGlWaTBw70/s320/Screen+shot+2011-08-18+at+9.58.10+AM.png" width="320" /></a>
Granted, not a brilliant sysadmin mind at work here, but this might help someone someday.<br />
Scientific Linux (SL) is built from Red Hat Enterprise Linux<br />
<br />
See installation instructions here:<br />
<a href="http://rstudio.org/download/server">http://rstudio.org/download/server</a><br />
<div class="separator" style="clear: both; text-align: left;">
</div>
<pre class="brush: bash">
[leipzig@localhost ~]$ sudo rpm -Uvh
http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-5.noarch.rpm
[sudo] password for leipzig:
Retrieving
http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-5.noarch.rpm
warning: /var/tmp/rpm-tmp.S2RQAH: Header V3 RSA/SHA256 Signature, key ID
0608b895: NOKEY
Preparing... ########################################### [100%]
1:epel-release ########################################### [100%]
[leipzig@localhost ~]$ rpm -qa | grep epel
epel-release-6-5.noarch
[leipzig@localhost ~]$ which R
/usr/local/bin/R
[leipzig@localhost ~]$ R
R version 2.13.0 (2011-04-13)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
&gt; q()
Save workspace image? [y/n/c]: n
[leipzig@localhost ~]$ wget
https://s3.amazonaws.com/rstudio-server/rstudio-server-0.94.92-x86_64.rpm
--2011-08-17 13:06:36--
https://s3.amazonaws.com/rstudio-server/rstudio-server-0.94.92-x86_64.rpm
Resolving s3.amazonaws.com... 72.21.211.170
Connecting to s3.amazonaws.com|72.21.211.170|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11373769 (11M) [application/x-redhat-package-manager]
Saving to: “rstudio-server-0.94.92-x86_64.rpm”
100%[===========================================================================
=============================================&gt;] 11,373,769 7.89M/s in 1.4s
2011-08-17 13:06:37 (7.89 MB/s) - “rstudio-server-0.94.92-x86_64.rpm” saved
[11373769/11373769]
[leipzig@localhost ~]$ sudo rpm -Uvh rstudio-server-0.94.92-x86_64.rpm
error: Failed dependencies:
libR.so()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
libRblas.so()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
libRlapack.so()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
libcrypto.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
libgfortran.so.1()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
libssl.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
[leipzig@localhost ~]$ sudo yum install R
epel/metalink
| 14 kB 00:00
epel
| 4.3 kB 00:00
epel/primary_db
| 4.0 MB 00:00
sl
| 3.2 kB 00:00
sl-security
| 1.9 kB 00:00
Setting up Install Process
Resolving Dependencies
--&gt; Running transaction check
---&gt; Package R.x86_64 0:2.13.1-1.el6 set to be updated
--&gt; Processing Dependency: libRmath-devel = 2.13.1-1.el6 for package:
R-2.13.1-1.el6.x86_64
--&gt; Processing Dependency: R-devel = 2.13.1-1.el6 for package:
R-2.13.1-1.el6.x86_64
--&gt; Running transaction check
---&gt; Package R-devel.x86_64 0:2.13.1-1.el6 set to be updated
--&gt; Processing Dependency: R-core = 2.13.1-1.el6 for package:
R-devel-2.13.1-1.el6.x86_64
--&gt; Processing Dependency: bzip2-devel for package: R-devel-2.13.1-1.el6.x86_64
--&gt; Processing Dependency: gcc-gfortran for package: R-devel-2.13.1-1.el6.x86_64
--&gt; Processing Dependency: tk-devel for package: R-devel-2.13.1-1.el6.x86_64
--&gt; Processing Dependency: pcre-devel for package: R-devel-2.13.1-1.el6.x86_64
--&gt; Processing Dependency: tcl-devel for package: R-devel-2.13.1-1.el6.x86_64
---&gt; Package libRmath-devel.x86_64 0:2.13.1-1.el6 set to be updated
--&gt; Processing Dependency: libRmath = 2.13.1-1.el6 for package:
libRmath-devel-2.13.1-1.el6.x86_64
--&gt; Running transaction check
---&gt; Package R-core.x86_64 0:2.13.1-1.el6 set to be updated
--&gt; Processing Dependency: cups for package: R-core-2.13.1-1.el6.x86_64
--&gt; Processing Dependency: libtk8.5.so()(64bit) for package:
R-core-2.13.1-1.el6.x86_64
---&gt; Package bzip2-devel.x86_64 0:1.0.5-7.el6_0 set to be updated
---&gt; Package gcc-gfortran.x86_64 0:4.4.4-13.el6 set to be updated
---&gt; Package libRmath.x86_64 0:2.13.1-1.el6 set to be updated
---&gt; Package pcre-devel.x86_64 0:7.8-3.1.el6 set to be updated
---&gt; Package tcl-devel.x86_64 1:8.5.7-6.el6 set to be updated
---&gt; Package tk-devel.x86_64 1:8.5.7-5.el6 set to be updated
--&gt; Running transaction check
---&gt; Package cups.x86_64 1:1.4.2-35.el6_0.1 set to be updated
--&gt; Processing Dependency: portreserve for package:
1:cups-1.4.2-35.el6_0.1.x86_64
--&gt; Processing Dependency: poppler-utils for package:
1:cups-1.4.2-35.el6_0.1.x86_64
---&gt; Package tk.x86_64 1:8.5.7-5.el6 set to be updated
--&gt; Running transaction check
---&gt; Package poppler-utils.x86_64 0:0.12.4-3.el6_0.1 set to be updated
---&gt; Package portreserve.x86_64 0:0.0.4-4.el6 set to be updated
--&gt; Finished Dependency Resolution
Dependencies Resolved
================================================================================
================================================================================
==
Package Arch Version
Repository Size
================================================================================
================================================================================
==
Installing:
R x86_64
2.13.1-1.el6 epel
17 k
Installing for dependencies:
R-core x86_64
2.13.1-1.el6 epel
33 M
R-devel x86_64
2.13.1-1.el6 epel
88 k
bzip2-devel x86_64
1.0.5-7.el6_0 sl-security
249 k
cups x86_64
1:1.4.2-35.el6_0.1 sl-security
2.3 M
gcc-gfortran x86_64
4.4.4-13.el6 sl
4.7 M
libRmath x86_64
2.13.1-1.el6 epel
111 k
libRmath-devel x86_64
2.13.1-1.el6 epel
21 k
pcre-devel x86_64
7.8-3.1.el6 sl
317 k
poppler-utils x86_64
0.12.4-3.el6_0.1 sl-security
72 k
portreserve x86_64
0.0.4-4.el6 sl
21 k
tcl-devel x86_64
1:8.5.7-6.el6 sl
161 k
tk x86_64
1:8.5.7-5.el6 sl
1.4 M
tk-devel x86_64
1:8.5.7-5.el6 sl
495 k
Transaction Summary
================================================================================
================================================================================
==
Install 14 Package(s)
Upgrade 0 Package(s)
Total download size: 43 M
Installed size: 89 M
Is this ok [y/N]: y
Downloading Packages:
(1/14): R-2.13.1-1.el6.x86_64.rpm
| 17 kB 00:00
(2/14): R-core-2.13.1-1.el6.x86_64.rpm
| 33 MB 00:05
(3/14): R-devel-2.13.1-1.el6.x86_64.rpm
| 88 kB 00:00
(4/14): bzip2-devel-1.0.5-7.el6_0.x86_64.rpm
| 249 kB 00:00
(5/14): cups-1.4.2-35.el6_0.1.x86_64.rpm
| 2.3 MB 00:01
(6/14): gcc-gfortran-4.4.4-13.el6.x86_64.rpm
| 4.7 MB 00:02
(7/14): libRmath-2.13.1-1.el6.x86_64.rpm
| 111 kB 00:00
(8/14): libRmath-devel-2.13.1-1.el6.x86_64.rpm
| 21 kB 00:00
(9/14): pcre-devel-7.8-3.1.el6.x86_64.rpm
| 317 kB 00:00
(10/14): poppler-utils-0.12.4-3.el6_0.1.x86_64.rpm
| 72 kB 00:00
(11/14): portreserve-0.0.4-4.el6.x86_64.rpm
| 21 kB 00:00
(12/14): tcl-devel-8.5.7-6.el6.x86_64.rpm
| 161 kB 00:00
(13/14): tk-8.5.7-5.el6.x86_64.rpm
| 1.4 MB 00:00
(14/14): tk-devel-8.5.7-5.el6.x86_64.rpm
| 495 kB 00:00
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--
Total
3.1 MB/s | 43 MB 00:13
warning: rpmts_HdrFromFdno: Header V3 RSA/SHA256 Signature, key ID 0608b895:
NOKEY
epel/gpgkey
| 3.2 kB 00:00 ...
Importing GPG key 0x0608B895 "EPEL (6) epel@fedoraproject.org" from
/etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6
Is this ok [y/N]: y
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Warning: RPMDB altered outside of yum.
Installing : 1:tk-8.5.7-5.el6.x86_64
1/14
Installing : portreserve-0.0.4-4.el6.x86_64
2/14
Installing : poppler-utils-0.12.4-3.el6_0.1.x86_64
3/14
Installing : 1:cups-1.4.2-35.el6_0.1.x86_64
4/14
Installing : R-core-2.13.1-1.el6.x86_64
5/14
Installing : gcc-gfortran-4.4.4-13.el6.x86_64
6/14
Installing : libRmath-2.13.1-1.el6.x86_64
7/14
Installing : 1:tcl-devel-8.5.7-6.el6.x86_64
8/14
Installing : 1:tk-devel-8.5.7-5.el6.x86_64
9/14
Installing : libRmath-devel-2.13.1-1.el6.x86_64
10/14
Installing : bzip2-devel-1.0.5-7.el6_0.x86_64
11/14
Installing : pcre-devel-7.8-3.1.el6.x86_64
12/14
Installing : R-devel-2.13.1-1.el6.x86_64
13/14
Installing : R-2.13.1-1.el6.x86_64
14/14
Installed:
R.x86_64 0:2.13.1-1.el6
Dependency Installed:
R-core.x86_64 0:2.13.1-1.el6 R-devel.x86_64 0:2.13.1-1.el6
bzip2-devel.x86_64 0:1.0.5-7.el6_0 cups.x86_64 1:1.4.2-35.el6_0.1
gcc-gfortran.x86_64 0:4.4.4-13.el6 libRmath.x86_64 0:2.13.1-1.el6
libRmath-devel.x86_64 0:2.13.1-1.el6 pcre-devel.x86_64 0:7.8-3.1.el6
poppler-utils.x86_64 0:0.12.4-3.el6_0.1 portreserve.x86_64 0:0.0.4-4.el6
tcl-devel.x86_64 1:8.5.7-6.el6 tk.x86_64 1:8.5.7-5.el6
tk-devel.x86_64 1:8.5.7-5.el6
Complete!
[leipzig@localhost ~]$ sudo rpm -Uvh rstudio-server-0.94.92-x86_64.rpm
error: Failed dependencies:
libcrypto.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
libgfortran.so.1()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
libssl.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
[leipzig@localhost ~]$ sudo yum install libcrypto.so.6
Setting up Install Process
Resolving Dependencies
--&gt; Running transaction check
---&gt; Package openssl098e.i686 0:0.9.8e-17.el6 set to be updated
--&gt; Processing Dependency: libc.so.6(GLIBC_2.3.4) for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libkrb5.so.3(krb5_3_MIT) for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libc.so.6(GLIBC_2.1) for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libcom_err.so.2 for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libc.so.6(GLIBC_2.0) for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libk5crypto.so.3 for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libk5crypto.so.3(k5crypto_3_MIT) for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libdl.so.2(GLIBC_2.0) for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libc.so.6(GLIBC_2.7) for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libkrb5.so.3 for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libc.so.6(GLIBC_2.4) for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libgssapi_krb5.so.2 for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libdl.so.2(GLIBC_2.1) for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libc.so.6(GLIBC_2.1.3) for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libresolv.so.2 for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libz.so.1 for package: openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libc.so.6 for package: openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libdl.so.2 for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Processing Dependency: libc.so.6(GLIBC_2.3) for package:
openssl098e-0.9.8e-17.el6.i686
--&gt; Running transaction check
---&gt; Package glibc.i686 0:2.12-1.7.el6_0.5 set to be updated
--&gt; Processing Dependency: libfreebl3.so for package:
glibc-2.12-1.7.el6_0.5.i686
--&gt; Processing Dependency: libfreebl3.so(NSSRAWHASH_3.12.3) for package:
glibc-2.12-1.7.el6_0.5.i686
---&gt; Package krb5-libs.i686 0:1.9-9.el6_1.1 set to be updated
--&gt; Processing Dependency: libkeyutils.so.1(KEYUTILS_0.3) for package:
krb5-libs-1.9-9.el6_1.1.i686
--&gt; Processing Dependency: libkeyutils.so.1 for package:
krb5-libs-1.9-9.el6_1.1.i686
--&gt; Processing Dependency: libselinux.so.1 for package:
krb5-libs-1.9-9.el6_1.1.i686
---&gt; Package libcom_err.i686 0:1.41.12-3.el6 set to be updated
---&gt; Package zlib.i686 0:1.2.3-25.el6 set to be updated
--&gt; Running transaction check
---&gt; Package keyutils-libs.i686 0:1.4-1.el6 set to be updated
---&gt; Package libselinux.i686 0:2.0.94-2.el6 set to be updated
---&gt; Package nss-softokn-freebl.i686 0:3.12.8-1.el6_0 set to be updated
--&gt; Finished Dependency Resolution
Dependencies Resolved
================================================================================
================================================================================
==
Package Arch
Version Repository
Size
================================================================================
================================================================================
==
Installing:
openssl098e i686
0.9.8e-17.el6 sl
772 k
Installing for dependencies:
glibc i686
2.12-1.7.el6_0.5 sl-security
4.3 M
keyutils-libs i686
1.4-1.el6 sl
19 k
krb5-libs i686
1.9-9.el6_1.1 sl-security
711 k
libcom_err i686
1.41.12-3.el6 sl
33 k
libselinux i686
2.0.94-2.el6 sl
106 k
nss-softokn-freebl i686
3.12.8-1.el6_0 sl-security
108 k
zlib i686
1.2.3-25.el6 sl
71 k
Transaction Summary
================================================================================
================================================================================
==
Install 8 Package(s)
Upgrade 0 Package(s)
Total download size: 6.0 M
Installed size: 18 M
Is this ok [y/N]: y
Downloading Packages:
(1/8): glibc-2.12-1.7.el6_0.5.i686.rpm
| 4.3 MB 00:02
(2/8): keyutils-libs-1.4-1.el6.i686.rpm
| 19 kB 00:00
(3/8): krb5-libs-1.9-9.el6_1.1.i686.rpm
| 711 kB 00:00
(4/8): libcom_err-1.41.12-3.el6.i686.rpm
| 33 kB 00:00
(5/8): libselinux-2.0.94-2.el6.i686.rpm
| 106 kB 00:00
(6/8): nss-softokn-freebl-3.12.8-1.el6_0.i686.rpm
| 108 kB 00:00
(7/8): openssl098e-0.9.8e-17.el6.i686.rpm
| 772 kB 00:00
(8/8): zlib-1.2.3-25.el6.i686.rpm
| 71 kB 00:00
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--
Total
1.2 MB/s | 6.0 MB 00:04
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Installing : nss-softokn-freebl-3.12.8-1.el6_0.i686
1/8
Installing : glibc-2.12-1.7.el6_0.5.i686
2/8
Installing : libcom_err-1.41.12-3.el6.i686
3/8
Installing : zlib-1.2.3-25.el6.i686
4/8
Installing : libselinux-2.0.94-2.el6.i686
5/8
Installing : keyutils-libs-1.4-1.el6.i686
6/8
Installing : krb5-libs-1.9-9.el6_1.1.i686
7/8
Installing : openssl098e-0.9.8e-17.el6.i686
8/8
Installed:
openssl098e.i686 0:0.9.8e-17.el6
Dependency Installed:
glibc.i686 0:2.12-1.7.el6_0.5 keyutils-libs.i686 0:1.4-1.el6
krb5-libs.i686 0:1.9-9.el6_1.1 libcom_err.i686 0:1.41.12-3.el6
libselinux.i686 0:2.0.94-2.el6 nss-softokn-freebl.i686 0:3.12.8-1.el6_0
zlib.i686 0:1.2.3-25.el6
Complete!
[leipzig@localhost ~]$ sudo rpm -Uvh rstudio-server-0.94.92-x86_64.rpm
error: Failed dependencies:
libcrypto.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
libgfortran.so.1()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
libssl.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
[leipzig@localhost ~]$ sudo yum install libcrypto.so.6
Setting up Install Process
Package openssl098e-0.9.8e-17.el6.i686 already installed and latest version
Nothing to do
[leipzig@localhost ~]$ sudo yum install libgfortran.so.1
Setting up Install Process
Resolving Dependencies
--&gt; Running transaction check
---&gt; Package compat-libgfortran-41.i686 0:4.1.2-39.el6 set to be updated
--&gt; Finished Dependency Resolution
Dependencies Resolved
================================================================================
================================================================================
==
Package Arch
Version Repository Size
================================================================================
================================================================================
==
Installing:
compat-libgfortran-41 i686
4.1.2-39.el6 sl 99 k
Transaction Summary
================================================================================
================================================================================
==
Install 1 Package(s)
Upgrade 0 Package(s)
Total download size: 99 k
Installed size: 488 k
Is this ok [y/N]: y
Downloading Packages:
compat-libgfortran-41-4.1.2-39.el6.i686.rpm
| 99 kB 00:00
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Installing : compat-libgfortran-41-4.1.2-39.el6.i686
1/1
Installed:
compat-libgfortran-41.i686 0:4.1.2-39.el6
Complete!
[leipzig@localhost ~]$ sudo rpm -Uvh rstudio-server-0.94.92-x86_64.rpm
error: Failed dependencies:
libcrypto.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
libgfortran.so.1()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
libssl.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
[leipzig@localhost ~]$ sudo yum install libssl.so.6
Setting up Install Process
Package openssl098e-0.9.8e-17.el6.i686 already installed and latest version
Nothing to do
[leipzig@localhost ~]$ sudo rpm -Uvh --nodeps rstudio-server-0.94.92-x86_64.rpm
Preparing... ########################################### [100%]
1:rstudio-server ########################################### [100%]
rsession: no process killed
Starting rstudio-server: /usr/lib/rstudio-server/bin/rserver: error while
loading shared libraries: libssl.so.6: cannot open shared object file: No such
file or directory
[FAILED]
#trying some stuff recommended here
#http://support.rstudio.org/help/discussions/problems/839-installing-rstudio-
from-source-after-installing-r-from-source
[leipzig@localhost ~]$ sudo yum install openssl098e-0.9.8e
Setting up Install Process
Resolving Dependencies
--&gt; Running transaction check
---&gt; Package openssl098e.x86_64 0:0.9.8e-17.el6 set to be updated
--&gt; Finished Dependency Resolution
Dependencies Resolved
================================================================================
================================================================================
==
Package Arch
Version Repository
Size
================================================================================
================================================================================
==
Installing:
openssl098e x86_64
0.9.8e-17.el6 sl
762 k
Transaction Summary
================================================================================
================================================================================
==
Install 1 Package(s)
Upgrade 0 Package(s)
Total download size: 762 k
Installed size: 2.2 M
Is this ok [y/N]: y
Downloading Packages:
openssl098e-0.9.8e-17.el6.x86_64.rpm
| 762 kB 00:00
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Warning: RPMDB altered outside of yum.
rstudio-server-0.94.92-1.x86_64 has missing requires of libcrypto.so.6()(64bit)
rstudio-server-0.94.92-1.x86_64 has missing requires of
libgfortran.so.1()(64bit)
rstudio-server-0.94.92-1.x86_64 has missing requires of libssl.so.6()(64bit)
Installing : openssl098e-0.9.8e-17.el6.x86_64
1/1
Installed:
openssl098e.x86_64 0:0.9.8e-17.el6
Complete!
[leipzig@localhost ~]$ sudo yum install gcc41-libgfortran-4.1.2
Setting up Install Process
No package gcc41-libgfortran-4.1.2 available.
Error: Nothing to do
[leipzig@localhost ~]$ sudo yum install pango-1.28.1
Setting up Install Process
Package pango-1.28.1-3.el6_0.5.x86_64 already installed and latest version
Nothing to do
[leipzig@localhost ~]$ sudo rpm -Uvh --nodeps rstudio-server-0.94.92-x86_64.rpm
Preparing... ########################################### [100%]
package rstudio-server-0.94.92-1.x86_64 is already installed
[leipzig@localhost ~]$ sudo rstudio-server start
[leipzig@localhost ~]$ sudo rstudio-server verify-installation
Stopping rstudio-server: [ OK ]
/usr/lib/rstudio-server/bin/rsession: error while loading shared libraries:
libgfortran.so.1: wrong ELF class: ELFCLASS32
Starting rstudio-server: [ OK ]
[leipzig@localhost ~]$ sudo yum install libgfortran.so.1
Setting up Install Process
Package compat-libgfortran-41-4.1.2-39.el6.i686 already installed and latest
version
Nothing to do
[leipzig@localhost ~]$ sudo rpm -Uvh
ftp.scientificlinux.org/linux/scientific/6.0/x86_64/os/Packages/compat-
libgfortran-41-4.1.2-39.el6.x86_64.rpm
error: open of
ftp.scientificlinux.org/linux/scientific/6.0/x86_64/os/Packages/compat-
libgfortran-41-4.1.2-39.el6.x86_64.rpm failed: No such file or directory
[leipzig@localhost ~]$ wget
ftp.scientificlinux.org/linux/scientific/6.0/x86_64/os/Packages/compat-
libgfortran-41-4.1.2-39.el6.x86_64.rpm
--2011-08-18 04:39:39--
http://ftp.scientificlinux.org/linux/scientific/6.0/x86_64/os/Packages/compat-
libgfortran-41-4.1.2-39.el6.x86_64.rpm
Resolving ftp.scientificlinux.org... 131.225.110.147
Connecting to ftp.scientificlinux.org|131.225.110.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 128080 (125K) [application/x-rpm]
Saving to: “compat-libgfortran-41-4.1.2-39.el6.x86_64.rpm”
100%[===========================================================================
=============================================&gt;] 128,080 488K/s in 0.3s
2011-08-18 04:39:39 (488 KB/s) - “compat-libgfortran-41-4.1.2-39.el6.x86_64.rpm”
saved [128080/128080]
[leipzig@localhost ~]$ sudo rpm -Uvh
compat-libgfortran-41-4.1.2-39.el6.x86_64.rpm
Preparing... ########################################### [100%]
1:compat-libgfortran-41 ########################################### [100%]
[leipzig@localhost ~]$ sudo rstudio-server verify-installation
Stopping rstudio-server: [ OK ]
Starting rstudio-server: [ OK ]
</pre>
Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com8tag:blogger.com,1999:blog-8532065960756482590.post-89510083045237840022011-06-23T17:14:00.000-04:002011-06-24T10:07:55.055-04:00Big-Ass Servers™ and the myths of clusters in bioinformatics<div class="separator" style="clear: both; text-align: center;"></div>Spending $55k for a 512GB machine (Big-Ass Server™ or BAS™) can be a tough sell for a bioinformatics researcher to pitch to a department head.<br />
<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="http://i.dell.com/images/global/products/pedge/pedge_highlights/server-poweredge-r900-overview2.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" src="http://i.dell.com/images/global/products/pedge/pedge_highlights/server-poweredge-r900-overview2.jpg" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Dell PowerEdge r900, available in orange and lemon-lime</td></tr>
</tbody></table>Speaking as someone who keeps his copy of <a href="http://en.wikipedia.org/wiki/Introduction_to_Algorithms">CLR</a> safely stored in the basement, ready to help rebuild society after a nuclear holocaust, I am painfully aware of the importance of algorithm development in the history of computing, and the possibilities for parallel computing to make problems tractable.<br />
<br />
Having recently spent 3 years in industry, however, I am now more inclined to just throw money at problems. In the case of hardware, I think this approach is more effective than clever programming for many of the current problems posed by NGS.<br />
<br />
From an economic and productivity perspective, I believe most bioinformatics shops doing basic research would benefit more from having access to a BAS™ than a cluster. Here's why:<br />
<ul><li>The development of multicore/multiprocessor machines and memory capacity has outpaced the speed of networks. NGS analyses tends to be more <a href="http://en.wikipedia.org/wiki/Memory_bound_function">memory-bound</a> and <a href="http://en.wikipedia.org/wiki/I/O_bound">IO-bound</a> rather than <a href="http://en.wikipedia.org/wiki/CPU_bound">CPU-bound</a>, so relying on a cluster of smaller machines can quickly overwhelm a network.</li>
<li>NGS has forced the number of high-performance applications from BLAST and protein structure prediction to doing dozens of different little analyses, with tools that change on a monthly basis, or are homegrown to deal with special circumstances. There isn't time or ability to write each of these for parallel architectures.</li>
</ul>If those don't sound very convincing, here is my layman's guide to dealing with the myths you might encounter concerning NGS and clusters:<br />
<br />
<h3>Myth: Google uses server farms. We should too.</h3><br />
Google has to focus on doing one thing very well: search.<br />
<br />
Bioinformatics programmers have to explore a number of different questions for any given experiment. There is not time to develop a parallel solution to many of these questions as they will lead to dead ends.<br />
<br />
Many bioinformatic problems, de-novo assembly being a prime example, are notoriously difficult to divide among several machines without being overwhelmed with messaging. You can imagine trying to divide a jigsaw puzzle among friends sitting several tables, you would spend more time talking about the pieces than fitting them together.<br />
<br />
<h3>Myth: Our development setup should mimic our production setup</h3><br />
An experimental computing structure with a BAS™ allows for researchers to freely explore big data without having to think about how to divide it efficiently. If an experiment is successful and there is the need to scale-up to a clinical or industrial platform, that can happen later.<br />
<br />
<h3>Myth: Clusters have been around a long time so there is a lot of shell-based infrastructure to distribute workflows</h3><br />
There are tools for queueing jobs, but those are often quite helpless to assist in managing workflows that are written as parallel and serial steps - for example, waiting for steps to finish before merging results.<br />
<br />
Various programming languages have features to take advantage of clusters. For example, R has <a href="http://cran.r-project.org/web/packages/snow/index.html">SNOW</a>. But Rsamtools requires you to load BAM files into memory, so a BAS™ is not just preferable for NGS analysis with R, it's required.<br />
<br />
<h3>Myth: The rise of cloud computing and Hadoop means that homegrown clusters are irrelevant that but also means we don't need a BAS™</h3><br />
The popularity of cloud computing in bioinformatics is also driven by the newfound ability to rent time on a BAS™. The main problem with cloud computing is the bottleneck posed by transferring GBs data to the cloud.<br />
<br />
<h3>Myth: Crossbow and Myrna are based on Hadoop, we can develop similar tools</h3><br />
Ben Langmead, Cole Trapnell, and Michael Schatz, alums of Steven Salzberg's group at UMD, have developed NGS solutions using the Hadoop MapReduce framework.<br />
<ul><li>Crossbow is a Hadoop-based implementation of Bowtie. </li>
<li>Myrna is an RNA-Seq pipeline. </li>
<li>Contrail is a de novo short read assembler. </li>
</ul>These are difficult programs to develop, and these examples are also somewhat limited experimental proofs of concept or are married to components that may be undesirable for certain analyses. The Bowtie stack (Bowtie, Tophat, Cufflinks), while revolutionary in its implementation of Burroughs-Wheeler algorithm, is itself is built around the limitations of computers in the year 2008. For many it lacks the sensitivity to deal with, for example, 1000 Genomes data.<br />
<br />
The dynamic scripting languages used most bioinformatics programmers are not as well suited to Hadoop as Java. To imply we can all develop similar tools of this sophistication is unrealistic. Many bioinformatics programs are not even <i>threaded</i>, much less designed to work amongst several machines.<br />
<br />
<h3>Myth: <a href="http://en.wikipedia.org/wiki/Embarrassingly_parallel">embarrassingly parallel</a> problems imply a cluster is needed</h3><h3><span style="font-size: small;"><span style="font-weight: normal;"> </span></span></h3><h3><span style="font-size: small;"><span style="font-weight: normal;">A server with 4 quad-core processors is often adequate for handling these embarrassing problems. Dividing the work just tends to lead to further embarrassments.</span></span></h3><h3><span style="font-size: small;"><span style="font-weight: normal;"> </span></span></h3>Here is a particularly telling <a href="http://biostar.stackexchange.com/questions/2657/scalemp-vsmp-or-physical-ram">quote</a> from Biohaskell developer Ketil Malde on Biostar:<br />
<blockquote>In general, I think HPC are doing the wrong thing for bioinformatics. It's okay to spend six weeks to rewrite your meteorology program to take advantage of the latest supercomputer (all of which tend to be just a huge stack of small PCs these days) if the program is going to run continously for the next three years. It is not okay to spend six weeks on a script that's going to run for a couple of days.<br />
<br />
In short, I keep asking for a big PC with a bunch of the latest Intel or AMD core, and as much RAM as we can afford. </blockquote><br />
<h3>Myth: We don't have money for a BAS™ because we need a new cluster to handle things like BLAST</h3><br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="http://farm3.static.flickr.com/2758/4400143436_50bcd3843a.jpg" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" height="215" src="http://farm3.static.flickr.com/2758/4400143436_50bcd3843a.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">IBM System x3850 X5 expandable to 1536GB, mouse not included</td></tr>
</tbody></table>Even the BLAST setup we think of as being the essence of parallel (a segmented genome index - every node gets a part of the genome) is often not the one that many institutions have settled on. Many rely on farming out queries to a cluster in which every node has the full genome index in memory.<br />
<br />
Secondly, the mpiBLAST appears to be more suited to dividing an index among older machines than today's, which typically have >32GB RAM. Here is a telling FAQ entry: <br />
<br />
<blockquote>I benchmarked mpiBLAST but I don't see super-linear speedup! Why?!<br />
<br />
mpiBLAST only yields super-linear speedup when the database being searched is significantly larger than the core memory on an individual node. The super-linear speedup results published in the ClusterWorld 2003 paper describing mpiBLAST are measurements of mpiBLAST v0.9 searching a 1.2GB (compressed) database on a cluster where each node has 640MB of RAM. A single node search results in heavy disk I/O and a long search time.<br />
<a href="http://www.mpiblast.org/Docs/FAQ#super-linear">http://www.mpiblast.org/Docs/FAQ#super-linear</a></blockquote><br />
Your comments on this topic are welcome!Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com17tag:blogger.com,1999:blog-8532065960756482590.post-51242414757833817762011-03-15T13:17:00.001-04:002011-08-18T10:07:35.743-04:00RStudio: My thoughts<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgvqneSN2oafZB0FiTWQGwn5RTMHksdBqGk-c0_eNjZbO_gKb8Parp2dZwWWIw3XRp7vZBLty-Jcomns0I0O346nx4_lnnrE3ULDWB4HTTGcQ83u5E8h_pJjWzWUUq_wnxqPXEVXyhyXXM/s1600/Screen+shot+2011-03-15+at+12.52.22+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="416" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgvqneSN2oafZB0FiTWQGwn5RTMHksdBqGk-c0_eNjZbO_gKb8Parp2dZwWWIw3XRp7vZBLty-Jcomns0I0O346nx4_lnnrE3ULDWB4HTTGcQ83u5E8h_pJjWzWUUq_wnxqPXEVXyhyXXM/s640/Screen+shot+2011-03-15+at+12.52.22+PM.png" width="640" /></a></div>
<br />
Let me get this out of the way: I just love <a href="http://www.rstudio.org/">RStudio</a>.<br />
<br />
Created by a team lead by <a href="http://en.wikipedia.org/wiki/Joseph_J._Allaire">JJ Allaire</a>, a name that should ring a bell if you were involved in web development during the Clinton administration, RStudio is an R IDE that is actually designed for R from the ground up. RStudio works on Linux, Mac, and Windows platforms, and can even run over the web.<br />
<br />
While borrowing many of the best features from ESS, the Mac R-GUI, and maybe Anup Parikh's <a href="http://www.red-r.org/">Red-R,</a> RStudio provides solutions to several long-standing barriers that have hampered R code development. For instance, to do Sweave-&gt;tex-&gt;pdf (then view the pdf) in ESS was a frustrating, arthritic <span style="font-family: "Courier New",Courier,monospace;">(M-n s M-n P)</span>process that flummoxed even the <a href="http://comments.gmane.org/gmane.emacs.ess.general/4873">greatest minds of our generation</a>. RStudio has a handy button (Compile PDF) that brings you all the way from .Rnw to Acrobat. Although this command appears to run in its own session, leading to some <a href="http://support.rstudio.org/help/discussions/suggestions/9-sweave-and-the-r-session-state">unexpected behavior</a> compared to running Sweave from the command line, the fact that this IDE is already geared for Sweave bodes well for future development.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirwZ_hyphenhyphenBEQqwjeKWdXkFUFxnnVXXftfte34ZGy60dSc6D79gwHH_UPOVVst8us9eZL4YrCZiI24m1pOa0ENH6XFmHsidrsRo2N2-D7W_KDKltZU8ztOD5NjROFaD_GlAWhoU_ugjYJQvU/s1600/Screen+shot+2011-03-15+at+12.20.03+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirwZ_hyphenhyphenBEQqwjeKWdXkFUFxnnVXXftfte34ZGy60dSc6D79gwHH_UPOVVst8us9eZL4YrCZiI24m1pOa0ENH6XFmHsidrsRo2N2-D7W_KDKltZU8ztOD5NjROFaD_GlAWhoU_ugjYJQvU/s1600/Screen+shot+2011-03-15+at+12.20.03+PM.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">This is fucking genius</td></tr>
</tbody></table>
<br />
The movement of commands back and forth from console to editor is another task that other editors made unnecessarily difficult - the old Mac R GUI console would not let you copy-and-paste a subset of the history, ESS was always geared to having users write code in the editor then executing lines but never writing code in the console then committing to the script. RStudio provides means of easily going in either direction. Control over multiple plots (solving both the overwritten X-Window and the annoying type=Cairo PNG problem on OS X) is a welcome relief.<br />
<br />
RStudio offers very good autocompletion for such a relatively weird language - in addition to package methods it is aware of data frame columns and user-defined functions, for instance. <br />
<br />
RStudio has already garnered a good number of <a href="http://support.rstudio.org/help/discussions/suggestions">suggestions</a>. Here's personally what I would like to see:<br />
<ol>
<li>More support for LaTeX markup, including menu driven formatting options so users don't have to memorize stuff like \textbf{}</li>
<li> More built-in aesthetic support for <a href="http://had.co.nz/ggplot2/">ggplot2</a>, something where users are given a WYSIWYG manipulating an existing plot similar to Jeroen Ooms' <a href="http://yeroon.net/ggplot2/">ggplot2 web application</a></li>
<li>A non-sudo Linux binary and a method of specifying different R and TeX installations kicking around a server without re-installing from source.</li>
<li>Better control over the working directory (<a href="http://flowingdata.com/2011/03/02/rstudio-a-new-ide-for-r-that-makes-coding-easier/">already reported and a likely future feature</a>)</li>
<li>A means of quickly seeing where source files are actually located without mouseover</li>
<li>Integration with version control.</li>
<li>Code cleanup and indenting </li>
</ol>
Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com3tag:blogger.com,1999:blog-8532065960756482590.post-47159692793522962892010-12-23T15:42:00.000-05:002010-12-29T16:24:12.858-05:00Chromosome bias in R, my notebookMy goal is to develop a means of detecting chromosome bias from a human BAM file.<br />
<br />
Because I've been working with proprietary and novel plant genomes for the last three years, I haven't had the chance to use any of the awesome UCSC-based annotational features that have been introduced and refined in Bioconductor until now. I've returned to biomedical research and I have some catching up to do.<br />
<br />
BSgenome might sound like horsecrap, but each <b>B</b>io<b>s</b>trings-based <b>genome</b> data package is actually a huge digested version of a UCSC/NCBI genome freeze and basic sequence annotation compiled into R objects.<br />
<br />
<a href="http://www.bioconductor.org/help/bioc-views/release/bioc/html/BSgenome.html">BSgenome at Bioconductor</a><br />
<blockquote>Be careful with googling bioconductor help - often the results point to older versions. Make sure your link has "release" in the url.</blockquote><br />
Here are the BSgenomes available today:<br />
<pre class="brush: shell">> available.genomes(type=getOption("pkgType"))
BioC_mirror = http://www.bioconductor.org
Change using chooseBioCmirror().
[1] "BSgenome.Amellifera.BeeBase.assembly4" "BSgenome.Amellifera.UCSC.apiMel2"
[3] "BSgenome.Athaliana.TAIR.01222004" "BSgenome.Athaliana.TAIR.04232008"
[5] "BSgenome.Btaurus.UCSC.bosTau3" "BSgenome.Btaurus.UCSC.bosTau4"
[7] "BSgenome.Celegans.UCSC.ce2" "BSgenome.Celegans.UCSC.ce6"
[9] "BSgenome.Cfamiliaris.UCSC.canFam2" "BSgenome.Dmelanogaster.UCSC.dm2"
[11] "BSgenome.Dmelanogaster.UCSC.dm3" "BSgenome.Drerio.UCSC.danRer5"
[13] "BSgenome.Drerio.UCSC.danRer6" "BSgenome.Ecoli.NCBI.20080805"
[15] "BSgenome.Ggallus.UCSC.galGal3" "BSgenome.Hsapiens.UCSC.hg17"
[17] "BSgenome.Hsapiens.UCSC.hg18" "BSgenome.Hsapiens.UCSC.hg19"
[19] "BSgenome.Mmusculus.UCSC.mm8" "BSgenome.Mmusculus.UCSC.mm9"
[21] "BSgenome.Ptroglodytes.UCSC.panTro2" "BSgenome.Rnorvegicus.UCSC.rn4"
[23] "BSgenome.Scerevisiae.UCSC.sacCer1" "BSgenome.Scerevisiae.UCSC.sacCer2"
</pre>Select and load hg19<br />
<pre class="brush: shell">biocLite("BSgenome.Hsapiens.UCSC.hg19")
library("BSgenome.Hsapiens.UCSC.hg19")
</pre><br />
When we get an alignment file one of the first things we want to do is look for red flags that might indicate something went awry in the lab or downstream. An example is chromosome bias - are we seeing more reads aligned to certain chromosomes than would be expected on size alone? A sticky question, since any experiment will introduce confounds based on the inherent uneven distribution of interesting genomic features, not to mention mapability. And yet I think this is still a worthwhile exercise and should be part of any ngs sequencing pipeline.<br />
<br />
What we don't want to do is ignore that 7.6% of the GRCh37 freeze is sequence that looks like "NNNNNNN" - gaps representing unsequencable regions such as centromeres, scaffold gap delinations, and the like. We especially don't want to ignore gaps because they are not evenly distributed across the chromosomes (chrY is 56% gaps).<br />
<br />
Raw chromosome length can be obtained from the BAM file header, but for this chromosome bias analysis I need the "non-gappy" length, the portion eligible for alignment. This is one of the "masks" turned on by default for BSgenomes in order to allow various functions to work properly (see MaskCollection in the <a href="http://www.bioconductor.org/help/bioc-views/release/bioc/html/IRanges.html">IRanges</a> package for more information).<br />
<br />
<br />
<pre class="brush: shell">> masks(Hsapiens)
Error in function (classes, fdef, mtable) :
unable to find an inherited method for function masks, for signature "BSgenome"
#oops I see masks are a member of MaskedDNAString objects (i.e. chromosomes) not BSgenome objects
> masks(Hsapiens$chrY)
MaskCollection of length 4 and width 59373566
masks:
maskedwidth maskedratio active names desc
1 33720000 0.567929506 TRUE AGAPS assembly gaps
2 0 0.000000000 TRUE AMB intra-contig ambiguities (empty)
3 16024357 0.269890426 FALSE RM RepeatMasker
4 587815 0.009900281 FALSE TRF Tandem Repeats Finder [period<=12]
all masks together:
maskedwidth maskedratio
49783032 0.8384713
all active masks together:
maskedwidth maskedratio
33720000 0.5679295
#I think the maskedwidth should reveal sum of actively masked nucleotides
> maskedwidth(Hsapiens$chrY)
[1] 33720000
#can we mess with the masks?
> active(masks(Hsapiens$chrY))["RM"]<-TRUE
Error in `$<-`(`*tmp*`, "chrY", value = < S4 object of class "MaskedDNAString">) :
no method for assigning subsets of this S4 class
#oops I can't manipulate a BSgenome this way - it is behaving like a class instead of an instance of a class
> chrY<-Hsapiens$chrY
> active(masks(chrY))["RM"]<-TRUE
> maskedwidth(chrY)
[1] 49744357
# ok maskedwidth is working as I figured, but i need unmasked width
> unmaskedWidth<-function(chr){length(chr)-maskedwidth(chr)}
> unmaskedWidth(Hsapiens$chrY)
[1] 25653566
#how can I iterate over something with a $ operator? let's try [[]]
> unmaskedWidth(Hsapiens[["chrY"]])
[1] 25653566
</pre>Now I want to create a data frame of with sequence names and unmaskedWidths to go with some read counts from a BAM file. Whenever I want to go from a list, through a function, to a data frame I think <a href="http://had.co.nz/plyr/">plyr</a>, specifically ldply (<b>l</b>ist to <b>d</b>ata frame).<br />
<pre class="brush: shell"># let's take chr 1-22,X,Y, skipping the unscaffolded sequences and mitochondrial chr
> maskedSizes<-ldply(.data=seqnames(Hsapiens)[1:24],
.fun=function(x){
data.frame(chr=x,seqlength=length(Hsapiens[[x]]),
unmaskedWidth=unmaskedWidth(Hsapiens[[x]]))},
.progress="text",
.parallel=TRUE)
> maskedSizes
chr seqlength unmaskedWidth
1 chr1 249250621 225280621
2 chr2 243199373 238204518
3 chr3 198022430 194797135
4 chr4 191154276 187661676
5 chr5 180915260 177695260
6 chr6 171115067 167395066
7 chr7 159138663 155353663
8 chr8 146364022 142888922
9 chr9 141213431 120143431
10 chr10 135534747 131314738
11 chr11 135006516 131129516
12 chr12 133851895 130481393
13 chr13 115169878 95589878
14 chr14 107349540 88289540
15 chr15 102531392 81694766
16 chr16 90354753 78884753
17 chr17 81195210 77795210
18 chr18 78077248 74657229
19 chr19 59128983 55808983
20 chr20 63025520 59505520
21 chr21 48129895 35106642
22 chr22 51304566 34894545
23 chrX 155270560 151100560
24 chrY 59373566 25653566
</pre><br />
Load the BAM file and get read counts in a data frame.<br />
<pre class="brush: shell">#other methods include scanBam and readAligned
bamFile<-readBamGappedAlignments("myIndexedSortedBamFile.bam")
> levels(rname(bamFile))
[1] "1" "2" "3" "4" "5"
[6] "6" "7" "8" "9" "10"
[11] "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20"
[21] "21" "22" "X" "Y" "MT"
[26] "GL000207.1" "GL000226.1" "GL000229.1" "GL000231.1" "GL000210.1"
[31] "GL000239.1" "GL000235.1" "GL000201.1" "GL000247.1" "GL000245.1"
[36] "GL000197.1" "GL000203.1" "GL000246.1" "GL000249.1" "GL000196.1"
[41] "GL000248.1" "GL000244.1" "GL000238.1" "GL000202.1" "GL000234.1"
[46] "GL000232.1" "GL000206.1" "GL000240.1" "GL000236.1" "GL000241.1"
[51] "GL000243.1" "GL000242.1" "GL000230.1" "GL000237.1" "GL000233.1"
[56] "GL000204.1" "GL000198.1" "GL000208.1" "GL000191.1" "GL000227.1"
[61] "GL000228.1" "GL000214.1" "GL000221.1" "GL000209.1" "GL000218.1"
[66] "GL000220.1" "GL000213.1" "GL000211.1" "GL000199.1" "GL000217.1"
[71] "GL000216.1" "GL000215.1" "GL000205.1" "GL000219.1" "GL000224.1"
[76] "GL000223.1" "GL000195.1" "GL000212.1" "GL000222.1" "GL000200.1"
[81] "GL000193.1" "GL000194.1" "GL000225.1" "GL000192.1"
#the deflines in my reference do not match the BSgenome names, must fix at least the chromosomes of interest
levels(rname(bamFile))[1:25]<-paste('chr',c(1:22,'X','Y','M'),sep='')
#run length encoded read counts per chromosome
readRle<-rname(bamFile)
#get a data frame with chromosome and read counts
allReadsDf<-ldply(runValue(readRle),function(x){data.frame(chr=levels(runValue(readRle))[x],reads=runLength(readRle)[x])})
> head(allReadsDf)
chr reads
1 chr1 3616909
2 chr2 3642052
3 chr3 2843019
4 chr4 2636141
5 chr5 2590352
6 chr6 2497123
</pre><br />
Merge the read counts with unmasked chromosome lengths and plot their relationship.<br />
<pre class="brush: shell">chrSizesReads<-merge(maskedSizes,readCounts,sort=FALSE)
library(ggplot2)
p<-ggplot(data=chrSizesReads, aes(x=unmaskedWidth, y=reads, label=chr)) +
geom_point() +
geom_text(vjust=2,size=3) +
stat_smooth(method="lm", se=TRUE,level=0.95) +
ylab("Reads aligned") +
xlab("Unmasked chromosome size") +
opts(title = "Reads vs Chromosome Size")
print(p)
</pre><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikQqdIJXvfhCF-jd6Vzvk1PU-sbv0z0I_7k48As_jx01w9btTvEFh2JzULCIaNZhnz41pyFr3ZsnCqowzKQiI-5RF_LJusPunbursRPN5ivugxHdZ-c7tTieQkl1Sw_ftXVKwicyg1Es0/s1600/readChr.png" imageanchor="1"><img border="0" height="305" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikQqdIJXvfhCF-jd6Vzvk1PU-sbv0z0I_7k48As_jx01w9btTvEFh2JzULCIaNZhnz41pyFr3ZsnCqowzKQiI-5RF_LJusPunbursRPN5ivugxHdZ-c7tTieQkl1Sw_ftXVKwicyg1Es0/s320/readChr.png" width="320" /></a></div>There should be a strong linear relationship between read count and chromosome size. We can test this using a linear regression model, the null hypothesis being the number of reads aligned to a chromosome is independent of its size. <br />
<pre class="brush: shell">> mylm<-lm(reads~unmaskedWidth,data=chrSizesReads)
> mysummary<-summary(mylm)
> mysummary
Call:
lm(formula = reads ~ unmaskedWidth, data = chrSizesReads)
Residuals:
Min 1Q Median 3Q Max
-271816 -108122 -43984 42826 676284
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.774e+05 9.505e+04 1.866 0.0754 .
unmaskedWidth 1.455e-02 7.145e-04 20.365 9.12e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 206600 on 22 degrees of freedom
Multiple R-squared: 0.9496, Adjusted R-squared: 0.9473
F-statistic: 414.8 on 1 and 22 DF, p-value: 9.123e-16
</pre>The low p-value (that chr size has no influence) and R-squared (predictive value of the linear model) suggest this model is sound.<br />
<br />
The following plot is obtained from the standardized residuals (the standardized difference between data observed and values expected) of the linear model described earlier.<br />
<br />
Chromosome bias refers to uneven read alignment distribution across various chromosomes. We can expect some chromosome bias in treatment sets because of the inherient nature any experimental conditions - recovered fragments will not be evenly distributed among chromosomes because regions of affect are not evenly distributed. Other possible factors of chromosome bias include heterochromatin, uneven repeat content, and the potential for aligning the against an incorrect set of sex chromsomes. Aligners will typically randomly, evenly, assign discrete positions to reads which map ambiguously to multiple locations. <br />
<pre class="brush: shell">> p<-qplot(chrSizesReads$chr,rstandard(mylm))+
aes(label=chrSizesReads$chr)+
geom_text(vjust=2,size=3)+
xlab("Chromosome")+
ylab("Std Residual from lm (reads)")+
geom_abline(slope=0,intercept=0)+
opts(axis.text.x = theme_text(angle=45,hjust=1))+
opts(title = "Linear Regression Residuals")
> print(p)
</pre><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgY5Ut4RdeF93ArZHwavk_IBp0O1REonnKgMX9KimOWTJPVtv1YIi1YjCcxaG3_BIhfhBWrA5QaV6MHqJLQ-LDiUf3CsphBnvXJJTFGs6Pc2qY3GeLdq8aJyY3BRyUQnKYQXhXgVbQymZY/s1600/resid.png" imageanchor="1" style=""><img border="0" height="320" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgY5Ut4RdeF93ArZHwavk_IBp0O1REonnKgMX9KimOWTJPVtv1YIi1YjCcxaG3_BIhfhBWrA5QaV6MHqJLQ-LDiUf3CsphBnvXJJTFGs6Pc2qY3GeLdq8aJyY3BRyUQnKYQXhXgVbQymZY/s320/resid.png" /></a></div><br />
Fortunately, there is no clear pattern to these residual values, which would indicate some model problems, but with a Z-score of 3.36, chrX appears to be an outlier. With 46M total alignments this is certainly not due to sampling error, but we can still test our observation with a Lund statistic. <br />
<pre class="brush: shell">#http://stackoverflow.com/questions/1444306/how-to-use-outlier-tests-in-r-code
>lundcrit<-function(a, n, q) {
F<-qf(c(1-(a/n)),df1=1,df2=n-q-1,lower.tail=TRUE)
crit<-((n-q)*F/(n-q-1+F))^0.5
crit
}
> n<-nrow(chrSizesReads)
> q<-length(mylm$coefficients)
> crit<-lundcrit(0.05,n,q)
> chrSizesReads[which(rstandard(mylm)>crit),"chr"]
[1] chrX
</pre><br />
Happy holidays!Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com3tag:blogger.com,1999:blog-8532065960756482590.post-32104635919389698662010-12-09T17:41:00.000-05:002012-01-20T22:56:37.301-05:00Directory-based bash historiesUsing a directory-based bash history allows for a record of shell actions on a directory basis, so a group of developers have some record of what was done while in a directory, when, and by whom. This can be helpful when trying to reconstruct history with limited documentation.<br />
<br />
I know this setup will be of some benefit to my successor at my previous job because he has access to everything I ever did in any project directory.<br />
<br />
Place this code in your ~/.bash_profile or ~/.bashrc<br />
<br />
(type source ~/.bash_profile (or .bashrc) to load this for your current session)<br />
<br />
<script src="https://gist.github.com/1651133.js"> </script>Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com0tag:blogger.com,1999:blog-8532065960756482590.post-9384236237543900092010-08-30T14:12:00.000-04:002010-08-31T13:50:03.872-04:00NGS viewers reviewedI gathered up some of the recent <i>free</i> next generation sequence viewers that were capabale of viewing BAM files - and put each through the motions with a few BAM files and reference sequences of various sizes. While there are some great ideas and several choices to be found along the feature spectrum, I think we are still in the dark ages with this stuff. No viewer has really been able to entirely combine usability with performance and analysis capabilities, let alone extensibility and web connectivity.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://picasaweb.google.com/lh/photo/oeGoMtCA47G7fFzw89a96Q?feat=embedwebsite" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1xWQ4IANBnIx5d7Y5d7IU_a30v87JOEE7z3bY4goSBzEbAzsAMMsiph7ZL-HSRNH4eV7moQk72GefMHtpJ3RuSae_kIL9oLCxqkGa4E-AFzWrHoRVQCq0HyG-sDsPAaBd9k6Vp3VUIq8/s288/tview.png" /></a></div><span style="font-size: large;"><a href="http://samtools.sourceforge.net/">tview</a></span><br />
<b>My take:</b> tview is the barebones, text-rendered viewer that is included with Samtools. People who favor this as their BAM viewer probably think Vim is too polished. Even the very limited navigation is remarkably unituitive (goto a coordinate requires chr:position even if you have just one chromosome, no errors are displayed if you forget chr).<br />
<b>Most resembles which video game: </b>Oregon Trail<br />
<b>Good standout feature:</b> command line access<br />
<b>Bad feature:</b> text display<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://picasaweb.google.com/lh/photo/WexFGlirQLnEUmeBpTZ4CQ?feat=embedwebsite" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJoWzwdcQ3BSEBd4s4NahXR6sph23Jjh-MfRNMoHukEHh_NuiNPKGIXWbWYcKMa1L0eTtfdG61SB99Vi0Diw4plOSDD3PgoKJN7benL8b31gPX1PIRcLZmGfscxCWZ9r0HN9LKlE6IfI4/s288/bamview.png" /></a></div><span style="font-size: large;"><a href="http://bamview.sourceforge.net/">BamView</a></span><br />
<b>My take:</b> Bamview is a wicked fast simple BAM file viewer. It doesn't have much in the way of features, but for cursory examination of BAM files it is more palatable than tview.<br />
<b>Most resembles which video game: </b>Burgertime<br />
<b>Good standout feature:</b> strand split screen<br />
<b>Bad feature:</b> drag selecting a region turns it red, and umm... that's all it does<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://picasaweb.google.com/lh/photo/gey5Qp6K7mT-ZErxkbN3IA?feat=embedwebsite" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0IASkgTwD5wLovhSd2AdQdakYAn_z2AQOq36V4AsBwS4ZNeZmZkvxNzbtPGcRffB_GZtg0FLdChv7nM-yV83vlAG4ngiVupRdnHVzVArDwyeULz46GsSuicmmUsRh4KlEwWDI-7e2xr0/s288/genoviewer-2.png" /></a></div><span style="font-size: large;"><a href="http://www.genoviewer.com/">GenoViewer</a></span><br />
<b>My take:</b> The only NGS viewer endorsed by <a href="http://www.youtube.com/watch?v=--Vaz9jW054">Speak</a> the Hungarian rapper, unfortunately this recent entrant leaves a lot to be desired in terms of performance with large files. GenoViewer is very hard on the eyes - the indiscriminate use of primary colors looks like a kid somehow vomited up the ball pit at Chuck E. Cheese.<br />
<b>Most resembles which video game: </b>Centipede<br />
<b>Good standout feature:</b> promises that "You will not get lost in the details, and can easily figure out the true meaning behind the data. Guaranteed."<br />
<b>Bad feature:</b> graphics<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://picasaweb.google.com/lh/photo/TqnwOx1tMnM8vcYS7_lHSg?feat=embedwebsite" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzBgPZVOPgeJE7ghvuQBZ4cURNNtKpYwGzQmbswsBr5McmRkXmXVWpIMpq6Uu6AK-7A0uu-_FtON60KAULhPAk3HhlN6SND50xyhAvCLO9rYPRf6AfNauv53uPYSYZruEDTtQ9gi84huE/s288/magicviewer.png" /></a></div><span style="font-size: large;"><a href="http://bioinformatics.zj.cn/magicviewer/">MagicViewer</a></span><br />
<b>My take:</b> MagicViewer has come a long way since its initial release last December. With its interesting pie-chart icon renderings of SNP purity, decent treatment of annotation tracks, and improved performance, MagicViewer might soon be a contender to Tablet in the midweight category. The navigation is workable but takes some getting used to - it's unclear when to use scroll bars vs. arrows<br />
<b>Most resembles which video game: </b>Moon Patrol<br />
<b>Good standout feature:</b> primer design tool<br />
<b>Bad feature:</b> some regions are simply not visible - no reference, no reads<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://picasaweb.google.com/lh/photo/yPwsdAPqe8I5vM4gtHq_bA?feat=embedwebsite" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2wIKIyQgJTV_RH8bQ5ow4CDQP-dBCHBhv5M4zSkCkbkO3ovWWE9tkQuA0ydzdCaEHCEYssBvs0ulGLRv8AVVSh164cJosEcsIFuPLUom36siXDtKTPlTNv-_75SdPVsglJqsEvxZdNlk/s288/tablet.png" /></a></div><span style="font-size: large;"><a href="http://bioinf.scri.ac.uk/tablet/">Tablet</a></span><br />
<b>My take:</b> Hands down the most attractive of the viewers, Tablet is aimed at fostering a delicate balance between performance, features, and aesthetics. Tablet comes with a suite of read views - Packed, Stacked, and Classic - to suit both young children and elderly scientists alike. GFF feature files can be loaded but they appear to merely serve as position indices.<br />
<b>Most resembles which video game: </b>SimCity<br />
<b>Good standout feature:</b> interface<br />
<b>Bad feature:</b> read insertions not displayed correctly<br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://picasaweb.google.com/lh/photo/Scp_X0E2C49rvYzqX8_CFA?feat=embedwebsite" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQ9cuDDp2x9Lqs2Fze8rMi89C2iSkxuKLCyoqjcHLW8hD83cA77ujvk2TXFsJ3TrmVehUW_tEt1oU-PsX6Yai_hUijUveQLHezVqz64W2VfpBfzId94m8sjtCY9dx9HUJVCYSpzkLu-mU/s280/igv.png" /></a></div><span style="font-size: large;"><a href="http://www.broadinstitute.org/igv/">IGV</a></span><br />
<b>My take:</b> The Integrative Genomics Viewer is a serious tool for exploring and analyzing large datasets. In addition to viewing, IGV is designed to allow users to extract the kind of hard publishable data that has typically been the domain of Bio* scripts. Like the true product of an Ivy League education, IGV can appear aloof and arrogant to newcomers. The viewer will let you load BAM files and other annotation tracks that have nothing to do with the reference without comment or guidance, then require an extra unitutive click to actually generate a view. While not the easiest or even the best in performance, in terms of generating real queries on your data from the viewer there is nothing comparable.<br />
<b>Most resembles which video game: </b>Dig Dug<br />
<b>Good standout feature:</b> analysis tools<br />
<b>Bad feature:</b> throws a hissy fit if it cannot connect to home server<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://picasaweb.google.com/lh/photo/8pD4iTemnoCaiTx1EpeAug?feat=embedwebsite" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPRWUeOe1OUmlWR1FYjAwiKn4Cfp-7yDS_HGI38dWq811FFn6vezgIG9XgNl3yE-I4wQlSSG9cjM9eJDn3WNDXc3nkXUuB2P2oSThIASzK-xwIlAbGeZxZJsP9yiEQ1AekdlAZ6syyQkg/s288/gb2.png" /></a></div><span style="font-size: large;"><a href="http://gmod.org/wiki/GBrowse">gbrowse2</a></span><br />
<b>My take:</b> gbrowse2 is the AJAXified protege to the venerable generic genome browser. Samtools integration is a recent addition to this highly extensible platform, which has been used for years to display everything from large genomes to small sequencing projects. Of all the viewers here, GB2 provides the best set of visualization tools, such that virtually any biological information that can be rendered linearly has been done so as gbrowse tracks. The lone true web application, gbrowse genomic positions can be hyperlinked or even snapshot-embedded on the web. The web provides the best platform to share visual genomic data among several users. However, a gbrowse2 installation with BAM tracks can be a massive pain to install, configure, and debug ("landmark chrI not found" is the most popular google search in all of bioinformatics). Novices can expect a minefield of historical gotchas and arcane conventions ("Name=" not "name=" field in gff3, bp_seqfeature_load instead of bp_load_gff for gff3, no validator for conf files), and even experienced users are often baffled by cryptic errors that pop up in server logs.<br />
<b>Most resembles which video game: </b>Sissyfight 2000<br />
<b>Good standout feature:</b> hyperlinks<br />
<b>Bad feature:</b> setupJermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com16tag:blogger.com,1999:blog-8532065960756482590.post-9180291394140780572010-08-18T12:42:00.000-04:002010-08-19T08:15:29.212-04:00My thoughts on the acquisition of Ion Torrent by Life Technologies<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.iontorrent.com/lib/images/products/pgm_images.jpg"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 333px; height: 198px;" src="http://www.iontorrent.com/lib/images/products/pgm_images.jpg" border="0" alt="" /></a>Yesterday it was announced that Ion Torrent, makers of the Ion Personal Genome Machine (PGM) sequencer, would be acquired by <strike>Invitrogen</strike> <strike>ABI</strike> Life Technologies for $375M in cash and stock, with the possibility of another $350M if various milestones are met.<br />
<br />
Gregory Lucier, chairman and CEO of Life Technologies, said, "We believe Ion Torrent's technology will represent a profound change for the life sciences industry, as fundamental as the one we saw with the introduction of qPCR." That analogy might not sound earth-shattering, but I suspect he is using qPCR as an example because it is one of the few molecular biology (as opposed to biochemical) techniques regularly used in clinical settings.<br />
<br />
The PGM is a unique machine because it is the first second generation sequencer (tentative <a href="https://spreadsheets.google.com/ccc?key=0AvaxS3m5rl-9dHdtUGRtaGlsZWNFNWJleDRXaUhQTHc&hl=en#gid=0">specs</a>: ~3 million 200bp reads at $500 1hr run, $50k machine) that could be conceivably leased by a small clinical testing facility, like the ones fed by those LabCorp boxes you see scattered all over strip malls. Bear in mind even a capillary-based sanger sequencer like the ABI3730xl costs a whopping <a href="http://webcache.googleusercontent.com/search?q=cache:Zr-H90V_YUcJ:www.genomeweb.com/sequencing/used-abi-3730xls-are-%E2%80%98flooding-market%E2%80%99-centers-shift-next-gen-sequencers+3730xl+price&cd=7&hl=en&ct=clnk&gl=us&client=firefox-a">$375,000</a>, which comes as a shock to those of us who mostly work with next-generation sequencing data. You can see not only why the PGM is really a game changer, but how it might fit into Life Technologies offerings.<br />
<br />
Exactly which clinical diagnostics are, or will be, suited for this machine are unclear. With the right foolproof software this could replace a lot of PCR and microarray-based tests. Read lengths and throughput are bound to increase just as they did with Solexa. Future applications like tumor sequencing and 16S rRNA microbiome sequencing have not even entered into medical practice yet.<br />
<br />
The key to Life Technology succeeding in these areas will likely come down to non-technical challenges:<br />
<ul><li>Getting FDA approval for the PGM as a medical device and for various protocols based on the PGM as approved clinical tests. Life Technology has experience with this process. Ion Torrent clearly does not.</li>
<li>Convincing insurance companies and HMOs that these tests are cost-effective diagnostics. When you consider the exorbitant cost of other tests - e.g. several MRIs over a course of chemotherapy - this might not be such hard sell.</li>
<li>Developing a business model that will allow small clinics to lease the machine for little or even no cost provided they agree to purchase a minimum amount of consumables. Life Technologies will likely market the PGM to smaller labs who themselves will inevitably face competition from larger dedicated sequencing centers trying to centralize this type of work using bigger machines from Illumina or PacBio (or even LT's SOLiD).</li>
</ul><br />
Although these obstacles seem daunting, I have sat through academic departmental meetings where very knowledgeable sequencer salespeople were invited back multiple times only to be jerked around as the faculty hemmed and hawed over whether this was the right $400k machine to buy at the time. That was undoubtedly an expensive and frustrating sales process for Solexa and 454. With the PGM, Life Technologies can pitch this much cheaper machine to an albeit different group of skeptics, clinical lab managers, but one with a clear bottom line in mind.Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com3tag:blogger.com,1999:blog-8532065960756482590.post-63888255660687774032010-03-10T13:21:00.000-05:002010-03-10T13:40:56.724-05:00Use local instead of my in Perl when using $$ (double dollar sign) inside an enclosed blockI thought I would be all clever and initialize several hash indices at once using the $$ notation in Perl - evaluating strings as variables. Unfortunately using "my" screws this up bigtime.<br /><br />This must be another circumstance spelled out in the incomprehensible "my" treatise:<br /><a href="http://perldoc.perl.org/perlsub.html#Private-Variables-via-my%28%29">http://perldoc.perl.org/perlsub.html#Private-Variables-via-my()</a><br /><br /><pre class="brush: perl"><br />my %foo;<br />$foo{'bar'}=1;<br />foreach $fooDex(qw(foo)){<br /> $$fooDex{'bar'}=2;<br />}<br />print $foo{'bar'}."\n"; #prints 1<br /></pre><br /><br /><br /><pre class="brush: perl"><br />local %foo;<br />$foo{'bar'}=1;<br />foreach $fooDex(qw(foo)){<br /> $$fooDex{'bar'}=2;<br />}<br />print $foo{'bar'}."\n"; #prints 2<br /></pre><br /><br />This is one of those solutions that gives you a feeling of dread instead of accomplishment.Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com1tag:blogger.com,1999:blog-8532065960756482590.post-67233663485179879372010-03-09T16:57:00.000-05:002010-04-15T10:39:09.475-04:00Getting the basics from readAlignedThe <a href="http://manuals.bioinformatics.ucr.edu/home/ht-seq">UCR</a> guide is a little sparse with regard to getting basic information from readAligned.<br /><br />I'd like to add to the general cookbook. If some bioc people out there can contribute some alignment recipes can fill me in on some more basics please comment:<br /><br /><pre class="brush: shell"><br />alignedReads <- readAligned("./", pattern="output.bowtie", type="Bowtie")<br /><br />#how many reads did I attempt to align<br />#i don't think you can't get this from alignedReads<br /><br />#how many reads aligned (one or more times)<br />length(unique(id(alignedReads)))<br /><br />#how many hits were there?<br />length(alignedReads)<br /><br />#how many reads produced multiple hits<br />length(unique(id(alignedReads[srduplicated(id(alignedReads))])))<br /><br />#how many reads produced multiple hits at the best strata?<br />#please fill me in on this one<br /><br />#how many reads aligned uniquely (with exactly one hit)<br />length(unique(id(alignedReads)))-length(unique(id(alignedReads[srduplicated(id(alignedReads))])))<br /><br />#how many reads aligned uniquely at the best strata (the other hits were not as good)<br />#please fill me in on this one<br /><br />#how many unique positions were hit? what if I ignore strand?<br />#please fill me in on this one<br /><br />#how many converging hits were there (two query sequences aligned to the same genomic position)<br />#please fill me in on this one<br /></pre>Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com1tag:blogger.com,1999:blog-8532065960756482590.post-6746365734350456942010-03-03T16:48:00.000-05:002010-08-04T12:22:36.400-04:00Quality trimming in R using ShortRead and BiostringsI wrote an R function to do soft-trimming, right clipping FastQ reads based on quality.<br />
<br />
This function has the option of leaving out sequences trimmed to extinction and will do left-side fixed trimming as well.<br />
<pre class="brush: shell">#softTrim
#trim first position lower than minQuality and all subsequent positions
#omit sequences that after trimming are shorter than minLength
#left trim to firstBase, (1 implies no left trim)
#input: ShortReadQ reads
# integer minQuality
# integer firstBase
# integer minLength
#output: ShortReadQ trimmed reads
library("ShortRead")
softTrim<-function(reads,minQuality,firstBase=1,minLength=5){
qualMat<-as(FastqQuality(quality(quality(reads))),'matrix')
qualList<-split(qualMat,row(qualMat))
ends<-as.integer(lapply(qualList,
function(x){which(x < minQuality)[1]-1}))
#length=end-start+1, so set start to no more than length+1 to avoid negative-length
starts<-as.integer(lapply(ends,function(x){min(x+1,firstBase)}))
#use whatever QualityScore subclass is sent
newQ<-ShortReadQ(sread=subseq(sread(reads),start=starts,end=ends),
quality=new(Class=class(quality(reads)),
quality=subseq(quality(quality(reads)),
start=starts,end=ends)),
id=id(reads))
#apply minLength using srFilter
lengthCutoff <- srFilter(function(x) {
width(x)>=minLength
},name="length cutoff")
newQ[lengthCutoff(newQ)]
}
</pre><br />
<br />
To use:<br />
<pre class="brush: shell">library("ShortRead")
source("softTrimFunction.R") #or whatever you want to name this
reads<-readFastq("myreads.fq") trimmedReads<-softTrim(reads=reads,minQuality=5,firstBase=4,minLength=3) writeFastq(trimmedReads,file="trimmed.fq") </pre>
I strongly recommend reading the excellent UC Riverside HT-Sequencing Wiki cookbook and tutorial if you wish to venture into using R for NGS handling. Among other things, it will explain how to perform casting if you have Solexa scaled (base 64) fastq files. The function should respect that.
<a href="http://manuals.bioinformatics.ucr.edu/home/ht-seq">http://manuals.bioinformatics.ucr.edu/home/ht-seq</a>Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com3tag:blogger.com,1999:blog-8532065960756482590.post-68490141217162431352009-11-19T13:56:00.000-05:002010-12-07T14:38:56.703-05:00Using Vmatch to combine assemblies<h3>In praise of Vmatch</h3><br />
If I could take only one bioinformatics tool with me to a desert island it would be <a href="http://www.vmatch.de/">Vmatch</a>.<br />
<br />
In addition to being a versatile alignment tool, Vmatch has many lesser-known features which also leverage suffix trees. The dbcluster option allows Vmatch to cluster similar sequences using Vmatch index files. This method is about 1000x faster than attempting to align all sequences against each other, no matter how clever the algorithm.<br />
<br />
Originally presented as means to cluster EST sequences, this option is useful for processing the output from de novo short read assemblers like Velvet and ABYSS. Vmatch -dbcluster allows us to easily create a non-redundant set of contig sequences originating from several assemblies.<br />
<br />
<h3>Why would we want to combine assemblies? </h3><br />
<h4>Every kmer is sacred</h4><br />
<img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 320px; height: 313px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbxtRzh08vkPF3A1qZo0ZXMEXoOw5jnBe8qwNBPwbFOg9mUPSy3qcVIFcOg1WYa7N5uDO7Imv4loThdXnXouLZxxpLx_pHhQE2KmI0eulf4j7RYo-ciMXudo1Hw6DDTm3rapI-YDv_mpc/s640/svarP1.png" alt="" id="BLOGGER_PHOTO_ID_5405584666748028370" border="0" /><p>Why not just take the one with the highest N50? At the recent Genome Informatics meeting at CSHL, Inanc Birol (ABYSS) said "every kmer is sacred". Our experience with the de novo transcriptome assembly of plants has been that <span style="font-weight: bold;">the best set of contigs is spread out all over the parameter space</span>. Longer kmer and cvCut settings can produce a longer set of elite contigs at the expense of omitting lowly expressed (or spurious?) contigs that appear at less stringent settings.</p><br />
<br />
<div class="clear"></div><br />
<br />
<h4>Reads held hostage</h4><br />
<img style="margin: 0pt 10px 10px 0pt; float: right; cursor: pointer; width: 320px; height: 316px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinO0uJ8IbeQ6G4Ap9w0kXsLx98THJMu1CTljHdez3QZcY6QOjc9Py-WZjD29BNvazVyAqZCrUEB6xUfRt8JiJDxKhvUSYTIyLjijFIjZ2u1Tb1Str8SUL5cUsPHzGvZ6o4kKFb7GCMVKs/s640/svarP2.png" alt="" id="BLOGGER_PHOTO_ID_5405585865730142370" border="0" /><p>To the point above, I would add "every cvCut is also sacred". That doesn't exactly roll off the tongue, but we have seen some instances in which an assembly will have a higher read usage at a cvCut of 10 than at 5. This paradox suggests there is a <span style="font-weight: bold;">"contig read hostage"</span> situation, in which reads critical to the formation of longer contigs can be held captive in low coverage short contigs. Raising the cvCut threshold frees up these critical reads to extend more plausible contigs and allowing additional reads from the pool to be recruited into the assembly.<br />
</p><br />
<div class="clear"></div><br />
<h3>What is a non-redundant set?</h3><br />
The following sequences have some redundancies. NODE2 and NODE3 do not add any information.<br />
The non-redundant set will have only NODE1 and NODE4.<br />
<pre>>NODE1
AATTCAGTTGAAGTAATAGAGGCAGCTGCTGTTAGAACTTCGCTACGACTAGCAACGCTATGTCGAGTTGTACCTTCCCACCTCGTATACAAGGAGCATGAAGTCATCAGCC
CTTTCTACAGAATCTAGGTCCGAAATGATGAAATTAAGAGAAAGACAACTGAAGGTTTTAGGAGGCAAGGTTTACAGGGTAATGAACACGGAAAAACCCACAAGCTAGGAAC
AGTGTGTCTTGGAGTTTAAACTGATTTGGTAGTAGTTCGAAAACAAATTGGAAGGGACATTTAAAGTCCGAGTTGACGTTATCTGAGACAACTTTGTCTTTAACCGACAGGG
AGTTGAGGTAGGAGAGAGTGTCCACATATTTAACGTTGTTCAGATATGGGATGTAGCAGTTGTAACCGAAGCATGTAGGAAGGTTAAAGGGGTCCATACCTCTATTCTAGTC
CCGAAGGTTGGTAGCTAGACTCGGTGACCTAAAATGAGAACGGAAGAAACGGAGGTGACATCCATGGGGCTTGTCGTATATCCAATCATACCTTTGGAGAAGGAAGTTAAAG
GTCAAAACTTTAAAAACCATGAGGACCATTTTATCCTCGTCACTGTCGACTATGGTGAAGTGACCTAGCAGGTTTGTGTAGTTACTGTTTTGTAAGTGTAAAGTGCGTTGCT
GGCTATAGGAGTTTCGCATGAAACATGCTCGCTCTTGTGACCCATCGGTTACCAAGTTTACAGACTAGGTGAGGACTAGGTCACCTCTGTTTGGTAACCAAAAAGTAGAGAA
AGAATATCAACAAGGTATATACAACAACTTAGAGTAGCAAACCTTAAAAGAGTTGTCGTCAGTCACAAAGGGAGAGTTGTACTAAAGGGAACTTTTGTTCACAGAGCTAAAA
CTCATTAAGACCATTTTAATGTGGTCTACCAAGTCTGTCGAAAAGAAGTGGATGCTACGGCAGA
>NODE2 a substring of NODE1
AATTCAGTTGAAGTAATAGAGGCAGCTGCTGTTAGAACTTCGCTACGACTAGCAACGCTATGTCGAGTTGTACCTTCCCACCTCGTATACAAGGAGCATGAAGTCATCAGCC
CTTTCTACAGAATCTAGGTCCGAAATGATGAAATTAAGAGAAAGACAACTGAAGGTTTTAGGAGGCAAGGTTTACAGGGTAATGAACACGGAAAAACCCACAAGCTAGGAAC
AGTGTGTCTTGGAGTTTAAACTGATTTGGTAGTAGTTCGAAAACAAATTGGAAGGGACATTTAAAGTCCGAGTTGACGTTATCTGAGACAACTTTGTCTTTAACCGACAGGG
AGTTGAGGTAGGAGAGAGTGTCCACATATTTAACGTTGTTCAGATATGGGATGTAGCAGTTGTAACCGAAGCATGTAGGAAGGT
>NODE3 a reverse complement substring of NODE1
CGACAGACTTGGTAGACCACATTAAAATGGTCTTAATGAGTTTTAGCTCTGTGAACAAAAGTTCCCTTTAGTACAACTCTCCCTTTGTGACTGACGACAACTCTTTTAAGGT
TTGCTACTCTAAGTTGTTGTATATACCTTGTTGATATTCTTTCTCTACTTTTTGGTTACCAAACAGAGGTGACCTAGTCCTCACCTAGTCTGTAAACTTGGTAACCGATGGG
TCACAAGAGCGAGCATGTTTCATGCGAAACTCCTATAGCCAGCAACGCACTT
>NODE4 a new sequence
TTACGAACGATAGCATCGATCGAAAACGCTACGCGCATCCGCTAAGCACTAGCATAATGCATCGATCGATCGACTACGCCTACGATCGACTAGCTAGCATCGAGCATCGATC
AGCATGCATCGATCGATCGAT
</pre><br />
<strike><h3>How do we create a non-redundant set?</h3><br />
<h4>Do not use -nonredundant</h4><br />
No, really. There is an option called -nonredundant which should presumably do what we want, but unfortunately that writes a "representative" member of each cluster to a file, which may or may not be the longest contig. I'm not sure what makes a sequence representative of a cluster, but for this application we want the <span style="font-style: italic;">longest</span> member of each cluster.</strike><br />
<em><br />
On April 29th 2010, Vmatch 2.1.3 was released. The most important change is that the option -nonredundant now delivers the longest sequence from the corresponding cluster (instead<br />
of an unspecified representative). This should make the longestSeq.pl approach unnecessary.</em><br />
<br />
To create a non-redundant set we will produce cluster files and then extract the longest sequence from each file. Use the following commands to produce your cluster files from an index:<br />
<pre class="brush: shell">mkvtree -allout -pl -db contigs1.fa contigs2.fa -dna -indexname myIndex
#mkvtree will accept multiple fasta or gzipped fasta files
#if using Vmatch 2.1.3 or later:
vmatch -d -p -l 25 -dbcluster 100 0 -v -nonredundant nonredundantset.fa myIndex > mySeqs.rpt
# thx to Hamid Ashrafi for debugging this syntax
#if using older Vmatch
mkdir sequenceMatches
#this is where vmatch will put each cluster
vmatch -d -p -l 25 -dbcluster 100 0 mySeqs "(1,0)" -v -s myIndex > mySeqs.rpt
mv mySeqs*.match mySeqs*.fna sequenceMatches</pre><br />
<h3>What do these options do?</h3><br />
<ul><li>-d direct matches (forward strand)</li>
<li>-p palindromic matches (reverse strand)</li>
<li>-l search length (set this below your shortest sequence)</li>
<li>-dbcluster queryPerc targetPerc - runs an internal alignment of the index created by mkvtree.<br />
<ul><li>The two numeric arguments specify what percentage of your query sequence (the smaller) is involved in an alignment to the cluster sentinel/target superstring. For our purposes we require 100% of our query sequence substring to match the target. We don't care what percentage of the target is aligned, so set the second parameter to 0.<br />
</li>
<li>The third argument to dbcluster is the index prefix name that vmatch will give to the THOUSANDS of cluster .fna and .match files it will create.</li>
<li>The fourth argument "(1,0)" specifies that we want to keep singletons (in a file called mySeqs.single.fna) and that there is no limit to the number of sequences in an acceptable cluster.</li>
</ul><br />
</li>
<li>-v verbose report (redirected to a the .rpt)</li>
<li>-s create the individual cluster fasta files and match reports</li>
</ul><br />
<br />
Consult the vmatch manual for fuzzy matches and more examples:<br />
<a href="http://www.zbh.uni-hamburg.de/vmatch/virtman.pdf">http://www.zbh.uni-hamburg.de/vmatch/virtman.pdf</a><br />
<br />
The standard output details the clusters and singlets.<br />
<pre>0:761437: NODE_83_length_31_cov_22.451612857642:
NODE_106_length_31_cov_22.451612621409: NODE_152_length_29_cov_27.758621749981:
NODE_185_length_29_cov_27.7586211:761861: NODE_531_length_424_cov_26.851416805590:
NODE_1413_length_320_cov_28.187500837480: NODE_1236_length_320_cov_28.187500858407:
NODE_937_length_320_cov_28.187500765510: NODE_1542_length_108_cov_34.870369621979:
NODE_786_length_425_cov_32.602352750915: NODE_1290_length_321_cov_34.392525
</pre><br />
<h3>Extract the longest sequence from your cluster files</h3><br />
<em>If using Vmatch 2.1.3 or later this is unnecessary</em><br />
Here is a perl script, which we will call longestSeq.pl, to do that<br />
<pre class="brush: perl">#!/usr/bin/perl
#print the longest sequence in a fasta file
use strict;my %seqs;my $defline;
while (<>) {
chomp;
if ( /^#/ || /^\n/ ) {
#comment line or empty line do nothing
}elsif (/>/) {
s/>//g;$defline = $_;
$seqs{$defline} = "";
}elsif ($defline) {
$seqs{$defline} .= $_;
}else{
die('invalid FASTA file');
}
}
my $max = $defline;
foreach my $def ( keys %seqs ) {
$max = ( length( $seqs{$def} ) > length( $seqs{$max} ) ) ? $def : $max;
}
print ">" . $max . "\n" . $seqs{$max} . "\n";</pre><br />
We want the longest member of each cluster and all sequences in the singletons file:<br />
<pre class="brush: shell">for f in sequenceMatches/*fna;
do
if [ "$f" = "sequenceMatches/mySeqs.single.fna" ];
then
cat $f >> mySeqs_longest_seqs.fa;
else perl longestSeq.pl $f >> mySeqs_longest_seqs.fa;
fi;
done
</pre><br />
That's it - now you have a comprehensive and non-redundant set of the longest contigs from a number of assemblies.Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com6tag:blogger.com,1999:blog-8532065960756482590.post-82237558430476719632009-11-04T12:54:00.000-05:002010-04-14T15:51:28.071-04:00R's xtabs for total weighted read coverageSamtools and its BioPerl wrapper Bio::DB:Sam prefer to give read coverage on a depth per base pair basis. This is typically an array of depths, one for every position that has at least one read aligned. OK, works for me. But how can we quickly see which targets (in my case transcripts) have the greatest total weighted read coverage (i.e. sum every base pair of every read that aligned)?<br /><br />My solution is to take that target-pos-depth information and import a table into R with at least the following columns:<br />targetName<br />depth<br /><br />I added the pos column here to emphasize the base-pair granularity<br /><pre> tx pos depth<br />1 tx500090 227 1<br />2 tx500090 228 1<br />3 tx500090 229 1<br />4 tx500090 230 1<br />5 tx500090 231 1<br />...<br />66 tx500123 184 1<br />67 tx500123 185 1<br />68 tx500123 186 1<br />69 tx500123 187 2<br />70 tx500123 188 2<br />71 tx500123 189 2</pre><br />In R<br /><pre>myCoverage<-read.table("myCoverage.txt",header=TRUE)<br />myxTab<-xtabs(depth ~ tx,data=myCoverage)</pre><br /><a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/xtabs.html">xtabs</a> will sum up depth-weighted positions by default (i suppose this is what tabulated contigency really means) and return an unsorted list of transcripts and their weighted coverage (total base pair read coverage)<br /><pre>> myxTab[1:100]<br />tx<br />tx500090 tx500123 tx500134 tx500155 tx500170 tx500178 tx500203 tx500207<br /> 38 92 610 46 176 46 92 130<br />tx500273 tx500441 tx500481 tx500482 tx500501 tx500507 tx500667 tx500684<br /> 76 2390 114 71228 762 222 542 442<br />tx500945 tx500955 tx501016 tx501120 tx501127 tx501169 tx501190 tx501192<br /> 1378 3604 46 46 420 854 130 352<br />tx501206 tx501226 tx501229 tx501245 tx501270 tx501297 tx501390 tx501405<br /> 244 1204 206 15926 214 46 168 46<br />tx501406 tx501438 tx501504 tx501694 tx501702 tx501877 tx501902 tx502238<br /> 38 2572 7768 3274 314 298 84 198<br />tx502320 tx502364 tx502403 tx502414 tx502462 tx502515 tx502517 tx502519<br /> 122 38 588 46 46 38 38 466<br />tx502610 tx502624 tx502680 tx502841 tx502882 tx503090 tx503192 tx503204<br /> 206 38 168 3750 38 122 76 92<br />tx503416 tx503468 tx503523 tx503536 tx503571 tx503578 tx503623 tx503700<br /> 260 38 168 38 46 46 84 38<br />tx503720 tx503721 tx503722 tx503788 tx503872 tx503892 tx503930 tx503970<br /> 97112 38 38 4708 38 38 1290 84<br />tx503995 tx504107 tx504115 tx504346 tx504353 tx504355 tx504357 tx504398<br /> 46 152 206 46 3416 1402 122 290<br />tx504434 tx504483 tx504523 tx504589 tx504612 tx504711 tx504751 tx504827<br /> 290 8728 176 46 46 76 5644 1308<br />tx504828 tx504834 tx504882 tx504931 tx504952 tx505017 tx505029 tx505078<br /> 2336 328 46 34138 1000 1838 46 474<br />tx505123 tx505146 tx505159 tx505184<br /> 38 123344 160 588</pre><br />this is approximately 10000x faster than using a formula like:<br /><pre>by(myCoverage,myCoverage$tx,function(x){sum(x$depth)}</pre>Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com2tag:blogger.com,1999:blog-8532065960756482590.post-69850021857573974442009-10-23T15:22:00.000-04:002010-04-17T15:01:10.413-04:00Installing Bio::DB::Sam from CPANBio::DB::Sam is Lincoln Stein's BioPerl API to the SamTools package.<br /><br /><span style="font-weight: bold;">Installing via CPAN might skip a necessary question that will cause it to fail.</span><br /><pre><br />cpan> install Bio::DB::Sam<br />...<br />...<br />DIED. FAILED tests 1-93<br /> Failed 93/93 tests, 0.00% okay<br />Failed Test Stat Wstat Total Fail Failed List of Failed<br />-------------------------------------------------------------------------------<br />t/01sam.t 2 512 93 186 200.00% 1-93<br />Failed 1/1 test scripts, 0.00% okay. 93/93 subtests failed, 0.00% okay.<br />make: *** [test_dynamic] Error 2<br />/usr/bin/make test -- NOT OK<br />Running make install<br />make test had returned bad status, won't install without force<br /></pre><br /><br />Navigate to where CPAN has downloaded the gz ffile<br />A closer examination reveals that Build.PL file wants to know where the SamTools header files are located.<br /><pre><br />Please enter the location of the bam.h and compiled libbam.a files:<br /></pre><br /><span style="font-weight: bold;">I have no idea how to pass these arguments using CPAN. I would just avoid this method of installation.</span> <span style="font-weight: bold;">Do the local build instead.</span>Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com2tag:blogger.com,1999:blog-8532065960756482590.post-40307524422441951622009-10-12T16:46:00.000-04:002010-04-17T15:00:57.494-04:00Disable file locking in Eclipse for OS XEclipse will refuse to use a workspace on an automounted OS X Server home directory.<div><pre>Workspace in use or cannot be created</pre></div><div><br /></div><div>To remedy this problem do the following:</div><div><ul><li>Right click the Eclipse application and select "Show Package Contents"</li><li>Contents->MacOS</li><li>Edit the eclipse.ini file in a text editor</li><li>Add -Dosgi.locking=none to the line below -vmargs</li></ul><br /><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj5RFGbBySNhHA2V-rTCO_QcCMvEviXzvtoi6qQH0_mQbM3wrq8mSj2KfyCo2qJEkEKz4UJA5rYH-npb-sWwABmWtb5G4dkCQgKxboX_nyg7kUF_Yuyvm_WkkGco7seY20kyoHTBUHQXsE/s320/Picture+17.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5391819444924538290" /><img style="cursor:pointer; cursor:hand;width: 320px; height: 170px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhleEXcAmf3nBsKk8W7GEoJPHF8WFqQrx4UwSiBGsfDhBIK-EvSyYqbQ_U_sYh6Uaf0MY2YoEZLCy_NZH3U2DPhFT8TbAgIb7c4CaO8SiQLKMfEXgUsP3KTLsT1Ir9zsrs26q7b8UOXm34/s320/Picture+18.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5391820040896419890" /><div><br /></div></div>Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com2tag:blogger.com,1999:blog-8532065960756482590.post-63371732915021386782009-09-23T20:18:00.000-04:002010-04-17T15:00:37.048-04:00My 2009 Bridge-to-Bridge Experience<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioxtfMheEqWQyUBnaHPiMQ-BTgL4vNCGt71nTbWUWNIvnm4GW8Df61eo0t30mjWjMq3VfVvTJcM-nl5Uu1yViLoY4Uf0rqIp88MRqVBdcoN2feeet-sLax8mK9k5NWfV2mKUlcM0F3dq0/s1600-h/DSC_0060.JPG"><img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 400px; height: 266px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioxtfMheEqWQyUBnaHPiMQ-BTgL4vNCGt71nTbWUWNIvnm4GW8Df61eo0t30mjWjMq3VfVvTJcM-nl5Uu1yViLoY4Uf0rqIp88MRqVBdcoN2feeet-sLax8mK9k5NWfV2mKUlcM0F3dq0/s400/DSC_0060.JPG" alt="" id="BLOGGER_PHOTO_ID_5384827527818904402" border="0" /></a>Bridge-to-Bridge is a 105-mile ride up to the top of <a href="http://en.wikipedia.org/wiki/Grandfather_Mountain">Grandfather Mountain</a>, one of the highest peaks in the Blue Ridge mountains.<br /><br />Although B2B is not an officially sanctioned race, the organizers conduct it just as professionally (with the exception of neutral support). Cops manage the major crossings, volunteers provide hand-offs at the dozen feed stations, and the event is officially timed with the aid of some magnetic shoe things.<br /><br />I last did this ride in 1999. Now 10 years older but about 10 pounds lighter I had somehow forgotten how much suffering was involved and figured this was a good time to tackle the challenge, despite falling ill a couple of weeks beforehand.<br /><br />Due to constant rain and very heavy fog, this year was utter torture for the 299 finishers and 371 <span style="font-style: italic;">non-finishers</span> who braved the elements. I believe there may have been another 130 <span style="font-style: italic;">non-starters</span> who stayed in their hotel rooms enjoying the Golden Girls marathon on tv. While I'm sure I would have felt some sense of accomplishment doing that, I was obligated to finish this ride as we had already driven down from Philadelphia (en-route to a wedding in Nashville the following weekend).<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiELWFdQ7mxbZvJXxWLogTLSwoXReZtlWCBFiXU-tqvLxNYGyBT3zA2FNw9MfhdttHdmDCiBccl7C_Rm8bDOEevzL7Bv_zrHtiQFlH_kBfptYLJPVIFoPn4cx8MTrOwaP00OqIxaFdv6Dw/s1600-h/DSC_0011.JPG"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 320px; height: 213px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiELWFdQ7mxbZvJXxWLogTLSwoXReZtlWCBFiXU-tqvLxNYGyBT3zA2FNw9MfhdttHdmDCiBccl7C_Rm8bDOEevzL7Bv_zrHtiQFlH_kBfptYLJPVIFoPn4cx8MTrOwaP00OqIxaFdv6Dw/s320/DSC_0011.JPG" alt="" id="BLOGGER_PHOTO_ID_5384834869777901266" border="0" /></a><br />The Grandfather Mountain staff said riders could not enter the park before 3pm. To accommodate both the riders and this odd rule two start times were offered - 10 and 11 am, with slower riders encouraged to start first. To me this was a welcome change from the ungodly pre-dawn start times of most big rides. Riders were advised to start at the later time if they estimated they would be pushing the 3pm threshold.<br /><br />I was on the fence about which group to join. I saw several very fit looking riders and expensive bikes in the 10am pack. I was still riding the same Litespeed I used in 1999. The word going around was that more rain was on the way (this turned out to be true for everyone). In the end I felt the risk of having an inexperienced rider fall in front of me to be the deciding factor to go with the second group. I knew I would have to pass several of that first group anyway but it would be on the later climbs instead of the early rollers. Ironically I almost got clipped by some idiot drifting carelessly in our pack. At the end there was considerable overlap in finishing times between the two groups.<br /><br />The two times I did this ride ('98 and '99) I got dropped by the leaders on the 13 mile climb up NC181, then spent the rest of the ride either riding alone or with a couple other guys. Despite my best efforts this year turned out roughly the same except I stuck with that front group for about half the climb instead of just the first couple miles. Remind me to buy a compact crank.<br /><br />This climb is very difficult psychologically - a relentless slog up a roughly-paved 4 lane highway. I was not very familiar with the profile and prematurely thought I had crested three times - each time putting in a kick over the "top". The fog would clear and I would see yet another rise.<br /><br />Feeling dispirited and exhausted, I nearly froze to death on the descents of 181 and the Blue Ridge Parkway and was eventually caught by a small group that stuck together until the ascent of Grandfather. I was amazed by how few words were exchanged in that group during the hour or so we traded pulls through the fog and rain, which only got worse as we neared the finish. One guy did say something that stuck with me, "It's like we're riding through a horror movie."<br /><br />After crawling to the finish in a 39x27, I was very fortunate that Mary Ellen had the foresight to drive up to the summit well in advance to meet me with a warm car. I thanked her by singing the Golden Girls theme song all the way to the hotel.<br /><br /><ul><br /><li>B2B '09 Results:<a href="http://www.caldwellcochamber.org/support/pagepics/09Bridge.txt"><br />http://www.caldwellcochamber.org/support/pagepics/09Bridge.txt</a></li><li>An account by Bruce Humphries (1st place)<a href="http://dieseldiaries.com/hom/?p=232"><br />http://dieseldiaries.com/hom/?p=232</a></li><li>Some more blog posts on the '09 ride:</li><ul><li><a href="http://fraught.wordpress.com/2009/09/22/theres-blue-sky-ahead/">http://fraught.wordpress.com/2009/09/22/theres-blue-sky-ahead/</a></li><li><a href="http://twentystone.blogspot.com/2009/09/2009-bridge-to-bridge-ride-report.html">http://twentystone.blogspot.com/2009/09/2009-bridge-to-bridge-ride-report.html</a></li><li><a href="http://khobama.blogspot.com/2009/09/and-saga-continues.html">http://khobama.blogspot.com/2009/09/and-saga-continues.html</a></li><li><a href="http://grimesjoseph.blogspot.com/2009/09/shenandoah-mountain-100-2009-bridge-to.html">http://grimesjoseph.blogspot.com/2009/09/shenandoah-mountain-100-2009-bridge-to.html</a><br /></li></ul></ul>Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com2tag:blogger.com,1999:blog-8532065960756482590.post-43750081007775160942009-09-16T12:20:00.000-04:002009-09-18T17:11:14.340-04:00Printing a specific line from bash_historyOften I want to archive specific commands out of my immediate bash history to a batch script that I can run later.<br />Unfortunately I could find no way of redirecting !# (where # is the bash history line I wish to save, e.g. !58 executes line 58) to a file. There is the "colon p" option - where !#:p will print the command instead executing it, but I could not redirect or pipe that output either.<br /><br />So I added this one-liner to a bin directory in my path:<br /><br /><pre class="brush: shell"><br />#!/bin/bash<br />cat $HISTFILE | sed -n "$1p"<br /></pre><br /><br />I call it <span style="font-family: courier new;">getHistLine</span>. So now to save line 58 to a batch script I can just type:<br />getHistLine 58 >> myBatchScript.sh<br /><br />To save a range of lines I can type<br />getHistLine 50,58 >> myBatchScript.shJermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com0tag:blogger.com,1999:blog-8532065960756482590.post-78834581603315543002009-08-26T23:27:00.000-04:002010-04-17T15:06:43.998-04:00Charles Dickens de Bruijn Graph<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbnGDaUbUf3_Y5bWT9YgiH_rqVN9zn-SiaSSThCSfoS9OcHVWSYJfsPSZrmhueAELkWjSyxHPsd0mDLup4zryan-00ubsZMRvh8sgYpSJWpYQEoYVjHpZddifT-GjYIjhmmarrG4Nh5dg/s800/dicken-debruijn.png"><img style="cursor: pointer; width: 800px; height: 314px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbnGDaUbUf3_Y5bWT9YgiH_rqVN9zn-SiaSSThCSfoS9OcHVWSYJfsPSZrmhueAELkWjSyxHPsd0mDLup4zryan-00ubsZMRvh8sgYpSJWpYQEoYVjHpZddifT-GjYIjhmmarrG4Nh5dg/s800/dicken-debruijn.png" alt="" border="0" /></a>Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com0tag:blogger.com,1999:blog-8532065960756482590.post-89710604363587302062009-08-25T10:17:00.001-04:002010-04-17T15:02:29.138-04:00Standardized Velvet Assembly Report<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHrQd98u7IhVHhGsVD29ivy3zUScH11LP2b1VosTYa1ePKwlxoJvySGX30jswswPV6l6CXkJdixM3l9P9nCJ_QbejPu5eNHaCy5aLlf3GcKV0Wm_LXeZRMhOhqW71HoNgqPwSzbwk4nNU/s400/example.png"><img style="cursor:pointer; cursor:hand;width: 400px; height: 391px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHrQd98u7IhVHhGsVD29ivy3zUScH11LP2b1VosTYa1ePKwlxoJvySGX30jswswPV6l6CXkJdixM3l9P9nCJ_QbejPu5eNHaCy5aLlf3GcKV0Wm_LXeZRMhOhqW71HoNgqPwSzbwk4nNU/s400/example.png" border="0" alt="" /></a><br /><br /><a href="http://code.google.com/p/standardized-velvet-assembly-report/"><br />http://code.google.com/p/standardized-velvet-assembly-report/</a><br /><br />I finally got my Velvet Assembler report script up on google code. This "program" consists of some short scripts and a Sweave report designed to help Velvet users identify the optimal kmer and cvCut parameters.Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com0tag:blogger.com,1999:blog-8532065960756482590.post-42998357415786211282009-08-13T15:33:00.000-04:002010-04-17T15:04:34.160-04:00GregorianCalendar foolishnessI wanted to add queries to my Grails application to find tasks that needed to be completed today, or were delinquent (due date < last midnight).<br />My application kept thinking things were delinquent the afternoon of the due date.<br /><br />The problem was that I neglected to read how HOUR_OF_DAY differed from HOUR and absentmindedly mixed the two.<br /><a href="http://java.sun.com/j2se/1.4.2/docs/api/java/util/Calendar.html">http://java.sun.com/j2se/1.4.2/docs/api/java/util/Calendar.html</a><br /><br />You can try to set HOUR to 0 but it will default to 12. If it is the afternoon when you request the Calendar object then it will assume you mean 12 noon.<br /><br /><pre class="brush: groovy"><br /><br />Calendar lastMidnight = Calendar.getInstance();<br />//DO NOT DO THIS!!<br /> lastMidnight.set( Calendar.HOUR, lastMidnight.getMinimum(Calendar.HOUR_OF_DAY ));<br /> ...snip...<br /></pre><br /><pre class="brush: groovy"><br /> //this is ok<br /> lastMidnight.set( Calendar.HOUR_OF_DAY, lastMidnight.getMinimum(Calendar.HOUR_OF_DAY ));<br /> ...snip...<br /></pre><br /><pre class="brush: groovy"><br /> //this is also ok<br /> lastMidnight.set( Calendar.AM_PM, Calendar.AM );<br /> lastMidnight.set( Calendar.HOUR, lastMidnight.getMinimum(Calendar.HOUR ));<br /> ...snip...<br /></pre><br /><br />The other settings are pretty self-explanatory<br /><pre class="brush: groovy"><br /> lastMidnight.set( Calendar.MINUTE, lastMidnight.getMinimum(Calendar.MINUTE));<br /> lastMidnight.set( Calendar.SECOND, lastMidnight.getMinimum(Calendar.SECOND));<br /> lastMidnight.set( Calendar.MILLISECOND,lastMidnight.getMinimum(Calendar.MILLISECOND));<br /></pre>Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com0tag:blogger.com,1999:blog-8532065960756482590.post-32574110546007107872009-07-28T14:47:00.000-04:002010-04-17T15:00:12.421-04:00Beware! Groovy split and tokenize don't treat empty elements the sameGroovy's <span style="font-style:italic;">tokenize</span>, which returns a List, will ignore empty elements (when a delimiter appears twice in succession). <span style="font-style:italic;">Split</span> keeps such elements and returns an Array. If you want to use List functions but you don't want to lose your empty elements, then just use <span style="font-style:italic;">split</span> and convert your Array into a List in a separate step.<br /><br />This might be important if you are parsing CSV files with empty cells.<br /><br /><pre class="brush: groovy"><br />import groovy.util.GroovyTestCase<br /><br /><br />class StringTests extends GroovyTestCase {<br /><br /> protected void setUp() {<br /> super.setUp()<br /> }<br /><br /> protected void tearDown() {<br /> super.tearDown()<br /> }<br /><br /> void testSplitAndTokenize() {<br /> assertEquals("This,,should,have,five,items".tokenize(',').size(),5)<br /> assertEquals("This, ,should,have,six,items".tokenize(',').size(),6)<br /><br /> assertEquals("This, ,should,have,six,items".split(',').size(),6)<br /> assertEquals("This,,should,have,six,items".split(',').size(),6)<br /><br /> //convert array to List and re-evaluate<br /> def fieldArray = "This,,should,have,six,items".split(',')<br /> def fields=fieldArray.collect{it}<br /> assert fields instanceof java.util.List<br /> assertEquals(fields.size(),6)<br /> }<br />}<br /></pre>Jermdemohttp://www.blogger.com/profile/01662705354227625640noreply@blogger.com5