Jermdemo

The Reproducible Research Guilt Trip May Finally Be Paying Off

2012-12-27T16:08:00.000-05:00

We might be closer to killing off the "Just take my word for it - I'm pretty sure I did this right" methods section

There is no shortage of well-reasoned articles filled persuasive arguments about the need for higher reproducible research standards in the scientific literature. With so many good posts about the virtues of reproducible research, they all boil down to one overarching concept:

Why is this even an issue? Biologists in particular seem to be collectively and subconsciously reacting to those awful General Chemistry labs where they had you copy down pages of instructions verbatim into your lab notebook. It should come as no surprise that bioinformatics is ground zero for reproducibility activism.

It is unfortunate reproducible research is tied up with all sorts of other holier-than-thou practices: open access, open source, open data, literate programming, blogging, functional programming. This all-encompassing evangelism tends to polarize people. While wonky über-programmers like C. Titus Brown lay out fundamental practices for reproducibility, most PIs have been publicly giving lip service to the idea of reproducible research, belying a "I don't wanna eat my vegetables"-type disdain. There are now "corsortia" and an "initiative" to compel scientists to actually write their shit down, preferably with door prizes. If you think this has a "posture pals" (video) feel to it, you're not alone. As the number of pro-RR articles has steadily increased, few take these to heart.

This head against wall bashing has been the pattern for many years - better tools are now available (RStudio, knitr, Galaxy, cloud computing, figshare, github, bitbucket) and more rah-rah from the blogosphere - but little enforcement from major journals. But now a recent development has raised my hopes, because it indicates editors have been tightening the screws enough to cause discomfort:

People have actually started to argue against reproducible research!

Hearts and Minds

The founder of the irreplicability movement is Christopher Drummond, author of “Reproducible Research: a Dissenting Opinion”. I will attempt to paraphrase his arguments here:

Richard Feynman never had a Github account.
No one is really going to read your damn code anyway.
Writing shit down == A big drag, man.
The Anil Potti incident proves liars always lie about their Rhodes Scholarships first. We should crack down on curricula vitae, not veritas curat.

Drummond's points are challenged here by statistician and Coursera favorite Roger Peng.

A precursor to the dissenting opinion article is Drummond's "Replicability is not Reproducibility: Nor is it Good Science". A distinction is drawn between reproducibility and replicability, the former being what is advocated and the latter being more generalizable or scientifically provable. The idea we require researchers to submit their data and code, replicable research, is a narrow concept really only useful for ferreting out scientific misconduct.

I would argue that ignorance of biological sequence analysis, and even moreso statistics, is a bigger threat than the outright fraud seen in the Duke case. Most bioinformatics manuscripts feature analysis which is not replicable, which is frightening to consider when GWAS and exome NGS variant papers implicate so many genes in disease, many of them residing along a razor thin p-value threshold tweaked by several incomprehensible cherry picked program parameters.

It is not clear science can efficiently self-correct. So while replicability is not reproducibility, reproducibility is too slow to substitute for replicability. A manuscript that describes real reproducible biological phenomena is essentially conjecture until it can be repeated. The greatest ferret-legger the world has ever known will live in obscurity until they buy a ferret. We have a culture of scientists who refuse to buy a ferret.

Accounting for Tastes

The other dissenting opinion (here) is from UCSC's Kevin Karplus, who replies to Iddo Friedberg's post recommending a panel of white coat mechanics to help biologists get their code ready for publication. Karplus raises two points:

It is difficult to make polished software for others to use and that is not the point of research.
Replicability is not reproducibility.

Regardless of Friedberg's proposal, railing against "polished software" is simply a straw man argument. Reproducible research in ~~2012~~ 2013 does not mean robust, extensible, or even well documented code. Most sequence analysis papers feature very little compiled code, but rely on using a series of executable programs glued together using scripting languages, producing intermediate data then digested into a report, often written in R.

Getting these sequence analysis workflows to be reproducible will not require a highly skilled platoon of developers. Any willing researcher can submit a shell script or a build script of commands provided they avoid these common pitfalls:

Using bioinformatics web applications with no web service capability
Using desktop bioinformatics software with no logging capability
Relying on proprietary institutional databases, perhaps with stored procedures that prove too unwiedly to dump
Using command line programs without a directory-based bash history
Using Excel to manually manipulate data

As our toolset and research community matures, these ~~excuses~~ obstacles will eventually disappear. But there is one scenario which will always be true in some of the more competitive arenas of bioinformatics programming (e.g. structure prediction, de novo assembly):

The researcher was perfectly capable of submitting code but decided to retain a competitive advantage.

"Over-CASPed" researchers who are unwilling to divulge their secret sauce should be relegated to appropriate sandboxes.

Replication does not prove a biological truth but we often don't even have the fleeting proof that a scientist did what they said they did.

Which brings us back to those damn chemistry labs. While many public access talk shows find chemists willing argue against evolution, you would be hard pressed to find a one who would argue against writing shit down.

In other words: Not writing shit down is an even worse idea than creationism.

--

There, I blogged in 2012.

AGBT: digesting diposable MinIONs in diaspora

2012-02-20T21:44:00.000-05:00

Despite my current ranking of 15th in Biostar, myriad page views of my BAS™ post (albeit mostly misdirected perverts), and positive response for my celebrated campaign against more microarray papers, for some reason I was not "comped" an all-expenses paid trip as honorary blog journalist to this year's Advances in Biology and Genome Technology, which is kind of like CES for sequencing people, except AGBT is still worth attending. Normally the oversight would not bother me, as bioinformatics itself is not the focus of this meeting, but the flood of #AGBT tweets would not let me forget this fact and I was forced to stew and blog in envy.

The first game changing disruptive revolutionary thing from England since 1964

Even from my distant perch it was obvious all the scientific presentations at AGBT were overshadowed by a 17-minute showstopping demo from Clive Brown of Oxford Nanopore, a company that by all appearances would either die, focus on some minor stuff, or bring it. They chose the third option, and in so doing boosted the "Clive index" to unprecedented levels. OxN's recent decision to enlist famed geneticist and serial startup advisor George Church struck me as a huge gamble, as the string of Route 128 flameouts touting his name lead me to assume long ago that Church had stowed away some cursed Tiki idol in his luggage like Bobby in that episode of the Brady Bunch. However, after reading up on OxN, I had to admit I was just bitter about Dr. Church's refusal to invest in my chain of Polonator-based paternity testing clinics, Yo Po'lonatizz!™

Two new sequencer platforms were announced:

A MinION. Forget to hit eject before removing this
and you will instantly lose $900.

The MinION, a $900 "disposable" USB drive which detects minute changes in voltage incurred by the passage of DNA through a robust and delicious lipid bilayer. Finally a device capable of sequencing filthy rabbit blood right on the spot!
The GridION system, a scalable rackmounted sequencer, which despite some lack of pricing clarity, should produce an actual $1000 15-minute human genome by 2013.

These exotic machines must be truly game-changing because they made properly expanding Albert Vilella's NGS sequencer spreadsheet quite difficult. The MinION, in particular, could be viewed as a free device with $900 of consumables. This effectively lowers the bar to getting high-throughput sequence in the doctor's office to a 100% unamortized billable transaction. These things also claim fucking unlimited read lengths.

Expression microarrays, SAGE, 454, ABI SOLiD, and now Pacific Biosciences have all left bad tastes of uncertainty and dissatisfaction in the mouths of scientists. It is easy to disappoint people on a grand scale with a $700,000 machine, but $900 worth of chemicals in a USB drive is a different animal, and it seems likely this invention will find a following if it even delivers on a fraction of what it promises.

The GridION - put it in a rack or right on the floor.

Good information on this sequencer-on-a-stick is to be found at Nick Loman's blog, Genomes Unzipped, and official press releases. An excellent discussion of the nanopores themselves can be found at Omically Speaking.

More cringeworthy marketing from the West coast

The Oxford Nanopore machines are so jaw-dropping, in fact, that Jonathan Rothberg is already crying vaporware. His complaints do seem warranted, given disappointments from past year's announcements and the lack of publicly available sequence from these devices.

Unfortunately Ion Torrent has spent all of its goodwill on an inane and hamfisted advertising war against Illumina's MiSeq, an intentionally crippled opponent. Seemingly orchestrated by castoffs from the Celebrity Apprentice, this assault began with cringe-inducing derivations of Apple commercials, and has expanded to include a sort of "feature combover." Through some convoluted logic involving consensus, a professional whiteboard artist attempts to convince the public how the homopolymer error rate is actually lower using Ion Torrent PGM than MiSeq. This is the sequencing equivalent of having your mom try to convince you two apples is better than one devil dog, or some such utter nonsense.

My response was predictably measured and cerebral.

This is not the first time I have tweet-confronted Ion Torrent over its odious approach. All this is rather unnecessary because overall, and despite the homopolymer issues, the utility of the PGM has been more or less within expectations. The MiSeq is also exactly within expectations, since it is basically a transparent, measly 1/50th slice of a HiSeq. The same cannot really be said for the RS, whose error rate is clearly far above what was expected at the outset. So if anyone requires an aggressive smokescreen-type marketing campaign (or a new machine) it is Pacific Biosciences.

When can we expect the last damn microarray paper?

2012-01-18T21:15:00.000-05:00

With bonus R code

It came as a shock to learn from PubMed that almost 900 papers were published with the word "microarray" in their titles last year alone, just 12 shy of the 2010 count. More alarming, many of these papers were not of the innocuous "Microarray study of gene expression in dog scrotal tissue" variety, but dry rehashings along the lines of "Statistical approaches to normalizing microarrays to the reference brightness of Ursa Minor".

It's an ugly truth we must face: people aren't just using microarrays, they're still writing about them.

See for yourself:

getCount<-function(term){function(year){
  nihUrl<-concat("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=",term,"+",year,"[pdat]")
  #cleanurl<-gsub('\\]','%5D',gsub('\\[','%5B',x=url))
  #http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=microarray%5btitle%5d+2003%5bpdat%5d
  xml<-xmlTreeParse(URLencode(nihUrl),isURL=TRUE)
  #Data Mashups in R, pg17
  as.numeric(xmlValue(xml$doc$children$eSearchResult$children$Count$children$text))
}}

years<-1995:2011
df<-data.frame(type="obs",year=years,
    mic=sapply(years,function(x){do.call(getCount('microarray[title]'),list(x))}),
    ngs=sapply(years,function(x){do.call(getCount('"next generation sequencing"[title] OR "high-throughput sequencing"[title]'),list(x))})
)
#papers with "microarray" in title
> df[,c("year","mic")]
   year  mic
1  1995    2
2  1996    4
3  1997    0
4  1998    7
5  1999   28
6  2000  108
7  2001  273
8  2002  553
9  2003  770
10 2004 1032
11 2005 1135
12 2006 1216
13 2007 1107
14 2008 1055
15 2009  981
16 2010  909
17 2011  897

Reading another treatise on microarray normalization in 2012 would be just tragic. Who still reads these? Who still writes these papers? Can we stop them? If not, when can we expect NGS to wipe them off the map?

#97 is a fair start
df<-subset(df,year>=1997)
mdf<-melt(df,id.vars=c("type","year"),variable_name="citation")

c<-ggplot(mdf,aes(x=year))
p<-c+geom_point(aes(y=value,color=citation)) +
  ylab("papers") +
  stat_smooth(aes(y=value,color=citation),data=subset(mdf,citation=="mic"),method="loess") +
  scale_x_continuous(breaks=seq(from=1997,to=2011,by=2))
print(p)

Here I plot both microarray and next-generation sequencing papers (in title). We see kurtosis is working in our favor, and LOESS seems to agree!

But when will the pain end? Let us extrapolate, wildly.

#Return 0 for negative elements
# noNeg(c(3,2,1,0,-1,-2,2))
# [1] 3 2 1 0 0 0 2
noNeg<-function(v){sapply(v,function(x){max(x,0)})}

#Return up to the first negative/zero element inclusive
# toZeroNoNeg(c(3,2,1,0,-1,-2,2))
# [1] 3 2 1 0
toZeroNoNeg<-function(v){noNeg(v)[1:firstZero(noNeg(v))]}

#return index of first zero
firstZero<-function(v){which(noNeg(v)==0)[1]}

#let's peer into the future
df.lo.mic<-loess(mic ~ year,df,control=loess.control(surface="direct"))

#when will it stop?
mic_predict<-as.integer(predict(df.lo.mic,data.frame(year=2012:2020),se=FALSE))
zero_year<-2011+firstZero(mic_predict)
cat(concat("LOESS projects ",sum(toZeroNoNeg(mic_predict))," more microarray papers."))
cat(concat("The last damn microarray paper is projected to be in ",(zero_year-1),"."))

#predict ngs growth
df.lo.ngs<-loess(ngs ~ year,df,control=loess.control(surface="direct"))
ngs_predict<-as.integer(predict(df.lo.ngs,data.frame(year=2012:zero_year),se=FALSE))

pred_df<-data.frame(type="pred",year=c(2012:zero_year),mic=toZeroNoNeg(mic_predict),ngs=ngs_predict)
df2<-rbind(df,pred_df)

mdf2<-melt(df2,id.vars=c("type","year"),variable_name="citation")

c2<-ggplot(mdf2,aes(x=year))
p2<-c2+geom_point(aes(y=value,color=citation,shape=type),size=3) +
    ylab("papers") +
    scale_y_continuous(breaks=seq(from=0,to=1600,by=200))+
    scale_x_continuous(breaks=seq(from=1997,to=zero_year,by=2))
print(p2)

LOESS projects 2038 more microarray papers.
The last damn microarray paper is projected to be published in 2016.

Yeah, right...

Full R code here: https://gist.github.com/1637248

Big-Ass Servers™ and the myths of clusters in bioinformatics

2011-06-23T17:14:00.000-04:00

Spending $55k for a 512GB machine (Big-Ass Server™ or BAS™) can be a tough sell for a bioinformatics researcher to pitch to a department head.

Dell PowerEdge r900, available in orange and lemon-lime

Speaking as someone who keeps his copy of CLR safely stored in the basement, ready to help rebuild society after a nuclear holocaust, I am painfully aware of the importance of algorithm development in the history of computing, and the possibilities for parallel computing to make problems tractable.

Having recently spent 3 years in industry, however, I am now more inclined to just throw money at problems. In the case of hardware, I think this approach is more effective than clever programming for many of the current problems posed by NGS.

From an economic and productivity perspective, I believe most bioinformatics shops doing basic research would benefit more from having access to a BAS™ than a cluster. Here's why:

The development of multicore/multiprocessor machines and memory capacity has outpaced the speed of networks. NGS analyses tends to be more memory-bound and IO-bound rather than CPU-bound, so relying on a cluster of smaller machines can quickly overwhelm a network.
NGS has forced the number of high-performance applications from BLAST and protein structure prediction to doing dozens of different little analyses, with tools that change on a monthly basis, or are homegrown to deal with special circumstances. There isn't time or ability to write each of these for parallel architectures.

If those don't sound very convincing, here is my layman's guide to dealing with the myths you might encounter concerning NGS and clusters:

Myth: Google uses server farms. We should too.

Google has to focus on doing one thing very well: search.

Bioinformatics programmers have to explore a number of different questions for any given experiment. There is not time to develop a parallel solution to many of these questions as they will lead to dead ends.

Many bioinformatic problems, de-novo assembly being a prime example, are notoriously difficult to divide among several machines without being overwhelmed with messaging. You can imagine trying to divide a jigsaw puzzle among friends sitting several tables, you would spend more time talking about the pieces than fitting them together.

Myth: Our development setup should mimic our production setup

An experimental computing structure with a BAS™ allows for researchers to freely explore big data without having to think about how to divide it efficiently. If an experiment is successful and there is the need to scale-up to a clinical or industrial platform, that can happen later.

Myth: Clusters have been around a long time so there is a lot of shell-based infrastructure to distribute workflows

There are tools for queueing jobs, but those are often quite helpless to assist in managing workflows that are written as parallel and serial steps - for example, waiting for steps to finish before merging results.

Various programming languages have features to take advantage of clusters. For example, R has SNOW. But Rsamtools requires you to load BAM files into memory, so a BAS™ is not just preferable for NGS analysis with R, it's required.

Myth: The rise of cloud computing and Hadoop means that homegrown clusters are irrelevant that but also means we don't need a BAS™

The popularity of cloud computing in bioinformatics is also driven by the newfound ability to rent time on a BAS™. The main problem with cloud computing is the bottleneck posed by transferring GBs data to the cloud.

Myth: Crossbow and Myrna are based on Hadoop, we can develop similar tools

Ben Langmead, Cole Trapnell, and Michael Schatz, alums of Steven Salzberg's group at UMD, have developed NGS solutions using the Hadoop MapReduce framework.

Crossbow is a Hadoop-based implementation of Bowtie.
Myrna is an RNA-Seq pipeline.
Contrail is a de novo short read assembler.

These are difficult programs to develop, and these examples are also somewhat limited experimental proofs of concept or are married to components that may be undesirable for certain analyses. The Bowtie stack (Bowtie, Tophat, Cufflinks), while revolutionary in its implementation of Burroughs-Wheeler algorithm, is itself is built around the limitations of computers in the year 2008. For many it lacks the sensitivity to deal with, for example, 1000 Genomes data.

The dynamic scripting languages used most bioinformatics programmers are not as well suited to Hadoop as Java. To imply we can all develop similar tools of this sophistication is unrealistic. Many bioinformatics programs are not even threaded, much less designed to work amongst several machines.

Myth: embarrassingly parallel problems imply a cluster is needed

A server with 4 quad-core processors is often adequate for handling these embarrassing problems. Dividing the work just tends to lead to further embarrassments.

Here is a particularly telling quote from Biohaskell developer Ketil Malde on Biostar:

In general, I think HPC are doing the wrong thing for bioinformatics. It's okay to spend six weeks to rewrite your meteorology program to take advantage of the latest supercomputer (all of which tend to be just a huge stack of small PCs these days) if the program is going to run continously for the next three years. It is not okay to spend six weeks on a script that's going to run for a couple of days.

In short, I keep asking for a big PC with a bunch of the latest Intel or AMD core, and as much RAM as we can afford.

Myth: We don't have money for a BAS™ because we need a new cluster to handle things like BLAST

IBM System x3850 X5 expandable to 1536GB, mouse not included

Even the BLAST setup we think of as being the essence of parallel (a segmented genome index - every node gets a part of the genome) is often not the one that many institutions have settled on. Many rely on farming out queries to a cluster in which every node has the full genome index in memory.

Secondly, the mpiBLAST appears to be more suited to dividing an index among older machines than today's, which typically have >32GB RAM. Here is a telling FAQ entry:

I benchmarked mpiBLAST but I don't see super-linear speedup! Why?!

mpiBLAST only yields super-linear speedup when the database being searched is significantly larger than the core memory on an individual node. The super-linear speedup results published in the ClusterWorld 2003 paper describing mpiBLAST are measurements of mpiBLAST v0.9 searching a 1.2GB (compressed) database on a cluster where each node has 640MB of RAM. A single node search results in heavy disk I/O and a long search time.
http://www.mpiblast.org/Docs/FAQ#super-linear

Your comments on this topic are welcome!

Chromosome bias in R, my notebook

2010-12-23T15:42:00.000-05:00

My goal is to develop a means of detecting chromosome bias from a human BAM file.

Because I've been working with proprietary and novel plant genomes for the last three years, I haven't had the chance to use any of the awesome UCSC-based annotational features that have been introduced and refined in Bioconductor until now. I've returned to biomedical research and I have some catching up to do.

BSgenome might sound like horsecrap, but each Biostrings-based genome data package is actually a huge digested version of a UCSC/NCBI genome freeze and basic sequence annotation compiled into R objects.

BSgenome at Bioconductor

Be careful with googling bioconductor help - often the results point to older versions. Make sure your link has "release" in the url.

Here are the BSgenomes available today:

> available.genomes(type=getOption("pkgType"))
BioC_mirror = http://www.bioconductor.org
Change using chooseBioCmirror().
 [1] "BSgenome.Amellifera.BeeBase.assembly4" "BSgenome.Amellifera.UCSC.apiMel2"     
 [3] "BSgenome.Athaliana.TAIR.01222004"      "BSgenome.Athaliana.TAIR.04232008"     
 [5] "BSgenome.Btaurus.UCSC.bosTau3"         "BSgenome.Btaurus.UCSC.bosTau4"        
 [7] "BSgenome.Celegans.UCSC.ce2"            "BSgenome.Celegans.UCSC.ce6"           
 [9] "BSgenome.Cfamiliaris.UCSC.canFam2"     "BSgenome.Dmelanogaster.UCSC.dm2"      
[11] "BSgenome.Dmelanogaster.UCSC.dm3"       "BSgenome.Drerio.UCSC.danRer5"         
[13] "BSgenome.Drerio.UCSC.danRer6"          "BSgenome.Ecoli.NCBI.20080805"         
[15] "BSgenome.Ggallus.UCSC.galGal3"         "BSgenome.Hsapiens.UCSC.hg17"          
[17] "BSgenome.Hsapiens.UCSC.hg18"           "BSgenome.Hsapiens.UCSC.hg19"          
[19] "BSgenome.Mmusculus.UCSC.mm8"           "BSgenome.Mmusculus.UCSC.mm9"          
[21] "BSgenome.Ptroglodytes.UCSC.panTro2"    "BSgenome.Rnorvegicus.UCSC.rn4"        
[23] "BSgenome.Scerevisiae.UCSC.sacCer1"     "BSgenome.Scerevisiae.UCSC.sacCer2"

Select and load hg19

biocLite("BSgenome.Hsapiens.UCSC.hg19")
library("BSgenome.Hsapiens.UCSC.hg19")

When we get an alignment file one of the first things we want to do is look for red flags that might indicate something went awry in the lab or downstream. An example is chromosome bias - are we seeing more reads aligned to certain chromosomes than would be expected on size alone? A sticky question, since any experiment will introduce confounds based on the inherent uneven distribution of interesting genomic features, not to mention mapability. And yet I think this is still a worthwhile exercise and should be part of any ngs sequencing pipeline.

What we don't want to do is ignore that 7.6% of the GRCh37 freeze is sequence that looks like "NNNNNNN" - gaps representing unsequencable regions such as centromeres, scaffold gap delinations, and the like. We especially don't want to ignore gaps because they are not evenly distributed across the chromosomes (chrY is 56% gaps).

Raw chromosome length can be obtained from the BAM file header, but for this chromosome bias analysis I need the "non-gappy" length, the portion eligible for alignment. This is one of the "masks" turned on by default for BSgenomes in order to allow various functions to work properly (see MaskCollection in the IRanges package for more information).

> masks(Hsapiens)
Error in function (classes, fdef, mtable)  : 
  unable to find an inherited method for function masks, for signature "BSgenome"
#oops I see masks are a member of MaskedDNAString objects (i.e. chromosomes) not BSgenome objects
> masks(Hsapiens$chrY)
MaskCollection of length 4 and width 59373566
masks:
  maskedwidth maskedratio active names                               desc
1    33720000 0.567929506   TRUE AGAPS                      assembly gaps
2           0 0.000000000   TRUE   AMB   intra-contig ambiguities (empty)
3    16024357 0.269890426  FALSE    RM                       RepeatMasker
4      587815 0.009900281  FALSE   TRF Tandem Repeats Finder [period<=12]
all masks together:
  maskedwidth maskedratio
     49783032   0.8384713
all active masks together:
  maskedwidth maskedratio
     33720000   0.5679295
#I think the maskedwidth should reveal sum of actively masked nucleotides
> maskedwidth(Hsapiens$chrY)
[1] 33720000
#can we mess with the masks?
> active(masks(Hsapiens$chrY))["RM"]<-TRUE
Error in `$<-`(`*tmp*`, "chrY", value = < S4 object of class "MaskedDNAString">) : 
  no method for assigning subsets of this S4 class
#oops I can't manipulate a BSgenome this way - it is behaving like a class instead of an instance of a class
> chrY<-Hsapiens$chrY
> active(masks(chrY))["RM"]<-TRUE
> maskedwidth(chrY)
[1] 49744357
# ok maskedwidth is working as I figured, but i need unmasked width
> unmaskedWidth<-function(chr){length(chr)-maskedwidth(chr)}
> unmaskedWidth(Hsapiens$chrY)
[1] 25653566
#how can I iterate over something with a $ operator? let's try [[]]
> unmaskedWidth(Hsapiens[["chrY"]])
[1] 25653566

Now I want to create a data frame of with sequence names and unmaskedWidths to go with some read counts from a BAM file. Whenever I want to go from a list, through a function, to a data frame I think plyr, specifically ldply (list to data frame).

# let's take chr 1-22,X,Y, skipping the unscaffolded sequences and mitochondrial chr
> maskedSizes<-ldply(.data=seqnames(Hsapiens)[1:24],
  .fun=function(x){
    data.frame(chr=x,seqlength=length(Hsapiens[[x]]),
    unmaskedWidth=unmaskedWidth(Hsapiens[[x]]))},
  .progress="text",
  .parallel=TRUE)
> maskedSizes
                     chr seqlength unmaskedWidth
1                   chr1 249250621     225280621
2                   chr2 243199373     238204518
3                   chr3 198022430     194797135
4                   chr4 191154276     187661676
5                   chr5 180915260     177695260
6                   chr6 171115067     167395066
7                   chr7 159138663     155353663
8                   chr8 146364022     142888922
9                   chr9 141213431     120143431
10                 chr10 135534747     131314738
11                 chr11 135006516     131129516
12                 chr12 133851895     130481393
13                 chr13 115169878      95589878
14                 chr14 107349540      88289540
15                 chr15 102531392      81694766
16                 chr16  90354753      78884753
17                 chr17  81195210      77795210
18                 chr18  78077248      74657229
19                 chr19  59128983      55808983
20                 chr20  63025520      59505520
21                 chr21  48129895      35106642
22                 chr22  51304566      34894545
23                  chrX 155270560     151100560
24                  chrY  59373566      25653566

Load the BAM file and get read counts in a data frame.

#other methods include scanBam and readAligned
bamFile<-readBamGappedAlignments("myIndexedSortedBamFile.bam")
> levels(rname(bamFile))
 [1] "1"          "2"          "3"          "4"          "5"         
 [6] "6"          "7"          "8"          "9"          "10"        
[11] "11"         "12"         "13"         "14"         "15"        
[16] "16"         "17"         "18"         "19"         "20"        
[21] "21"         "22"         "X"          "Y"          "MT"        
[26] "GL000207.1" "GL000226.1" "GL000229.1" "GL000231.1" "GL000210.1"
[31] "GL000239.1" "GL000235.1" "GL000201.1" "GL000247.1" "GL000245.1"
[36] "GL000197.1" "GL000203.1" "GL000246.1" "GL000249.1" "GL000196.1"
[41] "GL000248.1" "GL000244.1" "GL000238.1" "GL000202.1" "GL000234.1"
[46] "GL000232.1" "GL000206.1" "GL000240.1" "GL000236.1" "GL000241.1"
[51] "GL000243.1" "GL000242.1" "GL000230.1" "GL000237.1" "GL000233.1"
[56] "GL000204.1" "GL000198.1" "GL000208.1" "GL000191.1" "GL000227.1"
[61] "GL000228.1" "GL000214.1" "GL000221.1" "GL000209.1" "GL000218.1"
[66] "GL000220.1" "GL000213.1" "GL000211.1" "GL000199.1" "GL000217.1"
[71] "GL000216.1" "GL000215.1" "GL000205.1" "GL000219.1" "GL000224.1"
[76] "GL000223.1" "GL000195.1" "GL000212.1" "GL000222.1" "GL000200.1"
[81] "GL000193.1" "GL000194.1" "GL000225.1" "GL000192.1"
#the deflines in my reference do not match the BSgenome names, must fix at least the chromosomes of interest
levels(rname(bamFile))[1:25]<-paste('chr',c(1:22,'X','Y','M'),sep='')

#run length encoded read counts per chromosome
readRle<-rname(bamFile)

#get a data frame with chromosome and read counts
allReadsDf<-ldply(runValue(readRle),function(x){data.frame(chr=levels(runValue(readRle))[x],reads=runLength(readRle)[x])})
> head(allReadsDf)
   chr   reads
1 chr1 3616909
2 chr2 3642052
3 chr3 2843019
4 chr4 2636141
5 chr5 2590352
6 chr6 2497123

Merge the read counts with unmasked chromosome lengths and plot their relationship.

chrSizesReads<-merge(maskedSizes,readCounts,sort=FALSE)
library(ggplot2)
p<-ggplot(data=chrSizesReads, aes(x=unmaskedWidth, y=reads, label=chr)) + 
  geom_point() +
  geom_text(vjust=2,size=3) +
  stat_smooth(method="lm", se=TRUE,level=0.95) +
  ylab("Reads aligned") +
  xlab("Unmasked chromosome size") +
  opts(title = "Reads vs Chromosome Size")
print(p)

There should be a strong linear relationship between read count and chromosome size. We can test this using a linear regression model, the null hypothesis being the number of reads aligned to a chromosome is independent of its size.

> mylm<-lm(reads~unmaskedWidth,data=chrSizesReads)
> mysummary<-summary(mylm)
> mysummary

Call:
lm(formula = reads ~ unmaskedWidth, data = chrSizesReads)

Residuals:
    Min      1Q  Median      3Q     Max 
-271816 -108122  -43984   42826  676284 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.774e+05  9.505e+04   1.866   0.0754 .  
unmaskedWidth 1.455e-02  7.145e-04  20.365 9.12e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 206600 on 22 degrees of freedom
Multiple R-squared: 0.9496, Adjusted R-squared: 0.9473 
F-statistic: 414.8 on 1 and 22 DF,  p-value: 9.123e-16

The low p-value (that chr size has no influence) and R-squared (predictive value of the linear model) suggest this model is sound.

The following plot is obtained from the standardized residuals (the standardized difference between data observed and values expected) of the linear model described earlier.

Chromosome bias refers to uneven read alignment distribution across various chromosomes. We can expect some chromosome bias in treatment sets because of the inherient nature any experimental conditions - recovered fragments will not be evenly distributed among chromosomes because regions of affect are not evenly distributed. Other possible factors of chromosome bias include heterochromatin, uneven repeat content, and the potential for aligning the against an incorrect set of sex chromsomes. Aligners will typically randomly, evenly, assign discrete positions to reads which map ambiguously to multiple locations.

> p<-qplot(chrSizesReads$chr,rstandard(mylm))+
   aes(label=chrSizesReads$chr)+
   geom_text(vjust=2,size=3)+
   xlab("Chromosome")+
   ylab("Std Residual from lm (reads)")+
   geom_abline(slope=0,intercept=0)+
   opts(axis.text.x = theme_text(angle=45,hjust=1))+
   opts(title = "Linear Regression Residuals")
> print(p)

Fortunately, there is no clear pattern to these residual values, which would indicate some model problems, but with a Z-score of 3.36, chrX appears to be an outlier. With 46M total alignments this is certainly not due to sampling error, but we can still test our observation with a Lund statistic.

#http://stackoverflow.com/questions/1444306/how-to-use-outlier-tests-in-r-code
>lundcrit<-function(a, n, q) {
 F<-qf(c(1-(a/n)),df1=1,df2=n-q-1,lower.tail=TRUE)
 crit<-((n-q)*F/(n-q-1+F))^0.5
 crit
}
> n<-nrow(chrSizesReads)
> q<-length(mylm$coefficients)
> crit<-lundcrit(0.05,n,q)
> chrSizesReads[which(rstandard(mylm)>crit),"chr"]
[1] chrX

Happy holidays!

Directory-based bash histories

2010-12-09T17:41:00.000-05:00

Using a directory-based bash history allows for a record of shell actions on a directory basis, so a group of developers have some record of what was done while in a directory, when, and by whom. This can be helpful when trying to reconstruct history with limited documentation.

I know this setup will be of some benefit to my successor at my previous job because he has access to everything I ever did in any project directory.

Place this code in your ~/.bash_profile or ~/.bashrc

(type source ~/.bash_profile (or .bashrc) to load this for your current session)

NGS viewers reviewed

2010-08-30T14:12:00.000-04:00

I gathered up some of the recent free next generation sequence viewers that were capabale of viewing BAM files - and put each through the motions with a few BAM files and reference sequences of various sizes. While there are some great ideas and several choices to be found along the feature spectrum, I think we are still in the dark ages with this stuff. No viewer has really been able to entirely combine usability with performance and analysis capabilities, let alone extensibility and web connectivity.

tview
My take: tview is the barebones, text-rendered viewer that is included with Samtools. People who favor this as their BAM viewer probably think Vim is too polished. Even the very limited navigation is remarkably unituitive (goto a coordinate requires chr:position even if you have just one chromosome, no errors are displayed if you forget chr).
Most resembles which video game: Oregon Trail
Good standout feature: command line access
Bad feature: text display

BamView
My take: Bamview is a wicked fast simple BAM file viewer. It doesn't have much in the way of features, but for cursory examination of BAM files it is more palatable than tview.
Most resembles which video game: Burgertime
Good standout feature: strand split screen
Bad feature: drag selecting a region turns it red, and umm... that's all it does

GenoViewer
My take: The only NGS viewer endorsed by Speak the Hungarian rapper, unfortunately this recent entrant leaves a lot to be desired in terms of performance with large files. GenoViewer is very hard on the eyes - the indiscriminate use of primary colors looks like a kid somehow vomited up the ball pit at Chuck E. Cheese.
Most resembles which video game: Centipede
Good standout feature: promises that "You will not get lost in the details, and can easily figure out the true meaning behind the data. Guaranteed."
Bad feature: graphics

MagicViewer
My take: MagicViewer has come a long way since its initial release last December. With its interesting pie-chart icon renderings of SNP purity, decent treatment of annotation tracks, and improved performance, MagicViewer might soon be a contender to Tablet in the midweight category. The navigation is workable but takes some getting used to - it's unclear when to use scroll bars vs. arrows
Most resembles which video game: Moon Patrol
Good standout feature: primer design tool
Bad feature: some regions are simply not visible - no reference, no reads

Tablet
My take: Hands down the most attractive of the viewers, Tablet is aimed at fostering a delicate balance between performance, features, and aesthetics. Tablet comes with a suite of read views - Packed, Stacked, and Classic - to suit both young children and elderly scientists alike. GFF feature files can be loaded but they appear to merely serve as position indices.
Most resembles which video game: SimCity
Good standout feature: interface
Bad feature: read insertions not displayed correctly

IGV
My take: The Integrative Genomics Viewer is a serious tool for exploring and analyzing large datasets. In addition to viewing, IGV is designed to allow users to extract the kind of hard publishable data that has typically been the domain of Bio* scripts. Like the true product of an Ivy League education, IGV can appear aloof and arrogant to newcomers. The viewer will let you load BAM files and other annotation tracks that have nothing to do with the reference without comment or guidance, then require an extra unitutive click to actually generate a view. While not the easiest or even the best in performance, in terms of generating real queries on your data from the viewer there is nothing comparable.
Most resembles which video game: Dig Dug
Good standout feature: analysis tools
Bad feature: throws a hissy fit if it cannot connect to home server

gbrowse2
My take: gbrowse2 is the AJAXified protege to the venerable generic genome browser. Samtools integration is a recent addition to this highly extensible platform, which has been used for years to display everything from large genomes to small sequencing projects. Of all the viewers here, GB2 provides the best set of visualization tools, such that virtually any biological information that can be rendered linearly has been done so as gbrowse tracks. The lone true web application, gbrowse genomic positions can be hyperlinked or even snapshot-embedded on the web. The web provides the best platform to share visual genomic data among several users. However, a gbrowse2 installation with BAM tracks can be a massive pain to install, configure, and debug ("landmark chrI not found" is the most popular google search in all of bioinformatics). Novices can expect a minefield of historical gotchas and arcane conventions ("Name=" not "name=" field in gff3, bp_seqfeature_load instead of bp_load_gff for gff3, no validator for conf files), and even experienced users are often baffled by cryptic errors that pop up in server logs.
Most resembles which video game: Sissyfight 2000
Good standout feature: hyperlinks
Bad feature: setup

Getting the basics from readAligned

2010-03-09T16:57:00.000-05:00

The UCR guide is a little sparse with regard to getting basic information from readAligned.

I'd like to add to the general cookbook. If some bioc people out there can contribute some alignment recipes can fill me in on some more basics please comment:


alignedReads <- readAligned("./", pattern="output.bowtie", type="Bowtie")

#how many reads did I attempt to align
#i don't think you can't get this from alignedReads

#how many reads aligned (one or more times)
length(unique(id(alignedReads)))

#how many hits were there?
length(alignedReads)

#how many reads produced multiple hits
length(unique(id(alignedReads[srduplicated(id(alignedReads))])))

#how many reads produced multiple hits at the best strata?
#please fill me in on this one

#how many reads aligned uniquely (with exactly one hit)
length(unique(id(alignedReads)))-length(unique(id(alignedReads[srduplicated(id(alignedReads))])))

#how many reads aligned uniquely at the best strata (the other hits were not as good)
#please fill me in on this one

#how many unique positions were hit? what if I ignore strand?
#please fill me in on this one

#how many converging hits were there (two query sequences aligned to the same genomic position)
#please fill me in on this one

Quality trimming in R using ShortRead and Biostrings

2010-03-03T16:48:00.000-05:00

I wrote an R function to do soft-trimming, right clipping FastQ reads based on quality.

This function has the option of leaving out sequences trimmed to extinction and will do left-side fixed trimming as well.

#softTrim
#trim first position lower than minQuality and all subsequent positions
#omit sequences that after trimming are shorter than minLength
#left trim to firstBase, (1 implies no left trim)
#input: ShortReadQ reads
#       integer minQuality
#       integer firstBase
#       integer minLength
#output: ShortReadQ trimmed reads
library("ShortRead")
softTrim<-function(reads,minQuality,firstBase=1,minLength=5){
qualMat<-as(FastqQuality(quality(quality(reads))),'matrix')
qualList<-split(qualMat,row(qualMat))
ends<-as.integer(lapply(qualList,
function(x){which(x < minQuality)[1]-1}))
#length=end-start+1, so set start to no more than length+1 to avoid negative-length
starts<-as.integer(lapply(ends,function(x){min(x+1,firstBase)}))
#use whatever QualityScore subclass is sent
newQ<-ShortReadQ(sread=subseq(sread(reads),start=starts,end=ends),
quality=new(Class=class(quality(reads)),
quality=subseq(quality(quality(reads)),
start=starts,end=ends)),
id=id(reads))

#apply minLength using srFilter
lengthCutoff <- srFilter(function(x) {
width(x)>=minLength
},name="length cutoff")
newQ[lengthCutoff(newQ)]
}

To use:

library("ShortRead")
source("softTrimFunction.R") #or whatever you want to name this
reads<-readFastq("myreads.fq") trimmedReads<-softTrim(reads=reads,minQuality=5,firstBase=4,minLength=3) writeFastq(trimmedReads,file="trimmed.fq")

I strongly recommend reading the excellent UC Riverside HT-Sequencing Wiki cookbook and tutorial if you wish to venture into using R for NGS handling. Among other things, it will explain how to perform casting if you have Solexa scaled (base 64) fastq files. The function should respect that. http://manuals.bioinformatics.ucr.edu/home/ht-seq

Using Vmatch to combine assemblies

2009-11-19T13:56:00.000-05:00

In praise of Vmatch

If I could take only one bioinformatics tool with me to a desert island it would be Vmatch.

In addition to being a versatile alignment tool, Vmatch has many lesser-known features which also leverage suffix trees. The dbcluster option allows Vmatch to cluster similar sequences using Vmatch index files. This method is about 1000x faster than attempting to align all sequences against each other, no matter how clever the algorithm.

Originally presented as means to cluster EST sequences, this option is useful for processing the output from de novo short read assemblers like Velvet and ABYSS. Vmatch -dbcluster allows us to easily create a non-redundant set of contig sequences originating from several assemblies.

Why would we want to combine assemblies?

Every kmer is sacred

Why not just take the one with the highest N50? At the recent Genome Informatics meeting at CSHL, Inanc Birol (ABYSS) said "every kmer is sacred". Our experience with the de novo transcriptome assembly of plants has been that the best set of contigs is spread out all over the parameter space. Longer kmer and cvCut settings can produce a longer set of elite contigs at the expense of omitting lowly expressed (or spurious?) contigs that appear at less stringent settings.

Reads held hostage

To the point above, I would add "every cvCut is also sacred". That doesn't exactly roll off the tongue, but we have seen some instances in which an assembly will have a higher read usage at a cvCut of 10 than at 5. This paradox suggests there is a "contig read hostage" situation, in which reads critical to the formation of longer contigs can be held captive in low coverage short contigs. Raising the cvCut threshold frees up these critical reads to extend more plausible contigs and allowing additional reads from the pool to be recruited into the assembly.

What is a non-redundant set?

The following sequences have some redundancies. NODE2 and NODE3 do not add any information.
The non-redundant set will have only NODE1 and NODE4.

>NODE1
AATTCAGTTGAAGTAATAGAGGCAGCTGCTGTTAGAACTTCGCTACGACTAGCAACGCTATGTCGAGTTGTACCTTCCCACCTCGTATACAAGGAGCATGAAGTCATCAGCC
CTTTCTACAGAATCTAGGTCCGAAATGATGAAATTAAGAGAAAGACAACTGAAGGTTTTAGGAGGCAAGGTTTACAGGGTAATGAACACGGAAAAACCCACAAGCTAGGAAC
AGTGTGTCTTGGAGTTTAAACTGATTTGGTAGTAGTTCGAAAACAAATTGGAAGGGACATTTAAAGTCCGAGTTGACGTTATCTGAGACAACTTTGTCTTTAACCGACAGGG
AGTTGAGGTAGGAGAGAGTGTCCACATATTTAACGTTGTTCAGATATGGGATGTAGCAGTTGTAACCGAAGCATGTAGGAAGGTTAAAGGGGTCCATACCTCTATTCTAGTC
CCGAAGGTTGGTAGCTAGACTCGGTGACCTAAAATGAGAACGGAAGAAACGGAGGTGACATCCATGGGGCTTGTCGTATATCCAATCATACCTTTGGAGAAGGAAGTTAAAG
GTCAAAACTTTAAAAACCATGAGGACCATTTTATCCTCGTCACTGTCGACTATGGTGAAGTGACCTAGCAGGTTTGTGTAGTTACTGTTTTGTAAGTGTAAAGTGCGTTGCT
GGCTATAGGAGTTTCGCATGAAACATGCTCGCTCTTGTGACCCATCGGTTACCAAGTTTACAGACTAGGTGAGGACTAGGTCACCTCTGTTTGGTAACCAAAAAGTAGAGAA
AGAATATCAACAAGGTATATACAACAACTTAGAGTAGCAAACCTTAAAAGAGTTGTCGTCAGTCACAAAGGGAGAGTTGTACTAAAGGGAACTTTTGTTCACAGAGCTAAAA
CTCATTAAGACCATTTTAATGTGGTCTACCAAGTCTGTCGAAAAGAAGTGGATGCTACGGCAGA
>NODE2 a substring of NODE1
AATTCAGTTGAAGTAATAGAGGCAGCTGCTGTTAGAACTTCGCTACGACTAGCAACGCTATGTCGAGTTGTACCTTCCCACCTCGTATACAAGGAGCATGAAGTCATCAGCC
CTTTCTACAGAATCTAGGTCCGAAATGATGAAATTAAGAGAAAGACAACTGAAGGTTTTAGGAGGCAAGGTTTACAGGGTAATGAACACGGAAAAACCCACAAGCTAGGAAC
AGTGTGTCTTGGAGTTTAAACTGATTTGGTAGTAGTTCGAAAACAAATTGGAAGGGACATTTAAAGTCCGAGTTGACGTTATCTGAGACAACTTTGTCTTTAACCGACAGGG
AGTTGAGGTAGGAGAGAGTGTCCACATATTTAACGTTGTTCAGATATGGGATGTAGCAGTTGTAACCGAAGCATGTAGGAAGGT
>NODE3 a reverse complement substring of NODE1
CGACAGACTTGGTAGACCACATTAAAATGGTCTTAATGAGTTTTAGCTCTGTGAACAAAAGTTCCCTTTAGTACAACTCTCCCTTTGTGACTGACGACAACTCTTTTAAGGT
TTGCTACTCTAAGTTGTTGTATATACCTTGTTGATATTCTTTCTCTACTTTTTGGTTACCAAACAGAGGTGACCTAGTCCTCACCTAGTCTGTAAACTTGGTAACCGATGGG
TCACAAGAGCGAGCATGTTTCATGCGAAACTCCTATAGCCAGCAACGCACTT
>NODE4 a new sequence
TTACGAACGATAGCATCGATCGAAAACGCTACGCGCATCCGCTAAGCACTAGCATAATGCATCGATCGATCGACTACGCCTACGATCGACTAGCTAGCATCGAGCATCGATC
AGCATGCATCGATCGATCGAT

How do we create a non-redundant set?

Do not use -nonredundant

No, really. There is an option called -nonredundant which should presumably do what we want, but unfortunately that writes a "representative" member of each cluster to a file, which may or may not be the longest contig. I'm not sure what makes a sequence representative of a cluster, but for this application we want the longest member of each cluster.

On April 29th 2010, Vmatch 2.1.3 was released. The most important change is that the option -nonredundant now delivers the longest sequence from the corresponding cluster (instead
of an unspecified representative). This should make the longestSeq.pl approach unnecessary.

To create a non-redundant set we will produce cluster files and then extract the longest sequence from each file. Use the following commands to produce your cluster files from an index:

mkvtree -allout -pl -db contigs1.fa contigs2.fa -dna -indexname myIndex
#mkvtree will accept multiple fasta or gzipped fasta files

#if using Vmatch 2.1.3 or later:
vmatch -d -p -l 25 -dbcluster 100 0 -v -nonredundant nonredundantset.fa myIndex > mySeqs.rpt
# thx to Hamid Ashrafi for debugging this syntax


#if using older Vmatch
mkdir sequenceMatches 
#this is where vmatch will put each cluster

vmatch -d -p -l 25 -dbcluster 100 0 mySeqs "(1,0)" -v -s myIndex > mySeqs.rpt
mv mySeqs*.match mySeqs*.fna sequenceMatches

What do these options do?

-d direct matches (forward strand)
-p palindromic matches (reverse strand)
-l search length (set this below your shortest sequence)
-dbcluster queryPerc targetPerc - runs an internal alignment of the index created by mkvtree.
- The two numeric arguments specify what percentage of your query sequence (the smaller) is involved in an alignment to the cluster sentinel/target superstring. For our purposes we require 100% of our query sequence substring to match the target. We don't care what percentage of the target is aligned, so set the second parameter to 0.
- The third argument to dbcluster is the index prefix name that vmatch will give to the THOUSANDS of cluster .fna and .match files it will create.
- The fourth argument "(1,0)" specifies that we want to keep singletons (in a file called mySeqs.single.fna) and that there is no limit to the number of sequences in an acceptable cluster.
-v verbose report (redirected to a the .rpt)
-s create the individual cluster fasta files and match reports

Consult the vmatch manual for fuzzy matches and more examples:
http://www.zbh.uni-hamburg.de/vmatch/virtman.pdf

The standard output details the clusters and singlets.

0:761437: NODE_83_length_31_cov_22.451612857642: 
NODE_106_length_31_cov_22.451612621409: NODE_152_length_29_cov_27.758621749981: 
NODE_185_length_29_cov_27.7586211:761861: NODE_531_length_424_cov_26.851416805590: 
NODE_1413_length_320_cov_28.187500837480: NODE_1236_length_320_cov_28.187500858407: 
NODE_937_length_320_cov_28.187500765510: NODE_1542_length_108_cov_34.870369621979: 
NODE_786_length_425_cov_32.602352750915: NODE_1290_length_321_cov_34.392525

Extract the longest sequence from your cluster files

If using Vmatch 2.1.3 or later this is unnecessary
Here is a perl script, which we will call longestSeq.pl, to do that

#!/usr/bin/perl
#print the longest sequence in a fasta file
use strict;my %seqs;my $defline;
while (<>) {
  chomp;
  if ( /^#/ || /^\n/ ) {
    #comment line or empty line do nothing
  }elsif (/>/) {
    s/>//g;$defline = $_;
    $seqs{$defline} = "";
  }elsif ($defline) {
    $seqs{$defline} .= $_;
  }else{
    die('invalid FASTA file');
  }
}
my $max = $defline;
foreach my $def ( keys %seqs ) {
  $max = ( length( $seqs{$def} ) > length( $seqs{$max} ) ) ? $def : $max;
}
print ">" . $max . "\n" . $seqs{$max} . "\n";

We want the longest member of each cluster and all sequences in the singletons file:

for f in sequenceMatches/*fna;
do
if [ "$f" = "sequenceMatches/mySeqs.single.fna" ];
then 
cat $f >> mySeqs_longest_seqs.fa;
else perl longestSeq.pl $f >> mySeqs_longest_seqs.fa;
fi;
done

That's it - now you have a comprehensive and non-redundant set of the longest contigs from a number of assemblies.

R's xtabs for total weighted read coverage

2009-11-04T12:54:00.000-05:00

Samtools and its BioPerl wrapper Bio::DB:Sam prefer to give read coverage on a depth per base pair basis. This is typically an array of depths, one for every position that has at least one read aligned. OK, works for me. But how can we quickly see which targets (in my case transcripts) have the greatest total weighted read coverage (i.e. sum every base pair of every read that aligned)?

My solution is to take that target-pos-depth information and import a table into R with at least the following columns:
targetName
depth

I added the pos column here to emphasize the base-pair granularity

         tx pos depth
1   tx500090 227     1
2   tx500090 228     1
3   tx500090 229     1
4   tx500090 230     1
5   tx500090 231     1
...
66  tx500123 184     1
67  tx500123 185     1
68  tx500123 186     1
69  tx500123 187     2
70  tx500123 188     2
71  tx500123 189     2

In R

myCoverage<-read.table("myCoverage.txt",header=TRUE)
myxTab<-xtabs(depth ~ tx,data=myCoverage)

xtabs will sum up depth-weighted positions by default (i suppose this is what tabulated contigency really means) and return an unsorted list of transcripts and their weighted coverage (total base pair read coverage)

> myxTab[1:100]
tx
tx500090 tx500123 tx500134 tx500155 tx500170 tx500178 tx500203 tx500207
      38       92      610       46      176       46       92      130
tx500273 tx500441 tx500481 tx500482 tx500501 tx500507 tx500667 tx500684
      76     2390      114    71228      762      222      542      442
tx500945 tx500955 tx501016 tx501120 tx501127 tx501169 tx501190 tx501192
    1378     3604       46       46      420      854      130      352
tx501206 tx501226 tx501229 tx501245 tx501270 tx501297 tx501390 tx501405
     244     1204      206    15926      214       46      168       46
tx501406 tx501438 tx501504 tx501694 tx501702 tx501877 tx501902 tx502238
      38     2572     7768     3274      314      298       84      198
tx502320 tx502364 tx502403 tx502414 tx502462 tx502515 tx502517 tx502519
     122       38      588       46       46       38       38      466
tx502610 tx502624 tx502680 tx502841 tx502882 tx503090 tx503192 tx503204
     206       38      168     3750       38      122       76       92
tx503416 tx503468 tx503523 tx503536 tx503571 tx503578 tx503623 tx503700
     260       38      168       38       46       46       84       38
tx503720 tx503721 tx503722 tx503788 tx503872 tx503892 tx503930 tx503970
   97112       38       38     4708       38       38     1290       84
tx503995 tx504107 tx504115 tx504346 tx504353 tx504355 tx504357 tx504398
      46      152      206       46     3416     1402      122      290
tx504434 tx504483 tx504523 tx504589 tx504612 tx504711 tx504751 tx504827
     290     8728      176       46       46       76     5644     1308
tx504828 tx504834 tx504882 tx504931 tx504952 tx505017 tx505029 tx505078
    2336      328       46    34138     1000     1838       46      474
tx505123 tx505146 tx505159 tx505184
      38   123344      160      588

this is approximately 10000x faster than using a formula like:

by(myCoverage,myCoverage$tx,function(x){sum(x$depth)}

Installing Bio::DB::Sam from CPAN

2009-10-23T15:22:00.000-04:00

Bio::DB::Sam is Lincoln Stein's BioPerl API to the SamTools package.

Installing via CPAN might skip a necessary question that will cause it to fail.


cpan> install Bio::DB::Sam
...
...
DIED. FAILED tests 1-93
  Failed 93/93 tests, 0.00% okay
Failed Test Stat Wstat Total Fail  Failed  List of Failed
-------------------------------------------------------------------------------
t/01sam.t      2   512    93  186 200.00%  1-93
Failed 1/1 test scripts, 0.00% okay. 93/93 subtests failed, 0.00% okay.
make: *** [test_dynamic] Error 2
/usr/bin/make test -- NOT OK
Running make install
make test had returned bad status, won't install without force

Navigate to where CPAN has downloaded the gz ffile
A closer examination reveals that Build.PL file wants to know where the SamTools header files are located.


Please enter the location of the bam.h and compiled libbam.a files:

I have no idea how to pass these arguments using CPAN. I would just avoid this method of installation. Do the local build instead.

Disable file locking in Eclipse for OS X

2009-10-12T16:46:00.000-04:00

Eclipse will refuse to use a workspace on an automounted OS X Server home directory.

Workspace in use or cannot be created

To remedy this problem do the following:

Right click the Eclipse application and select "Show Package Contents"
Contents->MacOS
Edit the eclipse.ini file in a text editor
Add -Dosgi.locking=none to the line below -vmargs

My 2009 Bridge-to-Bridge Experience

2009-09-23T20:18:00.000-04:00

Bridge-to-Bridge is a 105-mile ride up to the top of Grandfather Mountain, one of the highest peaks in the Blue Ridge mountains.

Although B2B is not an officially sanctioned race, the organizers conduct it just as professionally (with the exception of neutral support). Cops manage the major crossings, volunteers provide hand-offs at the dozen feed stations, and the event is officially timed with the aid of some magnetic shoe things.

I last did this ride in 1999. Now 10 years older but about 10 pounds lighter I had somehow forgotten how much suffering was involved and figured this was a good time to tackle the challenge, despite falling ill a couple of weeks beforehand.

Due to constant rain and very heavy fog, this year was utter torture for the 299 finishers and 371 non-finishers who braved the elements. I believe there may have been another 130 non-starters who stayed in their hotel rooms enjoying the Golden Girls marathon on tv. While I'm sure I would have felt some sense of accomplishment doing that, I was obligated to finish this ride as we had already driven down from Philadelphia (en-route to a wedding in Nashville the following weekend).

The Grandfather Mountain staff said riders could not enter the park before 3pm. To accommodate both the riders and this odd rule two start times were offered - 10 and 11 am, with slower riders encouraged to start first. To me this was a welcome change from the ungodly pre-dawn start times of most big rides. Riders were advised to start at the later time if they estimated they would be pushing the 3pm threshold.

I was on the fence about which group to join. I saw several very fit looking riders and expensive bikes in the 10am pack. I was still riding the same Litespeed I used in 1999. The word going around was that more rain was on the way (this turned out to be true for everyone). In the end I felt the risk of having an inexperienced rider fall in front of me to be the deciding factor to go with the second group. I knew I would have to pass several of that first group anyway but it would be on the later climbs instead of the early rollers. Ironically I almost got clipped by some idiot drifting carelessly in our pack. At the end there was considerable overlap in finishing times between the two groups.

The two times I did this ride ('98 and '99) I got dropped by the leaders on the 13 mile climb up NC181, then spent the rest of the ride either riding alone or with a couple other guys. Despite my best efforts this year turned out roughly the same except I stuck with that front group for about half the climb instead of just the first couple miles. Remind me to buy a compact crank.

This climb is very difficult psychologically - a relentless slog up a roughly-paved 4 lane highway. I was not very familiar with the profile and prematurely thought I had crested three times - each time putting in a kick over the "top". The fog would clear and I would see yet another rise.

Feeling dispirited and exhausted, I nearly froze to death on the descents of 181 and the Blue Ridge Parkway and was eventually caught by a small group that stuck together until the ascent of Grandfather. I was amazed by how few words were exchanged in that group during the hour or so we traded pulls through the fog and rain, which only got worse as we neared the finish. One guy did say something that stuck with me, "It's like we're riding through a horror movie."

After crawling to the finish in a 39x27, I was very fortunate that Mary Ellen had the foresight to drive up to the summit well in advance to meet me with a warm car. I thanked her by singing the Golden Girls theme song all the way to the hotel.

B2B '09 Results:
http://www.caldwellcochamber.org/support/pagepics/09Bridge.txt
An account by Bruce Humphries (1st place)
http://dieseldiaries.com/hom/?p=232
Some more blog posts on the '09 ride:

Charles Dickens de Bruijn Graph

2009-08-26T23:27:00.000-04:00

Standardized Velvet Assembly Report

2009-08-25T10:17:00.001-04:00

http://code.google.com/p/standardized-velvet-assembly-report/

I finally got my Velvet Assembler report script up on google code. This "program" consists of some short scripts and a Sweave report designed to help Velvet users identify the optimal kmer and cvCut parameters.

I pick on the WSJ again: A true cost analysis of home prices

2009-04-01T14:22:00.000-04:00

Today's Wall Street Journal article Home Prices: Low, But Still No Bargain by Brett Arends featured a chart comparing the Case-Shiller home price index to average earnings - a metric used to measure true housing cost from Jan 1987 to the present.

Compared to the Case-Shiller index alone this is a useful analysis, but it doesn't take into account that most homes are bought using mortgages, not cash, so the cost of money must be factored in.

It is easy enough to access historical mortgage rates to see the rates people paid to buy homes over this period:
http://research.stlouisfed.org/fred2/series/MORTG/downloaddata?cid=114

The raw WSJ index data behind their snazzy chart can be gleaned from
http://online.wsj.com/public/resources/documents/info-ROI_0904_data.xml

Using a Groovy script I parsed their index data and used a multiplier based on the 30-year mortgage rate. As was done in the article, the resulting data was normalized to the first point in Jan '87 so the graphs would be comparable.
http://en.wikipedia.org/wiki/Fixed_rate_mortgage#Monthly_payment_formula

WSJ index ~prices/earnings
My index ~prices*rates/earnings

Using my "True Cost" index, the summer of 2006 was still very elevated (my index of 150 compared to 100 in Jan '87) but not quite as crazy as it would appear by not taking into account the prevailing mortgage rates (WSJ index 200 compared to 100 in Jan '87).

March of 1989, with its 30-year rate at 11.03%, was also a crappy time to buy a house. The true cost of housing actually did not return to the '89 level until the height of the boom in '06. Refinancing would only offer some comfort to the '89ers- you would have had to wait until August of 1992 to refinance below 8% and until Jan of 2003 to get below 6%.

With today's insanely low rates we see my index is at a manageable 84, similar to May of 1999, instead of the alarming 128 Arends has used as a metric for his article. A dip much below 5% would mean the cheapest mortgaged houses relative to earnings of in over 20 years.

The WSJ made an accounting error

2009-02-26T16:02:00.000-05:00

Wall Street Journal 2/26:
The 2% Illusion
http://online.wsj.com/article/SB123561551065378405.html

I was curious to see how they arrived at the numbers for this article, given there are no citations. I found the table they were using:
http://www.irs.gov/pub/irs-soi/06in21id.xls

Unfortunately, this table is only for taxpayers who filed itemized returns. From the opinion piece:

Consider the IRS data for 2006, the most recent year that such tax data are available and a good year for the economy and "the wealthiest 2%." Roughly 3.8 million filers had adjusted gross incomes above $200,000 in 2006. (That's about 7% of all returns; the data aren't broken down at the $250,000 point.) These people paid about $522 billion in income taxes, or roughly 62% of all federal individual income receipts.

To arrive at $522B they took rows 26:32

To arrive at 62% they divided this number by $837B, once again this is only for itemized returns. One clue that might have tipped them off (they were looking at a subset) would be that there are only 42 million returns on this table for a country of over 300 million people.

Perhaps most bizarre is their claim that 7% of returns report more than $200k/yr. I don't understand how this number got past them.

The total for all returns (standard and itemized) can be found on
http://www.irs.gov/pub/irs-soi/06in12ms.xls

Using the correct table:

The total income tax collected is $1.023T, not $837B. ($1,023,920,139)
The total income tax collected by households making more than $200k is $544B, not $522B ($544,318,726)
The percentage of receipts collected by rich households is roughly 53%, not 62%.
Finally, the percentage of returns with income >$200k is 2.9%, not 7% (wow)

This does not fundamentally change the WSJ's arguments, which I feel are simply irrelevant. Rich people know to hide their wealth in corporations. Their assets are all comingled with corporate assets and their expenses become business costs. In 2002, US corporations report revenues of almost $50T, income of $563B, and paid just $154B in corporate income tax. I find that very strange.

Even stranger, I've learned that some rich people actually take the standard deduction! In fact, 244 of the 15956 households making more than $10M in income decided it was not worth itemizing in TY2006.

Automatic litterbox reviews: It makes YOU the biggest LitterMaid of all!

2008-11-26T13:37:00.000-05:00

Americans have a strong faith that technology will improve their lives, so it's always fun to see our patience get tapped by a poorly designed product. My favorite example of this is the line of LitterMaid automatic or "self-cleaning" cat litter boxes. When I need a good laugh I just sort the latest LitterMaid reviews on Amazon by "Rating: Low to High". Here are some choice quotes:

LitterMaid LM500 Automated Litter Box

I read all the reviews saying this thing sucked, and I thought, Well I'm an engineer, I can handle it but I was Wrong. I even got the older model LM500 hoping that it would work, but it didn't.
In the automatic rake thing, there are several plastic parts put, not glued or secured together. The first couple of times the rake went offline (because of excessively large dump - this happened at least 4-5 times per day despite filling litter to lower than the mark and using premium litter etc), the little plastic parts would pop out and the rake would go offtrack. Once this started happening, the plastic pieces would pop off every time. I tried glueing it, rubberbands, etc. Nothing worked. I feel really stupid for having read all the reviews and still buying it. Look at the content of the reviews. People describe intimately what is wrong with the product (rake broken, off track, etc) and how they tried to fix it. This should tell you how involved the owner got to try and get it to work. For me this meant getting cat piss all over myself, litter in my eyes, and messing with it everyday until the day, only three weeks after I got it, when it refused to work whatsoever.
Now I have this gigantic plastic piece of cat piss smelling garbage sitting around that I have to repackage and mail back. This is just not worth it. My brother has one, and it works, but it's just luck of the draw whether you get it to work or not. But judging from all the negative reviews, odds are against you.
http://www.amazon.com/review/R3UV59VLA3XBVK

I got more [poop] flicked in my face in 1 month then in 15 years in the veterinary business.
Emptying the poop containers flicks sand and poop in you face and mouth. Also, litter rake would clump and you had to clean it by hand. The plastic must have been poop bonding material because you could never get it clean. After 2 weeks , it started throwing poop clumps acrosss the floor. Somehow the rake got stuck and preassure woud build and thwack! Poop and urine would fly. don't waste your money
http://www.amazon.com/review/R1ULXRUI9MSTLI

In particular, whatever plastic is used seems to have an affinity for soiled cat litter, as it sticks to everything, especially the rake, making a mess. Remember the idea is that you don't have to deal with handling cat litter directly anymore, right?

I am now on the 2nd Cat Tent as the cat jumped on it one time and the first one broke and I had to use duck tape to fix it. THis new cat tent is better but is very weak for the price u pay for the plastic and plastic materal that is the tent.
http://www.amazon.com/review/R273DEU4GH0VUT

Now we're cleaning the litter box more than ever! We groan with despair when we know we have to go up to see the condition that it's in. To look and see poo stuck to the rake along with clumps of other stuff. Not to mention the loose litter all over the edges of the box and on the floor outside the box. We bought the crappy tent that goes with it but that tore during assembly and now there's loose litter on the bottom of the tent. The entire area must be vacuumed every other day. If you don't mind your cats stepping in clumps of broken up pee, then this product is for you!
http://www.amazon.com/review/RG26YYDGW49XF

The motor was SO noisy that we could hear it go on when we were all the way upstairs, the noise scared our new kitten so much, he refused to use it. Our most mischievious little feline devil discovered what he needed to do to turn the motor on and that became his favorite sport, he'd activate it and watch the rake go back and forth, so the lifespan of the batteries was about 3 days in our house. Our older cat, who's very fastidious, even for a cat, steadfastly refused to use it at all because it smelled. He'd simply go to the spot where the old box used to be and wail. I should point out that the smell was distributed throughout our home when the box was raking. We needed to replace the motor after only 3 weeks because the casing broke, its a complicated process that takes more time than it should since you're presummed to be without a litterbox when it occurs. Cleaning it was a bigger chore than a regular litterbox and took over half hour because you had to dismantle parts of it to clean it completely. The motor couldn't be submerged in water, so taking it to the backyard and hosing it down like I did my regular box wasn't an option, it was scrape, scrub, scrap, scrub, scrape, scrub, etc every 2 or 3 days. It was a disgusting and smelly job, and you'd end up with flecks of litter all over your shirt and FACE. My husband and I would actually argue about who got to clean the mess. Many times the rake would just scoop whatever litter wasn't stuck to the sides and head onto the floor, but since we had the tent, it was scooped inside the tent, so on a daily basis, that had to be dismantled and swept out, because if we didn't, one of the cats would squeeze between the box and tent to use the litter that was deposited there.
http://www.amazon.com/review/R17DPF4GT7G9IB

Everything about cleaning this unit causes you to have to get cat waste all over yourself.
http://www.amazon.com/review/RR9GJ4SFBD7LD

First off it scared my cats to death. One of them refused to use the litter box ever again, even after switching back to a conventional scoop one. Next, the litterbox would literally fling poo. I would give it 5 stars if we were rating it on range!
http://www.amazon.com/review/RXMX22WIAQ0OS

The machine only lets you put in a shallow amount of litter or it will clog as the tines rake through it. This means you have to really watch the litter level. If it is too high the machine jams and the motor runs back and forth over and over again trying to push against a pile of litter. The trouble is, with such a shallow amount of litter [1] if your cat is in the habit of using the same place in the box everytime the litter gets REALLY shallow, and [2] even with a full load of litter, the litter does not clump when your cat has a large pee. The pee soaks to the bottom and sticks, and the litter stays slightly gummy, and sticks to the tines of the rake, and the machine jams. Yes I used premium litter. To make matters worse, my cat liked to pee right on the edge of the litter by the tines where the floor of the box ramps up and there is nearly no litter. instant mess.

If your cat's stool is soft that day, and/or your cat does not properly cover it, the poo smears and sticks all over the rake and it jams. And it is horrible to clean. I have has cat feces jammed way up under the tines on the side where the motor is.
http://www.amazon.com/review/R1U8PN3YPPD1I7

It is as if Littermaid tried to design a product to expose you to as much used cat litter as possible.
http://www.amazon.com/review/R2XEVH732T9MBI

the rake has just recently adopted a new trick - CATAPAULTING large clumps of poop across the kitchen floor! It's amazing how far those clumps can fly. When I was making a sandwich just now I found some small litter clumps that had landed on the kitchen counter.
http://www.amazon.com/review/R38J95MCLK7MRX

Really look at the picture and think how you would like to clean poop from between all those tines on the rake.
http://www.amazon.com/review/R2G6E7Y32MMI09

I hate this product.
http://www.amazon.com/review/R1XYLXKZMYG0EG

My wife and I received one of these as a gift and were initially very excited. The excitement, however, wore off when the cat stopped using the box and instead used the basement floor directly in front of the box.
http://www.amazon.com/review/R2LADMJG3SPAB3

LitterMaid LM900 Mega Self-Cleaning Litter Box

Mine was quite entertaining when I first purchased it because it came with a unique, added bonus feature. Every time the rake would get up to the disposal container, it would hang up and then forcefully catapult it's precious cargo across the room. I would be sitting there watching turd showers as they fell like meteors after being flung from inside my closet.
http://www.amazon.com/review/R1VQSBUHBG90R1

This thing is, no pun intended, a total piece of crap.
http://www.amazon.com/review/R1YCHB6QPNQTM4

My cats don't like the noise it makes, and I think they've gotten scared of it, since one of them has started to poop just outside of the box, and the other one started to pee on the bathroom rug (he's 5 years old, and has never had a problem before)
http://www.amazon.com/review/R1WMGVRKP9606C

I have admit this is a great idea, again underline/capitalize the word IDEA.
http://www.amazon.com/review/R5QXWGPTT6T9C

Well... these are made for cats to use, are they not?
http://www.amazon.com/review/R2MZ8JGXT24V3S

My cat, who also enjoys just setting off the sensor so that the rake will pass through the litter, has learned that a great way to get me out of the bed in the middle of the night is to build up a little mountain of litter so that when the rake passes through it and attempts to return to its start position it cannot and the machine will continue to run back and forth, making a sound very similar to nails on a chalk board, until someone (me) evens out the litter.
http://www.amazon.com/review/R1Q2VVAGT9RXFL

I read several reviews before buying this product. They seemes very mixed, but I hate changing the litter box, and I thought, "Hey, that person who talks about the cat poop flying 10 feet in the air has GOT to be lying!" Well, sir, that person was not. I, too, have seen the flying cat poop. In fact, the day some hit me in the face was the day I decided to return this!
http://www.amazon.com/review/R1DCOIY61ZEUYJ

Countless times I have been awaken at midnight by the loud whine of this contraption struggling to empty the contents until it finally stops after four or five rounds of noise unable to achieve its goal
http://www.amazon.com/review/RHLJPF2X8D8L2

I don't know what kind of toxoplasmosis you can get when it flings a chunk of cat litter in your eye, but I'll know soon.
http://www.amazon.com/review/R2V1504T3TL2AW/

I have had for about 5 months now, and this morning at 4am, it broke. The whole arm with the rake and the motor broke clean off, and I can see no way to repair it. However, after it broke, the motor continued to run and run until I got up and turned it off...I am surprised after reading all of the bad reviews that LitterMaid has not improved their product. I am also surprised they are even still in business.
http://www.amazon.com/review/R3D4ERTMRQGMJ3

What you have to understand is that the rake mechanism that sweeps up the clumps of litter runs along a line of sprocket-like plastic ridges...ridges that get jammed with litter every time it operates.
http://www.amazon.com/review/R35VS2YQBIDVL4

is lovely if the waste isn't in any way still damp but if it is watch out! With the rake-like feature it dug through the waste and stuck in the scooper all the time. Then it would reverse and push the unscooped waste into the mechanisim and it was just gross!
http://www.amazon.com/review/R3CQ9CIJMKQ8FM

The Incubus, or, How a Litterbox Ruined My Life
This is possibly the worst product ever fashioned by human hands. When it does work, which is rarer than the appearance of most astronomical phenomenon, all the tines of the rake serve to do is spear chunks of cat feces and carry them back to the return position, leaving nothing in the waste receptacle. Spending just five minutes scraping it off is enough to make me want to stab myself with one of said tines, that I might become infected with a deadly strain of ecoli, and be permanently freed from this abortion of a product.
http://www.amazon.com/review/R1RH4CFWDOEEJJ

I have asked the Litter Box God to please tell me the "Littermaid secret". I give up. The secret is apparently I AM THE LITTERMAID!...I really regret not asking for a new poop shovel and taken the $100 and my cats could be pooping in gold dust. Wow. I did not know I was that mad. Sorry.
http://www.amazon.com/review/R2EG7PNR8NRO1L

I have spent more money on litter boxes than clothes.
http://www.amazon.com/review/R3JA3CUQIM0BEV

Louder. Louder Than Hell. My wife and I bought one of these infernal machines and next to nails on a chalkboard and Ashley Simpson, the shrieking that comes from the slow-dying motor is absolutely one of the worst noises I can think of. I have been woken from a dead sleep many nights to the tune of "clogged feces." If you haven't heard this tune, let me describe it for you. Combine a five year old playing the electric violin for the first time with Fran Drescher and add in a few seconds of dead silence in two or three 15 second intervals.
http://www.amazon.com/review/R2EZQII0XE14CZ

Calculating an N50 from Velvet output

2008-11-25T14:02:00.000-05:00

In sequencing circles the N50 length is a useful heuristic for judging the quality of an assembly. Here is my definition of N50 length, which you may or may not find intuitive:

N50 length is the length of the shortest contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all contigs

For example's sake imagine an assembler has created contigs of the following length (in descending order):

91 77 70 69 62 56 45 29 16 4

The sum of these is 519bp, so the sum of all contigs equal to or greater than N50 must be equal to or greater than 519/2 or 259.5
We can see by brute force that
91+77=168
91+77+70=238
91+77+70+69=307 (that'll do)
so the N50 for this assembly is 69bp

Another way to look at this:

at least half the nucleotides in this assembly belong to contigs of size 69bp or longer.

N50 vs N50 length

Technically N50, as opposed to N50 length, refers to the ordinal of that last contig that pushes it over the brink - in this example 4 (since 69bp is the 4th largest contig). Unfortunately, a higher N50 implies the opposite of a longer N50 length. Some papers refer to N50 length as L50, while most have simply followed the lazy convention of dropping "length" off of "N50 length". I think it is important to include units with your N50 to minimize confusion.

Contig N50 vs Scaffold N50

Another distinction is often made between contig N50 and scaffold N50. Contigs are "contiguous segments", while scaffolds (aka supercontigs) consist of contigs separated by gaps. Scaffolds are constructed using paired-end information at the read level and, in major sequencing projects, paired BAC ends. Because the scaffolds sequences are filled with varying quantities of empty N's, the scaffold N50 should not solely be used as a comparative score of assembly quality.

Velvet, when used with sane expCov settings, is very conservative with regard to scaffolding - so much that the contigs.fa N50 can be virtually considered a contig N50, as opposed to a scaffold N50. More aggressive programs, such as SOAPdenovo, produce separate contig and scaffold files.

Velvet N50 from stats.txt

Velvet is a popular assembler for short sequences that uses DeBruijn graphs and Eulerian graph theory instead of a repetitive align-consensus-align approach. Although it returns an N50 in the course of assembling, I wanted to derive it from the contigs themselves. These contigs are summarized in a table called stats.txt

Using R and its cumulative summation function we can easily compute N50.

Strategy: Order the contigs by decreasing size and find the first value for which the cumulative summation is at least half the total sum.

Thanks to Barry Rowlingson for providing this solution


myStatsTable<-read.table("stats.txt",header=TRUE)
contigs<-rev(sort(myStatsTable$lgth+myKmerValue-1))
n50<-contigs[cumsum(contigs) >= sum(contigs)/2][1]

Beware the kmer

Note: The length in the stats.txt file is given as length=lgth+kmer-1, where kmer is the kmer value chosen for that assembly. The N50 length given in the Log file also appears to be in kmers. You cannot convert an N50 in kmers to bp by adding kmer-1. The math doesn't work like that - you need to convert each contig to bp before recalculating N50.

Finally, you can calculate N50 from sequences in the contigs.fa, but this file only contains contigs longer than 2-kmers by default. The contigs.fa bp-N50 will sometimes approximate the kmer-N50 in the Log file, but that is not a rule to depend on.

Writing a decent cover letter

2008-11-24T11:07:00.000-05:00

Times are tough, and many of us will soon find ourselves back on the job market. Many people ask me how I've gotten so many jobs for which I am clearly not qualified. A lot of the credit goes to the excellent cover letters I have written over the years. A decent cover letter sets the tone for a successful interview. I'll try to boil down the key practices that I try to emphasize throughout the entire interview process:

Be yourself - be honest about your strengths and weaknesses
Don't let a position's requirements stand in your way
Make your expectations clear

Here is a cover letter I helped my friend Erskine write for a lab tech position that recently opened up:

To whom it may concern,
I am very interested in your Research Specialist position. Although I have none of the skills you listed, I am confident my overwhelming sense of entitlement will win you over.

A series of bizarre and unfortunate lab accidents has left me without the use of the left side of my body. I also have syphillus, but that is a long story. Anyway, when the insurance money ran out I decided it was time to get back in the game. Word to your mother.

The only thing I request, other than the $85k/yr, is that the lights in the lab be dimmed when I am working - too much stimulation gives me seizures and/or provokes violence in me. I am also profoundly racist, so please do what you need to do in that area. I will reveal my gender to you, in private, when I feel the time is right.

I have tried to enlist references but their phones appear to be blocking my number so I doubt you would have better luck. The gang at Top2Bottom knows me pretty well so just ask around.

Regards,
EBL

To make a long story short, Erskine got the job!