Thursday, December 23, 2010

Chromosome bias in R, my notebook

My goal is to develop a means of detecting chromosome bias from a human BAM file.

Because I've been working with proprietary and novel plant genomes for the last three years, I haven't had the chance to use any of the awesome UCSC-based annotational features that have been introduced and refined in Bioconductor until now. I've returned to biomedical research and I have some catching up to do.

BSgenome might sound like horsecrap, but each Biostrings-based genome data package is actually a huge digested version of a UCSC/NCBI genome freeze and basic sequence annotation compiled into R objects.

BSgenome at Bioconductor
Be careful with googling bioconductor help - often the results point to older versions. Make sure your link has "release" in the url.

Here are the BSgenomes available today:
> available.genomes(type=getOption("pkgType"))
BioC_mirror = http://www.bioconductor.org
Change using chooseBioCmirror().
 [1] "BSgenome.Amellifera.BeeBase.assembly4" "BSgenome.Amellifera.UCSC.apiMel2"     
 [3] "BSgenome.Athaliana.TAIR.01222004"      "BSgenome.Athaliana.TAIR.04232008"     
 [5] "BSgenome.Btaurus.UCSC.bosTau3"         "BSgenome.Btaurus.UCSC.bosTau4"        
 [7] "BSgenome.Celegans.UCSC.ce2"            "BSgenome.Celegans.UCSC.ce6"           
 [9] "BSgenome.Cfamiliaris.UCSC.canFam2"     "BSgenome.Dmelanogaster.UCSC.dm2"      
[11] "BSgenome.Dmelanogaster.UCSC.dm3"       "BSgenome.Drerio.UCSC.danRer5"         
[13] "BSgenome.Drerio.UCSC.danRer6"          "BSgenome.Ecoli.NCBI.20080805"         
[15] "BSgenome.Ggallus.UCSC.galGal3"         "BSgenome.Hsapiens.UCSC.hg17"          
[17] "BSgenome.Hsapiens.UCSC.hg18"           "BSgenome.Hsapiens.UCSC.hg19"          
[19] "BSgenome.Mmusculus.UCSC.mm8"           "BSgenome.Mmusculus.UCSC.mm9"          
[21] "BSgenome.Ptroglodytes.UCSC.panTro2"    "BSgenome.Rnorvegicus.UCSC.rn4"        
[23] "BSgenome.Scerevisiae.UCSC.sacCer1"     "BSgenome.Scerevisiae.UCSC.sacCer2"    
Select and load hg19
biocLite("BSgenome.Hsapiens.UCSC.hg19")
library("BSgenome.Hsapiens.UCSC.hg19")

When we get an alignment file one of the first things we want to do is look for red flags that might indicate something went awry in the lab or downstream. An example is chromosome bias - are we seeing more reads aligned to certain chromosomes than would be expected on size alone? A sticky question, since any experiment will introduce confounds based on the inherent uneven distribution of interesting genomic features, not to mention mapability. And yet I think this is still a worthwhile exercise and should be part of any ngs sequencing pipeline.

What we don't want to do is ignore that 7.6% of the GRCh37 freeze is sequence that looks like "NNNNNNN" - gaps representing unsequencable regions such as centromeres, scaffold gap delinations, and the like. We especially don't want to ignore gaps because they are not evenly distributed across the chromosomes (chrY is 56% gaps).

Raw chromosome length can be obtained from the BAM file header, but for this chromosome bias analysis I need the "non-gappy" length, the portion eligible for alignment. This is one of the "masks" turned on by default for BSgenomes in order to allow various functions to work properly (see MaskCollection in the IRanges package for more information).


> masks(Hsapiens)
Error in function (classes, fdef, mtable)  : 
  unable to find an inherited method for function masks, for signature "BSgenome"
#oops I see masks are a member of MaskedDNAString objects (i.e. chromosomes) not BSgenome objects
> masks(Hsapiens$chrY)
MaskCollection of length 4 and width 59373566
masks:
  maskedwidth maskedratio active names                               desc
1    33720000 0.567929506   TRUE AGAPS                      assembly gaps
2           0 0.000000000   TRUE   AMB   intra-contig ambiguities (empty)
3    16024357 0.269890426  FALSE    RM                       RepeatMasker
4      587815 0.009900281  FALSE   TRF Tandem Repeats Finder [period<=12]
all masks together:
  maskedwidth maskedratio
     49783032   0.8384713
all active masks together:
  maskedwidth maskedratio
     33720000   0.5679295
#I think the maskedwidth should reveal sum of actively masked nucleotides
> maskedwidth(Hsapiens$chrY)
[1] 33720000
#can we mess with the masks?
> active(masks(Hsapiens$chrY))["RM"]<-TRUE
Error in `$<-`(`*tmp*`, "chrY", value = < S4 object of class "MaskedDNAString">) : 
  no method for assigning subsets of this S4 class
#oops I can't manipulate a BSgenome this way - it is behaving like a class instead of an instance of a class
> chrY<-Hsapiens$chrY
> active(masks(chrY))["RM"]<-TRUE
> maskedwidth(chrY)
[1] 49744357
# ok maskedwidth is working as I figured, but i need unmasked width
> unmaskedWidth<-function(chr){length(chr)-maskedwidth(chr)}
> unmaskedWidth(Hsapiens$chrY)
[1] 25653566
#how can I iterate over something with a $ operator? let's try [[]]
> unmaskedWidth(Hsapiens[["chrY"]])
[1] 25653566
Now I want to create a data frame of with sequence names and unmaskedWidths to go with some read counts from a BAM file. Whenever I want to go from a list, through a function, to a data frame I think plyr, specifically ldply (list to data frame).
# let's take chr 1-22,X,Y, skipping the unscaffolded sequences and mitochondrial chr
> maskedSizes<-ldply(.data=seqnames(Hsapiens)[1:24],
  .fun=function(x){
    data.frame(chr=x,seqlength=length(Hsapiens[[x]]),
    unmaskedWidth=unmaskedWidth(Hsapiens[[x]]))},
  .progress="text",
  .parallel=TRUE)
> maskedSizes
                     chr seqlength unmaskedWidth
1                   chr1 249250621     225280621
2                   chr2 243199373     238204518
3                   chr3 198022430     194797135
4                   chr4 191154276     187661676
5                   chr5 180915260     177695260
6                   chr6 171115067     167395066
7                   chr7 159138663     155353663
8                   chr8 146364022     142888922
9                   chr9 141213431     120143431
10                 chr10 135534747     131314738
11                 chr11 135006516     131129516
12                 chr12 133851895     130481393
13                 chr13 115169878      95589878
14                 chr14 107349540      88289540
15                 chr15 102531392      81694766
16                 chr16  90354753      78884753
17                 chr17  81195210      77795210
18                 chr18  78077248      74657229
19                 chr19  59128983      55808983
20                 chr20  63025520      59505520
21                 chr21  48129895      35106642
22                 chr22  51304566      34894545
23                  chrX 155270560     151100560
24                  chrY  59373566      25653566

Load the BAM file and get read counts in a data frame.
#other methods include scanBam and readAligned
bamFile<-readBamGappedAlignments("myIndexedSortedBamFile.bam")
> levels(rname(bamFile))
 [1] "1"          "2"          "3"          "4"          "5"         
 [6] "6"          "7"          "8"          "9"          "10"        
[11] "11"         "12"         "13"         "14"         "15"        
[16] "16"         "17"         "18"         "19"         "20"        
[21] "21"         "22"         "X"          "Y"          "MT"        
[26] "GL000207.1" "GL000226.1" "GL000229.1" "GL000231.1" "GL000210.1"
[31] "GL000239.1" "GL000235.1" "GL000201.1" "GL000247.1" "GL000245.1"
[36] "GL000197.1" "GL000203.1" "GL000246.1" "GL000249.1" "GL000196.1"
[41] "GL000248.1" "GL000244.1" "GL000238.1" "GL000202.1" "GL000234.1"
[46] "GL000232.1" "GL000206.1" "GL000240.1" "GL000236.1" "GL000241.1"
[51] "GL000243.1" "GL000242.1" "GL000230.1" "GL000237.1" "GL000233.1"
[56] "GL000204.1" "GL000198.1" "GL000208.1" "GL000191.1" "GL000227.1"
[61] "GL000228.1" "GL000214.1" "GL000221.1" "GL000209.1" "GL000218.1"
[66] "GL000220.1" "GL000213.1" "GL000211.1" "GL000199.1" "GL000217.1"
[71] "GL000216.1" "GL000215.1" "GL000205.1" "GL000219.1" "GL000224.1"
[76] "GL000223.1" "GL000195.1" "GL000212.1" "GL000222.1" "GL000200.1"
[81] "GL000193.1" "GL000194.1" "GL000225.1" "GL000192.1"
#the deflines in my reference do not match the BSgenome names, must fix at least the chromosomes of interest
levels(rname(bamFile))[1:25]<-paste('chr',c(1:22,'X','Y','M'),sep='')

#run length encoded read counts per chromosome
readRle<-rname(bamFile)

#get a data frame with chromosome and read counts
allReadsDf<-ldply(runValue(readRle),function(x){data.frame(chr=levels(runValue(readRle))[x],reads=runLength(readRle)[x])})
> head(allReadsDf)
   chr   reads
1 chr1 3616909
2 chr2 3642052
3 chr3 2843019
4 chr4 2636141
5 chr5 2590352
6 chr6 2497123

Merge the read counts with unmasked chromosome lengths and plot their relationship.
chrSizesReads<-merge(maskedSizes,readCounts,sort=FALSE)
library(ggplot2)
p<-ggplot(data=chrSizesReads, aes(x=unmaskedWidth, y=reads, label=chr)) + 
  geom_point() +
  geom_text(vjust=2,size=3) +
  stat_smooth(method="lm", se=TRUE,level=0.95) +
  ylab("Reads aligned") +
  xlab("Unmasked chromosome size") +
  opts(title = "Reads vs Chromosome Size")
print(p)
There should be a strong linear relationship between read count and chromosome size. We can test this using a linear regression model, the null hypothesis being the number of reads aligned to a chromosome is independent of its size.
> mylm<-lm(reads~unmaskedWidth,data=chrSizesReads)
> mysummary<-summary(mylm)
> mysummary

Call:
lm(formula = reads ~ unmaskedWidth, data = chrSizesReads)

Residuals:
    Min      1Q  Median      3Q     Max 
-271816 -108122  -43984   42826  676284 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.774e+05  9.505e+04   1.866   0.0754 .  
unmaskedWidth 1.455e-02  7.145e-04  20.365 9.12e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 206600 on 22 degrees of freedom
Multiple R-squared: 0.9496, Adjusted R-squared: 0.9473 
F-statistic: 414.8 on 1 and 22 DF,  p-value: 9.123e-16 
The low p-value (that chr size has no influence) and R-squared (predictive value of the linear model) suggest this model is sound.

The following plot is obtained from the standardized residuals (the standardized difference between data observed and values expected) of the linear model described earlier.

Chromosome bias refers to uneven read alignment distribution across various chromosomes. We can expect some chromosome bias in treatment sets because of the inherient nature any experimental conditions - recovered fragments will not be evenly distributed among chromosomes because regions of affect are not evenly distributed. Other possible factors of chromosome bias include heterochromatin, uneven repeat content, and the potential for aligning the against an incorrect set of sex chromsomes. Aligners will typically randomly, evenly, assign discrete positions to reads which map ambiguously to multiple locations.
> p<-qplot(chrSizesReads$chr,rstandard(mylm))+
   aes(label=chrSizesReads$chr)+
   geom_text(vjust=2,size=3)+
   xlab("Chromosome")+
   ylab("Std Residual from lm (reads)")+
   geom_abline(slope=0,intercept=0)+
   opts(axis.text.x = theme_text(angle=45,hjust=1))+
   opts(title = "Linear Regression Residuals")
> print(p)

Fortunately, there is no clear pattern to these residual values, which would indicate some model problems, but with a Z-score of 3.36, chrX appears to be an outlier. With 46M total alignments this is certainly not due to sampling error, but we can still test our observation with a Lund statistic.
#http://stackoverflow.com/questions/1444306/how-to-use-outlier-tests-in-r-code
>lundcrit<-function(a, n, q) {
 F<-qf(c(1-(a/n)),df1=1,df2=n-q-1,lower.tail=TRUE)
 crit<-((n-q)*F/(n-q-1+F))^0.5
 crit
}
> n<-nrow(chrSizesReads)
> q<-length(mylm$coefficients)
> crit<-lundcrit(0.05,n,q)
> chrSizesReads[which(rstandard(mylm)>crit),"chr"]
[1] chrX

Happy holidays!

Thursday, December 9, 2010

Directory-based bash histories

Using a directory-based bash history allows for a record of shell actions on a directory basis, so a group of developers have some record of what was done while in a directory, when, and by whom. This can be helpful when trying to reconstruct history with limited documentation.

I know this setup will be of some benefit to my successor at my previous job because he has access to everything I ever did in any project directory.

Place this code in your ~/.bash_profile or ~/.bashrc

(type source ~/.bash_profile (or .bashrc) to load this for your current session)

Monday, August 30, 2010

NGS viewers reviewed

I gathered up some of the recent free next generation sequence viewers that were capabale of viewing BAM files - and put each through the motions with a few BAM files and reference sequences of various sizes. While there are some great ideas and several choices to be found along the feature spectrum, I think we are still in the dark ages with this stuff. No viewer has really been able to entirely combine usability with performance and analysis capabilities, let alone extensibility and web connectivity.

tview
My take: tview is the barebones, text-rendered viewer that is included with Samtools. People who favor this as their BAM viewer probably think Vim is too polished. Even the very limited navigation is remarkably unituitive (goto a coordinate requires chr:position even if you have just one chromosome, no errors are displayed if you forget chr).
Most resembles which video game: Oregon Trail
Good standout feature: command line access
Bad feature: text display

BamView
My take: Bamview is a wicked fast simple BAM file viewer. It doesn't have much in the way of features, but for cursory examination of BAM files it is more palatable than tview.
Most resembles which video game: Burgertime
Good standout feature: strand split screen
Bad feature: drag selecting a region turns it red, and umm... that's all it does

GenoViewer
My take: The only NGS viewer endorsed by Speak the Hungarian rapper, unfortunately this recent entrant leaves a lot to be desired in terms of performance with large files. GenoViewer is very hard on the eyes - the indiscriminate use of primary colors looks like a kid somehow vomited up the ball pit at Chuck E. Cheese.
Most resembles which video game: Centipede
Good standout feature: promises that "You will not get lost in the details, and can easily figure out the true meaning behind the data. Guaranteed."
Bad feature: graphics

MagicViewer
My take: MagicViewer has come a long way since its initial release last December. With its interesting pie-chart icon renderings of SNP purity, decent treatment of annotation tracks, and improved performance, MagicViewer might soon be a contender to Tablet in the midweight category. The navigation is workable but takes some getting used to - it's unclear when to use scroll bars vs. arrows
Most resembles which video game: Moon Patrol
Good standout feature: primer design tool
Bad feature: some regions are simply not visible - no reference, no reads

Tablet
My take: Hands down the most attractive of the viewers, Tablet is aimed at fostering a delicate balance between performance, features, and aesthetics. Tablet comes with a suite of read views - Packed, Stacked, and Classic - to suit both young children and elderly scientists alike. GFF feature files can be loaded but they appear to merely serve as position indices.
Most resembles which video game: SimCity
Good standout feature: interface
Bad feature: read insertions not displayed correctly



IGV
My take: The Integrative Genomics Viewer is a serious tool for exploring and analyzing large datasets. In addition to viewing, IGV is designed to allow users to extract the kind of hard publishable data that has typically been the domain of Bio* scripts. Like the true product of an Ivy League education, IGV can appear aloof and arrogant to newcomers. The viewer will let you load BAM files and other annotation tracks that have nothing to do with the reference without comment or guidance, then require an extra unitutive click to actually generate a view. While not the easiest or even the best in performance, in terms of generating real queries on your data from the viewer there is nothing comparable.
Most resembles which video game: Dig Dug
Good standout feature: analysis tools
Bad feature: throws a hissy fit if it cannot connect to home server


gbrowse2
My take: gbrowse2 is the AJAXified protege to the venerable generic genome browser. Samtools integration is a recent addition to this highly extensible platform, which has been used for years to display everything from large genomes to small sequencing projects. Of all the viewers here, GB2 provides the best set of visualization tools, such that virtually any biological information that can be rendered linearly has been done so as gbrowse tracks. The lone true web application, gbrowse genomic positions can be hyperlinked or even snapshot-embedded on the web. The web provides the best platform to share visual genomic data among several users. However, a gbrowse2 installation with BAM tracks can be a massive pain to install, configure, and debug ("landmark chrI not found" is the most popular google search in all of bioinformatics). Novices can expect a minefield of historical gotchas and arcane conventions ("Name=" not "name=" field in gff3, bp_seqfeature_load instead of bp_load_gff for gff3, no validator for conf files), and even experienced users are often baffled by cryptic errors that pop up in server logs.
Most resembles which video game: Sissyfight 2000
Good standout feature: hyperlinks
Bad feature: setup

Wednesday, August 18, 2010

My thoughts on the acquisition of Ion Torrent by Life Technologies

Yesterday it was announced that Ion Torrent, makers of the Ion Personal Genome Machine (PGM) sequencer, would be acquired by Invitrogen ABI Life Technologies for $375M in cash and stock, with the possibility of another $350M if various milestones are met.

Gregory Lucier, chairman and CEO of Life Technologies, said, "We believe Ion Torrent's technology will represent a profound change for the life sciences industry, as fundamental as the one we saw with the introduction of qPCR." That analogy might not sound earth-shattering, but I suspect he is using qPCR as an example because it is one of the few molecular biology (as opposed to biochemical) techniques regularly used in clinical settings.

The PGM is a unique machine because it is the first second generation sequencer (tentative specs: ~3 million 200bp reads at $500 1hr run, $50k machine) that could be conceivably leased by a small clinical testing facility, like the ones fed by those LabCorp boxes you see scattered all over strip malls. Bear in mind even a capillary-based sanger sequencer like the ABI3730xl costs a whopping $375,000, which comes as a shock to those of us who mostly work with next-generation sequencing data. You can see not only why the PGM is really a game changer, but how it might fit into Life Technologies offerings.

Exactly which clinical diagnostics are, or will be, suited for this machine are unclear. With the right foolproof software this could replace a lot of PCR and microarray-based tests. Read lengths and throughput are bound to increase just as they did with Solexa. Future applications like tumor sequencing and 16S rRNA microbiome sequencing have not even entered into medical practice yet.

The key to Life Technology succeeding in these areas will likely come down to non-technical challenges:
  • Getting FDA approval for the PGM as a medical device and for various protocols based on the PGM as approved clinical tests. Life Technology has experience with this process. Ion Torrent clearly does not.
  • Convincing insurance companies and HMOs that these tests are cost-effective diagnostics. When you consider the exorbitant cost of other tests - e.g. several MRIs over a course of chemotherapy - this might not be such hard sell.
  • Developing a business model that will allow small clinics to lease the machine for little or even no cost provided they agree to purchase a minimum amount of consumables. Life Technologies will likely market the PGM to smaller labs who themselves will inevitably face competition from larger dedicated sequencing centers trying to centralize this type of work using bigger machines from Illumina or PacBio (or even LT's SOLiD).

Although these obstacles seem daunting, I have sat through academic departmental meetings where very knowledgeable sequencer salespeople were invited back multiple times only to be jerked around as the faculty hemmed and hawed over whether this was the right $400k machine to buy at the time. That was undoubtedly an expensive and frustrating sales process for Solexa and 454. With the PGM, Life Technologies can pitch this much cheaper machine to an albeit different group of skeptics, clinical lab managers, but one with a clear bottom line in mind.

Wednesday, March 10, 2010

Use local instead of my in Perl when using $$ (double dollar sign) inside an enclosed block

I thought I would be all clever and initialize several hash indices at once using the $$ notation in Perl - evaluating strings as variables. Unfortunately using "my" screws this up bigtime.

This must be another circumstance spelled out in the incomprehensible "my" treatise:
http://perldoc.perl.org/perlsub.html#Private-Variables-via-my()


my %foo;
$foo{'bar'}=1;
foreach $fooDex(qw(foo)){
$$fooDex{'bar'}=2;
}
print $foo{'bar'}."\n"; #prints 1




local %foo;
$foo{'bar'}=1;
foreach $fooDex(qw(foo)){
$$fooDex{'bar'}=2;
}
print $foo{'bar'}."\n"; #prints 2


This is one of those solutions that gives you a feeling of dread instead of accomplishment.

Tuesday, March 9, 2010

Getting the basics from readAligned

The UCR guide is a little sparse with regard to getting basic information from readAligned.

I'd like to add to the general cookbook. If some bioc people out there can contribute some alignment recipes can fill me in on some more basics please comment:


alignedReads <- readAligned("./", pattern="output.bowtie", type="Bowtie")

#how many reads did I attempt to align
#i don't think you can't get this from alignedReads

#how many reads aligned (one or more times)
length(unique(id(alignedReads)))

#how many hits were there?
length(alignedReads)

#how many reads produced multiple hits
length(unique(id(alignedReads[srduplicated(id(alignedReads))])))

#how many reads produced multiple hits at the best strata?
#please fill me in on this one

#how many reads aligned uniquely (with exactly one hit)
length(unique(id(alignedReads)))-length(unique(id(alignedReads[srduplicated(id(alignedReads))])))

#how many reads aligned uniquely at the best strata (the other hits were not as good)
#please fill me in on this one

#how many unique positions were hit? what if I ignore strand?
#please fill me in on this one

#how many converging hits were there (two query sequences aligned to the same genomic position)
#please fill me in on this one

Wednesday, March 3, 2010

Quality trimming in R using ShortRead and Biostrings

I wrote an R function to do soft-trimming, right clipping FastQ reads based on quality.

This function has the option of leaving out sequences trimmed to extinction and will do left-side fixed trimming as well.
#softTrim
#trim first position lower than minQuality and all subsequent positions
#omit sequences that after trimming are shorter than minLength
#left trim to firstBase, (1 implies no left trim)
#input: ShortReadQ reads
#       integer minQuality
#       integer firstBase
#       integer minLength
#output: ShortReadQ trimmed reads
library("ShortRead")
softTrim<-function(reads,minQuality,firstBase=1,minLength=5){
qualMat<-as(FastqQuality(quality(quality(reads))),'matrix')
qualList<-split(qualMat,row(qualMat))
ends<-as.integer(lapply(qualList,
function(x){which(x < minQuality)[1]-1}))
#length=end-start+1, so set start to no more than length+1 to avoid negative-length
starts<-as.integer(lapply(ends,function(x){min(x+1,firstBase)}))
#use whatever QualityScore subclass is sent
newQ<-ShortReadQ(sread=subseq(sread(reads),start=starts,end=ends),
quality=new(Class=class(quality(reads)),
quality=subseq(quality(quality(reads)),
start=starts,end=ends)),
id=id(reads))

#apply minLength using srFilter
lengthCutoff <- srFilter(function(x) {
width(x)>=minLength
},name="length cutoff")
newQ[lengthCutoff(newQ)]
}  


To use:
library("ShortRead")
source("softTrimFunction.R") #or whatever you want to name this
reads<-readFastq("myreads.fq") trimmedReads<-softTrim(reads=reads,minQuality=5,firstBase=4,minLength=3) writeFastq(trimmedReads,file="trimmed.fq") 
I strongly recommend reading the excellent UC Riverside HT-Sequencing Wiki cookbook and tutorial if you wish to venture into using R for NGS handling. Among other things, it will explain how to perform casting if you have Solexa scaled (base 64) fastq files. The function should respect that. http://manuals.bioinformatics.ucr.edu/home/ht-seq