Thursday, December 23, 2010

Chromosome bias in R, my notebook

My goal is to develop a means of detecting chromosome bias from a human BAM file.

Because I've been working with proprietary and novel plant genomes for the last three years, I haven't had the chance to use any of the awesome UCSC-based annotational features that have been introduced and refined in Bioconductor until now. I've returned to biomedical research and I have some catching up to do.

BSgenome might sound like horsecrap, but each Biostrings-based genome data package is actually a huge digested version of a UCSC/NCBI genome freeze and basic sequence annotation compiled into R objects.

BSgenome at Bioconductor
Be careful with googling bioconductor help - often the results point to older versions. Make sure your link has "release" in the url.

Here are the BSgenomes available today:
> available.genomes(type=getOption("pkgType"))
BioC_mirror =
Change using chooseBioCmirror().
 [1] "BSgenome.Amellifera.BeeBase.assembly4" "BSgenome.Amellifera.UCSC.apiMel2"     
 [3] "BSgenome.Athaliana.TAIR.01222004"      "BSgenome.Athaliana.TAIR.04232008"     
 [5] "BSgenome.Btaurus.UCSC.bosTau3"         "BSgenome.Btaurus.UCSC.bosTau4"        
 [7] "BSgenome.Celegans.UCSC.ce2"            "BSgenome.Celegans.UCSC.ce6"           
 [9] "BSgenome.Cfamiliaris.UCSC.canFam2"     "BSgenome.Dmelanogaster.UCSC.dm2"      
[11] "BSgenome.Dmelanogaster.UCSC.dm3"       "BSgenome.Drerio.UCSC.danRer5"         
[13] "BSgenome.Drerio.UCSC.danRer6"          "BSgenome.Ecoli.NCBI.20080805"         
[15] "BSgenome.Ggallus.UCSC.galGal3"         "BSgenome.Hsapiens.UCSC.hg17"          
[17] "BSgenome.Hsapiens.UCSC.hg18"           "BSgenome.Hsapiens.UCSC.hg19"          
[19] "BSgenome.Mmusculus.UCSC.mm8"           "BSgenome.Mmusculus.UCSC.mm9"          
[21] "BSgenome.Ptroglodytes.UCSC.panTro2"    "BSgenome.Rnorvegicus.UCSC.rn4"        
[23] "BSgenome.Scerevisiae.UCSC.sacCer1"     "BSgenome.Scerevisiae.UCSC.sacCer2"    
Select and load hg19

When we get an alignment file one of the first things we want to do is look for red flags that might indicate something went awry in the lab or downstream. An example is chromosome bias - are we seeing more reads aligned to certain chromosomes than would be expected on size alone? A sticky question, since any experiment will introduce confounds based on the inherent uneven distribution of interesting genomic features, not to mention mapability. And yet I think this is still a worthwhile exercise and should be part of any ngs sequencing pipeline.

What we don't want to do is ignore that 7.6% of the GRCh37 freeze is sequence that looks like "NNNNNNN" - gaps representing unsequencable regions such as centromeres, scaffold gap delinations, and the like. We especially don't want to ignore gaps because they are not evenly distributed across the chromosomes (chrY is 56% gaps).

Raw chromosome length can be obtained from the BAM file header, but for this chromosome bias analysis I need the "non-gappy" length, the portion eligible for alignment. This is one of the "masks" turned on by default for BSgenomes in order to allow various functions to work properly (see MaskCollection in the IRanges package for more information).

> masks(Hsapiens)
Error in function (classes, fdef, mtable)  : 
  unable to find an inherited method for function masks, for signature "BSgenome"
#oops I see masks are a member of MaskedDNAString objects (i.e. chromosomes) not BSgenome objects
> masks(Hsapiens$chrY)
MaskCollection of length 4 and width 59373566
  maskedwidth maskedratio active names                               desc
1    33720000 0.567929506   TRUE AGAPS                      assembly gaps
2           0 0.000000000   TRUE   AMB   intra-contig ambiguities (empty)
3    16024357 0.269890426  FALSE    RM                       RepeatMasker
4      587815 0.009900281  FALSE   TRF Tandem Repeats Finder [period<=12]
all masks together:
  maskedwidth maskedratio
     49783032   0.8384713
all active masks together:
  maskedwidth maskedratio
     33720000   0.5679295
#I think the maskedwidth should reveal sum of actively masked nucleotides
> maskedwidth(Hsapiens$chrY)
[1] 33720000
#can we mess with the masks?
> active(masks(Hsapiens$chrY))["RM"]<-TRUE
Error in `$<-`(`*tmp*`, "chrY", value = < S4 object of class "MaskedDNAString">) : 
  no method for assigning subsets of this S4 class
#oops I can't manipulate a BSgenome this way - it is behaving like a class instead of an instance of a class
> chrY<-Hsapiens$chrY
> active(masks(chrY))["RM"]<-TRUE
> maskedwidth(chrY)
[1] 49744357
# ok maskedwidth is working as I figured, but i need unmasked width
> unmaskedWidth<-function(chr){length(chr)-maskedwidth(chr)}
> unmaskedWidth(Hsapiens$chrY)
[1] 25653566
#how can I iterate over something with a $ operator? let's try [[]]
> unmaskedWidth(Hsapiens[["chrY"]])
[1] 25653566
Now I want to create a data frame of with sequence names and unmaskedWidths to go with some read counts from a BAM file. Whenever I want to go from a list, through a function, to a data frame I think plyr, specifically ldply (list to data frame).
# let's take chr 1-22,X,Y, skipping the unscaffolded sequences and mitochondrial chr
> maskedSizes<-ldply(.data=seqnames(Hsapiens)[1:24],
> maskedSizes
                     chr seqlength unmaskedWidth
1                   chr1 249250621     225280621
2                   chr2 243199373     238204518
3                   chr3 198022430     194797135
4                   chr4 191154276     187661676
5                   chr5 180915260     177695260
6                   chr6 171115067     167395066
7                   chr7 159138663     155353663
8                   chr8 146364022     142888922
9                   chr9 141213431     120143431
10                 chr10 135534747     131314738
11                 chr11 135006516     131129516
12                 chr12 133851895     130481393
13                 chr13 115169878      95589878
14                 chr14 107349540      88289540
15                 chr15 102531392      81694766
16                 chr16  90354753      78884753
17                 chr17  81195210      77795210
18                 chr18  78077248      74657229
19                 chr19  59128983      55808983
20                 chr20  63025520      59505520
21                 chr21  48129895      35106642
22                 chr22  51304566      34894545
23                  chrX 155270560     151100560
24                  chrY  59373566      25653566

Load the BAM file and get read counts in a data frame.
#other methods include scanBam and readAligned
> levels(rname(bamFile))
 [1] "1"          "2"          "3"          "4"          "5"         
 [6] "6"          "7"          "8"          "9"          "10"        
[11] "11"         "12"         "13"         "14"         "15"        
[16] "16"         "17"         "18"         "19"         "20"        
[21] "21"         "22"         "X"          "Y"          "MT"        
[26] "GL000207.1" "GL000226.1" "GL000229.1" "GL000231.1" "GL000210.1"
[31] "GL000239.1" "GL000235.1" "GL000201.1" "GL000247.1" "GL000245.1"
[36] "GL000197.1" "GL000203.1" "GL000246.1" "GL000249.1" "GL000196.1"
[41] "GL000248.1" "GL000244.1" "GL000238.1" "GL000202.1" "GL000234.1"
[46] "GL000232.1" "GL000206.1" "GL000240.1" "GL000236.1" "GL000241.1"
[51] "GL000243.1" "GL000242.1" "GL000230.1" "GL000237.1" "GL000233.1"
[56] "GL000204.1" "GL000198.1" "GL000208.1" "GL000191.1" "GL000227.1"
[61] "GL000228.1" "GL000214.1" "GL000221.1" "GL000209.1" "GL000218.1"
[66] "GL000220.1" "GL000213.1" "GL000211.1" "GL000199.1" "GL000217.1"
[71] "GL000216.1" "GL000215.1" "GL000205.1" "GL000219.1" "GL000224.1"
[76] "GL000223.1" "GL000195.1" "GL000212.1" "GL000222.1" "GL000200.1"
[81] "GL000193.1" "GL000194.1" "GL000225.1" "GL000192.1"
#the deflines in my reference do not match the BSgenome names, must fix at least the chromosomes of interest

#run length encoded read counts per chromosome

#get a data frame with chromosome and read counts
> head(allReadsDf)
   chr   reads
1 chr1 3616909
2 chr2 3642052
3 chr3 2843019
4 chr4 2636141
5 chr5 2590352
6 chr6 2497123

Merge the read counts with unmasked chromosome lengths and plot their relationship.
p<-ggplot(data=chrSizesReads, aes(x=unmaskedWidth, y=reads, label=chr)) + 
  geom_point() +
  geom_text(vjust=2,size=3) +
  stat_smooth(method="lm", se=TRUE,level=0.95) +
  ylab("Reads aligned") +
  xlab("Unmasked chromosome size") +
  opts(title = "Reads vs Chromosome Size")
There should be a strong linear relationship between read count and chromosome size. We can test this using a linear regression model, the null hypothesis being the number of reads aligned to a chromosome is independent of its size.
> mylm<-lm(reads~unmaskedWidth,data=chrSizesReads)
> mysummary<-summary(mylm)
> mysummary

lm(formula = reads ~ unmaskedWidth, data = chrSizesReads)

    Min      1Q  Median      3Q     Max 
-271816 -108122  -43984   42826  676284 

               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.774e+05  9.505e+04   1.866   0.0754 .  
unmaskedWidth 1.455e-02  7.145e-04  20.365 9.12e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 206600 on 22 degrees of freedom
Multiple R-squared: 0.9496, Adjusted R-squared: 0.9473 
F-statistic: 414.8 on 1 and 22 DF,  p-value: 9.123e-16 
The low p-value (that chr size has no influence) and R-squared (predictive value of the linear model) suggest this model is sound.

The following plot is obtained from the standardized residuals (the standardized difference between data observed and values expected) of the linear model described earlier.

Chromosome bias refers to uneven read alignment distribution across various chromosomes. We can expect some chromosome bias in treatment sets because of the inherient nature any experimental conditions - recovered fragments will not be evenly distributed among chromosomes because regions of affect are not evenly distributed. Other possible factors of chromosome bias include heterochromatin, uneven repeat content, and the potential for aligning the against an incorrect set of sex chromsomes. Aligners will typically randomly, evenly, assign discrete positions to reads which map ambiguously to multiple locations.
> p<-qplot(chrSizesReads$chr,rstandard(mylm))+
   ylab("Std Residual from lm (reads)")+
   opts(axis.text.x = theme_text(angle=45,hjust=1))+
   opts(title = "Linear Regression Residuals")
> print(p)

Fortunately, there is no clear pattern to these residual values, which would indicate some model problems, but with a Z-score of 3.36, chrX appears to be an outlier. With 46M total alignments this is certainly not due to sampling error, but we can still test our observation with a Lund statistic.
>lundcrit<-function(a, n, q) {
> n<-nrow(chrSizesReads)
> q<-length(mylm$coefficients)
> crit<-lundcrit(0.05,n,q)
> chrSizesReads[which(rstandard(mylm)>crit),"chr"]
[1] chrX

Happy holidays!


  1. Nice description! Counts within regions can be retrieved efficiently with Rsamtools::countBam using as the 'param' argument a ScanBamParam with appropriate 'which'. I did

    seqlengths2gr <- function(x, strand="*")
    ## convert 'seqlengths' of BSgenome to sGRanges
    GRanges(names(x), IRanges(1, x), strand=strand)
    gr <- seqlengths2gr(seqlengths(Scerevisiae))

    and then

    cnt <- countBam(fl, param=ScanBamParam(which=gr))

    'cnt' is then a data frame with counts of reads aligning to each chromosome,

    > head(cnt)
    space start end width file records nucleotides
    1 2micron 1 6318 6318 SRR002051 2068 68244
    2 chrI 1 230208 230208 SRR002051 34984 1154472
    3 chrII 1 813178 813178 SRR002051 104042 3433386
    4 chrIII 1 316617 316617 SRR002051 31024 1023792
    5 chrIV 1 1531919 1531919 SRR002051 214682 7084506
    6 chrIX 1 439885 439885 SRR002051 35439 1169487

  2. Jeremy;
    Thanks much for this post. The residual plot approach is a really nice way to display the chromsome variability. I needed exactly this for a recent project, and incorporated your code and Martin's Rsamtools suggestion into an Rscript:

  3. wow it's great to see someone use this approach