Chromosome bias in R, my notebook
My goal is to develop a means of detecting chromosome bias from a human BAM file.
Because I've been working with proprietary and novel plant genomes for the last three years, I haven't had the chance to use any of the awesome UCSC-based annotational features that have been introduced and refined in Bioconductor until now. I've returned to biomedical research and I have some catching up to do.
BSgenome might sound like horsecrap, but each Biostrings-based genome data package is actually a huge digested version of a UCSC/NCBI genome freeze and basic sequence annotation compiled into R objects.
BSgenome at Bioconductor
Here are the BSgenomes available today:
When we get an alignment file one of the first things we want to do is look for red flags that might indicate something went awry in the lab or downstream. An example is chromosome bias - are we seeing more reads aligned to certain chromosomes than would be expected on size alone? A sticky question, since any experiment will introduce confounds based on the inherent uneven distribution of interesting genomic features, not to mention mapability. And yet I think this is still a worthwhile exercise and should be part of any ngs sequencing pipeline.
What we don't want to do is ignore that 7.6% of the GRCh37 freeze is sequence that looks like "NNNNNNN" - gaps representing unsequencable regions such as centromeres, scaffold gap delinations, and the like. We especially don't want to ignore gaps because they are not evenly distributed across the chromosomes (chrY is 56% gaps).
Raw chromosome length can be obtained from the BAM file header, but for this chromosome bias analysis I need the "non-gappy" length, the portion eligible for alignment. This is one of the "masks" turned on by default for BSgenomes in order to allow various functions to work properly (see MaskCollection in the IRanges package for more information).
Load the BAM file and get read counts in a data frame.
Merge the read counts with unmasked chromosome lengths and plot their relationship.
The following plot is obtained from the standardized residuals (the standardized difference between data observed and values expected) of the linear model described earlier.
Chromosome bias refers to uneven read alignment distribution across various chromosomes. We can expect some chromosome bias in treatment sets because of the inherient nature any experimental conditions - recovered fragments will not be evenly distributed among chromosomes because regions of affect are not evenly distributed. Other possible factors of chromosome bias include heterochromatin, uneven repeat content, and the potential for aligning the against an incorrect set of sex chromsomes. Aligners will typically randomly, evenly, assign discrete positions to reads which map ambiguously to multiple locations.
Fortunately, there is no clear pattern to these residual values, which would indicate some model problems, but with a Z-score of 3.36, chrX appears to be an outlier. With 46M total alignments this is certainly not due to sampling error, but we can still test our observation with a Lund statistic.
Happy holidays!
Because I've been working with proprietary and novel plant genomes for the last three years, I haven't had the chance to use any of the awesome UCSC-based annotational features that have been introduced and refined in Bioconductor until now. I've returned to biomedical research and I have some catching up to do.
BSgenome might sound like horsecrap, but each Biostrings-based genome data package is actually a huge digested version of a UCSC/NCBI genome freeze and basic sequence annotation compiled into R objects.
BSgenome at Bioconductor
Be careful with googling bioconductor help - often the results point to older versions. Make sure your link has "release" in the url.
Here are the BSgenomes available today:
> available.genomes(type=getOption("pkgType")) BioC_mirror = http://www.bioconductor.org Change using chooseBioCmirror(). [1] "BSgenome.Amellifera.BeeBase.assembly4" "BSgenome.Amellifera.UCSC.apiMel2" [3] "BSgenome.Athaliana.TAIR.01222004" "BSgenome.Athaliana.TAIR.04232008" [5] "BSgenome.Btaurus.UCSC.bosTau3" "BSgenome.Btaurus.UCSC.bosTau4" [7] "BSgenome.Celegans.UCSC.ce2" "BSgenome.Celegans.UCSC.ce6" [9] "BSgenome.Cfamiliaris.UCSC.canFam2" "BSgenome.Dmelanogaster.UCSC.dm2" [11] "BSgenome.Dmelanogaster.UCSC.dm3" "BSgenome.Drerio.UCSC.danRer5" [13] "BSgenome.Drerio.UCSC.danRer6" "BSgenome.Ecoli.NCBI.20080805" [15] "BSgenome.Ggallus.UCSC.galGal3" "BSgenome.Hsapiens.UCSC.hg17" [17] "BSgenome.Hsapiens.UCSC.hg18" "BSgenome.Hsapiens.UCSC.hg19" [19] "BSgenome.Mmusculus.UCSC.mm8" "BSgenome.Mmusculus.UCSC.mm9" [21] "BSgenome.Ptroglodytes.UCSC.panTro2" "BSgenome.Rnorvegicus.UCSC.rn4" [23] "BSgenome.Scerevisiae.UCSC.sacCer1" "BSgenome.Scerevisiae.UCSC.sacCer2"Select and load hg19
biocLite("BSgenome.Hsapiens.UCSC.hg19") library("BSgenome.Hsapiens.UCSC.hg19")
When we get an alignment file one of the first things we want to do is look for red flags that might indicate something went awry in the lab or downstream. An example is chromosome bias - are we seeing more reads aligned to certain chromosomes than would be expected on size alone? A sticky question, since any experiment will introduce confounds based on the inherent uneven distribution of interesting genomic features, not to mention mapability. And yet I think this is still a worthwhile exercise and should be part of any ngs sequencing pipeline.
What we don't want to do is ignore that 7.6% of the GRCh37 freeze is sequence that looks like "NNNNNNN" - gaps representing unsequencable regions such as centromeres, scaffold gap delinations, and the like. We especially don't want to ignore gaps because they are not evenly distributed across the chromosomes (chrY is 56% gaps).
Raw chromosome length can be obtained from the BAM file header, but for this chromosome bias analysis I need the "non-gappy" length, the portion eligible for alignment. This is one of the "masks" turned on by default for BSgenomes in order to allow various functions to work properly (see MaskCollection in the IRanges package for more information).
> masks(Hsapiens) Error in function (classes, fdef, mtable) : unable to find an inherited method for function masks, for signature "BSgenome" #oops I see masks are a member of MaskedDNAString objects (i.e. chromosomes) not BSgenome objects > masks(Hsapiens$chrY) MaskCollection of length 4 and width 59373566 masks: maskedwidth maskedratio active names desc 1 33720000 0.567929506 TRUE AGAPS assembly gaps 2 0 0.000000000 TRUE AMB intra-contig ambiguities (empty) 3 16024357 0.269890426 FALSE RM RepeatMasker 4 587815 0.009900281 FALSE TRF Tandem Repeats Finder [period<=12] all masks together: maskedwidth maskedratio 49783032 0.8384713 all active masks together: maskedwidth maskedratio 33720000 0.5679295 #I think the maskedwidth should reveal sum of actively masked nucleotides > maskedwidth(Hsapiens$chrY) [1] 33720000 #can we mess with the masks? > active(masks(Hsapiens$chrY))["RM"]<-TRUE Error in `$<-`(`*tmp*`, "chrY", value = < S4 object of class "MaskedDNAString">) : no method for assigning subsets of this S4 class #oops I can't manipulate a BSgenome this way - it is behaving like a class instead of an instance of a class > chrY<-Hsapiens$chrY > active(masks(chrY))["RM"]<-TRUE > maskedwidth(chrY) [1] 49744357 # ok maskedwidth is working as I figured, but i need unmasked width > unmaskedWidth<-function(chr){length(chr)-maskedwidth(chr)} > unmaskedWidth(Hsapiens$chrY) [1] 25653566 #how can I iterate over something with a $ operator? let's try [[]] > unmaskedWidth(Hsapiens[["chrY"]]) [1] 25653566Now I want to create a data frame of with sequence names and unmaskedWidths to go with some read counts from a BAM file. Whenever I want to go from a list, through a function, to a data frame I think plyr, specifically ldply (list to data frame).
# let's take chr 1-22,X,Y, skipping the unscaffolded sequences and mitochondrial chr > maskedSizes<-ldply(.data=seqnames(Hsapiens)[1:24], .fun=function(x){ data.frame(chr=x,seqlength=length(Hsapiens[[x]]), unmaskedWidth=unmaskedWidth(Hsapiens[[x]]))}, .progress="text", .parallel=TRUE) > maskedSizes chr seqlength unmaskedWidth 1 chr1 249250621 225280621 2 chr2 243199373 238204518 3 chr3 198022430 194797135 4 chr4 191154276 187661676 5 chr5 180915260 177695260 6 chr6 171115067 167395066 7 chr7 159138663 155353663 8 chr8 146364022 142888922 9 chr9 141213431 120143431 10 chr10 135534747 131314738 11 chr11 135006516 131129516 12 chr12 133851895 130481393 13 chr13 115169878 95589878 14 chr14 107349540 88289540 15 chr15 102531392 81694766 16 chr16 90354753 78884753 17 chr17 81195210 77795210 18 chr18 78077248 74657229 19 chr19 59128983 55808983 20 chr20 63025520 59505520 21 chr21 48129895 35106642 22 chr22 51304566 34894545 23 chrX 155270560 151100560 24 chrY 59373566 25653566
Load the BAM file and get read counts in a data frame.
#other methods include scanBam and readAligned bamFile<-readBamGappedAlignments("myIndexedSortedBamFile.bam") > levels(rname(bamFile)) [1] "1" "2" "3" "4" "5" [6] "6" "7" "8" "9" "10" [11] "11" "12" "13" "14" "15" [16] "16" "17" "18" "19" "20" [21] "21" "22" "X" "Y" "MT" [26] "GL000207.1" "GL000226.1" "GL000229.1" "GL000231.1" "GL000210.1" [31] "GL000239.1" "GL000235.1" "GL000201.1" "GL000247.1" "GL000245.1" [36] "GL000197.1" "GL000203.1" "GL000246.1" "GL000249.1" "GL000196.1" [41] "GL000248.1" "GL000244.1" "GL000238.1" "GL000202.1" "GL000234.1" [46] "GL000232.1" "GL000206.1" "GL000240.1" "GL000236.1" "GL000241.1" [51] "GL000243.1" "GL000242.1" "GL000230.1" "GL000237.1" "GL000233.1" [56] "GL000204.1" "GL000198.1" "GL000208.1" "GL000191.1" "GL000227.1" [61] "GL000228.1" "GL000214.1" "GL000221.1" "GL000209.1" "GL000218.1" [66] "GL000220.1" "GL000213.1" "GL000211.1" "GL000199.1" "GL000217.1" [71] "GL000216.1" "GL000215.1" "GL000205.1" "GL000219.1" "GL000224.1" [76] "GL000223.1" "GL000195.1" "GL000212.1" "GL000222.1" "GL000200.1" [81] "GL000193.1" "GL000194.1" "GL000225.1" "GL000192.1" #the deflines in my reference do not match the BSgenome names, must fix at least the chromosomes of interest levels(rname(bamFile))[1:25]<-paste('chr',c(1:22,'X','Y','M'),sep='') #run length encoded read counts per chromosome readRle<-rname(bamFile) #get a data frame with chromosome and read counts allReadsDf<-ldply(runValue(readRle),function(x){data.frame(chr=levels(runValue(readRle))[x],reads=runLength(readRle)[x])}) > head(allReadsDf) chr reads 1 chr1 3616909 2 chr2 3642052 3 chr3 2843019 4 chr4 2636141 5 chr5 2590352 6 chr6 2497123
Merge the read counts with unmasked chromosome lengths and plot their relationship.
chrSizesReads<-merge(maskedSizes,readCounts,sort=FALSE) library(ggplot2) p<-ggplot(data=chrSizesReads, aes(x=unmaskedWidth, y=reads, label=chr)) + geom_point() + geom_text(vjust=2,size=3) + stat_smooth(method="lm", se=TRUE,level=0.95) + ylab("Reads aligned") + xlab("Unmasked chromosome size") + opts(title = "Reads vs Chromosome Size") print(p)There should be a strong linear relationship between read count and chromosome size. We can test this using a linear regression model, the null hypothesis being the number of reads aligned to a chromosome is independent of its size.
> mylm<-lm(reads~unmaskedWidth,data=chrSizesReads) > mysummary<-summary(mylm) > mysummary Call: lm(formula = reads ~ unmaskedWidth, data = chrSizesReads) Residuals: Min 1Q Median 3Q Max -271816 -108122 -43984 42826 676284 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.774e+05 9.505e+04 1.866 0.0754 . unmaskedWidth 1.455e-02 7.145e-04 20.365 9.12e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 206600 on 22 degrees of freedom Multiple R-squared: 0.9496, Adjusted R-squared: 0.9473 F-statistic: 414.8 on 1 and 22 DF, p-value: 9.123e-16The low p-value (that chr size has no influence) and R-squared (predictive value of the linear model) suggest this model is sound.
The following plot is obtained from the standardized residuals (the standardized difference between data observed and values expected) of the linear model described earlier.
Chromosome bias refers to uneven read alignment distribution across various chromosomes. We can expect some chromosome bias in treatment sets because of the inherient nature any experimental conditions - recovered fragments will not be evenly distributed among chromosomes because regions of affect are not evenly distributed. Other possible factors of chromosome bias include heterochromatin, uneven repeat content, and the potential for aligning the against an incorrect set of sex chromsomes. Aligners will typically randomly, evenly, assign discrete positions to reads which map ambiguously to multiple locations.
> p<-qplot(chrSizesReads$chr,rstandard(mylm))+ aes(label=chrSizesReads$chr)+ geom_text(vjust=2,size=3)+ xlab("Chromosome")+ ylab("Std Residual from lm (reads)")+ geom_abline(slope=0,intercept=0)+ opts(axis.text.x = theme_text(angle=45,hjust=1))+ opts(title = "Linear Regression Residuals") > print(p)
Fortunately, there is no clear pattern to these residual values, which would indicate some model problems, but with a Z-score of 3.36, chrX appears to be an outlier. With 46M total alignments this is certainly not due to sampling error, but we can still test our observation with a Lund statistic.
#http://stackoverflow.com/questions/1444306/how-to-use-outlier-tests-in-r-code >lundcrit<-function(a, n, q) { F<-qf(c(1-(a/n)),df1=1,df2=n-q-1,lower.tail=TRUE) crit<-((n-q)*F/(n-q-1+F))^0.5 crit } > n<-nrow(chrSizesReads) > q<-length(mylm$coefficients) > crit<-lundcrit(0.05,n,q) > chrSizesReads[which(rstandard(mylm)>crit),"chr"] [1] chrX
Happy holidays!
Nice description! Counts within regions can be retrieved efficiently with Rsamtools::countBam using as the 'param' argument a ScanBamParam with appropriate 'which'. I did
ReplyDeleteseqlengths2gr <- function(x, strand="*")
## convert 'seqlengths' of BSgenome to sGRanges
GRanges(names(x), IRanges(1, x), strand=strand)
gr <- seqlengths2gr(seqlengths(Scerevisiae))
and then
cnt <- countBam(fl, param=ScanBamParam(which=gr))
'cnt' is then a data frame with counts of reads aligning to each chromosome,
> head(cnt)
space start end width file records nucleotides
1 2micron 1 6318 6318 SRR002051 2068 68244
2 chrI 1 230208 230208 SRR002051 34984 1154472
3 chrII 1 813178 813178 SRR002051 104042 3433386
4 chrIII 1 316617 316617 SRR002051 31024 1023792
5 chrIV 1 1531919 1531919 SRR002051 214682 7084506
6 chrIX 1 439885 439885 SRR002051 35439 1169487
Jeremy;
ReplyDeleteThanks much for this post. The residual plot approach is a really nice way to display the chromsome variability. I needed exactly this for a recent project, and incorporated your code and Martin's Rsamtools suggestion into an Rscript:
https://github.com/chapmanb/mgh_projects/blob/master/cg_101112/chr_bias.R
wow it's great to see someone use this approach
ReplyDelete