Sequencing Errors and Biases
Biological Sequence AnalysisBNFO 691/602 Spring 2013
Mark Reimers
Outline
• Sequencing errors• Initiation biases• Quantification biases• Are biases consistent across samples?• Compensating biases
Types of mismatches in Illumina data are profoundly asymmetric and biased
Courtesy Thierry-Miegfrom uniquely mapped tags with a single mismatch
Position of single mismatch in uniquely mapped tags
Courtesy Thierry-Mieg
Initiation Biases
Nucleotide frequencies versus position for stringently mapped reads.
Hansen K D et al. Nucl. Acids Res. 2010;38:e131-e131
© The Author(s) 2010. Published by Oxford University Press.
Start Position Bias is Visible in MT-RNA
Start Position Bias is Consistent Across Samples
Counts per start site in lane 1 vs lane 2 (Marioni et al, Gen Res, 2008)
Quantification Biases
Consistent Technology-Specific Biases
(a) 25-kb region of chromosome 11 amplified by three long-range PCR products (red rectangles). (b) A heat-map colored matrix displays the correlation of coverage depth across 260 kb of sequence between four samples by three technologies from Harrismendy et al Genome Biology 2009
Quantitative Biases
• Not all regions represented equally• GC rich regions represented more• Independent of GC some chromosome regions
represented more – Euchromatin bias
• Sequence initiation site biases• ‘Mapability’ biases – some regions won’t have
any uniquely mapped tags
GC Bias
• Density of reads depends strongly on GC content of regions
• Most bias seems to come from PCR reaction
• Newer techniques show less bias but still strong GC content (%) of 1 kb region
Num
ber o
f Rea
ds in
1 k
b re
gion
From Dohm et al 2008
GC Bias depends on temperature
• Aird et al (Genome Biology 2011) did systematic tests of effects of various conditions on GC bias
• They provided protocols that improve CG bias but don’t eliminate it
NB. Log scale
Even Best Protocols have Bias• GC bias in Illumina reads from
a 400-bp fragment library amplified using the standard PCR protocol (Phusion HF, short denaturation) on a fast-ramping thermocycler (red squares), Phusion HF with long denaturation and 2M betaine (black triangles), AccuPrime Taq HiFi with long denaturation and primer extension at 65°C (blue diamonds) or 60°C (purple diamonds)
From Aird et al Genome Biology 2011
Biases Are NOT Consistent
• The plot on left shows Log-fold changes between RPKM values from two biological replicates (NA11918, NA12761) from the data of Montgomery et al, Nature 2010
• From Hansen et al 2012