Post on 19-Dec-2015
transcript
Next-generation sequencing – the informatics
angle
Gabor T. MarthBoston College Biology Department
AGBT 2008Marco Island, FL. February 6. 2008
T1. Roche / 454 FLX system
• pyrosequencing technology• variable read-length• the only new technology with >100bp reads• tested in many published applications• supports paired-end read protocols with up to 10kb separation size
T2. Illumina / Solexa Genome Analyzer
• fixed-length short-read sequencer• read properties are very close traditional capillary sequences • very low INDEL error rate• tested in many published applications• paired-end read protocols support short (<600bp) separation
T3. AB / SOLiD system
A C G T
A
C
G
T
2nd Base
1st
Bas
e
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
• fixed-length short-read sequencer• employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy• requires color-space informatics• published applications underway / in review• paired-end read protocols support up to 10kb separation size
T4. Helicos / Heliscope system
• experimental short-read sequencer system• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing
A1. Variation discovery: SNPs and short-INDELs
1. sequence alignment
2. dealing with non-unique mapping
3. looking for allelic differences
A2. Structural variation detection
• structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations
• copy number (for amplifications, deletions) from depth of read coverage
A3. Identification of protein-bound DNA
genome sequence
aligned reads
Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)
Transcription binding sites. Robertson et al. Nature Methods, 2007
A4. Novel transcript discovery (genes)
Inferred exon 1• novel genes / exons
Inferred exon 2
• novel transcripts in known genes
Known exon 1 Known exon 2
Known exon 1 Known exon 2
A5. Novel transcript discovery (miRNAs)
Ruby et al. Cell, 2006
A6. Expression profiling by tag counting
aligned reads
aligned reads
Jones-Rhoads et al. PLoS Genetics, 2007
gene gene
A7. De novo organismal genome sequencing
assembled sequence contigs
short reads
longer reads
read pairs
Lander et al. Nature 2001
C1. Read length
read length [bp]0 100 200 300
~250 (var)
25-40 (fixed)
25-35 (fixed)
20-35 (var)
When does read length matter?
• short reads often sufficient where the entire read length can be used for mapping:
SNPs, short-INDELs, SVsCHIP-SEQshort RNA discoverycounting (mRNA miRNA)
• longer reads are needed where one must use parts of reads for mapping:
de novo sequencing
novel transcript discovery
aacttagacttacagacttacatacgta
Known exon 1 Known exon 2
accgattactatacta
C2. Read error rate
• error rate dictates how many errors the aligner should tolerate
• error rate typically 0.4 - 1%
• the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned
0 1 20.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Fra
ctio
n of
gen
ome
Number of mismatches allowed
• applications where, in addition, specific alleles are essential, error rate is even more important
0
5
10
15
20
25
30
35
40
0 5 10 15 20 25 30 35 40
Position on Read
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
8.00%
9.00%
10.00%
Err
or r
ate
C3. Error rate grows with each cycle
• this phenomenon limits useful read length
C4. Substitutions vs. INDEL errors
• SNP discovery may require higher coverage for allele confirmation• INDELs can be discovered with very high confidence!
• gapped alignment necessary• good SNP discovery accuracy• short-INDEL discovery difficult
C5. Quality values are important for allele calling
• PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles
• inaccurate or not well calibrated base quality values hinder allele calling
Q-values should be accurate … and high!
Quality values should be well-calibrated
assigned base quality value should be calibrated to represent the actual base quality value in every sequencing cycle
C6. Representational biases / library complexity
fragmentation biases
amplification biases
PCR
sequencing biases
sequencing
low/no representati
on high
representation
Dispersal of read coverage
• this affects variation discovery (deeper starting read coverage is needed)• it has major impact is on counting applications
Amplification errors
many reads from clonal copies of a single fragment
• early PCR errors in “clonal” read copies lead to false positive allele calls
early amplification error gets propagated onto every clonal copy
C7. Paired-end reads
• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency
• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity
Korbel et al. Science 2007
• paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)
Paired-end reads for SV discovery
• longer fragments increase the chance of spanning SV breakpoints and/or entire events
• SV breakpoint detection sensitivity & resolution depend on the width of the fragment length distribution (most 2kb deletions would be detected at 10% std but missed at 30% std)
• longer fragments tend to have wider fragment length distributions
C8. Technologies / properties / applications
Technology
Roche/454 Illumina/Solexa AB/SOLiD
Read properties
Read length 250bp 20-40bp 25-35bp
Error rate <0.5% <1.0% <0.5%
Dominant error type INDEL SUB SUB
Paired-end reads available yes yes yes
Paired-end separation < 10kb (3kb optimal) 100 - 600bp 500bp - 10kb (3kb optimal)
Applications
SNP discovery ○ ● ●
short-INDEL discovery ● ○
SV discovery ○ ○ ●
CHIP-SEQ ○ ● ●
small RNA/gene discovery ○ ● ●
mRNA Xcript discovery ● ○ ○
Expression profiling ○ ● ●
De novo sequencing ● ? ?
Thanks
http://bioinformatics.bc.edu/marthlab
Derek BarnettEric Tsung
Aaron QuinlanDamien Croteau-Chonka
Weichun Huang
Michael Stromberg
Chip Stewart
Michele Busby
MOSAIK talk Thursday, 7:40PM
Michael Egholm
David Bentley
Francisco de la Vega
Kristen StoopsEd Thayer
Clive Brown
Elaine Mardis