Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology...

transcript

Next-generation sequencing – the informatics

Gabor T. MarthBoston College Biology Department

AGBT 2008Marco Island, FL. February 6. 2008

T1. Roche / 454 FLX system

• pyrosequencing technology• variable read-length• the only new technology with >100bp reads• tested in many published applications• supports paired-end read protocols with up to 10kb separation size

T2. Illumina / Solexa Genome Analyzer

• fixed-length short-read sequencer• read properties are very close traditional capillary sequences • very low INDEL error rate• tested in many published applications• paired-end read protocols support short (<600bp) separation

T3. AB / SOLiD system

A C G T

2nd Base

• fixed-length short-read sequencer• employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy• requires color-space informatics• published applications underway / in review• paired-end read protocols support up to 10kb separation size

T4. Helicos / Heliscope system

• experimental short-read sequencer system• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing

A1. Variation discovery: SNPs and short-INDELs

1. sequence alignment

2. dealing with non-unique mapping

3. looking for allelic differences

A2. Structural variation detection

• structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations

• copy number (for amplifications, deletions) from depth of read coverage

A3. Identification of protein-bound DNA

genome sequence

aligned reads

Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)

Transcription binding sites. Robertson et al. Nature Methods, 2007

A4. Novel transcript discovery (genes)

Inferred exon 1• novel genes / exons

Inferred exon 2

• novel transcripts in known genes

Known exon 1 Known exon 2

A5. Novel transcript discovery (miRNAs)

Ruby et al. Cell, 2006

A6. Expression profiling by tag counting

aligned reads

Jones-Rhoads et al. PLoS Genetics, 2007

gene gene

A7. De novo organismal genome sequencing

assembled sequence contigs

short reads

longer reads

read pairs

Lander et al. Nature 2001

C1. Read length

read length [bp]0 100 200 300

~250 (var)

25-40 (fixed)

25-35 (fixed)

20-35 (var)

When does read length matter?

• short reads often sufficient where the entire read length can be used for mapping:

SNPs, short-INDELs, SVsCHIP-SEQshort RNA discoverycounting (mRNA miRNA)

• longer reads are needed where one must use parts of reads for mapping:

de novo sequencing

novel transcript discovery

aacttagacttacagacttacatacgta

Known exon 1 Known exon 2

accgattactatacta

C2. Read error rate

• error rate dictates how many errors the aligner should tolerate

• error rate typically 0.4 - 1%

• the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned

0 1 20.00

Number of mismatches allowed

• applications where, in addition, specific alleles are essential, error rate is even more important

0 5 10 15 20 25 30 35 40

Position on Read

10.00%

C3. Error rate grows with each cycle

• this phenomenon limits useful read length

C4. Substitutions vs. INDEL errors

• SNP discovery may require higher coverage for allele confirmation• INDELs can be discovered with very high confidence!

• gapped alignment necessary• good SNP discovery accuracy• short-INDEL discovery difficult

C5. Quality values are important for allele calling

• PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles

• inaccurate or not well calibrated base quality values hinder allele calling

Q-values should be accurate … and high!

Quality values should be well-calibrated

assigned base quality value should be calibrated to represent the actual base quality value in every sequencing cycle

C6. Representational biases / library complexity

fragmentation biases

amplification biases

sequencing biases

sequencing

low/no representati

on high

representation

Dispersal of read coverage

• this affects variation discovery (deeper starting read coverage is needed)• it has major impact is on counting applications

Amplification errors

many reads from clonal copies of a single fragment

• early PCR errors in “clonal” read copies lead to false positive allele calls

early amplification error gets propagated onto every clonal copy

C7. Paired-end reads

• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency

• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity

Korbel et al. Science 2007

• paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)

Paired-end reads for SV discovery

• longer fragments increase the chance of spanning SV breakpoints and/or entire events

• SV breakpoint detection sensitivity & resolution depend on the width of the fragment length distribution (most 2kb deletions would be detected at 10% std but missed at 30% std)

• longer fragments tend to have wider fragment length distributions

C8. Technologies / properties / applications

Technology

Roche/454 Illumina/Solexa AB/SOLiD

Read properties

Read length 250bp 20-40bp 25-35bp

Error rate <0.5% <1.0% <0.5%

Dominant error type INDEL SUB SUB

Paired-end reads available yes yes yes

Paired-end separation < 10kb (3kb optimal) 100 - 600bp 500bp - 10kb (3kb optimal)

Applications

SNP discovery ○ ● ●

short-INDEL discovery ● ○

SV discovery ○ ○ ●

CHIP-SEQ ○ ● ●

small RNA/gene discovery ○ ● ●

mRNA Xcript discovery ● ○ ○

Expression profiling ○ ● ●

De novo sequencing ● ? ?

Thanks

http://bioinformatics.bc.edu/marthlab

Derek BarnettEric Tsung

Aaron QuinlanDamien Croteau-Chonka

Weichun Huang

Michael Stromberg

Chip Stewart

Michele Busby

MOSAIK talk Thursday, 7:40PM

Michael Egholm

David Bentley

Francisco de la Vega

Kristen StoopsEd Thayer

Clive Brown

Elaine Mardis

Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology...

Documents