+ All Categories
Home > Documents > Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology...

Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology...

Date post: 19-Dec-2015
Category:
View: 219 times
Download: 0 times
Share this document with a friend
Popular Tags:
26
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008
Transcript
Page 1: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

Next-generation sequencing – the informatics

angle

Gabor T. MarthBoston College Biology Department

AGBT 2008Marco Island, FL. February 6. 2008

Page 2: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

T1. Roche / 454 FLX system

• pyrosequencing technology• variable read-length• the only new technology with >100bp reads• tested in many published applications• supports paired-end read protocols with up to 10kb separation size

Page 3: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

T2. Illumina / Solexa Genome Analyzer

• fixed-length short-read sequencer• read properties are very close traditional capillary sequences • very low INDEL error rate• tested in many published applications• paired-end read protocols support short (<600bp) separation

Page 4: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

T3. AB / SOLiD system

A C G T

A

C

G

T

2nd Base

1st

Bas

e

0

0

0

0

1

1

1

1

2

2

2

2

3

3

3

3

• fixed-length short-read sequencer• employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy• requires color-space informatics• published applications underway / in review• paired-end read protocols support up to 10kb separation size

Page 5: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

T4. Helicos / Heliscope system

• experimental short-read sequencer system• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing

Page 6: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

A1. Variation discovery: SNPs and short-INDELs

1. sequence alignment

2. dealing with non-unique mapping

3. looking for allelic differences

Page 7: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

A2. Structural variation detection

• structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations

• copy number (for amplifications, deletions) from depth of read coverage

Page 8: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

A3. Identification of protein-bound DNA

genome sequence

aligned reads

Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)

Transcription binding sites. Robertson et al. Nature Methods, 2007

Page 9: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

A4. Novel transcript discovery (genes)

Inferred exon 1• novel genes / exons

Inferred exon 2

• novel transcripts in known genes

Known exon 1 Known exon 2

Known exon 1 Known exon 2

Page 10: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

A5. Novel transcript discovery (miRNAs)

Ruby et al. Cell, 2006

Page 11: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

A6. Expression profiling by tag counting

aligned reads

aligned reads

Jones-Rhoads et al. PLoS Genetics, 2007

gene gene

Page 12: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

A7. De novo organismal genome sequencing

assembled sequence contigs

short reads

longer reads

read pairs

Lander et al. Nature 2001

Page 13: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

C1. Read length

read length [bp]0 100 200 300

~250 (var)

25-40 (fixed)

25-35 (fixed)

20-35 (var)

Page 14: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

When does read length matter?

• short reads often sufficient where the entire read length can be used for mapping:

SNPs, short-INDELs, SVsCHIP-SEQshort RNA discoverycounting (mRNA miRNA)

• longer reads are needed where one must use parts of reads for mapping:

de novo sequencing

novel transcript discovery

aacttagacttacagacttacatacgta

Known exon 1 Known exon 2

accgattactatacta

Page 15: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

C2. Read error rate

• error rate dictates how many errors the aligner should tolerate

• error rate typically 0.4 - 1%

• the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned

0 1 20.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Fra

ctio

n of

gen

ome

Number of mismatches allowed

• applications where, in addition, specific alleles are essential, error rate is even more important

Page 16: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

0

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30 35 40

Position on Read

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%

Err

or r

ate

C3. Error rate grows with each cycle

• this phenomenon limits useful read length

Page 17: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

C4. Substitutions vs. INDEL errors

• SNP discovery may require higher coverage for allele confirmation• INDELs can be discovered with very high confidence!

• gapped alignment necessary• good SNP discovery accuracy• short-INDEL discovery difficult

Page 18: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

C5. Quality values are important for allele calling

• PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles

• inaccurate or not well calibrated base quality values hinder allele calling

Q-values should be accurate … and high!

Page 19: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

Quality values should be well-calibrated

assigned base quality value should be calibrated to represent the actual base quality value in every sequencing cycle

Page 20: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

C6. Representational biases / library complexity

fragmentation biases

amplification biases

PCR

sequencing biases

sequencing

low/no representati

on high

representation

Page 21: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

Dispersal of read coverage

• this affects variation discovery (deeper starting read coverage is needed)• it has major impact is on counting applications

Page 22: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

Amplification errors

many reads from clonal copies of a single fragment

• early PCR errors in “clonal” read copies lead to false positive allele calls

early amplification error gets propagated onto every clonal copy

Page 23: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

C7. Paired-end reads

• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency

• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity

Korbel et al. Science 2007

• paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)

Page 24: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

Paired-end reads for SV discovery

• longer fragments increase the chance of spanning SV breakpoints and/or entire events

• SV breakpoint detection sensitivity & resolution depend on the width of the fragment length distribution (most 2kb deletions would be detected at 10% std but missed at 30% std)

• longer fragments tend to have wider fragment length distributions

Page 25: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

C8. Technologies / properties / applications

  Technology

  Roche/454 Illumina/Solexa AB/SOLiD

Read properties      

Read length 250bp 20-40bp 25-35bp

Error rate <0.5% <1.0% <0.5%

Dominant error type INDEL SUB SUB

Paired-end reads available yes yes yes

Paired-end separation < 10kb (3kb optimal) 100 - 600bp 500bp - 10kb (3kb optimal)

       

Applications      

SNP discovery ○ ● ●

short-INDEL discovery   ● ○

SV discovery ○ ○ ●

CHIP-SEQ ○ ● ●

small RNA/gene discovery ○ ● ●

mRNA Xcript discovery ● ○ ○

Expression profiling ○ ● ●

De novo sequencing ● ? ?

Page 26: Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008.

Thanks

http://bioinformatics.bc.edu/marthlab

Derek BarnettEric Tsung

Aaron QuinlanDamien Croteau-Chonka

Weichun Huang

Michael Stromberg

Chip Stewart

Michele Busby

MOSAIK talk Thursday, 7:40PM

Michael Egholm

David Bentley

Francisco de la Vega

Kristen StoopsEd Thayer

Clive Brown

Elaine Mardis


Recommended