+ All Categories
Home > Documents > Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology...

Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology...

Date post: 15-Jan-2016
Category:
View: 220 times
Download: 0 times
Share this document with a friend
Popular Tags:
69
Informatics tools for next- generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008
Transcript
Page 1: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Informatics tools for next-generation sequence analysis

Gabor T. MarthBoston College Biology Department

University of MichiganOctober 20, 2008

Page 2: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Next-gen. sequencers offer vast throughput

read length

base

s p

er

mach

ine r

un

10 bp 1,000 bp100 bp

1 Gb

100 Mb

10 Mb

10 Gb

Illumina, AB/SOLiD short-read sequencers

ABI capillary sequencer

454 pyrosequencer(100-400 Mb in 200-450 bp reads)

(5-15Gb in 25-70 bp reads)

1 Mb

Page 3: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Next-gen sequencing enables new applications

Meissner et al. Nature 2008

Ruby et al. Cell, 2006

Jones-Rhoades et al. PLoS Genetics, 2007

• organismal resequencing & de novo sequencing

• transcriptome sequencing for transcript discovery and expression profiling

• epigenetic analysis (e.g. DNA methylation)

Page 4: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Large-scale individual human resequencing

Page 5: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Technologies

Page 6: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Roche / 454 system

• pyrosequencing technology• variable read-length• the only new technology with >100bp reads

Page 7: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Illumina / Solexa Genome Analyzer

• fixed-length short-read sequencer• very high throughput• read properties are very close to traditional capillary sequences

Page 8: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

AB / SOLiD system

A C G T

A

C

G

T

2nd Base

1st

Bas

e

0

0

0

0

1

1

1

1

2

2

2

2

3

3

3

3

• fixed-length short-reads• very high throughput• 2-base encoding system• color-space informatics

Page 9: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Helicos / Heliscope system

• short-read sequencer• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing

Page 10: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Data characteristics

Page 11: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Read length

read length [bp]0 100 200 300

~200-450 (variable)

25-70 (fixed)

25-50 (fixed)

20-60 (variable)

400

Page 12: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Representational biases

• this affects genome resequencing (deeper starting read coverage is needed)• will have major impact is on counting applications

“dispersed” coverage distribution

Page 13: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Amplification errors

many reads from clonal copies of a single fragment

• early PCR errors in “clonal” read copies lead to false positive allele calls

early amplification error gets propagated into every clonal copy

Page 14: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Read quality

Page 15: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Error rate (Illumina)

Page 16: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Error rate (454)

Page 17: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Per-read errors (Solexa)

Page 18: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Per read errors (454)

Page 19: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Base quality values not well calibrated

Page 20: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Tools for genome resequencing

Page 21: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

The resequencing informatics pipeline

(iii) read assembly

REF

(ii) read mapping

IND

(i) base calling

IND(iv) SNP and short INDEL calling

(vi) data validation, hypothesis generation

(v) SV calling

Page 22: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

The variation discovery “toolbox”

• base callers

• read mappers

• SNP callers

• SV callers

• assembly viewers

GigaBayesGigaBayes

Page 23: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

1. Base calling

base sequence

base quality (Q-value) sequence

diverse chemistry & sequencing error profiles

Page 24: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

454 pyrosequencer error profile

• multiple bases in a homo-polymeric run are incorporated in a single incorporation test the number of bases must be determined from a single scalar signal the majority of errors are INDELs

Page 25: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

454 base quality values

• the native 454 base caller assigns too low base quality values

Page 26: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

PYROBAYES: determine base number

Page 27: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

PYROBAYES: Performance

• assigned quality values predict measured error rate better

• higher fraction of bases are high quality

Page 28: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Base quality value calibration

Page 29: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Recalibrated base quality values (Illumina)

Page 30: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

… and they give you the picture on the box

2. Read mapping

Read mapping is like doing a jigsaw puzzle…

…you get the pieces…

Unique pieces are easier to place than others…

Page 31: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Non-uniqueness of reads confounds mapping

• Reads from repeats cannot be uniquely mapped back to their true region of origin

• RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length

Page 32: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Strategies to deal with non-unique mapping

• Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented)

0.8 0.19 0.01

read

• mapping to multiple loci requires the assignment of alignment probabilities

Page 33: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Paired-end reads help unique read placement

• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency

Korbel et al. Science 2007

• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity

PE

MP

• PE reads are now the standard for genome resequencing

Page 34: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

MOSAIK

Page 35: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

INDEL alleles/errors – gapped alignments

454

Page 36: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Aligning multiple read types together

ABI/capillary

454 FLX

454 GS20

Illumina

• Alignment and co-assembly of multiple reads types permits simultaneous analysis of data from multiple sources and error characteristics

Page 37: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Aligner speed

Page 38: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

3. Polymorphism / mutation detection

sequencing error

polymorphism

Page 39: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

New challenges for SNP calling

• deep alignments of 100s / 1000s of individuals • trio sequences

Page 40: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Rare alleles in 100s / 1,000s of samples

Page 41: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Allele discovery is a multi-step sampling process

Population Samples Reads

Page 42: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Capturing the allele in the sample

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1E-0

4

2E-0

4

5E-0

40.

001

0.00

20.

005

0.01

0.02

0.05 0.

10.

20.

5

Population AF

Pro

b(a

llele

cap

ture

d in

sam

ple

)

n=100

n=200

n=400

n=800

n=1600

Page 43: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Allele calling in the reads

1 2

1 21

1

1 2

Pr | Pr | Pr , , ,

Pr | Pr | Pr , , ,

Pr , , , |i

kT

ii n

l kT

nk ki i i n

i

nk k l l l li i

iG

n

B T T G G G G

B T T G G G G

G G G B

base call

sample size

GigaBayesGigaBayes

individual read coverage

base quality

Page 44: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Allele calling in deep sequence data

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

Q30 Q40 Q50 Q60

1 0.01 0.01 0.1 0.5

2 0.82 1.0 1.0 1.0

3 1.0 1.0 1.0 1.0

Page 45: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

More samples or deeper coverage / sample?

Shallower read coverage from more individuals …

…or deeper coverage from fewer samples?

simulation analysis by Aaron

Quinlan

Page 46: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Analysis indicates a balance

Page 47: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

SNP calling in trios

2

2

2 22 2

2

2

2

2 2

2

11 12 22

1 111: 1 1

2 2 11: 111: 11 1

11 12 : 2 1 12 : 2 1 1 12 : 12 2

22 : 22 : 11 122 : 1

2 2

1 1 111: 1 1 11:

2 2 4Pr | , 1 1

12 12 : 2 1 12 2

1 122 : 1

2 2

M M M

F

C M F

F

G G G

G

G G GG

2 2 2

2 22 2

2 22

2

2 22 2

1 1 1 11 1 11: 1

2 4 2 21 1 1 1 1

12 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2

1 1 1 1 122 : 1 1 22 : 1 1

4 2 4 2 2

1 111: 1

2 211: 11 1

22 12 : 1 12 : 12

22 : 1FG

2

2

2

11:

2 1 12 : 2 12

22 : 11 122 : 1 1

2 2

• the child inherits one chromosome from each parent• there is a small probability for a mutation in the child

Page 48: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

SNP calling in trios

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac

mother father

childP=0.79

P=0.86

Page 49: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Determining genotype directly from sequence

AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTAGCATAAACGTTAGCATA

individual 1

individual 3

individual 2

A/C

C/C

A/A

Page 50: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

4. Structural variation discovery

Page 51: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

SV events from PE read mapping patterns

Deletion

DNA reference

LM ~ LF+Ldel & depth: low

pattern

LMLF

Ldel

Tandemduplication

LM ~ LF-Ldup & depth: highLdup

Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv

Translocation

LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2

LT2 LT1

LM LM

LM

InsertionLins

un-paired read clusters & depth normal

Chromosomaltranslocation

LT

LM ~LF+LT & depth: normal& cross-paired read clusters

Page 52: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Deletion: Aberrant positive mapping distance

Page 53: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Copy number estimation from depth of coverage

Page 54: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Alignability – read coverage normalization

reads mapped possible all

reads mapped uniquelyA(p)

Page 55: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Het deletion “revealed” by normalization

Page 56: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Tandem duplication: negative mapping distance

Page 57: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Spanner – a hybrid SV/CNV detection tool

Navigation bar

Fragment lengths in selected region

Depth of coverage in selected region

Page 58: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

5. Data visualization

1. aid software development: integration of trace data viewing, fast navigation, zooming/panning

2. facilitate data validation (e.g. SNP validation): simultanous viewing of multiple read types, quality value displays

3. promote hypothesis generation: integration of annotation tracks

Page 59: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Data visualization

Page 60: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Our software tools for next-gen data

http://bioinformatics.bc.edu/marthlab/Beta_Release

Page 61: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Data mining projects

Page 62: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

SNP calling in short-read coverage

C. elegans reference genome (Bristol, N2 strain)

Pasadena, CB4858(1 ½ machine runs)

Bristol, N2 strain(3 ½ machine runs)

• goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes

• 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a collaborative project lead by Elaine Mardis, at Washington University

• primary aim was to detect polymorphisms between the Pasadena and the Bristol strain

Page 63: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Polymorphism discovery in C. elegans

• SNP calling error rate very low:

Validation rate = 97.8% (224/229)Conversion rate = 92.6% (224/242)Missed SNP rate = 3.75% (26/693)

SNP

INS

• INDEL candidates validate and convert at similar rates to SNPs:

Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221)

• MOSAIK aligned / assembled the reads (< 4 hours on 1 CPU)• PBSHORT found 44,642 SNP candidates (2 hours on our 160-CPU cluster) • SNP density: 1 in 1,630 bp (of non-repeat genome sequence)

Page 64: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Mutational profiling: deep 454/Illumina/SOLiD data

• Pichia stipitis converts xylose to ethanol (bio-fuel production)

• one mutagenized strain had especially high conversion

efficiency

• determine where the mutations were that caused this

phenotype

• we resequenced the 15MB genome with 454 Illumina, and

SOLiD reads

• 14 true point mutations in the entire genome

Pichia stipitis reference sequence

Image from JGI web site

Page 65: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Technology comparisons

Page 66: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Thanks

Page 67: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Credits

Elaine Mardis

Andy Clark

Aravinda Chakravarti

Doug Smith

Michael Egholm

Scott Kahn

Francisco de la Vega

Kristen StoopsEd Thayer

Page 68: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Lab

Page 69: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Recruitment


Recommended