Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology...

transcript

Informatics tools for next-generation sequence analysis

Gabor T. MarthBoston College Biology Department

University of MichiganOctober 20, 2008

Next-gen. sequencers offer vast throughput

read length

10 bp 1,000 bp100 bp

100 Mb

Illumina, AB/SOLiD short-read sequencers

ABI capillary sequencer

454 pyrosequencer(100-400 Mb in 200-450 bp reads)

(5-15Gb in 25-70 bp reads)

Next-gen sequencing enables new applications

Meissner et al. Nature 2008

Ruby et al. Cell, 2006

Jones-Rhoades et al. PLoS Genetics, 2007

• organismal resequencing & de novo sequencing

• transcriptome sequencing for transcript discovery and expression profiling

• epigenetic analysis (e.g. DNA methylation)

Large-scale individual human resequencing

Technologies

Roche / 454 system

• pyrosequencing technology• variable read-length• the only new technology with >100bp reads

Illumina / Solexa Genome Analyzer

• fixed-length short-read sequencer• very high throughput• read properties are very close to traditional capillary sequences

AB / SOLiD system

A C G T

2nd Base

• fixed-length short-reads• very high throughput• 2-base encoding system• color-space informatics

Helicos / Heliscope system

• short-read sequencer• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing

Data characteristics

Read length

read length [bp]0 100 200 300

~200-450 (variable)

25-70 (fixed)

25-50 (fixed)

20-60 (variable)

Representational biases

• this affects genome resequencing (deeper starting read coverage is needed)• will have major impact is on counting applications

“dispersed” coverage distribution

Amplification errors

many reads from clonal copies of a single fragment

• early PCR errors in “clonal” read copies lead to false positive allele calls

early amplification error gets propagated into every clonal copy

Read quality

Error rate (Illumina)

Error rate (454)

Per-read errors (Solexa)

Per read errors (454)

Base quality values not well calibrated

Tools for genome resequencing

The resequencing informatics pipeline

(iii) read assembly

(ii) read mapping

(i) base calling

IND(iv) SNP and short INDEL calling

(vi) data validation, hypothesis generation

(v) SV calling

The variation discovery “toolbox”

• base callers

• read mappers

• SNP callers

• SV callers

• assembly viewers

GigaBayesGigaBayes

1. Base calling

base sequence

base quality (Q-value) sequence

diverse chemistry & sequencing error profiles

454 pyrosequencer error profile

• multiple bases in a homo-polymeric run are incorporated in a single incorporation test the number of bases must be determined from a single scalar signal the majority of errors are INDELs

454 base quality values

• the native 454 base caller assigns too low base quality values

PYROBAYES: determine base number

PYROBAYES: Performance

• assigned quality values predict measured error rate better

• higher fraction of bases are high quality

Base quality value calibration

Recalibrated base quality values (Illumina)

… and they give you the picture on the box

2. Read mapping

Read mapping is like doing a jigsaw puzzle…

…you get the pieces…

Unique pieces are easier to place than others…

Non-uniqueness of reads confounds mapping

• Reads from repeats cannot be uniquely mapped back to their true region of origin

• RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length

Strategies to deal with non-unique mapping

• Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented)

0.8 0.19 0.01

• mapping to multiple loci requires the assignment of alignment probabilities

Paired-end reads help unique read placement

• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency

Korbel et al. Science 2007

• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity

• PE reads are now the standard for genome resequencing

MOSAIK

INDEL alleles/errors – gapped alignments

Aligning multiple read types together

ABI/capillary

454 FLX

454 GS20

Illumina

• Alignment and co-assembly of multiple reads types permits simultaneous analysis of data from multiple sources and error characteristics

Aligner speed

3. Polymorphism / mutation detection

sequencing error

polymorphism

New challenges for SNP calling

• deep alignments of 100s / 1000s of individuals • trio sequences

Rare alleles in 100s / 1,000s of samples

Allele discovery is a multi-step sampling process

Population Samples Reads

Capturing the allele in the sample

0.05 0.

Population AF

n=1600

Allele calling in the reads

Pr | Pr | Pr , , ,

Pr , , , |i

nk ki i i n

nk k l l l li i

B T T G G G G

G G G B

base call

sample size

GigaBayesGigaBayes

individual read coverage

base quality

Allele calling in deep sequence data

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

Q30 Q40 Q50 Q60

1 0.01 0.01 0.1 0.5

2 0.82 1.0 1.0 1.0

3 1.0 1.0 1.0 1.0

More samples or deeper coverage / sample?

Shallower read coverage from more individuals …

…or deeper coverage from fewer samples?

simulation analysis by Aaron

Quinlan

Analysis indicates a balance

SNP calling in trios

2 22 2

11 12 22

1 111: 1 1

2 2 11: 111: 11 1

11 12 : 2 1 12 : 2 1 1 12 : 12 2

22 : 22 : 11 122 : 1

1 1 111: 1 1 11:

2 2 4Pr | , 1 1

12 12 : 2 1 12 2

1 122 : 1

G G GG

2 22 2

1 1 1 11 1 11: 1

2 4 2 21 1 1 1 1

12 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2

1 1 1 1 122 : 1 1 22 : 1 1

4 2 4 2 2

1 111: 1

2 211: 11 1

22 12 : 1 12 : 12

22 : 1FG

2 1 12 : 2 12

22 : 11 122 : 1 1

• the child inherits one chromosome from each parent• there is a small probability for a mutation in the child

SNP calling in trios

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac

mother father

childP=0.79

P=0.86

Determining genotype directly from sequence

AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTAGCATAAACGTTAGCATA

individual 1

individual 3

individual 2

4. Structural variation discovery

SV events from PE read mapping patterns

Deletion

DNA reference

LM ~ LF+Ldel & depth: low

pattern

Tandemduplication

LM ~ LF-Ldup & depth: highLdup

Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv

Translocation

LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2

LT2 LT1

InsertionLins

un-paired read clusters & depth normal

Chromosomaltranslocation

LM ~LF+LT & depth: normal& cross-paired read clusters

Deletion: Aberrant positive mapping distance

Copy number estimation from depth of coverage

Alignability – read coverage normalization

reads mapped possible all

reads mapped uniquelyA(p)

Het deletion “revealed” by normalization

Tandem duplication: negative mapping distance

Spanner – a hybrid SV/CNV detection tool

Navigation bar

Fragment lengths in selected region

Depth of coverage in selected region

5. Data visualization

1. aid software development: integration of trace data viewing, fast navigation, zooming/panning

2. facilitate data validation (e.g. SNP validation): simultanous viewing of multiple read types, quality value displays

3. promote hypothesis generation: integration of annotation tracks

Data visualization

Our software tools for next-gen data

http://bioinformatics.bc.edu/marthlab/Beta_Release

Data mining projects

SNP calling in short-read coverage

C. elegans reference genome (Bristol, N2 strain)

Pasadena, CB4858(1 ½ machine runs)

Bristol, N2 strain(3 ½ machine runs)

• goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes

• 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a collaborative project lead by Elaine Mardis, at Washington University

• primary aim was to detect polymorphisms between the Pasadena and the Bristol strain

Polymorphism discovery in C. elegans

• SNP calling error rate very low:

Validation rate = 97.8% (224/229)Conversion rate = 92.6% (224/242)Missed SNP rate = 3.75% (26/693)

• INDEL candidates validate and convert at similar rates to SNPs:

Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221)

• MOSAIK aligned / assembled the reads (< 4 hours on 1 CPU)• PBSHORT found 44,642 SNP candidates (2 hours on our 160-CPU cluster) • SNP density: 1 in 1,630 bp (of non-repeat genome sequence)

Mutational profiling: deep 454/Illumina/SOLiD data

• Pichia stipitis converts xylose to ethanol (bio-fuel production)

• one mutagenized strain had especially high conversion

efficiency

• determine where the mutations were that caused this

phenotype

• we resequenced the 15MB genome with 454 Illumina, and

SOLiD reads

• 14 true point mutations in the entire genome

Pichia stipitis reference sequence

Image from JGI web site

Technology comparisons

Thanks

Credits

Elaine Mardis

Andy Clark

Aravinda Chakravarti

Doug Smith

Michael Egholm

Scott Kahn

Francisco de la Vega

Kristen StoopsEd Thayer

Recruitment

Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology...

Documents