+ All Categories
Home > Documents > Informatics tools for next-generation sequence analysis

Informatics tools for next-generation sequence analysis

Date post: 22-Feb-2016
Category:
Upload: derora
View: 40 times
Download: 0 times
Share this document with a friend
Description:
Informatics tools for next-generation sequence analysis. Gabor Marth Boston College Biology Next-Generation Sequencing MiniSymposium CHOP Philadelphia, PA April 6, 2009. New sequencing technologies…. … offer vast throughput. 100 Gb. Illumina/Solexa , AB/ SOLiD sequencers. - PowerPoint PPT Presentation
Popular Tags:
60
Informatics tools for next- generation sequence analysis Gabor Marth Boston College Biology Next-Generation Sequencing MiniSymposium CHOP Philadelphia, PA April 6, 2009
Transcript
Page 1: Informatics tools for next-generation sequence analysis

Informatics tools for next-generation sequence analysis

Gabor MarthBoston College Biology

Next-Generation Sequencing MiniSymposiumCHOP Philadelphia, PAApril 6, 2009

Page 2: Informatics tools for next-generation sequence analysis

New sequencing technologies…

Page 3: Informatics tools for next-generation sequence analysis

… offer vast throughput

read length

base

s per

mac

hine

run

10 bp 1,000 bp100 bp

1 Gb

100 Mb

10 Mb

10 Gb

Illumina/Solexa, AB/SOLiD sequencers

ABI capillary sequencer

Roche/454 pyrosequencer(100-400 Mb in 200-450 bp reads)

(10-30Gb in 25-100 bp reads)

1 Mb

100 Gb

Page 4: Informatics tools for next-generation sequence analysis

Roche / 454

• pyrosequencing technology• variable read-length• the only new technology with >100bp reads

Page 5: Informatics tools for next-generation sequence analysis

Illumina / Solexa• fixed-length short-read sequencer• very high throughput• read properties are very close to traditional capillary sequences

Page 6: Informatics tools for next-generation sequence analysis

AB / SOLiD

A C G TA

C

G

T

2nd Base

1st B

ase

0

0

0

0

1

1

1

1

2

2

2

2

3

3

3

3

• fixed-length short-reads• very high throughput• 2-base encoding system• color-space informatics

Page 7: Informatics tools for next-generation sequence analysis

Helicos / Heliscope• short-read sequencer• single molecule sequencing• no amplification• variable read-length

Page 8: Informatics tools for next-generation sequence analysis

Many applications• organismal resequencing & de novo sequencing

Ruby et al. Cell, 2006

Jones-Rhoades et al. PLoS Genetics, 2007

• transcriptome sequencing for transcript discovery and expression profiling

Meissner et al. Nature 2008

• epigenetic analysis (e.g. DNA methylation)

Page 9: Informatics tools for next-generation sequence analysis

Data characteristics

Page 10: Informatics tools for next-generation sequence analysis

Read length

read length [bp]0 100 200 300

~200-450 (variable)

25-100(fixed)

25-50 (fixed)

25-60 (variable)

400

Page 11: Informatics tools for next-generation sequence analysis

Error characteristics (Illumina)

Insertions1.43%

Deletions3.23%

Substitutions95.34%

Page 12: Informatics tools for next-generation sequence analysis

Error characteristics (454)

Page 13: Informatics tools for next-generation sequence analysis

Coverage bias

~2X read genome read coverage

~20X read genome read coverage

Page 14: Informatics tools for next-generation sequence analysis

Genome re-sequencing

Page 15: Informatics tools for next-generation sequence analysis

Complete human genomes

Page 16: Informatics tools for next-generation sequence analysis

The re-sequencing informatics pipelineREF

(ii) read mappingIND

(i) base calling

IND(iii) SNP and short INDEL calling

(v) data viewing, hypothesis generation

(iv) SV calling GigaBayesGigaBayes

Page 17: Informatics tools for next-generation sequence analysis

Read mapping

Page 18: Informatics tools for next-generation sequence analysis

… is like a jigsaw puzzle

… and they give you the picture on the box

2. Read mapping…you get the pieces…

Big and Unique pieces are easier to place than others…

Page 19: Informatics tools for next-generation sequence analysis

Challenge: non-uniqueness

• Reads from repeats cannot be uniquely mapped back to their true region of origin

• RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length

Page 20: Informatics tools for next-generation sequence analysis

Non-unique mapping

Page 21: Informatics tools for next-generation sequence analysis

SE short-read alignments are error-prone

0.35%

Page 22: Informatics tools for next-generation sequence analysis

Paired-end (PE) reads

fragment length: 100 – 600bp

Korbel et al. Science 2007

fragment length: 1 – 10kb

Page 23: Informatics tools for next-generation sequence analysis

PE alignment statistics (simulated data)

0.00%7.6%

0.09%

0.35%

0.03%

Page 24: Informatics tools for next-generation sequence analysis

The MOSAIK read mapper/aligner

Michael Strömberg

Page 25: Informatics tools for next-generation sequence analysis

Gapped alignments

Page 26: Informatics tools for next-generation sequence analysis

Aligning multiple read types together

ABI/capillary454 FLX

454 GS20

Illumina

Page 27: Informatics tools for next-generation sequence analysis

SNP / short-INDEL discovery

Page 28: Informatics tools for next-generation sequence analysis

Polymorphism detection

sequencing error polymorphism

Page 29: Informatics tools for next-generation sequence analysis

Allele calling in multi-individual data

P(G1=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gi=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gn=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(SNP)

“genotype probabilities”

P(B1=aacc|G1=aa)P(B1=aacc|G1=cc)P(B1=aacc|G1=ac)

P(Bi=aaaac|Gi=aa)P(Bi=aaaac|Gi=cc)P(Bi=aaaac|Gi=ac)

P(Bn=cccc|Gn=aa)P(Bn=cccc|Gn=cc)P(Bn=cccc|Gn=ac)

“genotype likelihoods”

Prio

r(G1,.

.,Gi,..

, Gn)

-----a----------a----------c----------c-----

-----a----------a----------a----------a----------c-----

-----c----------c----------c----------c-----

Page 30: Informatics tools for next-generation sequence analysis

SNP calling in deep sample sets

Population Samples Reads Allele detection

Page 31: Informatics tools for next-generation sequence analysis

Capturing the allele in the samples

0.000

1

0.000

2

0.000

50.0

010.0

020.0

05 0.01

0.02

0.05 0.1 0.2 0.5

00.10.20.30.40.50.60.70.80.9

1

n=100n=200n=400n=800n=1600

Population AF

Pro

b(al

lele

cap

ture

d in

sam

ple)

Page 32: Informatics tools for next-generation sequence analysis

The ability to call rare alleles

reads Q30 Q40 Q50 Q60

1 0.01 0.01 0.1 0.5

2 0.82 1.0 1.0 1.0

3 1.0 1.0 1.0 1.0

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

GigaBayesGigaBayes

Page 33: Informatics tools for next-generation sequence analysis

Allele calling in 400 samples

Page 34: Informatics tools for next-generation sequence analysis

Detecting de novo mutations

2

2

2 22 2

2

2

2

2 2

2

11 12 22

1 111: 1 12 2 11: 111: 1

1 111 12 : 2 1 12 : 2 1 1 12 : 12 2

22 : 22 : 11 122 : 12 2

1 1 111: 1 1 11:2 2 4

Pr | , 1 112 12 : 2 1 12 2

1 122 : 12 2

M M M

F

C M FF

G G G

G

G G GG

2 2 2

2 22 2

2 22

2

2 22 2

1 1 1 11 1 11: 12 4 2 2

1 1 1 1 112 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2

1 1 1 1 122 : 1 1 22 : 1 14 2 4 2 2

1 111: 12 211: 1

1 122 12 : 1 12 : 12

22 : 1FG

2

22

11:2 1 12 : 2 1

222 : 11 122 : 1 1

2 2

• the child inherits one chromosome from each parent• there is a small probability for a de novo (germ-line or somatic) mutation in the child

Page 35: Informatics tools for next-generation sequence analysis

Capture sequencing

Page 36: Informatics tools for next-generation sequence analysis

Targeted mammalian re-sequencing

• Deep sequencing of complete human genomes is still too expensive

• There is a need to sequence target regions, typically genes, to follow up on GWAS studies

• Targeted re-sequencing with DNA fragment capture offers apotentially cost-effective alternative

• Solid phase or liquid phase capture• 454 or Illumina sequencing

• Informatics pipeline must accountfor the peculiarities of capture data

Page 37: Informatics tools for next-generation sequence analysis

On/off target captureref allele*:

45%non-ref allele*: 54%

Target region

SNP(outside target region)

Page 38: Informatics tools for next-generation sequence analysis

Reference allele bias

(*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346

ref allele*:54%

non-ref allele*: 45%

Page 39: Informatics tools for next-generation sequence analysis

SNP example

Amit Indap

Page 40: Informatics tools for next-generation sequence analysis

Structural Variation discovery

Page 41: Informatics tools for next-generation sequence analysis

Structural variations

Page 42: Informatics tools for next-generation sequence analysis

SV/CNV detection – SNP chips

• Tiling arrays and SNP-chips made whole-genome CNV scans possible

• Probe density and placement limits resolution

• Balanced events cannot be detected

Page 43: Informatics tools for next-generation sequence analysis

SV/CNV detection – resolution

Expected CNVsKaryotype

Micro-arraySequencing

Rela

tive

num

bers

of e

vent

s

CNV event length [bp]

Page 44: Informatics tools for next-generation sequence analysis

44

Read depth

Page 45: Informatics tools for next-generation sequence analysis

Chromosome 2 Position [Mb]

CNV events found using RD

Page 46: Informatics tools for next-generation sequence analysis

PE read mapping positions

Deletion

DNA reference

LM ~ LF+Ldel & depth: low

patternLMLF

Ldel

Tandemduplication

LM ~ LF-Ldup & depth: highLdup

Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv

Translocation

LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2

LT2 LT1

LM LM

LM

InsertionLins

un-paired read clusters & depth normal

Chromosomaltranslocation

LT

LM ~LF+LT & depth: normal& cross-paired read clusters

Page 47: Informatics tools for next-generation sequence analysis

47

The SV/CNV “event display”

Chip Stewart

Page 48: Informatics tools for next-generation sequence analysis

Spanner – specificity

Page 49: Informatics tools for next-generation sequence analysis

Data standards

Page 50: Informatics tools for next-generation sequence analysis

Data types with standard formats

SRF/FASTQ

SAM/BAM

GLF

Page 51: Informatics tools for next-generation sequence analysis

Transcriptome sequencing

Page 52: Informatics tools for next-generation sequence analysis

Data highly reproducible

Michele Busby

Page 53: Informatics tools for next-generation sequence analysis

Comparative data

Michele Busby

Page 54: Informatics tools for next-generation sequence analysis

Biological questions

Michele Busby

Page 55: Informatics tools for next-generation sequence analysis

Our software tools for next-gen data

http://bioinformatics.bc.edu/marthlab/Software_Release

Page 56: Informatics tools for next-generation sequence analysis

CreditsElaine Mardis

Andy Clark

Aravinda Chakravarti

Doug Smith

Michael Egholm

Scott Kahn

Francisco de la Vega

Patrice MilosJohn Thompson

Page 57: Informatics tools for next-generation sequence analysis

Lab

Several postdoc positions are available!

Page 58: Informatics tools for next-generation sequence analysis

Mutational profiling

Page 59: Informatics tools for next-generation sequence analysis

Chemical mutagenesis

Page 60: Informatics tools for next-generation sequence analysis

Mutational profiling: deep 454/Illumina/SOLiD data

• Pichia stipitis converts xylose to ethanol (bio-fuel production)• one mutagenized strain had high conversion efficiency• determine which mutations caused this phenotype• 15MB genome: 454, Illumina, and SOLiD reads• 14 true point mutations in the entire genome

Pichia stipitis reference sequence

Image from JGI web site

10-15X genome coverage required


Recommended