Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML....

Post on 02-Aug-2020

1 views 0 download

transcript

Statistical Methods for Next Generation

Sequencinghttp://www.biostat.jhsph.edu/~khansen/enar2012.html

Zhijin WuBrown University

Kasper Hansen, Rafael A IrizarryJohns Hopkins University

\

Outline

• Introduction to NGS

• SNP calling and genotyping

• RNA-sequencing

• Hands-on exercise

2

Introduction to Next Generation

SequencingRafael A. Irizarry

http://rafalab.org

Many slides courtesy of:Héctor Corrada Bravo and Ben Langmead

D. melanogaster, Science, 2000 H. sapiens, Nature, 2000 M. musculus, Nature, 2002and Science, 2000

Back then: millions of clones (thousand bps) in 9 months for billions of dollars

Today: billion of short reads (35-100 bps) in a week for thousands of dollars

Remember this?

4

Claim: Assemble a genome in weeks for less than $100,000

Start with DNA (millions of copies)

5

Break it

6

Put in sequencer

7

Sequence first 35-400 bps: call them “reads”

GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGTGTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGGCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTTCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCGTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTGCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC

8

Platforms

Illumina/Solexa

• Eight lanes

• ~160M short reads (~50-70 bp) per lane

TECHNOLOGY SPOTLIGHT: ILLUMINA® SEQUENCING

INTRODUCTION

CLUSTER GENERATION

SEQUENCING-BY-SYNTHESIS

ANALYSIS PIPELINE

DATA COLLECTION, PROCESSING, AND ANALYSIS

Illumina Sequencing TechnologyThe Genome Analyzer generates several billion bases of high-quality sequence per run at less than 1% of the cost of capillary-based methods. An expansive scale of research unimaginable with other technology platforms is now possible.

FIGURE 1: ILLUMINA GENOME ANALYZER FLOW CELL

Several samples can be loaded onto the eight-lane flow cell for simul-taneous analysis on the Illumina Genome Analyzer.

10

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

Source: Whiteford et al. Swift: primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics. 2009

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

namesequencequality scores

x 100s of millions

Not just Assembly

• Resequencing

• SNP discovery and genotyping

• Variant discovery and quantification

• TF binding sites: ChIP-Seq

• Gene expression: RNA-Seq

• Measuring methylation

\

Not just Assembly

13

\

1000 Genomes Project

Genotyping

14

\

Human Epigenome Project

15

Methylation

What to do with all these sequences?

GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGTGTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGGCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTTCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCGTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTGCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC

16

Most apps: Start by matching to reference

GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC

17

Variant detection

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTANCTGTCT||||||||| ||||||GTCGCAGTATCTGTCT

GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC

AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT

ATATATATATATATAT||||||||||||||||ATATATATATATATAT

TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC

Align Aggregate

Reference

Call: HET A, Gp-value: 0.0023

GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATATGCCGGA-CACCCTATG

Statistics

“Coverage”

“Pileup” or “Coverage plot”

“Depth of coverage” = 14

RNA-seq differential expression

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATATGCCGGAGCACCCTATG

GTCGCAGTANCTGTCT||||||||| ||||||GTCGCAGTATCTGTCT

GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC

AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT

ATATATATATATATAT||||||||||||||||ATATATATATATATAT

TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC

Align Aggregate

Statistics

Gene 1differentially expressed?: YES

p-value: 0.0012

TGTCGCAGTATCTGTC AGCACCCTATGTCGCAGCCGGAGCACCCTATGGTCGCAGTANCTGTCT

||||||||| ||||||GTCGCAGTATCTGTCT

GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC

AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT

ATATATATATATATAT||||||||||||||||ATATATATATATATAT

TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC

Align Aggregate

Gene 1

Sample A

Sample B

GATTCCTGCCTCATCC

ChIP-seq

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATATGCCGGAGCACCCTATG

GTCGCAGTANCTGTCT||||||||| ||||||GTCGCAGTATCTGTCT

GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC

AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT

ATATATATATATATAT||||||||||||||||ATATATATATATATAT

TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC

Align

Reference

Binding occurs herep-value: 0.0023

Aggregate

Statistics

TATGCACGCGATAGCAGATAGCATTGCGAGAC

Matching Revisted

GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC

21

Matching 10,000,000 32 bps reads

• BLAST takes more than 6 months

• BLAT takes 2 months

• MAQ takes 1 day and half

• Bowtie takes 17 minutes

22

Matching

GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC

23

Mapping

CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC

Take a read:

And a reference sequence:>MT dna:chromosome chromosome:GRCh37:MT:1:16569:1GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCAAACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCACTTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAAGCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAACTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAAACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGCGGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCCTCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGACTACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGATACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAACACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGGAGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATACCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAGACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAGAAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAGAGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTCAAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGTCGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAGAAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTAGCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAAAGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATGAAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAATTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCTACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCAAAATAGTGGGAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGGTTGTCCAAGATAGAATCTTAGTTCAACTTTAAATTTGCCCACAGAACCCTCTAAATCCCCTTGTAAATTTAACTGTTAGTCCAAAGAGGAACAGCTCTTTGGACACTAGGAAAAAACCTTGTAGAGAGAGTAAAAAATTTAACACCCATAGTAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTTCAAGCTCAACACCCACTACCTAAAAAATCCCAAACATATAACTGAACTCCTCACACCCAATTGGACCAATCTATCACCCTATAGAAGAACTAATGTTAGTATAAGTAACATGAAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTAAAACACTGAACTGACAATTAACAGCCCAATATCTACAATCAACCAACAAGTCATTATTACCCTCACTGTCAACCCAACACAGGCATGCTCATAAGGAAAGGTTAAAAAAAGTAAAAGGAACTCGGCAAATCTTACCCCGCCTGTTTACCAAAAACATCACCTCTAGCATCACCAGTATTAGAGGCACCGCCTGCCCAGTGACACATGTTTAACGGCCGCGGTACCCTAACCGTGCAAAGGTAGCATAATCACTTGTTCCTTAAATAGGGACCTGTATGAATGGCTCCACGAGGGTTCAGCTGTCTCTTACTTTTAACCAGTGAAATTGACCTGCCCGTGAAGAGGCGGGCATAACACAGCAAGACGAGAAGACCCTATGGAGCTTTAATTTATTAATGCAAACAGTACCTAACAAACCCACAGGTCCTAAACTACCAAACCTGCATTAAAAATTTCGGTTGGGGCGACCTCGGAGCAGAACCCAACCTCCGAGCAGTACATGCTAAGACTTCACCAGTCAAAGCGAACTACTATACTCAATTGATCCAATAACTTGACCAACGGAACAAGTTACCCTAGGGATAACAGCGCAATCCTATTCTAGAGTCCATATCAACAATAGGGTTTACGACCTCGATGTTGGATCAGGACATCCCGATGGTGCAGCCGCTATTAAAGGTTCGTTTGTTCAACGATTAAAGTCCTACGTGATCTGAGTTCAGACCGGAGTAATCCAGGTCGGTTTCTATCTACNTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTACTTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTATTATACCCACACCCACCCAAGAACAGGGTTTGTTAAGATGGC

How do we determine the read’s point of origin with respect to the reference?

CTCAAAGACCTGACCTTTGGTGATCCACCC-----GCCTNGGCCTTC|||||| |||| |||| ||||||||| |||| |||||CTCAAACTCCTGGATTTTG--GATCCACCCAGCTGGCCTTGGCCTAA

Hypothesis 1:

Hypothesis 2:

CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC|||||||||||| ||||||||||||||||||||| ||||| |CTCAAACTCCTG-CCTTTGGTGATCCACCCGCCTTGGCCTAC

Answer: sequence similarity

Read

Reference

Read

Reference

Say hypothesis 2 is correct. Why are there still mismatches and gaps?

Which hypothesis is better?

More on variants and base-calling

SNPs

GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC

SNPs

TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT CTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC

All Reads

Sequencing cycle

Nuc

leot

ide

com

posi

tion

0.0

0.2

0.4

0.6

0.8

1.0

0 10 20 30

A

T

1000 Genomes Data

050

0010

000

1500

0 Sample NA19238, UCSC Loci, n=42691

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

050

010

0015

0020

00

Sample NA19238, Non−UCSC Loci, n=4986

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

010

0020

0030

0040

0050

0060

0070

00

Sample NA19238, Hapmap Loci, n=19067

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000 Sample NA19238, Non−Hapmap Loci, n=28610

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000

1200

0

Sample NA19238, 1kgenomes Loci, n=36333

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

010

0020

0030

0040

00

Sample NA19238, Non−1kgenomes Loci, n=11344

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

050

0010

000

1500

0

Sample NA19238, UCSC Loci, n=48864

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

010

0020

0030

0040

0050

0060

0070

00

Sample NA19238, Non−UCSC Loci, n=15005

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

00

Sample NA19238, Hapmap Loci, n=20069

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

050

0010

000

1500

0

Sample NA19238, Non−Hapmap Loci, n=43800

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000

1200

0

Sample NA19238, 1kgenomes Loci, n=38306

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000

Sample NA19238, Non−1kgenomes Loci, n=25563

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

All data

Filtered: snpq>=20,

nreads<=360

SNPs in dbSNP Novel SNPs

Cycle

Here we aggregate reads and

record cycle at which variant appears

1000 Genomes Data

050

0010

000

1500

0 Sample NA19238, UCSC Loci, n=42691

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

050

010

0015

0020

00

Sample NA19238, Non−UCSC Loci, n=4986

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

010

0020

0030

0040

0050

0060

0070

00

Sample NA19238, Hapmap Loci, n=19067

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000 Sample NA19238, Non−Hapmap Loci, n=28610

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000

1200

0

Sample NA19238, 1kgenomes Loci, n=36333

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

010

0020

0030

0040

00

Sample NA19238, Non−1kgenomes Loci, n=11344

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

050

0010

000

1500

0

Sample NA19238, UCSC Loci, n=48864

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

010

0020

0030

0040

0050

0060

0070

00

Sample NA19238, Non−UCSC Loci, n=15005

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

00

Sample NA19238, Hapmap Loci, n=20069

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

050

0010

000

1500

0

Sample NA19238, Non−Hapmap Loci, n=43800

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000

1200

0

Sample NA19238, 1kgenomes Loci, n=38306

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000

Sample NA19238, Non−1kgenomes Loci, n=25563

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

All data

Filtered: snpq>=20,

nreads<=360

SNPs in dbSNP Novel SNPs

Cycle

What is causing this?

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

Source: Whiteford et al. Swift: primary data analysis for the Illumina Solexa sequencingplatform. Bioinformatics. 2009

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

namesequencequality scores

x 100s of millions

(slide courtesy of Ben Langmead)

Before Reads There were Intensities

We Want to See This

Color coded by call made: A, C, G, T

But See This

Color coded by call made: A, C, G, T

Four channel fluorescence intensity, cycle 1

A

C

G

T

Gets Worse for higher cycles

Color coded by call made: A, C, G, T

Error Rate and Reported Quality

0 20 40 60

0.0

0.1

0.2

0.3

0.4

Sequencing Cycle

Estim

ated

Erro

r Pro

babi

litie

sA>CT>CA>GT>AC>TC>AG>AT>GG>CG>TA>TC>G

0 20 40 60

0.00

00.

005

0.01

00.

015

Sequencing Cycle

Mis

mat

ch P

ropo

rtion

s

A>CT>CA>GT>AC>TC>AG>AT>GG>CG>TA>TC>G

Remember This?

Sequencing cycle

Nuc

leot

ide

com

posi

tion

0.0

0.2

0.4

0.6

0.8

1.0

0 10 20 30

A

T

Bias Explained

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●●

● ●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0 2000 4000 6000 8000

05000

10000

0.5(A+T)

A−T

cycle << 20cycle ≥≥ 20

Base Calling

1) Rougemont et al. Probabilistic base calling of Solexa sequencingdata. BMC Bioinformatics (2008)

2) Erlich et al. Alta-Cyclic: a self-optimizing base caller fornext-generation sequencing. Nat Methods (2008) 3) Kao et al. BayesCall: A model-based base-calling algorithm forhigh-throughput short-read sequencing. Genome Res (2009)

4) Corrada Bravo and Irizarry. Model-Based Quality Assessment and Base-Callingfor Second-Generation Sequencing Data. Biometrics (2009)

5) Cokus et al. Shotgun bisulphite sequencing of the Arabidopsisgenome reveals DNA methylation patterning. Nature (2009)

Intensity Model

cycle

12.5

13.0

13.5

14.0

Intensity Model

log intensity read i, cycle j, channel c

indicators of nucleotide identity, read i, pos. j

∆ijc =

�1 if c is the nucleotide in read i position j

0 otherwise

uijc = ∆ijc(µcjα + xTj αi + �α

ijc) +

(1−∆ijc)(µcjβ + xTj βi + �β

ijc)

Intensity Model

log intensity read i, cycle j, channel c

read-specific linear models

uijc = ∆ijc(µcjα + xTj αi + �α

ijc) +

(1−∆ijc)(µcjβ + xTj βi + �β

ijc)

Intensity Model

log intensity read i, cycle j, channel c

measurement error

uijc = ∆ijc(µcjα + xTj αi + �α

ijc) +

(1−∆ijc)(µcjβ + xTj βi + �β

ijc)

�αijc ∼ N(0, σ2

αi) �βijc ∼ N(0, σ2

βi)

Read & Cycle Effects

cycle

h(intensity)

11121314

0 10 20 30 0 10 20 30 0 10 20 30

11121314

11121314

0 10 20 30 0 10 20 30

11121314

Base IdentityProbability Profiles

1 3 5 7 9 11 14 17 20 23 26 29 32 35

Position

00.20.40.60.81

Probability

Before And After

Sequencing cycle

Nuc

leot

ide

com

posi

tion

0.0

0.2

0.4

0.6

0.8

1.0

0 10 20 30

A

T

Sequencing cycle

Nuc

leot

ide

com

posi

tion

0.0

0.2

0.4

0.6

0.8

1.0

0 10 20 30

A

T

Solexa (Default) Srfim (Statistical Approach)

The End