Date post: | 15-Jan-2016 |
Category: |
Documents |
View: | 220 times |
Download: | 0 times |
Informatics tools for next-generation sequence analysis
Gabor T. MarthBoston College Biology Department
University of MichiganOctober 20, 2008
Next-gen. sequencers offer vast throughput
read length
base
s p
er
mach
ine r
un
10 bp 1,000 bp100 bp
1 Gb
100 Mb
10 Mb
10 Gb
Illumina, AB/SOLiD short-read sequencers
ABI capillary sequencer
454 pyrosequencer(100-400 Mb in 200-450 bp reads)
(5-15Gb in 25-70 bp reads)
1 Mb
Next-gen sequencing enables new applications
Meissner et al. Nature 2008
Ruby et al. Cell, 2006
Jones-Rhoades et al. PLoS Genetics, 2007
• organismal resequencing & de novo sequencing
• transcriptome sequencing for transcript discovery and expression profiling
• epigenetic analysis (e.g. DNA methylation)
Large-scale individual human resequencing
Technologies
Roche / 454 system
• pyrosequencing technology• variable read-length• the only new technology with >100bp reads
Illumina / Solexa Genome Analyzer
• fixed-length short-read sequencer• very high throughput• read properties are very close to traditional capillary sequences
AB / SOLiD system
A C G T
A
C
G
T
2nd Base
1st
Bas
e
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
• fixed-length short-reads• very high throughput• 2-base encoding system• color-space informatics
Helicos / Heliscope system
• short-read sequencer• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing
Data characteristics
Read length
read length [bp]0 100 200 300
~200-450 (variable)
25-70 (fixed)
25-50 (fixed)
20-60 (variable)
400
Representational biases
• this affects genome resequencing (deeper starting read coverage is needed)• will have major impact is on counting applications
“dispersed” coverage distribution
Amplification errors
many reads from clonal copies of a single fragment
• early PCR errors in “clonal” read copies lead to false positive allele calls
early amplification error gets propagated into every clonal copy
Read quality
Error rate (Illumina)
Error rate (454)
Per-read errors (Solexa)
Per read errors (454)
Base quality values not well calibrated
Tools for genome resequencing
The resequencing informatics pipeline
(iii) read assembly
REF
(ii) read mapping
IND
(i) base calling
IND(iv) SNP and short INDEL calling
(vi) data validation, hypothesis generation
(v) SV calling
The variation discovery “toolbox”
• base callers
• read mappers
• SNP callers
• SV callers
• assembly viewers
GigaBayesGigaBayes
1. Base calling
base sequence
base quality (Q-value) sequence
diverse chemistry & sequencing error profiles
454 pyrosequencer error profile
• multiple bases in a homo-polymeric run are incorporated in a single incorporation test the number of bases must be determined from a single scalar signal the majority of errors are INDELs
454 base quality values
• the native 454 base caller assigns too low base quality values
PYROBAYES: determine base number
PYROBAYES: Performance
• assigned quality values predict measured error rate better
• higher fraction of bases are high quality
Base quality value calibration
Recalibrated base quality values (Illumina)
… and they give you the picture on the box
2. Read mapping
Read mapping is like doing a jigsaw puzzle…
…you get the pieces…
Unique pieces are easier to place than others…
Non-uniqueness of reads confounds mapping
• Reads from repeats cannot be uniquely mapped back to their true region of origin
• RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length
Strategies to deal with non-unique mapping
• Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented)
0.8 0.19 0.01
read
• mapping to multiple loci requires the assignment of alignment probabilities
Paired-end reads help unique read placement
• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency
Korbel et al. Science 2007
• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity
PE
MP
• PE reads are now the standard for genome resequencing
MOSAIK
INDEL alleles/errors – gapped alignments
454
Aligning multiple read types together
ABI/capillary
454 FLX
454 GS20
Illumina
• Alignment and co-assembly of multiple reads types permits simultaneous analysis of data from multiple sources and error characteristics
Aligner speed
3. Polymorphism / mutation detection
sequencing error
polymorphism
New challenges for SNP calling
• deep alignments of 100s / 1000s of individuals • trio sequences
Rare alleles in 100s / 1,000s of samples
Allele discovery is a multi-step sampling process
Population Samples Reads
Capturing the allele in the sample
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1E-0
4
2E-0
4
5E-0
40.
001
0.00
20.
005
0.01
0.02
0.05 0.
10.
20.
5
Population AF
Pro
b(a
llele
cap
ture
d in
sam
ple
)
n=100
n=200
n=400
n=800
n=1600
Allele calling in the reads
1 2
1 21
1
1 2
Pr | Pr | Pr , , ,
Pr | Pr | Pr , , ,
Pr , , , |i
kT
ii n
l kT
nk ki i i n
i
nk k l l l li i
iG
n
B T T G G G G
B T T G G G G
G G G B
base call
sample size
GigaBayesGigaBayes
individual read coverage
base quality
Allele calling in deep sequence data
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
Q30 Q40 Q50 Q60
1 0.01 0.01 0.1 0.5
2 0.82 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
More samples or deeper coverage / sample?
Shallower read coverage from more individuals …
…or deeper coverage from fewer samples?
simulation analysis by Aaron
Quinlan
Analysis indicates a balance
SNP calling in trios
2
2
2 22 2
2
2
2
2 2
2
11 12 22
1 111: 1 1
2 2 11: 111: 11 1
11 12 : 2 1 12 : 2 1 1 12 : 12 2
22 : 22 : 11 122 : 1
2 2
1 1 111: 1 1 11:
2 2 4Pr | , 1 1
12 12 : 2 1 12 2
1 122 : 1
2 2
M M M
F
C M F
F
G G G
G
G G GG
2 2 2
2 22 2
2 22
2
2 22 2
1 1 1 11 1 11: 1
2 4 2 21 1 1 1 1
12 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2
1 1 1 1 122 : 1 1 22 : 1 1
4 2 4 2 2
1 111: 1
2 211: 11 1
22 12 : 1 12 : 12
22 : 1FG
2
2
2
11:
2 1 12 : 2 12
22 : 11 122 : 1 1
2 2
• the child inherits one chromosome from each parent• there is a small probability for a mutation in the child
SNP calling in trios
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
mother father
childP=0.79
P=0.86
Determining genotype directly from sequence
AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTAGCATAAACGTTAGCATA
individual 1
individual 3
individual 2
A/C
C/C
A/A
4. Structural variation discovery
SV events from PE read mapping patterns
Deletion
DNA reference
LM ~ LF+Ldel & depth: low
pattern
LMLF
Ldel
Tandemduplication
LM ~ LF-Ldup & depth: highLdup
Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv
Translocation
LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2
LT2 LT1
LM LM
LM
InsertionLins
un-paired read clusters & depth normal
Chromosomaltranslocation
LT
LM ~LF+LT & depth: normal& cross-paired read clusters
Deletion: Aberrant positive mapping distance
Copy number estimation from depth of coverage
Alignability – read coverage normalization
reads mapped possible all
reads mapped uniquelyA(p)
Het deletion “revealed” by normalization
Tandem duplication: negative mapping distance
Spanner – a hybrid SV/CNV detection tool
Navigation bar
Fragment lengths in selected region
Depth of coverage in selected region
5. Data visualization
1. aid software development: integration of trace data viewing, fast navigation, zooming/panning
2. facilitate data validation (e.g. SNP validation): simultanous viewing of multiple read types, quality value displays
3. promote hypothesis generation: integration of annotation tracks
Data visualization
Our software tools for next-gen data
http://bioinformatics.bc.edu/marthlab/Beta_Release
Data mining projects
SNP calling in short-read coverage
C. elegans reference genome (Bristol, N2 strain)
Pasadena, CB4858(1 ½ machine runs)
Bristol, N2 strain(3 ½ machine runs)
• goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes
• 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a collaborative project lead by Elaine Mardis, at Washington University
• primary aim was to detect polymorphisms between the Pasadena and the Bristol strain
Polymorphism discovery in C. elegans
• SNP calling error rate very low:
Validation rate = 97.8% (224/229)Conversion rate = 92.6% (224/242)Missed SNP rate = 3.75% (26/693)
SNP
INS
• INDEL candidates validate and convert at similar rates to SNPs:
Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221)
• MOSAIK aligned / assembled the reads (< 4 hours on 1 CPU)• PBSHORT found 44,642 SNP candidates (2 hours on our 160-CPU cluster) • SNP density: 1 in 1,630 bp (of non-repeat genome sequence)
Mutational profiling: deep 454/Illumina/SOLiD data
• Pichia stipitis converts xylose to ethanol (bio-fuel production)
• one mutagenized strain had especially high conversion
efficiency
• determine where the mutations were that caused this
phenotype
• we resequenced the 15MB genome with 454 Illumina, and
SOLiD reads
• 14 true point mutations in the entire genome
Pichia stipitis reference sequence
Image from JGI web site
Technology comparisons
Thanks
Credits
Elaine Mardis
Andy Clark
Aravinda Chakravarti
Doug Smith
Michael Egholm
Scott Kahn
Francisco de la Vega
Kristen StoopsEd Thayer
Lab
Recruitment