Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation Group
Next Generation SequencingSimon Rasmussen
Associate ProfessorCenter for Biological Sequence analysis
Technical University of Denmark2016
Library preparation1.Create library molecules
2.Amplification (PCR)
3.Massive parallel sequencing
DNA from extract
Fragment & polish DNA
Adapters
Library molecule
Library preparation1.Create library molecules
2.Amplification (PCR)
3.Massive parallel sequencing
Library
Amplification and immobilizationEmulsion PCR (454, Solid, IonTorrent): Water, oil, beads, one DNA template/droplet
Bridge PCR (Illumina): One DNA template/cluster, primers on surface, grow by bridging primers
Metzker, NatGen Rev. 2010Bridge PCR
Fluorescence detection
Nature Reviews | Genetics
CC C
C
Each cycle, add a differentdye-labelled dNTP
GC
T
A
GG
G
GA
GC T
A
F
C
C
C
C
GCG
GCG
GCG
a Illumina/Solexa — Reversible terminators
Incorporate all four nucleotides, each label with a different dye
Repeat cycles Repeat cycles
TGC
TGC
TGCG CA
TGC
G CATGC
G CATGC
F
F
F F
FFF
F F F F F F
F
F FF
FF
F
F
FF F
FF
FF
F
F
FF
F F F
Cleave dyeand terminatinggroups, wash
Cleave dyeand inhibitinggroups, cap,wash
Wash, four-colour imaging
Wash, one-colour imaging
C
G
A
T
T A G
C T A G
CTAGTG
c Helicos BioSciences — Reversible terminators
b
Incorporate single, dye-labelled nucleotides
C
d
Bottom:Top:
CATCGTTop:Bottom: CCCCCC
CAGCTA
Figure 2 | Four-colour and one-colour cyclic reversible termination methods. a | The four-colour cyclic reversible termination (CRT) method uses Illumina/Solexa’s 3 -O-azidomethyl reversible terminator chemistry23,101 (BOX 1) using solid-phase-amplified template clusters (FIG. 1b, shown as single templates for illustrative purposes). Following imaging, a cleavage step removes the fluorescent dyes and regenerates the 3 -OH group using the reducing agent tris(2-carboxyethyl)phosphine (TCEP)23. b | The four-colour images highlight the sequencing data from two clonally amplified templates. c | Unlike Illumina/Solexa’s terminators, the Helicos Virtual Terminators33 are labelled with the same dye and dispensed individually in a predetermined order, analogous to a single-nucleotide addition method. Following total internal reflection fluorescence imaging, a cleavage step removes the fluorescent dye and inhibitory groups using TCEP to permit the addition of the next Cy5-2 -deoxyribonucleoside triphosphate (dNTP) analogue. The free sulphhydryl groups are then capped with iodoacetamide before the next nucleotide addition33 (step not shown). d | The one-colour images highlight the sequencing data from two single-molecule templates.One-base-encoded probe
An oligonucleotide sequence in which one interrogation base is associated with a particular dye (for example, A in the first position corresponds to a green dye). An example of a one-base degenerate probe set is ‘1-probes’, which indicates that the first nucleotide is the interrogation base. The remaining bases consist of either degenerate (four possible bases) or universal bases.
Caenorhabditis elegans genome. From a single HeliScope run using only 7 of the instrument’s 50 channels, approx-imately 2.8 Gb of high-quality data were generated in 8 days from >25-base consensus reads with 0, 1 or 2 errors. Greater than 99% coverage of the genome was reported, and for regions that showed >5-fold coverage, the consensus accuracy was 99.999% (J. W. Efcavitch, personal communication).
Sequencing by ligation. SBL is another cyclic method that differs from CRT in its use of DNA ligase35 and either one-base-encoded probes or two-base-encoded probes. In its simplest form, a fluorescently labelled probe hybridizes to its complementary sequence adja-cent to the primed template. DNA ligase is then added to join the dye-labelled probe to the primer. Non-ligated probes are washed away, followed by fluorescence
REVIEWS
36 | JANUARY 2010 | VOLUME 11 www.nature.com/reviews/genetics
Illumina - Cyclic reversible termination
Add all dNTPs labelled w. diff dye
Create four-color image
Cleave dye and repeat next cycle
454 - Pyrosequencing
Flow one dNTP across wells
Load template beads into wells
Polymerase incorporates nucleotide
Release of PPi leads to light
Imaging, next dNTP Metzker, NatGen Rev. 2010
2G: Imaging handout
Nature Reviews | Genetics
CC C
C
Each cycle, add a differentdye-labelled dNTP
GC
T
A
GG
G
GA
GC T
A
F
C
C
C
C
GCG
GCG
GCG
a Illumina/Solexa — Reversible terminators
Incorporate all four nucleotides, each label with a different dye
Repeat cycles Repeat cycles
TGC
TGC
TGCG CA
TGC
G CATGC
G CATGC
F
F
F F
FFF
F F F F F F
F
F FF
FF
F
F
FF F
FF
FF
F
F
FF
F F F
Cleave dyeand terminatinggroups, wash
Cleave dyeand inhibitinggroups, cap,wash
Wash, four-colour imaging
Wash, one-colour imaging
C
G
A
T
T A G
C T A G
CTAGTG
c Helicos BioSciences — Reversible terminators
b
Incorporate single, dye-labelled nucleotides
C
d
Bottom:Top:
CATCGTTop:Bottom: CCCCCC
CAGCTA
Figure 2 | Four-colour and one-colour cyclic reversible termination methods. a | The four-colour cyclic reversible termination (CRT) method uses Illumina/Solexa’s 3 -O-azidomethyl reversible terminator chemistry23,101 (BOX 1) using solid-phase-amplified template clusters (FIG. 1b, shown as single templates for illustrative purposes). Following imaging, a cleavage step removes the fluorescent dyes and regenerates the 3 -OH group using the reducing agent tris(2-carboxyethyl)phosphine (TCEP)23. b | The four-colour images highlight the sequencing data from two clonally amplified templates. c | Unlike Illumina/Solexa’s terminators, the Helicos Virtual Terminators33 are labelled with the same dye and dispensed individually in a predetermined order, analogous to a single-nucleotide addition method. Following total internal reflection fluorescence imaging, a cleavage step removes the fluorescent dye and inhibitory groups using TCEP to permit the addition of the next Cy5-2 -deoxyribonucleoside triphosphate (dNTP) analogue. The free sulphhydryl groups are then capped with iodoacetamide before the next nucleotide addition33 (step not shown). d | The one-colour images highlight the sequencing data from two single-molecule templates.One-base-encoded probe
An oligonucleotide sequence in which one interrogation base is associated with a particular dye (for example, A in the first position corresponds to a green dye). An example of a one-base degenerate probe set is ‘1-probes’, which indicates that the first nucleotide is the interrogation base. The remaining bases consist of either degenerate (four possible bases) or universal bases.
Caenorhabditis elegans genome. From a single HeliScope run using only 7 of the instrument’s 50 channels, approx-imately 2.8 Gb of high-quality data were generated in 8 days from >25-base consensus reads with 0, 1 or 2 errors. Greater than 99% coverage of the genome was reported, and for regions that showed >5-fold coverage, the consensus accuracy was 99.999% (J. W. Efcavitch, personal communication).
Sequencing by ligation. SBL is another cyclic method that differs from CRT in its use of DNA ligase35 and either one-base-encoded probes or two-base-encoded probes. In its simplest form, a fluorescently labelled probe hybridizes to its complementary sequence adja-cent to the primed template. DNA ligase is then added to join the dye-labelled probe to the primer. Non-ligated probes are washed away, followed by fluorescence
REVIEWS
36 | JANUARY 2010 | VOLUME 11 www.nature.com/reviews/genetics
Illumina 1:________
Illumina 2:________
454: _______________________________________________
Metzker, NatGen Rev. 2010
2G: Imaging handout - answers
Nature Reviews | Genetics
CC C
C
Each cycle, add a differentdye-labelled dNTP
GC
T
A
GG
G
GA
GC T
A
F
C
C
C
C
GCG
GCG
GCG
a Illumina/Solexa — Reversible terminators
Incorporate all four nucleotides, each label with a different dye
Repeat cycles Repeat cycles
TGC
TGC
TGCG CA
TGC
G CATGC
G CATGC
F
F
F F
FFF
F F F F F F
F
F FF
FF
F
F
FF F
FF
FF
F
F
FF
F F F
Cleave dyeand terminatinggroups, wash
Cleave dyeand inhibitinggroups, cap,wash
Wash, four-colour imaging
Wash, one-colour imaging
C
G
A
T
T A G
C T A G
CTAGTG
c Helicos BioSciences — Reversible terminators
b
Incorporate single, dye-labelled nucleotides
C
d
Bottom:Top:
CATCGTTop:Bottom: CCCCCC
CAGCTA
Figure 2 | Four-colour and one-colour cyclic reversible termination methods. a | The four-colour cyclic reversible termination (CRT) method uses Illumina/Solexa’s 3 -O-azidomethyl reversible terminator chemistry23,101 (BOX 1) using solid-phase-amplified template clusters (FIG. 1b, shown as single templates for illustrative purposes). Following imaging, a cleavage step removes the fluorescent dyes and regenerates the 3 -OH group using the reducing agent tris(2-carboxyethyl)phosphine (TCEP)23. b | The four-colour images highlight the sequencing data from two clonally amplified templates. c | Unlike Illumina/Solexa’s terminators, the Helicos Virtual Terminators33 are labelled with the same dye and dispensed individually in a predetermined order, analogous to a single-nucleotide addition method. Following total internal reflection fluorescence imaging, a cleavage step removes the fluorescent dye and inhibitory groups using TCEP to permit the addition of the next Cy5-2 -deoxyribonucleoside triphosphate (dNTP) analogue. The free sulphhydryl groups are then capped with iodoacetamide before the next nucleotide addition33 (step not shown). d | The one-colour images highlight the sequencing data from two single-molecule templates.One-base-encoded probe
An oligonucleotide sequence in which one interrogation base is associated with a particular dye (for example, A in the first position corresponds to a green dye). An example of a one-base degenerate probe set is ‘1-probes’, which indicates that the first nucleotide is the interrogation base. The remaining bases consist of either degenerate (four possible bases) or universal bases.
Caenorhabditis elegans genome. From a single HeliScope run using only 7 of the instrument’s 50 channels, approx-imately 2.8 Gb of high-quality data were generated in 8 days from >25-base consensus reads with 0, 1 or 2 errors. Greater than 99% coverage of the genome was reported, and for regions that showed >5-fold coverage, the consensus accuracy was 99.999% (J. W. Efcavitch, personal communication).
Sequencing by ligation. SBL is another cyclic method that differs from CRT in its use of DNA ligase35 and either one-base-encoded probes or two-base-encoded probes. In its simplest form, a fluorescently labelled probe hybridizes to its complementary sequence adja-cent to the primed template. DNA ligase is then added to join the dye-labelled probe to the primer. Non-ligated probes are washed away, followed by fluorescence
REVIEWS
36 | JANUARY 2010 | VOLUME 11 www.nature.com/reviews/genetics
Metzker, NatGen Rev. 2010
2G: Imaging handout - answers
Nature Reviews | Genetics
CC C
C
Each cycle, add a differentdye-labelled dNTP
GC
T
A
GG
G
GA
GC T
A
F
C
C
C
C
GCG
GCG
GCG
a Illumina/Solexa — Reversible terminators
Incorporate all four nucleotides, each label with a different dye
Repeat cycles Repeat cycles
TGC
TGC
TGCG CA
TGC
G CATGC
G CATGC
F
F
F F
FFF
F F F F F F
F
F FF
FF
F
F
FF F
FF
FF
F
F
FF
F F F
Cleave dyeand terminatinggroups, wash
Cleave dyeand inhibitinggroups, cap,wash
Wash, four-colour imaging
Wash, one-colour imaging
C
G
A
T
T A G
C T A G
CTAGTG
c Helicos BioSciences — Reversible terminators
b
Incorporate single, dye-labelled nucleotides
C
d
Bottom:Top:
CATCGTTop:Bottom: CCCCCC
CAGCTA
Figure 2 | Four-colour and one-colour cyclic reversible termination methods. a | The four-colour cyclic reversible termination (CRT) method uses Illumina/Solexa’s 3 -O-azidomethyl reversible terminator chemistry23,101 (BOX 1) using solid-phase-amplified template clusters (FIG. 1b, shown as single templates for illustrative purposes). Following imaging, a cleavage step removes the fluorescent dyes and regenerates the 3 -OH group using the reducing agent tris(2-carboxyethyl)phosphine (TCEP)23. b | The four-colour images highlight the sequencing data from two clonally amplified templates. c | Unlike Illumina/Solexa’s terminators, the Helicos Virtual Terminators33 are labelled with the same dye and dispensed individually in a predetermined order, analogous to a single-nucleotide addition method. Following total internal reflection fluorescence imaging, a cleavage step removes the fluorescent dye and inhibitory groups using TCEP to permit the addition of the next Cy5-2 -deoxyribonucleoside triphosphate (dNTP) analogue. The free sulphhydryl groups are then capped with iodoacetamide before the next nucleotide addition33 (step not shown). d | The one-colour images highlight the sequencing data from two single-molecule templates.One-base-encoded probe
An oligonucleotide sequence in which one interrogation base is associated with a particular dye (for example, A in the first position corresponds to a green dye). An example of a one-base degenerate probe set is ‘1-probes’, which indicates that the first nucleotide is the interrogation base. The remaining bases consist of either degenerate (four possible bases) or universal bases.
Caenorhabditis elegans genome. From a single HeliScope run using only 7 of the instrument’s 50 channels, approx-imately 2.8 Gb of high-quality data were generated in 8 days from >25-base consensus reads with 0, 1 or 2 errors. Greater than 99% coverage of the genome was reported, and for regions that showed >5-fold coverage, the consensus accuracy was 99.999% (J. W. Efcavitch, personal communication).
Sequencing by ligation. SBL is another cyclic method that differs from CRT in its use of DNA ligase35 and either one-base-encoded probes or two-base-encoded probes. In its simplest form, a fluorescently labelled probe hybridizes to its complementary sequence adja-cent to the primed template. DNA ligase is then added to join the dye-labelled probe to the primer. Non-ligated probes are washed away, followed by fluorescence
REVIEWS
36 | JANUARY 2010 | VOLUME 11 www.nature.com/reviews/genetics
Quality of base call deteriorates after many
cycles
Metzker, NatGen Rev. 2010
2G: Imaging handout - answers
Nature Reviews | Genetics
CC C
C
Each cycle, add a differentdye-labelled dNTP
GC
T
A
GG
G
GA
GC T
A
F
C
C
C
C
GCG
GCG
GCG
a Illumina/Solexa — Reversible terminators
Incorporate all four nucleotides, each label with a different dye
Repeat cycles Repeat cycles
TGC
TGC
TGCG CA
TGC
G CATGC
G CATGC
F
F
F F
FFF
F F F F F F
F
F FF
FF
F
F
FF F
FF
FF
F
F
FF
F F F
Cleave dyeand terminatinggroups, wash
Cleave dyeand inhibitinggroups, cap,wash
Wash, four-colour imaging
Wash, one-colour imaging
C
G
A
T
T A G
C T A G
CTAGTG
c Helicos BioSciences — Reversible terminators
b
Incorporate single, dye-labelled nucleotides
C
d
Bottom:Top:
CATCGTTop:Bottom: CCCCCC
CAGCTA
Figure 2 | Four-colour and one-colour cyclic reversible termination methods. a | The four-colour cyclic reversible termination (CRT) method uses Illumina/Solexa’s 3 -O-azidomethyl reversible terminator chemistry23,101 (BOX 1) using solid-phase-amplified template clusters (FIG. 1b, shown as single templates for illustrative purposes). Following imaging, a cleavage step removes the fluorescent dyes and regenerates the 3 -OH group using the reducing agent tris(2-carboxyethyl)phosphine (TCEP)23. b | The four-colour images highlight the sequencing data from two clonally amplified templates. c | Unlike Illumina/Solexa’s terminators, the Helicos Virtual Terminators33 are labelled with the same dye and dispensed individually in a predetermined order, analogous to a single-nucleotide addition method. Following total internal reflection fluorescence imaging, a cleavage step removes the fluorescent dye and inhibitory groups using TCEP to permit the addition of the next Cy5-2 -deoxyribonucleoside triphosphate (dNTP) analogue. The free sulphhydryl groups are then capped with iodoacetamide before the next nucleotide addition33 (step not shown). d | The one-colour images highlight the sequencing data from two single-molecule templates.One-base-encoded probe
An oligonucleotide sequence in which one interrogation base is associated with a particular dye (for example, A in the first position corresponds to a green dye). An example of a one-base degenerate probe set is ‘1-probes’, which indicates that the first nucleotide is the interrogation base. The remaining bases consist of either degenerate (four possible bases) or universal bases.
Caenorhabditis elegans genome. From a single HeliScope run using only 7 of the instrument’s 50 channels, approx-imately 2.8 Gb of high-quality data were generated in 8 days from >25-base consensus reads with 0, 1 or 2 errors. Greater than 99% coverage of the genome was reported, and for regions that showed >5-fold coverage, the consensus accuracy was 99.999% (J. W. Efcavitch, personal communication).
Sequencing by ligation. SBL is another cyclic method that differs from CRT in its use of DNA ligase35 and either one-base-encoded probes or two-base-encoded probes. In its simplest form, a fluorescently labelled probe hybridizes to its complementary sequence adja-cent to the primed template. DNA ligase is then added to join the dye-labelled probe to the primer. Non-ligated probes are washed away, followed by fluorescence
REVIEWS
36 | JANUARY 2010 | VOLUME 11 www.nature.com/reviews/genetics
Quality of base call deteriorates after many
cycles
Homopolymer runs are problematic, gives rise to
indels
Metzker, NatGen Rev. 2010
Illumina: Quality deterioration
Can you think of why?
Efficiency of incorporation:Polymerase incorporation of base
Enzyme that cleaves the dye
NextSeq/HiSeq3000/4000
• Chemistry is not based 4 dyes (as before) but 2 dyes
• T (red), C (green), A (both) and G (none = “dark”)
• Faster processing rate and cheaper reagents
• Slightly increases error rate
• Problem with G stretches because G is not dyed
Ion TorrentSimilar principle to 454Library: Emulsion PCRBased on semiconductorsDetection is based on H ions (pH) changes
Ion Torrent
Solid example
@SRR349943.1 solid0420_20100825_FRAG/1T30212302300330212112223121002232332112002222302010+!=:369A?:.<9=.-5=%3-:6%3&<2%(169%,0.3%&'&(&.'%%%&&,
Double-base encodingColorspace
Low error rateErrors propagates
AAGT...
Complete GenomicsssDNA -> DNA nanoballs Use silicon chips with sticky spots
Place DNBs into each spot Sequence using ligase and flourescent labeled probes
Complete GenomicsssDNA -> DNA nanoballs Use silicon chips with sticky spots
Place DNBs into each spot Sequence using ligase and flourescent labeled probes
You cant buy the machine - Acquired by BGI - delayed - Only humans!
3rd generation
Helicos Pacific Biosciences Oxford Nanopore
No amplification (PCR introduces bias!)Simple sample preparation
3rd generation
Helicos Pacific Biosciences Oxford Nanopore
No amplification (PCR introduces bias!)Simple sample preparation
Pacific Biosciences
Slowed down DNA polymerase, measure light emission,Long reads > 10kb, high error rate (but random)
Pacific Biosciences
Oxford NanoporeNano-scale pores, with current acrossDrag DNA stretch through the poreMeasure change in current (pentamers)Long reads (up to 200k), currently ~5k, some systematic errors
Oxford Nanopore
Synthetic Long Reads (2nd gen)
• Illumina Synthetic long-read sequencing (Moleculo)
• 10X Genomics
• Based on Illumina sequencing
• Using barcode system to create linked reads / read clouds
ZMW wells
Sites where
sequencing
takes place
Labelled nucleotidesAll four dNTPs are
labelled and available
for incorporation
odified pol merase
As a nucleotide is
incorporated by the
polymerase, a camera
records the emitted light
lpha-hemol sinA large biological pore
capable of sensing DNA
rrentPasses through the pore
and is modulated as
DNA passes through
Leader airpin template
The leader sequence interacts
with the pore and a motor
protein to direct DNA,
a hairpin allows for
bidirectional sequencing
S RTbell templateTwo hairpin adapters
allow continuous
circular sequencing
Me
an
Sig
na
l
(pA
)
Aa Pacific Biosciences Ab O ord anopore Technologies
10 2 3 4
Time (seconds)
PacBio o tp tA camera records the changing
colours from all ZMWs; each
colour change corresponds to
one base
O T o tp t (s iggles)ach current shi as DNA
translocates through the
pore corresponds to a
particular k-mer
+
–
at re Re iews | Genetics
n matic clea age DNA is barcoded and
fragmented to ~350 bp
BarcodesDNA from the same well shares the same barcode
D ragmentDNA is fragmented and
selected to ~10 kb
Pooling DNA from
each well is
pooled and
undergoes
a standard
library
preparation
Se encingDNA is sequenced on
a standard short-read
sequencer
m lsion P R Arbitrarily long DNA
is mixed with beads
loaded with
barcoded primers,
enzyme and dNTPs
Lin ed reads• All reads from the same GEM derive from the long fragment, thus
they are linked
• Reads are dispersed across the long fragment and no GEM achieves
full coverage of a fragment
• Stacking of linked reads from the same loci achieves continuous
coverage
Bb enomics Ba ll mina
GEMsEach micelle
has 1 barcode
out of 750,000
mplification Long fragments are
amplified such that the
product is a barcoded
fragment ~350 bp
Pooling The emulsion is
broken and DNA is
pooled, then it
undergoes a standard
library preparation
~3,000
molecules
per well
A1 A2
otorprotein
A Real-time long-read se encing
B S nthetic long-read se encing
REV IEWS
344 | JUNE 2016 | VOLUME 17 www.nature.com/nrg
Company/technology Current machine, key characterstics
454 GS FLX+, 7-800bp length, 1M reads
Illumina HiSeq400/HiSeqX Ten, 100-150bp length, up to 2-4B reads
Solid (Life) 5500XL, 50/75bp length, 1.5B reads
PGM (Life) Ion Proton, 200/400bp length, 80M reads
Complete Genomics (BGISEQ) BGISEQ-500, 50-100 bp, 200Gb in total
PacBio Sequel, 8-12kb length, 350k reads
Oxford Nanopore GridION, up to 200kb, >100k reads
Illumina synthetic up to 100kb synthetic length, 1000$ pr. Gb
10X Genomics up to 100kb synthetic length, +500$ per sample
Machine overview - I
Excellent overview at Goodwin et al., Nature Reviews (2016)
Machine overview - II
Company/technology Current machine, key characterstics
454 GS Junior, 4-500bp length, 100k reads
Illumina MiSeq, 300bp length, 50M reads
PGM (Life) Ion Torrent, 400bp length, 5M reads
Oxford Nanopore MinION, up to 200kb, 100k reads
Benchtop machines
Excellent overview at Goodwin et al., Nature Reviews (2016)
More NGS material
• Elaine Mardis on NGS technology
• Youtube has many more …
• Excellent review on NGS technologies: Goodwin et al., Nature Reviews (2016)
• On Campusnet together with many other papers!