Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | christiana-goodman |
View: | 216 times |
Download: | 2 times |
Lecture 4: DNA Sequencing in the Genomics Era
Sandy Simon
Genomics Research FellowDepartment of Biology
August 28, 2015
What is Genomics?
Study and analysis of all of the DNA contained in an organism
Comprehensive blueprint for what makes each individual unique
Powerful method for studying the integrated functions of an organism and even of a whole community
http://www.nature.com
Genomics is revolutionizing the study of living systems
Human health
Agriculture
Environment
Fundamental biology
Huge economic payoffs from genomics research:
$3.8 billion investment in human genome sequencing has yielded $796 billion in economic development and 310,000 jobs in the United States1
http://cisncancer.org
1Tripps and Grueber 2011. Economic Impact of the Human Genome Project. Battelle Memorial
Institute.
Impacts of Genomics
PCR and the Molecular Revolution PCR: Polymerase Chain
Reaction
Invented by Kary Mullis in 1983
Exponential amplification of a specific sequence of DNA
Most important molecular marker techniques involve PCR
Components: primers, nucleotides, template, thermostable polymerase
http://www.dnalc.org/ddnalc/resources/pcr.html
Molecular markers provide closer link between phenotype and genotype
“Anonymous” molecular markers: RFLP, RAPD, AFLP and GBS: no knowledge of underlying sequence polymorphism or location in genome
“Sequence-Tagged” markers like microsatellites or SNPs derived from defined locations in genome
Often reveal higher levels of polymorphism than allozymes and morphological markers
Allow studies of neutral variation in natural populations
Molecular Markers
Anonymous markers often have short “primer” sequences (e.g., 10 bp primer sequences in RAPD)
Randomly amplify portions of genome
Sequence-Tagged markers have longer primers (e.g., 20 bp for microsatellite primers)
Anonymous and Sequence-Tagged Markers
AGTTCAGAGT
ATGCTGAGGTCGCTTAGCAGctctctctctctctctctctcctctctctctctctGGATCCTGAATGCTGACTG
TCAAGTCTCA
agctggactacctctacgtcagcTGAGACTTGAACTCTGAACT
ATGCTGAGGTCGCTTAGCAGctctctctctctctGGATCCTGAATGCTGACTG
DNA Sequencing
Direct determination of sequence of bases at a location in the genome
Shotgun versus PCR sequencing
Dye terminators (Sanger) and capillaries revolutionized DNA sequencing
Modern sequencing methods (sequencing by synthesis, pyrosequencing) have catapulted sequencing into realm of population genetics
SNPs A Single Nucleotide
Polymorphism (SNP) is a single base mutation in DNA.
The most common source of genetic polymorphism (e.g., 90% of all human DNA polymorphisms).
Identify SNP by screening a sample of individuals from study population: usually 16 to 48
Once identified, SNP are assayed in populations using high-throughput methods
If nucleotides occur randomly in a genome, which sequence should occur
more frequently?AGTTCAGAGT
AGTTCAGAGTAACTGATGCT
What is the expected probability of each sequence to occur once?
How many times would each sequence be expected to occur by chance in a
100 Mb genome?
AGTTCAGAGT
What is the expected probability of each sequence to occur once?
What is the sample space for the first position?A
T
G
C
Probability of “A” at that position?
4
1
Probability of “A” at position 1, “G” at position 2, “T” at position 3, etc.?
710 1054.925.04
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1 xxxxxxxxxx
AGTTCAGAGTAACTGATGCT
1320 1009.925.0 x
AGTTCAGAGT
How many times would each sequence be expected to occur in a 100 Mb
genome?
4.95101054.9 87 x
AGTTCAGAGTAACTGATGCT
5813 101.9101009.9 xx
Why is this calculation wrong?
www.wv-inbre.net/bioinformatics/slides/IST444Genomicsequencing.ppt
www.wv-inbre.net/bioinformatics/slides/IST444Genomicsequencing.ppt
Automated DNA Sequence Readouts
www.wv-inbre.net/bioinformatics/slides/IST444Genomicsequencing.ppt
Capillary Sequencers
www.wv-inbre.net/bioinformatics/slides/IST444Genomicsequencing.ppt
Revolutionized DNA sequencing by enabling multiple samples to be analyzed in parallel
www.wv-inbre.net/bioinformatics/slides/IST444Genomicsequencing.ppt
www.wv-inbre.net/bioinformatics/slides/IST444Genomicsequencing.ppt
Human Genome Project Sequencing Strategy
Clone-based physical mapping
Digest genome and make Bacterial Artificial Chromosomes (BACs, 150,000 bp each)
Digest BACs to create fingerprints
Organize BACs to form contigs
Select BAC clones for sequencing
Shear BACs and shotgun clone
Sequence clones and assemble overlaps
www.wv-inbre.net/bioinformatics/slides/IST444Genomicsequencing.ppt
J. Craig Venter and Shotgun Sequencing
Proposed a whole-genome shotgun sequencing method to NIH in 1991. Proposal rejected.
Sets up The Institute for Genomic Research (TIGR) in 1992 (private and non-profit)
TIGR publishes the first complete genome sequence in 1995 (Haemophilis influenzae)
Forms Celera Genomics in 1998 to sequence human genome in three years (private, for-profit)
The Sequence of the Human Genome is published in Science. February 2001
Venter departs Celera. 2002www.wv-inbre.net/bioinformatics/slides/IST444Genomicsequencing.ppt
Shotgun Sequencing Strategy
Whole-genome shotgun sequencing of five individuals with 5 to 100 fold coverage
Computer assembles overlapping sequences to form contigs
Contigs are assembled into scaffolds
Scaffolds are mapped to the genome by two or more Sequence Tagged Site (STS) markers
www.wv-inbre.net/bioinformatics/slides/IST444Genomicsequencing.ppt
www.wv-inbre.net/bioinformatics/slides/IST444Genomicsequencing.ppt
NextGen Challenge: Sequence Assembly New sequencing
technologies produce billions of small fragments of information that must be assembled to produce useful information about the target genome
Eukaryotic genomes are very large and complex
Billions of bases of DNA
Repetitive sequence
Polymorphisms
Computing 128 processors
available in clusters 128 Gb RAM, 24
processor server for Next-Gen sequence Assembly
Currently ~200 Tb of redundant storage
WVU HPC Cluster: 2300 node high capacity cluster with up to 500 Gb of RAM
STARS Server (WV-INBRE): 1.5 Tb of RAM
Sequencing Technology
Next Generation Sequencing
Illumnia MiSeq/HiSeq
Ion Torrent
Third Generation Sequencing
PacBio Sequencing
Nanopore Technology
Sequencing platform
Roche (454) FLX
Illumina HiSeq
Illumina MiSeq
ABI SOLiDIon Torrent/Ion Proton
Sequencing chemistry
Pyro-sequencing
Synthesis by reversible dye terminators
Synthesis by reversible dye terminators
Sequencing by ligation
Semiconductor Sequencing
Template amplification method
Emulsion PCR
Bridge PCR Bridge PCR Emulsion PCR
Emulsion PCR
Read length 400-800 bases
100 to 250 bases
100 to 300 bases
50 -100 bases
100 to 400 bp
Sequencing throughput/run
0.40–0.60 Gb 200–300 Gb 1.5 to 7.5 Gb 100–200 Gb 0.1 to 35 Gb
Sequencing run time
10 h 6 days 10 to 20 hours
6 days 2 hours
Approx. Cost per Machine
$700K $700K $125K $1M $50K to $150K
Comparison of HT-NGS sequencing Platforms
Library PreparationTruSeq DNA and Small RNA Sample Preparation Kits
How does the MiSeq Work?
https://www.youtube.com/watch?v=womKfikWlxMOption 1: Flashy Illumina video:
Or Option 2: my potentially boring explanation
How Does the MiSeq Work?
Data Analysis4
3Sequencing
Cluster Generation2
1Library Preparation
Cluster Generation
Bind single DNA molecules to surface
Amplify on
surface
~1000 molecules per ~ 1 µm cluster
Hybridize Fragment & Extend
Adapter sequenc
e
3’ extensio
n
Surface of flow cell coated with a lawn of oligo
pairs
Single DNA libraries are hybridized to
primer lawn
Bound libraries then extended by polymerases
Newly synthesized
strandOriginal template
Denature Double-Stranded DNA
discard
Double-stranded molecule is denatured
Original template washed away
Newly synthesized strand is covalently attached to flow cell
surface
Bridge Amplification
Single-stranded molecule flips over and forms a bridge by
hybridizing to adjacent, complementary primer
Hybridized primer is extended by polymerases
Bridge Amplification
Double-stranded bridge is formed
Denature Double-Stranded Bridge
Double-stranded bridge is denatured
Result:Two copies of covalently bound single-stranded
templates
Bridge Amplification
Single-stranded molecules flip over to hybridize to adjacent
primers
Hybridized primer is extended by polymerase
Bridge Amplification
Bridge amplification cycle repeated until multiple
bridges are formed
Linearization
dsDNA bridges are denatured
Reverse Strand Cleavage
Reverse strands cleaved and washed away, leaving
a cluster with forward strands only
Blocking
Free 3’ ends are blocked to prevent unwanted
DNA priming
Read 1 Primer Hybridization
Sequencing primer
Sequencing primer is hybridized to adapter
sequence
MiSeq Sequencing Workflow
Data Analysis4
3Sequencing
Cluster Generation2
1Library Preparation
Add 4 Fl-NTP’s +
Polymerase
Incorporated FI-NTP imaged
Terminator & fluorescent dye
cleaved from FI-NTP
X 36 - 151
Sequencing by Synthesis
Sequencing
A image
C image
T image
G image
After imaging is complete for one
section (tile), the flow cell is moved to the
next tile and the process is repeated
Clusters are images using LED and filter
combinations specific for each fluorescently-
labeled nucleotide
Imaging for the 1st cycle takes ~3 min., including focusing
routines
Ion Torrent Technology
Ion Torrent Platforms
Ion PGM 10 mb to 1 Gb capacity
per run
50 to 200 bp reads
$500 per run
Ion Proton
30 Gb capacity per run
200 to 400 bp reads
$1000 per run
Longest reads currently available (up to 10 kb with strobing)
Very high error rates Effective in “hybrid assemblies”
combining accurate technology (Illumina) with long reads
Nanopore sequencing
Structure of Protein Nanopore
https://www.iths.org/sites/www.iths.org/files/eventmedia/
ITHS_ThirdGenerationSequencers.pdf
The Future: nanopore sequencing
Supposedly will sequence a human genome in one day
Single strand sequencing Reads in the hundreds of Kb size
range
Applications of NextGen Sequencing
Whole genome de-novo sequencing
Genome resequencing: discovery of polymorphisms among and within individuals
Identification of disease determinants
Diagnosis
Transcriptome sequencing/gene expression
Metagenomics
Population genetics/marker analyses
Genotyping by Sequencing New sequencing methods generate 10’s of millions of short
sequences per run
Combine restriction digests with sequencing and pooling to genotype thousands of markers covering genome at very high density
http://www.maizegenetics.net/images/stories/GBS_CSSA_101102sem.pdf
Generate 10’s of thousands of markers for <$100 per sample
Presence-Absence Polymorphism
SNP
Genotyping by Sequencing Cost Example
http://www.maizegenetics.net/gbs-overview