Lecture 4: DNA Sequencing in the Genomics Era Sandy Simon Genomics Research Fellow Department of...

Lecture 4: DNA Sequencing in the Genomics Era

Sandy Simon

Genomics Research FellowDepartment of Biology

August 28, 2015

What is Genomics?

Study and analysis of all of the DNA contained in an organism

Comprehensive blueprint for what makes each individual unique

Powerful method for studying the integrated functions of an organism and even of a whole community

http://www.nature.com

Genomics is revolutionizing the study of living systems

Human health

Agriculture

Environment

Fundamental biology

Huge economic payoffs from genomics research:

$3.8 billion investment in human genome sequencing has yielded $796 billion in economic development and 310,000 jobs in the United States1

http://cisncancer.org

1Tripps and Grueber 2011. Economic Impact of the Human Genome Project. Battelle Memorial

Institute.

Impacts of Genomics

PCR and the Molecular Revolution PCR: Polymerase Chain

Reaction

Invented by Kary Mullis in 1983

Exponential amplification of a specific sequence of DNA

Most important molecular marker techniques involve PCR

Components: primers, nucleotides, template, thermostable polymerase

http://www.dnalc.org/ddnalc/resources/pcr.html

http://www.dnalc.org/ddnalc/resources/pcr.html

Molecular markers provide closer link between phenotype and genotype

“Anonymous” molecular markers: RFLP, RAPD, AFLP and GBS: no knowledge of underlying sequence polymorphism or location in genome

“Sequence-Tagged” markers like microsatellites or SNPs derived from defined locations in genome

Often reveal higher levels of polymorphism than allozymes and morphological markers

Allow studies of neutral variation in natural populations

Molecular Markers

Anonymous markers often have short “primer” sequences (e.g., 10 bp primer sequences in RAPD)

Randomly amplify portions of genome

Sequence-Tagged markers have longer primers (e.g., 20 bp for microsatellite primers)

Anonymous and Sequence-Tagged Markers

AGTTCAGAGT

ATGCTGAGGTCGCTTAGCAGctctctctctctctctctctcctctctctctctctGGATCCTGAATGCTGACTG

TCAAGTCTCA

agctggactacctctacgtcagcTGAGACTTGAACTCTGAACT

ATGCTGAGGTCGCTTAGCAGctctctctctctctGGATCCTGAATGCTGACTG

DNA Sequencing

Direct determination of sequence of bases at a location in the genome

Shotgun versus PCR sequencing

Dye terminators (Sanger) and capillaries revolutionized DNA sequencing

Modern sequencing methods (sequencing by synthesis, pyrosequencing) have catapulted sequencing into realm of population genetics

SNPs A Single Nucleotide

Polymorphism (SNP) is a single base mutation in DNA.

The most common source of genetic polymorphism (e.g., 90% of all human DNA polymorphisms).

Identify SNP by screening a sample of individuals from study population: usually 16 to 48

Once identified, SNP are assayed in populations using high-throughput methods

If nucleotides occur randomly in a genome, which sequence should occur

more frequently?AGTTCAGAGT

AGTTCAGAGTAACTGATGCT

What is the expected probability of each sequence to occur once?

How many times would each sequence be expected to occur by chance in a

100 Mb genome?

AGTTCAGAGT

What is the expected probability of each sequence to occur once?

What is the sample space for the first position?A

T

G

C

Probability of “A” at that position?

4

1

Probability of “A” at position 1, “G” at position 2, “T” at position 3, etc.?

710 1054.925.04

1

4

1

4

1

4

1

4

1

4

1

4

1

4

1

4

1

4

1 xxxxxxxxxx


1320 1009.925.0 x

AGTTCAGAGT

How many times would each sequence be expected to occur in a 100 Mb

genome?

4.95101054.9 87 x


5813 101.9101009.9 xx

Why is this calculation wrong?

www.wv-inbre.net/bioinformatics/slides/IST444Genomicsequencing.ppt


Automated DNA Sequence Readouts


Capillary Sequencers


Revolutionized DNA sequencing by enabling multiple samples to be analyzed in parallel



Human Genome Project Sequencing Strategy

Clone-based physical mapping

Digest genome and make Bacterial Artificial Chromosomes (BACs, 150,000 bp each)

Digest BACs to create fingerprints

Organize BACs to form contigs

Select BAC clones for sequencing

Shear BACs and shotgun clone

Sequence clones and assemble overlaps


J. Craig Venter and Shotgun Sequencing

Proposed a whole-genome shotgun sequencing method to NIH in 1991. Proposal rejected.

Sets up The Institute for Genomic Research (TIGR) in 1992 (private and non-profit)

TIGR publishes the first complete genome sequence in 1995 (Haemophilis influenzae)

Forms Celera Genomics in 1998 to sequence human genome in three years (private, for-profit)

The Sequence of the Human Genome is published in Science. February 2001

Venter departs Celera. 2002www.wv-inbre.net/bioinformatics/slides/IST444Genomicsequencing.ppt

Shotgun Sequencing Strategy

Whole-genome shotgun sequencing of five individuals with 5 to 100 fold coverage

Computer assembles overlapping sequences to form contigs

Contigs are assembled into scaffolds

Scaffolds are mapped to the genome by two or more Sequence Tagged Site (STS) markers



NextGen Challenge: Sequence Assembly New sequencing

technologies produce billions of small fragments of information that must be assembled to produce useful information about the target genome

Eukaryotic genomes are very large and complex

Billions of bases of DNA

Repetitive sequence

Polymorphisms

Computing 128 processors

available in clusters 128 Gb RAM, 24

processor server for Next-Gen sequence Assembly

Currently ~200 Tb of redundant storage

WVU HPC Cluster: 2300 node high capacity cluster with up to 500 Gb of RAM

STARS Server (WV-INBRE): 1.5 Tb of RAM

Sequencing Technology

Next Generation Sequencing

Illumnia MiSeq/HiSeq

Ion Torrent

Third Generation Sequencing

PacBio Sequencing

Nanopore Technology

Sequencing platform

Roche (454) FLX

Illumina HiSeq

Illumina MiSeq

ABI SOLiDIon Torrent/Ion Proton

Sequencing chemistry

Pyro-sequencing

Synthesis by reversible dye terminators

Synthesis by reversible dye terminators

Sequencing by ligation

Semiconductor Sequencing

Template amplification method

Emulsion PCR

Bridge PCR Bridge PCR Emulsion PCR

Emulsion PCR

Read length 400-800 bases

100 to 250 bases

100 to 300 bases

50 -100 bases

100 to 400 bp

Sequencing throughput/run

0.40–0.60 Gb 200–300 Gb 1.5 to 7.5 Gb 100–200 Gb 0.1 to 35 Gb

Sequencing run time

10 h 6 days 10 to 20 hours

6 days 2 hours

Approx. Cost per Machine

$700K $700K $125K $1M $50K to $150K

Comparison of HT-NGS sequencing Platforms

Library PreparationTruSeq DNA and Small RNA Sample Preparation Kits

How does the MiSeq Work?

https://www.youtube.com/watch?v=womKfikWlxMOption 1: Flashy Illumina video:

Or Option 2: my potentially boring explanation

https://www.youtube.com/watch?v=womKfikWlxM

https://www.youtube.com/watch?v=womKfikWlxM

How Does the MiSeq Work?

Data Analysis4

3Sequencing

Cluster Generation2

1Library Preparation

Cluster Generation

Bind single DNA molecules to surface

Amplify on

surface

~1000 molecules per ~ 1 µm cluster

Hybridize Fragment & Extend

Adapter sequenc

e

3’ extensio

n

Surface of flow cell coated with a lawn of oligo

pairs

Single DNA libraries are hybridized to

primer lawn

Bound libraries then extended by polymerases

Newly synthesized

strandOriginal template

Denature Double-Stranded DNA

discard

Double-stranded molecule is denatured

Original template washed away

Newly synthesized strand is covalently attached to flow cell

surface

Bridge Amplification

Single-stranded molecule flips over and forms a bridge by

hybridizing to adjacent, complementary primer

Hybridized primer is extended by polymerases


Double-stranded bridge is formed

Denature Double-Stranded Bridge

Double-stranded bridge is denatured

Result:Two copies of covalently bound single-stranded

templates


Single-stranded molecules flip over to hybridize to adjacent

primers

Hybridized primer is extended by polymerase


Bridge amplification cycle repeated until multiple

bridges are formed

Linearization

dsDNA bridges are denatured

Reverse Strand Cleavage

Reverse strands cleaved and washed away, leaving

a cluster with forward strands only

Blocking

Free 3’ ends are blocked to prevent unwanted

DNA priming

Read 1 Primer Hybridization

Sequencing primer

Sequencing primer is hybridized to adapter

sequence

MiSeq Sequencing Workflow

Data Analysis4

3Sequencing

Cluster Generation2

1Library Preparation

Add 4 Fl-NTP’s +

Polymerase

Incorporated FI-NTP imaged

Terminator & fluorescent dye

cleaved from FI-NTP

X 36 - 151

Sequencing by Synthesis

Sequencing

A image

C image

T image

G image

After imaging is complete for one

section (tile), the flow cell is moved to the

next tile and the process is repeated

Clusters are images using LED and filter

combinations specific for each fluorescently-

labeled nucleotide

Imaging for the 1st cycle takes ~3 min., including focusing

routines

Ion Torrent Technology

Ion Torrent Platforms

Ion PGM 10 mb to 1 Gb capacity

per run

50 to 200 bp reads

$500 per run

Ion Proton

30 Gb capacity per run

200 to 400 bp reads

$1000 per run

Longest reads currently available (up to 10 kb with strobing)

Very high error rates Effective in “hybrid assemblies”

combining accurate technology (Illumina) with long reads

Nanopore sequencing

Structure of Protein Nanopore

https://www.iths.org/sites/www.iths.org/files/eventmedia/

ITHS_ThirdGenerationSequencers.pdf

The Future: nanopore sequencing

Supposedly will sequence a human genome in one day

Single strand sequencing Reads in the hundreds of Kb size

range

Applications of NextGen Sequencing

Whole genome de-novo sequencing

Genome resequencing: discovery of polymorphisms among and within individuals

Identification of disease determinants

Diagnosis

Transcriptome sequencing/gene expression

Metagenomics

Population genetics/marker analyses

Genotyping by Sequencing New sequencing methods generate 10’s of millions of short

sequences per run

Combine restriction digests with sequencing and pooling to genotype thousands of markers covering genome at very high density

http://www.maizegenetics.net/images/stories/GBS_CSSA_101102sem.pdf

Generate 10’s of thousands of markers for <$100 per sample

Presence-Absence Polymorphism

SNP

Genotyping by Sequencing Cost Example

http://www.maizegenetics.net/gbs-overview

Date post:	13-Jan-2016
Category:	Documents
Upload:	christiana-goodman
View:	216 times
Download:	2 times

Lecture 4: DNA Sequencing in the Genomics Era Sandy Simon Genomics Research Fellow Department of...

Documents