Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | eustace-butler |
View: | 216 times |
Download: | 0 times |
Genome Genome Sequencing: Sequencing:
Technology and Technology and StrategiesStrategies
Chuong HuynhChuong Huynh
NIH/NLM/NCBINIH/NLM/NCBI
[email protected]@ncbi.nlm.nih.gov
Acknowledgement: Daniel Lawson (Sanger Institute) and Jane Carlton (TIGR)
Bioinformatics Flow ChartBioinformatics Flow Chart
6. Gene & Protein expression data
7. Drug screening
Ab initio drug design ORDrug compound screening in database of molecules
8. Genetic variability
1a. Sequencing
1b. Analysis of nucleic acid seq.
2. Analysis of protein seq.
3. Molecular structure prediction
4. molecular interaction
5. Metabolic and regulatory networks
How to sequence a genome
• development of sequencing strategy and source of funding• procurement of DNA and initial library construction• test sequencing• large-scale random sequencing of small (2-3 kb), medium (10 kb) and large (>50 kb) libraries• analysis of raw sequence data by: BLAST, RepeatFinder etc• release of genome data onto sequencing center website• at 8-10 X coverage, random stops• closure of sequence gaps and physical gaps• comparison to physical map• gene model prediction• final gene model annotation• release of data to GenBank and publication
large insert library (20 - 500 kb)
Minimal tiling path
Genomic DNA Marker1 Marker2
shotgun library: small (2-3 kb) and medium (10 kb)
Sequencing (8-10 X)
Assembly
Gap closure
gene prediction, annotation and analysis
scaffold contig
Full shotgun sequencing
Partial shotgun sequencingPartial shotgun sequencing
Sequencing (5X)
Assembly
Analysis
Genomic DNA
contig scaffold
shotgun library: small (2-3 kb) and medium (10 kb)
Raw sequence: unassembled sequence reads produced from sequencing of inserts from individual recombinant clones of a genomic DNA library.Finished sequence: complete sequence of a genome with no gaps and an accuracy of > 99.9%.Genome coverage: average number of times a nucleotide is represented by a high-quality base in random raw sequence.Full shotgun coverage: genome coverage in random raw sequence required to produce finished sequence, usually 8-10 fold (‘8-10X’).Partial shotgun coverage: typically 3-6X random coverage of a genome which produces sequence data of sufficient quality to enable gene identification but which is not sufficient to produce a finished genome sequencePaired reads: sequence reads determined from both ends of a cloned insert in a recombinant clone.Contig: contiguous DNA sequence produced from joining overlapping raw sequence reads.Singleton: single sequence read that cannot be joined (‘assembled’) into a contig.Scaffold: a group of ordered and orientated contigs known to be physically linked to each other by paired read information. EST: expressed sequence tag generated by sequencing one end of a recombinant clone from a cDNA library. ESTs are single-pass reads and therefore prone to contain sequence errors.GSS: genome survey sequence generated by sequencing one end of a recombinant clone from a genomic DNA library. The genomic DNA library can in some instances be enriched for the presence of coding regions, for example through use of mung bean nuclease digestion of genomic DNA prior to cloning.SNP: single nucleotide polymorphismORF: open reading frame, stretches of codons in the same reading frame uninterrupted by STOP codons and calculated from a six-frame translation of DNA sequence.
Genome sequencing terms
Jan 2003
NCBI Trace Archive Sep NCBI Trace Archive Sep 23, 200323, 2003
Large-scale genome Large-scale genome projectsprojects
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Sequencing DNA molecules in the Mb size range
• All strategies employ the same underlying principles:
Random Shotgun sequencing
Complete sequence
Shotgun reads
Contigs
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Assembly
Finishing
Finishing read
Strategies for Strategies for sequencingsequencing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• How big can you go??
• Large-insert clones
• cosmids 30-40 kb
• BACs/PACs 50 - 100 kb
• Whole chromosomes
• Whole genomes
Genome size and sequencing Genome size and sequencing strategiesstrategies
Genome size (log Mb)
D.melanogaster (170 Mb)C.elegans (100Mb)
H.sapiens (3000 Mb)
S.cerevisiae (14 Mb)E.coli (4 Mb)
P.falciparum (30 Mb)
0 1 2 3 4
Whole genome shotgun (WGS)
Whole Chromosome Shotgun (WCS)
Clone-by-clone
Whole Genome Shotgun (WGS)with Clone ‘skims’
Complete sequence
Shotgun reads
Contigs
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Assembly
Finishing
Finishing read
Strategies for Strategies for sequencingsequencing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Size and GC composition of genome
• Volume of data
• Ease of cloning
• Ease of sequencing
• Genome complexity
• dispersed repetitive sequence
• telomeres & centromeres
• Politics/Funding
Strategies: Clone by Strategies: Clone by CloneClone
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Simple (0.5 - 2 K reads)
• Few problems with repeats
• Relatively simple informatics
• Scalability
• Quality of physical map
• Fingerprint / STS maps
• End sequencing
Strategies: Whole Strategies: Whole Chromosome shotgun Chromosome shotgun
(WCS)(WCS)
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Requires chromosome isolation
• Moderate complexity (10’s K reads)
• Problems with repeats
• Complex informatics
• Inefficient in isolation
• Quality of physical map (want good physical map)
• Skims of mapped clones
Strategies: Whole Strategies: Whole Genome shotgun (WGS)Genome shotgun (WGS)
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Moderate to High complexity (10-100’s K reads)
• Massive Problems with repeats
• Complex informatics
• Quality of physical map
• Fingerprint map
• STS markers
• End-sequences
• Skims of mapped clones
Sequencing my Sequencing my genomegenome
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
Annotation
Finishing
Production
Politics
TIME MONEY
What do you get?What do you get?
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Sequence
• incomplete complete
• First-pass annotation
• Gene discovery
• Full annotation
• A starting point for research
DATA!!, DATA !!, and more DATA!!
Genome annotation is central to Genome annotation is central to functional genomicsfunctional genomics
Gene Knockout
Expression Microarray
RNAi phenotypes
ORFeome based functional genomics
Where is the problem?Where is the problem?
Most genome will be sequenced and Most genome will be sequenced and can be sequenced; few problem are can be sequenced; few problem are unsolvable. unsolvable.
Problems lies in understanding what Problems lies in understanding what you have:you have: gene predictiongene prediction annotationannotation
SequencingSequencing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Library construction
• Colony picking (random)
• DNA preparation (isolate DNA)
• Sequencing reactions
• Electrophoresis
• Tracking/Base calling
LibrariesLibraries
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Essentially Sub-cloning
• Generation of small insert libraries in a well characterised vector.
• Ease of propagation
• Ease of DNA purification
• e.g. puc18, M13
Libraries - Libraries - testingtesting
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Simple concepts
• Insert/Vector ratio (Blue/White ratio)
• Real data
• Insert size
• Sequence ….
• Simple analysis
Sequence Sequence generationgeneration
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Pick colonies growth medium
•Template preparation (DNA isolation)
• Sequence reactions
• Standard terminator chemistry
• pUC libraries sequenced with forward and reverse primers
•Tracking and noise
Sequence Sequence generationgeneration
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Electrophoresis of products
• Old style - slab gels, 32 > 64 > 96 lanes
• New style - capillary gels, 96 lanes
• Transfer of gel image to UNIX
• Sequencing machines use a slave Mac/PC
• Move data to centralised storage area for processing
Gel image Gel image processingprocessing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Light-to-Dye estimation
• Lane tracking
• Lane editing
• Trace extraction
• Trace standardisation
• Mobility correction
• Background substitution
Pre-processingPre-processing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Base calling using Phred
• modifies SCF file format
• Quality clipping from Phred
• Vector clipping
• Sequencing vector
• Cloning vector
• Screen for contaminants
• Feature mark up (repeats/transposons)
FinishingFinishing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Assembly: Process of taking raw single-pass reads into contiguous consensus sequence (Phred/Phrap)
• Closure: Process of ordering and merging consensus sequences into a single contiguous sequence
• Finished is defined as sequenced on both strands using multiple clones. In the absence of multiple clones the clone must be sequenced with multiple chemistries. The overall error rate is estimated at less than 1 error per 10 kb
Genome AssemblyGenome Assembly
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Pre-assembly (assembly algorithm)
• Assembly
• Automated appraisal
• Manual review
Pre-AssemblyPre-Assembly
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Convert to CAF format
• flatfile text format
• choice of assembler
• choice of post-assembly modules
• choice of assembly editor
www.sanger.ac.uk/Software/CAF
AssemblyAssembly
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Assemble using Phrap
• Read fasta & quality scores from CAF file
• Merge existing Phrap .ace file (previous assembly) as necessary
• Adjust clipping (where vector, quality start)
Assembly appraisalAssembly appraisal
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• auto-edit
• removes 70% of read discrepancies of seq. assembly (highlight misassembly); manually
• Remove cloning vector
• Mark up sequence features (for finisher)
• “Finish” Program (or Program “AutoFinish”)
• Identify low-quality regions
• Cover using ‘re-runs’ and ‘long-runs’
• Compare with current databases
• plate contamination
Manual Assembly Manual Assembly appraisalappraisal
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Use a sequence editor (GAP/consed)
• Tools to identify Internal joins
• Tools to identify and import data from an overlapping projects
• Tools to check failed or mis-assembled reads for inclusion in project
Manual editingManual editing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Sanger uses 100% edit strategy
• Where additional data is required:
• Check clipping
• Additional sequencing
• Template / Primer / Chemistry
• Assemble new data into project
• GAP4 Auto-assemble
• Repeat whole process
Manual Quality ChecksManual Quality Checks
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Force annotation tag consistency
• All unedited data is re-assembled using Phrap
• All high-quality discrepancies are reviewed
• Confirm restriction digest (clones)
• Check for inverted repeats
• Manually check:
• Areas of high-density edits
• Areas with no supporting unedited data
• Areas of low read coverage (need to confirm)
Gap closureGap closure
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Read pairs
• PCR reactions (long-range / combinatorial)
• Small-insert libraries
• Transposon-insertion libraries
Gap closure - contig Gap closure - contig orderingordering
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Read pair consistency
• STS mapping
• Physical mapping
• Genetic mapping
• Optical mapping
• Large-insert clone
• skims
• end-sequencing
AnnotationAnnotation
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• DNA features (repeats/similarities)
• Gene finding
• Peptide features
• Initial role assignment
• Others- regulatory regions
Annotation of eukaryotic genomesAnnotation of eukaryotic genomes
transcription
RNA processing
translation
AAAAAAA
Genomic DNA
Unprocessed RNA
Mature mRNA
Nascent polypeptide
folding
Reactant A Product BFunction
Active enzyme
ab initio gene prediction
Comparative gene prediction
Functional identification
Gm3
Genome analysis overview: Genome analysis overview: C.elegansC.elegans
DNA featuresDNA features
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Similarity features
• mapping repeats
• simple tandem and inverted
• repeat families
• mapping DNA similarities
• EST/mRNAs in eukaryotes
• Duplications,
• RNAs
• mapping peptide similarities
• protein similarities
Gene findingGene finding
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• ORF finding (simple but messy)
• ab initio prediction
• Measures of codon bias
• Simple statistical frequencies
• Comparative prediction
• Using similarity data
• Using cross-species similarities
Peptide featuresPeptide features
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Peptide features
• low-complexity regions
• trans-membrane regions
• structural information (coiled-coil)
• Similarities and alignments
• Protein families (InterPro/COGS)
Initial role assignmentInitial role assignment
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Simple attempt to describe the functional identity of a peptide
• Uses data from:
• peptide similarities
• protein families
• Vital for data mining
• Large number of predicted genes remain hypothetical or unknown
Other regulatory Other regulatory featuresfeatures
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Ribosomal binding sites
• Promoter regions
Data ReleaseData Release
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• DNA release
• Unfinished
• Finished
• Nucleotide databases
• GENBANK/EMBL/DDBJ
• Peptide databases
• SWISSPROT/TREMBL/GENPEPT
• Others
Real World Example: Real World Example: Malaria Genome ProjectMalaria Genome Project
If time permits.
Four species of malaria infect man:Plasmodium falciparumP. vivaxP. malariaeP. ovale
Four species of malaria infect rodents:P. yoeliiP. bergheiP. chabaudiP. vinckei
Sequencing the Plasmodium genomes
Plasmodium falciparum ~30 million base pairs
(Mb) 80% (A+T) 14 chromosomes DNA “unstable” in E.
coli No large insert DNA
clones suitable for sequencing
Too large for whole genome shotgun (‘96)
Whole chromosome shotgun strategy was selected
101112
5-9
3
2
1
13-14
4
3.2, 3.4 Mb
1.0 Mb
1.2 Mb
2.4 Mb
1.6 - 1.8 Mb
0.8 Mb
2.1 Mb2.3 Mb
1.4 Mb
Feature P.y.yoeliiP.falciparum
Size (Mb) 23.1 22.9No. chroms 14 14Coverage (fold) 5 14.5No. gaps 5,812
93(G+C) content (%) 22.6 19.4No. genes 5,878 5,268Mean gene length (bp) 1,298 2,283Gene density (bp/gene) 2,566 4,338Genes with introns (%) 54.2 53.9Genes with ESTs (%) 48.9 49.1Genes with proteomic data (%) 18.2 51.8Exons: Mean no./gene 2.0 2.4
(G+C) content (%) 24.8 23.7Introns: (G+C) content 21.1 13.5
Intergenic sequences: (G+C) content 20.7 13.6
RNAs: no. tRNAs 39 43no. 5s rRNAs 3 3no. rRNA units 4 7
Comparison of genome
features
P. falciparum genome status
Chr Size (bp) No. gaps Fold coverage1 643,293 0 13.3
2 (TIGR) 947,102 0 11.1
3 1,060,087 0 10.9
4 1,204,112 0 16.8
5 1,343,552 0 15.1
6 1,377,956 8 16.8
7 1,350,452 14 15.8
8 1,323,195 24 16.2
9 1,541,723 0 17.9
10 (TIGR) 1,694,445 4 15.6
11 (TIGR) 2,035,250 3 11.3
12 (Stanford) 2,271,477 0 16.3
13 2,747,327 37 17.2
14 (TIGR) 3,291,006 3 9.2
0 22,788 0 ND
22,853,764 93 14.5
Eukaryotic annotation - TIGR
EGC
Annotation Station/Manatee
DDS/DPS
Annotation DB
Project DB
Functional assignments
BLASTPFAM/TIGRFAMSignalP/TMHMM
Gene models
Gene finders
Alignments of genomic toproteins and ESTs
PFB0680w
P. falciparum S. pombe S. cerevisiae D. discoideum A. thaliana
Size (bp) 22,853,764 12,462,637 12,495,682 8,100,000 115,409,949
(G+C) content (%) 19.4 36.0 38.3 22.2 34.9
No. of genes 5,268 4,929 5,770 2,799 25,498
Mean gene length* (bp) 2,283 1,426 1,424 1,626 1,310
Gene density† 4,338 2,528 2,088 2,600 4,526
Percent coding 52.6 57.5 70.5 56.3 28.8
Genes with introns (%) 53.9 43 5.0 68 79
No. tRNA genes 43 174 ND 73 ND
No. 5S rRNA genes 3 30 ND NA ND
No. rRNAs units 7 200-400 ND NA 700-800
*excluding introns; †bp per gene
The The P. falciparumP. falciparum genome genome
Distribution of gene lengthsDistribution of gene lengths
0
500
1000
1500
2000
2500
3000
< 300 300-999 1000-1999 2000-2999 3000-3999 >4000
Gene length (bp, excluding introns)
Nu
mb
er
of
ge
ne
s
P. falciparum
S. pombe
S. cerevisiea
15.5%3.0-3.6%
The The P. falciparumP. falciparum proteome proteome
Feature Number Per cent
Total predicted proteins 5,268
Hypothetical proteins 3,208 60.9
InterPro matches 2,650 52.8
PFAM matches 1,746 33.1
Gene Ontology™
Process 1,301 24.7
Function 1,244 23.6
Component 2,412 45.8
Targeted to apicoplast 551 10.4
Targeted to mitochondrion 246 4.7
Structural features
Transmembrane domain(s) 1,631 31.0
Signal peptide 544 10.3
Signal anchor 367 7.0
Non-secretory proteins 4,357 82.7
52% of predicted gene products detected by proteomics
Florens et al. Nature 419:520-526
Metabolism and transport Analysis based on similarity searches with
sequences of known enzymes
14% (733) of genes encoded enzymes Lower than in bacterial genomes (25-33%)
Enzymes more difficult to identify due to AT-rich genome and evolutionary distance between P.f. and other sequenced organisms
Or
P.f. has smaller proportion of genome devoted to enzymes, reduced metabolic potential
H+SO2-4
di/tri-carboxylates ? ?
(6)
ADPATP
(2)
PiPE
P
Pisu
ga
rp
ho
sp
ha
tes
Amino Compounds
methionine salvage pathwayNOVEL INHIBITORS
aspartate prolineasparagine ornithine
glutamate N-acetyl-glutamateglutamine
Purine salvage,Pyrimidine synthesis
PEP
pyruvate
GLYCEROL
GLUCOSE
fructose-6P
fructose-1,6-bisP
dihydroxyacetone-P+
glyceraldehyde-3P
GlycolysisGlycolysis
myo-inositol-1P
L-LACTATE
6-phospho-gluconate
ribulose-5P ribose-5P
Pentose PhosphatePentose PhosphatePathwayPathway
PRPP
glucosamine-6P
glucosamine
glucosamine-1P
dihydroorotate
orotate
dUMP Methylene THF
DHF THF
xylulose-5P+
erythrose-4P
CDP
dCDP
DNARNA
chorismate
Shikimic AcidShikimic AcidPathwayPathway
CO2
pABA
Folatebiosynthesis
glycosylphophatidylinositol
(GPI anchors)
NOVEL INHIBITORS
Glycerolipid MetabolismGlycerolipid Metabolism
3-deoxyarabino -heptulosanate -7-phosphate
dTMP
PyrimethamineCycloguanyl
oxaloacetate
glycerol triacylglycerol
phosphatidylcholinecholine
ethanolamine phosphatidylethanolamine
Protohaem(FPIX2+)
porpho -bilinogen
acetoacetyl-ACP + malonyl-ACP
APICOPLAST
isopentenyl -PP
3-oxoacyl-ACP
3-hydroxy-acyl-ACPenoyl -ACP
acyl-ACP
acetyl-CoA
HaemHaemBiosynthesisBiosynthesis
Fatty acidFatty acidelongationelongation
Glycerolipids
Triclosan
Thiolactomycin
ALA
Haem AHaem C
UQ
acetate
glycine
NAD+
NADHmodified tRNAsmalonyl-CoA
pyruvate
glucose-6P
malate
NOVEL INHIBITORS
2C-methylerythrose -4P
deoxyxylulose -5P
NOVEL INHIBITORS
O2
FPIX2+
Largepeptides
Smallpeptides
Amino acidsAmino acidsHaemoglobin
FPIX3+O2-
Haemozoin
FOOD VACUOLE
Chloroquine Artemesinin
Quinine
PROTEASE INHIBITORS
PROTEASE INHIBITORS
F, V, & P-type ATPases
ADPATP
H+
V
ADPATP
P-lipids, Cu2+,other cations?
(16)
P
Na+H+ Ca2+H+ Zn2+H+ Mn2+H+water/
glycerol
(3)
ca
rbo
xy
late
s?
H+ glu
co
se
H+ su
ga
r
H+nucleotide
or nt-sugar?
(2)
nucleo-side/base H+
(4)
me
tab
olit
es
dru
gs
?(2)
H+
ADPATP
H+
F?
?H+
NOVEL INHIBITORS
Sulfonamides
IMP
AMP ATP
XMP hypoxanthine
xanthineGMP
GTP
guanineGDP
Folate BiosynthesisFolate Biosynthesis
7,8- dihydropteroate
DHF
THF
pABA
PyrimethamineCycloguanyl
Purines andPurines andPyrimidinesPyrimidines
H2OCytc Fe3+
Cytc Fe2+
UQ
UQH2
O2
or
Atovaquone
MITOCHONDRIONacetyl-CoA
glucose-1P
DOXP PathwayDOXP Pathway
Fosmidomycin
ALA
oxoglutarate
citrate
Tricarboxylic acidTricarboxylic acidcyclecycle
oxaloacetate
malate
fumarate
succinate
succinyl-CoA
isocitrate
cis-aconitate
Fatty AcidFatty AcidBiosynthesisBiosynthesis
NOVELINHIBITORS
PP i
H+
(2)PiADPATP
ABCtransporters
(13)
drugs?
NOVELINHIBITORS
NOVELINHIBITORS
malate
oxaloacetate
or
ubiquinone pool
mitochondrial/plastid carriers
H+PiH+Pi
oxaloacetateaspartate
glycine serine cysteine alanine
ornithine spermidineputrescine
riboflavin
FMN
FADCoA
dephosphoCoA
CO2
amino acidoxo acid
Analysis oftransporters inP. falciparum
Organization of multi-gene families in P. falciparum
P. falciparum Genome Summary
Feature Value Comments
Genome size 24 million base pairs
1% of the human genome
Number of chromosomes 14 23 pairs
Number of gaps 93 (0-37 per chr)
Genome >98% complete
(A+T) content ~ 80.6%Most (A+T) rich genome sequenced to date
Number of genes ~5,300Yeast: 5,770Human: ~35,000
Proteins of unknown function 60% More than other
genomesPossible surface proteins ~900 Test for use in
vaccines
Gene products detected by proteomics
52%See Florens et al.See Lasonder et al.
Genes conserved in rodent malaria P. yoelii yoelii
60% See Carlton et al.
Extra SlidesExtra Slides