DM Church- NCBI
Assembling and Annotating Genomes
Deanna M. ChurchNCBI
January 12, 2005
DM Church- NCBI
Of mice and men
DM Church- NCBI
Fleischman et al.(1991) PNAS88:10885-10889
Both carry mutations in the Kit gene.
Of mice and men
DM Church- NCBI
The Basic Model
Gene Gene Gene Gene
Structure
Mature Peptide
ProPeptide
mRNA
Transcript
Chromosome
Resources (Maps, Clones, etc)
Genomes
Organisms
Function/Phenotype
Disease
DM Church- NCBI
BAC insertBAC vector
Shotgun sequence
Assemble
This part is relativelycheap and easy
Fold
sequence
Gaps
deeper sequencecoverage rarelyresolves all gaps
GAPSThis part is hardand expensive
“finishers” go in to manually fill the gaps, often by PCR
Putting Genomes Together
Hierarchical Shotgun Assembly
200 Kb BAC0.5 Kb/read400 reads = 1X 2000 reads = 5X
DM Church- NCBI
HTGS keywords
htgs_phase0: low coverage sequence 1-2X
htgs_phase1: generally 4-5X sequence coverage, several fragments not ordered or oriented
htgs_phase2: sequence coverage can vary (generally 5-10X) but fragments are ordered and oriented.
htgs_phase3: highly accurate, finished sequence. Error rate <10-5
Draft sequence: phase 1 or 2, but >90% of the bases are high quality (phred 20 or better)
htgs_active_fin: center has finished shotgun phase and moved to finishing
htgs_cancelled: sequencing has discontinued on this clone
DM Church- NCBI
The Raw Data
DM Church- NCBI
- Remove contaminants(vector, E. coli, other organisms, virus)
- Bin clones by chromosome arm
- Incorporate clone order information using TPF
- Identify fragment overlaps
-Determine fragment order and orientation, remove sequence redundancy (This produces sequence contigs given NT_XXXXXX type accession numbers)
- Place contigs on chromosome
UCSC Jim KentNCBI Paul Kitts Greg Schuler Richa Agarwala
Putting genomes together
DM Church- NCBI
STS marker D6S1606
forward primer
reverse primer
microsatellite
PCR product size: 92 - 100 bases
GAGTTTGCACCATTGCACTCCAGCCTGGGCAAC (CA)n AACGTGGCATGTGCCTGTACTCTCCCTCAAACGTGGTAACGTGAGGTCGGACCCGTTG (GT)n TTGCACCGTACACGGACATGAGAGG
A common language for physical mapping of the human genome M. Olson, L. Hood, C. Cantor, and D. Botstein Science 245, 1434-1435 (1989).
A common language for physical mapping of the human genome M. Olson, L. Hood, C. Cantor, and D. Botstein Science 245, 1434-1435 (1989).
Sequence Tagged Sites (STS)
DM Church- NCBI
The Original Genome Resources- STS Maps
genomemeiosis- geneticradiation- RHclones- clone based
meiosis- geneticradiation- RHclones- clone based
fragment
- each line represents an individual cell line/animal that carries a particularbreak- STSs can be amplified from DNA in these cell lines/animals- based on cell line/animal marker content, the breaks can be determined andthe markers ordered.
129
wate
r
2 4 6 8 101214161820
ham
ste
r
1 3 5 7 9 1113 151719212224262830
D2Wsu129e
DM Church- NCBI
Electronic PCR (e-PCR)
STS marker D6S1606
forward primer
reverse primer
microsatellite repeat
PCR product size: 92 - 100 bases
GAGTTTGCACCATTGCACTCCAGCCTGGGCAACAAGAGTGAAACTCTGTCACAGA (CA)n AACGTGGCATGTGCCTGTACTCTCCTCAAACGTGGTAACGTGAGGTCGGACCCGTTGTTCTCACTTTGAGACAGTGTCT (GT)n TTGCACCGTACACGGACATGAGAG
E-PCR software searches DNA sequences for exact matches to both primers in correct order, orientation, and spacing to be consistent with known PCR product size.
Schuler (1997), Genome Research 7, 541-550
DM Church- NCBI
Electronic PCR (e-PCR)
http://www.ncbi.nlm.nih.gov/sutils/e-pcr/
DM Church- NCBI
A
BC
D
EF
GH
I
J
K
L
M
N
O
A
B
C
D
FGH
KL
O
N
Ideally…
Non-sequence based Map
(flip)
A
B
C
D
FGH
KL
O
N
Putting genomes together
DM Church- NCBI
More like…
A
BC
D
EF
GH
I
J
K
L
M
N
O
A
BC
ZYXW
H
J
M
V
N
O
AB
HIJ
CDY
LM
N
O
AB
HIJ
LM
N
O
?
Putting genomes together
DM Church- NCBI
The Starting Material:
Phase 1
Phase 2
Phase 3
number
10632
777
30470
Length (Kb)
1726.24
101.11
3621.30
http://www.ncbi.nlm.nih.gov/genome/guide/human/HsStats.html
Framework assemblies:
388 contigs- 3.02 Gb
Type of source sequence
Number used Length (bp)
Draft only 46 10,284,900
Finished only 3342,833,780,00
0
Contig Information:
Human assembly: Build 35 Assembly is now defined byAGP* files rather than a formal assembly process. These are maintained by chromosome coordinators.
*AGP= A Golden Path
Reference Contig N50#: 38.5 Mb
#N50 length: Contig length at which 50% of the bases in theassembly reside in a contig ofat least that size.
DM Church- NCBI
Contigs and components in the MapViewer
DM Church- NCBI
Aug. 200119.2M (3.5X)
Oct. 200125.2M (4.5X)
Nov. 200130M (5.5X)
Feb. 200240.1M (7X)
Mouse Genome Sequencing
DM Church- NCBI
WGS
Restrict and make libraries2, 4, 8, 10, 40, 150 kb
For mouse projectonly 40 kb clones and BAC clones are available
BAC clones wereconstructed and endsequenced before WGS project started
End-sequence allclones and retainpairing information“mate-pairs”
Find sequence overlaps
Each end sequenceis referred to as a read
WGS contig
David JaffeJim Mullikin
tails
Putting genomes together
DM Church- NCBI
Constructing Supercontigs (scaffolds)
David JaffeJim MullikinPutting genomes
together
DM Church- NCBI
The Starting Material
* Assumes a 2.75 Gb genome
The Assembly224,713 WGS contigs
Total length of the assembly: 2.5 Gb (90.9 % of genome)*
42,620 Supercontigs
N50 of mapped supercontigs: 17 MbN50 of unmapped supercontigs: 4.9 Kb
40.7 million WGS reads (2,4,6,10,40 Kb)~450,000 BAC end sequences
RPCI-23: 197 KbRPCI-24: 155 Kb
CAAA01000100-Length of contigs > 1kb: 2.53 Gb-Length of contigs with >= 1 BES: 2.06 Gb-Length of contigs with >= 1 mapped STS: .344 Gb
-N50 length: 24.8 Kb-Mapped: 173550
NW_XXXXXX
-Length of sc >= 1 BES: 2.41 Gb-Length of sc>= 1 mapped STS: 2.4 Gb
-N50 length: 17.7 Mb-Mapped: 366
ChrUn
The Mouse Genome- MGSCv3David Jaffe- Arachne
Jim Mullikin- Phusion(The Mouse Genome Sequencing Consortium)
(+ 274 finished BACs – 49.5 Mb)
Waterston et al, 2004
DM Church- NCBI
7
The Mouse Genome- over time…
FinishedDraftWGSGap
1
2
34 5 6
89
1011
12 13 14
1516
17 18
19
XMGSCv3
DM Church- NCBI
Contig/Supercontig size by chromosome
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X
Contig (Kb)Supercontig (Mb)
DM Church- NCBI
How does MGSCv3 compare to Non-Sequence based maps
0
20
40
60
80
100
120
140
0 20000000 40000000 60000000 80000000 100000000 120000000 140000000 160000000
basepairs
Map
posit
ion
Chromosome 7
~80% of STS markers on WI-Genetic Map localized by e-PCR
~72% of STS markers on WI/MRC RH Map localized by e-PCR
<3% chromosomeconflict.
WI-Gen mapWI/MRC RH map
DM Church- NCBI
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y Un
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
140,000,000
160,000,000
180,000,000
200,000,000
Finished NT Contig By Build
Build 29
Build 30Build 32Build 33Estimated Length
Finished sequences are usedto build hand-curated contigs(NT contigs)
Currently ~1.8 Gb (mostly) non-redundant sequence1.1 Gb in Build 33
DM Church- NCBI
Mouse Build 30:
Integrated 730 Mb of Finished C57BL/6J sequence into the assemblyMGSCv3 was used as a Tiling Path to guide the assemblyFreeze date: Jan 27, 2003Release date: Feb 27, 2003
The Mouse Genome- over time… NCBIRicha Agarwala
FinishedDraftWGSGap
9
1
34
5 6
78 10
1112 13 14
15 1617 18
19
X
2
DM Church- NCBI
The Mouse Genome- combining resources… NCBIRicha AgarwalaDeanna ChurchUnplaced versus Total curated Contigs Build 30
0
100
200
300
400
500
600
700
800
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y
UnplacedTotal contigs
.56%
.27%
1.83%
1.93%
4.07%
3.64%
3.61%
1.19%
2.94%
0
0
5.56%
1.38%
4.48%
0
0 1.27%1.41%
0
0.9%
100%
780 Mb of Curated NT Sequence
DM Church- NCBI
The Mouse Genome- combining resources… NCBIRicha AgarwalaDeanna ChurchMmu4 unplaced contigs (Build 30)
10 unplaced NT contigs(11 GenBank accessions)
Do align to WGS contigsmapped to Mmu4
Align to WGS contigsmapped to another chromsome
No hits/bad hits(mostly chrUn)
NT_039271NT_039272NT_039276NT_039280
NT_039273 (MmuX) NT_039269NT_039270NT_039274NT_039278NT_039279
DM Church- NCBI
Intrachromosomal Interchromosomal
Large, nearly identical copies of genomic DNA. > 1 Kb, > 90% identity
Segmental DuplicationsCase Western Reserve
Evan EichlerJeff Bailey
NCBIDeanna Church
DM Church- NCBI
Segmental Duplications
WGAC Analysis: Whole Genome Assembly Comparison
WSSD Analysis: Whole Genome Shotgun Sequence Detection
BLAST the genome against itself and look for sequence similarity.
caveat: difficult to distinguish between biological duplication and artificialduplication introduced when producing draft assemblies.
BLAST WGS reads against an assembly and look for increased depth of coverage
Case Western ReserveEvan Eichler
Jeff BaileyNCBI
Deanna Church
DM Church- NCBI
Segmental Duplications
MGSCv3 (>20Kb; >95%)
Case Western ReserveEvan Eichler
Jeff BaileyNCBI
Deanna Church
DM Church- NCBI
Segmental DuplicationsMGSCv3 (>90% ID; >10 Kb)
60% of all duplication mapto chrUn inMGSCv3
Case Western ReserveEvan Eichler
Jeff BaileyNCBI
Deanna Church
DM Church- NCBI
Comparison of duplication in the Mouse and Human Genomes
Human- Build 31
(2.75 Gb)
>1 KB
>5 Kb
>10 Kb
>20 Kb
5.25%
4.78%
4.52%
4.06%
MGSCv3
(2.55 Gb)
w/ unpl w/o unpl
ND ND
1.95%
0.70%
0.11%
1.01%
0.38%
0.10%
Mouse Build 29
(0.439 Gb – Finished BACs only)
initial filtered
3.74%
3.25%
2.71%
2.23%
2.35%
2.00%
1.60%
1.14%
WGAC analysis
Duplications are underrepresented in the Whole Genome Assembly (MGSCv3)
Segmental DuplicationsCase Western Reserve
Evan EichlerJeff Bailey
NCBIDeanna Church
DM Church- NCBI
Segmental Duplications
Unique: pre-quality score
Unique: post-quality score
Duplicated: pre-quality score
Duplicated: post-quality score
WSSD Finished BACs Case Western ReserveEvan Eichler
Jeff BaileyNCBI
Deanna Church
DM Church- NCBI
Segmental Duplications
WSSD (>95% id) analysis of Build 30 BACs
>10 Kb
>20 Kb
>5 Kb
>1 Kb
BACsMGSCv3
w/ Un w/o Un
ND
ND
ND ND
ND ND
1.51%
1.46%
2.09% 0.27%
2.01% 0.23%
(4298 BACs tested)
141 dup pos BACs
The 6 BACs (5 NT clones) from Mmu4 that hit chrUn are on the duplication positive list
Case Western ReserveEvan Eichler
Jeff BaileyNCBI
Deanna Church
DM Church- NCBI
Segmental DuplicationsBari Italy
Mario VenturaMariano Rochi
RP23-3D2 chr.X_A3
•Validated 18/27 (67%) In silico predictions by FISH•16/18 (~90%) were clustered intrachromosomal duplications
This region described in Mileham and Brown (1996) as ‘a repeat sequence island’
DM Church- NCBI
chr1 3.25 0.38 11.58 0.57 66.51 0.21
chr2 2.03 0.13 6.57 0.32 42.11 0.08
chr3 2.17 0.11 5.23 0.16 69.09 0.08
chr4 2.19 0.27 12.12 0.69 38.64 0.19
chr5 2.81 0.42 14.96 0.88 47.92 0.31
chr6 3.72 0.37 9.97 0.86 43.00 0.27
chr7 4.48 0.78 17.41 2.10 37.16 0.64
chr8 1.54 0.15 9.54 0.27 54.63 0.12
chr9 1.56 0.10 6.11 0.34 28.03 0.08
chr10 1.62 0.10 5.94 0.19 51.39 0.08
chr11 1.13 0.08 6.94 0.21 36.63 0.07
chr12 1.79 0.39 21.85 0.88 44.42 0.37
chr13 1.86 0.41 22.08 1.01 40.66 0.38
chr14 1.19 0.15 12.39 0.33 44.38 0.14
chr15 0.94 0.04 3.87 0.05 77.47 0.04
chr16 1.08 0.01 0.75 0.02 40.64 0.01
chr17 3.35 0.22 6.62 0.99 22.30 0.26
chr18 0.75 0.02 2.62 0.02 87.52 0.02
chr19 0.92 0.05 5.53 0.31 16.52 0.09
chrUn 23.78 13.03 54.80 82.02 15.89 12.91
chrX 3.17 0.31 9.91 0.86 36.41 0.23
both non redundant dup
WGAC (Mb)
WSSD supported WGAC (Mb)
WSSD overlap WGAC (%)
WSSD (Mb)
WGAC overlap WSSD (%)
Proportion of WSSD supported WGAC in chrom(%)
MGSCv3 Duplication Analysis
Build 33 data
Evan EichlerXinwei SheGinger ChangEray TuzanDeanna Church
DM Church- NCBI
1
2
34
5 6
78 9 10
1112 13 14
1516
17 18
19
X
Y
WGSFinishedDraftGap
Build 33Reference assembly N50: 22.3 Mb
DM Church- NCBI
Chromosome 7 inversion still present…
DM Church- NCBI
Mmu7 (3M – 6M)
DM Church- NCBI
Segmental Duplication: Genome annotation will under-represent the genecontent if segmental duplications are not included in the reference assembly.
DM Church- NCBI
Large scale variation in the genome
Nature Genetics, Sept. 2004
DM Church- NCBI
Types of annotation
Genes: By alignment, by prediction
Markers:By ePCR
Clones/Cytogenetic location: By alignment (BAC ends, insert) or assembly
Variation: By alignment
Phenotype:
Cytogenetic Position:
Feature Method
Sequence characteristics: CpG islands, source of assembly
Note: Genes from other organisms are also positioned based on alignment of mRNAs from one species on that of another genome. Example: the human Map Viewer shows the position of ESTs and other mRNAs from cow, pig, mouse, and rat.
Via Gene identification, associated markers
By annotated BAC-END sequenced clonesBy FISH-mapped clones used in assembly
Gene Trap Clones: By alignment
DM Church- NCBI
Goal: One sequence entry for each naturally occurring DNA, RNA and protein molecule
NC_000000
NM_000000NR_000000 NP_000000
XM_000000/
XR_000000 XP_000000
chromosome
NT_000000/NW_000000
contigRNA
predictedRNA
protein
predictedprotein
NG_000000genomic
Key:Curated annotationCalculated annotation
Key:Curated annotationCalculated annotation
Multiple products for one gene are instantiated as separate RefSeqs with the same LocusID.
Reference Sequences…
DM Church- NCBI
Why do we need RefSeq?
Entrez Nucleotide
DM Church- NCBI
• General alignment:– at least 50% of length or >1.0 kb– >95% identity, unless short exon – No longer one alignment per contig per strand
• (changed recently because this led to failure to annotate all members of a gene cluster)
– Constraints on intron length (compactness)– Shift within 3 nt to find splice sites conforming to
consensus (GT-AG, GC-AG, AT-AC)– Rank alignment by bit score, % identity, score, gaps,
compactness– global alignment
• Best placement:– Add to score for introns to compensate for gap
penalty– Known ambiguity if gene/pseudogene pairs are highly
related, and few introns in gene
mRNA alignment Sim4est2genomeSpideyBLATSPLIGN
DM Church- NCBI
Aligning cDNAs to the genome-Different algorithms can produce different results
-Trying to balance alignment with searching for splice sites.
ACAG++++++++++GAG||| |||ACATGTxxxxACAGGAG
Sim4AC++++++++++AGGAG|| |||||
ACATGTxxxxACAGGAG
splign/gpipe/BLAT
spidey
ACA++++++++++GGAG||| ||||ACATGTxxxxACAGGAG
NM_003490 (synapsin 3)
Between exons 7 and 8:
DM Church- NCBI
Making Gene Models (at NCBI)
Align RefSeq mRNAs to the genome
Select the best alignment (by score? exon structure?)
Run ab initio gene prediction on regions between these alignments
We use gnomon (GeneScan, GenomeScan, TwinScan, SGP)
Select best gene modelsRefSeq alignments (NM_XXXXXXXXX)ab initio models with support (XM_XXXXXXXXX)
Known issues:
Don’t make ab initio models in introns of known genesSkewed to what we knownDon’t really predict non-coding RNAs wellHard to sort out gene vs. pseudo-genes
DM Church- NCBI
DM Church- NCBI
Integrated comparison with Ensembl and UCSCPlacement of CDSPlacement of and consensus splice junctions% identity between RefSeq and GenomeReading frame
Possible ActionsReview current evidenceReview alignment algorithmsReview current RefSeqs
Integrated comparison with Ensembl and UCSCPlacement of CDSPlacement of and consensus splice junctions% identity between RefSeq and GenomeReading frame
Possible ActionsReview current evidenceReview alignment algorithmsReview current RefSeqs
Conflict resolution
DM Church- NCBI
• CCDS identifier assigned to annotated proteins that are consistently placed
• Sequence may not be identical because NCBI annotates and places existing RefSeqs that are based on cDNAs and Ensembl generates mRNA and protein products solely from the reference genome– cDNA (and thus protein) from a different allele – RNA editing– selenoproteins– ribosomal slippage– non-AUG initiation codon– cDNA source has undetected sequence errors
• CCDS identifier assigned to annotated proteins that are consistently placed
• Sequence may not be identical because NCBI annotates and places existing RefSeqs that are based on cDNAs and Ensembl generates mRNA and protein products solely from the reference genome– cDNA (and thus protein) from a different allele – RNA editing– selenoproteins– ribosomal slippage– non-AUG initiation codon– cDNA source has undetected sequence errors
Future consensus annotation
DM Church- NCBI
Preliminary Statistics based on Human Build 34.3
Preliminary Statistics based on Human Build 34.3
Count Total Conditions Satisfied
7802 7802 100% nucleotide+position
1499 9301 100% protein+position3053 12336 100% exon position23 12359 NCBI/Hinxton both
"good"1540 13899 NCBI annotation
projected1772 15671 One model better52 15723 Other model better
Future consensus annotation
DM Church- NCBI
Now that the genome is togetherI. Text based queries
Entrez:- organism restriction- molecule type restrictions- keyword restrictions
II. Sequence comparisons
BLAST (Basic Local Alignment Search Tool).
SSAHA (Sequence Search and Alignment by Hashing Algorithm)BLAT
III. Query by location
Base pair positioncM positioncytogenetic position
http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=10090
DM Church- NCBI
http://www.ncbi.nlm.nih.gov/genome/seq/MmBlast.html /HsBlast.html /RnBlast.html /DrBlast.html
Assembled SequenceReference assembly (C57BL/6J)Alternate assembliesCelera Mouse 16
Input SequencesHTGSWGS TracesAll other TracesBAC ends
Transcribed SequencesReference mRNABuild RNAESTs
ProteinsReference proteinsBuild proteins
DATABASESDATABASES
Entry point into the Genome- view BLAST results in the Map Viewer
Data Access
Other data setsGene Trap Clones
DM Church- NCBI
Data Access
DM Church- NCBI
Navigating by location
Jump to chromosome
15M30M
Add & Remove MapsChange Map OrderAdd RulersAdd another organism
DM Church- NCBI
Multiple assemblies can be a good thing…
Alignment of human Reference mRNAs:
256: Reference assembly only10: Celera assembly only
•Assembly Gaps•Assembly Errors•Biological variation
DM Church- NCBI
Mulitple assemblies can be a good thing…
DM Church- NCBI
DM Church- NCBI
DM Church- NCBI
Mulitple assemblies can be a good thing…
+
-
+
+
+
+
NM_004947 181 tgaaggggatctttcctgcaaattacattcacttgaaaaaggcaattgtcagtaataggg 240 +AY254099 181 ............................................................ 240 +AY145303 158 ............................................................ 217 +AY145302 509 .a......c................t.....t.......................c.... 568 +AK172930 518 .a......c................t.....t.......................c.... 577 +AK122353 445 .c.....t..a......t.c.gc..tg.............t..ctg...a.ag..c.aa. 504 +AY233380 158 .c.....t..a......t.c.gc...g.............t..ctg...a.ag..c.aa. 217 +AC121608 21865 .....c................t.....t.......................c.... 21921+AL672208 61296 .....c................t.....t.......................c.... 61240+
Reference Assembly Celera Assembly
Other sequence data indicate the reference assembly includes an inversion:
Inversions: An exon of DOCK3 is inverted in the reference assembly relative to other available information.
DM Church- NCBI
Mulitple assemblies can be a good thing…
DM Church- NCBI
DM Church- NCBI
Genome assembly and annotation is an ongoing issue.
Weigh all of the evidence carefully
Multiple lines of evidence better than a single thread
DM Church- NCBI
AcknowledgmentsRefSeq Curator StaffBLAST TeamEntrez TeamNCBI Service Desk Staff
Genome Team:Richa AgarwalaHsiu-Chuan ChenSlava ChetverninDeanna ChurchOlga ErmolaevaWratko HlavinaWonhee JangJonathan KansYuri KapustinKen KatzPaul KittsDonna MaglottJim OstellKim PruittSergey ResenchukVictor SapojnikovGreg SchulerSteve SherryAndrei ShkedaAlexandre SouvorovTatiana TatusovaLukas Wagner
Trace and Assembly ArchiveVladimir AlekseyevAnton ButanaevAlexey EgorovAndrew KlymenkoSergey PomorovEugene YaschenkoMike Dicuccio
Duplication AnalysisEvan EichlerXinwei SheZe ChengEray TuzanJeff BaileyMario VenturaMariano Rocchi
DM Church- NCBI
Mouse Genome Sequencing ConsortiumSanger InstituteWashington University Genome Sequencing CenterWhitehead (Broad) Institute Genome Cener
Baylor College of MedicineCold Spring Harbor LaboratoryGenome Therapeutics CorporationHarvard Partners Genome CenterJoint Genome InstituteNIH Intramural Sequencing CenterUK-MRC Sequencing ConsortiumThe University of Oklahoma Advanced Center for Genome TechnologyThe University of Texas Southwest
Acknowledgments