Goals of theInternational Human Genome
Sequencing Consortium
• Completeness:– no mapping gaps– no sequencing gaps
• Accuracy: – error rate < 10-4
– based on a minimum of 3 reads (1 on eachstrand at least)
Is the human genome sequence
complete ?
0
5000
10000
15000
20000
25000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Sequencing gaps : June 00/Apr 01
Sequencing gaps : June 00/Dec 01
0
5000
10000
15000
20000
25000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Sequencing gaps : June 00/Nov 02
0
5000
10000
15000
20000
25000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Sequencing gaps : June 00/Apr 03
0
5000
10000
15000
20000
25000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Assembling the entire sequence:The Golden Path(NCBI, Build 34) (Jul 2003)
• Total estimated size (bp)
• Total euchromatin estimated size (bp)
• Size of non-overlapping assembly (bp)
• Euchromatic fraction sequenced (%)
• Number of cloning gaps
3 070 000 000
2 864 000 000
2 843 433 602
99
281
Is the human genome sequence
accurate ?
Sequence Accuracy
Base miscalls and small indels were determinedby resequencing a sample of BACs from each sequencing center(J. Schmutz, Stanford Human Genome Center)
Error level between 10-4 and 10-5 basepairs
Sequence Accuracy
Large indels were identified by mapping size-known end-sequenced fragments on thehuman genome assembly
Is the Human Genome Sequence
well annotated ?
Genes
coding parts
other transcribed segments (5’ and 3’ UTRs)
regulatory regions
Other features
structural (centromeres, telomeres...)
unknown functions
Sequence variants
DNA sequence features of biological relevance
Biological features are identified using
biological data (cDNAs, RNAs, proteins)
ab initio analyses
comparative analyses
combination of these methods
Genes
Ensembl Current Release (Jan 2004)(based on NCBI, Build 34)
• Ensembl gene predictions
• Genscan gene predictions
• Ensembl gene exons
• Ensembl gene transcripts
23 531
65 010
225 897
31 609
NCBI statistics (July 2003)(based on Build 33)
RNA genes 185Protein coding genes
Function known or inferred 14858Function unknown 6383
Models predicted ab initio 4288Models predicted ab initio with EST support 9848Other models with EST support 3431Other models with mRNA support 3289Known and predicted Pseudogenes 6241
TOTAL Features 48523TOTAL protein or possibly protein coding 42282
RefSeq Statistics (14 Jan 04)
Review status
Validated
Inferred
Provisional
Predicted
Reviewed
Models
RNAs
1214
5
23885
8720
9393
31935
Proteins
1210
5
23767
8720
10393
31882
Despite the availability of
a nearly complete sequence (99%)
of the human genome,
the gene inventory is not yet complete
Numerous annotated gene models remain
fragmentary
5 ’ end 3 ’ end
CpG island
Annotated gene models frequently lack:
- the 5’ end
- alternative exons
- alternative splicing sites
- alternative starts or polyadenylation sites
Some neglected genome features
Small open reading frames smORFs
• encoding less than 100 amino acids
• phylogenetically conserved
• single or multiple exon genes
• other standard gene features
MicroRNA genes
• ~22-nucleotide non-coding RNAs
• control expression of other genes at the post-
transcriptional level
• derive from a phylogenetically conserved stem-
loop precursor with characteristic features
• 200-250 miRNA genes in the human genome
Other non-coding transcripts
• spliced, polyadenylated and cytoplasmic
• expressed at low level
• poorly conserved between human and mouse
• display some tissue specificity
• may represent 20 to 30% of the genome
• can they be considered as products of genes?
How can we identify more genomic features ?
• More ‘full length’ cDNAs
• Combining ab initio predictions with
biological data (RT PCR, SAGE)
• Global genome comparisons
Power of sequence comparison is known since a long a time
It has been observed that sequences that have a biological function
will show a higher degree of sequence similarity than on average
However a higher degree of sequence similarity is not a proof of
biological function
Comparative Genomics
To obtain a better idea about the respective role of mutation and
selection which are the main forces acting on genome evolution, one cannot restrict analyses to coding sequences
Hence, the use of a conservation score which can be applied to any type of genomic DNA sequence
Comparative Genomics
(ρ−µ)
µ(1-µ)/n
S=S(R)=
n number of sites within the window that are alignedρ fraction of aligned sites that are identicalµ average fraction of sites that are identical in aligned ancestral repeats
in the surrounding region
Non-coding sequences
coding exons
5’ UTR
200 bp upstream transcript start
known regulatory regions
introns
3’ UTR
200 bp downstream transcript end
CpG islands
Comparative Genomics
Identification of regions of biological relevance cannot only rely on
comparisons of the human/mouse pair
An important fraction of sequences under selection show
conservation scores in the range of sequences evolving neutrally
Use of genome sequences from additional species can overcome
this limitation
Vertebrate Genome Sequencing Projects
Human (>9X coverage, 99%complete)
Mouse (7X coverage)
Rat (6X coverage)
Zebrafish (5X coverage)
Pufferfish Tetraodon (7X coverage)
Pufferfish Fugu (6X coverage)
Vertebrate Genome Sequencing ProjectsData in Trace Archive
Species
MouseRatChimpanzeeLemurDogCatBovinePigChickenXenopusZebrafishPufferfish Tetraodon
Pufferfish Fugu
Sequence reads (million reads)
78.939.627.50.5350.60.70.9
11.75.9
13.43.0
2.0
T h e T e t r a o d o n g e n o m e p r o j e c t
A tool for vertebrate comparative analysis
89.3258312.4312.41649,609All contigs
56.5258197.7197.72616,083Mapped contigs
97.87,612312.4342.473125,773All scaffolds
62.47,612197.7218.21,3821,338Mapped scaffolds
78.211,977247.0274.01,382128All ultracontigs
62.411,977197.7218.37,60139Mapped ultracontigs
Percentageof the
genome,gaps included
Longest(Kb)
Size, gaps
excluded (Mb)
Size, gaps
included (Mb)
N50 length(Kb)
Number
Global assembly statistics
(Tetraodon/human) ecores in annotated gene modelsnot overlapping exons
6180 ecores in Ensembl gene models but not matching an exon
5 ’ end 3 ’ end
5789 ecores in RefSeq-based gene models but not matching an exon
5 ’ end 3 ’ end
(Tetraodon/human) ecores outside annotated gene models
19300 (Tetraodon/human) ecores outside Ensembl gene models
Vega gene annotations are generated by manual curation of computer based models:
HAVANA group, Wellcome Trust Sanger Institute
Hillier et al., Univerity of Washington Genome Centre
HAVANA group, Wellcome Trust Sanger Institute
Genoscope, CNRS
HAVANA group, Wellcome Trust Sanger Institute
Collins et al., Wellcome Trust Sanger Institute
Chromosome
6
7
13
14
20
22
Association of CpG islands and gene models(CpG island < 2kb from 5’ end)
Model
Vega
Refseq
Ensembl
CpG island(%)
65
60
50
Table S10. Exofish analysis of five finished human chromosomes.
4.9 %95.1%19024277Total
6.0 %94.0 %227587 Chr. 22
4.1 %95.9 %175650 Chr. 20
4.3 %95.7 %587860 Chr. 14
5.9 %94.1 %279622 Chr. 13
4.7 %95.3 %6341558 Chr. 6
Ecores out of annotations
Ecores in annotations
PseudogenesGenes(known +putative)
Chromosome
ExtendingEnsembl models
in new Vegamodels
in pseudogenes
elsewhere
32%
24%
36%
8%
(Tetraodon/human) ecores outside annotated gene models
Comparative Genomics applied to whole genomes
(1)a way to monitor the degree of completion ofgenome annotation
(2) a method to refine existing annotated genemodels (extensions, additional internal exons)
(3) a resource for novel candidate gene models
(4) a method to identify non-transcribed and non-coding features
Contributors
Tetraodon genomicsH. Roest CrolliusA. BernotL. BouneauC. DasilvaC. FischerS. NicaudJL PetitZ. Skalli
SequencingP. Wincker
FISHC. Ozouf-Costaz(Museum Nationald’Histoire Naturelle)
InformaticsO. JaillonJ.M. AuryV. CastelliC. DossatM. LevyE. PelletierC. ScarpelliW. SaurinV. Schächter
WIBR/MITN. Stange ThomannS. DodgeM. ZodyR. SantosC. NusbaumB. BirrenE. Lander
cDNAB. SegurensM. SalanoubatM. Katinka