Goals of the International Human Genome Sequencing ...2 843 433 602 99 281 Is the human genome...

Goals of theInternational Human Genome

Sequencing Consortium

• Completeness:– no mapping gaps– no sequencing gaps

• Accuracy: – error rate < 10-4

– based on a minimum of 3 reads (1 on eachstrand at least)

Is the human genome sequence

complete ?

0

5000

10000

15000

20000

25000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Sequencing gaps : June 00/Apr 01

Sequencing gaps : June 00/Dec 01

0

5000

10000

15000

20000

25000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Sequencing gaps : June 00/Nov 02

0

5000

10000

15000

20000

25000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Sequencing gaps : June 00/Apr 03

0

5000

10000

15000

20000

25000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Assembling the entire sequence:The Golden Path(NCBI, Build 34) (Jul 2003)

• Total estimated size (bp)

• Total euchromatin estimated size (bp)

• Size of non-overlapping assembly (bp)

• Euchromatic fraction sequenced (%)

• Number of cloning gaps

3 070 000 000

2 864 000 000

2 843 433 602

99

281

Is the human genome sequence

accurate ?

Sequence Accuracy

Base miscalls and small indels were determinedby resequencing a sample of BACs from each sequencing center(J. Schmutz, Stanford Human Genome Center)

Error level between 10-4 and 10-5 basepairs

Sequence Accuracy

Large indels were identified by mapping size-known end-sequenced fragments on thehuman genome assembly

Is the Human Genome Sequence

well annotated ?

Genes

coding parts

other transcribed segments (5’ and 3’ UTRs)

regulatory regions

Other features

structural (centromeres, telomeres...)

unknown functions

Sequence variants

DNA sequence features of biological relevance

Biological features are identified using

biological data (cDNAs, RNAs, proteins)

ab initio analyses

comparative analyses

combination of these methods

Ensembl Current Release (Jan 2004)(based on NCBI, Build 34)

• Ensembl gene predictions

• Genscan gene predictions

• Ensembl gene exons

• Ensembl gene transcripts

23 531

65 010

225 897

31 609

NCBI statistics (July 2003)(based on Build 33)

RNA genes 185Protein coding genes

Function known or inferred 14858Function unknown 6383

Models predicted ab initio 4288Models predicted ab initio with EST support 9848Other models with EST support 3431Other models with mRNA support 3289Known and predicted Pseudogenes 6241

TOTAL Features 48523TOTAL protein or possibly protein coding 42282

RefSeq Statistics (14 Jan 04)

Review status

Validated

Inferred

Provisional

Predicted

Reviewed

Models

RNAs

1214

5

23885

8720

9393

31935

Proteins

1210

5

23767

8720

10393

31882

Despite the availability of

a nearly complete sequence (99%)

of the human genome,

the gene inventory is not yet complete

Numerous annotated gene models remain

fragmentary

5 ’ end 3 ’ end

CpG island

Annotated gene models frequently lack:

- the 5’ end

- alternative exons

- alternative splicing sites

- alternative starts or polyadenylation sites

Some neglected genome features

Small open reading frames smORFs

• encoding less than 100 amino acids

• phylogenetically conserved

• single or multiple exon genes

• other standard gene features

MicroRNA genes

• ~22-nucleotide non-coding RNAs

• control expression of other genes at the post-

transcriptional level

• derive from a phylogenetically conserved stem-

loop precursor with characteristic features

• 200-250 miRNA genes in the human genome

Other non-coding transcripts

• spliced, polyadenylated and cytoplasmic

• expressed at low level

• poorly conserved between human and mouse

• display some tissue specificity

• may represent 20 to 30% of the genome

• can they be considered as products of genes?

How can we identify more genomic features ?

• More ‘full length’ cDNAs

• Combining ab initio predictions with

biological data (RT PCR, SAGE)

• Global genome comparisons

Power of sequence comparison is known since a long a time

It has been observed that sequences that have a biological function

will show a higher degree of sequence similarity than on average

However a higher degree of sequence similarity is not a proof of

biological function

Comparative Genomics

To obtain a better idea about the respective role of mutation and

selection which are the main forces acting on genome evolution, one cannot restrict analyses to coding sequences

Hence, the use of a conservation score which can be applied to any type of genomic DNA sequence


(ρ−µ)

µ(1-µ)/n

S=S(R)=

n number of sites within the window that are alignedρ fraction of aligned sites that are identicalµ average fraction of sites that are identical in aligned ancestral repeats

in the surrounding region

Non-coding sequences

coding exons

5’ UTR

200 bp upstream transcript start

known regulatory regions

introns

3’ UTR

200 bp downstream transcript end

CpG islands


Identification of regions of biological relevance cannot only rely on

comparisons of the human/mouse pair

An important fraction of sequences under selection show

conservation scores in the range of sequences evolving neutrally

Use of genome sequences from additional species can overcome

this limitation

Vertebrate Genome Sequencing Projects

Human (>9X coverage, 99%complete)

Mouse (7X coverage)

Rat (6X coverage)

Zebrafish (5X coverage)

Pufferfish Tetraodon (7X coverage)

Pufferfish Fugu (6X coverage)

Vertebrate Genome Sequencing ProjectsData in Trace Archive

Species

MouseRatChimpanzeeLemurDogCatBovinePigChickenXenopusZebrafishPufferfish Tetraodon

Pufferfish Fugu

Sequence reads (million reads)

78.939.627.50.5350.60.70.9

11.75.9

13.43.0

2.0

T h e T e t r a o d o n g e n o m e p r o j e c t

A tool for vertebrate comparative analysis

89.3258312.4312.41649,609All contigs

56.5258197.7197.72616,083Mapped contigs

97.87,612312.4342.473125,773All scaffolds

62.47,612197.7218.21,3821,338Mapped scaffolds

78.211,977247.0274.01,382128All ultracontigs

62.411,977197.7218.37,60139Mapped ultracontigs

Percentageof the

genome,gaps included

Longest(Kb)

Size, gaps

excluded (Mb)

Size, gaps

included (Mb)

N50 length(Kb)

Number

Global assembly statistics

(Tetraodon/human) ecores in annotated gene modelsnot overlapping exons

6180 ecores in Ensembl gene models but not matching an exon

5 ’ end 3 ’ end

5789 ecores in RefSeq-based gene models but not matching an exon

5 ’ end 3 ’ end

(Tetraodon/human) ecores outside annotated gene models

19300 (Tetraodon/human) ecores outside Ensembl gene models

Vega gene annotations are generated by manual curation of computer based models:

HAVANA group, Wellcome Trust Sanger Institute

Hillier et al., Univerity of Washington Genome Centre


Genoscope, CNRS


Collins et al., Wellcome Trust Sanger Institute

Chromosome

6

7

13

14

20

22

Association of CpG islands and gene models(CpG island < 2kb from 5’ end)

Model

Vega

Refseq

Ensembl

CpG island(%)

65

60

50

Table S10. Exofish analysis of five finished human chromosomes.

4.9 %95.1%19024277Total

6.0 %94.0 %227587 Chr. 22

4.1 %95.9 %175650 Chr. 20

4.3 %95.7 %587860 Chr. 14

5.9 %94.1 %279622 Chr. 13

4.7 %95.3 %6341558 Chr. 6

Ecores out of annotations

Ecores in annotations

PseudogenesGenes(known +putative)

Chromosome

ExtendingEnsembl models

in new Vegamodels

in pseudogenes

elsewhere

32%

24%

36%

8%

(Tetraodon/human) ecores outside annotated gene models

Comparative Genomics applied to whole genomes

(1)a way to monitor the degree of completion ofgenome annotation

(2) a method to refine existing annotated genemodels (extensions, additional internal exons)

(3) a resource for novel candidate gene models

(4) a method to identify non-transcribed and non-coding features

Contributors

Tetraodon genomicsH. Roest CrolliusA. BernotL. BouneauC. DasilvaC. FischerS. NicaudJL PetitZ. Skalli

SequencingP. Wincker

FISHC. Ozouf-Costaz(Museum Nationald’Histoire Naturelle)

InformaticsO. JaillonJ.M. AuryV. CastelliC. DossatM. LevyE. PelletierC. ScarpelliW. SaurinV. Schächter

WIBR/MITN. Stange ThomannS. DodgeM. ZodyR. SantosC. NusbaumB. BirrenE. Lander

cDNAB. SegurensM. SalanoubatM. Katinka

Date post:	02-Feb-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	1 times

Goals of the International Human Genome Sequencing ...2 843 433 602 99 281 Is the human genome...

Documents