+ All Categories
Home > Documents > The Human Genome

The Human Genome

Date post: 27-Jan-2016
Category:
Upload: calix
View: 72 times
Download: 1 times
Share this document with a friend
Description:
The Human Genome. The International Human Genome Consortium Initial sequencing and analysis of the human genome Nature, 409, February 15, 860-921 (2001) Venter et al. (Celera) The Sequence of the Human Genome Science, 291, February 16, 1304-1351 (2001). HC LEE January 8, 2002 - PowerPoint PPT Presentation
Popular Tags:
48
The Human Genome The International Human Genome Consortium Initial sequencing and analysis of the human genome Nature, 409, February 15, 860-921 (2001) Venter et al. (Celera) The Sequence of the Human Genome Science, 291, February 16, 1304-1351 (2001) HC LEE January 8, 2002 Computational Biology Lab National Central University
Transcript
Page 1: The Human Genome

The Human GenomeThe International Human Genome ConsortiumInitial sequencing and analysis of thehuman genomeNature, 409, February 15, 860-921 (2001)

Venter et al. (Celera)The Sequence of the Human GenomeScience, 291, February 16, 1304-1351 (2001)

HC LEE January 8, 2002Computational Biology Lab National Central University

Page 2: The Human Genome
Page 3: The Human Genome

1984 to 1986 – first proposed at US DOE meetings

1988 – endorsed by US National Research Council - creation of genetic, physical and sequence maps of the human genome- parallel efforts in key model organisms: bacteria, yeast, worms, flies and mice; - develop of supporting technology- ethical, legal and social issues (ELSI) 1990 – Human Genome Project (NHGRI)

Later – UK, France, Japan, Germany, China

Page 4: The Human Genome

Time-line large scale genomic analysis

Page 5: The Human Genome

Completed sequences1995 – First complete bacterial genomes2002 – About 35 bacterial genomes; 0.5-5 Mb; hundreds to 2000 genes1996 April – Yeast (Saccharomyces cerevisiae) 12 Mb, 5,500 genes1998 Dec. -Worm (Caenorhabditis elegans) 97 Mb, 19,000 genes2000 March - Fly (Drosophila melanogaster) 137 Mb, 13,500 genes2000 Dec. - Mustard (Arabidopsis thaliana) 125 Mb, 25,498 genes 2000 June – Human (Homo sapiens) 1st rough draft2001 Feb 15/16 – Human, “working draft” 3000 Mb, 35,000~40,000 genes

Page 6: The Human Genome

IHGCS paper

Nature, 409, February 15, 860-921 (2001)

Page 7: The Human Genome

Celera paper

Science, 291, February 16, 1304-1351 (2001)

Page 8: The Human Genome

Sequencing

BAC: Bacterial Artificial Chromosome clone

Contig: joined overlapping collection of sequences or clones.

Page 9: The Human Genome

C-value paradoxC-value paradox: Genome size does not correlate well with organismal complexity.

Human Homo sapiens 3000 MbYeast S. cerevisiae 12 Mb Amoeba dubia 600,000 Mb

Genomes can contain a large quantity of repetitive sequence, far in excess ofthat devoted to protein-coding genes

Page 10: The Human Genome

Global properties

• Pericentromeric and subtelomeric regions of chromosomes filled with large recent transposable elements

• Marked decline in the overall activity of transposable elements or transposons

• Male mutation rate about twice female – most mutation occurs in males

• Recombination rates much higher in distal regions of chromosomes and on shorter chromosome arms– > one crossover per chromosome arm in each m

eiosis

Page 11: The Human Genome

Important features of Human proteome

• 30,000–40,000 protein-coding genes• Proteome (full set of proteins) more comple

x than those of invertebrates.– pre-existing components arranged into a r

icher architectures.• Hundreds of genes seem to come from horiz

ontal transfer from bacteria• Dozens of genes seem to come from transpo

sable elements.

Page 12: The Human Genome

Human proteome is complex

• Gene codes proteins (also RNAs)

• Number of genes does not reflect complexity of organism

Org’nism no. genes no. proteins

Worm 20,000 ~20,000 Fly 13,500 >>20,000 Human ~40,000 >>100,000

Page 13: The Human Genome

Human genome content The Human Genome

Total length 3000 Mb~ 40,000 genes (coding seq)

Gene sequences < 5% Exons ~ 1.5% (coding) Introns ~ 3.5% (noncoding)

Intergenic regions (junk) > 95%

Repeats > 50%

Page 14: The Human Genome

Gene codes proteins (also RNAs)

Procaryotes (single cell): one gene, one proteinEucaryote (multicell): gene = intron + exon; one gene, many proteins

(transcription & translation)

Page 15: The Human Genome

Fig 35a

Size distributions of exons in Human, Worm and Fly. Human have shorter exons.

Page 16: The Human Genome

Fig 35cSize distributions of intons in Human, Worm and Fly. Human have longer introns.

Page 17: The Human Genome

Fig 35b

Page 18: The Human Genome

Gene recognition

• Coding region and non-coding region have different sequence profiles – coding region is “protected” from mutation and

is less random

• Gene recognition by sequence alignment• Gene prediction by Hidden Markov Model t

rained by set of known genes• Many genes are homologs – similar in vastl

y different organisms

Page 19: The Human Genome

Gene recog’n difficult for Human

• Easy for procaryotes (single cell) – one gene, one protein

• More difficult for eukaryotes (multicell) – one gene, many proteins

• Very difficult for Human – short exons separated by non-coding long introns

Page 20: The Human Genome

Genes predicted in Human Genome

Int’l Consortium Celera

known genes 14,882 17,764novel genes 16,896 21,350

Total 31,778 39,114

Page 21: The Human Genome

Two predictions disagree

John B. Hogenesch, et alCell, Vol. 106, 413–415August 24, 2001

“…predicted transcripts collectively contain partial matches to nearly all knowngenes, but the novel genes predicted by both groups are largely non-overlapping.”

Page 22: The Human Genome

Global properties with evolutionary implications

• Long-range variation in GC content not random

• CpG islands protected by genes

• Genetic and physical distance non-linear

• > 50% genome composed of repeats

Page 23: The Human Genome

Standard deviation

15 times wider than random distr

ib’n

GC-rich and GC-poor regions have different biological properties, such as gene density, composition of repeat sequences, correspondence with cytogenetic bands and recombination rate.

Page 24: The Human Genome

GC content is correlated with coding regions

Page 25: The Human Genome

GC content in introns (exons) vs introns (exons) length.

Page 26: The Human Genome

Fig 14 CpG islands

CpG islands and genes are correlated.

CpG dinucleotides are methylated; methyl-CpG steadily

mutate to TpG. Hence CpG is greatly under-represented

in human DNA. Except in CpG islands near genes.

Page 27: The Human Genome

Fig 15 recomb rate (distal)Recombination rate vs Physical position from centromere of genes. Ratehigher in distal regions.

Page 28: The Human Genome

Fig 16 recomb rate (short arm)Recombination rate higher on shorter chromosome arms

Page 29: The Human Genome

The genome mutates and copies itself

• 50%, probably much more, of genome composed of repeats – Many traces of repeats obliterated by mutation– Lower organisms may have longer genomes

• Five types of repeats– transposable elements; processed pseudogenes;

simple k-mer repeats; segmental duplications (10-300 kb); (large) blocks of tandemly repeated sequences

Page 30: The Human Genome

Fig 17 transposables

Classes of transposable elements. LINE, long interspersed element. SINE short interspersed element.

Total 45%

Interspersed repeats: fixed transposable elements copied to non-homologous regions.

Page 31: The Human Genome

Fig 21

Two regions of about 1 Mb on chromosomes 2 and 22. Red bars, interspersed repeats; blue bars, exons of known genes. Note the deficit of repeats in the HoxD cluster, which contains a collection of genes with complex, interrelated regulation.

Genes are sometimes protected from repeats

Page 32: The Human Genome

Tab 14 SSR content Simple sequence (k-mers) repeats: SSR

Page 33: The Human Genome

Fig 32b

Mosaic patterns of duplications. For each region top horizon line: segment of sequence (100–500 kb) with interchromosomal (red) and intrachromosomal (blue) duplications displayed. Lower lines with a distinct colours: separate sequence duplication. y axis: per cent nucleotide identity.

b. An ancestral region from Xq28 that has contributed various 'genic' segments to pericentromeric regions.

Page 34: The Human Genome

Fig 30

Page 35: The Human Genome

Fig 32a

An active pericentromeric region on chromosome 21.

Page 36: The Human Genome

Fig 32c

c. A pericentromeric region from chromosome 11.

Page 37: The Human Genome

Fig 32d

d. A subtelomeric region from chromosome 7p.

Page 38: The Human Genome

Fig 33

Finished HG has 1.5% interchromosomal 2% intrachromosomal segmental duplications. The duplications are 10–50 kb long and highly homologous. Structure in similarity may indicate that interchromosomal duplications occurred in a punctuated manner.

Page 39: The Human Genome

Human Proteome

• Number of human genes (~40,000) only twice that of worm or fly

• Many more transcripts (combination of exons in one gene)

• Many more proteins, perhaps >> 100,000• Most proteins are still homologs of non-human pr

oteins• Homologs (from a common ancestor gene)

– orthologs – derived through speciation– paralogs: derived through duplication

Page 40: The Human Genome

Completed eukaryotic proteomes

Human Fly Worm Yeast Mustard weed

Identified genes 32,000 13,338 18,266 6,144 25,706 Annotateddomain families 1,262 1,035 1,014 861 1,010Distinct domain architectures 1,695 1,036 1,018 310 -

Page 41: The Human Genome

Functional categories of eukaryote proteomes

Page 42: The Human Genome

Fig 38 distribution of homologs

Distribution of homologues of predicted human proteins

Page 43: The Human Genome

Simplified cladogram (relationship tree) of the 'many-to-many' relationships of classical nuclearreceptors. Triangles indicate expansion within one lineage; bars represent single members. Numbers inparentheses indicate the number of paralogues in each group.

Page 44: The Human Genome

Fig 42 domain accretion

Domain accretion in chromatin proteins in various lineages before the animal divergence, in the apparent coelomate lineage and the vertebrate lineage are shown using schematic representations of domain architec-tures (not to scale). Asterisks, mobile domains that have participated in theaccretion. Species in which a domain architecture has been identified are indicated (Y, yeast; W, worm; F, fly; V, vertebrate).

Page 45: The Human Genome

Fig 45 domain expansion

Lineage-specific expansions of domains and architectures of transcription factors

Page 46: The Human Genome

Conserved segments in human and mouse genome

Colour code: Mouse genome

Page 47: The Human Genome

Applications to medicine and biology

• Disease genes– human genomic sequence in public databases allo

ws rapid identification of disease genes in silico

• Drug targets– pharmaceutical industry has depended upon a limi

ted set of drug targets to develop new therapies– now can find new target in silico

• Basic biology– basic physiology, cell biology…

Page 48: The Human Genome

The next steps

• Finishing the human sequence• Developing the IGI (integrated gene index)

and IPI (protein)• Large-scale identification of regulatory

regions• Sequencing of additional large genomes

– mouse, super-rice, pig, fish…

• Completing the catalogue of human variation– Single nucleotide polymorphism– nasal and throat cancer…

• From sequence to function


Recommended