+ All Categories
Home > Documents > Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Date post: 26-Jan-2016
Category:
Upload: ursala
View: 17 times
Download: 0 times
Share this document with a friend
Description:
”Gene Finding in Eukaryotic Genomes” PhD course #27803 Spring 2003. Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob @cbs.dtu.dk. Human Genome Published HUGO: Nature, 15.feb.2001 Celera: Science, 16.feb.2001. - PowerPoint PPT Presentation
56
Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark [email protected] ”Gene Finding in Eukaryotic Genomes” PhD course #27803 Spring 2003
Transcript
Page 1: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

Nikolaj BlomCenter for Biological Sequence Analysis

BioCentrum-DTUTechnical University of Denmark

[email protected]

”Gene Finding in Eukaryotic Genomes”

PhD course #27803

Spring 2003

Page 2: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

Human Genome

Published

HUGO: Nature, 15.feb.2001

Celera: Science,

16.feb.2001

Page 3: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seWe Have the Human Genome Sequence...now what?

So, what is the problem?• Well...• We don’t know how

many genes there are!• We don’t know where

they are!• We don’t know what

they do!

Page 4: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

Page 5: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

The cellular machinery recognize genes without access to GenBank, SwissProt or computers – can we?

Page 6: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seNeedles in Haystacks...

Only 2% of human genome is coding regionsIntron-exon structure of genes• Large introns (average 3365 bp )• Small exons (average 145 bp)• Long genes (average 27 kb)

Page 7: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATTTTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAGTAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCTGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCTCCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGGTTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTTATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACACCACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTGGTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATAGCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTACATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGCTCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATATATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG

Page 8: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATTTTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAGTAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCTGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCTCCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGGTTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTTATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACACCACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTGGTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATAGCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTACATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGCTCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATATATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG

Page 9: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

Page 10: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGenes and Signals

Page 11: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Features

Codon frequency/bias• Organism dependent• Hexamer statistics

Transcriptional• Promoters/enhancers

Exon/introns• Length distributions• ORFs

Splicing• Donor/acceptor sites• Branchpoints

Translational• Ribosome binding sites

Page 12: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seCodon Bias

Gene Finders are often organism specificCoding regions often modelled by 5th order Markov chain (hexamers/di-codons)

Page 13: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seExon Size

0

5

10

15

20

25

30

35

1-100

100-200

200-300

300-500

>500

Fungi

Verterbrate

Page 14: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seIntron Size

0

10

20

30

40

50

60

70

<100 <200 <1kbp

1 to5

>5

Fungi

Verterbrate

Page 15: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seIntron Prevalence

0

10

20

3040

50

60

7080

90

100

0 1 >1

Yeast

Fungi

Mammal

Page 16: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Finding Challenges

Need the correct reading frame• Introns can interrupt an exon in mid-

codon

There is no hard and fast rule for identifying donor and acceptor splice sites• Signals are very weak

Page 17: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

Page 18: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seOverpredicting Genes

Easy to predict all exonsReport all sequences flanked by ..AG and GT.. as exonsSensitivity = 100%Specificity ~ 0%

Page 19: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seSensor-based methods

Similarity searches misses some/many genescDNA/EST libraries are not perfect Ab initio Gene Finders• HMM-based

• GenScan• HMMgene

• Neural network-based• GRAIL• NetGene2 (splice sites)

Page 20: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Prediction

”Isolated” methods• Predict individual features

• E.g. splice sites, coding regions• NetGene (Neural network)

– http://www.cbs.dtu.dk/services/NetGene2/

”Integrated” methods• Predict genes in context

• ”Grammar” of genes• Certain elements in specific order are required

– HMMgene http://www.cbs.dtu.dk/services/HMMgene/

– GenScan (HMM-based) http://genes.mit.edu/GENSCAN.html

Page 21: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Grammar

HAPPYEUGENEAWASGUYFINDER

Isolated features

Page 22: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Grammar

HAPPYEUGENEAWASGUYFINDER

Isolated features

Intron 3’UTR Exon Promoter Exon RBS

Page 23: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Grammar

EUGENEFINDERWASAHAPPYGUY

Integrated features

HAPPYEUGENEAWASGUYFINDER

Page 24: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Grammar

EUGENEFINDERWASAHAPPYGUY

Integrated features

PromRBSExonIntronExon3’UTR

Page 25: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Grammar

”Isolated” methods (e.g.NN):

HAPPYEUGENEAWASGUYFINDER

”Integrated” methods (e.g.HMM):

EUGENEFINDERWASAHAPPYGUY

Page 26: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seHMMs for genefinding

GenScan principle• E=exon• I=intron• F=5’ UTR• T=3’ UTR• P=promoter• N=intergenic

Page 27: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGenscan http://genes.mit.edu/GENSCAN.html

Page 28: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGenscan

Page 29: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGenscan http://genes.mit.edu/GENSCAN.html

Page 30: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGenscan

Page 31: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGenscan

Page 32: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seHMMgene http://www.cbs.dtu.dk/services/HMMgene/

Page 33: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seHMMgene http://www.cbs.dtu.dk/services/HMMgene/

Columns1.Sequence identifier 2.Program name 3.Prediction (see table below for the meaning). 4.Beginning 5.End 6.Score between 0 and 1 7.Strand: $+$ for direct and $-$ for complementary 8.Frame (for exons it is the position of the donor in the frame) 9.Group to which prediction belong. If several CDS's are found they will be called cds_1, cds_2, etc. `bestparse:' is there because alternative predictions will also be available (see below).

Name Meaning firstex The coding part of the first coding exon starting with the first base of the start codon. exon_N The N'th predicted internal coding exon. lastex The coding part of the last coding exon ending with the last base of the stop codon. singleex The coding part of an exon in a gene with only one coding exon. CDS Coding region composed of the exon predictions prior to this line.

Page 34: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seDefining the term ’exon’

Gene Prediction programs often useExon = CDS (coding sequence)

Real exons may contain 5’ or 3’ UTRs (untranslated regions)

Page 35: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Prediction – NetGene 2

Page 36: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Prediction – NetGene 2

Page 37: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Prediction – NetGene 2

Page 38: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Prediction – NetGene 2

Page 39: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seNIX – Visualizing Gene Predictions

http://www.hgmp.mrc.ac.uk/NIX/

Page 40: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Prediction – Performance of Genscan

Page 41: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

sePerformance of Genscan – Exon Length

Page 42: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seRepeatmasker

Repetitive sequences in human/eukaryotic genomes are a problemRun gene predictions on large genomic regions before and after masking of repetitive sequence: • http://ftp.genome.washington.edu/cgi-bin/

RepeatMasker

Up to 45% of human genomic sequence derived from transposable/repetitive elements

Page 43: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seRepeatmasker

Page 44: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seFuture Challenges

Bootstrapping: prediction improves as more genes become known• ’Extreme’ genes (long/short) still difficult• Initial and terminal exons are predicted with lower

confidence

Combine with Sequence Similarity MatchesNon-coding RNAs• Most gene prediction programs only predict protein-

coding genes• tRNA and rRNA genes are not predicted

Prokaryotic gene finding• Much easier (no introns), but still not perfect• Especially short genes (<300 bp) difficult

Page 45: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Prediction

Take home messages• Human genome sequence is known• Number of human genes is unknown!

• Before 2001: est.30,000-140,000• Anno 2003: 30,000-40,000

• Location, structure and function of many human genes is unknown!

• Genes may be discovered by different means and methods

• ...

Page 46: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Prediction

Take home messages• Genes may be predicted by computer

programs• Masking of repetitive sequences may be

required for large genomic sequences• ’Unusual’ genes are difficult (high GC%,

short or terminal exons)• HMM-based gene prediction programs are

suitable for “Gene Grammar”

Prediction methods are not perfect!

Page 47: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

The End

Page 48: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

Page 49: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

Gene Prediction Exercises

I. Gene Finding in Prokaryotic SequenceII. Gene Finding in Eukaryotic Sequence

Exercises at:

http://www.cbs.dtu.dk/phdcourse/programme.htmlhttp://www.cbs.dtu.dk/phdcourse/cookbooks/genefinding/pro.htmlhttp://www.cbs.dtu.dk/phdcourse/cookbooks/genefinding/euk.html

Page 50: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Prediction Exercise

Sequence GenBank Genscan HMMgene NetGene2

Seq#1 (HoxA10)

320..12262401..2675

320 1226 0.871 2401 2675 0.988

320 1226 0.744 2401 2675 0.971

Donor 1227 0.95HAcc. 2400 1.00H

Seq#2 (Dub-2)

398..4251208..2817

-1208 2817 0.800

398 425 0.418 1208 2817 0.735

Donor 426 0.87 Acc. 1207 0.42 Acc. 1210 0.71

http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob/exercises/gf_exercise_solution.html

Page 51: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

Page 52: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

Page 53: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGene Prediction – Performance of Genscan

Page 54: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGenome Browsing - Exercise #1

How many exons are encoded by the hoxA10 gene?• 2 exons

How many basepairs is the transcript length ?• 2542 bp

Page 55: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

seGenome Browsing - Exercise #1

On what chromosome is the hoxA10 gene?• Human chr.7

On which arm (short/p or long/q) ?• p

What gene is located ca. 500 kb downstream of HoxA10 ?• Scap2

On what mouse chromosome is the ortholog/homolog of human HoxA10 located?• Mouse chr.6

In the overview panel, there is a gene located ca. 300 kb downstream of HoxA10, what is the name?• Scap2

Page 56: Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

Cente

r fo

r B

iolo

gis

k Sekv

ensa

naly

se

http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob/exercises/gf_exercise_solution.html


Recommended