11
Gene FindingGene FindingGene FindingGene Finding
•Based partly on slides by Hedi Hegyi, Based partly on slides by Hedi Hegyi, Irene LiuIrene Liu
22
““The Central Dogma”The Central Dogma”
Transcription Translation
RNA Protein
33
Gene Finding in ProkaryotesGene Finding in ProkaryotesGene Finding in ProkaryotesGene Finding in Prokaryotes
44
Reminder: The Genetic CodeReminder: The Genetic Code
1 start, 3 stop Codons
55
Finding Genes in Prokaryotes Finding Genes in Prokaryotes
High gene densityHigh gene density– ~85% coding in E.coli~85% coding in E.coli
=> is every ORF a gene?=> is every ORF a gene?
Gene structureGene structure
66
Finding ORFsFinding ORFs
Many more ORFs than genesMany more ORFs than genes– In E.Coli one finds 6500 ORFs while there are 4290 genes.In E.Coli one finds 6500 ORFs while there are 4290 genes.
In random DNA, one stop codon every 64/3=21 In random DNA, one stop codon every 64/3=21 codons on average.codons on average.
Average protein is ~300 codons long.Average protein is ~300 codons long.=> search long ORFs.=> search long ORFs.
Problems:Problems:– Short genesShort genes– Overlapping long ORFs on opposite strandsOverlapping long ORFs on opposite strands
77
Codon FrequenciesCodon Frequencies
Coding DNA is not random:Coding DNA is not random:– In random DNA, expect In random DNA, expect Leu : Ala : Trp ratio of ratio of
6 : 4 : 16 : 4 : 1– In real proteins, 6.9 : 6.5 : 1In real proteins, 6.9 : 6.5 : 1
Different frequencies for different species.Different frequencies for different species.
88
Using Codon Frequencies/UsageUsing Codon Frequencies/Usage
The probability that the The probability that the iith reading th reading frame is the coding region:frame is the coding region:
11332221
1322211
222111
...
...
...
3
2
1
nnn
nnn
nnn
bacbacbac
acbacbacb
cbacbacba
fffp
fffp
fffp
321 ppp
pP ii
Assume each codon is independent.Assume each codon is independent.
For codon abc calculate frequency For codon abc calculate frequency f(abc) in coding region.f(abc) in coding region.
Given coding sequence a1b1c1,…, Given coding sequence a1b1c1,…, an+1bn+1cn+1an+1bn+1cn+1
Calculate Calculate
99
C+G ContentC+G Content
C+G content (“isochore”) has strong effect C+G content (“isochore”) has strong effect on gene density, gene length etc.on gene density, gene length etc.– < 43% C+G < 43% C+G : 62% of genome, 34% of genes : 62% of genome, 34% of genes
– >57% C+G : >57% C+G : 3-5% of genome, 28% of genes3-5% of genome, 28% of genes
Gene density in C+G rich regions is 5 times Gene density in C+G rich regions is 5 times higher than moderate C+G regions and 10 times higher than moderate C+G regions and 10 times higher than rich A+T regionshigher than rich A+T regions– Amount of intronic DNA is 3 times higher for A+T rich Amount of intronic DNA is 3 times higher for A+T rich
regions. (Both intron length and number).regions. (Both intron length and number).
– Etc… Etc…
1010
RNA TranscriptionRNA Transcription
Not all ORFs are expressed.Not all ORFs are expressed.
Transcription depends on regulatory regions.Transcription depends on regulatory regions.
Common regulatory region – the Common regulatory region – the promoterpromoter
RNA polymerase binds tightly to a specific binds tightly to a specific DNA sequence in the promoter called the DNA sequence in the promoter called the binding sitebinding site..
1111
Prokaryotic PromoterProkaryotic Promoter
One type of RNA polymerase.One type of RNA polymerase.
1212
Positional Weight MatrixPositional Weight Matrix
For TATA box:For TATA box:
1313
Gene Finding in EukaryotesGene Finding in EukaryotesGene Finding in EukaryotesGene Finding in Eukaryotes
1414
Coding densityCoding density
1515
Eukaryote gene structureEukaryote gene structure
• Gene length: 30kb, coding region: 1-2kb • Binding site: ~6bp; ~30bp upstream of TSS• Average of 6 exons, 150bp long• Huge variance: - dystrophin: 2.4Mb long
– Blood coagulation factor: 26 exons, 69bp to 3106bp; intron 22 contains another unrelated gene
1616
SplicingSplicing
Splicing:Splicing: the removal of the introns the removal of the introns..Performed by complexes called Performed by complexes called spliceosomesspliceosomes, , containing both proteins and snRNA.containing both proteins and snRNA.The snRNA recognizes the splice sites through The snRNA recognizes the splice sites through RNA-RNARNA-RNA base-pairing base-pairingRecognition must be Recognition must be preciseprecise: a 1nt error can : a 1nt error can shift the reading frame making nonsense of its shift the reading frame making nonsense of its message.message.Many genes have Many genes have alternative splicingalternative splicing which which changes the protein created.changes the protein created.
1717
Splice SitesSplice Sites
1818
Gene prediction programsGene prediction programs
Scan the sequence in all 6 reading frames: Scan the sequence in all 6 reading frames: 1.1. Start and stop codonsStart and stop codons2.2. Long ORFLong ORF3.3. Codon usageCodon usage4.4. GC contentGC content5.5. Gene features: promotor, terminator, Gene features: promotor, terminator,
poly A sites, exons and introns, …poly A sites, exons and introns, …
Frame +1Frame +2Frame +3
1919
Gene prediction programsGene prediction programs
Genscan:Genscan:
Predict location and gene features.Predict location and gene features.
Can handle few genes in one sequenceCan handle few genes in one sequence
•http://genes.mit.edu/http://genes.mit.edu/GENSCAN.htmlGENSCAN.html
2020
Gene prediction programsGene prediction programs
Results:Results:
GenscanGenscanBurge and Karlin, Stanford, 1997Burge and Karlin, Stanford, 1997Before The Human Genome ProjectBefore The Human Genome Project– No alignments availableNo alignments available– Estimated human genes count was 100,000Estimated human genes count was 100,000
First program to do well on realistic First program to do well on realistic sequencessequences– Long, multiple genes in both orientationsLong, multiple genes in both orientations
Pretty good sensitivity but poor specificityPretty good sensitivity but poor specificity– 70% Sn, 40% Sp70% Sn, 40% Sp
2222
GenScan OutputGenScan Output
2323
An end to An end to ab initioab initio prediction prediction
ab initio gene prediction is inaccurate.ab initio gene prediction is inaccurate.
High false positive rates for most predictors.High false positive rates for most predictors.
Rarely used as a final productRarely used as a final product
• Human annotation runs multiple algorithms and Human annotation runs multiple algorithms and scores exon predicted by multiple predictors.scores exon predicted by multiple predictors.
• Used as a starting point for refinement/verificationUsed as a starting point for refinement/verification
2424
Comparative GenomicsComparative Genomics
Use homologue sequences:Use homologue sequences:
1.1. Annotated genes.Annotated genes.
2.2. mRNA sequences.mRNA sequences.
3.3. Proteins sequencesProteins sequences
4.4. ESTsESTs
2525
ESTsESTs
EST – Expressed Sequence Tags. Short EST – Expressed Sequence Tags. Short sequences which are obtained from sequences which are obtained from cDNA (mRNA). cDNA (mRNA).
2626
Transcript-based predictionTranscript-based prediction
Align transcript data to genomic sequence Align transcript data to genomic sequence using a pair-wise sequence comparison.using a pair-wise sequence comparison.
EST
cDNA
GeneModel:
Comparative Gene PredictionsComparative Gene Predictions
Exons are more conserved than intronsExons are more conserved than introns•From a study of 1196 genes:From a study of 1196 genes:
•exons: 84.6%exons: 84.6%•protein: 85.4%protein: 85.4%•introns: 35%introns: 35%•5’ UTRs: 67%5’ UTRs: 67%•3’ UTRs: 69%3’ UTRs: 69%
Gene Prediction using expressed Gene Prediction using expressed sequencessequences
Improvement over previously existing Improvement over previously existing methods, in particular when predicting methods, in particular when predicting CDS:CDS:– there exists an increasing richer there exists an increasing richer
representation of the transcripts content of the representation of the transcripts content of the human genome in public databaseshuman genome in public databases
Improvements in the ability to call the Improvements in the ability to call the coding bases, but in particularcoding bases, but in particular– in connecting exons into transcriptsin connecting exons into transcripts– in predicting alternative splice formsin predicting alternative splice forms
Dual genome predictorsDual genome predictors
Use statistical inference and other Use statistical inference and other ‘informant’ genomes: SLAM, DoubleScan, ‘informant’ genomes: SLAM, DoubleScan, Twinscan, SGP-2, GenomeScan etc..Twinscan, SGP-2, GenomeScan etc..
Exon prediction such as ExoFish.Exon prediction such as ExoFish.
TwinscanTwinscan
Korf, Flicek, Duan, Brent, Washington University Korf, Flicek, Duan, Brent, Washington University in St. Louis, 2001in St. Louis, 2001
Similar to GENSCAN, except it uses another Similar to GENSCAN, except it uses another informant sequence as comparison.informant sequence as comparison.– For Human, the informant is normally mouseFor Human, the informant is normally mouse
Slightly more sensitive than GENSCAN, much Slightly more sensitive than GENSCAN, much more specificmore specific– Exon sensitivity/specificity about 75%Exon sensitivity/specificity about 75%
•Nature, 2005Nature, 2005
•Nature methods, 2005Nature methods, 2005
Other predictions in UCSC - Other predictions in UCSC - 20092009
N-SCAN – extension of TwinScanN-SCAN – extension of TwinScan– Allows for more species (currently uses Allows for more species (currently uses
mouse)mouse)– A richer model of sequence evolutionA richer model of sequence evolution
N-SCAN PASA-ESTN-SCAN PASA-EST– Combines conservation with evidence based Combines conservation with evidence based
on ESTson ESTs
3232
Other predictions in UCSC - Other predictions in UCSC - 20092009
CONSTRAST (Stanford, 2007)CONSTRAST (Stanford, 2007)– 11 informants (rhesus, mouse, cow,...,chicken)11 informants (rhesus, mouse, cow,...,chicken)– Machine learning two-phase approach – first predict Machine learning two-phase approach – first predict
exons and then combine themexons and then combine them
3333
Start and stop codon Start and stop codon classifier accuracy classifier accuracy
increases as increases as informants are added. informants are added.
3434
Splice site detection Splice site detection accuracy increases as accuracy increases as informants are added. informants are added.
Dark Matter in the GenomeDark Matter in the Genome
Tiling arrays for human Tiling arrays for human chromosomes 20 and chromosomes 20 and 22:22:
47% of positive probes 47% of positive probes outside exonsoutside exons
• 22% in introns22% in introns
• 25% in intergenic 25% in intergenic regionsregions
What could they be?What could they be?
• Novel protein-coding Novel protein-coding genesgenes
• Novel non-coding genesNovel non-coding genes
• Antisense transcriptionAntisense transcription
• Alternative isoformsAlternative isoforms
• Biological ‘artifacts’Biological ‘artifacts’
• False positivesFalse positives
•Johnson et al. 2005 TRENDS in Genetics 21:93-102Johnson et al. 2005 TRENDS in Genetics 21:93-102
Predicting non-coding RNA?Predicting non-coding RNA?
The clues we used so far are useless!The clues we used so far are useless!
Not clear which properties can be exploitedNot clear which properties can be exploited
Sequence features such as promoters are too Sequence features such as promoters are too weakweak
Histone modifications + conservation the keyHistone modifications + conservation the key
3636
3737