Gene Finding

11

Gene FindingGene FindingGene FindingGene Finding

•Based partly on slides by Hedi Hegyi, Based partly on slides by Hedi Hegyi, Irene LiuIrene Liu

22

““The Central Dogma”The Central Dogma”

Transcription Translation

RNA Protein

33

Gene Finding in ProkaryotesGene Finding in ProkaryotesGene Finding in ProkaryotesGene Finding in Prokaryotes

44

Reminder: The Genetic CodeReminder: The Genetic Code

1 start, 3 stop Codons

55

Finding Genes in Prokaryotes Finding Genes in Prokaryotes

High gene densityHigh gene density– ~85% coding in E.coli~85% coding in E.coli

=> is every ORF a gene?=> is every ORF a gene?

Gene structureGene structure

66

Finding ORFsFinding ORFs

Many more ORFs than genesMany more ORFs than genes– In E.Coli one finds 6500 ORFs while there are 4290 genes.In E.Coli one finds 6500 ORFs while there are 4290 genes.

In random DNA, one stop codon every 64/3=21 In random DNA, one stop codon every 64/3=21 codons on average.codons on average.

Average protein is ~300 codons long.Average protein is ~300 codons long.=> search long ORFs.=> search long ORFs.

Problems:Problems:– Short genesShort genes– Overlapping long ORFs on opposite strandsOverlapping long ORFs on opposite strands

77

Codon FrequenciesCodon Frequencies

Coding DNA is not random:Coding DNA is not random:– In random DNA, expect In random DNA, expect Leu : Ala : Trp ratio of ratio of

6 : 4 : 16 : 4 : 1– In real proteins, 6.9 : 6.5 : 1In real proteins, 6.9 : 6.5 : 1

Different frequencies for different species.Different frequencies for different species.

88

Using Codon Frequencies/UsageUsing Codon Frequencies/Usage

The probability that the The probability that the iith reading th reading frame is the coding region:frame is the coding region:

11332221

1322211

222111

...

...

...

3

2

1

nnn

nnn

nnn

bacbacbac

acbacbacb

cbacbacba

fffp

fffp

fffp

321 ppp

pP ii

Assume each codon is independent.Assume each codon is independent.

For codon abc calculate frequency For codon abc calculate frequency f(abc) in coding region.f(abc) in coding region.

Given coding sequence a1b1c1,…, Given coding sequence a1b1c1,…, an+1bn+1cn+1an+1bn+1cn+1

Calculate Calculate

99

C+G ContentC+G Content

C+G content (“isochore”) has strong effect C+G content (“isochore”) has strong effect on gene density, gene length etc.on gene density, gene length etc.– < 43% C+G < 43% C+G : 62% of genome, 34% of genes : 62% of genome, 34% of genes

– >57% C+G : >57% C+G : 3-5% of genome, 28% of genes3-5% of genome, 28% of genes

Gene density in C+G rich regions is 5 times Gene density in C+G rich regions is 5 times higher than moderate C+G regions and 10 times higher than moderate C+G regions and 10 times higher than rich A+T regionshigher than rich A+T regions– Amount of intronic DNA is 3 times higher for A+T rich Amount of intronic DNA is 3 times higher for A+T rich

regions. (Both intron length and number).regions. (Both intron length and number).

– Etc… Etc…

1010

RNA TranscriptionRNA Transcription

Not all ORFs are expressed.Not all ORFs are expressed.

Transcription depends on regulatory regions.Transcription depends on regulatory regions.

Common regulatory region – the Common regulatory region – the promoterpromoter

RNA polymerase binds tightly to a specific binds tightly to a specific DNA sequence in the promoter called the DNA sequence in the promoter called the binding sitebinding site..

1111

Prokaryotic PromoterProkaryotic Promoter

One type of RNA polymerase.One type of RNA polymerase.

1212

Positional Weight MatrixPositional Weight Matrix

For TATA box:For TATA box:

1313

Gene Finding in EukaryotesGene Finding in EukaryotesGene Finding in EukaryotesGene Finding in Eukaryotes

1414

Coding densityCoding density

1515

Eukaryote gene structureEukaryote gene structure

• Gene length: 30kb, coding region: 1-2kb • Binding site: ~6bp; ~30bp upstream of TSS• Average of 6 exons, 150bp long• Huge variance: - dystrophin: 2.4Mb long

– Blood coagulation factor: 26 exons, 69bp to 3106bp; intron 22 contains another unrelated gene

1616

SplicingSplicing

Splicing:Splicing: the removal of the introns the removal of the introns..Performed by complexes called Performed by complexes called spliceosomesspliceosomes, , containing both proteins and snRNA.containing both proteins and snRNA.The snRNA recognizes the splice sites through The snRNA recognizes the splice sites through RNA-RNARNA-RNA base-pairing base-pairingRecognition must be Recognition must be preciseprecise: a 1nt error can : a 1nt error can shift the reading frame making nonsense of its shift the reading frame making nonsense of its message.message.Many genes have Many genes have alternative splicingalternative splicing which which changes the protein created.changes the protein created.

1717

Splice SitesSplice Sites

1818

Gene prediction programsGene prediction programs

Scan the sequence in all 6 reading frames: Scan the sequence in all 6 reading frames: 1.1. Start and stop codonsStart and stop codons2.2. Long ORFLong ORF3.3. Codon usageCodon usage4.4. GC contentGC content5.5. Gene features: promotor, terminator, Gene features: promotor, terminator,

poly A sites, exons and introns, …poly A sites, exons and introns, …

Frame +1Frame +2Frame +3

1919


Genscan:Genscan:

Predict location and gene features.Predict location and gene features.

Can handle few genes in one sequenceCan handle few genes in one sequence

•http://genes.mit.edu/http://genes.mit.edu/GENSCAN.htmlGENSCAN.html

2020


Results:Results:

GenscanGenscanBurge and Karlin, Stanford, 1997Burge and Karlin, Stanford, 1997Before The Human Genome ProjectBefore The Human Genome Project– No alignments availableNo alignments available– Estimated human genes count was 100,000Estimated human genes count was 100,000

First program to do well on realistic First program to do well on realistic sequencessequences– Long, multiple genes in both orientationsLong, multiple genes in both orientations

Pretty good sensitivity but poor specificityPretty good sensitivity but poor specificity– 70% Sn, 40% Sp70% Sn, 40% Sp

2222

GenScan OutputGenScan Output

2323

An end to An end to ab initioab initio prediction prediction

ab initio gene prediction is inaccurate.ab initio gene prediction is inaccurate.

High false positive rates for most predictors.High false positive rates for most predictors.

Rarely used as a final productRarely used as a final product

• Human annotation runs multiple algorithms and Human annotation runs multiple algorithms and scores exon predicted by multiple predictors.scores exon predicted by multiple predictors.

• Used as a starting point for refinement/verificationUsed as a starting point for refinement/verification

2424

Comparative GenomicsComparative Genomics

Use homologue sequences:Use homologue sequences:

1.1. Annotated genes.Annotated genes.

2.2. mRNA sequences.mRNA sequences.

3.3. Proteins sequencesProteins sequences

4.4. ESTsESTs

2525

ESTsESTs

EST – Expressed Sequence Tags. Short EST – Expressed Sequence Tags. Short sequences which are obtained from sequences which are obtained from cDNA (mRNA). cDNA (mRNA).

2626

Transcript-based predictionTranscript-based prediction

Align transcript data to genomic sequence Align transcript data to genomic sequence using a pair-wise sequence comparison.using a pair-wise sequence comparison.

EST

cDNA

GeneModel:

Comparative Gene PredictionsComparative Gene Predictions

Exons are more conserved than intronsExons are more conserved than introns•From a study of 1196 genes:From a study of 1196 genes:

•exons: 84.6%exons: 84.6%•protein: 85.4%protein: 85.4%•introns: 35%introns: 35%•5’ UTRs: 67%5’ UTRs: 67%•3’ UTRs: 69%3’ UTRs: 69%

Gene Prediction using expressed Gene Prediction using expressed sequencessequences

Improvement over previously existing Improvement over previously existing methods, in particular when predicting methods, in particular when predicting CDS:CDS:– there exists an increasing richer there exists an increasing richer

representation of the transcripts content of the representation of the transcripts content of the human genome in public databaseshuman genome in public databases

Improvements in the ability to call the Improvements in the ability to call the coding bases, but in particularcoding bases, but in particular– in connecting exons into transcriptsin connecting exons into transcripts– in predicting alternative splice formsin predicting alternative splice forms

Dual genome predictorsDual genome predictors

Use statistical inference and other Use statistical inference and other ‘informant’ genomes: SLAM, DoubleScan, ‘informant’ genomes: SLAM, DoubleScan, Twinscan, SGP-2, GenomeScan etc..Twinscan, SGP-2, GenomeScan etc..

Exon prediction such as ExoFish.Exon prediction such as ExoFish.

TwinscanTwinscan

Korf, Flicek, Duan, Brent, Washington University Korf, Flicek, Duan, Brent, Washington University in St. Louis, 2001in St. Louis, 2001

Similar to GENSCAN, except it uses another Similar to GENSCAN, except it uses another informant sequence as comparison.informant sequence as comparison.– For Human, the informant is normally mouseFor Human, the informant is normally mouse

Slightly more sensitive than GENSCAN, much Slightly more sensitive than GENSCAN, much more specificmore specific– Exon sensitivity/specificity about 75%Exon sensitivity/specificity about 75%

•Nature, 2005Nature, 2005

•Nature methods, 2005Nature methods, 2005

Other predictions in UCSC - Other predictions in UCSC - 20092009

N-SCAN – extension of TwinScanN-SCAN – extension of TwinScan– Allows for more species (currently uses Allows for more species (currently uses

mouse)mouse)– A richer model of sequence evolutionA richer model of sequence evolution

N-SCAN PASA-ESTN-SCAN PASA-EST– Combines conservation with evidence based Combines conservation with evidence based

on ESTson ESTs

3232

Other predictions in UCSC - Other predictions in UCSC - 20092009

CONSTRAST (Stanford, 2007)CONSTRAST (Stanford, 2007)– 11 informants (rhesus, mouse, cow,...,chicken)11 informants (rhesus, mouse, cow,...,chicken)– Machine learning two-phase approach – first predict Machine learning two-phase approach – first predict

exons and then combine themexons and then combine them

3333

Start and stop codon Start and stop codon classifier accuracy classifier accuracy

increases as increases as informants are added. informants are added.

3434

Splice site detection Splice site detection accuracy increases as accuracy increases as informants are added. informants are added.

Dark Matter in the GenomeDark Matter in the Genome

Tiling arrays for human Tiling arrays for human chromosomes 20 and chromosomes 20 and 22:22:

47% of positive probes 47% of positive probes outside exonsoutside exons

• 22% in introns22% in introns

• 25% in intergenic 25% in intergenic regionsregions

What could they be?What could they be?

• Novel protein-coding Novel protein-coding genesgenes

• Novel non-coding genesNovel non-coding genes

• Antisense transcriptionAntisense transcription

• Alternative isoformsAlternative isoforms

• Biological ‘artifacts’Biological ‘artifacts’

• False positivesFalse positives

•Johnson et al. 2005 TRENDS in Genetics 21:93-102Johnson et al. 2005 TRENDS in Genetics 21:93-102

Predicting non-coding RNA?Predicting non-coding RNA?

The clues we used so far are useless!The clues we used so far are useless!

Not clear which properties can be exploitedNot clear which properties can be exploited

Sequence features such as promoters are too Sequence features such as promoters are too weakweak

Histone modifications + conservation the keyHistone modifications + conservation the key

3636

3737

Date post:	26-Jan-2016
Category:	Documents
Upload:	kalb
View:	36 times
Download:	0 times

Gene Finding

Documents