Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... ·...

Post on 23-Nov-2020

3 views 0 download

transcript

Gene Finding in Eukaryotes

Jan-Jaap Wesselinkjjwesselink@cnio.es

Computational and Structural Biology Group, Centro Nacional de InvestigacionesOncologicas

Madrid, July 2008

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 1 / 24

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 2 / 24

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 2 / 24

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 2 / 24

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 2 / 24

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 3 / 24

Eurkaryotic Gene Structure

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 4 / 24

Schematic Gene Structure

ATG

GT AG GT AG

TGA

Exon Intron Exon Intron Exon

UTR CDS

Gene prediction programs only predict the coding fraction of genes

Signals Exons RegionsStart (ATG) Single Exons

Stops (TGA,TAA,TAG) First IntronsDonor (GT) Internal Intergenic

Acceptor (AG) Terminal 5’ and 3’ UTRs

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 5 / 24

Signals are difficult to find (1)

ExampleTry reading this sentence:

LOOKITSMUCHEASIERLIKETHIS

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 6 / 24

Signals are difficult to find (1)

ExampleTry reading this sentence:

Look! It’s much easier like this!

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 6 / 24

Signals Are Difficult To Find (2)

ExampleGenomic DNA sequence

GTTTCAAGTGATCCTCCCGCCTCAGCCTGCCCAGGTGCTGAGATTACATGTATGAGCCACTGCACCTGGAAAGGAGCCAGAAATGTGAAGTGCTAGCTGAAGGATGAGCAGCAGCTAGCCAGGCAAAGGTAGGGTTTGGGGAAGGAAAGTGCACATTCTCTTCCCATCTGTGTTTCAGGGGGCAATGGCGGCTTCCTGTGTTCTACTGCACACTGGGCAGAAGATGCCTCTGATTGGTCTGGGTACCTGGAAGAGTGAGCCTGGTCAGGTGAGGGATGGGGGAAGAAAAAAGAAACCTCTGCTTCTCTCACCTGGCAGGTAAAAGCAGCTGTTAAGTATGCCCTTAGCGTAGGCTACCGCCACATTGATTGTGCTGCTATCTACGGCAATGAGCCTGAGATTGGGGAGGCCCTGAAGGAGGACGTGGGACCAGGCAAGGTAAGGACTGGGGTTGTAAATAGAGCTGTGGGCCCTGCCCCCTGCACTAG

Only beginning and end of introns are shown

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 7 / 24

Signals Are Difficult To Find (2)

ExampleGenomic DNA sequence

GTTTCAAGTGATCCTCCCGCCTCAGCCTGCCCAGGTGCTGAGATTACATGTATGAGCCACTGCACCTGGAAAGGAGCCAGAAATGTGAAGTGCTAGCTGAAGGATGAGCAGCAGCTAGCCAGGCAAAGGTAGGGTTTGGGGAAGGAAAGTGCACATTCTCTTCCCATCTGTGTTTCAGGGGGCAATGGCGGCTTCCTGTGTTCTACTGCACACTGGGCAGAAGATGCCTCTGATTGGTCTGGGTACCTGGAAGAGTGAGCCTGGTCAGGTGAGGGATGGGGGAAGAAAAAAGAAACCTCTGCTTCTCTCACCTGGCAGGTAAAAGCAGCTGTTAAGTATGCCCTTAGCGTAGGCTACCGCCACATTGATTGTGCTGCTATCTACGGCAATGAGCCTGAGATTGGGGAGGCCCTGAAGGAGGACGTGGGACCAGGCAAGGTAAGGACTGGGGTTGTAAATAGAGCTGTGGGCCCTGCCCCCTGCACTAG

Only beginning and end of introns are shown

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 7 / 24

All Signals Predicted by geneid in a Genomic DNA Sequence

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 8 / 24

All Exons Predicted by geneid in a Genomic DNA Sequence

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 9 / 24

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 10 / 24

Different Approaches to Gene Finding

Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 11 / 24

Different Approaches to Gene Finding

Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 11 / 24

Different Approaches to Gene Finding

Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 11 / 24

Different Approaches to Gene Finding

Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 11 / 24

Search by Signal

Signals are usually represented as patterns

Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 12 / 24

Search by Signal

Signals are usually represented as patterns

Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 12 / 24

Search by Signal

Signals are usually represented as patterns

Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 12 / 24

Search by Signal

Signals are usually represented as patterns

Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 12 / 24

Search by Signal

Signals are usually represented as patterns

Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 12 / 24

Patterns: examples

Example (patterns 2)Consensus sequence:

p1 = CTAAAAATAA

p2 = TTAAAAATAA

p3 = TTTAAAATAA

p4 = CTATAAATAA

p5 = TTATAAATAA

p6 = CTTAAAATAG

p7 = TTTAAAATAG

P = YTWWAAATAR

Y = pyrimidine (C or T), W = A or T, R = purine ( A or G)Regular expression: Prosite pattern:P = G − [GN]− [SGA]−G − x − R − x − [SGA]− C − x(2)− [IV ]

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 13 / 24

Patterns: examples

Example (patterns 2)Consensus sequence:

p1 = CTAAAAATAA

p2 = TTAAAAATAA

p3 = TTTAAAATAA

p4 = CTATAAATAA

p5 = TTATAAATAA

p6 = CTTAAAATAG

p7 = TTTAAAATAG

P = YTWWAAATAR

Y = pyrimidine (C or T), W = A or T, R = purine ( A or G)Regular expression: Prosite pattern:P = G − [GN]− [SGA]−G − x − R − x − [SGA]− C − x(2)− [IV ]

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 13 / 24

Position Weight Matrices

Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies

Mi,j = ln(ftrue

ffalse)

5 score of a given sequence is sum of matrix coefficients

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 14 / 24

Position Weight Matrices

Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies

Mi,j = ln(ftrue

ffalse)

5 score of a given sequence is sum of matrix coefficients

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 14 / 24

Position Weight Matrices

Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies

Mi,j = ln(ftrue

ffalse)

5 score of a given sequence is sum of matrix coefficients

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 14 / 24

Position Weight Matrices

Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies

Mi,j = ln(ftrue

ffalse)

5 score of a given sequence is sum of matrix coefficients

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 14 / 24

Position Weight Matrices

Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies

Mi,j = ln(ftrue

ffalse)

5 score of a given sequence is sum of matrix coefficients

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 14 / 24

Position Weight Matrices

Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies

Mi,j = ln(ftrue

ffalse)

5 score of a given sequence is sum of matrix coefficients

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 14 / 24

Example: Donor Sites

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 15 / 24

Search by Content

Coding statisticsHidden Markov Models

Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 16 / 24

Search by Content

Coding statisticsHidden Markov Models

Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 16 / 24

Search by Content

Coding statisticsHidden Markov Models

Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 16 / 24

Search by Content

Coding statisticsHidden Markov Models

Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 16 / 24

Search by Content

Coding statisticsHidden Markov Models

Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 16 / 24

Search by Content

Coding statisticsHidden Markov Models

Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 16 / 24

Coding Statistics

The probability of a sequence S being protein coding:

p(S) =p(C1) · p(C2) . . . p(Cn)

164 ·

164 . . . 1

64

p(S) ≈ f (C1) · f (C2) . . . f (Cn)1

64 ·1

64 . . . 164

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 17 / 24

Coding Statistics

The probability of a sequence S being protein coding:

p(S) =p(C1) · p(C2) . . . p(Cn)

164 ·

164 . . . 1

64

p(S) ≈ f (C1) · f (C2) . . . f (Cn)1

64 ·1

64 . . . 164

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 17 / 24

Hidden Markov Models

can have HMM’s for entire genescan have HMM’s for coding/non-coding regions

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 18 / 24

Hidden Markov Models

can have HMM’s for entire genescan have HMM’s for coding/non-coding regions

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 18 / 24

Hidden Markov Models

can have HMM’s for entire genescan have HMM’s for coding/non-coding regions

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 18 / 24

Search by Homology

Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 19 / 24

Search by Homology

Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 19 / 24

Search by Homology

Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 19 / 24

Search by Homology

Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 19 / 24

Comparative Gene Prediction

Comparing the human FOS gene with:

(a) Mouse (b) Chicken (c) Pufferfish

using tblastx

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 20 / 24

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 21 / 24

Gene Prediction Accuracy

measured in annotated sequencescan measure at nucleotide, exon and gene level

Sn =TP

TP + FNSP =

TPTP + FP

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 22 / 24

Gene Prediction Accuracy

measured in annotated sequencescan measure at nucleotide, exon and gene level

Sn =TP

TP + FNSP =

TPTP + FP

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 22 / 24

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 23 / 24

References

1 http://genome.imim.es/courses/BioinformaticaUPF/

2 http://genome.imim.es/∼jjw/oeiras05/3 Guigo, R. (1999) DNA Composition, Codon Usage and Exon Prediction.

In Bisshop, M., ed. Genetic Databases, Academic Press.

4 Eddy, S. (2004) What is a Hidden Markov Model? Nature 22:1315–6.

5 Burset, M. and Guigo, R. (1996). Evaluation of gene structure predictionprograms. Genomics, 34:353–7.

6 Brent M.R. and Guigo, R. (2004). Recent advances in gene structureprediction. Curr. Opin. Struct. Biol. 14:264–72.

7 Guigo, R. and Reese, M.G. (2005). EGASP: collaboration throughcompetition to find human genes. Nature Methods, 2:575–6.

Jan-Jaap Wesselink jjwesselink@cnio.es Gene Finding in Eukaryotes Madrid, July 2008 24 / 24