Date post: | 16-Jan-2016 |
Category: |
Documents |
Upload: | walter-ferguson |
View: | 218 times |
Download: | 0 times |
NC
BI
Gene Prediction and Annotation techniques
Basics Chuong HuynhNIH/NLM/NCBISept 30, 2004
Acknowledgement: Daniel Lawson, Neil Hall
NC
BI
What is gene prediction?
Detecting meaningful signals in uncharacterised DNA sequences.
Knowledge of the interesting information in DNA.
Sorting the ‘chaff from the wheat’
Gene prediction is ‘recognising protein-coding regions in genomic sequence’
GATCGGTCGAGCGTAAGCTAGCTAG
ATCGATGATCGATCGGCCATATATC
ACTAGAGCTAGAATCGATAATCGAT
CGATATAGCTATAGCTATAGCCTAT
NC
BI
Basic Gene Prediction Flow Chart
Obtain new genomic DNA sequence
1. Translate in all six reading frames and compare to protein sequence databases2. Perform database similarity search of expressed sequence tagSites (EST) database of same organism, or cDNA sequences if available
Use gene prediction program to locate genes
Analyze regulatory sequences in the gene
NC
BI
ACEDB View
NC
BI
Why is gene prediction important?
-Increased volume of genome data generated
-Paradigm shift from gene by gene sequencing (small scale) to large-scale genome sequencing.
-No more one gene at a time. A lot of data.
-Foundation for all further investigation. Knowledge of the protein-coding regions underpins functional genomics.
Note: this presentation is for the prediction of genes that encode protein only;Not promoter prediction, sequences regulate activity of protein encoding genes
NC
BI
NC
BI
Map Viewer
Genes
Genome Scan
Models
Human EST hits
Contig
GenBank
Mouse EST hits
NC
BI
NC
BI
Artemis – Free Genome Visualization/Annotation Workbench
NC
BI
Genome WorkBench
NC
BI
Knowing what to look for
What is a gene?
Not a full transcript with control regions
The coding sequence (ATG -> STOP)
Start MiddleN
End
NC
BI
ORF Finding in Prokaryotes
• Simplest method of finding DNA sequences that encode proteins by searching for open reading frames
• An ORF is a DNA sequence that contains a contiguous set of codons that species an amino acid
• Six possible reading frames• Good for prokaryotic system (no/little post
translation modification)• Runs from Met (AUG) on mRNA stop codon TER
(UAA, UAG, UGA)• http://www.ncbi.nlm.nih.gov/gorf/ NCBI ORF
Finder
NC
BI
ORF Finder (Open Reading Frame Finder)
NC
BI
Annotation of eukaryotic genomes
transcription
RNA processing
translation
AAAAAAA
Genomic DNA
Unprocessed RNA
Mature mRNA
Nascent polypeptide folding
Reactant A Product BFunction
Active enzyme
ab initio gene prediction(w/o prior knowledge)
Comparative gene prediction
(use other biological data)
Functional identification
Gm3
NC
BI
Two Classes of Sequence Information
• Signal Terms – short sequence motifs (such as splice sites, branch points,Polypyrimidine tracts, start codons, and stop codons)
• Content Terms – pattern of codon usage that are unique to a species and allow coding sequences to be distinguished from surrounding noncoding sequences by a statistical detection algorithm
NC
BI
Problem Using Codon Usage
• Program must be taught what the codon usage patterns look like by presenting the program with a TRAINING SET of known coding sequences.
• Different programs search for different patterns.
• A NEW training set is needed for each species• Untranslated regions (UTR) at the ends of the
genes cannot be detected, but most programs can identify polyadenylation sites
• Non-protein coding RNA genes cannot be detected (attempt detection in a few specialized programs)
• Non of these program can detect alternatively spliced transcripts
NC
BI
Explanation of False Positive/Negative in Gene
Prediction Programs
NC
BI
Gene finding: Issues
Issues regarding gene finding in general
Genome size (larger genome ~ more genes, but …)
Genome composition
Genome complexity (more complexity -> less coding density; fewer genes per kb)
cis-splicing (processing mRNA in Eukaryotics)
trans-splicing (in kinetisplastid)
alternate splicing (e.g. in different tissues; higher organism)
Variation of genetic code from the universal code
NC
BI
Gene finding: genome
• Genome composition– Long ORFs tend to be coding– Presence of more putative ORFs in GC
rich genomes (Stop codons = UAA, UAG & UGA)
• Genome complexity– Simple repetitive sequences (e.g.
dinucleotide) and dispersed repeats tend to be anti-coding
– May need to mask sequence prior to gene prediction
NC
BI
Gene finding: coding density
As the coding/non-coding length ratio decreases, exon prediction becomes more complex
Human
Fugu
worm
E.coli
NC
BI
Gene finding: splicing
cis-splicing of genes
Finding multiple (short) exons is harder than finding a single (long) exon.
worm
E.coli
trans-splicing of genes
A trans-splice acceptor is no different to a normal splice acceptor
NC
BI
Gene finding: alternate splicing
Human A
Human B
Human C
Alternate splicing (isoforms) are very difficult to predict.
NC
BI
ab initio prediction
What is ab initio gene prediction?
Prediction from first principles using the raw DNA sequence only.
Requires ‘training sets’ of known gene structures to generate statistical tests for the likelihood of a prediction being real.
GATCGGTCGAGCGTAAGCTAGCTAG
ATCGATGATCGATCGGCCATATATC
ACTAGAGCTAGAATCGATAATCGAT
CGATATAGCTATAGCTATAGCCTAT
NC
BI
Gene finding: ab initio
• What features of an ORF can we use?– Size - large open reading frames– DNA composition - codon usage / 3rd
position codon bias– Kozak sequence CCGCCAUGG– Ribosome binding sites– Termination signal (stops)– Splice junction boundaries
(acceptor/donor)
NC
BI
Gene finding: features
Think of a CDS gene prediction as a linear series of sequence features:
Initiation codon
Coding sequence (exon)
Non-coding sequence (intron)
Termination codon
Splice donor (5’)
Splice acceptor (3’)
Coding sequence (exon)
N times
NC
BI
A model ab initio predictor
Locate and score all sequence features used in gene models
dynamic programming to make the high scoring model from available features.
e.g. Genefinder (Green)
Running a 5’-> 3’ pass the sequence through a Markov model based on a typical gene model
e.g. TBparse (Krogh), GENSCAN (Burge) or GLIMMER (Salzberg)
Running a 5’->3’ pass the sequence through a neural net trained with confirmed gene models
e.g. GRAIL (Oak Ridge)
NC
BI
Ab initio Gene finding programs
• Most gene finding software packages use a some variant of Hidden Markov Models (HMM).
• Predict coding, intergenic, and intron sequences
• Need to be trained on a specific organism.• Never perfect!
NC
BI
What is an HMM?
• A statistical model that represents a gene.
• Similar to a “weight matrix” that can recognise gaps and treat them in a systematic way.
• Has different “states” that represent introns, exons, and intergenic regions.
NC
BI
Malaria Gene Prediction Tool
• Hexamer – ftp://ftp.sanger.ac.uk/pub/pathogens/software/hexamer/
• Genefinder – email [email protected]• GlimmerM – http://www.tigr.org/softlab/glimmerm• Phat – http://www.stat.berkeley.edu/users/scawley/Phat
• Already Trained for Malaria!!!! The more experimental derived genes used for training the gene prediction tool the more reliable the gene predictor.
NC
BI
GlimmerMSalzberg et al. (1999) genomics 59 24-31
• Adaption of the prokaryotic genefinder Glimmer.
Delcher et al. (1999) NAR 2 4363-4641
• Based on a interpolated HMM (IHMM).
• Only used short chains of bases (markov chains) to generate probabilities.
• Trained identically to Phat
NC
BI
An end to ab initio prediction
• ab initio gene prediction is inaccurate• Have high false positive rates, but also low false
negative rates for most predictors• Incorporating similarity info is meant to reduce
false positive rate, but at the same also increase false negative rate.
• Biggest determinant of false positive/negative is gene size.
• Exon prediction sensitivity can be good• Rarely used as a final product
– Human annotation runs multiple algorithms and scores exon predicted by multiple predictors.
– Used as a starting point for refinement/verification
• Prediction need correction and validation• -- Why not just build gene models by comparative
means?
NC
BI
Annotation of eukaryotic genomes
transcription
RNA processing
translation
AAAAAAA
Genomic DNA
Unprocessed RNA
Mature mRNA
Nascent polypeptide folding
Reactant A Product BFunction
Active enzyme
ab initio gene prediction (w/o prior knowledge)
Comparative gene prediction(use other biological data)
Functional identification
Gm3
NC
BI
If a cell was human?
The cell ‘knows’ how to splice a gene together.
We know some of these signals but not all and not all of the time
So compare with known examples from the species and othersCentral dogma for molecular
biology
Genome
Transcriptome
Proteome
DNA
Protein
RNA
NC
BI
When a human looks at a cell
Compare with the rest of the genome/transcriptome/proteome data
DNA
Protein
RNA
Extract DNA and sequence genome
Extract RNA, reverse transcribe and sequence cDNA
Peptide sequence inferred from gene prediction
NC
BI
comparative gene prediction
Use knowledge of known coding sequences to identify region of genomic DNA by similarity
transcriptome - transcribed DNA sequence
proteome - peptide sequence
genome - related genomic sequence
NC
BI
Transcript-based prediction: datasets
Generation of large numbers of Expressed Sequence Tags (ESTs)
Quick, cheap but random
Subtractive hybridisation to find rare transcripts
Use multiple libraries for different life-stages/conditions
Single-pass sequence prone to errors
Generation of small number of full length cDNA sequences
Slow and laborious but focused
Large-scale sequencing of (presumed) full length cDNAs
Systematic, multiplexed cloning/sequencing of CDS
Expensive and only viable if part of bigger project
NC
BI
Gene Prediction in Eukaryotes – Simplified
• For highly conserved proteins:– Translate DNA sequence in all 6 reading frames– BLASTX or FASTAX to compare the sequence to a
protein sequence database– Or– Protein compared against nucleic acid database
including genomic sequence that is translated in all six possible reading frame sby TBLASTN, TFASTAX/TFASTY programs.
• Note: Approximation of the gene structure only.
NC
BI
Transcript-based prediction: How it works
EST
cDNA
Align transcript data to genomic sequence using a pair-wise sequence comparison
GeneModel:
NC
BI
BLAST (Altshul) (36 hours)
Widely used and understood
HSPs often have ‘ragged’ ends so extends to the end of the introns
EST_GENOME (Mott) (3 days)
Dynamic programming post-process of BLAST
Slow and sometimes cryptic
BLAT (Kent) (1/2 hour)
Next generation of alignment algorithm
Design for looking at nearly identical sequences
Faster and more accurate than BLAST
Transcript-based gene prediction: algorithm
NC
BI
BLAST (Altshul)
Widely used and understood
Smith-Waterman
Preliminary to further processing
Used in preference to DNA-based similarities for evolutionary diverged species as peptide conservation is significantly higher than nucleotide
Peptide-based gene prediction: algorithm
NC
BI
BLAST (Altshul)
Can be used in TBLASTX mode
BLAT (Kent)
Can be used in a translated DNA vs translated DNA mode
Significantly faster than BLAST
WABA (Kent)
Designed to allow for 3rd position codon wobble
Slow with some outstanding problems
Only really used in C.elegans v C.briggsae analysis
Genomic-based gene prediction: algorithm
NC
BI
This can be viewed as an extension of the ab initio prediction tools – where coding exons are defined by similarities and not codon bias
GAZE (Howe) is an extension of Phil Green’s Genefinder in which transcript data is used to define coding exons. Other features are scored as in the original Genefinder implementation. This is being evaluated and used in the C.elegans project.
GENEWISE (Birney) is a HMM based gene predictor which attempts to predict the closest CDS to a supplied peptide sequence. This is the workhorse predictor for the ENSEMBL project.
Comparative gene predictors
NC
BI
A new generation of comparative gene prediction tools is being developed to utilise the large amount of genomic sequence available.
Twinscan (WashU) attempts to predict genes using related genomic sequences.
Doublescan (Sanger) is a HMM based gene predictor which attempts to predict 2 orthologous CDS’s from genomic regions pre-defined as matching.
Both of these predictors are in development and will be used for the C.elegans v C.briggsae match and the Mouse v Human match later this year.
Comparative gene predictors
NC
BI
Summary
Genes are complex structure which are difficult to predict with the required level of accuracy/confidence
We can predict stops better than starts
We can only give gross confidence levels to predictions (i.e. confirmed, partially confirmed or predicted)
Gene prediction is only part of the annotation procedure
Movement from ab initio to comparative methodology as sequence data becomes available/affordable
Curation of gene models is an active process – the set of gene models for a genome is fluid and WILL change over time.
NC
BI
The Annotation Process
DNA SEQUENCE
AN
NA
LY
SIS
SO
FT
WA
RE
UsefulInformation
Annotator
NC
BI
DNA sequence
RepeatMasker Blastn HalfwiseBlastxGene finders tRNA scan
Repeats Promoters Pseudo-GenesrRNAGenes
tRNA
Fasta BlastP Pfam Prosite Psort SignalP TMHMM
Annotation Process
NC
BI
Artemis
• Artemis is a free DNA sequence viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation.
• http://www.sanger.ac.uk/Software/Artemis/
NC
BI
atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtttttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatcatttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccgcagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatatataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattatttatatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaacatacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattaggagatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaattgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatattatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaattcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaataatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcattaaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatatatatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacatatgtattattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattactaccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatatatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaagaatttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatatatatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtatttataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaacattttcttcattatcaaaaatatttatttcctaattttttttttttgtaaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattcaaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataataaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcatatctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaattctgatcattgatccgtcttccttaaatgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgttttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaatgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttttttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagttaagcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaataaagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaattcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatctataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataacacataacaattataatgacatatcaaataataataataataataataatattaatggggtgaaagaccatataaataataacactctggaaaataatgatgaaccaatcttatctatatataatgaagatcttaatgttttatatatatgccaaaatatgtataacgtcctttttgttttgaatttaaataacctaagt
NC
BI
GC content
Forward translations
Reverse Translations
DNA and aminoacids
DNA in Artemis
Black bar = stop codon