NCBI Gene Prediction and Annotation techniques Basics Chuong Huynh NIH/NLM/NCBI Sept 30, 2004...

NC

BI

Gene Prediction and Annotation techniques

Basics Chuong HuynhNIH/NLM/NCBISept 30, 2004

[email protected]

Acknowledgement: Daniel Lawson, Neil Hall

NC

BI

What is gene prediction?

Detecting meaningful signals in uncharacterised DNA sequences.

Knowledge of the interesting information in DNA.

Sorting the ‘chaff from the wheat’

Gene prediction is ‘recognising protein-coding regions in genomic sequence’

GATCGGTCGAGCGTAAGCTAGCTAG

ATCGATGATCGATCGGCCATATATC

ACTAGAGCTAGAATCGATAATCGAT

CGATATAGCTATAGCTATAGCCTAT

NC

BI

Basic Gene Prediction Flow Chart

Obtain new genomic DNA sequence

1. Translate in all six reading frames and compare to protein sequence databases2. Perform database similarity search of expressed sequence tagSites (EST) database of same organism, or cDNA sequences if available

Use gene prediction program to locate genes

Analyze regulatory sequences in the gene

NC

BI

ACEDB View

NC

BI

Why is gene prediction important?

-Increased volume of genome data generated

-Paradigm shift from gene by gene sequencing (small scale) to large-scale genome sequencing.

-No more one gene at a time. A lot of data.

-Foundation for all further investigation. Knowledge of the protein-coding regions underpins functional genomics.

Note: this presentation is for the prediction of genes that encode protein only;Not promoter prediction, sequences regulate activity of protein encoding genes

NC

BI

NC

BI

Map Viewer

Genes

Genome Scan

Models

Human EST hits

Contig

GenBank

Mouse EST hits

NC

BI

NC

BI

Artemis – Free Genome Visualization/Annotation Workbench

NC

BI

Genome WorkBench

NC

BI

Knowing what to look for

What is a gene?

Not a full transcript with control regions

The coding sequence (ATG -> STOP)

Start MiddleN

End

NC

BI

ORF Finding in Prokaryotes

• Simplest method of finding DNA sequences that encode proteins by searching for open reading frames

• An ORF is a DNA sequence that contains a contiguous set of codons that species an amino acid

• Six possible reading frames• Good for prokaryotic system (no/little post

translation modification)• Runs from Met (AUG) on mRNA stop codon TER

(UAA, UAG, UGA)• http://www.ncbi.nlm.nih.gov/gorf/ NCBI ORF

Finder

NC

BI

ORF Finder (Open Reading Frame Finder)

NC

BI

Annotation of eukaryotic genomes

transcription

RNA processing

translation

AAAAAAA

Genomic DNA

Unprocessed RNA

Mature mRNA

Nascent polypeptide folding

Reactant A Product BFunction

Active enzyme

ab initio gene prediction(w/o prior knowledge)

Comparative gene prediction

(use other biological data)

Functional identification

Gm3

NC

BI

Two Classes of Sequence Information

• Signal Terms – short sequence motifs (such as splice sites, branch points,Polypyrimidine tracts, start codons, and stop codons)

• Content Terms – pattern of codon usage that are unique to a species and allow coding sequences to be distinguished from surrounding noncoding sequences by a statistical detection algorithm

NC

BI

Problem Using Codon Usage

• Program must be taught what the codon usage patterns look like by presenting the program with a TRAINING SET of known coding sequences.

• Different programs search for different patterns.

• A NEW training set is needed for each species• Untranslated regions (UTR) at the ends of the

genes cannot be detected, but most programs can identify polyadenylation sites

• Non-protein coding RNA genes cannot be detected (attempt detection in a few specialized programs)

• Non of these program can detect alternatively spliced transcripts

NC

BI

Explanation of False Positive/Negative in Gene

Prediction Programs

NC

BI

Gene finding: Issues

Issues regarding gene finding in general

Genome size (larger genome ~ more genes, but …)

Genome composition

Genome complexity (more complexity -> less coding density; fewer genes per kb)

cis-splicing (processing mRNA in Eukaryotics)

trans-splicing (in kinetisplastid)

alternate splicing (e.g. in different tissues; higher organism)

Variation of genetic code from the universal code

NC

BI

Gene finding: genome

• Genome composition– Long ORFs tend to be coding– Presence of more putative ORFs in GC

rich genomes (Stop codons = UAA, UAG & UGA)

• Genome complexity– Simple repetitive sequences (e.g.

dinucleotide) and dispersed repeats tend to be anti-coding

– May need to mask sequence prior to gene prediction

NC

BI

Gene finding: coding density

As the coding/non-coding length ratio decreases, exon prediction becomes more complex

Human

Fugu

worm

E.coli

NC

BI

Gene finding: splicing

cis-splicing of genes

Finding multiple (short) exons is harder than finding a single (long) exon.

worm

E.coli

trans-splicing of genes

A trans-splice acceptor is no different to a normal splice acceptor

NC

BI

Gene finding: alternate splicing

Human A

Human B

Human C

Alternate splicing (isoforms) are very difficult to predict.

NC

BI

ab initio prediction

What is ab initio gene prediction?

Prediction from first principles using the raw DNA sequence only.

Requires ‘training sets’ of known gene structures to generate statistical tests for the likelihood of a prediction being real.

GATCGGTCGAGCGTAAGCTAGCTAG

ATCGATGATCGATCGGCCATATATC

ACTAGAGCTAGAATCGATAATCGAT

CGATATAGCTATAGCTATAGCCTAT

NC

BI

Gene finding: ab initio

• What features of an ORF can we use?– Size - large open reading frames– DNA composition - codon usage / 3rd

position codon bias– Kozak sequence CCGCCAUGG– Ribosome binding sites– Termination signal (stops)– Splice junction boundaries

(acceptor/donor)

NC

BI

Gene finding: features

Think of a CDS gene prediction as a linear series of sequence features:

Initiation codon

Coding sequence (exon)

Non-coding sequence (intron)

Termination codon

Splice donor (5’)

Splice acceptor (3’)

Coding sequence (exon)

N times

NC

BI

A model ab initio predictor

Locate and score all sequence features used in gene models

dynamic programming to make the high scoring model from available features.

e.g. Genefinder (Green)

Running a 5’-> 3’ pass the sequence through a Markov model based on a typical gene model

e.g. TBparse (Krogh), GENSCAN (Burge) or GLIMMER (Salzberg)

Running a 5’->3’ pass the sequence through a neural net trained with confirmed gene models

e.g. GRAIL (Oak Ridge)

NC

BI

Ab initio Gene finding programs

• Most gene finding software packages use a some variant of Hidden Markov Models (HMM).

• Predict coding, intergenic, and intron sequences

• Need to be trained on a specific organism.• Never perfect!

NC

BI

What is an HMM?

• A statistical model that represents a gene.

• Similar to a “weight matrix” that can recognise gaps and treat them in a systematic way.

• Has different “states” that represent introns, exons, and intergenic regions.

NC

BI

Malaria Gene Prediction Tool

• Hexamer – ftp://ftp.sanger.ac.uk/pub/pathogens/software/hexamer/

• Genefinder – email [email protected]• GlimmerM – http://www.tigr.org/softlab/glimmerm• Phat – http://www.stat.berkeley.edu/users/scawley/Phat

• Already Trained for Malaria!!!! The more experimental derived genes used for training the gene prediction tool the more reliable the gene predictor.

NC

BI

GlimmerMSalzberg et al. (1999) genomics 59 24-31

• Adaption of the prokaryotic genefinder Glimmer.

Delcher et al. (1999) NAR 2 4363-4641

• Based on a interpolated HMM (IHMM).

• Only used short chains of bases (markov chains) to generate probabilities.

• Trained identically to Phat

NC

BI

An end to ab initio prediction

• ab initio gene prediction is inaccurate• Have high false positive rates, but also low false

negative rates for most predictors• Incorporating similarity info is meant to reduce

false positive rate, but at the same also increase false negative rate.

• Biggest determinant of false positive/negative is gene size.

• Exon prediction sensitivity can be good• Rarely used as a final product

– Human annotation runs multiple algorithms and scores exon predicted by multiple predictors.

– Used as a starting point for refinement/verification

• Prediction need correction and validation• -- Why not just build gene models by comparative

means?

NC

BI

Annotation of eukaryotic genomes

transcription

RNA processing

translation

AAAAAAA

Genomic DNA

Unprocessed RNA

Mature mRNA

Nascent polypeptide folding

Reactant A Product BFunction

Active enzyme

ab initio gene prediction (w/o prior knowledge)

Comparative gene prediction(use other biological data)

Functional identification

Gm3

NC

BI

If a cell was human?

The cell ‘knows’ how to splice a gene together.

We know some of these signals but not all and not all of the time

So compare with known examples from the species and othersCentral dogma for molecular

biology

Genome

Transcriptome

Proteome

DNA

Protein

RNA

NC

BI

When a human looks at a cell

Compare with the rest of the genome/transcriptome/proteome data

DNA

Protein

RNA

Extract DNA and sequence genome

Extract RNA, reverse transcribe and sequence cDNA

Peptide sequence inferred from gene prediction

NC

BI

comparative gene prediction

Use knowledge of known coding sequences to identify region of genomic DNA by similarity

transcriptome - transcribed DNA sequence

proteome - peptide sequence

genome - related genomic sequence

NC

BI

Transcript-based prediction: datasets

Generation of large numbers of Expressed Sequence Tags (ESTs)

Quick, cheap but random

Subtractive hybridisation to find rare transcripts

Use multiple libraries for different life-stages/conditions

Single-pass sequence prone to errors

Generation of small number of full length cDNA sequences

Slow and laborious but focused

Large-scale sequencing of (presumed) full length cDNAs

Systematic, multiplexed cloning/sequencing of CDS

Expensive and only viable if part of bigger project

NC

BI

Gene Prediction in Eukaryotes – Simplified

• For highly conserved proteins:– Translate DNA sequence in all 6 reading frames– BLASTX or FASTAX to compare the sequence to a

protein sequence database– Or– Protein compared against nucleic acid database

including genomic sequence that is translated in all six possible reading frame sby TBLASTN, TFASTAX/TFASTY programs.

• Note: Approximation of the gene structure only.

NC

BI

Transcript-based prediction: How it works

EST

cDNA

Align transcript data to genomic sequence using a pair-wise sequence comparison

GeneModel:

NC

BI

BLAST (Altshul) (36 hours)

Widely used and understood

HSPs often have ‘ragged’ ends so extends to the end of the introns

EST_GENOME (Mott) (3 days)

Dynamic programming post-process of BLAST

Slow and sometimes cryptic

BLAT (Kent) (1/2 hour)

Next generation of alignment algorithm

Design for looking at nearly identical sequences

Faster and more accurate than BLAST

Transcript-based gene prediction: algorithm

NC

BI

BLAST (Altshul)

Widely used and understood

Smith-Waterman

Preliminary to further processing

Used in preference to DNA-based similarities for evolutionary diverged species as peptide conservation is significantly higher than nucleotide

Peptide-based gene prediction: algorithm

NC

BI

BLAST (Altshul)

Can be used in TBLASTX mode

BLAT (Kent)

Can be used in a translated DNA vs translated DNA mode

Significantly faster than BLAST

WABA (Kent)

Designed to allow for 3rd position codon wobble

Slow with some outstanding problems

Only really used in C.elegans v C.briggsae analysis

Genomic-based gene prediction: algorithm

NC

BI

This can be viewed as an extension of the ab initio prediction tools – where coding exons are defined by similarities and not codon bias

GAZE (Howe) is an extension of Phil Green’s Genefinder in which transcript data is used to define coding exons. Other features are scored as in the original Genefinder implementation. This is being evaluated and used in the C.elegans project.

GENEWISE (Birney) is a HMM based gene predictor which attempts to predict the closest CDS to a supplied peptide sequence. This is the workhorse predictor for the ENSEMBL project.

Comparative gene predictors

NC

BI

A new generation of comparative gene prediction tools is being developed to utilise the large amount of genomic sequence available.

Twinscan (WashU) attempts to predict genes using related genomic sequences.

Doublescan (Sanger) is a HMM based gene predictor which attempts to predict 2 orthologous CDS’s from genomic regions pre-defined as matching.

Both of these predictors are in development and will be used for the C.elegans v C.briggsae match and the Mouse v Human match later this year.

Comparative gene predictors

NC

BI

Summary

Genes are complex structure which are difficult to predict with the required level of accuracy/confidence

We can predict stops better than starts

We can only give gross confidence levels to predictions (i.e. confirmed, partially confirmed or predicted)

Gene prediction is only part of the annotation procedure

Movement from ab initio to comparative methodology as sequence data becomes available/affordable

Curation of gene models is an active process – the set of gene models for a genome is fluid and WILL change over time.

NC

BI

The Annotation Process

DNA SEQUENCE

AN

NA

LY

SIS

SO

FT

WA

RE

UsefulInformation

Annotator

NC

BI

DNA sequence

RepeatMasker Blastn HalfwiseBlastxGene finders tRNA scan

Repeats Promoters Pseudo-GenesrRNAGenes

tRNA

Fasta BlastP Pfam Prosite Psort SignalP TMHMM

Annotation Process

NC

BI

Artemis

• Artemis is a free DNA sequence viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation.

• http://www.sanger.ac.uk/Software/Artemis/

NC

BI

atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtttttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatcatttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccgcagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatatataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattatttatatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaacatacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattaggagatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaattgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatattatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaattcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaataatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcattaaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatatatatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacatatgtattattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattactaccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatatatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaagaatttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatatatatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtatttataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaacattttcttcattatcaaaaatatttatttcctaattttttttttttgtaaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattcaaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataataaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcatatctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaattctgatcattgatccgtcttccttaaatgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgttttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaatgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttttttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagttaagcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaataaagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaattcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatctataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataacacataacaattataatgacatatcaaataataataataataataataatattaatggggtgaaagaccatataaataataacactctggaaaataatgatgaaccaatcttatctatatataatgaagatcttaatgttttatatatatgccaaaatatgtataacgtcctttttgttttgaatttaaataacctaagt

NC

BI

GC content

Forward translations

Reverse Translations

DNA and aminoacids

DNA in Artemis

Black bar = stop codon

Date post:	16-Jan-2016
Category:	Documents
Upload:	walter-ferguson
View:	218 times
Download:	0 times

NCBI Gene Prediction and Annotation techniques Basics Chuong Huynh NIH/NLM/NCBI Sept 30, 2004...

Documents