+ All Categories
Home > Documents > Bioinformatics: Advancing Biotechnology through...

Bioinformatics: Advancing Biotechnology through...

Date post: 09-Jun-2018
Category:
Upload: ledat
View: 216 times
Download: 0 times
Share this document with a friend
16
Indian Journal of Biotechnology Vol 2, April 2003, pp 159-174 Bioinformatics: Advancing Biotechnology through Information Technology Part II: In silico Gene Prediction Sudeshna Adak* GE Global Research, John F Welch Technology Center, EPIP Phase 2, Hoodi Village, Whitefield Road Bangalore 560 066, India Received 8 June 2002; accepted 10 September 2002 The paper reviews computational tools and algorithms for in silico genome annotation and, in particular, the Bioinformatics resources available for in silico gene prediction. Gene prediction requires a combination of algorithms with different types of biological databases and the author intends to provide the biologist or biotechnology researcher an insight into the use of these methods. The explosion of information seen in molecular biology has created a veritable maze, through which careful navigation is required for research and innovation in biotechnology. Online databases have given scientists and researchers across the world access to unimaginable volumes of biologically relevant data. Bioinformatics, a truly multidisciplinary science, aims to bring the benefits of computer technologies to bear in understanding the biology of life itself. In this second paper of a three part series, the tremendous value of gene prediction algorithms is discussed, in the context of transforming raw biological sequence data into biologically useful knowledge. It is the first step in harnessing the biological data arising out of the genome projects into a new and improved understanding of biology of organisms. Keywords: homology, ab initio, synteny, EST, genome annotation Introduction Gene prediction is best described (Semple, 2000) as the "beginning of the end" - the end of a genome sequencing project being marked by the beginning of efforts to annotate the genome. In silica genome annotation has developed as part of or perhaps because of the Human Genome Project - to develop the necessary computational tools of algorithms, software and databases to interpret and disseminate the vast quantities of information arising out of the research on the human genome (Pearson & Soli, 1999). Historically, gene prediction methods have been employed even with relatively short DNA sequences from partially complete genomes. However, the speed at which complete genomes are becoming available is increasing and today, gene prediction is shifting from primarily DNA based to genome based, with new methodologies and genome comparisons being part of genome annotation. Genome annotation is the transformation of raw genomic data into organized biological knowledge and lead to new and improved understanding of genome organization and regulation. For the computational biologist, genome annotation refers to *Tel:080-28412050 ext 2697; Fax: 080-28412111 Email: [email protected] the process of assigning "features" or "labels" to raw DNA sequences by integrating information from the sequence with computational tools, auxiliary data, and biological knowledge. The first step of genome annotation is usually to assign labels of gene and gene structure information and in silica gene prediction refers to the computational tools and algorithms that are useful in this step of genome annotation. Some examples of genome annotation labels not included in gene prediction are: (1) additional structural features such as CpG islands, methylation sites, phospho- rylation sites, etc.; (2) control elements such as promoters, enhancers, splice sites, etc.; (3) functional elements such as polymorphism or mutation regions, allelic variants, protein function, etc. Gene prediction is still the most important and widely used of all genome annotations and hence the focus of this review. The biotechnology issues involved in sequencing complete genomes have essentially been solved. Today, there already exist sufficient solutions for ascertaining sequencing error rates and for assembling sequence data. Currently, however, standards or rules for the annotation process are still an outstanding problem. How should the genomes be annotated, what should be annotated, which computational tools are
Transcript
Page 1: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

Indian Journal of BiotechnologyVol 2, April 2003, pp 159-174

Bioinformatics: Advancing Biotechnology through Information TechnologyPart II: In silico Gene Prediction

Sudeshna Adak*GE Global Research, John F Welch Technology Center, EPIP Phase 2, Hoodi Village, Whitefield Road

Bangalore 560 066, India

Received 8 June 2002; accepted 10 September 2002

The paper reviews computational tools and algorithms for in silico genome annotation and, in particular, theBioinformatics resources available for in silico gene prediction. Gene prediction requires a combination ofalgorithms with different types of biological databases and the author intends to provide the biologist orbiotechnology researcher an insight into the use of these methods. The explosion of information seen in molecularbiology has created a veritable maze, through which careful navigation is required for research and innovation inbiotechnology. Online databases have given scientists and researchers across the world access to unimaginablevolumes of biologically relevant data. Bioinformatics, a truly multidisciplinary science, aims to bring the benefits ofcomputer technologies to bear in understanding the biology of life itself. In this second paper of a three part series,the tremendous value of gene prediction algorithms is discussed, in the context of transforming raw biologicalsequence data into biologically useful knowledge. It is the first step in harnessing the biological data arising out of thegenome projects into a new and improved understanding of biology of organisms.

Keywords: homology, ab initio, synteny, EST, genome annotation

IntroductionGene prediction is best described (Semple, 2000)

as the "beginning of the end" - the end of a genomesequencing project being marked by the beginning ofefforts to annotate the genome. In silica genomeannotation has developed as part of or perhapsbecause of the Human Genome Project - to developthe necessary computational tools of algorithms,software and databases to interpret and disseminatethe vast quantities of information arising out of theresearch on the human genome (Pearson & Soli,1999). Historically, gene prediction methods havebeen employed even with relatively short DNAsequences from partially complete genomes.However, the speed at which complete genomes arebecoming available is increasing and today, geneprediction is shifting from primarily DNA based togenome based, with new methodologies and genomecomparisons being part of genome annotation.

Genome annotation is the transformation of rawgenomic data into organized biological knowledgeand lead to new and improved understanding ofgenome organization and regulation. For thecomputational biologist, genome annotation refers to

*Tel:080-28412050 ext 2697; Fax: 080-28412111Email: [email protected]

the process of assigning "features" or "labels" to rawDNA sequences by integrating information from thesequence with computational tools, auxiliary data, andbiological knowledge. The first step of genomeannotation is usually to assign labels of gene and genestructure information and in silica gene predictionrefers to the computational tools and algorithms thatare useful in this step of genome annotation. Someexamples of genome annotation labels not included ingene prediction are: (1) additional structural featuressuch as CpG islands, methylation sites, phospho-rylation sites, etc.; (2) control elements such aspromoters, enhancers, splice sites, etc.; (3) functionalelements such as polymorphism or mutation regions,allelic variants, protein function, etc. Gene predictionis still the most important and widely used of allgenome annotations and hence the focus of thisreview.

The biotechnology issues involved in sequencingcomplete genomes have essentially been solved.Today, there already exist sufficient solutions forascertaining sequencing error rates and for assemblingsequence data. Currently, however, standards or rulesfor the annotation process are still an outstandingproblem. How should the genomes be annotated, whatshould be annotated, which computational tools are

Page 2: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

160 INDIAN J BIOTECHNOL, APRIL 2003

most effective, how reliable are these annotations,how organism-specific do the tools have to be andultimately how should the computational results bepresented to the community? All these questions areunsolved. In silico gene prediction has evolved in thelast 20 years from simple methods based on codingregion statistics(Shulman et al, 1981) of the early 80sto sophisticated methodologies that can incorporatebiological constraints and paradigms intocomputational algorithms. This paper is intended, as areview of these developments, to provide the biologista perspective on the use of these algorithms, to helpanswer the following questions for the practicingbiotechnologist:

• When does one need to use a gene predictionalgorithm?

• What algorithms does one use and in whatorder? What customizations are necessary toannotate the genome of a particular species?

• How reliable are the algorithms?• How can one verify and validate the results?• What are the available resources for in silico

gene prediction?

Gene Prediction Algorithms

Homology Based Gene PredictionHomology based gene prediction falls into two

categories: (i) gene prediction through detection ofhomology to known proteins, (ii) gene predictionthrough comparison with expressed sequence tags(EST) databases. Homology detection and sequencealignment tools and resources were discussedpreviously in the first of this three part review (Adak& Srivastava, 2001). Gene prediction throughhomology to known proteins uses sequence alignmentof the translated DNA sequence (using six possiblereading frames) with databases of known proteins.Most users prefer BLASTx or FASTx, which are theadaptations of BLAST (Altschul et al, 1990)(http://www.ncbi.nlm.nih.govIBLAST) or FASTA(Pearson & Lipman, 1988)(http://www.ebi.ac.uk/fasta33) for translation andalignment of translated DNA sequences to non-redundant protein databases such as SWISS-PROT(http://www.expasy.ch/sprot) and the ProteinInformation Resource database (PIR,http://pir.georgetown.edu).Itis preferred to considersix reading frame translations of the query sequencefor alignment rather than aligning the nucleotide

query sequence to the nucleotide sequences ofproteins - this is because while similarity is as high as85% in the exons, it can be as low as 15% in theintron regions, which may result in a high proportionof false negatives. Similarity to expressed sequencetags (ESTs), found in dbEST database of NationalCenter for Biotechnology Information (NCB!),Bethesda, MD, USA has become increasingly popularas part of homology based gene prediction algorithms(Bailey et al. 1998). ESTs are derived (in theory)from the 3' of poly A+ transcripts, and often extendfar enough towards the 5' end to reach the codingsequence and thus overlap with predicted exons. Thismethod will not work perfectly, certainly for thosegenes that are expressed at low levels and with largertranscripts (>2 kb). However, for other genes,comparison to EST databases such as dbEST fromNCBI is popular using BLASTN or tBLASTx andtFASTx. The algorithms tBLASTx and tFASTx arevariants of BLAST and FAST A designed to aligntranslated nucleotide sequences to translatednucleotide sequence databases. Six reading frametranslations of the query nucleotide sequence iscompared to the translated EST sequences. Thereason that the sequences are translated prior tocomparison (rather than a nucleotide-nucleotidecomparison) is because ESTs being cDNA do notcontain introns, and thus introns from the querynucleotide sequence need to be spliced out prior toalignment.

Homology based gene prediction is traditionallythe first and the most commonly used tool to discovernew genes, but it has never been sufficient to usehomology alone for genome annotation due to tworeasons:

• Ortholog versus paralogs

The underlying assumption of gene recogmtionthrough homology is that sequence similarity impliesfunctional similarity, i.e. homologous sequences arealso orthologous (sequences from different speciesthat are similar encode proteins of the same function)which is not necessarily true.

• High accuracy and low coverage of homologybased methods

In case of microbial genomes, it has been foundthat as much as 40-50% of genes may code forproteins of unknown function, and 20-30% of genesmay encode unknown proteins that are unique to thespecies. The underlying assumption is that annotated

Page 3: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

ADAK: BIOINFORMATICS-GENE PREDICTION

EX"/lJ from two reading frames

u-fT ••••~ ••••• ,11-. 101

Exo •• 6-••••~ r.-./Tome 112

bon

Exon assemblies (not considered during sequen I:alignment to protein database.)

ISK *1

" .Ex••

U"mI>JY~~g-

b••••••• nb!y1l3

~~

Fig. I-Alternate exon assemblies.

genomes of "similar" species are available In thedatabase, which is often not the case.

In addition to the above drawbacks of homologybased gene prediction, another drawback, which ismore a result of the limited scope of sequencealignment tools, is that traditional sequencealignments consider six possible reading frames andalign the resulting six possible translated sequences tothe known protein database: it does not consider thepossibility that the collection of exons for the proteinmay actually come from different reading frames (asexemplified by the exon assemblies shown.in Fig. 1).Procrustes (Gelfand et ai, 1996) is a method thatconsiders all possible exon assemblies from the sixreading frames, and in aligning to target proteins alsodetermines the best exon assembly. Given a targetprotein sequence to which the query sequence is beingcompared, the exon assembly problem can bedescribed as follows: Find the set of blocks in thegenomic query sequence, whose concatenation(splicing) best fits the target sequence. A naive two-stage approach Bioinformatics: gene prediction to thespliced alignment problem, consisting of detecting allrelatively high similarities between each block and thetarget sequence followed by the construction of anoptimal exon assembly, was proposed in the early 80s(Wilbur & Lipman, 1983). The extremely highcomputational time and space requirements of thistwo-stage approach made it untenable for use in geneprediction. Procrustes was a spliced alignmentalgorithm that combined the splicing search with thesequence alignment in a single step using a dynamicprogramming formulation. This allowed theProcrustes methodology to avoid the shortcomings ofthe two-stage approach in a space-and-time efficient

161

algorithm. Procrustes, at the time that the paper waspublished, was consistently demonstrating 87%accuracy (correctly assembled 87% of human genes).However, since 1996, ab initio methods have beendeveloped that perform better than Procrustes. It isnevertheless an important example of howbioinformatics is evolving to incorporate biologicalmechanisms into computational algorithms.

OTTO, the gene prediction method used by Celerathat combines homology to known human genes withhomology to EST and protein databases actuallyoutperforms most ab initio algorithms. Also popularis the EXOFISH (Exon finding by sequencehomology) is based on sequence alignment searchesto identify genes in the human genome using thegenome of another vertebrate, Teraodon nigroviridis,a type of pufferfish. The advantage of the pufferfish isin the structure of its genome that is eight times morecompact than the human (Croll ius et ai, 2000).Bioinformatics algorithms are also evolving to keeppace with the requirements of the experimentalist andthe data available - as more complete genomesequences being available, the next generation ofsequence alignment methods for gene prediction aresyntenic alignments, discussed in a later section onsyntenic gene prediction.

Ab Initio Gene PredictionAb initio gene prediction encompasses the class of

"statistical learning" algorithms that are used for insilico gene recognition. These algorithms haveevolved from the simple codon usage based methodsof the early 80s (Staden & McLachlan, 1982) tosophisticated algorithms based on neural networksand hidden Markov models (HMMs). The basicparadigm of the ab initio algorithms can be describedin general as follows:

A DNA sequence is represented as S=SJ,S2," .,SL,where s, is a character from the set of basepairs{A,C,T,G}. The objective of a gene predictionalgorithm is to determine a " parse" of the sequence,peS) = (dJ,vJ, d2,v2,.. " dm,vm), where 1<= d, < d2

< ... dm = L, and d. is the end position of the thfeature and Vi is the label of the th feature. In ab initiogene prediction algorithms, it is assumed that DNAsequences follow a probability or statistical ormathematical "model", m(S I peS), Q) where Q is aset of ancillary" parameters" associated with themathematical model. This means that given the parsepeS) and the values of Q for the query sequence, the

Page 4: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

162 INDIAN J BIOTECHNOL, APRIL 2003

Training Data:Export Annotated DNA sequences

Control parameterselection

Model is trained in advance

Output: genepredictions

Fig. 2-Ab initio gene prediction.

model can determine the "likelihood" of observing thesequence of characters seen in the query sequence.

The "learning" or "training" (Fig. 2) comes fromusing annotated sequences to learn how to associatedifferent values of Q to different gene features. Then,given a query sequence S, the algorithm determinesthe optimal parse P*(S), using the mathematicalmodel m(S I peS), Q). The class of ab initio geneprediction algorithms use two categories ofinformation in their learning step: (i) algorithms thatrely on "content sensors", in which the parameters Qare learnt solely from the query DNA sequence; (ii)algorithms that rely on "signal sensors", in which theparameters Q are learnt from using auxiliarybiological data and information, such as location ofsplice sites. The best ab initio algorithms in use todayare typically of a more hybrid nature, using bothcontent and signal information to optimally determinegene locations, such as some of the neuralnetworkbased methods.

Ab Initio Gene Prediction Based on OligonucleotideUsage

This includes the class of methods that determine

an optimal parse based on usage stausncs. Themathematical models used are of the form:

J.1.(sI i, Q)= J.1.(C! I Q)J.1.(C~ I Q) ... J.1.(q I Q),

where for reading frame i,

i I Probability of the triplet c'. as codonJ.1.(C Q) = . J

J Probability of the triplet Cj as non - coding

Here, the parameters Q are based on codon usagestatistics for the particular species underconsideration. The various methods proposed in thisclass of algorithms are based on different paradigmsfor calculation of the required probabilities in theabove equation. Table 1 shows some of the differentmethods that have been proposed in the literature.

It has been found that higher order oligonucleotideusage (for example, hexamer usage Claverie et al,1990) provides increasing power to discriminatebetween coding and non-coding regions. This can beclearly seen in Fig. 3, where there is more differencebetween hexamer usage observed when comparingexons and introns.

Another approach has been to use the periodicappearance of certain bases in protein codingregions-for example, thiamine (T) occurspreferentially in the second position of codons. Theobserved 3-periodicity in DNA sequences led to thefourier spectrum based method of searching forcoding regions (Ramachandran & Ramakrishna,1999).

Ab initio Gene Prediction Based on Markov ModelsIt has been found that biological sequences can be

mathematically modeled as Markov models: themodel m(S I peS), Q) is that of a stochastic process inwhich the probability for a given nucleotide to occurat any given position r depends on the nucleotide

Method

Table I-Statistical features used in ab initio gene prediction

Probability formula

Codon usage [Staden & McLachlan, 1982]

Codon preference [Gribskov et al, 1984]

Amino acid usage [Fickett & Tung, 1992]

Codon prototype [Fickett & Tung, 1992]

F(Cd) = frequency of use of codon Cd in genes

F(Cd)/LF(Cd') = codon usage/total codon usage of synonymous codons

LF(Cd') = Total codon usage for synonymous codons

F(Cd) = F(Cd[I],I)F(Cd[2],2)F(Cd[3J,3), where F(b,r)= Frequency of usageof nucleotide b in position r

Assymetry(Cd) = L[F(Cd[r],r) - F(Cd[r])f

F(hexamer) = frequency of hexamer usage in codons

Differences in hexamer usage between coding and non-coding regions

Position asymmetry [Fickett & Tung, 1992]

Hexamer frequency [Claverie et al, 1990]Dicodon Bias [Badger & Olsen, 1999]

Page 5: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

,,ADAK: BIOINFORMATICS-GENE PREDICTION

14C

120

100

BC

GO

4.

20

C-02 -0,15

81'0:\5 -HrC'fll:-

.0 1 -005 00501 015CndlJnU9-.g.

02 025 03

163

Fig. 3-Likelihood scores for codon and hexamer usage, differentiating between exon and intron regions.

occupying the p previous positions. Such arepresentation is called a pth-order Markov model,and the parameters Q are the "transition probabilities"associated with the Markov model which depends onpeS). During the training phase, the parameters Q aredetermined for coding and non-coding regions. Givena query sequence, the optimal parse is obtained bysimply determining if a given region is more likely tobe generated by the coding versus the non-coding. Forexample, the 5th order Markov model (whichcorresponds to exploting hexamer based statistics forgene predition) was used in the algorithm GeneMark(Borodovsky & McIninch, 1993).

However, it was soon realized that the Markovmodel structure was too rigid and did not allow forincorporation of biological signal information such assplice sites, internal exons, donor/acceptor signals,etc. In order to overcome some of these drawbacks,the Markov model gene prediction algorithms evolvedinto more flexible hidden Markov model (HMM)algorithms and generalized hidden Markov model(GHMM) algorithms. For example, GeneMarkevolved into GeneMark.hmm (Lukashin &Borodvsky, 1998) and the authors showed that byembedding the original GeneMark model intonaturally derived hidden Markov models, the resultingalgorithm GeneMark.hmm was significantly moreaccurate than the original GeneMark.

Fig. 4 shows an example of a simple hiddenMarkov gene model. In HMMs, the sequence isassumed to follow a probability model m(S I peS), Q),for a given "state" PrS) of the sequence S. In addition,it is assumed that these states are "hidden" orunobserved and that probability of being in one state

depends on the previous states (Markov). In theexample of Fig. 4, there are four states: coding region,intergenic region, start codon (A TG) and stop codon.The probability of observing any of the bases A, C, T,G in the coding and intergenic regions are shown anddefine the mathematic model for the sequence withinthese two states. The probability of transition fromone state to another is also shown. Thus, for example,given that the sequence is currently in the codingstate, there is a probability of 0.1 of "transitioning"into a stop codon and a probability of 0.9 ofremaining in the same state. Imagine that at a

A 0.9C 0.03G 0.04T O.D3

1 0.1

A 0.25C 0.25G 0.25T 0.25

Fig. 4-Example of a simple HMM model for gene prediction.

Page 6: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

164 INDIAN J BIOTECHNOL, APRIL 2003

particular position of the coding region of thesequence, an unfair coin is tossed for which theprobability of heads is 0.9. If the coin comes upheads, the next base in the sequence will also becoding and it will be A (probability = 0.9), C(probability = 0.03), G (probability = 0.04), T(probability = 0.03). If the coin shows a tail, the next3 bases in the sequence will be a stop codon.

Generalized hidden Markov models (GHMM) arethe same as hidden Markov models (HMM), exceptthat instead of assuming a simple probability modelfor the sequence within a given state, the model canbe a complex mathematical model suitable to thecurrent state (for example, using position weightmatrices to model the sequence while it is an "exonstate").

Table 2 shows the various Markov model, HMM,and GHMM algorithms in use today. Accuracy andvalidity of these algorithms along with other geneprediction methods will be discussed later in thispaper. The use of HMM for in silico gene predictionwas introduced in ECOPARSE (Krogh, 1994a), agene finding algorithm for Escherichia coli and waslater used to decode the Mycobacterium tuberculosisgenome (Cole et al, 1998). GLIMMER (Salzberg et al1998), that used "integrated Markov models" was

found to perform better than fixed order Markovmodels and comparatively to the best HMM basedmethod and is among the most commonly usedmethods for bacterial genomes today. The use ofGHMM was introduced in the algorithm GENIE(Kulp et al, 1996). More recently, GENSCAN (Burge& Karlin, 1997) used a GHMM architecture forhuman genes. In contrast with previous algorithms,GENSCAN optimized the lower level modulesperforming recognition of the basic signals (e.g.,transcriptional, translational and splicing signals), andincorporated the influence of (C+G) content.

The main disadvantages of Markov model basedalgorithms have been their tendency to over-predictthe number of exons, and over-predict the exonicsequence (incorrect exon boundaries). However,GENSCAN is one of the best algorithms for geneprediction in vertebrate genomes today - an exampleof how computational algorithms for biology can besignificantly improved by incorporation of differenttypes of biological signals for better performance.

Ab Initio Gene Prediction Based on StatisticalPattern Recognition and Classification

Discriminant analysis is a standard method ofstatistical pattern recognition, pioneered by the great

Algorithm

Table 2-Markov model based ab initio gene prediction algorithms

ECOPARSE [Krogh et al, 1994]

GeneMark [Borodovsky & McIninch, 1993]

GeneMark.HMM [Lukashin & Borodvsky, 1998]

GLIMMER [Salzberg et al, 1998]

GENEWISE [Birney & Durbin, 1997]

HMMGene [Krogh, 1998]

VEIL [Henderson et al, 1997]

GENIE [Kulp et al, 1996]

GENSCAN [Burge & Karlin, 1997]

Description

HMM for E. coli

markov models

HMM with nine hidden states

Interpolated markov model, including markov models upto order 8

Profile hidden markov method, that allows comparison toprotein sequences and "profiles" of protein familiesdirectly

HMM based method, using a 4thorder markov model. Ifthe sequence analyzed already has some subregionsidentified (hits to EST or protein database, repeatedelements), those regions can be locked as coding ornoncoding and submitted to HMMgene.

HMM with 241 hidden states

GHMM with codon frequency models for the "exonstate" and neural network models for the "splice sitestate"

GHMM with 27 states, using a s" order Markov model.The signals for exons, introns, their splice sites, promoterregions are modeled by weight matrices, weight arrays,and maximal dependence decomposition.

Genomes

Bacterial

Bacterial

Bacterial

Bacterial

Human

Human, other

Human, other

Human, other

Human, other

Page 7: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

ADAK: BIOINFORMA TICS-GENE PREDICTION

statistical geneticist R A Fisher more than 60 yearsago. It is a method by which different features can becombined to "discriminate" between two or morefunctional classes. For in silica gene prediction, thegoal is to discriminate between coding and non-coding regions. In this class of methods, themathematical model m(S I peS), Q) is based on aprojection of the sequence of length L into a lowerdimensional space of "sequence features", f(S). Inmost discriminant analysis methods, it is assumed thatf(S) has a Gaussian probability distribution whoseparameters Q depend on PrS).

The algorithm HEXON (Solovyev et al, 1994)predicts internal exons by using Fisher's lineardiscriminant analysis (LOA) d to open reading framesflanked by GT and AG base pairs. The features usedin the linear Bi discriminant anlaysis function is acombination of characteristics describing donor andacceptor splice sites, 5'- and 3'-intron regions and alsocoding region for each open reading frames flankedby GT and AG base pairs. Current version of HEXONonly predict internal exons with GT and AGconserved base pair for donor and acceptor splicesites, respectively. FEXH, an extension of HEX ON,also predicts potential 5'- and 3'-exons bycorresponding discriminant functions on the left sideof the first internal exon and on the right side fromlast internal exon, respectively. An extension ofFisher's LOA is quadratic discriminant analysis(McLachlan, 1992), and its use in exon prediction wasdemonstrated in the algorithm MZEF (MichaelZhang's Exon Finder) (Zhang, 1997). MZEF usednine features (loglO (exon length in bases), hexamerusage and hexamer preference statistics, splice siteassociated statistics) in the QDA.

Classification trees (Breiman et al, 1984; Quinlan,1993), also known as decision tree classifiers is a wellestablished technique for learning classification rules.arising out of the statistical pattern recognition andmachine learning literature. It is different fromdiscriminant analysis in that the sequence feature f (S)is not assumed to have a simple parametricprobability distribution, such as a Gaussiandistribution. Instead, a non-parametric tree structure isassumed (parameters Q) and learnt during the trainingphase. MORGAN (Salzberg et al, 1997), a geneprediction algorithm based on decision treesspecifically uses OC1 (Murthy et al, 1994), anoblique decision tree methodology. The internal nodesof the trees are based on sequence features, which

165

included in-frame hexamer frequency (Claverie et al,1990), asymmetry measure (Fickett & Tung, 1992),the start site score as computed by a conditionalprobability matrix, and the scores given to the donorand acceptor sites by Markov models. It thuscombines the ab initio methods based onoligonuceotide usage and those based on Markovmodels to extract the features that are input into thedecision tree. The decision tree automaticallydetermines the features that have most discriminatorypower, i.e. can most distinguish between coding andnon-coding regions. The success of MORGAN can beattributed to the authors' care in adapting theclassification tree methodology to incorporatebiological principles. For example, MORGAN abidesby the biological rules that (i) The 1st coding regionof a gene begins with a start codon ATG; (ii) a genehas exactly one in-frame stop codon, which appears asthe last codon in the gene; (iii) each exon must be inthe same reading frame as the previous exon; (iv)each DNA sequence presented for analysis will startand end with a noncoding region and contains a singlegene; (v) every intron begins with the dinucleotideGT and ends with the dinucleotide AG.

Ab Initio Gene Prediction Based on NeuralNetworks

Neural network is also a classification technique,but one that has arisen from the artificial intelligence(AI) community. In this class of methods, themathematical model m(S I peS), Q) is based on aprojection of the sequence of length L into a lowerdimensional space of "sequence features", f(S), as inthe case of the statistical classification methods.These features are input into a "neural network"where the relation of the output classification to theinput layer is determined through a series of "hiddenlayers", the parameters Q being the weights given toeach connection in the neural network model. Earlywork in neural network methods for gene prediction(Lapedes et al; 1990; Brunak et al, 1991) has beensurpassed by GRAIL (Uberbacher & Mural, 1991).GRAIL stands for gene recognition and analysisinternet link and was unique in that it uses the outputsof different gene feature recognition algorithms(sensor based algorithm, content based algorithms, aswell as biological rules) as shown in Fig. 5. Thus, theneural network serves as a "decision fusion method",integrating information from different algorithms.Today, The GRAIL gene recognition algorithm has

Page 8: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

166 INDIAN J BIOTECHNOL, APRIL 2003

Exon CandidateParameters

16-mer in-frame (Isochore) ~o.16-mer in-frame (Candidate) r--+~\

\\I Markov ~Q \ \

\\\Iisochore GC Composition r--+,\ \\\

L....-...t.. ','. \\\Exon GC Composition r--"""Q~<;\'~~\~========~. \'-'.);-;'j,or-r- L-----. ),'; , .•

Size Prob. profile ~Qj1"i.~ ..'" Discrete Exon

~======= -'x:~~~"» ;iLength ~O" ..a.:Z~...., /' ScoreDonor L-----. ~-l;J~"'5~~·

I Acceptor ~" /if,.1JiIintron Vocabulary ('SOchOre)~"4r[

Iintron vocabulary(Candidate~J//

/lIlntronvocabulary2(ISOchOre)~Q{

'IIintronvocabulary 2(Candidate)~g;

Fig. 5-Inputs to the GRAIL neural network.

been embedded in GRAILexp, a web-based genomeannotation system. The GRAIL gene recognitionalgorithm, like many other ab initio algorithms worksin four steps:

1. Candidate generation: generate all possiblecandidate exons (genes) based on simplebiological rules;

2. Candidate elimination: heuristic rules based onbiology are used to eliminate a large number ofcandidates; and

3. Evaluation of likely candidates: pre-trainedneural network algorithm used in this step toassign scores to the likely candidates; and

4. Gene modeling and post processing: highestscoring candidate exons are selected, and rulesapplied to determine which exons splice into asingle gene.

Syntenic Gene PredictionTraditionally, there were only two classes of

methods for gene prediction - the homology basedmethods and the ab initio methods discussed in theprevious two sections. The ab initio methods usedifferent types of sequence based information(content, signal, etc.) to learn how to differentiatebetween coding and noncoding regions, and identifydifferent regions in a gene. Homology based methods

rely on similarity to known or previously identifiedproteins and ESTs. A certain limitation of bothapproaches is that they critically rely on informationderived from already known genes, so they tend to bebiased towards finding genes that are similar toknown genes.

At the same time, the success of biotechnology ingenome sequencing has resulted in massive amountsof unannotated genomic sequences being deposited inthe databases across the world. Availability of thisdata has given birth to a new way of predicting genesand other functional elements in DNA sequences -this emerging methodology is called syntenic geneprediction. Syntenic gene prediction is generecognition by using cross-species sequencecomparison to identify and align relevant regions andthen searching for the presence of exonic features atcorresponding positions in both speciessimultaneously. Fig. 6 shows the genes in humanchromosome 1 that are "syntenic" to the mousegenome. The rationale behind syntenic geneprediction was simple: during evolution, functionalelements in DNA sequences such as exons tend to bemore highly conserved than non-functional regions,so local conservation identified through comparisonof genomes of related species usually indicatesbiological functionality. If a protein encoded by agene is already known in one organism, it is relativelysimple to search genomic DNA from anotherorganism to identify genes encoding a similar protein- this is the basis of homology based gene predicition.However, if genes have not been completelyidentified in either of the two species, then homologybased methods can no longer be used. However, thebasic rationale of homology still holds and that is thepremise by which syntenic gene prediction cansimultaneously identify genes in two species. Thebirth of syntenic gene prediction was through twoalgorithms - ROSETTA (Batzoglou et al, 2000) andthe Conserved Exon Method (CEM) (Bafna & Huson,2000). Both ROSETTA and CEM go through thefollowing steps:

1. Repeats are masked in both genomic sequences;

2. The repeat masked sequences are then aligned[CEM used tBLASTX, a variant of BLAST(Altschul et al, 1990), while ROSETTA uses aglobal alignment algorithm specially designed tohandle long genomic sequences called GLASS].A candidate list of exons is generated from the

Page 9: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

ADAK: BIOINFORMATICS-GENE PREDICTION

high scoring pairs (regions with alignmentscores above a preset threshold). Regions ofweak alignment are thus discarded; and

3. The resulting list of candidate exons are scored(CEM and ROSETTA assign scores differently)and algorithms (CEM uses a graph theoreticalgorithm while ROSETTA uses a dynamicprogramming algorithm) are used to search forthe highest scoring path. The highest scoringsegments are predicted to be the best conservedregions.

167

In the last few years, the success of syntenic geneprediction in identifying new functional elements ingenomic sequences (Ansari-Lari et al, 1998; Jareborget al, 1999; Batzoglou et al, 2000; Gottgens et al,2000; Loots et al, 2000; Gottgens et al, 2001;Morgenstern et al, 2002) has resulted in it now beingwidely accepted that comparative sequence analysis isa powerful and universally applicable tool forfunctional genomics. It' has .also resulted in anexplosion of new Bioinformatics algorithms (Table 3)to meet the challenges of syntenic gene prediction.

Homo sapiens chr~osome 1

SOH SO"

toO" 100"

-1S0"ChI" II

60"S0"

100H

100"

Fig. 6-Synteny between mouse genome and human chromosome.

Page 10: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

168 INDIANJ BIOTECHNOL, APRIL 2003

Algorithm

Human GenomeBrowser

Genome Channel

Ensembl

NIX

RUMMAGE

GENOTATOR

PROCRUSTES

AAT

Otto

EXOFlSH

FEXH/HEXON

GENEID

GENEMARK&GENEMARK.hmm

GENEWISE

GENSCAN

GLIMMER

GRAIL

MORGAN

MZEF

VEIL

AgenDA

Alfresco

CGAT

EnteriX

GLASS

Gibbs

lntronerator

LAJ

MUMmer

PipMaker

Rosetta

SGP

SynPlot

VISTA

WABA

Table 3-Resources for in silico gene prediction

Link

Genome Annotation Systems

http://genome. ucsc.edu/

http://compbio.ornl.gov/channel

http://www.ensembl.org/

http://www.hgmp.mrc.ac.ukINIX/

http://genIOO.imb-jena.de/-baumgart/rummage/

http://www.fruitfly.org/-nomi/genotator

Homology based gene prediction methods

http://www-hto.usc.edu/software/procrustes/index.html

http://genome.cs.mtu.edu/aat.html

Celera Genomics

Ab Initio gene prediction

http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html

http://apolo.imim.es/geneid.html

http://genemark. biology .gatech .edu/GeneMarki

http://www.sanger.ac.uklSoftware/Wise2

http://genes.mit.edu/GENSCA.html

http://www.tigr.org/-salzberg/glimmer.html

http://compbio.ornl.gov/Grail-I.3/

http://www.tigr.org/-salzberg/morgan.htm I

http://sciclio.cshl.org/genefinder/

http://www.tigr.org/-salzberg/veil.html

Syntenic Gene Prediction

http://www.bioinfo.de/isb/2002/02/00 18/

http://www.sanger.ac. uk/Software/Alfresco

http://ftp.inertia.bs.jhmi.edulroger/CGAT/CGAT.html

http://ftp .globi n.cse. psu .edu/enteri x

http://ftp.plover.lcs.mit.edu

http://www .wadsworth .org/res&res/bioi nfo

http://www.cse.ucsc.edu/-kent/intronerator

http://ftp.web.uvic.ca/-bioweb/laj.html

http://www.tigr.org/softlab

http://ftp. bio.cse. psu.edu

http://ftp. plover.lcs. mit.edu

http://ftp.soft.ice.mpg.de/sgp-l

http://www.sanger.ac.uklUsers/jgrg/SynPlot

http://www.gsd.lbl.gov/vista

http://www.cse.ucsc.edu/-kent/xenoAli/index.html

Reference

Taudien et al. 2000

Gelfand et al, 1996

Crollius et al, 2000

Solovyev et al. 1994

Borodovsky & Mclninch, 1993;Lukashin & Borodovsky, 1998

Birney & Durbin, 1997

Burge &Karlin, 1997

Salzberg et al, 1998

Uberbacher & Mural. 1991

Salzberg et al, 1997

Zhang, 1997

Henderson et al, 1997

Rinner & Morgenstern, 2001

Jareborg & Durbin, 2000

Lund ef al, 2000

Florea et al, 2000

Batzoglou et al, 2000

Wasserman et ai, 2000

Kent & Zahler, 2000

Wilson et al, 2001

Deicher et al, 1999

Schwartz et al, 2000

Batzoglou et al. 2000

Wiehe et al, 2000

Gottgens et al, 200 I

Dubchak et al. 2000

Kent & Zahler, 2000

Page 11: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

ADAK: BIOINFORMATICS-GENE PREDICTION

Syntenic gene prediction presents its own challenge(Miller, 2001):

• Improved software that aligns two genomicsequences and has a rigorous statistical basis.

• An industrial-strength gene prediction systemthat effectively combines genomic sequencecomparisons, intrinsic sequence properties, andresults from searching databases of proteinssequences and ESTs.

• Reliable and automatic software for aligningthree or more genomic sequences.

• Better methods for displaying and browsinggenomic sequence alignments.

• Improved datasets and protocols for evaluatingthe correctness and performance of genomicalignment software.

The next generation of in silico gene predictionalgorithms will have to meet these challenges andcombine the previous generation of homology basedand ab initio algorithms to create superior hybridmethodologies.Genome Annotation Systems

As per the latest release of Entrez Genomes, adatabase of whole genomes from NCBI, there arecomplete whole genomic sequences from over 1000viruses, over 100 microbes and 9 eurkaryoticorganisms. There are hundreds of genome sequencingprojects ongoing in different parts of the world, andmany more are being initiated. These large-scalesequencing efforts generate massive amounts of rawgenomic sequence data, most of which is biologicallyuncharacterized. The developments in Bioinformaticshave made possible automated in silica annotation ofthese DNA sequences, but application ofBioinformatics tools still remains a challenge for theBiotechnology researcher. This has primarily beendue to the need for specialist skills required to use andinterpret the algorithmic output, the difficulties inresolving conflicting results and integrating biologicalparameters and existing knowledge into the availablesoftware. This problem of the biotechnologist facedwith the problem of efficiently annotating sequenceswith information that is accurate and consistent (Borket al, 1992) has led to the development of severa1genome annotation systems:

.NIX: NIX [http://www.hgmp.rnrc.ac.ukINIX/]provides several gene prediction algorithms, the mostnoteworthy of which are GRAIL, GENSCAN, Fgene,FEX and HEXON, MZEF, GeneMark, Genefinder (PGreen, unpublished data), and HMMGene discussed

169

in the previous section. Homology searching usingBLAST of EST and the European MolecularBiological Laboratory (EMBL) protein database isalso enabled. Among other tools provided are toolsfor masking repeats, predicting tRNA genes, promoterand poly A site prediction. Sequence is analysed inboth directions and the results divided according tostrand. Analysis from each of the programs aregrouped with others that perform similar functions i.e.CpG island and promoter predictions, exon and genepredictions, poly A site predictions, and BLASTsearches of particular databases. NIX is considered asthe best genome annotation system, but onedisadvantage is that it uses default sets of parameters,meaning that the only method of altering the levels ofstringency for these programs is to run each programindividually at other websites.

• RUMMAGE: RUMMAGE (Taudien et al, 2000;http://genl00.imb-jena.de/-baumgart/rummage/)is by far the most comprehensive of web-basedgenome annotation systems, running Grail,Genscan, MZEF and Xpound gene predictionalgorithms. The biggest selling point forRUMMAGE is that it also defines consensusexons from them. As part of its comprehensivegenome annotations, it also provides BLASTdata for sequence tagged sites (STSs), ESTs andproteins from several databases and cataloguesthe strength of exons and CpG islands, andclassifies repeat sequences. While RUMMAGEscores high as a comprehensive tool, it is not aseasy to use and interact with as some of theother systems.

• GENOTATOR: GENOTATOR[http://www.fruitfly.org/-nomi/genotator] usesthe same gene prediction algorithms asRUMMAGE with additional tools that includesthe novel promoter identification program,NNPP (M G Reese & F H Eeckman,unpublished data), the gene prediction program,GENIE (Kulp et al, 1996) and a BLASTXsearch of all the GenBank coding sequencestranslated into proteins.

For a comprehensive review and comparison ofthese and other genome annotation systems, seesoftware reviews of genome sequence analysis tools(Jones et al, 2002; Fortna & Gardiner, 2001). Othersystems do not provide interactive use of the tools butinstead provide pre-computed annotations: TheHuman Genome Browser [http://genome.ucsc.edu/],

Page 12: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

170 INDIAN J BIOTECHNOL, APRIL 2003

Ensembl [http://www.ensembl.org/], and the GenomeChannel [http://compbio.oml.gov/channel] and areimportant resources for information for thebiotechnologist as well as the bioinformatician.Accuracy and Validity of Gene PredictionAlgorithms

In silica gene prediction is the first step in theannotation of genomes. Inaccuracies of in silica geneprediction algorithms will travel down the line,resulting in errors at the transcription level, theproteome level and could ultimately affect or at leasthinder our understanding of the biology of thespecies .. It is important to select the bestcomputational algorithms, to cross-check and validateannotations with multiple algorithms, and toexperimentally validate the results. It is also vital thatannotations be updated and revalidated periodically -this is absolutely essential because (a) new data isconstantly being generated that can considerablyimprove the quality and quantity of annotation that ispossible and (b) because new and improvedalgorithms are also being developed that can alsoconsiderably improve the quality and quantity ofannotation that is possible. Assessment of relativeperformance of in silica gene prediction algorithms ismade at three levels (Burset & Guig6, 1996): at thenucleotide level, exon level, and protein product level.At the nucleotide level, the accuracy assessment isbased on a comparison of the actual label (coding ornon-coding) to the predicted label for each nucleotidein the -test sequence. The performance measures usedare (Fig. 7):

• Sensitivity (Sn) = Number of coding nucleotidescorrectly predicted as coding/Number of codingnucleotides --..

• Specificity (Sp) ';.Number of coding nucleotidescorrectly predicted as coding/Number ofnucleotides predicted to be coding

Other measures used are correlation coefficient andapproximate correlation (see Burset and Guig6, 1996for a definition of these measures). At the exon level(Fig. 8), in addition to sensitivity (= number of correctexons/number of actuaJexons) and sepecificity(= number of correct exons/number of predictedexons), there are two additional measures to count thenumber of exons that are either completely missed(ME = number of missing exons/number of actualexons) or exons that are wrong exons (WE = numberof wrong exons/number of predicted exons). At theprotein product level, a comparison of the translation

Nucleotide Level• I I • I • IITN IFN : TP !FP jrN! FN! TPt I , • t I I

, I!FNtrnI I

Groundtruth

Prediction

I I , I , • II I I t I I I

: II II I

Fig. 7-Assessing accuracy of gene prediction at the nucleotidelevel.

Exon Level

Groundtruth

Prediction-----iiiiili~Ii------~iiiii-Fig. 8-Assessment of accuracy of gene prediction at the exonlevel.

of the predicted gene sequence and the translation ofthe actual gene sequence is made. Frame shift errors,errors in a few nucleotide labels, that may appearinsignificant at the nucleotide or exon level, may bemagnified at the protein level. Measures of accuracyat the protein level will capture the impact of errors inthe gene prediction algorithm on the functionalassignment that is the ultimate goal of genomeannotation.

Unlike computational algorithms in other areas ofapplication, biological sequence analysis methodshave a unique requirement that the accuracyassessment of the algorithms must be carried outperiodically. This is essential because-most of thealgorithms depend on the quantity and quality of thedata available in genomic databases, and that data hasbeen changing rapidly in the last few years. Forexample, the initial assessments (Burset & Guig6,1996) were made for relatively short DNA sequencesand sequences that encode single, complete genes.However, by the year 2000, much of the raw genomicdata in DNA sequence databases were long sequenceswith multiple genes, coming from complete orpartially complete genome sequencing projects. Theaccuracy assessment had to be updated to considerlong DNA sequences (Guig6 et at, 2000a) and theresults of the comparison of three popular algorithmsusing 178 long DNA sequences are shown in Table 4.

The first challenge of assessing gene predictionalgorithms is to determine appropriate benchmarksets - these are sequences in which the genes havebeen correctly annotated and validated through

Page 13: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

ADAK: BIOINFORMA TICS-GENE PREDICTION

experiments, and yet capture the complexity ofgenomic data available in databases today. Whilebenchmark sets have been created for human genesequences, the next challenge is to determinebenchmark sets for other species. The immediate needis a benchmark set of sequences for plants - toevaluate the algorithms being developed specially tohandle plant genomic sequences. Microbial genomeannotation also relies greatly on claims of accuracymade by the authors of algorithms and there is asmuch a need for comparison of the methods that areused in microbial gene prediction. Assessment ofaccuracy is also required when algorithms are handledto incorporate new types of biological data - forexample, when algorithms were developed to utilizeEST databases (Guig6 et al, 2000b). The nextchallenge of accuracy assessment will be to developappropriate benchmark sets and assess the variety ofsyntenic gene prediction algorithms being proposedtoday.

Gene Prediction Algorithms: Putting them to UseIn the previous sections of this review, the author

has reviewed a variety of algorithms andcomputational techniques that are used for in silicagene prediction and discussed methods for assessing

171

the validity and accuracy of these techniques.However, the accuracy of an algorithm varies fromspecies to species, depending on GC content, thecomplexity of the gene structure, etc.

Table 5 shows the variation in GE SCAN alone(one of the most popular human gene predictionalgorithms), which illustrates clearly that in silicamethods still need to be tailored to variouscharacteristics of the genome being annotated. In thissection, we discuss how algorithms have beenspecifically designed or tailored to handle the specialneeds of different genomes.

Gene Prediction for ProkaryotesIn silica gene prediction for prokaryotes has been

relatively easier and more accurate due to the inherentsimplicity in the gene structure of prokaryotes. Theinterest generated by completion of the 1.83 Mb(megabase) Haemophilus influenzae genome, the firstcomplete genome sequence of a cellular life form,marked only the beginning of the focus on sequencingof bacterial genomes. Today, the race is on tocompletely sequence and annotate increasing numberof microbial genomes (xxxx). This has been thecatalyst in the development of microbial geneprediction algorithms, starting with simple homology

Table 4 - Assessment of accuracy of selection gene prediction algorithms

Nucleotide Exon GenePrograin No. Sn Sp CC Sn Sp Sn+Sp/2 ME WE MG WG

GenScan 43 0.89 0.64 0.76 0.64 0.44 0.54 0.14 0.41 0.03 0.28

0.92 0.92 0.91 0.76 0.76 0.76 0.09 0.09

Genewise 43 0.98 0.98 0.97 0.88 0.91 0.89 0.06 0.02

0.98 0.98 0.97 0.88 0.91 0.89 0.06 0.02

Procrustes 43 0.93 0.94 0.93 0.80 0.75 0.77 0.10 0.16

0.93 0.95 0.93 0.76 0.82 0.79 0.11 0.14

Table 5 - Dependence of accuracy of genomic features

Accuracy per nucleotide Accuracy per exon

Subset Sequences Sn Sp AC CC Sn Sp Avg. ME WE

C + G <40 86(3) 0.90 0.95 0.90 0.93 0.78 0.87 0.84 0.14 0.05

C + G 40-50 220 (1) 0.94 0.92 0.91 0.91 0.80 0.82 0.82 0.08 0.05

C + G 50-60 208 (4) 0.93 0.93 0.90 0.92 0.75 0.77 0.77 0.08 0.05

C + G>60 56 (0) 0.97 0.89 0.90 0.90 0.76 0.77 0.76 0.07 0.08Primates 237 (1) 0.96 0.94 0.93 0.94 0.81 0.82 0.82 0.07 0.05

Rodents 191 (4) 0.90 0.93 0.89 0.91 0.75 0.80 0.78 0.11 0.05Non-mam. Vert. 72 (2) 0.93 0.93 0.90 0.93 0.81 0.85 0.84 0.11 0.06

Page 14: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

172 INDIAN J BIOTECHNOL, APRIL 2003

searches (comparing the H. infleunzae genome to thewell understood E. Coli) to ab initio methods ofGENEMARK and GLIMMER. Algorithms such asECOPARSE and TbPARSE (Krogh et al, 1994a,1994b) have also been designed to handle specificmicrobial genomes such as E. Coli and M.tuberculosis. GLIMMER has been reported to haveover 97% accuracy in the H. infleunzae andHelicobacter pylori genomes. However, in recentyears, new "hybrid" algorithms such as ORPHEUS(Frishman et al, 1998) and CRITIC A (Badger &Olsen, 1999) have been developed that combinehomology information with ab initio methodologies.ORPHEUS uses homology to identify potential genefragments in combination with statistical properties ofp.rotein-coding regions to identify ribosomal bindingsites and to thus provide improved predictions of thestarts of genes. In CRITICA, homology is used incombination with dicodon usage measures to betterpredict gene positions.

Gene Prediction for Eukaryotes (excluding plants)Eukaryotic gene prediction has for the most part

been focused to determine human genes or at leastgenes in vertebrates. However, special algorithms areneeded for lower eukaryotes such as those of the yeastfamily to handle the inherent differences with highereukaryotes. For example, in some lower eukaryoticorganisms such as Saccharomyces cerevsiae, intronsare very rare, while in other lower eukaryotes such asS. pombe introns are common but relatively muchshorter in length. Thus, algorithms such as~E~SC~N which use ex on and intron lengthdistributions from vertebrate genome sequences willnot give the same level of accuracy in lowereukaryotes. The algorithm FEXY and POMBE forexample adapts discriminant analysis methods ofFEXH and HEXON (Solovyev et al, 1994) to predictsplicing sites and exons in yeast DNA sequences.

Plant Gene PredictionCompletion of the first model plant genome of

Arabidopsis thaliana and the ongoing efforts insequencing of various crop genomes such as rice andmaiz.e .have given birth to renewed interest in geneprediction algorithms. A review (Pavy et al, 1999) ofin silico gene prediction methods for a benchmark setof A. thaliana sequences showed that current methodsmust be modified to meet characteristics of genomicsequences that are specific to plants. In plants, the

bias in the nucleotide composition of exons andintrons has in particular been assigned an importantrole for the correct recognition of splice sites.

ConclusionIn silico gene prediction has come a long way since

the early days, both in terms of their ability toincorporate different types of biological informationand their accuracy. The last decade has seen arevolution in genomics and bioinformatics algorithmsfor genomics, i.e. in silico gene prediction. Today,attention is shifting from development of thesealgorithms to combining the various existingalgorithms and methodologies into "hybrid systems"that are more accurate, more robust to changes in GCcontent and even the species under consideration. Theremaining challenges for annotation at the genomelevel include:

• Detecting partial genes and overlapping genes inraw DNA sequences;

• Improved prediction of gene/exon boundaries;• Syntenic gene prediction that allows comparison

between more than 2 species;• Prediction of non protein-coding genes, such as

tRN A genes, and• Prediction of alternately spliced transcripts:The beginning of the new millennium coincided

with the dawn of a new era for biology. In silico zeneprediction is and will continue to be the mostImportant step in genome annotation. This reviewpaper was intended to provide only a flavour of thescope and resources available today for geneprediction.

The real success of in silico gene prediction is intheir successful integration into the toolboxes used bybiologists and biotechnology researchers. Even withthe high success rates of such computational methods,human intervention will always be necessary, andmore collaborative efforts will be needed to verifyand validate the huge volumes of annotation data thatis generation through such techniques.

ReferencesAdak. S & Srivastava B, 200 I. Bioinformatics: Advancing

biotechnology through information technology. Part I:Molecular Biology Databases. Indian J Biotechnol, 1, 10 1-116.

Altschul S F et al, 1990. Basic local alignment search tool. J MolBioi, 21S, 403-410.

Ansari-Lari M A et al, 1998. Comparative sequence analysis of agene-rich cluster at human chromosome 12p 13 and itssyntenic region in mouse chromosome 6. Genome Res. 8. 29-40.

Page 15: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

ADAK: BlOINFORMATICS-GENE PREDICTION

Badger J H & Olsen G J, 1999. CRITICA: Coding regionidentification tool invoking comparative analysis. Mol BioiEvol, 16, 512-524.

Bafna V & Huson D H, 2000. The Conserved Exon Method forgene finding. Proc s" Int Conf Intell Syst Mol Bioi SanDiego, CA, USA. Pp 3-12.

Bailey L C Jr et al, 1998. Analysis of EST-driven gene annotationin human genomic sequence. Genome Res, 8, 362-376.

Batzoglou S et al, 2000. Human and mouse gene structure:Comparative analysis and application to exon prediction.Genome Res, 7, 950-958.

Birney E & Durbin R, 1997. Dynamite: A flexible codegenerating language for dynamic programming methods usedin sequence comparison. in Proc 5th lnt Conf Intell Syst MolBioi, edited by T Gaasteland et al. AAAI Press, Menlo Park,CA, USA. Pp 56-64.

Bork P et ai, 1992. What's in a genome? Nature (Lond), 338, 287.Borodovsky M & McIninch J, 1993. GENMARK: Parallel gene

recognition for both DNA strands. Comput Chern, 17, 123-133.

Breiman L et al, 1984. Classification and Regression in Trees.Wadsworth, Belmont, CA, USA.

Brunak S et al, 1991. Prediction of human mRNA donor andacceptor sites from the DNA sequence. J Mol Bioi, 220, 4-65.

Burge C & Karlin S, 1997. Prediction of complete gene structurein human genomic DNA. J Mol Bioi, 268,1-17.

Burset M & Guigo R, 1996. Evaluation of gene structureprediction programs. Genomics, 34,353-367.

Claverie J M et al, 1990. K-tuple frequency analysis: Fromintron/exon discrimination to T-cell epitope mapping.Methods Enzymol, 183,237-252.

Cole S et al, 1998. Deciphering the biology of M. tuberculosisfrom the complete genome sequence. Naturei Lond) , 393,537-544.

Croll ius H R et al, 2000. Estimate of human gene numberprovided by genome-wide analysis using Tetraodonnigroviridis DNA sequence. Nature Genet, 25, 235-238.

Deicher A L et al, 1999. Alignment of whole genomes. NucleicAcids Res, 27, 2369-2376.

Dubchak I et al, 2000. Active conservation of noncodingsequences revealed by three-way species comparison.Genome Res, 10, 1304-1306.

Florea L et al, 2000. Web-based visualization tools for bacterialgenome alignments. Nucleic Acids Res, 28. 3486-3496.

Fickett J W & Tung, C-S 1992. Assessment of protein codingmeasures. Nucleic Acids Res, 20, 6441-6450.

Fisher R A, 1936. Ann Eugenics, 7, 179-188.Fleischmann R D et al, 1995. Whole-genome random sequencing

and assembly of Haemophilus influenzae Rd. Science, 269,496-512.

Fortna A & Gardiner K, 200 I. Genome sequence analysis tools: Auser's guide. Trend Genet, 17, 158-164.

Frishman D et al, 1998. Combining diverse evidence for generecognition in completely sequenced bacterial genomes.Nucleic Acids Res, 26, 2941-2947.

Gelfand M S et al, 1996. Gene recognition via spliced sequencealignment. Proc Natl Acad Sci USA, 93, 9061-9066.

Gottgens B et al, 2000. Analysis of vertebrate SCL loci identifiesconserved enhancers. Nature Biotechnol, 18, 181-186.

173

Gottgens B et al, 2001. Long-range comparison of human andmouse SCL loci: Localized regions of sensitivity torestriction endonucleases correspond precisely with peaks ofconserved noncoding sequences. Genome Res, 11,87-97.

Gribskov M et al, 1984. The codon preference plot: Graphicanalysis of protein coding sequences and prediction of geneexpression. Nucleic Acids Res, 12,529-549.

Guigo R et al, 2000a. An assessment of gene prediction accuracyin large DNA sequences. Genome Res, 10, 1631-1642.

Guigo i R et al, 2000b. Sequence similarity based gene prediction.in Genomics and proteomics: Functional and Computationalaspects, edited by S Suhai. Kluwer Academic/PlenumPublishing, New York. Pp 95-105.

Henderson J et al, 1997. Finding genes in DNA with a hiddenMarkov model. J Comput Bioi, 4, 127-142.

Jareborg N & Durbin R, 2000. Alfresco-a workbench forcomparative genomic sequence analysis. Genome Res, 10,1148-1157.

Jareborg N et al, 1999. Comparative analysis of non-codingregions of 77 orthologous mouse and human gene pairs.Genome Res, 9, 815-824.

Jones et al, 2002. A comparative guide to gene prediction toolsfor the bioinformatics amateur. In! J Oncol, 20, 697-705.

Kent & Zahler, 2000 The lntronerator: Exploring introns andalternative splicing in C. elegans. Nucleic Acids Res, 28, 91-93.

Krogh A 1997. Two methods for improving performance of anHMM and their application for gene-finding. in Proc s" IntConf lntell Systr Mol Biol,edited by T. Gaasterland. AAAJPress, Menlo Park, CA, USA Pp 179-186.

Krogh A, et al, 1994a. A hidden markov model that finds genes inE. coli DNA. Nucleic Acids Res, 22,4768-4778.

Krogh A et al, 1994b. Hidden Markov models in computationalbiology: application to protein modeling. J Mol Bioi, 235,1501-1531.

Kulp D et al, 1996. A generalized hidden Markov model forthe recognition of human genes in DNA. Proc 4'h liltConf In/ell Syst Mol Bioi, St Louis, Missouri, USA. Pp 134-142.

Lapedes A S et al, 1990. Application of neural networks and othermachine learning algorithms to DNA sequence analysis. inComputers and DNA, edited by G Bell & T Marr. Addison-Wesley Longman, Redwood City, CA, USA.

Loots G G et al, 2000. Identification of a coordinate regulator ofinterleukins 4, 13, and 5 by cross-species sequencecomparisons. Science, 288,136-140.

Lukashin A V & Borodovsky M, 1998. Genemark.hmm: Newsolutions for gene finding. Nucleic Acids Res, 26,1107-1115.

Lund J et al, 2000. Comparative sequence analysis of 634 kb ofthe mouse chromosome 16 region of conserved synteny withthe human velocardiofacial syndrome region on chromosome22q 11.2. Genomics, 63,374-383.

McLachlan G J, 1992. Discriminant Analysis and StatisticalPattern Recognition. Wiley, New York.

Miller W, 2001. Comparison of genomic DNA sequences: Solvedand unsolved problems. Bioinformatics, 17, 391 - 397.

Morgenstern B et al, 2002. Exon discovery by genomic sequencealignment. Bioinformatics, 18, 777-787.

Murthy, et al, 1994. A system for induction of oblique decisiontrees. J Artif Intell Res, 2, 1-33.

Page 16: Bioinformatics: Advancing Biotechnology through …nopr.niscair.res.in/bitstream/123456789/11294/1/IJBT 2(2...Bioinformatics: Advancing Biotechnology through Information Technology

174 INDIAN J BlOTECHNOL, APRIL 2003

Pavy N et al, 1999. Evaluation of gene prediction software using agenomic dataset: Application to Arabidopsis thalianasequences. Bioinformatics, 15,887-899.

Pearson W R & Lipman D J, 1988. Improved tools for biologicalsequence analysis. Proc Natl Acad Sci USA, 85, 2444-2448.

Pearson M L & Soli D, 1991. The Human Genome Project: Aparadigm for information management in the life sciences.FASEB J, 5, 35-39.

Quinlan H R, 1993. C4.5: Programs for Machine Learning.Morgan Kaufmann Publishers, San Mateo, CA, USA.

Ramachandran S & Ramakrishna R, 1999. Gene identification inbacterial and organellar genomes using GeneScan. ComputChem, 23,165-174.

Rinner & Morgenstern, 2002. AGenDA: Gene prediction bycomparative sequence analysis. In Silica Bioi, 2, 195-205.

Salzberg S et ai, 1998. A decision tree system for finding genes inDNA. Technical Report 1997-03, Department of ComputerScience, Johns Hopkins University, Baltimore, Maryland.USA.

Schwartz S et al, 2000. PipMaker-A web server for aligning twogenomic DNA sequences. Genome Res, 10,577-586.

Semple C, 2000. Gene prediction: The end of the beginning.http://bioinformer.ebi .ac. uklnewsletter/archi ves/6/genepred2000_6.html

Shulman M J et al, 1981. The coding function of nucleotidesequences can be discerned by statistical analysis. J TheorBioi, 88, 409-420.

Staden R & McLachlan A D, 1982. Codon preference and its usein identifying protein coding regions in long DNAsequences. Nucleic Acids Res, ]0, J4 J- J56.

SoJovyev V V et al, 1994. Predicting internal exons byoligonucleotide composition and discriminant analysis ofspliceable open reading frames. Nucleic Acids Res, 22,5156-5163.

Taudien S et al, 2000. RUMMAGE-a high throughput sequenceannotation system. Trend Genet, 16, 519-521.

Uberbacher E C & Mural R J, 1991. Locating Protein CodingRegions in Human DNA Sequences Using a MultipleSensor-Neural Network Approach. Proc Natl Acad Sci USA,88, 11261-11265.

Wasserman W et al, 2000. Human-mouse genome comparisons tolocate regulatory sites. Nature Genet, 26, 225-228.

Wiehe T et al, 2000. Genome sequence comparisons: Hurdles inthe fast lane to functional genomics. Brief Bioinformatics, 1,381-388.

Wilbur W & Lipman, D J, 1983. Proc Natl Acad Sci USA, 80,726-730.

Wilson M D et ai, 2001. Comparative analysis of the gene denseACHEITFR2 region on human chromosome 7q22 with theorthologous region on mouse chromosome 5. Nucleic AcidsRes, 29, 1352-1365.

Zhang M Q, 1997. Identification of protein coding regions in thehuman genome by quadratic discriminant analysis. Proc NatlAcad Sci USA, 94, 565-568.


Recommended