Bioinformatics—an introduction for computer scientists

Bioinformatics—An Introduction for Computer Scientists

JACQUES COHEN

Brandeis University

Abstract. The article aims to introduce computer scientists to the new field ofbioinformatics. This area has arisen from the needs of biologists to utilize and helpinterpret the vast amounts of data that are constantly being gathered in genomicresearch—and its more recent counterparts, proteomics and functional genomics. Theultimate goal of bioinformatics is to develop in silico models that will complement invitro and in vivo biological experiments. The article provides a bird’s eye view of thebasic concepts in molecular cell biology, outlines the nature of the existing data, anddescribes the kind of computer algorithms and techniques that are necessary tounderstand cell behavior. The underlying motivation for many of the bioinformaticsapproaches is the evolution of organisms and the complexity of working with incompleteand noisy data. The topics covered include: descriptions of the current softwareespecially developed for biologists, computer and mathematical cell models, and areas ofcomputer science that play an important role in bioinformatics.

Categories and Subject Descriptors: A.1 [Introductory and Survey]; F.1.1[Computation by Abstract Devices]: Models of Computation—Automata (e.g., finite,push-down, resource-bounded); F.4.2 [Mathematical Logic and Formal Languages]:Grammars and Other Rewriting Systems; G.2.0 [Discrete Mathematics]: General;G.3 [Probability and Statistics]; H.3.0 [Information Storage and Retrieval]:General; I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, andSearch; I.5.3 [Pattern Recongnition]: Clustering; I.5.4 [Pattern Recongnition]:Applications—Text processing; I.6.8 [Simulation and Modeling]: Types ofSimulation—Continuous; discrete event; I.7.0 [Document and Text Processing]:General; J.3 [Life and Medical Sciences]: Biology and genetics

General Terms: Algorithms, Languages, Theory

Additional Key Words and Phrases: Molecular cell biology, computer, DNA, alignments,dynamic programming, parsing biological sequences, hidden-Markov-models,phylogenetic trees, RNA and protein structure, cell simulation and modeling,microarray

1. INTRODUCTION

It is undeniable that, among the sciences,biology played a key role in the twentiethcentury. That role is likely to acquire fur-ther importance in the years to come. Inthe wake of the work of Watson and Crick,

Author’s address: Department of Computer Science, Brandeis University, Waltham, MA 02454; email: [email protected] to make digital/hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requiresprior specific permission and/or a fee.c©2004 ACM 0360-0300/04/0600-0122 $5.00

[2003] and the sequencing of the humangenome, far-reaching discoveries are con-stantly being made.

One of the central factors promoting theimportance of biology is its relationshipwith medicine. Fundamental progress inmedicine depends on elucidating some of

ACM Computing Surveys, Vol. 36, No. 2, June 2004, pp. 122–158.

Bioinformatics—An Introduction for Computer Scientists 123

the mysteries that occur in the biologicalsciences

Biology depended on chemistry to makemajor strides, and this led to the de-velopment of biochemistry. Similarly, theneed to explain biological phenomena atthe atomic level led to biophysics. Theenormous amount of data gathered bybiologists—and the need to interpret it—requires tools that are in the realm of com-puter science. Thus, bioinformatics.

Both chemistry and physics have bene-fited from the symbiotic work done withbiologists. The collaborative work func-tions as a source of inspiration for novelpursuits in the original science. It seemscertain that the same benefit will ac-crue to computer science—work with bi-ologists will inspire computer scientists tomake discoveries that improve their owndiscipline.

A common problem with the matu-ration of an interdisciplinary subject isthat, inevitably, the forerunner disciplinescall for differing perspectives. I see thesedifferences in working with my biolo-gist colleagues. Nevertheless, we are sointerested in the success of our dia-logue, that we make special efforts tounderstand each other’s point of view.That willingness is critical for jointwork, and this is particularly true inbioinformatics.

An area called computational biologypreceded what is now called bioinformat-ics. Computational biologists also gath-ered their inspiration from biology and de-veloped some very important algorithmsthat are now used by biologists. Computa-tional biologists take justified pride in theformal aspects of their work. Those ofteninvolve proofs of algorithmic correctness,complexity estimates, and other themesthat are central to theoretical computerscience.

Nevertheless, the biologists’ needs areso pressing and broad that many other as-pects related to computer science have tobe explored. For example, biologists needsoftware that is reliable and can deal withhuge amounts of data, as well as inter-faces that facilitate the human-machineinteractions.

I believe it is futile to argue the differ-ences and scope of computational biologyas compared to bioinformatics. Presently,the latter is more widely used among biol-ogists than the former, even though thereis no agreed definition for the two terms.

A distinctive aspect of bioinformatics isits widespread use of the Web. It couldnot be otherwise. The immense databasescontaining DNA sequences and 3D pro-tein structures are available to almostany researcher. Furthermore, the commu-nity interested in bioinformatics has de-veloped a myriad of application programsaccessible through the Internet. Some ofthese programs (e.g., BLAST) have takenyears of development and have been finelytuned. The vast numbers of daily visitsto some of the NIH sites containing ge-nomic databases are comparable to thoseof widely used search engines or activesoftware downloading sites. This explainsthe great interest that bioinformaticianshave in script languages such as Perl andPython that allow the automatic exami-nation and gathering of information fromwebsites.

With the above preface, we can put for-ward the objectives of this article andstate the background material necessaryfor reading it. The article is both a tuto-rial and a survey. As its title indicates, itis oriented towards computer scientists.

Some biologists may argue that theproper way to learn bioinformatics is tohave a good background in organic chem-istry and biochemistry and to take a fullcourse in molecular cell biology. I beg todisagree: In an interdisciplinary field likebioinformatics there must be several en-try points and one of them is using thelanguage that is familiar to computer sci-entists. This does not imply that one canskip the fundamental knowledge avail-able in a cell and molecular biology text.It means that a computer scientist inter-ested in learning what is presently beingdone in bioinformatics can save some pre-cious time by reading the material in thisarticle.

It was mentioned above that, in inter-disciplinary areas like bioinformatics, theplayers often view topics from a different

ACM Computing Surveys, Vol. 36, No. 2, June 2004.

124 J. Cohen

perspective. This is not surprising sinceboth biologists and computer scientistshave gone through years of education intheir respective fields. A related plausiblereason for such differences is as follows:

In computer science, we favor general-ity and abstractions; our approach is oftentop-down, as if we were developing a pro-gram or writing a manual. In contrast, bi-ologists often favor a bottom-up approach.This is understandable because the minu-tiae are so important and biologists are of-ten involved with time-consuming experi-ments that may yield ambiguous results,which in turn have to be resolved by fur-ther tests. The work of synthesis eventu-ally has to take place but, since in biologymost of the rules have exceptions, biolo-gists are wary of generalizations.

Naturally, in this article’s presentation,I have used a top-down approach. To makethe contents of the article clear and self-contained, certain details have had to beomitted. From a pragmatic point of view,articles like this one are useful in bridg-ing disciplines, provided that the readersare aware of the complexities lying behindabstractions.

The article should be viewed as a tourof bioinformatics enabling the interestedreader to search subsequently for deeperknowledge. One should expect to expendconsiderable effort gaining that knowl-edge because bioinformatics is inextrica-bly tied to biology, and it takes time tolearn the basic concepts of biology.

This work is directed to a mature com-puter scientist who wishes to learn moreabout the areas of bioinformatics and com-putational biology. The reader is expectedto be at ease with the basic concepts ofalgorithms, complexity, language and au-tomata theory, topics in artificial intelli-gence, and parallelism.

As to knowledge in biology, I hope thereader will be able to recall some of therudiments of biology learned in secondaryeducation or in an undergraduate coursein the biological sciences. An appendixto this article reviews a minimal set offacts needed to understand the material inthe subsequent sections. In addition, TheCartoon Guide to Genetics [Gonick and

Wheelis 1991] is an informative and amus-ing text explaining the fundamentals ofcell and molecular biology.

The reader should keep in mind the per-vasive role of evolution in biology. It isevolution that allows us to infer the cellbehavior of one species, say the human,from existing information about the cellbehavior of other species like the mouse,the worm, the fruit fly, and even yeast.

In the past decades, biologists havegathered information about the cell char-acteristics of many species. With the helpof evolutionary principles, that informa-tion can be extrapolated to other species.However, most available data is frag-mented, incomplete, and noisy. So if onehad to characterize bioinformatics in logi-cal terms, it would be: reasoning with in-complete information. That includes pro-viding ancillary tools allowing researchersto compare carefully the relationship be-tween new data and data that has beenvalidated by experiments. Since under-standing the human cell is a primary con-cern in medicine, one usually wishes to in-fer human cell behavior from that of otherspecies.

This article’s contents aim to shed somelight on the following questions:

—How can one describe the actors andprocesses that take place within a liv-ing cell?

—What can be determined or measured toinfer cell behavior?

—What data is presently available for thatpurpose?

—What are the present major problems inbioinformatics and how are they beingsolved?

—What areas in computer science relatemost to bioinformatics?

The next section offers some words ofcaution that should be kept in the reader’smind. The subsequent sections aim at an-swering the above questions. A final sec-tion provides information about how toproceed if one wishes to further explorethis new discipline.



2. WORDS OF CAUTION

Naturally, it is impossible to condensethe material presently available in manybioinformatics texts into a single surveyand tutorial. Compromises had to be madethat may displease purists who think thatno shortcuts are possible to explain thisnew field. Nevertheless, the objective ofthis work will be fulfilled if it incites thereader to continue along the path openedby reading this precis.

There is a dichotomy between the var-ious presentations available in bioinfor-matics articles and texts. At one extremeare those catering to algorithms, com-plexity, statistics, and probability. On theother are those that utilize tools to in-fer new biological knowledge from ex-isting data. The material in this arti-cle should be helpful to initiate potentialpractitioners in both areas. It is alwaysworthwhile to keep in mind that new al-gorithms become useful when they aredeveloped into packages that are used bybiologists.

It is also wise to recall some of the distin-guishing aspects of biology to which com-puter scientists are not accustomed. DavidB. Searls, in a highly recommended arti-cle on Grand challenges in computationalbiology [Searls 1998], points out that, inbiology:

—There are no rules without exception;—Every phenomenon has a nonlocal com-

ponent;—Every problem is intertwined with

others.

For example, for some time, it wasthought that a gene was responsible forproducing a single protein. Recent workin alternate splicing indicates that a genemay generate several proteins. This mayexplain why the number of genes inthe human genome is smaller than thatthat had been anticipated earlier. Anotherexample of the biology’s fundamentallydynamic and empirical state is that ithas been recently determined that gene-generated proteins may contain aminoacids beyond the 20 that are normally usedas constituents of those proteins.

The second item in Searls’ list warnsus that the existence of common local fea-tures cannot be generalized. For example,similar 3D substructures may originatefrom different sequences of amino acids.This implies that similarity at one levelcannot be generalized to another.

Finally, the third item cautions us toconsider biological problems as an aggre-gate and not to get lost in studying onlyindividual components. For example, sim-ple changes of nucleotides in DNA mayresult in entirely different protein struc-tures and function. This implies that thestudy of genomes has to be tied to thestudy of the resulting proteins.

3. BRIEF DESCRIPTION OF A SINGLE CELLAT THE MOLECULAR LEVEL

In this section, we assume that the readerhas a rudimentary level of knowledge incell and molecular biology. (The appendixreviews some of that material.) The in-tent here is to show the importance three-dimensional structures have in under-standing the behavior of a living cell. Cellsin different organisms or within the sameorganism vary significantly in shape, size,and behavior. However, they all share com-mon characteristics that are essential forlife.

The cell is made up of molecularcomponents, which can be viewed as3D-structures of various shapes. Thesemolecules can be quite large (like DNAmolecules) or relatively small (like the pro-teins that make up the cell membrane).The membrane acts as a filter that con-trols the access of exterior elements andalso allows certain molecules to exit thecell.

Biological molecules in isolation usuallymaintain their structure; however, theymay also contain articulations that allowmovements of their subparts (thus, the in-terest of nano-technology researchers inthose molecules).

The intracellular components are madeof various types of molecules. Some ofthem navigate randomly within the mediainside the membrane. Other molecules areattracted to each other.


126 J. Cohen

In a living cell, the molecules interactwith each other. By interaction it is meantthat two or more molecules are combinedto form one or more new molecules, thatis, new 3D-structures with new shapes.Alternatively, as a result of an interac-tion, a molecule may be disassembledinto smaller fragments. An interactionmay also reflect mutual influence amongmolecules. These interactions are due toattractions and repulsions that take placeat the atomic level. An important type ofinteraction involves catalysis, that is, thepresence of a molecule that facilitates theinteraction. These facilitators are calledenzymes.

Interactions amount to chemical reac-tions that change the energy level ofthe cell. A living cell has to maintainits orderly state and this takes energy,which is supplied by surrounding light andnutrients.

It can be said that biological inter-actions frequently occur because of theshape and location of the cell’s constituentmolecules. In other words, the proximity ofcomponents and the shape of componentstrigger interactions. Life exists only whenthe interactions can take place.

A cell grows because of the availabilityof external molecules (nutrients) that canpenetrate the cell’s membrane and par-ticipate in interactions with existing in-tracellular molecules. Some of those mayalso exit through the membrane. Thus, acell is able to “digest” surrounding nutri-ents and produce other molecules that areable to exit through the cell’s membrane. Ametabolic pathway is a chain of molecularinteractions involving enzymes. Signalingpathways are molecular interactions thatenable communication through the cell’smembrane. The notions of metabolic andsignaling pathways will be useful in un-derstanding gene regulation, a topic thatwill be covered later.

Cells, then, are capable of growingby absorbing outside nutrients. Copiesof existing components are made by in-teractions among exterior and interiormolecules. A living cell is thus capable ofreproduction: this occurs when there are

enough components in the original cell toproduce a duplicate of the original cell, ca-pable of acting independently.

So far, we have intuitively explained theconcepts of growth, metabolism, and re-production. These are some of the basiccharacteristics of living organisms. Otherimportant characteristics of living cellsare: motility, the capability of searchingfor nutrients, and eventually death.

Notice that we assumed the initial ex-istence of a cell, and it is fair to ask thequestion: how could one possibly have en-gineered such a contraption? The answerlies in evolution. When the cell duplicatesit may introduce slight (random) changesin the structure of its components. If thosechanges extend the life span of the cellthey tend to be incorporated in future gen-erations. It is still unclear what ingredi-ents made up the primordial living cellthat eventually generated all other cellsby the process of evolution.

The above description is very generaland abstract. To make it more detailedone has to introduce the differences be-tween the various components of a cell. Letus differentiate between two types of cellmolecules: DNA and proteins. DNA can beviewed as a template for producing addi-tional (duplicate) DNA and also for pro-ducing proteins.

Protein production is carried out usingcascading transformations. In bacterialcells (called prokaryotes), RNA is firstgenerated from DNA and proteins areproduced from RNA. In a more developedtype of cells (eukaryotes), there is anadditional intermediate transformation:pre-RNA is generated from DNA, RNAfrom pre-RNA, and proteins from RNA.Indeed, the present paragraph expresseswhat is known as the central dogmain molecular biology. (Graphical repre-sentations of these transformations areavailable in the site of the National HealthMuseum [http://www.accessexcellence.org/AB/GG/].)

Note that the above transformations areactually molecular interactions such as wehad previously described. A transforma-tion A → B means that the resulting



molecules of B are constructed anew us-ing subcomponents that are “copies” of theexisting molecules of A. (Notice the sim-ilarity with Lisp programs that use con-structors like cons to carry out transfor-mations of lists.)

The last two paragraphs implicitly as-sume the existence of processors capable ofeffecting the transformations. Indeed thatis the case with RNA-polymerases, spliceo-somes, and ribosomes. These are mar-velous machineries that are made them-selves of proteins and RNA, which in turnare produced from DNA! They demon-strate the omnipresence of loops in cellbehavior.

One can summarize the moleculartransformations that interest us using thenotation:

DNA →RNA-polymerase

pre-RNA →Spliceosome

RNA →Ribosome

Protein

The arrows denote transformations andthe entities below them indicate the pro-cessors responsible for carrying out thecorresponding transformations. Some im-portant remarks are in order:

(1) All the constituents in the abovetransformation are three-dimensionalstructures.

(2) It is more appropriate to considera gene (a subpart of DNA) as theoriginal material processed by RNA-polymerase.

(3) An arsenal of processors in the vicinityof the DNA molecule works on multiplegenes simultaneously.

(4) The proteins generated by variousgenes are used as constituents makingup the various processors.

(5) A generated protein may prevent (oraccelerate) the production of otherproteins. For example, a protein Pimay place itself at the origin of geneGk and prevent Pk from being pro-duced. It is said that Pi represses Pk .In other cases, the opposite occurs:one protein activates the production ofanother.

(6) It is known that a spliceosome is capa-ble of generating different RNAs (al-ternate splicing) and therefore the oldnotion that a given gene produces onegiven protein no longer holds true. Asa matter of fact, a gene may produceseveral different proteins, though themechanism of this is still a subject ofresearch.

(7) It is never repetitious to point out thatin biology, most rules have exceptions[Searls 1998].

The term, gene expression, refers to theproduction of RNA by a given gene. Pre-sumably, the amount of RNA generatedby the various genes of an organism es-tablishes an estimate of the correspondingprotein levels.

An important datum that can be ob-tained by laboratory experiments is anestimate of the simultaneous RNA produc-tion of thousands of genes. Gene expres-sions vary depending on a given state ofthe cell (e.g., starvation or lack of light,abnormal behavior, etc.).

3.1. Analogy with Computer SciencePrograms

We now open a parenthesis to recall therelationship that exists between computerprograms and data; that relationship hasanalogies that are applicable to under-standing cell behavior. Not all biologistswill agree with a metaphor equating DNAto a computer program. Nevertheless, Ihave found that metaphor useful in ex-plaining DNA to computer scientists.

In the universal Turing Machine (TM)model of computing, one does not distin-guish between program and data—theycoexist in the machine’s tape and it is theTM interpreter that is commanded to startcomputations at a given state examininga given element of the tape.

Let us introduce the notion of interpre-tation in our simplified description of asingle biological cell. Both DNA and pro-teins are components of our model, but theinteractions that take place between DNAand other components (existing proteins)


128 J. Cohen

result in producing new proteins each ofwhich has a specific function needed forcell survival (growth, metabolism, replica-tion, and others).

The following statement is crucial to un-derstanding the process of interpretationoccurring within a cell. Let a gene G in theDNA component be responsible for pro-ducing a protein P . Interpreters Icapableof processing any gene may well utilize Pas one of its components. This implies thatif P has not been assembled into the ma-chinery of I no interpretation takes place.

Another instance in which P cannot beproduced is the already mentioned factthat another protein P ′ may position itselfat the beginning of gene G and (temporar-ily) prevent the transcription.

The interpreter in the biological case iseither one that already exists in a givencell (prior to cell replication) or else itcan be assembled from proteins and RNAgenerated by specific genes (e.g., riboso-mal genes). In biology the interpreter canbe viewed as a mechanical gadget thatis made of moving parts that producenew components based on given templates(DNA or RNA). The construction of newcomponents is made by subcomponentsthat happen to be in the vicinity. If theyare not, interpretation cannot proceed.

One can imagine a similar situationwhen interpreting computer programs (al-though it is unlikely to occur in actual in-terpreters). Assume that the componentsof I are first generated on the fly and onceI is assembled (as data), control is trans-ferred to the execution of I (as a program).

The above situation can be simulatedin actual computers by utilizing concur-rent processes that comprise a multitudeof interrupts to control program execution.This could be implemented using inter-preters that first test that all the compo-nents have been assembled: execution pro-ceeds only if that is the case; otherwise aninterrupt takes place until the assemblyis completed. Alternatively one can exe-cute program parts as soon as they areproduced and interrupt execution if a se-quel has not yet been fully generated. InSection 7.5.1, we will describe one suchmodel of gene interaction.

4. LABORATORY TOOLS FORDETERMINING BIOLOGICAL DATA

We start with a warning that the expla-nations that follow are necessarily coarse.The goal of this section is to enable thereader to have some grasp of how biolog-ical information is gathered and of thedegree of difficulty in obtaining it. Thiswill be helpful in understanding the var-ious types of data available and the pro-grams needed to utilize and interpret thatdata.

Sequencers are machines capable ofreading off a sequence of nucleotides in astrand of DNA in biological samples. Themachines are linked to computers that dis-play the DNA sequence being analyzed.The display also provides the degree ofconfidence in identifying each nucleotide.Present sequencers can produce over 300kbase pairs per day at very reasonablecosts. It is also instructive to remark thatthe inverse operation of sequencing canalso be performed rather inexpensively:it is now common practice to order frombiotech companies vials containing shortsequences of nucleotides specified by auser.

A significant difficulty in obtaining anentire genome’s DNA is the fact that thesequences gathered in a wet lab consistof relatively short random segments thathave to be reassembled using computerprograms; this is referred to as the shot-gun method of sequencing. Since DNAmaterial contains many repeated subse-quences, performing the assemblage canbe tricky. This is due to the fact that a frag-ment can be placed ambiguously in two ormore positions of the genome being assem-bled. (DNA assembly will be revisited inSection 7.7.)

Recently, there has been a trend toattempt to identify proteins using massspectroscopy. The technique involves de-termining genes and obtaining the corre-sponding proteins in purified form. Theseare cut into short sequences of amino acids(called peptides) whose molecular weightscan be determined by a mass spectro-graph. It is then computationally possibleto infer the constituents of the peptides



yielding those molecular weights. By us-ing existing genomic sequences, one canattempt to reassemble the desired se-quence of amino acids.

The 3D structure of proteins is mainlydetermined by X-ray crystallography andby nuclear magnetic resonance (NMR).Both these experiments are time consum-ing and costly. In X-ray crystallography,one attempts to infer the 3D position ofeach of the protein’s atoms from a projec-tion obtained by passing X-rays througha crystallized sample of that protein. Oneof the major difficulties of the process isthe obtaining of good crystals. X-ray ex-periments may require months and evenyears of laboratory work.

In the NMR technique, one obtains anumber of matrices that express the factthat two atoms—that are not in the samebackbone chain of the protein—are withina certain distance. One then deduces a 3Dshape from those matrices. The NMR ex-periments are also costly. A great advan-tage is that they allow one to study mobileparts of proteins, a task which cannot bedone using crystals.

The preceding paragraphs explain whyDNA data is so much more abundant than3D protein data.

Another type of valuable informationobtainable through lab experiments isknown as ESTs or expressed sequence tags.These are RNA chunks that can be gath-ered from a cell in minute quantities, butcan easily be duplicated. Those chunks arevery useful since they do not contain ma-terial that would be present in introns (seethe Appendix for a definition). The avail-ability of EST databases comprising manyorganisms allows bioinformaticians to in-fer the positions of introns and even de-duce alternate splicing.

A powerful new tool available in biol-ogy is microarrays. They allow determin-ing simultaneously the amount of mRNAproduction of thousands of genes. As men-tioned earlier, this amount corresponds togene-expression; it is presumed that theamount of RNA generated by the var-ious genes of an organism establishesan estimate of the corresponding proteinlevels.

Microarray experiments require threephases. In the first phase one places thou-sands of different one-stranded chunks ofRNA in minuscule wells on the surface ofa small glass chip. (This task is not un-like that done by a jet printer using thou-sands of different colors and placing eachof them in different spots of a surface.) Thechunks correspond to the RNA known tohave been generated by a given gene. The2D coordinates of each of the wells are ofcourse known. Some companies mass pro-duce custom preloaded chips for cells ofvarious organisms and sell them to biolog-ical labs.

The second phase consists ofspreading—on the surface of the glass—genetic material (again one-strandedRNA) obtained by a cell experiment onewishes to perform. Those could be theRNAs produced by a diseased cell, orby a cell being subjected to starvation,high temperature, etc. The RNA alreadyin the glass chip combines with theRNA produced by the cell one wishes tostudy. The degree of combined materialobtained by complementing nucleotides isan indicator of how much RNA is beingexpressed by each one of the genes of thecell being studied.

The third phase consists of using a laserscanner connected to a computer. The ap-paratus measures the amount of com-bined material in each chip well and de-termines the degree of gene expression—areal number—for each of the genes origi-nally placed on the chip. Microarray datais becoming available in huge amounts. Aproblem with this data is that it is noisyand its interpretation is difficult. Microar-rays are becoming invaluable for biologistsstudying how genes interact with eachother. This is crucial in understanding dis-ease mechanisms.

The microarray approach has been ex-tended to the study of protein expres-sion. There exist chips whose wells containmolecules that can be bound to particularproteins.

Another recent development in exper-imental biology is the determination ofprotein interaction by what is called two-hybrid experiments. The goal of such


130 J. Cohen

experiments is to construct huge Booleanmatrices, whose rows and columns repre-sent the proteins of a genome. If a proteininteracts with another, the correspondingposition in the matrix is set to true. Again,one has to deal with thousands of pro-teins (genes); the data of their interactionsis invaluable in reconstructing metabolicand signaling pathways.

A final experimental tool described inthis section is the availability of librariesof variants of a given organism, yeast be-ing a notable example. Each variant cor-responds to cells having a single one of itsgenes knocked out. (Of course researchersare only interested in living cells sincecertain genes are vital to life.) Theselibraries enable biologists to performexperiments (say, using microarray) anddeduce information about cell behaviorand fault tolerance.

A promising development in experimen-tal biology is the use of RNA-i (the i denot-ing interference). It has been found thatwhen chunks of the RNA of a given geneare inserted in the nucleus of a cell, theymay prevent the production of that gene.This possibility is not dissimilar to that of-fered by libraries of knocked-out genes.

The above descriptions highlight thetrend of molecular biology experiments be-ing done by ordering components, or byhaving them analyzed by large biotechcompanies.

5. BIOLOGICAL DATA AVAILABLE

In a previous section, we mentioned thatall the components of a living cell are 3Dstructures and that shape is crucial inunderstanding molecular interactions. Afundamental abstraction often done in bi-ology is to replace the spatial 3D infor-mation specifying chemical bindings witha much simpler sequence of symbols: nu-cleotides or amino acids. In the case ofDNA, we know that the helix is the un-derlying 3D structure.

Although it is much more convenient todeal with sequences of symbols than withcomplex 3D entities, the problem of shapedetermination remains a critical one in thecase of RNA and proteins.

The previous section outlined the labo-ratory tools for gathering biological data.The vast majority of the existing informa-tion has been obtained through sequenc-ing, and it is expressible by strings—thatis, sequences of symbols. These sequencesspecify mostly nucleotides (genomic data)but there is also substantial informationon sequences of amino acids.

Next in volume of available informationare the results of microarray experiments.These can be viewed as very large usu-ally dense matrices of real numbers. Thesematrices may have thousands of rows andcolumns. And that is also the case of thesparse Boolean matrices describing pro-tein interactions.

The information about 3D structuresof proteins pales in comparison to thatavailable in sequence form. The pro-tein database (PDB) is the repositoryfor all known three-dimensional proteinstructures.

In a recent search, I found that thereare now about 26 billion base pairs (bp)representing the various genomes avail-able in the server of the National Cen-ter for Biotechnology Information (NCBI).Besides the human genome with about3 billion bp, many other species have theircomplete genome available there. Theseinclude several bacteria (e.g., E. Coli)and higher organisms including yeast,worm, fruit fly, mouse, and plants (e.g.,Arabidopsis).

The largest known gene in the NCBIserver has about 20 million base pairs andthe largest protein consists of about 34,000amino acids. These figures give an idea ofthe lengths of the entities we have to dealwith.

In contrast, the PDB has a catalogue ofonly 45,000 proteins specified by their 3Dstructure. These proteins originate fromvarious organisms. The relatively meagerprotein data shows the enormous need ofinferring protein shape from data avail-able in the form of sequences. This is oneof the major tasks facing biologists. Butmany others lie ahead.

The goal of understanding protein struc-ture is only part of the task. Next we haveto understand how proteins interact and



form the metabolic and signaling path-ways in the cell.

There is information available aboutmetabolic pathways in simple organisms,and parts of those pathways are knownfor human cells. The formidable task isto put all the available information to-gether so that it can be used to under-stand better the functioning of the hu-man cell. That pursuit is called functionalgenomics.

The term, genomics, is used to denotethe study of various genomes as enti-ties having similar contents. In the pastfew years other terms ending with thesuffixes-ome or - mics have been popular-ized. That explains proteomics (the studyof all the proteins of a genome), transcrip-tome, metabolome, and so forth.

6. PRESENT GOALS OF BIOINFORMATICS

The present role of bioinformatics is to aidbiologists in gathering and processing ge-nomic data to study protein function. An-other important role is to aid researchersat pharmaceutical companies in makingdetailed studies of protein structures to fa-cilitate drug design. Typical tasks done inbioinformatics include:

—Inferring a protein’s shape and functionfrom a given a sequence of amino acids,

—Finding all the genes and proteins in agiven genome,

—Determining sites in the protein struc-ture where drug molecules can beattached.

To perform these tasks, one usually hasto investigate homologous sequences orproteins for which genes have been deter-mined and structures are available. Ho-mology between two sequences (or struc-tures) suggests that they have a commonancestor. Since those ancestors may wellbe extinct, one hopes that similarity at thesequence or structural level is a good indi-cator of homology.

It is important to keep in mind that se-quence similarity does not always implysimilarity in structure, and vice-versa. Asa matter of fact, it is known that two fairly

dissimilar sequences of amino acids mayfold into similar 3D structures.

Nevertheless, the search for similarityis central to bioinformatics. When givena sequence (nucleotides or amino acids)one usually performs a search of similaritywith databases that comprise all availablegenomes and known proteins. Usually, thesearch yields many sequences with vary-ing degrees of similarities. It is up to theuser to select those that may well turn outto be homologous.

In the next section we describe the var-ious computer science algorithms that arefrequently used by bioinformaticians.

7. ALGORITHMS FREQUENTLY USEDIN BIOINFORMATICS

We recall that a major role of bioinfor-matics is to help infer gene function fromexisting data. Since that data is varied, in-complete, noisy, and covers a variety of or-ganisms, one has to constantly resort tothe biological principles of evolution to fil-ter out useful information.

Based on the availability of the data andgoals described in Sections 4 to 6, we nowpresent the various algorithms that leadto a better understanding of gene function.They can be summarized as follows:

(1) Comparing Sequences. Given thehuge number of sequences available, thereis an urgent need to develop algorithmscapable of comparing large numbers oflong sequences. These algorithms shouldallow the deletion, insertion, and replace-ments of symbols representing nucleotidesor amino acids, for such transmutationsoccur in nature.

(2) Constructing Evolutionary (Phyloge-netic) Trees. These trees are often con-structed after comparing sequences be-longing to different organisms. Treesgroup the sequences according to their de-gree of similarity. They serve as a guideto reasoning about how these sequenceshave been transformed through evolution.For example, they infer homology fromsimilarity, and may rule out erroneousassumptions that contradict known evolu-tionary processes.


132 J. Cohen

(3) Detecting Patterns in Sequences.There are certain parts of DNA and aminoacid sequences that need to be detected.Two prime examples are the search forgenes in DNA and the determining of sub-components of a sequence of amino acids(secondary structure). There are severalways to perform these tasks. Many of themare based on machine learning and in-clude probabilistic grammars, or neuralnetworks.

(4) Determining 3D Structures from Se-quences. The problems in bioinformat-ics that relate sequences to 3D structuresare computationally difficult. The deter-mination of RNA shape from sequencesrequires algorithms of cubic complexity.The inference of shapes of proteins fromamino acid sequences remains an un-solved problem.

(5) Inferring Cell Regulation. Thefunction of a gene or protein is bestdescribed by its role in a metabolic orsignaling pathway. Genes interact witheach other; proteins can also prevent orassist in the production of other pro-teins. The available approximate modelsof cell regulation can be either discreteor continuous. One usually distinguishesbetween cell simulation and modeling.The latter amounts to inferring the for-mer from experimental data (say microar-rays). This process is usually called reverseengineering.

(6) Determining Protein Function andMetabolic Pathways. This is one of themost challenging areas of bioinformat-ics and for which there is not consider-able data readily available. The objectivehere is to interpret human annotationsfor protein function and also to developdatabases representing graphs that can bequeried for the existence of nodes (speci-fying reactions) and paths (specifying se-quences of reactions).

(7) Assembling DNA Fragments. Frag-ments provided by sequencing machinesare assembled using computers. Thetricky part of that assemblage is that DNAhas many repetitive regions and the same

fragment may belong to different regions.The algorithms for assembling DNA aremostly used by large companies (like theformer Celera).

(8) Using Script Languages. Many ofthe above applications are already avail-able in websites. Their usage requiresscripting that provides data for an appli-cation, receives it back, and then analyzesit.

The algorithms required to perform theabove tasks are detailed in the followingsubsections. What differentiates bioinfor-matics problems from others is the hugesize of the data and its (sometimes ques-tionable) quality. That explains the needfor approximate solutions.

It should be remarked that severalof the problems in bioinformatics areconstrained optimization problems. Thesolution to those problems is usually com-putationally expensive. One of the effi-cient known methods in optimization isdynamic programming. That explains whythis technique is often used in bioinfor-matics. Other approaches like branch-and-bound are also used, but they areknown to have higher complexity.

7.1. Comparing Sequences

From the biological point of view sequencecomparison is motivated by the fact thatall living organisms are related by evolu-tion. That implies that the genes of speciesthat are closer to each other should exhibitsimilarities at the DNA level; one hopesthat those similarities also extend to genefunction.

The following definitions are useful inunderstanding what is meant by the com-parison of two or more sequences. Analignment is the process of lining up se-quences to achieve a maximal level of iden-tity. That level expresses the degree of sim-ilarity between sequences. Two sequencesare homologous if they share a commonancestor, which is not always easy to de-termine. The degree of similarity obtainedby alignment can be useful in determiningthe possibility of homology between twosequences.



In biology, the sequences to be com-pared are either nucleotides (DNA, RNA)or amino acids (proteins). In the case ofnucleotides, one usually aligns identicalnucleotide symbols. When dealing withamino acids the alignment of two aminoacids occurs if they are identical or if onecan be derived from the other by substitu-tions that are likely to occur in nature.

An alignment can be either local orglobal. In the former, only portions of thesequences are aligned, whereas in the lat-ter one aligns over the entire length of thesequences. Usually, one uses gaps, repre-sented by the symbol “-”, to indicate thatit is preferable not to align two symbolsbecause in so doing, many other pairs canbe aligned. In local alignments there arelarger regions of gaps. In global align-ments, gaps are scattered throughout thealignment.

A measure of likeness between two se-quences is percent identity: once an align-ment is performed we count the numberof columns containing identical symbols.The percent identity is the ratio betweenthat number and the number of symbolsin the (longest) sequence. A possible mea-sure or score of an alignment is calculatedby summing up the matches of identical(or similar) symbols and counting gaps asnegative.

With these preliminary definitions inmind, we are ready to describe the algo-rithms that are often used in sequencecomparison.

7.1.1. Pairwise Alignment. Many of themethods of pattern matching used incomputer science assume that matchescontain no gaps. Thus there is no matchfor the pattern bd in the text abcd. Inbiological sequences, gaps are allowedand an alignment abcd with bd yields therepresentation:

a b c d− b − d .

Similarly, an alignment of abcd with bucyields:

a b − c d− b u c −.

The above implies that gaps can appearboth in the text and in the pattern. There-fore there is no point in distinguishingtexts from patterns. Both are called se-quences. Notice that, in the above exam-ples, the alignments maximize matchesof identical symbols in both sequences.Therefore, sequence alignment is an op-timization problem.

A similar problem exists when we at-tempt to automatically correct typing er-rors like character replacements, inser-tions, and deletions. Google and Word, forexample, are able to handle some typingerrors and display suggestions for possi-ble corrections. That implies searching adictionary for best matches.

An intuitive way of aligning two se-quences is by constructing dot matrices.These are Boolean matrices representingpossible alignments that can be detectedvisually. Assume that the symbols of a firstsequence label the rows of the Boolean ma-trix and those of the second sequence labelthe columns. The matrix is initially set tozero (or false). An entry becomes a one (ortrue) if the labels of the corresponding rowand column are identical.

Consider now two identical sequences.Initially, assume that all the symbols in asequence are different. The correspondingdot matrix has ones in the diagonal indi-cating a perfect match. If the second se-quence is a subsequence of the first, thedot matrix also shows a shorter diago-nal line of ones indicating where matchesoccur.

The usage of dot matrices requires a vi-sual inspection to detect large chunks ofdiagonals indicating potential common re-gions of identical symbols. Notice, how-ever, that if the two sequences are longand contain symbols of a small vocabulary(like the four nucleotides in DNA) thennoise occurs: that means that there willbe a substantial number of scattered onesthroughout the matrix and there may beseveral possible diagonals that need to beinspected to find the one that maximizesthe number of symbol matches. Compar-ing multiple symbols instead of just two—one in a row the other in a column—mayreduce noise.


134 J. Cohen

An interesting case that often occurs inbiology is one in which a sequence containsrepeated copies of its subsequences. Thatresults in multiple diagonals, and againvisual inspection is used to detect the bestmatches.

The time complexity of constructing dotmatrices for two sequences of lengths mand n is m∗n. The space complexity is alsom∗n. These values may well be intolerablefor very large values of m and n. Noticealso that, if we know that the sequencesare very similar, we do not need to buildthe entire matrix. It suffices to constructthe elements around the diagonal. There-fore, one can hope to achieve almost linearcomplexity in those cases. It will be seenlater that the most widely used pairwisesequence alignment algorithm (BLAST)can be described in terms of finding diag-onals without constructing an entire dotmatrix.

The dot matrix approach is useful butdoes not yield precise measures of simi-larity among sequences. To do so, we in-troduce the notion of costs for the gaps, ex-act matches, and the fact that sometimesan alignment of different symbols is toler-ated and considered better than introduc-ing gaps.

We will now open a parenthesis to mo-tivate the reader of the advantages of us-ing dynamic programming in finding op-timal paths in graphs. Let us consider adirected acyclic graph (DAG) with possi-bly multiple entry nodes and a single exitnode. Each directed edge is labeled with anumber indicating the cost (or weight) ofthat edge in a path from the entry to theexit node. The goal is to find an optimal(maximal or minimal) path from an entryto the exit node.

The dynamic programming (DP) ap-proach consists of determining the bestpath to a given node. The principle is sim-ple: consider all the incoming edges ei toa node V . Each of these edges is labeledby a value vi indicating the weight of theedge ei. Let pi be the optimal values forthe nodes that are the immediate prede-cessors of V . An optimal path from anentry point to V is the one correspond-ing to the maximum (or the minimum)

of the quantities: p1 + v1, p2 + v2, p3 +v3, . . . etc.

Starting from the entry nodes one de-termines the optimal path connecting thatnode to its successors. Each successor nodeis then considered and processed as above.The time complexity of the DP algorithmfor DAGs is linear with its number ofnodes. Notice that the algorithm deter-mines a single value representing the totalcost of the optimal path.

If one wished to determine the sequenceof nodes in that path, then one would haveto perform a second (backward) pass start-ing from the exit node and retrace one byone the nodes in that optimal path. Thecomplexity of the backward pass is alsolinear. A word of caution is in order. Letus assume that there are several pathsthat yield the same total cost. Then thecomplexity of the backward pass could beexponential!

We will now show how to construct aDAG that can be used to determine an op-timal pairwise sequence alignment. Let usassume that the cost of introducing a gapis g , the cost of matching two identicalsymbols is s,and the choice of toleratingthe alignment of different symbols is d .In practice when matching nucleotide se-quences it is common to use the weightsg = −2, s = 1, and d = −1. That attri-bution of weights penalizes gaps most, fol-lowed by a tolerance of unmatched sym-bols; identical symbols induce the highestweight. The optimal path being sought isthe one with a total maximum cost.

Consider now the following part of aDAG expressing the three choices dealingwith a pair of symbols being aligned.

The horizontal and vertical arrows statethat a gap may be introduced either in thetop or in the bottom sequences. The di-agonal indicates that the symbols will be



aligned and the cost of that choice is eithers (if the symbols match) or d if they do notmatch.

Consider now a two-dimensional ma-trix organized as the previously describeddot matrix, with its rows and columnsbeing labeled by the elements of the se-quences being aligned. The matrix entriesare the nodes of a DAG, each node hav-ing the three outgoing directed edges as inthe above representation. In other wordsthe matrix is tiled with copies of the abovesubgraph. Notice that the bottommost row(and the rightmost column) consists onlyof a sequence of directed horizontal (verti-cal) edges labeled by g .

An optimal alignment amounts to de-termining the optimal path in the overallDAG. The single entry node is the matrixelement with indices [0,0] and the singleexit node is the element indexed by [m, n]where m and n are the lengths of the twosequences being aligned.

The DP measure of similarity of thepairwise alignment is a score denotingthe sum of weights in the optimal path.With the weights mentioned earlier (g =−2, s = 1, and d = −1), the best scoreoccurs when two identical sequences oflength n are aligned; the resulting scoreis n, since the cost attributed to a diagonaledge is 1. The worst (unrealistic) case oc-curs when aligning a sequence with theempty sequence resulting in a score of−2n. Another possible outcome is that dif-ferent symbols become aligned since theresulting path yields better scores thanthose that introduce gaps. In that case, thescore becomes −n.

Now a few words about the complexityof the DP approach for alignment. Sincethere are m∗n nodes in the DAG, the timeand space complexity is quadratic whenthe two sequences have similar lengths.As previously pointed out, that complex-ity can become intolerable (exponential) ifthere exist multiple optimal solutions andall of them need to be determined. (Thatoccurs in the backward pass.)

The two often-used algorithms for pair-wise alignment are those developed by thepairs of co-authors Needleman–Wunschand Smith–Waterman. They differ on the

costs attributed to the outmost horizon-tal and vertical edges of the DAG. In theNeedleman–Wunsch approach, one usesweights for the outmost edges that encour-age the best overall (global) alignment. Incontrast, the Smith–Waterman approachfavors the contiguity of segments beingaligned.

Most of the textbooks mentioned inthe references (e.g., Setubal and Meidanis[1997], Durbin et al. [1998], and Dwyer[2002]) contain good descriptions of us-ing dynamic programming for performingpairwise alignments.

7.1.2. Aligning Amino Acids Sequences.The DP algorithm is applicable to anysequence provided the weights for com-parisons and gaps are properly chosen.When aligning nucleotide sequences thepreviously mentioned weights yield goodresults. A more careful assessment of theweights has to be done when aligningsequences of amino acids. This is be-cause the comparison between any twoamino acids should take evolution intoconsideration.

Biologists have developed 20×20 trian-gular matrices that provide the weightsfor comparing identical and differentamino acids as well as the weight thatshould be attributed to gaps. The twomore frequently used matrices are knownas PAM (Percent Accepted Mutation) andBLOSUM (Blocks Substitution Matrix).These matrices reflect the weights ob-tained by comparing the amino acids sub-stitutions that have occurred through evo-lution. They are often called substitutionmatrices.

One usually qualifies those matrices bya number: the higher values of the X in ei-ther PAM X or BLOSUM X, indicate morelenience in estimating the difference be-tween two amino acids. An analogy withthe previously mentioned weights clarifieswhat is meant by lenience: a weight of 1attributed to identical symbols and 0 at-tributed to different symbols is more le-nient than retaining the weight of 1 forsymbol identity and utilizing the weight−1 for nonidentity.


136 J. Cohen

Many bioinformatics texts (e.g., Mount[2001] and Pevzner [2000]) provide de-tailed descriptions on how substitutionmatrices are computed.

7.1.3. Complexity Considerations and BLAST.The quadratic complexity of the DP-based algorithms renders their usage pro-hibitive for very large sequences. Recallthat the present genomic database con-tains about 30 billion base pairs (nu-cleotides) and thousands of users access-ing that database simultaneously wouldlike to determine if a sequence being stud-ied and made up of thousands of symbolscan be aligned with existing data. That isa formidable problem!

The program called BLAST (Basic LocalAlignment Search Tool) developed by theNational Center for Biotechnology Infor-mation (NCBI) has been designed to meetthat challenge. The best way to explainthe workings of BLAST is to recall theapproach using dot matrices. In BLASTthe sequence, whose presence one wishesto investigate in a huge database, is splitinto smaller subsequences. The presenceof those subsequences in the database canbe determined efficiently (say by hashingand indexing).

BLAST then attempts to pursue furthermatching by extending the left and rightcontexts of the subsequences. The pairingsthat do not succeed well are abandonedand the best match is chosen as a result ofthe search. The functioning of BLAST cantherefore be described as finding portionsof the diagonals in a dot matrix and thenattempting to determine the ones that canbe extended as much as possible. It ap-pears that such technique yields practi-cally linear complexity. The BLAST sitehandles about 90,000 searches per day. Itsuccess demonstrates that excellent hack-ing has its place in computer science.

BLAST allows comparisons of eithernucleotide or amino acid sequences withthose existing in the NCBI database. Inthe case of amino acids, the user is offeredvarious options for selecting the applicablesubstitution matrices. Other input param-eters are also available.

Among the important information pro-vided by a BLAST search is the p-valueassociated with each of the sequencesthat match a user specified sequence. Ap-value is a measure of how much evi-dence we have against the null hypothe-ses. (The null hypothesis is that observa-tions are purely the result of chance.) Avery small p-value (say of the order of10−19) indicates that it is very unlikelythat the sequence provided by the searchis totally unrelated to the one provided bythe user. The home page of BLAST is anexcellent source for a tutorial and a wealthof other information (http://www.ncbi.nlm.nih.gov/BLAST/).

FASTA is another frequently usedsearch program with a strategy similar tothat of BLAST. The differences betweenBLAST and FASTA are discussed in manytexts (e.g., Pevsner [2003]).

A topic that has attracted the atten-tion of present researchers is the compar-ison between two entire genomes. Thatinvolves aligning sequences containingbillions of nucleotides. Programs havebeen developed to handle these time con-suming tasks. Among these programs isPipmaker [Schwartz et al. 2000]; a discus-sion of the specific problems of comparingtwo genomes is presented in Miller [2001].An interesting method of entire genomecomparison using suffix trees is describedin Delcher et al. [1999].

7.1.4. Multiple Alignments. Let us assumethat a multiple alignment is performed fora set of sequences. One calls the consensussequence the one obtained by selecting foreach column of the alignment the symbolthat appears most often in that column.

Multiple alignments are usually per-formed using sequences of amino acidsthat are believed to have similar struc-tures. The biological motivation for multi-ple alignments is to find common patternsthat are conserved among all the se-quences being considered. Those patternsmay elucidate aspects of the structure of aprotein being studied.

Trying to extend the dot matrix andDP approaches to the alignment of three



or more sequences is a tricky proposi-tion. One soon gets into difficult time-consuming problems. A three-dimensionaldot matrix cannot be easily inspected vi-sually. The DP approach has to considerDAGs whose nodes have seven outgoingedges instead of the three edges neededfor pairwise alignment (see, e.g., Dwyer[2002]).

As dimensionality grows so does algo-rithmic complexity. It has been provedthat multiple alignments have exponen-tial complexity with the number of se-quences to be aligned. That does not pre-vent biologists from using approximatemethods. These approximate approachesare sometimes unsatisfactory; therefore,multiple alignment remains a worthytopic of research. Among the approximateapproaches, we consider two.

The first is to reduce a multiple align-ment to a series of pairwise alignmentsand then combine the results. One can usethe DP approach to align all pairs of se-quences and display the result in a tri-angular matrix form such that each en-try [i, j ] represents the score obtained byaligning sequence i with sequence j .

What follows is more an art than sci-ence. One can select a center sequence Cas the one that yields a maximum sum ofpairwise scores with all others. Other se-quences are then aligned with C followingthe empirical rule: once a gap is introducedit is never removed. As in the case of pair-wise alignments, one can obtain global orlocal alignments. The former attempts toobtain an alignment with maximum scoreregardless of the positions of the symbols.In contrast, local alignments favor conti-guity of matched symbols.

Another approach for performing mul-tiple alignments is using the HiddenMarkov Models (HMMs), which are cov-ered in Section 7.3.2.

CLUSTAL and its variants are softwarepackages often used to produce multiplealignments. As in the case of pairwisealignments these packages offer capa-bilities of utilizing substitution matriceslike BLOSUM or PAM. A description ofCLUSTAL W appears in Tompson et al.[1994].

7.1.5. Pragmatic Aspects of Alignments. Animportant aspect of providing results forsequence alignments is their presenta-tion. Visual inspection is crucial in obtain-ing meaningful interpretations of those re-sults. The more elaborate packages thatperform alignments use various colors toindicate regions that are conserved andprovide statistical data to assess the con-fidence level of the results.

Another aspect worth mentioning is thevariety of formats that are available for in-put and display of sequences. Some pack-ages require specific formats and, in manycases, it is necessary to translate from oneformat to another.

7.2. Phylogenetic Trees

Since evolution plays a key role in biol-ogy, it is natural to attempt to depict it us-ing trees. These are referred to as phyloge-netic trees: their leaves represent variousorganisms, species, or genomic sequences;an internal node Pstands for an abstractorganism (species, sequence) whose exis-tence is presumed and whose evolutionled to the organisms whose direct descen-dants are the branches emanating from P .

A motivation for depicting trees is toexpress—in graphical form—the outcomeof multiple alignments by the relation-ships that exist between pairs or groupsof sequences. These trees may reveal evo-lutionary inconsistencies that have to beresolved. In that sense the construction ofphylogenetic validates or invalidates con-jectures made about possible ancestors ofa group of organisms.

Consider a multiple alignment: Amongits sequences one can select two, whosepairwise score yields the highest value. Wethen create an abstract node representingthe direct ancestor of the two sequences.

A tricky step then is to reconstruct—among several possible sequences—onethat best represents its children. This re-quires both ingenuity and intuition. Oncethe choice is made, the abstract ances-tor sequence replaces its two children andthe algorithm continues recursively untila root node is determined. The result isa binary tree whose root represents the


138 J. Cohen

primordial sequence that is assumed tohave generated all the others. We will soonrevisit this topic.

There are several types of trees used inbioinformatics. Among them, we mentionthe following:

(1) Unrooted trees are those that spec-ify distances (differences) betweenspecies. The length of a path betweenany two leaves represents the accumu-lated differences.

(2) Cladograms are rooted trees in whichthe branches’ lengths have no mean-ing; the initial example in this sectionis a cladogram.

(3) Phylograms are extended cladogramsin which the length of a branch quan-tifies the number of genetic transfor-mations that occurred between a givennode and its immediate ancestor.

(4) Ultrametric trees are phylograms inwhich the accumulated distances fromthe root to each of the leaves is quanti-fied by the same number; ultrametrictrees are therefore the ones that pro-vide most information about evolution-ary changes. They are also the mostdifficult to construct.

The above definitions suggest establish-ing some sort of molecular clock in whichmutations occur at some predictable rateand that there exists a linear relation-ship between time and number of changes.These rates are known to be different fordifferent organisms and even for the var-ious cell components (e.g., DNA and pro-teins). That shows the magnitude and dif-ficulty of establishing correct phylogenies.

An algorithm frequently used to con-struct unrooted trees is called UPGMA(for Unweighted Pair Group Method us-ing Arithmetic averages). Let us recon-sider the initial example of multiple align-ments and assume that one can quantifythe distances between any two pairwisealignments (The DP score of those align-ments could yield the information aboutdistances: the higher the score, the loweris the distance among the sequences). Thevarious distances can be summarized intriangular matrix form.

The UPGMA algorithm is similar to theone described in the initial example. Con-sider two elements E1 and E2 having thelowest distance among them. They aregrouped into a new element (E1, E2). Anupdated matrix is constructed in whichthe new distances to the grouped elementsare the averages of the previous distancesto E1 and E2. The algorithm continuesuntil all nodes are collapsed into a sin-gle node. Note that if the original ma-trix contains many identical small en-tries there would be multiple solutionsand the results may be meaningless. Inbioinformatics—as in any field—one hasto exercise care and judgment in interpret-ing program output.

The notion of parsimony is often invokedin constructing phylograms, rooted treeswhose branches are labeled by the numberof evolutionary steps. Parsimony is basedon the hypothesis that mutations occurrarely. Consequently, the overall numberof mutations assumed to have occurredthroughout evolutionary steps ought to beminimal. If one considers the change of asingle nucleotide as a mutation, the prob-lem of constructing trees from sequencesbecomes hugely combinatorial.

An example illustrates the difficulty. Letus assume an unrooted tree with two typesof labels. Sequences label the nodes andnumbers label the branches. The numbersspecify the mutations (symbol changeswithin a given position of the sequence)occurring between the sequences labelingadjacent nodes.

The problem of constructing a tree us-ing parsimony is: given a small number ofshort nucleotide sequences, place them inthe nodes and leaves of an unrooted treeso that the overall number of mutations isminimal.

One can imagine a solution in which allpossible trees are generated, their nodeslabeled and the tree with minimal over-all mutations is chosen. This approachis of course forbidding for large sets oflong sequences and it is another exampleof the ubiquity of difficult combinatorialoptimization problems in bioinformatics.Mathematicians and theoretical computerscientists have devoted considerable effort



in solving efficiently these types of prob-lems (see, e.g., Gusfield [1997]).

We end this section by presenting an in-teresting recent development in attempt-ing to determine the evolutionary treesfor entire genomes using data compression[Bennett et al. 2003]. Let us assume thatthere is a set of long genomic sequencesthat we want to organize as a cladogram.Each sequence often includes a great dealof intergenic (noncoding) DNA materialwhose function is still not well understood.The initial example in this section is thebasis for constructing the cladogram.

Given a text T and its compressed formC, the ratio r = |C|/|T | (where |α| isthe length of the sequence α) expressesthe degree of compression that has beenachieved by the program. The smaller theratio the more compressed is C. Data com-pressors usually operate by using dictio-naries to replace commonly used words bypointers to words in the dictionary.

If two sequences S1 and S2 are verysimilar, it is likely that their respective rratios are close to each other. Assume nowthat all the sequences in our set have beencompressed and the ratios r are known.It is then easy to construct a triangularmatrix whose entries specify the differ-ences in compression ratios between anytwo sequences.

As in the UPGMA algorithm, we con-sider the smallest entry and construct thefirst binary node representing the pair ofsequences Si and Sj that are most simi-lar. We then update the matrix by replac-ing the rows corresponding to Si and Sjby a row representing the combination ofSi with Sj . The compression ratio for thecombined sequences can be taken to be theaverage between the compression ratios ofSi and Sj . The algorithm proceeds as be-fore by searching for the smallest entryand so on, until the entire cladogram isconstructed.

An immense advantage of the data com-pression approach is that it will consideras very similar two sequences αβγ δ andαδγβ, where α, β, γ , and δ are long subse-quences that have been swapped around.This is because the two sequences arelikely to have comparable compression ra-

tios. Genome rearrangements occur dur-ing evolution and could be handled by us-ing the data compression approach.

Finally, we should point out the exis-tence of horizontal transfers in molecularbiology. This term implies that the DNAof given organism can be modified by theinclusion of foreign DNA material thatcannot be explained by evolutionary argu-ments. That occurrence may possibly behandled using the notion of similarity de-rived from data compression ratios.

A valuable reference on phylogenetictrees is the recent text by Felsenstein[2003]. It includes a description ofPHYLIP (Phylogenetic Inference Pack-age), a frequently used software packagedeveloped by Felsenstein for determiningtrees expressing the relationships amongmultiple sequences.

7.3. Finding Patterns in Sequences

It is frequently the case in bioinformat-ics that one wishes to delimit parts of se-quences that have a biological meaning.Typical examples are determining the lo-cations of promoters, exons, and intronsin RNA, that is, gene finding, or detect-ing the boundaries of α-helices, β-sheets,and coils in sequences of amino acids.There are several approaches for perform-ing those tasks. They include neural nets,machine learning, and grammars, espe-cially variants of grammars called prob-abilistic [Wetherell 1980].

In this subsection, we will deal withtwo of such approaches. One is usinggrammars and parsing. The other, calledHidden Markov Models or HMMs, isa probabilistic variant of parsing usingfinite-state grammars.

It should be remarked that the recentcapabilities of aligning entire genomes(see Section 7.1.3) also provides means forgene finding in new genomes: assumingthat all the genes of a genome G1 havebeen determined, then a comparison withthe genome G2 should reveal likely posi-tions for the genes in G2.

7.3.1. Grammars and Parsing. Chomsky’slanguage theory is based on grammar


140 J. Cohen

rules used to generate sentences. In thattheory, a nonterminal is an identifiernaming groups of contiguous words thatmay have subgroups identified by othernonterminals.

In the Chomsky hierarchy of gram-mars and languages, the finite-state (FS)model is the lowest. In that case, anonterminal corresponds to a state ina finite-state automaton. In context-freegrammars one can specify a potentiallyinfinite number of states. Context-freegrammars (CFG) allow the descriptionof palindromes or matching parentheses,which cannot be described or generated byfinite-state models.

Higher than the context-free languagesare the so-called context sensitive ones(CSL). Those can specify repetitions ofsequence of words like ww, where wis any sequence using a vocabulary.These repetitions cannot be described byCFGs.

Parsing is the technique of retracing thegeneration of a sentence using the givengrammar rules. The complexity of parsingdepends on the language or grammar be-ing considered. Deterministic finite-statemodels can be parsed in linear time. Theworst case parsing complexity of CF lan-guages is cubic. Little is known about thecomplexity of general CS languages butparsing of its strings can be done in finitetime.

The parse of sentences in a finite-statelanguage can be represented by the se-quence of states taken by the correspond-ing finite-state automaton when it scansthe input string. A tree conveniently rep-resents the parse of a sentence in acontext-free language. Finally, one canrepresent the parse of sentence in a CSLby a graph. Essentially, an edge of thegraph denotes the symbols (or nontermi-nals) that are grouped together.

This prelude is helpful in relating lan-guage theory with biological sequences.Palindromes and repetitions of groups ofsymbols often occur in those sequencesand they can be given a semantic meaning.Searls [1992, 2002] has been a pioneer inrelating linguistics to biology and his pa-pers are highly recommended.

All the above grammars, includingfinite-state, can generate ambiguousstrings, and ambiguity and nondetermin-ism are often present when analyzingbiological sequences. In ambiguoussituations—as in natural language—oneis interested in the most-likely parse. Andthat parse can be determined by usingprobabilities and contexts. In biology, onecan also use energy considerations anddynamic programming to disambiguatemultiple parses of sequences.

There is an important difference be-tween linguistics as used in natural lan-guage processing and linguistics appliedto biology. The sentences or speech utter-ances in natural language usually amountto a relatively few words. In biology, wehave to deal with thousands!

It would be wonderful if we could pro-duce a grammar defining a gene, as a non-terminal in a language defining DNA. Butthat is an extremely difficult task. Simi-larly, it would be very desirable to havea grammar expressing protein folds. Inthat case, a nonterminal would correspondto a given structure in 3D space. As incontext-sensitive languages, the parse (agraph) would indicate the subcomponentsthat are close together.

It will be seen later (Section 7.4.1) thatCFGs can be conveniently used to mapRNA sequences into 2D structures. How-ever, it is doubtful that practical gram-mars would exist for detecting genes inDNA or determining tertiary structure ofproteins.

In what follows, we briefly describe thetypes of patterns that are necessary todetect genes in DNA. A nonterminal G,defined by the rules below, can roughlydescribe the syntax of genes:

G → P RP → NR → E I R|EE → NI → gtNag ,

where N denotes a sequence of nucleotidesa , c , g , t ; E is an exon, I an intron, R a



sequence of alternating exons and intronsand P is a promoter region, that is, a head-ing announcing the presence of the gene.In this simplified grammar, the markersgt and ag are delimiters for introns.

Notice that it is possible to transformthe above CFG into an equivalent FSGsince there is a regular expression that de-fines the above language. But the impor-tant remark is that the grammar is highlyambiguous since the markers gt or agcould appear anywhere within an exon anintron or in a promoter region. Therefore,the grammar is descriptive but not usablein constructing a parser.

One could introduce further constraintsabout the nature of promoters, requirethat the lengths of introns and exonsshould adhere to certain bounds, and thatthe combined lengths of all the exonsshould be a multiple of three since a gene istranscribed and spliced to form a sequenceof triplets (codons).

Notice that even if one transforms theabove rules with constraints into finite-state rules, the ambiguities will remain.The case of alternative splicing bears wit-ness to the presence of ambiguities. The al-ternation exons-introns can be interpretedin many different ways, thus accountingfor the fact that a given gene may generatealternate proteins depending on contexts.Furthermore, in biology there are excep-tions applicable to most rules. All this im-plies that probabilities will have to be in-troduced. This is a good motivation for theneed of probabilistic grammars as shownin the following section.

7.3.2. Hidden Markov Models (HMMs).HMMs are widely used in biologicalsequence analysis. They originated andstill play a significant role in speechrecognition Rabiner [1989].

HMMs can be viewed as variantsof probabilistic or stochastic finite-statetransducers (FSTs). In an FST, the au-tomaton changes states according to theinput symbols being examined. On a givenstate, the automaton also outputs a sym-bol. Therefore, FSTs are defined by sets ofstates, transitions, and input and output

vocabularies. There is as usual an initialstate and one or more final states.

In a probabilistic FST, the transitionsare specified by probabilities denoting thechance that a given state will be changedto a new one upon examining a symbol inthe input string. Obviously, the probabili-ties of transition emanating from a givenstate for a given symbol have to add upto 1. The automata that we are dealingwith can be and usually are nondetermin-istic. Therefore, upon examining a giveninput symbol, the transition depends onthe specified probabilities.

As in the case of an FST, an output isproduced upon reaching a new state. AnHMM is a probabilistic FST in which thereis also a set of pairs [p, s] associated toeach state; p is a probability and s is asymbol of the output vocabulary. The sumof the p’s in each set of pairs within agiven state also has to equal 1. One canassume that the input vocabulary for anHMM consists of a unique dummy symbol(say, the equivalent of an empty symbol).Actually, in the HMM paradigm, we aresolely interested in state transitions andoutput symbols. As in the case of finite-state automata, there is an initial stateand a final state.

Upon reaching a given state, the HMMautomaton produces the output symbol swith a probability p. The p’s are calledemission probabilities. As we describedso far, the HMM behaves as a stringgenerator.

The following example inspired fromDurbin et al. [1998] is helpful to under-stand the algorithms involved in HMMs.

Assume we have two coins: one, whichis unbiased, the other biased. We will usethe letters F (fair) for the former and L(loaded) for the latter. When tossed, theL coin yields Tails 75% of the time. Thetwo coins are indistinguishable from eachother in appearance.

Now imagine the following experiment:the person tossing the coins uses only onecoin at a given time. From time to time,he alternates between the fair and thecrooked coin. However, we do not know at agiven time which coin is being used (hence,the term hidden in HMM). But let us


142 J. Cohen

assume that the transition probabilitiesof switching coins are known. The transi-tion from F to L has probability u, and thetransition from L to F has probability v.

Let F be the state representing the us-age of the fair coin and L the state repre-senting the usage of the loaded coin. Theemission probabilities for the F state areobviously 1/2 for Heads and 1/2 for Tails. Letus assume that the corresponding proba-bilities while in L are 3/4 for Tails and 1/4for Heads.

Let [r, O] denote “emission of the sym-bol O with probability r” and {S, [r, O] ,w,S′} denote the transition from state S tostate S′ with probability w. In our partic-ular example, we have:

{F, [1/2, H], u, L}{F, [1/2, T ], u, L}{F, [1/2, H], 1 − u, F }{F, [1/2, T ], 1 − u, F }

{L, [3/4, T ], v, F }{L, [1/4, H], v, F }{L, [3/4, T ], 1 − v, L}{L, [1/4, H], 1 − v, L}.

Let us assume that both u and v are small,that is, one rarely switches from one cointo another. Then the outcome of a genera-tor simulating the above HMM could pro-duce the string:

. . . HTHTTHTT . . . . . . HTTTTTTTHT . . .

. . . FFFFFFFF . . . . . . LLLLLLLLLLL . . . .

The sequence of states below each emit-ted symbol indicates the present state For L of the generator.

The main usage of HMMs is in thereverse problem: recognition or parsing.Given a sequence of H ’s and T ’s, attemptto determine the most likely correspondingstate sequence of F ’s and L’s.

We now pause to mention an often-neglected characteristic of nondeter-ministic and even ambiguous finite-state-automata (FSA). Given an input

string accepted by the automaton, it ispossible to generate a directed acyclicgraph (DAG) expressing all possibleparses (sequence of transition states).The DAG is a beautifully compact form toexpress all those parses. The complexityof the DAG construction is O(n∗|S| ) inwhich n is the size of the input string and|S| is the number of states. If |S| is small,then the DAG construction is linear!

Let us return to the biased-coin exam-ple. Given an input sequence of H ’s andT ’s produced by an HMM generator andalso the numeric values for the transitionand emission probabilities, we could gen-erate a labeled DAG expressing all possi-ble parses for the given input string. Thelabel of each edge and node of the graphcorrespond to the transition and emissionprobabilities.

To determine the optimal parse (path),we are therefore back to dynamic pro-gramming (DP) as presented in the sectionon alignments (7.1). The DP algorithmthat finds the path of maximum likelihoodin the DAG is known as the Viterbi algo-rithm: given an HMM and an input stringit accepts, determine the most likely parse,that is, the sequence of states that bestrepresent the parsing of the input string.

A biological application closely relatedto the coin-tossing example is the deter-mination of GC islands in DNA. Those areregions of DNA in which the nucleotides Gand C appear in a higher frequency thanthe others. The detection of GC islands isrelevant since genes usually occur in thoseregions.

An important consideration not yet dis-cussed is how to determine the HMMstransition and emission probabilities. Toanswer that question, we have to enter therealm of machine learning.

Let us assume the existence of a learn-ing set in which the sequence of tosses isannotated by very good guesses of whenthe changes of coins occurred. If that setis available one can compute the tran-sition and emission probabilities simplyby ratios of counts. This learning is re-ferred as supervised learning since theuser provides the answers (states) to eachsequence of tosses.



An even more ambitious question is:can one determine the probabilities with-out having to annotate the input string?The answer is yes, with reservations.First, one has to suspect the existenceof different states and the topology ofthe HMM; furthermore, the generatedprobabilities may not be of practical useif the data is noisy. This approach iscalled unsupervised learning and the cor-responding algorithm is called Baum—Welch.

The interested reader is highly recom-mended to consult the book by Durbinet al. [1998] where the algorithms ofViterbi and Baum—Welch (probabilitygenerator) are explained in detail. A veryreadable, paper by Krogh [1998] is alsoadvocated. That paper describes interest-ing HMMs applications such as multiplealignments and gene finding. Many appli-cations of HMMs in bioinformatics con-sist of finding subsequences of nucleotidesor amino acids that have biological sig-nificance. These include determining pro-moter or regulatory sites, and protein sec-ondary structure.

It should be remarked that there isalso a theory for stochastic context-free-grammars; those grammars have beenused to determine RNA structure. Thattopic is discussed in the following section.

7.4. Determining Structure

From the beginning of this article, wereminded the reader of the importanceof structure in biology and its relationto function. In this section, we reviewsome of the approaches that have beenused to determine 3D structure from lin-ear sequences. A particular case of struc-ture determination is that of RNA, whosestructure can be approximated in two di-mensions. Nevertheless, it is known that3D knot-like structures exist in RNA.

This section has two subsections. In thefirst, we cover some approaches availableto infer 2D representations from RNA se-quences. In the second, we describe one ofthe most challenging problems in biology:the determination of the 3D structure ofproteins from sequences of amino acids.

Both problems deal with minimizing en-ergy functions.

7.4.1. RNA Structure. It is very conve-nient to describe the RNA structure prob-lem in terms of parsing strings generatedby context-free-grammars (CFG). As inthe case of finite-state automata used inHMMs we have to deal with highly am-biguous grammars. The generated stringscan be parsed in multiple ways and onehas to choose an optimal parse based onenergy considerations.

RNA structure is determined by the at-tractions among its nucleotides: A (ade-nine) attracts U (uracil) and C (cytosine)attracts G (guanine). These nucleotideswill be represented using small case let-ters. The CFG rules:

S → aSu/uSa/ε

generate palindrome-like sequences of u’sand a’s of even length. One could mapthis palindrome to a 2D representationin which each a in the left of the gener-ated string matches the corresponding uin the right part of the string and vice-versa. In this particular case, the numberof matches is maximal.

This grammar is nondeterministic sincea parser would not normally know wherelies the middle of the string to be parsed.The grammar becomes highly ambiguousif we introduce a new nonterminal N gen-erating any sequence of a’s and u’s.

S → aSu/uSa/N N → aN/uN/ε.

Now the problem becomes much hardersince any string admits a very a largenumber of parses and we have to choseamong all those parses the one thatmatches most a’s with u’s and vice versa.The corresponding 2D representation ofthat parse is what is called a hairpin loop.

The parsing becomes even more com-plex if we introduce the additional rule:

S → SS.

That rule is used to depict bifurcations ofRNA material. For example, two hairpin


144 J. Cohen

structures may be formed, one correspond-ing to the first S, the second to the secondS. The above rule increases exponentiallythe number of parses.

An actual grammar describing RNAshould also include the rules specifyingthe attractions among c’s and g ’s:

S → cS g/g Sc/.

And N would be further revised to al-low for left and right bulges in the 2Drepresentation. These will correspond toleft and right recursions for the new rulesdefining N :

N → aN/uN/cN/g N/Na/Nu/Nc/N g/ε.

The question remains that, from allparses, we have to select the one yield-ing the maximal number of complemen-tary pairs. And there could be several en-joying that property.

Zuker has developed a clever algorithm,using DP, that is able to find the best parsein n3 time where n is the length of thesequence (see Zuker and Stiegler [1981]).That is quite an accomplishment since justthe parsing of strings generated by gen-eral (ambiguous) CFG is also cubic.

Recent work by Rivas and Eddy [2000]shows that one can use variants of con-text sensitive grammars to map RNA se-quences onto structures containing knots,that is, overlaps that actually make thestructure three-dimensional. That resultsin higher than cubic complexity.

We should point out that the cubiccomplexity is acceptable in natural lan-guage processing or in speech recognitionwhere the sentences involved are rela-tively short. Determining the structure ofRNA strings involving thousands of nu-cleotides would imply in unbearable com-putation times.

One should also keep in mind that mul-tiple solutions in the vicinity of a the-oretical optimum may well exist; someof those may be of interest to biologistsand portray better what happens in na-ture. Ideally, one would want to introduceconstraints and ask questions like: Given

an RNA string and a 2D pattern con-strained to satisfy a given geometrical cri-teria, is there an RNA configuration thatexhibits that 2D pattern and is close to theoptimum?

We end this subsection by pointingout a worthy extension of CFGs calledstochastic or probabilistic CFG’s. Recallfrom Section 7.3.2 that HMMs could beviewed as probabilistic finite-state trans-ducers. Stochastic CFGs have been pro-posed and utilized in biology (see Durbinet al. [1998]). Ideally, one would like todevelop the counterparts of the Viterbiand Baum–Welch algorithms applicable tostochastic CFGs and that topic is being in-vestigated. This implies that the probabil-ities associated to a given CFG could bedetermined by a learning set, in a mannersimilar to that used to determine proba-bilities for HMMs.

7.4.2. Protein Structure. We have alreadymentioned the importance of 3D struc-tures in biology and the difficulty in ob-taining the actual 3D structures for pro-teins described by a sequence of aminoacids. The largest repository of 3D pro-tein structures is the PDB (Protein DataBase): it records the actual x, y , z coordi-nates of each atom making up each of itsproteins. That information has been gath-ered mostly by X-ray crystallography andNMR techniques.

There are very valuable graphical pack-ages (e.g., Rasmol) that can present thedense information in the PDB in a visu-ally attractive and useful form allowingthe user to observe a protein by rotatingit to inspect its details viewed from differ-ent angles.

The outer surface of a protein consistsof the amino acids that are hydrophilic(tolerate well the water media that sur-rounds the protein). In contrast, the hy-drophobic amino acids usually occupy theprotein’s core. The configuration taken bythe protein is one that minimizes the en-ergy of the various attractions and repul-sions among the constituent atoms.

Within the 3D representation of a pro-tein, one can distinguish the following



components. A domain is a portion of theprotein that has its own function. Do-mains are capable of independently fold-ing into a stable structure. The combina-tion of domains determines the protein’sfunction.

A motif is a generalization of a shortpattern (also called signature or finger-print) in a sequence of amino acids, rep-resenting a feature that is important for agiven function. A motif can be defined byregular expressions involving the concate-nation, union, and repetition of amino acidsymbols. Since function is the ultimate ob-jective in the study of proteins, both do-mains and motifs are used to characterizefunction.

In what follows, we will present thebioinformatics approaches that are beingused to describe and determine 3D proteinstructure. We mentioned in Section 7.3that there exist several approaches thatattempt to determine secondary structureof proteins by detecting 3D patterns—α-helices, β-sheets, and coils—in a givensequence of amino acids. That detectiondoes not give any information as to howclose those substructures are from eachother in three-dimensional space.

A team at the EBI (European Bioin-formatics Institute) has suggested theuse of what is called cartoons [Gilbertet al. 1999]. These are two-dimensionalrepresentations that express the prox-imity among components (α-helices andβ-sheets).

The cartoon uses graphical conven-tions—sheets represented by triangles,helices by circles—and lines joining com-ponents indicate their 3D closeness. Thiscan be viewed as an extension of the sec-ondary structure notation in which point-ers are used to indicate spatial proximity.In essence, cartoons depict the topology ofa protein. The EBI group has developed adatabase with information about cartoonsfor each protein in the PDB. The objectiveof the notation is to allow biologists to findgroups of combined helices and sheets (do-mains) that have certain characteristicsand function.

Protein folding, the determination ofprotein structure from a given sequence

of amino acids, is one of the most difficultproblems in present-day science. The ap-proaches that have been used to solve itcan only handle short sequences and re-quire the capabilities of the fastest par-allel computers available. (Incidentally,the IBM team that originated the chess-winning program is now developing pro-grams to attempt to solve this majorproblem.)

Essentially, protein folding can beviewed as an n-body problem as studiedby physicists. Assuming that one knowsthe various attracting and repelling forcesamong atoms the problem is to find theconfiguration that minimizes the total en-ergy of the system.

A related approach utilizes lattice mod-els: these assume that the backbone of theprotein can be represented by a sequenceof edges in mini-cubes packed on a largercubic volume. In theory, one would haveto determine all valid paths within thelarge cube. This determination requireshuge computational resources (see, e.g.,Li et al. [1996]). Random walks are of-ten used to generate a valid path andan optimizer computes the correspond-ing energy; the path is then modifiedslightly in the search of minimal energyconfigurations. As in many problems ofthis kind, optimizers try to avoid localminima.

The above brute-force approaches maybe considered as long-term efforts requir-ing significant investment in computerequipment. The more manageable presentformulations often use what is called theinverse protein-folding problem: given aknown 3D structure S of a protein corre-sponding to a sequence, attempt to find allother sequences that will fold in a man-ner similar to S. As mentioned earlier(Section 2) structure similarity does notimply sequence similarity.

An interesting approach called thread-ing is based on the inverse proteinparadigm. Given a sequence of aminoacids, a threading program compares itwith all the existing proteins in the PDBand determines a possible variant of thePDB protein that best matches the one be-ing considered.


146 J. Cohen

More details about threading are as fol-lows: Given a sequence s, one initiallydetermines a variant of its secondarystructure T defined by intervals within swhere each possible helix or sheet mayoccur; let us refer to helices and sheetssimply as components. The threading pro-gram uses those intervals and an energyfunction E that takes into account theproximity of any pair of components. Itthen uses branch-and-bound algorithmsto minimize E and determine the mostlikely boundaries between the components[Lathrop and Smith 1996]. A disadvan-tage of the threading approach is that itcannot discover new folds (structures).

There are several threading programsavailable in the Web (Threader being oneof them). These programs are given an in-put sequence s and provide a list of allthe structures S in the PDB that are good“matches” for s.

There are also programs that match 3Dstructures. For example, it is often desir-able to know if a domain of a protein ap-pears in other proteins.

Protein structure specialists have anannual competition (called CASP for Crit-ical Assessment of Techniques for ProteinStructure Prediction) in which the partic-ipant teams are challenged to predict thestructure of a protein given by its aminoacid sequence. That protein is one whosestructure has been recently determinedexactly by experiments but is not yet avail-able at large. The teams can use any of theavailable approaches.

In recent years, there has been somesuccess with the so-called ab initio tech-niques. They consist of initially predictingsecondary structure and then attemptingto position the helices and sheets in 3D soas to minimize the total energy. This dif-fers from threading in the sense that allpossible combinations of proximity of he-lices and sheets are considered in the en-ergy calculations. (Recall that in thread-ing intervals are provided to define theboundaries of helices and sheets.) One canthink of ab initio methods as those thatplace the linkages among the componentsof the above mentioned cartoons.

7.5. Cell Regulation

In this section, we present two amongthe existing approaches to simulate andmodel gene interaction. The terms sim-ulation and modeling are usually givendifferent meanings. A simulation mimicsgenes’ interactions and produces resultsthat can be compared with actual exper-imental data to check if the model used inthe simulation is realistic. In modeling theexperimental data is provided and one isasked to provide the model. Modeling is areverse engineering problem that is muchharder than simulation. Modeling is akinto program synthesis from data.

Although, in this section, we only dealwith gene interactions, the desirable out-come of regulation research is to pro-duce micro-level flowcharts represent-ing metabolic and signaling pathways(Section 7.6).

It is important to remark the signifi-cance of intergenic DNA material in cellregulation. These regions of noncodingDNA play a key role in allowing RNA-polymerase to start gene transcription.This is because there has to be a suit-able docking between the 3-D configura-tions of the DNA strand and those of theconstituents of RNA-polymerase.

7.5.1. Discrete Model. We start by show-ing how one can easily simulate the inter-action of two or more genes by a programinvolving threads. Consider the followingscenario:

Gene G1 produces protein P1 in T1 unitsof time; P1 dissipates in time U1 and trig-gers condition C1. Similarly:

Gene G2 produces P2 in T2 units of time;P2 dissipates in time U2 and triggers con-dition C2.

Once produced, P2 positions itself in G1for U2 units of time preventing P1 from be-ing produced. We further assume that theproduction of a protein can only take placeif a precondition is satisfied. That precon-dition is a function of the various post con-ditions C′

is.The above statements can be pre-

sented in program form by assuming the



existence of a procedure process involvingfive parameters:

—The gene identification G (possibly astring)

—A pre-condition C allowing a protein Pto be processed (a constraint)

—The units of time T needed to produceprotein P

—The time U for protein P to completelydissipate

—A post-condition C′ to be performed afterthe protein is produced (a constraint).

Notice that the precondition C can bea general Boolean function and the post-condition C′ can trigger changes in the pa-rameters of any C. A rough description ofprocess is:

process (Gene, Pre-Condition, Process-Time,Decay-Time, Post-Condition)

if Gene is not available (Pre-condition)then wait until it becomes availableelse {produce protein in Process-Time,

trigger Post-condition,wait for the given Decay-Time}

The order in which the constituents ofthe else part of the if-statement are ex-ecuted is subject to different interpreta-tions and it is left unspecified.

Now let us imagine that process acts likea thread that can be executed in paral-lel with other threads. We also make thesimplifying assumption that the processof a given gene G cannot be invoked un-til the previous incarnation of that pro-cess has terminated. Consider the pro-gram segment:

forever doprocess (“G1”, P2 is not on, 50, 20, none)process (“G2”, none, 200, 50, P2 is on “G1”)

Let t denote a current time in the ex-ecution of the above program. It shouldbe clear to the reader that the behaviorof the program can be displayed by suc-cessive Boolean vectors V (t) denoting thestate on or off of each of the genes at time t.

The above program is a minuscule ex-ample of the type of concurrent processesthat take place within the cell. The pro-

cesses can be likened to RNA-polymerases,spliceosomes and ribosomes.

Notice that in the case of eukaryoticcells there would be three levels of cas-cading processes since different conditionswould be applicable to simulate the gener-ation of a given protein. This is becausethere could be interruptions not only inRNA production, but also in the splicing,and generation of proteins.

Interesting organisms will have thou-sands of genes and many of them will in-teract with others in a complex mannerthat we do not yet know. As mentioned, thestate of the program at time t is describ-able by the vectors V (t). Actually thesevectors correspond to information that canbe gathered by microarray experiments.Usually microarrays detect not step func-tions but continuous ones expressing theamount of RNA produced by a cell at agiven time under certain conditions.

Now we can state a major and enor-mously difficult problem in biology: giventhe vectors V(t), deduce the pre and post-conditions for a program simulating geneinteractions. This is a reverse engineer-ing problem that is probably undecidable.Nevertheless, we can attempt to solvemore manageable problems of the sort:given the results of microarray experi-ments, is a given conjecture for the pre orpost- conditions possible?

There are groups of computer scientistsand biologists working in such problems.One of these groups led by Regev andShapiro [2002] uses Milner’s Pi-calculus toattempt to answer logical questions aboutconjectures made by biologists. (The Pi-calculus is a formal language for concur-rent computational processes, like thoseused in mobile telephone systems.) Sincethe results of microarrays are often noisyand uncertain one has to resort to aprobabilistic (or stochastic) variant of thePi-calculus.

Statistical methods like Bayesian net-works and support vector machines havealso been used in inferring gene be-havior from microarray data [Friedmanet al. 2000; Brown et al. 2000; Bar-Josephet al. 2002]. Clustering algorithms (see,


148 J. Cohen

Fig. 1.

e.g., Jain et al. [1999]) are often usedto group the genes exhibiting similar be-havior, therefore reducing the problem’sdimensionality.

7.5.2. Continuous Models. As in the caseof the discrete case, we will consider thedifferent aspects: simulation and model-ing. The continuous simulation approachis based on the theory of dynamic systems.It is assumed that the expression level ofeach gene is describable by a differentialequation. If there are n genes that interactwith each other then the continuous sim-ulation consists of a system of n nonlineardifferential equations.

Let xi denote the expression level of theith gene. Then the resulting system of dif-ferential equations becomes:

dxi/dt = fi(x) − γixi, xi ≥ 0,

where x is the vector (x1, x2, . . . . xn)The term −γi xi states that the con-

centration of the ith product decreasesthrough spontaneous processes like degra-dation, diffusion, etc. fij is the func-tion specifying a combination of sigmoids(highly nonlinear) which describes the in-teraction between genes i and j ; m is aparameter specifying the steepness of thefunction around θij (see Figure 1).

fij = xmj /

(xm

j + θmij

)

The above specifies that gene expressionincreases (or decreases) sharply when agene interacts with another. It is possi-ble to generate the system of differential

equations from a graph whose nodes rep-resent the genes and the branches their in-teractions. Additionally, the branches canbe labeled with +’s or −’s indicating thefact that a gene activates or represses an-other gene.

Once the graph and the above param-eters are known, the system of equationscan be generated, solved numerically andyield curves that describe gene expressionas a function of time.

These results are the continuous coun-terparts of those displayed for the dis-crete simulation described in the previoussection. In the discrete case the gene ex-pression was either on or off whereas inthe continuous case the gene expressioncurves vary smoothly.

The fact remains that it would ex-tremely difficult to do the reverse en-gineering task of modeling, that is,generating from existing data the sys-tem of equations and their parameters.The clustering algorithms mentioned inSection 9 have become indispensable toreduce the complexity of gene regulationanalysis from microarray data.

Somogyi and his co-workers [Liang et al.1998] have proposed an interesting ap-proach for both simulation and model-ing of gene interaction. The simulationuses a Boolean approach and the modelingamounts to generating circuits (or equiva-lent Boolean formulas) from data.

deJong [2002] has recently published anextensive survey about work done in cellregulation both in simulation and in mod-eling. One interesting way of solving theabove differential equations is by qualita-tive reasoning, a subject developed in ar-tificial intelligence by Kuipers [1994] todeal with discrete versions of differentialequations. Cohen [2001] proposes the useof constraints to describe various cell reg-ulation methods.

E-CELL is an ambitious Japaneseproject that aims at simulating cells byusing stochastic systems of nonlinear dif-ferential equations [Tomita et al. 1999]. Ithas been used to simulate the behavior ofvarious cells including that of the humanheart. Versions of the E-CELL simulatorare available for various platforms.



7.6. Determining Function and MetabolicPathways

In the previous section, we mentioned thatprotein domains and motifs where im-portant in determining protein function.Function is a subjective topic that maymean different things for different peo-ple. The protein database (PDB) containsannotations—in natural language—thatexplain the role of the protein in the largercontext of cell behavior. Incidentally, greatcare has to be taken to interpret annota-tions since different researchers use differ-ent terms that are supposed to be equiva-lent. In a simplistic manner, the functionof a protein is its PDB annotation comple-mented by related observations.

A typical example of an annotations forgene function is as follows: “The gene,known as 5-HTT, has been a focus of de-pression studies because it contains thecode to produce a protein that escorts thechemical messenger serotonin across thespaces between brain cells, or synapses,and then clears away the leftover sero-tonin” (New York Times, July 18, 2003).

The ultimate way to express pro-tein function is by finding its role inmetabolic, regulation, and signaling path-ways. (These have been briefly defined inthe previous sections.) Karp [2001] hasstudied this topic extensively. He has im-plemented some of those pathways for E.coli and other organisms in the form ofdatabases.

Karp rightfully points out that it isimpossible to develop a theory about acomplex system without the aid of a prop-erly designed database of facts and in-teractions among facts. Such database isessentially the representation of large la-beled graphs. Each node of the graph rep-resents a chemical reaction, the proteinsinvolved, and the enzymes catalyzing thatreaction.

Graphical interfaces are mandatoryto display the results of queries aboutmetabolic pathways. For example, oneshould be able to have graphical responsesto questions of the type: (i) determine allthe reactions, in which a given enzymeacts as a catalyzer, (ii) find the differ-

ent enzymes catalyzing similar reactions,(iii) specify all paths going trough a pairof reactions, and so forth. In the Ecocycsystem developed by Karp, the results ofsuch queries are graph representationswith highlighted nodes or paths.

Karp’s research is an ambitious one. Ul-timately one wants to attempt to generatemetabolic pathways from genomic data ofsimilar organisms. The system Metacyc isa meta-system developed for that purpose.This type of research should eventuallymerge with that proposed by Shapiro andbriefly described in Section 7.5.1.

The Japanese have also developed awidely used metabolic pathway databasecalled KEGG (Kyoto Encyclopedia ofGenes and Genomes) (http://www.genome.ad.jp/kegg/pathway.html).

7.7. Assembling DNA Fragments

The problem of DNA assembly becamevery important for sequencing very largegenomes such as the human genome. Com-panies like Celera use the so-called wholegenome shotgun method that consists ofsequencing relatively small fragments ofDNA and then relying on computer pro-grams to assemble those fragments. Eu-gene Myers [1999] formerly from Celera,now at Berkeley, has been a pioneer in thiseffort.

Fragments are of the order of 500 basepairs (bp). The target sequence—the oneto be reconstructed—is of the order of 50kto 100k bp, and there are about 1,000 frag-ments to be assembled.

The problem of assembly becomes com-plex because of several factors that in-clude orientation, repeats, and sequenc-ing errors. Fragments can originate fromeach of the two DNA strands, and orien-tation means that either a given sequenceor its reverse complement is a valid can-didate for being assembled into the targetsequence. Repeated subsequences in thetarget sequence make the assembly moredifficult because one does know to whichcopy a given fragment belongs.

A fragment Fi overlaps with a fragmentF j if the left (right) end of Fi shares a


150 J. Cohen

common subsequence with the right (left)end of F j . (If one fragment is a subse-quence of another, then the smaller onecan be discarded.) A region of contiguouslyoverlapping fragments is called a contig.

The assembly problem can be stated asfinding the shortest superstring S suchthat each fragment is a subsequence of S.Remark that this problem bears some sim-ilarity with finding multiple alignments.(As we have seen in Section 7.1.4, the lat-ter is known to be a computationally diffi-cult problem.)

It is easily shown that we can constructa labeled directed graph G representingthe overlaps of each pair of fragments. Letus assume that Fi overlaps q symbols withF j and that the length of Fi is greater thanthe length of Fi. Then the graph G con-tains a directed edge labeled by q joiningthe node Fi to the node Fi. Notice thatpairwise alignments can be used to de-termine the edges of the graph and theirlabels.

It is not difficult to see that a pathin G that contains no cycle represents acontig. Therefore the shortest superstringproblem amounts to finding the shortestHamiltonian path in G. That is computa-tionally difficult and one has to resort toapproximations. A greedy algorithm is of-ten used to determine that path. The prob-lems of orientation and repeats will alsohave to be surmounted. A helpful hint isthat one knows the approximate size of thetarget sequence. A recent article by Myersand his colleagues reflects some of the lat-est work done in DNA assembly [Husonet al. 2002].

A problem related to assembly is thatof physical mapping of DNA. The frag-ments for a given target sequence are ob-tained from parts of chromosomes contain-ing several hundred thousand base pairs.These very large fragments have markersthat enable the reconstruction of the orig-inal chromosomal DNA. As in the case ofDNA assembly, the reconstruction is basedon graphs and again the computationalcomplexity is very high. The reader is di-rected to the text of Setubal and Meidanis[1997] where that topic is presentedpedagogically.

7.8. Using Script Languages

Consider the following typical problem inbioinformatics. Given a sequence of aminoacids representing a protein P , we wantto use BLAST to determine the proteinsin GenBank database that are homologousto P but have a given degree a similarityspecified by a p-value threshold. Follow-ing that search we may also want to per-form a multiple alignment with those ho-mologous proteins (using CLUSTAL) andpossibly utilize a package like PHYLIP todetermine the phylogenetic tree corre-sponding to the multiple alignments. Fi-nally, we would like to check if any of theproteins in the multiple alignments has a3D structure in the PDB.

The cascading use of the above packageswould require a researcher to take activepart in requesting the URL of a package,performing formats changes if needed, in-specting and rejecting some data, and soforth. Script languages allow their usersto write programs that automatically per-form these tasks.

Perl and Python are probably the mostoften-used languages in bioinformatics.Perl is older and has many ready-madepackages available for searching web-sites and downloading results. Python, themore recent language, is gaining momen-tum in bioinformatics applications.

One of the frequent tasks done usingscript languages is finding certain pat-terns in files containing information invarious formats (e.g., html). Regular ex-pressions (RE) are often used to specifythose set of patterns. A more specific ex-ample of RE usage is as follows. Assumethat we want to test if a pattern of nu-cleotides defines the boundaries betweenexons and introns (these are called splice-sites). Also assume that one knows thesplice-sites for many genes of a given or-ganism O and can express them by a REthat takes into account the left and rightcontexts of the splice-sites.

Suppose now that we want to deter-mine the splice-sites for another organismO ′ that is possibly related to the first. Asearch using the RE applicable to O mayreveal interesting putative splice-sites in



O ′. If that is the case, one may wish torevise the RE for O to handle new organ-isms. These situations occur very often inbioinformatics.

8. A CHIMERICAL PROGRAM

In the previous sections, we reviewed theapproaches that are currently being usedto solve typical problems in bioinformat-ics. In this section, we will try to takea glimpse into the future. What followsis the author’s extrapolation from currentwork being done in bioinformatics, and itis admittedly speculative.

The hypothetical program below de-scribes the essence of functional genomics.Given the genome of an organism, it seeksto generate a program that simulates thecell behavior for that organism. At the toplevel, the program finds all the genes of thegenome, then determines the function ofeach gene, and finally combines the resultsinto a simulator. Paraphrasing in programform one has:

Generation of a cell simulator for anew organism

Find-Genes (DNA, Genome)for each Gene in Genome

Process (Gene, Function)Combine (Function, Cell-Behavior)

The first parameter in each of the proce-dures being called represents either givendata or results obtained from a preced-ing procedure call. Find-Gene is probably aversion of some of the programs mentionedin Section 7. Process embodies the CentralDogma, and Combine is admittedly a “piein the sky” that will have to be worked outin the future. Keep in mind, however, thatthis is the goal of Karp’s project, brieflydescribed in Section 7.6.

In a first phase, Combine should gen-erate a program similar to the discretemodel example in Section 7.5.1, but involv-ing all genes of the genome. Eventually,one would like to obtain a program thatnot only mimics gene interactions but alsodepicts in detail the workings of metabolicand signaling pathways.

An elaborated version of Process is givenbelow. It introduces the types of the pa-

rameters whenever possible. Even thoughall the components in the cell are 3D struc-tures, the abstraction of DNA sequencesinto the type “string” is likely to remainapplicable. Nevertheless, it is known thatthe transcription by RNA-polymerase de-pends on the (elongated) shape of the helixsegment that contains the gene.

Process (Gene: string,Function: program-fragments)

The central dogmaRNA-Polymerase (Gene, Pre-RNA)Note the possibility ofalternatesplicing (multiple RNAs)Spliceosome (Pre-RNA, RNA)Ribosome (RNA, Aminoacid-Sequence)

Fold (Aminoacid-Sequence,Structure)Determine-Function (Structure,Function)

The above hypothetical procedure is notunlike the thread process described inSection 7.5.1 and leaves open how func-tion can be determined from structure andother data. It is clear that results obtainedthrough microarray experiments, proteininteractions, and known metabolic andsignaling pathways will have to be takeninto consideration.

9. TOPICS THAT PLAY AN IMPORTANTROLE IN BIOINFORMATICS

A perusal of the material in the previ-ous sections provides insights on CS top-ics that are likely to influence bioinformat-ics. A recurring theme in the currentlyused algorithms is optimization. Align-ments, parsimony in phylogeny, determin-ing RNA structure, and protein thread-ing can all be viewed as optimizationproblems.

The interest in dynamic programming(DP) is that it enables an efficient (poly-nomial) solution of certain optimizationproblems. This occurs when a problem canbe transformed into determining the max-imal (or minimal) path in a DAG. It wasseen in the case of pairwise alignments


152 J. Cohen

that one could formulate the problem us-ing a DAG with n2 nodes, where n isthe length of the sequences being aligned.However, the use of DP becomes pro-hibitive in the case of multiple alignments.

Inevitably, in the case of algorithmswith higher complexity one has to resort toheuristics. Typically, heuristic strategiesare used in the case of NP problems orpolynomial problems involving large vol-umes of data.

For example, the DNA assembly prob-lem requires suitable heuristics for greedyalgorithms to determine possible Hamilto-nian paths in a graph. Genetic algorithmshave been used for that purpose [Parsonset al. 1995], BLAST illustrates the casein which even a quadratic space and timecomplexity makes the DP algorithm unus-able for practical problems involving hugesequences.

Machine learning, data mining, neuralnetworks, and genetic algorithms occupya prominent position among the CS ap-proaches used in bioinformatics (see, e.g.,Mitchell [1997] and Hand et al. [2000]).This is because there is an enormousamount of data available and, fortunately,biologists have annotated some of thisdata. Typical examples include gene find-ing and secondary structure determina-tion (Section 7.3).

There are thousands of genes whose lo-cations in various genomes have been de-termined using laboratory experiments.This information is recorded in a vastrepository of sequences, with markingsspecifying the locations of promoters, ex-ons, and introns. These annotations en-able the determination of the most likelycontexts for desired boundaries. The prob-lem becomes: given this learning set, inferthe corresponding boundaries for new se-quences not in the learning set (supervisedlearning).

A similar situation occurs when at-tempting to determine the secondarystructure of proteins. An annotated learn-ing set can be obtained from the ProteinData Base (PDB), where thousands of pro-teins have been studied in detail and forwhich boundaries of helices and sheetshave been accurately determined.

The above are typical problems that canbe solved by machine-learning and neuralnetwork techniques. Many gene-findersand secondary structure estimators utilizethese approaches.

Classification and data clustering (see,e.g., Jain et al. [1999]) are cognate to su-pervised machine learning. Assume thatwe are given a large set of lists each con-taining the values of n parameters andtheir known classification (say, an identi-fier). One then groups the lists into clus-ters that have the same classification.

Given a new list of parameters, wewish to determine the most likely clus-ter it should belong to. In two-dimensionalcases, the answer can be obtained by theevaluation of a simple equation represent-ing the straight line that separates the twosemispaces representing the clusters. Then-dimensional case is considered in therelatively new area of support vector ma-chines (SVM). The SVM approach dividesthe n-dimensional space into areas delim-ited by semiplanes. These techniques haveacquired great significance in reducing thecomplexity of the task of inferring generegulation from microarray data.

It is undeniable that probability andstatistics play an influential role in bioin-formatics. This is not surprising since thedata available is huge, varied, and noisy.Recent articles on interpreting microarrayexperiments utilize statistical approachessuch as SVMs and Bayesian networks[Friedman et al. 2000; Brown et al. 2000;Bar-Joseph et al. 2002].

Hidden Markov Models are alsomachine-learning techniques. In this ap-proach, one starts by specifying a topologyof finite-states representing the structureone believes is applicable. Based on thelearning set, the probabilities are com-puted. Given a new sequence, we can thenuse DP (the Viterbi algorithm) to deter-mine the most likely succession of statescorresponding to the given sequence.

All these methods amount to the gen-eration of probabilistic grammars from alearning set. The topology of states inHMMs is generalized to correspond tothe presumed grammar rules whose fre-quency one wishes to estimate. Therefore,



methods using probabilistic grammarsare expected to have a salient place inbioinformatics.

Data mining is akin to machine learn-ing. In data mining one hopes to detectcertain patterns in huge amounts of data(unsupervised learning). Data mining hasbeen used in forecasting protein interac-tions [Thierry-Mieg 2000].

It should be apparent that databasedesign and development are integralpart of bioinformatics. The best singleplace to look for information on biologicaldatabases is the annual database issueof Nucleic Acids Research: http://nar.oupjournals.org /content /vol32 /suppl 1 /index.shtml. A recommended precis of theproblems in the design of genomic andgenetic databases and their integration isgiven in [Ashburner and Goodman 1997].

Computational geometry should alsoplay a key role in analyzing 3D struc-tures. An example is 3D pattern match-ing in proteins: in this case, the “pattern”is a portion of a protein’s backbone, andthe “text” corresponds to all the proteinsin the PDB. One would want to deter-mine the set of proteins that exhibit thatpattern. As in the case of alignments, wewould like to tolerate small discrepanciesbetween the pattern and elements in thetext.

The example of phylogenetic treeconstruction using data compression(Section 7.2) illustrates the importance ofinformation theory in analyzing massivelylong sequences of symbols.

An interesting CS application in bioin-formatics is that of natural languageprocessing (NLP). For example, biotechcompanies hire teams of biologists to ex-amine the large scientific literature avail-able to detect descriptions of possible geneor protein interactions. It would be de-sirable to automate that process. Anotherpossible NLP application is to attempt tomake sense of annotations made by biol-ogists to explain gene function. Questionsof the type: Are two annotations compara-ble? are difficult inquiries that one wouldwant to be able to answer.

Graphics and graphical interfaces areof course a necessity for displaying bi-

ological data. As in the other CS ap-plications, knowledge of biology and thecapacity to interact with biologists are vi-tal to successful software development inbioinformatics.

10. LEARNING MORE ABOUTBIOINFORMATICS

The material covered in this article is butan introduction to the field. The inter-ested reader will have to expand his or herknowledge significantly to become profi-cient in bioinformatics. A few hints as howto proceed are discussed below.

We dealt with 3D structures in an ab-stract manner and showed their impor-tance in the molecular interactions thatare crucial to cell life. To understandmolecular structure and interactions indetail, one has to plunge into biochem-istry. Therefore, an introductory coursein biochemistry is a prerequisite for do-ing work in bioinformatics. That and acourse in molecular biology are long-termundertakings.

This author favors a continual updatingof knowledge by reading the tutorial ma-terial available on the Web, and most ofall, by interacting with biologists. As men-tioned earlier, this is not always an easytask since we have been educated to rea-son in different modes. Nevertheless, suchinteractions are necessary in order to in-fer which tools are best suited to help biol-ogists tackle unsolved problems. And thateffort can lead to the development of novelalgorithms and approaches.

A recommended introductory bioinfor-matics text, by Krane and Raymer [2003],has recently been published. It providesan easy to read introduction to the field.A good companion for that book is theCartoon Guide to Genetics by Gonick andWheelis [1991].

Computer scientists interested in com-putational biology are referred to the sev-eral textbooks currently available thatare listed as references. We should dis-tinguish two types of texts: those thatemphasize the discrete and combinato-rial aspects of the field (e.g., Setubal andMeidanis [1997]), and those that favor a


154 J. Cohen

probabilistic and statistical approach (e.g.,Durbin et al. [1998]).

For the reader interested in the re-search aspects of bioinformatics, the com-pendium edited by Salzberg et al. [1998]is recommended. The encyclopedic bookby Mount [2001], is an excellent referencetext. An interesting article by Luscombeet al. [2001] defining the goals of bioin-formatics is certainly worth reading. Anapercu of recent advances in bioinformat-ics appears in Goodman [2002].

Several texts in bioinformatics havebeen published recently. Among themwe note: Dwyer’s book stressing pro-gramming in bioinformatics using Perl[Dwyer 2002]; a compendium of recenttopics in bioinformatics [Orengo et al.2003]; a practical approach to the field[Claverie and Notredame 2003]; and atreatise on bioinformatics and functionalgenomics by Pevsner [2003]. Suggestionsfor implementing bioinformatics under-graduate level courses have appeared inCohen [2003]. A recent undergraduatetext by Jones and Pevzner [2004] is highlyrecommended.

11. FINAL REMARKS

Searls [1998] rightly pointed out thatmany current problems, such as thosebriefly described in Section 7, remain chal-lenging tasks. His list includes: proteinstructure prediction, homology search,multiple alignment and phylogeny con-struction, genomic sequence analysis, andgene finding. The most recent develop-ments in biology point in the directionof functional genomics research. Thattopic not only encompasses Searls’ list ofchallenges but also includes cell simula-tion and modeling, as well as metabolicpathways.

Nearly all the contents of the present ar-ticle have been devoted to explaining sin-gle cell behavior. The generic-type cell—also called a stem cell—can be transformedinto any other type of cell that specializesin performing specific functions in a mul-ticellular organism. Blocking the produc-tion of certain proteins and encouragingthe expression of others achieve this spe-

cialization. This process is not yet well un-derstood. Nevertheless, the geographic po-sition of the cell and its neighbors is knownto have significant roles as to which genesare turned on and which are switched off.

An interesting article written by com-puter scientists at MIT deals with thesimulation of multiple cells and proposesthe paradigm of amorphous computing[Abelson et al. 1995]. It has been inspiredby biology, and it develops a massivelyparallel model that accounts for changesin the shapes of a network of distributedasynchronous computers. This is an ex-ample on how biology can be inspirationalto computer science. Another prime exam-ple is DNA computing, that is, using DNAstrands to solve computationally difficultproblems.

As to the future of the relationship be-tween computer science and biology, it isworth mentioning an interview given byKnuth [1993]. He argues that major dis-coveries in computer science are unlikelyto occur as frequently as they did in thepast few decades. On the other hand, hestates that “Biology easily has 500 yearsof exciting problems to work on. . .”.

The accomplishments made in molecu-lar biology in the past half century havebeen remarkable. Nevertheless, they palein comparison to the wondrous tasks thatlie ahead. Consider, for example, attempt-ing to answer questions like:

—How do brain cells establish linkagesamong themselves while an embryo is be-ing formed?

—Is it possible to understand better the ori-gins of language and the nature-nurtureparadigm?

—How does Darwinian evolutionary the-ory operate at the molecular level?

These questions pose enormous chal-lenges and Knuth’s forecast may even turnout to be conservative.

With the increasing relevance of biology(and bioinformatics) also comes responsi-bility. In a recent article in the New YorkTimes, Kelly [2003], the president of theFederation of American Scientists, pointsout that a graduate student in biology



using the wet lab (and available bioinfor-matics tools) could concoct viruses withgreat potential for harm.

Since the Manhattan Project physicistshave been in a similar predicament. It isnow the turn of the biologists (and bioin-formaticians) to make sure that develop-ments will be used for lofty purposes. Forexample, understanding the mechanismsof cell differentiation and inferring thegene interactions that produce cancerouscells will no doubt revolutionize medicine.

APPENDIX. A SUMMARY OF CONCEPTS INMOLECULAR CELL BIOLOGY

The material in this appendix is designedas a concise refresher for the backgroundin molecular cell biology needed to readthe main article. Even though we haveavoided the description of chemical struc-tures, they are essential to understandingmolecular interactions at the atomic level.

There are several detailed texts incell and molecular biology available. Twooften-used ones are those by Lodish et al.[2003] and Alberts et al. [2004]. The readeris also referred to the numerous glossariesand tutorials that exist on the Web. It of-ten suffices to use Google with the desiredkeywords, followed by the terms “tutorial”or “applet,” to obtain a wealth of pedagog-ical information about a topic not coveredin this appendix. Preceding the referenceswe present a handful of URLs that arehelpful in providing additional informa-tion.

DNA is helix-shaped molecule whoseconstituents are two parallel strands ofnucleotides. There are four types of nu-cleotides in DNA and they correspond tothe letters A (for adenine), T (thymine), C(cytosine) and G (guanine). DNA is usu-ally represented by sequences of these fournucleotides. This assumes that only onestrand is considered; the second strand isalways derivable from the first by pair-ing A’s with T’s and C’s with G’s and vice-versa. That derivation is called finding thereverse complementary pair of a strand.

Genes are contiguous subparts of single-stranded DNA that are templates for pro-ducing proteins. Genes can appear in ei-

ther of the DNAs strands. The set of allgenes in a given organism is called thegenome for that organism. The functionof DNA material between genes is largelyunknown. Certain intergenic regions ofDNA (called noncoding) are known to playa major role in cell regulation, the pro-cess that controls the production of pro-teins and their possible interactions withDNA.

Proteins are produced from DNA usingthree operations or transformations calledtranscription, splicing, and translation. Inhumans and higher species (eukaryotes)the genes are only a minute part of thetotal DNA that exists in a cell. For the pur-poses of this article, chromosomes are com-pact chains of coiled DNA. In more rudi-mentary types of cells that do not have anucleus (prokaryotes), the phase of splic-ing does not occur.

DNA is capable of replicating itself. Thecell machinery that performs that task iscalled DNA-polymerase. Biologists call thecapability of DNA for replication and un-dergoing the above three (or two) transfor-mations the central dogma.

Genes are transcribed into pre-RNAby a complex ensemble of moleculescalled RNA-polymerase. During transcrip-tion the nucleotide T (thymine) is substi-tuted by another one designated by theletter U (for uracil). Pre-RNA can be rep-resented by alternations of sequence seg-ments called exons and introns. The exonsrepresent the parts of pre-RNA that will beexpressed, that is, translated into proteins.

Next comes the operation called splic-ing; an ensemble of proteins called thespliceosome performs it. Splicing consistsof concatenating the exons and excisingthe introns to form what is known asmRNA, or simply RNA.

The final phase, called translation, is es-sentially a “table look-up” performed bycomplex molecules called ribosomes (anensemble of RNA and proteins). Transla-tion repeatedly considers a triplet of con-secutive nucleotides in RNA and producesone corresponding amino acid. The tripletis called a codon. In RNA, there is one spe-cial codon called a start codon and a fewothers called the stop codons. An open


156 J. Cohen

reading frame (ORF) is a sequence ofcodons starting with a start codon andending with an end codon. The ORF is thusthe sequence of nucleotides that is used bythe ribosome to produce the sequence ofamino acids that makes up a protein.

There are basically 20 amino acids but,in certain rare situations, others can beadded to that list. Since there are 64 dif-ferent codons and 20 amino acids, the “ta-ble look-up” for translating each codon intoan amino acid is redundant in the sensethat multiple codons can produce the sameamino acid. The “table” used by nature toperform translation is called the geneticcode. Due to the redundancy of the geneticcode, certain nucleotide changes in DNAmay not alter the resulting protein.

Once a protein is produced, it folds (mostof the time) into a unique structure in 3Dspace.

In the 3D representation of a protein,one can distinguish three different types ofcomponents: α-helices, β-sheets and coils.The secondary structure of a protein isits sequence of amino acids, annotated todistinguish the boundaries of each com-ponent: helices, sheets, and coils. Thetertiary structure of a protein is its 3Drepresentation.

The function of a protein is the wayit participates with other proteins andmolecules in keeping the cell alive and in-teracting with its environment. Functionis closely related to tertiary structure. Infunctional genomics, one studies the func-tion of all the proteins of a genome. One ofthe important goals of bioinformatics is tohelp biologists in deciphering the functionof proteins.

ACKNOWLEDGMENTS

The author wishes to express his gratitude to MarkGerstein, Nathan Goodman and the reviewers whoprovided many suggestions to improve the originalmanuscript.

REFERENCES

For interesting graphical gallery of biology con-sult (downloadable drawings) sponsored bythe National Health Museum http://www.accessexcellence.org/AB/GG/.

A recommended glossary of genetic terms http://www.ornl.gov/TechResources/Human Genome/publicat/primer2001/glossary.html.

NCBI (National Center for Biotechnology Informa-tion) http://www.ncbi.nlm.nih.gov.

A summary of interesting sites in bioinformatics isgiven by the URLs.

On line lectures in bioinformatics—Heidelberghttp://www.dkfz-heidelberg.de/tbi/bioinfo/Biol/Intro/.

A special interest group with news and pointershttp://www.bioinformatrix.com.

Bioinformatics Bulletin Board http://bioinformatics.org/faq/#education.

Bioinformatics resources http://www.brc.dcs.gla.ac.uk/∼actan/resources.html.

Interesting and useful URL’s on existing courses.Jackson’s Laboratory Web Page with educational

links http://www.jax.org/courses.Course in bioinformatics (recommended set of

slides by R. L. Bernstein) http://www.swbic.org/education/bioinfo/.

Highly recommended texts in molecular cell biology[Alberts et al. 2004; Lodish et al. 2003].

Some texts in computational biology or bio-informatics[Baldi and Brunak 2002; Baxevanis and Ouel-lette 1998; Campbell and Heyer 2002; Claverieand Notredame 2003; Durbin et al. 1998;Dwyer 2002; Felsenstein 2003; Gonick andWheelis 1991; Gusfield 1997; Krane and Raymer2003; Jones and Pevzner 2004; Mount 2001;Orengo et al. 2003; Pevsner 2003, Pevzner 2000;Setubal and Meidanis 1997; Salzberg et al. 1998;Waterman 1995].

Main Journals in BioInformaticsBioinformatics, Oxford University PressIEEE/ACM Transactions on ComputationalBiology and Bioinformatics (TCBB).Journal of Computational Biology, Mary AnnLiebert, Inc, Publishers

Note: Many biology journals publish articles relatedto bioinformatics, e.g., Science, Nature, NucleicAcids Research, Journal of Molecular Biology,Proceedings of the National Academy of Sciences(PNAS), etc. In particular Nucleic Acid Researchpublishes a compendium of URL’s in its yearlyJanuary issue.

Yearly ConferencesRECOMB, Research in Computational Molecu-lar BiologyIEEE Computer Society Bioinformatics Confer-encePSB Pacific Symposium on BiocomputingISMB Intelligent Systems for Molecular Biology

Articles and BooksABELSON, H., ALLEN, D., COORE, D., HANSON, C., HOMSY,

G. KNIGHT, JR., . T. F., NAGPAL, R., RAUCH, E.,SUSSMAN, G. J., AND WEISS, R. 1995. Amor-phous Computing. Commun. ACM.



ALBERTS, B., BRAY, D., JOHNSON, A., LEWIS, J., RAFF,M., ROBERTS, K., AND WALTER, F. 2004. Essen-tial Cell Biology, 2nd ed. Garland Publishing.

ASHBURNER, M. AND GOODMAN, N. 1997. Informat-ics: Genome and genetic databases. Curr. Op.Gen. Develop. 7, 750–756.

BALDI, P. AND BRUNAK, S. 2002. Bioinformat-ics: The Machine Learning Approach, MITPress.

BAR-JOSEPH, Z., GERBER, G., GIFFORD, D., AND JAAKKOLA,T. 2002. A new approach to analyzing gene ex-pression time series data. In RECOMB The SixthAnnual International Conference on Research inComputational Molecular Biology.

BAXEVANIS, A., AND OUELLETTE, B. F. F. (EDS.). 1998.Bioinformatics: A Practical Guide to the Analysisof Genes and Proteins. Wiley, New York.

BENNETT, C., LI, M., AND MA, B. 2003. Linking chainletters. Sci. Amer. (June) 77–81.

BROWN, M. P. S., GRUNDY, W. N., LIN, D., CRISTIAN-INI, N., SUGNET, C. W., FUREY, T. S., ARES, JR.,M. AND HAUSSLER, D. 2000. Knowledge-basedanalysis of microarray gene expression data us-ing support vector machines. Proc. Nat. Acad.Sci. 97, 1, 262–267.

CAMPBELL, A. M. AND HEYER, L. 2002. DiscoveringGenomics, Proteomics and BioInformatics. Ben-jamin Cummings.

CLAVERIE, J. M. AND NOTREDAME, C. 2003. Bioinfor-matics for Dummies. Wiley, New York.

COHEN, J. 2001. Classification of approaches usedto study cell regulation: Search for a unified viewusing constraints and machine learning. Elec-tronic Transactions in Artificial Intelligence, Ma-chine Intelligence 18. Linkoping Electronic Arti-cles in Computer and Information Science ISSN1401-9841, 6(025).

COHEN, J. 2003. Guidelines for establishing under-graduate bioinformatics courses. J. Sci. Educat.Tech. 12, 4 (Dec.) 449–456.

DEJONG, H. 2002. Modeling and simulation of ge-netic regulatory systems: A literature review. J.Comput. Biol. 9, 1, 67–103.

DELCHER, A., KASIF, S., FLEISCHMANN, R. D., PETERSON,J., WHITE, O., AND SALZBERG, S. L. 1999. Align-ment of whole genomes. Nucl. Acid Res. 27, 11,2369–2376.

DUENWALD, M. 2003. Gene is linked to susceptibil-ity to depression. The New York Times, July 18,Sect. A, Page 14, Col. 1.

DURBIN, R., EDDY, S., KROGH, A., AND MITCHISON,G. 1998. Biological Sequence Analysis. Cam-bridge University Press, Cambridge, Mass.

DWYER, R. A. 2002. Genomic Perl: From Bioinfor-matics Basics to Working Code. Cambridge Uni-versity Press, Cambridge, Mass.

FELSENSTEIN, J. 2003. Inferring Phylogenies, Sin-auer Associates.

FRIEDMAN, N., LINIAL, M., NACHMAN, I., PEER, D. 2000.Using Bayesian networks to analyze expression

data. In Proceedings RECOMB—ComputationalMolecular Biology, pp. 127–135.

GILBERT, D. R., WESTHEAD, D. R., NAGANO, N., AND

THORNTON, J. M. 1999. Motif-based searchingin TOPS protein topology databases. Bioinfor-matics 5, 4, 317–326. Also see http://www.sander.embl-ebi.ac.uk/tops/.

GONICK, L. AND WHEELIS, M. 1991. A Cartoon Guideto Genetics. Harper Perennial.

GOODMAN, N. 2002. Biological data becomes com-puter literate: new advances in bioinformatics.Curr. Op. Biotech. 13, 66–71.

GUSFIELD, D. 1997. Algorithms on Strings, Trees,and Sequences: Computer Science and Compu-tational Biology. Cambridge University Press.

HAND, D. J., MANNILA, H., AND SMYTH, P. 2000. Prin-ciples of Data Mining. MIT Press, Cambridge,Mass.

HUSON, D. H., REINERT, K., AND MYERS, E. W. 2002.The greedy path-merging algorithm for contigscaffolding. J. ACM 49, 5 (Sept.), 603–615.

JAIN, A., MURTY, M., AND FLYNN, P. 1999. Data clus-tering: A review. ACM Comput. Surv. 31, 3, 264–323.

JONES, N. C. AND PEVZNER, P. A. 2004. An Introduc-tion to Bioinformatics Algorithms, MIT Press,Cambridge, Mass.

KARP, P. 2001. Pathway databases: A case studyin computational symbolic theories. Science 293,2040–2044.

KELLY, H. C. 2003. Terrorism and the biology lab.New York Times Op-Ed Page, July 2.

KNUTH, D. E. 1993. Computer Literacy BookshopsInterview (Dec.) (Available at http://dmoz.org/Computers/History/Pioneers/Knuth, Donald/).

KRANE, D. AND RAYMER, M. 2003. Fundamen-tal Concepts of BioInformatics. BenjaminCummings.

KROGH, A. 1998. An introduction to hiddenMarkov models for biological sequences. InS. L. Salzberg, D. B. Searls, and S. Kasif (eds.),Computational Methods in Molecular Biology.Elsevier, Amsterdam, The Netherlands, pp.45–63.

KUIPERS, B. J. 1994. Qualitative Reasoning: Mod-eling and Simulation with Incomplete Knowl-edge. MIT Press, Cambridge, Mass.

LATHROP, R. H. AND SMITH, T. F. 1996. Global opti-mum protein threading with gapped alignmentand empirical pair potentials. J. Molec. Biol. 255,641–665.

LI, H., HELLING, R., TANG, C., AND WINGREEN, N. 1996.Emergence of preferred structures in a sim-ple model of protein folding. Science 273, 666–669.

LIANG, S., FUHRMAN, S., AND SOMOGYI, R. 1998. RE-VEAL, A general reverse engineering algorithmfor inference of genetic network architectures. InPacific Symposium on Biocomputing 3, pp. 18–29.


158 J. Cohen

LODISH, H., BERK, A., MATSUDAIRA, P., KAISER, C. A.,KRIEGER, M., SCOTT, M. P., ZIPURSKY, L., AND

DARNELL, J. 2003. Molecular Cell Biology.W.H. Freeman.

LUSCOMBE, N. M., GREENBAUM, D., AND GERSTEIN,M. 2001. What is bioinformatics? A proposeddefinition and overview of the field. MethodsInf. Med. 40, 346–358 (Also available at http://bioinfo.mbb.yale.edu/papers/).

MILLER, W. 2001. Comparison of genomic DNA se-quences: Solved and unsolved problems. Bioin-formatics 17, 5, 391–397.

MITCHELL, T. 1997. Machine Learning, McGrawHill, New York.

MOUNT, D. W. 2001. Bioinformatics: Sequence andGenome Analysis, Cold Spring Harbor Press,Cold Spring Harbor, N.Y.

MYERS, E. 1999. Whole genome DNA-sequencing.IEEE Computat. Eng. Sci. 3, 1, 33–43.

ORENGO, C. A., JONES, D. T., AND THORNTON, J. M.2003. Bioinformatics: Genes, Proteins andComputers. BIOS Scientific Publishers, Oxford,England.

PARSONS, R. J., FORREST, S., AND BURKS, C. 1995. Ge-netic algorithms, operators, and DNA fragmentassembly. Mach. Learn. 21, 1–2, 11–33. (Also seepaper by Parsons in Computational Methods inMolecular Biology, S. L. Salzberg, D. B. Searls,and S. Kasif (Eds.). Elsevier, Amsterdam, TheNetherlands, 1998.)

PEVSNER, J. 2003. Bioinformatics and FunctionalGenomics. Wiley-Liss.

PEVZNER, P. A. 2000. Computational Molecular Bi-ology: An Algorithmic Approach. MIT Press,Cambridge, Mass.

RABINER, L. R. 1989. A tutorial on hidden Markovmodels and selected applications in speechrecognition. Proc. IEEE 77, 2, 257–286.

REGEV, E. AND SHAPIRO, E. 2002. Cellular abstrac-tions: Cells as computation. Nature 419 (Sept.),419–443.

RIVAS, E. AND EDDY, S. R. 2000. The language ofRNA: A formal grammar that includes pseudoknots. Bioinformatics 18, 4, 334–340.

SALZBERG, S. L., SEARLS, D. B., AND KASIF, S., EDS.1998. Computational Methods in Molec-ular Biology. Elsevier, Amsterdam, TheNetherlands.

SCHWARTZ, S., ZHANG, Z., FRAZER, K. A., SMIT, A.,RIEMER, C., BOUCK, J., GIBBS, R., HARDISON, R., AND

MILLER, W. 2000. PipMaker—A web server foraligning two genomic DNA sequence. GenomeRes. 10, 4 (Apr.), 577–586.

SEARLS, D. B. 1992. The linguistics of DNA. Amer.Sci. 80, 579–591.

SEARLS, D. B. 1998. Grand challenges in compu-tational Biology. In Computational Methods inMolecular Biology, S. L. Salzberg, D. B. Searls,and S. Kasif, Eds. Elsevier Amsterdam, TheNetherlands.

SEARLS, D. B. 2002. The language of genes. Nature420 (November), 211–217.

SETUBAL, J. AND MEIDANIS, J. 1997. Introduction toComputational Molecular Biology, PWS Pub-lishing.

THIERRY-MIEG, N. 2000. Protein-protein interac-tion prediction for C. elegans: In Knowl-edge Discovery in Biology, Workshop at thePKDD2000 (Conference on Principles and Prac-tice of Knowledge Discovery in Databases) (Lyon,France, Sept.).

THOMPSON, J. D., HIGGINS, D. G., AND GIBSON, T. J.1994. CLUSTAL W: Improving the sensitiv-ity of progressive multiple sequence alignmentthrough sequence weighting, positions-specificgap penalties and weight matrix choice. Nuc.Acid Res. 22, 4673–4680.

TOMITA, M., HASHIMOTO, K., TAKAHASHI, K., SHIMIZU,T. S., MATSUZAKI, Y., MIYOSHI, F., SAITO, K., TANIDA,S., YUGI, K., VENTER, J. C., AND HUTCHINSON III,C. A. 1999. E-CELL: Software environmentfor whole cell simulation. Bioinformatics 15, 1,72–84.

WATERMAN, M. S. 1995. Introduction to Computa-tional Biology: Maps, Sequences and Genomes.CRC Press.

WATSON, J. D. AND BERRY, A. 2003. DNA: The Secretof Life. Knopf.

WETHERELL, C. S. 1980. Probabilistic languages: Areview and some open questions. ACM Comput.Surv. 12, 4, 361–379.

ZUKER, M. AND STIEGLER, P. 1981. Optimal com-puter folding of large RNA sequences us-ing thermodynamics and auxiliary informa-tion. Nuc. Acids Res. 9, 133–148. (Also seehttp://www.bioinfo.rpi.edu/∼zukerm/).

Received July 2003; accepted August 2004


Date post:	11-May-2015
Category:	Technology
Upload:	unyil96
View:	246 times
Download:	1 times

Bioinformatics—an introduction for computer scientists

Technology