+ All Categories
Home > Documents > Guest Editorial: WABI Special Section Part II

Guest Editorial: WABI Special Section Part II

Date post: 23-Apr-2023
Category:
Upload: uib
View: 0 times
Download: 0 times
Share this document with a friend
78
Guest Editorial: WABI Special Section Part ll Junhyong Kim and Inge Jonassen æ T HE Fourth International Workshop on Algorithms in BIoinformatics (WABI) 2004 was held in Bergen, Nor- way, September 2004. The program committee consisted of 33 members and selected, among 117 submissions, 39 to be presented at the workshop and included in the proceedings from the workshop (volume 3240 of Lecture Notes in Bioinformatics, series edited by Sorin Istrail, Pavel Pevzner, and Michael Waterman). The WABI 2004 program committee selected a small number of papers among the 39 to be invited to submit extended versions of their papers to a special section of the IEEE/ACM Transactions on Computational Biology and Bioin- formatics. Four papers were published in the October- December 2004 issue of the journal and this issue contains an additional three papers. We would like to thank both the entire program committee for WABI and the reviewers of the papers in this issue for their valuable contributions. The first of the papers is “A New Distance for High Level RNA Secondary Structure Comparison” authored by Julien Allali and Marie-France Sagot. This paper describes algo- rithms for comparing secondary structures of RNA molecules where the structures are represented by trees. The problem of classifying RNA secondary structure is becoming critical as biologists are discovering more and more noncoding func- tional elements in the genome (e.g., miRNA). Most likely, the major functional determinants of the elements are their secondary structure and, therefore, a metric between such secondary structures will also help delineate clusters of functional groups. In Allali and Sagot’s paper, two tree representations of secondary structure are compared by analysing how one tree can be transformed into the other using an allowed set of operations. Each operation can be associated with a cost and the distance between two trees can then be defined as the minimum cost associated with a transform of one tree to the other. Allali and Sagot introduce two new operations that they name edge fusion and node fusion and show that these alleviate limitations associated with the classical tree edit operations used for RNA comparison. Importantly, they also present algorithms for calculating the distance between trees allowing the new operations in addition to the classical ones, and analyze the performance of the algorithms. The second paper is “Topological Rearrangements and Local Search Method for Tandem Duplication Trees” and is authored by Denis Bertrand and Olivier Gascuel. The paper approaches the problem of estimating the evolutionary history of tandem repeats. A tandem repeat is a stretch of DNA sequence that contains an element that is repeated multiple times and where the repeat occurrences are next to each other in the sequence. Since the repeats are subject to mutations, they are not identical. Therefore, tandem repeats occur through evolution by “copying” (duplication) of repeat elements in blocks of varying size. Bertrand and Gascuel address the problem of finding the most likely sequence of events giving rise to the observed set of repeats. Each sequence of events can be described by a duplication tree and one searches for the tree that is the most parsimonious, i.e., one that explains how the sequence has evolved from an ancestral single copy with a minimum number of mutations along the branches of the tree. The main difference with the standard phylogeny problem is that linear ordering of the tandem duplications impose constraints the possible binary tree form. This paper describes a local search method that allows exploration of the complete space of possible duplication trees and shows that the method is superior to other existing methods for reconstructing the tree and recovering its duplication events. The third paper is “Optimizing Multiple Seeds for Homology Search” authored by Daniel G. Brown. The paper presents an approach to selecting starting points for pairwise local alignments of protein sequences. The problem of pairwise local alignment is to find a segment from each so that the two local segments can be aligned to obtain a high score. For commonly used scoring schemes, this can be solved exactly using dynamic programming. However, pairwise alignment is frequently applied to large data sets and heuristic methods for restricting alignments to be considered are frequently used, for instance, in the BLAST programs. The key is to restrict the number of alignments as much as possible, by choosing a few good seeds, without missing high scoring alignments. The paper shows that this can be formulated as an integer program- ming problem and presents algorithm for choosing optimal seeds. Analysis is presented showing that the approach gives four times fewer false positives (unnecessary seeds) in comparison with BLASTP without losing more good hits. Junhyong Kim Inge Jonassen Guest Editors IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 1 . J. Kim is with the Department of Biology, University of Pennsylvania, 3451 Walnut Street, Philadelphia, PA 19104. E-mail: [email protected]. . I. Jonassen is with the Department of Informatics and Computational Biology Unit, University of Bergen, HIB N5020 Bergen, Norway. E-mail: [email protected]. For information on obtaining reprints of this article, please send e-mail to: [email protected]. 1545-5963/05/$20.00 ß 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
Transcript

Guest Editorial: WABI Special Section Part llJunhyong Kim and Inge Jonassen

THE Fourth International Workshop on Algorithms inBIoinformatics (WABI) 2004 was held in Bergen, Nor-

way, September 2004. The program committee consisted of33 members and selected, among 117 submissions, 39 to bepresented at the workshop and included in the proceedingsfrom the workshop (volume 3240 of Lecture Notes inBioinformatics, series edited by Sorin Istrail, Pavel Pevzner,and Michael Waterman).

The WABI 2004 program committee selected a small

number of papers among the 39 to be invited to submit

extended versions of their papers to a special section of the

IEEE/ACM Transactions on Computational Biology and Bioin-

formatics. Four papers were published in the October-

December 2004 issue of the journal and this issue contains

an additional three papers. We would like to thank both the

entire program committee for WABI and the reviewers of

the papers in this issue for their valuable contributions.The first of the papers is “A New Distance for High Level

RNA Secondary Structure Comparison” authored by Julien

Allali and Marie-France Sagot. This paper describes algo-

rithms for comparing secondary structuresofRNAmolecules

where the structures are represented by trees. The problemof

classifying RNA secondary structure is becoming critical as

biologists are discovering more and more noncoding func-

tional elements in the genome (e.g., miRNA). Most likely, the

major functional determinants of the elements are their

secondary structure and, therefore, a metric between such

secondary structures will also help delineate clusters of

functional groups. In Allali and Sagot’s paper, two tree

representations of secondary structure are compared by

analysing how one tree can be transformed into the other

using an allowed set of operations. Each operation can be

associatedwith a cost and the distance between two trees can

then be defined as the minimum cost associated with a

transform of one tree to the other. Allali and Sagot introduce

two new operations that they name edge fusion and node

fusion and show that these alleviate limitations associated

with the classical tree edit operations used for RNA

comparison. Importantly, they also present algorithms for

calculating the distance between trees allowing the new

operations in addition to the classical ones, and analyze the

performance of the algorithms.

The second paper is “Topological Rearrangements andLocal Search Method for Tandem Duplication Trees” and isauthored by Denis Bertrand and Olivier Gascuel. The paperapproaches the problem of estimating the evolutionaryhistory of tandem repeats. A tandem repeat is a stretch ofDNA sequence that contains an element that is repeatedmultiple times and where the repeat occurrences are next toeach other in the sequence. Since the repeats are subject tomutations, they are not identical. Therefore, tandem repeatsoccur through evolution by “copying” (duplication) ofrepeat elements in blocks of varying size. Bertrand andGascuel address the problem of finding the most likelysequence of events giving rise to the observed set of repeats.Each sequence of events can be described by a duplicationtree and one searches for the tree that is the mostparsimonious, i.e., one that explains how the sequence hasevolved from an ancestral single copy with a minimumnumber of mutations along the branches of the tree. Themain difference with the standard phylogeny problem isthat linear ordering of the tandem duplications imposeconstraints the possible binary tree form. This paperdescribes a local search method that allows exploration ofthe complete space of possible duplication trees and showsthat the method is superior to other existing methods forreconstructing the tree and recovering its duplicationevents.

The third paper is “Optimizing Multiple Seeds forHomology Search” authored by Daniel G. Brown. Thepaper presents an approach to selecting starting points forpairwise local alignments of protein sequences. Theproblem of pairwise local alignment is to find a segmentfrom each so that the two local segments can be aligned toobtain a high score. For commonly used scoring schemes,this can be solved exactly using dynamic programming.However, pairwise alignment is frequently applied to largedata sets and heuristic methods for restricting alignments tobe considered are frequently used, for instance, in theBLAST programs. The key is to restrict the number ofalignments as much as possible, by choosing a few goodseeds, without missing high scoring alignments. The papershows that this can be formulated as an integer program-ming problem and presents algorithm for choosing optimalseeds. Analysis is presented showing that the approachgives four times fewer false positives (unnecessary seeds) incomparison with BLASTP without losing more good hits.

Junhyong Kim

Inge Jonassen

Guest Editors

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 1

. J. Kim is with the Department of Biology, University of Pennsylvania,3451 Walnut Street, Philadelphia, PA 19104.E-mail: [email protected].

. I. Jonassen is with the Department of Informatics and ComputationalBiology Unit, University of Bergen, HIB N5020 Bergen, Norway.E-mail: [email protected].

For information on obtaining reprints of this article, please send e-mail to:[email protected].

1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Junhyong Kim is the Edmund J. and LouiseKahn Term Endowed Professor in the Depart-ment of Biology at theUniversity of Pennsylvania.He holds joint appointments in the Department ofComputer and Information Science, Penn Centerfor Bioinformatics, and the Penn GenomicsInstitute. He serves on the editorial board ofMolecular Development and Evolution and theIEEE/ACM Transactions on Computational Biol-ogy and Bioinformatics, the council of the Society

for Systematic Biology, and the executive committee of the CyberInfrastructure for Phylogenetics Research. His research focuses oncomputational and experimental approaches to comparative develop-ment. The current focus of his lab is in three areas: computationalphylogenetics, in silico gene discovery, and comparative developmentusing genome-wide gene expression data.

Inge Jonassen is a professor of computerscience in the Department of Informatics at theUniversity of Bergen in Norway, where he ismember of the bioinformatics group. He is alsoaffiliated with the Bergen Center for Computa-tional Science at the same university where heheads the Computational Biology Unit. He is alsovice president of the Society for Bioinformatics inthe Nordic Countries (SocBiN) and a member ofthe board of the Nordic Bioinformatics Network.

He coordinates the technology platform for bioinformatics funded by theNorwegian Research Council functional genomics programme FUGE.He has worked in the field of bioinformatics since the early 1990s, wherehe has primarily focused on methods for discovery of patterns withapplications to biological sequences and structures and on methods forthe analysis of microarray gene expression data.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

A New Distance for High Level RNASecondary Structure Comparison

Julien Allali and Marie-France Sagot

Abstract—We describe an algorithm for comparing two RNA secondary structures coded in the form of trees that introduces two new

operations, called node fusion and edge fusion, besides the tree edit operations of deletion, insertion, and relabeling classically used in

the literature. This allows us to address some serious limitations of the more traditional tree edit operations when the trees represent

RNAs andwhat is searched for is a common structural core of twoRNAs. Although the algorithm complexity has an exponential term, this

term depends only on the number of successive fusions that may be applied to a same node, not on the total number of fusions. The

algorithm remains therefore efficient in practice and is used for illustrative purposes on ribosomal as well as on other types of RNAs.

Index Terms—Tree comparison, edit operation, distance, RNA, secondary structure.

1 INTRODUCTION

RNAS are one of the fundamental elements of a cell. Their

role in regulation has been recently shown to be farmore prominent than initially believed (20 December 2002

issue of Science, which designated small RNAs with

regulatory function as the scientific breakthrough of the

year). It is now known, for instance, that there is massive

transcription of noncoding RNAs. Yet current mathematical

and computer tools remain mostly inadequate to identify,

analyze, and compare RNAs.An RNA may be seen as a string over the alphabet of

nucleotides (also called bases), {A, C, G, T}. Inside a cell,RNAs do not retain a linear form, but instead fold in space.The fold is given by the set of nucleotide bases that pair. The

main type of pairing, called canonical, corresponds to bondsof the type A� U and G� C. Other rarer types of bondsmay be observed, the most frequent among them is G� U ,also called the wobble pair. Fig. 1 shows the sequence of afolded RNA. Each box represents a consecutive sequence ofbonded pairs, corresponding to a helix in 3D space. The

secondary structure of an RNA is the set of helices (or thelist of paired bases) making up the RNA. Pseudoknots,which may be described as a pair of interleaved helices, arein general excluded from the secondary structure of anRNA. RNA secondary structures can thus be represented asplanar graphs. An RNA primary structure is its sequence of

nucleotides while its tertiary structure corresponds to thegeometric form the RNA adopts in space.

Apart from helices, the other main structural elements in

an RNA are:

1. hairpin loops which are sequences of unpaired basesclosing a helix;

2. internal loops which are sequences of unpairedbases linking two different helices;

3. bulges which are internal loops with unpaired baseson one side only of a helix;

4. multiloops which are unpaired bases linking at leastthree helices.

Stems are successions of one or more among helices,

internal loops, and/or bulges.

The comparison of RNA secondary structures is one of

the main basic computational problems raised by the study

of RNAs. It is the problem we address in this paper. The

motivations are many. RNA structure comparison has been

used in at least one approach to RNA structure prediction

that takes as initial data a set of unaligned sequences

supposed to have a common structural core [1]. For each

sequence, a set of structural predictions are made (for

instance, all suboptimal structures predicted by an algo-

rithm like Zucker’s MFOLD [15], or all suboptimal sets of

compatible helices or stems). The common structure is then

found by comparing all the structures obtained from the

initial set of sequences, and identifying a substructure

common to all, or to some of the sequences. RNA structure

comparison is also an essential element in the discovery of

RNA structural motifs, or profiles, or of more general

models that may then be used to search for other RNAs of

the same type in newly sequenced genomes. For instance,

general models for tRNAs and introns of group I have been

derived by hand [3], [10]. It is an open question whether

models at least as accurate as these, or perhaps even more

accurate, could have been derived in an automatic way. The

identification of smaller structural motifs is an equally

important topic that requires comparing structures.

As we saw, the comparison of RNA structures may

concern known RNA structures (that is, structures that were

experimentally determined) or predicted structures. The

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 3

. J. Allali is with the Institut Gaspard-Monge, Universite de Marne-la-Vallee, Cite Descartes, Champs-sur-Marne, 77454, Marne-la-Vallee Cedex2, France. E-mail: [email protected].

. M.-F. Sagot is with Inria Rhone-Alpes, Universite Claude Bernard, Lyon I,43 Bd du Novembre 1918, 69622 Villeurbanne cedex, France.E-mail: [email protected].

Manuscript received 11 Oct. 2004; accepted 20 Dec. 2004; published online30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-0164-1004.

1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

objective in both cases is the same: to find the common parts

of such structures.

In [11], Shapiro suggested to mathematically model RNA

secondary structures without pseudoknots by means of

trees. The trees are rooted and ordered, which means that

the order among the children of a node matters. This order

corresponds to the 5’-3’ orientation of an RNA sequence.

Given two trees representing each an RNA, there are two

main ways for comparing them. One is based on the

computation of the edit distance between the two trees

while the other consists in aligning the trees and using the

score of the alignment as a measure of the distance between

the trees. Contrary to what happens with sequences, the

two, alignment and edit distance, are not equivalent. The

alignment distance is a restrained form of the edit distance

between two trees, where all insertions must be performed

before any deletions. The alignment distance for general

trees was defined in 1994 by Jiang et al. in [9] and extended

to an alignment distance between forests in [6]. More

recently, Hochsmann et al. [7] applied the tree alignment

distance to the comparison of two RNA secondary

structures. Because of the restriction on the way edit

operations can be applied in an alignment, we are not

concerned in this paper with tree alignment distance and

we therefore address exclusively from now on the problem

of tree edit distance.

Our way for comparing two RNA secondary structures is

then to apply anumberof tree edit operations inoneorbothof

the trees representing the RNAs until isomorphic trees are

obtained. The currently most popular program using this

approach is probably theViennapackage [5], [4]. The tree edit

operations considered are derived from the operations

classically applied to sequences [13]: substitution, deletion,

and insertion. In 1989, Zhang and Shasha [14] gave adynamic

programming algorithm for comparing two trees. Shapiro

and Zhang then showed [12] how to use tree editing to

compare RNAs. The latter also proposed various treemodels

that could be used for representing RNA secondary struc-

tures. Each suggested tree offers a more or less detailed view

of an RNA structure. Figs. 2b, 2c, 2d, and 2e present a few

examples of such possible views for the RNAgiven in Fig. 2a.

In Fig. 2, the nodes of the tree in Fig. 2b represent either

unpaired bases (leaves) or paired bases (internal nodes). Each

node is labeled with, respectively, a base or a pair of bases. A

node of the tree in Fig. 2c represents a set of successive

unpaired bases or of stacked paired ones. The label of a node

is an integer indicating, respectively, the number of unpaired

basesor theheightof the stackofpairedones.Thenodesof the

tree in Fig. 2d represent elements of secondary structure:

hairpin loop (H), bulge (B), internal loop (I), ormultiloop (M).

The edges correspond to helices. Finally, the tree in Fig. 2e

contains only the information concerning the skeleton of

multiloops of anRNA. The last representation, though giving

ahighly simplifiedviewof anRNA, is important nevertheless

as it is generally accepted that it is this skeleton which is

usually the most constrained part of an RNA. The last two

models may be enriched with information concerning, for

instance, the number of (unpaired) bases in a loop (hairpin,

internal, multi) or bulge, and the number of paired bases in a

helix. The first label the nodes of the tree, the second its edges.

Other types of information may be added (such as overall

composition of the elements of secondary structure). In fact,

one could consider working with various representations

simultaneously or in an interlocked, multilevel fashion. This

goes beyond the scope of this paper which is concerned with

comparing RNA secondary structures using any one among

the many tree representations possible. We shall, however,

comment further on this multilevel approach later on.

Concerning the objectives of this paper, they are twofold.

The first is to give some indications on why the classical edit

operations that have been considered so far in the literature

for comparing trees present some limitations when the trees

stand for RNA structures. Three cases of such limitationswill

be illustrated through examples in Section 3. In Section 4, we

then introduce two novel operations, so-called node-fusion

and edge-fusion, that enable us to address some of these

limitations and then give a dynamic programming algorithm

for comparing twoRNA structureswith these two additional

operations. Implementation issues and initial results are

presented in Section 4. In Section 5, we give a first application

4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 1. Primary and secondary structures of a transfer RNA.

Fig. 2. Example of different tree representations ((b), (c), (d), and (e)) of

the same RNA (a).

of our algorithm to the comparison of two RNA secondary

structures. Finally, in Section 6, we sketch the main ideas

behind themultilevel RNA comparison approachmentioned

above. Before that,we start by introducing somenotation and

by recalling in the next section the basics about classical tree

edit operations and tree mapping.

This paper is an extended version of a paper presented at

the Workshop on Algorithms in BioInformatics (WABI) in

2004, in Bergen, Norway. A few more examples are given to

illustrate some of the points made in the WABI paper,

complexity and implementation issues are discussed in

more depth as are the cost functions and a multilevel

approach to comparing RNAs.

2 TREE EDITING AND MAPPING

Let T be an ordered rooted tree, that is, a tree where the

order among the children of a node matters. We define

three kinds of operations on T : deletion, insertion, and

relabeling (corresponding to a substitution in sequence

comparison). The operations are shown in Fig. 3. The

deletion (Fig. 3b) of a node u removes u from the tree. The

children of u become the children of u’s father. An insertion

(Fig. 3c) is the symmetric of a deletion. Given a node u, we

remove a consecutive (in relation to the order among the

children) set u1; . . . ; up of its children, create a new node v,

make v a child of u by attaching it at the place where the set

was, and, finally, make the set u1; . . . ; up (in the same order)

the children of v. The relabeling of a node (Fig. 3d) consists

simply in changing its label.

Given two trees T and T 0, we define S ¼ fs1 . . . seg to be

a series of edit operations such that, if we apply succes-

sively the operations in S to the tree T , we obtain T 0 (i.e., T

and T 0 become isomorphic). A series of operations like Srealizes the editing of T into T 0 and is denoted by T !S T 0.

We define a function cost from the set of possible edit

operations (deletion, insertion, relabeling) to the integers (or

the reals) such that costs is the score of the edit operation s.

If S is a series of edit operations, we define by extension that

costS isP

s2S costs. We can define the edit distance between

two trees as the series of operations that performs the

editing of T into T 0 and such that its cost is minimal:

distanceðT; T 0Þ ¼ fminðcostSÞjT !S T 0g.

Let an insertion or a deletion cost one and the relabeling of

a node cost zero if the label is the same and one otherwise. For

the two trees of the figure on the left, the series relabelðA !F Þ:deleteðBÞ:insertðGÞ realizes the editing of the left tree into

the right one and costs 3. Another possibility is the series

deleteðBÞ:relabelðA ! GÞ:insertðF Þ which also costs 3. The

distance between these two trees is 3.

Given a series of operations S, let us consider the nodes

of T that are not deleted (in the initial tree or after some

relabeling). Such nodes are associated with nodes of T 0. The

mapping MS relative to S is the set of couples ðu; u0Þ with

u 2 T and u0 2 T 0 such that u is associated with u0 by S.The operations described above are the “classical tree edit

operations” that have been commonly used in the literature

for RNA secondary structure comparison. We now present a

few results obtained using such classical operations that will

allowus to illustrate a few limitations theymaypresentwhen

used for comparing RNA structures.

3 LIMITATIONS OF CLASSICAL TREE EDIT

OPERATIONS FOR RNA COMPARISON

As suggested in [12], the tree edit operations recalled in the

previous section can be used on any type of tree coding of

an RNA secondary structure.

Fig. 4 shows two RNAsePs extracted from the database [2]

(they are found, respectively, in Streptococcus gordonii and

Thermotoga maritima). For the example we discuss now, we

code the RNAs using the tree representation indicated in

Fig. 2b where a node represents a base pair and a leaf an

unpaired base. After applying a few edit operations to the

trees, we obtain the result indicated in Fig. 4, with deleted/

insertedbases ingray.Wehave surroundeda fewregions that

match in the two trees. Bases in the rectangular box at the

bottomof theRNAon the left are thusassociatedwithbases in

thebottomrightmost rectangular boxof theRNAon the right.

The same is observed for the bases in the oval boxes for both

RNAs. Suchmatches illustrate one of themainproblemswith

the classical tree edit operations: Bases in one RNA may be

mapped to identically labeled bases in the other RNA to

minimise the total cost, while such bases should not be

associated in terms of the elements of secondary structure to

which they belong. In fact, such elements are often distant

from one another along the common RNA structure. We call

this problem the “scattering effect.” It is related to the

definition of tree edit operations. In the case of this example

and of the representation adopted, the problem might have

been avoided if structural information had been used.

Indeed, the problem appears also because the structural

ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 5

Fig. 3. Edit operations: (a) the original tree T , (b) deletion of the node

labelled D, (c) insertion of the node labeled I, and (d) relabeling of a

node in T (the label A of the root is changed into K).

location of an unpaired base is not taken into account. It is

therefore possible to match, for instance, an unpaired base

from a hairpin loop with an unpaired base from a multiloop.

Using another type of representation, as we shall do, would,

however, not be enough to solve all problems as we see next.

Indeed, to compare the same two RNAs, we can also use a

more abstract tree representation such as the one given in

Fig. 2d. In this case, the internal nodes represent a multiloop,

internal-loop, or bulge, the leaves code for hairpin loops and

edges for helices. The result of the editionofT intoT 0 for some

cost function is presented in Fig. 5 (we shall comeback later to

the cost functions used in the case of suchmore abstract RNA

representations; for the sake of this example, wemay assume

an arbitrary one is used).

The problem we wish to illustrate in this case is shown

by the boxes in the figure. Consider the boxes at the bottom.

In the left RNA, we have a helix made up of 13 base pairs. In

the right RNA, the helix is formed by seven base pairs

followed by an internal loop and another helix of size 5. By

definition (see Section 2), the algorithm can only associate

one element in the first tree to one element in the second

tree. In this case, we would like to associate the helix of the

left tree to the two helices of the second tree since it seems

clear that the internal loop represents either an inserted

element in the second RNA, or the unbonding of one base

pair. This, however, is not possible with classical edit

operations.

A third type of problem one can meet when using only

the three classical edit operations to compare trees standing

for RNAs is similar to the previous one, but concerns this

time a node instead of edges in the same tree representa-

tion. Often, an RNAmay present a very small helix between

two elements (multiloop, internal-loop, bulge, or hairpin-

loop) while such helix is absent in the other RNA. In this

case, we would therefore have liked to be able to associate

one node in a tree representing an RNA with two or more

6 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 5. Illustration of the one-to-one association problem with edges. Result of the matching of the two RNAsePs, of Saccharomyces uvarum and of

Saccharomyces kluveri, using the model given in Fig. 2d.

Fig. 4. Illustration of the scattering effect problem. Result of the matching of two RNAsePs, of Streptococcus gorgonii and of Thermotoga maritima,

using the model given in Fig. 2b.

nodes in the tree for the other RNA. Once again, this is not

possible with any of the classical tree edit operations. An

illustration of this problem is shown in Fig. 6.

We shall use RNA representations that take the elements

of the structure of an RNA into account to avoid some of the

scattering effect. Furthermore, in addition to considering

information of a structural nature, labels are attached, in

general, to both nodes and edges of the tree representing an

RNA. Such labels are numerical values (integers or reals).

They represent in most cases the size of the corresponding

element, but may also further indicate its composition, etc.

Such additional information is then incorporated into the

cost functions for all three edit operations. It is important to

observe that when dealing with trees labeled at both the

nodes and edges, any node and the edge that leads to it (or,

in an alternative perspective, departs from it) represent a

single object from the point of view of computing an edit

distance between the trees.

It remains now to deal with the last two problems that

are a consequence of the one-to-one associations between

nodes and edges enforced by the classical tree edit

operations. To that purpose, we introduce two novel tree

edit operations, called the edge fusion and the node fusion.

4 INTRODUCING NOVEL TREE EDIT OPERATIONS

4.1 Edge Fusion and Node Fusion

In order to address some of the limitations of the classical tree

edit operations that were illustrated in the previous section,

we need to introduce twonovel operations. These are the edge

fusion and the node fusion. They may be applied to any of the

tree representations given in Figs. 2c, 2d, and 2e.

An example of edge fusion is shown in Fig. 7a. Let eu be an

edge leading to a node u, ci a child of u and eci the edge

between u and ci. The edge fusion of eu and eci consists in

replacing eci and eu with a new single edge e. The edge e links

the father of u to ci. Its label then becomes a function of the

(numerical) labels of eu, u and eci . For instance, if such labels

indicated the size of each element (e.g., for a helix, thenumber

of its stackedpairs, and for a loop, themin ,max or theaverage

of its unpaired bases on each side of the loop), the label of e

could be the sum of the sizes of eu, u and eci . Observe that

merging two edges implies deleting all subtrees rooted at the

children cj ofu for jdifferent from i. The cost of suchdeletions

is added to the cost of the edge fusion.

An example of node fusion is given in Fig. 7b. Let u be a

node and ci one of its children. Performing a node fusion of

u and ci consists in making u the father of all children of ciand in relabeling u with a value that is a function of the

values of the labels of u, ci and of the edge between them.

Observe that a node fusion may be simulated using the

classical edit operations by a deletion followed by a

relabeling. However, the difference between a node fusion

and a deletion/relabeling is in the cost associated with both

operations. We shall come back to this point later.Obviously, like insertions or deletions, edge fusions and

node fusions have of course symmetric counterparts, whichare the edge split and the node split.

Given two rooted, ordered, and labeled trees T and T 0,we define the “edit distance with fusion” between T and T 0

ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 7

Fig. 7. (a) An example of edge fusion. (b) An example of node fusion.

Fig. 6. Illustration of the one-to-one association problem with nodes. The two RNAs used here are RNAsePs from Pyrococcus furiosus and

Metallosphaera sedula. Triangles stand for bulges, diamond stand for internal loops, and squares for hairpin loops.

as distancefusionðT; T 0Þ ¼ fminðcostSÞjT !S T 0gwith costs thecost associated to each of the seven edit operations nowconsidered (relabeling, insertion, deletion, node fusion andsplit, edge fusion and split).

Proposition 1. If the following is verified:

. costmatchða; bÞ is a distance,

. costinsðaÞ ¼ costdelðaÞ � 0,

. costnodefusionða; b; cÞ ¼ costnodesplitða; b; cÞ � 0, and

. costedgefusionða; b; cÞ ¼ costedgesplitða; b; cÞ � 0,

then distancefusion is indeed a distance.

Proof. The positiveness of distancefusion is given by the fact

that all elementary cost functions are positive. Its

symmetry is guaranteed by the symmetry in the costs

of the insertion/deletion and (node/edge) fusion/split

operations. Finally, it is straighforward to see that

distancefusion satisfies triangular inequality. tuBesides the above properties that must be satisfied by the

cost functions in order to obtain a distance, others may be

introduced for specific purposes. Some will be discussed in

Section 5.We now present an algorithm to compute the tree edit

distance between two trees using the classical tree edit

operations plus the two operations just introduced.

4.2 Algorithm

The method we introduce is a dynamic programming

algorithm based on the one proposed by Zhang and Shasha.

Their algorithm is divided in two parts: They first compute

the edit distance between two trees (this part is denoted by

TDist) and then the distance between two forests (this part

is denoted by FDist). Fig. 8 illustrates in pictorial form the

part TDist and Fig. 9 the FDist part of the computation.In order to take our two new operations into account, we

need to compute a few more things in the TDist part.

Indeed, we must add the possibility for each tree to have a

node fusion (inversely, node split) between the root and one

of its children, or to have an edge fusion (inversely edge

split) between the root and one of its children. These

additional operations are indicated in the right box of Fig. 8.

We present now a formal description of the algorithm. Let

T be an ordered rooted tree with jT j nodes. We denote by tithe ith node in a postfix order. For each node ti, lðiÞ is the

index of the leftmost child of the subtree rooted at ti. Let

T ði . . . jÞ denote the forest composed by the nodes ti . . . tj

(T � T ð0 . . . jT jÞÞ. To simplify notation, from now on, when

there is no ambiguity, i will refer to the node ti. In this case,

distanceði1 . . . i2; j1 . . . j2Þ will be equivalent to distanceðT ði1. . . i2Þ; T 0ðj1 . . . j2ÞÞ.

The algorithm of Zhang and Sasha is fully described by

the following recurrence formula:

if ðði1 ¼¼ lði2ÞÞ and ðj1 ¼¼ lðj2ÞÞÞ

MIN

distanceð i1 . . . i2 � 1 ; j1 . . . j2 Þ þ costdelði2Þdistanceð i1 . . . i2 ; j1 . . . j2 � 1 Þ þ costinsðj2Þdistanceð i1 . . . i2 � 1 ; j1 . . . j2 � 1 Þ þ costmatchði2; j2Þ

8><>:

ð1Þ

else

MIN

distanceð i1 . . . i2 � 1 ; j1 . . . j2Þ Þþ costdelði2Þdistanceð i1 . . . i2Þ ; j1 . . . j2 � 1 Þþ costinsðj2Þdistanceð i1 . . . lði2Þ � 1 ; j1 . . . lðj2Þ � 1 Þ

þdistanceð lði2Þ . . . i2 ; lðj2Þ . . . j2 Þ

8>>>>>>>><>>>>>>>>:

ð2Þ

Part (1) of the formula corresponds to Fig. 8, while part (2)

corresponds to Fig. 9. In practice, the algorithm stores in a

matrix the score between each subtree of T and T 0. The space

complexity is thereforeOðjT j � jT 0jÞ. To reach this complexity,

the computation must be done in a certain order (see

Section 4.3). The time complexity of the algorithm is

OðjT j �minðleafðT Þ; heightðT ÞÞ� jT 0j �minðleafðT 0Þ; heightðT 0ÞÞÞ;

where leafðT Þ and heightðT Þ represent, respectively, the

number of leaves and the height of a tree T .

8 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 8. Zhang and Sasha’s dynamic programming algorithm: the tree distance part. The right box corresponds to the additional operations added to

take fusion into account.

The formula to compute the edit score allowing for both

node and edge fusions follows.

if ðði1 � lðikÞÞ and ðj1 � lðjk0 ÞÞÞ

MIN

distanceðfi1 . . . ik�1g; ;; fj1 . . . jk0 g; path0Þ þ costdelðikÞdistanceðfi1 . . . ikg; path; fj1 . . . jk0�1g; ;Þ þ costinsðjk0 Þdistanceðfi1 . . . ik�1g; ;; fj1 . . . jk0�1g; ;Þ þ costmatchðik; jk0 Þfor each child ic of ik in fi1; . . . ; ikg; set il ¼ lðicÞ

distanceðfi1 . . . ic�1; icþ1 . . . ikg; path:ðu; icÞ; fj1 . . . jk0 g;path0Þ

þcostnode fusionðic; ikÞðobs: :ik data are changedÞdistanceðfil . . . ic�1; ikg; path:ðe; icÞ; fj1 . . . jk0 g; path0Þ

þcostedge fusionðic; ikÞ þ distanceðfi1 . . . il�1g;;; ;; ;Þ

þdistanceðficþ1 . . . ik � 1; ;; ;; ;Þðobs: : ik data are changedÞ

for each child jc0 of jk0 in fj1; . . . ; jk0 g; set jl0 ¼ lðjc0 Þdistanceðfi1 . . . ikg; path; fj1 . . . jc0�1; jc0þ1 . . . jk0 ;

path0:ðu; jc0 ÞÞþcostnode splitðjc0 ; jk0 Þðobs: : jk0 data are changedÞ

distanceðfi1 . . . ikg; path; fjl0 . . . jc0 ; jk0 ; path0:ðe; jc0 ÞÞþcostedge splitðjc0 ; jk0 Þþdistanceð;; ;; fj1 . . . jl0�1g; ;Þþdistanceð;; ;; jc0þ1 . . . jk0�1; ;Þðobs: : jk0 data are changedÞ

8>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>:

ð3Þ

else set il ¼ lðikÞ and jl0 ¼ lðjk0 Þ

MIN

distanceðfi1 . . . ik�1g; ;; fj1 . . . jk0 g; path0Þ þ delðikÞdistanceðfi1 . . . ikg; path; fj1 . . . jk0�1g; ;Þ þ insðjk0 Þdistanceðfi1 . . . il�1g; ;; fj1 . . . jl0�1g; ;Þ

þ distanceðfil . . . ikg; path; fjl0 . . . jk0 g; path0Þ

8>>><>>>:

ð4Þ

Given two nodes u and v such that v is a child of u,

node fusionðu; vÞ is the fusion of node v with u, and

edge fusionðu; vÞ is the edge fusion between the edges

leading to, respectively, nodes u and v. The symmetric

operations are denoted by, respectively, node splitðu; vÞ andedge splitðu; vÞ.

The distance computation takes two new parameters

path and path0. These are sets of pairs ðe or u; vÞ which

indicate, for node ik (respectively, jk), the series of fusions

that were done. Thus, a pair ðe; vÞ indicates that an edge

fusion has been perfomed between ik and v, while for ðu; vÞa node v has been merged with node ik.

The notation path:ðe; vÞ indicates that the operation ðe; vÞhas been performed in relation to node ik and the

information is thus concatenated to the set path of pairs

currently linked with ik.

4.3 Implementation and Complexity

The previous section gave the recurrence formulæ for

calculating the edit distance between two trees allowing for

node and edge fusion and split. We now discuss the

complexity of the algorithm. This requires paying attention

to some high-level implementation details that, in the case

of the tree edit distance problem, may have an important

influence on the theoretical complexity of the algorithm.

Such details were first observed by Zhang and Shasha. They

concern the order in which to perform the operations

indicated in (2) and (1) to obtain an algorithm that is time

and space efficient.Let us consider the last line of (2). We may observe that

the computation of the distance between two forests refersto the computation of the distance between two treesT ðlði2Þ . . . i2Þ and T 0ðlðj2Þ . . . j2Þ. We must therefore memor-ise the distance between any two subtrees of T and T 0.Furthermore, we have to carry out the computation fromthe leaves to the root because when we compute thedistance between two subtrees U and U 0, the distancebetween any subtrees of U and U 0 must already have beenmeasured. This explains the space complexity which is inOðjT j � jT 0jÞ and corresponds to the size of the table used forstoring such distances in memory.

If we look at (1) now, we see that it is not necessary tocalculate separately the distance between the subtreesrooted at i0 and j0 if i0 is on the path from lðiÞ to i and j0

is on the path from lðjÞ to j, for i and j nodes of,respectively, T and T 0.

We define a set LRðT Þ of the left roots of T as follows:

LRðT Þ ¼ fkj1 � k � jT j and 6 9k0 > k such that lðk0Þ ¼ lðkÞg

ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 9

Fig. 9. Zhang and Sasha’s dynamic programming algorithm: the forest distance part.

The algorithm for computing the edit distance between t

and T 0 consists then in computing the distance between

each subtree rooted at a node in LRðT Þ and each subtree

rooted at a node in LRðT 0Þ. Such subtrees are considered

from the leaves to the root of T and T 0, that is, in the order

of their indexes.

Zhang and Shasha proved that this algorithm has a

time complexity in OðjT j �minðleafðT Þ; heightðT ÞÞ � jT 0j �minðleafðT 0Þ; heightðT 0ÞÞÞ, leafðT Þ designating the num-

ber of leaves of T and heightðT Þ its height. In the worst

case (fan tree), the complexity is in OðjT j2 � jT 0j2Þ.Taking fusion and split operations into account does

not change the above reasoning. However, we must now

store in memory the distance between all subtrees

T ðlði2Þ . . . i2Þ and T 0ðlðj2Þ . . . j2Þ, and all the possible values

of path and path0.

We must therefore determine the number of values that

path can take. This amounts to determine the total number

of successive fusions that could be applied to a given node.

We recall that path is a list of pairs ðe or u; vÞ. Let path ¼fðe or u; v1Þ; ðe or u; v2Þ; . . . ; ðe or u; v‘Þg be the list for node i

of T . The first fusion can be performed only with a child v1of i. If d is the maximum degree of T , there are d possible

choices for v1. The second fusion can be done with one of

the children of i or with one of its grandchildren. Let v2 be

the node chosen. There are d + d2 possible choices for v2.

Following the same reasoning, there arePk¼‘

k¼1 dk possible

choices for the ‘th node v‘ to be fusioned with i.

Furthermore, we must take into account the fact that a

fusion can concern a node or an edge. The total number of

values possible for the variable path is therefore:

2‘ �Yk¼‘

k¼1

Xj¼k

j¼1

dj ¼ 2lYk¼‘

k¼1

dkþ1 � 1

d� 1;

that is:

2‘ � 1

d� 1

� �‘Yk¼‘

k¼1

ðdkþ1 � 1Þ < 2l � 1

d� 1

� �l

�dð‘þ1Þð‘þ2Þ

2 :

A node i may then be involved in Oðð2dÞlÞ possible

successive (node/edge) fusions.

As indicated, we must store in memory the distance

between each subtree T ðlði2Þ . . . i2Þ and T 0ðlðj2Þ . . . j2Þ for allpossible values of path and path0. The space complexity of

our algorithm is thus in Oðð2dÞ‘ � ð2d0Þ‘ � jT j � jT 0jÞ, with d

and d0 the maximum degrees of, respectively, T and T 0.

The computation of the time complexity of our algorithm

is done in a similar way as for the algorithm of Zhang and

Shasha. For each node of T and T 0, one must compute the

number of subtree distance computations the node will be

involved in by considering all subtrees rooted in, respec-

tively, a node of LRðT Þ and a node of LRðT 0Þ. In our case,

one must also take into account for each node the possibility

of applying a fusion. This leads to a time complexity in

Oðð2dÞ‘ � jT j �minðleafðT Þ; heightðT ÞÞ � ð2d0Þ‘ � jT 0j�minðleafðT 0Þ; heightðT 0ÞÞÞ:

This complexity suggests that the fusion operations may

be used only for reasonable trees (typically, less than

100 nodes) and small values of l (typically, less than 4). It is

however important to observe that the overall number of

fusions one may perform can be much greater than l

without affecting the worst-case complexity of the algo-

rithm. Indeed, any number of fusions can be made while

still retaining the bound of

Oðð2dÞl � jT j �minðleafðT Þ; heightðT ÞÞ � jT 0j �minðleafðT 0Þ;heightðT 0ÞÞÞ

so long as one does not realize more than l consecutive

fusions for each node.

In general, also, most interesting tree representations of

an RNA are of small enough size as will be shown next,

together with some initial results obtained in practice.

5 APPLICATION TO RNA SECONDARY STRUCTURES

COMPARISON

The algorithm presented in the previous section has beencoded using C++. An online version is available at http://www-igm.univ-mlv.fr/~allali/migal/.

We recall that RNAs are relatively small molecules with

sizes limited to a few kilobases. For instance, the small

ribosomal subunit of Sulfolobus acidocaldarius (D14876) is

made up of 1,147 bases. Using the representation shown in

Fig. 2b, the tree obtained contains 440 internal nodes and

567 leaves, that is 1,007 nodes overall. Using the representa-

tion in Fig. 2d, the tree is composed of 78 nodes. Finally, the

tree obtained using the representation given in Fig. 2e

contains only 48 nodes. We therefore see that even for large

RNAs, any of the known abstract tree-representations (that

is, representations which take the elements of the secondary

structure of an RNA into account) that we can use leads to a

tree of manageable size for our algorithm. In fact, for small

values of l (2 or 3), the tree comparison takes reasonable

time (a few minutes) and memory (less than 1Gb).

As we already mentioned, a fusion (respctively, split) can

be viewed as an alternative to a deletion (respectively,

insertion) followed by a relabeling. Therefore, the cost

function for a fusion must be chosen carefully.

10 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

To simplify, we reason on the cost of a node fusion

without considering the label of the edges leading to the

nodes that are fusioned with a father. The formal definition

of the cost functions takes the edges also into account.Let us assume that the cost function returns a real

value between zero and one. If we want to compute thecost of a fusion between two nodes u and v, the aim is togive to such fusion a cost slightly greater than the cost ofdeleting v and relabeling u; that is, we wish to havecostnode fusionðu; vÞ ¼ minðcostdelðvÞ þ t; 1Þ. The parameter tis a tuning parameter for the fusion.

Suppose that the new node w resulting from the fusion of

u and v matches with another node z. The cost of this match

is costmatchðw; zÞ. If we do not allow for node fusions, the

algorithm will first match u with z, then will delete v. If we

compare the two possibilities, on one hand we have a total

cost of costnode fusionðu; vÞ þ costmatchðw; zÞ for the fusion,

that is, costdelðvÞ þ tþ costmatchðw; zÞ, on the other hand, a

cost of costdelðvÞ þ costmatchðu; zÞ. Thus, t represents the gainthat must be obtained by costmatchðw; zÞ with regard to

costmatchðu; zÞ, that is, by a match without fusion. This is

illustrated in Fig. 10.

In this example, the cost associatedwith thepathon the top

is costmatchð5; 9Þ þ costdelð3Þ. The path at the bottom has a cost

of costnode fusionð5; 3Þ ¼ costdelð3Þ þ t for the node fusion to

which is added a relabeling cost of costmatchð8; 9Þ, leading to atotal of costmatchð8; 9Þ þ costdelð3Þ þ t. A node fusion will

therefore be chosen if costmatchð8; 9Þ þ t > costmatchð5; 9Þ,therefore if the score of a match with fusion is better by at

least t than a match without fusion.

We apply the same reasoning to the cost of an edge fusion.

The cost function for a node and an edge fusion between a

node u and a node v, with eu denoting the edge leading to u

and ev the edge leading to v is defined as follows:

costnode fusionðu; vÞ ¼ costdelðvÞ þ costdelðevÞ þ t

costedge fusionðu; vÞ ¼ costdelðuÞ þ costdelðeuÞ þ t

þX

csibling ofv

cost deleting subtree rooted at c:

The tuning parameter t is thus an important parameter

that allows us to control fusions. Always considering a cost

function that produces real values between 0 and 1, if t is

equal to 0:1, a fusion will be performed only if it improves

the score by 0:1. In practice, we use values of t between 0

and 0:2.For practical considerations, we also set a further

condition on the cost and relabeling functions related to a

node or edge resulting from a fusion which is as follows:

costdelðaÞ þ costdelðbÞ � costdelðcÞ

with c the label of the node/edge resulting from the fusion

of the nodes/edges labeled a and b. Indeed, if this condition

is not fulfilled, the algorithm may systematically fusion the

nodes or edges to reduce the overall cost.An important consequence of the conditions seen above

is that a node fusion cannot be followed by an edge fusion.

Below, the node fusion followed by an edge fusion costs:

ðcostdelðbÞ þ costdelðBÞ þ tÞ þ ðcostdelðABÞ þ costdelðaÞ þ tÞ:

Thealternative is todestroynodeB (togetherwith edge b) and

then to operate an edge fusion, the whole costing: ðcostdelðbÞþcostdelðBÞÞ þ ðcostdelðAÞ þ costdelðaÞ þ tÞ. The difference be-tween these two costs is tþ costdelðABÞ � costdelðAÞ, which is

always positive.

This observation allows to significantly improve the

performance in practice of the algorithm.We have applied the new algorithm on the two RNAs

shown in Fig. 5 (these are eukaryotic nuclear P RNAs from

Saccharomyces uvarum and Saccharomyces kluveri) and coded

using the same type of representation as in Fig. 2d. We have

limited the number of consecutive fusions to one (l ¼ 1).

The computation of the edit distance between the two trees

taking node and edge fusions into account besides dele-

tions, insertions, and relabeling has required less than a

second. The total cost allowing for fusions is 6:18 with t ¼0:05 against 7:42without fusions. As indicated in Fig. 11, the

last two problems discussed in Section 3 disappear thanks

to some edge fusions (represented by the boxes).An example of node fusions required when comparing

two “real” RNAs is given in Fig. 12. The RNAs are coded

using the same type of representation as in Fig. 2d. The

figure shows part of the mapping obtained between the

small subunits of two ribosomal RNAs retrieved from [8]

(from Bacillaria paxillifer and Calicophoron calicophorum). The

node fusion has been circled.

ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 11

Fig. 10. Illustration of the gain that must be obtained using a fusion

instead of a deletion/relabeling.

6 MULTILEVEL RNA STRUCTURE COMPARISON:SKETCH OF THE MAIN IDEA

We briefly discuss now an approach which addresses in

part the “scattering effect” problem (see Section 2). This

approach is being currently validated and will be more fully

described in another paper. We therefore present here the

main idea only.

To start with, it is important to understand the nature of

this “scattering effect.” Let us consider first a trivial case: the

cost functions are unitary (insertion, deletion, and relabeling

each cost 1) and we compute the edit distance between two

trees composed of a single node each. The obtainedmapping

will associate the single node in the first tree with the single

one in the second tree, independently from the labels of the

nodes. This example can be extended to the comparison of

two trees whose node labels are all different. In this case, the

obtained mapping corresponds to the maximum home-

omorphic subtree common to both trees.

If the two RNA secondary structures compared using a

tree representation which models both the base pairs and

the nonpaired bases are globally similar but present some

local dissimilarity, then an edit operation will almost

always associate the nodes of the locally divergent regions

that are located at the same positions relatively to the global

common structure. This is a normal, expected behavior in

the context of an edition. However, it seems clear also when

we look at Fig. 4 that the bases of a terminal loop should not

be mapped to those of a multiple loop.

To reduce this problem, one possible solution consists of

adding to the nodes corresponding to a base an information

concerning the element of secondary structure to which the

base belongs. The cost functions are then adapted to take

this type of information into account. This solution,

although producing interesting results, is not entirely

satisfying. Indeed, the algorithm will tend to systematically

put into correspondence nodes (and, thus, bases) belonging

to structural elements of the same type, which is also not

necessarily a good choice as these elements may not be

related in the overall structure. It seems therefore preferable

to have a structural approach first, mapping initially the

elements of secondary structure to each other and taking

care of the nucleotides in a second step only.

The approach we have elaborated may be briefly

described as follows: Given two RNA secondary structures,

the first step consists in coding the RNAs by trees of type ðcÞin Fig. 2 (nodes represent bulges or multiple, internal or

12 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 12. Part of a mapping between two rRNA small subunits. The node fusion is circled.

Fig. 11. Result of the editing between the two RNAs shown in Fig. 4 allowing for node and edge fusions.

terminal loops while edges code for helices). We then

compute the edit distance between these two trees using the

two novel fusion operations described in this paper. This

also produces a mapping between the two trees. Each node

and edge of the trees, that is, each element of secondary

structure, is then colored according to this mapping. Two

elements are thus of a same color if they have been mapped

in the first step. We now have at our disposal an

information concerning the structural similarity of the two

RNAs. We can then code the RNAs using a tree of type ðbÞ.To these trees, we add to each node the colour of the

structural element to which it belongs. We need now only to

restrict the match operation to nodes of the same color. Two

nodes can therefore match only if they belong to secondary

elements that have been identified in the first step as being

similar.To illustrate the use of this algorithm, we have applied it

to the two RNAs of Fig. 4. Fig. 13 presents the trees of type(Fig. 2c) coding for these structures, and the mappingproduced by the computation of the edit distance withfusion. In particular, the noncolored fine dashed nodes andedges correspond, respectively, to deleted nodes/edges.One can see that in the left RNA, the two hairpin loopsinvolved in the scattering effect problem in Fig. 4 (indicatedby the arrows) have been destroyed and will not be mappedto one another anymore when the edit operations areapplied to the trees of the type in Fig. 2b.

This approach allows to obtain interesting results.

Furthermore, it considerably reduces the complexity of

the algorithm for comparing two RNA structures coded

with trees of the type in Fig. 2b. However, it is important to

observe that the scattering effect problem is not specific of

the tree representations of the type in Fig. 2b. Indeed, the

same problem may be observed, to a lesser degree, with

trees of the type in Fig. 2c. This is the reason why we

generalize the process by adopting a modelling of RNA

secondary structures at different levels of abstraction. This

model, and the accompanying algorithm for comparing

RNA structures, is in progress.

7 FURTHER WORK AND CONCLUSION

We have proposed an algorithm that addresses two main

limitations of the classical tree edit operations for compar-

ing RNA secondary structures. Its complexity is high in

theory if many fusions are applied in succession to any

given (the same) node, but the total number of fusions that

may be performed is not limited. In practice, the algorithm

is fast enough for most situations one can meet in practice.

To provide a more complete solution to the problem of

the scattering effect, we also proposed a new multilevel

approach for comparing two RNA secondary structures

whose main idea was sketched in this paper. Further details

and evaluation of such novel comparison scheme will be the

subject of another paper.

REFERENCES

[1] D. Bouthinon and H. Soldano, “A New Method to Predict theConsensus Secondary Structure of a Set of Unaligned RNASequences,” Bioinformatics, vol. 15, no. 10, pp. 785-798, 1999.

[2] J.W. Brown, “The Ribonuclease P Database,” Nucleic AcidsResearch, vol. 24, no. 1, p. 314, 1999.

[3] N. el Mabrouk and F. Lisacek, “and Very Fast Identification ofRNA Motifs in Genomic DNA. Application to tRNA Search in theYeast Genome,” J. Molecular Biology, vol. 264, no. 1, pp. 46-55, 1996.

[4] I. Hofacker, “The Vienna RNA Secondary Structure Server,” 2003.[5] I. Hofacker, W. Fontana, P.F. Stadler, L. Sebastian Bonhoeffer, M.

Tacker, and P. Schuster, “Fast Folding and Comparison of RNASecondary Structures,” Monatshefte fur Chemie, vol. 125, pp. 167-188, 1994.

[6] M. Hochsmann, T. Toller, R. Giegerich, and S. Kurtz, “LocalSimilarity in RNA Secondary Structures,” Proc. IEEE Computer Soc.Conf. Bioinformatics, p. 159, 2003.

[7] M. Hochsmann, B. Voss, and R. Giegerich, “Pure Multiple RNASecondary Structure Alignments: A Progressive Profile Ap-proach,” IEEE/ACM Trans. Computational Biology and Bioinfor-matics, vol. 1, no. 1, pp. 53-62, 2004.

[8] T. Winkelmans, J. Wuyts, Y. Van de Peer, and R. De Wachter, “TheEuropean Database on Small Subunit Ribosomal RNA,” NucleicAcids Research, vol. 30, no. 1, pp. 183-185, 2002.

[9] T. Jiang, L. Wang, and K. Zhang, “Alignment of Trees—AnAlternative to Tree Edit,” Proc. Fifth Ann. Symp. CombinatorialPattern Matching, pp. 75-86, 1994.

[10] F. Lisacek, Y. Diaz, and F. Michel, “Automatic Identification ofGroup I Intron Cores in Genomic DNA Sequences,” J. MolecularBiology, vol. 235, no. 4, pp. 1206-1217, 1994.

ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 13

Fig. 13. Result of the comparison of the two RNAs of Fig. 4 using trees in Fig. 2c. The thick dash lines indicate some of the associations resulting

from the computation of the edit distance between these two trees. Triangular nodes stand for bulges, diamonds for internal loops, squares for

hairpin loops, and circles for multiloops. Noncolored fine dashed nodes and lines correspond, respectively, to deleted nodes/edges.

[11] B. Shapiro, “An Algorithm for Multiple RNA Secondary Struc-tures,” Computer Applications in the Biosciences, vol. 4, no. 3, pp. 387-393, 1988.

[12] B.A. Shapiro and K. Zhang, “Comparing Multiple RNA SecondaryStructures Using Tree Comparisons,” Computer Applications in theBiosciences, vol. 6, no. 4, pp. 309-318, 1990.

[13] K.-C. Tai, “The Tree-to-Tree Correction Problem,” J. ACM, vol. 26,no. 3, pp. 422-433, 1979.

[14] K. Zhang and D. Shasha, “Simple Fast Algorithms for the EditingDistance between Trees and Related Problems,” SIAM J. Comput-ing, vol. 18, no. 6, pp. 1245-1262, 1989.

[15] M. Zuker, “Mfold Web Server for Nucleic Acid Folding andHybridization Prediction,” Nucleic Acids Research, vol. 31, no. 13,pp. 3406-3415, 2003.

Julien Allali studied at the University of Marnela Vallee (France), where he received the MScdegree in computer science and computationalgenomics. In 2001, he began his PhD incomputational genomics at the Gaspard MongeInstitute of the University of Marne la Vallee. Histhesis focused on the study of RNA secondarystructures and, in particular, their comparisonusing a tree distance. In 2004, he received thePhD degree.

Marie-France Sagot received the BSc degree in computer science fromthe University of Sao Paulo, Brazil, in 1991, the PhD degree intheoretical computer science and applications from the University ofMarne-la-Vallee, France, in 1996, and the Habilitation from the sameuniversity in 2000. From 1997 to 2001, she worked as a researchassociate at the Pasteur Institute in Paris, France. In 2001, she movedto Lyon, France, as a research associate at the INRIA, the FrenchNational Institute for Research in Computer Science and Control. Since2003, she has been the Director of Research at the INRIA. Her researchinterests are in computational biology, algorithmics, and combinatorics.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

14 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Topological Rearrangements and Local SearchMethod for Tandem Duplication Trees

Denis Bertrand and Olivier Gascuel

Abstract—The problem of reconstructing the duplication history of a set of tandemly repeated sequences was first introduced by Fitch

[4]. Many recent studies deal with this problem, showing the validity of the unequal recombination model proposed by Fitch, describing

numerous inference algorithms, and exploring the combinatorial properties of these new mathematical objects, which are duplication

trees. In this paper, we deal with the topological rearrangement of these trees. Classical rearrangements used in phylogeny (NNI, SPR,

TBR, ...) cannot be applied directly on duplication trees. We show that restricting the neighborhood defined by the SPR (Subtree

Pruning and Regrafting) rearrangement to valid duplication trees, allows exploring the whole duplication tree space. We use these

restricted rearrangements in a local search method which improves an initial tree via successive rearrangements. This method is

applied to the optimization of parsimony and minimum evolution criteria. We show through simulations that this method improves all

existing programs for both reconstructing the topology of the true tree and recovering its duplication events. We apply this approach to

tandemly repeated human Zinc finger genes and observe that a much better duplication tree is obtained by our method than using any

other program.

Index Terms—Tandem duplication trees, phylogeny, topological rearrangements, local search, parsimony, minimum evolution, Zinc

finger genes.

1 INTRODUCTION

REPEATED sequences constitute an important fraction of

most genomes, from the well-studied Escherichia coli

bacterial genome [1] to the Human genome [2]. For

example, it is estimated that more than 50 percent of the

Human genome consists of repeated sequences [2], [3].

There exist three major types of repeated sequences:

transposon-derived repeats, micro or minisatellites, and

large duplicated sequences, the last often containing one or

several RNA or protein-coding genes. Micro or minisatel-

lites arise through a mechanism called slipped-strand

mispairing, and are always arranged in tandem: copies of

a same basic unit are linearly ordered on the chromosome.

Large duplicated sequences are also often found in tandem

and, when this is the case, unequal recombination is widely

assumed to be responsible for their formation.

Both the linear order among tandemly repeated se-

quences, and the knowledge of the biological mechanisms

responsible for their generation, suggest a simple model of

evolution by duplication. This model, first described by

Fitch in 1977 [4], introduces tandem duplication trees as

phylogenies constrained by the unequal recombination

mechanism. Although being a completely different biologi-

cal mechanism, slipped-strand mispairing leads to the same

duplication model [5]. A formal recursive definition of this

model is provided in Section 2, but its main features can be

grasped from the examples of Fig. 1. Fig. 1a shows the

duplication history of the 13 Antennapedia-class homeobox

genes from the cognate group [6]. In this history, the

ancestral locus has undergone a series of simple duplica-

tion eventswhere one of the genes has been duplicated into

two adjacent copies. Starting from the unique ancestral

gene, this series of events has produced the extant locus

containing the 13 linearly ordered contemporary genes. It is

easily seen [7] that trees only containing simple duplication

events are equivalent to binary search trees with labeled

leaves. They differ from standard phylogenies in that node

children have left/right orientation. Fig. 1b shows another

example corresponding to the nine variable genes of the

human T cell receptor Gamma (TRGV) locus [8]. In this

history, the most recent event involves a double duplica-

tion where two adjacent genes have been simultaneously

duplicated to produce four adjacent copies. Duplication

trees containing multiple duplication events differ from

binary search trees, but are less general than phylogenies.

The model proposed by Fitch [4] covers both simple and

multiple duplication trees.

Fitch’s paper [4] received relatively little attention at the

time of its publication probably due to the lack of available

sequence data. Rediscovered by Benson and Dong [9],

Tang et al. [10], and Elemento et al. [8], tandemly repeated

sequences and their suggested duplication model have

recently received much interest, providing several new

computational biology problems and challenges [11], [12].

The main challenge consists of creating algorithms

incorporating the model constraints to reconstruct the

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 15

. The authors are with Projet Methodes et Algorithmes pour la Bioinforma-tique, LIRMM (UMR 5506, CNRS—Univ. Montpellier 2), 161 rue Ada,34392 Montpellier Cedex 5—France. E-mail: [email protected].

Manuscript received 11 Oct. 2004; revised 17 Dec. 2004; accepted 20 Dec.2004; published online 30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBBSI-0170-1004.

1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

duplication history of tandemly repeated sequences.

Indeed, accurate reconstruction of duplication histories

will be useful to elucidate various aspects of genome

evolution. They will provide new insights into the

mechanisms and determinants of gene and protein domain

duplication, often recognized as major generators of

novelty [13]. Several important gene families, such as

immunity-related genes, are arranged in tandem; better

understanding their evolution should provide new insights

into their duplication dynamics and clues about their

functional specialization. Studying the evolution of micro

and minisatellites could resolve unanswered biological

questions regarding human migrations or the evolution of

bacterial diseases [14].

Given a set of aligned and ordered sequences (DNA or

proteins), the aim is to find the duplication tree that best

explains these sequences, according to usual criteria in

phylogenetics, e.g., parsimony or minimum evolution. Few

studies have focused on the computational hardness of this

problem, and all of these studies only deal with the

restricted version where simultaneous duplication of multi-

ple adjacent segments is not allowed. In this context, Jaitly

et al. [15] shows that finding the optimal single copy

duplication tree with parsimony is NP-Hard and that this

problem has a PTAS (Polynomial Time Approximation

Scheme). Another closely related PTAS is given by Tang

et al. [10] for the same problem. On the other hand,

Elemento et al. [7] describes a polynomial distance-based

algorithm that reconstructs optimal single copy tandem

duplication trees with minimum evolution.

However, it is commonly believed, as in phylogeny, that

most (especially multiple) duplication tree inference pro-

blems are NP-Hard. This explains the development of

heuristic approaches. Benson and Dong [9] provides various

parsimony-based heuristic reconstruction algorithms to infer

duplication trees, especially from minisatellites. Elemento

et al. [8] present an enumerative algorithm that computes the

most parsimonious duplication tree; this algorithm (by its

exhaustive approach) is limited to datasets of less than 15

repeats. Several distance-based methods have also been

described.TheWINDOWmethod [10]uses anagglomeration

scheme similar to UPGMA [16] and NJ [17], but the cost

function used to judge potential duplication is based on the

assumption that the sequences followamolecular clockmode

of evolution. The DTSCORE method [18] uses the same

schemebut corrects this limitationusing a score criterion [19],

like ADDTREE [20]. DTSCORE can be used with sequences

that do not follow themolecular clock, which is, for example,

essential when dealing with gene families containing

pseudogenes that evolve much faster than functional genes.

Finally, GREEDY SEARCH [21] corresponds to a different

approach divided into two steps: First, a phylogeny is

computed with a classical reconstruction method (NJ), then,

with nearest neighbor interchange (NNI) rearrangements, a

duplication tree close to this phylogeny is computed. This

approach is noteworthy since it implements topological

rearrangements which are highly useful in phylogenetics

[22], but it works blindly and does not ensure that good

duplication trees will be found (cf. Section 5.2).

Topological rearrangements have an essential function in

phylogenetic inference, where they are used to improve an

initial phylogeny by subtree movement or exchange.

Rearrangements are very useful for all common criteria

(parsimony, distance, maximum likelihood) and are inte-

grated into all classical programs like PAUP* [23] or

PHYLIP [24]. Furthermore, they are used to define various

distances between phylogenies and are the foundation of

much mathematical work [25]. Unfortunately, they cannot

be directly used here, as shown by a simple example given

16 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 1. (a) Rooted duplication tree describing the evolutionary history of the 13 Antennapedia-class homeobox genes from the cognate group [6].

(b) Rooted duplication tree describing the evolutionary history of the nine variable genes of the human T cell receptor Gamma (TRGV) locus [8]. In

both examples, the contemporary genes are adjacent and linearly ordered along the extant locus.

later. Indeed, when applied to a duplication tree, they do

not guarantee that another valid duplication tree will be

produced.

In this paper, we describe a set of topological rearrange-

ments to stay inside the duplication tree space and explore

the whole space from any of its elements. We then show the

advantages of this approach for duplication tree inference

from sequences. In Section 2, we describe the duplication

model introduced by [4], [8], [10], as well as an algorithm to

recognize duplication trees in linear time. Thanks to this

algorithm, we restrict the neighborhoods defined by

classical phylogeny rearrangements, namely, nearest neigh-

bor interchange (NNI) and subtree pruning and regrafting

(SPR), to valid duplication trees. We demonstrate (Section 3)

that for NNI moves this restricted neighborhood does not

allow the exploration of the whole duplication tree space.

On the other hand, we demonstrate that the restricted

neighborhood of SPR rearrangement allows the whole

space to be explored. In this way, we define a local search

method, applied here to parsimony and minimum evolu-

tion (Section 4). We compare this method to other existing

approaches using simulated and real data sets (Section 5).

We conclude by discussing the positive results obtained by

our method, and indicate directions for further research

(Section 6).

2 MODEL

2.1 Duplication History and Duplication Tree

The tandem duplication model used in this article was first

introduced by Fitch [4] then studied independently by [8],

[10]. It is based on unequal recombination which is assumed

to be the sole evolution mechanism (except point mutations)

acting on sequences. Although it is a completely different

biological mechanism, slipped-strand mispairing leads to

the same duplication model [5], [9].

Let O ¼ ð1; 2; . . . ; nÞ be the ordered set of sequences

representing the extant locus. Initially containing a single

copy, the locus grew through a series of consecutive

duplications. As shown in Fig. 2a, a duplication history

may contain simple duplication events. When the dupli-

cated fragment contains two, three, or k repeats, we say that

it involves a multiple duplication event. Under this

duplication model, a duplication history is a rooted tree

with n labeled and ordered leaves, in which internal nodes

of degree 3 correspond to duplication events. In a real

duplication history (Fig. 2a), the time intervals between

consecutive duplications are completely known, and the

internal nodes are ordered from top to bottom according to

the moment they occurred in the course of evolution. Any

ordered segment set of the same height then represents an

ancestral state of the locus. We call such a set a floor, and

we say that two nodes i; j are adjacent (i � j) if there is a

floor where i and j are consecutive and i is on the left of j.

However, in the absence of a molecular clock mode of

evolution (a typical problem), it is impossible to recover the

order between the duplication events of two different

lineages from the sequences. In this case, we are only able to

infer a duplication tree (DT) (Fig. 2b) or a rooted

duplication tree (RDT) (Fig. 2c).

A duplication tree is an unrooted phylogeny with

ordered leaves, whose topology is compatible with at least

one duplication history. Also, internal nodes of duplication

trees are partitioned into events (or “blocks” following

[10]), each containing one or more (ordered) nodes. We

distinguish “simple” duplication events that contain a

unique internal node (e.g., b and f in Fig. 2c) and “multiple”

duplication events which group a series of adjacent and

simultaneous duplications (e.g., c, d, and e in Fig. 2c). Let

E ¼ ðsi; siþ1; . . . ; skÞ denote an event containing internal

nodes si; siþ1; . . . ; sk in left to right order. We say that two

consecutive nodes of the same event are adjacent (sj � sjþ1)

just like in histories, as any event belongs to a floor in all of

BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 17

Fig. 2. (a) Duplication history; each segment represents a copy; extant segments are numbered. (b) Duplication tree (DT); the black points show the

possible root locations. (c) Rooted duplication tree (RDT) corresponding to history (a) and root position �1 on (b).

the histories that are compatible with the DT being

considered. The same notation will also be used for leaves

to express the segment order in the extant locus. When the

tree is rooted, every internal node sj is unambiguously

associated to one parent and two child nodes; moreover,

one child of sj is “left” and the other one is “right,” which is

denoted as lj and rj, respectively. In this case, for any

duplication history that is compatible with this tree, child

nodes of an event, si; siþ1; . . . ; sk are organized as follows:

li � liþ1 � . . . � lk � ri � riþ1 � . . . � rk:

In [8], [26], [27], it was shown that rooting a

duplication tree is different than rooting a phylogeny:

the root of a duplication tree necessarily lies on the tree

path between the most distant repeats on the locus, i.e., 1

and n; moreover, the root is always located ”above” all

multiple duplications, e.g., Fig. 1b shows that there are

only three valid root positions, the root cannot be a direct

ancestor of 12.

2.2 Recursive Definition of Rooted and UnrootedDuplication Trees

A duplication tree is compatible with at least one duplica-

tion history. This suggests a recursive definition, which

progressively reconstructs a possible history, given a

phylogeny T and a leaf ordering O. We define a cherry

ðl; s; rÞ as a pair of leaves (l and r) separated by a single

node s in T , and we call CðT Þ the set of cherries of T . This

recursive definition reverses evolution: It searches for a

“visible duplication event,” “agglomerates” this event, and

checks whether the “reduced” tree is a duplication tree. In

case of rooted trees, we have:

ðT;OÞ defines a duplication tree with root � if and only if:

1. ðT;OÞ only contains �, or

2. there is in CðT Þ a series of cherries

ðli; si; riÞ; ðliþ1; siþ1; riþ1Þ; . . . ; ðlk; sk; rkÞwith k � i and

li � liþ1 � . . . � lk � ri � riþ1 � . . . � rk in O, suchthat ðT 0; O0Þ defines a duplication tree with root �,

where T 0 is obtained from T by removing

li; liþ1; . . . ; lk; ri; riþ1; . . . ; rk, and O0 is obtained by

replacing ðli; liþ1; . . . ; lk; ri; riþ1; . . . ; rkÞ byðsi; siþ1; . . . ; skÞ in O.

The definition for unrooted trees is quite similar:

ðT;OÞ defines an unrooted duplication tree if and only if:

1. ðT;OÞ contains 1 segment, or

2. same as for rooted trees with ðT 0; O0Þ now defining anunrooted duplication tree.

Those definitions provide a recursive algorithm, RADT

(Recognition Algorithm for Duplication Trees), to check

whether any given phylogeny with ordered leaves is a

duplication tree. In case of success, this algorithm can also

be used to reconstruct duplication events: At each step, the

series of internal nodes above denoted as ðsi; siþ1; . . . ; skÞ isa duplication event. When the tree is rooted, lj is the left

child of sj and rj its right child, for every j; i � j � k. This

algorithm can be implemented in OðnÞ [26] where n is the

number of leaves. Another linear algorithm is proposed by

Zhang et al. [21] using a top down approach instead of a

bottom-up one, but applies only to rooted duplication trees.

3 TOPOLOGICAL REARRANGEMENTS FOR

DUPLICATION TREES

This section shows how to explore the DT space using SPR

rearrangements. First, we describe some NNI, SPR, and

TBR rearrangement properties with standard phylogenies.

But, these rearrangements cannot be directly used to

explore the DT space. Indeed, when applied to a duplica-

tion tree, they do not guarantee that another valid

duplication tree will be produced. So, we have decided to

restrict the neighborhood defined by those rearrangements

to duplication trees. If we only used NNI rearrangements,

the neighborhood would be too restricted (as shown by a

simple example) and would not allow the whole DT space

to be explored. On the other hand, we can distinguish two

types of SPR rearrangements which, when applied to a

rooted duplication tree guarantee that another valid

duplication tree will be produced. Thanks to these specific

rearrangements, we demonstrate that restricting the neigh-

borhood of SPR rearrangements allows the whole space of

duplication trees to be explored.

18 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 3. The tree obtained by applying an NNI move to a DT is not always a valid DT: T whose RT is a rooted version; T 0 is obtained by

applying NNI(5,4) around the bold edge; none of the possible root positions of T 0 (a, b, c, and d) leads to a valid RDT, cf. tree (b) which

corresponds to root b in T 0.

3.1 Topological Rearrangements for Phylogeny

There are many ways of carrying out topological rearrange-

ments on phylogeny [22]. We only describe NNI (Nearest

Neighbor Interchange), SPR (Subtree Pruning Regrafting),

and TBR (Tree Bisection and Reconnection) rearrangements.

The NNI move is a simple rearrangement which

exchanges two subtrees adjacent to the same internal edge

(Figs. 3 and 4). There are two possible NNIs for each

internal edge, so 2ðn� 3Þ neighboring trees for one tree

with n leaves. This rearrangement allows the whole space of

phylogeny to be explored; i.e., there is a succession of NNI

moves making it possible to transform any phylogeny P1

into any phylogeny P2 [28].

The SPR move consists of pruning a subtree and

regrafting it, by its root, to an edge of the resulting tree

(Figs. 6 and 7). We note that the neighborhood of a tree

defined by the NNI rearrangements is included in the

neighborhood defined by SPRs. The latter rearrangement

defines a neighborhood of size 2ðn� 3Þð2n� 7Þ [25].Finally, TBR generalizes SPR by allowing the pruned

subtree to be reconnected by any of its edges to the resulting

tree. These three rearrangements (NNI, SPR, and TBR) are

reversible, that is, if T 0 is obtained from T by a particular

rearrangement, then T can be obtained from T 0 using the

same type of rearrangement.

3.2 NNI Rearrangements Do Not Stay in DT Space

The classical phylogenetic rearrangements (NNI, SPR,

TBR,...) do not always stay in DT space. So, if we apply

an NNI to a DT (e.g., Fig. 3), the resulting tree is not always

a valid DT. This property is also true for SPR and TBR

rearrangements since NNI rearrangements are included in

these two rearrangement classes.

3.3 Restricted NNI Does Not Allow the Whole DTSpace to Be Explored

To restrict the neighborhood defined by NNI rearrange-

ments to duplication trees, each element of the neighbor-

hood is filtered thanks to the recognition algorithm (RADT).

But, this restricted neighborhood does not allow the whole

DT space to be explored. Fig. 4 gives an example of a

duplication tree, T , the neighborhood of which does not

contain any DT. So, its restricted neighborhood is empty,

and there is no succession of restricted NNIs allowing T to

be transformed into any other DT.

3.4 Restricted SPR Allows the Whole DT Space toBe Explored

As before, we restrict (using RADT) the neighborhood

defined by SPR rearrangements to duplication trees. We

name restricted SPR, SPR moves that, starting from a

duplication tree, lead to another duplication tree.

Main Theorem. Let T1 and T2 be any given duplication trees; T1

can be transformed into T2 via a succession of restricted SPRs.

Proof. To demonstrate the Main Theorem, we define two

types of special SPR that ensure staying within the space

of rooted duplication trees (RDT). Given these two types

of SPRs, we demonstrate that it is possible to transform

any rooted duplication tree into a caterpillar, i.e., a

rooted tree in which all internal nodes belong to the tree

path between the leaf 1 and the tree root � (cf. Fig. 5).

This result demonstrates the theorem. Indeed, let T1

and T2 be two RDTs. We can transform T1 and T2 into a

caterpillar by a succession of restricted SPRs. So, it is

possible to transform T1 into T2 by a succession of

restricted SPRs, with (possibly) a caterpillar as inter-

mediate tree. This property holds since the reciprocal

movement of an SPR is an SPR. As the two SPR types

proposed ensure that we stay within the RDTs space, we

have the desired result for rooted duplication trees. And,

this result extends to unrooted duplications trees since

two DTs can be arbitrarily rooted, transformed from one

to the other using restricted SPRs, then unrooted. tuThe first special SPR allows multiple duplication

events to be destroyed. Let E ¼ ðsi; siþ1; . . . ; skÞ be a

duplication event, ri and lk respectively right child of si

BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 19

Fig. 5. A six-leaf caterpillar.

Fig. 4. The NNI neighborhood of a duplication tree does not always contain duplication trees: T whose RT is a rooted version; T 0 is obtained by

exchanging subtrees 1 and (2 5); none of the possible root positions of T 0 (a, b, and c) leads to a valid duplication tree, cf. tree (b) which corresponds

to root b in T 0; and the same holds for every neighbor of T being obtained by NNI.

and left child of sk, and let pi be the father of si. The

DELETE rearrangement consists of pruning the subtree of

root ri and grafting this subtree on the edge ðsk; lkÞ, while

li is renamed si and the edge ðli; siÞ is deleted. Fig. 6

demonstrates this rearrangement.

Lemma 1. DELETE preserves the RDT property.

Proof. Let T be the initial tree (Fig. 6a), E ¼ ðsi; siþ1; . . . ; skÞbe an event of T , and T 0 be the tree obtained from T by

applying DELETE to E (Fig. 6b). Children of any node sj(i � j � k) are denoted lj and rj.

By definition, for any duplication history compatible

with T we have

li � liþ1 � . . . � lk � ri � riþ1 � . . . � rk:

Thus, there is a way to partially agglomerate T (using an

RADT-like procedure) such that these nodes becomes

leaves. The same agglomeration can be applied to T 0 as

only ancestors of the ljs and rjs are affected by DELETE.

Now, 1) agglomerate the event E of T , and 2) reduce T 0

by agglomerating the cherry ðlk; riÞ and then agglomer-

ating the event ðsiþ1; . . . ; skÞ. Two identical trees follow,

which concludes the proof. tuBy successively applying DELETE to any duplication

tree, we remove all multiple duplication events. The

following SPR rearrangement allows duplications to be

moved within simple RDT, i.e., any RDT containing only

simple duplications. Let p be a node of a simple RDT T , l its

left child, r its right child, and x the left child of r. This

rearrangement consists of pruning the subtree of root x and

regrafting it to the edge ðl; pÞ (Fig. 7). This rearrangement is

an SPR (in fact an NNI); we name it LEFT as it moves the

subtree root towards the left. It is obvious that the tree

obtained by applying such a rearrangement to a simple

RDT, is a simple RDT. We now establish the following

lemma which shows that any simple tree can be trans-

formed into a caterpillar.

Lemma 2. Let T be a simple RDT; T can be transformed into a

caterpillar by a succession of LEFT rearrangements.

Proof. In a caterpillar all internal nodes are ancestors of 1. If

T is not a caterpillar, there is an internal node r that is not

an ancestor of 1. If r is the right child of its father, we can

apply LEFT to the left child of r (Fig. 7). If r is the left

child of its father, we consider its father: It cannot be an

ancestor of 1 since its children are r and a node on the

right of r. So, we can apply the same argument: Either

the father of r is adequate for performing LEFT, or we

consider its father again. In this way, we necessarily

obtain a node for which the rearrangement is possible. T

is then transformed into a caterpillar by successively

applying the LEFT rearrangement to nodes which are not

on the path between 1 and �. After a finite number of

steps, all internal nodes are ancestors of 1 and T has been

transformed into a caterpillar. This concludes the proof

of Lemma 2 and, therefore, of our Main Theorem. tu

4 LOCAL SEARCH METHOD

We consider data consisting of an alignment of n segments

with length k, and of the ordering O of the segments along

the locus. This alignment has been created before tree

construction and the problem is not to build simultaneously

the alignment and the tree, a much more complicated task

[29]. The aim is to find a (nearly) optimal duplication tree,

where “optimal” is defined by some usual phylogenetic

criterion and the ordered and aligned segments at hand.

Topological rearrangements described in the previous

section naturally lead to a local search method for this

purpose. We discuss its use to optimize the usual Wagner

parsimony [22] and the distance-based balanced minimum

evolution criterion (BME) [30], [31]. First, we describe our

local search method, then we define briefly these two

criteria and explain how to compute them during local

search.

20 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 7. LEFT rearrangement.

Fig. 6. DELETE rearrangement.

4.1 The LSDT Method

Our method, LSDT (Local Search for Duplication Trees),

follows a classical local search procedure in which, at each

step, we try to strictly improve the current tree. This

approach can be used to optimize various criteria. In this

study, we restrict ourselves to parsimony and balanced

minimum evolution; fðT Þ represents the value (to be

minimized) of one of these criteria for the duplication tree

T and the sequence set.

Algorithm 1 summarizes LSDT. The neighborhood of the

current DT, Tcurrent, is computed using SPR. As we

explained earlier, we use the RADT procedure to restrict

this neighborhood to valid DTs. When a tree is a valid DT,

its f criterion value is computed. That way, we select the

best neighbor of Tcurrent. If this DT improves the value

obtained so far (i.e., fðTbestÞ), the local search restarts with

this new topology. If no neighbor of Tcurrent improves Tbest,

the local search is stopped and returns Tbest.

To analyze the time complexity of one LSDT step, we

have to consider the size of the neighborhood defined by

the restricted SPR. In the worst case, this size is of the same

order as the size of an unrestricted SPR neighborhood, i.e.,

Oðn2Þ. Indeed for the “double caterpillar” (Fig. 8), it is

possible to move any subtree being rooted on the path

between n=2 and � towards any edge of the path between

ðnþ 1Þ=2 and �; and inversely. Thus, for this tree, Oðn2Þrestricted SPRs can be performed. In the worst case,

restricting the neighborhood defined by SPR to duplication

trees does not significantly decrease the neighborhood size.

However, on average the diminution is quite significant;

e.g., with n ¼ 48, only 5 percent of the neighborhood

corresponds to a valid DTs, assuming DTs are uniformly

distributed [26].

Since the time complexity of the recognition algorithm

(RADT) is OðnÞ, computing the neighborhood defined by

restricted SPR requires Oðn3Þ. The calculation of the

criterion value is done for each tree of the restricted

neighborhood. Thus one local search step basically requires

Oðn3 þ n2gÞ, where g represents the time complexity of

computing the criterion value. However, preprocessing

allows this time complexity to be lowered, both for

parsimony and minimum evolution, as we shall explain in

the following sections.

4.2 The Maximum Parsimony Criterion

Parsimony is commonly acknowledged [22] to be a good

criterion when dealing with slightly divergent sequences,

which is usually the case with tandemly duplicated genes

[8]. The parsimony criterion involves selecting the tree

which minimizes the number of substitutions needed to

explain the evolution of the given sequences. Finding the

most parsimonious tree [22] or duplication tree [15] is

NP-hard, but we can find the optimal labeling of the

internal nodes and the parsimony score of a given tree T in

polynomial time using the Fitch-Hartigan algorithm [32],

[33]. The parsimony score and optimal labeling of internal

nodes is independently computed for each position within

sequences, using a postorder depth-first search algorithm

that requires OðnÞ time [32], [33]. Thus, computing the

parsimony score of n sequences of length k requires OðknÞtime. Hence, if we use this algorithm during our local

search method, one local search step is computed in Oðkn3Þ,which is relatively high.

To speed up this process, we adapted techniques

commonly used in phylogeny for fast calculation of

parsimony. Our implementation uses a data structure

implemented (among others) in DNAPARS [24] and

described in [34], [35]. Let Tp be the pruned subtree and

Tr be the resulting tree. A preprocessing stage computes

the parsimony vector (i.e., the optimal score and optimal

labeling of all sequence positions) of every rooted subtree

of Tr using a double depth-first search [36] (Fig. 9a); the

first search is postordered and computes the parsimony

vector of down-subtrees; the second search is preordered

and computes the parsimony vector of up-subtrees. Each

search requires OðnkÞ time. Thanks to this data structure,

the parsimony score of the tree obtained by regrafting Tp

on any given edge of Tr is computed in OðkÞ (Fig. 9b).

Hence, computing the SPR neighbor with minimum

parsimony of any given duplication tree is achieved in

Oðn3 þ n� nkþ n2kÞ ¼ Oðn3 þ n2kÞ; the first term ðn3Þrepresents the neighborhood computation; the second

term ðn� nkÞ corresponds to the time required by the n

BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 21

Fig. 8. A simple rooted duplication tree with a double caterpillar

structure.

preprocessing stages; the third term ðn2kÞ is the time to

test the n subtrees and the n possible insertion edges.

4.3 The Distance-Based Balanced MinimumEvolution Principle

As in any distance-based approach, we first estimate the

matrix of pairwise evolutionary distances between the

segments, using some standard distance estimator [22],

e.g., the Kimura two-parameter estimator [37] in case of

DNA or the JTT method with proteins [38]. Let � be this

matrix and �ij be the distance between segments i and j.

The � matrix plus the segment order is the input of the

reconstruction method.

The minimum evolution principle (ME) [39], [40]

involves selecting the shortest tree to be the tree which

best explains the observed sequences. The tree length is

equal to the sum of all the edge lengths, and the edge

lengths are estimated by minimizing a least squares fit

criterion. The problem of inferring optimal phylogenies

within ME is commonly assumed to be NP-hard, as are

many other distance-based phylogeny inference problems

[41]. Nonetheless, ME forms the basis of several phyloge-

netic reconstruction methods, generally based on greedy

heuristics. Among them is the popular Neighbor-Joining

(NJ) algorithm [17]. Starting from a star tree, NJ iteratively

agglomerates external pairs of taxa so as to minimize the

tree length at each step.Recently, Pauplin [30] proposed a new simple formula to

estimate the tree length LðT Þ of tree T :

LðT Þ ¼Xi < j

21�T ij �ij;

where T ij is the topological distance (number of edges) in T

between segments i and j. The correctness of this formula

was shown by Semple and Steel [42], while Desper and

Gascuel [31] showed that this formula is a special case of

weighted-least squares tree fitting. Moreover, Desper and

Gascuel demonstrated that selecting the shortest tree (as

computed from above formula) is statistically consistent and

well suited for phylogenetic inference. They called this new

version of ME “balanced minimum evolution” (BME) [31].

Using the above formula, the length of any given tree is

computed in Oðn2Þ, so computing one LSDT local search

step can be achieved in Oðn4Þ. However, a faster imple-

mentation is possible using a straightforward modification

of our BME addition algorithm [43]. This involves:

1. pruning a rooted subtree Tp from tree T ,2. computing the average distance between all non-

intersecting subtree pairs in the remaining tree Tr,3. computing the average distance between Tp and any

subtree of Tr in T , and4. using formula (10) from [43] and RADT to find the

best allowed edge to regraft Tp.

Steps 2 and 3 are based on algorithms described in [43],

which follow the same approach as the double depth-first

search described in the previous section. These two steps

require Oðn2Þ, just as Step 4. As there are OðnÞ subtrees to

prune and regraft, this implementation requires Oðn3Þ to

perform one search step.

5 RESULTS

5.1 Simulation Protocol

We applied our method and other existing methods to

simulated datasets obtained using the procedure described

in [18]. We uniformly randomly generated rooted tandem

duplication trees (see [26]) with 12, 24, and 48 leaves and

assigned lengths to the edges of these trees using the

coalescent model [44]. We then obtained molecular clock

trees (MC), which might be unrealistic in numerous cases,

e.g., when the sequences being studied contain pseudo-

genes which evolve much faster than functional genes.

Then, we generated nonmolecular clock trees (NO-MC)

from the previous trees by independently multiplying

22 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 9. (a) Every edge defines one down-subtree and one up-subtree; e.g., A represents the down-subtree (2 3) defined by the edge e while Dcorresponds to the up-subtree (1 (4 5)). Moreover, only the parsimony vector of the five leaves is known before the preprocessing stage. Thepostorder search computes the parsimony vector of down-subtrees: A is computed from 2 and 3, B from 4 and 5, C from A and B. The preordersearch computes the parsimony vector of up-subtrees: D is obtained from 1 and B, E is obtained from D and 3, etc. (b) When the parsimony vectorof every subtree in Tr is known, regrafting Tp on any given edge and computing the parsimony score of the resulting tree only requires analyzing theparsimony vector of three subtrees and is done in OðkÞ time.

every edge length by 1þ 0:8X, where X was drawn from

an exponential distribution with parameter 1. MC trees

were rescaled by multiplying every edge length by 1.8.

The trees thus obtained (MC and NO-MC) have a

maximum leaf-to-leaf divergence in the range ½0:1; 0:7�,and in NO-MC trees the ratio between the longest and

shortest root-to-leaf lineages is about 3.0 on average. Both

values are in accordance with real data, e.g., gene families

[8] or repeated protein domains [10].

SEQGEN [45] was used to produce a 1,000 bp-long

nucleotide multiple alignment from each of the generated

trees using the Kimura two-parameter model of substitution

[46], and a distance matrix was computed by DNADIST [24]

from this alignment using the same substitution model. For

MC andNO-MC cases, 1,000 trees (and, then, 1,000 sequence

sets and 1,000 distance matrices) were generated per tree

size. These data sets were used to compare the ability of the

various methods to recover the original trees from the

sequences or from the distance matrices, depending on the

method being tested. We measured the percentage of trees

(out of 1,000) being correctly reconstructed (%tr). For the

phylogeny reconstruction methods, we also kept the

percentage of duplication trees among the set of inferred

trees. Due to the random process used for generating these

trees and datasets, some short branches might not have

undergone any substitution (as during Evolution) and, thus,

are unobtainable, except by chance. When n and, thus, the

branch number is high, it becomes hard or impossible to

find the entire tree. So, we also measured the percentage of

duplication events in the true tree recovered by the inferred

tree (%ev). A duplication event involves one or more

internal nodes and is the lowest common ancestor of a set

of leaves; we say it “covers” its descendent leaves. However,

the leaves covered by a simple duplication event can change

when the root position changes. As regards the true tree, the

root is known and each event is defined by the set of leaves

which it covers. But, the inferred tree is unrooted. To avoid

ambiguity, we then tested all possible root positions and

chose the one which gave the highest proximity in number

of events detected between the true tree and the inferred

tree, where two events are identical if they cover the same

leaves. Finally, we kept the average parsimony value of each

method (pars).

5.2 Performance and Comparison

Using this protocol, we compared NJ [17], TNT [47], and

GREEDY-SEARCH (GS) [21] which starts from the NJ tree, a

modified version of GREEDY TRHIST RESTRICTED (GTR)

[9] to infer multiple duplication trees, WINDOWS [10],

DTSCORE [18], and eight versions of our local search

method LSDT corresponding to different starting duplica-

tion trees (GS, GTR, WINDOW, and DTSCORE) and

different criteria (parsimony and BME). TNT and GS use

the parsimony criterion, but the other are distance-based

methods. TNT is acknowledged as one of the very best

parsimony packages; it was run with 10 replicates and TBR

rearrangements. TNT often returns a set of equally

parsimonious trees. When this set contained duplication

trees, we randomly selected one of them; when no

duplication tree was inferred by TNT, we randomly

selected one of the output trees.

Results are given in Tables 1 and 2. First, we observe that

with n ¼ 48 the true tree is almost never entirely found, for

the reasons explained earlier. On the other hand, the best

methods recover 80 to 95 percent of the duplication events,

indicating that the tested datasets are relatively easy. NJ

and TNT perform relatively well, but they often output

trees that are not duplication trees, which is unsatisfactory

(e.g., with 48 leaves and NO-MC, NJ and TNT only infer

1 percent and 5 percent of duplication trees, respectively).

The GS approach is noteworthy since it modifies the trees

inferred by NJ to transform them into duplication trees.

However, GS is only slightly better than NJ regarding the

proportion of correctly reconstructed trees, but consider-

ably degrades the number of recovered duplication events,

which could be explained by the blind search it performs

to transform NJ trees into duplication trees. GTR also

obtains relatively poor results. As expected from its

assumptions, WINDOW performs better in the MC case

than in the NO-MC one. Finally, DTSCORE obtains the best

performance among the four existing methods, whatever

the topological criterion considered.

Applying our method to starting trees produced by GS,

GTR, WINDOW, and DTSCORE reveals the advantages of

the local search approach. Optimizing parsimony or BME

gives similar results, with a slight advantage for parsimony

as expected from the relatively low divergence rates in our

data sets. The trees produced by GS, GTR, and WINDOW

are clearly improved and, for most, are better than those

obtained by DTSCORE. DTSCORE trees are also improved,

even though this improvement is not very high from a

topological point of view. This could be explained by the

fact that DTSCORE is already an accurate method with

respect to the datasets used.

When we consider the parsimony criterion, the gain

achieved by LSDT is appreciable for each start method. This

could be expected for GS, WINDOW and DTSCORE which

do not optimize this criterion; with n ¼ 48 in NO-MC case,

the gain for GS is about 329, thus confirming that this

method is clearly suboptimal; the gains for WINDOW and

DTSCORE are about 42 and 15, which are lower but still

significant. The GTR results, which optimizes parsimony,

are more surprising since the gain (always with n ¼ 48 in

NO-MC case) is about 77 on average, which is very high.

Moreover, the parsimony value obtained by LSDT is very

close to that of TNT, in spite of a much more restricted

search space. This confirms the good performance of our

BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 23

local search method. It should be stressed that these gains

are obtained at low computational cost as dealing with any

of the 48-taxon datasets only requires about 10 seconds

for parsimony and five seconds for BME on a standard

PC-Pentium 4.

5.3 Analysis of the ZNF45 Family

Zinc finger (ZNF) genes code for proteins that contain one

or more zinc finger motifs. The zinc finger motif is one of

the most common motifs involved in nucleic acid-protein

interaction. Experimental studies on functions of ZNF genes

suggest that many of them code for transcription factors,

and some of them are known to take part in cellular growth

and development [48]. However, the biological functions of

most ZNF genes are currently unknown. The 16 members of

ZNF45 gene family are found in the q13.2 gene cluster on

human chromosome 19 [49]. The organization and features

of the members of the ZNF45 family suggest that the genes

in the family may have been produced by a series of in situ

24 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 2Performance Comparison Using Simulations (No Molecular Clock of Evolution)

Note: see Table 1.

TABLE 1Performance Comparison Using Simulations (Molecular Clock Mode of Evolution)

X+LSDT_Y: X is the method used to obtain the starting tree and Y the criterion being optimized by LSDT;%tr: the percentage of trees being correctlyreconstructed; the percentage of duplication trees obtained by phylogeny reconstruction methods is given between parentheses; %ev: thepercentage of duplication events in the true tree being recovered by the inferred tree; pars: the average parsimony value.

gene duplication events [49]. The ZNF45 gene family has

been previously studied by Tang et al. [10] and Zhang et al.

[21], who proposed different tandem duplication trees to

explain its evolutionary history.

We downloaded the DNA sequences of the 16 members

of ZNF45 from NCBI. Multiple alignment was achieved

using TCOFFEE,1 using default settings. We removed gaps

as usual in phylogenetics [22] and third codon positions

which look saturated (734 parsimony steps are required to

explain the evolution of the 237 sites). We thus obtained a

final alignment2 containing 474 homologous sites, with a

maximum pairwise divergence of 0:45.

PAUP* [23] was used to estimate the matrix of pairwise

distances, assuming the GTR substitution model [50] and a

gamma distribution of rates with parameter 1.

We used this distance matrix and DTSCORE to build a

starting tree, which was then refined by LSDT using

parsimony. We selected this criterion because of its good

performance with simulated data (Tables 1 and 2). The

resulting tree (Figs. 10a and 10b) is a simple DT requiring

897 steps to explain the extant sequences. We tried to

improve this score using a computationally intensive

ratchet approach [51], but were unable to obtain any other

DT with better (or even identical) parsimony. We also ran

TNT with ratchet, 1,000 random taxon addition replicates

and TBR branch swapping (i.e., all TNT options to intensify

the search) and found one maximum-parsimony phylogeny

requiring 896 steps. This phylogeny (Fig. 10c) contains an

unresolved node with degree 4 and is not a duplication tree.

TNT phylogeny is close to LSDT duplication tree. To

transform from one to the other only three taxa have to be

BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 25

1. http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi.2. Available on request.

Fig. 10. (a) Duplication tree for the 16 genes of human ZNF 45 family inferred by DTSCORE plus LSDT with parsimony; black dots represent the onlyallowed root positions, according to the tandem duplication model; the (arbitrarily) selected root position is circled. (b) Rooted duplication treecorresponding to tree (a). (c) Phylogeny inferred by TNT. Tree (a) can be obtained from tree (c) by moving ZNF45 and ZNF228 to edge 1, andZNF233 to edge 2. Edge lengths in tree (a) and tree (c) were estimated by maximum likelihood [52]. Lengths in tree (b) are meaningless and wereadjusted to obtain a readable drawing.

moved (Fig. 10), and both trees differ by only 1 parsimony

step. A similar difference was commonly observed in

simulation where TNT found (non-DT) phylogenies requir-

ing one parsimony step less (on average) than the DTs

found by LSDT (Tables 1 and 2), though the true tree used

to generate the sequences was a DT. Thus, having (only)

one parsimony step of difference between the best DT and

the best phylogeny is not significant and can be seen as

supporting the duplication model. Moreover, the discre-

pancy between the two trees can be explained by long

branch attraction, a phenomenon that frequently affects

parsimony-based reconstructions [53]. Indeed, ZNF180 and

ZNF229 genes are distant from the other genes (Figs. 10a

and 10c) and might perturb the whole tree. When removing

those two genes from the data set, both LSDT and TNT

found the same tree, which is identical to the LSDT tree of

Fig. 10a without the two genes. With 14 segments, the

probability of randomly picking up a duplication tree

among all distinct phylogenies is less than 10�4 [26]. This

extremely small probability indicates that the identity

between LSDT and TNT trees is very unlikely to be due

to chance. This provides a strong support for the tandem

duplication model and indicates that our LSDT tree likely

represents most—if not all—of the history of ZNF45 family.

We compared trees obtained by Tang et al. [10], Zhang

et al. [21], and those of the other programs to the LSDT tree

of Fig. 10. We computed the parsimony score of each tree

and the percentage of events shared by each tree with the

LSDT tree. Just as in the simulation study, we tested GS

[21], GTR [9], WINDOW [10], DTSCORE [8], and LSDT

using different starting points but optimizing parsimony in

all cases.

Results are displayed in Table 3 and confirm those

obtained with simulated data sets.Results of trees from

[10] and [21] are poor, which was expected as these

methods (WINDOWS and GS, respectively) do not

optimize the parsimony criterion and as we did not use

the same alignment. GS is relatively poor, while

DTSCORE, WINDOWS, and GTR perform better. LSDT

clearly improves these four methods, with gains ranging

from 10 to 50 parsimony steps. In all cases but GTR,

LSDT recovers the most parsimonious DT of Fig. 10.

6 CONCLUSION AND PROSPECTS

We have demonstrated that restricting the neighborhood

defined by the SPR rearrangement to valid duplication trees

allows the whole DT space to be explored. Thanks to these

rearrangements, we have defined a general local search

method which we used to optimize the parsimony and

balanced minimum evolution criteria. We have thus

improved the topological accuracy of all the tested

methods.

Several research directions are possible. Finding the set

of combinatorial configurations for the SPR rearrangement

which necessarily produce a duplication tree, could allow

the neighborhood computation to be accelerated (e.g., for

n ¼ 48 only 5 percent of the SPR neighborhood correspond

to duplication trees) and, furthermore, gain more insight

into the nature of duplication trees, which are just starting

to be investigated mathematically [12], [26], [27]. Our local

search method could be improved using restricted TBR

rearrangements or with the help of different stochastic

approaches (taboo, noising, ...) in order to avoid local

minima. Moreover, it would be relevant to test this local

search method with other criteria like maximum likelihood.

Finally, combining the tandem duplication events with

speciation events, as described in [54] and [55] for

nontandem duplications, would be relevant for real

applications where we have homologous tandem repeats

from several genomes.

ACKNOWLEDGMENTS

The authors would like to thankWafae El Alaoui for her help

with ZNF45 family genes, and Richard Desper,WimHordijk

and the referees of the Workshop on Algorithms in

Bioinformatics (WABI ’04) for reading preliminary versions

of this paper. This work was supported by ACI-IMPBIO

(Ministere de la Recherche, France).

26 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 3Analysis of the ZNF45 Data Set

REFERENCES

[1] F. Blattner, G. Plunkett, C. Bloch, N. Perna, V. Burland, M. Riley, J.Collado-Vides, J. Glasner, C. Rode, G. Mayhew, J. Gregor, N.Davis, H. Kirkpatrick, M. Goeden, D. Rose, B. Mau, and Y. Shao,“The Complete Genome Sequence Of Escherichia Coli k-12,”Science, vol. 277, no. 5331, pp. 1453-1474, 1997.

[2] E. Lander et al., “Initial Sequencing and Analysis of the HumanGenome,” Nature, vol. 409, pp. 860-921, 2001.

[3] A. Smit, “Interspersed Repeats and Other Mementos of Transpo-sable Elements in Mammalian Genomes,” Current Opinion inGenetics & Development, vol. 9, pp. 657-663, 1999.

[4] W. Fitch, “Phylogenies Constrained by Cross-Over Process asIllustrated by Human Hemoglobins in a Thirteen-Cycle, ElevenAmino-Acid Repeat in Human Apolipoprotein A-I,” Genetics,vol. 86, pp. 623-644, 1977.

[5] G. Levinson and G. Gutman, “Slipped-Strand Mispairing: A MajorMechanism for DNA Sequence Evolution,” Molecular Biology andEvolution, vol. 4, pp. 203-221, 1987.

[6] J. Zhang and M. Nei, “Evolution of Antennapedia-Class Homeo-box Genes,” Genetics, vol. 142, no. 1, pp. 295-303, 1996.

[7] O. Elemento and O. Gascuel, “An Exact and Polynomial Distance-Based Algorithm to Reconstruct Single Copy Tandem DuplicationTrees,” Proc. 14th Ann. Symp. Combinatorial Pattern Matching(CPM2003), 2003.

[8] O. Elemento, O. Gascuel, and M.-P. Lefranc, “Reconstructing theDuplication History of Tandemly Repeated Genes,” MolecularBiology and Evolution, vol. 19, pp. 278-288, 2002.

[9] G. Benson and L. Dong, “Reconstructing the Duplication Historyof a Tandem Repeat,” Proc. Intelligent Systems in Molecular Biology(ISMB1999), T. Lengauer, ed., pp. 44-53, 1999.

[10] M. Tang, M. Waterman, and S. Yooseph, “Zinc Finger GeneClusters and Tandem Gene Duplication,” J. Computational Biology,vol. 9, pp. 429-446, 2002.

[11] E. Rivals, “A Survey on Algorithmic Aspects of Tandem RepeatsEvolution,” Int’l J. Foundations of Computer Science, vol. 15, no. 2,pp. 225-257, 2004.

[12] O. Gascuel, D. Bertrand, and O. Elemento, “Reconstructing theDuplication History of Tandemly Repeated Sequences,” Math. ofEvolution and Phylogeny, O. Gascuel, ed., 2004.

[13] S. Ohno, Evolution by Gene Duplication. Springer Verlag, 1970.[14] P.L. Fleche, Y. Hauck, L. Onteniente, A. Prieur, F. Denoeud, V.

Ramisse, P. Sylvestre, G. Benson, F. Ramisse, and G. Vergnaud, “ATandem Repeats Database for Bacterial Genomes: Application tothe Genotyping of Yersinia Pestis and Bacillus Anthracis,” BioMedCentral Microbiology, vol. 1, pp. 2-15, 2001.

[15] D. Jaitly, P. Kearney, G. Lin, and B. Ma, “Methods forReconstructing the History of Tandem Repeats and TheirApplication to the Human Genome,” J. Computer and SystemSciences, vol. 65, pp. 494-507, 2002.

[16] P. Sneath and R. Sokal, Numerical Taxonomy. pp. 230-234, SanFrancisco: W.H. Freeman and Company, 1973.

[17] N. Saitou and M. Nei, “The Neighbor-Joining Method: A NewMethod for Reconstructing Phylogenetic Trees,” Molecular Biologyand Evolution, vol. 4, pp. 406-425, 1987.

[18] O. Elemento and O. Gascuel, “A Fast and Accurate Distance-Based Algorithm to Reconstruct Tandem Duplication Trees,”Bioinformatics, vol. 18, pp. 92-99, 2002.

[19] J. Barthelemy and A. Guenoche, Trees and Proximity Representa-tions. Wiley and Sons, 1991.

[20] S. Sattath and A. Tversky, “Additive Similarity Trees,” Psychome-trika, vol. 42, pp. 319-345, 1977.

[21] L. Zhang, B. Ma, L. Wang, and Y. Xu, “Greedy Method forInferring Tandem Duplication History,” Bioinformatics, vol. 19,pp. 1497-1504, 2003.

[22] D. Swofford, P. Olsen, P. Waddell, and D. Hillis, MolecularSystematics. pp. 407-514, Sunderland, Mass.: Sinauer Associates,1996.

[23] D. Swofford, PAUP*. Phylogenetic Analysis Using Parsimony (*andOther Methods), version 4. Sunderland, Mass.: Sinauer Associates,1999.

[24] J. Felsenstein, “PHYLIP—PHYLogeny Inference Package,” Cladis-tics, vol. 5, pp. 164-166, 1989.

[25] C. Semple and M. Steel, Phylogenetics. Oxford Univ. Press, 2003.[26] O. Gascuel, M. Hendy, A. Jean-Marie, and S. McLachlan, “The

Combinatorics of Tandem Duplication Trees,” Systematic Biology,vol. 52, pp. 110-118, 2003.

[27] J. Yang and L. Zhang, “On Counting Tandem Duplication Trees,”Molecular Biology and Evolution, vol. 21, pp. 1160-1163, 2004.

[28] D. Robinson, “Comparison of Labeled Trees with Valency Trees,”J. Combinatorial Theory, vol. 11, pp. 105-119, 1971.

[29] L. Wang and D. Gusfield, “Improved Approximation Algorithmsfor Tree Alignment,” J. Algorithms, vol. 25, pp. 255-273, 1997.

[30] Y. Pauplin, “Direct Calculation of a Tree Length Using a DistanceMatrix,” J. Molecular Evolution, vol. 51, pp. 41-47, 2000.

[31] R. Desper and O. Gascuel, “Theoretical Foundation of theBalanced Minimum Evolution Method of Phylogenetic Inferenceand Its Relationship to Weighted Least-Squares Tree Fitting,”Molecular Biology and Evolution, vol. 21, no. 3, pp. 587-598, 2004.

[32] W. Fitch, “Toward Defining the Course of Evolution: MinimumChange for a Specified Tree Topology,” Systematic Zoology, vol. 20,pp. 406-416, 1971.

[33] J. Hartigan, “Minimum Mutation Fits to a Given Tree,” Biometrics,vol. 29, pp. 53-65, 1973.

[34] G. Ganapathy, V. Ramachandran, and T. Warnow, “Better Hill-Climbing Searches for Parsimony,” Proc. Third Int’l WorkshopAlgorithms in Bioinformatics, 2003.

[35] P.A. Goloboff, “Methods for Faster Parsimony Analysis,” Cladis-tics, vol. 12, pp. 199-220, 1996.

[36] V. Berry and O. Gascuel, “Inferring Evolutionary Trees withStrong Combinatorial Evidence,” Theoretical Computer Science,vol. 240, pp. 271-298, 2000.

[37] M. Kimura, “A Simple Model for Estimating Evolutionary Rates ofBase Substitutions through Comparative Studies of NucleotideSequences,” J. Molecular Evolution, vol. 16, pp. 111-120, 1980.

[38] D. Jones, W. Taylor, and J. Thornton, “The Rapid Generation ofMutation Data Matrices from Protein Sequences,” ComputerApplications in Biosciences, vol. 8, pp. 275-282, 1992.

[39] K. Kidd and L. Sgaramella-Zonta, “Phylogenetic Analysis:Concepts and Methods,” Am. J. Human Genetics, vol. 23, pp. 235-252, 1971.

[40] A. Rzhetsky and M. Nei, “Theoretical Foundation of theMinimum-Evolution Method of Phylogenetic Inference,” Molecu-lar Biology and Evolution, vol. 10, pp. 173-1095, 1993.

[41] W. Day, “Computational Complexity of Inferring Phylogeniesfrom Dissimilarity Matrices,” Bull. Math. Biology, vol. 49, pp. 461-467, 1987.

[42] C. Semple and M. Steel, “Cyclic Permutations and EvolutionaryTrees,” Advances in Applied Math., vol. 32, no. 4, pp. 669-680, 2004.

[43] R. Desper and O. Gascuel, “Fast and Accurate PhylogenyReconstruction Algorithms Based on the Minimum-EvolutionPrinciple,” J. Computational Biology, vol. 9, pp. 687-706, 2002.

[44] M. Kuhner and J. Felsenstein, “A Simulation Comparison ofPhylogeny Algorithms under Equal and Unequal EvolutionaryRates,” Molecular Biology and Evolution, vol. 11, pp. 459-468, 1994.

[45] A. Rambault and N. Grassly, “Seq-Gen: An Application for theMonte Carlo Simulation of DNA Sequence Evolution AlongPhylogenetic Trees,” Computer Applied Biosciences, vol. 13, pp. 235-238, 1997.

[46] J. Felsenstein and G. Churchill, “A Hidden Markov ModelApproach to Variation Among Sites in Rate of Evolution,”Molecular Biology and Evolution, vol. 13, pp. 93-104, 1996.

[47] P.A. Goloboff, J.S. Farris, and K. Nixon, “TNT: Tree AnalysisUsing New Technology,” 2000, www.cladistics.com.

[48] T. El-Barabi and T. Pieler, “Zinc Finger Proteins: What We Knowand What We Would Like to Know,” Mechanisms of Development,vol. 33, pp. 155-169, 1991.

[49] M. Shannon, J. Kim, L. Ashworth, E. Branscomb, and L. Stubbs,“Tandem Zinc-Finger Gene Families in Mammals: Insights andUnanswered Questions,” DNA Sequence—The J. Sequencing andMapping, vol. 8, no. 5, pp. 303-315, 1998.

[50] P. Waddel and M. Steel, “General Time Reversible Distances withUnequal Rates Across Sites: Mixing T and Inverse GaussianDistributions with Invariant Sites,” Molecular Phylogeny andEvolution, vol. 8, pp. 398-414, 1997.

[51] K.C. Nixon, “The Parsimony Ratchet, a New Method for RapidParsimony Analysis,” Cladistics, vol. 15, pp. 407-414, 1999.

[52] S. Guindon and O. Gascuel, “A Simple, Fast and Accurate Methodto Estimate Large Phylogenies by Maximum-Likelihood,” Sys-tematic Biology, vol. 52, no. 5, pp. 696-704, 2003.

[53] J. Felsenstein, “Cases in Which Parsimony or CompatibilityMethods Will Be Positively Misleading,” Systematic Zoology,vol. 27, pp. 401-410, 1978.

BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 27

[54] D. Page andM. Charleston, “FromGene to Organismal Phylogeny:Reconciled Trees and the Gene Tree/Species Tree Problem,”Molecular Phylogenetics and Evolution, vol. 7, pp. 231-240, 1997.

[55] M. Hallett, J. Lagergren, and A. Tofigh, “Simultaneous Identifica-tion of Duplications and Lateral Transfers,” Proc. Conf. Researchand Computational Molecular Biology (RECOMB2004), pp. 347-356,2004.

Denis Bertrand is a PhD student under thesupervision of Olivier Gascuel. His researchsubject is the study of tandemly repeatedsequences. His main areas of interest arephylogenetics, combinatorics, and algorithms.

Olivier Gascuel is Directeur de Recherche atthe Centre National de la Recherche Scientifi-que (France). He is the head of the bioinfor-matics group from the LIRMM laboratory,belongs to the editorial board of SystematicBiology and of BMC Evolutionary Biology, andhas served in a number of program committeesof bioinformatics conferences (ISMB, WABI). Hestarted in this field in the mid 1980s, with workson sequence analysis and protein structure

prediction. Since the beginning of the 1990s, he turned his efforts tophylogenetics, focusing on the mathematical and computational toolsand concepts. He (co)authored several well-known phylogeny inferenceprograms (BioNJ, PHYML, FastME).

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

28 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Optimizing Multiple Seedsfor Protein Homology Search

Daniel G. Brown

Abstract—We present a framework for improving local protein alignment algorithms. Specifically, we discuss how to extend local

protein aligners to use a collection of vector seeds or ungapped alignment seeds to reduce noise hits. We model picking a set of seed

models as an integer programming problem and give algorithms to choose such a set of seeds. While the problem is NP-hard, and

Quasi-NP-hard to approximate to within a logarithmic factor, it can be solved easily in practice. A good set of seeds we have chosen

allows four to five times fewer false positive hits, while preserving essentially identical sensitivity as BLASTP.

Index Terms—Bioinformatics database applications, similarity measures, biology and genetics.

1 INTRODUCTION

PAIRWISE alignment is one of the most important problems

in bioinformatics. Here, we continue an exploration into

the seeding and structure of local pairwise alignments and

show that a recent strategy for seeding nucleotide align-

ments can be expanded to protein alignment. Heuristic

protein sequence aligners, exemplified by BLASTP [1], find

almost all high-scoring alignments. However, the sensitivity

of heuristic aligners to moderate-scoring alignments can

still be poor. In particular, alignments with BLASTP score

between 40 and 60 are commonly missed by BLASTP, even

though many are of truly homologous sequences. We focus

on these alignments and show that a change to the seeding

strategy gives success rates comparable to BLASTP with far

fewer false positive hits.

Specifically, multiple spaced seeds [2] and their relatives,

vector seeds [3], can be used in local protein alignment to

reduce the false positive rate in the seeding step of alignment

by a factor of four. We present a protocol for choosing

multiple vector seeds that allows us to find good seeds that

work well together. Our approach is based on solving a set-

cover integer program whose solution gives optimal thresh-

olds for a collection of seeds. Our IP is prone to overtraining,

so we discuss how to reduce the dependency of the solution

on the set of training alignments, both by increasing the false

positive rate of the seeds found slightly and by making the

program less sensitive to outliers. The problemwe are trying

to solve is NP-hard and Quasi-NP-hard to approximate to a

sublogarithmic factor, so we present heuristics for it, though

most instances are of moderate enough size to use integer

programming solvers.

Our successful result here contrasts with our previous

work [3] in which we introduced vector seeds. There, we

found that using only one vector seed would not substan-

tially improve BLASTP’s sensitivity or selectivity. The use

of multiple seeds is the important change in the present

work. This successful use of multiple seeds is similar to

what has been reported recently for pairwise nucleotide

alignment [4], [5], [6], but the approach we use is different

since protein aligners require extremely high sensitivity. We

note that, independently of our work, the authors of

PatternHunter, the first program to use optimized spaced

seeds, have developed a protein aligner based on seeding

approaches similar to those we discuss here [7]; however,

they have not offered theoretical justification for their

approach, which, in some sense, we provide here.

Our results confirm the themes developed by us and

others since the initial development of spaced seeds. The

first theme is that spaced seeds help in heuristic alignment

because the very surprisingly conserved regions that one

uses as a basis for building an alignment happen more

independently in true alignments than for unspaced seeds.

In protein alignments, there are often many small regions of

high conservation, each of which has a chance to have a hit

to a seed in it. With unspaced seeds, the probability that any

one of these regions is hit is low, but, when a region is hit,

there may be several more hits, which is unhelpful. By

contrast, a spaced seed is likely to hit a given region fewer

times, wasting less runtime, and will also hit at least one

region in more alignments, increasing sensitivity.

The second theme is that the more one understands how

local and global alignments look, the more possible it is to

tailor alignment seeding strategies to a particular applica-

tion, reducing false positives and improving true positives.

Here, by basing our set of seeds on sensitivity to true

alignments, we choose a set of seed models that hit diverse

types of short conserved alignment subregions. Conse-

quently, the probability that one of them hits a given

alignment is high since they complement each other well.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 29

. The author is with the School of Computer Science, University of Waterloo,200 University Ave., West, Waterloo, ON N2L 3G1, Canada.E-mail: [email protected].

Manuscript received 1 Nov. 2004; revised 2 Jan. 2005; accepted 11 Jan. 2005;published online 30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBBSI-0183-1104.

1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

2 BACKGROUND: HEURISTIC ALIGNMENT AND

SPACED SEEDS

Since the development of heuristic sequence aligners [1], the

same approach has been commonly used: identify short,

highly conserved regions and build local alignments

around these “hits.” This avoids the use of the Smith-

Waterman algorithm [8] for pairwise local alignment, which

has �ðnmÞ runtimes on input sequences A and B of length n

and m, respectively. (We will use the notation A½i� to

represent the ith character of sequence A.)

Instead, assuming random sequences, the expected

runtime of this heuristic search method is hðn;mÞ þ aðn;mÞ,where hðn;mÞ is the amount of time needed to find hits in the

two sequences and aðn;mÞ is the expected time needed to

compute the alignments from the hits.Most heuristic aligners

have hðn;mÞ ¼ �ðnþmþ nm=kÞ, while aðn;mÞ ¼ �ðnm=kÞfor some large constant k. There are many assumptions in

these formulas. First, evenwhenwealign sequenceswith true

homologies,most hits are betweenunrelated positions, so the

estimation of the runtime need not consider whether the

sequences are related. Further, this simplification assumes

that each hit found in the first phase results in a constant

amount of work being done in the second phase to identify

that it is false (or that truehits are rare). It is the speedup factor

of k that is important here; assuming m and n are large, the

overall runtime is much faster.

Most heuristic aligners look at the scores of matching

characters in short regions and use high-scoring short

regions as hits. For example, BLASTP [1] hits are three

consecutive positions in the two sequences where the total

score, according to a BLOSUM or PAM scoring matrix, of

aligning the three letters in one sequence to the three letters

of the other sequence is at least +13. Finding such hits can

be done easily, for example, by making a hash table of one

sequence and searching positions of the hash table for the

other sequence, in time proportional to the length of the

sequences and the number of hits found. BLASTP uses

more complicated data structures for this process, but the

principle is similar.

2.1 Seeding Models

To generalize BLASTP’s hits, we defined vector seeds [3], [9].

A vector seed is a pair ðv; T Þ. Vector v ¼ ðv1; . . . ; vkÞ is a

vector of position multipliers and T is a threshold. Given

two sequences A and B, let si;j be the score in our scoring

matrix of aligning the A½i� to B½j�. If we consider position i

in A and j in B, we then get an hit to the vector seed at those

positions when v � ðsi;j; siþ1;jþ1; . . . ; siþk�1;jþk�1Þ � T . In this

framework, BLASTP’s seed is ((1, 1, 1), 13).

Vector seeds generalize the earlier idea of spaced seeds

[2] for nucleotide alignments, where both scores and the

vector are 0/1 vectors and where T , the threshold, equals

the number of 1s in v. A spaced seed requires an exact

match in the positions where the vector is 1 and the places

where the vector is 0 are “don’t care” positions. In our

original work with vector seeds [3], the freedom to allow

positions of v to have values beside 0 and 1 was not

extremely useful, so the vector seeds we discuss here all

have binary vectors v.

Spaced seeds have the same expected number of junk

hits as unspaced seeds. For unrelated noise DNA se-

quences, this is nm4�w, where w is the number of ones in

the seed (its support). Their advantage comes because more

distinct internal subregions of a given alignment will match

a spaced seed than the unspaced seed; this happens because

the hits are more independent of each other. The probability

that an alignment of length 64 with 70 percent conservation

matches a good spaced seed of support 11 can be greater

than 45 percent because there are likely to be more

subregions that match the spaced seed than the unspaced

seed; by contrast, the default BLASTN seed, which is

11 consecutive required matches, hits only 30 percent of

alignments.

Spaced seeds have three advantages over unspaced

seeds. First, their hits are more independent, which means

that it is more likely that a given alignment has at least one

hit to a seed; fewer alignments have many. Second, the seed

model can be tailored to a particular application: If there is

structure or periodicity to alignments, this can be reflected

in the design of the seeds chosen. For example, in searching

for homologous codons, they can be tailored to the three-

periodic structure of such alignments [10], [11]. Finally, the

use of multiple seeds allows us to boost sensitivity well

above what is achievable with a single seed, which, for

nucleotide alignment, can give near 100 percent sensitivity

in reasonable runtime [4].

Keich et al. [12] have given an algorithm for a simple

model of alignments to compute the probability that an

alignment hits a seed; this has been extended by both

Buhler et al. [10] and Brejova et al. [11] to more complex

sequence models. Choi et al. [13] have also shown

experimental results for spaced seeds with high sensitivity

across a wide range of homologies. Kucherov et al. [14]

show how to adapt spaced seeds to the interesting case of

alignments where no subregion of the alignment has a

higher score than the entire alignment.

2.2 Some Newer Seeding Models

Another seeding model, which has recently arisen [7], [15]

is of ungapped alignment seeds. These were developed by

Brown and Hudek [15] to anchor global alignments of

ambiguous DNA sequences and, independently, by Kisman

et al. [7] in their heuristic protein aligner, tPatternHunter.

An ungapped alignment seed is a vector v, a global

threshold T , and a vector of positional minimum scores b.

There is a match between positions in the two sequences

when the vector of pairwise match scores is at least as large,

position-by-position, as the minimum scores vector b and

where the dot product of the position-by-position scores and

the multiplier vector v is at least T . These seeds are a

compromise between spaced seeds and consecutive seeds:

They require spaced positions to have good scores (those

30 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

where the lower bound vector b has high values), while also

focusing on the quality of the local alignment at the seed by

possibly examining all of the positions of the seed. It is not

possible to cast an ungapped alignment seed in the language

of vector seeds because of the requirement that each

individual position’s score is greater than its bound. It is

possible to cast a vector seed as an ungapped alignment seed,

by setting the b vector to �1 in all positions, thus removing

the position-by-position lower bound requirement.

Csuros [16] has also extended this frameworkof seeding to

look at variable-length seeds, where the length of the regions

that must match depends on their positional scores. While

this approach can also be brought into the framework of the

present work, we have not done so in our experiments.

2.3 Multiple Seeds

Another important extension to these ideas of seeding has

been the use of multiple seeds of different sorts in basing

alignments. In this approach, an attempt is made to perform

extension when any of a collection of seed models has a hit.

This will work well if each chosen seed has a very low false

positive rate so that their total false positive rate is still

below that of one seed of comparable sensitivity.

Several authors [2], [3], [4], [6], [10], [17] have proposed

using multiple seeds and given heuristics to choose them.

This problem was recently given a theoretical framework by

Xu et al. [5] and, independently, Kucherov et al. [18] studied

heuristic algorithms for identifying sets of good seeds. In

work unrelated to the present work, Kisman et al. [7] have

heuristically used multiple ungapped alignment seeds

(though not called by that term) for protein alignment. To

the best of our knowledge, the present work is the first work

to choose multiple seeds for protein alignment with a

theoretical basis.

3 CHOOSING A GOOD SET OF SEEDS

Spaced seeds have made a substantial impact in nucleotide

alignments, but less in protein alignment. Here, we show

that they have use in this domain as well. Specifically,

multiple vector seeds or multiple ungapped alignment

seeds, with high thresholds, give essentially the sensitivity

of BLASTP with four times fewer noise hits. Slightly fewer

alignments are hit, but the regions of alignment hit by the

vector seeds are all of the same good ones as hit by the

BLASTP seed and a few more. In other words, BLASTP hits

more alignments, but the hits found by BLASTP and not the

vector seeds are mostly in areas unlikely to be expanded to

full alignments.

We adapt a framework for identifying sets of seeds

introduced by Xu et al. [5]. We model multiple seed

selection as a set cover problem and give heuristics for the

problem. For our purposes, one advantage of the formula-

tion is that it works with explicit alignments: Since real

alignments may not look like a probabilistic model, we can

pick a set of seeds for sensitivity to a collection of true

alignments. Unfortunately, this also gives rise to problems,

as the thresholds may be set high due to overtraining for a

given set of alignments.

Most of our experiments concern themselves with vector

seeds, but the framework can be expanded straightforwardly

to ungapped alignment seeds as well. This is because we do

not compute theoretical sensitivity of the seeds, but, instead,

only identify hits in existing real alignments. Indeed, our

framework is quite broad and extends to many different

models for seeding as long as the assumption that false

positives are additive is reasonably accurate and that one can

compute that false positive rate for the seed models. Where

the ungapped alignment seeds require some thought, we

present the addition needed for them.

3.1 Background Rates

One important detail that we need before we begin is to the

background hit rate for a given vector or ungapped

alignment seed. We noted previously [3] that this can be

computed for vector seeds, given a scoring matrix; it is also

straightforward to compute for ungapped alignment seeds

as well. Namely, from the scoring matrix, we can compute

the distribution of letters in random sequences implied by

the matrix; this can then be used to compute the distribu-

tion of scores found in unrelated sequences. Using this, we

can compute the probability that unrelated sequences give a

hit to a given seed at a random position, which we call the

false positive rate for that seed. In fact, we can easily

compute the entire probability distribution on the score for

a given seed vector at a random position. Similarly, we can

compute this probability under the constraint that posi-

tional scores have minimum value, thus expanding to

ungapped alignment seeds.

For the default BLASTP seed, the probability that two

random unrelated positions have a hit is quite high, 1/

1,600. Because of this high level of false positives, BLASTP

must filter hits further in hopes of throwing out hits in

unrelated sequences. Specifically, BLASTP rapidly exam-

ines the local area around a hit and, if this region is not also

well-conserved, the hit is thrown out. Sometimes, this

filtering throws out all of the hits found in some true

alignments and, thus, BLASTP misses them, even though

they hit the seed. One way of modeling this filtering is to

view BLASTP as testing two seeds simultaneously: The

vector seed ((1, 1, 1), 13) and an ungapped alignment seed

that looks at the region surrounding the seed hit.

Our goal in using other seed models here is to reduce the

false positive rate, while still hitting the overwhelming

majority of alignments and hitting them in places that are

highly enough conserved as to make a full alignment likely.

A flowchart of our proposal, and the approach of BLASTP,

is in Fig. 1.

For a setQ of alignment seeds,we say that its false positive

rate is the probability that any seed in Q has a hit to two

randompositions in unrelated sequences. This is not equal to

the sumof the falsepositive rates for all seeds inQ sincehits to

BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 31

one seed may overlap hits to another. However, we will use

this approximation in our optimization. As we extend to a

very large collection of seeds inQ, this canbecomeworrisome

as the same false positive may be counted many times.

However, thismaybe appropriate, in fact, dependingonhow

the search is done to find the false hits.

3.2 An Integer Program to Choose Many Seeds

Here, we give an integer program to find the set of seeds

that hits all alignments in a given training set with overall

lowest possible false positive rate. We will show that our IP

encodes the Set-Cover problem and that it is NP-hard to

solve and Quasi-NP-hard even to approximate to a

sublogarithmic factor. However, for moderate-sized train-

ing sets, we can solve it, in practice, or use simple heuristics

to get good solutions.

Given a set of alignment seeds Q, we say that they hit a

given alignment a if any member of Q has a hit to the

alignment. Our goal in picking such a set will be to

minimize the false positive rate of the set Q, with the

requirement that we hit all alignments in a training

collection, A.

This optimization goal is the alternative to the goal of Xu

et al. [5]. In that work, we maximized seed sensitivity when

a maximum number of spaced seeds is allowed; given that

all possible seeds had the same false positive rate, this was

equivalent to maximizing sensitivity for a given false

positive rate. This alternative goal of minimizing false

positives when we want 100 percent sensitivity on the

training set is appropriate for protein alignment; however,

we want to achieve extremely high sensitivity, as close to

100 percent as possible.

3.2.1 The Integer Program

Here, we show how to cast this seed selection problem as an

integer program. Recall that a seed model is the vector v of

multipliers or for an ungapped alignment seed, the vector v

of multipliers, and the vector b of positional lower bounds.

We will call this vector or vectors the “pattern” of a seed.

We can then view choosing a set of vector or ungapped

alignment seeds as choosing thresholds for each pattern.

More formally, suppose we are given a collection of

alignments A ¼ fa1; . . . ; amg and a set of seed patterns

P ¼ fp1; . . . ; png. We will choose thresholds ðT �1 ; . . . ; T

�nÞ for

the patterns of P such that the seed model set Q� ¼fðp1; T �

1 Þ; . . . ; ðpn; T �nÞg hits all alignments in A and the false

positive rate of Q� is as low as possible. The T �i may be 1,

which corresponds to not choosing the pattern pi at all.

We require that each alignment a must be hit, so one of

the thresholds must be low enough to hit a. To verify this,

we compute the best-scoring hit for each seed pattern pi in

each alignment aj; let the score of this hit be Ti;j. If we

choose T �i so that it is at most Ti;j, then the seed ðpi; T �

i Þ will

hit alignment a.

To model this as an integer program, we have a collection

of integer variables xi;T for each possible threshold value for

seed pattern pi. We note that we are requiring that this

number is a small number or can be granularized reasonably

since each possible threshold will get its own constraint. For

simple seeds from a BLOSUMmatrix, the scores at a position

come in a small range of integers, so the possible reasonable

thresholds form a small range; let Tm be the smallest such

threshold.Wewill set variable xi;T to 1when the threshold or

seed vector xi is at most T ; for each pattern pi, its threshold

chosen is the smallest T , where xi;T ¼ 1.

To compute the false positive rate, we let ri;T be the

probability that a random place in the background model

has score exactly T according to the seed model ðpi; T Þ. We

add these up for all of the false hits with score equal to or

greater than the chosen thresholds. Our integer program is

as follows:

minXi;T

xi;T ri;T ; such that

Xi

xi;Ti;j� 1 for all alignments aj

xi;T � xi;T�1 for all thresholds T > Tm

xi;T 2 f0; 1g for all i and T:

Our framework is quite general: Given any collection of

alignments and the sensitivity of a collection of seeds to the

alignments, one can use this IP formulation to choose

thresholds to hit all alignments while minimizing false

positives. In particular, one could require that a hit satisfy

multiple seeds simultaneously or use more complicated hit

formulations. Of course, for these harder models, one might

have a more difficult time optimizing the integer program.

3.2.2 NP-Hardness

We now show that the problem of optimizing the seed set to

minimize the false positive rate while hitting all alignments

is NP-hard and that it is Quasi-NP-hard to approximate to

within a logarithmic factor [19]. (That is, assuming NP does

not have polynomial-time deterministic algorithms running

in OðnOðlog lognÞÞ time, no polynomial-time algorithm exists

with approximation ratio oðlognÞ.)

32 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 1. Flowchart contrasting BLASTP’s approach to heuristic sequencealignment to the one proposed here. The only difference is in the initialcollection of hits. The smaller collection of hits found with the variationson seeds gives as many hits to true alignments that survive to the thirdstage as does BLASTP, yet far fewer noise hits must be filtered out.

We show this by giving an approximation-preserving

reduction of the Set-Cover problem to this problem. Since

Set-Cover is Quasi-NP-hard to approximate to within a

logarithmic factor [19], so is our problem.

An instance of Set-Cover is a ground set S and a

collection T ¼ fT1; . . . ; Tmg of subsets of S; the goal is the

smallest cardinality subset of T whose union is S. The

connection to our problem is clear: We will produce one

alignment per ground set member and, for each of the

elements of T , we will have one seed. For simplicity, we will

assume that S ¼ f1; . . . ; ng. To fill the construction out, we

will assign the vector seed

vi ¼ ðð1; 0; . . . ; 0zfflfflfflffl}|fflfflfflffl{i

; 1Þ; 1Þ

to every ground set element si. In a model of sequence

where all positions are independent of all other, each of

these seeds has the same false positive rate, so the false

positive rate will be proportional to the number of ground

set members chosen.

Then, for each set Tj 2 T , we create an alignment Aj of

length 2n2 þ 4n by pasting together in n blocks of length

2nþ 4. If i is in Tj, then we make the ith block of the

alignment have the first and iþ 2nd position be of score 1,

while all other positions in the block have score zero, while

if i 62 Tj, then the ith block is all score zero. Then, it is clear

that if we choose the seed vi, we will hit all alignments Aj,

where i 2 Tj. If we desire the minimum false positive rate to

hit all alignments, this is exactly equivalent to choosing the

minimum cardinality set to cover all of the Tj.

Thus, we have presented an approximation-preserving

transformation from Set-Cover to our problem and it is both

NP-hard and Quasi-NP-hard to approximate to within a

logarithmic factor.

3.2.3 Expansions of the Framework

In our experiments, we use the vector seed requirement as a

threshold; one could use a more complicated threshold

scheme to focus on hits that would be expanded to full

alignments. That is, our minimum threshold for Ti;j could

be the highest-scoring hit that is expanded to a full alignment

of seed vector vj in alignment ai. We could also have a more

complicated way of seeding alignments and, still, as long as

we could compute false positive rates, we could require that

all alignments are hit and minimize false positive rates.

Also, we can limit the total number of vector seeds used

in the true solution (in other words, limit the number of

vectors with finite threshold). We do this by putting an

upper bound onP

i xi;T for the maximum threshold T . In

practice, one might want an upper bound of four or eight

seeds, as each chosen seed requires a method to identify hits

and one might not want to have to use too many such

methods in the goal of keeping fewer indexes of a protein

sequence database, for example.

Further, we might want to not allow seeds to be chosen

with very high threshold. The optimal solution to the

problem will have the thresholds as on the seeds as high as

possible while still hitting each alignment. This allows

overtraining: Since even a tiny increase in the thresholds

would have caused a missed alignment, we may easily

expect that, in another set of alignments, there may be

alignments just barely missed by the chosen thresholds.

This is particularly possible if thresholds are allowed to get

extremely high and only useful for a single alignment. This

overtraining happened in some of our experiments, so we

lowered the maximum so that they were either found in a

fairly narrow range (+13 to +25) or set to 1 when a seed

was not used. As one way of also addressing overtraining,

we considered lowering the thresholds obtained from the IP

uniformly or just lowering the thresholds that have been set

to high values.

And, finally, the framework can be extended to allow a

specific number of alignments to be missed. For each

alignment, rather than requiring that

Xi

xi;Ti;j� 1;

which requires that some threshold be chosen so that the

alignment is hit, we can add a 0/1 slack variable to count

how many are missed, changing the constraint to

Xi

xi;Ti;jþ sj � 1:

Then, if we require that

Xj

sj � M;

this allows at most M alignments to be so missed. This may

be appropriate to allow the optimization framework to be

less sensitive to a small number of outliers. We show

experiments with this slightly expanded framework in the

next section.

We note one simplification of our formulation: False hit

rates are not additive. Given two spaced seeds, a hit to one

may coincide with a hit to the other, so the background rate

of false positives is lower than estimated by the program.

When we give such background rates later, we will

distinguish those found by the IP from the true values.

3.2.4 Solving the IP and Heuristics

To solve this integer program or its variations is not

necessarily straightforward since the problem is NP-hard.

In our experiments, we used sets of approximately 400 align-

ments and the IP has been able to solvedirectly quickly, using

the CPLEX 9.0 integer programming solver.

Straightforward heuristics also work well for the

problem, such as solving the LP relaxation and rounding

to 1 all variables with values close to 1, until all alignments

are hit, or setting all variables with fractional LP solutions to

1 and then raising thresholds on seeds until we start to miss

alignments.

BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 33

We finally note that a simple greedy heuristic works well

for the problem, as well: Start with low thresholds for all

seed patterns and repeatedly increase the threshold whose

increase most reduces the false positive rate until no such

increase can be made without missing an alignment. This

simple heuristic performed essentially comparably to the

integer program in our experiments, but, since the IP solved

quickly, we present its results.

One other advantage to the IP formulation is that the

false-positive rate from the LP relaxation is a lower bound

on what can possibly be achieved; the simple greedy

heuristic offers no such lower bound.

4 EXPERIMENTAL RESULTS

Here, we present the results of experiments with our

multiple seed selection framework in the context of protein

alignments. Our goal is to identify collections of seed

models which together have extremely high sensitivity to

even moderately strong alignments, while admitting a very

low false positive rate.

Since we pick seeds with a relatively small number of

alignments, we run the serious risk of overtraining. In

particular, the requirement that our set of seeds has

100 percent sensitivity on the training data need not require

that it also have comparable sensitivity overall. In one

example, the particular choice of training examples was

apparently quite unrepresentative since a 100 percent

sensitivity to this set of alignments still gave only 96 percent

sensitivity on a testing set. (Or, presumably, the testing set

may be unrepresentative.) As a simple way of exploring this,

we examined what happened when we lowered the thresh-

old on some seeds that were chosen by the integer program

to modestly increase their false positive rates and sensitivity

in the hope of still keeping very high sensitivity.

We first present simple experiments with vector seeds

and with ungapped alignment seeds on a small sample of

alignments discovered with BLASTP; in this section, we

also allow for seed sets that miss a small number of the

training alignments.

Then, we explore how well these seed sets do in hitting

alignments that we did not use BLASTP to identify. Here,

we note that our vector seed sets do not appear to do as well

as BLASTP for sensitivity to alignments in general, but they

do hit more alignments with high-scoring short regions;

presumably, these alignments are more likely true.

4.1 Preliminary Experiments

We begin by exploring several sets of alignments generated

using BLASTP. Our target score range for our alignments is

BLASTP score between +40 and +60 (BLOSUM score +112

to +168). These moderate-scoring alignments can happen by

chance, but also are often true. Alignments below this

threshold are much more likely to be errors, while, in a

database of proteins we used, such alignments are likely to

happen to a random sequence by chance only one time in

10,000, according to BLASTP’s statistics.

We begin by identifying a set of BLASTP alignments in

this score range. To avoid overrepresenting certain families

of alignments in our test set, we did an all-versus-all

comparison of 8,654 human proteins from the SWISS-PROT

database [20]. (We note that this is the same set of proteins

and alignments we used in our previous vector seed work

[3]. We have used this test set in part to confirm our belief

that, while a single seed may not help much, in comparison

to BLASTP, many seeds will be of assistance.) We then

divided the proteins into families so that all alignments

with BLASTP score greater than 100 are between two

sequences in the same family and there are as many families

as possible. We then chose 10 sets of alignments in our

target score range such that, in each set of alignments, a

particular family will only contribute at most eight

alignments to that set. Note that, since our threshold for

sharing family membership is a BLASTP score greater than

100 and the alignments we are seeking score between +40

and +60, many chosen alignments will be between members

of different families. We divided the sets of alignments into

five training sets and five testing sets. It is possible that the

same alignments will occur in a training and testing set as

we did not take any efforts to avoid this, though the set of

possible alignments is large enough to make this a rare

occurrence.

We note that we are using this somewhat complicated

system specifically because we want to avoid imposing a

preexisting bias on the set of alignments: Many true yet

moderate-scoring alignments will be between proteins with

different functionor fromdifferentbiological families. For the

same reason, we have used alignments from dynamic

programming as our standard, rather than structural align-

ments of known proteins or curated alignments because our

goal is to improve the quality of heuristic alignments.

Certainly, many of the alignments we consider will not be

precise; still, a heuristic dynamic programming-based align-

ment that finds a hit between two proteins and then uses the

same scoring matrix as BLASTP will find the exact same,

potentially inaccurate, alignment as did BLASTP.

4.1.1 Multiple Vector Seeds

We then considered the set of all 35 vector patterns of length

at most 7 that include three or four 1s (the support of the

seed). We used this collection of vector patterns as we have

seen no evidence that nonbinary seed vectors are preferable

to binary ones for proteins and because it is more difficult to

find hits to seeds with higher support than four due to the

high number of needed hash table keys.

We computed the optimal set of thresholds for these

vector seeds such that every alignment in a training set has

a hit to at least one of the seeds, while minimizing the

background rate of hits to the seeds and only using at most

10 vector patterns. Then, we examined the sensitivity of the

chosen seeds for a training set to its corresponding test set.

34 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

The results are found in Table 1. Some seed sets chosen

showed signs of overtraining, but others were quite

successful, where the chosen seeds work well for their

training set as well and have low false positive rate.

We took the best seed set with near 100 percent

sensitivity for both its training and testing data, which

was the third of our experimental sets and used it in further

experiments. This seed set is shown in Table 2. We note that

this seed set has five times lower false positive rate

(1=8; 000) than does BLASTP, while still hitting all of its

testing alignments but four (which is not statistically

significant from zero). We also considered a set of thresh-

olds where we lowered the higher thresholds slightly to

allow more hits and possibly avoid overtraining on the

initial set of alignment. These altered thresholds are shown

as well in Table 2 and give a total false positive rate of

1=6; 900. (This set of thresholds also hits all 402 test

alignments for that instance.)

4.1.2 A Weaker Requirement on the Sensitivity

As noted previously, we can alter our integer program so

that it does not require 100 percent sensitivity on the

training data set. We performed experiments on this

formulation, using five subsets of the training alignments

chosen as before, where we allowed between zero and five

alignments from the training set to be missed by the seed

set. We show results in Table 3, using again a randomly

chosen testing set for each training set. The training data

sets varied in size from 304 to 415, while the testing sets

ranged from 392 to 407 in size.

Unsurprisingly, if we did not hit all alignments in the

training set, we often miss alignments in the testing set as

well. However, the ranges of the sensitivities we saw in

testing data for the seed sets picked allowing some misses

in the training data were much less wide, suggesting that

there may be fewer seed thresholds lowered merely to

accommodate a single outlier in the training data. As such,

if slightly lower sensitivity is acceptable, this approach may

give much more predictable results than training to require

all alignments to be hit.

4.1.3 Multiple Ungapped Alignment Seeds

Ungapped alignment seeds can be seen as breaking the

model we have for alignment speed. The most straightfor-

ward implementation of ungapped alignment seeds would

involve a hash table keyed on the letters corresponding to

the positions in the bounds vector b, where there is a

nontrivial lower bound on the score of a position. Still, even

after the first step, where we identified pairs of positions

satisfying the minimum bounds scores, we still need

another test to verify that a pair of positions satisfies the

requirement of the dot product of the local alignment score

with the vector v of positional multipliers being higher than

the threshold. Similar limitations affect any such two-phase

seed, such as requiring that two hypothetically aligned

positions satisfy two vector seeds at once.

If we assume, however, that testing a hit to the simple

hash-table to verify if the dot product of the local alignment

score with the vector of multipliers v has score greater than

the threshold T so rapidly that we can throw out misses

without having to count them, then we return to the case

from before, where we need count only the fraction of

positions expected to pass both levels of filtration. This

assumption may be appropriate, assuming that the small

amount of time taken to throw out a hash-table hit that does

not satisfy the dot product threshold is much, much smaller

than the amount of time needed to throw out a hit to the

whole ungapped alignment seed that still does not make a

good local alignment.

BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 35

TABLE 3Weakening Sensitivity to Testing AlignmentReduces Sensitivity on Training Alignments

TABLE 2Seeds and Thresholds Chosen by

Integer Programming for 409 Test Alignments

TABLE 1Hit Rates for Optimal Seed Sets for Various Sets of Training

Alignments when Applied to an Unrelated Test Set

With this in mind, we tested our set of moderate

alignments on a simple collection of ungapped alignment

seed patterns to identify whether ungapped alignment seeds

form a potentially superior seed filtering approach to vector

seeds. Of course, since they include vector seeds as a special

case, this is trivial, but our interest is primarily whether the

advantage of ungapped alignments is large enough to merit

their consideration over that of vector seeds.

In our experiments, we used ungapped alignment seeds

where the vector of score lower bounds consisted of only

the values 0 and �1 (which results in no score restriction);

we also allowed the vector of pairwise multipliers to only

be the all-ones vector. This simple approach, which was

used independently in the multiple aligner of Brown and

Hudek [15] and in the tPatternHunter protein aligner [7],

simply requires a good local region, with certain specified

positions having positive score. We required that the

bounds vector have at most four active positions and

considered seed lengths between three and six. Note that, in

this model, the bounds vector ð0; 0; 0;�1Þ behaves quite

differently than the bounds vector ð0; 0; 0Þ because we will

be adding pairwise scores of four positions in the former

case and three in the latter.

The results of our experiment are shown in Table 4. We

used the same testing and training data sets as for Table 3.

In general, these results are slightly worse than the results

of our original experiments with vector seeds when we

require 100 percent sensitivity to testing data, but improve

when we allow some misses in the training data. Typical

false positive rates on the order of 1=10; 000 are common

with testing sensitivity of approximately 99 percent, as

before; again, the corresponding false positive rate for

BLASTP’s seed is approximately 1=1; 600.

A positive note to the ungapped alignment seeds is that

there seems to be less overtraining: As the training

sensitivity is allowed to go down slightly, the testing

sensitivity does not plummet as quickly as for vector seeds.

One reason for this is that an ungapped alignment seed,

both times they have been implemented [7], [15], still

requires high-scoring short local alignment around the

seed. As we show in the next section, focusing on very

narrow alignments in seeding may be inappropriate and

one should instead focus on longer windows around a hit

before discarding it with a filter.

4.2 A Broader Set of Alignments

Returning to our set of vector seeds from Table 2, we then

considered a larger set of alignments in our target range of

good, but not great scores to verify if the advantage of

multiple seeds still holds. We used the Smith-Waterman

algorithm to compute all alignments between pairs of a

1,000-sequence subset of our protein data set and computed

how many of them were not found by BLASTP. Only 970

out of 2,950 Smith-Waterman alignments with BLOSUM62

score between +112 and +168 had been identified by

BLASTP, even though alignments in this score range would

have happened by chance only one time in 10,000 according

to BLASTP’s statistics.

Almost all of these 2,950 alignments, 2,942, had a hit to

the BLASTP default seed. Despite this, however, only 970

actually built a successful BLASTP alignment. Our set of

eight seeds had hits to 1,939 of the 1,980 that did not build a

BLASTP alignment and to 955 of the 970 that did build a

BLASTP alignment, so, at first glance, the situation does not

look good. However, the difference between having a hit

and having a hit in a good region of the alignment is where

we are able to show substantial improvement.

The discrepancy between hits and alignments comes

because the BLASTP seed can have a hit in a bad part of the

alignment, which is filtered out. Typically, such hits occur

in a region where the source of positive score is quite short,

which is much more likely with an unspaced seed than with

a spaced seed. We looked at all of the regions of length

10 amino acids of alignments that included a hit to a seed

(either the BLASTP seed or one of the multiple seeds), and

assigned the best score of such a region to that alignment; if

no ungapped region of length 10 surrounded a hit, we

assumed it would certainly be filtered out. The data are

shown in Table 5 and show that of the alignments hit by the

spaced seeds, they are hit in regions that are essentially

identical in conservation to where the BLASTP seed hits

them. For example, 47.7 percent of the alignments contain a

10-amino acid region around a hit to the ((1, 1, 1), 13) seed

with BLOSUM score at least +30, while 46.7 percent contain

such a region surrounding a hit to one of the multiple seeds

with higher threshold. If we use the lower thresholds that

allow slightly more false positives, their performance is

actually slightly better than BLASTP’s.

Table 5 also shows that the higher-threshold seed ((1, 1, 1),

15), which has a worse false positive rate (1/5,700) than our

ensembles of seeds, performs substantially worse: Namely,

only 64 percent of the alignments have a hit to the single seed

found in a regionwith local score above +25,while 73 percent

of the alignments have a hit to one of the multiple seeds with

this property. This single seed strategy is clearly worse than

the multiple seed strategy of comparable false positive rate

and the optimized seeds perform comparably to BLASTP in

36 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 4Ungapped Alignment Seeds Offer

Similar Performance to Vector Seeds

identifying the alignments that actually have a core con-

served region.

Our experiments show thatmultiple seedmodels canhave

an impact on local alignment of protein sequences. Using

many spaced seeds, which we picked by optimizing an

integer program, we find seed models with a comparable

chance of finding a good hit in amoderate-scoring alignment

than does the BLASTP seed, with four to five times fewer

noise hits. The difficulty with the BLASTP seed is that it not

onlyhasmore junkhits andmorehits inoverlappingplaces, it

also has more hits in short regions of true alignments, which

are likely to be filtered and thrown out.

5 CONCLUSIONS

We have given a theoretical framework to the problem of

using spaced seeds for protein homology search detection.

Our result shows that using multiple vector or ungapped

alignment seeds can give sensitivity to good parts of local

protein alignments essentially comparable to BLASTP,

while reducing the false positive rate of the search

algorithm by a factor of four to five.

Our set of vector seeds is chosen by optimizing an

integer programming framework for choosing multiple

seeds when we want 100 percent sensitivity to a collection

of training alignments. The framework is general enough to

accommodate many extensions, such as requiring a fixed

amount of sensitivity on the training (not only 100 percent),

allowing only a small number of seeds to be chosen or

allowing for many different sorts of seeding strategies. We

have mostly used it to optimize sets of vector seeds because

they encapsulate an approach to homology search for

nucleotides that has been very successful.

One difficulty with our approach is that it relies on a

theoretical estimate of the runtime of a homology search

program: namely, that the program will take time propor-

tional to the number of false positives found by the seeding

method. As seeding methods become more complex, such

as the two-step ungapped alignment seeds, it may become

harder to identify what a “false positive” is, in particular, if

a false positive fits through one step of a filter, but is quickly

discarded before the next step, should it count toward the

estimated runtime? Using our framework, we identified a

set of seeds for moderate-scoring protein alignments whose

total false positive rate in random sequence is four-to-five

times lower than the default BLASTP seed. This set of seeds

had hits to slightly fewer alignments in a test set of

moderate-scoring alignments found by the Smith-Water-

man algorithm than found by BLASTP; however, the

BLASTP seeds hit subregions of these alignments that were

actually slightly worse than hit by the spaced seeds. Hence,

given the filtering used by BLASTP, we expect that the two

alignment strategies would give comparable sensitivity,

while the spaced seeds give four times fewer false hits.

ACKNOWLEDGMENTS

The author would like to thank Ming Li for introducing him

to the idea of spaced seeds. This work is supported by the

Natural Science and Engineering Research Council of

Canada and by the Human Frontier Science Program. A

preliminary version of this paper [21] appeared at the

Workshop on Algorithms in Bioinformatics, held in Bergen,

Norway, in September, 2004.

REFERENCES

[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman,“Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215,no. 3, pp. 403-410, 1990.

[2] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and MoreSensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-445, Mar. 2002.

[3] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension toSpaced Seeds Allows Substantial Improvements in Sensitivity andSpecificity,” Proc. Third Ann. Workshop Algorithms in Bioinformatics,pp. 39-54, 2003.

[4] M. Li, B. Ma, D. Kisman, and J. Tromp, “Patternhunter II: HighlySensitive and Fast Homology Search,” J. Bioinformatics andComputational Biology, vol. 2, no. 3, pp. 419-439, 2004.

[5] J. Xu, D. Brown, M. Li, and B. Ma, “Optimizing Multiple SpacedSeeds for Homology Search,” Proc. 15th Ann. Symp. CombinatorialPattern Matching, pp. 47-58, 2004.

[6] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds forDNA Similarity Search,” Proc. Eighth Ann. Int’l Conf. ComputationalBiology, pp. 76-84, 2004.

[7] D. Kisman, M. Li, B. Ma, and L. Wang, “TPatternHunter: Gapped,Fast and Sensitive Translated Homology Search,” Bioinformatics,2004.

BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 37

TABLE 5Hits in Locally Good Regions of Alignments

[8] T. Smith and M. Waterman, “Identification of Common MolecularSubsequences,” J. Molecular Biology, vol. 147, pp. 195-197, 1981.

[9] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension toSpaced Seeds,” J. Computer and System Sciences, 2005, pendingpublication.

[10] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for SimilaritySearch in Genomic DNA,” Proc. Seventh Ann. Int’l Conf. Computa-tional Biology, pp. 67-75, 2003.

[11] B. Brejova, D. Brown, and T. Vinar, “Optimal Spaced Seeds forHomologous Coding Regions,” J. Bioinformatics and ComputationalBiology, vol. 1, pp. 595-610, Jan. 2004.

[12] U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds forSimilarity Search,” Discrete Applied Math., vol. 138, pp. 253-263,2004.

[13] K.P. Choi, F. Zeng, and L. Zhang, “Good Spaced Seeds forHomology Search,” Bioinformatics, vol. 20, no. 7, pp. 1053-1059,2004.

[14] G. Kucherov, L. Noe, and Y. Ponty, “Estimating Seed Sensitivityon Homogeneous Alignments,” Proc. Fourth IEEE Int’l Symp.BioInformatics and BioEng., pp. 387-394, 2004.

[15] D. Brown and A. Hudek, “New Algorithms for Multiple DNASequence Alignment,” Proc. Fourth Ann. Workshop Algorithms inBioinformatics, pp. 314-326, 2004.

[16] M. Csuros, “Performing Local Similarity Searches with VariableLength Seeds,” Proc. 15th Ann. Symp. Combinatorial PatternMatching, pp. 373-387, 2004.

[17] K. Choi and L. Zhang, “Sensitive Analysis and Efficient Methodfor Identifying Optimal Spaced Seeds,” J. Computer and SystemSciences, vol. 68, pp. 22-40, 2004.

[18] G. Kucherov, L. Noe, and Y. Ponty, “Multiseed LosslessFiltration,” Proc. 15th Ann. Symp. Combinatorial Pattern Matching,pp. 297-310, 2004.

[19] U. Feige, “A Threshold of lnn for Approximating Set Cover,”J. ACM, vol. 45, pp. 634-652, 1998.

[20] A. Bairoch and R. Apweiler, “The SWISS-PROT Protein SequenceDatabase and Its Supplement TrEMBL in 2000,” Nucleic AcidsResearch, vol. 28, no. 1, pp. 45-48, 2000.

[21] D. Brown, “Multiple Vector Seeds for Protein Alignment,” Proc.Fourth Ann. Workshop Algorithms in Bioinformatics, pp. 170-181,2004.

Daniel G. Brown received the undergraduatedegree in mathematics with computer sciencefrom the Massachusetts Institute of Technologyin 1995 and the PhD degree in computer sciencefrom Cornell University in 2000. He then spent ayear as a research scientist at the WhiteheadInstitute/MIT Center for Genome Research inCambridge, Massachusetts, working on the Hu-man and Mouse Genome Projects. Since 2001,

he has been an assistant professor in the School of Computer Scienceat the University of Waterloo.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

38 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 39

1545-5963/05/$20.00 © 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

For information on obtaining reprints of this article, please send e-mail to:[email protected].

I T is a pleasure to write this editorial at the beginning of the second year of the publication of the IEEE/ACM Transactionson Computational Biology and Bioinformatics (TCBB). The last year saw the publication of four issues of TCBB, the first of

which was mailed out roughly nine months after our initial call for submissions. That accomplishment was the result oftremendous cooperation and hard work on the part of authors, reviewers, associate editors, and staff. I would like to thankeveryone for making that possible.

During the past year, we recieved roughly 205 submissions and, presently, we have about 50 of those under review. Inour first year, we published 16 papers, including Part I of a special section on The Best Papers from WABI (Workshop onAlgorithms in Bioinformatics). Part II will appear this year, along with a special issue on Machine Learning inComputational Biology and Bioinformatics. Other special issues are also in the planning stages. The papers that we havepublished are establishing TCBB as a venue for the highest quality research in a broad range of topics in computationalbiology and bioinformatics. I know that some of the papers we have already published will be cited as the foundational orthe definitive papers in several subareas of the field.

A goal for the future is to attract more submissions from the biology community and this will be facilitated when TCBBis indexed in MEDLINE, which requires two years of publication before it will consider indexing a journal. So, this secondyear of publication will hopefully lead to the inclusion of TCBB in MEDLINE.

Finally, I would like to share some wonderful news we recieved in February. The Association of American Publishers,Professional and Scholarly Publishing Division awarded TCBB their “Honorable Mention” award for The Best New Journalin any category for the year 2004. Only one Honorable Mention is awarded. Again, the credit for this accomplishment goesto all the authors, reviewers, associate editors, and staff who have worked so hard to establish TCBB in this last year. I lookforward to continued growth and success of TCBB in our second year of publication.

Dan GusfieldEditor-in-Chief

Editorial—State of the TransactionDan Gusfield

Bases of Motifs for GeneratingRepeated Patterns with Wild Cards

Nadia Pisanti, Maxime Crochemore, Roberto Grossi, and Marie-France Sagot

Abstract—Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones.

Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in

relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature:

matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favor of either, and recent work

has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns.

This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently

proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs.

Instead of exhibiting a complete list of motifs satisfying some input constraints, what is produced is a basis of such motifs from which all

the other ones can be generated. We study the computational cost of determining such a basis of repeated motifs with wild cards in a

sequence. We give new upper and lower bounds on such a cost, introducing a notion of basis that is provably contained in (and, thus,

smaller) than previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of

motifs. We also prove that the number of motifs in all bases defined so far grows exponentially with the quorum, that is, with the

minimal number of times a motif must appear in a sequence, something unnoticed in previous work. We show that there is no hope to

efficiently compute such bases unless the quorum is fixed.

Index Terms—Motifs basis, repeated motifs.

1 INTRODUCTION

IDENTIFYING motifs in biological sequences is one of theoldest fields in computational biology. Yet, it remains also

very much an open problem in the sense that no currentlyexisting definition of a “motif” is fully satisfying for thepurposes of accurately and sensitively identifying thebiological features that such motifs are supposed torepresent. Among the most difficult to model are bindingsites, as they are often quite degenerate. Indeed, variabilitymay be considered part of their function. Such variabilitytranslates itself into changes in the motif, mostly substitu-tions, that do not affect the biological function. Two mainschools of thought on how to define motifs in biology havecoexisted for years, each valid in its own way. The firstworks with a statistical representation of motifs, usuallygiven in the form of what is called in the literature a PSSM(“Position Specific Scoring Matrix” [9], [11], [13], [12] or aprofile which is one type of PSSM). Interesting PSSMs arethose that have a high information value (measured, forinstance, by the relative entropy of the correspondingmatrix). The second school defines a motif as a consensus[4], [24]. A motif is therefore a pattern that appears

repeatedly, in general, approximately, that is, up to acertain number of differences (most often substitutionsonly) in a sequence or set of sequences of interest.

It is generally accepted that PSSMs are more appropriatefor modeling an already known (in the sense of well-characterized) biological feature for the purpose of thenidentifying other occurrences of the feature, even thoughthe false positive rate of this further identification remainsvery high. Identifying the PSSM itself ab initio is still,however, a difficult problem, particularly for large data setsor when the amount of noise may be high. The methodsused are also no guarantee heuristics, leaving an uncer-tainty as to whether motifs that are statistically as mean-ingful as those reported have not been missed.

On the other hand, formulating the problem of identifyingapproximate motifs as patterns enables one to address themotif identification problem in an exhaustive fashion, eventhough the algorithmic complexity of the problem remainsrelatively high, and the model may appear more limited thanPSSMs. Because of the lower algorithmic complexity ofidentifying repeated patterns, the model may, however, bemade more complex and biologically pertinent in other ways.One could think of introducing motifs composed of variousdifferent submotifs separated by variable-length distancesthat may then also be found in a relatively efficient way [14].Motifs presenting such a high level of combinatorial complex-ity are indeed frequent, particularly in eukaryotes. Exhaus-tively seeking for approximately repeated patterns mayhowever have the drawback of producing many “solutions,”that is, many motifs. In fact, the number of motifs identifiedwith this model may be so high (e.g., exponential in the size ofthe input) that it is as impossible to manage as the initial inputsequence(s), even though they provide a first way of

40 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

. N. Pisanti and R. Grossi are with the Dipartimento di Informatica,Universita di Pisa, Italy. E-mail: {pisanti, grossi}@di.unipi.it.

. M. Crochemore is with the Institut Gaspard-Monge, University of Marne-la-Vallee, France and King’s College London.E-mail: [email protected].

. M.-F. Sagot is with INRIA Rhone-Alpes, Laboratoire de Biometrie etBiologie �EEvolutive, Universite Claude Bernard Lyon 1, France andKing’s College London. E-mail: [email protected].

Manuscript received 14 Mar. 2004; revised 2 Dec. 2004; accepted 16 Feb.2005; published online 30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-0036-0304.

1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

structuring such input. Yet, it appeared clear also to anycomputational biologist working with motifs as patterns thatthere was further structure to be extracted from the set ofmotifs found, even when such a set is huge. Furthermore,such a structure could reflect some additional biologicalinformation, thus providing additional motivation for infer-ring it. Doing this is generally addressed by means ofclustering, or even by attempting to bring together the twotypes of motif models (PSSMs and patterns). Indeed, recentlyresearchers have been using pattern detection as a first filter-flavored step toward inferring PSSMs from biologicalsequences [6]. This seems very promising although muchwork remains to be done to precisely determine the relationbetween the two types of models, and to fully explore thebiological implications this may have.

Again, each of the two above approaches is valid, but thequestion remained open whether or not the inner structureof a set of motifs could be expressed in a manner that wouldbe more satisfying from both the mathematical and thebiological points of view. Then, in 2000, a paper by Parida etal. [17] seemed to present a way of extracting such an innerstructure in a very elegant and powerful way for aparticular type of motif. The power of their proposalresided in the fact that the above mentioned structurecorresponded to a well-known and precisely definedmathematical object and, moreover, guaranteed that nosolution would be lost. Exhaustiveness in relation to thechosen type of motif is also preserved, thus enabling abiologist to draw some conclusions even in the face ofnegative answers (i.e., when no motifs, or no a priori“expected” motifs are found in a given input), somethingwhich PSSM-detecting methods do not allow. The structureis that of a basis of motifs. Informally speaking, it is a subsetof all the motifs satisfying some input parameters (related,for instance, to which differences between a pattern and itsoccurrences are allowed) from which it is possible torecover all the other motifs, in the sense that all motifs notin the basis are a combination of some (in general, a fewonly) motifs in the basis. Such a combination is modeled bysimple rules to systematically generate the other motifs withan output sensitive cost [18]. A basis would therefore alsoprovide a way of characterizing the input, which then mightbe used to compare different inputs without resorting to thetraditional alignment methods with all the pitfalls theypresent. The idea of a basis would fulfill such expectationsif its size could be proven to be small enough. The argument[17] seemed to be that, for the type of motifs considered, acompact enough basis could always be found.

The motifs considered in [17] were patterns with wild cardsymbols occurring in a given sequence s of n symbolsdrawn over an alphabet �. A wild card symbol is a specialsymbol “�” matching any other element1 For example, thepattern T � G matches both TTG and TGG inside s ¼ TTGG.Parida et al. focused on patterns which appear at least qtimes in s for an input parameter q � 2, called the quorum.This may, at first sight, seem an even more restrictive typeof motif than patterns in general. It, however, has the merit

of capturing one aspect of biological features that currentPSSMs in general ignore, or address only in an indirect way.This aspect often concerns isolated positions inside a motifthat are not part of the biological feature being captured.This is the case, for instance, with some binding sites,particularly at the protein level. Studying patterns withwild cards has a further very important motivation inbiology, even when no differences (such as substitutions)are allowed. Indeed, motifs such as these or closely relatedones can be used as seeds for finding long repeats and foraligning, pairwise or multiple-wise, a set of sequences oreven whole genomes [15], [23].

The basis introduced by Parida et al. had interestingfeatures, but presented some unsatisfying properties. Inparticular, as we show in this paper, there is an infinitefamily of strings for which the authors’ basis contains �ðn2Þmotifs for q ¼ 2. This contradicts the upper bound of 3n forany q � 2 given in [17]. As a result, the algorithm takingOðn3 lognÞ time, mentioned in [17], for finding the basis ofmotifs does not hold since it relies on the upper bound of3n, thus leaving open the problem of efficiently discoveringa basis. A refinement of the definition of basis and anincremental construction in Oðn3Þ time has recently beendescribed by Apostolico and Parida [2]. A comparativesurvey of several notions of bases can be found in [22].

Closely following previous work, here we introduce anew definition of basis. The condition for the new basis isstronger than that of [17] and, hence, our basis is includedin that of [17] (and is thus smaller) while both are able togenerate the same set of motifs with mechanical rules. Ourbasis is moreover symmetric: Given a string s, the motifs inthe basis for its reverse ess are the reversals of the motifs inthe basis for s. Moreover, the number of motifs in our basiscan provably be upper bounded in the worst case by n� 1

for q ¼ 2 and occur in s a total of 2n times at most. However,we reveal an exponential dependency on q for the number ofmotifs in all bases defined so far (i.e., including our basis,Parida’s and Pelfrene et al.’s [19]), something unnoticed inprevious work. Consequently, no polynomial-time algo-rithm can exist for finding one of these bases with arbitrary

values of q � 2.

2 NOTATION AND TERMINOLOGY

We consider strings that are finite sequences of lettersdrawn from an alphabet �, whose elements are also calledsolid characters. We introduce an additional symbol (de-noted by � and called wild card) that does not belong to �

and matches any letter; a wild card clearly matches itself.The length of a string t, denoted by jtj, is the number ofletters and wild cards in t, and t½i� indicates the letter orwild card at position i in t for 0 � i � jtj � 1 (hence, t ¼t½0�t½1� � � � t½jtj � 1� also noted t½0::jtj � 1�).Definition 1 (pattern). Given the alphabet �, a pattern is a

string in � [ �ð� [ f�gÞ�� (that is, it starts and ends with a

solid character).

The patterns are related by the following specificity

relation � .

PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 41

1. In the literature on sequence analysis and pattern matching, the wildcard is often referred to as do not care (as it is in the literature on bases ofmotifs). Therefore, we will use this latter term when referring to thesequence analysis and string matching literature.

Definition 2 (� ). For individual characters �1; �2 2 � [ f�g,we have �1 � �2 if �1 ¼ � or �1 ¼ �2. Relation � extends tostrings in ð� [ f�gÞ� under the convention that each string tis implicitly surrounded by wild cards, namely, letter t½j� is �when j � jtj. Hence, v is more specific than u (writtenu � v) if u½j� � v½j� for any integer j.

We can now formally define the occurrences of patternsx in s and their lists.

Definition 3 (occurrence, L). We say that u occurs atposition ‘ in v if u½j� � v½jþ ‘�, for 0 � j � juj � 1(equivalently, we say that u matches v½‘::‘þ juj � 1�). Forthe input string s 2 �� with n ¼ jsj, we consider the locationlist Lx f0::n� 1g as the set of all the positions on s atwhich x occurs.

When a pattern u occurs in another pattern (or into astring) v, we also say that v contains u. For example, thelocation list of x ¼ T � G in s ¼ TTGG is Lx ¼ f0; 1g, hence scontains x.

Definition 4 (motif). Given a parameter q � 2, called quorum,we say that pattern x is a motif in s when jLxj � q.

Given any location list Lx and any integer d, we adoptthe notation Lx þ d ¼ f‘þ d j ‘ 2 Lxg for indicating theoccurrences in Lx “displaced” by the offset d.

Definition 5 (maximality). A motif x is maximal if for anyother motif y that contains x, we have no integer d such thatLy ¼ Lx þ d.

In other words, making a maximal motif x more specific(thus obtaining y) reduces the number of its occurrences ins. Definition 5 is equivalent to that meant in [17] stating thatx is maximal if there exist no other motif y and no integerd � 0 verifying Lx ¼ Ly þ d, such that x½j� � y½jþ d� for 0 �j � jxj � 1 (that is, x occurs in y at position d in ourterminology).2

Definition 6 (irredundant motif). A maximal motif x isirredundant if, for any maximal motifs y1, y2; . . . ; yk suchthat Lx ¼ [k

i¼1Lyi , motif x must be one of the yis. Conversely,if all the yis are different from x, pattern x is said to becovered by motifs y1, y2; . . . ; yk.

The basis of irredundant motifs for string s is the set of allirredundant motifs in s. The definition is given with respectto the set of maximal motifs of the input string which isunique; indeed, such basis is unique and it can be used as agenerator for all maximal motifs in s as proved in [17]. Thesize of the basis is the number of irredundant motifscontained in it. We illustrate the notions given so far by

employing the example string s ¼ FABCXFADCYZEADCEADC.

For this string and q ¼ 2 the location list of motif x1 ¼ A � Cis Lx1 ¼ f1; 6; 12; 16g, and that of motif x2 ¼ FA � C is

Lx2 ¼ f0; 5g. They are both maximal because they lose at

least one of their occurrences when extended with solid

characters at one side (possibly with wild cards in between),

or when their wild cards are replaced by solid characters.

However, motif x3 ¼ DC having list Lx3 ¼ f7; 13; 17g is not

maximal. It occurs in x4 ¼ ADC, where Lx4 ¼ f6; 12; 16g, and

its occurrences can be obtained from those of x4 by a

displacement of d ¼ 1 positions. The basis of the irredun-

dant motifs for s is made up of x1 ¼ A � C, x2 ¼ FA � C,

x4 ¼ ADC, and x5 ¼ EADC. The location list of each of them

cannot be obtained from the union of any of the other

location lists.

3 IRREDUNDANT MOTIFS: THE BASIS AND ITS SIZE

FOR QUORUM q ¼ 2

In this section, we show the existence of an infinite family ofstrings sk (k � 5) for which there are�ðn2Þ irredundant motifsin the basis for quorum q ¼ 2, where n ¼ jskj. In this way, wedisprove the claimed upper bound of 3n [17] mentioned inSection 1. Each string sk will be constructed from a shorterstring tk, which we now define. For each k, tk ¼ AkTAk, whereAk denotes the letter A repeated k times (our argument works,in general, for zkwzk, where z andw are strings of equal lengthnot sharing any common character). String tk contains anexponential number of maximal motifs, including thosehaving the form AfA; �gk�2

A with exactly two wild cards. Tosee why, each such motifxoccurs four times in tk: Specifically,two occurrences of x match the first and the last k letters in tkwhile each distinct wild card in x matching the letter T in tkcontributes to one of the two remaining occurrences.Extending x or replacing a wild card with a solid characterreduces the number of these occurrences, sox is maximal. Theidea of our proof is to obtain strings sk by prefixing tk withOðjtkjÞ symbols so that these motifs x become irredundant insk. Since there are �ðk2Þ of them, and n ¼ jskj ¼ �ðjtkjÞ ¼�ðkÞ, this leads to the claimed result.

In order to define the strings sk on the alphabet

� ¼ fA; T; u; v; w; x; y; z; a1; a2; . . . ; ak�2g, we introduce some

notation. Let euu denote the reversal of u, and let

evk; odk; uk; vk be the strings thus defined

if k is even : evk ¼ a2a4 � � � ak�2;

odk ¼ a1a3 � � � ak�3;

uk ¼ evk u fevkevk vw evk;

vk ¼ odk xy fodkodk z odk;

if k is odd : evk ¼ a2a4 � � � ak�3;

odk ¼ a1a3 � � � ak�2;

uk ¼ evk uv fevkevk wx evk;

vk ¼ odk y fodkodk z odk:

The strings sk are then defined by sk ¼ ukvktk for k � 5.

Fig. 1 shows them for k ¼ 7.

Fact 1. The length of ukvk is 3k, and that of sk is n ¼ 5kþ 1.

42 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

2. Actually, the definition literally reported in [17] is “Definition 4(Maximal Motif). Let p1; p2; . . . ; pk be the motifs in a sequence s. Let pi½j� be“.” if j > jpij. A motif pi is maximal if and only if there exists no pl, l 6¼ i andno integer 0 � � such that Lpi þ � ¼ Lpl and pl½� þ j� � pi½j� hold for1 � j � jpij.” (The symbols in pi and pl are indexed starting from 1onward.) The corresponding example in the paper illustrates the definitionfor s ¼ ABCDABCD, stating that pi ¼ ABCD is maximal while pl ¼ ABC is not.However, pi does not match the definition because of the existence of itsprefix pl (setting � ¼ 0); hence, we suspect a minor typo in the definition, forwhich the definition should read as “... such that Lpi ¼ Lpl þ � andpi½j� � pl½� þ j�.”

Proof. Whatever the parity ofk, the stringukvk contains the sixletters u, v, w, x, y, z, two occurrences each of evk and odk,and one occurrence each of fevkevk and fodkodk. Since odk and evktogether contain one occurrence of each letter a1,a2; . . . ; ak�2, we have jodkj þ jevkj ¼ k� 2. Moreover,jfevkevkj ¼ jevkj and j fodkodkj ¼ jodkj, so that jukvkj ¼ 6þ 3ðk� 2Þ¼ 3k. This proves the first statement. For the secondstatement, the total length of sk follows by observing thatjtkj ¼ 2kþ 1, and so n ¼ jskj ¼ 3kþ 2kþ 1 ¼ 5kþ 1. tu

Proposition 1. For 1 � p � k� 2, no motif of the form Ap �Ak�p�1 can be maximal in sk. Also, motif Ak cannot be maximalin sk.

Proof. Letwbe an arbitrary motif of the formAp � Ak�p�1, with1 � p � k� 2. Its location list is Lw ¼ f0; k� p; kþ 1g þjukvkj ¼ f3k; 4k� p; 4kþ 1g since jukvkj ¼ 3k by Fact 1 andw matches the two substrings Ak of sk as well as Ap TAk�p�1.The occurrences are shown in Fig. 1 for k ¼ 7 and p ¼ 2. Noother occurrences are possible. Let us consider theposition, say i, of the leftmost appearance of letter ap insk (recall that there are three positions on sk at which letterap occurs; we have i ¼ 0 in our example of Fig. 1 withp ¼ 2). We claim that motif y ¼ ap �3k�i�1 w satisfiesLy ¼ Lw � ð3k� iÞ. Since w appears in y, it follows that wcannot be maximal in sk by Definition 5 (settingd ¼ �3kþ i). To see why Lw ¼ Ly þ ð3k� iÞ, it suffices toprove that the distance in sk between the positions of thetwo leftmost lettersap isk� pwhile that of the leftmost andthe rightmost ap is kþ 1. The verification is a bit tediousbecause four cases arise according to the fact that each of kand p can be even or odd. Since the cases are analogous, wedetail only two of them, namely, when both k and p areeven, and when k is even and p is odd. In the first case, thethree occurrences of ap are all in uk. Moreover, the distancebetween the two leftmost letters ap is the length of thesubstring apapþ2 � � � ak�2uak�2ak�4 � � � apþ2, that is, 2japþ2 � � �ak�2j þ 2 ¼ 2ðk� 2� pÞ=2þ 2 ¼ k� p. The distance be-tween the leftmost and rightmost ap is the length ofapapþ2 � � � ak�2u fevkevk vwa2a4 � � � ap�2. This is also the length ofu fevkevk vwa2a4 � � � ap�2apapþ2 � � � ak�2 ¼ u fevkevk vwevk, that is,2ðk� 2Þ=2þ 3 ¼ kþ 1 as expected. In the second casewhere k is even and p is odd, the occurrences of ap are all invk. Analogously to the first case, the distance between the

two leftmost letters ap is the length of apapþ2 � � � ak�3xyak�3

� � � apþ2, that is, 2japþ2 � � � ak�3j þ 3 ¼ 2ðk� 3� pÞ=2þ 3

¼ k� p. The distance between the leftmost and the

rightmost ap is the length of the string apapþ2 � � � ak�3

xy fodkodkza1a3 � � � ap�2, which equals kþ 1, the length of

xy fodkodkzodk. The analogous verification of the other two

cases yields the fact that w cannot be maximal.

The second part of the lemma for motif Ak proceeds

along the same lines, except that we choose y ¼ap �3k�i�1 Ak with i as before (note that y is not required

to be maximal and that the motifs in the statement are

maximal in tk). tuProposition 2. Each motif of the form AfA; �gk�2

A with exactly

two �s is irredundant in sk.

Proof. Let x be an arbitrary motif of the form AfA; �gk�2Awith

two �s, namely, x ¼ Ap1 � Ap2�p1�1 � Ak�p2�1 for 1 � p1 <

p2 � k� 2. To prove thatx is an irredundant motif, we first

show that x is maximal. Its location list is Lx ¼ f0; k� p2;

k� p1; kþ 1g þ 3k since jukvkj ¼ 3k by Fact 1 and x

matches the two substrings Ak of sk as well as Ap1 TAk�p1�1

and Ap2 TAk�p2�1. Any other motif y such that x occurs in y

can be obtained by replacing at least one wild card (at

position p1 or p2) in xwith a solid character, but this would

cause the removal of position 4k� p1 or 4k� p2 from Lx.

Analogously, extending x to the right by putting a solid

character at position jxj or larger would eliminate position

4kþ 1 from Lx. Finally, extending x to the left by a solid

character would eliminate at least one position from Lx

because no symbol occurs four times inukvk. In conclusion,

for any motif y such thatx occurs in y, we haveLy 6¼ Lx þ d

for any integer d and, thus, x is a maximal motif by

Definition 5. We now prove that x is irredundant

according to Definition 6. Let us consider an arbitrary set

of maximal motifs y1, y2; . . . ; yh such thatLx ¼ [hi¼1Lyi . We

claim that at least one yi is of the form AfA; �gk�2A. Indeed,

there must exist a location list Lyi containing position 4kþ1 since that position belongs to Lx. This implies that yioccurs in the suffix Ak of sk. It cannot be that jyij < k since yiwould occur also in some position j > 4kþ 1 whereas

j 62 Lx, so it is impossible. Consequently, yi is of length k

and matches Ak, thus being of the form AfA; �gk�2A. We

observe that yi cannot contain zero or one �s, as it would

not be maximal by Proposition 1. Also, yi cannot contain

three or more�s, as each distinct � symbol would match the

letter T in sk giving jLyi j > jLxj, which is impossible. The

only possibility is that yi contains exactly two �s as x does

at the same positions because Ly Lx and they are

maximal. It follows that yi ¼ x proving the proposition. tuTheorem 2. The basis for string sk contains �ðn2Þ irredundant

motifs, where n ¼ jskj and k � 5.

Proof. By Proposition 2, the number of irredundant motifs

in sk is at least k�22

� �¼ �ðk2Þ, the number of choices of

two positions in fA; �gk�2. Since jskj ¼ 5kþ 1 by Fact 1,

we get the conclusion. tu

PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 43

Fig. 1. Example string s7, (ai of the definition is simply denoted by i).Above it, there are the occurrences of w of the Proof of Proposition 1,while the three lines below show the occurrences of motif x ¼4 �19 AAAA � AA in s7. The letter 4 corresponds to position 4 of the wildcard in AAAA � AA.

4 TILING MOTIFS: THE BASIS AND ITS PROPERTIES

4.1 Terminology and Properties

In this section, we introduce a natural notion of a basis for

generating all maximal motifs occurring in a string s of

length n.

Definition 7 (tiling motif). A maximal motif x is tiling if, for

any maximal motifs y1, y2; . . . ; yk and for any integers d1,

d2; . . . ; dk such that Lx ¼ [ki¼1ðLyi þ diÞ, motif x must be one

of the yis. Conversely, if all the yis are different from x, pattern

x is said to be tiled by motifs y1, y2; . . . ; yk.

The notion of tiling is in general more selective than that

of irredundancy. Continuing our example string

s ¼ FABCXFADCYZEADCEADC, we have seen in Section 2 that

motif x1 ¼ A � C is irredundant for s. Now, x1 is tiled by

x2 ¼ FA � C and x4 ¼ ADC according to Definition 7 since its

location list, Lx1 ¼ f1; 6; 12; 16g, can be obtained from the

union of Lx2 ¼ f0; 5g and Lx4 ¼ f6; 12; 16g with respective

displacements d2 ¼ 1 and d4 ¼ 0.

Remark 1. A fairly direct consequence of Definition 7 is that

if x is tiled by y1, y2, . . . , yk with associated displacements

d1, d2, . . . , dk, then x occurs at position di in yi for

1 � i � k. As a consequence, we have that di � 0 in

Definition 7. Note also that the yis in Definition 7 are not

necessarily distinct and that k > 1 for tiled motifs. (It

follows from the fact that Lx ¼ Ly1 þ d1 with x 6¼ y1would contradict the maximality of both x and y1.) As a

result, a maximal motif x occurring exactly q times in s is

tiling as it cannot be tiled by any other motifs because

such motifs would occur less than q times.

The basis of tiling motifs is the complete set of all tilingmotifs for s, and the size of the basis is the number of thesemotifs. For example, the basis, let us denote it by B, forFABCXFADCYZEADCEADC contains FA � C, EADC, and ADC astiling motifs. Although Definition 7 is derived from that ofirredundant motifs given in Definition 6, the difference ismuch more substantial than it may appear. The basis oftiling motifs relies on the fact that tiling motifs areconsidered as invariant by displacement as for maximality.Consequently, our definition of basis is symmetric, that is,each tiling motif in the basis for the reverse string ess is thereverse of a tiling motif in the basis of s. This follows fromthe symmetry in Definition 7 and from the fact thatmaximality is also symmetric in Definition 5. It is a sinequa non condition for having a notion of basis invariant bythe left-to-right or right-to-left order of the symbols in s (likethe entropy of s), while this property does not hold for theirredundant motifs.

The basis of tiling motifs has further interesting proper-

ties for quorum q ¼ 2, illustrated in Sections 4.2, 4.3, and 4.4.

In Section 4.2, we show that our basis is linear (that is, its

size is at most n� 1). In Section 4.3, we show that the total

size of the location lists for the tiling motifs is less than 2n,

describing how to find them in Oðn2 logn log j�jÞ time. In

Section 4.4, we discuss some applications such as generat-

ing all maximal motifs with the basis and finding motifs

with a constraint on the number of undefined symbols.

4.2 A Linear Upper Bound for the Tiling Motifs withQuorum q ¼ 2

Given a string s of length n, let B denote its basis of tilingmotifs for quorum q ¼ 2. Although the number of maximalmotifs may be exponential and the basis of irredundantmotifs may be at least quadratic (see Section 3), we showthat the size of B is always less than n. For this, weintroduce an operator between the symbols of � to definethe merges, which are at the heart of the properties of B.Given two letters �1; �2 2 � with �1 6¼ �2, the operatorsatisfies �1 �2 ¼ � and �1 �1 ¼ �1. The operator appliesto any pair of strings x; y 2 ��, so that u ¼ x y satisfiesu½j� ¼ x½j� y½j� for all integers j.

Definition 8 (Merge). For 1 � k � n� 1, let sk be the (infinite)string whose character at position i is sk½i� ¼ s½i� s½iþ k�. Ifsk contains at least one solid character, Mergek denotes themotif obtained by removing all the leading and trailing �s in sk(that is, those appearing before the leftmost solid character andafter the rightmost solid character).

For example, FABCXFADCYZEADCEADC has Merge4 ¼ EADC,Merge5 ¼ FA � C, Merge6 ¼ Merge10 ¼ ADC, and Merge11 ¼Merge15 ¼ A � C. The latter is the only merge that is not a tilingmotif.

Lemma 1. If Mergek exists, it must be a maximal motif.

Proof. Motifx ¼ Mergek occurs at positions, say, iand iþ k ins. Character sk½i� is solid by Definitions 4 and 8. We use thefact that x at occurs at least twice in s for showing that it ismaximal. Suppose it is not maximal. By Definition 5, thereexists y 6¼ x such that x occurs in y and Ly ¼ Lx þ d forsome integer d (in this case d � 0). Since y is more specificthan xdisplaced by d, there must exist at least one positionj with 0 � j < jyj such that x½jþ d� ¼ � and y½j� ¼ � 2 �.Hence, x½jþ d� ¼ s

�iþ ðjþ dÞ

� s�iþ kþ ðjþ dÞ

�¼ �,

and so s�ðiþ dÞ þ j

�6¼ s�ðiþ kþ dÞ þ j

�. Since y½j� cannot

match both of the latter symbols in s, at least one of iþ d oriþ kþ d is not a position of y in s. This contradicts thehypothesis that Ly ¼ Lx þ d, whereas both i; iþ k 2 Lx. tu

Lemma 2. For each tiling motif x in the basis B, there is at leastone k for which Mergek ¼ x.

Proof. As mentioned in Remark 1, a maximal motifoccurring exactly twice in s is tiling. Hence, if jLxj ¼ 2,say Lx ¼ fi; jg with j > i, then x ¼ Mergek with k ¼ j� iby the maximality of x and that of the merges byLemma 1. Let us now consider the case where jLxj > 2.For any pair i; j 2 Lx, we denote by uij the string s½i::iþjxj � 1� s½j::jþ jxj � 1� obtained by applying the op-erator to the two substrings of s matching x atpositions i and j, respectively. We have x � uij since xoccurs at positions i and j, and Lx ¼

Si;j2Lx

Luij since weare taking all pairs of occurrences of x. Letting k ¼ jj� ijfor i; j 2 Lx, we observe that uij is a substring of Mergekoccurring at position, say, �k in it. Thus,

[i;j2Lx

Luij ¼[

k¼jj�ij : i;j2Lx

LMergek þ �k� �

¼ Lx:

By Definition 7, the fact that x is tiling implies that xmust be one Mergek, proving the lemma. tu

44 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

We now state the main property of tiling bases thatfollows directly from Lemma 2.

Theorem 3 (linearity of the basis). Given a string s of length nand the quorum q ¼ 2, letM be the set ofMergek, for 1 � k �n� 1 such thatMergek exists. The basis B of tiling motifs for ssatisfies B M and, therefore, the size of B is at most n� 1.

A simple consequence of Theorem 3 implies a tightbound on the number of tiling motifs for periodic strings. Ifs ¼ we for a string w repeated e > 1 times, then s has at mostjwj tiling motifs.

Corollary 1. The number of tiling motifs for s is at most p, thesmallest period of s.

The bound in Corollary 1 is not valid for irredundantmotifs. String s ¼ ATATATATA has period p ¼ 2 and only onetiling motif ATATATA, while its irredundant motifs are A, ATA,ATATA, and ATATATA.

4.3 A Simple Algorithm for Computing Tiling Motifswith Quorum q ¼ 2

We describe how to compute the basis B for string s whenq ¼ 2. A brute-force algorithm generating first all maximalmotifs of s takes exponential time in the worst case.Theorem 3 plays a crucial role in that we first computethe motifs in M and then discard those being tiled. SinceB M, what remains is exactly B. To appreciate thisapproach, it is worth noting that we are left with theproblem of selecting B from n� 1 maximal motifs in M atmost, rather than selecting B among all the maximal motifsin s, which may be exponential in number. Our simplealgorithm takes Oðn2 logn log j�jÞ time and is faster thanprevious (and more complicated) methods discussed inSection 1.

Step 1. Compute the multiset M0 of merges. Lettingsk½i� be the leftmost solid character of string sk inDefinition 8, we define occx ¼ fi; iþ kg to be the positionsof the two occurrences of x whose superposition generatesx ¼ Mergek. For k ¼ 1; 2; . . . ; n� 1, we compute string skin Oðn� kÞ time. If sk contains some solid characters, wecompute x ¼ Mergek and occx in the same time complex-ity. As a result, we compute the multiset M0 of merges inOðn2Þ time. Each merge x in M0 is identified by a triplethi; iþ k; jxji, from which we can recover the jth symbol ofx in constant time by simple arithmetic operations andcomparisons.

Step 2. Transform the multiset M0 into the set M of

merges. Since there can be two or more merges in M0 thatare identical and correspond to the same merge in M, weput together all identical merges in M0 by radix sortingthem. The total cost of this step is dominated by radixsorting, giving Oðn2Þ time. As a byproduct, we produce thetemporary location list Tx ¼

Sx0¼x :x02M0 occx0 for each dis-

tinct x 2 M thus obtained.

Lemma 3. Each motif x 2 B satisfies Tx ¼ Lx.

Proof. For a fixed x 2 B, the fact that x is equal to at leastone merge by Lemma 2 implies that Tx is well defined,with jTxj � 2. Since Tx Lx, let us assume by contra-diction that Lx � Tx 6¼ ;. For each pair i 2 Lx � Tx and

j 2 Tx, let mij ¼ Mergejj�ij, which is maximal byLemma 1. Note that each mij 6¼ x by our assumption asotherwise i would belong to Tx; however, x must occurin mij, say, at position �ij in mij. Consequently,S

i2Lx�Tx;j2Tx

�Lmij

þ �ij�¼ Lx since any occurrence of x

is either i 2 Lx � Tx or j 2 Tx. At this point, we applyDefinition 7 to the tiling motif x, obtaining the contra-diction that x must be equal to one mij. tu

Notice that the conclusion of Lemma 3 does notnecessarily hold for the motifs in M�B. For the previousexample string FADABCXFADCYZEADCEADCFADC, one suchmotif is x ¼ ADC with Lx ¼ f8; 14; 18; 22g while Tx ¼ f8; 18g.

Step 3. SelectM� M, whereM� ¼ fx 2 M : Tx ¼ Lxg.In order to build M�, we employ the Fischer-Patersonalgorithm based on convolution [8] for string matching withdon’t cares to compute the whole list of occurrences Lx foreach merge x 2 M. Its cost isOððjxj þ nÞ logn log j�jÞ time foreach merge x. Since jxj < n and there are at most n� 1 motifsx 2 M, we obtain Oðn2 logn log j�jÞ time to construct all listsLx. We can compute M� by discarding the merges x 2 Msuch that Tx 6¼ Lx in additional Oðn2Þ time.

Lemma 4. The set M� satisfies the conditions B M� andPx2M� jLxj < 2n.

Proof. The first condition follows from the fact that themotifs in M�M� are surely tiled by Lemma 3. Thesecond condition follows from the definition of M� andfrom the observation that

Xx2M�

jLxj ¼Xx2M�

jTxj �Xx2M

joccxj < 2n;

since joccxj ¼ 2 (see Step 1) and there are less than n ofthem. tuThe property of M� in Lemma 4 is crucial in thatPx2M jLxj ¼ �ðn2Þ when many lists contain �ðnÞ entries.

For example, s ¼ An has n� 1 distinct merges, each of theform x ¼ Ai for 1 � i � n� 1, and so jLxj ¼ n� iþ 1. Thiswould be a sharp drawback in Step 4 when removing tiledmotifs as it may turn into a �ðn3Þ algorithm. Using M�

instead, we are guaranteed thatP

x2M� jLxj ¼ OðnÞ; hence,we may still have some tiled motifs in M�, but their totalnumber of occurrences is OðnÞ.

Step 4. Discard the tiled motifs in M�. We can nowcheck for tiling motifs in Oðn2Þ time. Given two distinctmotifs x; y 2 M�, we want to test whether Lx þ d Ly forsome integer d and, in that case, we want to mark the entriesin Ly that are also in Lx þ d. At the end of this task, the listshaving all entries marked are tiled (see Definition 7). Byremoving their corresponding motifs from M�, we even-tually obtain the basis B by Lemma 4. Since the meaningfulvalues of d are as many as the entries of Ly, we have onlyjLyj possible values to check. For a given value of d, weavoid to merge Lx and Ly in OðjLxj þ jLyjÞ time to performthe test, as it would contribute to a total of �ðn3Þ time.Instead, we exploit the fact that each list has values rangingfrom 1 to n, and use two bit-vectors of size n to perform theabove check in OðjLxj � jLyjÞ time for all values of d. Thisgives Oð

Py

Px jLxj � jLyjÞ ¼ Oð

Py jLyj �

Px jLxjÞ ¼ Oðn2Þ

by Lemma 4.

PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 45

We therefore detail how to perform the above check withLx and Ly in OðjLxj � jLyjÞ time. We use two bit-vectors V1

and V2 of length n initially set to all zeros. Given y 2 M�, weset V1½i� ¼ 1 if i 2 Ly. For each x 2 M� � fyg and for eachd 2 ðLy �mÞ (where m is the smallest entry of Lx), we thenperform the following test. If all j 2 Lx þ d satisfy V1½j� ¼ 1,we set V2½j� ¼ 1 for all such j. Otherwise, we take the nextvalue of d, or the next motif if there are no more values of d,and we repeat the test. After examining all x 2 M� � fyg,we check whether V1½i� ¼ V2½i� for all i 2 Ly. If so, y is tiledas its list is covered by possibly shifted location lists of othermotifs. We then reset the ones in both vectors in OðjLyjÞtime.

Summing up Steps 1-4, we have that the dominant cost isthat of Step 3 and that we have proved the following result.

Theorem 4. Given an input string s of length n over the alphabet�, the basis of tiling motifs with quorum q ¼ 2 can becomputed in Oðn2 logn log j�jÞ time. The total number ofmotifs in the basis is less than n, and the total number of theiroccurrences in s is less than 2n.

We have implemented the algorithm underlying Theo-rem 4, and we report here the lessons learned from ourexperiments. Step 1 requires, in practice, less than thepredicted Oðn2Þ running time. If p ¼ 1=j�j denotes theprobability that two randomly chosen symbols of � matchin the uniform distribution, the probability of finding thefirst solid character in a merge follows the binomialdistribution, and so the expected number of examinedcharacters in s is Oð1=pÞ ¼ Oðj�jÞ, yielding Oðnj�jÞ time onthe average to locate the first (scanning s from thebeginning) and the last (scanning s from the end backward)solid character in each merge. A similar approach can befollowed in Step 2 for finding the distinct merges. In thiscase, the merges are first partially sorted using hashing andexploiting the fact that the input is almost sorted. Insertionsort is then the best choice and works very efficiently in ourexperiments (at least 50 percent faster than Quicksort). Wedo not compute yet the full merges at this stage, but wedelay this expensive part to a later stage on a small set ofbuckets that require explicit representation of the merges.As a result, the average case is almost linear. For example,executing Steps 1 and 2 on chromosome V of C.eleganscontaining more than 21 million bases took around15 minutes on a machine with 512Mb of RAM runningLinux on a 1Ghz AMD Athlon processor. Step 3 isexpensive also in practice and the worst case predicted bytheory shows up in the experiments. Running this step onsequences much shorter than chromosome V of C.eleganstook many hours. Step 4 is not much of a problem. As aresult, an alternative way of selecting M� from M in Step 3working fast in practice, would improve considerably theoverall performance.

4.4 Some Applications

Checking whether a pattern is a motif. The main propertyunderlying the notion of basis is that it is a generator of allmotifs. The generation can be done as follows: First selectsegments of motifs in the basis that start and end with solidcharacters, then replace any number of internal solid

characters by wild cards. However, since the number ofmotifs, and even maximal motifs, can be exponential, this isnot really meaningful unless this number is small and thetime complexity of the algorithm is proportional to the totalsize of the output. An attempt in this direction is done in[18]. The dual problem concerns testing only one pattern.We show how, given a pattern x, it can be tested whether xis a motif for string s, that is, if pattern x occurs at least qtimes in s. There are two possible ways of performing sucha test, depending on whether we test directly on the stringor on the basis. The answer relies on iterative applicationsof the observation made in Remark 1, according to whichany tiled motif must occur in at least one tiling motif. Thenext two statements deal with the alternative. In both cases,we assume that integer k comes from the decomposition ofpattern x in the form u0 �‘0 u1 �‘1 � � �uk�1 �‘k�1 uk, where thesubwords ui contain no wild cards (ui 2 ��, 0 � i � k) and‘j are positive integers, 0 � j � k� 1. The next propositionstates a well-known fact on matching such a pattern in atext without any wild card that we report here because it isused in the sequel.

Proposition 3. The positions of the occurrences of a pattern x ina string of length n can be computed in time OðknÞ.

Proof. This is a mere application of matching a pattern withdo not cares inside a text without do not cares. Using, forinstance, the Fischer and Paterson’s algorithm [8] is notnecessary. Instead, the positions of the subwords ui arecomputed by a multiple string-matching algorithm, suchas the Aho-Corasick algorithm [1]. For each position p, acounter associated with position p� ‘ on s is incremented,where ‘ is the position of ui in x (‘ is the offset of ui in x).Counters whose value is kþ 1 correspond then tooccurrences of x in s. It remains to check if x occurs atleast q times in s. The running time is governed by thestring-matching algorithm, which is OðknÞ (equivalent torunning k times a linear-time string matching algorithm).tu

Proposition 4. Given the basis B of string s, testing if pattern xis a motif or a maximal motif can be done in OðkbÞ time, whereb ¼

Py2B jyj.

Proof. From Remark 1, testing if x is a maximal motifrequires only finding if x occurs in an element y of thebasis. To do this, we can apply the procedure of theprevious proof because wild cards in y should be viewedas extra characters that do not match any letter of �. Thetime complexity of the procedure is thus OðkbÞ. Since anonmaximal motif occurs in a maximal motif, the sameprocedure applies to test if x is a general motif. tu

As a consequence of Propositions 3 and 4, we get anupper bound on the time complexity for testing motifs.

Corollary 2. Testing whether or not pattern u0 �‘0 u1 �‘1� � �uk�1 �‘k�1 uk is a motif in a string of length n having abasis of total size b can be done in time Oðk �minfb; ngÞ.

Remark 2. Inside the procedure described in the proofs ofPropositions 3 and 4, it is also possible to use bit-vectorpattern matching methods [3], [16], [25] to compute theoccurrences of x. This leads to practically efficientsolutions running in time proportional to the length of

46 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

the string n or the total size of the basis b, in the bit-vectormodel of machine. This is certainly a method of choicefor short patterns.

Finding the longest motif with bounded number ofwild cards. We address an interesting question concerningthe computation of a longest motif occurring repeated in astring. Given an integer g � 0, let LMgðsÞ be the maximallength of motifs occurring in a string s of length n withquorum q ¼ 2, and containing no more than g wild cards. Ifg ¼ 0, the value can be computed in Oðn log j�jÞ time withthe help of the suffix tree of s (see [5] or [10]). For g > 0, wecan show that LMgðsÞ can be computed in Oðgn2Þ timeusing the suffix tree augmented (in linear time) to acceptlongest common ancestor (LCA) queries as follows: Foreach possible pair ði; jÞ of positions on s for which s½i� ¼ s½j�,we compute the longest common prefix of s½i::n� 1� ands½j::n� 1� in constant time through an LCA query on thesuffix tree. If ‘ is the length of the prefix, we get the first parts½i::iþ ‘� 1� � of a possible longest motif. The second partis found similarly by considering the pair of positionsðiþ ‘þ 1; jþ ‘þ 1Þ. The process is iterated g times (or less)and provides a longest motif containing at most g wildcards and occurring at positions i and j. Length LMgðsÞ isobtained by taking the maximum length of motifs for allpairs of positions ði; jÞ. This yields the next result.

Proposition 5. Using the suffix tree, LMgðsÞ can be computed inOðgn2Þ time.

What makes the use of the basis of tiling motifs interestingis that computing LMgðsÞ becomes a mere pattern matchingexercise because of the strong properties of the basis. Thiscontrasts with the previous result grounded on the deepalgorithmic technique for LCA queries.

Proposition 6. Using the basis B of tiling motifs, LMgðsÞ can becomputed in time OðbÞ, where b ¼

Py2B jyj.

Proof. Let x be a motif yielding LMgðsÞ (i.e., x is of lengthLMgðsÞ); hence, x occurs at least twice in s. Let y be amaximal motif in which x occurs (we have y ¼ x if x isitself maximal). Let z be a tiling motif in which y occurs(again we may have z ¼ y if y is a tiling motif). The wordx then occurs in z that belongs to the basis. Let us say thatit matches z½i::j�. Assume that x is not a tiling motif, thatis x 6¼ z. Certainly, i ¼ 0 or z½i� 1� ¼ �, otherwise, xwould not be the longest with its property. For the samereason, j ¼ jzj � 1 or z½jþ 1� ¼ �. But, indeed, x occursexactly in z, which means that the wild card symbols donot match any solid symbol. Because, otherwise, z½i::j�would contain less than g do not cares and could beextended by at least one symbol to the left or to the rightbecause x 6¼ z, yielding a contradiction with the defini-tion of x. Therefore, either x is a tiling motif or it matchesexactly a segment of one of the tiling motifs. Searchingfor x thus reduces to finding a longest segment of a tilingmotif in B that contains no more than g wild cards. Thecomputation can be done in linear time with only twopointers on s, which proves the result. tuBy Proposition 6, it is clear that a small basis B leads to

an efficient computation once B is given. If we have to buildB from scratch, we can observe that no (maximal) motif cangive a larger value of LMgðsÞ if it does not belong to B. Withthis observation, we have Oðn2Þ running time, which

always beats the Oðg� n2Þ cost of using the suffix tree. Inparticular, it is interesting to notice that the running time ofthe algorithm using the basis is independent of theparameter g.

5 PSEUDOPOLYNOMIAL BASES FOR HIGHER

QUORUM

We now discuss the general case of quorum q � 2 for

finding the basis of a string of length n. Differently from

previous work, we show in Section 5.1 that no polynomial-

time algorithm can exist for any arbitrary value of q in the

worst case, both for the basis of irredundant motifs and for

the basis of tiling motifs. The size of these bases provably

depends exponentially on suitable values of q � 2, that is, we

give a lower bound ofn�12 �1q�1

� �¼ �

�12q

n�1q�1

� ��. In practice, this

size has an exponential growth for increasing values of q up

to OðlognÞ, but larger values of q are theoretically possible

in the worst case. Fixing q ¼ ðn� 1Þ=4þ 1 in our lower

bound, we get a size of �ð2ðn�1Þ=4Þ motifs in the bases. On

the average, q ¼ Oðlogj�j nÞ by extending the argument after

Theorem 4, namely, using the fact that on the average the

number of simultaneous comparisons to find the first solid

character of a merge is Oðj�jq�1Þ, which must be less than n.

We show a further property for the basis of tiling motifs

in Section 5.2, giving an upper bound of n�1q�1

� �on its size

with a simple proof. Since we can find an algorithm taking

time proportional to the square of that size, we can

conclude that a worst-case polynomial-time algorithm for

finding the basis of tiling motifs exists if and only if the

quorum q satisfies either q ¼ Oð1Þ or q ¼ n�Oð1Þ (the latter

condition is hardly meaningful in practice).

5.1 A Lower Bound ofn�12 �1q�1

� �on the Bases

We show the existence of a family of strings for which there

are at leastn�12 �1q�1

� �tiling motifs for a quorum q. Since a tiling

motif is also irredundant, this gives a lower bound for the

irredundant motifs to be combined with that in Section 3

(note that the lower bound in Section 3 still gives �ðn2Þ for

q � 2). For q > 2, this gives a lower bound of �n�12 �1q�1

� �¼

��

12q

n�1q�1

� ��for the number of both tiling and irredundant

motifs.

The strings are this time of the form tk ¼ AkTAk (k � 5),

without the left extension used in the bound of Section 3.

The proof proceeds by exhibiting k�1q�1

� �motifs that are

maximal and have each exactly q occurrences, from when it

follows immediately that they are tiling. Indeed, Remark 1

for tiling motifs holds for any q � 2. Namely, all maximal

motifs that occur exactly q times in a string are tiling.

Proposition 7. For 2 � q � k and 1 � p � k� q þ 1, any motif

Ap � fA; �gk�p�1 � Ap with exactly q wild cards is tiling (and

so irredundant) in tk.

PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 47

Proof. Let x be an arbitrary motif Ap � fA; �gk�p�1 � Ap with1 � p � k� q þ 1 and q wild cards; namely, x ¼ Ap1 �Ap2�p1�1 � � � � � Apq�1�pq�2�1 � Ak�pq�1�1 � Ap1 for 1 � p1 < p2 <

� � � < pq�1 � k� 1 and p ¼ p1. We first have to prove that xis a maximal motif according to Definition 5. Its length iskþ 1þ p1 and its location list is Lx ¼ f0; k� pq�1; . . . ;

k� p2; k� p1g. Observe that the number of its occurrencesis exactly the number of times the wild card appears in x,which is equal to q. A motif y different from x such that xoccurs in y can be obtained by replacing the wild card atposition pi with a solid symbol, for 1 � i � q � 1, but thiseliminates k� pi from the location list of y. Also, y can beobtained by extending x to the right by a solid symbol (atany position � jxj), but then position k� p1 is not in Ly

because the last symbol in that occurrence of y occupiesposition ðk� p1Þþjyj�1� ðk� p1Þ þ jxj ¼ ðk� p1Þ þ ðkþ1þp1Þ > jtkj � 1 in tk, which is impossible. Analogously, ycan be obtained by extending x to the left by a solid symbol(at any position d < 0), but position 0 is no longer in Ly.Consequently, for any motif y more specific than x, wehave Ly 6¼ Lx þ d, implying that x is maximal. Aspreviously mentioned, x is tiling because it has exactly q

occurrences. tuTheorem 5. String tk has

n�12 �1q�1

� �¼ �

�12q

n�1q�1

� ��tiling (and

irredundant) motifs, where n ¼ jtkj and k � 2.

Proof. By Proposition 7, the tiling or irredundant motifs in tk

are at least k�1q�1

� �, the number of choices of q � 1 positions

on Ak�1. Since n ¼ 2kþ 1, we obtain the statement. tu

5.2 An Upper Bound of n�1q�1

� �Tiling Motifs

We now prove that n�1q�1

� �is an upper bound for the size of a

basis of tiling motifs for a string s and quorum q � 2. Let us

denote as before such a basis by B. To prove the upper

bound, we use again the notion of a merge, except that it

now involves q strings. The operator between the

elements of � extends to more than two arguments, so that

the result is a � if at least two arguments differ. Let k denote

now an array of q � 1 positive values k1; . . . ; kq�1 with 1 �ki < kj � n� 1 for all 1 � i < j � q � 1.

Definition 9. Let sk denote the string such that its jth character

is sk½j� ¼ s½j� s½jþ k1� � � � s½jþ kq�1� for all integers j.Mergek is the pattern obtained by removing all the leading

and trailing �s in sk (that is, appearing before the leftmost solid

character and after the rightmost solid character).

Lemmas 5 and 6 reported below extend Lemmas 1 and 2for q > 2.

Lemma 5. If Mergek exists for quorum q, then it must be a

maximal motif.

Proof. Let x ¼ Mergek denote the (nonempty) pattern, andlet sk½i� be its first character, which is solid byDefinition 9. Since x occurs at least q times in s, atpositions i; iþ k1; . . . ; iþ kq�1, then x is a motif forquorum q. We show that x is maximal. Suppose it isnot maximal. By Definition 5, there exists y 6¼ x s.t. x

occurs in y and Ly ¼ Lx þ d for some integer d. This

implies there exists at least one position j with 0 �j < jyj such that y½j� ¼ � 2 � and x½jþ d� ¼ �. Since

x½jþ d� ¼ s½iþ jþ d� s½iþ jþ k1 þ d� � � � s½iþ jþ kq�1 þ d�;

then at least one among iþ d; iþ k1 þ d; . . . ; iþ kq�1 þ d

is not an occurrence of y, contradicting the hypothesisthat Ly ¼ Lx þ d (since i; iþ k1; . . . ; iþ kq�1 2 Lx). tu

Lemma 6. For each tiling motif x in the basis B with quorum q,there is at least one k for which Mergek ¼ x.

Proof. If jLxj ¼ q andLx ¼ fi1; . . . ; iqgwith i1 < � � � < iq, thenx ¼ Mergek where k is the array of values i2 � i1; i3 � i1;

. . . ; iq � i1. Let us now consider the case where jLxj > q.Given any q-tuple i1; . . . ; iq 2 Lx, let uk denote s½i1::i1 þjxj � 1� � � � s½iq::iq þ jxj � 1�, which is a substring ofMergek introduced in Definition 9. We have that x � uk

and Lx ¼S

i1;i2;...;iq2LxLuk . Since each uk for i1; i2; . . . ; iq 2

Lx is a substring of Mergek, we infer that Lx ¼Si1;i2;...;iq2Lx

�LMergek þ �k

�where the �ks are non-negative

integers. By Definition 7, if Mergek were different from x,then x would not be tiling, which is a contradiction.Therefore, at least one Mergek is x. tu

The following property of tiling bases follows fromLemma 5 and 6.

Theorem 6. Given a string s of length n and a quorum q � 2, let

M be the set of Mergek, for any of the n�1q�1

� �possible choices

of k for which Mergek exists. The basis B of tiling motifs for s

satisfies B M and, therefore, the size of B is at most n�1q�1

� �.

The tiling motifs in our basis appear in s for a total of

q n�1q�1

� �times at most. A variation of the algorithm given in

Section 4.3 gives a pseudopolynomial-time complexity of

O q2n� 1

q � 1

� �2 !

:

When this upper bound is combined with the lower boundof Section 5.1, we obtain that there exists a polynomial-timealgorithm for finding the basis if and only if either q ¼ Oð1Þor q ¼ n�Oð1Þ.

6 CONCLUSIONS

The work presented in this paper is theoretical in nature, but itshould be clear by now that its practical consequences,particularly—but not exclusively—for computational biol-ogy, are relevant. Whether motifs as patterns are used forinferring binding sites or repeats of any length, for character-izing sequences or as a filtering step in a whole genomecomparison algorithm or before inferring PSSMs: We showthat wild cards alone are not enough for a biologicallysatisfying definition of the patterns of interest. Simplythrowing away the pattern-type of motif detection is not agood way to address the problem. This is confirmed byvariousbiologicalpublications [24], [7]aswellasbythenotyetpublished—but already publicly available—results of a first

48 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

motif detection competition http://bio.cs.washington.edu/assessment/. Evenifpatternsarenot the bestway ofmodelingbiological features, they deserve an important function in anyfuture improved algorithm for inferring motifs ab initio frombiological sequences. As such, the purpose of this paper is toshed some further light on the inner structure of oneimportant type of motif.

ACKNOWLEDGMENTS

Many suggestions from the anonymous referees greatlyimproved the original form of this paper. The authors arethankful to them for this and to M.H.ter Beek for improvingthe English. A preliminary version of the results in thispaper has been described in the technical report IGM-2002-10, July 2002 [20], and in [21]. Work was partially supportedby the French program bioinformatique EPST 2002 “Algo-rithms for Modelling and Inference Problems in MolecularBiology.” N. Pisanti and R. Grossi were partially supportedby the Italian PRIN project “ALINWEB: Algorithmics forInternet and the Web.” M.-F. Sagot was partially supportedby CNRS-INRIA-INRA-INSERM action BioInformatiqueand the Wellcome Trust Foundation. M. Crochemore waspartially supported by CNRS action AlBio, NATO ScienceProgramme grant PST.CLG.977017, and the Wellcome TrustFoundation.

REFERENCES

[1] A. Aho and M. Corasick, “Efficient String Matching: An Aid toBibliographic Search,”Comm. ACM, vol. 18, no. 6, pp. 333-340, 1975.

[2] A. Apostolico and L. Parida, “Incremental Paradigms of MotifDiscovery,” J. Computational Biology, vol. 11, no. 1, pp. 15-25, 2004.

[3] R. Baeza-Yates and G. Gonnet, “A New Approach to TextSearching,” Comm. ACM, vol. 35, pp. 74-82, 1992.

[4] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, “Ap-proaches to the Automatic Discovery of Patterns in Biose-quences,” J. Computational Biology, vol. 5, pp. 279-305, 1998.

[5] M. Crochemore and W. Rytter, Jewels of Stringology. WorldScientific Publishing, 2002.

[6] E. Eskin, “From Profiles to Patterns and Back Again: A Branch andBound Algorithm for Finding Near Optimal Motif Profiles,”RECOMB’04: Proc. Eighth Ann. Int’l Conf. Computational MolecularBiology, pp. 115-124, 2004.

[7] E. Eskin, U. Keich, M. Gelfand, and P. Pevzner, “Genome-WideAnalysis of Bacterial Promoter Regions,” Proc. Pacific Symp.Biocomputing, pp. 29-40, 2003.

[8] M. Fischer and M. Paterson, “String Matching and OtherProducts,” SIAM AMS Complexity of Computation, R. Karp, ed.,pp. 113-125, 1974.

[9] M. Gribskov, A. McLachlan, and D. Eisenberg, “Profile Analysis:Detection of Distantly Related Proteins,” Proc. Nat’l Academy ofSciences, vol. 84, no. 13, pp. 4355-4358, 1987.

[10] D. Gusfield, Algorithms on Strings, Trees and Sequences: ComputerScience and Computational Biology. Cambridge Univ. Press, 1997.

[11] G.Z. Hertz and G.D. Stormo, “Escherichia Coli Promoter Sequences:Analysis and Prediction,” Methods in Enzymology, vol. 273, pp. 30-42, 1996.

[12] C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald,and J.C. Wooton, “Detecting Subtle Sequence Signals: A GibbsSampling Strategy for Multiple Alignment,” Science, vol. 262,pp. 208-214, 1993.

[13] C.E. Lawrence and A.A. Reilly, “An Expectation Maximization(EM) Algorithm for the Identification and Characterization ofCommon Sites in Unaligned Biopolymer Sequences,” Proteins:Structure, Function, and Genetics, vol. 7, pp. 41-51, 1990.

[14] L. Marsan and M.-F. Sagot, “Algorithms for Extracting StructuredMotifs Using a Suffix Tree with an Application to Promoter andRegulatory Site Consensus Identification,” J. Computational Biol-ogy, vol. 7, pp. 345-362, 2000.

[15] W. Miller, “Comparison of Genomic DNA Sequences: Solved andUnsolved Problems,” Bioinformatics, vol. 17, pp. 391-397, 2001.

[16] G. Myers, “A Fast Bit-Vector Algorithm for Approximate StringMatching Based on Dynamic Programming,” J. ACM, vol. 46, no. 3,pp. 395-415, 1999.

[17] L. Parida, I. Rigoutsos, A. Floratos, D. Platt, and Y. Gao, “PatternDiscovery on Character Sets and Real-Valued Data: Linear Boundon Irredundant Motifs and Efficient Polynomial Time Algorithm,”Proc. SIAM Symp. Discrete Algorithms (SODA), 2000.

[18] L. Parida, I. Rigoutsos, and D. Platt, “An Output-Sensitive FlexiblePattern Discovery Algorithm,” Combinatorial Pattern Matching,A. Amir and G. Landau, eds., pp. 131-142, Springer-Verlag, 2001.

[19] J. Pelfrne, S. Abdeddaım, and J. Alexandre, “Extracting Approx-imate Patterns,” Combinatorial Pattern Matching, pp. 328-347,Springer-Verlag, 2003.

[20] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, “A Basisfor Repeated Motifs in Pattern Discovery and Text Mining,”Technical Report IGM 2002-10, Institut Gaspard-Monge, Univ. ofMarne-la-Vallee, July 2002.

[21] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, “A Basis ofTiling Motifs for Generating Repeated Patterns and Its Complex-ity for Higher Quorum,” Math. Foundations of Computer Science(MFCS), B. Rovan and P. Vojtas, eds., pp. 622-631, Springer-Verlag, 2003.

[22] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, StringAlgorithmics, chapter: A Comparative Study of Bases for MotifInference, pp. 195-225, KCL Press, 2004.

[23] D. Pollard, C. Bergman, J. Stoye, S. Celniker, and M. Eisen,“Benchmarking Tools for the Alignment of Functional NoncodingDNA,” BMC Bioinformatics, vol. 5, pp. 6-23, 2004.

[24] A. Vanet, L. Marsan, and M.-F. Sagot, “Promoter Sequences andAlgorithmical Methods for Identifying Them,” Research in Micro-biology, vol. 150, pp. 779-799, 1999.

[25] S. Wu and U. Manber, “Path-Matching Problems,” Algorithmica,vol. 8, no. 2, pp. 89-101, 1992.

Nadia Pisanti received the laurea degree incomputer science in 1996 from the University ofPisa (Italy), the French DEA in fundamentalinformatics with applications to genome treat-ment in 1998 from the University of Marne-la-Vallee (France), and the PhD degree in computerscience in 2002 from the University of Pisa. Shehas been postdoctorate at INRIA and at theUniversity of Paris 13 and she is currently aresearch fellow in the Department of Computer

Science of the University of Pisa. Her interests are in computationalbiology and, in particular, inmotifs extraction and genome rearrangement.

Maxime Crochemore received the PhD degreein 1978 and the Doctorat d’etat in 1983 from theUniversity of Rouen. He received his firstprofessorship position at the University ofParis-Nord in 1975 where he acted as Presidentof the Department of Mathematics and Compu-ter Science for two years. He became aprofessor at the University Paris 7 in 1989 andwas involved in the creation of the University ofMarne-la-Vallee where he is presently a profes-

sor. He also created the Computer Science Research Laboratory of thisuniversity in 1991. Since then, he has been the director of the laboratory,which now has around 45 permanent researchers. Professor Crochem-ore has been a senior research fellow at King’s College London since2002. He has been the recipient of several French grants on stringalgorithmics and bioinformatics. He participated in a good number ofinternational projects in algorithmics and supervised 20 PhD students.

PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 49

Roberto Grossi received the laurea degree incomputer science in 1988, and the PhD degreein computer science in 1993, at the University ofPisa. He joined the University of Florence in1993 as an associate researcher. Since 1998,he has been an associate professor of computerscience in the Dipartimento di Informatica,University of Pisa. He has been visiting severalinternational research institutions. His interestsare in the design and analysis of algorithms and

data structures, namely, dynamic and external memory algorithms,graph algorithms, experimental and algorithm engineering, fast lookuptables and dictionaries, pattern matching algorithms, text indexing, andcompressed data structures.

Marie-France Sagot received the BSc degree in computer science fromthe University of Sao Paulo, Brazil, in 1991, the PhD degree intheoretical computer science and applications from the University ofMarne-la-Vallee, France, in 1996, and the Habilitation from the sameuniversity in 2000. From 1997 to 2001, she worked as a researchassociate at the Pasteur Institute in Paris, France. In 2001, she movedto Lyon, France, as a research associate at the INRIA, the FrenchNational Institute for Research in Computer Science and Control. Since2003, she has been director of research at the INRIA. Her researchinterests are in computational biology, algorithmics, and combinatorics.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

50 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Multiseed Lossless FiltrationGregory Kucherov, Laurent Noe, and Mikhail Roytberg

Abstract—We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics

applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt

and Karkkainen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial

properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed

technique to the problem of oligonucleotide selection for an EST sequence database.

Index Terms—Filtration, string matching, gapped seed, gapped q-gram, local alignment, sequence similarity, seed family, multiple

spaced seeds, dynamic programming, EST, oligonucleotide selection.

1 INTRODUCTION

FILTERING is a widely-used technique in biosequenceanalysis. Applied to the approximate string matching

problem [2], it can be summarized by the following two-

stage scheme: To find approximate occurrences (matches) of

a given string in a sequence (text), one first quickly discards

(filters out) those sequence regions where matches cannot

occur, and then checks out the remaining parts of the

sequence for actual matches. The filtering is done according

to small patterns of a specified form that the searched stringis assumed to share, in the exact way, with its approximate

occurrences. A similar filtration scheme is used by heuristic

local alignment algorithms ([3], [4], [5], [6], to mention a

few): They first identify potential similarity regions that

share some patterns and then actually check whether those

regions represent a significant similarity by computing a

corresponding alignment.

Two types of filtering should be distinguished—lossless

and lossy. A lossless filtration guarantees to detect all

sequence fragments under interest, while a lossy filtration

may miss some of them, but still tries to detect a majority of

them. Local alignment algorithms usually use a lossy

filtration. On the other hand, the lossless filtration has been

studied in the context of approximate string matching

problem [7], [1]. In this paper, we focus on the lossless

filtration.

In the case of lossy filtration, its efficiency is measured by

two parameters, usually called selectivity and sensitivity. The

sensitivity measures the part of sequence fragments of

interest that are missed by the filter (false negatives), and

the selectivity indicates what part of detected candidate

fragments do not actually represent a solution (false

positives). In the case of lossless filtration, only the

selectivity parameter makes sense and is therefore the main

characteristic of the filtration efficiency.

The choice of patterns that must be contained in the

searched sequence fragments is a key ingredient of the

filtration algorithm. Gapped seeds (spaced seeds, gapped q-

grams) have been recently shown to significantly improve

the filtration efficiency over the “traditional” technique of

contiguous seeds. In the framework of lossy filtration for

sequence alignment, the use of designed gapped seeds has

been introduced by the PATTERNHUNTER method [4] and

then used by some other algorithms (e.g., [5], [6]). In [8], [9],

spaced seeds have been shown to improve indexing

schemes for similarity search in sequence databases. The

estimation of the sensitivity of spaced seeds (as well as of

some extended seed models) has been the subject of several

recent studies [10], [11], [12], [13], [14], [15]. In the

framework of lossless filtration for approximate pattern

matching, gapped seeds were studied in [1] (see also [7])

and have also been shown to increase the filtration

efficiency considerably.In this paper, we study an extension of the lossless

single-seed filtration technique [1]. The extension is based

on using seed families rather than individual seeds. The idea

of simultaneous use of multiple seeds for DNA local

alignment was already envisaged in [4] and applied in

PATTERNHUNTER II software [16]. The problem of design-

ing efficient seed families has also been studied in [17]. In

[18], multiple seeds have been applied to the protein search.

However, the issues analyzed in the present paper are quite

different, due to the proposed requirement for the search to

be lossless.

The rest of the paper is organized as follows: After

formally introducing the concept of multiple seed filtering

in Section 2, Section 3 is devoted to dynamic programming

algorithms to compute several important parameters of

seed families. In Section 4, we first study several combina-

torial properties of families of seeds and, in particular, seeds

having a periodic structure. These results are used to obtain

a method for constructing efficient seed families. We also

outline a heuristic genetic programming algorithm for

constructing seed families. Finally, in Section 5, we present

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 51

. G. Kucherov and L. Noe are with the INRIA/LORIA, 615, rue du JardinBotanique, B.P. 101, 54602 Villers-les-Nancy, France.E-mail: {Gregory.Kucherov, Laurent.Noe}@loria.fr.

. M. Roytberg is with the Institute of Mathematical Problems in Biology,Pushchino, Moscow Region, Russia. E-mail: [email protected].

Manuscript received 24 Sept. 2004; revised 13 Dec. 2004; accepted 10 Jan.2005; published online 30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-0154-0904.

1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

several seed families we computed, and we report a large-

scale experimental application of the method to a practical

problem of oligonucleotide selection.

2 MULTIPLE SEED FILTERING

A seed Q (called also spaced seed or gapped q-gram) is a list

fp1; p2; . . . ; pdg of positive integers, called matching positions,

such that p1 < p2 < . . . < pd. By convention, we always

assume p1 ¼ 0. The span of a seed Q, denoted sðQÞ, is the

quantity pd þ 1. The number d of matching positions is called

theweight of the seed and denoted wðQÞ. Often, we will use a

more visual representation of seeds, adopted in [1], as words

of length sðQÞ over the two-letter alphabet f#;�g, where #

occurs at all matching positions and—at all positions in

between. For example, seed f0; 1; 2; 4; 6; 9; 10; 11g of weight 8

andspan12 is representedbyword###�#�#��###.

The character � is called a joker. Note that, unless otherwise

stated, the seed has the character # at its first and last

positions.

Intuitively, a seed specifies the set of patterns that, if

shared by two sequences, indicate a possible similarity

between them. Two sequences are similar if the Hamming

distance between them is smaller than a certain threshold.

For example, sequences CACTCGT and CACACTT are similar

within Hamming distance 2 and this similarity is detected

by the seed##�# at position 2. We are interested in seeds

that detect all similarities of a given length with a given

Hamming distance.

Formally, a gapless similarity (hereafter simply similarity)

of two sequences of length m is a binary word w 2 f0; 1gm

interpreted as a sequence of matches (1s) and mismatches

(0s) of individual characters from the alphabet of input

sequences. A seed Q ¼ fp1; p2; . . . ; pdg matches a similarity w

at position i, 1 � i � m� pd þ 1, iff for every j 2 ½1::d�, we

have w½iþ pj� ¼ 1. In this case, we also say that seed Q has

an occurrence in similarity w at position i. A seed Q is said to

detect a similarity w if Q has at least one occurrence in w.

Given a similarity length m and a number of

mismatches k, consider all similarities of length m

containing k 0s and ðm� kÞ 1s. These similarities are

called ðm; kÞ-similarities. A seed Q solves the detection

problem ðm; kÞ (for short, the ðm; kÞ-problem) iff all of mk

� �ðm; kÞ-similarities w are detected by Q. For example, one

can check that seed #�##��#�## solves the

ð15; 2Þ-problem.

Note that the weight of the seed is directly related to the

selectivity of the corresponding filtration procedure. A larger

weight improves the selectivity, as less similarities will pass

through the filter. On the other hand, a smaller weight

reduces the filtration efficiency. Therefore, the goal is to

solve an ðm; kÞ-problem by a seed with the largest possible

weight.

Solving ðm; kÞ-problems by a single seed has been studied

by Burkhardt and Karkkainen [1]. An extension we propose

here is to use a family of seeds, instead of a single seed, to solve

the ðm; kÞ-problem. Formally, a finite family of seeds F ¼<

Ql >Ll¼1 solves an ðm; kÞ-problem iff for any ðm; kÞ-similarityw,

there exists a seed Ql 2 F that detects w.

Note that the seeds of the family are used in the

complementary (or disjunctive) fashion, i.e., a similarity is

detected if it is detected by one of the seeds. This differs from

the conjunctive approach of [7] where a similarity should be

detected by two seeds simultaneously.

The following example motivates the use of multiple

seeds. In [1], it has been shown that a seed solving the

ð25; 2Þ-problem has the maximal weight 12. The only such

seed (up to reversal) is

###�#��###�#��###�#:

However, the problem can be solved by the familycomposed of the following two seeds of weight 14:

#####�##���#####�##

and

#�##���#####�##���####:

Clearly, using these two seeds increases the selectivity of

the search, as only similarities having 14 or more matching

characters pass the filter versus 12 matching characters in

the case of single seed. On uniform Bernoulli sequences,

this results in the decrease of the number of candidate

similarities by the factor of jAj2=2, where A is the input

alphabet. This illustrates the advantage of the multiple seed

approach: it allows to increase the selectivity while

preserving a lossless search. The price to pay for this gain

in selectivity is multiplying the work on identifying the

seed occurrences. In the case of large sequences, however,

this is largely compensated by the decrease in the number

of false positives caused by the increase of the seed weight.

3 COMPUTING PROPERTIES OF SEED FAMILIES

Burkhardt and Karkkainen [1] proposed a dynamic pro-

gramming algorithm to compute the optimal threshold of a

given seed—the minimal number of its occurrences over all

possible ðm; kÞ-similarities. In this section, we describe an

extension of this algorithm for seed families and, on the

other hand, describe dynamic programming algorithms for

computing two other important parameters of seed families

that we will use in a later section.Consider an ðm; kÞ-problem and a family of seeds

F ¼< Ql >Ll¼1 . We need the following notations:

. smax ¼ maxfsðQlÞgLl¼1, smin ¼ minfsðQlÞgLl¼1,

. for a binary word w and a seed Ql, suffðQl; wÞ¼1 ifQl matches w at position ðjwj�sðQlÞþ1Þ (i.e.,matches a suffix of w), otherwise suffðQl; wÞ¼0,

. lastðwÞ ¼ 1 if the last character of w is 1, otherwiselastðwÞ ¼ 0, and

. zerosðwÞ is the number of 0s in w.

3.1 Optimal Threshold

Given an ðm; kÞ-problem, a family of seeds F ¼< Ql >Ll¼1

has the optimal threshold TF ðm; kÞ if every ðm; kÞ-similarity

52 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

has at least TF ðm; kÞ occurrences of seeds of F and this is the

maximal number with this property. Note that overlapping

occurrences of a seed as well as occurrences of different

seeds at the same position are counted separately. For

example, the singleton family f###�##g has threshold 2

for the ð15; 2Þ-problem.

Clearly, F solves an ðm; kÞ-problem if and only if

TF ðm; kÞ > 0. If TF ðm; kÞ > 1, then one can strengthen the

detection criterion by requiring several seed occurrences for

a similarity to be detected. This shows the importance of the

optimal threshold parameter.

We now describe a dynamic programming algorithm

for computing the optimal threshold TF ðm; kÞ. For a

binary word w, consider the quantity TF ðm; k;wÞ defined

as the minimal number of occurrences of seeds of F in all

ðm; kÞ-similarities which have the suffix w. By definition,

TF ðm; kÞ ¼ TF ðm; k; "Þ. Assume that we precomputed

values T F ðj; wÞ ¼ TF ðsmax; j; wÞ, for all j � maxfk; smaxg,jwj ¼ smax. The algorithm is based on the following

recurrence relations on TF ði; j; wÞ, for i � smax.

TF ði; j; w½1::n�Þ ¼T F ðj; wÞ; if i¼smax;

TF ði�1; j�1; w½1::n�1�Þ; if w½n�¼0;

TF ði�1; j; w½1::n�1�Þ þ ½PL

l¼1 suffðQl; wÞ�; if n¼smax;

minfTF ði; j; 1:wÞ; TF ði; j; 0:wÞg; if zerosðwÞ<j;

TF ði; j; 1:wÞ; if zerosðwÞ¼j:

8>>>>>><>>>>>>:

The first relation is an initial condition of the recurrence.

The second one is based on the fact that if the last symbol of

w is 0, then no seed can match a suffix of w (as the last

position of a seed is always assumed to be a matching

position). The third relation reduces the size of the problem

by counting the number of suffix seed occurrences. The

fourth one splits the counting into two cases, by considering

two possible characters occurring on the left of w. If w

already contains j 0s, then only 1 can occur on the left of w,

as stated by the last relation.

A dynamic programming implementation of the above

recurrence allows to compute TF ðm; k; "Þ in a bottom-up

fashion, starting from initial valuesT F ðj; wÞ andapplying the

above relations in the order in which they are given. A

straightforward dynamic programming implementation re-

quiresOðm � k � 2ðsmaxþ1ÞÞ time and space. However, the space

complexity can be immediately improved: If values of i are

processed successively, then only Oðk � 2ðsmaxþ1ÞÞ space is

needed. Furthermore, for each i and j, it is not necessary to

consider all 2ðsmaxþ1Þ different strings w, but only those which

contain up to j 0s. The number of those w is gðj; smaxÞ ¼Pje¼0

smax

e

� �. For each i, j ranges from 0 to k. Therefore, for each

i,weneed to store fðk; smaxÞ ¼Pk

j¼0 gðj; smaxÞ ¼Pk

j¼0smax

j

� ��

ðk� jþ 1Þ values. This yields the same space complexity as

for computing the optimal threshold for one seed [1].

The quantityPL

l¼1 suffðQl; wÞ can be precomputed for all

considered words w in time OðL � gðk; smaxÞÞ and space

Oðgðk; smaxÞÞ, under the assumption that checking an

individual match is done in constant time. This leads to

the overall time complexity Oðm � fðk; smaxÞ þ L � gðk; smaxÞÞwith the leading term m � fðk; smaxÞ (as L is usually small

compared to m and gðk; smaxÞ is smaller than fðk; smaxÞ).

3.2 Number of Undetected Similarities

We now describe a dynamic programming algorithm that

computes another characteristic of a seed family, that will

be used later in Section 4.4. Consider an ðm; kÞ-problem.

Given a seed family F ¼< Ql >Ll¼1 , we are interested in

the number UF ðm; kÞ of ðm; kÞ-similarities that are not

detected by F . For a binary word w, define UF ðm; k; wÞ to

be the number of undetected ðm; kÞ-similarities that have

the suffix w.Similar to [10], letXðF Þ be the set of binary words w such

that 1) jwj � smax, 2) for any Ql 2 F , suffðQl; 1smax�jwjwÞ ¼ 0,

and 3) no proper suffix of w satisfies 2). Note that word 0

belongs to XðF Þ, as the last position of every seed is a

matching position.The following recurrence relations allow to compute

UF ði; j; wÞ for i � m, j � k, and jwj � smax:

UF ði; j; w½1::n�Þ ¼i�jwj

j�zerosðwÞ

� �; if i < smin;

0; if 9l 2 ½1::L�;suffðQl; wÞ ¼ 1;

UF ði� 1; j� lastðwÞ; w½1::n� 1�Þ; if w 2 XðF Þ;UF ði; j; 1:wÞ þ UF ði; j; 0:wÞ; if zerosðwÞ < j;

UF ði; j; 1:wÞ; if zerosðwÞ ¼ j:

8>>>>>>>>><>>>>>>>>>:The first condition says that if i < smin, then no word of

length i will be detected, hence the binomial coefficient. The

second condition is straightforward. The third relation

follows from the definition of XðF Þ and allows us to reduce

the size of the problem. The last two conditions are similar

to those from the previous section.The set XðF Þ can be precomputed in time OðL �

gðk; smaxÞÞ and the worst-case time complexity of the whole

algorithm remains Oðm � fðk; smaxÞ þ L � gðk; smaxÞÞ.

3.3 Contribution of a Seed

Using a similar dynamic programming technique, one can

compute, for a given seed of the family, the number of

ðm; kÞ-similarities that are detected only by this seed and not

by the others. Together with the number of undetected

similarities, this parameter will be used later in Section 4.4.Given an ðm; kÞ-problem and a family F ¼< Ql >

Ll¼1 , we

define SF ðm; k; lÞ to be the number of ðm; kÞ-similarities

detected by the seed Ql exclusively (through one or several

occurrences), and SF ðm; k; l; wÞ to be the number of those

similarities ending with the suffix w. A dynamic program-

ming algorithm similar to the one described in the previous

sections can be applied to compute SF ðm; k; lÞ. The

recurrence is given below.

KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 53

SF ði; j; l; w½1::n�Þ ¼0 if i < sminor 9l0 6¼ l

suffðQl0 ; wÞ ¼ 1

SF ði� 1; j� 1; l; w½1::n� 1�Þ if w½n� ¼ 0

SF ði� 1; j; l; w½1::n� 1�Þ if n ¼ jQlj andsuffðQl; wÞ ¼ 0

SF ði� 1; j; l; w½1::n� 1�ÞþUF ði� 1; j; w½1::n� 1�Þ if n ¼ smax and

suffðQl; wÞ ¼ 1

and 8l0 6¼ l;

suffðQl0 ; wÞ ¼ 0;

SF ði; j; l; 1:w½1::n�ÞþSF ði; j; l; 0:w½1::n�Þ if zerosðwÞ < j

SF ði; j; l; 1:w½1::n�Þ if zerosðwÞ ¼ j:

8>>>>>>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>>>>>>:

The third and fourth relations play the principal role:

if Ql does not match a suffix of w½1::n�, then we simply

drop out the last letter. If Ql matches a suffix of w½1::n�,but no other seed does, then we count prefixes matched

by Ql exclusively (term SF ði� 1; j; l; w½1::n� 1�Þ) together

with prefixes matched by no seed at all (term

UF ði� 1; j; w½1::n� 1�Þ). The latter is computed by the

algorithm of the previous section.

The complexity of computing SF ðm; k; lÞ for a given l is

the same as the complexity of dynamic programming

algorithms from the previous sections.

4 SEED DESIGN

In the previous section we showed how to compute various

useful characteristics of a given family of seeds. A much

more difficult task is to find an efficient seed family that

solves a given ðm; kÞ-problem. Note that there exists a trivial

solution where the family consists of all mk

� �position

combinations, but this is in general unacceptable in practice

because of a huge number of seeds. Our goal is to find

families of reasonable size (typically, with the number of

seeds smaller than 10), with a good filtration efficiency.

In this section, we present several results that contribute

to this goal. In Section 4.1, we start with the case of single

seed with a fixed number of jokers and show, in particular,

that for one joker, there exists one best seed in a sense that

will be defined. We then show in Section 4.2 that a solution

for a larger problem can be obtained from a smaller one by a

regular expansion operation. In Section 4.3, we focus on

seeds that have a periodic structure and show how those

seeds can be constructed by iterating some smaller seeds.

We then show a way to build efficient families of periodic

seeds. Finally, in Section 4.4, we briefly describe a heuristic

approach to constructing efficient seed families that we

used in the experimental part of this work presented in

Section 5.

4.1 Single Seeds with a Fixed Number of Jokers

Assume that we fixed a class of seeds under interest (e.g.,

seeds of a given minimal weight). One possible way to

define the seed design problem is to fix a similarity length

m and find a seed that solves the ðm; kÞ-problem with the

largest possible value of k. A complementary definition is to

fix k and minimize m provided that the ðm; kÞ-problem is

still solved. In this section, we adopt the second definition

and present an optimal solution for one particular case.

For a seed Q and a number of mismatches k, define the

k-critical length for Q as the minimal value m such that Q

solves the ðm; kÞ-problem. For a class of seeds C and a value

k, a seed is k-optimal in C if Q has the minimal k-critical

length among all seeds of C.One interesting class of seeds C is obtained by putting an

upper bound on the possible number of jokers in the seed,

i.e. on the number ðsðQÞ � wðQÞÞ. We have found a general

solution of the seed design problem for the class C1ðnÞconsisting of seeds of weight dwith only one joker, i.e. seeds

#d�r �#r.

Consider first the case of one mismatch, i.e., k ¼ 1. A

1-optimal seed from C1ðdÞ is #d�r �#r with r ¼ bd=2c. Tosee this, consider an arbitrary seed Q ¼ #p �#q, pþ q ¼ d,

and assume by symmetry that p � q. Observe that the

longest ðm; 1Þ-similarity that is not detected by Q is

1p�101pþq of length ð2pþ qÞ. Therefore, we have to minimize

2pþ q ¼ dþ p, and since p � dd=2e, the minimum is reached

for p ¼ dd=2e, q ¼ bd=2c.However, for k � 2, an optimal seed has an asymmetric

structure described by the following theorem.

Theorem 1. Let n be an integer and r ¼ ½d=3� (½x� is the closestinteger to x). For every k � 2, seed QðdÞ ¼ #d�r �#r is

k-optimal among the seeds of C1ðdÞ.Proof. Again, consider a seed Q ¼ #p �#q, pþ q ¼ d, and

assume that p � q. Consider the longest word SðkÞ fromð1�0Þk1�, k � 1, which is not detected by Q and let LðkÞ isthe length of SðkÞ. By the above remark, Sð1Þ ¼ 1p�101pþq

and Lð1Þ ¼ 2pþ q.

It is easily seen that for every k, SðkÞ starts either with

1p�10, or with 1pþq01q�10. Define L0ðkÞ to be the maximal

length of a word from ð1�0Þk1� that is not detected by Q

and starts with 1q�10. Since prefix 1q�10 implies no

additional constraint on the rest of the word, we have

L0ðkÞ ¼ q þ Lðk� 1Þ. Observe that L0ð1Þ ¼ pþ 2q (word

1q�101pþq). To summarize, we have the following

recurrences for k � 2:

L0ðkÞ ¼ q þ Lðk� 1Þ; ð1ÞLðkÞ ¼ maxfpþ Lðk� 1Þ; pþ q þ 1þ L0ðk� 1Þg; ð2Þ

with initial conditions L0ð1Þ ¼ pþ 2q, Lð1Þ ¼ 2pþ q.

Two cases should be distinguished. If p � 2q þ 1, then

the straightforward induction shows that the first term in

(2) is always greater, and we have

LðkÞ ¼ ðkþ 1Þpþ q; ð3Þ

and the corresponding longest word is

SðkÞ ¼ ð1p�10Þk1pþq: ð4Þ

54 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

If q � p � 2q þ 1, then by induction, we obtain

LðkÞ ¼ ð‘þ 1Þpþ ðkþ 1Þq þ ‘ if k ¼ 2‘;ð‘þ 2Þpþ kq þ ‘ if k ¼ 2‘þ 1;

�ð5Þ

and

SðkÞ ¼ ð1pþq01q�10Þ‘1pþq if k ¼ 2‘;

1p�10ð1pþq01q�10Þ‘1pþq if k ¼ 2‘þ 1:

�ð6Þ

By definition of LðkÞ, seed #p �#q detects any word

from ð1�0Þk1� of length ðLðkÞ þ 1Þ or more, and this is the

tight bound. Therefore, we have to find p; q whichminimize LðkÞ. Recall that pþ q ¼ d, and observe that for

p � 2q þ 1, LðkÞ (defined by (3)) is increasing on p, while

for p � 2q þ 1, LðkÞ (defined by (5)) is decreasing on p.

Therefore, both functions reach its minimum when

p ¼ 2q þ 1. Therefore, if d � 1 ðmod 3Þ, we obtain q ¼bd=3c and p ¼ d� q. If d � 0 ðmod 3Þ, a routine computa-

tion shows that the minimum is reached at q ¼ d=3,

p ¼ 2d=3, and if d � 2 ðmod 3Þ, the minimum is reachedat q ¼ dd=3e, p ¼ d� q. Putting the three cases together

results in q ¼ ½d=3�, p ¼ d� q. tuTo illustrate Theorem 1, seed ####�## is optimal

among all seeds of weight 6 with one joker. This means that

this seed solves the ðm; 2Þ-problem for all m � 16 and this is

the smallest possible bound over all seeds of this class.

Similarly, this seed solves the ðm; 3Þ-problem for all m � 20,

which is the best possible bound, etc.

4.2 Regular Expansion and Contraction of Seeds

We now show that seeds solving larger problems can be

obtained from seeds solving smaller problems, and vice

versa, using regular expansion and regular contraction

operations.

Given a seed Q , its i-regular expansion i�Q is

obtained by multiplying each matching position by i. This

is equivalent to inserting i� 1 jokers between every two

successive positions along the seed. For example, if Q ¼f0; 2; 3; 5g (or #�##�#), then the 2-regular expansion

of Q is 2�Q ¼ f0; 4; 6; 10g (or #���#�#���#).

Given a family F , its i-regular expansion i� F is the

family obtained by applying the i-regular expansion on

each seed of F .

Lemma 1. If a family F solves an ðm; kÞ-problem, then theðim; ðiþ 1Þk� 1Þ-problem is solved both by family F and byits i-regular expansion Fi ¼ i� F .

Proof. Consider an ðim; ðiþ 1Þk� 1Þ-similarity w. By the

pigeon hole principle, it contains at least one substring of

length m with k mismatches or less and, therefore, F

solves the ðim; ðiþ 1Þk� 1Þ-problem. On the other hand,

consider i disjoint subsequences of w each one consisting

of m positions equal modulo i. Again, by the pigeon hole

principle, at least one of them contains k mismatches or

less and, therefore, the ðim; ðiþ 1Þk� 1Þ-problem is

solved by i� F . tuThe following lemma is the inverse of Lemma 1. It states

that if seeds solving a bigger problem have a regularstructure, then a solution for a smaller problem can be

obtained by the regular contraction operation, inverse to theregular expansion.

Lemma 2. If a family Fi ¼ i� F solves an ðim; kÞ-problem, then

F solves both the ðim; kÞ-problem and the ðm; bk=icÞ-problem.

Proof. One can even show that F solves the ðim; kÞ-problemwith the additional restriction for F tomatch inside one of

the position intervals ½1::m�; ½mþ 1::2m�; . . . ; ½ði� 1Þmþ1::im�. This is done by using the bijective mapping from

Lemma 1: Given an ðim; kÞ-similarity w, consider i disjoint

subsequences wj (0 � j � i� 1) of w obtained by picking

m positions equal to j modulo i, and then consider the

concatenation w0 ¼ w1w2 . . .wi�1w0.For every ðim; kÞ-similarity w0, its inverse image w is

detected by Fi, and therefore F detects w0 at one of theintervals

½1::m�; ½mþ 1::2m�; . . . ; ½ði� 1Þmþ 1::im�:

Futhermore, for any ðm; bk=icÞ-similarity v, consider w0 ¼vi and its inverse image w. As w0 is detected by Fi, v isdetected by F . tu

Example 1. To illustrate the two lemmas above, we give thefollowing example pointed out in [1]. The following two

seeds are the only seeds of weight 12 that solve theð50; 5Þ-problem:

#�#�#���#�����#�#�#���#�����#�#�#���#

and

###�#��###�#��###�#:

The first one is the 2-regular expansion of the second. The

second one is the only seed of weight 12 that solves the

ð25; 2Þ-problem.

The regular expansion allows, in some cases, to obtain an

efficient solution for a larger problem by reducing it to a

smaller problem for which an optimal or a near-optimal

solution is known.

4.3 Periodic Seeds

In this section, we study seeds with a periodic structure that

can be obtained by iterating a smaller seed. Such seeds often

turn out to be among maximally weighted seeds solving a

given ðm; kÞ-problem. Interestingly, this contrasts with the

lossy framework where optimal seeds usually have a

“random” irregular structure.

Consider two seeds Q1;Q2 represented as words over

f#;�g. In this section, we lift the assumption that a seed

must start and end with a matching position. We denote

½Q1;Q2�i the seed defined as ðQ1Q2ÞiQ1. For example,

½###�#;���2¼###�#��###�#��###�#.

We also need a modification of the ðm; kÞ-problem, where

ðm; kÞ-similarities are considered modulo a cyclic permuta-

tion. We say that a seed family F solves a cyclic

ðm; kÞ-problem, if for every ðm; kÞ-similarity w, F detects

one of cyclic permutations of w. Trivially, if F solves an

ðm; kÞ-problem, it also solves the cyclic ðm; kÞ-problem. To

KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 55

distinguish from a cyclic problem, we call sometimes an

ðm; kÞ-problem a linear problem.We first restrict ourselves to the single-seed case. The

following lemma demonstrates that iterating smaller seeds

solving a cyclic problem allows to obtain a solution forbigger problems, for the same number of mismatches.

Lemma 3. If a seed Q solves a cyclic ðm; kÞ-problem, then for

every i � 0, the seed Qi ¼ ½Q;�ðm�sðQÞÞ�i solves the linear

ðm � ðiþ 1Þ þ sðQÞ � 1; kÞ-problem. If i 6¼ 0, the inverse

holds too.

Proof. ) Consider an ðm � ðiþ 1Þ þ sðQÞ � 1; kÞ-similarity

u. Transform u into a similarity u0 for the cyclic

ðm; kÞ-problem as follows: For each mismatch position ‘

of u, set 0 at position ð‘modmÞ in u0. The other positions

of u0 are set to 1. Clearly, there are at most k 0s in u. As Q

solves the ðm; kÞ-cyclic problem, we can find at least one

position j, 1 � j � m, such that Q detects u0 cyclicly.We show now thatQi matches at position j of u (which

is a validposition as 1 � j � m and sðQiÞ ¼ imþ sðQÞ).As

the positions of 1 in u are projectedmodulom to matching

positions of Q, then there is no 0 under any matching

element of Qi and, thus, Qi detects u.

( Consider a seed Qi ¼ ½Q;�ðm�sðQÞÞ�i solving the

ðm � ðiþ 1Þ þ sðQÞ � 1; kÞ-problem. As i > 0, consider ðm �ðiþ 1Þ þ sðQÞ � 1; kÞ-similarities having all their mis-matches located inside the interval ½m; 2m� 1�. For eachsuch similarity, there exists a position j, 1 � j � m, such

that Qi detects it. Note that the span of Qi is at least

mþ sðQÞ, which implies that there is either an entire

occurrence of Q inside the window ½m; 2m� 1�, or a

prefix of Q matching a suffix of the window and the

complementary suffix of Q matching a prefix of the

window. This implies that Q solves the cyclicðm; kÞ-problem. tu

Example 2. Observe that the seed ###�# solves the

cyclic ð7; 2Þ-problem. From Lemma 3, this implies that for

every i � 0, the ð11þ 7i; 2Þ-problem is solved by the seed

½###�#;���i of span 5þ 7i. Moreover, for i ¼ 1; 2; 3,

this seed is optimal (maximally weighted) over all seeds

solving the problem.

By a similar argument based on Lemma 3, the

periodic seed ½#####�##;����i solves the

ð18þ 11i; 2Þ-problem. Note that its weight grows as711m compared to 4

7m for the seed from the previous

paragraph. However, when m ! 1, this is not an

asymptotically optimal bound, as we will see later.

The ð18þ 11i; 3Þ-problem is solved by the seedð###�#��#;���Þi, as seed ###�#��#

solves the cyclic ð11; 3Þ-problem. For i ¼ 1; 2, the former

is a maximally weighted seed among all solving the

ð18þ 11i; 3Þ-problem.

One question raised by these examples is whether

iterating some seed could provide an asymptotically

optimal solution, i.e., a seed of maximal asymptotic weight.The following theorem establishes a tight asymptotic bound

on the weight of an optimal seed, for a fixed number of

mismatches. It gives a negative answer to this question, as it

shows that the maximal weight grows faster than any linear

fraction of the similarity size.

Theorem 2. Consider a constant k. Let wðmÞ be the maximal

weight of a seed solving the cyclic ðm; kÞ-problem. Then,

ðm� wðmÞÞ ¼ �ðmk�1k Þ.

Proof. Note first that all seeds solving a cyclic ðm; kÞ-problemcanbe considered as seeds of spanm. Thenumberof jokers

in any seed Q is then n ¼ m� wðQÞ. The theorem states

that the minimal number of jokers of a seed solving the

ðm; kÞ-problem is �ðmk�1k Þ for every fixed k.

Lower bound Consider a cyclic ðm; kÞ-problem. Thenumber Dðm; kÞ of distinct cyclic ðm; kÞ-similaritiessatisfies

mk

� �m

� Dðm; kÞ; ð7Þ

as every linear ðm; kÞ-similarity has at most m cyclicly

equivalent ones. Consider a seed Q. Let n be the number

of jokers in Q and JQðm; kÞ the number of distinct cyclic

ðm; kÞ-similarities detected by Q. Observe that JQðm; kÞ �nk

� �and if Q solves the cyclic ðm; kÞ-problem, then

Dðm; kÞ ¼ JQðm; kÞ � n

k

� �: ð8Þ

From (7) and (8), we have

mk

� �m

� n

k

� �: ð9Þ

Using the Stirling formula, this gives nðkÞ ¼ �ðmk�1k Þ.

Upper bound. To prove the upper bound, we constructa seed Q that has no more then k �mk�1

k joker positionsand solves the cyclic ðm; kÞ-problem.

We start with the seed Q0 of span m with all matchingpositions, and introduce jokers into it in k steps. Afterstep i, the obtained seed is denoted Qi, and Q ¼ Qk.

Let B ¼ dm1ke. Q1 is obtained by introducing into Q0

individual jokers with periodicity B by placing jokers atpositions 1; Bþ 1; 2Bþ 1; . . . . At step 2, we introduceinto Q1 contiguous intervals of jokers of length B withperiodicity B2, such that jokers are placed at positions½1 . . .B�; ½B2 þ 1 . . .B2 þB�; ½2B2 þ 1 . . . 2B2 þB�; . . . .

In general, at step i (i � k), we introduce into Qi

intervals of Bi�1 jokers with periodicity Bi at positions½1 . . .Bi�1�; ½Bi þ 1 . . .Bi þBi�1�; . . . (see Fig. 1).

Note that Qi is periodic with periodicity Bi. Note

also that at each step i, we introduce at most bm1�ikc

intervals of Bi�1 jokers. Moreover, due to overlapswith already added jokers, each interval adds ðB�1Þi�1 new jokers.

This implies that the total number of jokers added atstep i is at most m1�i

k � ðB� 1Þi�1 � m1�ik �m1

k�ði�1Þ ¼ mk�1k .

Thus, the total number of jokers in Q is less than k �mk�1k .

By induction on i, weprove that for any ðm; iÞ-similarity

u (i � k),Qi detectsu cyclicly, that is there is a cyclic shift of

Qi such that all imismatches of u are covered with jokers

introduced at steps 1; . . . ; i.For i ¼ 1, the statement is obvious, as we can

always cover the single mismatch by shifting Q1 by atmost ðB� 1Þ positions. Assuming that the statement

56 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

holds for ði� 1Þ, we show now that it holds for i too.Consider an ðm; iÞ-similarity u. Select one mismatch ofu. By induction hypothesis, the other ði� 1Þ mis-matches can be covered by Qi�1. Since Qi�1 has periodBi�1 and Qi differs from Qi�1 by having at least onecontiguous interval of Bi�1 jokers, we can always shiftQi by j �Bi�1 positions such that the selected mismatchfalls into this interval. This shows that Qi detects u.We conclude that Q solves the cyclic ðm; iÞ-problem. tuUsing Theorem 2, we obtain the following bound on the

number of jokers for the linear ðm; kÞ-problem.

Lemma 4. Consider a constant k. Let wðmÞ be the maximalweight of a seed solving the linear ðm; kÞ-problem. Then,ðm� wðmÞÞ ¼ �ðm k

kþ1Þ.Proof. To prove the upper bound, we construct a seed Q

that solves the linear ðm; kÞ-problem and satisfies theasymptotic bound. Consider some l < m that will bedefined later, and let P be a seed that solves the cyclicðl; kÞ-problem. Without loss of generality, we assumesðP Þ ¼ l.

For a real number e � 1, define Pe to be the maximallyweighted seed of span at most le of the formP 0 � P � � �P � P 00, where P 0 and P 00 are, respectively, asuffix and a prefix of P . Due to the condition of maximalweight, wðPeÞ � e � wðP Þ.

We now set Q ¼ Pe for some real e to be defined.Observe that if e � l � m� l, then Q solves the linearðm; kÞ-problem. Therefore, we set e ¼ m�l

l .FromtheproofofTheorem2,wehave l� wðP Þ � k � lk�1

k .We then have

wðQÞ ¼ e � wðP Þ � m� l

l� ðl� k � lk�1

k Þ: ð10Þ

If we set

l ¼ mk

kþ1; ð11Þ

we obtain

m� wðQÞ � ðkþ 1Þm kkþ1 � km

k�1kþ1; ð12Þ

and as k is constant,

m� wðQÞ ¼ Oðm kkþ1Þ: ð13Þ

The lower bound is obtained similarly to Theorem 2.Let Q be a seed solving a linear ðm; kÞ-problem, and letn ¼ m� wðQÞ. From simple combinatorial considera-tions, we have

m

k

� �� n

k

� �� ðm� sðQÞÞ � n

k

� �� n; ð14Þ

which implies n ¼ �ðm kkþ1Þ for constant k. tu

The following simple lemma is also useful for construct-ing efficient seeds.

Lemma 5. Assume that a family F solves an ðm; kÞ-problem. LetF 0 be the family obtained from F by cutting out l charactersfrom the left and r characters from the right of each seed of F .Then F 0 solves the ðm� r� l; kÞ-problem.

Example 3. The ð9þ 7i; 2Þ-problem is solved by the seed½###;�#���i which is optimal for i ¼ 1; 2; 3. UsingLemma 5, this seed can be immediately obtained fromthe seed ½###�#;���i from Example 2, solving theð11þ 7i; 2Þ-problem.

We now apply the above results for the single seed caseto the case of multiple seeds.

For a seed Q considered as a word over f#;�g, wedenote by Q½i� its cyclic shift to the left by i characters.For example, i f Q ¼ ####�#�##��, thenQ½5� ¼ #�##��####� . The following lemma givesa way to construct seed families solving biggerproblems from an individual seed solving a smallercyclic problem.

Lemma 6. Assume that a seed Q solves a cyclic ðm; kÞ-problemand assume that sðQÞ ¼ m (otherwise, we pad Q on the right

with ðm� sðQÞÞ jokers). Fix some i > 1. For some L > 0,

consider a list ofL integers 0 � j1 < � � � < jL < m, and define a

family of seeds F ¼< kðQ½jl�Þik >L

l¼1 , where kðQ½jl�Þik stands

for the seed obtained from ðQ½jl�Þi by deleting the joker characters

at the left and right edges. Define �ðlÞ ¼ ððjl�1 � jlÞmodmÞ(or, alternatively, �ðlÞ ¼ ððjl � jl�1ÞmodmÞ) for all l,

1 � l � L. Let m0 ¼ maxfsðkðQ½jl�ÞikÞ þ �ðlÞgLl¼1 � 1. Then,

F solves the ðm0; kÞ-problem.

Proof. The proof is an extension of the proof of Lemma 3.Here, the seeds of the family are constructed in such away that for any instance of the linear ðm0; kÞ-problem,there exists at least one seed that satisfies the propertyrequired in the proof of Lemma 3 and, therefore, matchesthis instance. tuIn applying Lemma 6, integers jl are chosen from the

interval ½0;m� in such a way that values sðjjðQ½jl�ÞijjÞ þ �ðlÞare closed to each other. We illustrate Lemma 6 with twoexamples that follow.

Example 4. Let m ¼ 11, k ¼ 2. Consider the seed Q ¼####�#�##�� solving the cyclic ð11; 2Þ-problem.Choose i ¼ 2, L ¼ 2, j1 ¼ 0, j2 ¼ 5. This gives two seeds:

Q1 ¼ kðQ½0�Þ2k ¼ ####�#�##��####�#�##

KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 57

Fig. 1. Construction of seeds Qi from the proof of Theorem 2. Jokers are

represented in white and matching positions in black.

and

Q2¼kðQ½5�Þ2k ¼ #�##��####�#�##��####

of span 20 and 21, respectively, �ð1Þ ¼ 6 and �ð2Þ ¼ 5.maxf20þ 6; 21þ 5g � 1 ¼ 25. Therefore, family F ¼fQ1; Q2g solves the ð25; 2Þ-problem.

Example 5. Let m ¼ 11, k ¼ 3. The seed Q ¼ ###�#��#��� solving the cyclic ð11; 3Þ-problem. Choosei ¼ 2, L ¼ 2, j1 ¼ 0, j2 ¼ 4. The two seeds are

Q1 ¼ kðQ½0�Þ2k ¼ ###�#��#���###�#��#

(span 19) and

Q2 ¼ kðQ½4�Þ2k¼ #��#���###�#��#���###

(span 21), with �ð1Þ ¼ 7 and �ð2Þ ¼ 4. maxf19þ 7;21þ 4g � 1 ¼ 25. Therefore, family F ¼ fQ1; Q2g solvesthe ð25; 3Þ-problem.

4.4 Heuristic Seed Design

Results of Sections 4.1, 4.2, and 4.3 allow one to constructefficient seed families in certain cases, but still do not allowa systematic seed design. Recently, linear programmingapproaches to designing efficient seed families wereproposed in [19] and in [18], respectively, for DNA andprotein similarity search. However, neither of thesemethods aims at constructing lossless families.

In this section, we outline a heuristic genetic program-ming algorithm for designing lossless seed families. Thealgorithm will be used in the experimental part of thiswork, that we present in the next section. Note that thisalgorithm uses the dynamic programming algorithmsdiscussed in Section 3. Since the algorithm uses standardgenetic programming techniques, we give only a high-leveldescription here without going into all details.

The algorithm tries to iteratively improve characteristicsof a population of seed families until it finds a small familythat detects all ðm; kÞ-similarities (i.e., is lossless). The firststep of each iteration is based on screening current familiesagainst a set of difficult similarities that are similarities thathave been detected by fewer families. This set is continuallyreordered and updated according to the number of familiesthat do not detect those similarities. For this, each set isstored in a tree and the reordering is done using the list-as-a-tree principle [20]: Each time a similarity is not detected bya family, it is moved towards the root of the tree such thatits height is divided by two.

For those families that pass through the screening, thenumber of undetected similarities is computed by thedynamic programming algorithm of Section 3.2. The familyis kept if it produces a smaller number than the familiescurrently known. An undetected similarity obtained duringthis computation is added as a leaf to the tree of difficultsimilarities.

To detect seeds to be improved inside a family, wecompute the contribution of each seed by the dynamicprogramming algorithm of Section 3.3. The seeds with theleast contribution are then modified with a higher prob-ability. In general, the population of seed families is

evolving by mutating and crossing over according to the setof similarities they do not detect. Moreover, random seedfamilies are regularly injected into the population in orderto avoid local optima.

The described heuristic procedure often allows efficientor even optimal solutions to be computed in a reasonabletime. For example, in 10 runs of the algorithm, we foundthree of the six existing families of two seeds of weight 14solving the ð25; 2Þ-problem. The whole computation tookless than 1 hour, compared to a week of computationneeded to exhaustively test all seed pairs. Note that therandomized-greedy approach (incremental completion ofthe seed set by adding the best random seed) applied adozen of times to the same problem yielded only sets ofthree and sometimes four, but never two seeds, takingabout 1 hour at each run.

5 EXPERIMENTS

We describe two groups of experiments that we made. Thefirst one concerns the design of efficient seed families, andthe second one applies a multiseed lossless filtration to theidentification of unique oligos in a large set of ESTsequences.

5.1 Seed Design Experiments

We considered several ðm; kÞ-problems. For each problem,and for a fixed number of seeds in the family, we computedfamilies solving the problem and realizing the largestpossible seed weight (under a natural assumption that allseeds in a family have the same weight). We also kept trackof the ways (periodic seeds, genetic programming heur-istics, exhaustive search) in which those families can becomputed.

Tables 1 and 2 summarize some results obtained for theð25; 2Þ-problem and the ð25; 3Þ-problem, respectively. Fa-milies of periodic seeds (that can be found using Lemma 6)are marked with p, those that are found using a geneticalgorithm are marked with g, and those which are obtainedby an exhaustive search are marked with e. Only in thislatter case, the families are guaranteed to be optimal.Families of periodic seeds are shifted according to theirconstruction (see Lemma 6).

Moreover, to compare the selectivity of different familiessolving a given ðm; kÞ-problem, we estimated the probability� for at least one of the seeds of the family to match at agiven position of a uniform Bernoulli four-letter sequence.This has been done using the inclusion-exclusion formula.

Note that the simple fact of passing from a single seed toa two-seed family results in a considerable gain inefficiency: In both examples shown in the tables there achange of about one order magnitude in the selectivityestimator �.

5.2 Oligo Selection Using Multiseed Filtering

An important practical application of lossless filtration isthe selection of reliable oligonucleotides for DNA micro-array experiments. Oligonucleotides (oligos) are small DNAsequences of fixed size (usually ranging from 10 to 50)designed to hybridize only with a specific region of thegenome sequence. In microarray experiments, oligos areexpected to match ESTs that stem from a given gene and not

58 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

to match those of other genes. As the first approximation,the problem of oligo selection can then be formulated as thesearch for strings of a fixed length that occur in a givensequence but do not occur, within a specified distance, inother sequences of a given (possibly very large) sample.Different approaches to this problem apply differentdistance measures and different algorithmic techniques[21], [22], [23], [24]. The experiments we briefly present heredemonstrate that the multiseed filtering provides anefficient computation of candidate oligonucleotides. Theseshould then be further processed by complementarymethods in order to take into account other physico-chemical factors occurring in hybridisation, such as themelting temperature or the possible hairpin structure ofpalindromic oligos.

Here, we adopt the formalization of the oligo selectionproblem as the problem of identifying in a given sequence

(or a sequence database) all substrings of lengthm that haveno occurrences elsewhere in the sequence within theHamming distance k. The parameters m and k were set to32 and 5, respectively. For the ð32; 5Þ-problem, different seedfamilies were designed and their selectivity was estimated.Those are summarized in the table in Fig. 2, using the sameconventions as in Tables 1 and 2 above. The familycomposed of six seeds of weight 11 was selected for thefiltration experiment (shown in Fig. 2).

The filtering has been applied to a database of rice ESTsequences composed of 100,015 sequences for a total lengthof 42,845,242 bp.1 Substrings matching other substringswith five substitution errors or less were computed. Thecomputation took slightly more than one hour on a

KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 59

TABLE 2Seed Families for (25,3)-Problem

1. Source: http://bioserver.myongji.ac.kr/ricemac.html, The Korea RiceGenome Database.

TABLE 1Seed Families for (25,2)-Problem

Pentium2 4 3GHz computer. Before applying the filtering

using the family for the ð32; 5Þ-problem, we made a roughprefiltering using one spaced seed of weight 16 to detect,with a high selectivity, almost identical regions. Sixty-fivepercent of the database has been discarded by thisprefiltering. Another 22 percent of the database has beenfiltered out using the chosen seed family, leaving theremaining 13 percent as oligo candidates.

6 CONCLUSION

In this paper, we studied a lossless filtration method based

on multiseed families and demonstrated that it represents

an improvement compared to the single-seed approach

considered in [1]. We showed how some important

characteristics of seed families can be computed using the

dynamic programming. We presented several combinator-

ial results that allow one to construct efficient families

composed of seeds with a periodic structure. Finally, we

described a large-scale computational experiment of de-

signing reliable oligonucleotides for DNA microarrays. The

obtained experimental results provided evidence of the

applicability and efficiency of the whole method.

The results of Sections 4.1, 4,2, and 4.3 establish several

combinatorial properties of seed families, but many more of

them remain to be elucidated. The structure of optimal or

near-optimal seed families can be reduced to number-

theoretic questions, but this relation remains to be clearly

established. In general, constructing an algorithm to

systematically design seed families with quality guarantee

remains an open problem. Some complexity issues remain

open too: For example, what is the complexity of testing if a

single seed is lossless for given m; k? Section 3 implies a

time bound exponential on the number of jokers. Note that

for multiple seeds, computing the number of detected

similarities is NP-complete [16, Section 3.1].

Another direction is to consider different distance

measures, especially the Levenstein distance, or at least to

allow some restricted insertion/deletion errors. The method

proposed in [25] does not seem to be easily generalized to

multiseed families, and a further work is required to

improve lossless filtering in this case.

ACKNOWLEDGMENTS

G. Kucherov and L. Noe have been supported by the FrenchAction Specifique “Algorithmes et Sequences” of CNRS. A part

of this work has been done during a stay of M. Roytberg at

LORIA, Nancy, supported by INRIA. M. Roytberg has been

supported by the Russian Foundation for Basic Research

(project nos. 03-04-49469, 02-07-90412) and by grants from

the RF Ministry for Industry, Science, and Technology (20/

2002, 5/2003) and NWO. An extended abstract of this work

has been presented to the Combinatorial Pattern Matching

Conference (Istanbul, July 2004).

REFERENCES

[1] S. Burkhardt and J. Karkkainen, “Better Filtering with Gappedq-Grams,” Fundamenta Informaticae, vol. 56, nos. 1-2, pp. 51-70,2003, preliminary version in Combinatorial Pattern Matching2001.

[2] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings—Practical On-Line Search Algorithms for Texts and BiologicalSequences. Cambridge Univ. Press, 2002.

[3] S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller,and D. Lipman, “Gapped BLAST and PSI-BLAST: A NewGeneration of Protein Database Search Programs,” Nucleic AcidsResearch, vol. 25, no. 17, pp. 3389-3402, 1997.

[4] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and MoreSensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002.

[5] S. Schwartz, J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. Hardison,D. Haussler, and W. Miller, “Human—Mouse Alignments withBLASTZ,” Genome Research, vol. 13, pp. 103-107, 2003.

[6] L. Noe and G. Kucherov, “Improved Hit Criteria for DNA LocalAlignment,” BMC Bioinformatics, vol. 5, no. 149, Oct. 2004.

[7] P. Pevzner and M. Waterman, “Multiple Filtration and Approx-imate Pattern Matching,” Algorithmica, vol. 13, pp. 135-154, 1995.

[8] A. Califano and I. Rigoutsos, “Flash: A Fast Look-Up Algorithmfor String Homology,” Proc. First Int’l Conf. Intelligent Systems forMolecular Biology, pp. 56-64, July 1993.

[9] J. Buhler, “Provably Sensitive Indexing Strategies for BiosequenceSimilarity Search,” Proc. Sixth Ann. Int’l Conf. ComputationalMolecular Biology (RECOMB ’02), pp. 90-99, Apr. 2002.

[10] U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds forSimilarity Search,” Discrete Applied Math., vol. 138, no. 3, pp. 253-263, 2004.

[11] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for SimilaritySearch in Genomic DNA,” Proc. Seventh Ann. Int’l Conf. Computa-tional Molecular Biology (RECOMB ’03), pp. 67-75, Apr. 2003.

[12] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension toSpaced Seeds Allows Substantial Improvements in Sensitivity andSpecificity,” Proc. Third Int’l Workshop Algorithms in Bioinformatics(WABI), pp. 39-54, Sept. 2003.

[13] G. Kucherov, L. Noe, and Y. Ponty, “Estimating Seed Sensitivityon Homogeneous Alignments,” Proc. IEEE Fourth Symp. Bioinfor-matics and Bioeng. (BIBE 2004), May 2004.

[14] K. Choi and L. Zhang, “Sensitivity Analysis and Efficient Methodfor Identifying Optimal Spaced Seeds,” J. Computer and SystemSciences, vol. 68, pp. 22-40, 2004.

[15] M. Csuros, “Performing Local Similarity Searches with VariableLength Seeds,” Proc. 15th Ann. Combinatorial Pattern MatchingSymp. (CPM), pp. 373-387, 2004.

60 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 2. Computed seed families for the ð32; 5Þ-problem and the chosen family (six seeds of weight 11).

[16] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: HighlySensitive and Fast Homology Search,” J. Bioinformatics andComputational Biology, vol. 2, no. 3, pp. 417-440, Sept. 2004.

[17] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds forDNA Similarity Search,” Proc. Eighth Ann. Int’l Conf. Research inComputational Molecular Biology (RECOMB 2004), pp. 76-84, Mar.2004.

[18] D.G. Brown, “Multiple Vector Seeds for Protein Alignment,” Proc.Fourth Int’l Workshop Algorithms in Bioinformatics (WABI), pp. 170-181, Sept. 2004.

[19] J. Xu, D. Brown, M. Li, and B. Ma, “Optimizing Multiple SpacedSeeds for Homology Search,” Proc. 15th Symp. CombinatorialPattern Matching, pp. 47-58, 2004.

[20] J. Oommen and J. Dong, “Generalized Swap-with-Parent Schemesfor Self-Organizing Sequential Linear Lists,” Proc. 1997 Int’l Symp.Algorithms and Computation (ISAAC ’97), pp. 414-423, Dec. 1997.

[21] F. Li and G. Stormo, “Selection of Optimal DNA Oligos for GeneExpression Arrays,” Bioinformatics, vol. 17, pp. 1067-1076, 2001.

[22] L. Kaderali and A. Schliep, “Selecting Signature Oligonucleotidesto Identify Organisms Using DNA Arrays,” Bioinformatics, vol. 18,no. 10, pp. 1340-1349, 2002.

[23] S. Rahmann, “Fast Large Scale Oligonucleotide Selection Usingthe Longest Common Factor Approach,” J. Bioinformatics andComputational Biology, vol. 1, no. 2, pp. 343-361, 2003.

[24] J. Zheng, T. Close, T. Jiang, and S. Lonardi, “Efficient Selection ofUnique and Popular Oligos for Large EST Databases,” Proc. 14thAnn. Combinatorial Pattern Matching Symp. (CPM), pp. 273-283,2003.

[25] S. Burkhardt and J. Karkkainen, “One-Gapped q-Gram Filters forLevenshtein Distance,” Proc. 13th Symp. Combinatorial PatternMatching (CPM ’02), vol. 2373, pp. 225-234, 2002.

Gregory Kucherov received the PhD degree incomputer science in 1988 from the USSRAcademy of Sciences, and a Habilitation degreein 2000 from the Henri Poincare University inNancy. He is a senior INRIA researcher with theLORIA research unit in Nancy, France. For thelast 10 years, he has been doing research onword combinatorics, text algorithms and combi-natorial algorithms for bioinformatics, and com-putational biology.

Laurent Noe studied computer science at theESIAL engineering school in Nancy, France. Hereceived the MS degree in 2002 and is currentlya PhD student in computational biology atLORIA.

Mikhail Roytberg received the PhD degree incomputer science in 1983 from Moscow StateUniversity. He is a leader of the ComputationalMolecular Biology Group in the Institute ofMathematical Problems in Biology of the Rus-sian Academy of Sciences at Pushchino, Rus-sia. During the last years, his main research fieldhas been the development of algorithms forcomparative analysis of biological sequences.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 61

Text Mining Biomedical Literaturefor Discovering Gene-to-Gene Relationships:

A Comparative Study of AlgorithmsYing Liu, Shamkant B. Navathe, Jorge Civera, Venu Dasigi,

Ashwin Ram, Brian J. Ciliax, and Ray Dingledine

Abstract—Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of

microarray data. A number of computer algorithmshave been developed for this task. Although these algorithmshave demonstrated their

usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from

MEDLINE for a set of genes that are isolated for further study frommicroarray experiments based on their differential expression patterns.

The sharingof functional keywords amonggenes is usedas a basis for clustering in a newapproach calledBEA-PARTITION in this paper.

Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA),

which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional

keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering

and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of

BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell

cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the

results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means

andself-organizingmap.WhereasBEA-PARTITIONand thehierarchical clusteringproducedsimilar quality of clusters,BEA-PARTITION

provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a

powerful approach to clustering genes or to any clustering problemwhere startingmatrices are available fromexperimental observations.

Index Terms—Bond energy algorithm, microarray, MEDLINE, text analysis, cluster analysis, gene function.

1 INTRODUCTION

DNAmicroarrays, among the most rapidly growing toolsfor genome analysis, are introducing a paradigmatic

change in biology by shifting experimental approaches fromsingle gene studies to genome-level analyses [1], [2].Increasingly accessible microarray platforms allow therapid generation of large expression data sets [3]. One ofthe key challenges of microarray studies is to derivebiological insights from the unprecedented quantities ofdata on gene-expression patterns [5]. Partitioning genes intoclosely related groups has become an element of practicallyall analyses of microarray data [4].

A number of computer algorithms have been applied to

gene clustering. One of the earliest was a hierarchical

algorithm developed by Eisen et al. [6]. Other popular

algorithms, such as k-means [7] and Self-Organizing Maps

(SOM) [8] have also beenwidely used. These algorithms have

demonstrated their usefulness in gene clustering, but some

basic problems remain [2], [9]. Hierarchical clustering

organizes expression data into a binary tree, in which the

leaves are genes and the interior nodes (or branch points) are

candidate clusters. True clusterswith discrete boundaries are

not produced [10]. Although SOM is efficient and simple to

implement, studies suggest that it typically performs worse

than the traditional techniques, such as k-means [11].Basedon theassumption that geneswith the same function

or in the same biological pathway usually show similar

expression patterns, the functions of unknown genes can be

inferred from those of the known genes with similar

expression profile patterns. Therefore, expression profile

gene clustering by all the algorithms mentioned above has

received much attention; however, the task of finding

functional relationships between specific genes is left to the

investigator. Manual scanning of the biological literature (for

example, via MEDLINE) for clues regarding potential

functional relationships among a set of genes is not feasible

when the number of genes to be explored rises above

approximately 10. Restricting the scan (manual or automatic)

to annotation fields of GenBank, SwissProt, or LocusLink is

quicker but can suffer from the ad hoc relationship of

keywords to the research interests of whoever submitted

theentry.Moreover, keepingannotation fields current asnew

62 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

. Y. Liu, S.B. Navathe, J. Civera, and A. Ram are with the College ofComputing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta,GA 30322.E-mail: {yingliu, sham, ashwin}@cc.gatech.edu, [email protected].

. V. Dasigi is with the Department of Computer Science, School ofComputing and Software Engineering, Southern Polytechnic StateUniversity, Marietta, GA 30060. E-mail: [email protected].

. B.J. Ciliax is with the Department of Neurology, Emory University Schoolof Medicine, Atlanta, GA 30322. E-mail: [email protected].

. R. Dingledine is with the Department of Pharmacology, Emory UniversitySchool of Medicine, Atlanta, GA 30322.E-mail: [email protected].

Manuscript received 4 Apr. 2004; revised 1 Oct. 2004; accepted 10 Feb. 2005;published online 30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-0043-0404.

1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

information appears in the literature is amajor challenge thatis rarely met adequately.

If, instead of organizing by expression pattern similarity,genes were grouped according to shared function, investi-gators might more quickly discover patterns or themes ofbiological processes that were revealed by their microarrayexperiments and focus on a select group of functionallyrelated genes. A number of clustering strategies based onshared functions rather than similar expression patternshave been devised. Chaussabel and Sher [3] analyzedliterature profiles generated by extracting the frequencies ofcertain terms from the abstracts in MEDLINE and thenclustered the genes based on these terms, essentiallyapplying the same algorithm used for expression patternclustering. Jenssen et al. [12] used co-occurrence of genenames in abstracts to create networks of related genesautomatically. Text analysis of biomedical literature hasalso been applied successfully to incorporate functionalinformation about the genes in the analysis of geneexpression data [1], [10], [13], [14] without generatingclusters de novo. For example, Blaschke et al. [1] extractedinformation about the common biological characteristics ofgene clusters from MEDLINE using Andrade and Valen-cia’s statistical text mining approach, which accepts user-supplied abstracts related to a protein of interest andreturns an ordered set of keywords that occur in thoseabstracts more often than would be expected by chance [15].

We expanded and extended Andrade and Valencia’sapproach [15] to functional gene clustering by using anapproach that applies an algorithm called the Bond EnergyAlgorithm (BEA) [16], [17], which, to our knowledge, hasnot been used in bioinformatics. We modified it so that the“affinity” among attributes (in our case, genes) is definedbased on the sharing of keywords between them and wecame up with a scheme for partitioning the clusteredaffinity matrix to produce clusters of genes. We call theresulting algorithm BEA-PARTITION. BEA was originallyconceived as a technique to cluster questions in psycholo-gical instruments [16], has been used in operations research,production engineering, marketing, and various other fields[18], and is a popular clustering algorithm in distributeddatabase system (DDBS) design. The fundamental task ofBEA in DDBS design is to group attributes based on theiraffinity, which indicates how closely related the attributesare, as determined by the inclusion of these attributes by thesame database transactions. In our case, each gene wasconsidered as an attribute. Hence, the basic premise is thattwo genes would have higher affinity, thus higher bondenergy, if abstracts mentioning these genes shared manyinformative keywords. BEA has several useful properties[16], [19]. First, it groups attributes with larger affinityvalues together, and the ones with smaller values together(i.e., during the permutation of columns and rows, itshuffles the attributes towards those with which they havehigher affinity and away from those with which they havelower affinity). Second, the composition and order of thefinal groups are insensitive to the order in which items arepresented to the algorithm. Finally, it seeks to uncover anddisplay the association and interrelationships of the clus-tered groups with one another.

In order to explore whether this algorithm could be

useful for clustering genes derived from microarray

experiments, we compared the performance of BEA-

PARTITION, hierarchical clustering algorithm, self-organiz-

ing map, and the k-means algorithm for clustering func-

tionally-related genes based on shared keywords, using

purity, entropy, and mutual information as metrics for

evaluating cluster quality.

2 METHODS

2.1 Keyword Extraction from Biomedical Literature

We used statistical methods to extract keywords from

MEDLINE citations, based on the work of [15]. This method

estimates the significance of words by comparing the

frequency of words in a given gene-related set (Test Set)

of abstracts with their frequency in a background set of

abstracts. We modified the original method by using a

1) different background set, 2) a different stemming

algorithm (Porter’s stemmer), and 3) a customized stop list.

The details were reported by Liu et al. [20], [21].For each gene analyzed, word frequencies were calcu-

lated from a group of abstracts retrieved by an SQL(structured query language) search of MEDLINE for thespecific gene name, gene symbol, or any known aliases (seeLocusLink, ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz for gene aliases) in the TITLE field. The resultingset of abstracts (the Test Set) was processed to generate aspecific keyword list.

Test Sets of Genes. We compared BEA-PARTITION andother clustering algorithms (k-means, hierarchical, andSOM) on two test sets.

1. Twenty-six genes in four well-defined functional

groups consisting of 10 glutamate receptor subunits,

seven enzymes in catecholamine metabolism, five

cytoskeletal proteins, and four enzymes in tyrosine

and phenylalanine synthesis. The gene names and

aliases are listed in Table 1. This experiment was

performed to determine whether keyword associa-

tions can be used to group genes appropriately andwhether the four gene families or clusters that were

known a priori would also be predicted by a

clustering algorithm simply using the affinity metric

based on keywords.2. Forty-four yeast genes involved in the cell cycle of

budding yeast (Saccharomyces cerevisiae) that had

altered expression patterns on spotted DNA

microarrays [6]. These genes were analyzed by

Cherepinsky et al. [4] to demonstrate their Shrink-

age algorithm for gene clustering. A master list ofmember genes for each cluster was assembled

according to a combination of 1) common cell-cycle

functions and regulatory systems and 2) the

corresponding transcriptional activators for each

gene [4] (Table 2).

Keyword Assessment. Statistical formulae from [15] for

word frequencies were used without modification. These

calculations were repeated for all gene names in the test

LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 63

set, a process that generated a database of keywords

associated with specific genes, the strength of the associa-

tion being reflected by a z-score. The z-score of word a for

gene g is defined as:

Zag ¼

Fag � F

a

�a; ð1Þ

where Fag equals the frequency of word a in Test Set g (i.e.,

in the Test set g, the number of abstracts where the word aoccurs divided by the total number of abstracts) and, �FFa and�a are the average frequency and standard deviation,respectively, of word a in the background set. Intuitively,the score Z compares the “importance” or “discriminatoryrelevance” of a keyword in the test set of abstract with thebackground set that represents the expected occurrence ofthat word in the literature at large.

Keyword Selection forGeneClustering.We used z-scorethresholds to select the keywords used for gene clustering.Those keywords with z-scores less than the threshold werediscarded. The z-score thresholds we tested were 0, 5, 8, 10,15, 20, 30, 50, and 100. The database generated by thisalgorithm is represented as a sparse word (rows) � gene(columns)matrixwith cells containing z-scores. Thematrix ischaracterized as “sparse” because each gene only has afraction of all words associated with it. The output of thekeyword selection for all genes in each Test Set is representedas a sparse keyword (rows) � gene (columns) matrix withcells containing z-scores.

2.2 BEA-PARTITION: Detailed Working of theAlgorithm

The BEA-PARTITION takes a symmetric matrix as input,

permutes its rows and columns, and generates a sorted

matrix, which is then partitioned to form a clustered matrix.Constructing the Symmetric Gene � Gene Matrix. The

sparse word � gene matrix, with the cells containing the

z-scores of each word-gene pair, was converted to a gene

�genematrixwith the cells containing the sumofproducts of

z-scores for shared keywords. The z-score value was set to

zero if the value was less than the threshold. Larger values

reflect stronger and more extensive keyword associations

between gene-gene pairs. For each gene pair ðGi;GjÞ and

everyword a they share in the sparseword�genematrix, the

Gi�Gj cell value ðaffðGi;GjÞÞ in the gene � gene matrix

represents the affinity of the two genes for each other and is

calculated as:

affðGi;GjÞ ¼PN

a¼1ðZaGi � Za

GjÞ1; 000

: ð2Þ

Dividing the sum of the z-score product by 1,000 was

done to reduce the typically large numbers to a more

readable format in the output matrix.Sorting the Matrix [19]. The sorted matrix is generated

as follows:

64 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 1Twenty-Six Genes Manually Clustered Based on Functional Similarity

TABLE 2Forty-Four Yeast Genes Grouped by Transcriptional Activators and Cell Cycle Functions [4]

1. Initialization. Place and fix one of the columns ofsymmetric matrix arbitrarily into the clusteredmatrix.

2. Iteration. Pick one of the remaining n-i columns(where i is the number of columns already in thesorted matrix). Choose the placement in the sortedmatrix that maximizes the change in bond energy asdescribed below (3). Repeat this step until no morecolumns remain.

3. Row ordering. Once the column ordering is deter-mined, the placement of the rows should also bechanged correspondingly so that their relativepositions match the relative position of the columns.This restores the symmetry to the sorted matrix.

To calculate the change in bond energy for each possible

placement of the next ðiþ 1Þ column, the bonds between

that column ðkÞ and each of two newly adjacent columns

ði; jÞ are added and the bond that would be broken between

the latter two columns is subtracted. Thus, the “bond

energy” between these three columns i, j, and k (represent-

ing gene i ðGiÞ; gene j ðGjÞ; gene k ðGkÞ)) is calculated by

the following interaction contribution measure:

energyðGi;Gj;GkÞ ¼2� ½bondðGi;GkÞ þ bondðGk;GjÞ � bondðGi;GjÞ�;

ð3Þ

where bond ðGi;GjÞ is the bond energy between gene Gi

and gene Gj and

bondðGi;GjÞ ¼XNr¼l

affðGr;GiÞ � affðGr;GjÞ ð4Þ

affðG0; GiÞ ¼ affðGi;G0Þ¼ affðGðnþ 1Þ; GiÞ ¼ affðGi;Gðnþ 1ÞÞ ¼ 0:

ð5Þ

The last set of conditions (5) takes care of cases where a

gene is being placed in the sorted matrix to the left of the

leftmost gene or to the right of the rightmost gene during

column permutations, and prior to the topmost row and

following the last row during row permutations.

Partitioning the Sorted Matrix. The original BEA

algorithm [16] did not propose how to partition the sorted

matrix. The partitioning heuristic was added by Navathe

et al. [17] for the problems in the distributed database

design. These heuristics were constructed using the goals of

design: to minimize access time and storage costs. We do

not have the luxury of such a clear cut objective function in

our case. Hence, to partition the sorted matrix into

submatrices, each representing a gene cluster, we experi-

mented with different heuristics and, finally, derived a

heuristic that identifies the boundaries between clusters by

sequentially finding the maximum sum of the quotients for

corresponding cells in adjacent columns across the matrix.

With each successive split, only those rows corresponding

to the remaining columns were processed, i.e., only the

remaining symmetrical portion of the submatrix was used

for further iterations of the splitting algorithm. The number

of clusters into which the gene affinity matrix was

partitioned was determined by AUTOCLASS (described

below), however, other heuristics might be useful for this

determination. The boundary metric ðBÞ for columns Gi

and Gj used for placement of new column k between

existing columns i and j was defined as:

BðGi;GjÞ ¼ maxp�1�q�p

Xpk¼p�1

maxðaffðk; qÞ; affðk; q þ 1ÞÞminðaffðk; qÞ; affðk; q þ 1ÞÞ ; ð6Þ

where q is the new splitting point (for simplicity, we use the

number of the leftmost column in the new submatrix that is

to the right of the splitting point), which will split the

submatrix defined between two previous splitting points, p

and p� 1 (which do not necessarily represent contiguous

columns). To partition the entire sorted matrix, the

following initial conditions are set, p ¼ N; p� 1 ¼ 0.

2.3 KKKK-Means Algorithm and Hierarchical ClusteringAlgorithm

K-meansandhierarchical clusteringanalysiswereperformed

using Cluster/Treeview programs available online (http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/

software.htm).

2.4 Self-Organizing Map

Self-organizing map was performed using GeneClus-

ter 2.0 (http://www.broad.mit.edu/cancer/software/software.html).

Euclidean distance measure was used when gene �keyword matrix as input. When gene � gene matrix wasused as input, the gene similarity was calculated by (2).

2.5 Number of Clusters

In order to apply BEA-PARTITION and k-means cluster-

ing algorithms, the investigator needs to have a priori

knowledge about the number of clusters in the test set.

We determined the number of clusters by applying

AUTOCLASS, an unsupervised Bayesian classification

system developed by [22]. AUTOCLASS, which seeks a

maximum posterior probability classification, determines

the optimal number of classes in large data sets. Among

a variety of applications, AUTOCLASS has been used

for the discovery of new classes of infra-red stars in the

IRAS Low Resolution Spectral catalogue, new classes of

airports in a database of all US airports, and discovery

of classes of proteins, introns and other patterns in

DNA/protein sequence data [22]. We applied an open

source implementation of AUTOCLASS (http://

ic.arc.nasa.gov/ic/projects/bayes-group/autoclass/

autoclass-c-program.html). The resulting number of

clusters was then used as the endpoint for the

partitioning step of the BEA-PARTITION algorithm. To

determine whether AUTOCLASS could discover the

number of clusters in the test sets correctly, we also

tested different number of clusters other than the ones

AUTOCLASS predicted.

LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 65

2.6 Evaluating the Clustering Results

To evaluate the quality of our resultant clusters, we used

the established metrics of Purity, Entropy, and Mutual

Information, which are briefly described below [23]. Let us

assume that we have C classes (i.e., C expert clusters, as

shown in Tables 1 and 2), while our clustering algorithms

produce K clusters, �;�2; . . . ; �k.Purity. Purity can be interpreted as classification

accuracy under the assumption that all objects of a cluster

are classified to be members of the dominant class for that

cluster. If the majority of genes in cluster A are in class X,

then class X is the dominant class. Purity is defined as the

ratio between the number of items in cluster �i from

dominant class j and the size of cluster �i, that is:

P ð�iÞ ¼1

nimax

jðnj

iÞ; i ¼ 1; 2 . . . ; k; ð7Þ

where ni ¼ j�ij, that is, the size of cluster i and nji is the

number of genes in �i that belong to class j; j ¼ 1; 2; . . . ;C.

The closer to 1 the purity value is, the more similar this

cluster is to its dominant class. Purity is measured for each

cluster and the average purity of each test gene set cluster

result was calculated.

Entropy. Entropy denotes how uniform the cluster is. If a

cluster is composed of genes coming from different classes,

then the value of entropy will be close to 1. If a cluster only

contains one class, the value of entropy will be close to 0.

The ideal value for entropy would be zero. Lower values of

entropy would indicate better clustering. Entropy is also

measured for each cluster and is defined as:

Eð�iÞ ¼ � 1

logC

XCj¼1

nji

nilog

nji

ni

!: ð8Þ

The average entropy of each test gene set cluster result was

also calculated.

Mutual Information. One problem with purity and

entropy is that they are inherently biased to favor small

clusters. For example, if we had one object for each cluster,

then the value of purity would be 1 and entropy would be

zero, no matter what the distribution of objects in the expert

classes is.

Mutual information is a symmetric measure for the

degree of dependency between clusters and classes. Unlike

correlation, mutual information also takes higher order

dependencies into account. We use mutual information

because it captures how related clusters are to classes

without bias towards small clusters. Mutual information is

a measure of the discordance between the algorithm-

derived clusters and the actual clusters. It is the measure

of how much information the algorithm-derived clusters

can tell us to infer the actual clusters. Random clustering

has mutual information of 0 in the limit. Higher mutual

information indicates higher similarity between the algo-

rithm-derived clusters and the actual clusters. Mutual

information is defined as:

Mð�Þ ¼ 2

N

XKi¼1

XCj¼1

nji

lognji�NPK

t¼1nti

PC

t¼1nti

logðK � CÞ ; ð9Þ

where N is the total number of genes being clustered and K

is the number of clusters the algorithm produced, and C is

the number of expert classes.

2.7 Top-Scoring Keywords Shared among Membersof a Gene Cluster

Keywords were ranked according to their highest shared z-scores in each cluster. The keyword sharing strength metric(Ka) is defined as the sum of z-scores for a shared keyworda within the cluster, multiplied by the number of genes ðMÞwithin the cluster with which the word is associated; in thiscalculation z-scores less than a user-selected threshold areset to zero and are not counted.

Ka ¼XMg¼1

ðzagÞ �XMg¼1

CountðzagÞ: ð10Þ

Thus, larger values reflect stronger and more extensivekeyword associations within a cluster. We identified the30 highest scoring keywords for each of the four clusters andprovided these four lists to approximately 20 students,postdoctoral fellows, and faculty, asking them to guess amajor function of the underlying genes that gave rise to thefour keyword lists.

3 RESULTS

3.1 Keywords and Keyword � Gene MatrixGeneration

A list of keywords was generated for each gene to build the

keyword � gene matrix. Keywords were sorted according

to their z-scores. The keyword selection experiment (see

below) showed that a z-score threshold of 10 generally

produced better results, which suggests that keywords with

z-scores lower than 10 have less information content, e.g.,

“cell,” “express.” The relative values of z-scores depended

on the size of the background set (data not shown). Since we

used 5.6 million abstracts as the background set, the

z-scores of most of the informative keywords were well

above 10 (based on smaller values of standard deviation in

the definition of z-score). The keyword � gene matrices

were used as inputs to k-means, hierarchical clustering

algorithm, self-organizing map, while as required by the

BEA approach, they were first converted to a gene � gene

matrix based on common shared keywords and these gene

� gene matrices were used as inputs to BEA-PARTITION.

An overview of the gene clustering by shared keyword

process is provided in Fig. 1.

3.2 Effect of Keyword Selection on Gene Clustering

The effect of using different z-score thresholds for keyword

selection on the quality of resulting clusters is shown in

Figs. 2A1 and 2B1. For both test sets, BEA-PARTITION

produced clusters with higher mutual information when z-

score thresholds were within a range of 10 to 20. For the 44-

gene set, K-means produced clusters with the highest

66 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

mutual information when the z-score threshold was 8,

while, for the 26-gene set, mutual information was highest

when z-score threshold was 15. For the remaining studies,

we chose to use a z-score threshold of 10 to keep as many

functional keywords as possible.

3.3 Number of Clusters

We then used AUTOCLASS to decide the number ofclusters in the test sets. AUTOCLASS took the keyword �gene matrix as input and predicted that there were fiveclusters in the set of 26 genes and nine clusters in the set of44 yeast genes. The effect of the numbers of clusters on thealgorithm performance was shown in Figs. 2A2 and 2B2.BEA-PARTITION again produced a better result regardlessof the number of clusters used. BEA-PARTITION had thehighest mutual information when the numbers of clusterswere five (26-gene set) and nine (44-gene set), whereask-means worked marginally better when the numbers ofclusters were 8 (26-gene set) and 10 (44-gene set). Based onthese results we chose to use five and nine clusters,respectively, because the probabilities were higher thanthe other choices.

3.4 Clustering of the 26-Gene Set by KeywordAssociations

Todeterminewhether keyword associations could be used to

group genes appropriately, we clustered the 26-gene set with

either BEA-PARTITION, k-means, hierachical algorithm,

SOM, and AUTOCLASS. Keyword lists were generated for

each of these 26 genes, which belonged to one of four well-

defined functional groups (Table 1). The resulting word �gene matrix had 26 columns (genes) and approximately

8,540 rows (words with z-scores >¼ 10 appearing in any of

the query sets). TheBEA-PARTITION,with z-score threshold

= 10, correctly assigned 25 of 26 genes to the appropriate

cluster based on the strength of keyword associations (Fig. 3).

Tyrosine transaminasewas theonlyoutlier.As expected from

the BEA-PARTITION, cells inside clusters tended to have

much higher values than those outside. Hierarchical cluster-

ing algorithm, with the gene � keyword matrix as the input,

generated similar result as BEA-PARTITION (five clusters

andTTwas theoutlier) (Fig. 4a). The results,withgene�gene

matrix as the input, were shown in tables in the supplemen-

tary materials which can be found at www.computer.org/

publications/dlib.While BEA-PARTITION and hierarchical clustering

algorithm produced clusters very similar to the originalfunctional classes, those produced by k-means (Table 4),self-organizing map (Table 5), and AUTOCLASS (Table 6),with gene � keyword matrix as input, were heterogeneousand, thus, more difficult to explain. The average purity,

LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 67

Fig. 1. Procedure for clustering genes by the strength of their associated keywords.

Fig. 2. Effect of keyword selection by z-score thresholds (A1 and B1)and different number of clusters (A2 and B2) on the cluster quality. Z-score thresholds were used to select the keywords for gene clustering.Those keywords with z-scores less than the threshold were discarded.To determine the effect of keyword selection by z-score thresholds oncluster quality, we tested z-score thresholds 0, 5, 8, 10, 15, 20, 30, 50,and 100. To determine whether AUTOCLASS could be used to discoverthe number of clusters in the test sets correctly, we tested a differentnumber of clusters other than the ones AUTOCLASS predicted (four forthe 26-gene set and nine for the 44-gene set).

average entropy, and mutual information of the BEA-

PARTITION and hierarchical algorithm result were 1, 0,

and 0.88, while those of k-means result were 0.53, 0.65, and

0.28, respectively, those of SOM result were 0.76, 0.35, and

0.18, respectively, and those of AUTOCLASS result were

0.82, 0.28, and 0.56 (Table 3) (gene � keyword matrix as

input). When gene � gene matrix was used as input to

hierarchical algorithm, k-means, and SOM, the results were

even worse as measured by purity, entropy, and mutual

information (Table 3).

3.5 Yeast Microarray Gene Clustering by KeywordAssociation

To determine whether our test mining/gene clustering

approach could be used to group genes identified in

microarray experiments, we clustered 44 yeast genes taken

from Eisen et al. [6] via Cherepinsky et al. [4], again using

BEA-PARTITION, hierarchical algorithm, SOM, AUTO-

CLASS, and k-means. Keyword lists were generated for each

of the 44yeast genes (Table 2) and a 3,882 (words appearing in

the query sets with z-score greater or equal 10) � 44 (genes)

matrix was created. The clusters produced by the BEA-

PARTITION, k-means, SOM, and AUTOCLASS are shown in

Tables 7, 8, 9, and10, respectively,whereas thoseproducedby

hierarchical algorithm are shown in Fig. 4b. The average

purity, average entropy, andmutual information of the BEA-

PARTITION result were 0.74, 0.24, and 0.60, whereas those of

hierarchical algorithm, SOM, k-means, and AUTOCLASS

results (gene� keywordmatrix as input) were 0.86, 0.12, and

0.58; 0.60, 0.37, and 0.46; 0.61, 0.33, and 0.39; 0.57, 0.39, and

0.49, respectively (Table 3).

3.6 Keywords Indicative of Major Shared Functionswith a Gene Cluster

Keywords shared among genes (26-gene set) within eachcluster were ranked according to a metric based on both thedegree of significance (the sum of z-scores for each keyword)and the breadth of distribution (the sum of the number ofgenes within the cluster for which the keyword has a z-scoregreater than a selected threshold). This double-prongedmetric obviated the difficulty encountered with keywordsthat had extremely high z-scores for single genes within thecluster but modest z-scores for the remainder. The 30 highestscoring keywords for each of the four clusters were tabulated(Table 11). The respectivekeyword lists appeared tobehighlyinformative about the general function of the original,preselected clusters when shown to medical students,faculties, and postdoctoral fellows.

4 DISCUSSION

In this paper, we clustered the genes by shared functional

keywords. Our gene clustering strategy is similar to the

document clustering in information retrieval. Document

clustering, defined as grouping documents into clusters

according to their topics or main contents in an unsuper-

vised manner, organizes large amounts of information into

a small number of meaningful clusters and improves the

information retrieval performance either via cluster-driven

dimensionality reduction, term-weighting, or query expan-

sion [9], [24], [25], [26], [27].

Term vector-based document clustering has been widely

studied in information retrieval [9], [24], [25], [26], [27]. A

68 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 3. Gene clusters by keyword associations using BEA-PARTITION. Keywords with z-scores >¼ 10 were extracted from MEDLINE abstracts for26 genes in four functional classes. The resulting word � gene sparse matrix was converted to a gene � gene matrix. The cell values are the sum ofz-score products for all keywords shared by the gene pair. This value is divided by 1,000 for purpose of display. A modified bond energy algorithm[16], [17] was used to group genes into five clusters based on the strength of keyword associations, and the resulting gene clusters are boxed.

number of clustering algorithms have been proposed and

many of them have been applied to bioinformatics research.

In this report, we introduced a new algorithm for clustering

genes, BEA-PARTITION. Our results showed that BEA-

PARTITION, in conjunction with the heuristic developed

for partitioning the sorted matrix, outperforms the k-means

algorithm and SOM in two test sets. In the first set of genes

(26-gene set), BEA-PARTITION, as well as hierarchical

algorithm, correctly assigned 25 of 26 genes in a test set of

four known gene groups with one outlier, whereas k-means

and SOM mixed the genes into five more evenly sized but

less well functionally defined groups. In the 44-gene set, the

result generated by BEA-PARTITION had the highest

mutual information, indicating that BEA-PARTITION out-

performed all the other four clustering algorithms.

4.1 BEA-PARTITION versus kkkk-Means

In this study, the z-score thresholds were used for keyword

selection. When the threshold was 0, all words, including

noise (noninformative words and misspelled words), were

used to cluster genes. Under the tested conditions, clusters

produced by BEA-PARTITION had higher quality than

those produced by k-means. BEA-PARTITION clusters

genes based on their shared keywords. It is unlikely that

genes within the same cluster shared the same noisy words

with high z-scores, indicating that BEA-PARTITION is less

sensitive to noise than k-means. In fact, BEA-PARTITION

performed better than k-means in the two test gene sets

under almost all test conditions (Fig. 2). BEA-PARTITION

performed best when z-score thresholds were 10, 15, and 20,

which indicated 1) that the words with z-score less than 10

were less informative and 2) few words with z-scores

between 10 and 20 were shared by at least two genes and

did not improve the cluster quality. When z-score thresh-

olds were high (> 30 in the 26-gene set and > 20 in the

44-gene set), more informative words were discarded, and

as a result, the cluster quality was degraded.

LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 69

Fig. 4. Gene clusters by keyword associations using hierarchical clustering algorithm. Keywords with z-scores >¼ 10 were extracted from MEDLINE

abstracts for (a) 26 genes in four functional classes and (b) 44 gene in nine classes. The resulting word � gene sparse matrix was used as input to

the hierarchical algorithm.

BEA-PARTITION is designed to group cells with larger

values together, and the ones with smaller values together.

The final order of the genes within the cluster reflected

deeper interrelationships. Among the 10 glutamate receptor

genes examined, GluR1, GluR2, and GluR4 are AMPA

receptors, while GluR6, KA1, and KA2 are kainate receptors.

The observation that BEA-PARTITION placed gene GluR6

and gene KA2 next to each other, confirms that the literature

associations between GluR6 and KA2 are higher than those

between GluR6 and AMPA receptors. Furthermore, the

association and interrelationships of the clustered groups

with one another can be seen in the final clustering matrix.

For example, TT was an outlier in Fig. 3, however, it still

had higher affinity to PD1 (affinity = 202) and PD2 (affinity

= 139) than to any other genes. Thus, TT appears to be

strongly related to genes in the tyrosine and phenylalanine

synthesis cluster, from which it originated.

BEA-PARTITION has several advantages over the

k-means algorithm: 1) while k-means generally produces a

locally optimal clustering [2], BEA-PARTITION produces

70 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 3The Quality of the Gene Clusters Derived by Different Clustering Algorithms, Measured by Purity, Entropy, and Mutual Information

TABLE 4Twenty-Six Gene Set k-Means Result (Gene � Keyword Matrix as Input)

the globally optimal clustering by permuting the columns

and rows of the symmetric matrix; 2) the k-means algorithm

is sensitive to initial seed selection and noise [9].

4.2 BEA-PARTITION versus Hierarchical Algorithm

Hierarchical clustering algorithm, as well as k-means, and

Self-Organizing Maps, have been widely used in microarray

expression profile analysis. Hierarchical clustering orga-

nizes expression data into a binary tree without providing

clear indication of how the hierarchy should be clustered. In

practice, investigators define clusters by a manual scan of

the genes in each node and rely on their biological expertise

to notice shared functional properties of genes. Therefore,

the definition of the clusters is subjective, and as a result,

different investigators may interpret the same clustering

result differently. Some have proposed automatically

defining boundaries based on statistical properties of the

gene expression profiles; however, the same statistical

criteria may not be generally applicable to identify all

relevant biological functions [10]. We believe that an

algorithm that produces clusters with clear boundaries

can provide more objective results and possibly new

discoveries, which are beyond the experts’ knowledge. In

this report, our results showed that BEA-PARTITION can

have similar performance as a hierarchical algorithm, and

provide distinct cluster boundaries.

4.3 KKKK-Means versus SOM

The k-means algorithm and SOM can group objects into

different clusters and provide clear boundaries. Despite its

LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 71

TABLE 5Twenty-Six Gene SOM Result (Gene � Keyword Matrix as Input)

TABLE 6Twenty-Six Gene AUTOCLASS Result (Gene � Keyword Matrix as Input)

simplicity and efficiency, the SOM algorithm has several

weaknesses that make its theoretical analysis difficult and

limit its practical usefulness. Various studies have sug-

gested that it is hard to find any criteria under which the

SOM algorithm performs better than the traditional

techniques, such as k-means [11]. Balakrishnan et al. [28]

compared the SOM algorithm with k-means clustering on

108 multivariate normal clustering problems. The results

showed that the SOM algorithm performed significantly

worse than the k-means clustering algorithm. Our results

also showed that k-means performed better than SOM by

generating clusters with higher mutual information.

4.4 Computing Time

The computing time of BEA-PARTITION, same as that ofhierarchical algorithm and SOM, is in the order of N2, whichmeans that it grows proportionally to the square of thenumberofgenesandcommonlydenotedasOðN2Þ, and thatofk-means is in the order of N*K*T (O(NKT)), where N is thenumber of genes tested, K is the number of clusters, and T isthe number of improvement steps (iterations) performed byk-means. In our study, the number of improvement stepswas1,000. Therefore, when the number of genes tested is about1,000, BEA-PARTITION runs (a�Kþ b) times faster thank-means, where a, and b are constants. As long as the numberof genes to be clustered is less than the product of the number

72 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 7Forty-Four Yeast Genes BEA-PARTITION Result (Gene � Keyword Matrix as Input)

TABLE 8Forty-Four Yeast Gene SOM Result (Gene � Keyword as Input)

of clusters and the number of iterations, BEA-PARTITION

will run faster than k-means.

4.5 Number of Clusters

One disadvantage of BEA-PARTITION and k-means com-

pared to hierarchical clustering is that the investigator needs

tohave apriori knowledge about thenumberof clusters in the

test set, which may not be known. We approached this

problem by using AUTOCLASS to predict the number of

clusters in the test sets. BEA-PARTITION performed best

when it grouped the genes into five clusters (26-gene set) and

nine clusters (44-gene set), which were predicted by AUTO-

CLASS with higher probabilities. Therefore, AUTOCLASS

LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 73

TABLE 9Forty-Four Yeast Gene k-Means Result (Gene � Keyword Matrix as Input)

TABLE 10Forty-Four Yeast Gene AUTOCLASS Result (Gene � Keyword Matrix as Input)

appears to be an effective tool to assist the BEA-PARTITIONin gene clustering.

5 CONCLUSIONS AND FUTURE WORK

There are several aspects of the BEA approach that we are

currently exploringwithmore detailed studies. For example,

although the BEA-PARTITION described here performs

relatively well on small sets of genes, the larger gene lists

expected from microarray experiments need to be tested.

Furthermore,we derived a heuristic to partition the clustered

affinity matrix into clusters. We anticipate that this heuristic,

which is simply based on the sum of ratios of corresponding

values fromadjacent columns,will generallywork regardless

of the typeof itemsbeing clustered.Generally, optimizing the

heuristic to partition a sorted matrix after BEA-based

clustering will be valuable. Finally, we are developing a

Web-based tool that will include a text mining phase to

identify functional keywords, and a gene clustering phase to

cluster the genes based on the shared functional keywords.

We believe that this tool should be useful for discovering

novel relationships among sets of genes because it links genes

by shared functional keywords rather than just reporting

known interactions based on published reports. Thus, genes

that never co-occur in the same publication could still be

linked by their shared keywords.

The BEA approach has been applied successfully to other

disciplines, such as operations research, production en-

gineering, and marketing [18]. The BEA-PARTITION

algorithm represents our extension to the BEA approach

specifically for dealing with the problem of discovering

functional similarity among genes based on functional

keywords extracted from literature. We believe that this

important clustering technique, which was originally

proposed by [16] to cluster questions on psychological

instruments and later introduced by [17] for clustering of

data items in database design, has promise for application

to other bioinformatics problems where starting matrices

are available from experimental observations.

ACKNOWLEDGMENTS

This work was supported by NINDS (RD) and the Emory-Georgia Tech Research Consortium. The authors wouldlike to thank Brian Revennaugh and Alex Pivoshenk forresearch support.

REFERENCES

[1] C. Blaschke, J.C. Oliveros, and A. Valencia, “Mining FunctionalInformation Associated with Expression Arrays,” Functional &Integrative Genomics, vol. 1, pp. 256-268, 2001.

[2] Y. Xu, V. Olman, and D. Xu, “EXCAVATOR: A ComputerProgram for Efficiently Mining Gene Expression Data,” NucleicAcids Research, vol. 31, pp. 5582-5589, 2003.

[3] D. Chaussabel and A. Sher, “Mining Microarray Expression Databy Literature Profiling,” Genome Biology, vol. 3, pp. 1-16, 2002.

[4] V. Cherepinsky, J. Feng, M. Rejali, and B. Mishra, “Shrinkage-Based Similarity Metric for Cluster Analysis of Microarray Data,”Proc. Nat’l Academy of Sciences USA, vol. 100, pp. 9668-9673, 2003.

[5] J. Quackenbush, “Computational Analysis of Microarray Data,”Nature Rev. Genetics, vol. 2, pp. 418-427, 2001.

74 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 11Top Ranking Keywords Associated with Each Gene Cluster

[6] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “ClusterAnalysis and Display of Genome-Wide Expression Patterns,” Proc.Nat’l Academy of Sciences USA, vol. 95, pp. 14863-14868, 1998.

[7] R. Herwig, A.J. Poustka, C. Mller, C. Bull, H. Lehrach, and J.O’Brien, “Large-Scale Clustering of cDNA-Fingerprinting Data,”Genome Research, vol. 9, pp. 1093-1105, 1999.

[8] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E.Dmitrovsky, E.S. Lander, and T.R. Golub, “Interpreting Patterns ofGene Expression with Self-Organizing Maps: Methods andApplication to Hematopoietic Differentiation,” Proc. Nat’l Academyof Sciences USA, vol. 96, pp. 2907-2912, 1999.

[9] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: AReview,” ACM Computing Surveys, vol. 31, pp. 264-323, 1999.

[10] S. Raychaudhuri, J.T. Chang, F. Imam, and R.B. Altman, “TheComputational Analysis of Scientific Literature to Define andRecognize Gene Expression Clusters,” Nucleic Acids Research,vol. 15, pp. 4553-4560, 2003.

[11] B. Kegl, “Principle Curves: Learning, Design, and Applications,”PhD dissertation, Dept. of Computer Science, Concordia Univ.,Montreal, Quebec, 2002.

[12] T.K. Jenssen, A. Laegreid, J. Komorowski, and E. Hovig, “ALiterature Network of Human Genes for High-ThroughtputAnalysis of Gene Expression,” Nat’l Genetics, vol. 178, pp. 139-143, 2001.

[13] D.R. Masys, J.B. Welsh, J.L. Fink, M. Gribskov, I. Klacansky, and J.Corbeil, “Use of Keyword Hierarchies to Interprate GeneExpression Patterns,” Bioinformatics, vol. 17, pp. 319-326, 2001.

[14] S. Raychaudhuri, H. Schutze, and R.B. Altman, “Using TextAnalysis to Identify Functionally Coherent Gene Groups,” GenomeResearch, vol. 12, pp. 1582-1590, 2002.

[15] M. Andrade and A. Valencia, “Automatic Extraction of Keywordsfrom Scientific Text: Application to the Knowledge Domain ofProtein Families,” Bioinformatics, vol. 14, pp. 600-607, 1998.

[16] W.T. McCormick, P.J. Schweitzer, and T.W. White, “ProblemDecomposition and Data Reorganization by a Clustering Techni-que,” Operations Research, vol. 20, pp. 993-1009, 1972.

[17] S. Navathe, S. Ceri, G. Wiederhold, and J. Dou, “VerticalPartitioning Algorithms for Database Design,” ACM Trans.Database Systems, vol. 9, pp. 680-710, 1984.

[18] P. Arabie and L.J. Hubert, “The Bond Energy AlgorithmRevisited,” IEEE Trans. Systems, Man, and Cybernetics, vol. 20,pp. 268-274, 1990.

[19] A.T. Ozsu and P. Valduriez, Principles of Distributed DatabaseSystems, second ed. Prentice Hall Inc., 1999.

[20] Y. Liu, M. Brandon, S. Navathe, R. Dingledine, and B.J. Ciliax,“Text Mining Functional Keywords Associated with Genes,” Proc.Medinfo 2004, pp. 292-296, Sept. 2004.

[21] Y. Liu, B.J. Ciliax, K. Borges, V. Dasigi, A. Ram, S. Navathe, and R.Dingledine, “Comparison of Two Schemes for Automatic Key-word Extraction from MEDLINE for Functional Gene Clustering,”Proc. IEEE Computational Systems Bioinformatics Conf. (CSB 2004),pp. 394-404, Aug. 2004.

[22] P. Cheeseman and J. Stutz, “Bayesian Classification (Autoclass):Theory and Results,” Advances in Knowledge Discovery and DataMining, pp. 153-180, AAAI/MIT Press, 1996.

[23] A. Strehl, “Relationship-Based Clustering and Cluster Ensemblesfor High-Dimensional Data Mining,” PhD dissertation, Dept. ofElectric and Computer Eng., The University of Texas at Austin,2002.

[24] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval.New York: Addison Wesley Longman, 1999.

[25] F. Sebastiani, “Machine Learning in Automated Text Categoriza-tion,” ACM Computing Surveys, vol. 34, pp. 1-47, 1999.

[26] P. Willett, “Recent Trends in Hierarchic Document Clustering: ACritical Review,” Information Processing and Management, vol. 24,pp. 577-597, 1988.

[27] J. Aslam, A. Leblanc, and C. Stein, “Clustering Data without PriorKnowledge,” Proc. Algorithm Eng.: Fourth Int’l Workshop, 1982.

[28] P.V. Balakrishnan, M.C. Cooper, V.S. Jacob, and P.A. Lewis, “AStudy of the Classification Capabilities of Neural Networks UsingUnsupervised Learning: A Comparison with K-Means Cluster-ing,” Psychometrika, vol. 59, pp. 509-525, 1994.

Ying Liu received the BS degree in environ-mental biology from Nanjing University, China.He received Master’s degrees in bioinformaticsand computer science from Georgia Institute ofTechnology in 2002. He is a PhD candidate inCollege of Computing, Georgia Institute ofTechnology, where he works on text miningbiomedical literature to discover gene-to-generelationships. His research interests includebioinformatics, computational biology, data

mining, text mining, and database system. He is a student member ofIEEE Computer Society.

Shamkant B. Navathe received the PhD degreefrom the University of Michigan in 1976. He is aprofessor in the College of Computing, GeorgiaInstitute of Technology. He has published morethan 130 refereed papers in database research;his important contributions are in databasemodeling, database conversion, database de-sign, conceptual clustering, distributed databaseallocation, data mining, and database integra-tion. Current projects include text mining of

medical literature databases, creation of databases for biologicalapplications, transaction models in P2P and Web applications, anddata mining for better understanding of genomic/proteomic and medicaldata. His recent work has been focusing on issues of mobility,scalability, interoperability, and personalization of databases in scien-tific, engineering, and e-commerce applications. He is an author of thebook, Fundamentals of Database Systems, with R. Elmasri (AddisonWesley, fourth edition, 2004) which is currently the leading databasetext-book worldwide. He also coauthored the book Conceptual Design:An Entity Relationship Approach (Addison Wesley, 1992) with CarloBatini and Stefano Ceri. He was the general cochairman of the 1996International VLDB (Very Large Data Base) Conference in Bombay,India. He was also program cochair of ACM SIGMOD 1985 at Austin,Texas. He is also on the editorial boards of Data and KnowledgeEngineering (North Holland), Information Systems (Pergamon Press),Distributed and Parallel Databases (Kluwer Academic Publishers), andWorld Wide Web Journal (Kluwer). He has been an associate editor ofIEEE Transactions on Knowledge and Data Engineering. He is amember of the IEEE.

Jorge Civera received the BSc degree incomputer science from the Universidad Politec-nica de Valencia in 2002, and the Msc degree incomputer science from Georgia Institute ofTechnology in 2003. He is currently a PhDstudent at Departamento de Sistemas Informa-ticos y Computacion and a research assistant inthe Instituto Tecnologico de Informatica. He isalso with a fellowship from the Spanish Ministryof Education and Culture. His research interests

include bioinformatics, machine translation, and text mining.

Venu Dasigi received the BE degree in electro-nics and communication engineering from An-dhra University in 1979, the MEE degree inelectronic engineering from the NetherlandsUniversities Foundation for International Coop-eration in 1981, and the MS and PhD degrees incomputer science from the University of Mary-land, College Park in 1985 and 1988, respec-tively. He is currently professor and chair ofcomputer science at Southern Polytechnic State

University in Marietta, Georgia. He is also an honorary professor atGandhi Institute of Technology and Management in India. He heldresearch fellowships at the Oak Ridge National Laboratory and the AirForce Research Laboratory. His research interests include text mining,information retrieval, natural language processing, artificial intelligence,bioinformatics, and computer science education. He is a member ofACM and the IEEE Computer Society.

LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 75

Ashwin Ram received the PhD degree fromYale University in 1989, the MS degree from theUniversity of Illinois in 1984, and the BTechdegree from IIT Delhi in 1982. He is an associateprofessor in the College of Computing at theGeorgia Institute of Technology, an associateprofessor of Cognitive Science, and an adjunctprofessor in the School of Psychology. He haspublished two books and more than 80 scientificpapers in international forums. His research

interests lie in artificial intelligence and cognitive science, and includemachine learning, natural language processing, case-based reasoning,educational technology, and artificial intelligence applications.

Brian J. Ciliax received the BS degree inbiochemistry from Michigan State University in1981, and the PhD degree in pharmacology fromthe University of Michigan in 1987. He iscurrently an assistant professor in the Depart-ment of Neurology at Emory University School ofMedicine. His research interests include thefunctional neuroanatomy of the basal ganglia,particularly as it relates to hyperkinetic move-ment disorders such as Tourette’s Syndrome.

Since 2000, he has collaborated with the coauthors on the developmentof a system to functionally cluster genes (identified by high-throughputgenomic and proteomic assays) according to keywords mined fromrelevant MEDLINE abstracts.

Ray Dingledine received the PhD degree inpharmacology from Stanford. He is currentlyprofessor and chair of pharmacology at EmoryUniversity and serves on the Scientific Council ofNINDS at NIH. His research interests include theapplication of microarray and associated tech-nologies to identify novel molecular targets forneurologic disease, the normal functions andpathobiology of glutamate receptors, and therole of COX2 signaling in neurologic disease.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

76 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

A

John AachTatsuya AkutsuDavid AldousAijun AnIannis ApostolakisLars ArvestadDaniel AshlockKevin AttesonWai-Ho Au

B

Rolf BackofenDavid BaderTim BaileyTomas BallaSerafim BatzoglouGil BejeranoAmir Ben-DorAsa Ben-HurAnne BergeronOlaf Bininda-EmondsRiccardo BoscoloGuillaume BourqueAlvis BrazmaDaniel BrownDuncan BrownBarb BryantDavid BryantJeremy BuhlerJoachim Buhmann

C

Andrea CalifanoColin CampbellAlberto CapraraKeith ChanClaudine ChaouiyaFerdinando CicaleseMelissa ClineDavid CorneNello CristianiniMiklos CsurosAdele Cutler

D

Patrik D’haeseleerMichiel de HoonArthur DelcherAlain DeniseMarcel DettlingInderjit S. Dhillon

Diego di BernardoAdrian DobraBruce R. DonaldSebastián Dormido-CantoZhihua DuBlythe Durbin

E

Nadia El-MabroukCharles ElkanEleazar Eskin

F

Giancarlo Ferrari-TrecateLiliana FloreaGary FogelYoav FreundJane FridlyandYan FuTerrence FureyCesare Furlanello

G

Olivier GascuelDan GeigerZoubin GhahramaniDebashis GhoshPulak GhoshRaffaele GiancarloRobert GiegerichDavid GilbertJan GorodkinJohn GoutsiasDaniel GusfieldIsabelle M. GuyonAdolfo Guzman-Arenas

H

Sridhar HannenhalliAlexander HarteminkTzvika HartmanLisa HolmPaul HortonSteve HorvathXiao HuHaiyan HuangAlan HubbardKatharina HuberDirk HusmeierDaniel Huson

J

Inge JonassenRebecka Jornsten

K

Jaap KaandorpMarkus KalischRachel KarchinJuha KarkkainenKevin KarplusSimon KasifSamuel KaskiEd KeedwellPurvesh KhatriHyunsoo KimJunhyong KimRoss D. KingAndrzej KonopkaHamid KrimNandini KrishnamurthyGregory KucherovDavid Kulp

L

Michelle LaceyWai LamGiuseppe LanciaMichael LappeRichard LathropNicolas Le NovereThierry LeCroqHansheng LeiBoaz LernerChristina LeslieIlya LevnerDequan LiFan LiJinyan LiWentian LiJie LiangOlivier LichtargeCharles LingMichal LinialHuan LiuZhenqiu LiuStanley LohHeitor LopesRune Lyngsoe

M

Bin MaPatrick Ma

François MajorElisabetta ManduchiMark MarronJens MeilerStefano MerlerWebb MillerMarta MiloSatoru MiyanoAnnette MolinaroShinichi MorishitaVincent MoultonMarcus MuellerSayan MukherjeeRory MulvaneyT.M. MuraliSimon Myers

N

Iftach NachmanLuay NakhlehAnand NarasimhamurthyGonzalo NavarroWilliam Noble

O

Enno OhlebuschArlindo OliveiraJose OliverChristos Ouzounis

P

Junfeng PanRong PanWei PanPaul PavlidisItsik Pe’erChristian PedersenAnton PetrovTuan PhamKatherine PollardGianluca PollastriCalton Pu

R

John RachlinMark RaganJagath RajapakseR.S. RamakrishnaIsidore RigoutsosDave RitchieFredrik RonquistJuho Rousu

2004 Reviewers ListWe thank the following reviewers for the time and energy they have given to TCBB:

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 77

Jem RowlandLarry RuzzoLeszek Rychlewski

S

Gerhard SagererSteven SalzbergHerbert SauroAlejandro SchafferAlexander SchliepScott SchmidlerJeanette SchmidtAlexander SchönhuthCharles SempleSoheil ShamsRoded SharanChad ShawDinggang ShenDou ShenLisan ShenStanislav ShvartsmanAmandeep SidhuRichard SimonSameer SinghJanne SinkkonenSteven S. SkienaQuinn SnellCarol SoderlundRainer SpangPeter StadlerMike SteelGerhard StegerJens StoyeJack SullivanKrister Swenson

T

Pablo TamayoAmos TanayChun TangJijun TangThomas TangGlenn TeslerRobert TibshiraniMartin TompaAnna TramontanoJames TroendleJerry TsaiKoji TsudaJohn Tyson

V

Eugene van SomerenStella VeretnikDavid VogelGwenn Volkert

W

Baoying WangChang WangLisan WangTandy WarnowMichael K. WeirJason WestonYdo WexlerNalin WickramarachchiChris WigginsDavid WildTiffani WilliamsThomas Wu

X

Dong XuJinbo Xu

Y

Qiang YangYee Hwa YangZizhen YaoDaniel YekutieliJeffrey Yu

Z

Mohammed J. ZakiAn-Ping ZengChengxiang ZhaiJingfen ZhangKaizhong ZhangXuegong ZhangYang ZhangZhi-Hua ZhouZonglin ZhouJi Zhu

78 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005


Recommended