Guest Editorial: WABI Special Section Part llJunhyong Kim and Inge Jonassen
�
THE Fourth International Workshop on Algorithms inBIoinformatics (WABI) 2004 was held in Bergen, Nor-
way, September 2004. The program committee consisted of33 members and selected, among 117 submissions, 39 to bepresented at the workshop and included in the proceedingsfrom the workshop (volume 3240 of Lecture Notes inBioinformatics, series edited by Sorin Istrail, Pavel Pevzner,and Michael Waterman).
The WABI 2004 program committee selected a small
number of papers among the 39 to be invited to submit
extended versions of their papers to a special section of the
IEEE/ACM Transactions on Computational Biology and Bioin-
formatics. Four papers were published in the October-
December 2004 issue of the journal and this issue contains
an additional three papers. We would like to thank both the
entire program committee for WABI and the reviewers of
the papers in this issue for their valuable contributions.The first of the papers is “A New Distance for High Level
RNA Secondary Structure Comparison” authored by Julien
Allali and Marie-France Sagot. This paper describes algo-
rithms for comparing secondary structuresofRNAmolecules
where the structures are represented by trees. The problemof
classifying RNA secondary structure is becoming critical as
biologists are discovering more and more noncoding func-
tional elements in the genome (e.g., miRNA). Most likely, the
major functional determinants of the elements are their
secondary structure and, therefore, a metric between such
secondary structures will also help delineate clusters of
functional groups. In Allali and Sagot’s paper, two tree
representations of secondary structure are compared by
analysing how one tree can be transformed into the other
using an allowed set of operations. Each operation can be
associatedwith a cost and the distance between two trees can
then be defined as the minimum cost associated with a
transform of one tree to the other. Allali and Sagot introduce
two new operations that they name edge fusion and node
fusion and show that these alleviate limitations associated
with the classical tree edit operations used for RNA
comparison. Importantly, they also present algorithms for
calculating the distance between trees allowing the new
operations in addition to the classical ones, and analyze the
performance of the algorithms.
The second paper is “Topological Rearrangements andLocal Search Method for Tandem Duplication Trees” and isauthored by Denis Bertrand and Olivier Gascuel. The paperapproaches the problem of estimating the evolutionaryhistory of tandem repeats. A tandem repeat is a stretch ofDNA sequence that contains an element that is repeatedmultiple times and where the repeat occurrences are next toeach other in the sequence. Since the repeats are subject tomutations, they are not identical. Therefore, tandem repeatsoccur through evolution by “copying” (duplication) ofrepeat elements in blocks of varying size. Bertrand andGascuel address the problem of finding the most likelysequence of events giving rise to the observed set of repeats.Each sequence of events can be described by a duplicationtree and one searches for the tree that is the mostparsimonious, i.e., one that explains how the sequence hasevolved from an ancestral single copy with a minimumnumber of mutations along the branches of the tree. Themain difference with the standard phylogeny problem isthat linear ordering of the tandem duplications imposeconstraints the possible binary tree form. This paperdescribes a local search method that allows exploration ofthe complete space of possible duplication trees and showsthat the method is superior to other existing methods forreconstructing the tree and recovering its duplicationevents.
The third paper is “Optimizing Multiple Seeds forHomology Search” authored by Daniel G. Brown. Thepaper presents an approach to selecting starting points forpairwise local alignments of protein sequences. Theproblem of pairwise local alignment is to find a segmentfrom each so that the two local segments can be aligned toobtain a high score. For commonly used scoring schemes,this can be solved exactly using dynamic programming.However, pairwise alignment is frequently applied to largedata sets and heuristic methods for restricting alignments tobe considered are frequently used, for instance, in theBLAST programs. The key is to restrict the number ofalignments as much as possible, by choosing a few goodseeds, without missing high scoring alignments. The papershows that this can be formulated as an integer program-ming problem and presents algorithm for choosing optimalseeds. Analysis is presented showing that the approachgives four times fewer false positives (unnecessary seeds) incomparison with BLASTP without losing more good hits.
Junhyong Kim
Inge Jonassen
Guest Editors
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 1
. J. Kim is with the Department of Biology, University of Pennsylvania,3451 Walnut Street, Philadelphia, PA 19104.E-mail: [email protected].
. I. Jonassen is with the Department of Informatics and ComputationalBiology Unit, University of Bergen, HIB N5020 Bergen, Norway.E-mail: [email protected].
For information on obtaining reprints of this article, please send e-mail to:[email protected].
1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
Junhyong Kim is the Edmund J. and LouiseKahn Term Endowed Professor in the Depart-ment of Biology at theUniversity of Pennsylvania.He holds joint appointments in the Department ofComputer and Information Science, Penn Centerfor Bioinformatics, and the Penn GenomicsInstitute. He serves on the editorial board ofMolecular Development and Evolution and theIEEE/ACM Transactions on Computational Biol-ogy and Bioinformatics, the council of the Society
for Systematic Biology, and the executive committee of the CyberInfrastructure for Phylogenetics Research. His research focuses oncomputational and experimental approaches to comparative develop-ment. The current focus of his lab is in three areas: computationalphylogenetics, in silico gene discovery, and comparative developmentusing genome-wide gene expression data.
Inge Jonassen is a professor of computerscience in the Department of Informatics at theUniversity of Bergen in Norway, where he ismember of the bioinformatics group. He is alsoaffiliated with the Bergen Center for Computa-tional Science at the same university where heheads the Computational Biology Unit. He is alsovice president of the Society for Bioinformatics inthe Nordic Countries (SocBiN) and a member ofthe board of the Nordic Bioinformatics Network.
He coordinates the technology platform for bioinformatics funded by theNorwegian Research Council functional genomics programme FUGE.He has worked in the field of bioinformatics since the early 1990s, wherehe has primarily focused on methods for discovery of patterns withapplications to biological sequences and structures and on methods forthe analysis of microarray gene expression data.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
A New Distance for High Level RNASecondary Structure Comparison
Julien Allali and Marie-France Sagot
Abstract—We describe an algorithm for comparing two RNA secondary structures coded in the form of trees that introduces two new
operations, called node fusion and edge fusion, besides the tree edit operations of deletion, insertion, and relabeling classically used in
the literature. This allows us to address some serious limitations of the more traditional tree edit operations when the trees represent
RNAs andwhat is searched for is a common structural core of twoRNAs. Although the algorithm complexity has an exponential term, this
term depends only on the number of successive fusions that may be applied to a same node, not on the total number of fusions. The
algorithm remains therefore efficient in practice and is used for illustrative purposes on ribosomal as well as on other types of RNAs.
Index Terms—Tree comparison, edit operation, distance, RNA, secondary structure.
�
1 INTRODUCTION
RNAS are one of the fundamental elements of a cell. Their
role in regulation has been recently shown to be farmore prominent than initially believed (20 December 2002
issue of Science, which designated small RNAs with
regulatory function as the scientific breakthrough of the
year). It is now known, for instance, that there is massive
transcription of noncoding RNAs. Yet current mathematical
and computer tools remain mostly inadequate to identify,
analyze, and compare RNAs.An RNA may be seen as a string over the alphabet of
nucleotides (also called bases), {A, C, G, T}. Inside a cell,RNAs do not retain a linear form, but instead fold in space.The fold is given by the set of nucleotide bases that pair. The
main type of pairing, called canonical, corresponds to bondsof the type A� U and G� C. Other rarer types of bondsmay be observed, the most frequent among them is G� U ,also called the wobble pair. Fig. 1 shows the sequence of afolded RNA. Each box represents a consecutive sequence ofbonded pairs, corresponding to a helix in 3D space. The
secondary structure of an RNA is the set of helices (or thelist of paired bases) making up the RNA. Pseudoknots,which may be described as a pair of interleaved helices, arein general excluded from the secondary structure of anRNA. RNA secondary structures can thus be represented asplanar graphs. An RNA primary structure is its sequence of
nucleotides while its tertiary structure corresponds to thegeometric form the RNA adopts in space.
Apart from helices, the other main structural elements in
an RNA are:
1. hairpin loops which are sequences of unpaired basesclosing a helix;
2. internal loops which are sequences of unpairedbases linking two different helices;
3. bulges which are internal loops with unpaired baseson one side only of a helix;
4. multiloops which are unpaired bases linking at leastthree helices.
Stems are successions of one or more among helices,
internal loops, and/or bulges.
The comparison of RNA secondary structures is one of
the main basic computational problems raised by the study
of RNAs. It is the problem we address in this paper. The
motivations are many. RNA structure comparison has been
used in at least one approach to RNA structure prediction
that takes as initial data a set of unaligned sequences
supposed to have a common structural core [1]. For each
sequence, a set of structural predictions are made (for
instance, all suboptimal structures predicted by an algo-
rithm like Zucker’s MFOLD [15], or all suboptimal sets of
compatible helices or stems). The common structure is then
found by comparing all the structures obtained from the
initial set of sequences, and identifying a substructure
common to all, or to some of the sequences. RNA structure
comparison is also an essential element in the discovery of
RNA structural motifs, or profiles, or of more general
models that may then be used to search for other RNAs of
the same type in newly sequenced genomes. For instance,
general models for tRNAs and introns of group I have been
derived by hand [3], [10]. It is an open question whether
models at least as accurate as these, or perhaps even more
accurate, could have been derived in an automatic way. The
identification of smaller structural motifs is an equally
important topic that requires comparing structures.
As we saw, the comparison of RNA structures may
concern known RNA structures (that is, structures that were
experimentally determined) or predicted structures. The
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 3
. J. Allali is with the Institut Gaspard-Monge, Universite de Marne-la-Vallee, Cite Descartes, Champs-sur-Marne, 77454, Marne-la-Vallee Cedex2, France. E-mail: [email protected].
. M.-F. Sagot is with Inria Rhone-Alpes, Universite Claude Bernard, Lyon I,43 Bd du Novembre 1918, 69622 Villeurbanne cedex, France.E-mail: [email protected].
Manuscript received 11 Oct. 2004; accepted 20 Dec. 2004; published online30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-0164-1004.
1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
objective in both cases is the same: to find the common parts
of such structures.
In [11], Shapiro suggested to mathematically model RNA
secondary structures without pseudoknots by means of
trees. The trees are rooted and ordered, which means that
the order among the children of a node matters. This order
corresponds to the 5’-3’ orientation of an RNA sequence.
Given two trees representing each an RNA, there are two
main ways for comparing them. One is based on the
computation of the edit distance between the two trees
while the other consists in aligning the trees and using the
score of the alignment as a measure of the distance between
the trees. Contrary to what happens with sequences, the
two, alignment and edit distance, are not equivalent. The
alignment distance is a restrained form of the edit distance
between two trees, where all insertions must be performed
before any deletions. The alignment distance for general
trees was defined in 1994 by Jiang et al. in [9] and extended
to an alignment distance between forests in [6]. More
recently, Hochsmann et al. [7] applied the tree alignment
distance to the comparison of two RNA secondary
structures. Because of the restriction on the way edit
operations can be applied in an alignment, we are not
concerned in this paper with tree alignment distance and
we therefore address exclusively from now on the problem
of tree edit distance.
Our way for comparing two RNA secondary structures is
then to apply anumberof tree edit operations inoneorbothof
the trees representing the RNAs until isomorphic trees are
obtained. The currently most popular program using this
approach is probably theViennapackage [5], [4]. The tree edit
operations considered are derived from the operations
classically applied to sequences [13]: substitution, deletion,
and insertion. In 1989, Zhang and Shasha [14] gave adynamic
programming algorithm for comparing two trees. Shapiro
and Zhang then showed [12] how to use tree editing to
compare RNAs. The latter also proposed various treemodels
that could be used for representing RNA secondary struc-
tures. Each suggested tree offers a more or less detailed view
of an RNA structure. Figs. 2b, 2c, 2d, and 2e present a few
examples of such possible views for the RNAgiven in Fig. 2a.
In Fig. 2, the nodes of the tree in Fig. 2b represent either
unpaired bases (leaves) or paired bases (internal nodes). Each
node is labeled with, respectively, a base or a pair of bases. A
node of the tree in Fig. 2c represents a set of successive
unpaired bases or of stacked paired ones. The label of a node
is an integer indicating, respectively, the number of unpaired
basesor theheightof the stackofpairedones.Thenodesof the
tree in Fig. 2d represent elements of secondary structure:
hairpin loop (H), bulge (B), internal loop (I), ormultiloop (M).
The edges correspond to helices. Finally, the tree in Fig. 2e
contains only the information concerning the skeleton of
multiloops of anRNA. The last representation, though giving
ahighly simplifiedviewof anRNA, is important nevertheless
as it is generally accepted that it is this skeleton which is
usually the most constrained part of an RNA. The last two
models may be enriched with information concerning, for
instance, the number of (unpaired) bases in a loop (hairpin,
internal, multi) or bulge, and the number of paired bases in a
helix. The first label the nodes of the tree, the second its edges.
Other types of information may be added (such as overall
composition of the elements of secondary structure). In fact,
one could consider working with various representations
simultaneously or in an interlocked, multilevel fashion. This
goes beyond the scope of this paper which is concerned with
comparing RNA secondary structures using any one among
the many tree representations possible. We shall, however,
comment further on this multilevel approach later on.
Concerning the objectives of this paper, they are twofold.
The first is to give some indications on why the classical edit
operations that have been considered so far in the literature
for comparing trees present some limitations when the trees
stand for RNA structures. Three cases of such limitationswill
be illustrated through examples in Section 3. In Section 4, we
then introduce two novel operations, so-called node-fusion
and edge-fusion, that enable us to address some of these
limitations and then give a dynamic programming algorithm
for comparing twoRNA structureswith these two additional
operations. Implementation issues and initial results are
presented in Section 4. In Section 5, we give a first application
4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 1. Primary and secondary structures of a transfer RNA.
Fig. 2. Example of different tree representations ((b), (c), (d), and (e)) of
the same RNA (a).
of our algorithm to the comparison of two RNA secondary
structures. Finally, in Section 6, we sketch the main ideas
behind themultilevel RNA comparison approachmentioned
above. Before that,we start by introducing somenotation and
by recalling in the next section the basics about classical tree
edit operations and tree mapping.
This paper is an extended version of a paper presented at
the Workshop on Algorithms in BioInformatics (WABI) in
2004, in Bergen, Norway. A few more examples are given to
illustrate some of the points made in the WABI paper,
complexity and implementation issues are discussed in
more depth as are the cost functions and a multilevel
approach to comparing RNAs.
2 TREE EDITING AND MAPPING
Let T be an ordered rooted tree, that is, a tree where the
order among the children of a node matters. We define
three kinds of operations on T : deletion, insertion, and
relabeling (corresponding to a substitution in sequence
comparison). The operations are shown in Fig. 3. The
deletion (Fig. 3b) of a node u removes u from the tree. The
children of u become the children of u’s father. An insertion
(Fig. 3c) is the symmetric of a deletion. Given a node u, we
remove a consecutive (in relation to the order among the
children) set u1; . . . ; up of its children, create a new node v,
make v a child of u by attaching it at the place where the set
was, and, finally, make the set u1; . . . ; up (in the same order)
the children of v. The relabeling of a node (Fig. 3d) consists
simply in changing its label.
Given two trees T and T 0, we define S ¼ fs1 . . . seg to be
a series of edit operations such that, if we apply succes-
sively the operations in S to the tree T , we obtain T 0 (i.e., T
and T 0 become isomorphic). A series of operations like Srealizes the editing of T into T 0 and is denoted by T !S T 0.
We define a function cost from the set of possible edit
operations (deletion, insertion, relabeling) to the integers (or
the reals) such that costs is the score of the edit operation s.
If S is a series of edit operations, we define by extension that
costS isP
s2S costs. We can define the edit distance between
two trees as the series of operations that performs the
editing of T into T 0 and such that its cost is minimal:
distanceðT; T 0Þ ¼ fminðcostSÞjT !S T 0g.
Let an insertion or a deletion cost one and the relabeling of
a node cost zero if the label is the same and one otherwise. For
the two trees of the figure on the left, the series relabelðA !F Þ:deleteðBÞ:insertðGÞ realizes the editing of the left tree into
the right one and costs 3. Another possibility is the series
deleteðBÞ:relabelðA ! GÞ:insertðF Þ which also costs 3. The
distance between these two trees is 3.
Given a series of operations S, let us consider the nodes
of T that are not deleted (in the initial tree or after some
relabeling). Such nodes are associated with nodes of T 0. The
mapping MS relative to S is the set of couples ðu; u0Þ with
u 2 T and u0 2 T 0 such that u is associated with u0 by S.The operations described above are the “classical tree edit
operations” that have been commonly used in the literature
for RNA secondary structure comparison. We now present a
few results obtained using such classical operations that will
allowus to illustrate a few limitations theymaypresentwhen
used for comparing RNA structures.
3 LIMITATIONS OF CLASSICAL TREE EDIT
OPERATIONS FOR RNA COMPARISON
As suggested in [12], the tree edit operations recalled in the
previous section can be used on any type of tree coding of
an RNA secondary structure.
Fig. 4 shows two RNAsePs extracted from the database [2]
(they are found, respectively, in Streptococcus gordonii and
Thermotoga maritima). For the example we discuss now, we
code the RNAs using the tree representation indicated in
Fig. 2b where a node represents a base pair and a leaf an
unpaired base. After applying a few edit operations to the
trees, we obtain the result indicated in Fig. 4, with deleted/
insertedbases ingray.Wehave surroundeda fewregions that
match in the two trees. Bases in the rectangular box at the
bottomof theRNAon the left are thusassociatedwithbases in
thebottomrightmost rectangular boxof theRNAon the right.
The same is observed for the bases in the oval boxes for both
RNAs. Suchmatches illustrate one of themainproblemswith
the classical tree edit operations: Bases in one RNA may be
mapped to identically labeled bases in the other RNA to
minimise the total cost, while such bases should not be
associated in terms of the elements of secondary structure to
which they belong. In fact, such elements are often distant
from one another along the common RNA structure. We call
this problem the “scattering effect.” It is related to the
definition of tree edit operations. In the case of this example
and of the representation adopted, the problem might have
been avoided if structural information had been used.
Indeed, the problem appears also because the structural
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 5
Fig. 3. Edit operations: (a) the original tree T , (b) deletion of the node
labelled D, (c) insertion of the node labeled I, and (d) relabeling of a
node in T (the label A of the root is changed into K).
location of an unpaired base is not taken into account. It is
therefore possible to match, for instance, an unpaired base
from a hairpin loop with an unpaired base from a multiloop.
Using another type of representation, as we shall do, would,
however, not be enough to solve all problems as we see next.
Indeed, to compare the same two RNAs, we can also use a
more abstract tree representation such as the one given in
Fig. 2d. In this case, the internal nodes represent a multiloop,
internal-loop, or bulge, the leaves code for hairpin loops and
edges for helices. The result of the editionofT intoT 0 for some
cost function is presented in Fig. 5 (we shall comeback later to
the cost functions used in the case of suchmore abstract RNA
representations; for the sake of this example, wemay assume
an arbitrary one is used).
The problem we wish to illustrate in this case is shown
by the boxes in the figure. Consider the boxes at the bottom.
In the left RNA, we have a helix made up of 13 base pairs. In
the right RNA, the helix is formed by seven base pairs
followed by an internal loop and another helix of size 5. By
definition (see Section 2), the algorithm can only associate
one element in the first tree to one element in the second
tree. In this case, we would like to associate the helix of the
left tree to the two helices of the second tree since it seems
clear that the internal loop represents either an inserted
element in the second RNA, or the unbonding of one base
pair. This, however, is not possible with classical edit
operations.
A third type of problem one can meet when using only
the three classical edit operations to compare trees standing
for RNAs is similar to the previous one, but concerns this
time a node instead of edges in the same tree representa-
tion. Often, an RNAmay present a very small helix between
two elements (multiloop, internal-loop, bulge, or hairpin-
loop) while such helix is absent in the other RNA. In this
case, we would therefore have liked to be able to associate
one node in a tree representing an RNA with two or more
6 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 5. Illustration of the one-to-one association problem with edges. Result of the matching of the two RNAsePs, of Saccharomyces uvarum and of
Saccharomyces kluveri, using the model given in Fig. 2d.
Fig. 4. Illustration of the scattering effect problem. Result of the matching of two RNAsePs, of Streptococcus gorgonii and of Thermotoga maritima,
using the model given in Fig. 2b.
nodes in the tree for the other RNA. Once again, this is not
possible with any of the classical tree edit operations. An
illustration of this problem is shown in Fig. 6.
We shall use RNA representations that take the elements
of the structure of an RNA into account to avoid some of the
scattering effect. Furthermore, in addition to considering
information of a structural nature, labels are attached, in
general, to both nodes and edges of the tree representing an
RNA. Such labels are numerical values (integers or reals).
They represent in most cases the size of the corresponding
element, but may also further indicate its composition, etc.
Such additional information is then incorporated into the
cost functions for all three edit operations. It is important to
observe that when dealing with trees labeled at both the
nodes and edges, any node and the edge that leads to it (or,
in an alternative perspective, departs from it) represent a
single object from the point of view of computing an edit
distance between the trees.
It remains now to deal with the last two problems that
are a consequence of the one-to-one associations between
nodes and edges enforced by the classical tree edit
operations. To that purpose, we introduce two novel tree
edit operations, called the edge fusion and the node fusion.
4 INTRODUCING NOVEL TREE EDIT OPERATIONS
4.1 Edge Fusion and Node Fusion
In order to address some of the limitations of the classical tree
edit operations that were illustrated in the previous section,
we need to introduce twonovel operations. These are the edge
fusion and the node fusion. They may be applied to any of the
tree representations given in Figs. 2c, 2d, and 2e.
An example of edge fusion is shown in Fig. 7a. Let eu be an
edge leading to a node u, ci a child of u and eci the edge
between u and ci. The edge fusion of eu and eci consists in
replacing eci and eu with a new single edge e. The edge e links
the father of u to ci. Its label then becomes a function of the
(numerical) labels of eu, u and eci . For instance, if such labels
indicated the size of each element (e.g., for a helix, thenumber
of its stackedpairs, and for a loop, themin ,max or theaverage
of its unpaired bases on each side of the loop), the label of e
could be the sum of the sizes of eu, u and eci . Observe that
merging two edges implies deleting all subtrees rooted at the
children cj ofu for jdifferent from i. The cost of suchdeletions
is added to the cost of the edge fusion.
An example of node fusion is given in Fig. 7b. Let u be a
node and ci one of its children. Performing a node fusion of
u and ci consists in making u the father of all children of ciand in relabeling u with a value that is a function of the
values of the labels of u, ci and of the edge between them.
Observe that a node fusion may be simulated using the
classical edit operations by a deletion followed by a
relabeling. However, the difference between a node fusion
and a deletion/relabeling is in the cost associated with both
operations. We shall come back to this point later.Obviously, like insertions or deletions, edge fusions and
node fusions have of course symmetric counterparts, whichare the edge split and the node split.
Given two rooted, ordered, and labeled trees T and T 0,we define the “edit distance with fusion” between T and T 0
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 7
Fig. 7. (a) An example of edge fusion. (b) An example of node fusion.
Fig. 6. Illustration of the one-to-one association problem with nodes. The two RNAs used here are RNAsePs from Pyrococcus furiosus and
Metallosphaera sedula. Triangles stand for bulges, diamond stand for internal loops, and squares for hairpin loops.
as distancefusionðT; T 0Þ ¼ fminðcostSÞjT !S T 0gwith costs thecost associated to each of the seven edit operations nowconsidered (relabeling, insertion, deletion, node fusion andsplit, edge fusion and split).
Proposition 1. If the following is verified:
. costmatchða; bÞ is a distance,
. costinsðaÞ ¼ costdelðaÞ � 0,
. costnodefusionða; b; cÞ ¼ costnodesplitða; b; cÞ � 0, and
. costedgefusionða; b; cÞ ¼ costedgesplitða; b; cÞ � 0,
then distancefusion is indeed a distance.
Proof. The positiveness of distancefusion is given by the fact
that all elementary cost functions are positive. Its
symmetry is guaranteed by the symmetry in the costs
of the insertion/deletion and (node/edge) fusion/split
operations. Finally, it is straighforward to see that
distancefusion satisfies triangular inequality. tuBesides the above properties that must be satisfied by the
cost functions in order to obtain a distance, others may be
introduced for specific purposes. Some will be discussed in
Section 5.We now present an algorithm to compute the tree edit
distance between two trees using the classical tree edit
operations plus the two operations just introduced.
4.2 Algorithm
The method we introduce is a dynamic programming
algorithm based on the one proposed by Zhang and Shasha.
Their algorithm is divided in two parts: They first compute
the edit distance between two trees (this part is denoted by
TDist) and then the distance between two forests (this part
is denoted by FDist). Fig. 8 illustrates in pictorial form the
part TDist and Fig. 9 the FDist part of the computation.In order to take our two new operations into account, we
need to compute a few more things in the TDist part.
Indeed, we must add the possibility for each tree to have a
node fusion (inversely, node split) between the root and one
of its children, or to have an edge fusion (inversely edge
split) between the root and one of its children. These
additional operations are indicated in the right box of Fig. 8.
We present now a formal description of the algorithm. Let
T be an ordered rooted tree with jT j nodes. We denote by tithe ith node in a postfix order. For each node ti, lðiÞ is the
index of the leftmost child of the subtree rooted at ti. Let
T ði . . . jÞ denote the forest composed by the nodes ti . . . tj
(T � T ð0 . . . jT jÞÞ. To simplify notation, from now on, when
there is no ambiguity, i will refer to the node ti. In this case,
distanceði1 . . . i2; j1 . . . j2Þ will be equivalent to distanceðT ði1. . . i2Þ; T 0ðj1 . . . j2ÞÞ.
The algorithm of Zhang and Sasha is fully described by
the following recurrence formula:
if ðði1 ¼¼ lði2ÞÞ and ðj1 ¼¼ lðj2ÞÞÞ
MIN
distanceð i1 . . . i2 � 1 ; j1 . . . j2 Þ þ costdelði2Þdistanceð i1 . . . i2 ; j1 . . . j2 � 1 Þ þ costinsðj2Þdistanceð i1 . . . i2 � 1 ; j1 . . . j2 � 1 Þ þ costmatchði2; j2Þ
8><>:
ð1Þ
else
MIN
distanceð i1 . . . i2 � 1 ; j1 . . . j2Þ Þþ costdelði2Þdistanceð i1 . . . i2Þ ; j1 . . . j2 � 1 Þþ costinsðj2Þdistanceð i1 . . . lði2Þ � 1 ; j1 . . . lðj2Þ � 1 Þ
þdistanceð lði2Þ . . . i2 ; lðj2Þ . . . j2 Þ
8>>>>>>>><>>>>>>>>:
ð2Þ
Part (1) of the formula corresponds to Fig. 8, while part (2)
corresponds to Fig. 9. In practice, the algorithm stores in a
matrix the score between each subtree of T and T 0. The space
complexity is thereforeOðjT j � jT 0jÞ. To reach this complexity,
the computation must be done in a certain order (see
Section 4.3). The time complexity of the algorithm is
OðjT j �minðleafðT Þ; heightðT ÞÞ� jT 0j �minðleafðT 0Þ; heightðT 0ÞÞÞ;
where leafðT Þ and heightðT Þ represent, respectively, the
number of leaves and the height of a tree T .
8 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 8. Zhang and Sasha’s dynamic programming algorithm: the tree distance part. The right box corresponds to the additional operations added to
take fusion into account.
The formula to compute the edit score allowing for both
node and edge fusions follows.
if ðði1 � lðikÞÞ and ðj1 � lðjk0 ÞÞÞ
MIN
distanceðfi1 . . . ik�1g; ;; fj1 . . . jk0 g; path0Þ þ costdelðikÞdistanceðfi1 . . . ikg; path; fj1 . . . jk0�1g; ;Þ þ costinsðjk0 Þdistanceðfi1 . . . ik�1g; ;; fj1 . . . jk0�1g; ;Þ þ costmatchðik; jk0 Þfor each child ic of ik in fi1; . . . ; ikg; set il ¼ lðicÞ
distanceðfi1 . . . ic�1; icþ1 . . . ikg; path:ðu; icÞ; fj1 . . . jk0 g;path0Þ
þcostnode fusionðic; ikÞðobs: :ik data are changedÞdistanceðfil . . . ic�1; ikg; path:ðe; icÞ; fj1 . . . jk0 g; path0Þ
þcostedge fusionðic; ikÞ þ distanceðfi1 . . . il�1g;;; ;; ;Þ
þdistanceðficþ1 . . . ik � 1; ;; ;; ;Þðobs: : ik data are changedÞ
for each child jc0 of jk0 in fj1; . . . ; jk0 g; set jl0 ¼ lðjc0 Þdistanceðfi1 . . . ikg; path; fj1 . . . jc0�1; jc0þ1 . . . jk0 ;
path0:ðu; jc0 ÞÞþcostnode splitðjc0 ; jk0 Þðobs: : jk0 data are changedÞ
distanceðfi1 . . . ikg; path; fjl0 . . . jc0 ; jk0 ; path0:ðe; jc0 ÞÞþcostedge splitðjc0 ; jk0 Þþdistanceð;; ;; fj1 . . . jl0�1g; ;Þþdistanceð;; ;; jc0þ1 . . . jk0�1; ;Þðobs: : jk0 data are changedÞ
8>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>:
ð3Þ
else set il ¼ lðikÞ and jl0 ¼ lðjk0 Þ
MIN
distanceðfi1 . . . ik�1g; ;; fj1 . . . jk0 g; path0Þ þ delðikÞdistanceðfi1 . . . ikg; path; fj1 . . . jk0�1g; ;Þ þ insðjk0 Þdistanceðfi1 . . . il�1g; ;; fj1 . . . jl0�1g; ;Þ
þ distanceðfil . . . ikg; path; fjl0 . . . jk0 g; path0Þ
8>>><>>>:
ð4Þ
Given two nodes u and v such that v is a child of u,
node fusionðu; vÞ is the fusion of node v with u, and
edge fusionðu; vÞ is the edge fusion between the edges
leading to, respectively, nodes u and v. The symmetric
operations are denoted by, respectively, node splitðu; vÞ andedge splitðu; vÞ.
The distance computation takes two new parameters
path and path0. These are sets of pairs ðe or u; vÞ which
indicate, for node ik (respectively, jk), the series of fusions
that were done. Thus, a pair ðe; vÞ indicates that an edge
fusion has been perfomed between ik and v, while for ðu; vÞa node v has been merged with node ik.
The notation path:ðe; vÞ indicates that the operation ðe; vÞhas been performed in relation to node ik and the
information is thus concatenated to the set path of pairs
currently linked with ik.
4.3 Implementation and Complexity
The previous section gave the recurrence formulæ for
calculating the edit distance between two trees allowing for
node and edge fusion and split. We now discuss the
complexity of the algorithm. This requires paying attention
to some high-level implementation details that, in the case
of the tree edit distance problem, may have an important
influence on the theoretical complexity of the algorithm.
Such details were first observed by Zhang and Shasha. They
concern the order in which to perform the operations
indicated in (2) and (1) to obtain an algorithm that is time
and space efficient.Let us consider the last line of (2). We may observe that
the computation of the distance between two forests refersto the computation of the distance between two treesT ðlði2Þ . . . i2Þ and T 0ðlðj2Þ . . . j2Þ. We must therefore memor-ise the distance between any two subtrees of T and T 0.Furthermore, we have to carry out the computation fromthe leaves to the root because when we compute thedistance between two subtrees U and U 0, the distancebetween any subtrees of U and U 0 must already have beenmeasured. This explains the space complexity which is inOðjT j � jT 0jÞ and corresponds to the size of the table used forstoring such distances in memory.
If we look at (1) now, we see that it is not necessary tocalculate separately the distance between the subtreesrooted at i0 and j0 if i0 is on the path from lðiÞ to i and j0
is on the path from lðjÞ to j, for i and j nodes of,respectively, T and T 0.
We define a set LRðT Þ of the left roots of T as follows:
LRðT Þ ¼ fkj1 � k � jT j and 6 9k0 > k such that lðk0Þ ¼ lðkÞg
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 9
Fig. 9. Zhang and Sasha’s dynamic programming algorithm: the forest distance part.
The algorithm for computing the edit distance between t
and T 0 consists then in computing the distance between
each subtree rooted at a node in LRðT Þ and each subtree
rooted at a node in LRðT 0Þ. Such subtrees are considered
from the leaves to the root of T and T 0, that is, in the order
of their indexes.
Zhang and Shasha proved that this algorithm has a
time complexity in OðjT j �minðleafðT Þ; heightðT ÞÞ � jT 0j �minðleafðT 0Þ; heightðT 0ÞÞÞ, leafðT Þ designating the num-
ber of leaves of T and heightðT Þ its height. In the worst
case (fan tree), the complexity is in OðjT j2 � jT 0j2Þ.Taking fusion and split operations into account does
not change the above reasoning. However, we must now
store in memory the distance between all subtrees
T ðlði2Þ . . . i2Þ and T 0ðlðj2Þ . . . j2Þ, and all the possible values
of path and path0.
We must therefore determine the number of values that
path can take. This amounts to determine the total number
of successive fusions that could be applied to a given node.
We recall that path is a list of pairs ðe or u; vÞ. Let path ¼fðe or u; v1Þ; ðe or u; v2Þ; . . . ; ðe or u; v‘Þg be the list for node i
of T . The first fusion can be performed only with a child v1of i. If d is the maximum degree of T , there are d possible
choices for v1. The second fusion can be done with one of
the children of i or with one of its grandchildren. Let v2 be
the node chosen. There are d + d2 possible choices for v2.
Following the same reasoning, there arePk¼‘
k¼1 dk possible
choices for the ‘th node v‘ to be fusioned with i.
Furthermore, we must take into account the fact that a
fusion can concern a node or an edge. The total number of
values possible for the variable path is therefore:
2‘ �Yk¼‘
k¼1
Xj¼k
j¼1
dj ¼ 2lYk¼‘
k¼1
dkþ1 � 1
d� 1;
that is:
2‘ � 1
d� 1
� �‘Yk¼‘
k¼1
ðdkþ1 � 1Þ < 2l � 1
d� 1
� �l
�dð‘þ1Þð‘þ2Þ
2 :
A node i may then be involved in Oðð2dÞlÞ possible
successive (node/edge) fusions.
As indicated, we must store in memory the distance
between each subtree T ðlði2Þ . . . i2Þ and T 0ðlðj2Þ . . . j2Þ for allpossible values of path and path0. The space complexity of
our algorithm is thus in Oðð2dÞ‘ � ð2d0Þ‘ � jT j � jT 0jÞ, with d
and d0 the maximum degrees of, respectively, T and T 0.
The computation of the time complexity of our algorithm
is done in a similar way as for the algorithm of Zhang and
Shasha. For each node of T and T 0, one must compute the
number of subtree distance computations the node will be
involved in by considering all subtrees rooted in, respec-
tively, a node of LRðT Þ and a node of LRðT 0Þ. In our case,
one must also take into account for each node the possibility
of applying a fusion. This leads to a time complexity in
Oðð2dÞ‘ � jT j �minðleafðT Þ; heightðT ÞÞ � ð2d0Þ‘ � jT 0j�minðleafðT 0Þ; heightðT 0ÞÞÞ:
This complexity suggests that the fusion operations may
be used only for reasonable trees (typically, less than
100 nodes) and small values of l (typically, less than 4). It is
however important to observe that the overall number of
fusions one may perform can be much greater than l
without affecting the worst-case complexity of the algo-
rithm. Indeed, any number of fusions can be made while
still retaining the bound of
Oðð2dÞl � jT j �minðleafðT Þ; heightðT ÞÞ � jT 0j �minðleafðT 0Þ;heightðT 0ÞÞÞ
so long as one does not realize more than l consecutive
fusions for each node.
In general, also, most interesting tree representations of
an RNA are of small enough size as will be shown next,
together with some initial results obtained in practice.
5 APPLICATION TO RNA SECONDARY STRUCTURES
COMPARISON
The algorithm presented in the previous section has beencoded using C++. An online version is available at http://www-igm.univ-mlv.fr/~allali/migal/.
We recall that RNAs are relatively small molecules with
sizes limited to a few kilobases. For instance, the small
ribosomal subunit of Sulfolobus acidocaldarius (D14876) is
made up of 1,147 bases. Using the representation shown in
Fig. 2b, the tree obtained contains 440 internal nodes and
567 leaves, that is 1,007 nodes overall. Using the representa-
tion in Fig. 2d, the tree is composed of 78 nodes. Finally, the
tree obtained using the representation given in Fig. 2e
contains only 48 nodes. We therefore see that even for large
RNAs, any of the known abstract tree-representations (that
is, representations which take the elements of the secondary
structure of an RNA into account) that we can use leads to a
tree of manageable size for our algorithm. In fact, for small
values of l (2 or 3), the tree comparison takes reasonable
time (a few minutes) and memory (less than 1Gb).
As we already mentioned, a fusion (respctively, split) can
be viewed as an alternative to a deletion (respectively,
insertion) followed by a relabeling. Therefore, the cost
function for a fusion must be chosen carefully.
10 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
To simplify, we reason on the cost of a node fusion
without considering the label of the edges leading to the
nodes that are fusioned with a father. The formal definition
of the cost functions takes the edges also into account.Let us assume that the cost function returns a real
value between zero and one. If we want to compute thecost of a fusion between two nodes u and v, the aim is togive to such fusion a cost slightly greater than the cost ofdeleting v and relabeling u; that is, we wish to havecostnode fusionðu; vÞ ¼ minðcostdelðvÞ þ t; 1Þ. The parameter tis a tuning parameter for the fusion.
Suppose that the new node w resulting from the fusion of
u and v matches with another node z. The cost of this match
is costmatchðw; zÞ. If we do not allow for node fusions, the
algorithm will first match u with z, then will delete v. If we
compare the two possibilities, on one hand we have a total
cost of costnode fusionðu; vÞ þ costmatchðw; zÞ for the fusion,
that is, costdelðvÞ þ tþ costmatchðw; zÞ, on the other hand, a
cost of costdelðvÞ þ costmatchðu; zÞ. Thus, t represents the gainthat must be obtained by costmatchðw; zÞ with regard to
costmatchðu; zÞ, that is, by a match without fusion. This is
illustrated in Fig. 10.
In this example, the cost associatedwith thepathon the top
is costmatchð5; 9Þ þ costdelð3Þ. The path at the bottom has a cost
of costnode fusionð5; 3Þ ¼ costdelð3Þ þ t for the node fusion to
which is added a relabeling cost of costmatchð8; 9Þ, leading to atotal of costmatchð8; 9Þ þ costdelð3Þ þ t. A node fusion will
therefore be chosen if costmatchð8; 9Þ þ t > costmatchð5; 9Þ,therefore if the score of a match with fusion is better by at
least t than a match without fusion.
We apply the same reasoning to the cost of an edge fusion.
The cost function for a node and an edge fusion between a
node u and a node v, with eu denoting the edge leading to u
and ev the edge leading to v is defined as follows:
costnode fusionðu; vÞ ¼ costdelðvÞ þ costdelðevÞ þ t
costedge fusionðu; vÞ ¼ costdelðuÞ þ costdelðeuÞ þ t
þX
csibling ofv
cost deleting subtree rooted at c:
The tuning parameter t is thus an important parameter
that allows us to control fusions. Always considering a cost
function that produces real values between 0 and 1, if t is
equal to 0:1, a fusion will be performed only if it improves
the score by 0:1. In practice, we use values of t between 0
and 0:2.For practical considerations, we also set a further
condition on the cost and relabeling functions related to a
node or edge resulting from a fusion which is as follows:
costdelðaÞ þ costdelðbÞ � costdelðcÞ
with c the label of the node/edge resulting from the fusion
of the nodes/edges labeled a and b. Indeed, if this condition
is not fulfilled, the algorithm may systematically fusion the
nodes or edges to reduce the overall cost.An important consequence of the conditions seen above
is that a node fusion cannot be followed by an edge fusion.
Below, the node fusion followed by an edge fusion costs:
ðcostdelðbÞ þ costdelðBÞ þ tÞ þ ðcostdelðABÞ þ costdelðaÞ þ tÞ:
Thealternative is todestroynodeB (togetherwith edge b) and
then to operate an edge fusion, the whole costing: ðcostdelðbÞþcostdelðBÞÞ þ ðcostdelðAÞ þ costdelðaÞ þ tÞ. The difference be-tween these two costs is tþ costdelðABÞ � costdelðAÞ, which is
always positive.
This observation allows to significantly improve the
performance in practice of the algorithm.We have applied the new algorithm on the two RNAs
shown in Fig. 5 (these are eukaryotic nuclear P RNAs from
Saccharomyces uvarum and Saccharomyces kluveri) and coded
using the same type of representation as in Fig. 2d. We have
limited the number of consecutive fusions to one (l ¼ 1).
The computation of the edit distance between the two trees
taking node and edge fusions into account besides dele-
tions, insertions, and relabeling has required less than a
second. The total cost allowing for fusions is 6:18 with t ¼0:05 against 7:42without fusions. As indicated in Fig. 11, the
last two problems discussed in Section 3 disappear thanks
to some edge fusions (represented by the boxes).An example of node fusions required when comparing
two “real” RNAs is given in Fig. 12. The RNAs are coded
using the same type of representation as in Fig. 2d. The
figure shows part of the mapping obtained between the
small subunits of two ribosomal RNAs retrieved from [8]
(from Bacillaria paxillifer and Calicophoron calicophorum). The
node fusion has been circled.
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 11
Fig. 10. Illustration of the gain that must be obtained using a fusion
instead of a deletion/relabeling.
6 MULTILEVEL RNA STRUCTURE COMPARISON:SKETCH OF THE MAIN IDEA
We briefly discuss now an approach which addresses in
part the “scattering effect” problem (see Section 2). This
approach is being currently validated and will be more fully
described in another paper. We therefore present here the
main idea only.
To start with, it is important to understand the nature of
this “scattering effect.” Let us consider first a trivial case: the
cost functions are unitary (insertion, deletion, and relabeling
each cost 1) and we compute the edit distance between two
trees composed of a single node each. The obtainedmapping
will associate the single node in the first tree with the single
one in the second tree, independently from the labels of the
nodes. This example can be extended to the comparison of
two trees whose node labels are all different. In this case, the
obtained mapping corresponds to the maximum home-
omorphic subtree common to both trees.
If the two RNA secondary structures compared using a
tree representation which models both the base pairs and
the nonpaired bases are globally similar but present some
local dissimilarity, then an edit operation will almost
always associate the nodes of the locally divergent regions
that are located at the same positions relatively to the global
common structure. This is a normal, expected behavior in
the context of an edition. However, it seems clear also when
we look at Fig. 4 that the bases of a terminal loop should not
be mapped to those of a multiple loop.
To reduce this problem, one possible solution consists of
adding to the nodes corresponding to a base an information
concerning the element of secondary structure to which the
base belongs. The cost functions are then adapted to take
this type of information into account. This solution,
although producing interesting results, is not entirely
satisfying. Indeed, the algorithm will tend to systematically
put into correspondence nodes (and, thus, bases) belonging
to structural elements of the same type, which is also not
necessarily a good choice as these elements may not be
related in the overall structure. It seems therefore preferable
to have a structural approach first, mapping initially the
elements of secondary structure to each other and taking
care of the nucleotides in a second step only.
The approach we have elaborated may be briefly
described as follows: Given two RNA secondary structures,
the first step consists in coding the RNAs by trees of type ðcÞin Fig. 2 (nodes represent bulges or multiple, internal or
12 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 12. Part of a mapping between two rRNA small subunits. The node fusion is circled.
Fig. 11. Result of the editing between the two RNAs shown in Fig. 4 allowing for node and edge fusions.
terminal loops while edges code for helices). We then
compute the edit distance between these two trees using the
two novel fusion operations described in this paper. This
also produces a mapping between the two trees. Each node
and edge of the trees, that is, each element of secondary
structure, is then colored according to this mapping. Two
elements are thus of a same color if they have been mapped
in the first step. We now have at our disposal an
information concerning the structural similarity of the two
RNAs. We can then code the RNAs using a tree of type ðbÞ.To these trees, we add to each node the colour of the
structural element to which it belongs. We need now only to
restrict the match operation to nodes of the same color. Two
nodes can therefore match only if they belong to secondary
elements that have been identified in the first step as being
similar.To illustrate the use of this algorithm, we have applied it
to the two RNAs of Fig. 4. Fig. 13 presents the trees of type(Fig. 2c) coding for these structures, and the mappingproduced by the computation of the edit distance withfusion. In particular, the noncolored fine dashed nodes andedges correspond, respectively, to deleted nodes/edges.One can see that in the left RNA, the two hairpin loopsinvolved in the scattering effect problem in Fig. 4 (indicatedby the arrows) have been destroyed and will not be mappedto one another anymore when the edit operations areapplied to the trees of the type in Fig. 2b.
This approach allows to obtain interesting results.
Furthermore, it considerably reduces the complexity of
the algorithm for comparing two RNA structures coded
with trees of the type in Fig. 2b. However, it is important to
observe that the scattering effect problem is not specific of
the tree representations of the type in Fig. 2b. Indeed, the
same problem may be observed, to a lesser degree, with
trees of the type in Fig. 2c. This is the reason why we
generalize the process by adopting a modelling of RNA
secondary structures at different levels of abstraction. This
model, and the accompanying algorithm for comparing
RNA structures, is in progress.
7 FURTHER WORK AND CONCLUSION
We have proposed an algorithm that addresses two main
limitations of the classical tree edit operations for compar-
ing RNA secondary structures. Its complexity is high in
theory if many fusions are applied in succession to any
given (the same) node, but the total number of fusions that
may be performed is not limited. In practice, the algorithm
is fast enough for most situations one can meet in practice.
To provide a more complete solution to the problem of
the scattering effect, we also proposed a new multilevel
approach for comparing two RNA secondary structures
whose main idea was sketched in this paper. Further details
and evaluation of such novel comparison scheme will be the
subject of another paper.
REFERENCES
[1] D. Bouthinon and H. Soldano, “A New Method to Predict theConsensus Secondary Structure of a Set of Unaligned RNASequences,” Bioinformatics, vol. 15, no. 10, pp. 785-798, 1999.
[2] J.W. Brown, “The Ribonuclease P Database,” Nucleic AcidsResearch, vol. 24, no. 1, p. 314, 1999.
[3] N. el Mabrouk and F. Lisacek, “and Very Fast Identification ofRNA Motifs in Genomic DNA. Application to tRNA Search in theYeast Genome,” J. Molecular Biology, vol. 264, no. 1, pp. 46-55, 1996.
[4] I. Hofacker, “The Vienna RNA Secondary Structure Server,” 2003.[5] I. Hofacker, W. Fontana, P.F. Stadler, L. Sebastian Bonhoeffer, M.
Tacker, and P. Schuster, “Fast Folding and Comparison of RNASecondary Structures,” Monatshefte fur Chemie, vol. 125, pp. 167-188, 1994.
[6] M. Hochsmann, T. Toller, R. Giegerich, and S. Kurtz, “LocalSimilarity in RNA Secondary Structures,” Proc. IEEE Computer Soc.Conf. Bioinformatics, p. 159, 2003.
[7] M. Hochsmann, B. Voss, and R. Giegerich, “Pure Multiple RNASecondary Structure Alignments: A Progressive Profile Ap-proach,” IEEE/ACM Trans. Computational Biology and Bioinfor-matics, vol. 1, no. 1, pp. 53-62, 2004.
[8] T. Winkelmans, J. Wuyts, Y. Van de Peer, and R. De Wachter, “TheEuropean Database on Small Subunit Ribosomal RNA,” NucleicAcids Research, vol. 30, no. 1, pp. 183-185, 2002.
[9] T. Jiang, L. Wang, and K. Zhang, “Alignment of Trees—AnAlternative to Tree Edit,” Proc. Fifth Ann. Symp. CombinatorialPattern Matching, pp. 75-86, 1994.
[10] F. Lisacek, Y. Diaz, and F. Michel, “Automatic Identification ofGroup I Intron Cores in Genomic DNA Sequences,” J. MolecularBiology, vol. 235, no. 4, pp. 1206-1217, 1994.
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 13
Fig. 13. Result of the comparison of the two RNAs of Fig. 4 using trees in Fig. 2c. The thick dash lines indicate some of the associations resulting
from the computation of the edit distance between these two trees. Triangular nodes stand for bulges, diamonds for internal loops, squares for
hairpin loops, and circles for multiloops. Noncolored fine dashed nodes and lines correspond, respectively, to deleted nodes/edges.
[11] B. Shapiro, “An Algorithm for Multiple RNA Secondary Struc-tures,” Computer Applications in the Biosciences, vol. 4, no. 3, pp. 387-393, 1988.
[12] B.A. Shapiro and K. Zhang, “Comparing Multiple RNA SecondaryStructures Using Tree Comparisons,” Computer Applications in theBiosciences, vol. 6, no. 4, pp. 309-318, 1990.
[13] K.-C. Tai, “The Tree-to-Tree Correction Problem,” J. ACM, vol. 26,no. 3, pp. 422-433, 1979.
[14] K. Zhang and D. Shasha, “Simple Fast Algorithms for the EditingDistance between Trees and Related Problems,” SIAM J. Comput-ing, vol. 18, no. 6, pp. 1245-1262, 1989.
[15] M. Zuker, “Mfold Web Server for Nucleic Acid Folding andHybridization Prediction,” Nucleic Acids Research, vol. 31, no. 13,pp. 3406-3415, 2003.
Julien Allali studied at the University of Marnela Vallee (France), where he received the MScdegree in computer science and computationalgenomics. In 2001, he began his PhD incomputational genomics at the Gaspard MongeInstitute of the University of Marne la Vallee. Histhesis focused on the study of RNA secondarystructures and, in particular, their comparisonusing a tree distance. In 2004, he received thePhD degree.
Marie-France Sagot received the BSc degree in computer science fromthe University of Sao Paulo, Brazil, in 1991, the PhD degree intheoretical computer science and applications from the University ofMarne-la-Vallee, France, in 1996, and the Habilitation from the sameuniversity in 2000. From 1997 to 2001, she worked as a researchassociate at the Pasteur Institute in Paris, France. In 2001, she movedto Lyon, France, as a research associate at the INRIA, the FrenchNational Institute for Research in Computer Science and Control. Since2003, she has been the Director of Research at the INRIA. Her researchinterests are in computational biology, algorithmics, and combinatorics.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
14 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Topological Rearrangements and Local SearchMethod for Tandem Duplication Trees
Denis Bertrand and Olivier Gascuel
Abstract—The problem of reconstructing the duplication history of a set of tandemly repeated sequences was first introduced by Fitch
[4]. Many recent studies deal with this problem, showing the validity of the unequal recombination model proposed by Fitch, describing
numerous inference algorithms, and exploring the combinatorial properties of these new mathematical objects, which are duplication
trees. In this paper, we deal with the topological rearrangement of these trees. Classical rearrangements used in phylogeny (NNI, SPR,
TBR, ...) cannot be applied directly on duplication trees. We show that restricting the neighborhood defined by the SPR (Subtree
Pruning and Regrafting) rearrangement to valid duplication trees, allows exploring the whole duplication tree space. We use these
restricted rearrangements in a local search method which improves an initial tree via successive rearrangements. This method is
applied to the optimization of parsimony and minimum evolution criteria. We show through simulations that this method improves all
existing programs for both reconstructing the topology of the true tree and recovering its duplication events. We apply this approach to
tandemly repeated human Zinc finger genes and observe that a much better duplication tree is obtained by our method than using any
other program.
Index Terms—Tandem duplication trees, phylogeny, topological rearrangements, local search, parsimony, minimum evolution, Zinc
finger genes.
�
1 INTRODUCTION
REPEATED sequences constitute an important fraction of
most genomes, from the well-studied Escherichia coli
bacterial genome [1] to the Human genome [2]. For
example, it is estimated that more than 50 percent of the
Human genome consists of repeated sequences [2], [3].
There exist three major types of repeated sequences:
transposon-derived repeats, micro or minisatellites, and
large duplicated sequences, the last often containing one or
several RNA or protein-coding genes. Micro or minisatel-
lites arise through a mechanism called slipped-strand
mispairing, and are always arranged in tandem: copies of
a same basic unit are linearly ordered on the chromosome.
Large duplicated sequences are also often found in tandem
and, when this is the case, unequal recombination is widely
assumed to be responsible for their formation.
Both the linear order among tandemly repeated se-
quences, and the knowledge of the biological mechanisms
responsible for their generation, suggest a simple model of
evolution by duplication. This model, first described by
Fitch in 1977 [4], introduces tandem duplication trees as
phylogenies constrained by the unequal recombination
mechanism. Although being a completely different biologi-
cal mechanism, slipped-strand mispairing leads to the same
duplication model [5]. A formal recursive definition of this
model is provided in Section 2, but its main features can be
grasped from the examples of Fig. 1. Fig. 1a shows the
duplication history of the 13 Antennapedia-class homeobox
genes from the cognate group [6]. In this history, the
ancestral locus has undergone a series of simple duplica-
tion eventswhere one of the genes has been duplicated into
two adjacent copies. Starting from the unique ancestral
gene, this series of events has produced the extant locus
containing the 13 linearly ordered contemporary genes. It is
easily seen [7] that trees only containing simple duplication
events are equivalent to binary search trees with labeled
leaves. They differ from standard phylogenies in that node
children have left/right orientation. Fig. 1b shows another
example corresponding to the nine variable genes of the
human T cell receptor Gamma (TRGV) locus [8]. In this
history, the most recent event involves a double duplica-
tion where two adjacent genes have been simultaneously
duplicated to produce four adjacent copies. Duplication
trees containing multiple duplication events differ from
binary search trees, but are less general than phylogenies.
The model proposed by Fitch [4] covers both simple and
multiple duplication trees.
Fitch’s paper [4] received relatively little attention at the
time of its publication probably due to the lack of available
sequence data. Rediscovered by Benson and Dong [9],
Tang et al. [10], and Elemento et al. [8], tandemly repeated
sequences and their suggested duplication model have
recently received much interest, providing several new
computational biology problems and challenges [11], [12].
The main challenge consists of creating algorithms
incorporating the model constraints to reconstruct the
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 15
. The authors are with Projet Methodes et Algorithmes pour la Bioinforma-tique, LIRMM (UMR 5506, CNRS—Univ. Montpellier 2), 161 rue Ada,34392 Montpellier Cedex 5—France. E-mail: [email protected].
Manuscript received 11 Oct. 2004; revised 17 Dec. 2004; accepted 20 Dec.2004; published online 30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBBSI-0170-1004.
1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
duplication history of tandemly repeated sequences.
Indeed, accurate reconstruction of duplication histories
will be useful to elucidate various aspects of genome
evolution. They will provide new insights into the
mechanisms and determinants of gene and protein domain
duplication, often recognized as major generators of
novelty [13]. Several important gene families, such as
immunity-related genes, are arranged in tandem; better
understanding their evolution should provide new insights
into their duplication dynamics and clues about their
functional specialization. Studying the evolution of micro
and minisatellites could resolve unanswered biological
questions regarding human migrations or the evolution of
bacterial diseases [14].
Given a set of aligned and ordered sequences (DNA or
proteins), the aim is to find the duplication tree that best
explains these sequences, according to usual criteria in
phylogenetics, e.g., parsimony or minimum evolution. Few
studies have focused on the computational hardness of this
problem, and all of these studies only deal with the
restricted version where simultaneous duplication of multi-
ple adjacent segments is not allowed. In this context, Jaitly
et al. [15] shows that finding the optimal single copy
duplication tree with parsimony is NP-Hard and that this
problem has a PTAS (Polynomial Time Approximation
Scheme). Another closely related PTAS is given by Tang
et al. [10] for the same problem. On the other hand,
Elemento et al. [7] describes a polynomial distance-based
algorithm that reconstructs optimal single copy tandem
duplication trees with minimum evolution.
However, it is commonly believed, as in phylogeny, that
most (especially multiple) duplication tree inference pro-
blems are NP-Hard. This explains the development of
heuristic approaches. Benson and Dong [9] provides various
parsimony-based heuristic reconstruction algorithms to infer
duplication trees, especially from minisatellites. Elemento
et al. [8] present an enumerative algorithm that computes the
most parsimonious duplication tree; this algorithm (by its
exhaustive approach) is limited to datasets of less than 15
repeats. Several distance-based methods have also been
described.TheWINDOWmethod [10]uses anagglomeration
scheme similar to UPGMA [16] and NJ [17], but the cost
function used to judge potential duplication is based on the
assumption that the sequences followamolecular clockmode
of evolution. The DTSCORE method [18] uses the same
schemebut corrects this limitationusing a score criterion [19],
like ADDTREE [20]. DTSCORE can be used with sequences
that do not follow themolecular clock, which is, for example,
essential when dealing with gene families containing
pseudogenes that evolve much faster than functional genes.
Finally, GREEDY SEARCH [21] corresponds to a different
approach divided into two steps: First, a phylogeny is
computed with a classical reconstruction method (NJ), then,
with nearest neighbor interchange (NNI) rearrangements, a
duplication tree close to this phylogeny is computed. This
approach is noteworthy since it implements topological
rearrangements which are highly useful in phylogenetics
[22], but it works blindly and does not ensure that good
duplication trees will be found (cf. Section 5.2).
Topological rearrangements have an essential function in
phylogenetic inference, where they are used to improve an
initial phylogeny by subtree movement or exchange.
Rearrangements are very useful for all common criteria
(parsimony, distance, maximum likelihood) and are inte-
grated into all classical programs like PAUP* [23] or
PHYLIP [24]. Furthermore, they are used to define various
distances between phylogenies and are the foundation of
much mathematical work [25]. Unfortunately, they cannot
be directly used here, as shown by a simple example given
16 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 1. (a) Rooted duplication tree describing the evolutionary history of the 13 Antennapedia-class homeobox genes from the cognate group [6].
(b) Rooted duplication tree describing the evolutionary history of the nine variable genes of the human T cell receptor Gamma (TRGV) locus [8]. In
both examples, the contemporary genes are adjacent and linearly ordered along the extant locus.
later. Indeed, when applied to a duplication tree, they do
not guarantee that another valid duplication tree will be
produced.
In this paper, we describe a set of topological rearrange-
ments to stay inside the duplication tree space and explore
the whole space from any of its elements. We then show the
advantages of this approach for duplication tree inference
from sequences. In Section 2, we describe the duplication
model introduced by [4], [8], [10], as well as an algorithm to
recognize duplication trees in linear time. Thanks to this
algorithm, we restrict the neighborhoods defined by
classical phylogeny rearrangements, namely, nearest neigh-
bor interchange (NNI) and subtree pruning and regrafting
(SPR), to valid duplication trees. We demonstrate (Section 3)
that for NNI moves this restricted neighborhood does not
allow the exploration of the whole duplication tree space.
On the other hand, we demonstrate that the restricted
neighborhood of SPR rearrangement allows the whole
space to be explored. In this way, we define a local search
method, applied here to parsimony and minimum evolu-
tion (Section 4). We compare this method to other existing
approaches using simulated and real data sets (Section 5).
We conclude by discussing the positive results obtained by
our method, and indicate directions for further research
(Section 6).
2 MODEL
2.1 Duplication History and Duplication Tree
The tandem duplication model used in this article was first
introduced by Fitch [4] then studied independently by [8],
[10]. It is based on unequal recombination which is assumed
to be the sole evolution mechanism (except point mutations)
acting on sequences. Although it is a completely different
biological mechanism, slipped-strand mispairing leads to
the same duplication model [5], [9].
Let O ¼ ð1; 2; . . . ; nÞ be the ordered set of sequences
representing the extant locus. Initially containing a single
copy, the locus grew through a series of consecutive
duplications. As shown in Fig. 2a, a duplication history
may contain simple duplication events. When the dupli-
cated fragment contains two, three, or k repeats, we say that
it involves a multiple duplication event. Under this
duplication model, a duplication history is a rooted tree
with n labeled and ordered leaves, in which internal nodes
of degree 3 correspond to duplication events. In a real
duplication history (Fig. 2a), the time intervals between
consecutive duplications are completely known, and the
internal nodes are ordered from top to bottom according to
the moment they occurred in the course of evolution. Any
ordered segment set of the same height then represents an
ancestral state of the locus. We call such a set a floor, and
we say that two nodes i; j are adjacent (i � j) if there is a
floor where i and j are consecutive and i is on the left of j.
However, in the absence of a molecular clock mode of
evolution (a typical problem), it is impossible to recover the
order between the duplication events of two different
lineages from the sequences. In this case, we are only able to
infer a duplication tree (DT) (Fig. 2b) or a rooted
duplication tree (RDT) (Fig. 2c).
A duplication tree is an unrooted phylogeny with
ordered leaves, whose topology is compatible with at least
one duplication history. Also, internal nodes of duplication
trees are partitioned into events (or “blocks” following
[10]), each containing one or more (ordered) nodes. We
distinguish “simple” duplication events that contain a
unique internal node (e.g., b and f in Fig. 2c) and “multiple”
duplication events which group a series of adjacent and
simultaneous duplications (e.g., c, d, and e in Fig. 2c). Let
E ¼ ðsi; siþ1; . . . ; skÞ denote an event containing internal
nodes si; siþ1; . . . ; sk in left to right order. We say that two
consecutive nodes of the same event are adjacent (sj � sjþ1)
just like in histories, as any event belongs to a floor in all of
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 17
Fig. 2. (a) Duplication history; each segment represents a copy; extant segments are numbered. (b) Duplication tree (DT); the black points show the
possible root locations. (c) Rooted duplication tree (RDT) corresponding to history (a) and root position �1 on (b).
the histories that are compatible with the DT being
considered. The same notation will also be used for leaves
to express the segment order in the extant locus. When the
tree is rooted, every internal node sj is unambiguously
associated to one parent and two child nodes; moreover,
one child of sj is “left” and the other one is “right,” which is
denoted as lj and rj, respectively. In this case, for any
duplication history that is compatible with this tree, child
nodes of an event, si; siþ1; . . . ; sk are organized as follows:
li � liþ1 � . . . � lk � ri � riþ1 � . . . � rk:
In [8], [26], [27], it was shown that rooting a
duplication tree is different than rooting a phylogeny:
the root of a duplication tree necessarily lies on the tree
path between the most distant repeats on the locus, i.e., 1
and n; moreover, the root is always located ”above” all
multiple duplications, e.g., Fig. 1b shows that there are
only three valid root positions, the root cannot be a direct
ancestor of 12.
2.2 Recursive Definition of Rooted and UnrootedDuplication Trees
A duplication tree is compatible with at least one duplica-
tion history. This suggests a recursive definition, which
progressively reconstructs a possible history, given a
phylogeny T and a leaf ordering O. We define a cherry
ðl; s; rÞ as a pair of leaves (l and r) separated by a single
node s in T , and we call CðT Þ the set of cherries of T . This
recursive definition reverses evolution: It searches for a
“visible duplication event,” “agglomerates” this event, and
checks whether the “reduced” tree is a duplication tree. In
case of rooted trees, we have:
ðT;OÞ defines a duplication tree with root � if and only if:
1. ðT;OÞ only contains �, or
2. there is in CðT Þ a series of cherries
ðli; si; riÞ; ðliþ1; siþ1; riþ1Þ; . . . ; ðlk; sk; rkÞwith k � i and
li � liþ1 � . . . � lk � ri � riþ1 � . . . � rk in O, suchthat ðT 0; O0Þ defines a duplication tree with root �,
where T 0 is obtained from T by removing
li; liþ1; . . . ; lk; ri; riþ1; . . . ; rk, and O0 is obtained by
replacing ðli; liþ1; . . . ; lk; ri; riþ1; . . . ; rkÞ byðsi; siþ1; . . . ; skÞ in O.
The definition for unrooted trees is quite similar:
ðT;OÞ defines an unrooted duplication tree if and only if:
1. ðT;OÞ contains 1 segment, or
2. same as for rooted trees with ðT 0; O0Þ now defining anunrooted duplication tree.
Those definitions provide a recursive algorithm, RADT
(Recognition Algorithm for Duplication Trees), to check
whether any given phylogeny with ordered leaves is a
duplication tree. In case of success, this algorithm can also
be used to reconstruct duplication events: At each step, the
series of internal nodes above denoted as ðsi; siþ1; . . . ; skÞ isa duplication event. When the tree is rooted, lj is the left
child of sj and rj its right child, for every j; i � j � k. This
algorithm can be implemented in OðnÞ [26] where n is the
number of leaves. Another linear algorithm is proposed by
Zhang et al. [21] using a top down approach instead of a
bottom-up one, but applies only to rooted duplication trees.
3 TOPOLOGICAL REARRANGEMENTS FOR
DUPLICATION TREES
This section shows how to explore the DT space using SPR
rearrangements. First, we describe some NNI, SPR, and
TBR rearrangement properties with standard phylogenies.
But, these rearrangements cannot be directly used to
explore the DT space. Indeed, when applied to a duplica-
tion tree, they do not guarantee that another valid
duplication tree will be produced. So, we have decided to
restrict the neighborhood defined by those rearrangements
to duplication trees. If we only used NNI rearrangements,
the neighborhood would be too restricted (as shown by a
simple example) and would not allow the whole DT space
to be explored. On the other hand, we can distinguish two
types of SPR rearrangements which, when applied to a
rooted duplication tree guarantee that another valid
duplication tree will be produced. Thanks to these specific
rearrangements, we demonstrate that restricting the neigh-
borhood of SPR rearrangements allows the whole space of
duplication trees to be explored.
18 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 3. The tree obtained by applying an NNI move to a DT is not always a valid DT: T whose RT is a rooted version; T 0 is obtained by
applying NNI(5,4) around the bold edge; none of the possible root positions of T 0 (a, b, c, and d) leads to a valid RDT, cf. tree (b) which
corresponds to root b in T 0.
3.1 Topological Rearrangements for Phylogeny
There are many ways of carrying out topological rearrange-
ments on phylogeny [22]. We only describe NNI (Nearest
Neighbor Interchange), SPR (Subtree Pruning Regrafting),
and TBR (Tree Bisection and Reconnection) rearrangements.
The NNI move is a simple rearrangement which
exchanges two subtrees adjacent to the same internal edge
(Figs. 3 and 4). There are two possible NNIs for each
internal edge, so 2ðn� 3Þ neighboring trees for one tree
with n leaves. This rearrangement allows the whole space of
phylogeny to be explored; i.e., there is a succession of NNI
moves making it possible to transform any phylogeny P1
into any phylogeny P2 [28].
The SPR move consists of pruning a subtree and
regrafting it, by its root, to an edge of the resulting tree
(Figs. 6 and 7). We note that the neighborhood of a tree
defined by the NNI rearrangements is included in the
neighborhood defined by SPRs. The latter rearrangement
defines a neighborhood of size 2ðn� 3Þð2n� 7Þ [25].Finally, TBR generalizes SPR by allowing the pruned
subtree to be reconnected by any of its edges to the resulting
tree. These three rearrangements (NNI, SPR, and TBR) are
reversible, that is, if T 0 is obtained from T by a particular
rearrangement, then T can be obtained from T 0 using the
same type of rearrangement.
3.2 NNI Rearrangements Do Not Stay in DT Space
The classical phylogenetic rearrangements (NNI, SPR,
TBR,...) do not always stay in DT space. So, if we apply
an NNI to a DT (e.g., Fig. 3), the resulting tree is not always
a valid DT. This property is also true for SPR and TBR
rearrangements since NNI rearrangements are included in
these two rearrangement classes.
3.3 Restricted NNI Does Not Allow the Whole DTSpace to Be Explored
To restrict the neighborhood defined by NNI rearrange-
ments to duplication trees, each element of the neighbor-
hood is filtered thanks to the recognition algorithm (RADT).
But, this restricted neighborhood does not allow the whole
DT space to be explored. Fig. 4 gives an example of a
duplication tree, T , the neighborhood of which does not
contain any DT. So, its restricted neighborhood is empty,
and there is no succession of restricted NNIs allowing T to
be transformed into any other DT.
3.4 Restricted SPR Allows the Whole DT Space toBe Explored
As before, we restrict (using RADT) the neighborhood
defined by SPR rearrangements to duplication trees. We
name restricted SPR, SPR moves that, starting from a
duplication tree, lead to another duplication tree.
Main Theorem. Let T1 and T2 be any given duplication trees; T1
can be transformed into T2 via a succession of restricted SPRs.
Proof. To demonstrate the Main Theorem, we define two
types of special SPR that ensure staying within the space
of rooted duplication trees (RDT). Given these two types
of SPRs, we demonstrate that it is possible to transform
any rooted duplication tree into a caterpillar, i.e., a
rooted tree in which all internal nodes belong to the tree
path between the leaf 1 and the tree root � (cf. Fig. 5).
This result demonstrates the theorem. Indeed, let T1
and T2 be two RDTs. We can transform T1 and T2 into a
caterpillar by a succession of restricted SPRs. So, it is
possible to transform T1 into T2 by a succession of
restricted SPRs, with (possibly) a caterpillar as inter-
mediate tree. This property holds since the reciprocal
movement of an SPR is an SPR. As the two SPR types
proposed ensure that we stay within the RDTs space, we
have the desired result for rooted duplication trees. And,
this result extends to unrooted duplications trees since
two DTs can be arbitrarily rooted, transformed from one
to the other using restricted SPRs, then unrooted. tuThe first special SPR allows multiple duplication
events to be destroyed. Let E ¼ ðsi; siþ1; . . . ; skÞ be a
duplication event, ri and lk respectively right child of si
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 19
Fig. 5. A six-leaf caterpillar.
Fig. 4. The NNI neighborhood of a duplication tree does not always contain duplication trees: T whose RT is a rooted version; T 0 is obtained by
exchanging subtrees 1 and (2 5); none of the possible root positions of T 0 (a, b, and c) leads to a valid duplication tree, cf. tree (b) which corresponds
to root b in T 0; and the same holds for every neighbor of T being obtained by NNI.
and left child of sk, and let pi be the father of si. The
DELETE rearrangement consists of pruning the subtree of
root ri and grafting this subtree on the edge ðsk; lkÞ, while
li is renamed si and the edge ðli; siÞ is deleted. Fig. 6
demonstrates this rearrangement.
Lemma 1. DELETE preserves the RDT property.
Proof. Let T be the initial tree (Fig. 6a), E ¼ ðsi; siþ1; . . . ; skÞbe an event of T , and T 0 be the tree obtained from T by
applying DELETE to E (Fig. 6b). Children of any node sj(i � j � k) are denoted lj and rj.
By definition, for any duplication history compatible
with T we have
li � liþ1 � . . . � lk � ri � riþ1 � . . . � rk:
Thus, there is a way to partially agglomerate T (using an
RADT-like procedure) such that these nodes becomes
leaves. The same agglomeration can be applied to T 0 as
only ancestors of the ljs and rjs are affected by DELETE.
Now, 1) agglomerate the event E of T , and 2) reduce T 0
by agglomerating the cherry ðlk; riÞ and then agglomer-
ating the event ðsiþ1; . . . ; skÞ. Two identical trees follow,
which concludes the proof. tuBy successively applying DELETE to any duplication
tree, we remove all multiple duplication events. The
following SPR rearrangement allows duplications to be
moved within simple RDT, i.e., any RDT containing only
simple duplications. Let p be a node of a simple RDT T , l its
left child, r its right child, and x the left child of r. This
rearrangement consists of pruning the subtree of root x and
regrafting it to the edge ðl; pÞ (Fig. 7). This rearrangement is
an SPR (in fact an NNI); we name it LEFT as it moves the
subtree root towards the left. It is obvious that the tree
obtained by applying such a rearrangement to a simple
RDT, is a simple RDT. We now establish the following
lemma which shows that any simple tree can be trans-
formed into a caterpillar.
Lemma 2. Let T be a simple RDT; T can be transformed into a
caterpillar by a succession of LEFT rearrangements.
Proof. In a caterpillar all internal nodes are ancestors of 1. If
T is not a caterpillar, there is an internal node r that is not
an ancestor of 1. If r is the right child of its father, we can
apply LEFT to the left child of r (Fig. 7). If r is the left
child of its father, we consider its father: It cannot be an
ancestor of 1 since its children are r and a node on the
right of r. So, we can apply the same argument: Either
the father of r is adequate for performing LEFT, or we
consider its father again. In this way, we necessarily
obtain a node for which the rearrangement is possible. T
is then transformed into a caterpillar by successively
applying the LEFT rearrangement to nodes which are not
on the path between 1 and �. After a finite number of
steps, all internal nodes are ancestors of 1 and T has been
transformed into a caterpillar. This concludes the proof
of Lemma 2 and, therefore, of our Main Theorem. tu
4 LOCAL SEARCH METHOD
We consider data consisting of an alignment of n segments
with length k, and of the ordering O of the segments along
the locus. This alignment has been created before tree
construction and the problem is not to build simultaneously
the alignment and the tree, a much more complicated task
[29]. The aim is to find a (nearly) optimal duplication tree,
where “optimal” is defined by some usual phylogenetic
criterion and the ordered and aligned segments at hand.
Topological rearrangements described in the previous
section naturally lead to a local search method for this
purpose. We discuss its use to optimize the usual Wagner
parsimony [22] and the distance-based balanced minimum
evolution criterion (BME) [30], [31]. First, we describe our
local search method, then we define briefly these two
criteria and explain how to compute them during local
search.
20 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 7. LEFT rearrangement.
Fig. 6. DELETE rearrangement.
4.1 The LSDT Method
Our method, LSDT (Local Search for Duplication Trees),
follows a classical local search procedure in which, at each
step, we try to strictly improve the current tree. This
approach can be used to optimize various criteria. In this
study, we restrict ourselves to parsimony and balanced
minimum evolution; fðT Þ represents the value (to be
minimized) of one of these criteria for the duplication tree
T and the sequence set.
Algorithm 1 summarizes LSDT. The neighborhood of the
current DT, Tcurrent, is computed using SPR. As we
explained earlier, we use the RADT procedure to restrict
this neighborhood to valid DTs. When a tree is a valid DT,
its f criterion value is computed. That way, we select the
best neighbor of Tcurrent. If this DT improves the value
obtained so far (i.e., fðTbestÞ), the local search restarts with
this new topology. If no neighbor of Tcurrent improves Tbest,
the local search is stopped and returns Tbest.
To analyze the time complexity of one LSDT step, we
have to consider the size of the neighborhood defined by
the restricted SPR. In the worst case, this size is of the same
order as the size of an unrestricted SPR neighborhood, i.e.,
Oðn2Þ. Indeed for the “double caterpillar” (Fig. 8), it is
possible to move any subtree being rooted on the path
between n=2 and � towards any edge of the path between
ðnþ 1Þ=2 and �; and inversely. Thus, for this tree, Oðn2Þrestricted SPRs can be performed. In the worst case,
restricting the neighborhood defined by SPR to duplication
trees does not significantly decrease the neighborhood size.
However, on average the diminution is quite significant;
e.g., with n ¼ 48, only 5 percent of the neighborhood
corresponds to a valid DTs, assuming DTs are uniformly
distributed [26].
Since the time complexity of the recognition algorithm
(RADT) is OðnÞ, computing the neighborhood defined by
restricted SPR requires Oðn3Þ. The calculation of the
criterion value is done for each tree of the restricted
neighborhood. Thus one local search step basically requires
Oðn3 þ n2gÞ, where g represents the time complexity of
computing the criterion value. However, preprocessing
allows this time complexity to be lowered, both for
parsimony and minimum evolution, as we shall explain in
the following sections.
4.2 The Maximum Parsimony Criterion
Parsimony is commonly acknowledged [22] to be a good
criterion when dealing with slightly divergent sequences,
which is usually the case with tandemly duplicated genes
[8]. The parsimony criterion involves selecting the tree
which minimizes the number of substitutions needed to
explain the evolution of the given sequences. Finding the
most parsimonious tree [22] or duplication tree [15] is
NP-hard, but we can find the optimal labeling of the
internal nodes and the parsimony score of a given tree T in
polynomial time using the Fitch-Hartigan algorithm [32],
[33]. The parsimony score and optimal labeling of internal
nodes is independently computed for each position within
sequences, using a postorder depth-first search algorithm
that requires OðnÞ time [32], [33]. Thus, computing the
parsimony score of n sequences of length k requires OðknÞtime. Hence, if we use this algorithm during our local
search method, one local search step is computed in Oðkn3Þ,which is relatively high.
To speed up this process, we adapted techniques
commonly used in phylogeny for fast calculation of
parsimony. Our implementation uses a data structure
implemented (among others) in DNAPARS [24] and
described in [34], [35]. Let Tp be the pruned subtree and
Tr be the resulting tree. A preprocessing stage computes
the parsimony vector (i.e., the optimal score and optimal
labeling of all sequence positions) of every rooted subtree
of Tr using a double depth-first search [36] (Fig. 9a); the
first search is postordered and computes the parsimony
vector of down-subtrees; the second search is preordered
and computes the parsimony vector of up-subtrees. Each
search requires OðnkÞ time. Thanks to this data structure,
the parsimony score of the tree obtained by regrafting Tp
on any given edge of Tr is computed in OðkÞ (Fig. 9b).
Hence, computing the SPR neighbor with minimum
parsimony of any given duplication tree is achieved in
Oðn3 þ n� nkþ n2kÞ ¼ Oðn3 þ n2kÞ; the first term ðn3Þrepresents the neighborhood computation; the second
term ðn� nkÞ corresponds to the time required by the n
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 21
Fig. 8. A simple rooted duplication tree with a double caterpillar
structure.
preprocessing stages; the third term ðn2kÞ is the time to
test the n subtrees and the n possible insertion edges.
4.3 The Distance-Based Balanced MinimumEvolution Principle
As in any distance-based approach, we first estimate the
matrix of pairwise evolutionary distances between the
segments, using some standard distance estimator [22],
e.g., the Kimura two-parameter estimator [37] in case of
DNA or the JTT method with proteins [38]. Let � be this
matrix and �ij be the distance between segments i and j.
The � matrix plus the segment order is the input of the
reconstruction method.
The minimum evolution principle (ME) [39], [40]
involves selecting the shortest tree to be the tree which
best explains the observed sequences. The tree length is
equal to the sum of all the edge lengths, and the edge
lengths are estimated by minimizing a least squares fit
criterion. The problem of inferring optimal phylogenies
within ME is commonly assumed to be NP-hard, as are
many other distance-based phylogeny inference problems
[41]. Nonetheless, ME forms the basis of several phyloge-
netic reconstruction methods, generally based on greedy
heuristics. Among them is the popular Neighbor-Joining
(NJ) algorithm [17]. Starting from a star tree, NJ iteratively
agglomerates external pairs of taxa so as to minimize the
tree length at each step.Recently, Pauplin [30] proposed a new simple formula to
estimate the tree length LðT Þ of tree T :
LðT Þ ¼Xi < j
21�T ij �ij;
where T ij is the topological distance (number of edges) in T
between segments i and j. The correctness of this formula
was shown by Semple and Steel [42], while Desper and
Gascuel [31] showed that this formula is a special case of
weighted-least squares tree fitting. Moreover, Desper and
Gascuel demonstrated that selecting the shortest tree (as
computed from above formula) is statistically consistent and
well suited for phylogenetic inference. They called this new
version of ME “balanced minimum evolution” (BME) [31].
Using the above formula, the length of any given tree is
computed in Oðn2Þ, so computing one LSDT local search
step can be achieved in Oðn4Þ. However, a faster imple-
mentation is possible using a straightforward modification
of our BME addition algorithm [43]. This involves:
1. pruning a rooted subtree Tp from tree T ,2. computing the average distance between all non-
intersecting subtree pairs in the remaining tree Tr,3. computing the average distance between Tp and any
subtree of Tr in T , and4. using formula (10) from [43] and RADT to find the
best allowed edge to regraft Tp.
Steps 2 and 3 are based on algorithms described in [43],
which follow the same approach as the double depth-first
search described in the previous section. These two steps
require Oðn2Þ, just as Step 4. As there are OðnÞ subtrees to
prune and regraft, this implementation requires Oðn3Þ to
perform one search step.
5 RESULTS
5.1 Simulation Protocol
We applied our method and other existing methods to
simulated datasets obtained using the procedure described
in [18]. We uniformly randomly generated rooted tandem
duplication trees (see [26]) with 12, 24, and 48 leaves and
assigned lengths to the edges of these trees using the
coalescent model [44]. We then obtained molecular clock
trees (MC), which might be unrealistic in numerous cases,
e.g., when the sequences being studied contain pseudo-
genes which evolve much faster than functional genes.
Then, we generated nonmolecular clock trees (NO-MC)
from the previous trees by independently multiplying
22 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 9. (a) Every edge defines one down-subtree and one up-subtree; e.g., A represents the down-subtree (2 3) defined by the edge e while Dcorresponds to the up-subtree (1 (4 5)). Moreover, only the parsimony vector of the five leaves is known before the preprocessing stage. Thepostorder search computes the parsimony vector of down-subtrees: A is computed from 2 and 3, B from 4 and 5, C from A and B. The preordersearch computes the parsimony vector of up-subtrees: D is obtained from 1 and B, E is obtained from D and 3, etc. (b) When the parsimony vectorof every subtree in Tr is known, regrafting Tp on any given edge and computing the parsimony score of the resulting tree only requires analyzing theparsimony vector of three subtrees and is done in OðkÞ time.
every edge length by 1þ 0:8X, where X was drawn from
an exponential distribution with parameter 1. MC trees
were rescaled by multiplying every edge length by 1.8.
The trees thus obtained (MC and NO-MC) have a
maximum leaf-to-leaf divergence in the range ½0:1; 0:7�,and in NO-MC trees the ratio between the longest and
shortest root-to-leaf lineages is about 3.0 on average. Both
values are in accordance with real data, e.g., gene families
[8] or repeated protein domains [10].
SEQGEN [45] was used to produce a 1,000 bp-long
nucleotide multiple alignment from each of the generated
trees using the Kimura two-parameter model of substitution
[46], and a distance matrix was computed by DNADIST [24]
from this alignment using the same substitution model. For
MC andNO-MC cases, 1,000 trees (and, then, 1,000 sequence
sets and 1,000 distance matrices) were generated per tree
size. These data sets were used to compare the ability of the
various methods to recover the original trees from the
sequences or from the distance matrices, depending on the
method being tested. We measured the percentage of trees
(out of 1,000) being correctly reconstructed (%tr). For the
phylogeny reconstruction methods, we also kept the
percentage of duplication trees among the set of inferred
trees. Due to the random process used for generating these
trees and datasets, some short branches might not have
undergone any substitution (as during Evolution) and, thus,
are unobtainable, except by chance. When n and, thus, the
branch number is high, it becomes hard or impossible to
find the entire tree. So, we also measured the percentage of
duplication events in the true tree recovered by the inferred
tree (%ev). A duplication event involves one or more
internal nodes and is the lowest common ancestor of a set
of leaves; we say it “covers” its descendent leaves. However,
the leaves covered by a simple duplication event can change
when the root position changes. As regards the true tree, the
root is known and each event is defined by the set of leaves
which it covers. But, the inferred tree is unrooted. To avoid
ambiguity, we then tested all possible root positions and
chose the one which gave the highest proximity in number
of events detected between the true tree and the inferred
tree, where two events are identical if they cover the same
leaves. Finally, we kept the average parsimony value of each
method (pars).
5.2 Performance and Comparison
Using this protocol, we compared NJ [17], TNT [47], and
GREEDY-SEARCH (GS) [21] which starts from the NJ tree, a
modified version of GREEDY TRHIST RESTRICTED (GTR)
[9] to infer multiple duplication trees, WINDOWS [10],
DTSCORE [18], and eight versions of our local search
method LSDT corresponding to different starting duplica-
tion trees (GS, GTR, WINDOW, and DTSCORE) and
different criteria (parsimony and BME). TNT and GS use
the parsimony criterion, but the other are distance-based
methods. TNT is acknowledged as one of the very best
parsimony packages; it was run with 10 replicates and TBR
rearrangements. TNT often returns a set of equally
parsimonious trees. When this set contained duplication
trees, we randomly selected one of them; when no
duplication tree was inferred by TNT, we randomly
selected one of the output trees.
Results are given in Tables 1 and 2. First, we observe that
with n ¼ 48 the true tree is almost never entirely found, for
the reasons explained earlier. On the other hand, the best
methods recover 80 to 95 percent of the duplication events,
indicating that the tested datasets are relatively easy. NJ
and TNT perform relatively well, but they often output
trees that are not duplication trees, which is unsatisfactory
(e.g., with 48 leaves and NO-MC, NJ and TNT only infer
1 percent and 5 percent of duplication trees, respectively).
The GS approach is noteworthy since it modifies the trees
inferred by NJ to transform them into duplication trees.
However, GS is only slightly better than NJ regarding the
proportion of correctly reconstructed trees, but consider-
ably degrades the number of recovered duplication events,
which could be explained by the blind search it performs
to transform NJ trees into duplication trees. GTR also
obtains relatively poor results. As expected from its
assumptions, WINDOW performs better in the MC case
than in the NO-MC one. Finally, DTSCORE obtains the best
performance among the four existing methods, whatever
the topological criterion considered.
Applying our method to starting trees produced by GS,
GTR, WINDOW, and DTSCORE reveals the advantages of
the local search approach. Optimizing parsimony or BME
gives similar results, with a slight advantage for parsimony
as expected from the relatively low divergence rates in our
data sets. The trees produced by GS, GTR, and WINDOW
are clearly improved and, for most, are better than those
obtained by DTSCORE. DTSCORE trees are also improved,
even though this improvement is not very high from a
topological point of view. This could be explained by the
fact that DTSCORE is already an accurate method with
respect to the datasets used.
When we consider the parsimony criterion, the gain
achieved by LSDT is appreciable for each start method. This
could be expected for GS, WINDOW and DTSCORE which
do not optimize this criterion; with n ¼ 48 in NO-MC case,
the gain for GS is about 329, thus confirming that this
method is clearly suboptimal; the gains for WINDOW and
DTSCORE are about 42 and 15, which are lower but still
significant. The GTR results, which optimizes parsimony,
are more surprising since the gain (always with n ¼ 48 in
NO-MC case) is about 77 on average, which is very high.
Moreover, the parsimony value obtained by LSDT is very
close to that of TNT, in spite of a much more restricted
search space. This confirms the good performance of our
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 23
local search method. It should be stressed that these gains
are obtained at low computational cost as dealing with any
of the 48-taxon datasets only requires about 10 seconds
for parsimony and five seconds for BME on a standard
PC-Pentium 4.
5.3 Analysis of the ZNF45 Family
Zinc finger (ZNF) genes code for proteins that contain one
or more zinc finger motifs. The zinc finger motif is one of
the most common motifs involved in nucleic acid-protein
interaction. Experimental studies on functions of ZNF genes
suggest that many of them code for transcription factors,
and some of them are known to take part in cellular growth
and development [48]. However, the biological functions of
most ZNF genes are currently unknown. The 16 members of
ZNF45 gene family are found in the q13.2 gene cluster on
human chromosome 19 [49]. The organization and features
of the members of the ZNF45 family suggest that the genes
in the family may have been produced by a series of in situ
24 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 2Performance Comparison Using Simulations (No Molecular Clock of Evolution)
Note: see Table 1.
TABLE 1Performance Comparison Using Simulations (Molecular Clock Mode of Evolution)
X+LSDT_Y: X is the method used to obtain the starting tree and Y the criterion being optimized by LSDT;%tr: the percentage of trees being correctlyreconstructed; the percentage of duplication trees obtained by phylogeny reconstruction methods is given between parentheses; %ev: thepercentage of duplication events in the true tree being recovered by the inferred tree; pars: the average parsimony value.
gene duplication events [49]. The ZNF45 gene family has
been previously studied by Tang et al. [10] and Zhang et al.
[21], who proposed different tandem duplication trees to
explain its evolutionary history.
We downloaded the DNA sequences of the 16 members
of ZNF45 from NCBI. Multiple alignment was achieved
using TCOFFEE,1 using default settings. We removed gaps
as usual in phylogenetics [22] and third codon positions
which look saturated (734 parsimony steps are required to
explain the evolution of the 237 sites). We thus obtained a
final alignment2 containing 474 homologous sites, with a
maximum pairwise divergence of 0:45.
PAUP* [23] was used to estimate the matrix of pairwise
distances, assuming the GTR substitution model [50] and a
gamma distribution of rates with parameter 1.
We used this distance matrix and DTSCORE to build a
starting tree, which was then refined by LSDT using
parsimony. We selected this criterion because of its good
performance with simulated data (Tables 1 and 2). The
resulting tree (Figs. 10a and 10b) is a simple DT requiring
897 steps to explain the extant sequences. We tried to
improve this score using a computationally intensive
ratchet approach [51], but were unable to obtain any other
DT with better (or even identical) parsimony. We also ran
TNT with ratchet, 1,000 random taxon addition replicates
and TBR branch swapping (i.e., all TNT options to intensify
the search) and found one maximum-parsimony phylogeny
requiring 896 steps. This phylogeny (Fig. 10c) contains an
unresolved node with degree 4 and is not a duplication tree.
TNT phylogeny is close to LSDT duplication tree. To
transform from one to the other only three taxa have to be
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 25
1. http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi.2. Available on request.
Fig. 10. (a) Duplication tree for the 16 genes of human ZNF 45 family inferred by DTSCORE plus LSDT with parsimony; black dots represent the onlyallowed root positions, according to the tandem duplication model; the (arbitrarily) selected root position is circled. (b) Rooted duplication treecorresponding to tree (a). (c) Phylogeny inferred by TNT. Tree (a) can be obtained from tree (c) by moving ZNF45 and ZNF228 to edge 1, andZNF233 to edge 2. Edge lengths in tree (a) and tree (c) were estimated by maximum likelihood [52]. Lengths in tree (b) are meaningless and wereadjusted to obtain a readable drawing.
moved (Fig. 10), and both trees differ by only 1 parsimony
step. A similar difference was commonly observed in
simulation where TNT found (non-DT) phylogenies requir-
ing one parsimony step less (on average) than the DTs
found by LSDT (Tables 1 and 2), though the true tree used
to generate the sequences was a DT. Thus, having (only)
one parsimony step of difference between the best DT and
the best phylogeny is not significant and can be seen as
supporting the duplication model. Moreover, the discre-
pancy between the two trees can be explained by long
branch attraction, a phenomenon that frequently affects
parsimony-based reconstructions [53]. Indeed, ZNF180 and
ZNF229 genes are distant from the other genes (Figs. 10a
and 10c) and might perturb the whole tree. When removing
those two genes from the data set, both LSDT and TNT
found the same tree, which is identical to the LSDT tree of
Fig. 10a without the two genes. With 14 segments, the
probability of randomly picking up a duplication tree
among all distinct phylogenies is less than 10�4 [26]. This
extremely small probability indicates that the identity
between LSDT and TNT trees is very unlikely to be due
to chance. This provides a strong support for the tandem
duplication model and indicates that our LSDT tree likely
represents most—if not all—of the history of ZNF45 family.
We compared trees obtained by Tang et al. [10], Zhang
et al. [21], and those of the other programs to the LSDT tree
of Fig. 10. We computed the parsimony score of each tree
and the percentage of events shared by each tree with the
LSDT tree. Just as in the simulation study, we tested GS
[21], GTR [9], WINDOW [10], DTSCORE [8], and LSDT
using different starting points but optimizing parsimony in
all cases.
Results are displayed in Table 3 and confirm those
obtained with simulated data sets.Results of trees from
[10] and [21] are poor, which was expected as these
methods (WINDOWS and GS, respectively) do not
optimize the parsimony criterion and as we did not use
the same alignment. GS is relatively poor, while
DTSCORE, WINDOWS, and GTR perform better. LSDT
clearly improves these four methods, with gains ranging
from 10 to 50 parsimony steps. In all cases but GTR,
LSDT recovers the most parsimonious DT of Fig. 10.
6 CONCLUSION AND PROSPECTS
We have demonstrated that restricting the neighborhood
defined by the SPR rearrangement to valid duplication trees
allows the whole DT space to be explored. Thanks to these
rearrangements, we have defined a general local search
method which we used to optimize the parsimony and
balanced minimum evolution criteria. We have thus
improved the topological accuracy of all the tested
methods.
Several research directions are possible. Finding the set
of combinatorial configurations for the SPR rearrangement
which necessarily produce a duplication tree, could allow
the neighborhood computation to be accelerated (e.g., for
n ¼ 48 only 5 percent of the SPR neighborhood correspond
to duplication trees) and, furthermore, gain more insight
into the nature of duplication trees, which are just starting
to be investigated mathematically [12], [26], [27]. Our local
search method could be improved using restricted TBR
rearrangements or with the help of different stochastic
approaches (taboo, noising, ...) in order to avoid local
minima. Moreover, it would be relevant to test this local
search method with other criteria like maximum likelihood.
Finally, combining the tandem duplication events with
speciation events, as described in [54] and [55] for
nontandem duplications, would be relevant for real
applications where we have homologous tandem repeats
from several genomes.
ACKNOWLEDGMENTS
The authors would like to thankWafae El Alaoui for her help
with ZNF45 family genes, and Richard Desper,WimHordijk
and the referees of the Workshop on Algorithms in
Bioinformatics (WABI ’04) for reading preliminary versions
of this paper. This work was supported by ACI-IMPBIO
(Ministere de la Recherche, France).
26 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 3Analysis of the ZNF45 Data Set
REFERENCES
[1] F. Blattner, G. Plunkett, C. Bloch, N. Perna, V. Burland, M. Riley, J.Collado-Vides, J. Glasner, C. Rode, G. Mayhew, J. Gregor, N.Davis, H. Kirkpatrick, M. Goeden, D. Rose, B. Mau, and Y. Shao,“The Complete Genome Sequence Of Escherichia Coli k-12,”Science, vol. 277, no. 5331, pp. 1453-1474, 1997.
[2] E. Lander et al., “Initial Sequencing and Analysis of the HumanGenome,” Nature, vol. 409, pp. 860-921, 2001.
[3] A. Smit, “Interspersed Repeats and Other Mementos of Transpo-sable Elements in Mammalian Genomes,” Current Opinion inGenetics & Development, vol. 9, pp. 657-663, 1999.
[4] W. Fitch, “Phylogenies Constrained by Cross-Over Process asIllustrated by Human Hemoglobins in a Thirteen-Cycle, ElevenAmino-Acid Repeat in Human Apolipoprotein A-I,” Genetics,vol. 86, pp. 623-644, 1977.
[5] G. Levinson and G. Gutman, “Slipped-Strand Mispairing: A MajorMechanism for DNA Sequence Evolution,” Molecular Biology andEvolution, vol. 4, pp. 203-221, 1987.
[6] J. Zhang and M. Nei, “Evolution of Antennapedia-Class Homeo-box Genes,” Genetics, vol. 142, no. 1, pp. 295-303, 1996.
[7] O. Elemento and O. Gascuel, “An Exact and Polynomial Distance-Based Algorithm to Reconstruct Single Copy Tandem DuplicationTrees,” Proc. 14th Ann. Symp. Combinatorial Pattern Matching(CPM2003), 2003.
[8] O. Elemento, O. Gascuel, and M.-P. Lefranc, “Reconstructing theDuplication History of Tandemly Repeated Genes,” MolecularBiology and Evolution, vol. 19, pp. 278-288, 2002.
[9] G. Benson and L. Dong, “Reconstructing the Duplication Historyof a Tandem Repeat,” Proc. Intelligent Systems in Molecular Biology(ISMB1999), T. Lengauer, ed., pp. 44-53, 1999.
[10] M. Tang, M. Waterman, and S. Yooseph, “Zinc Finger GeneClusters and Tandem Gene Duplication,” J. Computational Biology,vol. 9, pp. 429-446, 2002.
[11] E. Rivals, “A Survey on Algorithmic Aspects of Tandem RepeatsEvolution,” Int’l J. Foundations of Computer Science, vol. 15, no. 2,pp. 225-257, 2004.
[12] O. Gascuel, D. Bertrand, and O. Elemento, “Reconstructing theDuplication History of Tandemly Repeated Sequences,” Math. ofEvolution and Phylogeny, O. Gascuel, ed., 2004.
[13] S. Ohno, Evolution by Gene Duplication. Springer Verlag, 1970.[14] P.L. Fleche, Y. Hauck, L. Onteniente, A. Prieur, F. Denoeud, V.
Ramisse, P. Sylvestre, G. Benson, F. Ramisse, and G. Vergnaud, “ATandem Repeats Database for Bacterial Genomes: Application tothe Genotyping of Yersinia Pestis and Bacillus Anthracis,” BioMedCentral Microbiology, vol. 1, pp. 2-15, 2001.
[15] D. Jaitly, P. Kearney, G. Lin, and B. Ma, “Methods forReconstructing the History of Tandem Repeats and TheirApplication to the Human Genome,” J. Computer and SystemSciences, vol. 65, pp. 494-507, 2002.
[16] P. Sneath and R. Sokal, Numerical Taxonomy. pp. 230-234, SanFrancisco: W.H. Freeman and Company, 1973.
[17] N. Saitou and M. Nei, “The Neighbor-Joining Method: A NewMethod for Reconstructing Phylogenetic Trees,” Molecular Biologyand Evolution, vol. 4, pp. 406-425, 1987.
[18] O. Elemento and O. Gascuel, “A Fast and Accurate Distance-Based Algorithm to Reconstruct Tandem Duplication Trees,”Bioinformatics, vol. 18, pp. 92-99, 2002.
[19] J. Barthelemy and A. Guenoche, Trees and Proximity Representa-tions. Wiley and Sons, 1991.
[20] S. Sattath and A. Tversky, “Additive Similarity Trees,” Psychome-trika, vol. 42, pp. 319-345, 1977.
[21] L. Zhang, B. Ma, L. Wang, and Y. Xu, “Greedy Method forInferring Tandem Duplication History,” Bioinformatics, vol. 19,pp. 1497-1504, 2003.
[22] D. Swofford, P. Olsen, P. Waddell, and D. Hillis, MolecularSystematics. pp. 407-514, Sunderland, Mass.: Sinauer Associates,1996.
[23] D. Swofford, PAUP*. Phylogenetic Analysis Using Parsimony (*andOther Methods), version 4. Sunderland, Mass.: Sinauer Associates,1999.
[24] J. Felsenstein, “PHYLIP—PHYLogeny Inference Package,” Cladis-tics, vol. 5, pp. 164-166, 1989.
[25] C. Semple and M. Steel, Phylogenetics. Oxford Univ. Press, 2003.[26] O. Gascuel, M. Hendy, A. Jean-Marie, and S. McLachlan, “The
Combinatorics of Tandem Duplication Trees,” Systematic Biology,vol. 52, pp. 110-118, 2003.
[27] J. Yang and L. Zhang, “On Counting Tandem Duplication Trees,”Molecular Biology and Evolution, vol. 21, pp. 1160-1163, 2004.
[28] D. Robinson, “Comparison of Labeled Trees with Valency Trees,”J. Combinatorial Theory, vol. 11, pp. 105-119, 1971.
[29] L. Wang and D. Gusfield, “Improved Approximation Algorithmsfor Tree Alignment,” J. Algorithms, vol. 25, pp. 255-273, 1997.
[30] Y. Pauplin, “Direct Calculation of a Tree Length Using a DistanceMatrix,” J. Molecular Evolution, vol. 51, pp. 41-47, 2000.
[31] R. Desper and O. Gascuel, “Theoretical Foundation of theBalanced Minimum Evolution Method of Phylogenetic Inferenceand Its Relationship to Weighted Least-Squares Tree Fitting,”Molecular Biology and Evolution, vol. 21, no. 3, pp. 587-598, 2004.
[32] W. Fitch, “Toward Defining the Course of Evolution: MinimumChange for a Specified Tree Topology,” Systematic Zoology, vol. 20,pp. 406-416, 1971.
[33] J. Hartigan, “Minimum Mutation Fits to a Given Tree,” Biometrics,vol. 29, pp. 53-65, 1973.
[34] G. Ganapathy, V. Ramachandran, and T. Warnow, “Better Hill-Climbing Searches for Parsimony,” Proc. Third Int’l WorkshopAlgorithms in Bioinformatics, 2003.
[35] P.A. Goloboff, “Methods for Faster Parsimony Analysis,” Cladis-tics, vol. 12, pp. 199-220, 1996.
[36] V. Berry and O. Gascuel, “Inferring Evolutionary Trees withStrong Combinatorial Evidence,” Theoretical Computer Science,vol. 240, pp. 271-298, 2000.
[37] M. Kimura, “A Simple Model for Estimating Evolutionary Rates ofBase Substitutions through Comparative Studies of NucleotideSequences,” J. Molecular Evolution, vol. 16, pp. 111-120, 1980.
[38] D. Jones, W. Taylor, and J. Thornton, “The Rapid Generation ofMutation Data Matrices from Protein Sequences,” ComputerApplications in Biosciences, vol. 8, pp. 275-282, 1992.
[39] K. Kidd and L. Sgaramella-Zonta, “Phylogenetic Analysis:Concepts and Methods,” Am. J. Human Genetics, vol. 23, pp. 235-252, 1971.
[40] A. Rzhetsky and M. Nei, “Theoretical Foundation of theMinimum-Evolution Method of Phylogenetic Inference,” Molecu-lar Biology and Evolution, vol. 10, pp. 173-1095, 1993.
[41] W. Day, “Computational Complexity of Inferring Phylogeniesfrom Dissimilarity Matrices,” Bull. Math. Biology, vol. 49, pp. 461-467, 1987.
[42] C. Semple and M. Steel, “Cyclic Permutations and EvolutionaryTrees,” Advances in Applied Math., vol. 32, no. 4, pp. 669-680, 2004.
[43] R. Desper and O. Gascuel, “Fast and Accurate PhylogenyReconstruction Algorithms Based on the Minimum-EvolutionPrinciple,” J. Computational Biology, vol. 9, pp. 687-706, 2002.
[44] M. Kuhner and J. Felsenstein, “A Simulation Comparison ofPhylogeny Algorithms under Equal and Unequal EvolutionaryRates,” Molecular Biology and Evolution, vol. 11, pp. 459-468, 1994.
[45] A. Rambault and N. Grassly, “Seq-Gen: An Application for theMonte Carlo Simulation of DNA Sequence Evolution AlongPhylogenetic Trees,” Computer Applied Biosciences, vol. 13, pp. 235-238, 1997.
[46] J. Felsenstein and G. Churchill, “A Hidden Markov ModelApproach to Variation Among Sites in Rate of Evolution,”Molecular Biology and Evolution, vol. 13, pp. 93-104, 1996.
[47] P.A. Goloboff, J.S. Farris, and K. Nixon, “TNT: Tree AnalysisUsing New Technology,” 2000, www.cladistics.com.
[48] T. El-Barabi and T. Pieler, “Zinc Finger Proteins: What We Knowand What We Would Like to Know,” Mechanisms of Development,vol. 33, pp. 155-169, 1991.
[49] M. Shannon, J. Kim, L. Ashworth, E. Branscomb, and L. Stubbs,“Tandem Zinc-Finger Gene Families in Mammals: Insights andUnanswered Questions,” DNA Sequence—The J. Sequencing andMapping, vol. 8, no. 5, pp. 303-315, 1998.
[50] P. Waddel and M. Steel, “General Time Reversible Distances withUnequal Rates Across Sites: Mixing T and Inverse GaussianDistributions with Invariant Sites,” Molecular Phylogeny andEvolution, vol. 8, pp. 398-414, 1997.
[51] K.C. Nixon, “The Parsimony Ratchet, a New Method for RapidParsimony Analysis,” Cladistics, vol. 15, pp. 407-414, 1999.
[52] S. Guindon and O. Gascuel, “A Simple, Fast and Accurate Methodto Estimate Large Phylogenies by Maximum-Likelihood,” Sys-tematic Biology, vol. 52, no. 5, pp. 696-704, 2003.
[53] J. Felsenstein, “Cases in Which Parsimony or CompatibilityMethods Will Be Positively Misleading,” Systematic Zoology,vol. 27, pp. 401-410, 1978.
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 27
[54] D. Page andM. Charleston, “FromGene to Organismal Phylogeny:Reconciled Trees and the Gene Tree/Species Tree Problem,”Molecular Phylogenetics and Evolution, vol. 7, pp. 231-240, 1997.
[55] M. Hallett, J. Lagergren, and A. Tofigh, “Simultaneous Identifica-tion of Duplications and Lateral Transfers,” Proc. Conf. Researchand Computational Molecular Biology (RECOMB2004), pp. 347-356,2004.
Denis Bertrand is a PhD student under thesupervision of Olivier Gascuel. His researchsubject is the study of tandemly repeatedsequences. His main areas of interest arephylogenetics, combinatorics, and algorithms.
Olivier Gascuel is Directeur de Recherche atthe Centre National de la Recherche Scientifi-que (France). He is the head of the bioinfor-matics group from the LIRMM laboratory,belongs to the editorial board of SystematicBiology and of BMC Evolutionary Biology, andhas served in a number of program committeesof bioinformatics conferences (ISMB, WABI). Hestarted in this field in the mid 1980s, with workson sequence analysis and protein structure
prediction. Since the beginning of the 1990s, he turned his efforts tophylogenetics, focusing on the mathematical and computational toolsand concepts. He (co)authored several well-known phylogeny inferenceprograms (BioNJ, PHYML, FastME).
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
28 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Optimizing Multiple Seedsfor Protein Homology Search
Daniel G. Brown
Abstract—We present a framework for improving local protein alignment algorithms. Specifically, we discuss how to extend local
protein aligners to use a collection of vector seeds or ungapped alignment seeds to reduce noise hits. We model picking a set of seed
models as an integer programming problem and give algorithms to choose such a set of seeds. While the problem is NP-hard, and
Quasi-NP-hard to approximate to within a logarithmic factor, it can be solved easily in practice. A good set of seeds we have chosen
allows four to five times fewer false positive hits, while preserving essentially identical sensitivity as BLASTP.
Index Terms—Bioinformatics database applications, similarity measures, biology and genetics.
�
1 INTRODUCTION
PAIRWISE alignment is one of the most important problems
in bioinformatics. Here, we continue an exploration into
the seeding and structure of local pairwise alignments and
show that a recent strategy for seeding nucleotide align-
ments can be expanded to protein alignment. Heuristic
protein sequence aligners, exemplified by BLASTP [1], find
almost all high-scoring alignments. However, the sensitivity
of heuristic aligners to moderate-scoring alignments can
still be poor. In particular, alignments with BLASTP score
between 40 and 60 are commonly missed by BLASTP, even
though many are of truly homologous sequences. We focus
on these alignments and show that a change to the seeding
strategy gives success rates comparable to BLASTP with far
fewer false positive hits.
Specifically, multiple spaced seeds [2] and their relatives,
vector seeds [3], can be used in local protein alignment to
reduce the false positive rate in the seeding step of alignment
by a factor of four. We present a protocol for choosing
multiple vector seeds that allows us to find good seeds that
work well together. Our approach is based on solving a set-
cover integer program whose solution gives optimal thresh-
olds for a collection of seeds. Our IP is prone to overtraining,
so we discuss how to reduce the dependency of the solution
on the set of training alignments, both by increasing the false
positive rate of the seeds found slightly and by making the
program less sensitive to outliers. The problemwe are trying
to solve is NP-hard and Quasi-NP-hard to approximate to a
sublogarithmic factor, so we present heuristics for it, though
most instances are of moderate enough size to use integer
programming solvers.
Our successful result here contrasts with our previous
work [3] in which we introduced vector seeds. There, we
found that using only one vector seed would not substan-
tially improve BLASTP’s sensitivity or selectivity. The use
of multiple seeds is the important change in the present
work. This successful use of multiple seeds is similar to
what has been reported recently for pairwise nucleotide
alignment [4], [5], [6], but the approach we use is different
since protein aligners require extremely high sensitivity. We
note that, independently of our work, the authors of
PatternHunter, the first program to use optimized spaced
seeds, have developed a protein aligner based on seeding
approaches similar to those we discuss here [7]; however,
they have not offered theoretical justification for their
approach, which, in some sense, we provide here.
Our results confirm the themes developed by us and
others since the initial development of spaced seeds. The
first theme is that spaced seeds help in heuristic alignment
because the very surprisingly conserved regions that one
uses as a basis for building an alignment happen more
independently in true alignments than for unspaced seeds.
In protein alignments, there are often many small regions of
high conservation, each of which has a chance to have a hit
to a seed in it. With unspaced seeds, the probability that any
one of these regions is hit is low, but, when a region is hit,
there may be several more hits, which is unhelpful. By
contrast, a spaced seed is likely to hit a given region fewer
times, wasting less runtime, and will also hit at least one
region in more alignments, increasing sensitivity.
The second theme is that the more one understands how
local and global alignments look, the more possible it is to
tailor alignment seeding strategies to a particular applica-
tion, reducing false positives and improving true positives.
Here, by basing our set of seeds on sensitivity to true
alignments, we choose a set of seed models that hit diverse
types of short conserved alignment subregions. Conse-
quently, the probability that one of them hits a given
alignment is high since they complement each other well.
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 29
. The author is with the School of Computer Science, University of Waterloo,200 University Ave., West, Waterloo, ON N2L 3G1, Canada.E-mail: [email protected].
Manuscript received 1 Nov. 2004; revised 2 Jan. 2005; accepted 11 Jan. 2005;published online 30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBBSI-0183-1104.
1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
2 BACKGROUND: HEURISTIC ALIGNMENT AND
SPACED SEEDS
Since the development of heuristic sequence aligners [1], the
same approach has been commonly used: identify short,
highly conserved regions and build local alignments
around these “hits.” This avoids the use of the Smith-
Waterman algorithm [8] for pairwise local alignment, which
has �ðnmÞ runtimes on input sequences A and B of length n
and m, respectively. (We will use the notation A½i� to
represent the ith character of sequence A.)
Instead, assuming random sequences, the expected
runtime of this heuristic search method is hðn;mÞ þ aðn;mÞ,where hðn;mÞ is the amount of time needed to find hits in the
two sequences and aðn;mÞ is the expected time needed to
compute the alignments from the hits.Most heuristic aligners
have hðn;mÞ ¼ �ðnþmþ nm=kÞ, while aðn;mÞ ¼ �ðnm=kÞfor some large constant k. There are many assumptions in
these formulas. First, evenwhenwealign sequenceswith true
homologies,most hits are betweenunrelated positions, so the
estimation of the runtime need not consider whether the
sequences are related. Further, this simplification assumes
that each hit found in the first phase results in a constant
amount of work being done in the second phase to identify
that it is false (or that truehits are rare). It is the speedup factor
of k that is important here; assuming m and n are large, the
overall runtime is much faster.
Most heuristic aligners look at the scores of matching
characters in short regions and use high-scoring short
regions as hits. For example, BLASTP [1] hits are three
consecutive positions in the two sequences where the total
score, according to a BLOSUM or PAM scoring matrix, of
aligning the three letters in one sequence to the three letters
of the other sequence is at least +13. Finding such hits can
be done easily, for example, by making a hash table of one
sequence and searching positions of the hash table for the
other sequence, in time proportional to the length of the
sequences and the number of hits found. BLASTP uses
more complicated data structures for this process, but the
principle is similar.
2.1 Seeding Models
To generalize BLASTP’s hits, we defined vector seeds [3], [9].
A vector seed is a pair ðv; T Þ. Vector v ¼ ðv1; . . . ; vkÞ is a
vector of position multipliers and T is a threshold. Given
two sequences A and B, let si;j be the score in our scoring
matrix of aligning the A½i� to B½j�. If we consider position i
in A and j in B, we then get an hit to the vector seed at those
positions when v � ðsi;j; siþ1;jþ1; . . . ; siþk�1;jþk�1Þ � T . In this
framework, BLASTP’s seed is ((1, 1, 1), 13).
Vector seeds generalize the earlier idea of spaced seeds
[2] for nucleotide alignments, where both scores and the
vector are 0/1 vectors and where T , the threshold, equals
the number of 1s in v. A spaced seed requires an exact
match in the positions where the vector is 1 and the places
where the vector is 0 are “don’t care” positions. In our
original work with vector seeds [3], the freedom to allow
positions of v to have values beside 0 and 1 was not
extremely useful, so the vector seeds we discuss here all
have binary vectors v.
Spaced seeds have the same expected number of junk
hits as unspaced seeds. For unrelated noise DNA se-
quences, this is nm4�w, where w is the number of ones in
the seed (its support). Their advantage comes because more
distinct internal subregions of a given alignment will match
a spaced seed than the unspaced seed; this happens because
the hits are more independent of each other. The probability
that an alignment of length 64 with 70 percent conservation
matches a good spaced seed of support 11 can be greater
than 45 percent because there are likely to be more
subregions that match the spaced seed than the unspaced
seed; by contrast, the default BLASTN seed, which is
11 consecutive required matches, hits only 30 percent of
alignments.
Spaced seeds have three advantages over unspaced
seeds. First, their hits are more independent, which means
that it is more likely that a given alignment has at least one
hit to a seed; fewer alignments have many. Second, the seed
model can be tailored to a particular application: If there is
structure or periodicity to alignments, this can be reflected
in the design of the seeds chosen. For example, in searching
for homologous codons, they can be tailored to the three-
periodic structure of such alignments [10], [11]. Finally, the
use of multiple seeds allows us to boost sensitivity well
above what is achievable with a single seed, which, for
nucleotide alignment, can give near 100 percent sensitivity
in reasonable runtime [4].
Keich et al. [12] have given an algorithm for a simple
model of alignments to compute the probability that an
alignment hits a seed; this has been extended by both
Buhler et al. [10] and Brejova et al. [11] to more complex
sequence models. Choi et al. [13] have also shown
experimental results for spaced seeds with high sensitivity
across a wide range of homologies. Kucherov et al. [14]
show how to adapt spaced seeds to the interesting case of
alignments where no subregion of the alignment has a
higher score than the entire alignment.
2.2 Some Newer Seeding Models
Another seeding model, which has recently arisen [7], [15]
is of ungapped alignment seeds. These were developed by
Brown and Hudek [15] to anchor global alignments of
ambiguous DNA sequences and, independently, by Kisman
et al. [7] in their heuristic protein aligner, tPatternHunter.
An ungapped alignment seed is a vector v, a global
threshold T , and a vector of positional minimum scores b.
There is a match between positions in the two sequences
when the vector of pairwise match scores is at least as large,
position-by-position, as the minimum scores vector b and
where the dot product of the position-by-position scores and
the multiplier vector v is at least T . These seeds are a
compromise between spaced seeds and consecutive seeds:
They require spaced positions to have good scores (those
30 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
where the lower bound vector b has high values), while also
focusing on the quality of the local alignment at the seed by
possibly examining all of the positions of the seed. It is not
possible to cast an ungapped alignment seed in the language
of vector seeds because of the requirement that each
individual position’s score is greater than its bound. It is
possible to cast a vector seed as an ungapped alignment seed,
by setting the b vector to �1 in all positions, thus removing
the position-by-position lower bound requirement.
Csuros [16] has also extended this frameworkof seeding to
look at variable-length seeds, where the length of the regions
that must match depends on their positional scores. While
this approach can also be brought into the framework of the
present work, we have not done so in our experiments.
2.3 Multiple Seeds
Another important extension to these ideas of seeding has
been the use of multiple seeds of different sorts in basing
alignments. In this approach, an attempt is made to perform
extension when any of a collection of seed models has a hit.
This will work well if each chosen seed has a very low false
positive rate so that their total false positive rate is still
below that of one seed of comparable sensitivity.
Several authors [2], [3], [4], [6], [10], [17] have proposed
using multiple seeds and given heuristics to choose them.
This problem was recently given a theoretical framework by
Xu et al. [5] and, independently, Kucherov et al. [18] studied
heuristic algorithms for identifying sets of good seeds. In
work unrelated to the present work, Kisman et al. [7] have
heuristically used multiple ungapped alignment seeds
(though not called by that term) for protein alignment. To
the best of our knowledge, the present work is the first work
to choose multiple seeds for protein alignment with a
theoretical basis.
3 CHOOSING A GOOD SET OF SEEDS
Spaced seeds have made a substantial impact in nucleotide
alignments, but less in protein alignment. Here, we show
that they have use in this domain as well. Specifically,
multiple vector seeds or multiple ungapped alignment
seeds, with high thresholds, give essentially the sensitivity
of BLASTP with four times fewer noise hits. Slightly fewer
alignments are hit, but the regions of alignment hit by the
vector seeds are all of the same good ones as hit by the
BLASTP seed and a few more. In other words, BLASTP hits
more alignments, but the hits found by BLASTP and not the
vector seeds are mostly in areas unlikely to be expanded to
full alignments.
We adapt a framework for identifying sets of seeds
introduced by Xu et al. [5]. We model multiple seed
selection as a set cover problem and give heuristics for the
problem. For our purposes, one advantage of the formula-
tion is that it works with explicit alignments: Since real
alignments may not look like a probabilistic model, we can
pick a set of seeds for sensitivity to a collection of true
alignments. Unfortunately, this also gives rise to problems,
as the thresholds may be set high due to overtraining for a
given set of alignments.
Most of our experiments concern themselves with vector
seeds, but the framework can be expanded straightforwardly
to ungapped alignment seeds as well. This is because we do
not compute theoretical sensitivity of the seeds, but, instead,
only identify hits in existing real alignments. Indeed, our
framework is quite broad and extends to many different
models for seeding as long as the assumption that false
positives are additive is reasonably accurate and that one can
compute that false positive rate for the seed models. Where
the ungapped alignment seeds require some thought, we
present the addition needed for them.
3.1 Background Rates
One important detail that we need before we begin is to the
background hit rate for a given vector or ungapped
alignment seed. We noted previously [3] that this can be
computed for vector seeds, given a scoring matrix; it is also
straightforward to compute for ungapped alignment seeds
as well. Namely, from the scoring matrix, we can compute
the distribution of letters in random sequences implied by
the matrix; this can then be used to compute the distribu-
tion of scores found in unrelated sequences. Using this, we
can compute the probability that unrelated sequences give a
hit to a given seed at a random position, which we call the
false positive rate for that seed. In fact, we can easily
compute the entire probability distribution on the score for
a given seed vector at a random position. Similarly, we can
compute this probability under the constraint that posi-
tional scores have minimum value, thus expanding to
ungapped alignment seeds.
For the default BLASTP seed, the probability that two
random unrelated positions have a hit is quite high, 1/
1,600. Because of this high level of false positives, BLASTP
must filter hits further in hopes of throwing out hits in
unrelated sequences. Specifically, BLASTP rapidly exam-
ines the local area around a hit and, if this region is not also
well-conserved, the hit is thrown out. Sometimes, this
filtering throws out all of the hits found in some true
alignments and, thus, BLASTP misses them, even though
they hit the seed. One way of modeling this filtering is to
view BLASTP as testing two seeds simultaneously: The
vector seed ((1, 1, 1), 13) and an ungapped alignment seed
that looks at the region surrounding the seed hit.
Our goal in using other seed models here is to reduce the
false positive rate, while still hitting the overwhelming
majority of alignments and hitting them in places that are
highly enough conserved as to make a full alignment likely.
A flowchart of our proposal, and the approach of BLASTP,
is in Fig. 1.
For a setQ of alignment seeds,we say that its false positive
rate is the probability that any seed in Q has a hit to two
randompositions in unrelated sequences. This is not equal to
the sumof the falsepositive rates for all seeds inQ sincehits to
BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 31
one seed may overlap hits to another. However, we will use
this approximation in our optimization. As we extend to a
very large collection of seeds inQ, this canbecomeworrisome
as the same false positive may be counted many times.
However, thismaybe appropriate, in fact, dependingonhow
the search is done to find the false hits.
3.2 An Integer Program to Choose Many Seeds
Here, we give an integer program to find the set of seeds
that hits all alignments in a given training set with overall
lowest possible false positive rate. We will show that our IP
encodes the Set-Cover problem and that it is NP-hard to
solve and Quasi-NP-hard even to approximate to a
sublogarithmic factor. However, for moderate-sized train-
ing sets, we can solve it, in practice, or use simple heuristics
to get good solutions.
Given a set of alignment seeds Q, we say that they hit a
given alignment a if any member of Q has a hit to the
alignment. Our goal in picking such a set will be to
minimize the false positive rate of the set Q, with the
requirement that we hit all alignments in a training
collection, A.
This optimization goal is the alternative to the goal of Xu
et al. [5]. In that work, we maximized seed sensitivity when
a maximum number of spaced seeds is allowed; given that
all possible seeds had the same false positive rate, this was
equivalent to maximizing sensitivity for a given false
positive rate. This alternative goal of minimizing false
positives when we want 100 percent sensitivity on the
training set is appropriate for protein alignment; however,
we want to achieve extremely high sensitivity, as close to
100 percent as possible.
3.2.1 The Integer Program
Here, we show how to cast this seed selection problem as an
integer program. Recall that a seed model is the vector v of
multipliers or for an ungapped alignment seed, the vector v
of multipliers, and the vector b of positional lower bounds.
We will call this vector or vectors the “pattern” of a seed.
We can then view choosing a set of vector or ungapped
alignment seeds as choosing thresholds for each pattern.
More formally, suppose we are given a collection of
alignments A ¼ fa1; . . . ; amg and a set of seed patterns
P ¼ fp1; . . . ; png. We will choose thresholds ðT �1 ; . . . ; T
�nÞ for
the patterns of P such that the seed model set Q� ¼fðp1; T �
1 Þ; . . . ; ðpn; T �nÞg hits all alignments in A and the false
positive rate of Q� is as low as possible. The T �i may be 1,
which corresponds to not choosing the pattern pi at all.
We require that each alignment a must be hit, so one of
the thresholds must be low enough to hit a. To verify this,
we compute the best-scoring hit for each seed pattern pi in
each alignment aj; let the score of this hit be Ti;j. If we
choose T �i so that it is at most Ti;j, then the seed ðpi; T �
i Þ will
hit alignment a.
To model this as an integer program, we have a collection
of integer variables xi;T for each possible threshold value for
seed pattern pi. We note that we are requiring that this
number is a small number or can be granularized reasonably
since each possible threshold will get its own constraint. For
simple seeds from a BLOSUMmatrix, the scores at a position
come in a small range of integers, so the possible reasonable
thresholds form a small range; let Tm be the smallest such
threshold.Wewill set variable xi;T to 1when the threshold or
seed vector xi is at most T ; for each pattern pi, its threshold
chosen is the smallest T , where xi;T ¼ 1.
To compute the false positive rate, we let ri;T be the
probability that a random place in the background model
has score exactly T according to the seed model ðpi; T Þ. We
add these up for all of the false hits with score equal to or
greater than the chosen thresholds. Our integer program is
as follows:
minXi;T
xi;T ri;T ; such that
Xi
xi;Ti;j� 1 for all alignments aj
xi;T � xi;T�1 for all thresholds T > Tm
xi;T 2 f0; 1g for all i and T:
Our framework is quite general: Given any collection of
alignments and the sensitivity of a collection of seeds to the
alignments, one can use this IP formulation to choose
thresholds to hit all alignments while minimizing false
positives. In particular, one could require that a hit satisfy
multiple seeds simultaneously or use more complicated hit
formulations. Of course, for these harder models, one might
have a more difficult time optimizing the integer program.
3.2.2 NP-Hardness
We now show that the problem of optimizing the seed set to
minimize the false positive rate while hitting all alignments
is NP-hard and that it is Quasi-NP-hard to approximate to
within a logarithmic factor [19]. (That is, assuming NP does
not have polynomial-time deterministic algorithms running
in OðnOðlog lognÞÞ time, no polynomial-time algorithm exists
with approximation ratio oðlognÞ.)
32 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 1. Flowchart contrasting BLASTP’s approach to heuristic sequencealignment to the one proposed here. The only difference is in the initialcollection of hits. The smaller collection of hits found with the variationson seeds gives as many hits to true alignments that survive to the thirdstage as does BLASTP, yet far fewer noise hits must be filtered out.
We show this by giving an approximation-preserving
reduction of the Set-Cover problem to this problem. Since
Set-Cover is Quasi-NP-hard to approximate to within a
logarithmic factor [19], so is our problem.
An instance of Set-Cover is a ground set S and a
collection T ¼ fT1; . . . ; Tmg of subsets of S; the goal is the
smallest cardinality subset of T whose union is S. The
connection to our problem is clear: We will produce one
alignment per ground set member and, for each of the
elements of T , we will have one seed. For simplicity, we will
assume that S ¼ f1; . . . ; ng. To fill the construction out, we
will assign the vector seed
vi ¼ ðð1; 0; . . . ; 0zfflfflfflffl}|fflfflfflffl{i
; 1Þ; 1Þ
to every ground set element si. In a model of sequence
where all positions are independent of all other, each of
these seeds has the same false positive rate, so the false
positive rate will be proportional to the number of ground
set members chosen.
Then, for each set Tj 2 T , we create an alignment Aj of
length 2n2 þ 4n by pasting together in n blocks of length
2nþ 4. If i is in Tj, then we make the ith block of the
alignment have the first and iþ 2nd position be of score 1,
while all other positions in the block have score zero, while
if i 62 Tj, then the ith block is all score zero. Then, it is clear
that if we choose the seed vi, we will hit all alignments Aj,
where i 2 Tj. If we desire the minimum false positive rate to
hit all alignments, this is exactly equivalent to choosing the
minimum cardinality set to cover all of the Tj.
Thus, we have presented an approximation-preserving
transformation from Set-Cover to our problem and it is both
NP-hard and Quasi-NP-hard to approximate to within a
logarithmic factor.
3.2.3 Expansions of the Framework
In our experiments, we use the vector seed requirement as a
threshold; one could use a more complicated threshold
scheme to focus on hits that would be expanded to full
alignments. That is, our minimum threshold for Ti;j could
be the highest-scoring hit that is expanded to a full alignment
of seed vector vj in alignment ai. We could also have a more
complicated way of seeding alignments and, still, as long as
we could compute false positive rates, we could require that
all alignments are hit and minimize false positive rates.
Also, we can limit the total number of vector seeds used
in the true solution (in other words, limit the number of
vectors with finite threshold). We do this by putting an
upper bound onP
i xi;T for the maximum threshold T . In
practice, one might want an upper bound of four or eight
seeds, as each chosen seed requires a method to identify hits
and one might not want to have to use too many such
methods in the goal of keeping fewer indexes of a protein
sequence database, for example.
Further, we might want to not allow seeds to be chosen
with very high threshold. The optimal solution to the
problem will have the thresholds as on the seeds as high as
possible while still hitting each alignment. This allows
overtraining: Since even a tiny increase in the thresholds
would have caused a missed alignment, we may easily
expect that, in another set of alignments, there may be
alignments just barely missed by the chosen thresholds.
This is particularly possible if thresholds are allowed to get
extremely high and only useful for a single alignment. This
overtraining happened in some of our experiments, so we
lowered the maximum so that they were either found in a
fairly narrow range (+13 to +25) or set to 1 when a seed
was not used. As one way of also addressing overtraining,
we considered lowering the thresholds obtained from the IP
uniformly or just lowering the thresholds that have been set
to high values.
And, finally, the framework can be extended to allow a
specific number of alignments to be missed. For each
alignment, rather than requiring that
Xi
xi;Ti;j� 1;
which requires that some threshold be chosen so that the
alignment is hit, we can add a 0/1 slack variable to count
how many are missed, changing the constraint to
Xi
xi;Ti;jþ sj � 1:
Then, if we require that
Xj
sj � M;
this allows at most M alignments to be so missed. This may
be appropriate to allow the optimization framework to be
less sensitive to a small number of outliers. We show
experiments with this slightly expanded framework in the
next section.
We note one simplification of our formulation: False hit
rates are not additive. Given two spaced seeds, a hit to one
may coincide with a hit to the other, so the background rate
of false positives is lower than estimated by the program.
When we give such background rates later, we will
distinguish those found by the IP from the true values.
3.2.4 Solving the IP and Heuristics
To solve this integer program or its variations is not
necessarily straightforward since the problem is NP-hard.
In our experiments, we used sets of approximately 400 align-
ments and the IP has been able to solvedirectly quickly, using
the CPLEX 9.0 integer programming solver.
Straightforward heuristics also work well for the
problem, such as solving the LP relaxation and rounding
to 1 all variables with values close to 1, until all alignments
are hit, or setting all variables with fractional LP solutions to
1 and then raising thresholds on seeds until we start to miss
alignments.
BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 33
We finally note that a simple greedy heuristic works well
for the problem, as well: Start with low thresholds for all
seed patterns and repeatedly increase the threshold whose
increase most reduces the false positive rate until no such
increase can be made without missing an alignment. This
simple heuristic performed essentially comparably to the
integer program in our experiments, but, since the IP solved
quickly, we present its results.
One other advantage to the IP formulation is that the
false-positive rate from the LP relaxation is a lower bound
on what can possibly be achieved; the simple greedy
heuristic offers no such lower bound.
4 EXPERIMENTAL RESULTS
Here, we present the results of experiments with our
multiple seed selection framework in the context of protein
alignments. Our goal is to identify collections of seed
models which together have extremely high sensitivity to
even moderately strong alignments, while admitting a very
low false positive rate.
Since we pick seeds with a relatively small number of
alignments, we run the serious risk of overtraining. In
particular, the requirement that our set of seeds has
100 percent sensitivity on the training data need not require
that it also have comparable sensitivity overall. In one
example, the particular choice of training examples was
apparently quite unrepresentative since a 100 percent
sensitivity to this set of alignments still gave only 96 percent
sensitivity on a testing set. (Or, presumably, the testing set
may be unrepresentative.) As a simple way of exploring this,
we examined what happened when we lowered the thresh-
old on some seeds that were chosen by the integer program
to modestly increase their false positive rates and sensitivity
in the hope of still keeping very high sensitivity.
We first present simple experiments with vector seeds
and with ungapped alignment seeds on a small sample of
alignments discovered with BLASTP; in this section, we
also allow for seed sets that miss a small number of the
training alignments.
Then, we explore how well these seed sets do in hitting
alignments that we did not use BLASTP to identify. Here,
we note that our vector seed sets do not appear to do as well
as BLASTP for sensitivity to alignments in general, but they
do hit more alignments with high-scoring short regions;
presumably, these alignments are more likely true.
4.1 Preliminary Experiments
We begin by exploring several sets of alignments generated
using BLASTP. Our target score range for our alignments is
BLASTP score between +40 and +60 (BLOSUM score +112
to +168). These moderate-scoring alignments can happen by
chance, but also are often true. Alignments below this
threshold are much more likely to be errors, while, in a
database of proteins we used, such alignments are likely to
happen to a random sequence by chance only one time in
10,000, according to BLASTP’s statistics.
We begin by identifying a set of BLASTP alignments in
this score range. To avoid overrepresenting certain families
of alignments in our test set, we did an all-versus-all
comparison of 8,654 human proteins from the SWISS-PROT
database [20]. (We note that this is the same set of proteins
and alignments we used in our previous vector seed work
[3]. We have used this test set in part to confirm our belief
that, while a single seed may not help much, in comparison
to BLASTP, many seeds will be of assistance.) We then
divided the proteins into families so that all alignments
with BLASTP score greater than 100 are between two
sequences in the same family and there are as many families
as possible. We then chose 10 sets of alignments in our
target score range such that, in each set of alignments, a
particular family will only contribute at most eight
alignments to that set. Note that, since our threshold for
sharing family membership is a BLASTP score greater than
100 and the alignments we are seeking score between +40
and +60, many chosen alignments will be between members
of different families. We divided the sets of alignments into
five training sets and five testing sets. It is possible that the
same alignments will occur in a training and testing set as
we did not take any efforts to avoid this, though the set of
possible alignments is large enough to make this a rare
occurrence.
We note that we are using this somewhat complicated
system specifically because we want to avoid imposing a
preexisting bias on the set of alignments: Many true yet
moderate-scoring alignments will be between proteins with
different functionor fromdifferentbiological families. For the
same reason, we have used alignments from dynamic
programming as our standard, rather than structural align-
ments of known proteins or curated alignments because our
goal is to improve the quality of heuristic alignments.
Certainly, many of the alignments we consider will not be
precise; still, a heuristic dynamic programming-based align-
ment that finds a hit between two proteins and then uses the
same scoring matrix as BLASTP will find the exact same,
potentially inaccurate, alignment as did BLASTP.
4.1.1 Multiple Vector Seeds
We then considered the set of all 35 vector patterns of length
at most 7 that include three or four 1s (the support of the
seed). We used this collection of vector patterns as we have
seen no evidence that nonbinary seed vectors are preferable
to binary ones for proteins and because it is more difficult to
find hits to seeds with higher support than four due to the
high number of needed hash table keys.
We computed the optimal set of thresholds for these
vector seeds such that every alignment in a training set has
a hit to at least one of the seeds, while minimizing the
background rate of hits to the seeds and only using at most
10 vector patterns. Then, we examined the sensitivity of the
chosen seeds for a training set to its corresponding test set.
34 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
The results are found in Table 1. Some seed sets chosen
showed signs of overtraining, but others were quite
successful, where the chosen seeds work well for their
training set as well and have low false positive rate.
We took the best seed set with near 100 percent
sensitivity for both its training and testing data, which
was the third of our experimental sets and used it in further
experiments. This seed set is shown in Table 2. We note that
this seed set has five times lower false positive rate
(1=8; 000) than does BLASTP, while still hitting all of its
testing alignments but four (which is not statistically
significant from zero). We also considered a set of thresh-
olds where we lowered the higher thresholds slightly to
allow more hits and possibly avoid overtraining on the
initial set of alignment. These altered thresholds are shown
as well in Table 2 and give a total false positive rate of
1=6; 900. (This set of thresholds also hits all 402 test
alignments for that instance.)
4.1.2 A Weaker Requirement on the Sensitivity
As noted previously, we can alter our integer program so
that it does not require 100 percent sensitivity on the
training data set. We performed experiments on this
formulation, using five subsets of the training alignments
chosen as before, where we allowed between zero and five
alignments from the training set to be missed by the seed
set. We show results in Table 3, using again a randomly
chosen testing set for each training set. The training data
sets varied in size from 304 to 415, while the testing sets
ranged from 392 to 407 in size.
Unsurprisingly, if we did not hit all alignments in the
training set, we often miss alignments in the testing set as
well. However, the ranges of the sensitivities we saw in
testing data for the seed sets picked allowing some misses
in the training data were much less wide, suggesting that
there may be fewer seed thresholds lowered merely to
accommodate a single outlier in the training data. As such,
if slightly lower sensitivity is acceptable, this approach may
give much more predictable results than training to require
all alignments to be hit.
4.1.3 Multiple Ungapped Alignment Seeds
Ungapped alignment seeds can be seen as breaking the
model we have for alignment speed. The most straightfor-
ward implementation of ungapped alignment seeds would
involve a hash table keyed on the letters corresponding to
the positions in the bounds vector b, where there is a
nontrivial lower bound on the score of a position. Still, even
after the first step, where we identified pairs of positions
satisfying the minimum bounds scores, we still need
another test to verify that a pair of positions satisfies the
requirement of the dot product of the local alignment score
with the vector v of positional multipliers being higher than
the threshold. Similar limitations affect any such two-phase
seed, such as requiring that two hypothetically aligned
positions satisfy two vector seeds at once.
If we assume, however, that testing a hit to the simple
hash-table to verify if the dot product of the local alignment
score with the vector of multipliers v has score greater than
the threshold T so rapidly that we can throw out misses
without having to count them, then we return to the case
from before, where we need count only the fraction of
positions expected to pass both levels of filtration. This
assumption may be appropriate, assuming that the small
amount of time taken to throw out a hash-table hit that does
not satisfy the dot product threshold is much, much smaller
than the amount of time needed to throw out a hit to the
whole ungapped alignment seed that still does not make a
good local alignment.
BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 35
TABLE 3Weakening Sensitivity to Testing AlignmentReduces Sensitivity on Training Alignments
TABLE 2Seeds and Thresholds Chosen by
Integer Programming for 409 Test Alignments
TABLE 1Hit Rates for Optimal Seed Sets for Various Sets of Training
Alignments when Applied to an Unrelated Test Set
With this in mind, we tested our set of moderate
alignments on a simple collection of ungapped alignment
seed patterns to identify whether ungapped alignment seeds
form a potentially superior seed filtering approach to vector
seeds. Of course, since they include vector seeds as a special
case, this is trivial, but our interest is primarily whether the
advantage of ungapped alignments is large enough to merit
their consideration over that of vector seeds.
In our experiments, we used ungapped alignment seeds
where the vector of score lower bounds consisted of only
the values 0 and �1 (which results in no score restriction);
we also allowed the vector of pairwise multipliers to only
be the all-ones vector. This simple approach, which was
used independently in the multiple aligner of Brown and
Hudek [15] and in the tPatternHunter protein aligner [7],
simply requires a good local region, with certain specified
positions having positive score. We required that the
bounds vector have at most four active positions and
considered seed lengths between three and six. Note that, in
this model, the bounds vector ð0; 0; 0;�1Þ behaves quite
differently than the bounds vector ð0; 0; 0Þ because we will
be adding pairwise scores of four positions in the former
case and three in the latter.
The results of our experiment are shown in Table 4. We
used the same testing and training data sets as for Table 3.
In general, these results are slightly worse than the results
of our original experiments with vector seeds when we
require 100 percent sensitivity to testing data, but improve
when we allow some misses in the training data. Typical
false positive rates on the order of 1=10; 000 are common
with testing sensitivity of approximately 99 percent, as
before; again, the corresponding false positive rate for
BLASTP’s seed is approximately 1=1; 600.
A positive note to the ungapped alignment seeds is that
there seems to be less overtraining: As the training
sensitivity is allowed to go down slightly, the testing
sensitivity does not plummet as quickly as for vector seeds.
One reason for this is that an ungapped alignment seed,
both times they have been implemented [7], [15], still
requires high-scoring short local alignment around the
seed. As we show in the next section, focusing on very
narrow alignments in seeding may be inappropriate and
one should instead focus on longer windows around a hit
before discarding it with a filter.
4.2 A Broader Set of Alignments
Returning to our set of vector seeds from Table 2, we then
considered a larger set of alignments in our target range of
good, but not great scores to verify if the advantage of
multiple seeds still holds. We used the Smith-Waterman
algorithm to compute all alignments between pairs of a
1,000-sequence subset of our protein data set and computed
how many of them were not found by BLASTP. Only 970
out of 2,950 Smith-Waterman alignments with BLOSUM62
score between +112 and +168 had been identified by
BLASTP, even though alignments in this score range would
have happened by chance only one time in 10,000 according
to BLASTP’s statistics.
Almost all of these 2,950 alignments, 2,942, had a hit to
the BLASTP default seed. Despite this, however, only 970
actually built a successful BLASTP alignment. Our set of
eight seeds had hits to 1,939 of the 1,980 that did not build a
BLASTP alignment and to 955 of the 970 that did build a
BLASTP alignment, so, at first glance, the situation does not
look good. However, the difference between having a hit
and having a hit in a good region of the alignment is where
we are able to show substantial improvement.
The discrepancy between hits and alignments comes
because the BLASTP seed can have a hit in a bad part of the
alignment, which is filtered out. Typically, such hits occur
in a region where the source of positive score is quite short,
which is much more likely with an unspaced seed than with
a spaced seed. We looked at all of the regions of length
10 amino acids of alignments that included a hit to a seed
(either the BLASTP seed or one of the multiple seeds), and
assigned the best score of such a region to that alignment; if
no ungapped region of length 10 surrounded a hit, we
assumed it would certainly be filtered out. The data are
shown in Table 5 and show that of the alignments hit by the
spaced seeds, they are hit in regions that are essentially
identical in conservation to where the BLASTP seed hits
them. For example, 47.7 percent of the alignments contain a
10-amino acid region around a hit to the ((1, 1, 1), 13) seed
with BLOSUM score at least +30, while 46.7 percent contain
such a region surrounding a hit to one of the multiple seeds
with higher threshold. If we use the lower thresholds that
allow slightly more false positives, their performance is
actually slightly better than BLASTP’s.
Table 5 also shows that the higher-threshold seed ((1, 1, 1),
15), which has a worse false positive rate (1/5,700) than our
ensembles of seeds, performs substantially worse: Namely,
only 64 percent of the alignments have a hit to the single seed
found in a regionwith local score above +25,while 73 percent
of the alignments have a hit to one of the multiple seeds with
this property. This single seed strategy is clearly worse than
the multiple seed strategy of comparable false positive rate
and the optimized seeds perform comparably to BLASTP in
36 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 4Ungapped Alignment Seeds Offer
Similar Performance to Vector Seeds
identifying the alignments that actually have a core con-
served region.
Our experiments show thatmultiple seedmodels canhave
an impact on local alignment of protein sequences. Using
many spaced seeds, which we picked by optimizing an
integer program, we find seed models with a comparable
chance of finding a good hit in amoderate-scoring alignment
than does the BLASTP seed, with four to five times fewer
noise hits. The difficulty with the BLASTP seed is that it not
onlyhasmore junkhits andmorehits inoverlappingplaces, it
also has more hits in short regions of true alignments, which
are likely to be filtered and thrown out.
5 CONCLUSIONS
We have given a theoretical framework to the problem of
using spaced seeds for protein homology search detection.
Our result shows that using multiple vector or ungapped
alignment seeds can give sensitivity to good parts of local
protein alignments essentially comparable to BLASTP,
while reducing the false positive rate of the search
algorithm by a factor of four to five.
Our set of vector seeds is chosen by optimizing an
integer programming framework for choosing multiple
seeds when we want 100 percent sensitivity to a collection
of training alignments. The framework is general enough to
accommodate many extensions, such as requiring a fixed
amount of sensitivity on the training (not only 100 percent),
allowing only a small number of seeds to be chosen or
allowing for many different sorts of seeding strategies. We
have mostly used it to optimize sets of vector seeds because
they encapsulate an approach to homology search for
nucleotides that has been very successful.
One difficulty with our approach is that it relies on a
theoretical estimate of the runtime of a homology search
program: namely, that the program will take time propor-
tional to the number of false positives found by the seeding
method. As seeding methods become more complex, such
as the two-step ungapped alignment seeds, it may become
harder to identify what a “false positive” is, in particular, if
a false positive fits through one step of a filter, but is quickly
discarded before the next step, should it count toward the
estimated runtime? Using our framework, we identified a
set of seeds for moderate-scoring protein alignments whose
total false positive rate in random sequence is four-to-five
times lower than the default BLASTP seed. This set of seeds
had hits to slightly fewer alignments in a test set of
moderate-scoring alignments found by the Smith-Water-
man algorithm than found by BLASTP; however, the
BLASTP seeds hit subregions of these alignments that were
actually slightly worse than hit by the spaced seeds. Hence,
given the filtering used by BLASTP, we expect that the two
alignment strategies would give comparable sensitivity,
while the spaced seeds give four times fewer false hits.
ACKNOWLEDGMENTS
The author would like to thank Ming Li for introducing him
to the idea of spaced seeds. This work is supported by the
Natural Science and Engineering Research Council of
Canada and by the Human Frontier Science Program. A
preliminary version of this paper [21] appeared at the
Workshop on Algorithms in Bioinformatics, held in Bergen,
Norway, in September, 2004.
REFERENCES
[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman,“Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215,no. 3, pp. 403-410, 1990.
[2] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and MoreSensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-445, Mar. 2002.
[3] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension toSpaced Seeds Allows Substantial Improvements in Sensitivity andSpecificity,” Proc. Third Ann. Workshop Algorithms in Bioinformatics,pp. 39-54, 2003.
[4] M. Li, B. Ma, D. Kisman, and J. Tromp, “Patternhunter II: HighlySensitive and Fast Homology Search,” J. Bioinformatics andComputational Biology, vol. 2, no. 3, pp. 419-439, 2004.
[5] J. Xu, D. Brown, M. Li, and B. Ma, “Optimizing Multiple SpacedSeeds for Homology Search,” Proc. 15th Ann. Symp. CombinatorialPattern Matching, pp. 47-58, 2004.
[6] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds forDNA Similarity Search,” Proc. Eighth Ann. Int’l Conf. ComputationalBiology, pp. 76-84, 2004.
[7] D. Kisman, M. Li, B. Ma, and L. Wang, “TPatternHunter: Gapped,Fast and Sensitive Translated Homology Search,” Bioinformatics,2004.
BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 37
TABLE 5Hits in Locally Good Regions of Alignments
[8] T. Smith and M. Waterman, “Identification of Common MolecularSubsequences,” J. Molecular Biology, vol. 147, pp. 195-197, 1981.
[9] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension toSpaced Seeds,” J. Computer and System Sciences, 2005, pendingpublication.
[10] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for SimilaritySearch in Genomic DNA,” Proc. Seventh Ann. Int’l Conf. Computa-tional Biology, pp. 67-75, 2003.
[11] B. Brejova, D. Brown, and T. Vinar, “Optimal Spaced Seeds forHomologous Coding Regions,” J. Bioinformatics and ComputationalBiology, vol. 1, pp. 595-610, Jan. 2004.
[12] U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds forSimilarity Search,” Discrete Applied Math., vol. 138, pp. 253-263,2004.
[13] K.P. Choi, F. Zeng, and L. Zhang, “Good Spaced Seeds forHomology Search,” Bioinformatics, vol. 20, no. 7, pp. 1053-1059,2004.
[14] G. Kucherov, L. Noe, and Y. Ponty, “Estimating Seed Sensitivityon Homogeneous Alignments,” Proc. Fourth IEEE Int’l Symp.BioInformatics and BioEng., pp. 387-394, 2004.
[15] D. Brown and A. Hudek, “New Algorithms for Multiple DNASequence Alignment,” Proc. Fourth Ann. Workshop Algorithms inBioinformatics, pp. 314-326, 2004.
[16] M. Csuros, “Performing Local Similarity Searches with VariableLength Seeds,” Proc. 15th Ann. Symp. Combinatorial PatternMatching, pp. 373-387, 2004.
[17] K. Choi and L. Zhang, “Sensitive Analysis and Efficient Methodfor Identifying Optimal Spaced Seeds,” J. Computer and SystemSciences, vol. 68, pp. 22-40, 2004.
[18] G. Kucherov, L. Noe, and Y. Ponty, “Multiseed LosslessFiltration,” Proc. 15th Ann. Symp. Combinatorial Pattern Matching,pp. 297-310, 2004.
[19] U. Feige, “A Threshold of lnn for Approximating Set Cover,”J. ACM, vol. 45, pp. 634-652, 1998.
[20] A. Bairoch and R. Apweiler, “The SWISS-PROT Protein SequenceDatabase and Its Supplement TrEMBL in 2000,” Nucleic AcidsResearch, vol. 28, no. 1, pp. 45-48, 2000.
[21] D. Brown, “Multiple Vector Seeds for Protein Alignment,” Proc.Fourth Ann. Workshop Algorithms in Bioinformatics, pp. 170-181,2004.
Daniel G. Brown received the undergraduatedegree in mathematics with computer sciencefrom the Massachusetts Institute of Technologyin 1995 and the PhD degree in computer sciencefrom Cornell University in 2000. He then spent ayear as a research scientist at the WhiteheadInstitute/MIT Center for Genome Research inCambridge, Massachusetts, working on the Hu-man and Mouse Genome Projects. Since 2001,
he has been an assistant professor in the School of Computer Scienceat the University of Waterloo.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
38 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 39
1545-5963/05/$20.00 © 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
For information on obtaining reprints of this article, please send e-mail to:[email protected].
I T is a pleasure to write this editorial at the beginning of the second year of the publication of the IEEE/ACM Transactionson Computational Biology and Bioinformatics (TCBB). The last year saw the publication of four issues of TCBB, the first of
which was mailed out roughly nine months after our initial call for submissions. That accomplishment was the result oftremendous cooperation and hard work on the part of authors, reviewers, associate editors, and staff. I would like to thankeveryone for making that possible.
During the past year, we recieved roughly 205 submissions and, presently, we have about 50 of those under review. Inour first year, we published 16 papers, including Part I of a special section on The Best Papers from WABI (Workshop onAlgorithms in Bioinformatics). Part II will appear this year, along with a special issue on Machine Learning inComputational Biology and Bioinformatics. Other special issues are also in the planning stages. The papers that we havepublished are establishing TCBB as a venue for the highest quality research in a broad range of topics in computationalbiology and bioinformatics. I know that some of the papers we have already published will be cited as the foundational orthe definitive papers in several subareas of the field.
A goal for the future is to attract more submissions from the biology community and this will be facilitated when TCBBis indexed in MEDLINE, which requires two years of publication before it will consider indexing a journal. So, this secondyear of publication will hopefully lead to the inclusion of TCBB in MEDLINE.
Finally, I would like to share some wonderful news we recieved in February. The Association of American Publishers,Professional and Scholarly Publishing Division awarded TCBB their “Honorable Mention” award for The Best New Journalin any category for the year 2004. Only one Honorable Mention is awarded. Again, the credit for this accomplishment goesto all the authors, reviewers, associate editors, and staff who have worked so hard to establish TCBB in this last year. I lookforward to continued growth and success of TCBB in our second year of publication.
Dan GusfieldEditor-in-Chief
Editorial—State of the TransactionDan Gusfield
Bases of Motifs for GeneratingRepeated Patterns with Wild Cards
Nadia Pisanti, Maxime Crochemore, Roberto Grossi, and Marie-France Sagot
Abstract—Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones.
Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in
relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature:
matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favor of either, and recent work
has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns.
This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently
proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs.
Instead of exhibiting a complete list of motifs satisfying some input constraints, what is produced is a basis of such motifs from which all
the other ones can be generated. We study the computational cost of determining such a basis of repeated motifs with wild cards in a
sequence. We give new upper and lower bounds on such a cost, introducing a notion of basis that is provably contained in (and, thus,
smaller) than previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of
motifs. We also prove that the number of motifs in all bases defined so far grows exponentially with the quorum, that is, with the
minimal number of times a motif must appear in a sequence, something unnoticed in previous work. We show that there is no hope to
efficiently compute such bases unless the quorum is fixed.
Index Terms—Motifs basis, repeated motifs.
�
1 INTRODUCTION
IDENTIFYING motifs in biological sequences is one of theoldest fields in computational biology. Yet, it remains also
very much an open problem in the sense that no currentlyexisting definition of a “motif” is fully satisfying for thepurposes of accurately and sensitively identifying thebiological features that such motifs are supposed torepresent. Among the most difficult to model are bindingsites, as they are often quite degenerate. Indeed, variabilitymay be considered part of their function. Such variabilitytranslates itself into changes in the motif, mostly substitu-tions, that do not affect the biological function. Two mainschools of thought on how to define motifs in biology havecoexisted for years, each valid in its own way. The firstworks with a statistical representation of motifs, usuallygiven in the form of what is called in the literature a PSSM(“Position Specific Scoring Matrix” [9], [11], [13], [12] or aprofile which is one type of PSSM). Interesting PSSMs arethose that have a high information value (measured, forinstance, by the relative entropy of the correspondingmatrix). The second school defines a motif as a consensus[4], [24]. A motif is therefore a pattern that appears
repeatedly, in general, approximately, that is, up to acertain number of differences (most often substitutionsonly) in a sequence or set of sequences of interest.
It is generally accepted that PSSMs are more appropriatefor modeling an already known (in the sense of well-characterized) biological feature for the purpose of thenidentifying other occurrences of the feature, even thoughthe false positive rate of this further identification remainsvery high. Identifying the PSSM itself ab initio is still,however, a difficult problem, particularly for large data setsor when the amount of noise may be high. The methodsused are also no guarantee heuristics, leaving an uncer-tainty as to whether motifs that are statistically as mean-ingful as those reported have not been missed.
On the other hand, formulating the problem of identifyingapproximate motifs as patterns enables one to address themotif identification problem in an exhaustive fashion, eventhough the algorithmic complexity of the problem remainsrelatively high, and the model may appear more limited thanPSSMs. Because of the lower algorithmic complexity ofidentifying repeated patterns, the model may, however, bemade more complex and biologically pertinent in other ways.One could think of introducing motifs composed of variousdifferent submotifs separated by variable-length distancesthat may then also be found in a relatively efficient way [14].Motifs presenting such a high level of combinatorial complex-ity are indeed frequent, particularly in eukaryotes. Exhaus-tively seeking for approximately repeated patterns mayhowever have the drawback of producing many “solutions,”that is, many motifs. In fact, the number of motifs identifiedwith this model may be so high (e.g., exponential in the size ofthe input) that it is as impossible to manage as the initial inputsequence(s), even though they provide a first way of
40 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
. N. Pisanti and R. Grossi are with the Dipartimento di Informatica,Universita di Pisa, Italy. E-mail: {pisanti, grossi}@di.unipi.it.
. M. Crochemore is with the Institut Gaspard-Monge, University of Marne-la-Vallee, France and King’s College London.E-mail: [email protected].
. M.-F. Sagot is with INRIA Rhone-Alpes, Laboratoire de Biometrie etBiologie �EEvolutive, Universite Claude Bernard Lyon 1, France andKing’s College London. E-mail: [email protected].
Manuscript received 14 Mar. 2004; revised 2 Dec. 2004; accepted 16 Feb.2005; published online 30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-0036-0304.
1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
structuring such input. Yet, it appeared clear also to anycomputational biologist working with motifs as patterns thatthere was further structure to be extracted from the set ofmotifs found, even when such a set is huge. Furthermore,such a structure could reflect some additional biologicalinformation, thus providing additional motivation for infer-ring it. Doing this is generally addressed by means ofclustering, or even by attempting to bring together the twotypes of motif models (PSSMs and patterns). Indeed, recentlyresearchers have been using pattern detection as a first filter-flavored step toward inferring PSSMs from biologicalsequences [6]. This seems very promising although muchwork remains to be done to precisely determine the relationbetween the two types of models, and to fully explore thebiological implications this may have.
Again, each of the two above approaches is valid, but thequestion remained open whether or not the inner structureof a set of motifs could be expressed in a manner that wouldbe more satisfying from both the mathematical and thebiological points of view. Then, in 2000, a paper by Parida etal. [17] seemed to present a way of extracting such an innerstructure in a very elegant and powerful way for aparticular type of motif. The power of their proposalresided in the fact that the above mentioned structurecorresponded to a well-known and precisely definedmathematical object and, moreover, guaranteed that nosolution would be lost. Exhaustiveness in relation to thechosen type of motif is also preserved, thus enabling abiologist to draw some conclusions even in the face ofnegative answers (i.e., when no motifs, or no a priori“expected” motifs are found in a given input), somethingwhich PSSM-detecting methods do not allow. The structureis that of a basis of motifs. Informally speaking, it is a subsetof all the motifs satisfying some input parameters (related,for instance, to which differences between a pattern and itsoccurrences are allowed) from which it is possible torecover all the other motifs, in the sense that all motifs notin the basis are a combination of some (in general, a fewonly) motifs in the basis. Such a combination is modeled bysimple rules to systematically generate the other motifs withan output sensitive cost [18]. A basis would therefore alsoprovide a way of characterizing the input, which then mightbe used to compare different inputs without resorting to thetraditional alignment methods with all the pitfalls theypresent. The idea of a basis would fulfill such expectationsif its size could be proven to be small enough. The argument[17] seemed to be that, for the type of motifs considered, acompact enough basis could always be found.
The motifs considered in [17] were patterns with wild cardsymbols occurring in a given sequence s of n symbolsdrawn over an alphabet �. A wild card symbol is a specialsymbol “�” matching any other element1 For example, thepattern T � G matches both TTG and TGG inside s ¼ TTGG.Parida et al. focused on patterns which appear at least qtimes in s for an input parameter q � 2, called the quorum.This may, at first sight, seem an even more restrictive typeof motif than patterns in general. It, however, has the merit
of capturing one aspect of biological features that currentPSSMs in general ignore, or address only in an indirect way.This aspect often concerns isolated positions inside a motifthat are not part of the biological feature being captured.This is the case, for instance, with some binding sites,particularly at the protein level. Studying patterns withwild cards has a further very important motivation inbiology, even when no differences (such as substitutions)are allowed. Indeed, motifs such as these or closely relatedones can be used as seeds for finding long repeats and foraligning, pairwise or multiple-wise, a set of sequences oreven whole genomes [15], [23].
The basis introduced by Parida et al. had interestingfeatures, but presented some unsatisfying properties. Inparticular, as we show in this paper, there is an infinitefamily of strings for which the authors’ basis contains �ðn2Þmotifs for q ¼ 2. This contradicts the upper bound of 3n forany q � 2 given in [17]. As a result, the algorithm takingOðn3 lognÞ time, mentioned in [17], for finding the basis ofmotifs does not hold since it relies on the upper bound of3n, thus leaving open the problem of efficiently discoveringa basis. A refinement of the definition of basis and anincremental construction in Oðn3Þ time has recently beendescribed by Apostolico and Parida [2]. A comparativesurvey of several notions of bases can be found in [22].
Closely following previous work, here we introduce anew definition of basis. The condition for the new basis isstronger than that of [17] and, hence, our basis is includedin that of [17] (and is thus smaller) while both are able togenerate the same set of motifs with mechanical rules. Ourbasis is moreover symmetric: Given a string s, the motifs inthe basis for its reverse ess are the reversals of the motifs inthe basis for s. Moreover, the number of motifs in our basiscan provably be upper bounded in the worst case by n� 1
for q ¼ 2 and occur in s a total of 2n times at most. However,we reveal an exponential dependency on q for the number ofmotifs in all bases defined so far (i.e., including our basis,Parida’s and Pelfrene et al.’s [19]), something unnoticed inprevious work. Consequently, no polynomial-time algo-rithm can exist for finding one of these bases with arbitrary
values of q � 2.
2 NOTATION AND TERMINOLOGY
We consider strings that are finite sequences of lettersdrawn from an alphabet �, whose elements are also calledsolid characters. We introduce an additional symbol (de-noted by � and called wild card) that does not belong to �
and matches any letter; a wild card clearly matches itself.The length of a string t, denoted by jtj, is the number ofletters and wild cards in t, and t½i� indicates the letter orwild card at position i in t for 0 � i � jtj � 1 (hence, t ¼t½0�t½1� � � � t½jtj � 1� also noted t½0::jtj � 1�).Definition 1 (pattern). Given the alphabet �, a pattern is a
string in � [ �ð� [ f�gÞ�� (that is, it starts and ends with a
solid character).
The patterns are related by the following specificity
relation � .
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 41
1. In the literature on sequence analysis and pattern matching, the wildcard is often referred to as do not care (as it is in the literature on bases ofmotifs). Therefore, we will use this latter term when referring to thesequence analysis and string matching literature.
Definition 2 (� ). For individual characters �1; �2 2 � [ f�g,we have �1 � �2 if �1 ¼ � or �1 ¼ �2. Relation � extends tostrings in ð� [ f�gÞ� under the convention that each string tis implicitly surrounded by wild cards, namely, letter t½j� is �when j � jtj. Hence, v is more specific than u (writtenu � v) if u½j� � v½j� for any integer j.
We can now formally define the occurrences of patternsx in s and their lists.
Definition 3 (occurrence, L). We say that u occurs atposition ‘ in v if u½j� � v½jþ ‘�, for 0 � j � juj � 1(equivalently, we say that u matches v½‘::‘þ juj � 1�). Forthe input string s 2 �� with n ¼ jsj, we consider the locationlist Lx f0::n� 1g as the set of all the positions on s atwhich x occurs.
When a pattern u occurs in another pattern (or into astring) v, we also say that v contains u. For example, thelocation list of x ¼ T � G in s ¼ TTGG is Lx ¼ f0; 1g, hence scontains x.
Definition 4 (motif). Given a parameter q � 2, called quorum,we say that pattern x is a motif in s when jLxj � q.
Given any location list Lx and any integer d, we adoptthe notation Lx þ d ¼ f‘þ d j ‘ 2 Lxg for indicating theoccurrences in Lx “displaced” by the offset d.
Definition 5 (maximality). A motif x is maximal if for anyother motif y that contains x, we have no integer d such thatLy ¼ Lx þ d.
In other words, making a maximal motif x more specific(thus obtaining y) reduces the number of its occurrences ins. Definition 5 is equivalent to that meant in [17] stating thatx is maximal if there exist no other motif y and no integerd � 0 verifying Lx ¼ Ly þ d, such that x½j� � y½jþ d� for 0 �j � jxj � 1 (that is, x occurs in y at position d in ourterminology).2
Definition 6 (irredundant motif). A maximal motif x isirredundant if, for any maximal motifs y1, y2; . . . ; yk suchthat Lx ¼ [k
i¼1Lyi , motif x must be one of the yis. Conversely,if all the yis are different from x, pattern x is said to becovered by motifs y1, y2; . . . ; yk.
The basis of irredundant motifs for string s is the set of allirredundant motifs in s. The definition is given with respectto the set of maximal motifs of the input string which isunique; indeed, such basis is unique and it can be used as agenerator for all maximal motifs in s as proved in [17]. Thesize of the basis is the number of irredundant motifscontained in it. We illustrate the notions given so far by
employing the example string s ¼ FABCXFADCYZEADCEADC.
For this string and q ¼ 2 the location list of motif x1 ¼ A � Cis Lx1 ¼ f1; 6; 12; 16g, and that of motif x2 ¼ FA � C is
Lx2 ¼ f0; 5g. They are both maximal because they lose at
least one of their occurrences when extended with solid
characters at one side (possibly with wild cards in between),
or when their wild cards are replaced by solid characters.
However, motif x3 ¼ DC having list Lx3 ¼ f7; 13; 17g is not
maximal. It occurs in x4 ¼ ADC, where Lx4 ¼ f6; 12; 16g, and
its occurrences can be obtained from those of x4 by a
displacement of d ¼ 1 positions. The basis of the irredun-
dant motifs for s is made up of x1 ¼ A � C, x2 ¼ FA � C,
x4 ¼ ADC, and x5 ¼ EADC. The location list of each of them
cannot be obtained from the union of any of the other
location lists.
3 IRREDUNDANT MOTIFS: THE BASIS AND ITS SIZE
FOR QUORUM q ¼ 2
In this section, we show the existence of an infinite family ofstrings sk (k � 5) for which there are�ðn2Þ irredundant motifsin the basis for quorum q ¼ 2, where n ¼ jskj. In this way, wedisprove the claimed upper bound of 3n [17] mentioned inSection 1. Each string sk will be constructed from a shorterstring tk, which we now define. For each k, tk ¼ AkTAk, whereAk denotes the letter A repeated k times (our argument works,in general, for zkwzk, where z andw are strings of equal lengthnot sharing any common character). String tk contains anexponential number of maximal motifs, including thosehaving the form AfA; �gk�2
A with exactly two wild cards. Tosee why, each such motifxoccurs four times in tk: Specifically,two occurrences of x match the first and the last k letters in tkwhile each distinct wild card in x matching the letter T in tkcontributes to one of the two remaining occurrences.Extending x or replacing a wild card with a solid characterreduces the number of these occurrences, sox is maximal. Theidea of our proof is to obtain strings sk by prefixing tk withOðjtkjÞ symbols so that these motifs x become irredundant insk. Since there are �ðk2Þ of them, and n ¼ jskj ¼ �ðjtkjÞ ¼�ðkÞ, this leads to the claimed result.
In order to define the strings sk on the alphabet
� ¼ fA; T; u; v; w; x; y; z; a1; a2; . . . ; ak�2g, we introduce some
notation. Let euu denote the reversal of u, and let
evk; odk; uk; vk be the strings thus defined
if k is even : evk ¼ a2a4 � � � ak�2;
odk ¼ a1a3 � � � ak�3;
uk ¼ evk u fevkevk vw evk;
vk ¼ odk xy fodkodk z odk;
if k is odd : evk ¼ a2a4 � � � ak�3;
odk ¼ a1a3 � � � ak�2;
uk ¼ evk uv fevkevk wx evk;
vk ¼ odk y fodkodk z odk:
The strings sk are then defined by sk ¼ ukvktk for k � 5.
Fig. 1 shows them for k ¼ 7.
Fact 1. The length of ukvk is 3k, and that of sk is n ¼ 5kþ 1.
42 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
2. Actually, the definition literally reported in [17] is “Definition 4(Maximal Motif). Let p1; p2; . . . ; pk be the motifs in a sequence s. Let pi½j� be“.” if j > jpij. A motif pi is maximal if and only if there exists no pl, l 6¼ i andno integer 0 � � such that Lpi þ � ¼ Lpl and pl½� þ j� � pi½j� hold for1 � j � jpij.” (The symbols in pi and pl are indexed starting from 1onward.) The corresponding example in the paper illustrates the definitionfor s ¼ ABCDABCD, stating that pi ¼ ABCD is maximal while pl ¼ ABC is not.However, pi does not match the definition because of the existence of itsprefix pl (setting � ¼ 0); hence, we suspect a minor typo in the definition, forwhich the definition should read as “... such that Lpi ¼ Lpl þ � andpi½j� � pl½� þ j�.”
Proof. Whatever the parity ofk, the stringukvk contains the sixletters u, v, w, x, y, z, two occurrences each of evk and odk,and one occurrence each of fevkevk and fodkodk. Since odk and evktogether contain one occurrence of each letter a1,a2; . . . ; ak�2, we have jodkj þ jevkj ¼ k� 2. Moreover,jfevkevkj ¼ jevkj and j fodkodkj ¼ jodkj, so that jukvkj ¼ 6þ 3ðk� 2Þ¼ 3k. This proves the first statement. For the secondstatement, the total length of sk follows by observing thatjtkj ¼ 2kþ 1, and so n ¼ jskj ¼ 3kþ 2kþ 1 ¼ 5kþ 1. tu
Proposition 1. For 1 � p � k� 2, no motif of the form Ap �Ak�p�1 can be maximal in sk. Also, motif Ak cannot be maximalin sk.
Proof. Letwbe an arbitrary motif of the formAp � Ak�p�1, with1 � p � k� 2. Its location list is Lw ¼ f0; k� p; kþ 1g þjukvkj ¼ f3k; 4k� p; 4kþ 1g since jukvkj ¼ 3k by Fact 1 andw matches the two substrings Ak of sk as well as Ap TAk�p�1.The occurrences are shown in Fig. 1 for k ¼ 7 and p ¼ 2. Noother occurrences are possible. Let us consider theposition, say i, of the leftmost appearance of letter ap insk (recall that there are three positions on sk at which letterap occurs; we have i ¼ 0 in our example of Fig. 1 withp ¼ 2). We claim that motif y ¼ ap �3k�i�1 w satisfiesLy ¼ Lw � ð3k� iÞ. Since w appears in y, it follows that wcannot be maximal in sk by Definition 5 (settingd ¼ �3kþ i). To see why Lw ¼ Ly þ ð3k� iÞ, it suffices toprove that the distance in sk between the positions of thetwo leftmost lettersap isk� pwhile that of the leftmost andthe rightmost ap is kþ 1. The verification is a bit tediousbecause four cases arise according to the fact that each of kand p can be even or odd. Since the cases are analogous, wedetail only two of them, namely, when both k and p areeven, and when k is even and p is odd. In the first case, thethree occurrences of ap are all in uk. Moreover, the distancebetween the two leftmost letters ap is the length of thesubstring apapþ2 � � � ak�2uak�2ak�4 � � � apþ2, that is, 2japþ2 � � �ak�2j þ 2 ¼ 2ðk� 2� pÞ=2þ 2 ¼ k� p. The distance be-tween the leftmost and rightmost ap is the length ofapapþ2 � � � ak�2u fevkevk vwa2a4 � � � ap�2. This is also the length ofu fevkevk vwa2a4 � � � ap�2apapþ2 � � � ak�2 ¼ u fevkevk vwevk, that is,2ðk� 2Þ=2þ 3 ¼ kþ 1 as expected. In the second casewhere k is even and p is odd, the occurrences of ap are all invk. Analogously to the first case, the distance between the
two leftmost letters ap is the length of apapþ2 � � � ak�3xyak�3
� � � apþ2, that is, 2japþ2 � � � ak�3j þ 3 ¼ 2ðk� 3� pÞ=2þ 3
¼ k� p. The distance between the leftmost and the
rightmost ap is the length of the string apapþ2 � � � ak�3
xy fodkodkza1a3 � � � ap�2, which equals kþ 1, the length of
xy fodkodkzodk. The analogous verification of the other two
cases yields the fact that w cannot be maximal.
The second part of the lemma for motif Ak proceeds
along the same lines, except that we choose y ¼ap �3k�i�1 Ak with i as before (note that y is not required
to be maximal and that the motifs in the statement are
maximal in tk). tuProposition 2. Each motif of the form AfA; �gk�2
A with exactly
two �s is irredundant in sk.
Proof. Let x be an arbitrary motif of the form AfA; �gk�2Awith
two �s, namely, x ¼ Ap1 � Ap2�p1�1 � Ak�p2�1 for 1 � p1 <
p2 � k� 2. To prove thatx is an irredundant motif, we first
show that x is maximal. Its location list is Lx ¼ f0; k� p2;
k� p1; kþ 1g þ 3k since jukvkj ¼ 3k by Fact 1 and x
matches the two substrings Ak of sk as well as Ap1 TAk�p1�1
and Ap2 TAk�p2�1. Any other motif y such that x occurs in y
can be obtained by replacing at least one wild card (at
position p1 or p2) in xwith a solid character, but this would
cause the removal of position 4k� p1 or 4k� p2 from Lx.
Analogously, extending x to the right by putting a solid
character at position jxj or larger would eliminate position
4kþ 1 from Lx. Finally, extending x to the left by a solid
character would eliminate at least one position from Lx
because no symbol occurs four times inukvk. In conclusion,
for any motif y such thatx occurs in y, we haveLy 6¼ Lx þ d
for any integer d and, thus, x is a maximal motif by
Definition 5. We now prove that x is irredundant
according to Definition 6. Let us consider an arbitrary set
of maximal motifs y1, y2; . . . ; yh such thatLx ¼ [hi¼1Lyi . We
claim that at least one yi is of the form AfA; �gk�2A. Indeed,
there must exist a location list Lyi containing position 4kþ1 since that position belongs to Lx. This implies that yioccurs in the suffix Ak of sk. It cannot be that jyij < k since yiwould occur also in some position j > 4kþ 1 whereas
j 62 Lx, so it is impossible. Consequently, yi is of length k
and matches Ak, thus being of the form AfA; �gk�2A. We
observe that yi cannot contain zero or one �s, as it would
not be maximal by Proposition 1. Also, yi cannot contain
three or more�s, as each distinct � symbol would match the
letter T in sk giving jLyi j > jLxj, which is impossible. The
only possibility is that yi contains exactly two �s as x does
at the same positions because Ly Lx and they are
maximal. It follows that yi ¼ x proving the proposition. tuTheorem 2. The basis for string sk contains �ðn2Þ irredundant
motifs, where n ¼ jskj and k � 5.
Proof. By Proposition 2, the number of irredundant motifs
in sk is at least k�22
� �¼ �ðk2Þ, the number of choices of
two positions in fA; �gk�2. Since jskj ¼ 5kþ 1 by Fact 1,
we get the conclusion. tu
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 43
Fig. 1. Example string s7, (ai of the definition is simply denoted by i).Above it, there are the occurrences of w of the Proof of Proposition 1,while the three lines below show the occurrences of motif x ¼4 �19 AAAA � AA in s7. The letter 4 corresponds to position 4 of the wildcard in AAAA � AA.
4 TILING MOTIFS: THE BASIS AND ITS PROPERTIES
4.1 Terminology and Properties
In this section, we introduce a natural notion of a basis for
generating all maximal motifs occurring in a string s of
length n.
Definition 7 (tiling motif). A maximal motif x is tiling if, for
any maximal motifs y1, y2; . . . ; yk and for any integers d1,
d2; . . . ; dk such that Lx ¼ [ki¼1ðLyi þ diÞ, motif x must be one
of the yis. Conversely, if all the yis are different from x, pattern
x is said to be tiled by motifs y1, y2; . . . ; yk.
The notion of tiling is in general more selective than that
of irredundancy. Continuing our example string
s ¼ FABCXFADCYZEADCEADC, we have seen in Section 2 that
motif x1 ¼ A � C is irredundant for s. Now, x1 is tiled by
x2 ¼ FA � C and x4 ¼ ADC according to Definition 7 since its
location list, Lx1 ¼ f1; 6; 12; 16g, can be obtained from the
union of Lx2 ¼ f0; 5g and Lx4 ¼ f6; 12; 16g with respective
displacements d2 ¼ 1 and d4 ¼ 0.
Remark 1. A fairly direct consequence of Definition 7 is that
if x is tiled by y1, y2, . . . , yk with associated displacements
d1, d2, . . . , dk, then x occurs at position di in yi for
1 � i � k. As a consequence, we have that di � 0 in
Definition 7. Note also that the yis in Definition 7 are not
necessarily distinct and that k > 1 for tiled motifs. (It
follows from the fact that Lx ¼ Ly1 þ d1 with x 6¼ y1would contradict the maximality of both x and y1.) As a
result, a maximal motif x occurring exactly q times in s is
tiling as it cannot be tiled by any other motifs because
such motifs would occur less than q times.
The basis of tiling motifs is the complete set of all tilingmotifs for s, and the size of the basis is the number of thesemotifs. For example, the basis, let us denote it by B, forFABCXFADCYZEADCEADC contains FA � C, EADC, and ADC astiling motifs. Although Definition 7 is derived from that ofirredundant motifs given in Definition 6, the difference ismuch more substantial than it may appear. The basis oftiling motifs relies on the fact that tiling motifs areconsidered as invariant by displacement as for maximality.Consequently, our definition of basis is symmetric, that is,each tiling motif in the basis for the reverse string ess is thereverse of a tiling motif in the basis of s. This follows fromthe symmetry in Definition 7 and from the fact thatmaximality is also symmetric in Definition 5. It is a sinequa non condition for having a notion of basis invariant bythe left-to-right or right-to-left order of the symbols in s (likethe entropy of s), while this property does not hold for theirredundant motifs.
The basis of tiling motifs has further interesting proper-
ties for quorum q ¼ 2, illustrated in Sections 4.2, 4.3, and 4.4.
In Section 4.2, we show that our basis is linear (that is, its
size is at most n� 1). In Section 4.3, we show that the total
size of the location lists for the tiling motifs is less than 2n,
describing how to find them in Oðn2 logn log j�jÞ time. In
Section 4.4, we discuss some applications such as generat-
ing all maximal motifs with the basis and finding motifs
with a constraint on the number of undefined symbols.
4.2 A Linear Upper Bound for the Tiling Motifs withQuorum q ¼ 2
Given a string s of length n, let B denote its basis of tilingmotifs for quorum q ¼ 2. Although the number of maximalmotifs may be exponential and the basis of irredundantmotifs may be at least quadratic (see Section 3), we showthat the size of B is always less than n. For this, weintroduce an operator between the symbols of � to definethe merges, which are at the heart of the properties of B.Given two letters �1; �2 2 � with �1 6¼ �2, the operatorsatisfies �1 �2 ¼ � and �1 �1 ¼ �1. The operator appliesto any pair of strings x; y 2 ��, so that u ¼ x y satisfiesu½j� ¼ x½j� y½j� for all integers j.
Definition 8 (Merge). For 1 � k � n� 1, let sk be the (infinite)string whose character at position i is sk½i� ¼ s½i� s½iþ k�. Ifsk contains at least one solid character, Mergek denotes themotif obtained by removing all the leading and trailing �s in sk(that is, those appearing before the leftmost solid character andafter the rightmost solid character).
For example, FABCXFADCYZEADCEADC has Merge4 ¼ EADC,Merge5 ¼ FA � C, Merge6 ¼ Merge10 ¼ ADC, and Merge11 ¼Merge15 ¼ A � C. The latter is the only merge that is not a tilingmotif.
Lemma 1. If Mergek exists, it must be a maximal motif.
Proof. Motifx ¼ Mergek occurs at positions, say, iand iþ k ins. Character sk½i� is solid by Definitions 4 and 8. We use thefact that x at occurs at least twice in s for showing that it ismaximal. Suppose it is not maximal. By Definition 5, thereexists y 6¼ x such that x occurs in y and Ly ¼ Lx þ d forsome integer d (in this case d � 0). Since y is more specificthan xdisplaced by d, there must exist at least one positionj with 0 � j < jyj such that x½jþ d� ¼ � and y½j� ¼ � 2 �.Hence, x½jþ d� ¼ s
�iþ ðjþ dÞ
� s�iþ kþ ðjþ dÞ
�¼ �,
and so s�ðiþ dÞ þ j
�6¼ s�ðiþ kþ dÞ þ j
�. Since y½j� cannot
match both of the latter symbols in s, at least one of iþ d oriþ kþ d is not a position of y in s. This contradicts thehypothesis that Ly ¼ Lx þ d, whereas both i; iþ k 2 Lx. tu
Lemma 2. For each tiling motif x in the basis B, there is at leastone k for which Mergek ¼ x.
Proof. As mentioned in Remark 1, a maximal motifoccurring exactly twice in s is tiling. Hence, if jLxj ¼ 2,say Lx ¼ fi; jg with j > i, then x ¼ Mergek with k ¼ j� iby the maximality of x and that of the merges byLemma 1. Let us now consider the case where jLxj > 2.For any pair i; j 2 Lx, we denote by uij the string s½i::iþjxj � 1� s½j::jþ jxj � 1� obtained by applying the op-erator to the two substrings of s matching x atpositions i and j, respectively. We have x � uij since xoccurs at positions i and j, and Lx ¼
Si;j2Lx
Luij since weare taking all pairs of occurrences of x. Letting k ¼ jj� ijfor i; j 2 Lx, we observe that uij is a substring of Mergekoccurring at position, say, �k in it. Thus,
[i;j2Lx
Luij ¼[
k¼jj�ij : i;j2Lx
LMergek þ �k� �
¼ Lx:
By Definition 7, the fact that x is tiling implies that xmust be one Mergek, proving the lemma. tu
44 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
We now state the main property of tiling bases thatfollows directly from Lemma 2.
Theorem 3 (linearity of the basis). Given a string s of length nand the quorum q ¼ 2, letM be the set ofMergek, for 1 � k �n� 1 such thatMergek exists. The basis B of tiling motifs for ssatisfies B M and, therefore, the size of B is at most n� 1.
A simple consequence of Theorem 3 implies a tightbound on the number of tiling motifs for periodic strings. Ifs ¼ we for a string w repeated e > 1 times, then s has at mostjwj tiling motifs.
Corollary 1. The number of tiling motifs for s is at most p, thesmallest period of s.
The bound in Corollary 1 is not valid for irredundantmotifs. String s ¼ ATATATATA has period p ¼ 2 and only onetiling motif ATATATA, while its irredundant motifs are A, ATA,ATATA, and ATATATA.
4.3 A Simple Algorithm for Computing Tiling Motifswith Quorum q ¼ 2
We describe how to compute the basis B for string s whenq ¼ 2. A brute-force algorithm generating first all maximalmotifs of s takes exponential time in the worst case.Theorem 3 plays a crucial role in that we first computethe motifs in M and then discard those being tiled. SinceB M, what remains is exactly B. To appreciate thisapproach, it is worth noting that we are left with theproblem of selecting B from n� 1 maximal motifs in M atmost, rather than selecting B among all the maximal motifsin s, which may be exponential in number. Our simplealgorithm takes Oðn2 logn log j�jÞ time and is faster thanprevious (and more complicated) methods discussed inSection 1.
Step 1. Compute the multiset M0 of merges. Lettingsk½i� be the leftmost solid character of string sk inDefinition 8, we define occx ¼ fi; iþ kg to be the positionsof the two occurrences of x whose superposition generatesx ¼ Mergek. For k ¼ 1; 2; . . . ; n� 1, we compute string skin Oðn� kÞ time. If sk contains some solid characters, wecompute x ¼ Mergek and occx in the same time complex-ity. As a result, we compute the multiset M0 of merges inOðn2Þ time. Each merge x in M0 is identified by a triplethi; iþ k; jxji, from which we can recover the jth symbol ofx in constant time by simple arithmetic operations andcomparisons.
Step 2. Transform the multiset M0 into the set M of
merges. Since there can be two or more merges in M0 thatare identical and correspond to the same merge in M, weput together all identical merges in M0 by radix sortingthem. The total cost of this step is dominated by radixsorting, giving Oðn2Þ time. As a byproduct, we produce thetemporary location list Tx ¼
Sx0¼x :x02M0 occx0 for each dis-
tinct x 2 M thus obtained.
Lemma 3. Each motif x 2 B satisfies Tx ¼ Lx.
Proof. For a fixed x 2 B, the fact that x is equal to at leastone merge by Lemma 2 implies that Tx is well defined,with jTxj � 2. Since Tx Lx, let us assume by contra-diction that Lx � Tx 6¼ ;. For each pair i 2 Lx � Tx and
j 2 Tx, let mij ¼ Mergejj�ij, which is maximal byLemma 1. Note that each mij 6¼ x by our assumption asotherwise i would belong to Tx; however, x must occurin mij, say, at position �ij in mij. Consequently,S
i2Lx�Tx;j2Tx
�Lmij
þ �ij�¼ Lx since any occurrence of x
is either i 2 Lx � Tx or j 2 Tx. At this point, we applyDefinition 7 to the tiling motif x, obtaining the contra-diction that x must be equal to one mij. tu
Notice that the conclusion of Lemma 3 does notnecessarily hold for the motifs in M�B. For the previousexample string FADABCXFADCYZEADCEADCFADC, one suchmotif is x ¼ ADC with Lx ¼ f8; 14; 18; 22g while Tx ¼ f8; 18g.
Step 3. SelectM� M, whereM� ¼ fx 2 M : Tx ¼ Lxg.In order to build M�, we employ the Fischer-Patersonalgorithm based on convolution [8] for string matching withdon’t cares to compute the whole list of occurrences Lx foreach merge x 2 M. Its cost isOððjxj þ nÞ logn log j�jÞ time foreach merge x. Since jxj < n and there are at most n� 1 motifsx 2 M, we obtain Oðn2 logn log j�jÞ time to construct all listsLx. We can compute M� by discarding the merges x 2 Msuch that Tx 6¼ Lx in additional Oðn2Þ time.
Lemma 4. The set M� satisfies the conditions B M� andPx2M� jLxj < 2n.
Proof. The first condition follows from the fact that themotifs in M�M� are surely tiled by Lemma 3. Thesecond condition follows from the definition of M� andfrom the observation that
Xx2M�
jLxj ¼Xx2M�
jTxj �Xx2M
joccxj < 2n;
since joccxj ¼ 2 (see Step 1) and there are less than n ofthem. tuThe property of M� in Lemma 4 is crucial in thatPx2M jLxj ¼ �ðn2Þ when many lists contain �ðnÞ entries.
For example, s ¼ An has n� 1 distinct merges, each of theform x ¼ Ai for 1 � i � n� 1, and so jLxj ¼ n� iþ 1. Thiswould be a sharp drawback in Step 4 when removing tiledmotifs as it may turn into a �ðn3Þ algorithm. Using M�
instead, we are guaranteed thatP
x2M� jLxj ¼ OðnÞ; hence,we may still have some tiled motifs in M�, but their totalnumber of occurrences is OðnÞ.
Step 4. Discard the tiled motifs in M�. We can nowcheck for tiling motifs in Oðn2Þ time. Given two distinctmotifs x; y 2 M�, we want to test whether Lx þ d Ly forsome integer d and, in that case, we want to mark the entriesin Ly that are also in Lx þ d. At the end of this task, the listshaving all entries marked are tiled (see Definition 7). Byremoving their corresponding motifs from M�, we even-tually obtain the basis B by Lemma 4. Since the meaningfulvalues of d are as many as the entries of Ly, we have onlyjLyj possible values to check. For a given value of d, weavoid to merge Lx and Ly in OðjLxj þ jLyjÞ time to performthe test, as it would contribute to a total of �ðn3Þ time.Instead, we exploit the fact that each list has values rangingfrom 1 to n, and use two bit-vectors of size n to perform theabove check in OðjLxj � jLyjÞ time for all values of d. Thisgives Oð
Py
Px jLxj � jLyjÞ ¼ Oð
Py jLyj �
Px jLxjÞ ¼ Oðn2Þ
by Lemma 4.
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 45
We therefore detail how to perform the above check withLx and Ly in OðjLxj � jLyjÞ time. We use two bit-vectors V1
and V2 of length n initially set to all zeros. Given y 2 M�, weset V1½i� ¼ 1 if i 2 Ly. For each x 2 M� � fyg and for eachd 2 ðLy �mÞ (where m is the smallest entry of Lx), we thenperform the following test. If all j 2 Lx þ d satisfy V1½j� ¼ 1,we set V2½j� ¼ 1 for all such j. Otherwise, we take the nextvalue of d, or the next motif if there are no more values of d,and we repeat the test. After examining all x 2 M� � fyg,we check whether V1½i� ¼ V2½i� for all i 2 Ly. If so, y is tiledas its list is covered by possibly shifted location lists of othermotifs. We then reset the ones in both vectors in OðjLyjÞtime.
Summing up Steps 1-4, we have that the dominant cost isthat of Step 3 and that we have proved the following result.
Theorem 4. Given an input string s of length n over the alphabet�, the basis of tiling motifs with quorum q ¼ 2 can becomputed in Oðn2 logn log j�jÞ time. The total number ofmotifs in the basis is less than n, and the total number of theiroccurrences in s is less than 2n.
We have implemented the algorithm underlying Theo-rem 4, and we report here the lessons learned from ourexperiments. Step 1 requires, in practice, less than thepredicted Oðn2Þ running time. If p ¼ 1=j�j denotes theprobability that two randomly chosen symbols of � matchin the uniform distribution, the probability of finding thefirst solid character in a merge follows the binomialdistribution, and so the expected number of examinedcharacters in s is Oð1=pÞ ¼ Oðj�jÞ, yielding Oðnj�jÞ time onthe average to locate the first (scanning s from thebeginning) and the last (scanning s from the end backward)solid character in each merge. A similar approach can befollowed in Step 2 for finding the distinct merges. In thiscase, the merges are first partially sorted using hashing andexploiting the fact that the input is almost sorted. Insertionsort is then the best choice and works very efficiently in ourexperiments (at least 50 percent faster than Quicksort). Wedo not compute yet the full merges at this stage, but wedelay this expensive part to a later stage on a small set ofbuckets that require explicit representation of the merges.As a result, the average case is almost linear. For example,executing Steps 1 and 2 on chromosome V of C.eleganscontaining more than 21 million bases took around15 minutes on a machine with 512Mb of RAM runningLinux on a 1Ghz AMD Athlon processor. Step 3 isexpensive also in practice and the worst case predicted bytheory shows up in the experiments. Running this step onsequences much shorter than chromosome V of C.eleganstook many hours. Step 4 is not much of a problem. As aresult, an alternative way of selecting M� from M in Step 3working fast in practice, would improve considerably theoverall performance.
4.4 Some Applications
Checking whether a pattern is a motif. The main propertyunderlying the notion of basis is that it is a generator of allmotifs. The generation can be done as follows: First selectsegments of motifs in the basis that start and end with solidcharacters, then replace any number of internal solid
characters by wild cards. However, since the number ofmotifs, and even maximal motifs, can be exponential, this isnot really meaningful unless this number is small and thetime complexity of the algorithm is proportional to the totalsize of the output. An attempt in this direction is done in[18]. The dual problem concerns testing only one pattern.We show how, given a pattern x, it can be tested whether xis a motif for string s, that is, if pattern x occurs at least qtimes in s. There are two possible ways of performing sucha test, depending on whether we test directly on the stringor on the basis. The answer relies on iterative applicationsof the observation made in Remark 1, according to whichany tiled motif must occur in at least one tiling motif. Thenext two statements deal with the alternative. In both cases,we assume that integer k comes from the decomposition ofpattern x in the form u0 �‘0 u1 �‘1 � � �uk�1 �‘k�1 uk, where thesubwords ui contain no wild cards (ui 2 ��, 0 � i � k) and‘j are positive integers, 0 � j � k� 1. The next propositionstates a well-known fact on matching such a pattern in atext without any wild card that we report here because it isused in the sequel.
Proposition 3. The positions of the occurrences of a pattern x ina string of length n can be computed in time OðknÞ.
Proof. This is a mere application of matching a pattern withdo not cares inside a text without do not cares. Using, forinstance, the Fischer and Paterson’s algorithm [8] is notnecessary. Instead, the positions of the subwords ui arecomputed by a multiple string-matching algorithm, suchas the Aho-Corasick algorithm [1]. For each position p, acounter associated with position p� ‘ on s is incremented,where ‘ is the position of ui in x (‘ is the offset of ui in x).Counters whose value is kþ 1 correspond then tooccurrences of x in s. It remains to check if x occurs atleast q times in s. The running time is governed by thestring-matching algorithm, which is OðknÞ (equivalent torunning k times a linear-time string matching algorithm).tu
Proposition 4. Given the basis B of string s, testing if pattern xis a motif or a maximal motif can be done in OðkbÞ time, whereb ¼
Py2B jyj.
Proof. From Remark 1, testing if x is a maximal motifrequires only finding if x occurs in an element y of thebasis. To do this, we can apply the procedure of theprevious proof because wild cards in y should be viewedas extra characters that do not match any letter of �. Thetime complexity of the procedure is thus OðkbÞ. Since anonmaximal motif occurs in a maximal motif, the sameprocedure applies to test if x is a general motif. tu
As a consequence of Propositions 3 and 4, we get anupper bound on the time complexity for testing motifs.
Corollary 2. Testing whether or not pattern u0 �‘0 u1 �‘1� � �uk�1 �‘k�1 uk is a motif in a string of length n having abasis of total size b can be done in time Oðk �minfb; ngÞ.
Remark 2. Inside the procedure described in the proofs ofPropositions 3 and 4, it is also possible to use bit-vectorpattern matching methods [3], [16], [25] to compute theoccurrences of x. This leads to practically efficientsolutions running in time proportional to the length of
46 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
the string n or the total size of the basis b, in the bit-vectormodel of machine. This is certainly a method of choicefor short patterns.
Finding the longest motif with bounded number ofwild cards. We address an interesting question concerningthe computation of a longest motif occurring repeated in astring. Given an integer g � 0, let LMgðsÞ be the maximallength of motifs occurring in a string s of length n withquorum q ¼ 2, and containing no more than g wild cards. Ifg ¼ 0, the value can be computed in Oðn log j�jÞ time withthe help of the suffix tree of s (see [5] or [10]). For g > 0, wecan show that LMgðsÞ can be computed in Oðgn2Þ timeusing the suffix tree augmented (in linear time) to acceptlongest common ancestor (LCA) queries as follows: Foreach possible pair ði; jÞ of positions on s for which s½i� ¼ s½j�,we compute the longest common prefix of s½i::n� 1� ands½j::n� 1� in constant time through an LCA query on thesuffix tree. If ‘ is the length of the prefix, we get the first parts½i::iþ ‘� 1� � of a possible longest motif. The second partis found similarly by considering the pair of positionsðiþ ‘þ 1; jþ ‘þ 1Þ. The process is iterated g times (or less)and provides a longest motif containing at most g wildcards and occurring at positions i and j. Length LMgðsÞ isobtained by taking the maximum length of motifs for allpairs of positions ði; jÞ. This yields the next result.
Proposition 5. Using the suffix tree, LMgðsÞ can be computed inOðgn2Þ time.
What makes the use of the basis of tiling motifs interestingis that computing LMgðsÞ becomes a mere pattern matchingexercise because of the strong properties of the basis. Thiscontrasts with the previous result grounded on the deepalgorithmic technique for LCA queries.
Proposition 6. Using the basis B of tiling motifs, LMgðsÞ can becomputed in time OðbÞ, where b ¼
Py2B jyj.
Proof. Let x be a motif yielding LMgðsÞ (i.e., x is of lengthLMgðsÞ); hence, x occurs at least twice in s. Let y be amaximal motif in which x occurs (we have y ¼ x if x isitself maximal). Let z be a tiling motif in which y occurs(again we may have z ¼ y if y is a tiling motif). The wordx then occurs in z that belongs to the basis. Let us say thatit matches z½i::j�. Assume that x is not a tiling motif, thatis x 6¼ z. Certainly, i ¼ 0 or z½i� 1� ¼ �, otherwise, xwould not be the longest with its property. For the samereason, j ¼ jzj � 1 or z½jþ 1� ¼ �. But, indeed, x occursexactly in z, which means that the wild card symbols donot match any solid symbol. Because, otherwise, z½i::j�would contain less than g do not cares and could beextended by at least one symbol to the left or to the rightbecause x 6¼ z, yielding a contradiction with the defini-tion of x. Therefore, either x is a tiling motif or it matchesexactly a segment of one of the tiling motifs. Searchingfor x thus reduces to finding a longest segment of a tilingmotif in B that contains no more than g wild cards. Thecomputation can be done in linear time with only twopointers on s, which proves the result. tuBy Proposition 6, it is clear that a small basis B leads to
an efficient computation once B is given. If we have to buildB from scratch, we can observe that no (maximal) motif cangive a larger value of LMgðsÞ if it does not belong to B. Withthis observation, we have Oðn2Þ running time, which
always beats the Oðg� n2Þ cost of using the suffix tree. Inparticular, it is interesting to notice that the running time ofthe algorithm using the basis is independent of theparameter g.
5 PSEUDOPOLYNOMIAL BASES FOR HIGHER
QUORUM
We now discuss the general case of quorum q � 2 for
finding the basis of a string of length n. Differently from
previous work, we show in Section 5.1 that no polynomial-
time algorithm can exist for any arbitrary value of q in the
worst case, both for the basis of irredundant motifs and for
the basis of tiling motifs. The size of these bases provably
depends exponentially on suitable values of q � 2, that is, we
give a lower bound ofn�12 �1q�1
� �¼ �
�12q
n�1q�1
� ��. In practice, this
size has an exponential growth for increasing values of q up
to OðlognÞ, but larger values of q are theoretically possible
in the worst case. Fixing q ¼ ðn� 1Þ=4þ 1 in our lower
bound, we get a size of �ð2ðn�1Þ=4Þ motifs in the bases. On
the average, q ¼ Oðlogj�j nÞ by extending the argument after
Theorem 4, namely, using the fact that on the average the
number of simultaneous comparisons to find the first solid
character of a merge is Oðj�jq�1Þ, which must be less than n.
We show a further property for the basis of tiling motifs
in Section 5.2, giving an upper bound of n�1q�1
� �on its size
with a simple proof. Since we can find an algorithm taking
time proportional to the square of that size, we can
conclude that a worst-case polynomial-time algorithm for
finding the basis of tiling motifs exists if and only if the
quorum q satisfies either q ¼ Oð1Þ or q ¼ n�Oð1Þ (the latter
condition is hardly meaningful in practice).
5.1 A Lower Bound ofn�12 �1q�1
� �on the Bases
We show the existence of a family of strings for which there
are at leastn�12 �1q�1
� �tiling motifs for a quorum q. Since a tiling
motif is also irredundant, this gives a lower bound for the
irredundant motifs to be combined with that in Section 3
(note that the lower bound in Section 3 still gives �ðn2Þ for
q � 2). For q > 2, this gives a lower bound of �n�12 �1q�1
� �¼
��
12q
n�1q�1
� ��for the number of both tiling and irredundant
motifs.
The strings are this time of the form tk ¼ AkTAk (k � 5),
without the left extension used in the bound of Section 3.
The proof proceeds by exhibiting k�1q�1
� �motifs that are
maximal and have each exactly q occurrences, from when it
follows immediately that they are tiling. Indeed, Remark 1
for tiling motifs holds for any q � 2. Namely, all maximal
motifs that occur exactly q times in a string are tiling.
Proposition 7. For 2 � q � k and 1 � p � k� q þ 1, any motif
Ap � fA; �gk�p�1 � Ap with exactly q wild cards is tiling (and
so irredundant) in tk.
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 47
Proof. Let x be an arbitrary motif Ap � fA; �gk�p�1 � Ap with1 � p � k� q þ 1 and q wild cards; namely, x ¼ Ap1 �Ap2�p1�1 � � � � � Apq�1�pq�2�1 � Ak�pq�1�1 � Ap1 for 1 � p1 < p2 <
� � � < pq�1 � k� 1 and p ¼ p1. We first have to prove that xis a maximal motif according to Definition 5. Its length iskþ 1þ p1 and its location list is Lx ¼ f0; k� pq�1; . . . ;
k� p2; k� p1g. Observe that the number of its occurrencesis exactly the number of times the wild card appears in x,which is equal to q. A motif y different from x such that xoccurs in y can be obtained by replacing the wild card atposition pi with a solid symbol, for 1 � i � q � 1, but thiseliminates k� pi from the location list of y. Also, y can beobtained by extending x to the right by a solid symbol (atany position � jxj), but then position k� p1 is not in Ly
because the last symbol in that occurrence of y occupiesposition ðk� p1Þþjyj�1� ðk� p1Þ þ jxj ¼ ðk� p1Þ þ ðkþ1þp1Þ > jtkj � 1 in tk, which is impossible. Analogously, ycan be obtained by extending x to the left by a solid symbol(at any position d < 0), but position 0 is no longer in Ly.Consequently, for any motif y more specific than x, wehave Ly 6¼ Lx þ d, implying that x is maximal. Aspreviously mentioned, x is tiling because it has exactly q
occurrences. tuTheorem 5. String tk has
n�12 �1q�1
� �¼ �
�12q
n�1q�1
� ��tiling (and
irredundant) motifs, where n ¼ jtkj and k � 2.
Proof. By Proposition 7, the tiling or irredundant motifs in tk
are at least k�1q�1
� �, the number of choices of q � 1 positions
on Ak�1. Since n ¼ 2kþ 1, we obtain the statement. tu
5.2 An Upper Bound of n�1q�1
� �Tiling Motifs
We now prove that n�1q�1
� �is an upper bound for the size of a
basis of tiling motifs for a string s and quorum q � 2. Let us
denote as before such a basis by B. To prove the upper
bound, we use again the notion of a merge, except that it
now involves q strings. The operator between the
elements of � extends to more than two arguments, so that
the result is a � if at least two arguments differ. Let k denote
now an array of q � 1 positive values k1; . . . ; kq�1 with 1 �ki < kj � n� 1 for all 1 � i < j � q � 1.
Definition 9. Let sk denote the string such that its jth character
is sk½j� ¼ s½j� s½jþ k1� � � � s½jþ kq�1� for all integers j.Mergek is the pattern obtained by removing all the leading
and trailing �s in sk (that is, appearing before the leftmost solid
character and after the rightmost solid character).
Lemmas 5 and 6 reported below extend Lemmas 1 and 2for q > 2.
Lemma 5. If Mergek exists for quorum q, then it must be a
maximal motif.
Proof. Let x ¼ Mergek denote the (nonempty) pattern, andlet sk½i� be its first character, which is solid byDefinition 9. Since x occurs at least q times in s, atpositions i; iþ k1; . . . ; iþ kq�1, then x is a motif forquorum q. We show that x is maximal. Suppose it isnot maximal. By Definition 5, there exists y 6¼ x s.t. x
occurs in y and Ly ¼ Lx þ d for some integer d. This
implies there exists at least one position j with 0 �j < jyj such that y½j� ¼ � 2 � and x½jþ d� ¼ �. Since
x½jþ d� ¼ s½iþ jþ d� s½iþ jþ k1 þ d� � � � s½iþ jþ kq�1 þ d�;
then at least one among iþ d; iþ k1 þ d; . . . ; iþ kq�1 þ d
is not an occurrence of y, contradicting the hypothesisthat Ly ¼ Lx þ d (since i; iþ k1; . . . ; iþ kq�1 2 Lx). tu
Lemma 6. For each tiling motif x in the basis B with quorum q,there is at least one k for which Mergek ¼ x.
Proof. If jLxj ¼ q andLx ¼ fi1; . . . ; iqgwith i1 < � � � < iq, thenx ¼ Mergek where k is the array of values i2 � i1; i3 � i1;
. . . ; iq � i1. Let us now consider the case where jLxj > q.Given any q-tuple i1; . . . ; iq 2 Lx, let uk denote s½i1::i1 þjxj � 1� � � � s½iq::iq þ jxj � 1�, which is a substring ofMergek introduced in Definition 9. We have that x � uk
and Lx ¼S
i1;i2;...;iq2LxLuk . Since each uk for i1; i2; . . . ; iq 2
Lx is a substring of Mergek, we infer that Lx ¼Si1;i2;...;iq2Lx
�LMergek þ �k
�where the �ks are non-negative
integers. By Definition 7, if Mergek were different from x,then x would not be tiling, which is a contradiction.Therefore, at least one Mergek is x. tu
The following property of tiling bases follows fromLemma 5 and 6.
Theorem 6. Given a string s of length n and a quorum q � 2, let
M be the set of Mergek, for any of the n�1q�1
� �possible choices
of k for which Mergek exists. The basis B of tiling motifs for s
satisfies B M and, therefore, the size of B is at most n�1q�1
� �.
The tiling motifs in our basis appear in s for a total of
q n�1q�1
� �times at most. A variation of the algorithm given in
Section 4.3 gives a pseudopolynomial-time complexity of
O q2n� 1
q � 1
� �2 !
:
When this upper bound is combined with the lower boundof Section 5.1, we obtain that there exists a polynomial-timealgorithm for finding the basis if and only if either q ¼ Oð1Þor q ¼ n�Oð1Þ.
6 CONCLUSIONS
The work presented in this paper is theoretical in nature, but itshould be clear by now that its practical consequences,particularly—but not exclusively—for computational biol-ogy, are relevant. Whether motifs as patterns are used forinferring binding sites or repeats of any length, for character-izing sequences or as a filtering step in a whole genomecomparison algorithm or before inferring PSSMs: We showthat wild cards alone are not enough for a biologicallysatisfying definition of the patterns of interest. Simplythrowing away the pattern-type of motif detection is not agood way to address the problem. This is confirmed byvariousbiologicalpublications [24], [7]aswellasbythenotyetpublished—but already publicly available—results of a first
48 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
motif detection competition http://bio.cs.washington.edu/assessment/. Evenifpatternsarenot the bestway ofmodelingbiological features, they deserve an important function in anyfuture improved algorithm for inferring motifs ab initio frombiological sequences. As such, the purpose of this paper is toshed some further light on the inner structure of oneimportant type of motif.
ACKNOWLEDGMENTS
Many suggestions from the anonymous referees greatlyimproved the original form of this paper. The authors arethankful to them for this and to M.H.ter Beek for improvingthe English. A preliminary version of the results in thispaper has been described in the technical report IGM-2002-10, July 2002 [20], and in [21]. Work was partially supportedby the French program bioinformatique EPST 2002 “Algo-rithms for Modelling and Inference Problems in MolecularBiology.” N. Pisanti and R. Grossi were partially supportedby the Italian PRIN project “ALINWEB: Algorithmics forInternet and the Web.” M.-F. Sagot was partially supportedby CNRS-INRIA-INRA-INSERM action BioInformatiqueand the Wellcome Trust Foundation. M. Crochemore waspartially supported by CNRS action AlBio, NATO ScienceProgramme grant PST.CLG.977017, and the Wellcome TrustFoundation.
REFERENCES
[1] A. Aho and M. Corasick, “Efficient String Matching: An Aid toBibliographic Search,”Comm. ACM, vol. 18, no. 6, pp. 333-340, 1975.
[2] A. Apostolico and L. Parida, “Incremental Paradigms of MotifDiscovery,” J. Computational Biology, vol. 11, no. 1, pp. 15-25, 2004.
[3] R. Baeza-Yates and G. Gonnet, “A New Approach to TextSearching,” Comm. ACM, vol. 35, pp. 74-82, 1992.
[4] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, “Ap-proaches to the Automatic Discovery of Patterns in Biose-quences,” J. Computational Biology, vol. 5, pp. 279-305, 1998.
[5] M. Crochemore and W. Rytter, Jewels of Stringology. WorldScientific Publishing, 2002.
[6] E. Eskin, “From Profiles to Patterns and Back Again: A Branch andBound Algorithm for Finding Near Optimal Motif Profiles,”RECOMB’04: Proc. Eighth Ann. Int’l Conf. Computational MolecularBiology, pp. 115-124, 2004.
[7] E. Eskin, U. Keich, M. Gelfand, and P. Pevzner, “Genome-WideAnalysis of Bacterial Promoter Regions,” Proc. Pacific Symp.Biocomputing, pp. 29-40, 2003.
[8] M. Fischer and M. Paterson, “String Matching and OtherProducts,” SIAM AMS Complexity of Computation, R. Karp, ed.,pp. 113-125, 1974.
[9] M. Gribskov, A. McLachlan, and D. Eisenberg, “Profile Analysis:Detection of Distantly Related Proteins,” Proc. Nat’l Academy ofSciences, vol. 84, no. 13, pp. 4355-4358, 1987.
[10] D. Gusfield, Algorithms on Strings, Trees and Sequences: ComputerScience and Computational Biology. Cambridge Univ. Press, 1997.
[11] G.Z. Hertz and G.D. Stormo, “Escherichia Coli Promoter Sequences:Analysis and Prediction,” Methods in Enzymology, vol. 273, pp. 30-42, 1996.
[12] C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald,and J.C. Wooton, “Detecting Subtle Sequence Signals: A GibbsSampling Strategy for Multiple Alignment,” Science, vol. 262,pp. 208-214, 1993.
[13] C.E. Lawrence and A.A. Reilly, “An Expectation Maximization(EM) Algorithm for the Identification and Characterization ofCommon Sites in Unaligned Biopolymer Sequences,” Proteins:Structure, Function, and Genetics, vol. 7, pp. 41-51, 1990.
[14] L. Marsan and M.-F. Sagot, “Algorithms for Extracting StructuredMotifs Using a Suffix Tree with an Application to Promoter andRegulatory Site Consensus Identification,” J. Computational Biol-ogy, vol. 7, pp. 345-362, 2000.
[15] W. Miller, “Comparison of Genomic DNA Sequences: Solved andUnsolved Problems,” Bioinformatics, vol. 17, pp. 391-397, 2001.
[16] G. Myers, “A Fast Bit-Vector Algorithm for Approximate StringMatching Based on Dynamic Programming,” J. ACM, vol. 46, no. 3,pp. 395-415, 1999.
[17] L. Parida, I. Rigoutsos, A. Floratos, D. Platt, and Y. Gao, “PatternDiscovery on Character Sets and Real-Valued Data: Linear Boundon Irredundant Motifs and Efficient Polynomial Time Algorithm,”Proc. SIAM Symp. Discrete Algorithms (SODA), 2000.
[18] L. Parida, I. Rigoutsos, and D. Platt, “An Output-Sensitive FlexiblePattern Discovery Algorithm,” Combinatorial Pattern Matching,A. Amir and G. Landau, eds., pp. 131-142, Springer-Verlag, 2001.
[19] J. Pelfrne, S. Abdeddaım, and J. Alexandre, “Extracting Approx-imate Patterns,” Combinatorial Pattern Matching, pp. 328-347,Springer-Verlag, 2003.
[20] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, “A Basisfor Repeated Motifs in Pattern Discovery and Text Mining,”Technical Report IGM 2002-10, Institut Gaspard-Monge, Univ. ofMarne-la-Vallee, July 2002.
[21] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, “A Basis ofTiling Motifs for Generating Repeated Patterns and Its Complex-ity for Higher Quorum,” Math. Foundations of Computer Science(MFCS), B. Rovan and P. Vojtas, eds., pp. 622-631, Springer-Verlag, 2003.
[22] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, StringAlgorithmics, chapter: A Comparative Study of Bases for MotifInference, pp. 195-225, KCL Press, 2004.
[23] D. Pollard, C. Bergman, J. Stoye, S. Celniker, and M. Eisen,“Benchmarking Tools for the Alignment of Functional NoncodingDNA,” BMC Bioinformatics, vol. 5, pp. 6-23, 2004.
[24] A. Vanet, L. Marsan, and M.-F. Sagot, “Promoter Sequences andAlgorithmical Methods for Identifying Them,” Research in Micro-biology, vol. 150, pp. 779-799, 1999.
[25] S. Wu and U. Manber, “Path-Matching Problems,” Algorithmica,vol. 8, no. 2, pp. 89-101, 1992.
Nadia Pisanti received the laurea degree incomputer science in 1996 from the University ofPisa (Italy), the French DEA in fundamentalinformatics with applications to genome treat-ment in 1998 from the University of Marne-la-Vallee (France), and the PhD degree in computerscience in 2002 from the University of Pisa. Shehas been postdoctorate at INRIA and at theUniversity of Paris 13 and she is currently aresearch fellow in the Department of Computer
Science of the University of Pisa. Her interests are in computationalbiology and, in particular, inmotifs extraction and genome rearrangement.
Maxime Crochemore received the PhD degreein 1978 and the Doctorat d’etat in 1983 from theUniversity of Rouen. He received his firstprofessorship position at the University ofParis-Nord in 1975 where he acted as Presidentof the Department of Mathematics and Compu-ter Science for two years. He became aprofessor at the University Paris 7 in 1989 andwas involved in the creation of the University ofMarne-la-Vallee where he is presently a profes-
sor. He also created the Computer Science Research Laboratory of thisuniversity in 1991. Since then, he has been the director of the laboratory,which now has around 45 permanent researchers. Professor Crochem-ore has been a senior research fellow at King’s College London since2002. He has been the recipient of several French grants on stringalgorithmics and bioinformatics. He participated in a good number ofinternational projects in algorithmics and supervised 20 PhD students.
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 49
Roberto Grossi received the laurea degree incomputer science in 1988, and the PhD degreein computer science in 1993, at the University ofPisa. He joined the University of Florence in1993 as an associate researcher. Since 1998,he has been an associate professor of computerscience in the Dipartimento di Informatica,University of Pisa. He has been visiting severalinternational research institutions. His interestsare in the design and analysis of algorithms and
data structures, namely, dynamic and external memory algorithms,graph algorithms, experimental and algorithm engineering, fast lookuptables and dictionaries, pattern matching algorithms, text indexing, andcompressed data structures.
Marie-France Sagot received the BSc degree in computer science fromthe University of Sao Paulo, Brazil, in 1991, the PhD degree intheoretical computer science and applications from the University ofMarne-la-Vallee, France, in 1996, and the Habilitation from the sameuniversity in 2000. From 1997 to 2001, she worked as a researchassociate at the Pasteur Institute in Paris, France. In 2001, she movedto Lyon, France, as a research associate at the INRIA, the FrenchNational Institute for Research in Computer Science and Control. Since2003, she has been director of research at the INRIA. Her researchinterests are in computational biology, algorithmics, and combinatorics.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
50 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Multiseed Lossless FiltrationGregory Kucherov, Laurent Noe, and Mikhail Roytberg
Abstract—We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics
applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt
and Karkkainen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial
properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed
technique to the problem of oligonucleotide selection for an EST sequence database.
Index Terms—Filtration, string matching, gapped seed, gapped q-gram, local alignment, sequence similarity, seed family, multiple
spaced seeds, dynamic programming, EST, oligonucleotide selection.
�
1 INTRODUCTION
FILTERING is a widely-used technique in biosequenceanalysis. Applied to the approximate string matching
problem [2], it can be summarized by the following two-
stage scheme: To find approximate occurrences (matches) of
a given string in a sequence (text), one first quickly discards
(filters out) those sequence regions where matches cannot
occur, and then checks out the remaining parts of the
sequence for actual matches. The filtering is done according
to small patterns of a specified form that the searched stringis assumed to share, in the exact way, with its approximate
occurrences. A similar filtration scheme is used by heuristic
local alignment algorithms ([3], [4], [5], [6], to mention a
few): They first identify potential similarity regions that
share some patterns and then actually check whether those
regions represent a significant similarity by computing a
corresponding alignment.
Two types of filtering should be distinguished—lossless
and lossy. A lossless filtration guarantees to detect all
sequence fragments under interest, while a lossy filtration
may miss some of them, but still tries to detect a majority of
them. Local alignment algorithms usually use a lossy
filtration. On the other hand, the lossless filtration has been
studied in the context of approximate string matching
problem [7], [1]. In this paper, we focus on the lossless
filtration.
In the case of lossy filtration, its efficiency is measured by
two parameters, usually called selectivity and sensitivity. The
sensitivity measures the part of sequence fragments of
interest that are missed by the filter (false negatives), and
the selectivity indicates what part of detected candidate
fragments do not actually represent a solution (false
positives). In the case of lossless filtration, only the
selectivity parameter makes sense and is therefore the main
characteristic of the filtration efficiency.
The choice of patterns that must be contained in the
searched sequence fragments is a key ingredient of the
filtration algorithm. Gapped seeds (spaced seeds, gapped q-
grams) have been recently shown to significantly improve
the filtration efficiency over the “traditional” technique of
contiguous seeds. In the framework of lossy filtration for
sequence alignment, the use of designed gapped seeds has
been introduced by the PATTERNHUNTER method [4] and
then used by some other algorithms (e.g., [5], [6]). In [8], [9],
spaced seeds have been shown to improve indexing
schemes for similarity search in sequence databases. The
estimation of the sensitivity of spaced seeds (as well as of
some extended seed models) has been the subject of several
recent studies [10], [11], [12], [13], [14], [15]. In the
framework of lossless filtration for approximate pattern
matching, gapped seeds were studied in [1] (see also [7])
and have also been shown to increase the filtration
efficiency considerably.In this paper, we study an extension of the lossless
single-seed filtration technique [1]. The extension is based
on using seed families rather than individual seeds. The idea
of simultaneous use of multiple seeds for DNA local
alignment was already envisaged in [4] and applied in
PATTERNHUNTER II software [16]. The problem of design-
ing efficient seed families has also been studied in [17]. In
[18], multiple seeds have been applied to the protein search.
However, the issues analyzed in the present paper are quite
different, due to the proposed requirement for the search to
be lossless.
The rest of the paper is organized as follows: After
formally introducing the concept of multiple seed filtering
in Section 2, Section 3 is devoted to dynamic programming
algorithms to compute several important parameters of
seed families. In Section 4, we first study several combina-
torial properties of families of seeds and, in particular, seeds
having a periodic structure. These results are used to obtain
a method for constructing efficient seed families. We also
outline a heuristic genetic programming algorithm for
constructing seed families. Finally, in Section 5, we present
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 51
. G. Kucherov and L. Noe are with the INRIA/LORIA, 615, rue du JardinBotanique, B.P. 101, 54602 Villers-les-Nancy, France.E-mail: {Gregory.Kucherov, Laurent.Noe}@loria.fr.
. M. Roytberg is with the Institute of Mathematical Problems in Biology,Pushchino, Moscow Region, Russia. E-mail: [email protected].
Manuscript received 24 Sept. 2004; revised 13 Dec. 2004; accepted 10 Jan.2005; published online 30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-0154-0904.
1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
several seed families we computed, and we report a large-
scale experimental application of the method to a practical
problem of oligonucleotide selection.
2 MULTIPLE SEED FILTERING
A seed Q (called also spaced seed or gapped q-gram) is a list
fp1; p2; . . . ; pdg of positive integers, called matching positions,
such that p1 < p2 < . . . < pd. By convention, we always
assume p1 ¼ 0. The span of a seed Q, denoted sðQÞ, is the
quantity pd þ 1. The number d of matching positions is called
theweight of the seed and denoted wðQÞ. Often, we will use a
more visual representation of seeds, adopted in [1], as words
of length sðQÞ over the two-letter alphabet f#;�g, where #
occurs at all matching positions and—at all positions in
between. For example, seed f0; 1; 2; 4; 6; 9; 10; 11g of weight 8
andspan12 is representedbyword###�#�#��###.
The character � is called a joker. Note that, unless otherwise
stated, the seed has the character # at its first and last
positions.
Intuitively, a seed specifies the set of patterns that, if
shared by two sequences, indicate a possible similarity
between them. Two sequences are similar if the Hamming
distance between them is smaller than a certain threshold.
For example, sequences CACTCGT and CACACTT are similar
within Hamming distance 2 and this similarity is detected
by the seed##�# at position 2. We are interested in seeds
that detect all similarities of a given length with a given
Hamming distance.
Formally, a gapless similarity (hereafter simply similarity)
of two sequences of length m is a binary word w 2 f0; 1gm
interpreted as a sequence of matches (1s) and mismatches
(0s) of individual characters from the alphabet of input
sequences. A seed Q ¼ fp1; p2; . . . ; pdg matches a similarity w
at position i, 1 � i � m� pd þ 1, iff for every j 2 ½1::d�, we
have w½iþ pj� ¼ 1. In this case, we also say that seed Q has
an occurrence in similarity w at position i. A seed Q is said to
detect a similarity w if Q has at least one occurrence in w.
Given a similarity length m and a number of
mismatches k, consider all similarities of length m
containing k 0s and ðm� kÞ 1s. These similarities are
called ðm; kÞ-similarities. A seed Q solves the detection
problem ðm; kÞ (for short, the ðm; kÞ-problem) iff all of mk
� �ðm; kÞ-similarities w are detected by Q. For example, one
can check that seed #�##��#�## solves the
ð15; 2Þ-problem.
Note that the weight of the seed is directly related to the
selectivity of the corresponding filtration procedure. A larger
weight improves the selectivity, as less similarities will pass
through the filter. On the other hand, a smaller weight
reduces the filtration efficiency. Therefore, the goal is to
solve an ðm; kÞ-problem by a seed with the largest possible
weight.
Solving ðm; kÞ-problems by a single seed has been studied
by Burkhardt and Karkkainen [1]. An extension we propose
here is to use a family of seeds, instead of a single seed, to solve
the ðm; kÞ-problem. Formally, a finite family of seeds F ¼<
Ql >Ll¼1 solves an ðm; kÞ-problem iff for any ðm; kÞ-similarityw,
there exists a seed Ql 2 F that detects w.
Note that the seeds of the family are used in the
complementary (or disjunctive) fashion, i.e., a similarity is
detected if it is detected by one of the seeds. This differs from
the conjunctive approach of [7] where a similarity should be
detected by two seeds simultaneously.
The following example motivates the use of multiple
seeds. In [1], it has been shown that a seed solving the
ð25; 2Þ-problem has the maximal weight 12. The only such
seed (up to reversal) is
###�#��###�#��###�#:
However, the problem can be solved by the familycomposed of the following two seeds of weight 14:
#####�##���#####�##
and
#�##���#####�##���####:
Clearly, using these two seeds increases the selectivity of
the search, as only similarities having 14 or more matching
characters pass the filter versus 12 matching characters in
the case of single seed. On uniform Bernoulli sequences,
this results in the decrease of the number of candidate
similarities by the factor of jAj2=2, where A is the input
alphabet. This illustrates the advantage of the multiple seed
approach: it allows to increase the selectivity while
preserving a lossless search. The price to pay for this gain
in selectivity is multiplying the work on identifying the
seed occurrences. In the case of large sequences, however,
this is largely compensated by the decrease in the number
of false positives caused by the increase of the seed weight.
3 COMPUTING PROPERTIES OF SEED FAMILIES
Burkhardt and Karkkainen [1] proposed a dynamic pro-
gramming algorithm to compute the optimal threshold of a
given seed—the minimal number of its occurrences over all
possible ðm; kÞ-similarities. In this section, we describe an
extension of this algorithm for seed families and, on the
other hand, describe dynamic programming algorithms for
computing two other important parameters of seed families
that we will use in a later section.Consider an ðm; kÞ-problem and a family of seeds
F ¼< Ql >Ll¼1 . We need the following notations:
. smax ¼ maxfsðQlÞgLl¼1, smin ¼ minfsðQlÞgLl¼1,
. for a binary word w and a seed Ql, suffðQl; wÞ¼1 ifQl matches w at position ðjwj�sðQlÞþ1Þ (i.e.,matches a suffix of w), otherwise suffðQl; wÞ¼0,
. lastðwÞ ¼ 1 if the last character of w is 1, otherwiselastðwÞ ¼ 0, and
. zerosðwÞ is the number of 0s in w.
3.1 Optimal Threshold
Given an ðm; kÞ-problem, a family of seeds F ¼< Ql >Ll¼1
has the optimal threshold TF ðm; kÞ if every ðm; kÞ-similarity
52 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
has at least TF ðm; kÞ occurrences of seeds of F and this is the
maximal number with this property. Note that overlapping
occurrences of a seed as well as occurrences of different
seeds at the same position are counted separately. For
example, the singleton family f###�##g has threshold 2
for the ð15; 2Þ-problem.
Clearly, F solves an ðm; kÞ-problem if and only if
TF ðm; kÞ > 0. If TF ðm; kÞ > 1, then one can strengthen the
detection criterion by requiring several seed occurrences for
a similarity to be detected. This shows the importance of the
optimal threshold parameter.
We now describe a dynamic programming algorithm
for computing the optimal threshold TF ðm; kÞ. For a
binary word w, consider the quantity TF ðm; k;wÞ defined
as the minimal number of occurrences of seeds of F in all
ðm; kÞ-similarities which have the suffix w. By definition,
TF ðm; kÞ ¼ TF ðm; k; "Þ. Assume that we precomputed
values T F ðj; wÞ ¼ TF ðsmax; j; wÞ, for all j � maxfk; smaxg,jwj ¼ smax. The algorithm is based on the following
recurrence relations on TF ði; j; wÞ, for i � smax.
TF ði; j; w½1::n�Þ ¼T F ðj; wÞ; if i¼smax;
TF ði�1; j�1; w½1::n�1�Þ; if w½n�¼0;
TF ði�1; j; w½1::n�1�Þ þ ½PL
l¼1 suffðQl; wÞ�; if n¼smax;
minfTF ði; j; 1:wÞ; TF ði; j; 0:wÞg; if zerosðwÞ<j;
TF ði; j; 1:wÞ; if zerosðwÞ¼j:
8>>>>>><>>>>>>:
The first relation is an initial condition of the recurrence.
The second one is based on the fact that if the last symbol of
w is 0, then no seed can match a suffix of w (as the last
position of a seed is always assumed to be a matching
position). The third relation reduces the size of the problem
by counting the number of suffix seed occurrences. The
fourth one splits the counting into two cases, by considering
two possible characters occurring on the left of w. If w
already contains j 0s, then only 1 can occur on the left of w,
as stated by the last relation.
A dynamic programming implementation of the above
recurrence allows to compute TF ðm; k; "Þ in a bottom-up
fashion, starting from initial valuesT F ðj; wÞ andapplying the
above relations in the order in which they are given. A
straightforward dynamic programming implementation re-
quiresOðm � k � 2ðsmaxþ1ÞÞ time and space. However, the space
complexity can be immediately improved: If values of i are
processed successively, then only Oðk � 2ðsmaxþ1ÞÞ space is
needed. Furthermore, for each i and j, it is not necessary to
consider all 2ðsmaxþ1Þ different strings w, but only those which
contain up to j 0s. The number of those w is gðj; smaxÞ ¼Pje¼0
smax
e
� �. For each i, j ranges from 0 to k. Therefore, for each
i,weneed to store fðk; smaxÞ ¼Pk
j¼0 gðj; smaxÞ ¼Pk
j¼0smax
j
� ��
ðk� jþ 1Þ values. This yields the same space complexity as
for computing the optimal threshold for one seed [1].
The quantityPL
l¼1 suffðQl; wÞ can be precomputed for all
considered words w in time OðL � gðk; smaxÞÞ and space
Oðgðk; smaxÞÞ, under the assumption that checking an
individual match is done in constant time. This leads to
the overall time complexity Oðm � fðk; smaxÞ þ L � gðk; smaxÞÞwith the leading term m � fðk; smaxÞ (as L is usually small
compared to m and gðk; smaxÞ is smaller than fðk; smaxÞ).
3.2 Number of Undetected Similarities
We now describe a dynamic programming algorithm that
computes another characteristic of a seed family, that will
be used later in Section 4.4. Consider an ðm; kÞ-problem.
Given a seed family F ¼< Ql >Ll¼1 , we are interested in
the number UF ðm; kÞ of ðm; kÞ-similarities that are not
detected by F . For a binary word w, define UF ðm; k; wÞ to
be the number of undetected ðm; kÞ-similarities that have
the suffix w.Similar to [10], letXðF Þ be the set of binary words w such
that 1) jwj � smax, 2) for any Ql 2 F , suffðQl; 1smax�jwjwÞ ¼ 0,
and 3) no proper suffix of w satisfies 2). Note that word 0
belongs to XðF Þ, as the last position of every seed is a
matching position.The following recurrence relations allow to compute
UF ði; j; wÞ for i � m, j � k, and jwj � smax:
UF ði; j; w½1::n�Þ ¼i�jwj
j�zerosðwÞ
� �; if i < smin;
0; if 9l 2 ½1::L�;suffðQl; wÞ ¼ 1;
UF ði� 1; j� lastðwÞ; w½1::n� 1�Þ; if w 2 XðF Þ;UF ði; j; 1:wÞ þ UF ði; j; 0:wÞ; if zerosðwÞ < j;
UF ði; j; 1:wÞ; if zerosðwÞ ¼ j:
8>>>>>>>>><>>>>>>>>>:The first condition says that if i < smin, then no word of
length i will be detected, hence the binomial coefficient. The
second condition is straightforward. The third relation
follows from the definition of XðF Þ and allows us to reduce
the size of the problem. The last two conditions are similar
to those from the previous section.The set XðF Þ can be precomputed in time OðL �
gðk; smaxÞÞ and the worst-case time complexity of the whole
algorithm remains Oðm � fðk; smaxÞ þ L � gðk; smaxÞÞ.
3.3 Contribution of a Seed
Using a similar dynamic programming technique, one can
compute, for a given seed of the family, the number of
ðm; kÞ-similarities that are detected only by this seed and not
by the others. Together with the number of undetected
similarities, this parameter will be used later in Section 4.4.Given an ðm; kÞ-problem and a family F ¼< Ql >
Ll¼1 , we
define SF ðm; k; lÞ to be the number of ðm; kÞ-similarities
detected by the seed Ql exclusively (through one or several
occurrences), and SF ðm; k; l; wÞ to be the number of those
similarities ending with the suffix w. A dynamic program-
ming algorithm similar to the one described in the previous
sections can be applied to compute SF ðm; k; lÞ. The
recurrence is given below.
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 53
SF ði; j; l; w½1::n�Þ ¼0 if i < sminor 9l0 6¼ l
suffðQl0 ; wÞ ¼ 1
SF ði� 1; j� 1; l; w½1::n� 1�Þ if w½n� ¼ 0
SF ði� 1; j; l; w½1::n� 1�Þ if n ¼ jQlj andsuffðQl; wÞ ¼ 0
SF ði� 1; j; l; w½1::n� 1�ÞþUF ði� 1; j; w½1::n� 1�Þ if n ¼ smax and
suffðQl; wÞ ¼ 1
and 8l0 6¼ l;
suffðQl0 ; wÞ ¼ 0;
SF ði; j; l; 1:w½1::n�ÞþSF ði; j; l; 0:w½1::n�Þ if zerosðwÞ < j
SF ði; j; l; 1:w½1::n�Þ if zerosðwÞ ¼ j:
8>>>>>>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>>>>>>:
The third and fourth relations play the principal role:
if Ql does not match a suffix of w½1::n�, then we simply
drop out the last letter. If Ql matches a suffix of w½1::n�,but no other seed does, then we count prefixes matched
by Ql exclusively (term SF ði� 1; j; l; w½1::n� 1�Þ) together
with prefixes matched by no seed at all (term
UF ði� 1; j; w½1::n� 1�Þ). The latter is computed by the
algorithm of the previous section.
The complexity of computing SF ðm; k; lÞ for a given l is
the same as the complexity of dynamic programming
algorithms from the previous sections.
4 SEED DESIGN
In the previous section we showed how to compute various
useful characteristics of a given family of seeds. A much
more difficult task is to find an efficient seed family that
solves a given ðm; kÞ-problem. Note that there exists a trivial
solution where the family consists of all mk
� �position
combinations, but this is in general unacceptable in practice
because of a huge number of seeds. Our goal is to find
families of reasonable size (typically, with the number of
seeds smaller than 10), with a good filtration efficiency.
In this section, we present several results that contribute
to this goal. In Section 4.1, we start with the case of single
seed with a fixed number of jokers and show, in particular,
that for one joker, there exists one best seed in a sense that
will be defined. We then show in Section 4.2 that a solution
for a larger problem can be obtained from a smaller one by a
regular expansion operation. In Section 4.3, we focus on
seeds that have a periodic structure and show how those
seeds can be constructed by iterating some smaller seeds.
We then show a way to build efficient families of periodic
seeds. Finally, in Section 4.4, we briefly describe a heuristic
approach to constructing efficient seed families that we
used in the experimental part of this work presented in
Section 5.
4.1 Single Seeds with a Fixed Number of Jokers
Assume that we fixed a class of seeds under interest (e.g.,
seeds of a given minimal weight). One possible way to
define the seed design problem is to fix a similarity length
m and find a seed that solves the ðm; kÞ-problem with the
largest possible value of k. A complementary definition is to
fix k and minimize m provided that the ðm; kÞ-problem is
still solved. In this section, we adopt the second definition
and present an optimal solution for one particular case.
For a seed Q and a number of mismatches k, define the
k-critical length for Q as the minimal value m such that Q
solves the ðm; kÞ-problem. For a class of seeds C and a value
k, a seed is k-optimal in C if Q has the minimal k-critical
length among all seeds of C.One interesting class of seeds C is obtained by putting an
upper bound on the possible number of jokers in the seed,
i.e. on the number ðsðQÞ � wðQÞÞ. We have found a general
solution of the seed design problem for the class C1ðnÞconsisting of seeds of weight dwith only one joker, i.e. seeds
#d�r �#r.
Consider first the case of one mismatch, i.e., k ¼ 1. A
1-optimal seed from C1ðdÞ is #d�r �#r with r ¼ bd=2c. Tosee this, consider an arbitrary seed Q ¼ #p �#q, pþ q ¼ d,
and assume by symmetry that p � q. Observe that the
longest ðm; 1Þ-similarity that is not detected by Q is
1p�101pþq of length ð2pþ qÞ. Therefore, we have to minimize
2pþ q ¼ dþ p, and since p � dd=2e, the minimum is reached
for p ¼ dd=2e, q ¼ bd=2c.However, for k � 2, an optimal seed has an asymmetric
structure described by the following theorem.
Theorem 1. Let n be an integer and r ¼ ½d=3� (½x� is the closestinteger to x). For every k � 2, seed QðdÞ ¼ #d�r �#r is
k-optimal among the seeds of C1ðdÞ.Proof. Again, consider a seed Q ¼ #p �#q, pþ q ¼ d, and
assume that p � q. Consider the longest word SðkÞ fromð1�0Þk1�, k � 1, which is not detected by Q and let LðkÞ isthe length of SðkÞ. By the above remark, Sð1Þ ¼ 1p�101pþq
and Lð1Þ ¼ 2pþ q.
It is easily seen that for every k, SðkÞ starts either with
1p�10, or with 1pþq01q�10. Define L0ðkÞ to be the maximal
length of a word from ð1�0Þk1� that is not detected by Q
and starts with 1q�10. Since prefix 1q�10 implies no
additional constraint on the rest of the word, we have
L0ðkÞ ¼ q þ Lðk� 1Þ. Observe that L0ð1Þ ¼ pþ 2q (word
1q�101pþq). To summarize, we have the following
recurrences for k � 2:
L0ðkÞ ¼ q þ Lðk� 1Þ; ð1ÞLðkÞ ¼ maxfpþ Lðk� 1Þ; pþ q þ 1þ L0ðk� 1Þg; ð2Þ
with initial conditions L0ð1Þ ¼ pþ 2q, Lð1Þ ¼ 2pþ q.
Two cases should be distinguished. If p � 2q þ 1, then
the straightforward induction shows that the first term in
(2) is always greater, and we have
LðkÞ ¼ ðkþ 1Þpþ q; ð3Þ
and the corresponding longest word is
SðkÞ ¼ ð1p�10Þk1pþq: ð4Þ
54 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
If q � p � 2q þ 1, then by induction, we obtain
LðkÞ ¼ ð‘þ 1Þpþ ðkþ 1Þq þ ‘ if k ¼ 2‘;ð‘þ 2Þpþ kq þ ‘ if k ¼ 2‘þ 1;
�ð5Þ
and
SðkÞ ¼ ð1pþq01q�10Þ‘1pþq if k ¼ 2‘;
1p�10ð1pþq01q�10Þ‘1pþq if k ¼ 2‘þ 1:
�ð6Þ
By definition of LðkÞ, seed #p �#q detects any word
from ð1�0Þk1� of length ðLðkÞ þ 1Þ or more, and this is the
tight bound. Therefore, we have to find p; q whichminimize LðkÞ. Recall that pþ q ¼ d, and observe that for
p � 2q þ 1, LðkÞ (defined by (3)) is increasing on p, while
for p � 2q þ 1, LðkÞ (defined by (5)) is decreasing on p.
Therefore, both functions reach its minimum when
p ¼ 2q þ 1. Therefore, if d � 1 ðmod 3Þ, we obtain q ¼bd=3c and p ¼ d� q. If d � 0 ðmod 3Þ, a routine computa-
tion shows that the minimum is reached at q ¼ d=3,
p ¼ 2d=3, and if d � 2 ðmod 3Þ, the minimum is reachedat q ¼ dd=3e, p ¼ d� q. Putting the three cases together
results in q ¼ ½d=3�, p ¼ d� q. tuTo illustrate Theorem 1, seed ####�## is optimal
among all seeds of weight 6 with one joker. This means that
this seed solves the ðm; 2Þ-problem for all m � 16 and this is
the smallest possible bound over all seeds of this class.
Similarly, this seed solves the ðm; 3Þ-problem for all m � 20,
which is the best possible bound, etc.
4.2 Regular Expansion and Contraction of Seeds
We now show that seeds solving larger problems can be
obtained from seeds solving smaller problems, and vice
versa, using regular expansion and regular contraction
operations.
Given a seed Q , its i-regular expansion i�Q is
obtained by multiplying each matching position by i. This
is equivalent to inserting i� 1 jokers between every two
successive positions along the seed. For example, if Q ¼f0; 2; 3; 5g (or #�##�#), then the 2-regular expansion
of Q is 2�Q ¼ f0; 4; 6; 10g (or #���#�#���#).
Given a family F , its i-regular expansion i� F is the
family obtained by applying the i-regular expansion on
each seed of F .
Lemma 1. If a family F solves an ðm; kÞ-problem, then theðim; ðiþ 1Þk� 1Þ-problem is solved both by family F and byits i-regular expansion Fi ¼ i� F .
Proof. Consider an ðim; ðiþ 1Þk� 1Þ-similarity w. By the
pigeon hole principle, it contains at least one substring of
length m with k mismatches or less and, therefore, F
solves the ðim; ðiþ 1Þk� 1Þ-problem. On the other hand,
consider i disjoint subsequences of w each one consisting
of m positions equal modulo i. Again, by the pigeon hole
principle, at least one of them contains k mismatches or
less and, therefore, the ðim; ðiþ 1Þk� 1Þ-problem is
solved by i� F . tuThe following lemma is the inverse of Lemma 1. It states
that if seeds solving a bigger problem have a regularstructure, then a solution for a smaller problem can be
obtained by the regular contraction operation, inverse to theregular expansion.
Lemma 2. If a family Fi ¼ i� F solves an ðim; kÞ-problem, then
F solves both the ðim; kÞ-problem and the ðm; bk=icÞ-problem.
Proof. One can even show that F solves the ðim; kÞ-problemwith the additional restriction for F tomatch inside one of
the position intervals ½1::m�; ½mþ 1::2m�; . . . ; ½ði� 1Þmþ1::im�. This is done by using the bijective mapping from
Lemma 1: Given an ðim; kÞ-similarity w, consider i disjoint
subsequences wj (0 � j � i� 1) of w obtained by picking
m positions equal to j modulo i, and then consider the
concatenation w0 ¼ w1w2 . . .wi�1w0.For every ðim; kÞ-similarity w0, its inverse image w is
detected by Fi, and therefore F detects w0 at one of theintervals
½1::m�; ½mþ 1::2m�; . . . ; ½ði� 1Þmþ 1::im�:
Futhermore, for any ðm; bk=icÞ-similarity v, consider w0 ¼vi and its inverse image w. As w0 is detected by Fi, v isdetected by F . tu
Example 1. To illustrate the two lemmas above, we give thefollowing example pointed out in [1]. The following two
seeds are the only seeds of weight 12 that solve theð50; 5Þ-problem:
#�#�#���#�����#�#�#���#�����#�#�#���#
and
###�#��###�#��###�#:
The first one is the 2-regular expansion of the second. The
second one is the only seed of weight 12 that solves the
ð25; 2Þ-problem.
The regular expansion allows, in some cases, to obtain an
efficient solution for a larger problem by reducing it to a
smaller problem for which an optimal or a near-optimal
solution is known.
4.3 Periodic Seeds
In this section, we study seeds with a periodic structure that
can be obtained by iterating a smaller seed. Such seeds often
turn out to be among maximally weighted seeds solving a
given ðm; kÞ-problem. Interestingly, this contrasts with the
lossy framework where optimal seeds usually have a
“random” irregular structure.
Consider two seeds Q1;Q2 represented as words over
f#;�g. In this section, we lift the assumption that a seed
must start and end with a matching position. We denote
½Q1;Q2�i the seed defined as ðQ1Q2ÞiQ1. For example,
½###�#;���2¼###�#��###�#��###�#.
We also need a modification of the ðm; kÞ-problem, where
ðm; kÞ-similarities are considered modulo a cyclic permuta-
tion. We say that a seed family F solves a cyclic
ðm; kÞ-problem, if for every ðm; kÞ-similarity w, F detects
one of cyclic permutations of w. Trivially, if F solves an
ðm; kÞ-problem, it also solves the cyclic ðm; kÞ-problem. To
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 55
distinguish from a cyclic problem, we call sometimes an
ðm; kÞ-problem a linear problem.We first restrict ourselves to the single-seed case. The
following lemma demonstrates that iterating smaller seeds
solving a cyclic problem allows to obtain a solution forbigger problems, for the same number of mismatches.
Lemma 3. If a seed Q solves a cyclic ðm; kÞ-problem, then for
every i � 0, the seed Qi ¼ ½Q;�ðm�sðQÞÞ�i solves the linear
ðm � ðiþ 1Þ þ sðQÞ � 1; kÞ-problem. If i 6¼ 0, the inverse
holds too.
Proof. ) Consider an ðm � ðiþ 1Þ þ sðQÞ � 1; kÞ-similarity
u. Transform u into a similarity u0 for the cyclic
ðm; kÞ-problem as follows: For each mismatch position ‘
of u, set 0 at position ð‘modmÞ in u0. The other positions
of u0 are set to 1. Clearly, there are at most k 0s in u. As Q
solves the ðm; kÞ-cyclic problem, we can find at least one
position j, 1 � j � m, such that Q detects u0 cyclicly.We show now thatQi matches at position j of u (which
is a validposition as 1 � j � m and sðQiÞ ¼ imþ sðQÞ).As
the positions of 1 in u are projectedmodulom to matching
positions of Q, then there is no 0 under any matching
element of Qi and, thus, Qi detects u.
( Consider a seed Qi ¼ ½Q;�ðm�sðQÞÞ�i solving the
ðm � ðiþ 1Þ þ sðQÞ � 1; kÞ-problem. As i > 0, consider ðm �ðiþ 1Þ þ sðQÞ � 1; kÞ-similarities having all their mis-matches located inside the interval ½m; 2m� 1�. For eachsuch similarity, there exists a position j, 1 � j � m, such
that Qi detects it. Note that the span of Qi is at least
mþ sðQÞ, which implies that there is either an entire
occurrence of Q inside the window ½m; 2m� 1�, or a
prefix of Q matching a suffix of the window and the
complementary suffix of Q matching a prefix of the
window. This implies that Q solves the cyclicðm; kÞ-problem. tu
Example 2. Observe that the seed ###�# solves the
cyclic ð7; 2Þ-problem. From Lemma 3, this implies that for
every i � 0, the ð11þ 7i; 2Þ-problem is solved by the seed
½###�#;���i of span 5þ 7i. Moreover, for i ¼ 1; 2; 3,
this seed is optimal (maximally weighted) over all seeds
solving the problem.
By a similar argument based on Lemma 3, the
periodic seed ½#####�##;����i solves the
ð18þ 11i; 2Þ-problem. Note that its weight grows as711m compared to 4
7m for the seed from the previous
paragraph. However, when m ! 1, this is not an
asymptotically optimal bound, as we will see later.
The ð18þ 11i; 3Þ-problem is solved by the seedð###�#��#;���Þi, as seed ###�#��#
solves the cyclic ð11; 3Þ-problem. For i ¼ 1; 2, the former
is a maximally weighted seed among all solving the
ð18þ 11i; 3Þ-problem.
One question raised by these examples is whether
iterating some seed could provide an asymptotically
optimal solution, i.e., a seed of maximal asymptotic weight.The following theorem establishes a tight asymptotic bound
on the weight of an optimal seed, for a fixed number of
mismatches. It gives a negative answer to this question, as it
shows that the maximal weight grows faster than any linear
fraction of the similarity size.
Theorem 2. Consider a constant k. Let wðmÞ be the maximal
weight of a seed solving the cyclic ðm; kÞ-problem. Then,
ðm� wðmÞÞ ¼ �ðmk�1k Þ.
Proof. Note first that all seeds solving a cyclic ðm; kÞ-problemcanbe considered as seeds of spanm. Thenumberof jokers
in any seed Q is then n ¼ m� wðQÞ. The theorem states
that the minimal number of jokers of a seed solving the
ðm; kÞ-problem is �ðmk�1k Þ for every fixed k.
Lower bound Consider a cyclic ðm; kÞ-problem. Thenumber Dðm; kÞ of distinct cyclic ðm; kÞ-similaritiessatisfies
mk
� �m
� Dðm; kÞ; ð7Þ
as every linear ðm; kÞ-similarity has at most m cyclicly
equivalent ones. Consider a seed Q. Let n be the number
of jokers in Q and JQðm; kÞ the number of distinct cyclic
ðm; kÞ-similarities detected by Q. Observe that JQðm; kÞ �nk
� �and if Q solves the cyclic ðm; kÞ-problem, then
Dðm; kÞ ¼ JQðm; kÞ � n
k
� �: ð8Þ
From (7) and (8), we have
mk
� �m
� n
k
� �: ð9Þ
Using the Stirling formula, this gives nðkÞ ¼ �ðmk�1k Þ.
Upper bound. To prove the upper bound, we constructa seed Q that has no more then k �mk�1
k joker positionsand solves the cyclic ðm; kÞ-problem.
We start with the seed Q0 of span m with all matchingpositions, and introduce jokers into it in k steps. Afterstep i, the obtained seed is denoted Qi, and Q ¼ Qk.
Let B ¼ dm1ke. Q1 is obtained by introducing into Q0
individual jokers with periodicity B by placing jokers atpositions 1; Bþ 1; 2Bþ 1; . . . . At step 2, we introduceinto Q1 contiguous intervals of jokers of length B withperiodicity B2, such that jokers are placed at positions½1 . . .B�; ½B2 þ 1 . . .B2 þB�; ½2B2 þ 1 . . . 2B2 þB�; . . . .
In general, at step i (i � k), we introduce into Qi
intervals of Bi�1 jokers with periodicity Bi at positions½1 . . .Bi�1�; ½Bi þ 1 . . .Bi þBi�1�; . . . (see Fig. 1).
Note that Qi is periodic with periodicity Bi. Note
also that at each step i, we introduce at most bm1�ikc
intervals of Bi�1 jokers. Moreover, due to overlapswith already added jokers, each interval adds ðB�1Þi�1 new jokers.
This implies that the total number of jokers added atstep i is at most m1�i
k � ðB� 1Þi�1 � m1�ik �m1
k�ði�1Þ ¼ mk�1k .
Thus, the total number of jokers in Q is less than k �mk�1k .
By induction on i, weprove that for any ðm; iÞ-similarity
u (i � k),Qi detectsu cyclicly, that is there is a cyclic shift of
Qi such that all imismatches of u are covered with jokers
introduced at steps 1; . . . ; i.For i ¼ 1, the statement is obvious, as we can
always cover the single mismatch by shifting Q1 by atmost ðB� 1Þ positions. Assuming that the statement
56 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
holds for ði� 1Þ, we show now that it holds for i too.Consider an ðm; iÞ-similarity u. Select one mismatch ofu. By induction hypothesis, the other ði� 1Þ mis-matches can be covered by Qi�1. Since Qi�1 has periodBi�1 and Qi differs from Qi�1 by having at least onecontiguous interval of Bi�1 jokers, we can always shiftQi by j �Bi�1 positions such that the selected mismatchfalls into this interval. This shows that Qi detects u.We conclude that Q solves the cyclic ðm; iÞ-problem. tuUsing Theorem 2, we obtain the following bound on the
number of jokers for the linear ðm; kÞ-problem.
Lemma 4. Consider a constant k. Let wðmÞ be the maximalweight of a seed solving the linear ðm; kÞ-problem. Then,ðm� wðmÞÞ ¼ �ðm k
kþ1Þ.Proof. To prove the upper bound, we construct a seed Q
that solves the linear ðm; kÞ-problem and satisfies theasymptotic bound. Consider some l < m that will bedefined later, and let P be a seed that solves the cyclicðl; kÞ-problem. Without loss of generality, we assumesðP Þ ¼ l.
For a real number e � 1, define Pe to be the maximallyweighted seed of span at most le of the formP 0 � P � � �P � P 00, where P 0 and P 00 are, respectively, asuffix and a prefix of P . Due to the condition of maximalweight, wðPeÞ � e � wðP Þ.
We now set Q ¼ Pe for some real e to be defined.Observe that if e � l � m� l, then Q solves the linearðm; kÞ-problem. Therefore, we set e ¼ m�l
l .FromtheproofofTheorem2,wehave l� wðP Þ � k � lk�1
k .We then have
wðQÞ ¼ e � wðP Þ � m� l
l� ðl� k � lk�1
k Þ: ð10Þ
If we set
l ¼ mk
kþ1; ð11Þ
we obtain
m� wðQÞ � ðkþ 1Þm kkþ1 � km
k�1kþ1; ð12Þ
and as k is constant,
m� wðQÞ ¼ Oðm kkþ1Þ: ð13Þ
The lower bound is obtained similarly to Theorem 2.Let Q be a seed solving a linear ðm; kÞ-problem, and letn ¼ m� wðQÞ. From simple combinatorial considera-tions, we have
m
k
� �� n
k
� �� ðm� sðQÞÞ � n
k
� �� n; ð14Þ
which implies n ¼ �ðm kkþ1Þ for constant k. tu
The following simple lemma is also useful for construct-ing efficient seeds.
Lemma 5. Assume that a family F solves an ðm; kÞ-problem. LetF 0 be the family obtained from F by cutting out l charactersfrom the left and r characters from the right of each seed of F .Then F 0 solves the ðm� r� l; kÞ-problem.
Example 3. The ð9þ 7i; 2Þ-problem is solved by the seed½###;�#���i which is optimal for i ¼ 1; 2; 3. UsingLemma 5, this seed can be immediately obtained fromthe seed ½###�#;���i from Example 2, solving theð11þ 7i; 2Þ-problem.
We now apply the above results for the single seed caseto the case of multiple seeds.
For a seed Q considered as a word over f#;�g, wedenote by Q½i� its cyclic shift to the left by i characters.For example, i f Q ¼ ####�#�##��, thenQ½5� ¼ #�##��####� . The following lemma givesa way to construct seed families solving biggerproblems from an individual seed solving a smallercyclic problem.
Lemma 6. Assume that a seed Q solves a cyclic ðm; kÞ-problemand assume that sðQÞ ¼ m (otherwise, we pad Q on the right
with ðm� sðQÞÞ jokers). Fix some i > 1. For some L > 0,
consider a list ofL integers 0 � j1 < � � � < jL < m, and define a
family of seeds F ¼< kðQ½jl�Þik >L
l¼1 , where kðQ½jl�Þik stands
for the seed obtained from ðQ½jl�Þi by deleting the joker characters
at the left and right edges. Define �ðlÞ ¼ ððjl�1 � jlÞmodmÞ(or, alternatively, �ðlÞ ¼ ððjl � jl�1ÞmodmÞ) for all l,
1 � l � L. Let m0 ¼ maxfsðkðQ½jl�ÞikÞ þ �ðlÞgLl¼1 � 1. Then,
F solves the ðm0; kÞ-problem.
Proof. The proof is an extension of the proof of Lemma 3.Here, the seeds of the family are constructed in such away that for any instance of the linear ðm0; kÞ-problem,there exists at least one seed that satisfies the propertyrequired in the proof of Lemma 3 and, therefore, matchesthis instance. tuIn applying Lemma 6, integers jl are chosen from the
interval ½0;m� in such a way that values sðjjðQ½jl�ÞijjÞ þ �ðlÞare closed to each other. We illustrate Lemma 6 with twoexamples that follow.
Example 4. Let m ¼ 11, k ¼ 2. Consider the seed Q ¼####�#�##�� solving the cyclic ð11; 2Þ-problem.Choose i ¼ 2, L ¼ 2, j1 ¼ 0, j2 ¼ 5. This gives two seeds:
Q1 ¼ kðQ½0�Þ2k ¼ ####�#�##��####�#�##
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 57
Fig. 1. Construction of seeds Qi from the proof of Theorem 2. Jokers are
represented in white and matching positions in black.
and
Q2¼kðQ½5�Þ2k ¼ #�##��####�#�##��####
of span 20 and 21, respectively, �ð1Þ ¼ 6 and �ð2Þ ¼ 5.maxf20þ 6; 21þ 5g � 1 ¼ 25. Therefore, family F ¼fQ1; Q2g solves the ð25; 2Þ-problem.
Example 5. Let m ¼ 11, k ¼ 3. The seed Q ¼ ###�#��#��� solving the cyclic ð11; 3Þ-problem. Choosei ¼ 2, L ¼ 2, j1 ¼ 0, j2 ¼ 4. The two seeds are
Q1 ¼ kðQ½0�Þ2k ¼ ###�#��#���###�#��#
(span 19) and
Q2 ¼ kðQ½4�Þ2k¼ #��#���###�#��#���###
(span 21), with �ð1Þ ¼ 7 and �ð2Þ ¼ 4. maxf19þ 7;21þ 4g � 1 ¼ 25. Therefore, family F ¼ fQ1; Q2g solvesthe ð25; 3Þ-problem.
4.4 Heuristic Seed Design
Results of Sections 4.1, 4.2, and 4.3 allow one to constructefficient seed families in certain cases, but still do not allowa systematic seed design. Recently, linear programmingapproaches to designing efficient seed families wereproposed in [19] and in [18], respectively, for DNA andprotein similarity search. However, neither of thesemethods aims at constructing lossless families.
In this section, we outline a heuristic genetic program-ming algorithm for designing lossless seed families. Thealgorithm will be used in the experimental part of thiswork, that we present in the next section. Note that thisalgorithm uses the dynamic programming algorithmsdiscussed in Section 3. Since the algorithm uses standardgenetic programming techniques, we give only a high-leveldescription here without going into all details.
The algorithm tries to iteratively improve characteristicsof a population of seed families until it finds a small familythat detects all ðm; kÞ-similarities (i.e., is lossless). The firststep of each iteration is based on screening current familiesagainst a set of difficult similarities that are similarities thathave been detected by fewer families. This set is continuallyreordered and updated according to the number of familiesthat do not detect those similarities. For this, each set isstored in a tree and the reordering is done using the list-as-a-tree principle [20]: Each time a similarity is not detected bya family, it is moved towards the root of the tree such thatits height is divided by two.
For those families that pass through the screening, thenumber of undetected similarities is computed by thedynamic programming algorithm of Section 3.2. The familyis kept if it produces a smaller number than the familiescurrently known. An undetected similarity obtained duringthis computation is added as a leaf to the tree of difficultsimilarities.
To detect seeds to be improved inside a family, wecompute the contribution of each seed by the dynamicprogramming algorithm of Section 3.3. The seeds with theleast contribution are then modified with a higher prob-ability. In general, the population of seed families is
evolving by mutating and crossing over according to the setof similarities they do not detect. Moreover, random seedfamilies are regularly injected into the population in orderto avoid local optima.
The described heuristic procedure often allows efficientor even optimal solutions to be computed in a reasonabletime. For example, in 10 runs of the algorithm, we foundthree of the six existing families of two seeds of weight 14solving the ð25; 2Þ-problem. The whole computation tookless than 1 hour, compared to a week of computationneeded to exhaustively test all seed pairs. Note that therandomized-greedy approach (incremental completion ofthe seed set by adding the best random seed) applied adozen of times to the same problem yielded only sets ofthree and sometimes four, but never two seeds, takingabout 1 hour at each run.
5 EXPERIMENTS
We describe two groups of experiments that we made. Thefirst one concerns the design of efficient seed families, andthe second one applies a multiseed lossless filtration to theidentification of unique oligos in a large set of ESTsequences.
5.1 Seed Design Experiments
We considered several ðm; kÞ-problems. For each problem,and for a fixed number of seeds in the family, we computedfamilies solving the problem and realizing the largestpossible seed weight (under a natural assumption that allseeds in a family have the same weight). We also kept trackof the ways (periodic seeds, genetic programming heur-istics, exhaustive search) in which those families can becomputed.
Tables 1 and 2 summarize some results obtained for theð25; 2Þ-problem and the ð25; 3Þ-problem, respectively. Fa-milies of periodic seeds (that can be found using Lemma 6)are marked with p, those that are found using a geneticalgorithm are marked with g, and those which are obtainedby an exhaustive search are marked with e. Only in thislatter case, the families are guaranteed to be optimal.Families of periodic seeds are shifted according to theirconstruction (see Lemma 6).
Moreover, to compare the selectivity of different familiessolving a given ðm; kÞ-problem, we estimated the probability� for at least one of the seeds of the family to match at agiven position of a uniform Bernoulli four-letter sequence.This has been done using the inclusion-exclusion formula.
Note that the simple fact of passing from a single seed toa two-seed family results in a considerable gain inefficiency: In both examples shown in the tables there achange of about one order magnitude in the selectivityestimator �.
5.2 Oligo Selection Using Multiseed Filtering
An important practical application of lossless filtration isthe selection of reliable oligonucleotides for DNA micro-array experiments. Oligonucleotides (oligos) are small DNAsequences of fixed size (usually ranging from 10 to 50)designed to hybridize only with a specific region of thegenome sequence. In microarray experiments, oligos areexpected to match ESTs that stem from a given gene and not
58 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
to match those of other genes. As the first approximation,the problem of oligo selection can then be formulated as thesearch for strings of a fixed length that occur in a givensequence but do not occur, within a specified distance, inother sequences of a given (possibly very large) sample.Different approaches to this problem apply differentdistance measures and different algorithmic techniques[21], [22], [23], [24]. The experiments we briefly present heredemonstrate that the multiseed filtering provides anefficient computation of candidate oligonucleotides. Theseshould then be further processed by complementarymethods in order to take into account other physico-chemical factors occurring in hybridisation, such as themelting temperature or the possible hairpin structure ofpalindromic oligos.
Here, we adopt the formalization of the oligo selectionproblem as the problem of identifying in a given sequence
(or a sequence database) all substrings of lengthm that haveno occurrences elsewhere in the sequence within theHamming distance k. The parameters m and k were set to32 and 5, respectively. For the ð32; 5Þ-problem, different seedfamilies were designed and their selectivity was estimated.Those are summarized in the table in Fig. 2, using the sameconventions as in Tables 1 and 2 above. The familycomposed of six seeds of weight 11 was selected for thefiltration experiment (shown in Fig. 2).
The filtering has been applied to a database of rice ESTsequences composed of 100,015 sequences for a total lengthof 42,845,242 bp.1 Substrings matching other substringswith five substitution errors or less were computed. Thecomputation took slightly more than one hour on a
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 59
TABLE 2Seed Families for (25,3)-Problem
1. Source: http://bioserver.myongji.ac.kr/ricemac.html, The Korea RiceGenome Database.
TABLE 1Seed Families for (25,2)-Problem
Pentium2 4 3GHz computer. Before applying the filtering
using the family for the ð32; 5Þ-problem, we made a roughprefiltering using one spaced seed of weight 16 to detect,with a high selectivity, almost identical regions. Sixty-fivepercent of the database has been discarded by thisprefiltering. Another 22 percent of the database has beenfiltered out using the chosen seed family, leaving theremaining 13 percent as oligo candidates.
6 CONCLUSION
In this paper, we studied a lossless filtration method based
on multiseed families and demonstrated that it represents
an improvement compared to the single-seed approach
considered in [1]. We showed how some important
characteristics of seed families can be computed using the
dynamic programming. We presented several combinator-
ial results that allow one to construct efficient families
composed of seeds with a periodic structure. Finally, we
described a large-scale computational experiment of de-
signing reliable oligonucleotides for DNA microarrays. The
obtained experimental results provided evidence of the
applicability and efficiency of the whole method.
The results of Sections 4.1, 4,2, and 4.3 establish several
combinatorial properties of seed families, but many more of
them remain to be elucidated. The structure of optimal or
near-optimal seed families can be reduced to number-
theoretic questions, but this relation remains to be clearly
established. In general, constructing an algorithm to
systematically design seed families with quality guarantee
remains an open problem. Some complexity issues remain
open too: For example, what is the complexity of testing if a
single seed is lossless for given m; k? Section 3 implies a
time bound exponential on the number of jokers. Note that
for multiple seeds, computing the number of detected
similarities is NP-complete [16, Section 3.1].
Another direction is to consider different distance
measures, especially the Levenstein distance, or at least to
allow some restricted insertion/deletion errors. The method
proposed in [25] does not seem to be easily generalized to
multiseed families, and a further work is required to
improve lossless filtering in this case.
ACKNOWLEDGMENTS
G. Kucherov and L. Noe have been supported by the FrenchAction Specifique “Algorithmes et Sequences” of CNRS. A part
of this work has been done during a stay of M. Roytberg at
LORIA, Nancy, supported by INRIA. M. Roytberg has been
supported by the Russian Foundation for Basic Research
(project nos. 03-04-49469, 02-07-90412) and by grants from
the RF Ministry for Industry, Science, and Technology (20/
2002, 5/2003) and NWO. An extended abstract of this work
has been presented to the Combinatorial Pattern Matching
Conference (Istanbul, July 2004).
REFERENCES
[1] S. Burkhardt and J. Karkkainen, “Better Filtering with Gappedq-Grams,” Fundamenta Informaticae, vol. 56, nos. 1-2, pp. 51-70,2003, preliminary version in Combinatorial Pattern Matching2001.
[2] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings—Practical On-Line Search Algorithms for Texts and BiologicalSequences. Cambridge Univ. Press, 2002.
[3] S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller,and D. Lipman, “Gapped BLAST and PSI-BLAST: A NewGeneration of Protein Database Search Programs,” Nucleic AcidsResearch, vol. 25, no. 17, pp. 3389-3402, 1997.
[4] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and MoreSensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002.
[5] S. Schwartz, J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. Hardison,D. Haussler, and W. Miller, “Human—Mouse Alignments withBLASTZ,” Genome Research, vol. 13, pp. 103-107, 2003.
[6] L. Noe and G. Kucherov, “Improved Hit Criteria for DNA LocalAlignment,” BMC Bioinformatics, vol. 5, no. 149, Oct. 2004.
[7] P. Pevzner and M. Waterman, “Multiple Filtration and Approx-imate Pattern Matching,” Algorithmica, vol. 13, pp. 135-154, 1995.
[8] A. Califano and I. Rigoutsos, “Flash: A Fast Look-Up Algorithmfor String Homology,” Proc. First Int’l Conf. Intelligent Systems forMolecular Biology, pp. 56-64, July 1993.
[9] J. Buhler, “Provably Sensitive Indexing Strategies for BiosequenceSimilarity Search,” Proc. Sixth Ann. Int’l Conf. ComputationalMolecular Biology (RECOMB ’02), pp. 90-99, Apr. 2002.
[10] U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds forSimilarity Search,” Discrete Applied Math., vol. 138, no. 3, pp. 253-263, 2004.
[11] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for SimilaritySearch in Genomic DNA,” Proc. Seventh Ann. Int’l Conf. Computa-tional Molecular Biology (RECOMB ’03), pp. 67-75, Apr. 2003.
[12] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension toSpaced Seeds Allows Substantial Improvements in Sensitivity andSpecificity,” Proc. Third Int’l Workshop Algorithms in Bioinformatics(WABI), pp. 39-54, Sept. 2003.
[13] G. Kucherov, L. Noe, and Y. Ponty, “Estimating Seed Sensitivityon Homogeneous Alignments,” Proc. IEEE Fourth Symp. Bioinfor-matics and Bioeng. (BIBE 2004), May 2004.
[14] K. Choi and L. Zhang, “Sensitivity Analysis and Efficient Methodfor Identifying Optimal Spaced Seeds,” J. Computer and SystemSciences, vol. 68, pp. 22-40, 2004.
[15] M. Csuros, “Performing Local Similarity Searches with VariableLength Seeds,” Proc. 15th Ann. Combinatorial Pattern MatchingSymp. (CPM), pp. 373-387, 2004.
60 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 2. Computed seed families for the ð32; 5Þ-problem and the chosen family (six seeds of weight 11).
[16] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: HighlySensitive and Fast Homology Search,” J. Bioinformatics andComputational Biology, vol. 2, no. 3, pp. 417-440, Sept. 2004.
[17] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds forDNA Similarity Search,” Proc. Eighth Ann. Int’l Conf. Research inComputational Molecular Biology (RECOMB 2004), pp. 76-84, Mar.2004.
[18] D.G. Brown, “Multiple Vector Seeds for Protein Alignment,” Proc.Fourth Int’l Workshop Algorithms in Bioinformatics (WABI), pp. 170-181, Sept. 2004.
[19] J. Xu, D. Brown, M. Li, and B. Ma, “Optimizing Multiple SpacedSeeds for Homology Search,” Proc. 15th Symp. CombinatorialPattern Matching, pp. 47-58, 2004.
[20] J. Oommen and J. Dong, “Generalized Swap-with-Parent Schemesfor Self-Organizing Sequential Linear Lists,” Proc. 1997 Int’l Symp.Algorithms and Computation (ISAAC ’97), pp. 414-423, Dec. 1997.
[21] F. Li and G. Stormo, “Selection of Optimal DNA Oligos for GeneExpression Arrays,” Bioinformatics, vol. 17, pp. 1067-1076, 2001.
[22] L. Kaderali and A. Schliep, “Selecting Signature Oligonucleotidesto Identify Organisms Using DNA Arrays,” Bioinformatics, vol. 18,no. 10, pp. 1340-1349, 2002.
[23] S. Rahmann, “Fast Large Scale Oligonucleotide Selection Usingthe Longest Common Factor Approach,” J. Bioinformatics andComputational Biology, vol. 1, no. 2, pp. 343-361, 2003.
[24] J. Zheng, T. Close, T. Jiang, and S. Lonardi, “Efficient Selection ofUnique and Popular Oligos for Large EST Databases,” Proc. 14thAnn. Combinatorial Pattern Matching Symp. (CPM), pp. 273-283,2003.
[25] S. Burkhardt and J. Karkkainen, “One-Gapped q-Gram Filters forLevenshtein Distance,” Proc. 13th Symp. Combinatorial PatternMatching (CPM ’02), vol. 2373, pp. 225-234, 2002.
Gregory Kucherov received the PhD degree incomputer science in 1988 from the USSRAcademy of Sciences, and a Habilitation degreein 2000 from the Henri Poincare University inNancy. He is a senior INRIA researcher with theLORIA research unit in Nancy, France. For thelast 10 years, he has been doing research onword combinatorics, text algorithms and combi-natorial algorithms for bioinformatics, and com-putational biology.
Laurent Noe studied computer science at theESIAL engineering school in Nancy, France. Hereceived the MS degree in 2002 and is currentlya PhD student in computational biology atLORIA.
Mikhail Roytberg received the PhD degree incomputer science in 1983 from Moscow StateUniversity. He is a leader of the ComputationalMolecular Biology Group in the Institute ofMathematical Problems in Biology of the Rus-sian Academy of Sciences at Pushchino, Rus-sia. During the last years, his main research fieldhas been the development of algorithms forcomparative analysis of biological sequences.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 61
Text Mining Biomedical Literaturefor Discovering Gene-to-Gene Relationships:
A Comparative Study of AlgorithmsYing Liu, Shamkant B. Navathe, Jorge Civera, Venu Dasigi,
Ashwin Ram, Brian J. Ciliax, and Ray Dingledine
Abstract—Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of
microarray data. A number of computer algorithmshave been developed for this task. Although these algorithmshave demonstrated their
usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from
MEDLINE for a set of genes that are isolated for further study frommicroarray experiments based on their differential expression patterns.
The sharingof functional keywords amonggenes is usedas a basis for clustering in a newapproach calledBEA-PARTITION in this paper.
Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA),
which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional
keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering
and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of
BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell
cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the
results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means
andself-organizingmap.WhereasBEA-PARTITIONand thehierarchical clusteringproducedsimilar quality of clusters,BEA-PARTITION
provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a
powerful approach to clustering genes or to any clustering problemwhere startingmatrices are available fromexperimental observations.
Index Terms—Bond energy algorithm, microarray, MEDLINE, text analysis, cluster analysis, gene function.
�
1 INTRODUCTION
DNAmicroarrays, among the most rapidly growing toolsfor genome analysis, are introducing a paradigmatic
change in biology by shifting experimental approaches fromsingle gene studies to genome-level analyses [1], [2].Increasingly accessible microarray platforms allow therapid generation of large expression data sets [3]. One ofthe key challenges of microarray studies is to derivebiological insights from the unprecedented quantities ofdata on gene-expression patterns [5]. Partitioning genes intoclosely related groups has become an element of practicallyall analyses of microarray data [4].
A number of computer algorithms have been applied to
gene clustering. One of the earliest was a hierarchical
algorithm developed by Eisen et al. [6]. Other popular
algorithms, such as k-means [7] and Self-Organizing Maps
(SOM) [8] have also beenwidely used. These algorithms have
demonstrated their usefulness in gene clustering, but some
basic problems remain [2], [9]. Hierarchical clustering
organizes expression data into a binary tree, in which the
leaves are genes and the interior nodes (or branch points) are
candidate clusters. True clusterswith discrete boundaries are
not produced [10]. Although SOM is efficient and simple to
implement, studies suggest that it typically performs worse
than the traditional techniques, such as k-means [11].Basedon theassumption that geneswith the same function
or in the same biological pathway usually show similar
expression patterns, the functions of unknown genes can be
inferred from those of the known genes with similar
expression profile patterns. Therefore, expression profile
gene clustering by all the algorithms mentioned above has
received much attention; however, the task of finding
functional relationships between specific genes is left to the
investigator. Manual scanning of the biological literature (for
example, via MEDLINE) for clues regarding potential
functional relationships among a set of genes is not feasible
when the number of genes to be explored rises above
approximately 10. Restricting the scan (manual or automatic)
to annotation fields of GenBank, SwissProt, or LocusLink is
quicker but can suffer from the ad hoc relationship of
keywords to the research interests of whoever submitted
theentry.Moreover, keepingannotation fields current asnew
62 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
. Y. Liu, S.B. Navathe, J. Civera, and A. Ram are with the College ofComputing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta,GA 30322.E-mail: {yingliu, sham, ashwin}@cc.gatech.edu, [email protected].
. V. Dasigi is with the Department of Computer Science, School ofComputing and Software Engineering, Southern Polytechnic StateUniversity, Marietta, GA 30060. E-mail: [email protected].
. B.J. Ciliax is with the Department of Neurology, Emory University Schoolof Medicine, Atlanta, GA 30322. E-mail: [email protected].
. R. Dingledine is with the Department of Pharmacology, Emory UniversitySchool of Medicine, Atlanta, GA 30322.E-mail: [email protected].
Manuscript received 4 Apr. 2004; revised 1 Oct. 2004; accepted 10 Feb. 2005;published online 30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-0043-0404.
1545-5963/05/$20.00 � 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
information appears in the literature is amajor challenge thatis rarely met adequately.
If, instead of organizing by expression pattern similarity,genes were grouped according to shared function, investi-gators might more quickly discover patterns or themes ofbiological processes that were revealed by their microarrayexperiments and focus on a select group of functionallyrelated genes. A number of clustering strategies based onshared functions rather than similar expression patternshave been devised. Chaussabel and Sher [3] analyzedliterature profiles generated by extracting the frequencies ofcertain terms from the abstracts in MEDLINE and thenclustered the genes based on these terms, essentiallyapplying the same algorithm used for expression patternclustering. Jenssen et al. [12] used co-occurrence of genenames in abstracts to create networks of related genesautomatically. Text analysis of biomedical literature hasalso been applied successfully to incorporate functionalinformation about the genes in the analysis of geneexpression data [1], [10], [13], [14] without generatingclusters de novo. For example, Blaschke et al. [1] extractedinformation about the common biological characteristics ofgene clusters from MEDLINE using Andrade and Valen-cia’s statistical text mining approach, which accepts user-supplied abstracts related to a protein of interest andreturns an ordered set of keywords that occur in thoseabstracts more often than would be expected by chance [15].
We expanded and extended Andrade and Valencia’sapproach [15] to functional gene clustering by using anapproach that applies an algorithm called the Bond EnergyAlgorithm (BEA) [16], [17], which, to our knowledge, hasnot been used in bioinformatics. We modified it so that the“affinity” among attributes (in our case, genes) is definedbased on the sharing of keywords between them and wecame up with a scheme for partitioning the clusteredaffinity matrix to produce clusters of genes. We call theresulting algorithm BEA-PARTITION. BEA was originallyconceived as a technique to cluster questions in psycholo-gical instruments [16], has been used in operations research,production engineering, marketing, and various other fields[18], and is a popular clustering algorithm in distributeddatabase system (DDBS) design. The fundamental task ofBEA in DDBS design is to group attributes based on theiraffinity, which indicates how closely related the attributesare, as determined by the inclusion of these attributes by thesame database transactions. In our case, each gene wasconsidered as an attribute. Hence, the basic premise is thattwo genes would have higher affinity, thus higher bondenergy, if abstracts mentioning these genes shared manyinformative keywords. BEA has several useful properties[16], [19]. First, it groups attributes with larger affinityvalues together, and the ones with smaller values together(i.e., during the permutation of columns and rows, itshuffles the attributes towards those with which they havehigher affinity and away from those with which they havelower affinity). Second, the composition and order of thefinal groups are insensitive to the order in which items arepresented to the algorithm. Finally, it seeks to uncover anddisplay the association and interrelationships of the clus-tered groups with one another.
In order to explore whether this algorithm could be
useful for clustering genes derived from microarray
experiments, we compared the performance of BEA-
PARTITION, hierarchical clustering algorithm, self-organiz-
ing map, and the k-means algorithm for clustering func-
tionally-related genes based on shared keywords, using
purity, entropy, and mutual information as metrics for
evaluating cluster quality.
2 METHODS
2.1 Keyword Extraction from Biomedical Literature
We used statistical methods to extract keywords from
MEDLINE citations, based on the work of [15]. This method
estimates the significance of words by comparing the
frequency of words in a given gene-related set (Test Set)
of abstracts with their frequency in a background set of
abstracts. We modified the original method by using a
1) different background set, 2) a different stemming
algorithm (Porter’s stemmer), and 3) a customized stop list.
The details were reported by Liu et al. [20], [21].For each gene analyzed, word frequencies were calcu-
lated from a group of abstracts retrieved by an SQL(structured query language) search of MEDLINE for thespecific gene name, gene symbol, or any known aliases (seeLocusLink, ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz for gene aliases) in the TITLE field. The resultingset of abstracts (the Test Set) was processed to generate aspecific keyword list.
Test Sets of Genes. We compared BEA-PARTITION andother clustering algorithms (k-means, hierarchical, andSOM) on two test sets.
1. Twenty-six genes in four well-defined functional
groups consisting of 10 glutamate receptor subunits,
seven enzymes in catecholamine metabolism, five
cytoskeletal proteins, and four enzymes in tyrosine
and phenylalanine synthesis. The gene names and
aliases are listed in Table 1. This experiment was
performed to determine whether keyword associa-
tions can be used to group genes appropriately andwhether the four gene families or clusters that were
known a priori would also be predicted by a
clustering algorithm simply using the affinity metric
based on keywords.2. Forty-four yeast genes involved in the cell cycle of
budding yeast (Saccharomyces cerevisiae) that had
altered expression patterns on spotted DNA
microarrays [6]. These genes were analyzed by
Cherepinsky et al. [4] to demonstrate their Shrink-
age algorithm for gene clustering. A master list ofmember genes for each cluster was assembled
according to a combination of 1) common cell-cycle
functions and regulatory systems and 2) the
corresponding transcriptional activators for each
gene [4] (Table 2).
Keyword Assessment. Statistical formulae from [15] for
word frequencies were used without modification. These
calculations were repeated for all gene names in the test
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 63
set, a process that generated a database of keywords
associated with specific genes, the strength of the associa-
tion being reflected by a z-score. The z-score of word a for
gene g is defined as:
Zag ¼
Fag � F
a
�a; ð1Þ
where Fag equals the frequency of word a in Test Set g (i.e.,
in the Test set g, the number of abstracts where the word aoccurs divided by the total number of abstracts) and, �FFa and�a are the average frequency and standard deviation,respectively, of word a in the background set. Intuitively,the score Z compares the “importance” or “discriminatoryrelevance” of a keyword in the test set of abstract with thebackground set that represents the expected occurrence ofthat word in the literature at large.
Keyword Selection forGeneClustering.We used z-scorethresholds to select the keywords used for gene clustering.Those keywords with z-scores less than the threshold werediscarded. The z-score thresholds we tested were 0, 5, 8, 10,15, 20, 30, 50, and 100. The database generated by thisalgorithm is represented as a sparse word (rows) � gene(columns)matrixwith cells containing z-scores. Thematrix ischaracterized as “sparse” because each gene only has afraction of all words associated with it. The output of thekeyword selection for all genes in each Test Set is representedas a sparse keyword (rows) � gene (columns) matrix withcells containing z-scores.
2.2 BEA-PARTITION: Detailed Working of theAlgorithm
The BEA-PARTITION takes a symmetric matrix as input,
permutes its rows and columns, and generates a sorted
matrix, which is then partitioned to form a clustered matrix.Constructing the Symmetric Gene � Gene Matrix. The
sparse word � gene matrix, with the cells containing the
z-scores of each word-gene pair, was converted to a gene
�genematrixwith the cells containing the sumofproducts of
z-scores for shared keywords. The z-score value was set to
zero if the value was less than the threshold. Larger values
reflect stronger and more extensive keyword associations
between gene-gene pairs. For each gene pair ðGi;GjÞ and
everyword a they share in the sparseword�genematrix, the
Gi�Gj cell value ðaffðGi;GjÞÞ in the gene � gene matrix
represents the affinity of the two genes for each other and is
calculated as:
affðGi;GjÞ ¼PN
a¼1ðZaGi � Za
GjÞ1; 000
: ð2Þ
Dividing the sum of the z-score product by 1,000 was
done to reduce the typically large numbers to a more
readable format in the output matrix.Sorting the Matrix [19]. The sorted matrix is generated
as follows:
64 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 1Twenty-Six Genes Manually Clustered Based on Functional Similarity
TABLE 2Forty-Four Yeast Genes Grouped by Transcriptional Activators and Cell Cycle Functions [4]
1. Initialization. Place and fix one of the columns ofsymmetric matrix arbitrarily into the clusteredmatrix.
2. Iteration. Pick one of the remaining n-i columns(where i is the number of columns already in thesorted matrix). Choose the placement in the sortedmatrix that maximizes the change in bond energy asdescribed below (3). Repeat this step until no morecolumns remain.
3. Row ordering. Once the column ordering is deter-mined, the placement of the rows should also bechanged correspondingly so that their relativepositions match the relative position of the columns.This restores the symmetry to the sorted matrix.
To calculate the change in bond energy for each possible
placement of the next ðiþ 1Þ column, the bonds between
that column ðkÞ and each of two newly adjacent columns
ði; jÞ are added and the bond that would be broken between
the latter two columns is subtracted. Thus, the “bond
energy” between these three columns i, j, and k (represent-
ing gene i ðGiÞ; gene j ðGjÞ; gene k ðGkÞ)) is calculated by
the following interaction contribution measure:
energyðGi;Gj;GkÞ ¼2� ½bondðGi;GkÞ þ bondðGk;GjÞ � bondðGi;GjÞ�;
ð3Þ
where bond ðGi;GjÞ is the bond energy between gene Gi
and gene Gj and
bondðGi;GjÞ ¼XNr¼l
affðGr;GiÞ � affðGr;GjÞ ð4Þ
affðG0; GiÞ ¼ affðGi;G0Þ¼ affðGðnþ 1Þ; GiÞ ¼ affðGi;Gðnþ 1ÞÞ ¼ 0:
ð5Þ
The last set of conditions (5) takes care of cases where a
gene is being placed in the sorted matrix to the left of the
leftmost gene or to the right of the rightmost gene during
column permutations, and prior to the topmost row and
following the last row during row permutations.
Partitioning the Sorted Matrix. The original BEA
algorithm [16] did not propose how to partition the sorted
matrix. The partitioning heuristic was added by Navathe
et al. [17] for the problems in the distributed database
design. These heuristics were constructed using the goals of
design: to minimize access time and storage costs. We do
not have the luxury of such a clear cut objective function in
our case. Hence, to partition the sorted matrix into
submatrices, each representing a gene cluster, we experi-
mented with different heuristics and, finally, derived a
heuristic that identifies the boundaries between clusters by
sequentially finding the maximum sum of the quotients for
corresponding cells in adjacent columns across the matrix.
With each successive split, only those rows corresponding
to the remaining columns were processed, i.e., only the
remaining symmetrical portion of the submatrix was used
for further iterations of the splitting algorithm. The number
of clusters into which the gene affinity matrix was
partitioned was determined by AUTOCLASS (described
below), however, other heuristics might be useful for this
determination. The boundary metric ðBÞ for columns Gi
and Gj used for placement of new column k between
existing columns i and j was defined as:
BðGi;GjÞ ¼ maxp�1�q�p
Xpk¼p�1
maxðaffðk; qÞ; affðk; q þ 1ÞÞminðaffðk; qÞ; affðk; q þ 1ÞÞ ; ð6Þ
where q is the new splitting point (for simplicity, we use the
number of the leftmost column in the new submatrix that is
to the right of the splitting point), which will split the
submatrix defined between two previous splitting points, p
and p� 1 (which do not necessarily represent contiguous
columns). To partition the entire sorted matrix, the
following initial conditions are set, p ¼ N; p� 1 ¼ 0.
2.3 KKKK-Means Algorithm and Hierarchical ClusteringAlgorithm
K-meansandhierarchical clusteringanalysiswereperformed
using Cluster/Treeview programs available online (http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/
software.htm).
2.4 Self-Organizing Map
Self-organizing map was performed using GeneClus-
ter 2.0 (http://www.broad.mit.edu/cancer/software/software.html).
Euclidean distance measure was used when gene �keyword matrix as input. When gene � gene matrix wasused as input, the gene similarity was calculated by (2).
2.5 Number of Clusters
In order to apply BEA-PARTITION and k-means cluster-
ing algorithms, the investigator needs to have a priori
knowledge about the number of clusters in the test set.
We determined the number of clusters by applying
AUTOCLASS, an unsupervised Bayesian classification
system developed by [22]. AUTOCLASS, which seeks a
maximum posterior probability classification, determines
the optimal number of classes in large data sets. Among
a variety of applications, AUTOCLASS has been used
for the discovery of new classes of infra-red stars in the
IRAS Low Resolution Spectral catalogue, new classes of
airports in a database of all US airports, and discovery
of classes of proteins, introns and other patterns in
DNA/protein sequence data [22]. We applied an open
source implementation of AUTOCLASS (http://
ic.arc.nasa.gov/ic/projects/bayes-group/autoclass/
autoclass-c-program.html). The resulting number of
clusters was then used as the endpoint for the
partitioning step of the BEA-PARTITION algorithm. To
determine whether AUTOCLASS could discover the
number of clusters in the test sets correctly, we also
tested different number of clusters other than the ones
AUTOCLASS predicted.
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 65
2.6 Evaluating the Clustering Results
To evaluate the quality of our resultant clusters, we used
the established metrics of Purity, Entropy, and Mutual
Information, which are briefly described below [23]. Let us
assume that we have C classes (i.e., C expert clusters, as
shown in Tables 1 and 2), while our clustering algorithms
produce K clusters, �;�2; . . . ; �k.Purity. Purity can be interpreted as classification
accuracy under the assumption that all objects of a cluster
are classified to be members of the dominant class for that
cluster. If the majority of genes in cluster A are in class X,
then class X is the dominant class. Purity is defined as the
ratio between the number of items in cluster �i from
dominant class j and the size of cluster �i, that is:
P ð�iÞ ¼1
nimax
jðnj
iÞ; i ¼ 1; 2 . . . ; k; ð7Þ
where ni ¼ j�ij, that is, the size of cluster i and nji is the
number of genes in �i that belong to class j; j ¼ 1; 2; . . . ;C.
The closer to 1 the purity value is, the more similar this
cluster is to its dominant class. Purity is measured for each
cluster and the average purity of each test gene set cluster
result was calculated.
Entropy. Entropy denotes how uniform the cluster is. If a
cluster is composed of genes coming from different classes,
then the value of entropy will be close to 1. If a cluster only
contains one class, the value of entropy will be close to 0.
The ideal value for entropy would be zero. Lower values of
entropy would indicate better clustering. Entropy is also
measured for each cluster and is defined as:
Eð�iÞ ¼ � 1
logC
XCj¼1
nji
nilog
nji
ni
!: ð8Þ
The average entropy of each test gene set cluster result was
also calculated.
Mutual Information. One problem with purity and
entropy is that they are inherently biased to favor small
clusters. For example, if we had one object for each cluster,
then the value of purity would be 1 and entropy would be
zero, no matter what the distribution of objects in the expert
classes is.
Mutual information is a symmetric measure for the
degree of dependency between clusters and classes. Unlike
correlation, mutual information also takes higher order
dependencies into account. We use mutual information
because it captures how related clusters are to classes
without bias towards small clusters. Mutual information is
a measure of the discordance between the algorithm-
derived clusters and the actual clusters. It is the measure
of how much information the algorithm-derived clusters
can tell us to infer the actual clusters. Random clustering
has mutual information of 0 in the limit. Higher mutual
information indicates higher similarity between the algo-
rithm-derived clusters and the actual clusters. Mutual
information is defined as:
Mð�Þ ¼ 2
N
XKi¼1
XCj¼1
nji
lognji�NPK
t¼1nti
PC
t¼1nti
logðK � CÞ ; ð9Þ
where N is the total number of genes being clustered and K
is the number of clusters the algorithm produced, and C is
the number of expert classes.
2.7 Top-Scoring Keywords Shared among Membersof a Gene Cluster
Keywords were ranked according to their highest shared z-scores in each cluster. The keyword sharing strength metric(Ka) is defined as the sum of z-scores for a shared keyworda within the cluster, multiplied by the number of genes ðMÞwithin the cluster with which the word is associated; in thiscalculation z-scores less than a user-selected threshold areset to zero and are not counted.
Ka ¼XMg¼1
ðzagÞ �XMg¼1
CountðzagÞ: ð10Þ
Thus, larger values reflect stronger and more extensivekeyword associations within a cluster. We identified the30 highest scoring keywords for each of the four clusters andprovided these four lists to approximately 20 students,postdoctoral fellows, and faculty, asking them to guess amajor function of the underlying genes that gave rise to thefour keyword lists.
3 RESULTS
3.1 Keywords and Keyword � Gene MatrixGeneration
A list of keywords was generated for each gene to build the
keyword � gene matrix. Keywords were sorted according
to their z-scores. The keyword selection experiment (see
below) showed that a z-score threshold of 10 generally
produced better results, which suggests that keywords with
z-scores lower than 10 have less information content, e.g.,
“cell,” “express.” The relative values of z-scores depended
on the size of the background set (data not shown). Since we
used 5.6 million abstracts as the background set, the
z-scores of most of the informative keywords were well
above 10 (based on smaller values of standard deviation in
the definition of z-score). The keyword � gene matrices
were used as inputs to k-means, hierarchical clustering
algorithm, self-organizing map, while as required by the
BEA approach, they were first converted to a gene � gene
matrix based on common shared keywords and these gene
� gene matrices were used as inputs to BEA-PARTITION.
An overview of the gene clustering by shared keyword
process is provided in Fig. 1.
3.2 Effect of Keyword Selection on Gene Clustering
The effect of using different z-score thresholds for keyword
selection on the quality of resulting clusters is shown in
Figs. 2A1 and 2B1. For both test sets, BEA-PARTITION
produced clusters with higher mutual information when z-
score thresholds were within a range of 10 to 20. For the 44-
gene set, K-means produced clusters with the highest
66 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
mutual information when the z-score threshold was 8,
while, for the 26-gene set, mutual information was highest
when z-score threshold was 15. For the remaining studies,
we chose to use a z-score threshold of 10 to keep as many
functional keywords as possible.
3.3 Number of Clusters
We then used AUTOCLASS to decide the number ofclusters in the test sets. AUTOCLASS took the keyword �gene matrix as input and predicted that there were fiveclusters in the set of 26 genes and nine clusters in the set of44 yeast genes. The effect of the numbers of clusters on thealgorithm performance was shown in Figs. 2A2 and 2B2.BEA-PARTITION again produced a better result regardlessof the number of clusters used. BEA-PARTITION had thehighest mutual information when the numbers of clusterswere five (26-gene set) and nine (44-gene set), whereask-means worked marginally better when the numbers ofclusters were 8 (26-gene set) and 10 (44-gene set). Based onthese results we chose to use five and nine clusters,respectively, because the probabilities were higher thanthe other choices.
3.4 Clustering of the 26-Gene Set by KeywordAssociations
Todeterminewhether keyword associations could be used to
group genes appropriately, we clustered the 26-gene set with
either BEA-PARTITION, k-means, hierachical algorithm,
SOM, and AUTOCLASS. Keyword lists were generated for
each of these 26 genes, which belonged to one of four well-
defined functional groups (Table 1). The resulting word �gene matrix had 26 columns (genes) and approximately
8,540 rows (words with z-scores >¼ 10 appearing in any of
the query sets). TheBEA-PARTITION,with z-score threshold
= 10, correctly assigned 25 of 26 genes to the appropriate
cluster based on the strength of keyword associations (Fig. 3).
Tyrosine transaminasewas theonlyoutlier.As expected from
the BEA-PARTITION, cells inside clusters tended to have
much higher values than those outside. Hierarchical cluster-
ing algorithm, with the gene � keyword matrix as the input,
generated similar result as BEA-PARTITION (five clusters
andTTwas theoutlier) (Fig. 4a). The results,withgene�gene
matrix as the input, were shown in tables in the supplemen-
tary materials which can be found at www.computer.org/
publications/dlib.While BEA-PARTITION and hierarchical clustering
algorithm produced clusters very similar to the originalfunctional classes, those produced by k-means (Table 4),self-organizing map (Table 5), and AUTOCLASS (Table 6),with gene � keyword matrix as input, were heterogeneousand, thus, more difficult to explain. The average purity,
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 67
Fig. 1. Procedure for clustering genes by the strength of their associated keywords.
Fig. 2. Effect of keyword selection by z-score thresholds (A1 and B1)and different number of clusters (A2 and B2) on the cluster quality. Z-score thresholds were used to select the keywords for gene clustering.Those keywords with z-scores less than the threshold were discarded.To determine the effect of keyword selection by z-score thresholds oncluster quality, we tested z-score thresholds 0, 5, 8, 10, 15, 20, 30, 50,and 100. To determine whether AUTOCLASS could be used to discoverthe number of clusters in the test sets correctly, we tested a differentnumber of clusters other than the ones AUTOCLASS predicted (four forthe 26-gene set and nine for the 44-gene set).
average entropy, and mutual information of the BEA-
PARTITION and hierarchical algorithm result were 1, 0,
and 0.88, while those of k-means result were 0.53, 0.65, and
0.28, respectively, those of SOM result were 0.76, 0.35, and
0.18, respectively, and those of AUTOCLASS result were
0.82, 0.28, and 0.56 (Table 3) (gene � keyword matrix as
input). When gene � gene matrix was used as input to
hierarchical algorithm, k-means, and SOM, the results were
even worse as measured by purity, entropy, and mutual
information (Table 3).
3.5 Yeast Microarray Gene Clustering by KeywordAssociation
To determine whether our test mining/gene clustering
approach could be used to group genes identified in
microarray experiments, we clustered 44 yeast genes taken
from Eisen et al. [6] via Cherepinsky et al. [4], again using
BEA-PARTITION, hierarchical algorithm, SOM, AUTO-
CLASS, and k-means. Keyword lists were generated for each
of the 44yeast genes (Table 2) and a 3,882 (words appearing in
the query sets with z-score greater or equal 10) � 44 (genes)
matrix was created. The clusters produced by the BEA-
PARTITION, k-means, SOM, and AUTOCLASS are shown in
Tables 7, 8, 9, and10, respectively,whereas thoseproducedby
hierarchical algorithm are shown in Fig. 4b. The average
purity, average entropy, andmutual information of the BEA-
PARTITION result were 0.74, 0.24, and 0.60, whereas those of
hierarchical algorithm, SOM, k-means, and AUTOCLASS
results (gene� keywordmatrix as input) were 0.86, 0.12, and
0.58; 0.60, 0.37, and 0.46; 0.61, 0.33, and 0.39; 0.57, 0.39, and
0.49, respectively (Table 3).
3.6 Keywords Indicative of Major Shared Functionswith a Gene Cluster
Keywords shared among genes (26-gene set) within eachcluster were ranked according to a metric based on both thedegree of significance (the sum of z-scores for each keyword)and the breadth of distribution (the sum of the number ofgenes within the cluster for which the keyword has a z-scoregreater than a selected threshold). This double-prongedmetric obviated the difficulty encountered with keywordsthat had extremely high z-scores for single genes within thecluster but modest z-scores for the remainder. The 30 highestscoring keywords for each of the four clusters were tabulated(Table 11). The respectivekeyword lists appeared tobehighlyinformative about the general function of the original,preselected clusters when shown to medical students,faculties, and postdoctoral fellows.
4 DISCUSSION
In this paper, we clustered the genes by shared functional
keywords. Our gene clustering strategy is similar to the
document clustering in information retrieval. Document
clustering, defined as grouping documents into clusters
according to their topics or main contents in an unsuper-
vised manner, organizes large amounts of information into
a small number of meaningful clusters and improves the
information retrieval performance either via cluster-driven
dimensionality reduction, term-weighting, or query expan-
sion [9], [24], [25], [26], [27].
Term vector-based document clustering has been widely
studied in information retrieval [9], [24], [25], [26], [27]. A
68 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 3. Gene clusters by keyword associations using BEA-PARTITION. Keywords with z-scores >¼ 10 were extracted from MEDLINE abstracts for26 genes in four functional classes. The resulting word � gene sparse matrix was converted to a gene � gene matrix. The cell values are the sum ofz-score products for all keywords shared by the gene pair. This value is divided by 1,000 for purpose of display. A modified bond energy algorithm[16], [17] was used to group genes into five clusters based on the strength of keyword associations, and the resulting gene clusters are boxed.
number of clustering algorithms have been proposed and
many of them have been applied to bioinformatics research.
In this report, we introduced a new algorithm for clustering
genes, BEA-PARTITION. Our results showed that BEA-
PARTITION, in conjunction with the heuristic developed
for partitioning the sorted matrix, outperforms the k-means
algorithm and SOM in two test sets. In the first set of genes
(26-gene set), BEA-PARTITION, as well as hierarchical
algorithm, correctly assigned 25 of 26 genes in a test set of
four known gene groups with one outlier, whereas k-means
and SOM mixed the genes into five more evenly sized but
less well functionally defined groups. In the 44-gene set, the
result generated by BEA-PARTITION had the highest
mutual information, indicating that BEA-PARTITION out-
performed all the other four clustering algorithms.
4.1 BEA-PARTITION versus kkkk-Means
In this study, the z-score thresholds were used for keyword
selection. When the threshold was 0, all words, including
noise (noninformative words and misspelled words), were
used to cluster genes. Under the tested conditions, clusters
produced by BEA-PARTITION had higher quality than
those produced by k-means. BEA-PARTITION clusters
genes based on their shared keywords. It is unlikely that
genes within the same cluster shared the same noisy words
with high z-scores, indicating that BEA-PARTITION is less
sensitive to noise than k-means. In fact, BEA-PARTITION
performed better than k-means in the two test gene sets
under almost all test conditions (Fig. 2). BEA-PARTITION
performed best when z-score thresholds were 10, 15, and 20,
which indicated 1) that the words with z-score less than 10
were less informative and 2) few words with z-scores
between 10 and 20 were shared by at least two genes and
did not improve the cluster quality. When z-score thresh-
olds were high (> 30 in the 26-gene set and > 20 in the
44-gene set), more informative words were discarded, and
as a result, the cluster quality was degraded.
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 69
Fig. 4. Gene clusters by keyword associations using hierarchical clustering algorithm. Keywords with z-scores >¼ 10 were extracted from MEDLINE
abstracts for (a) 26 genes in four functional classes and (b) 44 gene in nine classes. The resulting word � gene sparse matrix was used as input to
the hierarchical algorithm.
BEA-PARTITION is designed to group cells with larger
values together, and the ones with smaller values together.
The final order of the genes within the cluster reflected
deeper interrelationships. Among the 10 glutamate receptor
genes examined, GluR1, GluR2, and GluR4 are AMPA
receptors, while GluR6, KA1, and KA2 are kainate receptors.
The observation that BEA-PARTITION placed gene GluR6
and gene KA2 next to each other, confirms that the literature
associations between GluR6 and KA2 are higher than those
between GluR6 and AMPA receptors. Furthermore, the
association and interrelationships of the clustered groups
with one another can be seen in the final clustering matrix.
For example, TT was an outlier in Fig. 3, however, it still
had higher affinity to PD1 (affinity = 202) and PD2 (affinity
= 139) than to any other genes. Thus, TT appears to be
strongly related to genes in the tyrosine and phenylalanine
synthesis cluster, from which it originated.
BEA-PARTITION has several advantages over the
k-means algorithm: 1) while k-means generally produces a
locally optimal clustering [2], BEA-PARTITION produces
70 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 3The Quality of the Gene Clusters Derived by Different Clustering Algorithms, Measured by Purity, Entropy, and Mutual Information
TABLE 4Twenty-Six Gene Set k-Means Result (Gene � Keyword Matrix as Input)
the globally optimal clustering by permuting the columns
and rows of the symmetric matrix; 2) the k-means algorithm
is sensitive to initial seed selection and noise [9].
4.2 BEA-PARTITION versus Hierarchical Algorithm
Hierarchical clustering algorithm, as well as k-means, and
Self-Organizing Maps, have been widely used in microarray
expression profile analysis. Hierarchical clustering orga-
nizes expression data into a binary tree without providing
clear indication of how the hierarchy should be clustered. In
practice, investigators define clusters by a manual scan of
the genes in each node and rely on their biological expertise
to notice shared functional properties of genes. Therefore,
the definition of the clusters is subjective, and as a result,
different investigators may interpret the same clustering
result differently. Some have proposed automatically
defining boundaries based on statistical properties of the
gene expression profiles; however, the same statistical
criteria may not be generally applicable to identify all
relevant biological functions [10]. We believe that an
algorithm that produces clusters with clear boundaries
can provide more objective results and possibly new
discoveries, which are beyond the experts’ knowledge. In
this report, our results showed that BEA-PARTITION can
have similar performance as a hierarchical algorithm, and
provide distinct cluster boundaries.
4.3 KKKK-Means versus SOM
The k-means algorithm and SOM can group objects into
different clusters and provide clear boundaries. Despite its
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 71
TABLE 5Twenty-Six Gene SOM Result (Gene � Keyword Matrix as Input)
TABLE 6Twenty-Six Gene AUTOCLASS Result (Gene � Keyword Matrix as Input)
simplicity and efficiency, the SOM algorithm has several
weaknesses that make its theoretical analysis difficult and
limit its practical usefulness. Various studies have sug-
gested that it is hard to find any criteria under which the
SOM algorithm performs better than the traditional
techniques, such as k-means [11]. Balakrishnan et al. [28]
compared the SOM algorithm with k-means clustering on
108 multivariate normal clustering problems. The results
showed that the SOM algorithm performed significantly
worse than the k-means clustering algorithm. Our results
also showed that k-means performed better than SOM by
generating clusters with higher mutual information.
4.4 Computing Time
The computing time of BEA-PARTITION, same as that ofhierarchical algorithm and SOM, is in the order of N2, whichmeans that it grows proportionally to the square of thenumberofgenesandcommonlydenotedasOðN2Þ, and thatofk-means is in the order of N*K*T (O(NKT)), where N is thenumber of genes tested, K is the number of clusters, and T isthe number of improvement steps (iterations) performed byk-means. In our study, the number of improvement stepswas1,000. Therefore, when the number of genes tested is about1,000, BEA-PARTITION runs (a�Kþ b) times faster thank-means, where a, and b are constants. As long as the numberof genes to be clustered is less than the product of the number
72 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 7Forty-Four Yeast Genes BEA-PARTITION Result (Gene � Keyword Matrix as Input)
TABLE 8Forty-Four Yeast Gene SOM Result (Gene � Keyword as Input)
of clusters and the number of iterations, BEA-PARTITION
will run faster than k-means.
4.5 Number of Clusters
One disadvantage of BEA-PARTITION and k-means com-
pared to hierarchical clustering is that the investigator needs
tohave apriori knowledge about thenumberof clusters in the
test set, which may not be known. We approached this
problem by using AUTOCLASS to predict the number of
clusters in the test sets. BEA-PARTITION performed best
when it grouped the genes into five clusters (26-gene set) and
nine clusters (44-gene set), which were predicted by AUTO-
CLASS with higher probabilities. Therefore, AUTOCLASS
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 73
TABLE 9Forty-Four Yeast Gene k-Means Result (Gene � Keyword Matrix as Input)
TABLE 10Forty-Four Yeast Gene AUTOCLASS Result (Gene � Keyword Matrix as Input)
appears to be an effective tool to assist the BEA-PARTITIONin gene clustering.
5 CONCLUSIONS AND FUTURE WORK
There are several aspects of the BEA approach that we are
currently exploringwithmore detailed studies. For example,
although the BEA-PARTITION described here performs
relatively well on small sets of genes, the larger gene lists
expected from microarray experiments need to be tested.
Furthermore,we derived a heuristic to partition the clustered
affinity matrix into clusters. We anticipate that this heuristic,
which is simply based on the sum of ratios of corresponding
values fromadjacent columns,will generallywork regardless
of the typeof itemsbeing clustered.Generally, optimizing the
heuristic to partition a sorted matrix after BEA-based
clustering will be valuable. Finally, we are developing a
Web-based tool that will include a text mining phase to
identify functional keywords, and a gene clustering phase to
cluster the genes based on the shared functional keywords.
We believe that this tool should be useful for discovering
novel relationships among sets of genes because it links genes
by shared functional keywords rather than just reporting
known interactions based on published reports. Thus, genes
that never co-occur in the same publication could still be
linked by their shared keywords.
The BEA approach has been applied successfully to other
disciplines, such as operations research, production en-
gineering, and marketing [18]. The BEA-PARTITION
algorithm represents our extension to the BEA approach
specifically for dealing with the problem of discovering
functional similarity among genes based on functional
keywords extracted from literature. We believe that this
important clustering technique, which was originally
proposed by [16] to cluster questions on psychological
instruments and later introduced by [17] for clustering of
data items in database design, has promise for application
to other bioinformatics problems where starting matrices
are available from experimental observations.
ACKNOWLEDGMENTS
This work was supported by NINDS (RD) and the Emory-Georgia Tech Research Consortium. The authors wouldlike to thank Brian Revennaugh and Alex Pivoshenk forresearch support.
REFERENCES
[1] C. Blaschke, J.C. Oliveros, and A. Valencia, “Mining FunctionalInformation Associated with Expression Arrays,” Functional &Integrative Genomics, vol. 1, pp. 256-268, 2001.
[2] Y. Xu, V. Olman, and D. Xu, “EXCAVATOR: A ComputerProgram for Efficiently Mining Gene Expression Data,” NucleicAcids Research, vol. 31, pp. 5582-5589, 2003.
[3] D. Chaussabel and A. Sher, “Mining Microarray Expression Databy Literature Profiling,” Genome Biology, vol. 3, pp. 1-16, 2002.
[4] V. Cherepinsky, J. Feng, M. Rejali, and B. Mishra, “Shrinkage-Based Similarity Metric for Cluster Analysis of Microarray Data,”Proc. Nat’l Academy of Sciences USA, vol. 100, pp. 9668-9673, 2003.
[5] J. Quackenbush, “Computational Analysis of Microarray Data,”Nature Rev. Genetics, vol. 2, pp. 418-427, 2001.
74 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 11Top Ranking Keywords Associated with Each Gene Cluster
[6] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “ClusterAnalysis and Display of Genome-Wide Expression Patterns,” Proc.Nat’l Academy of Sciences USA, vol. 95, pp. 14863-14868, 1998.
[7] R. Herwig, A.J. Poustka, C. Mller, C. Bull, H. Lehrach, and J.O’Brien, “Large-Scale Clustering of cDNA-Fingerprinting Data,”Genome Research, vol. 9, pp. 1093-1105, 1999.
[8] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E.Dmitrovsky, E.S. Lander, and T.R. Golub, “Interpreting Patterns ofGene Expression with Self-Organizing Maps: Methods andApplication to Hematopoietic Differentiation,” Proc. Nat’l Academyof Sciences USA, vol. 96, pp. 2907-2912, 1999.
[9] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: AReview,” ACM Computing Surveys, vol. 31, pp. 264-323, 1999.
[10] S. Raychaudhuri, J.T. Chang, F. Imam, and R.B. Altman, “TheComputational Analysis of Scientific Literature to Define andRecognize Gene Expression Clusters,” Nucleic Acids Research,vol. 15, pp. 4553-4560, 2003.
[11] B. Kegl, “Principle Curves: Learning, Design, and Applications,”PhD dissertation, Dept. of Computer Science, Concordia Univ.,Montreal, Quebec, 2002.
[12] T.K. Jenssen, A. Laegreid, J. Komorowski, and E. Hovig, “ALiterature Network of Human Genes for High-ThroughtputAnalysis of Gene Expression,” Nat’l Genetics, vol. 178, pp. 139-143, 2001.
[13] D.R. Masys, J.B. Welsh, J.L. Fink, M. Gribskov, I. Klacansky, and J.Corbeil, “Use of Keyword Hierarchies to Interprate GeneExpression Patterns,” Bioinformatics, vol. 17, pp. 319-326, 2001.
[14] S. Raychaudhuri, H. Schutze, and R.B. Altman, “Using TextAnalysis to Identify Functionally Coherent Gene Groups,” GenomeResearch, vol. 12, pp. 1582-1590, 2002.
[15] M. Andrade and A. Valencia, “Automatic Extraction of Keywordsfrom Scientific Text: Application to the Knowledge Domain ofProtein Families,” Bioinformatics, vol. 14, pp. 600-607, 1998.
[16] W.T. McCormick, P.J. Schweitzer, and T.W. White, “ProblemDecomposition and Data Reorganization by a Clustering Techni-que,” Operations Research, vol. 20, pp. 993-1009, 1972.
[17] S. Navathe, S. Ceri, G. Wiederhold, and J. Dou, “VerticalPartitioning Algorithms for Database Design,” ACM Trans.Database Systems, vol. 9, pp. 680-710, 1984.
[18] P. Arabie and L.J. Hubert, “The Bond Energy AlgorithmRevisited,” IEEE Trans. Systems, Man, and Cybernetics, vol. 20,pp. 268-274, 1990.
[19] A.T. Ozsu and P. Valduriez, Principles of Distributed DatabaseSystems, second ed. Prentice Hall Inc., 1999.
[20] Y. Liu, M. Brandon, S. Navathe, R. Dingledine, and B.J. Ciliax,“Text Mining Functional Keywords Associated with Genes,” Proc.Medinfo 2004, pp. 292-296, Sept. 2004.
[21] Y. Liu, B.J. Ciliax, K. Borges, V. Dasigi, A. Ram, S. Navathe, and R.Dingledine, “Comparison of Two Schemes for Automatic Key-word Extraction from MEDLINE for Functional Gene Clustering,”Proc. IEEE Computational Systems Bioinformatics Conf. (CSB 2004),pp. 394-404, Aug. 2004.
[22] P. Cheeseman and J. Stutz, “Bayesian Classification (Autoclass):Theory and Results,” Advances in Knowledge Discovery and DataMining, pp. 153-180, AAAI/MIT Press, 1996.
[23] A. Strehl, “Relationship-Based Clustering and Cluster Ensemblesfor High-Dimensional Data Mining,” PhD dissertation, Dept. ofElectric and Computer Eng., The University of Texas at Austin,2002.
[24] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval.New York: Addison Wesley Longman, 1999.
[25] F. Sebastiani, “Machine Learning in Automated Text Categoriza-tion,” ACM Computing Surveys, vol. 34, pp. 1-47, 1999.
[26] P. Willett, “Recent Trends in Hierarchic Document Clustering: ACritical Review,” Information Processing and Management, vol. 24,pp. 577-597, 1988.
[27] J. Aslam, A. Leblanc, and C. Stein, “Clustering Data without PriorKnowledge,” Proc. Algorithm Eng.: Fourth Int’l Workshop, 1982.
[28] P.V. Balakrishnan, M.C. Cooper, V.S. Jacob, and P.A. Lewis, “AStudy of the Classification Capabilities of Neural Networks UsingUnsupervised Learning: A Comparison with K-Means Cluster-ing,” Psychometrika, vol. 59, pp. 509-525, 1994.
Ying Liu received the BS degree in environ-mental biology from Nanjing University, China.He received Master’s degrees in bioinformaticsand computer science from Georgia Institute ofTechnology in 2002. He is a PhD candidate inCollege of Computing, Georgia Institute ofTechnology, where he works on text miningbiomedical literature to discover gene-to-generelationships. His research interests includebioinformatics, computational biology, data
mining, text mining, and database system. He is a student member ofIEEE Computer Society.
Shamkant B. Navathe received the PhD degreefrom the University of Michigan in 1976. He is aprofessor in the College of Computing, GeorgiaInstitute of Technology. He has published morethan 130 refereed papers in database research;his important contributions are in databasemodeling, database conversion, database de-sign, conceptual clustering, distributed databaseallocation, data mining, and database integra-tion. Current projects include text mining of
medical literature databases, creation of databases for biologicalapplications, transaction models in P2P and Web applications, anddata mining for better understanding of genomic/proteomic and medicaldata. His recent work has been focusing on issues of mobility,scalability, interoperability, and personalization of databases in scien-tific, engineering, and e-commerce applications. He is an author of thebook, Fundamentals of Database Systems, with R. Elmasri (AddisonWesley, fourth edition, 2004) which is currently the leading databasetext-book worldwide. He also coauthored the book Conceptual Design:An Entity Relationship Approach (Addison Wesley, 1992) with CarloBatini and Stefano Ceri. He was the general cochairman of the 1996International VLDB (Very Large Data Base) Conference in Bombay,India. He was also program cochair of ACM SIGMOD 1985 at Austin,Texas. He is also on the editorial boards of Data and KnowledgeEngineering (North Holland), Information Systems (Pergamon Press),Distributed and Parallel Databases (Kluwer Academic Publishers), andWorld Wide Web Journal (Kluwer). He has been an associate editor ofIEEE Transactions on Knowledge and Data Engineering. He is amember of the IEEE.
Jorge Civera received the BSc degree incomputer science from the Universidad Politec-nica de Valencia in 2002, and the Msc degree incomputer science from Georgia Institute ofTechnology in 2003. He is currently a PhDstudent at Departamento de Sistemas Informa-ticos y Computacion and a research assistant inthe Instituto Tecnologico de Informatica. He isalso with a fellowship from the Spanish Ministryof Education and Culture. His research interests
include bioinformatics, machine translation, and text mining.
Venu Dasigi received the BE degree in electro-nics and communication engineering from An-dhra University in 1979, the MEE degree inelectronic engineering from the NetherlandsUniversities Foundation for International Coop-eration in 1981, and the MS and PhD degrees incomputer science from the University of Mary-land, College Park in 1985 and 1988, respec-tively. He is currently professor and chair ofcomputer science at Southern Polytechnic State
University in Marietta, Georgia. He is also an honorary professor atGandhi Institute of Technology and Management in India. He heldresearch fellowships at the Oak Ridge National Laboratory and the AirForce Research Laboratory. His research interests include text mining,information retrieval, natural language processing, artificial intelligence,bioinformatics, and computer science education. He is a member ofACM and the IEEE Computer Society.
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 75
Ashwin Ram received the PhD degree fromYale University in 1989, the MS degree from theUniversity of Illinois in 1984, and the BTechdegree from IIT Delhi in 1982. He is an associateprofessor in the College of Computing at theGeorgia Institute of Technology, an associateprofessor of Cognitive Science, and an adjunctprofessor in the School of Psychology. He haspublished two books and more than 80 scientificpapers in international forums. His research
interests lie in artificial intelligence and cognitive science, and includemachine learning, natural language processing, case-based reasoning,educational technology, and artificial intelligence applications.
Brian J. Ciliax received the BS degree inbiochemistry from Michigan State University in1981, and the PhD degree in pharmacology fromthe University of Michigan in 1987. He iscurrently an assistant professor in the Depart-ment of Neurology at Emory University School ofMedicine. His research interests include thefunctional neuroanatomy of the basal ganglia,particularly as it relates to hyperkinetic move-ment disorders such as Tourette’s Syndrome.
Since 2000, he has collaborated with the coauthors on the developmentof a system to functionally cluster genes (identified by high-throughputgenomic and proteomic assays) according to keywords mined fromrelevant MEDLINE abstracts.
Ray Dingledine received the PhD degree inpharmacology from Stanford. He is currentlyprofessor and chair of pharmacology at EmoryUniversity and serves on the Scientific Council ofNINDS at NIH. His research interests include theapplication of microarray and associated tech-nologies to identify novel molecular targets forneurologic disease, the normal functions andpathobiology of glutamate receptors, and therole of COX2 signaling in neurologic disease.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
76 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
A
John AachTatsuya AkutsuDavid AldousAijun AnIannis ApostolakisLars ArvestadDaniel AshlockKevin AttesonWai-Ho Au
B
Rolf BackofenDavid BaderTim BaileyTomas BallaSerafim BatzoglouGil BejeranoAmir Ben-DorAsa Ben-HurAnne BergeronOlaf Bininda-EmondsRiccardo BoscoloGuillaume BourqueAlvis BrazmaDaniel BrownDuncan BrownBarb BryantDavid BryantJeremy BuhlerJoachim Buhmann
C
Andrea CalifanoColin CampbellAlberto CapraraKeith ChanClaudine ChaouiyaFerdinando CicaleseMelissa ClineDavid CorneNello CristianiniMiklos CsurosAdele Cutler
D
Patrik D’haeseleerMichiel de HoonArthur DelcherAlain DeniseMarcel DettlingInderjit S. Dhillon
Diego di BernardoAdrian DobraBruce R. DonaldSebastián Dormido-CantoZhihua DuBlythe Durbin
E
Nadia El-MabroukCharles ElkanEleazar Eskin
F
Giancarlo Ferrari-TrecateLiliana FloreaGary FogelYoav FreundJane FridlyandYan FuTerrence FureyCesare Furlanello
G
Olivier GascuelDan GeigerZoubin GhahramaniDebashis GhoshPulak GhoshRaffaele GiancarloRobert GiegerichDavid GilbertJan GorodkinJohn GoutsiasDaniel GusfieldIsabelle M. GuyonAdolfo Guzman-Arenas
H
Sridhar HannenhalliAlexander HarteminkTzvika HartmanLisa HolmPaul HortonSteve HorvathXiao HuHaiyan HuangAlan HubbardKatharina HuberDirk HusmeierDaniel Huson
J
Inge JonassenRebecka Jornsten
K
Jaap KaandorpMarkus KalischRachel KarchinJuha KarkkainenKevin KarplusSimon KasifSamuel KaskiEd KeedwellPurvesh KhatriHyunsoo KimJunhyong KimRoss D. KingAndrzej KonopkaHamid KrimNandini KrishnamurthyGregory KucherovDavid Kulp
L
Michelle LaceyWai LamGiuseppe LanciaMichael LappeRichard LathropNicolas Le NovereThierry LeCroqHansheng LeiBoaz LernerChristina LeslieIlya LevnerDequan LiFan LiJinyan LiWentian LiJie LiangOlivier LichtargeCharles LingMichal LinialHuan LiuZhenqiu LiuStanley LohHeitor LopesRune Lyngsoe
M
Bin MaPatrick Ma
François MajorElisabetta ManduchiMark MarronJens MeilerStefano MerlerWebb MillerMarta MiloSatoru MiyanoAnnette MolinaroShinichi MorishitaVincent MoultonMarcus MuellerSayan MukherjeeRory MulvaneyT.M. MuraliSimon Myers
N
Iftach NachmanLuay NakhlehAnand NarasimhamurthyGonzalo NavarroWilliam Noble
O
Enno OhlebuschArlindo OliveiraJose OliverChristos Ouzounis
P
Junfeng PanRong PanWei PanPaul PavlidisItsik Pe’erChristian PedersenAnton PetrovTuan PhamKatherine PollardGianluca PollastriCalton Pu
R
John RachlinMark RaganJagath RajapakseR.S. RamakrishnaIsidore RigoutsosDave RitchieFredrik RonquistJuho Rousu
2004 Reviewers ListWe thank the following reviewers for the time and energy they have given to TCBB:
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 77
✦
Jem RowlandLarry RuzzoLeszek Rychlewski
S
Gerhard SagererSteven SalzbergHerbert SauroAlejandro SchafferAlexander SchliepScott SchmidlerJeanette SchmidtAlexander SchönhuthCharles SempleSoheil ShamsRoded SharanChad ShawDinggang ShenDou ShenLisan ShenStanislav ShvartsmanAmandeep SidhuRichard SimonSameer SinghJanne SinkkonenSteven S. SkienaQuinn SnellCarol SoderlundRainer SpangPeter StadlerMike SteelGerhard StegerJens StoyeJack SullivanKrister Swenson
T
Pablo TamayoAmos TanayChun TangJijun TangThomas TangGlenn TeslerRobert TibshiraniMartin TompaAnna TramontanoJames TroendleJerry TsaiKoji TsudaJohn Tyson
V
Eugene van SomerenStella VeretnikDavid VogelGwenn Volkert
W
Baoying WangChang WangLisan WangTandy WarnowMichael K. WeirJason WestonYdo WexlerNalin WickramarachchiChris WigginsDavid WildTiffani WilliamsThomas Wu
X
Dong XuJinbo Xu
Y
Qiang YangYee Hwa YangZizhen YaoDaniel YekutieliJeffrey Yu
Z
Mohammed J. ZakiAn-Ping ZengChengxiang ZhaiJingfen ZhangKaizhong ZhangXuegong ZhangYang ZhangZhi-Hua ZhouZonglin ZhouJi Zhu
78 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005