+ All Categories
Home > Documents > A Polynomial-time Maximum Common Subgraph Algorithm for

A Polynomial-time Maximum Common Subgraph Algorithm for

Date post: 12-Sep-2021
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
39
A Polynomial-time Maximum Common Subgraph Algorithm for Outerplanar Graphs and its Application to Chemoinformatics Leander Schietgat Jan Ramon Maurice Bruynooghe Department of Computer Science - Katholieke Universiteit Leuven {firstname.lastname}@cs.kuleuven.be Corresponding Author Leander Schietgat Department of Computer Science - Katholieke Universiteit Leuven Celestijnenlaan 200A 3001 Leuven Belgium Tel: +32 (0)16 32 76 37 Fax: +32 (0)16 32 79 96 E-mail: [email protected] Abstract Metrics for structured data have received an increasing interest in the machine learning community. Graphs provide a natural representation for structured data, but a lot of operations on graphs are computationally intractable. In this article, we present a polynomial-time algorithm that computes a maximum common subgraph of two outerplanar graphs. The algorithm makes use of the block-and-bridge preserving subgraph isomorphism, which has significant efficiency benefits and is also motivated from a chemical perspective. We focus on the application of learning structure-activity relationships, where the task is to predict the chemical activity of molecules. We show how the algorithm can be used to construct a metric for structured data and we evaluate this metric and more generally also the block-and-bridge preserving matching operator on 60 molecular datasets, obtaining state-of-the-art results in terms of predictive performance and efficiency. Keywords: metrics for structured data, graph mining, chemoinformatics, learning structure-activity relationships 1
Transcript
Page 1: A Polynomial-time Maximum Common Subgraph Algorithm for

A Polynomial-time Maximum Common Subgraph

Algorithm for Outerplanar Graphs and its

Application to Chemoinformatics

Leander Schietgat Jan Ramon Maurice Bruynooghe

Department of Computer Science - Katholieke Universiteit Leuven{firstname.lastname}@cs.kuleuven.be

Corresponding Author

Leander SchietgatDepartment of Computer Science - Katholieke Universiteit LeuvenCelestijnenlaan 200A3001 LeuvenBelgium

Tel: +32 (0)16 32 76 37Fax: +32 (0)16 32 79 96E-mail: [email protected]

Abstract

Metrics for structured data have received an increasing interest in the machinelearning community. Graphs provide a natural representation for structureddata, but a lot of operations on graphs are computationally intractable. Inthis article, we present a polynomial-time algorithm that computes a maximumcommon subgraph of two outerplanar graphs. The algorithm makes use ofthe block-and-bridge preserving subgraph isomorphism, which has significantefficiency benefits and is also motivated from a chemical perspective. We focuson the application of learning structure-activity relationships, where the task isto predict the chemical activity of molecules. We show how the algorithm canbe used to construct a metric for structured data and we evaluate this metricand more generally also the block-and-bridge preserving matching operator on60 molecular datasets, obtaining state-of-the-art results in terms of predictiveperformance and efficiency.Keywords: metrics for structured data, graph mining, chemoinformatics,learning structure-activity relationships

1

Page 2: A Polynomial-time Maximum Common Subgraph Algorithm for

1 Introduction

Metrics are important components of several machine learning methods, suchas instance-based learning or clustering. Recently, there has been an increasedinterest in metrics that express a similarity between structured objects insteadof propositional objects. This is relevant for multiple application domains in-cluding drug discovery [5], image recognition and computer vision [42].

The application on which we focus is the learning of structure-activity rela-tionships (SAR) in drug discovery, which is an important step in the identifica-tion of molecules that play an active role in the regulation of biological processesor disease states. Even though it has become easier during the last years to ac-quire molecular data, this process, which is often referred to as screening, isstill costly and time-consuming. For this reason, pharmaceutical companies areinterested in SAR techniques that can predict the activity of molecules automat-ically and can therefore seriously decrease the amount of molecules that shouldbe screened in the lab. Various properties of molecules are of interest in drugdevelopment, such as toxicity, carcinogenicity or biodegradability.

Since it is widely known that molecules with a similar structure tend tohave the same function [24], structural similarity search among small moleculesis the standard tool used for virtual screening and in silico drug development.The task then comes down to finding an appropriate similarity measure be-tween molecules. Such a structural similarity measure should ideally fulfil tworequirements: (1) it should be efficiently computable, which is important whenanalyzing large molecular databases, and (2) the notion of similarity shoulddiscriminate between molecules w.r.t. the activity of interest. Finding suchsimilarity measures is one of the current challenges in chemoinformatics [5, 12].

Because graphs provide a natural representation of the atom-bond structureof molecules, they have become very popular in the context of SAR: each vertexof the graph then represents an atom, while an edge represents a molecular bond[38]. However, algorithms on graphs that aim at using all available structuralinformation often involve the matching of subgraphs or other combinatorialoperations trying to align graphs optimally. For this reason, typical approacheshave resorted to a propositional representation by transforming molecules intovectors. In such transformations either information is lost or the size of theresulting representation can grow exponentially [12].

In order to define a similarity measure between molecules, we use the notionof a maximum common subgraph (MCS) between their graph representations.Intuitively, an MCS of two graphs is a connected subgraph common to bothgraphs, such that no other common subgraph has a size that is strictly larger.The ratio between the size of the MCS of the two graphs and their own size thenreflects their similarity. The idea behind this is that an MCS of two moleculargraphs may reflect shared properties that relate to the molecular activity. Anexample of an MCS of two molecular graphs is shown in Fig. 1. Given our targetapplication, the motivation for this MCS-based similarity measure is threefold:(1) the MCS of structurally related drugs is likely to be an important componentof their activities [4], (2) it is known that methods based on computing MCSs

2

Page 3: A Polynomial-time Maximum Common Subgraph Algorithm for

Figure 1: A maximum common subgraph (MCS) of two molecular graphs (high-lighted in gray).

work well for this type of tasks [38], (3) MCSs are able to reflect local similaritiesin the graphs rather than global ones, and (4) because of the visual nature ofgraphs, it is possible for chemists to investigate the MCSs that are predicted tobe responsible for certain activities.

Finding the possibly multiple MCSs of two graphs, however, is known to bean NP-hard problem [14]. Nevertheless, previous theoretical work on graphshas shown that in many cases, when certain constraints on the structure of thegraphs hold, efficient matching algorithms exist [33, 30]. Sequences, trees andouterplanar graphs (the latter are planar graphs which can be embedded in theplane in such a way that all of their vertices lie on the boundary of the outer face)are examples of such subclasses of general graphs. Investigation on the NCI1

database, a collection of datasets containing over 250,000 chemical compounds,revealed that 94.5% of these graphs are outerplanar while only 8.8% of themare trees [21]. Hence, algorithms that can deal with outerplanar graphs have amuch broader applicability than those handling trees, while still being efficient.

Recently, Horvath et al. [21, 22] introduced a new “block-and-bridge-preser-ving” (BBP) matching operator for outerplanar graphs. It is based on the ideathat it only makes sense to match “similar” parts of the graphs. The authorsshow that, using this operator, the computation of pattern embeddings, that is,checking whether a pattern is subgraph isomorphic to graphs in the dataset, ispolynomial, such that frequent patterns can be mined efficiently in outerplanargraphs.

In this research, the task is to find a connected graph of maximum size thatis subgraph isomorphic to two given graphs. Maximum common subgraphs canbe regarded as local patterns in the sense that they are frequent with support 2(or more). It is shown in [40] that they are useful patterns for predictive tasks.In Schietgat et al. [39], we presented a polynomial algorithm that computes amaximum common subgraph between two outerplanar graphs under the block-and-bridge preserving subgraph isomorphism and we showed that a metric using

1National Cancer Institute (http://cactus.nci.nih.gov/).

3

Page 4: A Polynomial-time Maximum Common Subgraph Algorithm for

this algorithm obtains favorable results on several prediction tasks in chemoin-formatics. This metric has three important properties: (1) it is computable inpolynomial time, (2) as it reflects the size of the MCS, it has an intuitive mean-ing making results of methods such as instance-based learning interpretable, and(3) it provides the right level of abstraction in order to discriminate betweenmolecules with different activities.

In this paper, we extend this work along several dimensions:

• We provide a more detailed technical description of the MCS algorithm,with a proof of its correctness. We also provide new complexity boundson the algorithm and prove these bounds.

• We compare our algorithm to a state-of-the-art algorithm that computesMCSs under the general subgraph isomorphism, in terms of predictiveperformance and efficiency.

• We extend our experimental evaluation, not only showing that our metricoutperforms state-of-the-art metrics on several classification tasks fromchemoinformatics, but also by showing the advantages of the BBP sub-graph isomorphism over the general subgraph isomorphism in terms ofpredictive performance and efficiency in other learning techniques.

The text is organized as follows. In Sect. 2, we start by introducing thenecessary graph-theoretical concepts and then we present our MCS algorithmin detail. In Sect. 3, we present an experimental evaluation of the algorithm.In Sect. 4, we discuss previous work and in Sect. 5, we draw conclusions andpresent future work.

2 Computing a Maximum Common Subgraphof Two Outerplanar Graphs

In this section, we first give the necessary graph-theoretical concepts (Sect. 2.1)and then present the MCS algorithm (Sect. 2.2).

2.1 Graph-theoretical concepts

This section gives the relevant definitions necessary to understand the algorithmthat computes an MCS of two outerplanar graphs. For an overview of graphtheory, we refer to an introductory text book [13].

Definition 1 (Rooted Graph). A labeled graph G is a quadruple (V,E,Σ, λ),with V a finite set of vertices and E ⊆ {{u, v} | u, v ∈ V, u 6= v} a set of edges.Σ is a finite set of labels and λ : V ∪ E → Σ is a function assigning a labelto each element of V ∪ E. A rooted graph G is a quintuple (V,E,Σ, λ, r) with(V,E,Σ, λ) a labeled graph and r, the root, one of its vertices.

4

Page 5: A Polynomial-time Maximum Common Subgraph Algorithm for

Given a graph G, we denote its vertices with V (G) and its edges with E(G).If the graph is rooted, we denote its root with r(G). Sometimes, it will beconvenient to denote a rooted graph as Gr with r the root of the graph. The setof all graphs is denoted as G. As we define edges as sets of two distinct vertices,it means that our graphs do not have reflexive edges (loops) and that all edgesare undirected, i.e., that we consider simple graphs.

Definition 2 (Size). The size | · | : G → R of a graph is a function mappinga graph to a real number of the form |G| =

∑x∈V (G)∪E(G) wλG(x), where each

possible label of l ∈ Σ has been assigned a non-negative weight wl.

Example 1. Assume that all vertices of v ∈ V (G) have the same label λG(v) =l1 and all the edges e ∈ E(G) have the same label λG(e) = l2. Let wl1 = 1and wl2 = 0. Then, the corresponding size function maps every graph G on thenumber of vertices of G.

Definition 3 (Path, Cycle). A sequence x0, x1, . . . , xn of vertices is a path fromx0 to xn iff {xi, xi+1} ∈ E(G) for all i ∈ [0, n− 1]. A cycle x0, . . . , xn is a pathsuch that x0 = xn. A path without repeated vertices is a simple path2; a cyclewithout repeated vertices apart from the start and end vertex is a simple cycle.

Definition 4 (Connected). A graph G is connected if there is a path betweenany pair of its vertices in G; it is biconnected if for any two vertices u and v ofG, there is a simple cycle in G containing u and v.

Definition 5 (Planar graph). A graph is planar if it has a planar embeddingi.e., it can be drawn in the plane in such a way that no two edges intersect exceptat a vertex in common. The regions formed by the edges in a planar embeddingare called faces. There is one unbounded face, which is called the outer face.

Definition 6 (Outerplanar graph). An outerplanar graph is a planar graphwhich can be embedded in the plane in such a way that all of its vertices lie onthe boundary of the outer face.

We denote the set of all outerplanar graphs with Gop. Note that the boundaryof the outer face is not connected when the outerplanar graph is not connected.

Definition 7 (Tree). A tree T is a graph for which there is a unique pathbetween every pair of vertices u, v ∈ T .

Definition 8 (Subgraph). Let G and H be graphs. G is a subgraph of H, if(i) V (G) ⊆ V (H), (ii) E(G) ⊆ E(H), and (iii) λG(x) = λH(x) holds for everyx ∈ V (G) ∪ E(G).

Definition 9 (Block). A biconnected component or block of a graph G is amaximal subgraph of G that is biconnected.

2In the literature, alternative terminology might be used. Our definition of a path is thencalled a walk, and our definition of a simple path is then called a path.

5

Page 6: A Polynomial-time Maximum Common Subgraph Algorithm for

Definition 10 (Bridge). In a graph G, a bridge is an edge of G that does notbelong to a block.

For a graph G we number the blocks and denote them Bl1(G), Bl2(G), . . . ,Bln(G). With Bl(G), we denote all the edges that are involved in blocks of Gand with Br(G) all the bridges of G.

Any graph consists entirely of blocks and bridges, that is, its edges can bepartitioned as: E(G) = Br(G) ∪ Bl1(G) ∪ . . . ∪ Bln(G) with n the number ofblocks, Br(G) ∩ Bl(G) = ∅ and ∀i, j (i 6= j) : Bli(G) ∩ Blj(G) = ∅. Twoblocks can share a vertex, but no edge. If they would share an edge, the blockscannot be maximal biconnected subgraphs as their union would be larger andstill biconnected.

In outerplanar graphs, the bridges are always adjacent to the outer face,while block edges are adjacent to two faces, one of which can possibly be theouter face.

Example 2. Figure 2a shows an example of an non-outerplanar graph. Indeed,in all possible embeddings of the graph, there is one vertex (labeled v) that is noton the outside of the graph. The graphs in Fig. 2b and Fig. 2c are, however,outerplanar. Every vertex is labeled with a color representing a chemical ele-ment: black for carbon, white for hydrogen and blue for nitrogen. Note that thegraph of Figure 2a, the edges between carbons and hydrogens are bridges, whilethe rest of the edges are involved in blocks. From a chemical viewpoint, blockscorrespond to ring structures while bridges are linear fragments of the molecule.

Definition 11 (Graph isomorphism). Two graphs G and H are isomorphicunder ϕ if ϕ is a function from V (G) to V (H) such that (i) ϕ is a bijection,(ii) ∀u, v : {u, v} ∈ E(G)⇔ {ϕ(u), ϕ(v)} ∈ E(H), (iii) ∀u : λG(u) = λH(ϕ(u)),(iv) ∀u, v : {u, v} ∈ E(G)⇒ λG({u, v}) = λH({ϕ(u), ϕ(v)}).

Definition 12 (Subgraph isomorphism). A graph G is subgraph isomorphic toH under ϕ, denoted G �ϕ H, iff G is isomorphic to a subgraph of H under ϕ.

If the function ϕ is clear from the context, we omit it from the notation, i.e.,G � H. The subgraph isomorphism problem, which decides whether G issubgraph isomorphic to H is known to be NP-complete [14]; this also holds forouterplanar graphs [45].

Definition 13 (BBP subgraph isomorphism). A subgraph isomorphism from Gto H under ϕ is a block-and-bridge-preserving (BBP) subgraph isomorphism if(i) ∀u, v: {u, v} ∈ Br(G)⇔ {ϕ(u), ϕ(v)} ∈ Br(H) and (ii) ϕ maps edges fromdifferent blocks on edges from different blocks.

We denote that a graph G is BBP subgraph isomorphic to H under ϕ byG vϕ H. Again, we omit the subscript ϕ if not needed. The BBP subgraphisomorphism maps each bridge of G to a bridge of H and each block of G to apart of a block of H. While the latter part is biconnected, it is not necessarilythe whole block. Contrary to the subgraph isomorphism problem, the BBP

6

Page 7: A Polynomial-time Maximum Common Subgraph Algorithm for

(a) (b)

v

(c)

Figure 2: Examples of molecular graphs. The colors of the vertices correspondto their labels: black for carbon, white for hydrogen and blue for nitrogen. (a)Example of a non-outerplanar graph. For all of its plane embeddings, v never liesin the outer face. (b) Two molecules with a maximum common subgraph com-puted under the general subgraph isomorphism (MCS�), highlighted in gray. (c)The same two molecules with a maximum common subgraph computed underthe BBP subgraph isomorphism (MCSv), highlighted in gray.

subgraph isomorphism problem is computable in polynomial time for outerpla-nar graphs [21]. For trees, which form a subset of outerplanar graphs (they areconnected and block-free), the BBP subgraph isomorphism is equivalent to thesubtree isomorphism [41].

Definition 14 (Maximum common subgraph). A common subgraph I of twographs G and H is a connected graph such that I is isomorphic to a subgraphof G and to a subgraph of H (I � G and I � H); it is a maximum commonsubgraph (MCS) when in addition there exists no other common subgraph J ,such that size(I) < size(J).

Note that we only consider connected MCSs. We use the abbreviation MCS�to stress that we refer to a maximum common subgraph under the (general)subgraph isomorphism and MCSv when it is under the BBP subgraph isomor-phism. As a block can have subblocks, an MCSv can include only a part of ablock of the given graphs. Since the BBP subgraph isomorphism is a restrictedversion of the general subgraph isomorphism, an MCSv is subgraph isomorphicto one of the MCSs�. For example, the MCSv of the graphs in Fig. 2c is sub-graph isomorphic to the MCS� of the graphs in Fig. 2b. In the worst case,two general graphs can have an exponential number of MCSv and computingan MCSv of two general graphs is NP-hard [14]. In Sect. 2.2.5, however, wewill show that it is possible to compute an MCSv of two outerplanar graphsin polynomial time by using the block-and-bridge-preserving (BBP) subgraphisomorphism [21]. From an application point of view, the BBP subgraph iso-

7

Page 8: A Polynomial-time Maximum Common Subgraph Algorithm for

morphism is motivated from the fact that, in molecules, cyclic structures andlinear fragments usually behave differently, and hence treating them separatelymight well be a good thing.

Figure 2 shows a comparison between the MCS� (b) and the MCSv (c).In both examples, the MCS is highlighted in gray. Note that one of the edgesis a bridge in the upper graph, while it belongs to a block in the lower graph(marked with a * in both graphs) and hence, it cannot be mapped under theBBP subgraph isomorphism. Chemists consider it relevant not to map linearfragments with fragments that are part of a ring structure. From now on, weoften use MCS to indicate an MCSv.

We end this section with recalling some definitions and results about match-ings, as introduced in [34].

Definition 15 (Matching). A matching f between sets A and B is a relation(i.e. a subset of A × B) such that for all (a1, b1), (a2, b2) ∈ f : a1 = a2 iffb1 = b2, so each element of A is associated with at most one element of B andvice versa.

Definition 16 (Weighted maximal matching). The weighted maximal matchingproblem, also known as the assignment problem, is an optimisation problemwhere two sets A and B are given together with a weight function w : A×B → Rand the task is to find a matching m between A and B such that

∑(a,b)∈m w(a, b)

is maximal.

Computing a weighted maximal matching can happen in O((n+m) ·n1/2), withn = |A| + |B| and m the number of pairs between A and B having a weightgreater than zero [19].

2.2 Algorithm

For simplicity of exposition, in this section we construct an MCSv for a pairof connected outerplanar graphs. The extension to non-connected graphs isstraightforward. So, from here onwards, we assume graphs are connected.

To compute an MCSv of two outerplanar graphs, we apply a dynamic pro-gramming strategy. The graphs are decomposed into subgraphs (parts) andthe MCS is computed from the MCSs between the subgraphs. To come toan efficient algorithm, we should not consider all possible subgraphs, but haveto judiciously design a decomposition that includes enough subgraphs to findan MCS, while at the same time avoiding redundancy. A first step is to con-sider rooted graphs and to compute their rooted maximum common subgraph(RMCS), which we can define as follows.

Definition 17 (Rooted maximum common subgraph). S is a rooted commonsubgraph of rooted outerplanar graphs G and H iff S is a common subgraph of Gand H, i.e., ∃ϕG, ϕH : S vϕG

G and S vϕHH, and moreover, ϕG(r(S)) = r(G)

and ϕH(r(S)) = r(H). S is a rooted maximum common subgraph (RMCSv) iffthere does not exist another rooted common subgraph T that is larger, i.e., forwhich size(T ) > size(S).

8

Page 9: A Polynomial-time Maximum Common Subgraph Algorithm for

Without loss of generality, in the rest of the section we can assume that S is asubgraph of G with the same root as G, i.e., that ϕG is the identity mapping.

We first describe the decomposition of a connected rooted outerplanar graphinto parts (Sect. 2.2.1). Then, in Sect. 2.2.2 we define the parts to be used inour algorithm and prove some properties about them. In Sect. 2.2.3, we givethe algorithm for computing an MCS. In Sect. 2.2.4, we prove its correctnessand finally, in Sect. 2.2.5, we describe the time complexity of the algorithm.

2.2.1 Decomposition of a connected rooted outerplanar graph

The BBP subgraph isomorphism maps bridges to bridges and blocks to (partsof) blocks, hence we can say that an MCS of two graphs is block-preserving, aconcept we formally define as:

Definition 18 (Block-preserving subgraph). A subgraph S of G is a block-preserving subgraph (BPS) iff bridges of S are bridges in G.

From the definition, it follows that blocks in S are also (parts of) blocks in G.We distinguish between basic-root BPSs and compound-root BPSs.

Definition 19 (Basic-root and compound-root BPS). A rooted BPS S of agraph G is basic-root if either r(S) is incident with at most one edge (a bridge)or all edges incident with r(S) belong to the same block. In the other case, S isa compound-root BPS.

Let G and H be connected graphs and assume S is one of their MCSs (S vϕGG

– note that ϕG is the identity mapping – and S vϕHH). Arbitrarily selecting

one vertex of G as root turns G into Gr, a rooted BPS, subgraph of G. Theidea behind our approach is to decompose Gr into rooted parts in such a waythat one can be sure to obtain a part P s with its root s an element of S andsuch that Ss vϕG

P s. H is decomposed in a similar way such that one canbe sure to obtain a part Qt with Ss vϕH

Qt with ϕH(s) = t. Now Ss is theRMCS of P s and Qt and the decomposition is such that this RMCS can beeasily constructed from the RMCSs between the parts of P s and Qt. To achievethis and obtain an RMCS computation in polynomial time, we have to carefullydesign a decomposition that includes enough parts while avoiding too muchredundancy.

Selecting a vertex r as root in a connected outerplanar graph G turns thegraph into a rooted BPS of G which can either be basic-root or compound-root.First we describe how to decompose a compound-root BPS, then we do the samefor a basic-root BPS.

Definition 20 (Elementary parts of a compound-root BPS). Let Gr be acompound-root BPS of a graph G.

• For each bridge {r, i} in Gr, the connected subgraph P r with root r thatremains after removing all edges {r, j} for j 6= i is an elementary part ofGr.

9

Page 10: A Polynomial-time Maximum Common Subgraph Algorithm for

Br

Br

r

i

i

(a) Gr (b) P r1

(c) P r2

Figure 3: A compound-root BPS Gr (a) and its two elementary parts. Theelementary part P r1 (b) is obtained by removing all edges not belonging to theblock in which r is involved. The elementary part P r2 (c) is obtained by removingall edges adjacent to r except for the bridge {r, i}.

• For each block B in Gr that contains r, the connected subgraph P r withroot r that remains after removing all edges {r, i} that do not belong to Bis an elementary part of Gr.

Note that the elementary parts of a compound-root BPS of G are basic-rootBPSs of G. Figure 3 shows two examples of elementary parts of a compound-root BPS. Both are basic-root BPS elementary parts.

To decompose a basic-root BPS, the idea is to remove an edge containingthe root. This is straightforward when there is only one such edge, such asgraph (a) in Fig. 4. However, it is more involved when the root is part of ablock. We need some notation to make clear which edge(s) are removed from ablock. Every block of an outerplanar graph G has a unique Hamiltonian cycleover its vertices [33]. Given a planar embedding of G, we can enumerate theblock’s vertices according to either the clockwise (y) or the counterclockwise(x) orientation. Given an orientation and a root, the block’s vertices are or-dered by the Hamiltonian cycle {r, . . . , r} following the orientation; we say i < jwhen i comes before j; we can refer to the first, the last, to the vertices and/oredges in an interval [u, v], and so on. When removing vertices and edges of ablock along the Hamiltonian cycle, the resulting subgraph is not necessarily ablock-preserving subgraph. If not, we call it a block-splitting subgraph.

Definition 21 (Block-splitting subgraph). Let Gr be a basic-root BPS with rootr belonging to a block B; let o be an orientation and let {r, . . . , f, . . . r} be theHamiltonian cycle following orientation o of block B. With f 6= r, the graphobtained by removing all vertices and edges on the interval [r, f ] of the Hamil-tonian cycle (except for r and f themselves) and removing zero or more edges

10

Page 11: A Polynomial-time Maximum Common Subgraph Algorithm for

{f, i} not belonging to block B (as well as all edges adjacent to a removed ver-tex and parts no longer connected to the remaining segment) is a block-splittingsubgraph (BSS) of Gr if it has no edge between f and r. We denote it as o[f, r]and f is its root.

Note that the subgraph construction in the definition results in a BPS when,after removing the edges of the interval [r, f ], there is still an edge between fand r. This is for example the case in graph (c) of Fig. 4 when removing thevertices and edges in xb [i, r]. For computing an MCS, there will be no needfor considering subgraphs where more than one block is split. As for BPSs, wecan distinguish two kinds of BSSs.

Definition 22 (Basic-root and compound-root BSS). A BSS o[f, r] of a BPSGr is a basic-root BSS if f participates only in edges of the split block; otherwiseit is a compound-root BSS.

When relevant to make the distinction, we write oc[f, r] and ob[f, r] for re-spectively a compound-root and a basic-root BSS over the same interval. Toconstruct an MCS we do not need all BSSs of a basic-root BPS. Also here wemake a judicious selection.

Definition 23 (Elementary parts of a basic-root BPS). Let Gr be a basic-rootBPS of a graph G. If r belongs to a bridge {r, i}, then the BPS P i with root ithat remains after removing the bridge is the only elementary part of Gr.

Otherwise, r belongs to a block B. Let o be an orientation and {r, 1, . . . , n, r}(n > 1) be the Hamiltonian cycle following orientation o.

• For each edge {r, i} with 1 ≤ i < n, the BSS o[i, r] obtained by removing{r, i} and all vertices of B not belonging to the block interval [i, r] (and theedges adjacent to such vertices and the parts not connected to the verticeson the interval) is an elementary part of Gr.

• In case vertex 1 belongs to more than two edges of B, it is part of a subblocknot containing r. In that case, let j 6= r be the largest vertex (in the order{1, . . . , n} of orientation o) such that {1, j} is an edge of B. The BPSP 1 with root 1 obtained by removing all vertices of B not belonging to theblock interval [1, j] (and the edges adjacent to such vertices and the partsnot connected to the vertices on the interval [1, j]) is an elementary partof Gr.

The elementary parts can be basic-root or compound-root, depending on theproperty of their root. An example is shown in Fig. 4. In graph (a), the root ison a bridge and there is a single BPS, namely graph (b). In graph (c), the rootbelongs to a block. There are two BSSs according to orientation y (graphs (d)and (e)) and two according to orientation x) (graphs (f) and (g)). Finally, fororientation y, there is also a BPS (graph (h)).

The attentive reader may wonder whether both orientations need to be con-sidered. As will become clear, when computing the MCS of graphs G and H,

11

Page 12: A Polynomial-time Maximum Common Subgraph Algorithm for

r

Br

B

B

Br

B

B

Br

r

B

B

B

(d) yb [1, r]

(e) yb [i, r] (g) xb [i, r]

(h) P 1

(a) Gr (c) Gr

(b) P i

(f) xb [n, r]

Br

B

B

Figure 4: (a) A basic-root BPS Gr. (b) Its only elementary part, a compound-root BPS. (c) Another basic-root BPS Gr, now with r belonging to a block. Ithas 5 elementary parts. Graphs (d), (e), (f) and (g) are basic-root BSSs, graph(h) a basic-root BPS.

a fixed orientation will suffice when decomposing G. However, for decomposingH, both orientations will be needed.

We still have to decompose a BSS into elementary parts. For a compound-root BSS, this is very similar as for a compound-root BPS. The parts almostpartition the graph, only the root is shared between them.

Definition 24 (Elementary parts of a compound-root BSS). Let oc[i, r] be acompound-root BSS of Gr.

• For each bridge {i, j} in oc[i, r], the connected basic-root BPS P i thatremains after removing all edges {i, k} for k 6= j is an elementary part ofoc[i, r].

• For each block B in oc[i, r] that contains i and is different from the splitblock, the connected basic-root BPS P i that remains after removing alledges {i, j} that do not belong to B is an elementary part of oc[i, r].

• The connected basic-root BSS ob[i, r] of Gr obtained by removing all edges

12

Page 13: A Polynomial-time Maximum Common Subgraph Algorithm for

B r

Br

(b) P i1

(d) xb [i, r]

(a) xc [i, r]

(c) P i2

Figure 5: (a) A compound-root BSS o[i, r]. Graphs (b) and (c) are its two BPSelementary parts; graph (d) is its BSS elementary part.

{i, j} of oc[i, r] not belonging to the split block is an elementary part ofoc[i, r].

An example is shown in Fig. 5. Graph (a) has three elementary parts, two BPSparts (graphs (b) and (c)) and one BSS part (graph (d)).

The decomposition of a basic-root BSS of a BPS Gr is similar to the decom-position of Gr itself.

Definition 25 (Elementary parts of a basic-root BSS). Let ob[i, r] be a basic-root BSS of Gr. Let {i, j, . . . , n, r} be what remains of the Hamiltonian cyclealong orientation o of block B. If i = n i.e., only one edge is left, then there areno elementary parts. Otherwise:

• For each edge {i, k} with i < k ≤ n, the connected BSS o[k, r] obtainedby removing {i, k} and all vertices of B not belonging to the block interval[k, r] is an elementary part of ob. It is denoted as o[k, r].

• In case vertex j belongs to more than two edges of B, it is part of asubblock not containing i. In that case let k 6= r be the largest vertex suchthat {j, k} is an edge of B. The connected BPS P j with root j obtainedby removing all vertices of B not belonging to the block interval [j, k] is aan elementary part of ob[i, r].

An example is given in Fig. 6. The graph (a) has two BSS parts (graphs (b)and (c)) and one BPS part (graph (d)). In constructing o[k, r], the vertex i isremoved as well as the block’s edges connected with vertex i; note that therecan be several such edges. Again, the elementary parts can be basic-root orcompound-root, depending on the property of their root.

13

Page 14: A Polynomial-time Maximum Common Subgraph Algorithm for

Br

B

B

Br

B

Br

B

(a) yb [i, r] (b) yc [j, r]

(c) yb [k, r]

(d) P j

Figure 6: (a) A basic-root BSS o[i, r]. Graphs (b) and (c) are its BSS elementaryparts; graph (d) is its BPS elementary part.

2.2.2 Parts and their properties

Having defined decompositions of rooted BPSs and rooted BSSs, we can nowformalize our claims. As before, G and H are the given connected rooted out-erplanar graphs. First, we define which parts of G are used during the searchfor an MCS.

Definition 26 (Parts). Given a graph G, let r be an arbitrarily selected vertexof G. Then Parts(Gr) is inductively defined as follows:

• Gr ∈ Parts(Gr)

• If P p ∈ Parts(Gr) then the elementary parts of P p are in Parts(Gr).

For every connected subgraph S of G, we can claim that Parts(Gr) has anelement P with its root in S and with S a subgraph of P .

Proposition 1. Let S be a non-empty and connected subgraph that is block-and-bridge-preserving subgraph isomorphic to G. Then, Parts(Gr) contains arooted BPS P such that r(P ) ∈ S and S is a subgraph of P .

Proof. Note that Parts(Gr) contains Gr and that Gr is a BPS that contains S.If the root of Gr is part of S, then the proposition is satisfied. Otherwise, we willgradually reduce Gr, always selecting a part that contains S, until we obtain apart with the desired property. More formally, let Gr be R. If r(R) ∈ S, thenR, being a BPS, satisfies the proposition. Otherwise, R satisfies the followinginvariant: R is a BPS or a BSS, S is a subgraph of R and r(R) /∈ S. The

14

Page 15: A Polynomial-time Maximum Common Subgraph Algorithm for

proposition follows from the argument made below that R has an elementarypart R′ (and hence a member of Parts(Gr)) that either satisfies the propositionor satisfies the invariant. As S is non-empty, the proposition then follows. Theargument is by case analysis.

• When R is a compound-root BPS or a compound-root BSS, each of theelementary parts is either a basic-root BPS or a basic-root BSS and doesnot overlap with the others except for the root r(R), while their unioncovers the whole graph. Because r(R) is not part of S, S must be asubgraph of one of the elementary parts, say R′. This R′ satisfies theinvariant.

• When R is a basic-root BPS and r(R) is incident with a bridge, its onlyelementary part R′ is obtained by removing that bridge. As r(R) /∈ S, Sis a subgraph of R′; either R′ is the BPS that satisfies the proposition orR′ satisfies the invariant.

• When R is a basic-root BPS and r(R) is part of a block, R has one ormore BSS elementary parts and eventually one BPS elementary part. Wedo a case analysis and use the notations of Definition 23.

– When the BPS P 1 exists and node 1 belongs to S, then P 1 satisfiesthe proposition.

– Otherwise, the BSS o[1, r] satisfies the invariant.3

• When R is a basic-root BSS o[i, r], it has one or more BSS elementaryparts and eventually one BPS elementary part. We do a case analysis anduse the notations of Definition 25.

– If the BPS P j exists and vertex j belongs to S, then P j satisfies theproposition.

– Otherwise, the BSS o[j, r] satisfies the invariant.

For this proof, it suffices to use only one orientation when constructing theelementary parts of a basic-root BPS with a root belonging to a block.

Now, we turn our attention to the decomposition of H, where we considerevery vertex as root.

Definition 27 (Parts∗). Given a graph H, Parts∗(H) is defined as follows:

Parts∗(H) = ∪s∈V (H)Parts(Hs)

Proposition 2. Let S be a rooted connected graph such that r(S) = s andS vϕH

H. Then Parts∗(H) contains a rooted BPS P such that S is a subgraphof P and ϕH(s) = r(P ).

3The other BSSs o[i, r] are used in the proof of Proposition 3.

15

Page 16: A Polynomial-time Maximum Common Subgraph Algorithm for

Algorithm 1 Computing an MCS of two outerplanar graphs G and H

Require: G and H are outerplanar graphs; Pairs(G,H) as in Definition 28.Ensure: MCSbest candidate is a subgraph of G and an MCSv of G and H.

1: MCSbest candidate ← ((∅, ∅), 0)2: for all (P,Q) ∈ Pairs(G,H)3: RMCS = RMCS(P,Q)4: if P is a BPS and |RMCS| > |MCSbest candidate|5: MCSbest candidate ← RMCS6: return MCSbest candidate7: function RMCS(P,Q)8: if P and Q are compound-root graphs9: return RMCScompound(P,Q)

10: else if P and Q are basic-root BPS graphs11: return RMCSbps(P,Q)12: else P and Q are basic-root BSS graphs13: return RMCSbss(P,Q)

Proof. HϕH(s) satisfies the claim and is in Parts∗(H).

Let S be an MCS of graphs G (mapping ϕG) and H (mapping ϕH); let r bean arbitrary vertex of G. We now have established that S has a vertex s suchthat Ss vϕG

Gr (Proposition 1) and Ss vϕHHϕH(s) (Proposition 2), i.e., Ss

is one of their RMCSs. This means that considering the subgraphs defined inParts(Gr) and Parts∗(H) will lead to a valid RMCS between Gr and Hs.

2.2.3 Computing an MCS of two outerplanar graphs

The algorithm below constructs pairwise RMCSs for elements in Parts(Gr) andParts∗(H). While doing so, it keeps track of the largest RMCS found so far; ituses a dynamic programming approach and stores results, so that the RMCS foreach pair is computed only once. When all pairs have been processed, the largestRMCS found is an MCS of graphs G and H. To reduce the number of casesto four, it assumes that the initial graphs as well as the elementary parts of abasic-root graph are compound-root. If a compound-root graph turns out to bebasic-root, the graph itself is its elementary basic-root part. So, Parts(G) andParts∗(H) may have two copies of some graphs, one copy labeled as compound-root, the other as basic-root. It then suffices to compute the RMCS of pairsof basic-root graphs and of pairs of compound-root graphs. The pairs to beconsidered can be formally defined as follows.

Definition 28 (Pairs(G,H)). Pairs(G,H) is the set of pairs {(P p, Qq)|λ(p) =λ(q), P p ∈ Parts(Gr), Qq ∈ Parts∗(H), P p is basic-root (compound-root) iff Qq

is basic-root (compound-root), and P p is a BPS (BSS) iff Qq is a BPS (BSS)}.Algorithm 1 shows the top level of the method. Maximum common sub-

graphs are represented by pairs consisting of a graph and a size. A graph in

16

Page 17: A Polynomial-time Maximum Common Subgraph Algorithm for

Algorithm 2 Computing an RMCS of two compound-root graphs Gr and Hs

Require: Gr and Hs are compound-root graphs; their roots have the samelabel.

1: function RMCScompound(Gr, Hs)2: M ←MaxMatch(EPs(Gr),EPs(Hs))3: RMCS ← ∪(X,Y )∈MGraph(RMCS(X,Y ))4: size← Σ(X,Y )∈M |RMCS(X,Y )| − (|M | − 1)w(λ(r))5: return (RMCS, size)

turn is represented as a pair (V,E) with V the vertices and E the edges. Weuse the functions Graph, Edges, Vertices and | · | to select respectively thegraph, the edges, the vertices and the size of an MCS. We need not to specifythe labels of an MCS as the MCS is a subgraph of G and it inherits the labels ofG. We will write (V1, E1)∪ (V2, E2) as a short notation for (V1∪V2, E1∪E2). Aglobal variable MCSbest candidate keeps track of a maximum common subgraphthat has been found so far; it is initialized as the empty graph (size 0). Thealgorithm iterates over all pairs (P,Q) of rooted graphs. The RMCS is com-puted by the function RMCS which does a case analysis, and, depending onthe case (compound-root graphs, basic-root BPS or basic-root BSS), calls oneof RMCScompound, RMCSbps or RMCSbss. The results of the functionRMCS(P,Q) are stored so that the function is evaluated only once for eachpair. The value of MCSbest candidate is updated when the RMCS of a pair ofBPSs was computed and the found RMCS is larger than the currently bestcommon subgraph (under BBP subgraph isomorphism).

First, we discuss the function RMCScompound (Algorithm 2) that com-putes an RMCS of two compound-root graphs Gr and Hs with the same label intheir roots. A compound-root graph has basic-root graphs as elementary parts.An RMCS is obtained by computing a weighted maximal matching betweenpairs of elementary parts, using the size of the RMCS as weight (see Propo-sition 3 in the next section). The RMCS is then the union of the RMCSs ofthe pairs that participate in the maximal matching. The algorithm uses Max-Match to compute a maximal matching and the function EPs to access thesets of elementary parts. These functions are omitted. When computing the sizeof the RMCS, simply adding the sizes of the RMCSs of the pairs participatingin the matching would count w(λ(r)), the weight of the label of the root, |M |times, so a correction is made.

Second, we discuss the RMCS computation of two basic-root BPS graphs(Algorithm 3). It first handles the case that the root belongs to a bridge. If so,both graphs have a single elementary part (accessed with the function EP). Ifthe two bridges match, then the RMCS consists of the bridge and the RMCSbetween the elementary parts; otherwise, the RMCS consists of the root. If theroot belongs to a block in one graph and to a bridge in the other, then theRMCS consists of the root. This is the initial RMCS and is returned as the forloop is skipped in that case. If the root belongs to a block in both graphs thenthe algorithm has to consider the pairs of elementary parts that are BSSs. If

17

Page 18: A Polynomial-time Maximum Common Subgraph Algorithm for

Algorithm 3 Computing an RMCS of two basic-root BPSs Gr and Hs

Require: Gr and Hs are two basic-root block-preserving graphs; λ(r) = λ(s).1: function RMCSbps(Gr, Hs)2: if r is part of the bridge {r, r′} and s part of the bridge {s, s′}3: if λ({r, r′}) = λ({s, s′}) ∧ λ(r′) = λ(s′) // the bridges match4: RMCS ← Graph(RMCS(EP(Gr),EP(Hs))) ∪ ({r}, {{r, r′}})5: size← |RMCS(EP(Gr),EP(Hs))|+ w(λ({r, r′})) + w(λ(r))6: else7: RMCS ← ({r}, ∅)8: size← w(λ(r))9: return (RMCS, size)

10: else11: RMCS ← (({r}, ∅)), w(λ(r))12: for all pairs (oG[i, r], oH [i′, s]) such that oG[i, r] ∈ EPs(Gr),

oH [i′, s] ∈ EPs(Hs) // pairs of BSSs∧ λ({i, r}) = λ({i′, s}) ∧ λ(i) = λ(i′) // and the edges match

13: if |RMCS(oG[i, r], oH [i′, s])| > 0 // the pair has an RMCS14: Cand← Graph(RMCS(oG[i, r], oH [i′, s])) ∪ (∅, {{r, i}})15: size← |RMCS(oG[i, r], oH [i′, s])|+ w(λ({r, i}))16: if size > |RMCS|17: RMCS ← (Cand, size)18: return RMCS

the removed edges match and the pair has a non-empty RMCS, then a BPS iscomposed. If it improves the best result so far, it becomes the new RMCS.

Finally, Algorithm 4 computes the RMCS between two basic-root BSSs.Although it is quite similar to the previous, there are some differences. First,the root is known to be part of a block so the bridge case can be skipped. Second,it is possible that one of {r′, r} and {s′, s} is the final edge of the Hamiltonianpath, in which case that one has no children. If there is an edge between rand r′ and an edge with the same label between s and s′, then we have anRMCS that consists of this edge. Otherwise, we have to iterate over the pairsoG[r′′, r] and oH [s′′, s] of BSSs that are their elementary parts. If the edges{r′′, r′} and {s′′, s′} match and the pair has a non-empty RMCS, the RMCSbetween oG[r′′, r] and oH [s′′, s] can be extended into an RMCS of oG[r′, r] andoH [s′, s]. Note that r′ may have other edges {r′, i} with i in the interval [r′′, r].If i is in the RMCS (its image in oH [s′′, s] is ϕH(i)) and the edge {s′, ϕH(i)}is present in oH [s′, s] and matches with {r′, i}, then the edge is added to theRMCS of oG[r′, r] and oH [s′, s] (they are collected in Edges). Note also, incase the RMCS of oG[r′′, r] and oH [s′′, s] is empty, but the edges {r, r′} and{s, r′} exist and match, then these edges form the RMCS between oG[r′, r] andoH [s′, s].

18

Page 19: A Polynomial-time Maximum Common Subgraph Algorithm for

Algorithm 4 Computing an RMCS of two basic-root BSSs oG[r′, r] and oH [s′, s]

Require: oG[r′, r] and oH [s′, s] are basic-root block-splitting graphs; λ(r) =λ(s) and λ(r′) = λ(s′).

1: function RMCSbss(oG[r′, r], oH [s′, s])2: if one of oG[r′, r] and oH [s′, s] has no elementary parts3: if λ({r, r′}) = λ({s, s′}) // the edges exist and match4: return (({r′, r}, {{r′, r}}), w(λ(r)) + w(λ(r′)) + w(λ({r′, r})))5: else6: return ((∅, ∅), 0)7: else RMCS ← ((∅, ∅), 0)8: for all pairs (oG[r′′, r], oH [s′′, s]) such that oG[r′′, r] ∈ EPs(oG[r′, r])

and oH [s′′, s] ∈ EPs(oH [s′, s]) // BSSs∧ λ({r′′, r′}) = λ({s′′, s′}) ∧ λ(r′′) = λ(s′′) // the edges match

9: if |RMCS(oG[r′′, r], oH [s′′, s])| > 0 // the pair has an RMCS10: Edges ← {{r′, i}|i 6= r′′ ∧ i ∈

vertices(RMCS(oG[r′′, r], oH [s′′, s])) ∧ {s′, ϕH(i)} ∈oH [s′, s] ∧ λ({r′, i}) = λ({s′, ϕH(i)})}

11: Cand← Graph(RMCS(oG[r′′, r], oH [s′′, s]))∪({r′}, {{r′, r′′}})∪(∅, Edges)

12: size← |RMCS(oG[r′′, r], oH [s′′, s])|+w(λ({r′, r′′})) +w(λ(r′) +Σe∈Edgesw(λ(e))

13: else if λ({r, r′}) = λ({s, s′}) // the edges exist and match14: Cand← ({r′, r}, {{r′, r}})15: size← w(λ(r)) + w(λ(r′)) + w(λ(r′, r))16: else17: Cand← (∅, ∅); size← 018: if size > |RMCS|19: RMCS ← (Cand, size)20: return RMCS

2.2.4 Proof of correctness

In this section, we prove that the algorithms described in Sect. 2.2.3 work cor-rectly. We do this by the following proposition, which shows that an RMCSbetween an element of Parts(Gr) and an element of Parts∗(H) is constructedfrom the RMCSs between their parts (Algorithms 2-4). This ensures us thatthe RMCS between Gr and Hs will be computed correctly.

Proposition 3. Let P p be a BPS (BSS) element of Parts(Gr), Qq a BPS(BSS) element of Parts∗(H) and Sp be a BPS (BSS) that is an RMCS of P p

and Qq, i.e., Sp vϕGP p and Sp vϕH

Qq. According to the different cases,Algorithms 2-4 compute an RMCS of P p and Qq that has the same size as Sp,which is constructed from the RMCSs between the elementary parts of P p andQq.

Proof. We prove this result by induction over the size of the RMCS Sp. When

19

Page 20: A Polynomial-time Maximum Common Subgraph Algorithm for

the RMCS of P p and Qq is empty, i.e., p and q have a different label, thennothing is to be constructed. For the induction step, we do a case analysis.

• P p, Qq and Sp are BPSs.

– P p, Qq and Sp are compound-root BPSs (Alg. 2) Definition 20 allowsus to make the following observations: Sp is the union of its elementaryparts.4 For each elementary part Spe of Sp, there is an elementary partP pe of P p, such that Spe vϕG

P pe and an elementary part Qqe of Qq, suchthat Spe vϕH

Qqe. Moreover, we can claim that Spe is an RMCS of P peand Qqe (if P pe and Qqe had a larger RMCS, then P p and Qq would havean RMCS larger than Sp). Spe is smaller than Sp, so we can apply theinduction hypothesis and hence P pe and Qqe have an RMCS of size Spe . So,when computing the weighted maximal matching between the elementaryparts of P p and Qq (using the size of their RMCS as weight), we obtaina matching which has as weight the sum of the sizes of the elementaryparts of Sp. Taking the union of the RMCSs involved in this matching,we obtain an RMCS with a size equal to the size of Sp. Note that thematching cannot result in a larger weight as that would lead to an RMCSlarger than Sp.

– P p, Qq and Sp are basic-root BPSs (Alg. 3) Let o be the orienta-tion that is chosen in the decomposition of P p. Let {p, i, . . . ,m, p} bethe Hamiltonian cycle following orientation o in Sp. Sp can be recon-structed from the edge {p, i} and from oS [i, . . . , p], an elementary part ofSp (Definition 23). It follows from the same definition that oG[i, . . . , p]with oG = oS is an elementary part of P p and that Qq has an ele-mentary part oH [ϕH(i), . . . , ϕH(p)] (with oH one of {y,x}). More-over, oS [i, . . . , p] is an RMCS of oG[i, . . . , p] and oH [ϕH(i), . . . , ϕH(p)],hence the induction hypothesis is applicable and both oG[i, . . . , p] andoH [ϕH(i), . . . , ϕH(p)] have an RMCS with the size of oG[i, . . . , p]. Extend-ing the latter RMCS with the edge {p, i} gives an RMCS of P p and Qq

which has the same size as Sp. As above, the assumption that oG[i, . . . , p]and oH [ϕH(i), . . . , ϕH(p)] have an RMCS larger than oG[i, . . . , p] contra-dicts that Sp is an RMCS of P p and Qq.

– Sp is a basic-root BPS and P p and/or Qq are compound-rootBPSs (Alg. 3) If P p is compound-root, then it has an elementary partP ′p such that Sp vϕG

P ′p; if not, let P ′p = P p. Similarly, we constructQ′q such that (Sp vϕH

Q′q). Now we can apply the argument of theprevious case to claim that P ′p and Q′q have an RMCS of the size of Sp.This must also be an RMCS of P p and Qq as they cannot have a largerone.

4If the compound Sp has only itself as elementary basic BPS, there is strictly speakingno size reduction. However, note that one could define an auxiliary function size′(G) =2∗ size(G)+1 if G is compound and 2∗ size(G) otherwise. Formally, the induction is on size′.

20

Page 21: A Polynomial-time Maximum Common Subgraph Algorithm for

• P p, Qq and Sp are BSSs Using Definition 24 and Definition 25, we can make asimilar argument as for BPSs. Note however that for the case of basic-root BSSs(Alg. 4), the difference between oS [i, j, . . . , p] and its elementary part oS [j, . . . , p]consists of the vertex i, the edge {i, j}, and zero or more edges {i, k}. So whenextending the RMCS between oG[j, . . . , p] and oH [ϕH(j), . . . , ϕH(p)], not onlyvertex i and edge {i, j} have to be added, but also the edges {i, k} for which kbelongs to the RMCS of oG[j, . . . , p] and oH [ϕH(j), . . . , ϕH(p)].

The following theorem now directly follows from Propositions 1, 2 and 3.

Theorem 1. Algorithm 1 computes an MCSv of two given outerplanar graphs.

2.2.5 Time complexity

In this section, we outline the proof for the following theorem:

Theorem 2. Algorithm 1, computing an MCS under BBP subgraph isomor-phism of two outerplanar graphs G and H, runs in time O(|V (G)| · |V (H)| ·(|V (G)| +|V (H)|)1/2).

Proof. The result follows from counting the total time spent in Algorithms 1-4and making use of known bounds on the running times of the algorithms usedin the dynamic programming step.

For convenience, we account for the time spent in Alg. 1, which calls theappropriate function for every pair (P,Q) ∈ Pairs(G,H), in the discussion ofthe functions themselves. More specifically, the iteration over the appropriatepairs is taken into consideration during the discussion of the three functionsRMCScompound, RMCSbps and RMCSbss.

First, the time needed to decompose G, i.e., to compute Parts(Gr), canbe bound by O(|V (G)|). Since this is clearly not the dominating term in thecomplexity of the algorithm, we will only provide a sketch. The number ofelementary parts of a BPS or BSS with root r depends on the degree of r.Therefore, the number of elements in Parts(Gr) is linear to the number ofedges in G. For outerplanar graphs it holds that O(E(G)) = O(V (G)) [33], sothis gives us the above upper bound. In addition, the time needed to computeParts∗(H) can be bound by O(|V (H)| · |V (H)|). The time to decompose bothG and H is then bound by O(|V (G)| + |V (H)|2). With |V (H)| ≤ |V (G)|, thiscan be reduced to O(|V (G)| · |V (H)|.

Second, the total time spent in RMCScompound (Alg. 2) can be bound byO(|V (G)|·|V (H)|·(|V (G)|+|V (H)|)1/2). This can be explained as follows. Everyvertex g ∈ V (G) and every vertex h ∈ V (H) has at most deg(g) (resp. deg(h))elementary parts involved in a maximal matching, with deg(g) the degree ofvertex g and deg(h) the degree of vertex h. According to Hopcroft [19], theweighted maximal matching problem can be solved in time O((n + m) · n1/2),with n the number of nodes and m the number of edges in a bipartite graph.

21

Page 22: A Polynomial-time Maximum Common Subgraph Algorithm for

Since n = deg(g) + deg(h) and m = deg(g) · deg(h), this allows us to bound thetime spent in RMCScompound by Tcomp:

Tcomp =∑

g∈V (G)

∑h∈V (H)

((deg(g)+deg(h))+(deg(g)·deg(h)))·(deg(g)+deg(h))1/2.

In the non-trivial case, deg(g), deg(h) ≥ 1, while ((deg(g) + deg(h)) + (deg(g)·deg(h))) ≤ 3((deg(g) · deg(h))),

∑v∈V (G) deg(v) = 2|E(G)| and for outerplanar

graphs holds that O(E(G)) = O(V (G)) [33]. This gives us an upper bound onTcomp:

Tcomp ≤∑

g∈V (G)

∑h∈V (H)

3 · (deg(g) · deg(h)) · (deg(g) + deg(h))1/2

Tcomp = O(|V (G)| · |V (H)| · (|V (G)|+ |V (H)|)1/2.Third, the total time spent in RMCSbps (Alg. 3) can be bound by O(|V (G)|·

|V (H)|). If r ∈ V (G) and s ∈ V (H) are part of a bridge, this function onlyrequires a constant time. If r and s are involved in a block of G or H, we countthe number of elementary parts of type BSS of the BPSs having r and s asroots, which can be upper bound by deg(r) and deg(s), respectively. Makinguse again of

∑v∈V (G) deg(v) = 2|E(G)|, this allows us to bound the time spent

in RMCSbps by Tbps:

Tbps =∑

r∈V (G)

∑s∈V (H)

deg(r) · deg(s) =∑

r∈V (G)

deg(r) ·∑

s∈V (H)

deg(s)

= O(|V (G)| · |V (H)|).Finally, the time spent in RMCSbss (Alg. 4) can be bound by O(|V (G)| ·

|V (H)|) as well. If oG[r′, r] (r, r′ ∈ V (G)) and oH [s′, s] (s, s′ ∈ V (H)) haveno elementary parts, this function requires a constant time. Otherwise, allcombinations of deg(r) elementary parts of oG[r′, r] and deg(s) elementary partsof oH [s′, s] need to be considered. As such, we can bound the time spent inRMCSbss by Tbss:

Tbss =∑

r∈V (G)

∑s∈V (H)

deg(r) · deg(s) =∑

r∈V (G)

deg(r) ·∑

s∈V (H)

deg(s)

= O(|V (G)| · |V (H)|).The overall time complexity is dominated by the time spent in RMCScom-

pound, so we can conclude that there is an algorithm running in a time boundedby a polynomial of degree 3, given the size of the vertices of the two graphs.

We end this section with a practical remark. In fact, we need not to considerall vertices of H as root. If we would only be dealing with tree structures, itwould be sufficient to consider only their leaves as root. As outerplanar graphsalso have blocks, we have to consider a larger class, which we call root-candidates.

22

Page 23: A Polynomial-time Maximum Common Subgraph Algorithm for

Definition 29 (Root-candidate of an outerplanar graph). A vertex is a root-candidate of an outerplanar graph if either it is incident with a single bridge orit belongs to a block.

To decompose G, we select an arbitrary root-candidate r of G and decomposeGr. To decompose H, we consider all root-candidates s of H and decomposeeach Hs.

From now on, we refer to our algorithm as MCS-BBP. It can be downloadedat http://dtai.cs.kuleuven.be/PMCSFG.

3 Experimental Evaluation

In this section, we compare MCS-BBP against the state-of-the-art algorithmsin terms of efficiency and predictive performance. We also want to investigatethe advantages of using the BBP subgraph isomorphism as opposed to the gen-eral subgraph isomorphism. More precisely, we want to answer the followingexperimental questions:

Q1 What are the differences in terms of efficiency and predictive performancebetween MCS-BBP and a state-of-the-art MCS algorithm that uses thegeneral subgraph isomorphism?

Q2 How does the predictive performance of a metric based on MCS-BBPcompare to the predictive performance of state-of-the-art metrics?

Q3 What is the influence on predictive performance when using the BBPsubgraph isomorphism as matching operator in other learning techniques?

In order to answer Q1, we compare MCS-BBP against the MCS algorithm ofCao et al. [4]. To compare the predictive performance, we construct a metricfrom the MCS algorithms and evaluate them based on the performance of themetric in an instance-based learning (IBL) setting.

In order to answer Q2, we compare our metric based on the MCSv to ametric based on 2D fingerprints [48] and a metric based on the WDK kernel [5].These metrics are also evaluated through instance-based learning.

Although the gains in efficiency that can be obtained by using the BBPmatching operator have been studied in the context of frequent subgraph min-ing by Horvath et al. [21, 22], the impact of the BBP matching operator onpredictive performance has not been investigated before. Evaluating the BBPmatching operator is interesting in its own and these results give us an answerto Q3. To this end, we compare the different matching operators when used insupport vector machines (SVMs) [23], a method that has become very popularalso for the classification of molecules (e.g., [44, 12, 5]). Rather than obtain-ing state-of-the-art results, the purpose in this context is investigating whetherthe BBP subgraph isomorphism has a larger predictive power compared to thegeneral one in classification tasks from chemoinformatics.

23

Page 24: A Polynomial-time Maximum Common Subgraph Algorithm for

3.1 Datasets

The Developmental Therapeutics Program (DTP) at the U.S. National CancerInstitute (NCI) has checked a large number of compounds for evidence of theability to inhibit the growth of human tumor cell lines.5 The target valuescorrespond to the parameter GI50, the concentration that causes 50% growthinhibition.

In total, there are 60 datasets, each corresponding to a particular cell line.We use the roughly balanced datasets of [44], which have become a popularbenchmark for QSAR algorithms, and are often referred to as NCI60. Eachcell line has inhibition data on about 3500 compounds, which defines a binaryclassification problem. There are 3910 distinct molecules over all datasets. Theaverage number of vertices is 23, while the average number of edges is 25.

Each molecule is described in the Tripos Sybyl MOL2 format6, from whichwe extract a graph in which the vertices are labeled with general atom types(e.g., N, C) and the edges are labeled single, double, triple, amide or aromatic.Hydrogen atoms are dropped.

In order to compare with the BBP-based methods, we have removed thenon-outerplanar examples (≈9.69%) from these datasets.7

3.2 Method

Q1: What are the differences between MCS-BBP and a state-of-the-art MCS algorithm that uses the general subgraph isomorphism interms of efficiency and predictive performance? The algorithm of Cao etal. [4] computes an unconnected MCS under the general subgraph isomorphismand is not bounded by the restriction to outerplanar graphs. It makes useof heuristics to drastically reduce the search space compared to other MCSalgorithms (see Sect. 4.1). To the best of our knowledge, this is the only MCSalgorithm that is freely available in the public domain. The algorithm, partof a package called ChemMiner, can be downloaded at http://biowb.ucr.

edu/ChemMineV2/help/mcs.html. From now on, we refer to this algorithm asMCS-CAO.

To compare the efficiency, we selected a set of 50 graphs from NCI60 ran-domly, for which there are 1225 possible combinations of two graphs. For eachof them, we run MCS-BBP (computing an MCSv) and MCS-CAO (computingan MCS�) and measure runtimes.

All experiments were run on an Intel Core2Quad 2.33 GHz processor. Atime-out of 24 hours per MCS computation was enforced.

To compare the predictive performance of MCS-BBP and MCS-CAO, weconstruct a metric from the MCS. Bunke and Shearer [3] proposed a distance

5http://dtp.nci.nih.gov/docs/cancer/cancer_data.html6http://www.tripos.com/data/support/mol2.pdf7These examples cannot be handled efficiently and could be covered by an algorithm using

the general subgraph isomorphism. In [40], an efficient feature generation approach based onthe MCSv was introduced, also covering the non-outerplanar examples.

24

Page 25: A Polynomial-time Maximum Common Subgraph Algorithm for

function on graphs based on the maximum common subgraph:

dbs(G,H) = 1− |MCS(G,H)|max(|G|, |H|) , (1)

with |G| equal to the number of vertices in G. They proved that dbs is a metric.We can easily extend this proof to the more general case where |G| is determinedby the function size. Some variants with similar properties and performancesare reviewed in [37].

We instantiate the size of a graph as the sum of its number of vertices andits number of edges, that is, we choose wλG(x) = 1 for every x ∈ V (G) ∪ E(G).We call the metric computed under the BBP subgraph isomorphism dv and themetric computed under the general subgraph isomorphism d�.

To compare the predictive performance of the different metrics, we use k-nearest neighbour classification (kNN): in order to classify a given molecule, weselect the k neighbour molecules that are closest according to the metric (wechoose k = 11, which resulted in an optimal AUROC for all metrics). For eachmolecule, we obtain a prediction equal to the percentage of positive votes of itsneighbors. In this way, we can rank the predictions and compute the area underthe ROC curve (AUROC). We use leave-one-out cross-validation.

Q2: How does the predictive performance of a metric based on theMCS algorithm using the BBP subgraph isomorphism compare to thepredictive performance of state-of-the-art metrics? Next to the MCS-based metrics, we construct a metric based on 2D fingerprints and a metric basedon the WDK kernel. For each molecule, we construct a 1024 FP2-fingerprintusing OpenBabel v2.1.18 and we define a metric on these fingerprints using theTanimoto coefficient, which is considered among chemists to produce state-of-the-art results for virtual screening [48]. We call this metric FP2. Although theWDK kernel is not intended to be used in this way, for reasons of comparisonwe define a metric according to the following formula: d2(x, y) = κ(x, x) −2κ(x, y) + κ(y, y), with κ the WDK kernel function. We call this metric WDK.

Q3: What is the influence on predictive performance when using theBBP subgraph isomorphism as matching operator in other learningtechniques? To test the performance of the BBP subgraph isomorphism ina broader context, we perform a propositionalization approach and transformmolecules into bit-vectors using two approaches. The first approach mines fre-quent subgraphs under the BBP subgraph isomorphism using the FOG algo-rithm [21]. The second approach mines frequent subgraphs under the generalsubgraph isomorphism, using an efficient implementation [2] of the gSpan algo-rithm [49]. In both cases, the bit-vector encodes the occurrence of the frequentpatterns as follows. Given a set of k patterns, each graph g is encoded as ak-dimensional binary vector, where a 1 is marked in the i-th position if the i-th

8http://openbabel.sourceforge.net

25

Page 26: A Polynomial-time Maximum Common Subgraph Algorithm for

subgraph is subgraph isomorphic to g and a 0 otherwise. We aim at obtaininga similar number of features by selecting an appropriate frequency thresholdfor the mining algorithms: 4% for FOG, leading to 1376 patterns and 5% forgSpan, leading to 1292 patterns.

We use SVMs as models for these vectors. As implementation we useSVMlight [23]. For all methods, we use exactly the same settings, which in-volves applying a polynomial kernel of degree 2, using a 10-fold stratified cross-validation. For each fold, the regularization parameter is tuned by holding out adevelopment set from each training fold of the cross-validation. Finally, we com-bine the predictions from each test fold, rank all the predictions and computeagain the AUROC.

Statistical significance We compute the statistical significance of the differ-ent methods in two ways. To estimate the significance of the AUROC compar-ison between two classifiers, we use the (two-sided) Wilcoxon signed rank test[47], which is a non-parametric alternative to the paired Student’s t-test thatdoes not make any assumption about the distribution of the measurements. Inthe results, we report the p-value of the test.

To compute statistical comparisons of multiple classifiers over multiple data-sets, we use the Friedman test combined with a Nemenyi post-hoc test [11]. TheFriedman test is a non-parametric test for statistical comparisons of multipleclassifiers. It ranks the algorithms for each dataset separately, with the best per-forming algorithm getting the rank of 1. In case of a tie, the average rank of thetied models is assigned. Then, a Nemenyi post-hoc test is used to analyse whichof the classifier’s ranks differ significantly from each other: the performance issignificantly different if the corresponding average ranks differ by at most thecritical difference, which depends on the significance level and the number ofclassifiers [11].

3.3 Results

Q1: What are the differences between MCS-BBP and a state-of-the-art MCS algorithm that uses the general subgraph isomorphism interms of efficiency and predictive performance? A comparison of theruntimes can be found in Fig. 7. Note that a logarithmic scale is used. The figureclearly shows the exponential behavior of MCS-CAO. For 18 of the 1225 MCScomputations, MCS-CAO timed out at 24 hours; we do not show the runtimesof MCS-BBP on these 18 comparisons either. We also report the order statisticsof the computation time for both algorithms in Table 1. These show that MCS-BBP computes an MCSv in less than 0.713 seconds in 95% of the cases, whileMCS-CAO needs 1643 seconds to compute the respective MCS�. However, in204 cases, MCS-CAO was faster than MCS-BBP, which indicates that bothalgorithms are competitive when computing MCSs for smaller molecules. Thegraph in Fig. 7 confirms this.

While comparing these runtimes, we need to keep in mind that, next to thedifference in type of subgraph isomorphism, there is also another difference be-

26

Page 27: A Polynomial-time Maximum Common Subgraph Algorithm for

Tim

e(s)

|V(G)| ∗ |V(H)|

MCS�MCSv

Figure 7: Runtimes of MCSv and MCS� (in seconds).

tween the MCS algorithms. MCS-CAO computes an unconnected MCS, whilethe MCS of MCS-BBP is connected. This factor also contributes to the im-proved runtimes of MCS-BBP and it would be interesting to be able to measurethe impact of these two restrictions separately.

Because MCS� has an exponential behavior, it was not always possible tocompute d� while performing k-nearest-neighbour classification. A time-out of24 hours per computation of one distance (corresponding to one MCS compu-tation) was used. On average, the computation of d� timed out for 0.1% ofthe molecules. However, since the molecules for which this occurs are large,the denominator in Eq. 1 is large too, which would result in large distances tothese molecules. This reduces the chance of them being selected by the kNNalgorithm, which indicates that this should not have a big impact on the clas-sification performance.

Figure 8 plots the AUROC of both IBL classifiers. IBL-dv scores betterthan IBL-d� on all 60 datasets. According to the (two-sided) Wilcoxon signedrank test, this is statistically significant (p = 1.60 ·10−11). The average AUROCfor IBL-dv is 0.760, while for IBL-d� it is 0.728.

27

Page 28: A Polynomial-time Maximum Common Subgraph Algorithm for

Table 1: Distribution of runtimes of the two MCS algorithms for 1207 compar-isons (in seconds).

MCS� MCSvAverage 805.668 0.174Minimum 0.002 0.002O.05 0.003 0.003O.25 0.025 0.012O.5 0.161 0.049O.75 3.590 0.238O.95 1643.850 0.713Maximum 79625.900 2.951

Conclusion The results of the experimental comparison between MCS-CAOand MCS-BBP show that MCS-BBP outperforms MCS-CAO in terms of ef-ficiency. This can be explained by four restrictions imposed by MCS-BBP: itworks with outerplanar graphs, makes use of the BBP subgraph isomorphism,computes connected MCSs and selects a single random one in the case there aremultiple MCSs.

Concerning predictive performance, the experimental results indicate thatthe metric based on the MCSv is performing better than the metric based onthe MCS�. This means that using the BBP subgraph isomorphism not onlyleads to a metric that is efficiently computable, but that also has a significantlybetter predictive performance when used for the classification of molecules. Themetric based on the BBP subgraph isomorphism thus seems more meaningfulon a chemical level, especially for small pharmalogical molecules.

Q2: How does the predictive performance of a metric based on theMCS algorithm using the BBP subgraph isomorphism compare tothe predictive performance of state-of-the-art metrics? In Figure 9,we plot the AUROC of IBL-dv, IBL-FP and IBL-WDK. We find that IBL-dvperforms consistently better (60 wins out of 60) than the other two methods.According to the Friedman test (Table 2), which shows that the difference inaverage ranks is larger than the critical difference at 1%, this is statisticallysignificant. Table 2 also shows the average AUROC for the different methods.

Conclusion IBL-dv is performing significantly better than the current state-of-the-art metrics for molecules.

Q3: What is the influence on predictive performance of using theBBP subgraph isomorphism as matching operator in other learningtechniques? Figure 10 shows a similar comparison between the SVM-basedclassifiers on the 60 datasets. SVM-FOG scores better than SVM-gSpan on

28

Page 29: A Polynomial-time Maximum Common Subgraph Algorithm for

Figure 8: Predictive performance of the IBL-dv and IBL-d� classifiers onNCI60.

all datasets. According to the (two-sided) Wilcoxon signed rank test, this isstatistically significant (p = 1.60 · 10−11). The average AUROC for SVM-FOGis 0.762, while for SVM-gSpan it is 0.737.

Conclusion The features generated by the BBP matching operator have alarger predictive power. This can be explained by the fact that SVM-FOG usesa more constrained language, i.e. outerplanar graphs, leading to less redundantpatterns. Additional experiments show that it is still possible to boost theperformance of SVM-FOG and SVM-gSpan by lowering the support threshold(and in this way obtaining more patterns), but this does not change the aboveconclusion.

4 Related Work

In this section, we discuss related research from two different backgrounds. First,we focus on algorithms that compute an MCS of two graphs (Sect. 4.1). Second,since we evaluate the metric on a number of classification tasks from chemoinfor-matics, we also give an overview of general learning methods that have been usedfor learning quantitative structure-activity relationships in molecules (Sect. 4.2).

29

Page 30: A Polynomial-time Maximum Common Subgraph Algorithm for

Figure 9: Predictive performance of the state-of-the-art IBL classifiers onNCI60.

4.1 MCS algorithms

There exist a number of approaches that compute MCSs of graphs and there area lot of differences in the way they handle the NP-completeness of the problem.For example, certain algorithms solve the MCS problem optimally, while othersdo it approximately. Moreover, there exist slightly different notions of an MCSe.g., whether it is connected or whether it is subgraph-induced. We first discussthese different notions.

First, the computed MCS can be connected or unconnected. When comput-ing an unconnected MCS, a trade-off has to be made between the number ofconnected fragments in the MCS and its usefulness as a pattern. If too manyconnected components are allowed, the pattern becomes too general. If thenumber is too small, the pattern may be too specific. For example, when twomolecules share two benzene rings but differ with respect to the connection be-tween the rings, only one of these rings can appear in a connected MCS. Despitethis potential drawback, we focus on finding connected MCSs here.

A second distinction is made between the maximum common induced sub-graph (MCIS) and the maximum common edge subgraph (MCES). The MCISrequires the common subgraph to be induced, in the sense that two verticesin the first graph are linked by an edge if and only if they are mapped to two

30

Page 31: A Polynomial-time Maximum Common Subgraph Algorithm for

Table 2: Average scores and ranks for AUROC for the state-of-the-art IBLclassifiers on NCI60.

Method Average AUROC Average rankIBL-dv 0.760 1IBL-FP 0.742 2.15IBL-WDK 0.731 2.85Critical difference for the average ranks at the 1% significance level: 0.53

vertices in the second graph that are linked by an edge as well.9 The MCES,which does not have this requirement, simply tries to maximize the number ofedges in the common subgraph. Chemists have argued that the MCES more ad-equately expresses the notion of chemical similarity than does the MCIS, sincethe bonded interactions between atoms are most responsible for the perceivedmolecular activity [38]. An MCS that is computed between outerplanar graphsunder the BBP subgraph isomorphism corresponds to neither of these notions,but tries to maximize the edges given that only bridges are mapped to bridgesand block edges to block edges.

Here we also introduce a third distinction, which is the kind of subgraphisomorphism used in the matching. Usually, the general subgraph isomorphismis used to compute the MCS, while we propose to use the BBP subgraph iso-morphism instead.

An overview of MCS algorithms has been given by Raymond and Willett [38]and Conte et al. [8]. McGregor [32] proposes a simple backtracking strategy thatuses various heuristics in order to reduce the number of backtrackings. Akutsu[1] proposes a polynomial algorithm for computing the connected MCES of “al-most trees of bounded degree” under the general subgraph isomorphism. Thisis another class of graphs which is suited for the representation of molecules.Koch [27] introduces an algorithm for enumerating all connected MCESs in twographs, based on a transformation of the two graphs into an association graph.The MCS problem is then transformed into a clique detection problem. Ray-mond et al. [36] use the same problem transformation and propose an exactmulti-step algorithm which defines a similarity based on computing the uncon-nected MCES. The problem still remains theoretically NP-complete, but theiralgorithm makes use of advanced heuristics to reduce the number of matchingsrequired.

Cao et al. [4] propose an improved backtracking algorithm that is specializedfor molecules and works directly on the chemical graph. Because of branch-and-bound heuristics, they can drastically reduce the search space and obtainan efficient algorithm for unconnected MCIS computation. They show thattheir MCS-based similarity measure outperforms traditional descriptor-based

9The definition of an induced subgraph G of a graph H (under the mapping ϕ) requiresthat for every u, v ∈ V (G): {u, v} ∈ E(G) iff {ϕ(u), ϕ(v)} ∈ E(H).

31

Page 32: A Polynomial-time Maximum Common Subgraph Algorithm for

Figure 10: Comparison of the performance of the SVM classifiers on NCI60.

and topology-based similarity measures.Instead of using heuristics, an alternative approach to reduce the complexity

could be to consider common substructures which can be computed more easily,such as multisets of common vertex labels [25]. However, not taking into accountthe more complex shared substructures is an important drawback.

Our algorithm is different from the other approaches in the sense that itcomputes a connected MCS that fulfils the requirements of the BBP subgraphisomorphism in an exact way. The algorithm works directly on the chemicalgraph and does not require an advanced graph-theoretical problem transforma-tion.

4.2 QSAR methods

Determining quantitative structure-activity relationships (QSAR) is the pro-cess where chemical structure is quantitatively correlated with a biological orchemical activity. Usually, a function is learned that takes as input a numberof physicochemical and structural properties of a molecule and then returns itspredicted activity. A number of learning approaches have been proposed forQSAR. Depending on the representation they use, we organise them in severalcategories.

32

Page 33: A Polynomial-time Maximum Common Subgraph Algorithm for

4.2.1 Descriptor-based methods

The first methods for QSAR were descriptor-based: using various measurablephysicochemical properties (such as polarity or hydrophobicity) of a molecule asdescriptors, the molecule can be transformed into a vector of real numbers, eachnumber representing a specific property [16]. Then, a number of propositionalsimilarity measures (e.g., the Tanimoto coefficient [48]) and machine learningmethods can be applied.

There are two main difficulties with this approach: (1) domain knowledge isrequired to select the proper descriptors, and (2) information about the molec-ular structure is lost. As there is a common agreement that the structure isimportant when developing similarity measures, there has been an increasedinterest in relational learning methods.

4.2.2 Relational methods

Because relational methods take structural information directly into account,these methods are often referred to as topology-based approaches. One frame-work in which relational approaches have been developed is that of inductivelogic programming (ILP), where the molecules and models for them are de-scribed with first-order logic. In this domain, a number of algorithms have beenproposed that obtain excellent results on several QSAR tasks [26, 18]. How-ever, because of computational complexity issues, it is still a challenge for ILPalgorithms to cope with large datasets.

Actually, the ILP representation is much more general than what is neededto model molecules and to discover substructures. Therefore, a number of meth-ods aim at achieving better performance by building systems that use special-purpose data structures for representing graphs and provide well-chosen prim-itives for manipulating them (e.g., [49, 35, 7]). In this way, they transformthe problem of binding chemical substructures into that of finding subgraphsin a graph. Still, finding subgraphs requires expensive subgraph isomorphismmatchings.

A number of such specialised graph algorithms have been developed to com-pute similarities between graphs [38, 8], such as the MCS algorithms that werediscussed in Sect. 4.1. De Raedt and Ramon [10] show that the notion of amaximum common subgraph (or a minimally general generalisation in relationallearning terms) can be used to construct a distance measure.

Furthermore, kernels can be viewed as a kind of similarity measure as well.Following the increased emphasis on structure, several kernel functions havebeen developed that work on graphs directly [20, 15, 44, 5, 43]. Again, the mainproblem here is to select kernel functions that can capture the molecule’s activitywhile still remaining efficient to compute. Many types of graph kernels corre-spond to some implicit feature space where features are generated for certainsubgraphs or graph properties. The kernel then counts how many subgraphstwo examples have in common. The subgraphs are typically defined in such away that they can be enumerated implicitly and such that the counting proce-

33

Page 34: A Polynomial-time Maximum Common Subgraph Algorithm for

dure can be done efficiently (e.g., via dynamic programming procedures). Forexample, Ceroni et al. [5] have introduced the weighted decomposition kernel(WDK), which obtains state-of-the-art results when used for the classificationof molecules. It is based on a decomposition of the molecule in a selector (asingle vertex) and a context (a fixed-radius subgraph surrounding the selector).By selecting an appropriate kernel for these structures, the computation of theWDK kernel remains feasible.

4.2.3 Propositionalization methods

To avoid having to deal with subgraph isomorphisms during the learning phase,a popular approach in relational learning is to propositionalize the graph-basedrepresentation into an attribute-valued representation [9]. In this context, thelearning process involves two separate steps: first, the complete dataset ofgraphs is searched for subgraphs and then, each example is represented as abit-vector, which is often called fingerprint, that encodes the occurrences ofthese subgraphs in the example (see Sect. 3.2). The difference with the abovementioned descriptor-based methods is that the vectors here encode topologicalfeatures instead of numerical ones, trying to minimize the loss of structural in-formation. Since the graph miner automatically selects the “most interesting”features, this approach solves the first difficulty mentioned above, while thesecond difficulty is transformed into selecting a suitable criterion that finds themost interesting features, i.e., the patterns that describe the molecule’s topologybest in order to discriminate molecules w.r.t. their activity.

To this end, various techniques have been used to generate features of interest[29]. In the chemoinformatics context, there are two important approaches:either the patterns are generated exhaustively, or some kind of constraint isused to select the most interesting patterns.

The current state-of-the-art methods towards feature construction for mole-cules generate all sequences [48] or subgraphs [46] of size up to k that occur inat least one molecule. Because even for small values of k, this rapidly leads tovast numbers of features, the generated features are typically compressed usingsome kind of hashing of the occurrences of the paths onto a fixed-lengh vector.Because of its size, the resulting pattern set is hard to interpret, and the use ofhashing methods only aggravates this problem.

In order to limit the size of the pattern set, a number of constraint-basedgraph mining methods have been proposed, where constraints on the subgraphsof interest are formulated. A wide variety of different constraints has beenconsidered, such as frequency [12, 46], statistical significance measures such asχ2 [2, 17], syntactical constraints [28], randomization [6] or a combination ofthose [31]. As mentioned in the introduction, Horvath et al. [21] proposed touse the BBP subgraph isomorphism on outerplanar graphs, yielding a set ofpatterns that are topographically restricted and can be mined more efficiently.In these types of approaches, one typically performs a complete search. Thisleads one to finding all patterns satisfying the constraints.

34

Page 35: A Polynomial-time Maximum Common Subgraph Algorithm for

5 Conclusions and Future Work

In this article, we have introduced an algorithm computing the MCS of twoouterplanar graphs under the BBP subgraph isomorphism. We have proved itscorrectness and shown that it runs in polynomial time, both theoretically andempirically.

The polynomial time complexity is realized by four restrictions. First, thealgorithm only works on outerplanar graphs. Most molecular graphs, which areof most interest to us, are outerplanar. Second, the algorithm uses the BBPsubgraph isomorphism as matching operator. Again, in the chemical context,this seems a sensible choice. Third, we only consider connected MCSs. Fourth,in the case there are multiple possible MCSs, we only select one randomly.

We have shown that our algorithm outperforms a state-of-the-art MCS al-gorithm that computes an unconnected MCS under the general subgraph iso-morphism, both in terms of efficiency and predictive performance. To evaluatethe predictive performance, we have constructed a metric based on the MCSand used it in an instance-based learning classification task.

We have also investigated our MCS-based metric and it turns out that thisBBP-based metric outperforms previously published metrics. One reason maybe that dealing differently with cycles and linear fragments makes sense in chem-ical applications. Moreover, it is more intuitive than other metrics and graphkernels and it uses the original graph structure.

Finally, we have investigated the performance of the BBP matching operatorin general. Experimental results show that the BBP matching operator obtainsgood performance as well when used to generate features.

We can conclude that, at least for molecular datasets, the BBP matchingoperator can, next to the gain in efficiency, also improve the predictive per-formance of graph mining techniques and hence, it is an interesting matchingoperator for molecules. Therefore, depending on the situation (e.g., the numberof examples) and the user’s preferences (e.g., the interpretability of the pre-dictions), either a BBP-metric (IBL) or BBP-generated features can be goodchoices.

Since a fraction of the molecules in the NCI database cannot be representedby outerplanar graphs, it is useful to investigate how the ideas of the BBPmatching operator can be extended to non-outerplanar graphs. It would alsobe interesting to run our algorithm on other outerplanar datasets such as RNAmolecules and to investigate other classes of graphs which would be suitable torepresent molecules (e.g., graphs having a bounded treewidth) and for which wecan design polynomial algorithms. Finally, comparing with similar algorithms,such as the one of Akutsu [1], remains interesting future work.

Acknowledgements

Leander Schietgat is supported by the ERC Starting Grant 240186 “MiGraNT”and the Research Fund KU Leuven. Jan Ramon is supported by the ERC Start-ing Grant 240186 “MiGraNT” and the KU Leuven GOA project “Probabilistic

35

Page 36: A Polynomial-time Maximum Common Subgraph Algorithm for

Logic Learning”. This research was also supported by FWO Vlaanderen. Wethank Fabrizio Costa for valuable discussions and Mostafa Haghir Chehreghanifor proofreading.

References

[1] T. Akutsu. A polynomial time algorithm for finding a largest commonsubgraph of almost trees of bounded degree. IEICE Transactions on Fun-damentals of Electronics, Communications and Computer Science, E76-A:1488–1493, 1993.

[2] B. Bringmann, A. Zimmermann, L. De Raedt, and S. Nijssen. Don’t beafraid of simpler patterns. In Proceedings of the 10th European Conferenceon Principles and Practice of Knowledge Discovery in Databases, pages55–66, 2006.

[3] H. Bunke and K. Shearer. A graph distance metric based on the maximalcommon subgraph. Pattern Recognition Letters, 19(3-4):255-259, 1998.

[4] Y. Cao, T. Jiang, and T. Girke. A maximum common substructure-basedalgorithm for searching and predicting drug-like compounds. Bioinformat-ics, 24(13):i366–i374, 2008.

[5] Alessio Ceroni, Fabrizio Costa, and Paolo Frasconi. Classification of smallmolecules by two- and three-dimensional decomposition kernels. Bioinfor-matics, 23(16):2038–2045, 2007.

[6] Vineet Chaoji, Mohammad Al Hasan, Saeed Salem, Jeremy Besson, andMohammed J. Zaki. Origami: A novel and effective approach for miningrepresentative orthogonal graph patterns. Stat. Anal. Data Min., 1(2):67–84, 2008.

[7] Yun Chi, Richard R. Muntz, Siegfried Nijssen, and Joost N. Kok. Frequentsubtree mining - an overview. Fundam. Inform., 66(1-2):161–198, 2005.

[8] D. Conte, P. Foggia, C. Sansone, and M. Vento. Thirty years of graphmatching in pattern recognition. International Journal of Pattern Recog-nition and Artificial Intelligence, 18(3):265–298, 2004.

[9] L. De Raedt. Logical and Relational Learning. Springer, 2008.

[10] L. De Raedt and J. Ramon. Deriving distance metrics from generalityrelations. Pattern Recognition Letters, 30(3):187–191, 2009.

[11] J. Demsar. Statistical comparisons of classifiers over multiple data sets.Journal of Machine Learning Research, 7(Jan):1–30, 2006.

[12] M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis. Frequentsubstructure-based approaches for classifying chemical compounds. IEEETransactions on Knowledge and Data Engineering, 17(8):1036–1050, 2005.

36

Page 37: A Polynomial-time Maximum Common Subgraph Algorithm for

[13] Reinhard Diestel. Graph Theory. Springer-Verlag, 2000.

[14] Michael R. Garey and D.S. Johnson. Computers and Intractability: a Guideto the theory of NP-Completeness. Freeman and Co., 1979.

[15] Thomas Gartner. Kernels for Structured Data. World Scientific, 2008.

[16] C. Hansch, P.P. Maolney, T. Fujita, and Muir R.M. Correlation of biologicalactivity of phenoxyacetic acids with hammett substituent constants andpartition coefficients. Nature, 194:178–180, 1962.

[17] Huahai He and Ambuj K. Singh. Graphrank: Statistical modeling and min-ing of significant subgraphs in the feature space. In ICDM ’06: Proceed-ings of the Sixth International Conference on Data Mining, pages 885–890,Washington, DC, USA, 2006. IEEE Computer Society.

[18] C. Helma, Kramer S., and De Raedt L. Data mining and machine learningtechniques for the identification of mutagenicity inducing substructures andstructure activity relationships of noncongeneric compounds. Journal ofChemical Information and Computer Systems, 44(4):1402–141, 2004.

[19] J.E. Hopcroft and R.M. Karp. A n5/2 algorithm for maximum matchingin bipartite graphs. SIAM Journal of Computing, 2:225–231, 1973.

[20] Tamas Horvath, Thomas Gartner, and Stefan Wrobel. Cyclic pattern ker-nels for predictive graph mining. In KDD ’04: Proceedings of the TenthACM SIGKDD International Conference on Knowledge discovery and DataMining, pages 158–167, 2004.

[21] Tamas Horvath, Jan Ramon, and Stefan Wrobel. Frequent subgraph min-ing in outerplanar graphs. In KDD ’06: Proceedings of the Twelfth ACMSIGKDD International Conference on Knowledge Discovery and Data Min-ing, pages 197–206, Philadelphia, PA, August 2006.

[22] Tamas Horvath, Jan Ramon, and Stefan Wrobel. Frequent subgraphmining in outerplanar graphs. Knowledge Discovery and Data Mining,21(3):472–508, 2010.

[23] T. Joachims. Learning to Classify Text using Support Vector Machines:Methods, Theory, and Algorithms. Springer, 2002.

[24] M.A. Johnson and G.M. Maggiora. Concepts and Applications of MolecularSimilarity. John Wiley, 1990.

[25] T. Karunaratne and H. Bostrom. Learning to classify structured data bygraph propositionalization. In Proceedings of the Second IASTED Interna-tional Conference on Computational Intelligence, pages 393–398, 2006.

37

Page 38: A Polynomial-time Maximum Common Subgraph Algorithm for

[26] R.D. King, S. Muggleton, A. Srinivasan, and M.J.E. Sternberg. Structure-activity relationships derived by machine learning: The use of atoms andtheir bond connectivities to predict mutagenicity by inductive logic pro-gramming. Proceedings of the National Academy of Sciences, 93:438–442,1996.

[27] Ina Koch. Enumerating all connected maximal common subgraphs in twographs. Theor. Comput. Sci., 250(1-2):1–30, 2001.

[28] S. Kramer, L. De Raedt, and C. Helma. Molecular feature mining in HIVdata. In Proceedings of the Seventh ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining (KDD-01), pages 136–143.ACM Press, 2001.

[29] S. Kramer, N. Lavrac, and P. Flach. Propositionalization approaches torelational data mining. In S. Dzeroski and N. Lavrac, editors, RelationalData Mining, pages 262–291. Springer-Verlag, 2001.

[30] A. Lingas. Subgraph isomorphism for biconnected outerplanar graphs incubic time. Theoretical Computer Science, 63:295–302, 1989.

[31] Andreas Maunz, Christoph Helma, and Stefan Kramer. Large-scale graphmining using backbone refinement classes. In KDD ’09: Proceedings ofthe 15th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, pages 617–626, New York, NY, USA, 2009. ACM.

[32] J. J. McGregor. Backtrack search algorithms and the maximal commonsubgraph problem. Software: Practice and Experience, 12:23–34, 1982.

[33] Sandra L. Mitchell. Linear algorithms to recognize outerplanar and max-imal outerplanar graphs. Information Processing Letters, 9(5):229–232,1979.

[34] J. Munkres. Algorithms for the assignment and transportation problems.Journal of the Society for Industrial and Applied Mathematics, 5(1):32–38,1957.

[35] Siegfried Nijssen and Joost N. Kok. A quickstart in frequent structure min-ing can make a difference. In Proceedings of the Tenth ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Mining (KDD),pages 647–652, 2004.

[36] J. Raymond, E. Gardiner, and P. Willett. Rascal: Calculation of graph sim-ilarity using maximum common edge subgraphs. The Computer Journal,45:631–644, 2002.

[37] J. Raymond and P. Willett. Effectiveness of graph-based and fingerprint-based similarity measures for virtual screening of 2D chemical structuredatabases. Journal of Computer-Aided Design, 16:59–71, 2002.

38

Page 39: A Polynomial-time Maximum Common Subgraph Algorithm for

[38] J. Raymond and P. Willett. Maximum common subgraph isomorphismalgorithms for the matching of chemical structures. Journal of Computer-Aided Molecular Design, 16:521–533, 2002.

[39] L. Schietgat, J. Ramon, M. Bruynooghe, and H. Blockeel. An efficientlycomputable graph-based metric for the classification of small molecules.In Proceedings of the 11th International Conference on Discovery Science,volume 5255 of Lecture Notes in Artificial Intelligence, pages 197–209, 2008.

[40] Leander Schietgat, Fabrizio Costa, Jan Ramon, and Luc De Raedt. Effec-tive feature construction by maximum common subgraph sampling. Ma-chine Learning, 83(2):137-161, 2011.

[41] Ron Shamir and Dekel Tsur. Faster subtree isomorphism. Journal ofAlgorithms, 33(2):267–280, November 1992.

[42] K. Shearer, H. Bunke, and S. Venkatesh. Video indexing and similarity re-trieval by largest common subgraph detection using decision trees. PatternRecognition, 34(5):1075–1091, 2001.

[43] Nino Shervashidze and Karsten Borgwardt. Fast subtree kernels on graphs.In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta,editors, Advances in Neural Information Processing Systems 22, pages1660–1668, 2009.

[44] S. Joshua Swamidass, Jonathan Chen, Jocelyne Bruand, Peter Phung, LivaRalaivola, and Pierre Baldi. Kernels for small molecules and the pre-diction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics,21(suppl 1):i359–368, 2005.

[45] M. Syslo. The subgraph isomorphism problem for outerplanar graphs. The-oretical Computer Science, 17(1):91–97, 1982.

[46] N. Wale, I.A. Watson, and G. Karypis. Comparison of descriptor spaces forchemical compound retrieval and classification. Knowledge and InformationSystems, 14:347–375, 2008.

[47] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics,1:80–83, 1945.

[48] P. Willett. Similarity-based virtual screening using 2D fingerprints. DrugDiscovery Today, 11(23/24):1046–1051, 2006.

[49] Xifeng Yan and Jiawei Han. gSpan: Graph-based substructure patternmining. In Proceedings of the 2002 IEEE International Conference on DataMining (ICDM 2002), pages 721–724, Japan, 2002. IEEE Computer Soci-ety.

39


Recommended