+ All Categories
Home > Documents > GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods....

GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods....

Date post: 03-Jul-2018
Category:
Upload: nguyenliem
View: 222 times
Download: 0 times
Share this document with a friend
27
Chapter 11 GRAPH CLASSIFICATION Koji Tsuda Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology (AIST) Tokyo, Japan [email protected] Hiroto Saigo Max Planck Institute for Informatics Saarbr- ucken, Germany [email protected] Abstract Supervised learning on graphs is a central subject in graph data processing. In graph classification and regression, we assume that the target values of a certain number of graphs or a certain part of a graph are available as a training dataset, and our goal is to derive the target values of other graphs or the remaining part of the graph. In drug discovery applications, for example, a graph and its target value correspond to a chemical compound and its chemical activity. In this chap- ter, we review state-of-the-art methods of graph classification. In particular, we focus on two representative methods, graph kernels and graph boosting, and we present other methods in relation to the two methods. We describe the strengths and weaknesses of different graph classification methods and recent efforts to overcome the challenges. Keywords: graph classification, graph mining, graph kernels, graph boosting 1. Introduction Graphs are general and powerful data structures that can be used to repre- sent diverse kinds of objects. Much of the real world data is represented not © Springer Science+Business Media, LLC 2010 C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_11, 337
Transcript
Page 1: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Chapter 11

GRAPH CLASSIFICATION

Koji TsudaComputational Biology Research Center, National Institute of Advanced Industrial Science andTechnology (AIST)Tokyo, Japan

[email protected]

Hiroto SaigoMax Planck Institute for InformaticsSaarbr-ucken, Germany

[email protected]

Abstract Supervised learning on graphs is a central subject in graph data processing. Ingraph classification and regression, we assume that the target values of a certainnumber of graphs or a certain part of a graph are available as a training dataset,and our goal is to derive the target values of other graphs or the remaining partof the graph. In drug discovery applications, for example, a graph and its targetvalue correspond to a chemical compound and its chemical activity. In this chap-ter, we review state-of-the-art methods of graph classification. In particular, wefocus on two representative methods, graph kernels and graph boosting, and wepresent other methods in relation to the two methods. We describe the strengthsand weaknesses of different graph classification methods and recent efforts toovercome the challenges.

Keywords: graph classification, graph mining, graph kernels, graph boosting

1. Introduction

Graphs are general and powerful data structures that can be used to repre-sent diverse kinds of objects. Much of the real world data is represented not

© Springer Science+Business Media, LLC 2010

C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data,Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_11,

337

Page 2: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

338 MANAGING AND MINING GRAPH DATA

Figure 11.1. Graph classification and label propagation.

as vectors, but as graphs (including sequences and trees, which are specializedgraphs). Examples include biological sequences, semi-structured texts suchas HTML and XML, chemical compounds, RNA secondary structures, APIcall graphs, etc. The topic of graph data processing is not new. Over the lastthree decades, there have been continuous efforts in developing new methodsfor processing graph data. Recently we have seen a surge of interest in thistopic, fueled partly by new technical advances, for example, development ofgraph kernels [21] and graph mining [52] techniques, and partly by demandsfrom new applications, for example, chemical informatics. In fact, chemicalinformatics is one of the most prominent fields that deal with large reposito-ries of graph data. For example, NCBI’s PubChem has millions of chemicalcompounds that are naturally represented as molecular graphs. Also, manydifferent kinds of chemical activity data are available, which provides a hugetest-bed for graph classification methods.

This chapter aims at giving an overview of existing graph classificationmethods. The term “graph classification” can mean two different tasks. Thefirst task is to build a model to predict the class label of a whole graph (Fig-ure 11.1, left). The second task is to predict the class labels of nodes in alarge graph (Figure 11.1, right). For clarity, we used the term to represent thefirst task, and we call the second task “label propagation”[6]. This chaptermainly deals with graph classification, but we will provide a short review oflabel propagation in Section 5.

Graph classification tasks can either be unsupervised or supervised. Un-supervised methods classify graphs into a certain number of categories bysimilarity [47, 46]. In supervised classification, a classification model is con-structed by learning from training data. In the training data, each graph (e.g., achemical compound) has a target value or a class label (e.g., biochemical activ-ity). Supervised methods are more fundamental from a technical point of view,because unsupervised learning problems can be solved by supervised methodsvia probabilistic modeling of latent class labels [46]. In this chapter, we focuson two supervised methods for graph classification: graph kernels and graphboosting [40], which are similarity- and feature-based respectively. The two

Page 3: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Graph Classification 339

Figure 11.2. Prediction rules of kernel methods.

methods differ in many aspects, and a characterization of the difference ofthese two methods would be helpful in characterizing other methods.

Kernel methods, such as support vector machines, construct a predictionrule based on a similarity function between two objects [42]. Similarity func-tions which satisfy a mathematical condition called positive definiteness arecalled kernel functions. For example, in Figure 11.2, the similarity betweentwo objects is represented by a kernel function K(x, x′). The prediction func-tion f(x) is a linear combination of x’s similarities to each training exampleK(x, xi), i = 1, . . . , n. In order to apply kernel methods to graph data, it isnecessary to define a kernel function for graphs that can measure the similaritybetween two graphs. It is natural to use the number of shared substructures intwo graphs as a similarity measure. However, the enumeration of subgraphs ofa given graph is NP-hard [12]. Therefore, one needs to use simpler substruc-tures such as paths and trees. Graph kernels [21] are based on the weightedcounts of common paths. A clever recursive algorithm is employed to com-pute the similarity without total enumeration of substructures.

One obvious drawback of graph kernels is that it is not clear which substruc-tures have the biggest contribution to classification. For a new graph classifiedby similarity, it is not always possible to know which part of the compound isessential in classification. In many chemical applications, the users are inter-ested not only in accurate prediction of biochemical activities, but also in themechanism creating the activities. This interpretation problem motivates us toreexamine the approach of subgraph enumeration. Recently, frequent subgraphenumeration algorithms such as AGM [18], Gaston [33] and gSpan [52] havebeen proposed. They can enumerate all the subgraph patterns that appear morethan m times in a graph database. The threshold m is called minimum sup-port. Frequent subgraph patterns are determined by branch-and-bound searchin a tree shaped search space (Figure 11.7). The computational time crucially

Page 4: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

340 MANAGING AND MINING GRAPH DATA

depends on the minimum support parameter. For larger values of the supportparameter, the search tree can be pruned earlier. For chemical compound data-sets, it is easy to mine tens of thousands of graphs on a commodity desktopcomputer, if the minimum support is reasonably high (e.g., 10% of the num-ber of graphs). However, it is known that, to achieve the best accuracy, theminimum support has to be set to a small value (e.g., smaller than 1%) [51,23, 16]. In such a setting, the graph mining becomes prohibitively inefficient,because the algorithm creates millions of patterns. This also makes subsequentprocessing very expensive. Graph boosting [40] progressively constructs theprediction rule in an iterative fashion, and in each iteration only a few infor-mative subgraphs are discovered. In comparison to the na-“ve method of usingfrequent mining and support vector machines, the graph mining routine has tobe invoked multiple times. However, an additional search tree pruning con-dition can speed up each call, and the overall time is shorter than the na-“vemethod.

The rest of this chapter is organized as follows. In Section 2, we will ex-plain graph kernels, and review its recent extensions for graph classification.In Section 3, we will discuss graph boosting and other methods based on ex-plicit substructure mining. Applications of graph classification methods arereviewed in Section 4. Section 5 briefly presents the label propagation tech-niques. We conclude the chapter in Section 6.

2. Graph Kernels

We consider a graph kernel as a similarity measure for two graphs whosenodes and edges are labeled (Figure 11.3). In this section, we present themost fundamental kernel called the marginalized graph kernel [21], which isbased on graph paths. Recently, different versions of graph kernels have beenproposed using different substructures. Examples include cyclic paths [17] andtrees [29].

The proposed graph kernel is based on the idea of random walking. For thelabeled graph shown in Figure 11.3a, a label sequence is produced by travers-ing the graph. A representative example is as follows:

(A, c,C, b,A, a,B), (2.1)

The vertex labels A,B,C,D and the edge labels a, b, c, d appear alternately.By repeating random walks with random initial and end points, it is possibleto obtain the probabilities for all possible walks (Figure 11.3b). The essentialidea of the graph kernel is to derive a similarity measure of two graphs bycomparing their probability tables. It is computationally infeasible to performall possible random walks. Therefore, we employ a recursive algorithm whichcan estimate the underlying probabilities. The node and edge labels are either

Page 5: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Graph Classification 341

A D

C AB

b

c

b

d

a

a

Figure 11.3. (a) An example of labeled graphs. Vertices and edges are labeled by uppercaseand lowercase letters, respectively. By traversing along the bold edges, the label sequence (2.1)is produced. (b) By repeating random walks, one can construct a list of probabilities.

discrete symbols or vectors. In the latter case, it is necessary to define nodekernels and edge kernels to specify the similarity of vectors.

Before describing technical details, we formally define a labeled graph. LetΣV denote the set of vertex labels, and ΣE the set of edge labels. Let X bea finite nonempty set of vertices, v be a function v : X → ΣV . Let ℒ bea set of vertex pairs that denote edges, and e be a function e : ℒ → ΣE .(We assume that there are no multiple edges from one vertex to another.) ThenG = (X , v,ℒ, e) is a labeled graph with directed edges. Our task is to constructa kernel function k(G,G′) between two labeled graphs G and G′.

2.1 Random Walks on Graphs

We extract features (labeled sequences) from a graph G by performing ran-dom walks. At the first step, we sample a node x1 ∈ X from an initial proba-bility distribution ps(x1). Subsequently, at the ith step, the next vertex xi ∈ Xis sampled subject to a transition probability pt(xi∣xi−1), or the random walkends at node xi−1 with probability pq(xi−1). In other words, at the ith step, wehave:

∣X ∣∑

k=1

pt(xk∣xi−1) + pq(xi−1) = 1 (2.2)

that is, at each step, the probabilities of transitions and termination sum to 1.When we do not have any prior knowledge, we can set the initial probability

distribution ps to be the uniform distribution, the transition probability pt to bea uniform distribution over the vertices adjacent to the current vertex, and thetermination probability pq to be a small constant probability.

From the random walk, we obtain a sequence of vertices called a path:

x = (x1, x2, . . . , xℓ), (2.3)

where ℓ is the length of x (possibly infinite). The final probability of obtainingpath x is the product of the probabilities that the path starts with x1, transits

Page 6: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

342 MANAGING AND MINING GRAPH DATA

from xi−1 to xi for each i, and finally terminates with xl:

p(x∣G) = ps(x1)

ℓ∏

i=2

pt(xi∣xi−1)pq(xℓ).

Let us define a label sequence as sequence of alternating vertex labels and edgelabels:

h = (ℎ1, ℎ2, . . . , ℎ2ℓ−1) ∈ (ΣV ΣE)ℓ−1ΣV .

Associated with a path x, we obtain a label sequence

hx = (vx1 , ex1,x2 , vx2 , ex2,x3 , . . . , vxℓ).

which is a sequence of alternating vertex and edge labels. Since multiple ver-tices (edges) may have the same label, multiple paths may map to one labelsequence. The probability of obtaining a label sequence h is thus the sum ofthe probabilities of each path that emits h. This can be expressed as

p(h∣G) =∑

x

�(h = hx) ⋅(ps(x1)

ℓ∏

i=2

pt(xi∣xi−1)pq(xℓ)

),

where � is a function that returns 1 if its argument holds, 0 otherwise.

2.2 Label Sequence Kernel

We now define a kernel kz between two label sequences h and h′. Thesequence kernel is defined based on kernels for vertex labels and edge labels.

We assume two kernel functions, kv(v, v′) and ke(e, e

′), are readily definedbetween vertex labels and edge labels. We constrain both kernels to be non-negative1. An example of a vertex label kernel is the identity kernel, that is, thekernel return 1 if the two labels are the same, 0 otherwise. It can be expressedas:

kv(v, v′) = �(v = v′) (2.4)

where �(⋅) is a function that returns 1 if its argument holds, and 0 otherwise.The above kernel (2.4) is for labels of discrete values. If the labels are definedin ℝ, then the Gaussian kernel can be used as a natural choice [42]:

kv(v, v′) = exp(− ∥ v − v′ ∥2 /2�2), (2.5)

Edge kernels can be defined in the same way as in (2.4) and (2.5).Based on the vertex label and the edge label kernels, we defome the kernel

for label sequences. If two sequences h and h′ are of the same length, or

1This constraint will play an important role in proving the convergence of our kernel.

Page 7: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Graph Classification 343

ℓ(h) = ℓ(h′), then the sequence kernel is defined as the product of the labelkernels:

kz(h,h′) = kv(ℎ1, ℎ

′1)

ℓ∏

i=2

ke(ℎ2i−2, ℎ′2i−2)kv(ℎ2i−1, ℎ

′2i−1). (2.6)

If the two sequences are of different length, or ℓ(h) ∕= ℓ(h′), then the sequencekernel returns 0, that is, kz(h,h

′) = 0.Finally, our label sequence kernel is defined as the expectation of kz over

all possible h ∈ G and h′ ∈ G′.

k(G,G′) =∑

h

h′

kz(h,h′)p(h∣G)p(h′∣G′). (2.7)

Here, p(h∣G)p(h′∣G′) is the probabilty that h and h′ occur in G and G′,respectively, and kz(h,h

′) is their similarity. This kernel is valid, as it is de-scribed as an inner product of two vectors p(h∣G) and p(h′∣G′).

2.3 Efficient Computation of Label Sequence Kernels

The label sequence kernel (2.7) defined above can be expanded as follows:

k(G,G′) =∑∞

ℓ=1

∑h

∑h′ kv(ℎ1, ℎ

′1)×(∏ℓ

i=2 ke(ℎ2i−2, ℎ′2i−2)kv(ℎ2i−1, ℎ

′2i−1)

(∑x�(h = hx) ⋅

(ps(x1)

∏ℓi=2 pt(xi∣xi−1)pq(xℓ)

))×

(∑x′ �(h = hx′) ⋅

(ps(x

′1)∏ℓ

i=2 pt(x′i∣x′i−1)pq(x

′ℓ)))

.

The straightforward enumeration of all terms to compute the sum has a pro-hibitive computational cost. In particular, for cyclic graphs, it is infeasible toperform this computation in an enumerative way, because the possible length ofa sequence spans from 1 to infinity. Nevertheless, there is an efficient methodto compute this kernel as shown below. The method is based on the observationthat the kernel has the following nested structure.

Page 8: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

344 MANAGING AND MINING GRAPH DATA

k(G,G′) = limL→∞

L∑

ℓ=1

(2.8)

x1,x′1

s(x1, x′1)×

⎛⎝∑

x2,x′2

t(x2, x′2, x1, x

′1)×

⎛⎝∑

x3,x′3

t(x3, x′3, x2, x

′2)×

⋅ ⋅ ⋅ ×∑

xℓ,x′ℓ

t(xℓ, x′ℓ, xℓ−1, x

′ℓ−1)q(xℓ, x

′ℓ)

⎞⎠ ⋅ ⋅ ⋅

⎞⎠

where

s(x1, x′1) = ps(x1)p

′s(x

′1)kv(vx1 , v

′x′1),

q(xℓ, x′ℓ) = pq(xℓ)p

′q(x

′ℓ)

t(xi, x′i, xi−1, x

′i−1) = pt(xi∣xi−1)p

′t(x

′i∣x′i−1)kv(vxi , v

′x′i)ke(exi−1xi , ex′

i−1x′i)

Intuitively, (2.8) computes the expectation of the kernel function over allpossible pairs of paths of the same length l. Consider one of such pairs:(x1, ⋅ ⋅ ⋅ , xℓ) in G and (x′1, ⋅ ⋅ ⋅ , x′ℓ) in G′. Here, ps, pt, and pq denote theinitial, transition, and termination probability of nodes in graph G, and p′s, p′t,and p′q denote the initial, transition, and termination probability of nodes ingraph G′. Thus, s(x1, x

′1) is the probability-weighted similarity of the first

elements in the two paths, q(xℓ, x′ℓ) is the probability that the two paths end

with xℓ and x′ℓ, and t(xi, x′i, xi−1, x

′i−1) is the probability-weighted similarity

of the ith node pair and edge pair in the two paths.

Acyclic Graphs. Let us first consider the case of acyclic graphs. In anacyclic graph, if there is a directed path from vertex x1 to x2, then there isno directed path from vertex x2 to x1. It is well known that vertices of adirected, acyclic graph can be numbered in a topological order2 such that everyedge from a vertex numbered i to a vertex numbered j satisfies i < j (seeFigure 11.4).

Since there are no directed paths from vertex j to vertex i if i < j, we canemploy dynamic programming to achieve our goal. Given that both G and G′

2Topological sorting of graph G can be done in O(∣X ∣+ ∣ℒ∣) [7].

Page 9: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Graph Classification 345

are directed acyclic graphs, we can rewrite (2.8) into the following:

k(G,G′) =∑

x1.x′1s(x1, x

′1)q(x1, x

′1) + limL→∞

∑Lℓ=2

∑x1,x′

1s(x1, x

′1)×(∑

x2>x1,x′2>x′

1t(x2, x

′2, x1, x

′1)(∑

x3>x2,x′3>x′

2t(x3, x

′3, x2, x

′2)×(

⋅ ⋅ ⋅(∑

xℓ>xℓ−1,x′ℓ>x′

ℓ−1t(xℓ, x

′ℓ, xℓ−1, x

′ℓ−1)q(xℓ, x

′ℓ)))⋅ ⋅ ⋅).

(2.9)The first term corresponds to paths of length 1, and the second term corre-sponds to paths longer than 1. We define r(⋅, ⋅) as follows:

r(x1, x′1) := q(x1, x

′1) + limL→∞

∑Lℓ=2

(∑x2>x1,x′

2>x′1t(x2, x

′2, x1, x

′1)×(

⋅ ⋅ ⋅(∑

xℓ>xℓ−1,x′ℓ>x′

ℓ−1t(xℓ, x

′ℓ, xℓ−1, x

′ℓ−1)q(xℓ, x

′ℓ)))⋅ ⋅ ⋅),

(2.10)We can rewrite (2.9) as the follows:

k(G,G′) =∑

x1,x′1

s(x1, x′1)r(x1, x

′1).

The merit of defining (2.10) is that we can exploit the following recursive equa-tion.

r(x1, x′1) = q(x1, x

′1) +

j>x1,j′>x′1

t(j, j′, x1, x′1)r(j, j′). (2.11)

Since all vertices are topologically ordered, r(x1, x′1) can be efficiently com-

puted by dynamic programming (Figure 11.5) for all x1 and x′1. The worst-casetime complexity of computing k(G,G′) is O(c ⋅ c′ ⋅ ∣X ∣ ⋅ ∣X ′∣) where c and c′

are the maximum out-degree of G and G′, respectively.

Page 10: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

346 MANAGING AND MINING GRAPH DATA

Figure 11.4. A topologically sorted directed acyclic graph. The label sequence kernel can beefficiently computed by dynamic programming running from right to left.

Figure 11.5. Recursion for computing r(x1, x′1) using recursive equation (2.11). r(x1, x′

1) can becomputed based on the precomputed values of r(x2, x′

2), x2 > x1, x′2 > x′

1.

General Directed Graphs. For cyclic graphs, nodes cannot be topologi-cally sorted. This means that we cannot employ a one-pass dynamic program-ming algorithm for acyclic graphs. However, we can obtain a recursive form

Page 11: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Graph Classification 347

of the kernel like (2.11), and reduce the problem to solving a system of simul-taneous linear equations.

Let us rewrite (2.8) as

k(G,G′) = limL→∞

L∑

ℓ=1

x1,x′1

s(x1, x′1)rℓ(x1, x

′1), (2.12)

where

r1(x1, x′1) := q(x1, x

′1)

and

rℓ(x1, x′1) :=

⎛⎝∑

x2,x′2

t(x2, x′2, x1, x

′1)

⎛⎝∑

x3,x′3

t(x3, x′3, x2, x

′2)×

⎛⎝⋅ ⋅ ⋅

⎛⎝∑

xℓ,x′ℓ

t(xℓ, x′ℓ, xℓ−1, x

′ℓ−1)q(xℓ, x

′ℓ)

⎞⎠⎞⎠ ⋅ ⋅ ⋅

⎞⎠

for ℓ ≥ 2

Replacing the order of summation in (2.12), we have the following:

k(G,G′) =∑

x1,x′1

s(x1, x′1) lim

L→∞

L∑

ℓ=1

rℓ(x1, x′1)

=∑

x1,x′1

s(x1, x′1) lim

L→∞RL(x1, x

′1), (2.13)

where

RL(x1, x′1) :=

L∑

ℓ=1

rℓ(x1, x′1).

Thus we need to compute R∞(x1, x′1) to obtain k(G,G′).

Now let us restate this problem in terms of linear system theory [38]. Thefollowing recursive relationship holds between rk and rk−1 (k ≥ 2):

rk(x1, x′1) =

i,j

t(i, j, x1, x′1)rk−1(i, j). (2.14)

Page 12: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

348 MANAGING AND MINING GRAPH DATA

Using (2.14), the recursive relationship for RL also holds as follows:

RL(x1, x′1) = r1(x1, x

′1) +

L∑

k=2

rk(x1, x′1)

= r1(x1, x′1) +

L∑

k=2

i,j

t(i, j, x1, x′1)rk−1(i, j)

= r1(x1, x′1) +

i,j

t(i, j, x1, x′1)RL−1(i, j). (2.15)

Thus,RL can be perceived as a discrete-time linear system [38] evolving as thetime L increases. Assuming that RL converges (see [21] for the convergencecondition), we have the following equilibrium equation:

R∞(x1, x′1) = r1(x1, x

′1) +

i,j

t(i, j, x1, x′1)R∞(i, j). (2.16)

Therefore, the computation of the kernel finally requires solving simultaneouslinear equations (2.16) and substituting the solutions into (2.13).

Now let us restate the above discussion in the language of matrices. Let s,r1, and r∞ be ∣X ∣ ⋅ ∣X ′∣ dimensional vectors such that

s = (⋅ ⋅ ⋅ , s(i, j), ⋅ ⋅ ⋅ )⊤r1 = (⋅ ⋅ ⋅ , r1(i, j), ⋅ ⋅ ⋅ )⊤r∞ = (⋅ ⋅ ⋅ , R∞(i, j), ⋅ ⋅ ⋅ )⊤

Let the transition probability matrix T be a ∣X ∣∣X ′∣ × ∣X ∣∣X ′∣ matrix,

[T ](i,j),(k,l) = t(i, j, k, l).

Equation (2.13) can be rewritten as

k(G,G′) = rT∞s (2.17)

Similarly, the recursive equation (2.16) is rewritten as

r∞ = r1 + T r∞.

The solution of this equation is

r∞ = (I − T )−1r1.

Finally, the matrix form of the kernel is

k(G,G′) = (I − T )−1r1s. (2.18)

Page 13: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Graph Classification 349

Computing the kernel requires solving a linear equation or inverting a matrixwith ∣X ∣∣X ′∣ × ∣X ∣∣X ′∣ coefficients. However, the matrix I − T is actuallysparse because the number of non-zero elements of T is less than c⋅c′ ⋅∣X ∣⋅∣X ′∣where c and c′ are the maximum out degree of G and G′, respectively. There-fore, we can employ efficient numerical algorithms that exploit sparsity [3]. Inour implementation, we employed a simple iterative method that updates RL

by using (2.15) until convergence starting from R1(x1, x′1) = r1(x1, x

′1).

2.4 Extensions

Vishwanathan et al. [50] proposed a fast way to compute the graph kernelbased on the Sylvestor equation. Let AX , AY and B denote M ×M , N ×Nand M × N matrices, respectively. They have used the following equation tospeed up the computation.

(AX ⊗AY )vec(B) = vec(AXBAY )

where ⊗ corresponds to the Kronecker product (tensor product) and vec is thevectorization operator. The left hand side requires O(M2N2) time, while theright hand side requires only O(MN(M + N)) time. Notice that this trick(“vec-trick”) has recently been used in link prediction tasks as well [20].

A random walk can trace the same edge back and forth many times (“tot-tering”), which could be harmful for similarity measurement. Mahe et al. [28]presented an extension of the kernel without tottering and applied it success-fully to chemical informatics data.

3. Graph Boosting

Frequent pattern mining techniques are important tools in data mining [14].Its simplest form is the classic problem of itemset mining [1], where frequentsubsets are enumerated from a series of sets. The original work on this topic isfor transactional data, and since then, researchers have applied frequent patternmining to other structured data such as sequences [35] and trees [2]. Every pat-tern mining method uses a search tree to systematically organize the patterns.For general graphs, there are technical difficulties about duplication: it is possi-ble to generate the same graph with different paths of the search tree. Methodssuch as AGM [18] and gspan [52] solve this duplication problem by pruningthe search nodes whenever duplicates are found.

The simplest way to apply such pattern mining techniques to graph classi-fication is to build a binary feature vector based on the presence or absenceof frequent patterns and apply an off-the-shelf classifier. Such methods areemployed in a few chemical informatics papers [16, 23]. However, they areobviously suboptimal because frequent patterns are not necessarily useful for

Page 14: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

350 MANAGING AND MINING GRAPH DATA

(-1,...,-1,1,-1,...,-1,1,-1,...)B

A

A

BA

AA

B

A

APatterns

Figure 11.6. Feature space based on subgraph patterns. The feature vector consists of binarypattern indicators.

classification. In chemical data, patterns such as C-C or C-C-C are frequent,but have almost no significance.

To discuss pattern mining strategies for graph classification, let us firstdefine the binary classification problem. The task is to learn a predictionrule from training examples {(Gi, yi)}ni=1, where Gi is a training graph andyi ∈ {+1,−1} is its associated class label. Let P be the set of all patterns, i.e.,the set of all subgraphs included in at least one training graph, and d := ∣P∣.Then, each graph Gi is encoded as a d-dimensional vector

xi,p =

{1 if p ⊆ Gi,−1 otherwise,

This feature space is illustrated in Figure 11.6.Since the whole feature space is intractably large, we need to obtain a set

of informative patterns without enumerating all patterns (i.e., discriminativepattern mining). This problem is close to feature selection in machine learn-ing. The difference is that it is not allowed to scan all features. As in featureselection, we can consider the following three categories in discriminative pat-tern mining methods: filter, wrapper and embedded [24]. In filter methods,discriminative patterns are collected by a mining call before the learning algo-rithm is started. They employ a simple statistical criterion such as informationgain [31]. In wrapper and embedded methods, the learning algorithm choosesfeatures via minimization of a sparsity-inducing objective function. Typically,they have a high dimensional weight vector and most of these weights coverageto zero after optimization. In most cases, the sparsity is induced by L1-normregularization [40]. The difference between wrapper and embedded methodsare subtle, but wrapper methods tend to be based on heuristic ideas by reducingthe features recursively (recursive feature elimination)[13]. Graph boosting isan embedded method, but to deal with graphs, we need to combine L1-normregularization with graph mining.

Page 15: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Graph Classification 351

3.1 Formulation of Graph Boosting

The name ‘boosting’ comes from the fact that linear program boosting (LP-Boost) is used as a fundamental computational framework. In chemical infor-matics experiments [40], it was shown that the accuracy of graph boosting isbetter than graph kernels. At the same time, key substructures are explicitlydiscovered.

Our prediction rule is a convex combination of binary indicators xi,j , andhas the form

f(xi) =∑

p∈P�pxi,p, (3.1)

where � is a ∣P∣-dimensional column vector such that∑

p∈P �p = 1 and�p ≥ 0.

This is a linear discriminant function in an intractably large dimensionalspace. To obtain an interpretable rule, we need to obtain a sparse weight vec-tor �, where only a few weights are nonzero. In the following, we will presenta linear programming approach for efficiently capturing such patterns. Ourformulation is based on that of LPBoost [8], and the learning problem is rep-resented as

min�

∥�∥1 + �

n∑

i=1

[1− yif(xi)]+ , (3.2)

where ∥x∥1 =∑n

i=1 ∣xi∣ denotes the ℓ1 norm of x, � is a regularization param-eter, and the subscript “+” indicates positive part. A soft-margin formulationof the above problem exists [8], and can be written as follows:

min�,�,�

−�+ �

n∑

i=1

�i (3.3)

s.t. y⊤X� + �i ≥ �, �i ≥ 0, i = 1, . . . , n (3.4)∑

p∈P�p = 1, �p ≥ 0,

where � are slack variables, � is the margin separating negative examples frompositives, � = 1

�n , � ∈ (0, 1) is a parameter controlling the cost of misclassifi-cation which has to be found using model selection techniques, such as cross-validation. It is known that the optimal solution has the following �-property:

Theorem 11.1 ([36]). Assume that the solution of (3.3) satisfies � ≥ 0. Thefollowing statements hold:

1 � is an upper-bound of the fraction of margin errors, i.e., the exampleswith

y⊤X� < �.

Page 16: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

352 MANAGING AND MINING GRAPH DATA

2 � is a lower-bound of the fraction of the examples such that

y⊤X� < �.

Directly solving this optimization problem is intractable due to the largenumber of variables in �. So we solve the following equivalent dual probleminstead.

minu,v

v (3.5)

s.t.

n∑

i=1

uiyixi,p ≤ v, ∀p ∈ P (3.6)

n∑

i=1

ui = 1, 0 ≤ ui ≤ �, i = 1, . . . , n.

After solving the dual problem, the primal solution � is obtained from the La-grange multipliers [8]. The dual problem has a limited number of variables, buta huge number of constraints. Such a linear program can be solved by the col-umn generation technique [27]: Starting with an empty pattern set, the patternwhose corresponding constraint is violated the most is identified and addediteratively. Each time a pattern is added, the optimal solution is updated bysolving the restricted dual problem. Denote by u(k), v(k) the optimal solutionof the restricted problem at iteration k = 0, 1, . . ., and denote by X̂(k) ⊆ Pthe set at iteration k. Initially, X̂(0) is empty and u

(0)i = 1/n. The restricted

problem is defined by replacing the set of constraints (3.6) with

n∑

i=1

u(k)i yixi,p ≤ v, ∀p ∈ X̂(k).

The left hand side of the inequality is called as gain in boosting literature. Aftersolving the problem, X̂(k) is updated to X̂(k+1) by adding a column. Severalcriteria have been proposed to select the new columns [10], but we adopt themost simple rule that is amenable to graph mining: We select the constraintwith the largest gain.

p∗ = argmaxp∈P

n∑

i=1

u(k)i yixi,p. (3.7)

The solution set is updated as X̂(k+1) ← X̂(k) ∪Xj∗ . In the next section, wediscuss how to efficiently find the largest gain in detail.

One of the big advantages of our method is that we have a stopping criterionthat guarantees that the optimal solution is found: If there is no p ∈ P such

Page 17: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Graph Classification 353

A B

A B C D A B

Tree of Substructures

A

B C

Figure 11.7. Schematic figure of the tree-shaped search space of graph patterns (i.e., the DFScode tree). To find the optimal pattern efficiently, the tree is systematically expanded by rightmostextensions.

thatn∑

i=1

u(k)i yixi,p > v(k), (3.8)

then the current solution is the optimal dual solution. Empirically, the patternsfound in the last few iterations have negligibly small weights. The number ofiterations can be decreased by relaxing the condition as

n∑

i=1

u(k)i yixi,p > v(k) + �, (3.9)

Let us define the primal objective function as V = −�+�∑ni=1 �i. Due to the

convex duality, we can guarantee that, for the solution obtained from the earlytermination (3.9), the objective satisfies V ≤ V ∗ + �, where V ∗ is the optimalvalue with the exact termination (3.8) [8]. In our experiments, � = 0.01 isalways used.

3.2 Optimal Pattern Search

Our search strategy is a branch-and-bound algorithm that requires a canon-ical search space in which a whole set of patterns are enumerated without du-plication. As the search space, we adopt the DFS (depth first search) codetree [52]. The basic idea of the DFS code tree is to organize patterns as a tree,where a child node has a super graph of the parent’s pattern (Figure 11.7). Apattern is represented as a text string called the DFS code. The patterns areenumerated by generating the tree from the root to leaves using a recursivealgorithm. To avoid duplications, node generation is systematically done byrightmost extensions.

Page 18: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

354 MANAGING AND MINING GRAPH DATA

All embeddings of a pattern in the graphs {Gi}ni=1 are maintained in eachnode. If a pattern matches a graph in different ways, all such embeddings arestored. When a new pattern is created by adding an edge, it is not necessaryto perform full isomorphism checks with respect to all graphs in the database.A new list of embeddings are made by extending the embeddings of the par-ent [52]. Technically, it is necessary to devise a data structure such that theembeddings are stored incrementally, because it takes a prohibitive amount ofmemory to keep all embeddings independently in each node. As mentioned in(3.7), our aim is to find the optimal hypothesis that maximizes the gain g(p).

g(p) =n∑

i=1

u(k)i yixi,p. (3.10)

For efficient search, it is important to minimize the size of the actual searchspace. To this aim, tree pruning is crucially important: Suppose the search treeis generated up to the pattern p and denote by g∗ the maximum gain among theones observed so far. If it is guaranteed that the gain of any super graph p′ isnot larger than g∗, we can avoid the generation of downstream nodes withoutlosing the optimal pattern. We employ the following pruning condition.

Theorem 11.2. [30, 26] Let us define

�(p) = 2∑

{i∣yi=+1,p⊆Gi}u(k)i −

n∑

i=1

yiu(k)i .

If the following condition is satisfied,

g∗ > �(p), (3.11)

the inequality g(p′) < g∗ holds for any p′ such that p ⊆ p′.The gBoost algorithm is summarized in Algorithms 12 and 13.

3.3 Computational Experiments

In [40], it is shown that graph boosting performs better than graph kernelsin classification accuracy in chemical compound datasets. The top 20 dis-criminative subgraphs for a mutagenicity dataset called CPDB are displayedin Figure 11.8. We found that the top 3 substructures with positive weights(0.0672,0.0656, 0.0577) correspond to known toxicophores [23]. They corre-spond to aromatic amine, aliphatic halide, and three-membered heterocycle,respectively. In addition, the patterns with weights 0.0431, 0.0412, 0.0411and 0.0318 seem to be related to polycyclic aromatic systems. Only from thisresult, we cannot conclude that graph boosting is better in general data. How-ever, since important chemical substructures cannot be represented in paths, itwould be reasonable to say that subgraph features are better in chemical data.

Page 19: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Graph Classification 355

Algorithm 12 gBoost algorithm: main part

1: X̂(0) = ∅, u(0)i = 1/n, k = 0

2: loop3: Find the optimal pattern p∗ based on u(k)

4: if termination condition (3.9) holds then5: break6: end if7: X̂ ← X̂ ∪Xj∗

8: Solve the restricted dual problem (3.5) to obtain u(k+1)

9: k = k + 110: end loop

Algorithm 13 Finding the Optimal Pattern

1: Procedure Optimal Pattern

2: Global variables: g∗, p∗

3: g∗ = −∞4: for p ∈ DFS codes with single nodes do5: project(p)6: end for7: return p∗

8: EndProcedure9:

10: Function project(p)11: if p is not a minimum DFS code then12: return13: end if14: if pruning condition (3.11) holds then15: return16: end if17: if g(p) > g∗ then18: g∗ = g(p), p∗ = p19: end if20: for p′ ∈ rightmost extensions of p do21: project(p′)22: end for23: EndFunction

3.4 Related Work

Graph algorithms can be designed based on existing statistical frameworks(i.e., mother algorithms). It allows us to use theoretical results and insights

Page 20: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

356 MANAGING AND MINING GRAPH DATA

N

N BrSO

NO N

N

0.0672 0.0656 -0.0628 -0.0609 -0.0594

O

NO

O

N

Cl

Cl

O

O N N0.0577 -0.0510 -0.0482 -0.0454 0.0448

NN

ClCl

O

-0.0438 0.0431 -0.0419 0.0412 0.0411

O

NN

N

Cl

0.0402 -0.0384 -0.0336 -0.0333 0.0318

Figure 11.8. Top 20 discriminative subgraphs from the CPDB dataset. Each subgraph is shownwith the corresponding weight, and ordered by the absolute value from the top left to the bottomright. H atom is omitted, and C atom is represented as a dot for simplicity. Aromatic bondsappeared in an open form are displayed by the combination of dashed and solid lines.

accumulated in the past studies. In graph boosting, we employed LPboost asa mother algorithm. It is possible to employ other algorithms such as partialleast squares regression (PLS) [39] and least angle regression (LARS) [45].

When applied to ordinary vectorial data, partial least squares regression ex-tracts a few orthogonal features and perform least squares regression in theprojected space [37]. A PLS feature is a linear combination of original fea-tures, and it is often the case that correlated features are summarized into aPLS feature. Sometimes, the subgraph features chosen by graph boosting isnot robust against bootstrapping or other data perturbations, whereas the clas-sification accuracy is quite stable. It is due to strong correlation among featurescorresponding to similar subgraphs. The graph mining version of PLS, gPLS[39], solves this problem by summarizing similar subgraphs into each feature(Figure 11.9). Since only one graph mining call is required to construct each

Page 21: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Graph Classification 357

Figure 11.9. Patterns obtained by gPLS. Each column corresponds to the patterns of a PLScomponent.

feature, gPLS can build the classification rule more quickly than graph boost-ing.

In graph boosting, it is necessary to set the regularization parameter � in(3.2). Typically it is determined by cross validation, but there is a differentapproach called “regularization path tracking”. When � = 0, the weight vectorconverges to the origin. As � is increased continuously, the weight vectordraws a piecewise linear path. Because of this property, one can track thewhole path by repeating to jump to the next turning point. We combined thetracking with graph mining in [45]. In ordinary tracking, a feature is addedor removed at each turning point. In our graph version, a subgraph to add orremove is found by a customized gSpan search.

The examples shown above were for supervised classification. For unsuper-vised clustering of graphs, the combinations with the EM algorithm [46] andthe Dirichlet process [47] have been reported.

Page 22: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

358 MANAGING AND MINING GRAPH DATA

4. Applications of Graph Classification

Borgwardt et al. [5] applied the graph kernel method to classify protein 3Dstructures. It outperformed classical alignment-based approaches. Karklin etal. [19] built a classifier for non-coding RNAs employing a graph represen-tation of RNAs. Outside biology and chemistry, Harchaoui and Bach [15]applied graph kernels to image classification where each region corresponds toa node and their positional relationships are represented by edges.

Traditionally, graph mining methods are mainly used for small chemicalcompounds [28, 9]. However, new application areas are emerging. In im-age processing [34], geometric relationships between points are represented asedges. Software bug detection is an interesting area, where the relationships ofAPIs are represented as directed graphs and anomalous patterns are detected toidentify bugs [11]. In natural language processing, the relationships betweenwords are represented as a graph (e.g., predicate-argument structures) and keyphrases are identified as subgraphs [26].

5. Label Propagation

In the previous discussion, the term graph classification means classifyingan entire graph. In many applications, we are interested in classifying thenodes. For example, in large-scale network analysis for social networks andbiological networks, it is a central task to classify unlabeled nodes given alimited number of labeled nodes (Figure 11.1, right). In FaceBook, one canlabel people who responded to a certain advertisement as positive nodes, andpeople who did not respond as negative nodes. Based on these labeled nodes,our task is to predict other people’s response to the advertisement.

In earlier studies, diffusion kernels are used in combination with supportvector machines [25, 48]. The basic idea is to compute the closeness betweentwo nodes in terms of commute time of random walks between the nodes.Though this approach gained popularity in the machine learning community,a significant drawback is that the derived kernel matrix is dense. For largenetworks, the diffusion kernel is not suitable because it takes O(n3) time andO(n2) memory. In contrast, label propagation methods use simpler computa-tional strategies that exploit sparsity of the adjacency matrix [54, 53]. The labelpropagation method of Zhou et al.[53] is achieved by solving simultaneous lin-ear equations with a sparse coefficient matrix. The time complexity is nearlylinear to the number of non-zero entries of the coefficient matrix [49], which ismuch more efficient than the diffusion kernels. Due to its efficiency, label prop-agation is gaining popularity in applications with biological networks, whereweb servers should return the propagation result without much delay [32].However, the classification performance is quite sensitive to methodologicaldetails. For example, Shin et al. pointed out that the introduction of directional

Page 23: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Graph Classification 359

propagation can increase the performance significantly [43]. Also, Mostafaviet al. [32] reported that their engineered version has outperformed the vanillaversion [53]. Label propagation is still an active research field. Recent ex-tensions include automatic combination of multiple networks [49, 22] and theintroduction of probabilistic inference in label propagation [54, 44].

6. Concluding Remarks

We have covered the two different methods for graph classification. Graphkernel is a similarity measure between two graphs, while graph mining meth-ods can derive characteristic subgraphs that can be used for any subsequentmachine learning algorithms. We have the impression that so far graph kernelsare more frequently applied. Probably it is due to the fact that graph kernels areeasier to implement and currently used graph datasets are not so large. How-ever, graph kernels are not suitable for very large data, because it takes O(n2)time to derive the kernel matrix of n training graphs, which is very hard toimprove. Toward large scale data, graph mining methods seem more promis-ing because it requires only O(n) time. Nevertheless, there remains much tobe done in graph mining methods. Existing methods such as gSpan enumer-ate all subgraphs satisfying a certain frequency-based criterion. However, itis often pointed out that, for graph classification, it is not always necessaryto enumerate all subgraphs. Recently, Boley and Grosskreutz proposed a uni-form sampling method of frequent itemsets [4]. Such theoretically guaranteedsampling procedures will certainly contribute to graph classification as well.

One fact that hinders the further popularity of graph mining methodsis that it is not common to make the code public in the machine learn-ing and data mining community. We have made several easy-to-use codeavailable: SPIDER (http://www.kyb.tuebingen.mpg.de/bs/people/spider/) contains codes for graph kernels and the gBoost package con-tains codes for graph mining and boosting (http://www.kyb.mpg.de/bs/people/nowozin/gboost/).

References

[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules inlarge databases. In Proc. VLDB 1994, pages 487–499, 1994.

[2] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa.Efficient substructure discovery from large semi-structured data. In Proc2nd SIAM Data Mining Conference (SDM), pages 158–174, 2002.

[3] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Ei-jkhout, R. Pozo, C. Romine, and H. Van der Vorst. Templates for the Solu-tion of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition.

Page 24: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

360 MANAGING AND MINING GRAPH DATA

SIAM, Philadelphia, PA, 1994.

[4] M. Boley and H. Grosskreutz. A randomized approach for approximatingthe number of frequent sets. In Proceedings of the 8th IEEE InternationalConference on Data Mining, pages 43–52, 2008.

[5] K. M. Borgwardt, C. S. Ong, S. Sch-onauer, S. V. N. Vishwanathan, A. J.Smola, and H.-P. Kriegel. Protein function prediction via graph kernels.Bioinformatics, 21(suppl. 1):i47–i56, 2006.

[6] O. Chapelle, A. Zien, and B. Sch-olkopf, editors. Semi-Supervised Learn-ing. MIT Press, Cambridge, MA, 2006.

[7] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MITPress and McGraw Hill, 1990.

[8] A. Demiriz, K.P. Bennet, and J. Shawe-Taylor. Linear programming boost-ing via column generation. Machine Learning, 46(1-3):225–254, 2002.

[9] M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis. Frequent sub-structure-based approaches for classifying chemical compounds. IEEETrans. Knowl. Data Eng., 17(8):1036–1050, 2005.

[10] O. du Merle, D. Villeneuve, J. Desrosiers, and P. Hansen. Stabilizedcolumn generation. Discrete Mathematics, 194:229–237, 1999.

[11] F. Eichinger, K. B-ohm, and M. Huber. Mining edge-weighted call graphsto localise software bugs. In Proceedings of the European Conference onMachine Learning and Principles and Practice of Knowledge Discoveryin Databases (ECML PKDD), pages 333–348, 2008.

[12] T. G-artner, P. Flach, and S. Wrobel. On graph kernels: Hardness resultsand efficient alternatives. In Proc. of the Sixteenth Annual Conference onComputational Learning Theory, 2003.

[13] I. Guyon, J. Weston, S. Bahnhill, and V. Vapnik. Gene selection for cancerclassification using support vector machines. Machine Learning, 46(1-3):389–422, 2002.

[14] J. Han and M. Kamber. Data Mining: Concepts and Techniques. MorganKaufmann, 2000.

[15] Z. Harchaoui and F. Bach. Image classification with segmentation graphkernels. In 2007 IEEE Computer Society Conference on Computer Visionand Pattern Recognition. IEEE Computer Society, 2007.

[16] C. Helma, T. Cramer, S. Kramer, and L.D. Raedt. Data mining and ma-chine learning techniques for the identification of mutagenicity inducingsubstructures and structure activity relationships of noncongeneric com-pounds. J. Chem. Inf. Comput. Sci., 44:1402–1411, 2004.

[17] T. Horvath, T. G-artner, and S. Wrobel. Cyclic pattern kernels for predic-tive graph mining. In Proceedings of the 10th ACM SIGKDD International

Page 25: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Graph Classification 361

Conference on Knowledge Discovery and Data Mining, pages 158–167,2004.

[18] A. Inokuchi. Mining generalized substructures from a set of labeledgraphs. In Proceedings of the 4th IEEE Internatinal Conference on DataMining, pages 415–418. IEEE Computer Society, 2005.

[19] Y. Karklin, R.F. Meraz, and S.R. Holbrook. Classification of non-codingrna using graph representations of secondary structure. In Pacific Sympo-sium on Biocomputing, pages 4–15, 2005.

[20] H. Kashima, T. Kato, Y. Yamanishi, M. Sugiyama, and K. Tsuda. Linkpropagation: A fast semi-supervised learning algorithm for link prediction.In 2009 SIAM Conference on Data Mining, pages 1100–1111, 2009.

[21] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels betweenlabeled graphs. In Proceedings of the 21st International Conference onMachine Learning, pages 321–328. AAAI Press, 2003.

[22] T. Kato, H. Kashima, and M. Sugiyama. Robust label propagation onmultiple networks. IEEE Trans. Neural Networks, 20(1):35–44, 2008.

[23] J. Kazius, S. Nijssen, J. Kok, T. B-ack, and A.P. Ijzerman. Substructuremining using elaborate chemical representation. J. Chem. Inf. Model.,46:597–605, 2006.

[24] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artifi-cial Intelligence, 1-2:273–324, 1997.

[25] R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other dis-crete input. In ICML 2002, 2002.

[26] T. Kudo, E. Maeda, and Y. Matsumoto. An application of boosting tograph classification. In Advances in Neural Information Processing Sys-tems 17, pages 729–736. MIT Press, 2005.

[27] D. G. Luenberger. Optimization by Vector Space Methods. Wiley, 1969.

[28] P. Mah«e, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Graph kernelsfor molecular structure - activity relationship analysis with support vectormachines. J. Chem. Inf. Model., 45:939–951, 2005.

[29] P. Mahe and J.P. Vert. Graph kernels based on tree patterns for molecules.Machine Learning, 75:3–35, 2009.

[30] S. Morishita. Computing optimal hypotheses efficiently for boosting. InDiscovery Science, pages 471–481, 2001.

[31] S. Morishita and J. Sese. Traversing itemset lattices with statistical metricpruning. In Proceedings of ACM SIGACT-SIGMOD-SIGART Symposiumon Database Systems (PODS), pages 226–236, 2000.

[32] S. Mostafavi, D. Ray, D. Warde-Farley, C. Grouios, and Q. Morris. Gen-eMANIA: a real-time multiple association network integration algorithmfor predicting gene function. Genome Biology, 9(Suppl. 1):S4, 2008.

Page 26: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

362 MANAGING AND MINING GRAPH DATA

[33] S. Nijssen and J.N. Kok. A quickstart in frequent structure mining canmake a difference. In Proceedings of the 10th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, pages 647–652.ACM Press, 2004.

[34] S. Nowozin, K. Tsuda, T. Uno, T. Kudo, and G. Bakir. Weighted substruc-ture mining for image analysis. In IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR). IEEE Computer Soci-ety, 2007.

[35] J. Pei, J. Han, B. Mortazavi-asl, J. Wang, H. Pinto, Q. Chen, U. Dayal,and M. Hsu. Mining sequential patterns by pattern-growth: The prefixs-pan approach. IEEE Transactions on Knowledge and Data Engineering,16(11):1424–1440, 2004.

[36] G. R-atsch, S. Mika, B. Sch-olkopf, and K.-R. M-uller. Constructingboosting algorithms from SVMs: an application to one-class classification.IEEE Trans. Patt. Anal. Mach. Intell., 24(9):1184–1199, 2002.

[37] R. Rosipal and N. Kr-amer. Overview and recent advances in partial leastsquares. In Subspace, Latent Structure and Feature Selection Techniques,pages 34–51. Springer, 2006.

[38] W.J. Rugh. Linear System Theory. Prentice Hall, 1995.

[39] H. Saigo, N. Kr-amer, and K. Tsuda. Partial least squares regression forgraph mining. In Proceedings of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, pages 578–586,2008.

[40] H. Saigo, S. Nowozin, T. Kadowaki, T. Kudo, and K. Tsuda. GBoost:A mathematical programming approach to graph classification and regres-sion. Machine Learning, 2008.

[41] A. Sanfeliu and K.S. Fu. A distance measure between attributed relationalgraphs for pattern recognition. IEEE Trans. Syst. Man Cybern., 13:353–362, 1983.

[42] B. Sch-olkopf and A. J. Smola. Learning with Kernels: Support VectorMachines, Regularization, Optimization, and Beyond. MIT Press, 2002.

[43] H. Shin, A.M. Lisewski, and O. Lichtarge. Graph sharpening plusgraph integration: a synergy that improves protein functional classifica-tion. Bioinformatics, 23:3217–3224, 2007.

[44] A. Subramanya and J. Bilmes. Soft-supervised learning for text classifi-cation. In Proceedings of the 2008 Conference on Empirical Methods inNatural Language Processing, pages 1090–1099, 2008.

[45] K. Tsuda. Entire regularization paths for graph data. In Proceedings ofthe 24th International Conference on Machine Learning, pages 919–926,2007.

Page 27: GRAPH CLASSIFICATION · Graph Classification 339 Figure 11.2. Prediction rules of kernel methods. methods differ in many aspects, and a characterization of the difference of

Graph Classification 363

[46] K. Tsuda and T. Kudo. Clustering graphs by weighted substructure min-ing. In Proceedings of the 23rd International Conference on MachineLearning, pages 953–960. ACM Press, 2006.

[47] K. Tsuda and K. Kurihara. Graph mining with variational dirichlet pro-cess mixture models. In SIAM Conference on Data Mining (SDM), 2008.

[48] K. Tsuda and W.S. Noble. Learning kernels from biological networks bymaximizing entropy. Bioinformatics, 20(Suppl. 1):i326–i333, 2004.

[49] K. Tsuda, H.J. Shin, and B. Sch-olkopf. Fast protein classification withmultiple networks. Bioinformatics, 21(Suppl. 2):ii59–ii65, 2005.

[50] S.V.N. Vishwanathan, K.M. Borgwardt, and N.N. Schraudolph. Fastcomputation of graph kernels. In Advances in Neural Information Pro-cessing Systems 19, Cambridge, MA, 2006. MIT Press.

[51] N. Wale and G. Karypis. Comparison of descriptor spaces for chemicalcompound retrieval and classification. In Proceedings of the 2006 IEEEInternational Conference on Data Mining, pages 678–689, 2006.

[52] X. Yan and J. Han. gSpan: graph-based substructure pattern mining. InProceedings of the 2002 IEEE International Conference on Data Mining,pages 721–724. IEEE Computer Society, 2002.

[53] D. Zhou, O. Bousquet, J. Weston, and B. Sch-olkopf. Learning with localand global consistency. In Advances in Neural Information ProcessingSystems (NIPS) 16, pages 321–328. MIT Press, 2004.

[54] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning usinggaussian fields and harmonic functions. In Proc. of the Twentieth Interna-tional Conference on Machine Learning (ICML), pages 912–919. AAAIPress, 2003.


Recommended