+ All Categories
Home > Technology > Overlapping correlation clustering

Overlapping correlation clustering

Date post: 14-Jun-2015
Category:
Upload: larca-upc
View: 1,639 times
Download: 0 times
Share this document with a friend
Description:
Overlapping clustering, where a data point can be assigned to more than one cluster, is desirable in various applications, such as bioinformatics, information retrieval, and social network analysis. In this paper we generalize the framework of correlation clustering to deal with overlapping clusters. In short, we formulate an optimization problem in which each point in the dataset is mapped to a small set of labels, representing membership in different clusters. The number of labels does not have to be the same for all data points. The objective is to find a mapping so that the distances between points in the dataset agree as much as possible with distances taken over their sets of labels. For defining distances between sets of labels, we consider two measures: set-intersection indicator and the Jaccard coefficient.To solve the problem we propose a local-search algorithm. Iterative improvement within our algorithm gives rise to non-trivial optimization problems, which, for the measures of set intersection and Jaccard, we solve using a greedy method and non-negative least squares, respectively.
Popular Tags:
42
overlapping correlation clustering francesco bonchi aris gionis antti ukkonen yahoo! research barcelona Monday, September 26, 2011
Transcript
Page 1: Overlapping correlation clustering

overlapping correlation clustering

francesco bonchiaris gionis

antti ukkonen

yahoo! research barcelona

Monday, September 26, 2011

Page 2: Overlapping correlation clustering

overlapping clusters are very natural

- social networks

- proteins

- documents

2

Monday, September 26, 2011

Page 3: Overlapping correlation clustering

most clustering algorithms produce disjoint partitions

3

Monday, September 26, 2011

Page 4: Overlapping correlation clustering

overlapping is conceptually challenging to formulate

- why assign a point to a further center?

- why/how to generate less good clusters?

4

Monday, September 26, 2011

Page 5: Overlapping correlation clustering

correlation clustering

5

Ccc(�) =�

|s(u, v)− I(�(u) = �(v))|

Monday, September 26, 2011

Page 6: Overlapping correlation clustering

6

Monday, September 26, 2011

Page 7: Overlapping correlation clustering

7

1

Monday, September 26, 2011

Page 8: Overlapping correlation clustering

8

0

Monday, September 26, 2011

Page 9: Overlapping correlation clustering

9

0.33???

Monday, September 26, 2011

Page 10: Overlapping correlation clustering

10

0.33

Monday, September 26, 2011

Page 11: Overlapping correlation clustering

10

0.33

multiple labels = multi-cluster assignment

Monday, September 26, 2011

Page 12: Overlapping correlation clustering

11

0.5

Monday, September 26, 2011

Page 13: Overlapping correlation clustering

12

0.67

Monday, September 26, 2011

Page 14: Overlapping correlation clustering

13

54167/108301???

Monday, September 26, 2011

Page 15: Overlapping correlation clustering

overlapping correlation clustering

14

Cocc(�) =�

|s(u, v)−H(�(u), �(v))|

Monday, September 26, 2011

Page 16: Overlapping correlation clustering

comparing sets of labels

- Jaccard coefficient

- set intersection indicator

15

H(�(u), �(v))

Monday, September 26, 2011

Page 17: Overlapping correlation clustering

correlation clustering

16

overlapping correlation clustering

Monday, September 26, 2011

Page 18: Overlapping correlation clustering

correlation clustering

16

set of labels L, |L| = k

overlapping correlation clustering

Monday, September 26, 2011

Page 19: Overlapping correlation clustering

correlation clustering

16

set of labels L, |L| = k

overlapping correlation clustering

�(u) ⊆ L�(u) ∈ L

Monday, September 26, 2011

Page 20: Overlapping correlation clustering

correlation clustering

16

set of labels L, |L| = k

overlapping correlation clustering

�(u) ⊆ L�(u) ∈ L

C(�) =�

|s(u, v)−H(�(u), �(v))|

Monday, September 26, 2011

Page 21: Overlapping correlation clustering

correlation clustering

16

set of labels L, |L| = k

overlapping correlation clustering

�(u) ⊆ L�(u) ∈ L

|�(u)| ≤ p

C(�) =�

|s(u, v)−H(�(u), �(v))|

Monday, September 26, 2011

Page 22: Overlapping correlation clustering

17

dimensionality reduction

- mapping to sets instead of vectors

Monday, September 26, 2011

Page 23: Overlapping correlation clustering

18

Monday, September 26, 2011

Page 24: Overlapping correlation clustering

18

Monday, September 26, 2011

Page 25: Overlapping correlation clustering

18

Monday, September 26, 2011

Page 26: Overlapping correlation clustering

18

Monday, September 26, 2011

Page 27: Overlapping correlation clustering

18

Monday, September 26, 2011

Page 28: Overlapping correlation clustering

18

Monday, September 26, 2011

Page 29: Overlapping correlation clustering

19

u

v

x

y

Monday, September 26, 2011

Page 30: Overlapping correlation clustering

19

u

v

x

y

(u,v)

(x,y)

Monday, September 26, 2011

Page 31: Overlapping correlation clustering

20

IV. ALGORITHMS

We propose a local-search algorithm that optimizes thelabels of one object in the dataset, when the labels of allother objects are fixed. We apply this framework both forthe Jaccard coefficient and the intersection-function variantsof the problem, proposing novel algorithms for these two localoptimization problems in Sec. IV-B and IV-C respectively.

A. The local-search frameworkA typical approach for multivariate optimization problems

is to iteratively find the optimal value for one variable givenvalues for the remaining variables. The global solution isfound by repeatedly optimizing each of the variables in turnuntil the objective function value no longer improves. In mostcases such a method will converge to a local optimum. Thealgorithm we propose falls into this framework. At the coreof our algorithm is an efficient method for finding a goodlabeling of a single object given a fixed labeling of the otherobjects. We can guarantee that the value of Equation 2 is non-increasing with respect to such optimization steps. First, weobserve that the cost of Equation (2) can be rewritten as

Cocc(V, �) =1

2

v∈V

u∈V \{v}

|H(�(v), �(u))− s(v, u)|

=1

2

v∈V

Cv,p(�(v) | �),

where

Cv,p(�(v) | �) =�

u∈V \{v}

|H(�(v), �(u))− s(v, u)| (3)

expresses the error incurred by vertex v when it has the labels�(v), and the remaining nodes are labeled according to �. Thesubscript p in Cv,p serves to remind us that the set �(v) shouldhave at most p labels. Our general local-search strategy issummarized in Algorithm 1.

Algorithm 1 LocalSearch1: initialize � to a valid labeling;2: while Cocc(V, �) decreases do3: for each v ∈ V do4: find the label set L that minimizes Cv,p(L | �);5: update � so that �(v) = L;6: return �

Line 4 is the step in which LocalSearch seeks to find anoptimal set of labels for an object v by solving Equation (3).This is also the place that our framework differentiates be-tween the measures of Jaccard coefficient and set-intersection.

B. Local step for Jaccard coefficientProblem 3 (JACCARD-TRIANGULATION): Consider the set

{�Sj , zj�}j=1...n, where Sj are subsets of a ground set U ={1, . . . , k}, and zj are fractional numbers in the interval [0, 1].The task is to find a set X ⊆ U that minimizes the distance

d(X, {�Sj , zj�}j=1...n) =n�

j=1

|J(X,Sj)− zj |. (4)

The intuition behind Equation (4) is that we are given sets Sj

and “target similarities” zj and we want to find a set whoseJaccard coefficient with each set Sj is as close as possibleto the target similarity zj . A moment’s thought can convinceus that Equation (4) corresponds exactly to the error termCv,p(�(v) | �) defined by Equation (3), and thus, in the local-improvement step of the LocalSearch algorithm.

To our knowledge, JACCARD-TRIANGULATION is a newand interesting problem, which has not been studied before,in particular in the context of overlapping clustering. Themost-related problem that we are aware of is the problemof finding the Jaccard median, which was recently studied byChierichetti et al. [5]. The Jaccard-median problem is a specialcase of the JACCARD-TRIANGULATION problem, where allsimilarities zj are equal to 1. Chierichetti et al. provide a PTASfor the Jaccard-median problem. However, their techniquesseem mostly of theoretical interest, and do not extend beyondthe special case where all zj = 1.

However, since JACCARD-TRIANGULATION is a general-ization of the Jaccard-median problem that has been provenNP-hard [5], the following is immediate.

Theorem 2: JACCARD-TRIANGULATION is NP-hard.We next proceed to discuss our proposed algorithm for theJACCARD-TRIANGULATION problem. The idea is to introducea variable xi for every element i ∈ U . The variable xi indicatesif element i belongs in the solution set X . In particular, xi = 1if i ∈ X and xi = 0 otherwise. We then assume that the sizeof set X is t, that is,

i∈U

xi − t = 0. (5)

Now, given a set Sj with target similarity zj we want to obtainJ(X,Sj) = zj , for all j = 1, . . . n, or

J(X,Sj) =

�i∈Sj

xi

|Sj |+ t−�

i∈Sjxi

= zj ,

which is equivalent to

zjt− (1 + zj)�

i∈Sj

xi = zj |Sj |, (6)

and we have one Equation of type (6) for each pair �Sj , zj�.We observe that Equations (5) and (6) are linear with respectto the unknowns xi and t. On the other hands, the variablesxi and t take integral values, which implies that the system ofEquations (5) and (6) cannot be solved efficiently. Instead wepropose to relax the integrality constraints to non-negativityconstraints xi, t ≥ 0 and solve the above system in the least-squares sense. Thus, we apply a non-negative least-squaresoptimization method (NNLS) and we obtain estimates for thevariables xi and t.

The solution we obtain from the NNLS solver has twodrawbacks: (i) it does not incorporate the constraint of havingat most p labels, and (ii) most importantly, it does not havea clear interpretation as a set X , since the variables xi maytake any non-negative value, not only 0–1. We address both ofthese problems with a greedy post-processing of the fractional

Monday, September 26, 2011

Page 32: Overlapping correlation clustering

21

Jaccard triangulation

given

find

to minimize

{�Sj , zj�}j=1...n

X ⊆ U

d(X, {�Sj , zj�}j=1...n) =n�

j=1

|J(X,Sj)− zj |

Monday, September 26, 2011

Page 33: Overlapping correlation clustering

set-intersection indicator

hit-n-miss sets

approximation

greedy approach

22

O(√n log n)

Monday, September 26, 2011

Page 34: Overlapping correlation clustering

experimental evaluation

23

Monday, September 26, 2011

Page 35: Overlapping correlation clustering

EMOTION: 593 objects, 6 labels

YEAST: 2417 objects, 14 labels

24

Monday, September 26, 2011

Page 36: Overlapping correlation clustering

25

2 4 6 8 100

0.1

0.2EMOTION

k

Cost

/edge

2 4 6 8 100.6

0.8

1

Pre

cisi

on &

Reca

ll

costprecrec

5 10 15 200

0.1

0.2YEAST

k

Cost

/edge

5 10 15 200.8

0.9

1

Pre

cisi

on &

Reca

ll

costprecrec

2 4 60

0.1EMOTION

p

Cost

/edge

2 4 60

0.5

1

Pre

cisi

on &

Reca

llcostprecrec

2 4 6 8 10 12 140

0.1

YEAST

p

Cost

/edge

2 4 6 8 10 12 140.8

0.9

1

Pre

cisi

on &

Reca

ll

costprecrec

Fig. 1. Cost per edge, precision and recall of OCC-JACC as a function ofk, the total number of distinct labels (top row), and as a function of p, themaximum number of labels per vertex (bottom row).

2 4 6 8 100

0.1

0.2EMOTION

k

Cost

/edge

2 4 6 8 100.6

0.8

1

Pre

cisi

on &

Reca

ll

costprecrec

5 10 15 200

0.1

YEAST

k

Cost

/edge

5 10 15 20

0.9

1

Pre

cisi

on &

Reca

ll

costprecrec

2 4 60

0.1

0.2EMOTION

p

Cost

/edge

2 4 60.6

0.7

0.8

0.9

1

Pre

cisi

on &

Reca

ll

costprecrec

2 4 6 8 10 12 140

0.1

0.2YEAST

p

Cost

/edge

2 4 6 8 10 12 140.9

1

Pre

cisi

on &

Reca

ll

costprecrec

Fig. 2. Same as in Figure 1, but using OCC-ISECT.

In the basic form the input to our algorithms contains all|V | × |V | pairwise similarities. However, it turns out thatthere can be a lot of redundancy in this input. Often we canprune most of the pairwise comparisons with negligible lossin quality. This is an important characteristic, as it allows usto apply the algorithm also for larger data sets. Selecting thebest set of edges to prune is an interesting problem in its ownright. In this experiment we took the simple approach, andprune edges at random: an edge is taken in consideration withprobability q (denoted the pruning threshold) independently ofthe other edges. In Figure 3 we show edge-specific cost as wellas precision and recall as a function of q for the OCC-JACCalgorithm (the curves are again medians over 30 trials). Clearlywith these example data sets the pruning threshold can be setvery low. Also, there is a noticeable “threshold effect” in thecost/edge that may serve as an indicator to find the pruningthreshold in a setting where a ground truth is not available.This suggests that in practice it is not necessary to use allpairwise comparisons, a sample of the graph may be enough.

0.01 0.02 0.03 0.040

0.05EMOTION

q

Co

st/e

dg

e

0.01 0.02 0.03 0.04

0.8

1

0.4

0.6

Pre

cisi

on

& R

eca

ll

costprecrec

0.01 0.02 0.03 0.040.02

0.03

0.04YEAST

q

Co

st/e

dg

e

0.01 0.02 0.03 0.040.8

1

Pre

cisi

on

& R

eca

ll

costprecrec

Fig. 3. Pruning experiment using OCC-JACC. Cost/edge and precision andrecall as a function of the pruning threshold q.

In fact, the results for YEAST shown in figures 1 and 2 werecomputed with q = 0.05. In terms of computational speeduppruning has a very positive effect. Using the full YEASTdata (without pruning) our Python implementation takes 400seconds to finish on a 1.8GHz CPU, with pruning (q = 0.05)this can be reduced to 70.

VI. APPLICATIONS

Since our method uses pairwise similarities of objects asthe input, it lends itself naturally to the task of clusteringstructured objects for which feature vectors can be difficultto obtain. In this section we discuss clusterings for two ofsuch data: movement trajectories, and protein sequences.

A. Overlapping Clustering of Trajectories

Spatio-temporal data mining is an active research area witha huge variety of application domains, e.g., mobility manage-ment, video surveillance, mobile social networks, atmosphereanalysis, biology and zoology just to mention a few. The basicentities in analysis are usually trajectories of moving objects,i.e., sequences �(x0, y0, t0), . . . , (xn, yn, tn)� of observationsin space and time. Trajectories are structured complex objects,and mining them is often an involved task. They might havedifferent lengths (i.e., different number of observations), andtherefore viewing them simply as vectors and using standarddistance functions (e.g., Euclidean) is not feasible. Moreover,the different nature of space and time implies different granu-larity and resolution issues. While there has been quite someresearch on clustering trajectories (e.g., [8], [9]), to the bestof our knowledge the problem of overlapping clustering oftrajectories has been largely left unexplored. This is quitesurprising as overlapping clustering intuitively seems very well

suited for trajectories.Consider, for instance, a well-shaped cluster Ci of trajecto-

ries of GPS-equipped cars going from a south-west suburb tothe city center between 8 and 9 AM, and another cluster Cj

moving inside the city center along a specific path, from 3 to 4PM. Now consider a trajectory that moves from the south-westsuburb to the city center in the morning, and within the citycenter in the afternoon: it is very natural for this trajectory tobelong to both clusters Ci and Cj .

Developing overlapping clustering within our frameworkis straightforward: we just need to compute a distance (orsimilarity) for every trajectory pair in our input. For thispurpose, we chose the EDR distance [10], which has the

Monday, September 26, 2011

Page 37: Overlapping correlation clustering

26

2 4 6 8 100

0.1

0.2EMOTION

k

Co

st/e

dg

e

2 4 6 8 100.6

0.8

1

Pre

cisi

on

& R

eca

ll

costprecrec

5 10 15 200

0.1

0.2YEAST

k

Co

st/e

dg

e

5 10 15 200.8

0.9

1

Pre

cisi

on

& R

eca

ll

costprecrec

2 4 60

0.1EMOTION

p

Co

st/e

dg

e

2 4 60

0.5

1

Pre

cisi

on

& R

eca

ll

costprecrec

2 4 6 8 10 12 140

0.1

YEAST

p

Co

st/e

dg

e

2 4 6 8 10 12 140.8

0.9

1

Pre

cisi

on

& R

eca

ll

costprecrec

Fig. 1. Cost per edge, precision and recall of OCC-JACC as a function ofk, the total number of distinct labels (top row), and as a function of p, themaximum number of labels per vertex (bottom row).

2 4 6 8 100

0.1

0.2EMOTION

k

Co

st/e

dg

e

2 4 6 8 100.6

0.8

1

Pre

cisi

on

& R

eca

ll

costprecrec

5 10 15 200

0.1

YEAST

k

Co

st/e

dg

e

5 10 15 20

0.9

1

Pre

cisi

on

& R

eca

ll

costprecrec

2 4 60

0.1

0.2EMOTION

p

Co

st/e

dg

e

2 4 60.6

0.7

0.8

0.9

1

Pre

cisi

on

& R

eca

ll

costprecrec

2 4 6 8 10 12 140

0.1

0.2YEAST

p

Co

st/e

dg

e

2 4 6 8 10 12 140.9

1

Pre

cisi

on

& R

eca

ll

costprecrec

Fig. 2. Same as in Figure 1, but using OCC-ISECT.

In the basic form the input to our algorithms contains all|V | × |V | pairwise similarities. However, it turns out thatthere can be a lot of redundancy in this input. Often we canprune most of the pairwise comparisons with negligible lossin quality. This is an important characteristic, as it allows usto apply the algorithm also for larger data sets. Selecting thebest set of edges to prune is an interesting problem in its ownright. In this experiment we took the simple approach, andprune edges at random: an edge is taken in consideration withprobability q (denoted the pruning threshold) independently ofthe other edges. In Figure 3 we show edge-specific cost as wellas precision and recall as a function of q for the OCC-JACCalgorithm (the curves are again medians over 30 trials). Clearlywith these example data sets the pruning threshold can be setvery low. Also, there is a noticeable “threshold effect” in thecost/edge that may serve as an indicator to find the pruningthreshold in a setting where a ground truth is not available.This suggests that in practice it is not necessary to use allpairwise comparisons, a sample of the graph may be enough.

0.01 0.02 0.03 0.040

0.05EMOTION

q

Cost

/edge

0.01 0.02 0.03 0.04

0.8

1

0.4

0.6

Pre

cisi

on &

Reca

ll

costprecrec

0.01 0.02 0.03 0.040.02

0.03

0.04YEAST

q

Cost

/edge

0.01 0.02 0.03 0.040.8

1

Pre

cisi

on &

Reca

ll

costprecrec

Fig. 3. Pruning experiment using OCC-JACC. Cost/edge and precision andrecall as a function of the pruning threshold q.

In fact, the results for YEAST shown in figures 1 and 2 werecomputed with q = 0.05. In terms of computational speeduppruning has a very positive effect. Using the full YEASTdata (without pruning) our Python implementation takes 400seconds to finish on a 1.8GHz CPU, with pruning (q = 0.05)this can be reduced to 70.

VI. APPLICATIONS

Since our method uses pairwise similarities of objects asthe input, it lends itself naturally to the task of clusteringstructured objects for which feature vectors can be difficultto obtain. In this section we discuss clusterings for two ofsuch data: movement trajectories, and protein sequences.

A. Overlapping Clustering of Trajectories

Spatio-temporal data mining is an active research area witha huge variety of application domains, e.g., mobility manage-ment, video surveillance, mobile social networks, atmosphereanalysis, biology and zoology just to mention a few. The basicentities in analysis are usually trajectories of moving objects,i.e., sequences �(x0, y0, t0), . . . , (xn, yn, tn)� of observationsin space and time. Trajectories are structured complex objects,and mining them is often an involved task. They might havedifferent lengths (i.e., different number of observations), andtherefore viewing them simply as vectors and using standarddistance functions (e.g., Euclidean) is not feasible. Moreover,the different nature of space and time implies different granu-larity and resolution issues. While there has been quite someresearch on clustering trajectories (e.g., [8], [9]), to the bestof our knowledge the problem of overlapping clustering oftrajectories has been largely left unexplored. This is quitesurprising as overlapping clustering intuitively seems very well

suited for trajectories.Consider, for instance, a well-shaped cluster Ci of trajecto-

ries of GPS-equipped cars going from a south-west suburb tothe city center between 8 and 9 AM, and another cluster Cj

moving inside the city center along a specific path, from 3 to 4PM. Now consider a trajectory that moves from the south-westsuburb to the city center in the morning, and within the citycenter in the afternoon: it is very natural for this trajectory tobelong to both clusters Ci and Cj .

Developing overlapping clustering within our frameworkis straightforward: we just need to compute a distance (orsimilarity) for every trajectory pair in our input. For thispurpose, we chose the EDR distance [10], which has the

Monday, September 26, 2011

Page 38: Overlapping correlation clustering

27

protein clustering

- pairwise similarities based on matching of amino-acid sequences

- compare using a hand-made taxonomy

Monday, September 26, 2011

Page 39: Overlapping correlation clustering

28

TABLE IIPrecision, recall, and their harmonic mean F-score, for non-overlapping

clusterings of protein sequence datasets computed using SCPS [14] and the

OCC algorithms. BL is the precision of a baseline that assigns all

sequences to the same cluster.

BL SCPS OCC-ISECT OCC-JACCdataset prec prec/recall/F-score prec/recall/F-score prec/recall/F-score

D1 0.21 0.56 / 0.82 / 0.664 0.70 / 0.67 / 0.683 0.57 / 0.55 / 0.561D2 0.17 0.59 / 0.89 / 0.708 0.86 / 0.83 / 0.844 0.64 / 0.63 / 0.637D3 0.38 0.93 / 0.88 / 0.904 0.81 / 0.43 / 0.558 0.73 / 0.39 / 0.505D4 0.14 0.30 / 0.64 / 0.408 0.64 / 0.56 / 0.598 0.44 / 0.39 / 0.412

Summarizing: C3 and C4 contain elks and deer that stayaway from cattle (C3 moving in higher X than C4); C1 alsocontains only elks and deer, but those moves in the higher Yarea where also the cattles move; C2 is the cattle cluster andit contains also few elks and deer; finally C5 is another mixedcluster which overlaps with C2 only for the cattle and withC1 for the elks and deer.

B. Overlapping Clustering of Protein Sequences

An important problem in genomics is the study of evo-lutionary relatedness of proteins using sequence data. Weuse our algorithms to cluster proteins to homologous groupsgiven pairwise similarities of amino-acid sequences. Suchsimilarities are computed by the sequence alignment toolBLAST [12]. We follow the approach of Paccanaro et al. [13]and Nepusz et al. [14], and compare the computed clusteringagainst a ground truth given by SCOP, a manually craftedtaxonomy of proteins [15]. The SCOP taxonomy is a tree withproteins at the leaf nodes. The ground truth clusters used inthe experiments are subsets of the leafs, i.e., proteins, rootedat different SCOP superfamilies. These are nodes on the 3rdlevel below the root.

We compare our algorithm with the SCPS algorithm [13],[14], a spectral method for clustering biological sequence data.The experiment is run using datasets 1–4 from [14]6 thatcontain pre-computed sequence similarities in the range [0, 1],(appropriately transformed BLAST E-values, please refer to[13] and [14] for details) for various subsets of the SCOP(ver. 1.75) proteins, together with the ground truth clusterings.

To compare with the SCPS algorithm we first computednon-overlapping clusterings of all four datasets. All algorithmswere given the correct number of clusters as a parameter.Results are shown in Table II. SCPS has a higher recall inevery case, but with datasets 1, 2, and 4, the OCC-ISECTalgorithm achieves a substantially higher precision. In practicethis means that if OCC-ISECT assigns two sequences to thesame cluster, they belong to the same cluster also in the groundtruth with higher probability than when using SCPS.7

We also conduct a more fine-grained analysis of the resultsusing the SCOP taxonomy. In fact, different cluster “errors”should have different cost, depending on the distance on thetaxonomy. If we misplace a protein in two clusters that are

6The data are bundled with the SCPS application available at http://www.paccanarolab.org/software/scps/. D1 contains 669 sequences and 5 ground-truth clusters, D2 587 sequences and 6 clusters, D3 567 sequences and 5clusters, and D4 contains 654 sequences and 8 ground-truth clusters.

7Note that these numbers are not directly comparable with the ones in [14]as they define precision and recall in a slightly different way.

TABLE IIIComparing clusterings cost based on distance on the SCOP taxonomy, for

different values of p, the maximum number of labels per protein.

SCPS OCC-ISECT-p1 OCC-ISECT-p2 OCC-ISECT-p3D1 0.231 0.196 0.194 0.193D2 0.188 0.112 0.107 0.106D3 0.215 0.214 0.214 0.231D4 0.289 0.139 0.133 0.139

SCPS OCC-JACC-p1 OCC-JACC-p2 OCC-JACC-p3D1 0.231 0.208 0.202 0.205D2 0.188 0.137 0.130 0.127D3 0.215 0.243 0.242 0.221D4 0.289 0.158 0.141 0.152

extremely close in the taxonomy, the error should have a smallcost. Following this intuition we define the SCOP similaritybetween two proteins as follows:

sim(u, v) =d(lca(u, v))

max(d(u), d(v))− 1, (8)

where d(u) is the depth of a node in the tree (the root is atdepth 0), and lca(u, v) denotes the lowest common ancestorof the nodes u and v. The above similarity has a value of zeroif lca(u, v) is the root, and a value of one if lca(u, v) is thecommon parent of u and v. Based on sim(u, v) we define thecost of a clustering by paying 1− sim(u, v) for two proteinsthat ends in the same cluster, and sim(u, v) for two proteinsbelonging to different clusters, as in Eq. (1) and similarly toEq. (2) for the overlapping clusterings.

The results of Table III suggest that the OCC algorithms,thanks to the overlaps, find a clustering that is better in agree-ment with the SCOP taxonomy than are the clusterings foundby SCPS. However, while allowing overlaps is beneficial, wedo not observe a significant improvement as the node-specificconstraint p is increased. Moreover, we observe that only asmall number of proteins are assigned to multiple clusters.We conjecture that this is due to the similarities produced byBLAST, which imply very well defined clusters in most ofthe cases. Nevertheless, it is worth noting that our methods,regardless of the parameters used, do not find unnecessarilylarge overlaps, when this is not dictated by the data.

VII. RELATED WORK

Correlation Clustering. The problem of CORRELATION-CLUSTERING was first defined by Bansal et al. [1]. In theirdefinition, the input is a complete graph with positive andnegative edges. The objective is to partition the nodes ofthe graph so as to minimize the number of positive edgesthat are cut and the number of negative edges that are notcut; corresponding to our problem instance (b, H, 1). Thisis an APX-hard optimization problem which has received agreat deal of attention in the field of theoretical computerscience [16], [17], [18], [19].

Ailon et al. [16] considered a variety of correlation clus-tering problems. They proposed an algorithm that achievesexpected approximation ratio 5 if the weights obey the prob-ability condition. If the weights Xij obey also the triangleinequality, then the algorithm achieves expected approximationratio 2. Swamy [19] has applied semi-definite programming

non-overlapping

Monday, September 26, 2011

Page 40: Overlapping correlation clustering

29

TABLE IIPrecision, recall, and their harmonic mean F-score, for non-overlapping

clusterings of protein sequence datasets computed using SCPS [14] and the

OCC algorithms. BL is the precision of a baseline that assigns all

sequences to the same cluster.

BL SCPS OCC-ISECT OCC-JACCdataset prec prec/recall/F-score prec/recall/F-score prec/recall/F-score

D1 0.21 0.56 / 0.82 / 0.664 0.70 / 0.67 / 0.683 0.57 / 0.55 / 0.561D2 0.17 0.59 / 0.89 / 0.708 0.86 / 0.83 / 0.844 0.64 / 0.63 / 0.637D3 0.38 0.93 / 0.88 / 0.904 0.81 / 0.43 / 0.558 0.73 / 0.39 / 0.505D4 0.14 0.30 / 0.64 / 0.408 0.64 / 0.56 / 0.598 0.44 / 0.39 / 0.412

Summarizing: C3 and C4 contain elks and deer that stayaway from cattle (C3 moving in higher X than C4); C1 alsocontains only elks and deer, but those moves in the higher Yarea where also the cattles move; C2 is the cattle cluster andit contains also few elks and deer; finally C5 is another mixedcluster which overlaps with C2 only for the cattle and withC1 for the elks and deer.

B. Overlapping Clustering of Protein Sequences

An important problem in genomics is the study of evo-lutionary relatedness of proteins using sequence data. Weuse our algorithms to cluster proteins to homologous groupsgiven pairwise similarities of amino-acid sequences. Suchsimilarities are computed by the sequence alignment toolBLAST [12]. We follow the approach of Paccanaro et al. [13]and Nepusz et al. [14], and compare the computed clusteringagainst a ground truth given by SCOP, a manually craftedtaxonomy of proteins [15]. The SCOP taxonomy is a tree withproteins at the leaf nodes. The ground truth clusters used inthe experiments are subsets of the leafs, i.e., proteins, rootedat different SCOP superfamilies. These are nodes on the 3rdlevel below the root.

We compare our algorithm with the SCPS algorithm [13],[14], a spectral method for clustering biological sequence data.The experiment is run using datasets 1–4 from [14]6 thatcontain pre-computed sequence similarities in the range [0, 1],(appropriately transformed BLAST E-values, please refer to[13] and [14] for details) for various subsets of the SCOP(ver. 1.75) proteins, together with the ground truth clusterings.

To compare with the SCPS algorithm we first computednon-overlapping clusterings of all four datasets. All algorithmswere given the correct number of clusters as a parameter.Results are shown in Table II. SCPS has a higher recall inevery case, but with datasets 1, 2, and 4, the OCC-ISECTalgorithm achieves a substantially higher precision. In practicethis means that if OCC-ISECT assigns two sequences to thesame cluster, they belong to the same cluster also in the groundtruth with higher probability than when using SCPS.7

We also conduct a more fine-grained analysis of the resultsusing the SCOP taxonomy. In fact, different cluster “errors”should have different cost, depending on the distance on thetaxonomy. If we misplace a protein in two clusters that are

6The data are bundled with the SCPS application available at http://www.paccanarolab.org/software/scps/. D1 contains 669 sequences and 5 ground-truth clusters, D2 587 sequences and 6 clusters, D3 567 sequences and 5clusters, and D4 contains 654 sequences and 8 ground-truth clusters.

7Note that these numbers are not directly comparable with the ones in [14]as they define precision and recall in a slightly different way.

TABLE IIIComparing clusterings cost based on distance on the SCOP taxonomy, for

different values of p, the maximum number of labels per protein.

SCPS OCC-ISECT-p1 OCC-ISECT-p2 OCC-ISECT-p3D1 0.231 0.196 0.194 0.193D2 0.188 0.112 0.107 0.106D3 0.215 0.214 0.214 0.231D4 0.289 0.139 0.133 0.139

SCPS OCC-JACC-p1 OCC-JACC-p2 OCC-JACC-p3D1 0.231 0.208 0.202 0.205D2 0.188 0.137 0.130 0.127D3 0.215 0.243 0.242 0.221D4 0.289 0.158 0.141 0.152

extremely close in the taxonomy, the error should have a smallcost. Following this intuition we define the SCOP similaritybetween two proteins as follows:

sim(u, v) =d(lca(u, v))

max(d(u), d(v))− 1, (8)

where d(u) is the depth of a node in the tree (the root is atdepth 0), and lca(u, v) denotes the lowest common ancestorof the nodes u and v. The above similarity has a value of zeroif lca(u, v) is the root, and a value of one if lca(u, v) is thecommon parent of u and v. Based on sim(u, v) we define thecost of a clustering by paying 1− sim(u, v) for two proteinsthat ends in the same cluster, and sim(u, v) for two proteinsbelonging to different clusters, as in Eq. (1) and similarly toEq. (2) for the overlapping clusterings.

The results of Table III suggest that the OCC algorithms,thanks to the overlaps, find a clustering that is better in agree-ment with the SCOP taxonomy than are the clusterings foundby SCPS. However, while allowing overlaps is beneficial, wedo not observe a significant improvement as the node-specificconstraint p is increased. Moreover, we observe that only asmall number of proteins are assigned to multiple clusters.We conjecture that this is due to the similarities produced byBLAST, which imply very well defined clusters in most ofthe cases. Nevertheless, it is worth noting that our methods,regardless of the parameters used, do not find unnecessarilylarge overlaps, when this is not dictated by the data.

VII. RELATED WORK

Correlation Clustering. The problem of CORRELATION-CLUSTERING was first defined by Bansal et al. [1]. In theirdefinition, the input is a complete graph with positive andnegative edges. The objective is to partition the nodes ofthe graph so as to minimize the number of positive edgesthat are cut and the number of negative edges that are notcut; corresponding to our problem instance (b, H, 1). Thisis an APX-hard optimization problem which has received agreat deal of attention in the field of theoretical computerscience [16], [17], [18], [19].

Ailon et al. [16] considered a variety of correlation clus-tering problems. They proposed an algorithm that achievesexpected approximation ratio 5 if the weights obey the prob-ability condition. If the weights Xij obey also the triangleinequality, then the algorithm achieves expected approximationratio 2. Swamy [19] has applied semi-definite programming

overlapping

Monday, September 26, 2011

Page 41: Overlapping correlation clustering

30

future work

- scaling up

- approximation algorithm

- jaccard triangulation

- more experimentation and applications

Monday, September 26, 2011

Page 42: Overlapping correlation clustering

thank you!

Monday, September 26, 2011


Recommended