Overlapping correlation clustering

overlapping correlation clustering

francesco bonchiaris gionis

antti ukkonen

yahoo! research barcelona

Monday, September 26, 2011

overlapping clusters are very natural

- social networks

- proteins

- documents

2


most clustering algorithms produce disjoint partitions

3


overlapping is conceptually challenging to formulate

- why assign a point to a further center?

- why/how to generate less good clusters?

4


correlation clustering

5

Ccc(�) =�

|s(u, v)− I(�(u) = �(v))|


6


7

1


8

0


9

0.33???


10

0.33


10

0.33

multiple labels = multi-cluster assignment


11

0.5


12

0.67


13

54167/108301???



14

Cocc(�) =�

|s(u, v)−H(�(u), �(v))|


comparing sets of labels

- Jaccard coefficient

- set intersection indicator

15

H(�(u), �(v))



16




16

set of labels L, |L| = k




16



�(u) ⊆ L�(u) ∈ L



16



�(u) ⊆ L�(u) ∈ L

C(�) =�

|s(u, v)−H(�(u), �(v))|



16



�(u) ⊆ L�(u) ∈ L

|�(u)| ≤ p

C(�) =�

|s(u, v)−H(�(u), �(v))|


17

dimensionality reduction

- mapping to sets instead of vectors


18


18


18


18


18


18


19

u

v

x

y


19

u

v

x

y

(u,v)

(x,y)


20

IV. ALGORITHMS

We propose a local-search algorithm that optimizes thelabels of one object in the dataset, when the labels of allother objects are fixed. We apply this framework both forthe Jaccard coefficient and the intersection-function variantsof the problem, proposing novel algorithms for these two localoptimization problems in Sec. IV-B and IV-C respectively.

A. The local-search frameworkA typical approach for multivariate optimization problems

is to iteratively find the optimal value for one variable givenvalues for the remaining variables. The global solution isfound by repeatedly optimizing each of the variables in turnuntil the objective function value no longer improves. In mostcases such a method will converge to a local optimum. Thealgorithm we propose falls into this framework. At the coreof our algorithm is an efficient method for finding a goodlabeling of a single object given a fixed labeling of the otherobjects. We can guarantee that the value of Equation 2 is non-increasing with respect to such optimization steps. First, weobserve that the cost of Equation (2) can be rewritten as

Cocc(V, �) =1

2

�

v∈V

�

u∈V \{v}

|H(�(v), �(u))− s(v, u)|

=1

2

�

v∈V

Cv,p(�(v) | �),

where

Cv,p(�(v) | �) =�

u∈V \{v}

|H(�(v), �(u))− s(v, u)| (3)

expresses the error incurred by vertex v when it has the labels�(v), and the remaining nodes are labeled according to �. Thesubscript p in Cv,p serves to remind us that the set �(v) shouldhave at most p labels. Our general local-search strategy issummarized in Algorithm 1.

Algorithm 1 LocalSearch1: initialize � to a valid labeling;2: while Cocc(V, �) decreases do3: for each v ∈ V do4: find the label set L that minimizes Cv,p(L | �);5: update � so that �(v) = L;6: return �

Line 4 is the step in which LocalSearch seeks to find anoptimal set of labels for an object v by solving Equation (3).This is also the place that our framework differentiates be-tween the measures of Jaccard coefficient and set-intersection.

B. Local step for Jaccard coefficientProblem 3 (JACCARD-TRIANGULATION): Consider the set

{�Sj , zj�}j=1...n, where Sj are subsets of a ground set U ={1, . . . , k}, and zj are fractional numbers in the interval [0, 1].The task is to find a set X ⊆ U that minimizes the distance

d(X, {�Sj , zj�}j=1...n) =n�

j=1

|J(X,Sj)− zj |. (4)

The intuition behind Equation (4) is that we are given sets Sj

and “target similarities” zj and we want to find a set whoseJaccard coefficient with each set Sj is as close as possibleto the target similarity zj . A moment’s thought can convinceus that Equation (4) corresponds exactly to the error termCv,p(�(v) | �) defined by Equation (3), and thus, in the local-improvement step of the LocalSearch algorithm.

To our knowledge, JACCARD-TRIANGULATION is a newand interesting problem, which has not been studied before,in particular in the context of overlapping clustering. Themost-related problem that we are aware of is the problemof finding the Jaccard median, which was recently studied byChierichetti et al. [5]. The Jaccard-median problem is a specialcase of the JACCARD-TRIANGULATION problem, where allsimilarities zj are equal to 1. Chierichetti et al. provide a PTASfor the Jaccard-median problem. However, their techniquesseem mostly of theoretical interest, and do not extend beyondthe special case where all zj = 1.

However, since JACCARD-TRIANGULATION is a general-ization of the Jaccard-median problem that has been provenNP-hard [5], the following is immediate.

Theorem 2: JACCARD-TRIANGULATION is NP-hard.We next proceed to discuss our proposed algorithm for theJACCARD-TRIANGULATION problem. The idea is to introducea variable xi for every element i ∈ U . The variable xi indicatesif element i belongs in the solution set X . In particular, xi = 1if i ∈ X and xi = 0 otherwise. We then assume that the sizeof set X is t, that is,

�

i∈U

xi − t = 0. (5)

Now, given a set Sj with target similarity zj we want to obtainJ(X,Sj) = zj , for all j = 1, . . . n, or

J(X,Sj) =

�i∈Sj

xi

|Sj |+ t−�

i∈Sjxi

= zj ,

which is equivalent to

zjt− (1 + zj)�

i∈Sj

xi = zj |Sj |, (6)

and we have one Equation of type (6) for each pair �Sj , zj�.We observe that Equations (5) and (6) are linear with respectto the unknowns xi and t. On the other hands, the variablesxi and t take integral values, which implies that the system ofEquations (5) and (6) cannot be solved efficiently. Instead wepropose to relax the integrality constraints to non-negativityconstraints xi, t ≥ 0 and solve the above system in the least-squares sense. Thus, we apply a non-negative least-squaresoptimization method (NNLS) and we obtain estimates for thevariables xi and t.

The solution we obtain from the NNLS solver has twodrawbacks: (i) it does not incorporate the constraint of havingat most p labels, and (ii) most importantly, it does not havea clear interpretation as a set X , since the variables xi maytake any non-negative value, not only 0–1. We address both ofthese problems with a greedy post-processing of the fractional


21

Jaccard triangulation

given

find

to minimize

{�Sj , zj�}j=1...n

X ⊆ U

d(X, {�Sj , zj�}j=1...n) =n�

j=1

|J(X,Sj)− zj |


set-intersection indicator

hit-n-miss sets

approximation

greedy approach

22

O(√n log n)


experimental evaluation

23


EMOTION: 593 objects, 6 labels

YEAST: 2417 objects, 14 labels

24


25

2 4 6 8 100

0.1

0.2EMOTION

k

Cost

/edge

2 4 6 8 100.6

0.8

1

Pre

cisi

on &

Reca

ll

costprecrec

5 10 15 200

0.1

0.2YEAST

k

Cost

/edge

5 10 15 200.8

0.9

1

Pre

cisi

on &

Reca

ll

costprecrec

2 4 60

0.1EMOTION

p

Cost

/edge

2 4 60

0.5

1

Pre

cisi

on &

Reca

llcostprecrec

2 4 6 8 10 12 140

0.1

YEAST

p

Cost

/edge

2 4 6 8 10 12 140.8

0.9

1

Pre

cisi

on &

Reca

ll

costprecrec

Fig. 1. Cost per edge, precision and recall of OCC-JACC as a function ofk, the total number of distinct labels (top row), and as a function of p, themaximum number of labels per vertex (bottom row).

2 4 6 8 100

0.1

0.2EMOTION

k

Cost

/edge

2 4 6 8 100.6

0.8

1

Pre

cisi

on &

Reca

ll

costprecrec

5 10 15 200

0.1

YEAST

k

Cost

/edge

5 10 15 20

0.9

1

Pre

cisi

on &

Reca

ll

costprecrec

2 4 60

0.1

0.2EMOTION

p

Cost

/edge

2 4 60.6

0.7

0.8

0.9

1

Pre

cisi

on &

Reca

ll

costprecrec

2 4 6 8 10 12 140

0.1

0.2YEAST

p

Cost

/edge

2 4 6 8 10 12 140.9

1

Pre

cisi

on &

Reca

ll

costprecrec

Fig. 2. Same as in Figure 1, but using OCC-ISECT.

In the basic form the input to our algorithms contains all|V | × |V | pairwise similarities. However, it turns out thatthere can be a lot of redundancy in this input. Often we canprune most of the pairwise comparisons with negligible lossin quality. This is an important characteristic, as it allows usto apply the algorithm also for larger data sets. Selecting thebest set of edges to prune is an interesting problem in its ownright. In this experiment we took the simple approach, andprune edges at random: an edge is taken in consideration withprobability q (denoted the pruning threshold) independently ofthe other edges. In Figure 3 we show edge-specific cost as wellas precision and recall as a function of q for the OCC-JACCalgorithm (the curves are again medians over 30 trials). Clearlywith these example data sets the pruning threshold can be setvery low. Also, there is a noticeable “threshold effect” in thecost/edge that may serve as an indicator to find the pruningthreshold in a setting where a ground truth is not available.This suggests that in practice it is not necessary to use allpairwise comparisons, a sample of the graph may be enough.

0.01 0.02 0.03 0.040

0.05EMOTION

q

Co

st/e

dg

e

0.01 0.02 0.03 0.04

0.8

1

0.4

0.6

Pre

cisi

on

& R

eca

ll

costprecrec

0.01 0.02 0.03 0.040.02

0.03

0.04YEAST

q

Co

st/e

dg

e

0.01 0.02 0.03 0.040.8

1

Pre

cisi

on

& R

eca

ll

costprecrec

Fig. 3. Pruning experiment using OCC-JACC. Cost/edge and precision andrecall as a function of the pruning threshold q.

In fact, the results for YEAST shown in figures 1 and 2 werecomputed with q = 0.05. In terms of computational speeduppruning has a very positive effect. Using the full YEASTdata (without pruning) our Python implementation takes 400seconds to finish on a 1.8GHz CPU, with pruning (q = 0.05)this can be reduced to 70.

VI. APPLICATIONS

Since our method uses pairwise similarities of objects asthe input, it lends itself naturally to the task of clusteringstructured objects for which feature vectors can be difficultto obtain. In this section we discuss clusterings for two ofsuch data: movement trajectories, and protein sequences.

A. Overlapping Clustering of Trajectories

Spatio-temporal data mining is an active research area witha huge variety of application domains, e.g., mobility manage-ment, video surveillance, mobile social networks, atmosphereanalysis, biology and zoology just to mention a few. The basicentities in analysis are usually trajectories of moving objects,i.e., sequences �(x0, y0, t0), . . . , (xn, yn, tn)� of observationsin space and time. Trajectories are structured complex objects,and mining them is often an involved task. They might havedifferent lengths (i.e., different number of observations), andtherefore viewing them simply as vectors and using standarddistance functions (e.g., Euclidean) is not feasible. Moreover,the different nature of space and time implies different granu-larity and resolution issues. While there has been quite someresearch on clustering trajectories (e.g., [8], [9]), to the bestof our knowledge the problem of overlapping clustering oftrajectories has been largely left unexplored. This is quitesurprising as overlapping clustering intuitively seems very well

suited for trajectories.Consider, for instance, a well-shaped cluster Ci of trajecto-

ries of GPS-equipped cars going from a south-west suburb tothe city center between 8 and 9 AM, and another cluster Cj

moving inside the city center along a specific path, from 3 to 4PM. Now consider a trajectory that moves from the south-westsuburb to the city center in the morning, and within the citycenter in the afternoon: it is very natural for this trajectory tobelong to both clusters Ci and Cj .

Developing overlapping clustering within our frameworkis straightforward: we just need to compute a distance (orsimilarity) for every trajectory pair in our input. For thispurpose, we chose the EDR distance [10], which has the


26

2 4 6 8 100

0.1

0.2EMOTION

k

Co

st/e

dg

e

2 4 6 8 100.6

0.8

1

Pre

cisi

on

& R

eca

ll

costprecrec

5 10 15 200

0.1

0.2YEAST

k

Co

st/e

dg

e

5 10 15 200.8

0.9

1

Pre

cisi

on

& R

eca

ll

costprecrec

2 4 60

0.1EMOTION

p

Co

st/e

dg

e

2 4 60

0.5

1

Pre

cisi

on

& R

eca

ll

costprecrec

2 4 6 8 10 12 140

0.1

YEAST

p

Co

st/e

dg

e

2 4 6 8 10 12 140.8

0.9

1

Pre

cisi

on

& R

eca

ll

costprecrec

Fig. 1. Cost per edge, precision and recall of OCC-JACC as a function ofk, the total number of distinct labels (top row), and as a function of p, themaximum number of labels per vertex (bottom row).

2 4 6 8 100

0.1

0.2EMOTION

k

Co

st/e

dg

e

2 4 6 8 100.6

0.8

1

Pre

cisi

on

& R

eca

ll

costprecrec

5 10 15 200

0.1

YEAST

k

Co

st/e

dg

e

5 10 15 20

0.9

1

Pre

cisi

on

& R

eca

ll

costprecrec

2 4 60

0.1

0.2EMOTION

p

Co

st/e

dg

e

2 4 60.6

0.7

0.8

0.9

1

Pre

cisi

on

& R

eca

ll

costprecrec

2 4 6 8 10 12 140

0.1

0.2YEAST

p

Co

st/e

dg

e

2 4 6 8 10 12 140.9

1

Pre

cisi

on

& R

eca

ll

costprecrec

Fig. 2. Same as in Figure 1, but using OCC-ISECT.

In the basic form the input to our algorithms contains all|V | × |V | pairwise similarities. However, it turns out thatthere can be a lot of redundancy in this input. Often we canprune most of the pairwise comparisons with negligible lossin quality. This is an important characteristic, as it allows usto apply the algorithm also for larger data sets. Selecting thebest set of edges to prune is an interesting problem in its ownright. In this experiment we took the simple approach, andprune edges at random: an edge is taken in consideration withprobability q (denoted the pruning threshold) independently ofthe other edges. In Figure 3 we show edge-specific cost as wellas precision and recall as a function of q for the OCC-JACCalgorithm (the curves are again medians over 30 trials). Clearlywith these example data sets the pruning threshold can be setvery low. Also, there is a noticeable “threshold effect” in thecost/edge that may serve as an indicator to find the pruningthreshold in a setting where a ground truth is not available.This suggests that in practice it is not necessary to use allpairwise comparisons, a sample of the graph may be enough.

0.01 0.02 0.03 0.040

0.05EMOTION

q

Cost

/edge

0.01 0.02 0.03 0.04

0.8

1

0.4

0.6

Pre

cisi

on &

Reca

ll

costprecrec

0.01 0.02 0.03 0.040.02

0.03

0.04YEAST

q

Cost

/edge

0.01 0.02 0.03 0.040.8

1

Pre

cisi

on &

Reca

ll

costprecrec

Fig. 3. Pruning experiment using OCC-JACC. Cost/edge and precision andrecall as a function of the pruning threshold q.

In fact, the results for YEAST shown in figures 1 and 2 werecomputed with q = 0.05. In terms of computational speeduppruning has a very positive effect. Using the full YEASTdata (without pruning) our Python implementation takes 400seconds to finish on a 1.8GHz CPU, with pruning (q = 0.05)this can be reduced to 70.

VI. APPLICATIONS

Since our method uses pairwise similarities of objects asthe input, it lends itself naturally to the task of clusteringstructured objects for which feature vectors can be difficultto obtain. In this section we discuss clusterings for two ofsuch data: movement trajectories, and protein sequences.

A. Overlapping Clustering of Trajectories

Spatio-temporal data mining is an active research area witha huge variety of application domains, e.g., mobility manage-ment, video surveillance, mobile social networks, atmosphereanalysis, biology and zoology just to mention a few. The basicentities in analysis are usually trajectories of moving objects,i.e., sequences �(x0, y0, t0), . . . , (xn, yn, tn)� of observationsin space and time. Trajectories are structured complex objects,and mining them is often an involved task. They might havedifferent lengths (i.e., different number of observations), andtherefore viewing them simply as vectors and using standarddistance functions (e.g., Euclidean) is not feasible. Moreover,the different nature of space and time implies different granu-larity and resolution issues. While there has been quite someresearch on clustering trajectories (e.g., [8], [9]), to the bestof our knowledge the problem of overlapping clustering oftrajectories has been largely left unexplored. This is quitesurprising as overlapping clustering intuitively seems very well

suited for trajectories.Consider, for instance, a well-shaped cluster Ci of trajecto-

ries of GPS-equipped cars going from a south-west suburb tothe city center between 8 and 9 AM, and another cluster Cj

moving inside the city center along a specific path, from 3 to 4PM. Now consider a trajectory that moves from the south-westsuburb to the city center in the morning, and within the citycenter in the afternoon: it is very natural for this trajectory tobelong to both clusters Ci and Cj .

Developing overlapping clustering within our frameworkis straightforward: we just need to compute a distance (orsimilarity) for every trajectory pair in our input. For thispurpose, we chose the EDR distance [10], which has the


27

protein clustering

- pairwise similarities based on matching of amino-acid sequences

- compare using a hand-made taxonomy


28

TABLE IIPrecision, recall, and their harmonic mean F-score, for non-overlapping

clusterings of protein sequence datasets computed using SCPS [14] and the

OCC algorithms. BL is the precision of a baseline that assigns all

sequences to the same cluster.

BL SCPS OCC-ISECT OCC-JACCdataset prec prec/recall/F-score prec/recall/F-score prec/recall/F-score

D1 0.21 0.56 / 0.82 / 0.664 0.70 / 0.67 / 0.683 0.57 / 0.55 / 0.561D2 0.17 0.59 / 0.89 / 0.708 0.86 / 0.83 / 0.844 0.64 / 0.63 / 0.637D3 0.38 0.93 / 0.88 / 0.904 0.81 / 0.43 / 0.558 0.73 / 0.39 / 0.505D4 0.14 0.30 / 0.64 / 0.408 0.64 / 0.56 / 0.598 0.44 / 0.39 / 0.412

Summarizing: C3 and C4 contain elks and deer that stayaway from cattle (C3 moving in higher X than C4); C1 alsocontains only elks and deer, but those moves in the higher Yarea where also the cattles move; C2 is the cattle cluster andit contains also few elks and deer; finally C5 is another mixedcluster which overlaps with C2 only for the cattle and withC1 for the elks and deer.

B. Overlapping Clustering of Protein Sequences

An important problem in genomics is the study of evo-lutionary relatedness of proteins using sequence data. Weuse our algorithms to cluster proteins to homologous groupsgiven pairwise similarities of amino-acid sequences. Suchsimilarities are computed by the sequence alignment toolBLAST [12]. We follow the approach of Paccanaro et al. [13]and Nepusz et al. [14], and compare the computed clusteringagainst a ground truth given by SCOP, a manually craftedtaxonomy of proteins [15]. The SCOP taxonomy is a tree withproteins at the leaf nodes. The ground truth clusters used inthe experiments are subsets of the leafs, i.e., proteins, rootedat different SCOP superfamilies. These are nodes on the 3rdlevel below the root.

We compare our algorithm with the SCPS algorithm [13],[14], a spectral method for clustering biological sequence data.The experiment is run using datasets 1–4 from [14]6 thatcontain pre-computed sequence similarities in the range [0, 1],(appropriately transformed BLAST E-values, please refer to[13] and [14] for details) for various subsets of the SCOP(ver. 1.75) proteins, together with the ground truth clusterings.

To compare with the SCPS algorithm we first computednon-overlapping clusterings of all four datasets. All algorithmswere given the correct number of clusters as a parameter.Results are shown in Table II. SCPS has a higher recall inevery case, but with datasets 1, 2, and 4, the OCC-ISECTalgorithm achieves a substantially higher precision. In practicethis means that if OCC-ISECT assigns two sequences to thesame cluster, they belong to the same cluster also in the groundtruth with higher probability than when using SCPS.7

We also conduct a more fine-grained analysis of the resultsusing the SCOP taxonomy. In fact, different cluster “errors”should have different cost, depending on the distance on thetaxonomy. If we misplace a protein in two clusters that are

6The data are bundled with the SCPS application available at http://www.paccanarolab.org/software/scps/. D1 contains 669 sequences and 5 ground-truth clusters, D2 587 sequences and 6 clusters, D3 567 sequences and 5clusters, and D4 contains 654 sequences and 8 ground-truth clusters.

7Note that these numbers are not directly comparable with the ones in [14]as they define precision and recall in a slightly different way.

TABLE IIIComparing clusterings cost based on distance on the SCOP taxonomy, for

different values of p, the maximum number of labels per protein.

SCPS OCC-ISECT-p1 OCC-ISECT-p2 OCC-ISECT-p3D1 0.231 0.196 0.194 0.193D2 0.188 0.112 0.107 0.106D3 0.215 0.214 0.214 0.231D4 0.289 0.139 0.133 0.139

SCPS OCC-JACC-p1 OCC-JACC-p2 OCC-JACC-p3D1 0.231 0.208 0.202 0.205D2 0.188 0.137 0.130 0.127D3 0.215 0.243 0.242 0.221D4 0.289 0.158 0.141 0.152

extremely close in the taxonomy, the error should have a smallcost. Following this intuition we define the SCOP similaritybetween two proteins as follows:

sim(u, v) =d(lca(u, v))

max(d(u), d(v))− 1, (8)

where d(u) is the depth of a node in the tree (the root is atdepth 0), and lca(u, v) denotes the lowest common ancestorof the nodes u and v. The above similarity has a value of zeroif lca(u, v) is the root, and a value of one if lca(u, v) is thecommon parent of u and v. Based on sim(u, v) we define thecost of a clustering by paying 1− sim(u, v) for two proteinsthat ends in the same cluster, and sim(u, v) for two proteinsbelonging to different clusters, as in Eq. (1) and similarly toEq. (2) for the overlapping clusterings.

The results of Table III suggest that the OCC algorithms,thanks to the overlaps, find a clustering that is better in agree-ment with the SCOP taxonomy than are the clusterings foundby SCPS. However, while allowing overlaps is beneficial, wedo not observe a significant improvement as the node-specificconstraint p is increased. Moreover, we observe that only asmall number of proteins are assigned to multiple clusters.We conjecture that this is due to the similarities produced byBLAST, which imply very well defined clusters in most ofthe cases. Nevertheless, it is worth noting that our methods,regardless of the parameters used, do not find unnecessarilylarge overlaps, when this is not dictated by the data.

VII. RELATED WORK

Correlation Clustering. The problem of CORRELATION-CLUSTERING was first defined by Bansal et al. [1]. In theirdefinition, the input is a complete graph with positive andnegative edges. The objective is to partition the nodes ofthe graph so as to minimize the number of positive edgesthat are cut and the number of negative edges that are notcut; corresponding to our problem instance (b, H, 1). Thisis an APX-hard optimization problem which has received agreat deal of attention in the field of theoretical computerscience [16], [17], [18], [19].

Ailon et al. [16] considered a variety of correlation clus-tering problems. They proposed an algorithm that achievesexpected approximation ratio 5 if the weights obey the prob-ability condition. If the weights Xij obey also the triangleinequality, then the algorithm achieves expected approximationratio 2. Swamy [19] has applied semi-definite programming

non-overlapping


29

TABLE IIPrecision, recall, and their harmonic mean F-score, for non-overlapping

clusterings of protein sequence datasets computed using SCPS [14] and the

OCC algorithms. BL is the precision of a baseline that assigns all

sequences to the same cluster.

BL SCPS OCC-ISECT OCC-JACCdataset prec prec/recall/F-score prec/recall/F-score prec/recall/F-score

D1 0.21 0.56 / 0.82 / 0.664 0.70 / 0.67 / 0.683 0.57 / 0.55 / 0.561D2 0.17 0.59 / 0.89 / 0.708 0.86 / 0.83 / 0.844 0.64 / 0.63 / 0.637D3 0.38 0.93 / 0.88 / 0.904 0.81 / 0.43 / 0.558 0.73 / 0.39 / 0.505D4 0.14 0.30 / 0.64 / 0.408 0.64 / 0.56 / 0.598 0.44 / 0.39 / 0.412

Summarizing: C3 and C4 contain elks and deer that stayaway from cattle (C3 moving in higher X than C4); C1 alsocontains only elks and deer, but those moves in the higher Yarea where also the cattles move; C2 is the cattle cluster andit contains also few elks and deer; finally C5 is another mixedcluster which overlaps with C2 only for the cattle and withC1 for the elks and deer.

B. Overlapping Clustering of Protein Sequences

An important problem in genomics is the study of evo-lutionary relatedness of proteins using sequence data. Weuse our algorithms to cluster proteins to homologous groupsgiven pairwise similarities of amino-acid sequences. Suchsimilarities are computed by the sequence alignment toolBLAST [12]. We follow the approach of Paccanaro et al. [13]and Nepusz et al. [14], and compare the computed clusteringagainst a ground truth given by SCOP, a manually craftedtaxonomy of proteins [15]. The SCOP taxonomy is a tree withproteins at the leaf nodes. The ground truth clusters used inthe experiments are subsets of the leafs, i.e., proteins, rootedat different SCOP superfamilies. These are nodes on the 3rdlevel below the root.

We compare our algorithm with the SCPS algorithm [13],[14], a spectral method for clustering biological sequence data.The experiment is run using datasets 1–4 from [14]6 thatcontain pre-computed sequence similarities in the range [0, 1],(appropriately transformed BLAST E-values, please refer to[13] and [14] for details) for various subsets of the SCOP(ver. 1.75) proteins, together with the ground truth clusterings.

To compare with the SCPS algorithm we first computednon-overlapping clusterings of all four datasets. All algorithmswere given the correct number of clusters as a parameter.Results are shown in Table II. SCPS has a higher recall inevery case, but with datasets 1, 2, and 4, the OCC-ISECTalgorithm achieves a substantially higher precision. In practicethis means that if OCC-ISECT assigns two sequences to thesame cluster, they belong to the same cluster also in the groundtruth with higher probability than when using SCPS.7

We also conduct a more fine-grained analysis of the resultsusing the SCOP taxonomy. In fact, different cluster “errors”should have different cost, depending on the distance on thetaxonomy. If we misplace a protein in two clusters that are

6The data are bundled with the SCPS application available at http://www.paccanarolab.org/software/scps/. D1 contains 669 sequences and 5 ground-truth clusters, D2 587 sequences and 6 clusters, D3 567 sequences and 5clusters, and D4 contains 654 sequences and 8 ground-truth clusters.

7Note that these numbers are not directly comparable with the ones in [14]as they define precision and recall in a slightly different way.

TABLE IIIComparing clusterings cost based on distance on the SCOP taxonomy, for

different values of p, the maximum number of labels per protein.

SCPS OCC-ISECT-p1 OCC-ISECT-p2 OCC-ISECT-p3D1 0.231 0.196 0.194 0.193D2 0.188 0.112 0.107 0.106D3 0.215 0.214 0.214 0.231D4 0.289 0.139 0.133 0.139

SCPS OCC-JACC-p1 OCC-JACC-p2 OCC-JACC-p3D1 0.231 0.208 0.202 0.205D2 0.188 0.137 0.130 0.127D3 0.215 0.243 0.242 0.221D4 0.289 0.158 0.141 0.152

extremely close in the taxonomy, the error should have a smallcost. Following this intuition we define the SCOP similaritybetween two proteins as follows:

sim(u, v) =d(lca(u, v))

max(d(u), d(v))− 1, (8)

where d(u) is the depth of a node in the tree (the root is atdepth 0), and lca(u, v) denotes the lowest common ancestorof the nodes u and v. The above similarity has a value of zeroif lca(u, v) is the root, and a value of one if lca(u, v) is thecommon parent of u and v. Based on sim(u, v) we define thecost of a clustering by paying 1− sim(u, v) for two proteinsthat ends in the same cluster, and sim(u, v) for two proteinsbelonging to different clusters, as in Eq. (1) and similarly toEq. (2) for the overlapping clusterings.

The results of Table III suggest that the OCC algorithms,thanks to the overlaps, find a clustering that is better in agree-ment with the SCOP taxonomy than are the clusterings foundby SCPS. However, while allowing overlaps is beneficial, wedo not observe a significant improvement as the node-specificconstraint p is increased. Moreover, we observe that only asmall number of proteins are assigned to multiple clusters.We conjecture that this is due to the similarities produced byBLAST, which imply very well defined clusters in most ofthe cases. Nevertheless, it is worth noting that our methods,regardless of the parameters used, do not find unnecessarilylarge overlaps, when this is not dictated by the data.

VII. RELATED WORK

Correlation Clustering. The problem of CORRELATION-CLUSTERING was first defined by Bansal et al. [1]. In theirdefinition, the input is a complete graph with positive andnegative edges. The objective is to partition the nodes ofthe graph so as to minimize the number of positive edgesthat are cut and the number of negative edges that are notcut; corresponding to our problem instance (b, H, 1). Thisis an APX-hard optimization problem which has received agreat deal of attention in the field of theoretical computerscience [16], [17], [18], [19].

Ailon et al. [16] considered a variety of correlation clus-tering problems. They proposed an algorithm that achievesexpected approximation ratio 5 if the weights obey the prob-ability condition. If the weights Xij obey also the triangleinequality, then the algorithm achieves expected approximationratio 2. Swamy [19] has applied semi-definite programming

overlapping


30

future work

- scaling up

- approximation algorithm

- jaccard triangulation

- more experimentation and applications


thank you!


Date post:	14-Jun-2015
Category:	Technology
Upload:	larca-upc
View:	1,639 times
Download:	0 times

Overlapping correlation clustering

Technology