+ All Categories
Home > Documents > On a Theory of Similarity functions for Learning and Clustering

On a Theory of Similarity functions for Learning and Clustering

Date post: 31-Dec-2015
Category:
Upload: shellie-phelps
View: 32 times
Download: 0 times
Share this document with a friend
Description:
On a Theory of Similarity functions for Learning and Clustering. Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan, Nati Srebro and Santosh Vempala. Theory and Practice of Computational Learning, 2009. 2-minute version. - PowerPoint PPT Presentation
Popular Tags:
44
On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan, Nati Srebro and Santosh Vempala Theory and Practice of Computational Learning, 20
Transcript
Page 1: On a Theory of Similarity functions for Learning and Clustering

On a Theory of Similarity functions for Learning and

Clustering

Avrim BlumCarnegie Mellon University

This talk is based on work joint with Nina Balcan, Nati Srebro and Santosh Vempala

Theory and Practice of Computational Learning, 2009

Page 2: On a Theory of Similarity functions for Learning and Clustering

2-minute version2-minute version• Suppose we are given a set of images , Suppose we are given a set of images ,

and want to learn a rule to distinguish men and want to learn a rule to distinguish men from women. from women. Problem: pixel representation not Problem: pixel representation not so good.so good.

• A powerful technique for such settings is to use A powerful technique for such settings is to use a a kernelkernel: a special kind of pairwise similarity : a special kind of pairwise similarity function function KK( , )( , )..

• But, theory in terms of implicit mappings.But, theory in terms of implicit mappings.

Q: Can we develop a theory that just views Q: Can we develop a theory that just views KK as as a a measure of similaritymeasure of similarity? Develop more ? Develop more generalgeneral and intuitive theory of when and intuitive theory of when KK is useful for is useful for learning?learning?

Page 3: On a Theory of Similarity functions for Learning and Clustering

2-minute version2-minute version• Suppose we are given a set of images , Suppose we are given a set of images ,

and want to learn a rule to distinguish men and want to learn a rule to distinguish men from women. from women. Problem: pixel representation not Problem: pixel representation not so good.so good.

• A powerful technique for such settings is to use A powerful technique for such settings is to use a a kernelkernel: a special kind of pairwise similarity : a special kind of pairwise similarity function function KK( , )( , )..

• But, theory in terms of implicit mappings.But, theory in terms of implicit mappings.

Q: What if we only have unlabeled data (i.e., Q: What if we only have unlabeled data (i.e., clustering)? Can we develop a theory of clustering)? Can we develop a theory of properties that are sufficient to be able to properties that are sufficient to be able to clustercluster well? well?

Page 4: On a Theory of Similarity functions for Learning and Clustering

2-minute version2-minute version• Suppose we are given a set of images , Suppose we are given a set of images ,

and want to learn a rule to distinguish men and want to learn a rule to distinguish men from women. from women. Problem: pixel representation not Problem: pixel representation not so good.so good.

• A powerful technique for such settings is to use A powerful technique for such settings is to use a a kernelkernel: a special kind of pairwise similarity : a special kind of pairwise similarity function function KK( , )( , )..

• But, theory in terms of implicit mappings.But, theory in terms of implicit mappings.

Develop a kind of Develop a kind of PAC model for clusteringPAC model for clustering..

Page 5: On a Theory of Similarity functions for Learning and Clustering

Part 1: On similarity functions for learning

Page 6: On a Theory of Similarity functions for Learning and Clustering

Theme of this part

• Theory of natural sufficient conditions for similarity functions to be useful for classification learning problems.Don’t require PSD, no implicit spaces, but includes

notion of large-margin kernel.

At a formal level, can even allow you to learn more (can define classes of functions with no large-margin kernel even if allow substantial hinge-loss but that do have a good similarity fn under this notion)

Page 7: On a Theory of Similarity functions for Learning and Clustering

KernelsKernels• We have a lot of great algorithms for learning

linear separators (perceptron, SVM, …). But, a lot of time, data is not linearly separable.– “Old” answer: use a multi-layer neural network.– “New” answer: use a kernel function!

• Many algorithms only interact with the data via dot-products.– So, let’s just re-define dot-product.– E.g., K(x,y) = (1 + x¢y)d.

• K(x,y) = (x) ¢ (y), where () is implicit mapping into an nd-dimensional space.

– Algorithm acts as if data is in “-space”. Allows it to produce non-linear curve in original space.

+++

+

- - -

-

Page 8: On a Theory of Similarity functions for Learning and Clustering

x2

x1

O

O O

O

O

O

O O

XX

X

X

XX

X

X X

X

X

X

X

XX

X

XX

z1

z3

O

O

OO

O

O

O

OO

X X

X XX

X

X

X X

X

X

X

X

X

X

X X

X

ExampleE.g., for n=2, d=2, the kernel

z2

K(x,y) = (x¢y)d corresponds to

original space space

Page 9: On a Theory of Similarity functions for Learning and Clustering

Moreover, generalize well if good Margin

• If data is linearly separable by large margin in -space,

then good sample complexity.

|(x)| · 1

+

++

++

+

-

--

--

If margin in -space, then

need sample size of only

Õ(1/2) to get confidence in

generalization.

• Kernels useful in practice for dealing with many, many different kinds of data.

[no dependence on dimension]

Page 10: On a Theory of Similarity functions for Learning and Clustering

Limitations of the Current Theory

Existing Theory: in terms of margins in implicit spaces.

In practice: kernels are constructed by viewing them as

measures of similarity.

Kernel requirement rules out many natural similarity

functions.

Not best for intuition.

Alternative, perhaps more Alternative, perhaps more general theoretical general theoretical explanation?explanation?

Page 11: On a Theory of Similarity functions for Learning and Clustering

2) Is broad: includes usual notion of good kernel,

A notion of a good similarity function that is:

1) In terms of natural direct quantities.

• no implicit high-dimensional spaces

• no requirement that K(x,y)=(x) ¢ (y)

K can be used to learn well.

has a large margin sep. in -space

Good kernels

First attempt

Main notion

[Balcan-Blum, ICML 2006][Balcan-Blum-Srebro, MLJ 2008][Balcan-Blum-Srebro, COLT 2008]

3) Even formally allows you to do more.

Page 12: On a Theory of Similarity functions for Learning and Clustering

A First Attempt

K is (,)-good for P if a 1- prob. mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

P distribution over labeled examples (x, l(x))

K is good if most x are on average more similar to points y of their own type than to points y of the other type.

Goal: output classification rule good for P

Average similarity to points of opposite label

gap Average similarity to

points of the same label

Page 13: On a Theory of Similarity functions for Learning and Clustering

A First Attempt

Algorithm

• Draw sets S+, S- of positive and negative examples.

• Classify x based on average similarity to S+ versus to S-.

K is (,)-good for P if a 1- prob. mass of x satisfy:

S+-1

1

1

0.5

0.4

S-

xx

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

Page 14: On a Theory of Similarity functions for Learning and Clustering

A First Attempt K is (,)-good for P if a 1- prob. mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

Theorem

Algorithm

• Draw sets S+, S- of positive and negative examples.

• Classify x based on average similarity to S+ versus to S-.

If |S+| and |S-| are ((1/2) ln(1/’)), then with

probability ¸ 1-, error · +’.

Page 15: On a Theory of Similarity functions for Learning and Clustering

A First Attempt: Not Broad Enough

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

• has a large margin separator;

+ +++++

-- -- --

more similar to - than to typical +

30o

30o

Similarity function K(x,y)=x ¢ y

does not satisfy our definition.

½ versus ½ ¢ 1 + ½ ¢ (- ½) = ¼

Page 16: On a Theory of Similarity functions for Learning and Clustering

A First Attempt: Not Broad Enough

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

+ +++++

-- -- --

R

Broaden: 9 non-negligible R s.t. most x are on average more similar to y 2 R of same label than to y 2 R of other label.

[even if do not know R in advance]

30o

30o

Page 17: On a Theory of Similarity functions for Learning and Clustering

Broader Definition K is (, , )-good if 9 a set R of “reasonable” y (allow probabilistic) s.t.

1- fraction of x satisfy:

• Draw S={y1, , yd} set of landmarks.

F(x) = [K(x,y1), …,K(x,yd)].

RdF

F(P)

Algorithm

x !

• If enough landmarks (d=(1/2 )), then with high prob.

there exists a good L1 large margin linear separator.

Re-represent data.

w=[0,0,1/n+,1/n+,0,0,0,-1/n-,0,0]w=[0,0,1/n+,1/n+,0,0,0,-1/n-,0,0]

P

At least prob. mass of reasonable positives & negatives.

Ey~P[K(x,y)|l(y)=l(x), R(y)] ¸ Ey~P[K(x,y)|l(y)l(x), R(y)]+

(technically hinge loss)

Page 18: On a Theory of Similarity functions for Learning and Clustering

(technically hinge loss)

Broader Definition

• Draw S={y1, , yd} set of landmarks.

F(x) = [K(x,y1), …,K(x,yd)]

RdF

F(P)

Algorithm

x !Re-represent data.

OO

O

OOX

XXX

X

X X X XX

OO OO O

• Take a new set of labeled examples, project to this space, and and run a good Lrun a good L11 linear separator alg. linear separator alg. (e.g., Winnow etc).

P

K is (, , )–good if 9 a set R of “reasonable” y (allow probabilistic) s.t.

1- fraction of x satisfy:

du=Õ(1/(2 ))

dl=O((1/(2²acc))ln du)

At least prob. mass of reasonable positives & negatives.

Ey~P[K(x,y)|l(y)=l(x), R(y)] ¸ Ey~P[K(x,y)|l(y)l(x), R(y)]+

Page 19: On a Theory of Similarity functions for Learning and Clustering

Kernels and Similarity Functions

Theorem

K is also a good similarity function.(but gets

squared).

K is a good kernel

If K has margin in implicit space, then for any ,K is (,2,)-good in our sense.

Large-margin Kernels

Good Similarities

Page 20: On a Theory of Similarity functions for Learning and Clustering

Kernels and Similarity Functions

Can also show a separation.

Large-margin Kernels

Good Similarities

Exists class C, distrib D s.t. 9 a similarity function with large for all f in C, but no large-margin kernel function exists.

Theorem

Theorem

K is also a good similarity function.(but gets

squared).

K is a good kernel

Page 21: On a Theory of Similarity functions for Learning and Clustering

Kernels and Similarity Functions

For any class C of pairwise uncorrelated functions, 9 a similarity function good for all f in C, but no such good kernel function exists.

Theorem

• In principle, should be able to learn from O(-1log(|C|/)) labeled examples.

• Claim 1: can define generic (0,1,1/|C|)-good similarity function achieving this bound. (Assume D not too concentrated)

• Claim 2: There is no (,) good kernel in hinge loss, even if

=1/2 and =1/|C|1/2. So, margin based SC is d=(|C|).

Page 22: On a Theory of Similarity functions for Learning and Clustering

Learning with Multiple Similarity Functions

• Let K1, …, Kr be similarity functions s. t. some (unknown) convex combination of them is (,)-good.

Algorithm

• Draw S={y1, , yd} set of landmarks. Concatenate features.

F(x) = [K1(x,y1), …,Kr(x,y1), …, K1(x,yd),…,Kr(x,yd)].

• Run same L1 optimization algorithm as before in this new feature space.

Page 23: On a Theory of Similarity functions for Learning and Clustering

Learning with Multiple Similarity Functions

• Let K1, …, Kr be similarity functions s. t. some (unknown) convex combination of them is (,)-good.

Guarantee: Whp the induced distribution F(P) in R2dr has a separator of error · + at L1 margin at least

Algorithm

• Draw S={y1, , yd} set of landmarks. Concatenate features.

Sample complexity is roughly:

F(x) = [K1(x,y1), …,Kr(x,y1), …, K1(x,yd),…,Kr(x,yd)].

Only increases by log(r) factor!

Page 24: On a Theory of Similarity functions for Learning and Clustering

Learning with Multiple Similarity Functions

• Let K1, …, Kr be similarity functions s. t. some (unknown) convex combination of them is (,)-good.

Guarantee: Whp the induced distribution F(P) in R2dr has a separator of error · + at L1 margin at least

Algorithm

• Draw S={y1, , yd} set of landmarks. Concatenate features.

F(x) = [K1(x,y1), …,Kr(x,y1), …, K1(x,yd),…,Kr(x,yd)].

Proof: imagine mapping Fo(x) = [Ko(x,y1), …,Ko (x,yd)], for the good similarity function Ko =1 K1 + …. + r Kr

Consider wo =(w1, …, wd) of L1 norm 1, margin /4.

The vector w = (1 w1 , 2 w1,…, r w1, …, 1 wd , 2 wd,…, r wd) also has norm 1 and has w¢F(x) = wo¢Fo(x).

Page 25: On a Theory of Similarity functions for Learning and Clustering

• Because property defined in terms of L1, no change in margin!– Only log(r) penalty for concatenating feature spaces.

– If L2, margin would drop by factor r1/2, giving O(r) penalty in sample complexity.

• Algorithm is also very simple (just concatenate).• Alternative algorithm: do joint optimization:

– solve for Ko = (1K1 + … + nKn), vector wo s.t. wo has good L1 margin in space defined by Fo(x) = [Ko(x,y1),…,Ko(x,yd)]

– Bound also holds here since capacity only lower.– But we don’t know how to do this efficiently…

Learning with Multiple Similarity Functions

Page 26: On a Theory of Similarity functions for Learning and Clustering

Part 2: Can we use this angle Part 2: Can we use this angle to help think about to help think about

clustering?clustering?

Page 27: On a Theory of Similarity functions for Learning and Clustering

• Given a set of documents or search results, cluster them by topic.

• Given a collection of protein sequences, cluster them by function.

• Given a set of images of people, cluster by who is in them.

• …

Clustering comes up in many Clustering comes up in many placesplaces

Page 28: On a Theory of Similarity functions for Learning and Clustering

• Given data set S of n objects.

• There is some (unknown) “ground truth” clustering

• Goal: produce hypothesis clustering C1,C2,…,Ck that matches target as much as possible.

Problem: no labeled data!But: do have a measure of similarity…

Can model clustering like this:Can model clustering like this:

C1*,C2

*,

…,Ck*.

[news articles]

[sports][politics]

[minimize # mistakes up to renumbering of indices]

Page 29: On a Theory of Similarity functions for Learning and Clustering

• Given data set S of n objects.

• There is some (unknown) “ground truth” clustering

• Goal: produce hypothesis clustering C1,C2,…,Ck that matches target as much as possible.

Problem: no labeled data!But: do have a measure of similarity…

Can model clustering like this:Can model clustering like this:

C1*,C2

*,

…,Ck*.

[news articles]

[sports][politics]

What conditions on a similarity measure What conditions on a similarity measure would be enough to allow one to would be enough to allow one to clustercluster

well?well?

[minimize # mistakes up to renumbering of indices]

Page 30: On a Theory of Similarity functions for Learning and Clustering

Contrast with more standard approach to clustering analysis:

• View similarity/distance info as “ground truth”

• Analyze abilities of algorithms to achieve different optimization criteria.

• Or, assume generative model, like mixture of Gaussians

• Here, no generative assumptions. Instead: given data, how powerful a K do we need to be able to cluster it well?

What conditions on a similarity measure What conditions on a similarity measure would be enough to allow one to would be enough to allow one to clustercluster

well?well?

min-sum, k-means, k-median,…

Page 31: On a Theory of Similarity functions for Learning and Clustering

Here is a condition that trivially Here is a condition that trivially works:works:Suppose K has property that:• K(x,y) > 0 for all x,y such that CK(x,y) > 0 for all x,y such that C**(x) = C(x) = C**(y).(y).• K(x,y) < 0 for all x,y such that CK(x,y) < 0 for all x,y such that C**(x) (x) C C**(y).(y).

If we have such a K, then clustering is easy.Now, let’s try to make this condition a little

weaker….

What conditions on a similarity measure What conditions on a similarity measure would be enough to allow one to would be enough to allow one to clustercluster

well?well?

Page 32: On a Theory of Similarity functions for Learning and Clustering

baseball

basketball

Suppose K has property that all x are more similar to all points y in their own cluster than to any y’ in other clusters.

• Still a very strong condition.Still a very strong condition.Problem: the same K can satisfy for two

very different clusterings of the same data!

What conditions on a similarity measure What conditions on a similarity measure would be enough to allow one to would be enough to allow one to clustercluster

well?well?

Math

Physics

Page 33: On a Theory of Similarity functions for Learning and Clustering

Suppose K has property that all x are more similar to all points y in their own cluster than to any y’ in other clusters.

• Still a very strong condition.Still a very strong condition.Problem: the same K can satisfy for two

very different clusterings of the same data!

What conditions on a similarity measure What conditions on a similarity measure would be enough to allow one to would be enough to allow one to clustercluster

well?well?

baseball

basketball

Math

Physics

Page 34: On a Theory of Similarity functions for Learning and Clustering

Let’s weaken our goals a Let’s weaken our goals a bit…bit…

• OK to produce a hierarchical clustering (tree) such that target clustering is apx some pruning of it.

– E.g., in case from last slide:

– Can view as saying “if any of these clusters is too broad, just click and I will split it for you”

• Or, OK to output a small # of clusterings such that at least one has low error (like list-decoding) but won’t talk about this one today.

baseball

basketball

Math

Physics

baseball basketball math physics

sports scienceall documents

Page 35: On a Theory of Similarity functions for Learning and Clustering

Then you can start getting Then you can start getting somewhere….somewhere….

1.

is sufficient to get hierarchical clustering such that target is some pruning of tree. (Kruskal’s / single-linkage works)

“all x more similar to all y in their own cluster than to any y’ from any other cluster”

Page 36: On a Theory of Similarity functions for Learning and Clustering

Then you can start getting Then you can start getting somewhere….somewhere….

1.

is sufficient to get hierarchical clustering such that target is some pruning of tree. (Kruskal’s / single-linkage works)

2. Weaker condition: ground truth is “stable”:

For all clusters C, C’, for all AµC, A’µC’: A and A’ not both more similar on avg to each other than to rest of own clusters.

“all x more similar to all y in their own cluster than to any y’ from any other cluster”

View K(x,y) as attraction

between x and y

(plus technical conditions at boundary)

Sufficient to get a good tree using average single linkage alg.

Page 37: On a Theory of Similarity functions for Learning and Clustering

43

Analysis for slightly simpler Analysis for slightly simpler versionversion

Assume for all C, C’, all A½C, A’µC’, we have K(A,C-A) > K(A,A’),

and say K is symmetric.

Algorithm: Algorithm: averageaverage single-linkage single-linkage• Like Kruskal, but at each step merge pair of Like Kruskal, but at each step merge pair of

clusters whose clusters whose averageaverage similarity is highest. similarity is highest.

Analysis: (all clusters made are laminar wrt target)Analysis: (all clusters made are laminar wrt target)

• Failure iff merge C1, C2 s.t. C1½C, C2ÅC = .

Avgx2A, y2C-A[S(x,y)]

Page 38: On a Theory of Similarity functions for Learning and Clustering

44

Analysis for slightly simpler Analysis for slightly simpler versionversion

Assume for all C, C’, all A½C, A’µC’, we have K(A,C-A) > K(A,A’),

and say K is symmetric.

Algorithm: Algorithm: averageaverage single-linkage single-linkage• Like Kruskal, but at each step merge pair of Like Kruskal, but at each step merge pair of

clusters whose clusters whose averageaverage similarity is highest. similarity is highest.

Analysis: (all clusters made are laminar wrt target)Analysis: (all clusters made are laminar wrt target)

• Failure iff merge C1, C2 s.t. C1½C, C2ÅC = .

• But must exist C3½C at least as similar to C1 as the average. Contradiction.

CC11

CC33

C2

Avgx2A, y2C-A[S(x,y)]

Page 39: On a Theory of Similarity functions for Learning and Clustering

More sufficient properties:More sufficient properties:

3.

But add noisy data. – Noisy data can ruin bottom-up algorithms, but

can show a generate-and-test style algorithm works.

– Create collection of plausible clusters.– Use series of pairwise tests to remove/shrink

clusters until consistent with a tree

“all x more similar to all y in their own cluster than to any y’ from any other cluster”

Page 40: On a Theory of Similarity functions for Learning and Clustering

More sufficient properties:More sufficient properties:

3.

But add noisy data.

4. Implicit assumptions made by optimization approach:

“all x more similar to all y in their own cluster than to any y’ from any other cluster”

“Any approximately-optimal ..k-median.. solution is close (in terms of how pts are clustered) to the target.”[Nina Balcan’s talk on Saturday]

Page 41: On a Theory of Similarity functions for Learning and Clustering

Can also analyze Can also analyze inductiveinductive settingsetting

Assume for all C, C’, all A½C, A’µC’, we have

K(A,C-A) > K(A,A’)+,but only see small sample S

Can use “regularity” type results of [AFKK] Can use “regularity” type results of [AFKK] to argue that whp, a reasonable size to argue that whp, a reasonable size SS will give good estimates of all desired will give good estimates of all desired quantities. quantities.

Once S is hierarchically partitioned, can Once S is hierarchically partitioned, can insert new points as they arrive.insert new points as they arrive.

Page 42: On a Theory of Similarity functions for Learning and Clustering

Like a PAC model for clusteringLike a PAC model for clustering

• A A propertyproperty is a relation between target and is a relation between target and similarity information (data). Like a similarity information (data). Like a data-data-dependent concept classdependent concept class in learning. in learning.

• Given data and a similarity function Given data and a similarity function KK, a , a property induces a “concept class” property induces a “concept class” CC of all of all clusterings c such that (c,K) is consistent clusterings c such that (c,K) is consistent with the property.with the property.

• Tree model:Tree model: want tree T s.t. set of prunings want tree T s.t. set of prunings of T form an of T form an -cover of -cover of CC..

• In inductive model, want this with prob 1-In inductive model, want this with prob 1-..

Page 43: On a Theory of Similarity functions for Learning and Clustering

Summary (part II)Summary (part II)

• Exploring the question: what does an algorithm need in order to cluster well?

• What natural properties allow a similarity measure to be useful for clustering?

– To get a good theory, helps to relax what we mean by “useful for clustering”.

– User can then decide how specific he wanted to be in each part of domain.

• Analyze a number of natural properties and prove guarantees on algorithms able to use them.

Page 44: On a Theory of Similarity functions for Learning and Clustering

Wrap-upWrap-up• Tour through learning and clustering by similarity

functions.– User with some knowledge of the problem domain comes up

with pairwise similarity measure K(x,y) that makes sense for the given problem.

– Algorithm uses this (together with labeled data in the case of learning) to find a good solution.

• Goals of a theory:– Give guidance to similarity-function designer (what properties

to shoot for?).– Understand what properties are sufficient for

learning/clustering, and by what algorithms.

• For learning, get theory of kernels without need for “implicit spaces”.

• For clustering, “reverses” the usual view. Suggests giving the algorithm some slack (tree vs partitioning).

• A lot of interesting questions still open in these areas.


Recommended