Vikas Sindhwani - University of Chicagopeople.cs.uchicago.edu/~vikass/TTItalk.pdfVikas Sindhwani...

Post on 03-Oct-2020

4 views 0 download

transcript

Manifold Regularization

Vikas Sindhwani

Department of Computer Science

University of Chicago

Joint Work with Mikhail Belkin and Partha Niyogi

TTI-C Talk September 14, 2004 – p.1

The Problem of Learning

� � ��� ��� � � ��� is drawn from an unknownprobability distribution ��� �.A Learning Algorithm maps

�to an element� of a hypothesis space of functions

mapping .� should provide good labels for futureexamples.

Regularization : Choose a simple functionthat agrees with data.

TTI-C Talk September 14, 2004 – p.2

The Problem of Learning

Notions of simplicity are the key to successful

learning. Here’s a simple function that agrees with

data.

TTI-C Talk September 14, 2004 – p.3

Learning and Prior Knowledge

But Simplicity is a Relative Concept. PriorKnowledge of the Marginal can modify ournotions of simplicity.

TTI-C Talk September 14, 2004 – p.4

Motivation

How can we exploit prior knowledge of themarginal distribution �?More practically, how can we use unlabeledexamples drawn from �Why is this important ?

Natural Data has structure to exploit.Natural Learning is largelysemi-supervised.Labels are Expensive, Unlabeled data ischeap and plenty.

TTI-C Talk September 14, 2004 – p.5

Contributions

A data-dependent, Geometric RegularizationFramework for Learning from examples.

Representer Theorems provide solutions.

Extensions of SVM and RLS forSemi-supervised Learning.

Regularized Spectral Clustering andDimensionality Reduction.

The problem of Out-of-sample extensions ingraph methods is resolved.

Good Empirical Performance.TTI-C Talk September 14, 2004 – p.6

Regularization with RKHS

Learning in Reproducing Kernel HilbertSpaces :

� � �� �� ������ � �! "

��� # #� � $ � � � $ % & & '(

Regularized Least Squares (RLS) :# #� � $ � � � $ � # � � ) #� � $ $ '

Support Vector Machine (SVM) :# #� � $ � � � $ � � �* +-, � ! ) � � #� � $.

TTI-C Talk September 14, 2004 – p.7

What are RKHS ?

Hilbert Spaces with a nice property :If two functions � / 0 are close in thedistance derived from the inner product,their values

#� $ � / #� $are close

1� 0 .

Reproducing Property :23 4 5 #� $is linear, continuous. By

Reisz’s Representation theorem,6 #� �7 $ 0 4 2 3 # $ � 8 � 3 9 � � #� $

.

Kernel Function RKHS :#� � : $ � 3 # : $ � 8 3 � ; 9

TTI-C Talk September 14, 2004 – p.8

Why RKHS ?

Rich Function Spaces with complexity control

e.g Gaussian Kernel

#� � � $ � < = >@?BA C > DDFE D :& � & '( � G H #�I $ G 'KJ #&�L & ' $M IRepresenter Theorems show that theminimizer has the form :� #7 $ � ��� N � #� �O�7 $

and therefore,& � & '( � 8 � � � 9 �P� � �RQ S � N � N S #� �� � S $

Motivates kernelization (KPCA, KFD, etc).

Good empirical performance.

TTI-C Talk September 14, 2004 – p.9

Known Marginal

If � is known, solve :

� � �� �� ������ � �!"

��� # #� � $ � � � $ %UT & & '( %WV & & 'V

Extrinsic and Intrinsic Regularization%XT controls complexity in ambient space.%�V controls complexity in the intrinsicgeometry of �

TTI-C Talk September 14, 2004 – p.10

Continuous Representer Theorem

Assume that the penalty term

& & V is sufficientlysmooth with respect to the RKHS norm

& & ( .Then the solution

to the optimization problemexists and admits the following representation

� #� $ � ��� N � #� �Y� � $ Z N # � $ #� � � $ M � # � $

where � [\ ] ] � �

is the support of the

marginal �.TTI-C Talk September 14, 2004 – p.11

A Manifold Regularizer

If , the support of the marginal is a compactsubmanifold ^ � _

, it seems natural tochoose :

& & 'V � Z 8 Z � Z 9

and to find

� 0 ( that minimizes :

! "

�� # #� � $ � � � $ %UT & & '( % V Z 8 Z � Z 9

TTI-C Talk September 14, 2004 – p.12

Laplace Beltrami Operator

The intrinsic regularizer is a quadratic forminvolving the Laplace-Beltrami operator on the

manifold

`ba �� )Mc d Z :

& & V � Z 8 Z � Z 9 � Z

because some calculus on manifolds establishesthat for any vector field

# � Z $

,Z 8 � Z 9 � ) Z Mc d # $

TTI-C Talk September 14, 2004 – p.13

Passage to the Discrete

In reality, is unknown and sampled onlyvia examples

��� � BeUf ��� . Labels are not requiredfor empirical estimates of

& & 'V .Manifold Graph

# � $

� ��� � e f �� � � #� � � � S $ 4 � � ghji k � S

.

Laplace Beltrami Graph Laplacian

l

l `ba �� ) � Mc m / � � � � S � S

.& & 'V � Z n& & 'V � o p l o �# #� � $ ) #� S $ $ ' � S

TTI-C Talk September 14, 2004 – p.14

Algorithms

We have motivated the following optimizationproblem : Find a function

� 0 ( thatminimizes :! "

��� # #� � $ � � � $ %qT & & '( %�V# " r $ ' o p l o

Laplacian RLS# #� � $ � � � $ � # � � ) #� � $ $ '

Laplacian SVM# #� � $ � � � $ � � �* +-, � ! ) � � #� � $.

TTI-C Talk September 14, 2004 – p.15

Empirical Representer Theorem

The minimizer admits an expansion

� #� $ � BeUf�� N � #� � � � $

Proof :Write any 0 ( as

BeUf ��� N � #� � � � $ s#� S $ � 8 � 3 k 9 � BeUf ��� N � #� � � � S $

s increases the norm. So

� s � , .

TTI-C Talk September 14, 2004 – p.16

Laplacian RLS

By the Representer Theorem, the problembecomes finite dimensional. For Laplacian RLS,we find N � 0 Be f

that minimizes :! " & ) t N & ' %XT N p N %WV#r " $ ' N p l N

where : Gram Matrix ; � + � �7 7 7 � � � , 7 7 7 � , .

and

t � Mc m / # ! �7 7 7 � ! � , �7 7 7 � , $

. The solution is :

N � � # t %XT "u %WV "#r " $ ' l $ =

TTI-C Talk September 14, 2004 – p.17

Laplacian SVM

For Laplacian SVMs, we solve a QP :� � �� �� �*wv � x y ��� � ) ' p

subject to : ��� � � � � ,, �

where � t #-z % T u z {}|~BeUf � D l $ = t p

, andthen invert a linear system :

N � � #-z %XT u z %WV#r " $ ' l $ = t p�

TTI-C Talk September 14, 2004 – p.18

Manifold Regularization

Input : l labeled and u unlabeled examples

Output : 4 _ 5

Algorithm :Contruct adjacency Graph. ComputeLaplacian.Choose Kernel

#� � � $ . Compute Grammatrix K.Choose % T � %�V . (?)Compute N � .Output

� #� $ � Be f ��� N �� #� �Y� � $

TTI-C Talk September 14, 2004 – p.19

Unity of Learning

Supervised Partially Supervised Unsupervised

SVM/RLS Graph Regularization Graph Mincut�� �� ���X�� ��� � � � � ���@�� � ���� � � � � � ���X�� ��� ��� � � ���� � ���F  � ¡ ¢b£ �¥¤ ¦ ¢b§ � ¨¨B© � � � ���  � ¡ ¢ £ �¥¤ ª�« ¨ ©U¬ ª ­q® ª �¯ � °�� ±   � ² � ± ¢ ª �³ ª ± ¨ ´¬ µ ¦ µ ´·¶ Out-of-sample Extn. Spectral Clustering� � � � ��� �� �� � � � � ����� �� � ´ ª ­ ® ª� � � ���  � ¡ ¢ £ �¥¤ ª�« ¨ ©U¬ ª ­q® ª

Out-of-sample Extn.� � � � ��� �� �� � ´ ª ­¸® ª

Reg. Spectral Clust.�� �� ���-�� ���� ´ ª ­® ª © ¬ µ ¦ µ ´¹¶

TTI-C Talk September 14, 2004 – p.20

Regularized Spectral Clustering

Unsupervised Manifold Regularization :� � �� �� � �º »½¼¿¾ ÀÂÁ > ¼ > ÃÄþ º��� � �% & & '( o p l o

Representer Theorem :

� #� $ � f ��� N �� #� � � � $

leads to an eigenvalue problem :# % l $Å � Æ 'Å

and N � � d � . d � is the smallest-eigenvalueeigenvector; P projects orthogonal to

Ç

.TTI-C Talk September 14, 2004 – p.21

Experiments : Synthetic

−1 0 1 2

−1

0

1

2

γA = 0.03125 γ

I = 0

SVM

−1 0 1 2

−1

0

1

2

Laplacian SVM

γA = 0.03125 γ

I = 0.01

−1 0 1 2

−1

0

1

2

Laplacian SVM

γA = 0.03125 γ

I = 1

−1 0 1 2

−1

0

1

2

γA = 1e−06 γ

I = 1

−1 0 1 2

−1

0

1

2

γA = 0.0001 γ

I = 1

−1 0 1 2

−1

0

1

2

γA = 0.1 γ

I = 1

TTI-C Talk September 14, 2004 – p.22

Related Algorithms

Transductive SVMs [Joachims, Vapnik]

¦ ÈÊÉ � � � � ����� ��� � Ë��Ì �Í Í Í Ë� �� Î ��  � ¢Ï ³ £ � ¦ ¢b§ � ¨ ¨ � © Î È � � °�  � � � ¢Ï ³ £ � ¦ ¢b§ � ¨ ¨ � © µ ¦ µ ´¹¶

Semi-supervised SVMs [Bennet,Fung et al]

¦ ÈÊÉ �� �� ����� �Ð� � Ë�� Ì �Í Í Í Ë��� Î ��  Ñ ¢Ï ³ £ � ¦ ¢ § � ¨ ¨ � ©

C

� � � °�F  � � � � ��� Ò ¢Ï ³ ¦ ¢b§ � ¨ ¨ � ¤ ¢Ï © ¦ ¢ § � ¨ ¨ � Ó © µ ¦ µ ´·¶

Measure-based Reg. [Bousquet et al]¦ ÈÉ �� �� ����� Ô ��  � ¡ ¢ ¦ ¢ § � ¨ ¤ £ � ¨ ©U¬ Õ Ö× ¦ ¢b§ ¨ ¤ × ¦ ¢ § ¨ØbÙ ¢b§ ¨Ú §

TTI-C Talk September 14, 2004 – p.23

Experiments : Synthetic Data

−1 0 1 2−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5SVM

−1 0 1 2−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5Transductive SVM

−1 0 1 2−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5Laplacian SVM

TTI-C Talk September 14, 2004 – p.24

Experiments : Digits

10 20 30 400

5

10

15

20

RLS vs LapRLS

45 Classification Problems

Erro

r Rat

es

RLSLapRLS

10 20 30 400

5

10

15

20

SVM vs LapSVM

45 Classification ProblemsEr

ror R

ates

SVMLapSVM

10 20 30 400

5

10

15

20TSVM vs LapSVM

45 Classification Problems

Erro

r Rat

es

TSVMLapSVM

0 5 10 150

5

10

15Out−of−Sample Extension

LapRLS (Unlabeled)

LapR

LS (T

est)

0 5 10 150

5

10

15Out−of−Sample Extension

LapSVM (Unlabeled)

LapS

VM (T

est)

0 2 4 60

5

10

15Performance Deviation

SVM

(o) ,

TSV

M (x

) Dev

iation

LapSVM Deviation

TTI-C Talk September 14, 2004 – p.25

Experiments : Digits

2 4 8 16 32 64 1280

1

2

3

4

5

6

7

8

9

10SVM vs LapSVM

Number of Labeled Examples

Aver

age

Erro

r Rat

e

SVM (T)SVM (U)LapSVM (T)LapSVM (U)

2 4 8 16 32 64 1280

1

2

3

4

5

6

7

8

9RLS vs LapRLS

Number of Labeled Examples

Aver

age

Erro

r Rat

e

RLS (T)RLS (U)LapRLS (T)LapRLS (U)

TTI-C Talk September 14, 2004 – p.26

Experiments : Speech

0 10 20 30

14

16

18

20

22

24

26

28

Labeled Speaker #

Erro

r Rat

e (u

nlabe

led se

t)

RLS vs LapRLS

RLSLapRLS

0 10 20 30

15

20

25

30

35

40

Labeled Speaker #

Erro

r Rat

es (u

nlabe

led se

t)

SVM vs TSVM vs LapSVM

SVMTSVMLapSVM

0 10 20 30

20

25

30

35

Labeled Speaker #

Erro

r Rat

es (t

est s

et)

RLS vs LapRLS

RLSLapRLS

0 10 20 30

20

25

30

35

40

Labeled Speaker #

Erro

r Rat

es (t

est s

et)

SVM vs TSVM vs LapSVM

SVMTSVMLapSVM

TTI-C Talk September 14, 2004 – p.27

Experiments : Speech

15 20 25 3015

20

25

30

Error Rate (Unlabeled)

Erro

r Rat

e (T

est)

RLS

Experiment 1Experiment 2

15 20 25 3015

20

25

30

Error Rate (Unlabeled)

Erro

r Rat

e (T

est)

LapRLS

Experiment 1Experiment 2

15 20 25 3015

20

25

30

Error Rate (Unlabeled)

Erro

r Rat

e (T

est)

SVM

Experiment 1Experiment 2

15 20 25 3015

20

25

30

Error Rate (Unlabeled)

Erro

r Rat

e (T

est)

LapSVM

Experiment 1Experiment 2

TTI-C Talk September 14, 2004 – p.28

Experiments : Text

Method PRBEP Error

k-NN 73.2 13.3

SGT 86.2 6.2

Naive-Bayes — 12.9

Cotraining — 6.20

SVM 76.39 (5.6) 10.41 (2.5)

TSVM 88.15 (1.0) 5.22 (0.5)

LapSVM 87.73 (2.3) 5.41 (1.0)

RLS 73.49 (6.2) 11.68 (2.7)

LapRLS 86.37 (3.1) 5.99 (1.4)

TTI-C Talk September 14, 2004 – p.29

Experiments : Text

2 4 8 16 32 64

60

65

70

75

80

85

Number of Labeled Examples

PRBE

P

Performance of RLS, LapRLS

2 4 8 16 32 64

60

65

70

75

80

85

Number of Labeled Examples

PRBE

P

Performance of SVM, LapSVM

2 4 8 16 32 64

80

82

84

86

88

Number of Labeled Examples

PRBE

P

LapSVM performance (Unlabeled)

2 4 8 16 32 64

78

80

82

84

86

Number of Labeled Examples

PRBE

P

LapSVM performance (Test)

rls (U)rls (T)laprls (U)laprls (T)

svm (U)svm (T)lapsvm (U)lapsvm (T)

U=779−lU=350U=150

U=779−lU=350U=150

TTI-C Talk September 14, 2004 – p.30

Future Work

Generalization as a function of labeled andunlabeled examples.

Additional Structure : Structured Outputs,Invariances

Active Learning , Feature Selection

Efficient Algorithms : Linear Methods, SparseSolutions

Applications : Bioinformatics, Text, Speech,Vision, ...

TTI-C Talk September 14, 2004 – p.31