Manifold Regularization
Vikas Sindhwani
Department of Computer Science
University of Chicago
Joint Work with Mikhail Belkin and Partha Niyogi
TTI-C Talk September 14, 2004 – p.1
The Problem of Learning
� � ��� ��� � � ��� is drawn from an unknownprobability distribution ��� �.A Learning Algorithm maps
�to an element� of a hypothesis space of functions
mapping .� should provide good labels for futureexamples.
Regularization : Choose a simple functionthat agrees with data.
TTI-C Talk September 14, 2004 – p.2
The Problem of Learning
Notions of simplicity are the key to successful
learning. Here’s a simple function that agrees with
data.
TTI-C Talk September 14, 2004 – p.3
Learning and Prior Knowledge
But Simplicity is a Relative Concept. PriorKnowledge of the Marginal can modify ournotions of simplicity.
TTI-C Talk September 14, 2004 – p.4
Motivation
How can we exploit prior knowledge of themarginal distribution �?More practically, how can we use unlabeledexamples drawn from �Why is this important ?
Natural Data has structure to exploit.Natural Learning is largelysemi-supervised.Labels are Expensive, Unlabeled data ischeap and plenty.
TTI-C Talk September 14, 2004 – p.5
Contributions
A data-dependent, Geometric RegularizationFramework for Learning from examples.
Representer Theorems provide solutions.
Extensions of SVM and RLS forSemi-supervised Learning.
Regularized Spectral Clustering andDimensionality Reduction.
The problem of Out-of-sample extensions ingraph methods is resolved.
Good Empirical Performance.TTI-C Talk September 14, 2004 – p.6
Regularization with RKHS
Learning in Reproducing Kernel HilbertSpaces :
� � �� �� ������ � �! "
��� # #� � $ � � � $ % & & '(
Regularized Least Squares (RLS) :# #� � $ � � � $ � # � � ) #� � $ $ '
Support Vector Machine (SVM) :# #� � $ � � � $ � � �* +-, � ! ) � � #� � $.
TTI-C Talk September 14, 2004 – p.7
What are RKHS ?
Hilbert Spaces with a nice property :If two functions � / 0 are close in thedistance derived from the inner product,their values
#� $ � / #� $are close
1� 0 .
Reproducing Property :23 4 5 #� $is linear, continuous. By
Reisz’s Representation theorem,6 #� �7 $ 0 4 2 3 # $ � 8 � 3 9 � � #� $
.
Kernel Function RKHS :#� � : $ � 3 # : $ � 8 3 � ; 9
TTI-C Talk September 14, 2004 – p.8
Why RKHS ?
Rich Function Spaces with complexity control
e.g Gaussian Kernel
#� � � $ � < = >@?BA C > DDFE D :& � & '( � G H #�I $ G 'KJ #&�L & ' $M IRepresenter Theorems show that theminimizer has the form :� #7 $ � ��� N � #� �O�7 $
and therefore,& � & '( � 8 � � � 9 �P� � �RQ S � N � N S #� �� � S $
Motivates kernelization (KPCA, KFD, etc).
Good empirical performance.
TTI-C Talk September 14, 2004 – p.9
Known Marginal
If � is known, solve :
� � �� �� ������ � �!"
��� # #� � $ � � � $ %UT & & '( %WV & & 'V
Extrinsic and Intrinsic Regularization%XT controls complexity in ambient space.%�V controls complexity in the intrinsicgeometry of �
TTI-C Talk September 14, 2004 – p.10
Continuous Representer Theorem
Assume that the penalty term
& & V is sufficientlysmooth with respect to the RKHS norm
& & ( .Then the solution
�
to the optimization problemexists and admits the following representation
� #� $ � ��� N � #� �Y� � $ Z N # � $ #� � � $ M � # � $
where � [\ ] ] � �
is the support of the
marginal �.TTI-C Talk September 14, 2004 – p.11
A Manifold Regularizer
If , the support of the marginal is a compactsubmanifold ^ � _
, it seems natural tochoose :
& & 'V � Z 8 Z � Z 9
and to find
� 0 ( that minimizes :
! "
�� # #� � $ � � � $ %UT & & '( % V Z 8 Z � Z 9
TTI-C Talk September 14, 2004 – p.12
Laplace Beltrami Operator
The intrinsic regularizer is a quadratic forminvolving the Laplace-Beltrami operator on the
manifold
`ba �� )Mc d Z :
& & V � Z 8 Z � Z 9 � Z
because some calculus on manifolds establishesthat for any vector field
# � Z $
,Z 8 � Z 9 � ) Z Mc d # $
TTI-C Talk September 14, 2004 – p.13
Passage to the Discrete
In reality, is unknown and sampled onlyvia examples
��� � BeUf ��� . Labels are not requiredfor empirical estimates of
& & 'V .Manifold Graph
# � $
� ��� � e f �� � � #� � � � S $ 4 � � ghji k � S
.
Laplace Beltrami Graph Laplacian
l
l `ba �� ) � Mc m / � � � � S � S
.& & 'V � Z n& & 'V � o p l o �# #� � $ ) #� S $ $ ' � S
TTI-C Talk September 14, 2004 – p.14
Algorithms
We have motivated the following optimizationproblem : Find a function
� 0 ( thatminimizes :! "
��� # #� � $ � � � $ %qT & & '( %�V# " r $ ' o p l o
Laplacian RLS# #� � $ � � � $ � # � � ) #� � $ $ '
Laplacian SVM# #� � $ � � � $ � � �* +-, � ! ) � � #� � $.
TTI-C Talk September 14, 2004 – p.15
Empirical Representer Theorem
The minimizer admits an expansion
� #� $ � BeUf�� N � #� � � � $
Proof :Write any 0 ( as
BeUf ��� N � #� � � � $ s#� S $ � 8 � 3 k 9 � BeUf ��� N � #� � � � S $
s increases the norm. So
� s � , .
TTI-C Talk September 14, 2004 – p.16
Laplacian RLS
By the Representer Theorem, the problembecomes finite dimensional. For Laplacian RLS,we find N � 0 Be f
that minimizes :! " & ) t N & ' %XT N p N %WV#r " $ ' N p l N
where : Gram Matrix ; � + � �7 7 7 � � � , 7 7 7 � , .
and
t � Mc m / # ! �7 7 7 � ! � , �7 7 7 � , $
. The solution is :
N � � # t %XT "u %WV "#r " $ ' l $ =
TTI-C Talk September 14, 2004 – p.17
Laplacian SVM
For Laplacian SVMs, we solve a QP :� � �� �� �*wv � x y ��� � ) ' p
subject to : ��� � � � � ,, �
where � t #-z % T u z {}|~BeUf � D l $ = t p
, andthen invert a linear system :
N � � #-z %XT u z %WV#r " $ ' l $ = t p�
TTI-C Talk September 14, 2004 – p.18
Manifold Regularization
Input : l labeled and u unlabeled examples
Output : 4 _ 5
Algorithm :Contruct adjacency Graph. ComputeLaplacian.Choose Kernel
#� � � $ . Compute Grammatrix K.Choose % T � %�V . (?)Compute N � .Output
� #� $ � Be f ��� N �� #� �Y� � $
TTI-C Talk September 14, 2004 – p.19
Unity of Learning
Supervised Partially Supervised Unsupervised
SVM/RLS Graph Regularization Graph Mincut�� �� ���X�� ��� � � � � ���@�� � ���� � � � � � ���X�� ��� ��� � � ���� � ���F � ¡ ¢b£ �¥¤ ¦ ¢b§ � ¨¨B© � � � ��� � ¡ ¢ £ �¥¤ ª�« ¨ ©U¬ ª q® ª �¯ � °�� ± � ² � ± ¢ ª �³ ª ± ¨ ´¬ µ ¦ µ ´·¶ Out-of-sample Extn. Spectral Clustering� � � � ��� �� �� � � � � ����� �� � ´ ª ® ª� � � ��� � ¡ ¢ £ �¥¤ ª�« ¨ ©U¬ ª q® ª
Out-of-sample Extn.� � � � ��� �� �� � ´ ª ¸® ª
Reg. Spectral Clust.�� �� ���-�� ���� ´ ª ® ª © ¬ µ ¦ µ ´¹¶
TTI-C Talk September 14, 2004 – p.20
Regularized Spectral Clustering
Unsupervised Manifold Regularization :� � �� �� � �º »½¼¿¾ ÀÂÁ > ¼ > ÃÄþ º��� � �% & & '( o p l o
Representer Theorem :
� #� $ � f ��� N �� #� � � � $
leads to an eigenvalue problem :# % l $Å � Æ 'Å
and N � � d � . d � is the smallest-eigenvalueeigenvector; P projects orthogonal to
Ç
.TTI-C Talk September 14, 2004 – p.21
Experiments : Synthetic
−1 0 1 2
−1
0
1
2
γA = 0.03125 γ
I = 0
SVM
−1 0 1 2
−1
0
1
2
Laplacian SVM
γA = 0.03125 γ
I = 0.01
−1 0 1 2
−1
0
1
2
Laplacian SVM
γA = 0.03125 γ
I = 1
−1 0 1 2
−1
0
1
2
γA = 1e−06 γ
I = 1
−1 0 1 2
−1
0
1
2
γA = 0.0001 γ
I = 1
−1 0 1 2
−1
0
1
2
γA = 0.1 γ
I = 1
TTI-C Talk September 14, 2004 – p.22
Related Algorithms
Transductive SVMs [Joachims, Vapnik]
¦ ÈÊÉ � � � � ����� ��� � Ë��Ì �Í Í Í Ë� �� Î �� � ¢Ï ³ £ � ¦ ¢b§ � ¨ ¨ � © Î È � � °� � � � ¢Ï ³ £ � ¦ ¢b§ � ¨ ¨ � © µ ¦ µ ´¹¶
Semi-supervised SVMs [Bennet,Fung et al]
¦ ÈÊÉ �� �� ����� �Ð� � Ë�� Ì �Í Í Í Ë��� Î �� Ñ ¢Ï ³ £ � ¦ ¢ § � ¨ ¨ � ©
C
� � � °�F � � � � ��� Ò ¢Ï ³ ¦ ¢b§ � ¨ ¨ � ¤ ¢Ï © ¦ ¢ § � ¨ ¨ � Ó © µ ¦ µ ´·¶
Measure-based Reg. [Bousquet et al]¦ ÈÉ �� �� ����� Ô �� � ¡ ¢ ¦ ¢ § � ¨ ¤ £ � ¨ ©U¬ Õ Ö× ¦ ¢b§ ¨ ¤ × ¦ ¢ § ¨ØbÙ ¢b§ ¨Ú §
TTI-C Talk September 14, 2004 – p.23
Experiments : Synthetic Data
−1 0 1 2−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5SVM
−1 0 1 2−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5Transductive SVM
−1 0 1 2−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5Laplacian SVM
TTI-C Talk September 14, 2004 – p.24
Experiments : Digits
10 20 30 400
5
10
15
20
RLS vs LapRLS
45 Classification Problems
Erro
r Rat
es
RLSLapRLS
10 20 30 400
5
10
15
20
SVM vs LapSVM
45 Classification ProblemsEr
ror R
ates
SVMLapSVM
10 20 30 400
5
10
15
20TSVM vs LapSVM
45 Classification Problems
Erro
r Rat
es
TSVMLapSVM
0 5 10 150
5
10
15Out−of−Sample Extension
LapRLS (Unlabeled)
LapR
LS (T
est)
0 5 10 150
5
10
15Out−of−Sample Extension
LapSVM (Unlabeled)
LapS
VM (T
est)
0 2 4 60
5
10
15Performance Deviation
SVM
(o) ,
TSV
M (x
) Dev
iation
LapSVM Deviation
TTI-C Talk September 14, 2004 – p.25
Experiments : Digits
2 4 8 16 32 64 1280
1
2
3
4
5
6
7
8
9
10SVM vs LapSVM
Number of Labeled Examples
Aver
age
Erro
r Rat
e
SVM (T)SVM (U)LapSVM (T)LapSVM (U)
2 4 8 16 32 64 1280
1
2
3
4
5
6
7
8
9RLS vs LapRLS
Number of Labeled Examples
Aver
age
Erro
r Rat
e
RLS (T)RLS (U)LapRLS (T)LapRLS (U)
TTI-C Talk September 14, 2004 – p.26
Experiments : Speech
0 10 20 30
14
16
18
20
22
24
26
28
Labeled Speaker #
Erro
r Rat
e (u
nlabe
led se
t)
RLS vs LapRLS
RLSLapRLS
0 10 20 30
15
20
25
30
35
40
Labeled Speaker #
Erro
r Rat
es (u
nlabe
led se
t)
SVM vs TSVM vs LapSVM
SVMTSVMLapSVM
0 10 20 30
20
25
30
35
Labeled Speaker #
Erro
r Rat
es (t
est s
et)
RLS vs LapRLS
RLSLapRLS
0 10 20 30
20
25
30
35
40
Labeled Speaker #
Erro
r Rat
es (t
est s
et)
SVM vs TSVM vs LapSVM
SVMTSVMLapSVM
TTI-C Talk September 14, 2004 – p.27
Experiments : Speech
15 20 25 3015
20
25
30
Error Rate (Unlabeled)
Erro
r Rat
e (T
est)
RLS
Experiment 1Experiment 2
15 20 25 3015
20
25
30
Error Rate (Unlabeled)
Erro
r Rat
e (T
est)
LapRLS
Experiment 1Experiment 2
15 20 25 3015
20
25
30
Error Rate (Unlabeled)
Erro
r Rat
e (T
est)
SVM
Experiment 1Experiment 2
15 20 25 3015
20
25
30
Error Rate (Unlabeled)
Erro
r Rat
e (T
est)
LapSVM
Experiment 1Experiment 2
TTI-C Talk September 14, 2004 – p.28
Experiments : Text
Method PRBEP Error
k-NN 73.2 13.3
SGT 86.2 6.2
Naive-Bayes — 12.9
Cotraining — 6.20
SVM 76.39 (5.6) 10.41 (2.5)
TSVM 88.15 (1.0) 5.22 (0.5)
LapSVM 87.73 (2.3) 5.41 (1.0)
RLS 73.49 (6.2) 11.68 (2.7)
LapRLS 86.37 (3.1) 5.99 (1.4)
TTI-C Talk September 14, 2004 – p.29
Experiments : Text
2 4 8 16 32 64
60
65
70
75
80
85
Number of Labeled Examples
PRBE
P
Performance of RLS, LapRLS
2 4 8 16 32 64
60
65
70
75
80
85
Number of Labeled Examples
PRBE
P
Performance of SVM, LapSVM
2 4 8 16 32 64
80
82
84
86
88
Number of Labeled Examples
PRBE
P
LapSVM performance (Unlabeled)
2 4 8 16 32 64
78
80
82
84
86
Number of Labeled Examples
PRBE
P
LapSVM performance (Test)
rls (U)rls (T)laprls (U)laprls (T)
svm (U)svm (T)lapsvm (U)lapsvm (T)
U=779−lU=350U=150
U=779−lU=350U=150
TTI-C Talk September 14, 2004 – p.30
Future Work
Generalization as a function of labeled andunlabeled examples.
Additional Structure : Structured Outputs,Invariances
Active Learning , Feature Selection
Efficient Algorithms : Linear Methods, SparseSolutions
Applications : Bioinformatics, Text, Speech,Vision, ...
TTI-C Talk September 14, 2004 – p.31