Post on 23-Jan-2016
transcript
Semi-Supervised Learning in Gigantic Image Collections
Rob Fergus (NYU)Yair Weiss (Hebrew
U.)Antonio Torralba
(MIT)
What does the world look like?
High level image statisticsObject Recognition for large-scale search
Gigantic Image Collections
Spectrum of Label InformationHuman annotations Noisy
labelsUnlabele
d
Semi-Supervised Learning using Graph Laplacian
V = data pointsE = n x n affinity matrix W
G = (V;E )
Wi j = exp(¡ kxi ¡ xj k=2²2) Di i =P
j Wi j
L = D ¡ 1=2LD ¡ 1=2 = I ¡ D ¡ 1=2WD ¡ 1=2Graph Laplacian:
[Zhu03,Zhou04]
Wi j = exp(¡ kxi ¡ xj k=2²2)
SSL using Graph Laplacian
J (f ) = f T Lf +P l
i=1 ¸(f (i) ¡ yi )2
f T Lf + (f ¡ y)T ¤(f ¡ y)¤i i = ¸
¤i i = 0
If labeled:
If unlabeled:
• Want to find label function f that minimizes:
• y = labels, λ = weights• Rewrite as:
• Straightforward solution
Smoothness Agreement with labels
• Smooth vectors will be linear combinations of eigenvectors U with small eigenvalues:
Eigenvectors of Laplacian
f = U®U = [Á1; : : : ;Ák]
[Belkin & Niyogi 06, Schoelkopf & Smola 02, Zhu et al 03, 08]
Rewrite System
• Let • U = smallest k eigenvectors of L, α =
coeffs.
• Optimal is now solution to k x k system:
J (®) = ®T § ®+ (U®¡ y)T ¤(U®¡ y)
®
(§ + UT ¤U)®= UT ¤y
f = U®
Computational Bottleneck
• Consider a dataset of 80 million images
• Inverting L– Inverting 80 million x 80 million matrix
• Finding eigenvectors of L– Diagonalizing 80 million x 80 million
matrix
Large Scale SSL - Related work• Nystrom method: pick small set of landmark points
– Compute exact solution on these– Interpolate solution to rest
• Others iteratively use classifiers to label data– E.g. Boosting-based method of Loeff et al. ICML’08
[see Zhu ‘08 survey]
Data Landmarks
Our Approach
Overview of Our Approach
Data LandmarksDensity
Reduce n
Limit as n ∞
Nystrom
Ours
Consider Limit as n ∞
• Consider x to be drawn from 2D distribution p(x)
• Let Lp(F) be a smoothness operator on p(x), for a function F(x):
• Analyze eigenfunctions of Lp(F)
Lp(F ) = 1=2RR
(F (x1) ¡ F (x2))2W(x1;x2)p(x1)p(x2)dx1dx2
W(x1;x2) = exp(¡ kx1 ¡ x2k=2²2)where2
Eigenvectors & Eigenfunctions
• Claim:
If p is separable, then:
Eigenfunctions of marginals are also eigenfunctions of the joint density, with same eigenvalue
p(x1,x2)
p(x1)
p(x2)
Key Assumption: Separability of Input data
[Nadler et al. 06,Weiss et al. 08]
Numerical Approximations to Eigenfunctions in 1D
• 300k points drawn from distribution p(x)
• Consider p(x1)
p(x) Data
p(x1)
Histogram h(x1)
• Solve for values g of eigenfunction at set of discrete locations (histogram bin centers)– and associated eigenvalues– B x B system (# histogram bins = 50)
• P is diag(h(x1))
•
Numerical Approximations to Eigenfunctions in 1D
P ( ~D ¡ ~W)P g= ¾P D̂g~D =
Pj
~W
~W D̂ = diag(P
j P ~W)
¾
Affinity between discrete locations
1D Approximate Eigenfunctions
• Solve
1st Eigenfunction of h(x1)
2nd Eigenfunction of h(x1)
3rd Eigenfunction of h(x1)
P ( ~D ¡ ~W)P g= ¾P D̂g
Separability over Dimension
• Build histogram over dimension 2: h(x2)
• Now solve for eigenfunctions of h(x2)
1st Eigenfunction of h(x2)
2nd Eigenfunction of h(x2)
3rd Eigenfunction of h(x2)
Data
From Eigenfunctions to Approximate Eigenvectors
• Take each data point• Do 1-D interpolation in each eigenfunction
k dimensional vector (for k eigenfunctions)
• Very fast operation (has to be done nk times)
Histogram bin1 50
Eig
enfu
nct
ion v
alu
e
Preprocessing
• Need to make data separable• Rotate using PCA
Not separable Separable
Rotate
Overall Algorithm1. Rotate data to maximize separability (currently use
PCA)
2. For each dimension:– Construct 1D histogram– Solve numerically for eigenfunctions/values
3. Order eigenfunctions from all dimensions by increasing eigenvalue & take first k
4. Interpolate data into k eigenfunctions– Yields approximate eigenvectors of Normalized Laplacian
5. Solve k x k least squares system to give label function
Experimentson Toy Data
Comparison of Approaches
Data Exact Eigenvector Eigenfunction
ExactEigenvectors
0.0531 −− 0.0535
Exact -- ApproximateEigenvalues Approximate
Eigenvectors
0.1920 −− 0.1928
0.2049 −− 0.2068
0.2480 −− 0.5512
0.3580 −− 0.7979
Data
Nystrom Comparison
• Too few landmark points results in highly unstable eigenvectors
Nystrom Comparison
• Eigenfunctions fail when data has significant dependencies between dimensions
Experimentson Real Data
Experiments• Images from 126 classes downloaded
from Internet search engines, total 63,000 images Dump truck Emu
• Labels (correct/incorrect) provided by Geoff Hinton, Alex Krizhevsky, Vinod Nair (U. Toronto and CIFAR)
Input Image Representation
• Pixels not a convenient representation• Use Gist descriptor (Oliva & Torralba, 2001)• PCA down to 64 dimensions• L2 distance btw. Gist vectors rough
substitute for human perceptual distance
Are Dimensions Independent?Joint histogram for pairs of dimensions from raw 384-dimensional Gist
PCA
Joint histogram for pairs of dimensions after PCA
MI is mutual information score. 0 = Independent
Real 1-D Eigenfunctions of PCA’d Gist descriptors
64
56
48
40
32
24
16
8
1Eigenfunction 1
Eigenfunction 256
Inp
ut D
imensio
n
Eig
enfu
nct
ion v
alu
e Color = Input dimension
xmin xmax
Histogram bin1 50
Protocol• Task is to re-rank images of each class
• Measure precision @ 15% recall
• Vary # of labeled examples
−Inf 0 1 2 3 4 5 6 70.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Log2 number of +ve training examples/class
Mean
pre
cisi
on
at
15
% r
eca
ll a
vera
ged
over
16
cla
sses
Least−squares
SVM
Chance
−Inf 0 1 2 3 4 5 6 70.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Log2 number of +ve training examples/class
Mean
pre
cisi
on
at
15
% r
eca
ll a
vera
ged
over
16
cla
sses
Nystrom
Least−squares
SVM
Chance
−Inf 0 1 2 3 4 5 6 70.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Log2 number of +ve training examples/class
Mean
pre
cisi
on
at
15
% r
eca
ll a
vera
ged
over
16
cla
sses
Eigenfunction
Nystrom
Least−squares
SVM
Chance
−Inf 0 1 2 3 4 5 6 70.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Log2 number of +ve training examples/class
Mean
pre
cisi
on
at
15
% r
eca
ll a
vera
ged
over
16
cla
sses
Eigenfunction
Nystrom
Least−squares
Eigenvector
SVM
NN
Chance
80 Million Images
Running on 80 million images
• PCA to 32 dims, k=48 eigenfunctions
• Precompute approximate eigenvectors (~20Gb)
• For each class, labels propagating through 80 million images
Summary
• Semi-supervised scheme that can scale to really large problems
• Rather than sub-sampling the data, we take the limit of infinite unlabeled data
• Assumes input data distribution is separable
• Can propagate labels in graph with 80 million nodes in fractions of second
Future Work
• Can potentially use 2D or 3D histograms instead of 1D– Requires more data
• Consider diagonal eigenfunctions
• Sharing of labels between classes
Are Dimensions Independent?Joint histogram for pairs of dimensions from raw 384-dimensional Gist
PCA
Joint histogram for pairs of dimensions after PCA
MI is mutual information score. 0 = Independent
Are Dimensions Independent?Joint histogram for pairs of dimensions from raw 384-dimensional Gist
ICA
Joint histogram for pairs of dimensions after ICA
MI is mutual information score. 0 = Independent
Overview of Our Approach
• Existing large-scale SSL methods try to reduce # points
• We consider what happens as n ∞
• Eigenvectors Eigenfunctions
• Assume input distribution is separable
• Make crude numerical approx. to Eigenfunctions
• Interpolate data in these approximate eigenfunctions to give approx. eigenvalues
Eigenfunctions
• Eigenfunction are limit of Eigenvectors as n ∞
• Analytical forms of eigenfunctions exist only in a few cases: Uniform, Gaussian
• Instead, we calculate numerical approximation to eigenfunctions
[Nadler et al. 06,Weiss et al. 08]
[Coifman et al. 05, Nadler et al. 06, Belikin & Niyogi 07]
1n2 f T Lf ! Lp(F )
Complexity Comparison
Nystrom
Select m landmark points
Get smallest k eigenvectors of m x m system
Interpolate n points into k eigenvectors
Solve k x k linear system
Eigenfunction
Rotate n points
Form d 1-D histograms
Solve d linear systems, each b x b
k 1-D interpolations of n points
Solve k x k linear system
Key: n = # data points (big, >106) l = # labeled points (small, <100) m = # landmark points d = # input dims (~100) k = # eigenvectors (~100) b = # histogram bins (~50)
Polynomial in # landmarks Linear in # data points
• Can’t build accurate high dimensional histograms– Need too many points
• Currently just use 1-D histograms– 2 or 3D ones possible with enough data
• This assumes distribution is separable– Assume p(x) = p(x1) p(x2) … p(xd)
• For separable distributions, eigenfunctions are also separable
Key Assumption: Separability of Input data
[Nadler et al. 06,Weiss et al. 08]
Varying # Training Examples
−Inf 0 1 2 3 4 5 6 70.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Log2 number of +ve training examples/class
Mean
pre
cisi
on
at
15
% r
eca
ll a
vera
ged
over
16
cla
sses
Eigenfunction
Nystrom
Least−squares
Eigenvector
SVM
NN
Chance