Date post: | 27-Dec-2015 |
Category: |
Documents |
Upload: | lee-sullivan |
View: | 224 times |
Download: | 0 times |
Modeling Latent Variable Uncertainty for Loss-based Learning
Daphne KollerStanford University
Ben PackerStanford University
M. Pawan KumarÉcole Centrale Paris
École des Ponts ParisTechINRIA Saclay, Île-de-France
AimAccurate learning with weakly supervised data
Train Input xi Output yi
Bison
Deer
Elephant
Giraffe
Llama
Rhino Object Detection
Input x
Output y = “Deer”Latent Variable h
(y(f),h(f)) = argmaxy,h f(Ψ(x,y,h))
AimAccurate learning with weakly supervised data
Feature Ψ(x,y,h) (e.g. HOG)
Input x
Output y = “Deer”
Prediction
Function f : Ψ(x,y,h) (-∞, +∞)
Latent Variable h
f* = argminf Objective(f)
AimAccurate learning with weakly supervised data
Feature Ψ(x,y,h) (e.g. HOG)
Input x
Output y = “Deer”
Function f : Ψ(x,y,h) (-∞, +∞)
Learning
Latent Variable h
AimFind a suitable objective function to learn f*
Feature Ψ(x,y,h) (e.g. HOG)
Input x
Output y = “Deer”
Function f : Ψ(x,y,h) (-∞, +∞)
Learning
Encourages accurate prediction
User-specified criterion for accuracy
f* = argminf Objective(f)
Latent Variable h
Latent SVM
Linear function parameterized by w
Prediction (y(w), h(w)) = argmaxy,h wTΨ(x,y,h)
Learning minw Σi Δ(yi,yi(w),hi(w))
✔ Loss based learning
✖ Loss independent of true (unknown) latent variable
✖ Doesn’t model uncertainty in latent variables
User-defined loss
Expectation Maximization
Joint probability Pθ(y,h|x) =exp(θTΨ(x,y,h))
Z
Prediction (y(θ), h(θ)) = argmaxy,h Pθ(y,h|x)
Expectation Maximization
Joint probability Pθ(y,h|x) =exp(θTΨ(x,y,h))
Z
Prediction (y(θ), h(θ)) = argmaxy,h θTΨ(x,y,h)
Learning maxθ Σi log (Pθ(yi|xi))
Expectation Maximization
Joint probability Pθ(y,h|x) =exp(θTΨ(x,y,h))
Z
Prediction (y(θ), h(θ)) = argmaxy,h θTΨ(x,y,h)
Learning maxθ Σi Σhi log (Pθ(yi,hi|xi))
✔ Models uncertainty in latent variables
✖ Doesn’t model accuracy of latent variable prediction
✖ No user-defined loss function
Solution
Model Uncertainty in Latent Variables
Model Accuracy of Latent Variable Predictions
Use two different distributions for the two different tasks
Solution
Model Accuracy of Latent Variable Predictions
Use two different distributions for the two different tasks
Pθ(hi|yi,xi)
hi
SolutionUse two different distributions for the two different tasks
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
The Ideal CaseNo latent variable uncertainty, correct prediction
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
The Ideal CaseNo latent variable uncertainty, correct prediction
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
hi(w)
The Ideal CaseNo latent variable uncertainty, correct prediction
hi
Pw(yi,hi|xi)
(yi,hi)(yi,hi(w))
Pθ(hi|yi,xi)
hi(w)
In PracticeRestrictions in the representation power of models
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
Our FrameworkMinimize the dissimilarity between the two distributions
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
User-defined dissimilarity measure
Our FrameworkMinimize Rao’s Dissimilarity Coefficient
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)
Our FrameworkMinimize Rao’s Dissimilarity Coefficient
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)
Hi(w,θ)
Our FrameworkMinimize Rao’s Dissimilarity Coefficient
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))
- β Hi(θ,θ)Hi(w,θ)
Our FrameworkMinimize Rao’s Dissimilarity Coefficient
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
- β Hi(θ,θ)Hi(w,θ)minw,θ Σi
Optimization
minw,θ Σi Hi(w,θ) - β Hi(θ,θ)
Initialize the parameters to w0 and θ0
Repeat until convergence
End
Fix w and optimize θ
Fix θ and optimize w
Optimization of θ
minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)
hi
Pθ(hi|yi,xi)
Case I: yi(w) = yi
hi(w)
Optimization of θ
minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)
hi
Pθ(hi|yi,xi)
Case I: yi(w) = yi
hi(w)
Optimization of θ
minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)
hi
Pθ(hi|yi,xi)
Case II: yi(w) ≠ yi
Optimization of θ
minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)
hi
Pθ(hi|yi,xi)
Case II: yi(w) ≠ yi
Stochastic subgradient descent
Optimization of w
minw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)
Expected loss, models uncertainty
Form of optimization similar to Latent SVM
Observation: When Δ is independent of true h,our framework is equivalent to Latent SVM
Concave-Convex Procedure (CCCP)
Object Detection
Bison
Deer
Elephant
Giraffe
Llama
Rhino
Input x
Output y = “Deer”
Latent Variable h
Mammals Dataset
60/40 Train/Test Split
5 Folds
Train Input xi Output yi
Results – 0/1 Loss
Fold 1 Fold 2 Fold 3 Fold 4 Fold 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Test Loss
LSVMOur
Statistically Significant
Results – Overlap Loss
Fold 1 Fold 2 Fold 3 Fold 4 Fold 50
0.1
0.2
0.3
0.4
0.5
0.6
Average Test Loss
LSVMOur
Action DetectionInput x
Output y = “Using Computer”
Latent Variable h
PASCAL VOC 2011
60/40 Train/Test Split
5 Folds
Jumping
Phoning
Playing Instrument
Reading
Riding Bike
Riding Horse
Running
Taking Photo
Using Computer
Walking
Train Input xi Output yi
Results – 0/1 Loss
Fold 1 Fold 2 Fold 3 Fold 4 Fold 50
0.2
0.4
0.6
0.8
1
1.2
Average Test Loss
LSVMOur
Statistically Significant
Results – Overlap Loss
Fold 1 Fold 2 Fold 3 Fold 4 Fold 50.62
0.64
0.66
0.68
0.7
0.72
0.74
Average Test Loss
LSVMOur
Statistically Significant