+ All Categories
Home > Education > cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with...

cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with...

Date post: 14-Jun-2015
Category:
Upload: zukun
View: 862 times
Download: 0 times
Share this document with a friend
Popular Tags:
128
Statistics and Clustering with Kernels Christoph Lampert & Matthew Blaschko Max-Planck-Institute for Biological Cybernetics Department Schölkopf: Empirical Inference Tübingen, Germany Visual Geometry Group University of Oxford United Kingdom June 20, 2009
Transcript
Page 1: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Statistics and Clustering with Kernels

Christoph Lampert & Matthew Blaschko

Max-Planck-Institute for Biological CyberneticsDepartment Schölkopf: Empirical Inference

Tübingen, Germany

Visual Geometry GroupUniversity of OxfordUnited Kingdom

June 20, 2009

Page 2: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Overview

Kernel Ridge RegressionKernel PCASpectral ClusteringKernel Covariance and Canonical Correlation AnalysisKernel Measures of Independence

Page 3: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel Ridge Regression

Regularized least squares regression:

minw

n∑i=1

(yi − 〈w, xi〉)2 + λ‖w‖2

Replace w with ∑ni=1 αixi

minα

n∑i=1

yi −n∑

j=1〈xi , xj〉

2

+ λn∑

i=1

n∑j=1

αiαj〈xi , xj〉

α∗ can be solved in closed form solution

α∗ = (K + λI )−1 y

Page 4: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel Ridge Regression

Regularized least squares regression:

minw

n∑i=1

(yi − 〈w, xi〉)2 + λ‖w‖2

Replace w with ∑ni=1 αixi

minα

n∑i=1

yi −n∑

j=1〈xi , xj〉

2

+ λn∑

i=1

n∑j=1

αiαj〈xi , xj〉

α∗ can be solved in closed form solution

α∗ = (K + λI )−1 y

Page 5: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

PCAEquivalent formulations:

Minimize squared error between original data and a projection ofour data into a lower dimensional subspaceMaximize variance of projected data

Solutions: Eigenvectors of the empirical covariance matrix

(fig: Tristan Jehan)

Page 6: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

PCA continued

Empirical covariance matrix (biased):

C = 1n∑

i(xi − µ)(xi − µ)T

where µ is the sample mean.

C is positive (semi-)definite symmetric

PCA:max

w

wT Cw‖w‖2

Page 7: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Data Centering

We use the notation X to denote the design matrix where everycolumn of X is a data sampleWe can define a centering matrix

H = I − 1n eeT

where e is a vector of all ones

H is idempotent, symmetric, and positive semi-definite (rankn − 1)The design matrix of centered data can be written compactly inmatrix form as XH

I The ith column of XH is equal to xi − µ, where µ = 1n∑

j xj isthe sample mean

Page 8: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Data Centering

We use the notation X to denote the design matrix where everycolumn of X is a data sampleWe can define a centering matrix

H = I − 1n eeT

where e is a vector of all onesH is idempotent, symmetric, and positive semi-definite (rankn − 1)

The design matrix of centered data can be written compactly inmatrix form as XH

I The ith column of XH is equal to xi − µ, where µ = 1n∑

j xj isthe sample mean

Page 9: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Data Centering

We use the notation X to denote the design matrix where everycolumn of X is a data sampleWe can define a centering matrix

H = I − 1n eeT

where e is a vector of all onesH is idempotent, symmetric, and positive semi-definite (rankn − 1)The design matrix of centered data can be written compactly inmatrix form as XH

I The ith column of XH is equal to xi − µ, where µ = 1n∑

j xj isthe sample mean

Page 10: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel PCAPCA:

maxw

wT Cw‖w‖2

Kernel PCA:I Replace w by

∑i αi(xi − µ) - this can be represented compactly

in matrix form by w = XHα where X is the design matrix, H isthe centering matrix, and α is the coefficient vector.

I Compute C in matrix form as C = 1n XHXT

I Denote the matrix of pairwise inner products K = XT X , i.e.Kij = 〈xi , xj〉

maxw

wT Cw‖w‖2 = max

α

1nαTHKHKHααTHKHα

This is a Rayleigh quotient with known solution

HKHβi = λiβi

Page 11: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel PCAPCA:

maxw

wT Cw‖w‖2

Kernel PCA:I Replace w by

∑i αi(xi − µ) - this can be represented compactly

in matrix form by w = XHα where X is the design matrix, H isthe centering matrix, and α is the coefficient vector.

I Compute C in matrix form as C = 1n XHXT

I Denote the matrix of pairwise inner products K = XT X , i.e.Kij = 〈xi , xj〉

maxw

wT Cw‖w‖2 = max

α

1nαTHKHKHααTHKHα

This is a Rayleigh quotient with known solution

HKHβi = λiβi

Page 12: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel PCAPCA:

maxw

wT Cw‖w‖2

Kernel PCA:I Replace w by

∑i αi(xi − µ) - this can be represented compactly

in matrix form by w = XHα where X is the design matrix, H isthe centering matrix, and α is the coefficient vector.

I Compute C in matrix form as C = 1n XHXT

I Denote the matrix of pairwise inner products K = XT X , i.e.Kij = 〈xi , xj〉

maxw

wT Cw‖w‖2 = max

α

1nαTHKHKHααTHKHα

This is a Rayleigh quotient with known solution

HKHβi = λiβi

Page 13: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel PCAPCA:

maxw

wT Cw‖w‖2

Kernel PCA:I Replace w by

∑i αi(xi − µ) - this can be represented compactly

in matrix form by w = XHα where X is the design matrix, H isthe centering matrix, and α is the coefficient vector.

I Compute C in matrix form as C = 1n XHXT

I Denote the matrix of pairwise inner products K = XT X , i.e.Kij = 〈xi , xj〉

maxw

wT Cw‖w‖2 = max

α

1nαTHKHKHααTHKHα

This is a Rayleigh quotient with known solution

HKHβi = λiβi

Page 14: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel PCASet β to be the eigenvectors of HKH , and λ the correspondingeigenvaluesSet α = βλ−

12

Example, image super-resolution:

(fig: Kim et al., PAMI 2005.)

Page 15: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Overview

Kernel Ridge RegressionKernel PCASpectral ClusteringKernel Covariance and Canonical Correlation AnalysisKernel Measures of Independence

Page 16: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Spectral Clustering

Represent similarity of images by weights on a graphNormalized cuts optimizes the ratio of the cost of a cut and thevolume of each cluster

Ncut(A1, . . . ,Ak) =k∑

i=1

cut(Ai , Ai)vol(Ai)

Exact optimization is NP-hard, but relaxed version can be solvedby finding the eigenvalues of the graph Laplacian

L = I − D− 12 AD− 1

2

where D is the diagonal matrix with entries equal to the rowsums of similarity matrix, A.

Page 17: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Spectral Clustering

Represent similarity of images by weights on a graphNormalized cuts optimizes the ratio of the cost of a cut and thevolume of each cluster

Ncut(A1, . . . ,Ak) =k∑

i=1

cut(Ai , Ai)vol(Ai)

Exact optimization is NP-hard, but relaxed version can be solvedby finding the eigenvalues of the graph Laplacian

L = I − D− 12 AD− 1

2

where D is the diagonal matrix with entries equal to the rowsums of similarity matrix, A.

Page 18: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Spectral Clustering (continued)

Compute L = I − D− 12 AD− 1

2 .Map data points based on the eigenvalues of LExample, handwritten digits (0-9):

(fig: Xiaofei He)Cluster in mapped space using k-means

Page 19: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Overview

Kernel Ridge RegressionKernel PCASpectral ClusteringKernel Covariance and Canonical Correlation AnalysisKernel Measures of Independence

Page 20: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Multimodal Data

A latent aspect relates data that are present in multiplemodalitiese.g. images and text

_^]\XYZ[z

xxqqqqqqq

&&MMMMMMM

_^]\XYZ[ϕx(x) _^]\XYZ[ϕy(y) x :y: “A view from Idyllwild, California,with pine trees and snow capped MarionMountain under a blue sky.”

Learn kernelized projections that relate both spaces

Page 21: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Multimodal Data

A latent aspect relates data that are present in multiplemodalitiese.g. images and text

_^]\XYZ[z

xxqqqqqqq

&&MMMMMMM

_^]\XYZ[ϕx(x) _^]\XYZ[ϕy(y) x :y: “A view from Idyllwild, California,with pine trees and snow capped MarionMountain under a blue sky.”

Learn kernelized projections that relate both spaces

Page 22: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel CovarianceKPCA is maximization of auto-covarianceInstead maximize cross-covariance

maxwx ,wy

wxCxywy

‖wx‖‖wy‖

Can also be kernelized (replace wx by ∑i αi(xi − µx), etc.)

maxα,βαTHKxHKyHβ√

αTHKxHαβTHKyHβ

Solution is given by (generalized) eigenproblem(0 HKxHKyH

HKyHKxH 0

)(αβ

)= λ

(HKxH 0

0 HKyH

)(αβ

)

Page 23: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel CovarianceKPCA is maximization of auto-covarianceInstead maximize cross-covariance

maxwx ,wy

wxCxywy

‖wx‖‖wy‖

Can also be kernelized (replace wx by ∑i αi(xi − µx), etc.)

maxα,βαTHKxHKyHβ√

αTHKxHαβTHKyHβ

Solution is given by (generalized) eigenproblem(0 HKxHKyH

HKyHKxH 0

)(αβ

)= λ

(HKxH 0

0 HKyH

)(αβ

)

Page 24: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel CovarianceKPCA is maximization of auto-covarianceInstead maximize cross-covariance

maxwx ,wy

wxCxywy

‖wx‖‖wy‖

Can also be kernelized (replace wx by ∑i αi(xi − µx), etc.)

maxα,βαTHKxHKyHβ√

αTHKxHαβTHKyHβ

Solution is given by (generalized) eigenproblem(0 HKxHKyH

HKyHKxH 0

)(αβ

)= λ

(HKxH 0

0 HKyH

)(αβ

)

Page 25: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel Canonical Correlation Analysis(KCCA)

Alternately, maximize correlation instead of covariance

maxwx ,wy

wTx Cxywy√

wTx CxxwxwT

y Cyywy

Kernelization is straightforward as before

maxα,β

αTHKxHKyHβ√αT (HKxH )2 αβT (HKyH )2 β

Page 26: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel Canonical Correlation Analysis(KCCA)

Alternately, maximize correlation instead of covariance

maxwx ,wy

wTx Cxywy√

wTx CxxwxwT

y Cyywy

Kernelization is straightforward as before

maxα,β

αTHKxHKyHβ√αT (HKxH )2 αβT (HKyH )2 β

Page 27: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

KCCA (continued)

Problem:If the data in either modality are linearly independent (as manydimensions as data points), there exists a projection of the datathat respects any arbitrary orderingPerfect correlation can always be achieved

This is even more likely when a kernel is used (e.g. Gaussian)

Solution: Regularize

maxwx ,wy

wTx Cxywy√

(wTx Cxxwx + εx‖wx‖2)

(wT

y Cyywy + εy‖wy‖2)

As εx →∞, εx →∞, solution approaches maximum covariance

Page 28: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

KCCA (continued)

Problem:If the data in either modality are linearly independent (as manydimensions as data points), there exists a projection of the datathat respects any arbitrary orderingPerfect correlation can always be achievedThis is even more likely when a kernel is used (e.g. Gaussian)

Solution: Regularize

maxwx ,wy

wTx Cxywy√

(wTx Cxxwx + εx‖wx‖2)

(wT

y Cyywy + εy‖wy‖2)

As εx →∞, εx →∞, solution approaches maximum covariance

Page 29: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

KCCA (continued)

Problem:If the data in either modality are linearly independent (as manydimensions as data points), there exists a projection of the datathat respects any arbitrary orderingPerfect correlation can always be achievedThis is even more likely when a kernel is used (e.g. Gaussian)

Solution: Regularize

maxwx ,wy

wTx Cxywy√

(wTx Cxxwx + εx‖wx‖2)

(wT

y Cyywy + εy‖wy‖2)

As εx →∞, εx →∞, solution approaches maximum covariance

Page 30: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

KCCA Algorithm

Compute Kx , Ky

Solve for α and β as the eigenvectors of(0 HKxHKyH

HKyHKxH 0

)(αβ

)=

λ

((HKxH )2 + εxHKxH 0

0 (HKyH )2 + εyHKyH

)(αβ

)

Page 31: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Content Based Image Retrieval with KCCA

Hardoon et al., 2004Training data consists of images with text captionsLearn embeddings of both spaces using KCCA and appropriatelychosen image and text kernelsRetrieval consists of finding images whose embeddings arerelated to the embedding of the text query

A kind of multi-variate regression

Page 32: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Content Based Image Retrieval with KCCA

Hardoon et al., 2004Training data consists of images with text captionsLearn embeddings of both spaces using KCCA and appropriatelychosen image and text kernelsRetrieval consists of finding images whose embeddings arerelated to the embedding of the text query

A kind of multi-variate regression

Page 33: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Overview

Kernel Ridge RegressionKernel PCASpectral ClusteringKernel Covariance and Canonical Correlation AnalysisKernel Measures of Independence

Page 34: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel Measures of Independence

We know how to measure correlation in the kernelized space

Independence implies zero correlation

Different kernels encode different statistical properties of thedata

Use an appropriate kernel such that zero correlation in theHilbert space implies independence

Page 35: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel Measures of Independence

We know how to measure correlation in the kernelized space

Independence implies zero correlation

Different kernels encode different statistical properties of thedata

Use an appropriate kernel such that zero correlation in theHilbert space implies independence

Page 36: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel Measures of Independence

We know how to measure correlation in the kernelized space

Independence implies zero correlation

Different kernels encode different statistical properties of thedata

Use an appropriate kernel such that zero correlation in theHilbert space implies independence

Page 37: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel Measures of Independence

We know how to measure correlation in the kernelized space

Independence implies zero correlation

Different kernels encode different statistical properties of thedata

Use an appropriate kernel such that zero correlation in theHilbert space implies independence

Page 38: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Example: Polynomial KernelFirst degree polynomial kernel (i.e. linear) captures correlationonlySecond degree polynomial kernel captures all second orderstatistics...

A Gaussian kernel can be written

k(xi , xj) = e−γ‖xi−xj‖2 = e−γ〈xi ,xi〉e2γ〈xi ,xj〉e−γ〈xj ,xj〉

and we can use the identity

ez =∞∑

i=1

1i!z

i

We can view the Gaussian kernel as being related to anappropriately scaled infinite dimensional polynomial kernel

I captures all order statistics

Page 39: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Example: Polynomial KernelFirst degree polynomial kernel (i.e. linear) captures correlationonlySecond degree polynomial kernel captures all second orderstatistics...A Gaussian kernel can be written

k(xi , xj) = e−γ‖xi−xj‖2 = e−γ〈xi ,xi〉e2γ〈xi ,xj〉e−γ〈xj ,xj〉

and we can use the identity

ez =∞∑

i=1

1i!z

i

We can view the Gaussian kernel as being related to anappropriately scaled infinite dimensional polynomial kernel

I captures all order statistics

Page 40: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Example: Polynomial KernelFirst degree polynomial kernel (i.e. linear) captures correlationonlySecond degree polynomial kernel captures all second orderstatistics...A Gaussian kernel can be written

k(xi , xj) = e−γ‖xi−xj‖2 = e−γ〈xi ,xi〉e2γ〈xi ,xj〉e−γ〈xj ,xj〉

and we can use the identity

ez =∞∑

i=1

1i!z

i

We can view the Gaussian kernel as being related to anappropriately scaled infinite dimensional polynomial kernel

I captures all order statistics

Page 41: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Example: Polynomial KernelFirst degree polynomial kernel (i.e. linear) captures correlationonlySecond degree polynomial kernel captures all second orderstatistics...A Gaussian kernel can be written

k(xi , xj) = e−γ‖xi−xj‖2 = e−γ〈xi ,xi〉e2γ〈xi ,xj〉e−γ〈xj ,xj〉

and we can use the identity

ez =∞∑

i=1

1i!z

i

We can view the Gaussian kernel as being related to anappropriately scaled infinite dimensional polynomial kernel

I captures all order statistics

Page 42: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Hilbert-Schmidt Independence CriterionF RKHS on X with kernel kx(x , x ′), G RKHS on Y with kernelky(y, y ′)

Covariance operator: Cxy : G → F such that

〈f ,Cxyg〉F = Ex,y[f (x)g(y)]− Ex [f (x)]Ey[g(y)]

HSIC is the Hilbert-Schmidt norm of Cxy (Fukumizu et al. 2008):

HSIC := ‖Cxy‖2HS

(Biased) empirical HSIC:

HSIC := 1n2 Tr(KxHKyH )

Page 43: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Hilbert-Schmidt Independence CriterionF RKHS on X with kernel kx(x , x ′), G RKHS on Y with kernelky(y, y ′)Covariance operator: Cxy : G → F such that

〈f ,Cxyg〉F = Ex,y[f (x)g(y)]− Ex [f (x)]Ey[g(y)]

HSIC is the Hilbert-Schmidt norm of Cxy (Fukumizu et al. 2008):

HSIC := ‖Cxy‖2HS

(Biased) empirical HSIC:

HSIC := 1n2 Tr(KxHKyH )

Page 44: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Hilbert-Schmidt Independence CriterionF RKHS on X with kernel kx(x , x ′), G RKHS on Y with kernelky(y, y ′)Covariance operator: Cxy : G → F such that

〈f ,Cxyg〉F = Ex,y[f (x)g(y)]− Ex [f (x)]Ey[g(y)]

HSIC is the Hilbert-Schmidt norm of Cxy (Fukumizu et al. 2008):

HSIC := ‖Cxy‖2HS

(Biased) empirical HSIC:

HSIC := 1n2 Tr(KxHKyH )

Page 45: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Hilbert-Schmidt Independence CriterionF RKHS on X with kernel kx(x , x ′), G RKHS on Y with kernelky(y, y ′)Covariance operator: Cxy : G → F such that

〈f ,Cxyg〉F = Ex,y[f (x)g(y)]− Ex [f (x)]Ey[g(y)]

HSIC is the Hilbert-Schmidt norm of Cxy (Fukumizu et al. 2008):

HSIC := ‖Cxy‖2HS

(Biased) empirical HSIC:

HSIC := 1n2 Tr(KxHKyH )

Page 46: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Hilbert-Schmidt Independence Criterion(continued)

Ring-shaped density, correlation approx. zeroMaximum singular vectors (functions) of Cxy

−2 0 2−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Correlation: −0.00

−2 0 2−1

−0.5

0

0.5

x

f(x)

Dependence witness, X

−2 0 2−1

−0.5

0

0.5

y

g(y)

Dependence witness, Y

−1 −0.5 0 0.5−1

−0.5

0

0.5

1

f(X)g(

Y)

Correlation: −0.90 COCO: 0.14

Page 47: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Hilbert-Schmidt Normalized IndependenceCriterion

Hilbert-Schmidt Independence Criterion analogous tocross-covarianceCan we construct a version analogous to correlation?

Simple modification: decompose Covariance operator (Baker1973)

Cxy = C12xxVxyC

12yy

where Vxy is the normalized cross-covariance operator(maximum singular value is bounded by 1)Use norm of Vxy instead of the norm of Cxy

Page 48: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Hilbert-Schmidt Normalized IndependenceCriterion

Hilbert-Schmidt Independence Criterion analogous tocross-covarianceCan we construct a version analogous to correlation?Simple modification: decompose Covariance operator (Baker1973)

Cxy = C12xxVxyC

12yy

where Vxy is the normalized cross-covariance operator(maximum singular value is bounded by 1)

Use norm of Vxy instead of the norm of Cxy

Page 49: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Hilbert-Schmidt Normalized IndependenceCriterion

Hilbert-Schmidt Independence Criterion analogous tocross-covarianceCan we construct a version analogous to correlation?Simple modification: decompose Covariance operator (Baker1973)

Cxy = C12xxVxyC

12yy

where Vxy is the normalized cross-covariance operator(maximum singular value is bounded by 1)Use norm of Vxy instead of the norm of Cxy

Page 50: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Hilbert-Schmidt Normalized IndependenceCriterion (continued)

Define the normalized independence criterion to be theHilbert-Schmidt norm of Vxy

HSNIC := 1n2 Tr

[HKxH (HKxH + εxI )−1

HKyH (HKyH + εyI )−1]

where εx and εy are regularization parameters as in KCCA

If the kernels on x and y are characteristic (e.g. Gaussiankernels, see Fukumizu et al., 2008)‖Cxy‖2

HS = ‖Vxy‖2HS = 0 iff x and y are independent!

Page 51: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Hilbert-Schmidt Normalized IndependenceCriterion (continued)

Define the normalized independence criterion to be theHilbert-Schmidt norm of Vxy

HSNIC := 1n2 Tr

[HKxH (HKxH + εxI )−1

HKyH (HKyH + εyI )−1]

where εx and εy are regularization parameters as in KCCA

If the kernels on x and y are characteristic (e.g. Gaussiankernels, see Fukumizu et al., 2008)‖Cxy‖2

HS = ‖Vxy‖2HS = 0 iff x and y are independent!

Page 52: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Applications of HS(N)IC

Independence tests - is there anything to gain from the use ofmulti-modal data?

Kernel ICAMaximize dependence with respect to some model parameters

I Kernel target alignment (Cristianini et al., 2001)I Learning spectral clustering (Bach & Jordan, 2003) - relates

kernel learning and clusteringI Taxonomy discovery (Blaschko & Gretton, 2008)

Page 53: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Applications of HS(N)IC

Independence tests - is there anything to gain from the use ofmulti-modal data?Kernel ICA

Maximize dependence with respect to some model parametersI Kernel target alignment (Cristianini et al., 2001)I Learning spectral clustering (Bach & Jordan, 2003) - relates

kernel learning and clusteringI Taxonomy discovery (Blaschko & Gretton, 2008)

Page 54: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Applications of HS(N)IC

Independence tests - is there anything to gain from the use ofmulti-modal data?Kernel ICAMaximize dependence with respect to some model parameters

I Kernel target alignment (Cristianini et al., 2001)

I Learning spectral clustering (Bach & Jordan, 2003) - relateskernel learning and clustering

I Taxonomy discovery (Blaschko & Gretton, 2008)

Page 55: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Applications of HS(N)IC

Independence tests - is there anything to gain from the use ofmulti-modal data?Kernel ICAMaximize dependence with respect to some model parameters

I Kernel target alignment (Cristianini et al., 2001)I Learning spectral clustering (Bach & Jordan, 2003) - relates

kernel learning and clustering

I Taxonomy discovery (Blaschko & Gretton, 2008)

Page 56: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Applications of HS(N)IC

Independence tests - is there anything to gain from the use ofmulti-modal data?Kernel ICAMaximize dependence with respect to some model parameters

I Kernel target alignment (Cristianini et al., 2001)I Learning spectral clustering (Bach & Jordan, 2003) - relates

kernel learning and clusteringI Taxonomy discovery (Blaschko & Gretton, 2008)

Page 57: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

SummaryIn this section we learned how to

Do basic operations in kernel space like:I Regularized least squares regressionI Data centeringI PCA

Learn with multi-modal dataI Kernel CovarianceI KCCA

Use kernels to construct statistical independence testsI Use appropriate kernels to capture relevant statisticsI Measure dependence by norm of (normalized) covariance

operatorI Closed form solutions requiring only kernel matrices for each

modality

Questions?

Page 58: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

SummaryIn this section we learned how to

Do basic operations in kernel space like:I Regularized least squares regressionI Data centeringI PCA

Learn with multi-modal dataI Kernel CovarianceI KCCA

Use kernels to construct statistical independence testsI Use appropriate kernels to capture relevant statisticsI Measure dependence by norm of (normalized) covariance

operatorI Closed form solutions requiring only kernel matrices for each

modality

Questions?

Page 59: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

SummaryIn this section we learned how to

Do basic operations in kernel space like:I Regularized least squares regressionI Data centeringI PCA

Learn with multi-modal dataI Kernel CovarianceI KCCA

Use kernels to construct statistical independence testsI Use appropriate kernels to capture relevant statisticsI Measure dependence by norm of (normalized) covariance

operatorI Closed form solutions requiring only kernel matrices for each

modality

Questions?

Page 60: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

SummaryIn this section we learned how to

Do basic operations in kernel space like:I Regularized least squares regressionI Data centeringI PCA

Learn with multi-modal dataI Kernel CovarianceI KCCA

Use kernels to construct statistical independence testsI Use appropriate kernels to capture relevant statisticsI Measure dependence by norm of (normalized) covariance

operatorI Closed form solutions requiring only kernel matrices for each

modality

Questions?

Page 61: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Structured Output Learning

Christoph Lampert & Matthew Blaschko

Max-Planck-Institute for Biological CyberneticsDepartment Schölkopf: Empirical Inference

Tübingen, Germany

Visual Geometry GroupUniversity of OxfordUnited Kingdom

June 20, 2009

Page 62: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

What is Structured Output Learning?

Regression maps from an input space to an output space

g : X → Y

In typical scenarios, Y ≡ R (regression) or Y ≡ {−1, 1}(classification)

Structured output learning extends this concept to morecomplex and interdependent output spaces

Page 63: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

What is Structured Output Learning?

Regression maps from an input space to an output space

g : X → Y

In typical scenarios, Y ≡ R (regression) or Y ≡ {−1, 1}(classification)

Structured output learning extends this concept to morecomplex and interdependent output spaces

Page 64: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

What is Structured Output Learning?

Regression maps from an input space to an output space

g : X → Y

In typical scenarios, Y ≡ R (regression) or Y ≡ {−1, 1}(classification)

Structured output learning extends this concept to morecomplex and interdependent output spaces

Page 65: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Examples of Structured Output Problemsin Computer Vision

Multi-class classification (Crammer & Singer, 2001)Hierarchical classification (Cai & Hofmann, 2004)Segmentation of 3d scan data (Anguelov et al., 2005)Learning a CRF model for stereo vision (Li & Huttenlocher,2008)Object localization (Blaschko & Lampert, 2008)Segmentation with a learned CRF model (Szummer et al., 2008)...More examples at CVPR 2009

Page 66: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Generalization of Regression

Direct discriminative learning of g : X → YI Penalize errors for this mapping

Two basic assumptions employedI Use of a compatibility function

f : X × Y → R

I g takes the form of a decoding function

g(x) = argmaxy

f (x, y)

I linear w.r.t. joint kernel

f (x, y) = 〈w, ϕ(x, y)〉

Page 67: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Generalization of Regression

Direct discriminative learning of g : X → YI Penalize errors for this mapping

Two basic assumptions employedI Use of a compatibility function

f : X × Y → R

I g takes the form of a decoding function

g(x) = argmaxy

f (x, y)

I linear w.r.t. joint kernel

f (x, y) = 〈w, ϕ(x, y)〉

Page 68: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Generalization of Regression

Direct discriminative learning of g : X → YI Penalize errors for this mapping

Two basic assumptions employedI Use of a compatibility function

f : X × Y → R

I g takes the form of a decoding function

g(x) = argmaxy

f (x, y)

I linear w.r.t. joint kernel

f (x, y) = 〈w, ϕ(x, y)〉

Page 69: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Multi-Class Joint Feature Map

Simple joint kernel map:define ϕy(yi) to be the vector with 1 in place of the currentclass, and 0 elsewhere

ϕy(yi) = [0, . . . , 1︸︷︷︸kth position

, . . . , 0]T

if yi represents a sample that is a member of class k

ϕx(xi) can result from any kernel over X :

kx(xi , xj) = 〈ϕx(xi), ϕx(xj)〉

Set ϕ(xi , yi) = ϕy(yi)⊗ ϕx(xi), where ⊗ represents theKronecker product

Page 70: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Multi-Class Joint Feature Map

Simple joint kernel map:define ϕy(yi) to be the vector with 1 in place of the currentclass, and 0 elsewhere

ϕy(yi) = [0, . . . , 1︸︷︷︸kth position

, . . . , 0]T

if yi represents a sample that is a member of class kϕx(xi) can result from any kernel over X :

kx(xi , xj) = 〈ϕx(xi), ϕx(xj)〉

Set ϕ(xi , yi) = ϕy(yi)⊗ ϕx(xi), where ⊗ represents theKronecker product

Page 71: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Multi-Class Joint Feature Map

Simple joint kernel map:define ϕy(yi) to be the vector with 1 in place of the currentclass, and 0 elsewhere

ϕy(yi) = [0, . . . , 1︸︷︷︸kth position

, . . . , 0]T

if yi represents a sample that is a member of class kϕx(xi) can result from any kernel over X :

kx(xi , xj) = 〈ϕx(xi), ϕx(xj)〉

Set ϕ(xi , yi) = ϕy(yi)⊗ ϕx(xi), where ⊗ represents theKronecker product

Page 72: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Multiclass Perceptron

Reminder: we want

〈w, ϕ(xi , yi)〉 > 〈w, ϕ(xi , y)〉 ∀y 6= yi

Example: perceptron training with a multiclass joint feature mapGradient of loss for example i is

∂w`(xi , yi ,w) =

0 if 〈w, ϕ(xi , yi)〉 ≥ 〈w, ϕ(xi , y)〉∀y 6= yi

maxy 6=yi ϕ(xi , yi)− ϕ(xi , y) otherwise

Page 73: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Multiclass Perceptron

Reminder: we want

〈w, ϕ(xi , yi)〉 > 〈w, ϕ(xi , y)〉 ∀y 6= yi

Example: perceptron training with a multiclass joint feature mapGradient of loss for example i is

∂w`(xi , yi ,w) =

0 if 〈w, ϕ(xi , yi)〉 ≥ 〈w, ϕ(xi , y)〉∀y 6= yi

maxy 6=yi ϕ(xi , yi)− ϕ(xi , y) otherwise

Page 74: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 75: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 76: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 77: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 78: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 79: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 80: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 81: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 82: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 83: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 84: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 85: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 86: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 87: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 88: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 89: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

Final result(Credit: Lyndsey Pickup)

Page 90: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Perceptron Training with Multiclass JointFeature Map

(Credit: Lyndsey Pickup)

Page 91: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Crammer & Singer Multi-Class SVMInstead of training using a perceptron, we can enforce a largemargin and do a batch convex optimization:

minw

12‖w‖

2 + Cn∑

i=1ξi

s.t. 〈w, ϕ(xi , yi)〉 − 〈w, ϕ(xi , y)〉 ≥ 1− ξi ∀y 6= yi

Can also be written only in terms of kernels

w =∑

x

∑yαxyϕ(x , y)

Can use a joint kernel

k : X × Y × X × Y → R

k(xi , yi , xj , yj) = 〈ϕ(xi , yi), ϕ(xj , yj)〉

Page 92: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Crammer & Singer Multi-Class SVMInstead of training using a perceptron, we can enforce a largemargin and do a batch convex optimization:

minw

12‖w‖

2 + Cn∑

i=1ξi

s.t. 〈w, ϕ(xi , yi)〉 − 〈w, ϕ(xi , y)〉 ≥ 1− ξi ∀y 6= yi

Can also be written only in terms of kernels

w =∑

x

∑yαxyϕ(x , y)

Can use a joint kernel

k : X × Y × X × Y → R

k(xi , yi , xj , yj) = 〈ϕ(xi , yi), ϕ(xj , yj)〉

Page 93: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Structured Output Support VectorMachines (SO-SVM)

Frame structured prediction as a multiclass problemI predict a single element of Y and pay a penalty for mistakes

Not all errors are created equallyI e.g. in an HMM making only one mistake in a sequence should

be penalized less than making 50 mistakesPay a loss proportional to the difference between true andpredicted error (task dependent)

∆(yi , y)

Page 94: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Structured Output Support VectorMachines (SO-SVM)

Frame structured prediction as a multiclass problemI predict a single element of Y and pay a penalty for mistakes

Not all errors are created equallyI e.g. in an HMM making only one mistake in a sequence should

be penalized less than making 50 mistakes

Pay a loss proportional to the difference between true andpredicted error (task dependent)

∆(yi , y)

Page 95: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Structured Output Support VectorMachines (SO-SVM)

Frame structured prediction as a multiclass problemI predict a single element of Y and pay a penalty for mistakes

Not all errors are created equallyI e.g. in an HMM making only one mistake in a sequence should

be penalized less than making 50 mistakesPay a loss proportional to the difference between true andpredicted error (task dependent)

∆(yi , y)

Page 96: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Margin RescalingVariant: Margin-Rescaled Joint-Kernel SVM for output space Y(Tsochantaridis et al., 2005)

Idea: some wrong labels are worse than others: loss ∆(yi , y)Solve

minw‖w‖2 + C

n∑i=1

ξi

s.t. 〈w, ϕ(xi , yi)〉 − 〈w, ϕ(xi , y)〉 ≥ ∆(yi , y)− ξi ∀y ∈ Y \ {yi}

Classify new samples using g : X → Y :

g(x) = argmaxy∈Y

〈w, ϕ(x , y)〉

Another variant is slack rescaling (see Tsochantaridis et al.,2005)

Page 97: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Margin RescalingVariant: Margin-Rescaled Joint-Kernel SVM for output space Y(Tsochantaridis et al., 2005)

Idea: some wrong labels are worse than others: loss ∆(yi , y)Solve

minw‖w‖2 + C

n∑i=1

ξi

s.t. 〈w, ϕ(xi , yi)〉 − 〈w, ϕ(xi , y)〉 ≥ ∆(yi , y)− ξi ∀y ∈ Y \ {yi}

Classify new samples using g : X → Y :

g(x) = argmaxy∈Y

〈w, ϕ(x , y)〉

Another variant is slack rescaling (see Tsochantaridis et al.,2005)

Page 98: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Label Sequence Learning

For, e.g., handwritten character recognition, it may be useful toinclude a temporal model in addition to learning each characterindividuallyAs a simple example take an HMM

We need to model emission probabilities and transitionprobabilities

I Learn these discriminatively

Page 99: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Label Sequence Learning

For, e.g., handwritten character recognition, it may be useful toinclude a temporal model in addition to learning each characterindividuallyAs a simple example take an HMM

We need to model emission probabilities and transitionprobabilities

I Learn these discriminatively

Page 100: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

A Joint Kernel Map for Label SequenceLearning

Emissions (blue)

I fe(xi , yi) = 〈we, ϕe(xi , yi)〉I Can simply use the multi-class joint feature map for ϕe

Transitions (green)I ft(xi , yi) = 〈wt , ϕt(yi , yi+1)〉I Can use ϕt(yi , yi+1) = ϕy(yi)⊗ ϕy(yi+1)

Page 101: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

A Joint Kernel Map for Label SequenceLearning

Emissions (blue)I fe(xi , yi) = 〈we, ϕe(xi , yi)〉

I Can simply use the multi-class joint feature map for ϕe

Transitions (green)I ft(xi , yi) = 〈wt , ϕt(yi , yi+1)〉I Can use ϕt(yi , yi+1) = ϕy(yi)⊗ ϕy(yi+1)

Page 102: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

A Joint Kernel Map for Label SequenceLearning

Emissions (blue)I fe(xi , yi) = 〈we, ϕe(xi , yi)〉I Can simply use the multi-class joint feature map for ϕe

Transitions (green)I ft(xi , yi) = 〈wt , ϕt(yi , yi+1)〉I Can use ϕt(yi , yi+1) = ϕy(yi)⊗ ϕy(yi+1)

Page 103: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

A Joint Kernel Map for Label SequenceLearning

Emissions (blue)I fe(xi , yi) = 〈we, ϕe(xi , yi)〉I Can simply use the multi-class joint feature map for ϕe

Transitions (green)

I ft(xi , yi) = 〈wt , ϕt(yi , yi+1)〉I Can use ϕt(yi , yi+1) = ϕy(yi)⊗ ϕy(yi+1)

Page 104: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

A Joint Kernel Map for Label SequenceLearning

Emissions (blue)I fe(xi , yi) = 〈we, ϕe(xi , yi)〉I Can simply use the multi-class joint feature map for ϕe

Transitions (green)I ft(xi , yi) = 〈wt , ϕt(yi , yi+1)〉I Can use ϕt(yi , yi+1) = ϕy(yi)⊗ ϕy(yi+1)

Page 105: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

A Joint Kernel Map for Label SequenceLearning (continued)

p(x , y) ∝∏

iefe(xi ,yi)

∏i

eft(yi ,yi+1) for an HMM

f (x , y) =∑

ife(xi , yi) +

∑i

ft(yi , yi+1)

= 〈we,∑

iϕe(xi , yi)〉+ 〈wt ,

∑iϕt(yi , yi+1)〉

Page 106: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

A Joint Kernel Map for Label SequenceLearning (continued)

p(x , y) ∝∏

iefe(xi ,yi)

∏i

eft(yi ,yi+1) for an HMM

f (x , y) =∑

ife(xi , yi) +

∑i

ft(yi , yi+1)

= 〈we,∑

iϕe(xi , yi)〉+ 〈wt ,

∑iϕt(yi , yi+1)〉

Page 107: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Constraint Generation

minw‖w‖2 + C

n∑i=1

ξi

s.t. 〈w, ϕ(xi , yi)〉 − 〈w, ϕ(xi , y)〉 ≥ ∆(yi , y)− ξi ∀y ∈ Y \ {yi}

Initialize constraint set to be emptyIterate until convergence:

I Solve optimization using current constraint setI Add maximially violated constraint for current solution

Page 108: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Constraint Generation

minw‖w‖2 + C

n∑i=1

ξi

s.t. 〈w, ϕ(xi , yi)〉 − 〈w, ϕ(xi , y)〉 ≥ ∆(yi , y)− ξi ∀y ∈ Y \ {yi}

Initialize constraint set to be emptyIterate until convergence:

I Solve optimization using current constraint setI Add maximially violated constraint for current solution

Page 109: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Constraint Generation with the ViterbiAlgorithm

To find the maximially violated constraint, we need to maximizew.r.t. y

〈w, ϕ(xi , y)〉+ ∆(yi , y)

For arbitrary output spaces, we would need to iterate over allelements in YFor HMMs, maxy〈w, ϕ(xi , y)〉 can be found using the ViterbialgorithmIt is a simple modification of this procedure to incorporate∆(yi , y) (Tsochantaridis et al., 2004)

Page 110: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Constraint Generation with the ViterbiAlgorithm

To find the maximially violated constraint, we need to maximizew.r.t. y

〈w, ϕ(xi , y)〉+ ∆(yi , y)

For arbitrary output spaces, we would need to iterate over allelements in Y

For HMMs, maxy〈w, ϕ(xi , y)〉 can be found using the ViterbialgorithmIt is a simple modification of this procedure to incorporate∆(yi , y) (Tsochantaridis et al., 2004)

Page 111: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Constraint Generation with the ViterbiAlgorithm

To find the maximially violated constraint, we need to maximizew.r.t. y

〈w, ϕ(xi , y)〉+ ∆(yi , y)

For arbitrary output spaces, we would need to iterate over allelements in YFor HMMs, maxy〈w, ϕ(xi , y)〉 can be found using the Viterbialgorithm

It is a simple modification of this procedure to incorporate∆(yi , y) (Tsochantaridis et al., 2004)

Page 112: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Constraint Generation with the ViterbiAlgorithm

To find the maximially violated constraint, we need to maximizew.r.t. y

〈w, ϕ(xi , y)〉+ ∆(yi , y)

For arbitrary output spaces, we would need to iterate over allelements in YFor HMMs, maxy〈w, ϕ(xi , y)〉 can be found using the ViterbialgorithmIt is a simple modification of this procedure to incorporate∆(yi , y) (Tsochantaridis et al., 2004)

Page 113: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Discriminative Training of ObjectLocalization

Structured output learning is not restricted to outputs specifiedby graphical models

We can formulate object localization as a regression from animage to a bounding box

g : X → Y

X is the space of all imagesY is the space of all bounding boxes

Page 114: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Discriminative Training of ObjectLocalization

Structured output learning is not restricted to outputs specifiedby graphical modelsWe can formulate object localization as a regression from animage to a bounding box

g : X → Y

X is the space of all imagesY is the space of all bounding boxes

Page 115: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Joint Kernel between Images and Boxes:Restriction Kernel

Note: x |y (the image restricted to the box region) is again animage.Compare two images with boxes by comparing the images withinthe boxes:

kjoint((x , y), (x ′, y ′) ) = kimage(x |y, x ′ |y′ , )

Any common image kernel is applicable:I linear on cluster histograms: k(h, h′) =

∑i hih′i ,

I χ2-kernel: kχ2(h, h′) = exp(− 1γ

∑i

(hi−h′i)2

hi+h′i

)I pyramid matching kernel, ...

The resulting joint kernel is positive definite.

Page 116: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Restriction Kernel: Examples

kjoint

(,

)= k

(,

)

is large.

kjoint

(,

)= k

(,

)

is small.

kjoint

(,

)= k

(,

)

could also be large.Note: This behaves differently from the common tensor products

kjoint( (x , y), (x ′, y ′) ) 6= k(x , x ′)k(y, y ′)) !

Page 117: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Constraint Generation with Branch andBound

As before, we must solve

maxy∈Y〈w, ϕ(xi , y)〉+ ∆(yi , y)

where

∆(yi , y) =

1− Area(yi⋂

y)Area(yi

⋃y) if yiω = yω = 1

1−(

12(yiωyω + 1)

)otherwise

and yiω specifies whether there is an instance of the object at allpresent in the image

Solution: use branch-and-bound over the space of all rectanglesin the image (Blaschko & Lampert, 2008)

Page 118: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Constraint Generation with Branch andBound

As before, we must solve

maxy∈Y〈w, ϕ(xi , y)〉+ ∆(yi , y)

where

∆(yi , y) =

1− Area(yi⋂

y)Area(yi

⋃y) if yiω = yω = 1

1−(

12(yiωyω + 1)

)otherwise

and yiω specifies whether there is an instance of the object at allpresent in the imageSolution: use branch-and-bound over the space of all rectanglesin the image (Blaschko & Lampert, 2008)

Page 119: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Discriminative Training of ImageSegmentation

Frame discriminative image segmentation as learning parametersof a random field model

Like sequence learning, the problem decomposes over cliques inthe graph

Set the loss to the number of incorrect pixels

Page 120: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Discriminative Training of ImageSegmentation

Frame discriminative image segmentation as learning parametersof a random field model

Like sequence learning, the problem decomposes over cliques inthe graph

Set the loss to the number of incorrect pixels

Page 121: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Discriminative Training of ImageSegmentation

Frame discriminative image segmentation as learning parametersof a random field model

Like sequence learning, the problem decomposes over cliques inthe graph

Set the loss to the number of incorrect pixels

Page 122: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Constraint Generation with Graph Cuts

As the graph is loopy, we cannot use Viterbi

Loopy belief propagation is approximate and can lead to poorlearning performance for structured output learning of graphicalmodels (Finley & Joachims, 2008)

Solution: use graph cuts (Szummer et al., 2008)∆(yi , y) can be easily incorporated into the energy function

Page 123: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Constraint Generation with Graph Cuts

As the graph is loopy, we cannot use Viterbi

Loopy belief propagation is approximate and can lead to poorlearning performance for structured output learning of graphicalmodels (Finley & Joachims, 2008)

Solution: use graph cuts (Szummer et al., 2008)∆(yi , y) can be easily incorporated into the energy function

Page 124: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Constraint Generation with Graph Cuts

As the graph is loopy, we cannot use Viterbi

Loopy belief propagation is approximate and can lead to poorlearning performance for structured output learning of graphicalmodels (Finley & Joachims, 2008)

Solution: use graph cuts (Szummer et al., 2008)∆(yi , y) can be easily incorporated into the energy function

Page 125: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Summary of Structured Output Learning

Structured output learning is the prediction of items in complexand interdependent output spaces

We can train regressors into these spaces using a generalizationof the support vector machine

We have shown examples forI Label sequence learning with ViterbiI Object localization with branch and boundI Image segmentation with graph cuts

Questions?

Page 126: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Summary of Structured Output Learning

Structured output learning is the prediction of items in complexand interdependent output spaces

We can train regressors into these spaces using a generalizationof the support vector machine

We have shown examples forI Label sequence learning with ViterbiI Object localization with branch and boundI Image segmentation with graph cuts

Questions?

Page 127: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Summary of Structured Output Learning

Structured output learning is the prediction of items in complexand interdependent output spaces

We can train regressors into these spaces using a generalizationof the support vector machine

We have shown examples forI Label sequence learning with ViterbiI Object localization with branch and boundI Image segmentation with graph cuts

Questions?

Page 128: cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Summary of Structured Output Learning

Structured output learning is the prediction of items in complexand interdependent output spaces

We can train regressors into these spaces using a generalizationof the support vector machine

We have shown examples forI Label sequence learning with ViterbiI Object localization with branch and boundI Image segmentation with graph cuts

Questions?


Recommended