10-701/15-781Recitation: Kernelsaarti/Class/10701_Spring14/KernelTheory.pdfJust because it is a...

Post on 14-Jul-2020

0 views 0 download

transcript

10-701/15-781 Recitation : Kernels

Manojit Nandi

February 27, 2014

OutlineMathematical Theory

Banach Space and Hilbert SpacesKernelsCommonly Used Kernels

Kernel TheoryOne Weird Kernel TrickRepresenter Theorem

Do You want to Build a (((((hhhhhSnowman Kernel? It doesn’t have to be(((((hhhhhSnowman Kernel.Operations that Preserve/Create Kernels

10-701/15-781 Kernel Theory Recitation

Current Research in Kernel TheoryFlaxman TestFast Food

References

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 3/36

10-701/15-781 Kernel Theory Recitation

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 4/36

10-701/15-781 Kernel Theory Recitation

IMPORTANT: The Kernels used in Support Vector Machines are different fromthe Kernels used in Kernel Regression.

To avoid confusion, I’ll refer to the Support Vector Machines kernels as MercerKernels and the Kernel Regression kernels as Smoothing Kernels

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 5/36

10-701/15-781 Kernel Theory Recitation

Recall: A square matrix A ∈ RNxN is positive semi-definite if for all vectorsu ∈ Rn, uTAu ≥ 0.

For some kernel function K : XxX → R defined over a set X ; |X | = n, theGram Matrix G is an nXn matrix such that Gij = K(xi, xj) =< φ(xi), φ(xj) >

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 6/36

10-701/15-781 Kernel Theory Recitation

Mathematical Theory

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 7/36

10-701/15-781 Kernel Theory Recitation

Banach Space and Hilbert Spaces

A Banach Space is a complete Vector Space V equipped with a norm.

Examples:

• lmp Spaces: Rm equipped with the norm ‖x‖ = (m∑

i=1|xi|p)

1p

• Function Spaces, ‖f‖ := (∫

X |f(x)|P dx)1p

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 8/36

10-701/15-781 Kernel Theory Recitation

A Hilbert Space is a complete Vector Space V equipped with an inner product< ., . >.The inner product defines a norm (‖x‖ =< x, x >), so Hilbert Spaces are aspecial case of Banach Spaces.

Example:Euclidean Space with standard inner product: x, y ∈ Rn;< x, y >=

n∑i=1xiyi =

xT y

Because Kernel functions are the inner product in some higher-dimensionalspace, each kernel function corresponds to some Hilbert Space H.

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 9/36

10-701/15-781 Kernel Theory Recitation

Kernels

A Mercer Kernel is a function K defined over a set X such that K : XxX → R.The Mercer Kernel represents an inner product in some higher dimensional space

A Mercer Kernel is symmetric and positive semi-definite. By Mercer’s theo-rem, we know that any symmetric positive semi-definite kernel represents theinner product in some higher-dimensional Hilbert Space H.

Symmetric and Positive Semi-Definite ⇔ Kernel Function ⇔< φ(x), φ(x′) >

for some φ(.).

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 10/36

10-701/15-781 Kernel Theory Recitation

Why Symmetric?Inner products are symmetric by definition, so therefore if the kernel functionrepresents an inner product in some Hilbert Space, then the kernel function mustbe symmetric as well.K(x, z) =< φ(x), φ(z) >=< φ(z), φ(x) >= K(z, x).

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 11/36

10-701/15-781 Kernel Theory Recitation

Why Positive Semi-definite?In order to be positive semi-definite, the Gram Matrix G ∈ RnXn must satisfyuTGu ≥ 0 ∀u ∈ Rn. Using functional analysis, one can show that uTGucorresponds to < hu, hu > for some hu in some Hilbert Space H. Because< hu, hu > ≥ 0 by the definition of an inner product, then < hu, hu > ≥ 0 and< hu, hu >= uTGu⇒ uTGu ≥ 0.

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 12/36

10-701/15-781 Kernel Theory Recitation

Why are Kernels useful?Let x = (x1, x2) and z = (z1, z2). Then,

K(x, z) =< x, z >2= (x1z1 + x2z2)2 = (x21z

21) + (2x1z1x2z2) + (x2

2z22)

=< (x21,√

2x1x2, x22), (z2

1 ,√

2z1z2, z22) >=< φ(x), φ(z) >

Where φ(x) = (x21,√

2x1x2, x22).

What if x = (x1, x2, x3, x4), z = (z1, z2, z3, z4), and K(x, z) =< x, z >50.In this case, φ(x) and φ(z) are 23426∗ dimensional vectors.

∗: From combinatorics, 50 unlabeled balls into 4 labeled boxes =(53

3).

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 13/36

10-701/15-781 Kernel Theory Recitation

We can either:

1. Calculate < x, z > (Inner product of two four-dimension vectors) and raisethis value (a float) to the 50th power.

2. OR map x and z to φ(x) and φ(z), and calculate the inner product of two23426 dimensional vectors.

Which is faster?

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 14/36

10-701/15-781 Kernel Theory Recitation

Commonly Used Kernels

Some commonly used Kernels are:

• Linear Kernel: K(x, z) =< x, z >

• Polynomial Kernel: K(x, z) =< x, z >d where d is the degree of thepolynomial

• Gaussian RBF: K(x, z) = exp(− 12σ2 ‖x− z‖

2) = exp(−γ ‖x− z‖2)

• Sigmoid Kernel: K(x, z) = tanh(αxT z + β)

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 15/36

10-701/15-781 Kernel Theory Recitation

Other Kernels:

1. Laplacian Kernel

2. ANOVA Kernel

3. Circular Kernel

4. Spherical Kernel

5. Wave Kernel

6. Power Kernel

7. Log Kernel

8. B-Spline Kernel

9. Bessel Kernel

10. Cauchy Kernel

11. χ2 Kernel

12. Wavelet Kernel

13. Bayesian Kernel

14. Histogram Kernel

As you can see, there are a lot of kernels.

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 16/36

10-701/15-781 Kernel Theory Recitation

The Gaussian RBF Kernel is infiniteIn class, Barnabas mentioned that the Gaussian RBF Kernel corresponds to aninfinite dimensional vector space. From the Moore-Aronszajn theorem, we knowthere is a unique Hilbert space of functions for with the Gaussian RBF is areproducing kernel.

One can show that this Hilbert Space has a infinite basis, and so the Gaus-sian RBF Kernel corresponds to an infinite-dimensional vector space. A YouTubevideo of Abu-Mostafa explaining this in more detail has been included in thereferences at the end of this presentation.

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 17/36

10-701/15-781 Kernel Theory Recitation

Kernel Theory

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 18/36

10-701/15-781 Kernel Theory Recitation

One Weird Kernel TrickThe kernel trick is one the reasons kernel methods are so popular in machinelearning. With the kernel trick, we can substitute any dot product with a kernelfunction. This means any linear dot product in the formulation of a machinelearning algorithm can be replaced with a non-linear kernel function. The non-linear kernel function may allow us to find separation or structure in the datathat was not present in the linear dot product.

Kernel Trick: < xi, xj > can be replaced with K(xi, xj) for any valid ker-nel function K. Furthermore, we can swap any kernel function with any otherkernel function.

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 19/36

10-701/15-781 Kernel Theory Recitation

Some examples of algorithms that take advantage of the kernel trick:

• Support Vector Machines (Poster Child for Kernel Methods).

• Kernel Principal Component Analysis

• Kernel Independent Component Analysis

• Kernel K-Means algorithm

• Kernel Gaussian Process Regression

• Kernel Deep Learning

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 20/36

10-701/15-781 Kernel Theory Recitation

Representer TheoremTheoremLet X be a non-empty set and k a positive-definite real-valued kernel on onXxX with the corresponding reproducing kernel Hilbert Space Hk. Given atraining sample [(x1, y1), ..., (xn, yn)] ∈ XxR, a strictly monotonic increasingreal-valued function g : [0,∞) → R, and an arbitrary emprical loss functionE : (XXR2)m → R ∪∞, then for any f∗ satisfying

f̂ = argminf∈Hk

{E[(x1, y1, f(x1), ..., (xn, yn, f(xn))]}+ g(‖f‖)

f∗ can be represented in the form f∗(.) =n∑

i=1αik(·, xi),

where αi ∈ R for all αi

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 21/36

10-701/15-781 Kernel Theory Recitation

Example: Let Hk be some RKHS function space with a kernel k(., .). Let[(x1, y1), ..., (xm, ym)] be a training input-output set, and our task is to find f∗that minimizes the following regularized kernel.

f̂ = argminf∈Hk

(n∏

i=1|f(xi)|6)

n∑i=1

[∣∣∣sin(‖xi‖yi−f(xi))

∣∣∣25+yi |f(xi)|249] + exp(‖f‖F )

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 22/36

10-701/15-781 Kernel Theory Recitation

f̂ = argminf∈Hk

(n∏

i=1|f(xi)|6)

n∑i=1

[∣∣∣sin(‖xi‖yi−f(xi))

∣∣∣25+yi |f(xi)|249] + exp(‖f‖F )

Well exp(.) is a strictly monotonic increasing function, so exp(‖f‖F ) fits theg(‖f‖F ) part.

(n∏

i=1|f(xi)|6)

n∑i=1

[∣∣∣sin(‖xi‖yi−f(xi))

∣∣∣25+ yi |f(xi)|249] is some arbitrary emper-

ical loss function of the [xi, yi, f(xi)] . So this optimization can be expressed asf̂ = argmin

f∈Hk

{E[(x1, y1, f(x1), ..., (xn, yn, f(xn))]}+ g(‖f‖)

Therefore, by the Representer Theorem, f∗(.) =n∑

i=1αiK(xi, .)

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 23/36

10-701/15-781 Kernel Theory Recitation

Do You want to Build a (((((((hhhhhhhSnowman Kernel? Itdoesn’t have to be (((((((hhhhhhhSnowman Kernel.

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 24/36

10-701/15-781 Kernel Theory Recitation

Operations that Preserve/Create KernelsLet k1, ...km be valid kernel functions defined over some set X such that |X | = n,and let α1, ...., αm be non-negative coefficents. The following are valid kernels.

• K(x,z) =m∑

i=1αiki(x, z) (Closed under non-negative linear multiplication)

• K(x,z) =m∏

i=1ki(x, z) (Closed under multiplication)

• K(x,z) =k1(f(x), f(z)) for any function f : X → X

• K(x,z) = g(x)g(z) for any function g : X → R

• K(x,z) = xTATAz for any matrix A ∈ RmXn

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 25/36

10-701/15-781 Kernel Theory Recitation

Proof: K(x,z) =m∑

i=1αiki(x, z) is a valid kernel

Symmetry: K(z, x) =m∑

i=1αiki(z, x). Because each ki is a kernel function,

it is symmetric, so ki(z, x) = ki(x, z). Therefore,K(z, x) =

m∑i=1αiki(z, x) =

m∑i=1αiki(x, z) = K(x, z)⇒ K(z, x) = K(x, z)

Positive Semi-definite: Let u ∈ Rn be arbitrary. The Gram matrix of K,denoted by G has the property, Gi,j = K(xi, xj) =

m∑i=1αiki(z, x) ⇒ G =

α1G1 + ...+ αmGm. Now uTGu = uT (α1G1 + ...+ αmGm)u= α1uTG1u + ....+ αmuTGmu =

m∑i=1αiuTGiu.

uTGiu ≥ 0, and αi ≥ 0, so αiuTGiu ≥ 0.

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 26/36

10-701/15-781 Kernel Theory Recitation

Proof: K(x,z) = xTATAz for any matrix A ∈ RmXn is a valid Kernel.

For this proof, we are going to show K(x, z) is an inner product on some HilbertSpace. Let φ(x) = Ax, then < φ(x), φ(z) >= φ(x)Tφ(z) = (Ax)T (Az) =xTATAz = K(x, z)⇒< φ(x), φ(z) >= K(x, z).

Therefore, K(x, z) is an inner product on some Hilbert Space.

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 27/36

10-701/15-781 Kernel Theory Recitation

Just because it is a valid kernel does not mean it is a good kernel

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 28/36

10-701/15-781 Kernel Theory Recitation

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 29/36

10-701/15-781 Kernel Theory Recitation

Some Exercises Prove the following are not valid kernels:

1. K(x, z) = k1(x, z)− k2(x, z) for k1, k2 valid kernels

2. K(x, z) = exp(γ ‖x− z‖2) for some γ > 0

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 30/36

10-701/15-781 Kernel Theory Recitation

Current Research in Kernel Theory

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 31/36

10-701/15-781 Kernel Theory Recitation

Flaxman Test

Correlates of homicide: New space/time interaction tests for spatiotemporal pointprocesses

Goal: Develop a statistical test that can test for the strength of interactionbetween time and space for different types of homicides.

Result: Using Reproducing Kernel Hilbert spaces, one can project the space-timedata into a higher dimensional space that can test for a significant interaction ofa kernalized distance of space and time using the Hilbert-Schmidt IndependenceCriterion.

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 32/36

10-701/15-781 Kernel Theory Recitation

Fast FoodFastfood - Approximating Kernel Expansions in Loglinear time

Problem: Kernel methods do not scale well to large datasets, especially duringprediction time, because the algorithm has to compute the kernel distance betweenthe new vector and all of the data in the training set.

Solution: Using advanced mathematical black magic, dense random Gaussianmatrices can be approximated using Hadamard matrices and diagonal Gaussianmatrices.

V = 1σ√dSHG

∏HB

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 33/36

10-701/15-781 Kernel Theory Recitation

Where∏∈ {0, 1}n is a permutation matrix, H is a Hadamard matrix, S is a

random scaling matrix with elements along the diagonal, B has random {+−

1}along the diagonal, and G has random Gaussian entries.

This algorithm performs 100x faster than the previous leading algorithm (RandomKitchen Sinks) and uses 1000x less memory.

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 34/36

10-701/15-781 Kernel Theory Recitation

References

Representer Theorem and Kernel MethodsPredicting Structured Data by Alex SmolaKernel MachinesString and Tree KernelsWhy Gaussian Kernel is Infinite

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 35/36

Questions?