Kernel Methods Arie Nakhmani. Outline Kernel Smoothers Kernel Density Estimators Kernel Density...

Post on 18-Jan-2018

247 views 0 download

description

Kernel Smoothers – The Goal Estimating a function by using noisy observations, when the parametric model for this function is unknown The resulting function should be smooth The level of “smoothness” should be set by a single parameter

transcript

Kernel Methods

Arie Nakhmani

Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers

Kernel Smoothers – The Goal Estimating a function by

using noisy observations, when the parametric model for this function is unknown

The resulting function should be smooth

The level of “smoothness” should be set by a single parameter

( ) : pf X

Example

0 1 2 3 4 5 6-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

X

Y

N=100 sample points

What is it: “smooth enough” ?

0 1 2 3 4 5 6-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

X

Y

Example

Y in X , X ~ U 0,2 , ~ (0,1/ 4)s N

N=100 sample points

Exponential Smoother

0 1 2 3 4 5 6-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

X

Y

ˆ ˆ( ) (1 ) ( 1) ( )ˆ(1) (1) 0 1

sorted

sorted

Y i Y i Y i

Y Y

0.25

Smaller smoother line, but more delayed

Exponential Smoother

Simple Sequential Single parameter Single value memory Too rough Delayed

ˆ ˆ( ) (1 ) ( 1) ( )sortedY i Y i Y i

Moving Average Smoother

5 :

Y(1) = Y (1)

Y(2) = (Y (1) + Y (2) + Y (3))/3

Y(3) = (Y (1) + Y (2) + Y (3) + Y (4) + Y (5))/5

Y(4) = (Y (2) + Y (3) + Y (4) + Y

sorted

sorted sorted sorted

sorted sorted sorted sorted sorted

sorted sorted sorted sort

For m

(5) + Y (6))/5...

ed sorted

Moving Average Smoother

0 1 2 3 4 5 6-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

X

Y

m=11

Larger m smoother, but straightened line

Moving Average Smoother Sequential Single parameter: the window size

m Memory for m values Irregularly smooth What if we have p-dimensional

problem with p>1 ???

0 1 2 3 4 5 6-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

X

YNearest Neighbors Smoother

0 0ˆ( ) | ( )i i mY x Average y x Neighborhood x

x0

m=160

ˆ( )Y x

Larger m smoother, but biased line

Nearest Neighbors Smoother Not sequential Single parameter: the number of

neighbors m Trivially extended to any number of

dimensions Memory for m values Depends on metrics definition Not smooth enough Biased end-points

Low Pass Filter

0 1 2 3 4 5 6-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

X

Y

2nd order Butterworth: 2

22 1( ) 0.0078

3 0.77z zH z

z z

Why do we need kernel smoothers ???

Low Pass Filter

0 1 2 3 4 5 6-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

X

Y

The same filter…for log function

Low Pass Filter Smooth Simply extended to any number of

dimensions Effectively, 3 parameters: type,

order, and bandwidth Biased end-points Inappropriate for some functions

(depends on bandwidth)

Kernel Average Smoother

0 1 2 3 4 5 6-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

X

Y

0ˆ( )Y x

0 0ˆ( ) |i i iY x Average w y x x

x0

Kernel Average Smoother Nadaraya-Watson kernel-weighted average:

with the kernel:

for Nearest Neighbor Smoother for Locally Weighted Average

01

0

01

( , )ˆ( )

( , )

N

i iiN

ii

K x x yY x

K x x

0

00

( , )( )

x xK x x D

h x

0( )h x 0 0 [ ]( )m mh x x x

t

Popular Kernels Epanechnikov kernel:

Tri-cube kernel:

Gaussian Kernel:

23(1 ) / 4 1( )0,

t if tD totherwise

3 3(1 ) 1( )0,

t if tD totherwise

21( ) ( ) exp22tD t t

-3 -2 -1 0 1 2 3

0

0.2

0.4

0.6

0.8

1

EpanechnikovTri-cubeGaussian

Non-Symmetric Kernel Kernel example:

Which kernel is that ???

(1 ) , 0, 0 1( )0,

t tD totherwise

1

2 11 2 1

ˆ(1 )

ˆ (1 ) (1 ) ... (1 )

i

ii i i i

Y

Y Y Y Y Y

0(1 ) 1i

i

-3 -2 -1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Kernel Average Smoother Single parameter: window width Smooth Trivially extended to any number

of dimensions Memory-based method – little or

no training is required Depends on metrics definition Biased end-points

Local Linear Regression Kernel-weighted average

minimizes:

Local linear regression minimizes:

0

20 0 0 0

( ) 1

ˆmin ( , ) ( ) ( ) ( )N

i ix i

K x x y x Y x x

0 0

20 0 0

( ), ( ) 1

0 0 0 0

min ( , ) ( ) ( )

ˆ( ) ( ) ( )

N

i i ix x i

K x x y x x x

Y x x x x

Local Linear Regression Solution:

where:

Other representation:

10 0 0 0

ˆ( ) 1, ( ) ( )T TY x x x x

B W B B W y

1 2

1 1 ... 1...

T

Nx x x

B

0 0( ) ( , )i N Nx diag K x x W

1

N

y

y

y

0 01

ˆ( ) ( )N

i ii

Y x l x y

equivalent kernel01

( ) 1N

iil x

Local Linear Regression

0 1 2 3 4 5 6-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

X

Y

0ˆ( )Y x

x0

Equivalent Kernels

Local Polynomial Regression Why stop at local linear fits? Let’s minimize:

0 0

2

0 0 0( ), ( ), 1,..., 1 1

0 0 0 01

min ( , ) ( ) ( )

ˆ( ) ( ) ( )

j

N dj

i i j ix x j d i j

dj

jj

K x x y x x x

Y x x x x

Local Polynomial Regression

Variance Compromise 2 2 2

0 0ˆ( ) ( ) ( ) ; , 0i iVar Y x l x for y f x Var E

0.2 tri-cube kernel

Conclusions Local linear fits can help bias dramatically at

the boundaries at a modest cost in variance. Local linear fits more reliable for extrapolation.

Local quadratic fits do little at the boundaries for bias, but increase the variance a lot.

Local quadratic fits tend to be most helpful in reducing bias due to curvature in the interior of the domain.

λ controls the tradeoff between bias and variance. Larger λ makes lower variance but higher bias

Local Regression in Radial kernel:

p

00

0( , )

( )x x

K x x Dh x

0

2

0 0 0( ) 1

2 21 2 1 2 1 2

0 0 0

ˆ( ) arg min ( , ) ( ) ( )

( ) 1, , , , , ...

ˆˆ( ) ( ) ( )

NT

i i ix i

T

x K x x y b x x

b X X X X X X X

Y x b x x

Popular Kernels Epanechnikov kernel

Tri-cube kernel

Gaussian kernel

23(1 ) / 4, 1( )0,

t if tD t

otherwise

331 , 1( )0,

t if tD totherwise

21( ) ( ) exp22tD t t

Example

0 1 2 3 4 5 6 70

5

10-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

XY

sin( )~ [0,2 ]

~ (0,0.2)

Z XX U

N

Higher Dimensions The boundary estimation is problematic

Many sample points are needed to reduce the bias

Local regression is less useful for p>3 It’s impossible to maintain localness

(low bias) and sizeable samples (low variance) at the same time

Structured Kernels Non-radial kernel:

Coordinates or directions can be downgraded or omitted by imposing restrictions on A.

Covariance can be used to adapt a metric A. (related to Mahalanobis distance)

Projection-pursuit model

0 0, 0

( ) ( )( , ) ; 0

Tx x x xK x x D

A

A A

Structured Regression Divide into a set (X1,X2,…,Xq) with

q<p and the remainder of the variables collect in vector Z.

Conditionally linear model:

For given Z fit a model by locally weighted least squares:

1 1( ) ( ) ( ) ... ( )q qf X Z Z X Z X

0 0

20 0 1 1 0 0

( ), ( ) 1min ( , ) ( ) ( ) ... ( )

N

i i i qi qz z i

K z z y z x z x z

pX

Density Estimation

-10 -5 0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

X

Den

sity

original distribution

constant window estimation

sample set0

0# Neighbor(x )ˆ ( ) i

Xx

f xN

Mixture of two normal distributions

6000.3

N

Kernel Density EstimationSmooth Parzen estimate: 0 0

1

1ˆ ( ) ( , )N

Xi

f x K x xN

Comparison

-10 -5 0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

X

Den

sity

Data SamplesNearest NeighborsEpanechnikovGaussian

Mixture of two normal distributions

Usually Bandwidth selection is more important than kernel function selection

Kernel Density Estimation Gaussian kernel density estimation:

where denote the Gaussian density with mean zero and standard deviation .

Generalization to :

1

1ˆ ˆ( ) ( ) * ( )N

X ii

f x x x F xN

LPF

p

20 02 /2

1

1ˆ ( ) exp 0.5 /(2 )

N

X ipi

f x x xN

Kernel Density Classification

00

01

ˆˆ ( )Pr | ; 1,...,

ˆˆ ( )

j jJ

k kk

f xG j X x j J

f x

For a J class problem:

( )

Pr( | )jf x

X x G j

Radial Basis Functions Function f(x) is represented as expansion in

basis functions:

Radial basis functions expansion (RBF):

where the sum-of-squares is minimized with respect to all the parameters (for Gaussian kernel):

1( ) ( )Mj jjf x h x

1 1

( ) ,j

M M jj j

jj j

xf x K x D

1

2

0 1 2{ , , } 1

( ) ( )min exp

Mj j j

TN M i j i ji jj

i j

x xy

Radial Basis Functions When assuming constant j= : the problem of

“holes”

The solution - Renormalized RBF: 1

/( )

/

jj M

kk

D xh x

D x

Additional Applications Local likelihood Mixture models for density

estimation and classification Mean-shift

Conclusions Memory-based methods: the model is the

entire training data set Infeasible for many real-time applications Provides good smoothing result for

arbitrary sampled function Appropriate for interpolation and

extrapolation When the model is known, better use

another fitting methods