Kernel Methods Dept. Computer Science & Engineering, Shanghai Jiao Tong University.

transcript

Kernel MethodsKernel Methods

Dept. Computer Science & Engineering,

Shanghai Jiao Tong University

23/4/21 Kernel Methods 2

Outline

• One-Dimensional Kernel Smoothers• Local Regression• Local Likelihood• Kernel Density estimation• Naive Bayes• Radial Basis Functions• Mixture Models and EM

One-Dimensional Kernel Smoothers

• k-NN:

• 30-NN curve is bumpy, since is discontinuous in x.

• The average changes in a discrete way, leading to a discontinuous .

))(|()(ˆ xNxyAvexf kii

)(ˆ xf

• Nadaraya-Watson Kernel weighted average:

• Epanechnikov quadratic kernel:

yxxKxf

),()(ˆ

0 ),(xx

• More general kernel:

– : width function that determines the width

of the neighborhood at x0.

– For quadratic kernel– For k-NN kernel

Variance constant– The Epanechnikov kernel has compact support

xxDxxK

)( 0xh

constantBias ,)( 0 xh|,|)( ][00 kk

replacedxxxhk

• Three popular kernel for local smoothing:

• Epanechnikov kerneland tri-cube kernel are compact but tri-cube has two continuous derivatives

• Gaussian kernel is infinite support

0 ),(xx

• Boundary issue– Badly biased on the boundaries because of the

asymmetry of the kernel in the region.– Linear fitting remove the bias to first order

Local Linear Regression

• Locally weighted linear regression make a first-order correction

• Separate weighted least squares at each target point x0:

• The estimate:• b(x)T=(1,x); B: Nx2 regression matrix with i-th row

b(x)T;

xxxxxyxxK

)(),(])()()[,(min

0000 )(ˆ)(ˆ)(ˆ xxxxf

NixxKdiagxW iNN ,,1,),()( 00

• The weights combine the weighting kernel and the least squares operations——Equivalent Kernel

)( 0xli),( 0 xK

• The expansion for , using the linearity of local regression and a series expansion of the true function f around x0

• For local regression • The bias depends only on quadratic and

higher-order terms in the expansion of .

)(ˆ0xfE

Rxlxxxf

xlxxxfxlxfxfxlxfE

)()()()()()()()(ˆ

)()(ˆ00 xfxfE

Local Polynomial Regression• Fit local polynomial fits of any degree d

xxxyxxKj

1 0000

,,1),(),(

)(ˆ)(ˆ)(ˆ

)()(),(min00

Local Polynomial Regression

• Bias only have components of degree d+1 and higher.

• The reduction for bias costs the increased variance. dxlxlxf )(,)())(ˆvar( withincreases0

选择核的宽度• 核中，是参数，控制核宽度：

– 对于有紧支集的核，取其支集区域的半径– 对于高斯核，取其方差– 对 k- 对近邻域法，取 k/N

• 窗口宽度导致偏倚 - 方差权衡：– 窗口较窄，方差误差大，均值误差偏倚小– 窗口较宽，方差误差小，均值误差偏倚大

Structured Local Regression

• Structured kernels

– Introduce structure by imposing appropriate restrictions on A

• Structured regression function

– Introduce structure by eliminating some of the higher-order terms

),( 000,

xxAxxDxxK

lkkljjp XXgXgXXXf ),()(),,,( 21

• Any parametric model can be made local:– Parameter associated with :– Log-likelihood:– Model likelihood local to :

– A varying coefficient model

Local Likelihood & Other Models

Tiii xx )(

Tii xyll

1),()(

Tiii xxylxxKxl

1000 ))(,(),())((

zxylzzKzl

)))(,(,(),())((1

• Logistic Regression

– Local log-likelihood for the J class model

– Center the local regressions at

Local Likelihood & Other Models

)exp(1

)exp()|Pr( J

1 0000

1 00000

))()()(exp(1log

)()()(),(J

xxxxxxKii

))(ˆexp(1

))(ˆexp()|r(P̂ J

• A natural local estimate

• The smooth Parzen estimate

– For Gaussian kernel – The estimate become

Kernel Density Estimation

)(#)(ˆ 0

i iX xxKN

xf1 00 ),(

)(/),( 00 xxxxK ii

Kernel Density Estimation• A kernel density estimate for systolic blood

pressure. The density estimate at each point is the average contribution from each of the kernels at that point.

• Bayes’ theorem:

• The estimate for CHD uses the tri-cube kernel with k-NN bandwidth.

Kernel Density Classification

• The population class densities and the posterior probabilities

Naïve Bayes

• Naïve Bayes model assume that given a class G=j, the features Xk are independent:

– is kernel density estimate, or Gaussian, for coordinate Xk in class j.

– If Xk is categorical, use Histogram.

kkjkj XfXf

)(ˆkjk Xf

k kJkJ

)(loglog

)|Pr(logit

Radial Basis Function & Kernel

• Radial basis function combine the local and flexibility of kernel methods.

– Each basis element is indexed by a location or prototype parameter and a scale parameter

– , a pop choice is the standard Gaussian density function.

xDxKxf

j 11),()(

• For simplicity, focus on least squares methods for regression, and use the Gaussian kernel.

• RBF network model:

• Estimate the separately from the .• A undesirable side effect of creating holes——

regions of IRp where none of the kernels has appreciable support.

)()(expmin

jj , j

Gaussian radial basis function with fixed width can leave holes. Renormalized Gaussian radial basis function produce basis functions similar in some respects to B-splines.

• Renormalized radial basis functions.

• The expansion in renormalized RBF

xxKyxf

Mixture Models & EM

• Gaussian Mixture Model

– are mixture proportions,

• EM algorithm for mixtures– Given log-likelihood:

– Suppose we observe Latent Binary

m mmm xxf1

),;()(

i ii xxyl1

)()1()(log),(21

,,,, 21 nxxx

)()1(log)(log),,(0

xxzxLN

that such z Good

Mixture Models & EM

• Given ,compute

• In Example

~max(),,)(,,()(

~ 0 yzxE

i iiii

)()1(log)1()(ˆlog)(

)()ˆ1()(ˆ

)(ˆ),|(

Mixture Models & EM

• Application of mixtures to the heart disease risk factor study.

Mixture Models & EM

• Mixture model used for classification of the simulated data

Kernel Methods Dept. Computer Science & Engineering, Shanghai Jiao Tong University.

Documents