Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning...

transcript

Machine Learning Algorithms (IFT6266 A07)Prof. Douglas Eck, Université de Montréal

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning

by Christopher Bishop

A note: Methods

• We (perhaps unwisely) skipped Bishop 2.1--2.4 until just before our graphical models lectures

• Give it a look. . . .

Nonparametric Methods

• Parametric methods: fit a small number of parameters to a data set.

• Must make assumptions about the underlying distribution of data.

• If those assumptions are wrong, model is flawed. Eg. trying to fit Gaussian model to multimodal data.

• Nonparametric methods: fit a large number of parameters to a data set.

• Number of parameters scales with number of data points.

• In general one parameter per data point

• Main advantage is that we need make only very weak assumptions about the underlying distribution of data.

Histograms

0 0.5 10

pi =ni

N!i!p(x) dx = 1

Histograms

• Discontinuities at bin boundaries

• Curse of dimensionality: if we divide each variable in a D-dimensional space into M bins, we get MD bins

• Locality is important (density defined via evaluation of points in a local neighborhood)

• Smoothing achieved by bin count; we want neither “too much” nor “too little” smoothing

• Relates to model complexity and regularization in parametric modeling

Kernel density estimators• Assume Euclidean distance, observe from unknown density p(x)

• Consider small region R containing x. Mass for region is:

• Distribution of points is binomial:

• From chp 2.1 on binary variables:

Rp(x) dx

Bin(K|N,P ) =N !

K!(N !K)!PK(1! P )N!K

p(x = 1|µ) = µ

Bin(m|N,µ) =!

"µm(1! µ)N!m

"! N !

(N "m)!m!

[m] !N!

mBin(m|N,µ) = Nµ

var[m] !N!

(m" [m])2Bin(m|N,µ) = Nµ(1" µ)

Kernel density estimators

• Thus we see that mean fraction of points falling into region is E[K/N]=P

• Variance is var[K/N] = P(1-P) / N

• For large N, distribution will be sharply peaked around the mean so K ≅ NP

• If region R is sufficiently smally that the density p(x) is constant then we have P ≅ p(x)V where V is the volume of R

• Combining we get p(x) = K / NV

• Depends on contradictory assumptions (that R is small enough to have constant density over region yet sufficiently large such that the number K of points falling into the region generates a sharply peaked distribution

• If we fix K and determine V we get K-nearest-neighbors

• If we fix V and determine K we get kernal density estimator

Kernel density estimators

• Wish to determine density around x in region R centered on x.

• We will count points:k(u) = 1, |ui|≤ .5 i = 1,..., D 0, otherwise

• k(u) is a kernel function, here called a Parzen window

• Total number of data points inside of cube of side h defined by k(u) :

• Substitution into p(x) = K / NV (where volume of hypercube of side h in D dimeinsions is V = 1/hd)

"x! xn

P (X) =1N

"x! xn

Parzen estimator

• Model suffers from discontinuities at hypercube boundaries

• Substitute Gaussian (where h represents standard deviation of Gaussian components):

• As expected, h acts to smoothTradeoff between noise sensitivityand oversmoothing

• Any kernel can be used provided:k(u) ≥ 0 m, integral k(u)du = 1

• No computation for training phase

• But must store entire data set to evaluate

P (X) =1N

1(2!h2)1/2

exp"! ||x! xn||2

0 0.5 10

Nearest neighbor methods• One weakness of kernel density estimation is that kernel width h is fixed

for all lerneks.

• Fix K and determine V

• Consider small sphere centered at x, allow sphere to grow until it contains precisely K points. Estimate of density p(x) given by same formula: P(X) = K / NV

• K governs degree of smoothing. Compare Parzens (left) to KNN (right)

0 0.5 10

Classification using KNN

• Can do classification by applying KNN to each class and applying Bayes:

• To classify new point x, draw a sphere containning precisely K points.

• Suppose sphere contains Kk points from class Ck. Then model p(x) = K/NV provides a density estimate:

• We also obtain unconditional density:

• With class priors:

• Applying Bayes’ theorem yields:

p(x|Ck) =Kk

p(x) =K

p(Ck) =Nk

p(Ck|x) =p(x|Ck)p(Ck)

Classification using KNN

• To minimize misclassification rate, always choose class with largest Kk/K. For K=1, yields decision boundary composed of hyperplanes that form perpendicular bisectors of pairs from different classes

Left: K=3, Right: K=1

KNN Example

• K acts as regularizer.

• Tree-base search can be used to find approximate near neighbors.

Oil dataset. Left K=1, Middle K=3, Right K=31

0 1 20

Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning...

Documents