+ All Categories
Home > Documents > Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning...

Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning...

Date post: 18-Jul-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
13
Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop 1
Transcript
Page 1: Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow

Machine Learning Algorithms (IFT6266 A07)Prof. Douglas Eck, Université de Montréal

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning

by Christopher Bishop

1

Page 2: Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow

A note: Methods

• We (perhaps unwisely) skipped Bishop 2.1--2.4 until just before our graphical models lectures

• Give it a look. . . .

2

2

Page 3: Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow

Nonparametric Methods

• Parametric methods: fit a small number of parameters to a data set.

• Must make assumptions about the underlying distribution of data.

• If those assumptions are wrong, model is flawed. Eg. trying to fit Gaussian model to multimodal data.

• Nonparametric methods: fit a large number of parameters to a data set.

• Number of parameters scales with number of data points.

• In general one parameter per data point

• Main advantage is that we need make only very weak assumptions about the underlying distribution of data.

3

3

Page 4: Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow

Histograms

4

0 0.5 10

5

0 0.5 10

5

0 0.5 10

5

pi =ni

N!i!p(x) dx = 1

4

Page 5: Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow

Histograms

• Discontinuities at bin boundaries

• Curse of dimensionality: if we divide each variable in a D-dimensional space into M bins, we get MD bins

• Locality is important (density defined via evaluation of points in a local neighborhood)

• Smoothing achieved by bin count; we want neither “too much” nor “too little” smoothing

• Relates to model complexity and regularization in parametric modeling

5

5

Page 6: Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow

Kernel density estimators• Assume Euclidean distance, observe from unknown density p(x)

• Consider small region R containing x. Mass for region is:

• Distribution of points is binomial:

• From chp 2.1 on binary variables:

6

P =!

Rp(x) dx

Bin(K|N,P ) =N !

K!(N !K)!PK(1! P )N!K

p(x = 1|µ) = µ

Bin(m|N,µ) =!

N

m

"µm(1! µ)N!m

!N

m

"! N !

(N "m)!m!

[m] !N!

m=0

mBin(m|N,µ) = Nµ

var[m] !N!

m=0

(m" [m])2Bin(m|N,µ) = Nµ(1" µ)

6

Page 7: Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow

Kernel density estimators

• Thus we see that mean fraction of points falling into region is E[K/N]=P

• Variance is var[K/N] = P(1-P) / N

• For large N, distribution will be sharply peaked around the mean so K ≅ NP

• If region R is sufficiently smally that the density p(x) is constant then we have P ≅ p(x)V where V is the volume of R

• Combining we get p(x) = K / NV

• Depends on contradictory assumptions (that R is small enough to have constant density over region yet sufficiently large such that the number K of points falling into the region generates a sharply peaked distribution

• If we fix K and determine V we get K-nearest-neighbors

• If we fix V and determine K we get kernal density estimator

7

7

Page 8: Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow

Kernel density estimators

• Wish to determine density around x in region R centered on x.

• We will count points:k(u) = 1, |ui|≤ .5 i = 1,..., D 0, otherwise

• k(u) is a kernel function, here called a Parzen window

• Total number of data points inside of cube of side h defined by k(u) :

• Substitution into p(x) = K / NV (where volume of hypercube of side h in D dimeinsions is V = 1/hd)

8

K =N!

n=1

k

"x! xn

h

#

P (X) =1N

N!

n=1

1hD

k

"x! xn

h

#

8

Page 9: Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow

Parzen estimator

• Model suffers from discontinuities at hypercube boundaries

• Substitute Gaussian (where h represents standard deviation of Gaussian components):

• As expected, h acts to smoothTradeoff between noise sensitivityand oversmoothing

• Any kernel can be used provided:k(u) ≥ 0 m, integral k(u)du = 1

• No computation for training phase

• But must store entire data set to evaluate

9

P (X) =1N

N!

n=1

1(2!h2)1/2

exp"! ||x! xn||2

2h2

#

0 0.5 10

5

0 0.5 10

5

0 0.5 10

5

9

Page 10: Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow

Nearest neighbor methods• One weakness of kernel density estimation is that kernel width h is fixed

for all lerneks.

• Fix K and determine V

• Consider small sphere centered at x, allow sphere to grow until it contains precisely K points. Estimate of density p(x) given by same formula: P(X) = K / NV

• K governs degree of smoothing. Compare Parzens (left) to KNN (right)

10

0 0.5 10

5

0 0.5 10

5

0 0.5 10

5

0 0.5 10

5

0 0.5 10

5

0 0.5 10

5

10

Page 11: Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow

Classification using KNN

• Can do classification by applying KNN to each class and applying Bayes:

• To classify new point x, draw a sphere containning precisely K points.

• Suppose sphere contains Kk points from class Ck. Then model p(x) = K/NV provides a density estimate:

• We also obtain unconditional density:

• With class priors:

• Applying Bayes’ theorem yields:

11

p(x|Ck) =Kk

NkV

p(x) =K

NV

p(Ck) =Nk

N

p(Ck|x) =p(x|Ck)p(Ck)

p(x)=

Kk

K

11

Page 12: Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow

Classification using KNN

• To minimize misclassification rate, always choose class with largest Kk/K. For K=1, yields decision boundary composed of hyperplanes that form perpendicular bisectors of pairs from different classes

12

x1

x2

(a)x1

x2

(b)

Left: K=3, Right: K=1

12

Page 13: Pattern Recognition and Machine Learningpift6266/A07/documents/3_oct.pdf · Machine Learning Algorithms (IFT6266 A07) Prof. Douglas Eck, Université de Montréal These slides follow

KNN Example

• K acts as regularizer.

• Tree-base search can be used to find approximate near neighbors.

13

Oil dataset. Left K=1, Middle K=3, Right K=31

0 1 20

1

2

0 1 20

1

2

0 1 20

1

2

13


Recommended