Machine Learning Algorithms (IFT6266 A07)Prof. Douglas Eck, Université de Montréal
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning
by Christopher Bishop
1
A note: Methods
• We (perhaps unwisely) skipped Bishop 2.1--2.4 until just before our graphical models lectures
• Give it a look. . . .
2
2
Nonparametric Methods
• Parametric methods: fit a small number of parameters to a data set.
• Must make assumptions about the underlying distribution of data.
• If those assumptions are wrong, model is flawed. Eg. trying to fit Gaussian model to multimodal data.
• Nonparametric methods: fit a large number of parameters to a data set.
• Number of parameters scales with number of data points.
• In general one parameter per data point
• Main advantage is that we need make only very weak assumptions about the underlying distribution of data.
3
3
Histograms
4
0 0.5 10
5
0 0.5 10
5
0 0.5 10
5
pi =ni
N!i!p(x) dx = 1
4
Histograms
• Discontinuities at bin boundaries
• Curse of dimensionality: if we divide each variable in a D-dimensional space into M bins, we get MD bins
• Locality is important (density defined via evaluation of points in a local neighborhood)
• Smoothing achieved by bin count; we want neither “too much” nor “too little” smoothing
• Relates to model complexity and regularization in parametric modeling
5
5
Kernel density estimators• Assume Euclidean distance, observe from unknown density p(x)
• Consider small region R containing x. Mass for region is:
• Distribution of points is binomial:
• From chp 2.1 on binary variables:
6
P =!
Rp(x) dx
Bin(K|N,P ) =N !
K!(N !K)!PK(1! P )N!K
p(x = 1|µ) = µ
Bin(m|N,µ) =!
N
m
"µm(1! µ)N!m
!N
m
"! N !
(N "m)!m!
[m] !N!
m=0
mBin(m|N,µ) = Nµ
var[m] !N!
m=0
(m" [m])2Bin(m|N,µ) = Nµ(1" µ)
6
Kernel density estimators
• Thus we see that mean fraction of points falling into region is E[K/N]=P
• Variance is var[K/N] = P(1-P) / N
• For large N, distribution will be sharply peaked around the mean so K ≅ NP
• If region R is sufficiently smally that the density p(x) is constant then we have P ≅ p(x)V where V is the volume of R
• Combining we get p(x) = K / NV
• Depends on contradictory assumptions (that R is small enough to have constant density over region yet sufficiently large such that the number K of points falling into the region generates a sharply peaked distribution
• If we fix K and determine V we get K-nearest-neighbors
• If we fix V and determine K we get kernal density estimator
7
7
Kernel density estimators
• Wish to determine density around x in region R centered on x.
• We will count points:k(u) = 1, |ui|≤ .5 i = 1,..., D 0, otherwise
• k(u) is a kernel function, here called a Parzen window
• Total number of data points inside of cube of side h defined by k(u) :
• Substitution into p(x) = K / NV (where volume of hypercube of side h in D dimeinsions is V = 1/hd)
8
K =N!
n=1
k
"x! xn
h
#
P (X) =1N
N!
n=1
1hD
k
"x! xn
h
#
8
Parzen estimator
• Model suffers from discontinuities at hypercube boundaries
• Substitute Gaussian (where h represents standard deviation of Gaussian components):
• As expected, h acts to smoothTradeoff between noise sensitivityand oversmoothing
• Any kernel can be used provided:k(u) ≥ 0 m, integral k(u)du = 1
• No computation for training phase
• But must store entire data set to evaluate
9
P (X) =1N
N!
n=1
1(2!h2)1/2
exp"! ||x! xn||2
2h2
#
0 0.5 10
5
0 0.5 10
5
0 0.5 10
5
9
Nearest neighbor methods• One weakness of kernel density estimation is that kernel width h is fixed
for all lerneks.
• Fix K and determine V
• Consider small sphere centered at x, allow sphere to grow until it contains precisely K points. Estimate of density p(x) given by same formula: P(X) = K / NV
• K governs degree of smoothing. Compare Parzens (left) to KNN (right)
10
0 0.5 10
5
0 0.5 10
5
0 0.5 10
5
0 0.5 10
5
0 0.5 10
5
0 0.5 10
5
10
Classification using KNN
• Can do classification by applying KNN to each class and applying Bayes:
• To classify new point x, draw a sphere containning precisely K points.
• Suppose sphere contains Kk points from class Ck. Then model p(x) = K/NV provides a density estimate:
• We also obtain unconditional density:
• With class priors:
• Applying Bayes’ theorem yields:
11
p(x|Ck) =Kk
NkV
p(x) =K
NV
p(Ck) =Nk
N
p(Ck|x) =p(x|Ck)p(Ck)
p(x)=
Kk
K
11
Classification using KNN
• To minimize misclassification rate, always choose class with largest Kk/K. For K=1, yields decision boundary composed of hyperplanes that form perpendicular bisectors of pairs from different classes
12
x1
x2
(a)x1
x2
(b)
Left: K=3, Right: K=1
12
KNN Example
• K acts as regularizer.
• Tree-base search can be used to find approximate near neighbors.
13
Oil dataset. Left K=1, Middle K=3, Right K=31
0 1 20
1
2
0 1 20
1
2
0 1 20
1
2
13