ICS583 Non-Parametric density estimation 2
Acknowledgement
• The material in these slides are based on:
• R. O.Duda,P. E. Hart, and D.G.Stork, Pattern Classification. John
Wiley & Sons, 2nd ed., 2001.
• Pattern Recognition (4th Edition) by Sergios Theodoridis,
Konstantinos Koutroumbas and Konstantinos Koutroumbas, 2009.
• Pattern Recognition and Machine Learning, C. M. Bishop, Springer,
2006.
• Lectures of Prof. Andrew W. Moore, School of Computer Science,
Carnegie Mellon University
• And other internet resources
ICS583 Non-Parametric density estimation 3
Non-Parametric Density Estimation
• Introduction
• Histogram-based estimation
• Parzen windows estimation
• Nearest neighbor estimation
ICS583 Non-Parametric density estimation 4
Introduction
• In supervised learning we assumed that the parametric
forms of underlying (class conditional) density functions
were known.
• However, in practical pattern recognition this assumption
may be doubtful.
• All Parametric densities are unimodal (have a single local
maximum), whereas many practical problems involve
multi-modal densities
• Non-parametric techniques can be used with arbitrary
distributions and without knowing the parametric form of
the underlying (class conditional) densities.
ICS583 Non-Parametric density estimation 5
Density Estimation and Parzen Window
Types of non-parametric methods
1. Estimation of density functions p (x|i) using
sample patterns.
2. Estimation of a posteriori probabilities P(i |x)
directly based on sample patterns or prototypes.
We will consider two approaches for both types of non-parametric methods above:
1. Parzen-window-based;
2. k-Nearest neighbor-based;
ICS583 Non-Parametric density estimation 6
Density estimation
• Density estimation - estimating the probability density function
p(x) based on a given set of training samples D = {x1,...,xn}.
• The estimated density is denoted by pˆ(x).
• Assume that the training samples are independent and identically
distributed (i.i.d.) - each has the same probability distribution as
the others and all are mutually independent and distributed
according to p(x).
• The difference between parametric and non-parametric
estimation:
1. In the non-parametric case we try to estimate a function of the
distribution p(x) instead of a parameter vector.
2.We have a finite number of training samples meaning that
there will be some errors in the function estimation.
ICS583 Non-Parametric density estimation 7
Histograms
• Histograms are the simplest way to density estimation.
• The feature space is divided into m equal sized cells or bins
Bi.
• The number of the training samples ni, i=1,...n falling to
each bin is computed and the estimate within each bin is
p(x) = ni /Vn ,
where V is the volume of the cell. (All cells have equal
volume so the index is not needed.)
• The histogram estimate is not a very good way to estimate
densities, especially when there are many features. It leads
to discontinuous density estimates.
ICS583 Non-Parametric density estimation 8
Histograms (cont.)
ICS583 Non-Parametric density estimation 9
Density estimation
In words : Place a segment of length h at x and count points inside it.
• If is continuous: as , if
2ˆ- ,
1)ˆ(ˆ)(ˆ
total
in
hxx
N
k
hxpxp
N
hk
N
kP
N
NN
2
2
hx̂x̂
hx̂
x̂
)(xp )()( xpxp
N
0,,0 N
kkh N
NN
ICS583 Non-Parametric density estimation 10
Density estimation
• Consider the problem of selecting k labeled balls without
replacement from an urn containing n balls.
• In how many different ways may we select those k balls?
combinations (subsets) containing k elements.
Where n! = n · (n − 1) · · · 2 · 1, (0! = 1)
ICS583 Non-Parametric density estimation 11
Density estimation
• Assume that we wish to estimate the value of the density function p at x
based on training samples x1 , . . . , xn.
• If we think a region R around x, then the probability that a training
sample xj will fall in the region R is
----- (1)
P is an averaged version of the density function p(x).
Assume that there are k (of n) training samples in the region R.
The probability of k of the training samples falling into region R is
given by the binomial density
----- (2)
ICS583 Non-Parametric density estimation 12
Density estimation
(4) V)x(p'dx)'x(p
The expected value for k is
----- (3)
Estimation of the E[k] by the observed k’ leads to the estimate P’ =
k’/n.
Therefore, the ratio k/n is a good estimate for the probability P and
hence for the density function p.
Assume p(x) is continuous and that the region R is so small that p
does not vary significantly within it, we can write:
Where x is a point within R and V the volume enclosed by R
Combining equations 1, 3, and 4
---- (5)
ICS583 Non-Parametric density estimation 13
V
n/k)x(p
ICS583 Non-Parametric density estimation 14
Density estimation …
• Variance
• The smaller the h the higher the variance
h=0.1, N=1000 h=0.8, N=1000
ICS583 Non-Parametric density estimation 15
Density estimation …
h=0.1, N=10000
The higher the N the better the accuracy
ICS583 Non-Parametric density estimation 16
Example :
ICS583 Non-Parametric density estimation 17
CURSE OF DIMENSIONALITY
• In all the methods, so far, we saw that the highest the
number of points, N, the better the resulting estimate.
• If in the one-dimensional space an interval, filled with N
points, is adequate (for good estimation), in the two-
dimensional space the corresponding square will require
N2 and in the ℓ-dimensional space the ℓ-dimensional cube
will require Nℓ points.
• The exponential increase in the number of necessary
points in known as the curse of dimensionality. This is a
major problem one is confronted with in high
dimensional spaces.
ICS583 Non-Parametric density estimation 18
Density estimation …
• Condition for convergence
The fraction k/(nV) is a space averaged value of p(x). p(x)
is obtained only if V approaches zero.
This is the case where no samples are included in R: it is an
uninteresting case!
In this case, the estimate diverges: it is an uninteresting
case!
fixed)n (if 0)x(plim0k ,0V
)x(plim0k ,0V
ICS583 Non-Parametric density estimation 19
Density estimation …
• The volume V needs to approach 0 anyway if we
want to use this estimation
•Practically, V cannot be allowed to become
small since the number of samples is always
limited
•One will have to accept a certain amount of
variance in the ratio k/n
ICS583 Non-Parametric density estimation 20
Unlimited number of samples
Theoretically, if an unlimited number of samples is
available, we can circumvent this difficulty
To estimate the density of x, we form a sequence of regions
R1, R2,…containing x: the first region contains one sample,
the second two samples and so on.
Let Vn be the volume of Rn, kn the number of samples
falling in Rn and pn(x) be the nth estimate for p(x):
pn(x) = (kn/n)/Vn (7)
ICS583 Non-Parametric density estimation 21
Unlimited number of samples
lim ( )n np x p x For
Three conditions are required
First condition assures P/V → p(x)
Second condition assures that the frequency ratio will
converge to the probability
Third condition states that the number of samples in region
Rn is small compared to number of samples. This is
required for pn(x) to converge.
1. lim 0
2. lim
3. lim / 0
n n
n n
n n
V
k
k n
ICS583 Non-Parametric density estimation 22
Unlimited number of samples
)x(p)x(pn
n
There are two ways of obtaining the sequence of
regions that satisfy these constraints.
1. Shrink an initial region by specifying Vn as some
function of n. where Vn = 1/n
This is called “the Parzen-window estimation
method”
2. Specify kn as some function of n and let Vn grow
until it encloses kn neighbors of x. kn = n.
This is called “the kn-nearest neighbor estimation
method”
ICS583 Non-Parametric density estimation 23
ICS583 Non-Parametric density estimation 24
Parzen windows - An example
• Assume that the region Rn is a d-dimensional hypercube.
If hn is the length of the side of the hypercube, its volume is
given by Vn = hnd
• We can obtain an analytic expression for kn – the number of
samples falling into the hypercube – by defining the
following window function:
• ((x-xi)/hn) is equal to unity if xi falls within the hypercube
of volume Vn centered at x and equal to zero otherwise.
ICS583 Non-Parametric density estimation 25
Parzen windows - An example
1
,n
in
i n
x xk
h
It follows that ((x - xi)/hn) = 1 if xi falls in the hypercube
of volume Vn centered at x. And ((x - xi)/hn) = 0
otherwise.
The number of samples in this hypercube is therefore given
by
By substituting kn in equation (7), we obtain the following
estimate:
Pn(x) estimates p(x) as an average of functions of x and the
samples (xi) (i = 1,… ,n).
These functions can be general!
1
1 1( )
nn i
n
in n n
k x xp x
nV n V h
ICS583 Non-Parametric density estimation 26
Parzen windows
1
1 1( )
ni
n
i n n
x xp x
n V h
• The Parzen-window density estimate using n training
samples and the window function is defined by
• The estimate pn(x) is an average of (window) functions.
Usually the window function has its maximum at the origin
and its values become smaller when we move further away
from the origin. Then each training sample is contributing
to the estimate in accordance with its distance from x.
ICS583 Non-Parametric density estimation 27
Illustration
n
ini
1i n
nh
xx
h
1
n
1)x(p
The behavior of the Parzen -window method
Case where p(x) N(0,1)
Let (u) = (1/(2) exp (-u2/2) and hn = h1/n (n>1)
(h1: known parameter)
Thus:
is an average of normal densities centered at the samples xi.
ICS583 Non-Parametric density estimation 28
Parzen windows …
)1,x(N)xx(e2
1 )xx()x(p 1
2
1
2/1
11
Numerical results:
For n = 1 and h1=1
For n = 10 and h = 0.1, the contributions of the
individual samples are clearly observable !
ICS583 Non-Parametric density estimation 29
Parzen windows …
ICS583 Non-Parametric density estimation 30
Parzen windows …
ICS583 Non-Parametric density estimation 31
Analogous results are also obtained in two dimensions as
illustrated:
ICS583 Non-Parametric density estimation 32
Parzen windows …
ICS583 Non-Parametric density estimation 33
Case where p(x) = 1.U(a,b) + 2.T(c,d) (unknown density)
(mixture of a uniform and a triangle density)
ICS583 Non-Parametric density estimation 34
Parzen windows …
ICS583 Non-Parametric density estimation 35
Parzen windows …
ICS583 Non-Parametric density estimation 36
Parzen windows …
ICS583 Non-Parametric density estimation 37
Parzen windows example
Question: Given a set of five data points x1 = 2, x2 =
2.5, x3 = 3, x4 = 1 and x5 = 6, find Parzen probability
density function (pdf) estimates at x = 3, using the
Gaussian function with = 1 as window function?
ICS583 Non-Parametric density estimation 38
Parzen windows example …
x1 1
2𝜋𝑒−
(𝑥1−𝑥)2
2 = 1
2𝜋𝑒−
(2−3)2
2 = 0.242
x2 1
2𝜋𝑒−
(2.5−3)2
2 = 0.3521
x3 1
2𝜋𝑒−
(3−3)2
2 = 0.3989
x4 1
2𝜋𝑒−
(1−3)2
2 = 0.054
x5 1
2𝜋𝑒−
(6−3)2
2 = 0.0044
P(x=3) = (0.242 + 0.3521+
0.3989+0.054+0.0044)/5 = 0.2103
ICS583 Non-Parametric density estimation 39
Parzen windows example …
ICS583 Non-Parametric density estimation 40
Parzen windows …
ICS583 Non-Parametric density estimation 41
Parzen windows …
ICS583 Non-Parametric density estimation 42
Parzen windows notes
• The window width ha (volume, Va ) is the most
critical parameter in Parzen window approach.
• It is selected by cross-validation approach where a
validation set or portion of the training set is used
to form a validation set.
• The classifier is trained using different values of ha .
• The ha that results in the smallest error in the validation
set is selected as the most optimal one.
This technique is normally used with algorithms that
has parameters to select from.
ICS583 Non-Parametric density estimation 43
Classification example
In classifiers based on Parzen-window estimation:
• We estimate the densities for each category and
classify a test point by the label corresponding to the
maximum posterior
• The decision region for a Parzen-window classifier
depends upon the choice of window function as
illustrated in the following figure.
ICS583 Non-Parametric density estimation 44
Parzen windows …
ICS583 Non-Parametric density estimation 45
Example of Nearest Neighbor Rule
• Two class problem: yellow triangles and blue
squares. Circle represents the unknown sample x
and as its nearest neighbor comes from class θ1, it
is labeled as class θ1.
ICS583 Non-Parametric density estimation 46
Example of k-NN rule with k = 3
• There are two classes: yellow triangles and blue
squares. The circle represents the unknown sample
x and as two of its nearest neighbors come from
class θ2, it is labeled class θ2.
• The number k should be:
• 1) large to minimize probability of misclassifying x.
• 2) small (with respect to no of samples) so that points are
close enough to x to give an accurate estimate of the true
class of x.
ICS583 Non-Parametric density estimation 47
Kn - Nearest neighbor estimation
• Goal: a solution for the problem of the unknown “best”
window function
• Let the cell volume be a function of the training data
• Center a cell about x and let it grow until it captures kn samples
(kn = f(n))
• kn are called the kn nearest-neighbors of x
• possibilities can occur:
• Density is high near x; therefore the cell will be small which
provides a good resolution
• Density is low; therefore the cell will grow large and stop until
higher density regions are reached
We can obtain a family of estimates by setting kn=k1/n
and choosing different values for k1
ICS583 Non-Parametric density estimation 48
Illustration
For kn = n = 1 ; the estimate becomes:
Pn(x) = kn / n.Vn
= 1 / V1
=1 / 2|x-x1|
ICS583 Non-Parametric density estimation 49
Kn - Nearest neighbor estimation …
ICS583 Non-Parametric density estimation 50
Kn - Nearest neighbor estimation …
ICS583 Non-Parametric density estimation 51
Kn - Nearest neighbor estimation …
ICS583 Non-Parametric density estimation 52
Kn - Nearest neighbor estimation …
• K= 10, N = 200
ICS583 Non-Parametric density estimation 53
Kn - Nearest neighbor estimation …
◦ Estimation of a-posteriori probabilities
Goal: estimate P(i | x) from a set of n labeled samples
◦ Let’s place a cell of volume V around x and capture k
samples
◦ ki samples amongst k turned out to be labeled i then:
pn(x, i) = ki /n.V
An estimate for pn(i| x) is:
k
k
),x(p
),x(p)x|(p i
cj
1j
jn
inin
ICS583 Non-Parametric density estimation 54
Kn - Nearest neighbor estimation …
• ki/k is the fraction of the samples within the cell that are labeled i
•For minimum error rate, the most frequently represented category within the cell is selected
• If k is large and the cell sufficiently small, the performance will approach the best possible
ICS583 Non-Parametric density estimation 55
Nearest Neighbor
• The nearest –neighbor rule
• Let Dn = {x1, x2, …, xn} be a set of n labeled prototypes
• Let x’ Dn be the closest prototype to a test point x
• The nearest-neighbor rule for classifying x is
• to assign it the label associated with x’
ICS583 Non-Parametric density estimation 56
Nearest Neighbor
• Nearest-neighbor rule is a sub-optimal procedure
• The nearest-neighbor rule leads to an error rate greater
than the minimum possible: the Bayes rate
• If the number of prototypes is large (unlimited), the error
rate of the nearest-neighbor classifier is never worse than
twice the Bayes rate (it can be demonstrated!)
• If n , it is always possible to find x’ sufficiently
close so that:
P(i | x’) P(i | x)
ICS583 Non-Parametric density estimation 57
Bounds on Error Rate of k-Nearest Neighbor Rule
• As k gets larger the error rate equals the Bayes rate
• k should be a small fraction of the total number of
samples
ICS583 Non-Parametric density estimation 58
Nearest Neighbor …
• If P(m | x) 1, then the nearest neighbor
selection is almost always the same as the Bayes
selection
ICS583 Non-Parametric density estimation 59
59
Nearest Neighbor …
ICS583 Non-Parametric density estimation 60
Nearest Neighbor …
• N-n classifier effectively partitions the feature space into
cells consisting of all points closer to a given training point
x’ than to any other training points.
• All points in such a cell are thus labeled by the category of
the training point –Voronoi tesselation of the space
ICS583 Non-Parametric density estimation 61
The k –Nearest-Neighbor Rule
• Classify x by assigning it the label most frequently
represented among the k nearest samples and use a
voting scheme
ICS583 Non-Parametric density estimation 62
Example
Example:k = 3 (odd value) and x
= (0.10, 0.25)t
Closest vectors to x with their
labels are:
{(0.10, 0.28, 2); (0.12, 0.20, 2);
(0.15, 0.35,1)}
One voting scheme assigns the
label 2 to x since 2 is the most
frequently represented
Prototypes Labels
(0.15, 0.35)
(0.10, 0.28)
(0.09, 0.30)
(0.12, 0.20)
1
2
5
2