INSTANCE BASED APPROACH
KNN Classifier
Instance Based Classification 2
Simple classification technique
Handed an instance you wish to classify
Look around the nearby region to see what other classes are around
Whichever is most common—make that the prediction
8/29/03
0 5 10
05
10
Two Classes
X
Y
Instance Based Classification 3
K-nearest neighbor Assign the most common class among
the K-nearest neighbors (like a vote)
8/29/03
KNN CLASSIFIER
Instance Based Classification 4
How Train?
8/29/03
Don’t
Instance Based Classification 5
Let’s get specific Train
Load training data Classify
Read in instance Find K-nearest neighbors in the training data Assign the most common class among the
K-nearest neighbors (like a vote)
8/29/03
𝑑(𝑥 𝑖 ,𝑥 𝑗)≡√∑𝑟=1
𝑛
(𝑎𝑟 (𝑥 𝑖 )−𝑎𝑟 (𝑥 𝑗 ))¿2¿Euclidean distance: a is an attribute (dimension)
Instance Based Classification 6
𝑑(𝑥 𝑖 ,𝑥 𝑗)≡√∑𝑟=1
𝑛
(𝑎𝑟 (𝑥 𝑖 )−𝑎𝑟 (𝑥 𝑗 ))¿2¿Euclidean distance: a is an attribute (dimension)
How find nearest neighbors
Naïve approach: exhaustive
For the instance to be classified Visit every training sample and calculate
distance Sort First K in the list
8/29/03
Voting Formula
Where is ’s class, if ; 0 otherwise
Instance Based Classification 7
Classifying a lot of work The Work that Must
be Performed Visit every training sample
and calculate distance Sort Lots of floating point calculations Classifier puts-off work till time to classify
8/29/03
𝑑(𝑥 𝑖 ,𝑥 𝑗)≡√∑𝑟=1
𝑛
(𝑎𝑟 (𝑥 𝑖 )−𝑎𝑟 (𝑥 𝑗 ))¿2¿Euclidean distance: a is an attribute (dimension)
Instance Based Classification 8
Lazy This is known as a “lazy” learning method If do most of the work during the training stage known as
“eager” Our next classifier, Naïve Bayes, will be eager
Training takes a while but can classify fast Which do you think is better?
8/29/03
Lazy vs. Eager
Training or ClassifyingWhere the work happens
Instance Based Classification 9
Book mentions KD-TreeFrom Wikipedia: space‑partitioning data structure for organizing points in a k‑dimensional space. kd‑trees are a useful data structure for several applications, such as searches involving a multidimensional search key (e.g. range searches and nearest neighbor searches). kd-trees are a special case of BSP trees.
8/29/03
Instance Based Classification 10
If use some data structure …
Speeds up classification
Probably slows “training”
8/29/03
Instance Based Classification 11
How choose K? Choosing K can be a bit of an art What if you could include all data-points
(K=n)? How might you do such a thing?
8/29/03
How include all data points?What if weighted the votes of each training sample by its distance from the point being classified?
Weighted Voting Formula
Where , and is “1” if it is a member of class (i.e. where returns the class of )
Weight Curve
1 over distance squared
Could get less fancy and go linear But then training data
very-far-away still have strong influence
8/29/03 Instance Based Classification 12
0 20 40 60 80 100
020
40
60
80
100
Distance
Weight
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Distance
Weight
Instance Based Classification 13
Could go more fancy Other Radial Basis
Functions Sometimes known as a
Kernel Function One of the more
common
8/29/03
𝐾 (𝑑 (𝑥 , 𝑥𝑡 ))= 1𝑒 √2𝜋
𝑒−(𝑥−𝜇)2 /2𝜎 2
-4 -2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
Distance
Weight
Instance Based Classification 14
Issues Work back-loaded
Worse the bigger the training data Can alleviate with data structures
What else?
8/29/03
Other Issues?What if only some dimensions contribute to ability to classify? Differences in other dimensions would put distance between that point and the target.
Instance Based Classification 15
Curse of dimensionality More is not always better Might be identical in important dimensions other
dimensions might simply be random, and seemingly distant
8/29/03
From Wikipedia:In applied mathematics, curse of dimensionality (a term coined by Richard E. Bellman),[1][2] also known as the Hughes effect[3] or Hughes phenomenon[4] (named after Gordon F. Hughes),[5][6] refers to the problem caused by the exponential increase in volume associated with adding extra dimensions to a mathematical space.
For example, 100 evenly-spaced sample points suffice to sample a unit interval with no more than 0.01 distance between points; an equivalent sampling of a 10-dimensional unit hypercube with a lattice with a spacing of 0.01 between adjacent points would require 1020 sample points: thus, in some sense, the 10-dimensional hypercube can be said to be a factor of 1018 "larger" than the unit interval. (Adapted from an example by R. E. Bellman; see below.)
Instance Based Classification 16
Gene expression data Thousands of genes Relatively few patients Is there a curse?
8/29/03
gene
patientg1 g2 g3 … gn disease
p1 x1,1 x1,2 x1,3 … x1,n Y
p2 x2,1 x2,2 x2,3 … x2,n N...
.
.
.
pm xm,1 xm,2 xm,3 … xm,n ?
Instance Based Classification 17
Can it classify discrete data?
Think of discrete data as being pre-binned
Remember RNA classification Data in each dimension was A, C, U, or G
8/29/03
How measure distance?A might be closer to G than C or U (A and G are both purines while C and U are pyrimidines). Dimensional distance becomes domain specific.
Representation becomes all importantIf could arrange appropriately could use techniques like Hamming distances
Instance Based Classification 18
Another issue
Redness Yellowness Mass Volume Class
4.816472 2.347954 125.5082 25.01441 apple
2.036318 4.879481 125.8775 18.2101 lemon
2.767383 3.353061 109.9687 33.53737 orange
4.327248 3.322961 118.4266 19.07535 peach
2.96197 4.124945 159.2573 29.00904 orange
5.655719 1.706671 147.0695 39.30565 apple
8/29/03
First few records in the training data See any issues? Hint: think of how Euclidean distance is
calculatedShould really normalize the data
For each entry in a dimension
Instance Based Classification 19
Other uses of instance based approaches
Function approximation Real valued prediction: take average of nearest k neighbors
If don’t know the function and/or it is too complex to “learn”, just plug-in a new value the KNN classifier can “learn” the predicted value on the fly by averaging the nearest neighbors
8/29/03
�̂� (𝑥𝑞)←∑𝑖=1
𝑘
𝑓 (𝑥𝑖)
𝑘-10 -5 0 5
-10
-50
510
X
Y
Why average?
Instance Based Classification 20
Regression Choose an m and b that minimizes the
squared error But again,
computationallyHow?
8/29/03
-10 -5 0 5
-10
-50
510
XY
m and b that minimize
Instance Based Classification 21
Other things that can be learned
If want to learn an instantaneous slope Can do local regression Get the slope of a line that fits just the local data
8/29/03
-10 -5 0 5 10
-3000-2000-1000
0100020003000
X
Y
Instance Based Classification 22
Summary KNN highly effective for many practical
problems With sufficient training data
Robust to noisy training Work back-loaded Susceptible to dimensionality curse
8/29/03
Instance Based Classification 238/29/03
Instance Based Classification 24
The How: Big Picture For each of the training datum we know
what Y should be If we have a randomly generated m and
b, these, along with X will tell us a predicted Y
Know whether the m and b yield too large or too small a prediction
Can nudge “m” and “b” in an appropriate direction (+ or -)
Sum these proposed nudges across all training data
8/29/03
∆𝑚∆ 𝑏
Target Y too low
Line represents output or predicted Y
Instance Based Classification 25
Gradient Descent Which way should
m go to reduce error?
8/29/03
y actual
y actual
𝑦 𝑝𝑟𝑒𝑑=𝑚𝑔𝑢𝑒𝑠𝑠 𝑥+𝑏𝑔𝑢𝑒𝑠𝑠
Rise
b
𝑦 𝑝𝑟𝑒𝑑− 𝑦𝑎𝑐𝑡
Could Average
Then do same for bThen do again
Instance Based Classification 26
Back to why we went down this road
Locally weighted linear regression Would still perform gradient descent Becomes a global function approximation
8/29/03
-10 -5 0 5 10
-3000-2000-1000
0100020003000
X
Y
�̂� (𝑥 )=𝑤0+𝑤1𝑎1 (𝑥 )+…+𝑤𝑛𝑎𝑛 (𝑥)