+ All Categories
Home > Documents > Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business...

Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business...

Date post: 19-Dec-2015
Category:
Upload: dylan-harrington
View: 253 times
Download: 3 times
Share this document with a friend
32
Chapter 7 – K-Nearest- Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1
Transcript
Page 1: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Chapter 7 – K-Nearest-Neighbor

© Galit Shmueli and Peter Bruce 2010

Data Mining for Business Intelligence

Shmueli, Patel & Bruce1

Page 2: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Characteristics

Data-driven, not model-driven

Makes no assumptions about the data

2 © Galit Shmueli and Peter Bruce 2010

Page 3: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Basic Idea

For a given record to be classified, identify nearby records

“Near” means records with similar predictor values X1, X2, … Xp

Classify the record as whatever the predominant class is among the nearby records (the “neighbors”)

3 © Galit Shmueli and Peter Bruce 2010

Page 4: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

How to measure “nearby”?

The most popular distance measure is Euclidean distance

4 © Galit Shmueli and Peter Bruce 2010

Page 5: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Choosing kK is the number of nearby neighbors to be used to classify the new record

K=1 means use the single nearest recordK=5 means use the 5 nearest records

Typically choose that value of k which has lowest error rate in validation data

5 © Galit Shmueli and Peter Bruce 2010

Page 6: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Low k vs. High k

Low values of k (1, 3, …) capture local structure in data (but also noise)

High values of k provide more smoothing, less noise, but may miss local structure

Note: the extreme case of k = n (i.e., the entire data set) is the same as the “naïve rule” (classify all records according to majority class)

6 © Galit Shmueli and Peter Bruce 2010

Page 7: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Example: Riding Mowers

Data: 24 households classified as owning or not owning riding mowers

Predictors: Income, Lot Size

7 © Galit Shmueli and Peter Bruce 2010

Page 8: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Income Lot_Size Ownership60.0 18.4 owner85.5 16.8 owner64.8 21.6 owner61.5 20.8 owner87.0 23.6 owner110.1 19.2 owner108.0 17.6 owner82.8 22.4 owner69.0 20.0 owner93.0 20.8 owner51.0 22.0 owner81.0 20.0 owner75.0 19.6 non-owner52.8 20.8 non-owner64.8 17.2 non-owner43.2 20.4 non-owner84.0 17.6 non-owner49.2 17.6 non-owner59.4 16.0 non-owner66.0 18.4 non-owner47.4 16.4 non-owner33.0 18.8 non-owner51.0 14.0 non-owner63.0 14.8 non-owner8 © Galit Shmueli and Peter Bruce 2010

Page 9: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

XLMiner Output

For each record in validation data (6 records) XLMiner finds neighbors amongst training data (18 records).

The record is scored for k=1, k=2, … k=18.

Best k appears to be k=8.

k = 9, k = 10, k=14 also share low error rate, but best to choose lowest k.

9 © Galit Shmueli and Peter Bruce 2010

Page 10: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Value of k% Error

Training% Error

Validation

1 0.00 33.33

2 16.67 33.33

3 11.11 33.33

4 22.22 33.33

5 11.11 33.33

6 27.78 33.33

7 22.22 33.33

8 22.22 16.67 <--- Best k

9 22.22 16.67

10 22.22 16.67

11 16.67 33.33

12 16.67 16.67

13 11.11 33.33

14 11.11 16.67

15 5.56 33.33

16 16.67 33.33

17 11.11 33.33

18 50.00 50.0010 © Galit Shmueli and Peter Bruce 2010

Page 11: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Using K-NN for Prediction (for Numerical Outcome)

Instead of “majority vote determines class” use average of response values

May be a weighted average, weight decreasing with distance

11 © Galit Shmueli and Peter Bruce 2010

Page 12: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

AdvantagesSimpleNo assumptions required about Normal

distribution, etc.Effective at capturing complex interactions

among variables without having to define a statistical model

12 © Galit Shmueli and Peter Bruce 2010

Page 13: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Shortcomings Required size of training set increases

exponentially with # of predictors, pThis is because expected distance to nearest neighbor increases with p (with large vector of predictors, all records end up “far away” from each other)

In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s)

These constitute “curse of dimensionality”

13 © Galit Shmueli and Peter Bruce 2010

Page 14: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Dealing with the Curse

Reduce dimension of predictors (e.g., with PCA)

Computational shortcuts that settle for “almost nearest neighbors”

14 © Galit Shmueli and Peter Bruce 2010

Page 15: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

SummaryFind distance between record-to-be-

classified and all other recordsSelect k-nearest records

Classify it according to majority vote of nearest neighbors

Or, for prediction, take the as average of the nearest neighbors

“Curse of dimensionality” – need to limit # of predictors

15 © Galit Shmueli and Peter Bruce 2010

Page 16: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Chapter 8 – Naïve Bayes

© Galit Shmueli and Peter Bruce 2010

Data Mining for Business Intelligence

Shmueli, Patel & Bruce16

Page 17: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Characteristics

Data-driven, not model-driven

Make no assumptions about the data

17 © Galit Shmueli and Peter Bruce 2010

Page 18: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Naïve Bayes: The Basic Idea

For a given new record to be classified, find other records like it (i.e., same values for the predictors)

What is the prevalent class among those records?

Assign that class to your new record

18 © Galit Shmueli and Peter Bruce 2010

Page 19: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Usage

Requires categorical variables

Numerical variable must be binned and converted to categorical

Can be used with very large data sets

Example: Spell check programs assign your misspelled word to an established “class” (i.e., correctly spelled word)

19 © Galit Shmueli and Peter Bruce 2010

Page 20: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Exact Bayes Classifier

Relies on finding other records that share same predictor values as record-to-be-classified.

Want to find “probability of belonging to class C, given specified values of predictors.”

Even with large data sets, may be hard to find other records that exactly match your record, in terms of predictor values.

20 © Galit Shmueli and Peter Bruce 2010

Page 21: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Solution – Naïve BayesAssume independence of predictor

variables (within each class)

Use multiplication rule

Find same probability that record belongs to class C, given predictor values, without limiting calculation to records that share all those same values

21 © Galit Shmueli and Peter Bruce 2010

Page 22: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Calculations1. Take a record, and note its predictor values2. Find the probabilities those predictor

values occur across all records in C13. Multiply them together, then by proportion

of records belonging to C14. Same for C2, C3, etc.5. Prob. of belonging to C1 is value from step

(3) divide by sum of all such values C1 … Cn

6. Establish & adjust a “cutoff” prob. for class of interest

22 © Galit Shmueli and Peter Bruce 2010

Page 23: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Example: Financial Fraud

Target variable: Audit finds fraud, no fraud

Predictors: Prior pending legal charges (yes/no)Size of firm (small/large)

23 © Galit Shmueli and Peter Bruce 2010

Page 24: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Charges? Size Outcomey small truthfuln small truthfuln large truthfuln large truthfuln small truthfuln small truthfuly small fraudy large fraudn large fraudy large fraud

24 © Galit Shmueli and Peter Bruce 2010

Page 25: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Exact Bayes Calculations

Goal: classify (as “fraudulent” or as “truthful”) a small firm with charges filed

There are 2 firms like that, one fraudulent and the other truthful

P(fraud | charges=y, size=small) = ½ = 0.50

Note: calculation is limited to the two firms matching those characteristics

25 © Galit Shmueli and Peter Bruce 2010

Page 26: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Naïve Bayes Calculations

Same goal as before

Compute 2 quantities:Proportion of “charges = y” among frauds, times

proportion of “small” among frauds, times proportion frauds = 3/4 * 1/4 * 4/10 = 0.075

Prop “charges = y” among truthfuls, times prop. “small” among truthfuls, times prop. truthfuls = 1/6 * 4/6 * 6/10 = 0.067

P(fraud | charges, small) = 0.075/(0.075+0.067) = 0.5326 © Galit Shmueli and Peter Bruce 2010

Page 27: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Naïve Bayes, cont.

Note that probability estimate does not differ greatly from exact

All records are used in calculations, not just those matching predictor values

This makes calculations practical in most circumstances

Relies on assumption of independence between predictor variables within each class

27 © Galit Shmueli and Peter Bruce 2010

Page 28: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Independence Assumption

Not strictly justified (variables often correlated with one another)

Often “good enough”

28 © Galit Shmueli and Peter Bruce 2010

Page 29: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Advantages

Handles purely categorical data wellWorks well with very large data setsSimple & computationally efficient

29 © Galit Shmueli and Peter Bruce 2010

Page 30: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Shortcomings

Requires large number of records

Problematic when a predictor category is not present in training data

Assigns 0 probability of response, ignoring information in other variables

30 © Galit Shmueli and Peter Bruce 2010

Page 31: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

On the other hand…

Probability rankings are more accurate than the actual probability estimates

Good for applications using lift (e.g. response to mailing), less so for applications requiring probabilities (e.g. credit scoring)

31 © Galit Shmueli and Peter Bruce 2010

Page 32: Chapter 7 – K-Nearest-Neighbor © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce 1.

Summary

No statistical models involved

Naïve Bayes (like KNN) pays attention to complex interactions and local structure

Computational challenges remain

32 © Galit Shmueli and Peter Bruce 2010


Recommended