+ All Categories
Home > Documents > Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering...

Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering...

Date post: 18-Jan-2016
Category:
Upload: moris-cobb
View: 219 times
Download: 0 times
Share this document with a friend
Popular Tags:
69
Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: [email protected]
Transcript
Page 1: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

Classification & Regression

COSC 526 Class 7

Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: [email protected]

Page 2: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

2

Last Class

• Streaming Data Analytics:– sampling, counting, frequent item sets, etc.

• Berkeley Data Analytic Stack (BDAS)

Page 3: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

3

Project Deliverables and Deadlines

Deliverable Due-date % Grade

Initial selection of topics1, 2 Jan 27, 2015 10

Project Description and Approach Feb 20, 2015 20

Initial Report Mar 20, 2015 10

Project Demonstration Apr 16-19, 2015 10

Final Project Report (10 pages) Apr 21, 2015 25

Poster (12-16 slides) Apr 23, 2015 25

• 1Projects can come with their own data (e.g., from your project) or can be provided

• 2Datasets need to be open! Please don’t use datasets that have proprietary limitations

• All reports will be in NIPS format: http://nips.cc/Conferences/2013/PaperInformation/StyleFiles

Page 4: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

4

This class and next…

• Classification

• Decision Tree Methods

• Instance based Learning

• Support Vector Machines (SVM)

• Latent Dirichlet Allocation (LDA)

• How do we make these work with large-scale datasets?

Page 5: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

5

Part I: Classification and a few new terms…

Page 6: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

6

Classification • Given a collection of records (training):

– Each record has a set of attributes [x1,x2, …, xn]

– One attribute is referred to as the class [y]

• Training Goal: Build a model for the class attribute as a function of the set of attributes:

• Testing Goal: Previously unseen records should be assigned the class attribute as accurately as possible

Page 7: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

7

Illustration of Classification

ID Attrib1

Attrib2

Attrib3

Class

1 Yes Large 120,000

No

2 No Medium

100,000

No

3 No Small 70,000 No

4 Yes Medium

120,000

No

5 No Large 95,000 Yes

6 Yes Large 220,000

Yes

… … … … …

ID Attrib1

Attrib2

Attrib3

Class

20 No Small 55,000 ?

21 Yes Medium

90,000 ?

22 Yes … … ?

Training

Testing

Learning Algorith

m

Learn Model

Model

Apply Model

Page 8: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

8

Examples of classification

• Predicting tumor cells whether they are benign or cancerous

• Classifying credit card transactions are legitimate / fraudulent

• Classifying if a protein sequence is a helix, beta-sheet or a random coil

• Classifying if a tweet is talking about sports, finance, terrorism, etc.

Page 9: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

9

Two problem Settings

• Supervised Learning:– Training and testing data

– Training consists of labels (class attribute y)

– Testing data consists of all other attributes

• Unsupervised Learning:– Training data does not usually consist of labels

– We will cover unsupervised learning later (mid Feb)

Page 10: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

10

Two more terms..

• Parametric:– A particular functional form is assumed. E.g.,

Naïve Bayes

– Simplicity: easy to interpret and understand

– High bias: real data might not follow this functional form

• Non-parametric:– Estimation is purely data-driven

– no functional form assumption

Page 11: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

11

Part II: Classification with Decision TreesExamples illustrated from Tan, Steinbach and Kumar, Introduction

to Data Mining Text book

Page 12: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

12

Decision Trees

• One of the more popular learning techniques

• Uses a tree-structured plan of a set of attributes to test in order to predict the output

• Decision of which attribute to be tested is based on information gain

Page 13: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

13

Example of a Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

contin

uous

class

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Page 14: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

14

Another Example of Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

contin

uous

class

MarSt

Refund

TaxInc

YESNO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

There could be more than one tree that fits the same data!

Page 15: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

15

Decision Tree Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

TreeInductionalgorithm

Training Set

Decision Tree

Page 16: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

16

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test DataStart from the root of tree.

Page 17: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

17

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 18: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

18

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 19: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

19

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 20: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

20

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 21: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

21

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Assign Cheat to “No”

Page 22: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

22

Decision Tree Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

TreeInductionalgorithm

Training Set

Decision Tree

Page 23: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

23

Decision Tree Induction

• Many Algorithms:– Hunt’s Algorithm (one of the earliest)

– CART

– ID3, C4.5

– SLIQ,SPRINT

Page 24: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

24

General Structure of Hunt’s Algorithm

• Let Dt be the set of training records that reach a node t

• General Procedure:

– If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt

– If Dt is an empty set, then t is a leaf node labeled by the default class, yd

– If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets.

– Recursively apply the procedure to each subset.

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Dt

?

Page 25: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

25

Hunt’s Algorithm

Don’t Cheat

Refund

Don’t Cheat

Don’t Cheat

Yes No

Refund

Don’t Cheat

Yes No

MaritalStatus

Don’t Cheat

Cheat

Single,Divorced

Married

TaxableIncome

Don’t Cheat

< 80K >= 80K

Refund

Don’t Cheat

Yes No

MaritalStatus

Don’t Cheat

Cheat

Single,Divorced

Married

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Page 26: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

26

Tree Induction

• Greedy strategy– Split the records based on an attribute test that

optimizes certain criterion.

• Issues– Determine how to split the records

• How to specify the attribute test condition?

• How to determine the best split?

– Determine when to stop splitting

Page 27: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

27

Tree Induction

• Greedy strategy.– Split the records based on an attribute test that

optimizes certain criterion.

• Issues– Determine how to split the records

• How to specify the attribute test condition?

• How to determine the best split?

– Determine when to stop splitting

Page 28: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

28

How to Specify Test Condition?

• Depends on attribute types– Nominal

– Ordinal

– Continuous

• Depends on number of ways to split– 2-way split

– Multi-way split

Page 29: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

29

Splitting Based on Nominal Attributes

• Multi-way split: Use as many partitions as distinct values.

• Binary split: Divides values into two subsets.

Need to find optimal partitioning.

CarTypeFamily

Sports

Luxury

CarType{Family, Luxury} {Sports}

CarType{Sports,

Luxury}

{Family} OR

Page 30: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

30

• Multi-way split: Use as many partitions as distinct values.

• Binary split: Divides values into two subsets. Need to find optimal partitioning.

Splitting Based on Ordinal Attributes

SizeSmall

Medium

Large

Size{Medium,

Large} {Small}

Size{Small, Medium

}{Large}

OR

Size{Small, Large} {Medium}

Page 31: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

31

Splitting Based on Continuous Attributes

• Different ways of handling– Discretization to form an ordinal categorical

attribute• Static – discretize once at the beginning

• Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing(percentiles), or clustering.

– Binary Decision: (A < v) or (A v)• consider all possible splits and finds the best cut

• can be more compute intensive

Page 32: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

32

Splitting Based on Continuous Attributes

TaxableIncome> 80K?

Yes No

TaxableIncome?

(i) Binary split (ii) Multi-way split

< 10K

[10K,25K) [25K,50K) [50K,80K)

> 80K

Page 33: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

33

Tree Induction

• Greedy strategy.– Split the records based on an attribute test that

optimizes certain criterion.

• Issues– Determine how to split the records

• How to specify the attribute test condition?

• How to determine the best split?

– Determine when to stop splitting

Page 34: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

34

How to determine the Best Split

OwnCar?

C0: 6C1: 4

C0: 4C1: 6

C0: 1C1: 3

C0: 8C1: 0

C0: 1C1: 7

CarType?

C0: 1C1: 0

C0: 1C1: 0

C0: 0C1: 1

StudentID?

...

Yes No Family

Sports

Luxury c1c10

c20

C0: 0C1: 1

...

c11

Before Splitting: 10 records of class 0,10 records of class 1

Which test condition is the best?

Page 35: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

35

How to determine the Best Split

• Greedy approach: – Nodes with homogeneous class distribution are

preferred

• Need a measure of node impurity:

C0: 5C1: 5

C0: 9C1: 1

Non-homogeneous,

High degree of impurity

Homogeneous,

Low degree of impurity

Page 36: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

36

Measures of Node Impurity

• Gini Index

• Entropy

• Misclassification error

Page 37: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

37

How to Find the Best Split

B?

Yes No

Node N3 Node N4

A?

Yes No

Node N1 Node N2

Before Splitting:

C0 N10 C1 N11

C0 N20 C1 N21

C0 N30 C1 N31

C0 N40 C1 N41

C0 N00 C1 N01

M0

M1 M2 M3 M4

M12

M34Gain = M0 – M12 vs M0 –

M34

Page 38: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

38

Measure of Impurity: GINI

• Gini Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information

– Minimum (0.0) when all records belong to one class, implying most interesting information

j

tjptGINI 2)]|([1)(

C1 0C2 6

Gini=0.000

C1 2C2 4

Gini=0.444

C1 3C2 3

Gini=0.500

C1 1C2 5

Gini=0.278

Page 39: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

39

Examples for computing GINI

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

j

tjptGINI 2)]|([1)(

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Page 40: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

40

Splitting Based on GINI

• Used in CART, SLIQ, SPRINT.

• When a node p is split into k partitions (children), the quality of split is computed as,

where, ni = number of records at child i,

n = number of records at node

p.

k

i

isplit iGINI

n

nGINI

1

)(

Page 41: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

41

Binary Attributes: Computing GINI Index Splits into two partitions Effect of Weighing partitions:

– Larger and Purer Partitions are sought for.

B?

Yes No

Node N1 Node N2

Parent

C1 6

C2 6

Gini = 0.500

N1 N2 C1 5 1

C2 2 4

Gini=0.333

Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194

Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528

Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528= 0.333

Page 42: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

42

Categorical Attributes: Computing Gini Index

• For each distinct value, gather counts for each class in the dataset

• Use the count matrix to make decisions

CarType{Sports,Luxury}

{Family}

C1 3 1

C2 2 4

Gini 0.400

CarType

{Sports}{Family,Luxury}

C1 2 2

C2 1 5

Gini 0.419

CarType

Family Sports Luxury

C1 1 2 1

C2 4 1 1

Gini 0.393

Multi-way split Two-way split (find best partition of values)

Page 43: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

43

Continuous Attributes: Computing Gini Index• Use Binary Decisions based on one

value

• Several Choices for the splitting value– Number of possible splitting

values = Number of distinct values

• Each splitting value has a count matrix associated with it– Class counts in each of the

partitions, A < v and A v

• Simple method to choose best v– For each v, scan the database to

gather count matrix and compute its Gini index

– Computationally Inefficient! Repetition of work.

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

TaxableIncome> 80K?

Yes No

Page 44: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

44

Continuous Attributes: Computing Gini Index...

• For efficient computation: for each attribute,– Sort the attribute on values– Linearly scan these values, each time updating the

count matrix and computing gini index– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Taxable Income

60 70 75 85 90 95 100 120 125 220

55 65 72 80 87 92 97 110 122 172 230

<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >

Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Split Positions

Sorted Values

Page 45: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

45

How do we do this with large datasets?

• GINI computations is expensive:– Sorting along numeric attributes O(n log n)!

– Need an in-memory hashtable to split based on attributes

• Empirical observation:– GINI scores tend to increase or decrease slowly

– Minimum value of the gini score for the splitting attribute is lower than other data points

• CLOUDS (Classification for Large OUt-of-core Datasets) achieves this

Page 46: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

46

How to hasten GINI computation? (1)

• Sampling of Splitting Points (SS):– Use quantiles to divide the points into q

intervals

– Evaluate GINI index at each boundary

– Determine the GINImin value

– Split along the attribute with GINImin

• Pro:– Only one pass over the entire dataset

• Cons:– Greedy

– Time is dependent on the number of points in the interval

Page 47: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

47

How to hasten GINI computation? (2)

• Sampling Splitting points with Estimation (SSE)

– Find GINImin as outlined in SS

– Estimate a lower bound GINIest for each interval

– Eliminate (prune) all intervals where GINIest ≥

GINImin

– Derive list of intervals (alive) where we can determine the correct split

• How do we determine what to prune?

Goal: Estimate the minimum GINI given an interval [vl, vu]Approach: Hill-climbing heuristicInputs: • n – number of data points

(records)• c – number of classes

Compute:xi (yi) – number of elements in class i that are less than equal to vl (vu)ci – total number of elements in class inl (nu) – number of elements that are less than equal to vl (vu)

Take the partial derivative along the subset of classes cClass with the minimum gradient will remain the split point (for the next split)

• Pros:• Instead of being dependent

on (n) the hill climbing heuristic is dependent on (c)

• Other optimizations can include prioritizing alive intervals based on GINImin estimated

• Better accuracy than SS in estimating the splitting points

• Cons:• Paying more for splitting

the data (I/O costs are higher)

Page 48: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

48

Summary

• Decision Trees are an elegant way to design classifiers:– Simple and has intuitive meaning associated

with the splits

• More than one tree representation is possible for the same data– many heuristic measures available for deciding

how to construct a tree

• For Big Data, we need to design heuristics that provide approximate (yet closely accurate) solutions

Page 49: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

49

Part III: Classification with k-Nearest Neighbors

Examples illustrated from Tom Mitchell, Andrew Moore, Jure Leskovec, …

Page 50: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

50

Instance based Learning

• Given a number of examples {(x,y)}, and a query vector q:– Find the closest example(s) x*

– Predict y*

• Learning:– Store all training examples

• Prediction:– Classification problem: Return majority class

among the examples

– Regression problem: Return average y value of the k examples

Page 51: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

51

Application Examples

• Collaborative filtering– Find k most similar people to user x that have

rated movies in a similar way

– Predict rating yx of x as an average of yk

Page 52: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

52

Visual Interpretation of k-Nearest Neighbor works

k=1

k=3

Page 53: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

53

1-Nearest Neigbhor

• Distance Metric: – E.g.: Euclidean

• How many neighbors to look at?– One

• Weighting function (optional):– Not-used

• How to fit with local points?– Predict the same output as the nearest

neigbhor

Page 54: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

54

• Distance Metric: – E.g.: Euclidean

• How many neighbors to look at?– k (e.g., k=9)

• Weighting function (optional):– Not-used

• How to fit with local points?– Predict the same output as the nearest

neighbors

k-nearest Neighbor

k = 9

Page 55: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

55

Generalizing k-NN: Kernel Regression

• Instead of k-neighbors, we look at all

• Weight the points using a Gaussian function:

• Fit the local points with the

weighted average…

wi

d(xi, q) = 0

Page 56: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

56

Distance Weighted k-NN

• Points that are close by may be weighted heavier than farther away points

• Prediction rule:

• where wi can be:

• where d is the distance between x and xi

Page 57: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

57

How to compute d(x,xi)?

• One approach: Shepard’s method

Page 58: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

58

How to find nearest neighbors?

• Given: a set P of n points in Rd

• Goal: Given a query point q– NN: Find the nearest neighbor p of q in P

– Range search: Find one/all points in P within distance r from q

58

qp

Page 59: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

59

Issues with large datasets

• Distance measure to use

• Curse of dimensionality:– in high dimensions, the nearest neighbor might

not be near at all

• Irrelevant attributes:– a lot of attributes (in big data) are not

informative

• k-NN wants data in memory!

• It must make a pass over the entire data to do classification

Page 60: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

60

Distance Metrics

• D(x, xi) must be positive

• D(x, xi) = 0 iff a = b (Reflexive)

• D(x, xi) = D(xi, x) (Symmetric)

• For any other data point y, D(x, xi) + D(xi,y) >= D(x, y) (Triangle inequality)

Euclidean Distance: One of the most commonly used metric

Page 61: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

61

But… Euclidean is not fully useful!

• Makes sense when all data points are expressed in SI (or a similar way)

• Can generalize Euclidean distance Lk

norm, also called Minkowski distance…

• L1: Manhattan Distance (L1 norm)

• L2: Euclidean Distance (L2 norm)

Page 62: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

62

Curse of Dimensionality…

• What happens if {d} is very large?

• Say we have 10000 points uniformly distributed in a unit hypercube where we want to run k=5 neighbors

• Query point is origin (01, 02, …,0d):

– in d=1: we have to go 5/10000 = 0.0005 on an average to capture our 5 neighbors

– in d=2: we have to go (0.0005)1/2 to find 5 neighbors

– in d=d: we have to go (0.0005)1/d to find neighbors

• Very expensive and can break down!

Page 63: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

63

Irrelevant features…

• A large number of features are uninformative about the data:– do not discriminate across classes (but may be

within a class)

• We may be left with a classic “hole” in the data when computing the neighbors…

Page 64: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

64

Efficient indexing (for computing neighbors)

• A kd-tree is an efficient data structure:– Split along the median value having the highest

variance

– Points are stored in the leaves…

Page 65: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

65

Space and Time Complexity of Kd-trees

• Space: O(n)!!!– Can reduce storage with other forms of efficient

data structures

– E.g., Ball-tree structure (does better with higher dimensional datasets)

• Time: – O(n) – worst case scenario

– O(log n) – normal scenarios…

Page 66: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

66

Now, how do we make this online?

• Instead of assuming a static k-NN classifier, how about we make our k-NN adapt to incoming data?– Locally weighted linear regression

• Form an explicit approximation f^ for a region surrounding query q– Called piece-wise approximation

– Minimize error over k nearest neighbor of q

– Minimize error over all training examples (weighting by distances)

– Combine the two

Page 67: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

67

LWLR: Mathematical form

• Linear regression:

• Error:

• Minimize error over k-nearest neighbors

• Minimize error over entire set of examples, weighing by distances

• Combine the two:

Page 68: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

68

Example of LWLR

simple regression

Page 69: Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

69

Large-scale Classification & Regression• Decision Trees:

– Querying and inference is easy

– Induction needs special constructs for handling large-scale datasets

– Can adapt to incoming data streams

• k-NN:– Intuitive algorithm

– Building efficient data structures to handle large datasets

– Online approach

• Next class: More classification & Regression!


Recommended