+ All Categories
Home > Documents > Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data...

Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data...

Date post: 29-Mar-2015
Category:
Upload: princess-goodfriend
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
104
Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar
Transcript
Page 1: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Data Mining Classification: Basic Concepts,

Lecture Notes for Chapter 4 - 5

Introduction to Data Miningby

Tan, Steinbach, Kumar

Page 2: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Classification: Definition

• Given a collection of records (training set )– Each record contains a set of attributes, one of the

attributes is the class.

• Find a model for class attribute as a function of the values of other attributes.

Page 3: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

• Goal: previously unseen records should be assigned a class as accurately as possible.– A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification: Definition

Page 4: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Illustrating Classification Task

Learning Algorithm

Learn Model

ApplyModel

Model

Page 5: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Examples of Classification Task

• Predicting tumor cells as benign or malignant

• Classifying credit card transactions as legitimate or fraudulent

• Classifying spam mail

• Classifying user in a social network

• Categorizing news stories as finance, weather, entertainment, sports, etc

Page 6: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Classification Techniques

• Decision Tree• Naïve Bayes • Instance Based Learning• Rule-based Methods• Neural Networks• Bayesian Belief Networks• Support Vector Machines

Page 7: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Usually the Accuracy measure is used:

Number of Correctly Classified recordAccuracy =

Total Number of record in the test set

Classification: Measure the quality

Page 8: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Decision Tree

Uses a tree structure to model the training set

Classifies a new record following the path in the tree

Inner nodes represent attributes and leaves nodes represent the class

Page 9: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Example of a Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

contin

uous

class

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Page 10: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Another Example of Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

contin

uous

class

MarSt

Refund

TaxInc

YESNO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

There could be more than one tree that fits the same data!

Page 11: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Decision Tree Classification Task

Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Learning Algorithm

Learn Model

ApplyModel

Model

Page 12: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test DataStart from the root of tree.

Page 13: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 14: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 15: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 16: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 17: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Assign Cheat to “No”

Page 18: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Decision Tree Classification Task

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

LearningTree

Algorithm

Learn Model

ApplyModel

Model

Decision Tree

Page 19: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Decision Tree Induction

• Many Algorithms:– Hunt’s Algorithm (one of the earliest)– CART– ID3, C4.5 (J48 on WEKA)– SLIQ,SPRINT

Page 20: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Tree Induction

• Greedy strategy.– Split the records based on an attribute test

that optimizes certain criterion.

• Issues– Determine how to split the records

• How to specify the attribute test condition?• How to determine the best split?

– Determine when to stop splitting

Page 21: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Tree Induction

• Greedy strategy.– Split the records based on an attribute test

that optimizes certain criterion.

• Issues– Determine how to split the records

• How to specify the attribute test condition?• How to determine the best split?

– Determine when to stop splitting

Page 22: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

How to Specify Test Condition?

• Depends on attribute types– Nominal– Ordinal– Continuous

• Depends on number of ways to split– 2-way split– Multi-way split

Page 23: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Splitting Based on Nominal Attributes

• Multi-way split: Use as many partitions as distinct values.

• Binary split: Divides values into two subsets. Need to find optimal partitioning.

Marital Status

SingleMarried

Divorced

ORMarital Status

{Married, Single} Divorced

Marital Status

{Divorced, Single} Married

Page 24: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

• Multi-way split: Use as many partitions as distinct values.

Splitting Based on Ordinal Attributes

Size

SmallMedium

Large

We can imagine an attribute SIZE defined over the ordered set {Small, Medium, Large}

Page 25: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Size

{Medium, Large} {Small}

Size

{Small, Medium}

{Large}

OR

Size

{Small, Large} {Medium}

Binary split: Divides values into two subsets. Need to find optimal partitioning.

What about this split?

Splitting Based on Ordinal Attributes

Page 26: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Splitting Based on Continuous Attributes

• Different ways of handling– Discretization to form an ordinal categorical

attribute• Static – discretize once at the beginning• Dynamic – ranges can be found by equal interval

bucketing, equal frequency bucketing(percentiles), or clustering.

– Binary Decision: (A < v) or (A v)• consider all possible splits and finds the best cut• can be more compute intensive

Page 27: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Splitting Based on Continuous Attributes

TaxableIncome> 80K?

Yes No

TaxableIncome?

(i) Binary split (ii) Multi-way split

< 10K

[10K,25K) [25K,50K) [50K,80K)

> 80K

Page 28: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Tree Induction

• Greedy strategy.– Split the records based on an attribute test

that optimizes certain criterion.

• Issues– Determine how to split the records

• How to specify the attribute test condition?• How to determine the best split?

– Determine when to stop splitting

Page 29: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

MarSt Single, DivorcedMarried

Yes No

Married 0 4

Single, Divorced

3 3

Page 30: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Example

MarSt Single, DivorcedMarried

NO

{NO:4, YES:0}

Refund

Yes No

YES NOSingle, Divorced

Refund = NO

3 1

Single, Divorced

Refund = Yes

0 2

Page 31: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

MarSt Single, DivorcedMarried

NO

{NO:4, YES:0}

Refund

Yes No

Example

NO

{NO:2, YES:0}

TaxInc

< 80K >= 80K

YES NOSingle, Divorced

Refund = NO

TaxInc =< 80k

0 1

Single, Divorced

Refund = NO

TaxInc =>= 80k

3 0

Page 32: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Example

MarSt Single, DivorcedMarried

NO

{NO:4, YES:0}

Refund

Yes No

NO

{NO:2, YES:0}

TaxInc

< 80K >= 80K

YESNO

{NO:1, YES:0} {NO:0, YES:3}

Page 33: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

How to determine the Best Split

• Greedy approach: – Nodes with homogeneous class distribution

are preferred• Need a measure of node impurity:

C0: 5C1: 5

C0: 9C1: 1

Non-homogeneous,

High degree of impurity

Homogeneous,

Low degree of impurity

Page 34: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Measures of Node Impurity

Given a node t

• Gini Index

• Entropy

• Misclassification error

Error(t) =1− maxj

P( j | t)

j

tjptjptEntropy )|(log)|()(2€

GINI(t) =1 − [ p( j | t)]2

j

Page 35: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Tree Induction

• Greedy strategy.– Split the records based on an attribute test

that optimizes certain criterion.

• Issues– Determine how to split the records

• How to specify the attribute test condition?• How to determine the best split?

– Determine when to stop splitting

Page 36: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Stopping Criteria for Tree Induction

• Stop expanding a node when all the records belong to the same class

• Stop expanding a node when all the records have similar attribute values

Page 37: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Decision Tree Based Classification

• Advantages:

– Inexpensive to construct

– Extremely fast at classifying unknown records

– Easy to interpret for small-sized trees

– Accuracy is comparable to other classification techniques for many simple data sets

Page 38: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Naive Bayes

Uses probability theory to model the training set

Assumes independence between attributes

Produces a model for each class

Page 39: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Conditional Probability:

Bayes theorem:

Bayes Theorem

Page 40: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Example of Bayes Theorem• Given:

– A doctor knows that meningitis causes headache 50% of the time

– Prior probability of any patient having meningitis is 1/50,000– Prior probability of any patient having headache is 1/20

• If a patient has headache, what’s the probability he/she has meningitis ? (M = meningitis, S = headache)

Page 41: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Bayesian Classifiers• Consider each attribute and class label as

random variables

• Given a record with attributes (A1, A2,…,An) – Goal is to predict class C– Specifically, we want to find the value of C that

maximizes P(C| A1, A2,…,An )

• Can we estimate P(C| A1, A2,…,An ) directly from data?

Page 42: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Bayesian Classifiers• Approach:

– compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem

– Choose value of C that maximizes P(C | A1, A2, …, An)

– Equivalent to choosing value of C that maximizes

• How to estimate P(A1, A2, …, An | C )?

Page 43: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Naïve Bayes Classifier

• Assume independence among attributes Ai when class is

given:

– Can estimate P(Ai| Cj) for all Ai and Cj.

– New point is classified to Cj if

is maximal.

Page 44: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

How to Estimate Probabilities from Data?

• Class: P(C) = Nc/N

– e.g., P(No) = 7/10, P(Yes) = 3/10

• For discrete attributes: P(Ai | Ck) = |Aik|/ Nc

– where |Aik| is number of instances having attribute Ai and belongs to class Ck

– Examples:

P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Page 45: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

How to Estimate Probabilities from Data?

• For continuous attributes: – Discretize the range into bins

• one ordinal attribute per bin• violates independence assumption

– Two-way split: (A < v) or (A > v)• choose only one of the two splits as new attribute

– Probability density estimation:• Assume attribute follows a normal distribution• Use data to estimate parameters of distribution

(e.g., mean and standard deviation)• Once probability distribution is known, can use it to

estimate the conditional probability P(Ai|c)

k

Page 46: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

How to Estimate Probabilities from Data?

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

P(Status=Married|Yes) = ?

P(Refund=Yes|No) = ?

P(Status=Divorced|Yes) = ?

P(TaxableInc > 80K|Yes) = ?

P(TaxableInc > 80K|NO) = ?

Compute:

Page 47: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

How to Estimate Probabilities from Data?

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

P(Status=Married|Yes) = 0/3

P(Refund=Yes|No) = 3/7

P(Status=Divorced|Yes) = 1/3

P(TaxableInc > 80K|Yes) = 3/3

P(TaxableInc > 80K|NO) = 4/7

Compute:

Page 48: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Example of Naïve Bayes Classifier

X = (Refund = No,Married,Income >= 80K)Given a Test Record:

REFUND P(Refund=Yes|No) = 3/7P(Refund=No|No) = 4/7P(Refund=Yes|Yes) = 0P(Refund=No|Yes) = 1

MARITAL STATUS

P(Marital Status=Single|No) = 2/7P(Marital Status=Divorced|No) = 1/7P(Marital Status=Married|No) = 4/7P(Marital Status=Single|Yes) = 2/7P(Marital Status=Divorced|Yes) = 1/7P(Marital Status=Married|Yes) = 0

TAXABLE INCOMING

P(TaxableInc >= 80K|Yes) = 3/3P(TaxableInc >= 80K|NO) = 4/7P(TaxableInc < 80K|Yes) = 0/3P(TaxableInc < 80K|NO) = 3/7

Class=No 7/10

Class=Yes 3/10

P(Cj|A1, …, An ) = P(A1, …, An |Cj) P(Cj ) = P(A1| Cj)… P(An| Cj) P(Cj )

Page 49: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income>=80K| Class=No)

= 4/7 4/7 4/7 = 0.1865

P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income>=80K| Class=Yes)

= 1 0 1 = 0

Example of Naïve Bayes Classifier

P(X|No)P(No) = 0.1865 * 0.7 = 0.1306

P(X|Yes)P(Yes) = 0 * 0.3 = 0

Since P(X|No)P(No) > P(X|Yes)P(Yes)

Therefore P(No|X) > P(Yes|X) => Class = No

Page 50: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

X = (Refund = No,Single,Income >= 80K)Given a Test Record:

Example of Naïve Bayes Classifier(2)

REFUND P(Refund=Yes|No) = 3/7P(Refund=No|No) = 4/7P(Refund=Yes|Yes) = 0P(Refund=No|Yes) = 3/3

MARITAL STATUS

P(Marital Status=Single|No) = 2/7P(Marital Status=Divorced|No) = 1/7P(Marital Status=Married|No) = 4/7P(Marital Status=Single|Yes) = 2/3P(Marital Status=Divorced|Yes) = 1/3P(Marital Status=Married|Yes) = 0

TAXABLE INCOMING

P(TaxableInc >= 80K|Yes) = 3/3P(TaxableInc >= 80K|NO) = 4/7P(TaxableInc < 80K|Yes) = 0/3P(TaxableInc < 80K|NO) = 3/7

Class=No 7/10

Class=Yes 3/10

P(A1, A2, …, An |C) P(C ) = P(A1| Cj) P(A2| Cj)… P(An| Cj) P(C )

Page 51: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

P(X|Class=No) = P(Refund=No|Class=No) P(Single| Class=No) P(Income>=80K| Class=No)

= 4/7 2/7 4/7 = 0.0933

P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Single| Class=Yes) P(Income>=80K|Class=Yes)

= 1 2/3 1 =0.666

Example of Naïve Bayes Classifier(2)

P(X|No)P(No) = 0.0933 * 0.7 = 0.06531

P(X|Yes)P(Yes) = 0.666* 0.3 = 0.08571

Since P(X|No)P(No) < P(X|Yes)P(Yes)

Therefore P(No|X) < P(Yes|X) => Class = Yes

Page 52: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

Naïve Bayes (Summary)

• Robust to isolated noise points

• Model each class separately

• Robust to irrelevant attributes

• Use the whole set of attribute to perform classification

• Independence assumption may not hold for some attributes

Page 53: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

53

Ensemble Learning

Roberto Esposito and Dino [email protected]

http://www2.lirmm.fr/~ienco/Dino_Ienco_Home_Page/Index.html

Acknowledgements

Most of the material is based on Nicholaus Kushmerick’s slides.You can find his original slideshow at:

www.cs.ucd.ie/staff/nick/home/COMP-4030/L14,15.ppt

Several pictures are taken from the slides by Thomas Dietterich.You can find his original slideshow (see slides about Bias/Variance theory) at:

http://web.engr.oregonstate.edu/~tgd/classes/534/index.html

Page 54: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

54

Agenda

• 1. What is ensemble learning

• 2. (Ensemble) Classification:

– 2.1. Bagging– 2.2. Boosting– 2.3 Why do Ensemble Classifiers Work?

• 3. (Ensemble) Clustering:– Cluster-based Similarity Partitioning Algorithm (CSPA)– HyperGraph-Partitioning Algorithm (HGPA)– Some hints on how to build base clusters

Page 55: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

55

part 1. What is ensemble learning?

Ensemble learning refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions

[Freund & Schapire, 1995]

Page 56: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

56

Ensemble learning

Application phase

T

T1 T2 … TS

(x, ?) h* = F(h1, h2, …, hS)

(x, y*)

Learning phase

h1 h2 … hS

differenttraining setsand/orlearning algorithms

Page 57: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

57

How to make an effective ensemble?

Two basic decisions when designing ensembles:

1. How to generate the base model?h1, h2, …

2. How to integrate/combine them?F(h1(x), h2(x), …)

Page 58: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

58

Ensemble Classification

Ensemble of Classifier:

– How to generate a classifier in an ensemble schema (which is the training set)?

– How to combine the vote of each classifier?

– Why do ensemble Ensemble Classifiers do?

Page 59: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

59

Question 2: How to integrate them

• Usually take a weighted vote:

ensemble(x) = f( i wi hi(x) )

– wi is the “weight” of hypothesis hi

– wi > wj means “hi is more reliable than hj”– typically wi>0 (though could have wi<0

meaning “hi is more often wrong than right”)

• (Fancier schemes are possible but uncommon)

Page 60: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

60

Question 1: How to generate base classifiers

• Lots of approaches…• A. Bagging• B. Boosting• …

Page 61: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

61

BAGGing = Bootstrap AGGregation

(Breiman, 1996)

• for i = 1, 2, …, K:– Ti randomly select M training instances

with replacement– hi learn(Ti) [ID3, NB, kNN, neural net, …]

• Now combine the hi together with uniform voting (wi=1/K for all i)

Page 62: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

62

Page 63: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

63

decision tree learning algorithm; along the lines of ID3

Page 64: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

64

shades of blue/red indicate strength of vote for particular classification

Page 65: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

65

Page 66: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

66

Page 67: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

67

Page 68: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

68

Boosting• Bagging was one simple way to

generate ensemble members with trivial (uniform) vote weighting

• Boosting is another….

• “Boost” as in “give a hand up to”– suppose A can learn a

hypothesis that is better than rolling a dice – but perhaps only a tiny bit better

– Theorem: Boosting A yields an ensemble with arbitrarily low error on the training data!

size of ensembletime

1 2 3 4 5 6 …. 500

50%49%

ensemble error rate

error rate of A by itselferror rate of flipping a coin

Page 69: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

69

BoostingIdea:• assign a weight to every training set instance• initially, all instances have the same weight• as boosting proceedgs, adjusts weights based on

how well we have predicted data points so far- data points correctly predicted low weight- data points mispredicted high weight

Results: as learning proceeds, the learner is forced to focus on portions of data space not previously well predicted

Page 70: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

70

Page 71: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

71

blue/red = classsize of dot = weighthypothesis =horizontal or vertical line

Time=0

Page 72: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

72

The WL error is 30%

The ensemble error is 30% (note T=1)

Time=1

Page 73: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

73

Time=3

Page 74: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

74

Time=7

Page 75: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

75

Time=21

Notice the slope of the weak learner error: AdaBoost creates problems of increasing difficulty.

Page 76: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

76

Time=51

Look, the training error is zero. One could think that we cannot improve the test error any more.

Page 77: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

77

But… the test error still decreases!

Time=57

Page 78: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

78

Time=110

Page 79: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

79

Page 80: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

80

AdaBoost (Freund and Schapire)

[0,1]

= 1/N

normalize wt to get aprobability distribution pt

I pti = 1

penalize mistakes onhigh-weightinstances more

if ht correctly classify xi

multiply weight by t < 1otherwise multiple weight by 1

binary class y {0,1}

weighted vote,with wt = log(1/ t)

Page 81: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

81

Learning from weighted instances?

• One piece of the puzzle missing…

• So far, learning algorithms have just taken as input a set of equally important learning instances.

[0,1]

Reweighting• What if we also get a weight vector saying how important each instance

is?• Turns out.. it’s very easy to modify most learning algorithms to deal with

weighted instances:– ID3: Easy to modify entropy, information-gain equations to take into

consideration the weights associated to the examples, rather than to take into account only the count (which simply assumes all weights=1)

– Naïve Bayes: ditto– k-NN: multiple vote from an instance by its weight

Page 82: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

82

Learning from weighted instances?

Resampling

As an alternative to modify learning algorithms to support weighted datasets, we can build a new dataset which is not weighted but it shows the same properties of the weighted one.

wii=1

k−1

∑ < n≤ wii=1

k

1. Let L’ be the empty set2. Let (w1,..., wn) be the weights of examples

in L sorted in some fixed order (we assume wi corresponds to example xi)

3. Draw n[0..1] according to U(0,1)

4. set L’L’{xk} where k is such that

5. if enough examples have been drawn return L’

6. else go to 3

Page 83: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

83

Learning from weighted instances?

• How many examples are “enough”?

The higher the number, the better L’ approximate a dataset following the distribution induced by W.

As a rule of thumb: |L’|=|L| usually works reasonably well.

• Why don’t we always use resampling instead of reweighting?

Resampling can be always applied, unfortunately it requires more resources and produces less accurate results. One should use this technique only when it is too costly (or unfeasible) to use reweighting.

Page 84: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

84

Why do ensemble classifiers work?

1

2

3

[T. G. Dietterich. Ensemble methods in machine learning. Lecture Notes in Computer Science, 1857:1–15, 2000.]

Page 85: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

85

1. Statistical• Given a finite amount of data, many hypothesis are

typically equally good. How can the learning algorithm select among them?Optimal Bayes classifier recipe: take a weighted majority vote of all hypotheses weighted by their posterior probability. That is, put most weight on hypotheses consistent with the data.

Hence, ensemble learning may be viewed as an approximation of the Optimal Bayes rule (which is provably the best possible classifier).

Page 86: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

86

2. RepresentationalThe desired target function may not be implementable with individual classifiers, but may be approximated by ensemble averaging

Suppose you want to build a decision boundary with decision trees The decision boundaries of decision trees are hyperplanes parallel to the coordinate axes. By averaging a large number of such “staircases”, the diagonal decision boundary can be approximated with arbitrarily good accuracy

Page 87: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

87

3. Computational• All learning algorithms do some sort of search

through some space of hypotheses to find one that is “good enough” for the given training data

• Since interesting hypothesis spaces are huge/infinite, heuristic search is essential (eg ID3 does greedy search in space of possible decision trees)

• So the learner might get stuck in a local minimum

• One strategy for avoiding local minima: repeat the search many times with random restarts

bagging

Page 88: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

88

Reading

• Dietterich: Ensemble methods in machine learning (2000).

• Schapire: A brief introduction to boosting (1999). [Sec 1-2, 5-6]

• Dietterich & Bakiri: Solving multiclass learning problems via error-correcting output codes (1995). [Skim]

Page 89: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

89

Summary…

• Ensembles Classifiers: basic motivation – creating a committee of experts is more effective than trying to derive a single super-genius

• Key issues:– Generation of base models– Integration of base models

• Popular ensemble techniques– manipulate training data: bagging and boosting

(ensemble of “experts”, each specializing on different portions of the instance space)

• Current Research: Ensemble pruning (reduce the number of classifiers, select only non redundant and informative ones)

versus

Page 90: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

90

Ensemble Clustering

Ensemble of Clustering (Partition):

- How to formulate the problem?

– How to define how far (or similar) are two different clustering solutions?

– How to combine different partitions? (CSPA,

– How to generate different paritions?

Page 91: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

91

Example

We have a dataset of 7 examples and we run 3 clustering algorithm and we obtain 3 different clustering solution:

C1 = (1,1,1,2,2,3,3)C2 = (1,1,2,2,3,3,3)C3 = (2,2,2,3,3,1,1)

Each clustering solution is represented as a vector with as many componenet as the number of original example (7) and associate to each component we have the label of corresponding clustering.

How much information is shared among the different partitions?How we combine them?

Page 92: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

92

How to formulate the problem

GOAL: Seek a final clustering that shares the

most information with the original clusterings.

Find a partition that share as much information as possible with all individual clustering results

If we have N clustering results we want obtain a solution such that:

Page 93: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

93

How to formulate the problem

Where Ci is a clustering solution and Copt is the optimal solution so the problem to the clustering ensemble problem and phi is a measure able to evaluate the similarity between 2 clustering results.

Page 94: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

94

How to define similarity between partitions

How we can define the function

- We use the Normalized Mutual Information (original paper: Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions, Alexander Strehl and Joydeep Ghosh)

- Normalized version of Mutual Information

- Mutual Information usually involved to evalute correlation between two random variables

Page 95: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

95

How to combine different partitions?

Cluster-based Similarity Partitioning Algorithm (CSPA)

Simple approach: Given an ensemble of partitions:

- if M is the number of objects, build a M x M matrix

- each cell of the matrix contains how many times 2 objects co-occur together in a cluster

- the matrix can be seen as object-object similarity matrix

- re-cluster the matrix with a clustering algorithm (in the original paper they apply METIS, graph-based clustering approach, to re-cluster the similarity matrix

Page 96: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

96

Example (I)

Given the different partition of the dataset:C1 = (1,1,1,2,2,3,3)C2 = (1,1,2,2,3,3,3)C3 = (2,2,2,3,3,1,1)

We obtain this co-occurrence matrix:

3 3 2 0 0 0 0

3 3 2 0 0 0 0

3 2 3 1 2 0 0

0 0 1 3 2 0 0

0 0 2 2 3 1 1

0 0 0 0 1 3 3

0 0 0 0 1 3 3

Page 97: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

97

Example (II)We obtain this co-occurrence matrix:

3 3 2 0 0 0 0

3 3 2 0 0 0 0

3 2 3 1 2 0 0

0 0 1 3 2 0 0

0 0 2 2 3 1 1

0 0 0 0 1 3 3

0 0 0 0 1 3 3

X3

X1

X4

X5 X7

X6

X2

33

2

2 2

1

2

1

1

We obtain this graph of co-occurrence:

Page 98: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

98

How to combine different partitions?

HyperGraph-Partitioning Algorithm (HGPA)

Hypergraph-based approach: Given an ensemble of partitions:

- each partition is seen as an edge in the hypergraph

- each object is a node

- all nodes and hyperdeges have the same weight

- try to eliminate hyperedges to obtain K unconnected partitions with approximately the same size

- Apply standard techniques to partition hypergraph

Page 99: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

99

Hypergraph

An hypergraph is defined as:a set of nodesa set of hyperedges over nodes

Each edge is not a relation between 2 nodes, but it is a relation among more than 1 nodes

Hyperedge: is defined as set of nodes in some relations each other

Page 100: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

100

Example

Given the different partition of the dataset:C1 = (1,1,1,2,2,3,3)C2 = (1,1,2,2,3,3,3)C3 = (2,2,2,3,3,1,1)The dataset is composed by 7 examples

We obtain this hypergraph:

X3

X1

X4

X5 X7

X6

X2

Page 101: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

101

How to generate different paritions (I)

Ways to generate different partitions:

1) Using the same clustering algorithm:

a) fix the dataset and change the parameter values in order to obtain different results.

b) Produce as many random projection of the original dataset as the number of partitions needed and then use the clustering algorithm to obtain a partition over each projected dataset.

Page 102: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

102

How to generate different paritions (II)

Ways to generate different partitions:

2) Using different clustering algorithms:

a) fix the dataset and run as many clustering algorithms as you need. If the number of algorithms available is smaller than the number of partition needed, re-run the same algorithms with different parameters.

b) Produce some random projections of the dataset and then apply the different clustering algorithms.

Page 103: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

103

Reading

• Strehl and Ghosh: Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions (2002).

• Al-Razgan and Domeniconi: Weighted

Clustering Ensembles (2006).

• Fern and Lin: Cluster Ensemble Selection (2008).

Page 104: Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar.

104

Summary…• Ensembles Clustering: basic motivation – No

single clustering algorithm can adequately handle all types of cluster shapes and structures

• Key issues:– How to choose an appropriate algorithm?– How to interpret different partitions produced by

different clustering algorithms?– More difficult than Ensemble Classifiers

• Popular ensemble techniques– Combine partitions obtained in different way using

similarity between partitions or co-occurrence

• Current Research: Cluster Selection (reduce the number of clusters, select only non redundant and informative ones)

versus


Recommended