+ All Categories
Home > Documents > Decision Tree Learning

Decision Tree Learning

Date post: 21-Jan-2016
Category:
Upload: airlia
View: 39 times
Download: 0 times
Share this document with a friend
Description:
Decision Tree Learning. Ata Kaban The University of Birmingham. Today we learn about: Decision Tree Representation Entropy, Information Gain ID3 Learning algorithm for classification Avoiding overfitting. Internal node ~ test an attribute Branch ~ attribute value Leaf - PowerPoint PPT Presentation
Popular Tags:
34
1 Decision Tree Learning Ata Kaban The University of Birmingham
Transcript
Page 1: Decision Tree Learning

1

Decision Tree Learning

Ata Kaban

The University of Birmingham

Page 2: Decision Tree Learning

2

Today we learn about: Decision Tree Representation Entropy, Information Gain ID3 Learning algorithm for classification Avoiding overfitting

Page 3: Decision Tree Learning

3

Decision Tree Representation for ‘Play Tennis?’

Internal node

~ test an attribute Branch

~ attribute value Leaf

~ classification result

Page 4: Decision Tree Learning

4

When is it useful?

Medical diagnosisEquipment diagnosisCredit risk analysisetc

Page 5: Decision Tree Learning

5

Page 6: Decision Tree Learning

6

Sunburn Data CollectedName Hair Height Weight Lotion Result

Sarah Blonde Average Light No Sunburned

Dana Blonde Tall Average Yes None

Alex Brown Short Average Yes None

Annie Blonde Short Average No Sunburned

Emily Red Average Heavy No Sunburned

Pete Brown Tall Heavy No None

John Brown Average Heavy No None

Kate Blonde Short Light Yes None

Page 7: Decision Tree Learning

7

Decision Tree 1 is_sunburned

Dana, Pete

Alex Sarah

Emily JohnAnnieKatie

Height

average tallshort

Hair colour Weight

brownredblonde

Weight

light average heavy

light average heavy

Hair colourblonde

red brown

Page 8: Decision Tree Learning

8

Sunburn sufferers are ... If height=“average” then

– if weight=“light” then• return(true) ;;; Sarah

– elseif weight=“heavy” then• if hair_colour=“red” then

– return(true) ;;; Emily

elseif height=“short” then– if hair_colour=“blonde” then

• if weight=“average” then– return(true) ;;; Annie

else return(false) ;;;everyone else

Page 9: Decision Tree Learning

9

Decision Tree 2

Sarah, Annie

is_sunburned

Emily Alex

Dana, KatiePete,

John

Lotion used

Hair colour Hair colour

brownred

blonde

yesno

redbrownblonde

Page 10: Decision Tree Learning

10

Decision Tree 3

Sarah, Annie

is_sunburned

Alex, Pete, JohnEmily

Dana, Katie

Hair colour

Lotion used

blonde redbrown

no yes

Page 11: Decision Tree Learning

11

Summing up

Irrelevant attributes do not classify the data well

Using irrelevant attributes thus causes larger decision trees

a computer could look for simpler decision trees

Q: How?

Page 12: Decision Tree Learning

12

A: How WE did it? Q: Which is the best attribute for splitting up

the data? A: The one which is most informative for the

classification we want to get. Q: What does it mean ‘more informative’? A: The attribute which best reduces the

uncertainty or the disorder Q: How can we measure something like that? A: Simple – just listen ;-)

Page 13: Decision Tree Learning

13

We need a quantity to measure the disorder in a set of examples

S={s1, s2, s3, …, sn} where s1=“Sarah”, s2=“Dana”, …

Then we need a quantity to measure the amount of reduction of the disorder level in the instance of knowing the value of a particular attribute

Page 14: Decision Tree Learning

14

What properties should the Disorder (D) have?

Suppose that D(S)=0 means that all the examples in S have the same class

Suppose that D(S)=1 means that half the examples in S are of one class and half are the opposite class

Page 15: Decision Tree Learning

15

Examples

D({“Dana”,“Pete”}) =0 D({“Sarah”,“Annie”,“Emily” })=0 D({“Sarah”,“Emily”,“Alex”,“John” })=1 D({“Sarah”,“Emily”, “Alex” })=?

Page 16: Decision Tree Learning

16

0.918

D({“Sarah”,“Emily”, “Alex” })=0.918

Proportion of positive examples, p+

0 0.5 1.0

Dis

orde

r, D

1.0

0

0.67

Page 17: Decision Tree Learning

17

Definition of Disorder

)(loglog),( 22 SEntropy

n

n

n

n

n

n

n

nnnD

The Entropy measures the disorder of a set S containing a total of n examples of which n+ are positive and n- are negative and it is given by

xx ?2 2 means log

where

Check it!

D(0,1) = ? D(1,0)=? D(0.5,0.5)=?

Page 18: Decision Tree Learning

18

Back to the beach (or the disorder of sunbathers)!

85

log85

83

log83

)5,3( 22 D

D({ “Sarah”,“Dana”,“Alex”,“Annie”, “Emily”,“Pete”,“John”,“Katie”})

954.0

Page 19: Decision Tree Learning

19

Some more useful properties of the Entropy

),(),( nmDmnD

0),0( mD

1),( mmD

Page 20: Decision Tree Learning

20

So: We can measure the disorder What’s left:

– We want to measure how much by knowing the value of a particular attribute the disorder of a set would reduce.

Page 21: Decision Tree Learning

21

The Information Gain measures the expected reduction in entropy due to splitting on an attribute A

)(

)(||

||)(),(

AValuesvv

v SEntropyS

SSEntropyASGain

the average disorder is just the weighted sum of the disorders in the branches (subsets) created by the values of A.

We want:

-large Gain

-same as: small avg disorder created

Page 22: Decision Tree Learning

22

Back to the beach: calculate the Average

Disorder associated with Hair Colour

)( blondeblonde SD

S

S)( red

red SDS

S)( brown

brown SDS

S

Sarah AnnieDana Katie

Emily Alex Pete John

Hair colour

brownblondered

Page 23: Decision Tree Learning

23

…Calculating the Disorder of the “blondes”

5.08

4)(

S

SSD

S

S blondeblonde

blonde

The first term of the sum: D(Sblonde)=

D({ “Sarah”,“Annie”,“Dana”,“Katie”}) = D(2,2)

=1

Page 24: Decision Tree Learning

24

…Calculating the disorder of the others

The second and third terms of the sum: Sred={“Emily”}

Sbrown={ “Alex”, “Pete”, “John”}.

These are both 0 because within each set all the examples have the same class

So the avg disorder created when splitting on ‘hair colour’ is 0.5+0+0=0.5

Page 25: Decision Tree Learning

25

Which decision variable minimises the disorder?Test Disorder

Hair 0.5 – this what we just computed

height 0.69

weight 0.94

lotion 0.61

Which decision variable maximises the Info Gain then?

Remember it’s the one which minimises the avg disorder (see slide 21 for memory refreshing).

these are the avg disorders of the other attributes, computed in the same way

Page 26: Decision Tree Learning

26

So what is the best decision tree?

?

is_sunburned

Alex, Pete, John

Emily

Sarah AnnieDana Katie

Hair colour

brownblonde

red

Page 27: Decision Tree Learning

27

ID3 algorithm

Greedy search in the hypothesis space

Page 28: Decision Tree Learning

28

Is this all? So much simple?

Of course not… where do we stop growing the tree? what if there are noisy (mislabelled) data as well in data set?

Page 29: Decision Tree Learning

29

Overfitting in Decision Tree Learning

Page 30: Decision Tree Learning

30

Overfitting

Consider the error of hypothesis h over:– Training data: error_train(h)– The whole data set (new data as well):

error_D(h) If there is another hypothesis h’ such

that error_train(h) < error_train(h’) and error_D(h)>error_D(h’) then we say that hypothesis h overfits the training data.

Page 31: Decision Tree Learning

31

How can we avoid overfitting?

Split the data into training set & validation set Train on the training set and stop growing the

tree when further data split deteriorates performance on validation set

Or: grow the full tree first and then post-prune What if data is limited?

Page 32: Decision Tree Learning

32

…looks a bit better now

Page 33: Decision Tree Learning

33

Summary

Decision Tree Representation Entropy, Information Gain ID3 Learning algorithm Overfitting and how to avoid it

Page 34: Decision Tree Learning

34

When to consider Decision Trees?

If data is described by a finite number of attributes, each having a (finite) number of possible values

The target function is discrete valued (i.e. classification problem)

Possibly noisy data Possibly missing values

E.g.:– Medical diagnosis– Equipment diagnosis– Credit risk analysis– etc


Recommended