+ All Categories
Home > Documents > Decision Trees - BIU

Decision Trees - BIU

Date post: 30-Nov-2021
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
49
Decision Trees Tirgul 5
Transcript
Page 1: Decision Trees - BIU

Decision TreesTirgul 5

Page 2: Decision Trees - BIU

Using Decision Trees

• It could be difficult to decide which pet is right for you.

• We’ll find a nice algorithm to help us decide what to choose without having to think about it.

2

Page 3: Decision Trees - BIU

Using Decision Trees

• Another example:• The tree helps a student decide

what to do in the evening.

• <Party? (Yes, No), Deadline (Urgent, Near, None), Lazy? (Yes, No)>

Party/Study/TV/Pub

3

Page 4: Decision Trees - BIU

Definition

• A Decision tree is a flowchart-like structure which provides a useful way to describe a hypothesis ℎ from a domain 𝒳 to a label set 𝒴= 0, 1, … 𝑘 .

• Given 𝑥 ∈ 𝒳, a prediction ℎ(𝑥) of a decision tree ℎ corresponds to a path from the root of the tree to a leaf.• Most of the time we do not distinguish between the representation (the tree)

and the hypothesis.

• An internal node corresponds to a “question”.

• A branch corresponds to an “answer”.

• A leaf corresponds to a label.

4

Page 5: Decision Trees - BIU

Should I study?

• Equivalent to:

• 𝑃𝑎𝑟𝑡𝑦 == 𝑁𝑜 ∧ 𝐷𝑒𝑎𝑑𝑙𝑖𝑛𝑒 == 𝑈𝑟𝑔𝑒𝑛𝑡 ∨ 𝑃𝑎𝑟𝑡𝑦 == 𝑁𝑜 ∧ 𝐷𝑒𝑎𝑑𝑙𝑖𝑛𝑒 == 𝑁𝑒𝑎𝑟 ∧ (𝐿𝑎𝑧𝑦 ==

5

Page 6: Decision Trees - BIU

Constructing the Tree

• Features: Party, Deadline, Lazy.

• Based on these features, how do we construct the tree?

• The DT algorithm use the following principle:• Build the tree in a greedy manner;

• Starting at the root, choose the most informative feature at each step.

6

Page 7: Decision Trees - BIU

Constructing the Tree

• “Informative” features?• Choosing which feature to use next in

the decision tree can be thought of as playing the game ‘20 Questions‘.

• At each stage, you choose a question that gives you the most information given what you know already.

• Thus, you would ask ‘Is it an animal?’ before you ask ‘Is it a cat?’.

7

Page 8: Decision Trees - BIU

Constructing the Tree

• “20 Questions” example.

My character

Akinator’s questions

8

Page 9: Decision Trees - BIU

Constructing the Tree

• The idea: quantify how much information is provided.

• Mathematically: Information Theory

9

Page 10: Decision Trees - BIU

Pivot example:

10

Page 11: Decision Trees - BIU

Quick Aside: Information Theory

Page 12: Decision Trees - BIU

Entropy

• S is a sample of training examples.

• p+ is the proportion of positive examples in S

• p- is the proportion of negative examples in S

• Entropy measures the impurity of S:

• Entropy(S) = - p+ log2 p+ - p- log2 p-

• The smaller the better

• (we define: 0log0 = 0)

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = −

𝑖

𝑝𝑖 log2 𝑝𝑖

12

Page 13: Decision Trees - BIU

Entropy

• Generally, entropy refers to disorder or uncertainty.

• Entropy = 0 if outcome is certain.

• E.g.: Consider a coin toss:• Probability of heads == probability of tails

• The entropy of the coin toss is as high as it could be.

• This is because there is no way to predict the outcome of the coin toss ahead of time: the best we can do is predict that the coin will come up heads, and our prediction will be correct with probability 1/2.

13

Page 14: Decision Trees - BIU

Entropy: Example

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = −

𝑖

𝑝𝑖 log2 𝑝𝑖

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝"𝑦𝑒𝑠" log2 𝑝"𝑦𝑒𝑠" − 𝑝"𝑛𝑜" log2 𝑝"𝑛𝑜"

= −9

14log29

14−5

14log25

14

= 0.409 + 0.530 = 0.939

} 14 examples:• 9 positive• 5 negative

The dataset

14

Page 15: Decision Trees - BIU

Information Gain

• Important Idea: find how much the entropy of the whole training set would decrease if we choose each particular feature for the next classification step.

• Called: “Information Gain”.• defined as the entropy of the whole set minus the entropy when a particular

feature is chosen.

15

Page 16: Decision Trees - BIU

Information Gain

• The information gain is the expected reduction in entropy caused by partitioning the examples with respect to an attribute.

• Given S is the set of examples (at the current node), A the attribute, and Sv the subset of S for which attribute A has value v:

• IG(S, a) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 𝐴𝑆𝑣

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑣

• That is, current entropy minus new entropy

16

Page 17: Decision Trees - BIU

Information Gain: Example

• Attribute A: Outlook• 𝑉𝑎𝑙𝑢𝑒𝑠 𝐴 = 𝑠𝑢𝑛𝑛𝑦, 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑟𝑎𝑖𝑛

Outlook

[3 + , 2 -][2 + , 3-] [4 + , 0 -]

SunnyOvercast

Rain

17

Page 18: Decision Trees - BIU

18

Page 19: Decision Trees - BIU

Information Gain: Example

• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = 0.939

• 𝑉𝑎𝑙𝑢𝑒𝑠 𝐴 = 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠𝑢𝑛𝑛𝑦, 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑟𝑎𝑖𝑛

•𝑆𝑠𝑢𝑛𝑛𝑦

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦

•𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡

•𝑆𝑟𝑎𝑖𝑛

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛

IG(S, a) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 𝐴𝑆𝑣

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑣

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = −

𝑖

𝑝𝑖 log2 𝑝𝑖

Outlook

[3 + , 2 -][2 + , 3-] [4 + , 0 -]

SunnyOvercast

Rain

[9 + , 5 -]

19

Page 20: Decision Trees - BIU

Information Gain: Example

• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = 0.939

• 𝑉𝑎𝑙𝑢𝑒𝑠 𝐴 = 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠𝑢𝑛𝑛𝑦, 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑟𝑎𝑖𝑛

•𝑆𝑠𝑢𝑛𝑛𝑦

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦 =

5

14𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦

•𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 =

4

14𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡

•𝑆𝑟𝑎𝑖𝑛

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛 =

5

14𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛

IG(S, a) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 𝐴𝑆𝑣

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑣

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = −

𝑖

𝑝𝑖 log2 𝑝𝑖

Outlook

[3 + , 2 -][2 + , 3-] [4 + , 0 -]

SunnyOvercast

Rain

[9 + , 5 -]

20

Page 21: Decision Trees - BIU

Information Gain: Example

• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = 0.939

• 𝑉𝑎𝑙𝑢𝑒𝑠 𝐴 = 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠𝑢𝑛𝑛𝑦, 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑟𝑎𝑖𝑛

•𝑆𝑠𝑢𝑛𝑛𝑦

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦 =

5

14−2

5𝑙𝑜𝑔2

2

5−3

5𝑙𝑜𝑔2

3

5=5

14⋅ 0.970

•𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 =

4

14−4

4𝑙𝑜𝑔2

4

4−0

4𝑙𝑜𝑔2

0

4=4

14⋅ 0

•𝑆𝑟𝑎𝑖𝑛

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛 =

5

14−3

5𝑙𝑜𝑔2

3

5−2

5𝑙𝑜𝑔2

2

5=5

14⋅ 0.970

IG(S, a) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 𝐴𝑆𝑣

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑣

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = −

𝑖

𝑝𝑖 log2 𝑝𝑖

Outlook

[3 + , 2 -][2 + , 3-] [4 + , 0 -]

SunnyOvercast

Rain

[9 + , 5 -]

21

Page 22: Decision Trees - BIU

Information Gain: Example

• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = 0.939

• 𝑉𝑎𝑙𝑢𝑒𝑠 𝐴 = 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠𝑢𝑛𝑛𝑦, 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑟𝑎𝑖𝑛

•𝑆𝑠𝑢𝑛𝑛𝑦

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦 =

5

14⋅ 0.970 = 0.346

•𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 =

4

14⋅ 0 = 0

•𝑆𝑟𝑎𝑖𝑛

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛 =

5

14⋅ 0.970 = 0.346

IG(S, a) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 𝐴𝑆𝑣

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑣

Outlook

[3 + , 2 -][2 + , 3-] [4 + , 0 -]

SunnyOvercast

Rain

[9 + , 5 -]

IG(S, a = outlook) = 0.939 − 0.346 + 0 + 0.346 = 𝟎. 𝟐𝟒𝟕

22

Page 23: Decision Trees - BIU

Information Gain: Example

• Attribute A: Wind• 𝑉𝑎𝑙𝑢𝑒𝑠 𝐴 = 𝑤𝑒𝑎𝑘, 𝑠𝑡𝑟𝑜𝑛𝑔

Wind

[3 + , 3 -][6 + , 2-]

weak strong

[9 + , 5-]

23

Page 24: Decision Trees - BIU

Information Gain: Example

• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = 0.939

• 𝑉𝑎𝑙𝑢𝑒𝑠 𝐴 = 𝑊𝑖𝑛𝑑 = 𝑤𝑒𝑎𝑘, 𝑠𝑡𝑟𝑜𝑛𝑔

•𝑆𝑤𝑒𝑎𝑘

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑤𝑒𝑎𝑘 =

8

14−6

8𝑙𝑜𝑔2

6

8−2

8𝑙𝑜𝑔2

2

8=8

14⋅ 0.811

•𝑆𝑠𝑡𝑟𝑜𝑛𝑔

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 =

6

14−3

6𝑙𝑜𝑔2

3

6−3

6𝑙𝑜𝑔2

3

6=6

14⋅ 1

IG(S, a) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 𝐴𝑆𝑣

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑣

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = −

𝑖

𝑝𝑖 log2 𝑝𝑖

Wind

[3 + , 3 -][6 + , 2-]

weak strong

[9 + , 5-]

24

Page 25: Decision Trees - BIU

Information Gain: Example

• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = 0.939

• 𝑉𝑎𝑙𝑢𝑒𝑠 𝐴 = 𝑊𝑖𝑛𝑑 = 𝑤𝑒𝑎𝑘, 𝑠𝑡𝑟𝑜𝑛𝑔

•𝑆𝑤𝑒𝑎𝑘

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑤𝑒𝑎𝑘 =

8

14⋅ 0.811 = 0.463

•𝑆𝑠𝑡𝑟𝑜𝑛𝑔

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 =

6

14⋅ 1 = 0.428

IG(S, a) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 𝐴𝑆𝑣

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑣

IG(S, a = wind) = 0.939 − 0.463 + 0.428 = 𝟎. 𝟎𝟒𝟖

25

Page 26: Decision Trees - BIU

Information Gain

• The smaller the value of : 𝑆𝑣

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑣 is, the larger IG becomes.

• The ID3 algorithm computes this IG for each attribute and chooses the one that produces the highest value.• Greedy

IG(S, a) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 𝐴𝑆𝑣

𝑆𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑣

“before” “after”

Probability of getting to the node

26

Page 27: Decision Trees - BIU

ID3 AlgorithmArtificial Intelligence: A Modern Approach

27

Page 28: Decision Trees - BIU

ID3 Algorithm

• Majority Value function:• Returns the label of the majority of the training examples in the current

subtree.

• Choose attribute function:• Choose the attribute that maximizes the Information Gain.

• (Could use other measures other than IG.)

28

Page 29: Decision Trees - BIU

Back to the example

• IG(S, a = outlook) = 𝟎. 𝟐𝟒𝟕

• IG(S, a = wind) = 0.048

• IG(S, a = temperature) = 0.028

• IG(S, a = humidity) = 0.151

• The tree (for now):

Maximal value

29

Page 30: Decision Trees - BIU

Decision Tree after first step:

hothot

mild

cool

mild

hot

cool

mild

hot

mild

coolcoolmildmild

30

Page 31: Decision Trees - BIU

Decision Tree after first step:

hothot

mild

cool

mild

hot

cool

mild

hot

mild

coolcoolmildmild

yes

31

Page 32: Decision Trees - BIU

The second step:

wind

[1 + , 1 -][1 + , 2-]

weak strong

[2 + , 3-]

temperature

[1 + , 0-][0 + , 2-] [1 + , 1 -]

hotmild

cool

[2 + , 3 -]

humidity

[2 + , 0 -][0 + , 3-]

high normal

[2 + , 3-]

Entropy = 0 Entropy = 0

IG is maximal

32

Page 33: Decision Trees - BIU

Decision Tree after second step:

33

Page 34: Decision Trees - BIU

Next…

34

Page 35: Decision Trees - BIU

Final DT:

35

Page 36: Decision Trees - BIU

Minimal Description Length

• The attribute we choose is the one with the highest information gain• We minimize the amount of information that is left

• Thus, the algorithm is biased towards smaller trees

• Consistent with the well-known principle that short solutions are usually better than longer ones.

36

Page 37: Decision Trees - BIU

Minimal Description Length

• MDL: • The shortest description of something, i.e., the most compressed one, is the

best description.

• a.k.a: Occam’s Razor:• Among competing hypotheses, the one with the fewest assumptions should

be selected.

37

Page 38: Decision Trees - BIU

Outlook

WindHumidity True

Sunny

Overcast

Rain

True False

Weak Strong

False True

High Normal

New Data: <Rain, Mild, High, Weak> - False

Prediction: True

Wait What!?

38

Page 39: Decision Trees - BIU

Overfitting

• Learning was performed for too long or the learner may adjust to very specific random features of the training data, that have no causal relation to the target function.

39

Page 40: Decision Trees - BIU

Overfitting

40

Page 41: Decision Trees - BIU

Overfitting

• The performance on the training examples still increases while the performance on unseen data becomes worse.

41

Page 42: Decision Trees - BIU

42

Page 43: Decision Trees - BIU

Pruning

43

Page 44: Decision Trees - BIU

44

Page 45: Decision Trees - BIU

ID3 for real-valued features

• Until now, we assumed that the splitting rules are of the form:• 𝕀[𝑥𝑖=1]

• For real-valued features, use threshold-based splitting rules:• 𝕀[𝑥𝑖<𝜃]

*𝕀(𝑏𝑜𝑜𝑙𝑒𝑎𝑛 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛) is the indicator function

(equals 1 if expression is true and 0 otherwise.

45

Page 46: Decision Trees - BIU

Random forest

46

Page 47: Decision Trees - BIU

Random forest

• Collection of decision trees.

• Prediction: a majority vote over the predictions of the individual trees.

• Constructing the trees:• Take a random subsample 𝑆’ (of size 𝑚’) from 𝑆, with replacements.

• Construct a sequence 𝐼1, 𝐼2, … , where 𝐼𝑡 is a subset of the attributes (of size 𝑘)

• The algorithm grows a DT (using ID3) based on the sample 𝑆’• At each splitting stage, the algorithm chooses a feature that maximized IG from 𝐼𝑡

47

Page 48: Decision Trees - BIU

Weka

• http://www.cs.waikato.ac.nz/ml/weka/

48

Page 49: Decision Trees - BIU

Summary

• Intro: Decision Trees

• Constructing the tree:• Information Theory: Entropy, IG• ID3 Algorithm• MDL

• Overfitting:• Pruning

• ID3 for real-valued features

• Random Forests

• Weka

49


Recommended