Decision trees · 2021. 1. 18. · Use of decision trees Plan 1. Introduction 2. Use of decision...

transcript

Artificial intelligence

Decision trees

PRISM - Nicolas Sutton-Charani

18/01/2021

1 / 52

1. Introduction

2. Use of decision trees2.1 Prediction2.2 Interpretability : Descriptive data analysis

3. Learning of decision trees3.1 Purity criteria3.2 Stopping criteria3.3 Learning algorithm

4. Pruning of decision trees4.1 Cost-complexity trade-off

5. Extension : random forest

2 / 52

Introduction

1. Introduction

3 / 52

Introduction

What is a decision tree ?

attribute J1

attribute J2

labelprediction

attribute J3

attribute J4

labelprediction

4 / 52

Introduction

What is a decision tree ?

attribute J1

attribute J2

labelprediction

attribute J3

attribute J4

labelprediction

5 / 52

Introduction

What is a decision tree ? → supervised learning

attribute J1

attribute J2

labelprediction

values

labelprediction

values

attribute J3

attribute J4

labelprediction

values

labelprediction

values

labelprediction

values

6 / 52

Introduction

A little history

!4machine learning (or data mining) decision trees6= decision theory decision trees

7 / 52

Introduction

Types of decision trees

type of class label

I numerical → regression tree

I nominal → classification tree

type of algorithm (→ structure)

I CART : statistics, binary tree

I C4.5 : computer science, small tree

8 / 52

Use of decision trees

1. Introduction

9 / 52

Prediction

1. Introduction

10 / 52

Prediction

Classification treesWill the badminton match take place ?

11 / 52

Prediction

Classification treesWhat fruit is it ?

12 / 52

Prediction

Classification treesWhat he/she come to my party ?

13 / 52

Prediction

Classification treesWill they wait ?

14 / 52

Prediction

Classification treesWho will win the US presidential election ?

15 / 52

Prediction

Regression treesWhat grade will a student get (given his homework averagegrade) ?

16 / 52

Interpretability : Descriptive data analysis

1. Introduction

17 / 52

Interpretability : Descriptive data analysis

Data analysis tool

Trees are very interpretable : attributes spaces partitioning

→ a tree can be resumed by its leaves which define a law mixture

→ wonderful collaboration tool with experts

!4 INSTABILITY ← overfitting

18 / 52

Learning of decision trees

1. Introduction

19 / 52

Formalism

Learning dataset (supervised learning) x1, y1...

xN , yN

x11 . . . xJ1 y1...

......

x1N . . . xJN yN

samples are assu-med to be i.i.d

I Attributes X = (X 1, . . . ,X J) ∈ X = X 1 × · · · × X J

I Spaces X j can be categorical or numerical

I Class label Y ∈ Ω = ω1, . . . , ωK (∈ RK for regression)

PH = t1, . . . , tH and πh = P(th) ≈ |th|N

with |th| = #i : xi ∈ th

20 / 52

Recursive partitioning

21 / 52

22 / 52

23 / 52

24 / 52

Learning principleI Start with all the dataset in the initial nodeI Chose the best splits (on attributes) in order to get pure

leaves

Classification trees

purity = homogeneity in term of class labels

I CART → Gini impurity : i(th) =K∑

k=1pk (1− pk )

I ID3, C4.5 → Shanon entropy : i(th) = −K∑

k=1pk log2(pk )

pk = P(Y = ωk |th)

Regression trees

purity = low variance of class labels

→ i(th) = Var(Y |th) = 1|th|

∑xi∈th

(yi − E(Y |th))2 with E(Y |th) = 1|th|

∑xi∈th

25 / 52

Impurity measures

26 / 52

Purity criteria

1. Introduction

27 / 52

Purity criteria

leafto split

Impurity measure + tree structure → criteria

CART, ID3 : purity gain

C4.5 : information gain ratio

Regression trees

CART : Variance minimisation28 / 52

Purity criteria

attribute ?

prediction ?

values ?

prediction ?

values ?th

Impurity measure + tree structure → criteria

CART, ID3 : purity gain → ∆i = i(th)− πLi(tL)− πR i(tR)C4.5 : information gain ratio → IGR = ∆i

H(πL,πR)

Regression trees

CART : Variance minimisation → ∆i = i(th)− πLi(tL)− πR i(tR)

29 / 52

Stopping criteria

1. Introduction

30 / 52

Stopping criteria

Stopping criteria (pre-pruning)

For all leaves thh=1,...,H and their potential children :

I leaves purity : ∃k ∈ 1, . . . ,K : pk = 1

I leaves and children sizes : |th| ≤ minLeafSize

I leaves and children weights : πh = |th|t0≤ minLeafProba

I leaves number : H ≥ maxNumberLeaves

I tree depth : depth(PH) ≥ maxDepth

I purity gain : ∆i ≤ minPurityGain

31 / 52

Learning algorithm

1. Introduction

32 / 52

Learning algorithm

Result: Learnt tree

Start with all the learning data in an initial node (single leaf);

while Stopping criteria not verified for all leaves dofor each splitable leaf do

compute the purity gains obtained from all possiblesplit;

endSPLIT : select the split achieving the maximum purity gain;

endprune the obtained tree;

Recursive partitioning33 / 52

Learning algorithm

ID3 - Training Examples – [9+,5-]

34 / 52

Learning algorithm

ID3 - Selecting Next Attribute

35 / 52

Learning algorithm

36 / 52

Learning algorithm

37 / 52

Learning algorithm

ID3 - Best Attribute - Outlook

38 / 52

Learning algorithm

ID3 - Ssunny

39 / 52

Learning algorithm

ID3 - Results

40 / 52

Pruning of decision trees

1. Introduction

41 / 52

Overfitting

42 / 52

Overfitting

Remark : decision trees do not need variable selection ordimension reduction (in term of accuracy).

43 / 52

Cost-complexity trade-off

1. Introduction

44 / 52

Cost-Complexity Pruning

The ideaI trade-off between predictive efficiency and complexity

I find a subtree that fulfills this trade-off

MetricsI ’Err’ ← misclassification rate or MSE

I Criterion : Rα = Err + αH

I Find a useful sequence of nested subtrees

I Choose the right subtree

45 / 52

Sequence of subtrees creation

Result: sequence of trees that are all sub-trees of T0 : T0T1 T2 T3 . . . Tk P1(initialnode)

Learn the biggest tree Ts = T0 := PHmax obtained for α0 = 0(s=0);

while Ts 6= P1 doTs+1 = argmin

t∈subtrees(Ts)[Rαs (t)− Rαs (Ts)];

αs+1 = Rαs (Ts+1)− Rαs (Ts);

We get 2 bijective sets : T0, . . . ,TS and α0, . . . , αS (with TS = P1)

Selection : Ts∗ = argminTs∈T0,...,TS

Err(Ts) ← pruning set or cross validation

46 / 52

Figure – Sequence of nested subtrees

Here, α2 < α1 =⇒ T − T1 ⊂ T − T2

47 / 52

Extension : random forest

1. Introduction

48 / 52

Random forest

MotivationI trees instability

I bias-variance trade-off

Averaging reduces variance :

Var(X ) =Var(X )

N(for independant predictions)

→ Average models to reduce model variance

One problem :

- only one training set

- where do multiple models come from ?49 / 52

Bagging : Bootstrap Aggregation

I Tin Kam Ho (1995) → Leo Breiman (2001)

I Take repeated bootstrap samples from the training set

I Bootstrap sampling : Given a training set D containing Nexamples, draw N examples at random with replacement fromD.

I Bagging :

- create B bootstrap samples D1, . . . ,DB

- train distinct classifier on each Db

- classify new instance by majority vote / averaging /aggregating predictions

50 / 52

Random forest

51 / 52

References

* L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen,Classification And Regression Trees, 1984.

* J. Quinlan, “Induction of decision trees,” Machine Learning,vol. 1, pp. 81–106, Oct. 1986

* L. Breiman. Random forests. Statistics, pages 1–33, 2001.

* G. Biau, L. Devroye, and G. Lugosi. Consistency of randomforests and other averaging classifiers. J. Mach. Learn. Res.,9 :2015–2033, jun 2008.

52 / 52

Decision trees · 2021. 1. 18. · Use of decision trees Plan 1. Introduction 2. Use of decision...

Documents