Date post: | 22-Nov-2014 |
Category: |
Documents |
Upload: | arthur-charpentier |
View: | 1,024 times |
Download: | 2 times |
11pt
11pt
Note Exemple Exemple
11pt
Preuve
Arthur CHARPENTIER - Arbres de Classification
Arbres de Classification
Arthur Charpentier
http ://freakonometrics.hypotheses.org/
Avril 2014
Séminaires en méthodes d’analyses quantitatives et qualitatives
1
Arthur CHARPENTIER - Arbres de Classification
La problématique
Victimes d’infarctus du myocardelors de leur admission aux urgences. On observe ; :
◦ fréquence cardiaque (FRCAR),◦ index cardiaque(INCAR)◦ index systolique (INSYS)◦ pression diastolique (PRDIA)◦ pression arterielle pulmonaire (PAPUL)◦ pression venticulaire (PVENT)◦ resistance pulmonaire (REPUL)◦ décès - ou pas - de la personne
Source : Saporta, G. (2006).
2
Arthur CHARPENTIER - Arbres de Classification
Plan de la présentation• Introduction◦ La problématique de la classification◦ Retour sur la régression logistique• Analyse d’une classification◦ Erreurs, faux positifs, faux négatifs◦ Courbe ROC et autres courbes• Les arbres de classification◦ Critère de discrimination, Gini et entropie◦ Méthode CART◦ Robustification par bootstrap et forêts aléatoires
3
Arthur CHARPENTIER - Arbres de Classification
La classification : modélisation d’une variable 0/1
[0][0]
●●●
DE
CE
SS
UR
VIE
60 70 80 90 100 110 120
● ●●● ●● ●●● ● ●●● ●● ●●● ●●● ● ●● ●● ● ●●
●● ●●● ●●●●● ●● ●● ●● ● ●● ●●● ●●● ●● ●● ●● ● ●●● ● ● ●● ●● ●
FRCAR
●
DE
CE
SS
UR
VIE
1.0 1.5 2.0 2.5 3.0
●● ●● ●●● ●●● ●●●● ● ●● ●●●●● ● ●●● ●● ●
● ● ●● ● ● ●●● ● ●●●● ●● ●●● ●● ●●● ●●● ●● ● ●●● ●● ●● ● ●●●●
INCAR
●
● ●
DE
CE
SS
UR
VIE
10 20 30 40 50
●● ●● ● ●● ●●● ●● ●● ●●● ●● ●●● ● ●●● ●● ●
● ●●● ●● ●● ● ●●●●● ●● ●●●● ● ●●● ●● ● ●● ● ●●● ● ● ●●● ●● ●●
INSYS
● ●
DE
CE
SS
UR
VIE
10 15 20 25 30 35 40 45
●●●● ●● ●●●● ● ●● ●● ●● ● ●●● ● ●● ● ● ●●●
●● ●●● ● ●●● ● ●●●● ●●● ●●● ●●●● ● ●●● ● ●● ●●●● ● ● ●● ●●●
PAPUL
DE
CE
SS
UR
VIE
5 10 15 20
●● ●●● ●● ●●●● ●●●● ● ●● ●●● ● ●● ●●●● ●
●●● ●●● ● ●● ●●●● ●●●● ● ●●● ●● ● ●● ●● ●●●● ●● ● ●●●● ● ●●
PVENT
DE
CE
SS
UR
VIE
500 1000 1500 2000 2500 3000
● ●● ●●● ●● ●●● ●● ●● ●●● ●●● ● ●● ● ●●●●
●● ● ●● ●●●●●● ● ●●● ●● ● ●● ●● ●●● ●●● ●●● ● ●●● ● ● ●● ●●●
REPUL
4
Arthur CHARPENTIER - Arbres de Classification
La classification : modélisation d’une variable 0/1
[0][0]
●●●
DE
CE
SS
UR
VIE
60 70 80 90 100 110 120
FRCAR
●
DE
CE
SS
UR
VIE
1.0 1.5 2.0 2.5 3.0
INCAR
●
● ●
DE
CE
SS
UR
VIE
10 20 30 40 50
INSYS
● ●
DE
CE
SS
UR
VIE
10 15 20 25 30 35 40 45
PAPUL
DE
CE
SS
UR
VIE
5 10 15 20
PVENT
DE
CE
SS
UR
VIE
500 1000 1500 2000 2500 3000
REPUL
5
Arthur CHARPENTIER - Arbres de Classification
Régression linéaire, simple
E(Y |X = x) = β0 + β1x[0][0]
●●●
DE
CE
SS
UR
VIE
60 70 80 90 100 110 120
● ●●● ●● ●●● ● ●●● ●● ●●● ●●● ● ●● ●● ● ●●
●● ●●● ●●●●● ●● ●● ●● ● ●● ●●● ●●● ●● ●● ●● ● ●●● ● ● ●● ●● ●
FRCAR
●
DE
CE
SS
UR
VIE
1.0 1.5 2.0 2.5 3.0
●● ●● ●●● ●●● ●●●● ● ●● ●●●●● ● ●●● ●● ●
● ● ●● ● ● ●●● ● ●●●● ●● ●●● ●● ●●● ●●● ●● ● ●●● ●● ●● ● ●●●●
INCAR
●
● ●
DE
CE
SS
UR
VIE
10 20 30 40 50
●● ●● ● ●● ●●● ●● ●● ●●● ●● ●●● ● ●●● ●● ●
● ●●● ●● ●● ● ●●●●● ●● ●●●● ● ●●● ●● ● ●● ● ●●● ● ● ●●● ●● ●●
INSYS
● ●
DE
CE
SS
UR
VIE
10 15 20 25 30 35 40 45
●●●● ●● ●●●● ● ●● ●● ●● ● ●●● ● ●● ● ● ●●●
●● ●●● ● ●●● ● ●●●● ●●● ●●● ●●●● ● ●●● ● ●● ●●●● ● ● ●● ●●●
PAPUL
DE
CE
SS
UR
VIE
5 10 15 20
●● ●●● ●● ●●●● ●●●● ● ●● ●●● ● ●● ●●●● ●
●●● ●●● ● ●● ●●●● ●●●● ● ●●● ●● ● ●● ●● ●●●● ●● ● ●●●● ● ●●
PVENT
DE
CE
SS
UR
VIE
500 1000 1500 2000 2500 3000
● ●● ●●● ●● ●●● ●● ●● ●●● ●●● ● ●● ● ●●●●
●● ● ●● ●●●●●● ● ●●● ●● ● ●● ●● ●●● ●●● ●●● ● ●●● ● ● ●● ●●●
REPUL
6
Arthur CHARPENTIER - Arbres de Classification
Régression logistique, simple
E(Y |X = x) = P(Y = 1|X = x) = exp[β0 + β1x]1 + exp[β0 + β1x]
●●●
DE
CE
SS
UR
VIE
60 70 80 90 100 110 120
● ●●● ●● ●●● ● ●●● ●● ●●● ●●● ● ●● ●● ● ●●
●● ●●● ●●●●● ●● ●● ●● ● ●● ●●● ●●● ●● ●● ●● ● ●●● ● ● ●● ●● ●
FRCAR
●
DE
CE
SS
UR
VIE
1.0 1.5 2.0 2.5 3.0
●● ●● ●●● ●●● ●●●● ● ●● ●●●●● ● ●●● ●● ●
● ● ●● ● ● ●●● ● ●●●● ●● ●●● ●● ●●● ●●● ●● ● ●●● ●● ●● ● ●●●●
INCAR
●
● ●
DE
CE
SS
UR
VIE
10 20 30 40 50
●● ●● ● ●● ●●● ●● ●● ●●● ●● ●●● ● ●●● ●● ●
● ●●● ●● ●● ● ●●●●● ●● ●●●● ● ●●● ●● ● ●● ● ●●● ● ● ●●● ●● ●●
INSYS
● ●
DE
CE
SS
UR
VIE
10 15 20 25 30 35 40 45
●●●● ●● ●●●● ● ●● ●● ●● ● ●●● ● ●● ● ● ●●●
●● ●●● ● ●●● ● ●●●● ●●● ●●● ●●●● ● ●●● ● ●● ●●●● ● ● ●● ●●●
PAPUL
DE
CE
SS
UR
VIE
5 10 15 20
●● ●●● ●● ●●●● ●●●● ● ●● ●●● ● ●● ●●●● ●
●●● ●●● ● ●● ●●●● ●●●● ● ●●● ●● ● ●● ●● ●●●● ●● ● ●●●● ● ●●
PVENT
DE
CE
SS
UR
VIE
500 1000 1500 2000 2500 3000
● ●● ●●● ●● ●●● ●● ●● ●●● ●●● ● ●● ● ●●●●
●● ● ●● ●●●●●● ● ●●● ●● ● ●● ●● ●●● ●●● ●●● ● ●●● ● ● ●● ●●●
REPUL
7
Arthur CHARPENTIER - Arbres de Classification
Régression logistique, multiple
E(Y |X = x) = P(Y = 1|X = x) = exp[β0 + β1x1 + · · ·+ βkxk]1 + exp[β0 + β1x1 + · · ·+ βkxk]
500 1000 1500 2000 2500 3000
510
1520
resistance pulmonaire (REPUL)
pres
sion
ven
ticul
aire
(P
VE
NT
)
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
8
Arthur CHARPENTIER - Arbres de Classification
Régression logistique, multiple
E(Y |X = x) = P(Y = 1|X = x) = exp[β0 + β1x1 + · · ·+ βkxk]1 + exp[β0 + β1x1 + · · ·+ βkxk]
500 1000 1500 2000 2500 3000
510
1520
resistance pulmonaire (REPUL)
pres
sion
ven
ticul
aire
(P
VE
NT
)
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
resistance pulmonaire (REPUL)
5001000
1500
2000
2500
3000
pres
sion
ven
ticul
aire
(PVE
NT)
5
10
15
20
probabilité (de survie)
0.0
0.2
0.4
0.6
0.8
1.0
9
Arthur CHARPENTIER - Arbres de Classification
Régression logistique, multiple
E(Y |X = x) = P(Y = 1|X = x) = exp[s(x1, · · · , xk)]1 + exp[s(x1, · · · , xk)]
500 1000 1500 2000 2500 3000
510
1520
resistance pulmonaire (REPUL)
pres
sion
ven
ticul
aire
(P
VE
NT
)
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
resistance pulmonaire (REPUL)
5001000
1500
2000
2500
3000
pres
sion
ven
ticul
aire
(PVE
NT)
5
10
15
20
probabilité (de survie)
0.0
0.2
0.4
0.6
0.8
1.0
10
Arthur CHARPENTIER - Arbres de Classification
Comment juger la qualité de notre modèle ?découpage avec un seuil à 50% :◦ si P(Y = 1|x) < 50%, diagnostic de décès◦ si P(Y = 1|x) > 50%, diagnostic de survie
Yi = 0 Yi = 1
Yi = 0 24 5 29Yi = 1 4 38 42
28 43 71500 1000 1500 2000 2500 3000
510
1520
resistance pulmonaire (REPUL)
pres
sion
ven
ticul
aire
(P
VE
NT
)
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
11
Arthur CHARPENTIER - Arbres de Classification
Comment juger la qualité de notre modèle ?découpage avec un seuil à 80% :◦ si P(Y = 1|x) < 80%, diagnostic de décès◦ si P(Y = 1|x) > 80%, diagnostic de survie
Yi = 0 Yi = 1
Yi = 0 26 3 29Yi = 1 12 30 42
38 33 71500 1000 1500 2000 2500 3000
510
1520
resistance pulmonaire (REPUL)
pres
sion
ven
ticul
aire
(P
VE
NT
)
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
12
Arthur CHARPENTIER - Arbres de Classification
Comment juger la qualité de notre modèle ?découpage avec un seuil à 20% :◦ si P(Y = 1|x) < 20%, diagnostic de décès◦ si P(Y = 1|x) > 20%, diagnostic de survie
Yi = 0 Yi = 1
Yi = 0 18 11 29Yi = 1 2 40 42
20 51 71500 1000 1500 2000 2500 3000
510
1520
resistance pulmonaire (REPUL)
pres
sion
ven
ticul
aire
(P
VE
NT
)
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
13
Arthur CHARPENTIER - Arbres de Classification
Comment juger la qualité de notre modèle ?
impact du seuil, deux cas extrêmes :
Yi = 0 Yi = 1
Yi = 0 0 29 29Yi = 1 0 42 42
0 71 71
Yi = 0 Yi = 1
Yi = 0 29 0 29Yi = 1 42 0 42
71 0 71
0 20 40 60 80 100
05
1015
2025
30
Faux
Nég
atif
(FN
)
0 20 40 60 80 100
010
2030
40
Seuil (%)
Faux
Pos
itif (
TP
)
14
Arthur CHARPENTIER - Arbres de Classification
Comment juger la qualité de notre modèle ?
Yi = 0 Yi = 1
Yi = 0 TN FN specificity, TNRnegative true negative false negative true negative rate
Yi = 1 FP TP sensitivity, TPRpositive false positive true positive true positive rate
precision, PPVpositive predictive value
TPR = TPTP + FN et FPR = FP
FP + TNavec
TPR(s) = P(Y (s) = 1|Y = 1) et FPR(s) = P(Y (s) = 1|Y = 0)
15
Arthur CHARPENTIER - Arbres de Classification
Comment juger la qualité de notre modèle ?
True positive rateTPR(s) = P(Y (s) = 1|Y = 1)
=nY=1,Y=1
nY=1False positive rateFPR(s) = P(Y (s) = 1|Y = 0)
=nY=1,Y=1
nY=1
→ courbe sensibilité/spécificité,appelée aussi courbe ROC(Receiver Operating Characteristic){(FPR(s), TPR(s)), s ∈ (0, 1)}
16
Arthur CHARPENTIER - Arbres de Classification
Comment juger la qualité de notre modèle ?
La courbe ROC est{(FPR(s), TPR(s)), s ∈ (0, 1)}
→ une courbe par modèle.
17
Arthur CHARPENTIER - Arbres de Classification
Comment juger la qualité de notre modèle ?
Specificity (%)
Sen
sitiv
ity (
%)
020
4060
8010
0
100 80 60 40 20 0 0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
●
●●
●
●
●
●
●
0.10.20.3
0.40.5
0.7
0.8
0.9
18
Arthur CHARPENTIER - Arbres de Classification
Construction Hiérarchique d’Arbres de ClassificationClassification = regrouper les individus en un nombre limitéde classesCes classes sont construites au fur et à mesure→ regrouper des des individus similaires, séparer des indi-vidus ayant des caractéristiques proches.Histoire : date des années 1960’s,regain d’intérêt suit à la publication de Breiman et al. (1984).outils devenu populaire en apprentissage automatique (ma-chine learning)
19
Arthur CHARPENTIER - Arbres de Classification
Construction Hiérarchique d’Arbres de Classificationclassification descendante :on sélectionne par les variables explicatives la plus liéeà la variable à expliquer Y ,→ donne une première division de léchantillonon réitère dans chaque classe→ chaque classe doit être la plus homogène possible,en Y .Différence par rapport à la régression logistiqueutilisation séquentielle des variables explicativesprésentation des sorties sous forme d’arbre de décisioni.e. une séquence de noeuds
20
Arthur CHARPENTIER - Arbres de Classification
Construction Hiérachique d’Arbres de Classification
L’algorithme est construit ainsi◦ un critère pour choisir la ‘meilleure’ divisiondes branches◦ une règle permettant de décider si un noeudest terminal, et devient une feuille◦ un méthode d’attribution d’une valeur danschaque feuille.
REPULp < 0.001
1
≤ 1093 > 1093
PVENTp = 0.179
2
≤ 11.5 > 11.5
Node 3 (n = 31)
SU
RV
IED
EC
ES
0
0.2
0.4
0.6
0.8
1Node 4 (n = 8)
SU
RV
IED
EC
ES
0
0.2
0.4
0.6
0.8
1
REPULp = 0.341
5
≤ 1583 > 1583
Node 6 (n = 16)
SU
RV
IED
EC
ES
0
0.2
0.4
0.6
0.8
1Node 7 (n = 16)
SU
RV
IED
EC
ES
0
0.2
0.4
0.6
0.8
1
21
Arthur CHARPENTIER - Arbres de Classification
Subdivisionner l’espace : une variable explicativeY ∈ {0, 1} et X ∈ R : on découpe suivant un seuil s, X = A si X ≤ s
X = B si X > s
X = A X = B
X ≤ s X > s
Y = 0 nA,0 nB,0 n·,0
Y = 1 nA,1 nB,1 n·,1
nA,· nB,· n
Gini gini(Y |X) = −∑
x∈{A,B}
nx,·n
∑y∈{0,1}
nx,ynx,·
(1− nx,y
nx,·
)
22
Arthur CHARPENTIER - Arbres de Classification
Subdivisionner l’espace : une variable explicativeY ∈ {0, 1} et X ∈ R : on découpe suivant un seuil s, X = A si X ≤ s
X = B si X > s
X = A X = B
X ≤ s X > s
Y = 0 nA,0 nB,0 n·,0
Y = 1 nA,1 nB,1 n·,1
nA,· nB,· n
Entropie entropie(Y |X) = −∑
x∈{A,B}
nx,·n
∑y∈{0,1}
nx,ynx,·
log(nx,ynx,·
)
23
Arthur CHARPENTIER - Arbres de Classification
Subdivisionner l’espace : une variable explicative
Découpage et indice de Gini
−∑
x∈{A,B}
nx,·n
∑y∈{0,1}
nx,ynx,·
(1− nx,y
nx,·
)
24
Arthur CHARPENTIER - Arbres de Classification
Subdivisionner l’espace : une variable explicativeOn fixe s, on cherche un second découpage,
A = (−∞, s2] B = (s2, s] C = (s,∞)
X = A X = B X = C
X ≤ s2 X ∈ (s2, s] X > s
Y = 0 nA,0 nB,0 nC,0 n·,0
Y = 1 nA,1 nB,1 nC,1 n·,1
nA,· nB,· nC,· n
Découpage et indice de Gini
−∑
x∈{A,B,C}
nx,·n
∑y∈{0,1}
nx,ynx,·
(1− nx,y
nx,·
)
●
● ●
DE
CE
SS
UR
VIE
10 20 30 40 50
INSYS
●● ● ● ●● ●● ●● ●● ●●● ● ●●● ●●● ●● ●● ● ●
●● ● ● ●● ●●● ●●● ●● ● ●●● ● ●● ● ● ● ●●●● ●● ● ●●● ●●● ● ● ●● ●
10 20 30 40 50
−0.
200
−0.
190
−0.
180
−0.
170
Indi
ce G
ini
25
Arthur CHARPENTIER - Arbres de Classification
Subdivisionner l’espace : une variable explicativeOn fixe s, on cherche un second découpage,
A = (−∞, s] B = (s, s2] C = (s2,∞)
X = A X = B X = C
X ≤ s X ∈ (s, s2] X > s2
Y = 0 nA,0 nB,0 nC,0 n·,0
Y = 1 nA,1 nB,1 nC,1 n·,1
nA,· nB,· nC,· n
Découpage et indice de Gini
−∑
x∈{A,B,C}
nx,·n
∑y∈{0,1}
nx,ynx,·
(1− nx,y
nx,·
)
●
● ●
DE
CE
SS
UR
VIE
10 20 30 40 50
INSYS
●●● ●●● ● ●● ● ●●● ●● ●● ●●●●● ●●
● ● ● ●● ● ● ●● ●● ●● ● ● ● ●●●●● ●● ●●●● ● ●●● ●● ● ●● ● ●●● ● ●● ●● ●●
10 20 30 40 50
−0.
200
−0.
190
−0.
180
−0.
170
Indi
ce G
ini
26
Arthur CHARPENTIER - Arbres de Classification
Découpage séquentiel, aspects computationnelsÀ chaque étape, on fixe un noeud, et on découpe une des classes en deux
→ structure d’arbre
Avantage numérique important : si s peut prendre n valeurs,
{s1, s2, · · · , sn} e.g.{
1n+ 1 ,
2n+ 1 , · · · ,
n
n+ 1
}il existe n!/(n− k)! = n(n− 1)(n− 2) · · · (n− k + 1) partitions en k classes.
Avec la méthode possible, seuls n+ (n− 1) + · · ·+ (n− k) partitions sontenvisagées,
Example n = 100 et k = 5 : 9, 034, 502, 400 de partitions possibles vs. 400 arbres.
27
Arthur CHARPENTIER - Arbres de Classification
Élagage de l’arbreÉtape 1 Construction de l’arbre par un processus récursif de divisions binaires
Étape 2 Élagage de l’arbre (pruning), en supprimant les branches trop vides, oupeu représentatives → besoin d’un critère d’élagage (gain en entropie)
|MYOCARDE[, nom] < 18.85
MYOCARDE[, nom] < 21.55
MYOCARDE[, nom] < 19.75 MYOCARDE[, nom] < 28.25
MYOCARDE[, nom] < 31.6
DECES
SURVIE DECES SURVIE
SURVIE SURVIE
|MYOCARDE[, nom] < 18.85
MYOCARDE[, nom] < 21.55
MYOCARDE[, nom] < 19.75 MYOCARDE[, nom] < 28.25MYOCARDE[, nom] < 31.6
DECES
SURVIE DECES SURVIESURVIE SURVIE
4050
6070
8090
28
Arthur CHARPENTIER - Arbres de Classification
Cas de plusieurs variables quantitativesY ∈ {0, 1} et X1, X2 ∈ R : on découpe X1 suivant un seuil s, X = A si X1 ≤ s
X = B si X1 > s
X = A X = B
X1 ≤ s X1 > s
Y = 0 nA,0 nB,0 n·,0
Y = 1 nA,1 nB,1 n·,1
nA,· nB,· n
−∑
x∈{A,B}
nx,·n
∑y∈{0,1}
nx,ynx,·
(1− nx,y
nx,·
)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●●●
●
●
●
●
●● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
500 1000 1500 2000 2500 3000
510
1520
resistance pulmonaire (REPUL)
pres
sion
ven
ticul
aire
(P
VE
NT
)
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●●
●
500 1000 1500 2000 2500 3000
−0.
45−
0.35
−0.
25
Indi
ce G
ini
29
Arthur CHARPENTIER - Arbres de Classification
Cas de plusieurs variables quantitativesY ∈ {0, 1} et X1, X2 ∈ R : on découpe X2 suivant un seuil s, X = A si X2 ≤ s
X = B si X2 > s
X = A X = B
X2 ≤ s X2 > s
Y = 0 nA,0 nB,0 n·,0
Y = 1 nA,1 nB,1 n·,1
nA,· nB,· n
−∑
x∈{A,B}
nx,·n
∑y∈{0,1}
nx,ynx,·
(1− nx,y
nx,·
)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●●●
●
●
●
●
●● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
500 1000 1500 2000 2500 3000
510
1520
resistance pulmonaire (REPUL)
pres
sion
ven
ticul
aire
(P
VE
NT
)
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●●
●
5 10 15 20
−0.
45−
0.35
−0.
25
Indi
ce G
ini
30
Arthur CHARPENTIER - Arbres de Classification
Cas de variables qualitativesX prend des valeurs {a, b, c, d}.
Au lieu de faire un arbre par découpages suc-cessifs, on peut faire un arbre par regroupementsuccessifs.
{a, b, c, d} {(a, b), c, d} {(a, d), b, c} {(a, d), b, c}{(b, c), a, d} {(b, d), a, c} {(c, d), a, b}
{(b, c, a), d}{(b, c, d), a}{(b, c), (a, d)} A B C D
020
4060
8010
0
31
Arthur CHARPENTIER - Arbres de Classification
Cas de variables qualitativesX prend des valeurs {a, b, c, d}.
Au lieu de faire un arbre par découpages suc-cessifs, on peut faire un arbre par regroupementsuccessifs.
{a, b, c, d} {(a, b), c, d} {(a, d), b, c} {(a, d), b, c}{(b, c), a, d} {(b, d), a, c} {(c, d), a, b}
{(b, c, a), d}{(b, c, d), a}{(b, c), (a, d)}
32
Arthur CHARPENTIER - Arbres de Classification
Cas de variables qualitativesX prend des valeurs {a, b, c, d}.
Au lieu de faire un arbre par découpages suc-cessifs, on peut faire un arbre par regroupementsuccessifs.
{a, b, c, d} {(a, b), c, d} {(a, d), b, c} {(a, d), b, c}{(b, c), a, d} {(b, d), a, c} {(c, d), a, b}
{(b, c, a), d}{(b, c, d), a}{(b, c), (a, d)}
Xp < 0.001
1
A {B, C, D}
Node 2 (n = 100)
●●●●●●●
0
0.2
0.4
0.6
0.8
1
Xp < 0.001
3
C {B, D}
Node 4 (n = 100)
●●●●●●●●●●●●●●0
0.2
0.4
0.6
0.8
1
Xp = 0.066
5
B D
Node 6 (n = 100)
0
0.2
0.4
0.6
0.8
1
Node 7 (n = 100)
0
0.2
0.4
0.6
0.8
1
33
Arthur CHARPENTIER - Arbres de Classification
Méthode CART et extensionsOn se contente de faire des coupes suivant X1 ou X2.
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
34
Arthur CHARPENTIER - Arbres de Classification
Méthode CART et extensionsmais ne marche pas bien en présence de non linéarités, ou de rotations
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
35
Arthur CHARPENTIER - Arbres de Classification
Méthode CART et extensionson peut aussi tenter des arbres sur X1 +X2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
36
Arthur CHARPENTIER - Arbres de Classification
Robustesse des arbresLes arbres sont intéressants, mais peu robustes,
cf Classification and regression trees, bagging, and boostinghttp ://mason.gmu.edu/ csutton/vt6.pdf
idée possible : bootstrapper (rééchantiloner) et agréger ensuite les prédicitons.
37
Arthur CHARPENTIER - Arbres de Classification
Rééchantillonage et régression
Dans un modèle linéaire,E.g. modélation du poids d’une personne (Y )
en fonction de sa taille (X)modèle linéaire GaussienY |X = x ∼ N (β0 + β1x, σ
2)
E(Y |X = x) = β0 + β1x = Y (x)
Y ∈[Y (x)± u1−α/2︸ ︷︷ ︸
1.96
· σ]
38
Arthur CHARPENTIER - Arbres de Classification
Rééchantillonage et régression
Échantillon (X1, Y1), · · · , (Xn, Yn)On va échantillonner, i.e. tirer n observa-tions avec remiseestimer un modèle sur cet échantillongarder en mémoire la prévisionet répéter cette étape de rééchantillonnage
39
Arthur CHARPENTIER - Arbres de Classification
Rééchantillonage et arbre de classification
On va échantillonner, i.e. tirer n observa-tions avec remiseconstruire et arbre sur cet échantillonon répétant, on va générer une forêt, ran-dom forrest
40
Arthur CHARPENTIER - Arbres de Classification
Rééchantillonage et arbre de classification
On va échantillonner, i.e. tirer n observa-tions avec remiseconstruire et arbre sur cet échantillonon répétant, on va générer une forêt, ran-dom forrest
500 1000 1500 2000 2500 30005
1015
20
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
41
Arthur CHARPENTIER - Arbres de Classification
RéférencesTrevor Hastie, Robert Tibshirani & Jerome Friedman (2013). Elements ofStatistical Learning : Data Mining, Inference, and Prediction. Springer Verlaghttp ://statweb.stanford.edu/˜ tibs/ElemStatLearn/printings/ESLII_print10.pdf
Leo Breiman, Jerome Friedman, Charles J. Stone & R.A. Olshen (1984).Classification and Regression Trees. CRC.
Kevin P. Murphy (2012). Machine Learning : A Probabilistic Perspective. MITPress.
42