Présentation G.Biau Random Forests

transcript

Random forests

Gérard Biau

Groupe de travail prévision, May 2011

G. Biau (UPMC) 1 / 106

Outline

1 Setting

2 A random forests model

3 Layered nearest neighbors and random forests

Outline

1 Setting

Leo Breiman (1928-2005)

Classification

Regression

1.386.76

−2.33

4.7331.2

0.562.27

−2.56

−2.88

−1.23

−2.66

0.122.23

Regression

1.386.76

−2.33

4.7331.2

0.562.27

−2.56

−2.88

−1.23

−2.66

0.122.23

G. Biau (UPMC) 10 / 106

Regression

1.386.76

−2.33

4.7331.2

0.562.27

−2.56

−2.88

−1.23

−2.66

0.122.23

G. Biau (UPMC) 11 / 106

G. Biau (UPMC) 12 / 106

G. Biau (UPMC) 13 / 106

G. Biau (UPMC) 14 / 106

G. Biau (UPMC) 15 / 106

G. Biau (UPMC) 16 / 106

G. Biau (UPMC) 17 / 106

G. Biau (UPMC) 18 / 106

G. Biau (UPMC) 19 / 106

G. Biau (UPMC) 20 / 106

G. Biau (UPMC) 21 / 106

G. Biau (UPMC) 22 / 106

G. Biau (UPMC) 23 / 106

G. Biau (UPMC) 24 / 106

G. Biau (UPMC) 25 / 106

G. Biau (UPMC) 26 / 106

G. Biau (UPMC) 27 / 106

G. Biau (UPMC) 28 / 106

G. Biau (UPMC) 29 / 106

G. Biau (UPMC) 30 / 106

G. Biau (UPMC) 31 / 106

G. Biau (UPMC) 32 / 106

G. Biau (UPMC) 33 / 106

From trees to forests

Leo Breiman promoted random forests.

Idea: Using tree averaging as a means of obtaining good rules.

The base trees are simple and randomized.

Breiman’s ideas were decisively influenced by

Amit and Geman (1997, geometric feature selection).

Ho (1998, random subspace method).

Dietterich (2000, random split selection approach).

stat.berkeley.edu/users/breiman/RandomForests

G. Biau (UPMC) 34 / 106

Random forests

They have emerged as serious competitors to state of the artmethods.

They are fast and easy to implement, produce highly accuratepredictions and can handle a very large number of input variableswithout overfitting.

In fact, forests are among the most accurate general-purposelearners available.

The algorithm is difficult to analyze and its mathematicalproperties remain to date largely unknown.

Most theoretical studies have concentrated on isolated parts orstylized versions of the procedure.

G. Biau (UPMC) 35 / 106

Random forests

G. Biau (UPMC) 35 / 106

Random forests

G. Biau (UPMC) 35 / 106

Random forests

G. Biau (UPMC) 35 / 106

Random forests

G. Biau (UPMC) 35 / 106

Key-references

Breiman (2000, 2001, 2004).

. Definition, experiments and intuitions.

Lin and Jeon (2006).

. Link with layered nearest neighbors.

Biau, Devroye and Lugosi (2008).

. Consistency results for stylized versions.

Biau (2010).

. Sparsity and random forests.

G. Biau (UPMC) 36 / 106

Key-references

Breiman (2000, 2001, 2004).

Biau (2010).

G. Biau (UPMC) 36 / 106

Key-references

Breiman (2000, 2001, 2004).

Biau (2010).

G. Biau (UPMC) 36 / 106

Key-references

Breiman (2000, 2001, 2004).

Biau (2010).

G. Biau (UPMC) 36 / 106

Outline

1 Setting

G. Biau (UPMC) 37 / 106

Three basic ingredients

1-Randomization and no-pruning. For each tree, select at random, at each node, a small group of

input coordinates to split.

. Calculate the best split based on these features and cut.

. The tree is grown to maximum size, without pruning.

G. Biau (UPMC) 38 / 106

2-Aggregation. Final predictions are obtained by aggregating over the ensemble.

. It is fast and easily parallelizable.

G. Biau (UPMC) 39 / 106

2-Aggregation. Final predictions are obtained by aggregating over the ensemble.

. It is fast and easily parallelizable.

G. Biau (UPMC) 39 / 106

3-Bagging. The subspace randomization scheme is blended with bagging.

. Breiman (1996).

. Bühlmann and Yu (2002).

. Biau, Cérou and Guyader (2010).

G. Biau (UPMC) 40 / 106

3-Bagging. The subspace randomization scheme is blended with bagging.

. Breiman (1996).

. Bühlmann and Yu (2002).

. Biau, Cérou and Guyader (2010).

G. Biau (UPMC) 40 / 106

Mathematical framework

A training sample: Dn = {(X1,Y1), . . . , (Xn,Yn)} i.i.d.[0,1]d × R-valued random variables.

A generic pair: (X,Y ) satisfying EY 2 <∞.

Our mission: For fixed x ∈ [0,1]d , estimate the regression functionr(x) = E[Y |X = x] using the data.

Quality criterion: E[rn(X)− r(X)]2.

G. Biau (UPMC) 41 / 106

The model

A random forest is a collection of randomized base regressiontrees {rn(x,Θm,Dn),m ≥ 1}.

These random trees are combined to form the aggregatedregression estimate

r̄n(X,Dn) = EΘ [rn(X,Θ,Dn)] .

Θ is assumed to be independent of X and the training sample Dn.

However, we allow Θ to be based on a test sample, independentof, but distributed as, Dn.

G. Biau (UPMC) 42 / 106

The model

G. Biau (UPMC) 42 / 106

The model

G. Biau (UPMC) 42 / 106

The model

G. Biau (UPMC) 42 / 106

The procedure

. Fix kn ≥ 2 and repeat the following procedure dlog2 kne times:

1 At each node, a coordinate of X = (X (1), . . . ,X (d)) is selected, with thej-th feature having a probability pnj ∈ (0,1) of being selected.

2 At each node, once the coordinate is selected, the split is at the midpointof the chosen side.

. Thus

r̄n(X) = EΘ

[∑ni=1 Yi1[Xi∈An(X,Θ)]∑n

i=1 1[Xi∈An(X,Θ)]

1En(X,Θ)

En(X,Θ) =

1[Xi∈An(X,Θ)] 6= 0

G. Biau (UPMC) 43 / 106

The procedure

. Thus

r̄n(X) = EΘ

1En(X,Θ)

En(X,Θ) =

1[Xi∈An(X,Θ)] 6= 0

G. Biau (UPMC) 43 / 106

The procedure

. Thus

r̄n(X) = EΘ

1En(X,Θ)

En(X,Θ) =

1[Xi∈An(X,Θ)] 6= 0

G. Biau (UPMC) 43 / 106

Binary trees

G. Biau (UPMC) 44 / 106

Binary trees

G. Biau (UPMC) 45 / 106

Binary trees

G. Biau (UPMC) 46 / 106

Binary trees

G. Biau (UPMC) 47 / 106

Binary trees

G. Biau (UPMC) 48 / 106

General comments

Each individual tree has exactly 2dlog2 kne (≈ kn) terminal nodes,and each leaf has Lebesgue measure 2−dlog2 kne (≈ 1/kn).

If X has uniform distribution on [0,1]d , there will be on averageabout n/kn observations per terminal node.

The choice kn = n induces a very small number of cases in thefinal leaves.

This scheme is close to what the original randomforests algorithm does.

G. Biau (UPMC) 49 / 106

General comments

G. Biau (UPMC) 49 / 106

General comments

G. Biau (UPMC) 49 / 106

General comments

G. Biau (UPMC) 49 / 106

Consistency

TheoremAssume that the distribution of X has support on [0,1]d . Then therandom forests estimate r̄n is consistent whenever pnj log kn →∞ forall j = 1, . . . ,d and kn/n→ 0 as n→∞.

In the purely random model, pnj = 1/d , independently of n and j ,and consistency is ensured as long as kn →∞ and kn/n→ 0.

This is however a radically simplified version of the random forestsused in practice.

A more in-depth analysis is needed.

G. Biau (UPMC) 50 / 106

Consistency

G. Biau (UPMC) 50 / 106

Consistency

G. Biau (UPMC) 50 / 106

Consistency

G. Biau (UPMC) 50 / 106

Sparsity

There is empirical evidence that many signals in high-dimensionalspaces admit a sparse representation.

. Images wavelet coefficients.

. High-throughput technologies.

Sparse estimation is playing an increasingly important role in thestatistics and machine learning communities.

Several methods have recently been developed in both fields,which rely upon the notion of sparsity.

G. Biau (UPMC) 51 / 106

Sparsity

G. Biau (UPMC) 51 / 106

Sparsity

G. Biau (UPMC) 51 / 106

Sparsity

G. Biau (UPMC) 51 / 106

Sparsity

G. Biau (UPMC) 51 / 106

Our vision

The regression function r(X) = E[Y |X] depends in fact only on anonempty subset S (for Strong) of the d features.

In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have

r(X) = E[Y |XS ].

In the dimension reduction scenario we have in mind, the ambientdimension d can be very large, much larger than n.

As such, the value S characterizes the sparsity of the model: Thesmaller S, the sparser r .

G. Biau (UPMC) 52 / 106

Our vision

r(X) = E[Y |XS ].

G. Biau (UPMC) 52 / 106

Our vision

r(X) = E[Y |XS ].

G. Biau (UPMC) 52 / 106

Our vision

r(X) = E[Y |XS ].

G. Biau (UPMC) 52 / 106

Sparsity and random forests

Ideally, pnj = 1/S for j ∈ S.

To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj).

Such a randomization mechanism may be designed on the basisof a test sample.

Action plan

E [r̄n(X)− r(X)]2 = E [r̄n(X)− r̃n(X)]2 + E [r̃n(X)− r(X)]2 ,

r̃n(X) =n∑

EΘ [Wni(X,Θ)] r(Xi).

G. Biau (UPMC) 53 / 106

Action plan

r̃n(X) =n∑

G. Biau (UPMC) 53 / 106

Action plan

r̃n(X) =n∑

G. Biau (UPMC) 53 / 106

Action plan

r̃n(X) =n∑

G. Biau (UPMC) 53 / 106

Variance

Proposition

Assume that X is uniformly distributed on [0,1]d and, for all x ∈ Rd ,

σ2(x) = V[Y |X = x] ≤ σ2

for some positive constant σ2. Then, if pnj = (1/S)(1 + ξnj) for j ∈ S,

E [r̄n(X)− r̃n(X)]2 ≤ Cσ2(

S − 1

(1 + ξn)kn

n(log kn)S/2d ,

C =288π

(π log 2

G. Biau (UPMC) 54 / 106

Proposition

Assume that X is uniformly distributed on [0,1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj) for j ∈ S,

E [r̃n(X)− r(X)]2 ≤ 2SL2

kn0.75

S log 2 (1+γn)+

x∈[0,1]dr2(x)

]e−n/2kn .

The rate at which the bias decreases to 0 depends on the numberof strong variables, not on d .

kn−(0.75/(S log 2))(1+γn) = o(kn

−2/d ) as soon as S ≤ b0.54dc.

The term e−n/2kn prevents the extreme choice kn = n.

G. Biau (UPMC) 55 / 106

Proposition

E [r̃n(X)− r(X)]2 ≤ 2SL2

kn0.75

S log 2 (1+γn)+

x∈[0,1]dr2(x)

]e−n/2kn .

kn−(0.75/(S log 2))(1+γn) = o(kn

G. Biau (UPMC) 55 / 106

Proposition

E [r̃n(X)− r(X)]2 ≤ 2SL2

kn0.75

S log 2 (1+γn)+

x∈[0,1]dr2(x)

]e−n/2kn .

kn−(0.75/(S log 2))(1+γn) = o(kn

G. Biau (UPMC) 55 / 106

Proposition

E [r̃n(X)− r(X)]2 ≤ 2SL2

kn0.75

S log 2 (1+γn)+

x∈[0,1]dr2(x)

]e−n/2kn .

kn−(0.75/(S log 2))(1+γn) = o(kn

G. Biau (UPMC) 55 / 106

Main result

TheoremIf pnj = (1/S)(1 + ξnj) for j ∈ S, with ξnj log n→ 0 as n→∞, then forthe choice

kn ∝(

)1/(1+ 0.75S log 2 )

n1/(1+ 0.75S log 2 )

we have

lim supn→∞

sup(X,Y )∈F

E [r̄n(X)− r(X)]2(ΞL

2S ln 20.75

) 0.75S ln 2+0.75 n

−0.75S log 2+0.75

≤ Λ.

Take-home message

The rate n−0.75

S log 2+0.75 is strictly faster than the usual minimax raten−2/(d+2) as soon as S ≤ b0.54dc.

G. Biau (UPMC) 56 / 106

Main result

TheoremIf pnj = (1/S)(1 + ξnj) for j ∈ S, with ξnj log n→ 0 as n→∞, then forthe choice

kn ∝(

)1/(1+ 0.75S log 2 )

n1/(1+ 0.75S log 2 )

we have

lim supn→∞

sup(X,Y )∈F

E [r̄n(X)− r(X)]2(ΞL

2S ln 20.75

) 0.75S ln 2+0.75 n

−0.75S log 2+0.75

≤ Λ.

Take-home message

The rate n−0.75

S log 2+0.75 is strictly faster than the usual minimax raten−2/(d+2) as soon as S ≤ b0.54dc.

G. Biau (UPMC) 56 / 106

Dimension reduction

10 20 30 40 50 60 70 80 90 10020

0.01960.0196

G. Biau (UPMC) 57 / 106

Discussion — Bagging

The optimal parameter kn depends on the unknown distribution of(X,Y ).

To correct this situation, adaptive choices of kn should preservethe rate of convergence of the estimate.

Another route we may follow is to analyze the effect of bagging.

Biau, Cérou and Guyader (2010).

G. Biau (UPMC) 58 / 106

G. Biau (UPMC) 59 / 106

G. Biau (UPMC) 60 / 106

G. Biau (UPMC) 61 / 106

G. Biau (UPMC) 62 / 106

G. Biau (UPMC) 63 / 106

G. Biau (UPMC) 64 / 106

G. Biau (UPMC) 65 / 106

G. Biau (UPMC) 66 / 106

G. Biau (UPMC) 67 / 106

G. Biau (UPMC) 68 / 106

G. Biau (UPMC) 69 / 106

Discussion — Choosing the pnj ’s

Imaginary scenarioThe following splitting scheme is iteratively repeated at each node:

1 Select at random Mn candidate coordinates to split on.

2 If the selection is all weak, then choose one at random to split on.

3 If there is more than one strong variable elected, choose one atrandom and cut.

G. Biau (UPMC) 70 / 106

Each coordinate in S will be cut with the “ideal” probability

p?n =1S

(1− S

The parameter Mn should satisfy(1− S

log n→ 0 as n→∞.

This is true as soon as

Mn →∞ andMn

log n→∞ as n→∞.

G. Biau (UPMC) 71 / 106

p?n =1S

(1− S

Mn →∞ andMn

G. Biau (UPMC) 71 / 106

p?n =1S

(1− S

Mn →∞ andMn

G. Biau (UPMC) 71 / 106

AssumptionsWe have at hand an independent test set D′n.

The model is linear:

Y =∑j∈S

ajX (j) + ε.

For a fixed node A =∏d

j=1 Aj , fix a coordinate j and look at theweighted conditional variance V[Y |X (j) ∈ Aj ]P(X (j) ∈ Aj).

If j ∈ S, then the best split is at the midpoint of the node, with avariance decrease equal to a2

j /16 > 0.

If j ∈ W, the decrease of the variance is always 0, whatever thelocation of the split.

G. Biau (UPMC) 72 / 106

Y =∑j∈S

ajX (j) + ε.

j /16 > 0.

G. Biau (UPMC) 72 / 106

Y =∑j∈S

ajX (j) + ε.

j /16 > 0.

G. Biau (UPMC) 72 / 106

Y =∑j∈S

ajX (j) + ε.

j /16 > 0.

G. Biau (UPMC) 72 / 106

Y =∑j∈S

ajX (j) + ε.

j /16 > 0.

G. Biau (UPMC) 72 / 106

Y =∑j∈S

ajX (j) + ε.

j /16 > 0.

G. Biau (UPMC) 72 / 106

Near-reality scenarioThe following splitting scheme is iteratively repeated at each node:

2 For each of the Mn elected coordinates, calculate the best split.

3 Select the coordinate which outputs the best within-node sum ofsquares decrease, and cut.

ConclusionFor j ∈ S,

pnj ≈1S(1 + ξnj

where ξnj → 0 and satisfies the constraint ξnj log n→ 0 as n tends toinfinity, provided kn log n/n→ 0, Mn →∞ and Mn/ log n→∞.

G. Biau (UPMC) 73 / 106

pnj ≈1S(1 + ξnj

G. Biau (UPMC) 73 / 106

pnj ≈1S(1 + ξnj

G. Biau (UPMC) 73 / 106

pnj ≈1S(1 + ξnj

G. Biau (UPMC) 73 / 106

pnj ≈1S(1 + ξnj

G. Biau (UPMC) 73 / 106

Outline

1 Setting

G. Biau (UPMC) 74 / 106

G. Biau (UPMC) 75 / 106

G. Biau (UPMC) 76 / 106

G. Biau (UPMC) 77 / 106

G. Biau (UPMC) 78 / 106

G. Biau (UPMC) 79 / 106

G. Biau (UPMC) 80 / 106

G. Biau (UPMC) 81 / 106

G. Biau (UPMC) 82 / 106

G. Biau (UPMC) 83 / 106

G. Biau (UPMC) 84 / 106

G. Biau (UPMC) 85 / 106

G. Biau (UPMC) 86 / 106

G. Biau (UPMC) 87 / 106

G. Biau (UPMC) 88 / 106

G. Biau (UPMC) 89 / 106

G. Biau (UPMC) 90 / 106

G. Biau (UPMC) 91 / 106

G. Biau (UPMC) 92 / 106

G. Biau (UPMC) 93 / 106

G. Biau (UPMC) 94 / 106

G. Biau (UPMC) 95 / 106

G. Biau (UPMC) 96 / 106

G. Biau (UPMC) 97 / 106

G. Biau (UPMC) 98 / 106

Layered Nearest Neighbors

DefinitionLet X1, . . . ,Xn be a sample of i.i.d. random vectors in Rd , d ≥ 2. Anobservation Xi is said to be a LNN of a point x if the hyperrectangledefined by x and Xi contains no other data points.

G. Biau (UPMC) 99 / 106

What is known about Ln(x)?

... a lot when X1, . . . ,Xn are uniformly distributed over [0,1]d .

For example,

ELn(x) =2d (log n)d−1

(d − 1)!+O

((log n)d−2

(d − 1)! Ln(x)

2d (log n)d−1 → 1 in probability as n→∞.

This is the problem of maxima in random vectors(Barndorff-Nielsen and Sobel, 1966).

G. Biau (UPMC) 100 / 106

For example,

(d − 1)!+O

((log n)d−2

(d − 1)! Ln(x)

G. Biau (UPMC) 100 / 106

For example,

(d − 1)!+O

((log n)d−2

(d − 1)! Ln(x)

G. Biau (UPMC) 100 / 106

Two results (Biau and Devroye, 2010)

ModelX1, . . . ,Xn are independently distributed according to some probabilitydensity f (with probability measure µ).

TheoremFor µ-almost all x ∈ Rd , one has

Ln(x)→∞ in probability as n→∞.

TheoremSuppose that f is λ-almost everywhere continuous. Then

(d − 1)!ELn(x)

2d (log n)d−1 → 1 as n→∞,

at µ-almost all x ∈ Rd .

G. Biau (UPMC) 101 / 106

(d − 1)!ELn(x)

2d (log n)d−1 → 1 as n→∞,

G. Biau (UPMC) 101 / 106

(d − 1)!ELn(x)

2d (log n)d−1 → 1 as n→∞,

G. Biau (UPMC) 101 / 106

LNN regression estimation

Model(X,Y ), (X1,Y1), . . . , (Xn,Yn) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.

The regression function r(x) = E[Y |X = x] may be estimated by

rn(x) =1

n∑i=1

Yi1[Xi∈Ln(x)].

1 No smoothing parameter.

2 A scale-invariant estimate.

3 Intimately connected to Breiman’s random forests.

G. Biau (UPMC) 102 / 106

rn(x) =1

n∑i=1

Yi1[Xi∈Ln(x)].

G. Biau (UPMC) 102 / 106

rn(x) =1

n∑i=1

Yi1[Xi∈Ln(x)].

G. Biau (UPMC) 102 / 106

rn(x) =1

n∑i=1

Yi1[Xi∈Ln(x)].

G. Biau (UPMC) 102 / 106

rn(x) =1

n∑i=1

Yi1[Xi∈Ln(x)].

G. Biau (UPMC) 102 / 106

Consistency

Theorem (Pointwise Lp-consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1,

E |rn(x)− r(x)|p → 0 as n→∞.

Theorem (Gobal Lp-consistency)Under the same conditions, for all p ≥ 1,

E |rn(X)− r(X)|p → 0 as n→∞.

1 No universal consistency result with respect to r is possible.

2 The results do not impose any condition on the density.

3 They are also scale-free.

G. Biau (UPMC) 103 / 106

Consistency

E |rn(x)− r(x)|p → 0 as n→∞.

E |rn(X)− r(X)|p → 0 as n→∞.

G. Biau (UPMC) 103 / 106

Consistency

E |rn(x)− r(x)|p → 0 as n→∞.

E |rn(X)− r(X)|p → 0 as n→∞.

G. Biau (UPMC) 103 / 106

Consistency

E |rn(x)− r(x)|p → 0 as n→∞.

E |rn(X)− r(X)|p → 0 as n→∞.

G. Biau (UPMC) 103 / 106

Consistency

E |rn(x)− r(x)|p → 0 as n→∞.

E |rn(X)− r(X)|p → 0 as n→∞.

G. Biau (UPMC) 103 / 106

Back to random forests

A random forest can be viewed as a weighted LNN regression estimate

r̄n(x) =n∑

YiWni(x),

where the weights concentrate on the LNN and satisfyn∑

Wni(x) = 1.

G. Biau (UPMC) 104 / 106

Non-adaptive strategies

Consider the non-adaptive random forests estimate

r̄n(x) =n∑

YiWni(x).

where the weights concentrate on the LNN.Proposition

For any x ∈ Rd , assume that σ2 = V[Y |X = x] is independent of x.Then

E [r̄n(x)− r(x)]2 ≥ σ2

ELn(x).

G. Biau (UPMC) 105 / 106

Non-adaptive strategies

Consider the non-adaptive random forests estimate

r̄n(x) =n∑

YiWni(x).

where the weights concentrate on the LNN.Proposition

For any x ∈ Rd , assume that σ2 = V[Y |X = x] is independent of x.Then

E [r̄n(x)− r(x)]2 ≥ σ2

ELn(x).

G. Biau (UPMC) 105 / 106

Rate of convergence

At µ-almost all x, when f is λ-almost everywhere continuous,

E [r̄n(x)− r(x)]2 &σ2(d − 1)!

2d (log n)d−1 .

Improving the rate of convergence1 Stop as soon as a future rectangle split would cause a

sub-rectangle to have fewer than kn points.

2 Resort to bagging and randomize using random subsamples.

G. Biau (UPMC) 106 / 106

Rate of convergence

E [r̄n(x)− r(x)]2 &σ2(d − 1)!

2d (log n)d−1 .

G. Biau (UPMC) 106 / 106

Rate of convergence

E [r̄n(x)− r(x)]2 &σ2(d − 1)!

2d (log n)d−1 .

G. Biau (UPMC) 106 / 106

Présentation G.Biau Random Forests

Technology