Post on 30-Jun-2015
description
transcript
Random forests
Gérard Biau
Groupe de travail prévision, May 2011
G. Biau (UPMC) 1 / 106
Outline
1 Setting
2 A random forests model
3 Layered nearest neighbors and random forests
G. Biau (UPMC) 2 / 106
Outline
1 Setting
2 A random forests model
3 Layered nearest neighbors and random forests
G. Biau (UPMC) 3 / 106
Leo Breiman (1928-2005)
G. Biau (UPMC) 4 / 106
Leo Breiman (1928-2005)
G. Biau (UPMC) 5 / 106
Classification
G. Biau (UPMC) 6 / 106
Classification
?
G. Biau (UPMC) 7 / 106
Classification
G. Biau (UPMC) 8 / 106
Regression
1.09
1.386.76
5.89
−2.33
12.00
0.12
4.7331.2
0.562.27
−2.56
−2.88
7.26
4.32
−1.23
−2.66
7.77
0.122.23
7.72
3.82
8.21
G. Biau (UPMC) 9 / 106
Regression
?
1.386.76
5.89
−2.33
12.00
0.12
4.7331.2
0.562.27
−2.56
−2.88
7.26
4.32
−1.23
−2.66
7.77
0.122.23
7.72
3.82
8.21
1.09
G. Biau (UPMC) 10 / 106
Regression
2.25
1.386.76
5.89
−2.33
12.00
0.12
4.7331.2
0.562.27
−2.56
−2.88
7.26
4.32
−1.23
−2.66
7.77
0.122.23
7.72
3.82
8.21
1.09
G. Biau (UPMC) 11 / 106
Trees
?
G. Biau (UPMC) 12 / 106
Trees
?
G. Biau (UPMC) 13 / 106
Trees
?
G. Biau (UPMC) 14 / 106
Trees
?
G. Biau (UPMC) 15 / 106
Trees
?
G. Biau (UPMC) 16 / 106
Trees
?
G. Biau (UPMC) 17 / 106
Trees
?
G. Biau (UPMC) 18 / 106
Trees
G. Biau (UPMC) 19 / 106
Trees
G. Biau (UPMC) 20 / 106
Trees
G. Biau (UPMC) 21 / 106
Trees
G. Biau (UPMC) 22 / 106
Trees
G. Biau (UPMC) 23 / 106
Trees
G. Biau (UPMC) 24 / 106
Trees
G. Biau (UPMC) 25 / 106
Trees
G. Biau (UPMC) 26 / 106
Trees
?
G. Biau (UPMC) 27 / 106
Trees
?
G. Biau (UPMC) 28 / 106
Trees
?
G. Biau (UPMC) 29 / 106
Trees
?
G. Biau (UPMC) 30 / 106
Trees
?
G. Biau (UPMC) 31 / 106
Trees
?
G. Biau (UPMC) 32 / 106
Trees
?
G. Biau (UPMC) 33 / 106
From trees to forests
Leo Breiman promoted random forests.
Idea: Using tree averaging as a means of obtaining good rules.
The base trees are simple and randomized.
Breiman’s ideas were decisively influenced by
Amit and Geman (1997, geometric feature selection).
Ho (1998, random subspace method).
Dietterich (2000, random split selection approach).
stat.berkeley.edu/users/breiman/RandomForests
G. Biau (UPMC) 34 / 106
From trees to forests
Leo Breiman promoted random forests.
Idea: Using tree averaging as a means of obtaining good rules.
The base trees are simple and randomized.
Breiman’s ideas were decisively influenced by
Amit and Geman (1997, geometric feature selection).
Ho (1998, random subspace method).
Dietterich (2000, random split selection approach).
stat.berkeley.edu/users/breiman/RandomForests
G. Biau (UPMC) 34 / 106
From trees to forests
Leo Breiman promoted random forests.
Idea: Using tree averaging as a means of obtaining good rules.
The base trees are simple and randomized.
Breiman’s ideas were decisively influenced by
Amit and Geman (1997, geometric feature selection).
Ho (1998, random subspace method).
Dietterich (2000, random split selection approach).
stat.berkeley.edu/users/breiman/RandomForests
G. Biau (UPMC) 34 / 106
From trees to forests
Leo Breiman promoted random forests.
Idea: Using tree averaging as a means of obtaining good rules.
The base trees are simple and randomized.
Breiman’s ideas were decisively influenced by
Amit and Geman (1997, geometric feature selection).
Ho (1998, random subspace method).
Dietterich (2000, random split selection approach).
stat.berkeley.edu/users/breiman/RandomForests
G. Biau (UPMC) 34 / 106
From trees to forests
Leo Breiman promoted random forests.
Idea: Using tree averaging as a means of obtaining good rules.
The base trees are simple and randomized.
Breiman’s ideas were decisively influenced by
Amit and Geman (1997, geometric feature selection).
Ho (1998, random subspace method).
Dietterich (2000, random split selection approach).
stat.berkeley.edu/users/breiman/RandomForests
G. Biau (UPMC) 34 / 106
Random forests
They have emerged as serious competitors to state of the artmethods.
They are fast and easy to implement, produce highly accuratepredictions and can handle a very large number of input variableswithout overfitting.
In fact, forests are among the most accurate general-purposelearners available.
The algorithm is difficult to analyze and its mathematicalproperties remain to date largely unknown.
Most theoretical studies have concentrated on isolated parts orstylized versions of the procedure.
G. Biau (UPMC) 35 / 106
Random forests
They have emerged as serious competitors to state of the artmethods.
They are fast and easy to implement, produce highly accuratepredictions and can handle a very large number of input variableswithout overfitting.
In fact, forests are among the most accurate general-purposelearners available.
The algorithm is difficult to analyze and its mathematicalproperties remain to date largely unknown.
Most theoretical studies have concentrated on isolated parts orstylized versions of the procedure.
G. Biau (UPMC) 35 / 106
Random forests
They have emerged as serious competitors to state of the artmethods.
They are fast and easy to implement, produce highly accuratepredictions and can handle a very large number of input variableswithout overfitting.
In fact, forests are among the most accurate general-purposelearners available.
The algorithm is difficult to analyze and its mathematicalproperties remain to date largely unknown.
Most theoretical studies have concentrated on isolated parts orstylized versions of the procedure.
G. Biau (UPMC) 35 / 106
Random forests
They have emerged as serious competitors to state of the artmethods.
They are fast and easy to implement, produce highly accuratepredictions and can handle a very large number of input variableswithout overfitting.
In fact, forests are among the most accurate general-purposelearners available.
The algorithm is difficult to analyze and its mathematicalproperties remain to date largely unknown.
Most theoretical studies have concentrated on isolated parts orstylized versions of the procedure.
G. Biau (UPMC) 35 / 106
Random forests
They have emerged as serious competitors to state of the artmethods.
They are fast and easy to implement, produce highly accuratepredictions and can handle a very large number of input variableswithout overfitting.
In fact, forests are among the most accurate general-purposelearners available.
The algorithm is difficult to analyze and its mathematicalproperties remain to date largely unknown.
Most theoretical studies have concentrated on isolated parts orstylized versions of the procedure.
G. Biau (UPMC) 35 / 106
Key-references
Breiman (2000, 2001, 2004).
. Definition, experiments and intuitions.
Lin and Jeon (2006).
. Link with layered nearest neighbors.
Biau, Devroye and Lugosi (2008).
. Consistency results for stylized versions.
Biau (2010).
. Sparsity and random forests.
G. Biau (UPMC) 36 / 106
Key-references
Breiman (2000, 2001, 2004).
. Definition, experiments and intuitions.
Lin and Jeon (2006).
. Link with layered nearest neighbors.
Biau, Devroye and Lugosi (2008).
. Consistency results for stylized versions.
Biau (2010).
. Sparsity and random forests.
G. Biau (UPMC) 36 / 106
Key-references
Breiman (2000, 2001, 2004).
. Definition, experiments and intuitions.
Lin and Jeon (2006).
. Link with layered nearest neighbors.
Biau, Devroye and Lugosi (2008).
. Consistency results for stylized versions.
Biau (2010).
. Sparsity and random forests.
G. Biau (UPMC) 36 / 106
Key-references
Breiman (2000, 2001, 2004).
. Definition, experiments and intuitions.
Lin and Jeon (2006).
. Link with layered nearest neighbors.
Biau, Devroye and Lugosi (2008).
. Consistency results for stylized versions.
Biau (2010).
. Sparsity and random forests.
G. Biau (UPMC) 36 / 106
Outline
1 Setting
2 A random forests model
3 Layered nearest neighbors and random forests
G. Biau (UPMC) 37 / 106
Three basic ingredients
1-Randomization and no-pruning. For each tree, select at random, at each node, a small group of
input coordinates to split.
. Calculate the best split based on these features and cut.
. The tree is grown to maximum size, without pruning.
G. Biau (UPMC) 38 / 106
Three basic ingredients
1-Randomization and no-pruning. For each tree, select at random, at each node, a small group of
input coordinates to split.
. Calculate the best split based on these features and cut.
. The tree is grown to maximum size, without pruning.
G. Biau (UPMC) 38 / 106
Three basic ingredients
1-Randomization and no-pruning. For each tree, select at random, at each node, a small group of
input coordinates to split.
. Calculate the best split based on these features and cut.
. The tree is grown to maximum size, without pruning.
G. Biau (UPMC) 38 / 106
Three basic ingredients
2-Aggregation. Final predictions are obtained by aggregating over the ensemble.
. It is fast and easily parallelizable.
G. Biau (UPMC) 39 / 106
Three basic ingredients
2-Aggregation. Final predictions are obtained by aggregating over the ensemble.
. It is fast and easily parallelizable.
G. Biau (UPMC) 39 / 106
Three basic ingredients
3-Bagging. The subspace randomization scheme is blended with bagging.
. Breiman (1996).
. Bühlmann and Yu (2002).
. Biau, Cérou and Guyader (2010).
G. Biau (UPMC) 40 / 106
Three basic ingredients
3-Bagging. The subspace randomization scheme is blended with bagging.
. Breiman (1996).
. Bühlmann and Yu (2002).
. Biau, Cérou and Guyader (2010).
G. Biau (UPMC) 40 / 106
Mathematical framework
A training sample: Dn = {(X1,Y1), . . . , (Xn,Yn)} i.i.d.[0,1]d × R-valued random variables.
A generic pair: (X,Y ) satisfying EY 2 <∞.
Our mission: For fixed x ∈ [0,1]d , estimate the regression functionr(x) = E[Y |X = x] using the data.
Quality criterion: E[rn(X)− r(X)]2.
G. Biau (UPMC) 41 / 106
Mathematical framework
A training sample: Dn = {(X1,Y1), . . . , (Xn,Yn)} i.i.d.[0,1]d × R-valued random variables.
A generic pair: (X,Y ) satisfying EY 2 <∞.
Our mission: For fixed x ∈ [0,1]d , estimate the regression functionr(x) = E[Y |X = x] using the data.
Quality criterion: E[rn(X)− r(X)]2.
G. Biau (UPMC) 41 / 106
Mathematical framework
A training sample: Dn = {(X1,Y1), . . . , (Xn,Yn)} i.i.d.[0,1]d × R-valued random variables.
A generic pair: (X,Y ) satisfying EY 2 <∞.
Our mission: For fixed x ∈ [0,1]d , estimate the regression functionr(x) = E[Y |X = x] using the data.
Quality criterion: E[rn(X)− r(X)]2.
G. Biau (UPMC) 41 / 106
Mathematical framework
A training sample: Dn = {(X1,Y1), . . . , (Xn,Yn)} i.i.d.[0,1]d × R-valued random variables.
A generic pair: (X,Y ) satisfying EY 2 <∞.
Our mission: For fixed x ∈ [0,1]d , estimate the regression functionr(x) = E[Y |X = x] using the data.
Quality criterion: E[rn(X)− r(X)]2.
G. Biau (UPMC) 41 / 106
The model
A random forest is a collection of randomized base regressiontrees {rn(x,Θm,Dn),m ≥ 1}.
These random trees are combined to form the aggregatedregression estimate
r̄n(X,Dn) = EΘ [rn(X,Θ,Dn)] .
Θ is assumed to be independent of X and the training sample Dn.
However, we allow Θ to be based on a test sample, independentof, but distributed as, Dn.
G. Biau (UPMC) 42 / 106
The model
A random forest is a collection of randomized base regressiontrees {rn(x,Θm,Dn),m ≥ 1}.
These random trees are combined to form the aggregatedregression estimate
r̄n(X,Dn) = EΘ [rn(X,Θ,Dn)] .
Θ is assumed to be independent of X and the training sample Dn.
However, we allow Θ to be based on a test sample, independentof, but distributed as, Dn.
G. Biau (UPMC) 42 / 106
The model
A random forest is a collection of randomized base regressiontrees {rn(x,Θm,Dn),m ≥ 1}.
These random trees are combined to form the aggregatedregression estimate
r̄n(X,Dn) = EΘ [rn(X,Θ,Dn)] .
Θ is assumed to be independent of X and the training sample Dn.
However, we allow Θ to be based on a test sample, independentof, but distributed as, Dn.
G. Biau (UPMC) 42 / 106
The model
A random forest is a collection of randomized base regressiontrees {rn(x,Θm,Dn),m ≥ 1}.
These random trees are combined to form the aggregatedregression estimate
r̄n(X,Dn) = EΘ [rn(X,Θ,Dn)] .
Θ is assumed to be independent of X and the training sample Dn.
However, we allow Θ to be based on a test sample, independentof, but distributed as, Dn.
G. Biau (UPMC) 42 / 106
The procedure
. Fix kn ≥ 2 and repeat the following procedure dlog2 kne times:
1 At each node, a coordinate of X = (X (1), . . . ,X (d)) is selected, with thej-th feature having a probability pnj ∈ (0,1) of being selected.
2 At each node, once the coordinate is selected, the split is at the midpointof the chosen side.
. Thus
r̄n(X) = EΘ
[∑ni=1 Yi1[Xi∈An(X,Θ)]∑n
i=1 1[Xi∈An(X,Θ)]
1En(X,Θ)
],
where
En(X,Θ) =
[n∑
i=1
1[Xi∈An(X,Θ)] 6= 0
].
G. Biau (UPMC) 43 / 106
The procedure
. Fix kn ≥ 2 and repeat the following procedure dlog2 kne times:
1 At each node, a coordinate of X = (X (1), . . . ,X (d)) is selected, with thej-th feature having a probability pnj ∈ (0,1) of being selected.
2 At each node, once the coordinate is selected, the split is at the midpointof the chosen side.
. Thus
r̄n(X) = EΘ
[∑ni=1 Yi1[Xi∈An(X,Θ)]∑n
i=1 1[Xi∈An(X,Θ)]
1En(X,Θ)
],
where
En(X,Θ) =
[n∑
i=1
1[Xi∈An(X,Θ)] 6= 0
].
G. Biau (UPMC) 43 / 106
The procedure
. Fix kn ≥ 2 and repeat the following procedure dlog2 kne times:
1 At each node, a coordinate of X = (X (1), . . . ,X (d)) is selected, with thej-th feature having a probability pnj ∈ (0,1) of being selected.
2 At each node, once the coordinate is selected, the split is at the midpointof the chosen side.
. Thus
r̄n(X) = EΘ
[∑ni=1 Yi1[Xi∈An(X,Θ)]∑n
i=1 1[Xi∈An(X,Θ)]
1En(X,Θ)
],
where
En(X,Θ) =
[n∑
i=1
1[Xi∈An(X,Θ)] 6= 0
].
G. Biau (UPMC) 43 / 106
Binary trees
G. Biau (UPMC) 44 / 106
Binary trees
G. Biau (UPMC) 45 / 106
Binary trees
G. Biau (UPMC) 46 / 106
Binary trees
G. Biau (UPMC) 47 / 106
Binary trees
G. Biau (UPMC) 48 / 106
General comments
Each individual tree has exactly 2dlog2 kne (≈ kn) terminal nodes,and each leaf has Lebesgue measure 2−dlog2 kne (≈ 1/kn).
If X has uniform distribution on [0,1]d , there will be on averageabout n/kn observations per terminal node.
The choice kn = n induces a very small number of cases in thefinal leaves.
This scheme is close to what the original randomforests algorithm does.
G. Biau (UPMC) 49 / 106
General comments
Each individual tree has exactly 2dlog2 kne (≈ kn) terminal nodes,and each leaf has Lebesgue measure 2−dlog2 kne (≈ 1/kn).
If X has uniform distribution on [0,1]d , there will be on averageabout n/kn observations per terminal node.
The choice kn = n induces a very small number of cases in thefinal leaves.
This scheme is close to what the original randomforests algorithm does.
G. Biau (UPMC) 49 / 106
General comments
Each individual tree has exactly 2dlog2 kne (≈ kn) terminal nodes,and each leaf has Lebesgue measure 2−dlog2 kne (≈ 1/kn).
If X has uniform distribution on [0,1]d , there will be on averageabout n/kn observations per terminal node.
The choice kn = n induces a very small number of cases in thefinal leaves.
This scheme is close to what the original randomforests algorithm does.
G. Biau (UPMC) 49 / 106
General comments
Each individual tree has exactly 2dlog2 kne (≈ kn) terminal nodes,and each leaf has Lebesgue measure 2−dlog2 kne (≈ 1/kn).
If X has uniform distribution on [0,1]d , there will be on averageabout n/kn observations per terminal node.
The choice kn = n induces a very small number of cases in thefinal leaves.
This scheme is close to what the original randomforests algorithm does.
G. Biau (UPMC) 49 / 106
Consistency
TheoremAssume that the distribution of X has support on [0,1]d . Then therandom forests estimate r̄n is consistent whenever pnj log kn →∞ forall j = 1, . . . ,d and kn/n→ 0 as n→∞.
In the purely random model, pnj = 1/d , independently of n and j ,and consistency is ensured as long as kn →∞ and kn/n→ 0.
This is however a radically simplified version of the random forestsused in practice.
A more in-depth analysis is needed.
G. Biau (UPMC) 50 / 106
Consistency
TheoremAssume that the distribution of X has support on [0,1]d . Then therandom forests estimate r̄n is consistent whenever pnj log kn →∞ forall j = 1, . . . ,d and kn/n→ 0 as n→∞.
In the purely random model, pnj = 1/d , independently of n and j ,and consistency is ensured as long as kn →∞ and kn/n→ 0.
This is however a radically simplified version of the random forestsused in practice.
A more in-depth analysis is needed.
G. Biau (UPMC) 50 / 106
Consistency
TheoremAssume that the distribution of X has support on [0,1]d . Then therandom forests estimate r̄n is consistent whenever pnj log kn →∞ forall j = 1, . . . ,d and kn/n→ 0 as n→∞.
In the purely random model, pnj = 1/d , independently of n and j ,and consistency is ensured as long as kn →∞ and kn/n→ 0.
This is however a radically simplified version of the random forestsused in practice.
A more in-depth analysis is needed.
G. Biau (UPMC) 50 / 106
Consistency
TheoremAssume that the distribution of X has support on [0,1]d . Then therandom forests estimate r̄n is consistent whenever pnj log kn →∞ forall j = 1, . . . ,d and kn/n→ 0 as n→∞.
In the purely random model, pnj = 1/d , independently of n and j ,and consistency is ensured as long as kn →∞ and kn/n→ 0.
This is however a radically simplified version of the random forestsused in practice.
A more in-depth analysis is needed.
G. Biau (UPMC) 50 / 106
Sparsity
There is empirical evidence that many signals in high-dimensionalspaces admit a sparse representation.
. Images wavelet coefficients.
. High-throughput technologies.
Sparse estimation is playing an increasingly important role in thestatistics and machine learning communities.
Several methods have recently been developed in both fields,which rely upon the notion of sparsity.
G. Biau (UPMC) 51 / 106
Sparsity
There is empirical evidence that many signals in high-dimensionalspaces admit a sparse representation.
. Images wavelet coefficients.
. High-throughput technologies.
Sparse estimation is playing an increasingly important role in thestatistics and machine learning communities.
Several methods have recently been developed in both fields,which rely upon the notion of sparsity.
G. Biau (UPMC) 51 / 106
Sparsity
There is empirical evidence that many signals in high-dimensionalspaces admit a sparse representation.
. Images wavelet coefficients.
. High-throughput technologies.
Sparse estimation is playing an increasingly important role in thestatistics and machine learning communities.
Several methods have recently been developed in both fields,which rely upon the notion of sparsity.
G. Biau (UPMC) 51 / 106
Sparsity
There is empirical evidence that many signals in high-dimensionalspaces admit a sparse representation.
. Images wavelet coefficients.
. High-throughput technologies.
Sparse estimation is playing an increasingly important role in thestatistics and machine learning communities.
Several methods have recently been developed in both fields,which rely upon the notion of sparsity.
G. Biau (UPMC) 51 / 106
Sparsity
There is empirical evidence that many signals in high-dimensionalspaces admit a sparse representation.
. Images wavelet coefficients.
. High-throughput technologies.
Sparse estimation is playing an increasingly important role in thestatistics and machine learning communities.
Several methods have recently been developed in both fields,which rely upon the notion of sparsity.
G. Biau (UPMC) 51 / 106
Our vision
The regression function r(X) = E[Y |X] depends in fact only on anonempty subset S (for Strong) of the d features.
In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have
r(X) = E[Y |XS ].
In the dimension reduction scenario we have in mind, the ambientdimension d can be very large, much larger than n.
As such, the value S characterizes the sparsity of the model: Thesmaller S, the sparser r .
G. Biau (UPMC) 52 / 106
Our vision
The regression function r(X) = E[Y |X] depends in fact only on anonempty subset S (for Strong) of the d features.
In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have
r(X) = E[Y |XS ].
In the dimension reduction scenario we have in mind, the ambientdimension d can be very large, much larger than n.
As such, the value S characterizes the sparsity of the model: Thesmaller S, the sparser r .
G. Biau (UPMC) 52 / 106
Our vision
The regression function r(X) = E[Y |X] depends in fact only on anonempty subset S (for Strong) of the d features.
In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have
r(X) = E[Y |XS ].
In the dimension reduction scenario we have in mind, the ambientdimension d can be very large, much larger than n.
As such, the value S characterizes the sparsity of the model: Thesmaller S, the sparser r .
G. Biau (UPMC) 52 / 106
Our vision
The regression function r(X) = E[Y |X] depends in fact only on anonempty subset S (for Strong) of the d features.
In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have
r(X) = E[Y |XS ].
In the dimension reduction scenario we have in mind, the ambientdimension d can be very large, much larger than n.
As such, the value S characterizes the sparsity of the model: Thesmaller S, the sparser r .
G. Biau (UPMC) 52 / 106
Sparsity and random forests
Ideally, pnj = 1/S for j ∈ S.
To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj).
Such a randomization mechanism may be designed on the basisof a test sample.
Action plan
E [r̄n(X)− r(X)]2 = E [r̄n(X)− r̃n(X)]2 + E [r̃n(X)− r(X)]2 ,
where
r̃n(X) =n∑
i=1
EΘ [Wni(X,Θ)] r(Xi).
G. Biau (UPMC) 53 / 106
Sparsity and random forests
Ideally, pnj = 1/S for j ∈ S.
To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj).
Such a randomization mechanism may be designed on the basisof a test sample.
Action plan
E [r̄n(X)− r(X)]2 = E [r̄n(X)− r̃n(X)]2 + E [r̃n(X)− r(X)]2 ,
where
r̃n(X) =n∑
i=1
EΘ [Wni(X,Θ)] r(Xi).
G. Biau (UPMC) 53 / 106
Sparsity and random forests
Ideally, pnj = 1/S for j ∈ S.
To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj).
Such a randomization mechanism may be designed on the basisof a test sample.
Action plan
E [r̄n(X)− r(X)]2 = E [r̄n(X)− r̃n(X)]2 + E [r̃n(X)− r(X)]2 ,
where
r̃n(X) =n∑
i=1
EΘ [Wni(X,Θ)] r(Xi).
G. Biau (UPMC) 53 / 106
Sparsity and random forests
Ideally, pnj = 1/S for j ∈ S.
To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj).
Such a randomization mechanism may be designed on the basisof a test sample.
Action plan
E [r̄n(X)− r(X)]2 = E [r̄n(X)− r̃n(X)]2 + E [r̃n(X)− r(X)]2 ,
where
r̃n(X) =n∑
i=1
EΘ [Wni(X,Θ)] r(Xi).
G. Biau (UPMC) 53 / 106
Variance
Proposition
Assume that X is uniformly distributed on [0,1]d and, for all x ∈ Rd ,
σ2(x) = V[Y |X = x] ≤ σ2
for some positive constant σ2. Then, if pnj = (1/S)(1 + ξnj) for j ∈ S,
E [r̄n(X)− r̃n(X)]2 ≤ Cσ2(
S2
S − 1
)S/2d
(1 + ξn)kn
n(log kn)S/2d ,
where
C =288π
(π log 2
16
)S/2d
.
G. Biau (UPMC) 54 / 106
Bias
Proposition
Assume that X is uniformly distributed on [0,1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj) for j ∈ S,
E [r̃n(X)− r(X)]2 ≤ 2SL2
kn0.75
S log 2 (1+γn)+
[sup
x∈[0,1]dr2(x)
]e−n/2kn .
The rate at which the bias decreases to 0 depends on the numberof strong variables, not on d .
kn−(0.75/(S log 2))(1+γn) = o(kn
−2/d ) as soon as S ≤ b0.54dc.
The term e−n/2kn prevents the extreme choice kn = n.
G. Biau (UPMC) 55 / 106
Bias
Proposition
Assume that X is uniformly distributed on [0,1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj) for j ∈ S,
E [r̃n(X)− r(X)]2 ≤ 2SL2
kn0.75
S log 2 (1+γn)+
[sup
x∈[0,1]dr2(x)
]e−n/2kn .
The rate at which the bias decreases to 0 depends on the numberof strong variables, not on d .
kn−(0.75/(S log 2))(1+γn) = o(kn
−2/d ) as soon as S ≤ b0.54dc.
The term e−n/2kn prevents the extreme choice kn = n.
G. Biau (UPMC) 55 / 106
Bias
Proposition
Assume that X is uniformly distributed on [0,1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj) for j ∈ S,
E [r̃n(X)− r(X)]2 ≤ 2SL2
kn0.75
S log 2 (1+γn)+
[sup
x∈[0,1]dr2(x)
]e−n/2kn .
The rate at which the bias decreases to 0 depends on the numberof strong variables, not on d .
kn−(0.75/(S log 2))(1+γn) = o(kn
−2/d ) as soon as S ≤ b0.54dc.
The term e−n/2kn prevents the extreme choice kn = n.
G. Biau (UPMC) 55 / 106
Bias
Proposition
Assume that X is uniformly distributed on [0,1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj) for j ∈ S,
E [r̃n(X)− r(X)]2 ≤ 2SL2
kn0.75
S log 2 (1+γn)+
[sup
x∈[0,1]dr2(x)
]e−n/2kn .
The rate at which the bias decreases to 0 depends on the numberof strong variables, not on d .
kn−(0.75/(S log 2))(1+γn) = o(kn
−2/d ) as soon as S ≤ b0.54dc.
The term e−n/2kn prevents the extreme choice kn = n.
G. Biau (UPMC) 55 / 106
Main result
TheoremIf pnj = (1/S)(1 + ξnj) for j ∈ S, with ξnj log n→ 0 as n→∞, then forthe choice
kn ∝(
L2
Ξ
)1/(1+ 0.75S log 2 )
n1/(1+ 0.75S log 2 )
,
we have
lim supn→∞
sup(X,Y )∈F
E [r̄n(X)− r(X)]2(ΞL
2S ln 20.75
) 0.75S ln 2+0.75 n
−0.75S log 2+0.75
≤ Λ.
Take-home message
The rate n−0.75
S log 2+0.75 is strictly faster than the usual minimax raten−2/(d+2) as soon as S ≤ b0.54dc.
G. Biau (UPMC) 56 / 106
Main result
TheoremIf pnj = (1/S)(1 + ξnj) for j ∈ S, with ξnj log n→ 0 as n→∞, then forthe choice
kn ∝(
L2
Ξ
)1/(1+ 0.75S log 2 )
n1/(1+ 0.75S log 2 )
,
we have
lim supn→∞
sup(X,Y )∈F
E [r̄n(X)− r(X)]2(ΞL
2S ln 20.75
) 0.75S ln 2+0.75 n
−0.75S log 2+0.75
≤ Λ.
Take-home message
The rate n−0.75
S log 2+0.75 is strictly faster than the usual minimax raten−2/(d+2) as soon as S ≤ b0.54dc.
G. Biau (UPMC) 56 / 106
Dimension reduction
10 20 30 40 50 60 70 80 90 10020
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.01960.0196
S
G. Biau (UPMC) 57 / 106
Discussion — Bagging
The optimal parameter kn depends on the unknown distribution of(X,Y ).
To correct this situation, adaptive choices of kn should preservethe rate of convergence of the estimate.
Another route we may follow is to analyze the effect of bagging.
Biau, Cérou and Guyader (2010).
G. Biau (UPMC) 58 / 106
Discussion — Bagging
The optimal parameter kn depends on the unknown distribution of(X,Y ).
To correct this situation, adaptive choices of kn should preservethe rate of convergence of the estimate.
Another route we may follow is to analyze the effect of bagging.
Biau, Cérou and Guyader (2010).
G. Biau (UPMC) 58 / 106
Discussion — Bagging
The optimal parameter kn depends on the unknown distribution of(X,Y ).
To correct this situation, adaptive choices of kn should preservethe rate of convergence of the estimate.
Another route we may follow is to analyze the effect of bagging.
Biau, Cérou and Guyader (2010).
G. Biau (UPMC) 58 / 106
Discussion — Bagging
The optimal parameter kn depends on the unknown distribution of(X,Y ).
To correct this situation, adaptive choices of kn should preservethe rate of convergence of the estimate.
Another route we may follow is to analyze the effect of bagging.
Biau, Cérou and Guyader (2010).
G. Biau (UPMC) 58 / 106
?
G. Biau (UPMC) 59 / 106
?
G. Biau (UPMC) 60 / 106
G. Biau (UPMC) 61 / 106
?
G. Biau (UPMC) 62 / 106
?
G. Biau (UPMC) 63 / 106
?
G. Biau (UPMC) 64 / 106
G. Biau (UPMC) 65 / 106
?
G. Biau (UPMC) 66 / 106
?
G. Biau (UPMC) 67 / 106
?
G. Biau (UPMC) 68 / 106
G. Biau (UPMC) 69 / 106
Discussion — Choosing the pnj ’s
Imaginary scenarioThe following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 If the selection is all weak, then choose one at random to split on.
3 If there is more than one strong variable elected, choose one atrandom and cut.
G. Biau (UPMC) 70 / 106
Discussion — Choosing the pnj ’s
Imaginary scenarioThe following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 If the selection is all weak, then choose one at random to split on.
3 If there is more than one strong variable elected, choose one atrandom and cut.
G. Biau (UPMC) 70 / 106
Discussion — Choosing the pnj ’s
Imaginary scenarioThe following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 If the selection is all weak, then choose one at random to split on.
3 If there is more than one strong variable elected, choose one atrandom and cut.
G. Biau (UPMC) 70 / 106
Discussion — Choosing the pnj ’s
Imaginary scenarioThe following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 If the selection is all weak, then choose one at random to split on.
3 If there is more than one strong variable elected, choose one atrandom and cut.
G. Biau (UPMC) 70 / 106
Discussion — Choosing the pnj ’s
Each coordinate in S will be cut with the “ideal” probability
p?n =1S
[1−
(1− S
d
)Mn].
The parameter Mn should satisfy(1− S
d
)Mn
log n→ 0 as n→∞.
This is true as soon as
Mn →∞ andMn
log n→∞ as n→∞.
G. Biau (UPMC) 71 / 106
Discussion — Choosing the pnj ’s
Each coordinate in S will be cut with the “ideal” probability
p?n =1S
[1−
(1− S
d
)Mn].
The parameter Mn should satisfy(1− S
d
)Mn
log n→ 0 as n→∞.
This is true as soon as
Mn →∞ andMn
log n→∞ as n→∞.
G. Biau (UPMC) 71 / 106
Discussion — Choosing the pnj ’s
Each coordinate in S will be cut with the “ideal” probability
p?n =1S
[1−
(1− S
d
)Mn].
The parameter Mn should satisfy(1− S
d
)Mn
log n→ 0 as n→∞.
This is true as soon as
Mn →∞ andMn
log n→∞ as n→∞.
G. Biau (UPMC) 71 / 106
AssumptionsWe have at hand an independent test set D′n.
The model is linear:
Y =∑j∈S
ajX (j) + ε.
For a fixed node A =∏d
j=1 Aj , fix a coordinate j and look at theweighted conditional variance V[Y |X (j) ∈ Aj ]P(X (j) ∈ Aj).
If j ∈ S, then the best split is at the midpoint of the node, with avariance decrease equal to a2
j /16 > 0.
If j ∈ W, the decrease of the variance is always 0, whatever thelocation of the split.
G. Biau (UPMC) 72 / 106
AssumptionsWe have at hand an independent test set D′n.
The model is linear:
Y =∑j∈S
ajX (j) + ε.
For a fixed node A =∏d
j=1 Aj , fix a coordinate j and look at theweighted conditional variance V[Y |X (j) ∈ Aj ]P(X (j) ∈ Aj).
If j ∈ S, then the best split is at the midpoint of the node, with avariance decrease equal to a2
j /16 > 0.
If j ∈ W, the decrease of the variance is always 0, whatever thelocation of the split.
G. Biau (UPMC) 72 / 106
AssumptionsWe have at hand an independent test set D′n.
The model is linear:
Y =∑j∈S
ajX (j) + ε.
For a fixed node A =∏d
j=1 Aj , fix a coordinate j and look at theweighted conditional variance V[Y |X (j) ∈ Aj ]P(X (j) ∈ Aj).
If j ∈ S, then the best split is at the midpoint of the node, with avariance decrease equal to a2
j /16 > 0.
If j ∈ W, the decrease of the variance is always 0, whatever thelocation of the split.
G. Biau (UPMC) 72 / 106
AssumptionsWe have at hand an independent test set D′n.
The model is linear:
Y =∑j∈S
ajX (j) + ε.
For a fixed node A =∏d
j=1 Aj , fix a coordinate j and look at theweighted conditional variance V[Y |X (j) ∈ Aj ]P(X (j) ∈ Aj).
If j ∈ S, then the best split is at the midpoint of the node, with avariance decrease equal to a2
j /16 > 0.
If j ∈ W, the decrease of the variance is always 0, whatever thelocation of the split.
G. Biau (UPMC) 72 / 106
AssumptionsWe have at hand an independent test set D′n.
The model is linear:
Y =∑j∈S
ajX (j) + ε.
For a fixed node A =∏d
j=1 Aj , fix a coordinate j and look at theweighted conditional variance V[Y |X (j) ∈ Aj ]P(X (j) ∈ Aj).
If j ∈ S, then the best split is at the midpoint of the node, with avariance decrease equal to a2
j /16 > 0.
If j ∈ W, the decrease of the variance is always 0, whatever thelocation of the split.
G. Biau (UPMC) 72 / 106
AssumptionsWe have at hand an independent test set D′n.
The model is linear:
Y =∑j∈S
ajX (j) + ε.
For a fixed node A =∏d
j=1 Aj , fix a coordinate j and look at theweighted conditional variance V[Y |X (j) ∈ Aj ]P(X (j) ∈ Aj).
If j ∈ S, then the best split is at the midpoint of the node, with avariance decrease equal to a2
j /16 > 0.
If j ∈ W, the decrease of the variance is always 0, whatever thelocation of the split.
G. Biau (UPMC) 72 / 106
Discussion — Choosing the pnj ’s
Near-reality scenarioThe following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 For each of the Mn elected coordinates, calculate the best split.
3 Select the coordinate which outputs the best within-node sum ofsquares decrease, and cut.
ConclusionFor j ∈ S,
pnj ≈1S(1 + ξnj
),
where ξnj → 0 and satisfies the constraint ξnj log n→ 0 as n tends toinfinity, provided kn log n/n→ 0, Mn →∞ and Mn/ log n→∞.
G. Biau (UPMC) 73 / 106
Discussion — Choosing the pnj ’s
Near-reality scenarioThe following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 For each of the Mn elected coordinates, calculate the best split.
3 Select the coordinate which outputs the best within-node sum ofsquares decrease, and cut.
ConclusionFor j ∈ S,
pnj ≈1S(1 + ξnj
),
where ξnj → 0 and satisfies the constraint ξnj log n→ 0 as n tends toinfinity, provided kn log n/n→ 0, Mn →∞ and Mn/ log n→∞.
G. Biau (UPMC) 73 / 106
Discussion — Choosing the pnj ’s
Near-reality scenarioThe following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 For each of the Mn elected coordinates, calculate the best split.
3 Select the coordinate which outputs the best within-node sum ofsquares decrease, and cut.
ConclusionFor j ∈ S,
pnj ≈1S(1 + ξnj
),
where ξnj → 0 and satisfies the constraint ξnj log n→ 0 as n tends toinfinity, provided kn log n/n→ 0, Mn →∞ and Mn/ log n→∞.
G. Biau (UPMC) 73 / 106
Discussion — Choosing the pnj ’s
Near-reality scenarioThe following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 For each of the Mn elected coordinates, calculate the best split.
3 Select the coordinate which outputs the best within-node sum ofsquares decrease, and cut.
ConclusionFor j ∈ S,
pnj ≈1S(1 + ξnj
),
where ξnj → 0 and satisfies the constraint ξnj log n→ 0 as n tends toinfinity, provided kn log n/n→ 0, Mn →∞ and Mn/ log n→∞.
G. Biau (UPMC) 73 / 106
Discussion — Choosing the pnj ’s
Near-reality scenarioThe following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 For each of the Mn elected coordinates, calculate the best split.
3 Select the coordinate which outputs the best within-node sum ofsquares decrease, and cut.
ConclusionFor j ∈ S,
pnj ≈1S(1 + ξnj
),
where ξnj → 0 and satisfies the constraint ξnj log n→ 0 as n tends toinfinity, provided kn log n/n→ 0, Mn →∞ and Mn/ log n→∞.
G. Biau (UPMC) 73 / 106
Outline
1 Setting
2 A random forests model
3 Layered nearest neighbors and random forests
G. Biau (UPMC) 74 / 106
?
G. Biau (UPMC) 75 / 106
?
G. Biau (UPMC) 76 / 106
?
G. Biau (UPMC) 77 / 106
?
G. Biau (UPMC) 78 / 106
?
G. Biau (UPMC) 79 / 106
G. Biau (UPMC) 80 / 106
?
G. Biau (UPMC) 81 / 106
?
G. Biau (UPMC) 82 / 106
?
G. Biau (UPMC) 83 / 106
?
G. Biau (UPMC) 84 / 106
?
G. Biau (UPMC) 85 / 106
G. Biau (UPMC) 86 / 106
?
G. Biau (UPMC) 87 / 106
?
G. Biau (UPMC) 88 / 106
?
G. Biau (UPMC) 89 / 106
?
G. Biau (UPMC) 90 / 106
?
G. Biau (UPMC) 91 / 106
?
G. Biau (UPMC) 92 / 106
?
G. Biau (UPMC) 93 / 106
?
G. Biau (UPMC) 94 / 106
?
G. Biau (UPMC) 95 / 106
?
G. Biau (UPMC) 96 / 106
?
G. Biau (UPMC) 97 / 106
?
G. Biau (UPMC) 98 / 106
Layered Nearest Neighbors
DefinitionLet X1, . . . ,Xn be a sample of i.i.d. random vectors in Rd , d ≥ 2. Anobservation Xi is said to be a LNN of a point x if the hyperrectangledefined by x and Xi contains no other data points.
empty
x
G. Biau (UPMC) 99 / 106
What is known about Ln(x)?
... a lot when X1, . . . ,Xn are uniformly distributed over [0,1]d .
For example,
ELn(x) =2d (log n)d−1
(d − 1)!+O
((log n)d−2
)and
(d − 1)! Ln(x)
2d (log n)d−1 → 1 in probability as n→∞.
This is the problem of maxima in random vectors(Barndorff-Nielsen and Sobel, 1966).
G. Biau (UPMC) 100 / 106
What is known about Ln(x)?
... a lot when X1, . . . ,Xn are uniformly distributed over [0,1]d .
For example,
ELn(x) =2d (log n)d−1
(d − 1)!+O
((log n)d−2
)and
(d − 1)! Ln(x)
2d (log n)d−1 → 1 in probability as n→∞.
This is the problem of maxima in random vectors(Barndorff-Nielsen and Sobel, 1966).
G. Biau (UPMC) 100 / 106
What is known about Ln(x)?
... a lot when X1, . . . ,Xn are uniformly distributed over [0,1]d .
For example,
ELn(x) =2d (log n)d−1
(d − 1)!+O
((log n)d−2
)and
(d − 1)! Ln(x)
2d (log n)d−1 → 1 in probability as n→∞.
This is the problem of maxima in random vectors(Barndorff-Nielsen and Sobel, 1966).
G. Biau (UPMC) 100 / 106
Two results (Biau and Devroye, 2010)
ModelX1, . . . ,Xn are independently distributed according to some probabilitydensity f (with probability measure µ).
TheoremFor µ-almost all x ∈ Rd , one has
Ln(x)→∞ in probability as n→∞.
TheoremSuppose that f is λ-almost everywhere continuous. Then
(d − 1)!ELn(x)
2d (log n)d−1 → 1 as n→∞,
at µ-almost all x ∈ Rd .
G. Biau (UPMC) 101 / 106
Two results (Biau and Devroye, 2010)
ModelX1, . . . ,Xn are independently distributed according to some probabilitydensity f (with probability measure µ).
TheoremFor µ-almost all x ∈ Rd , one has
Ln(x)→∞ in probability as n→∞.
TheoremSuppose that f is λ-almost everywhere continuous. Then
(d − 1)!ELn(x)
2d (log n)d−1 → 1 as n→∞,
at µ-almost all x ∈ Rd .
G. Biau (UPMC) 101 / 106
Two results (Biau and Devroye, 2010)
ModelX1, . . . ,Xn are independently distributed according to some probabilitydensity f (with probability measure µ).
TheoremFor µ-almost all x ∈ Rd , one has
Ln(x)→∞ in probability as n→∞.
TheoremSuppose that f is λ-almost everywhere continuous. Then
(d − 1)!ELn(x)
2d (log n)d−1 → 1 as n→∞,
at µ-almost all x ∈ Rd .
G. Biau (UPMC) 101 / 106
LNN regression estimation
Model(X,Y ), (X1,Y1), . . . , (Xn,Yn) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.
The regression function r(x) = E[Y |X = x] may be estimated by
rn(x) =1
Ln(x)
n∑i=1
Yi1[Xi∈Ln(x)].
1 No smoothing parameter.
2 A scale-invariant estimate.
3 Intimately connected to Breiman’s random forests.
G. Biau (UPMC) 102 / 106
LNN regression estimation
Model(X,Y ), (X1,Y1), . . . , (Xn,Yn) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.
The regression function r(x) = E[Y |X = x] may be estimated by
rn(x) =1
Ln(x)
n∑i=1
Yi1[Xi∈Ln(x)].
1 No smoothing parameter.
2 A scale-invariant estimate.
3 Intimately connected to Breiman’s random forests.
G. Biau (UPMC) 102 / 106
LNN regression estimation
Model(X,Y ), (X1,Y1), . . . , (Xn,Yn) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.
The regression function r(x) = E[Y |X = x] may be estimated by
rn(x) =1
Ln(x)
n∑i=1
Yi1[Xi∈Ln(x)].
1 No smoothing parameter.
2 A scale-invariant estimate.
3 Intimately connected to Breiman’s random forests.
G. Biau (UPMC) 102 / 106
LNN regression estimation
Model(X,Y ), (X1,Y1), . . . , (Xn,Yn) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.
The regression function r(x) = E[Y |X = x] may be estimated by
rn(x) =1
Ln(x)
n∑i=1
Yi1[Xi∈Ln(x)].
1 No smoothing parameter.
2 A scale-invariant estimate.
3 Intimately connected to Breiman’s random forests.
G. Biau (UPMC) 102 / 106
LNN regression estimation
Model(X,Y ), (X1,Y1), . . . , (Xn,Yn) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.
The regression function r(x) = E[Y |X = x] may be estimated by
rn(x) =1
Ln(x)
n∑i=1
Yi1[Xi∈Ln(x)].
1 No smoothing parameter.
2 A scale-invariant estimate.
3 Intimately connected to Breiman’s random forests.
G. Biau (UPMC) 102 / 106
Consistency
Theorem (Pointwise Lp-consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1,
E |rn(x)− r(x)|p → 0 as n→∞.
Theorem (Gobal Lp-consistency)Under the same conditions, for all p ≥ 1,
E |rn(X)− r(X)|p → 0 as n→∞.
1 No universal consistency result with respect to r is possible.
2 The results do not impose any condition on the density.
3 They are also scale-free.
G. Biau (UPMC) 103 / 106
Consistency
Theorem (Pointwise Lp-consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1,
E |rn(x)− r(x)|p → 0 as n→∞.
Theorem (Gobal Lp-consistency)Under the same conditions, for all p ≥ 1,
E |rn(X)− r(X)|p → 0 as n→∞.
1 No universal consistency result with respect to r is possible.
2 The results do not impose any condition on the density.
3 They are also scale-free.
G. Biau (UPMC) 103 / 106
Consistency
Theorem (Pointwise Lp-consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1,
E |rn(x)− r(x)|p → 0 as n→∞.
Theorem (Gobal Lp-consistency)Under the same conditions, for all p ≥ 1,
E |rn(X)− r(X)|p → 0 as n→∞.
1 No universal consistency result with respect to r is possible.
2 The results do not impose any condition on the density.
3 They are also scale-free.
G. Biau (UPMC) 103 / 106
Consistency
Theorem (Pointwise Lp-consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1,
E |rn(x)− r(x)|p → 0 as n→∞.
Theorem (Gobal Lp-consistency)Under the same conditions, for all p ≥ 1,
E |rn(X)− r(X)|p → 0 as n→∞.
1 No universal consistency result with respect to r is possible.
2 The results do not impose any condition on the density.
3 They are also scale-free.
G. Biau (UPMC) 103 / 106
Consistency
Theorem (Pointwise Lp-consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1,
E |rn(x)− r(x)|p → 0 as n→∞.
Theorem (Gobal Lp-consistency)Under the same conditions, for all p ≥ 1,
E |rn(X)− r(X)|p → 0 as n→∞.
1 No universal consistency result with respect to r is possible.
2 The results do not impose any condition on the density.
3 They are also scale-free.
G. Biau (UPMC) 103 / 106
Back to random forests
A random forest can be viewed as a weighted LNN regression estimate
r̄n(x) =n∑
i=1
YiWni(x),
where the weights concentrate on the LNN and satisfyn∑
i=1
Wni(x) = 1.
G. Biau (UPMC) 104 / 106
Non-adaptive strategies
Consider the non-adaptive random forests estimate
r̄n(x) =n∑
i=1
YiWni(x).
where the weights concentrate on the LNN.Proposition
For any x ∈ Rd , assume that σ2 = V[Y |X = x] is independent of x.Then
E [r̄n(x)− r(x)]2 ≥ σ2
ELn(x).
G. Biau (UPMC) 105 / 106
Non-adaptive strategies
Consider the non-adaptive random forests estimate
r̄n(x) =n∑
i=1
YiWni(x).
where the weights concentrate on the LNN.Proposition
For any x ∈ Rd , assume that σ2 = V[Y |X = x] is independent of x.Then
E [r̄n(x)− r(x)]2 ≥ σ2
ELn(x).
G. Biau (UPMC) 105 / 106
Rate of convergence
At µ-almost all x, when f is λ-almost everywhere continuous,
E [r̄n(x)− r(x)]2 &σ2(d − 1)!
2d (log n)d−1 .
Improving the rate of convergence1 Stop as soon as a future rectangle split would cause a
sub-rectangle to have fewer than kn points.
2 Resort to bagging and randomize using random subsamples.
G. Biau (UPMC) 106 / 106
Rate of convergence
At µ-almost all x, when f is λ-almost everywhere continuous,
E [r̄n(x)− r(x)]2 &σ2(d − 1)!
2d (log n)d−1 .
Improving the rate of convergence1 Stop as soon as a future rectangle split would cause a
sub-rectangle to have fewer than kn points.
2 Resort to bagging and randomize using random subsamples.
G. Biau (UPMC) 106 / 106
Rate of convergence
At µ-almost all x, when f is λ-almost everywhere continuous,
E [r̄n(x)− r(x)]2 &σ2(d − 1)!
2d (log n)d−1 .
Improving the rate of convergence1 Stop as soon as a future rectangle split would cause a
sub-rectangle to have fewer than kn points.
2 Resort to bagging and randomize using random subsamples.
G. Biau (UPMC) 106 / 106