Post on 05-Jun-2020
transcript
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Exponentially weighted aggregationLaplace prior for linear regression
Arnak Dalalyan, Edwin Grappin & Quentin Paris
edwin.grappin@ensae.fr
JPS - Les Houches - 2016
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Linear regression: goals & settingsLinear regression: least squares
Goals & settings
We observe n labels (Yi )i∈{1,...,n} and there is a linear relationbetween the label and the p features (Xi ,j)j∈{1,...,p} such that:
Y = Xβ? + ξ,
where Y ∈ Rn, X ∈ Rn×p, β? ∈ Rp and ξ ∈ Rn a random variablesuch that ξi is N (0, σ2).
Our interests are:
Low prediction loss: ‖X (β? − β̂)‖22 (fitting β? is lessimportant),Good quality when p is large (p >> n),Efficient use of sparsity property of β? (β? is s-sparse if atmost s elements are non null).
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Linear regression: goals & settingsLinear regression: least squares
Goals & settings
We observe n labels (Yi )i∈{1,...,n} and there is a linear relationbetween the label and the p features (Xi ,j)j∈{1,...,p} such that:
Y = Xβ? + ξ,
where Y ∈ Rn, X ∈ Rn×p, β? ∈ Rp and ξ ∈ Rn a random variablesuch that ξi is N (0, σ2).
Our interests are:
Low prediction loss: ‖X (β? − β̂)‖22 (fitting β? is lessimportant),Good quality when p is large (p >> n),Efficient use of sparsity property of β? (β? is s-sparse if atmost s elements are non null).
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Linear regression: goals & settingsLinear regression: least squares
Least squares method
Ordinary least squares (OLS) estimator is defined by:
β̂OLS = arg minβ∈Rp
‖Y − Xβ‖22.
OLS minimizes the sum of the squares of the residuals.
Overfitting. If p is very large, OLS haspoor prediction results:
There is not a unique solution whenp > n,Does not detect meaningfulfeatures among all features,Performance is focus on fitting thedata not predicting labels.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Linear regression: goals & settingsLinear regression: least squares
Least squares method
Ordinary least squares (OLS) estimator is defined by:
β̂OLS = arg minβ∈Rp
‖Y − Xβ‖22.
OLS minimizes the sum of the squares of the residuals.
Overfitting. If p is very large, OLS haspoor prediction results:
There is not a unique solution whenp > n,Does not detect meaningfulfeatures among all features,Performance is focus on fitting thedata not predicting labels.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Linear regression: goals & settingsLinear regression: least squares
Least squares method
Ordinary least squares (OLS) estimator is defined by:
β̂OLS = arg minβ∈Rp
‖Y − Xβ‖22.
OLS minimizes the sum of the squares of the residuals.
Overfitting. If p is very large, OLS haspoor prediction results:
There is not a unique solution whenp > n,
Does not detect meaningfulfeatures among all features,Performance is focus on fitting thedata not predicting labels.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Linear regression: goals & settingsLinear regression: least squares
Least squares method
Ordinary least squares (OLS) estimator is defined by:
β̂OLS = arg minβ∈Rp
‖Y − Xβ‖22.
OLS minimizes the sum of the squares of the residuals.
Overfitting. If p is very large, OLS haspoor prediction results:
There is not a unique solution whenp > n,Does not detect meaningfulfeatures among all features,
Performance is focus on fitting thedata not predicting labels.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Linear regression: goals & settingsLinear regression: least squares
Least squares method
Ordinary least squares (OLS) estimator is defined by:
β̂OLS = arg minβ∈Rp
‖Y − Xβ‖22.
OLS minimizes the sum of the squares of the residuals.
Overfitting. If p is very large, OLS haspoor prediction results:
There is not a unique solution whenp > n,Does not detect meaningfulfeatures among all features,Performance is focus on fitting thedata not predicting labels.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Penalized regression`0 penalizationA trade off between sparsity and convexityLasso and oracle inequality
Penalized regression
In our case, a good estimator has the following properties:Guarantees on prediction results,Use sparsity assumption to manage p > n,Computationnaly fast (of paramount importance when p islarge).
Penalized regression is a method that combines the usual fittingterm with a penalty term :
β̂pen = arg minβ∈Rp
(‖Y − Xβ‖22 + λP(β)
),
P is the penalty function and λ ≥ 0 controls the trade off betweenthe two terms.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Penalized regression`0 penalizationA trade off between sparsity and convexityLasso and oracle inequality
Penalized regression
In our case, a good estimator has the following properties:Guarantees on prediction results,Use sparsity assumption to manage p > n,Computationnaly fast (of paramount importance when p islarge).
Penalized regression is a method that combines the usual fittingterm with a penalty term :
β̂pen = arg minβ∈Rp
(‖Y − Xβ‖22 + λP(β)
),
P is the penalty function and λ ≥ 0 controls the trade off betweenthe two terms.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Penalized regression`0 penalizationA trade off between sparsity and convexityLasso and oracle inequality
Subset selection with a `0 penalization
An intuitive candidate would be a penalization based on `0pseudo-norm (the sparsity level):
‖β‖0 =
p∑i=1
1βi 6=0.
β̂`0 = arg minβ∈Rp
(‖Y − Xβ‖22 + λ‖β‖0
).
The penalty forces many elements of β̂ to be null. It chooses themost important features.Due to the `0 pseudo-norm, the objective function is nonconvex.Hence, computational time grows exponentially with p.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Penalized regression`0 penalizationA trade off between sparsity and convexityLasso and oracle inequality
Subset selection with a `0 penalization
An intuitive candidate would be a penalization based on `0pseudo-norm (the sparsity level):
‖β‖0 =
p∑i=1
1βi 6=0.
β̂`0 = arg minβ∈Rp
(‖Y − Xβ‖22 + λ‖β‖0
).
The penalty forces many elements of β̂ to be null. It chooses themost important features.
Due to the `0 pseudo-norm, the objective function is nonconvex.Hence, computational time grows exponentially with p.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Penalized regression`0 penalizationA trade off between sparsity and convexityLasso and oracle inequality
Subset selection with a `0 penalization
An intuitive candidate would be a penalization based on `0pseudo-norm (the sparsity level):
‖β‖0 =
p∑i=1
1βi 6=0.
β̂`0 = arg minβ∈Rp
(‖Y − Xβ‖22 + λ‖β‖0
).
The penalty forces many elements of β̂ to be null. It chooses themost important features.Due to the `0 pseudo-norm, the objective function is nonconvex.Hence, computational time grows exponentially with p.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Penalized regression`0 penalizationA trade off between sparsity and convexityLasso and oracle inequality
Choice of the penalization term
Let q > 0, we consider the estimators
β̂q = arg minβ∈Rp
(‖Y − Xβ‖22 + λ‖β‖qq
).
If q <1, the solution is sparsebut the problem is nonconvex.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Penalized regression`0 penalizationA trade off between sparsity and convexityLasso and oracle inequality
Choice of the penalization term
Let q > 0, we consider the estimators
β̂q = arg minβ∈Rp
(‖Y − Xβ‖22 + λ‖β‖qq
).
If q > 1, the problem is convexbut the solution is not sparse.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Penalized regression`0 penalizationA trade off between sparsity and convexityLasso and oracle inequality
Choice of the penalization term
Let q > 0, we consider the estimators
β̂q = arg minβ∈Rp
(‖Y − Xβ‖22 + λ‖β‖qq
).
If q = 1, the solution is sparseand the problem is convex.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Penalized regression`0 penalizationA trade off between sparsity and convexityLasso and oracle inequality
Lasso, the `1 norm
The Lasso estimator is defined by:
β̂L = arg minβ∈Rp
(‖Y − Xβ‖222n
+ λ‖β‖1).
TheoremDalalyan & al. (2014). On the Prediction Performance of the Lasso
Let λ = 2σ√
2 log(p/δ)n . Then, with probability at least 1− δ,
‖X (β? − β̂L)‖22n
≤ infβ∈Rp
s−sparse
(‖X (β? − β)‖22n
+10 s σ2 log(p/δ)
n κ
),
where κ is a constant depending on the design of X .
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
Penalized regression`0 penalizationA trade off between sparsity and convexityLasso and oracle inequality
Lasso, the `1 norm
The Lasso estimator is defined by:
β̂L = arg minβ∈Rp
(‖Y − Xβ‖222n
+ λ‖β‖1).
TheoremDalalyan & al. (2014). On the Prediction Performance of the Lasso
Let λ = 2σ√
2 log(p/δ)n . Then, with probability at least 1− δ,
‖X (β? − β̂L)‖22n
≤ infβ∈Rp
s−sparse
(‖X (β? − β)‖22n
+10 s σ2 log(p/δ)
n κ
),
where κ is a constant depending on the design of X .
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
EWA: definitionOracle inequalityThe choice of TConclusion & questions
EWA: definition
Lasso estimator is a maximum a posteriori estimator with Laplaceprior :
β̂L = arg minβ∈Rp
(‖Y − Xβ‖222n
+ λ‖β‖1)
= arg maxβ∈Rp
[exp(− 1
2‖Y − Xβ‖22
σ2
)︸ ︷︷ ︸
∝N (Xβ,σ2In)
exp(− λnσ2 ‖β‖1
)︸ ︷︷ ︸∝π0(β):Laplace prior
]
Let, V (β) = 12σ2 ‖Y − Xβ‖22 + λn
σ2 ‖β‖1, andπ̂T (β) ∝ exp
{−V (β)
T
}. We define the exponentially weighted
average (EWA) estimator with Laplace prior by
β̂EWA =
∫Rpβπ̂T (β)dβ.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
EWA: definitionOracle inequalityThe choice of TConclusion & questions
EWA: definition
Lasso estimator is a maximum a posteriori estimator with Laplaceprior :
β̂L = arg minβ∈Rp
(‖Y − Xβ‖222n
+ λ‖β‖1)
= arg maxβ∈Rp
[exp(− 1
2‖Y − Xβ‖22
σ2
)︸ ︷︷ ︸
∝N (Xβ,σ2In)
exp(− λnσ2 ‖β‖1
)︸ ︷︷ ︸∝π0(β):Laplace prior
]
Let, V (β) = 12σ2 ‖Y − Xβ‖22 + λn
σ2 ‖β‖1, andπ̂T (β) ∝ exp
{−V (β)
T
}.
We define the exponentially weightedaverage (EWA) estimator with Laplace prior by
β̂EWA =
∫Rpβπ̂T (β)dβ.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
EWA: definitionOracle inequalityThe choice of TConclusion & questions
EWA: definition
Lasso estimator is a maximum a posteriori estimator with Laplaceprior :
β̂L = arg minβ∈Rp
(‖Y − Xβ‖222n
+ λ‖β‖1)
= arg maxβ∈Rp
[exp(− 1
2‖Y − Xβ‖22
σ2
)︸ ︷︷ ︸
∝N (Xβ,σ2In)
exp(− λnσ2 ‖β‖1
)︸ ︷︷ ︸∝π0(β):Laplace prior
]
Let, V (β) = 12σ2 ‖Y − Xβ‖22 + λn
σ2 ‖β‖1, andπ̂T (β) ∝ exp
{−V (β)
T
}. We define the exponentially weighted
average (EWA) estimator with Laplace prior by
β̂EWA =
∫Rpβπ̂T (β)dβ.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
EWA: definitionOracle inequalityThe choice of TConclusion & questions
Results
Theorem
Let λ = 2σ√
2 log(p/δ)n , then with probability at least 1− δ,
‖X (β? − β̂EWA)‖22n
≤ infβ∈Rp
s−sparse
(‖X (β? − β)‖22n
+10sσ2 log(p/δ)
nκ
)+ 2H(T ).
WhereH(T ) = pT −
∫G (β)π̂T (β)dβ + G (β̂EWA),
and G (β) = 1n‖Xβ‖
22 + λ‖β‖1. G is convex, hence H(T ) ≤ pT .
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
EWA: definitionOracle inequalityThe choice of TConclusion & questions
The choice of T
If T = 0, β̂L = β̂EWA.
We are interested in T < 1/p,remember: H(T ) ≤ pT .The larger T is, the larger is thevariance of the posterior.We believe that variance bringsrobustness to the choice of λ.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
EWA: definitionOracle inequalityThe choice of TConclusion & questions
The choice of T
If T = 0, β̂L = β̂EWA.We are interested in T < 1/p,remember: H(T ) ≤ pT .
The larger T is, the larger is thevariance of the posterior.We believe that variance bringsrobustness to the choice of λ.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
EWA: definitionOracle inequalityThe choice of TConclusion & questions
The choice of T
If T = 0, β̂L = β̂EWA.We are interested in T < 1/p,remember: H(T ) ≤ pT .The larger T is, the larger is thevariance of the posterior.
We believe that variance bringsrobustness to the choice of λ.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
EWA: definitionOracle inequalityThe choice of TConclusion & questions
The choice of T
If T = 0, β̂L = β̂EWA.We are interested in T < 1/p,remember: H(T ) ≤ pT .The larger T is, the larger is thevariance of the posterior.We believe that variance bringsrobustness to the choice of λ.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
EWA: definitionOracle inequalityThe choice of TConclusion & questions
The choice of T
If T = 0, β̂L = β̂EWA.We are interested in T < 1/p,remember: H(T ) ≤ pT .The larger T is, the larger is thevariance of the posterior.We believe that variance bringsrobustness to the choice of λ.
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
EWA: definitionOracle inequalityThe choice of TConclusion & questions
Conclusion & questions
Results:EWA with Laplace prior is a family of estimator that includesthe Lasso.There is a sharp oracle inequality for this family of estimator.
Questions:What is a good value of T?Can we prove a result on the robustness of λ?Can we compute efficiently this estimator?
Thank you!
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
EWA: definitionOracle inequalityThe choice of TConclusion & questions
Conclusion & questions
Results:EWA with Laplace prior is a family of estimator that includesthe Lasso.There is a sharp oracle inequality for this family of estimator.
Questions:What is a good value of T?Can we prove a result on the robustness of λ?Can we compute efficiently this estimator?
Thank you!
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior
Introduction: prediction in high dimensionPenalization and Lasso
Exponentially weighted average
EWA: definitionOracle inequalityThe choice of TConclusion & questions
Conclusion & questions
Results:EWA with Laplace prior is a family of estimator that includesthe Lasso.There is a sharp oracle inequality for this family of estimator.
Questions:What is a good value of T?Can we prove a result on the robustness of λ?Can we compute efficiently this estimator?
Thank you!
Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior