Exponentially weighted aggregation Laplace prior for linear...

transcript

Introduction: prediction in high dimensionPenalization and Lasso

Exponentially weighted average

Exponentially weighted aggregationLaplace prior for linear regression

Arnak Dalalyan, Edwin Grappin & Quentin Paris

edwin.grappin@ensae.fr

JPS - Les Houches - 2016

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

Linear regression: goals & settingsLinear regression: least squares

Goals & settings

We observe n labels (Yi )i∈{1,...,n} and there is a linear relationbetween the label and the p features (Xi ,j)j∈{1,...,p} such that:

Y = Xβ? + ξ,

where Y ∈ Rn, X ∈ Rn×p, β? ∈ Rp and ξ ∈ Rn a random variablesuch that ξi is N (0, σ2).

Our interests are:

Low prediction loss: ‖X (β? − β̂)‖22 (fitting β? is lessimportant),Good quality when p is large (p >> n),Efficient use of sparsity property of β? (β? is s-sparse if atmost s elements are non null).

Goals & settings

We observe n labels (Yi )i∈{1,...,n} and there is a linear relationbetween the label and the p features (Xi ,j)j∈{1,...,p} such that:

Y = Xβ? + ξ,

where Y ∈ Rn, X ∈ Rn×p, β? ∈ Rp and ξ ∈ Rn a random variablesuch that ξi is N (0, σ2).

Our interests are:

Low prediction loss: ‖X (β? − β̂)‖22 (fitting β? is lessimportant),Good quality when p is large (p >> n),Efficient use of sparsity property of β? (β? is s-sparse if atmost s elements are non null).

Least squares method

Ordinary least squares (OLS) estimator is defined by:

β̂OLS = arg minβ∈Rp

‖Y − Xβ‖22.

OLS minimizes the sum of the squares of the residuals.

Overfitting. If p is very large, OLS haspoor prediction results:

There is not a unique solution whenp > n,Does not detect meaningfulfeatures among all features,Performance is focus on fitting thedata not predicting labels.

‖Y − Xβ‖22.

There is not a unique solution whenp > n,

Does not detect meaningfulfeatures among all features,Performance is focus on fitting thedata not predicting labels.

‖Y − Xβ‖22.

There is not a unique solution whenp > n,Does not detect meaningfulfeatures among all features,

Performance is focus on fitting thedata not predicting labels.

‖Y − Xβ‖22.

Penalized regression`0 penalizationA trade off between sparsity and convexityLasso and oracle inequality

Penalized regression

In our case, a good estimator has the following properties:Guarantees on prediction results,Use sparsity assumption to manage p > n,Computationnaly fast (of paramount importance when p islarge).

Penalized regression is a method that combines the usual fittingterm with a penalty term :

β̂pen = arg minβ∈Rp

(‖Y − Xβ‖22 + λP(β)

P is the penalty function and λ ≥ 0 controls the trade off betweenthe two terms.

Penalized regression

In our case, a good estimator has the following properties:Guarantees on prediction results,Use sparsity assumption to manage p > n,Computationnaly fast (of paramount importance when p islarge).

Penalized regression is a method that combines the usual fittingterm with a penalty term :

β̂pen = arg minβ∈Rp

(‖Y − Xβ‖22 + λP(β)

P is the penalty function and λ ≥ 0 controls the trade off betweenthe two terms.

Subset selection with a `0 penalization

An intuitive candidate would be a penalization based on `0pseudo-norm (the sparsity level):

‖β‖0 =

p∑i=1

1βi 6=0.

β̂`0 = arg minβ∈Rp

(‖Y − Xβ‖22 + λ‖β‖0

The penalty forces many elements of β̂ to be null. It chooses themost important features.Due to the `0 pseudo-norm, the objective function is nonconvex.Hence, computational time grows exponentially with p.

‖β‖0 =

p∑i=1

1βi 6=0.

(‖Y − Xβ‖22 + λ‖β‖0

The penalty forces many elements of β̂ to be null. It chooses themost important features.

Due to the `0 pseudo-norm, the objective function is nonconvex.Hence, computational time grows exponentially with p.

‖β‖0 =

p∑i=1

1βi 6=0.

(‖Y − Xβ‖22 + λ‖β‖0

The penalty forces many elements of β̂ to be null. It chooses themost important features.Due to the `0 pseudo-norm, the objective function is nonconvex.Hence, computational time grows exponentially with p.

Choice of the penalization term

Let q > 0, we consider the estimators

β̂q = arg minβ∈Rp

(‖Y − Xβ‖22 + λ‖β‖qq

If q <1, the solution is sparsebut the problem is nonconvex.

(‖Y − Xβ‖22 + λ‖β‖qq

If q > 1, the problem is convexbut the solution is not sparse.

(‖Y − Xβ‖22 + λ‖β‖qq

If q = 1, the solution is sparseand the problem is convex.

Lasso, the `1 norm

The Lasso estimator is defined by:

β̂L = arg minβ∈Rp

(‖Y − Xβ‖222n

+ λ‖β‖1).

TheoremDalalyan & al. (2014). On the Prediction Performance of the Lasso

Let λ = 2σ√

2 log(p/δ)n . Then, with probability at least 1− δ,

‖X (β? − β̂L)‖22n

≤ infβ∈Rp

s−sparse

(‖X (β? − β)‖22n

+10 s σ2 log(p/δ)

where κ is a constant depending on the design of X .

Lasso, the `1 norm

The Lasso estimator is defined by:

(‖Y − Xβ‖222n

+ λ‖β‖1).

TheoremDalalyan & al. (2014). On the Prediction Performance of the Lasso

Let λ = 2σ√

2 log(p/δ)n . Then, with probability at least 1− δ,

‖X (β? − β̂L)‖22n

≤ infβ∈Rp

s−sparse

(‖X (β? − β)‖22n

+10 s σ2 log(p/δ)

where κ is a constant depending on the design of X .

EWA: definitionOracle inequalityThe choice of TConclusion & questions

EWA: definition

Lasso estimator is a maximum a posteriori estimator with Laplaceprior :

(‖Y − Xβ‖222n

+ λ‖β‖1)

= arg maxβ∈Rp

[exp(− 1

2‖Y − Xβ‖22

)︸︷︷︸

∝N (Xβ,σ2In)

exp(− λnσ2 ‖β‖1

)︸︷︷︸∝π0(β):Laplace prior

Let, V (β) = 12σ2 ‖Y − Xβ‖22 + λn

σ2 ‖β‖1, andπ̂T (β) ∝ exp

{−V (β)

}. We define the exponentially weighted

average (EWA) estimator with Laplace prior by

β̂EWA =

∫Rpβπ̂T (β)dβ.

EWA: definition

(‖Y − Xβ‖222n

+ λ‖β‖1)

= arg maxβ∈Rp

[exp(− 1

2‖Y − Xβ‖22

)︸︷︷︸

∝N (Xβ,σ2In)

Let, V (β) = 12σ2 ‖Y − Xβ‖22 + λn

{−V (β)

We define the exponentially weightedaverage (EWA) estimator with Laplace prior by

β̂EWA =

EWA: definition

(‖Y − Xβ‖222n

+ λ‖β‖1)

= arg maxβ∈Rp

[exp(− 1

2‖Y − Xβ‖22

)︸︷︷︸

∝N (Xβ,σ2In)

Let, V (β) = 12σ2 ‖Y − Xβ‖22 + λn

{−V (β)

}. We define the exponentially weighted

average (EWA) estimator with Laplace prior by

β̂EWA =

Results

Theorem

Let λ = 2σ√

2 log(p/δ)n , then with probability at least 1− δ,

‖X (β? − β̂EWA)‖22n

≤ infβ∈Rp

s−sparse

(‖X (β? − β)‖22n

+10sσ2 log(p/δ)

)+ 2H(T ).

WhereH(T ) = pT −

∫G (β)π̂T (β)dβ + G (β̂EWA),

and G (β) = 1n‖Xβ‖

22 + λ‖β‖1. G is convex, hence H(T ) ≤ pT .

The choice of T

If T = 0, β̂L = β̂EWA.

We are interested in T < 1/p,remember: H(T ) ≤ pT .The larger T is, the larger is thevariance of the posterior.We believe that variance bringsrobustness to the choice of λ.

The choice of T

If T = 0, β̂L = β̂EWA.We are interested in T < 1/p,remember: H(T ) ≤ pT .

The larger T is, the larger is thevariance of the posterior.We believe that variance bringsrobustness to the choice of λ.

The choice of T

If T = 0, β̂L = β̂EWA.We are interested in T < 1/p,remember: H(T ) ≤ pT .The larger T is, the larger is thevariance of the posterior.

We believe that variance bringsrobustness to the choice of λ.

The choice of T

If T = 0, β̂L = β̂EWA.We are interested in T < 1/p,remember: H(T ) ≤ pT .The larger T is, the larger is thevariance of the posterior.We believe that variance bringsrobustness to the choice of λ.

The choice of T

If T = 0, β̂L = β̂EWA.We are interested in T < 1/p,remember: H(T ) ≤ pT .The larger T is, the larger is thevariance of the posterior.We believe that variance bringsrobustness to the choice of λ.

Conclusion & questions

Results:EWA with Laplace prior is a family of estimator that includesthe Lasso.There is a sharp oracle inequality for this family of estimator.

Questions:What is a good value of T?Can we prove a result on the robustness of λ?Can we compute efficiently this estimator?

Thank you!

Exponentially weighted aggregation Laplace prior for linear...

Documents