Méthodes Parcimonieuses en Apprentissage...

Methodes Parcimonieuses en ApprentissageStatistique

Guillaume Obozinski

INRIA - Ecole Normale Superieure - Paris

E.S. Ecole des Mines, 16 avril 2010

G.Obozinski Methodes Parcimonieuses en Apprentissage Statistique 1/48

Outline

1 Surapprentissage et Regularisation

2 Regularisation `1

3 Parcimonie par bloc

4 Apprentissage de dictionnaire et ACP parcimonieuseDebruitage et restauration d’imagesDictionnaire parcimonieux structure pour la reconnaissance devisagesDictionnaire Hierarchiques


Grande dimension

Caracteristique essentielle des problemes d’apprentissage: grandnombre de variables/parametres p par rapport au nombred’observations n disponibles.

p / n


Sur-apprentissage (Overfitting)

Regression Lineaire Y = w0 + w1X + w2X2 + . . . + wpX

p + ε

minw

1

2n

n∑i=1

(yi − (w0 + w1xi + w2x2

i + . . .+ wpxpi ))2

x

t

M = 0

0 1

−1

0

1

x

t

M = 1

0 1

−1

0

1

x

t

M = 3

0 1

−1

0

1

x

t

M = 9

0 1

−1

0

1


Sur-apprentissage (Overfitting)

p / n vs p n

x

t

N = 15

0 1

−1

0

1

x

t

N = 100

0 1

−1

0

1

Mauvaise generalisation:

M

ER

MS

0 3 6 90

0.5

1TrainingTest ⇒ Controler la complexite des modeles

Reduire le nombre de variables

Regularisation


Regularisation

Regularisation de Tikhonov

1

2n

n∑i=1

`(yi ,w>Φ(xi )) +

λ

2n‖w‖2

2

λ coefficient de regularisation

probleme convexe si ` est convexeshrinkage methodCas de la regression: Ridge regression

Effet de la regularisation p = 9, n = 10

x

t

ln λ = −18

0 1

−1

0

1

x

t

ln λ = 0

0 1

−1

0

1


Lien avec la SVM


Selection de variables

Principe de Parcimonie, rasoir d’Ockham,

“pluralitas non est ponenda sine necessitate”

Interpretabilite

Controler la complexite

Pour xi ∈ Rp

1

2n

n∑i=1

`(yi ,w>xi ) + λ | j | wj 6= 0 |

Non-convexe, discontinu

Travaux theoriques sur le choix de λ Akaike Information Criterion(AIC), Bayesian Information Criterion (BIC), theorie du MinimumDescription Length (MDL).

Pour | j | wj 6= 0 | = k fixe il y a(pk

)modeles possibles


Outline


2 Regularisation `1




Une regularisation induisant de la parcimonie

LASSO (Tishirani,1996)

1

2n

n∑i=1

(yi −w>xi )2 + λ ‖w‖1

Least Absolute Shrinkage and Selection Operator

Intuition

1

2n

n∑i=1

(yi −w>xi )2

s.t. ‖w‖1 ≤ C (λ)1

2w

w 1

2w

w


Derivees directionelles pour la regularisation `1

• Function J(w) =n∑

i=1

ℓ(yi, w⊤xi) + λ‖w‖1 = L(w) + λ‖w‖1

• ℓ1-norm: ‖w+ε∆‖1−‖w‖1=∑

j, wj 6=0

|wj + ε∆j| − |wj|+∑

j, wj=0

|ε∆j|

• Thus,

∇J(w,∆) = ∇L(w)⊤∆ + λ∑

j, wj 6=0

sign(wj)∆j + λ∑

j, wj=0

|∆j|

=∑

j, wj 6=0

[∇L(w)j + λ sign(wj)]∆j +∑

j, wj=0

[∇L(w)j∆j + λ|∆j|]

• Separability of optimality conditions


Conditions d’optimalite pour la regularisation `1

• General loss: w optimal if and only if for all j ∈ 1, . . . , p,

sign(wj) 6= 0 ⇒ ∇L(w)j + λ sign(wj) = 0

sign(wj) = 0 ⇒ |∇L(w)j| 6 λ

• Square loss: w optimal if and only if for all j ∈ 1, . . . , p,

sign(wj) 6= 0 ⇒ −X⊤j (y −Xw) + λ sign(wj) = 0

sign(wj) = 0 ⇒ |X⊤j (y −Xw)| 6 λ

– For J ⊂ 1, . . . , p, XJ ∈ Rn×|J| = X(:, J) denotes the columns

of X indexed by J , i.e., variables indexed by J


Methodes d’optimisation

Dans le cas du Lasso, c’est un programme quadratique

1

2‖y−Xw‖2+λ

p∑j=1

(w +j +w−j ) such that w = w +−w−, w + > 0, w− > 0

Solveurs generiques ⇒ trop lent

Coordinate descent (Fu, 1998; Wu and Lange, 2008; Friedman et al.,2007)

convergent here under reasonable assumptions!

“η-trick” (Micchelli and Pontil, 2006; Rakotomamonjy et al., 2008)

On a∑p

j=1 |wj | = minη>012

∑pj=1

w2j

ηj+ ηj

On peut utiliser une minimisation alternee en η (forme analytique)and w (probleme de type ridge)

Dedicated algorithms that use sparsity (active sets and homotopymethods)


Algorithme LARS

Le chemin de regularisation est lineaire par morceaux et peut etrecalcule en resolvant de petits sytemes lineaires

0 0.1 0.2 0.3 0.4 0.5 0.6

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

regularization parameter

weig

hts


Proprietes theoriques du Lasso

Le Lasso est efficace en haute dimension

Si les donnees sont generees en utilisant s variables.

n = nb. d’observations, p = nombre de variables.

Regime de haute dimension: log(p) = o(n)

The Lasso est alors

Consistent pour la selection de variablesConverge en norme ‖w − w∗‖ → 0Efficace pour la prediction

De nombreux travaux abordent les conditions de succes du Lasso1 n > θs log p2 Les variables du modeles et celles hors modele ne doivent pas etre

trop correlees (Wainwright, 2009; Meinshausen and Yu, 2009; Buneaet al., 2007)


Comparaison empirique `1 vs `2 vs greedy

Matrice de design gaussienne k = 4,n = 64, p ∈ [2, 256],SNR = 1

Noter la robustesse au cas non parcimonieux

2 4 6 80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

log2(p)

mean s

quare

err

or

L1

L2

greedy

oracle

2 4 6 80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

log2(p)

mean s

quare

err

or

L1

L2

greedy

Sparse Rotated (non sparse)


Selection de variables par bloc

Groupes de variables

x = (x>g1, . . . , x>gk

)> ∈ Rp1 × . . .× Rpk = Rp

Variables correspondant aux differentes valeurs d’une variablenominale e.g. occupation, variable ordinalexi ∈ normal,mauvais, tres mauvaisDifferentes fonctions non-lineaires de chaque variable

Y =

p∑j=1

D∑d=1

wjdX dj + ε

Penaliser le nombre de groupes

minw

1

n

n∑i=1

(y (i) −∑g∈G

w>g x(i)g )2 + λ |g | ‖wg‖ 6= 0|


Outline


2 Regularisation `1




Regularisation induisant de la parcimonie par blocs

Group Lasso (Yuan and Lin, 2006)

The `1/`2 norm,∑g∈G‖wg‖2 =

∑g∈G

(∑j∈g

w 2j

)1/2, with G a partition of1, . . . , p

minw∈Rp

1

n

n∑i=1

(y (i) −w>x(i))2 + λ∑g∈G‖wg‖

The `1/`2 block-norm: Ω(w) : =∑

g∈G ‖wg‖

Unit ball in R3 : ‖(w1,w2)‖+ ‖w3‖ ≤ 1


Outline


2 Regularisation `1




Deux types de parcimonie pour une matrice M

I - Directement sur les elements de M

Beaucoup d’elements nuls: Mij = 0

M

Beaucoup de ligne (ou de colonnes nulles): (Mi1, . . . ,Mip) = 0

M


Deux types de parcimonie pour une matrice M

II - Via une factorisation

Matrice M = UV>, U ∈ Rn×k et V ∈ Rp×k

Rang faible: m petit

=

T

UV

M

Decomposition parcimonieuse: U parcimonieux

U= VMT


Matrix Factorization

X D α= .

low rank factorization e.g., collaborative filtering: users and movies

D : dictionary of prototype users

α : decomposition coefficients

Typically: PCA

Introduce structure/sparsity?


Matrix Factorization

X D α= .

low rank factorization e.g., collaborative filtering: users and movies

D : dictionary of prototype users

α : decomposition coefficients

Typically: PCA

Introduce structure/sparsity?


Sparse PCA / Dictionary Learning

Sparse PCA

X α= .D

e.g. microarray data

sparse dictionary

(Witten et al., 2009; Bach et al.,2008)

Dictionary Learning

X D α= .

e.g. overcomplete dictionariesfor natural images

sparse decomposition

(Olshausen and Field, 1997; Eladand Aharon, 2006; Raina et al.,2007)

Other constraints

positivity: Non-negative matrix factorization

simplex constraint: multinomial PCA (Buntine, 2002) (closely related to LDA ofBlei et al. (2003))


Formulations for SPCA and dictionary Learning

SPCA:

minA∈Rk×n

D∈Rp×k

n∑i=1

‖xi −Dαi‖22 + λ

k∑j=1

‖dj‖1 s.t. ∀j , ‖αj‖2 ≤ 1.

Dictionary learning:

minA∈Rk×n

D∈Rp×k

n∑i=1

(‖xi −Dαi‖22 + λ‖αi‖1

)s.t. ∀j , ‖dj‖2 ≤ 1.

In both cases no orthogonality

Not jointly convex but convex in each dj and αj

⇒ efficient bock-coordinate descent algorithms


Dictionary Learning for Image Denoising

Example from Mairal et al. (2009)


Dictionary Learning for Image Denoising



Dictionary Learning for Image Inpainting



Image Inpainting (Mairal et al., 2009)








Sparse Structured PCA and structured dictionary Learning

Sparse structured PCA (sparse and structured dictionary elements):

minA∈Rk×n

D∈Rp×k

n∑i=1


k∑j=1

Ω(dj) s.t. ∀j , ‖αj‖2 ≤ 1.

(Jenatton, Obozinski and Bach, 2010)

Dictionary learning with structured sparsity for α:

minA∈Rk×n

D∈Rp×k

n∑i=1

(‖xi −Dαi‖22 + λΩ(αi )

)s.t. ∀j , ‖dj‖2 ≤ 1.

(Jenatton, Mairal, Obozinski and Bach, 2010)


Faces

Faces

A basis to decompose faces?

Eigenfaces

Find parts?

Localized components

NMF (Lee and Seung, 1999)


Faces

Faces NMF


Rectangular supports

Ω(d) =∑

g∈G ‖dg‖2: Selection of rectangles on the 2D-grid.

G is the set of blue/green groups (with their not displayedcomplements)

Any union of blue/green groups set to zero leads to the selection ofa rectangle


General “convex” supports

Ω(d) =∑

g∈G ‖dg‖2: Selection of “convex” patterns on a 2-Dgrids.

It is possible to extent such settings to 3-D space, or more complextopologies


Sparse Structured PCA (Jenatton, Obozinski and Bach (2009))

Learning sparse and structured dictionary elements:

minA∈Rk×n

D∈Rp×k

n∑i=1


p∑j=1

Ω(dj) s.t. ∀i , ‖αi‖2 ≤ 1

Structure of the dictionary elements determined by the choice of G(and thus Ω)

Efficient learning procedures through variational formulation.

Reweighted `2:∑g∈G‖yg‖2 = min

ηg >0,g∈G1

2

∑g∈G

‖yg‖22

ηg+ ηg


Faces

AR Face database

100 individuals (50 W/50 M)

For each

14 non-occluded12 occludedlateral illuminationsreduced resolution to 38× 27pixels


Decomposition of faces

SPCA SSPCA


Decomposition of faces II

SPCA SSPCA


k-NN classification based on decompositions


Tree of Topics

NIPS abstracts

1714documents

8274 words


Hierarchical dictionary for image patches


Conclusions

La regularisation est un concept clef en apprentissage.

Il existe des regularisations induisant differentes formes deparsimonie

Les problemes d’estimation correspondant sont des problemesd’optimisation non-differentiable

Cependant il existe des algorithmes efficaces qui se basent surl’Optimisation Structuree

Ces regularisations permettent d’aborder par exemple

des problemes de selection de variables,l’apprentissage de modeles graphiques gaussienl’identification de structure latentes des donnees


References I

Bach, F., Mairal, J., and Ponce, J. (2008). Convex sparse matrix factorizations. TechnicalReport 0812.1869, ArXiv.

Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet allocation. The Journal of MachineLearning Research, 3:993–1022.

Bunea, F., Tsybakov, A., and Wegkamp, M. (2007). Aggregation for Gaussian regression.Annals of Statistics, 35(4):1674–1697.

Buntine, W. (2002). Variational extensions to EM and multinomial PCA. Lecture notes incomputer science, pages 23–34.

Elad, M. and Aharon, M. (2006). Image denoising via sparse and redundant representationsover learned dictionaries. IEEE Transactions on Image Processing, 15(12):3736–3745.

Friedman, J., Hastie, T., Hofling, H., and Tibshirani, R. (2007). Pathwise coordinateoptimization. Annals of Statistics, 1(2):302–332.

Fu, W. (1998). Penalized regressions: the bridge versus the Lasso. Journal of computationaland graphical statistics, pages 397–416.

Jenatton, R., Obozinski, G., and Bach, F. (2009). Structured sparse principal componentanalysis. Technical report. preprint arXiv:0909.1440.

Lee, D. and Seung, H. (1999). Learning the parts of objects by non-negative matrixfactorization. Nature, 401(6755):788–791.


References II

Mairal, J., Bach, F., Ponce, J., Sapiro, G., and Zisserman, A. (2009). Non-local sparsemodels for image restoration. In Proceedings of the IEEE International Conference onComputer Vision (ICCV).

Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations forhigh-dimensional data. Annals of Statistics, 37:2246–2270.

Micchelli, C. and Pontil, M. (2006). Learning the kernel function via regularization. Journalof Machine Learning Research, 6(2):1099.

Olshausen, B. and Field, D. (1997). Sparse coding with an overcomplete basis set: A strategyemployed by V1? Vision research, 37(23):3311–3325.

Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A. (2007). Self-taught learning: transferlearning from unlabeled data. In Proceedings of the 24th international conference onMachine learning, page 766. ACM.

Rakotomamonjy, A., Bach, F., Canu, S., and Grandvalet, Y. (2008). SimpleMKL. Journal ofMachine Learning Research, 9:2491–2521.

Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recoveryusing `1-constrained quadratic programming (Lasso). IEEE Trans. Information Theory,55:2183–2202.

Witten, D., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition, withapplications to sparse principal components and canonical correlation analysis.Biostatistics, 10(3):515–534.


References III

Wu, T. and Lange, K. (2008). Coordinate descent algorithms for Lasso penalized regression.Annals of Statistics, 2(1):224–244.

Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with groupedvariables. Journal of the Royal Statistical Society B, 1(68):4967.


Date post:	04-Jan-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Méthodes Parcimonieuses en Apprentissage...

Documents