Topographic Dictionary Learning with Structured...

Post on 25-May-2020

9 views 0 download

transcript

Topographic Dictionary Learningwith Structured Sparsity

Julien Mairal1 Rodolphe Jenatton2

Guillaume Obozinski2 Francis Bach2

1UC Berkeley 2INRIA - SIERRA Project-Team

San Diego, Wavelets and Sparsity XIV, August 2011

Julien Mairal, UC Berkeley Topographic Dictionary Learning 1/32

What this work is about

Group sparsity with overlapping groups.

Hierarchical, topographic dictionary learning,

More generally: structured dictionaries of natural imagepatches.

Related publications:

[1] J. Mairal, R. Jenatton, G. Obozinski and F. Bach. Network FlowAlgorithms for Structured Sparsity. NIPS, 2010.

[2] R. Jenatton, J. Mairal, G. Obozinski and F. Bach. Proximal Methods forHierarchical Sparse Coding. JMLR, 2011.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 2/32

Part I: Introduction to Dictionary Learning

Julien Mairal, UC Berkeley Topographic Dictionary Learning 3/32

What is a Sparse Linear Model?

Let x in Rm be a signal.

Let D = [d1, . . . ,dp] ∈ Rm×p be a set of

normalized “basis vectors”.We call it dictionary.

D is “adapted” to x if it can represent it with a few basis vectors—thatis, there exists a sparse vector α in R

p such that x ≈ Dα. We call αthe sparse code.

x

︸ ︷︷ ︸

x∈Rm

d1 d2 · · · dp

︸ ︷︷ ︸

D∈Rm×p

α1

α2

...αp

︸ ︷︷ ︸

α∈Rp,sparse

Julien Mairal, UC Berkeley Topographic Dictionary Learning 4/32

The Sparse Decomposition Problem

minα∈Rp

1

2‖x−Dα‖22

︸ ︷︷ ︸

data fitting term

+ λψ(α)︸ ︷︷ ︸

sparsity-inducingregularization

ψ induces sparsity in α:

the ℓ0 “pseudo-norm”. ‖α‖0

= #i s.t. αi 6= 0 (NP-hard)

the ℓ1 norm. ‖α‖1

=∑p

i=1 |αi | (convex),

. . .

This is a selection problem. When ψ is the ℓ1-norm, the problem iscalled Lasso [Tibshirani, 1996] or basis pursuit [Chen et al., 1999]

Julien Mairal, UC Berkeley Topographic Dictionary Learning 5/32

Sparse representations for image restoration

Designed dictionaries

[Haar, 1910], [Zweig, Morlet, Grossman ∼70s], [Meyer, Mallat,Daubechies, Coifman, Donoho, Candes ∼80s-today]. . .Wavelets, Curvelets, Wedgelets, Bandlets, . . . lets

Learned dictionaries of patches

[Olshausen and Field, 1997, Engan et al., 1999, Lewicki and Sejnowski,2000, Aharon et al., 2006],. . .

minαi ,D∈D

n∑

i=1

1

2‖xi −Dαi‖22

︸ ︷︷ ︸

reconstruction

+λψ(αi )︸ ︷︷ ︸

sparsity

ψ(α) = ‖α‖0 (“ℓ0 pseudo-norm”)

ψ(α) = ‖α‖1 (ℓ1 norm)

Julien Mairal, UC Berkeley Topographic Dictionary Learning 6/32

Sparse representations for image restorationGrayscale vs color image patches

Figure: Left: learned on grayscale image patches. Right: learned on colorimage patches (after removing the mean color from each patch)

Julien Mairal, UC Berkeley Topographic Dictionary Learning 7/32

Algorithms

minα∈Rp×n

D∈D

n∑

i=1

1

2‖xi −Dαi‖22 + λψ(αi ).

How do we optimize that?

alternate between D and α [Engan et al., 1999], or other variants[Elad and Aharon, 2006]

online learning [Olshausen and Field, 1997, Mairal et al., 2009,Skretting and Engan, 2010]

Code SPAMS available: http://www.di.ens.fr/willow/SPAMS/,now open-source!

Julien Mairal, UC Berkeley Topographic Dictionary Learning 8/32

Part II: Introduction to Structured Sparsity(Let us play with ψ)

Julien Mairal, UC Berkeley Topographic Dictionary Learning 9/32

Group Sparsity-Inducing Norms

minα∈Rp

1

2‖x−Dα‖22 + λ ψ(α)

︸ ︷︷ ︸

sparsity-inducing norm

The most popular choice for ψ:

The ℓ1 norm, ψ(α) = ‖α‖1.

However, the ℓ1 norm encodes poor information, just cardinality!

Another popular choice for Ω:

The ℓ1-ℓq norm [Turlach et al., 2005], with q = 2 or q =∞

g∈G

‖αg‖q with G a partition of 1, . . . , p.

The ℓ1-ℓq norm sets to zero groups of non-overlapping variables(as opposed to single variables for the ℓ1 norm).

Julien Mairal, UC Berkeley Topographic Dictionary Learning 10/32

Structured Sparsity with Overlapping Groups

Warning: Under the name “structured sparsity” appear in factsignificantly different formulations!

1 non-convex

zero-tree wavelets [Shapiro, 1993]sparsity patterns are in a predefined collection: [Baraniuk et al., 2010]select a union of groups: [Huang et al., 2009]structure via Markov Random Fields: [Cehver et al., 2008]

2 convex

tree-structure: [Zhao et al., 2009]non-zero patterns are a union of groups: [Jacob et al., 2009]zero patterns are a union of groups: [Jenatton et al., 2009]other norms: [Micchelli et al., 2010]

Julien Mairal, UC Berkeley Topographic Dictionary Learning 11/32

Structured Sparsity with Overlapping Groups

ψ(α) =∑

g∈G

‖αg‖q

What happens when the groups overlap? [Jenatton et al., 2009]

Inside the groups, the ℓ2-norm (or ℓ∞) does not promote sparsity.

Variables belonging to the same groups are encouraged to be set tozero together.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 12/32

Examples of set of groups G[Jenatton et al., 2009]

Selection of contiguous patterns on a sequence, p = 6.

G is the set of blue groups.

Any union of blue groups set to zero leads to the selection of acontiguous pattern.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 13/32

Hierarchical Norms[Zhao et al., 2009]

A node can be active only if its ancestors are active.The selected patterns are rooted subtrees.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 14/32

Algorithms/Difficulties[Jenatton et al., 2010, Mairal et al., 2011]

minα∈Rp

1

2‖x−Dα‖22 + λ

g∈G

‖αg‖q.

The function is convex non-differentiable; the sum is a sum of simplenon-separable regularizers.

How do we optimize that?

hierarchical norms: same complexity as ℓ1 with proximal methods.

general case: Augmenting Lagrangian Techniques.

general case with ℓ∞-norms: proximal methods combine withnetwork flow optimization.

Also implemented in the toolbox SPAMS

Julien Mairal, UC Berkeley Topographic Dictionary Learning 15/32

Part III: Learning Structured Dictionaries

Julien Mairal, UC Berkeley Topographic Dictionary Learning 16/32

Topographic Dictionary Learning

[Kavukcuoglu et al., 2009]: organize the dictionary elements on a2D-grids and use ψ with e × e overlapping groups.

[Garrigues and Olshausen, 2010]: sparse coding + probabilisticmodel to model lateral interactions.

topographic ICA by Hyvarinen et al. [2001]:

Julien Mairal, UC Berkeley Topographic Dictionary Learning 17/32

Topographic Dictionary Learning[Mairal, Jenatton, Obozinski, and Bach, 2011], 3× 3-neighborhoods

Julien Mairal, UC Berkeley Topographic Dictionary Learning 18/32

Topographic Dictionary Learning[Mairal, Jenatton, Obozinski, and Bach, 2011],4× 4-neighborhoods

Julien Mairal, UC Berkeley Topographic Dictionary Learning 19/32

Hierarchical Dictionary Learning[Jenatton, Mairal, Obozinski, and Bach, 2010]

Julien Mairal, UC Berkeley Topographic Dictionary Learning 20/32

Conclusion / Discussion

Structured sparsity is a natural framework for learning structureddictionaries...

...and has efficient optimization tools.

other applications in natural language processing, bio-informatics,neuroscience...

Julien Mairal, UC Berkeley Topographic Dictionary Learning 21/32

SPAMS toolbox (open-source)

C++ interfaced with Matlab.

proximal gradient methods for ℓ0, ℓ1, elastic-net, fused-Lasso,group-Lasso, tree group-Lasso, tree-ℓ0, sparse group Lasso,overlapping group Lasso...

...for square, logistic, multi-class logistic loss functions.

handles sparse matrices,

provides duality gaps.

also coordinate descent, block coordinate descent algorithms.

fastest available implementation of OMP and LARS.

dictionary learning and matrix factorization (NMF, sparse PCA).

fast projections onto some convex sets.

Try it! http://www.di.ens.fr/willow/SPAMS/

Julien Mairal, UC Berkeley Topographic Dictionary Learning 22/32

References I

M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithmfor designing of overcomplete dictionaries for sparse representations.IEEE Transactions on Signal Processing, 54(11):4311–4322,November 2006.

R. G. Baraniuk, V. Cevher, M. Duarte, and C. Hegde. Model-basedcompressive sensing. IEEE Transactions on Information Theory, 2010.to appear.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems. SIAM Journal on ImagingSciences, 2(1):183–202, 2009.

V. Cehver, M. F. Duarte, C. Hegde, and R. G. Baraniuk. Sparse signalrecovery usingmarkov random fields. In Advances in NeuralInformation Processing Systems, 2008.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 23/32

References IIS. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decompositionby basis pursuit. SIAM Journal on Scientific Computing, 20:33–61,1999.

M. Elad and M. Aharon. Image denoising via sparse and redundantrepresentations over learned dictionaries. IEEE Transactions on ImageProcessing, 54(12):3736–3745, December 2006.

K. Engan, S. O. Aase, and J. H. Husoy. Frame based signal compressionusing method of optimal directions (MOD). In Proceedings of the1999 IEEE International Symposium on Circuits Systems, volume 4,1999.

P. Garrigues and B. Olshausen. Group sparse coding with a laplacianscale mixture prior. In Advances in Neural Information ProcessingSystems, 2010.

A. Haar. Zur theorie der orthogonalen funktionensysteme.Mathematische Annalen, 69:331–371, 1910.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 24/32

References IIIJ. Huang, Z. Zhang, and D. Metaxas. Learning with structured sparsity.

In Proceedings of the International Conference on Machine Learning(ICML), 2009.

A. Hyvarinen, P. Hoyer, and M. Inki. Topographic independentcomponent analysis. Neural Computation, 13(7):1527–1558, 2001.

L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlap andgraph Lasso. In Proceedings of the International Conference onMachine Learning (ICML), 2009.

R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selectionwith sparsity-inducing norms. Technical report, 2009. preprintarXiv:0904.3523v1.

R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methodsfor sparse hierarchical dictionary learning. In Proceedings of theInternational Conference on Machine Learning (ICML), 2010.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 25/32

References IVK. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learninginvariant features through topographic filter maps. In Proceedings ofCVPR, 2009.

M. S. Lewicki and T. J. Sejnowski. Learning overcompleterepresentations. Neural Computation, 12(2):337–365, 2000.

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learningfor sparse coding. In Proceedings of the International Conference onMachine Learning (ICML), 2009.

J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Convex and networkflow optimization for structured sparsity. Preprint arXiv:1104.1872,2011.

C. A. Micchelli, J. M. Morales, and M. Pontil. A family of penaltyfunctions for structured sparsity. In Advances in Neural InformationProcessing Systems, 2010.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 26/32

References VY. Nesterov. A method for solving a convex programming problem withconvergence rate O(1/k2). Soviet Math. Dokl., 27:372–376, 1983.

Y. Nesterov. Gradient methods for minimizing composite objectivefunction. Technical report, CORE, 2007.

B. A. Olshausen and D. J. Field. Sparse coding with an overcompletebasis set: A strategy employed by V1? Vision Research, 37:3311–3325, 1997.

J.M. Shapiro. Embedded image coding using zerotrees of waveletcoefficients. IEEE Transactions on Signal Processing, 41(12):3445–3462, 1993.

K. Skretting and K. Engan. Recursive least squares dictionary learningalgorithm. IEEE Transactions on Signal Processing, 58(4):2121–2130,2010.

R. Tibshirani. Regression shrinkage and selection via the Lasso. Journalof the Royal Statistical Society. Series B, 58(1):267–288, 1996.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 27/32

References VIB. A. Turlach, W. N. Venables, and S. J. Wright. Simultaneous variableselection. Technometrics, 47(3):349–363, 2005.

P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties familyfor grouped and hierarchical variable selection. 37(6A):3468–3497,2009.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 28/32

First-order/proximal methods

minα∈Rp

f (α) + λΩ(α)

f is strictly convex and differentiable with a Lipshitz gradient.

Generalizes the idea of gradient descent

αk+1←argminα∈Rp

f (αk)+∇f (αk)⊤(α−αk)︸ ︷︷ ︸

linear approximation

+L

2‖α−αk‖22

︸ ︷︷ ︸

quadratic term

+λΩ(α)

← argminα∈Rp

1

2‖α− (αk −

1

L∇f (αk))‖22 +

λ

LΩ(α)

When λ = 0, αk+1 ← αk − 1L∇f (αk), this is equivalent to a

classical gradient descent step.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 29/32

First-order/proximal methods

They require solving efficiently the proximal operator

minα∈Rp

1

2‖u−α‖22 + λΩ(α)

For the ℓ1-norm, this amounts to a soft-thresholding:

α⋆

i = sign(ui )(ui − λ)+.

There exists accelerated versions based on Nesterov optimalfirst-order method (gradient method with “extrapolation”) [Beckand Teboulle, 2009, Nesterov, 2007, 1983]

suited for large-scale experiments.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 30/32

Tree-structured groups

Proposition [Jenatton, Mairal, Obozinski, and Bach, 2010]

If G is a tree-structured set of groups, i.e., ∀g , h ∈ G,

g ∩ h = ∅ or g ⊂ h or h ⊂ g

For q = 2 or q =∞, we define Proxg and ProxΩ as

Proxg :u→ argminα∈Rp

1

2‖u−α‖+ λ‖αg‖q,

ProxΩ :u→ argminα∈Rp

1

2‖u−α‖+ λ

g∈G

‖αg‖q,

If the groups are sorted from the leaves to the root, then

ProxΩ = Proxgm . . . Proxg1 .

→ Tree-structured regularization : Efficient linear time algorithm.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 31/32

General Overlapping Groups for q =∞[Mairal, Jenatton, Obozinski, and Bach, 2011]

Dual formulation

The solutions α⋆ and ξ⋆ of the following optimization problems

minα∈Rp

1

2‖u−α‖+ λ

g∈G

‖αg‖∞, (Primal)

minξ∈Rp×|G|

1

2‖u−

g∈G

ξg‖22 s.t. ∀g ∈ G, ‖ξg‖1 ≤ λ and ξgj = 0 if j /∈ g ,

(Dual)satisfy

α⋆ = u−∑

g∈G

ξ⋆g . (Primal-dual relation)

The dual formulation has more variables, but is equivalent to quadraticmin-cost flow problem.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 32/32