+ All Categories
Home > Documents > Topographic Dictionary Learning with Structured...

Topographic Dictionary Learning with Structured...

Date post: 25-May-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
32
Topographic Dictionary Learning with Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team San Diego, Wavelets and Sparsity XIV, August 2011 Julien Mairal, UC Berkeley Topographic Dictionary Learning 1/32
Transcript
Page 1: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Topographic Dictionary Learningwith Structured Sparsity

Julien Mairal1 Rodolphe Jenatton2

Guillaume Obozinski2 Francis Bach2

1UC Berkeley 2INRIA - SIERRA Project-Team

San Diego, Wavelets and Sparsity XIV, August 2011

Julien Mairal, UC Berkeley Topographic Dictionary Learning 1/32

Page 2: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

What this work is about

Group sparsity with overlapping groups.

Hierarchical, topographic dictionary learning,

More generally: structured dictionaries of natural imagepatches.

Related publications:

[1] J. Mairal, R. Jenatton, G. Obozinski and F. Bach. Network FlowAlgorithms for Structured Sparsity. NIPS, 2010.

[2] R. Jenatton, J. Mairal, G. Obozinski and F. Bach. Proximal Methods forHierarchical Sparse Coding. JMLR, 2011.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 2/32

Page 3: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Part I: Introduction to Dictionary Learning

Julien Mairal, UC Berkeley Topographic Dictionary Learning 3/32

Page 4: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

What is a Sparse Linear Model?

Let x in Rm be a signal.

Let D = [d1, . . . ,dp] ∈ Rm×p be a set of

normalized “basis vectors”.We call it dictionary.

D is “adapted” to x if it can represent it with a few basis vectors—thatis, there exists a sparse vector α in R

p such that x ≈ Dα. We call αthe sparse code.

x

︸ ︷︷ ︸

x∈Rm

d1 d2 · · · dp

︸ ︷︷ ︸

D∈Rm×p

α1

α2

...αp

︸ ︷︷ ︸

α∈Rp,sparse

Julien Mairal, UC Berkeley Topographic Dictionary Learning 4/32

Page 5: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

The Sparse Decomposition Problem

minα∈Rp

1

2‖x−Dα‖22

︸ ︷︷ ︸

data fitting term

+ λψ(α)︸ ︷︷ ︸

sparsity-inducingregularization

ψ induces sparsity in α:

the ℓ0 “pseudo-norm”. ‖α‖0

= #i s.t. αi 6= 0 (NP-hard)

the ℓ1 norm. ‖α‖1

=∑p

i=1 |αi | (convex),

. . .

This is a selection problem. When ψ is the ℓ1-norm, the problem iscalled Lasso [Tibshirani, 1996] or basis pursuit [Chen et al., 1999]

Julien Mairal, UC Berkeley Topographic Dictionary Learning 5/32

Page 6: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Sparse representations for image restoration

Designed dictionaries

[Haar, 1910], [Zweig, Morlet, Grossman ∼70s], [Meyer, Mallat,Daubechies, Coifman, Donoho, Candes ∼80s-today]. . .Wavelets, Curvelets, Wedgelets, Bandlets, . . . lets

Learned dictionaries of patches

[Olshausen and Field, 1997, Engan et al., 1999, Lewicki and Sejnowski,2000, Aharon et al., 2006],. . .

minαi ,D∈D

n∑

i=1

1

2‖xi −Dαi‖22

︸ ︷︷ ︸

reconstruction

+λψ(αi )︸ ︷︷ ︸

sparsity

ψ(α) = ‖α‖0 (“ℓ0 pseudo-norm”)

ψ(α) = ‖α‖1 (ℓ1 norm)

Julien Mairal, UC Berkeley Topographic Dictionary Learning 6/32

Page 7: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Sparse representations for image restorationGrayscale vs color image patches

Figure: Left: learned on grayscale image patches. Right: learned on colorimage patches (after removing the mean color from each patch)

Julien Mairal, UC Berkeley Topographic Dictionary Learning 7/32

Page 8: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Algorithms

minα∈Rp×n

D∈D

n∑

i=1

1

2‖xi −Dαi‖22 + λψ(αi ).

How do we optimize that?

alternate between D and α [Engan et al., 1999], or other variants[Elad and Aharon, 2006]

online learning [Olshausen and Field, 1997, Mairal et al., 2009,Skretting and Engan, 2010]

Code SPAMS available: http://www.di.ens.fr/willow/SPAMS/,now open-source!

Julien Mairal, UC Berkeley Topographic Dictionary Learning 8/32

Page 9: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Part II: Introduction to Structured Sparsity(Let us play with ψ)

Julien Mairal, UC Berkeley Topographic Dictionary Learning 9/32

Page 10: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Group Sparsity-Inducing Norms

minα∈Rp

1

2‖x−Dα‖22 + λ ψ(α)

︸ ︷︷ ︸

sparsity-inducing norm

The most popular choice for ψ:

The ℓ1 norm, ψ(α) = ‖α‖1.

However, the ℓ1 norm encodes poor information, just cardinality!

Another popular choice for Ω:

The ℓ1-ℓq norm [Turlach et al., 2005], with q = 2 or q =∞

g∈G

‖αg‖q with G a partition of 1, . . . , p.

The ℓ1-ℓq norm sets to zero groups of non-overlapping variables(as opposed to single variables for the ℓ1 norm).

Julien Mairal, UC Berkeley Topographic Dictionary Learning 10/32

Page 11: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Structured Sparsity with Overlapping Groups

Warning: Under the name “structured sparsity” appear in factsignificantly different formulations!

1 non-convex

zero-tree wavelets [Shapiro, 1993]sparsity patterns are in a predefined collection: [Baraniuk et al., 2010]select a union of groups: [Huang et al., 2009]structure via Markov Random Fields: [Cehver et al., 2008]

2 convex

tree-structure: [Zhao et al., 2009]non-zero patterns are a union of groups: [Jacob et al., 2009]zero patterns are a union of groups: [Jenatton et al., 2009]other norms: [Micchelli et al., 2010]

Julien Mairal, UC Berkeley Topographic Dictionary Learning 11/32

Page 12: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Structured Sparsity with Overlapping Groups

ψ(α) =∑

g∈G

‖αg‖q

What happens when the groups overlap? [Jenatton et al., 2009]

Inside the groups, the ℓ2-norm (or ℓ∞) does not promote sparsity.

Variables belonging to the same groups are encouraged to be set tozero together.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 12/32

Page 13: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Examples of set of groups G[Jenatton et al., 2009]

Selection of contiguous patterns on a sequence, p = 6.

G is the set of blue groups.

Any union of blue groups set to zero leads to the selection of acontiguous pattern.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 13/32

Page 14: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Hierarchical Norms[Zhao et al., 2009]

A node can be active only if its ancestors are active.The selected patterns are rooted subtrees.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 14/32

Page 15: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Algorithms/Difficulties[Jenatton et al., 2010, Mairal et al., 2011]

minα∈Rp

1

2‖x−Dα‖22 + λ

g∈G

‖αg‖q.

The function is convex non-differentiable; the sum is a sum of simplenon-separable regularizers.

How do we optimize that?

hierarchical norms: same complexity as ℓ1 with proximal methods.

general case: Augmenting Lagrangian Techniques.

general case with ℓ∞-norms: proximal methods combine withnetwork flow optimization.

Also implemented in the toolbox SPAMS

Julien Mairal, UC Berkeley Topographic Dictionary Learning 15/32

Page 16: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Part III: Learning Structured Dictionaries

Julien Mairal, UC Berkeley Topographic Dictionary Learning 16/32

Page 17: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Topographic Dictionary Learning

[Kavukcuoglu et al., 2009]: organize the dictionary elements on a2D-grids and use ψ with e × e overlapping groups.

[Garrigues and Olshausen, 2010]: sparse coding + probabilisticmodel to model lateral interactions.

topographic ICA by Hyvarinen et al. [2001]:

Julien Mairal, UC Berkeley Topographic Dictionary Learning 17/32

Page 18: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Topographic Dictionary Learning[Mairal, Jenatton, Obozinski, and Bach, 2011], 3× 3-neighborhoods

Julien Mairal, UC Berkeley Topographic Dictionary Learning 18/32

Page 19: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Topographic Dictionary Learning[Mairal, Jenatton, Obozinski, and Bach, 2011],4× 4-neighborhoods

Julien Mairal, UC Berkeley Topographic Dictionary Learning 19/32

Page 20: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Hierarchical Dictionary Learning[Jenatton, Mairal, Obozinski, and Bach, 2010]

Julien Mairal, UC Berkeley Topographic Dictionary Learning 20/32

Page 21: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Conclusion / Discussion

Structured sparsity is a natural framework for learning structureddictionaries...

...and has efficient optimization tools.

other applications in natural language processing, bio-informatics,neuroscience...

Julien Mairal, UC Berkeley Topographic Dictionary Learning 21/32

Page 22: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

SPAMS toolbox (open-source)

C++ interfaced with Matlab.

proximal gradient methods for ℓ0, ℓ1, elastic-net, fused-Lasso,group-Lasso, tree group-Lasso, tree-ℓ0, sparse group Lasso,overlapping group Lasso...

...for square, logistic, multi-class logistic loss functions.

handles sparse matrices,

provides duality gaps.

also coordinate descent, block coordinate descent algorithms.

fastest available implementation of OMP and LARS.

dictionary learning and matrix factorization (NMF, sparse PCA).

fast projections onto some convex sets.

Try it! http://www.di.ens.fr/willow/SPAMS/

Julien Mairal, UC Berkeley Topographic Dictionary Learning 22/32

Page 23: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

References I

M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithmfor designing of overcomplete dictionaries for sparse representations.IEEE Transactions on Signal Processing, 54(11):4311–4322,November 2006.

R. G. Baraniuk, V. Cevher, M. Duarte, and C. Hegde. Model-basedcompressive sensing. IEEE Transactions on Information Theory, 2010.to appear.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems. SIAM Journal on ImagingSciences, 2(1):183–202, 2009.

V. Cehver, M. F. Duarte, C. Hegde, and R. G. Baraniuk. Sparse signalrecovery usingmarkov random fields. In Advances in NeuralInformation Processing Systems, 2008.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 23/32

Page 24: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

References IIS. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decompositionby basis pursuit. SIAM Journal on Scientific Computing, 20:33–61,1999.

M. Elad and M. Aharon. Image denoising via sparse and redundantrepresentations over learned dictionaries. IEEE Transactions on ImageProcessing, 54(12):3736–3745, December 2006.

K. Engan, S. O. Aase, and J. H. Husoy. Frame based signal compressionusing method of optimal directions (MOD). In Proceedings of the1999 IEEE International Symposium on Circuits Systems, volume 4,1999.

P. Garrigues and B. Olshausen. Group sparse coding with a laplacianscale mixture prior. In Advances in Neural Information ProcessingSystems, 2010.

A. Haar. Zur theorie der orthogonalen funktionensysteme.Mathematische Annalen, 69:331–371, 1910.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 24/32

Page 25: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

References IIIJ. Huang, Z. Zhang, and D. Metaxas. Learning with structured sparsity.

In Proceedings of the International Conference on Machine Learning(ICML), 2009.

A. Hyvarinen, P. Hoyer, and M. Inki. Topographic independentcomponent analysis. Neural Computation, 13(7):1527–1558, 2001.

L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlap andgraph Lasso. In Proceedings of the International Conference onMachine Learning (ICML), 2009.

R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selectionwith sparsity-inducing norms. Technical report, 2009. preprintarXiv:0904.3523v1.

R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methodsfor sparse hierarchical dictionary learning. In Proceedings of theInternational Conference on Machine Learning (ICML), 2010.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 25/32

Page 26: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

References IVK. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learninginvariant features through topographic filter maps. In Proceedings ofCVPR, 2009.

M. S. Lewicki and T. J. Sejnowski. Learning overcompleterepresentations. Neural Computation, 12(2):337–365, 2000.

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learningfor sparse coding. In Proceedings of the International Conference onMachine Learning (ICML), 2009.

J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Convex and networkflow optimization for structured sparsity. Preprint arXiv:1104.1872,2011.

C. A. Micchelli, J. M. Morales, and M. Pontil. A family of penaltyfunctions for structured sparsity. In Advances in Neural InformationProcessing Systems, 2010.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 26/32

Page 27: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

References VY. Nesterov. A method for solving a convex programming problem withconvergence rate O(1/k2). Soviet Math. Dokl., 27:372–376, 1983.

Y. Nesterov. Gradient methods for minimizing composite objectivefunction. Technical report, CORE, 2007.

B. A. Olshausen and D. J. Field. Sparse coding with an overcompletebasis set: A strategy employed by V1? Vision Research, 37:3311–3325, 1997.

J.M. Shapiro. Embedded image coding using zerotrees of waveletcoefficients. IEEE Transactions on Signal Processing, 41(12):3445–3462, 1993.

K. Skretting and K. Engan. Recursive least squares dictionary learningalgorithm. IEEE Transactions on Signal Processing, 58(4):2121–2130,2010.

R. Tibshirani. Regression shrinkage and selection via the Lasso. Journalof the Royal Statistical Society. Series B, 58(1):267–288, 1996.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 27/32

Page 28: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

References VIB. A. Turlach, W. N. Venables, and S. J. Wright. Simultaneous variableselection. Technometrics, 47(3):349–363, 2005.

P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties familyfor grouped and hierarchical variable selection. 37(6A):3468–3497,2009.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 28/32

Page 29: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

First-order/proximal methods

minα∈Rp

f (α) + λΩ(α)

f is strictly convex and differentiable with a Lipshitz gradient.

Generalizes the idea of gradient descent

αk+1←argminα∈Rp

f (αk)+∇f (αk)⊤(α−αk)︸ ︷︷ ︸

linear approximation

+L

2‖α−αk‖22

︸ ︷︷ ︸

quadratic term

+λΩ(α)

← argminα∈Rp

1

2‖α− (αk −

1

L∇f (αk))‖22 +

λ

LΩ(α)

When λ = 0, αk+1 ← αk − 1L∇f (αk), this is equivalent to a

classical gradient descent step.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 29/32

Page 30: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

First-order/proximal methods

They require solving efficiently the proximal operator

minα∈Rp

1

2‖u−α‖22 + λΩ(α)

For the ℓ1-norm, this amounts to a soft-thresholding:

α⋆

i = sign(ui )(ui − λ)+.

There exists accelerated versions based on Nesterov optimalfirst-order method (gradient method with “extrapolation”) [Beckand Teboulle, 2009, Nesterov, 2007, 1983]

suited for large-scale experiments.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 30/32

Page 31: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

Tree-structured groups

Proposition [Jenatton, Mairal, Obozinski, and Bach, 2010]

If G is a tree-structured set of groups, i.e., ∀g , h ∈ G,

g ∩ h = ∅ or g ⊂ h or h ⊂ g

For q = 2 or q =∞, we define Proxg and ProxΩ as

Proxg :u→ argminα∈Rp

1

2‖u−α‖+ λ‖αg‖q,

ProxΩ :u→ argminα∈Rp

1

2‖u−α‖+ λ

g∈G

‖αg‖q,

If the groups are sorted from the leaves to the root, then

ProxΩ = Proxgm . . . Proxg1 .

→ Tree-structured regularization : Efficient linear time algorithm.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 31/32

Page 32: Topographic Dictionary Learning with Structured …lear.inrialpes.fr/people/mairal/resources/pdf/spie.pdfTopographic Dictionary Learning with Structured Sparsity Julien Mairal1 Rodolphe

General Overlapping Groups for q =∞[Mairal, Jenatton, Obozinski, and Bach, 2011]

Dual formulation

The solutions α⋆ and ξ⋆ of the following optimization problems

minα∈Rp

1

2‖u−α‖+ λ

g∈G

‖αg‖∞, (Primal)

minξ∈Rp×|G|

1

2‖u−

g∈G

ξg‖22 s.t. ∀g ∈ G, ‖ξg‖1 ≤ λ and ξgj = 0 if j /∈ g ,

(Dual)satisfy

α⋆ = u−∑

g∈G

ξ⋆g . (Primal-dual relation)

The dual formulation has more variables, but is equivalent to quadraticmin-cost flow problem.

Julien Mairal, UC Berkeley Topographic Dictionary Learning 32/32


Recommended