Post on 25-May-2020
transcript
Topographic Dictionary Learningwith Structured Sparsity
Julien Mairal1 Rodolphe Jenatton2
Guillaume Obozinski2 Francis Bach2
1UC Berkeley 2INRIA - SIERRA Project-Team
San Diego, Wavelets and Sparsity XIV, August 2011
Julien Mairal, UC Berkeley Topographic Dictionary Learning 1/32
What this work is about
Group sparsity with overlapping groups.
Hierarchical, topographic dictionary learning,
More generally: structured dictionaries of natural imagepatches.
Related publications:
[1] J. Mairal, R. Jenatton, G. Obozinski and F. Bach. Network FlowAlgorithms for Structured Sparsity. NIPS, 2010.
[2] R. Jenatton, J. Mairal, G. Obozinski and F. Bach. Proximal Methods forHierarchical Sparse Coding. JMLR, 2011.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 2/32
Part I: Introduction to Dictionary Learning
Julien Mairal, UC Berkeley Topographic Dictionary Learning 3/32
What is a Sparse Linear Model?
Let x in Rm be a signal.
Let D = [d1, . . . ,dp] ∈ Rm×p be a set of
normalized “basis vectors”.We call it dictionary.
D is “adapted” to x if it can represent it with a few basis vectors—thatis, there exists a sparse vector α in R
p such that x ≈ Dα. We call αthe sparse code.
x
︸ ︷︷ ︸
x∈Rm
≈
d1 d2 · · · dp
︸ ︷︷ ︸
D∈Rm×p
α1
α2
...αp
︸ ︷︷ ︸
α∈Rp,sparse
Julien Mairal, UC Berkeley Topographic Dictionary Learning 4/32
The Sparse Decomposition Problem
minα∈Rp
1
2‖x−Dα‖22
︸ ︷︷ ︸
data fitting term
+ λψ(α)︸ ︷︷ ︸
sparsity-inducingregularization
ψ induces sparsity in α:
the ℓ0 “pseudo-norm”. ‖α‖0
= #i s.t. αi 6= 0 (NP-hard)
the ℓ1 norm. ‖α‖1
=∑p
i=1 |αi | (convex),
. . .
This is a selection problem. When ψ is the ℓ1-norm, the problem iscalled Lasso [Tibshirani, 1996] or basis pursuit [Chen et al., 1999]
Julien Mairal, UC Berkeley Topographic Dictionary Learning 5/32
Sparse representations for image restoration
Designed dictionaries
[Haar, 1910], [Zweig, Morlet, Grossman ∼70s], [Meyer, Mallat,Daubechies, Coifman, Donoho, Candes ∼80s-today]. . .Wavelets, Curvelets, Wedgelets, Bandlets, . . . lets
Learned dictionaries of patches
[Olshausen and Field, 1997, Engan et al., 1999, Lewicki and Sejnowski,2000, Aharon et al., 2006],. . .
minαi ,D∈D
n∑
i=1
1
2‖xi −Dαi‖22
︸ ︷︷ ︸
reconstruction
+λψ(αi )︸ ︷︷ ︸
sparsity
ψ(α) = ‖α‖0 (“ℓ0 pseudo-norm”)
ψ(α) = ‖α‖1 (ℓ1 norm)
Julien Mairal, UC Berkeley Topographic Dictionary Learning 6/32
Sparse representations for image restorationGrayscale vs color image patches
Figure: Left: learned on grayscale image patches. Right: learned on colorimage patches (after removing the mean color from each patch)
Julien Mairal, UC Berkeley Topographic Dictionary Learning 7/32
Algorithms
minα∈Rp×n
D∈D
n∑
i=1
1
2‖xi −Dαi‖22 + λψ(αi ).
How do we optimize that?
alternate between D and α [Engan et al., 1999], or other variants[Elad and Aharon, 2006]
online learning [Olshausen and Field, 1997, Mairal et al., 2009,Skretting and Engan, 2010]
Code SPAMS available: http://www.di.ens.fr/willow/SPAMS/,now open-source!
Julien Mairal, UC Berkeley Topographic Dictionary Learning 8/32
Part II: Introduction to Structured Sparsity(Let us play with ψ)
Julien Mairal, UC Berkeley Topographic Dictionary Learning 9/32
Group Sparsity-Inducing Norms
minα∈Rp
1
2‖x−Dα‖22 + λ ψ(α)
︸ ︷︷ ︸
sparsity-inducing norm
The most popular choice for ψ:
The ℓ1 norm, ψ(α) = ‖α‖1.
However, the ℓ1 norm encodes poor information, just cardinality!
Another popular choice for Ω:
The ℓ1-ℓq norm [Turlach et al., 2005], with q = 2 or q =∞
∑
g∈G
‖αg‖q with G a partition of 1, . . . , p.
The ℓ1-ℓq norm sets to zero groups of non-overlapping variables(as opposed to single variables for the ℓ1 norm).
Julien Mairal, UC Berkeley Topographic Dictionary Learning 10/32
Structured Sparsity with Overlapping Groups
Warning: Under the name “structured sparsity” appear in factsignificantly different formulations!
1 non-convex
zero-tree wavelets [Shapiro, 1993]sparsity patterns are in a predefined collection: [Baraniuk et al., 2010]select a union of groups: [Huang et al., 2009]structure via Markov Random Fields: [Cehver et al., 2008]
2 convex
tree-structure: [Zhao et al., 2009]non-zero patterns are a union of groups: [Jacob et al., 2009]zero patterns are a union of groups: [Jenatton et al., 2009]other norms: [Micchelli et al., 2010]
Julien Mairal, UC Berkeley Topographic Dictionary Learning 11/32
Structured Sparsity with Overlapping Groups
ψ(α) =∑
g∈G
‖αg‖q
What happens when the groups overlap? [Jenatton et al., 2009]
Inside the groups, the ℓ2-norm (or ℓ∞) does not promote sparsity.
Variables belonging to the same groups are encouraged to be set tozero together.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 12/32
Examples of set of groups G[Jenatton et al., 2009]
Selection of contiguous patterns on a sequence, p = 6.
G is the set of blue groups.
Any union of blue groups set to zero leads to the selection of acontiguous pattern.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 13/32
Hierarchical Norms[Zhao et al., 2009]
A node can be active only if its ancestors are active.The selected patterns are rooted subtrees.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 14/32
Algorithms/Difficulties[Jenatton et al., 2010, Mairal et al., 2011]
minα∈Rp
1
2‖x−Dα‖22 + λ
∑
g∈G
‖αg‖q.
The function is convex non-differentiable; the sum is a sum of simplenon-separable regularizers.
How do we optimize that?
hierarchical norms: same complexity as ℓ1 with proximal methods.
general case: Augmenting Lagrangian Techniques.
general case with ℓ∞-norms: proximal methods combine withnetwork flow optimization.
Also implemented in the toolbox SPAMS
Julien Mairal, UC Berkeley Topographic Dictionary Learning 15/32
Part III: Learning Structured Dictionaries
Julien Mairal, UC Berkeley Topographic Dictionary Learning 16/32
Topographic Dictionary Learning
[Kavukcuoglu et al., 2009]: organize the dictionary elements on a2D-grids and use ψ with e × e overlapping groups.
[Garrigues and Olshausen, 2010]: sparse coding + probabilisticmodel to model lateral interactions.
topographic ICA by Hyvarinen et al. [2001]:
Julien Mairal, UC Berkeley Topographic Dictionary Learning 17/32
Topographic Dictionary Learning[Mairal, Jenatton, Obozinski, and Bach, 2011], 3× 3-neighborhoods
Julien Mairal, UC Berkeley Topographic Dictionary Learning 18/32
Topographic Dictionary Learning[Mairal, Jenatton, Obozinski, and Bach, 2011],4× 4-neighborhoods
Julien Mairal, UC Berkeley Topographic Dictionary Learning 19/32
Hierarchical Dictionary Learning[Jenatton, Mairal, Obozinski, and Bach, 2010]
Julien Mairal, UC Berkeley Topographic Dictionary Learning 20/32
Conclusion / Discussion
Structured sparsity is a natural framework for learning structureddictionaries...
...and has efficient optimization tools.
other applications in natural language processing, bio-informatics,neuroscience...
Julien Mairal, UC Berkeley Topographic Dictionary Learning 21/32
SPAMS toolbox (open-source)
C++ interfaced with Matlab.
proximal gradient methods for ℓ0, ℓ1, elastic-net, fused-Lasso,group-Lasso, tree group-Lasso, tree-ℓ0, sparse group Lasso,overlapping group Lasso...
...for square, logistic, multi-class logistic loss functions.
handles sparse matrices,
provides duality gaps.
also coordinate descent, block coordinate descent algorithms.
fastest available implementation of OMP and LARS.
dictionary learning and matrix factorization (NMF, sparse PCA).
fast projections onto some convex sets.
Try it! http://www.di.ens.fr/willow/SPAMS/
Julien Mairal, UC Berkeley Topographic Dictionary Learning 22/32
References I
M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithmfor designing of overcomplete dictionaries for sparse representations.IEEE Transactions on Signal Processing, 54(11):4311–4322,November 2006.
R. G. Baraniuk, V. Cevher, M. Duarte, and C. Hegde. Model-basedcompressive sensing. IEEE Transactions on Information Theory, 2010.to appear.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems. SIAM Journal on ImagingSciences, 2(1):183–202, 2009.
V. Cehver, M. F. Duarte, C. Hegde, and R. G. Baraniuk. Sparse signalrecovery usingmarkov random fields. In Advances in NeuralInformation Processing Systems, 2008.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 23/32
References IIS. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decompositionby basis pursuit. SIAM Journal on Scientific Computing, 20:33–61,1999.
M. Elad and M. Aharon. Image denoising via sparse and redundantrepresentations over learned dictionaries. IEEE Transactions on ImageProcessing, 54(12):3736–3745, December 2006.
K. Engan, S. O. Aase, and J. H. Husoy. Frame based signal compressionusing method of optimal directions (MOD). In Proceedings of the1999 IEEE International Symposium on Circuits Systems, volume 4,1999.
P. Garrigues and B. Olshausen. Group sparse coding with a laplacianscale mixture prior. In Advances in Neural Information ProcessingSystems, 2010.
A. Haar. Zur theorie der orthogonalen funktionensysteme.Mathematische Annalen, 69:331–371, 1910.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 24/32
References IIIJ. Huang, Z. Zhang, and D. Metaxas. Learning with structured sparsity.
In Proceedings of the International Conference on Machine Learning(ICML), 2009.
A. Hyvarinen, P. Hoyer, and M. Inki. Topographic independentcomponent analysis. Neural Computation, 13(7):1527–1558, 2001.
L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlap andgraph Lasso. In Proceedings of the International Conference onMachine Learning (ICML), 2009.
R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selectionwith sparsity-inducing norms. Technical report, 2009. preprintarXiv:0904.3523v1.
R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methodsfor sparse hierarchical dictionary learning. In Proceedings of theInternational Conference on Machine Learning (ICML), 2010.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 25/32
References IVK. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learninginvariant features through topographic filter maps. In Proceedings ofCVPR, 2009.
M. S. Lewicki and T. J. Sejnowski. Learning overcompleterepresentations. Neural Computation, 12(2):337–365, 2000.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learningfor sparse coding. In Proceedings of the International Conference onMachine Learning (ICML), 2009.
J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Convex and networkflow optimization for structured sparsity. Preprint arXiv:1104.1872,2011.
C. A. Micchelli, J. M. Morales, and M. Pontil. A family of penaltyfunctions for structured sparsity. In Advances in Neural InformationProcessing Systems, 2010.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 26/32
References VY. Nesterov. A method for solving a convex programming problem withconvergence rate O(1/k2). Soviet Math. Dokl., 27:372–376, 1983.
Y. Nesterov. Gradient methods for minimizing composite objectivefunction. Technical report, CORE, 2007.
B. A. Olshausen and D. J. Field. Sparse coding with an overcompletebasis set: A strategy employed by V1? Vision Research, 37:3311–3325, 1997.
J.M. Shapiro. Embedded image coding using zerotrees of waveletcoefficients. IEEE Transactions on Signal Processing, 41(12):3445–3462, 1993.
K. Skretting and K. Engan. Recursive least squares dictionary learningalgorithm. IEEE Transactions on Signal Processing, 58(4):2121–2130,2010.
R. Tibshirani. Regression shrinkage and selection via the Lasso. Journalof the Royal Statistical Society. Series B, 58(1):267–288, 1996.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 27/32
References VIB. A. Turlach, W. N. Venables, and S. J. Wright. Simultaneous variableselection. Technometrics, 47(3):349–363, 2005.
P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties familyfor grouped and hierarchical variable selection. 37(6A):3468–3497,2009.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 28/32
First-order/proximal methods
minα∈Rp
f (α) + λΩ(α)
f is strictly convex and differentiable with a Lipshitz gradient.
Generalizes the idea of gradient descent
αk+1←argminα∈Rp
f (αk)+∇f (αk)⊤(α−αk)︸ ︷︷ ︸
linear approximation
+L
2‖α−αk‖22
︸ ︷︷ ︸
quadratic term
+λΩ(α)
← argminα∈Rp
1
2‖α− (αk −
1
L∇f (αk))‖22 +
λ
LΩ(α)
When λ = 0, αk+1 ← αk − 1L∇f (αk), this is equivalent to a
classical gradient descent step.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 29/32
First-order/proximal methods
They require solving efficiently the proximal operator
minα∈Rp
1
2‖u−α‖22 + λΩ(α)
For the ℓ1-norm, this amounts to a soft-thresholding:
α⋆
i = sign(ui )(ui − λ)+.
There exists accelerated versions based on Nesterov optimalfirst-order method (gradient method with “extrapolation”) [Beckand Teboulle, 2009, Nesterov, 2007, 1983]
suited for large-scale experiments.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 30/32
Tree-structured groups
Proposition [Jenatton, Mairal, Obozinski, and Bach, 2010]
If G is a tree-structured set of groups, i.e., ∀g , h ∈ G,
g ∩ h = ∅ or g ⊂ h or h ⊂ g
For q = 2 or q =∞, we define Proxg and ProxΩ as
Proxg :u→ argminα∈Rp
1
2‖u−α‖+ λ‖αg‖q,
ProxΩ :u→ argminα∈Rp
1
2‖u−α‖+ λ
∑
g∈G
‖αg‖q,
If the groups are sorted from the leaves to the root, then
ProxΩ = Proxgm . . . Proxg1 .
→ Tree-structured regularization : Efficient linear time algorithm.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 31/32
General Overlapping Groups for q =∞[Mairal, Jenatton, Obozinski, and Bach, 2011]
Dual formulation
The solutions α⋆ and ξ⋆ of the following optimization problems
minα∈Rp
1
2‖u−α‖+ λ
∑
g∈G
‖αg‖∞, (Primal)
minξ∈Rp×|G|
1
2‖u−
∑
g∈G
ξg‖22 s.t. ∀g ∈ G, ‖ξg‖1 ≤ λ and ξgj = 0 if j /∈ g ,
(Dual)satisfy
α⋆ = u−∑
g∈G
ξ⋆g . (Primal-dual relation)
The dual formulation has more variables, but is equivalent to quadraticmin-cost flow problem.
Julien Mairal, UC Berkeley Topographic Dictionary Learning 32/32