Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with...

transcript

Understanding Trainable Sparse Codingwith Matrix Factorization

Thomas Moreau CMLA - ENS Paris-Saclay

J. Bruna, J. Audiffren

1 Physiological signals

2 End-to-end Approaches for Time Series

3 Post-training for Deep Learning

4 Adaptive Iterative Soft Thresholding

5 Numerical Experiments

Tech talk 2/40 Google, Zurich

Physiological signals

Oculometric signals

0 1 2 3 4 5 6Time [s]

Accelerometers

Time [s]

eratio

ular V

elocit

Statistical analysis for Time Series

I Failure of the vectorial distances

I Alignment issues, different lengths(can be solved with DTW)

I ”Curse of dimensionality”

I Different approaches which can be classified in 2categories:

I Model based methods:feature extraction + vectorial method, . . .

I Data driven methodsEnd-to-end model, Neural networks, . . .

Statistical analysis for Time Series

I Failure of the vectorial distances

I Alignment issues, different lengths(can be solved with DTW)

I ”Curse of dimensionality”

I Different approaches which can be classified in 2categories:

I Model based methods:feature extraction + vectorial method, . . .

I Data driven methodsEnd-to-end model, Neural networks, . . .

End-to-end Approaches

Neural Networks:

I Raw signal as input,

No feature-engineering

I Internally select the data representation,

Adaptive

I Representation adapted to the task,

Performant

I Simple training algorithms,

Scalable

Theoretical guarantees

Split between risk error 3 terms:[Bottou and Bousquet, 2008]

I Approximation error: Universal approximation,

[Hornik, 1991]

I Estimation error: Generalization bound,

[Kawaguchi et al., 2017]

I Optimization error: Learning convexification,

[Haeffele and Vidal, 2017]

Neural Networks

Main drawback:

Lack of interpretability. It is often seen as a black box.

How can we bring interpretability in the internalrepresentation?

End-to-end Approaches

Task-driven Dictionary Learning:[Mairal et al., 2012]

I Raw signal as input,

No feature-engineering

I Representation adapted to the task,

Performant

I Complex training algorithms,

Scalable

I Highlight local structures,

Interpretable

Problematic

Can we study the links betweenthese two models to bring more

interpretability in neural networks?

Post-training for Deep Learning

Paper with J. Audiffren: arxiv:1611.04499

Use the idea to split the representation learning and the task resolution:

I Post-training step: only train the last layer,

I Easy problem: this problem is often convex,

I Link with kernel: close form solution for optimal last layer.

I Experiments: consistent performance boost with multiplearchitecture.

LASSO [Tibshirani, 1996]

The LASSO or sparse coding problem is defined as

argminz

F (z) := ‖x − Dz‖22︸︷︷︸

+λ‖z‖1 , (1)

where x ∈ RP , D ∈ RP×K and z ∈ RK .

(1) can be rewritten as a proximal problem for the B-norm

argminz

(y − z)TB(y − z)︸︷︷︸E(z)

+λ‖z‖1

(= F (z)

where B = DTD is the Gram matrix of D and y = D†x .

Learned ISTA [Gregor and Lecun, 2010]

Accelerate the LASSO resolution using a neural network.

Figure: Adapted from [Gregor and Lecun, 2010]

Link dictionary learning model and sparse representation.

Why does it work?

ISTA (1/2) [Daubechies et al., 2004, Beck and Teboulle, 2009]

Surrogate function Fq associated with point z(q) :

Fq(z) = E (z(q)) + 〈B(z(q) − y), z − z(q)〉+‖B‖2

2‖z − z(q)‖2

2 + λ‖z‖1 ,

Properties

This surrogate function satisfies

1 Fq(z(q)) = F (z(q))

2 for all z , Fq(z) ≥ F (z),

3 solving argminz Fq(z) is computationally efficient.

ISTA (2/2)

Iterative procedure: proximal splitting

z(q+1) = argminz

= proxλ‖·‖1

(z(q) − 1

L∇E (z(q))

Properties

1 z∗ is a fix point of (2),

2 Efficient computation for z(q+1) asthe problem is separable,

3 Convergence in O(

)in general.

10 5 0 5 100

Why does it work?

I Guaranteed descent

The construction of the next point guarantees the cost function isdecreasing:

F (z(q+1)) ≤ Fq(z(q+1)) ≤ Fq(z(q)) = F (z(q))

I Efficient computation:

With the isotropic quadratic form L2IIIK , the function Fq is separable.

The computation are linear in K .

Toward an adaptive procedure

We define QS(u, v) = 12 (u − v)TS(u − v) + λ‖u‖1 .

Fq(z) = E (z(q)) + 〈B(z(q) − y), z − z(q)〉+ QLIIIK (z , z(q)) ,

→ minz

QLIIIK (z , z(q) − 1

LB(z(q) − y))

⇒ Replace B with an upperbound LIIIK

FacNet: For any matrix S diagonal, and A unitary we define :

F̃q(z) = E (z(q)) + 〈B(z(q) − y), z − z(q)〉+ QS(Az ,Az(q)) ,

→ minz

QS(Az ,Az(q) − S−1AB(z(q) − y))

⇒ Replace B with an approximation ATSA

Can we choose A,S to accelerate the optimization compared to ISTA?

Toward an adaptive procedure

Similar iterative procedure with steps adapted to the problem topology.

F̃q(z) = F (z) + (z − z(q))TR(z − z(q)) + δA(z)

Tradeoff between:

I Rotation to align the norm ‖ · ‖B and the norm ‖ · ‖1 , Computation

R = ATSA− B

I Deformation of the `1-norm with the rotation A . Accuracy

δA(z) = λ(‖Az‖1 − ‖z‖1

One step improvement

Proposition

Suppose that R = ATSA− B � 0 is positive definite, and define

z(q+1) = arg minz

F̃q(z) ,

F (z(q+1))− F (z∗) ≤ 1

2(z(q) − z∗)TR(z(q) − z∗) + δA(z∗)− δA(z(q+1)) .

We are interested in factorization (A, S) for which ‖R‖2 and δA are small.

Adaptive Iterative Soft thresholding - Convergence rate

Theorem

Let Aq,Sq be the pair of unitary and diagonal matrices corresponding toiteration q, chosen such that Rq = AT

q SqAq − B � 0. It results that

F (z (q))− F (z∗) ≤ (z∗ − z (0))TR0(z∗ − z (0)) + 2LA0 (z (1))‖(z∗ − z (1))‖2

+αq − βq

q−1∑i=1

(2LAi (z

(i+1))‖(z∗ − z (i+1))‖+ (z∗ − z (i))T(Ri−1 − Ri )(z∗ − z (i))),

q−1∑i=0

(i + 1)(

(z (i+1) − z (i))TRi (z(i+1) − z (i)) + 2δAi (z

(i+1))− 2δAi (z(i))),

where LA(z) denote the local Lipschitz constant of δA at z .

Interpretation

I For Aq = IIIK and Sq = ‖B‖2IIIK , the procedure is equivalent to ISTA,with the same rate of convergence.

I If ‖R0‖2 + 2LA0(z1)

‖z∗ − z0‖2≤ ‖B‖2

2and Aq = IIIK and Sq = ‖B‖2IIIK for

k > 0, then the procedure get a head start compare to ISTA

I Phase transition :

The upper bound is improved when ‖Rq‖2 + 2LAq(z(q+1))

‖z∗ − z(q)‖2≤ ‖B‖2

it is thus harder to gain as ‖z(q) − z∗‖2 → 0

Generic Dictionaries

A dictionary D ∈ Rp×K is a generic dictionary when its columns Di aredrawn uniformly over the `2 unit sphere Sp−1.

Theorem (Acceleration conditions)

In expectation over the generic dictionary D, the factorizationalgorithm using a diagonally dominant matrix A ⊂ Eδ, has betterperformance for iteration q + 1 than the normal ISTA iteration – whichuses the identity – when

[‖z(q+1)‖1 + ‖z∗‖1

√K (K − 1)

[‖z(q) − z∗‖2

]︸︷︷︸

expected resolutionat iteration q

Generic Dictionaries

Corollary (Acceleration conditions)

If the input distribution and the regularization parameter λ verify

λ√p

8≤ Ez

[‖z∗‖1

Then for any resolution Ez

[‖z(q) − z∗‖2

]= ε > 0 at iteration q, the

performance of our factorization algorithm is better than the performanceof ISTA, in expectation over the generic dictionaries.

FacNet can improve the performances compared to ISTA when this isverified.

Learned ISTA [Gregor and Lecun, 2010]

Figure: Network architecture for ISTA/LISTA. LISTA is the unfolded version ofthe RNN of ISTA, trainable with back-propagation.

If We = DT

L and Wg = I − BL , this network is exactly 2 iterations of ISTA.

FacNet

Specialization of LISTA

z(q+1) = AT proxS

(Az(q) − S−1AB(z(q) − y)) ,

with A unitary and S diagonal.Same architecture with more constraints on the parameter space:We = S−1ADT

Wg = AT − S−1ABAT

⇒ LISTA can be at least as good as this model.

Learned FISTA

The same ideas can also be applied to FISTA to obtain similar procedures:

W(1)g +

W(2)g +

W(3)g +

Figure: Network architecture for L-FISTA.

Artificial simulation

Generating Model:

(d1‖d1‖2

, . . . dK‖dK‖2

)with dk ∼ N (0, IIIP) for all k ∈ J1,KK ,

I z = (z1, . . . zK ) are constructed following a bernouilli gaussian:

zk = bkak , bk ∼ B(ρ) and a ∼ N (0, σIIIK )

with: K = 100, P = 64, for the dimension, σ = 10 and λ = 0.01

⇒ The sparsity patterns are uniformely distributed.

100 101 102

# iteration/layers q

z(q) )

F(z* )

ISTALinearFISTA

L-ISTAFacNetL-FISTA

Evolution of the cost function F (z(q))− F (z∗) with the number oflayers/iterations q with a sparse model ρ = 1/20.

100 101 102 103

z* )ISTALinearFISTA

L-ISTAFacNetL-FISTA

Evolution of the cost function F (z(q))− F (z∗) with the number oflayers/iterations q with a denser model ρ = 1/4.

Adversarial dictionary

Adversarial dictionary:

D =[d1 . . . dK

]∈ RK×p ,

dj = e−i2πjζq

for a random subset offrequencies

{ζi}i≤m

⇒ Eigenvectors of D are far from canonical basis.

Adversarial dictionary

100 101 102 103

z* )ISTAFISTA

L-ISTAFacNet

Evolution of the cost function F (z(q))− F (z∗) with the number oflayers/iterations k with n adversarial dictionary.

PASCAL 08

Sparse coding for the PASCAL 08datasets over the Haar waveletsfamily.

The sparse coding is performed forpatches of size 8× 8.

Train over 500 images and test over100 images.

PASCAL 08

100 101 102

z* )ISTALinearFISTA

L-ISTAFacNetL-FISTA

Evolution of the cost function F (z(q))− F (z∗) with the number of layersor the number of iteration q for Pascal VOC 2008.

Dictionary D with K = 100 atoms learned on 10 000 MNIST samples(17x17) with dictionary learning. LISTA trained with MNIST training setand tested on MNIST test set.

100 101 102

ISTALinearFISTA

L-ISTAFacNetL-FISTA

Evolution of the cost function F (z(q))− F (z∗) with the number of layersor the number of iteration q for MNIST.

Conclusion

I Non asymptotic acceleration is possible :Approximate matrix factorization of B = DTD

I Nearly diagonalize the kernel,I `1-norm nearly invariant by this orthogonal transformation.

I Future work:I Improve the factorization formulation:

minATA=IIIK

f (‖DA‖1,2) + λq‖A‖1,1

I Give generic bounds for sub gaussian D,I Link to Sparse PCA.

Conclusion

Questions?

Code: tomMoral/AdaptiveOptim

Paper: https://arxiv.org/abs/1706.01338

More at tommoral.github.io tomMoral

References

Beck, A. and Teboulle, M. (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAMJournal on Imaging Sciences, 2(1):183–202.

Bottou, L. and Bousquet, O. (2008). Learning using large datasets. Mining Massive DataSets for Security, 3.

Daubechies, I., Defrise, M., and De Mol, C. (2004). An iterative thresholding algorithm for linear inverse problems with asparsity constraint. Communications on Pure and Applied Mathematics, 57(11):1413–1457.

Gregor, K. and Lecun, Y. (2010). Learning Fast Approximations of Sparse Coding Karol. In International Conference onMachine Learning (ICML), volume 152, pages 399–406, Haifa, Israel.

Haeffele, B. D. and Vidal, R. (2017). Global Optimality in Neural Network Training. In Conference on Computer Vision andPattern Recognition (CVPR), pages 7331–7339, Honolulu, HI, USA.

Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257.

Kawaguchi, K., Pack Kaelbling, L., and Bengio, Y. (2017). Generalization in Deep Learning. preprintq, arXiv:1710(05468).

Mairal, J., Bach, F., and Ponce, J. (2012). Task-driven dictionary learning. IEEE Transactions on Pattern Analysis andMachine Intelligence, 34(4):791–804.

Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the royal statistical society. Series B(methodological), 58(1):267—-288.

Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with...

Documents