Post on 17-Jun-2020
transcript
Understanding Trainable Sparse Codingwith Matrix Factorization
Thomas Moreau CMLA - ENS Paris-Saclay
J. Bruna, J. Audiffren
Plan
1 Physiological signals
2 End-to-end Approaches for Time Series
3 Post-training for Deep Learning
4 Adaptive Iterative Soft Thresholding
5 Numerical Experiments
Tech talk 2/40 Google, Zurich
Plan
1 Physiological signals
2 End-to-end Approaches for Time Series
3 Post-training for Deep Learning
4 Adaptive Iterative Soft Thresholding
5 Numerical Experiments
Tech talk 3/40 Google, Zurich
Physiological signals
ECG
Tech talk 4/40 Google, Zurich
Physiological signals
EEG
Tech talk 5/40 Google, Zurich
Physiological signals
Oculometric signals
0 1 2 3 4 5 6Time [s]
Eye p
ositi
ons
Tech talk 6/40 Google, Zurich
Physiological signals
Accelerometers
Time [s]
Accel
eratio
n/Ang
ular V
elocit
y
4×
Tech talk 7/40 Google, Zurich
Statistical analysis for Time Series
I Failure of the vectorial distances
I Alignment issues, different lengths(can be solved with DTW)
I ”Curse of dimensionality”
I Different approaches which can be classified in 2categories:
I Model based methods:feature extraction + vectorial method, . . .
I Data driven methodsEnd-to-end model, Neural networks, . . .
Tech talk 8/40 Google, Zurich
Statistical analysis for Time Series
I Failure of the vectorial distances
I Alignment issues, different lengths(can be solved with DTW)
I ”Curse of dimensionality”
I Different approaches which can be classified in 2categories:
I Model based methods:feature extraction + vectorial method, . . .
I Data driven methodsEnd-to-end model, Neural networks, . . .
Tech talk 8/40 Google, Zurich
Plan
1 Physiological signals
2 End-to-end Approaches for Time Series
3 Post-training for Deep Learning
4 Adaptive Iterative Soft Thresholding
5 Numerical Experiments
Tech talk 9/40 Google, Zurich
End-to-end Approaches
Neural Networks:
I Raw signal as input,
No feature-engineering
I Internally select the data representation,
Adaptive
I Representation adapted to the task,
Performant
I Simple training algorithms,
Scalable
Tech talk 10/40 Google, Zurich
Theoretical guarantees
Split between risk error 3 terms:[Bottou and Bousquet, 2008]
I Approximation error: Universal approximation,
[Hornik, 1991]
I Estimation error: Generalization bound,
[Kawaguchi et al., 2017]
I Optimization error: Learning convexification,
[Haeffele and Vidal, 2017]
Tech talk 11/40 Google, Zurich
Neural Networks
Main drawback:
Lack of interpretability. It is often seen as a black box.
How can we bring interpretability in the internalrepresentation?
Tech talk 12/40 Google, Zurich
End-to-end Approaches
Task-driven Dictionary Learning:[Mairal et al., 2012]
I Raw signal as input,
No feature-engineering
I Representation adapted to the task,
Performant
I Complex training algorithms,
Scalable
I Highlight local structures,
Interpretable
Tech talk 13/40 Google, Zurich
Problematic
Can we study the links betweenthese two models to bring more
interpretability in neural networks?
Tech talk 14/40 Google, Zurich
Plan
1 Physiological signals
2 End-to-end Approaches for Time Series
3 Post-training for Deep Learning
4 Adaptive Iterative Soft Thresholding
5 Numerical Experiments
Tech talk 15/40 Google, Zurich
Post-training for Deep Learning
Paper with J. Audiffren: arxiv:1611.04499
Use the idea to split the representation learning and the task resolution:
I Post-training step: only train the last layer,
I Easy problem: this problem is often convex,
I Link with kernel: close form solution for optimal last layer.
I Experiments: consistent performance boost with multiplearchitecture.
Tech talk 16/40 Google, Zurich
Plan
1 Physiological signals
2 End-to-end Approaches for Time Series
3 Post-training for Deep Learning
4 Adaptive Iterative Soft Thresholding
5 Numerical Experiments
Tech talk 17/40 Google, Zurich
LASSO [Tibshirani, 1996]
The LASSO or sparse coding problem is defined as
argminz
F (z) := ‖x − Dz‖22︸ ︷︷ ︸
E(z)
+λ‖z‖1 , (1)
where x ∈ RP , D ∈ RP×K and z ∈ RK .
(1) can be rewritten as a proximal problem for the B-norm
argminz
(y − z)TB(y − z)︸ ︷︷ ︸E(z)
+λ‖z‖1
(= F (z)
)
where B = DTD is the Gram matrix of D and y = D†x .
Tech talk 18/40 Google, Zurich
Learned ISTA [Gregor and Lecun, 2010]
Accelerate the LASSO resolution using a neural network.
WeX Z
Wg
X
W(0)e
W(1)g
W(1)e
W(2)g
W(2)e
Z
Figure: Adapted from [Gregor and Lecun, 2010]
Link dictionary learning model and sparse representation.
Why does it work?
Tech talk 19/40 Google, Zurich
ISTA (1/2) [Daubechies et al., 2004, Beck and Teboulle, 2009]
Surrogate function Fq associated with point z(q) :
Fq(z) = E (z(q)) + 〈B(z(q) − y), z − z(q)〉+‖B‖2
2‖z − z(q)‖2
2 + λ‖z‖1 ,
Properties
This surrogate function satisfies
1 Fq(z(q)) = F (z(q))
2 for all z , Fq(z) ≥ F (z),
3 solving argminz Fq(z) is computationally efficient.
Tech talk 20/40 Google, Zurich
ISTA (2/2)
Iterative procedure: proximal splitting
z(q+1) = argminz
Fq(z)
= proxλ‖·‖1
(z(q) − 1
L∇E (z(q))
)(2)
Properties
1 z∗ is a fix point of (2),
2 Efficient computation for z(q+1) asthe problem is separable,
3 Convergence in O(
1q
)in general.
10 5 0 5 100
100
200
300
400
500
600
zk
zk+ 1
FFk
Tech talk 21/40 Google, Zurich
Why does it work?
I Guaranteed descent
The construction of the next point guarantees the cost function isdecreasing:
F (z(q+1)) ≤ Fq(z(q+1)) ≤ Fq(z(q)) = F (z(q))
I Efficient computation:
With the isotropic quadratic form L2IIIK , the function Fq is separable.
The computation are linear in K .
Tech talk 22/40 Google, Zurich
Toward an adaptive procedure
We define QS(u, v) = 12 (u − v)TS(u − v) + λ‖u‖1 .
ISTA:
Fq(z) = E (z(q)) + 〈B(z(q) − y), z − z(q)〉+ QLIIIK (z , z(q)) ,
→ minz
QLIIIK (z , z(q) − 1
LB(z(q) − y))
⇒ Replace B with an upperbound LIIIK
FacNet: For any matrix S diagonal, and A unitary we define :
F̃q(z) = E (z(q)) + 〈B(z(q) − y), z − z(q)〉+ QS(Az ,Az(q)) ,
→ minz
QS(Az ,Az(q) − S−1AB(z(q) − y))
⇒ Replace B with an approximation ATSA
Can we choose A,S to accelerate the optimization compared to ISTA?
Tech talk 23/40 Google, Zurich
Toward an adaptive procedure
Similar iterative procedure with steps adapted to the problem topology.
F̃q(z) = F (z) + (z − z(q))TR(z − z(q)) + δA(z)
Tradeoff between:
I Rotation to align the norm ‖ · ‖B and the norm ‖ · ‖1 , Computation
R = ATSA− B
I Deformation of the `1-norm with the rotation A . Accuracy
δA(z) = λ(‖Az‖1 − ‖z‖1
)
Tech talk 24/40 Google, Zurich
One step improvement
Proposition
Suppose that R = ATSA− B � 0 is positive definite, and define
z(q+1) = arg minz
F̃q(z) ,
Then
F (z(q+1))− F (z∗) ≤ 1
2(z(q) − z∗)TR(z(q) − z∗) + δA(z∗)− δA(z(q+1)) .
We are interested in factorization (A, S) for which ‖R‖2 and δA are small.
Tech talk 25/40 Google, Zurich
Adaptive Iterative Soft thresholding - Convergence rate
Theorem
Let Aq,Sq be the pair of unitary and diagonal matrices corresponding toiteration q, chosen such that Rq = AT
q SqAq − B � 0. It results that
F (z (q))− F (z∗) ≤ (z∗ − z (0))TR0(z∗ − z (0)) + 2LA0 (z (1))‖(z∗ − z (1))‖2
2q
+αq − βq
2q,
αq =
q−1∑i=1
(2LAi (z
(i+1))‖(z∗ − z (i+1))‖+ (z∗ − z (i))T(Ri−1 − Ri )(z∗ − z (i))),
βq =
q−1∑i=0
(i + 1)(
(z (i+1) − z (i))TRi (z(i+1) − z (i)) + 2δAi (z
(i+1))− 2δAi (z(i))),
where LA(z) denote the local Lipschitz constant of δA at z .
Tech talk 26/40 Google, Zurich
Interpretation
I For Aq = IIIK and Sq = ‖B‖2IIIK , the procedure is equivalent to ISTA,with the same rate of convergence.
I If ‖R0‖2 + 2LA0(z1)
‖z∗ − z0‖2≤ ‖B‖2
2and Aq = IIIK and Sq = ‖B‖2IIIK for
k > 0, then the procedure get a head start compare to ISTA
I Phase transition :
The upper bound is improved when ‖Rq‖2 + 2LAq(z(q+1))
‖z∗ − z(q)‖2≤ ‖B‖2
2,
it is thus harder to gain as ‖z(q) − z∗‖2 → 0
Tech talk 27/40 Google, Zurich
Generic Dictionaries
A dictionary D ∈ Rp×K is a generic dictionary when its columns Di aredrawn uniformly over the `2 unit sphere Sp−1.
Theorem (Acceleration conditions)
In expectation over the generic dictionary D, the factorizationalgorithm using a diagonally dominant matrix A ⊂ Eδ, has betterperformance for iteration q + 1 than the normal ISTA iteration – whichuses the identity – when
λEz
[‖z(q+1)‖1 + ‖z∗‖1
]≤
√K (K − 1)
pEz
[‖z(q) − z∗‖2
2
]︸ ︷︷ ︸
expected resolutionat iteration q
Tech talk 28/40 Google, Zurich
Generic Dictionaries
Corollary (Acceleration conditions)
If the input distribution and the regularization parameter λ verify
λ√p
8≤ Ez
[‖z∗‖1
],
Then for any resolution Ez
[‖z(q) − z∗‖2
]= ε > 0 at iteration q, the
performance of our factorization algorithm is better than the performanceof ISTA, in expectation over the generic dictionaries.
FacNet can improve the performances compared to ISTA when this isverified.
Tech talk 29/40 Google, Zurich
Plan
1 Physiological signals
2 End-to-end Approaches for Time Series
3 Post-training for Deep Learning
4 Adaptive Iterative Soft Thresholding
5 Numerical Experiments
Tech talk 30/40 Google, Zurich
Learned ISTA [Gregor and Lecun, 2010]
WeX Z
Wg
X
W(0)e
W(1)g
W(1)e
W(2)g
W(2)e
Z
Figure: Network architecture for ISTA/LISTA. LISTA is the unfolded version ofthe RNN of ISTA, trainable with back-propagation.
If We = DT
L and Wg = I − BL , this network is exactly 2 iterations of ISTA.
Tech talk 31/40 Google, Zurich
FacNet
Specialization of LISTA
z(q+1) = AT proxS
(Az(q) − S−1AB(z(q) − y)) ,
with A unitary and S diagonal.Same architecture with more constraints on the parameter space:We = S−1ADT
Wg = AT − S−1ABAT
⇒ LISTA can be at least as good as this model.
Tech talk 32/40 Google, Zurich
Learned FISTA
The same ideas can also be applied to FISTA to obtain similar procedures:
X
W(0)e
W(1)g +
W(1)m
W(1)e
W(2)g +
W(2)m
W(2)e
W(3)g +
W(3)e
Z
Figure: Network architecture for L-FISTA.
Tech talk 33/40 Google, Zurich
Artificial simulation
Generating Model:
I D =
(d1‖d1‖2
, . . . dK‖dK‖2
)with dk ∼ N (0, IIIP) for all k ∈ J1,KK ,
I z = (z1, . . . zK ) are constructed following a bernouilli gaussian:
zk = bkak , bk ∼ B(ρ) and a ∼ N (0, σIIIK )
with: K = 100, P = 64, for the dimension, σ = 10 and λ = 0.01
⇒ The sparsity patterns are uniformely distributed.
Tech talk 34/40 Google, Zurich
Artificial simulation
100 101 102
# iteration/layers q
10 7
10 6
10 5
10 4
10 3
10 2
10 1
100
101Co
st fu
nctio
n F(
z(q) )
F(z* )
ISTALinearFISTA
L-ISTAFacNetL-FISTA
Evolution of the cost function F (z(q))− F (z∗) with the number oflayers/iterations q with a sparse model ρ = 1/20.
Tech talk 34/40 Google, Zurich
Artificial simulation
100 101 102 103
# iteration/layers q
10 6
10 5
10 4
10 3
10 2
10 1
100
101
Cost
func
tion
F(z(q
) )F(
z* )ISTALinearFISTA
L-ISTAFacNetL-FISTA
Evolution of the cost function F (z(q))− F (z∗) with the number oflayers/iterations q with a denser model ρ = 1/4.
Tech talk 34/40 Google, Zurich
Adversarial dictionary
Adversarial dictionary:
D =[d1 . . . dK
]∈ RK×p ,
with
dj = e−i2πjζq
K
for a random subset offrequencies
{ζi}i≤m
K
p
⇒ Eigenvectors of D are far from canonical basis.
Tech talk 35/40 Google, Zurich
Adversarial dictionary
100 101 102 103
# iteration/layers q
10 6
10 5
10 4
10 3
10 2
10 1
100
101
102
Cost
func
tion
F(z(q
) )F(
z* )ISTAFISTA
L-ISTAFacNet
Evolution of the cost function F (z(q))− F (z∗) with the number oflayers/iterations k with n adversarial dictionary.
Tech talk 35/40 Google, Zurich
PASCAL 08
Sparse coding for the PASCAL 08datasets over the Haar waveletsfamily.
The sparse coding is performed forpatches of size 8× 8.
Train over 500 images and test over100 images.
Tech talk 36/40 Google, Zurich
PASCAL 08
100 101 102
# iteration/layers q
10 6
10 5
10 4
10 3
10 2
10 1
Cost
func
tion
F(z(q
) )F(
z* )ISTALinearFISTA
L-ISTAFacNetL-FISTA
Evolution of the cost function F (z(q))− F (z∗) with the number of layersor the number of iteration q for Pascal VOC 2008.
Tech talk 36/40 Google, Zurich
MNIST
Dictionary D with K = 100 atoms learned on 10 000 MNIST samples(17x17) with dictionary learning. LISTA trained with MNIST training setand tested on MNIST test set.
100 101 102
# iteration/layers q
10 2
10 1
100
101
102
103
104
105
106
Cost
func
tion
F(z(q
) )F(
z* )
ISTALinearFISTA
L-ISTAFacNetL-FISTA
Evolution of the cost function F (z(q))− F (z∗) with the number of layersor the number of iteration q for MNIST.
Tech talk 37/40 Google, Zurich
Conclusion
I Non asymptotic acceleration is possible :Approximate matrix factorization of B = DTD
I Nearly diagonalize the kernel,I `1-norm nearly invariant by this orthogonal transformation.
I Future work:I Improve the factorization formulation:
minATA=IIIK
f (‖DA‖1,2) + λq‖A‖1,1
n,
I Give generic bounds for sub gaussian D,I Link to Sparse PCA.
Tech talk 38/40 Google, Zurich
Conclusion
Questions?
Code: tomMoral/AdaptiveOptim
Paper: https://arxiv.org/abs/1706.01338
More at tommoral.github.io tomMoral
Tech talk 39/40 Google, Zurich
References
Beck, A. and Teboulle, M. (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAMJournal on Imaging Sciences, 2(1):183–202.
Bottou, L. and Bousquet, O. (2008). Learning using large datasets. Mining Massive DataSets for Security, 3.
Daubechies, I., Defrise, M., and De Mol, C. (2004). An iterative thresholding algorithm for linear inverse problems with asparsity constraint. Communications on Pure and Applied Mathematics, 57(11):1413–1457.
Gregor, K. and Lecun, Y. (2010). Learning Fast Approximations of Sparse Coding Karol. In International Conference onMachine Learning (ICML), volume 152, pages 399–406, Haifa, Israel.
Haeffele, B. D. and Vidal, R. (2017). Global Optimality in Neural Network Training. In Conference on Computer Vision andPattern Recognition (CVPR), pages 7331–7339, Honolulu, HI, USA.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257.
Kawaguchi, K., Pack Kaelbling, L., and Bengio, Y. (2017). Generalization in Deep Learning. preprintq, arXiv:1710(05468).
Mairal, J., Bach, F., and Ponce, J. (2012). Task-driven dictionary learning. IEEE Transactions on Pattern Analysis andMachine Intelligence, 34(4):791–804.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the royal statistical society. Series B(methodological), 58(1):267—-288.
Tech talk 40/40 Google, Zurich