Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns -...

Post on 17-Jun-2020

2 views 0 download

transcript

Stochastic Subsampling for Massive MatrixFactorization

Arthur Mensch, Julien Mairal,Gael Varoquaux, Bertrand Thirion

Inria Parietal, Universite Paris-Saclay

April 13, 2018

Matrix factorization

X ∈ Rp×n = DA ∈ Rp×k × Rk×n

Flexible tool for unsupervised data analysis

Dataset has lower underlying complexity than appearing size

How to scale it to very large datasets ?(Brain imaging, 4TB, hyperspectral imaging, 100 GB)

Arthur Mensch Stochastic Subsampling for Matrix Factorization 1 / 24

Example: resting-state fMRI

Resting-state data analysis:

Input: 3D brain images across time for 900 subjects

X ∈ Rp×n, n = 5 · 106, p = 2 · 105

Goal: Extract representative sparse brain components D

Functional networks: correlated brain activity in localizedareas (e.g., auditory, visual, motor cortex)

Voxe

ls

Time

=

k spatial maps Time

x

Arthur Mensch Stochastic Subsampling for Matrix Factorization 2 / 24

Other examples

Computer vision:

Patches of an image

Decompose onto a dictionary

Sparse loadings

Collaborative filtering:

Incomplete user-item rating matrix

Decompose into a low-rank factorisation

Arthur Mensch Stochastic Subsampling for Matrix Factorization 3 / 24

Designing new efficient algorithms

Voxe

lsTime

=

k spatial maps Time

x

X is large (5TB) in both number of samples n and sampledimension p

New stochastic algorithms that scale in both directions

Arthur Mensch Stochastic Subsampling for Matrix Factorization 4 / 24

Formalism and methods

Non-convex matrix factorization:

minD∈C,A∈Rk×n

‖X−DA‖2F + λΩ(A)

Constraints on the dictionary D: each column d(j) in B2 or B1

Penalty on the code A: `1, `2 (+ non-negativity)

= x

Naive resolution:

Alternated minimization: use full X at each iteration

Slow: single iteration cost in O(np)

Arthur Mensch Stochastic Subsampling for Matrix Factorization 5 / 24

Online matrix factorization [Mairal et al., 2010]

Scaling in n:

Stream (xt) and update (Dt) at each t

Single iteration cost in O(p)

Convergence in a few epochs → large speed-up

= x

UpdateStream Compute

Use case:

Large n, regular p, e.g., image patches:

p = 256 n ≈ 106 1GB

Low-rank factorization / sparse coding

Arthur Mensch Stochastic Subsampling for Matrix Factorization 6 / 24

Scaling-up for massive matrices

Out-of-the-box online algorithm ?

=

x

Update

Stream

Compute

Limited time budget ?

Need to accomodate large p

235 h run time

1 full epoch

10 h run time

124

epoch

Arthur Mensch Stochastic Subsampling for Matrix Factorization 7 / 24

Scaling-up for massive matrices

Out-of-the-box online algorithm ?

=

x

Update

Stream

Compute

Limited time budget ?

Need to accomodate large p

235 h run time

1 full epoch

10 h run time

124

epoch

Arthur Mensch Stochastic Subsampling for Matrix Factorization 7 / 24

Scaling-up for massive matrices

Out-of-the-box online algorithm ?

=

x

Update

Stream

Compute

Limited time budget ?

Need to accomodate large p

235 h run time

1 full epoch

10 h run time

124

epoch

Arthur Mensch Stochastic Subsampling for Matrix Factorization 7 / 24

Scaling-up in both directions

- Data access

- Dictionary update

Streamcolumns

- Code com- putation

Online matrixfactorization

Alternate-minimization

(dim.)

Iteration t

Seen at t Seen at t+1Unseen at t

(di

m.)

Updated at t

Arthur Mensch Stochastic Subsampling for Matrix Factorization 8 / 24

Scaling-up in both directions

- Data access

- Dictionary update

Streamcolumns

- Code com- putation Subsample

rows

Online matrixfactorization

Proposedalgorithm

Alternate-minimization

(dim.)

Iteration t

Seen at t Seen at t+1Unseen at t

(di

m.)

Updated at t

Arthur Mensch Stochastic Subsampling for Matrix Factorization 9 / 24

Online dictionary learning: details

We learn the left side factor: D? solution of

minD∈C

1

n

n∑i=1

‖x(i) −Dα(i)(D)‖22

α(i)(D) = argminα∈Rk

‖x(i) −Dα‖22 + λΩ(α)

Expected risk minimization problem: (it)t sampled uniformly

minD∈C

E[ft(D)] ft(D) , ‖x(it) −Dα(it)(D)‖22

How to minimize it ?

Arthur Mensch Stochastic Subsampling for Matrix Factorization 10 / 24

Online dictionary learning: details

We learn the left side factor: D? solution of

minD∈C

1

n

n∑i=1

‖x(i) −Dα(i)(D)‖22

α(i)(D) = argminα∈Rk

‖x(i) −Dα‖22 + λΩ(α)

Expected risk minimization problem: (it)t sampled uniformly

minD∈C

E[ft(D)] ft(D) , ‖x(it) −Dα(it)(D)‖22

How to minimize it ?

Arthur Mensch Stochastic Subsampling for Matrix Factorization 10 / 24

Online dictionary learning: details

We learn the left side factor: D? solution of

minD∈C

1

n

n∑i=1

‖x(i) −Dα(i)(D)‖22

α(i)(D) = argminα∈Rk

‖x(i) −Dα‖22 + λΩ(α)

Expected risk minimization problem: (it)t sampled uniformly

minD∈C

E[ft(D)] ft(D) , ‖x(it) −Dα(it)(D)‖22

How to minimize it ?

Arthur Mensch Stochastic Subsampling for Matrix Factorization 10 / 24

A majorization-minimization algorithm

Expected risk minimization: xt , x (it)

minD∈C

f (D) = E[ft(D)] ft(D) , ‖xt −Dαt(D)‖22

At iteration t: we build a pointwise majorizing surrogate

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(α)

gt(D) = ‖xt −Dαt‖22 ≥ ft(D) gt(Dt−1) = ft(Dt−1)

We minimize the aggregated surrogate

gt(D) ,1

t

t∑s=1

gs(D) ≥ 1

t

t∑s=1

fs(D) , ft(D)

and obtain Dt

Arthur Mensch Stochastic Subsampling for Matrix Factorization 11 / 24

A majorization-minimization algorithm

Expected risk minimization: xt , x (it)

minD∈C

f (D) = E[ft(D)] ft(D) , ‖xt −Dαt(D)‖22

At iteration t: we build a pointwise majorizing surrogate

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(α)

gt(D) = ‖xt −Dαt‖22 ≥ ft(D) gt(Dt−1) = ft(Dt−1)

We minimize the aggregated surrogate

gt(D) ,1

t

t∑s=1

gs(D) ≥ 1

t

t∑s=1

fs(D) , ft(D)

and obtain Dt

Arthur Mensch Stochastic Subsampling for Matrix Factorization 11 / 24

A majorization-minimization algorithm

Expected risk minimization: xt , x (it)

minD∈C

f (D) = E[ft(D)] ft(D) , ‖xt −Dαt(D)‖22

At iteration t: we build a pointwise majorizing surrogate

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(α)

gt(D) = ‖xt −Dαt‖22 ≥ ft(D) gt(Dt−1) = ft(Dt−1)

We minimize the aggregated surrogate

gt(D) ,1

t

t∑s=1

gs(D) ≥ 1

t

t∑s=1

fs(D) , ft(D)

and obtain Dt

Arthur Mensch Stochastic Subsampling for Matrix Factorization 11 / 24

A majorization-minimization algorithm

The surrogate is a simple quadratic

gt(D) = Tr (D>DCt −D>Bt)

with parameters that can be updated online

Ct =1

t

t∑s=1

αsα>s Bt =

1

t

t∑s=1

xsα>s

The surrogate is minimized using block coordinate descent

Arthur Mensch Stochastic Subsampling for Matrix Factorization 12 / 24

Convergence guarantees (informal)

(Dt)t converges a.s. towards a critical point of the expected risk

minD∈C

f (D) = E[ft(D)]

Major lemma: gt is a tighter and tighter surrogate:

gt(DDt−1)− ft(gt(Dt−1))→ 0

and the algorithm is asymptotically a majorization-minimizationalgorithm.

Arthur Mensch Stochastic Subsampling for Matrix Factorization 13 / 24

Convergence guarantees (informal)

(Dt)t converges a.s. towards a critical point of the expected risk

minD∈C

f (D) = E[ft(D)]

Major lemma: gt is a tighter and tighter surrogate:

gt(DDt−1)− ft(gt(Dt−1))→ 0

and the algorithm is asymptotically a majorization-minimizationalgorithm.

Arthur Mensch Stochastic Subsampling for Matrix Factorization 13 / 24

Algorithm design: summary

Online dictionary learning [Mairal et al., 2010]

1 Compute code

– O(p)

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(αt)

2 Update surrogate

– O(p)

gt(D) =1

t

t∑s=1

‖xs −Dαs‖22 = Tr (D>DCt −D>Bt)

3 Minimize surrogate

– O(p)

Dt = argminD∈C

gt(D) = argminD∈C

Tr (D>DCt −D>Bt)

Access to xt → Algorithm in O(p) (complexity dependency in p)

Arthur Mensch Stochastic Subsampling for Matrix Factorization 14 / 24

Algorithm design: summary

Online dictionary learning [Mairal et al., 2010]

1 Compute code – O(p)

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(αt)

2 Update surrogate – O(p)

gt(D) =1

t

t∑s=1

‖xs −Dαs‖22 = Tr (D>DCt −D>Bt)

3 Minimize surrogate – O(p)

Dt = argminD∈C

gt(D) = argminD∈C

Tr (D>DCt −D>Bt)

Access to xt → Algorithm in O(p) (complexity dependency in p)

Arthur Mensch Stochastic Subsampling for Matrix Factorization 14 / 24

Stochastic subsampling

How to reduce single iteration cost O(p)?

Sample masking matrix Mt

Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q

xt →Mtxt , E[Mtxt ] = xt

Use only Mtxt in algorithm computations

Noisy updates but single iteration in O(q)

Stream

Subsample

Subsampled Online matrix Factorization (SOMF)

Adapt the 3 parts of the algorith to obtain O(q) complexity

1 Codecomputation

2 Surrogateupdate

3 Surrogateminimization

Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24

Stochastic subsampling

How to reduce single iteration cost O(p)?

Sample masking matrix Mt

Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q

xt →Mtxt , E[Mtxt ] = xt

Use only Mtxt in algorithm computations

Noisy updates but single iteration in O(q)

Stream

Subsample

Subsampled Online matrix Factorization (SOMF)

Adapt the 3 parts of the algorith to obtain O(q) complexity

1 Codecomputation

2 Surrogateupdate

3 Surrogateminimization

Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24

Stochastic subsampling

How to reduce single iteration cost O(p)?

Sample masking matrix Mt

Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q

xt →Mtxt , E[Mtxt ] = xt

Use only Mtxt in algorithm computations

Noisy updates but single iteration in O(q)

Stream

Subsample

Subsampled Online matrix Factorization (SOMF)

Adapt the 3 parts of the algorith to obtain O(q) complexity

1 Codecomputation

2 Surrogateupdate

3 Surrogateminimization

Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24

Stochastic subsampling

How to reduce single iteration cost O(p)?

Sample masking matrix Mt

Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q

xt →Mtxt , E[Mtxt ] = xt

Use only Mtxt in algorithm computations

Noisy updates but single iteration in O(q)

Stream

Subsample

Subsampled Online matrix Factorization (SOMF)

Adapt the 3 parts of the algorith to obtain O(q) complexity

1 Codecomputation

2 Surrogateupdate

3 Surrogateminimization

Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24

Stochastic subsampling

How to reduce single iteration cost O(p)?

Sample masking matrix Mt

Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q

xt →Mtxt , E[Mtxt ] = xt

Use only Mtxt in algorithm computations

Noisy updates but single iteration in O(q)

Stream

Subsample

Subsampled Online matrix Factorization (SOMF)

Adapt the 3 parts of the algorith to obtain O(q) complexity

1 Codecomputation

2 Surrogateupdate

3 Surrogateminimization

Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24

1. Code computation

Linear regression with random sampling

αt = argminα∈Rk

‖Mt(xt −Dt−1αt)‖22 + λΩ(α)

approximative (sketched) solution of

αt = argminα∈Rk

‖xt −Dt−1αt‖22 + λΩ(α)

(Mt)t introduces errors in (αt)t computations

The error can be controlled and reduced

Arthur Mensch Stochastic Subsampling for Matrix Factorization 16 / 24

3. Surrogate minimization

Original OMF: block coordinate descent with projection on C

minD∈C

gt(D) d(j) ← p⊥C[j ,j](d(j) − 1

Cj ,j(Dc

(j)t − b

(j)t )

SOMF: Freeze the rows not selected by Mt

minD∈C

P⊥t D=P⊥t Dt−1

gt(D)

Freezerows

Reduces to a block coordinate descent in Rq×k !

Arthur Mensch Stochastic Subsampling for Matrix Factorization 17 / 24

3. Surrogate minimization

Original OMF: block coordinate descent with projection on C

minD∈C

gt(D) d(j) ← p⊥C[j ,j](d(j) − 1

Cj ,j(Dc

(j)t − b

(j)t )

SOMF: Freeze the rows not selected by Mt

minD∈C

P⊥t D=P⊥t Dt−1

gt(D)

Freezerows

Reduces to a block coordinate descent in Rq×k !

Arthur Mensch Stochastic Subsampling for Matrix Factorization 17 / 24

3. Surrogate minimization

Original OMF: block coordinate descent with projection on C

minD∈C

gt(D) d(j) ← p⊥C[j ,j](d(j) − 1

Cj ,j(Dc

(j)t − b

(j)t )

SOMF: Freeze the rows not selected by Mt

minD∈C

P⊥t D=P⊥t Dt−1

gt(D)

Freezerows

Reduces to a block coordinate descent in Rq×k !

Arthur Mensch Stochastic Subsampling for Matrix Factorization 17 / 24

Theoretical analysis

Convergence theorem (informal)

f (Dt) converges with probability one and every limit point D∞ of(Dt)t is a stationary point of f : for all D ∈ C

∇f (D∞,D−D∞) ≥ 0

Surrogate approximation Partial minimization

Proof: Control perturbation (red) from the online matrixfactorization algorithm (green)

Arthur Mensch Stochastic Subsampling for Matrix Factorization 18 / 24

Theoretical analysis

Convergence theorem (informal)

f (Dt) converges with probability one and every limit point D∞ of(Dt)t is a stationary point of f : for all D ∈ C

∇f (D∞,D−D∞) ≥ 0

Surrogate approximation Partial minimization

Proof: Control perturbation (red) from the online matrixfactorization algorithm (green)

Arthur Mensch Stochastic Subsampling for Matrix Factorization 18 / 24

Results: up to 12x speed-up

Arthur Mensch Stochastic Subsampling for Matrix Factorization 19 / 24

Resting-state fMRI

Online dictionary learning

235 h run time

1 full epoch

10 h run time

124 epoch

Proposed method

10 h run time

12 epoch, reduction r=12

Qualitatively, usable maps are obtained 10× faster

Arthur Mensch Stochastic Subsampling for Matrix Factorization 20 / 24

Hyperspectral imaging

Comp. 1 Comp. 2 Comp. 3Time: 14h

841k patchesOMF

r = 1 Time: 177 s

3k patches

SOMF

r = 24Time: 179 s

87k patches

SOMF atoms are more focal and less noisy given a certaintime budget

Arthur Mensch Stochastic Subsampling for Matrix Factorization 21 / 24

Collaborative filtering

Mtxt movie ratings from user t

vs. coordinate descent for MMMF loss (no hyperparameters)

100 s 1000 s0.930.940.950.960.970.980.99 Netflix (140M)

Coordinate descent

Proposed(full projection)

Proposed(partial projection)

Dataset Test RMSE Speed

CD MODL -up

ML 1M 0.872 0.866 ×0.75ML 10M 0.802 0.799 ×3.7NF (140M) 0.938 0.934 ×6.8

Outperform coordinate descent beyond 10M ratings

Same prediction performance

Speed-up 6.8× on Netflix

Arthur Mensch Stochastic Subsampling for Matrix Factorization 22 / 24

Conclusion

New efficient algorithm with many potential use-case.

Subsampling mini-batches at each iteration.

Python package (github.com/arthurmensch/modl)

Perspectives:

Efficient heuristics and adaptative subsampling ratio

Is this kind of approach transposable to SGD setting ?

Arthur Mensch Stochastic Subsampling for Matrix Factorization 23 / 24

Publications

[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).

Online learning for matrix factorization and sparse coding.

The Journal of Machine Learning Research, 11:19–60.

[Mensch et al., 2016] Mensch, A., Mairal, J., Thirion, B., and Varoquaux, G.(2016).

Dictionary learning for massive matrix factorization.

In 33rd International Conference on Machine Learning (ICML).

[Mensch et al., 2017] Mensch, A., Mairal, J., Thirion, B., and Varoquaux, G.(2017).

Stochastic Subsampling for Factorizing Huge Matrices.

arXiv:1701.05363 [cs, math, q-bio, stat].

Arthur Mensch Stochastic Subsampling for Matrix Factorization 24 / 24