Low Rank Tucker Approximation of a Tensor from Streaming Data · Low Rank Tucker Approximation of a...

Post on 27-Jan-2020

7 views 0 download

transcript

Low Rank Tucker Approximation of a Tensorfrom Streaming Data

Madeleine Udell

Operations Research and Information EngineeringCornell University

Based on joint work withYiming Sun (Cornell), Yang Guo (UW Madison),

Charlene Luo (Columbia), and Joel Tropp (Caltech)

April 1, 2019

Madeleine Udell, Cornell. Streaming Tucker Approximation. 1

Outline

Applications

Tucker factorization

Sketching

Reconstruction

Numerics

*

Madeleine Udell, Cornell. Streaming Tucker Approximation. 2

Big data, small laptop

X = H1 + · · ·+ HT X

HT X

Madeleine Udell, Cornell. Streaming Tucker Approximation. 3

Distributed data

X = H1 + · · ·+ HT X

H1

Ht

HT

X

Madeleine Udell, Cornell. Streaming Tucker Approximation. 4

Streaming data

X(t) = H1 + · · ·+ Ht

Ht

Madeleine Udell, Cornell. Streaming Tucker Approximation. 5

Streaming multilinear algebra

turnstile model:

X = H1 + · · ·+ HT

I tensor X presented as sum of smaller, simpler tensors Ht

I must discard Ht after it is processed

I Goal: without storing X, approximate X after seeing allupdates (with guaranteed accuracy)

applications:

I scientific simulation

I sensor measurements

I memory- or communication-limited computing

I low memory optimization

Madeleine Udell, Cornell. Streaming Tucker Approximation. 6

Linear sketch

X = H1 + · · ·+ HT

L(X) = L(H1) + · · ·+ L(HT )

I select a linear map L independent of X

I sketch L(X) is much smaller than input tensor X

I use randomness so sketch works for an arbitrary input

I essentially the only way to handle the turnstile model[Li, Nguyen & Woodruff 2014]

examples:

I L(X) = X×n Ω for some matrix Ω

I L(X) = X×n Ωnn∈[N] for some matrices Ωnn∈[N]

Madeleine Udell, Cornell. Streaming Tucker Approximation. 7

Main idea

sketch suffices for (Tucker) approximation:

I compute (randomized) linear sketch of tensor

I recover low rank (Tucker) approximation from sketch

I (optional) improve approximation by revisiting data

Madeleine Udell, Cornell. Streaming Tucker Approximation. 8

Big data, small laptop: sketch

X = H1 + · · ·+ HT L(X)

HT L(X)

I (+) reduced communication

I (+) sketch of data fits on laptop

Madeleine Udell, Cornell. Streaming Tucker Approximation. 9

Distributed data: sketch

L(X(T ))

= L(H1 + · · ·+ HT−1) + L(HT )

L(H1)

L(Ht)

L(HT )

I (+) reduced communication

I (+) no PITI (personally identifiable toast information)

I (+) sketch of data fits on laptop

Madeleine Udell, Cornell. Streaming Tucker Approximation. 10

Streaming data: sketch

L(X(t)) = L(H1 + · · ·+ Ht−1) + L(Ht)

L(Ht)

I (+) even a toaster can form sketch

Madeleine Udell, Cornell. Streaming Tucker Approximation. 11

Outline

Applications

Tucker factorization

Sketching

Reconstruction

Numerics

*

Madeleine Udell, Cornell. Streaming Tucker Approximation. 12

Notation

tensor to compress:

I tensor X ∈ RI1×···×IN with N modes

I sometimes assume I1 = · · · = IN = I for simplicity

indexing:

I [N] = 1, . . . ,N

I I(−n) = I1 × · · · × In−1 × In+1 × IN

tensor operations:

I mode n product: for A ∈ Rk×In ,X×n A ∈ RI1×···×In−1×k×In+1×···×IN

I unfolding X(n) ∈ RIn×I(−n) stacks mode-n fibers of X ascolumns of matrix

Madeleine Udell, Cornell. Streaming Tucker Approximation. 13

Tucker factorization

rank r = (r1, . . . , rN) Tucker factorization of X ∈ RI1×···×IN :

X = G×1 U1 · · · ×N UN =: JG; U1, . . . ,UNK

where

I G ∈ Rr1×···×rN is the core matrix

I Un ∈ RIn×rn is the factor matrix for each mode n ∈ [N]

(sometimes assume r1 = · · · = rN = r for simplicity)

Tucker is useful for compression: when N is small,

I Tucker stores O(rNI ) numbers for rank r3 approximation

I CP stores O(rNI ) numbers for rank r approximation

future work: one pass ST-HOSVD / tensor train?

Madeleine Udell, Cornell. Streaming Tucker Approximation. 14

Computing Tucker: HOSVD

Algorithm Higher order singular value decomposition (HOSVD)[De Lathauwer, De Moor & Vandewalle 2000, Tucker 1966]

Given: tensor X, rank r = (r1, . . . , rN)

1. Factors. Compute top rn left singular vectors Un of theunfolding X(n) for each n ∈ [N].

2. Core. Contract these with X to form the core

G = X×1 UT1 · · · ×N UT

N .

Return: Tucker approximation XHOSVD = JG; U1, . . . ,UNK

Madeleine Udell, Cornell. Streaming Tucker Approximation. 15

Two pass HOSVD

HOSVD can be computed in two passes over the tensor:

I Factors. use randomized linear algebraI need to find span of fibers of X along nth mode:

range(Un) ≈ range(X(n))

I if rank(Ω) ≥ rank(X(n)), then whp for random Ω,

range(X(n)) = range(X(n)Ω)

algorithm:1. compute sketch L(X) = X(n)Ωnn∈[N]

2. use QR on sketch to approximate range(X(n))

I Core. Computation is linear in X:

G = X×1 UT1 · · · ×N UT

N .

Source: [Halko, Martinsson & Tropp 2011, Zhou, Cichocki & Xie 2014, Battaglino,

Ballard & Kolda 2019]

Madeleine Udell, Cornell. Streaming Tucker Approximation. 16

Computing Tucker: HOOI

Algorithm Higher order orthogonal iteration (HOOI)[De Lathauwer et al. 2000]

Given: tensor X, rank r = (r1, . . . , rN)Initialize: compute X ≈ JG; U1, . . . ,UNK using HOSVDRepeat:

1. Factors. For n ∈ [N],

Un ← argminUn

‖JG; U1, . . . ,UNK−X‖2F ,

2. Core.G← argmin

G

‖JG; U1, . . . ,UNK−X‖2F .

Return: Tucker approximation XHOOI = JG; U1, . . . ,UNK

I core update has closed form G← X×1 U>1 · · · ×N U>NMadeleine Udell, Cornell. Streaming Tucker Approximation. 17

Previous work: one pass algorithm via HOOI

[Malik & Becker 2018]:

I (+) sketch design matrix to reduce size of HOOIsubproblems

I (+) exploit Tucker structure of design matrix

I (-) expensive slow reconstruction (via iterative optimization)

I (-) no error guarantees for one pass algorithm

Madeleine Udell, Cornell. Streaming Tucker Approximation. 18

Outline

Applications

Tucker factorization

Sketching

Reconstruction

Numerics

*

Madeleine Udell, Cornell. Streaming Tucker Approximation. 19

Background: randomized sketches

idea: random matrix Ω is not orthogonal to range of interest(whp)

range(X(n)) = range(X(n)Ω)

a dimension reduction map (DRM) (approximately) preservesrange of its argument

examples of DRMS: multiplication by random matrix Ω that is

I gaussian

I sparse [Achlioptas 2003, Li, Hastie & Church 2006]

I SSFRT [Woolfe, Liberty, Rokhlin & Tygert 2008]

I tensor random projection (TRP) [Sun, Guo, Tropp & Udell2018]

I . . .

Madeleine Udell, Cornell. Streaming Tucker Approximation. 20

The sketch

approximate factor matrices and core:

I Factor sketch (k). For each n ∈ [N],fix random DRM Ωn ∈ RI(−n)×kn and compute the sketch

Vn = X(n)Ωn ∈ RIn×kn .

I Core sketch (s). For each n ∈ [N],fix random DRM Φn ∈ RIn×sn . Compute the sketch

H = X×1 Φ>1 · · · ×N Φ>N ∈ Rs1×···×sN .

I Rule of thumb. Pick k as big as you can afford, pick s = 2k.

I define (H,V1, . . . ,VN) = Sketch(X; Φn,Ωnn∈[N]

)Madeleine Udell, Cornell. Streaming Tucker Approximation. 21

Low memory DRMs

factor sketch DRMs are big!

I I(−n) × kn for each n ∈ [N]

how to store?

I don’t store DRMS; instead, use pseudorandom numbergenerator to generate (parts of) DRMs as needed.

I use structured DRM:I TRP generates DRM as Khatri-Rao product of simpler,

smaller DRMsI behaves approximately like a Gaussian sketch

Source: [Sun et al. 2018, Rudelson 2012]

Madeleine Udell, Cornell. Streaming Tucker Approximation. 22

Outline

Applications

Tucker factorization

Sketching

Reconstruction

Numerics

*

Madeleine Udell, Cornell. Streaming Tucker Approximation. 23

Recovery: factor matrices

I compute QR factorization of each factor sketch Vn:

Vn = QnRn

where Qn is orthonormal and Rn is triangular

Madeleine Udell, Cornell. Streaming Tucker Approximation. 24

Two pass algorithm

Algorithm Two Pass Sketch and Low Rank Recovery

Given: tensor X, rank r = (r1, . . . , rN), DRMs Φn,Ωnn∈[N]

I Sketch. (H,V1, . . . ,VN) = Sketch(X; Φn,Ωnn∈[N]

)I Recover factor matrices. For n ∈ [N],

(Qn,∼)← QR(Vn)

I Recover core.

W← X×1 Q1 · · · ×N QN

Return: Tucker approximation X = JW; Q1, . . . ,QNK

accesses X twice: 1) to sketch 2) to recover core

Madeleine Udell, Cornell. Streaming Tucker Approximation. 25

Intuition: one pass core recovery

I we want to know W:compression of X using factor range approximations Qn

I we observe H:compression of X using random projections Φn

how to approximate W?

X ≈ X×1 Q1Q>1 × · · · ×N QNQ>N

=(X×1 Q>1 ×N · · · ×Q>N

)×1 Q1 · · · ×N QN

= W×1 Q1 · · · ×N QN

X×1 Φ>1 · · · ×N Φ>N︸ ︷︷ ︸H

≈ W×1 Φ>1 Q1 × · · · ×N Φ>NQN

we can solve for W: s > k, so each Φ>n Qn has a left inverse(whp):

W ≈H×1 (Φ>1 Q1)† × · · · ×N (Φ>NQN)†

Madeleine Udell, Cornell. Streaming Tucker Approximation. 26

One pass algorithm

Algorithm One Pass Sketch and Low Rank Recovery

Given: tensor X, rank r = (r1, . . . , rN), DRMs Φn,Ωnn∈[N]

I Sketch. (H,V1, . . . ,VN) = Sketch(X; Φn,Ωnn∈[N]

)I Recover factor matrices. For n ∈ [N],

(Qn,∼)← QR(Vn)

I Recover core.

W←H×1 (Φ>1 Q1)† × · · · ×N (Φ>NQN)†

Return: Tucker approximation X = JW; Q1, . . . ,QNK

accesses X only once, to sketch

Source: [Sun, Guo, Luo, Tropp & Udell 2019]

Madeleine Udell, Cornell. Streaming Tucker Approximation. 27

Fixed rank approximation

to truncate reconstruction to rank r, truncate core:

LemmaFor a tensor W ∈ Rk1×···×kN , orthogonal matrices Qn ∈ Rkn×rn ,

JW×1 Q1 · · · ×N QNKr = JWKr ×1 Q1 · · · ×N QN ,

where J·K denotes the best rank r Tucker approximation.

=⇒ compute fixed rank approximation using, e.g., HOOI on(small) core approximation W

Madeleine Udell, Cornell. Streaming Tucker Approximation. 28

Tail energy

For each unfolding X(n), define its ρth tail energy as

(τ (n)ρ )2 :=

min(In,I(−n))∑k>ρ

σ2k(X(n)),

where σk(X(n)) is the kth largest singular value of X(n).

Madeleine Udell, Cornell. Streaming Tucker Approximation. 29

Guarantees (I)

Theorem (Recommended parameters [Sun et al. 2019])

Sketch X with Gaussian DRMs of parameters k, s = 2k + 1.Form a rank r Tucker approximation X using the one passalgorithm. Then

E‖X− X‖2F ≤ 4

N∑n=1

(τ(n)rn )2.

If X is truly rank r, we obtain the true Tucker factorization!

Madeleine Udell, Cornell. Streaming Tucker Approximation. 30

Guarantees (II)

Theorem (Detailed guarantee [Sun et al. 2019])

Sketch X with Gaussian DRMs of parameters k, s. Form a rankr Tucker approximation X using the one pass algorithm. Then

E‖X− X‖2F ≤ (1 + ∆) min

1≤ρn<kn−1

N∑n=1

(1 +

ρnkn − ρn − 1

)(τ (n)ρn )2

where ∆ = maxNn=1 kn/(sn − kn − 1)

Madeleine Udell, Cornell. Streaming Tucker Approximation. 31

Outline

Applications

Tucker factorization

Sketching

Reconstruction

Numerics

*

Madeleine Udell, Cornell. Streaming Tucker Approximation. 32

Different DRMs perform similarly

0.0 0.1 0.2 0.3 0.4Compression Factor: δ1 = k/I

10−3

10−2

10−1

100

Diff

eren

cein

Rel

ativ

eE

rror

Vary k, I = 600 (Low Rank γ = 0.01)

0.02 0.04 0.06 0.08 0.10Compression Factor: δ1 = k/I

10−2

10−1

Diff

eren

cein

Rel

ativ

eE

rror

Vary k, I = 600 (Sparse Low Rank γ = 0.01)

0.00 0.05 0.10 0.15Compression Factor: δ1 = k/I

10−3

10−2

10−1

100

Diff

eren

cein

Rel

ativ

eE

rror

Vary k, I = 600 (Polynomial Decay)

0.0 0.1 0.2 0.3 0.4Compression Factor: δ1 = k/I

10−2

10−1

100

Diff

eren

cein

Rel

ativ

eE

rror

Vary k, I = 600 (Low Rank γ = 0.1)

0.0 0.1 0.2 0.3 0.4Compression Factor: δ1 = k/I

10−1

100

Diff

eren

cein

Rel

ativ

eE

rror

Vary k, I = 600 (Low Rank γ = 1)

SSRFT

Gaussian TRP

Sparse TRP

Comments: Synthetic data, I = 600 and r = (5, 5, 5). k/I = .4 =⇒ 20×compression.

Madeleine Udell, Cornell. Streaming Tucker Approximation. 33

Sensible reconstruction at practical compression level

105 106 107

Memory Use

10−3

10−2

10−1

100

Diff

eren

cein

Rel

ativ

eE

rror

Vary Memory Size, I = 300 (Low Rank γ = 0.01)

105 106 107

Memory Use

10−3

10−2

10−1

100

Diff

eren

cein

Rel

ativ

eE

rror

Vary Memory Size, I = 300 (Sparse Low Rank γ = 0.01)

105 106 107

Memory Use

10−3

10−2

10−1

100

Diff

eren

cein

Rel

ativ

eE

rror

Vary k, I = 300 (Polynomial Decay)

105 106 107

Memory Use

10−2

10−1

100

Diff

eren

cein

Rel

ativ

eE

rror

Vary Memory Size, I = 300 (Low Rank γ = 0.1)

105 106 107

Memory Use

10−1

100

Diff

eren

cein

Rel

ativ

eE

rror

Vary Memory Size, I = 300 (Low Rank γ = 1)

Two Pass

One Pass

TS

Comments: Error of fixed-rank approximation relative to HOOI for r = 10, I = 300using TRP. Total memory use is ((2k + 1)N + kIN) and (Kr2N + K ∗ r2N−2).Low-rank data uses γ = 0.01, 0.1, 1.

Madeleine Udell, Cornell. Streaming Tucker Approximation. 34

Combustion simulation

0 20 40 60 80 100 120x

0

20

40

60

80

100

120

yOrginal

0 20 40 60 80 100 120x

0

20

40

60

80

100

120

y

HOOI

0 20 40 60 80 100 120x

0

20

40

60

80

100

120

y

Two Pass

0 20 40 60 80 100 120x

0

20

40

60

80

100

120

yOne Pass

Comments:1408× 128× 128simulatedcombustion datafrom [Lapointe,Savard & Blanquart2015].

Madeleine Udell, Cornell. Streaming Tucker Approximation. 35

Video scene classification

Linear Sketch (k = 20)

Two-Pass Tucker (k = 20, r = 10)

One-Pass Tucker (k = 20, r = 10)

0 250 500 750 1000 1250 1500 1750 2000

Frame

One-Pass Tucker (k = 300, r = 10)

Comments: Video data 2200× 1080× 1980. Classify scenes using k-means on: 1)linear sketch along the time dimension k = 20 (Row 1); 2) The Tucker factor alongthe time dimension, computed via our two pass (Row 2) and one pass (Row 3)sketching algorithm (r , k, s) = (10, 20, 41). 3) The Tucker factor along the timedimension, computed via our one pass (Row 4) sketching algorithm(r , k, s) = (10, 300, 601).

Madeleine Udell, Cornell. Streaming Tucker Approximation. 36

Summary

Streaming Tucker approximation compresses tensor withoutstoring it.

useful for:

I streaming data

I distributed data

I low memory compute

key ideas:

I form linear sketch of tensor and recover from sketch

I random projection of tensor preserves dominant information

Madeleine Udell, Cornell. Streaming Tucker Approximation. 37

Future work + references

let’s talk!

I bigger tensors to compress?

I streaming compression for 〈your research〉?references:

I Sun, Y., Guo, Y., Tropp, J. A., and Udell, M. (2018).Tensor random projection for low memory dimensionreduction. In NeurIPS Workshop on RelationalRepresentation Learning.

I Sun, Y., Guo, Y., Luo, C., Tropp, J. A., and Udell, M.(2019). Low rank tucker approximation of a tensor fromstreaming data. In preparation.

I Tropp, J. A., Yurtsever, A., Udell, M., and Cevher, V.(2019). Streaming low-rank matrix approximation with anapplication to scientific simulation. Submitted to SISC.

Madeleine Udell, Cornell. Streaming Tucker Approximation. 38

Outline

Applications

Tucker factorization

Sketching

Reconstruction

Numerics

*

Madeleine Udell, Cornell. Streaming Tucker Approximation. 39

References

Achlioptas, D. (2003). Database-friendly random projections:Johnson-lindenstrauss with binary coins. Journal of computer and SystemSciences, 66(4), 671–687.

Battaglino, C., Ballard, G., & Kolda, T. G. (2019). Faster parallel tucker tensordecomposition using randomization.

De Lathauwer, L., De Moor, B., & Vandewalle, J. (2000). A multilinear singularvalue decomposition. SIAM journal on Matrix Analysis and Applications, 21(4),1253–1278.

Halko, N., Martinsson, P.-G., & Tropp, J. A. (2011). Finding structure withrandomness: Probabilistic algorithms for constructing approximate matrixdecompositions. SIAM review, 53(2), 217–288.

Lapointe, S., Savard, B., & Blanquart, G. (2015). Differential diffusion effects,distributed burning, and local extinctions in high karlovitz premixed flames.Combustion and flame, 162(9), 3341–3355.

Li, P., Hastie, T. J., & Church, K. W. (2006). Very sparse random projections. InProceedings of the 12th ACM SIGKDD international conference on Knowledgediscovery and data mining, (pp. 287–296). ACM.

Li, Y., Nguyen, H. L., & Woodruff, D. P. (2014). Turnstile streaming algorithmsmight as well be linear sketches. In Proceedings of the forty-sixth annual ACMsymposium on Theory of computing, (pp. 174–183). ACM.

Malik, O. A. & Becker, S. (2018). Low-rank tucker decomposition of large tensorsusing tensorsketch. In Advances in Neural Information Processing Systems, (pp.10116–10126).

Madeleine Udell, Cornell. Streaming Tucker Approximation. 39

Rudelson, M. (2012). Row products of random matrices. Advances inMathematics, 231(6), 3199–3231.

Sun, Y., Guo, Y., Luo, C., Tropp, J. A., & Udell, M. (2019). Low rank tuckerapproximation of a tensor from streaming data. In preparation.

Sun, Y., Guo, Y., Tropp, J. A., & Udell, M. (2018). Tensor random projection forlow memory dimension reduction. In NeurIPS Workshop on RelationalRepresentation Learning.

Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis.Psychometrika, 31(3), 279–311.

Woolfe, F., Liberty, E., Rokhlin, V., & Tygert, M. (2008). A fast randomizedalgorithm for the approximation of matrices. Applied and ComputationalHarmonic Analysis, 25(3), 335–366.

Zhou, G., Cichocki, A., & Xie, S. (2014). Decomposition of big tensors with lowmultilinear rank. arXiv preprint arXiv:1412.1885.

Madeleine Udell, Cornell. Streaming Tucker Approximation. 39