+ All Categories
Home > Documents > Efficient Spatio-Temporal Sampling via Low-Rank Tensor...

Efficient Spatio-Temporal Sampling via Low-Rank Tensor...

Date post: 21-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
1
Tucker decomposition: 1. Abstract We formulate spatio-temporal sampling task as tensor sketching problem. We generalize sparse subspace embedding to low-rank tensor domain. Our algorithm achieves accurate predictions with significant speed-up in social media and climate applications. Efficient Spatio-Temporal Sampling via Low-Rank Tensor Sketching Rose Yu, Sanjay Purushotham, Yan Liu {qiyu, spurusho, yanliu.cs}@usc.edu 4. Forecasting Formulation 6. Experiments 2. Spatio-Temporal Sampling 3. Tensor Representation Multivariate spatio-temporal data can be naturally represented by tensors. 5. Subsampled Randomized Low-rank Tensor Learning 7. Reference [1] Rose Yu, Dehua Cheng, Yan Liu. Accelerated Online Low-Rank Tensor Learning for Multivariate Spatio-Temporal Streams International Conference on Machine Learning (ICML), 2015 [2] Rose Yu*, Mohammad Taha Bahadori*, Yan Liu. (*Equal Contributions) Fast Multivariate Spatio-temporal Analysis via Low Rank Tensor Learning Advances in Neural Information Processing Systems (NIPS), 2014, Spotlight [3] Bonilla, Edwin V., Kian M. Chai, and Christopher Williams. "Multi-task Gaussian process prediction." Advances in neural information processing systems . 2007. [4] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Machine Learning, 5(2-3):123–286, 2012. [5] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Theoretical Computer Sci- ence, 10(1-2):1–157, 2014. ! = 1, … , & ' = 1, … , ( )=ℝ +×-×. Space Time Variables Low-rank tensor can capture structures in spatio-temporal data [Yu 2014, 2015]. sampler data Definition: Identify important locations or time stamps and extract samples from them with the merit of computational efficiency Concatenate historical measurements of lags %&’,:,* , %&+,:,* %&-,:,* = / %,:,* ∈ℝ 4- ×5 Goal: Learn a model tensor ∈ℝ (4-)×4×5 9 = argmin @ :,:,* :,:,*, :,:,* C + * subject to rank sketching tensor operation optimal result tensor operation Double Sketching: 1. Sketching data tensor along time dimension.2. Sketching model tensor along locations and variables. L2-Sparse subspace embedding [Clark && Woodruff 2013]: for each column j, uniformly pick a row i {1,2,···M} and assign {1, 1} with equal probability to , . T U T 100 3000 Sketch Size 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Parameter Estimate Error Sparse Gaussian SRP 100 30000 Sketch Size 0 20 40 60 80 100 120 140 Run Time (sec) Sparse Gaussian SRP 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 Foursquare AWS Run Time Sparse Gaussian SRP Real-World DatasetsFoursquare: 121 user check-ins, 15 categories of business venues, 1200 time intervals. AWS: 153 weather stations measurements, 4 climate variables, 76 time stamps. Settings: 90 % training data on both datasets for VAR model with different lags and average run time. Challenge: Voronoi diagram: no theoretical guanreee Sequential sampling [Krause 2008]:require submodular assumption Determinal point process [Kulesza 2012]: requires expensive eigen-decomposition Preliminary : S ( + U 1 U 2 + U 3 × × + × b Many spatio-temporal analysis tasks can be formulated as low-rank tensor learning problems. Tensor n-product: + × S ( U 1 + × b (’) × U 1 + 9 = argmin e C + + @ trace e :,:,* e :,:,* 2 5 *h’ subject to rank / %,:,* :,i,* = e %,i,* Theoretical Analsysis Lemma 1 (Adapted from [5]): For any 0<<, and for a sparse -subspace embedding matrix with K = O(P2M2/δε2) rows, then with probability , we can achieve ( + ) -approximation for tensor least square in O(nnz()) time. Let , n = 1, 2, 3 be a sparse -subspace embedding matrix with K = O(R/ε) rows, then with high probability, we can achieve (1 + ε)-approximation for low-rank tensor approximation in O(nnz()) + poly(P + Q + M)poly(R/ε) time, Synthetic: 30000 time stamps with VAR(2) model, parameter tensor W R 30×60×20 . Repeat the procedure for 10 times. Time Location model data 0 0.2 0.4 0.6 0.8 1 1.2 Foursquare AWS Forecasting RMSE Sparse Gaussian SRP
Transcript
Page 1: Efficient Spatio-Temporal Sampling via Low-Rank Tensor ...roseyu.com/Papers/nips2015_workshop_poster.pdf · (*Equal Contributions) Fast Multivariate Spatio-temporal Analysis via Low

• Tucker decomposition:

1. Abstract• We formulate spatio-temporal sampling task as

tensor sketching problem.• We generalize sparse subspace embedding

to low-rank tensor domain. • Our algorithm achieves accurate predictions

with significant speed-up in social media and climate applications.

Efficient Spatio-Temporal Sampling via Low-Rank Tensor SketchingRose Yu, Sanjay Purushotham, Yan Liu{qiyu, spurusho, yanliu.cs}@usc.edu

4. Forecasting Formulation

6. Experiments

2. Spatio-Temporal Sampling

3. Tensor Representation• Multivariate spatio-temporal data can be

naturally represented by tensors.

5. Subsampled Randomized Low-rank Tensor Learning

7. Reference[1] RoseYu,Dehua Cheng,YanLiu.AcceleratedOnlineLow-RankTensorLearningforMultivariateSpatio-TemporalStreamsInternationalConferenceonMachineLearning (ICML),2015[2] RoseYu*,MohammadTaha Bahadori*,YanLiu.(*EqualContributions) FastMultivariateSpatio-temporalAnalysisviaLowRankTensorLearningAdvancesinNeuralInformationProcessingSystems (NIPS),2014,Spotlight[3] Bonilla,EdwinV.,KianM.Chai,andChristopherWilliams."Multi-taskGaussianprocessprediction." Advancesinneuralinformationprocessingsystems.2007.[4] A.Kulesza andB.Taskar.Determinantal pointprocessesformachinelearning.MachineLearning,5(2-3):123–286,2012.[5] D.P.Woodruff. Sketchingasatoolfornumericallinearalgebra.TheoreticalComputerSci- ence,10(1-2):1–157,2014.

! = 1,… , &

'=1,…,(

) = ℝ+×-×.

Space

Time

Variables

• Low-rank tensor can capture structures in spatio-temporal data [Yu 2014, 2015].

sampler

data

Definition: Identify important locations or time stamps and extract samples from them with the merit of computational efficiency

Concatenate historical measurements of 𝐿 lags 𝒳%&',:,*,𝒳%&+,:,*…𝒳%&-,:,* = 𝒳/%,:,* ∈ ℝ2× 4- ×5

Goal: Learn a model tensor 𝒲 ∈ ℝ(4-)×4×5

𝒲9 = argmin𝒲

@ 𝒵:,:,*𝒲:,:,*, − 𝒳:,:,* C+

*subjectto rank 𝒲 ≤ 𝑅

sketching

tensor operation

optimalresult

tensoroperation𝜀

Double Sketching:1. Sketching data tensor along time dimension.2. Sketching model tensor along locations and variables.

L2-Sparse subspace embedding [Clark && Woodruff 2013]:for each column j, uniformly pick a row i ∈ {1,2,···M} and assign {−1, 1} with equal probability to 𝑺𝒊,𝒋 .

𝒵 𝒳

𝒵T 𝒳U 𝒲′

𝑺𝒕

𝒲′𝑺𝟏𝒔

𝒲T𝑽𝟏

𝑽𝟐

𝑽𝟑𝐶𝐶𝒲

100 3000Sketch Size

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Para

met

er E

stim

ate

Erro

r SparseGaussianSRP

100 30000Sketch Size

0

20

40

60

80

100

120

140

Run

Tim

e (s

ec)

SparseGaussianSRP

0.10.30.50.70.91.11.31.51.7

Foursquare AWS

Run Time

Sparse Gaussian SRP

Real-World Datasets:• Foursquare: 121 user check-ins, 15 categories of business venues, 1200 time intervals.• AWS: 153 weather stations measurements, 4 climate variables, 76 time stamps.Settings: 90 % training data on both datasets for VAR model with different lags and averagerun time.

Challenge:• Voronoi diagram: no theoretical guanreee• Sequential sampling [Krause 2008]:require

submodular assumption• Determinal point process [Kulesza 2012]:

requires expensive eigen-decomposition

Preliminary :

𝑁

𝒳𝐽

𝐼

Accelerated Online Low-Rank Tensor Learning

cation of our interest, i.e., the multivariate spatio-temporalstream analysis, both Z and X grow along the temporal di-mension as time T increases. We define Wm = W:,:,m andsimilarly for others, the unconstrained optimization prob-lem at time T can be written as minW kWZ1:T �X1:T k2F,where we omit the index m for simplicity. Suppose that attime stamp T , we receive a new batch of data of size b, wecan update the parameter tensor in the k-th iteration W(k)

with two possible strategies: one is exact update, and theother is increment update.

Exact update Notice that we can obtain a closed-formsolution of W (k) by using all the data from time stamp 1 toT + b as follows:

W(k)= X1:T+bZ

†1:T+b.

where † denotes matrix pseudo-inverse. Note that thepseudo-inverse can be computed efficiently via the Wood-bury matrix identity (Woodbury, 1950). At each iteration,we can compute the inverse of the complete data covari-ance (Z1:T+bZ

>1:T+b)

�1 by inverting a smaller matrix con-structed from the new data ZT+1:T+b at a computationalcost linear to the batch size b, with a small memory over-head to store the inverse of the previous covariance matrix(Z1:TZ

>1:T )

�1. We defer the details to Appendix B.1.

Increment update We can also incrementally update thevalue of W given the new data as follows:

W(k)= (1� ↵)W(k�1)

+ ↵XT+1:T+bZ†T+1:T+b.

The difference of the two updating scheme lies in the vari-ables we store in memory. For exact update, we store thedata statistics required to reconstruct the model. It givesan exact solution for the linear regression problem givenall the historical observations. For incremental update, westore the previous model, compute the solution for currentdata only, and then take a convex combination of two mod-els. Note that different statistical properties of these twoupdating scheme may require different theoretical analysistools, but the low-rank projection of the solution is invari-ant to the updating strategy.

2.4. Online Low-Rank Tensor Approximation

In Step 2, we need to project the solution from Step 1 tothe low-rank tensor space. In ALTO, we measure the rankwith respect to the sum-n-rank of the tensor: We restrictthe maximum n-rank of tensor W over all modes to be nolarger than R. In order to obtain the n-rank projection,we resort to Tucker decomposition (De Lathauwer et al.,2000), which decomposes a tensor into a core tensor anda set of projection matrices. The dimensions of the coretensor are n-ranks of the tensor itself. The projection is

generally time consuming, as it usually involves SVD onunfolded matrices at each mode of a full tensor. For theonline setting, this operation needs to be repeated for eachiteration, which is infeasible for large-scale applications. InALTO, we utilize the projection results from the last itera-tion to approximate the current projection. It eliminates theneed of SVD on unfolded matrices of a full tensor. Instead,it performs dimension reduction and computes the SVD onunfolded matrices of a low-dimensional tensor.

Without the loss of generality, we elaborate ALTO via athird order tensor. Given the Tucker decomposition of W 2RN⇥N⇥N from the previous iteration:

W(k�1)= S(k�1) ⇥1 U

(k�1)1 ⇥2 U

(k�1)2 ⇥3 U

(k�1)3 .

we first augment each U(k�1)i 2 RN⇥R with K random

column vectors for i = 1, 2, 3, which are drawn from a zeromean Gaussian distribution. These random column vectorsare introduced as noise perturbation. Then we apply Gram-Schmidt process to create orthonormal augmented projec-tion matrices V

(k�1)i 2 RN⇥(R+K), which has K more

columns than U(t�1)i , for i = 1, 2, 3 respectively.

With augmented projection matrices V(k�1)i , we project

the tensor W(k) to an augmented core tensor S 0(k) withdimension (R+K)⇥ (R+K)⇥ (R+K).

S 0(k)= W(k�1) ⇥1 V

(k�1)>1 ⇥2 V

(k�1)>2 ⇥3 V

(k�1)>3 .

Then we compute the rank-R approximation of the aug-mented core by decomposing S 0(k):

S 0(k) ⇡ S(k) ⇥1 V0(k)1 ⇥2 V

0(k)2 ⇥3 V

0(k)3

where S(k) is the new core tensor with dimension R⇥R⇥Rand V

0(k)i is of size (R + K) ⇥ R. We update the new

projection matrices as U(k)i = V

(k�1)i V

0(k)i for i = 1, 2, 3.

And the final low-rank projection of the solution tensor ofcurrent iteration is given by

W(k)= S(k) ⇥1 U

(k)1 ⇥2 U

(k)2 ⇥3 U

(k)3 .

We summarize the workflow of ALTO in Algorithm 1. Therank-R approximation of the augmented core S 0(k) is com-puted by iterating over all the modes and sequentially map-ping the unfolded tensor into the rank-R subspace. Wename this procedure as low-rank Tensor Sequential Map-

ping (TSM), which is described in Algorithm 2.

ALTO is computationally efficient since the augmentedcore tensor S 0(k) has dimension (R + K) ⇥ (R + K) ⇥(R +K), which is much smaller than W(k). At each iter-ation, the low-rank mapping procedure TSM only involvestop-R SVD on matrices of size (R + K) ⇥ (R + K)

2, incomparison to the expensive top-R SVD on N⇥N2 matri-ces in most existing low-rank tensor learning approaches.

𝑅 '

𝑅+

U1

1

𝐼

𝑅'

U2

1

𝑅+U3

1

𝐾×' ×+ ×b≈

• Many spatio-temporal analysis tasks can be formulated as low-rank tensor learning problems.

• Tensor n-product:

𝑅 '

𝑅+

𝐼

𝑅'

×'

Accelerated Online Low-Rank Tensor Learning

cation of our interest, i.e., the multivariate spatio-temporalstream analysis, both Z and X grow along the temporal di-mension as time T increases. We define Wm = W:,:,m andsimilarly for others, the unconstrained optimization prob-lem at time T can be written as minW kWZ1:T �X1:T k2F,where we omit the index m for simplicity. Suppose that attime stamp T , we receive a new batch of data of size b, wecan update the parameter tensor in the k-th iteration W(k)

with two possible strategies: one is exact update, and theother is increment update.

Exact update Notice that we can obtain a closed-formsolution of W (k) by using all the data from time stamp 1 toT + b as follows:

W(k)= X1:T+bZ

†1:T+b.

where † denotes matrix pseudo-inverse. Note that thepseudo-inverse can be computed efficiently via the Wood-bury matrix identity (Woodbury, 1950). At each iteration,we can compute the inverse of the complete data covari-ance (Z1:T+bZ

>1:T+b)

�1 by inverting a smaller matrix con-structed from the new data ZT+1:T+b at a computationalcost linear to the batch size b, with a small memory over-head to store the inverse of the previous covariance matrix(Z1:TZ

>1:T )

�1. We defer the details to Appendix B.1.

Increment update We can also incrementally update thevalue of W given the new data as follows:

W(k)= (1� ↵)W(k�1)

+ ↵XT+1:T+bZ†T+1:T+b.

The difference of the two updating scheme lies in the vari-ables we store in memory. For exact update, we store thedata statistics required to reconstruct the model. It givesan exact solution for the linear regression problem givenall the historical observations. For incremental update, westore the previous model, compute the solution for currentdata only, and then take a convex combination of two mod-els. Note that different statistical properties of these twoupdating scheme may require different theoretical analysistools, but the low-rank projection of the solution is invari-ant to the updating strategy.

2.4. Online Low-Rank Tensor Approximation

In Step 2, we need to project the solution from Step 1 tothe low-rank tensor space. In ALTO, we measure the rankwith respect to the sum-n-rank of the tensor: We restrictthe maximum n-rank of tensor W over all modes to be nolarger than R. In order to obtain the n-rank projection,we resort to Tucker decomposition (De Lathauwer et al.,2000), which decomposes a tensor into a core tensor anda set of projection matrices. The dimensions of the coretensor are n-ranks of the tensor itself. The projection is

generally time consuming, as it usually involves SVD onunfolded matrices at each mode of a full tensor. For theonline setting, this operation needs to be repeated for eachiteration, which is infeasible for large-scale applications. InALTO, we utilize the projection results from the last itera-tion to approximate the current projection. It eliminates theneed of SVD on unfolded matrices of a full tensor. Instead,it performs dimension reduction and computes the SVD onunfolded matrices of a low-dimensional tensor.

Without the loss of generality, we elaborate ALTO via athird order tensor. Given the Tucker decomposition of W 2RN⇥N⇥N from the previous iteration:

W(k�1)= S(k�1) ⇥1 U

(k�1)1 ⇥2 U

(k�1)2 ⇥3 U

(k�1)3 .

we first augment each U(k�1)i 2 RN⇥R with K random

column vectors for i = 1, 2, 3, which are drawn from a zeromean Gaussian distribution. These random column vectorsare introduced as noise perturbation. Then we apply Gram-Schmidt process to create orthonormal augmented projec-tion matrices V

(k�1)i 2 RN⇥(R+K), which has K more

columns than U(t�1)i , for i = 1, 2, 3 respectively.

With augmented projection matrices V(k�1)i , we project

the tensor W(k) to an augmented core tensor S 0(k) withdimension (R+K)⇥ (R+K)⇥ (R+K).

S 0(k)= W(k�1) ⇥1 V

(k�1)>1 ⇥2 V

(k�1)>2 ⇥3 V

(k�1)>3 .

Then we compute the rank-R approximation of the aug-mented core by decomposing S 0(k):

S 0(k) ⇡ S(k) ⇥1 V0(k)1 ⇥2 V

0(k)2 ⇥3 V

0(k)3

where S(k) is the new core tensor with dimension R⇥R⇥Rand V

0(k)i is of size (R + K) ⇥ R. We update the new

projection matrices as U(k)i = V

(k�1)i V

0(k)i for i = 1, 2, 3.

And the final low-rank projection of the solution tensor ofcurrent iteration is given by

W(k)= S(k) ⇥1 U

(k)1 ⇥2 U

(k)2 ⇥3 U

(k)3 .

We summarize the workflow of ALTO in Algorithm 1. Therank-R approximation of the augmented core S 0(k) is com-puted by iterating over all the modes and sequentially map-ping the unfolded tensor into the rank-R subspace. Wename this procedure as low-rank Tensor Sequential Map-

ping (TSM), which is described in Algorithm 2.

ALTO is computationally efficient since the augmentedcore tensor S 0(k) has dimension (R + K) ⇥ (R + K) ⇥(R +K), which is much smaller than W(k). At each iter-ation, the low-rank mapping procedure TSM only involvestop-R SVD on matrices of size (R + K) ⇥ (R + K)

2, incomparison to the expensive top-R SVD on N⇥N2 matri-ces in most existing low-rank tensor learning approaches.

U1

1

𝑅 '

𝑅+×𝑅b

𝒮(')𝐼

𝑅'

×U1

1

𝑅+

𝐼

𝒲9 = argmin𝒲

𝒳e −𝒳 C+ + 𝜇 @ trace 𝒳e:,:,*𝐿𝒳e:,:,*2

5

*h'subjectto rank 𝒲 ≤ 𝑅𝒳/%,:,*𝒲:,i,* = 𝒳e%,i,*

Theoretical AnalsysisLemma 1 (Adapted from [5]): For any 0 < 𝜹 < 𝟏, and for 𝑺𝒕 a 𝒍𝟐sparse -subspace embedding matrix with K = O(P2M2/δε2) rows, then with probability 𝟏 − 𝜹, we can achieve (𝟏 + 𝜺)-approximation for tensor least square in O(nnz(𝒳)) time. Let 𝑺𝒏𝒔 , n = 1, 2, 3 be a sparse 𝒍𝟐-subspace embedding matrix with K = O(R/ε) rows, then with high probability, we can achieve (1 + ε)-approximation for low-rank tensor approximation in O(nnz(𝒲′)) + poly(P + Q + M)poly(R/ε) time,

Synthetic: 30000timestampswithVAR(2)model,parametertensorW∈ R 30×60×20.Repeattheprocedurefor10times.

𝑺𝟐𝒔

𝑺𝟑𝒔

𝑺𝟏𝒔𝑺𝟐𝒔

𝑺𝟑𝒔

TimeLoca

tion

model data

0

0.2

0.4

0.6

0.8

1

1.2

Foursquare AWS

Forecasting RMSE

Sparse Gaussian SRP

Recommended