Big Tensor Data Reduction - DTC · Nikos Sidiropoulos Dept. ECE University of Minnesota Big Tensor...

Big Tensor Data Reduction

Nikos SidiropoulosDept. ECE

University of Minnesota

NSF/ECCS Big Data, 3/21/2013

Nikos Sidiropoulos Dept. ECE University of Minnesota () Big Tensor Data Reduction NSF/ECCS Big Data, 3/21/2013 1 / 24

STAR Group, Collaborators, Credits

Signal and Tensor Analytics Research (STAR) grouphttps://sites.google.com/a/umn.edu/nikosgroup/home

Signal processingBig dataPreference measurementCognitive radioSpectrum sensing

Christos Faloutsos, Tom Mitchell, Vaggelis Papalexakis (CMU), GeorgeKarypis (UMN), NSF-NIH/BIGDATA: Big Tensor Mining: Theory, ScalableAlgorithms and ApplicationsTimos Tsakonas (KTH)Tasos Kyrillidis (EPFL)


https://sites.google.com/a/umn.edu/nikosgroup/home

Tensor? What is this?

Has different formal meaning in Physics (spin, symmetries)Informally adopted in CS as shorthand for three-way array: dataset Xindexed by three indices, (i , j , k)-th entry X(i , j , k).For two vectors a (I × 1) and b (J × 1), a ◦ b is an I × J rank-one matrixwith (i , j)-th element a(i)b(j); i.e., a ◦ b = abT .For three vectors, a (I × 1), b (J × 1), c (K × 1), a ◦ b ◦ c is an I × J × Krank-one three-way array with (i , j , k)-th element a(i)b(j)c(k).The rank of a three-way array X is the smallest number of outer productsneeded to synthesize X.‘Curiosities’:

Two-way (I × J): row-rank = column-rank = rank ≤ min(I, J);Three-way: row-rank 6= column-rank 6= “tube”-rank 6= rankTwo-way: rank(randn(I,J))=min(I,J) w.p. 1;Three-way: rank(randn(2,2,2)) is a RV (2 w.p. 0.3, 3 w.p. 0.7)


NELL @ CMU / Tom Mitchell

Crawl web, learn language ‘like children do’: encounter new concepts,learn from contextNELL triplets of “subject-verb-object” naturally lead to a 3-mode tensor

X

≈

object

subject

verb

~a1

~b1~c1

+

~a2

~b2~c2

+

~aF

~bF~cF

. . . +

Each rank-one factor corresponds to a concept, e.g., ‘leaders’ or ‘tools’E.g., say a1, b1, c1 corresponds to ‘leaders’: subjects/rows with highscore on a1 will be “Obama”, “Merkel”, “Steve Jobs”, objects/columns withhigh score on b1 will be “USA”, “Germany”, “Apple Inc.”, and verbs/fiberswith high score on c1 will be ‘verbs’, like “lead”, “is-president-of”, and“is-CEO-of”.


Low-rank tensor decomposition / approximation

X ≈F∑

f=1

af ◦ bf ◦ cf ,

Parallel factor analysis (PARAFAC) model [Harshman ’70-’72], a.k.a.canonical decomposition [Carroll & Chang, ’70], a.k.a. CP; cf. [Hitchcock,’27]PARAFAC can be written as a system of matrix equationsXk = ADk (C)BT , where Dk (C) is a diagonal matrix holding the k -th row ofC in its diagonal; or in compact matrix form as X ≈ (B� A)CT , using theKhatri-Rao product.In particular, employing a property of the Khatri-Rao product,

X ≈ (B� A)CT ⇐⇒ vec(X) ≈ (C� B� A) 1,

where 1 is a vector of all 1’s.


Uniqueness

The distinguishing feature of the PARAFAC model is its essentialuniqueness: under certain conditions, (A,B,C) can be identified from X,i.e., they are unique up to permutation and scaling of columns [Kruskal’77, Sidiropoulos et al ’00 - ’07, de Lathauwer ’04-, Stegeman ’06-]Consider an I × J × K tensor X of rank F . In vectorized form, it can bewritten as the IJK × 1 vector x = (A� B� C) 1, for some A (I × F ), B(J × F ), and C (K × F ) - a PARAFAC model of size I × J × K and order Fparameterized by (A,B,C).The Kruskal-rank of A, denoted kA, is the maximum k such that any kcolumns of A are linearly independent (kA ≤ rA := rank(A)).Given X (⇔ x), if kA + kB + kC ≥ 2F + 2, then (A,B,C) are unique up to acommon column permutation and scaling


Big data: need for compression

Tensors can easily become really big! - size exponential in the number ofdimensions (‘ways’, or ‘modes’).Cannot load in main memory; can reside in cloud storage.Tensor compression?Commonly used compression method for ‘moderate’-size tensors: fitorthogonal Tucker3 model, regress data onto fitted mode-bases.Lossless if exact mode bases used [CANDELINC]; but Tucker3 fitting isitself cumbersome for big tensors (big matrix SVDs), cannot compressbelow mode ranks without introducing errorsIf tensor is sparse, can store as [i , j , k , value] + use specialized sparsematrix / tensor alorithms [(Sparse) Tensor Toolbox, Bader & Kolda].Useful if sparse representation can fit in main memory.


Tensor compression

Consider compressing x into y = Sx, where S is d × IJK , d � IJK .In particular, consider a specially structured compression matrixS = UT ⊗ VT ⊗WT

Corresponds to multiplying (every slab of) X from the I-mode with UT ,from the J-mode with VT , and from the K -mode with WT , where U is I×L,V is J ×M, and W is K × N, with L ≤ I, M ≤ J, N ≤ K and LMN � IJK

I

L

M

I

J

J

K N

L

K

M X

N Y

_ _


Key

Due to a property of the Kronecker product(UT ⊗ VT ⊗WT

)(A� B� C) =(

(UT A)� (VT B)� (WT C)),

from which it follows that

y =(

(UT A)� (VT B)� (WT C))

1 =(

A� B� C)

1.

i.e., the compressed data follow a PARAFAC model of size L×M × Nand order F parameterized by (A, B, C), with A := UT A, B := VT B,C := WT C.


Random multi-way compression can be better!

Sidiropoulos & Kyrillidis, IEEE SPL Oct. 2012Assume that the columns of A,B,C are sparse, and let na (nb, nc) be anupper bound on the number of nonzero elements per column of A(respectively B, C).Let the mode-compression matrices U (I × L,L ≤ I), V (J ×M,M ≤ J),and W (K × N,N ≤ K ) be randomly drawn from an absolutely continuousdistribution with respect to the Lebesgue measure in RIL, RJM , and RKN ,respectively.If

min(L, kA) + min(M, kB) + min(N, kC) ≥ 2F + 2, and

L ≥ 2na, M ≥ 2nb, N ≥ 2nc ,

then the original factor loadings A,B,C are almost surely identifiable fromthe compressed data.


Proof rests on two lemmas + Kruskal

Lemma 1: Consider A := UT A, where A is I × F , and let the I × L matrixU be randomly drawn from an absolutely continuous distribution withrespect to the Lebesgue measure in RIL (e.g., multivariate Gaussian witha non-singular covariance matrix). Then kA = min(L, kA) almost surely(with probability 1).Lemma 2: Consider A := UT A, where A and U are given and A is sought.Suppose that every column of A has at most na nonzero elements, andthat kUT ≥ 2na. (The latter holds with probability 1 if the I × L matrix U israndomly drawn from an absolutely continuous distribution with respect tothe Lebesgue measure in RIL, and min(I,L) ≥ 2na.) Then A is the uniquesolution with at most na nonzero elements per column [Donoho & Elad,’03]


Complexity

First fitting PARAFAC in compressed space and then recovering thesparse A, B, C from the fitted compressed factors entails complexityO(LMNF + (I3.5 + J3.5 + K 3.5)F ).Using sparsity first and then fitting PARAFAC in raw space entailscomplexity O(IJKF + (IJK )3.5) - the difference is huge.Also note that the proposed approach does not require computations inthe uncompressed data domain, which is important for big data that donot fit in memory for processing.


Further compression - down to O(√

F ) in 2/3 modes

Sidiropoulos & Kyrillidis, IEEE SPL Oct. 2012Assume that the columns of A,B,C are sparse, and let na (nb, nc) be anupper bound on the number of nonzero elements per column of A(respectively B, C).Let the mode-compression matrices U (I × L,L ≤ I), V (J ×M,M ≤ J),and W (K × N,N ≤ K ) be randomly drawn from an absolutely continuousdistribution with respect to the Lebesgue measure in RIL, RJM , and RKN ,respectively.If

rA = rB = rC = F

L(L− 1)M(M − 1) ≥ 2F (F − 1), N ≥ F , and

L ≥ 2na, M ≥ 2nb, N ≥ 2nc ,

then the original factor loadings A,B,C are almost surely identifiable fromthe compressed data up to a common column permutation and scaling.


Proof: Lemma 3 + results on a.s. ID of PARAFAC

Lemma 3: Consider A = UT A, where A (I × F ) is deterministic,tall/square (I ≥ F ) and full column rank rA = F , and the elements of U(I × L) are i.i.d. Gaussian zero mean, unit variance random variables.Then the distribution of A is nonsingular multivariate Gaussian.From [Stegeman, ten Berge, de Lathauwer 2006] (see also [Jiang,Sidiropoulos 2004], we know that PARAFAC is almost surely identifiable ifthe loading matrices A, B are randomly drawn from an absolutelycontinuous distribution with respect to the Lebesgue measure in R(L+M)F ,C is full column rank, and L(L− 1)M(M − 1) ≥ 2F (F − 1).


Generalization to higher-way arrays

Theorem 3: Let x = (A1 � · · · � Aδ) 1 ∈ R∏δ

d=1 Id , where Ad is Id × F , andconsider compressing it to y =

(UT

1 ⊗ · · · ⊗ UTδ

)x =(

(UT1 A1)� · · · � (UT

δ Aδ))

1 =(

A1 � · · · � Aδ)

1 ∈ R∏δ

d=1 Ld , where themode-compression matrices Ud (Id × Ld ,Ld ≤ Id ) are randomly drawnfrom an absolutely continuous distribution with respect to the Lebesguemeasure in RId Ld . Assume that the columns of Ad are sparse, and let ndbe an upper bound on the number of nonzero elements per column of Ad ,for each d ∈ {1, · · · , δ}. If

δ∑d=1

min(Ld , kAd ) ≥ 2F + δ − 1, and Ld ≥ 2nd , ∀d ∈ {1, · · · , δ} ,

then the original factor loadings {Ad}δd=1 are almost surely identifiablefrom the compressed data y up to a common column permutation andscaling.Various additional results possible, e.g., generalization of Theorem 2.


PARCUBE: Parallel sampling-based tensor decomp

Papalexakis, Faloutsos, Sidiropoulos, ECML-PKDD 2012

!"#!"

$!"

%!"

##"

$#"

%#"

!$%!"

&"

#!"

$!"

%!"

!$%#"

Challenge: different permutations, scaling‘Anchor’ in small common sampleHadoop implementation→ 100-fold improvement (size/speedup)


Road ahead

Important first steps / results pave way, but simply scratched surfaceRandomized tensor algorithms based on generalized samplingOther models?Rate-distortion theory for big tensor data compression?Statistically and computationally efficient algorithms - big open issueDistributed computations - not all data reside in one place - Hadoop /multicoreStatistical inference for big tensorsApplications


Switch gears: Large-scale Conjoint Analysis

Preference Measurement (PM): GoalsPredict responses of individuals based on previously observed preferencedata (ratings, choices, buying patterns, etc)Reveal utility function - marketing sensitivity

PM workhorse: Conjoint Analysis (CA)Long history in marketing, retailing, health care, ...Traditionally offline, assuming rational individuals, responses that regressupon few varsNo longer true for modern large-scale PM systems, esp. web-based


Conjoint Analysis

Individual rating J profiles {pi}Ji=1, e.g., pi = [screen size, MP, GB, price]T

w is the unknown vector of partworths

Given choice data, {di , yi}Ni=1 ,di ∈ Rp, yi ∈ {−1,+1}, di := p(1)

i − p(2)i ,

assumed to obey yi = sign(dT

i w + ei),∀i

Estimate partworth vector w


Robust statistical choice-based CA

Preference data can be inconsistent (unmodeled dynamics, whenseeking w of ‘population’ averages; ... but also spammers, fraudsters,prankers!)Introduce gross errors {oi}N

i=1 in response model (before the sign)Sensible to assume that gross errors are sparseNumber of attributes p in w can be very large (e.g., cellphones), but onlyfew features matter to any given individualCan we exploit these two pieces of prior information in CA context?Sparse CA model formulation:

yi = sign(dT

i w + oi + ei)

i = 1, · · · ,N

with constraints ||w||0 ≤ κw and ||o||0 ≤ κo.Small ‘typical’ errors ei modeled as random i.i.d. N (0, σ2)

Tsakonas, Jalden, Sidiropoulos, Ottersten, 2012


MLE

Log-likelihood l(w,o) can be shown to be

l(w,o) = log py (w,o) =N∑

i=1

log Φ

(yidT

i w + yioi

σ

)to be maximized over ||w||0 ≤ κw and ||o||0 ≤ κo.Φ(·) is the Gaussian c.d.f., so ML metric is a concave functionCardinality constraints are hard, relaxing to `1-norm constraints yieldsconvex relaxationIdentifiability? Best achievable MSE performance (CRB)?Turns out sparsity plays key role in both


Algorithms for Big Data

Huge volumes of preference data, cannot be analyzed in real-timeDecentralized collection and/or storage of datasetsDistributed CA algorithms highly desirable

Solve large-scale problemsPrivacy / confidentialityFault-tolerance

Relaxed ML problem is of the form

minimizeM∑

i=1

fi (ξ)

and we wish to ‘split’ w.r.t the training examples onlyMany distributed opt. techniques can be used, one appealing (andrecently popular) method is the ADMoM.Developed fully decentralized MLE for our CA formulation based onADMoMTsakonas, Jalden, Sidiropoulos, Ottersten, 2012


Experiments

500 1000 1500 2000 2500 30000.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Number of samples

RM

SE

RMSE performance of MLE versus the CRLB

CRLBMLE

Figure: RMSE comparison of the MLE versus CRLB for different sample sizes N,when outliers are not present in the data.


Experiments

500 1000 1500 2000 2500 3000

100

101

Number of samples

RM

SE

RMSE performance of MLE versus the CRLB

CRLBMLE with oultier detectionMLE without outlier detection

Figure: RMSE comparison of the MLE versus CRLB for different number of samplesN, when outliers are present in the data [outlier percentage 4%].


Date post:	08-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Big Tensor Data Reduction - DTC · Nikos Sidiropoulos Dept. ECE University of Minnesota Big Tensor...

Documents