Beyond matrices: statistical method for higher-order...

Beyond matrices: statistical method for higher-ordertensors and its application. I

Miaoyan Wang

Department of StatisticsUniversity of Wisconsin-Madison

Fudan UniversityJuly, 2019

Introduction Population inference Tensor spectral norm Tensor decomposition Conclusions

Introduction: session aim

This session focuses on statistical machine learning methods for tensorand matrix analysis. We aim to cover:

I Spectral theory for higher-order tensors

I PCA and population structure inference

I Structured tensor decomposition and its statistical properties

I Application of tensor decomposition to genetics

I Low-rank tensor estimation from binary data

I Multiway clustering via tensor block models


Introduction: about me

I Assistant professor in Statistics at University of Wis-

consin Madison, USA

I Past experiences:

I Postdoc in Computer Science at UC BerkeleyI Simons Math + Biology visitor at University of

PennsylvaniaI PhD in Statistics at UChicagoI B.S in Pure and Applied Mathematics, Fudan

University


My research

Statistical machine learning:

I structured tensor decomposition, optimization

Numerical analysis:

I functional properties of higher-order tensors

Biological applications:

I statistical genetics, complex traits, gene expression analysis


Introduction: resources

Importantly, the class site ishttp://www.stat.wisc.edu/~miaoyan/tensor.html.

I PDF copies of slides

I Datasets needed for exercises

I Links to software packages

http://www.stat.wisc.edu/~miaoyan/tensor.html


A successful story: PCA of Europeans

000201100000111110000000...

000011000000120110000000...

002001110120010100110111...

000000000111210100101110...

110110111011110120001001...

SNPs

ind

ivid

ua

ls

1,389 samples, ~ 200k SNPs

Novembre et al. (2008)


Matrix methods are powerful, however...

All Gaussian except points 17 and 39.left: matrix PCA; right: principal components of kurtosis.Figure credit: Jason Morton and Lek-Heng Lim (2009, 2015 Rasmus Bro).


What is a tensor?

I Tensors are generalizations of vectors and matrices:

I An order-k tensor A = Jai1... ikK ∈ Fd1×···×dk is a hypermatrix withdimensions (d1, . . . , dk) and entries ai1...ik ∈ F.

I This talk will focus on F = R or {0, 1}.I We focus on tensor of order 3 or greater, also known as higher-order

tensors.


Tensors in statistical modeling

“Tensors are the new matrices” that tie together a wide range of areas:

I Longitudinal social network data {Yt : t = 1, ..., n}I Spatio-temporal transcriptome data

I Joint probability table of a set of variables P(X1, X2, X3)

I Higher-order moments in topic models

I Markov models for the phylogenetic tree K1,3

Liu, Yuan, and Zhao 2017, Hoff 2015, Montanari-Richard 2014

Anandkumar 2014, Mossel et al 2004, McCullagh 1987


Tensors in genomics

I Many biomedical datasets come naturally in a multiway form.

I Multi-tissue multi-individual gene expression measures could be orga-nized as a multiarray dataset A = JagitK ∈ RnG×nI×nT .

normalization

imputation

Multi-way Clustering

To identify subsets of genes that are similarly expressed within subsets ofindividuals and tissues, we seek local blocks in the expression tensor.


Tensors in scientific computing

Tensor algebra software speeds big-data analysis 100-fold (Science Daily).

I Deep learning frameworks: tensorflow / torch / theano


Talk outline

Prohibitive Computational Complexity

Most higher-order tensor problems are NP-hard [Hillar & Lim, 2013].

Topics I will address:





A successful story: PCA of Europeans

000201100000111110000000...

000011000000120110000000...

002001110120010100110111...

000000000111210100101110...

110110111011110120001001...

SNPs

ind

ivid

ua

ls




Background: Population structure

I Many organisms (humans, Arabidopsis) spread across the world manythousand years ago.

I Migration and genetic drift led to genetic diversity between groups.


Population structure inferences

I Inference on genetic ancestry differences among individuals from dif-ferent populations, or population structure, has been motivated by avariety of applications:

I population geneticsI genetic association studiesI personalized medicineI forensics

I Advancements in genotyping technologies have largely facilitated theinvestigation of genetic diversity at remarkably high levels of detail.

I A variety of methods have been proposed for the identification of geneticancestry differences among individuals in a sample using high-densitygenome-screen data.


Inferring Population Structure with PCA

I Principal Components Analysis (PCA) is the most widely used approachfor identifying and adjusting for ancestry difference among sample indi-viduals

I PCA applied to genotype data can be used to calculate principal com-ponents (PCs) that explain differences among the sample individualsin the genetic data

I The top PCs are viewed as continuous axes of variation that reflectgenetic variation due to ancestry in the sample.

I PCA is an unsupervised learning tool for dimension reduction in multi-variate analysis.


Data structure

I Sample of n individuals, indexed by i = 1, 2, . . . , n.

I Genome screen data on m genetic autosomal markers, indexed by ` =1, 2, . . . ,m.

I At each marker, for each individual, we have a genotype value xi`.

I Here we consider bi-allelic SNP data, so xi` takes values 0, 1, or 2,corresponding to the number of reference alleles.

I We center and standardize these genotype values:

zi` =xi` − 2p`√2p`(1− p`)

,

where p` is an estimate of the reference allele frequency for marker l.


Genetic Correlation Estimation

I Create an n × m matrix, Z, of centered and standardized genotypevalues, and from this, a genetic correlation matrix (GRM):

Φ =1

mZZT

I Φij is an estimate of the genome-wide average genetic correlation be-tween individuals i and j.

I PCA relies on individuals from the same ancestral population being moregenetically correlated than individuals from different ancestral popula-tions.


Standard Principal Components Analysis (PCA)

I PCA is performed by obtaining the eigen-decomposition Φ.

I Top eigenvectors (PCs) are used as surrogates for population structure.

I Orthogonal axes of variation, i.e. linear combinations of SNPs, thatbest explain the genotypic variability amongst the n sample individualsare identified.

I Individuals with “similar” values for a particular top principal componenttend to have “similar” ancestry.


PCA of Europeans

An application of principal components to genetic data from Europeansamples showed that the first two principal components computed using200K SNPs could map their country of origin accurately.

000201100000111110000000...

000011000000120110000000...

002001110120010100110111...

000000000111210100101110...

110110111011110120001001...

SNPs

ind

ivid

ua

ls




Population structure among Arabidopsis (host) sample

12ptAn application of PCA to genetic data from 1001 Arabidopsis projectlargely captures the geographical origins of the Arabidposis accessions:I US vs. EuropeanI Smaller regional groups among European accessions

●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●●

●●

●

●

●

●●●

●●●

●●●

●●

PC 1

PC

2

A

−0.3 −0.2 −0.1 0.0 0.1

−0.

2−

0.1

0.0

0.1

0.2

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●●

●

●

● ●

●

●

●

PC 2

PC

3

−0.2 −0.1 0.0 0.1 0.2

−0.

2−

0.1

0.0

0.1

B


Population structure among pathogen sample

We develop a method for genetic correlation matrix (GRM) estimationusing both mutation and deletion polymorphisms. [PNAS. Vol. 115 (24), 2018.]

I GRM can be used for clustering analysis.I Xanthomonas sample exhibits strong population stratification.


HapMap ASW and MXL Ancestry

I Genome-screen data on 150,872 autosomal SNPs was used to estimateancestry

I Estimated genome-wide ancestry proportions of every individual usingthe ADMIXTURE (Alexander et al., 2009) software

I A supervised analysis was conducted using genotype data from the fol-lowing reference population samples for three “ancestral” populations

I HapMap YRI for West African ancestryI HapMap CEU samples for northern and western European ancestryI HGDP Native American samples for Native American ancestry


Conomos, Matthew P et al. Genetic epidemiology 39.4 (2015): 276-293.


Figure source: SISG 2017. Timothy Thornton and Michael Wu.


Table source: SISG 2017. Timothy Thornton and Michael Wu.


Outline

PCA and population structure inference

Spectral theory for higher-order tensors

Structured tensor decomposition and its statistical properties


Tensor spectral norm

I Order-k tensor as k-linear functional A : Rd1 × · · · ×Rdk → R given by

〈A, x1 ⊗ · · · ⊗ xk︸︷︷︸rank-1 tensor

〉 =∑

i1,...,ik

ai1...ikx(1)i1· · ·x(k)ik ,

where xn = (x(n)1 , . . . , x

(n)dn

)T ∈ Rdn , n ∈ [k].

I Spectral norm. Determine the value of

‖A‖2 = max‖xi‖2=1,xi∈Rdi

〈A,x1 ⊗ · · · ⊗ xk〉.

I Finding ‖A‖2 is closely related to the best rank-1 tensor approximation.

Key question

Can we provide polynomial-time computable bounds for ‖A‖2?


Unfolding

I Matricization. Rearrange the slices of the tensor in different modesinto a matrix.

Unfoldπ(A) Partition π ∈ P[3]

∈ Rd2×d1d3 π = {{2}, {1, 3}}

∈ Rd1×d2d3 π = {{1}, {2, 3}}

∈ Rd3×d1d2 π = {{3}, {1, 2}}

I General Unfolding. The set of all possible unfoldings of an order-ktensor is in one-to-one correspondence with the set P[k] of all partitionsof [k] := {1, . . . , k}.

I For π = {B1, . . . , B`} ∈ P[k], Unfoldπ(A) is obtained by combiningthe modes in each block Bn into a single mode.


Unfolding

I Matricization. Rearrange the slices of the tensor in different modesinto a matrix.

Unfoldπ(A) Partition π ∈ P[3]

∈ Rd2×d1d3 π = {{2}, {1, 3}}

∈ Rd1×d2d3 π = {{1}, {2, 3}}

∈ Rd3×d1d2 π = {{3}, {1, 2}}

I General Unfolding. The set of all possible unfoldings of an order-ktensor is in one-to-one correspondence with the set P[k] of all partitionsof [k] := {1, . . . , k}.



Partition lattice

Partial order on P[k]π3 ≤ π1 if π3 is a refinement of π1.

Example: k = 4

π3 ≤ π1, while π1 and π2 are not comparable.

I The set of all partitions of [k] with two blocks ↔ all matricizations.


Norm inequalities between arbitrary two unfoldings

Theorem (W. et al., 2017a)

Let A ∈ Rd1×···×dk be an arbitrary order-k tensor, and π1, π2 any two

partitions in P[k]. Define dim(A) =k∏

n=1dn. Then,

√dimA(π1, π2)

dim(A)‖Unfoldπ1(A)‖2 ≤ ‖Unfoldπ2(A)‖2

≤√

dim(A)

dimA(π2, π1)‖Unfoldπ1(A)‖2 .

Given A ∈ Rd1×···×dk , we define the map dimA : P[k] × P[k] → N+ as

dimA(π1, π2) =∏

B∈π1

[maxB′∈π2

( ∏

n∈B∩B′dn

)], where π1, π2 ∈ P[k].

Wang et al., Linear Algebra and its Applications, Vol. 520 (2017) 44-66.


Corollaries

Bottom-up inequality

Let A ∈ Rd×···×d be an order-k tensor. Define P`[k] = the set of all

partitions of [k] with ` blocks. Then, for all 1 ≤ ` ≤ k,1

d(k−`)/2maxπ∈P`

[k]

‖Unfoldπ(A)‖2 ≤ ‖A‖2 ≤ minπ∈P`

[k]

‖Unfoldπ(A)‖2 .

with


Corollaries

Frobenius norm vs. spectral norm‖A‖F = maxπ∈P[k]

‖Unfoldπ(A)‖2 , ‖A‖2 = minπ∈P[k]‖Unfoldπ(A)‖2 ,

‖A‖F ≤[ ∏

n dnmaxn∈[k] dn

]1/2‖A‖2 .

This bound improves over the recent result found by Friedland and Lim [Lemma 5.1,

2016], namely, ‖A‖F ≤(∏

n dn)1/2 ‖A‖2.


Orthogonal decomposability

I In general the unfolding operation may change the spectral norm by upto a poly(d) factor.

I How about specially-structured tensors?





Definition (Orthogonally decomposable)

A tensor A ∈ Rd1×···×dk is called orthogonal decomposable, or 0[k]-OD, ifit admits the decomposition

A = λ1a(1)1 ⊗ a

(1)2 ⊗ · · · ⊗ a

(1)k + · · ·+ λra

(r)1 ⊗ a

(r)2 ⊗ · · · ⊗ a

(r)k ,

where the set of vectors {a(n)i } satisfies

〈a(n)i , a

(m)i 〉 = δnm, for all n,m ∈ [r].





Definition (π-orthogonally decomposable)

A tensor A ∈ Rd1×···×dk is called π-orthogonally decomposable, or π-OD,if it admits the decomposition

A = λ1 a(1)1 ⊗ a

(1)2︸︷︷︸

B1

⊗ · · ·⊗a(1)k︸︷︷︸

B`

+ · · ·+ λr a(r)1 ⊗ a

(r)2︸︷︷︸

B1

⊗ · · ·⊗a(r)k︸︷︷︸

B`

,


〈⊗i∈Ba(n)i , ⊗i∈Ba(m)

i 〉 = δnm,

for all B ∈ π and all n,m ∈ [r].


π-OD tensors and norm-preserving cones

Suppose A is a π-OD tensor and define cdef= ‖A‖2.

I ‖Unfoldτ (A)‖2 = c for all τ ∈ Cπ def= {τ : τ ≥ π} ∪ {τ : τ ≤ π}\1[k]

I π-OD implies π′-OD for all π′ ≥ π. ⇒‖Unfoldτ (A)‖2 = c for all τ ∈ Cπ1 ∪ · · · ∪ Cπs ,

where π1, . . . , πs are matricizations obtained by merging blocks of π.I Can obtain sharper bounds for spectral norms.












where π1, . . . , πs are matricizations obtained by merging blocks of π.

I Can obtain sharper bounds for spectral norms.














Outline





Review of matrix eigendecomposition

Matrix perturbation theorem (Davis–Kahan 1970)

Let A and E be symmetric matrices, and A = A+ E. Let ui, ui denotethe ith eigenvectors of A and A, respectively. Then

sin Θ(ui, ui) ≤2 ‖E‖2

minj 6=i |λj − λi|.

+

=

=

+ +

+

I Does there exist a tensor analogue of matrix eigendecomposition? Howabout perturbation analysis?


Review of matrix eigendecomposition

Matrix perturbation theorem (Davis–Kahan 1970)

Let A and E be symmetric matrices, and A = A+ E. Let ui, ui denotethe ith eigenvectors of A and A, respectively. Then

sin Θ(ui, ui) ≤2 ‖E‖2

minj 6=i |λj − λi|.

+

=

=

+ +

+

I Does there exist a tensor analogue of matrix eigendecomposition? Howabout perturbation analysis?


Symmetric tensors

Definition (Symmetric tensors)

A tensor A = Jai1...ikK ∈ Rd1×···×dk is called symmetric if d1 = · · · = dkand

ai1i2...ik = aσ(i1)σ(i2)...σ(ik),

for all permutations σ of [k].

I By the spectral theorem, every symmetric matrix A admits an eigende-composition,

A = λ1u⊗21 + λ2u

⊗22 + · · ·+ λru

⊗2r .

I Does not hold for general symmetric tensors.

example


SOD tensors

I A tensor A is called symmetric and orthogonally decomposable (SOD) if

A =

r∑

i=1

λiu⊗ki ,

where {ui} are orthonormal vectors in Rd and {λi} are non-zero scalars.

I For example, k = 3 and r = 3:

+ +=

I Kruskal’s theorem implies that {ui} is unique even in the case of de-generate λis.

I Eigen-components of a 3rd cumulant tensor are closely related to pa-rameter estimation in latent variable models [Anandkumar et al 2014].


Tensor decomposition

I Nearly SOD tensors:

A =

r∑

i=1

λiu⊗ki + E ,

where E ∈ Rd×···×d is a symmetric but otherwise arbitrary tensor with‖E‖2 ≤ ε.

I For example, k = 3 and r = 3:

+ +=

+

Key question

Can we recover the vectors {ui} from the noisy observation A?


Decomposition of SOD tensors: noiseless case

I The structure of A =∑r

i=1 λiu⊗ki implies a common eigenspace for all

matrix slices.

=

+ +=

I Is it possible to recover {ui}i∈[r] using the left singular vectors of the1-mode unfolding, A(1)(2...k)?




i=1 λiu⊗ki implies a common eigenspace for all

matrix slices.

=

+ +=





i=1 λiu⊗ki implies the one-mode unfolding

A(1)(2...k) =∑r

i=1 λiui Vec(u⊗k−1i )T :

+ +

= + +

=

...



Matrix vs. tensor decompositions

+

=

=

+ +

+

Caveats:

I A rank r > 1 matrix can be decomposed in multiple ways as a sum oforder-product terms in the case of degenerate λis.

I Kruskal’s theorem guarantees that the set of vectors {ui}i∈[r] of anSOD tensor is unique up to signs even when some λis are degenerate.


Matrix vs. tensor decompositions

+

=

=

+ +

+

Caveats:

I A rank r > 1 matrix can be decomposed in multiple ways as a sum oforder-product terms in the case of degenerate λis.

I Kruskal’s theorem guarantees that the set of vectors {ui}i∈[r] of anSOD tensor is unique up to signs even when some λis are degenerate.


Two-mode HOSVD via rank-1 matrix pursuit

Key idea: Instead of A(1)(2...k), we consider the two-mode unfolding of A.

Two-mode unfolding

A(12)(3...k) is a d2 × dk−2 matrix obtained by grouping the first 2 indices as therow index and the remaining (k − 2) indices as the column index.


Two-mode HOSVD via rank-1 matrix pursuit

Key idea: Instead of A(1)(2...k), we consider the two-mode unfolding of A.

Two-mode unfolding

A(12)(3...k) is a d2 × dk−2 matrix obtained by grouping the first 2 indicesas the row index and the remaining (k − 2) indices as the column index.


Our results

Given an order-k nearly SOD tensor ∈ Rd×···×d,

A =

r∑

i=1

λiu⊗ki + E , where ‖E‖2 ≤ ε.

Goal: recover {ui} from A.

I Noiseless case:Every rank-1 matrix in the left singular space of A(12)(3...k) is (up to ascalar) the Kronecker square of some robust tensor eigenvector ui.

I Noisy case:If ε/|λ|min - d−(k−2)/2, we can recover {ui} up to error O(ε) in poly-normial time.

Wang, M. and Song, Y.S., Journal of Machine Learning Research W&CP, Vol. 54 (2017) 614-622.

details


Comparison of tensor decomposition algorithms

I The error bound in tensor decomposition does not depend on the eigen-value gap ⇒ more stable than matrix decomposition.

MethodNoise threshold Recovery accuracy

(ε/|λ|min ≤) (‖ui − ui‖2 ≤)

Power iterationO(d−1) for order 3 8ε

λi(Anandkumar et al, 2014)

Joint diagonalization–

2ε√‖λ‖1λmax

λ2i+ o(ε)

(Kuleshov et al, 2015)

Our method O(d−1/2) for order 3 2ελi

+ o(ε)(W. and Song 2017b) O(d−(k−2)/2) for order k

Wang, M. and Song, Y.S., Journal of Machine Learning Research W&CP, Vol. 54 (2017) 614-622.


Numerical experiments

Our method achieves a higher estimation accuracy and performsfavorably as the order increases.

a b (c)

TPM: tensor power method, JMLR 2014OJD: Orthogonal joint diagonalization, AISTATS 2015


Numerical experiments

Our method has better convergence performance compared with otherdecomposition methods.

I Order-3:

I Order-4:

TPM: tensor power method, JMLR 2014OJD: Orthogonal joint diagonalization, AISTATS 2015


Conclusions

I We see a successful application of matrix spectral method in revealinglatent structure in genetics data.

I We establish a full picture of the norm landscape over all possible un-foldings, providing the mathematical foundations of tensor algorithms.

I We propose a new tensor decomposition algorithm that provably handlesa great level of noise while achieving high recovery accuracy.


Future work

Keywords: higher-order tensor, genomics data, randomized algorithm.

I Developing statistical tools for large-scale genomics data:

I Integrative analysis of multiple omics datasets;

I Spatial-temporal gene expression analysis;

I Single-cell RNA-seq gene expression studies.

I Developing random tensor theory for probabilistic algorithms:

I Preliminary results: Gaussian random tensors and array normal distribu-tion; concentration properties;

I Open problems: tensor analogy of Tracy-Widom law for the top eigen-value? Bernstein-type inequality?


Publications:

I M. Wang and Y. S. Song. Tensor Decomposition via two-mode higher-orderSVD (HOSVD). Journal of Machine Learning Research W&CP (AISTATStrack), Vol. 54, (2017) 614-622.

I M. Wang, K. Dao Duc, J. Fischer, and Y. S. Song. Operator norm inequalitiesbetween tensor unfoldings on the partition lattice. Linear Algebra and itsApplications, Vol. 520 (2017) 44-66.

I M. Wang, J. Fischer, and Y. S. Song. Three-way clustering of multi-tissuemulti-individual gene expression data via semi-nonnegative tensor decomposi-tion. Annals of Applied Statistics. (2019), Vol. 13, No. 2, 1124-1148.

More

Example

A symmetric tensor but not orthogonally decomposable:

A(:, :, 1) =

[2 11 1

],

A(:, :, 2) =

[1 11 1

].

back

More

Orthogonality

Definition (π-orthogonally decomposable)

A tensor A ∈ Rd1×···×dk is called π-OD with partition π = {B1, . . . , B`}if it admits the decomposition

A = λ1 a(1)1 ⊗ a

(1)2︸︷︷︸

B1

⊗ · · ·⊗a(1)k︸︷︷︸

B`

+ · · ·+ λr a(r)1 ⊗ a

(r)2︸︷︷︸

B1

⊗ · · ·⊗a(r)k︸︷︷︸

B`

,


〈⊗i∈Ba(n)i , ⊗i∈Ba(m)

i 〉 = δnm,

for all B ∈ π and all n,m ∈ [r].

More

Unfolding of an order-k tensor

I General Unfolding. The set of all possible unfoldings of an order-ktensor is in one-to-one correspondence with the set P[k] of all partitionsof [k] = {1, . . . , k}.


Example. An order-4 tensor A = Jai1i2i3i4K ∈ R2×2×2×2 with ai1i2i3i4 ={1 if i1 = i2 = i3 = i4

0 otherwisecan be matricized into

I 2× 23 matrix: Unfold[1|234](A) =

[1 0 0 0 0 0 0 00 0 0 0 0 0 0 1

].

I 22 × 22 matrix: Unfold[12|34](A) =

1 0 0 00 0 0 00 0 0 00 0 0 1

.

back

More

Definition (Inner product)

For any two tensors A = Jai1... ikK, B = Jbi1... ikK ∈ Rd1×···×dk of identicalorder and dimensions, their inner product is defined as

〈A, B〉 =∑

i1,...,ik

ai1...ikbi1...ik .

The tensor Frobenius norm of A is defined as ‖A‖F =√〈A, A〉.

More

Norm inequalities between any two tensor unfoldings

Given A ∈ Rd1×···×dk , we define the map dimA : P[k] × P[k] → N+ as :

dimA(π1, π2) =∏B∈π1

[maxB′∈π2

( ∏n∈B∩B′

dn

)], where π1, π2 ∈ P[k].

Theorem (p-norm inequalities)

Let A ∈ Rd1×···×dk be an arbitrary order-k tensor, and π1, π2 any two partitions in P[k].

Define dim(A) =∏ki=1 di. Then,

(a) For any 1 ≤ p ≤ 2,

[dimA]−1/p

[dimA(π1, π2)]−1/2

‖Unfoldπ1 (A)‖p ≤ ‖ Unfoldπ2 (A)‖p ≤[dim(A)]1/p

[dimA(π2, π1)]1/2‖Unfoldπ1 (A)‖p .

(b) For any 2 ≤ p ≤ ∞,

[dim(A)]1p−1

[dimA(π1, π2)]−1/2

‖Unfoldπ1 (A)‖p ≤ ‖ Unfoldπ2 (A)‖p ≤[dim(A)]1−

1p

[dimA(π2, π1)]1/2‖Unfoldπ1 (A)‖p .

More

Two-mode HOSVD algorithm for tensors with noise

Rank-1 matrices in LS(r) are sufficient to find {ui}.I Define the two-mode left singular space byLS(r) def

= Span{ai ∈ Rd2: ai is the ith left singular vector of (12)(3...k)}.

I Look for “nearly” rank-1 matrix M in the linear space LS(r):maximizeM∈Rd×d

‖M‖2 ,

subject to M ∈ LS(r) and ‖M‖F = 1.

Justification of the optimization: ‖M‖2 ≤ ‖M‖F ≤√

rank M ‖M‖2.

I Apply eigendecomposition on the matrix M to recover ui.

back

More

Exact Recovery for SOD Tensors in the noiseless case

Optimization to recover the desired factors {ui} of A:

maximizeM∈Rd×d

‖M‖2 ,

subject to M ∈ LS0 and ‖M‖F = 1. (1)

Theorem (W. and Song, 2017)

The optimization problem (1) has exactly r pairs of local maximizers{±M∗

i : i ∈ [r]}. Furthermore, they satisfy the following three proper-ties:

1 ‖M∗i ‖2 = 1 for all i ∈ [r].

2

∣∣∣〈Vec(M∗i ), Vec(M∗

j )〉∣∣∣ = δij for all i, j ∈ [r], where 〈·, ·〉 denotes the

inner product.

3 There exists a permutation π on [r] such that M∗i = ±u⊗2π(i) for all

i ∈ [r].

back

More

Two-mode HOSVD algorithm for tensors with noise

Optimization to recover the desired factors {ui} of :

maximizeM∈Rd×d

‖M‖2 ,

subject to M ∈ LSr and ‖M‖F = 1.

Algorithm 1 Two-mode HOSVD

Input: Noisy tensor T where T =∑ri=1 λiu

⊗ki + E , number of factors r.

Output: r pairs of estimators (ui, λi).

1: Reshape the tensor T into a d2-by-dk−2 matrix T(12)(3...k);2: Find the top r left singular vectors of T(12)(3...k), denoted {a1, . . . ,ar};3: Initialize LS(r) = Span{ai : i ∈ [r]};4: for i=1 to r do5: Solve Mi = argmax

M∈LS(r),‖M‖F=1

‖M‖σ and ui = argmaxu∈Sd−1

|uTMiu|;

6: Update Mi ← T(1)(2)(3...k)(I, I,Vec(u⊗(k−2)i )) and ui ← argmaxu∈Sd−1

|uTMiu|;

7: Return (ui, λi)← (ui, T (ui, . . . , ui));8: Set LS(r) ← LS(r) ∩

[Vec(u⊗2i )

]⊥;

9: end for

Two-Mode HOSVD

Nearly Rank-1 Matrix

Post-Processing

Deflation

1

back

More

Theorem (W. and Song, 2017b)

Let =∑ri=1 λiu

⊗ki +E ∈ Rd×···×d, where {ui}i∈[r] are orthonormal vectors, λi >

0 for all i ∈ [r], and ‖E‖2 ≤ ε. Suppose ε ≤ |λ|min/[c0d

(k−2)/2], where c0 > 0

is a sufficiently large constant that does not depend on d. Let {(ui, λi)}i∈[r] bethe output of Algorithm 1 for inputs and r. Then, there exists a permutation πon [r] such that for all i ∈ [r],

Loss(ui,uπ(i)) ≤2ε

λπ(i)+ o(ε), Loss(λi, λπ(i)) ≤ 2ε+ o(ε),

and ∥∥∥∥∥−r∑

i=1

λiu⊗ki

∥∥∥∥∥2

≤ Cε+ o(ε),

where C = C(k) > 0 is a constant that only depends on k.

For two unit vectors a, b ∈ Rd, define

Loss(a, b) = min(‖a− b‖2 , ‖a+ b‖2

).

If a, b are two scalars in R, we define Loss(a, b) = min (|a− b|, |a+ b|) . back

More

Run time comparison

Complexity (for order-3 tensors):

I TPM (Anandkumar et al., 2014): O(d3M) per iteration, where M isthe number of restarts.

I OJD (Kuleshov et al., 2015): O(d3L) per iteration, where L is thenumber of projected matrices.

I Our method (W. and Song 2017b): O(d3) per iteration.

Simulation study: decompose A ∈ R18000×500×40 into 10 components.

I SDA (Hore et al., 2016): 73,989 seconds (∼ 20.1 hrs)

I HOSVD (Ombert et al., 2007): 5,849 seconds

I Our method (W. el al., 2017c): 6,047 seconds (∼ 1.7 hrs)

More

Theorem

Let A ∈ ⊗kRd be an order-k dim-d random tensor with i.i.d. standardGaussian entries, then

d1/2 < E ‖A‖2 < kd1/2.

Further, ‖A‖2 concentrates tightly around its expectation. Namely, forany s ≥ 0,

P(∣∣ ‖A‖2 − E ‖A‖2

∣∣ ≥ s) ≤ 2e−s2/2.

With little modification, the above result can be generalized to order-k,dimensional-(d1, . . . , dk) tensors. Specifically, we have

√dmax < E ‖A‖2 <

k∑

i=1

√di.

This implies, ‖A‖2 � Op(√dmax) asymptotically for large d and fixed k.

More

Theorem (Non-Asymptotic Chain)

Let A ∈ ⊗kRd be an order-k dim-d random tensor with i.i.d. standardGaussian entries. Then for any d ≥ 4 and k ≥ 2,

E ‖Mat1(A)‖2 > E ‖Mat2(A)‖2 > · · · > E∥∥Matbk/2c(A)

∥∥2.

Further, for any 1 ≤ p ≤ bk/2c,

d(k−p)/2 < E ‖Matp(A)‖ < d(k−p)/2 + dp/2.

The following inequality chain holds almost surely as d→∞ at any fixedk:

‖Mat1(A)‖2 > ‖Mat2(A)‖2 > · · · >∥∥Matbk/2c(A)

∥∥2,

Further, for any 1 ≤ p ≤ bk/2c,‖Matp(A)‖2 →a.s. (1 + 1{p=k−p})d

(k−p)/2 as d→∞.

More

3050

7090

1−mode 2−mode 3−mode 4−mode 5−mode

●

●

●

●

●

Multi−Mode Flattening of Random Tensor

Ope

rato

r N

orm

dp 2 + d(k−p) 2

150

200

250

1−mode 2−mode 3−mode 4−mode 5−mode

●

●

●

●

●

Multi−Mode Flattening of Random Symmetric Tensor

Ope

rato

r N

orm

More

Proposition (lp-norm vs. lq-norm)

Let A ∈ Rd1×···×dk be an order-k tensor and suppose q ≥ p ≥ 1. Then,

‖A‖p ≤ ‖A‖q ≤ dim(A)1p− 1q ‖A‖p .

More

Theorem (lp-norm inequalities)

Let A ∈ Rd1×···×dk be an arbitrary order-k tensor, and π1, π2 any twopartitions in P[k].For any 1 ≤ p ≤ 2,

[dim(A)]−1/p

[dimA(π1, π2)]−1/2 ‖Unfoldπ1(A)‖p ≤ ‖Unfoldπ2(A)‖p

≤ [dim(A)]1/p

[dimA(π2, π1)]1/2‖Unfoldπ1(A)‖p .

For any 2 ≤ p ≤ ∞,

[dim(A)]1p−1

[dimA(π1, π2)]−1/2 ‖Unfoldπ1(A)‖p ≤ ‖Unfoldπ2(A)‖p

≤ [dim(A)]1− 1

p

[dimA(π2, π1)]1/2‖Unfoldπ1(A)‖p .

Date post:	31-May-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Beyond matrices: statistical method for higher-order...

Documents