A Method of Moments for Mixture Models and Hidden Markov Models

8/12/2019 A Method of Moments for Mixture Models and Hidden Markov Models

1/31

arXiv:120

3.0683v3

[cs.LG]

5Sep2012

A Method of Moments for Mixture Models and

Hidden Markov Models

Animashree Anandkumar1, Daniel Hsu2, and Sham M. Kakade2

1Department of EECS, University of California, Irvine2Microsoft Research New England

September 7, 2012

Abstract

Mixture models are a fundamental tool in applied statistics and machine learning for treatingdata taken from multiple subpopulations. The current practice for estimating the parametersof such models relies on local search heuristics (e.g., the EM algorithm) which are prone tofailure, and existing consistent methods are unfavorable due to their high computational andsample complexity which typically scale exponentially with the number of mixture components.This work develops an efficient method of moments approach to parameter estimation for abroad class of high-dimensional mixture models with many components, including multi-viewmixtures of Gaussians (such as mixtures of axis-aligned Gaussians) and hidden Markov models.The new method leads to rigorous unsupervised learning results for mixture models that werenot achieved by previous works; and, because of its simplicity, it offers a viable alternative toEM for practical deployment.

1 Introduction

Mixture models are a fundamental tool in applied statistics and machine learning for treating datataken from multiple subpopulations (Titterington et al., 1985). In a mixture model, the data aregenerated from a number of possible sources, and it is of interest to identify the nature of theindividual sources. As such, estimating the unknown parameters of the mixture model from sam-pled dataespecially the parameters of the underlying constituent distributionsis an importantstatistical task. For most mixture models, including the widely used mixtures of Gaussians andhidden Markov models (HMMs), the current practice relies on the Expectation-Maximization (EM)algorithm, a local search heuristic for maximum likelihood estimation. However, EM has a number

of well-documented drawbacks regularly faced by practitioners, including slow convergence andsuboptimal local optima (Redner and Walker, 1984).An alternative to maximum likelihood and EM, especially in the context of mixture models,

is the method of momentsapproach. The method of moments dates back to the origins of mix-ture models with Pearsons solution for identifying the parameters of a mixture of two univariate

E-mail: [email protected], [email protected], [email protected]

1
http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3http://arxiv.org/abs/1203.0683v3


2/31

Gaussians (Pearson, 1894). In this approach, model parameters are chosen to specify a distributionwhose p-th order moments, for several values ofp, are equal to the corresponding empirical mo-ments observed in the data. Since Pearsons work, the method of moments has b een studied andadapted for a variety of problems; their intuitive appeal is also complemented with a guarantee ofstatistical consistency under mild conditions. Unfortunately, the method often runs into trouble

with large mixtures of high-dimensional distributions. This is because the equations determiningthe parameters are typically based on moments of order equal to the number of model parameters,and high-order moments are exceedingly difficult to estimate accurately due to their large variance.

This work develops a computationally efficient method of moments based on only low-ordermoments that can be used to estimate the parameters of a broad class of high-dimensional mixturemodels with many components. The resulting estimators can b e implemented with standard nu-merical linear algebra routines (singular value and eigenvalue decompositions), and the estimateshave low variance because they only involve low-order moments. The class of models covered bythe method includes certain multivariate Gaussian mixture models and HMMs, as well as mix-ture models with no explicit likelihood equations. The method exploits the availability of multipleindirect views of a models underlying latent variable that determines the source distribution,although the notion of a view is rather general. For instance, in an HMM, the past, present,

and future observations can be thought of as different noisy views of the present hidden state; ina mixture of product distributions (such as axis-aligned Gaussians), the coordinates in the outputspace can be partitioned (say, randomly) into multiple non-redundant views. The new methodof moments leads to unsupervised learning guarantees for mixture models under mild rank condi-tions that were not achieved by previous works; in particular, the sample complexity of accurateparameter estimation is shown to be polynomial in the number of mixture components and otherrelevant quantities. Finally, due to its simplicity, the new method (or variants thereof) also offersa viable alternative to EM and maximum likelihood for practical deployment.

1.1 Related work

Gaussian mixture models. The statistical literature on mixture models is vast (a more thoroughtreatment can be found in the texts of Titterington et al. (1985) and Lindsay (1995)), and manyadvances have been made in computer science and machine learning over the past decade or so, inpart due to their importance in modern applications. The use of mixture models for clustering datacomprises a large part of this work, beginning with the work of Dasgupta (1999) on learning mix-tures ofk well-separated d-dimensional Gaussians. This and subsequent work (Arora and Kannan,2001; Dasgupta and Schulman, 2007; Vempala and Wang, 2002; Kannan et al., 2005; Achlioptasand McSherry, 2005; Chaudhuri and Rao, 2008; Brubaker and Vempala, 2008; Chaudhuri et al.,2009) have focused on efficient algorithms that provably recover the parameters of the constituentGaussians from data generated by such a mixture distribution, provided that the distance betweeneach pair of means is sufficiently large (roughly either dc orkc times the standard deviation of theGaussians, for some c > 0). Such separation conditions are natural to expect in many clusteringapplications, and a number of spectral projection techniques have been shown to enhance the sep-aration (Vempala and Wang, 2002; Kannan et al., 2005; Brubaker and Vempala, 2008; Chaudhuriet al., 2009). More recently, techniques have been developed for learning mixtures of Gaussianswithout any separation condition (Kalai et al., 2010; Belkin and Sinha, 2010; Moitra and Valiant,2010), although the computational and sample complexities of these methods grow exponentiallywith the number of mixture components k. This dependence has also been shown to be inevitable

2


3/31

without further assumptions (Moitra and Valiant, 2010).

Method of moments. The latter works of Belkin and Sinha (2010), Kalai et al. (2010), and Moitraand Valiant (2010) (as well as the algorithms of Feldman et al. (2005, 2006) for a related but differentlearning objective) can be thought of as modern implementations of the method of moments, andtheir exponential dependence on k is not surprising given the literature on other moment methods

for mixture models. In particular, a number of moment methods for both discrete and continuousmixture models have been developed using techniques such as the Vandermonde decompositionsof Hankel matrices (Lindsay, 1989; Lindsay and Basak, 1993; Boley et al., 1997; Gravin et al.,2012). In these methods, following the spirit of Pearsons original solution, the model parametersare derived from the roots of polynomials whose coefficients are based on moments up to the(k)-th order. The accurate estimation of such moments generally has computational and samplecomplexity exponential in k.

Spectral approach to parameter estimation with low-order moments. The present workis based on a notable exception to the above situation, namely Changs spectral decompositiontechnique for discrete Markov models of evolution (Chang, 1996) (see also Mossel and Roch (2006)and Hsu et al. (2009) for adaptations to other discrete mixture models such as discrete HMMs).

This spectral technique depends only on moments up to the third-order; consequently, the resultingalgorithms have computational and sample complexity that scales only polynomially in the numberof mixture components k. The success of the technique depends on a certain rank condition ofthe transition matrices; but this condition is much milder than separation conditions of clusteringworks, and it remains sufficient even when the dimension of the observation space is very large (Hsuet al., 2009). In this work, we extend Changs spectral technique to develop a general method ofmoments approach to parameter estimation, which is applicable to a large class of mixture modelsand HMMs with both discrete and continuous component distributions in high-dimensional spaces.Like the moment methods of Moitra and Valiant (2010) and Belkin and Sinha (2010), our algorithmdoes not require a separation condition; but unlike those previous methods, the algorithm hascomputational and sample complexity polynomial in k.

Some previous spectral approaches for related learning problems only use second-order moments,but these approaches can only estimate a subspace containing the parameter vectors and not theparameters themselves (McSherry, 2001). Indeed, it is known that the parameters of even very sim-ple discrete mixture models are not generally identifiable from only second-order moments (Chang,1996)1. We note that moments beyond the second-order (specifically, fourth-order moments) havebeen exploited in the methods of Frieze et al. (1996) and Nguyen and Regev (2009) for the prob-lem of learning a parallelepiped from random samples, and that these methods are very related totechniques used for independent component analysis (Hyvarinen and Oja, 2000). Adapting thesetechniques for other parameter estimation problems is an enticing possibility.

Multi-view learning. The spectral technique we employ depends on the availability of multipleviews, and such a multi-view assumption has been exploited in previous works on learning mixturesof well-separated distributions (Chaudhuri and Rao, 2008; Chaudhuri et al., 2009). In these previousworks, a projection based on acanonical correlation analysis(Hotelling, 1935) between two views isused to reinforce the separation between the mixture components, and to cancel out noise orthogonalto the separation directions. The present work, which uses similar correlation-based projections,shows that the availability of a third view of the data can remove the separation condition entirely.

1See Appendix G for an example of Chang (1996) demonstrating the non-identifiability of parameters from onlysecond-order moments in a simple class of Markov models.

3


4/31

The multi-view assumption substantially generalizes the case where the component distributionsare product distributions (such as axis-aligned Gaussians), which has been previously studied in theliterature (Dasgupta, 1999; Vempala and Wang, 2002; Chaudhuri and Rao, 2008; Feldman et al.,2005, 2006); the combination of this and a non-degeneracy assumption is what allows us to avoidthe sample complexity lower bound of Moitra and Valiant (2010) for Gaussian mixture models. The

multi-view assumption also naturally arises in many applications, such as in multimedia data with(say) text, audio, and video components (Blaschko and Lampert, 2008; Chaudhuri et al., 2009); aswell as in linguistic data, where the different words in a sentence or paragraph are considered noisypredictors of the underlying semantics (Gale et al., 1992). In the vein of this latter example, weconsider estimation in a simple bag-of-words document topic model as a warm-up to our generalmethod; even this simpler model illustrates the p ower of pair-wise and triple-wise (i.e., bigram andtrigram) statistics that were not exploited by previous works on multi-view learning.

1.2 Outline

Section 2 first develops the method of moments in the context of a simple discrete mixture modelmotivated by document topic modeling; an explicit algorithm and convergence analysis are also

provided. The general setting is considered in Section 3, where the main algorithm and its accom-panying correctness and efficiency guarantee are presented. Applications to learning multi-viewmixtures of Gaussians and HMMs are discussed in Section 4. All proofs are given in the appendix.

1.3 Notations

The standard inner product between vectors u and v is denoted byu, v =uv. We denote thep-norm of a vectorvby vp. For a matrixA Rmn, we let A2denote its spectral norm A2 :=supv=0 Av2/v2,AF denote its Frobenius norm, i(A) denote the i-th largest singular value,and(A) :=1(A)/min(m,n)(A) denote its condition number. Let

n1 :={(p1, p2, . . . , pn)Rn :pi0i,

ni=1pi= 1}denote the probability simplex in Rn, and letSn1 :={u Rn :u2= 1}

denote the unit sphere in Rn. Let eiRd denote the i-th coordinate vector whose i-th entry is 1

and the rest are zero. Finally, for a positive integer n, let [n] :={1, 2, . . . , n}.

2 Warm-up: bag-of-words document topic modeling

We first describe our method of moments in the simpler context of bag-of-words models for docu-ments.

2.1 Setting

Suppose a document corpus can be partitioned by topic, with each document being assigned a singletopic. Further, suppose the words in a document are drawn independently from a multinomial

distribution corresponding to the documents topic. Letk be the number of distinct topics in thecorpus,d be the number of distinct words in the vocabulary, and 3 be the number of words ineach document (so the documents may be quite short).

The generative process for a document is given as follows:

4


5/31

1. The documents topic is drawn according to the multinomial distribution specified by theprobability vector w = (w1, w2, . . . , wk) k1. This is modeled as a discrete randomvariable h such that

Pr[h= j ] =wj , j[k].2. Given the topic h, the documents words are drawn independently according to the multi-

nomial distribution specified by the probability vector h d1. The random vectorsx1, x2, . . . , x Rd represent the words by setting

xv =ei the v-th word in the document is i, i[d](the reason for this encoding of words will b ecome clear in the next section). Therefore, foreach word v[] in the document,

Pr[xv =ei|h= j ] =ei, j= Mi,j , i[d], j[k],whereM Rdk is the matrix of conditional probabilities M := [1|2| |k].

This probabilistic model has the conditional independence structure depicted in Figure 2(a) as a

directed graphical model.We assume the following condition on w and M.

Condition 2.1 (Non-degeneracy: document topic model). wj > 0 for allj [k], andMhas rankk.This condition requires that each topic has non-zero probability, and also prevents any topics

word distribution from being a mixture of the other topics word distributions.

2.2 Pair-wise and triple-wise probabilities

Define Pairs Rdd to be the matrix of pair-wise probabilities whose (i, j)-th entry isPairsi,j := Pr[x1 = ei, x2 = ej], i, j[d].

Also define Triples Rddd to b e the third-order tensor of triple-wise probabilities whose (i,j,)-th entry is

Triplesi,j,:= Pr[x1= ei, x2= ej , x3=e], i, j, [d].The identification of words with coordinate vectors allows Pairs and Triples to be viewed as expec-tations of tensor products of the random vectors x1,x2, and x3:

Pairs = E[x1 x2] and Triples = E[x1 x2 x3]. (1)We may also view Triples as a linear operator Triples: Rd Rdd given by

Triples() := E[(x1 x2), x3].

In other words, the (i, j)-th entry of Triples() for = (1, 2, . . . , d) is

Triples()i,j =d

x=1

xTriplesi,j,x=d

x=1

xTriples(ex)i,j .

The following lemma shows that Pairs and Triples() can be viewed as certain matrix productsinvolving the model parameters M and w.

5


6/31

Lemma 2.1. Pairs= Mdiag( w)M andTriples()= Mdiag(M)diag( w)M for all Rd.Proof. Sincex1,x2, andx3 are conditionally independent given h,

Pairsi,j = Pr[x1 = ei, x2= ej ] =k

t=1Pr[x1= ei, x2= ej |h= t] Pr[h= t]

=k

t=1

Pr[x1 = ei|h= t] Pr[x2=ej |h= t] Pr[h= t] =k

t=1

Mi,t Mj,t wt

so Pairs =Mdiag( w)M. Moreover, writing = (1, 2, . . . , d),

Triples()i,j =d

x=1

xPr[x1= ei, x2 =ej , x3= ex]

=d

x=1k

t=1x Mi,t Mj,t Mx,t wt =

k

t=1Mi,t Mj,t wt (M)t

so Triples() =Mdiag(M) diag(w)M.

2.3 Observable operators and their spectral properties

The pair-wise and triple-wise probabilities can be related in a way that essentially reveals the con-ditional probability matrix M. This is achieved through a matrix called an observable operator.Similar observable operators were previously used to characterize multiplicity automata (Schutzenberger,1961; Jaeger, 2000) and, more recently, for learning discrete HMMs (via an operator parameteriza-tion) (Hsu et al., 2009).

Lemma 2.2. Assume Condition 2.1. Let U Rdk and V Rdk be matrices such that bothU

M and V

M are invertible. ThenU

PairsV is invertible, and for all Rd

, the observableoperatorB() Rkk, given by

B() := (UTriples()V)(UPairsV)1,

satisfiesB() = (UM)diag(M)(UM)1.

Proof. Since diag( w)0 by Condition 2.1 and UPairsV = (UM) diag(w)MVby Lemma 2.1, itfollows thatUPairsV is invertible by the assumptions on U andV. Moreover, also by Lemma 2.1,

B() = (UTriples()V) (UPairsV)1

= (U

Mdiag(M

) diag(w)M

V) (U

PairsV)1= (UM)diag(M)(UM)1 (UMdiag( w)MV) (UPairsV)1

= (UM)diag(M)(UM)1.

The matrixB () is called observable because it is only a function of the observable variablesjoint probabilities (e.g., Pr[x1 = ei, x2 = ej]). In the case = ex for some x [d], the matrix

6


7/31

Algorithm A

1. Obtain empirical frequencies of word pairs and triples from a given sample of documents, and

form the tables PairsRdd and Triples Rddd corresponding to the population quantitiesPairs and Triples.

2. Let U

Rdk and V

Rdk be, respectively, matrices of orthonormal left and right singular

vectors ofPairs corresponding to its top k singular values.

3. Pick Rd (see remark in the main text), and compute the right eigenvectors 1,2, . . . ,k (ofunit Euclidean norm) of

B() := (U Triples()V)(UPairsV)1.

(Fail if not possible.)

4. Let j := Uj/1, Uj for all j[k].5. Return M:= [1|2| |k].

Figure 1: Topic-word distribution estimator (Algorithm A).

B(ex) is similar (in the linear algebraic sense) to the diagonal matrix diag(Mex); the collection

of matrices{diag(Mex) : x [d]} (together with w) can be used to compute joint probabilitiesunder the model (see, e.g., Hsu et al. (2009)). Note that the columns ofUM are eigenvectors ofB(ex), with thej-th column having an associated eigenvalue equal to Pr[xv =x|h= j ]. If the wordx has distinct probabilities under every topic, then B(ex) has exactly k distinct eigenvalues, eachhaving geometric multiplicity one and corresponding to a column ofUM.

2.4 Topic-word distribution estimator and convergence guarantee

The spectral properties of the observable operators B() implied by Lemma 2.2 suggest the estima-tion procedure (Algorithm A) in Figure 1. The procedure is essentially a plug-in approach based onthe equations relating the second- and third-order moments in Lemma 2.2. We focus on estimatingM; estimating the mixing weights w is easily handled as a secondary step (see Appendix B.5 forthe estimator in the context of the general model in Section 3.1).

On the choice of . As discussed in the previous section, a suitable choice for can be basedon prior knowledge about the topic-word distributions, such as =ex for some x [d] that hasdifferent conditional probabilities under each topic. In the absence of such information, one mayselectrandomly from the subspace range(U). Specifically, take := Uwhere Rk is a randomunit vector distributed uniformly overSk1.

The following theorem establishes the convergence rate of Algorithm A.

Theorem 2.1. There exists a constantC > 0 such that the following holds. Pick any

(0, 1).

Assume the document topic model from Section 2.1 satisfies Condition 2.1. Further, assume thatin Algorithm A, Pairs and Triplesare, respectively, the empirical averages ofNindependent copiesofx1 x2 andx1 x2 x3; and that= U where Rk is an independent random unit vectordistributed uniformly overSk1. If

NC k7 ln(1/)

k(M)6 k(Pairs)4 2 ,

7


8/31

h

x1 x2 x

h1 h2 h

x1 x2 x

(a) (b)

Figure 2: (a) The multi-view mixture model. (b) A hidden Markov model.

then with probability at least 1, the parameters returned by Algorithm A have the followingguarantee: there exists a permutation on [k] and scalars c1, c2, . . . , ck R such that, for eachj[k],

cj j (j)2C (j)2 k5

k(M)4 k(Pairs)2

ln(1/)

N .

The proof of Theorem 2.1, as well as some illustrative empirical results on using Algorithm A,are presented in Appendix A. A few remarks about the theorem are in order.

On boosting the confidence. Although the convergence depends polynomially on 1/, where

is the failure probability, it is possible to boost the confidence by repeating Step 3 of AlgorithmA with different random until the eigenvalues ofB() are sufficiently separated (as judged byconfidence intervals).

On the scaling factors cj . With a larger sample complexity that depends on d, an error boundcan be established forj(j)1 directly (without the unknown scaling factors cj). We alsoremark that the scaling factors can be estimated from the eigenvalues ofB(), but we do notpursue this approach as it is subsumed by Algorithm B anyway.

3 A method of moments for multi-view mixture models

We now consider a much broader class of mixture models and present a general method of moments

in this context.

3.1 General setting

Consider the following multi-view mixture model; k denotes the number of mixture components,and denotes the number of views. We assume3 throughout. Let w= (w1, w2, . . . , wk)k1be a vector of mixing weights, and leth be a (hidden) discrete random variable with Pr[h= j ] =wjfor allj[k]. Letx1, x2, . . . , x Rd be random vectors that are conditionally independent givenh; the directed graphical model is depicted in Figure 2(a).

Define the conditional mean vectors as

v,j := E[xv|h= j ], v

[], j

[k],

and let Mv Rdk be the matrix whose j -th column is v,j . Note that we do not specify anythingelse about the (conditional) distribution ofxvit may be continuous, discrete, or even a hybriddepending onh.

We assume the following conditions on w and the Mv.

8


9/31

Condition 3.1 (Non-degeneracy: general setting). wj > 0 for all j [k], and Mv has rank k forall v[].

We remark that it is easy to generalize to the case where views have different dimensionality(e.g.,xv Rdv for possibly different dimensionsdv). For notational simplicity, we stick to the samedimension for each view. Moreover, Condition 3.1 can be relaxed in some cases; we discuss one

such case in Section 4.1 in the context of Gaussian mixture models.Because the conditional distribution ofxv is not specified beyond its conditional means, it is

not possible to develop a maximum likelihood approach to parameter estimation. Instead, as in thedocument topic model, we develop a method of moments based on solving polynomial equationsarising from eigenvalue problems.

3.2 Observable moments and operators

We focus on the moments concerning{x1, x2, x3}, but the same properties hold for other triples ofthe random vectors{xa, xb, xc} {xv :v[]}as well.

As in (1), we define the matrix P1,2 Rdd of second-order moments, and the tensor P1,2,3Rd

d

d

of third-order moments, by

P1,2 := E[x1 x2] and P1,2,3 := E[x1 x2 x3].

Again,P1,2,3 is regarded as the linear operator P1,2,3 : E[(x1 x2), x3].Lemma 3.1 and Lemma 3.2 are straightforward generalizations of Lemma 2.1 and Lemma 2.2.

Lemma 3.1. P1,2 = M1 diag(w)M2 andP1,2,3()= M1 diag(M

3)diag( w)M

2 for all Rd.Lemma 3.2. Assume Condition 3.1. For v {1, 2, 3}, let Uv Rdk be a matrix such thatUvMv is invertible. Then U

1P1,2U2 is invertible, and for all Rd, the observable operatorB1,2,3()Rkk, given byB1,2,3() := (U1P1,2,3()U2)(U1P1,2U2)1, satisfies

B1,2,3() = (U1M1)diag(M3)(U1M1)1.

In particular, thek roots of the polynomialdet(B1,2,3() I) are{, 3,j: j[k]}.Recall that Algorithm A relates the eigenvectors ofB () to the matrix of conditional meansM.

However, eigenvectors are only defined up to a scaling of each vector; without prior knowledge ofthe correct scaling, the eigenvectors are not sufficient to recover the parameters M. Nevertheless,the eigenvalues also carry information about the parameters, as shown in Lemma 3.2, and it ispossible to reconstruct the parameters from different the observation operators applied to differentvectors . This idea is captured in the following lemma.

Lemma 3.3. Consider the setting and definitions from Lemma 3.2. Let

Rkk be an invertible

matrix, and let

i Rk be its i-th row. Moreover, for all i [k], leti,1, i,2, . . . , i,k denote thek eigenvalues ofB1,2,3(U3i) in the order specified by the matrix of right eigenvectorsU

1M1. LetL Rkk be the matrix whose(i, j)-th entry isi,j . Then

U3M3= L.

9


10/31

Observe that the unknown parameters M3 are expressed as the solution to a linear system inthe above equation, where the elements of the right-hand side L are the roots ofk-th degree poly-nomials derived from the second- and third-order observable moments (namely, the characteristicpolynomials of the B1,2,3(U3i),i [k]). This template is also found in other moment methodsbased on decompositions of a Hankel matrix. A crucial distinction, however, is that the k-th degree

polynomials in Lemma 3.3 only involve low-order moments, whereas standard methods may involveup to (k)-th order moments which are difficult to estimate (Lindsay, 1989; Lindsay and Basak,1993; Gravin et al., 2012).

3.3 Main result: general estimation procedure and sample complexity bound

The lemmas in the previous section suggest the estimation procedure (Algorithm B) presented inFigure 3.

Algorithm B

1. Compute empirical averages fromNindependent copies ofx1x2to form P1,2 Rdd. Similarlydo the same for x1

x3 to form P1,3

Rkk , and for x1

x2

x3 to form P1,2,3

Rddd.

2. Let U1 Rdk and U2 Rdk be, respectively, matrices of orthonormal left and right singularvectors ofP1,2 corresponding to its top k singular values. Let U3 Rdk be the matrix oforthonormal right singular vectors ofP1,3 corresponding to its top k singular values.

3. Pick an invertible matrix Rkk, with itsi-th row denoted as i Rk. In the absence of anyprior information about M3, a suitable choice for is a random rotation matrix.

Form the matrix B1,2,3(U31) := (U

1P1,2,3(U31)U2)(U

1P1,2U2)

1.

Compute R1 Rkk (with unit Euclidean norm columns) that diagonalizes B1,2,3(U31), i.e.,R11

B1,2,3(U31)R1= diag(1,1, 1,2, . . . , 1,k). (Fail if not possible.)

4. For eachi {2, . . . , k}, obtain the diagonal entries i,1, i,2, . . . , i,k ofR11 B1,2,3(U3i)R1, andform the matrix L

Rkk whose (i, j)-th entry is i,j .

5. Return M3 := U31L.

Figure 3: General method of moments estimator (Algorithm B).

As stated, the Algorithm B yields an estimator for M3, but the method can easily be appliedto estimate Mv for all other views v. One caveat is that the estimators may not yield the sameordering of the columns, due to the unspecified order of the eigenvectors obtained in the third stepof the method, and therefore some care is needed to obtain a consistent ordering. We outline onesolution in Appendix B.4.

The sample complexity of Algorithm B depends on the specific concentration properties ofx1, x2, x3. We abstract away this dependence in the following condition.

Condition 3.2. There exist positive scalarsN0,C1,2,C1,3,C1,2,3, and a function f(N, ) (decreas-ing in N and ) such that for any NN0 and (0, 1),

1. PrPa,b Pa,b2Ca,b f(N, )

1 for{a, b} {{1, 2}, {1, 3}},

2.v Rd, PrP1,2,3(v) P1,2,3(v)2C1,2,3 v2 f(N, )

1 .

10


11/31

Moreover (for technical convenience), P1,3 is independent ofP1,2,3 (which may be achieved, say, bysplitting a sample of size 2N).

For the discrete models such as the document topic model of Section 2.1 and discrete HMMs (Mos-sel and Roch, 2006; Hsu et al., 2009), Condition 3.2 holds with N0= C1,2= C1,3= C1,2,3= 1, andf(N, ) = (1+ln(1/))/

N. Using standard techniques (e.g., Chaudhuri et al. (2009); Vershynin

(2012)), the condition can also be shown to hold for mixtures of various continuous distributionssuch as multivariate Gaussians.

Now we are ready to present the main theorem of this section (proved in Appendix B.6).

Theorem 3.1. There exists a constant C > 0 such that the following holds. Assume the three-view mixture model satisfies Condition 3.1 and Condition 3.2. Pick any (0, 1) and(0, 0).Further, assume Rkk is an independent random rotation matrix distributed uniformly overthe Stiefel manifold{QRkk :QQ= I}. If the number of samplesN satisfiesNN0 and

f(N,/k)Cmini=j M3(ei ej)2 k(P1,2)C1,2,3 k5 (M1)4

ln(k/) ,

f(N, )C minmini=j M3(ei ej)2 k(P1,2)2C1,2 P1,2,32 k5 (M1)4 ln(k/) , k(P1,3)C1,3 whereP1,2,32 := maxv=0 P1,2,3(v)2, then with probability at least 1 5, Algorithm B returnsM3= [3,1|3,2| |3,k]with the following guarantee: there exists a permutation on[k] such thatfor eachj[k],

3,j 3,(j)2maxj[k]

3,j2 .

4 Applications

In addition to the document clustering model from Section 2, a number of natural latent variable

models fit into this multi-view framework. We describe two such cases in this section: Gaussianmixture models and HMMs, both of which have been (at least partially) studied in the literature.In both cases, the estimation technique of Algorithm B leads to new learnability results that werenot achieved by previous works.

4.1 Multi-view Gaussian mixture models

The standard Gaussian mixture model is parameterized by a mixing weight wj , mean vectorj RD, and covariance matrix j RDD for each mixture component j [k]. The hiddendiscrete random variable h selects a component j with probability Pr[h= j ] =wj; the conditionaldistribution of the observed random vector xgiven h is a multivariate Gaussian with mean h and

covariance h.The multi-view assumption for Gaussian mixture models asserts that for each component j,the covariance j has a block diagonal structure j = blkdiag(1,j , 2,j , . . . , ,j) (a special caseis an axis-aligned Gaussian). The various blocks correspond to the different views of the datax1, x2, . . . , x Rd (for d= D/), which are conditionally independent given h. The mean vectorfor each componentj is similarly partitioned into the views as j = (1,j , 2,j , . . . , ,j). Note thatin the case of an axis-aligned Gaussian, each covariance matrix j is diagonal, and therefore the

11


12/31

original coordinates [D] can be partitioned into = O(D/k) views (each of dimension d = (k)) inany way (say, randomly) provided that Condition 3.1 holds.

Condition 3.1 requires that the conditional mean matrixMv = [v,1|v,2| |v,k ] for each viewv have full column rank. This is similar to the non-degeneracy and spreading conditions used inprevious studies of multi-view clustering (Chaudhuri and Rao, 2008; Chaudhuri et al., 2009). In

these previous works, the multi-view and non-degeneracy assumptions are shown to reduce theminimum separation required for various efficient algorithms to learn the model parameters. Incomparison, Algorithm B does not require a minimum separation condition at all. See Appendix D.3for details.

While Algorithm B recovers just the means of the mixture components (see Appendix D.4 fordetails concerning Condition 3.2), we remark that a slight variation can be used to recover thecovariances as well. Note that

E[xv xv|h] = (Mveh) (Mveh) + v,h = v,h v,h + v,h

for all v []. For a pair of vectors Rd and Rd, define the matrix Q1,2,3(,) Rdd offourth-order moments by Q1,2,3(,) := E[(x1

x2)

, x3

, x3

].

Proposition 4.1. Under the setting of Lemma 3.2, the matrix given by

F1,2,3(,) := (U

1Q1,2,3(,)U2)(U

1P1,2U2)1

satisfiesF1,2,3(,) = (U

1M1)diag(, 3,t, 3,t + , 3,t : t [k])(U1M1)1 and hence isdiagonalizable (in fact, by the same matrices asB1,2,3()).

Finally, we note that even if Condition 3.1 does not hold (e.g., ifv,j m Rd (say) for allv [], j [k] so all of the Gaussians have the same mean), one may still apply Algorithm B tothe model (h, y1, y2, . . . , y) where yv Rd+d(d+1)/2 is the random vector that include both first-and second-order terms ofxv, i.e., yv is the concatenation ofxv and the upper triangular part of

xv xv. In this case, Condition 3.1 is replaced by a requirement that the matricesMv :=

E[yv|h= 1] E[yv|h= 2] E[yv|h= k]

R(d+d(d+1)/2)kof conditional means and covariances have full rank. This requirement can be met even if the meansv,j of the mixture components are all the same.

4.2 Hidden Markov models

A hidden Markov model is a latent variable model in which a hidden state sequence h1, h2, . . . , hforms a Markov chain h1 h2 h over k possible states [k]; and given the state ht attime t [k], the observation xt at time t (a random vector taking values in Rd) is conditionallyindependent of all other observations and states. The directed graphical model is depicted inFigure 2(b).

The vector k1 is the initial state distribution:

Pr[h1= i] =i, i[k].

12


13/31

For simplicity, we only consider time-homogeneous HMMs, although it is possible to generalize tothe time-varying setting. The matrix T Rkk is a stochastic matrix describing the hidden stateMarkov chain:

Pr[ht+1=i|ht =j ] =Ti,j , i, j[k], t[ 1].Finally, the columns of the matrix O = [o1

|o2

| |ok]

Rdk are the conditional means of the

observation xt at time t given the corresponding hidden state ht:

E[xt|ht = i] =Oei= oi, i[k], t[].Note that both discrete and continuous observations are readily handled in this framework. For

instance, the conditional distribution ofxtgivenht = i (fori[k]) could be a high-dimensional mul-tivariate Gaussian with meanoi Rd. Such models were not handled by previous methods (Chang,1996; Mossel and Roch, 2006; Hsu et al., 2009).

The restriction of the HMM to three time steps, say t {1, 2, 3}, is an instance of the three-viewmixture model.

Proposition 4.2. If the hidden variableh (from the three-view mixture model of Section 3.1) is

identified with the second hidden stateh2, then{x1, x2, x3} are conditionally independent givenh,and the parameters of the resulting three-view mixture model on(h, x1, x2, x3) arew:= T , M 1 := O diag()T

diag(T )1, M2:=O, M3:=OT .

From Proposition 4.2, it is easy to verify that B3,1,2() = (U3OT) diag(O)(U3OT)

1. There-fore, after recovering the observation conditional mean matrix O using Algorithm B, the Markovchain transition matrix can be recovered using the matrix of right eigenvectors R ofB3,1,2() andthe equation (U3O)

1R= T(up to scaling of the columns).

Acknowledgments

We thank Kamalika Chaudhuri and Tong Zhang for many useful discussions, Karl Stratos forcomments on an early draft, David Sontag and an anonymous reviewer for some pointers to relatedwork, and Adel Javanmard for pointing out a problem with Theorem D.1 in an earlier version ofthe paper.

References

D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In COLT, 2005.

R. Ahlswede and A. Winter. Strong converse for identification via quantum channels. IEEETransactions on Information Theory, 48(3):569579, 2002.

S. Arora and R. Kannan. Learning mixtures of arbitrary Gaussians. In STOC, 2001.M. Belkin and K. Sinha. Polynomial learning of distribution families. In FOCS, 2010.

M. B. Blaschko and C. H. Lampert. Correlational spectral clustering. In CVPR, 2008.

D. L. Boley, F. T. Luk, and D. Vandevoorde. Vandermonde factorization of a Hankel matrix. InScientific Computing, 1997.

13


14/31

S. C. Brubaker and S. Vempala. Isotropic PCA and affine-invariant clustering. In FOCS, 2008.

J. T. Chang. Full reconstruction of Markov models on evolutionary trees: Identifiability andconsistency. Mathematical Biosciences, 137:5173, 1996.

K. Chaudhuri and S. Rao. Learning mixtures of product distributions using correlations and

independence. In COLT, 2008.

K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan. Multi-view clustering via canonicalcorrelation analysis. In ICML, 2009.

S. Dasgupta. Learning mixutres of Gaussians. In FOCS, 1999.

S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss.Random Structures and Algorithms, 22(1):6065, 2003.

S. Dasgupta and L. Schulman. A probabilistic analysis of EM for mixtures of separated, sphericalGaussians. Journal of Machine Learning Research, 8(Feb):203226, 2007.

J. Feldman, R. ODonnell, and R. Servedio. Learning mixtures of product distributions over discretedomains. In FOCS, 2005.

J. Feldman, R. ODonnell, and R. Servedio. PAC learning mixtures of axis-aligned Gaussians withno separation assumption. In COLT, 2006.

A. M. Frieze, M. Jerrum, and R. Kannan. Learning linear transformations. In FOCS, 1996.

W. A. Gale, K. W. Church, and D. Yarowsky. One sense per discourse. In 4th DARPA Speech andNatural Language Workshop, 1992.

N. Gravin, J. Lasserre, D. Pasechnik, and S. Robins. The inverse moment problem for convexpolytopes. Discrete and Computational Geometry, 2012. To appear.

H. Hotelling. The most predictable criterion. Journal of Educational Psychology, 26(2):139142,1935.

D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. InCOLT, 2009.

D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models.Journal of Computer and System Sciences, 2012. To appear.

A. Hyvarinen and E. Oja. Independent component analysis: algorithms and applications. NeuralNetworks, 13(45):411430, 2000.

H. Jaeger. Observable operator models for discrete stochastic time series.Neural Computation, 12(6), 2000.

A. T. Kalai, A. Moitra, and G. Valiant. Efficiently learning mixtures of two Gaussians. In STOC,2010.

14


15/31

R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. InCOLT, 2005.

B. G. Lindsay. Moment matrices: applications in mixtures. Annals of Statistics, 17(2):722740,1989.

B. G. Lindsay.Mixture models: theory, geometry and applications. American Statistical Association,1995.

B. G. Lindsay and P. Basak. Multivariate normal mixtures: a fast consistent method. Journal ofthe American Statistical Association, 88(422):468476, 1993.

F. McSherry. Spectral partitioning of random graphs. In FOCS, 2001.

A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of Gaussians. In FOCS,2010.

E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden Markov models. Annals ofApplied Probability, 16(2):583614, 2006.

P. Q. Nguyen and O. Regev. Learning a parallelepiped: Cryptanalysis of GGH and NTRU signa-tures. Journal of Cryptology, 22(2):139160, 2009.

K. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions ofthe Royal Society, London, A., page 71, 1894.

R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm.SIAM Review, 26(2):195239, 1984.

M. P. Schutzenberger. On the definition of a family of automata. Information and Control, 4:245270, 1961.

G. W. Stewart and Ji-Guang Sun. Matrix Perturbation Theory. Academic Press, 1990.

D. M. Titterington, A. F. M. Smith, and U. E. Makov. Statistical analysis of finite mixture distri-butions. Wiley, 1985.

S. Vempala and G. Wang. A spectral algorithm for learning mixtures of distributions. In FOCS,2002.

R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar andG. Kutyniok, editors, Compressed Sensing, Theory and Applications, chapter 5, pages 210268.Cambridge University Press, 2012.

A Analysis of Algorithm A

In this appendix, we give an analysis of Algorithm A (but defer most perturbation arguments toAppendix C), and also present some illustrative empirical results on text data using a modifiedimplementation.

15


16/31

A.1 Accuracy of moment estimates

Lemma A.1. Fix(0, 1). LetPairs be the empirical average ofNindependent copies ofx1x2,and letTriples be the empirical average ofN independent copies of(x1 x2), x3. Then

1. PrPairs PairsF 1 +ln(1/)N 1 , and2. Pr

Rd, Triples() Triples()F2(1 +

ln(1/))

N

1 .

Proof. The first claim follows from applying Lemma F.1 to the vectorizations ofPairs and Pairs(whereupon the Frobenius norm is the Euclidean norm of the vectorized matrices). For the second

claim, we also apply Lemma F.1 to Triples and Triples in the same way to obtain, with probabilityat least 1 ,

d

i=1d

j=1d

x=1(Triplesi,j,x

Triplesi,j,x)

2

(1 +

ln(1/))2

N .

Now condition on this event. For any = (1, 2, . . . , d) Rd,

Triples() Triples()2F =d

i=1

dj=1

d

x=1

x( Triplesi,j,x Triplesi,j,x)2

d

i=1

dj=1

22d

x=1

( Triplesi,j,x Triplesi,j,x)2

22(1 +

ln(1/))2

N

where the first inequality follows by Cauchy-Schwarz.

A.2 Proof of Theorem 2.1

LetE1 be the event in which

Pairs Pairs2 1 +

ln(1/)N

(2)

and

Triples(v) Triples(v)2v2(1 +ln(1/))

N(3)

for allvRd. By Lemma A.1, a union bound, and the fact thatA2 AF, we have Pr[E1]1 2. Now condition on E1, and letE2 be the event in which

:= mini=j

|U, M(eiej)|= mini=j

|, UM(ei ej)|>

2k(UM)

ekk2

. (4)16


17/31

By Lemma C.6 and the factUM(eiej)2

2k(UM), we have Pr[E2|E1]1 , and thus

Pr[E1 E2](1 2)(1 )1 3. So henceforth condition on this joint event E1 E2.Let 0 :=

PairsPairs2k(Pairs)

, 1 := 010 , and 2 :=

0(12

1)(1021)

. The conditions on N and the

bound in (2) implies that 0< 11+

2 12 , so Lemma C.1 implies thatk(UM)

1 21 k(M),

(U

M) M

2

121k(M) , and that U

PairsV is invertible. By Lemma 2.2,

B() := (UTriples()V)(UPairsV)1 = (UM) diag(M)(UM)1.

Thus, Lemma C.2 implies

B() B()2Triples() Triples()2

(1 0) k(Pairs) + 2

k(Pairs). (5)

Let R := UMdiag(UMe12, UMe22, . . . ,UMek2)1 and 3 := B()B()2(R) . Notethat R has unit norm columns, and that R1B()R = diag(M). By Lemma C.5 and the fact

M2 kM1 = k,R12(UM) M2

1 21 k(M)

k

1 21 k(M)(6)

and

(R)(UM)2 k(1 21) k(M)2

. (7)

The conditions onNand the bounds in (2), (3), (4), (5), and (7) imply that3 < 12 . By Lemma C.3,

there exists a permutation on [k] such that, for all j[k],

sj j

U(j)/c

j

2 =

sj j

Re(j)

2

4k

R1

2

3 (8)

wheresj := sign(j, U(j)) and cj :=U(j)2 (j)2 (the eigenvectors j are unique upto signsjbecause each eigenvalue has geometric multiplicity 1). Since (j)range(U), Lemma C.1and the bounds in (8) and (6) imply

sjUj (j)/cj2

sj jU(j)/cj22+ (j)/cj22 21 sj jU(j)/cj2+ (j)/cj2 14k R12 3+ 1

4k

k

1 21 k(M) 3+ 1.

Therefore, for cj :=sjcj1, Uj, we have

cj j (j)2 =cjsjUj (j)2 (j)2

4k

k1 21 k(M)

3+ 1

.

17


18/31


19/31

B Proofs and details from Section 3

In this section, we provide ommitted proofs and discussion from Section 3.

B.1 Proof of Lemma 3.1

By conditional independence,

P1,2= E[E[x1 x2|h]] = E[E[x1|h] E[x2|h]]

= E[(M1eh) (M2eh)] =M1 k

t=1

wtet et

M2 =M1diag( w)M

2.

Similarly,

P1,2,3() = E[E[(x1 x2), x3|h]] = E[E[x1|h] E[x2|h], E[x3|h]]

= E[(M1eh) (M2eh), M3eh] =M1 k

t=1 wteh eh, M3ehM

2

=M1diag(M

3)diag( w)M

2.


We have U1P1,2U2= (U

1M1)diag( w)(M

2U2) by Lemma 3.1, which is invertible by the assumptionson Uv and Condition 3.1. Moreover, also by Lemma 3.1,

B1,2,3() = (U

1P1,2,3()U2) (U

1P1,2U2)1

= (U1M1diag(M

3)diag( w)M

2U2) (U

1P1,2U2)1

= (U1M1) diag(M

3)(U

1M1)1 (U1M1diag( w)M

2U2) (U

1P1,2U2)1

= (U

1M1) diag(M

3)(U

1M1)1 (U

1P1,2U2) (U

1P1,2U2)1= (U1M1) diag(M

3)(U

1M1)1.


By Lemma 3.2,

(U1M1)1B1,2,3(U3i)(U1M1) = diag(M

3U3i)

= diag(i, U3M3e1, i, U3M3e2, . . . i, U3M3ek)= diag(i,1, i,2, . . . , i,k)

for all i[k], and therefore

L=

1, U3M3e1 1, U3M3e2 1, U3M3ek2, U3M3e1 2, U3M3e2 2, U3M3e3

... ...

. . . ...

k, U3M3e1 k, U3M3e2 k, U3M3ek

= U3M3.

19


20/31

B.4 Ordering issues

Although Algorithm B only explicitly yields estimates for M3, it can easily be applied to estimateMv for all other views v. The main caveat is that the estimators may not yield the same orderingof the columns, due to the unspecified order of the eigenvectors obtained in the third step of themethod, and therefore some care is needed to obtain a consistent ordering. However, this ordering

issue can be handled by exploiting consistency across the multiple views.The first step is to perform the estimation ofM3 using Algorithm B as is. Then, to estimate

M2, one may re-use the eigenvectors in R1 to diagonalize B1,3,2(), as B1,2,3() and B1,3,2() sharethe same eigenvectors. The same goes for estimating Mv for other all other views v except v= 1.

It remains to provide a way to estimate M1. Observe that M2 can be estimated in at least twoways: via the operators B1,3,2(), or via the operators B3,1,2(). This is because the eigenvaluesofB3,1,2() and B1,3,2() are the identical. Because the eigenvalues are also sufficiently separatedfrom each other, the eigenvectors R3 ofB3,1,2() can be put in the same order as the eigenvectorsR1 ofB1,3,2 by (approximately) matching up their respective corresponding eigenvalues. Finally,the appropriately re-ordered eigenvectors R3 can then be used to diagonalize B3,2,1() to estimateM1.

B.5 Estimating the mixing weights

Given the estimate ofM3, one can obtain an estimate of w using

w:= M3 E[x3]

where A denotes the Moore-Penrose pseudoinverse ofA (though other generalized inverses maywork as well), and E[x3] is the empirical average ofx3. This estimator is based on the followingobservation:

E[x3] = E[E[x3|h]] =M3E[eh] =M3 wand therefore

M

3E

[x3] =M

3M3 w= w

since M3 has full column rank.

B.6 Proof of Theorem 3.1

The proof is similar to that of Theorem 2.1, so we just describe the essential differences. As before,most perturbation arguments are deferred to Appendix C.

First, letE1 be the event in which

P1,2 P1,22C1,2 f(N, ),P1,3 P1,32C1,3 f(N, )

and P1,2,3(U3i) P1,2,3(U3i)2C1,2,3 f(N,/k)for all i[k]. Therefore by Condition 3.2 and a union bound, we have Pr[E1]1 3. Second,letE2 be the event in which

:= mini[k]

minj=j

|i, U3M3(ej ej)|>minj=j U3M3(ej ej)2

ekk2

k

20


21/31

and

max := maxi,j[k]

|i, U3M3ej| maxj[k] M3ej2

k

1 +

2ln(k2/)

.

Since each i is distributed uniformly overSk1, it follows from Lemma C.6 and a union boundthat Pr[E2|E1]1 2. Therefore Pr[E1 E2](1 3)(1 2)1 5.

LetU3 Rdk

be the matrix of top k orthonormal left singular vectors ofM3. By Lemma C.1and the conditions on N, we have k(U

3U3)1/2, and therefore

mini=i M3(eiei)2 2

ekk2

k

and max

ek3(1 +

2ln(k2/))

(M3)

where

(M3) :=maxi[m] M3ei2

mini=i M3(ei ei)2 .

Leti := U3i for i[k]. By Lemma C.1, U1P1,2U2 is invertible, so we may define B1,2,3(i) :=(U1P1,2,3(i)U2)(U

1P1,2U2)

1. By Lemma 3.2,

B1,2,3

(i) = (U

1M

1)diag(M

3i)(U

1M

1)1.

Also defineR := U1M1diag(U1M1e12, U1M1e22, . . . , U1M1ek2)1. Using most of the samearguments in the proof of Theorem 2.1, we have

R122(M1), (9)(R)4(M1)2, (10)

B1,2,3(i) B1,2,3(i)2 2P1,2,3(i) P1,2,3(i)2k(P1,2)

+2P1,2,32 P1,2 P1,22

k(P1,2)2 .

By Lemma C.3, the operator B1,2,3(1) has k distinct eigenvalues, and hence its matrix of righteigenvectors R1 is unique up to column scaling and ordering. This in turn implies that R

11 is

unique up to row scaling and ordering. Therefore, for each i[k], thei,j =e

j

R

1

1

B1,2,3(i)

R1ejforj[k] are uniquely defined up to ordering. Moreover, by Lemma C.4 and the above bounds on

B1,2,3(i) B1,2,3(i)2 and , there exists a permutation on [k] such that, for all i, j[k],|i,j i,(j)|

3(R) + 16k1.5 (R) R122 max/

B1,2,3(i) B1,2,3(i)2

12(M1)2 + 256k1.5 (M1)4 max/

B1,2,3(i) B1,2,3(i)2 (11)

where the second inequality uses (9) and (10). Let j := (1,j , 2,j , . . . , k,j) Rk and j :=(1,j , 2,j , . . . , k,j) Rk. Observe that j = U3M3ej = U33,j by Lemma 3.3. By theorthogonality of , the factv2

kv forv Rk, and (11)

1j

U33,(j)

2=

1(j

(j))

2

=j (j)2

k j (j)

=

k maxi

|i,j i,(j)|

12

k (M1)2 + 256k2 (M1)4 max/

B1,2,3(i) B1,2,3(i)2.

21


22/31

Finally, by Lemma C.1 (as applied to P1,3 andP1,3),

3,j 3,(j)2 1jU33,(j)2+ 23,(j)2P1,3 P1,32

k(P1,3) .

Making all of the substitutions into the above bound gives

3,j 3,(j)2C

6 k5 (M1)4 (M3) ln(k/)

C1,2,3 f(N,/k)

k(P1,2) +

P1,2,32 C1,2 f(N/)k(P1,2)2

+

C

6 3,(j)2

C1,3 f(N, )k(P1,3)

12

maxj[k]

3,j2+ 3,(j)2

maxj[k]

3,j2 .

C Perturbation analysis for observable operators

The following lemma establishes the accuracy of approximating the fundamental subspaces (i.e.,the row and column spaces) of a matrix X by computing the singular value decomposition of aperturbation XofX.

Lemma C.1. Let X Rmn be a matrix of rank k. Let U Rmk andV Rnk be matriceswith orthonormal columns such that range(U) and range(V) are spanned by, respectively, the leftand right singular vectors of X corresponding to its k largest singular values. Similarly defineU Rmk andV Rnk relative to a matrixX Rmn. Define X :=X X2, 0 := Xk(X) ,and1:=

010 . Assume0 0;

3. k(UU)

1 21;

4. k(VV)

1 21;

5. k(UXV)(1 21) k(X);

6. for any Rk andvrange(U),U v22 Uv22+ v22 21.Proof. The first claim follows from the assumption on 0. The second claim follows from the assump-tions and Weyls theorem (Lemma E.1). Let the columns ofU Rm(mk) be an orthonormalbasis for the orthogonal complement of range(U), so that

U

U

2

X/k(X)

1 by Wedins

theorem (Lemma E.2). The third claim then follows becauseUU22 = 1 UU22 121.The fourth claim is analogous to the third claim, and the fifth claim follows from the third andfourth. The sixth claim follows writingv = U for some Rk, and using the decompositionUv22=UUUv22+UUv22 =Uv22+U(U )22 Uv22+UU2222Uv22+22 21 =UU 22+v22 21where the last inequality follows from the argumentfor the third claim, and the last equality uses the orthonormality of the columns ofU.

22


23/31

The next lemma bounds the error of the observation operator in terms of the errors in estimatingthe second-order and third-order moments.

Lemma C.2. Consider the setting and definitions from Lemma C.1, and let Y Rmn andY Rmn be given. Define2:= 0(12

1)(1021) andY :=

Y Y2. Assume0< 11+2 . Then

1. UXV andUXV are both invertible, and(UXV)1 (UXV)12 2k(X) ;

2.(UYV)(UXV)1 (UYV)(UXV)12 Y(10)k(X)+Y22k(X)

.

Proof. Let S := UXV and S:= UXV . By Lemma C.1, UXV is invertible,k(S)k(UU) k(X) k(VV)(1 21) k(X) (so S is also invertible), andS S2 0 k(X) 012

1

k(S). The assumption on 0 implies

012

1


24/31

Since A is real, all non-real eigenvalues ofA must come in conjugate pairs; so the existence of anon-real eigenvalue ofAwould contradict (12). This proves the first claim.

For the second claim, assume for notational simplicity that the permutation is the identitypermutation. Let R Rkk be the matrix whose i-th column is i. Define i Rk to be thei-th row ofR1 (i.e., the i-th left eigenvector ofA), and similarly define i Rk to be the i-th

row of R

1

. Fix a particular i [k]. Since{1,

2, . . . ,

k} forms a basis for R

k

, we can writei =k

j=1 ci,jj for some coefficients ci,1, ci,2, . . . , ci,k R. We may assume ci,i 0 (or else we

replace i withi). The fact thati2 =j2 = 1 for all j [k] and the triangle inequalityimply 1 =i2ci,ii2+

j=i |ci,j |j2= ci,i+

j=i |ci,j |, and therefore

i i2 |1 ci,i|i2+j=i

|ci,jj22j=i

|ci,j |

again by the triangle inequality. Therefore, it suffices to show|ci,j | 2R12 3forj=i to provethe second claim.

Observe that Ai = A(

ki=1 ci,i

i) =

ki=1 ci,ii

i , and therefore

ki=1

ci,iii+ (A A)i= Ai= ii= ik

i=1

ci,ii+ (i i)i.

Multiplying through the above equation by j , and using the fact thatj

i = 1{j=i} gives

ci,jj+

i (A A)i = ici,j + (i i)j i.

The above equation rearranges to (j i)ci,j = (i i)j i+j(A A)i and therefore

|ci,j | j2 (|i i| + (A A)i2)

|j

i

| R

12 (|i i| + A A2)

|j

i

|by the Cauchy-Schwarz and triangle inequalities and the sub-multiplicative property of the spectralnorm. The bound|ci,j | 2R12 3 then follows from the first claim.

The third claim follows from standard comparisons of matrix norms.

The next lemma gives perturbation bounds for estimating the eigenvalues of simultaneously di-agonalizable matricesA1, A2, . . . , Ak. The eigenvectors Rare taken from a perturbation of the firstmatrix A1, and are then subsequently used to approximately diagonalize the perturbations of theremaining matrices A2, . . . , Ak. In practice, one may use Jacobi-like procedures to approximatelysolve the joint eigenvalue problem.

Lemma C.4. Let A1, A2, . . . , Ak Rkk be diagonalizable matrices that are diagonalized by thesame matrix invertible R R

kk

with unit length columnsRej2 = 1, such that each Ai has kdistinct real eigenvalues:

R1AiR= diag(i,1, i,2, . . . , i,k).

LetA1, A2, . . . , Ak Rkk be given. DefineA := maxi Ai Ai2, A := miniminj=j |i,j i,j|,max := maxi,j |i,j |, 3:= (R)AA , and4:= 4k1.5 R122 3. Assume3 mini=j A(eiej)2 emn2 1 .2. Pr

i[m],|, Aei| Aei2

m

1 +

2ln(m/)

1 .

Proof. For the first claim, let 0 := /n2

. By Lemma F.2, for any fixed pair{i, j} [n] and

:= 0/

e,

Pr

|, A(eiej)| A(eiej)2 1

m 0

e

exp

1

2(1 (20/e) + ln(20/e))

0.

Therefore the first claim follows by a union bound over all

n2

pairs{i, j}.

For the second claim, apply Lemma F.2 with := 1 + t and t:= 2ln(m/) to obtainPr|, Aei| Aei2

m (1 + t)

exp

12

1 (1 + t)2 + 2 ln(1 + t)

exp

1

2

1 (1 + t)2 + 2t

=et

2/2 =/m.

26


27/31

Therefore the second claim follows by taking a union bound over all i[m].

D Proofs and details from Section 4

In this section, we provide ommitted proofs and details from Section 4.

D.1 Proof of Proposition 4.1

As in the proof of Lemma 3.1, it is easy to show that

Q1,2,3(,) = E[E[x1|h] E[x2|h],E[x3 x3|h] ]=M1E[eh eh, (3,h 3,h+ 3,h) ]M2=M1diag(, 3,t, 3,t + , 3,t : t[k])diag( w)M2.

The claim then follows from the same arguments used in the proof of Lemma 3.2.

D.2 Proof of Proposition 4.2The conditional independence properties follow from the HMM conditional independence assump-tions. To check the parameters, observe first that

Pr[h1= i|h2=j ] =Pr[h2=j |h1 =i] Pr[h1 =i]Pr[h2= j ]

= Tj,ii(T)j

=eidiag()T diag(T )1ej

by Bayes rule. Therefore

M1ej = E[x1|h2= j ] =OE[eh1|h2 =j ] =O diag()T diag(T )1ej .

The rest of the parameters are similar to verify.

D.3 Learning mixtures of product distributions

In this section, we show how to use Algorithm B with mixtures of product distributions in Rn thatsatisfy an incoherence conditionon the means 1, 2, . . . , k Rn ofk component distributions.Note that product distributions are just a special case of the more general class of multi-viewdistributions, which are directly handled by Algorithm B.

The basic idea is to randomly partition the coordinates into 3 views, each of roughly thesame dimension. Under the assumption that the component distributions are product distributions,the multi-view assumption is satisfied. What remains to be checked is that the non-degeneracycondition (Condition 3.1) is satisfied. Theorem D.1 (below) shows that it suffices that the originalmatrix of component means have rank k and satisfy the following incoherence condition.

Condition D.1 (Incoherence condition). Let (0, 1), [n], and M = [1|2| |k] Rnkbe given; let M = USV be the thin singular value decomposition ofM, where U Rnk is amatrix of orthonormal columns, S = diag(1(M), 2(M), . . . , k(M)) Rkk, and V Rkk isorthogonal; and let

cM:= maxj[n]

n

k Uej22

.

27


28/31

The following inequality holds:

cM 932

nk ln k

.

Note that cM is always in the interval [1,n/k]; it is smallest when the left singular vectors inU have

1/

n entries (as in a Hadamard basis), and largest when the singular vectors are the

coordinate axes. Roughly speaking, the incoherence condition requires that the non-degeneracy ofa matrixMbe witnessed by many vertical blocks ofM. When the condition is satisfied, then withhigh probability, a random partitioning of the coordinates into groups induces a block partitioningofM into matricesM1, M2, . . . , M (with roughly equal number of rows) such that the k-th largestsingular value ofMv is not much smaller than that ofM (for each v[]).

Chaudhuri and Rao (2008) show that under a similar condition (which they call a spreadingcondition), a random partitioning of the coordinates into two views preserves the separationbetween the means of k component distributions. They then follow this preprocessing with aprojection based on the correlations across the two views (similar to CCA). However, their overallalgorithm requires a minimum separation condition on the means of the component distributions.In contrast, Algorithm B does not require a minimum separation condition at all in this setting.

Theorem D.1. Assume Condition D.1 holds. Independently put each coordinate i[n] into oneof different setsI1,I2, . . . ,I chosen uniformly at random. With probability at least 1, foreachv[], the matrixMv R|Iv|k formed by selecting the rows ofM indexed byIv, satisfies

k(Mv)k(M)/(2

).

Proof. Follows from Lemma D.1 (below) together with a union bound.

Lemma D.1. Assume Condition D.1 holds. Consider a random submatrixM ofM obtained byindependently deciding to include each row ofMwith probability1/. Then

Prk(M)k(M)/(2)1 /.Proof. Let z1, z2, . . . , zn {0, 1} be independent indicator random variables, each with Pr[zi =1] = 1/. Note thatMM=M diag(z1, z2, . . . , zn)M=ni=1 ziMeieiM, and that

k(M)2 =min(MM)min(S)2 min ni=1

ziUeie

iU

.

Moreover, 0 ziUeieiU (k/n)cMI and min(E[n

i=1 ziUeie

iU]) = 1/. By Lemma F.3 (aChernoff bound on extremal eigenvalues of random symmetric matrices),

Prmin dj=1

ziU

eie

iU 14k e(3/4)2/(2cMk/n) /by the assumption on cM.

28


29/31

D.4 Empirical moments for multi-view mixtures of subgaussian distributions

The required concentration behavior of the empirical moments used by Algorithm B can be easilyestablished for multi-view Gaussian mixture models using known techniques (Chaudhuri et al.,2009). This is clear for the second-order statistics Pa,b for{a, b} {{1, 2}, {1, 3}}, and remainstrue for the third-order statistics P1,2,3 becausex3 is conditionally independent ofx1 andx2 given

h. The magnitude ofU3i, x3 can be bounded for all samples (with a union bound; recall thatwe make the simplifying assumption that P1,3 is independent ofP1,2,3, and therefore so are U3 andP1,2,3). Therefore, one effectively only needs spectral norm error bounds for second-order statistics,as provided by existing techniques.

Indeed, it is possible to establish Condition 3.2 in the case where the conditional distributionofxv givenh (for each viewv) is subgaussian. Specifically, we assume that there exists some >0such that for each view v and each component j[k],

E

exp

u, cov(xv|h= j)1/2(xvE[xv|h= j ])

exp(2/2), R, u Sd1

where cov(x

|h=j ) := E[(xv

E[xv

|h= j ])

(xv

E[xv

|h= j ])

|h =j ] is assumed to be positive

definite. Using standard techniques (e.g., Vershynin (2012)), Condition 3.2 can be shown to holdunder the above conditions with the following parameters (for some universal constant c > 0):

wmin := minj[k]

wj

N0:=3/2(d+ log(1/))

wminlog

3/2(d+ log(1/))

wmin

Ca,b := c

max cov(xv|h= j)1/22 , E[xv|h= j ]2 :v {a, b}, j[k]

2C1,2,3:= c

max cov(xv|h= j)1/22 , E[xv|h= j ]2 :v[3], j[k]3

f(N, ) :=k2 log(1/)N

+3/2log(N/)(d+ log(1/))wminN

.

E General results from matrix perturbation theory

The lemmas in this section are standard results from matrix perturbation theory, taken from Stewartand Sun (1990).

Lemma E.1 (Weyls theorem). LetA, E Rmn withmn be given. Thenmaxi[n]

|i(A+ E) i(A)| E2.

Proof. See Theorem 4.11, p. 204 in Stewart and Sun (1990).

Lemma E.2 (Wedins theorem). LetA, E Rmn withmn be given. LetA have the singularvalue decomposition U1U2

U3

A V1 V2 = 1 00 2

0 0

.29


30/31

LetA := A+E, with analogous singular value decomposition (U1, U2, U3,1,2,V1V2). Let bethe matrix of canonical angles betweenrange(U1) andrange(U1), and be the matrix of canonicalangles betweenrange(V1) andrange(V1). If there exists, >0 such thatmini i(1) + andmaxi i(2), then

max

{sin

2,

sin

2

}

E2

.


Lemma E.3 (Bauer-Fike theorem). LetA, E Rkk be given. IfA= Vdiag(1, 2, . . . , k)V1for some invertibleV Rkk, andA:= A + E has eigenvalues1, 2, . . . , k, then

maxi[k]

minj[k]

|i j | V1EV2.


Lemma E.4. LetA, E Rkk be given. IfA is invertible, andA1E2< 1, thenA:= A + E isinvertible, and

A1 A12E2A1

22

1 A1E2 .


F Probability inequalities

Lemma F.1 (Accuracy of empirical probabilities). Fix = (1, 2, . . . , n)m1. Letx be arandom vector for which Pr[x = ei] = i for all i [m], and let x1, x2, . . . , xn be n independentcopies ofx. Set:= (1/n)

ni=1 xi. For all t >0,

Pr 2 > 1 +

t

n et.Proof. This is a standard application of McDiarmids inequality (using the fact that 2 has

2/n bounded differences when a single xi is changed), together with the bound E[2]1/

n. See Proposition 19 in Hsu et al. (2012).

Lemma F.2 (Random projection). Let Rn be a random vector distributed uniformly overSn1, and fix a vectorv Rn.

1. If(0, 1), then

Pr|, v| v2 1

nexp

1

2

(1

2 + ln 2).

2. If >1, then

Pr

|, v| v2 1

n exp

1

2(1 2 + ln 2)

.

Proof. This is a special case of Lemma 2.2 from Dasgupta and Gupta (2003).

30


31/31

Lemma F.3 (Matrix Chernoff bound). LetX1, X2, . . . , X n be independent and symmetricm mrandom matrices such that 0 Xi rI, and set l := min(E[X1 +X2 + + Xn]). For any[0, 1],

Pr

min

ni=1

Xi

(1 ) l

m e2l/(2r).

Proof. This is a direct corollary of Theorem 19 from Ahlswede and Winter (2002).

G Insufficiency of second-order moments

Chang (1996) shows that a simple class of Markov models used in mathematical phylogeneticscannot be identified from pair-wise probabilities alone. Below, we restate (a specialization of) thisresult in terms of the document topic model from Section 2.1.

Proposition G.1 (Chang, 1996). Consider the model from Section 2.1 on (h, x1, x2, . . . , x) withparametersM and w. LetQRkk be an invertible matrix such that the following hold:

1. 1Q= 1;

2. MQ1, Q diag(w)M diag(M w)1, andQ w have non-negative entries;

3. Q diag( w)Q is a diagonal matrix.

Then the marginal distribution over (x1, x2) is identical to that in the case where the model hasparameters M :=M Q1 andw:= Q w.

A simple example for d = k = 2 can be obtained from

M :=

p 1 p

1 p p

, w:=

1/21/2

, Q:=

p 1+1+4p(1p)21 p 1

1+4p(1p)

2

for some p

(0, 1). We takep= 0.25, in which case Q satisfies the conditions of Proposition G.1,

and

M=

0.25 0.750.75 0.25

, w=

0.50.5

,

M=M Q1

0.6614 0.11290.3386 0.8871

, w= Q w

0.70570.2943

.

In this case, both (M, w) and ( M , w) give rise to the same pair-wise probabilities

Mdiag( w)M = Mdiag(w)M

0.3125 0.18750.1875 0.3125

.

However, the triple-wise probabilities, for = (1, 0), differ: for (M, w), we have

Mdiag(M)diag( w)M 0.2188 0.09380.0938 0.0938

;while for ( M , w), we have

Mdiag( M) diag(w)M

0.2046 0.10790.1079 0.0796

.

31

Date post:	03-Jun-2018
Category:	Documents
Upload:	venommax
View:	218 times
Download:	0 times

A Method of Moments for Mixture Models and Hidden Markov Models

Documents