Improving Optimization in Models With Continuous Symmetry Breaking · 2018-07-09 · Improving...

Improving Optimization in Models With Continuous Symmetry Breaking

Robert Bamler 1 Stephan Mandt 1

AbstractMany loss functions in representation learning areinvariant under a continuous symmetry transfor-mation. For example, the loss function of wordembeddings (Mikolov et al., 2013b) remains un-changed if we simultaneously rotate all word andcontext embedding vectors. We show that repre-sentation learning models for time series possessan approximate continuous symmetry that leadsto slow convergence of gradient descent. We pro-pose a new optimization algorithm that speedsup convergence using ideas from gauge theory inphysics. Our algorithm leads to orders of magni-tude faster convergence and to more interpretablerepresentations, as we show for dynamic exten-sions of matrix factorization and word embeddingmodels. We further present an example appli-cation of our proposed algorithm that translatesmodern words into their historic equivalents.

1. IntroductionSymmetries frequently occur in machine learning. Theyexpress that the loss function of a model is invariant un-der a certain group of transformations. For example, theloss function of matrix factorization or word embeddingmodels remains unchanged if we simultaneously rotate allembedding vectors with the same rotation matrix. This isan example of a continuous symmetry, since the rotationsare parameterized by a continuum of real-valued angles.

Sometimes, the symmetry of a loss function is broken, e.g.,due to the presence of an additional term that violates thesymmetry. For example, this may be a weak regularizer.In this paper, we show that such symmetry breaking mayinduce slow convergence problems in gradient descent, inparticular when the symmetry breaking is weak. We solvethis problem with a new optimization algorithm.

1Los Angeles, CA, USA. Correspondence to: RobertBamler <[email protected]>, Stephan Mandt<[email protected]>.

Proceedings of the 35 th International Conference on MachineLearning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s).

Weak continuous symmetry breaking leads to an ill-conditioned optimization problem. When a loss function isinvariant under a continuous symmetry, it has a manifoldof degenerate (equivalent) minima. This is usually not aproblem because any such minimum is a valid solution ofthe optimization problem. However, adding a small symme-try breaking term to the loss function lifts the degeneracyand forces the model to prefer one minimum over all others.As we show, this leads to an ill-conditioned Hessian of theloss, with a small curvature along symmetry directions anda large curvature perpendicular to them. The ill-conditionedHessian results in slow convergence of gradient descent.

We propose an optimization algorithm that speeds up con-vergence by separating the optimization in the symmetrydirections from the optimization in the remaining directions.At regular intervals, the algorithm efficiently minimizesthe small symmetry breaking term in such a way that theminimization does not degrade the symmetry invariant term.

Symmetries can be broken explicitly, e.g., due to an addi-tional term such as an L1 regularizer in a word embeddingmodel (Sun et al., 2016). However, perhaps more interest-ingly, symmetries can also be broken by couplings betweenmodel parameters. This is known as spontaneous symmetrybreaking in the physics community.

One of our main findings is that spontaneous symmetrybreaking occurs in certain time series models, such as dy-namic matrix factorizations and dynamic word embeddingmodels (Lu et al., 2009; Koren, 2010; Charlin et al., 2015;Bamler & Mandt, 2017; Rudolph & Blei, 2017). In thesemodels, it turns out that model parameters may be twistedalong the time axis, and that these twists contribute onlylittle to the loss, thus leading to a small gradient. These lowcost configurations are known in the physics community asGoldstone modes (Altland & Simons, 2010).

Our contributions are as follows:

• We identify a broad class of models that suffer fromslow convergence of gradient descent due to Goldstonemodes. We explain both mathematically and pictoriallyhow Goldstone modes lead to slow convergence.

• Using ideas from gauge theories in physics, we pro-pose Goldstone Gradient Descent (Goldstone-GD), an


optimization algorithm that speeds up convergence byseparating the optimization along symmetry directionsfrom the remaining coordinate directions.

• We evaluate the Goldstone-GD algorithm experimen-tally with dynamic matrix factorizations and DynamicWord Embeddings. We find that Goldstone-GD con-verges orders of magnitude faster and finds more in-terpretable embedding vectors than standard gradientdescent (GD) or GD with diagonal preconditioning.

• For Dynamic Word Embeddings (Bamler & Mandt,2017), Goldstone-GD allows us to find historic syn-onyms of modern English words, such as “wagon” for“car”. Without our advanced optimization algorithm,we were not able to perform this task.

Our paper is structured as follows. Section 2 describesrelated work. In Section 3, we specify the model classunder consideration, introduce concrete example models,and discuss the slow convergence problem. In Section 4, wepropose the Goldstone-GD algorithm that solves the slowconvergence problem. We report experimental results inSection 5 and provide concluding remarks in Section 6.

2. Related WorkOur paper discusses continuous symmetries in machinelearning and proposes a new optimization algorithm. Inthis section, we summarize related work on both aspects.

Most work on symmetries in machine learning focuses ondiscrete symmetries. Convolutional neural networks (Le-Cun et al., 1998) exploit the discrete translational symmetryof images. This idea was generalized to arbitrary discretesymmetries (Gens & Domingos, 2014), to the permutationsymmetry of sets (Zaheer et al., 2017), and to discrete sym-metries in graphical models (Bui et al., 2012; Noessner et al.,2013). Discrete symmetries do not cause an ill-conditionedoptimization problem because they lead to isolated degener-ate minima rather than a manifold of degenerate minima.

In this work, we consider models with continuous symme-tries. A specialized optimization algorithm for a loss func-tion with a continuous symmetry was presented in (Choiet al., 1999). Our algorithm applies to a broader class ofmodels since our assumptions are more permissive. We onlyrequire invariance under a collective rotation of all featurevectors, and not under independent symmetry transforma-tions of each individual feature. Rotational symmetries havebeen identified in deep neural networks (Badrinarayananet al., 2015), matrix factorization (Mnih & Salakhutdinov,2008; Gopalan et al., 2015), linear factor models (Mur-phy, 2012), and word embeddings (Mikolov et al., 2013a;b;Pennington et al., 2014; Barkan, 2017). Dynamic matrixfactorizations (Lu et al., 2009; Koren, 2010; Sun et al., 2012;

a) b)

Figure 1. a) The loss ` of a rotationally symmetric model has acontinuum of degenerate minima. b) A Goldstone mode; similarconfigurations arise during the optimization of time series models,and they lead to a slowdown of gradient descent (see Figure 2).

Charlin et al., 2015) and dynamic word embeddings (Bam-ler & Mandt, 2017; Rudolph & Blei, 2017) generalize thesemodels to sequential data. These are the models whoseoptimization we address in this paper.

The slow convergence in these models is caused by shallowdirections of the loss function. Popular methods to escape ashallow valley of a loss function (Duchi et al., 2011; Zeiler,2012; Kingma & Ba, 2014) use diagonal preconditioning.As confirmed by our experiments, diagonal preconditioningdoes not speed up convergence when the shallow directionscorrespond to collective rotations of many model parame-ters, which are not aligned with the coordinate axes.

Natural gradients (Amari, 1998; Martens, 2014) are a moresophisticated form of preconditioning, which has been ap-plied to deep learning (Pascanu & Bengio, 2013) and to vari-ational inference (Hoffman et al., 2013). Our proposed al-gorithm uses natural gradients in a subspace where they arecheap to obtain. Different to the Krylov subspace method(Vinyals & Povey, 2012), we construct the subspace suchthat it always contains the shallow directions of the loss.

3. Problem SettingIn this section, we formalize the notion of continuous sym-metry breaking (Section 3.1), and we specify the type ofmodels that we investigate in this paper (Section 3.2). Wethen show that the introduced models exhibit a specific typeof symmetry breaking that is generically weak and thereforeleads to slow convergence of gradient descent (Section 3.3).

3.1. Symmetry Breaking in Representation Learning

We formalize the concept of weak continuous symmetrybreaking and show that it leads to an ill-conditioned opti-mization problem. Figure 1a illustrates the main effectsof a continuous symmetry by means of a small toy model.The red surface depicts the loss function. Since the lossis rotationally symmetric, it has a ring of degenerate (i.e.,equivalent) minima. The purple sphere depicts one arbi-trarily chosen minimum. Notice that the loss is flat in thedirection tangential to the ring of degenerate minima (bluearrows), i.e., its curvature along this direction is zero.


For a machine learning example, consider factorizing a largematrix X into the product U>V of two smaller matrices Uand V by minimizing the loss `(U, V ) = ||X − U>V ||22.Rotating all columns of U and V by the same orthogonal1

rotation matrix R such that U ← RU and V ← RV doesnot change ` since (RU)>RV = U>(R>R)V = U>V .

The continuous rotational symmetry of ` leads to a mani-fold of degenerate minima, and to a zero eigenvalue of theHessian of ` at any of the minima. If (U∗, V ∗) minimizes `,then so does (RU∗, RV ∗) for any rotation matrix R. Onthe manifold of degenerate minima, the gradient assumesthe constant value of zero. A constant gradient (first deriva-tive) means that the curvature (second derivative) is zero.More precisely, the Hessian of ` has a zero eigenvalue forall eigenvectors that are tangential to the manifold of de-generate minima. Usually, a zero eigenvalue of the Hessianindicates a maximally ill-conditioned optimization problem,but this is not an issue here. The zero eigenvalue only meansthat convergence within the manifold of degenerate minimais infinitely slow. This is of no concern since any minimumis a valid solution of the optimization problem.

A problem arises when the continuous symmetry is weaklybroken, e.g., by adding a small L1 regularizer to the loss.A sufficiently small regularizer changes the eigenvalues ofthe Hessian only slightly, leaving it still poorly conditioned.However, even if the regularizer has a tiny coefficient, itlifts the degeneracy and turns the manifold of exactly de-generate minima into a shallow valley with one preferredminimum. Convergence along this shallow valley is slowbecause of the small eigenvalues of the Hessian. This is theslow convergence problem that we address in this paper. Wepresent a more natural setup in which the problem occurs inSections 3.2-3.3. Our solution, presented in Section 4, is toseparate the optimization in the symmetry directions fromthe optimization in the remaining directions.

3.2. Representation Learning for Time Series

We define a broad class of representation learning modelsfor sequential data, and introduce three example models thatare investigated experimentally in this paper. As we showin Section 3.3, the models presented here suffer from slowconvergence due to a specific kind of symmetry breaking.

General Model Class. We consider data X ≡ {Xt}t=1:T

that are associated with additional metadata t, such as a timestamp. For each t, the task is to learn a low dimensionalrepresentation Zt by minimizing what we call a ‘local lossfunction’ `(Xt;Zt). We add a quadratic regularizer ψ(Z)that couples the representations Z ≡ {Zt}t=1:T along thet-dimension. In a Bayesian setup, ψ comes from the log-

1We call a square matrix R ‘orthogonal’ if R>R is the identity.This is sometimes also called an ‘orthonormal’ matrix.

prior of the model. The total loss function is thus

L(Z) =

T∑t=1

`(Xt;Zt) + ψ(Z). (1)

For each task t, the representation Zt is a matrix whosecolumns are low dimensional embedding vectors. We as-sume that ` is invariant under a collective rotation of allcolumns of Zt: let R be an arbitrary orthogonal rotationmatrix of the same dimension as the embedding space, then

`(Xt;RZt) = `(Xt;Zt). (2)

Finally, we consider a specific form of the regularizer ψwhich is quadratic in Z, and which is defined in terms of asparse symmetric coupling matrix L ∈ RT×T :

ψ(Z) = 12Tr(Z>LZ). (3)

Here, the matrix-vector multiplications are carried out int-space, and the trace runs over the remaining dimensions.Note that, different to Section 3.1, we do not require ψ tohave a small coefficient. We only require the coupling ma-trix L to be sparse. In the examples below, L is tridiagonaland results from a Gaussian Markovian time series prior. Ina more general setup, L = D−A is the Laplacian matrixof a sparse weighted graph (Poignard et al., 2018). Here,A is the adjacency matrix, whose entries are the couplingstrengths, and the degree matrix D is diagonal and definedsuch that the entries of each row of L sum up to zero.

Equations 1, 2, and 3 specify the problem class of interest inthis paper. The following paragraphs introduce the specificexample models used in our experiments. In Section 3.3,we show that the sparse coupling in these models leads toweak continuous symmetry breaking and therefore to slowconvergence of gradient descent (GD).

Model 1: Dense Dynamic Matrix Factorization. Con-sider the task of factorizing a large matrix Xt into a productU>t Vt of two smaller matrices. The latent representation isthe concatenation of the two embedding matrices,

Zt ≡ (Ut, Vt). (4)

In a Gaussian matrix factorization, the local loss function is

`(Xt;Zt) = − logN (Xt;U>t Vt, I) (5)

In dynamic matrix factorization models, the data X areobserved sequentially at discrete time steps t, and the rep-resentations Z capture the temporal evolution of latent em-bedding vectors. We use a Markovian Gaussian time seriesprior with a coupling strength λ, resulting in the regularizer

ψ(Z) =λ

2

N∑i=1

T−1∑t=1

||zt+1,i − zt,i||22. (6)


Here, the vector zt,i is the ith column of the matrix Zt, i.e.,the ith embedding vector, and N is the number of columns.The regularizer allows the model to share statistical strengthacross time. By multiplying out the square, we find that ψhas the form of Eq. 3 with a tridiagonal coupling matrix,

L = λ

1 −1−1 2 −1

. . . . . . . . .−1 2 −1

−1 1

. (7)

Model 2: Sparse Dynamic Matrix Factorization. In asparse matrix factorization, the local loss ` involves only fewcomponents of the matrix U>t Vt. The latent representationis again Zt ≡ (Ut, Vt). We consider a model for movieratings where each user rates only few movies. When user irates movie j in time step t, we model the log-likelihood toobtain the binary rating x ∈ {±1} with a logistic regression,

log p(x|ut,i, vt,j) = log σ(xu>t,ivt,j) (8)

with the sigmoid function σ(ξ) = 1/(1 + e−ξ). Eq. 8 isthe log-likelihood of the rating x of a single movie by asingle user. We obtain the full log-likelihood log p(Xt|Zt)for time step t by summing over the log-likelihoods of allratings observed at time step t. The local loss is

`(Xt;Zt) = − log p(Xt|Zt) +γ

2||Zt||22. (9)

Here, || · ||2 is the Frobenius norm, and we add a quadraticregularizer with strength γ since data for some users ormovies may be scarce. We distinguish this local regularizerfrom the time series regularizer ψ, given again in Eq. 6, asthe local regularizer does not break the rotational symmetry.

Model 3: Dynamic Word Embeddings. Word embed-dings map words from a large vocabulary to a low dimen-sional representation space such that neighboring words aresemantically similar, and differences between word embed-ding vectors capture syntactic and semantic relations. Weconsider the Dynamic Word Embeddings model (Bamler &Mandt, 2017), which uses a probabilistic interpretation ofthe Skip-Gram model with negative sampling, also knownas word2vec (Mikolov et al., 2013b; Barkan, 2017), andcombines it with a time series prior. The model is trained onT text sources with time stamps t, and it assigns two time de-pendent embedding vectors ut,i and vt,i to each word i froma fixed vocabulary. The embedding vectors are obtained bysimultaneously factorizing two matrices, which contain so-called positive and negative counts of word-context pairs.Therefore, the representation Zt ≡ (Ut, Vt) for each timestep is invariant under orthogonal transformations. The regu-larizer ψ comes from a time series prior that is a discretizedOrnstein-Uhlenbeck process, i.e., it combines a randomdiffusion process with a local quadratic regularizer.

Figure 2. Goldstone modes and slow convergence of GD in a dy-namic matrix factorization with a 3d embedding space (details inSection 5.1). Colored points in the 3d plots show the evolution ofeach embedding vector along the time dimension of the model.

3.3. Symmetry Breaking in Representation Learningfor Time Series

We show that the time series models introduced in Sec-tion 3.2 exhibit a variant of the symmetry breaking dis-cussed in Section 3.1. The symmetry breaking in time seriesmodels is generically weak, thus causing slow convergence.

Geometric Picture: Goldstone Modes. Figure 1b graph-ically illustrates the class of time series representation learn-ing problems introduced in Eq. 1. Each of the seven purplespheres depicts an embedding vector Zt with a time stamp tin a rotationally symmetric local loss function ` (red sur-face). For a simpler visualization, we assume here that thelocal loss function is the same for every sphere, and thateach Zt contains only a single embedding vector.

The embedding vectors are coupled via the term ψ in Eq. 1.We can think of this coupling as springs between neighbor-ing spheres (not drawn) that try to pull the spheres together.Figure 1b shows a typical configuration that GD may findafter a few update steps. Each embedding Zt minimizes itslocal loss `. However, the chain is not yet contracted to asingle point, which reflects a deviation from the minimum ofthe total loss L. Such a configuration is called a Goldstonemode in the physics literature (Altland & Simons, 2010).

As we show below, the gradient of the total loss is sup-pressed in a Goldstone mode (Goldstone theorem) (Altland& Simons, 2010). This leads to slow convergence of GD,as we illustrate in Figure 2. The 3d plots show snapshotsof the embedding space in a Gaussian dynamic matrix fac-torization with T = 30 time stamps and a 3d representationspace (details in Section 5.1). Each color corresponds toan embedding vector. Points of equal color show the sameembedding vector at different time stamps t.


In this toy experiment, the local loss ` is again identicalfor all time stamps. Thus, in the optimum, the embeddingvectors at different times should be the same, so that thetrajectories contract to single points. We see that GD (toprow of 3d plots) takes a long time for the chains to contract,while our algorithm finds the optimum much faster.

Formal Picture: Eigenvalues of the Hessian. Goldstonemodes decay slowly in GD because the gradient of the totalloss L is suppressed in a Goldstone mode. This can be seenby analyzing the eigenvalues of the Hessian H of L at itsminimum. For a configuration Z that is close to the trueminimum Z∗, the gradient is approximately H(Z− Z∗).

The Hessian of L in Eq. 1 is the sum of the Hessians of thelocal loss functions ` plus the Hessian of the regularizer ψ.As discussed in Section 3.1, the Hessians of ` all have exactzero eigenvalues along the symmetry directions. Withinthis nullspace, only the Hessian H(ψ) of the regularizer ψremains. From Eq. 3, we find H(ψ) = IN×N ⊗ Id×d ⊗ Lwhere⊗ is the tensor product, and IN×N and Id×d are iden-tity matrices in the input and embedding space, respectively.Thus, H(ψ) has the same eigenvalues as the Laplacian ma-trix L of the coupling graph, each with multiplicity Nd.

Since the rows of L sum up to zero, L has a zero eigenvaluefor the eigenvector (1, . . . , 1)>. This is a consequence ofthe fact that, despite the symmetry breaking regularizer ψ,the total lossL is still invariant if we rotate all embeddings Zby the same orthogonal matrix. As discussed in Section 3.1,this remaining symmetry leads to zero eigenvalues of theHessian, but these do not induce slow convergence of GD.

The speed of convergence of GD is governed by the lowestnonzero eigenvalue of the Hessian, and therefore of L. Ina Markovian time series model, L in Eq. 7 couples onlyneighbors along a chain of length T . Its lowest nonzeroeigenvalue is 2λ(1 − cos(π/T )) (de Abreu, 2007), whichvanishes asO(1/T 2) for large T . Thus, even if the couplingstrength λ is large, the gradient of ψ is suppressed for thecorresponding eigenvectors, i.e., the Goldstone modes. Amore general model may couple tasks t along a sparse graphother than a chain. The second lowest eigenvalue of theLaplacian matrix L of a graph is called the algebraic con-nectivity (de Abreu, 2007), and it is small in sparse graphs.This makes the optimization problem ill-conditioned.

4. Goldstone Gradient DescentIn this section, we present our solution to the slow con-vergence problem that we identified in Section 3.3. Al-gorithm 1 summarizes the proposed Goldstone GradientDescent (Goldstone-GD) algorithm. We lay out details inSection 4.1, and discuss hyperparameters in Section 4.2.

The algorithm minimizes a loss function L of the form of

Algorithm 1: Goldstone Gradient Descent (Goldstone-GD)

Input: Loss function L of the form of Eqs. 1-3;learning rate ρ; integer hyperparameters k1 and k2

Output: Local minimum of L.1 Initialize model parameters Z randomly2 Initialize gauge fields Γ← 03 repeat4 repeat k1 times5 Set Z← Z− ρ∇ZL(Z)

. gradient step in full parameter spaceend

6 Obtain M and ρ′ from Eqs. 15 and 17. transformation to symmetry subspace

7 repeat k2 times8 Set Γ← Γ− ρ′L+∇ΓL′′(Γ; M)

. natural gradient step in symmetry subspaceend

9 Set Zt ← Zt + (Γt − Γ>t )Zt ∀t ∈ {1, . . . , T}. transformation back to full parameter space

until convergence

Eqs. 1-3. The main idea is to periodically minimize thesymmetry breaking regularizer ψ more efficiently than GD,without degrading the symmetry invariant part of the loss.We alternate between this specialized minimization of ψ(lines 7-8 in Algorithm 1) and standard GD on the full lossfunction L (lines 4-5). Switching between the two types ofupdates involves an overhead due to coordinate transforma-tions (lines 6 and 9). We therefore always perform severalupdates of each type in a row (hyperparameters k1 and k2).Algorithm 1 presents Goldstone-GD in its simplest form.It is straight-forward to combine it with adaptive learningrates and minibatch sampling, see experiments in Section 5.

4.1. Optimization in the Symmetry Subspace

We explain lines 6-9 of Algorithm 1. These steps minimizethe total loss function L(Z) while restricting updates ofthe model parameters Z to symmetry transformations. LetR ≡ {Rt}t=1:T denote T orthogonal matrices. The task isto minimize the following auxiliary loss function over R,

L′(Z; R) ≡ L(R1Z1, . . . ,RTZT )−L(Z1, . . . ,ZT ) (10)

with the nonlinear constraint R>tRt = I ∀t. If R∗ mini-mizes L′, then updating Zt ← R∗tZt decreases the loss Lby eliminating all Goldstone modes. The second term onthe right-hand side of Eq. 10 does not influence the mini-mization as it is independent of R. Subtracting this termmakes L′ independent of the local loss functions `: by usingEqs. 1-2, we can write L′ in terms of only the regularizer ψ,

L′(Z; R) = ψ(R1Z1, . . . ,RTZT )−ψ(Z1, . . . ,ZT ). (11)


Artificial Gauge Fields. We turn the constrained mini-mization of L′ over R into an unconstrained minimizationusing a result from the theory of Lie groups (Hall, 2015).Every special orthogonal matrix Rt ∈ SO(d) is the matrixexponential of a skew symmetric d × d matrix Γt. Here,skew symmetry means that Γ>t = −Γt, and the matrix ex-ponential function exp(·) is defined by its series expansion,

Rt = exp(Γt) ≡ I + Γt +1

2!Γ2t +

1

3!Γ3t + . . . (12)

which is not to be confused with the componentwise expo-nential of Γt (the term Γ2

t in Eq. 12 is the matrix productof Γt with itself, not the componentwise square). Eq. 12follows from the Lie group–Lie algebra correspondence forthe Lie group SO(d) (Hall, 2015). Note that Rt is close tothe identity I if the entries of Γt are small. To enforce skewsymmetry of Γt, we parameterize it via the skew symmetricpart of an unconstrained d× d matrix Γt, i.e.,

Γt = Γt − Γ>t . (13)

We call the components of Γ ≡ {Γt}t=1:T the gauge fields,invoking an analogy to gauge theory in physics.

Taylor Expansion in the Gauge Fields. Eqs. 12-13 pa-rameterize a valid rotation matrix Rt in terms of an arbitraryd × d matrix Γt. This turns the constrained minimizationof L′ into an unconstrained one. However, the matrix-exponential function in Eq. 12 is numerically expensive,and its derivative is complicated because the group SO(d)is non-abelian. We simplify the problem by introducing anapproximateion. As the model parameters Z approach theminimum of L, the optimal rotations R∗ that minimize L′converge to the identity, and thus the gauge fields convergeto zero. In this limit, the approximation becomes exact.

We approximate the auxiliary loss function L′ by a secondorder Taylor expansion L′′. In detail, we truncate Eq. 12after the term quadratic in Γt and insert the truncated seriesinto Eq. 11. We multiply out the quadratic form in theprior ψ, Eq. 3, and neglect again all terms of higher thanquadratic order in Γ. Using the skew symmetry of Γt andthe symmetry of the Laplacian matrix L = D−A, we find

L′′(Γ; M)=∑t,t′

Att′Tr

[(Γt′ +

1

2(Γt′−Γt)Γt

)Mtt′

](14)

where the trace runs over the embedding space, and for eacht, t′ ∈ {1, . . . , T}, we define the matrix Mtt′ ∈ Rd×d,

Mtt′ ≡N∑i=1

zt,iz>t′,i. (15)

We evaluate the matrices Mtt′ on line 6 in Algorithm 1.Note that the adjacency matrix A is sparse, and that we onlyneed to obtain those matrices Mtt′ for which Att′ 6= 0.

We describe the numerical minimization of L′′ below.Once we obtain gauge fields Γ

∗that minimize L′′, the

optimal update step for the model parameters would beZt ← exp(Γ∗t − Γ∗>t )Zt. For efficiency, we truncate thematrix exponential function exp(·) after the linear term, re-sulting in line 9 of Algorithm 1. We do not reset the gaugefields Γ to zero after updating Z, so that the next minimiza-tion of L′′ starts with preinitialized Γ. This turned out tospeed up convergence in our experiments, possibly becauseΓ acts like a momentum in the symmetry subspace.

Natural Gradients. Lines 7-8 in Algorithm 1 minimizeL′′ over the gauge fields Γ using GD. We speed up conver-gence using the fact that L′′ depends only on the prior ψand not on `. Since we know the Hessian of ψ, we can usenatural gradients (Amari, 1998), resulting in the update step

Γ← Γ− ρ′L+∇ΓL′′(Γ; M) (16)

where ρ′ is a constant learning rate and L+ is the pseudoin-verse of the Laplacian matrix L. We obtain L+ by takingthe eigendecomposition of L and inverting the eigenvalues,except for the single zero eigenvalue corresponding to (ir-relevant) global rotations, which we leave at zero. L+ hasto be obtained only once before entering the training loop.

Learning Rate. We find that we can automatically set ρ′

in Eq. 16 to a value that leads to fast convergence,

ρ′ =1

TN〈Z2〉with 〈Z2〉 ≡ 1

TNd

∑t,i

||zt,i||22. (17)

We arrive at this choice of learning rate by estimating theHessian of L′′. The preconditioning with L+ in Eq. 16takes into account the structure of the Hessian in t-space,which enters L′′ in Eq. 14 via the adjacency matrix A. Theremaining factor Mtt′ , defined in Eq. 15, is quadratic in thecomponents of Z and linear in N . This suggests a learningrate ρ′ ∝ 1/(N〈Z2〉). We find empirically for large modelsthat the t-dependency of Mtt′ leads to a small mismatchbetween L and the Hessian of L′′. The more conservativechoice of learning rate in Eq. 17 leads to fast convergenceof the gauge fields in all our experiments.

4.2. Hyperparameters

Goldstone-GD has two integer hyperparameters, k1 and k2,which control the frequency of execution of each operation.Table 1 lists the computational complexity of each opera-tion, assuming that the sparse adjacency matrix A has O(T )nonzero entries, as is the case in Markovian time seriesmodels (Eq. 7). Note that the embedding dimension d istypically orders of magnitude smaller than the input dimen-sion N . Therefore, update steps in the symmetry subspace(line 8) are cheap. In our experiments, we always set k1 andk2 such that the overhead from lines 6-9 is less than 10%.


Table 1. Runtimes of operations in Goldstone-GD (L=line in Algo-rithm 1; #=frequency of execution; T=no. of time steps; N=inputdimension; d=embedding dimension; k1, k2=hyperparameters).

L OPERATION COMPLEXITY #

5 gradient step in full param. space model dependent ×k16 transformation to symmetry space O(TNd2) ×18 nat. grad. step in symmetry space O(Td3 + T 2d2) ×k29 transformation to full param. space O(TNd2) ×1

5. ExperimentsWe evaluate the proposed Goldstone-GD optimization algo-rithm on the three example models introduced in Section 3.2.We compare Goldstone-GD to standard GD, to AdaGrad(Duchi et al., 2011), and to Adam (Kingma & Ba, 2014).Goldstone-GD converges orders of magnitude faster and fitsmore interpretable word embeddings.

5.1. Visualizing Goldstone Modes With Artificial Data

Model and Data Preparation. We fit the dynamic Gaus-sian matrix factorization model defined in Eqs. 4-6 in Sec-tion 3.2 to small scale artificial data. In order to visualizeGoldstone modes in the embedding space we choose anembedding dimension of d = 3 and, for this experimentonly, we fit the model to time independent-data. This allowsus to monitor convergence since we know that the matricesU∗t and V ∗t that minimize the loss are also time-independent.We generate artificial data for the matrix X ∈ R10×10 bydrawing the components of two matrices U , V ∈ R3×10

from a standard normal distribution, forming U>V , andadding uncorrelated Gaussian noise with variance 10−3. Weuse T = 30 time steps and a coupling strength of λ = 10.

Hyperparameters. We train the model with standard GD(baseline) and with Goldstone-GD with k1 = 50 and k2 =10. We find fastest convergence for the baseline methodif we clip the gradients to an interval [−g, g] and use adecreasing learning rate ρs = ρ0(s/(s+ s))0.7 despite thenoise-free gradient. Here, s is the training step. We optimizethe hyperparameters for fastest convergence in the baselineand find g = 0.01, ρ0 = 1, and s = 100.

Results. Figure 2 compares convergence in the two algo-rithms. We discussed the figure at the end of Section 3.3. Insummary, Goldstone-GD converges an order of magnitudefaster even in this small scale setup that allows only forthree different kinds of Goldstone modes (the skew symmet-ric gauge fields Γt have only d(d− 1)/2 = 3 independentparameters). Once the minimization finds minima of thelocal losses `, differences in the total loss L between thetwo algorithms are small since Goldstone modes contributeonly little to L (this is why they decay slowly in GD).

4.275

4.300

4.325

Tot

allo

ssL

[×106

]

baseline

Goldstone-GD (proposed)

0 5000 10000 15000 20000

training step

0.425

0.430

Reg

ula

rize

rψ

[×10

6]

Figure 3. Training curves for MovieLens recommendations usingsparse dynamic matrix factorization. Curves for three different ran-dom initializations are indistinguishable at this scale. Horizontalaxis not accounting for a 1% runtime overhead in Goldstone-GD.

5.2. MovieLens Recommendations

Model and Data Set. We fit the sparse dynamic Bernoullifactorization model defined in Eqs. 6-9 in Section 3.2 tothe Movielens 20M data set2 (Harper & Konstan, 2016).We use embedding dimension d = 30, coupling strengthλ = 10, and regularizer γ = 1. The data set consists of20 million reviews of 27,000 movies by 138,000 users withtime stamps from 1995 to 2015. We binarize the ratings bysplitting at the median, discarding ratings at the median, andwe slice the remaining 18 million data points into T = 100time bins of equal duration. We split randomly across allbins into 50% training, 20% validation, and 30% test set.

Baseline and Hyperparameters. We compare the pro-posed Goldstone-GD algorithm to GD with AdaGrad (Duchiet al., 2011) with a learning rate prefactor of 1 obtained fromcross-validation. Similar to Goldstone-GD, AdaGrad is de-signed to escape shallow valleys of the loss, but it uses onlydiagonal preconditioning. We compare to Goldstone-GDwith k1 = 100 and k2 = 10, using the same AdaGradoptimizer for update steps in the full parameter space.

Results. The additional operations in Goldstone-GD leadto a 1% overhead in runtime. The upper panel in Figure 3shows training curves for the loss L using the baseline (pur-ple) and Goldstone-GD (green). The loss L drops faster inGoldstone-GD, but differences in terms of the full loss L aresmall because the local loss functions ` are much larger thanthe regularizer ψ in this experiment. The lower panel of Fig-ure 3 shows only ψ. Both algorithms converge to the samevalue of ψ, but Goldstone-GD converges at least an order ofmagnitude faster. The difference in value is small becauseGoldstone modes contribute little to ψ. They can, however,have a large influence on the parameter values, as we shownext in experiments with Dynamic Word Embeddings.

2https://grouplens.org/datasets/movielens/20m/

https://grouplens.org/datasets/movielens/20m/

https://grouplens.org/datasets/movielens/20m/


Table 2. Word aging: We translate modern words to the year 1800 using the shared representation space of Dynamic Word Embeddings.

QUERY GOLDSTONE-GD BASELINE

car boat, saddle, canoe, wagon, box shell, roof, ceiling, choir, centralcomputer perspective, telescope, needle, mathematical, camera organism, disturbing, sexual, rendering, badDNA potassium, chemical, sodium, molecules, displacement operates, differs, sharing, takes, keepselectricity vapor, virus, friction, fluid, molecular exercising, inherent, seeks, takes, protecttuberculosis chronic, paralysis, irritation, disease, vomiting trained, uniformly, extinguished, emerged, widely

5.3. Dynamic Word Embeddings

Model and Data Set. We perform variational inference(Ranganath et al., 2014) in Dynamic Word Embeddings(DWE), see Section 3.2. We fit the model to digitized booksfrom the years 1800 to 2008 in the Google Books corpus3

(Michel et al., 2011) (approximately 1010 words). We fol-low (Bamler & Mandt, 2017) for data preparation, resultingin a vocabulary size of 10,000, a training set of T = 188time step, and a test set of 21 time steps. The DWE paperproposes two inference algorithms: filtering and smoothing.We use the smoothing algorithm, which has better predic-tive performance than filtering but suffers from Goldstonemodes. We set the embedding dimension to d = 100 dueto hardware constraints and train for 10,000 steps using anAdam optimizer (Kingma & Ba, 2014) with a decaying pref-actor of the adaptive learning rate, ρs = ρ0(s/(s + s))0.7,where s is the training step, ρ0 = 0.1, and s = 1000. Wefind that this leads to better convergence than a constant pref-actor. All other hyperparameters are the same as in (Bamler& Mandt, 2017). We compare the baseline to Goldstone-GDusing the same learning rate schedule and k1 = k2 = 10,which leads to an 8% runtime overhead.

Results. By eliminating Goldstone modes, Goldstone-GDmakes word embeddings comparable across the time dimen-sion of the model. We demonstrate this in Table 2, whichshows the result of ‘aging’ modern words, i.e., translatingthem from modern English to the English language of theyear 1800. For each query word i, we report the five wordsi′ whose embedding vectors ui′,1 at the first time step (year1800) have largest overlap with the embedding vector ui,Tof the query word at the last time step (year 2008). Overlapis measured in cosine distance (normalized scalar product),between the means of ui,T and ui′,1 under the variationaldistribution.

Goldstone-GD finds words that are plausible for the year1800 while still being related to the query (e.g., means oftransportation in a query for ‘car’). By contrast, the base-line method fails to find plausible results. Figure 4 providesmore insight into the failure of the baseline method. It showshistograms of the cosine distance between word embeddings

3http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

−0.2 0 0.2 0.4 0.6 0.8 1.0

cosine distance between word embeddings from 1800 to 2008

200

400

600

cou

nt

Goldstone-GD

baseline

Figure 4. Cosine distance between word embeddings from the firstand last year of the training data in Dynamic Word Embeddings.

ui,1 and ui,T for the same word i from the first to the lasttime step. In Goldstone-GD (green), most embeddings havea large overlap because the meaning of most words does notchange drastically over time. By contrast, in the baseline(purple), no embeddings overlap by more than 60% between1800 and 2008, and some embeddings even change their ori-entation (negative overlap). We explain this counterintuitiveresult with the presence of Goldstone modes, i.e., the entireembedding spaces are rotated against each other.

For a quantitative comparison, we evaluate the predictivelog-likelihood of the test set under the posterior mean, andfind slightly better predictive performance with Goldstone-GD (−0.5317 vs. −0.5323 per test point). The improve-ment is small because the training set is so large that theinfluence of the symmetry breaking regularizer is dwarfedin all but the symmetry directions by the log-likelihood ofthe data. The main advantage of Goldstone-GD are the moreinterpretable embeddings, as demonstrated in Table 2.

6. ConclusionsWe identified a slow convergence problem in representationlearning models with a continuous symmetry and a Marko-vian time series prior, and we solved the problem with a newoptimization algorithm, Goldstone-GD. The algorithm sepa-rates the minimization in the symmetry subspace from theremaining coordinate directions. Our experiments showedthat Goldstone-GD converges orders of magnitude fasterand fits more interpretable embedding vectors, which can becompared across the time dimension of a model. Since con-tinuous symmetries are common in representation learning,we believe that gauge theories and, more broadly, the theoryof Lie groups are more widely useful in machine learning.

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html


AcknowledgementsWe thank Ari Pakman for valuable and detailed feedbackthat greatly improved the manuscript.

ReferencesAltland, A. and Simons, B. D. Condensed matter field

theory. Cambridge University Press, 2010.

Amari, S.-I. Natural gradient works efficiently in learning.Neural computation, 10(2):251–276, 1998.

Badrinarayanan, V., Mishra, B., and Cipolla, R. Under-standing symmetries in deep networks. arXiv preprintarXiv:1511.01029, 2015.

Bamler, R. and Mandt, S. Dynamic word embeddings.In Proceedings of the 34th International Conference onMachine Learning (ICML), pp. 380–389, 2017.

Barkan, O. Bayesian Neural Word Embedding. In Proceed-ings of the Thirty-First AAAI Conference on ArtificialIntelligence, 2017.

Bui, H. H., Huynh, T. N., and Riedel, S. Automorphismgroups of graphical models and lifted variational infer-ence. arXiv preprint arXiv:1207.4814, 2012.

Charlin, L., Ranganath, R., McInerney, J., and Blei, D. M.Dynamic Poisson factorization. In Proceedings of the 9thACM Conference on Recommender Systems, pp. 155–162,2015.

Choi, S., Amari, S., Cichocki, A., and Liu, R.-w. Natu-ral gradient learning with a nonholonomic constraint forblind deconvolution of multiple channels. In First Inter-national Workshop on Independent Component Analysisand Signal Separation, pp. 371–376, 1999.

de Abreu, N. M. M. Old and new results on algebraic con-nectivity of graphs. Linear Algebra and its Applications,423(1):53–73, 2007.

Duchi, J., Hazan, E., and Singer, Y. Adaptive Subgradi-ent Methods for Online Learning and Stochastic Opti-mization. Journal of Machine Learning Research, 12:2121–2159, 2011.

Gens, R. and Domingos, P. M. Deep symmetry networks.In Advances in Neural Information Processing Systems27, pp. 2537–2545. 2014.

Gopalan, P., Hofman, J. M., and Blei, D. M. Scalablerecommendation with hierarchical Poisson factorization.In UAI, pp. 326–335, 2015.

Hall, B. Lie groups, Lie algebras, and representations: anelementary introduction, volume 222. Springer, 2015.

Harper, F. M. and Konstan, J. A. The MovieLens Datasets:History and Context. ACM Transactions on InteractiveIntelligent Systems (TiiS), 5(4):19, 2016.

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. W.Stochastic Variational Inference. Journal of MachineLearning Research, 14(1):1303–1347, 2013.

Kingma, D. and Ba, J. Adam: A Method for StochasticOptimization. In Proceedings of the 3rd InternationalConference for Learning Representations (ICLR), 2014.

Koren, Y. Collaborative filtering with temporal dynamics.Communications of the ACM, 53(4):89–97, 2010.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceed-ings of the IEEE, 86(11):2278–2324, 1998.

Lu, Z., Agarwal, D., and Dhillon, I. S. A spatio–temporalapproach to collaborative filtering. In ACM Conferenceon Recommender Systems (RecSys), 2009.

Martens, J. New insights and perspectives on the naturalgradient method. arXiv preprint arXiv:1412.1193, 2014.

Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray,M. K., Pickett, J. P., Hoiberg, D., Clancy, D., Norvig,P., Orwant, J., et al. Quantitative Analysis of CultureUsing Millions of Digitized Books. Science, 331(6014):176–182, 2011.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. Effi-cient Estimation of Word Representations in Vector Space.arXiv preprint arXiv:1301.3781, 2013a.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., andDean, J. Distributed Representations of Words andPhrases and their Compositionality. In Advances in Neu-ral Information Processing Systems 26, pp. 3111–3119.2013b.

Mnih, A. and Salakhutdinov, R. R. Probabilistic matrix fac-torization. In Advances in neural information processingsystems, pp. 1257–1264, 2008.

Murphy, K. P. Machine Learning: A Probabilistic Perspec-tive. MIT Press, 2012.

Noessner, J., Niepert, M., and Stuckenschmidt, H. Rockit:Exploiting parallelism and symmetry for map inferencein statistical relational models. In AAAI Workshop: Sta-tistical Relational Artificial Intelligence, 2013.

Pascanu, R. and Bengio, Y. Revisiting natural gradient fordeep networks. arXiv preprint arXiv:1301.3584, 2013.


Pennington, J., Socher, R., and Manning, C. Glove: Globalvectors for word representation. In Proceedings of the2014 conference on empirical methods in natural lan-guage processing (EMNLP), pp. 1532–1543, 2014.

Poignard, C., Pereira, T., and Pade, J. P. Spectra of lapla-cian matrices of weighted graphs: structural genericityproperties. SIAM Journal on Applied Mathematics, 78(1):372–394, 2018.

Ranganath, R., Gerrish, S., and Blei, D. Black box varia-tional inference. In Artificial Intelligence and Statistics,pp. 814–822, 2014.

Rudolph, M. and Blei, D. Dynamic bernoulli embeddingsfor language evolution. arXiv preprint arXiv:1703.08052,2017.

Sun, F., Guo, J., Lan, Y., Xu, J., and Cheng, X. Sparseword embeddings using l1 regularized online learning. InProceedings of the Twenty-Fifth International Joint Con-ference on Artificial Intelligence, pp. 2915–2921, 2016.

Sun, J. Z., Varshney, K. R., and Subbian, K. Dynamicmatrix factorization: A state space approach. In 2012IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), pp. 1897–1900, 2012.

Vinyals, O. and Povey, D. Krylov subspace descent fordeep learning. In Artificial Intelligence and Statistics, pp.1261–1268, 2012.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B.,Salakhutdinov, R. R., and Smola, A. J. Deep sets. InAdvances in Neural Information Processing Systems, pp.3394–3404, 2017.

Zeiler, M. D. ADADELTA: an Adaptive Learning RateMethod. arXiv preprint arXiv:1212.5701, 2012.

Date post:	30-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Improving Optimization in Models With Continuous Symmetry Breaking · 2018-07-09 · Improving...

Documents