Incremental gradient method for Karcher mean on symmetric ... · symmetric cone settings as in the...

Incremental gradient method for Karcher mean onsymmetric cones

Sangho Kum and Sangwoon Yun

Department of Mathematics EducationChungbuk National University

[email protected]

2016 MAO (Workshop on Matrices and Operator)

July 5, 2016

Suites Hotel, Jeju

Sangho Kum and Sangwoon Yun (CBNU) Incremental gradient method for Karcher mean July 5, 2016 1 / 32

1. Riemannian center of mass or Karcher mean

A brief history

Definition 1. (Riemannian center of mass)

(M, d): an n-dimensional complete Riemannian manifold withdistance d induced by the Riemannian structure.

ν : a probability measure on M.

f2(x) =1

2

∫d2(x , s)dν(s).

Any minimizer of f2 is called a Riemannian L2 center of mass withrespect to ν.



A brief history




f2(x) =1

2

∫d2(x , s)dν(s).




A brief history




f2(x) =1

2

∫d2(x , s)dν(s).




A brief history




f2(x) =1

2

∫d2(x , s)dν(s).




A brief history




f2(x) =1

2

∫d2(x , s)dν(s).




E. Cartan : In 1920s, the Riemannian L2 center of mass in anHadamard manifold (the first one in the context of Riemanniangeometry)

Existence and Uniqueness.

Any compact subgroup of the isometry group of an Hadamardmanifold has a fixed point.


















H. Karcher : The Riemannian L2 center of mass in generalRiemannian manifolds but for probability measures with support insmall enough balls

He enlarged the domain of existence and uniqueness and considerednew applications.

More recently, the Karcher mean has found applications in manyapplied fields:

computer vision, statistical analysis of shapes, medical imaging,sensor networks, data analysis applications, and so an.



























Existing methodologies

In these applied settings, an important problem is to numericallyapproximate or compute the Karcher mean.

Gradient descent methods

Proximal point methods(or incremental proximal methods)

Newton methods







Newton methods







Newton methods







Newton methods







Newton methods







Newton methods



Points to consider

But, even though numerical algorithms were developed in generalRiemannian manifolds circumstances or more, some of them are notnumerically implementable in a practical sense.

It is quite often that algorithms on Riemannian manifolds seem to beconceptual when we consider that applications are mainlyconcentrated on matrix cases.

In comparison to this, to develop algorithms is more tangible insymmetric cone settings as in the case of the positive semidefinitecone.

This is a main reason why we work under the framework of symmetriccones.



Points to consider







Points to consider







Points to consider







Points to consider







Jordan algebra

A Jordan algebra V over R is a (non-associative) commutativealgebra satisfying x2(xy) = x(x2y) for all x , y ∈ V .

For x ∈ V , let L(x) be the linear operator defined by L(x)y = xy , andlet P(x) = 2L(x)2 − L(x2). The map P is called the quadraticrepresentation of V .

An element x ∈ V is said to be invertible if there exists an element y(denoted by y = x−1) in the subalgebra generated by x and e (theJordan identity) such that xy = e.



Jordan algebra






Jordan algebra






Jordan algebra






Euclidean Jordan algebra (finite dim′l symmetric cone)

A finite dimensional Jordan algebra V is said to be Euclidean if thereexists an inner product 〈·, ·〉 such that

〈xy , z〉 = 〈y , xz〉 (1.1)

for all x , y , z ∈ V .

An element c ∈ V is called an idempotent if c2 = c. We say thatc1, . . . , ck is a complete system of orthogonal idempotents ifc2i = ci , cicj = 0, i 6= j , c1 + · · ·+ ck = e. An idempotent is primitive

if it is non-zero and cannot be written as the sum of two non-zeroidempotents. A Jordan frame is a complete system of primitiveidempotents.





〈xy , z〉 = 〈y , xz〉 (1.1)








〈xy , z〉 = 〈y , xz〉 (1.1)






Let Q be the set of all square elements of V . Then Q is a closedconvex cone of V with Q ∩ −Q = 0, and is the set of elementx ∈ V such that L(x) is positive semi-definite.

It turns out that Q has non-empty interior Ω, and Ω is a symmetriccone, that is, the group

G (Ω) = g ∈ GL(V )|g(Ω) = Ω

acts transitively on it and Ω is a self-dual cone with respect to theinner product 〈·|·〉.

Furthermore, for any a in Ω, P(a) ∈ G (Ω) and is positive definite.





G (Ω) = g ∈ GL(V )|g(Ω) = Ω







G (Ω) = g ∈ GL(V )|g(Ω) = Ω





Two typical examples

Second-order cone(SOC) is the closed convex cone

K :=

(x1, x2) ∈ R× Rn−1 | ‖x2‖ ≤ x1

.

The Euclidean space Rn with the Jordan product defined by

x y = (〈x , y〉, x1y2 + y1x2)

is a Euclidean Jordan algebra equipped with the standard innerproduct 〈·, ·〉 where x = (x1, x2), y = (y1, y2) ∈ R× Rn−1.

K is the corresponding symmetric cone of the Euclidean Jordanalgebra Rn.





K :=

(x1, x2) ∈ R× Rn−1 | ‖x2‖ ≤ x1

.


x y = (〈x , y〉, x1y2 + y1x2)







K :=

(x1, x2) ∈ R× Rn−1 | ‖x2‖ ≤ x1

.


x y = (〈x , y〉, x1y2 + y1x2)







K :=

(x1, x2) ∈ R× Rn−1 | ‖x2‖ ≤ x1

.


x y = (〈x , y〉, x1y2 + y1x2)





Let Sn be the algebra of n × n real symmetric matrices with theJordan product defined by

X Y =XY + YX

2

where XY is the usual matrix multiplication of X and Y .

Then Sn is a Euclidean Jordan algebra equipped with the trace innerproduct

〈X ,Y 〉 = tr(XY ), P(X )Y = XYX .

PD is the corresponding symmetric cone of the Euclidean Jordanalgebra Sn.




X Y =XY + YX

2



〈X ,Y 〉 = tr(XY ), P(X )Y = XYX .





X Y =XY + YX

2



〈X ,Y 〉 = tr(XY ), P(X )Y = XYX .




Symmetric cone setting

The symmetric cone Ω admits a Riemannian metric defined by

〈u, v〉x = 〈P(x)−1u, v〉, x ∈ Ω, u, v ∈ V .

The Riemannian distance δ(a, b) is given by

δ(a, b) == ‖ log P(a−12 )b‖ =

(r∑

i=1

log2 λi (P(a−1/2)b)

)1/2

.

The unique geodesic curve joining a and b is

t 7→ a#tb := P(a1/2)(P(a−1/2)b)t .





〈u, v〉x = 〈P(x)−1u, v〉, x ∈ Ω, u, v ∈ V .


δ(a, b) == ‖ log P(a−12 )b‖ =

(r∑

i=1


)1/2

.


t 7→ a#tb := P(a1/2)(P(a−1/2)b)t .





〈u, v〉x = 〈P(x)−1u, v〉, x ∈ Ω, u, v ∈ V .


δ(a, b) == ‖ log P(a−12 )b‖ =

(r∑

i=1


)1/2

.


t 7→ a#tb := P(a1/2)(P(a−1/2)b)t .





〈u, v〉x = 〈P(x)−1u, v〉, x ∈ Ω, u, v ∈ V .


δ(a, b) == ‖ log P(a−12 )b‖ =

(r∑

i=1


)1/2

.


t 7→ a#tb := P(a1/2)(P(a−1/2)b)t .



An important property of the metric δ is the semiparallelogram law

δ2(z , x#y) ≤ 1

2δ2(z , x) +

1

2δ2(z , y)− 1

4δ2(x , y)

and its general form for any t ∈ [0, 1]

δ2(z , x#ty) ≤ (1− t)δ2(z , x) + tδ2(z , y)− t(1− t)δ2(x , y).

The Riemannian manifold (Ω, δ) is an important example of anHadamard manifold.



An important property of the metric δ is the semiparallelogram law

δ2(z , x#y) ≤ 1

2δ2(z , x) +

1

2δ2(z , y)− 1

4δ2(x , y)

and its general form for any t ∈ [0, 1]

δ2(z , x#ty) ≤ (1− t)δ2(z , x) + tδ2(z , y)− t(1− t)δ2(x , y).

The Riemannian manifold (Ω, δ) is an important example of anHadamard manifold.



In this circumstances, the Karcher mean reduces to the following:

Definition 2. (Karcher mean in symmetric cones)

The Karcher mean of a1, . . . , an ∈ Ω is defined to be the unique minimizerof the sum of squares of the Riemannian distances to each of the ai , i.e.,

Λ(a1, . . . , an) = argminx∈Ω

1

2

n∑i=1

δ2(x , ai ).







1

2

n∑i=1

δ2(x , ai ).







1

2

n∑i=1

δ2(x , ai ).


2. Motivation and Problem formulation

Motivations

The objective function of the aforementioned minimization problem isthe sum of many functions, i.e., the squares of Riemannian distancefunctions with given data a1, . . . , an ∈ Ω:

minx∈Ω

f (x) :=m∑i=1

fi (x), (2.1)

where fi (x) = 12δ(x , ai )

2 with ai ’s and x in Ω.

It is observed that the solution of the problem belongs to a boundedset D = x ∈ Ω | αe ≤ x ≤ βe, where 0 < α ≤ β, that containsa1, . . . , an.



Motivations


minx∈Ω

f (x) :=m∑i=1

fi (x), (2.1)






Motivations


minx∈Ω

f (x) :=m∑i=1

fi (x), (2.1)






Motivations


minx∈Ω

f (x) :=m∑i=1

fi (x), (2.1)






Problem formulation

Thus, we consider the following bound constained minimizationproblem formulation of :

minx∈D

f (x). (2.2)

This problem formulation motivates us to adapt an incrementallyupdated gradient(IUG) method to solve the problem.

To our knowledge, this IUG method is not adopted to deal with theproblem of finding the Karcher mean yet.



Problem formulation


minx∈D

f (x). (2.2)





Problem formulation


minx∈D

f (x). (2.2)





Problem formulation


minx∈D

f (x). (2.2)




3. Incrementally updated gradient(IUG) method

Incremental gradient(IG) method

In the case that the number of fi ’s consisting of the objective functionf =

∑ni=1 fi is large, traditional gradient method would be inefficient

since they require evaluating all the gradients of fi ’s before updatingthe iterate.

Incremental gradient methods, in contrast, update the iterate afterevaluation of gradients for only one or a few smooth functions.

Blatt et al. proposed a method that computes the gradient of a singlecomponent function at each iteration, but instead of updating theiterate using this gradient, it uses the sum of n most recentlycomputed gradients for the unconstrained smooth minimization case.



























Assuming the uniform boundedness and Lipschitz continuity of all thegradients of fi ’s as well as the uniqueness of a stationary point andpositive definiteness of Hessian of f at the stationary point, the globalconvergence of this method with a sufficiently small stepsize is shown.

Blatt’s method may be viewed as belonging to a general class ofgradient methods that update the gradients for only one or a few fi ’sat a time, which we call incrementally updated gradient(IUG) method.



Assuming the uniform boundedness and Lipschitz continuity of all thegradients of fi ’s as well as the uniqueness of a stationary point andpositive definiteness of Hessian of f at the stationary point, the globalconvergence of this method with a sufficiently small stepsize is shown.

Blatt’s method may be viewed as belonging to a general class ofgradient methods that update the gradients for only one or a few fi ’sat a time, which we call incrementally updated gradient(IUG) method.



IUG method

Recently, Tseng and Yun proposed two IUG methods to solve thenonsmooth minimization problem whose objective is the sum of nsmooth functions and a convex function. They showed the globalconvergence for the IUG method using a constant step size, assumingonly the Lipschitz continuity of each gradient of n smooth functions.

Compared to Blatt’s one, IUG method is a more general one forsolving a more general problem, and the global convergence is shownunder much weaker assumptions.

The second IUG method uses adaptive stepsizes and hence morepractical, and it has a similar global convergence property as the firstIUG method.



IUG method






IUG method






IUG method






They generalized the previous IG method to handle constraints andnonsmooth regularization and proved the global convergence undermuch weaker assumptions.

For this IUG method, in the present paper, we work under thestandard framework of Euclidean spaces rather than Riemanniancircumstances from a theoretical viewpoint.

This is mainly due to the fact that an addition of vectors in differenttangent spaces of Riemannian manifolds is not possible. Even if itwere possible using a parallel transport, it may have no practicalmeaning in computation.

At present, it seems to be difficult to consider an effective IG methodin a fully Riemannian sense on a symmetric cone.


4. IUG method for Karcher mean

IUG method for Karcher mean

The IUG method due to Tseng and Yun exactly fits to deal with theKarcher mean approximation (2.1) and (2.2) where the numbers ofthe smooth fi = 1

2δ(x , ai )2 is large.

The following fact plays a key role in the present work:

Proposition 1.

‖∇fi (y)−∇fi (z)‖ ≤ Li‖y − z‖ ∀y , z ∈ D, (4.1)

for some Li ≥ 0, i = 1, . . . ,m. Let L =∑m

i=1 Li .







Proposition 1.

‖∇fi (y)−∇fi (z)‖ ≤ Li‖y − z‖ ∀y , z ∈ D, (4.1)


i=1 Li .







Proposition 1.

‖∇fi (y)−∇fi (z)‖ ≤ Li‖y − z‖ ∀y , z ∈ D, (4.1)


i=1 Li .







Proposition 1.

‖∇fi (y)−∇fi (z)‖ ≤ Li‖y − z‖ ∀y , z ∈ D, (4.1)


i=1 Li .







Proposition 1.

‖∇fi (y)−∇fi (z)‖ ≤ Li‖y − z‖ ∀y , z ∈ D, (4.1)


i=1 Li .



Moreover, we give an assumption as follows:

Assumption.

τki ≥ k − K for all i and k, where K ≥ 0 is an integer.

Assumption ensures that the gradient of fi is updated at least oncefor every K + 1 consecutive iterations.




Assumption.






Assumption.





Algorithms

Algorithm 1.

Choose x0, x−1, · · · ∈ D and t ∈]0, 1]. Initialize k = 0. Update x (k+1)

from xk by the following template:

Step 1. Choose 0 ≤ τki ≤ k for i = 1, . . . ,m,

Step 2. Update gk by

gk =m∑i=1

∇fi (xτki ). (4.2)

Step 3. Find dk by using

dk = argmind∈V ,xk+d∈D

〈gk , d〉+

1

2‖d‖2

. (4.3)

Step 4. xk+1 = xk + tdk .



Algorithms

Algorithm 1.





gk =m∑i=1

∇fi (xτki ). (4.2)



〈gk , d〉+

1

2‖d‖2

. (4.3)




Algorithms

Algorithm 1.





gk =m∑i=1

∇fi (xτki ). (4.2)



〈gk , d〉+

1

2‖d‖2

. (4.3)




The above framework is quite flexible and allows partiallyasynchronous updating of the component gradients.

In the following lemma, we give a descent property of theminimization subproblem (4.3) for finding a search direction.

Lemma 1.

For any x ∈ D, and g ∈ V , let dg be the solution of the problem

mind∈V ,x+d∈D

〈g , d〉+

1

2‖d‖2

.

Then

〈g , d〉+1

2‖d‖2 ≤ −1

2‖dg‖2 or 〈g , dg 〉 ≤ −‖dg‖2. (4.4)





Lemma 1.


mind∈V ,x+d∈D

〈g , d〉+

1

2‖d‖2

.

Then

〈g , d〉+1

2‖d‖2 ≤ −1

2‖dg‖2 or 〈g , dg 〉 ≤ −‖dg‖2. (4.4)





Lemma 1.


mind∈V ,x+d∈D

〈g , d〉+

1

2‖d‖2

.

Then

〈g , d〉+1

2‖d‖2 ≤ −1

2‖dg‖2 or 〈g , dg 〉 ≤ −‖dg‖2. (4.4)



An x ∈ V is a stationary point of f if x ∈ D and f ′(x ; d) ≥ 0 for alld ∈ V .

The following result characterizes stationarity in terms of d∇f (x).

Lemma 2.

An x ∈ D is a stationary point of f if and only if d∇f (x) = 0.



An x ∈ V is a stationary point of f if x ∈ D and f ′(x ; d) ≥ 0 for alld ∈ V .

The following result characterizes stationarity in terms of d∇f (x).

Lemma 2.

An x ∈ D is a stationary point of f if and only if d∇f (x) = 0.



Now, we have the following global convergence result for the methodwith a sufficiently small constant stepsize.

Theorem 1. (Constant Stepsize Case)

Let xk and dk be sequences generated by Algorithm 1 underAssumption, and with t < 2

L(2K+1) . Then dk → 0 and every cluster

point of xk is a stationary point.



Now, we have the following global convergence result for the methodwith a sufficiently small constant stepsize.

Theorem 1. (Constant Stepsize Case)

Let xk and dk be sequences generated by Algorithm 1 underAssumption, and with t < 2

L(2K+1) . Then dk → 0 and every cluster

point of xk is a stationary point.



We describe the second IG method with adaptive stepsize below.

Algorithm 2.

Choose x0, x−1, . . . ∈ D, t ∈]0, 1], β ∈]0, 1[, and σ > 12 . Initialize k = 0.

Update x (k+1) from xk by the following template:


Step 2. Update gk by (4.2)

Step 3. Find dk by using (4.3)

Step 4. Choose tinit

k ∈ [t, 1] and let tk be the largest element of

t init

k βjj=0,1,... satisfying

f (xk + tkdk)− f (xk) ≤ −σKL‖tkdk‖2 +L

2

k−1∑j=(k−K)+

‖tjd j‖2

(4.5)




We describe the second IG method with adaptive stepsize below.

Algorithm 2.

Choose x0, x−1, . . . ∈ D, t ∈]0, 1], β ∈]0, 1[, and σ > 12 . Initialize k = 0.

Update x (k+1) from xk by the following template:


Step 2. Update gk by (4.2)

Step 3. Find dk by using (4.3)

Step 4. Choose tinit

k ∈ [t, 1] and let tk be the largest element of

t init

k βjj=0,1,... satisfying

f (xk + tkdk)− f (xk) ≤ −σKL‖tkdk‖2 +L

2

k−1∑j=(k−K)+

‖tjd j‖2

(4.5)

Step 5. xk+1 = xk + tdk .Sangho Kum and Sangwoon Yun (CBNU) Incremental gradient method for Karcher mean July 5, 2016 27 / 32


The stepsize tk in the first IUG method is adaptively selected bydecreasing tk whenever the condition (4.5) is violated.

In practice, the Lipschitz constant L is not given a priori but we areable to estimate L by increasing L by a certain positive factorwhenever the condition (4.5) is not satisfied with starting at anarbitrary estimate of L.

When tk is below t defined in Theorem 2 below, the condition (4.5) issatisfied with some constant L. Whether L is defined by Proposition 1is irrelevant.



Theorem 2. (Adaptive Stepsize Case)

Let xk, dk, tk be sequences generated by Algorithm 2 underAssumption 1. Then the following results hold.

(a) For each k ≥ 0, (4.5) holds whenever tk ≤ t , wheret = 2

L(2σK+K+1) .

(b) We have tk ≥ mint, βt for all k.

(c) dk → 0 and every cluster point of xk is a stationarypoint.



Conclusions

In this paper we consider IUG method for the Karther meanmotivated by the observations that implementable algorithms offinding the Karcher mean on general settings beyond matrix case arenot as many as expected, and the objective function of the consideredminimization problem is the sum of many smooth functions.

We have shown the global convergence of the proposed methodsexploiting the Lipschitz continuity of the gradient of the objectivefunction.

Even though our method is faster than SD, we need to furtheraccelerate the proposed method so that it becomes more attractive.



Conclusions






Conclusions






Conclusions






Two directions may be taken into account.

First, a tight bound for Lipschitz constant or a scheme is necessary foradjusting the better stepsize without evaluating the objective value.

Second, a fully Riemannian version of the proposed incrementalgradient method can be a better alternative.












References

[1] Karcher, H.: Riemannian center of mass and mollifier smoothing. Comm.Pure Appl. Math. 30, 509–541 (1977)

[2] Afsari, B., Tron, R., Vidal, R.: On the convergence of gradient descent forfinding the Riemannian center of mass. SIAM J. Control Optim. 51, 2230–2260(2013)

[3] Tseng, P., Yun, S.: Incrementally updated gradient methods for constrainedand regularized optimization. J. Optim. Theory Appl. 160, 832–853 (2014)

[4] Kum, S., Lee, H., Lim, Y.: No dice theorem on symmetric cones. TaiwaneseJ. Math. 17, 1967–1982 (2013)

[5] Holbrook, J.: No dice: a determinstic approach to the Cartan centroid. J.Ramanujan Math. Soc. 27, 509-521 (2012)


Date post:	13-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Incremental gradient method for Karcher mean on symmetric ... · symmetric cone settings as in the...

Documents