Contents lists available at ScienceDirect
Journal of Statistical Planning and Inference
Journal of Statistical Planning and Inference 150 (2014) 66–83
http://d0378-37
journal homepage: www.elsevier.com/locate/jspi
Inference on the eigenvalues of the covariance matrixof a multivariate normal distribution—Geometrical view
Yo SheenaFaculty of Economics, Shinshu University, Japan
a r t i c l e i n f o
Article history:Received 22 October 2013Received in revised form5 March 2014Accepted 11 March 2014Available online 20 March 2014
MSC:Primary, 62H05;Secondary, 62F12
Keywords:Curved exponential familyInformation lossFisher information metricEmbedding curvatureAffine connectionPositive definite matrix
x.doi.org/10.1016/j.jspi.2014.03.00458/& 2014 Elsevier B.V. All rights reserved.
a b s t r a c t
We consider inference on the eigenvalues of the covariance matrix of a multivariatenormal distribution. The family of multivariate normal distributions with a fixed meanis seen as a Riemannian manifold with Fisher information metric. Two submanifoldsnaturally arises: one is the submanifold given by the fixed eigenvectors of the covariancematrix; the other is the one given by the fixed eigenvalues. We analyze the geometricalstructures of these manifolds such as metric, embedding curvature under e-connection orm-connection. Based on these results, we study (1) the bias of the sample eigenvalues, (2)the asymptotic variance of estimators, (3) the asymptotic information loss caused byneglecting the sample eigenvectors, (4) the derivation of a new estimator that is naturalfrom a geometrical point of view.
& 2014 Elsevier B.V. All rights reserved.
1. Introduction
Consider a normal distribution with zero mean and an unknown covariance matrix, Nð0;ΣÞ. Let denote the eigenvaluesof Σ by
λ¼ ðλ1;…; λpÞ; λ14⋯4λp
and eigenvectors matrix by Γ, hence we have the spectral decomposition
Σ¼ ΓΛΓt ; Λ¼ diagðλÞ; ð1Þ
where diagðλÞ means the diagonal matrix with the ith diagonal element λi. It is needless to say that the inference on Σ is animportant task in many practical situations in such a diversity of fields as engineering, biology, chemistry, finance,psychology, etc. Especially we often encounter the cases where the property of interest depends on Σ only through itseigenvalues λ. We treat an inference problem on the eigenvalues λ from a geometrical point of view.
Treating the family of normal distributions Nðμ;ΣÞ (μ is not necessarily zero) as a Riemannian manifold has been done byseveral authors. For example, see Fletcher and Joshi (2007), Lenglet et al. (2006), Skovgaard (1984), Smith (2005), andYoshizawa and Tanabe (1999). When μ equals zero, the family of normal distributions Nð0;ΣÞ can be taken as a manifold (say S)with a single coordinate system Σ. Hence, S is identified with the space of symmetric positive definite matrices.Geometrically analyzing the space of symmetric positive definite matrices has been an interesting topic in a mathematical
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–83 67
or engineering point of view. Refer to Moakher and Zéraï (2011), Ohara et al. (1996) and Zhang et al. (2009) as well as theabove literature.
In this paper, we analyze S from the standpoint of information geometry while focusing on the inference on theeigenvalues of Σ. The paper is aimed to make a contribution in two regards: (1) the geometrical structure of S is analyzed inview of the eigenvalues and eigenvectors of Σ; (2) some statistical problems on the inference for λ are explained in thegeometrical terms.
We summarize the inference problem for λ. Based on independent n samples xi ¼ ðxi1;…; xipÞ0; i¼ 1;…;n from Nð0;ΣÞ, wewant to make inference on the unknown λ. We confine ourselves to the classical case where nZp. It is well-known that theproduct–sum matrix
S ¼ ∑n
i ¼ 1xixti
is sufficient statistic for both unknown λ and Γ. The spectral decomposition of S is given by
S ¼HLHt ; L¼ diagðlÞ;
where
l¼ ðl1;…; lpÞ; l14⋯4 lp40 a:e:
are the eigenvalues of S, and H is the corresponding eigenvectors matrix. This decomposition gives us two statisticsavailable, i.e. the sample eigenvalues l and the sample eigenvectors H. However it is almost customary that we only use thesample eigenvalues, discarding the information contained in H. In the past literature on the inference for the populationeigenvalues, every notable estimator is based simply on the sample eigenvalues. See Takemura (1984), Dey and Srinivasan(1985), Haff (1991), Yang and Berger (1994) for orthogonally invariant estimators of Σ; Dey (1988), Hydorn and Muirhead(1999), Jin (1993), Sheena and Takemura (2011) for direct estimators of λ. Since we do not have enough space to state theconcrete form of each estimator, we just mention Stein's estimator as a pioneering work for “shrinkage” estimator of Σ. Ingeneral, an orthogonally invariant estimator of Σ is given by
Σ ¼HΦHt ; Φ¼ diagðϕ1ðlÞ;…;ϕpðlÞÞ: ð2Þ
The estimator of λ is given by the eigenvalues of Σ, that is, ðϕ1ðlÞ;…;ϕpðlÞÞ. The sample covariance matrix (M.L.E. estimator)S9n�1S gives the estimator of λ as ϕiðlÞ ¼ n�1li; i¼ 1;…; p, while Stein's “shrinkage” estimator gives birth to
ϕiðlÞ ¼ li=ðnþpþ1�2iÞ; i¼ 1;…; p: ð3Þ
Stein's estimator assigns the lighter (heavier) weight to the larger (smaller) sample eigenvalues, hence the diversity of l is shrunk.This estimator is quite simple and performs much better than M.L.E. (see Dey and Srinivasan, 1985). Unlike Stein's estimator,many estimators in the above literature are not explicitly given or too complicated for immediate use. Nonetheless they all haveone common feature. The derived estimators of λ only depend on l.
In a sense it is natural to implicitly associate the sample eigenvalues to the population eigenvalues, and the sampleeigenvectors to the population counterpart. However the sample eigenvalues are not sufficient for the unknown populationeigenvalues. Therefore it is important to evaluate how much information is lost by neglecting the sample eigenvectors.Following Amari (1982), we gain an understanding of the asymptotic information loss with geometric terms such as Fisherinformation metric and embedding curvatures.
Another statistically interesting topic is the bias of n�1l. It is well known that n�1l is largely biased and the estimatorsmentioned above are all modification of n�1l to correct the bias, that is, “shrinkage estimators.” We show that the bias isclosely related to the embedding curvatures. Moreover the geometric structure of S naturally leads us to a new estimator,which is also a shrinkage estimator.
The organization of this paper is as follows: In the former part (Sections 2 and 3), we describe the geometrical structureof S in view of the spectral decomposition (1). In Section 2, we observe S as a Riemannian manifold endowed with Fisherinformation metrics. In Section 3, we treat two submanifolds of S, a submanifold given by the fixed eigenvectors and the onegiven by the fixed eigenvalues. The embedding curvatures of these submanifolds are explicitly given. We will show that thebias of l is closely related to the curvatures. In the latter part (Sections 4 and 5), we consider the estimation problem of λ. InSection 4, we describe the asymptotic variance of estimators when Γ is known (Section 4.1) and the asymptotic informationloss caused by discarding the sample eigenvectors H (Section 4.2). The asymptotic information loss could be measured bythe difference in the asymptotic variance between two certain estimators. In Section 5 for the case when Γ is unknown, wepropose a new estimator of λ, which is naturally derived from a geometric point of view. In the last section, some commentsare made for further research. All the proofs are collected in Appendix.
Unfortunately we do not have enough space to explain the geometrical concepts used in this paper. Refer to Boothby(2002), Amari (1985), and Amari and Nagaoka (2000).
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–8368
2. Riemannian manifold and metric
The density of the normal distribution Nð0;ΣÞ is given by
f Σ xð Þ ¼ ð2πÞ�p=2 Σj�1=2 exp �12 x
t�1x� �
; x¼ x1;…; xp� �
ARp��If we let sij and sij denote the (i,j) element of respectively Σ and Σ�1, then the log likelihood equals
log f ΣðxÞ ¼∑ix2i ð�sii=2Þþ ∑
io jxixjð�sijÞ�ðp=2Þlog 2π�ð1=2ÞlogjΣj
¼∑iyiiθ
iiþ ∑io j
yijθij�ψðΘÞ ðsay lðy;ΘÞÞ; ð4Þ
where Θ¼ ðθijÞir j and y¼ ðyijÞir j are given by
θii ¼ ð�1=2Þsii; i¼ 1;…; p;
θij ¼ �sij; 1r io jrp;
yii ¼ x2i ; i¼ 1;…; p;
yij ¼ xixj; 1r io jrp;
8>>>><>>>>:
ð5Þ
and
ψðΘÞ ¼ ðp=2Þlog 2πþð1=2Þ log jΣðΘÞj: ð6ÞThe summations ∑i, ∑io j in Eq. (4) are abbreviations respectively for ∑p
i ¼ 1 and ∑1r io jrp, and we will use these kinds ofnotations implicitly hereafter.
The expression (4) gives natural coordinate system Θ of the manifold S as a full exponential family. Another coordinatesystem, so-called expectation parameters, is also useful, which is defined as
sij ¼ EðyijÞ; 1r ir jrp: ð7ÞFor the analysis of the information carried by l and H, we need to prepare another coordinate system. The matrix
exponential expression of an orthogonal matrix O is given by
O¼ exp U ¼ IpþUþ12U
2þ 13!U
3þ⋯; ð8Þwhere Ip is the p-dimensional unit matrix, U is a skew-symmetric matrix and parametrized by u¼ ðuijÞ1r io jrp as
ðUÞij ¼uij if1r io jrp;
�uij if 1r jo irp;0 if 1r i¼ jrp:
8><>:
The function exp U is diffeomorphic, and u gives “normal coordinate” for the group of orthogonal matrices (see Boothby,2002, (6.7) or Muirhead, 1982, Theorem A9.11). We can use this coordinate as local system around Ip and construct an atlasfor the entire space of p-dimensional orthogonal matrices (note this space is compact); for each Γ, there exists an openneighborhood and some open ball B in Rpðp�1Þ=2 around the origin such that these spaces are diffeomorphic by the functionΓ exp UðuÞ on B.
We will use ðλ;uÞ as the third coordinate system of S and call it “spectral coordinate (system)”. Notice that this coordinatesystem is associated with the following submanifolds in S. If we fix Γ in (1), then we get a submanifold MðΓÞ embedded in Swith a coordinate system λ. This is a subfamily in Nð0;ΣÞ and called curved exponential family. Its log-likelihood isexpressed, as we emphasize it as a function of λ, to be
lðy;ΘðλÞÞ ¼∑iyiiθ
iiðλÞþ ∑io j
yijθijðλÞ�ψðΘðλÞÞ: ð9Þ
On the contrary, if we fix λ in (1), we get another submanifold AðλÞ in S, whose coordinate system is given by u in aneighborhood of each point of AðλÞ. Its log-likelihood expression is given by
lðy;ΘðuÞÞ ¼∑iyiiθ
iiðuÞþ ∑io j
yijθijðuÞ�ψðΘðuÞÞ: ð10Þ
First we consider a metric, that is, a field of symmetric, positive definite, bilinear form on S. The statistically most naturalmetric is Fisher information metric. Suppose ff ðx; θÞg is a parametric family of probability density functions, whosecoordinate as a manifold is given by θ¼ ðθ1;…; θpÞ. Then the (i,j) component of Fisher information metric with respect to θ isgiven by
Eθ∂∂θi
log f x; θð Þ ∂∂θj
log f x; θð Þ� �
:
For the multivariate normal distribution family, Nðμ;ΣÞ (μ, the mean parameter is also included), Skovgaard (1984) gives aclear form of Fisher information metric. The tangent vector space at a fixed point Σ w.r.t. ðsijÞir j coordinate can be identified
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–83 69
with the space of symmetric matrices. For any symmetric matrix A, B, the metric with respect to the Σ¼ ðsijÞ coordinatesystem is given by
12 tr �1A�1B� �
: ð11ÞWe are interested in Fisher information metric with respect to the spectral coordinate ðλ;uÞ. Let ∂a; ∂b;⋯ denote the
tangent vectors w.r.t. the λ coordinate, ∂ðs;tÞ; ∂ðu;vÞ;⋯ denote the tangent vectors w.r.t. the u coordinate. Namely
∂a9∂∂λa
; ∂ðs;tÞ9∂
∂ust:
These tangent vectors (exactly speaking, vector fields) are invariant with respect to any orthogonal transformation of Σ. Forsome orthogonal matrix O, an orthogonal transformation F of S is defined as
FðΣÞ ¼OΣOt ð12ÞFor any O,
Fnð∂aÞ ¼ ∂a; 1rarp; ð13Þ
Fnð∂ðs;tÞÞ ¼ ∂ðs;tÞ; 1rsotrp; ð14Þwhere Fn is the derivative of F.
Proposition 1. Let ⟨ ; ⟩ denote Fisher information metric based on x�Nð0;ΣÞ, then the components of the metric with respect toðλ;uÞ are given as follows:
gab9⟨∂a; ∂b⟩¼ ð1=2Þλ�2a δða¼ bÞ; 1ra; brp;
gaðs;tÞ9⟨∂a; ∂ðs;tÞ⟩¼ 0; 1rarp; 1rsotrp;
gðs;tÞðu;vÞ9⟨∂ðs;tÞ; ∂ðu;vÞ⟩¼ ðλs�λtÞ2λ�1s λ�1
t δððs; tÞ ¼ ðu; vÞÞ; 1rsotrp; 1ruovrp:
δð�Þ equals one if the logic inside the parenthesis is correct, otherwise zero.
There are two remarkable properties of the metric for the spectral coordinate. First note that since the metriccomponents matrix is diagonal, ðλ;uÞ is an orthogonal coordinate system, especially that the submanifolds MðΓÞ and AðλÞare orthogonal to each other for any λ and Γ. Second it is independent of Γ, hence the metric stays constant with respectto the orthogonal transformation F in (12) for any orthogonal matrix O. (Second property is instantly derived from theexpression (11).)
Theoretically, other metrics could be naturally implemented. Calvo and Oller (1990) introduced Siegel metric. Lovrić et al.(2000) considered the natural invariant metric from the standpoint of Riemannian symmetric space. The concrete formsof both metrics are given by (3.4) and (3.2) in Lovrić et al. (2000). (The information metric (11) corresponds to (3.3) in Lovrićet al., 2000. See also Theorem 1 of Zhang et al., 2009.)
Once a metric is given on the manifold S, a connection is needed for further geometrical analysis. Connection isan important “rule” which defines how a tangent space is shifted with an infinitesimal move in a differential manifold.Although connection has an infinite variation, the most commonly used one is Levi–Civita connection. It is characterized asa unique torsion-free, metric-preserving connection. This connection is essential to consider a distance function on themanifold. Skovgaard (1984), Calvo and Oller (1990), Fletcher and Joshi (2007), Lenglet et al. (2006), Lovrić et al. (2000), andMoakher and Zéraï (2011) analyze the manifold of the normal distributions under Levi–Civita connection.
On the other hand, Amari (1982) showed that “α-connection” is suitable for statistical manifolds in general. He also foundthat e-connection (α¼1) and m-connection (α¼�1) are especially important for the asymptotic analysis of informationloss for a curved exponential family. Amari and Kumon (1983), Kumon and Amari (1983) and Eguchi (1985) gave furtherdevelopment along this line. Specifically in the relation with the multivariate normal distribution or S, Ohara et al. (1996),Yoshizawa and Tanabe (1999) and Zhang et al. (2009) considered the dual geometry (α and �α connections) of themanifolds. Notice that Levi–Civita connection is 0-connection and the “mean” between e-connection and m-connection.Therefore, using the results on geometric properties of S under e-connection and m-connection, we could also derive thoseunder Levi–Civita connection.
Since this paper is aimed for the statistical inference on Σ, we adopt α-connections, especially e- and m-connections,hereafter. We conclude this section by mentioning the important fact that S is e-flat and m-flat, and corresponding affinecoordinates are given respectively by ðsijÞ and ðsijÞ.
3. Embedding curvatures
Curvature, which is an important property for a geometrical analysis, is defined based on a given connection. Asubmanifold has both intrinsic and extrinsic curvatures. The latter describes how the submanifold is placed in the wholemanifold, and called an embedding curvature or the second fundamental form. (The first fundamental form is the metric.)
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–8370
In this section, we observe the embedding curvatures of M and A for the analysis of the distribution ðl;HÞ. Specificallywe consider the following embedding curvatures:
1.
Embedding curvature of M with respect to e-connection or m-connection. Its components w.r.t. the spectral coordinateare given byHe
abðs;tÞ9⟨∇e∂a∂b; ∂ðs;tÞ⟩; H
m
abðs;tÞ9⟨∇m∂a∂b; ∂ðs;tÞ⟩; ð15Þ
where ∇e∂a∂b is the covariant derivative of ∂b in the direction of ∂a with respect to e-connection. ∇
m∂a∂b is similarly defined.
2.
Embedding curvature of A with respect to m-connection. Its components w.r.t. the spectral coordinate are given byHm
ðs;tÞðu;vÞa9⟨∇m∂ðs;tÞ∂ðu;vÞ; ∂a⟩; ð16Þ
where ∇m∂ðs;tÞ∂ðu;vÞ is the covariant derivative of ∂ðs;tÞ in the direction of ∂ðu;vÞ with respect to m-connection.
On these curvatures at the point ðλ;ΓÞ, we have the following results.
Proposition 2. For 1ra; brp; 1rsotrp;
He
abðs;tÞ ¼Hm
abðs;tÞ ¼ 0: ð17Þ
For 1rarp; 1rsotrp; 1ruovrp,
Hm
ðs;tÞðu;vÞa ¼λ�2a ðλt�λaÞ if s¼ u¼ a; t ¼ v;λ�2a ðλs�λaÞ if s¼ u; t ¼ v¼ a;
0 otherwise:
8><>: ð18Þ
Another expression of the embedding curvature of A is given by
Ham
ðs;tÞðu;vÞ9∑bHm
ðs;tÞðu;vÞbgba: ð19Þ
With this notation, the orthogonal projection of the covariant derivative
∇m∂ðs;tÞ∂ðu;vÞ
onto the tangent space of M is given by
∑aHam
ðs;tÞðu;vÞ∂a:
From Propositions 1 and 2, we have
Ham
ðs;tÞðu;vÞ ¼ 2ðλt�λaÞδðs¼ u¼ a; t ¼ vÞþ2ðλs�λaÞδðs¼ u; t ¼ v¼ aÞ; ð20Þhence
∑aHam
ðs;tÞðu;vÞ∂a ¼2ðλt�λsÞ∂sþ2ðλs�λtÞ∂t if ðs; tÞ ¼ ðu; vÞ;0 otherwise:
(
Similarly another embedding curvature components Hðs;tÞe
ce is defined as
Hðs;tÞe
ab ¼ ∑uov
He
abðu;vÞgðu;vÞðs;tÞ ð21Þ
and actually it vanishes
Hðs;tÞe
ab ¼ 0; 1ra; brp; 1rsotrp: ð22ÞAn embedding curvature has full information about the “extrinsic curvature” of the embedded submanifold in any
direction. Sometimes it is convenient to compress it into a scalar measure of the curvature. “Statistical curvature” by Efron(see Efron, 1975; Murray and Rice, 1993) is such a measure. For A, it is defined by (see Amari, 1985, p. 159)
γðAÞ9 ∑1ra;brp
∑so t;uov;oop;qo r
Hm
ðs;tÞðu;vÞaHm
ðo;pÞðq;rÞbgðs;tÞðo;pÞgðu;vÞðq;rÞgab;
which attains the following value at the point ðλ;ΓÞ.
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–83 71
Corollary 1.
γ Að Þ ¼ 2 ∑aob
λ2aþλ2bðλa�λbÞ2
From these results, we notice that if S is endowed with m-connection, then (1) the embedding curvatures and the statisticalcurvatures of A are independent of Γ, (2) any one-parameter curve ðλ;ΓðuÞÞ given by a parameter uðs;tÞ; sot, where λ and theother elements of u are fixed, is curved in the direction of ∂t�∂s and contained in a two-dimensional plane composedby ∂ðs;tÞ and ∂t�∂s, (3) the statistical curvature of A could be quite large when λ are close to each other, while M is flateverywhere.
Here we introduce another submanifold ~A which is contrasting to A in the sense that ~A is flat with respect tom-connection. For a point ðλ;ΓÞ, let
~Aðλ;ΓÞ9fΣASjðΓtΣΓÞii ¼ λi; 1r 8 irpg:We easily notice that ~A is the minimum distance points with respect to Kullback–Leibler divergence. That is,
~Aðλ;ΓÞ ¼ fΣASjarg min ~λKLðΣ;Γ diagð~λ1;…; ~λpÞΓtÞ ¼ λg;where KLðΣ; ~ΣÞ is the Kullback–Leibler divergence between Nð0;ΣÞ and Nð0; ~ΣÞ, which is specifically given by
trðΣ ~Σ�1Þ� logjΣ ~Σ�1j�p:
The minimum distance points with respect to the Kullback–Leibler divergence consists of all the points on the m-geodesicswhich pass through the point ðλ;ΓÞ and are orthogonal to MðΓÞ at that point. (See Theorem in A2 of Amari, 1982.)
We can visualize the structure of S endowed with m-connection for the two dimensional case. See Fig. 1, whereMi9MðΓiÞ; i¼ 1;…;3, Ai9AðλiÞ; i¼ 1;2 and ~A19 ~Aðλ1;Γ1Þ are drawn. When p¼2, M is a two-dimensional autoparallelsubmanifold with the affine coordinate ðλ1; λ2Þ, while A is a one-dimensional submanifold with the coordinate uð1;2Þ. As it isseen in Proposition 1, all the tangent vectors ∂1ð9∂=∂λ1Þ, ∂2ð9∂=∂λ2Þ, ∂ð1;2Þð9∂=∂uð1;2ÞÞ are orthogonal to each other. ~A is a“straight” line which is also orthogonal toM. The arrow onM is the line fλjλ1þλ2 is constantg, and the arrow head indicatesthe direction in which c9λ2=λ1 increases. The statistical curvature turns out to be the increasing function of c:
γ Að Þ ¼ 21þc2
ð1�cÞ2:
We can analyze the bias of li9n�1li; i¼ 1;…; p, from the geometrical structure of S. It is well known thatE½li� ði¼ 1;…;pÞ majorizes λi ði¼ 1;…; pÞ, that is,
∑j
i ¼ 1E½li�Z ∑
j
i ¼ 1λi; 1r 8 jrp�1; ∑
p
i ¼ 1E½li� ¼ ∑
p
i ¼ 1λi: ð23Þ
The bias E½li� is quite large when n is small and λi's are close to each other (see Lawley, 1956; Anderson, 1965). For the casep¼2,
E½l1�Zλ1; E½l2�rλ2; E½l1�þE½l2� ¼ λ1þλ2: ð24ÞSuppose a sample S9n�1S takes the value at a point sAS: Let s1 denote the point on MðΓÞ designated by the eigenvaluesof S , namely l9 ðl1; l2Þ. The curve AðlÞ connects s and s1. If we define s2 as the point on MðΓÞ designated by
Fig. 1. Submanifolds of S when p¼2, M, A and ~A .
Fig. 2. Horizontal perspective of A and ~A on the plane M when p¼2.
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–8372
λ9ðλ1; λ2Þ9 ððΓtSΓÞ11; ðΓtSΓÞ22Þ, then ~Aðλ;ΓÞ connects s and s2. The three points s, s1 and s2 are on the same plane, and if wemove from s1 in the direction to s2, then the statistical curvature of A increases (see Fig. 2). If we estimate ðλ1; λ2Þ by l , thenthe estimate is the point s1, while for the unbiased estimator λ, the estimate is the point s2. Since the c-coordinate of s1 isalways smaller than that of s2, the estimator ðl1; l2Þ is likely to estimate λ1 and λ2 too apart, which causes the bias (24). It isalso seen that the bias gets larger when c approaches to one, that is, λ1 and λ2 get closer to each other.
Though the exact magnitude of the bias EðlaÞ�λa is hard to evaluate, the asymptotic bias can be evaluated. This can bealso described with embedding curvatures (see Amari, 1985, (5.4)):
E la�λa� �
¼ � 12n
CaþO n�3=2� �
;
where
Ca ¼∑c;dΓam
cdgcdþ ∑
so t;uovHam
ðs;tÞðu;vÞgðs;tÞðu;vÞ;
and Γam
cd is a m-connection coefficients of M, which is defined by
Γam
cd ¼ Γmcdbg
ba; Γmcdb9⟨∇
m∂c∂d; ∂b⟩: ð25Þ
Since M is autoparallel in m-flat S,
Γam
cd ¼ Γmcdb ¼ 0; 1ra; b; c; drp: ð26Þ
Hence we have the following equation from Proposition 1 and (20):
Ca λð Þ ¼ ∑ao t
Ham
ða;tÞða;tÞgða;tÞða;tÞ þ ∑soa
Ham
ðs;aÞðs;aÞgðs;aÞðs;aÞ
¼ 2 ∑taa
λaλtλt�λa
: ð27Þ
This bias was originally derived by the perturbation method in Lawley (1956).
4. Estimation of λ when Γ is known
We consider an estimation problem when Γ is known to be Γ0. From a practical point of view, the case when Γ is knownis not of much interest compared to the general case where both Γ and λ are unknown. However as we will show in thissection, the asymptotic information loss caused by discarding the sample eigenvectors (Section 4.2) are closely related to theasymptotic variance difference between two certain estimators (Section 4.1). Both asymptotic variance and information lossare described with geometrical terms.
4.1. Asymptotic variance of the estimators of λ
In a general term, the subfamily (submanifold) MðΓ0Þð9fΣASjΓðΣÞ ¼ Γ0gÞ in S is a “curved” exponential family, since itis a subfamily in an exponential family S. In a usual case, a subfamily is not “flat”, hence the term “curved” is used. Howeveras you can see from (17), MðΓ0Þ is autoparallel in m(e)-flat S, and intrinsically m(e)-flat (see e.g. Amari and Nagaoka, 2000,Theorem 1.1).
We are supposed to estimate unknown coordinate λ of MðΓ0Þ using an estimator λ ¼ ðλ1;…; λpÞ of some kind. Anestimator λðSÞ is specified by its inverse image λ
�1ðλÞ
AðλÞ9 λ�1ðλÞ ¼ fΣASjλðΣÞ ¼ λg: ð28Þ
This is another submanifold in S, where we will use u as a coordinate system.A consistent estimator λ is called first-order (Fisher) efficient if the first order term (i.e. Oðn�1) order term) w.r.t. the
asymptotic expansion of the variance (covariance) in n is minimized among all (regular) estimators. Correct the bias of the
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–83 73
first-order efficient estimator λ up to the term of order n�1, and let it be denoted by λn
9ðλn1;…; λn
pÞ. Amari showed (see e.g.Amari and Nagaoka, 2000, Theorem 4.4) that its asymptotic variance can be described by the geometrical properties such asthe metric and the embedding curvatures of MðΓ0Þ and A. For 1ra;brp,
E λn
a�λa� �
λn
b�λb� �h i
¼ 1ngabþ 1
2n2fðΓmMÞ2abþ2ðHe
MÞ2abþðHAmÞ2abgþO n�3� � ð29Þ
where
ðΓmMÞ2ab ¼ ∑
c;d;e;fΓam
cdΓb
m
ef gcegdf ;
ðHeMÞ2ab ¼ ∑
c;d;e;f ;so t;uovHðs;tÞ
e
ceHðu;vÞe
df gðs;tÞðu;vÞgcdgeagfb;
ðHAmÞ2ab ¼ ∑
so t;uov;oop;qo rHam
ðs;tÞðu;vÞHb
m
ðo;pÞðq;rÞgðs;tÞðo;pÞgðu;vÞðq;rÞ;
Γam
cd and He ðs;tÞce are already defined in the previous section as the connection coefficients (see (25)) or the embedding
curvature components (see (21)) of M. They are defined independently of the particular estimator. Ham
ðs;tÞðu;vÞ are thecomponents of the embedding m-curvature of A, which differ among the estimators.
We apply this formula to the following two estimators, ln ¼ ðln1;…; lnpÞ and λ ¼ ðλ1;…; λpÞ. The former is the bias-correctedsample eigenvalues, which is given, using (27), by
lna ¼ laþ12n
Ca lð Þ ¼ laþ1n∑taa
laltlt� la
; a¼ 1;…; p; ð30Þ
and the latter is defined by
λa ¼ ððΓ0ÞtSΓ0Þaa; a¼ 1;…; p; ð31Þwhich is (exactly) unbiased. In fact λ is the maximum likelihood estimator for the case Γ is known. Notice that for l,AðλÞ ¼AðλÞ and that for λ, AðλÞ ¼ ~Aðλ;Γ0Þ. The first-order efficiency of both estimators is guaranteed by the orthogonality toMðΓ0Þ of AðλÞ and ~Aðλ;Γ0Þ.
The terms ðΓmMÞ2ab and ðHe
MÞ2ab, which are related to the submanifold M, hence common to both estimators, vanish,because of (22) and (26). The term ðHA
mÞ2ab is different between the two estimators. As we observed in the previous section,AðλÞ is not autoparallel in S (see (18)). On the other hand, ~Aðλ;Γ0Þ is autoparallel in S, hence ðHA
mÞ2ab vanishes. Consequentlythe following results are gained.
Proposition 3. For 1ra; brp,
E lna�λa� �
lnb�λb� �
9Vab ln� �� �
¼
2nλ2aþ
2n2 ∑
taa
λ2aλ2t
ðλt�λaÞ2þO n�3
� �if a¼ b;
� 2n2
λ2aλ2b
ðλa�λbÞ2þO n�3
� �if aab:
8>>>><>>>>:
ð32Þ
E λa�λa� �
λb�λb� �h i
9Vab λ� �� �
¼2nλ2aþO n�5=2
� �if a¼ b;
Oðn�5=2Þ if aab:
8<: ð33Þ
This result says that λ is the second-order efficient (among the bias-corrected first-order efficient estimators), but the bias-corrected sample eigenvalues are not. The difference in the asymptotic performance between the two estimators is due tothe fact ln do not use the prior information Γ¼ Γ0, while λ does. In contrast to ln, which does not use H, λ incorporates theinformation of H with the aid of the prior knowledge Γ¼ Γ0: In fact, as we will see in the next subsection, the differencebetween (32) and (33) is closely related to the asymptotic information loss caused by discarding H.
4.2. Asymptotic information loss
In this subsection, we consider the asymptotic information loss caused by ignoring H for the estimation of λ. Informationloss matrix ðΔgabðlÞÞ; 1ra; brp at a fixed point Σ¼ ðλ;ΓÞ is given by
ΔgabðlÞ9E½gabðSjlÞ� ¼ gabðSÞ�gabðlÞ;where gabðSÞ; gabðlÞ; gabðSjlÞ are the components of the metrics w.r.t. ∂a and ∂b based on respectively the distributions S, l andthe conditional distribution of S given l, all of which are measured at the point Σ¼ ðλ;ΓÞ.
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–8374
Amari (1982) found that the asymptotic information loss can be expressed in terms of the metric and the embeddingcurvatures:
ΔgabðlÞ ¼ n ∑so t;uov
gaðs;tÞgbðu;vÞgðs;tÞðu;vÞ þ ∑
c;d;so t;uovHe
acðs;tÞHe
bdðu;vÞgcdgðs;tÞðu;vÞ
þð1=2Þ ∑so t;uov;oop;qo r
Hm
ðs;tÞðu;vÞaHm
ðo;pÞðq;rÞb gðs;tÞðo;pÞgðu;vÞðq;rÞ þOðn�1Þ: ð34Þ
Straightforward calculation leads us to the following result:
Proposition 4.
ΔgabðlÞ ¼ BabþOðn�1Þ;
where
Bab ¼
12λ2a
∑taa
λ2tðλt�λaÞ2
if a¼ b;
� 12ðλa�λbÞ2
if aab:
8>>>><>>>>:
Bab at the point ðλ;ΓÞ depends only on λ. When the information loss of a statistic has the order Oðn�qþ1Þ, we call the statisticis the qth order sufficient. Consequently the statistic l is the first order sufficient, but not the second order sufficient.
Bab, the information loss in the second order term (Oð1Þ), could be quite large when the population eigenvalues are closeto each other. Note that the information carried by l is given by the formula:
gabðlÞ ¼ gabðSÞ�ΔgabðlÞ¼ ngabðxÞ�ΔgabðlÞ¼ ðn=2Þλ�2
a δða¼ bÞ�ΔgabðlÞ:Since ðgabðlÞÞ is positive definite, diagðn2�1λ�2
1 ;…;n2�1λ�2p Þ4 ðΔgabÞ. This holds true even in the neighborhood of a point
λ1 ¼⋯¼ λp where Bab diverges. This indicates that the term of order Oðn�1Þ in ΔgabðlÞ is also unbounded in such aneighborhood. Hence the expansion of the information loss with respect to n is not useful when the population eigenvaluesare close to each other.
Except for the case where the population eigenvalues are close to each other, Proposition 4 tells us approximately howmuch information is lost by ignoring the sample eigenvectors for the inference on the population eigenvalues. If we contractΔgab, then we could get a scalar measure on the information loss:
IL9∑a;bgabΔgab ¼∑
a2λ2aBaaþO n�1� �¼ ∑
aob
λ2aþλ2bðλa�λbÞ2
þO n�1� �Asymptotic information loss is closely related to the asymptotic variance of the two estimators ln and λ in the previoussubsection. Actually if we contract the asymptotic performance difference between the two estimators VabðlnÞ�VabðλÞ, thenit equals n�2IL, that is,
∑a;b
Vab ln� ��Vab λ
� �� �gab ¼ 2�1E ∑
aðlna=λa�1Þ2
� ��2�1E ∑
aðλa=λa�1Þ2
� �
¼ n�2 ∑aob
λ2aþλ2bðλa�λbÞ2
þO n�3� �¼ n�2IL: ð35Þ
As a numerical example, we made a simulation for the case p¼2, n¼20. Taking the relationship (35) into account, we couldmeasure an information loss as the normalized quadratic risk difference between ln and λ. We randomly generated a two-dimensional normal vector under the following conditions, Σ¼ diagð1:0; cÞ; c¼ 0:2;0:4;0:6;0:8;1:0. We made 108 timesrepetition and took the average for each condition. Table 1 shows the result. (Note: (1) The risk of λ theoretically equals 0.4.(2) The simulated risk of ln is quite unstable as its large s.d. shows.) We notice that information loss is not negligible. The risk
Table 1Simulate risk of ln when p¼2 as c varies.
c: Second eigenvalue 1.0 0.8 0.6 0.4 0.2Simulated risk of ln 0.85 0.83 0.70 0.60 0.50Standard deviation 0.24 0.48 0.15 0.09 0.22
100� (risk difference/risk of λ) 111 107 75 49 24
Fig. 3. The shrinkage effect of the projection ðλ;ΓÞ onto Mi ; i¼ 1;2.
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–83 75
of ln is larger than that of λ by 24–111%. The risk difference is quite large especially when the population eigenvalues areclose to each other.
5. Estimation of λ when Γ is unknown
In this section, we consider the more practical case where Γ is unknown. The derivation of a new estimator for this casewill be done in view of the modification of the bias of l . Actually almost all the literature on the estimation of λ wementioned in Section 1 modify the bias of l by the so-called “shrinkage” method, that is, decreasing the dispersion of l .Though the concrete methods of shrinkage differ for each estimator, they are proposed mainly from analytical motivations.Here we consider another shrinkage estimator from a geometrical point of view.
Suppose that we have a sample S9n�1S which takes the point ðλ;ΓÞ in S, that is, λ¼ l ;Γ¼H. (See Fig. 3.) Take theorthogonal projection of this point onto the submanifold MðΓiÞ9Miði¼ 1;2Þ, where the projected point ðλi;ΓiÞ is given byλi ¼ ððΓt
iSΓiÞ11;…; ðΓtiSΓiÞppÞ. As we mentioned in Section 3, ðλi;ΓiÞ is the minimum distance point on Mi from ðλ;ΓÞ with
respect to Kullback–Leibler divergence. It is clearly understood that this projection has the shrinkage effect. If we havean appropriate probability measure of Γ on the group of p-dimensional orthogonal matrices OðpÞ, the expectation ofðΓtSΓÞii; i¼ 1;…; p, for that measure would give birth to a natural shrinkage estimator.
We choose the conditional distribution of H when l is given for the probability measure on OðpÞ. Since S ¼HLHt isdistributed as Wishart matrix Wpðn;ΣÞ, its density w.r.t. the uniform probability dμðHÞ on OðpÞ equals
f ðHjl;ΣÞ ¼ Kðl;ΣÞ�1 expð�ð1=2ÞtrHLHtΣ�1Þ; ð36Þwhere normalizing constant Kðl;ΣÞ is given by
Kðl;ΣÞ ¼ZOðpÞ
expð�ð1=2Þ tr HLHtΣ�1Þ dμðHÞ:
This conditional distribution depends on Σ. If we substitute Σ with an estimator ΣðS Þ, it gives a distribution on OðpÞ, whosedensity with respect to dμðΓÞ is given by
f ðΓjl; ΣÞ ¼ Kðl; S Þ�1 expð�ð1=2Þ tr ΓLΓtΣ�1Þ; ð37Þwhere
Kðl; ΣÞ ¼ZOðpÞ
expð�ð1=2Þ tr ΓLΓtΣ�1Þ dμðΓÞ:
Take the expectation of ðΓtSΓÞii w.r.t. the density (37), then we have
λn
i 9Kðl; SÞ�1ZOðpÞ
ðΓtSΓÞii expð�ð1=2Þ tr ΓLΓtΣ�1Þ dμðΓÞ; i¼ 1;…; p: ð38Þ
We propose λn
9 ðλn1;…; λn
pÞ as a new estimator of λ.
Fig. 4. Risks of the three estimators as c changes.
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–8376
If Σ is given by an orthogonally invariant estimator (2), λn
i can be more specifically described. Let L denote diagðlÞ.Because of the invariance of dμ, it turns out that
λn
i ¼ KðlÞ�1ZOðpÞ
ðΓtHLHtΓÞii expð�ð1=2Þ tr LΓtHΦ�1HtΓÞ dμðΓÞ
¼ KðlÞ�1ZOðpÞ
ðΓtLΓÞii expð�ð1=2Þ tr LΓtΦ�1ΓÞ dμðΓÞ; ð39Þ
where
KðlÞ9ZOðpÞ
expð�ð1=2Þ tr LΓtΦ�1ΓÞ dμðΓÞ: ð40Þ
The analytic evaluation of this estimator's performance seems difficult even for the large sample case. Instead we showthe numerical result comparing l , λ
n
and Stein's estimator (3). Our new estimator λn
i is also equipped with the same ϕ'sin (3). We simulated the risks of three estimators for the case p¼2, n¼10 w.r.t. K–L loss, which is given by
∑p
i ¼ 1λ iλ
�1i � ∑
p
i ¼ 1logðλ iλ�1
i Þ�p;
where λ i ¼ li; λn
i ; ϕi; i¼ 1;…; p. Since all the estimators are functions of l and scale invariant, it is enough to measure therisks for Σ¼ diagð1; cÞ; 0ocr1. We varied c from 0.04 to 1.00 by an increment of 0.04, and for each c we repeated the riskevaluation 105 times and took the average. For the integral calculation of (39) and (40), we picked up 50 points from Oð2Þ inan equidistant manner. Fig. 4 shows the result. The new estimator performs better compared to l , especially λ are close toeach other, though it seems that λ
n
does not dominate l as Stein's estimator does. Unfortunately we do not have anytheoretical explanation of the risk behavior of the new estimator. We could only guess that the shrinkage effect works wellwhen c is close to one, while its effect is too strong elsewhere. We also simulated the risk of the new estimator equippedwith M.L.E. instead of Stein's estimator. Since its performance is almost the same as the above new estimator, we skip theresult.
6. Remark
1.
We treated the estimation problem of the eigenvalues λ in the latter half of the paper. The estimation on the eigenvectorsΓ seems rather untouched in the classical situation nZp. Corollary 1 on the statistical curvatures of A or (27) on theasymptotic bias tells us that the point where λ has some multiplicity is a statistically singular point. Around these points,inference on Γ is considered to need subtle treatment. Especially the eigenvectors are not well identified around themultiplicity point, hence the information contained in H vanishes there (see gðs;tÞðu;vÞ in Proposition 1). This indicates thatthe inference using only H is not appropriate.2.
We proposed a new estimator for λ in Section 5. However this belongs to the same category as most estimators in thepast literature in that it uses sample eigenvalues λ only. It is still unclear how we can use the sample eigenvalues H forthe inference of λ.Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–83 77
Acknowledgment
The author really appreciates Dr. M. Kumon kindly answering his question on a basic fact of the information loss. He alsoshows deep gratitude to anonymous referees for their valuable comments which improved the quality of the paper.
Appendix A
A.1. Proof of Proposition 1
As a base for the vector space of real symmetric matrices, we consider Eij ð1r ir jrpÞwhich is a p�p matrix defined by
Eij ¼Iii if i¼ j;IijþIji if io j;
(
where Iij ð1r i; jrpÞ is the p�p matrix whose ði; jÞ element equals one, and all the other elements are zero. The one to onecorrespondence
∂ði;jÞ9∂
∂sij⟷Eij; 1r ir jrp;
gives the component expression of (11)
⟨∂ði;jÞ; ∂ðk;lÞ⟩¼ 12 tr Σ�1EijΣ�1Ekl� �
; 1r ir jrp; 1rkr lrp:
Since
∂a9∂∂λa
¼ ∑ir j
∂sij∂λa
∂∂sij
¼ ∑ir j
∂sij∂λa
∂ði;jÞ; 1rarp; ð41Þ
∂ðs;tÞ9∂
∂ust¼ ∑
ir j
∂sij∂ust
∂∂sij
¼ ∑ir j
∂sij∂ust
∂ði;jÞ; 1rsotrp; ð42Þ
we have the following relations:
gab ¼12tr Σ�1 ∑
ir j
∂sij∂λa
Eij
!Σ�1 ∑
kr l
∂skl∂λb
Ekl
!( ); ð43Þ
gaðs;tÞ ¼12tr Σ�1 ∑
ir j
∂sij∂λa
Eij
!Σ�1 ∑
kr l
∂skl∂ust
Ekl
!( ); ð44Þ
gðs;tÞðu;vÞ ¼12tr Σ�1 ∑
ir j
∂sij∂ust
Eij
!Σ�1 ∑
kr l
∂skl∂uuv
Ekl
!( ); ð45Þ
where 1ra; brp, 1rsotrp; 1ruovrp.For the first order derivative at u¼ 0, we only have to consider Σ up to the term to the first power w.r.t. u, hence we put
Σðλ;uÞ asΣðλ;uÞ ¼ ΓðIpþUÞΛðIpþUÞtΓtþOðJuJ2Þ
¼ ΓΛΓtþΓΛUtΓtþΓUΛΓtþOðJuJ2Þ: ð46ÞTherefore we have
sij ¼∑kγikγjkλkþ∑
k;lγikγjlðλkulkþλluklÞþOðJuJ2Þ; 1r ir jrp;
where uii90 ð1r irpÞ; uij9�uji ð1r jo irpÞ, which leads to
∂sij∂λa
����u ¼ 0
¼ γiaγja; ð47Þ
and
∂sij∂ust
����u ¼ 0
¼ λtγitγjs�λsγisγjtþλtγisγjt�λsγitγjs: ð48Þ
From (47) and (48), we have the following results on tangent vectors:
∑ir j
∂sij∂λa
Eij ¼ ∑ir j
γiaγjaEij ¼ γaγta; ð49Þ
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–8378
where γa is the ath column of Γ, and
∑ir j
∂sij∂ust
Eij ¼ λtγtγts�λsγsγ
ttþλtγsγ
tt�λsγtγ
ts: ð50Þ
If we substitute (49) and (50) into (43), (44), and (45), we get the results as follows:
2gab ¼ trðΣ�1γaγtaΣ
�1γbγtbÞ
¼ trfðγtbΣ�1γaÞfðγtaΣ�1γbÞg¼ trðλ�1
a δða¼ bÞλ�1b δða¼ bÞÞ
¼ λ�2a δða¼ bÞ;
2gaðs;tÞ ¼ trfΣ�1γaγtaΣ
�1ðλtγtγts�λsγsγttþλtγsγ
tt�λsγtγ
tsÞg
¼ λtλ�2a δða¼ s¼ tÞ�λsλ
�2a δða¼ s¼ tÞ
þλtλ�2a δða¼ s¼ tÞ�λsλ
�2a δða¼ s¼ tÞ
¼ 0;
2gðs;tÞðu;vÞ ¼ trfΣ�1ðλtγtγts�λsγsγttþλtγsγ
tt�λsγtγ
tsÞ � Σ�1ðλvγvγtu�λuγuγ
tvþλvγuγ
tv�λuγvγ
tuÞg
¼ λtλvλ�1t δðu¼ tÞλ�1
s δðs¼ vÞ�λtλuλ�1t δðv¼ tÞλ�1
s δðs¼ uÞþλtλvλ
�1t δðv¼ tÞλ�1
s δðs¼ uÞ�λtλuλ�1t δðu¼ tÞλ�1
s δðs¼ vÞ�λsλvλ
�1s δðu¼ sÞλ�1
t δðt ¼ vÞþλsλuλ�1s δðs¼ vÞλ�1
t δðt ¼ uÞ�λsλvλ
�1s δðs¼ vÞλ�1
t δðt ¼ uÞþλsλuλ�1s δðu¼ sÞλ�1
t δðt ¼ vÞþλtλvλ
�1s δðu¼ sÞλ�1
t δðt ¼ vÞ�λtλuλ�1s δðv¼ sÞλ�1
t δðt ¼ uÞþλtλvλ
�1s δðv¼ sÞλ�1
t δðt ¼ uÞ�λtλuλ�1s δðu¼ sÞλ�1
t δðt ¼ vÞ�λsλvλ
�1t δðu¼ tÞλ�1
s δðs¼ vÞþλsλuλ�1t δðv¼ tÞλ�1
s δðu¼ sÞ�λsλvλ
�1t δðt ¼ vÞλ�1
s δðu¼ sÞþλsλuλ�1t δðu¼ tÞλ�1
s δðs¼ vÞ¼ ð�1þλtλ
�1s �1þλsλ
�1t þλtλ
�1s �1þλsλ
�1t �1Þδðs¼ u; t ¼ vÞ
¼ 2ðλ�1s ðλt�λsÞþλ�1
t ðλs�λtÞÞδðs¼ u; t ¼ vÞ¼ 2ðλt�λsÞðλ�1
s �λ�1t Þδðs¼ u; t ¼ vÞ
¼ 2ðλt�λsÞ2ðλsλtÞ�1δðs¼ u; t ¼ vÞ:
A.2. Proof of Proposition 2
Note that Σ�1 ¼ ΓΛ�1Γt , hence
θij ¼�∑
kγikγjkλ
�1k if io j;
�2�1∑kγ2ikλ
�1k if i¼ j:
8>><>>:
This means M is an affine subspace of S w.r.t. an Θ, which is an affine coordinate system of S with e-connection.
Consequently M is e-flat, i.e. He
abðs;tÞ ¼ 0. Hm
abðs;tÞ ¼ 0 is similarly proved. See Theorem 1.1 in Amari and Nagaoka (2000).
Now we consider Hm
ðs;tÞðu;vÞa. Using (4.14) in Amari (1985), it is calculated as
Hm
ðs;tÞðu;vÞa ¼ ∑ir j
∂2sij∂ust∂uuv
����u ¼ 0
∂θij
∂λa
����u ¼ 0
¼ �2�1 ∑1r i;jrp
∂2sij∂ust∂uuv
����u ¼ 0
∂sij
∂λa
����u ¼ 0
¼ �2�1trðABÞ;where p� p matrices A, B are given by
ðAÞij9∂2sij
∂ust∂uuv
����u ¼ 0
; ðBÞij9∂sij
∂λa
����u ¼ 0
; 1r i; jrp:
In order to calculate A, we only have to consider Σ up to the terms powered by two w.r.t. u:
Σ¼ ΓðIpþUþ2�1U2ÞΛðIpþUþ2�1U2ÞtΓtþOðJuJ3Þ¼ ΓΛΓtþΓðUΛþΛUtÞΓtþ2�1ΓðU2ΛþΛðU2ÞtÞΓtþΓUΛUtΓtþOðJuJ3Þ:
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–83 79
Therefore sij is expressed as
sij ¼ 2�1∑k;lγikγjlððU2ΛþΛðU2ÞtÞklþ2ðUΛUtÞklÞþRijþOðJuJ3Þ; ð51Þ
where R¼ ΓΛΓtþΓðUΛþΛUtÞΓt : Since
ðU2ΛþΛðU2ÞtÞkl ¼ ðU2ΛÞklþðU2ΛÞlk ¼∑bukbublλlþ∑
bulbubkλk;
2ðUΛUtÞkl ¼ 2∑bukbulbλb;
(51) turns out to be
sij ¼ 2�1 ∑k;l;b
γikγjlðukbublλlþulbubkλkþ2ukbulbλbÞþRijþOðJuJ3Þ: ð52Þ
From this we have
∂2sij∂ust∂uuv
����u ¼ 0
� 2¼ ∑k;l;b
að1Þij það1Þji það2Þij það2Þji það3Þij það4Þij
� �; ð53Þ
where
að1Þij ¼ γisγjvλvδfðk; bÞ ¼ ðs; tÞ; ðb; lÞ ¼ ðu; vÞ; ðs; tÞaðu; vÞg�γitγjvλvδfðk; bÞ ¼ ðt; sÞ; ðb; lÞ ¼ ðu; vÞ; ðs; tÞaðu; vÞg�γisγjuλuδfðk; bÞ ¼ ðs; tÞ; ðb; lÞ ¼ ðv;uÞ; ðs; tÞaðu; vÞgþγitγjuλuδfðk; bÞ ¼ ðt; sÞ; ðb; lÞ ¼ ðv;uÞ; ðs; tÞaðu; vÞgþγiuγjtλtδfðk; bÞ ¼ ðu; vÞ; ðb; lÞ ¼ ðs; tÞ; ðs; tÞa ðu; vÞg�γivγjtλtδfðk; bÞ ¼ ðv;uÞ; ðb; lÞ ¼ ðs; tÞ; ðs; tÞa ðu; vÞg�γiuγjsλsδfðk; bÞ ¼ ðu; vÞ; ðb; lÞ ¼ ðt; sÞ; ðs; tÞaðu; vÞgþγivγjsλsδfðk; bÞ ¼ ðv;uÞ; ðb; lÞ ¼ ðt; sÞ; ðs; tÞa ðu; vÞg;
að2Þij ¼ 2γisγjtλtδfðk; bÞ ¼ ðs; tÞ; ðb; lÞ ¼ ðs; tÞ; ðs; tÞ ¼ ðu; vÞgþ2γitγjsλsδfðk;bÞ ¼ ðt; sÞ; ðb; lÞ ¼ ðt; sÞ; ðs; tÞ ¼ ðu; vÞg�2γisγjsλsδfðk; bÞ ¼ ðs; tÞ; ðb; lÞ ¼ ðt; sÞ; ðs; tÞ ¼ ðu; vÞg�2γitγjtλtδfðk; bÞ ¼ ðt; sÞ; ðb; lÞ ¼ ðs; tÞ; ðs; tÞ ¼ ðu; vÞg
¼ �2γisγjsλsδfðk;bÞ ¼ ðs; tÞ; ðb; lÞ ¼ ðt; sÞ; ðs; tÞ ¼ ðu; vÞg�2γitγjtλtδfðk; bÞ ¼ ðt; sÞ; ðb; lÞ ¼ ðs; tÞ; ðs; tÞ ¼ ðu; vÞg;
að3Þij ¼ 2γisγjuλtδfðk; bÞ ¼ ðs; tÞ; ðl; bÞ ¼ ðu; vÞ; ðs; tÞa ðu; vÞg�2γitγjuλsδfðk;bÞ ¼ ðt; sÞ; ðl;bÞ ¼ ðu; vÞ; ðs; tÞa ðu; vÞg�2γisγjvλtδfðk; bÞ ¼ ðs; tÞ; ðl; bÞ ¼ ðv;uÞ; ðs; tÞaðu; vÞgþ2γitγjvλsδfðk; bÞ ¼ ðt; sÞ; ðl; bÞ ¼ ðv;uÞ; ðs; tÞaðu; vÞgþ2γiuγjsλtδfðk;bÞ ¼ ðu; vÞ; ðl; bÞ ¼ ðs; tÞ; ðs; tÞa ðu; vÞg�2γivγjsλtδfðk; bÞ ¼ ðv;uÞ; ðl; bÞ ¼ ðs; tÞ; ðs; tÞaðu; vÞg�2γiuγjtλsδfðk;bÞ ¼ ðu; vÞ; ðl; bÞ ¼ ðt; sÞ; ðs; tÞa ðu; vÞgþ2γivγjtλsδfðk; bÞ ¼ ðv;uÞ; ðl; bÞ ¼ ðt; sÞ; ðs; tÞaðu; vÞg;
að4Þij ¼ 4γisγjsλtδfðk; bÞ ¼ ðl;bÞ ¼ ðs; tÞ ¼ ðu; vÞgþ4γitγjtλsδfðk; bÞ ¼ ðl; bÞ ¼ ðt; sÞ ¼ ðv;uÞg�4γisγjtλtδfðk; bÞ ¼ ðs; tÞ; ðl; bÞ ¼ ðt; sÞ; ðs; tÞ ¼ ðu; vÞg�4γitγjsλsδfðk;bÞ ¼ ðt; sÞ; ðl; bÞ ¼ ðs; tÞ; ðs; tÞ ¼ ðu; vÞg
¼ 4γisγjsλtδfðk; bÞ ¼ ðl; bÞ ¼ ðs; tÞ ¼ ðu; vÞgþ4γitγjtλsδfðk; bÞ ¼ ðl; bÞ ¼ ðt; sÞ ¼ ðv;uÞg:
Furthermore we have
2A¼ Að1Þ þðAð1ÞÞtþAð2Þ þðAð2ÞÞtþAð3Þ þAð4Þ; ð54Þ
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–8380
where
Að1Þ ¼ γsγtvλvδðt ¼ uÞþγtγ
tuλuδðs¼ vÞ
þγuγttλtδðs¼ vÞþγvγ
tsλsδðu¼ tÞ
�γtγtvλvδðs¼ u; tavÞ�γsγ
tuλuδðt ¼ v; sauÞ
�γvγttλtδðu¼ s; tavÞ�γuγ
tsλsδðt ¼ v; sauÞ;
Að2Þ ¼ �2ðγsγtsλsδðs¼ u; t ¼ vÞþγtγttλtδðs¼ u; t ¼ vÞÞ;
Að3Þ ¼ 2ðγsγtuλtδðt ¼ v; sauÞþγtγtvλsδðs¼ u; tavÞ
þγuγtsλtδðv¼ t; sauÞþγvγ
ttλsδðu¼ s; tavÞÞ
�2ðγtγtuλsδðs¼ vÞþγsγtvλtδðt ¼ uÞþγvγ
tsλtδðu¼ tÞþγuγ
ttλsδðs¼ vÞÞ;
Að4Þ ¼ 4ðγsγtsλtδðs¼ u; t ¼ vÞþγtγttλsδðs¼ u; t ¼ vÞÞ:
Since
∂sij
∂λa
����u ¼ 0
¼ �λ�2a γiaγja;
we have
B¼ �λ�2a γaγ
ta: ð55Þ
From (54) and (55), we have
Hm
ðs;tÞðu;vÞa ¼ �4�1 trð2ABÞ¼ �4�1 trfðAð1Þ þðAð1ÞÞtþAð2Þ þðAð2ÞÞtþAð3Þ þAð4ÞÞBg¼ 4�1λ�2
a trfðAð1Þ þðAð1ÞÞtþAð2Þ þðAð2ÞÞtþAð3Þ þAð4ÞÞγaγtag:The following equalities hold:
trðAð1ÞγaγtaÞ ¼ λaδðs¼ v¼ a; t ¼ uÞþλaδðt ¼ u¼ a; s¼ vÞ
þλaδðt ¼ u¼ a; s¼ vÞþλaδðs¼ v¼ a; t ¼ uÞ�λaδðt ¼ v¼ a; s¼ u; tavÞ�λaδðs¼ u¼ a; t ¼ v; sauÞ�λaδðt ¼ v¼ a; s¼ u; tavÞ�λaδðs¼ u¼ a; t ¼ v; sauÞ
¼ 0:
trððAð1ÞÞtγaγtaÞ ¼ 0:
trðAð2ÞγaγtaÞ ¼ �2ðλaδðs¼ u¼ a; t ¼ vÞþλaδðs¼ u; t ¼ v¼ aÞÞ:
trððAð2ÞÞtγaγtaÞ ¼ �2ðλaδðs¼ u¼ a; t ¼ vÞþλaδðs¼ u; t ¼ v¼ aÞÞ:
trðAð3ÞγaγtaÞ ¼ 2fλtδðs¼ u¼ aÞδðt ¼ v; sauÞþλsδðt ¼ v¼ aÞδðs¼ u; tavÞ
þλtδðs¼ u¼ aÞδðt ¼ v; sauÞþλsδðt ¼ v¼ aÞδðs¼ u; tavÞg�2fλsδðt ¼ u¼ aÞδðs¼ vÞþλtδðs¼ v¼ aÞδðt ¼ uÞþλtδðs¼ v¼ aÞδðt ¼ uÞþλsδðu¼ t ¼ aÞδðs¼ vÞg
¼ 0:
trðAð4ÞγaγtaÞ ¼ 4λtδðs¼ u¼ a; t ¼ vÞþ4λsδðt ¼ v¼ a; s¼ uÞ:
Consequently
Hm
ðs;tÞðu;vÞa ¼ �λ�1a δðs¼ u¼ a; t ¼ vÞ�λ�1
a δðs¼ u; t ¼ v¼ aÞþλ�2a λtδðs¼ u¼ a; t ¼ vÞþλ�2
a λsδðt ¼ v¼ a; s¼ uÞ
¼λ�2a ðλt�λaÞ if s¼ u¼ a; t ¼ v;
λ�2a ðλs�λaÞ if s¼ u; t ¼ v¼ a;
0 otherwise:
8><>:
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–83 81
A.3. Proof of Corollary 1
As we will see in the next subsection,
∑so t;uov;oop;qo r
Hm
ðs;tÞðu;vÞaHm
ðo;pÞðq;rÞbgðs;tÞðo;pÞgðu;vÞðq;rÞ ¼
1λ2a
∑taa
λ2tðλt � λaÞ2
if a¼ b;
� 1ðλa�λaÞ2
if aab:
8>>><>>>:
Combine this with Proposition 1, we have
γ Að Þ ¼ 2∑a∑taa
λ2tðλt�λaÞ2
¼ 2 ∑aob
λ2aþλ2bðλa�λbÞ2
:
A.4. Proof of Proposition 3
We calculate each term in (29). gab ¼ δða¼ bÞ2λ2a from Proposition 1. Because of (22) and (26),
ðΓmMÞ2ab ¼ ðHe
MÞ2ab ¼ 0:
For ln, ðHAmÞ2ab ¼ ðHm
A Þ2ab:
ðHmA Þ2ab ¼ ∑
so t;uov;oop;qo rHam
ðs;tÞðu;vÞHb
m
ðo;pÞðq;rÞgðs;tÞðo;pÞgðu;vÞðq;rÞ
¼ ∑t4a;p4b
Ham
ða;tÞða;tÞHa
m
ðb;pÞðb;pÞðgða;tÞðb;pÞÞ2
þ ∑t4a;pob
Ham
ða;tÞða;tÞHb
m
ðp;bÞðp;bÞðgða;tÞðp;bÞÞ2
þ ∑toa;p4b
Ham
ðt;aÞðt;aÞHb
m
ðb;pÞðb;pÞðgðt;aÞðb;pÞÞ2
þ ∑toa;pob
Ham
ðt;aÞðt;aÞHb
m
ðp;bÞðp;bÞðgðt;aÞðp;bÞÞ2 ð56Þ
If a¼b, then the r.h.s. of (56) equals
∑t4a
Ham
ða;tÞða;tÞ
� �2
ðgða;tÞða;tÞÞ2þ ∑toa
Ham
ðt;aÞðt;aÞ
� �2
ðgðt;aÞðt;aÞÞ2
¼ ∑t4a
ð2ðλt�λaÞÞ2 λaλt
ðλa�λtÞ2
!2
þ ∑toa
ð2ðλt�λaÞÞ2 λaλt
ðλa�λtÞ2
!2
¼ 4 ∑taa
λ2aλ2t
ðλa�λtÞ2:
If aab, then the r.h.s. of (56) equals
Ham
ða;bÞða;bÞHb
m
ða;bÞða;bÞðgða;bÞða;bÞÞ2
¼ 4 λb�λað Þ λa�λbð Þ λaλbðλa�λbÞ2
!2
¼ � 4λ2aλ2b
ðλa�λbÞ2:
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–8382
A.5. Proof of Proposition 4
The term of the order n in (34) vanishes since gaðs;tÞ equals zero for 1rarp; 1rsotrp. We consider the term of order
Oð1Þ. Since He
acðs;tÞ also vanishes for 1ra; crp, 1rsotrp, we only have to consider the term
ð1=2Þ ∑so t;uov;oop;qo r
Hm
ðs;tÞðu;vÞaHm
ðo;pÞðq;rÞbgðs;tÞðo;pÞgðu;vÞðq;rÞ:
Because of (18), the above term equals
2�1 ∑t4a;p4b
Hm
ða;tÞða;tÞaHm
ðb;pÞðb;pÞbðgða;tÞðb;pÞÞ2
þ2�1 ∑t4a;pob
Hm
ða;tÞða;tÞaHm
ðp;bÞðp;bÞbðgða;tÞðp;bÞÞ2
þ2�1 ∑toa;p4b
Hm
ðt;aÞðt;aÞaHm
ðb;pÞðb;pÞbðgðt;aÞðb;pÞÞ2
þ2�1 ∑toa;pob
Hm
ðt;aÞðt;aÞaHm
ðp;bÞðp;bÞbðgðt;aÞðp;bÞÞ2 ð57Þ
If a¼b, then (57) equals
2�1 ∑t4a
ðHm
ða;tÞða;tÞaÞ2ðgða;tÞða;tÞÞ2þ2�1 ∑toa
ðHm
ðt;aÞðt;aÞaÞ2ðgðt;aÞðt;aÞÞ2
¼ 2�1 ∑t4a
ðλ�2a ðλt�λaÞÞ2 λaλt
ðλa�λtÞ2
!2
þ ∑toa
ðλ�2a ðλt�λaÞÞ2 λaλt
ðλa�λtÞ2
!28<:
9=;
¼ 2�1 ∑taa
λ2tλ2aðλa�λtÞ2
:
If aob, then (57) equals
2�1Hm
ða;bÞða;bÞaHm
ða;bÞða;bÞbðgða;bÞða;bÞÞ2
¼ 2�1λ�2a λb�λað Þλ�2
b λa�λbð Þ λaλbðλa�λbÞ2
!2
¼ � 12ðλa�λbÞ2
:
References
Amari, S., 1982. Differential geometry of curved exponential families—curvature and information loss. Ann. Statist. 10, 357–385.Amari, S., 1985. Differential-Geometrical Methods in Statistics. Lecture Notes in Statistics, vol. 28. Springer-Verlag, Berlin, Heidelberg.Amari, S., Nagaoka, H., 2000. Methods of Information Geometry. Translations of Mathematical Monographs, vol. 191. American Mathematical Society,
Providence.Amari, S., Kumon, M., 1983. Differential geometry of Edgeworth expansions in curved exponential family. Ann. Inst. Statist. Math. 35, 1–24.Anderson, G.A., 1965. An asymptotic expansion for the distribution of the latent roots of the estimated covariance matrix. Ann. Math. Statist. 36, 1153–1173.Boothby, W.M., 2002. An Introduction to Differentiable Manifolds and Riemannian Geometry, revised 2nd ed. Academic Press, San Diego.Calvo, M., Oller, J.M., 1990. A distance between multivariate normal distributions based on an embedding into Siegel group. J. Multivariate Anal. 35,
223–242.Dey, D.K., 1988. Simultaneous estimation of eigenvalues. Ann. Inst. Statist. Math. 40, 137–147.Dey, D.K., Srinivasan, C., 1985. Estimation of a covariance matrix under Stein's loss. Ann. Statist. 13, 1581–1591.Efron, B., 1975. Defining the curvature of a statistical problem (with application to second order efficiency) (with discussion). Ann. Statist. 3, 1189–1242.Eguchi, S., 1985. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima Math. J. 15, 341–391.Fletcher, P.T., Joshi, S., 2007. Riemannian geometry for the statistical analysis of diffusion tensor data. Signal Process. 87, 250–262.Haff, L.R., 1991. The variational form of certain Bayes estimators. Ann. Statist. 19, 1163–1190.Hydorn, D.L., Muirhead, R.J., 1999. Polynomial estimation of eigenvalues. Commun. Statist. Theory Methods 28, 581–596.Jin, C., 1993. A note on simultaneous estimation of eigenvalues of a multivariate normal covariance matrix. Statist. Probab. Lett. 16, 197–203.Kumon, M., Amari, S., 1983. Geometrical theory of higher-order asymptotics of test, interval estimator and conditional inference. Proc. Roy. Soc. London
A387, 429–458.Lawley, D.N., 1956. Test of significance for the latent roots of covariance and correlation matrices. Biometrika 43, 128–136.Lenglet, C., Rousson, M., Deriche, R., Faugeras, O., 2006. Statistics on the manifold of multivariate normal distributions: theory and application to diffusion
tensor MRI processing. J. Math. Imaging Vis. 25, 423–444.Lovrić, M., Min-Oo, M., Ruh, E.A., 2000. Multivariate normal distributions parametrized as a Riemannian symmetric space. J. Multivariate Anal. 74, 36–48.Moakher, Maher, Zéraï, Mourad, 2011. The Riemannian geometry of the space of positive-definite matrices and its application to the regularization of
positive-definite matrix-valued data. J. Math. Imaging Vis. 40, 171–187.Muirhead, R.J., 1982. Aspects of Multivariate Statistical Theory. Wiley, New York.Murray, M.K., Rice, J.W., 1993. Differential Geometry and Statistics. Chapman&Hall/CRC, Boca Raton.
Y. Sheena / Journal of Statistical Planning and Inference 150 (2014) 66–83 83
Ohara, A., Suda, N., Amari, S., 1996. Dualistic differential geometry of positive definite matrices and its applications to related problems. Linear Algebra Appl.247, 31–53.
Sheena, Y., Takemura, A., 2011. Admissible estimator of the eigenvalues of the variance–covariance matrix for multivariate normal distributions.J. Multivariate Anal. 102, 801–815.
Skovgaard, L.T., 1984. A Riemannian geometry of the multivariate normal model. Scand. J. Statist. 11, 211–233.Smith, S.T., 2005. Covariance, subspace, and intrinsic Cramér–Rao bounds. IEEE Trans. Signal Process. 53, 1610–1630.Takemura, A., 1984. An orthogonally invariant minimax estimator of the covariance matrix of a multivariate normal population. Tsukuba J. Math. 8, 367–376.Yang, R., Berger, J.O., 1994. Estimation of a covariance matrix using the reference prior. Ann. Statist. 22, 1195–1221.Yoshizawa, S., Tanabe, K., 1999. Dual differential geometry associated with the Kullback–Leibler information on the Gaussian distributions and its
2-parameter deformations. SUT J. Math. 35, 113–137.Zhang, S., Sun, H., Li, C., 2009. Information geometry of positive definite matrices. J. Beijing Inst. Technol. 18, 484–487.