+ All Categories
Home > Documents > Board of the Foundation of the Scandinavian Journal of ... · Board of the Foundation of the...

Board of the Foundation of the Scandinavian Journal of ... · Board of the Foundation of the...

Date post: 30-Jun-2018
Category:
Upload: nguyenhanh
View: 216 times
Download: 0 times
Share this document with a friend
12
Board of the Foundation of the Scandinavian Journal of Statistics Markov Regime Models for Mixed Distributions and Switching Regressions Author(s): Georg Lindgren Source: Scandinavian Journal of Statistics, Vol. 5, No. 2 (1978), pp. 81-91 Published by: Blackwell Publishing on behalf of Board of the Foundation of the Scandinavian Journal of Statistics Stable URL: http://www.jstor.org/stable/4615692 Accessed: 29/03/2010 10:48 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=black. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Blackwell Publishing and Board of the Foundation of the Scandinavian Journal of Statistics are collaborating with JSTOR to digitize, preserve and extend access to Scandinavian Journal of Statistics. http://www.jstor.org
Transcript

Board of the Foundation of the Scandinavian Journal of Statistics

Markov Regime Models for Mixed Distributions and Switching RegressionsAuthor(s): Georg LindgrenSource: Scandinavian Journal of Statistics, Vol. 5, No. 2 (1978), pp. 81-91Published by: Blackwell Publishing on behalf of Board of the Foundation of the ScandinavianJournal of StatisticsStable URL: http://www.jstor.org/stable/4615692Accessed: 29/03/2010 10:48

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available athttp://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unlessyou have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and youmay use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained athttp://www.jstor.org/action/showPublisher?publisherCode=black.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printedpage of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

Blackwell Publishing and Board of the Foundation of the Scandinavian Journal of Statistics are collaboratingwith JSTOR to digitize, preserve and extend access to Scandinavian Journal of Statistics.

http://www.jstor.org

Scand J Statist 5: 81-91, 1978

Markov Regime Models for Mixed Distributions and Switching Regressions

GEORG LINDGREN

University of Umei

Received May 1977, revised August 1977

ABSTRACT. The paper deals with estimation of unknown para- meters in a finite mixture of distributions or populations. The basic assumption is the existence of unobservable regime vari- ables Xt, t= 1, ..., T which select the distribution to be ob- served for each t. If the X-variables are independent this is the classical mixed distribution problem. Here we suppose that Xp, t= 1, ..., T is a stationary Markov chain, which means that successive observations either tend to come from different or from similar populations. Such a dependence is reasonable if the observations are obtained sequentially in time.

The ML-estimators of marginal parameters and transition probabilities are derived using a maximization technique due to Baum. In a simulation study the ML-estimators are com- pared to the estimators derived under the assumption that the observations are independent, and it is shown that when there actually is a dependence, then the estimators based on the full Markov model are superior.

Key words: Incomplete data, Markov chains, maximum likeli- hood, mixtures, partial observation, robustness, switching regression

1. Introduction

This paper deals with the statistical estimation of unknown parameters in a finite mixture

r

f(y) = E Nf (Y) (1) J=1

of distributions with densities fj, j = 1, ..., r from a sequence of observations Yt, t =1, ..., T from the density (1). Such mixtures often appear as the result of a physical mixture of populations with different characteristics, e.g. different sexes or races in biology, different polymer molecules in chemistry, or different demand and supply markets in econometrics.

The basic assumption made here is the existence of regime variables Xt, t =1, ..., T which for each t select one of the distributions f1 which is then ob- served, i.e. the conditional density of Yt given that Xt =j is equal tof1. If the X-variables are independent with P(Xt =j) = n we have the classical mixed distri- bution problem with independent observations, which has been tackled by several authors, starting

with Karl Pearson (1894) and Charlier (1906). As is well known this is a very difficult estimation problem, which requires large samples and good initial quesses in order to be successfully solved.

The estimation procedures for mixtures of normal distributions are of mainly three types: moment estimators as those of Pearson and Charlier and more recently Cohen (1967), graphical methods, e.g. Hald (1952), Sect. 6.10, and Bhattacharya (1967), and, most promising, maximum likelihood estimators as developed by Rao (1948), Hasselblad (1966), Day (1969), and Wolfe (1970).

General principles for separation have been devel- oped by Hasselblad (1969), Orchard & Woodbury (1972), and Sundberg (1974), who all employ a mis- sing information technique in connextion with ex- ponential families. A recent reference for maximum likelihood estimation with missing information is Dempster et al. (1977). Behboodian (1975) contains some more references.

All the above-mentioned authors have been con- cerned with independent observations, i.e. the regime variables Xt are independent and unobservable.

Some effort has also been made to utilize addi- tional information in order to improve the estimates. Hosmer (1973) and Dick & Bowden (1973) showed that if one has an additional small sample which is known to come from one specific of the components then the estimates are considerably improved. For- mulated in terms of the regime variables Xt this means that we know, for some t, the exact value of Xt, while for most t-values the regimes are unknown and still independent.

In this paper we shall use the regime variables more systematically in that we suppose that Xt, t=1, ..., T is a realization of a stochastic process with known or partially known distribution. In other words, we postulate some type of dependence between X8 and Xt for s * t, and then use this depen- dence in order to improve the estimates of parameters in the marginal distribution (1).

6- 781928 Scand J Statist 5

82 G. Lindgren

Such a dependence is often reasonable, especially if the observations Yt, t =1, ..., T are obtained se- quentially in time. Then it can be natural to assume that adjacent observations will either tend to come from the same or tend to come from different populations.

Closely related to mixed distributions is what is sometimes called switching regressions, i.e. one has a sequence of observations Yt, t = 1, ..., T where each Yt is generated by one of r possible regression expres- sions,

fl() u(?) + sy)

where the residuals e(t) are independent with E(effl) O, V(-.(t)) = <j, j = 1, .,r,t c 1, ..., T.

Formulated by the aid of the regimes Xt, t 1, T one has that

(Yt I Xt 1j)L(D)uM) + e(j)

2. Markov regime models

Let {Xt}', be a finite state Markov chain with state space {1, ..., r}, stationary transition proba- bilities

:jk = P(Xt+L - k IXt = j)

and stationary distribution

nj = P(Xt =j).

Let fj, i = 1, ..., r be probability density functions with joint continuous or discrete state space and let {Yt}t.1 be a sequence of random variables such that, given that Xt =J, the conditional distribution of Yt has the density fj. We furthermore assume that the Y-variables are conditionally independent, given the X-variables, i.e. the conditional density of

Y1, ..., YTIX =jl, ..., XT =T

is T

fit(yt). (2) t=l

The marginal density of Yt is then 7

f(y) = Sf C, 1=1

so that we have a mixed distribution like (1). For regression models we will sometimes let the

densities depend also on independent variables u4' in which case we replace (2) by T

n-1fi,yt U(t,t).

In some literature a process { Yt}' 1 of the specified type is called a sequence of probabilistic functions of the Markov chain {Xt} 01 (Petrie, 1969). Most work has been done for functions with finite state space, since this easily reduces to the problem of partially observed finite Markov chains. Especially useful is the treatment by Baum et al. (1970) from which many of the ideas in this paper have been taken.

The assumption that the dependence between the regimes Xt is of Markov type is of course sometimes hard to justify from logical reasons. Since it is the most simple dependence structure which can be handled it may nevertheless be worthwhile to use it, at least as a first approximation.

If the regimes X1, ..., XT which govern the distribu- tions of the observations Yi, ..., YT are completely known, the estimation of the parameters in the marginal distribution (1) is a simple statistical prob- lem involving certain, but random, numbers of observations from each of the separate distributions fj, j= 1, ..., r. In order to carry out the estimation when the regimes are unobservable we need some reconstruction technique based on what is actually observed. We will here employ a missing informa- tion principle which utilizes the posterior distribu- tion of the regime Xt given the observations y1, .--, Yt; cf. Orchard & Woodbury (1972) and Sundberg (1974). Reconstruction of Xt from observations Yi, ..., Yt is usually called a filtering problem, while reconstruction from Yi, ..., YT is called an interpola- tion problem. All information about Xt is contained in the conditional probabilities

Itcj(tlIs) = P(Xt = j Y1, ..,Y.s)

of which we shall make frequent use of

TjA(tit) = P(Xt =AY1, ... , Yt)

and

fiz(t IT) = P(Xt =iIY1, .., YT).

The filtering probabilities ;ri1(t It) can be calculated in real time, while the interpolation probabilities zi(tI T) can be used retrospectively after the whole experiment has been completed.

We also need the conditional probability that a transition has taken place between any two states j and k,

?jk(tI T) = P(Xt =j, Xt+i = kIy1, . YT),

and the conditional density for Yt,1, ..., YT given xt =1'

o.(t) = f t . ... .(VI .tJ( t .9 v_ 9Y

Scand J Statist 5

Markov models for mixed distributions 83

As starting values we define

:V(1 0o) = n= P(X1 =A),

4j(T) = l,j 1, ...,r.

For notational convenience we introduce the nor- malizing operator N such that if ocj, j = 1, ..., r are non-negative real numbers then

r cj N=aj >/OCk,j=1, ... ,r.

k=1

The following forward and backward recursion for- mulas follow from the Markov structure; see also Baum et al. (1970).

Lemma 2.1 r

mcjti t-1 n l(t - 1 t t- l)nij, i=l

r A (tit) = - It( t - 1) nijfj(yt) N

i=l

= jc(tIt- l)f,(yt) N,

r =i(t) = ziJ.f(Yt+1) O(t + 1), (4)

j=l

fj(tI T) =(tIt) Oj(t) N,

njk(t I T) =j(t I T)njkfk(yt+l) Mk(t + 1)I#,(t). (5)

Proof. We show only (4) and (5). The other formulas are even simpler. The Markov structure gives

r

MO = fyt+l- P(X YT- Xt4-i 1 Xt=ift+*-- YTij) j=1

r = P(Xt+l =ji Xt = i)fyt+l I xt+1 =J(Yt+,) j=1

Xfyt+2- ---- YT I Xt+ = J(Yt+2 , YT)

= 2ijfj(Yt + 1) P+,(+

j=1

and

Aj(tI T) - . ;;:: -- ...

P(xIjyt *., Yt) f Yk(t T) f.. YT(Y1 I - - -I YT)

( =1 9 ..

t

x P(Xt+I = k Xt =i)ft+l ?1 Xt+1=k(Yt+1)

X fyt+2. *** YTI Xt+l=k(Yt+2, " YT)

= Ctd(tIt)7rkfk(yt+1) k(t 1), say.

Since 2,k= 1 Zjk(t IT) =zfir(t I T) and _ kfk(Yt+l) X

'k(t + 1) = 'k(t) we find that Ct7c(tI t) =ai(tI T)I#j(t), which implies (5). O

3. ML-estimation of marginal and transition parameters

Of all decomposition techniques for mixtures, maxi- mum likelihood estimation has proved to be the most efficient, not only for large samples, but also for small ones. It is furthermore the only technique which can readily be generalized to dependent regimes.

From now on we let the densities fj in the marginal distribution (1) depend on unknown, possibly multi- dimensional, marginal parameters Oj so that each component in the mixture has its own parameter, fj(y) =fj(y; Oj). The marginal density for each Yt is then

r

1(Y) = E fj(y; 0i). .i=l

To begin with we suppose that the proportions nj and the transition probabilities :Tjk are known, and concentrate on the estimation of 0 =(O,, .., Or)- We will return to the estimation of the Markov para- meters in Section 3.3.

Maximum likelihood estimation of parameters in mixed distributions for independent variables is usually performed either by direct calculation of the likelihood function and its derivatives or via an iterative scheme where the observations Yt are weighted with the posterior probabilities P(Xt =j lyt); see Hasselblad (1966) and Behboodian (1970). Both principles can be used when the regimes are Markov dependent.

3.1. The likelihood function The likelihood function and its derivatives can be calculated recursively using Lemma 2.1 and the posterior probabilities AM(t I t - 1).

Theorem 3.1. The loglikelihood function for the para- meter 0 = (0, ..., Or) given the observations Yl, ..., YT

is

T

L(O)= 2 logf(ytIy1, yt, Y-1) (6) t=1

where

f(Yt IY1, ... , Yt-I =t, A E (tIt - l)fj(yt; 0i) 7r

j=1

is the posterior density for Yt given Yi, ..., Yt-i . Writing 01/00k. for the vector of partial derivatives with respect

to the components in ok one also has

Scand J Statist 5

84 G. Lindgren

TL T

=L . ftT1 {k(t t-1) fkYt; ok) 80k t= 1k

+ ; {a j(tlt- )lf,(yt; o4 - :

and

aOk =1 aOk i I |

1) njj,

j j(t t) =fi7 {{- ,(t I t - l) fj(yt; 0,)

+j(tlt- 1) a8 f}Jt; 0} afk

_ft-2,i(t It_ 1)fJy(t; oj) left*? 80k

The likelihood function and its derivatives can be calculated via the recursive formulas, and thus numerical maximization is possible. Even the second partial derivatives can be computed recursively, but the formulas are complicated and can give rise to doubts about the numerical accuracy.

3.2. An iterative maximization scheme The iterative scheme for the ML-estimates derived by Hasselblad is intuitively appealing in that it re- places the missing information on the regime Xt with the likelihood for the different outcomes. Or- chard & Woodbury (1972) and Sundberg (1974) have given a systematic account of this missing informa- tion technique for independent variables.

Without coupling it to mixed distributions Baum et al. (1970) have derived an iterative technique analogous to that of Hasselbad for estimation of parameters in Markov regime models; see also Baum (1972). We will here summarize Baum's results in Theorem 3.2.

The iteration starts with initial estimates 0= (01, ..., 0r) of the marginal parameters, from which we can compute the posterior probabilities for inter- polation

7ty(t IT) == PO(Xt = -IYI, * t YT)

by means of Lemma 2.1. The dependence on 0 is indicated by the index on Po.

For each j separately one then maximizes the weighted loglikelihood function

T A i(t I T) log f,(yt; 0)

t=1

as a function of 0;. Let us for the moment use the notation 0 also for the solution of this maximiza-

tion problem. Then 0' = (O', ..., 0o) is a new and, as will be seen, better set of estimates of the marginal parameters. The whole procedure can be repeated, and under mild restrictions the estimates will con- verge towards a local maximum of the likelihood function.

The transformation Ovr(0) = 0' is often easy to express explicitly, and the new estimates will have a simple form. The following theorem contains the results by Baum et al. (1970) concerning the trans- formation T.

Theorem 3.2. Let the transformation -r be defined by r(O) = 0', where 0' maximizes

T

Qj(0j') = 2 " (t I T) log fj(yt; ft), j=1 .. r.

t=1

Then T increases the likelihood function, i.e.

L(T(0)) > L(0),

with strict inequality unless 0 is a stationary point of L, in which case T(0) = 0. El

As was shown by Baum et al., the transformation T is unique and continuous if the densities fulfill some mild regularity and extremal conditions, e.g. if logf,(y; 0j) is strictly concave for each y, or if

= = (,u, a) is a location and scale parameter and

f(y; 0,) = Y'f((y -)/a), i = 1, ..., r,

where log f(u) is strictly concave. Of course one also need that for all j the posterior probability '1(tI T) is strictly positive for at least one t.

Example 3.1. For a mixed Poisson distribution with

fj(y; 0j) = exp (O- Olly!, T

I '4

(tI T) log f:(yt; 0;) t=l

T

=~ 2 n(tj T) { -0+Y, "og 0> -logYt!} t=l

is maximized by

T T

o = 2~ A(tT)yt/ EA(tjIT). t=1 t=1

Example 3.2. For a mixture of normal distributions

r 1

f(y) = 2 - exp (-y _,uJ)2 /22) J=1 V2n V'

Scand J Statist 5

Markov models for mixed distributions 85

with unknown location and scale parameters one gets the same iterative scheme as Hasselblad (1966) and Behboodian (1970), but with 7ti(t 1 T) computed from the Markov model,

T /T

Example 3.3. In a switching regression model with normal residuals,

(V(4('J) = ag), the parameters are O = (a,, fly 4), j = 1, ..., r. If we drop the primes from 4 fl, (ai)' the maximization of Q,(0) gives the equations

lAj- 2 *(t IT)(yt nu(t I T), 89 t=l =

aQ, T

@' E A , (t I T) u(?2t - j( I T?)_ - O t-1 t=l

a2 = , (tI T) {vt - e -fl,Ut'))2 -ai}/2c4 = 0,

which are only slight modifications of the familiar normal equations. ,

3.3. Estimation of transition probabilities In reality, the mixing proportions a are unknown, and so are of course the transition probabilities ZC11.

in the postulated Markov chain. The maximum like- lihood estimates of the transition matrix 3 =(n) and consequently also of the stationary mixing pro- portions ns = (rr1, ..., Ir~) can be obtained by a similar iteration technique that yields the estimates of the marginal parameters 0 =(0k, .., 0r). This is most clearly seen from an analogy with the situation when the Markov chain Xt, t = 1, ..., T can be observed. Then the maximum likelihood estimates of the transi- tion probabilities rr,k are obtained by a maximiza- tion of

lg xl + T1lg * 2t+i=log rr*1 +1 2 n. log 311k

(7)

under the restriction rG,k = 0, >k =ik =1. The number of transitions jcak in the sequence x = (x1, ..., XT) iS denoted by nik,

T-1

t=k 1 E l{XtjuXt+l

Maximization of (7) can be performed for each j separately, and we see that k njk log 1Jk attains its maximum for

nJk (8)

njil I =1

In the present situation we have to replace the indicator l{zt='.Xt+i=k} by the posterior probability of a transitionjr-'k,

aik(tI T) = PO, (Xt =1, XW+L =-kIy1, ..., YT),

calculated as in Lemma 2.1. The iteration procedure then goes as follows. Take some initial estimates 0 = (0, ..., 0,i) and n = (n,k) of the marginal and transi- tion parameters, and let similarly no = (no,, ..., nor) be initial estimates of the starting distribution for X1. Of course, estimation of no = P(X1 =j) from only one sample makes no sense if we do not assume it to be the stationary distribution corresponding to the transition matrix n. In this context no is in- troduced separate from n. only as a computational aid.

Using 0, no, and a, compute the posterior prob- abilities for states and transitions, nij(t I T) and 7[ik(t I T) by Lemma 2.1, and finally take as new estimates

zoi = (1T)(9)

and

T-1

n2 jk(tj T)

njk r T-1 (10)

E :E -' tT) 1 =1 t=1

in analogy with (8). The analogy with the completely observable case goes even further, and as will be shown in the following theorem, p4k maximizes the analog to (7),

r T-i

k1 {2 jik(t IT)} log 7k.

k=1 t=i

The whole procedure can then be repeated as far as necessary.

The stationary mixing proportions n, are most naturally estimated by

T

e=T nj(t I T) t=1

even if this does not give the maximum likelihood estimator.

Scand J Statist 5

86 G. Lindgren

The following theorem was proved by Petrie (1969) for what he called a probabilistic function of a Markov chain. More details on this and on the proof of Theorem 3.2 can be found in Lindgren (1976).

Theorem 3.3. The transformations (n;,o )rcs0, 7') defined by (9) and (10) and OrO' defined in Theorem 3.2 increase the likelihood function, i.e.

L(0' nIO n') >_ L(O, no, n),

with strict inequality unless (0, no, m) is a stationary point. ?

The iterative scheme for ML-estimation of 0, no and r has been coded by the author in a FORTRAN- subroutine MARMIX which has been implemented and tested at a CDC Cyber 172 of Umea University Computing Center.

3.4. Robustness and consistency of ML-estimates As many authors have noted the ML-estimates of marginal parameters in mixed distributions are not necessarily consistent. In a location and scale para- meter family such as in Example 3.2 the likelihood function is unbounded if pLi is equal to some yt and vi is near zero. This means that the likelihood is un- bounded near the boundary of the parameter space 0 = {(ui, vi); H i ER, ci >0}. This can be remedied in different ways. Hasselblad (1966) makes a grouping of the observations which transforms the problem to an estimation problem in a multinomial distribution. Behboodian (1970) proposes that one shall take the greatest local maximum in the interior of E as the estimator. Hasselblad's method is natural from a practical point of view but leads to some numerical complications. More motivated from principles of statistical model building is to use a compact para- meter space, and for the location and scale mixture take

E) = {(jsi, vi); ldi I < C, 0 <c1 C vi O c,, i = 1,..., r}.

Besides from being mathematically attractive, this has the important statistical relevance that it will prevent the distributions from being too different and exclude aberrant forms from the class of per- mitted distributions. To be sure of this we of course also have to impose some continuity or differenti- ability restrictions. For example, if we fit a mixture of two normal distributions with means and standard deviations jzy, a, and /,u a, respectively, we presum- ably have some prior reason for doing so, and then we also have some opinion of what reasonable values of a, and a2 are. Permitting a, to tend to zero introduces a one point distribution into the class of distributions, and if this is the intention, we should have done it from the beginning.

We will not embark upon a general treatment of asymptotic properties of the ML-estimates in Mar- kov regime models. The consistency and asymptotic normality of the estimators seem to be established only for families of discrete distributions.fi with finite state space; and that was done already by Baum & Petrie (1966) in an early paper on probabilistic func- tions of Markov chains.

In the rest of this section we will investigate the robustness against a dependence structure of the ML-estimates of marginal parameters derived under the hypothesis of independent regimes. In cases where the Markov dependence between regimes is unclear, or perhaps not even realized, one could be tempted to derive the ML-estimates of the para- meters in the density

r

f(y) = E nrf,(y; o0) J=1

under the assumption that the regimes are in- dependent. This means that one maximizes the log- likelihood function

T

VI(0, ns) - E5, log f(yt) (l ) t=l

as a function of the marginal parameters 0 =

(0k, ..., Or) and the stationary mixing proportions n. = (ni, *@er). This is the standard mixing problem with independent observations. We call the solution an MLI-estimator (Maximum Likelihood under In- dependence assumption) and denote it (0, Ais). Ro- bustness of MLI-estimators against different depend- ence structures has been studied by Ranneby (1975) for distributions with finite or countably infinite state space. The following is a version of his Theorem 3 for continuous state space.

Theorem 3.4. Suppose that the regime process {Xt}%t= is an ergodic Markov chain, and that the parameter space E) for 0 is compact with the true parameter 00 as an interior point. Also suppose that the mixture is identifiable in the sense that for each 6 >0 there is a set of outcomes for Yt with P(Yt E16) >0 such that

inf If(y; O)-f(Y; 0)1>0 IO-O'I >6

for ally EI,. Furthermore suppose there is a neighbour- hood S of 00 such that

E(sup I log f(Y1; 0) |) < oo, Oes

rl ?J(y; 0)I Jsup VY - dy< oo, Joes 80j

Scand J Statist 5

Markov models for mixed distributions 87

a a2f(y; 0) ,fsup 00 dy<oo, J0e s affi aoj

E s a3 logf(Y1; 0)

)\

O eSs 00aaOJSOk

E alogf(Y; 00) 2+6<00, a0f

for some 6 > 0. Then the MLI-estimator 0 is consistent and asymptotically normal,

V T(_00) L. N(O, E1 C E-1)

where Z (ai1), C = (cij),

= Coy {e logf(Y1) a logf(Y)}

c21=limCov 1 Ta log f(Y) 1 T logf(Yt) ciTo l/TSj= 86 ' f/7i=i 80 J Similar results hold for a', with aOi replaced by anj.

Proof. The proofs of consistency and asymptotic normality given by Ranneby (1975) and (1976) work also for a probabilistic function of a Markov chain. We only have to check that the mixing condition for asymptotic independence of the Yk-variables is satis- fied, i.e. that if

sup I P(A n B) - P(A) P(B) I =(k) (12) A, B

when A E a(Y1, ..., Yt), BE ar(Yt+k+l, ...) then

00 E ay(k)1(2 +1) < C'(b > ). (13)

k=1

To see that (13) holds, let X(t) = (X1, ..., Xt), X(t+k+1) = (Xt+k+1, ...) and define a:x(k) by (12) with Y, replaced by XA. Then

2()< Cok, (O < e 1

so that (13) is fulfilled with a, replaced by x,. But it is easy to see that the dependence between Yt- events can not be greater than that between Xt- events. In fact, due to the conditional independence between A and B given X(t) and X(t+k+l) we have

P(A n B) = E(P(A n BIX ), X(t+k+1)))

= E(P(A I X(t)) P(B I (t+k+1))).

Here P(A I X(t) and P(B I g(t+k+1)) are bounded ran- dom variables, measurable with respect to a(X(t))

and o(X(t?k+ )) respectively, and is follows from Ibragimov & Linnik (1971), Theorem 17.2.1 that

aj(k) 6< 4a.,(k).

Hence (13) follows. The rest of the proof goes as in Ranneby (1975) and (1976). 0

Due to the Markov dependence structure and the conditional independence of Yt, t = 1, ..., T given Xt, t =1, ..., T the quantities ai, and ctj can be calculated quite simply. Write

afj (y; O 4) f} (Y)

ao

u()=a log f(Yd) UP(t) - a?g fi'( Yt) I f t( y ,d(

and define

mj(k) E(uj(t) I = k) = a j'ft(y)fk(y)If(y)dy,

asj(k) = E(ui(t) * uj(t) I Xt = k)

= 3j j4fi'(Y)fj'(Y)fk(Y)Wf2(y)dy.

Conditioning on the sequence X= (X1, X2, ...) of regimes we have

Cov {9 A us(s), u u(t)}

il T =Ex-{ :> E(ut(s)uj(t)IX)}.

TS, t=1

To evaluate the conditional expectation we can use that Y, and Yt are conditionally independent given the sequence X and that the conditional distribution depends only on X8 and Xt. Thus

T

E(ut(s) uj(t) |X)- 2 E(ui(t) ut(t)l Xt) s,t=l t

+ 2 E(u,(s) u1(t) I Xs, Xt) s$t

r r - 2 E(u,(t) u,(t) I Xt = k)

k=1 t;Xt=k

+ ' E(ui(s) X =k) k=1 1=1 s*t: X*k.Xt=l

x E(u(t) IXt= 1).

Using

nk(T) =*{t;O st s T,Xt =k}

Scand J Statist 5

88 G. Lindgren

this can be written

r

2 nk(T) {orij(k) - mi(k) mj(k)} k=i

r

+ E nk(T) nl(T) mi(k) mj(l). k,l=1

Taking expectations over X and noting that E(nk(T)) = Tak we get the following expression for cii,

r

cij = E nk(aij(k) - mi(k) mj(k)) k=1

+lim E{T2n() ik nl(T) mj(l)} T-oo Tk=l11

Since

I r r

E - 2 nk(T) mi(k)} = k= k mi(k) = E(ui(t)) = 0 Tk=1 k=1

the second term can be written

r_ r

lim E T I (T-1nk(T) - Gk) m2(k) T-.oo k=1

V_ r

x eT (T--'n,(T) -71l) mj(l) Il=1J

Now it is well known that the r-variate variable

V T(Tlnk(T)-C)=

is asymptotically normal with covariance matrix B=-(#kl),

flkl =

bkl 7tk - 7k 7CI

+ , {P(X1 = k, Xt = 1) + P(X. = 1, Xt = k) - 2;k al}b t=2

(14)

so we get the simple expression

r r

cij= I -k( mj(k)-mi(k) mj(k)) + 2 mi(k) flkl mj(l). k=l k, 1=1

For aij we get the analogue formula

r

a1i = : :kaij(k). k=i

4. Small sample properties of ML-estimates

There are in essence two methods for obtaining the variance of the ML-estimates 0*, n*, a* of 0, n., and ' for finite T and they are inversion of the matrix of observed second order partial derivatives of the loglikelihood function and simulation of known mixtures.

The asymptotic variance as T- oo can be obtained from the matrix of second order partial derivatives of the loglikelihood function at the point 0*, a*, I*; for a recursive method of doing this, see Theorem 3.1. This is the technique used by Hasselblad (1966) for independent mixtures.

However, and this has been shown in several si- mulation studies with moderately large samples, the actual variance of the estimators are usually greater than those computed from the asymptotic variance; see Dick & Bowden (1973).

In this section we will present the results of a simple simulation study with the aim to show the effect of the dependence between regimes on the ML- estimates based on the full Markov modell and on the MLI-estimates of the marginal parameters, based on the assumption that the regimes are independent. The mixture is composed of two normal distribu- tions with means ,u =0 and u2 =10 and standard deviations al = a2 = a where a = 2.5 and 5.0. The mix- ing proportions are hold equal, 1 = = 0. 5 while the transition matrix

( 11 :12

\21 22!

is varied from a strong negative dependence with :T1=:22 =0.1 to a strong positive dependence with n11 =n22 = 0.9.

The ML-estimates of 0 = (fle, M2, a1, 1a2) n = (l, n2),

and r = (oTl, :T12, n21, :T22) were obtained via the itera- tive scheme presented in Section 3.2 and 3.3 by means of the FORTRAN-subroutine MARMIX. The true parameter values were always used as the starting points, and the iterations carried on until no two successive approximations differed more than

= 0.002. In no case more than 20 iterations were permitted.

For comparison the MLI-estimators of 0 and a. were computed by a similar iteration scheme, also incorporated in the subroutine MARMIX.

The choice of the true parameters as starting points and the restriction of the number of itera- tions to 20 will increase the apparent accuracy of the estimators, but it will probably not affect the comparison between the ML- and the MLI-esti- mators. Further investigations are necessary to show how the choice of starting points affect the esti- mates.

Scand J Statist S

Markov models for mixed distributions 89

1.6

1.4 CY0 05

1.0

08 ., 2.5 0.6

0.4

0.2.

010.2 023 04 0.5 06 07 0.8 ns 11 22

Fig. 1. Sample Root Mean Square Error (RMSE), E((u*I~-,us)2)1, for estimates of the means ,ui in a symmetric mixture of normal distributions, a,= 0, u2 = 1O; number of observations T= 50; number of simulation replicates NSIM = 200; O--O, MLI-estimator based on assumption of independe; x - x, ML-estimator based on Markov chain assumption;

true switching probabilities 712 = 2l =1 - 7[11

Two sample sizes were considered, one very small with T = 50, and one moderately small with T= 200. The number of simulations was 200 and 100, respectively, for every combination. The results are presented below as the observed RMSE (Root Mean Square Error, E((O* -0O)')'/') for the different esti- mates.

Remark 4.1. The accuracy of the estimates n* of the mixing proportions aj needs special care. In fact, nj is only the prior probability that observation t comes from regime j and it equals the long run proportion of regimes equal to j. In the sample, however, the real proportion is

- n,(T) =pj, say,

which is a random quantity. The total mean square error can be split according to

E((4 -0)2) = E((n* -pi)2) + 2E((2r'-p1)(pj-ur1))

+ E((p;j- nj)2),

1.2- .CT=

1.0- 0.8--.

06- C_ c=2.5

02-

712 0.1 0.2 03 04 05 06 07 08 Q9 11 22

Fig. 2. Sample RMSE, E((a* - a7)2)*, i= 1, 2; T=50,NSIM= 200; O--O, MLI-estimator; x-x, ML-estimator.

0.16- -

0.14- 6

0.12-

008 *=

006-.

0.04 -- -25

0.02-

I I I I i I I @ 4 f11! T2 01 02 03 04 Q5 0.6 0.7 0.8 09

Fig. 3. Sample RMSE, E((n* -pi)2)1, i= 1, 2, for estimaten* of sample proportion pi; T=50; NSIM=200; O--O, MLI- estimator; x - x, ML-estimator; solid curve = E((pi-n,)2)*

where

E(pj - nj)2) T-'Pjj

from (14). It can be argued that z should be re- garded as an estimate of pj and that only E((@o _p1)2)

has any relevance for the validation of the estima- tion procedure. In this the marginal distributions fj are used to split the observations Yt according to the presumed value of Xt, which will give an estimate of pj. On the other hand, inference about nj from exact or inexact knowledge of Pj is a question of how much we trust the Markov regime model for the genera- tion of regimes.

It seems as if previous authors have not made this distinction. Both Hasselblad (1966) and Dick & Bowden (1973) present E((nj4 - t)2) without com- ment.

In the Markov regime model the separation of E((njr - rj)2) into its components is of special im- portance, since E((pj - nj)2) is strongly dependent of the transition probabilities rio and increases dras- tically as the diagonal elements njr tend to one; see Figs. 3 and 7.

0.18'

016- .=5

0.14-*

0.12- /

0.10- CT a25

008-0

006-.

0.04-

0.02-

01 02 0.3 0.4 0s 06 0o do 09 o h1 22

Fig. 4. Sample RMSE, E((n *-nJ)2)1, i,j=1,2; T=50, NSIM=200; x-x, ML-estimator.

Scand J Statist 5

90 G. Lindgren

076 -

0.6

0.5

04

0.3 - =2 5

0.2

0.1

0.1 02 03 04 05 06 07 08 009 7'11= 22

Fig. 5. Sample RMSE, E((u*_-pt)2)0, i= 1, 2; T=200, NSIM = 100; 0-- 0, MLI-estimator; x - x, ML-estimator.

As is seen from the simulations it is only for a fairly strong positive or negative dependence that the full ML-estimator of jtl, P2 and a,, a2 performs better than the simpler MLI-estimator. The addition of an extra dependence parameter obviously causes an overfit to data, at least for such a small sample size as T= 200.

It is also seen, as could be anticipated, that for a negative dependence where the regimes alternate, the ML-estimates of Pl, P2 and a,, a2 are almost as good as when the regimes are completely known.

One important fact is that the iterative estimation schemes very seldom converged within 20 iterations for the sample size T= 200 and a1 = a2 = 5. For 711 = n22 = 0.1 and 0.9 the ML-estimator converged in about 50 % of the simulations. All other combinations gave convergence in less than about 10% of the simula- tions.

When a1 = a2 =2.5 all procedures converged in at least 75%, and the ML-estimator in almost 100% of the simulations when there actually was a de- pendence.

Separate studies indicated that the convergence properties were considerably improved with further iterations, but that the observed variance of the estimator then also increased. Incidentally, this indi- cates that the ML-estimator is more superior over the MLI-estimator than is hinted from the diagrams, in cases of dependence.

06 ?, a= 5

0.4-

0.3

0.2- 0 = 2.5

0.1 .

0.1 02 03 0.4 0. 06 7 08 09 -

Fig. 6. Sample RMSE, E((ai'- a)2)1, i= 1, 2; T=200, NSIM = 100; 0 -- 0, MLI-estimator; x - x, ML-estimator.

0.08-.

0.07--

0.06 --

0.0 4-- -

0.03 -

0.02 o - -~-0..

001 a= 2.5

71l1 122 0.1 02 Q3 0.4 0.5 0.6 0.7 0.8 Q9

Fig. 7. Sample RMSE, E((n -p.)2)1, i= 1, 2, for estimate r* of sample proportion pi; T= 200; NSIM = 100; 0 -- 0,

MLI-estimator; x - x, ML-estimaor; solid curve= E((p,

- n)2) J.

As a whole, the choice of the true parameter values as a starting point in combination with the poor convergence properties will probably not affect the comparison between the two estimators when both behaved poorly. It was noted that it never happened that the MLI-estimator converged con- siderably more often than the ML-estimator, while in those cases where the opposite happened, the ML- estimator also was better than the MLI-estimator.

5. Final remarks

It has been emphasized by several authors that a mixed model should be applied to empirical data only when theoretical considerations give a clear indication of their relevance. On the other hand, once one has such good reasons for a mixed model, not much more is needed in order to apply a model with dependent regimes, which either alternate or form long runs. The assumption of an exact Markov dependence is then not crucial, but is merely used as the most simple tool to transfer information be- tween successive observations. The transition prob- abilities nfj = P(Xt+l =j I Xt = i) are of course defined for every stationary regime process.

0.09

Q08

0.07

0.06

0.05

0.04 0=5

0.03 C=2.5

0.02

0,01

0.1 02 03 0.4 0.5 06 0.7 0.8 0.9 2 2

Fig. 8. Sample RMSE, E((n* - n,)2)8, i, j=l, 2; T=200,

NSIM=100; x-x, ML-estimator.

Scand J Statist 5

Markov models for mixed distributions 91

Acknowledgement

I want to thank the referee for pointing out an error in the original manuscript and for other valuable advice.

References

Baum, L. E. (1972). An inequality and associated maximiza- tion technique in statistical estimation for probabilistic functions of Markov chains. Inequalities 3, 1-8.

Baum, L. E. & Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Statist. 38, 1554-1563.

Baum, L. E., Petrie, T., Soules, G. & Weiss, N. (1970). A maximization technique occurring in the statistical analy- sis of probabilistic functions of Markov chains. Ann. Math. Statist. 41, 164-171.

Behboodian, J. (1970). On a mixture of normal distributions. Biometrika 57, 215-217.

Behboodian, J. (1975). Structural properties and statistics of finite mixtures. Statistical distributions in scientific work (ed. G. P. Patil et al.). Reidel, Dordrecht.

Bhattacharya, C. G. (1967). A simple method of resolution of distribution into Gaussian components. Biometrics 23, 115-135.

Charlier, C. V. L. (1906). Researches into the theory of probability. Lunds Universitets 4rsskrift, new series, Afd. 2: 1, No. 5.

Cohen, C. (1967). Estimation in mixtures of two normal distributions. Technometrics 9, 15-28.

Day, N. E. (1969). Estimating the components of a mixture of normal distributions. Biometrika 56, 463-474.

Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with Discussion). J. Roy. Statist. Soc. Ser. B 39, 1-38.

Dick, N. P. & Bowden, D. C. (1973). Maximum likelihood estimation for mixtures of two normal distributions. Biometrics 29, 781-790.

Hald, A. (1952). Statistical theory with engineering applica- tions. Wiley, New York.

Hasselblad, V. (1966). Estimation of parameters for a mix- ture of normal distributions. Technometrics 8, 431 444.

Hasselblad, V. (1969). Estimation of finite mixtures of distributions from the exponential family. J. Amer. Statist. Ass. 64, 1459-1471.

Hosmer, D. W. (1973). A comparison of iterative maximum likelihood estimates of the parameters of a mixture of two normal distributions under three different types of sample. Biometrics 29, 761-770.

Ibragimov, I. A. & Linnik, Yu. V. (1971). Independent and stationary sequences of random variables. Wolters- Noordhoff, Groningen.

Lindgren, G. (1976). Markov regime models for mixed distri- butions and switching regressions. Techn. Report 1976-11, Inst. of Mathematics and Statistics, University of Umea.

Orchard, T. & Woodbury, M. A. (1972). A missing informa- tion principle: theory and applications. 6th Berkeley Symp. Math. Statist. Prob. 1, 697-715.

Pearson, K. (1894). Contribution to the mathematical theory of evolution. Phil. Trans. A, 185, 71-110.

Petrie, T. (1969). Probabilistic functions of finite state Markov chains. Ann. Math. Statist. 40, 97-115.

Ranneby, B. (1975). Robustness against dependence of maxi- mum likelihood estimates in i.i.d. models. Techn. Report 1975-9, Inst. of Mathematics and Statistics, University of UmeA.

Ranneby, B. (1976). Strong consistency of approximative maximum likelihood estimators. Techn. Report 1976-3, Inst. of Mathematics and Statistics, University of UmeA.

Rao, C. R. (1948). The utilization of multiple measurements in problems of biological classification. J. Roy. Statist. Soc. B, 10, 159-203.

Sundberg, R. (1974). Maximum likelihood theory for in- complete data from an exponential family. Scand. J. Statist. 1, 49-58.

Wolfe, J. H. (1970). Pattern clustering by multivariate mix- ture analysis. Mult. Beh. Res. 5, 329-350.

Georg Lindgren Department of Statistics University of UmeA S-901 87 UmeA Sweden

Scand J Statist S


Recommended