Analysis of multivariate probit models - Semantic Scholar · 348 SlDDHARTHA CHIB AND EDWARD...

Biometrika (1998), 85,2, pp. 347-361Printed in Great Britain

Analysis of multivariate probit models

BY SIDDHARTHA CHIBJohn M. Olin School of Business, Washington University, One Brookings Drive, St. Louis,

Missouri 63130, [email protected]

AND EDWARD GREENBERG

Department of Economics, Washington University, One Brookings Drive, St. Louis,Missouri 63130, [email protected]

SUMMARY

This paper provides a practical simulation-based Bayesian and non-Bayesian analysisof correlated binary data using the multivariate probit model. The posterior distributionis simulated by Markov chain Monte Carlo methods and maximum likelihood estimatesare obtained by a Monte Carlo version of the EM algorithm. A practical approach for thecomputation of Bayes factors from the simulation output is also developed. The methodsare applied to a dataset with a bivariate binary response, to a four-year longitudinaldataset from the Six Cities study of the health effects of air pollution and to a seven-variate binary response dataset on the labour supply of married women from the PanelSurvey of Income Dynamics.

Some key words: Bayes factor; Correlated binary data; Gibbs sampling; Marginal likelihood; Markov chainMonte Carlo; Metropolis-Hastings algorithm.

1. INTRODUCTION

Correlated binary data arise in settings ranging from multivariate measurements on arandom cross-section of subjects to repeated measurements on a sample of subjects acrosstime. A central issue in the analysis of such data is model formulation. One strategy,outlined by Carey, Zeger & Diggle (1993) and Glonek & McCullagh (1995), relies on thegeneralisation of the binary logistic model to multivariate outcomes in conjunction witha particular parameterised representation for the correlations. Another strategy, discussedby Ashford & Sowden (1970) and Amemiya (1972), generalises the binary probit model.The resulting multivariate probit model is described in terms of a correlated Gaussiandistribution for underlying latent variables that are manifested as discrete variablesthrough a threshold specification. Despite this connection to the Gaussian distribution,which allows for flexible modelling of the correlation structure and straightforwardinterpretation of the parameters, the model is not commonly used, mainly because itslikelihood function is difficult to evaluate except under simplifying assumptions (Ochi &Prentice, 1984). Thus, few applications of the model have appeared, and much of thepotential of the model has not been realised.

The purpose of this paper is to provide a unified simulation-based inference method-

348 SlDDHARTHA CHIB AND EDWARD GREENBERG

ology for overcoming the problems in fitting multivariate probit models. We discuss vari-ous aspects of the inference problem, including simulation of the posterior distribution,calculation of maximum likelihood estimates and the computation of Bayes factors fromthe simulation output The approach makes extensive use of recent developments both inMarkov chain Monte Carlo methods (Gelfand & Smith, 1990; Smith & Roberts, 1993;Tierney, 1994; Chib & Greenberg, 1995) and in the Bayesian analysis of binary andpolytomous data (Albert & Chib, 1993). Two important technical advances are made.First, the paper provides an approach for sampling the posterior distribution of the corre-lation matrix. The same approach can be used in other problems with a restricted covari-ance matrix. Secondly, we extend Chib's (1995) marginal likelihood estimation procedureto a problem where some of the full conditional densities in the Markov chain MonteCarlo simulation do not have known normalising constants.

The paper proceeds as follows. In § 2 we summarise the model and in § 3 we considerthe sampling of the posterior distribution and the computation of the marginal likelihood.The computation of maximum likelihood estimates is discussed in § 4. These estimates areobtained by utilising a Monte Carlo version of the EM algorithm (Wei & Tanner, 1990;Meng & Rubin, 1993). The E-step in this approach is implemented by Monte Carlo, whilethe M-step is conducted in two sub-steps; latent data are re-simulated after the first con-ditional maximisation. Section 5 presents three real data applications, and § 6 contains abrief discussion.

2. THE MULTIVARIATE PROBIT MODEL

Let Yi} denote a binary 0/1 response on the zth observation unit and j th variable, andlet Yi = (Yil,..., Yu)' ( l ^ i ^ n ) denote the collection of responses on all J variables.According to the multivariate probit model, the probability that Yt = yh conditioned onparameters /?, Z and a set of covariates xijy is given by

r(l)

where <f>j(t\O, Z) is the density of a J-variate normal distribution with mean vector 0 andcorrelation matrix Z = {ajk}, Au is the interval

Pj e RkJ is an unknown parameter vector and /?' = (P[,..., P'j) e Rk, k = E kj. We denotethe p = J(J — l)/2 free parameters of £ by a = (cr12, er13 , . . . , <TJ_I,J).

It is important to note that Z must be in correlation form for identifiability reasons.Suppose that (y, Q) is an alternative parameterisation, where y is the regression parametervector and Q is the covariance matrix. Then it is easy to show that pr(j>j|y, Q) =pr(yj |^, Z), where pJ = coJJ

1'2yj, Z = CQC and C = diag{w1~11/2, • • •, coj/12}. A param-

eterisation in terms of covariances is therefore not likelihood identified.For our purposes a more convenient formulation of the multivariate probit model is in

terms of Gaussian latent variables. Let Z ; = ( z a , . . . , zu) denote a J-variate normal vectorwith distribution Z, ~ Nj(Xif}, Z), where Xt = diag(x j '1,..., x'u) isaJxk covariate matrix,and let Y{j be 1 or 0 according to the sign of zy:

yiJ = I(zij>0) U=\,...,J), (2)

Multivariate probit models 349

where I(A) is the indicator function of the event A. In this formulation the probability in(1) may be expressed as

I - JJBU JBn

(3)

where B,7 is the interval (0, oo) if ytJ=l a n d t n e interval (—oo, 0] if _y,-7 = 0. We letB, = Bn x Bi2 x- ... Bu and note that Bt is the set-valued inverse of the mapping in (2). Itis important to bear in mind that Bh unlike the Au, depends only on the value of yt andnot on the parameters. This latent variable representation and the inverse mapping fromytj to Zy form the basis of our posterior sampling method.

3. POSTERIOR ANALYSIS

31. IntroductionGiven a random sample of nJ observations y = ( y ^ ..., yn) and a prior density n(p, a)

on the parameters of a multivariate probit model, the posterior density is

n(p,a\y)ccn(p,a)PT(y\p,'L), 0 e 5R*, aeC,

where pr(y|/?, E) = FI pr(y,|/?, Z) is the likelihood function, C is a convex solid body inthe hypercube [ — 1, l ] p that leads to a proper correlation matrix, see Rousseeuw &Molenberghs (1994) for more on the shape of correlation matrices, and pr(y,|/?, E) is theintegral of the multivariate normal density given in (3). This form of the posterior densityis not particularly useful for Bayesian estimation, because the evaluation of the likelihoodfunction is computationally intensive.

In this paper we invoke an alternative procedure that is based on the framework devel-oped by Albert & Chib (1993). The idea is to focus on the joint posterior distribution ofthe parameters and the latent data n(P, a,Zx,...,Zn\y). From Bayes' theorem,

7r(jS, a, Z|y)oc7r(/?, Z) / (Z |0 , E) pr();|Z,/?, I ) ,

where we have let Z = (Zlt..., Zn). Use the mapping in equation (2) to note thatpr(y i |Z i , 0, Z) = I(Zt e B(). Hence,

n(P,<T,Z\y)ccn(P,c) flf{Zt\P, 2)/(Z,e Bt), (4)( = i

where

independently across observations. Thus, the effect of yt appears only through Bh andlikelihood evaluation is not required. The problem is now much simplified, and we cantake a sampling-based approach, in conjunction with Markov chain Monte Carlomethods, to summarise (4). The Markov chain sampling scheme can be constructed fromthe distributions [Zt\yh /?, Z ] (i ^ n), [fi\y, Z, Z ] and [o\y, Z, /?]. As we show below, eachof these distributions can be sampled either directly or by Markov chain methods.

In passing we mention that no other Markov chain sampling scheme appears viablefor this model. For a number of reasons, we do not think that it is a good idea to samplethe non-identified parameters, y and Q, as in McCulloch & Rossi (1994) in a differentcontext. First, this approach requires that a prior be formulated on unidentified param-


eters. This prior cannot be too weak, because that would lead to a flat posterior andconvergence problems, or too strong, because then it would determine the posterior.Secondly, this approach does not work when Z is patterned or restricted, as in some ofthe examples below. Finally, it is difficult to compute Bayes factors using this approach.With weak priors, problems similar to those described by the Lindley paradox can arise,especially if the model dimensions are quite different.

32. Posterior simulationsWe begin with the sampling of the latent data Z( from [_Z,\yh fi, Z] (i < n), given values

of {p, Z). From (4) it follows that

/(Z^/UJccMZ.-IX^L) II {/(zy>0)/(yy=l) + /(zy^0)/(yy = 0)}, (5)

where we have written out the set Bt in full. This is a multivariate normal density truncatedto the region specified by Bt. For example, if J = 2 and yt = (l,1)', then the normaldistribution is truncated to the positive orthant. To sample this distribution one can usethe method of Geweke (1991) to compose a cycle of J Gibbs steps through the componentsof Z(. In the jth step of this cycle, zy is simulated from zij\{yl], ztt (k =t=;), P, Z}, which isa univariate normal distribution truncated to the region fly. The parameters of theuntruncated normal distribution zy | {ztt (k =t= j), /?, Z} are obtained from the usual formu-lae, and the truncated version is simulated by the inverse distribution function method(Devroye, 1985, p. 38).

To describe the sampling of ft, we assume prior independence between /? and a and letn(P) = <t>k(f}\fio,Bol), where the location is controlled by the vector p0 and its strengthby the precision matrix Bo. By combining terms of P in (4) we obtain the standard linearmodels result

P\(Z,I.)~Nk(P\kB-1), (6)

where ^ = B~1(Bopo + E"=1 X'{L~lZt) and B = B0+ Yin

t=lX'iY.-lXi. Thus, the simulationof /? is straightforward.

Finally, we consider the sampling of a from n(a | Z, /?) given the prior

n(a) OL<I>P{O\<J0,GOX), oeC,

a normal distribution truncated to the region C. From (4), n(a\Z, P)ccn(a)f(Z\p, Z),where

/(Z|j3,Z) = |Z|-"/2exp{-|tr(Z*-A)'Z-1(Z*-A)}/((TeC),

and Z* = (Zu Z2, •. •, Zn) and A = {X^fl,..., Xnfi) are J x n matrices. This distribution isnonstandard. In sampling this density we must pay attention to the facts that (Z, /J)are refreshed across iterations, that suitable bounds and dominating functions aredifficult to obtain, and that a must lie in C. We deal with these problems by adopt-ing the Metropolis-Hastings algorithm (Hastings, 1970; Chib & Greenberg, 1995). Letq{a'\a, Z, fl) denote a density that supplies candidate values a' given the current value a.The choice of this density is explained below. Our algorithm is then formally describedby the following two steps.


ALGORITHM

1. Sample a proposal value a' given a from the density q{a'\a, Z, P).2. Move to a' with probability

a.(a, a ) = min

stay at a with probability 1 — a(er, a').

We make two general remarks. First, the proposal density q is not truncated to C,because that constraint is part of the target density. Thus, if E' is not positive definite, theconditional posterior is zero, and the proposal value is rejected with certainty. By specifyingq in this way we are able to simplify both the choice of q and the determination of theproposal density of the reverse move q(a' | a, Z, ft). Secondly, if the dimension of Z is large,as in our third example in § 5 below, it is best to partition a into blocks and to apply theMetropolis-Hastings algorithm in sequence through the various blocks.

We now discuss the choice of q. As in all Metropolis samplers, the general objectivesare to traverse the parameter space and produce output that mixes well; see Chib &Greenberg (1995) for further details.

One simple way to generate proposal values is through the random walk chaina' = a + h, where a' is the candidate value, a is the current value, and h is a zero-meanincrement vector. If h is assumed to follow a symmetric distribution such as the normal,the ratio of the q{.)'s is one, and the probability of move is determined by the ratio of thedensity ordinates. To generate suitable moves the variance of the increment may be setto a multiple of either \/n, which is the large-sample variance of the marginal posteriorof the correlation coefficient, or X, the smallest characteristic root of E; see Marsaglia &Olkin (1984) for an explanation. This proposal density is usually effective for small p.

A more general procedure is based on a proposal density that is tailored to the un-normalised target density g(a\Z, P) = n(a)f(Z\P,'L)I(a e C). The form of this proposalgenerating process is given by

a' = n* + h, n* = n + P((T-n), (7)

where \i is a vector, P is a p x p diagonal matrix and h is a multivariate-t vector withmean zero, dispersion x2V and v degrees of freedom. Of the five tuning parameters(/*, V, P, x, v) in this proposal density, two (ft, V) depend on the current values of /? and Zand automatically adjust as the iterations proceed. The resulting proposal density isextremely flexible and leads to competitive proposal values. The parameter y. is taken tobe the approximate mode of the function log{g(o-|Z, /?)}, and V is taken to be inverse ofthe negative Hessian matrix of log {g{a \ Z, /?)} at the mode. These quantities are obtainedfrom two or three Newton-Raphson steps, initialised at the mode from the previous round.Next, P is either equal to zero, leading to what may be called the tailored independencechain, or equal to minus the identity matrix of order p, giving the tailored reflection chain.Reflection is a useful device for generating large moves, from the current point to theother side of the mode where the ordinate is roughly similar, that have a high probabilityof being accepted. Finally, T2 is adjusted by experimentation, and v is specified arbitrarilyat 10, or some similar value.

One cycle of the resulting Markov chain Monte Carlo algorithm is completed by simu-lating all the distributions in fixed or random order. A full sample from the posterior


distribution is generated by repeating this process a large number of times. All posteriorinferences are based on the sample furnished by the Markov chain procedure.

3-3. Computation of marginal likelihoodWe now consider the question of comparing alternative multivariate probit models.

Typically, competing models arise from restrictions on the covariate or correlation struc-ture. One example of this is the restriction that fa = /? across the J responses. Another isthat £ is in the equi-correlated form (1 — p)Ij + pijl'j, where \p\ < 1 (Ochi & Prentice,1984). In a panel data context, when the index ; represents time, £ may be specified toreflect the assumption of serially correlated errors or the assumption of 1-dependence.Finally, a benchmark restriction is that the responses are independent and Z is diagonal(Kiefer, 1982).

These alternative models may be compared in a Bayesian context by Bayes factors, orratios of model marginal likelihoods (Kass & Raftery, 1995). The marginal likelihood ofmodel Jtk (k = 1, 2 , . . . , K)'is defined as

and is a function of Mk. This function is sometimes estimated by the Laplace method orimportance sampling. In the preserit case, however, these methods are not attractive,because they involve the repeated evaluation of the likelihood function pr(_y|^t, /?, Z).For this reason, we focus on an alternative method recently developed in Chib (1995).The Chib method relies on an identity in (/?, £). This identity, which is obtained byrewriting the expression pfjthe posterior density, states that

where all integrating constants on the right-hand side are included. Since the left-handside is free of the parameters, the marginal likelihood may be calculated by evaluating theterms on the right-hand side at any single point (ft*, a*). In practice, a high density pointsuch as the posterior mean is used. Thus, on the log scale

-\ogn(p*\Jtk,y,o*)-\o%n(o*\Jtk,y)t (8)

where we have expressed the posterior ordinate by a conditional/marginal decomposition.We now suppress1 the dependence on-Mk and discuss the estimation of the terms in (8)

that are not available through direct computation, starting with

The normalising constant of this density can be found by simulation: simply generate alarge number of observations from <f>p(o\o0, GQ1) and find the proportion that satisfy thepositive definiteness constraint. This proportion is the Monte Carlo estimate of pr(<x e C).

Next consider the estimation of the conditional posterior ordinate

J= J n(^\y,i:*,Z)p(Z\y,'L*)dZ,

where n(p*\y, Z, £*) is the multivariate normal density in (6) with /? = ft* and E = L*. A


key point is that this integral can be estimated very accurately by drawing a large sampleof Z values from the density p(Z\y, £*). As in Chib (1995), these Z variates are producedfrom a reduced Markov chain Monte Carlo run consisting of the distributions

LZ1\y1,p,'L*]x...xlZn\yn,p,X*], [fi\Zlf..., Zn, Z* ] ,

which are the distributions discussed above but with £ fixed at Z*. Given a sample of Gvalues on Z from this run, an estimate of n(fi*\y, £*) is available as

G-1 £ foin^B-1), (9)

where ft™ = B~l{Bopo + E"=1 X'{L*-lZP) and B = B0 + 1 ^ ^ , ' Z " 1 ^ .For the marginal ordinate n(a* | y) = J 7T(CT* | Z, /?)P(Z, /? | y) dfl dZ a different approach is

necessary because the normalising constant of n(a* \ Z, fi) is not known. Let K(x) denotea univariate kernel density, taken to be <f>(x\0,1) in the examples, and Sj the standarddeviation of {a^}. Then an estimate of n(a* \y) is

fi(a*\y) = G~l £ f\bj'K{bj\aJ-a^)}, (10)«=iJ=i

where the bandwidth bj is s;G~1/0>+4) (Scott, 1992, p. 150). The sample aig) can be thinnedbefore smoothing is done.

The kernel approach may have to be modified somewhat if a is high-dimensional. Then,one should partition a into several low-dimensional blocks and apply the reduced Markovchain Monte Carlo procedure to each of the blocks. This overcomes the well-knowndifficulties of kernel smoothing in high dimensions at the cost of additional computations.Note that in standard applications of kernel smoothing the sample is fixed and thisrefinement is not possible. We illustrate the idea with two blocks, a = (a1, a2) where at

contains p, components. The identity

n{a* | y) = n^pt I y)^i{o* \y,af)

permits us to estimate each of the two terms by kernel smoothing after samples aregenerated from the appropriate distributions. For the first term, the values of a^(g = 1 , . . . , G) are available from the full Markov chain Monte Carlo sampler. Thesevalues are smoothed by the kernel method as described above. To estimate the next termwe run a reduced Markov chain Monte Carlo sampler consisting of

The values {a^} generated from this sampler are from [p2\y, erf] and can be smoothedto estimate the ordinate n2{o2 \y, a*).

We mention that the numerical standard error of the resulting posterior density estimatelog #(/?*, a* | y) can be estimated precisely as explained in Chib (1995), despite the factthat kernel smoothing is used to estimate H{<T* | y). The numerical standard error is basedon the variance of G"1 E g = i h

ig\ where h^ is a vector whose components are the densityordinates <j>AP*\^\ B'1) from (9) and Y\"]=lbjlK{bJy{aJ - af)} from (10). The vectorh^ is available as a by-product of the marginal likelihood calculation. Further detailsmay be found in Chib (1995).

Finally, we consider the computation of log pr(_y|/?*, £*) = E"= 1 logpr(yd/?*, E*),where pr(y,|/?*, £*) is given in (3). One way to compute this integral is by the approachcalled the Geweke-Hajivassiliou-Keane method in the econometrics literature. See Greene

354 SlDDHARTHA C H I B AND EDWARD GREENBERG

(1997, p. 196) for further details. This method is based on writing £ = LL', where L is thelower triangular Choleski factorisation, and making a change of variable from Z, to eh

where Zt = X{p + Le{. Then

pr(y,|/?*,£•) = \"U ... I'" 4>At\O,I)dt, (11)

where

i - l 1 " dti-x'i,p,-Ii~-\I * *J lJr J ~~K — 1

lJJ lJJ

and (cy, dy) denotes the lower and upper endpoints of Bi} (j ̂ J). This integral can nowbe calculated by recursive Monte Carlo simulations that are repeated a large number oftimes. In the examples we use 10 000 iterations.

4. MODAL ESTIMATION

A by-product of the simulation of the latent data is an approach that yields the maxi-mum likelihood estimates for the multivariate probit model or the maximiser of the pos-terior without computation of the likelihood function. This can be done by an adaptationof the Monte Carlo EM algorithm (MCEM) that was proposed by Wei & Tanner (1990),which in turn is a stochastic modification of the original Dempster, Laird & Rubin (1977)EM algorithm.

Let 9 = (P,a) and consider the expectation or E-step of the EM algorithm, given thecurrent value of the maximiser 9(t):

= [\og{f{y,Z\9)}diZ\y,e^= \ log{/(Z|0)} d\Z\y,Jz Jz

where the integral is with respect to the truncated normal distribution of Z and the secondequality is a consequence of (4). This calculation is analytically intractable. In the MCEM

algorithm, the Q function is estimated by

where Z ^ (j = I,... ,N) are draws from (5). In the M-step of the algorithm, the Q functionis maximised over 9 to obtain the new parameter 9l'+l). The MCEM algorithm is terminatedonce the difference || 0(t + 1) - 0<r) || is negligible.

We now follow Meng & Rubin (1993) and complete the M-step through a sequence oftwo conditional maximisations, the maximisation over ft given £ and the maximisationover E given /?. This simplifies the update of 6 and parallels the blocking adopted in theBayesian simulation presented above; a similar conditional maximisation step is adoptedin a different context in unpublished work by R. Natarajan, C. E. McCulloch andN. M. Kiefer 'Maximum likelihood for the multinomial probit model'. Specifically, if weset the derivative with respect to 0 equal to zero, the update of /? is given by


where Z, = iV"1 E Z ^ is the average of Z, over the N draws. The update of a is obtainedby replacing /? by /?(' + 1) in the Q function and maximising over a with a Newton-Raphsontype routine. Although not necessary, it is preferable for efficiency considerations tore-draw the Z values from the distribution [Z|y, /?<t+1), E] and re-compute the Q func-tion before the second maximisation is attempted. The update for a is thus obtained bymaximising the function

L g=n=i

where Zte) are the newly drawn latent values.As suggested by Wei & Tanner (1990), these iterations are started with a small value

of N that is increased as the maximiser is approached. Given the modal value 6, thestandard errors of the estimate are obtained by the formula of Louis (1982). Specifically,the observed information matrix is given by

r-E\ dddd'

where the expectation and variance are with respect to the distribution Z | y, 8. Each ofthese terms is estimated by taking an additional M draws {Z(1),..., Z(M)} from Z\y,Q.Standard errors are equal to the square roots of the diagonal elements of the inverse ofthe estimated information matrix.

5. APPLICATIONS

51. Bivariate probit for voter behaviourOur first application is to survey data of the voting behaviour of 95 residents of Troy,

Michigan, in which the first decision, Yn, is whether or not to send at least one child topublic school and the second, Yi2, is whether or not to vote in favour of a school budget.The objective of the study is to model the two binary responses as a function of covariates,allowing for correlation in the responses. As in Greene (1997, p. 906), the covariates inx(1 are a constant, the natural logarithm of annual household income in dollars INC andthe natural logarithm of property taxes paid per year in dollars TAX; those in xi2 are aconstant, INC, TAX and the number of years YRS the resident has been living in Troy.

We fit two models to these data. The first, denoted by Mx, is the bivariate probit modelin which the marginal probabilities of the jth subject are given by

and the joint probabilities are given by the distribution function of the bivariate normalwith correlation matrix

a12

Model M± thus contains seven unknown regression parameters and one unknown corre-lation parameter. The second model, denoted by Ji2, is an independent probit model inwhich al2 = 0.

The modal maximum likelihood estimates in this instance can be computed by directlymaximising the likelihood function and the MCEM algorithm is not necessary. These esti-

356 SlDDHARTHA CHIB AND EDWARD G R E E N B E R G

mates are reported in Table 1. For model Jt±, the posterior distributions of the parametersare obtained by applying the Markov chain Monte Carlo algorithm described in § 3 for6000 cycles beyond 500 burn-in iterations. The prior distribution of /? is multivariatenormal with a mean vector of 0 and a variance matrix of 100 times the identity matrix,and that of al2 is proportional to a univariate normal with a mean of 0 and variance of0-5. In terms of the notation used above, these settings imply that /?0 = 0, Bo= 10~2/7,cro = 0 and G^1 = 05.

Table 1. Voting data: maximum likelihood and Bayes estimates. The Bayes estimates arereported along with the mean, the numerical standard error (NSE), the standard deviation

(SD), the median (Med.) and the lower 2-5th and upper 97Sth percentiles

Parameter

ftMLE

-4-7640-11490-6699

-0-30660-9895

-1-3080-0-0176

0-317

PRIOR

MEAN

000

0000

0

SD

101010

10101010

0-707

Mean

- 4 1 8 90-0690654

-0-4741-057

-1-380-0-017

0-258

NSE

0-0760-0110-014

0-0810-0110-0140-000

0-009

SD

3-6700-4440-563

3-7870-4380-5840014

0178

PosteriorMed.

- 4 1 9 3O0810658

- 0 4 2 61042

-1-349-O017

0264

Lower

-11-395-0-820- 0 4 7 2

-7-8780244

-2-599-O045

-O103

Upper

2-91809111-775

6-9231-953

- 0 3 1 300110589

We use the random walk proposal density in the Metropolis-Hastings step and let therandom increment be univariate normal with standard deviation equal to 4>/(l/n). Thisresults in an acceptance rate of about 0-5. Our results for the posterior distribution aresummarised in Table 1; the results for the independent probit obtained via the algorithmof Albert & Chib (1993) are similar. Table 1 reports the maximum likelihood estimates,the prior moments, the posterior means and standard deviations, the numerical standarderrors computed by the method of batch means and the 2-5th and 97-5th percentiles ofthe posterior distribution.

The posterior distribution of CT12 is spread out, and its 95% credibility interval includes0, which is evidence for the independent probit model. To assess the evidence more for-mally, we calculate the marginal likelihoods of M^ and M2, evaluating all the quantitiesin (8) at the posterior mean. Note that both the likelihood function and the normalisingconstant of the truncated normal prior of <T12 are available without simulation. To obtainthe conditional ordinate of /? at j8* we use a reduced run of 6000 iterations. The results,along with the posterior probabilities of the respective models under the assumption thatthe prior odds equal one, are reported in Table 2. The data thus do not provide supportfor the correlated probit model over the independent probit model.

Table 2. Voting data: Marginal likelihood, posterior prob-abilities and Bayes factors for alternative models

Jtx:Jt2:

Model Jit

correlated probitindependent probit

\ogm{y\J(l)

-126-31

— 126-30

pr^l}1)

0495O505

Bayes factor

O9801020


5-2. Six Cities studyThe second example is based on a subset of data from the Six Cities study, a longitudinal

study of the health effects of air pollution, which has been analysed by Fitzmaurice &Laird (1993) and Glonek & McCullagh (1995) with a multivariate logit model. The data,reproduced in Table 3, contain repeated binary measures of the wheezing status (1 = yes,0 = no) for each of 537 children from Stuebenville, Ohio, at ages 7, 8, 9 and 10 years. Theobjective of the study is to model the probability of wheeze status over time as a functionof a binary indicator variable representing the mother's smoking habit during the firstyear of the study and the age of the child.

Table 3. Six Cities dataset: Child's wheeze frequency

1

0000

0000

1111

1111

No maternalAge of child

8

0000

1111

0000

1111

9

0011

0011

0011

0011

10

0101

0101

0101

0101

smoking

Frequency

237 (10 (15 (4 (

16 (2 (7 (3 (

24332

625

11

]

Age1 8

) 0) 0) 0) 0

) 1) 1) 1) 1

I 0I 0I 0I 0

I 1I 1I 1I 1

Maternal smokingof child

9

0011

0011

0011

0011

10

0101

0101

0101

0101

Frequency

118682

11164

7331

4247

Interpreting age as category ;, we fit three models to these data: the full multivariateprobit model, JK1} the equi-correlated model, M2, and the independent probit model, Jt3.In each model the marginal probability of response is specified as

pr( yy = 1|ft 2) = ft. P2xi2

where xn is the age of the child, centred at 9 years, xi2 is a binary indicator representingthe mother's smoking habit (1 = yes, 0 = no), and xi3 is an interaction between smokinghabit and age. Note that, in contrast to the previous example, the regression parameteris constrained to be constant across j . In addition to the four regression parameters ineach model, there are six unknown correlation parameters in ^ . o n e unknown correlationparameter in J?2 and no unknown correlation parameter in M2.

For all three models the prior for /? is represented by the hyperparameters f$0 = 0 andBo = 10"1 J4, and that for a in Jiv and M2 by a0 = 0, GQ * = 0-5/6 and a0 = 0, GJ"x = 0-5,respectively, where, here and below, we use the same symbols for the hyperparametersacross the different models. Posterior sampling of the correlation parameters of model JtY

is by the tailored independence method applied in one block to all six unknown parametersof Z. The parameters \i and V are obtained by the Newton-Raphson method. The param-eter T is set equal to 1-5. The same approach is used to sample the single parameter of 2


in model JC2 with T = 40. The sampler was run for G = 10000 cycles beyond a transientstage of 500 iterations. The Metropolis-Hastings acceptance rate was about 35% for thefull model and about 40% for the equi-correlated model.

To compute the modal estimates for Jlx and M2, the MCEM algorithm described aboveis tuned as follows: for the first ten updates of 9, the Q function is estimated from N = 10samples of the latent data; for the final ten iterations N = 200. The algorithm was stoppedat iteration 40 when convergence was achieved for each parameter up to at least the firsttwo decimal places.

Results of the simulation are summarised in Table 4. First, note that the maximumlikelihood and Bayes estimates of /? are very insensitive to the specification of the covari-ance structure. Secondly, the maximum likelihood values differ slightly from the posteriormeans, an indication of some asymmetry in the posterior distributions. Thirdly, the esti-mated standard errors, SE, of the maximum likelihood estimates are generally smaller thanthe corresponding posterior standard deviations. Fourthly, there is little support for Jlj,because the posterior distribution of Z in both Jiv and M2 is concentrated away fromzero. The marginal likelihoods of the respective models support this conclusion. Somedetails of the implementation as they relate to Mv and Ji2 are as follows. A simulation of10 000 draws from the untruncated normal prior on a is used to estimate the normalisingconstant of the prior, as discussed in § 3. The likelihood contribution is computed using(11). The conditional posterior ordinate of /S at /?*, the posterior mean, is estimated froma reduced Markov chain Monte Carlo run of 10 000 iterations, while the marginal pos-terior density ordinate of a at a*, the posterior mean, is estimated in one block by kernelsmoothing. Interestingly, the marginal posterior ordinate at a*, in the case of Mx, changedonly slightly when it was estimated as n(a*,af\y) = n(af\y)n((72\y,a*) with two blocksof size three. Thus, in this case, kernel smoothing is accurate for a six-dimensional density.We conjecture that this is because a large sample on a is being used to estimate a singlehigh density ordinate.

The results show that the marginal hkelihood of the independence model J(z is thesmallest of the three models and that of M2 the largest. The log marginal likelihoodincreased by about 0-30 when the prior on a was specified with GQ1 = 1. This is evidencethat the results are not unduly sensitive to the prior specification. We computed standarderrors for the marginal likelihood computation; these are 013 for Mx and 004 for M2

Table 4. Six Cities data: Posterior results for Jlx, the unrestricted multivariate probit model,M2, the model with an equicorrelated correlation structure, and M^, the independence model

P -

tILE

1-118O0790152O039

05840521058606880562O631

>(SE)

(O065)(0033)(O102)(O052)

(O068)(0076)(O095)(0051)(O077)(0077)

log^yl^T,)

r

Mean

-1-127-O079

01600040

055704970541065605130601

= -82503

SD

00610032O099O053

O068O073O0750058O073O065

MLE

-1120-O079

01720041

O602—————

Jt2(SE)

(O043)(0021)(O072)(0034)

(O025)

log iftfy !,*,) =

Mean

-1121-O078

01600038

0584—————

-816-80

SD

0062O031O0990049

0O54—————

Mean

-1126-O076

01680035

——————

logrfOI-*))

f

SD

00470037O0760060

——————

= -931-;


on the log scale and are negligible. The Bayes factors are as follows: B12 = 2-76 x 10~4,51>3= 1-24 x 1046 and B 2 3 = 4-63 x 1049. Unless the prior probabilities for the variousmodels put virtually zero weight on Mx and M2, this is decisive evidence against inde-pendence in favour of either alternative and decisive evidence in favour of the equi-correlated model.

5-3. Labour force participationThe final illustration is a model of the labour force participation decision of married

women in the age range 35—62. The data from the Panel Survey of Income Dynamics ofthe University of Michigan consist of a sample of 520 households over the seven-yearspan 1976-1982. As in Avery, Hansen & Hotz (1983), who analysed similar data by themethod of moments, the covariates are (i) a constant, (ii) wife's education in number ofgrades completed and (iii) total family income excluding wife's earnings, in thousandsof dollars.

We consider two multivariate probit models for this dataset. In Mx the correlationmatrix is fully unrestricted with 21 unknown parameters. In Ji2 the correlation matrix isin equi-correlated form. In both models we let Pj be constant across j and represent ourprior distribution through the hyperparameters /?0 = 0 and So = 10~1/3. In the unrestrictedmodel the prior on a is represented by aQ = 0, a 21-vector of zeros, and GQ1 = 0-5/21; forthe restricted model it is represented by o0 = 0, a scalar, and G^^O-5 . The Markovchain Monte Carlo simulation algorithm is run for 10 000 cycles beyond a transientphase of 500 iterations. As a result of the large dimension of a in model Mx, theMetropolis-Hastings step is applied to a = {ax, <J2,a3, aA) in four blocks, where ax, a2 anda3 each consist of six elements and aA of three elements in a row-wise expansion of Z.Thus, for example, ax = (<J12) <r13, <r14, <r15, a16, CT17) and a4 = (aJ6, asl, a61). Proposal valuesfor the correlation parameters in the four Metropolis-Hastings steps, within each cycle,are generated by the tailored independence chain. The value of x for the first three blocksof a is 1-5 and that for the fourth block is 2-0. In the case of M2, proposal values are alsogenerated by the tailored independence chain, but with T = 8.

For the marginal likelihood calculation in Mx, the posterior density n(a*\y) at a*, theposterior mean, is estimated from

n{al\y)n{at\y, <x?)n(&% \y, o%, <r?)rc(<r}\y, o\,o\, a%\

where each of the conditional ordinates is estimated by kernel smoothing of the simulationsfrom 10 000 values of cr, generated from n{at\y, a*,..., o*-x) in a reduced Markov chainMonte Carlo run. By breaking up a in this manner we ensure that kernel smoothingremains accurate. Finally, we estimate the normalising constant of the prior density of aat a* from 10 000 draws, and the likelihood contribution using (11). The calculationof m{y\M1) is similar, except that n(a*\y) is estimated directly in one pass by kernelsmoothing since only one parameter in a is involved.

The results from the simulation show that the posterior distributions of /? from the twomodels are virtually identical. The posterior means and standard deviations of /?0, /?x andp2 in model Mx are found to be -0-620 (0-234), 0090 (0-018) and -0-00J (0-001); themodal estimates are similar and are not reported. The marginal posterior distributions of$ are fairly symmetric and concentrated about their means, and autocorrelations of thesampled draws drop off very quickly, being less than 0-2 by the tenth lag for all three /Ts.A boxplot summary of the marginal posterior distributions of the elements of Z is presentedin Fig. 1 using every tenth draw from the simulation. The correlations are all quite large

360 SlDDHARTHA CHIB AND EDWARD G R E E N B E R G

and precisely estimated, and most decline with an increase in the time lag. For M2 thecorrelation parameter is estimated to be 0-739 with a posterior standard deviation of 0057.From this evidence it would appear that the equi-correlated correlation structure is notappropriate for these data. This is confirmed from the marginal likelihood calculation,which yields \ogm(y\J(l) = —1561-66 and logm(y\J(^)=- —1600-45. The evidence infavour of the unrestricted model is thus overwhelming, and, as above, the standard errorsare negligible.

0-3 I1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Column number

Fig. 1. Panel Survey of Income Dynamics data: Marginal posterior boxplotsfor elements of Z in model JKt. The columns correspond to a row-wise

expansion of S, that is column 1 refers to ff12, column 2 to ai3, etc.

6. DISCUSSION

Several interesting conclusions emerge from the empirical examples. First, the posteriormeans of ft seem to be extremely robust to the specification of E. Secondly, although notreported above, we computed the loglikelihood contribution log pr(_y;|^*, Z*) by a simplefrequency approach, counting the number of times that draws from iV(X,/?*, £*) respectthe constraints imposed by yt, and obtained results almost identical to those in the paper.Thirdly, we achieved extremely high accuracy in computing the marginal likelihood byChib's method, as evidenced by the very low numerical standard errors. Fourthly, in boththe second and third examples, the Bayes factor decisively favoured one of the correlationspecifications: an extremely large prior odds ratio would have been necessary to offset theempirical results.

ACKNOWLEDGEMENT

We acknowledge the helpful comments on the paper by the editor, associate editor andthe referees.


REFERENCES

ALBERT, J. & CHIB, S. (1993). Bayesian analysis of binary and polychotomous response data. J. Am. Statist.Assoc. 88, 669-79.

AMEMIYA, T. (1972). Bivariate probit analysis: minimum chi-square methods. J. Am. Statist. Assoc. 69, 940—4.ASHFORD, J. R. & SOWDEN, R. R. (1970). Multivariate probit analysis. Biometrics 26, 535-46.AVERY, R. B., HANSEN, L. P. & HOTZ, V. J. (1983). Multiperiod probit models and orthogonality condition

estimation. Int. Econ. Rev. 24, 21-35.CAREY, V., ZEGER, S. L. & DIGGLE, P. (1993). Modelling multivariate binary data with alternating logistic

regressions. Biometrika 80, 517-26.CHIB, S. (1995). Marginal likelihood from the Gibbs output. J. Am. Statist. Assoc. 90, 1313-21.CHIB, S. & GREENBERG, E. (1995). Understanding the Metropolis-Hastings algorithm. Am. Statistician 49,

327-35.DEMPSTER, A. P., LAIRD, N. M. & RUBIN, D. B. (1977). Maximum likelihood from incomplete data via the

EM algorithm (with Discussion). J. R. Statist. Soc. B 39, 1-38.DEVROYE, L. (1985). Non-Uniform Random Variate Generation. New York: Springer Verlag.FrrzMAURicE, G. F. V. & LAIRD, N. M. (1993). A likelihood-based method for analysing longitudinal binary

responses. Biometrika 80, 141-51.GELFAND, A. E. & SMITH, A. F. M. (1990). Sampling-based approaches to calculating marginal densities.

J. Am. Statist. Assoc. 85, 398-409.GEWEKE, J. (1991). Efficient simulation from the multivariate normal and student-t distributions subject to

linear constraints. In Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface,Ed. E. Keramidas and S. Kaufman, pp. 571-8. Fairfax Station, VA: Interface Foundation of North America.

GLONEK, G. F. V. & MCCULLAGH, P. (1995). Multivariate logistic models. J. R. Statist. Soc. B 57, 533-46.GREENE, W. (1997). Econometric Analysis, 3rd ed. Upper Saddle River, NJ: Prentice Hall.HASTINGS, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications.

Biometrika 57, 97-109.KASS, R. E. & RAFTERY, A. E. (1995). Bayes factors. J. Am. Statist. Assoc. 90, 773-95.KIEFER, N. M. (1982). Testing for dependence in multivariate probit models. Biometrika 69, 161-6.Louis, T. A. (1982). Finding the observed information matrix using the EM algorithm. J. R. Statist. Soc. B

44, 226-33.MARSAGLIA, G. & OLKIN, I. (1984). Generating correlation matrices. SIAM J. Sci. Statist. Comp. 5, 470-5.MCCULLOCH, R. E. & Rossi, P. E. (1994). Exact likelihood analysis of the multinomial probit model.

J. Economet. 64, 207-40.MENG, X. & RUBIN, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework.

Biometrika 80, 267-78.OCHI, Y. & PRENTICE, R. L. (1984). Likelihood inference in a correlated probit regression model. Biometrika

71, 531-43.ROUSSEEUW, P. & MOLENBERGHS, G. (1994). The shape of correlation matrices. Am. Statistician 48, 276-9.SCOTT, D. W. (1992). Multivariate Density Estimation. New York: John Wiley.SMITH, A. F. M. & ROBERTS, G. O. (1993). Bayesian computation via the Gibbs sampler and related Markov

chain Monte Carlo methods. J. R. Statist. Soc. B 55, 3-21.TIERNEY, L. (1994). Markov chains for exploring posterior distributions (with Discussion). Ann. Statist. 22,

1701-62.WEI, G. C. G. & TANNER, M. A. (1990). A Monte Carlo implementation of EM algorithm and the poor man's

data augmentation algorithm. J. Am. Statist. Assoc. 85, 699-704.

[Received September 1995. Revised May 1997]

Date post:	20-Jan-2019
Category:	Documents
Upload:	vananh
View:	228 times
Download:	0 times

Analysis of multivariate probit models - Semantic Scholar · 348 SlDDHARTHA CHIB AND EDWARD...

Documents