tIATING THE SIZE OFA CLOSED POPULATIONThe indices can take on values of 0, indicating absence from a...

BAYESIAN l\1ETHODS FOR ESTIl\tIATING

THE SIZE OF A CLOSED POPULATION

by

Jeremy C. YorkDavid Madigan

TECHNICAL REPORT No. 234

July 1992

Department of Statistics, GN-22

University of Washington

Seattle, Washington 98195 USA

Bayesian Methods for Estimatingthe Size of a Closed Population

Jeremy C. YorkDavid wIadigan

University of \Vashington *

July 9, 1992

Abstract

A Bayesian methodology for estimating the size of a closed population from multiple incomplete administrative lists is proposed. The approach allows for a varietyof dependence structures between the lists, inclusion of covariates, and explicitly accounts for model uncertainty. Interval estimates from this approach are compared tofrequentist and previously published Bayesian approaches, and found to be superior.Several examples are considered.

KEYWORDS: Bayesian graphical model; Capture-recapture; Closed population estimation;Chordal graph; Contingency table; Decomposable log-linear model; Markov distribution.

Contents

1 Introduction

2 Prior Distributions2.1 Hyper-Dirichlet priors2.2 Other models . . . . .

3 Posterior3.1 Calculation and estimation .3.2 Handling covariates3.3 Other models

4 Examples4.1 Prevalence of spina bifida in upstate New York .4.2 Spina bifida analysis : independence versus model averaging4.3 Spina bifida with covariate . . . . . . . . . . .4.4 Patients receiving methicillin .4.5 Children in Massachusetts with a birth defect

5 Assessing Coverage

6 Discussion and Extensions

Bibliography

A Determining the Form of a Hyper-Dirichlet DistributionA.l Construction of a hyper-Dirichlet .A.2 Jeffreys versus hyper-Dirichlet distributions

B Is the Posterior of N Proper?

3

57

10

12121314

151520222729

33

41

43

474749

50

1 Introduction

One approach to estimating of a closed population is to use methods to "capture" individuals in the population. Although might be actual physical captures, herewe will consider a capture to be the occurence of an individual's name in an administrativelist. If it is possible to uniquely identify the individuals or capture histories, then thedata can be represented as a contingency table with one dimension for each capture method.The count for each cell gives the number of individuals with a particular capture history.Of course, the count for the cell in which individuals were not found in any of the lists isunknown; the goal of the analysis is to estimate the number of individuals in the population.It is assumed that the population is closed - there is no migration.

For example, consider the estimation of the rate at which the birth defect spina bifidaoccurs. Hook et al (1980) gathered records on persons born in upstate New York between1969 and 1974 with this defect from birth certificates (B), death certificates (D), and medicalrehabilitation records (R). These different records were compared, and each individual withthe defect was classified as to whether or not they were found in each list. The data for thoseindividuals born in 1969 is given in table 1, where a value of 0 indicates that an individualwas not found in that list, and a 1 indicates that he or she was found. 136 individuals werefound in the three record systems considered; the question is, how many more individualswere missed by all three?

R=O R=lB = O,D = 0B = 0, D 1 t----::-~---:;c_l

B I,D = 0B = 1, D = 1 t-----:::=-+---:;-j

Table 1: Spina Bifida data 1969

This framework is known as a multiple records system (MRS). Although seemingly quitesimilar to experiments where animals are physically captured or (for a comparison,see El-Khorazaty et much the vast literature on capture-recapture

UHC:;~C1J n~le'VaJlt to

for individuals that have been captured previously and another for those who have not yetbeen captured (Pollock and Otto, 1983, Otis et aI, 1978). To use methods on MRSdata, one would to an unnatural, ordering on record systems; eventhen, the kinds of dependencies that could be modeled are probably not complex enoughto capture what is going on in the data. Another issue is that there is often covariateinformation available in MRS data. More sophisticated capture-recapture models wherecapture probabilities vary across individuals and over time, such as in Chao et al (1992),still do not allow for explicit dependencies between the capture methods, and allow no easyway to use covariates in adjusting for heterogeneity in the population. Also, much of themore complex capture-recapture work assumes that the number of individuals to be capturedby each method is fixed in advance, yielding a hyper-geometric likelihood. Darroch (1958)has shown that the estimates for the population size are the same in the simplest modelswhether or not the sample sizes are fixed in advance. This result and the practical concernsin designing capture-recapture experiments have meant that random sample size approacheshave received less attention, even though the equivalence does not hold in more complexmodels. For these reasons, capture-recapture approaches will not be as useful for analyzingMRS data.

One source of confusion has been that if there are only two captures or record systems,there are really only two possible models dependence or independence. Capture-recapturemethods are certainly flexible enough to allow this sort of dependence, regardless of whetherthere is really any time-ordering of the captures; thus, they can be directly applied to thedual record systems problem, as in Wolter (1986). In systems of three or more separate lists,however, things are not so straightforward; it is this situation that we consider here.

Early approaches to this problem assumed independence between lists in order to applycapture-recapture techniques; see vVittes (1970, 1974) and Wittes, Colton and Sidel (1974).More recently, Smith (1990) has applied a Bayesian capture-recapture technique to MRSdata, again assuming independence between the lists. However, vVittes (1974) has shownthat ignoring dependence between lists in a frequentist approach can introduce sizable biasinto estimates of the population is no reason to that would not betrue for a Bayesian approach as well. Dependence due to heterogeneity can be removed bystratification using covariates; if this is not effective, it has been suggested that dependentlists should be aggregated (\Vittes, 1974; Smith, 1990). Stratification will not be fool-proof,however, and can not dependence between if that O'Pl'Pl':"rp

are no publ1iSl1e:C1

seems to

any frequentist approach, all inference is conditional upon the model chosen. Regal and Hook(1991) have demonstrated that, in this context, the selection of a model conditional uponthe data has an adverse effect on the coverage of resulting confidence for thepopulation size. This point will be illustrated in greater detail in section similar complaintsabout model selection with standard capture~recapturemodels can be found in Merkins andAnderson (1988).

The difficulty is that only sampling error is being considered in construction of thesefrequentist estimates, while potential error in selecting a model is an important aspect ofuncertainty in predicting the population size. Hodges (1987) and Draper et al (1987) haveargued that a Bayesian approach, in which the uncertainty about the correct model is explicitly included, allows one to make inferences and predictions that reflect one's actualuncertainty. In this approach, a prior is placed over a set of possible models, and over theparameters of each model; an unconditional posterior is formed by averaging the posteriorsfor each of the models. Applications of these ideas can be found in Moulton (1991), Raftery(1988b), Madigan and York (1992), York (1992), York and Madigan (1992), and York etal (1992). Madigan and Raftery (1991) take a slightly different approach when there area great many models, in that they first identify a parsimonious class of models with highposterior probability and then average across only that class of models.

In order to compensate for model uncertainty by averaging over models, we present aBayesian alternative to log-linear models. This approach makes use of a class of priors introduced by Dawid and Lauritzen (1989), the hyper-Dirichlet distributions. These priors canbe constructed to yield the same dependence structures that one would find in decomposablelog~linear models, they are conjugate with multinomial sampling, and their parameters areeasy to interpret (as opposed to placing a distribution on the parameters of a log-linearmodel, as done by \Vest, 1985 and Raftery, 1988b). Other applications of hyper-Dirichletpriors are presented in Madigan and York (1992).

In the remaining sections of this paper, we develop and assess the approach outlinedabove. Section 2 discuss the priors for the population size and for the probabilities ofindividuals being captured the different lists; 3 describes calculations necessaryto compute posterior distributions of the quantities of interest. Several examples, ranging incomplexity from the spina bifida data with three lists to a data set with lists to a data setwith four lists and a covariate with levels, are considered section The analysisfor the spina bifida data is . CoveragePfC::>p(:;rtles of various a e>llJ.IUIuvlVll

III sec:tlc,nIS to

2 Prior Distributions

assumptions about the administrative lists. pnor P( !v!) should be based onexpert opinion; however, in the applications here, we assume 1\;! is uniformlydistributed over J\/1. The cell probabilities for each model m E ./\/1 are parameterized bysome vector B(m); priors for these parameters are described below.

\Ve write N for the total population size; we will assume a prio7>i that N is independentof the model }.rf and B(A1). A typical choice for the prior of N when no information aboutit is available is the Jeffreys prior,

(. 1

PN)ex:N

. (1)

\Vhen used in the Bayesian estimation of the Binomial N parameter, this prior for N, incombination with certain priors for the success probability, can yield an improper posterior;see Kahn (1987). This is not an issue in the current context even with Jeffreys priors onthe probabilities, the posterior and its expectation will be well defined. See appendix B formore details.

An alternative is a non-informative prior for the integers proposed by Rissanen (1983),

P(N) ex: 2- log*(N), (2)

where 10g*(N) is a sum of the positive terms in the sequence {logzUV),logz(logz(N)), ... }.This prior is proper and will yield well-behaved posteriors.

If information about N is available, one alternative is is take N to be Poisson with somemean A; if knowledge of the mean is imprecise, /\ can be given an gamma distribution withmean determined from prior information and some prespecified variance; this is suggestedby Smith (1991) in the capture-recapture context, and is similar to an approach taken byRaftery (1988a). If A is Gamma(a, b), it is straightforward to derive that N has a negativebinomial distribution,

(3)

number of successes andthen being the number

Figure 1: A simple graphical model

2.1 Hyper-Dirichlet priors

Before describing these priors, we must first introduce some notation. If there are 3 differentvariables (representing presence or absence in three different lists), we will denote particularcell probabilities and data counts by ()ijk and D ijk , where i indexes the first variable, etc.The indices can take on values of 0, indicating absence from a list; 1, indicating presence ina list; or +, indicating that the variable has been marginalized out. vVhen discussing generalproperties of the cell probabilities and their parameters, we will often drop the reference toa specific model m from the notation.

We consider models for the variables for which the dependence relationships can berepresented by an undirected, chordal graph2 (see Lauritzen et ai, 1984); graphical modelingtechniques are nicely summarized in vVhittaker (1990). Each variable is represented as a nodein the graph, and links between variables represent explicit dependencies between them. IfS, C1 , and C2 represent sets of variables, and if S separates C\ from C2 in the graph, then C1

is independent of C2 given S. Thus, for example, Figure 1 gives a graphical representationof the model where X and Z are conditionally independent given Y.

A log-linear model is hierarchical if the presence of an interaction between a set of variables implies that all interactions between subsets of those variables are present as well.This suggests that a simple way to specify the interactions in a loglinear model (Bishop etai, 197.5) is to write the variables involved in each the highest order interactions insidebrackets. Thus the model [XZ] [YZ] contains of interactions betweenX, Y and Z, as well as a main effect for each. Graphs can also be to represent~ ~ Asuch

Figure 2: The maximal model for 3 variables

z.A hierarchical log-linear model is decomposable iff it can be be represented by a chordal

graph; see Darroch et al (1980) and Kiiveri et al (1984). In order to form a prior for the cellprobabilities of such a model, we would like to write a distribution for () such that realizationsof () exhibit the independence relationships of the model almost surely. As part of a verygeneral discussion of this problem, Dawid and Lauritzen (1989) introduce such a class ofpriors, which they call "hyper-Dirichlet" distributions.

The standard Dirichlet distribution has been widely used as a prior for multinomial sampling problems (see, for example, Good 1967, 1975, 1976, 1983). It is conjugate with themultinomial distribution, and its parameters are easily interpretable for each possible outcome, there is a parameter that indicates that our prior knowledge is equivalent to havingobserved some number of hypothetical instances of that outcome; see Lindley (1964). Ahyper-Dirichlet distribution is formed for a particular chordal graph by placing Dirichletdistributions on the probabilities for each clique of the graph, such that for any intersection between cliques, the parameters and actual realizations of Dirichlets are consistentbetween the cliques.

For example, if the variableswe a on

and

from either Dirichlet, the parameters of the two distributions must be constrained such that0+1+ has the same distribution under condition is satisfied if

a+j. = a.H, j = 0,1

Additionally, the actual realizations of the two Dirichlets must be constrained so that

with probability 1. Joint probabilities for X, Y, and Z conditional on 0 are given by

( . . I ) 0 (O+jk)PI' X = z, Y =], Z = k 0 = Oijk = ij+ -0.. .+J+

In general, one can construct a hyper- Dirichlet (see appendix A for details) by movingthrough the graph according to a "perfect ordering" of the cliques and placing a Dirichleton the Oc for each clique C in turn, subject to the constraint that realizations and priorparameters must agree with what has been specified for previous cliques, as above. Oneof the important results of Dawid and Lauritzen (1989) is that a joint distribution for theelements of 0 constructed in this way exists, is unique, and is conjugate with multinomialsampling.

One issue that must be addressed is how to choose the parameters for the Dirichletmarginals that determine the prior distribution of O. In the absence of prior knowledge, wewill place a symmetric distribution on Oc for each clique C of the graph. The parametersof the largest clique in each connected subgraph3 are set to be some constant fJ; all ofthe other parameters can then be found via (15) in the Appendix. Alternately, one couldspecify the parameters of any other given clique in the graph (perhaps the smallest clique,or the one judged a priori to be the most influential). Informal investigations have shownthat unexpected and counter-intuitive posteriors can result when some cliques have smallparameter values; thus, by specifying the prior via parameters for the largest clique, wecan control this sort of behavior (the parameters for the largest clique will be the smallest ofany of the parameters of the marginal Dirichlets). For conventional noninformative Dirichletpriors, fJ = ! (the Jeffreys prior for multinomial sampling) or fJ = 1 uniform distribution)are often used. vVe will use values of fJ non-informative distributions

applications. It should be ~

of hypothetical data. For example, if we have two independent variables, and theprobability corresponding to each is given a beta(2,2) distribution, correspondto a 2 x 2 table with a total of 4 data points in it. A model dependence thetwo variables would place a Dirichlet over the 2 x 2 table; choosing that Dirichlet to haveparameter 1 in each cell would yield a prior with the same amount of hypothetical data (4points). This approach will also ensure that any marginal of the table of joint probabilities(such as the probability that a particular variable will take a particular value) will have thesame distribution under all models considered. In general, one can look at the most generaldependence model as giving prior counts for the full table; the parameters for any moreconstrained model can be found by marginalizing this table of parameters in the appropriateway.

To illustrate how one would choose the parameters for multiple models, consider threebinary variables. Table 2 describes all of the possible hyper-Dirichlet models for three binaryvariables B, D and R (these variable names come from the spina bifida data consideredbelow), with parameters chosen so that each model has the same amount of hypotheticaldata, and such that margins of () have the same distribution under all models.

2.2 Other models

The models described above are restricted to the class of decornposable models. For threevariables, the only hierarchical log-linear model that is not decomposable is that of no thirdorder interaction; for larger problems, there will be numerous hierarchical log-linear modelsthat are not decomposable.

To investigate whether this is a serious drawback, we have considered Bayesian versionsof conventional hierarchical log-linear models as in vVest (198.5) and Raftery (1988b). Here,the maximal model for a 3 list scheme is parameterized

Pr(X = i, Y = j, Z = k) = exp(uiJ) +Ui .. +u.j. + U ..k + + Ui·k + U·.ik + ),

where Ui.. is a main effect for X being in state i, etc. Smaller models can be formed bysetting some of the U terms to zero, subject to the hierarchical constraint mentioned earlier;see Bishop et al (197.5).

For a Bayesian VPlr<.:l{Wl

placing Normal priors onplace a GaJm:na

Graph Assumptions Parameters

® @ ® full independence 0H+, 0+1+, 0++1 rv iid Beta(48, 48)[B][D][R]

®-@ ® B 1. R,D 1. R (O l1+,OlO+,OOH) rv Dirichlet(28, 28, 28, 28)[BD][R] I 0++1 rv Beta(48, 48)

®-® @ B 1. D,D 1. R (01+1,01+0,00+1) rv Dirichlet(2D, 215, 215, 215)[BR][D] O+H rv Beta(415,415)

®-® ® B 1. D,B 1. R (°+ 11 ,°+10 ,°+01 ) rv Dirichlet(28, 215, 28, 215)[DR][B] °H+ rv Beta(415,48)

@-(fj)-(jf) B1.RID ° hyper-Dirichlet with:(O l1+,O lO+,OOH) rv Dirichlet(2D, 28, 215, 215)

[BD][DR] (0+11, 0+10, 0+01) rv Dirichlet(2D, 28, 215, 215)

@-®-(if) D1.RIB ° hyper-Dirichlet with:

I(O l1 +,O lO+,OOH) rv Dirichlet(2D, 215, 215, 215)

[BD][BR] (01+1,01+0,00+1) rv Dirichlet(28, 215, 215, 215)I

@--(jj)-@!

B1.DIR I ° hyper-Dirichlet with:(01+1,01+0,00+1) rv Dirichlet(2D, 215, 215, 28)

[BR][DR] (°+11, 0+10, 0+01) rv Dirichlet(2D, 28, 28, 28)

V ~-none~-- rv " -' ,~, 1 ,~ 15,8,5,, , 'lllCllleL

[BDR]

problems. \Ve do not pursue these models further here, but we do note that their abilityto reflect causal and sequential relationships make them ideal for extensions of the Bayesianapproaches to capture-recapture data taken in Smith (1988,1991) Rodrigues et al (

3 Posterior

3.1 Calculation and estimation

Conditional upon N, M, and O(AI), D has a multinomial distribution with N trials andsuccess probabilities given by O(kl). Recalling that we assume that N is independent of 0and AI a priori, we use Bayes rule to combine the priors and the likelihood to arrive at theposterior distribution of (N, O(1Vl)) conditional on the model AI, as follows:

P(NT O( )' 1 D \/{, =' )' = P(D 1 N, O(m), kl = m)P(O(m) I iV! m)P(N),m ,1m P(D I AI = m) (4)

However, making inference about O(m) is not the primary objective, and so it will be moreconvenient to marginalize by integrating out O(m). This yields

P (N ID, 1\1 = Tn) = ---"-'----,------"---"-'- (5)

where

P(D IN,1\1 m) f P(D IN,O(m),iVI = m)P(O(m) 11v[ m)dO(m)lBm

(N ) \]imm +D*D \]im(o{m))'

(6)

(7)

\]im(') is the normalizing constant for modelrn, as given in appendix A, and is composed ofr -functions; D* is the "complete" data including count unobserved cell. and canbe written since we are conditioning upon N.

Expression (5) gives the posterior for conditional on a particular model AI m: wemay now combine models for an unconditional posterior as follows:

rnD,= LP(

Depending upon the application, summaries of posterior may be desirable. If apoint estimate is we follow several in Bayesian estimation capture-recapture(Freeman, 1973, Smith, 1988, 1991) and binomial N estimation 1988a) who havesuggested that loss function to use is the relative squared error loss function,

L(N,N)

It is relatively easy to work out that the Bayes estimate for this loss function is

A E(N- 1 ID)lv = -E-'-(l-iV--2-"':I-D":"'r (9)

If interval estimates are desired, posterior percentiles or highest posterior density intervalscan be used. This situation is somewhat similar to the estimation of the binomial N parameter, where point estimates of N can be very unstable (see Olkin et al, 1981). In sucha context, we feel that interval estimates are preferable to point estimates because theyexplicitly reflect the uncertainty and instability in the estimate.

Finally, the efficiency of a list (the probability that it will capture an individual) can bejudged by looking at the posterior of marginals of B; dependence relationships between listscan be judged by looking at the posterior of an odds ratio. Any such marginal 6. can beexpressed as fm(B(m)) for each model. The posterior distribution of 6. is then given by

P(6.1 D) = L L P(Jm(B(m)) ID, N, kI = m)P(N ID, Nt = rn)P(Af = m ID). (10)m N

Conditional upon N, the posterior of the efficiency of a given list is just a beta distributionmarginalized from the hyper-Dirichlet posterior. Because of the way that the priors arechosen to be consistent across models, these posteriors will be identical across models, andare given by adding the margins of the full table of parameters to the number found and notfound by a particular list. If we have k lists, the maximal model is given a Dirichlet priorwith parameter 8 in each cell, and if x and N - x are the numbers found and not foundby a particular list, then the posterior of the efficiency of that list conditional upon N is abeta(2k

- 18 + x, 2k-

1{j + N . The unconditional posterior can be found as in (10).

3.2 Handling covariates

For illustration~ we will consider a model with three lists and one covariate; the first slotin the subscript will refer to the covariate, the rest to the lists. the covariate takes onvalues of 0 or 1~ then both Doooo and DlOOO are unknown. Let D be the observed data, letD* = (Doooo~ D1000~ D), and write n for total number of individuals observed.

The posterior of N is as given by (5) and (8), except that

N-n

P(D IN,lV! = i) = L P(D* I N,lv! = i)DlOOO=O

N-n

L JP(D* I N,O(i),1'v! = i)P(O(i) I lV! = i)dO(i)DlOOO=O

N-n (N )Wi +D*DIE=o D* Wi(a(i))

(11 )

If some covariate values are missing for observed individuals, then the sum in (11) shouldalso be over all possible values for the cases with missing data. If the population size ispotentially large, and/or if there is a great deal of missing data, this sum will be arduous tocompute, and Monte Carlo approximations may be necessary.

3.3 Other models

For the Bayesian version of conventional log-linear models described in section 2.2, the posterior of N is still calculated as in (5). However, the integral in (6) can not be evaluatedeasily because this prior is not conjugate. It can still be evaluated using the Laplace approximation, one version of which has been described by Tierney and Kadane (1986)4. Thismethod approximates the log-integrand with a quadratic centered at the mode, making theproblem equivalent to integrating a multivariate normal distribution. \Ve integrate out boththe u-terms and </>, which is their variance; in order to do this~ we first find the mode of theposterior, (iL, J). Since the distribution is concave, this can be done by gradient ascent or theslower but simpler to implement rCM method of Besag (1983). vVe then evaluate 2:;, whichis minus the inverse Hessian of the log integrand evaluated at (tt, ; if are A~ u- terms,2:; is a + 1) x (k + 1) matrix. The likelihood is lS

P(D i N, At = P(D. I m

by noting that (u, will be close to the maximum likelihood for these parameters,and so E can be approximated by using asymptotic results for inverse of the Fisher infor-mation matrix. Also, Markov Chain Monte Carlo approaches would be able to circumventthese difficulties, perhaps using adaptive sampling proposed by Gilks and 'Wild(1992). We have not pursued these approaches further. This is partially because the priorsfor these models are much more difficult to choose and interpret - how should one specifythe distribution of ¢? Should higher order interaction terms have a lower variance? Does itmake sense to assume independence between the u terms?

4 Examples

4.1 Prevalence of spina bifida in upstate New York

Hook et al (1980) have made use of an MRS to estimate the number of children born withspina bifida in New York state outside of New York city during the period from 1969 to1974. There are three record systems involved birth certificates (B), death certificates(D), and records on individuals who were served by a medical rehabilitation program (R).The data represent the number of non-anencephalic cases of spina bifida, together with therelated birth defects meningomyelocele, cephalocele, and encephalocele, and are presented intable 3. A total of 626 individuals were identified in all of the lists, out of a total of 863, 143live births.

R=O R=1B = O,D = 0B = 0, D = 1 I---:-:::-!I-----;-j

B=I,D=O1-----::--:-:::--1---.-:::-1

B = I,D = 1

Table 3: Spina Bifida data

\Ve propose an informativeCenter for Controlnon-anencephalic Spina Bifidaadditionally,

Posterior I IModel Prob. Mode ..5%, 97..5%

@J-(ff) ® 0.373 731 729 (701, 767)

@)-(fj)-(JJ) 0.301 756 752 (714,811)

®-®-® 0.281 712 709 (681,7.51 )

1/ 0.036 697 626

I(628,9:34)

Model Averaging ~ 731 728 (682,797)

Table 4: Summaries of the posteriors for the spina bifida data for all models wit.h post.erior probabilitygreater t.han 0.01

However, these uncertainties are part of the reason for choosing such a large prior variancefor N; a sensitivity ana1ysis (see below) demonstrates that moderate shifts in the location ofthe prior do not have much impact upon the posterior. If we had greater confidence in ourpreliminary estimate of the expected number of cases, we would choose a smaller variancefor this prior.

Using this informative prior for N, and using equation (8) to average across the 8 sym-metric hyper-Dirichlet priors given in table 2 with the parameter {; L we obtain theposterior for N shown in figure (3). In the figure, the value of

P(N I D, }I! m)P(iV! m I D)

is plotted for any model rn with non-negligible posterior probability; curves showboth the shape of the posterior for particular models, and, area beneath them, theirrelative contribution to posterior. of are gIVen mtable 4. over {; ~ because

more complex modelsIS

[OR)[BO][OR)[BR][OR)[BOR]

ooo

L()ooo

oo

650 700 750

N

800 850 900

P(N) li N Mode 2.5%,97.5% [DR] [BD][DR] [BR][DR] [BDR]1/2 732 728 687, 795 0.48 0.28 0.23 0.02

Historical 1 731 728 682, 797 0.37 0.31 0.28 0.042 730 726 678, 797 U.";;;I 0.30 0.34 0.061/2 730 727 68.5, 792 0.48 0.26 0.24 0.02

< Historical 1 729 727 679, 794 0.37 0.29 0.29 0.042 728 725 675, 794 0.29 0.29 0.35 0.07

1/2 734 730 689, 799 0.47 0.30 0.21 0.02> Historical 1 7:34 729 685, 802 0.37 0.3:3 0.26 0.03

2 732 728 682, 802 0.29 0.32 0.32 0.06

1/2 732 728 686, 798 0.47 0.28 0.23 0.02Jeffreys 1 732 728 681, 800 0.37 0.31 0.28 0.04

2 730 726 678, 799 0.29 0.31 0.34 0.061/2 732 728 686, 797 0.47 0.28 0.23 0.02

Rissanen 1 731 727 681, 800 0.37 0.31 I 0.28 0.042 730 726 I 677, 799 0.29 0.:30 I O. 0.07

Table 5: Sensitivity for spina bifida data. "Historical" refers to the informative prior for Ndescribed above; "< Historical" and "> Historical" refer to similar priors, but based on prevalence ratesof 7.2 and 9.6 per 10,000, respectively. The Jeffreys prior is in (1), and the Rissanen prior is givenin (2). The features of the posterior given are: the Bayes estimate for the relative loss function

in (9); the mode; the 2.5th and 97.5th and all model probatHllt,les.Those models not listed above had model . The that we chosea are indicated in bolc:ltac:e.

holds may be found by summing the posterior probabilities for each model uses thatassumption, we have that the posterior probability of the [DR] interaction is very nearly l.There is also some evidence in favor of the [BD] interaction (probability 0.31) and [BR]interaction (probability 0.25). Hook et al(1980) find [BD][DR] to be a close competitor tothe simpler [DR] model; we find this model somewhat less likely. Another candidate modelnot considered in previously published analyses (which have used backwards deletion modelselection techniques) of the data is [BR][DR], which is nearly as likely as [BD][DR]. \Ve notethat the posteriors for these three models are centered at different locations, and estimationof N conditional upon one model would depend a great deal upon the particular modelchosen. Averaging over models, on the other hand, gives us a one posterior that accuratelyreflects our uncertainty about the correct model.

The posterior means and standard deviations for the probability that an individual willbe found via any particular list (efficiency) are given in table 6. It is awkward to come upwith efficiency estimates in the conventional log-linear modeling framework, and Hook et al(1980) avoid doing so. For reporting such efficiencies, they restrict themselves to two waycomparisons, and capture-recapture models, so that all efficiences are with respect to someother list or combination of lists. This makes interpretation somewhat confusing, becausethey are not able to directly estimate the true quantity of interest. In contrast, the methodsdescribed here directly and easily produce efficiency estimates for each list.

Posterior PosteriorList Mean Std. Dev

Birth Certificate 0.699 0.032Death Certificate 0.284 0.020Medical Records 0.258 0.019 I

Table 6: Posterior mean and standard deviation for the probability that aan individual with bificla

list will CVITec lly identify

If we use our estimate of N to compute a prevalence rate for spina bifida for the populationof all live births, we arrive at an of 0.847 per 1000 with 2..5th and 97..5thposterior percentiles being 0.790, 0.923. In to estimate of 0.72.5 thousandif we assume that no cases were more than one caseper ten thousand is mlss<~d;

estImate of

It)-oo

oqo

It)oqo

oo

{BR][DR]no 3rd-order(BDR]

650 700 750 800 850 900

N

Figure 4: Posterior for the number of cases of spina bifida : the lOl!-llnear approach

,r.N'_'"""',,,, and hyper-Dirichlet approaches for models can,"",,"'UllU',,""O for most are slightly higher for

hyper-Dirichlet a value ofestImate for

Comparison betweenbe dealt with inthe log-linear parameterization.

= 697 for

4.2 Spina bifida analysis independence versus nl0deI averaging

Smith reduces the dimensionality of the data since the exact capture histories are irrelevant under this model. Instead, he places an arbitrary ordering on the lists, and considersthe data to be the number caught in first list, the number caught in the second listthat had and had not captured etc. We write D' for this reduced data set.One can construct the likelihood by noting that the number of individuals captured in eachlist is binomial with N and success probabilities 01++,0+1+, and 0++1 respectively. He thenmultiplies by a hypergeometric term, to introduce the probability of those who were caughtin the first list being caught in the second, and the probability of those caught in the first orsecond list being caught in the third list. \Vith a little bit of simplification, this likelihoodcan be written as

P(D' IN, 0) = ( N ) ( D1++ ) ( Do++ ) ( D1++ + DOH) ( Doo+ )DH+ Dl1+ DOH D1+1 + DOll DOOl

XODH+(1-0 )N-Dl++0D+H(1_0 )N-D+1+0D++ 1 (·.1 0 )·N-D++11++ 1++ +H +H ++1 ++1 .

The likelihood for the independence model that we consider is simply multinomial:

(12)

P(D IO,N) =

(N ) OD1++ (1 _ 0 )N-D1++ OD+1+ (1D H+, 1++ +H '

Further simplification of the product of combinatorial terms in (12) reveals that, as a flmction of N, it is proportional to the combinatorial term in (1:3), so the two likelihoods areproportional5

. With the same priors as Smith (1990), (the Jeffreys prior given in (1) for N,and a Beta(1,1) for the probability of capture by each list), our analysis for this one modelshould be identical to his. Smith argues that his likelihood is appropriate because it doesnot depend upon the arbitrary ordering of the lists that he uses; this is indeed the case, butwe feel that this is good reason not to go through the trouble of using the convoluted andconfusing expression (12), when the likelihood in ( produces identical results and is verytransparent.

There is an error in the expression of the posterior of in Smith (1990, 1991); he writes

---'-------'----"- x ---'-------'----"-

Mode95th percentiles

1\~T

1V

Independence model All models>165 158

(153, 188) (146, 188)

168 161

Table 7: Summaries for independence model versus the aggregated posterior, for the 1969 Spina Bifidadata. I'll is the Bayes estimate for the relative squared error loss function.

given above). Instead, we compare the results for the independence model that we havecalculated with the results of averaging across models in table 7.

'vVe compare the performance of our models to Smith's by using Bayes factors, which areoften used to select models and to compare their predictive ability (see, for example, Good,1975). The Bayes factor between model mo and model ml is defined as

B _ P(DIN/ = md.10 - P(D IN/ = mo)'

a large value of BlO indicates evidence against mo, in favor of mI. Logarithms of the Bayesfactors against mo, the independence model of Smith, in favor of the other models listedin table 2 for both the 1969 and the aggregated data, are given in table 8. The finalentry in the table is the Bayes factor for the average over models versus the choice of theindependence model. This log Bayes factor is necessarily positive, because the independencemodel is a component of the average across models, and appears in both the numerator anddenominator. The priors for and ewere chosen to be consistent with the analysis in Smith(1990). One can see that with the 1969 data, some of the dependence assumptions are favoredover that of independence; the evidence against the independence model is overwhelmingwhen one considers the full data set.

Thus, we conclude that the assumption of independence made by Smith (1990) is notsupported by the data, whether we are looking only at 1969 or at the full data set. Moreover,his method of likelihood only obfuscates the posterior asSmith (1990, 1991) is written incorrectly, 1 data given inSmith (1990) are not reproducible.

4.3 Spina bifida with covariate

log(BlO)

Alternate Model ml 1969 Data 1969~1973 Data[BD] 0.03 3.62[DR] 3.21 24.64[BD] -1.06 -1.83

[BD][DR] 2.21 24.35[BD][BR] -1.59 5.29[BR][DR] 2.81 24.13

[BDR] 1.08 22.08All Models 4.03 25.53

Table 8: Bayes factors against rno, the independence model [B][D][R]

added those cases to the data for blacks. In addition to the data in the table, there are 5individuals for whom we have no information on race (Hook, 1992). These five individualshad the following values for (B, D, R) : (1,0,1), 3 x (1,0,0), and (0,1,0). \Ve compute theposterior for N as in section 3.2, summing over all 2.5 possible values of race for these 5incomplete cases, as well as summing over the possible races of the unobserved individuals.

\Vhites Blacks & Others

B = O,D = 0B = O,D = 1B = I,D = 0B = I,D = 1

R=O R= 1 R= 0 R=1? ~') ? 8o~

45 3 3 1230 107 14 4134 12 8 0

Table 9: Spina Bifida

The posterior for is displayed in figure 5; "'A.•.• UU.• '"u

make the greatest contribution are table 1posterior probability

by Race

models whichhad high

notable

It)

(;ci Full Posterior

IIIIIIIV. V. and VII comb.VIVIII

o(;ci

It)

8ci

oci

650 700 750

N

800 850 900

Efficiencies of the various lists, broken down by age, are table 10. As withother features of posterior, including the ethnicity covariate no impact upon theestimates for the overall population. The results do, however, indicate that birth certificatesare considerably spina bifida cases

\Vhites Non-whitesList I Mean Std. Dev Mean Std. Dev

Birth Certifica~ 0.710 0.033 0.565 0.107Death Certificate 0.285 0.020 0.282 0.031Medical Records 0.259 0.019 0.263 0.031

Table 10: Posterior mean and standard deviation for the probability that a given list will correctly identifyan individual with spina bifida

Posterior \Vhites Non-whites

Label Model Prob N (2.5, 97.5) N N

I ®-® ®-® 0.223 7:31 (701,767) 68:3 48

II ®-®-®-® 0.185 7.56 (714,811 ) 699 .56

@-(fj)-(jj)-(jj , I

III 0.168 I 712 I (681,7.51) 660 50

IV ® ® ®-® 0.062 7:31 (701.767) 683 48\ , !

V @-®-(fj)-(fj) 0.061 732 (702,769) 54

® @-(ltr@I

VI 0.052 I 756 706 49

VII 0.050

VIII

o;;o

LO

8o

oo

Hyper-DirichletLog-LinearH-D with covariate

650 700 750

N

800 850

of

Table 12: Patients receiving methicillin in Massachusetts General Hospital, from Wittes (1974). Iintravenous nurses; F - floor nurses; D pharmacy; R medication records.

4.4 Patients receiving methicillin

\Vittes (1974) presents a data set, in which the population of interest is the number ofpatients in a hospital receiving drug methicillin, a synthetic penicillin. Four methodswere used to identify the patients: the intravenous nurses were to report each administration(1); all floor nurses were to report use of the drug (F); the hospi tal pharmacy staff wereto report each case where the drug was ordered for a patient (D); and, medication recordsof the patients were at pharmacy are duplicates of themedication records scanned thesetwo lists; could COlICE~lV('tI)ly

patient was on

It)ooo

It)

;;0

\0 \;;0

>za:-

oo

450 500 550 600 650

N

Figure 7: Posterior for the number of patients receiving Methicillin, from Wittes (1974)

at as many values of as possible, fit a polynomial tofitted polynomial to interpolate value ofvalues that were calculated vA'CLvC, Y

would lead to a cluttered picture, because even the best posteriors are very short scaledby their posterior model probabilities. Four of the models have posterior probability higherthan 0.05; one of them is the model of full independence, none of four involve thecovariate. The relative risk estimate for the full posterior is 2..5th and 97..5thpercentiles are 480 and .568.

The high dispersion of the posterior model probabilities indicates that there is littleinformation in the data about the correct model, which would result in severe instabilityof any approach that relied upon choosing one particular model. In contrast, the modeluncertainty approach still gives us a posterior that can be used for decision making andinference; although it does not give us any real insight as to the mechanisms that producedthe data, we would argue that this is because there are no such insights to be gleaned fromthe data.

The covariate here is being treated as having four completely different levels, when infact the values are ordinal in nature. One would expect that the probability of identifying amember of the population would increase with the number of days on the drug. However, thefunctional form of such an increase is not at all apparent. Convenience, and the difficultyof parameterizing the problem in such a way as to capture this ordinal nature, are ourjustifications for treating this variable as categorical. One possible alternative would be touse the ordered Beta priors presented in York (1992); this would not be compatible withthe hyper-Dirichlet approach taken here, however, and evaluation of features of the posteriorwould require a Monte Carlo approach to calculation, even if the combinatorial explosionmentioned above were not a problem.

4.5 Children in Massachusetts with a birth defect

Bishop et al (197.5) and \Vittes et al (1974) present and analyze a data set extracted fromWittes (1970). The data are counts of children in Massachusetts possessing a particular (unnamed) congenital birth defect, born between 1945 and 1949 inclusive, and still alive as ofthe end of 19.56. There are a total of 5 record systems: obstetric (0); other hospitalrecords (R); Massachusetts Department of Health records (D); Massachusetts Departmentof Mental Health records (II); and school records The data are presented in table 1:3;a total of 537 individuals were identified.

\Vittes et al are unable to draw anyto assume H'l(l(~p(m(lertCe

Bishop et

0=0 0=0 I(D,H,S) R= 0 R = 1 R=O R= 1 I

(0,0,0) ? 27 37 19 I(0,0,1) 4 4 1 1(0,1,0) 97 22 37 25(0,1,1) 2 1 3 5(1,0,0) 83 36 34 18(1,0,1) 3 ,5 0 2(1,1,0) 30 5 2~) 8(1,1,1) 0 3 0 2

Table 13: Number of children identified as having a particular birth defect. 0 ~ obstetric records; R ~

other hospital records; D - Massachusetts Department of Health records; H Massachusetts Department ofMental Health records; S school records.

Posterior

Label Model Probability j\l Mode 2..5%,97 )%

I (jj)-(fj)--(!j) 0.217 627 624 (598, 663)

II (jj)-(jf) @-(jf)-(f) 0.160 91, 645)

III ~ 0.082 613 610 (.586, 648)

~I

IV 0.065 610 608 (.585, 640)

V ®-®-®-®-@ 0.0.52 616 614 (591, 647)

VI ®-®-®-®-@ 0.0.51 616 614 (591, 646)- Model Averaging 617 614 (.588, 6.5.5)

Table 14: Models with largest posterior probability in Bayesian hyper-Dirichlet approach. Labels correspond to those in figure 9

hyper-Dirichlet approach with6 P(N) ex liN and Ii 1/2 are shown in table 14. Eleven othermodels have posterior model probabilities between 0.01 and 0.05, and 98% of the posteriorprobability is concentrated on fewer than 50 of the 822 possible five variable models. Someof the relationships between variables are very for example, the posterior probabilityof a link between obstetric records 0 and hospital records R is 0.87; between Randschool records S is 0.99; and between Department of Health D and Department of MentalHealth H is 0.99. \\Te note that these relationships are rather more than modelgiven in Bishop et al (1975), with its between and the Departrnent ofHealth and between school records and of Health.

The 9.

LOC\l0ci

Full Posterior

@ I· [DH][HO)[ORllRS]0 II· [DH][ORllRS]ci III· [DH)[HORllRS]

IV • [DH][HR][OR)[RS]V and VI combined

LO;;ci

6'Zi:i::'

0q0

8ci

oci

550 600 650

N

700 750

number

5 Assessing Coverage

Regal and Hook (1991) criticized the model approach, showing that theprocess of model selection can destroy coverage probability. their argument toits essentials, they choose a "true" value for N that is not inconsistent with the Spina Bifidadata for at least one possible log-linear model. This value is, however, fairly extreme incomparison with the fitted value. They then fix a simulation routine so that it will generatedata for this value of the parameter and so that the simulated data will look much likethe observed data. Since the simulated data is constructed to look much like the observeddata, it would not be surprising if the value for N fitted to the artificial data resembled thevalue fitted to the observed data. Thus, unless the method of producing interval estimatesis sensitive to very subtle features of the data, it should be no wonder that the resultingcoverage probabilities are so poor. Carried to extreme, one could choose a huge N and fita no-third-order interaction model that could produce data similar to that observed; thecalculated confidence intervals will be similar to those calculated for the real data, and sothe coverage probability will by abysmal.

A better approach for assessing coverage is to consider it for a range of values of N. Weconducted several simulation experiments based on the spina bifida data in table 3, followinga procedure similar to that in Regal and Hook (1991). First, we hypothesize a particularlog-linear model for the "truth". For a fixed value of N, we fit the model to the full tablecomposed of the observed data and the count, for the unobserved cell corresponding to thehypothesized N. This yields a fitted value of 0; we then simulate data from the model withNand {; as parameters. For each simulated data set, we apply several methods for findinginterval estimates for N, and note whether or not each interval estimate contains the truevalue of N. \Ve estimate the coverage for an interval estimate the proportion of timesthat it contained the truth, for fixed values of Nand O.

The coverage probability has a different interpretation for the Bayesian models than forthe frequentist models. In the Bayesian framework, it can be seen as a crude measure ofhow far wrong our posterior might be if the mechanism producing the data is considerablydifferent from our prior. For the frequentist approaches, it shows what the true confidencelevel is for selecting a model and then calculating an interval estimate. Ideally, a frequentistroutine would be calibrated to compensate for the model selection process, the confidencelevel would be It is not what the approach do ideallyin this cOlltext;mechanism, we

and Jeffreys prior for N. The frequentist approaches use a backwards deletion approachto model selection, which begins with a no-third-order interaction and deletes theleast significant link one at a time until the likelihood ratio current model andthe best smaller model is not significant at the level. A which compares allmodels does not provide a noticeable improvement in coverage.

The interval estimates were taken to be the 2.5th and the 97..5th percentiles of theposterior in the Bayesian approach, and 95% confidence intervals based upon asymptotictheory (as in Bishop et aI, 1975) and upon a goodness-of-fit statistic (proposed in Regaland Hook, 1984). Figures 10 through 15 give a comparison of the coverage achieved by theBayesian and frequentist approaches for various true models, for a range of possible valuesof N. 600 simulated data sets were generated for each combination of N and true model;each algorithm was given the same simulated data sets.

\Ve note that the Bayesian model averaging approach consistently outperforms both ofthe frequentist approaches to constructing interval estimates. Neither approach performsadequately when the value of N is fairly extreme and the model is very flexible, but it wouldbe surprising if any method could perform well under such circumstances.

In the frequentist context, abandoning a model selection procedure and going with asaturated model (here, the no-third-order interaction model) is not a useful procedure; theconfidence intervals generated for this model are incredibly large, and grossly overshoot thedesired coverage probability for reasonable values of the parameter.

A summary of the information in the figures 10 through 1.5 has been put forward byRubin and Schenker (1986). They suggest selecting some distribution g(N) of "typical"values for the unknown parameter, and looking at the average coverage with respect to g :

= L C(N)g(N)N

(14)

derived from historicalestimation of the

coveragen3NeS13,n approach.

where C(N) is the coverage probability at parameter value N. This average coverage "fg

would be of interest in both the Bayesian and frequentist !r3Xl1(c;w'Drl,:s and would provide aunified way of comparing the performance of the two.

A reasonable choice of g(.) would be the prior for that wasinformation. \Ve purposely did not use our historical priorcoverage probabilities above, so that we use it to summarizefor the different techniques without things inValues of from this eXpeJrll11erlt

<0o

C\Io

oo

600 800 1000

N

Bayesian model avgfrequentist asymptoticfrequentist GA2Desired 95% level

1200 1400

'"ci

oci

IIIII

\~ l~.h /1 ,

\; 1/ \:1 !, \

~~ £' \11 i, \\~ ; \h:1 ,

I,', C \h \it ~ \\~ ! \V \

\\\ •.../'.it //'

\~\.\ ...../\... , ,,'

..............

Bayesian model avgfrequenlist asymptoticfrequentist GA 2Desired 95% level

............."

..................

, ...•....••..,'

",

......•......

---~_ .. ----

600 800 1000

N

1200 1400

1

too

Bayesian model avgfrequentlsl asymptoticfrequentlst GA 2Desired 95% level

oo

600 800 1000 1200 1400

IS

Ql

f(.)

<0o

C\l0 ~ Bayesian model avg

frequenlisl asymptoticfrequentisl GA2Desired 95% level

00

600 600 1000 1200 1400

N

~ -

<0.:;

<l>CD

~>0()

"<t -.:;

Bayesian model avgC\I - lrequentist asymptotic.:;

lrequentist G'2vc~"cv vV

0 -.:;

I I I I

600 800 1000 1200 1400

N

Q,)OJ

i()

<Xlo

<0o

(\j

o

oo

::;<=:::.." =~=:::::<;:::':'.:':~':::::'-':~'::':::.'::::'-";.':::~:::~':::::' .::':'.::•.:':::~.:: ",:"::*.:::.;:::'" "._-_ ~ ~ _.*.......•........•..__ .---------------

Bayesian model avgfrequentist asymptoticfrequentist G'2

600 800 1000

N

1200 1400

IS

Bayesian i Frequentist iTrue model model avg Asym. G"' I

[BDR] 0.380 0.254 0.2:30[BD][BR][DR] 0.524 0.412 I 0.272

[BR][DR] 0.945 0.765 0.750[BD][BR] 0.805 0.638 0.627[BD][DR] 0.938 0.780 0.7.56[B][DR] 0.987 0.922 0.901[B][D][R] 0.982 0.920 0.91:3

Table 15: Coverage probabilities for different methods of creating interval estImates, averaged with respectto the informative prior for N

6 Discussion and Extensions

We have presented a framework for estimating the size of a population by comparing multipleadministrative lists or captures; we allow the lists to have a wide variety of dependencerelationships, but avoid selecting a particular model by considering a prior which is a mixtureover all models within a certain class.

This method is shown to be tractable for a reasonable range of problems (although acombination of missing data, many lists, and a covariate with several levels will render itinfeasible). Computation times range from less than 30 seconds for the spina bifida example,to around an hour for the five-dimensional Massachusetts birth defect example on a SunSparc2. Performance was measured by coverage properties, and the method presented herewas shown to have better coverage properties than any other published method.

The hyper-Dirichlet priors of Dawid and Lauritzen (1989) prove very useful in this contextthey are allow the same range of independence assumptions available in conventional

decomposable log-linear models, and calculations of posteriors are straightforward becauseof the conjugacy of these priors with complete multinomial data. This ease of computingmakes it feasible to perform the calculations necessary to across models.

Numerous extensions and variations upon For ifthere is a great deal of prior not prove flexibleenough to r"fi""f

wouldwere k COlnponenl:s

conflict with this hypothesized ordering (as done in York and Madigan, 1992).Regarding the coverage experiments, it would also be of interest to find out how model

selection techniques such as AIC, BIC and ABIC, perform against mixture method-ology presented here.

"With regard to the applications, it would seem that there is substantial under-reportingin some of the record systems considered. Modern computer data storage methods havelikely improved upon these shortcomings, but human error cannot be eliminated altogether.Present day data storage and manipulation should make comparison of multiple record systems easier; one has to wonder if it is not done in practice because the statistical literaturehas been unable to provide reliable methodologies which are able to cope with the uncertaintyand subtle dependencies that this kind of data exhibits.

It would be very interesting to apply these methods to problems where more complicated,subtle, and common sampling schemes were used, such as stratified or clustered sampling.

Pocketbook of Mathematical Functions Frankfurt, Ger-

BibliographyAbramowitz, Milton and Stegun, Irene, eds.

many: Thun.

Berger, James O. and Bernardo, Jose M. (1992) Ordered group reference priors with application to themultinomial problem. Biometrika, 19. 25~37.

Bernardo, Jose M. (1979) Reference posterior distributions for Bayesian inference (with discussion). J. Roy.Statist. Soc. B, 41, 113-147.

Besag, Julian E. (1986) On the statistical analysis of dirty pictures (with discussion). Journal of the RoyalStatistical Society Series B, 48, 259-302.

Bishop, Y.M.M., Fienberg, S.E. and Holland, P.W. (1975) Discrete Multivariate Analysis Cambridge, Mass.:MIT Press.

Box, George E. P. and Tiao, George C. (1973) Bayesian Inference in Statistical Analysis Addison-Wesley:Reading, MA.

Castledine, B. J. (1981) A Bayesian analysis of multiple-recapture sampling for a closed population.Biometrika, 61, pp 197-210.

Chao, Anne, Lee, S.M. and Jeng, S.L. (1992) Estimating population size for capture-recapture data whencapture probabilities vary by time and individual animal. Biometrics, 48, 201-216.

Darroch, J.N. (1958) The multiple-recapture census I. Estimation of a closed population. Biometrika, 45,343-359.

Darroch, J. N., Lauritzen, S.L. and Speed, T.P. (1980) Markov fields and log-linear interaction models forcontingency tables. Ann. Stat. , 8, 522-539.

Dawid, A.P. (1979) Conditional independence in statistical theory (with discussion). Journal of the RoyalStatistical Society series B, 41, 1-31.

Dawid, A.P. and Lauritzen, S.L. (1989) Markoy distributions, hyper-Markov laws and meta-Markoy modelson decomposable graphs, with applications to Bayesian learning in expert systems. BAlES ReportBR-l0.

Draper, D., Hodges, J.S., Leamer, E.E., Morris, C.N. and Rubin, D.B. (1987) A research agenda for assessment and propagation of model uncertainty. Rand Note N-2683-RC, The RAND Corporation, SantaMonica, California.

El-Khorazaty, M. Nabil, Imrey, Peter B., Koch, G. and Bradley, H. ( the total numberof events with data from multiple-record Sv~,trlms review of nH~t,l1ocloJ(),gl,eaJstl'aU'p'j,os InternationalStatistical 45,129--157.29-157.

~lerll)er',g. S.E.tables.

A l1Url1erlcaJ con1panscm [)c,twe'::l1 sequcmha! Lag,J;!lIg,

W.R. and337-348.

~-- (1975) The Bayes factor against equiprobability of a multinomial population assuming a symmetricDirichlet prior. Ann. Stat., 3, 246~250.

--- (1976) On the application of symmetric Dirichlet distributions and their mixtures to contingencytables. A nnals of Statistics, 4, 1159~1189.

--- (1983) The robustness of a hierarchical model for multinomials and contingency tables. in ScientificInference, Data Analysis and Robustness, Box, G.E.P., Leonard, Tom and vVu, Chien-fu, eds, New York: Academic Press.

Hartigan, J. A. (1983) Bayes Theory Springer-Verlag: New York.

Hodges, J.S. (1987) Uncertainty, policy analysis and statistics. Statistical Science, 2,259-291.

Hook, Earnest B. (1992) Private communication.

Hook, Earnest B., Albright, Susan G. and Cross, Philip K. (1980) Use of Bernoulli census and log-linearmethods for estimating the prevalence of Spina Bifida in Iivebirths and the completeness of vital recordreports in New York state. American Journal of Epidemiology, 112, 750-758.

Human Services Research Institute (1985) Summary of Data on Handicapped Children and Youth. U.S.Government Printing Office, Washington D.C.

Jeffreys, H. (1946) An invariant form for the prior probability in estimation problems. Proc. R. Soc. LondonA, 186, 453-461.

Kahn, William D. (1987) A cautionary note for Bayesian estimation of the binomial parameter n. TheAmerican Statistician, 41, 38-40.

Kiiveri, Hani, Speed, T.P. and Carlin, J.B. (1984) Recursive causal models. 1. Aust. lvlath. Soc. Ser. A,36,30-52.

Lauritzen, S.L., Speed, T.P. and Vijayan, K. (1984). Decomposable graphs and hypergraphs. J. Aust..Math.Soc. Ser. A, 36, 12-29.

Lindley, Dennis V. (1964) The Bayesian analysis of contingency tables. Ann. iVlath. Stat., 35, pp 1622-1643.

Madigan, D. and Raftery, A.E. (1991) Model selection and accounting for model uncertainty in graphicalmodels using Occam's window. Technical Report Department of Statistics, University of Washing-ton.

Madigan, D. and York, J. (1992) Bayesian graphical models. J'v!anuscript in preparation.

Merkins, George E. and Anderson, Stanley H. (1988) Estimation of small-mammal population size. Ecology,69, 1952-1959.

Moulton, Brent R. (1989) A Bayesian approach to regression selection and est,lmatl,on, with application toa price index for radio services. Working Bureau of Labor of Labor,Washington D.C.

A. John anddistribution. Journal the

D.L., K.P.on animal popnJiat.lons.

Some OhQel'V>lj',lOllS [nsf.

Department

Pollock, Kenneth H. and Otto, Mark C. (1983) Robust estimation of population size in closed animalpopulations from capture~recapture experiments. 39, 1035-1049.

Raftery, Adrian E. (1988a) Inference for the binomial N parameter: a hierarchical Bayes approach.Biometrika, 75, 223-228.

--- (1988b) Approximate Bayes factors for generalised linear models. Technical Reportof Statistics, University of Washington.

Regal, Ronald R. and Hook, Ernest B. (1984) Goodness-of-fit based confidence intervals for estimates of thesize of a closed population. Statistics in Medicine, 3, 355-363.

--- (1991) The effects of model selection on confidence intervals for the size of a closed population.Statistics in Medicine 10, 717-721.

Rissanen, Jorma (1983) A universal prior for integers and estimation by minimum description length. Annalsof Statistics, 11, 416-431.

Rodrigues, J., Bolfarine, H. and Leite, J.G. (1988) A Bayesian analysis in closed animal populations fromcapture recapture experiments with trap response. Communications in Statistics Simulation, 17,407-430.

Seber, G.A.F. (1982) The Estimation of Animal Abundance and Related Parameters (2nd ed.), London:Chares W. Griffin.

---- (1986) A review of estimating animal abundance. Biometrics, 42, 267--292.

Smith, A.F.M. and Roberts, G. O. (1993) Bayesian computation via the Gibbs sampler and related Markovchain Monte Carlo methods. 1. R. Statist. Soc. B, to appear.

Smith, Phillip J. (1988) Bayesian methods for multiple capture-recapture surveys. Biometrics, 44, 11771189.

--- (1990) Bayesian methods for prevalence estimation from incomplete administrative lists. Statisticsin Medicine, 10, 113-118.

-._-- (1991) Bayesian analyses for a multiple capture-recapture model. Biornetrika, 78,399-407.

Spiegelhalter, D.J. and Lauritzen, S.L. (1990) Sequential updating of conditional probabilities on directedgraphical structures. Networks, 20,579-60.5.

Tierney, Luke and Kadane, Joseph (1986) Accurate approximations for po"tel:IOr moments and marginaldensities. Journal of the American Statistical Association, 81, 82-86.

West, M. (1985) Generalized linear models: scale parameters, outlier accommodation and prior distributions.in Bayesian Statistics 2, Bernado, J .M. ct ai, eds. North-Holland: Elsevier Science Publishers.

Whittaker, J. (1990) Graphical "~fodels in Applied lvlultivariate Statistics. Chichester: Wiley.

Janet T. Estimation of populatIOn size::;tcttlSitlcs, Harvard Boston.

PhD (!Jssprb"tinn

Wolter, Kirk M. (1986) Some coverage error models for census data. JABA, 81, 338-346.

York, J. (1992) Bayesian Methods for the Analysis of Misclassified or Incomplete Multivariate Discrete Data(in preparation). Department of Statistics, University of Washington.

York, J. and Madigan, D. (1992) Bayesian methods for double sampling. Manuscript in preparation.

York, Jeremy C., Madigan, David, Heuch, IvaI' and Lie, Rolv Terje (1992) Estimating a Proportion of BirthDefects by Double Sampling: A Bayesian Approach Incorporating Covariates and Model Uncertainty.Submitted for publication.

A Determining the Form of a Hyper-Dirichlet Distribution

In this section, we will describe how to write the hyper-Dirichlet distribution for a particularchordal graph, and compare these distributions to the Jeffreys priors. For more informationand theoretical motivation, the reader is referred to section 7.2 of Dawid and Lauritzen(1989). The discussion here is challenging notationally, but not mathematically.

A.I Construction of a hyper-Dirichlet

Without loss of generality, we assume that the graph is connected; that is, it is possible tomove between any two variables by following edges in the graph. For a graph with two ormore disjoint subgraphs, one can apply the discussion below to each subgraph to find itshyper-Dirichlet distribution; the ()Is for the different subgraphs will be independent, so thejoint density is just a product of the individual hyper-Dirichlet densities.

We assume that we have some ordering (Cll ... ,Ck ) of the cliques, where Ci is the set ofpossible configurations of the variables in the ith clique. As a notational shortcut, v(Cd willbe the list of variables in the ith clique.

For 2 ::; i ::; k, we define the separator Si to be the set of configurations of the variablesin v(Ci ) n v(U~:'~Cj); that is, the variables in Si are those variables which are in Ci as well asin Cj for some j < i. For some configuration Si of the variables in Si, we write Ci /\ Si for theelements of Ci which have the same values as Si for those variables common to both Ci andSi.

The ordering of cliques (C1, ... ,Cm) is said to be pelject if, for every i, v(Sd c v(Cj ) forsome j E {I, ... , i-I}. In other words, the ordering is perfect if the intersection betweenany clique and all the cliques preceding it in the ordering is contained within one of thosepreceding cliques. A perfect ordering for a graph exists iff it is chordal; see Golumbic (1980).

To construct a hyper-Dirichlet distribution, begin by placing a Dirichlet distribution withparameters QC

lon ()C l , the marginal table for clique C1 ; we have

, for i >

-1

so that the marginal distribution of Os, is the same whether it is derived from Oc] or Oc,.\Ve next derive the distribution of Oc, given OCI" .. ,0C'_1 for i > 1. The distribution of

Oc, given all previous cliques depends only upon Os, (this is due to the fact that 0 is stronghyper l"1:farkov; for a definition and proof, see Dawid and Lauritzen, 1 . Conditioningupon Os, imposes the following constraint on Oc, :

L Oc, = Os,;ciECiJ\Si

(16)

The density of the Dirichlet for OCi conditional upon Os, is easily derived, and is given by

for OCi satisfying (16).Now, we have

(17)

f(O)k

f( Ocl) IT f( Oc, I OCl' ... ,0c._Ji=2

(18)

where

i=lm

\fJ(a) IT ITj=2 SJ

Tn

(19)

IT IT f( )

Comparing the prior in (19) with the likelihood function for data D,

(20)

A.2 Jeffreys versus hyper-Dirichlet distributions

Jeffreys (1946) proposed a procedure for generating priors for Bayesian analyses whichlittle or no prior information is present. The original argument for the procedure was thatit produced a prior that was invariant under transformations of the parameter. At aroundthe same time, Perks (1947) proposed the same prior for a success probability of binary databy requiring the density to be inversely proportional to the standard error of an efficientestimator. Box and Tiao (1973) justify the Jeffreys prior in a simple case by finding atransformation of the likelihood such that changes in the data translate the likelihood curvebut do not change its shape. They argue that a non-informative prior should have a densitythat is uniformly distributed over the transformed parameter space. The reference priorapproach of Bernardo (1979) (see also Berger and Bernardo, 1992), if applied to a problemwithout nuisance parameters, will produce a Jeffreys prior. Their approach is motivatedby examining the expected increase in information that will be obtained by conducting anexperiment t which produces data x when the prior for 0 is Pr(0) :

J (Pr(O I X))fO(t,Pr(O)) = log Pr(O) dPr(O I x)dPr(x)

The reference prior is chosen to maximize this value.Regardless of the justifications, the form of the Jeffreys prior is

where f (0) is the Fisher information matrix,

1(0);j = Eo ((o~; log(f(x, 0))) (O~j log(f(:I',O)))).

If a hyper-Dirichlet is constructed for a graph without intersecting cliques, then thedensity of 0 will be a product of Dirichlet densities; it is straightforward to show thatthe Jeffreys prior in such a case is formed by taking the product of the Jeffreys prior foreach marginal Dirichlet Jeffries prior for a Dirichlet sets equal to 1/2).However, if there are intersections more complicated.

For illustration, we will distribution threeB, B.

this parameterization, we find that the Jeffreys density for 0 is

( )

1/2

P1'(0) ex l1 Oi:~;~kAlthough very similar to the form of the density of a hyper-Dirichlet, this Jeffreys prior

is not a hyper-Dirichlet. To illustrate this, let us write the density for a symmetric hyperDirichlet with parameter 1/2 in each cell of each marginal Dirichlet:

(1 ) 1/2

Pr(0) ex Uk (Oij.O.jkt,),

It is apparent that, although we have matched the denominator of the Jeffreys prior, thatwe cannot also match the numerator. The Jeffries prior is like a symmetric hyper-Dirichletwith parameter 1/2, but the parameter of the intersection (0'.0.,0'.1.) is set to (0,0), and thusviolates the constraint (15) given above.

This counter-example demonstrates that not all Jeffries priors on ITlultinomial probabilities will be hyper-Dirichlet.

B Is the Posterior of N Proper?

In this section, we will derive conditions for the priors on N and the cell probabilities, suchthat the posterior for N is a proper distribution and has certain finite moments. The discussion applies equally to estimation of N in the binomial or multinomial case; the standardbeta and Dirichlet priors are a subset of the hyper-Dirichlets considered here.

We will make use of the following fact about infinite series: if {aN} and {bN} are suchthat limaN/bN exists, then L:f aN and L:f bN either both converge or both diverge.

Additionally, we use the following fact about the f-function :

lim ----,---- 1 (21)

To see this, take the limit of expression 6.1.47 in Abramowitz Stegun (1984).Kahn (1987) used a result similar to (21) to informally examine the behavior of the tail

of the posterior distribution of is of in a binomial experiment

To determine conditions under which (22) is proper and hassums of the form LN=n aN where

a T = + 1 [(N - n + b) NT PN [(N - n + 1) [(N + kb) (

for some power r.

Lemma 1 If s is such that L NS P( N) < 00, then the series(1' - s)j(k -1).

Proof. DefinebN = )6 P(N),

and note that LN=n bN exists if b 2: (1' - s)j(k 1).Applying (21) to the sequence aN j bN, we find that

mc)m,ent,s, we look at

aN converges if b >

1· aN l' [(N + l)N-n _- 1.nTI - = 1m. . X ---'------'-----

N-oo bN N-oo f(N - n + 1)(23)

Since this sequence converges, we have that LN=n aN exists if LN=n bN exists. 0

For various choices of the prior for N, application of Lemma 1 yields the following results.

Proposition 1 If P(N) ex: I, then the poste1'ior P(N I D) is proper iff b > 1j(k - 1), andthe expectation E(NT I D) exists iff b > (1 + r)j(k - 1).

This result is if and only if; "if" by application of Lemma 2;3 with any s < -1; "only if"by noting that L bN diverges for any s 2: -1.

Proposition 2 If P(N) ex: 1jN, then the poste1'ior P(N i D) is p1'oper for any b > O. Theexpectation E(NT ID) iff b > - 1).

The justification is the same as the previous proposition, with convergence for any s < aand divergence for any s 2: O.

prz01'

for iV > nmax = max(nl"'" nd, where n = ni.If we write

_ f(IN - n + 8) III f(N + 1) i\Trp( ,i\T)CN - \ X . T 1 Iv .v,

f(iN + kb) i=1 f(1\ - ni + 1.)

then we have an analogous result to Lemma 1.

Lemma 2 If s is such that L NaP(N) < 00, then the series(r - s)j(k - 1).

CN converges if b >

Proof. The sequence bN remains as before, and the series LN=nmax bN converges iff 8 >(r - s)j(k -1). Examining the series cNjbN, we have

1. CN1m

N--+co bN

I f(N + I)N-n,Jim -.::----~-x II f( i\T . 1\ = 1.jV --+co i=1 i v - ni + )

Thus, LN=nmax CN converges if LN=nmax bN converges. 0Application of this Lemma demonstrates that the conditions derived above apply to the

Binomial case with multiple observations. In particular, Raftery (1988a) takes P(N) ex IjNand b = 1; the posterior for this distribution is proper, but it does not have a finite firstmoment. However, the Bayes estimate for the relative squared error only involves E(N-1 ID)and E(N- 2 ID), both of which are well defined for any positive iJ.

Next, let us consider the multinomial estimation problem with a hyper-Dirichlet prior ona graph consisting of disjoint cliques. Each clique will have a Dirichlet prior for the table ofjoint probabilities of the variables in that clique, and none of the machinery presented in theprevious section is necessary in order to determine the nature of the posterior. If we choosea symmetric prior, such that the amount of hypothetical data for each clique is the same,we have

P(·V I D) N + 1 II f(N - ni + k8/ k i ) P( V1 I ex f' i\ r .) X f(·· I. J:' J )

(1'1 - n + 1 N + I\:U)

where k is the number of configurations of the largest clique, ki is the number of configurationsof the ith clique, and ni is the number of successes observed clique. Also, write n

for the total number of successes. that if

-2: )

the number of successes in each individual clique are added up; the two quantities are equalif every success was only a success on the variable(s) in one clique. Thus, we can write

for the powers of N that are guaranteed to have a proper posterior. For most real data,Li ni - n will be fairly large, ensuring existence of considerably higher moments.

Finally, for a general hyper-Dirichlet, we have

P(N ID) ex

ex

r(N +1)r(N_n+l)\f!(a+D)

+1

where mj is the number of successes observed in the jth clique intersection Sj, and Ij is thenumber of possible configurations that the variables in that intersection can have. Note thatthe number of cliques is one more than the number of clique separators, so that there areequal number of r terms in the numerator and denominator. Thus, we can construct a seriesas before, choosing bN to converge as the product of these ratios of r functions.

If we define

bN = exp (IOg(N) x (n + r - L: ni +L: mj + k8(L: llki - L: l/I j - 1))) P(N),t J t>l J

and again choose s such that E(NS) exists, then the sum LN bN converges if

J J i>l

This can be seen by the kind of argument given in Lemma 23. The total of the data dependentterms here will be non-negative, guaranteeing existing of the 1,th posterior moment for any

r' S; s + k8(L:l -L:li>l

+1 .

To see note that n1 is number successes "h«p,'\7p·r! uponclique. n2 - m2 can be broken two parts : individuals a success on a \f,; 1 ',;.1 ...... 111

C2 and failures on all variables in C1 new c!ls:co'verleson a variable in

Date post:	27-Dec-2019
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

tIATING THE SIZE OFA CLOSED POPULATIONThe indices can take on values of 0, indicating absence from a...

Documents