Modeling Concordance Correlation Coefficient for Longitudinal Study Data

Post on 17-Nov-2023

0 views 0 download

transcript

PSYCHOMETRIKA—VOL. 75, NO. 1, 99–119MARCH 2010DOI: 10.1007/S11336-009-9142-Z

MODELING CONCORDANCE CORRELATION COEFFICIENT FOR LONGITUDINALSTUDY DATA

YAN MA

HOSPITAL FOR SPECIAL SURGERY–WEILL MEDICAL COLLEGE OF CORNELL UNIVERSITY

WAN TANG, QIN YU, AND X.M. TU

UNIVERSITY OF ROCHESTER

Measures of agreement are used in a wide range of behavioral, biomedical, psychosocial, and health-care related research to assess reliability of diagnostic test, psychometric properties of instrument, fidelityof psychosocial intervention, and accuracy of proxy outcome. The concordance correlation coefficient(CCC) is a popular measure of agreement for continuous outcomes. In modern-day applications, data areoften clustered, making inference difficult to perform using existing methods. In addition, as longitudinalstudy designs become increasingly popular, missing data have become a serious issue, and the lack ofmethods to systematically address this problem has hampered the progress of research in the aforemen-tioned fields. In this paper, we develop a novel approach to tackle the complexities involved in addressingmissing data and other related issues for performing CCC analysis within a longitudinal data setting. Theapproach is illustrated with both real and simulated data.

Key words: diagnostic test, inverse probability weighted estimates, missing data, monotone missing datapattern, U-statistics.

1. Introduction

Measures of agreement are widely used in biomedical and psychosocial research for diseasediagnoses, assessment and validation of psychometric properties of instruments, and evaluationof treatment fidelity (Bauer & Kennedy, 1981; Chandler, Martin, Girman, Ross, Love-McClung,Lydick, & Yawn, 1998; Costa, Arnould, Cour, Boyer, Marrel, Jaudinot, & Solesse de Gendre,2003; Westgard & Hunt, 1973). For example, in biomedical research, it is often of interest tocompare a newly developed diagnostic method with a reference standard to determine its diag-nostic accuracy. For instance, in cough frequency studies, digital audio recording is comparedto manual cough counting to see if it is accurate enough to replace the latter standard practice,which is time-consuming and laborious (Paul, Wai, Jewell, Shaffer, & Varadan, 2006). In HIVprevention research, frequencies of risky behaviors such as unprotected vaginal sex reported overmonthly or quarterly intervals are often compared to those from daily diary to help assess the de-gree of cognitive bias in such proxy outcomes (Morrison-Beedy, Carey, & Tu, 2006; Schroder,Carey, & Vanable, 2003). The concordance correlation coefficient (CCC) is a popular index forassessing agreement for continuous outcomes as well as those that can be so treated such as countresponse (Lin, 1989).

The CCC between two outcomes, y1 and y2, from two assessment methods or observers’ratings is defined as

ρ = 1 − E(y1 − y2)2

Eindep(y1 − y2)2= 2σ12

σ 21 + σ 2

2 + (μ1 − μ2)2, (1)

Requests for reprints should be sent to Yan Ma, Department of Public Health, Hospital for Special Surgery–WeillMedical College of Cornell University, New York, NY 10021, USA. E-mail: yam2007@med.cornell.edu

© 2009 The Psychometric Society99

100 PSYCHOMETRIKA

where μk = E(yk), σ 2k = Var(yk), σ12 = Cov(y1, y2), and Eindep denotes expectation under in-

dependence between yk (k = 1,2). The CCC ρ in (1) is indeed a correlation coefficient since itranges between −1 and 1; ρ = 1 (−1) if the two raters completely agree (disagree) and ρ = 0 ifyk are independent. Further, ρ has a nice decomposition, ρ = ρPMCb , where ρPM is the product-moment correlation measuring precision, while Cb (0 ≤ Cb ≤ 1) is a function of scale shift,σ1/σ2, and location shift relative to the scale, (μ1 −μ2)/

√σ1σ2, indicating accuracy (Lin, 1989).

Unlike ρPM and other popular association measures such as Spearman’s rho, CCC is sensitive tosystematic between-rater differences, making it a unique measure of agreement.

Since its introduction by Lin (1989), CCC has been generalized to address more generaltypes of outcome such as categorical data and complex study designs involving multiple ob-servers and repeated measures (Barnhart, Haber, & Song, 2002; Barnhart & Williamson, 2001;Chinchilli, Martel, Kumanyika, & Lloyd, 1996; King & Chinchilli, 2001; King, Chinchilli, &Carrasco, 2007). However, none of the existing methods has sufficiently addressed missing datawithin a longitudinal data setting.

In this paper, we propose an approach to extend existing methods for inference about CCCwithin a general multiobserver longitudinal data setting. We discuss distribution-free inferenceunder complete and missing data, and study the performance of the proposed approach under dif-ferent missing data assumptions for small and moderate sample sizes. We illustrate our method-ology by applying it to real study data in biomedical research.

2. Modeling CCC for Longitudinal Data

2.1. CCC from Longitudinal Study with Multiple Raters

Consider a longitudinal study with n subjects, M observers (or methods), and T assessments.Let

yit = (yi1t , yi2t , . . . , yiMt )�, yi = (

y�i1,y�

i2, . . . ,y�iT

)�, i = 1,2, . . . , n, t = 1,2, . . . , T .

In the above, yimt represents the rating of the mth observer on the ith subject at time t , yit theratings on the ith subject from all the M judges at time t , and yi the collection of rating datafor the ith subject across all judges and assessments. Note that for convenience, we use judges’or observers’ ratings to refer to the multiple outcomes from different diagnostic or assessmentmethods under consideration throughout the rest of the discussion.

Let

ρmlt = 2σmlt

σ 2mt + σ 2

lt + (μmt − μlt )2, ρt = (

ρ12t , ρ13t , . . . , ρ(M−1)Mt

)�,

ρ = (ρ�

1 , . . . ,ρ�T

)�, μt = E(yit ) = (μ1t , . . . ,μMt )

�,(2)

μ = E(yi ) = (μ�

1 , . . . ,μ�T

)�,

�st = Cov(yis ,yit ) =⎛

⎜⎝

σ1st · · · σ1Mst

.... . .

...

σM1st · · · σMst

⎟⎠ , � = Var(yi ) =

⎜⎝

�11 · · · �1T

.... . .

...

�T 1 · · · �T T

⎟⎠ .

In the above, ρmlt denotes the CCC between raters m and l at time t , and ρt the vector contain-ing all such pairwise CCCs at time t . As a special case, if T = 1, ρ in (2) represents all pairwiseCCCs from M raters. Further, if M = 2, inference about such a single CCC ρ12 has been dis-cussed in Lin (1989). By applying the generalized estimation equations (GEE) II (Prentice, 1988;

YAN MA ET AL. 101

Reboussin & Liang, 1998), Barnhart and Williamson 2001, and Barnhart et al. (2002) extendedLin’s work to a multirater setting. More recently, by utilizing the theory of U-statistics, King etal. (2007) generalized Lin’s approach to longitudinal data analysis by constructing an aggregatedCCC index based on data over time. None of these methods sufficiently addresses missing data.We will discuss their specific limitations in Section 4.

Note that the general setup above also accommodates comparisons of agreement with respectto a reference standard. For example, by designating the first rater as the reference standard,we can compare agreement between each mth rater and the reference standard by examiningdifferences among ρ12t , ρ13t , . . . , ρ1Mt .

2.2. Inference Under Missing Data

Let

θmlt1 = 2σmlt , θmlt2 = (μmt − μlt )2 + (

σ 2mt + σ 2

lt − 2σmlt

),

θmlt = (θmlt1, θmlt2)�, θ�

t = (θ�

12t , θ�13t , . . . , θ

�(M−1)Mt

)�,

(3)

θ = (θ�

1 , . . . , θ�T

)�,

fmlt (θmlt ) = θmlt1

θmlt1 + θmlt2, f = (

f�1 , f�2 , . . . , f�T)�

, ft = (f12t , f13t , . . . , f(M−1)Mt )�.

Then, ρ = f(θ). Let

hijmlt1 = (yimt − yjmt )(yilt − yjlt ), hijmlt2 = 1

2

[(yimt − yilt )

2 + (yjmt − yjlt )2],

hijmlt = (hijmlt1, hijmlt2)�, hi j t = (

h�ij12t , . . . ,h�

ij (M−1)Mt

)�, (4)

hij = (h�

ij1, . . . ,h�ijT

)�, (i, j) ∈ Cn

2 , (m, l) ∈ CM2 , 1 ≤ t ≤ T .

Then, it is readily checked that θ = E(hi j ). Further, θ = (n2

)−1 ∑(i,j)∈Cn

2hij is a one-sample,

vector-valued U-statistic. By applying the theory of multivariate U-statistics (Kowalski & Tu,2007, Chapter 5), θ is unbiased, consistent, and asymptotically normal. Thus, by the Deltamethod, ρ = f(θ) is a consistent and asymptotically normal estimate of ρ.

In the presence of missing data, one approach is to apply ρ above or rather θ to the subsampleconsisting of those subjects with complete data. However, such a complete-data approach notonly reduces power but may also yield biased estimates. A better alternative is to include allavailable data and address the inherent missing data problem.

To this end, define a vector of binary variables for indicating missing (or rather observed)response as follows:

rimt ={

1 if yimt is observed,

0 if yimt is unobserved,rit = (ri1t , ri2t , . . . , riMt )

�,

(5)

ri = (r�i1, r�

i2, . . . , r�iT

)�, 1 ≤ m ≤ M, 1 ≤ t ≤ T .

Also, let

�imt = Pr[rimt = 1 | yi], �imlt = Pr[rimt rilt = 1 | yi],(6)

�it = (�i1t ,�i2t , . . . ,�iMt ,�i12t ,�i13t , . . . ,�i(M−1)Mt )�, �i = (

��i1, . . . ,�

�iT

)�.

102 PSYCHOMETRIKA

We define our estimate as ρ = f(θ) with θ given by

θmlt =(

n

2

)−1 ∑

(i,j)∈Cn2

rimt rilt rjmt rjlt

�imlt�jmlt

hijmlt , 1 ≤ m < l ≤ M, 1 ≤ t ≤ T . (7)

Note that although hijmlt in (4) is not defined if one of the yimt , yjmt , yilt , and yjlt is missing,θmlt above is well defined since hijmlt can be assigned any value in such cases without affectingθmlt . The estimate θmlt in (7) may be viewed as a generalization of the classic inverse probabilityweighted (IPW) estimate used in nonparametric analysis to a U-statistics setting (e.g., Robins,Rotnitzky, & Zhao, 1995).

First, assume that �i is known. It can be shown that θ is both consistent and asymptoticallynormal. Thus, by the Delta method, ρ = f(θ) is a consistent and asymptotically normal estimateof ρ. We summarize these results in a theorem below, with a proof sketched in Appendix A.1.

Theorem 1. Let

vijmlt = rimt rilt rjmt rjlt

�imlt�jmlt

(hijmlt − θmlt ), vij t = (v�ij12t , . . . ,v�

ij (M−1)Mt

)�,

vij = (v�ij1, . . . ,v�

ijT

)�, vimlt = E(vijmlt | yi , ri ), (8)

vit = (v�i12t , . . . , v�

i(M−1)Mt

)�, vi = (

v�i1, . . . , v�

iT

)�, �θ = 4 Var(vi ).

Then,

(a) ρ is consistent and asymptotically normal,

ρ →p ρ,√

n(ρ − ρ) →d N

(0,�ρ = ∂�

∂θf(θ)�θ

∂θf(θ)

), (9)

where →p and →d denote convergence in probability and distribution, respectively.(b) The asymptotic variance �ρ can be estimated by the Delta method. The following is a

consistent estimate �θ :

Cov(vimlt , vim′l′t ′) = 1

n − 1

n∑

i=1

rimt rilt rim′t ′ril′t ′

�imlt�im′l′t ′eimlte�

im′l′t ′ ,

eimlt = (eimlt1, eimlt2)�,

eimlt1 = yimtyilt − yilt μmt − yimt μlt + σmlt + μmt μlt − θmlt1,

eimlt2 = 1

2

[(yimt − yilt )

2 + σ 2mt + σ 2

lt − σmlt + (μmt − μlt )2] (10)

− θmlt2,

μmt = 1

n

n∑

i=1

rimt

�imt

yimt , σ 2mt = 1

n − 1

n∑

i=1

rimt

�imt

(yimt − μmt )2,

σmlt = 1

n − 1

n∑

i=1

rimt rilt

�imlt

(yimt − μmt )(yilt − μlt ),

for 1 ≤ m < l ≤ M, 1 ≤ m′ < l′ ≤ M, 1 ≤ t, t ′ ≤ T .

YAN MA ET AL. 103

Based on Theorem 1, we can test linear contrasts of the form

H0 : Kρ = 0, vs. Ha : Kρ �= 0,

using a Wald-type statistic, Wn = nρ�K�(K�ρK�)−1Kρ, where K is some p × (M(M−1)T

2 )

full-rank matrix of known constants. Under H0, Wn has an asymptotic central χ2p distribution

with p degrees of freedom. For small samples, Wn often yields inflated type I errors, since itcompletely ignores the variability in the estimated �ρ (Guo, Pan, Connett, Hannan, & French2005). A popular alternative for correcting this upward bias is the Hotelling’s T-square statistic,T 2

n = n−pp(n−1)

Wn, which follows approximately a central Fp,n−p distribution with p (numerator)and n − p (denominator) degrees of freedom under H0 (e.g., Seber, 1984).

2.3. Estimation of Weight Function

In most applications, �i are unknown and must be estimated. Under the assumption of miss-ing completely at random (MCAR), ri are independent of yi . Consequently, �imt and �imlt areboth functionally independent of yi and are readily estimated by the respective sample moments:�mt = 1

n

∑ni=1 rimt and �mlt = 1

n

∑ni=1 rimt rilt (1 ≤ m < l ≤ M , 1 ≤ t ≤ T ).

When �i becomes dependent on yi , it is necessary to model �i as a function of yi . However,it is difficult to model such a relationship without imposing some additional assumptions on therelationship between the occurrence of missing data and outcomes (Little & Rubin, 1987). As inthe literature, we focus on the missing at random (MAR) mechanism.

Consider first a cross-sectional study with M observers and n subjects. In this special case,missing data occur when observers’ ratings are not available from some of the judges. UnderMAR, the occurrence of missing data from such judges depends only on the observed responsesfrom the other observers. Let yobs

i denote the observed judges’ ratings for the ith subject. Then,under MAR,

�iml = Pr(rimril = 1 | yi ) = Pr(rimril = 1 | yobs

i

). (11)

As yobsi is the subvector of yi corresponding to the judges with nonmissing ratings, �iml above is

essentially a function of the missing data patterns across the subjects. Since there are potentially2M different patterns, it is generally not feasible to model and estimate �iml in most real studiesunless there is a certain structure in the patterns. Fortunately, in most such applications, �iml

is unlikely to depend on the observed data yobsi , and, as such, estimation of �iml reverts to the

MCAR condition discussed earlier.For longitudinal trials, the situation is quite different. Missing data in such studies occur as

the result of subject dropout due to deteriorated/improved health and other related conditions,exhibiting the so-called monotone missing data pattern (MMDP). In such cases, missing data ispredicted by observed responses, and MAR arises naturally as a plausible model for modelingthe missing data. Further, the structured patterns under MMDP make it possible to model such adependence in most studies.

Consider a special case with only one rater, i.e., M = 1 and yi = (yi1, . . . , yiT )�. We assumeno missing data at baseline t = 1 so that ri1 = 1. Under MMDP, if yit is observed at t , yis is thenobserved at all earlier times s < t . Let yit = (yi1, yi2, . . . , yi(t−1))

� denote the subvector of yi

containing responses up to and including time t − 1 (2 ≤ t ≤ T ). Then, under MAR,

�it = Pr(rit = 1 | yi ) = Pr(rit = 1 | yobs

i

) = Pr(rit = 1 | yit ). (12)

In contrast to (11), there are only T distinct missing data patterns to consider when modeling �it

in (12).

104 PSYCHOMETRIKA

We first model the one-step transition probability of the occurrence of missing data usinglogistic regression:

logit(pit ) = logit(E(rit = 1 | ri(t−1) = 1, yit )

) = αt + β�t yit . (13)

By invoking MMDP and the assumption of no missing data at t = 1, we obtain:

�it = Pr(rit = 1, ri(t−1) = 1 | yit ) = pit Pr(ri(t−1) = 1 | yi(t−1)) =t∏

s=2

pis. (14)

Although widely used to model MAR for regression analysis of longitudinal data (e.g.,Robins et al., 1995), the above approach does not apply when there is more than one observer inour context. For notational brevity, we first focus on a two-rater setting and introduce an index tocategorize the different MMDPs within such a setting (Tu, Feng, Kowalski, Tang, Wang, Wan, &Ma, 2007).

For each subject i, let li denote the absolute difference or lag time between the last observedresponses from the two raters, and let L = max{li;1 ≤ i ≤ n}. If L = 0, yit = (yi1t , yi2t )

� isobserved or missing as a pair so that ri1t = ri2t = rit . In this case, the modeling approach dis-cussed above for the single-rater case applies. Let yit = (y�

i1, . . . ,y�i(t−1))

� denote the history ofobserved response pairs prior to time t . We can use the same logistic regression in (13) to modelthe one-step transition probability, except for using the predictor defined by the observed datafrom both raters, ηit = αt + β�

t yit .Now consider the case with L = 1. At each time t , the pair yit = (yi1t , yi2t )

� may be missingone or both components, yielding four possible missing data patterns indexed by a four-levelnominal variable zit :

zit = 1 : {ri1t = 1, ri2t = 1}, zit = 2 : {ri1t = 0, ri2t = 1},(15)

zit = 3 : {ri1t = 1, ri2t = 0}, zit = 4 : {ri1t = 0, ri2t = 0}.A popular choice for modeling zit is the generalized logit model (e.g., McCullagh & Nelder,1989):

pitl = E(zit = l | zi(t−1) = 1, yit ) = ηl∑4

m=1 ηitm

, 1 ≤ l ≤ 4, 2 ≤ t ≤ T ,

ηitl = exp(αtl + β�

t l yit

), αt1 = 0, β t1 = 0, β t l = (βtl1, . . . , βtl(t−1))

�,(16)

αt = (αt2, αt3, αt4)�, β t = (

β�t2,β

�t3,β

�t4

)�,

ζ t = (α�

t ,β�t

)�, ζ = (

ζ�2 , . . . , ζ�

T

)�.

Under the assumption of no missing data at baseline, pi1l = 1 if l = 1 and 0 if otherwise. Thus,we have

�i12t = Pr[zit = 1 | yit ] = pit1 Pr[zi(t−1) = 1 | yi(t−1)] =t∏

s=2

pis1,

(17)

�i1t = (pit1 + pit3)

t−1∏

s=2

pis1, �i2t = (pit1 + pit2)

t−1∏

s=2

pis1.

In many applications, pitl may depend only on the most recently observed response at t − 1, andthe predictor in (16) under this Markov condition is simply ηitl = αtl + β�

t l yi(t−1).

YAN MA ET AL. 105

Note that we have assumed that rit depend only on the past history yit . Although tempting,it is generally not possible to model the missingness of yi1t (yi2t ) as a function of yi2t (yi1t )

in addition to yit (see Appendix B). Note also that for lag time L ≥ 2, the number of potentialmissing data patterns at each time t will increase at the rate of (L + 1)2. Although possible,it is likely to be difficult to model and estimate parameters given insufficient replications ofmissing data patterns in most applications. On the other hand, large lag times are unlikely to occurunder MAR in real studies. For example, within the context of the HIV prevention example, it isunlikely that one response (diary or retrospective) will be continuously observed, while the otheris not. Thus, models with L ≤ 1 provide reasonable approximations for modeling missingnessunder MAR in most applications. Further, the total number of missing data patterns reduces fromT 2 to T for L = 0 and to 3T − 2 for L = 1.

The above procedure is readily generalized to more than two raters. For M raters with L = 1,let zit = l index the potentially 2M different missing data patterns. By extending the upper limitof l from 4 to 2M in (16), we can apply this same model to estimate the one-step transitionprobability pitl and use (17) to estimate the weight function �ilmt (1 ≤ l < m ≤ M). As in theabove two-rater case, the general case with L ≥ 2 is similarly considered, but it may have limitedapplications in practice.

2.4. Accounting for Variation in Estimated Weight Function

When �it are estimated, such as by the generalized logit model in (16), �it (ζ ) are subject tothe sampling variability of ζ . By treating ζ as a constant, the asymptotic variance in Theorem 1generally underestimates the true variability of θ . For correct inference, we must account for thisextra variability.

Suppose that ζ in the generalized logit model in (16) is estimated by maximum likelihood.Then, ζ is the solution to the following score equations:

n∑

i=1

wi (ζ ) =n∑

i=1

(w�

i2, . . . ,w�iT

)� = 0, wit = ∂

∂ζ t

[4∑

l=1

Iitl log(pitl)

], (18)

where pit1 = 1 − ∑4l=2 pitl , and Iitl is a binary indicator with Iitl = 1 if zit = l and 0 other-

wise (1 ≤ l ≤ 4). When using the estimated ζ , the estimate defined in (7) becomes θmlt (ζ ) =(n2

)−1 ∑(i,j)∈Cn

2

rimt rilt rjmt rj lt

�imlt (ζ )�jmlt (ζ )hijmlt . Thus, we must also include the variability of ζ in the as-

ymptotic variance of θ (ζ ). By utilizing the properties of score equations and the projection-basedU-statistics asymptotic expansion, we can derive the asymptotic variance to take into account thisextra variability. We summarize the results below with a justification given in Appendix A.2.

Theorem 2. Let

C = 2E

(∂�

∂ζvi (ζ )

), H = E

(∂�

∂ζwi (ζ )

). (19)

Then, under the assumptions of Theorem 1,

(a) ρ is consistent and asymptotically normal, with the asymptotic variance

�ρ� = ∂�f∂θ

(�θ + �)∂f∂θ

,

(20)

� = −4[CH−1C� + E

(viw�

i H−1C�) + (E

(viw�

i H−1C�))�],

where �θ is given in (8).

106 PSYCHOMETRIKA

(b) A consistent estimate of �ρ� is �ρ� = ∂�∂θ f(θ)(�θ + �) ∂

∂θ f(θ), where �θ is givenin (10), and � is obtained by substituting moment estimates in place of H , C, andE(viw�

i H−1C�).

It is seen from (20) that when the weight function is estimated, the asymptotic variance ofρ has an additional component �, which reflects the sampling variability in ζ . Note that wedo not have to estimate ζ using maximum likelihood. For example, we may use the generalizedestimating equations (GEE) to estimate ζ . The adjustment factor � has the same form as given in(20) under such alternative procedures. Again, as in Theorem 1, the Hotelling T-square statistic,T 2

n = n−pp(n−1)

Wn, can be used to provide inference for linear contrasts for small sample sizes,

which follows approximately a central Fp,n−p distribution with p and n − p degrees of freedomunder the null hypothesis.

3. Application

We illustrate the approach with both real and simulated data. We first present applicationsto data from two longitudinal studies and then follow up with investigations of the performanceof the approach with small to moderate sample sizes by simulation. In all the examples, we setthe statistical significance at α = 0.05. All analyses are carried out using a package we have de-veloped for implementing the proposed approach using the R software platform (R DevelopmentCore Team (2009). R: A language and environment for statistical computing. R Foundation forStatistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org).

3.1. Real Studies

Example 1. In this Penn State Young Women’s Health Study (Lloyd, Chinchilli, Eggli, Rollings,& Kulin, 1998), the percentage of body fat was examined on a group of adolescent girls us-ing skinfold calipers (SC) and dual-energy X-ray absorptiometry (DEXA). The measurementswere initially taken at age 12 and then every 6 months thereafter. The goal was to compare theagreement between these two measures over time at ages 12.5, 13, 13.5, and 14. For illustrationpurposes, we focus on the 89 subjects with complete baseline data at age 12.5.

Let yit = (yi1t , yi2t )� denote the SC (yi1t ) and DEXA (yi2t ) measures of body fat at time

t with t = 1 denoting the baseline and t = 2, 3, and 4 corresponding to the three follow-upassessments. Shown in Table 1 are the estimates of the product moment correlation coefficientρPM, location shift |μ1−μ2|√

σ1σ2, and scale shift σ1/σ2 across all assessments. It is seen that these two

methods show a strong linear relationship at the first and last assessments, but a moderate linear

TABLE 1.Estimates of product-moment correlation, location and scale shift between SC and DEXA at each assessment for thePenn State Young Women’s Health Study.

Assessment Product-moment correlation Location shift Scale shift

ρPM|μSC−μDEXA|√

σSCσDEXAσSC/σDEXA

Age 12.5 (t = 1) 0.804 0.562 0.812Age 13 (t = 2) 0.622 1.013 0.859Age 13.5 (t = 3) 0.623 1.068 0.867Age 14 (t = 4) 0.918 1.026 0.733

YAN MA ET AL. 107

TABLE 2.Estimates of parameters of logistic regression for modeling missingness under MMDP for bivariate outcomes for thePenn State Young Women’s Health Study.

Assessment time Predictors Estimates Standard errors p-values

Age 13 (t = 2) SC (y11) −0.13 0.32 0.68DEXA (y21) 0.28 0.25 0.26

Age 13.5 (t = 3) SC (y12) −0.08 0.25 0.74DEXA (y22) −0.03 0.20 0.87

Age 14 (t = 4) SC (y13) 1.44 1.22 0.24DEXA (y23) −0.42 0.76 0.58

relationship at assessments 2 and 3. Location shifts nearly doubled at times 2–4 compared to thatat time 1, indicating consistent differences between the two methods at the follow-up times. Scaleshifts are minor with the largest difference at time 4. The large location shifts at the follow-upassessments would discount the use of any association measures such as the product momentcorrelation as a valid measure of agreement between SC and DEXA in this study.

Since missing data were only present in the DEXA outcome and followed the univariatemonotone missing data pattern (MMDP), the generalized logit model in (16) reduced to thelogistic regression in (13), with the predictor defined as a function of the previously observedyit = yi(t−1) = (yi1(t−1), yi2(t−1))

� under the Markov assumption:

logit(E(rit = 1 | yit )

) = αt + β1t yi1(t−1) + β2t yi2(t−1), t = 2,3,4.

Shown in Table 2 are the estimates of β t = (β1t , β2t )�, their standard errors, and corresponding

p-values (2 ≤ t ≤ 4). The large p-values indicate insufficient ground to reject MCAR. Nonethe-less, for illustration purposes, we proceeded with subsequent analyses under both MCAR andMAR with inference based on Theorem 2.

We assessed the effect of age on agreement between the two methods by testing the null,H0 : ρt = ρ (1 ≤ t ≤ 4), with ρt denoting the CCC between SC and DEXA at time t . Shown inTable 3 are the estimated ρt over time, test statistic, and associated p-value under H0 computedunder MCAR and MAR. To reduce bias in type I error, we reported the p-values based onthe generalized Fieller’s method with a Hotelling’s T-square statistic (see Appendix C). Theestimates of CCC are almost identical under MCAR and MAR. Although the test statisticsand p-values are somewhat different, the conclusions for the hypotheses tested remain the samebetween the two missing data models.

The two assessment methods seemed to have the closest agreement at age 12.5 but with asteady decline in CCC over the follow-up assessments. The small p-value from the test of H0confirmed the discrepancies among the ρt over time. Further, the large p-value shown in Table 3from the null limited to the follow-up assessments delineates the discrepancies as the differencebetween the baseline and follow-up CCCs. These findings substantiate the speculation in Kinget al. (2007) that “There may be physiological explanations for this phenomenon due to the factthat the skinfold estimation is only capable of detecting subcutaneous fat, whereas breast, lowerbody and visceral fat are increasing over this age range due to the onset of menarche.”

Example 2. A longitudinal study in sexual health research was recently conducted to examinethe accuracy of retrospective recall of sexual behavior using daily diary as a reference standard(Morrison-Beedy et al., 2006). Although a daily diary may not be 100% accurate, this contem-poraneous monitoring strategy addresses some of the key limitations of retrospective assessment

108 PSYCHOMETRIKA

TABLE 3.Estimates of CCC over time, test statistics and p-values for testing the null of equal CCC over time for the Penn StateYoung Women’s Health Study under MCAR and MAR.

MCAR

Estimates of CCC over time (asymptotic standard errors)

Age 12.5 (t = 1) Age 13 (t = 2) Age 13.5 (t = 3) Age 14 (t = 4)

0.683(0.047) 0.512(0.056) 0.489(0.062) 0.469(0.059)

Hypothesis testing

H0 : ρ1 = ρ2 = ρ3 = ρ4 H0 : ρ2 = ρ3 = ρ4T 2n p-value T 2

n p-value22.639 <0.001 0.624 0.894

MAR

Estimates of CCC over time (asymptotic standard errors)

Age 12.5 (t = 1) Age 13 (t = 2) Age 13.5 (t = 3) Age 14 (t = 4)

0.683(0.047) 0.509(0.055) 0.489(0.063) 0.469(0.083)

Hypothesis testing

H0 : ρ1 = ρ2 = ρ3 = ρ4 H0 : ρ2 = ρ3 = ρ4T 2n p-value T 2

n p-value20.720 <0.001 0.243 0.971

TABLE 4.Estimates of logistic regression for modeling occurrence of missing data at each monthly assessment.

Assessment time Predictors Estimates Standard errors p-values

Month 1 (t = 2) Retrospective (yi11) −0.0001 0.06 0.97

Month 2 (t = 3) Retrospective (yi12) −0.17 0.09 0.07Daily diary (yi22) 0.24 0.18 0.18

Month 3 (t = 4) Retrospective (yi13) −0.08 0.21 0.71Daily diary (yi23) 0.41 0.35 0.25

such as recall bias (Schroder et al., 2003; Shrier, Shih, & Beardslee, 2005). A sample of 102adolescent girls monitored their sexual behavior with a daily diary and returned for assessmentmonthly for three months. One of the primary interests was to assess recall bias due to assess-ment intervals and check whether the association between these two measurements changes overtime. For illustration purposes, we focused on the reporting of unprotected vaginal sex. Therewas about 10% missing data in the outcomes across the follow-up assessments.

Let yi1t (yi2t ) denote the frequency of unprotected vaginal sex from retrospective recall(daily diary) (1 ≤ t ≤ 4). Since missing data were the result of missed visits in this study, theMMDP with L = 0 applied and the missingness was modeled using the logistic regression in(13), with the predictor defined as a function of both observed yi1t and yi2t . As subjects wereasked to recall their sexual behavior in the month prior to baseline, the predictor of the logisticmodel at t = 2 (first monthly visit) contained only retrospective recall yi11 at baseline, rather thanboth retrospective and diary outcomes for the later visits at t = 3,4. Given the low frequency ofmissing response, we again modeled the occurrence of missing data under the Markov conditionwith predictors, α2 + β21yi11 and αt + βt1yi1(t−1) + βt2yi2(t−1) (for t = 3,4), for the logistic

YAN MA ET AL. 109

TABLE 5.Estimates of CCC over time, test statistics and p-values for testing the null of equal CCC over time for the Sexual HealthStudy under MCAR and MAR.

MCAR

Estimates of CCC ρt over time (asymptotic standard errors)

Month 1 (t = 2) Month 2 (t = 3) Month 3 (t = 4) Average0.783(0.110) 0.825(0.058) 0.602(0.120) 0.737(0.060)

Hypothesis testing H0 : ρ1 = ρ2 = ρ3

T 2n p-value

2.444 0.508

MAR

Estimates of CCC ρt over time (asymptotic standard errors)

Month 1 (t = 2) Month 2 (t = 3) Month 3 (t = 4) Average0.783(0.112) 0.825(0.060) 0.601(0.117) 0.736(0.060)

Hypothesis testing H0 : ρ1 = ρ2 = ρ3

T 2n p-value

2.415 0.513

model in (13). Shown in Table 4 are the estimates of β t = (βt1, βt2) (β2 = β21) and their standarderrors and corresponding p-values. Although there is insufficient ground to reject MCAR, weperformed the analyses under both MCAR and MAR.

We assessed the effect of reporting interval on accuracy by testing the null, H0 : ρ1 = ρ2 =ρ3. Shown in Table 5 are the value of the statistic T 2

n and associated p-value computed under bothMCAR and MAR. Again, we reported the p-values based on the generalized Fieller’s methodwith a Hotelling’s T-square statistic. The results, which are quite similar between the two missingdata models, indicate no evidence of degrading accuracy of retrospective reporting over time.

3.2. Simulation Study

We conducted a limited simulation study to examine the empirical type I error rate for testingthe null of equal CCCs over time based on two raters (methods) under a longitudinal studydesign with three assessments (T = 3) for five sample sizes—30, 50, 100, 200, and 500—undercomplete data, and missing data with MCAR and MAR.

For each sample size, we generated pairs of observers’ ratings over time yit = (yi1t , yi2t )�

by simulating yi = (y�i1,y�

i2,y�i3)

� from a six-variate normal with mean μ = (0, δ,0, δ,0, δ)�and variance � = �2 ⊗ �1, where ⊗ denotes the Kronecker product of �2 and �1.

In the above,

δ = E(yi1t ) − E(yi2t ), �1 =(

1 ρw × √ω

ρw × √ω ω

), �2 = C3(ρb),

ω =√

Var(yi1t )

Var(yi2t ), ρw = Corr(yi1t , yi2t ), ρb = Corr(yiks, yikt ),

1 ≤ s < t ≤ 3, 1 ≤ k ≤ 2,

where δ denotes location shift, ω the ratio of the standard deviation of yi1t to that of yi2t , ρw

the within-visit correlation, ρb the between-visit, within-subject correlation, �1 the within-visit

110 PSYCHOMETRIKA

covariance matrix, and �2 the within-subject correlation matrix with C3(ρb) denoting the 3 × 3compound symmetry correlation matrix with correlation ρb . For the simulation study, we set

δ = 0.5, ω = 1.25, ρw = 0.9, ρb = 0.5,

so that the CCC ρ0 = 0.805. For a given sample size, we assumed no missing data at baselinet = 1 and simulated the missing response according to each of the MCAR and MAR models.Appendix D provides details about simulating the missing response under the two missing datamodels.

We estimated ρt and its asymptotic variances with and without adjustment for samplingvariability in the estimated weight function using Theorems 1 and 2, respectively. Shown inTable 6 are the averaged estimates based on 1,000 Monte Carlo (MC) replications. The em-pirical variance Var(ρt ) and type I error rate were obtained from the respective empiricaldistributions of ρt , with the latter calculated according to α = 1

1000

∑1000j=1 I{T 2

nj ≥ϕ0.95} (α� =1

1000

∑1000j=1 I{T 2

�nj ≥ϕ0.95}), where T 2nj (T 2

�nj ) denotes the test statistic T 2n constructed in Appen-

dix C based on Theorem 1 (Theorem 2) from the j th MC replication, and ϕ0.95 represents the95th percentile of the F3,n−3 distribution with degrees of freedom 3 (numerator) and n − 3 (de-nominator). Thus, in comparison to α, α� also reflects the additional variability in the estimatedweight function.

Under complete data, MCAR, and MAR, ρ seems to be under-estimated in all cases. Interms of variance estimates, �ρ, �ρ�, and Var(ρt ) are very close across all cases. It is clear thatthe type I error rates do show a converging trend toward 0.05 as the sample size increases undercomplete, MCAR, and MAR. Overall, under complete data, it has the smallest bias, variance,and type I error, while the largest bias in the estimate and type I error occurred under MAR. α�

is consistently smaller than α across the board under MAR thanks to differences in covariancesbetween the two asymptotic variances, �ρ and �ρ�, with the latter reflecting variation in esti-mated weights. As in the real data example case in Section 3.1, α� and α are virtually identicalunder MCAR.

4. Discussion

By generalizing the method of King et al. (2007) to integrate IPW estimates within theU-statistics setting, we developed an approach to address missing data when modeling multiob-server CCC for longitudinal study data and illustrated it with applications to two real study data.This approach performed well for moderate sample sizes as evidenced by results from our sim-ulation study. We have implemented the proposed approach in R, and the software is availablefrom the authors upon request.

In this paper, we employed the theory of U-statistics to effectively address the complexitywhen modeling second-order moments such as the CCC. An alternative is to use the weightedgeneralized estimation equations (WGEE) II to facilitate inference about such functions ofsecond-order moments. For example, Barnhart and Williamson (2001) and Barnhart et al. (2002)developed GEE II-based methods to model CCC that involve multiple raters in a cross-sectionaldata setting. However, none of the papers discussed how to estimate weight functions and accountfor their sampling variability when applying the WGEE to address missing data.

We opted for the U-statistics based approach primarily for the following reasons. Unlikethe (W)GEE II that models individually all the moments involved in the definition of CCC,the proposed alternative creates U-statistics to model functions of such moments, resulting inthe reduction of parameters in expressing the CCC. For example, for a single, two-rater CCC,

YAN MA ET AL. 111

TA

BL

E6.

Ave

rage

,�ρ

,and

�ρ�

over

1000

MC

repl

icat

ions

alon

gw

ithes

timat

edem

piri

calv

aria

nce

ofρ

,em

piri

calt

ype

Ier

ror

rate

αan

�un

der

com

plet

eda

ta,M

CA

Ran

dM

AR

.

Sim

ulat

ion

para

met

ers:

(δ,ω

,ρw

,ρb,ρ

0)=

(0.5

,1.

25,0.

9,0.

5,0.

805)

Sam

ple

size

3050

100

200

500

Ass

essm

entt

ime

12

31

23

12

31

23

12

3C

ompl

ete

data

ρ0.

801

0.79

50.

799

0.79

80.

799

0.79

80.

803

0.80

00.

801

0.80

40.

804

0.80

30.

804

0.80

40.

804

Var

(ρ)×

103

2.99

33.

719

3.24

01.

887

1.98

52.

128

0.95

40.

983

0.95

10.

466

0.44

60.

418

0.17

30.

187

0.16

7�

ρ×

103

2.98

03.

036

2.94

51.

845

1.82

21.

847

0.89

80.

926

0.91

20.

454

0.45

30.

459

0.18

20.

183

0.18

0.04

40.

055

0.05

40.

050.

048

MC

AR

ρ0.

801

0.79

40.

799

0.79

80.

799

0.79

70.

803

0.79

90.

801

0.80

40.

804

0.80

30.

804

0.80

40.

805

Var

(ρ)×

103

2.99

34.

278

3.59

31.

887

2.19

72.

306

0.95

41.

131

1.09

30.

466

0.47

60.

466

0.17

30.

211

0.18

7�

ρ×

103

2.98

03.

552

2.98

41.

845

2.08

11.

871

0.89

81.

048

0.90

70.

454

0.50

70.

457

0.18

20.

203

0.18

0.05

70.

056

0.05

60.

050.

05M

AR

ρ0.

801

0.78

90.

789

0.79

80.

792

0.79

10.

803

0.79

60.

793

0.80

40.

801

0.79

60.

804

0.80

20.

798

Var

(ρ)×

103

2.99

34.

953

4.48

21.

887

2.78

12.

929

0.95

41.

417

1.43

30.

466

0.82

60.

690

0.17

30.

511

0.26

9�

ρ×

103

2.98

04.

181

4.05

71.

848

2.84

52.

968

0.89

91.

528

1.53

80.

454

0.76

00.

772

0.18

20.

384

0.31

0�

ρ�

×10

32.

981

4.53

34.

317

1.84

92.

962

3.05

60.

899

1.58

91.

598

0.45

60.

787

0.80

20.

182

0.40

30.

322

α0.

062

0.06

10.

058

0.05

60.

055

α�

0.06

00.

059

0.05

80.

053

0.05

2

112 PSYCHOMETRIKA

TA

BL

E7.

Ave

rage

and

�ρ

over

1000

MC

repl

icat

ions

alon

gw

ithes

timat

edem

piri

calv

aria

nce

ofρ

,em

piri

calt

ype

Ier

ror

rate

αob

tain

edba

sed

onin

corr

ectM

CA

Ras

sum

ptio

nra

ther

than

true

MA

Rm

odel

.

Sim

ulat

ion

para

met

ers:

(δ,ω

,ρw

,ρb,ρ

0)=

(0.5

,1.

25,0.

9,0.

5,0.

805)

Sam

ple

size

N=

30N

=50

N=

100

N=

200

N=

500

Ass

essm

ent

time

12

31

23

12

31

23

12

0.80

10.

788

0.78

70.

798

0.79

10.

788

0.80

30.

793

0.79

00.

804

0.79

80.

793

0.80

40.

797

0.79

4V

ar(ρ

103

2.99

34.

441

4.01

81.

887

2.41

32.

656

0.95

41.

116

1.20

40.

466

0.55

80.

572

0.17

30.

234

0.21

3�

ρ×

103

2.98

03.

937

4.01

81.

845

2.31

72.

394

0.89

81.

149

1.18

50.

454

0.54

90.

591

0.18

20.

221

0.23

0.05

30.

059

0.07

40.

071

0.09

9

YAN MA ET AL. 113

(W)GEE II requires estimating a vector of five parameters, μk , σ 2k , and σ12 in (1) (k = 1,2). The

asymptotic variance of this estimate is a 5 × 5 matrix, yielding fifteen additional parameters. Incontrast, the proposed approach integrates these five moments into two parameters, θ = (θ1, θ2)

�(see Equation (3) in this special case). The estimate of θ is a 2×1 vector, and its asymptotic vari-ance a 2 × 2 matrix with three parameters. This difference widens when modeling longitudinalstudy data. For example, in the Penn State Young Women’s Health Study with four assessments,the (W)GEE II would estimate a 20 × 1 parameter vector and a 20 × 20 asymptotic variance. Incomparison, θ in the proposed approach is an 8 × 1 vector with an 8 × 8 asymptotic variancematrix.

We also compared the two approaches numerically by setting T = 1 in the simulation studyin Section 3.2. The results (not shown) indicate that the CCC estimates based on the proposedapproach had consistently smaller bias and asymptotic variance than its GEE II counterpart, evenfor large sample size n = 500.

The proposed approach depends on the quality of the model for missing data. If this model isnot true, the CCC estimate may be biased. To understand the effect of misspecification of missingdata model on inference, we carried out some additional simulations to study the performance ofCCC estimate when MAR missingness was erroneously modeled as MCAR. The results shownin Table 7 indicate biased CCC estimates, with type I error rates increasing as a function ofsample size. As a future research goal, we will explore the possibility of applying the doublerobust concept in which we will combine the proposed IPW estimate with other models thatdirectly relate (missing) response with observed response so that the resulting estimate will beconsistent if only one of these missing data models is correct (Robins et al., 1995; Tsiatis, 2006).

We will also investigate an extension of the proposed approach to a regression setting sothat we can examine linear trend and include covariates. For example, Barnhart and Williamson(2001) considered the effect of continuous covariates for modeling the mean rater ratings.However, within a longitudinal setting, the variance of the rater ratings may also change overtime. Thus, to generalize their approach, we will seek an extension that will permit modelingboth the mean and variance as a function of time and other covariates. In addition to CCC,the proposed approach can be similarly applied to modeling other measures of agreement forcontinuous outcome such as the intraclass correlation or ICC (e.g., McGraw & Wong, 1996;Shrout & Fleiss, 1979). Work is currently underway to generalize the proposed approach to thisand other popular measures of agreement.

Acknowledgements

This research was supported in part by NIH grants R01-DA012249 and 1 UL1 RR024160-01. Dr. Ma was partially supported by the following grants: Center for Education and Researchin Therapeutics (CERTs) (AHRQ RFA-HS-05-14) and Clinical Translational Science Center(CTSC) (UL1-RR024996). We are grateful to Prof. King at Penn State for graciously provid-ing and helping interpret the Penn StateYoung Women’s Health Study data for the illustration ofthe methodology. We sincerely thank Ms. Bliss-Clark at the University of Rochester, an Editor,and two anonymous reviewers, for their constructive comments, which have led to considerableimprovements of this paper.

Appendix A. Proofs of Theorems

A.1. Proof of Theorem 1

We first establish a lemma.

114 PSYCHOMETRIKA

Lemma. Let

Un =(

n

2

)−1 ∑

(i,j)∈Cn2

vij = (U�

1,n, . . . ,U�T ,n

)�,

Ut,n =(

n

2

)−1 ∑

(i,j)∈Cn2

vij t = (U�

12t,n, . . . ,U�(M−1)Mt,n

)�, (A.1)

Umlt,n =(

n

2

)−1 ∑

(i,j)∈Cn2

vijmlt .

Then, E(Un) = 0 and

√nUn = √

n

(n

2

)−1 ∑

(i,j)∈Cn2

vij →d N(0,�U = 4 Var(vi )

). (A.2)

Proof: It is readily checked by the iterated conditional expectation that (e.g., Kowalski & Tu,2007, Chapter 1)

E(Umlt,n) = E

[E

(rimt rilt rjmt rjlt

�imlt�jmlt

(hijmlt − θmlt ) | yi ,yj

)]

= E[�−1

imlt�−1jmlt (hijmlt − θmlt )E(rimt rilt | yi )E(rjmt rjlt | yj )

]

= 0. (A.3)

It follows that E(Un) = 0. Since vkjmlt = vjkmlt (j �= k), we have

vimlt = E(vjkmlt | yi , ri ) =⎧⎨

0 if j �= i, k �= i,

E(vikmlt | yi , ri ) if j = i,

E(vijmlt | yi , ri ) if k = i.

Let ϒ(i) = {(j, k) ∈ Cn2 ; j �= i, k �= i}. It then follows that

E(Umlt,n | yi , ri ) =(

n

2

)−1[ ∑

(j,k)/∈ϒ(i)

E(vjkmlt | yi , ri ) +∑

(j,k)∈ϒ(i)

E(vjkmlt | yi , ri )

]

=(

n

2

)−1[ ∑

(j,k)/∈ϒ(i)

E(vjkmlt | yi , ri )

]

=(

n

2

)−1[ ∑

j=i,(j,k)/∈ϒ(i)

E(vjkmlt | yi , ri ) +∑

k=i,(j,k)/∈ϒ(i)

E(vjkmlt | yi , ri )

]

=(

n

2

)−1[

n∑

k=i+1

E(vikmlt | yi , ri ) +i−1∑

j=1

E(vijmlt | yi , ri )

]

= 2

nvimlt .

YAN MA ET AL. 115

Thus, the projection of Umlt,n is given by (e.g., Kowalski & Tu, 2007, Chapter 3; Serfling, 1980,Chapter 5)

Umlt,n =n∑

i=1

E(Umlt,n | yi , ri ) = 1

n

n∑

i=1

2vimlt .

Since Umlt,n is a sum of independently and identically distributed random variables, it followsfrom the central limit theorem (CLT) that

√nUmlt,n =

√n

n

n∑

i=1

2vimlt →d N(0,�mlt = 4 Var(vimlt )

).

By the theory of U-statistics (e.g., Kowalski & Tu, 2007; Serfling, 1980, Chapter 5), Umlt,n andUmlt,n have the same asymptotic distribution, and thus

√nUmlt,n →d N

(0,�γmlt

= 4 Var(vimlt )).

The lemma follows by applying a similar argument to the vector Un (e.g., Kowalski & Tu, 2007,Chapter 5). �

Proof of Theorem 1: Let gijmlt = rimt rilt rjmt rj lt

�imlt�jmlthijmlt . By an argument similar to (A.3), we have

E(gijmlt ) = θmlt . It then follows from the theory of U-statistics that

θmlt =(

n

2

)−1 ∑

(i,j)∈Cn2

gijmlt →p θmlt .

Thus, by Slutsky’s theorem, ρmlt = f(θmlt ) is consistent. Further, by applying the lemma andSlutsky’s theorem, we obtain the asymptotic distribution of ρmlt :

√n(θmlt − θmlt ) = √

n

(n

2

)−1 ∑

(i,j)∈Cn2

Umlt,n →d N(0,4Var(vimlt )

).

Similarly, by considering the vector θ , we obtain

√n(θ − θ) →d N

(0,�θ = 4 Var(vi )

).

Theorem 1 follows by applying the Delta method to ρ = f(θ).To show (10), first note that

E(hijmlt1 | yi ) = yimtyilt − yiltμmt − yimtμlt + σmlt + μmtμlt ,

E(hijmlt2 | yi ) = 1

2

[(yimt − yilt )

2 + σ 2mt + σ 2

lt − 2σmlt + (μmt − μlt )2].

116 PSYCHOMETRIKA

Further, we have

vimlt = E(vijmlt | yi , ri ) = E

[rimt rilt rjmt rjlt

�imlt�jmlt

(hijmlt − θmlt ) | yi , ri

]

= rimt rilt

�imlt

E

[rjmt rjlt

�jmlt

(hijmlt − θmlt ) | yi , ri

]

= rimt rilt

�imlt

E

{E

[rjmt rjlt

�jmlt

(hijmlt − θmlt ) | yi ,yj , ri

]| yi , ri

}

= rimt rilt

�imlt

E{�−1

jmlt (hijmlt − θmlt )E[rjmt rjlt | yi ,yj , ri] | yi , ri

}

= rimt rilt

�imlt

[E(hijmlt | yi ) − θmlt

] = rimt rilt

�imlt

eimlt .

Thus, (10) follows by taking the covariance between vimlt and vim′l′t ′ and substituting consistentestimates of E(hijmlt | yi ) using the identities in (4) with consistent estimates of μmt , σ 2

mt , andσmlt given in (10). �

A.2. Proof of Theorem 2

As noted in Section 2.4, ζ is the solution to the score equations in (18). From the propertiesof score equations we have

√n(ζ − ζ ) = −H−1

√n

n

n∑

i=1

wi + op(1), (A.4)

where H is given in (19), and op(1) denotes the stochastic o(1) (Kowalski & Tu, 2007, Chap-ter 1). It follows from (A.1), (A.4), and the projection-based U-statistics asymptotic expansionthat (Kowalski & Tu, 2007, Chapter 5)

√n(θ (ζ ) − θ

) = √n[θ(ζ ) − θ + θ (ζ ) − θ(ζ )

]

= √nUn(ζ ) + C

√n(ζ − ζ ) + op(1)

= √n

2

n

n∑

i=1

(vi − CH−1wi

) + op(1), (A.5)

where C is given in (19). It follows from (A.5) that θ (ζ ) is asymptotically normal with theasymptotic variance given by (20).

Appendix B. Modeling Bivariate Missing Data Under MAR

We show that it is not possible to model the missingness of yi1t as dependent on yi2t andvice versa in addition to yit under MAR. For convenience, we consider the case with t = 1 anddenote yi1t (yi2t ) simply as yi1 (yi2). For notational brevity, we also suppress the dependenceon yit .

Suppose that on the contrary such a model exists. Then, we would have

Pr[ri1 = 1 | ri2 = 1, yi1, yi2] = Pr[ri1 = 1 | ri2 = 1, yi2],(B.1)

Pr[ri2 = 1 | ri1 = 1, yi1, yi2] = Pr[ri2 = 1 | ri1 = 1, yi1].

YAN MA ET AL. 117

Under MAR, missingness depends only on observed data, and Pr[ri1 = 0, ri2 = 0 | yi1, yi2]is a constant c (0 < c < 1). Thus, Pr[ri1 = 1, ri2 = 0 | yi1, yi2] is a function of yi1, andPr[ri1 = 0, ri2 = 1 | yi1, yi2] is a function of yi2 only. Denote them as f (yi1) and g(yi2). Then,Pr[ri1 = 1, ri2 = 1 | yi1, yi2] = 1− f (yi1) − g(yi2) − c. It follows that

Pr[ri1 = 1 | yi1, yi2] = f (yi1) + 1 − f (yi1) − g(yi2) − c = 1 − g(yi2) − c,

Pr[ri2 = 1 | ri1 = 1, yi1, yi2] = 1 − f (yi1) − g(yi2) − c

1 − g(yi2) − c.

It follows from (B.1) that 1−f (yi1)−g(yi2)−c1−g(yi2)−c

is a function of yi1 only. Thus, g(yi2) must be a

constant. Likewise, f (yi1) must be a constant. These contradict the MAR assumption.

Appendix C. Generalized Fieller’s Method for Vector of Ratio Statistics

Consider a longitudinal study with two raters and T assessments. The vector of CCC can beexpressed as

ρ = (ρ1, . . . , ρT )�, θ = (θ�

1 , . . . , θ�T

)�, θ t = (θ1t , θ2t )

�,

ρt = ϕt

1 + ϕt

, ϕt = θ1t

θ2t

, 1 ≤ t ≤ T .

Hence, each single CCC is a function of the ratio ϕt . We are interested in testing the null hy-pothesis,

H0 : ρt = ρ0, or equivalently, H0 : θ1t − ϕ0θ2t = 0, 1 ≤ t ≤ T . (C.1)

Regular approaches such as the Delta method for constructing the asymptotic variance ofa ratio statistic often result in inflated type I error for small samples. A popular alternative forcorrecting the upward bias is the Fieller’s method (Fieller, 1954). However, this approach onlyapplies to a single time point. We now generalize this approach to a vector function within ourcontext.

The hypothesis in (C.1) can be expressed as a linear contrast as

H0 : Kθ = 0, K =

⎜⎜⎜⎝

1 −ϕ0 0 0 · · · 0 00 0 1 −ϕ0 · · · 0 0...

......

.... . .

......

0 0 0 0 · · · 1 −ϕ0

⎟⎟⎟⎠

T ×2T

. (C.2)

By Theorem 2, θ has an asymptotic normal distribution,√

n(θ − θ) →d N(0,�θ ). Thus, underH0, the Hotelling’s T-square test statistic T 2

n = n−TT (n−1)

Wn has approximately an FT,n−T distri-bution with degrees of freedom T (numerator) and n − T (denominator). As noted earlier inSection 2, T 2

n yields improved type I error than the Wald statistic Wn, which has an asymptoticχ2

T distribution.

Appendix D. Simulation Procedures

For MAR, we considered the MMDP for bivariate outcomes with lag time L = 1 and sim-ulated the missing data indicators for the two raters at each time t , rit = (ri1t , ri2t )

�, according

118 PSYCHOMETRIKA

to a multinomial model with the cell-probability vector pit = (pit1,pit2,pit3,pit4)�, where the

one-step transition probabilities pitl were specified according to the generalized logit model in(16) under the Markov assumption (t = 2,3). To generate about 10% and 15% missing responsesat times 2 and 3, we solved the following equations for β2l and β3l :

0.9n =n∑

i=1

pi21, 0.85n =n∑

i=1

pi21pi31,

(D.1)

pit1 = 1

1 + ∑4l=2 exp(βtl + 2(yi1(t−1) + yi2(t−1)))

, t = 2,3.

To ensure that the missing data indicator rit follows the BMMDP model with L = 1, we furtherimposed the following restrictions:

ri13 = ri12 × ri22 × ri13, ri23 = ri12 × ri22 × ri23.

For MCAR, the same approach above was used except that pit1 were modeled independentlyof yimt with pi21 = 0.90 and pi31 = 0.85

0.90 to produce about 10% and 15% missing responses att = 2,3, respectively.

References

Barnhart, H.X., & Williamson, J.M. (2001). Modelling concordance correlation via GEE to evaluate reproducibility.Biometrics, 57, 931–940.

Barnhart, H.X., Haber, M., & Song, J. (2002). Overall concordance correlation coefficient for evaluating agreementamong multiple observers. Biometrics, 58, 1020–1027.

Bauer, S., & Kennedy, J.W. (1981). Applied statistics for the clinical laboratory: II. Within-run imprecision. The Journalof Clinical Laboratory Automation, 1, 197–201.

Chandler, J.M., Martin, A.R., Girman, C., Ross, P.D., Love-McClung, B., Lydick, E., & Yawn, B.P. (1998). Reliabilityof an osteoporosis-targeted quality of life survey instrument for use in the community: OPTQoL. OsteoporosisInternational, 8, 127–135.

Chinchilli, V.M., Martel, J.K., Kumanyika, S., & Lloyd, T. (1996). A weighted concordance correlation coefficient forrepeated measures designs. Biometrics, 52, 341–353.

Costa, P., Arnould, B., Cour, F., Boyer, P., Marrel, A., Jaudinot, E.O., & Solesse de Gendre, A. (2003). Quality of SexualLife Questionnaire (QVS): a reliable, sensitive and reproducible instrument to assess quality of life in subjects witherectile dysfunction. International Journal of Impotence Research, 15, 173–184.

Fieller, E.C. (1954). Some problems in interval estimation. Journal of the Royal Statistical Society B, 16, 175–185.Guo, X., Pan, W., Connett, J.E., Hannan, P.J., & French, S.A. (2005). Small-sample performance of the robust score test

and its modifications in generalized estimating equations. Statistics in Medicine, 24, 3479–3495.King, T.S., & Chinchilli, V.M. (2001). A generalized concordance correlation coefficient for continuous and categorical

data. Statistics in Medicine, 20, 2131–2147.King, T.S., Chinchilli, V.M., & Carrasco, J.L. (2007). A repeated measures concordance correlation coefficient. Statistics

in Medicine, 26, 3095–3113.Kowalski, J., & Tu, X.M. (2007). Modern applied U-statistics. New York: Wiley.Lin, L. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45, 255–268.Little, R.J.A., & Rubin, D.B. (1987). Statistical analysis with missing data. New York: Wiley.Lloyd, T., Chinchilli, V.M., Eggli, D.F., Rollings, N., & Kulin, H.E. (1998). Body composition development of adolescent

white females. Archives of Pediatric Adolescence Medicine, 152, 998–1002.McCullagh, P., & Nelder, J.A. (1989). Generalized linear models (2nd ed.). London: Chapman and Hall.McGraw, K.O., & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological

Methods, 1, 30–46.Morrison-Beedy, D., Carey, M.P., & Tu, X.M. (2006). Accuracy of audio computer-assisted self-interviewing (ACASI)

and self-administered questionnaires for the assessment of sexual behavior. AIDS and Behavior, 10, 541–552.Paul, I.M., Wai, K., Jewell, S.J., Shaffer, M.L., & Varadan, V.V. (2006). Evaluation of a new self-contained, ambulatory,

objective cough monitor. Cough, 2, 1–7.Prentice, R.L. (1988). Correlated binary regression with covariates specific to each binary observation. Biometrics, 44,

321–327.Reboussin, B.A., & Liang, K.Y. (1998). An estimating equations approach for the LISCOMP model. Psychometrika, 63,

165–182.

YAN MA ET AL. 119

Robins, J.M., Rotnitzky, A., & Zhao, L.P. (1995). Analysis of semiparametric regression models for repeated outcomesin the presence of missing data. Journal of the American Statistical Association, 90, 106–121.

Schroder, K.E.E., Carey, M.P., & Vanable, P.A. (2003). Methodological challenges in research on sexual risk behavior:II. Accuracy of self-reports. Annals of Behavioral Medicine, 26, 104–123.

Seber, G.A.F. (1984). Multivariate observations. New York: Wiley.Serfling, R.J. (1980). Approximation theorems of mathematical statistics. New York: Wiley.Shrier, L.A., Shih, M., & Beardslee, W.R. (2005). Comparison of momentary sampling with diary and retrospective

self-report methods of measurement. Pediatrics, 115, 573–581.Shrout, P.E., & Fleiss, J.L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86,

420–428.Tsiatis, A.A. (2006). Semiparametric Theory and Missing Data. New York: Springer.Tu, X.M., Feng, C., Kowalski, J., Tang, W., Wang, H., Wan, C., & Ma, Y. (2007). Correlation analysis for longitudinal

data: applications to HIV and psychosocial research. Statistics in Medicine, 26, 4116–4138.Westgard, J.O., & Hunt, M.R. (1973). Use and interpretation of common statistical tests on method-comparison studies.

Clinical Chemistry, 19, 49–57.

Manuscript Received: 14 DEC 2007Final Version Received: 23 JUN 2009Published Online Date: 13 NOV 2009