Statistical Methods for Analysis with Missing...

transcript

Statistical Methods for Analysis with Missing Data

Lecture 3: naıve methods: complete-case analysis and imputation

Mauricio Sadinle

Department of Biostatistics

1 / 42

Previous LectureUniverse of missing-data mechanisms:

MNAR MAR MCAR

I MCAR: p(R = r | z) = p(R = r)I Unreasonable in most cases

I MAR: p(R = r | z) = p(R = r | z(r))I Hard to digest, in generalI R ⊥⊥ Z1 | Z2, if Z2 fully observed

I MNAR: p(R = r | z) 6= p(R = r | z(r))I Most realistic, but hard to handle

2 / 42

Today’s Lecture

Naıve or ad-hoc methods

I Complete-case / available-case analyses

I Different types of (single) imputation

Reading: Ch. 2, of Davidian and Tsiatis

3 / 42

Naıve or Ad-Hoc Methods

I Motivation: we know how to run analyses with complete(rectangular) datasets

I Idea: somehow “fix” the dataset so that the analysis for completedata can be run

4 / 42

Outline

Complete-Case and Available-Case AnalysisComplete-Case AnalysisAvailable-Case Analysis

ImputationMean ImputationMode ImputationRegression ImputationHot-Deck ImputationLast Observation Carried Forward

Summary

5 / 42

Outline

Summary

6 / 42

Complete-Case Analysis

I Idea: ignore observations with missingness, run intended analysiswith remaining data

7 / 42

Complete-Case Analysis

8 / 42

Assumption for Complete-Case AnalysisComplete-case analysis implicitly assumes

p(z) = p(z | R = 1K ) (1)

where 1K represents a vector (1, 1, . . . , 1) of length K

I By Bayes’ theorem

p(z | R = 1K ) =p(R = 1K | z)p(z)

p(R = 1K )

I Therefore, (1) is equivalent to

p(R = 1K | z) = p(R = 1K )

I This doesn’t require any assumptions on p(R = r | z) for r 6= 1K

I MCAR (Z ⊥⊥ R) is a sufficient condition for (1)

9 / 42

p(z) = p(z | R = 1K ) (1)

p(z | R = 1K ) =p(R = 1K | z)p(z)

p(R = 1K )

p(R = 1K | z) = p(R = 1K )

9 / 42

p(z) = p(z | R = 1K ) (1)

p(z | R = 1K ) =p(R = 1K | z)p(z)

p(R = 1K )

p(R = 1K | z) = p(R = 1K )

9 / 42

p(z) = p(z | R = 1K ) (1)

p(z | R = 1K ) =p(R = 1K | z)p(z)

p(R = 1K )

p(R = 1K | z) = p(R = 1K )

9 / 42

p(z) = p(z | R = 1K ) (1)

p(z | R = 1K ) =p(R = 1K | z)p(z)

p(R = 1K )

p(R = 1K | z) = p(R = 1K )

9 / 42

Complete-Case Analysis is Wasteful/Inefficient

Clearly, there can be a huge waste of information

I Observed data with response patterns r 6= 1K should be informativeabout the distribution of Z(r), which is informative about thedistribution of Z

p(z(r)) =

∫p(z) dz(r), r ∈ {0, 1}K

I We might end up with very little data

I Say the R1, . . . ,RKi.i.d.∼ Bernoulli(π)

I p(R = 1K ) = πK K→∞−→ 0

10 / 42

Complete-Case Analysis is Wasteful/Inefficient

Clearly, there can be a huge waste of information

I Observed data with response patterns r 6= 1K should be informativeabout the distribution of Z(r), which is informative about thedistribution of Z

p(z(r)) =

∫p(z) dz(r), r ∈ {0, 1}K

I We might end up with very little data

I Say the R1, . . . ,RKi.i.d.∼ Bernoulli(π)

I p(R = 1K ) = πK K→∞−→ 0

10 / 42

Example: Estimating a Mean

We’ll see an alternative presentation of Example 1 in Section 1.4 ofDavidian and Tsiatis

I {(Yi ,Ri )}ni=1i.i.d.∼ F

I Yi : numeric variable for individual i

I Ri : indicator of Yi being observed

I If Yi was always observed, we could estimate the mean of Y ,µ = E (Y ), as

µfull =1

n∑i=1

11 / 42

With missing data, we could use the complete cases

µcc =

∑ni=1 YiRi∑ni=1 Ri

Is this any good?

HW1: show that the following holds

E (µcc) = E (Y | R = 1)

for all sample sizes, provided that at least one Yi is observed.

Hint: write E(µcc ) = E[E(∑n

i=1 YiRi∑ni=1 Ri

| R1, . . . ,Rn

12 / 42

µcc =

Is this any good?

E (µcc) = E (Y | R = 1)

i=1 YiRi∑ni=1 Ri

| R1, . . . ,Rn

12 / 42

µcc =

Is this any good?

E (µcc) = E (Y | R = 1)

i=1 YiRi∑ni=1 Ri

| R1, . . . ,Rn

12 / 42

E (µcc) = E (Y | R = 1)

Therefore

I Complete-case estimator of the mean requires assuming

E (Y ) = E (Y | R = 1)

I In particular, valid under MCAR

I Otherwise, µcc is not valid for µ, as it estimates the wrong quantity

I HW1: if p(R = 1 | y) is an increasing function of y , show that

E (Y | R = 1) > E (Y )

13 / 42

E (µcc) = E (Y | R = 1)

Therefore

I Complete-case estimator of the mean requires assuming

E (Y ) = E (Y | R = 1)

I In particular, valid under MCAR

I Otherwise, µcc is not valid for µ, as it estimates the wrong quantity

I HW1: if p(R = 1 | y) is an increasing function of y , show that

E (Y | R = 1) > E (Y )

13 / 42

Outline

Summary

14 / 42

Available-Case Analysis

Sometimes what we need to estimate doesn’t really require a“rectangular” dataset

I If you can, just use whatever data are available for computing whatyou need

I Davidian and Tsiatis talk about generalized estimating equations(GEEs) and their Example 3 in Section 1.4 (we’ll cover this when weget to Chapter 5)

I K normal random variables: under some missing-data assumption, itseems we could still obtain a good estimate of the distribution as itonly depends on univariate and bivariate quantities (means,variances, covariances)

15 / 42

Example of Available-Case Analysis

I Say the data are

I Zi = (Yi1, . . . ,YiK )

I Ri = (Ri1, . . . ,RiK )

I Available-case estimators:

µacj =

∑ni=1 YijRij∑ni=1 Rij

, j = 1, . . . ,K

σacjk =

∑ni=1(Yij − µac

j )(Yik − µack )RijRik∑n

i=1 RijRik − 1; j , k = 1, . . . ,K

I Better than complete-case analysis

I Valid under MCAR, but what are the minimal assumptions on themissing-data mechanism for this to be valid?

16 / 42

I Say the data are

I Zi = (Yi1, . . . ,YiK )

I Ri = (Ri1, . . . ,RiK )

µacj =

, j = 1, . . . ,K

σacjk =

i=1 RijRik − 1; j , k = 1, . . . ,K

16 / 42

I Say the data are

I Zi = (Yi1, . . . ,YiK )

I Ri = (Ri1, . . . ,RiK )

µacj =

, j = 1, . . . ,K

σacjk =

i=1 RijRik − 1; j , k = 1, . . . ,K

16 / 42

I Say the data are

I Zi = (Yi1, . . . ,YiK )

I Ri = (Ri1, . . . ,RiK )

µacj =

, j = 1, . . . ,K

σacjk =

i=1 RijRik − 1; j , k = 1, . . . ,K

16 / 42

Complete-Case and Available-Case Analysis

The moral:

I Complete-case analysis is wasteful and, most likely, invalid

I Available-case analysis is better, but still requires MCAR or possiblya weaker assumption depending on what we need to compute

17 / 42

Outline

Summary

18 / 42

Imputation

I Idea: plug something “reasonable” into the holes of the dataset,then run intended analysis with completed data

19 / 42

Imputation

20 / 42

Outline

Summary

21 / 42

Mean Imputation

I Numeric variables

I Impute mean of observed values

I Corresponds to imputing an estimate of E(Yj | Rj = 1), j = 1, . . . ,K

I Leads to valid point estimates of means under MCAR

I Underestimates true variance of estimators

22 / 42

Mean Imputation

Say the data are

I {(Zi ,Ri )}ni=1i.i.d.∼ F

I Zi = (Yi1, . . . ,YiK )

I Ri = (Ri1, . . . ,RiK )

Mean imputation:

I Compute

µ1j =

, j = 1, . . . ,K

I Impute Yij with µ1j whenever Rij = 0

I Run your analysis as if your data were fully observed

23 / 42

Mean Imputation

Say the data are

I Zi = (Yi1, . . . ,YiK )

I Ri = (Ri1, . . . ,RiK )

Mean imputation:

I Compute

µ1j =

, j = 1, . . . ,K

23 / 42

Mean Imputation

Say the data are

I Zi = (Yi1, . . . ,YiK )

I Ri = (Ri1, . . . ,RiK )

Mean imputation:

I Compute

µ1j =

, j = 1, . . . ,K

23 / 42

Mean Imputation

Say the data are

I Zi = (Yi1, . . . ,YiK )

I Ri = (Ri1, . . . ,RiK )

Mean imputation:

I Compute

µ1j =

, j = 1, . . . ,K

23 / 42

Mean Imputation

Age Income25 60, 000

? 150, 300

......

Age Income25 60, 000

µ1Age µ1

Income

51 µ1Income

µ1Age 150, 300

......

24 / 42

I Estimating a mean after mean imputation corresponds to using theestimator

µmimpj =

n∑i=1

[YijRij + µ1j (1− Rij)]

I µmimpj is the mean of the imputed data, so its naıvely estimated

variance isVnaıve(µmimp

j ) = Vnaıve(Yj)/n

Vnaıve(Yj) =1

n − 1

n∑i=1

[Rij(Yij − µmimpj )2 + (1− Rij)(µ1

j − µmimpj )2]

I HW1: show that µmimpj = µ1

25 / 42

µmimpj =

n∑i=1

j ) = Vnaıve(Yj)/n

Vnaıve(Yj) =1

n − 1

n∑i=1

j − µmimpj )2]

25 / 42

µmimpj =

n∑i=1

j ) = Vnaıve(Yj)/n

Vnaıve(Yj) =1

n − 1

n∑i=1

j − µmimpj )2]

25 / 42

As a consequence, using the mean imputation method we:

I Underestimate the variance of each variable:

Vnaıve(Yj) =1

n − 1

n∑i=1

Rij(Yij − µ1j )2

I Compare with an estimate based on the available cases:

V 1(Yj) =

∑ni=1 Rij(Yij − µ1

j )2∑ni=1 Rij − 1

I =⇒ Vnaıve(Yj) ≤ V 1(Yj)

26 / 42

I Underestimate the variance of µmimpj :

Vnaıve(µmimpj ) =

n(n − 1)

n∑i=1

Rij(Yij − µ1j )2

V 1(µmimpj ) =

i=1 Rij)(∑n

i=1 Rij − 1)

I =⇒ Vnaıve(µmimpj ) ≤ V 1(µmimp

I HW1: comment on the implications of mean imputation for theconstruction of confidence intervals

27 / 42

I Underestimate the variance of µmimpj :

Vnaıve(µmimpj ) =

n(n − 1)

n∑i=1

Rij(Yij − µ1j )2

V 1(µmimpj ) =

i=1 Rij)(∑n

i=1 Rij − 1)

I =⇒ Vnaıve(µmimpj ) ≤ V 1(µmimp

I HW1: comment on the implications of mean imputation for theconstruction of confidence intervals

27 / 42

Outline

Summary

28 / 42

Mode Imputation

I Categorical variables

I Impute mode of observed values

I Artificially inflates frequency of mode

I Leads to valid point estimates of marginal modes under MCAR

29 / 42

Outline

Summary

30 / 42

Regression Imputation

I Regress one variable on others based on observed data, then imputepredicted values from model

I Corresponds to imputing an estimate of E (Yj | y−j ,R = 1K ), wherey−j = (y1, . . . , yj−1, yj+1, . . . , yK )

I Valid for means under MCAR

I Validity depends on model used for imputation

31 / 42

Example of Regression Imputation in Davidian and Tsiatis

I Z = (Y1,Y2), baseline and follow-up, Y1 always observed

I R indicator of response for Y2

I Goal: to estimate µ2 = E (Y2)

I Say we posit a linear model E (Y2 | y1) = β0 + β1y1

I Impute Yi2 with Yi2 = β0 + β1Yi1 when Ri = 0, with β0 and β1

obtained via least squares among complete cases

I The regression imputation estimator for µ2 is

µrimp2 =

n∑i=1

[Yi2Ri + Yi2(1− Ri )]

I When is this valid? (when does µrimp2

n→∞−→ µ2 ?)

32 / 42

µrimp2 =

n∑i=1

[Yi2Ri + Yi2(1− Ri )]

n→∞−→ µ2 ?)

32 / 42

µrimp2 =

n∑i=1

[Yi2Ri + Yi2(1− Ri )]

n→∞−→ µ2 ?)

32 / 42

µrimp2 =

n∑i=1

[Yi2Ri + Yi2(1− Ri )]

n→∞−→ µ2 ?)

32 / 42

µrimp2 =

n∑i=1

[Yi2Ri + Yi2(1− Ri )]

n→∞−→ µ2 ?)

32 / 42

Davidian and Tsiatis show that for µrimp2

n→∞−→ µ2 (µrimp2

p−→ µ2) weneed these two requirements to hold simultaneously:

I E (Y2 | y1,R = 1) = E (Y2 | y1) (implied by MAR)

I E (Y2 | y1) is correctly specified, i.e., there really exist β∗0 and β∗1such that E (Y2 | y1) = β∗0 + β∗1 y1

However, even if these two conditions hold, single imputation leads tounderestimation of variances, as seen with mean imputation

33 / 42

Outline

Summary

34 / 42

Hot-Deck Imputation

I Replace missing values of a non-respondent (called the recipient)with observed values from a respondent (the donor)

I Recipient and donor need to be similar with respect to variablesobserved by both cases

I Donor can be selected randomly from a pool of potential donorsI Single donor can be identified, e.g. “nearest neighbour” based on

some metric

I Andridge & Little (2010, Int. Stat. Rev.) reviewed this approachand concluded that

I General patterns of missingness are difficult to deal with (“swisscheese pattern”)

I Lack of theory to support this methodI Lack of comparisons with other methodsI Uncertainty from imputation is not taken into account

(underestimation of variances)

35 / 42

Hot-Deck Imputation

some metric

35 / 42

Hot-Deck Imputation

some metric

35 / 42

Outline

Summary

36 / 42

Last Observation Carried Forward

I Common in settings where a variable is measured repeatedly overtime and there is dropout

I If there is droput at time j , we don’t observe Zj ,Zj+1, . . . ,ZT

I LOCF: replace all of Zj ,Zj+1, . . . ,ZT with Zj−1

37 / 42

Example from Davidian and Tsiatis:

Solid lines: observed data. Dashed lines: extrapolated data with LOCF.

38 / 42

Attempts to justify LOCF

I Interest in the last observed outcome measure (reasonable in somecontext??)

I Under some assumptions, will lead to conservative analysis

I Say we have a clinical trial, outcome under treatment is expected toimprove over time

I If treatment is found to be superior even with LOCF, then true effectshould be even larger

I Relies on assumption of monotonic improvement over time!

39 / 42

Attempts to justify LOCF

I Interest in the last observed outcome measure (reasonable in somecontext??)

I Under some assumptions, will lead to conservative analysis

I Say we have a clinical trial, outcome under treatment is expected toimprove over time

I If treatment is found to be superior even with LOCF, then true effectshould be even larger

I Relies on assumption of monotonic improvement over time!

39 / 42

Example of LOCF in Davidian and TsiatisStudy participants’ characteristic to be measured at T times

I Yj : measurement taken at time tj

I D: participant dropout time

I Interest: µT = E (YT )

I The LOCF estimator of the mean is

µLOCFT =

n∑i=1

T∑j=1

I (Di = j + 1)Yij

I The expected value of the LOCF estimator of the mean is

E (µLOCFT ) = µT −

T−1∑j=1

E [I (D = j + 1)(YT − Yj)],

so µLOCFT is biased, in general

40 / 42

µLOCFT =

n∑i=1

T∑j=1

I (Di = j + 1)Yij

T−1∑j=1

E [I (D = j + 1)(YT − Yj)],

40 / 42

µLOCFT =

n∑i=1

T∑j=1

I (Di = j + 1)Yij

T−1∑j=1

E [I (D = j + 1)(YT − Yj)],

40 / 42

µLOCFT =

n∑i=1

T∑j=1

I (Di = j + 1)Yij

T−1∑j=1

E [I (D = j + 1)(YT − Yj)],

40 / 42

Outline

Summary

41 / 42

Summary

Main take-aways from today’s lecture:

I Complete-case analyses are wasteful. Also, potentially invalid unlessMCAR

I Available-case analyses make a better use of the available data butstill requires MCAR (weaker assumptions possibly depend onmodel/quantity being used/estimated)

I Imputation methods might be valid for some quantities under MCARbut variances are underestimated =⇒ overconfidence in your results!

Next lecture:

I R session 1: imputation methods, some simulation studies

I Bring your laptops!

42 / 42

Summary

Main take-aways from today’s lecture:

I Complete-case analyses are wasteful. Also, potentially invalid unlessMCAR

I Available-case analyses make a better use of the available data butstill requires MCAR (weaker assumptions possibly depend onmodel/quantity being used/estimated)

I Imputation methods might be valid for some quantities under MCARbut variances are underestimated =⇒ overconfidence in your results!

Next lecture:

I R session 1: imputation methods, some simulation studies

I Bring your laptops!

42 / 42

Statistical Methods for Analysis with Missing...

Documents