Propensity Score Methods for Longitudinal Data Analyses: General background, ra>onale and illustra>ons*
Bob Pruzek, University at Albany SUNY
Summary
Propensity Score Analysis (PSA) was introduced by Rosenbaum & Rubin (Biometrika, 1983). Since then PSA has become one of the most studied and frequently used new methods in sta>s>cs. Hundreds of papers have appeared covering philosophy, sta>s>cal theory and a
wide variety of applica>ons (especially in health science & medicine). I focus on background and logical founda>ons of PSA, then present & discuss graphics that illustrate various PSA methods; lastly, I describe with examples how conven>onal PSA methodology can be extended to accommodate longitudinal data analysis. It is noted that while longitudinal PSA oVen oVen entails notable complica>ons, special
advantages can accrue to LDA-‐PSA if aXen>on is given to certain aspects of observa>onal study design.
*Talk for INTEGRATIVE ANALYSIS OF LONGITUDINAL STUDIES OF AGING Conference, Victoria, BC June 2010
PSA is based on the same logic that underpins analyses of true experiments. In true experiments, units are randomly allocated to (two) treatment groups at the outset of study, that is, before the treatments begin. Randomiza>on, in the words of R. A. Fisher, provides the ‘reasoned basis for causal inference’ in experiments. Randomiza>on ensures that units in the two treatment groups do not differ systema>cally on any covariate which is why this opera>on supports causal interpreta5ons: When one group scores notably higher than another on ul>mate response variable(s), this can (with qualifica>ons*) be aXributed to treatment differences; random assignment tends to make alterna>ve explana>ons implausible. *Three caveats, at least, are in order: 1. Randomiza5on can go awry in prac5ce, par5cularly when samples are not large; 2. Much depends on details of how experiments are run; & 3. To say that “treatments caused differences” is not to say that one knows what feature(s) of the treatments had the noted effects. Sta5s5cians generally study ‘effects of causes,’ not ‘causes of effects’.
Observa1onal studies entail comparison of groups not formed using randomiza1on; units are said to “select their own treatments.” This means that observa>onal studies give rise to a greater likelihood for Selec1on Bias (SB). SB refers to systema5c covariate differences between groups differences that can confound aPempts to interpret response variable differences. SB is the central problem that propensity score analysis aims to reduce, if not eliminate (usually – but not always – in the context of observa>onal studies). This tends to be facilitated if one conceptualizes each observa>onal study as having arisen from a (complex) randomized experiment. Three people have wriXen key ar>cles and books that underpin propensity score methods: William Cochran, his student Donald Rubin; and then his student, Paul Rosenbaum. A review of one of Cochran’s studies, done 40 years ago is worth brief examina>on.
Cochran (1968) compared death rates of smokers and non-‐smokers. It had been found, using unstra>fied data, that death rates for smokers and non-‐smokers were nearly iden>cal (evidence that many smokers and manufacturers of tobacco products found greatly to their liking). Cochran decided to reanalyze the data aVer stra*fying both smokers & non-‐smokers by age before compu>ng death rates. AVer age-‐based stra>fica>on he re-‐calculated death rates. This led to the finding that death rates among smokers were on average 40 -‐ 50% higher than for non-‐smokers! Moreover, this was for very large samples. Results of this kind represent early versions of what now can be seen as propensity score analysis. The advent of modern PSA methods helps inves>gators adjust for mul>ple covariates, not just one as in Cochran’s case.
When there are many poten>al confounding variables in an observa>onal study then direct stra>fica>on is unwieldy because the number of ‘cells’ associated with the crossing of covariates is oVen huge; also missing values will be found in many cells. Nevertheless numerous covariates can be expected to confound interpreta>ons. For many years analysts found it especially difficult to account for confounding effects. The key breakthrough came when Rosenbaum and Rubin (1983) showed how to produce a single variable, a propensity score, the use of which could greatly simplify treatment comparisons in observa>onal studies. They noted that condi>ons may exist where treatment assignment Z (binary) is independent of poten5al outcomes* Y0 & Y1 , condi>onal on observed baseline covariates, X. That is, (Y(1), Y(0)) ╨ Z|X, if 0 < P(Z=1|X) < 1. This condi>on was defined as strong ignorability – which essen>ally means that all covariates that affect treatment assignment are included in X. *Reference to ‘poten1al outcomes’ invokes counterfactual logic.
These authors defined the propensity score e(X) (a scalar func1on of X) as the probability of treatment assignment, condi1onal on observed baseline covariates: e(X) = ei = Pr(Zi = 1 | Xi). They then demonstrated that the propensity score is a balancing score, meaning that, condi>onal on the propensity score, the distribu>on of measured baseline covariates is similar for the treated & untreated (or treatment and control) subjects. Therefore (Y(1), Y(0)) ╨ Z|e(X), an analog of the preceding expression. In effect e(X) summarizes the informa>on in X. Rosenbaum and Rubin rely strongly on the assump>on of strong ignorability. In prac>ce, the preceding leads to an interest in es>ma>ng the (scalar) propensity score from the (vector) of (appropriately chosen) covariates, say X, so that comparisons of treatment and control response score distribu>ons can be made condi>onal on an es>mated propensity score. The most common method for es>ma>ng e(X) entails use of logis>c regression (LR).
In prac>ce, there are two main Phases of a propensity score analysis. In Phase I, pre-‐treatment covariates are used to construct a scalar variable, a propensity score, that summarizes key differences among units (or respondents) with respect to the two* treatments being compared. Generally the fiXed values produced in logis5c regression are taken as es>mates of propensity scores, the e(X)’s. e(x) = 1/(1 + e-‐ {linear func>on of covariates}). These e(X)’s are then used in Phase II in either of two ways: units in the treatment and control groups are either matched or stra>fied (sorted); then the two groups are compared on one or more outcome measures, condi1onal on the matches or strata. For matching, an algorithm or rule is used to match individuals in the T & C groups whose P-‐scores are “reasonably close” to one another; numerous methods are available. With stra>fica>on responses of units in the two groups are compared within propensity-‐based strata. Both methods are illustrated below. *Except for recent work, nearly all PSA’s to date have focused on two groups. See my wiki: propensityscoreanalysis.pbworks.com
The following slide exhibits a flow chart showing how propensity score analysis proceeds when comparing two groups (to be read counterclockwise from the NW corner). • Covariate selec>on is central. Once the T & C groups have been defined, the key problem is to decide what covariates should be balanced re: T & C comparison. Theory and prior evidence come into play. Use of all relevant covariates is advised; they should relate to the ul>mate response variable, as well as the T vs. C dis>nc>on • Logis>c regression modeling should consider main effects as well as interac5ons (based on substan>ve relevance, and empirics) • Once propensity scores have been calculated, it is helpful to demonstrate overlap of P-‐score distribu5ons for the T & C groups • Either or both, matching and stra5fica5on, are generally used for analyses; the es>mands, however, differ in the two cases (ATT, ATE). • Outcomes are readily compared across the range of P-‐scores; see the loess graphic that follows. For matched data either dependent or independent sample sta>s>cal methods and graphics may be used.
The next slide illustrates a Phase II analysis, where loess regression was used to compare infant birth weights of mothers who smoked (treatment group) with mothers who did not. Birth weights (in lbs.) are ploXed (ver>cal) against LR-‐derived propensity scores (horizontal) for n = 189 infants. Two loess regression lines (dashed and solid) are shown, for infants of smoking (darkened points) and non-‐smoking (open circles) mothers. Ver>cal dashed lines depict eight quan>le-‐based strata; effects are assessed within strata (and then averaged). In this case, aVer adjus>ng for covariate effects using P-‐scores, it is seen that birth weights are notably lower for infants (whose mothers smoked) than for controls. (Notably, overlap of the two P-‐score distribu>ons provided reasonable ‘support’ for the comparison and all covariates were reasonably balanced across the eight P-‐score strata.) To complete the illustra>on note that the Average Treatment Effect was .84 lbs., and the 95% CI yields the limits (0.30, 1.38) – failing to span zero. (The graphic is based on func>on loess.psa from the PSAgraphics package (R).)
The next slide illustrates matching* in an observa>onal study by Morten, et. al (1982, Amer. Jour. Epidemiology, p. 549 ff) that entailed an especially simple form of propensity score analysis. Children of parents who had worked in a factory where lead was used in making baXeries were matched by age and neighbor-‐hood with children whose parents did not work in lead-‐based industries. Whole blood was assessed for lead content to provide responses. Results shown compare blood of Exposed with that of Control Children in what can be seen as a paired samples design. Conven>onal dependent sample analysis shows that the (95%) C.I. for the popula>on mean difference is far from zero (see line segment, lower leV). The mean difference score is 5.78; results support the conclusion that a parents’ lead-‐related occupa>on can ‘cause’ lead to be found in their children's blood. *Using function granova.ds in package granova (R). The heavy black line on diagonal corresponds to X = Y, so if X > Y its point lies below the identity line. Parallel projections to lower left line segment show the distribution of difference scores corresponding to the pairs; the red dashed line shows the average difference score, and the green line segment shows the 95% C.I.
A graphic allows one to go beyond a numerical summary. In this case note the wide dispersion of lead measurements for exposed children in comparison with their control counterparts. A follow-‐up showed that parental hygiene differed largely across the baXery-‐factory parents, and the varia>on in hygiene accounted in large measure for dispersion of their children’s lead measurements (a finding made possible because of Morton’s close aXen>on to detail in ini>al data collec>on). Although Control & Exposed children may differ in other ways (than age and neighborhood of residence) these data seem persuasive in showing that lead-‐based baXery factory work puts children at risk for high levels of blood lead -‐-‐ except when personal hygiene of the worker is effec>ve. Rosenbaum (2002), who discusses this example in detail, uses a sensi1vity analysis to show that the hidden bias would have to be substan>al to explain away a difference this large. Sensi>vity analyses can be essen>al to a wrap-‐up of a PSA study. In summary, these observa>onal data appear to provide valuable evidence to support causal conclusions re: the hypothesis.
The basic ideas of PSA have been simplified in order to focus on key principles and methods central to modern-‐day propensity score applica>ons. Recent PSA inves>ga>ons have begun to move beyond comparison of two treatments, to compare three or more (however, most authors assume an underlying con>nuum, e.g., dose-‐response groups). Mul>level methods for PSA have begun to be published, as have methods for studying media>on; the role of stra>fica>on has also begun to see aXen>on. A few studies have been aimed at missing data imputa>on methods, including mul>ple imputa>on. Pearl (2010), in par>cular, has formalized basic ideas to help bridge the gap between mainstream PSA methods & structural/graphical modeling. To date, only a handful of authors seem to have addressed the central issue of this conference, viz., analysis of longitudinal data -‐-‐ in par>cular, P-‐score methodology to compare treated & control groups for observa>onal data aVer adjus>ng for confounding covariates. In what follows, I use a basic illustra>on to show how preceding methodology can be extended to deal with longitudinal data comparisons. The next slide sets the stage for how this might be done.
Longitudinal data are shown for four individuals, where slopes & intercepts are readily discerned for 4 individuals for 5 >me waves. For data like these, where T & C groups could start prior to 5me 1, and key covariate data are available for all units, P-‐scores could be generated to to assess T vs. C effects. Sta5s5cs that describe profiles could be used as responses in PSA. A key is to find sta>s>cs sufficient to characterize trajectories (regardless of the # of data waves). In this way, LDA versions of PSA may be straigh~orwardly generalized, moving from univariate to mul>variate PSA.
Panels distinguish 4 persons at 5 time points, w/ a common response for all.
Assuming straight line regression for all panels, two sta>s>cs are sufficient regardless of the # of waves; moreover, many fiXed (smoothed) curves might entail few (oVen no more than 3 or 4) sta>s>cs to characterize >me trends for assessments of treatment effects. In such cases PSA can be generalized to (low dimensional) mul>variate analysis to support observa>onal LD analyses. Smoothing is the key; let us consider this topic next.
Given the preceding approaches for extending PSA to longitudinal data for observa>onal studies, consider several further points: 1. A wide variety of so-‐called growth models are available to characterize
longitudinal profiles; much recent work in this field has aimed at developing generaliza>ons that extend the reach of models;
2. Some authors have focused on smoothing profiles -‐-‐ in two dis>nc>ve ways: a. smoothing individual profiles by taking advantage of dependencies among adjacent or closely related observa>ons in profiles, and b. smoothing by capitalizing on similari>es among profiles for individuals; such smoothing entails ‘borrowing strength’ from mutually related profiles. Double smoothing may also be employed.
3. Those who model as in 1., are oVen advised to smooth ini>ally when individual observa>ons are subject to ‘considerable noise’.
4. Model-‐based predic>ons (or fiXed versions) of ini>al profiles, or at least their smoothed counterparts, are likely to be beXer targets for (PSA) studies than would be ini>al (raw-‐data) profiles.
5. As in the case of simpler forms of PSA, sensi>vity analyses will generally be advisable. (Rosenbaum, in his two books, considers this topic closely.)
6. A great deal of work on PSA remains to be done for longitudinal problems and there are many opportuni>es for analysis in this area.
These four panels illus-‐trate possibili>es. They correspond to subsets of profiles, each for five animals, as clustered(see below). In par>cu-‐lar, three principal components were derived from ini>ally smoothed profiles that were in turn used to get doubly smoothed profiles computed as linear combina>ons of the PCs. Each profile can be fully described using 4 coefficients: intercept, & three PC regression coefficients. Clusters were based on these coefficients (in R).
Given appropriate covariate data for each animal, these might be used for observa>onal study comparison of animals whose diets differed from one another; i.e. using constructed P-‐scores. (All ini>al responses were measures of protein in milk over five weeks. These data are part of the Milk dataset in the nlme package; they are real.) As seen here, smoothing can work especially well for some data. Exploratory approaches (based on underlying components or latent variables) may oVen permit crea>on of smoothed versions of either original or pre-‐ smoothed profiles. LDA versions of PSA may readily follow.
Although >me may not permit discussion, it may be useful to make some further observa>ons pertaining to mainstream IALSA interests. Consider an example of an observa5onal comparison of two groups (as suggested by S. Hofer). “Does engagement in intellectually challenging tasks, exercise, [or] social networks help to maintain cogni>ve func>oning in later life?” There have been a number of analyses of longitudinal observa>onal studies and experimental studies of this topic. There is some evidence to suggest that physical ac>vity enhances cogni>ve performance. Suppose we revisit such a ques>on using a modern PSA approach. Let us limit aXen>on to one measure of cogni>ve func>oning and a clearly defined treatment (that could be a combina>on, but might be limited to one behavior, say engagement in exercise). Given a clear dis>nc>on between two groups, one of which will not have exercised (self-‐report?), and one of which will (at some defined level of rigor and regularity), we might aim to adjust for (all) relevant covariate differences. This is the hardest part, one that might ideally be done using a prospec5ve approach, where one could have the luxury of naming in advance all covariates that seem likely to confound interpreta>ons of T vs. C effects. Archival data might also be used with the proviso that such data rarely contain all variables that ul>mately maXer (think of the cri>cs). If cogni>ve func>oning scores are available for individuals before and a^er commencement of exercise, ini>al covariate scores might be used in the construc>on of P-‐scores.
It is almost inevitable in prac>ce that some covariate and longitudinal data will have gone missing. This is one key reason imputa>on methods have garnered so much interest. (But the underlying theory that supports PSA methods, á la Rubin, is strongly based on counterfactual logic where one of two poten>al outcomes will always be missing). A special advantage for some longitudinal data sets is that missing LD values can be more reliably es>mated than counterparts outside the longitudinal framework. This means that mul>ple LD imputa>ons, and many products of analysis, may vary less than is typical. An addi>onal concern is that responses (such as cogni>ve func>on scores) are likely to have been obtained at different >mes for different individuals; i.e., different spacings and different numbers of >mes as well. Smoothing can oVen help with such problems, in which case the ul>mate data used in the PSA can begin from sta>s>cs that characterize smoothed profiles, not original data. (This step may also help ameliorate problems induced by measurement errors.) Naturally, imputa>ons can be done in different ways, perhaps using different imputa>on models; and the same goes for smoothing. When mul>ple analyses of the same data are employed one will want to learn how much results vary across methods. Further value may come from use of mixed models in analysis. Ul>mately, longitudinal PS analysis of such data may lead to fairly strong conclusions about “treatment effects”, condi>onal on the extent to which covariates are strongly ignorable, and the extent to which results do not depend heavily on the par>cular methods used for analysis. Design is likely to be central.