Matching Methods for Causal Inference - gking.harvard.edu · Matching Methods for Causal Inference...

Matching Methods for Causal InferenceGary King1
Microsoft Research, 1/19/2018
1GaryKing.org 1 / 25
3 Problems, 3 Solutions
1. The most popular method (propensity score matching, used in 93,700 articles!) sounds magical:
“Why Propensity Scores Should Not Be Used for Matching” (Gary King, Richard Nielsen)
2. Do powerful methods have to be complicated?
“Causal Inference Without Balance Checking: Coarsened Exact Matching” (PA, 2011. Stefano M Iacus, Gary King, and Giuseppe Porro)
3. Matching methods optimize either imbalance (≈ bias) or # units pruned (≈ variance); users need both simultaneously’:
“The Balance-Sample Size Frontier in Matching Methods for Causal Inference” (In press, AJPS; Gary King, Christopher Lucas and Richard Nielsen)
2 / 25
1. The most popular method (propensity score matching, used in 93,700 articles!) sounds magical:
“Why Propensity Scores Should Not Be Used for Matching” (Gary King, Richard Nielsen)
2 / 25
1. The most popular method (propensity score matching, used in 93,700 articles!) sounds magical: “Why Propensity Scores Should Not Be Used for
Matching” (Gary King, Richard Nielsen)
2 / 25
2 / 25
2. Do powerful methods have to be complicated? “Causal Inference Without Balance Checking:
Coarsened Exact Matching” (PA, 2011. Stefano M Iacus, Gary King, and Giuseppe Porro)
2 / 25
2 / 25
3. Matching methods optimize either imbalance (≈ bias) or # units pruned (≈ variance); users need both simultaneously’: “The Balance-Sample Size Frontier in Matching
Methods for Causal Inference” (In press, AJPS; Gary King, Christopher Lucas and Richard Nielsen)
2 / 25
3 / 25
Matching to Reduce Model Dependence (Ho, Imai, King, Stuart, 2007: fig.1, Political Analysis)
3 / 25
Education (years)
O ut
co m
0
2
4
6
8
10
12
3 / 25
Education (years)
O ut
co m
0
2
4
6
8
10
12
T
T
T
3 / 25
Education (years)
O ut
co m
0
2
4
6
8
10
12
T
T
T
3 / 25
Education (years)
O ut
co m
0
2
4
6
8
10
12
T
T
T
3 / 25
Education (years)
O ut
co m
0
2
4
6
8
10
12
T
T
T
3 / 25
Education (years)
O ut
co m
0
2
4
6
8
10
12
T
T
T
3 / 25
Education (years)
O ut
co m
0
2
4
6
8
10
12
T
T
T
C C
CCC C
3 / 25
Education (years)
O ut
co m
0
2
4
6
8
10
12
T
T
T
• Qualitative choice from unbiased estimates = biased estimator
• e.g., Choosing from results of 50 randomized experiments • Choosing based on “plausibility” is probably worse[effrt]
• conscientious effort doesn’t avoid biases (Banaji 2013)[acc]
• People do not have easy access to their own mental processes or feedback to avoid the problem (Wilson and Brekke 1994)[exprt]
• Experts overestimate their ability to control personal biases more than nonexperts, and more prominent experts are the most overconfident (Tetlock 2005)[tch]
• “Teaching psychology is mostly a waste of time” (Kahneman 2011)
4 / 25
4 / 25
4 / 25
4 / 25
4 / 25
4 / 25
4 / 25
• Qualitative choice from unbiased estimates = biased estimator • e.g., Choosing from results of 50 randomized experiments
• Choosing based on “plausibility” is probably worse[effrt]
4 / 25
Imbalance Model Dependence Researcher discretion Bias
• Qualitative choice from unbiased estimates = biased estimator • e.g., Choosing from results of 50 randomized experiments • Choosing based on “plausibility” is probably worse[effrt]
4 / 25
4 / 25
4 / 25
4 / 25
4 / 25
4 / 25
4 / 25
4 / 25
4 / 25
4 / 25
4 / 25
• Yi dep var, Ti (1=treated, 0=control), Xi confounders
• Treatment Effect for treated observation i :
TEi = Yi − Yi (0)
= observed− unobserved
• Estimate Yi (0) with Yj with a matched (Xi ≈ Xj) control • Quantities of Interest:
1. SATT: Sample Average Treatment effect on the Treated:
SATT = Mean i∈{Ti=1}
(TEi )
• Big convenience: Follow preprocessing with whatever statistical method you’d have used without matching
• Pruning nonmatches makes control vars matter less: reduces imbalance, model dependence, researcher discretion, & bias
5 / 25
TEi = Yi − Yi (0)
(TEi )
5 / 25
TEi = Yi − Yi (0)
(TEi )
5 / 25
TEi = Yi (1)− Yi (0)
(TEi )
5 / 25
TEi = Yi (1)− Yi (0)
(TEi )
5 / 25
TEi = Yi − Yi (0)
(TEi )
5 / 25
TEi = Yi − Yi (0)
• Estimate Yi (0) with Yj with a matched (Xi ≈ Xj) control
• Quantities of Interest:
(TEi )
5 / 25
TEi = Yi − Yi (0)
(TEi )
5 / 25
TEi = Yi − Yi (0)
(TEi )
5 / 25
TEi = Yi − Yi (0)
(TEi )
5 / 25
TEi = Yi − Yi (0)
(TEi )
5 / 25
TEi = Yi − Yi (0)
(TEi )
5 / 25
Types of Experiments
Fully blocked dominates complete randomization for: imbalance, model dependence, power, efficiency, bias, research costs, robustness. E.g., Imai, King, Nall 2009: SEs 600% smaller!
Goal of Each Matching Method (in Observational Data)
• PSM: complete randomization
(wait, it gets worse)
On average
Fully blocked dominates complete randomization
for: imbalance, model dependence, power, efficiency, bias, research costs, robustness. E.g., Imai, King, Nall 2009: SEs 600% smaller!
Fully blocked dominates complete randomization for:
imbalance, model dependence, power, efficiency, bias, research costs, robustness. E.g., Imai, King, Nall 2009: SEs 600% smaller!
Fully blocked dominates complete randomization for: imbalance,
model dependence, power, efficiency, bias, research costs, robustness. E.g., Imai, King, Nall 2009: SEs 600% smaller!
power, efficiency, bias, research costs, robustness. E.g., Imai, King, Nall 2009: SEs 600% smaller!
Observed On average Exact Unobserved On average On average
Fully blocked dominates complete randomization for: imbalance, model dependence, power,
efficiency, bias, research costs, robustness. E.g., Imai, King, Nall 2009: SEs 600% smaller!
Fully blocked dominates complete randomization for: imbalance, model dependence, power, efficiency,
bias, research costs, robustness. E.g., Imai, King, Nall 2009: SEs 600% smaller!
Fully blocked dominates complete randomization for: imbalance, model dependence, power, efficiency, bias,
research costs, robustness. E.g., Imai, King, Nall 2009: SEs 600% smaller!
E.g., Imai, King, Nall 2009: SEs 600% smaller!
6 / 25
1. Preprocess (Matching)
• Distance(Xc ,Xt) = √
(Xc − Xt)′S−1(Xc − Xt) • (Mahalanobis is for methodologists; in applications, use
Euclidean!) • Match each treated unit to the nearest control unit • Control units: not reused; pruned if unused • Prune matches if Distance>caliper • (Many adjustments available to this basic method)
2. Estimation Difference in means or a model
7 / 25
7 / 25
7 / 25
• (Mahalanobis is for methodologists; in applications, use Euclidean!)
• Match each treated unit to the nearest control unit • Control units: not reused; pruned if unused • Prune matches if Distance>caliper • (Many adjustments available to this basic method)
7 / 25
Euclidean!)
7 / 25
Euclidean!) • Match each treated unit to the nearest control unit
• Control units: not reused; pruned if unused • Prune matches if Distance>caliper • (Many adjustments available to this basic method)
7 / 25
Euclidean!) • Match each treated unit to the nearest control unit • Control units: not reused; pruned if unused
• Prune matches if Distance>caliper • (Many adjustments available to this basic method)
7 / 25
Euclidean!) • Match each treated unit to the nearest control unit • Control units: not reused; pruned if unused • Prune matches if Distance>caliper
• (Many adjustments available to this basic method)
7 / 25
7 / 25
20
30
40
50
60
70
80
20
30
40
50
60
70
80
20
30
40
50
60
70
80
C
C
CC
C
C
C
C
C
20
30
40
50
60
70
80
C
C
CC
C
C
C
C
C
20
30
40
50
60
70
80
T TTT
20
30
40
50
60
70
80
T TTT
20
30
40
50
60
70
80
T TTT
9 / 25
Education (years)
20
30
40
50
60
70
80
T TTTT T T TTT TT T T TT TTTTT TTTT TT T TT TTTT TTT TT TTT TT TT TTTT
C CCCC C C CCC CC C C CC CCCCC CCCC CC C CC CCCC CCC CC CCC CC CC CCCC
CC CCC CCCCC CCCCCC CCC CC C C CCCCC CCC CC CC CC CC CC CC CC CC C CCCCC CC CC C CCCC C CCC CC CC C CCC CC CC CC C CCC CC CC CCC C CC CCCCC CC CCCC C CCC CCC CC C C CCCC CC CCCC C CCCC C CC CCC CC CCC CC CC CC CC C CCCC C CC C C CC CCC CC CCC CCCCC CCCC CCC CC CCC C C CC CC CC CCCC CC CC CCC C C CCC C CC CCC CC C CCC C C CCC CC CC C CC C CC CCCCC CCCC C C CC C CCCC CC CCC CCC C CCC CC CC CC CC CC C CC C CC CC CCC CC C C CCC CCC C CC CC CCC CCC CC CCC CC C CCC CC C CC CCCC C CC C CC CC C CC C CCC C C CCC CC C CCC CCC CC CCCC CC CC C CC C CC CC C CC CCC CC C CCC CCCC C CC CC C CCC CC CC CC C CCCCC CCC C C CC C CC CCC CCC CC CCC CC CCCC C CCC CC CCC
9 / 25
Education (years)
20
30
40
50
60
70
80
C CCCC C C CCC CC C C CC CCCCC CCCC CC C CC CCCC CCC CC CCC CC CC CCCC
9 / 25
• e.g., Education (grade school, high school, college, graduate)
• Apply exact matching to the coarsened X , C (X )
• Sort observations into strata, each with unique values of C(X ) • Prune any stratum with 0 treated or 0 control units
• Pass on original (uncoarsened) units except those pruned
• Weight controls in each stratum to equal treateds
10 / 25
(Approximates Fully Blocked Experiment)
10 / 25
10 / 25
1. Preprocess (Matching) • Temporarily coarsen X as much as you’re willing
10 / 25
10 / 25
10 / 25
• Apply exact matching to the coarsened X , C (X ) • Sort observations into strata, each with unique values of C(X )
• Prune any stratum with 0 treated or 0 control units
10 / 25
• Apply exact matching to the coarsened X , C (X ) • Sort observations into strata, each with unique values of C(X ) • Prune any stratum with 0 treated or 0 control units
10 / 25
10 / 25
2. Estimation Difference in means or a model • Weight controls in each stratum to equal treateds
10 / 25
20
30
40
50
60
70
80
C C CC CC C
T T T T
T TTT
11 / 25
Drinking age
The Big 40
C C CC CC C
T T T T
T TTT
11 / 25
Drinking age
The Big 40
C C CC CC C
T T T T
T TTT
11 / 25
Drinking age
The Big 40
C
Drinking age
The Big 40
20
30
40
50
60
70
80
12 / 25
Education
Age
20
30
40
50
60
70
80
CC CCC CCCCC CCCCCC CCC CC C C CCCCC CCC CC CC CC C CCC CCC CCC CC CC C CCCCCC CC CCC C C CCCC C CCC CC CC C CCCCC CC C CC CC CCC CCC CC CC CCC C C CC CCC CCC CC CCCC C CC CC CCC CC CC C CCCC C CC CC CCC CC CCCCC C CC CCC CC C C CCC CC CC CC CCC C CC CCC C CC C C CC CCC CC CCC C CCCCC CCCC CCC CC CC CCC C CC CC CC CC CCCCC CC CC CCC C CC CCC C CC CCC CC C CCCC C C CCC CC CC CCC C CC C CC CCCCC CCCC CC C CC C CCCC CCC CCC CCC C CCC CC CC CC CCC CC C C CC C C CC CC CC CC CCC C C CCC CCC C CC CCC CCC CCC CC CCC CC C CCC C CC C CC CCCCC C CC C CC CC C CC C CCC C C CCC CC C CCC CCC CC CCCC CC CC C CC C CC CC CC CC CCC CC C CCC CCCC C CC CC C CCC C CC CCC CC C CCC CCCC CCCC C C CC C CC CCC CCC CC CCC CC CCCC C CCC CC CCC
12 / 25
Education
Age
20
30
40
50
60
70
80
C CCCC C CCCC C C CC CCCC CCC CC C CC CCC C
C CCC CC
C CCCC
T TTTT T TTTT T T TT TTTT TTT TT T TT TTT T
T TTT TT
Education
Age
20
30
40
50
60
70
80
C CCCC C CCCC C C CC CCCC CCC CC C CC CCC C
C CCC CC
C CCCC
T TTTT T TTTT T T TT TTTT TTT TT T TT TTT T
T TTT TT
• Reduce k elements of X to scalar πi ≡ Pr(Ti = 1|X ) = 1
1+e−Xiβ
• Distance(Xc ,Xt) = |πc − πt | • Match each treated unit to the nearest control unit • Control units: not reused; pruned if unused • Prune matches if Distance>caliper • (Many adjustments available to this basic method)
13 / 25
1+e−Xiβ
13 / 25
1+e−Xiβ
13 / 25
Method 3: Propensity Score Matching (Approximates Completely Randomized Experiment)
1. Preprocess (Matching) • Reduce k elements of X to scalar πi ≡ Pr(Ti = 1|X ) = 1
1+e−Xiβ
13 / 25
1+e−Xiβ
• Distance(Xc ,Xt) = |πc − πt |
13 / 25
1+e−Xiβ
• Distance(Xc ,Xt) = |πc − πt | • Match each treated unit to the nearest control unit
• Control units: not reused; pruned if unused • Prune matches if Distance>caliper • (Many adjustments available to this basic method)
13 / 25
1+e−Xiβ
• Distance(Xc ,Xt) = |πc − πt | • Match each treated unit to the nearest control unit • Control units: not reused; pruned if unused
• Prune matches if Distance>caliper • (Many adjustments available to this basic method)
13 / 25
1+e−Xiβ
• Distance(Xc ,Xt) = |πc − πt | • Match each treated unit to the nearest control unit • Control units: not reused; pruned if unused • Prune matches if Distance>caliper
• (Many adjustments available to this basic method)
13 / 25
1+e−Xiβ
13 / 25
20
30
40
50
60
70
80
C
C
CC
C
C
C
C
C
20
30
40
50
60
70
80
C
C
CC
C
C
C
C
C
20
30
40
50
60
70
80
C
C
CC
C
C
C
C
C
C
C
C
C
20
30
40
50
60
70
80
1
0
C
C
C
C
20
30
40
50
60
70
80
1
0
C
C
C
C
20
30
40
50
60
70
80
1
0
20
30
40
50
60
70
80
C
C
20
30
40
50
60
70
80
C
C
15 / 25
Education (years)
20
30
40
50
60
70
80
C
Education (years)
20
30
40
50
60
70
80
C
Education (years)
20
30
40
50
60
70
80
Education (years)
20
30
40
50
60
70
80
• Efficient relative to complete randomization, but • Inefficient relative to (the more powerful) full blocking • Other methods usually dominate:
Xc = Xt =⇒ πc = πt but πc = πt 6=⇒ Xc = Xt
2. The PSM Paradox: When you do “better,” you do worse
• Background: Random matching increases imbalance • When PSM approximates complete randomization (to begin
with or, after some pruning)
all π ≈ 0.5 (or constant within strata) pruning at random Imbalance Inefficency Model dependence Bias
• If the data have no good matches, the paradox won’t be a problem but you’re cooked anyway.
• Doesn’t PSM solve the curse of dimensionality problem?
Nope. The PSM Paradox gets worse with more covariates
16 / 25
• Efficient relative to complete randomization, but • Inefficient relative to (the more powerful) full blocking • Other methods usually dominate:
16 / 25
1. Low Standards: Sometimes helps, never optimizes • Efficient relative to complete randomization, but
• Inefficient relative to (the more powerful) full blocking • Other methods usually dominate:
16 / 25
1. Low Standards: Sometimes helps, never optimizes • Efficient relative to complete randomization, but • Inefficient relative to (the more powerful) full blocking
• Other methods usually dominate:
16 / 25
1. Low Standards: Sometimes helps, never optimizes • Efficient relative to complete randomization, but • Inefficient relative to (the more powerful) full blocking • Other methods usually dominate:
16 / 25
Xc = Xt =⇒ πc = πt
but πc = πt 6=⇒ Xc = Xt
16 / 25
16 / 25
16 / 25
2. The PSM Paradox: When you do “better,” you do worse • Background: Random matching increases imbalance
• When PSM approximates complete randomization (to begin with or, after some pruning)
16 / 25
2. The PSM Paradox: When you do “better,” you do worse • Background: Random matching increases imbalance • When PSM approximates complete randomization (to begin
16 / 25
with or, after some pruning) all π ≈ 0.5 (or constant within strata)
pruning at random Imbalance Inefficency Model dependence Bias
16 / 25
with or, after some pruning) all π ≈ 0.5 (or constant within strata) pruning at random
Imbalance Inefficency Model dependence Bias
16 / 25
with or, after some pruning) all π ≈ 0.5 (or constant within strata) pruning at random Imbalance
Inefficency Model dependence Bias
16 / 25
with or, after some pruning) all π ≈ 0.5 (or constant within strata) pruning at random Imbalance Inefficency
Model dependence Bias • If the data have no good matches, the paradox won’t be a
problem but you’re cooked anyway. • Doesn’t PSM solve the curse of dimensionality problem?
16 / 25
with or, after some pruning) all π ≈ 0.5 (or constant within strata) pruning at random Imbalance Inefficency Model dependence
Bias • If the data have no good matches, the paradox won’t be a
problem but you’re cooked anyway. • Doesn’t PSM solve the curse of dimensionality problem?
16 / 25
with or, after some pruning) all π ≈ 0.5 (or constant within strata) pruning at random Imbalance Inefficency Model dependence Bias
16 / 25
16 / 25
16 / 25
• Doesn’t PSM solve the curse of dimensionality problem? Nope.
The PSM Paradox gets worse with more covariates
16 / 25
• Doesn’t PSM solve the curse of dimensionality problem? Nope. The PSM Paradox gets worse with more covariates
16 / 25
−8 −6 −4 −2 0 2
−8
−6
−4
−2
0
2
X1
−8 −6 −4 −2 0 2
−8
−6
−4
−2
0
2
X1
−8 −6 −4 −2 0 2
−8
−6
−4
−2
0
2
X1
0 1
2 3
4 5
6 7

1 2
3 4
5 6

18 / 25
Model Dependence 0.
MDM
PSM
Bias
MDM
PSM
19 / 25
Finkel et al. (JOP, 2012)
0 500 1000 1500 2000 2500 3000
0
2
4
6
8
10
0 500 1000 1500 2000 2500
0
5
10
15
20
25
30
Similar pattern for > 20 other real data sets we checked
20 / 25
0 500 1000 1500 2000 2500 3000
0
2
4
6
8
10
0 500 1000 1500 2000 2500
0
5
10
15
20
25
30
20 / 25
0 500 1000 1500 2000 2500 3000
0
2
4
6
8
10
0 500 1000 1500 2000 2500
0
5
10
15
20
25
30
20 / 25
• Bias-Variance trade off Imbalance-n Trade Off
• Simple to use
• All solutions are optimal
• No cherry picking possible; you see everything optimal
• Choose an imbalance metric, then run.
21 / 25
• Simple to use
21 / 25
• Simple to use
21 / 25
• Simple to use
21 / 25
• Simple to use
21 / 25
• Simple to use
21 / 25
• Simple to use
21 / 25
• Simple to use
21 / 25
• Simple to use
21 / 25
• Start with matrix of N control units X0
• Calculate imbalance for all ( N n
) subsets of rows of X0
• Choose subset with lowest imbalance
• Evaluations needed to compute the entire frontier:
• ( N n
) evaluations for each sample size n = N,N − 1, . . . , 1
• The combination is the (gargantuan) “power set” • e.g., N > 300 requires more imbalance evaluations than
elementary particles in the universe • It’s hard to calculate!
• We develop algorithms for the (optimal) frontier which:
• runs very fast • operate as “greedy” but we prove are optimal • do not require evaluating every subset • work with very large data sets • is the exact frontier (no approximation or estimation) • It’s easy to calculate!
22 / 25
• Start with matrix of N control units X0
• ( N n
22 / 25
How hard is the frontier to calculate?
• Consider 1 point on the SATT frontier: • Start with matrix of N control units X0
• ( N n
22 / 25
• ( N n
22 / 25
• ( N n
22 / 25
• ( N n
22 / 25
• Evaluations needed to compute the entire frontier: • ( N n
22 / 25
• The combination is the (gargantuan) “power set”
• e.g., N > 300 requires more imbalance evaluations than elementary particles in the universe
• It’s hard to calculate!
22 / 25
elementary particles in the universe
• It’s hard to calculate!
22 / 25
22 / 25
• The combination is the (gargantuan) “power set” &bull

Date post:	06-Jun-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Matching Methods for Causal Inference - gking.harvard.edu · Matching Methods for Causal Inference...

Documents