Josh PasekUniversity of Michigan
Can Microtargeting Improve Survey Sampling?
An Assessment of Accuracy and Bias in Consumer File Marketing Data
Project in conjunction with S. Mo Jang, Curtiss Cobb, Charles DiSogra, & J. Michael Dennis
NSF / Stanford Conference: Future of Survey Research
Surveys in the 21st Century
Challenges and Opportunities
Surveys in the 21st Century
Challenges and Opportunities
Declining response rates
Increasing costs
Coverage challenges(for some modes)
Surveys in the 21st Century
Challenges and Opportunities
Declining response rates
Coverage challenges(for some modes)
Increasing costs0
10
20
30
40
1997 2000 2003 2006 2009 2012
Pew 2012
Also see: Curtin, Presser, & Singer, 2005
Surveys in the 21st Century
Challenges and Opportunities
Declining response rates
Increasing costs
Coverage challenges(for some modes)
Dual-frame costs(cf. Kennedy, 2007)
Increasing refusal as a cost(cf. Curtin, Presser, & Singer, 2005)
Surveys in the 21st Century
Challenges and Opportunities
Declining response rates
Increasing costs
Coverage challenges(for some modes)
Hispanic & Young Americans(Abraham, Maitland, Bianchi, 2006)
Telephone Access(Blumberg & Luke, 2011)
Surveys in the 21st Century
Challenges and Opportunities
Declining response rates
Increasing costs
Coverage challenges(for some modes)
Increasingly difficult to translate from respondents to the
population
Surveys in the 21st Century
Challenges and Opportunities
New modes of data collection
New forms of data
More sophisticated analytical tools
New modes of data collection
Surveys in the 21st Century
Challenges and Opportunities
New forms of data
More sophisticated analytical tools
Social Media
Mobile phone data
Tracking data & Paradata
Marketing data
New modes of data collection
Surveys in the 21st Century
Challenges and Opportunities
New forms of data
More sophisticated analytical tools
Online surveys
Behavioral tracking
New modes of data collection
Surveys in the 21st Century
Challenges and Opportunities
New forms of data
More sophisticated analytical tools
Better weighting techniques
Matching / propensity scores
Imputation
Machine learning
The big question
Can the opportunities offset the challenges?
The big question
Can the opportunities offset the challenges?
Can we use new methods for data collection and analysis to help us
understand the population?
The current exploration
Consumer file marketing data
Why consumer file marketing data?
Why consumer file marketing data?
Can be easily purchased
Why consumer file marketing data?
Can be easily purchased
Readily matched to addresses
Why consumer file marketing data?
Can be easily purchased
Readily matched to addresses
Provides a rich source of data for all individuals in the sample (not just respondents)
Why consumer file marketing data?
If the data are high quality:
Why consumer file marketing data?
If the data are high quality:
- Can enable efficient targeted sampling of hard-to-reach groups
Why consumer file marketing data?
If the data are high quality:
- Can enable efficient targeted sampling of hard-to-reach groups
- Can provide information on systematic nonresponse
Why consumer file marketing data?
If the data are high quality:
- Can enable efficient targeted sampling of hard-to-reach groups
- Can provide information on systematic nonresponse
- Might allow corrections for nonresponse and sampling biases
Consumer file marketing data as a form of ancillary data
Long history in statistics and survey methodology thinking about auxiliary sources of data that could
translate between respondents and population(e.g. Deville, Sarndal, & Sautory, 1993; Holt & Smith, 1979; Jagers, Oden, & Trulsson)
Consumer file marketing data as a form of ancillary data
Long history in statistics and survey methodology thinking about auxiliary sources of data that could
translate between respondents and population
(e.g. Boehmke, 2003; Dixon & Tucker, 2010; Groves, 2006; Maitland, Casas-Cordero, & Kreuter, 2009; Kreuter & Olson, 2011; Little & Vartivarian, 2005; Peytchev, 2012; Smith, 2011)
Emerging literature on using individual-level non-survey data to correct for errors due to nonresponse
(e.g. Deville, Sarndal, & Sautory, 1993; Holt & Smith, 1979; Jagers, Oden, & Trulsson)
Some key initial questions
1) What are we doing with the data?(supplement or source of inference)
Some key initial questions
1) What are we doing with the data?(supplement or source of inference)
2) How accurate are the data?
Some key initial questions
1) What are we doing with the data?(supplement or source of inference)
2) How accurate are the data?
3) How complete are the data?
Some key initial questions
1) What are we doing with the data?(supplement or source of inference)
2) How accurate are the data?
3) How complete are the data?
4) What model are we using tolink the data with the world?
Some key initial questions
1) What are we doing with the data?(supplement or source of inference)
2) How accurate are the data?
3) How complete are the data?
4) What model are we using tolink the data with the world?
5) How does the model performfor different types of inference?
Some key initial questions
Evaluating consumer filemarketing data
Evaluating consumer filemarketing data
Sample
Evaluating consumer filemarketing data
Sample
AncillaryData
Evaluating consumer filemarketing data
Sample
AncillaryData
Respondents Non-Respondents
The current project
The current project
(1) Assess the correspondence of ancillary data and self-reports
The current project
(1) Assess the correspondence of ancillary data and self-reports
(2) Evaluate the nature of missingness in ancillary data
The current project
(1) Assess the correspondence of ancillary data and self-reports
(2) Evaluate the nature of missingness in ancillary data
(3) Explore whether correctives using ancillary data (i.e. multiple imputations) could produce results that
better reflect population parameters
Comparison data
Comparison data
25,000 households sampled by GfK from USPS Computerized Delivery Sequence File (>95% coverage)
Comparison data
25,000 households sampled by GfK from USPS Computerized Delivery Sequence File (>95% coverage)
Address-Based Sample recruited via mail in January 2011, respondents were provided with Internet access
Comparison data
25,000 households sampled by GfK from USPS Computerized Delivery Sequence File (>95% coverage)
Self-report data from 4472 individuals in 2498 households recruited by GfK to KnowledgePanel®
AAPOR RR1=10.0%
Address-Based Sample recruited via mail in January 2011, respondents were provided with Internet access
Comparison data
25,000 households sampled by GfK from USPS Computerized Delivery Sequence File (>95% coverage)
Self-report data from 4472 individuals in 2498 households recruited by GfK to KnowledgePanel®
AAPOR RR1=10.0%
Address-Based Sample recruited via mail in January 2011, respondents were provided with Internet access
Consumer file data from Marketing Systems Group merged with all sampled households
100% matched, data originally from InfoUSA, Experian, and Acxiom
Weights4 sets of household weights:
Pure household weight (1 / Rs in HH)
Adult household weight (1 / Rs in HH over 18)
Best ancillary match weight (1 / R(s) closest to Ancillary age in HH)
Best ancillary match weight, full HH only(1 / R(s) closest to Ancillary age in HH for HHs with all respondents present)
Weights4 sets of household weights:
Pure household weight (1 / Rs in HH)
Adult household weight (1 / Rs in HH over 18)
Best ancillary match weight (1 / R(s) closest to Ancillary age in HH)
Best ancillary match weight, full HH only(1 / R(s) closest to Ancillary age in HH for HHs with all respondents present)
Weights2 sets of adjustment targets
Respondents
All Sampled Individuals
Weights2 sets of adjustment targets
Respondents
All Sampled Individuals
Assessments of correspondence and missingness
Weights2 sets of adjustment targets
Respondents
All Sampled Individuals
Assessments of correspondence and missingness
Multiple imputations to match sampling frame
Weights
Which weights we used did not matter for analyses
We always used the most contextually appropriate weights for the data presented
Measures
Measures
Home ownership
Household income
Household size} Household
Measures
Home ownership
Household income
Household size
Marital status
Education
Age
}}
Household
Individual(“Head of Household”)
The current project
(1) Assess the correspondence of ancillary data and self-reported estimates
(2) Evaluate the nature of missingness in ancillary data
(3) Explore whether correctives using ancillary data could produce results that better reflect
population parameters
(1) Assess the correspondence of ancillary data and self-reported estimates
Basic strategy:
Assess the proportion of matches between ancillary data and self-reported
data for each variable among respondents
Sample
AncillaryData
Respondents Non-Respondents
(1) Assess the correspondence of ancillary data and self-reported estimates
Home Ownership
Renter Owner
Ancillary RenterAncillary Owner
Self−Report
Prop
ortio
n of
Hou
seho
lds
0.0
0.2
0.4
0.6
Home Ownership
Renter Owner
Ancillary RenterAncillary Owner
Self−Report
Prop
ortio
n of
Hou
seho
lds
0.0
0.2
0.4
0.6
88.9% Agreement
Household Income
−7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7Difference Between Self−Report
and Ancillary Estimates
Prop
ortio
n of
Hou
seho
lds
0.00
0.05
0.10
0.15
0.20
Household Income
−7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7Difference Between Self−Report
and Ancillary Estimates
Prop
ortio
n of
Hou
seho
lds
0.00
0.05
0.10
0.15
0.20
22.8% Agreement
Household Income
−7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7Difference Between Self−Report
and Ancillary Estimates
Prop
ortio
n of
Hou
seho
lds
0.00
0.05
0.10
0.15
0.20
44.1% Far Off
Household Size
−4 −3 −2 −1 0 1 2 3 4Difference Between Self−Report
and Ancillary Estimates
Prop
ortio
n of
Hou
seho
lds
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Household Size
−4 −3 −2 −1 0 1 2 3 4Difference Between Self−Report
and Ancillary Estimates
Prop
ortio
n of
Hou
seho
lds
0.00
0.05
0.10
0.15
0.20
0.25
0.30
32.1% Agreement
Household Size
−4 −3 −2 −1 0 1 2 3 4Difference Between Self−Report
and Ancillary Estimates
Prop
ortio
n of
Hou
seho
lds
0.00
0.05
0.10
0.15
0.20
0.25
0.30
32.3% Far Off
Marital Status
Unmarried Married
Ancillary UnmarriedAncillary Married
Self−Report
Prop
ortio
n of
Hou
seho
lds
0.0
0.1
0.2
0.3
0.4
0.5
Marital Status
Unmarried Married
Ancillary UnmarriedAncillary Married
Self−Report
Prop
ortio
n of
Hou
seho
lds
0.0
0.1
0.2
0.3
0.4
0.5
72.3% Agreement
Education
−4 −3 −2 −1 0 1 2 3 4Difference Between Self−Report
and Ancillary Estimates
Prop
ortio
n of
Hou
seho
lds
0.0
0.1
0.2
0.3
Education
−4 −3 −2 −1 0 1 2 3 4Difference Between Self−Report
and Ancillary Estimates
Prop
ortio
n of
Hou
seho
lds
0.0
0.1
0.2
0.3
38.9% Agreement
Education
−4 −3 −2 −1 0 1 2 3 4Difference Between Self−Report
and Ancillary Estimates
Prop
ortio
n of
Hou
seho
lds
0.0
0.1
0.2
0.3
22.4% Far Off
Age
−5+ −5 to −2 −1 Equal 1 2 to 5 5+
Differences Between Ancillary and Self−Report Age in Years
Prop
ortio
n of
Hou
seho
lds
010
2030
(Biased toward match)
Age
−5+ −5 to −2 −1 Equal 1 2 to 5 5+
Differences Between Ancillary and Self−Report Age in Years
Prop
ortio
n of
Hou
seho
lds
010
2030
(Biased toward match)
70.4% Within 1 year
Age
−5+ −5 to −2 −1 Equal 1 2 to 5 5+
Differences Between Ancillary and Self−Report Age in Years
Prop
ortio
n of
Hou
seho
lds
010
2030
(Biased toward match)
18.6% > 5 years
(1) Assess the correspondence of ancillary data and self-reported estimates
Correspondence varies enormously across variables
Considerable discrepancies for all variables
23% - 89%
The current project
(1) Assess the correspondence of ancillary data and self-reported estimates
(2) Evaluate the nature of missingness in ancillary data
(3) Explore whether correctives using ancillary data could produce results that better reflect
population parameters
(2) Evaluate the nature of missingness in ancillary data
Basic strategy:
See if missingness for ancillary measures differs by self-reports of the same variable
See how well missingness can be predicted
Sample
AncillaryData
Respondents Non-Respondents
(2) Evaluate the nature of missingness in ancillary data
(2) Evaluate the nature of missingness in ancillary data
Basic strategy:
See if missingness for ancillary measures differs by self-reports of the same variable
See how well missingness can be predicted
Missingness by Variable
HomeOwnership
HouseholdIncome
HouseholdSize
MaritalStatus Education Age
Missing Ancillary Data by Variable(N = 2277)
Variable
Prop
ortio
n M
issi
ng A
ncilla
ry D
ata
(%)
010
2030
4050
12.3
6.1 6.2
23.4
19.9
28.3
Missingness by Respondent
0 1 2 3 4 5 6
Distribution of Missing Ancillary Data Across Respondents(N = 2277)
Number of Variables Missing
Prop
ortio
n of
Res
pond
ents
(%)
010
2030
4050
45.7
35.7
10.4
1.7.4
3.92.2
Home Ownership
Own Rent
Distribution of Missing Home Ownership Databy Self−Reported Home Ownership Status
Self−Reported Home Ownership Status
Prop
ortio
n M
issi
ng A
ncilla
ry H
ome
Ow
ners
hip
Dat
a (%
)
010
2030
4050
χ2(1, 2274) = 181.4, p<.001
7.3
29.5
Household Income
0−15 15−25 25−35 35−50 50−75 75−100 100−150 150+
Distribution of Missing Ancillary Income Databy Self−Reported Income
Self−Reported Income Category in Thousands
Prop
ortio
n M
issi
ng A
ncilla
ry In
com
e D
ata
(%)
010
2030
4050
χ2(7, 1661) = 32.5, p<.001
10.0 9.0
12.7
5.1 4.63.3
2.0 2.4
Household Size
1 2 3 4 5
Distribution of Missing Household Size Databy Self−Reported Household Size
Self−Reported Household Size
Prop
ortio
n M
issi
ng A
ncilla
ry H
ouse
hold
Size
Dat
a (%
)
010
2030
4050
χ2(4, 2274) = 15.15, p<.01
9.1
5.4 5.6 5.8
2.4
Marital Status
Not Married Married
Distribution of Missing Marital Databy Self−Reported Marital Status
Self−Reported Marital Status
Prop
ortio
n M
issi
ng A
ncilla
ry M
arita
l Dat
a (%
)
010
2030
4050
χ2(1, 2274) = 59.8, p<.001
28.2
13.5
Education
Less ThanHigh School
High SchoolGraduate
SomeCollege
CollegeDegree
Post−GraduateEducation
Distribution of Missing Education Databy Self−Reported Education
Self−Reported Education Category
Prop
ortio
n M
issi
ng A
ncilla
ry E
duca
tion
Dat
a (%
)
010
2030
4050
χ2(4, 1414) = 5.8, p=.21
20.519.2
20.919.2
27.3
Education
18−24 25−34 35−44 45−54 55−64 65−74 75+
Distribution of Missing Ancillary Age Databy Self−Reported Age
Self−Reported Age Category
Prop
ortio
n M
issi
ng A
ncilla
ry A
ge D
ata
(%)
010
2030
4050
χ2(6, 2277) = 188.2, p<.001
40.0
49.4
35.7
23.0
14.0 14.917.3
(2) Evaluate the nature of missingness in ancillary data
Basic strategy:
See if missingness for ancillary measures differs by self-reports of the same variable
See how well missingness can be predicted
Regressions predicting number of ancillary variables missing (range 0-6) using self-reported demographics
Predictor Missing Ancillary Data
Regressions predicting number of ancillary variables missing (range 0-6) using self-reported demographics
Predictor Missing Ancillary Data
Home ownership Non-owners***
Regressions predicting number of ancillary variables missing (range 0-6) using self-reported demographics
Predictor Missing Ancillary Data
Home ownership Non-owners***
Income n.s.
Regressions predicting number of ancillary variables missing (range 0-6) using self-reported demographics
Predictor Missing Ancillary Data
Home ownership Non-owners***
Income n.s.
Household size Fewer persons***
Regressions predicting number of ancillary variables missing (range 0-6) using self-reported demographics
Predictor Missing Ancillary Data
Home ownership Non-owners***
Income n.s.
Household size Fewer persons***
Marital status Unmarried*
Regressions predicting number of ancillary variables missing (range 0-6) using self-reported demographics
Predictor Missing Ancillary Data
Home ownership Non-owners***
Income n.s.
Household size Fewer persons***
Marital status Unmarried*
Education n.s.
Regressions predicting number of ancillary variables missing (range 0-6) using self-reported demographics
Predictor Missing Ancillary Data
Home ownership Non-owners***
Income n.s.
Household size Fewer persons***
Marital status Unmarried*
Education n.s.
Age Younger*
Regressions predicting number of ancillary variables missing (range 0-6) using self-reported demographics
Predictor Missing Ancillary Data
Home ownership Non-owners***
Income n.s.
Household size Fewer persons***
Marital status Unmarried*
Education n.s.
Age Younger*
Race/ethnicity n.s.
Regressions predicting number of ancillary variables missing (range 0-6) using self-reported demographics
Predictor Missing Ancillary Data
Home ownership Non-owners***
Income n.s.
Household size Fewer persons***
Marital status Unmarried*
Education n.s.
Age Younger*
Race/ethnicity n.s.
R-squared 0.11
Regressions predicting number of ancillary variables missing (range 0-6) using self-reported demographics
Predictor Missing Ancillary Data
Home ownership Non-owners***
Income n.s.
Household size Fewer persons***
Marital status Unmarried*
Education n.s.
Age Younger*
Race/ethnicity n.s.
R-squared 0.11
Missingness in ancillary data was not well accounted for
(2) Evaluate the nature of missingness in ancillary data
Missing ancillary data appears to be nonignorable and biased
The current project
(1) Assess the correspondence of ancillary data and self-reported estimates
(2) Evaluate the nature of missingness in ancillary data
(3) Explore whether correctives using ancillary data could produce results that better reflect
population parameters
(3) Explore whether correctives using ancillary data could produce results that
better reflect population parameters
Basic strategy:
Impute the distribution of self-reports for all sampled individuals based on the ancillary data
See if that represents a substantive improvement over the self-reports of respondents alone
(cf. Peytchev 2012)
Sample
AncillaryData
Respondents Non-Respondents
(3) Explore whether correctives using ancillary data could produce results that
better reflect population parameters
Analytical strategy
- Impute self-reports for entire sample (not just respondents)
- Compare imputed, raw self-report, and ancillary values to CPS
Measures Used in ImputationsHome ownership
Presence of telephoneNumber of persons in household
Household incomeMarital status
Education of head of householdAge of head of household
Number of children in householdHispanic status
Region
The imputations
100 imputed datasets were created using MICE (multiple imputation via chained equations)
Point estimates were generated for all imputed datasets as well as for raw self-reports, ancillary data, and CPS
Home ownership
●●
0.60
0.65
0.70
0.75
0.80
0.85
0.90
Home Ownership Estimates Weighted By Household
Prop
ortio
n100 MIsRaw GfK EstimateCPS EstimateAncillary Estimate
Household Income
●
●
●●
●
●
●
●
●
Less than $15k $15k−25k $25k−35k $35k−50k $50k−75k $75k−100k $100k−150k More than $150k
0.00
0.05
0.10
0.15
0.20
0.25
Income Category Estimates Weighted By Household
Household Income Category
Prop
ortio
n
100 MIsRaw GfK EstimateCPS EstimateAncillary Estimate
Household size
●●
●
1 2 3 4 5 or more
0.0
0.1
0.2
0.3
0.4
Household Size Estimates Weighted By Household
Persons in Household
Prop
ortio
n
100 MIsRaw GfK EstimateCPS EstimateAncillary Estimate
Marital Status
●
0.5
0.6
0.7
0.8
0.9
Marital Status Estimates Weighted By Individual
Prop
ortio
n100 MIsRaw GfK EstimateCPS EstimateAncillary Estimate
Education
●●●
●●
●
●
●
●●●
●
Less than HS HS Grad Some College College Grad Grad School
0.0
0.1
0.2
0.3
0.4
Education Level Estimates Weighted By Individual
Education Level
Prop
ortio
n
100 MIsRaw GfK EstimateCPS EstimateAncillary Estimate
Age
●●
●●●
●
●
●
●●
●
18−24 25−34 35−44 45−54 55−64 65−74 75 and up
0.0
0.1
0.2
0.3
0.4
Age Category Estimates Weighted By Individual
Age Category
Prop
ortio
n
100 MIsRaw GfK EstimateCPS EstimateAncillary Estimate
Differences from CPS
Home Ownership Income Household Size Marital Status Education Age Average
Average Absolute Difference From CPS By Method And Variable
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Imputation MeanRaw GfK EstimateAncillary Estimate
Differences from CPS
Data Household Individual Total
Imputed mean 2.7% 3.3% 3.0%
Raw self-report 4.0% 6.5% 5.3%
Ancillary 7.5% 12.7% 10.1%
Differences from CPS
Data Household Individual Total
Imputed mean 2.7% 3.3% 3.0%
Raw self-report 4.0% 6.5% 5.3%
Ancillary 7.5% 12.7% 10.1%
Imputations were better than raw self-reports, but not by an enormous amount
Distilling these results
Distilling these results
Estimates from the raw self-report data (unweighted) were not very far off
Distilling these results
Imputations based on ancillary data eliminated a moderate portion of the error in the self-reports
Estimates from the raw self-report data (unweighted) were not very far off
Distilling these results
Ancillary data themselves do not seem particularly accurate
Estimates from the raw self-report data (unweighted) were not very far off
Imputations based on ancillary data eliminated a moderate portion of the error in the self-reports
Across all analyses
Across all analyses
Ancillary data estimates frequently differ from self-reports
Across all analyses
Ancillary data estimates frequently differ from self-reports
Missing ancillary data is systematic and appears to be non-ignorable
Across all analyses
Ancillary data estimates frequently differ from self-reports
Missing ancillary data is systematic and appears to be non-ignorable
Standard Bayesian imputation algorithms do not fully correct biases
Using consumer file marketing data
Using consumer file marketing data
Ancillary data may help identify members of hard-to-reach populations (possibly with bias)
Ancillary data do not seem to be particularly efficient when correcting for non-response
Using consumer file marketing data
Ancillary data may help identify members of hard-to-reach populations (possibly with bias)
Unlikely that it would be possible to use these data to correct for a problematic sampling frame
Ancillary data do not seem to be particularly efficient when correcting for non-response
Using consumer file marketing data
Ancillary data may help identify members of hard-to-reach populations (possibly with bias)
What went wrong?
What went wrong?
We can’t know . . .
What went wrong?
We can’t know . . .
The data are complete black boxes
Propri
etary
Moving forward from here
Still lots of reasons to think that good ancillary data would substantively improve survey sampling
But the demographic ancillary data used in this study were not sufficient for many purposes
Moving forward with current data
How do these results compare with traditional survey weighting techniques?
Moving forward with current data
Could a larger set of ancillary measures allow for better correctives?
How do these results compare with traditional survey weighting techniques?
Moving forward with current data
Could linking other types of newly available data allow for better translations between respondents and society?
Could a larger set of ancillary measures allow for better correctives?
How do these results compare with traditional survey weighting techniques?
Moving forward with current data
Ideally, we want data we can trust and evaluate
Ideally, we want data we can trust and evaluate
The ancillary data need to be more transparent
Ideally, we want data we can trust and evaluate
The ancillary data need to be more transparent
The process of linking sources of data to one-another needs to be more systematically addressed
Ideally, we want data we can trust and evaluate
The ancillary data need to be more transparent
The process of linking sources of data to one-another needs to be more systematically addressed
Is there an in-house option that could be used instead of purchasing data from corporations?
Ideally, we want data we can trust and evaluate
The ancillary data need to be more transparent
The process of linking sources of data to one-another needs to be more systematically addressed
Is there an in-house option that could be used instead of purchasing data from corporations?
NSF can play a pivotal role in building such a dataset
1) What are we doing with the data?(supplement or source of inference)
2) How accurate are the data?
3) How complete are the data?
4) What model are we using tolink the data with the world?
5) How does the model performfor different types of inference?
Need to consider these questions with additional sources of data
Josh PasekUniversity of Michigan
Can Microtargeting Improve Survey Sampling?
An Assessment of Accuracy and Bias in Consumer File Marketing Data
Project in conjunction with S. Mo Jang, Curtiss Cobb, Charles DiSogra, & J. Michael Dennis
NSF / Stanford Conference: Future of Survey Research