Can Microtargeting Improve Survey Sampling?

Josh PasekUniversity of Michigan

Can Microtargeting Improve Survey Sampling?

An Assessment of Accuracy and Bias in Consumer File Marketing Data

Project in conjunction with S. Mo Jang, Curtiss Cobb, Charles DiSogra, & J. Michael Dennis

NSF / Stanford Conference: Future of Survey Research

[email protected]

mailto:[email protected]


Surveys in the 21st Century

Challenges and Opportunities



Declining response rates

Increasing costs

Coverage challenges(for some modes)





Increasing costs0

10

20

30

40

1997 2000 2003 2006 2009 2012

Pew 2012

Also see: Curtin, Presser, & Singer, 2005




Increasing costs


Dual-frame costs(cf. Kennedy, 2007)

Increasing refusal as a cost(cf. Curtin, Presser, & Singer, 2005)




Increasing costs


Hispanic & Young Americans(Abraham, Maitland, Bianchi, 2006)

Telephone Access(Blumberg & Luke, 2011)




Increasing costs


Increasingly difficult to translate from respondents to the

population



New modes of data collection

New forms of data

More sophisticated analytical tools




New forms of data


Social Media

Mobile phone data

Tracking data & Paradata

Marketing data




New forms of data


Online surveys

Behavioral tracking




New forms of data


Better weighting techniques

Matching / propensity scores

Imputation

Machine learning

The big question

Can the opportunities offset the challenges?

The big question

Can the opportunities offset the challenges?

Can we use new methods for data collection and analysis to help us

understand the population?

The current exploration

Consumer file marketing data

Why consumer file marketing data?


Can be easily purchased



Readily matched to addresses



Readily matched to addresses

Provides a rich source of data for all individuals in the sample (not just respondents)


If the data are high quality:



- Can enable efficient targeted sampling of hard-to-reach groups




- Can provide information on systematic nonresponse




- Can provide information on systematic nonresponse

- Might allow corrections for nonresponse and sampling biases

Consumer file marketing data as a form of ancillary data

Long history in statistics and survey methodology thinking about auxiliary sources of data that could

translate between respondents and population(e.g. Deville, Sarndal, & Sautory, 1993; Holt & Smith, 1979; Jagers, Oden, & Trulsson)

Consumer file marketing data as a form of ancillary data

Long history in statistics and survey methodology thinking about auxiliary sources of data that could

translate between respondents and population

(e.g. Boehmke, 2003; Dixon & Tucker, 2010; Groves, 2006; Maitland, Casas-Cordero, & Kreuter, 2009; Kreuter & Olson, 2011; Little & Vartivarian, 2005; Peytchev, 2012; Smith, 2011)

Emerging literature on using individual-level non-survey data to correct for errors due to nonresponse

(e.g. Deville, Sarndal, & Sautory, 1993; Holt & Smith, 1979; Jagers, Oden, & Trulsson)

Some key initial questions

1) What are we doing with the data?(supplement or source of inference)



2) How accurate are the data?




3) How complete are the data?





4) What model are we using tolink the data with the world?






5) How does the model performfor different types of inference?


Evaluating consumer filemarketing data


Sample


Sample

AncillaryData


Sample

AncillaryData

Respondents Non-Respondents

The current project

The current project

(1) Assess the correspondence of ancillary data and self-reports

The current project


(2) Evaluate the nature of missingness in ancillary data

The current project



(3) Explore whether correctives using ancillary data (i.e. multiple imputations) could produce results that

better reflect population parameters

Comparison data

Comparison data

25,000 households sampled by GfK from USPS Computerized Delivery Sequence File (>95% coverage)

Comparison data


Address-Based Sample recruited via mail in January 2011, respondents were provided with Internet access

Comparison data


Self-report data from 4472 individuals in 2498 households recruited by GfK to KnowledgePanel®

AAPOR RR1=10.0%


Comparison data


Self-report data from 4472 individuals in 2498 households recruited by GfK to KnowledgePanel®

AAPOR RR1=10.0%


Consumer file data from Marketing Systems Group merged with all sampled households

100% matched, data originally from InfoUSA, Experian, and Acxiom

Weights4 sets of household weights:

Pure household weight (1 / Rs in HH)

Adult household weight (1 / Rs in HH over 18)

Best ancillary match weight (1 / R(s) closest to Ancillary age in HH)

Best ancillary match weight, full HH only(1 / R(s) closest to Ancillary age in HH for HHs with all respondents present)

Weights4 sets of household weights:

Pure household weight (1 / Rs in HH)

Adult household weight (1 / Rs in HH over 18)

Best ancillary match weight (1 / R(s) closest to Ancillary age in HH)

Best ancillary match weight, full HH only(1 / R(s) closest to Ancillary age in HH for HHs with all respondents present)

Weights2 sets of adjustment targets

Respondents

All Sampled Individuals


Respondents


Assessments of correspondence and missingness


Respondents


Assessments of correspondence and missingness

Multiple imputations to match sampling frame

Weights

Which weights we used did not matter for analyses

We always used the most contextually appropriate weights for the data presented

Measures

Measures

Home ownership

Household income

Household size} Household

Measures

Home ownership

Household income

Household size

Marital status

Education

Age

}}

Household

Individual(“Head of Household”)

The current project

(1) Assess the correspondence of ancillary data and self-reported estimates


(3) Explore whether correctives using ancillary data could produce results that better reflect

population parameters


Basic strategy:

Assess the proportion of matches between ancillary data and self-reported

data for each variable among respondents

Sample

AncillaryData



Home Ownership

Renter Owner

Ancillary RenterAncillary Owner

Self−Report

Prop

ortio

n of

Hou

seho

lds

0.0

0.2

0.4

0.6

Home Ownership

Renter Owner

Ancillary RenterAncillary Owner

Self−Report

Prop

ortio

n of

Hou

seho

lds

0.0

0.2

0.4

0.6

88.9% Agreement

Household Income

−7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7Difference Between Self−Report

and Ancillary Estimates

Prop

ortio

n of

Hou

seho

lds

0.00

0.05

0.10

0.15

0.20

Household Income



Prop

ortio

n of

Hou

seho

lds

0.00

0.05

0.10

0.15

0.20

22.8% Agreement

Household Income



Prop

ortio

n of

Hou

seho

lds

0.00

0.05

0.10

0.15

0.20

44.1% Far Off

Household Size

−4 −3 −2 −1 0 1 2 3 4Difference Between Self−Report


Prop

ortio

n of

Hou

seho

lds

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Household Size



Prop

ortio

n of

Hou

seho

lds

0.00

0.05

0.10

0.15

0.20

0.25

0.30

32.1% Agreement

Household Size



Prop

ortio

n of

Hou

seho

lds

0.00

0.05

0.10

0.15

0.20

0.25

0.30

32.3% Far Off

Marital Status

Unmarried Married

Ancillary UnmarriedAncillary Married

Self−Report

Prop

ortio

n of

Hou

seho

lds

0.0

0.1

0.2

0.3

0.4

0.5

Marital Status

Unmarried Married

Ancillary UnmarriedAncillary Married

Self−Report

Prop

ortio

n of

Hou

seho

lds

0.0

0.1

0.2

0.3

0.4

0.5

72.3% Agreement

Education



Prop

ortio

n of

Hou

seho

lds

0.0

0.1

0.2

0.3

Education



Prop

ortio

n of

Hou

seho

lds

0.0

0.1

0.2

0.3

38.9% Agreement

Education



Prop

ortio

n of

Hou

seho

lds

0.0

0.1

0.2

0.3

22.4% Far Off

Age

−5+ −5 to −2 −1 Equal 1 2 to 5 5+

Differences Between Ancillary and Self−Report Age in Years

Prop

ortio

n of

Hou

seho

lds

010

2030

(Biased toward match)

Age

−5+ −5 to −2 −1 Equal 1 2 to 5 5+


Prop

ortio

n of

Hou

seho

lds

010

2030


70.4% Within 1 year

Age

−5+ −5 to −2 −1 Equal 1 2 to 5 5+


Prop

ortio

n of

Hou

seho

lds

010

2030


18.6% > 5 years


Correspondence varies enormously across variables

Considerable discrepancies for all variables

23% - 89%

The current project






Basic strategy:

See if missingness for ancillary measures differs by self-reports of the same variable

See how well missingness can be predicted

Sample

AncillaryData




Basic strategy:



Missingness by Variable

HomeOwnership

HouseholdIncome

HouseholdSize

MaritalStatus Education Age

Missing Ancillary Data by Variable(N = 2277)

Variable

Prop

ortio

n M

issi

ng A

ncilla

ry D

ata

(%)

010

2030

4050

12.3

6.1 6.2

23.4

19.9

28.3

Missingness by Respondent

0 1 2 3 4 5 6

Distribution of Missing Ancillary Data Across Respondents(N = 2277)

Number of Variables Missing

Prop

ortio

n of

Res

pond

ents

(%)

010

2030

4050

45.7

35.7

10.4

1.7.4

3.92.2

Home Ownership

Own Rent

Distribution of Missing Home Ownership Databy Self−Reported Home Ownership Status

Self−Reported Home Ownership Status

Prop

ortio

n M

issi

ng A

ncilla

ry H

ome

Ow

ners

hip

Dat

a (%

)

010

2030

4050

χ2(1, 2274) = 181.4, p<.001

7.3

29.5

Household Income

0−15 15−25 25−35 35−50 50−75 75−100 100−150 150+

Distribution of Missing Ancillary Income Databy Self−Reported Income

Self−Reported Income Category in Thousands

Prop

ortio

n M

issi

ng A

ncilla

ry In

com

e D

ata

(%)

010

2030

4050

χ2(7, 1661) = 32.5, p<.001

10.0 9.0

12.7

5.1 4.63.3

2.0 2.4

Household Size

1 2 3 4 5

Distribution of Missing Household Size Databy Self−Reported Household Size

Self−Reported Household Size

Prop

ortio

n M

issi

ng A

ncilla

ry H

ouse

hold

Size

Dat

a (%

)

010

2030

4050

χ2(4, 2274) = 15.15, p<.01

9.1

5.4 5.6 5.8

2.4

Marital Status

Not Married Married

Distribution of Missing Marital Databy Self−Reported Marital Status

Self−Reported Marital Status

Prop

ortio

n M

issi

ng A

ncilla

ry M

arita

l Dat

a (%

)

010

2030

4050

χ2(1, 2274) = 59.8, p<.001

28.2

13.5

Education

Less ThanHigh School

High SchoolGraduate

SomeCollege

CollegeDegree

Post−GraduateEducation

Distribution of Missing Education Databy Self−Reported Education

Self−Reported Education Category

Prop

ortio

n M

issi

ng A

ncilla

ry E

duca

tion

Dat

a (%

)

010

2030

4050

χ2(4, 1414) = 5.8, p=.21

20.519.2

20.919.2

27.3

Education

18−24 25−34 35−44 45−54 55−64 65−74 75+

Distribution of Missing Ancillary Age Databy Self−Reported Age

Self−Reported Age Category

Prop

ortio

n M

issi

ng A

ncilla

ry A

ge D

ata

(%)

010

2030

4050

χ2(6, 2277) = 188.2, p<.001

40.0

49.4

35.7

23.0

14.0 14.917.3


Basic strategy:



Regressions predicting number of ancillary variables missing (range 0-6) using self-reported demographics

Predictor Missing Ancillary Data



Home ownership Non-owners***




Income n.s.




Income n.s.

Household size Fewer persons***




Income n.s.


Marital status Unmarried*




Income n.s.



Education n.s.




Income n.s.



Education n.s.

Age Younger*




Income n.s.



Education n.s.

Age Younger*

Race/ethnicity n.s.




Income n.s.



Education n.s.

Age Younger*

Race/ethnicity n.s.

R-squared 0.11




Income n.s.



Education n.s.

Age Younger*

Race/ethnicity n.s.

R-squared 0.11

Missingness in ancillary data was not well accounted for


Missing ancillary data appears to be nonignorable and biased

The current project





(3) Explore whether correctives using ancillary data could produce results that


Basic strategy:

Impute the distribution of self-reports for all sampled individuals based on the ancillary data

See if that represents a substantive improvement over the self-reports of respondents alone

(cf. Peytchev 2012)

Sample

AncillaryData


(3) Explore whether correctives using ancillary data could produce results that


Analytical strategy

- Impute self-reports for entire sample (not just respondents)

- Compare imputed, raw self-report, and ancillary values to CPS

Measures Used in ImputationsHome ownership

Presence of telephoneNumber of persons in household

Household incomeMarital status

Education of head of householdAge of head of household

Number of children in householdHispanic status

Region

The imputations

100 imputed datasets were created using MICE (multiple imputation via chained equations)

Point estimates were generated for all imputed datasets as well as for raw self-reports, ancillary data, and CPS

Home ownership

●●

0.60

0.65

0.70

0.75

0.80

0.85

0.90

Home Ownership Estimates Weighted By Household

Prop

ortio

n100 MIsRaw GfK EstimateCPS EstimateAncillary Estimate

Household Income

●

●

●●

●

●

●

●

●

Less than $15k $15k−25k $25k−35k $35k−50k $50k−75k $75k−100k $100k−150k More than $150k

0.00

0.05

0.10

0.15

0.20

0.25

Income Category Estimates Weighted By Household

Household Income Category

Prop

ortio

n

100 MIsRaw GfK EstimateCPS EstimateAncillary Estimate

Household size

●●

●

1 2 3 4 5 or more

0.0

0.1

0.2

0.3

0.4

Household Size Estimates Weighted By Household

Persons in Household

Prop

ortio

n


Marital Status

●

0.5

0.6

0.7

0.8

0.9

Marital Status Estimates Weighted By Individual

Prop

ortio

n100 MIsRaw GfK EstimateCPS EstimateAncillary Estimate

Education

●●●

●●

●

●

●

●●●

●

Less than HS HS Grad Some College College Grad Grad School

0.0

0.1

0.2

0.3

0.4

Education Level Estimates Weighted By Individual

Education Level

Prop

ortio

n


Age

●●

●●●

●

●

●

●●

●

18−24 25−34 35−44 45−54 55−64 65−74 75 and up

0.0

0.1

0.2

0.3

0.4

Age Category Estimates Weighted By Individual

Age Category

Prop

ortio

n


Differences from CPS

Home Ownership Income Household Size Marital Status Education Age Average

Average Absolute Difference From CPS By Method And Variable

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Imputation MeanRaw GfK EstimateAncillary Estimate


Data Household Individual Total

Imputed mean 2.7% 3.3% 3.0%

Raw self-report 4.0% 6.5% 5.3%

Ancillary 7.5% 12.7% 10.1%


Data Household Individual Total

Imputed mean 2.7% 3.3% 3.0%

Raw self-report 4.0% 6.5% 5.3%

Ancillary 7.5% 12.7% 10.1%

Imputations were better than raw self-reports, but not by an enormous amount

Distilling these results


Estimates from the raw self-report data (unweighted) were not very far off


Imputations based on ancillary data eliminated a moderate portion of the error in the self-reports



Ancillary data themselves do not seem particularly accurate


Imputations based on ancillary data eliminated a moderate portion of the error in the self-reports

Across all analyses

Across all analyses

Ancillary data estimates frequently differ from self-reports

Across all analyses


Missing ancillary data is systematic and appears to be non-ignorable

Across all analyses


Missing ancillary data is systematic and appears to be non-ignorable

Standard Bayesian imputation algorithms do not fully correct biases

Using consumer file marketing data


Ancillary data may help identify members of hard-to-reach populations (possibly with bias)

Ancillary data do not seem to be particularly efficient when correcting for non-response



Unlikely that it would be possible to use these data to correct for a problematic sampling frame

Ancillary data do not seem to be particularly efficient when correcting for non-response



What went wrong?

What went wrong?

We can’t know . . .

What went wrong?

We can’t know . . .

The data are complete black boxes

Propri

etary

Moving forward from here

Still lots of reasons to think that good ancillary data would substantively improve survey sampling

But the demographic ancillary data used in this study were not sufficient for many purposes

Moving forward with current data

How do these results compare with traditional survey weighting techniques?


Could a larger set of ancillary measures allow for better correctives?



Could linking other types of newly available data allow for better translations between respondents and society?

Could a larger set of ancillary measures allow for better correctives?



Ideally, we want data we can trust and evaluate


The ancillary data need to be more transparent



The process of linking sources of data to one-another needs to be more systematically addressed




Is there an in-house option that could be used instead of purchasing data from corporations?




Is there an in-house option that could be used instead of purchasing data from corporations?

NSF can play a pivotal role in building such a dataset





5) How does the model performfor different types of inference?

Need to consider these questions with additional sources of data

Josh PasekUniversity of Michigan

Can Microtargeting Improve Survey Sampling?

An Assessment of Accuracy and Bias in Consumer File Marketing Data

Project in conjunction with S. Mo Jang, Curtiss Cobb, Charles DiSogra, & J. Michael Dennis

NSF / Stanford Conference: Future of Survey Research

[email protected]



Date post:	15-Apr-2022
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Can Microtargeting Improve Survey Sampling?

Documents