+ All Categories
Home > Documents > TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr...

TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr...

Date post: 22-Jun-2020
Category:
Upload: others
View: 2 times
Download: 1 times
Share this document with a friend
31
TEA for survey processing Ben Klemens Rolando Rodr´ ıguez Center for Statistical Research and Methodology U.S. Census Bureau [email protected] September 25, 2012
Transcript
Page 1: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

TEA for survey processing

Ben KlemensRolando Rodrıguez

Center for Statistical Research and MethodologyU.S. Census Bureau

[email protected]

September 25, 2012

Page 2: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Disclaimer

This paper is released to inform interested parties of ongoingresearch and to encourage discussion. The views expressed arethose of the authors and not necessarily those of the U.S. CensusBureau.

Klemens & Rodrıguez TEA 2/22

Page 3: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Introduction

• Problem statement

• Background on modeling

• An overview of TEA

• A worked example

Klemens & Rodrıguez TEA Introduction 3/22

Page 4: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Problem statement

• Used for:I U.S. Census, group quarters

I American Community Survey, group quarters

I American Community Survey, island areas

• We want a simple edit specification. A sociologist should beable to write it.

• The research division puts out new and better methods forimputation; they have to be as easy as possible for productionto test out.

• Every imputation passes every edit.

Klemens & Rodrıguez TEA Problem statement 4/22

Page 5: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Problem statement

• Used for:I U.S. Census, group quarters

I American Community Survey, group quarters

I American Community Survey, island areas

• We want a simple edit specification. A sociologist should beable to write it.

• The research division puts out new and better methods forimputation; they have to be as easy as possible for productionto test out.

• Every imputation passes every edit.

Klemens & Rodrıguez TEA Problem statement 4/22

Page 6: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

The environment; technical dependencies

Don’t reinvent the wheel ⇒ depends on several existing systems

• R: relatively friendly interface, graphics

• SQLite: databasesI Easy subsetting and joining

I Handle large data sets

I SQLite is a lightweight C library, puts the database in a singlefile

• C libraries:I GNU Scientific Library: matrix algebra, random number

generation

I SQLite, as above

I Apophenia: a library of statistics functions; see below

• Tea is reduced to doing traffic control.

Klemens & Rodrıguez TEA Using TEA 5/22

Page 7: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

The environment; technical dependencies

Don’t reinvent the wheel ⇒ depends on several existing systems

• R: relatively friendly interface, graphics

• SQLite: databasesI Easy subsetting and joining

I Handle large data sets

I SQLite is a lightweight C library, puts the database in a singlefile

• C libraries:I GNU Scientific Library: matrix algebra, random number

generation

I SQLite, as above

I Apophenia: a library of statistics functions; see below

• Tea is reduced to doing traffic control.

Klemens & Rodrıguez TEA Using TEA 5/22

Page 8: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Social/political/business dependencies

• Licensing requirementsI None.

• API + front-end designI If you can’t use R, build another front-end.

• Conforms to:I IEEE Std 1003.1-2008 (POSIX)

I IEEE 754 (floating-point math)

I ISO/IEC 9899:2011 (C)

I ISO/IEC 10646:2011 (∼ Unicode 6.1.0)

I IETF RFC 3629 (UTF-8)

Klemens & Rodrıguez TEA Using TEA 6/22

Page 9: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Social/political/business dependencies

• Licensing requirementsI None.

• API + front-end designI If you can’t use R, build another front-end.

• Conforms to:I IEEE Std 1003.1-2008 (POSIX)

I IEEE 754 (floating-point math)

I ISO/IEC 9899:2011 (C)

I ISO/IEC 10646:2011 (∼ Unicode 6.1.0)

I IETF RFC 3629 (UTF-8)

Klemens & Rodrıguez TEA Using TEA 6/22

Page 10: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Models as black boxes

Klemens & Rodrıguez TEA Using TEA 7/22

Page 11: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Models as black boxes

Klemens & Rodrıguez TEA Using TEA 8/22

Page 12: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Apophenia

• Typical data manipulation facilities

• A standard object representing a modelI An object, with

I Parameters: Normal=(µ, σ); OLS=β, Poisson=λ, . . .

I Estimation: fill in the parameters given the data

I Drawing random values: make fake data given the parameters

I Log likelihood

I Expected value

I ...

I Bernoulli, Beta, Bi/multinomial, χ2, Dirichlet, Exponential, F ,Γ, Histogram/empirical PMF, (proper/improper) Uniform,Instrumental Variables, Kernel Density, Loess, (bi/multinomial)Logit, Lognormal, (uni/multivariate) Normal, (weighted) OLS,Poisson, (bi/multinomial) Probit, t, Waring, Wishart, Yule,Zipf; and transformations thereof.

Klemens & Rodrıguez TEA Using TEA 9/22

Page 13: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Apophenia

• Typical data manipulation facilities

• A standard object representing a modelI An object, with

I Parameters: Normal=(µ, σ); OLS=β, Poisson=λ, . . .

I Estimation: fill in the parameters given the data

I Drawing random values: make fake data given the parameters

I Log likelihood

I Expected value

I ...

I Bernoulli, Beta, Bi/multinomial, χ2, Dirichlet, Exponential, F ,Γ, Histogram/empirical PMF, (proper/improper) Uniform,Instrumental Variables, Kernel Density, Loess, (bi/multinomial)Logit, Lognormal, (uni/multivariate) Normal, (weighted) OLS,Poisson, (bi/multinomial) Probit, t, Waring, Wishart, Yule,Zipf; and transformations thereof.

Klemens & Rodrıguez TEA Using TEA 9/22

Page 14: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Apophenia

• Typical data manipulation facilities

• A standard object representing a modelI An object, with

I Parameters: Normal=(µ, σ); OLS=β, Poisson=λ, . . .

I Estimation: fill in the parameters given the data

I Drawing random values: make fake data given the parameters

I Log likelihood

I Expected value

I ...

I Bernoulli, Beta, Bi/multinomial, χ2, Dirichlet, Exponential, F ,Γ, Histogram/empirical PMF, (proper/improper) Uniform,Instrumental Variables, Kernel Density, Loess, (bi/multinomial)Logit, Lognormal, (uni/multivariate) Normal, (weighted) OLS,Poisson, (bi/multinomial) Probit, t, Waring, Wishart, Yule,Zipf; and transformations thereof.

Klemens & Rodrıguez TEA Using TEA 9/22

Page 15: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Editing with off-the-shelf models

• Modifying every model to suit is a pain.

• Easier: separate the edit checks from the models entirely.

• Once edits and models have their own spaces, edits are easierto work with too.

Klemens & Rodrıguez TEA Using TEA 10/22

Page 16: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

The R script

library(tea)readSpec(”demo.spec”)doMImpute()

• Inputs: the spec file has all the details, so these function callsare brief.

• We’ll see the spec in a few slides.

• Outputs: these functions manipulate the database.

Klemens & Rodrıguez TEA An example 11/22

Page 17: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

The R script

library(tea)readSpec(”demo.spec”)doMImpute()

• Inputs: the spec file has all the details, so these function callsare brief.

• We’ll see the spec in a few slides.

• Outputs: these functions manipulate the database.

Klemens & Rodrıguez TEA An example 11/22

Page 18: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Key tables in the database

• ORIGdc: the original data

• dc: the data with recodes, maybe edits blanked out

• filled: the imputations

Klemens & Rodrıguez TEA An example 12/22

Page 19: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Interrogation in R

checkOutImpute(”indata”, ”completed”, imputation number=1)teaTable(”completed”, limit=20)justmales <− teaTable(”completed”, where=”sex=0”)

Klemens & Rodrıguez TEA An example 13/22

Page 20: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Original/imputed agesR is popular among dataviz geeks ⇒ R graphics are best in breed.See figure.

Klemens & Rodrıguez TEA An example 14/22

Page 21: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Original/imputed log wagesLognormal model.

Klemens & Rodrıguez TEA An example 15/22

Page 22: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

A specification file’s header and input.

database: afile.dbid: serialno

input {input file: survey.csvoutput table: indata

}

Klemens & Rodrıguez TEA An example 16/22

Page 23: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

An edit spec

List edits using SQL syntax.Specification is pessimistic (i.e., list when failure occurs)

age <=15 and status=”married” #Illegal in the USAage < 0 #OLS may give negative ages.age > 100 => age = 100 #top code.

We call that last one a pre-edit. It is deterministic and requires nostatistical model.

Klemens & Rodrıguez TEA An example 17/22

Page 24: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

An edit spec

List edits using SQL syntax.Specification is pessimistic (i.e., list when failure occurs)

age <=15 and status=”married” #Illegal in the USAage < 0 #OLS may give negative ages.age > 100 => age = 100 #top code.

We call that last one a pre-edit. It is deterministic and requires nostatistical model.

Klemens & Rodrıguez TEA An example 17/22

Page 25: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

An imputation spec

impute{min group size: 10

categories {tractage < 18age >=18 and age < 65age >=65

}include: models.spec

}

• For each tract × age category, estimate a new model.

• If a tract × age category has fewer than 10 complete dataitems, drop age.

• Models: see additional file; next slide.

Klemens & Rodrıguez TEA An example 18/22

Page 26: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

models.spec

draw count: 5models{

sex { method: hot deck }

income { method: lognormal }

status { method: logitvars: income | sex

}}

• Do five imputations.

• Just like Apophenia does it: estimate model parameters fromcomplete set; draw from parameterised model.

• Changing models is easy.

Klemens & Rodrıguez TEA An example 19/22

Page 27: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

What I didn’t show you

• Inference control: suppress cells with too few entries

• Recodes: generating variables such as age categories

• Just edits, without imputation

• Raking sparse data sets

Klemens & Rodrıguez TEA An example 20/22

Page 28: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Some characteristics to consider

• On-line editing?I Near-time: R is quick enough to allow a modify/run/review

loop

• Graphical editing?I R graphics are awesome, and TEA’s API can be called by other

systems. Who wants to write this?

• Variance due to imputation?I Via multiple imputation

Klemens & Rodrıguez TEA An example 21/22

Page 29: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Some characteristics to consider

• On-line editing?I Near-time: R is quick enough to allow a modify/run/review

loop

• Graphical editing?I R graphics are awesome, and TEA’s API can be called by other

systems. Who wants to write this?

• Variance due to imputation?I Via multiple imputation

Klemens & Rodrıguez TEA An example 21/22

Page 30: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Some characteristics to consider

• On-line editing?I Near-time: R is quick enough to allow a modify/run/review

loop

• Graphical editing?I R graphics are awesome, and TEA’s API can be called by other

systems. Who wants to write this?

• Variance due to imputation?I Via multiple imputation

Klemens & Rodrıguez TEA An example 21/22

Page 31: TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr guez Center for Statistical Research and Methodology U.S. Census Bureau ben.klemens@census.gov

Tea makes it easy

• Databases make editing easy!I Non-programmer analysts can write edits in SQL; a good SQL

engine can process them in no time.

• A modeling framework makes imputation easy!I Imputation is always a two step process:

I Estimate using the good data

I Make random draws to fill in the missing data

• R makes viewing subsets and plotting data easy!I Not only were those plots sexy, but they gave us a sense of the

relative performance of the models.

Klemens & Rodrıguez TEA Conclusion 22/22


Recommended