TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr...

Post on 22-Jun-2020

2 views 1 download

transcript

TEA for survey processing

Ben KlemensRolando Rodrıguez

Center for Statistical Research and MethodologyU.S. Census Bureau

ben.klemens@census.gov

September 25, 2012

Disclaimer

This paper is released to inform interested parties of ongoingresearch and to encourage discussion. The views expressed arethose of the authors and not necessarily those of the U.S. CensusBureau.

Klemens & Rodrıguez TEA 2/22

Introduction

• Problem statement

• Background on modeling

• An overview of TEA

• A worked example

Klemens & Rodrıguez TEA Introduction 3/22

Problem statement

• Used for:I U.S. Census, group quarters

I American Community Survey, group quarters

I American Community Survey, island areas

• We want a simple edit specification. A sociologist should beable to write it.

• The research division puts out new and better methods forimputation; they have to be as easy as possible for productionto test out.

• Every imputation passes every edit.

Klemens & Rodrıguez TEA Problem statement 4/22

Problem statement

• Used for:I U.S. Census, group quarters

I American Community Survey, group quarters

I American Community Survey, island areas

• We want a simple edit specification. A sociologist should beable to write it.

• The research division puts out new and better methods forimputation; they have to be as easy as possible for productionto test out.

• Every imputation passes every edit.

Klemens & Rodrıguez TEA Problem statement 4/22

The environment; technical dependencies

Don’t reinvent the wheel ⇒ depends on several existing systems

• R: relatively friendly interface, graphics

• SQLite: databasesI Easy subsetting and joining

I Handle large data sets

I SQLite is a lightweight C library, puts the database in a singlefile

• C libraries:I GNU Scientific Library: matrix algebra, random number

generation

I SQLite, as above

I Apophenia: a library of statistics functions; see below

• Tea is reduced to doing traffic control.

Klemens & Rodrıguez TEA Using TEA 5/22

The environment; technical dependencies

Don’t reinvent the wheel ⇒ depends on several existing systems

• R: relatively friendly interface, graphics

• SQLite: databasesI Easy subsetting and joining

I Handle large data sets

I SQLite is a lightweight C library, puts the database in a singlefile

• C libraries:I GNU Scientific Library: matrix algebra, random number

generation

I SQLite, as above

I Apophenia: a library of statistics functions; see below

• Tea is reduced to doing traffic control.

Klemens & Rodrıguez TEA Using TEA 5/22

Social/political/business dependencies

• Licensing requirementsI None.

• API + front-end designI If you can’t use R, build another front-end.

• Conforms to:I IEEE Std 1003.1-2008 (POSIX)

I IEEE 754 (floating-point math)

I ISO/IEC 9899:2011 (C)

I ISO/IEC 10646:2011 (∼ Unicode 6.1.0)

I IETF RFC 3629 (UTF-8)

Klemens & Rodrıguez TEA Using TEA 6/22

Social/political/business dependencies

• Licensing requirementsI None.

• API + front-end designI If you can’t use R, build another front-end.

• Conforms to:I IEEE Std 1003.1-2008 (POSIX)

I IEEE 754 (floating-point math)

I ISO/IEC 9899:2011 (C)

I ISO/IEC 10646:2011 (∼ Unicode 6.1.0)

I IETF RFC 3629 (UTF-8)

Klemens & Rodrıguez TEA Using TEA 6/22

Models as black boxes

Klemens & Rodrıguez TEA Using TEA 7/22

Models as black boxes

Klemens & Rodrıguez TEA Using TEA 8/22

Apophenia

• Typical data manipulation facilities

• A standard object representing a modelI An object, with

I Parameters: Normal=(µ, σ); OLS=β, Poisson=λ, . . .

I Estimation: fill in the parameters given the data

I Drawing random values: make fake data given the parameters

I Log likelihood

I Expected value

I ...

I Bernoulli, Beta, Bi/multinomial, χ2, Dirichlet, Exponential, F ,Γ, Histogram/empirical PMF, (proper/improper) Uniform,Instrumental Variables, Kernel Density, Loess, (bi/multinomial)Logit, Lognormal, (uni/multivariate) Normal, (weighted) OLS,Poisson, (bi/multinomial) Probit, t, Waring, Wishart, Yule,Zipf; and transformations thereof.

Klemens & Rodrıguez TEA Using TEA 9/22

Apophenia

• Typical data manipulation facilities

• A standard object representing a modelI An object, with

I Parameters: Normal=(µ, σ); OLS=β, Poisson=λ, . . .

I Estimation: fill in the parameters given the data

I Drawing random values: make fake data given the parameters

I Log likelihood

I Expected value

I ...

I Bernoulli, Beta, Bi/multinomial, χ2, Dirichlet, Exponential, F ,Γ, Histogram/empirical PMF, (proper/improper) Uniform,Instrumental Variables, Kernel Density, Loess, (bi/multinomial)Logit, Lognormal, (uni/multivariate) Normal, (weighted) OLS,Poisson, (bi/multinomial) Probit, t, Waring, Wishart, Yule,Zipf; and transformations thereof.

Klemens & Rodrıguez TEA Using TEA 9/22

Apophenia

• Typical data manipulation facilities

• A standard object representing a modelI An object, with

I Parameters: Normal=(µ, σ); OLS=β, Poisson=λ, . . .

I Estimation: fill in the parameters given the data

I Drawing random values: make fake data given the parameters

I Log likelihood

I Expected value

I ...

I Bernoulli, Beta, Bi/multinomial, χ2, Dirichlet, Exponential, F ,Γ, Histogram/empirical PMF, (proper/improper) Uniform,Instrumental Variables, Kernel Density, Loess, (bi/multinomial)Logit, Lognormal, (uni/multivariate) Normal, (weighted) OLS,Poisson, (bi/multinomial) Probit, t, Waring, Wishart, Yule,Zipf; and transformations thereof.

Klemens & Rodrıguez TEA Using TEA 9/22

Editing with off-the-shelf models

• Modifying every model to suit is a pain.

• Easier: separate the edit checks from the models entirely.

• Once edits and models have their own spaces, edits are easierto work with too.

Klemens & Rodrıguez TEA Using TEA 10/22

The R script

library(tea)readSpec(”demo.spec”)doMImpute()

• Inputs: the spec file has all the details, so these function callsare brief.

• We’ll see the spec in a few slides.

• Outputs: these functions manipulate the database.

Klemens & Rodrıguez TEA An example 11/22

The R script

library(tea)readSpec(”demo.spec”)doMImpute()

• Inputs: the spec file has all the details, so these function callsare brief.

• We’ll see the spec in a few slides.

• Outputs: these functions manipulate the database.

Klemens & Rodrıguez TEA An example 11/22

Key tables in the database

• ORIGdc: the original data

• dc: the data with recodes, maybe edits blanked out

• filled: the imputations

Klemens & Rodrıguez TEA An example 12/22

Interrogation in R

checkOutImpute(”indata”, ”completed”, imputation number=1)teaTable(”completed”, limit=20)justmales <− teaTable(”completed”, where=”sex=0”)

Klemens & Rodrıguez TEA An example 13/22

Original/imputed agesR is popular among dataviz geeks ⇒ R graphics are best in breed.See figure.

Klemens & Rodrıguez TEA An example 14/22

Original/imputed log wagesLognormal model.

Klemens & Rodrıguez TEA An example 15/22

A specification file’s header and input.

database: afile.dbid: serialno

input {input file: survey.csvoutput table: indata

}

Klemens & Rodrıguez TEA An example 16/22

An edit spec

List edits using SQL syntax.Specification is pessimistic (i.e., list when failure occurs)

age <=15 and status=”married” #Illegal in the USAage < 0 #OLS may give negative ages.age > 100 => age = 100 #top code.

We call that last one a pre-edit. It is deterministic and requires nostatistical model.

Klemens & Rodrıguez TEA An example 17/22

An edit spec

List edits using SQL syntax.Specification is pessimistic (i.e., list when failure occurs)

age <=15 and status=”married” #Illegal in the USAage < 0 #OLS may give negative ages.age > 100 => age = 100 #top code.

We call that last one a pre-edit. It is deterministic and requires nostatistical model.

Klemens & Rodrıguez TEA An example 17/22

An imputation spec

impute{min group size: 10

categories {tractage < 18age >=18 and age < 65age >=65

}include: models.spec

}

• For each tract × age category, estimate a new model.

• If a tract × age category has fewer than 10 complete dataitems, drop age.

• Models: see additional file; next slide.

Klemens & Rodrıguez TEA An example 18/22

models.spec

draw count: 5models{

sex { method: hot deck }

income { method: lognormal }

status { method: logitvars: income | sex

}}

• Do five imputations.

• Just like Apophenia does it: estimate model parameters fromcomplete set; draw from parameterised model.

• Changing models is easy.

Klemens & Rodrıguez TEA An example 19/22

What I didn’t show you

• Inference control: suppress cells with too few entries

• Recodes: generating variables such as age categories

• Just edits, without imputation

• Raking sparse data sets

Klemens & Rodrıguez TEA An example 20/22

Some characteristics to consider

• On-line editing?I Near-time: R is quick enough to allow a modify/run/review

loop

• Graphical editing?I R graphics are awesome, and TEA’s API can be called by other

systems. Who wants to write this?

• Variance due to imputation?I Via multiple imputation

Klemens & Rodrıguez TEA An example 21/22

Some characteristics to consider

• On-line editing?I Near-time: R is quick enough to allow a modify/run/review

loop

• Graphical editing?I R graphics are awesome, and TEA’s API can be called by other

systems. Who wants to write this?

• Variance due to imputation?I Via multiple imputation

Klemens & Rodrıguez TEA An example 21/22

Some characteristics to consider

• On-line editing?I Near-time: R is quick enough to allow a modify/run/review

loop

• Graphical editing?I R graphics are awesome, and TEA’s API can be called by other

systems. Who wants to write this?

• Variance due to imputation?I Via multiple imputation

Klemens & Rodrıguez TEA An example 21/22

Tea makes it easy

• Databases make editing easy!I Non-programmer analysts can write edits in SQL; a good SQL

engine can process them in no time.

• A modeling framework makes imputation easy!I Imputation is always a two step process:

I Estimate using the good data

I Make random draws to fill in the missing data

• R makes viewing subsets and plotting data easy!I Not only were those plots sexy, but they gave us a sense of the

relative performance of the models.

Klemens & Rodrıguez TEA Conclusion 22/22