TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr...

transcript

TEA for survey processing

Ben KlemensRolando Rodrıguez

Center for Statistical Research and MethodologyU.S. Census Bureau

ben.klemens@census.gov

September 25, 2012

Disclaimer

This paper is released to inform interested parties of ongoingresearch and to encourage discussion. The views expressed arethose of the authors and not necessarily those of the U.S. CensusBureau.

Klemens & Rodrıguez TEA 2/22

Introduction

• Problem statement

• Background on modeling

• An overview of TEA

• A worked example

Klemens & Rodrıguez TEA Introduction 3/22

Problem statement

• Used for:I U.S. Census, group quarters

I American Community Survey, group quarters

I American Community Survey, island areas

• We want a simple edit specification. A sociologist should beable to write it.

• The research division puts out new and better methods forimputation; they have to be as easy as possible for productionto test out.

• Every imputation passes every edit.

Klemens & Rodrıguez TEA Problem statement 4/22

Problem statement

• Used for:I U.S. Census, group quarters

I American Community Survey, group quarters

I American Community Survey, island areas

• We want a simple edit specification. A sociologist should beable to write it.

• The research division puts out new and better methods forimputation; they have to be as easy as possible for productionto test out.

• Every imputation passes every edit.

Klemens & Rodrıguez TEA Problem statement 4/22

The environment; technical dependencies

Don’t reinvent the wheel ⇒ depends on several existing systems

• R: relatively friendly interface, graphics

• SQLite: databasesI Easy subsetting and joining

I Handle large data sets

I SQLite is a lightweight C library, puts the database in a singlefile

• C libraries:I GNU Scientific Library: matrix algebra, random number

generation

I SQLite, as above

I Apophenia: a library of statistics functions; see below

• Tea is reduced to doing traffic control.

Klemens & Rodrıguez TEA Using TEA 5/22

The environment; technical dependencies

Don’t reinvent the wheel ⇒ depends on several existing systems

• R: relatively friendly interface, graphics

• SQLite: databasesI Easy subsetting and joining

I Handle large data sets

I SQLite is a lightweight C library, puts the database in a singlefile

• C libraries:I GNU Scientific Library: matrix algebra, random number

generation

I SQLite, as above

I Apophenia: a library of statistics functions; see below

• Tea is reduced to doing traffic control.

Social/political/business dependencies

• Licensing requirementsI None.

• API + front-end designI If you can’t use R, build another front-end.

• Conforms to:I IEEE Std 1003.1-2008 (POSIX)

I IEEE 754 (floating-point math)

I ISO/IEC 9899:2011 (C)

I ISO/IEC 10646:2011 (∼ Unicode 6.1.0)

I IETF RFC 3629 (UTF-8)

Social/political/business dependencies

• Licensing requirementsI None.

• API + front-end designI If you can’t use R, build another front-end.

• Conforms to:I IEEE Std 1003.1-2008 (POSIX)

I IEEE 754 (floating-point math)

I ISO/IEC 9899:2011 (C)

I ISO/IEC 10646:2011 (∼ Unicode 6.1.0)

I IETF RFC 3629 (UTF-8)

Models as black boxes

Apophenia

• Typical data manipulation facilities

• A standard object representing a modelI An object, with

I Parameters: Normal=(µ, σ); OLS=β, Poisson=λ, . . .

I Estimation: fill in the parameters given the data

I Drawing random values: make fake data given the parameters

I Log likelihood

I Expected value

I Bernoulli, Beta, Bi/multinomial, χ2, Dirichlet, Exponential, F ,Γ, Histogram/empirical PMF, (proper/improper) Uniform,Instrumental Variables, Kernel Density, Loess, (bi/multinomial)Logit, Lognormal, (uni/multivariate) Normal, (weighted) OLS,Poisson, (bi/multinomial) Probit, t, Waring, Wishart, Yule,Zipf; and transformations thereof.

Apophenia

I Log likelihood

I Expected value

Apophenia

I Log likelihood

I Expected value

Editing with off-the-shelf models

• Modifying every model to suit is a pain.

• Easier: separate the edit checks from the models entirely.

• Once edits and models have their own spaces, edits are easierto work with too.

The R script

library(tea)readSpec(”demo.spec”)doMImpute()

• Inputs: the spec file has all the details, so these function callsare brief.

• We’ll see the spec in a few slides.

• Outputs: these functions manipulate the database.

Klemens & Rodrıguez TEA An example 11/22

The R script

library(tea)readSpec(”demo.spec”)doMImpute()

• Inputs: the spec file has all the details, so these function callsare brief.

• We’ll see the spec in a few slides.

• Outputs: these functions manipulate the database.

Key tables in the database

• ORIGdc: the original data

• dc: the data with recodes, maybe edits blanked out

• filled: the imputations

Interrogation in R

checkOutImpute(”indata”, ”completed”, imputation number=1)teaTable(”completed”, limit=20)justmales <− teaTable(”completed”, where=”sex=0”)

Original/imputed agesR is popular among dataviz geeks ⇒ R graphics are best in breed.See figure.

Original/imputed log wagesLognormal model.

A specification file’s header and input.

database: afile.dbid: serialno

input {input file: survey.csvoutput table: indata

An edit spec

List edits using SQL syntax.Specification is pessimistic (i.e., list when failure occurs)

age <=15 and status=”married” #Illegal in the USAage < 0 #OLS may give negative ages.age > 100 => age = 100 #top code.

We call that last one a pre-edit. It is deterministic and requires nostatistical model.

An edit spec

List edits using SQL syntax.Specification is pessimistic (i.e., list when failure occurs)

age <=15 and status=”married” #Illegal in the USAage < 0 #OLS may give negative ages.age > 100 => age = 100 #top code.

We call that last one a pre-edit. It is deterministic and requires nostatistical model.

An imputation spec

impute{min group size: 10

categories {tractage < 18age >=18 and age < 65age >=65

}include: models.spec

• For each tract × age category, estimate a new model.

• If a tract × age category has fewer than 10 complete dataitems, drop age.

• Models: see additional file; next slide.

models.spec

draw count: 5models{

sex { method: hot deck }

income { method: lognormal }

status { method: logitvars: income | sex

• Do five imputations.

• Just like Apophenia does it: estimate model parameters fromcomplete set; draw from parameterised model.

• Changing models is easy.

What I didn’t show you

• Inference control: suppress cells with too few entries

• Recodes: generating variables such as age categories

• Just edits, without imputation

• Raking sparse data sets

Some characteristics to consider

• On-line editing?I Near-time: R is quick enough to allow a modify/run/review

• Graphical editing?I R graphics are awesome, and TEA’s API can be called by other

systems. Who wants to write this?

• Variance due to imputation?I Via multiple imputation

Tea makes it easy

• Databases make editing easy!I Non-programmer analysts can write edits in SQL; a good SQL

engine can process them in no time.

• A modeling framework makes imputation easy!I Imputation is always a two step process:

I Estimate using the good data

I Make random draws to fill in the missing data

• R makes viewing subsets and plotting data easy!I Not only were those plots sexy, but they gave us a sense of the

relative performance of the models.

Klemens & Rodrıguez TEA Conclusion 22/22

TEA for survey processing - unece.org€¦ · TEA for survey processing Ben Klemens Rolando Rodr...

Documents