Small Area Estimation: An Overview
Stas Kolenikov
Abt SRBI
AAPOR 2014
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 1 / 28
Motivation
In many social, behavioral or health studies, there may be interest inobtaining estimates for small subgroups of population
National study → estimates forI statesI countiesI school districtsI metro areas (NHIS phone use)I health service areas
Statewide study → estimates for counties or cities
City-wide study → estimates for neighborhoods
Detailed industry by size by region classifications (some BLS work)
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 2 / 28
US official statistics
Census Bureau: Small Area Income and Poverty EstimatesI States, Counties, School districtsI http://www.census.gov/did/www/saipe/
National Cancer Institute: Small Area Estimates for Cancer RiskFactors and Screening Behaviors
I States, CountiesI Combines BRFSS and NHIS (Raghunathan et al. 2007)I http://sae.cancer.gov/
National Center for Health Statistics: NHIS phone use data(Blumberg et al. 2013, Stephen Blumberg’s presentation)
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 3 / 28
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 4 / 28
Some work at Abt SRBI
2013 National Survey of American Jews for the Pew Research CenterI County-level estimates of Jewish populationI Coverage decisions (dial 90% of US population, 99% Jewish
population)I Stratification decisionsI See also Abt SRBI news release, Pew Research Center Methods report,
AAPOR 2013 presentations by Ben Phillips and Stas Kolenikov
Small area phone usage figuresI ACS PUMA level estimatesI Sample design for current and future surveysI Weighting targetsI See AAPOR 2013 presentation by Kolenikov and ZuWallack
Ongoing work in health surveys conducted by Abt SRBII Small area = community within a city, county within a state
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 5 / 28
Statement of the problem
Decent estimates are needed!
Given the low sample sizes for small areas (sometimes n = double digits,sometimes n = single digits, sometimes n = 0), can any reasonableestimates be obtained? If yes, can any reasonable measures of precision beattached to these estimates?
Since survey means/proportions (direct estimates in SAE jargon) may notbe available, or are of insufficient accuracy, statistical models have to beused, and weaved into complex survey estimation.
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 6 / 28
Small area estimation approaches
Statistical approaches growing from traditional survey statistics:
Apply the ratios/proportions
Treat as mixed/multilevel models
Filter/reweight a “big” data set to look like a small area
Fit models on one data set, apply on another
Main reference: Rao (2003)
Reviews: Ghosh & Rao (1994), Rao (1999), Pfeffermann (2002), Datta(2009), Lehtonen & Veijanen (2009), Pfeffermann (2013)
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 7 / 28
Data requirements
A large data set that would allow precise estimation of the SAEmodel for an outcome y using variables x
Identifiers of the small areas (if a part of the large data set)
The same variables x for the small area as those used in the SAEmodel in the large data set
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 8 / 28
Groupwise ratio estimator
Applies means or proportions of the outcome in population group
1 Split large sample into groups (e.g., age-gender-education)g = 1, . . . ,G
2 Estimate the nationwide mean outcome in each group yg3 Apply the small area proportions γg of the groups as weights to
obtain the SAE estimate∑
g γg yg
This produces a synthetic estimate, i.e., the one that is based only onaggregate data subset to the small area values.
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 9 / 28
Group-wise ratio estimator
Education level % smoking Nationwide Columbia, MO
Less than college 22.6% 69.2% 44.1%Bachelor+ 7.9% 30.8% 55.9%
Overall 18.0% 14.4%
Direct Synthetic
Source NHIS 2012 NHIS 2012 ACS 2012(via FactFinder)
Note: age 25+.
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 10 / 28
Group-wise ratio estimator
Education level % smoking Nationwide Columbia, MO
Less than college 22.6% 69.2% 44.1%Bachelor+ 7.9% 30.8% 55.9%
Overall 18.0% 14.4%Direct
Synthetic
Source NHIS 2012 NHIS 2012 ACS 2012(via FactFinder)
Note: age 25+.
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 10 / 28
Group-wise ratio estimator
Education level % smoking Nationwide Columbia, MO
Less than college 22.6% 69.2% 44.1%Bachelor+ 7.9% 30.8% 55.9%
Overall 18.0% 14.4%Direct Synthetic
Source NHIS 2012 NHIS 2012 ACS 2012(via FactFinder)
Note: age 25+.
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 10 / 28
Issues
Simplistic: assumes group structure explains all variations in outcome
Synthetic estimate only: even if there are data available for Columbia,MO, they are ignored
No telling how accurate the (implicit) model of ratios y ∝ x is
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 11 / 28
Area model: SAIPE history
Fay & Herriot (1979) analyzed per capita incomes for small places in theUS with population less than 1,000:
θi = x ′iβ + vi
θi = θi + ei = x ′iβ + vi + ei (1)
where
θi = log Yi is the log of the mean per capita income
Xi are demographic explanatory variables
vi is the model error
θi is the direct estimator based only on sample data on yi (if ni > 0)
ei is sampling error
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 12 / 28
Area model: composite estimate
Fay-Herriot model (1) → composite estimate
θi = γi θi + (1− γi )x ′i β (2)
γi =σ2v
σ2v + ψi=
contribution ofthe direct estimate,
(3)
ψi = sampling variance of θi ,
β = estimate of β from (1)
That is,
compositeestimate
=precisionweight
× directestimate
+precisionweight
× syntheticestimate
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 13 / 28
Unit models
Unit-level data on respondents in their specific areas is available →the unit model for responses:
yij = x ′ijβ + vi + εij (4)
where
i still enumerates the areas
j enumerates individuals in areas
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 14 / 28
Unit model: composite estimate
If
sampling fractions within areas are small
+
the covariates are known for all units in the area,
then the composite estimator is
µi = γi [yi + (Xi − xi )β] + (1− γi )X ′i β, (5)
γi =σ2v
σ2v + σ2e/ni(6)
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 15 / 28
Variance estimation
In the area model framework,
V[composite estimator θi ] =[g1i (σ
2v , ψi ) ∼ fraction of the direct estimator variance
]+[g2i (σ
2v , ψi , xi ) ∼ sampling uncertainty due to the fact
that β needs to be estimated]
+[g3i (σ
2v , ψi ) ∼ sampling uncertainty due to the fact
that σ2v needs to be estimated]
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 16 / 28
Variance estimation, small print
What is being estimated is the mean squared error (MSE) rather thanthe variance of an estimate
Contribution of vi is treated as a bias contribution rather thanvariance contribution
Simply plugging in the parameter estimates underestimates MSE
This is a large sample expression applicable when there are manysmall areas
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 17 / 28
SAE Extensions
Multivariate models (for several outcomes at a time; cancer SAE)
Generalized linear models (e.g., logistic for binary outcome) (Jiang &Lahiri 2001)
Models for quantities other than means and proportions (e.g.,quantiles) (Chambers & Tzavidis 2006)
Spatial covariance modeling for area effects
Parametric bootstrap to account for uncertainty in variancecomponent estimation (Lahiri 2003)
Bayesian methods (Ghosh et al. 1998) (cancer SAE)
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 18 / 28
Simulation of non-sampled units
1 For each small area, prepare the set of calibration characteristics
2 Take a rich, large sample survey data set3 Simulate the units in small area by either. . .
I . . . reweighting the observations in the large data set, or. . .I . . . finding a subset of observations from the large data set via
combinatorial optimization. . .
. . . so that the resulting sample matches the calibration characteristicsof the small area of interest
Microdata analysis can be performed on the resulting microsample that ismade to resemble the characteristics of the small area of interest.
See Williamson et al. (1998)
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 19 / 28
Two micro data sets
Blumberg et al. (2009) estimated phone usage at state level:
1 Phone usage data is available in NHIS
2 Fit the multinomial logistic model using NHIS outcome yi and NHISdemographics xi
3 Microdata with both relevant demographic variables and state IDs areavailable through CPS
4 Take NHIS coefficients, apply the model with CPS demographics, getpredicted probabilities for everybody
5 Estimate prevalences as average probabilities by state
Likewise, combine NHIS and ACS data (with PUMA geographic IDs) toestimate phone usage: Battaglia et al. (2010) for NYC neighborhoods andKolenikov & ZuWallack (2013) for states.
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 20 / 28
What I covered today
1 MotivationFederal statisticsWork at Abt SRBI
2 MethodsRatio calculationsArea modelsUnit modelsVariance estimationExtensionsReweighting simulationTwo micro data sets
3 Discussion
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 21 / 28
SAE is the synthesis of. . .
Mixedmodels
GLM Surveystatistics
BLUP / MSEoptimizationBayesian
methods GIS
SMALL AREAESTIMATION
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 22 / 28
Challenges
Standard errors are always tough
Need to match the outcomes data set with administrative data setI Definitions of explanatory variables may not match in different data sets
Very detailed levels of geography may be protected due toconfidentiality constraints
I ACS only has PUMA ≈100,000 people
Methodological challenges specific to the components of SAEI Non-response as a survey methodology issueI Use of weights in multilevel models as multilevel modeling issueI Complicated custom code
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 23 / 28
The end
THANK YOU!
Questions, comments, requests: [email protected]
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 24 / 28
References I
Battaglia, M. P., Eisenhower, D., Immerwahr, S. & Konty, K. (2010),Dual-frame weighting of RDD and cell phone interviews at the locallevel, in ‘Proceedings of Survey Research Methods Section’, TheAmerican Statistical Association, Alexandria, VA.
Blumberg, S. J., Ganesh, N., Luke, J. V. & Gonzales, G. (2013), Wirelesssubstitution: State-level estimates from the National Health InterviewSurvey, 2012, Technical Report 70, National Center for Health Statistics.
Blumberg, S. J., Luke, J. V., Davidson, G., Davern, M. E., Yu, T.-C. &Soderberg, K. (2009), Wireless substitution: State-level estimates fromthe National Health Interview Survey, january-december 2007, TechnicalReport 14, National Center for Health Statistics.
Chambers, R. & Tzavidis, N. (2006), ‘M-quantile models for small areaestimation’, Biometrika 93(2), 255–268.
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 25 / 28
References II
Datta, G. S. (2009), Model-based approach to small area estimation, inD. Pfeffermann & C. R. Rao, eds, ‘Sample Surveys: Inference andAnalysis’, Vol. 29B of Handbook of Statistics, North Holland,Amsterdam, pp. 251–288.
Fay, R. E. & Herriot, R. A. (1979), ‘Estimates of income for small places:An application of James-Stein procedures to census data’, Journal ofthe American Statistical Association 74(366), 269–277.
Ghosh, M., Natarajan, K., Stroud, T. W. F. & Carlin, B. P. (1998),‘Generalized linear models for small-area estimation’, Journal of theAmerican Statistical Association 93(441), 273–282.
Ghosh, M. & Rao, J. N. K. (1994), ‘Small area estimation: An appraisal’,Statistical Science 9(1), 55–76.
Jiang, J. & Lahiri, P. (2001), ‘Empirical best prediction for small areainference with binary data’, Annals of the Institute of StatisticalMathematics 53(2), 217–243.
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 26 / 28
References III
Lahiri, P. (2003), ‘On the impact of bootstrap in survey sampling andsmall-area estimation’, Statistical Science 18, 199–210.
Lehtonen, R. & Veijanen, A. (2009), Design-based methods for domainsand small areas, in D. Pfeffermann & C. R. Rao, eds, ‘Sample Surveys:Inference and Analysis’, Vol. 29B of Handbook of Statistics, NorthHolland, Amsterdam, pp. 219–249.
Pfeffermann, D. (2002), ‘Small area estimation—new developments anddirections’, International Statistical Review 70(1), 125–143.
Pfeffermann, D. (2013), ‘New important developments in small areaestimation’, Statistical Science 28(1), 40–68.
Raghunathan, T. E., Xie, D., Schenker, N., Parsons, V. L., Davis, W. W.,Dodd, K. W. & Feuer, E. J. (2007), ‘Combining information from twosurveys to estimate county-level prevalence rates of cancer risk factorsand screening’, Journal of the American Statistical Association102(478), 474–486.
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 27 / 28
References IV
Rao, J. N. K. (1999), ‘Some recent advances in model-based small areaestimation’, Survey Methodology 25(2), 175–186.
Rao, J. N. K. (2003), Small Area Estimation, Wiley series in surveymethodology, John Wiley and Sons, New York.
Williamson, P., Birkin, M. & Rees, P. (1998), ‘The estimation ofpopulation microdata by using data from small area statistics andsample of anonymised records’, Environment and Planning Analysis30, 785–816.
Stas Kolenikov (Abt SRBI) Small Area Estimation: An Overview AAPOR 2014 28 / 28