Small area estimation I
Monica Pratesi and Caterina Giusti
Department of Economics and Management, University of Pisa
1st EMOS Spring SchoolTrier, Pisa, Manchester, Luxembourg, 23-27 March 2015
M. Pratesi, C. Giusti Small area estimation I
Structure of the Presentation
1 Rising interest in measuring progress at a local level
2 An overview of SAE methods
3 Expansion and GREG estimators
4 Empirical Best Linear Unbiased Predictor
5 M-Quantile Approach
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
Part I
A short introduction to the small area
estimation problem
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
Demand for statistics at local level
During the last decade there has been a rising demand for (objectiveand subjective) progress and well-being indicators
These measures play a central role for policy makers, to plan and toverify the effectiveness of their policies
Some examples of indicators:
Poverty indicators: At risk of poverty rate, Household equivalizedincome, etc.Labour market indicators: Unemployment rate, Satisfaction with thejob, etc.Health indicators: Average lifespan, Share of population withdangerous behaviors (obesity, smoking, etc.)
To be informative and effective, these indicators should be chosen atthe appropriate level of disaggregation
Indicators can be disaggregated along various dimensions, includinggeographic areas, demographic groups, income/consumption groupsand social groups
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
Demand for statistics at local level
To gather data to compute the indicators of interest for a givenpopulation there are two main solutions:
CensusesSample Surveys
Sample surveys have been recognized as cost-effectiveness means ofobtaining information on wide-ranging topics of interest at frequentintervals over time
Indeed, available data to measure progress and well-being in Europecome mainly from sample surveys, such as the Survey on Incomeand Living Conditions (EU-SILC)
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
Demand for statistics at local level
Data coming from sample surveys can be used to compute accuratedirect estimates only for large domains
Direct estimation uses only domain-specific data:1 suffer from low precision when the domain sample size is small2 it is not applicable with zero sample sizes
For example, EU-SILC data can be used to produce accurate directestimates only at the NUTS* 2 level (that is, regional level in Italy)
* The geocode NUTS refers to the EU Nomenclature of Territorial Units for Statistics
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
Introduction to Small Area Estimation
To satisfy the increasing demand of statistical estimates on progressand well-being referring to smaller domains (LAU 1 and LAU 2levels, that is provinces and municipalities in Italy) using samplesurvey data, there is the need to resort to small area methodologies
Small area methodologies try to fill the gap between officialstatistics and local request of data
What is a small area? What is small area estimation?
Small Area Estimation (SAE) is concerned with the development ofstatistical procedures for producing efficient (precise) estimates for smallareas, that is for domains with small or zero sample sizes. Domains aredefined by the cross-classification of geographical districts bysocial/economic/demographic characteristics. The target is theestimation of a parameter (average/percentile/proportion/rate) and theestimation of the corresponding prediction error
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
Introduction to Small Area Estimation: Target parameters
The estimates of interest in the small areas are usually means ortotals. Examples:
Mean individual gross income in a set of areasTotal crop production in a set of areas
However, also other estimates can be of interest. Examples:
The quantiles of the household (equivalised) incomeThe proportion of households with income below the poverty line
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
Introduction to Small Area Estimation: Example
Target population: households who live in an Italian Region
Variable of interest: Income or other poverty measures
Survey sample: EUSILC (European Union Statistics on Income andLiving Conditions), designed to obtain reliable estimate at Regionallevel in Italy
planned design domains: Regionsunplanned design domains: e.g. Provinces, Municipalities
EUSILC sample size in Tuscany: 1751 households
Pisa province 158 households → need SAEGrosseto province 70 households → need SAE
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
Introduction to Small Area Estimation: Example
US sample sizes with an equal probability of selection methodsample of 10,000 persons
State 1994 Population (thousands) Sample sizeCalifornia 31,431 1207
Texas 18,378 706New York 18,169 698
......
...DC 570 22
Wyoming 476 18
Suppose to measure customer satisfaction for a government service:
California 24.86% → leads to a confidence interval of 22.4%-27.3%(reliable)Wyoming 33.33% → leads to a confidence interval of 10.9%-55.7%(unreliable → need SAE)
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
SAE: data requirements
Survey data: available for the target variable y and for the auxiliaryvariable x, related to y
Census/Administrative data: available for x but not for y
SAE in 3 steps
1 Use survey data to estimate models that link y to x
2 Combine the estimated model parameters with x for out of sampleunits, to form predictions
3 Use these predictions to estimate the target parameters
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
SAE: classifications
There are several classifications of SAE methods
Some of these classifications cross each other
Direct vs Indirect methods
Direct methods use only domain-specific data
Indirect methods borrow information from all the data
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
SAE: classifications
Design-based vs Model-based methods
Design-based (Model-assisted) methods
Direct estimationCan allow for the use of models (model-assisted)Inference is under the randomization distribution
Model-based methods
Borrow strength by using a modelEstimation using frequentist or Bayesian approachesInference under the model is conditional on the selected sample
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
SAE: classifications
Area-level vs Unit level methods
Area-level models relate small area direct estimators to area specificcovariates. Such models are necessary if unit (or element) level dataare not available.
Unit level models that relate the unit values of a study variable tounit-specific covariates.
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
How a SAE Unit Level Model Works
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
A classification of the Small Area Estimation Methods
M. Pratesi, C. Giusti Small area estimation I
Rising interest in measuring progress at a local levelAn overview of SAE methods
Focus of the lesson
We will focus on model-assisted and model-based methods
For the model-based estimators we will consider two alternativeapproaches to SAE:
small area estimation using multilevel (mixed/random effects)models → unit and area-level approachessmall area estimation based on M-quantile models → unit levelapproach
We will present some examples based on true data
We will present R codes to obtained the estimates of the examples
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Part II
Classic and new approaches to
model-based Small Area Estimation
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Definitions
Design-based estimation: the main focus is on the designunbiasedness. Estimators are unbiased with respect to therandomization that generates survey data
Finite population Ω = 1, . . . ,N
y : variable of interest, with yi value of the i-th unit of the finitepopulation
Statistics of interest: e.g. total, Y =∑
Ω yi or mean, Y = Y /N
Sample s = 1, . . . , n
p(s): probability of selecting the sample s from population Ω. p(s)depends on know design variables such as stratum indicator and sizemeasures of clusters
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Definitions
Bias of an estimator θ is defined as E [θ − θ]
Variance of an estimator θ is defined as E [(θ − E [θ])2]
Mean Squared Error of an estimator θ is: E [(θ− θ)2] = V [θ] +B[θ]2
Design bias: Bias(Y ) = Ep[Y ]− Y
Design variance: V (Y ) = Ep[(Y − y)2]
Design-based properties
1 Design-unbiasedness: Ep[Y ] =∑
p(s)Ys = Y
2 Design-consistency: Y → Y in probability
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Estimation of Means: Expansion Estimator
Data yi, i ∈ s
Expansion estimator for the mean:
ˆY =
∑i∈s wiyi∑i∈s wi
wi = π−1i , the basic design weight
πi is the probability of selecting the unit i in sample s
Remark: weights wi are independent from yi
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Estimation of Means: GREG Estimator
Data yi , xi, i ∈ s
X = (X1, . . . ,Xq)′, population totals for q auxiliary variables
Generalized REGression Estimator:
ˆYGR = ˆY + (X− ˆX)′β
β = (∑
i∈s wixix′i/ci )−1(∑
i∈s wixiyi/ci )
ci is a specified positive constant
Note ˆYGR =∑
i∈s w∗i yi , where w∗i = w∗i (s) = wi (s)gi (s)
with gi (s) = 1 + (X− ¯X)′(∑
i∈s wixix′i/ci )−1xi/ci
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Domain Estimation
Let partitioning population Ω into m partitions or domains:
Ω = ∪mi=1Ωi
Ωi = 1, . . . ,Ni , population of the domain i
si = 1, . . . , ni , sample of the domain i
Statistics of interest for the variable y :
Yi =∑
Ωiyj , the total of domain i
Yi = Yi/Ni , the mean of domain i
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Domain Estimation: Expansion Estimator
Data yij, j ∈ si , i = 1, . . . ,m
Expansion estimator of the mean for the domain i :
ˆYi =
∑j∈si wijyij∑j∈si wij
where yij is the observation value and wij is the weight for unit j inarea i
The case of the simple random sampling:
πij =(1
1)(Ni−1ni−1)
(Nini
)= ni
Ni→ wij = π−1
ij = Nini
ˆYi =∑ni
j=1 wij yij∑nij=1 wij
=
∑nij=1
Nini
yij∑nij=1
Nini
=Nini
∑nij=1 yij
niNini
= 1ni
∑nij=1 yij (that is the
sample mean)
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Domain Estimation: Expansion Estimator
ˆYi is design unbiased
V ( ˆYi ) = (1− niNi
)S2i
ni, where S2
i =∑
j∈si(yij−yi )2
ni−1 is the sample variance
The magnitude of the variance depends on 3 factors: ni/Ni , S2i and
ni
If ni is small the design variance is likely to be large
In such a situation, estimation of variance is even more problematic
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Domain Estimation: GREG Estimator
Data yij , xij, i ∈ s
Xi = (Xi1, . . . ,Xiq)′, population totals for q auxiliary variables in thedomain i
Generalized REGression Estimator:
ˆYi,GR = ˆYi + (Xi − ˆXi )′βi
βi = (∑
j∈si wijxijx′ij/cij)−1(∑
j∈si wijxijyij/cij)
cij is a specified positive constant
Remark: The GREG estimator attempt to improve on the expansionestimator by borrowing strength from relevant sources (e.g. covariates)through appropriate adjustment to the basic weights.Remark: The GREG estimator remain approximately design-unbiased, but(should) improve on the design-variance (and thereby the MSE)
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
GREG Estimator, Example
We present here an example of how to use and apply theGeneralized Regression estimator to obtain small area estimates ofthe area means
The target parameter is the mean forest biomass per hectare (ha)within the 14 municipalities (small areas) of the forest in VestfoldCounty (Norway)
The data are from the Norwegian National Forest Inventory, thatprovides estimates of forest parameters on national and regionalscales by means of a systematic network of permanent sample plots
The data set is public and detailed information on the data isavailable in Breidenbach and Astrup (2012)
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
GREG Estimator, Example
Data on forest biomass per hectare (Biomass/ha) are available for145 sample plots
Auxiliary data on mean canopy height are also available from digitalaerial images
The following table shows the first six lines (out of 145) of thesample plots of the Norwegian National Forest Inventory
Figure : Norwegian National Forest Inventory sample data
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
GREG Estimator, ExampleThe relationship between the target and the auxiliary data availablefrom the sample is represented in the following Figure
Figure : Scatterplot of the Biomass/ha vs Mean canopy height sample data
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
GREG Estimator, ExampleData on the mean canopy height are also available for all theelements, the population here is the forest covered by aerial images(for which the mean canopy values are available)
Thus, the N =∑14
i=1 Ni = 5402465 population elements are the16 · 16 tiles within forest for which auxiliary variables from thecanopy height mean and image data were calculated
Figure : Population data of mean canopy height from digital aerial images
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
GREG Estimator, Example
Using data reported in the two previous Tables we can obtain smallarea GREG estimates of the mean of forest biomass/ha in the 14municipalities
Figure : Point and MSE GREG estimates of the mean forest biomass perhectare for the 14 municipalities
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
GREG: advantages, disadvantages, extensions
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP: Unit Level ApproachUnit level approach to small area estimation
y the vector for the y variable for the population Ω
y = [y′s , y′r ]′, where ys is the vector of the observed units (the
sampled ones) and yr is the vector of the non observed units (N − n,r = 1, . . . ,N − n)
X is the covariates matrix and is considered know for all thepopulation units
Subscript i refers to small areas (e.g. ysi is the vector of observedvariables in area i)
Model for the y variable (known as superpopulation model)
y = Xβ + Zu + e
that can be alternatively write as
yij = xijβ + ui + eij
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP: Unit Level Approach
Given that
u ∼ N(0,G), e ∼ N(0,R) and u ⊥ e
R = σ2eΣe , G = σ2
uΣu
X is a full rank matrix (say rank equal q) and rank of (X : Zi ) > q
n ≥ q + m + 1
it can be shown that
V (y) = ZGZ′ + R = V
β = (X′V−1X)−1X′V−1ys is the BLUE for β
u = GZ′V−1(y − Xβ) is the BLUP for u
y ∼ N(Xβ,V)
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP: Unit Level ApproachThe estimator of the mean for the i-th area is obtained in terms of alinear combination between observed and unobserved units, as follows
ˆYi =1
Ni
∑j∈si
yij +∑k∈ri
(x′ik β + ui )
Next step is to derive an MSE estimator
MSE (θi ) ≈ g1i (σ) + g2i (σ) + g3i (σ) + g4i (σ)
g1i (σ) = α′rZrTsZ′rαr
g2i (σ) =[α′rbXr −α′rZrTsZ′sR
′sXs ](X′sV
−1Xs)−1[X′rαr − X′sR−1s ZsTsZ′rαr ]
g3i (σ) = tr(∇(α′rZrΣuZ′sV−1s )′)Vs(∇(α′rZrΣuZ′sV
−1s )′)′E [(σ −
σ)(σ − σ)′]g4i (σ) = α′rRrαr
T = Σu −ΣuZ′s(Σes + ZsΣuZ′s)−1ZsΣu
σ = (σ2e , σ
2u/σ
2e )′
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP: Unit Level Approach
Finally, the estimator for the MSE of θi is
MSE (θi ) = g1i (σ) + g2i (σ) + 2g3i (σ) + g4i (σ)
σ is an unbiased estimator for σ
Remark: it is possible to obtain an estimate of the MSE using alternativetechniques, such as bootstrap and jackknife
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: Estimates of Mean Income in Tuscany Provinces
Data on the equivalised income in 2005 for 1525 households in the10 Tuscany Provinces are available from the EUSILC survey 2006
A set of explanatory variables is available for each unit in thepopulation from the Population Census 2001
We employ the EBLUP unit level model to estimate the mean of thehousehold equivalised income
The Municipality of Florence, with 125 units out of 457 in theProvince, is considered as a stand-alone area
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: Target variable
Relative poverty measures are related to income or consumption
Our estimate are based on the household equivalised disposableincome (target variable)
Averages are computed on the household equivalised disposableincome
The disposable household equivalised income is computed as
[Disposable household income] · [Within-household non-response inflation factor]
Equivalised household size
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: Target variable
Disposable household income:
The sum for all household members of gross personal incomecomponents plus gross income components at household level minusemployer’s social insurance contributions, interest paid on mortgage,regular taxes on wealth, regular inter-household cash transfer paid,tax on income and social insurance contributions
Within-household non-response inflation factor:
Factor by which it is necessary to multiply the total gross income, thetotal disposable income or the total disposable income before socialtransfers to compensate the non-response in individual questionnaires
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: Target variable
Equivalised household size:
Let HM14+ be the number of household members aged 14 and over(at the end of income reference period)Let HM13− be the number of household members aged 13 or less(atthe end of income reference period)
Equivalised household size = 1 + 0.5 · (HM14+ − 1) + 0.3 · HM13−
Remark: by this way we take into account the economy of scale presentin an household
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: auxiliary data
The selection of covariates to fit the small area model relies on priorstudies on poverty assessment
The following covariates have been selected:
household sizeownership of dwelling (owner/tenant)age of the head of the householdyears of education of the head of the householdworking position of the head of the household(employed/unemployed in the previous week)
Design-based estimates of the mean income have been carried out inorder to show the gain in efficiency of the EBLUP
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: results
Table : Mean Income Estimates. Tuscany Provinces
Provinces EBLUP RMSEEBLUP Design-Based RMSEDB
Arezzo 17328 816.6 18455 1088.9Florence M. 18139 806.4 21927 1188.8
Florence 16327 628.2 16347 490.1Grosseto 16593 937.8 17811 1574.5Livorno 17111 841.5 20257 3258.4Lucca 15805 868.7 15780 783.5Massa 15644 868.0 14814 909.0Pisa 15950 826.5 16741 755.0
Pistoia 16467 850.1 16852 950.4Prato 16964 824.9 16715 911.5Siena 16660 852.0 16926 779.4
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: results
2 4 6 8 10
14000
16000
18000
20000
22000
24000
26000
Error bar for mean income estimates
Provinces
Equ
ival
ized
Inco
me
Figure : Black bars represent Design-Based estimates, red bars representEBLUP estimates
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: results
Design-Based Estimates EBLUP Estimates
14813.00
16019.47
17323.03
18732.66
20258.00
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP at the unit level: advantages, disadvantages,extensions
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP: Area Level Approach, the Fay-Harriot Model
The area level model includes random area-specific effects and areaspecific covariates xi
θi = xiβ + ziui , i = 1, . . . ,m
θi is the parameter of interest (e.g. totals, Yi or means, Yi )
zi are known positive constant
ui are independent and identically distributed random variables withmean 0 and variance σ2
u (ui ∼ N(0, σ2u))
β is the regression parameters vector
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP: Area Level Approach, the Fay-Harriot Model
Assumptionθi = θi + ei
θi is a direct design-unbiased estimator
ei are independent sampling error with mean 0 and know variance σ2e
Fay-Harrior Model
θi = xiβ + zivi + ei , i = 1, . . . ,m
Note: this is a special case of the general linear mixed model withdiagonal covariance structure
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP: Area Level Approach, the Fay-Harriot Model
Under above mentioned assumptions
θi ∼ N(xiβ, z2i σ
2u + σ2
e )
Let us to introduce matrix notation
θi = Xβ + Zu + e
u ∼ N(0,G) and e ∼ N(0,R)
θ ∼ N(Xβ,ZGZ′ + R), let V = ZGZ′
Given the estimates of β and u we obtain the Best Linear UnbiasedPredictor (BLUP) for θ
β = (X′V−1X)−1X′V−1y, where y is the vector of the observations
u = GZ′V−1(y − Xβ)
Note: estimates for β and u can be obtained by penalized maximumlikelihood (u considered as fix)
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP: Area Level Approach, the Fay-Harriot Model
In the “real world” G (and R) are unknown and they must beestimated
Using restricted likelihood optimized with scoring algorithm weobtain estimates for σ2
u (G)
In the Fay-Herriot model σ2e is considered as known (we use
sampling variance)
Plugging in the estimated area-specific variance component σ2u in the
estimator for β and u we obtain their estimates
β = (X′V−1X)−1X′V−1y
u = GZ′V−1(y − Xβ)
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP: Area Level Approach, the Fay-Harriot Model
Finally using the obtained estimates in the Fay-Herriot model we havethe Empirical BLUP (EBLUP) for the parameter of interest θ
θi = φi θi + (1− φi )(x′i β + z′i u)
φi =z2i σ
2u
z2i σ
2u+σ2
e
, is the shrinkage factor
θi is the design estimator for θi
Note: using this procedure could happen that the estimate of σ2u is
negative, in this case it must be truncated to 0
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP: Area Level Approach, the Fay-Harriot Model
The MSE of the Fay-Herriot small area estimator is
MSE (θi ) = g1i (σ2u) + g2i (σ
2u) + g3i (σ
2u)
g1i (σ2u) is due to random errors (order O(1))
g2i (σ2u) is due to β estimate (order O(m−1), given some conditions)
g3i (σ2u) is due to the estimate of σ2
u
An approximately correct estimate of the MSE is
MSE (θi ) = g1i (σ2u) + g2i (σ
2u) + 2g3i (σ
2u)
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP at the area level: example
In this example we use of the Fay-Herriot model to estimate themean agrarian surface area used for production of grape (θi ) in eachmunicipality of the Tuscany region
Population data come from the Italian Agricultural Census of year2000 for the region of Tuscany
The census collects information about farms land by type ofcultivation, amount of breeding, kind of production, structure andamount of farm employment
We considered as small areas the 274 municipalities of Tuscany, withpopulation sizes Ni , i = 1, . . . ,m given by the census
We use the census data on the agrarian surface area used forproduction in hectares (x1i ) and on the average number of workingdays in the reference year (x2i ) as covariates in the model
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP at the area level: exampleSample data are collected from a simple random sample with size nifrom each area, with sampling fractions ni/Ni approximatelyconstant and equal to 0.05
These data are used to compute, for each municipality i , the directestimator of the mean agrarian surface area used for production ofgrape in hectares (yi ) and its sampling variance (ψi )
Figure : Data on grape production
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP at the area level: example
The results obtained with the Fay-Herriot model for the first 10Municipalities are shown in the following table
Figure : Estimates on grape production
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP at the area level: example
As we can se from the second table, the MSEs of the FH-EBLUPestimates are lower than the variances of the direct estimates in thefirst table
Readers interested in the results for all the 247 municipalities canrefer to the deliverables of the SAMPLE project(www.sample-project.eu)
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
EBLUP at the area level: advantages, disadvantages,extensions
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Conclusions on EBLUPs
The EBLUPs are model-based estimators
EBLUPs can also be used when we know only the average of theauxiliary variables
In many applications the EBLUPs perform better than the designbased estimators in terms of relative root MSE (smaller confidenceintervals)
Actually EBLUPs are used as a standard technique to derive smallarea statistics
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Conclusions on EBLUPs
Drawbacks
Assumption of normality is needed for area effects and individualeffects (but sensitivity analysis shows that the model is robustagainst non normality if symmetry of the distributions hold)
It is not design-unbiased, in the sense that under complex surveydesign the estimates could be biased
Parameters of interest in out of samples areas (areas with 0observations) cannot be estimated (EBLUP needs minimum twoobservations per area)
Extensions to the model are not easily implementable (complexderivation of the MSE estimator)
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Conclusions on EBLUPs
Improvements for the EBLUP
Spatial process (CAR and SAR models)Time processSpatio-temporal processRobust estimationBinary and count data models
Alternative approach
Quantile/M-Quantile approach
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Quantiles
Given p, with p ∈ (0, 1), the quantile qp of a random variable X isdefined as: ∫ qp
−∞dF (x) = p
Quantiles are points taken at regular intervals from the cumulativedistribution function (CDF) of a random variable.
The kth q-quantile for a random variable is the value x such thatthe probability that the random variable will be less than x is atmost k/q and the probability that the random variable will be morethan x is at most (q − k)/q = 1− (k/q).
Example: the 4-quantiles are called quartiles and they are threevalues that divide the distribution of the values of X into four parts.The first quartile is associated with p = 0.25, the second withp = 0.5 and the third with p = 0.75. The second quartile is themedian.
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
M-quantiles
Given p, with p ∈ (0, 1), the M-Quantile θp of a random variable Xis defined as: ∫
ψp(x − θp)dF (x) = 0
where
ψp(u) =
(1− p)ψ(u) u < 0pψ(u) u ≥ 0
and ψp(u) is an opportunely chosen influence function
The M-Quantile is a generalization of the quantile concept (includesquantile as a particular case)
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
M-quantile Regression
M-quantile Linear Regression
Dependent variable (y1, . . . , yn)
Auxiliary variables (x1, . . . , xk)
θp(x) = αp + x′iβp + εi
The M-Quantile θp of order p, with p ∈ (0, 1), of the conditionaldistribution Y |X is defined as:
∫ψp(y − θp(x))dF (y |x) = 0
with
ψp(r) =
(1− p)ψ(r) r < 0pψ(r) r ≥ 0
M-Quantile regression is a unified model that includes quantileregression and expectile regression as particular cases
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
How to use M-quantile model to obtain estimates for therandom area effects
Small area model-based estimators borrow strength from all the sampleto capture random area effects, given the hierarchical structure of thedata. M-quantile regression does not depend on a hierarchical structure.We can characterise conditional variability across the population ofinterest by the M-quantile coefficients of the population units
Linear mixed effects model captures random area effects asdifferences in the conditional distribution of y given x between smallareas
M-Quantile model determines area effect with M-Quantilecoefficients of the units belonging to the area
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
How to use M-quantile model to obtain estimates for therandom area effects
Assume that we have individual level data on y and x
Each sample value of (x, y) will lie on one and only one M-Quantileline
We refer to the p-value of this line as the M-Quantile coefficient ofthe corresponding sample unit. So every sample unit will have anassociate p-value
In order to estimate these unit specific p-values, we define a fine gridof p-values (e.g. 0.001,. . .,0.999) that adequately covers theconditional distribution of y and x.
We fit an M-Quantile model for each p-value in the grid and uselinear interpolation to estimate a unique p-value, pj , for eachindividual j in the sample
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
How to use M-quantile model to obtain estimates for therandom area effects
2 4 6 8
05
1015
2025
(a)
2 4 6 8
05
1015
2025
(b)
2 4 6 8
05
1015
2025
(c)
2 4 6 8
05
1015
2025
(d)
Figure : (a) Sample data, (b) M-quantile lines, (c) M-quantile lines associated to each unit, (d) M-quantile area linesM. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
How to use M-quantile model to measure small area effects
Calculate an M-Quantile coefficient for each area by suitablyaveraging the q-values of each sampled individual in that areas.Denote this area-specific q-value by θi
The M-Quantile small area model is
yij = xTij βψ(θi ) + εij
β is the unknown regression vector
θi is the unknown area specific coefficient
εij is an individual disturbance
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
The linear M-quantile small area model
Linear M-quantile small area model: yij = xTij βψ(θi ) + εij
βψ is estimated using the iterative weighted least square
θi is obtained by averaging the q-values of the sampled unitsbelonging to area i
ψ(u) = u I (|u| ≤ c) + sgn(u) c I (|u| > c) (Huber proposal 2influence function)
εij has a non specified distribution
The predictor for the target variable of the non sampled unit k inarea i is
yki = xTki βψ(θi )
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Small Area Mean Estimate with M-Quantile Model
Given estimates of β and θi we can obtain the small area meanestimator
Small Area Mean Estimator
ˆYi = N−1i
(∑j∈si yij +
∑j∈ri x′ij βψ(θi ) + Ni−ni
ni
∑j∈si (yij − x′ij βψ(θi ))
)and the MSE estimator
MSE estimator for the M-Quantile Small Area Mean Estimator
Vi = V (ˆyi − yi ) =
N−1i
(∑j∈s u
2ij
(yij − x′ij βψ(θi )
)2
+ (Ni−ni )(ni−1)
∑j∈ri
(yij − x′ij βψ(θi )
)2)
Bi = N−1i
(∑mi=1
∑j∈si wijx′ij βψ(θi )−
∑j∈si x′ij βψ(θi )
)MSE i = Vi + B2
i
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: M-quantile estimates using EU-SILC data
Data on the equivalised income in 2007 are available from theEU-SILC survey 2008 for 1495 households in the 10 TuscanyProvinces, for 1286 households in the 5 Campania Provinces and for2274 households in the 11 Lombardia Provinces
The target variable is the equivalised household income alreadydefined for the unit-level EBLUP example
A set of explanatory variables is available for each unit in thepopulation from the Population Census 2001
We employ an M-quantile model to estimate the mean of thehousehold equivalised income at a LAU 1 level (Provinces)
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: EU-SILC data
Remark 1: it is important to underline that EU-SILC data areconfidential. These data were provided by ISTAT, the ItalianNational Institute of Statistics, to the researchers of the SAMPLEproject and were analyzed by respecting all confidentiality restrictions
Remark 2: We chose the Campania, Lombardia and Toscana regionsbecause they are representative respectively of the South, Centerand North of Italy
Remark 3: The choice of three representative regions for North,Center and South of Italy has been driven by the well knownNorth-South divide
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: Target variable statistics
Campania Lombardia Toscana
050000
100000
150000
200000
Figure : Boxplots of the disposable equivalised household income
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: Target variable statistics
The boxplots show evidence of skew distribution of the householdequivalised income with heavy tail on the right in all the threeregions
The boxplots shown evidence of outliers
Evidence of outliers emerges also from summary statics obtainedusing the cross-sectional EU-SILC household weights:
Min. 1st Qu. Median Mean 3rd Qu. Max.Campania −852.7 8073 11560 13550 17430 99400Lombardia −21550.0 12620 17670 20040 24000 209800Toscana −2849.0 12120 17230 19430 23570 107900
Table : Summary statistics for the household equivalised income
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: Census 2001 data
Italian Census 2001 was collected by ISTAT
Campania region accounts for 1,862,855 households
Lombardia region accounts for 3,652,944 households
Toscana region accounts for 1,388,252 households
Available variables: household size (integer value), ownership ofdwelling (owner/tenant), age of the head of the household (integervalue), years of education of the head of the household (integervalue), working position of the head of the household (employed /unemployed in the previous week), gender of the head of thehousehold, civil status of the head of the household
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: Model Specification
We used the M-quantile linear model to compute the means at LAU1 level (Provinces)
The selection of covariates to fit the small area models relies onprior studies of poverty assessment
The following covariates were selected:
household size (integer value)ownership of dwelling (owner/tenant)age of the head of the household (integer value)years of education of the head of the household (integer value)working position of the head of the household (employed /unemployed in the previous week)gender of the head of the household
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: Estimates of mean income for LombardiaProvinces
VARESE21091.49(1305.98)
COMO18578.33(1137.01)
SONDRIO16307.16(1668.92)
MILANO20798.63(497.68)
BERGAMO18323.07(820.61) BRESCIA
16326.21(581.47)
PAVIA21081.25(4080.17)
CREMONA16774.18(883.69) MANTOVA
17774.9(677.24)
LECCO19497.61(1131.62)
LODI17052.58(965.49)
Mean of Households Equivalised IncomeLombardia Provinces
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: Estimates of mean income for Toscana Provinces
MASSA CARRARA14128.26(664.84)
LUCCA15867.69(766.8)
PISTOIA18980.76(1119.33)
FIRENZE19184.92(498.35)
LIVORNO17875.01(919.41)
PISA18550.16(876.37)
AREZZO18665.97(1014.42)
SIENA20228.98(1113.91)
GROSSETO16152.47(1151.84)
PRATO17702.87(632.74)
Mean of Households Equivalised Income Toscana Provinces
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Example: Estimates of mean income for CampaniaProvinces
CASERTA11685.74(574.89)
BENEVENTO11312.89(1033.79)
NAPOLI12661.84(291.73)
AVELLINO12873.13(979.46)
SALERNO12715.91(502.22)
Mean of Households Equivalised Income Campania Provinces
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
MQ approach to SAE: advantages, disadvantages,extensions
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Advantages and drawbacks of the M-Quantile approachwith respect to the EBLUP approach
Main advantages
i. Distributional assumptions on parameters are not neededii. Assumptions on the hierarchical structure are not needediii. M-Quantile model is robust against outliersiv. It is easy to implement non parametric M-Quantile
approachesv. Bootstrap approach to the estimate of MSE is faster than
bootstrap for EBLUP (mixed linear model require doublebootstrap techniques)
Main drawbacks
i. There is no specification for the M-Quantile model fornominal response variables
ii. There is no specification if the response variable ismultivariate
iii. Estimators can be design-biased (even if they aremodel-unbiased)
M. Pratesi, C. Giusti Small area estimation I
Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor
M-Quantile
Essential bibliographyBreckling, J. and Chambers, R. (1988). M -quantiles. Biometrika, 75, 761–771.
Brunsdon, C., Fotheringham, A.S. and Charlton, M. (1996). Geographically weighted regression: a method for exploring spatialnonstationarity. Geographical Analysis, 28, 281–298.
Chambers, R. and Dunstan, R. (1986). Estimating distribution function from survey data, Biometrika. 73, 597–604.
Chambers, R. and Tzavidis, N. (2006). M-quantile models for small area estimation. Biometrika, 93, 255–268.
Chambers, R.L., Chandra, H., Tzavidis, N. (2011). On bias-robust mean squared error estimation for pseudo-linear small areaestimators. Survey Methodology, 37, 153–170.
Cheli B. and Lemmi, A. (1995). A Totally Fuzzy and Relative Approach to the Multidimensional Analysis of Poverty. EconomicNotes, 24, 115–134.
Elbers, C., Lanjouw, J. O., Lanjouw, P. (2003). Micro-level estimation of poverty and inequality. Econometrica, 71, 355–364.
Fotheringham, A.S., Brunsdon, C. and Charlton, M. (2002). Geographically Weighted Regression. John Wiley and Sons, WestSussex.
Foster, J., Greer, J. and Thorbecke, E. (1984). A class of decomposable poverty measures. Econometrica, 52, 761–766.
Hall, P. and Maiti, T. (2006). On parametric bootstrap methods for small area prediction. Journal of the Royal StatisticalSociety: Series B, 68, 2, 221–238.
Lombardia M.J., Gonzalez-Manteiga W. and Prada-Sanchez J.M. (2003). Bootstrapping the Chambers-Dunstan estimate offinite population distribution function. Journal of Statistical Planning and Inference, 116, 367–388.
Marchetti, S., Tzavidis, N. and Pratesi, P. (2012). Nonparametric Bootstrap Mean Squared Error Estimation for M-quantileEstimators for Small Area Averages, Quantiles and Poverty Indicators. Computational Statistics and Data Analysis, 56,2889–2902.
Royall, R. and Cumberland, W.G. (1978). Variance Estimation in Finite Population Sampling. Journal of the American StatisticalAssociation, 73, 351–358.
Salvati, N., Tzavidis, N., Pratesi, M. and Chambers, R. (2010). Small Area Estimation Via M-quantile Geographically WeightedRegression. [Paper submitted for publication in TEST]
Salvati, N., Chandra, H., Ranalli, M.G. and Chambers, R. (2010). Small Area Estimation Using a Nonparametric Model BasedDirect Estimator. Journal of Computational Statistics and Data Analysis, 54, 2159-2171.
Tzavidis N., Marchetti S. and Chambers R. (2010). Robust estimation of small area means and quantiles. Australian and NewZealand Journal of Statistics, 52, 2, 167–186.
Tzavidis, N., Salvati, N., Pratesi, M. and Chambers, R. (2007). M-quantile models for poverty mapping. Statistical Methods &Applications, 17, 393–411.
M. Pratesi, C. Giusti Small area estimation I