Download - Small area estimation I - European Commission EMOSschool_last.pdfsmall area estimation using multilevel (mixed/random e ects) models !unit and area-level approaches small area estimation

Small area estimation I

Monica Pratesi and Caterina Giusti

Department of Economics and Management, University of Pisa

1st EMOS Spring SchoolTrier, Pisa, Manchester, Luxembourg, 23-27 March 2015

M. Pratesi, C. Giusti Small area estimation I

Structure of the Presentation

1 Rising interest in measuring progress at a local level

2 An overview of SAE methods

3 Expansion and GREG estimators

4 Empirical Best Linear Unbiased Predictor

5 M-Quantile Approach


Rising interest in measuring progress at a local levelAn overview of SAE methods

Part I

A short introduction to the small area

estimation problem



Demand for statistics at local level

During the last decade there has been a rising demand for (objectiveand subjective) progress and well-being indicators

These measures play a central role for policy makers, to plan and toverify the effectiveness of their policies

Some examples of indicators:

Poverty indicators: At risk of poverty rate, Household equivalizedincome, etc.Labour market indicators: Unemployment rate, Satisfaction with thejob, etc.Health indicators: Average lifespan, Share of population withdangerous behaviors (obesity, smoking, etc.)

To be informative and effective, these indicators should be chosen atthe appropriate level of disaggregation

Indicators can be disaggregated along various dimensions, includinggeographic areas, demographic groups, income/consumption groupsand social groups




To gather data to compute the indicators of interest for a givenpopulation there are two main solutions:

CensusesSample Surveys

Sample surveys have been recognized as cost-effectiveness means ofobtaining information on wide-ranging topics of interest at frequentintervals over time

Indeed, available data to measure progress and well-being in Europecome mainly from sample surveys, such as the Survey on Incomeand Living Conditions (EU-SILC)




Data coming from sample surveys can be used to compute accuratedirect estimates only for large domains

Direct estimation uses only domain-specific data:1 suffer from low precision when the domain sample size is small2 it is not applicable with zero sample sizes

For example, EU-SILC data can be used to produce accurate directestimates only at the NUTS* 2 level (that is, regional level in Italy)

* The geocode NUTS refers to the EU Nomenclature of Territorial Units for Statistics



Introduction to Small Area Estimation

To satisfy the increasing demand of statistical estimates on progressand well-being referring to smaller domains (LAU 1 and LAU 2levels, that is provinces and municipalities in Italy) using samplesurvey data, there is the need to resort to small area methodologies

Small area methodologies try to fill the gap between officialstatistics and local request of data

What is a small area? What is small area estimation?

Small Area Estimation (SAE) is concerned with the development ofstatistical procedures for producing efficient (precise) estimates for smallareas, that is for domains with small or zero sample sizes. Domains aredefined by the cross-classification of geographical districts bysocial/economic/demographic characteristics. The target is theestimation of a parameter (average/percentile/proportion/rate) and theestimation of the corresponding prediction error



Introduction to Small Area Estimation: Target parameters

The estimates of interest in the small areas are usually means ortotals. Examples:

Mean individual gross income in a set of areasTotal crop production in a set of areas

However, also other estimates can be of interest. Examples:

The quantiles of the household (equivalised) incomeThe proportion of households with income below the poverty line



Introduction to Small Area Estimation: Example

Target population: households who live in an Italian Region

Variable of interest: Income or other poverty measures

Survey sample: EUSILC (European Union Statistics on Income andLiving Conditions), designed to obtain reliable estimate at Regionallevel in Italy

planned design domains: Regionsunplanned design domains: e.g. Provinces, Municipalities

EUSILC sample size in Tuscany: 1751 households

Pisa province 158 households → need SAEGrosseto province 70 households → need SAE



Introduction to Small Area Estimation: Example

US sample sizes with an equal probability of selection methodsample of 10,000 persons

State 1994 Population (thousands) Sample sizeCalifornia 31,431 1207

Texas 18,378 706New York 18,169 698

......

...DC 570 22

Wyoming 476 18

Suppose to measure customer satisfaction for a government service:

California 24.86% → leads to a confidence interval of 22.4%-27.3%(reliable)Wyoming 33.33% → leads to a confidence interval of 10.9%-55.7%(unreliable → need SAE)



SAE: data requirements

Survey data: available for the target variable y and for the auxiliaryvariable x, related to y

Census/Administrative data: available for x but not for y

SAE in 3 steps

1 Use survey data to estimate models that link y to x

2 Combine the estimated model parameters with x for out of sampleunits, to form predictions

3 Use these predictions to estimate the target parameters



SAE: classifications

There are several classifications of SAE methods

Some of these classifications cross each other

Direct vs Indirect methods

Direct methods use only domain-specific data

Indirect methods borrow information from all the data




Design-based vs Model-based methods

Design-based (Model-assisted) methods

Direct estimationCan allow for the use of models (model-assisted)Inference is under the randomization distribution

Model-based methods

Borrow strength by using a modelEstimation using frequentist or Bayesian approachesInference under the model is conditional on the selected sample




Area-level vs Unit level methods

Area-level models relate small area direct estimators to area specificcovariates. Such models are necessary if unit (or element) level dataare not available.

Unit level models that relate the unit values of a study variable tounit-specific covariates.



How a SAE Unit Level Model Works



A classification of the Small Area Estimation Methods



Focus of the lesson

We will focus on model-assisted and model-based methods

For the model-based estimators we will consider two alternativeapproaches to SAE:

small area estimation using multilevel (mixed/random effects)models → unit and area-level approachessmall area estimation based on M-quantile models → unit levelapproach

We will present some examples based on true data

We will present R codes to obtained the estimates of the examples


Expansion and GREG estimatorsEmpirical Best Linear Unbiased Predictor

M-Quantile

Part II

Classic and new approaches to

model-based Small Area Estimation



M-Quantile

Definitions

Design-based estimation: the main focus is on the designunbiasedness. Estimators are unbiased with respect to therandomization that generates survey data

Finite population Ω = 1, . . . ,N

y : variable of interest, with yi value of the i-th unit of the finitepopulation

Statistics of interest: e.g. total, Y =∑

Ω yi or mean, Y = Y /N

Sample s = 1, . . . , n

p(s): probability of selecting the sample s from population Ω. p(s)depends on know design variables such as stratum indicator and sizemeasures of clusters



M-Quantile

Definitions

Bias of an estimator θ is defined as E [θ − θ]

Variance of an estimator θ is defined as E [(θ − E [θ])2]

Mean Squared Error of an estimator θ is: E [(θ− θ)2] = V [θ] +B[θ]2

Design bias: Bias(Y ) = Ep[Y ]− Y

Design variance: V (Y ) = Ep[(Y − y)2]

Design-based properties

1 Design-unbiasedness: Ep[Y ] =∑

p(s)Ys = Y

2 Design-consistency: Y → Y in probability



M-Quantile

Estimation of Means: Expansion Estimator

Data yi, i ∈ s

Expansion estimator for the mean:

ˆY =

∑i∈s wiyi∑i∈s wi

wi = π−1i , the basic design weight

πi is the probability of selecting the unit i in sample s

Remark: weights wi are independent from yi



M-Quantile

Estimation of Means: GREG Estimator

Data yi , xi, i ∈ s

X = (X1, . . . ,Xq)′, population totals for q auxiliary variables

Generalized REGression Estimator:

ˆYGR = ˆY + (X− ˆX)′β

β = (∑

i∈s wixix′i/ci )−1(∑

i∈s wixiyi/ci )

ci is a specified positive constant

Note ˆYGR =∑

i∈s w∗i yi , where w∗i = w∗i (s) = wi (s)gi (s)

with gi (s) = 1 + (X− ¯X)′(∑

i∈s wixix′i/ci )−1xi/ci



M-Quantile

Domain Estimation

Let partitioning population Ω into m partitions or domains:

Ω = ∪mi=1Ωi

Ωi = 1, . . . ,Ni , population of the domain i

si = 1, . . . , ni , sample of the domain i

Statistics of interest for the variable y :

Yi =∑

Ωiyj , the total of domain i

Yi = Yi/Ni , the mean of domain i



M-Quantile

Domain Estimation: Expansion Estimator

Data yij, j ∈ si , i = 1, . . . ,m

Expansion estimator of the mean for the domain i :

ˆYi =

∑j∈si wijyij∑j∈si wij

where yij is the observation value and wij is the weight for unit j inarea i

The case of the simple random sampling:

πij =(1

1)(Ni−1ni−1)

(Nini

)= ni

Ni→ wij = π−1

ij = Nini

ˆYi =∑ni

j=1 wij yij∑nij=1 wij

=

∑nij=1

Nini

yij∑nij=1

Nini

=Nini

∑nij=1 yij

niNini

= 1ni

∑nij=1 yij (that is the

sample mean)



M-Quantile

Domain Estimation: Expansion Estimator

ˆYi is design unbiased

V ( ˆYi ) = (1− niNi

)S2i

ni, where S2

i =∑

j∈si(yij−yi )2

ni−1 is the sample variance

The magnitude of the variance depends on 3 factors: ni/Ni , S2i and

ni

If ni is small the design variance is likely to be large

In such a situation, estimation of variance is even more problematic



M-Quantile

Domain Estimation: GREG Estimator

Data yij , xij, i ∈ s

Xi = (Xi1, . . . ,Xiq)′, population totals for q auxiliary variables in thedomain i

Generalized REGression Estimator:

ˆYi,GR = ˆYi + (Xi − ˆXi )′βi

βi = (∑

j∈si wijxijx′ij/cij)−1(∑

j∈si wijxijyij/cij)

cij is a specified positive constant

Remark: The GREG estimator attempt to improve on the expansionestimator by borrowing strength from relevant sources (e.g. covariates)through appropriate adjustment to the basic weights.Remark: The GREG estimator remain approximately design-unbiased, but(should) improve on the design-variance (and thereby the MSE)



M-Quantile

GREG Estimator, Example

We present here an example of how to use and apply theGeneralized Regression estimator to obtain small area estimates ofthe area means

The target parameter is the mean forest biomass per hectare (ha)within the 14 municipalities (small areas) of the forest in VestfoldCounty (Norway)

The data are from the Norwegian National Forest Inventory, thatprovides estimates of forest parameters on national and regionalscales by means of a systematic network of permanent sample plots

The data set is public and detailed information on the data isavailable in Breidenbach and Astrup (2012)



M-Quantile


Data on forest biomass per hectare (Biomass/ha) are available for145 sample plots

Auxiliary data on mean canopy height are also available from digitalaerial images

The following table shows the first six lines (out of 145) of thesample plots of the Norwegian National Forest Inventory

Figure : Norwegian National Forest Inventory sample data



M-Quantile

GREG Estimator, ExampleThe relationship between the target and the auxiliary data availablefrom the sample is represented in the following Figure

Figure : Scatterplot of the Biomass/ha vs Mean canopy height sample data



M-Quantile

GREG Estimator, ExampleData on the mean canopy height are also available for all theelements, the population here is the forest covered by aerial images(for which the mean canopy values are available)

Thus, the N =∑14

i=1 Ni = 5402465 population elements are the16 · 16 tiles within forest for which auxiliary variables from thecanopy height mean and image data were calculated

Figure : Population data of mean canopy height from digital aerial images



M-Quantile


Using data reported in the two previous Tables we can obtain smallarea GREG estimates of the mean of forest biomass/ha in the 14municipalities

Figure : Point and MSE GREG estimates of the mean forest biomass perhectare for the 14 municipalities



M-Quantile

GREG: advantages, disadvantages, extensions



M-Quantile

EBLUP: Unit Level ApproachUnit level approach to small area estimation

y the vector for the y variable for the population Ω

y = [y′s , y′r ]′, where ys is the vector of the observed units (the

sampled ones) and yr is the vector of the non observed units (N − n,r = 1, . . . ,N − n)

X is the covariates matrix and is considered know for all thepopulation units

Subscript i refers to small areas (e.g. ysi is the vector of observedvariables in area i)

Model for the y variable (known as superpopulation model)

y = Xβ + Zu + e

that can be alternatively write as

yij = xijβ + ui + eij



M-Quantile

EBLUP: Unit Level Approach

Given that

u ∼ N(0,G), e ∼ N(0,R) and u ⊥ e

R = σ2eΣe , G = σ2

uΣu

X is a full rank matrix (say rank equal q) and rank of (X : Zi ) > q

n ≥ q + m + 1

it can be shown that

V (y) = ZGZ′ + R = V

β = (X′V−1X)−1X′V−1ys is the BLUE for β

u = GZ′V−1(y − Xβ) is the BLUP for u

y ∼ N(Xβ,V)



M-Quantile

EBLUP: Unit Level ApproachThe estimator of the mean for the i-th area is obtained in terms of alinear combination between observed and unobserved units, as follows

ˆYi =1

Ni

∑j∈si

yij +∑k∈ri

(x′ik β + ui )

Next step is to derive an MSE estimator

MSE (θi ) ≈ g1i (σ) + g2i (σ) + g3i (σ) + g4i (σ)

g1i (σ) = α′rZrTsZ′rαr

g2i (σ) =[α′rbXr −α′rZrTsZ′sR

′sXs ](X′sV

−1Xs)−1[X′rαr − X′sR−1s ZsTsZ′rαr ]

g3i (σ) = tr(∇(α′rZrΣuZ′sV−1s )′)Vs(∇(α′rZrΣuZ′sV

−1s )′)′E [(σ −

σ)(σ − σ)′]g4i (σ) = α′rRrαr

T = Σu −ΣuZ′s(Σes + ZsΣuZ′s)−1ZsΣu

σ = (σ2e , σ

2u/σ

2e )′



M-Quantile

EBLUP: Unit Level Approach

Finally, the estimator for the MSE of θi is

MSE (θi ) = g1i (σ) + g2i (σ) + 2g3i (σ) + g4i (σ)

σ is an unbiased estimator for σ

Remark: it is possible to obtain an estimate of the MSE using alternativetechniques, such as bootstrap and jackknife



M-Quantile

Example: Estimates of Mean Income in Tuscany Provinces

Data on the equivalised income in 2005 for 1525 households in the10 Tuscany Provinces are available from the EUSILC survey 2006

A set of explanatory variables is available for each unit in thepopulation from the Population Census 2001

We employ the EBLUP unit level model to estimate the mean of thehousehold equivalised income

The Municipality of Florence, with 125 units out of 457 in theProvince, is considered as a stand-alone area



M-Quantile

Example: Target variable

Relative poverty measures are related to income or consumption

Our estimate are based on the household equivalised disposableincome (target variable)

Averages are computed on the household equivalised disposableincome

The disposable household equivalised income is computed as

[Disposable household income] · [Within-household non-response inflation factor]

Equivalised household size



M-Quantile


Disposable household income:

The sum for all household members of gross personal incomecomponents plus gross income components at household level minusemployer’s social insurance contributions, interest paid on mortgage,regular taxes on wealth, regular inter-household cash transfer paid,tax on income and social insurance contributions

Within-household non-response inflation factor:

Factor by which it is necessary to multiply the total gross income, thetotal disposable income or the total disposable income before socialtransfers to compensate the non-response in individual questionnaires



M-Quantile


Equivalised household size:

Let HM14+ be the number of household members aged 14 and over(at the end of income reference period)Let HM13− be the number of household members aged 13 or less(atthe end of income reference period)

Equivalised household size = 1 + 0.5 · (HM14+ − 1) + 0.3 · HM13−

Remark: by this way we take into account the economy of scale presentin an household



M-Quantile

Example: auxiliary data

The selection of covariates to fit the small area model relies on priorstudies on poverty assessment

The following covariates have been selected:

household sizeownership of dwelling (owner/tenant)age of the head of the householdyears of education of the head of the householdworking position of the head of the household(employed/unemployed in the previous week)

Design-based estimates of the mean income have been carried out inorder to show the gain in efficiency of the EBLUP



M-Quantile

Example: results

Table : Mean Income Estimates. Tuscany Provinces

Provinces EBLUP RMSEEBLUP Design-Based RMSEDB

Arezzo 17328 816.6 18455 1088.9Florence M. 18139 806.4 21927 1188.8

Florence 16327 628.2 16347 490.1Grosseto 16593 937.8 17811 1574.5Livorno 17111 841.5 20257 3258.4Lucca 15805 868.7 15780 783.5Massa 15644 868.0 14814 909.0Pisa 15950 826.5 16741 755.0

Pistoia 16467 850.1 16852 950.4Prato 16964 824.9 16715 911.5Siena 16660 852.0 16926 779.4



M-Quantile

Example: results

2 4 6 8 10

14000

16000

18000

20000

22000

24000

26000

Error bar for mean income estimates

Provinces

Equ

ival

ized

Inco

me

Figure : Black bars represent Design-Based estimates, red bars representEBLUP estimates



M-Quantile

Example: results

Design-Based Estimates EBLUP Estimates

14813.00

16019.47

17323.03

18732.66

20258.00



M-Quantile

EBLUP at the unit level: advantages, disadvantages,extensions



M-Quantile

EBLUP: Area Level Approach, the Fay-Harriot Model

The area level model includes random area-specific effects and areaspecific covariates xi

θi = xiβ + ziui , i = 1, . . . ,m

θi is the parameter of interest (e.g. totals, Yi or means, Yi )

zi are known positive constant

ui are independent and identically distributed random variables withmean 0 and variance σ2

u (ui ∼ N(0, σ2u))

β is the regression parameters vector



M-Quantile


Assumptionθi = θi + ei

θi is a direct design-unbiased estimator

ei are independent sampling error with mean 0 and know variance σ2e

Fay-Harrior Model

θi = xiβ + zivi + ei , i = 1, . . . ,m

Note: this is a special case of the general linear mixed model withdiagonal covariance structure



M-Quantile


Under above mentioned assumptions

θi ∼ N(xiβ, z2i σ

2u + σ2

e )

Let us to introduce matrix notation

θi = Xβ + Zu + e

u ∼ N(0,G) and e ∼ N(0,R)

θ ∼ N(Xβ,ZGZ′ + R), let V = ZGZ′

Given the estimates of β and u we obtain the Best Linear UnbiasedPredictor (BLUP) for θ

β = (X′V−1X)−1X′V−1y, where y is the vector of the observations

u = GZ′V−1(y − Xβ)

Note: estimates for β and u can be obtained by penalized maximumlikelihood (u considered as fix)



M-Quantile


In the “real world” G (and R) are unknown and they must beestimated

Using restricted likelihood optimized with scoring algorithm weobtain estimates for σ2

u (G)

In the Fay-Herriot model σ2e is considered as known (we use

sampling variance)

Plugging in the estimated area-specific variance component σ2u in the

estimator for β and u we obtain their estimates

β = (X′V−1X)−1X′V−1y

u = GZ′V−1(y − Xβ)



M-Quantile


Finally using the obtained estimates in the Fay-Herriot model we havethe Empirical BLUP (EBLUP) for the parameter of interest θ

θi = φi θi + (1− φi )(x′i β + z′i u)

φi =z2i σ

2u

z2i σ

2u+σ2

e

, is the shrinkage factor

θi is the design estimator for θi

Note: using this procedure could happen that the estimate of σ2u is

negative, in this case it must be truncated to 0



M-Quantile


The MSE of the Fay-Herriot small area estimator is

MSE (θi ) = g1i (σ2u) + g2i (σ

2u) + g3i (σ

2u)

g1i (σ2u) is due to random errors (order O(1))

g2i (σ2u) is due to β estimate (order O(m−1), given some conditions)

g3i (σ2u) is due to the estimate of σ2

u

An approximately correct estimate of the MSE is

MSE (θi ) = g1i (σ2u) + g2i (σ

2u) + 2g3i (σ

2u)



M-Quantile

EBLUP at the area level: example

In this example we use of the Fay-Herriot model to estimate themean agrarian surface area used for production of grape (θi ) in eachmunicipality of the Tuscany region

Population data come from the Italian Agricultural Census of year2000 for the region of Tuscany

The census collects information about farms land by type ofcultivation, amount of breeding, kind of production, structure andamount of farm employment

We considered as small areas the 274 municipalities of Tuscany, withpopulation sizes Ni , i = 1, . . . ,m given by the census

We use the census data on the agrarian surface area used forproduction in hectares (x1i ) and on the average number of workingdays in the reference year (x2i ) as covariates in the model



M-Quantile

EBLUP at the area level: exampleSample data are collected from a simple random sample with size nifrom each area, with sampling fractions ni/Ni approximatelyconstant and equal to 0.05

These data are used to compute, for each municipality i , the directestimator of the mean agrarian surface area used for production ofgrape in hectares (yi ) and its sampling variance (ψi )

Figure : Data on grape production



M-Quantile


The results obtained with the Fay-Herriot model for the first 10Municipalities are shown in the following table

Figure : Estimates on grape production



M-Quantile


As we can se from the second table, the MSEs of the FH-EBLUPestimates are lower than the variances of the direct estimates in thefirst table

Readers interested in the results for all the 247 municipalities canrefer to the deliverables of the SAMPLE project(www.sample-project.eu)



M-Quantile

EBLUP at the area level: advantages, disadvantages,extensions



M-Quantile

Conclusions on EBLUPs

The EBLUPs are model-based estimators

EBLUPs can also be used when we know only the average of theauxiliary variables

In many applications the EBLUPs perform better than the designbased estimators in terms of relative root MSE (smaller confidenceintervals)

Actually EBLUPs are used as a standard technique to derive smallarea statistics



M-Quantile


Drawbacks

Assumption of normality is needed for area effects and individualeffects (but sensitivity analysis shows that the model is robustagainst non normality if symmetry of the distributions hold)

It is not design-unbiased, in the sense that under complex surveydesign the estimates could be biased

Parameters of interest in out of samples areas (areas with 0observations) cannot be estimated (EBLUP needs minimum twoobservations per area)

Extensions to the model are not easily implementable (complexderivation of the MSE estimator)



M-Quantile


Improvements for the EBLUP

Spatial process (CAR and SAR models)Time processSpatio-temporal processRobust estimationBinary and count data models

Alternative approach

Quantile/M-Quantile approach



M-Quantile

Quantiles

Given p, with p ∈ (0, 1), the quantile qp of a random variable X isdefined as: ∫ qp

−∞dF (x) = p

Quantiles are points taken at regular intervals from the cumulativedistribution function (CDF) of a random variable.

The kth q-quantile for a random variable is the value x such thatthe probability that the random variable will be less than x is atmost k/q and the probability that the random variable will be morethan x is at most (q − k)/q = 1− (k/q).

Example: the 4-quantiles are called quartiles and they are threevalues that divide the distribution of the values of X into four parts.The first quartile is associated with p = 0.25, the second withp = 0.5 and the third with p = 0.75. The second quartile is themedian.



M-Quantile

M-quantiles

Given p, with p ∈ (0, 1), the M-Quantile θp of a random variable Xis defined as: ∫

ψp(x − θp)dF (x) = 0

where

ψp(u) =

(1− p)ψ(u) u < 0pψ(u) u ≥ 0

and ψp(u) is an opportunely chosen influence function

The M-Quantile is a generalization of the quantile concept (includesquantile as a particular case)



M-Quantile

M-quantile Regression

M-quantile Linear Regression

Dependent variable (y1, . . . , yn)

Auxiliary variables (x1, . . . , xk)

θp(x) = αp + x′iβp + εi

The M-Quantile θp of order p, with p ∈ (0, 1), of the conditionaldistribution Y |X is defined as:

∫ψp(y − θp(x))dF (y |x) = 0

with

ψp(r) =

(1− p)ψ(r) r < 0pψ(r) r ≥ 0

M-Quantile regression is a unified model that includes quantileregression and expectile regression as particular cases



M-Quantile

How to use M-quantile model to obtain estimates for therandom area effects

Small area model-based estimators borrow strength from all the sampleto capture random area effects, given the hierarchical structure of thedata. M-quantile regression does not depend on a hierarchical structure.We can characterise conditional variability across the population ofinterest by the M-quantile coefficients of the population units

Linear mixed effects model captures random area effects asdifferences in the conditional distribution of y given x between smallareas

M-Quantile model determines area effect with M-Quantilecoefficients of the units belonging to the area



M-Quantile


Assume that we have individual level data on y and x

Each sample value of (x, y) will lie on one and only one M-Quantileline

We refer to the p-value of this line as the M-Quantile coefficient ofthe corresponding sample unit. So every sample unit will have anassociate p-value

In order to estimate these unit specific p-values, we define a fine gridof p-values (e.g. 0.001,. . .,0.999) that adequately covers theconditional distribution of y and x.

We fit an M-Quantile model for each p-value in the grid and uselinear interpolation to estimate a unique p-value, pj , for eachindividual j in the sample



M-Quantile


2 4 6 8

05

1015

2025

(a)

2 4 6 8

05

1015

2025

(b)

2 4 6 8

05

1015

2025

(c)

2 4 6 8

05

1015

2025

(d)

Figure : (a) Sample data, (b) M-quantile lines, (c) M-quantile lines associated to each unit, (d) M-quantile area linesM. Pratesi, C. Giusti Small area estimation I


M-Quantile

How to use M-quantile model to measure small area effects

Calculate an M-Quantile coefficient for each area by suitablyaveraging the q-values of each sampled individual in that areas.Denote this area-specific q-value by θi

The M-Quantile small area model is

yij = xTij βψ(θi ) + εij

β is the unknown regression vector

θi is the unknown area specific coefficient

εij is an individual disturbance



M-Quantile

The linear M-quantile small area model

Linear M-quantile small area model: yij = xTij βψ(θi ) + εij

βψ is estimated using the iterative weighted least square

θi is obtained by averaging the q-values of the sampled unitsbelonging to area i

ψ(u) = u I (|u| ≤ c) + sgn(u) c I (|u| > c) (Huber proposal 2influence function)

εij has a non specified distribution

The predictor for the target variable of the non sampled unit k inarea i is

yki = xTki βψ(θi )



M-Quantile

Small Area Mean Estimate with M-Quantile Model

Given estimates of β and θi we can obtain the small area meanestimator

Small Area Mean Estimator

ˆYi = N−1i

(∑j∈si yij +

∑j∈ri x′ij βψ(θi ) + Ni−ni

ni

∑j∈si (yij − x′ij βψ(θi ))

)and the MSE estimator

MSE estimator for the M-Quantile Small Area Mean Estimator

Vi = V (ˆyi − yi ) =

N−1i

(∑j∈s u

2ij

(yij − x′ij βψ(θi )

)2

+ (Ni−ni )(ni−1)

∑j∈ri

(yij − x′ij βψ(θi )

)2)

Bi = N−1i

(∑mi=1

∑j∈si wijx′ij βψ(θi )−

∑j∈si x′ij βψ(θi )

)MSE i = Vi + B2

i



M-Quantile

Example: M-quantile estimates using EU-SILC data

Data on the equivalised income in 2007 are available from theEU-SILC survey 2008 for 1495 households in the 10 TuscanyProvinces, for 1286 households in the 5 Campania Provinces and for2274 households in the 11 Lombardia Provinces

The target variable is the equivalised household income alreadydefined for the unit-level EBLUP example

A set of explanatory variables is available for each unit in thepopulation from the Population Census 2001

We employ an M-quantile model to estimate the mean of thehousehold equivalised income at a LAU 1 level (Provinces)



M-Quantile

Example: EU-SILC data

Remark 1: it is important to underline that EU-SILC data areconfidential. These data were provided by ISTAT, the ItalianNational Institute of Statistics, to the researchers of the SAMPLEproject and were analyzed by respecting all confidentiality restrictions

Remark 2: We chose the Campania, Lombardia and Toscana regionsbecause they are representative respectively of the South, Centerand North of Italy

Remark 3: The choice of three representative regions for North,Center and South of Italy has been driven by the well knownNorth-South divide



M-Quantile

Example: Target variable statistics

Campania Lombardia Toscana

050000

100000

150000

200000

Figure : Boxplots of the disposable equivalised household income



M-Quantile

Example: Target variable statistics

The boxplots show evidence of skew distribution of the householdequivalised income with heavy tail on the right in all the threeregions

The boxplots shown evidence of outliers

Evidence of outliers emerges also from summary statics obtainedusing the cross-sectional EU-SILC household weights:

Min. 1st Qu. Median Mean 3rd Qu. Max.Campania −852.7 8073 11560 13550 17430 99400Lombardia −21550.0 12620 17670 20040 24000 209800Toscana −2849.0 12120 17230 19430 23570 107900

Table : Summary statistics for the household equivalised income



M-Quantile

Example: Census 2001 data

Italian Census 2001 was collected by ISTAT

Campania region accounts for 1,862,855 households

Lombardia region accounts for 3,652,944 households

Toscana region accounts for 1,388,252 households

Available variables: household size (integer value), ownership ofdwelling (owner/tenant), age of the head of the household (integervalue), years of education of the head of the household (integervalue), working position of the head of the household (employed /unemployed in the previous week), gender of the head of thehousehold, civil status of the head of the household



M-Quantile

Example: Model Specification

We used the M-quantile linear model to compute the means at LAU1 level (Provinces)

The selection of covariates to fit the small area models relies onprior studies of poverty assessment

The following covariates were selected:

household size (integer value)ownership of dwelling (owner/tenant)age of the head of the household (integer value)years of education of the head of the household (integer value)working position of the head of the household (employed /unemployed in the previous week)gender of the head of the household



M-Quantile

Example: Estimates of mean income for LombardiaProvinces

VARESE21091.49(1305.98)

COMO18578.33(1137.01)

SONDRIO16307.16(1668.92)

MILANO20798.63(497.68)

BERGAMO18323.07(820.61) BRESCIA

16326.21(581.47)

PAVIA21081.25(4080.17)

CREMONA16774.18(883.69) MANTOVA

17774.9(677.24)

LECCO19497.61(1131.62)

LODI17052.58(965.49)

Mean of Households Equivalised IncomeLombardia Provinces



M-Quantile

Example: Estimates of mean income for Toscana Provinces

MASSA CARRARA14128.26(664.84)

LUCCA15867.69(766.8)

PISTOIA18980.76(1119.33)

FIRENZE19184.92(498.35)

LIVORNO17875.01(919.41)

PISA18550.16(876.37)

AREZZO18665.97(1014.42)

SIENA20228.98(1113.91)

GROSSETO16152.47(1151.84)

PRATO17702.87(632.74)

Mean of Households Equivalised Income Toscana Provinces



M-Quantile

Example: Estimates of mean income for CampaniaProvinces

CASERTA11685.74(574.89)

BENEVENTO11312.89(1033.79)

NAPOLI12661.84(291.73)

AVELLINO12873.13(979.46)

SALERNO12715.91(502.22)

Mean of Households Equivalised Income Campania Provinces



M-Quantile

MQ approach to SAE: advantages, disadvantages,extensions



M-Quantile

Advantages and drawbacks of the M-Quantile approachwith respect to the EBLUP approach

Main advantages

i. Distributional assumptions on parameters are not neededii. Assumptions on the hierarchical structure are not needediii. M-Quantile model is robust against outliersiv. It is easy to implement non parametric M-Quantile

approachesv. Bootstrap approach to the estimate of MSE is faster than

bootstrap for EBLUP (mixed linear model require doublebootstrap techniques)

Main drawbacks

i. There is no specification for the M-Quantile model fornominal response variables

ii. There is no specification if the response variable ismultivariate

iii. Estimators can be design-biased (even if they aremodel-unbiased)



M-Quantile

Essential bibliographyBreckling, J. and Chambers, R. (1988). M -quantiles. Biometrika, 75, 761–771.

Brunsdon, C., Fotheringham, A.S. and Charlton, M. (1996). Geographically weighted regression: a method for exploring spatialnonstationarity. Geographical Analysis, 28, 281–298.

Chambers, R. and Dunstan, R. (1986). Estimating distribution function from survey data, Biometrika. 73, 597–604.

Chambers, R. and Tzavidis, N. (2006). M-quantile models for small area estimation. Biometrika, 93, 255–268.

Chambers, R.L., Chandra, H., Tzavidis, N. (2011). On bias-robust mean squared error estimation for pseudo-linear small areaestimators. Survey Methodology, 37, 153–170.

Cheli B. and Lemmi, A. (1995). A Totally Fuzzy and Relative Approach to the Multidimensional Analysis of Poverty. EconomicNotes, 24, 115–134.

Elbers, C., Lanjouw, J. O., Lanjouw, P. (2003). Micro-level estimation of poverty and inequality. Econometrica, 71, 355–364.

Fotheringham, A.S., Brunsdon, C. and Charlton, M. (2002). Geographically Weighted Regression. John Wiley and Sons, WestSussex.

Foster, J., Greer, J. and Thorbecke, E. (1984). A class of decomposable poverty measures. Econometrica, 52, 761–766.

Hall, P. and Maiti, T. (2006). On parametric bootstrap methods for small area prediction. Journal of the Royal StatisticalSociety: Series B, 68, 2, 221–238.

Lombardia M.J., Gonzalez-Manteiga W. and Prada-Sanchez J.M. (2003). Bootstrapping the Chambers-Dunstan estimate offinite population distribution function. Journal of Statistical Planning and Inference, 116, 367–388.

Marchetti, S., Tzavidis, N. and Pratesi, P. (2012). Nonparametric Bootstrap Mean Squared Error Estimation for M-quantileEstimators for Small Area Averages, Quantiles and Poverty Indicators. Computational Statistics and Data Analysis, 56,2889–2902.

Royall, R. and Cumberland, W.G. (1978). Variance Estimation in Finite Population Sampling. Journal of the American StatisticalAssociation, 73, 351–358.

Salvati, N., Tzavidis, N., Pratesi, M. and Chambers, R. (2010). Small Area Estimation Via M-quantile Geographically WeightedRegression. [Paper submitted for publication in TEST]

Salvati, N., Chandra, H., Ranalli, M.G. and Chambers, R. (2010). Small Area Estimation Using a Nonparametric Model BasedDirect Estimator. Journal of Computational Statistics and Data Analysis, 54, 2159-2171.

Tzavidis N., Marchetti S. and Chambers R. (2010). Robust estimation of small area means and quantiles. Australian and NewZealand Journal of Statistics, 52, 2, 167–186.

Tzavidis, N., Salvati, N., Pratesi, M. and Chambers, R. (2007). M-quantile models for poverty mapping. Statistical Methods &Applications, 17, 393–411.