+ All Categories
Home > Documents > Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Date post: 30-Dec-2015
Category:
Upload: connor-blake
View: 22 times
Download: 0 times
Share this document with a friend
Description:
Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals. Maria Cristina Casciano, Laura Corallo, Daniela Ichim. Multiple releases: MFR and PUF Subsampling allocation: reduce the risk of disclosure selection: pre-defined quality standards Results - PowerPoint PPT Presentation
22
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals Maria Cristina Casciano, Laura Corallo, Daniela Ichim
Transcript
Page 1: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Sampling as a way to reduce risk and create a Public Use

File maintaining weighted totals

Maria Cristina Casciano, Laura Corallo, Daniela Ichim

Page 2: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Outline• Multiple releases: MFR and PUF

• Subsampling– allocation: reduce the risk of disclosure– selection: pre-defined quality standards

• Results– Career of Doctorate Holders Survey

• Further work

Page 3: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Multiple …Multiple countries

Multiple countries

Multiple countries

MS1

MS2

SURVEY1 TABLES1 PUF1 MFR1 OTHER1

SURVEY2 TABLES2 PUF2 MFR2 OTHER2

SURVEYX TABLESX PUFX MFRX OTHERX

Multiple releases

Multiple releases

Multiple releases

SURVEY1 TABLES1 PUF1 MFR1 OTHER1

SURVEY2 TABLES2 PUF2 MFR2 OTHER2

SURVEYX TABLESX PUFX MFRX OTHERX

Multiple releases

Multiple releases

Multiple releases

SURVEY1 TABLES1 PUF1 MFR1 OTHER1

SURVEY2 TABLES2 PUF2 MFR2 OTHER2

SURVEYX TABLESX PUFX MFRX OTHERX

Multiple releases

Multiple releases

Multiple releases

MS27

Multiple countriesM

ultip

le

surv

eys

Page 4: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Comparability

• ESSnet on SDC harmonisation and common tools– WP1: test the comparability concept– Istat, Destatis, Statistics Austria– multiple countries

• 1 Assessment of effects of different practices on predefined statistics• 2 Definition of a threshold to define when action is needed• 3 setting a process for choosing acceptable practices

HOW

Page 5: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Multiple releases

SURVEY1 TABLES1 PUF1 MFR1 OTHER1

• A particular harmonisation dimension

• Hierarchical structure– Utility– Risk of disclosure

Page 6: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Multiple releaseshierarchical structure

MFR

+

-

More restrictive license

PUF

+

-Less aggregated information

Less restrictive license More aggregated information

UNIQUE PRODUCTION PROCESS!

Page 7: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

PUF-MFR• MFR

– definition of a disclosure scenario– risk assessment R1

– risk limitation w.r.t.• adopted disclosure scenario• some data utility requirements

• PUF– harmonized with the MFR (e.g. weighted totals)– reduced the risk of disclosure– random sample– internal consistency of records– some (other) data utility requirements (CV and weighted totals – precision and accuracy)

Page 8: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Data description

Year t-5 Year t-3 Year t

Doctorate Holders CDH 2009 Survey

Estimates by PhD scientific area, by gender and by region

labour market entry

usefulness of the

PhD

for obtaining a job

type of contract

type of work

earnings job

satisfaction

Focus on the characterisation of the occupational status of the PhD holders:

Page 9: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

72%resp

28%No resp

12964 respondents

18500PhD Holders

(Census)

Citizenship(2 categories)

PhD Scientific Area(14 categories)

Gender Region

weights obtained by

constraining on known marginal distributions:

Adjustment for non-responses via calibration

Data description

Page 10: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

PUF-subsamplingSimple random samplingUtility: Weighted totals may always be preserved by calibrationRisk: how many units at risk are sampled?Example (MFR-CDH): 12964 units, 24.7% of units at risk

Page 11: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Subsampling

allocation

domains

utility

disclosure

sample size

stratification

dissemination

totals

scenario

calibration

key variables

quality

users

auxiliary

Page 12: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

PUF-subsampling: proposal

1. Optimal allocation of units to be sampled in each domain according to Bethel’s approach

(Risk minimization)

2. Selection of a fixed size balanced sample (CUBE method)

(Data utility maximization)

Page 13: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

*djpdjp CVCV

●Cost function to minimize:

● Expected Coefficient of Variation (CV) of the estimates of the total of variable P in domain jd equal or lower than prefixed thresholds:

djH

1hhh0

' nCCC

1. Bethel’s approach (1989)

nh and Ch related to the risk to be reduced

Optimal allocation: nh*

Page 14: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

2. Balanced sampling

A sampling design s is said to be balanced on

the auxiliary variables if and

only if the balancing equations given by:

are satisfied, where X is the vector of known

population totals, is the H.-T. estimator

XXπ ˆ

'.....1 pj xxxx

πX̂

exact estimates for pre-defined variables

Page 15: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Balanced sampling: the CUBE methodGeometrically each vertex of the hypercube is a sample:

The balancing equations define a sub-space of RN named K.The problem is to choose a vertex (sample) of the N-cube that remains in the sub-space of constraints K

(111)

(000) (100)

(101)

(010)

(011)

(110)

N,s 10

Cube method (Deville & Tillé,2004):1. Flight phase: it’s a random walk starting from the

vector and moving in the intersection of the cube C and K. It stops at the vertex of intersection of C and K, if this vertex exists.

2. Landing phase: At the end of the flight phase, if a sample is not exactly determined in C∩K, a sample is selected as close as possible to the constraints space K.

K

Page 16: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Implementation

1. determination of the optimal strata sizes in terms of reduction of the overall risk (cost function), keeping the CV level of the estimates below a 5% threshold for three combinations of the allocation and domain variables

Allocation variables: Occup, JobS, Contract, Work, IncomeDomain variables: Gender, Region, Scientific Area, Year

of Completion

2. six possible settings, corresponding to different choices of the parameters:a. Risk R1 used as the minimization cost of the algorithmb. Risk R1 used as a stratification variable c. include all units of the strata containing no units at risk

Page 17: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

C.S

Ris

k.c

ost

Ris

k.s

trat

Cen

s.n

o.ris

k

# S

trata

#C

en

s.s

trata

#C

en

s.u

nits

Size

Beth

el

Size

Pro

p.

Size

Eq

ual

Max.B

eth

el-

Pro

p

Max.B

eth

el-

Eq

ual

1 N Y N 925 153 252 4933 5391 5550 459 618

2 N Y Y 925 214 704 5105 5547 5550 443 446

3 Y Y N 925 204 558 5239 5719 5550 480 311

4 Y Y Y 925 235 814 5330 5781 5550 451 220

5 Y N N 925 240 687 5555 5953 6475 399 921

6 Y N Y 925 269 983 5649 6094 6475 446 827

7 N Y N 925 306 1614 8725 9256 9250 530 524

8 N Y Y 925 352 1919 8827 9324 9250 498 424

9 Y Y N 925 416 3229 8955 9424 9250 468 294

10 Y Y Y 925 451 3398 9045 9511 9250 466 205

11 Y N N 925 426 3243 9151 9601 9250 451 100

12 Y N Y 925 457 3399 9222 9669 9250 446 84

13 N Y N 56 0 0 4745 4773 4760 138 132

14 N Y Y 56 28 9761 10320 10346 10360 166 630

15 Y Y N 56 21 5844 8812 8841 8848 189 389

16 Y Y Y 56 28 9761 10323 10349 10360 166 630

17 Y N N 28 0 0 4760 4774 4788 176 88

18 Y N Y 28 0 0 4759 4774 4788 176 88

Allocations (CV* = 5%)

Page 18: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

0

2000

4000

6000

8000

10000

12000

0 0.05 0.1 0.15 0.2 0.25

CV

Be

the

l sa

mp

le s

ize

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Allocations

Page 19: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Balanced sample

Selection of samples of fixed size from the CDH survey:

Utility constraints on:• the population size N • the optimal sample size n • the marginal frequency distributions by

Gender, Year of Doctorate Completion and Scientific Area

18 equations

CUBE algorithm:

I. Input Vector is the optimal one determined by Bethel

II. Flight phase ends with no exact solutionIII. Landing phase starts: selection of a sample which

ensures a low difference to the balance, according to the distance between to

Page 20: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Median of absolute relative errors

Results

Page 21: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Results

C.S

Ris

k.c

ost

Ris

k.s

trat

Cen

s.n

o.ris

k

Ris

k

Occu

p

Job

S

Con

tract

Work

Incom

e

1 N Y N 1366 0.88 0.97 0.97 0.99 0.99

2 N Y Y 1333 0.92 0.99 0.94 0.97 0.99

3 Y Y N 1335 0.92 0.98 0.95 0.99 0.99

4 Y Y Y 1354 0.87 0.99 0.95 0.97 0.99

5 Y N N 1490 0.86 0.98 0.97 0.98 0.98

6 Y N Y 1525 0.91 0.98 0.95 0.97 0.99

7 N Y N 2194 0.83 0.91 0.99 0.97 1.00

8 N Y Y 2177 0.56 0.81 0.99 0.94 0.99

9 Y Y N 2149 0.78 0.91 0.99 0.91 1.00

10 Y Y Y 2163 0.64 0.88 0.97 0.95 0.99

11 Y N N 2232 0.63 0.87 0.99 0.86 1.00

12 Y N Y 2233 0.55 0.78 0.96 0.94 0.99

13 N Y N 1272 0.96 0.99 0.92 0.96 0.98

14 N Y Y 559 0.52 0.79 0.41 0.83 0.98

15 Y Y N 564 0.77 0.94 0.93 0.97 0.99

16 Y Y Y 562 0.56* 0.84 0.59 0.88 0.99

17 Y N N 1270 0.95 0.99 0.98 0.99 0.99

18 Y N Y 1247 0.91 0.99 0.98 0.99 0.98

Page 22: Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Further work

1. the relationship between coefficients of variation and disclosure risk, together with different options of including the risk of disclosure in the sampling design;

2. the introduction of an utility-priority

approach into the way to deal with the balancing equations;

3. the usage of other data utility constraints to be investigated.


Recommended