Download - Selective Data Editing - Helsingin yliopisto · • then editing would be a minor process! 2. Editing Editing is an activity of detecting, resolving and understanding errors in data

Selective Data EditingSelective Data EditingThe Third Baltic-Nordic Conference on Survey Statistics – BaNoCoss.

13-17 June 2011 in High Cost area, Sweden

Mr Anders Norberg,Statistics Sweden (SCB)

If …

• we only want information from businesses that weknow they have,

• and we ask for that information so theyunderstand,

• and we motivate them to deliver as good quality indata as possible,

• and we help them to avoid accidental errors inanswering questionnaires,

• then editing would be a minor process!2

Editing

Editing is an activity of detecting,resolving and understanding errors indata and produced statistics

Where errors are introduced• Errors in raw data delivered by respondents

to the statistical agency are typically non-response and measurement errors

• Errors in data transmissions

• The statistics production process is amixture of many activities with risks ofintroducing errors

Editing activitiesA. Respondent editingB. Manual editing before data registrationC. Data registration editingD. Production editing / micro editing

1 “Traditional” editing2 Selective editing

E. Coherence analysisF. Output editing / macro editingG. EvaluationH. Delivery control

Types of errorsObvious errors / Fatal errors

Item non-responseNon-valid valuesData structure- or model errors, total sum of componentsContradictions

Suspected data valuesDeviation errors (Outliers)• Suspiciously high/low values, data outside of predetermined limits

Definition errors (Inliers)• Many respondent miss-understand a question in the same way• Many respondents fetch data from info-systems with other definitions

Suspected data valuesDeviation errors• Manual follow-up takes time and is expensive

• Few deviation errors have impact on outputstatistics (low hit-rate, many changes in datahave very little impact)

Editing must have impact on the output!Remember response burden !

Suspected data valuesDefinition errors (Inliers)• Difficult to find• Ways to find them:

Combined editing for several surveysDeep interviews in focus groupsUse statistics from FAQ and from re-contacts withrespondentsHigh proportions of item non-responseGraphical editingGood examples

The new role of editingThe new role of editing• Quality Control of the measurement process

– Find errors (use efficient controls)– Consider every identified error as a problem for the

respondent to deliver correct data by our collectioninstrument

– Identify sources of error (process data)– Analyse process data – communicate with cognitive

specialists

• Contribute to quality declaration

• Adjust (change/correct) significant errorsGranquist (1997). The New View on Editing. International Statistical Review

The Process Perspective• Audit and improve data collection

– measurement instrument– collection process

• and the editing process itself

Un-edited data must be saved in order toproduced important process indicators,as hit-rate and impact on output!

Process indicators

• Sources of errors (problem for the respondents)

• Prop. of flagged units and variables

• Prop. of manually and automatically reviewed unitsand variables

• Prop. of amended values and impact of thechanges, per variable

• Hit-rate for edits

“Traditional” data editing

“Traditional” data editing

An EDIT is a checking rule / edit rule, a logicalcondition or a restriction to the value of a dataitem or a data group which must be met if thedata is to be considered correct.

An EDIT has:Test-variableEdit groupAcceptance region

if Occupation = ‘Doctor’ andnot (2900 < Salary_Month < 7100)then Errcode_A01 = ‘Flag’;

Suspicion / Traditional editsFinding acceptance limits: Data from previous surveyrounds

Hourly wage distributed by SNI code at one-digit level.

Selective data editing

Potentialimpact

Suspicion0 1

Flagged

A procedure which targets only some of themicro data variables or records for review byprioritizing the manual work.

Selective data editingCriteria for prioritizing variables and recordsfor review:• Limited bias• Limited variance

… imagining that 100% would yield bestquality

Hedlin, D. (2008). Local and global score functions in selective editing. Invited paper, UNECEWork Session on Statistical Data Editing, Wien, Austria, 21-23 April.

Selective data editingConstruct a score function for prioritizingvariables and records:

• Potential impact on statistics for recordsflagged by traditional edits

• Expected impact on statistics for variablevalues flagged to be suspected by edits

Norberg, A. et al. (2010): A General Methodology for Selective Data Editing.Statistics Sweden

Selective data editing

The purpose of selective data editing is toreduce cost for the statistical agency as wellas for the respondents, without significantdecrease of the quality of the outputstatistics.

Selective data editingLatouche, M. and Berthelot, J.-M. (1992): Use of a score function to prioritize and limitre-contacts in business surveys. Journal of Official Statistics, Vol. 8, pp. 389-400

Lawrence, D. and McDavitt, C. (1994): Significance Editing in the Australian Survey ofAverage Weekly Earnings. Journal of Official Statistics, Vol. 10, pp. 437-447

Granquist, L. (1995): Improving the Traditional Editing Process. In Business SurveyMethods, eds. Cox et.al., Wiley

Granquist, L. (1997): The New View on Editing. International Statistical Review

Granquist, L. and Kovar, J. (1997): Editing of survey data: How much is enough?In Survey measurement and process quality (p. 415-435) eds. Lyberg et al., Wiley

Hedlin, D. (2008): Local and global score functions in selective editing. Invited paper,UNECE Work Session on Statistical Data Editing, Wien, Austria, 21-23 April.

Norberg, A. et al. (2010): A General Methodology for Selective Data Editing.Statistics Sweden

Ilves, K. (2010): Probability Approach to Editing. Workshop on Survey Sampling Theoryand Methodology, Vilnius, Lithuania, August 23-27, 2010

Selective data editingStatistics Sweden has developed a genericIT-tool for selective editing, SELEKT 1.1

It is based on a documented methodology.

SELEKT 1.1 is flexible but require yourunderstanding of the methodology.

Norberg, A. et al. (2010): A General Methodology for Selective Data Editing.Statistics SwedenNorberg, A. et al. (2011): User´s Guide to SELEKT 1.1, A Generic Toolbox forSelective Data Editing. Statistics Sweden

The survey environment

-Coding Sum of wages by Industry -Decision makingRespondent (u) has one or several sampled units -Editing Industry -Information

-Imputation ASampled unit (k) -Estimation BObserved Background variable Measurement var. (j) Cunit (l) Industry Gender Occup. 1 2=Wage D

1 E2 B M 2 F - Z

34 Sum of wages by Occupation and Gender

GenderOccupa-tion Men Women Sum

1234

Sum

Input Throughput Output Use

jkly

The survey environment

-Coding Sum of wages by Industry -Decision makingRespondent (u) has one or several sampled units -Editing Industry -Information

-Imputation ASampled unit (k) -Estimation BObserved Background variable Measurement var. (j) Cunit (l) Industry Gender Occup. 1 2=Wage D

1 E2 B M 2 F - Z

34 Sum of wages by Occupation and Gender

GenderOccupa-tion Men Women Sum

1234

Sum

Input Throughput Output Use

jkly

Suspicion

Predicted (expected) values

Data / predictor•Time series

•Previous value

•Forecast

•Cross section•Mean/standard error

•Median/quartile

Edit groupsAll data

Blue collarworkers

White collars

Monthly payWeekly pay

Profession=3111Profession=3112

Payment bythe hour

Monthly payWeekly pay

Payment bythe hour

Profession =1 Profession= 2 Profession= 3 Profession=9

Profession=3480

MenWomen

Profession=3113

21

Suspicion

R=

Suspicion=R/(TAU+R)

l,k,jU

l,k,jl,k,jU

l,k,jL

l,k,jU

l,k,jl,k,jU

l,k,jl,k,jl,k,j

l,k,jU

l,k,jl,k,jl,k,jL

l,k,jl,k,jl,k,j

Ll,k,jl,k,jl,k,jl,k,j

Ll,k,j

Ul,k,jl,k,j

Ll,k,jl,k,jl,k,j

z~z~KAPPAz~z~if)z~z~/(z~z~KAPPAz~zz~z~KAPPAz~zz~z~KAPPAz~if0

z~z~KAPPAz~zif)z~z~/(zz~z~KAPPAz~

Susp

KAPPA = 0. The ratio R is the distancebetween t and the centre t~ divided by thedispersion range r = )()( ~~ LU tt ,

R = a/r:

KAPPA = 1. The ratio R is the distancefrom the nearest range limit divided by therange. Hence R = a/r. For data between thelower and upper limits of the dispersionrange the suspicion is zero.

* **r

a

* **a

r

Impact• Actual impact = w ( yune – yedi) for an observation is

the impact on estimated domain-total of variable Yif yune is kept instead of making a review to find yedi

• Potential impact = w (yune – ypred) is a proxy for actualimpact to be used in practice. ypred is a prediction(expected value) for yedi

• Expected impact (per domain, variable, observation)is the product of suspicion and potential impact

Score function (1)Local score nr 5, by domain d, variable j, observed unit k,lis the expected impact related to an appropriate measure ofsize for the domain/variable, say standard error of estimate.

VIOLINj = weight for variable j

CLARINETc(d) = weight for classification (domains) c(d)

OBOEj = adjustment for size of estimated total or itsstandard error for variable j

Score5d,j,k,l = Suspicionj,k,l ×Potential impactd,j,k,l × CELLOd(c),j

( ){ }( ) jOBOE0t,j,d0t,j,dj

)d(cjj),c(d

T̂SE,T̂×ALFAmaximum

CLARINET×VIOLIN=CELLO

27

Score function (2)• Global scores are aggregated local scores by

domain, variable, second stage units (opt.) toa score for the primary unit and finally torespondent unit (opt.)

• Methods: sum, sum of squares, maximumetc. by (Minkovsky´s distance)

{ }( )-l

3l,k

3k 3Threshold3Score,0max=2Score

Hedlin, D. (2008): Local and global score functions in selective editing. Invited paper, UNECE WorkSession on Statistical Data Editing, Wien, Austria, 21-23 April.

EvaluationEvaluation

Relative pseudo-bias is a measure of error inoutput due to incomplete data review

( ) ( )100

100q

T̂SE

T̂T̂=qRPB

-

EvaluationEvaluationPsedobias for PPI-survey relative to the overall price index.PSUs ordered in descending order of score

0

0,02

0,04

0,06

0,08

0,1

0,12

0,14

0,16

0,18

0,2

1 42 83 124 165 206 247 288 329 370 411 452 493 534 575 616 657 698 739 780 821

Antal ändringar

CutCut--off or probability sampling?off or probability sampling?

Say that 821 of the total sample (n=4 000) have ascore >0.

There are two options for manual review:

– Cut-off sampling: Score2 >Threshold2,assuming the remaining bias is small

– Two-phase sampling: ps-sampling anddesign-based estimation of measurement errorsto subtract from initial estimates

Ilves, K. (2010): Probability Approach to Editing. Workshop on Survey Sampling Theory andMethodology, Vilnius, Lithuania, August 23-27, 2010

SELEKT 1.1Survey specific coldadapter (SAS code)Data preparation

SAS dataset

PRE-SELEKTParameter specifications,Analysis of cold data

AUTOSELEKTScore calculation &record flagging

Records toFOLLOW-UP

Processdata andreports

Input (hot)survey data

Records toIMPUTATION

Raw+edited past(cold) survey data

Survey specific hotadapter (SAS code)Data preparation

SASdata set

Table ofParameters

Table ofEstimates

Acceptedrecords

CLANestimationsoftware

SNOWDON-X analysisof edits

Edits

EditingEditing –– remaining methodology issuesremaining methodology issues

Confidence (respondents and clients)Do we make a differrence between new and old respondentsEditing in earlier processes– Web-questionnaires– Scanned paper questionnaires

Fatal errors– Classifying variables– Survey variables

Data and methods for computing predicted values etc.Homogenous edit groupsHow to decide threshold valuesAggregating scoresSampling below threshold– Inference– Data for evaluation