Selective Data EditingSelective Data EditingThe Third Baltic-Nordic Conference on Survey Statistics – BaNoCoss.
13-17 June 2011 in High Cost area, Sweden
Mr Anders Norberg,Statistics Sweden (SCB)
If …
• we only want information from businesses that weknow they have,
• and we ask for that information so theyunderstand,
• and we motivate them to deliver as good quality indata as possible,
• and we help them to avoid accidental errors inanswering questionnaires,
• then editing would be a minor process!2
Editing
Editing is an activity of detecting,resolving and understanding errors indata and produced statistics
Where errors are introduced• Errors in raw data delivered by respondents
to the statistical agency are typically non-response and measurement errors
• Errors in data transmissions
• The statistics production process is amixture of many activities with risks ofintroducing errors
Editing activitiesA. Respondent editingB. Manual editing before data registrationC. Data registration editingD. Production editing / micro editing
1 “Traditional” editing2 Selective editing
E. Coherence analysisF. Output editing / macro editingG. EvaluationH. Delivery control
Types of errorsObvious errors / Fatal errors
Item non-responseNon-valid valuesData structure- or model errors, total sum of componentsContradictions
Suspected data valuesDeviation errors (Outliers)• Suspiciously high/low values, data outside of predetermined limits
Definition errors (Inliers)• Many respondent miss-understand a question in the same way• Many respondents fetch data from info-systems with other definitions
Suspected data valuesDeviation errors• Manual follow-up takes time and is expensive
• Few deviation errors have impact on outputstatistics (low hit-rate, many changes in datahave very little impact)
Editing must have impact on the output!Remember response burden !
Suspected data valuesDefinition errors (Inliers)• Difficult to find• Ways to find them:
Combined editing for several surveysDeep interviews in focus groupsUse statistics from FAQ and from re-contacts withrespondentsHigh proportions of item non-responseGraphical editingGood examples
The new role of editingThe new role of editing• Quality Control of the measurement process
– Find errors (use efficient controls)– Consider every identified error as a problem for the
respondent to deliver correct data by our collectioninstrument
– Identify sources of error (process data)– Analyse process data – communicate with cognitive
specialists
• Contribute to quality declaration
• Adjust (change/correct) significant errorsGranquist (1997). The New View on Editing. International Statistical Review
The Process Perspective• Audit and improve data collection
– measurement instrument– collection process
• and the editing process itself
Un-edited data must be saved in order toproduced important process indicators,as hit-rate and impact on output!
Process indicators
• Sources of errors (problem for the respondents)
• Prop. of flagged units and variables
• Prop. of manually and automatically reviewed unitsand variables
• Prop. of amended values and impact of thechanges, per variable
• Hit-rate for edits
“Traditional” data editing
“Traditional” data editing
An EDIT is a checking rule / edit rule, a logicalcondition or a restriction to the value of a dataitem or a data group which must be met if thedata is to be considered correct.
An EDIT has:Test-variableEdit groupAcceptance region
if Occupation = ‘Doctor’ andnot (2900 < Salary_Month < 7100)then Errcode_A01 = ‘Flag’;
Suspicion / Traditional editsFinding acceptance limits: Data from previous surveyrounds
Hourly wage distributed by SNI code at one-digit level.
Selective data editing
Potentialimpact
Suspicion0 1
Flagged
A procedure which targets only some of themicro data variables or records for review byprioritizing the manual work.
Selective data editingCriteria for prioritizing variables and recordsfor review:• Limited bias• Limited variance
… imagining that 100% would yield bestquality
Hedlin, D. (2008). Local and global score functions in selective editing. Invited paper, UNECEWork Session on Statistical Data Editing, Wien, Austria, 21-23 April.
Selective data editingConstruct a score function for prioritizingvariables and records:
• Potential impact on statistics for recordsflagged by traditional edits
• Expected impact on statistics for variablevalues flagged to be suspected by edits
Norberg, A. et al. (2010): A General Methodology for Selective Data Editing.Statistics Sweden
Selective data editing
The purpose of selective data editing is toreduce cost for the statistical agency as wellas for the respondents, without significantdecrease of the quality of the outputstatistics.
Selective data editingLatouche, M. and Berthelot, J.-M. (1992): Use of a score function to prioritize and limitre-contacts in business surveys. Journal of Official Statistics, Vol. 8, pp. 389-400
Lawrence, D. and McDavitt, C. (1994): Significance Editing in the Australian Survey ofAverage Weekly Earnings. Journal of Official Statistics, Vol. 10, pp. 437-447
Granquist, L. (1995): Improving the Traditional Editing Process. In Business SurveyMethods, eds. Cox et.al., Wiley
Granquist, L. (1997): The New View on Editing. International Statistical Review
Granquist, L. and Kovar, J. (1997): Editing of survey data: How much is enough?In Survey measurement and process quality (p. 415-435) eds. Lyberg et al., Wiley
Hedlin, D. (2008): Local and global score functions in selective editing. Invited paper,UNECE Work Session on Statistical Data Editing, Wien, Austria, 21-23 April.
Norberg, A. et al. (2010): A General Methodology for Selective Data Editing.Statistics Sweden
Ilves, K. (2010): Probability Approach to Editing. Workshop on Survey Sampling Theoryand Methodology, Vilnius, Lithuania, August 23-27, 2010
Selective data editingStatistics Sweden has developed a genericIT-tool for selective editing, SELEKT 1.1
It is based on a documented methodology.
SELEKT 1.1 is flexible but require yourunderstanding of the methodology.
Norberg, A. et al. (2010): A General Methodology for Selective Data Editing.Statistics SwedenNorberg, A. et al. (2011): User´s Guide to SELEKT 1.1, A Generic Toolbox forSelective Data Editing. Statistics Sweden
The survey environment
-Coding Sum of wages by Industry -Decision makingRespondent (u) has one or several sampled units -Editing Industry -Information
-Imputation ASampled unit (k) -Estimation BObserved Background variable Measurement var. (j) Cunit (l) Industry Gender Occup. 1 2=Wage D
1 E2 B M 2 F - Z
34 Sum of wages by Occupation and Gender
GenderOccupa-tion Men Women Sum
1234
Sum
Input Throughput Output Use
jkly
The survey environment
-Coding Sum of wages by Industry -Decision makingRespondent (u) has one or several sampled units -Editing Industry -Information
-Imputation ASampled unit (k) -Estimation BObserved Background variable Measurement var. (j) Cunit (l) Industry Gender Occup. 1 2=Wage D
1 E2 B M 2 F - Z
34 Sum of wages by Occupation and Gender
GenderOccupa-tion Men Women Sum
1234
Sum
Input Throughput Output Use
jkly
Suspicion
Predicted (expected) values
Data / predictor•Time series
•Previous value
•Forecast
•Cross section•Mean/standard error
•Median/quartile
Edit groupsAll data
Blue collarworkers
White collars
Monthly payWeekly pay
Profession=3111Profession=3112
Payment bythe hour
Monthly payWeekly pay
Payment bythe hour
Profession =1 Profession= 2 Profession= 3 Profession=9
Profession=3480
MenWomen
Profession=3113
21
Suspicion
R=
Suspicion=R/(TAU+R)
l,k,jU
l,k,jl,k,jU
l,k,jL
l,k,jU
l,k,jl,k,jU
l,k,jl,k,jl,k,j
l,k,jU
l,k,jl,k,jl,k,jL
l,k,jl,k,jl,k,j
Ll,k,jl,k,jl,k,jl,k,j
Ll,k,j
Ul,k,jl,k,j
Ll,k,jl,k,jl,k,j
z~z~KAPPAz~z~if)z~z~/(z~z~KAPPAz~zz~z~KAPPAz~zz~z~KAPPAz~if0
z~z~KAPPAz~zif)z~z~/(zz~z~KAPPAz~
Susp
KAPPA = 0. The ratio R is the distancebetween t and the centre t~ divided by thedispersion range r = )()( ~~ LU tt ,
R = a/r:
KAPPA = 1. The ratio R is the distancefrom the nearest range limit divided by therange. Hence R = a/r. For data between thelower and upper limits of the dispersionrange the suspicion is zero.
* **r
a
* **a
r
Impact• Actual impact = w ( yune – yedi) for an observation is
the impact on estimated domain-total of variable Yif yune is kept instead of making a review to find yedi
• Potential impact = w (yune – ypred) is a proxy for actualimpact to be used in practice. ypred is a prediction(expected value) for yedi
• Expected impact (per domain, variable, observation)is the product of suspicion and potential impact
Score function (1)Local score nr 5, by domain d, variable j, observed unit k,lis the expected impact related to an appropriate measure ofsize for the domain/variable, say standard error of estimate.
VIOLINj = weight for variable j
CLARINETc(d) = weight for classification (domains) c(d)
OBOEj = adjustment for size of estimated total or itsstandard error for variable j
Score5d,j,k,l = Suspicionj,k,l ×Potential impactd,j,k,l × CELLOd(c),j
( ){ }( ) jOBOE0t,j,d0t,j,dj
)d(cjj),c(d
T̂SE,T̂×ALFAmaximum
CLARINET×VIOLIN=CELLO
27
Score function (2)• Global scores are aggregated local scores by
domain, variable, second stage units (opt.) toa score for the primary unit and finally torespondent unit (opt.)
• Methods: sum, sum of squares, maximumetc. by (Minkovsky´s distance)
{ }( )-l
3l,k
3k 3Threshold3Score,0max=2Score
Hedlin, D. (2008): Local and global score functions in selective editing. Invited paper, UNECE WorkSession on Statistical Data Editing, Wien, Austria, 21-23 April.
EvaluationEvaluation
Relative pseudo-bias is a measure of error inoutput due to incomplete data review
( ) ( )100
100q
T̂SE
T̂T̂=qRPB
-
EvaluationEvaluationPsedobias for PPI-survey relative to the overall price index.PSUs ordered in descending order of score
0
0,02
0,04
0,06
0,08
0,1
0,12
0,14
0,16
0,18
0,2
1 42 83 124 165 206 247 288 329 370 411 452 493 534 575 616 657 698 739 780 821
Antal ändringar
CutCut--off or probability sampling?off or probability sampling?
Say that 821 of the total sample (n=4 000) have ascore >0.
There are two options for manual review:
– Cut-off sampling: Score2 >Threshold2,assuming the remaining bias is small
– Two-phase sampling: ps-sampling anddesign-based estimation of measurement errorsto subtract from initial estimates
Ilves, K. (2010): Probability Approach to Editing. Workshop on Survey Sampling Theory andMethodology, Vilnius, Lithuania, August 23-27, 2010
SELEKT 1.1Survey specific coldadapter (SAS code)Data preparation
SAS dataset
PRE-SELEKTParameter specifications,Analysis of cold data
AUTOSELEKTScore calculation &record flagging
Records toFOLLOW-UP
Processdata andreports
Input (hot)survey data
Records toIMPUTATION
Raw+edited past(cold) survey data
Survey specific hotadapter (SAS code)Data preparation
SASdata set
Table ofParameters
Table ofEstimates
Acceptedrecords
CLANestimationsoftware
SNOWDON-X analysisof edits
Edits
EditingEditing –– remaining methodology issuesremaining methodology issues
Confidence (respondents and clients)Do we make a differrence between new and old respondentsEditing in earlier processes– Web-questionnaires– Scanned paper questionnaires
Fatal errors– Classifying variables– Survey variables
Data and methods for computing predicted values etc.Homogenous edit groupsHow to decide threshold valuesAggregating scoresSampling below threshold– Inference– Data for evaluation