Date post: | 27-Mar-2015 |
Category: |
Documents |
Upload: | aaliyah-mcmahon |
View: | 216 times |
Download: | 2 times |
Investigation of Macro Editing Techniques for Outlier Detection in
Survey Data
Katherine Jenny Thompson
Office of Statistical Methods and Research for Economic Programs
Simplified Survey Processing Cycle
Data Collection/Analyst Review Micro-editing
And ImputationIndividual Returns
Macro-editing Tabulated Initial
Estimates
Analyst InvestigationAnd Correction
Publication Estimates
Identifying Outlying Estimates
• Set of Estimates– Unknown parametric distribution (robust)– Contains outliers (resistant)
• Outlier-identification problems (Multiple Outliers)– Masking: difficult to detect an individual outlier– Swamping: too many false outliers flagged
Outlier Detection Approaches
• Sets of “bivariate” (Ratio) comparisons – Same estimate from two consecutive
collection periods (historic cell ratios)– Different estimates in same collection
period (current cell ratios)
• Multivariate comparisons– Current period data
Method for Bivariate Comparisons
• Resistant Fences Methods– Symmetrized Resistant fences– Asymmetric Fences
• Robust Regression• Hidiroglou-Berthelot Edit
Bivariate Comparisons (Current Cell Ratios)
• Linear relationship between payroll and employment• No intercept
Paired Estimates
0
1000
2000
3000
4000
5000
6000
7000
0 20 40 60 80 100 120
Total Employment
An
nu
al P
ay
roll
Paired Estimates
“Traditional” Ratio Edit (Current Cell Ratio)
0
1000
2000
3000
4000
5000
6000
7000
8000
0 20 40 60 80 100 120
Total Employment
An
nu
al P
ayro
ll
Paired Estimates Lower Tolerance Upper Tolerance
• “Cone-shaped” tolerances• Goes through origin• Strong statistical association
Acceptance Region
Outlier Region
Outlier Region
Resistant Fences Methods
q25 q75
q25-1.5H q75+1.5H
• Different numbers of interquartile ranges (1.5 = Inner, 3 = Outer)
• Implicitly assumes symmetry
• May want to “symmetrize”, apply rule, use inverse transformation
Asymmetric Fences Methods
q75+3 (q75- m)q25+3 (m – q25)
• Different numbers of interquartile ranges (3 = Inner, 6 = Outer)
• Incorporates skewness of distribution in outlier rule (“Fences”)
Robust Regression
• Least Trimmed Squares Robust Regression • Resistant (minimizes median residual)• Outlier = |residual| 3 robust M.S.E.
0
1000
2000
3000
4000
5000
6000
7000
0 20 40 60 80 100 120
Total Employment
An
nu
al P
ayro
ll
Paired Estimates Robust Regression Line
Issue at Origin (Historic Cell Ratio)
0
10
20
30
40
50
60
70
80
0 5 10 15 20 25 30 35
Prior Month's Number of Employees
Cu
rren
t M
on
th's
Nu
mb
er o
f E
mp
loye
es
Hidiroglou-Berthelot (HB) Edit
-250
-200
-150
-100
-50
0
50
0 20 40 60 80 100 120
Employment
HB
"E
ffec
ts"
Upper Bound Lower Bound Effects
• Accounts for magnitude of unit (variability at origin)
Hidiroglou-Berthelot (HB) Edit
• Two-step transformation (Ei)– Centering transformation on ratios– Magnitude transformation that accounts for the relative
importance of large cases
• Asymmetric Fences “Type” Outlier Rule
• Key ParameterU = magnitude transformation parameter (0 U 1)C = controls width of outlier region
Multivariate Methods: Mahalanobis Distance
• Multivariate normal (,)
– T(X) estimates – C(X) estimates – p is the number of distinct variables (items)
• Prone to masking (difficult to detect individual outliers)
2~))()(())(( piii TxCTxMD XXX
Robust Alternatives
• M-estimation (not considered)• “Production Method”• Minimum Volume Ellipse (MVE)
– Resistant (50% breakdown) and robust
• Minimum Covariance Determinant (MCD)– Resistant (50% breakdown) and robust
• Assumption of Normality– Log-transformation
Evaluation: Classify Item Estimates
Input ValueReported
Final ValueTabulated
RatioInput/Final
OutlierPotentialOutlier
Not an Outlier
0
5
10
15
20
25
30
35
40
45
50
Ratio Values
Fre
qu
ency
Co
un
ts
Evaluation: Classify Ratios (Bivariate)
• Conservative– Ratio is “outlier” if numerator or
denominator is an outlier
• Anti-Conservative– Ratio is “outlier” if numerator or
denominator is an outlier or a potential outlier
Evaluation: Classify Records (Multivariate)• Conservative
– Record is “outlier” at least one estimate is an outlier
• Anti-Conservative– Record is “outlier” at least one estimate is
an outlier or a potential outlier
Evaluation Statistics: Bivariate Comparisons
• Individual Test Level• Type I Error Rate: proportion of false rejects• Type II Error Rate: proportion of false accepts• Hit Rate: proportion of flagged estimates that are
outliers
• All-Test Level• All-item Type II error rate
Evaluation Statistics: Multivariate Comparisons
• Type I error rate: the proportion of non-outlier records that are flagged as outliers
• Type II error rate: the proportion of outlier records that are not flagged as outliers (missed “bad” values)
Annual Capital Expenditures Survey (ACES)
• Sample Survey (Stratified SRS-WOR)– ACE-1: Employer companies– ACE-2: Non-employer companies (not discussed)
• New sample selection each year• Total and year-to-year change estimates
– Total Capital Expenditures– Structures (New and Used)– Equipment (New and Used)
Capital Expenditures Data
• Characterized by• Low year-to-year correlation (same
company)• Weak association with available auxiliary
data
• Editing procedures focus on additivity
• Outlier correction at micro-level
Bivariate Comparisons
Robust Regression
Resistant Fences
HB Edit
Structures/Total New Structures/Structures
New Structures/Used Structures
Equipment/Total New Equipment/Equipment
• Resistant Fences: (Symmetric or Asymmetric) (Inner or Outer)
• HB Edit: (U = 0.3 or 0.5) (c = 10 or 20 )
Results – Individual Tests
• Robust Regression prone to swamping– High Type I error rate (false rejects)
• Comparable performance with Asymmetric Inner Fences and HB Edit (U = 0.3, c = 10)– Low Type I error rates– High Hit Rates– High Type II error rates
• Other variations of Resistant Fences and HB edit not as good
Results – All-Tests
• Very large Type II error rates (approx. 50%)• Robust regression• Symmetric resistant outer fences• HB edit with c = 20
• Improved Type II error rates (30% - 40%)• Asymmetric inner fences • HB edit (U = 0.3, C=10)
Multivariate Results
• Original Data: considered methods ineffective• Log-transformed data: improved performance (MCD and MVE)
– Reduced Type I error rates
– Comparable Type II error rates (to original-data MCD and MVE)
Conservative Results: 2002
0
0.2
0.4
0.6
0.8
1
Production-MD MCD (original) MVE (original) MCD (log-transformed)
MVE (log-transformed)
Type I Error Rates Type II Error Rates
Multivariate Versus Bivariate:Different Outcomes (Conservative)
Combined HB edits flag more “outliers”:– Higher Type I error rate – Lower Type II error rates for the complete set of HB edits
Counts of Non-Flagged Outliers Type I Errors (False Rejects)
8
0
11
4
2002 2003
HB MVE
Counts of Missed OutliersType II Errors (False Accepts)
13 14
0
4
2002 2003
HB MVE
Comments• Economic data with inconsistent statistical
association between items in each collection period • Critical values must be determined by the data set at
hand (no “hard-coding”)• Dynamically
– Standardize the comparisons (HB edit, log transformation)– Compute outlier limits
• Could try hybrid approach:– Multivariate a few current cell ratio tests with the HB edit – Perform all bivariate tests, but unduplicate cells before
analyst review
Final Thoughts/Next Steps
• Examine one set of economic data and considered only two separate collections from this program.
• Extrapolation would be foolish• My results need to be validated on other
economic data sets – a more typical periodic business survey and/or – a well-constructed simulation study