Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | calvin-wilcox |
View: | 216 times |
Download: | 1 times |
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and
Hot Deck
Jennifer Huckett
Iowa State University
June 20, 2007
Outline
• Motivation
• Disclosure Limitation Methods
• Risk Assessment
• Simulation Study
• Results & Conclusions
Motivation• Iowa Department of Revenue (IDR)
– Collects and maintains individual tax return data
• Legislative Services Agency (LSA)– Examines impact of tax law changes on liability
• Current system– LSA submits requests to IDR– IDR computes liability, reports to LSA– Occurs several times each year– Inefficient for both IDR and LSA
• Solutions– Secure/remote access server
• Data are not released
• Some analyses suppressed
– Statistical disclosure limitation (SDL)• Tabular
• Microdata– enable IDR to provide LSA with data set
– allow LSA to compute liability with ease and accuracy
– MUST ENSURE CONFIDENTIALITY of RECORDS!
Establishment Connection
• Very skew distributions, unusual associations among distributions
• Groups of variables are related to one another in unusual ways
• Similar to business tax data or business expenditure/revenue data
• Confidentiality is critical
Traditional Approaches
• Recoding (e.g. aggregation)
• Noise addition
• Data swapping
• Data suppression
• Imputation
• Combinations of these
Our Approach
• Synthetic microdata simulation– Retain key demographic variables– Simulate values for some variables
• Quantile regression conditional on key variables
• Compute fitted values at selected quantiles
– Impute values for remaining variables • Hot deck + rank swap
• Hot deck based on simulated income variables
Quantile Regression
•
– = “tilted absolute value function” for quantile
– = linear function of predictors (xi)
• performed in R– quantreg package– rq function
Quantile Regression, Koenker 2004
)),((min ii xy
)ˆ( yyi ),( ix
th
Simulate via Quantile Regression
• Estimate for quantiles from the set
• For each record on variable y
– Randomly select ~ Uniform(0,1)
– Compute fitted given x at above and below
– Interpolate to obtain = simulated value
={0.01, 0.02, ...,0.99}
*ˆy
**y
),( ix
IDR Application: Key Demographic Variables
• Number of dependents– 0, 1, 2,…
– Categorized into • 0
• 1
• ≥2
• County– 1,…,99
– Categorized into 4 population size groups
• State filing status1. single2. married filing joint3. married filing separate
on combined return4. married filing separate
returns5. head of household6. widow(er) with
dependent child– Categorized into
• 1• 2 and 3• 4, 5, and 6
IDR Application: Quantile Regression for wages
]4[]3[]2[]6,5,4[
]3,2[]2[#]1[#
111098
7654
43
32
210
countyIcountyIcountyIsfsI
sfsIdepIdepIageageageagewages
• Hot Deck– Mahalanobis distance
– closest 20 records
• Rank Swap– compute sample rank, r
– draw random rank, r*, from discrete Uniform[r-10, r+10]
– impute value from record with rank r*
IDR Application: Hot Deck and Rank Swap for Federal Tax
)()'(),( 1jixxji xxSxxjid
Disclosure Risk Measurement
• Using methods detailed in Reiter (2005) and Duncan and Lambert (1986, 1989)
• Examine specific records– Original records– Released records – Model intruder behavior to assess disclosure
risk
• Simulation Study
Original and Released Records
),|Pr( ZtjJ
Intruder Behavior
• Target record, t– Intruder has information on target
– Attempts to match t in released records
• Released records j=1,…,r in Z• Probability that record j belongs to target t is
• As – probability decreases
– disclosure risk decreases
Simulation Study
Schemes for SDL influence divisions of A into Ap
(available, perturbed) and Ad (available, unperturbed).
SDL Schemes in Simulation Study
• No SDL• Swap 30% marital status• Swap 30% marital status and minority• Recode age into 5 year intervals• Recode age into 5 year intervals and swap
30% marital status and minority• Simulation via quantile regression and hot
deck
Targets
• Intruder has information on target, t, and wants to match with released records
• Consider a few targets– Unique record– Rare record– Common record
Results from Simulation Study
),|Pr( ZtjJ
target No SDLMarital
swapMarital and
minority swapAge
recode
Swaps and
recode
Quantile regression
and hot deck
unique1 1 0.1046 1 0.0178 0.0895
rare0.3333 0.1044 0.1304 0.0526 0.0225
0.0016
common0.0385 0.0320 0.0320 0.0068 0.0055
0.0008
Conclusions & Future Work
• Risk behaves as we expect– increased SDL– decreased disclosure risk (except for unique!)
• Perform SDL techniques to American Community Survey data at US Census Bureau
• Compare traditional techniques to quantile regression and hot deck by computing risk
• Measure utility of released data
Acknowledgements
• Iowa Department of Revenue
• Iowa’s Legislative Services Agency
• National Institute of Statistical Sciences
• US Census Bureau Dissertation Fellowship Award
References
• Duncan,G.T. and Lambert, D. 1986. “Disclosure-Limited Data Dissemination,” Journal of the American Statistical Association, 81, 10-28.
• Duncan,G.T. and Lambert, D. 1989. “The Risk of Disclosure for Microdata,” Journal of Business and Economic Statisistics, 7, 207-217.
• Koenker, R. 2005. “Introduction,” Quantile Regression, Econometric Society Monograph Series, Cambridge University Press.
• Reiter, J.P. 2005. “Estimating Risks of Identification Disclosure in Microdata”, Journal of the American Statistical Association, 100, 472, 1103-1113.