VAT Tax Gap prediction: a 2-steps
Gradient Boosting approach
Giovanna Tagliaferri
13 March 2019
Outline
1 Introduction
2 2-Steps Gradient Boosting
3 Application on results from fiscal audits
Dataset description
Selection bias correction
Potential tax base estimate
4 VAT base gap propensity analysis
5 Conclusions
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 2 of 16
Preamble
• Internship: work realized at Sogei in collaboration with the Italian Revenue
Agency.
• What: produce an estimate of the Italian VAT Tax Gap for the year 2011.
- Major disadvantage: selection bias ) taxpayers are not randomly se-
lected.
• How: a completely non parametric approach in 2-steps, based on Gradient
Boosting, able to provide estimates for the potential tax base (BIT) and
the undeclared part (BIND).
BIT = BID + BIND
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 3 of 16
Preamble
• Internship: work realized at Sogei in collaboration with the Italian Revenue
Agency.
• What: produce an estimate of the Italian VAT Tax Gap for the year 2011.
- Major disadvantage: selection bias ) taxpayers are not randomly se-
lected.
• How: a completely non parametric approach in 2-steps, based on Gradient
Boosting, able to provide estimates for the potential tax base (BIT) and
the undeclared part (BIND).
BIT = BID + BIND
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 3 of 16
Preamble
• Internship: work realized at Sogei in collaboration with the Italian Revenue
Agency.
• What: produce an estimate of the Italian VAT Tax Gap for the year 2011.
- Major disadvantage: selection bias ) taxpayers are not randomly se-
lected.
• How: a completely non parametric approach in 2-steps, based on Gradient
Boosting, able to provide estimates for the potential tax base (BIT) and
the undeclared part (BIND).
BIT = BID + BIND
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 3 of 16
Application Context
• The statistical unit is the Individual Firm, individual who carries out busi-
ness activities or self-employment.
• The available information have been gathered from two sources:
- the register of Irpef, VAT and Irap declarations (available for the entire
population);
- the compliance control papers (available for tax assessed taxpayers).
• Only 2% of taxpayers are generally subject to tax assessment.
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 4 of 16
2-Steps Gradient Boosting
First step: Selection bias correction
Gradient Boosting classification model, aimed at the estimation of:
π̂i = P ( i ∈ S | X ).
Target variable: compliance control presence.
Second step: Potential tax base estimate
Gradient Boosting regression model, only on the assessed units, with weights:
νi ∝1π̂i.
Target variable: potential tax base.
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 5 of 16
2-Steps Gradient Boosting
First step: Selection bias correction
Gradient Boosting classification model, aimed at the estimation of:
π̂i = P ( i ∈ S | X ).
Target variable: compliance control presence.
Second step: Potential tax base estimate
Gradient Boosting regression model, only on the assessed units, with weights:
νi ∝1π̂i.
Target variable: potential tax base.
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 5 of 16
2-Steps Gradient Boosting
First step: Selection bias correction
Gradient Boosting classification model, aimed at the estimation of:
π̂i = P ( i ∈ S | X ).
Target variable: compliance control presence.
Second step: Potential tax base estimate
Gradient Boosting regression model, only on the assessed units, with weights:
νi ∝1π̂i.
Target variable: potential tax base.
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 5 of 16
Data
Matrix with approximately 2.3 milion of taxpayers for 160 variables.
Problems: hardware limits
Solution: subsampling
Total population Sample
Control type Frequence Percentage Frequence Percentage
Not Assessed 2′275′219 99.18% 45′489 70.85%
Assessed 18′718 0.82% 18′718 29.15%
2′293′937 100% 64′207 100%
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 6 of 16
Data
Matrix with approximately 2.3 milion of taxpayers for 160 variables.
Problems: hardware limits
Solution: subsampling
Total population Sample
Control type Frequence Percentage Frequence Percentage
Not Assessed 2′275′219 99.18% 45′489 70.85%
Assessed 18′718 0.82% 18′718 29.15%
2′293′937 100% 64′207 100%
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 6 of 16
Data
Matrix with approximately 2.3 milion of taxpayers for 160 variables.
Problems: hardware limits
Solution: subsampling
Total population Sample
Control type Frequence Percentage Frequence Percentage
Not Assessed 2′275′219 99.18% 45′489 70.85%
Assessed 18′718 0.82% 18′718 29.15%
2′293′937 100% 64′207 100%
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 6 of 16
Selection bias correction
Hyperparameters have been tuned via cross-validation, with the sample split
in train (70%) and test (30%). The optimal choice was:
{λopt = 0.1, n.iteropt = 998} → AUC = 0.79
The most discriminating variables:
- region in which the firm operates
- branch
- number of employees
- revenues
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 7 of 16
Selection bias correction
Hyperparameters have been tuned via cross-validation, with the sample split
in train (70%) and test (30%). The optimal choice was:
{λopt = 0.1, n.iteropt = 998} → AUC = 0.79
The most discriminating variables:
- region in which the firm operates
- branch
- number of employees
- revenues
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 7 of 16
Potential tax base estimate
The regressive model has been estimated only on 18’718 taxpayers subject to
tax assessment. Cross validation was also performed here.
{λopt = 0.1, n.iteropt = 38 } → R2BIT = 0.83
Most important variables:
- belonging region
- taxable for other purchases and imports
- set of operations that produce VAT
- operating costs and fiscal added value
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 8 of 16
Potential tax base estimate
The regressive model has been estimated only on 18’718 taxpayers subject to
tax assessment. Cross validation was also performed here.
{λopt = 0.1, n.iteropt = 38 } → R2BIT = 0.83
Most important variables:
- belonging region
- taxable for other purchases and imports
- set of operations that produce VAT
- operating costs and fiscal added value
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 8 of 16
Potential tax base estimate
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 9 of 16
Results
The Heckman Model has been estimated on the same sample for compara-
tive purposes.
Gradient Boosting Heckman
Train Test Test
BINDTOT 0.727mld 0.314mld 0.314mld
ˆBINDTOT 0.693mld 0.292mld 0.290mld
BITTOT 3.194mld 1.316mld 1.316mld
ˆBITTOT 3.159mld 1.292mld 1.231mld
R2BIT 0.836 0.828 0.657
R2adj,BIT 0.834 0.826 0.652
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 10 of 16
Tax evasion propensity analysis
The estimated model has been used to get predictions onto not assessed
taxpayers.
The trend of tax evasion propensity was studied for the whole sample.
Prop =
∑Ni=1
ˆBIND i∑Ni=1
ˆBIT i
The lower the ratio the better the compliance.
Gradient Boosting Heckman
Propensity 30.40% 29.77%
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 11 of 16
Tax evasion propensity analysis
The estimated model has been used to get predictions onto not assessed
taxpayers.
The trend of tax evasion propensity was studied for the whole sample.
Prop =
∑Ni=1
ˆBIND i∑Ni=1
ˆBIT i
The lower the ratio the better the compliance.
Gradient Boosting Heckman
Propensity 30.40% 29.77%
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 11 of 16
Propensity for sex
Gradient Boosting Heckman
Sex n BITmld
BINDmld
Prop BITmld
BINDmld
Prop
Female 16053 2.34 0.81 34.74% 2.22 0.70 31.39%
Male 48154 8.72 2.55 29.25% 8.73 2.56 29.35%
Total 64207 11.05 3.36 30.40% 10.95 3.26 29.77%
Gradient Boosting Heckman
Age n BITmld
BINDmld
Prop BITmld
BINDmld
Prop
[18 − 25) 976 0.13 0.05 39.07% 0.13 0.04 36.71%
[25 − 45) 28250 4.26 1.45 34.10% 4.11 1.30 31.61%
[45 − 65) 30496 5.64 1.60 28.45% 5.64 1.60 28.37%
over 65 4485 1.02 0.25 24.65% 1.08 0.32 29.28%
Total 64207 11.05 3.36 30.40% 10.95 3.26 29.77%
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 12 of 16
Propensity for sex and age
Gradient Boosting Heckman
Sex n BITmld
BINDmld
Prop BITmld
BINDmld
Prop
Female 16053 2.34 0.81 34.74% 2.22 0.70 31.39%
Male 48154 8.72 2.55 29.25% 8.73 2.56 29.35%
Total 64207 11.05 3.36 30.40% 10.95 3.26 29.77%
Gradient Boosting Heckman
Age n BITmld
BINDmld
Prop BITmld
BINDmld
Prop
[18 − 25) 976 0.13 0.05 39.07% 0.13 0.04 36.71%
[25 − 45) 28250 4.26 1.45 34.10% 4.11 1.30 31.61%
[45 − 65) 30496 5.64 1.60 28.45% 5.64 1.60 28.37%
over 65 4485 1.02 0.25 24.65% 1.08 0.32 29.28%
Total 64207 11.05 3.36 30.40% 10.95 3.26 29.77%
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 12 of 16
Propensity for geographic area
a) Heckman b) Gradient Boosting
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 13 of 16
Conclusions
• Advantages:
- all data are processed
- distribution free approach
- no transformation variable is required
- no problems with multicollinearity
• Further developments:
- extension of the analysis to the entire population
- robustification via ensemble with other models (Xgboost and Neural
Network)
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 14 of 16
Conclusions
• Advantages:
- all data are processed
- distribution free approach
- no transformation variable is required
- no problems with multicollinearity
• Further developments:
- extension of the analysis to the entire population
- robustification via ensemble with other models (Xgboost and Neural
Network)
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 14 of 16
Bibliography
[1] Statuto dell’Agenzia delle Entrate.
[2] Braiotta A., Carfora A., Pansini R.V., Pisani S.; Tax Gap and redistributive aspects
across Italy, 2015.
[3] Heckman James J.; Sample Selection Bias as a Specification Error, Econometrica 47,
no. 1 (1979): 153-61.
[4] Greene William H.; Econometric Analysis (Fifth ed.), Prentice-Hall, 2003.
[5] Friedman Jerome H.; Greedy Function Approximation: A Gradient Boosting
Machine, Annals of Statistics 29(5):1189-1232, 2001.
[6] Friedman Jerome H.; Stochastic Gradient Boosting, Computational Statistics and
Data Analysis 38(4):367-378, 2002.
[7] Bianca Zadrozny; Learning and Evaluating Classifiers under Sample Selection Bias,
Proceedings of the twenty-first international conference on Machine learning, 2004.
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 15 of 16
Thanks for your attention!
VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 16 of 16