GiovannaTagliaferri · Outline 1 Introduction 2 2-Steps Gradient Boosting 3 Application on results...

Post on 12-Aug-2020

0 views 0 download

transcript

VAT Tax Gap prediction: a 2-steps

Gradient Boosting approach

Giovanna Tagliaferri

13 March 2019

Outline

1 Introduction

2 2-Steps Gradient Boosting

3 Application on results from fiscal audits

Dataset description

Selection bias correction

Potential tax base estimate

4 VAT base gap propensity analysis

5 Conclusions

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 2 of 16

Preamble

• Internship: work realized at Sogei in collaboration with the Italian Revenue

Agency.

• What: produce an estimate of the Italian VAT Tax Gap for the year 2011.

- Major disadvantage: selection bias ) taxpayers are not randomly se-

lected.

• How: a completely non parametric approach in 2-steps, based on Gradient

Boosting, able to provide estimates for the potential tax base (BIT) and

the undeclared part (BIND).

BIT = BID + BIND

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 3 of 16

Preamble

• Internship: work realized at Sogei in collaboration with the Italian Revenue

Agency.

• What: produce an estimate of the Italian VAT Tax Gap for the year 2011.

- Major disadvantage: selection bias ) taxpayers are not randomly se-

lected.

• How: a completely non parametric approach in 2-steps, based on Gradient

Boosting, able to provide estimates for the potential tax base (BIT) and

the undeclared part (BIND).

BIT = BID + BIND

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 3 of 16

Preamble

• Internship: work realized at Sogei in collaboration with the Italian Revenue

Agency.

• What: produce an estimate of the Italian VAT Tax Gap for the year 2011.

- Major disadvantage: selection bias ) taxpayers are not randomly se-

lected.

• How: a completely non parametric approach in 2-steps, based on Gradient

Boosting, able to provide estimates for the potential tax base (BIT) and

the undeclared part (BIND).

BIT = BID + BIND

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 3 of 16

Application Context

• The statistical unit is the Individual Firm, individual who carries out busi-

ness activities or self-employment.

• The available information have been gathered from two sources:

- the register of Irpef, VAT and Irap declarations (available for the entire

population);

- the compliance control papers (available for tax assessed taxpayers).

• Only 2% of taxpayers are generally subject to tax assessment.

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 4 of 16

2-Steps Gradient Boosting

First step: Selection bias correction

Gradient Boosting classification model, aimed at the estimation of:

π̂i = P ( i ∈ S | X ).

Target variable: compliance control presence.

Second step: Potential tax base estimate

Gradient Boosting regression model, only on the assessed units, with weights:

νi ∝1π̂i.

Target variable: potential tax base.

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 5 of 16

2-Steps Gradient Boosting

First step: Selection bias correction

Gradient Boosting classification model, aimed at the estimation of:

π̂i = P ( i ∈ S | X ).

Target variable: compliance control presence.

Second step: Potential tax base estimate

Gradient Boosting regression model, only on the assessed units, with weights:

νi ∝1π̂i.

Target variable: potential tax base.

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 5 of 16

2-Steps Gradient Boosting

First step: Selection bias correction

Gradient Boosting classification model, aimed at the estimation of:

π̂i = P ( i ∈ S | X ).

Target variable: compliance control presence.

Second step: Potential tax base estimate

Gradient Boosting regression model, only on the assessed units, with weights:

νi ∝1π̂i.

Target variable: potential tax base.

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 5 of 16

Data

Matrix with approximately 2.3 milion of taxpayers for 160 variables.

Problems: hardware limits

Solution: subsampling

Total population Sample

Control type Frequence Percentage Frequence Percentage

Not Assessed 2′275′219 99.18% 45′489 70.85%

Assessed 18′718 0.82% 18′718 29.15%

2′293′937 100% 64′207 100%

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 6 of 16

Data

Matrix with approximately 2.3 milion of taxpayers for 160 variables.

Problems: hardware limits

Solution: subsampling

Total population Sample

Control type Frequence Percentage Frequence Percentage

Not Assessed 2′275′219 99.18% 45′489 70.85%

Assessed 18′718 0.82% 18′718 29.15%

2′293′937 100% 64′207 100%

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 6 of 16

Data

Matrix with approximately 2.3 milion of taxpayers for 160 variables.

Problems: hardware limits

Solution: subsampling

Total population Sample

Control type Frequence Percentage Frequence Percentage

Not Assessed 2′275′219 99.18% 45′489 70.85%

Assessed 18′718 0.82% 18′718 29.15%

2′293′937 100% 64′207 100%

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 6 of 16

Selection bias correction

Hyperparameters have been tuned via cross-validation, with the sample split

in train (70%) and test (30%). The optimal choice was:

{λopt = 0.1, n.iteropt = 998} → AUC = 0.79

The most discriminating variables:

- region in which the firm operates

- branch

- number of employees

- revenues

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 7 of 16

Selection bias correction

Hyperparameters have been tuned via cross-validation, with the sample split

in train (70%) and test (30%). The optimal choice was:

{λopt = 0.1, n.iteropt = 998} → AUC = 0.79

The most discriminating variables:

- region in which the firm operates

- branch

- number of employees

- revenues

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 7 of 16

Potential tax base estimate

The regressive model has been estimated only on 18’718 taxpayers subject to

tax assessment. Cross validation was also performed here.

{λopt = 0.1, n.iteropt = 38 } → R2BIT = 0.83

Most important variables:

- belonging region

- taxable for other purchases and imports

- set of operations that produce VAT

- operating costs and fiscal added value

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 8 of 16

Potential tax base estimate

The regressive model has been estimated only on 18’718 taxpayers subject to

tax assessment. Cross validation was also performed here.

{λopt = 0.1, n.iteropt = 38 } → R2BIT = 0.83

Most important variables:

- belonging region

- taxable for other purchases and imports

- set of operations that produce VAT

- operating costs and fiscal added value

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 8 of 16

Potential tax base estimate

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 9 of 16

Results

The Heckman Model has been estimated on the same sample for compara-

tive purposes.

Gradient Boosting Heckman

Train Test Test

BINDTOT 0.727mld 0.314mld 0.314mld

ˆBINDTOT 0.693mld 0.292mld 0.290mld

BITTOT 3.194mld 1.316mld 1.316mld

ˆBITTOT 3.159mld 1.292mld 1.231mld

R2BIT 0.836 0.828 0.657

R2adj,BIT 0.834 0.826 0.652

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 10 of 16

Tax evasion propensity analysis

The estimated model has been used to get predictions onto not assessed

taxpayers.

The trend of tax evasion propensity was studied for the whole sample.

Prop =

∑Ni=1

ˆBIND i∑Ni=1

ˆBIT i

The lower the ratio the better the compliance.

Gradient Boosting Heckman

Propensity 30.40% 29.77%

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 11 of 16

Tax evasion propensity analysis

The estimated model has been used to get predictions onto not assessed

taxpayers.

The trend of tax evasion propensity was studied for the whole sample.

Prop =

∑Ni=1

ˆBIND i∑Ni=1

ˆBIT i

The lower the ratio the better the compliance.

Gradient Boosting Heckman

Propensity 30.40% 29.77%

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 11 of 16

Propensity for sex

Gradient Boosting Heckman

Sex n BITmld

BINDmld

Prop BITmld

BINDmld

Prop

Female 16053 2.34 0.81 34.74% 2.22 0.70 31.39%

Male 48154 8.72 2.55 29.25% 8.73 2.56 29.35%

Total 64207 11.05 3.36 30.40% 10.95 3.26 29.77%

Gradient Boosting Heckman

Age n BITmld

BINDmld

Prop BITmld

BINDmld

Prop

[18 − 25) 976 0.13 0.05 39.07% 0.13 0.04 36.71%

[25 − 45) 28250 4.26 1.45 34.10% 4.11 1.30 31.61%

[45 − 65) 30496 5.64 1.60 28.45% 5.64 1.60 28.37%

over 65 4485 1.02 0.25 24.65% 1.08 0.32 29.28%

Total 64207 11.05 3.36 30.40% 10.95 3.26 29.77%

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 12 of 16

Propensity for sex and age

Gradient Boosting Heckman

Sex n BITmld

BINDmld

Prop BITmld

BINDmld

Prop

Female 16053 2.34 0.81 34.74% 2.22 0.70 31.39%

Male 48154 8.72 2.55 29.25% 8.73 2.56 29.35%

Total 64207 11.05 3.36 30.40% 10.95 3.26 29.77%

Gradient Boosting Heckman

Age n BITmld

BINDmld

Prop BITmld

BINDmld

Prop

[18 − 25) 976 0.13 0.05 39.07% 0.13 0.04 36.71%

[25 − 45) 28250 4.26 1.45 34.10% 4.11 1.30 31.61%

[45 − 65) 30496 5.64 1.60 28.45% 5.64 1.60 28.37%

over 65 4485 1.02 0.25 24.65% 1.08 0.32 29.28%

Total 64207 11.05 3.36 30.40% 10.95 3.26 29.77%

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 12 of 16

Propensity for geographic area

a) Heckman b) Gradient Boosting

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 13 of 16

Conclusions

• Advantages:

- all data are processed

- distribution free approach

- no transformation variable is required

- no problems with multicollinearity

• Further developments:

- extension of the analysis to the entire population

- robustification via ensemble with other models (Xgboost and Neural

Network)

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 14 of 16

Conclusions

• Advantages:

- all data are processed

- distribution free approach

- no transformation variable is required

- no problems with multicollinearity

• Further developments:

- extension of the analysis to the entire population

- robustification via ensemble with other models (Xgboost and Neural

Network)

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 14 of 16

Bibliography

[1] Statuto dell’Agenzia delle Entrate.

[2] Braiotta A., Carfora A., Pansini R.V., Pisani S.; Tax Gap and redistributive aspects

across Italy, 2015.

[3] Heckman James J.; Sample Selection Bias as a Specification Error, Econometrica 47,

no. 1 (1979): 153-61.

[4] Greene William H.; Econometric Analysis (Fifth ed.), Prentice-Hall, 2003.

[5] Friedman Jerome H.; Greedy Function Approximation: A Gradient Boosting

Machine, Annals of Statistics 29(5):1189-1232, 2001.

[6] Friedman Jerome H.; Stochastic Gradient Boosting, Computational Statistics and

Data Analysis 38(4):367-378, 2002.

[7] Bianca Zadrozny; Learning and Evaluating Classifiers under Sample Selection Bias,

Proceedings of the twenty-first international conference on Machine learning, 2004.

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 15 of 16

Thanks for your attention!

VAT Tax Gap prediction: a 2-stepsGradient Boosting approach 16 of 16