Post on 20-Jun-2020
transcript
Variable Selection and Optimizationin Default Prediction
Dedy Dwi Prastyo
Wolfgang Karl Härdle
Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. Center for Applied Statisticsand EconomicsHumboldtUniversität zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de
Introduction 1-1
Credit-Worthiness Jury
NO
NO
NO
NO
Yes
Maybe
I like her.
Figure 1: Who is more precise?
Default Prediction
Introduction 1-2
Credit rating
Score (S)
Quantitative indicator for customers w.r.t. their individualdefault risk
Probability of Default (PD)
One-to-one mapping of the score, S → PD(S)
Rating
Classication of customers (private, corporate, sovereign) intogroups of equivalent default risk
Default Prediction
Introduction 1-3
Default prediction
Time-series data (market data)
I Merton approach (stock price as estimate for the marketvalue), S = distance to default
Cross-sectional data (i.e. balance sheet)
I Discriminant analysisI Categorical regression (logit, probit)I Support Vector Machines (SVM)
Default Prediction
Introduction 1-4
Research questions
What are the variables (i.e. accounting) signicantlycontribute to default ?
How to optimize the default prediction (classication) ?
Default Prediction
Outline
1. Introduction X
2. Variable Selection Regularized GLM, elastic net
3. Evolutionary optimization
4. Genetic Algorithm SVM
Default Prediction
Variable selection 2-1
Some problems
Number of predictors is greater than number of observation,p n
There are correlated variables
Sparsity (elements of predictor matrix X ≈ 0)
VLDS (very large data set)
Default Prediction
Variable selection 2-2
Model with convex penalty
Apply fast algorithm to estimate
Model
I Linear regressionI Two-class logistic regression
Penalties
I Lasso (`1)I Ridge regression (`2)I Elastic net (mixture of `1 and `2)
Default Prediction
Variable selection 2-3
Linear regression
Suppose Y ∈ R and X ∈ Rp
E (Y |X = x) = β0 + x>β,
If θ = (β0, β), a penalty Pα(β) and multiplier λ, then
θ = argmin
1
2n
n∑i=1
(yi − β0 − x>i β
)2+ λPα(β)
(1)
Default Prediction
Variable selection 2-4
Elastic net
The penalty is a compromise between ridge and lasso
Pα(β) =1
2(1− α) ‖β‖2`2 + α ‖β‖`1
=
p∑j=1
1
2(1− α)β2j + α|βj |
(2)
Be a ridge regression, if (α = 0), and lasso, if (α = 1).Useful when p n or there are many correlated variables.
Default Prediction
Variable selection 2-54 Regularization Paths for GLMs via Coordinate Descent
Figure 1: Leukemia data: profiles of estimated coefficients for three methods, showing onlyfirst 10 steps (values for λ) in each case. For the elastic net, α = 0.2.
Figure 2: Prole of estimated coecients, λ = 1, . . . , 10, elastic net (α =
0.2), on Leukimia data, p = 72 and n = 3571 (Friedman et al., 2010)Default Prediction
Variable selection 2-6
Binary logit
Suppose p(xi ) = P(Y = 1|xi ), with Y ∈ 0, 1,
P(Y = 1|x) =1 + e−(β0+x>β)
−1, (3)
P(Y = 0|x) =1 + e(β0+x>β)
−1,
log
P(Y = 1|x)
P(Y = 0|x)
= β0 + x>β
Default Prediction
Variable selection 2-7
Penalized log likelihood
maxβ0,β
1
n
n∑i=1
`(β0, β)− λPα(β)
(4)
where
`(β0, β) =1
n
n∑i=1
yi (β0 + x>i β)− log(1 + eβ0+x>iβ), (5)
is a concave function of the parameter.Maximizing `(β0, β) iteratively reweightet least squares (IRLS).
Default Prediction
Variable selection 2-8
Newton algorithm
If current parameter are (β0, β), the quadratic approximation to`(β0, β) is,
`Q(β0, β) = − 1
2n
n∑i=1
wi (zi − β0 − x>i β)2 + C (β0, β)2 (6)
where working response and wheight are,
zi = β0 + x>i β +yi − p(xi )
p(xi )(1− p(xi ))
wi = p(xi )(1− p(xi ))
Newton update is obtained by minimizing `Q(β0, β).
Default Prediction
Variable selection 2-9
Friedman approach
Coordinate descent is used to solve the penalized weightedleast-square (PWLS)
minβ0,β−`Q(β0, β) + λPα(β)
Sequence of nested loops:
Outer loop: Decrement λ
Middle loop: Update `Q using current parameter (β0, β)
Inner loop: Run coordinate descent algorithm on PWLS
Default Prediction
Evolutionary optimization 3-1
Global optimum
Coordinate descent search local minimum
How to choose α (and λ) ?
More complicated
Default Prediction
Evolutionary optimization 3-2
Evolutionary optimizationp
p p
Genetic Algorithm (GA)
GA nds global optimum solution parameters
Default Prediction
Evolutionary optimization 3-3
What is a Genetic Algorithm ?
´ p p p
:
;06å
*
Genetics algorithm is searchand optimization techniquebased on Darwin's principleon natural selection(Holland, 1975) GA
Default Prediction
GA-SVM 4-1
Classier
.
Figure 3: Linear classier functions (1 and 2) and a non-linear one (3)
Default Prediction
GA-SVM 4-2
SVM
ClassicationData Dn = (x1, y1) , . . . , (xn, yn) : Ω→ (X × Y)n
X ⊆ Rd and Y ∈ −1, 1
Goal to predict Y for new observation, x ∈ X , based oninformation in Dn
Default Prediction
GA-SVM 4-3
Linearly (Non-) Separable Case details
0
Margin ( d )
Figure 4: Hyperplane and its margin in linearly (non-) separable case
Default Prediction
GA-SVM 4-4
Loss function
L(y , f (x)) Loss type
1− yf (x)2 quadratic loss (ridge regression)1− yf (x)+ = max 0, 1− yf (x) hinge loss (SVM)log1 + exp(−yf (x)) log-loss (logistic regression)sign−yf (x) 0, 1 loss1/ 1 + exp(yf (x)) sigmoidal loss
Table 1: Types of Loss function, with f (x) = x>w + b is a score
Default Prediction
GA-SVM 4-5
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
01
23
4
y f(x)
Loss
(fo
r y=
+1)
hingequadraticLog0,1sigmoid
Figure 5: Plot of loss function for y = 1, f (x) = 0.2x1, x1 ∈ [−5, 10].
Similar plot for y = −1 but in opposite direction of yf (x) axis
Default Prediction
GA-SVM 4-6
SVM dual problem
maxα
LD (α) = maxα
n∑
i=1
αi −1
2
n∑i=1
n∑j=1
αiαjyiyjx>i xj
,
s.t. 0 ≤ αi ≤ Cn∑
i=1
αiyi = 0
Default Prediction
GA-SVM 4-7
Data Space Feature Space
Figure 6: Mapping two dimensional data space into a three dimensional fea-
ture space, R2 7→ R3. The transformation Ψ(x1, x2) = (x21,√2x1x2, x
2
2)>
corresponds to K (xi , xj) = (x>i xj)2
Default Prediction
GA-SVM 4-8
Non-linear SVM
maxα
LD (α) = maxα
n∑
i=1
αi −1
2
n∑i=1
n∑j=1
αiαjyiyjK (xi , xj)
s.t. 0 ≤ αi ≤ C ,
n∑i=1
αiyi = 0
Gaussian RBF kernel K (xi , xj) = exp(− 1
σ ‖xi − xj‖2)
Polynomial kernel K (xi , xj) =(x>i xj + 1
)pDefault Prediction
GA-SVM 4-9
Structural Risk Minimization (SRM)
Search for the model structure Sh,
Sh1 ⊆ Sh2 ⊆ . . . ⊆ Sh∗ ⊆ . . . ⊆ Shk = F
such that f ∈ Sh∗ minimises the expected risk bound, with f ⊆ Fis class of linear function and h is VC dimensioni.e.
SVM(h1) ⊆ . . . ⊆ SVM(h∗) ⊆ . . . ⊆ SVM(hk) = F
with h correspond to the value of SVM (kernel) parameter
Default Prediction
GA-SVM 4-10
GA - SVM
mating poolpopulation
Generating Evaluation(fitness)
SVM
SVM
SVM
MODEL
Learning
Learning
Learning
DATA
selectioncrossover
mutation
Figure 7: Iteration (generation) in GA-SVM
Default Prediction
Application 5-1
Validation of scores
Discriminatory power (of the score)
I Cumulative Accuracy Prole (CAP) curveI Receiver Operating Characteristic (ROC) curveI Accuracy, Specicity, Sensitivity
Default Prediction
Application 5-2
1
1
Model with zeropredictive power
Perfect model
Random model
Model beingevaluted /actual
0
1
1
Model with zeropredictive power
Perfect model
Random model
ROCcurve
0
1
1
Model with zeropredictive power
Perfect model
Random model
Model beingevaluted /actual
0
1
1
Model with zeropredictive power
Perfect model
Random model
ROCcurve
0
Figure 8: CAP curve (left) and ROC curve (right)
Default Prediction
Application 5-3
Discriminatory power
Cumulative Accuracy Prole (CAP) curve
I CAP/Power/Lorenz curve → Accuracy Ratio (AR)I Total sample vs. default sample
Receiver Operating Characteristic (ROC) curve
I ROC curve → Area Under Curve (AUC)I Non-default sample vs. default sample
Relationship: AR = 2 AUC - 1
Default Prediction
Application 5-4
Discriminatory power (cont'd)
sampledefault non-default(1) (-1)
predicted(1) True Positive (TP) False Positive (FP)(-1) False Negative (FN) True Negative (TN)
total P N
I Accuracy, P(Y = Y ) = TP+TNP+N
I Specicity, P(Y = −1|Y = −1) = TNN
I Sensitivity, P(Y = 1|Y = 1) = TPP
Default Prediction
Application 5-5
Credit reform data Example
type solvent (%) insolvent (%) total (%)
Manufacturing 27.37 (26.06) 25.70 (1.22) 27.29Construction 13.88 (13.22) 39.70 (1.89) 15.11Wholesale and retail 24.78 (23.60) 20.10 (0.96) 24.56Real estate 17.28 (16.46) 9.40 (0.45) 16.90
total 83.31 (79.34) 94.90 (4.52) 83.86
others 16.69 (15.90) 5.10 (0.24) 16.14
# 20,000 1,000 21,000
Table 2: Credit reform data
Default Prediction
Application 5-6
Pre-processing
year solvent insolvent total# (%) # (%) # (%)
1997 872 ( 9.08) 86 (0.90) 958 ( 9.98)1998 928 ( 9.66) 92 (0.96) 1020 (10.62)1999 1005 (10.47) 112 (1.17) 1117 (11.63)2000 1379 (14.36) 102 (1.06) 1481 (15.42)2001 1989 (20.71) 111 (1.16) 2100 (21.87)2002 2791 (29.07) 135 (1.41) 2926 (30.47)
total 8964 (93.36) 638 (6.64) 9602 (100)
Table 3: Pre-processed credit reform data
Default Prediction
Application 5-7
Full model, X1, . . . ,X28
Predictors 28 nancial ratio variables detail
Population (# solutions) 20
Evolutionary iteration (generation) 100
Elitism 0.2 of population
Crossover rate 0.5, mutation rate 0.1
Optimal SVM parameters σ = 1/178.75 and C = 63.44
Default Prediction
Application 5-8
Scenario Finding
scenario training set testing set
Scenario-1 1997 1998Scenario-2 1997-1998 1999Scenario-3 1997-1999 2000Scenario-4 1997-2000 2001Scenario-5 1997-2001 2002
Table 4: Training and testing data set
Default Prediction
Application 5-9
Quality of classication
training TE (CV) testing TE (CV)
1997 0 (8.98) 1998 0 ( 9.02)1997-1998 0 (8.99) 1999 0 (10.03)1997-1999 0 (9.37) 2000 0 ( 6.89)1997-2000 0 (8.57) 2001 5.43 ( 5.86)1997-2001 0 (4.55) 2002 4.68 ( 4.61)
Table 5: Percentage of Training Error (TE) and Cross-Validation (5-fold
CV)
Default Prediction
Current ndings 6-1
Current ndings
SVM with optimal parameter is more robust to the imbalanceddata set
Evolutionary feature selection (using Genetic Algorithm) couldnd global solution of SVM parameter optimization
More investigation to variable selection
Default Prediction
Variable Selection and Optimizationin Default Prediction
Dedy Dwi Prastyo
Wolfgang Karl Härdle
Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. Center for Applied Statisticsand EconomicsHumboldtUniversität zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de
References 7-1
References
Chen, S., Härdle, W. and Moro, R.Estimation of Default Probabilities with Support VectorMachinesQuantitative Finance, 2011, 11, 135 - 154
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R.Least angle regressionThe Annals of Statistics, 2004, 32(2), 407 - 449
Friedman, J., Hastie, T., Hoeing, H. and Tibshirani, R.Pathwise coordinate optimizationThe Annals of Applied Statistics, 2007, 2(1), 302 - 332
Default Prediction
References 7-2
References back
Friedman, J., Hastie, T. and Tibshirani, R.Regularization path for generalized linear models via coordinatedescentJournal of Statistical Software, 2010, 33(1)
Härdle, Lee, Y-J., Schäfer, D. and Yeh, Y-R.Variable Selection and Oversampling in the Use of SmoothSupport Vector Machine for Predicting the Default Risk ofCompaniesJournal of Forecasting, 2009, 28, 512 - 534
Holland, J.H.Adaptation in Natural and Articial SystemsUniversity of Michigan Press, 1975
Default Prediction
References 7-3
References
Karatzoglou, A. and Meyer, D.Support Vector Machines in RJournal of Statistical Software, 2006, 15:9, 1-28
Rosset, S. and Zhu, J.Piecewise linear regularized solution pathThe Annals of Statistics, 2007, 35(3), 1012 - 1030
Tibshirani, R.Regression shrinkage and selection via LassoJournal of the Royal Statistical Society B, 1996, 58, 267 - 288
Default Prediction
References 7-4
References
Tseng, P.Convergence of a block coordinate descent method fornondierentiable minimizationJournal of Optimization Theory and Application, 2001, 109,475 - 494
Van der Kooij, A.Prediction accuracy and stability of regression with optimalscaling transformationPh.D. thesis, Department of data theory, University of Leiden,2001
Default Prediction
Appendix Linearly Separable Case 8-1
Linearly Separable Case back
0
Margin ( d )
Figure 9: Separating hyperplane and its margin in linearly separable case
Default Prediction
Appendix Linearly Separable Case 8-2
Choose f ∈ F such that margin (d− + d+) is maximal
No error separation, if all i = 1, 2, ..., n satisfy
x>i w + b ≥ +1 for yi = +1
x>i w + b ≤ −1 for yi = −1
Both constraints are combined into
yi (x>i w + b)− 1 ≥ 0 i = 1, 2, ..., n
Default Prediction
Appendix Linearly Separable Case 8-3
Distance between margins and the separating hyperplane isd+ = d− = 1/‖w‖
Maximize the margin, d+ + d− = 2/‖w‖, could be attained byminimizing ‖w‖ or ‖w‖2
Lagrangian for the primal problem
LP (w , b) =1
2‖w‖2 −
n∑i=1
αiyi (x>i w + b)− 1
Default Prediction
Appendix Linearly Separable Case 8-4
Karush-Kuhn-Tucker (KKT) rst order optimality conditions
∂LP∂wk
= 0 : wk −n∑
i=1
αiyixik = 0 k = 1, ..., d
∂LP∂b
= 0 :n∑
i=1
αiyi = 0
yi (x>i w + b)− 1 ≥ 0 i = 1, ..., n
αi ≥ 0
αiyi (x>i w + b)− 1 = 0
Default Prediction
Appendix Linearly Separable Case 8-5
Solution w =∑n
i=1αiyixi , therefore
1
2‖w‖2 =
1
2
n∑i=1
n∑j=1
αiαjyiyjx>i xj
−n∑
i=1
αiyi (x>i w + b)− 1 = −n∑
i=1
αiyix>i
n∑j=1
αjyjxj +n∑
i=1
αi
= −n∑
i=1
n∑j=1
αiαjyiyjx>i xj +
n∑i=1
αi
Lagrangian for the dual problem
LD (α) =n∑
i=1
αi −1
2
n∑i=1
n∑j=1
αiαjyiyjx>i xj
Default Prediction
Appendix Linearly Separable Case 8-6
Primal and dual problems
minw ,b
LP (w , b)
maxα
LD (α) s.t. αi ≥ 0,n∑
i=1
αiyi = 0
Optimization problem is convex, therefore the dual and primalformulations give the same solution
Support vector, a point i for which yi (x>i w + b) = 1 holds
Default Prediction
Appendix Linearly Non-separable Case 9-1
Linearly Non-separable Case back
0
Figure 10: Hyperplane and its margin in linearly non-separable case
Default Prediction
Appendix Linearly Non-separable Case 9-2
Slack variables ξi represent the violation from strict separation
x>i w + b ≥ 1− ξi for yi = 1,
x>i w + b ≤ −1 + ξi for yi = −1,ξi ≥ 0
constraints are combined into
yi (x>i w + b) ≥ 1− ξi and ξi ≥ 0
If ξi > 0, the objective function is
1
2‖w‖2 + C
n∑i=1
ξi
Default Prediction
Appendix Linearly Non-separable Case 9-3
Lagrange function for the primal problem
LP (w , b, ξ) =1
2‖w‖2 + C
n∑i=1
ξi−
n∑i=1
αiyi(x>i w + b
)− 1 + ξi −
n∑i=1
µiξi ,
where αi ≥ 0 and µi ≥ 0 are Lagrange multipliers
Primal problem
minw ,b,ξ
LP (w , b, ξ)
Default Prediction
Appendix Linearly Non-separable Case 9-4
First order conditions
∂LP∂wk
= 0 : wk −n∑
i=1
αiyixik = 0
∂LP∂b
= 0 :n∑
i=1
αiyi = 0
∂LP∂ξi
= 0 : C − αi − µi = 0
s.t. αi ≥ 0, µi ≥ 0, µiξi = 0
αiyi (x>i w + b)− 1 + ξi = 0
Default Prediction
Appendix Linearly Non-separable Case 9-5
Note that∑n
i=1αiyib = 0. Translate primal problem into
LD (α) =n∑
i=1
αi −1
2
n∑i=1
n∑j=1
αiαjyiyjx>i xj +
n∑i=1
ξi (C − αi − µi )
Last term is 0, therefore the dual problem is
maxα
LD (α) = maxα
n∑
i=1
αi −1
2
n∑i=1
n∑j=1
αiαjyiyjx>i xj
,
s.t. 0 ≤ αi ≤ C ,
n∑i=1
αiyi = 0
back
Default Prediction
AppendixGenetic Algorithm 10-1
GA Initialization Back
f obj
solution
1 1 1 1 0 1 00 1 1 0 1 1 00 0 1 0 0 1 1 1 0 0 0 1 0 10 1 0 1 1 0 01 1 1 0 0 0 11 1 0 1 1 1 01 0 1 1 0 0 10 0 1 0 1 1 11 1 0 1 0 0 1
binary encodingpopulation
decoding
global maximum
global minimum
1227754196944
1131108923
105real number( solution )
Figure 11: GA at rst generation
Default Prediction
AppendixGenetic Algorithm 10-2
GA Convergency
f obj
solution
f obj
solution
Figure 12: Solutions at 1st generation (left) and r th generation (right)
Default Prediction
AppendixGenetic Algorithm 10-3
GA Decoding
Figure 13: Decoding
θ = θlower + (θupper − θlower )∑l−1
i=0ai2
i
2l
where θ is solution (i.e. parameter), a is allele
Default Prediction
AppendixGenetic Algorithm 10-4
GA Fitness evaluation
Calculate f (θi ), i = 1, . . . , popsize
Evaluate tness, fdp(θi )
fdp(θi ) AR, AUC, accuracy, specicity, sensitivity
Relative tness, pi =fdp(θ
i )∑popsize
k=ifdp(θi )
20 2 Simple Evolutionary Algorithms
Fig. 2.5 Roulette wheel selection
been done to minimize the selection bias, and we will introduce some of them inlater chapters.
Another consideration about RWS is that the problem needs to be maximum andall the objective values need to be greater than zero so that we can use objectivevalues as fitness values and take RWS as the selection process directly.24
2.2.5 Variation Operators
There are many variation operators to change information in individuals in the mat-ing pool. If information exchange, i.e., gene exchange, is done between two or moreindividuals25, this variation operator is called crossover or recombination. If thegenes of one individual changes on its own, this variation operator is called muta-tion. We will introduce single-point crossover and bit-flip mutation here.
There are two ways to select two individuals in the mating pool to determinewhether or not to cross over them. One is to shuffle the mating pool randomlyand assign individuals 1 and 2 without replacement to be a crossover pair, 3 and4 without replacement to be another pair, etc. The other is to generate a randominteger permutation, per, between [1, popsize]. per(i) = j means the ith element inthe permutation is the jth individual in the mating pool. Then we assign indper(1)and indper(2) without replacement as the first crossover pair, indper(3) and indper(4)without replacement as the second crossover pair, etc.
24 Why do we need two such requirements for RWS to handle objective values directly?25 We will give examples of multiparent crossover in Chap. 3.
Figure 14: Proportion to be choosen in the next iteration (generation)
Default Prediction
AppendixGenetic Algorithm 10-5
GA Roulette wheel
20 2 Simple Evolutionary Algorithms
Fig. 2.5 Roulette wheel selection
been done to minimize the selection bias, and we will introduce some of them inlater chapters.
Another consideration about RWS is that the problem needs to be maximum andall the objective values need to be greater than zero so that we can use objectivevalues as fitness values and take RWS as the selection process directly.24
2.2.5 Variation Operators
There are many variation operators to change information in individuals in the mat-ing pool. If information exchange, i.e., gene exchange, is done between two or moreindividuals25, this variation operator is called crossover or recombination. If thegenes of one individual changes on its own, this variation operator is called muta-tion. We will introduce single-point crossover and bit-flip mutation here.
There are two ways to select two individuals in the mating pool to determinewhether or not to cross over them. One is to shuffle the mating pool randomlyand assign individuals 1 and 2 without replacement to be a crossover pair, 3 and4 without replacement to be another pair, etc. The other is to generate a randominteger permutation, per, between [1, popsize]. per(i) = j means the ith element inthe permutation is the jth individual in the mating pool. Then we assign indper(1)and indper(2) without replacement as the first crossover pair, indper(3) and indper(4)without replacement as the second crossover pair, etc.
24 Why do we need two such requirements for RWS to handle objective values directly?25 We will give examples of multiparent crossover in Chap. 3.
rand ∼ U(0, 1)
Select i th chromosome if∑k
i=1pi < rand <
∑k+1
i=1pi
Repeat popsize times to get popsize new chromosomes
Default Prediction
AppendixGenetic Algorithm 10-6
GA Crossover
1
1
1
12
2
2
2
Figure 15: Crossover in nature
Reproduction Operators comparison
• Single point crossover
Cross point• Two point crossover (Multi point crossover)
Reproduction Operators comparison
• Single point crossover
Cross point• Two point crossover (Multi point crossover)
Figure 16: Randomly choosen
one-point crossover (top) and
two-points crossover (bottom)
Default Prediction
AppendixGenetic Algorithm 10-7
GA Reproductive operator
2.2 Simple Genetic Algorithm 21
Generally, we will assign the probability of crossover pc, called the crossoverrate, to control the possibility of performing a crossover.26
For two individuals selected to cross over, we assign a point between 1 and l −1randomly, where l is the length of the chromosome. This means generating a randominteger in the range [1, l−1]. The genes after the point are changed between parentsand the resulting chromosomes are offspring. We call this operator a single-pointcrossover. Figure 2.6 illustrates this.
))) ) ) ) ) )
) )) ) ) C
) ) ) )
) ) ) ) ) ) )
) )
. /''%Fig. 2.6 Single-point crossover
As can be seen from Fig. 2.6, two new individuals are generated by crossover,which is generally seen as the major exploration mechanism of SGA.
If two parents do not perform a crossover according to probability pc, their off-spring are themselves.
Now we discuss mutation. There are also two ways to implement mutation. Oneway is to open another memory with size popsize to store the results of crossover,and mutation is carried out in that memory. The other way is to mutate the offspringof crossover directly. We use the latter way.
For every gene in an individual, we mutate it with probability pm, called themutation rate.27 Provided gene j needs to be mutated, we make a bit-flip changefor gene j, i.e., 1 to 0 or 0 to 1. We call this operator a bit-flip mutation. Figure 2.7illustrates the bit-flip mutation. The individual after mutation is called the mutant.
) )) ) ) C
) ) ) ) ) )
/''%'
Fig. 2.7 Bit-flip mutation for gene j of the offspring
26 How do we implement the statement “Individual i and individual j cross over with probabilitypc”?27 How do we implement the statement “Gene j mutates with probability pm”?
2.2 Simple Genetic Algorithm 21
Generally, we will assign the probability of crossover pc, called the crossoverrate, to control the possibility of performing a crossover.26
For two individuals selected to cross over, we assign a point between 1 and l −1randomly, where l is the length of the chromosome. This means generating a randominteger in the range [1, l−1]. The genes after the point are changed between parentsand the resulting chromosomes are offspring. We call this operator a single-pointcrossover. Figure 2.6 illustrates this.
))) ) ) ) ) )
) )) ) ) C
) ) ) )
) ) ) ) ) ) )
) )
. /''%Fig. 2.6 Single-point crossover
As can be seen from Fig. 2.6, two new individuals are generated by crossover,which is generally seen as the major exploration mechanism of SGA.
If two parents do not perform a crossover according to probability pc, their off-spring are themselves.
Now we discuss mutation. There are also two ways to implement mutation. Oneway is to open another memory with size popsize to store the results of crossover,and mutation is carried out in that memory. The other way is to mutate the offspringof crossover directly. We use the latter way.
For every gene in an individual, we mutate it with probability pm, called themutation rate.27 Provided gene j needs to be mutated, we make a bit-flip changefor gene j, i.e., 1 to 0 or 0 to 1. We call this operator a bit-flip mutation. Figure 2.7illustrates the bit-flip mutation. The individual after mutation is called the mutant.
) )) ) ) C
) ) ) ) ) )
/''%'
Fig. 2.7 Bit-flip mutation for gene j of the offspring
26 How do we implement the statement “Individual i and individual j cross over with probabilitypc”?27 How do we implement the statement “Gene j mutates with probability pm”?
Figure 17: One-point crossover (top) and bit-ip mutation (bottom)
Default Prediction
AppendixGenetic Algorithm 10-8
GA Elitism
Best solution in each iteration is maintained in anothermemory place
New population replaces the old one, check whether bestsolution is in the population
If not, replace any one in the population with best solution
Default Prediction
AppendixGenetic Algorithm 10-9
Nature to Computer Mapping Back
Nature GA-SVM
Population Set of parameterIndividual (phenotype) ParametersFitness Discriminatory powerChromosome (genotype) Encoding of parameterGene Binary encodingReproduction CrossoverGeneration Iteration
Table 6: Nature to GA-SVM mapping
Default Prediction
AppendixGenetic Algorithm 10-10
Examples
Small sample: 100 solvent and insolvent companies
Credit reform data
X3 Operating Income / Total Asset
X24 Account Payable / Total Asset
Default Prediction
AppendixGenetic Algorithm 10-11
−1.0
−0.5
0.0
0.5
1.0
0.00 0.10 0.20 0.30
−0.2
−0.1
0.0
0.1
0.2
SVM classification plot
x24
x3
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
0.00 0.10 0.20 0.30
−0.2
−0.1
0.0
0.1
0.2
SVM classification plot
x24
x3
Figure 18: SVM plot, C = 1 and σ = 1/2, misclass. rate 0.19 (left) and
GA-SVM, C = 14.86 and σ = 1/121.61, misclass. rate 0 (right).
Default Prediction
AppendixGenetic Algorithm 10-12
−10
−5
0
5
0.05 0.10 0.15 0.20
−0.05
0.00
0.05
0.10
0.15
0.20
SVM classification plot
X24
X3
−10
−5
0
5
0.05 0.10 0.15 0.20
−0.05
0.00
0.05
0.10
0.15
0.20
SVM classification plot
X24
X3
Figure 19: GA-SVM (C = 187.93 and σ = 1/195.16) plot of training data,
misclass. rate 2.38%, and testing data, misclass. rate 1.37%. Back
Default Prediction
Appendix Financial Ratio Variables 11-1
FR: Protability
Ratio No. Denition Ratio
x1 NI/TA Return on assets (ROA)x2 NI/Sales Net prot marginx3 OI/TA Operating Income/Total assetsx4 OI/Sales Operating prot marginx5 EBIT/TA EBIT/Total assetsx6 (EBIT+AD)/TA EBITDAx7 EBIT/Sales EBIT/Sales
Table 7: Dentions of nancial ratios.
Default Prediction
Appendix Financial Ratio Variables 11-2
FR: Leverage
Ratio No. Denition Ratio
x8 Equity/TA Own funds ratio (simple)x9 (Equity-ITGA)/ Own funds ratio (adjusted)
(TA-ITGA-Cash-LB)x10 CL/TA Current liabilities/Total assetsx11 (CL-Cash)/TA Net indebtednessx12 TL/TA Total liabilities/Total assetsx13 Debt/TA Debt ratiox14 EBIT/ Interest coverage ratio
Interest expenses
Table 8: Dentions of nancial ratios.Default Prediction
Appendix Financial Ratio Variables 11-3
FR: Liquidity
Ratio No. Denition Ratio
x15 Cash/TA Cash/Total assetsx16 Cash/CL Cash ratiox17 QA/CL Quick ratiox18 CA/CL Current ratiox19 WC/TA Working Capitalx20 CL/TL Current liabilities/
Total liabilities
Table 9: Dentions of nancial ratios.
Default Prediction
Appendix Financial Ratio Variables 11-4
FR: Activity
Ratio No. Denition Ratio
x21 TA/Sales Asset turnoverx22 INV/Sales Inventory turnoverx23 AR/Sales Account receivable turnoverx24 AP/Sales Account payable turnoverx25 Log(TA) Log(Total assets)
Table 10: Dentions of nancial ratios.
Default Prediction
Appendix Financial Ratio Variables 11-5
FR back
Ratio No. Denition
x26 increase (decrease) in inventories /inventories
x27 increase (decrease) in liabilities /total liabilities
x28 increase (decrease) in cash ows /cash and cash equivalent
Table 11: Dentions of nancial ratios.
Default Prediction
Appendix Financial Ratio Variables 11-6
Findings
Härdle et al. (2009): Smooth SVM overall mean of correctpredictions ranging from 70% to 78% (misclassication: 22%to 30%)
Chen, Härdle and Moro (2011):
I Most of the models tested, AR in 43.50% and 60.51%I SVM (grid search optiization): percentage of correctly
classied out-of-sample 71.85%I Logit model: percentage of correctly classied out-of-sample
67.24%
Default Prediction
Appendix Financial Ratio Variables 11-7
Findings Back
Zang and Härdle (2010):
Performance measure Logit (%) CART (%) BACT (%)
Ovearal misclass. rate 30.2 33.8 26.6Type I misclass. rate 28.3 27.2 27.6Type II misclass. rate 30.3 34.3 26.5AR 52.1 58.7 60.4
Table 12: Average value (of bootstrap) of performance measures
Default Prediction