Variable Selection and Optimization in Default...

transcript

Variable Selection and Optimizationin Default Prediction

Dedy Dwi Prastyo

Wolfgang Karl Härdle

Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. Center for Applied Statisticsand EconomicsHumboldtUniversität zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de

Introduction 1-1

Credit-Worthiness Jury

I like her.

Figure 1: Who is more precise?

Default Prediction

Introduction 1-2

Credit rating

Score (S)

Quantitative indicator for customers w.r.t. their individualdefault risk

Probability of Default (PD)

One-to-one mapping of the score, S → PD(S)

Rating

Classication of customers (private, corporate, sovereign) intogroups of equivalent default risk

Default Prediction

Introduction 1-3

Default prediction

Time-series data (market data)

I Merton approach (stock price as estimate for the marketvalue), S = distance to default

Cross-sectional data (i.e. balance sheet)

I Discriminant analysisI Categorical regression (logit, probit)I Support Vector Machines (SVM)

Default Prediction

Introduction 1-4

Research questions

What are the variables (i.e. accounting) signicantlycontribute to default ?

How to optimize the default prediction (classication) ?

Default Prediction

Outline

1. Introduction X

2. Variable Selection Regularized GLM, elastic net

3. Evolutionary optimization

4. Genetic Algorithm SVM

Default Prediction

Variable selection 2-1

Some problems

Number of predictors is greater than number of observation,p n

There are correlated variables

Sparsity (elements of predictor matrix X ≈ 0)

VLDS (very large data set)

Default Prediction

Model with convex penalty

Apply fast algorithm to estimate

I Linear regressionI Two-class logistic regression

Penalties

I Lasso (`1)I Ridge regression (`2)I Elastic net (mixture of `1 and `2)

Default Prediction

Linear regression

Suppose Y ∈ R and X ∈ Rp

E (Y |X = x) = β0 + x>β,

If θ = (β0, β), a penalty Pα(β) and multiplier λ, then

θ = argmin

n∑i=1

(yi − β0 − x>i β

)2+ λPα(β)

Default Prediction

Elastic net

The penalty is a compromise between ridge and lasso

Pα(β) =1

2(1− α) ‖β‖2`2 + α ‖β‖`1

p∑j=1

2(1− α)β2j + α|βj |

Be a ridge regression, if (α = 0), and lasso, if (α = 1).Useful when p n or there are many correlated variables.

Default Prediction

Variable selection 2-54 Regularization Paths for GLMs via Coordinate Descent

Figure 1: Leukemia data: profiles of estimated coefficients for three methods, showing onlyfirst 10 steps (values for λ) in each case. For the elastic net, α = 0.2.

Figure 2: Prole of estimated coecients, λ = 1, . . . , 10, elastic net (α =

0.2), on Leukimia data, p = 72 and n = 3571 (Friedman et al., 2010)Default Prediction

Binary logit

Suppose p(xi ) = P(Y = 1|xi ), with Y ∈ 0, 1,

P(Y = 1|x) =1 + e−(β0+x>β)

−1, (3)

P(Y = 0|x) =1 + e(β0+x>β)

P(Y = 1|x)

P(Y = 0|x)

= β0 + x>β

Default Prediction

Penalized log likelihood

maxβ0,β

n∑i=1

`(β0, β)− λPα(β)

`(β0, β) =1

n∑i=1

yi (β0 + x>i β)− log(1 + eβ0+x>iβ), (5)

is a concave function of the parameter.Maximizing `(β0, β) iteratively reweightet least squares (IRLS).

Default Prediction

Newton algorithm

If current parameter are (β0, β), the quadratic approximation to`(β0, β) is,

`Q(β0, β) = − 1

n∑i=1

wi (zi − β0 − x>i β)2 + C (β0, β)2 (6)

where working response and wheight are,

zi = β0 + x>i β +yi − p(xi )

p(xi )(1− p(xi ))

wi = p(xi )(1− p(xi ))

Newton update is obtained by minimizing `Q(β0, β).

Default Prediction

Friedman approach

Coordinate descent is used to solve the penalized weightedleast-square (PWLS)

minβ0,β−`Q(β0, β) + λPα(β)

Sequence of nested loops:

Outer loop: Decrement λ

Middle loop: Update `Q using current parameter (β0, β)

Inner loop: Run coordinate descent algorithm on PWLS

Default Prediction

Evolutionary optimization 3-1

Global optimum

Coordinate descent search local minimum

How to choose α (and λ) ?

More complicated

Default Prediction

Evolutionary optimizationp

Genetic Algorithm (GA)

GA nds global optimum solution parameters

Default Prediction

What is a Genetic Algorithm ?

´ p p p

Genetics algorithm is searchand optimization techniquebased on Darwin's principleon natural selection(Holland, 1975) GA

Default Prediction

GA-SVM 4-1

Classier

Figure 3: Linear classier functions (1 and 2) and a non-linear one (3)

Default Prediction

GA-SVM 4-2

ClassicationData Dn = (x1, y1) , . . . , (xn, yn) : Ω→ (X × Y)n

X ⊆ Rd and Y ∈ −1, 1

Goal to predict Y for new observation, x ∈ X , based oninformation in Dn

Default Prediction

GA-SVM 4-3

Linearly (Non-) Separable Case details

Margin ( d )

Figure 4: Hyperplane and its margin in linearly (non-) separable case

Default Prediction

GA-SVM 4-4

Loss function

L(y , f (x)) Loss type

1− yf (x)2 quadratic loss (ridge regression)1− yf (x)+ = max 0, 1− yf (x) hinge loss (SVM)log1 + exp(−yf (x)) log-loss (logistic regression)sign−yf (x) 0, 1 loss1/ 1 + exp(yf (x)) sigmoidal loss

Table 1: Types of Loss function, with f (x) = x>w + b is a score

Default Prediction

GA-SVM 4-5

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

y f(x)

hingequadraticLog0,1sigmoid

Figure 5: Plot of loss function for y = 1, f (x) = 0.2x1, x1 ∈ [−5, 10].

Similar plot for y = −1 but in opposite direction of yf (x) axis

Default Prediction

GA-SVM 4-6

SVM dual problem

LD (α) = maxα

αi −1

n∑i=1

n∑j=1

αiαjyiyjx>i xj

s.t. 0 ≤ αi ≤ Cn∑

αiyi = 0

Default Prediction

GA-SVM 4-7

Data Space Feature Space

Figure 6: Mapping two dimensional data space into a three dimensional fea-

ture space, R2 7→ R3. The transformation Ψ(x1, x2) = (x21,√2x1x2, x

corresponds to K (xi , xj) = (x>i xj)2

Default Prediction

GA-SVM 4-8

Non-linear SVM

LD (α) = maxα

αi −1

n∑i=1

n∑j=1

αiαjyiyjK (xi , xj)

s.t. 0 ≤ αi ≤ C ,

n∑i=1

αiyi = 0

Gaussian RBF kernel K (xi , xj) = exp(− 1

σ ‖xi − xj‖2)

Polynomial kernel K (xi , xj) =(x>i xj + 1

)pDefault Prediction

GA-SVM 4-9

Structural Risk Minimization (SRM)

Search for the model structure Sh,

Sh1 ⊆ Sh2 ⊆ . . . ⊆ Sh∗ ⊆ . . . ⊆ Shk = F

such that f ∈ Sh∗ minimises the expected risk bound, with f ⊆ Fis class of linear function and h is VC dimensioni.e.

SVM(h1) ⊆ . . . ⊆ SVM(h∗) ⊆ . . . ⊆ SVM(hk) = F

with h correspond to the value of SVM (kernel) parameter

Default Prediction

GA-SVM 4-10

GA - SVM

mating poolpopulation

Generating Evaluation(fitness)

Learning

selectioncrossover

mutation

Figure 7: Iteration (generation) in GA-SVM

Default Prediction

Application 5-1

Validation of scores

Discriminatory power (of the score)

I Cumulative Accuracy Prole (CAP) curveI Receiver Operating Characteristic (ROC) curveI Accuracy, Specicity, Sensitivity

Default Prediction

Application 5-2

Model with zeropredictive power

Perfect model

Random model

Model beingevaluted /actual

Perfect model

Random model

ROCcurve

Perfect model

Random model

Model beingevaluted /actual

Perfect model

Random model

ROCcurve

Figure 8: CAP curve (left) and ROC curve (right)

Default Prediction

Application 5-3

Discriminatory power

Cumulative Accuracy Prole (CAP) curve

I CAP/Power/Lorenz curve → Accuracy Ratio (AR)I Total sample vs. default sample

Receiver Operating Characteristic (ROC) curve

I ROC curve → Area Under Curve (AUC)I Non-default sample vs. default sample

Relationship: AR = 2 AUC - 1

Default Prediction

Application 5-4

Discriminatory power (cont'd)

sampledefault non-default(1) (-1)

predicted(1) True Positive (TP) False Positive (FP)(-1) False Negative (FN) True Negative (TN)

total P N

I Accuracy, P(Y = Y ) = TP+TNP+N

I Specicity, P(Y = −1|Y = −1) = TNN

I Sensitivity, P(Y = 1|Y = 1) = TPP

Default Prediction

Application 5-5

Credit reform data Example

type solvent (%) insolvent (%) total (%)

Manufacturing 27.37 (26.06) 25.70 (1.22) 27.29Construction 13.88 (13.22) 39.70 (1.89) 15.11Wholesale and retail 24.78 (23.60) 20.10 (0.96) 24.56Real estate 17.28 (16.46) 9.40 (0.45) 16.90

total 83.31 (79.34) 94.90 (4.52) 83.86

others 16.69 (15.90) 5.10 (0.24) 16.14

# 20,000 1,000 21,000

Table 2: Credit reform data

Default Prediction

Application 5-6

Pre-processing

year solvent insolvent total# (%) # (%) # (%)

1997 872 ( 9.08) 86 (0.90) 958 ( 9.98)1998 928 ( 9.66) 92 (0.96) 1020 (10.62)1999 1005 (10.47) 112 (1.17) 1117 (11.63)2000 1379 (14.36) 102 (1.06) 1481 (15.42)2001 1989 (20.71) 111 (1.16) 2100 (21.87)2002 2791 (29.07) 135 (1.41) 2926 (30.47)

total 8964 (93.36) 638 (6.64) 9602 (100)

Table 3: Pre-processed credit reform data

Default Prediction

Application 5-7

Full model, X1, . . . ,X28

Predictors 28 nancial ratio variables detail

Population (# solutions) 20

Evolutionary iteration (generation) 100

Elitism 0.2 of population

Crossover rate 0.5, mutation rate 0.1

Optimal SVM parameters σ = 1/178.75 and C = 63.44

Default Prediction

Application 5-8

Scenario Finding

scenario training set testing set

Scenario-1 1997 1998Scenario-2 1997-1998 1999Scenario-3 1997-1999 2000Scenario-4 1997-2000 2001Scenario-5 1997-2001 2002

Table 4: Training and testing data set

Default Prediction

Application 5-9

Quality of classication

training TE (CV) testing TE (CV)

1997 0 (8.98) 1998 0 ( 9.02)1997-1998 0 (8.99) 1999 0 (10.03)1997-1999 0 (9.37) 2000 0 ( 6.89)1997-2000 0 (8.57) 2001 5.43 ( 5.86)1997-2001 0 (4.55) 2002 4.68 ( 4.61)

Table 5: Percentage of Training Error (TE) and Cross-Validation (5-fold

Default Prediction

Current ndings 6-1

Current ndings

SVM with optimal parameter is more robust to the imbalanceddata set

Evolutionary feature selection (using Genetic Algorithm) couldnd global solution of SVM parameter optimization

More investigation to variable selection

Default Prediction

Variable Selection and Optimizationin Default Prediction

Dedy Dwi Prastyo

Wolfgang Karl Härdle

Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. Center for Applied Statisticsand EconomicsHumboldtUniversität zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de

References 7-1

References

Chen, S., Härdle, W. and Moro, R.Estimation of Default Probabilities with Support VectorMachinesQuantitative Finance, 2011, 11, 135 - 154

Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R.Least angle regressionThe Annals of Statistics, 2004, 32(2), 407 - 449

Friedman, J., Hastie, T., Hoeing, H. and Tibshirani, R.Pathwise coordinate optimizationThe Annals of Applied Statistics, 2007, 2(1), 302 - 332

Default Prediction

References 7-2

References back

Friedman, J., Hastie, T. and Tibshirani, R.Regularization path for generalized linear models via coordinatedescentJournal of Statistical Software, 2010, 33(1)

Härdle, Lee, Y-J., Schäfer, D. and Yeh, Y-R.Variable Selection and Oversampling in the Use of SmoothSupport Vector Machine for Predicting the Default Risk ofCompaniesJournal of Forecasting, 2009, 28, 512 - 534

Holland, J.H.Adaptation in Natural and Articial SystemsUniversity of Michigan Press, 1975

Default Prediction

References 7-3

References

Karatzoglou, A. and Meyer, D.Support Vector Machines in RJournal of Statistical Software, 2006, 15:9, 1-28

Rosset, S. and Zhu, J.Piecewise linear regularized solution pathThe Annals of Statistics, 2007, 35(3), 1012 - 1030

Tibshirani, R.Regression shrinkage and selection via LassoJournal of the Royal Statistical Society B, 1996, 58, 267 - 288

Default Prediction

References 7-4

References

Tseng, P.Convergence of a block coordinate descent method fornondierentiable minimizationJournal of Optimization Theory and Application, 2001, 109,475 - 494

Van der Kooij, A.Prediction accuracy and stability of regression with optimalscaling transformationPh.D. thesis, Department of data theory, University of Leiden,2001

Default Prediction

Appendix Linearly Separable Case 8-1

Linearly Separable Case back

Margin ( d )

Figure 9: Separating hyperplane and its margin in linearly separable case

Default Prediction

Choose f ∈ F such that margin (d− + d+) is maximal

No error separation, if all i = 1, 2, ..., n satisfy

x>i w + b ≥ +1 for yi = +1

x>i w + b ≤ −1 for yi = −1

Both constraints are combined into

yi (x>i w + b)− 1 ≥ 0 i = 1, 2, ..., n

Default Prediction

Distance between margins and the separating hyperplane isd+ = d− = 1/‖w‖

Maximize the margin, d+ + d− = 2/‖w‖, could be attained byminimizing ‖w‖ or ‖w‖2

Lagrangian for the primal problem

LP (w , b) =1

2‖w‖2 −

n∑i=1

αiyi (x>i w + b)− 1

Default Prediction

Karush-Kuhn-Tucker (KKT) rst order optimality conditions

∂LP∂wk

= 0 : wk −n∑

αiyixik = 0 k = 1, ..., d

∂LP∂b

= 0 :n∑

αiyi = 0

yi (x>i w + b)− 1 ≥ 0 i = 1, ..., n

αi ≥ 0

αiyi (x>i w + b)− 1 = 0

Default Prediction

Solution w =∑n

i=1αiyixi , therefore

2‖w‖2 =

n∑i=1

n∑j=1

αiαjyiyjx>i xj

−n∑

αiyi (x>i w + b)− 1 = −n∑

αiyix>i

n∑j=1

αjyjxj +n∑

= −n∑

n∑j=1

αiαjyiyjx>i xj +

n∑i=1

Lagrangian for the dual problem

LD (α) =n∑

αi −1

n∑i=1

n∑j=1

αiαjyiyjx>i xj

Default Prediction

Primal and dual problems

minw ,b

LP (w , b)

LD (α) s.t. αi ≥ 0,n∑

αiyi = 0

Optimization problem is convex, therefore the dual and primalformulations give the same solution

Support vector, a point i for which yi (x>i w + b) = 1 holds

Default Prediction

Appendix Linearly Non-separable Case 9-1

Linearly Non-separable Case back

Figure 10: Hyperplane and its margin in linearly non-separable case

Default Prediction

Slack variables ξi represent the violation from strict separation

x>i w + b ≥ 1− ξi for yi = 1,

x>i w + b ≤ −1 + ξi for yi = −1,ξi ≥ 0

constraints are combined into

yi (x>i w + b) ≥ 1− ξi and ξi ≥ 0

If ξi > 0, the objective function is

2‖w‖2 + C

n∑i=1

Default Prediction

Lagrange function for the primal problem

LP (w , b, ξ) =1

2‖w‖2 + C

n∑i=1

ξi−

n∑i=1

αiyi(x>i w + b

)− 1 + ξi −

n∑i=1

µiξi ,

where αi ≥ 0 and µi ≥ 0 are Lagrange multipliers

Primal problem

minw ,b,ξ

LP (w , b, ξ)

Default Prediction

First order conditions

∂LP∂wk

= 0 : wk −n∑

αiyixik = 0

∂LP∂b

= 0 :n∑

αiyi = 0

∂LP∂ξi

= 0 : C − αi − µi = 0

s.t. αi ≥ 0, µi ≥ 0, µiξi = 0

αiyi (x>i w + b)− 1 + ξi = 0

Default Prediction

Note that∑n

i=1αiyib = 0. Translate primal problem into

LD (α) =n∑

αi −1

n∑i=1

n∑j=1

αiαjyiyjx>i xj +

n∑i=1

ξi (C − αi − µi )

Last term is 0, therefore the dual problem is

LD (α) = maxα

αi −1

n∑i=1

n∑j=1

αiαjyiyjx>i xj

s.t. 0 ≤ αi ≤ C ,

n∑i=1

αiyi = 0

Default Prediction

AppendixGenetic Algorithm 10-1

GA Initialization Back

solution

1 1 1 1 0 1 00 1 1 0 1 1 00 0 1 0 0 1 1 1 0 0 0 1 0 10 1 0 1 1 0 01 1 1 0 0 0 11 1 0 1 1 1 01 0 1 1 0 0 10 0 1 0 1 1 11 1 0 1 0 0 1

binary encodingpopulation

decoding

global maximum

global minimum

1227754196944

1131108923

105real number( solution )

Figure 11: GA at rst generation

Default Prediction

GA Convergency

solution

Figure 12: Solutions at 1st generation (left) and r th generation (right)

Default Prediction

GA Decoding

Figure 13: Decoding

θ = θlower + (θupper − θlower )∑l−1

i=0ai2

where θ is solution (i.e. parameter), a is allele

Default Prediction

GA Fitness evaluation

Calculate f (θi ), i = 1, . . . , popsize

Evaluate tness, fdp(θi )

fdp(θi ) AR, AUC, accuracy, specicity, sensitivity

Relative tness, pi =fdp(θ

i )∑popsize

k=ifdp(θi )

20 2 Simple Evolutionary Algorithms

Fig. 2.5 Roulette wheel selection

been done to minimize the selection bias, and we will introduce some of them inlater chapters.

Another consideration about RWS is that the problem needs to be maximum andall the objective values need to be greater than zero so that we can use objectivevalues as fitness values and take RWS as the selection process directly.24

2.2.5 Variation Operators

There are many variation operators to change information in individuals in the mat-ing pool. If information exchange, i.e., gene exchange, is done between two or moreindividuals25, this variation operator is called crossover or recombination. If thegenes of one individual changes on its own, this variation operator is called muta-tion. We will introduce single-point crossover and bit-flip mutation here.

There are two ways to select two individuals in the mating pool to determinewhether or not to cross over them. One is to shuffle the mating pool randomlyand assign individuals 1 and 2 without replacement to be a crossover pair, 3 and4 without replacement to be another pair, etc. The other is to generate a randominteger permutation, per, between [1, popsize]. per(i) = j means the ith element inthe permutation is the jth individual in the mating pool. Then we assign indper(1)and indper(2) without replacement as the first crossover pair, indper(3) and indper(4)without replacement as the second crossover pair, etc.

24 Why do we need two such requirements for RWS to handle objective values directly?25 We will give examples of multiparent crossover in Chap. 3.

Figure 14: Proportion to be choosen in the next iteration (generation)

Default Prediction

GA Roulette wheel

20 2 Simple Evolutionary Algorithms

Fig. 2.5 Roulette wheel selection

been done to minimize the selection bias, and we will introduce some of them inlater chapters.

Another consideration about RWS is that the problem needs to be maximum andall the objective values need to be greater than zero so that we can use objectivevalues as fitness values and take RWS as the selection process directly.24

2.2.5 Variation Operators

There are many variation operators to change information in individuals in the mat-ing pool. If information exchange, i.e., gene exchange, is done between two or moreindividuals25, this variation operator is called crossover or recombination. If thegenes of one individual changes on its own, this variation operator is called muta-tion. We will introduce single-point crossover and bit-flip mutation here.

There are two ways to select two individuals in the mating pool to determinewhether or not to cross over them. One is to shuffle the mating pool randomlyand assign individuals 1 and 2 without replacement to be a crossover pair, 3 and4 without replacement to be another pair, etc. The other is to generate a randominteger permutation, per, between [1, popsize]. per(i) = j means the ith element inthe permutation is the jth individual in the mating pool. Then we assign indper(1)and indper(2) without replacement as the first crossover pair, indper(3) and indper(4)without replacement as the second crossover pair, etc.

24 Why do we need two such requirements for RWS to handle objective values directly?25 We will give examples of multiparent crossover in Chap. 3.

rand ∼ U(0, 1)

Select i th chromosome if∑k

i=1pi < rand <

∑k+1

Repeat popsize times to get popsize new chromosomes

Default Prediction

GA Crossover

Figure 15: Crossover in nature

Reproduction Operators comparison

• Single point crossover

Cross point• Two point crossover (Multi point crossover)

Reproduction Operators comparison

• Single point crossover

Cross point• Two point crossover (Multi point crossover)

Figure 16: Randomly choosen

one-point crossover (top) and

two-points crossover (bottom)

Default Prediction

GA Reproductive operator

2.2 Simple Genetic Algorithm 21

Generally, we will assign the probability of crossover pc, called the crossoverrate, to control the possibility of performing a crossover.26

For two individuals selected to cross over, we assign a point between 1 and l −1randomly, where l is the length of the chromosome. This means generating a randominteger in the range [1, l−1]. The genes after the point are changed between parentsand the resulting chromosomes are offspring. We call this operator a single-pointcrossover. Figure 2.6 illustrates this.

))) ) ) ) ) )

) )) ) ) C

) ) ) )

) ) ) ) ) ) )

. /''%Fig. 2.6 Single-point crossover

As can be seen from Fig. 2.6, two new individuals are generated by crossover,which is generally seen as the major exploration mechanism of SGA.

If two parents do not perform a crossover according to probability pc, their off-spring are themselves.

Now we discuss mutation. There are also two ways to implement mutation. Oneway is to open another memory with size popsize to store the results of crossover,and mutation is carried out in that memory. The other way is to mutate the offspringof crossover directly. We use the latter way.

For every gene in an individual, we mutate it with probability pm, called themutation rate.27 Provided gene j needs to be mutated, we make a bit-flip changefor gene j, i.e., 1 to 0 or 0 to 1. We call this operator a bit-flip mutation. Figure 2.7illustrates the bit-flip mutation. The individual after mutation is called the mutant.

) )) ) ) C

) ) ) ) ) )

Fig. 2.7 Bit-flip mutation for gene j of the offspring

26 How do we implement the statement “Individual i and individual j cross over with probabilitypc”?27 How do we implement the statement “Gene j mutates with probability pm”?

2.2 Simple Genetic Algorithm 21

Generally, we will assign the probability of crossover pc, called the crossoverrate, to control the possibility of performing a crossover.26

For two individuals selected to cross over, we assign a point between 1 and l −1randomly, where l is the length of the chromosome. This means generating a randominteger in the range [1, l−1]. The genes after the point are changed between parentsand the resulting chromosomes are offspring. We call this operator a single-pointcrossover. Figure 2.6 illustrates this.

))) ) ) ) ) )

) )) ) ) C

) ) ) )

) ) ) ) ) ) )

. /''%Fig. 2.6 Single-point crossover

As can be seen from Fig. 2.6, two new individuals are generated by crossover,which is generally seen as the major exploration mechanism of SGA.

If two parents do not perform a crossover according to probability pc, their off-spring are themselves.

Now we discuss mutation. There are also two ways to implement mutation. Oneway is to open another memory with size popsize to store the results of crossover,and mutation is carried out in that memory. The other way is to mutate the offspringof crossover directly. We use the latter way.

For every gene in an individual, we mutate it with probability pm, called themutation rate.27 Provided gene j needs to be mutated, we make a bit-flip changefor gene j, i.e., 1 to 0 or 0 to 1. We call this operator a bit-flip mutation. Figure 2.7illustrates the bit-flip mutation. The individual after mutation is called the mutant.

) )) ) ) C

) ) ) ) ) )

Fig. 2.7 Bit-flip mutation for gene j of the offspring

26 How do we implement the statement “Individual i and individual j cross over with probabilitypc”?27 How do we implement the statement “Gene j mutates with probability pm”?

Figure 17: One-point crossover (top) and bit-ip mutation (bottom)

Default Prediction

GA Elitism

Best solution in each iteration is maintained in anothermemory place

New population replaces the old one, check whether bestsolution is in the population

If not, replace any one in the population with best solution

Default Prediction

Nature to Computer Mapping Back

Nature GA-SVM

Population Set of parameterIndividual (phenotype) ParametersFitness Discriminatory powerChromosome (genotype) Encoding of parameterGene Binary encodingReproduction CrossoverGeneration Iteration

Table 6: Nature to GA-SVM mapping

Default Prediction

Examples

Small sample: 100 solvent and insolvent companies

Credit reform data

X3 Operating Income / Total Asset

X24 Account Payable / Total Asset

Default Prediction

−1.0

−0.5

0.00 0.10 0.20 0.30

−0.2

−0.1

SVM classification plot

−1.5

−1.0

−0.5

0.00 0.10 0.20 0.30

−0.2

−0.1

Figure 18: SVM plot, C = 1 and σ = 1/2, misclass. rate 0.19 (left) and

GA-SVM, C = 14.86 and σ = 1/121.61, misclass. rate 0 (right).

Default Prediction

0.05 0.10 0.15 0.20

−0.05

0.05 0.10 0.15 0.20

−0.05

Figure 19: GA-SVM (C = 187.93 and σ = 1/195.16) plot of training data,

misclass. rate 2.38%, and testing data, misclass. rate 1.37%. Back

Default Prediction

Appendix Financial Ratio Variables 11-1

FR: Protability

Ratio No. Denition Ratio

x1 NI/TA Return on assets (ROA)x2 NI/Sales Net prot marginx3 OI/TA Operating Income/Total assetsx4 OI/Sales Operating prot marginx5 EBIT/TA EBIT/Total assetsx6 (EBIT+AD)/TA EBITDAx7 EBIT/Sales EBIT/Sales

Table 7: Dentions of nancial ratios.

Default Prediction

FR: Leverage

x8 Equity/TA Own funds ratio (simple)x9 (Equity-ITGA)/ Own funds ratio (adjusted)

(TA-ITGA-Cash-LB)x10 CL/TA Current liabilities/Total assetsx11 (CL-Cash)/TA Net indebtednessx12 TL/TA Total liabilities/Total assetsx13 Debt/TA Debt ratiox14 EBIT/ Interest coverage ratio

Interest expenses

Table 8: Dentions of nancial ratios.Default Prediction

FR: Liquidity

x15 Cash/TA Cash/Total assetsx16 Cash/CL Cash ratiox17 QA/CL Quick ratiox18 CA/CL Current ratiox19 WC/TA Working Capitalx20 CL/TL Current liabilities/

Total liabilities

Default Prediction

FR: Activity

x21 TA/Sales Asset turnoverx22 INV/Sales Inventory turnoverx23 AR/Sales Account receivable turnoverx24 AP/Sales Account payable turnoverx25 Log(TA) Log(Total assets)

Default Prediction

FR back

Ratio No. Denition

x26 increase (decrease) in inventories /inventories

x27 increase (decrease) in liabilities /total liabilities

x28 increase (decrease) in cash ows /cash and cash equivalent

Default Prediction

Findings

Härdle et al. (2009): Smooth SVM overall mean of correctpredictions ranging from 70% to 78% (misclassication: 22%to 30%)

Chen, Härdle and Moro (2011):

I Most of the models tested, AR in 43.50% and 60.51%I SVM (grid search optiization): percentage of correctly

classied out-of-sample 71.85%I Logit model: percentage of correctly classied out-of-sample

67.24%

Default Prediction

Findings Back

Zang and Härdle (2010):

Performance measure Logit (%) CART (%) BACT (%)

Ovearal misclass. rate 30.2 33.8 26.6Type I misclass. rate 28.3 27.2 27.6Type II misclass. rate 30.3 34.3 26.5AR 52.1 58.7 60.4

Table 12: Average value (of bootstrap) of performance measures

Default Prediction

Variable Selection and Optimization in Default...

Documents