Project Report - Acquisition Credit Scoring Model

Acquisition Credit Scoring ModelProject Final Report

31st April, 2015Great Lakes Institute of Management, GurgaonSubhasis Mishra

– Research Supervisor –

Mr. Manu Chandra (FNMathlogic)

Acquisition Credit Scoring Model

Table of Contents

1 Introduction....................................................................................................................................3

2 Scope and Objectives......................................................................................................................3

3 Data Sources...................................................................................................................................4

4 Analytical Approach........................................................................................................................4

4.1 Data Collection.........................................................................................................................4

4.2 Data Preparation......................................................................................................................4

4.3 Variable Reduction.................................................................................................................11

4.4 Data Sampling........................................................................................................................15

4.5 Model Development..............................................................................................................16

4.6 Intercept not significant in the development sample............................................................17

4.7 Assessment of sign of variables..............................................................................................18

4.8 Multicollinearity Test.............................................................................................................18

4.9 Probability Prediction.............................................................................................................22

4.10 Model Prediction (Goodness-of-fit).....................................................................................23

4.11 Calculating top 3 variables affecting credit score function...................................................29

4.12 Reject Inference...................................................................................................................30

4.13 Model Performance.............................................................................................................31

5 Decision Tree Approach................................................................................................................31

6 Tools and Techniques....................................................................................................................33

7 Way Forward.................................................................................................................................33

8 Recommendations and Applications.............................................................................................33

9 References and Bibliography.........................................................................................................33

10 Project Code...............................................................................................................................33

Project Final Report – Great Lakes PGPBA Program Page 2


1 Introduction

Credit scoring is perhaps one of the most classic applications for predictive modeling, to predict whether or not credit extended to an applicant will likely result in profit or losses for the lending institution. There are many variations and complexities regarding how exactly credit is extended to individuals, businesses, and other organizations for various purposes (purchasing equipment, real estate, consumer items, and so on), and using various methods of credit (credit card, loan, delayed payment plan). But in all cases, a lender provides money to an individual or institution, and expects to be paid back in time with interest commensurate with the risk of default.

Credit scoring is the set of decision models and their underlying techniques that aid lenders in the granting of consumer credit. These techniques determine who will get credit, how much credit they should get, and what operational strategies will enhance the profitability of the borrowers to the lenders. Further, they help to assess the risk in lending. Credit scoring is a dependable assessment of a person’s credit worthiness since it is based on actual data.

A lender commonly makes two types of decisions: first, whether to grant credit to a new applicant, and second, how to deal with existing applicants, including whether to increase their credit limits. In both cases, whatever the techniques used, it is critical that there is a large sample of previous customers with their application details, behavioral patterns, and subsequent credit history available. Most of the techniques use this sample to identify the connection between the characteristics of the consumers (annual income, age, number of years in employment with their current employer, etc.) and their subsequent history.

Typical application areas in the consumer market include: credit cards, auto loans, home mortgages, home equity loans, mail catalog orders, and a wide variety of personal loan products.

2 Scope and Objectives

To evaluate the scope of applying Logistic Regression and various Data Minging techniques in Credit Scoring facilitating better decision making to reduce the Risk.



Objective of this project is to build a Credit Scoring model to reduce the potential risk involved, and based on which credit lending decisions can be made.

3 Data SourcesApplicant data involving US customers that has been used in this project was provided by our mentor.

4 Analytical ApproachSteps which would be followed for this project are described below:

4.1 Data CollectionCustomer credit history along with all the information was provided by our mentor. Data provided was cross sectional data and belongs to a single point of time.

4.2 Data Preparation

Listed below are some of the techniques used for data preparation :-

1. Initial variable selection based on judgement –

First we identified some of the key variables out of all the variables given in our data set purely based on judgement. Initial screening of variables is very much important and it requires deeper understanding about that particular domain as well as experience. So out of the 46 variables, we took some 20 odd variables into consideration for modelling. Below is the list of initial variables :-

Loan_amnt

Term

Annual_inc

Fico_range_high

Fico_range_low

Last_fico_range_hig

Last_fico_range_low

Purpose

Home_ownership



Grade

Dti

Mths_since_last_delinq

Mths_since_last_record

Last_paymnt_amnt

Total_paymnt

Total_paymnt_inv

Pub_rec

Total_rec_int

Revol_bal

2. Missing Value Treatment -

Normally there are two techniques used in case of missing value treatment. Those are either by removing the entire variable- if more than 80-90% of observations are missing for that variable, or by imputing with a very high value like 9999999- if the variable is significant in the perspective of modeling, hence can’t be dropped. In our case we went ahead with the latter one as few variables like Last_month_since_deliquency, last_month_since_record carries a lot of importance in terms of modeling although it had more than 90% observations missing in the original data set. For other non-significant variables, we just imputed with Zero.

Below is the piece of R code showing what we did towards missing value treatment –

new_data <- as.data.frame(data, stringsAsFactors=FALSE)

new_data$mths_since_last_delinq[is.na(new_data$mths_since_last_delinq)] <- 9999999

new_data$mths_since_last_record[is.na(new_data$mths_since_last_record)] <- 9999999

new_data[is.na(new_data)] <- 0

str(new_data)



3. Identifying the dependent variable –

In any business problem, It is very much critical to first understand the problem statement and then approaching with an appropriate solution. Hence in our case the first priority was given to identify the dependent variable correctly, which might otherwise would have caused the modelling to go completely wrong. Our problem is concerned with finding if the person is going to default or not by calculating the probability of default among the consumers.

From our data set, we picked LOAN_STATUS as the dependent variable which contains value like Current, Default, Charged off etc.

4. Data Transformation –

Data Transformation is one of the most significant steps before doing the actual modeling and carries a lot of importance in making the final model robust. It includes variuos steps like introduction of dummy variables, conversion of integer variable to categorical variable etc.

Introduction of dummy variables –

We introduced dummy variables for the character variables. As part of the dummy variable inclusion, we did the following :-

# Introduction of Dummy variables

# For Loan status

def <- ifelse(loan_status=="Default", 1, 0)

# purpose dummies

purpose_debt <- ifelse(purpose=='debt_consolidation', 1, 0)

purpose_car <- ifelse(purpose=='car', 1, 0)

purpose_credit <- ifelse(purpose=="credit_card", 1, 0)

purpose_home_imp <- ifelse(purpose=="home_improvement", 1, 0)

purpose_maj_purchase <- ifelse(purpose=="major_purchase", 1, 0)

purpose_ren_energy <- ifelse(purpose=="renewable_energy", 1, 0)

purpose_education <- ifelse(purpose=="educational", 1, 0)

purpose_house <- ifelse(purpose=="house", 1, 0)



purpose_medical <- ifelse(purpose=="medical", 1, 0)

purpose_moving <- ifelse(purpose=="moving", 1, 0)

purpose_small_business <- ifelse(purpose=="small_business", 1, 0)

purpose_vacation <- ifelse(purpose=="vacation", 1, 0)

purpose_wedding <- ifelse(purpose=="wedding", 1, 0)

# home ownership dummies

home_ownership_mort <- ifelse(home_ownership=="MORTGAGE", 1, 0)

home_ownership_own <- ifelse(home_ownership=="OWN", 1, 0)

home_ownership_rent <- ifelse(home_ownership=="RENT", 1, 0)

home_ownership_other <- ifelse(home_ownership=="OTHER", 1, 0)

# Grade

grade_a <- ifelse(grade=="A",1 ,0)

grade_b <- ifelse(grade=="B",1 ,0)

grade_c <- ifelse(grade=="C",1 ,0)

grade_d <- ifelse(grade=="D",1 ,0)

grade_e <- ifelse(grade=="E",1 ,0)

grade_f <- ifelse(grade=="F",1 ,0)

Conversion of continuos variables to categorical variables(Fine classing)

As a better modelling practice, it is recommended that continuos variables should be changed into categorical variables by introduction of bins- which is also called fine classing. It generally yields better result during Information Value (IV) calculation of response variables.

We followed the below approach for fine classing :-

gloan_amnt <- cut(new_data$loan_amnt, br=c(0, 5000, 15000, 35000), labels=c("Low","Medium","High"))



gloan_amnt_data <- data.frame(new_data, gloan_amnt)

summary(gloan_amnt_data$gloan_amnt)

gloan_amnt_data_final<-gloan_amnt_data[complete.cases(gloan_amnt_data),]

summary(gloan_amnt_data_final$gloan_amnt)

glast_fico_range_high <- cut(new_data$last_fico_range_high, br=c(0, 649, 699, 744, 850), labels=c("Low","Medium","High","Very High"))

last_fico_high_data <- data.frame(new_data, glast_fico_range_high)

summary(last_fico_high_data$glast_fico_range_high)

last_fico_high_data_final<-last_fico_high_data[complete.cases(last_fico_high_data),]

summary(last_fico_high_data_final$glast_fico_range_high)

glast_fico_range_low <- cut(new_data$last_fico_range_low, br=c(0, 645, 695,740, 845), labels=c("Low","Medium","High","Very High"))

last_fico_low_data <- data.frame(new_data, glast_fico_range_low)

summary(last_fico_low_data$glast_fico_range_low)

last_fico_low_data_final<- last_fico_low_data[complete.cases(last_fico_low_data),]

summary(last_fico_low_data_final$glast_fico_range_low)

gfico_range_high <- cut(new_data$fico_range_high, br=c(0, 650, 700, 745, 850), labels=c("Low","Medium","High","Very High"))

fico_high_data <- data.frame(new_data, gfico_range_high)

summary(fico_high_data$gfico_range_high)

fico_high_data_final<- fico_high_data[complete.cases(fico_high_data),]

summary(fico_high_data_final$gfico_range_high)



gfico_range_low <- cut(new_data$fico_range_low, br=c(0, 650, 700, 745, 850), labels=c("Low","Medium","High","Very High"))

fico_low_data <- data.frame(new_data, gfico_range_low)

summary(fico_low_data$gfico_range_low)

fico_low_data_final <- fico_low_data[complete.cases(fico_low_data),]

summary(fico_low_data_final$gfico_range_low)

Gdti<-cut(new_data$dti,br=c(0,8,13,19,30),labels=c("Low","Medium","High","Very High"))

dti_data <- data.frame(new_data, gdti)

summary(dti_data$gdti)

dti_data_final <- dti_data[complete.cases(dti_data),]

summary(dti_data_final$gdti)

gmths_since_last_delinq<-cut(new_data$mths_since_last_delinq, br=c(0, 48, 6466000, 10000000), labels=c("Low","Medium","High"))

mths_since_last_delinq_data<-data.frame(new_data, gmths_since_last_delinq)

summary(mths_since_last_delinq_data$gmths_since_last_delinq)

mths_since_last_delinq_data_final<- mths_since_last_delinq_data[complete.cases(mths_since_last_delinq_data),]

summary(mths_since_last_delinq_data_final$gmths_since_last_delinq)

gmths_since_last_record <- cut(new_data$mths_since_last_record, br=c(0, 9299000, 10000000), labels=c("Low","High"))



mths_since_last_record_data <- data.frame(new_data, gmths_since_last_record)

summary(mths_since_last_record_data$gmths_since_last_record)

mths_since_last_record_data_final<- mths_since_last_record_data[complete.cases(mths_since_last_record_data),]

summary(mths_since_last_record_data_final$gmths_since_last_record)

gannual_inc <- cut(new_data$annual_inc, br=c(4000, 40500, 59000, 82340, 6000000), labels=c("Low","Medium","High","Very High"))

annual_inc_data <- data.frame(new_data, gannual_inc)

summary(annual_inc_data$gannual_inc)

annual_inc_data_final<- annual_inc_data[complete.cases(annual_inc_data),]

summary(annual_inc_data_final$gannual_inc)

glast_pymnt_amnt <- cut(new_data$last_pymnt_amnt, br=c(0, 195, 392, 1816, 36120), labels=c("Low","Medium","High","Very High"))

last_pymnt_amnt_data <- data.frame(new_data, glast_pymnt_amnt)

summary(last_pymnt_amnt_data$glast_pymnt_amnt)

last_pymnt_amnt_data_final<- last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),]

summary(last_pymnt_amnt_data_final$gannual_inc)

gtotal_pymnt <- cut(new_data$total_pymnt, br=c(0, 4707, 8089, 13090, 49490), labels=c("Low","Medium","High","Very High"))

total_pymnt_data <- data.frame(new_data, gtotal_pymnt)

summary(total_pymnt_data$gtotal_pymnt)

total_pymnt_data_final<- total_pymnt_data[complete.cases(total_pymnt_data),]

summary(total_pymnt_data_final$gtotal_pymnt)



gtotal_pymnt_inv <- cut(new_data$total_pymnt_inv, br=c(0, 4312, 7590, 12390, 49100), labels=c("Low","Medium","High","Very High"))

total_pymnt_inv_data <- data.frame(new_data, gtotal_pymnt_inv)

summary(total_pymnt_inv_data$gtotal_pymnt_inv)

total_pymnt_inv_data_final<- total_pymnt_inv_data[complete.cases(total_pymnt_inv_data),]

summary(total_pymnt_inv_data_final$gtotal_pymnt_inv)

dim(total_pymnt_inv_data_final)

glast_pymnt_amnt <- cut(new_data$last_pymnt_amnt, br=c(0, 195, 392, 1816, 16120), labels=c("Low","Medium","High","Very High"))last_pymnt_amnt_data <- data.frame(new_data, glast_pymnt_amnt)summary(last_pymnt_amnt_data$glast_pymnt_amnt) last_pymnt_amnt_data_final<- last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),]summary(last_pymnt_amnt_data_final$glast_pymnt_amnt) dim(last_pymnt_amnt_data_final)

gtotal_rec_int <- cut(new_data$total_rec_int, br=c(0, 640, 1290, 2618, 15300), labels=c("Low","Medium","High","Very High"))

total_rec_int_data <- data.frame(new_data, gtotal_rec_int)

summary(total_rec_int_data$gtotal_rec_int)

total_rec_int_data_final<- total_rec_int_data[complete.cases(total_rec_int_data),]

summary(total_rec_int_data_final$gtotal_rec_int)

dim(total_rec_int_data_final)

4.3 Variable ReductionOnce the strongest characteristics are grouped and ranked, variable selection is done. At the end of this step, the Scorecard Developer will have a set of strong, grouped characterstics, preferably representing independent information types, for use in the regression step. The strength of a characteristic is gauged using four main criteria :



Predictive power of each attribute. The weight of evidence (WOE) measure is used for this purpose.

The range and trend of weight of evidence across grouped attributes within a characteristic.

Predictive power of the characterstic. The Information Value (IV) measure is used for this purpose.

Operational and business considerations

In our case, we have used the Information Value measure approach for variable reduction. Some analysts run other variable selection algorithms (e.g., those that rank predictive power using Chi Square or R-Square) prior to grouping characteristics. This gives them an indication of characteristic strength using independent means, and also alerts them in cases where the Information Value figure is high/low compared to other measures. The initial characteristic analysis process can be interactive, and involvement from business users and operations staff should be encouraged. In particular, they may provide further insights into any unex- pected or illogical behavior patterns and enhance the grouping of all variables. The first step in performing this analysis is to perform initial grouping of the variables, and rank order them by IV or some other strength measure. This can be done using a number of binning techniques.

If using other applications, a good way to start is to bin nominal variables into 50 or so equal groups, and to calculate the WOE and IV for the grouped attributes and characteristics. One can then use any spreadsheet software to fine tune the groupings for the stronger characteristics based on principles to be outlined in the next section. Similarly for categorical characteristics, the WOE for each unique attribute and the IV of each characteristic can be calculated. One can then spend time fine-tuning the grouping for those characteristics that surpass a minimum acceptable strength. Decision trees are also often used for grouping variables. Most users, however, use them to generate initial ideas, and then use alternate software applications to interactively fine-tune the groupings.

Information Value (IV) :-

Information Value provides a measure of how well a variable X is able to distinguish between a binary response (e.g “good” vs “bad”) in some target variable Y. The idea is if a variable X has a low information value, it may not do



a sufficient job of classifying the target variable, and hence is removed as an explnatory variable.

To see how this works, let X be grouped into n bins. Each x X corresponds to a

y Y that may take one of two values, say 0 or 1. Then for bins, Xi, 1 i n,

IV = ( gi- bi ) * ln(gi/ bi)

Where bi = the proportion of 0’s in bin i versus all bins

gi = the proportion of 1’s in bin versus all bins

ln(gi/ bi) is known as weight of evidence (For bin Xi). Cut-off value ma vary

but in our case, we have considered IV cut-off as 0.1 .

Below is the R code sample, output and plot of IV calculation :-

# IV calculation

iv.mult(final_data1,"def",TRUE)

iv_data <- data.frame(def, annual_inc_data$gannual_inc, last_fico_high_data$glast_fico_range_high, mths_since_last_delinq_data$gmths_since_last_delinq, total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv, term, last_pymnt_amnt_data$glast_pymnt_amnt, total_rec_int_data$gtotal_rec_int)



iv.plot.summary(iv.mult(final_data1,"def",TRUE))

#taking IV cut off as 0.1

iv_data_final <- data.frame(def, last_fico_high_data$glast_fico_range_high, total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv, term, last_pymnt_amnt_data$glast_pymnt_amnt, total_rec_int_data$gtotal_rec_int)

iv.mult(iv_data_final,"def",TRUE)

iv.plot.summary(iv.mult(iv_data_final,"def",TRUE))



4.4 Data SamplingSampling has been done based on the 70-30% rule. That is the entire data would be splitted into 70-30 ratio to use it as Development and Validation data respectively.

Below is the R code :-



# Dividing data into train and test

totalrecords <- nrow(fin_data)

trainfraction = 0.7

trainrecords = as.integer(totalrecords * trainfraction)

allrows <- 1:totalrecords

trainrows <- sample(totalrecords,trainrecords)

testrows <- allrows[-trainrows]

train<-data.frame(fin_data[trainrows,])

test<-data.frame(fin_data[testrows,])

dim(train)

dim(test)

4.5 Model DevelopmentWhen the training data set on which the modeling is based contains a binary indicator variable of "Paid back" vs. "Default", or "Good Credit" vs. "Bad Credit", then Logistic Regression models are well suited for subsequent predictive modeling. Logistic regression yields prediction probabilities for whether or not a particular outcome (e.g., Bad Credit) will occur. Furthermore, logistic regression models are linear models, in that the logit transformed prediction probability is a linear function of the predictor variable values. Thus, a final score card model derived in this manner has the desirable quality that the final credit score (credit risk) is a linear function of the predictors, and with some additional transformations applied to the model parameter, a simple linear function of scores that can be associated with each predictor class value after coarse coding. So the final credit score is then a simple sum of individual score values that can be taken from the scorecard.

In our model, we have used train data set which contains 70% of the original data. Below is the R code for logistic regrssion :-

fit_train <- glm(def~., data=train, family="binomial")summary(fit_train)



4.6 Intercept not significant in the development sample

An intercept is almost always part of the model and is almost always significantly different from zero. The test of the intercept in the procedure output tests whether this parameter is equal to zero. If the intercept is zero (equivalent to having no intercept in the model), the resulting model implies that the response function must be exactly zero when all the predictors are set to zero. For a logistic model it means that the logit (or log odds) is zero, which implies that the event probability is 0.5. This is a very strong assumption that is sometimes reasonable, but more often is not. So, a highly significant intercept in your model is generally not a problem.

By the same token, if the intercept is not significant you usually would not want to remove it from the model because by doing this you are creating a model that says that the response function must be zero when the predictors are all zero. If the nature of what you are modeling is such that you want to assume this, then you might want to remove the intercept. In our case, we are getting a intercept value of around 0.58 for validation sample.



4.7 Assessment of sign of variables

a. Last_fico_range_high- Negative sign of the coeffcient signifies that it exhibits an inverse relationship with default, which is actually true in this context. Higher the fico value is, lesser is the chance of default.

b. Total_pymnt- It also signifies an inverse relationship with default, which is also true in this context.

c. Last_pymnt_amnt- Negative sign of the coefficient shows that it exhibits an inverse relationship with default.

d. Total_rec_int- Positive sign of the coefficient signifies that it has a direct relationship with default.

4.8 Multicollinearity Test

It occurs when there are high correlations among predictor variables, leading to unreliable and unstable estimates of regression coefficients. After model building, multi collinearity check is normally performed to ensure that independent variables are not highly correlated using VIF (Variable Inflation Factor) function. Normally the cut-off for VIF is considered as 5. Variables with VIF more than 5 should be considered to have collinearity. However, for factor/ categorical variables, GVIF value is considered as the baseline. Variables which require more than 1 coefficient and thus more than 1 degree of freedom are typically evaluated using the GVIF. For one-coefficient terms VIF is equal to GVIF.

There are 4 options for producing the VIF value :-

Corvif command from the AED package Vif command from the car package Vif command from the rms package Vif command from the DAAG package

Out of these, “car” and “AED” produce GVIF value and other two produce VIF value.

# Multi-colinearity check

library(car) vif(fit_train)









As we can see, Four plots are generated at a time while plotting the logistic output.

4.9 Probability PredictionFrom the above logistic output we have predicted the probabilities of individual customers. Please find the below R code :-

# Probablities prediction

prob_glm <- predict.glm(fit4, type="response",se.fit=FALSE)

write.csv(prob_glm, "probability.csv")

plot(prob_glm)



4.10 Model Prediction (Goodness-of-fit)

Goodness-of-fit attempts to get at how well a model fits the data. It is usually applied after a final model has been selected. If we have multiple models, then goodness-of-fit is performed to choose among all the models. Concordance, discordance, ROC curve and KS-statistics are used for this purpose.



Concordance :- In OLS regression, the R-squared and its more refined

measure adjusted R-square would be the ‘one-stop’ metric which would immediately

tell us if the model was a good fit or not. And since this was a value between 0 and 1,

we could easily change it to a percentage value and pass it off as ‘model accuracy’

for beginners and the not-so-much-math-oriented businesses. Unfortunately, looking

at adj-R square would be totally irrelevant in case of logistic regression because we

model the log odds ratio and it becomes very difficult in terms of explain ability. This

is where concordance helps. Concordance tells us the association between actual

values and the values fitted by the model in percentage terms. Concordance is

defined as the ratio of number of pairs where the 1 had a higher model score than

the model score of zero to the total number of 1-0 pairs possible. A higher value for

concordance (60-70%) means a better fitted model. However, a very large value for

concordance (85-95%) could also suggest that the model is over-fitted and needs to

be re-aligned to explain the entire population.

We have used an in-built R function OptimisedConc() for getting the concordance value.

OptimisedConc=function(model){ Data = cbind(model$y, model$fitted.values) ones = Data[Data[,1] == 1,] zeros = Data[Data[,1] == 0,] conc=matrix(0, dim(zeros)[1], dim(ones)[1]) disc=matrix(0, dim(zeros)[1], dim(ones)[1]) ties=matrix(0, dim(zeros)[1], dim(ones)[1]) for (j in 1:dim(zeros)[1]) { for (i in 1:dim(ones)[1]) { if (ones[i,2]>zeros[j,2]) {conc[j,i]=1} else if (ones[i,2]<zeros[j,2]) {disc[j,i]=1} else if (ones[i,2]==zeros[j,2]) {ties[j,i]=1} } } Pairs=dim(zeros)[1]*dim(ones)[1] PercentConcordance=(sum(conc)/Pairs)*100 PercentDiscordance=(sum(disc)/Pairs)*100 PercentTied=(sum(ties)/Pairs)*100



return(list("Percent Concordance"=PercentConcordance,"Percent Discordance"=PercentDiscordance,"Percent Tied"=PercentTied,"Pairs"=Pairs))}

OptimisedConc(fit_train)

In our model, we are getting 91 percent Concordance which means this model is overfitted.

Lorenz Curve and AUC :-

For train data :-----------------

#Calculating ROC curve for model

library(ROCR)

#score train data set

train$score<-predict(fit_train,type='response',train)

pred_train<-prediction(train$score,train$def)

perf_train <- performance(pred_train,"tpr","fpr")

plot(perf_train)



# calculating AUC

auc_train <- performance(pred_train,"auc")

auc_train <- unlist(slot(auc_train, "y.values"))

# adding min and max ROC AUC to the center of the plot

minauc<-min(round(auc_train, digits = 2))

maxauc<-max(round(auc_train, digits = 2))

minauct <- paste(c("min(AUC) = "),minauc,sep="")

maxauct <- paste(c("max(AUC) = "),maxauc,sep="")

legend(0.3,0.6,c(minauct,maxauct,"\n"),border="white",cex=1.7,box.col = "white")



So for the train data, we are getting AUC as 0.91.

For Test Data :

--------------------

#score test data set

test$score<-predict(fit_train,type='response',test)

pred_test<-prediction(test$score,test$def)

perf_test <- performance(pred_test,"tpr","fpr")

plot(perf_test)

# calculating AUC

auc_test <- performance(pred_test,"auc")

# now converting S4 class to vector

auc_test <- unlist(slot(auc_test, "y.values"))


minauc<-min(round(auc_test, digits = 2))

maxauc<-max(round(auc_test, digits = 2))






For Test data, we are getting AUC value as 0.84.

KS Statistics (For train data) :-

# Calculating KS statistic for train data

max(attr(perf_train,'y.values')[[1]]-attr(perf_train,'x.values')[[1]])

We are getting KS- stats value for train data as 0.706 .



KS Statistics (For test data) :-

# Calculating KS statistic

max(attr(perf_test,'y.values')[[1]]-attr(perf_test,'x.values')[[1]])

4.11 Calculating top 3 variables affecting credit score function

#Calculating top 3 variables affecting Credit Score Function

g<-predict(fit_train,type='terms',test)

#function to pick top 3 reasons

#works by sorting coefficient terms in equation and selecting top 3 in sort for each loan scored

ftopk<- function(x,top=3){

res=names(x)[order(x, decreasing = TRUE)][1:top]

paste(res,collapse=";",sep="")

}

# Application of the function using the top 3 rows

topk=apply(g,1,ftopk,top=3)

#add reason list to scored test sample

test<-cbind(test, topk)

summary(test)



4.12 Reject Inference

The term Reject Inference describes the issue of how to deal with the inherent bias when modeling is based on a training dataset consisting only of those previous applicants for whom the actual performance (Good Credit vs. Bad Credit) has been observed; however, there are

likely another significant number of previous applicants, that had been rejected and for whom final "credit performance" was never observed. The question is, how to include those previous applicants in the modeling, in order to make the predictive model more accurate and robust (and less biased), and applicable also to those individuals.

This is of particular importance when the criteria for the decision whether or not to extend credit need to be loosened, in order to attract and extend credit to more applicants. This can for example happen during a severe economic downturn, affecting many people and placing their overall financial well being into a condition that would not qualify them as acceptable credit risk using older criteria. In short, if nobody were to qualify for credit any more, then the institutions extending credits would be out of business. So it is often critically



important to make predictions about observations with specific predictor values that were essentially outside the range of what would have been previously considered, and consequently is unavailable and has not been observed in the training data where the actual outcomes are recorded.

There are a number of approaches that have been suggested on how to include previously rejected applicants for credit in the model building step, in order to make the model more broadly applicable (to those applicants as well). In short, these methods come down to systematically extrapolating from the actual observed data, often by deliberately introducing biases and assumptions about the expected loan outcome, had the (in actuality not observed) applicant been accepted for credit.

4.13 Model Performance

A specific performance window would be taken into consideration to identify the accuracy of the predictive power of the model.

5 Decision Tree ApproachThis is one of the classic data mining techniques available for better decision making purpose as the tree structure gives a better understanding of the data and the important variables- which in turn could help in minimizing the potential risk involved. However, individual customer probobilties can’t be found using this method. Hence, Logistic Regression is preferred over it in industries.

# Decision Tree Implementation

#load tree package

library(rpart)

#build model using 90% 10% priors

#with smaller complexity parameter to allow more complex trees

model_dt <- rpart(def~.,data=train,parms=list(prior=c(.9,.1)),cp=.0002)

plot(model_dt) text(model_dt)





6 Tools and TechniquesFollowing tools and techniques have been used :-

R Studio Predictive Modeling, Logistic Regression, Data Mining Techniques

7 Way ForwardReject inference has not been implemented. So there is scope to implement that. Also data mining techniques like Random Forest, Neural Network can be implemented as per the scope and easeness of the project. We have only implemented Decision Tree.

8 Recommendations and ApplicationsTypical application areas in the consumer market include: credit cards, auto loans, home mortgages, home equity loans, mail catalog orders, and a wide variety of personal loan products.

9 References and BibliographyBelow documents have been referred for this project :-

Credit risk scorecards developing and implementing intelligent credit scoring by Naeem Siddiqi

Sharma- Credit scoring Machine learning with R – Brett Lantz

10 Project Code

Below is the complete R code that has been used for the project :-

rm(list=ls())

data <- read.csv(file.choose(), header=T, stringsAsFactors=FALSE)

str(data)

# Replacing NA values

new_data <- as.data.frame(data, stringsAsFactors=FALSE)



new_data$mths_since_last_delinq[is.na(new_data$mths_since_last_delinq)] <- 9999999

new_data$mths_since_last_record[is.na(new_data$mths_since_last_record)] <- 9999999

new_data[is.na(new_data)] <- 0

str(new_data)

attach(new_data)

# Introduction of Dummy variables

# For Loan status

def <- ifelse(loan_status=="Default", 1, 0)

# purpose dummies

purpose_debt <- ifelse(purpose=='debt_consolidation', 1, 0)

purpose_car <- ifelse(purpose=='car', 1, 0)

purpose_credit <- ifelse(purpose=="credit_card", 1, 0)

purpose_home_imp <- ifelse(purpose=="home_improvement", 1, 0)

purpose_maj_purchase <- ifelse(purpose=="major_purchase", 1, 0)

purpose_ren_energy <- ifelse(purpose=="renewable_energy", 1, 0)

purpose_education <- ifelse(purpose=="educational", 1, 0)

purpose_house <- ifelse(purpose=="house", 1, 0)

purpose_medical <- ifelse(purpose=="medical", 1, 0)

purpose_moving <- ifelse(purpose=="moving", 1, 0)

purpose_small_business <- ifelse(purpose=="small_business", 1, 0)

purpose_vacation <- ifelse(purpose=="vacation", 1, 0)

purpose_wedding <- ifelse(purpose=="wedding", 1, 0)

# home ownership dummies

home_ownership_mort <- ifelse(home_ownership=="MORTGAGE", 1, 0)

home_ownership_own <- ifelse(home_ownership=="OWN", 1, 0)

home_ownership_rent <- ifelse(home_ownership=="RENT", 1, 0)

home_ownership_other <- ifelse(home_ownership=="OTHER", 1, 0)

# Grade



grade_a <- ifelse(grade=="A",1 ,0)

grade_b <- ifelse(grade=="B",1 ,0)

grade_c <- ifelse(grade=="C",1 ,0)

grade_d <- ifelse(grade=="D",1 ,0)

grade_e <- ifelse(grade=="E",1 ,0)

grade_f <- ifelse(grade=="F",1 ,0)

# Fine classing

gloan_amnt <- cut(new_data$loan_amnt, br=c(0, 5000, 15000, 35000), labels=c("Low","Medium","High"))

gloan_amnt_data <- data.frame(new_data, gloan_amnt)

summary(gloan_amnt_data$gloan_amnt)

gloan_amnt_data_final <- gloan_amnt_data[complete.cases(gloan_amnt_data),]

summary(gloan_amnt_data_final$gloan_amnt)

glast_fico_range_high <- cut(new_data$last_fico_range_high, br=c(0, 649, 699, 744, 850), labels=c("Low","Medium","High","Very High"))

last_fico_high_data <- data.frame(new_data, glast_fico_range_high)

summary(last_fico_high_data$glast_fico_range_high)

last_fico_high_data_final <- last_fico_high_data[complete.cases(last_fico_high_data),]

summary(last_fico_high_data_final$glast_fico_range_high)

glast_fico_range_low <- cut(new_data$last_fico_range_low, br=c(0, 645, 695,740, 845), labels=c("Low","Medium","High","Very High"))

last_fico_low_data <- data.frame(new_data, glast_fico_range_low)

summary(last_fico_low_data$glast_fico_range_low)

last_fico_low_data_final <- last_fico_low_data[complete.cases(last_fico_low_data),]

summary(last_fico_low_data_final$glast_fico_range_low)



gfico_range_high <- cut(new_data$fico_range_high, br=c(0, 650, 700, 745, 850), labels=c("Low","Medium","High","Very High"))

fico_high_data <- data.frame(new_data, gfico_range_high)

summary(fico_high_data$gfico_range_high)

fico_high_data_final <- fico_high_data[complete.cases(fico_high_data),]

summary(fico_high_data_final$gfico_range_high)

gfico_range_low <- cut(new_data$fico_range_low, br=c(0, 650, 700, 745, 850), labels=c("Low","Medium","High","Very High"))

fico_low_data <- data.frame(new_data, gfico_range_low)

summary(fico_low_data$gfico_range_low)

fico_low_data_final <- fico_low_data[complete.cases(fico_low_data),]

summary(fico_low_data_final$gfico_range_low)

gdti <- cut(new_data$dti, br=c(0, 8, 13, 19, 30), labels=c("Low","Medium","High","Very High"))

dti_data <- data.frame(new_data, gdti)

summary(dti_data$gdti)

dti_data_final <- dti_data[complete.cases(dti_data),]

summary(dti_data_final$gdti)

gmths_since_last_delinq <- cut(new_data$mths_since_last_delinq, br=c(0, 48, 6466000, 10000000), labels=c("Low","Medium","High"))

mths_since_last_delinq_data <- data.frame(new_data, gmths_since_last_delinq)

summary(mths_since_last_delinq_data$gmths_since_last_delinq)

mths_since_last_delinq_data_final <- mths_since_last_delinq_data[complete.cases(mths_since_last_delinq_data),]



summary(mths_since_last_delinq_data_final$gmths_since_last_delinq)

gmths_since_last_record <- cut(new_data$mths_since_last_record, br=c(0, 9299000, 10000000), labels=c("Low","High"))

mths_since_last_record_data <- data.frame(new_data, gmths_since_last_record)

summary(mths_since_last_record_data$gmths_since_last_record)

mths_since_last_record_data_final <- mths_since_last_record_data[complete.cases(mths_since_last_record_data),]

summary(mths_since_last_record_data_final$gmths_since_last_record)

gannual_inc <- cut(new_data$annual_inc, br=c(4000, 40500, 59000, 82340, 6000000), labels=c("Low","Medium","High","Very High"))

annual_inc_data <- data.frame(new_data, gannual_inc)

summary(annual_inc_data$gannual_inc)

annual_inc_data_final <- annual_inc_data[complete.cases(annual_inc_data),]

summary(annual_inc_data_final$gannual_inc)




last_pymnt_amnt_data_final <- last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),]

summary(last_pymnt_amnt_data_final$gannual_inc)

gtotal_pymnt <- cut(new_data$total_pymnt, br=c(0, 4707, 8089, 13090, 49490), labels=c("Low","Medium","High","Very High"))

total_pymnt_data <- data.frame(new_data, gtotal_pymnt)

summary(total_pymnt_data$gtotal_pymnt)



total_pymnt_data_final <- total_pymnt_data[complete.cases(total_pymnt_data),]

summary(total_pymnt_data_final$gtotal_pymnt)

gtotal_pymnt_inv <- cut(new_data$total_pymnt_inv, br=c(0, 4312, 7590, 12390, 49100), labels=c("Low","Medium","High","Very High"))

total_pymnt_inv_data <- data.frame(new_data, gtotal_pymnt_inv)

summary(total_pymnt_inv_data$gtotal_pymnt_inv)

total_pymnt_inv_data_final <- total_pymnt_inv_data[complete.cases(total_pymnt_inv_data),]

summary(total_pymnt_inv_data_final$gtotal_pymnt_inv)

dim(total_pymnt_inv_data_final)




last_pymnt_amnt_data_final <- last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),]

summary(last_pymnt_amnt_data_final$glast_pymnt_amnt)

dim(last_pymnt_amnt_data_final)

gtotal_rec_int <- cut(new_data$total_rec_int, br=c(0, 640, 1290, 2618, 15300), labels=c("Low","Medium","High","Very High"))

total_rec_int_data <- data.frame(new_data, gtotal_rec_int)

summary(total_rec_int_data$gtotal_rec_int)

total_rec_int_data_final <- total_rec_int_data[complete.cases(total_rec_int_data),]

summary(total_rec_int_data_final$gtotal_rec_int)

dim(total_rec_int_data_final)

# define global



purpose <- data.frame(purpose_debt ,purpose_car ,purpose_credit ,purpose_home_imp ,purpose_maj_purchase,purpose_ren_energy ,purpose_education ,purpose_house ,purpose_medical ,purpose_moving ,purpose_small_business ,purpose_vacation ,purpose_wedding)

home_ownership <- data.frame(home_ownership_mort, home_ownership_own, home_ownership_rent, home_ownership_other)

grade <- data.frame(grade_a,grade_b,grade_c,grade_d,grade_e,grade_f)

final_data <- data.frame(def, annual_inc_data$gannual_inc, last_fico_high_data$glast_fico_range_high, last_fico_low_data$glast_fico_range_low, fico_low_data$gfico_range_low, fico_high_data$gfico_range_high, mths_since_last_delinq_data$gmths_since_last_delinq, total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv, term, last_pymnt_amnt_data$glast_pymnt_amnt, total_rec_int_data$gtotal_rec_int, purpose, grade,home_ownership)

fit1 <- glm(def~.,data=final_data, family="binomial")

summary(fit1)

final_data1<- data.frame(def, annual_inc_data$gannual_inc, last_fico_high_data$glast_fico_range_high, mths_since_last_delinq_data$gmths_since_last_delinq, total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv, term, last_pymnt_amnt_data$glast_pymnt_amnt, total_rec_int_data$gtotal_rec_int)

fit2 <- glm(def~.,data=final_data1, family="binomial")

summary(fit2)

#IV calculation

iv.mult(final_data1,"def",TRUE)

iv_data <- data.frame(def, annual_inc_data$gannual_inc, last_fico_high_data$glast_fico_range_high, mths_since_last_delinq_data$gmths_since_last_delinq, total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv, term, last_pymnt_amnt_data$glast_pymnt_amnt, total_rec_int_data$gtotal_rec_int)

iv.plot.summary(iv.mult(final_data1,"def",TRUE))



#taking IV cut off as 0.1

iv_data_final <- data.frame(def, last_fico_high_data$glast_fico_range_high, total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv, term, last_pymnt_amnt_data$glast_pymnt_amnt, total_rec_int_data$gtotal_rec_int)

iv.mult(iv_data_final,"def",TRUE)

iv.plot.summary(iv.mult(iv_data_final,"def",TRUE))

fin_data <- data.frame(def, last_fico_range_high, last_pymnt_amnt, total_pymnt, total_rec_int)

fit <- glm(def~., data=fin_data, family="binomial")

summary(fit)

vif(fit)

# Dividing data into train and test

totalrecords <- nrow(fin_data)

trainfraction = 0.7

trainrecords = as.integer(totalrecords * trainfraction)

allrows <- 1:totalrecords

trainrows <- sample(totalrecords,trainrecords)

testrows <- allrows[-trainrows]

train<-data.frame(fin_data[trainrows,])

test<-data.frame(fin_data[testrows,])

dim(train)

dim(test)

fit_train <- glm(def~., data=train, family="binomial")

summary(fit_train)



# multicolinearity

vif(fit_train)

#Calculating ROC curve for model

library(ROCR)

#score train data set

train$score<-predict(fit_train,type='response',train)

pred_train<-prediction(train$score,train$def)

perf_train <- performance(pred_train,"tpr","fpr")

plot(perf_train)

# calculating AUC

auc_train <- performance(pred_train,"auc")

auc_train <- unlist(slot(auc_train, "y.values"))


minauc<-min(round(auc_train, digits = 2))

maxauc<-max(round(auc_train, digits = 2))




# Calculating KS statistic for train data

max(attr(perf_train,'y.values')[[1]]-attr(perf_train,'x.values')[[1]])

# Concordance

OptimisedConc=function(model)

{



Data = cbind(model$y, model$fitted.values)

ones = Data[Data[,1] == 1,]

zeros = Data[Data[,1] == 0,]

conc=matrix(0, dim(zeros)[1], dim(ones)[1])

disc=matrix(0, dim(zeros)[1], dim(ones)[1])

ties=matrix(0, dim(zeros)[1], dim(ones)[1])

for (j in 1:dim(zeros)[1])

{

for (i in 1:dim(ones)[1])

{

if (ones[i,2]>zeros[j,2])

{conc[j,i]=1}

else if (ones[i,2]<zeros[j,2])

{disc[j,i]=1}

else if (ones[i,2]==zeros[j,2])

{ties[j,i]=1}

}

}

Pairs=dim(zeros)[1]*dim(ones)[1]

PercentConcordance=(sum(conc)/Pairs)*100

PercentDiscordance=(sum(disc)/Pairs)*100

PercentTied=(sum(ties)/Pairs)*100

return(list("Percent Concordance"=PercentConcordance,"Percent Discordance"=PercentDiscordance,"Percent Tied"=PercentTied,"Pairs"=Pairs))

}

# concordance of train data

OptimisedConc(fit_train)

#score test data set



test$score<-predict(fit_train,type='response',test)

pred_test<-prediction(test$score,test$def)

perf_test <- performance(pred_test,"tpr","fpr")

plot(perf_test)

# calculating AUC

auc_test <- performance(pred_test,"auc")

auc_test <- unlist(slot(auc_test, "y.values"))


minauc<-min(round(auc_test, digits = 2))

maxauc<-max(round(auc_test, digits = 2))




# Calculating KS statistic

max(attr(perf_test,'y.values')[[1]]-attr(perf_test,'x.values')[[1]])

# Probabilities prediction

prob_glm <- predict.glm(fit_train, type="response",se.fit=FALSE)

write.csv(prob_glm, "probability.csv")

plot(prob_glm)

#Calculating top 3 variables affecting Credit Score Function

g<-predict(fit_train,type='terms',test)

#function to pick top 3 reasons

#works by sorting coefficient terms in equation and selecting top 3 in sort for each loan scored

ftopk<- function(x,top=3){



res=names(x)[order(x, decreasing = TRUE)][1:top]

paste(res,collapse=";",sep="")

}

# Application of the function using the top 3 rows

topk=apply(g,1,ftopk,top=3)

#add reason list to scored test sample

test<-cbind(test, topk)

summary(test)

# Decision Tree Implementation

#load tree package

library(rpart)

#build model using 90% 10% priors

#with smaller complexity parameter to allow more complex trees

model_dt <- rpart(def~.,data=train,parms=list(prior=c(.9,.1)),cp=.0002)

plot(model_dt)

text(model_dt)

#score test data

test$tscore1<-predict(model_dt,type='prob',test)

pred5<-prediction(test$tscore1[,2],test$def)

perf5 <- performance(pred5,"tpr","fpr")

# Random Forest

library(randomForest)

train$default <- data.frame(train$def[complete.cases(train$def),])



arf <- randomForest(def~.,data=train,importance=TRUE,proximity=TRUE,ntree=500, keep.forest=TRUE)

#plot variable importance

varImpPlot(arf)

train$p <- predict(model_dt,train,type="prob")

train$p

summary(train$p)

testdata$p <- predict(model_rf,testdata,type="prob")

summary(testdata$p)

----------------------- End Of Report --------------------------


Date post:	08-Aug-2015
Category:	Documents
Upload:	subhasis-mishra
View:	68 times
Download:	4 times

Project Report - Acquisition Credit Scoring Model

Documents