Estimation of the probability of default : Credit Rish

Estimating the probability of default: Credit Risk

Mohamed Arsalan QadriSarvesh SaurabhMohit Ravi

Summary

• Credit risk – The probability of default• Data Cleansing• Logistic Regression• Linear Discriminant Analysis• Comparison of the LR and LDA• Factor Analysis

Credit RiskWhat is it?• The risk of default on a debt that may arise from a borrower failing to make

required payments.

Impact on the lender?• Lost principal and interest, disruption to cash flows, and increased collection

costs.

How to estimate it?• Credit risk arises from the potential that a borrower or counterparty will fail to perform

on an obligation

Sources of risk?• For most banks, loans are the largest and most obvious source of credit risk.

• There are other sources of credit risk both on and off the balance sheet including letters of credit unfunded loan commitments, and lines of credit.

• Other products, activities, and services that expose a bank to credit risk are credit derivatives, foreign exchange, and cash management services.

Credit Risk

Credit Scoring vs RiskEstimation of risk?• The risk posed by the borrower is inversely proportional to the credit score.• A statistically derived numeric expression of a person's creditworthiness that is used by

lenders to access the likelihood that a person will repay his or her debts. • A credit score is based on, among other things, a person's past credit history (300-850)

Credit Scoring

• Consumers can typically keep their credit scores high by maintaining a long history of always paying their bills on time and not having too much debt.

• A FICO score is the most widely used credit scoring system.

• A credit score is primarily based on a credit report information typically sourced from credit bureaus.

Data Cleaning

Data Cleaning• Serious Delinquency in two years. (Make a Pi chart for this)

• Revolving Utilization Of Unsecured Lines

Data Cleaning

• Age

Data Cleaning

• Number Of Time 30-59 Days Past Due Not Worse Data Cleaning

• Number Of Time 60-89 Days Past Due Not Worse Data Cleaning

• Number Of Times 90 Days LateData Cleaning

• Monthly Income• Replaced with Mean

Data Cleaning

Data Cleaning• Monthly Income• Ran Multiple Linear Regression on Missing Values

Data Cleaning• Monthly Income• The Histogram after running Multiple Linear Regression on Missing Values

Data Cleaning• Debt Ratio• We found that the Debt Ratio was extremely high in many cases.• Upon Closer inspection, we found out that high debt ratio was present for those records

whose Monthly Income was unknown.• From this we inferred that the Debt Ratio could most probably be the Debt.

Data Cleaning• Debt Ratio• We replaced the high values of debt ratio by dividing it by the predicted values of the

monthly income.• The new mean after replacement was 0.67

Data Cleaning• Number of Dependents

Data Modelling

• Split the dataset into Training data (70%) and Test Data (30%).• Computed Co-relation Matrix among Independent variables. • The variables had very less Co-relation amongst themselves.• Ran Logistic Regression by using Stepwise selection.• Ran Linear Discriminant Analysis.• Compared both the models by measuring their accuracy of prediction.• Ran both models on significant Factors using Factor Analysis.

Logistic Regression

Logistic Regression

• Ran Logistic Regression separately for each variable.• Computed the ROC curve for each variable and compared the AUC value.

Stepwise Selection

• Overall Model was Significant.• All the variables were included in the

model.• The model built on the Training data

was tested on the Test data.• Probability of default > 0.7 was coded

as 1, and Probability of default <0.7 was coded as 0.

Logistic Regression on Test Data

Overall Accuracy = (41374+291)/(41374+291+175+2661)

= 93.6 % True Positive Rate = TP / (TP+FN) = 9.85% True Negative Rate = TN / (TN+FP) = 99.5%

Predicted Values Actual Values

Confusion Matrix

ROC curve for Test Data

• AUC Value = 0.8557

Discriminant Analysis

Discriminant Analysis

Overall accuracy =(38134+1717)/Total =89.5 %

True Positive Rate = TP / (TP+FN) = 58%

True Negative Rate = TN / (TN+FP) = 91.7%

Predicted Predicted0 1

Actual0

Actual1

38134 3415

1235 1717

Serious Deliquen

Comparison of Models

Linear Discriminant Analysis

Overall accuracy =89.5 %

Predicted Predicted0 1

Actual0

Actual1

38134 3415

1235 1717

Serious Deliquen

Logistic Regression

Overall Accuracy = 93.6 %

Normality of variables

Factor Analysis

Factor Analysis

Factor Pattern

Factor1 Factor2 Factor3 Factor4

NumberOfTimes90DaysLate 0.54684 0.28062 0.26286 -0.0429

Factor 1 NumberOfTime60_89DaysPastDueNot 0.50016 0.3943 0.37949 -0.0015

RevolvingUtilizationOfUnsecured 0.60945 0.24942 -0.1861 -0.0285

NumberOfOpenCreditLinesAndLoans -0.5203 0.5275 0.1922 0.15051

NumberRealEstateLoansOrLines -0.4698 0.61529 -0.0292 0.09694

Factor 2 NumberOfDependents_num 0.03058 0.46357 -0.6034 -0.008

Monthlyincome_debt -0.4298 0.5044 -0.09 -0.1628

NumberOfTime30_59DaysPastDueNot 0.40861 0.49901 0.31943 0.05977

Factor 3 age -0.4301 -0.1476 0.65733 -0.0396

Factor 4 DebtRatio 0.05584 -0.0712 -0.0331 0.97112

Conclusion

• 80% time spent on Data cleaning

• Logistic Regression gives better results when data is not normal as compared to LDA

• Factors can be grouped for a logical understanding, with Debt Ratio and age explaining high variance.

Thank you

Date post:	15-Apr-2017
Category:	Data & Analytics
Upload:	arsalan-qadri
View:	415 times
Download:	1 times

Estimation of the probability of default : Credit Rish

Data & Analytics