Date post: | 15-Apr-2017 |
Category: |
Data & Analytics |
Upload: | arsalan-qadri |
View: | 415 times |
Download: | 1 times |
Estimating the probability of default: Credit Risk
Mohamed Arsalan QadriSarvesh SaurabhMohit Ravi
Summary
• Credit risk – The probability of default• Data Cleansing• Logistic Regression• Linear Discriminant Analysis• Comparison of the LR and LDA• Factor Analysis
Credit RiskWhat is it?• The risk of default on a debt that may arise from a borrower failing to make
required payments.
Impact on the lender?• Lost principal and interest, disruption to cash flows, and increased collection
costs.
How to estimate it?• Credit risk arises from the potential that a borrower or counterparty will fail to perform
on an obligation
Sources of risk?• For most banks, loans are the largest and most obvious source of credit risk.
• There are other sources of credit risk both on and off the balance sheet including letters of credit unfunded loan commitments, and lines of credit.
• Other products, activities, and services that expose a bank to credit risk are credit derivatives, foreign exchange, and cash management services.
Credit Risk
Credit Scoring vs RiskEstimation of risk?• The risk posed by the borrower is inversely proportional to the credit score.• A statistically derived numeric expression of a person's creditworthiness that is used by
lenders to access the likelihood that a person will repay his or her debts. • A credit score is based on, among other things, a person's past credit history (300-850)
Credit Scoring
• Consumers can typically keep their credit scores high by maintaining a long history of always paying their bills on time and not having too much debt.
• A FICO score is the most widely used credit scoring system.
• A credit score is primarily based on a credit report information typically sourced from credit bureaus.
Data Cleaning
Data Cleaning• Serious Delinquency in two years. (Make a Pi chart for this)
• Revolving Utilization Of Unsecured Lines
Data Cleaning
• Age
Data Cleaning
• Number Of Time 30-59 Days Past Due Not Worse Data Cleaning
• Number Of Time 60-89 Days Past Due Not Worse Data Cleaning
• Number Of Times 90 Days LateData Cleaning
• Monthly Income• Replaced with Mean
Data Cleaning
Data Cleaning• Monthly Income• Ran Multiple Linear Regression on Missing Values
Data Cleaning• Monthly Income• The Histogram after running Multiple Linear Regression on Missing Values
Data Cleaning• Debt Ratio• We found that the Debt Ratio was extremely high in many cases.• Upon Closer inspection, we found out that high debt ratio was present for those records
whose Monthly Income was unknown.• From this we inferred that the Debt Ratio could most probably be the Debt.
Data Cleaning• Debt Ratio• We replaced the high values of debt ratio by dividing it by the predicted values of the
monthly income.• The new mean after replacement was 0.67
Data Cleaning• Number of Dependents
Data Modelling
• Split the dataset into Training data (70%) and Test Data (30%).• Computed Co-relation Matrix among Independent variables. • The variables had very less Co-relation amongst themselves.• Ran Logistic Regression by using Stepwise selection.• Ran Linear Discriminant Analysis.• Compared both the models by measuring their accuracy of prediction.• Ran both models on significant Factors using Factor Analysis.
Logistic Regression
Logistic Regression
• Ran Logistic Regression separately for each variable.• Computed the ROC curve for each variable and compared the AUC value.
Stepwise Selection
• Overall Model was Significant.• All the variables were included in the
model.• The model built on the Training data
was tested on the Test data.• Probability of default > 0.7 was coded
as 1, and Probability of default <0.7 was coded as 0.
Logistic Regression on Test Data
Overall Accuracy = (41374+291)/(41374+291+175+2661)
= 93.6 % True Positive Rate = TP / (TP+FN) = 9.85% True Negative Rate = TN / (TN+FP) = 99.5%
Predicted Values Actual Values
Confusion Matrix
ROC curve for Test Data
• AUC Value = 0.8557
Discriminant Analysis
Discriminant Analysis
Overall accuracy =(38134+1717)/Total =89.5 %
True Positive Rate = TP / (TP+FN) = 58%
True Negative Rate = TN / (TN+FP) = 91.7%
Predicted Predicted0 1
Actual0
Actual1
38134 3415
1235 1717
Serious Deliquen
Comparison of Models
Linear Discriminant Analysis
Overall accuracy =89.5 %
Predicted Predicted0 1
Actual0
Actual1
38134 3415
1235 1717
Serious Deliquen
Logistic Regression
Overall Accuracy = 93.6 %
Normality of variables
Factor Analysis
Factor Analysis
Factor Pattern
Factor1 Factor2 Factor3 Factor4
NumberOfTimes90DaysLate 0.54684 0.28062 0.26286 -0.0429
Factor 1 NumberOfTime60_89DaysPastDueNot 0.50016 0.3943 0.37949 -0.0015
RevolvingUtilizationOfUnsecured 0.60945 0.24942 -0.1861 -0.0285
NumberOfOpenCreditLinesAndLoans -0.5203 0.5275 0.1922 0.15051
NumberRealEstateLoansOrLines -0.4698 0.61529 -0.0292 0.09694
Factor 2 NumberOfDependents_num 0.03058 0.46357 -0.6034 -0.008
Monthlyincome_debt -0.4298 0.5044 -0.09 -0.1628
NumberOfTime30_59DaysPastDueNot 0.40861 0.49901 0.31943 0.05977
Factor 3 age -0.4301 -0.1476 0.65733 -0.0396
Factor 4 DebtRatio 0.05584 -0.0712 -0.0331 0.97112
Conclusion
• 80% time spent on Data cleaning
• Logistic Regression gives better results when data is not normal as compared to LDA
• Factors can be grouped for a logical understanding, with Debt Ratio and age explaining high variance.
Thank you