Predictive Analytics, Predicting LIkely Donors and Donation Amounts

transcript

Predicting Likely Donors and

Donation Amounts

Predictive AnalyticsMichele Vincent

March 22, 2017

Goals• Predict likely donors using classification models

• Predict how much donation will likely donors give using regression models

• Validate predictive models by measuring how effective the models are

Objectives

Training Data• Filename: cup98lrn variable

subset small.txt • # Records: 50% of 95,412

(47,706)• Target Variable: TARGET_B

– Classification (Y/N decision, a donor or not)

– 5% responders

Testing Data• Filename: cup98lrn variable

subset small.txt • # Records: 50% of 95,412

(47,706)• Target Variable: TARGET_B

– Classification (Y/N decision, a donor or not)

– 5% responders

Data Used for Training and Testing of Classification Models

TARGET_B is binary indicator for response to 97NK Mailing

Validation Data• Filename: cup98val_variable_subset_small.csv• # Records: 96,367• Target Variable: TARGET_B

– Classification (Y/N decision, a donor or not)– 5% responders

Data Used for Validation of Classification Models

TARGET_B is binary indicator for response to 97NK Mailing

Based on AUC, the best model is the Logistic Regression model which generates the highest AUC. It correctly classified 58.7% which is not the highest but it is one of the highest. It’s precision is 2nd highest. The lift it provides at 10% or 70% is one of the highest among all models tested.

Classification Model: Accuracy

• Best Performing AlgorithmStratified and equal size sampling were used for all models tested below.

The accuracy rate of the Logistic Regression is 58.7%. This means we correctly classified 58.7% of the file. If we had 100 records, there were 58 that we correctly classified as non-donors and donors.

The precision means how many donors did we get right out of the total that we predicted as donors. If 100 donors were predicted, but only 7 of them are actual donors, then the precision is 7%.

The recall means how many donors did we get right out of the total actual donors. If there were 100 actual donors, and we predicted 58 of them correctly, then the recall rate is 58%. Sensitivity is also recall.

False alarm means we thought a person is a donor, but he wasn’t. If there were 100 non-donors and we claimed 41 of them to be donors, then our false alarm rate is 41%. In this example, we would claim that 59 are not donors. This means that the Specificity is 59%.

Classification Model: Accuracy (Cont’d)

• Logistic Regression

• ROC Curve for the winning model (Logistic Regression):

ROC curve shows an area under the curve of 0.6110 (which is the biggest area under the curve among other models).

This curve is also useful for knowing what true alarm rate we can get given an accepted false alarm rate.

If we are willing to accept a 37% false alarm rate we can get a true alarm rate of 55% (dotted line on the graph). This means that if we are lenient and allow the model to make a mistake of classifying 37 donors when they are not actual donors, then the model can get us 55 donors who are actual donors.

False Alarm

• Lift Chart for the winning model (Logistic Regression)

The lift at 10% of the file is 1.815; at 70%, the cumulative lift is 1.12.

This means that if we mailed to the top 10% of the file which contains the predictions with highest probabilities, we can get 1.8 more donors than just 1.0 if we do not use a model.

y-axis shows 1.815 (lift) and x-axis shows 10 (percentage of file).

• Histogram for the winning model (Logistic Regression) The histogram of the distribution of

predictions shows:• 921 are predicted with probability to respond

between 0.507 and 0.608• 429 are predicted with probability to respond

between 0.608 and 0.709

This means that if we are comfortable mailing only to those with probability greater than 0.6, then we can expect at least 429 donors with that probability to respond.

101010

We validated the results of our best model using a different set of data. The results here are very close to results previously discussed.

• Classification AccuracyUsing Validation Data

111111

Classification Model: Interpretation

• Best Predictors

RFA_2F* was chosen by five models to be their top predictor for a donor. If we were to differentiate between a non-donor and a donor, RFA_2F is the best variable to use.

E_RFA_2A was chosen by three models to be one of their top three predictors.

FISTDATE and NGIFTALL were chosen by two models to be one of their top three predictors.

* Frequency code for donor’s RFA status as of 97NK promotion date

121212

Classification Model: Interpretation (Cont’d)

• Details of Best Predictors for Some of the Models Used:

Three common predictors appeared among the top predictors from Logistic Regression, Neural Networks and kNN = 101 models:

RFA_2F E_RFA_2A FISTDATE

131313

Classification Model: Interpretation (Cont’d)

• Relationship of Predictors to Target Variable: LASTGIFT’s P-value is not significant, and

therefore, not an influential variable. It does not matter how much a donor gave last time. The amount the donor gave does not help in predicting whether he will be a donor again.

FISTDATE and DOMAIN3 have negative relationship with the target variable. The smaller (or less recent) the FISTDATE is, the more likely they are to be a responder. The likely donor is someone who has not given recently, and is not from the lowest socio-economic status.

RFA_2F, D_RFA_2A, E_RFA_2A, DOMAIN1 have positive relationship with the target variable. The bigger these variables are, the more likely that the outcome of Donor=Y is true. The likely donor is someone who is a frequent giver, and comes from the highest socio-economic status.

The D_RFA_2A has higher coefficient than other predictors which means D_RFA_2A has larger influence on our prediction that someone is a donor than other predictors. So, the donor’s RFA status as of the 97K promotion is more influential than his RFA status as of the 96NK or 95NK promotions.

Donor=Y

141414

Classification Model: Conclusion

• The donor’s frequency of giving is the most influential variable to determine a donor and a non-donor.

• The likely donor is someone who has not given recently but is a frequent giver, and comes from the highest socio-economic status.

• If the results of the Logistic Regression model is implemented in the next campaign, and we know the model gives a 1.2 cumulative lift at 70% of the file, we can expect to gain out of 70,000 mailings 4,200 responders. We’ll get a higher response rate of 6% instead of 5% without a model, and we’ll save some money from the cost of 30,000 mailings.

Training Data• Filename: cup98lrn variable

subset small responders.txt

• # Records: 50% of 4,873• Target Variable: TARGET_D

– Average Continuous Value (donation amount)

– 5% responders

Testing Data• Filename: cup98lrn variable

subset small responders.txt

• # Records: 50% of 4,873• Target Variable: TARGET_D

– Average Continuous Value (donation amount)

– 5% responders

Data Used for Training and Testing of Regression Models

TARGET_D is donation amount associated with the response to 97NK Mailing

Validation Data• Filename: cup98val_variable_subset_small.csv• # Records: 96,367• Target Variable: TARGET_D

– Average Continuous Value (donation amount)– 5% responders

Data Used for Validation of Regression Models

TARGET_D is donation amount associated with the response to 97NK Mailing

171717

Regression Model: Accuracy

Using original target variable, SVM wins. The model SVM is the best model having the highest R-squared value. SVM also has the smallest mean absolute error, mean squared error and root mean squared deviation than Linear Regression and Neural Net.

Using transformed target variable, LR wins (although SVM is close) based on R-square.

LowestHighest

LowestLowes

tHighest Lowest (tied

with SVM)

• Best Performing Algorithm

181818

Regression Model: Accuracy (Cont’d)

Final Project

While there is not much difference between the models, SVM provided the highest lift of 2.329 in the top decile with overall average donation of $16 and total dollar amount of $8,921 using transformed target variable.

• Best Performing Algorithm

Highest Lift

Highest Total Amount

Average Donation

Top Decile

191919

Regression Model: Accuracy (Cont’d)

• Regression Accuracy (Results on Validation Data)

Highest Lift

Highest Total Amount

Average Donation

Top Decile

Using the validation data of 96k responders and non-responders with the target variable transformed, SVM provided the highest lift among other models. The lift is 1.692 in the top decile with overall average donation of $0.79 and total dollar amount of $12,330.

This tells us that if we mailed to the top decile, we can expect $12,330.

202020

Regression Model: Interpretation

• Best Predictors from SVM:

Shown above is the output from KNIME from Linear Correlation to find best predictors. The best predictors have the highest prediction values. The output is sorted by prediction values. The top two variables given the highest prediction values are LASTGIFT (transformed into log10) with prediction probability of 0.971, and AVGGIFT with prediction probability of 0.644.

It makes sense that to be able to predict how much will be donated, it’s important to consider first how much was the last donation, and how much is the average donation so far for a particular donor.

212121

Regression Model: Interpretation (Cont’d)

• Best Predictors from SVM Model:

The scatter plot of one of the best predictors from SVM, AVGGIFT, and the target variable, TARGET_D is shown on the left. It shows that there is a linear relationship between AVGGIFT and TARGET_D.

It shows that as the average gift of a donor increases, the donation amount that he will give also increases.

222222

• Best Predictors from Linear Regression:

The best predictors chosen by Linear Regression based on significant P-values are LASTGIFT, RFA_2F, F_RFA_2A, E_RFA_2A, G_RFA_2A.

LASTGIFT has a positive coefficient which means it is positively related to the target variable. The larger the last gift of a donor, the larger the probability that he will give again. If he gave a lot before, he is likely to give again.

The RFA_2F is negatively related to the target variable. This means that the more frequently a donor gave before, the smaller his donation.

F_RFA_2A, E_RFA_2A, G_RFA_2A are positively related to the target variable. They add 0.086, 052 and 0.13, respectively, to the donation.

P-value

232323

• Best Predictors from Linear Regression (Cont’d):

The scatter plot of one of the best predictors from Linear Regression, LASTGIFT, and the target variable, TARGET_D is shown on the left. It shows that there is a linear relationship between LASTGIFT and TARGET_D.

Not only that it confirms what we found earlier that the larger the last gift of a donor, the larger the probability that he will give again, the scatter plot gives us additional knowledge – it shows that as the last gift of a donor increases, the donation amount that he will give also increases.

The scatter plot shows the appropriateness of the linearity of the regression function.

242424

Regression Model: Conclusion

• The donor’s last gift amount and average gift amount are the most influential variables to determine how much would a donor donate again.

• If the results of the SVM model are implemented in the next campaign to the list of responders only from the previous campaign, and we know the model gives a cumulative lift of 2.329 in the top decile with overall average donation of $16, we can expect to get a total donation of $8,921. Without a model, the total donation is $3,822.

• If the results of the SVM model are implemented in the next campaign to all responders and non-responders from the previous campaign, and we know the model gives a cumulative lift of 1.692 in the top decile with overall average donation of $0.79, we can expect to get a total donation of $12,330.

252525

Conclusion

• My recommendation is to use the Logistic model if all we want is to identify a donor and a non-donor. If we are also interested in the amount of donation, my recommendation is to use the SVM model, although the Linear Regression model will work just as well too.

• The validation dataset will provide the best estimate of the money the best model will generate as it contains people who did not donate which is realistic. The validation data will give us an overall average donation of $1.28 (from $12339/9847 people in 1st decile) per person. However, if we just want to get the highest average donation, the test data will give us $36 (from $8921/244 people in 1st decile) per person.

262626

Appendix

• Independent variables (predictor variables) used in the models

Predictive Analytics, Predicting LIkely Donors and Donation Amounts

Data & Analytics