Date post: | 13-Feb-2017 |
Category: |
Data & Analytics |
Upload: | ritu-sarkar |
View: | 173 times |
Download: | 1 times |
TaykoSmart Marketing using analytics
Business Problem
Tayko is a software catalog firm that sells games and educational software
Want to market a new collection using e-mail marketing. As member of an industry consortium, they can pull 2,00,000 emails
address from the central repository of the consortium. To maximize the benefit, Tayko wants to pull records with high
probability of response and higher value of sale.
Analytics Problem
1. Create a classification model to groups the customer as responder or purchasers(1) and non-responders or non-purchasers(0).
2. Create a prediction model to predict the value of sale of the responder(1).
Data Collection
Supervised learning techniques is to be applied as a desired output is required is already defined.
A sample of 2000 customer is drawn form the central repository and test e-mail marketing is done.
The 2 target variables : Purchased and Spending is recorded for the sample.
The result showed 1000 purchasers and 1000 non-purchasers
Data partitioning
The data set is partitioned into Training set – 60% - 1200 records Testing – 20% - 400 records Validation – 20% - 400 records
Initial StudyWhat kind of variables are present.
Finding the variables with strong differentiation power – Nominal Variables
Use of Catalog A, T, U, P show high percentage of people making a purchase
Use of Catalog O, H show high percentage of people not making a purchase
But only Catalog A & U has been used for more than 100 customers. Catalog H for more than 50 customers & rest below 50 customers. Distribution of catalogs were not even.
Other Nominal Variables
Out of other categorical variables : “Order Online” is the only one which show some power to differentiate between customer who purchased and the non-purchasers.
Ordinal Variables Number of purchase last year shows a good trend People who have not made any purchase last year
have not made any purchase with the new catalogs also.
People who had made more than 3 purchase has surly made a purchase this time also
Scale Variables
Out of the 2 scale variables “Last update to customer record” shows a significant difference in their mean.
Target Variables
Purchaser and non-purchasers are equally distributed However the sales value or the amount spend by customer follows a
non-normal distribution
ClassificationWho will make a purchase?
Logistic Regression – Training
Final set of variables1. Frequency : Number of transactions in last year at
source catalog 2. Web Order : Customer placed at least 1 order via
web 3. Address is Residence : Address is a residence 4. Source_a, h or u :Source Catalog is A, U or H
Logistic Regression – Testing & Validation
Test Over-all accuracy : 80%
Validation Over-all accuracy : 77%
Decision Tree – Training CHAID Growing method gave best results
Decision Tree – Test & Validate Test
Over-all accuracy : 76%
Validation Over-all accuracy : 74%
Result
Logistic regression gives a better result than decision tree
PredictionHow much a purchaser will spend?
New Calculated Variables
• High correlation between “last_update_days_ago ” and “1st_update_days_ago ”• New calculated variable DayDiff which is difference of
the 2 variables
Multiple Linear Regression
Pre-processiong Univariate analysis and transformation of Target Variable “Spend”
Outlier removal, Filtering and Transformation
Model & Performance
4 models are generated Case 1 : None Residence Address & Not a Web-Order (R-sqr : 0.569 & Adj R-sqr :
0.566)Spending = -15.733 + 79.11 * No of transaction last year – 47.825 * Catalog D + 30.632 * Catalog U Case 2 : None Residence Address & Web-Order (R-sqr : 0.62 & Adj R-sqr : 0.616)Spending = -42.285 + 115.976 * No of transaction last year + 45.506 * Catalog U -247.655 * Catalog H + 55.605 Catalog R Case 3 : Residence Address & Not a Web-Order (R-sqr : 0.516 & Adj R-sqr : 0.507)Spending = -26.965 + 69.218 * No of transaction last year + 66.219 * Catalog U – 113.587*Catalog H Case 4 : Residence Address & Web-Order (R-sqr : 0.612 & Adj R-sqr : 0.592)Spending = -4.616 + 65.114 * No of transaction last year - 111.934*Catalog H – 81.28 * Catalog R – 129.754 * Catalog C + 66.242 * Catalog A
MAD & MAPE
Training MAD : 68.89 MAPE : 103%
Test MAD : 104.53 MAPE : 109%
Validation MAD : 104.03 MAPE : 101%
Regression Tree Exhaustive CHAID
MAD & MAPE
Training MAD : 105.37 MAPE : 95%
Test MAD : 121.54 MAPE : 103%
Validation MAD : 121.31 MAPE : 113%
Decision
Both the models are very weak in predicting the amount spent There is high error for evaluation indicators. One major reason for this can be the lack of scale variables and high
correlation between whatever scale variables are given. Since most variables are of nominal type, converting the prediction
problem to classification might produce better result. But it was out of scope for the given problem.
Conclusion
The classification of customer into purchasers and non-purchasers shows good result and the elected logistic regression model is expected to show high performance in live situation also.
However the prediction models show weak performance and a high degree of error is expected if used in the current state.