Date post: | 29-Dec-2015 |
Category: |
Documents |
Upload: | brendan-snow |
View: | 216 times |
Download: | 1 times |
2
Assignment 7
• An acquisition campaign with no targetting was conducted in January. The available information is as follows:
– Mail files containing name and address
– Responder files containing name and address
– 2001 Stats Can Census data available at the enumeration area
– A conversion table which maps enumeration areas to postal codes
• How would you use the above information to better target prospects to become new customers.
• Describe how the analytical file would be created– Answered in last class and is in worknotes
3
Types of Predictive Models-Assignment 7
• You have been asked to create programs that better target existing customers for insurance products. You have the following info:
Customer file Transaction FileAccount ID Account IDpostal code amountstart date date birth date type
incomehousehold sizebehaviour score
•What would you do and how would you What would you do and how would you create the analytical filecreate the analytical file
•In last lecture’s studynotesIn last lecture’s studynotes
4
Types of Predictive Models
• You have been asked to target customer that will not only purchase insurance but will also purchase the largest premiums
• What type of model would be built here?
• In last lecture’s studynotesIn last lecture’s studynotes –
5
• Geocoding is the process that assigns a latitude-longitude coordinate to an address. Once a latitude-longitude coordinate is assigned, the address can be displayed on a map or used in a spatial search.
• Data miners often use these coordinates to calculate such things as “distance to the nearest store”
Creating the Analytical File- Geo-Coding
6
Demographic Analysis
Population Population CountCount
Population Population CountCount
Age Age DistributionDistribution
Age Age DistributionDistribution Average AgeAverage AgeAverage AgeAverage Age
Store Store LocationLocation
Store Store LocationLocation
GeoGeoProfileProfile
7
Creating the Analytical File-What is Geocoding?
• Let’s look at a sample of what some data might look like?
Postal Code latitude LongitudeA1A5A2 5 10B5V1A2 7 20M6B2A2 10 30T4B1A2 6 40V4H2B5 11 50
How do we use this data to create meaningful variables?-use the latitude metric and longitude metrics and then use pythagorean theorem to calculate distance between the two postal codesEx: distance between A1A5A2 and B5V1A2=Distance=square root of abs.value[(7-5)**2+(20-10)**2]
=10.19degreesAbove number has to then be converted to kilometres or miles
8
Creating the Analytical File-What is Geocoding
• Example:– A retailer has the following information:
• Name and address of its customers
• Address of its stores
• Stats Can Information
– As a marketer, how would you intelligently use this information
9
Correlation Coefficient
Correlation coefficient is a measure of how much variation within the response variable is explained by the variation of a given variable.
Gender Response Household Size Response
Male 1 1 0Female 0 2 1Male 1 3 0Female 0 4 1Male 1 5 0Female 0 6+ 1
Gender vs. Response Household Size vs. Response
10
Correlation Coefficient
Correlation coefficient is a measure of how much variation within the response variable is explained by the variation of a given variable.
Gender Response Household Size Response
Male 1 1 0Female 0 2 1Male 1 3 0Female 0 4 1Male 1 5 0Female 0 6+ 1
Gender vs. Response Household Size vs. Response
11
Correlation Analysis
• The male gender variable has a perfect correlation of +1.
• The female gender variable has a perfect correlation of -1.
• Household size has no correlation with response, hence the correlation coefficient is 0.
12
Correlation Results
• Show the level of confidence which a given variable has with the modelled behaviour i.e. response
Example of Output:
Age Tenure # of Products Purchased
-0.0673 0.055 0.045
99.5% 98% 97%
Example of Output
# of Promotions Income Household SizeSince Last Purchase
-0.031 -0.0045 0.001
96% 50% 20%
Correlation coefficient
ConfidenceInterval
13
Examples-Correlation-Response Model• Listed below is an example of a correlation matrix
Variable Correlation Coeff. Stat.Sign.Income 0.50 99%
Age -0.45 95%Product Spend in last
12 months -0.48 97%Live in Quebec -0.01 15%
Tenure -0.47 96%# in household 0.02 20%
# of months since last promoted -0.55 99%
# of months since last purchase 0.02 20%
Pay with credit card 0.49 98%Gender is male -0.46 95%
Answer the following:•Is each variable relevant•What is the relationship or impact of each variable with response•What is the strongest variable and what is the weakest variable?
Income –relevant and positive imactAge-relevant and negative impactProduct Spend in last 12 months-relevant and negative Live in Quebec-not relevant and negativeTenure-relevant and negative# in household-not relevant and positive# of months since last promoted-relevant and negative# of months since last purchase-not relevant and positivePay with credit card-relevant and positiveGender is male-relevant and negative
strongest variable-# of months since last promoted
Weakest variable-live in Quebec
14
Exploratory Data Analysis Reports(EDA)
• After looking at the correlation reports, we also need to create EDA reports which help to better understand the relationship of a given variable with the desired marketing behaviour.
• It helps the business people and marketers to get inside the so-called black box of modelling.
15
Exploratory Data Analysis Reports(EDA)
Income # of observations Response Rate0-20000 10000 1%
20000-40000 10000 1.50%40000-60000 10000 2%60000-80000 10000 3%
80000+ 10000 4%Average 50000 2.30%
Age # of Observations Response Rate0-20 10000 3.50%20-40 10000 2.89%40-60 10000 2.25%60-70 10000 1.65%70+ 10000 1.25%
Average 2.30%
16
Exploratory Data Analysis Reports(EDA)
• Let’s take a look at example of a binary variable
On the next page are some examples of EDA reports of variables that are not statistically significant according to the correlation matrix.
Male # of Observations Response RateYes 50000 2.00%No 50000 2.60%
Average 100000 2.30%
17
Exploratory Data Analysis Reports(EDA)
• EDA’s of non-stat.sign. variables
# of months since last purchase # of observations Response Rate
0-2 months 10000 2.15%3-6 months 10000 2.10%6-12 months 10000 2.45%
12-18 months 10000 2.40%18+ months 10000 2.35%
Average 50000 2.30%
Live in Quebec # of observations Response Rateyes 10000 2.02%no 40000 2.38%
Average 50000 2.30%
18
More examples of correlation
• Previous analysis has indicated the following trends
Spending # of customers Response Rate0-100 1000 1%
100-200 1000 0.80%200-300 1000 1.20%300-400 1000 0.90%
400+ 1000 0.95%
Tenure # of customers Response Rate< 1 year 1000 3%1-2 yrs 1000 2.00%2-3 yrs 1000 1.00%3-4 yrs 1000 0.75%4yrs+ 1000 0.30%
•Would the Would the correlations be correlations be closer to 1,-1 , orcloser to 1,-1 , or0 here for both0 here for bothvariables?variables?
Closer to 0 here
Closer to –1 here
19
More examples of correlation
Spending # of customers Credit Risk0-100 1000 1%
100-200 1000 2.00%200-300 1000 3.00%300-400 1000 4.00%
400+ 1000 5.00%
Tenure # of customers Credit Risk< 1 year 1000 3%1-2 yrs 1000 2.50%2-3 yrs 1000 3.30%3-4 yrs 1000 2.70%
•Would the Would the correlations be correlations be closer to 1,-1 , orcloser to 1,-1 , or0 here for both0 here for bothvariables?variables?
•What is the learning here vs. the previousWhat is the learning here vs. the previousslide slide
Closer to +1 here
Closer to –1 here
20
Exploratory Data Analysis Reports
• Exploratory Data Analysis Reports:
Age% of
CustomersResponse
RateResponse
Indexunder 25 25% 6% 171
25-35 25% 4.50% 12835-50 25% 2.50% 7150+ 25% 1% 28
Average 100% 3.50% 100
What does this tell us?
Younger are more likely to
respond
Household Size% of
customersResponse
RateResponse
Index1 person 25% 4% 1142 persons 25% 3% 863 persons 25% 4% 114
4+ persons 25% 3% 86Average 100% 3.50% 100
What does this tell us?
No trend exists here
21
Exploratory Data Analysis Reports
Income% of
customersResponse
Rate Income > 40under20K 16% 1.50%
20-30K 16% 2.50% 030-40K 16% 2.00%40-55K 16% 6%55-80K 16% 5% 180K+ 16% 4%
Average 100% 3.50%
What does this mean?
# of Months % of Response Response MonthsSince Last Customers Rate Index Since LastPromotion Promotion
1 16% 2.50% 0.71
2 16% 1.50% 0.43
3 16% 3.75% 1.07
4 16% 3.25% 0.93
5 16% 6.00% 1.71
6 16% 4.00% 1.14
Average 100% 3.50% 1.00
0.620.62
1.001.00
1.431.43
What does this mean?
Clearly, there is more of a binary rather than linear relationship here. Would create binary variable
on income >=40K
Not quite binary but not perfect inlinear sense, would create indexvariables here
22
Creating the Final Model
• Why couldn’t we just use results of correlation to create model and create index values for each sign .variable.– Age– Tenure– # of products purchased– # of promotions since last purchase
Think Statistics here?Think Statistics here?
Independent or predictor variables have interaction here known as multicollinearity and this interaction must be accounted for when building any model. The interaction between model variables(independent variables) willhave an impact on the actual variable weight or coefficient within any model equation that is parametric(ie. there are weights or coefficients associated with each parameter)
23
Creating the Final Model
• Need to account for interaction here
Age EducationLive in QuebecSpendAge 1 0.3 -0.3 -0.7
Education 0.3 1 -0.6 0.9Live in Quebec -0.3 -0.6 1 0.2
Spend -0.7 0.9 0.2 1
Correlation Coefficient with
Response Confidence
IntervalAge 0.4 99%
Education 0.4 99%Quebec 0.3 97%Spend 0.2 95%
Let’s take a look at some equations
24
The Data Mining Process : Application of Data Mining Techniques-Creating the Final Model
Problems with Multicollinearity• Example: Years of Education and Income on Response Rate
•
• Regression Equation is:Response= .50+.00001*income -.03*yrs. of education
Years of Income
EducationCorrelation Coefficient 0.11
0.12Confidence Interval 99%
99.50%
Response
What is the problem here and what do you do?
Income and education are highly correlated causing education to flip its sign within the modelequation. I would either replace it with some other variable that does not reduce the model power too much or I would create an interaction variable between the two(i.e. Age X Income)
Problems with Multicollinearity• Example: Years of Education and Income on Response Rate
•
• Regression Equation is:Response= .50+.00001*income -.03*yrs. of education
25
Continuing to build the model
• Multivariate analytical techniques such as multiple regression,logistic regression,etc. may be employed to produce the final model
• Final equation:Predicted Response Rate:=A –B1*Age +B2*tenure
• Corr. Coeff. is +.5 for age and +.55 for tenure
• What is the problem here?• Age has flipped
• What other diagnostics would you undertake to better understand the situation?• Examine correlation coefficient between these two variable and compare this
result to other independent variable correlations. The magnitude of the age and tenure correlation should be much greater than other independent variable correlations
26
Continuing to build the model
Variable CorrelationSpend 0.6
Live in Ontario 0.5Number in House -0.3
Response= A(+.05 X spend)
(-.03 X Live in Ontario)(-.01 X Number in House)
Variable Correlation# of products 0.6Credit Score 0.4
Tenure -0.2Response= A
(-.03*number of products)(+.08 X Credit Score)
(-.01 X tenure)
27
Continuing to build the model
• After observing correlation results and EDA’s what can we begin to do at this point.– Derive new variables-EDA’s– Derive new variables-multicollinearity– Derive new variables-Factor Analysis– Derive new variables-CHAID(will explore later)
Reference Material: Factor Analysis-look up in any Statistics Handbook
Regression-look up in textbook under Regression and Statistics Regression.
28
Continuing to build the model
• Running further statistical routines, we are able to develop a final model. The marketer or business person should receive a report that looks as follows:
Model Variable Impact on Response % contribution to ModelIncome Positive 45%Tenure Negative 25%
Product Spend positive 20%Gender is Male negative 10%
For those of you that have statistics training, how is the % For those of you that have statistics training, how is the % Contribution to model calculated derived?Contribution to model calculated derived?Looks at the partial R2 of each variable and calculates as follows: % contribution= partial R2/ total R2Looks at the partial R2 of each variable and calculates as follows: % contribution= partial R2/ total R2
29
Continuing to Build the Model
ParameterEstimate
Intercept 0.04331 69.06 <.0001var 1 0.01321 18.16 <.0001var 2 -0.00586 39.8 <.0001var 3 -0.01181 12.07 0.0005var 4 0.01584 13.97 0.0002var 5 -0.01496 13.93 0.0002var 6 -0.03684 4.31 0.038
Variable F Value Pr > F
Variable Partial ModelEntered R-Square R-Square
var 4 0.0036 0.0036var 3 0.0034 0.007var 1 0.0016 0.0086var 2 0.0007 0.0092var 6 0.0009 0.0102var 5 0.0003 0.0105
30
Continuing to Build the Model
Variable Impact Strength
var 4 + 34.29%
var 3 - 32.38%
var 1 + 15.24%
var 2 - 6.67%
var 6 - 8.57%
var 5 - 2.86%
What would be the final equation in terms of the sign?What would be the final equation in terms of the sign?
The equation should have the same signs as seen above from the impact column
31
Continuing to build the model
•What would you do hereWhat would you do here
Model Variable Impact on Response Contribution to ModelLive in Quebec positive 85%
Income negative 7%Behaviour Score negative 5%# of promotions negative 3%
I would conduct upfront segmentation as live in Quebec is overwhelmingly strong and essentiallyindicates that we have a one variable model. Create two segments-live in Quebec and Rest of Canadaand perhaps develop models to each of these segments
32
Continuing to build the model
Model Variable Impact on Response % contribution to ModelIncome Positive 45%Tenure Negative 25%
Product Spend positive 20%Gender is Male negative 10%
•Suppose we have the following equation:Suppose we have the following equation:
Response= Response= +.09+.09
+.05 X Income+.05 X Income
+.06 X Tenure+.06 X Tenure
+.08 X Product Spend+.08 X Product Spend
-.04 X Male -.04 X Male •What is the problem here?What is the problem here?
•Problem tenure-sign is inconsistent between report and actual equation-doublecheckProblem tenure-sign is inconsistent between report and actual equation-doublecheckactual equation and coefficient signsactual equation and coefficient signs