Date post: | 21-Apr-2017 |
Category: |
Data & Analytics |
Upload: | neeraj-tiwary |
View: | 43 times |
Download: | 1 times |
Event – Webinar Attendee Count Prediction v1.0 pilot
Neeraj Tiwary
Data Scientist
Problem StatementMarketers need to know the prospective attendee count for In-Person/Online events conducted for any product / geographic location during the event setup and budget planning stage. Predicting attendee counts at the time of planning will help in improving the overall success rate of events conducted.
Business Cases1. Predict event attendance with basic event attributes at the time of creation of event.
Impact: This will help event owner/marketer in pre-planning of event.2. Predict the event attendance with registration counts and basic event attributes.
Impact: This will help marketers/event owner to improve #registrations during the duration between event setup and start date.
DataTraining dataset contains 4000 records whereas test dataset contains 800 records.
Architecture
Architecture - ContinuedHere for each use case, we created two separate models and then ensemble them into a wrapper model. The reason for creating two separate models is that to Simplify the problem space Distribution of response variable was suggesting that the data follows gamma distribution. Gamma distribution didn’t have very good support for ZERO inflated kind of problems though Poisson /
Negative Binomial distribution have it. Here the requirement is to predict the number of attendees of any event. This was a count regression
problem, and we can’t use any other regression algorithms like linear / neural network as those follow the ranges from – infinity to + infinity whereas for count variable, it should follow the range from 0 to infinity.
Data Cleansing Trimmed all the variables to remove white spaces Converted all the categorical variable values into lower case Replaced all the null values to “Not Assigned” to have uniformity in the data Data transformation to have proper data values for some common categorical variables Removed low frequency categorical data as those were impacting the model
Missing value imputation Went to the business and derived the missing value with the actual value as far as possible For remaining missing values, used “Multiple Imputation” methods to impute the data as most of the
data were missing at random and belongs to categorical variables.
Feature EngineeringThis is the man step of any model development activity. We need to enhance our features to have a better predictability. Created dummy variables for categorical variables like “Product” and “TargetAudience” by using mtabulate in
R Drop unused levels for all categorical variables. Created “Hour of Day” attribute which will tell that at which hour the event is going to start Created “Month of Day” attribute which will tell that at which month the event is going to start Created “Duration” attribute which will tell the duration of event Created “DaysBetweenEventCreationAndStartDate” attribute which will tell the period between event start
date from its creation date Initially all the data were available in text string. Parsed the data to fetch relevant information. We did the pre-cooking /text parsing of data before landing into R for developing the model
Descriptive Statistics – Attendee Count
Response Variable: Statistics:Attendee Count of a randomly chosen in-person event for a future date
Distribution (Log-Likelihood):
Boxplot Density Plot Histogram
Mean: 28.65435Standard Deviation: 32.89823Skewness: 2.267742Kurtosis: 5.9273
Response Variable - Distribution• Here response variable
“AttendeeCount” follows the Gamma Distribution
• We had many instances (~23%) with ZERO attendee counts for the events
• Since gamma model doesn’t support ZERO response variable, we divided the problem into two sets
1. Zero attendee count problem
2. Non-Zero attendee count problem
Exploratory Analysis - ProgramOwner
•Model1: Logistic Regression• ROC Curve
••
• AUC: •• Confusion
Matrix
Model Output: Business Case 1Model2: Gamma RegressionAccuracy:
Model Parameters
•Model1: Logistic Regression• ROC Curve
••
• AUC:•• Confusion
Matrix
Model Output: Business Case 2Model2: Gamma RegressionAccuracy:
Model Parameters
Model - Actual vs Predicted + Registration
Model - AzureMLWe developed the same model in AzureML and deployed it as web service.
Below is the snippet of the same in excel.
Model - AzureML
MethodologyUsed -> Gamma Regression, Logistic regression,
Tried -> Poisson, Negative Binomial, Neural Networks regression etc
Results Models developed with Gamma / Logistic regression have better results. Marketer will change any attributes and then can check the predicted attendee count score through
AzureML model and based on that score, he/she will be in a better state to take his/her own decision.
Conclusions and Next StepsAfter a thoroughly understanding of the problem, below are my further recommendations to proceed ahead Need to explore Vowpal Wabbit in AzureML Need to embed the model with Power BI reporting