Uncover Customer Insights with Apache Spark and ML Bo Zhang, IBM Joint work with Chunhui Higgins, Chul Sung and Pu Yang 2016/09/23 http://dpda.mybluemix.net
About Me
• Data Science Driven Business at IBM
• Ph.D. from North Carolina State University in Statistics
• LinkedIn: https://www.linkedin.com/in/imbozhang
• Cognitive Business = Digital Business + Digital Intelligence (Data Science) • Big Data
• User online data • User “offline” data • Watson Internet of Things • …
• Good news: cost (5TB of disk: ~$109.99) • Bad news: speed (time to read 5TB from disk: 15 hours)
Challenge: Big Data
Cloud platform for building, running, and managing apps and services
• One machine can not process or even store all the data • Solution is to distribute data over a large cluster of machines
Solution: Apache Spark
Churn Tenure Cost … Type
1 13 44 … 0
1 11 33 … 1
0 68 52 … 1
1 33 33 … 0
1 23 30 … 0
0 41 39 … 0
Churn Tenure Cost … Type
1 13 44 … 0
1 11 33 … 1
Churn Tenure Cost … Type
0 68 52 … 1
1 33 33 … 0
Churn Tenure Cost … Type
1 23 30 … 0
0 41 39 … 0
Partition 1
Partition 2
Partition 3
Spark DataFrame (DF)
Spark provides a programming abstraction and parallel runtime that hides the complexities of fault tolerance and slow machines.
Apache Spark Components
Apache Spark
Spark SQL Hive, JSON,
…
JDBC to DB
Spark Streaming
ML (machine learning)
ML Algorithms
Featurization, Pipelines,
Persistence, Utilities
GraphX (graph)
Spark DF
Spark Driver and Workers
• A Spark program is two programs: • a driver program (one machine) • a worker program (cluster nodes)
Driver Program
SparkContext
Cluster Manager
SQLContext
Worker Spark Executor
Worker Spark Executor
DF distributed across workers
• SparkContext • SQLContext
Lifecycle
• DashDB
Create
• DF
Filter • Filtered DF
Show
• Transformed DF
ML • Insights
Present
Transformations • Filter • Select • …
Actions • Show • Count • …
Machine Learning
Settings Common Methods Use Cases
Regression Linear Regression,
Generalized Linear Regression, Survival Regression
Customer Churn Analysis
Classification
Logistic Regression, Decision Tree,
Random Forest, Naïve Bayes
Sales Leads Prediction
Clustering K-means, Gaussian Mixture Customer Segmentation
Collaborative Filtering User-Based Collaborative Filtering, Item-Based Collaborative Filtering,
Alternating Least Squares Service Recommendation
Text Mining Sentiment Analysis, Topic Classification NPS Survey Analysis
Typical Supervised Machine Learning Pipeline
• Many sources: marketing data, user behavior data, social media data, call center data, survey data and so on Obtain New Data
• Extract features to represent observations • Unsupervised learning Feature Extraction
• Train models Supervised Learning
• Determine the quality of the model Evaluation
• Predict on future observations Predict
Business Initiative - Customer Churn Reduction
Customer B
Customer C
Customer D
Now
Register Time
Churn
Acquisition
Activation
Retention
Revenue
Referral
Customer A
Churn
Churn
As part of the efforts to reduce customer churn, IBM is interested in modeling the "time to churn" in order to determine the factors associated with customers who left.
AARRR
� Survival regression: to estimate time to event: death, equipment broken or customer churn
� Here we cannot show the real data so we use some toy data with similar properties, and thereby make the problem reproducible.
Customer Churn Prediction via Survival Regression
Tenure Churn Type Analytics Runtimes Mobile … Watson Boilerplate Quantity Cost
13 1 1 14 12 07 … 1 1 28 45
11 1 1 13 06 02 … 0 0 15 24
68 0 1 02 07 02 … 1 1 10 23
33 1 0 13 09 01 … 1 0 3 3
23 1 0 10 19 03 … 1 1 1 1
41 0 1 19 25 04 … 1 0 17 9
… … … … … … … … … … …
Survival Regression Model - Accelerated Failure Time
f (t) = limΔt⎯→⎯ 0
P(t ≤ T < t +Δt)Δt
S(t) = P(T ≥ t) =1−F(t) = f (x)dxt
∞
∫
ttTttTtPth
t Δ
≥Δ+<≤=
⎯→⎯Δ
)/(lim)(0
F(t) = P(T ≤ t)
• T: the churn time for a customer, a random variable having a probability distribution (PDF):
• Cumulative density function (CDF):
• Survival function:
• Hazard function:
• Accelerated Failure Time (AFT):
• for trial accounts, for paying accounts
iippii zzT σεβββ ++++= …110)log(
zk = 0 1=kzS1(t) = S2 (ct)c = eβk⎧⎨⎩
Median survival time of paying accounts all are c times as much as those of trial accounts
• Maximize Likelihood:
• Minimize the negative log-likelihood with a Weibull distribution
• A convex optimization problem, how to solve it efficiently in Spark?
• Quasi-Newton method (L-BFGS): approximate the objective function locally as a quadratic without evaluating the second partial derivatives.
Maximum Likelihood Estimation of AFT
L(β,σ ; x,δ) = [ f (xi;β,σ )]∏δi [S(xi;β,σ )]
1−δi
−l(β,σ ; x,δ) = (δi logσ −δiεi + eεi )
i=1
n
∑ ,where εi = log ti −ʹxiβ
σ
Bk+1 = Bk −Bksk( ) Bksk( )T
skTBksk
+yk yk
T
ykT sk
,sk =θk+1−θk , yk = gk+1− gk
Survival Regression on Spark
Initialize Weights
Broadcast Weights to Executors
Customer likelihood and gradient for each instance, sum them up locally
Compute likelihood and gradient for each instance, sum them up locally
Compute likelihood and gradient for each instance, sum them up locally
Reduce from executor to get sum of likelihood and gradient
Use L-BFGS to find next step
Final Model Weights
Demo: Customer Churn Analysis on Spark • bluemix: https://console.ng.bluemix.net • notebook: https://goo.gl/KgxZn5 • web application: http://dpda.mybluemix.net
Takeaways • Apache Spark helped our team significantly reduce the time
from prototype to production • Survival analysis is useful to identify not only who will churn
but also when to churn and why to churn • By ranking the customers predicted survival probabilities in
ascending order, the top 50% customers capture 80% of churners.
Thank You