+ All Categories
Home > Documents > Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering...

Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering...

Date post: 24-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
17
Uncover Customer Insights with Apache Spark and ML Bo Zhang, IBM Joint work with Chunhui Higgins, Chul Sung and Pu Yang 2016/09/23 http://dpda.mybluemix.net
Transcript
Page 1: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

Uncover Customer Insights with Apache Spark and ML Bo Zhang, IBM Joint work with Chunhui Higgins, Chul Sung and Pu Yang 2016/09/23 http://dpda.mybluemix.net

Page 2: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

About Me

•  Data Science Driven Business at IBM

•  Ph.D. from North Carolina State University in Statistics

•  LinkedIn: https://www.linkedin.com/in/imbozhang

Page 3: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

•  Cognitive Business = Digital Business + Digital Intelligence (Data Science) •  Big Data

•  User online data •  User “offline” data •  Watson Internet of Things •  …

•  Good news: cost (5TB of disk: ~$109.99) •  Bad news: speed (time to read 5TB from disk: 15 hours)

Challenge: Big Data

Cloud platform for building, running, and managing apps and services

Page 4: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

•  One machine can not process or even store all the data •  Solution is to distribute data over a large cluster of machines

Solution: Apache Spark

Churn Tenure Cost … Type

1 13 44 … 0

1 11 33 … 1

0 68 52 … 1

1 33 33 … 0

1 23 30 … 0

0 41 39 … 0

Churn Tenure Cost … Type

1 13 44 … 0

1 11 33 … 1

Churn Tenure Cost … Type

0 68 52 … 1

1 33 33 … 0

Churn Tenure Cost … Type

1 23 30 … 0

0 41 39 … 0

Partition 1

Partition 2

Partition 3

Spark DataFrame (DF)

Spark provides a programming abstraction and parallel runtime that hides the complexities of fault tolerance and slow machines.

Page 5: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

Apache Spark Components

Apache Spark

Spark SQL Hive, JSON,

JDBC to DB

Spark Streaming

ML (machine learning)

ML Algorithms

Featurization, Pipelines,

Persistence, Utilities

GraphX (graph)

Spark DF

Page 6: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

Spark Driver and Workers

•  A Spark program is two programs: •  a driver program (one machine) •  a worker program (cluster nodes)

Driver Program

SparkContext

Cluster Manager

SQLContext

Worker Spark Executor

Worker Spark Executor

DF distributed across workers

•  SparkContext •  SQLContext

Page 7: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

Lifecycle

•  DashDB

Create

•  DF

Filter •  Filtered DF

Show

•  Transformed DF

ML •  Insights

Present

Transformations •  Filter •  Select •  …

Actions •  Show •  Count •  …

Page 8: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

Machine Learning

Settings Common Methods Use Cases

Regression Linear Regression,

Generalized Linear Regression, Survival Regression

Customer Churn Analysis

Classification

Logistic Regression, Decision Tree,

Random Forest, Naïve Bayes

Sales Leads Prediction

Clustering K-means, Gaussian Mixture Customer Segmentation

Collaborative Filtering User-Based Collaborative Filtering, Item-Based Collaborative Filtering,

Alternating Least Squares Service Recommendation

Text Mining Sentiment Analysis, Topic Classification NPS Survey Analysis

Page 9: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

Typical Supervised Machine Learning Pipeline

• Many sources: marketing data, user behavior data, social media data, call center data, survey data and so on Obtain New Data

• Extract features to represent observations • Unsupervised learning Feature Extraction

• Train models Supervised Learning

• Determine the quality of the model Evaluation

• Predict on future observations Predict

Page 10: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

Business Initiative - Customer Churn Reduction

Customer B

Customer C

Customer D

Now

Register Time

Churn

Acquisition

Activation

Retention

Revenue

Referral

Customer A

Churn

Churn

As part of the efforts to reduce customer churn, IBM is interested in modeling the "time to churn" in order to determine the factors associated with customers who left.

AARRR

Page 11: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

�  Survival regression: to estimate time to event: death, equipment broken or customer churn

�  Here we cannot show the real data so we use some toy data with similar properties, and thereby make the problem reproducible.

Customer Churn Prediction via Survival Regression

Tenure Churn Type Analytics Runtimes Mobile … Watson Boilerplate Quantity Cost

13 1 1 14 12 07 … 1 1 28 45

11 1 1 13 06 02 … 0 0 15 24

68 0 1 02 07 02 … 1 1 10 23

33 1 0 13 09 01 … 1 0 3 3

23 1 0 10 19 03 … 1 1 1 1

41 0 1 19 25 04 … 1 0 17 9

… … … … … … … … … … …

Page 12: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

Survival Regression Model - Accelerated Failure Time

f (t) = limΔt⎯→⎯ 0

P(t ≤ T < t +Δt)Δt

S(t) = P(T ≥ t) =1−F(t) = f (x)dxt

ttTttTtPth

t Δ

≥Δ+<≤=

⎯→⎯Δ

)/(lim)(0

F(t) = P(T ≤ t)

•  T: the churn time for a customer, a random variable having a probability distribution (PDF):

•  Cumulative density function (CDF):

•  Survival function:

•  Hazard function:

•  Accelerated Failure Time (AFT):

•  for trial accounts, for paying accounts

iippii zzT σεβββ ++++= …110)log(

zk = 0 1=kzS1(t) = S2 (ct)c = eβk⎧⎨⎩

Median survival time of paying accounts all are c times as much as those of trial accounts

Page 13: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

•  Maximize Likelihood:

•  Minimize the negative log-likelihood with a Weibull distribution

•  A convex optimization problem, how to solve it efficiently in Spark?

•  Quasi-Newton method (L-BFGS): approximate the objective function locally as a quadratic without evaluating the second partial derivatives.

Maximum Likelihood Estimation of AFT

L(β,σ ; x,δ) = [ f (xi;β,σ )]∏δi [S(xi;β,σ )]

1−δi

−l(β,σ ; x,δ) = (δi logσ −δiεi + eεi )

i=1

n

∑ ,where εi = log ti −ʹxiβ

σ

Bk+1 = Bk −Bksk( ) Bksk( )T

skTBksk

+yk yk

T

ykT sk

,sk =θk+1−θk , yk = gk+1− gk

Page 14: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

Survival Regression on Spark

Initialize Weights

Broadcast Weights to Executors

Customer likelihood and gradient for each instance, sum them up locally

Compute likelihood and gradient for each instance, sum them up locally

Compute likelihood and gradient for each instance, sum them up locally

Reduce from executor to get sum of likelihood and gradient

Use L-BFGS to find next step

Final Model Weights

Page 15: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

Demo: Customer Churn Analysis on Spark •  bluemix: https://console.ng.bluemix.net •  notebook: https://goo.gl/KgxZn5 •  web application: http://dpda.mybluemix.net

Page 16: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

Takeaways •  Apache Spark helped our team significantly reduce the time

from prototype to production •  Survival analysis is useful to identify not only who will churn

but also when to churn and why to churn •  By ranking the customers predicted survival probabilities in

ascending order, the top 50% customers capture 80% of churners.

Page 17: Apache Spark and ML Bo Zhang, IBM Chul Sung and Pu Yang ......Sales Leads Prediction Clustering K-means, Gaussian Mixture Customer Segmentation Collaborative Filtering User-Based Collaborative

Thank You


Recommended