+ All Categories
Home > Technology > Apache Spark Model Deployment

Apache Spark Model Deployment

Date post: 16-Apr-2017
Category:
Upload: databricks
View: 3,070 times
Download: 0 times
Share this document with a friend
34
Apache Spark(™) Model Deployment Bay Area Spark Meetup – June 30, 2016 Richard Garris – Big Data Solution Architect Focused on Advanced Analytics
Transcript
Page 1: Apache Spark Model Deployment

Apache Spark(™)Model Deployment

Bay Area Spark Meetup – June 30, 2016Richard Garris – Big Data Solution Architect Focused on Advanced Analytics

Page 2: Apache Spark Model Deployment

About Me

Richard L Garris • [email protected] • @rlgarris [Twitter]

Big Data Solutions Architect @ Databricks12+ years designing Enterprise Data Solutions for everyone from

startups to Global 2000Prior Work Experience PwC, Google, SkytreeOhio State Buckeye and CMU Alumni

2

Page 3: Apache Spark Model Deployment

About Apache Spark MLlib

Started at Berkeley AMPLab (Apache Spark 0.8)

Now (Apache Spark 2.0)• Contributions from 75+ orgs, ~250 individuals• Development driven by Databricks: roadmap + 50% of PRs• Growing coverage of distributed algorithms

Spark

SparkSQL Streaming MLlib GraphFrames

3

Page 4: Apache Spark Model Deployment

MLlib Goals

General Machine Learning library for big data• Scalable & robust• Coverage of common algorithms• Leverages Apache Spark

Tools for practical workflows

Integration with existing data science tools

4

Page 5: Apache Spark Model Deployment

Apache Spark MLlib• spark.mllib• Pre Mllib < Spark 1.4• Spark Mllib was a lower

level library that used Spark RDDs

• Uses LabeledPoint, Vectors and Tuples

• Maintenance Mode only after Spark 2.X

 

// Load and parse the data

val data = sc.textFile("data/mllib/ridge-data/lpsa.data")

val parsedData = data.map { line =>

  val parts = line.split(',')

  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))

}.cache()

 

// Building the model

val numIterations = 100

val stepSize = 0.00000001

val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)

 

// Evaluate model on training examples and compute training error

val valuesAndPreds = parsedData.map { point =>

  val prediction = model.predict(point.features)

  (point.label, prediction)

}

 

Page 6: Apache Spark Model Deployment

Apache Spark – ML Pipelines• spark.ml• Spark > 1.4• Spark.ML pipelines –

able to create more complex models

• Integrated with DataFrames

// Let's initialize our linear regression learner

val lr = new LinearRegression()

// Now we set the parameters for the methodlr.setPredictionCol("Predicted_PE") .setLabelCol("PE").setMaxIter(100).setRegParam(0.1)

// We will use the new spark.ml pipeline API. If you have worked with scikit-learn this will be very familiar.

val lrPipeline = new Pipeline()lrPipeline.setStages(Array(vectorizer, lr))

// Let's first train on the entire dataset to see what we getval lrModel = lrPipeline.fit(trainingSet)

Page 7: Apache Spark Model Deployment

The Agile Modeling ProcessSet Business 

Goals

Understand Your Data

Create Hypothesis

Devise Experiment

Prepare Data

Train-Tune-Test Model

Deploy Model

Measure / Evaluate Results

Page 8: Apache Spark Model Deployment

The Agile Modeling ProcessSet Business 

Goals

Understand Your Data

Create Hypothesis

Devise Experiment

Prepare Data

Train-Tune-Test Model

Deploy Model

Measure / Evaluate Results

Focus of this talk

Page 9: Apache Spark Model Deployment

What is a Model?

•  

Page 10: Apache Spark Model Deployment

But What Really is a Model?A model is a complex pipeline of components• Data Sources• Joins• Featurization Logic• Algorithm(s)

• Transformers• Estimators

• Tuning Parameters

Page 11: Apache Spark Model Deployment

ML Pipelines

11

Train model

Evaluate

Load data

Extract featuresA very simple pipeline

Page 12: Apache Spark Model Deployment

ML Pipelines

12

Train model 1

Evaluate

Datasource 1Datasource 2

Datasource 3

Extract featuresExtract features

Feature transform 1

Feature transform 2

Feature transform 3

Train model 2

Ensemble

A real pipeline

Page 13: Apache Spark Model Deployment

Why ML persistence?

13

Data Science

Software Engineering

Prototype (Python/R)Create model

Re-implement model for production (Java)

Deploy model

Page 14: Apache Spark Model Deployment

Why ML persistence?

14

Data Science

Software Engineering

Prototype (Python/R)Create Pipeline• Extract raw features• Transform features• Select key features• Fit multiple models• Combine results to

make prediction

• Extra implementation work• Different code paths• Synchronization overhead

Re-implement Pipeline for production (Java)

Deploy Pipeline

Page 15: Apache Spark Model Deployment

With ML persistence...

15

Data Science

Software Engineering

Prototype (Python/R)Create Pipeline

Persist model or Pipeline: model.save(“s3n://...”)

Load Pipeline (Scala/Java) Model.load(“s3n://…”)Deploy in production

Page 16: Apache Spark Model Deployment

Demo

Model Serialization in Apache Spark 2.0 using Parquet

Page 17: Apache Spark Model Deployment

What are the Requirements for a Robust Model Deployment System?

Page 18: Apache Spark Model Deployment

Customer SLAs• Response time• Throughput (predictions per second)• Uptime / Reliability

Tech Stack• C / C++• Legacy (mainframe)• Java • Docker

Your Model Scoring Environment

Page 19: Apache Spark Model Deployment

Offline • Internal Use (batch)• Emails, Notifications (batch)• Offline – schedule based or

event trigger based

Model Scoring Offline vs Online

Online• Customer Waiting on the

Response (human real-time)• Super low-latency with fixed

response window (transactional fraud, ad bidding)

Page 20: Apache Spark Model Deployment

Not All Models Return a Yes / No

Model Scoring Considerations

Example: Login Bot DetectorDifferent behavior depending on probability score

0.0-0.4 Allow login☞0.4-0.6 Challenge Question☞0.6 to 0.75 Send SMS☞0.75 to 0.9 Refer to Agent☞0.9 - 1.0 Block☞

Example: Item RecommendationsOutput is a ranking of the top n items

API – send user ID + number of itemsReturn sorted set of items to recommendOptional – pass context sensitive information to tailor results

Page 21: Apache Spark Model Deployment

Model Updates and Versioning

• Model Update Frequency (nightly, weekly, monthly, quarterly)

• Model Version Tracking• Model Release Process

• Dev Test Staging Production‣ ‣ ‣• Model update process

• Benchmark (or Shadow Models)• Phase-In (20% traffic)• Big Bang

 

 

 

Page 22: Apache Spark Model Deployment

• Models can have both reward and risk to the business– Well designed models prevent fraud, reduce churn, increase sales– Poorly designed models increase fraud, could impact the company’s brand,

cause compliance violations or other risks• Models should be governed by the company's policies and procedures,

laws and regulations and the organization's management goals

Model Governance

Considerations• Models have to be transparent, explainable, traceable and interpretable for auditors

/ regulators• Models may need reason codes for rejections (e.g. if I decline someone credit why?)• Models should have an approval and release process• Models also cannot violate any discrimination laws or use features that could be

traced to religion, gender, ethnicity,

Page 23: Apache Spark Model Deployment

Model A/B TestingSet Business 

Goals

Understand Your Data

Create Hypothesis

Devise Experiment

Prepare Data

Train-Tune-Test Model

Deploy Model

Measure / Evaluate Results

• A/B testing – comparing two versions to see what performs better

• Historical data works for evaluating models in testing, but production experiments required to validate model hypothesis

• Model update process• Benchmark (or Shadow Models)• Phase-In (20% traffic)• Big Bang

A/B Framework should support these steps

Page 24: Apache Spark Model Deployment

• Monitoring is the process of observing the model’s performance, logging it’s behavior and alerting when the model degrades

• Logging should log exactly the data feed into the model at the time of scoring

• Model alerting is critical to detect unusual or unexpected behaviors

Model Monitoring

Page 25: Apache Spark Model Deployment

Open Loop vs Closed Loop• Open Loop – human being involved• Closed Loop – no human involved

Model Scoring – almost always closed loop, some models alert agents or customer service Model Training – usually open loop with a data scientist in the loop to update the model

Page 26: Apache Spark Model Deployment

Online Learning

• closed loop, entirely machine driven modeling is risky• need to have proper model monitoring and

safeguards to prevent abuse / sensitivity to noise• Mllib supports online through streaming models (k-

means, logistic regression support online)• Alternative – use a more complex model to better fit

new data rather than using online learning

Page 27: Apache Spark Model Deployment

Model Deployment Architectures

Page 28: Apache Spark Model Deployment

Architecture #1Offline Recommendations

Train ALS Model Send Offers to Customers

Save Offers to NoSQL

Ranked Offers

Display Ranked Offers in Web / Mobile

Nightly Batch

Page 29: Apache Spark Model Deployment

Architecture #2Precomputed Features with Streaming

Web Logs

Kill User’s Login SessionPre-compute Features Features

 

Spark Streaming

Page 30: Apache Spark Model Deployment

Architecture #3Local Apache Spark(™)

Train Model in Spark Save Model to S3 / HDFS

New Data

Copy Model to Production

Predictions

Run Spark Local

Page 31: Apache Spark Model Deployment

Demo

• Example of Offline Recommendations using ALS and Redis as a NoSQL Cache

Page 32: Apache Spark Model Deployment

Try Databricks Community Edition

Page 33: Apache Spark Model Deployment

2016 Apache Spark Survey

33

Page 34: Apache Spark Model Deployment

Spark Summit EU Brussels

October 25-27

The CFP closes at 11:59pm on July 1st

For more information and to submit:

https://spark-summit.org/eu-2016/

34


Recommended