Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

transcript

Joseph K. BradleyFebruary 24th, 2016

About the speaker: Joseph Bradley

Joseph Bradley is a Software Engineer and Apache Spark Committer working on MLlib at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon U. in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs.

About the moderator: Denny Lee

Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).

We are Databricks, the company behind Apache Spark

Founded by the creators of Apache Spark in 2013

Share of Spark code contributed by Databricksin 2014

Data Value

Created Databricks on top of Spark to make big data simple.

Apache Spark Engine

Spark Core

SparkStreamingSpark SQL MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

N O T A BL E U S E RS T H A T PRE S EN T ED A T S PA RK S U MM IT 2 0 1 5 S A N F RA N CISCO

Source: Slide 5 of Spark Community Update

Machine Learning: What and Why?

What: ML uses data to identify patterns and make decisions.

Why: The core value of ML is automated decision making.• Especially important when dealing with TB or PB of data

Many use cases, including:• Marketing and advertising optimization• Security monitoring / fraud detection• Operational optimizations

Why Spark MLlib

Provide general purpose ML algorithms on top of Spark• Hide complexity of distributing data & queries, and scaling• Leverage Spark improvements (DataFrames, Tungsten, Datasets)

Advantages of MLlib’s design:• Simplicity• Scalability• Streamlined end-to-end• Compatibility

Spark scales well

Largest cluster:8000 Nodes (Tencent)

Largest single job:1 PB (Alibaba, Databricks)

Top Streaming Intake:1 TB/hour (HHMI Janelia Farm)

2014 On-Disk Sort RecordFastest Open Source Engine for sorting a PB

Machine Learning highlights

Source: Why you should use Spark for Machine Learning

Source: Toyota Customer 360 Insights on Apache Spark and MLlibPerformance• Original batch job: 160 hours• Same Job re-written using Apache Spark: 4 hours

ML task• Prioritize incoming social media in real-time using Spark MLlib

(differentiate campaign, feedback, product feedback, and noise)• ML life cycle: Extract features and train:

• V1: 56% Accuracy -> V9: 82% Accuracy• Remove False Positives and Semantic Analysis (similarity between

concepts)

Example analysis: Population vs. housing priceLinksSimplifying Machine Learning with Databricks Blog PostPopulation vs. Price Multi-chart Spark SQL NotebookPopulation vs. Price Linear Regression Python Notebook

Scatterplotimport numpy as npimport matplotlib.pyplot as plt

x = data.map(lambda p: (p.features[0])).collect()y = data.map(lambda p: (p.label)).collect()

from pandas import *from ggplot import *pydf = DataFrame({'pop':x,'price':y})p = ggplot(pydf, aes('pop','price')) + \

geom_point(color='blue')display(p)

Linear Regression with SGDDefine and Build Models

# Import LinearRegression classfrom pyspark.ml.regression import LinearRegression

# Define LinearRegression modellr = LinearRegression()

# Build two modelsmodelA = lr.fit(data, {lr.regParam:0.0})modelB = lr.fit(data, {lr.regParam: 100.0})

Linear Regression with SGDMake Predictions

# Make predictionspredictionsA = modelA.transform(data)display(predictionsA)

Linear Regression with SGDEvaluate the Models

from pyspark.ml.evaluation import RegressionEvaluatorevaluator = RegressionEvaluator(metricName="mse")MSE = evaluator.evaluate(predictionsA)print("ModelA: Mean Squared Error = " + str(MSE))

ModelA: Mean Squared Error = 16538.4813081ModelB: Mean Squared Error = 16769.2917636

Scatterplot with plotting Regression Modelsp = ggplot(pydf, aes('pop','price')) + \

geom_point(color='blue') + \geom_line(pydf, aes('pop','predA'), color='red') + \

geom_line(pydf, aes('pop','predB'), color='green') + \

scale_x_log(10) + scale_y_log10()display(p)

Learning more about MLlib

Guides & examples• Example workflow using ML Pipelines (Python)• Power plant data analysis workflow (Scala)• The above 2 links are part of the Databricks Guide, which contains

many more examples and references.

References• Apache Spark MLlib User Guide

• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation.

• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper)

Combining the Strengths of MLlib, scikit-learn, & R

Great libraries à Business investment• Education• Tooling & workflows

Big Data

Scaling (trees)Topic model on 4.5 million Wikipedia articles

Recommendation with50 million users,5 million songs,50 billion ratings

Big Data & MLlib

• More data à higher accuracy• Scale with business (# users, available data)• Integrate with production systems

Bridging the gap

How do you get from a single-machine workloadto a distributed one?

At school: Machine Learning with R on my laptop

The Goal: Machine Learning on a huge computing cluster

Wish list

• Run original code on a production environment• Use distributed data sources• Distribute ML workload piece by piece• Use familiar algorithms & APIs

Our task

Sentiment analysisGiven a review (text),Predict the user’s rating.

Datafromhttps://snap.stanford.edu/data/web-Amazon.html

Our ML workflow

TextThis scarf I bought is very strange. When I ...

LabelRating = 3.0

Tokenizer

Words[This,scarf,I,bought,...]

HashingTerm-Freq

Features[2.0,0.0,3.0,...]

Linear Regression

PredictionRating = 2.7

Our ML workflow

Cross Validation

Linear Regression

Feature Extraction

regularizationparameter:{0.0, 0.1, ...}

Cross validation

Cross Validation

Best Linear Regression

Linear Regression #1

Feature Extraction

Cross validation

Cross Validation

Feature Extraction

Distribute cross validation

Cross Validation

Feature Extraction

Repeating this at homeThis demo used:• Spark 1.6• spark-sklearn (on Spark Packages) (on PyPi)

The notebook from the demo is available here:• sklearn integration• MLlib + sklearn: Distribute Everything!

The Amazon Reviews data20K and test4K datasets were created and can be used within the databricks-datasets with permission from Professor Julian McAuley @ UCSD. Source: Image-based recommendations on styles and substitutes.J. McAuley, C. Targett, J. Shi, A. van den Hengel.SIGIR, 2015.

Integrations we mentioned

Data sources• Spark DataFrames: Conversions between pandas (local data) &

Spark (distributed data)• MLlib: Conversions between scipy & MLlib data types

Model selection / tuning• spark-sklearn: Automatically distribute cross-validation

Python API• MLlib: Distributed learning algorithms with familiar APIs• spark-sklearn: Conversions between scikit-learn & MLlib models

Integrations with RDataFrames• Conversions between R (local)

& Spark (distributed)• SQL queries from R

model <- glm(Sepal_Length ~ Sepal_Width + Species,data = df, family = "gaussian")

head(filter(df, df$waiting < 50))## eruptions waiting##1 1.750 47##2 1.750 47##3 1.867 48

API for calling MLlib algorithms from R• Linear & logistic regression supported in Spark 1.6• More algorithms in development

Learning more about integrationsPython, pandas & scikit-learn• spark-sklearn documentation and blog post• Spark DataFrame Python API & pandas conversions• Databricks Guide on using scikit-learn and other libraries with Spark

R• Spark R API User Guide (DataFrames & ML)• Databricks Guide: Spark R overview + docs & examples for each function

TensorFlow on Apache Spark (Deep Learning in Python)• Blog post explaining how to run TensorFlow on top of Spark, with example code

MLlib roadmap highlights

Workflow• Simplify building and customizing ML Pipelines.

Key models• Improve inspection for generalized linear models (linear & logistic

regression).

Language APIs• Support Pipeline persistence (saving & loading Pipelines and Models)

in the Python API.

Spark 2.0 Roadmap JIRA: https://issues.apache.org/jira/browse/SPARK-12626

More resources

• Databricks Guide• Apache Spark User Guide• Databricks Community Forum• Training courses: public classes, MOOCs, & private training• Databricks Community Edition: Free hosted Apache Spark.

Join the waitlist for the beta release!

Thanks!