+ All Categories
Home > Technology > Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Date post: 15-Jul-2015
Category:
Upload: sessionsevents
View: 133 times
Download: 2 times
Share this document with a friend
Popular Tags:
30
Spark DataFrames and ML Pipelines Joseph K. Bradley May 1, 2015 MLconf Seattle
Transcript
Page 1: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Spark DataFrames

and ML Pipelines

Joseph K. BradleyMay 1, 2015

MLconf Seattle

Page 2: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Who am I?

Joseph K. Bradley

Ph.D. in ML from CMU, postdoc at Berkeley

Apache Spark committer

Software Engineer @ Databricks Inc.

2

Page 3: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Databricks Inc.

3

Founded by the creators of Spark

& driving its development

Databricks Cloud: the best place to run Spark

Guess what…we’re hiring!

databricks.com/company/careers

Page 4: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

4

Concise APIs in Python, Java, Scala… and R in Spark 1.4!

500+ enterprises using or planning to use Spark in production (blog)

Spark

SparkSQL Streaming MLlib GraphX

Distributed computing engine• Built for speed, ease of use,

and sophisticated analytics• Apache open source

Page 5: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Beyond Hadoop

5

Early adopters (Data) Engineers

MapReduce &functional API

Data Scientists& Statisticians

Page 6: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Spark for Data Science

DataFrames

Intuitive manipulation of distributed structured data

6

Machine Learning Pipelines

Simple construction and tuning of ML workflows

Page 7: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Google Trends for “dataframe”

7

Page 8: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

DataFrames

8

dept age name

Bio 48 H Smith

CS 54 A Turing

Bio 43 B Jones

Chem 61 M Kennedy

RDD API

DataFrame API

Data grouped into named columns

Page 9: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

DataFrames

9

dept age name

Bio 48 H Smith

CS 54 A Turing

Bio 43 B Jones

Chem 61 M Kennedy

Data grouped into named columns

DSL for common tasks• Project, filter, aggregate, join, …• Metadata• UDFs

Page 10: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Spark DataFrames

10

API inspired by R and Python Pandas• Python, Scala, Java (+ R in dev)

• Pandas integration

Distributed DataFrame

Highly optimized

Page 11: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

11

0 2 4 6 8 10

RDD Scala

RDD Python

Spark Scala DF

Spark Python DF

Runtime of aggregating 10 million int pairs (secs)

Spark DataFrames are fast

better

Uses SparkSQLCatalyst optimizer

Page 12: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

12

Demo: DataFrames

Page 13: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Spark for Data Science

DataFrames

• Structured data

• Familiar API based on R & Python Pandas

• Distributed, optimized implementation

13

Machine Learning Pipelines

Simple construction and tuning of ML workflows

Page 14: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

About Spark MLlib

Started @ Berkeley

• Spark 0.8

Now (Spark 1.3)

• Contributions from 50+ orgs, 100+ individuals

• Growing coverage of distributed algorithms

Spark

SparkSQL Streaming MLlib GraphX

14

Page 15: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

About Spark MLlibClassification

• Logistic regression

• Naive Bayes

• Streaming logistic regression

• Linear SVMs

• Decision trees

• Random forests

• Gradient-boosted trees

Regression

• Ordinary least squares

• Ridge regression

• Lasso

• Isotonic regression

• Decision trees

• Random forests

• Gradient-boosted trees

• Streaming linear methods

15

Statistics

• Pearson correlation

• Spearman correlation

• Online summarization

• Chi-squared test

• Kernel density estimation

Linear algebra

• Local dense & sparse vectors & matrices

• Distributed matrices

• Block-partitioned matrix

• Row matrix

• Indexed row matrix

• Coordinate matrix

• Matrix decompositions

Frequent itemsets

• FP-growth

Model import/export

Clustering

• Gaussian mixture models

• K-Means

• Streaming K-Means

• Latent Dirichlet Allocation

• Power Iteration Clustering

Recommendation

• Alternating Least Squares

Feature extraction & selection

• Word2Vec

• Chi-Squared selection

• Hashing term frequency

• Inverse document frequency

• Normalizer

• Standard scaler

• Tokenizer

Page 16: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

ML Workflows are complex

16

Image classification pipeline*

* Evan Sparks. “ML Pipelines.” amplab.cs.berkeley.edu/ml-pipelines

Specify pipeline Inspect & debug Re-run on new data Tune parameters

Page 17: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Example: Text Classification

17

Goal: Given a text document, predict its

topic.

Subject: Re: Lexan Polish?

Suggest McQuires #1 plastic

polish. It will help somewhat

but nothing will remove deep

scratches without making it

worse than it already is.

McQuires will do something...

1: about science0: not about science

LabelFeatures

Dataset: “20 Newsgroups”From UCI KDD Archive

Page 18: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

ML Workflow

18

Train model

Evaluate

Load data

Extract features

Page 19: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Load Data

19

Train model

Evaluate

Load data

Extract features

built-in external

{ JSON }

JDBC

and more …

Data sources for DataFrames

Page 20: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Load Data

20

Train model

Evaluate

Load data

Extract features

label: Int

text: String

Current data schema

Page 21: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Extract Features

21

Train model

Evaluate

Load data

Extract features

label: Int

text: String

Current data schema

Page 22: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Extract Features

22

Train model

Evaluate

Load datalabel: Int

text: String

Current data schema

Tokenizer

Hashed Term Freq.features: Vector

words: Seq[String]

Transformer

Page 23: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Train a Model

23

Logistic Regression

Evaluate

label: Int

text: String

Current data schema

Tokenizer

Hashed Term Freq.features: Vector

words: Seq[String]

prediction: Int

Estimator

Load data

Transformer

Page 24: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Evaluate the Model

24

Logistic Regression

Evaluate

label: Int

text: String

Current data schema

Tokenizer

Hashed Term Freq.features: Vector

words: Seq[String]

prediction: Int

Load data

Transformer

Evaluator

Estimator

By default, always append new columns

Can go back & inspect intermediate results

Made efficient by DataFrameoptimizations

Page 25: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

ML Pipelines

25

Logistic Regression

Evaluate

Tokenizer

Hashed Term Freq.

Load data

Pipeline

Test data

Logistic Regression

Tokenizer

Hashed Term Freq.

Evaluate

Re-run exactly the same way

Page 26: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Parameter Tuning

26

Logistic Regression

Evaluate

Tokenizer

Hashed Term Freq.

lr.regParam

{0.01, 0.1, 0.5}

hashingTF.numFeatures

{100, 1000, 10000} Given:

• Estimator

• Parameter grid

• Evaluator

Find best parameters

CrossValidator

Page 27: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

27

Demo: ML Pipelines

Page 28: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Recap

DataFrames

• Structured data

• Familiar API based on R & Python

Pandas

• Distributed, optimized

implementation

Machine Learning Pipelines

• Integration with DataFrames

• Familiar API based on scikit-learn

• Simple parameter tuning 28

Composable & DAG Pipelines

Schema validation

User-defined Transformers & Estimators

Page 29: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Looking Ahead

Collaborations with UC Berkeley & others

• Auto-tuning models

29

DataFrames

• Further

optimization

• API for R

ML Pipelines

• More algorithms &

pluggability

• API for R

Page 30: Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

Thank you!

Spark documentationspark.apache.org

Pipelines blog postdatabricks.com/blog/2015/01/07

DataFrames blog postdatabricks.com/blog/2015/02/17

Databricks Cloud Platformdatabricks.com/product

Spark MOOCs on edXIntro to Spark & ML with Spark

Spark Packagesspark-packages.org


Recommended