+ All Categories
Home > Documents > Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15...

Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15...

Date post: 20-May-2020
Category:
Upload: others
View: 15 times
Download: 0 times
Share this document with a friend
37
Scalable Machine Learning with Dask Tom Augspurger Data Scientist
Transcript
Page 1: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Scalable Machine Learning with Dask

Tom Augspurger

Data Scientist

Page 2: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

© 2018 Anaconda, Inc. All Rights Reserved. 

Hi! I’m Tom

• I work at Anaconda on dask, pandas, & ML things

2

Page 3: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Machine Learning WorkflowIt’s not just .fit(X, y)

Page 4: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Machine Learning Workflow

• Understanding the problem, objectives • Reading from data sources • Exploratory analysis • Data cleaning • Modeling • Deployment and reporting

4

Page 5: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Bokeh

Jake Vanderplas PyCon 2017 KeynoteThe Numeric Python Ecosystem

Page 6: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Machine Learning Workflow

A rich ecosystem of tools

But they don’t scale well

6

Page 7: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

DaskParallelizing the Numeric Python Ecosystem

Page 8: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Dask

High Level: Parallel Pandas, NumPy, Scikit-Learn

Low Level: High performance task scheduling

8

Page 9: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Dask

import dask.dataframe as dd

df = dd.read_csv('s3://abc/*.csv')df.groupby(df.name).value.mean()

9

High-Level: Scalable Pandas DataFrames

Page 10: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Dask

import dask.array as da

x = da.random.random(…)y = x.dot(x.T) - x.mean(axis=0)

10

High-Level: Scalable NumPy Arrays

Page 11: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Dask

11

Low-Level: Scalable, Fine-Grained Task Scheduling

Page 12: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Demo

Page 13: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Dask

• Parallelizes libraries like NumPy, Pandas, and Scikit-Learn • Adapts to custom algorithms with a flexible task scheduler • Scales from a laptop to thousands of computers

• Integrates easily, Pure Python built from standard technology

13

Page 14: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Scalable Machine LearningDask-ML

Page 15: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Scaling Pains

15

Page 16: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Scaling Pains

16

Page 17: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Scaling Pains

17

Page 18: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Scaling Pains

18

Page 19: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Distributed Scikit-Learn

Page 20: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Distributed Scikit-Learn

• Use dask to distribute computation on a cluster

20

Large models, smaller datasets

Page 21: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Distributed Scikit-Learn

21

Single-Machine Parallelism with Scikit-Learn

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=200, n_jobs=-1)

clf.fit(X, y)

Page 22: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Distributed Scikit-Learn

22

Multi-Machine Parallelism with Dask

from sklearn.ensemble import RandomForestClassifierfrom sklearn.externals import joblibimport dask_ml.joblib

clf = RandomForestClassifier(n_estimators=200, n_jobs=-1)

with joblib.parallel_backend("dask", scatter=[X, y]): clf.fit(X, y)

Page 23: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Distributed Scikit-Learn

• Data has to fit in RAM • Data shipped to each worker

• Each parallel task should be expensive

• There should be many parallel tasks

23

Caveats

Page 24: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Scalable AlgorithmsWhen your dataset is larger than RAM

Page 25: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

First: Do you need all the data?

• Sampling may be OK • Plotting Learning

Curves from scikit-learn docs

25

Page 26: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Second: Parallel Meta-estimators

26

from dask_ml.wrappers import ParallelPostFitimport dask.dataframe as dd

clf = ParallelPostFit(SVC())clf.fit(X_small, y_small)

X_large = dd.read_csv("s3://abc/*.parq")y_large = clf.predict(X_large)

• Train on subset • Predict for

large dataset, in parallel

Page 27: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Scalable Estimators

• Scikit-Learn wasn’t designed for distributed datasets

• Dask-ML implements scalable variants of some estimators

• Works well with Dask DataFrames & Arrays

27

When the training dataset is larger than RAM

Page 28: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Scalable, Parallel Algorithms

28

Spectral Clustering Comparison

Page 29: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Scalable, Parallel Algorithms

• Distributed GLM•LogisticRegression, LinearRegression, ...

• Clustering •KMeans(init='k-means||'), SpectraclClustering, ...

•Preprocessing•QuantileTransformer, RobustScalar, ...

•Dimensionality Reduction•PCA, TruncatedSVD

•...

29

Some Notable Estimators

Page 30: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

FamiliarWorks well with existing libraries

Page 31: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Familiar

• Dask-ML estimators are Scikit-Learn estimators • Dask-ML pipelines are Scikit-Learn Pipelines

31

Page 32: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Familiar

32

>>> from sklearn.pipeline import make_pipeline >>> from sklearn.preprocessing import FunctionTransformer

>>> pipe = make_pipeline( ... ColumnSelector(columns), ... HourExtractor(['Trip_Pickup_DateTime']), ... FunctionTransformer(payment_lowerer, validate=False), ... Categorizer(categories), ... DummyEncoder(), ... StandardScaler(scale), ... LogisticRegression(), ... )

Page 33: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Familiar

33

>>> from sklearn.pipeline import make_pipeline >>> from sklearn.preprocessing import FunctionTransformer

>>> pipe = make_pipeline( ... ColumnSelector(columns), ... HourExtractor(['Trip_Pickup_DateTime']), ... FunctionTransformer(payment_lowerer, validate=False), ... Categorizer(categories), ... DummyEncoder(), ... StandardScaler(scale), ... LogisticRegression(), ... )

Scikit-Learn objects

Custom transformers

Dask-ML estimators

Full Example: https://git.io/vAi7C

Page 34: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Distributed SystemsIntegrate with XGBoost and Tensorflow

Page 35: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

>>> import dask_ml.xgboost as xgb>>> df = dd.read_csv("trips*.csv")>>> y = df['Tip_Amt'] > 0>>> X = df[columns]>>> booster = xgb.train(... client, params, X, y... )

Distributed System

35

Peer with systems like XGBoost or Tensorflow

Page 36: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Dask & Dask-ML

• Parallelizes libraries like NumPy, Pandas, and Scikit-Learn • Scales from a laptop to thousands of computers • Familiar API and in-memory computation

• https://dask.pydata.org

36

Page 37: Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Questions?


Recommended