+ All Categories
Home > Documents > Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more...

Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more...

Date post: 29-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
37
Matthew Rocklin, Systems Software Manager GTC San Jose 2019 Scaling RAPIDS with Dask
Transcript
Page 1: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

Matthew Rocklin, Systems Software ManagerGTC San Jose 2019

Scaling RAPIDS with Dask

Page 2: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

2

PyData is Pragmatic, but Limited

The PyData Ecosystem• NumPy: Arrays• Pandas: Dataframes• Scikit-Learn: Machine Learning• Jupyter: Interaction• … (many other projects)

Is well loved• Easy to use• Broadly taught• Community Governed

But sometimes slow• Single CPU core• In-memory data

How do we accelerate an existing software stack?

Page 3: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

3

95% of the time, PyData is great

5% of the time, you want more performance(and you can ignore the rest of this talk)

Page 4: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

4

Scale up and out with RAPIDS and Dask

Accelerated on single GPU

NumPy -> CuPy/PyTorch/..Pandas -> cuDFScikit-Learn -> cuMLNumba -> Numba

RAPIDS and Others

NumPy, Pandas, Scikit-Learnand many more

Single CPU coreIn-memory data

PyData

Multi-GPUOn single Node (DGX)Or across a cluster

Dask + RAPIDS

Multi-core and Distributed PyData

NumPy -> Dask ArrayPandas -> Dask DataFrameScikit-Learn -> Dask-ML… -> Dask Futures

Dask

Scal

e U

p /

Acce

lera

te

Scale out / Parallelize

Page 5: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

5

Scale up and out with RAPIDS and Dask

Accelerated on single GPU

NumPy -> CuPy/PyTorch/..Pandas -> cuDFScikit-Learn -> cuMLNumba -> Numba

RAPIDS and Others

NumPy, Pandas, Scikit-Learnand many more

Single CPU coreIn-memory data

PyData

Scal

e U

p /

Acce

lera

te

Scale out / Parallelize

Page 6: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

6

RAPIDS: GPU variants of PyData libraries

• NumPy -> CuPy, PyTorch, TensorFlow• Array computing• Mature due to deep learning boom• Also useful for other domains• Obvious fit for GPUs

• Pandas -> cuDF• Tabular computing• New development • Parsing, joins, groupbys• Not an obvious fit for GPUs

• Scikit-Learn -> cuML• Traditional machine learning• Somewhere in between

Page 7: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

7

RAPIDS: GPU variants of PyData libraries

• NumPy -> CuPy, PyTorch, TensorFlow• Array computing• Mature due to deep learning boom• Also useful for other domains• Obvious fit for GPUs

• Pandas -> cuDF• Tabular computing• New development • Parsing, joins, groupbys• Not an obvious fit for GPUs

• Scikit-Learn -> cuML• Traditional machine learning• Somewhere in between

Page 8: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

8

RAPIDS: GPU variants of PyData libraries

• NumPy -> CuPy, PyTorch, TensorFlow• Array computing• Mature due to deep learning boom• Also useful for other domains• Obvious fit for GPUs

• Pandas -> cuDF• Tabular computing• New development • Parsing, joins, groupbys• Not an obvious fit for GPUs

• Scikit-Learn -> cuML• Traditional machine learning• Somewhere in between

Page 9: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

9

RAPIDS: GPU variants of PyData libraries

• NumPy -> CuPy, PyTorch, TensorFlow• Array computing• Mature due to deep learning boom• Also useful for other domains• Obvious fit for GPUs

• Pandas -> cuDF• Tabular computing• New development • Parsing, joins, groupbys• Not an obvious fit for GPUs

• Scikit-Learn -> cuML• Traditional machine learning• Somewhere in between

Page 10: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

10

Scale up and out with RAPIDS and Dask

Accelerated on single GPU

NumPy -> CuPy/PyTorch/..Pandas -> cuDFScikit-Learn -> cuMLNumba -> Numba

RAPIDS and Others

NumPy, Pandas, Scikit-Learnand many more

Single CPU coreIn-memory data

PyData

Multi-GPUOn single Node (DGX)Or across a cluster

Dask + RAPIDS

Multi-core and Distributed PyData

NumPy -> Dask ArrayPandas -> Dask DataFrameScikit-Learn -> Dask-ML… -> Dask Futures

Dask

Scal

e U

p /

Acce

lera

te

Scale out / Parallelize

Page 11: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

11

Scale up and out with RAPIDS and Dask

NumPy, Pandas, Scikit-Learnand many more

Single CPU coreIn-memory data

PyDataMulti-core and Distributed PyData

NumPy -> Dask ArrayPandas -> Dask DataFrameScikit-Learn -> Dask-ML… -> Dask Futures

Dask

Scal

e U

p /

Acce

lera

te

Scale out / Parallelize

Page 12: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

12

• PyData Native• Built on top of NumPy, Pandas Scikit-Learn, … (easy to migrate)• With the same APIs (easy to train)• With the same developer community (well trusted)

• Scales• Scales out to thousand-node clusters• Easy to install and use on a laptop

• Popular• Most common parallelism framework today at PyData and SciPy conferences

• Deployable• HPC: SLURM, PBS, LSF, SGE• Cloud: Kubernetes• Hadoop/Spark: Yarn

Dask Parallelizes PyDataNatively

Page 13: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

13

Parallel NumPyFor imaging, simulation analysis, machine learning

● Same API as NumPy

● One Dask Array is built from many NumPy arrays

Either lazily fetched from diskOr distributed throughout a cluster

Page 14: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

14

Parallel PandasFor ETL, time series, data munging

● Same API as Pandas

● One Dask DataFrame is built from many Pandas DataFrames

Either lazily fetched from diskOr distributed throughout a cluster

Page 15: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

15

● Same API

● Same exact code, just wrap with a decorator● Replaces default threaded execution with Dask

Allowing scaling onto clusters● Available in most Scikit-Learn algorithms where joblib is

used

Parallel Scikit-Learn

ThreadPool

For Hyper-Parameter Optimization, Random Forests, ...

Page 16: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

16

● Same API

● Same exact code, just wrap with a decorator● Replaces default threaded execution with Dask

Allowing scaling onto clusters● Available in most Scikit-Learn algorithms where joblib is

used

Parallel Scikit-Learn

ThreadPool

For Hyper-Parameter Optimization, Random Forests, ...

Page 17: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

17

Parallel PythonFor custom systems, ML algorithms, workflow engines

● Parallelize existing codebases

Page 18: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

18

Parallel PythonFor custom systems, ML algorithms, workflow engines

● Parallelize existing codebases

M Tepper, G Sapiro “Compressed nonnegative matrix factorization is fast and accurate”, IEEE Transactions on Signal Processing, 2016

Page 19: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

19

Dask Connects Python users to Hardware

User Execute on distributed hardware

Page 20: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

20

Dask Connects Python users to Hardware

UserWrites high level code

(NumPy/Pandas/Scikit-Learn)Turns into a task graph Executes on distributed

hardware

Page 21: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

21

Example: Dask + Pandas on NYC TaxiWe see how well New Yorkers Tip

import dask.dataframe as dd

df = dd.read_csv('gcs://bucket-name/nyc-taxi-*.csv', parse_dates=['pickup_datetime', 'dropoff_datetime'])

df2 = df[(df.tip_amount > 0) & (df.fare_amount > 0)]df2['tip_fraction'] = df2.tip_amount / df2.fare_amount

hour = df2.groupby(df2.pickup_datetime.dt.hour).tip_fraction.mean()hour.compute().plot(figsize=(10, 6), title='Tip Fraction by Hour')

Page 22: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

22

examples.dask.org Try live

Page 23: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

23

Dask scales PyData libraries

(A good fit if you’re building a new data science platform)

But is compute-agnostic to those libraries

Page 24: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

24

Scale up and out with RAPIDS and Dask

Accelerated on single GPU

NumPy -> CuPy/PyTorch/..Pandas -> cuDFScikit-Learn -> cuMLNumba -> Numba

RAPIDS and Others

NumPy, Pandas, Scikit-Learnand many more

Single CPU coreIn-memory data

PyDataMulti-core and Distributed PyData

NumPy -> Dask ArrayPandas -> Dask DataFrameScikit-Learn -> Dask-ML… -> Dask Futures

Dask

Scal

e U

p /

Acce

lera

te

Scale out / Parallelize

Page 25: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

25

Scale up and out with RAPIDS and Dask

Accelerated on single GPU

NumPy -> CuPy/PyTorch/..Pandas -> cuDFScikit-Learn -> cuMLNumba -> Numba

RAPIDS and Others

NumPy, Pandas, Scikit-Learnand many more

Single CPU coreIn-memory data

PyData

Multi-GPUOn single Node (DGX)Or across a cluster

Dask + RAPIDS

Multi-core and Distributed PyData

NumPy -> Dask ArrayPandas -> Dask DataFrameScikit-Learn -> Dask-ML… -> Dask Futures

Dask

Scal

e U

p /

Acce

lera

te

Scale out / Parallelize

Page 26: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

26

Combine Dask with cuDFMany GPU DataFrames form a distributed DataFrame

Page 27: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

27

Combine Dask with cuDFMany GPU DataFrames form a distributed DataFrame

cuDF

Page 28: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

28

Combine Dask with CuPyMany GPU arrays form a Distributed GPU array

Page 29: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

29

Combine Dask with CuPyMany GPU arrays form a Distributed GPU array

GPU

Page 30: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

30

Experiments...

SVD with Dask Array NYC Taxi with Dask DataFrame

Page 31: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

31

So what works in DataFrames?

Read CSV: Elementwise operations: Reductions: Groupby Aggregations: Joins (hash, sorted, large-to-small):

Leverages Dask DataFrame algorithms (been around for years)API matches Pandas

Lots!

Page 32: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

32

So what doesn’t work?

Read Parquet/ORC

Reductions: Groupby Aggregations: Rolling window operations

Leverages Dask DataFrame algorithms (been around for years)API matches Pandas

Lots!

Page 33: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

33

So what doesn’t work?

• When cuDF and Pandas match, existing Dask algorithms work seamlessly• But the APIs don’t always match

API Alignment

Page 34: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

34

So what doesn’t work?

• When cuDF and Pandas match, existing Dask algorithms work seamlessly• But the APIs don’t always match

API Alignment

Page 35: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

35

So what works in Arrays?

• This work is much younger, but moving quickly

• CuPy has been around for a while, and is fairly mature• Most work today happening upstream in NumPy and Dask

Thanks Peter Entschev, Hameer Abbasi, Stephan Hoyer, Marten van Kerkwijk, Eric Wieser

• Ecosystem approach benefits other NumPy-like arrays as well, sparse arrays, Xarray, ...

We genuinely don’t know yet

Page 36: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

36

So what’s next?

• High Performance Communication• Today Dask uses in-memory or TCP• For Infiniband and NVLink, now integrating OpenUCX with ucx-py

• Spilling to main memory• Today Dask spills from memory to disk• For GPUs, we’d like to spill from device, to host, to disk

• Mixing CPU and GPU workloads• Today Dask has one thread per core, or one thread per GPU• For mixed systems we need to auto-annotate GPU vs CPU tasks

• Better recipes for deployment• Today Dask deploys on Kubernetes, HPC job schedulers, YARN• Today these technologies also support GPU workloads• Need better examples using both together

Lots of issues with Dask, too!

Page 37: Scaling RAPIDS with Dask - Nvidia...3 95% of the time, PyData is great 5% of the time, you want more performance (and you can ignore the rest of this talk) 4 Scale up and out with

37

PyData: pydata.org

RAPIDS: rapids.ai

Dask: dask.org

examples.dask.org

Learn MoreThank you for your time


Recommended