Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is...

transcript

M. Naumov, J. Daw, V. Ditya, A. Fit-Florea and S. Migacz

Spark and GPUs

04/07/2016

Key Issues that Need to Be Addressed

Data Contiguous memory layout

Code Intercept compute intensive call

Compile Java Bytecode to PTX

Job Placement Awareness of nodes with and without GPUs

Different GPU configurations

Contiguous memory layout Java Unsafe API

Java NIO buffers

Keep track of where is the data Reuse on CPU/GPU instead of always copying

And more Data layout in memory, UVM, …

Intercept compute intensive call Wrap library calls (using JNI, jCUDA, SWIG, …)

Key question, what algorithms are important?

Compile Java Bytecode to PTX Likely limits the functions you can write

Maybe enough for majority of users

Job Placement

Awareness of nodes with/without GPUs By all schedulers, such as Mesos, YARN, …

Different configurations multiple processes per GPU

multiple GPUs per process

processes with memory requirements larger than

the memory of the GPU(s)

Spark Language Interfaces PyCUDA, SWIG, JNI, MLLib with NVBLAS

Python CUDA Bindings (PyCUDA) #CUDA kernel

mod= SourceModule(""" __global__ void vector_add(float *a, float *b, float *c) { int i = threadIdx.x; c[i]=a[i] + b[i]; } """)

#CUDA run vector_add = mod.get_function("vector_add") vector_add(drv.In(a), drv.In(b), drv.Out(c), block=(2,1,1), grid=(3,1,1))

PyCUDA

CUDA kernel

Caveat I:

Must be able to serialize/unserialize (Java) or pickle/unpickle (Python) the

lambda/closure/function supplied to Spark operations, such as map

In practice, this often means the function must be “self contained”

Caveat II:

Currently, there is a lot of overhead in pyCUDA, which seems to include

compiling the CUDA kernel at Spark runtime

Caveat III:

Currently, there is no way to leave and reuse data on the GPU

Nikolai Sakharnykh,

Spark - Python + PyCUDA

Python-C/C++ Interface Generation Tool #CUDA kernel

def test_add(n): x = [numpy.float64(i+1) for i in range(n)] y = [numpy.float64(10*(i+1)) for i in range(n)] e,r = mn.add(len(x),len(x),x,y) int add(int n, double *r, double *x, double *y) { for(int i=0; i<n; i++){ r[i]=x[i]+y[i]; } return 1; }

Python

Generate Python Object Layer Code …

Preamble C/C++ function call

Post amble …

Python-C/C++ Interface Generation Tool #CUDA kernel

def test_add(n): x = [numpy.float64 … ] y = [numpy.float64 … ] e,r = mn.add(len(x), …) int add(int n, double *r, … ) { for(int i=0; i<n; i++){ r[i]=x[i]+y[i]; } return 1; }

Code Example (typemaps for variables):

%define tmp2c_v(type, name)

#define PyType_AsType PyType_AsType_##type

%typemap(in) (type name) {

$1 = PyType_AsType($input);

#undef PyType_AsType

%enddef

Similar to PyCUDA, but does not compile code on the fly.

Allows easier wrapping of CUDA library calls

Careful with data returned in arrays

Careful with names across multiple library calls

(they are all treated using the same rules)

SWIG can also generate interface to other languages

(for example, Java using JNI)

Nikolai Sakharnykh,

Spark - Python + SWIG

Can be used for Scala-C/C++ Interface #CUDA kernel class Binding { @native def iArrayMethod(a: Array[Int]): Int } object Test extends App { System.loadLibrary("Binding") val b = new Binding val sum = b.iArrayMethod(Array(1, 2,3)) … }

Java Native Interface (JNI)

Scala C/C++

JNIEXPORT jint JNICALL Java_Binding_iArrayMethod (JNIEnv* env, jobject obj, jintArray array) { int sum=0; jsize len = (*env)->GetArrayLength(env,array); jint* x = (*env)->GetIntArrayElements(env,array, 0); for (int i = 0; i < len; i++) { sum += x[i]; } (*env)->ReleaseIntArrayElements(env,array, x, 0); return sum; }

Spark - Scala + JNI

Similar SWIG, but using JNI instead of Python Object Layer.

Allows easier wrapping of CUDA library calls

Careful with arrays (GetIntArrayElements might make extra copies)

We have integrated this bindings into the Spark Maven project manager and

they are accessible from any classes.

MLLib Spark Machine Learning Library

Allows the use of native BLAS libraries (such as Intel MKL)

NVBLAS Plug-and-play: intercepts host BLAS level-3 calls

Offloads computation to CUBLAS when beneficial

Supports multiple-GPUs

Designed to support preloading (no need to even recompile the code)

Spark – MLLib + NVBLAS

Investigation of Spark Operators Basics, Prefix Sum, All-to-All

Existing Operators

Map, flatMap, mapPartitions[WithIndex],

Zip[WithIndex], Union, Intersect, Filter,

sortBy[Key], PartitionBy, Reduce, …

Code Example:

>>> rdd = sc.parallelize([1, 2, 3, 4], 2)

>>> res = rdd.reduce(lambda x,y: x+y)

>>> print(res)

>>> 10

shuffles

actions

transforms

1 + 2 + 3 + 4 = 10

Motivation for New Operators

= Ap,Ac,Av x y

Many algorithm are not easily expressed

with existing operators

Consider sparse matrix-vector multiplication

(matrix A in CSR format is represented by arrays Ap, Ac and Av)

It is a standard benchmark for HPC. Also, it is used in Power method

to compute PageRank of a webpage.

Coordinate (COO)

Compressed Sparse Row (CSR)

Compressed Sparse Column (CSC)

Sparse Matrix Storage Formats

1 2 2 3 4 4 4

1 1 2 3 1 3 4

1.0 2.0 3.0 4.0 5.0 6.0 7.0

Row Index

Col Index

Values

1 2 4 5 8

1 1 2 3 1 3 4

1.0 2.0 3.0 4.0 5.0 6.0 7.0

1 2 4 2 3 4 4

1 4 5 7 8

1.0 2.0 3.0 4.0 5.0 6.0 7.0

3.0 2.0

1 2 3 4

column-major order

row-major order

Partitioning the matrix

Ap,Ac,Av

Ap1,Ac1,Av1

Ap2,Ac2,Av2

• Partition Arrays • Insert (at index) • Compute prefix sum • Broadcast/Collect • Numeric Operations

numElements (per partition)

def getNumElements(self):

return self.map(lambda x: 1).reduce(lambda x,y: x+y)

def getNumLocalElements(self):

return self.mapPartitions(lambda p: [sum(1 for x in p)])

Code Example:

>>> rdd = sc.parallelize([1, 2, 3, 4], 2)

>>> ne = rdd.getNumElements() >>> nle = rdd.getNumLocalElements()

>>> 4 >>> [[2], [2]] single number RDD

1 + 1 + 1 + 1 = 4

[1 + 1], [1 + 1] = [2], [2]

same as count()

[find|insert|remove|swap][at]Index

def findIndex(self,e):

res = self.zipWithIndex().filter(lambda (x,k): x == e)

# check whether rdd is empty, if not then …

return res.reduce(lambda (x1,k1), (x2,k2): min(k1,k2))

Code Example:

>>> res = sc.parallelize([1, 3, 3, 2], 2).findIndex(3)

>>> print(res)

>>> 1 (be careful with 0/1 based indexing)

1 3 3 2

0 1 2 3

Also, need local versions

def findLocalIndex(self,e):

res = self.zipWithLocalIndex().filter(lambda (x,k): x == e)

# check whether rdd is empty, if not then …

return res.mapPartitions(find_min_in_a_list)

Code Example:

>>> res = sc.parallelize([1, 3, 3, 2], 2).findLocalIndex(3);

>>> print(res.glom().collect())

>>> [[1], [0]] (be careful with 0/1 based indexing)

1 3 3 2

0 1 0 1

Prefix Sum (by Key)

1 2 2 3 4 4 4

1 1 1 1 1 1 1

1 2 3 4

1 2 1 3

1 2 3 4

1 3 4 7

This can be used to convert from COO to CSR format

1 2 3 4

1 2 4 5 8 +1 (optional, based on 0/1 based indexing)

Prefix Sum def prefixSum(self): #compute prefix sum by shifting and filtering keys

rdd = self.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)

n = rdd.getNumElements(); offset = next_pow2(n)

1 2 1 3

1 1 1 1 1 1 1

keys are colors

1 3 4 7

final result we expect

Prefix Sum

def prefixSum(self): #compute prefix sum by shifting and filtering keys

while offset > 0:

set1= rdd.map(lambda t: t)

set2= rdd.map(lambda (k,x): (k+offset,x)).filter(lambda (k,x): k<(n+1))

rdd = set1.union(set2).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)

offset = int(offset/2)

return rdd

1 2 1 3

1 2 1 3 1 2 2 5

offset=2

Prefix Sum

while offset > 0:

set2= rdd.map(lambda (k,x): (k+offset,x)).filter(lambda (k,x): k<(n+1))

rdd = set1.union(set2).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)

offset = int(offset/2)

return rdd

1 2 2 5

1 2 2 5 1 3 4 7

offset=1

Prefix Sum

while offset > 0:

set2= rdd.map(lambda (k,x): … >>> rdd = sc.parallelize([1,2,3,4,4,2,4], 2)

rdd = set1.union(set2).redu … >>> rdd.prefixSum()

offset = int(offset/2) >>> [(1,1), (2,3), (3,4),(4,7)]

return

Code Example (we can similarly have a local variant):

numOps[Mixed]

def numOpsMixed(self, other, func): # ASSUMPTION: number of partitions is the same

rdd = self.zipPartitions(other) #creates an rdd whose elements are partitions

def apply_func((p,q)):

for y in q:

for x in p:

yield func(x,y)

res = rdd.flatMap(func)

return res

1 2 1 2

11 21 12 22

AllToAll

def allToAll(self, np, partitionFunc):

#define add_partition_index_to_each_element and use it below …

rdd = self.mapPartitionsWithIndex(add_partition_index_to_each_element)

def expand_p_index(x):

for k in range(np):

yield (k,x)

res = rdd.flatMap(expand_p_index).partitionBy(np, partitionFunc).map(lambda k,x: x)

return res.sortLocalByKey().map(lambda k,x: x)

1 2 3 1

2 2 3 3 1 1 1 1

0 1 0 1 0 1 0 1

3 1 1 2 1 2 3 1

0 0 0 0 1 1 1 1

Partitioning the matrix

Ap,Ac,Av

Ap1,Ac1,Av1

Ap2,Ac2,Av2

• Partition Arrays • Insert (at index) • Compute prefix sum • Broadcast/Collect • Numeric Operations

Discussion with Audience

Algorithms and Challenges

What algorithms would you like to implement? PCA (SVD), SVM, ALS, K-Means, …

Are you interested in machine learning (other than deep learning)?

How is Python/Scala/Java used? What code/problems are interesting?

What is your vision for how spark should be aware of GPU resources,

in conjunction with resource manager (such as Mesos)?

What challenges do you have for using GPUs? Performance/Power/$, Memory Layout (JVM vs. C/C++), …

Backup Slides

PageRank (from Linear Algebra Perspective)

• Let Cnxn be a scaled adjacency matrix (with row sums = 1), vector b={0,1}n have 1 in place of dangling nodes (indices of empty rows), and vector u=(1/n)e where e=[1,…,1]T.

• Find the largest eigenpair (in which eigenvector = pagerank) of

Ax = \lambda x, where A = \alpha (C + buT) + (1-\alpha)(ueT)

• The simplest approach - Power method

key operation: sparse matrix-vector multiplication

Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is...

Documents