+ All Categories
Home > Software > Making Hardware Accelerator Easier to Use

Making Hardware Accelerator Easier to Use

Date post: 18-Jan-2017
Category:
Upload: kazuaki-ishizaki
View: 191 times
Download: 5 times
Share this document with a friend
47
Invited talk at 14th Asian Symposium on Programming Languages and Systems (APLAS 2016) Kazuaki Ishizaki IBM Research – Tokyo Making Hardware Accelerator Easier to Use 1
Transcript
  • Invited talk at 14th Asian Symposium on Programming Languages and Systems (APLAS 2016)

    Kazuaki Ishizaki

    IBM Research Tokyo

    Making Hardware Accelerator Easier to Use

    1

  • Hanoi in 1996

    My first visit to Hanoi

    I joined our research project Java just-in-time compiler on 1996

    I worked for Parallel Fortran compiler by 1995Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki2

  • Hanoi in 1996 and 2016

    Drastically changed over twenty years

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki3

    1996 2016

  • Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    What has Happened in Computation-Intensive Area for 20 Years

    Program is becoming simpler

    Hardware is becoming complicated1996 2016

    Hardware Fast scalar processors Commodity processors with

    hardware accelerators

    Applications Weather, wind, fluid, and physics

    simulations, chemical synthesis

    Machine learning and

    deep learning with big data

    Program Complicated and

    hardware-dependent code

    Simple and clean code

    (e.g. mapreduce)

    Users Limited to programmers

    who are well-educated for HPC

    Data scientists

    who are non-familiar with hardware

    Hardware

    Examples4

    GPUPowerPC

  • What has Happened in Computation-Intensive Area for 20 Years

    Program is becoming simpler

    Hardware is becoming complicated

    Making Hardware Accelerator Easier to Use5

    1996 2016

    Hardware Fast scalar processors Commodity processors with

    hardware accelerators

    Applications Weather, wind, fluid, and physics

    simulations, chemical synthesis

    Machine learning and

    deep learning with big data

    Program Complicated and

    hardware- dependent code

    Simple and clean code

    (MapReduce)

    Users Limited to programmers

    who are well-educated for HPC

    Data scientists

    who are non-familiar with hardware

    Hardware

    Examples

    Bad news:Gap between hardware and software is bigger

    Good news:Program can be easily analyzed

  • My Recent Interest

    How system generates hardware accelerator code from

    program with high-level abstractionExpected (practical) result

    People execute program without knowing usage of hardware accelerator

    Challenge How to optimize code for a certain hardware accelerator without specific

    information

    On-going research GPU exploitation from Java program

    GPU exploitation in Apache Spark

    work with Akihiro Hayashi *, Alon Shalev Housfater -, Hiroshi Inoue +,Madhusudanan Kandasamy , Gita Koblents -, Moriyoshi Ohara +,

    Vivek Sarkar *, and Jan Wroblewski (intern) +

    + IBM Research Tokyo, - IBM Canada, IBM India, * Rice University

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki6

  • GPU Exploitation from Java Program

  • Why Java for GPU Programming?High productivity

    Safety and flexibility (compare to C/C++)

    Good program portability among different machines write once, run anywhere

    One of the most popular programming languages Hard to use CUDA and OpenCL for non-expert programmers

    Many computation-intensive applications in non-HPC area Data analytics and data science (Hadoop, Spark, etc.)

    Security analysis

    Natural language processing

    8 Making Hardware Accelerator Easier to Use / Kazuaki IshizakiFrom https://www.flickr.com/photos/dlato/5530553658

    CUDA is a programming language for GPU offered by NVIDIA

  • Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    How We Write GPU Program Five steps

    1. Allocate GPU device memory

    2. Copy data on CPU main memory

    to GPU device memory

    3. Launch a GPU kernel to be executed

    in parallel on cores

    4. Copy back data on GPU device

    memory to CPU main memory

    5. Free GPU device memory

    device memory

    (up to 16GB)main memory

    (up to 1TB/socket)

    CPU GPU

    Data copy over

    PCIe or NVLink

    dozen cores/socket thousands cores

    9

  • Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    How We Optimize GPU Program

    device memory

    (up to 16GB)main memory

    (up to 1TB/socket)

    CPU GPUdozen cores/socket thousands cores

    10

    Exploit faster memory Read-only cache (Read only)

    Shared memory (SMEM)

    Data copy over

    PCIe or NVLink

    From GTC presentation by NVIDIA

    Reduce data copy

    Five steps 1. Allocate GPU device memory

    2. Copy data on CPU main memory

    to GPU device memory

    3. Launch a GPU kernel to be executed

    in parallel on cores

    4. Copy back data on GPU device

    memory to CPU main memory

    5. Free GPU device memory

  • Fewer Code Makes GPU Programming Easy

    Current programming model requires programmers to

    explicitly write operations for managing device memories

    copying data

    between CPU and GPU

    expressing parallelism

    exploiting faster memory

    Java 8 enables programmers

    to just focus on expressing parallelism

    11 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    void fooCUDA(N, float *A, float *B, int N) {int sizeN = N * sizeof(float);cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN);cudaMemcpy(d_A, A, sizeN, Host2Device);GPU(d_A, d_B, N);cudaMemcpy(B, d_B, sizeN, Device2Host);cudaFree(d_B); cudaFree(d_A);

    }// code for GPU__global__ void GPU(float* d_A, float* d_B, int N) {

    int i = threadIdx.x;if (N {

    B[i] = A[i] * 2.0;});

    }

  • GoalBuild a Java just-in-time (JIT) compiler to generate high

    performance GPU code from a parallel loop construct

    Implementing four performance optimizations

    Offering performance evaluations on POWER8 with a GPU

    Supporting Java language feature (See [PACT2015])

    Predicting Performance on CPU and GPU [PPPJ2015]

    Available in IBM Java 8 ppc64le and x86_64 https://www.ibm.com/developerworks/java/jdk/java8/

    12 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    Accomplishments

  • Parallel Programming in Java 8 Express parallelism by using Parallel Stream API among

    iterations of a lambda expression (index variable: i)

    13 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    class Example {void foo(float[] a, float[] b, float[] c, int n) {java.util.Stream.IntStream.range(0, n).parallel().forEach(i -> {b[i] = a[i] * 2.0;c[i] = a[i] * 3.0;

    });}}

    Note: Current version supports one-dimensional arrays with primitive types in a lambda expression

  • Overview of Our JIT Compiler Java bytecode

    sequence is divided

    into two intermediate

    presentation (IR) parts Lambda expression:

    generate GPU code

    using NVIDIA tool chain

    (right hand side)

    Others:

    generate CPU code

    using conventional JIT

    compiler (left hand side)

    14 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    NVIDIA GPU binaryfor lambda expression

    CPU binary for- managing device memory- copying data- launching GPU binary

    Conventional Java JIT compiler

    Parallel stream APIs detection

    // Parallel stream codeIntStream.range(0, n).parallel()

    .forEach(i -> { ...c[i] = a[i]...});

    IR for GPUs...

    c[i] = a[i]...

    IR for CPUs

    Java bytecode

    CPU native code generator GPU native code

    generator (by NVIDIA)

    Additional modules for GPU

    GPUs optimizations

  • Optimizations for GPU in Our JIT CompilerOptimizing alignment of Java arrays on GPUs

    Reduce # of memory transactions to a GPU global memory

    Using read-only cache Reduce # of memory transactions to a GPU global memory

    Optimizing data copy between CPU and GPU Reduce amount of data copy

    Eliminating redundant exception checks Reduce # of instructions in GPU binary

    15 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

  • Reducing # of Memory Transactions to GPU Global Memory

    Aligning the starting address of an array body in GPU global

    memory with memory transaction boundary

    16 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    0 128

    a[0]-a[31]

    Object header

    Memory address

    a[32]-a[63]Naivealignmentstrategy

    a[0]-a[31] a[32]-a[63]

    256 384

    Ouralignmentstrategy

    One memory transaction for a[0:31]

    Two memory transactions for a[0:31]

    IntStream.range(0,n).parallel().forEach(i->{

    ...= a[i]...; // a[] : float

    ...;});

    a[64]-a[95]

    a[64]-a[95]

    A 128-byte memorytransaction boundary

  • Reducing # of Memory Transactions to GPU Global Memory

    Must keep only a read-only array in a read-only cache Lexically different variables (e.g. a[] and b[]) may point to the same array

    that may be updated

    Perform alias analysis to identify a read-only array by Static analysis in JIT compiler identifies lexically read-only arrays and lexically written arrays

    Dynamic alias analysis in generated code checks a lexically read-only array that may alias with any lexically written arrays

    executes code with a read-only cache if not aliased

    17 Compiling and Optimizing Java 8 Programs for GPU Execution

    if (!(a[] aliases with b[]) && !(a[] aliases with c[])) {IntStream.range(0, n).parallel().forEach( i -> { b[i] = ROa[i] * 2.0; // use RO cache for a[]c[i] = ROa[i] * 3.0; // use RO cache for a[]

    });} else {

    // execute code w/o a read-only cache}

    IntStream.range(0,n).parallel().forEach(i->{b[i] = a[i] * 2.0;c[i] = a[i] * 3.0;

    });

  • Reducing Amount of Data Copy between CPU and GPU

    Eliminate data copy from GPU if an array (e.g. a[]) is not

    updated in GPU binary [Jablin11][Pai12]

    Copy only a read or write set if an array index form is

    i + constant (the set is contiguous)

    18 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    sz = (n 0) * sizeof(float)cudaMemCopy(&a[0], d_a, sz, H2D); // copy only a read setcudaMemCopy(&b[0], d_b, sz, H2D);cudaMemCopy(&c[0], d_c, sz, H2D);IntStream.range(0, n).parallel().forEach( i -> {

    b[i] = a[i]...;c[i] = a[i]...;

    });cudaMemcpy(a, d_a, sz, D2H);cudaMemcpy(&b[0], d_b, sz, D2H); // copy only a write setcudaMemcpy(&c[0], c_b, sz, D2H); // copy only a write set

  • Eliminating Redundant Exception ChecksGenerate GPU code without exception checks by using

    loop versioning [Artigas00] that guarantees safe region by using pre-

    condition checks on CPU

    19 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    if (// check cond. for NullPointerExceptiona != null && b != null && c != null &&// check cond. for ArrayIndexOutOfBoundsExceptiona.length

  • Automatically Optimized for CPU and GPUCPU code

    handles GPU device memory management and data copying

    checks whether optimized CPU and GPU code can be executed

    GPU code

    is optimized Using

    read-only

    cache

    Eliminating

    exception

    checks

    20 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    if (a != null && b != null && c != null &&a.length < n && b.length < n && c.length < n &&!(a[] aliases with b[]) && !(a[] aliases with c[])) {

    cudaMalloc(d_a, a.length*sizeof(float)+128);if (b!=a) cudaMalloc(d_b, b.length*sizeof(float)+128);if (c!=a && c!=b) cudaMalloc(d_c, c.length*sizeof(float)+128);

    int sz = (n 0) * sizeof(float), szh = sz + Jhdrsz;cudaMemCopy(a, d_a + align - Jhdrsz, szh, H2D);

    GPU(d_a, d_b, d_c, n) // launch GPU

    cudaMemcpy(b + Jhdrsz, d_b + align, sz, D2H);cudaMemcpy(c + Jhdrsz, d_c + align, sz, D2H);cudaFree(d_a); if (b!=a) cudaFree(d_b); if (c=!a && c!=b) cudaFree(d_c);

    } else {// execute CPU binary

    }

    __global__ void GPU(float *a,float *b, float *c, int n) {// no exception checksi = ...b[i] = ROa[i] * 2.0;c[i] = ROa[i] * 3.0;

    }

    CPU

    GPU

  • Benchmark ProgramsPrepare sequential and parallel stream API versions in Java

    21 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    Name Summary Data size Type

    Blackscholes Financial application that calculates the price of put and call

    options

    4,194,304 virtual

    options

    double

    MM A standard dense matrix multiplication: C = A.B 1,024 x 1,024 double

    Crypt Cryptographic application [Java Grande Benchmarks] N = 50,000,000 byte

    Series the first N fourier coefficients of the function [Java Grande

    Benchamark]

    N = 1,000,000 double

    SpMM Sparse matrix multiplication [Java Grande Benchmarks] N = 500,000 double

    MRIQ 3D image benchmark for MRI [Parboil benchmarks] 64x64x64 float

    Gemm Matrix multiplication: C = .A.B + .C [PolyBench] 1,024 x 1,024 int

    Gesummv Scalar, vector, and Matrix multiplication [PolyBench] 4,096 x 4,096 int

  • Performance Improvements of GPU Version Over Sequential and Parallel CPU Versions

    Achieve 127.9x on geomean and 2067.7x for Series over 1 CPU thread

    Achieve 3.3x on geomean and 32.8x for Series over 160 CPU threads

    Degrade performance for SpMM and Gesummv against 160 CPU threads

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki22

    Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memorywith one NVIDIA Kepler K40m GPU at 876 MHz with 12-GB global memory (ECC off)Ubuntu 14.10, CUDA 5.5Modified IBM Java 8 runtime for PowerPC

  • 0.85

    0.45

    1.51

    0.920.74

    0.11

    1.19

    3.47

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    BlackScholes MM Crypt Series SpMM MRIQ Gemm Gesummv

    Sp

    eed

    up

    rela

    tive

    to

    CU

    DA

    Performance Comparison with Hand-Coded CUDA Achieve 0.83x on geomean over CUDA

    Crypt, Gemm, and Gesummv: usage of a read-only cache

    BlackScholes: usage of larger CUDA threads per block (1024 vs. 128)

    SpMM: overhead of exception checks

    MRIQ: miss of -use-fast-math compile option

    MM: lack of usage of shared memory with loop tiling

    23 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    Higher is better

  • GPU Version is slower Than Parallel CPU Version

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki24

    Can we choose an appropriate device (CPU or GPU) to avoid

    performance degradation? Want to make sure to achieve equal or better performance

  • Machine-learning-based Performance HeuristicsConstruct a binary prediction model offline by supervised

    machine learning with support vector machines (SVMs) Features Loop range

    Dynamic number of instructions (memory access, arithmetic operation, )

    Dynamic number of array accesses (a[i], a[i + c], a[c * i], a[idx[i]])

    Data transfer size (CPU to GPU, GPU to CPU)

    25 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    data1Bytecode

    App A feature 1Features

    extraction

    LIBSVM JavaRuntime

    PredictionModel

    data2Bytecode

    App A feature 2Features

    extraction

    data3Bytecode

    App B feature 3Features

    extraction

    CPU GPU

  • Most Predictions are CorrectUse 291 cases to build model

    Succeeded in predicting cases of performance degradations on GPU

    Failed to predict BlackScholes

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki26

    Prediction

    1.8->1.0 0.8->1.0 0.4->1.0

  • Related WorkOur research enables memory and communication

    optimizations with machine-learning-based device selection

    27 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    Work Language Exception

    support

    JIT

    compiler

    How to write GPU kernel Data copy

    optimization

    GPU memory

    optimization

    Device selection

    JCUDA Java CUDA Manual Manual GPU only

    JaBEE Java Override run method GPU only

    Aparapi Java Override run

    method/Lambda Static

    Hadoop-CL Java Override map/reduce

    method Static

    Rootbeer Java Override run method Not described Not described

    [PPPJ09] Java Java for-loop Not described Dynamic with

    regression

    HJ-OpenCLHabanero-

    Java Forall constructs Static

    Our work Java Standard parallel

    stream API

    ROCache /

    alignment

    Dynamic with

    machine learning

  • Future work Exploiting shared memory

    Like private memory shared by 64 - 192 cores

    Supporting additional Java operations

    28 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

  • GPU Exploitation in Apache Spark

  • What is Apache Spark? Framework that processes distributed computing by transforming

    distributed immutable memory structure using set of parallel operations e.g. map(), filter(), reduce(),

    Distributed immutable in-memory structures RDD (Resilient Distributed Dataset), DataFrame, Dataset

    Scala is primary language for programming on Spark

    Provide domain specific libraries

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    Spark Runtime (written in Java and Scala)

    Spark

    Streaming

    (real-time)

    GraphX

    (graph)

    SparkSQL

    (SQL)

    MLlib

    (machine

    learning)

    Java Virtual Machine

    tasks Executor

    Driver

    Executor

    results

    ExecutorData

    Data

    Data

    Open source: http://spark.apache.org/

    Data Source (HDFS, DB, File, etc.)

    Latest version is 2.0.3 released in 2016/11

    30

  • How Program Works on Apache Spark Parallel operations can be executed among partitions

    In a partition, data can be processed sequentially

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    case class Pt(x: Int, y: Int)val ds1: Dataset[Pt] = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDSval ds2: Dataset[Pt] = ds1.map(p => Pt(p.x+1, p.y*2))val cnt: Int = ds2.reduce((p1, p2) => p1.x + p2.x)

    ds1 ds2

    p.x+1p.y*2

    p1.x + p2.x

    9

    5

    14partition

    partition

    cnt

    54

    32

    + =

    + =1 5

    2 6

    partition

    pt

    partition

    31

    2 10

    3 12

    3 7

    4 8

    4 14

    5 16

  • How We Can Run Program Faster on GPU Assign many parallel computations into cores

    Make memory accesses coalesce

    Column-oriented layout results in better performance [Che2011] reports on about 3x performance improvement of GPU kernel execution of

    kmeans with column-oriented layout over row-oriented layout

    1 52 61 5 3 7

    Assumption: 4 consecutive data elements

    can be coalesced using GPU hardware

    2 v.s. 4memory accesses to

    GPU device memoryRow-oriented layoutColumn-oriented layout

    Pt(x: Int, y: Int)Pt(1,5), Pt(2,6), Pt(3,7), Pt(4,8)Load four Pt.xLoad four Pt.y

    2 6 4 843 87

    coresx1 x2 x3 x4cores

    Load Pt.x Load Pt.y Load Pt.x Load Pt.y

    1 2 31 2 4

    y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki32

  • Idea to Transparently Exploit GPUs on Apache Spark

    Generate GPU code from a set of parallel operations Made it in another research already

    Physically put distributed immutable in-memory structures

    (e.g. Dataset) in column-oriented representation Dataset is statically typed, but physical layout is not specified in program

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki33

  • Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    Overview of GPU Exploitation on Apache Spark

    Users Spark Program

    case class Pt(x: Int, y: Int)ds1 = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDS

    ds2 = ds1.map(p => Pt(p.x+1, p.y*2))cnt = ds2.reduce((p1, p2) => p1.x + p2.x)

    Nativ

    e c

    od

    e

    GPU

    10

    12

    14

    + 1

    =

    * 2 =

    ds1

    Datatransfer

    x y x y

    ds2

    partitionGPU

    kernel

    CPU

    16

    2

    3

    4

    5

    10

    12

    14

    16

    2

    3

    4

    5

    5

    6

    1

    2

    7

    8

    3

    4

    5

    6

    1

    2

    7

    8

    3

    4

    34

  • Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    Overview of GPU Exploitation on Apache Spark Efficient

    Reduce data copy overhead between CPU and GPU

    Make memory accesses efficient on GPU

    Transparent Map parallelism in program

    into GPU native code

    Users Spark Program

    case class Pt(x: Int, y: Int)ds1 = sc.parallelize(Seq(Pt(1, 4), Pt(2, 5), Pt(3, 6), Pt(4, 7)), 2).toDS

    ds2 = ds1.map(p => Pt(p.x+1, p.y*2))cnt = ds2.reduce((p1, p2) => p1.x + p2.x)

    Drive

    GPU native

    code

    Nativ

    e c

    od

    e

    GPU

    + 1

    =

    * 2 =

    ds1

    Datatransfer

    x y

    GPU manager

    Columnar storage

    x y

    GPU can exploit parallelism bothamong partitions in Dataset andwithin a partition of Dataset

    ds2

    partitionGPU

    kernel

    CPU

    Mem

    ory

    ad

    dre

    ss

    35

    10

    12

    14

    16

    2

    3

    4

    5

    10

    12

    14

    16

    2

    3

    4

    5

    5

    6

    1

    2

    7

    8

    3

    4

    5

    6

    1

    2

    7

    8

    3

    4

  • Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    How We Write Program And What is Executed Write an operation using a lambda expression for RDD. Then, the corresponding Java

    bytecode for the expression is executed.

    Write a program using a relational operation for DataFrame or a lambda expression for

    Dataset. Catalyst performs optimization and code generation for the program. Then, the

    corresponding Java bytecode for the generated Java code is executed.

    ds1 = data.toDS()ds2 = ds2.map(p => p.x+1)ds2.reduce((a,b) => a+b)

    rdd1 = sc.parallelize(data)rdd2 = rdd1.map(p => p.x+1)rdd2.reduce((a,b) => a+b)

    df1 = data.toDF()df2 = df2.selectExpr("x+1")df2.agg(sum())

    Frontend

    API

    DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)

    Backend

    computationCatalyst-generated Java bytecodeScalac-generated Java bytecode

    Java code

    Catalyst

    1 5 2 6

    Java heap

    2 61 5

    Java heap

    Row-oriented Row-oriented Data

    data =Seq(Pt(1, 5),Pt(2, 6))

    36

  • Our Two Implementations for GPU Exploitation GPUEnabler is designed for writing domain specific libraries by a Ninja

    programmer Transparent exploitation by calling a method in the library

    Enhanced Catalyst is designed for writing application by general

    programmer Transparent exploitation by automatic code generation

    Code / Columnar Storage DataFrame/Dataset RDD

    Hand-written code GPUEnabler

    Automatic code generation Enhanced Catalyst

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    GPU manager

    Generated

    GPU/SIMD

    Pre-compiled

    GPU/SIMD code

    Spark user program

    Columnar storage

    Spark runtime

    37

  • Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    How Program is Executed on GPU For RDD, a programmer provides a pre-compiled GPU code. GPUEnabler handles

    data transfer between GPU and CPU and launches the GPU code on GPU

    For DataFrame and Dataset, enhanced Catalyst generates Java code optimized for

    GPU. A just-in-time compiler in Java virtual machine can generate GPU code.

    ds1 = data.toDS()ds2 = ds2.map(p => p.x+1)ds2.reduce((a,b) => a+b)

    rdd1 = sc.parallelize(data)rdd2 = rdd1.map(p => p.x+1, gpu)rdd2.reduce((a,b) => a+b, gpu)

    df1 = data.toDF()df2 = df2.selectExpr("x+1")df2.agg(sum())

    Frontend

    API

    DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)

    Backend

    computationAutomatically generated GPU codePre-compiled GPU code

    Optimized Java code

    Enhanced Catalyst

    Data2 61 5

    GPU device memory

    Column-oriented 2 61 5

    GPU device memory

    Column-oriented

    data =Seq(Pt(1, 5),Pt(2, 6))

    GPU Enabler

    38

  • GPUEnabler (https://github.com/IBMSparkGPU/GPUEnabler) Use columnar storage for RDD

    Support map & reduce operations to drive GPU code

    Pass GPU code provided by programmer

    to an argument of map()/reduce()

    Implemented as a plug-in

    # bin/spark-shell --class your.gpu.application yours.jar --packages com.ibm:gpu-enabler_2.10:1.0.0

    // Load a kernel function from the GPU kernel binary val ptxURL = SparkGPULR.getClass.getResource("/GpuEnablerExamples.ptx")

    val mapFunction = new CUDAFunction("multiply2", Array("this"), Array("this"),ptxURL)val reduceFunction = new CUDAFunction(sum, Array(this), Array(this), ptxURL)val rdd = sc.parallelize(1 to n)val output = rdd

    .mapExtFunc((x: Int) => x * 2, mapFunction)

    .reduceExtFunc((x: Int, y: Int) => x + y, reduceFunction)

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    // GPU code__global__ void multiply2(int *inX, int *outX, long *size) {

    long ix = threadIdx.x + blockIdx.x * blockDim.x;if (*size

  • Pseudo Java Code by Current Catalyst Perform optimization that merges multiple parallel operations

    (selectExpr() and agg(sum()) into one loop

    int sum = 0while (rowIterator.hasNext()) {

    Row row = rowIterator.next(); // for df1int x = row.getInteger(0);// selectExpr(x + 1)

    int x_new = x + 1; // for df2sum += x_new;

    }

    val df1 = (-1 to 1).toDF("x")val df2 = df1.selectExpr("x + 1")df2.agg(sum())

    Generated code corresponds to selectExpr() and local sum()

    1

    3

    1

    0

    -1

    -1 0

    DataFrame program for Spark

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    20 1

    Read sequentially

    40

    df1

    x

    x_new

    sum

    Row-oriented

    Catalyst

    Generated pseudo Java code

  • Pseudo Java Code by Enhanced Catalyst Get column0 from column-oriented storage

    For-loop can be executed in a reduction manner

    Column column0 = df1.getColumn(0); // df1int sum = 0;for (int i = 0; i < column0.numRows; i++) {

    int x = column0.getInteger(i);// selectExpr(x + 1)

    int x_new = x + 1; // for df2sum += x_new;

    }

    1

    10-1

    -1 0

    Generated pseudo Java code

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    3

    20 1

    41

    df1

    x

    x_new

    sum

    Column-orientedEnhanced Catalyst

  • Generate GPU Code Transparently from Spark Program

    Copy column-oriented storage into GPU

    Execute add and reduction in one GPU kernel

    Column column0 = df1.getColumn(0);int nRows = column0.numRows;cudaMalloc(&d_c0, nRows*4);cudaMemcpy(d_c0, column0, nRows, H2D);int sum = 0;cudaMalloc(&d_sum, 4);cudaMemcpy(d_c0, &sum, 4, H2D); GPU(d_c0, d_sum, nRows) // launch GPUcudaMemcpy(d_c0, &sum, 4, D2H);cudaFree(d_sum); cudaFree(d_c0);

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    val df1 = (-1 to 1).toDF("x")val df2 = df1.selectExpr("x + 1")df2.agg(sum())

    // GPU code__global__ void GPU(int *d_c0, int *d_sum, long size) {long ix = // 0, 1, 2if (size

  • Many Engineering Efforts are Required Make DataFrame and Dataset column-oriented storage

    Generate simpler optimized code in the while-loop

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki43

  • Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    Very Complicated Java Code by Current Catalyst

    Overhead exists in Java code Data representation

    Data conversions

    Complicated code

    // source programval x = Array(1.0, 2.0), y = Array(3.0, 4.0)val ds = sparkContext.parallelize(Seq(x, y), 1).toDSds.map(a => a)

    44

    a. sparse array to

    java.lang.Double[]

    b. java.lang.Double[] to double[]

    c. double[] to java.lang.Double[]

    d. java.lang.Double[] to sparse array

  • Pretty Simple Java Code by Enhanced Catalyst Eliminated most of data conversions are eliminated

    Use data representations suitable for GPU

    Dense array to double[]

    double[] to dense array

    // source programval x = Array(1.0, 2.0), y = Array(3.0, 4.0)val ds = sparkContext.parallelize(Seq(x, y), 1).toDSds.map(a => a)

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki45

  • Related Work SparkJNI (https://github.com/tudorv91/SparkJNI)

    Call native method from map() or reduce()

    Very similar to GPUEnabler. However, use no columnar storage

    Spark With Accelerated Tasks [Grossman2016] Generate GPU code from lambda function in map() in RDD

    Very similar to enhanced Catalyst using columnar storage to transparently

    exploit GPUs. However, work for RDD with map()

    GPU Columnar (proposed by Kiran Lonikar) Generate GPU code from program using select() method in DataFrame

    Very similar to enhanced Catalyst using columnar storage to transparently

    exploit GPUsMaking Hardware Accelerator Easier to Use / Kazuaki Ishizaki

    val inputRDD = cl(sc.objectFile[Int]( hdfsPath ))val doubledRDD = inputRDD.map(i => 2 * i)

    JavaRDD vectorsRdd = getSparkContext().parallelize(generateVectors(2, 4));JavaRDD mulResults = vectorsRdd.map(new VectorMulJni(libPath, "mapVectorMul"));VectorBean results = mulResults.reduce(new VectorAddJni(libPath, "reduceVectorAdd"));

    46

    https://github.com/tudorv91/SparkJNI

  • ConclusionWe generated hardware accelerator code from program with

    high-level abstraction

    It is not easy to make them in systematic wayHow can we easily generate optimized code from different types of

    domain specific languages? Program is cleaner and simpler than twenty-years ago.

    How can we integrate good results in theory into practical system?

    What can we do similar things for deep learning?Current deep learning frameworks use GPU by calling libraries (e.g.

    cnDNN/cuRNN)

    What are future programming models for deep learning?

    Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki47


Recommended