+ All Categories
Home > Documents > CUDA - Copperhead

CUDA - Copperhead

Date post: 30-May-2018
Category:
Upload: adam-noble
View: 219 times
Download: 0 times
Share this document with a friend

of 28

Transcript
  • 8/14/2019 CUDA - Copperhead

    1/28

    Copperhead: A Python-like DataParallel Language & Compiler

    Bryan Catanzaro, UC BerkeleyMichael Garland, NVIDIA Research

    Kurt Keutzer, UC Berkeley

    Universal Parallel Computing Research CenterUniversity of California, Berkeley

  • 8/14/2019 CUDA - Copperhead

    2/28

    2/28

    Intro to CUDA

    Overview Multicore/Manycore SIMD Programming with millions of threads

  • 8/14/2019 CUDA - Copperhead

    3/28

    3/28

    The CUDA Programming Model

    CUDA is a recent programming model, designed for

    Manycore architectures

    Wide SIMD parallelism

    Scalability CUDA provides:

    A thread abstraction to deal with SIMD

    Synchronization & data sharing between small groups of

    threads CUDA programs are written in C + extensions OpenCL is inspired by CUDA, but HW & SW vendor neutral

    Programming model essentially identical

  • 8/14/2019 CUDA - Copperhead

    4/28

    4/28

    Multicore and Manycore

    Multicore: yoke of oxen Each core optimized for executing a single thread

    Manycore: flock of chickens

    Cores optimized for aggregate throughput, deemphasizing

    individual performance

    Multicore Manycore

  • 8/14/2019 CUDA - Copperhead

    5/28

    5/28

    Multicore & Manycore, cont.

    Specifications Core i7 960 GTX285

    Processing Elements4 cores, 4 way SIMD

    @3.2 GHz

    30 cores, 8 way SIMD

    @1.5 GHz

    Resident Threads(max)

    4 cores, 2 threads, 4

    width SIMD:

    32 strands

    30 cores, 32 SIMD

    vectors, 32 widthSIMD:

    30720 strands

    SP GFLOP/s 102 1080

    Memory Bandwidth 25.6 GB/s 159 GB/s

    Register File - 1.875 MB

    Local Store - 480 kB

    Core i7

    GTX285

  • 8/14/2019 CUDA - Copperhead

    6/28

    6/28

    SIMD: Neglected Parallelism

    It is difficult for a compiler to exploit SIMD How do you deal with sparse data & branches?

    Many languages (like C) are difficult to vectorize

    Fortran is somewhat better

    Most common solution:

    Either forget about SIMD

    Pray the autovectorizer likes you

    Or instantiate intrinsics (assembly language)

    Requires a new code version for every SIMD extension

  • 8/14/2019 CUDA - Copperhead

    7/287/28

    What to do with SIMD?

    Neglecting SIMD in the future will be more expensive

    AVX: 8 way SIMD, Larrabee: 16 way SIMD

    This problem composes with thread level parallelism

    4 way SIMD 16 way SIMD

  • 8/14/2019 CUDA - Copperhead

    8/288/28

    CUDA

    CUDA addresses this problem by abstracting both SIMDand task parallelism into threads

    The programmer writes a serial, scalar thread with the

    intention of launching thousands of threads Being able to launch 1 Million threads changes the

    parallelism problem

    Its often easier to find 1 Million threads than 32: just look

    at your data & launch a thread per element CUDA is designed for Data Parallelism

    Not coincidentally, data parallelism is the only way for

    most applications to scale to 1000(+) way parallelism

  • 8/14/2019 CUDA - Copperhead

    9/289/28

    Hello World

  • 8/14/2019 CUDA - Copperhead

    10/2810/28

    CUDA Summary

    CUDA is a programming model for manycoreprocessors

    It abstracts SIMD, making it easy to use wide SIMD

    vectors It provides good performance on todays GPUs In the near future, CUDA-like approaches will map well

    to many processors & GPUs

    CUDA encourages SIMD friendly, highly scalablealgorithm design and implementation

  • 8/14/2019 CUDA - Copperhead

    11/2811/28

    A Parallel Scripting Language

    What is a scripting language?

    Lots of opinions on this

    Im using an informal definition:

    A language where performance is happily traded for productivity Weak performance requirement of scalability

    My code should run faster tomorrow

    What is the analog of todays scripting languages for manycore?

  • 8/14/2019 CUDA - Copperhead

    12/2812/28

    Data Parallelism

    Assertion: Scaling to 1000 cores requires dataparallelism

    Accordingly, manycore scripting languages will be data

    parallel They should allow the programmer to express data

    parallelism naturally They should compose and transform the parallelism to

    fit target platforms

  • 8/14/2019 CUDA - Copperhead

    13/2813/28

    Warning: Evolving Project

    Copperhead is still in embryo We can compile a few small programs Lots more work to be done in both language definition

    and code generation Feedback is encouraged

  • 8/14/2019 CUDA - Copperhead

    14/2814/28

    Copperhead = Cu + python

    Copperhead is a subset of Python, designedfor data parallelism

    Why Python?

    Extant, well accepted high level scripting language

    Free simulator(!!)

    Already understands things like map and reduce

    Comes with a parser & lexer

    The current Copperhead compiler takes a subset ofPython and produces CUDA code

    Copperhead is not CUDA specific, but current compiler is

  • 8/14/2019 CUDA - Copperhead

    15/2815/28

    Copperhead is not Pure Python

    Copperhead is not for arbitrary Python code

    Most features of Python are unsupported

    Copperhead is compiled, not interpreted Connecting Python code & Copperhead code will

    require binding the programs together, similar toPython-C interaction

    Copperhead is statically typed

    Python

    Copperhead

  • 8/14/2019 CUDA - Copperhead

    16/2816/28

    Saxpy: Hello world

    Some things to notice:

    Types are implicit The Copperhead compiler uses a Hindley-Milner type system

    with typeclasses similar to Haskell

    Typeclasses are fully resolved in CUDA via C++ templates

    Functional programming: map, lambda (or equivalent in list comprehensions)

    you can pass functions around to other functions

    Closure: the variable a is free in the lambda function, but

    bound to the a in its enclosing scope

    def saxpy(a, x, y):

    return map(lambda xi, yi: a*xi + yi, x, y)

  • 8/14/2019 CUDA - Copperhead

    17/28

    17/28

    Type Inference, cont.

    Copperhead includes function templates for intrinsics like

    add, subtract, map, scan, gather Expressions are mapped against templates

    Every variable starts out with a unique generic type, then

    types are resolved by union find on the abstract syntax tree Tuple and function types are also inferred

    c = a + b+ : (Num0, Num0) > Num0

    A145A207

    A52

    c = a + b

    Num52Num52

    Num52

  • 8/14/2019 CUDA - Copperhead

    18/28

    18/28

    Data parallelism

    Copperhead computations are organized around dataparallel arrays

    map performs a forall for each element in an array

    Accesses must be local Accessing non-local elements is done explicitly

    shift, rotate, or gather

    No side effects allowed

  • 8/14/2019 CUDA - Copperhead

    19/28

    19/28

    Copperhead primitives

    map

    reduce

    Scans:

    scan, rscan, segscan, rsegscan

    exscan, exrscan, exsegscan, exrsegscan

    Shuffles:

    shift, rotate, gather, scatter

  • 8/14/2019 CUDA - Copperhead

    20/28

    20/28

    Implementing Copperhead

    The Copperheadcompiler is written inPython

    Python provides its

    own Abstract SyntaxTree Type inference, code

    generation, etc. is doneby walking the AST

    Module(None,

    Stmt(

    Function(

    None,

    'saxpy',

    ['a', 'x', 'y'],

    0,

    None,

    Stmt(

    Return(

    CallFunc(Name('map'),

    Lambda(

    ['xi', 'yi'],

    0,

    Add(

    Mul(

    Name('a'),

    Name('xi')

    ),

    Name('yi')

    )

    ),

    Name('x'),

    Name('y'),

    None,

    None

    )

    )

    )

    )

    ))

    def saxpy(a, x, y):

    return map(lambda xi, yi: a*xi + yi, x, y)

  • 8/14/2019 CUDA - Copperhead

    21/28

    21/28

    Compiling Copperhead to CUDA

    Every Copperhead function creates at least one CUDAdevice function

    Top level Copperhead functions create a CUDA global

    function, which orchestrates the device function calls The global function takes care of allocating shared

    memory and returning data (storing it to DRAM) Global synchronizations are implemented through

    multiple phases All intermediate arrays & plumbing handled by Copperhead

    compiler

  • 8/14/2019 CUDA - Copperhead

    22/28

    22/28

    Saxpy Revisited

    template __device__ Num lambda0(Num xi, Num yi, Num a) {return ((a * xi) + yi);

    }

    template__device__ void saxpy0Dev(Array x, Array y, Num a, uint

    _globalIndex, Num& _returnValueReg) {

    Num _xReg, yReg;

    if (_globalIndex < x.length) _xReg = x[_globalIndex];

    if (_globalIndex < y.length) _yReg = y[_globalIndex];

    if (_globalIndex < x.length) _returnValueReg = lambda0(_xReg, _yReg, a);

    }template__global__ void saxpy0(Array x, Array y, Num a, Array

    _returnValue) {

    uint _blockMin = IMUL(blockDim.x, blockIdx.x);

    uint _blockMax = _blockMin + blockDim.x;

    uint _globalIndex = _blockMin + threadIdx.x;

    Num _returnValueReg;

    saxpy0Dev(x, y, a, _globalIndex, _returnValueReg);

    if (_globalIndex < _returnValue.length) _returnValue[_globalIndex] = _returnValueReg;

    }

    def saxpy(a, x, y):

    return map(lambda xi, yi: a*xi + yi, x, y)

  • 8/14/2019 CUDA - Copperhead

    23/28

    23/28

    Phases

    Reduction

    phase 0

    phase 1

    Scanphase 0

    phase 1

    phase 2

  • 8/14/2019 CUDA - Copperhead

    24/28

    24/28

    Copperhead to CUDA, cont.

    Compiler schedules computations into phases

    Right now, this composition is done greedily

    Compiler tracks global and local availability of all variablesand creates a phase boundary when necessary

    Fusing work into phases is important for performance

    B = reduce(map(A))

    D = reduce(map(C))

    phase 0

    phase 1

  • 8/14/2019 CUDA - Copperhead

    25/28

    25/28

    Copperhead to CUDA, cont.

    Shared memory used only for communicating betweenthreads

    Caching unpredictable accesses (gather)

    Accessing elements with a uniform stride (shift & rotate)

    Each device function returns its intermediate resultsthrough registers

  • 8/14/2019 CUDA - Copperhead

    26/28

    26/28

    Split

    This code is decomposed into 3 phases Copperhead compiler takes care of intermediate

    variables Copperhead compiler uses shared memory for

    temporaries used in scans here

    Everything else is in registers

    def split(input, value):

    flags = map(lambda a: 1 if a

  • 8/14/2019 CUDA - Copperhead

    27/28

    27/28

    Interpreting to Copperhead

    If the interpreter harvested dynamic type information, itcould use the Copperhead compiler as a backend

    Fun project see what kinds of information could be

    gleaned from the Python interpreter at runtime tofigure out what should be compiled via Copperhead to amanycore chip

  • 8/14/2019 CUDA - Copperhead

    28/28

    Future Work

    Finish support for the basics Compiler transformations

    Nested data parallelism flattening

    segmented scans

    Retargetability

    Thread Building Blocks/OpenMP/OpenCL

    Bridge Python and Copperhead Implement real algorithms with Copperhead

    Vision/Machine Learning, etc.


Recommended