+ All Categories
Home > Documents > Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow...

Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow...

Date post: 05-Apr-2018
Category:
Upload: hoangmien
View: 220 times
Download: 2 times
Share this document with a friend
15
Alea Reactive Dataflow GPU Parallelization Made Simple Luc Bläser IFS Institute for Software, HSR Rapperswil D. Egloff, O. Knobel, P. Kramer, X. Zhang, D. Fabian (Joint HSR and QuantAlea Zurich) REBLS’14 21 Oct 2014
Transcript
Page 1: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

Alea Reactive DataflowGPU Parallelization Made Simple

Luc BläserIFS Institute for Software, HSR Rapperswil

D. Egloff, O. Knobel, P. Kramer, X. Zhang, D. Fabian(Joint HSR and QuantAlea Zurich)

REBLS’1421 Oct 2014

Page 2: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

GPU Programming Today

Massive parallel power

□ Very specific pattern: vector-parallelism

High obstacles

□ Particular algorithms needed

□ Machine-centric programming models

□ Poor language and runtime integration

Good excuses against it - unfortunately

□ Too difficult, costly, error-prone, marginal benefit

5760 Cores

Page 3: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

Our Goal

GPU parallel programming for (almost) everyone

Radical simplification

□ No GPU experience required

□ Fast development

□ High performance comes automatically

□ Guaranteed memory safety

Broad community

□ .NET in general: C#, F#, VB etc.

□ Based on Alea cuBase F# runtime

Page 4: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

Alea Dataflow Programming Model

Dataflow

□ Graph of operations

□ Data propagated through graph

Reactive

□ Feed input in arbitrary intervals

□ Listen for asynchronous output

Page 5: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

The Descriptive Power

Program is purely descriptive

□ What, not how

Efficient execution behind the scenes

□ Vector-parallel operations

□ Stream operations on GPU

□ Minimize memory copying

□ Hybrid multi-platform scheduling

□ Tune degree of parallelization

□ …

Page 6: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

Operation

Unit of calculation (typically vector-parallel)

Input and output ports

Port = stream of typed data

Consumes input, produces output

Map

Input: T[]

Output: U[]

MatrixProduct

Left: T[,] Right: T[,]

Output: T[,]

Splitter

Input: Tuple<T, U>

First: T Second: U

Page 7: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

Graph

Random

Pairing

Map

Average

var randoms = new Random<float>(0, 1);var coordinates = new Pairing<float>();var inUnitCircle = new Map<Pair<float>, float>(p => p.Left * p.Left + p.Right * p.Right <= 1

? 1f : 0f);var average = new Average<float>();

randoms.Output.ConnectTo(coordinates.Input);coordinates.Output.ConnectTo(inUnitCircle.Input);inUnitCircle.Output.ConnectTo(average.Input);

float[]

int[]

Pair<float>[]

Page 8: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

Dataflow

Send data to input port

Receive from output port

All asynchronousRandom

Pairing

Map

Average

average.Output.OnReceive(x =>Console.WriteLine(4 * x));

random.Input.Send(1000);random.Input.Send(1000000);

1000

Write(…)

1000000

Write(…)

Page 9: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

Short Fluent Notation

var randoms = new Random<float>(0, 1);randoms.Pairing().Map(p => p.Left * p.Left + p.Right * p.Right <= 1 ? 1f : 0f).Average().OnReceive(x => Console.WriteLine(4 * x));

randoms.Send(100);randoms.Send(100000);

Page 10: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

Algebraic Computation

A * B + C

MatrixProduct

MatrixSum

var product = new MatrixProduct<float>();var sum = new MatrixSum<float>();

product.Output.ConnectTo(sum.Left);

sum.Output.OnReceive(Console.WriteLine);

product.Left.Send(A);product.Right.Send(B);sum.Right.Send(C);

A B C

Page 11: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

Iterative Computation

𝑏𝑖+1 = 𝐴 ∙ 𝑏𝑖 (until 𝑏𝑖+1 = 𝑏𝑖)

var source = new Splitter<double[,], double[]>();var product = source.First.Multiply(source.Second);

var steady = product.Compare(source.Second, fun x y ->

Math.Abs(x – y) < 1E-6);var next = source.First.Merge(product);var branch = steady.Switch(next);branch.False.ConnectTo(source.Input);

branch.True.OnReceive(Console.WriteLine)source.Send(new Tuple<double[,], double>(A, b0));

Splitter

MatrixVector

Product

Merger Comparison

(A, bi)

(A, bi+1)

A

bi

bi+1

Switch

bool

(A, bi+1)

Page 12: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

Current Scheduler

Operation implement GPU and/or CPU

GPU operations combined to stream

Memory copy only when needed

Host scheduling with .NET TPL

Automatic free memory management

GPU

CPU

COPY

COPY

COPY

COPY

Page 13: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

Operation Catalogue

Prefabricated generic operations□ Switch, Merger, Splitter, Comparison

□ Map, Reduction, Average, Pairing

□ Random, MatrixProduct, MatrixSum, MatrixVectorProduct, VectorSum, ScalarProduct

□ More to come…

Custom operations can be added

Good performance□ Nearly as fast as native C CUDA: overhead about 10%

• Performance depends on operation implementation

• Small overhead (cross-compilation, managed to unmanaged interop, scheduler)

Page 14: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

Related Works

Rx.NET / TPL Dataflow□ Single input and output port□ Not for GPU

Xcelerit□ Not reactive: single flow per graph□ No generic operations with functors

MSR PTasks / Dandelion□ Synchronous receive, on C++, no generic operations□ .NET LINQ integration (pull instead of push)

Fastflow□ Not reactive (sync run of the graph)□ More low-level C++ tasks, no functors

Nikola□ Implicit dataflow described by functions□ Limited set of operations

Page 15: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

Conclusions

Simple but powerful GPU parallelization in .NET

□ No low-level GPU artefacts

□ Fast and condensed problem formulation

□ Efficient and safe execution by the scheduler

The descriptive paradigm is the key

□ Reactive makes it very general: cycles, infinite etc.

□ Practical suitability depends on operations

Future directions

□ Advanced schedulers: multi GPUs, cluster, optimizations

□ Larger operation catalogue, optimized operations


Recommended