Date post: | 29-Dec-2015 |
Category: |
Documents |
Upload: | priscilla-atkinson |
View: | 215 times |
Download: | 0 times |
K F U S I O N
Simple Annotations for Optimized Data FlowLiam Kiemele, Celina Berg, Aaron Gulliver, Yvonne Coady
University of Victoria
with thanks to Tim Mattson, Andrew Brownsword (Intel)
2
Road MapKFusion at work
Motivation
KFusion
Costs and benefitsannotations, lines of code
modularity, performance
Future work and conclusionexplicit composition of computation around data flow
IWOCL 2013 Kiemele
KFusion
Eval
Future!
4
Good News and Bad News…
ParallelismAdded complexity
OptimizationMemory and Bandwidth
Modularity: Let’s talk LibrariesDetails behind an API
Optimize data access (prefetching, caching…)
Better separation of concerns
IWOCL 2013 Kiemele
Background
5
OpenCL LibrariesOpenCL (Computing Language), for CPUs and GPUs
At the heart of any given library will be kernels
Suppose we build an OpenCL Linear Algebra Library
__kernel void add_vectors(__global float* sum, __global float* v1, __global float* v2) {
int i = get_global_id(0); sum[i] = v1[i] + v2[i];}
IWOCL 2013 Kiemele
KFusion
6
What you get…c = sqrt(add(square(x), square(y));
square
square
add
sqrt
IWOCL 2013 Kiemele
KFusion
7
What you get…c = sqrt(add(square(x), square(y));
IWOCL 2013 Kiemele
KFusion
Kernel Operation
Memory Access Cycles
square 1 load and store 804
square 1 load and store 804
add 1 2 loads and 1 store
804
sqrt 1 load and store 804
total 4 9 3216
9
What you WANT!c = sqrt(add(square(x), square(y));
IWOCL 2013 Kiemele
KFusion
Kernel Operation
Memory Access Cycles
square 1 load and store 804
square 1 load and store 804
add 1 2 loads and 1 store
804
sqrt 1 load and store 804
total 4 9 3216
Kernel Operation
Memory Access Cycles
fu
1 load 404
1 load 404
1 - 4
1 store 404
total 4 3 1216
10
Two Choices
IWOCL 2013 Kiemele
KFusion
Modular ImplementationReusable
Easy to maintain and develop
Individual Kernel optimization
Monolithic ImplementationPerformance
Allows for optimizations which will otherwise exist between modules
Can we do both?
11
Introducing KFusion
IWOCL 2013 Kiemele
square(…) kernel square
Application File
Library File Kernel File
float* square
square(…)
add(…)kernel add …
float* add …
sqrt(…)kernel sqrt …
float* sqrt …
Kernel Operation Memory Access Cycles
square 1 load and store 804
square 1 load and store 804
add 1 2 loads and 1 store 804
sqrt 1 load and store 804
total 4 9 3216
KFusion
12
After KFusion…
IWOCL 2013 Kiemele
square(…) kernel square
Application File
Library File Kernel File
void square …
square(…)
add(…)kernel add …
void add …
sqrt(…)kernel sqrt …
void sqrt …
New Call:c = fu(…);
New Function:
float* fu(…)
New Kernel:kernel fu(…)
Kernel Operation Memory Access Cycles
fu
1 load 404
1 load 404
1 - 4
1 store 404
total 4 3 1216
KFusion
13
It works!
1024 2048 4096 81920
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Before KFusionAfter KFusion
vector size
tim
e (
ms)
IWOCL 2013 Kiemele
KFusion
14
Road MapKFusion at work
what and how
…why!
Costs and benefitsannotations, lines of code
modularity, performance
Future work and conclusionexplicit composition of computation around data flow
IWOCL 2013 Kiemele
KFusion
Eval
Future!
15
CostsAnnotations
application hints
library synchronization
kernel data flow for compositions
Preprocessorbuild dependency graph
source-to-source transformationloop fusion
deforestation
IWOCL 2013 Kiemele
KFusion
Eval
16
Annotations#pragma start fuse
square(x,x)
square(y,y)
add(c,x,y)
sqrt(c, c) c = sqrt(add(, square(y));#pragma end fuse
#pragma sync out
public void dot_product(double result, vector x);
#pragma sync in
public void matrix_vector_mult(vector b, Matrix A, vector x)
IWOCL 2013 Kiemele
applicatio
n
Library
KFusion
Eval
17
Annotations__kernel void add_vectors(__global float* sum, __global float* v1,
__global float* v2) {#pragma kload
{
int i = get_global_id(0); float arg1 = v1[i]; float arg2 = v2[i]; float s; } s = arg1 + arg2; #pragma kstore
{ sum[i] = s; }}
IWOCL 2013 Kiemele
kernel
KFusion
Eval
add
AOSD 2013 Kiemele 21
Annotations
Imag
e Man
ipul
atio
n
Line
ar A
lgeb
ra
Phys
ics En
gine
0
5
10
15
20
25
ApplicationKernel/Library
KFusion
Eval
22
Benefits
IWOCL 2013 Kiemele
KFusion
Eval
Imag
e Man
ipul
atio
n
Line
ar A
lgeb
ra
Phys
ics En
gine
0
200
400
600
800
1000
1200
Lines of Code Generated
LibraryKernelFused LibraryFused Kernel
23
Performance
IWOCL 2013 Kiemele
KFusion
Eval
512 1024 2048 40960.5
1
1.5
2
2.5
3
3.5
4
4.5
Speedup of Fused Kernels over Unfused Kernels vs Image Size
Low DependencyMedium DependencyHigh DependencyHigh Dependency Low Fusion
Image Size (pixels)
Rela
tive p
erf
orm
an
ce in
cre
ase
aft
er
fusi
on
24
Performance
IWOCL 2013 Kiemele
KFusion
Eval
512 1024 2048 40960
0.2
0.4
0.6
0.8
1
1.2
Speedup of Automatic Fusion over Manual Fusion vs Image Size
Low Dependency
Medium Dependency
High Dependency
High Dependency Low Fusion
Image Size (Pixels)
Rela
tive S
peed
up
AOSD 2013 Kiemele 25
Roofline Analysis of Performance
Peak Actual GFlops =minimum(Bandwidth x flops/byte, Peak Performance)
Three Linear Algebra Scenariosc = sqrt(a2 + b2)
d = sqrt( (x1 – x2)2 + (y1 – y2)2)
Start of conjugate gradientr = Ax – b
p = r
R2 = r*r
AOSD 2013 Kiemele 29
Road MapKFusion at work
what and how
…why!
Costs and benefitsannotations, lines of code
modularity, performance
Future work and conclusionexplicit composition of computation around data flow
KFusion
Eval
Future!
30
Future WorkTools
comprehension and visualization
emulation
performance testing
Combine with other approaches
Optimizing compiles
Code Generators
IWOCL 2013 Kiemele
KFusion
Eval
Future!
kfuse{ calls}
__kernel void k(…) { kload { … } computation kstore { … }}