Robert Strzodka, Stanford University, Max Planck Center
The Chances and Challenges The Chances and Challenges of Parallelismof Parallelism
Comparison ofComparison ofHardwired (GPU) and Hardwired (GPU) and
Reconfigurable (FPGA) DevicesReconfigurable (FPGA) Devices
0
5e-05
0.0001
0.00015
0.0002
0.00025
0.0003
1000 10000 100000 1e+06 1e+07
Seco
nds
per g
rid n
ode
Domain size in grid nodes
Normlized CPU (double) and CPU-GPU (mixed precision) execution time
1x1 CG: Opteron 2501x1 CG: GF7800GTX
2x2 MG__MG: Opteron 2502x2 MG__MG: GF7800GTX
0
200
400
600
800
1000
1200
1400
1600
20 25 30 35 40 45 50
Num
ber o
f slic
es
Bits of mantissa
Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)
AdderMultiplier
CG kernel normalized (1/30)
2
The ChancesThe Chances
• GPU: 249 GFLOPS single precision166 GB/s internal bandwidth51.2 GB/s external bandwidth
• FPGA: 192 mad25x18 at 550MHz + logicalmost unrestricted internal bandwidth 120.0 GB/s external bandwidth (for all IO pins)
• Clearspeed: 50 GFLOPS double precision200.0 GB/s internal bandwidth
6.4 GB/s external bandwidth
• Cell BE: 230 GFLOPS single precision21 GFLOPS double precision
204.8 GB/s internal bandwidth25.6 GB/s external bandwidth
3
The ChallengesThe Challenges
• Computing Paradigms
• Parallel Programming
• Precision and Accuracy
• Algorithmic Optimization
• Large Range Scaling
4
Processor
InstructionInstruction--StreamStream--Based ProcessingBased Processing
instructions
cache
mem
ory
mem
orydata datadata
datadata
datadata
Software Hardware
5
Processor
DataData--StreamStream--Based ProcessingBased Processing
mem
ory
mem
ory
pipelinedatadata
configuration
pipelinepipeline
Flowware Hardware/Morphware
Configware
Nomenclature from[Reiner Hartenstein. Data-stream-based computing: Models and architectural resources, MIDEM 2003]
6
The ChallengesThe Challenges
• Computing Paradigms
• Parallel Programming
• Precision and Accuracy
• Algorithmic Optimization
• Large Range Scaling
7
PDE Example: The Poisson ProblemPDE Example: The Poisson Problem
such that : find :function Given ++ →Ω→Ω RR ub
bu =Δ− Ωdomain the inside0=u Ω∂boundary the on
2
2
2
2
:),( asgiven isoperator Laplace the2DIn yu
xuyxu
∂∂
+∂∂
=Δ
Ωdomain
Ω∂boundary
),( yxusolution
8
PDE Example: PDE Example: DiscretizationDiscretization and Solversand Solvers
buA =
After discretization the Poisson Problem becomes a linear equation system
bu =Δ−
)(:guess initial:
1
0
kkk uGuuu
+=
=+
For large systems, Au=b is typically solved with iterative schemes
( ) *210 ,,, uuuuu kk
k ⎯⎯ →⎯= ∞→KWe obtain a convergent series:
yuUbPyLLUPA
==
=
,
For small systems, Au=b is typically solved with a LU decomposition
9
Matrix Vector Product as Stencil OperationMatrix Vector Product as Stencil Operation
Step n Step n+1
( )C
nv≤−αββ
1+nvα
( )( )C
nh vF
≤−αββ
∑≤− C
nvAαββ
ββα:
,
10
• Configware
Maths: Banded Matrix Vector Product r = AvMaths: Banded Matrix Vector Product r = Av
( )
∑==
=
βββα
α
αα
vA
vAFvAr
,
,. ),( α
0β 1β2β
• Flowware
( ){ } ( ) HEIGHTWIDTH33
,,
HEIGHTWIDTHHEIGHTWIDTHHEIGHTWIDTH
0
,,⋅⋅
⋅⋅⋅
∈≠
∈∈
R
RR
βαβα AA
Avr
...221100 ,,, +++= ββαββαββαα vAvAvAr
3β
...),,(),,(),,( ,.,.,. 221100vAFrvAFrvAFr αααααα ===
11
• Configware in C/C++float kernel( float v[HEIGHT][WIDTH], float A[HEIGHT][WIDTH][3][3],
int x, int y ) {float r= 0;for( int yo= -1; yo <= 1; yo++ ) {for( int xo= -1; xo <= 1; xo++ ) {
r+= A[y][x][yo+1][xo+1] * v[y+yo][x+xo];}}return r;
}
CPU: Banded Matrix Vector Product r = AvCPU: Banded Matrix Vector Product r = Av
• Flowware in C/C++
extern float A[HEIGHT][WIDTH][3][3];extern float r[HEIGHT][WIDTH], v[HEIGHT][WIDTH];
for( int y= 0; y < HEIGHT; y++ ) {for( int x= 0; x < WIDTH; x++ ) {
r[y][x]= kernel( v, A, x, y );}}
12
• Configware in Cg (high level language for GPUs)float kernel( array2d v, array2d Al, array2d Ac, array2d Au,
float2 xy : WPOS ) : COLOR {float r= 0; array2d A[3]= { Al, Ac, Au };for( int yo= -1; yo <= 1; yo++ ) {for( int xo= -1; xo <= 1; xo++ ) {
r+= arr2d(A[yo+1],xy)[xo+1] * arr2d(v,xy+float2(xo,yo));}}return r;
}
GPU: Banded Matrix Vector Product r = AvGPU: Banded Matrix Vector Product r = Av
• Flowware in C++// load configware to the GPU, define names for arrays, then initialize// enum EnumArr { ARR_r, ARR_v, ARR_Al, ARR_Ac, ARR_Au, ARR_NUM };for( int i= 0; i < ARR_NUM; i++ ) {
GPUArr* arr= new GPUArr( "Array name", (i<=ARR_v)? 1 : 3 );arr->Initialize(WIDTH, HEIGHT);arrP.push_back(arr);
}// ...SciGPU::op( ARR_r, VP_ID, FP_MAT_VEC, ARR_v, ARR_Al, ARR_Ac, ARR_Au );
13
• Configware in ASC (high level language for FPGAs)void kernel() {
HWfloatFormat(32, 24, SIGNMAGNITUDE);Arch(OUT); IOtype<float> r_out; Arch(TMP); HWfloat r;Arch(IN); IOtype<float> v_in; Arch(TMP); HWfloat v;Arch(IN); IOtype<float> A_in[3][3]; Arch(TMP); HWfloat A[3][3];v= v_in; r= 0;UNROLL_LOOP( int yo= 0; yo < 3; yo++ ) {UNROLL_LOOP( int xo= 0; xo < 3; xo++ ) {
A[yo][xo]= A_in[yo][xo];r+= A[yo][xo] * prev(v, yo*WIDTH+xo);
}}r_out= r;
}
FPGA: Banded Matrix Vector Product r = AvFPGA: Banded Matrix Vector Product r = Av
• Flowware in C++
• The FPGA will use the same framework as the GPU.• Object-orientation: One interface, different implementations.• In development.
[Oskar Mencer: ASC, A Stream Compiler for Computing with FPGAs, IEEE Trans. CAD, 2006]
14
Applicatione.g. in
C/C++, Java,Fortran, Perl
Shaderprograms
e.g. inHLSL, GLSL,
Cg
GPU ProgrammingGPU Programming
Graphicshardware
e.g.Radeon (ATI),GeForce (NV)
Operatingsystem
e.g.Windows, Unix,Linux, MacOS
Graphics APIe.g.
OpenGL,DirectX
Window manager
e.g.GLUT, Qt,
Win32, MotifGPU library
Hides thegraphicsspecificdetails
Flowware
Configware
Logic Synthesis
System Level Model
Behavioral Synthesis
RTL / Libraries
- Traditional hardware design process is vertically fragmented across many companies, file formats, etc...this is the major culprit for the productivity gap.
ASC bridges the VLSI CAD Productivity Gap witha Software Approach to Hardware Generation
FPGA ProgrammingFPGA Programming
slide courtesy
of Oskar Mencer
ASC bridges VLSI CAD Productivity GapASC bridges VLSI CAD Productivity Gap
Architecture GenerationModule GenerationGate Level (PamDC)
Parallelizing Compileror Manual Optimization
ASCASC
- Very high performance:Programmer has easy access to the design on all levels of abstraction.
- Easy to use: C++ syntax with custom types.e.g. Most comprehensive Floating Point library available today (>200 different units) created in 2 months!
16
The ChallengesThe Challenges
• Computing Paradigms
• Parallel Programming
• Precision and Accuracy
• Algorithmic Optimization
• Large Range Scaling
17
The Erratic The Erratic RoundoffRoundoff ErrorErrorS
mal
ler i
s be
tter
-100
-90
-80
-70
-60
-50
-40
-30
-20
0 10 20 30 40 50
y =
log2
( f(a
) ),
0 --
> 2^
-100
x = log2( 1/a ), a = 1 / 2^x
Roundoff error for: 0 = f(a):= |(1+a)^3 - (1+3a^2) - (3a+a^3)|
single precisiondouble precision
18
Precision and AccuracyPrecision and Accuracy
• There is no monotonic relation between the computational precision and the accuracy of the final result.
• Increasing precision can decrease accuracy !
• The increase or decrease of precision in different parts of a computation can have very different impact on the accuracy.
• The above can be exploited to significantly reduce the precision in parts of a computation without a loss in accuracy.
• We obtain a mixed precision method.
19
Resource Consumption for Integer OperationsResource Consumption for Integer Operations
Operation Area Latencymin(r,0)max(r,0) b+1 2
add(r1,r2)sub(r1,r2)
2b b
add(r1,r2,r3)→add(r4,r5) 2b 1mult(r1,r2)
sqr(r) b(b-2) b log(b)
sqrt(r) 2c(c-5) c(c+3)
b: bitlength of argument, c: bitlength of result
20
Resource Consumption on a FPGAResource Consumption on a FPGA
0
200
400
600
800
1000
1200
1400
1600
20 25 30 35 40 45 50
Num
ber o
f slic
es
Bits of mantissa
Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)
AdderMultiplier
CG kernel normalized (1/30)
Sm
alle
r is
bette
r
21
Generalized Iterative RefinementGeneralized Iterative Refinement
with find parameters and :function aFor 0 NMNN XQF ℜ∈ℜ∈ℜ→ℜ
0);( 0 =QXF
equations of systemlinear a solve toused typicallyis This BAX =11111 ~:,0~,: +++++ +==−−= kkkkkkk XXXXABAXBB
process iterativean itself requires osolution t eapproximat The 2)directly osolution t eapproximatan findcan We1)
:cases h twodistinguis weNow
FF
iterate we some with starting exactly, solvecannot weAs 0 NXF ℜ∈
,~:,0);~(),,,,(: 111101 +++++ +=== kkkkkkkk XXXQXFQQXHQ K
. parametersdifferent with solve repeatedly wei.e. kPF
22
CPU Results: LU SolverCPU Results: LU Solver
chart courtesy
of Jack Dongarra
Larg
er is
bet
ter
[Langou et al. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems), SC 2006, to appear]
23
GPU Results: Conjugate Gradient (CG) and GPU Results: Conjugate Gradient (CG) and MultigridMultigrid (MG)(MG)S
mal
ler i
s be
tter
5e-7
5e-6
5e-5
5e-4
6 7 8 9 10
Sec
onds
per
grid
nod
e
Data level
Performance of double precision CPU and mixed precision CPU-GPU solvers
CG CPUCG GPU
MG2+2 CPUMG2+2 GPU
24
FPGA Results: Conjugate Gradient with MUL18x18FPGA Results: Conjugate Gradient with MUL18x18S
mal
ler i
s be
tter
0
10000
20000
30000
40000
50000
60000
20 25 30 35 40 45 50
Num
ber o
f slic
es
Bits of mantissa
Area of Conjugate Gradient s??e11 float kernels on the xc2v8000
Number of SlicesQuadratic fit
Number of 4 input LUTsNumber of Slice Flip Flops
Number of MULT18X18s * 500
25
FPGA Results: Conjugate Gradient with MUL18x18FPGA Results: Conjugate Gradient with MUL18x18La
rger
is b
ette
r
40
60
80
100
120
140
20 25 30 35 40 45 50
Freq
uenc
y / I
O B
lock
s
Bits of mantissa
Frequency/IO of Conjugate Gradient s??e11 float kernels on the xc2v8000
Maximal Frequency in MHzNumber of bonded IOBs in 10s
26
The ChallengesThe Challenges
• Computing Paradigms
• Parallel Programming
• Precision and Accuracy
• Algorithmic Optimization
• Large Range Scaling
27
ArithmeticArithmetic Intensity in MatrixIntensity in Matrix--Vector ProductsVector Products
• Analysis of banded MatVec r=Av, pre-assembled– Reads per component of r:
9 times into v, once into each band of A– Operations per component of r:
9 multiply-adds
18 reads
18 ops
18/18=1
• Arithmetic intensity• Operations per memory access• Computation / bandwidth
> 8• Rule of thumb for CPU/GPU
• Arithmetic intensity on floats should be• On doubles twice as high
28
Trading Computation for BandwidthTrading Computation for Bandwidth
• Three possibilities for a matrix vector product A·v if Adepends on some data and must be computed itself– On-the-fly: compute entries of A for each A·v application
• Lowest memory requirement• Good for simple entries or seldom use of A
– Partial assembly: precompute only some intermediate results• Allows to balance computation and bandwidth requirements • Good choice of precomputed results requires little memory
– Full assembly: precompute all entries of A, use these in A·v• Good if other computations hide bandwidth problem in A·v• Otherwise try to use partial assembly
( ).][div:][,][ 1h
kh
kkkkk UGUAUUUA ∇−==⋅ + τ1• For example, pre-compute only G[] when solving
29
Standard Conjugate GradientStandard Conjugate Gradient
kUr
kRr
1−kPr
A
kβkk
kkkk
PAQ
PRPrr
rrr
=
+= −− 11β
Vector operations 1
kk QPrr
⋅Dot product 1
kα
kkkk
kkkk
QRR
PUUrrr
rrr
α
α
−=
+=+
+
1
1
Vector operations 2
11 ++ ⋅ kk RRrr
Dot product 2
1+kβ
kUr
kRr
kPr
kQr
kα
30
Pipelined Conjugate GradientPipelined Conjugate Gradient
kUr
kRr
kPr
kQr
Akαkβ
kk
kkkk
kkkk
kkkk
PAQ
PRP
QRR
PUU
rr
rrr
rrr
rrr
=
+=
−=
+=
++
+
+
β
α
α
11
1
1
Vector operations
11
11
11
++
++
++
⋅
⋅
⋅
kk
kk
kk
QP
RR
rr
rr
rrDot products
Scalaroperations
1+kα1+kβ
31
The ChallengesThe Challenges
• Computing Paradigms
• Parallel Programming
• Precision and Accuracy
• Algorithmic Optimization
• Large Range Scaling
32
DiscretizationDiscretization GridsGrids
Deformed tensor-product gridParallel dynamic updates
One array for values,second for deformation
Equidistant gridEasy to implement
One array holds all values
33
DiscretizationDiscretization GridsGrids
Adaptive gridCan handle coherently changing
dynamic grid topology
A hash, tree or page table is needed
Unstructured gridGood performance for static,
poor for dynamic grid topology
An index array is needed
34
GliftGlift : Generic, Efficient, Random: Generic, Efficient, Random--Access GPU Access GPU Data StructuresData Structures
STL-like abstraction of data containers from algorithms for GPUs
The Glift slides are based on Aaron Lefohn‘s presentation at the GPGPU Vis05 tutorial
35
GliftGlift: Virtual Memory: Virtual Memory
• Virtual N-D address space– Defined by physical memory and address translator– Address translator can be a simple analytical or a
complex mapping based on page table, tree or hash.– The same user interface irrespective of actual
physical storage
Abstraction
Virtual representation of memory: 3D grid
Translation
3D native mem
Translation
2D slices
Translation
Flat 3D array
36
GliftGlift ComponentsComponents
Application
PhysMem AddrTrans
C++ / Cg / OpenGL
VirtMem
Container Adaptors
Implementation
Algorithms based on VirtMem do not depend on the physical memory capabilities: data layout opt., code reuse, portability
37
FEAST: Generalized TensorFEAST: Generalized Tensor--Product GridsProduct Grids
• Sufficient flexibility in domain discretization– Global unstructured macro
mesh, domain decomposition– (an)isotropic refinement into
local tensor-product grids
• Efficient computation– High data locality, large problems map well to clusters – Problem specific solvers depending on anisotropy level– Hardware accelerated solvers on regular sub-problems
[Stefan Turek et al. Hardware–oriented numerics and concepts for PDE software, 2006]
38
FEAST: Deformation FEAST: Deformation AdaptivityAdaptivity
• This grid is a tensor-product !
• Easier to accelerate in hardware than resolution adaptive grids
• Anisotropy leveldetermines optimal solver
39
FEAST: AdFEAST: Ad--hoc GPU Cluster Performancehoc GPU Cluster PerformanceS
mal
ler i
s be
tter
0.0006
0.0008
0.001
0.0012
0.0014
0.0016
0.0018
0.002
0.0022
6 6.5 7 7.5 8 8.5 9
Sec
onds
per
mac
ro g
rid n
ode
Level
CPU, GPU Performance Study for 1x16p, 2x16p (Threshold=20K)
1x16p CPU MGCPU21x16p GPU FX1400
2x16p CPU MGCPU22x16p GPU FX1400
40
ConclusionsConclusions
• Flowware/configware distinction is important for efficiency; abstract interfaces facilitate programming
• Mixed precision methods often allow to reduce the computational precision without a loss of final accuracy
• Balancing arithmetic intensity is more effective than one-sided bandwidth or computation optimizations
• Clever discretizations combine high flexibility with very efficient parallel data layout for PDE solvers
41
Collaborators and Associated ProjectsCollaborators and Associated Projects
• FPGAs, ASC– Lee Howes, Oliver Pell, Oskar Mencer (Imperial College)
• Mixed Precision Methods, FEAST– Dominik Göddeke, Stefan Turek (Univeristy of Dortmund)
• Cluster Computing, Scout– Patrick McCormick, Advanced Computing Lab (LANL)
• Parallel Adaptive Grids, Glift– Aaron Lefohn (Neoptica), Joe Kniss (University of Utah), Shubhabrata
Sengupta, John Owens (University of California, Davis)
• Application Integration, PhysBAM– Ron Fedkiw’s group, physical simulation and computer graphics
(Stanford University)
Robert Strzodka, Stanford University, Max Planck Center
The Chances and Challenges The Chances and Challenges of Parallelismof Parallelism
Comparison ofComparison ofHardwired (GPU) and Hardwired (GPU) and
Reconfigurable (FPGA) DevicesReconfigurable (FPGA) Devices
0
5e-05
0.0001
0.00015
0.0002
0.00025
0.0003
1000 10000 100000 1e+06 1e+07
Seco
nds
per g
rid n
ode
Domain size in grid nodes
Normlized CPU (double) and CPU-GPU (mixed precision) execution time
1x1 CG: Opteron 2501x1 CG: GF7800GTX
2x2 MG__MG: Opteron 2502x2 MG__MG: GF7800GTX
0
200
400
600
800
1000
1200
1400
1600
20 25 30 35 40 45 50
Num
ber o
f slic
es
Bits of mantissa
Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)
AdderMultiplier
CG kernel normalized (1/30)