Benchmarks on BG/L: Parallel and Serial
John A. GunnelsMathematical Sciences Dept.
IBM T. J. Watson Research Center
Overview Single node benchmarks
Architecture Algorithms
Linpack Dealing with a bottleneck Communication operations
Benchmarks of the Future
Compute Node: BG/L Dual Core Dual FPU/SIMD
Alignment issues Three-level cache
Pre-fetching Non-coherent L1 caches
32 KB, 64-way, Round-Robin L2 & L3 caches coherent
Outstanding L1 misses (limited)
Programming OptionsHigh Low Level
Compiler optimization to find SIMD parallelism User input for specifying memory alignment and lack of
aliasing alignx assertion disjoint pragma
Dual FPU intrinsics (“built-ins”) Complex data type used to model pair of double-precision
numbers that occupy a (P, S) register pair Compiler responsible for register allocation and scheduling
In-line assembly User responsible for instruction selection, register allocation,
and scheduling
BG/L SIngle-Node STREAM Performance (444 MHz)28 July 2003
0
500
1000
1500
2000
2500
3000
0 500000 1000000 1500000 2000000
Vector size (8-byte elements)
Ban
dw
idth
(M
B/s
)
Tuned copy (MB/s)
Tuned scale (MB/s)
Tuned add (MB/s)
Tuned triad (MB/s)
OOB copy (MB/s)
OOB scale (MB/s)
OOB add (MB/s)
OOB triad (MB/s)
STREAM Performance
Out-of-box performance is 50-65% of tuned performance Lessons learned in
tuning will be transferred to compiler where possible
Comparison with commodity microprocessors is competitive
Machine Frequency STREAM FP peak Balance
(MHz) (MB/s) (Mflop/s) (B/F)
Intel Xeon 3060 2900 6120 0.474
BG/L 444 2355 3552 0.663
BG/L 670 3579 5360 0.668
AMD Opteron 1800 3600 3600 1.000
DAXPY Bandwidth Utilization
Memory Bandwidth Utilization vs Vector Size for Different Implementations of DAXPY
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
Vector Size (bytes)
Me
m.
Ba
nd
wid
th (
by
tes
/cy
cle
)
Intrinsics
Assembly
Vanilla
5.3 bytes/cycle(L3 – bandwidth)
16 bytes/cycle(L1 – bandwidth)
Matrix MultiplicationTiling for Registers (Analysis)
Latency tolerance (not bandwidth) Take advantage of register count
Unroll by factor of two 24 register pairs 32 cycles per unrolled iteration 15 cycle load-to-use latency (L2 hit)
Could go to 3-way unroll if needed 32 register pairs 32 cycles per unrolled iteration 31 cycle load-to-use latency
F2
F1
M1
M2
8
8
16
16
Recursive Data Format Mapping 2-D (Matrix) to
1-D (RAM) C/Fortran do not map well
Space-Filling Curve Approximation Recursive Tiling
Enables Streaming/pre-fetching Dual core “scaling”
Register Set
Blocks
L1 Cache Blocks
L3 Cache Blocks
Dual Register Blocks
Dual Core
Why? It’s a effortless way to double your
performance
Dual Core
Why? It exploits the architecture and
may allow one to double the performance of their code in some cases/regions
Single-Node DGEMM Performance at 92% of Peak
Single-node DGEMM (444 MHz)18 July 2003
0
1
2
3
4
0 50 100 150 200 250
Matrix size (N)
Per
form
ance
(G
Flo
p/s
)
Single core (GF/s)
Dual core (GF/s)
Single core peak (GF/s)
Dual core peak (GF/s)
Near-perfect scalability (1.99) going from single-core to dual-core Dual-core code delivers 92.27% of peak flops (8 flop/pclk) Performance (as fraction of peak) competitive with that of Power3
and Power4
92.27%
Performance Scales Linearly with Clock Frequency
40
0
44
0
48
0
52
0
56
0
60
0
64
0 200000
900000
16000000
1000
2000
3000
4000
Ba
nd
wid
th (
MB
/s)
Frequency (MHz)
N (elts)
Speed test of STREAM COPY, 25 July 2003
Measured performance of DGEMM and STREAM scale linearly with frequency DGEMM at 650 MHz delivers 4.79 Gflop/s STREAM COPY at 670 MHz delivers 3579 MB/s
400
440
480
520
560
600
640 16
96
1760
1
2
3
4
5
Performance (GFlop/s)
Frequency (MHz)
N (elts)
Speed test of DGEMM, 25 July 2003
The Linpack Benchmark
LU Factorization: Brief Review
Alreadyfactored
Pivot and scalecolumns
DTRSM
DGEMM
Current block
LINPACKProblem Mapping
...16nrepetitions n repetitions
N
Panel Factorization: Option #1 Stagger the computations PF Distributed over relatively few
processors May take as long as several DGEMM
updates DGEMM load imbalance
Block size trades balance for speed Use collective communication primitives
May require no “holes” in communication fabric
Speed-up Option #2
Change the data distribution Decrease the critical path length Consider the communication abilities of
machine Complements Option #1 Memory size (small favors #2; large #1)
Memory hierarchy (higher latency: #1) The two options can be used in concert
Communication Routines
Broadcasts precede DGEMM update Needs to be architecturally aware
Multiple “pipes” connect processors Physical to logical mapping Careful orchestration is required to
take advantage of machines considerable abilities
Row BroadcastMesh
Row BroadcastMesh
Row BroadcastMesh
Row BroadcastMesh
Row BroadcastMesh
Row BroadcastMesh
Row BroadcastMesh
Row BroadcastMesh
Row BroadcastMesh
Row BroadcastMesh
Row BroadcastMesh
Recv 2Send 4Hot Spot!
Row BroadcastMesh
Recv 2Send 3
Row BroadcastTorus
Row BroadcastTorus
Row BroadcastTorus
Row BroadcastTorus
Row BroadcastTorus (sorry for the “fruit salad”)
Broadcast Bandwidth/Latency
Bandwidth: 2 bytes/cycle per wire Latency:
Sqrt(p), pipelined (large msg.) Deposit bit: 3 hops
Mesh Recv 2/Send 3
Torus Recv 4/Send 4 (no “hot spot”) Recv 2/Send 2 (red-blue only … again, no bottleneck)
Pipe Recv/Send: 1/1 on mesh; 2/2 on torus
What Else? It’s a(n) …
FPU Test Memory Test Power Test Torus Test Mode Test
Conclusion
1.435 TF Linpack #73 in TOP500 List (11/2003) Limited Machine Access Time
Made analysis (prediction) more important 500 MHz Chip
1.507 TF run at 525MHz demonstrates scaling Would achieve >2 TF at 700MHz
1TF even if machine used in “true” heater mode
Conclusion
1.4 TF Linpack on BG/L Prototype: Components
0.02%
0.05%
81.35%
0.73%
8.97%
2.07%0.15%0.52%0.03%0.09%
1.55%0.01%
3.39%0.48%
0.60%
6.03%
Scale
Rank1
Gemm
Trsm
BcastA
BcastD
Pack
Unpack
Idamax
pdgemm
FWDPiv
BackPiv
Waiting
Additional Conclusions Models, extrapolated data
Use models to the extent that the architecture and algorithm are understood
Extrapolate from small processor sets Vary as many (yes) parameters as possible
at the same time Consider how they interact and how they don’t Also remember that instruments affect timing
Often can compensate (incorrect answer results) Utilize observed “eccentricities” with
caution (MPI_Reduce)
Current Fronts
HPC Challenge Benchmark Suite STREAMS, HPL, etc.
HPCS Productivity BenchmarksMath Libraries
Focused Feedback to Toronto PERCS Compiler/Persistent
Optimization Linpack Algorithm on Other Machines
Thanks to …
Leonardo Bachega: BLAS-1, performance results
Sid Chatterjee, Xavier Martorell: Coprocessor, BLAS-1
Fred Gustavson, James Sexton: Data structure investigations, design, sanity tests
Thanks to … Gheorghe Almasi, Phil
Heidelberger & Nils Smeds: MPI/Communications
Vernon Austel: Data copy routines Gerry Kopcsay & Jose Moreira:
System & machine configuration Derek Lieber & Martin Ohmacht:
Refined memory settings Everyone else: System software,
hardware, & Machine time!
Benchmarks on BG/L: Parallel and Serial
John A. GunnelsMathematical Sciences Dept.
IBM T. J. Watson Research Center