Thirteen modern ways to fool the massesThirteen modern ways to fool the massesThirteen modern ways to fool the masses Thirteen modern ways to fool the masses with performance results on parallel with performance results on parallel computerscomputerscomputerscomputers
Georg HagerGeorg HagerGeorg HagerGeorg HagerErlangen Regional Computing Center Erlangen Regional Computing Center (RRZE)(RRZE)University of ErlangenUniversity of Erlangen--NurembergNurembergUniversity of ErlangenUniversity of Erlangen NurembergNuremberg
6th Erlangen International High End Computing Symposium6th Erlangen International High End Computing Symposium6th Erlangen International High End Computing Symposium6th Erlangen International High End Computing SymposiumRRZE, 04.06.2010RRZE, 04.06.2010
1991
David H. BaileySupercomputing Review, August 1991, p. 54-55 “Twelve Ways to Fool the Masses When Giving Performance ResultsTwelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”
1. Quote only 32-bit performance results, not 64-bit results.2. Present performance figures for an inner kernel, and then represent these
figures as the performance of the entire application.3. Quietly employ assembly code and other low-level language constructs.4. Scale up the problem size with the number of processors, but omit any
mention of this fact.5. Quote performance results projected to a full system.6. Compare your results against scalar, unoptimized code on Crays.7. When direct run time comparisons are required, compare with an old code on an obsolete system.8 If MFLOPS rates must be quoted base the operation count on the parallel implementation not on8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on
the best sequential implementation.9. Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar.10 Mutilate the algorithm used in the parallel implementation to match the architecture10. Mutilate the algorithm used in the parallel implementation to match the architecture.11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy
environment.12 If all else fails sho prett pict res and animated ideos and don't talk abo t performance
204.06.2010 Fooling the masses
12. If all else fails, show pretty pictures and animated videos, and don't talk about performance.
1991 …
If you were plowing a field, which y p g ,would you rather use?
Two strong oxen gor 1024 chickens?(Attributed to Seymour Cray)
304.06.2010 Fooling the masses
1991 …
Strong I/O facilities
32-bit vs. 64-bit FP arithmetic
SIMD/MIMD parallelism
Vectorization
System-specific ti i ti
No parallelization standards optimizationsstandards
404.06.2010 Fooling the masses
Today we have…
Multicore processorswith shared/separate caches, shared data paths
Hybrid hierarchical systemsHybrid, hierarchical systemswith multi-socket, multi-core, ccNUMA, heterogeneous networks
504.06.2010 Fooling the masses
Today we have…
Ants all over the placeCell, Clearspeed, GPUs...
604.06.2010 Fooling the masses
Today we have…
Commodity everywherex86-type processors, cost-effective interconnects, GNU/Linux
704.06.2010 Fooling the masses
The landscape of High Performance Computing and the way we think about HPC has changed over the last 19 years, and g y ,
we need an update!
Still many of Bailey’s points are valid without changeStill, many of Bailey s points are valid without change
804.06.2010 Fooling the masses
Stunt 1
Report scalability, not absolute performance.
workerswithwork/time NSpeedup: worker withwork/time workerswithwork/time1
)( NNS =
“Good” scalability ↔ S(N) ≈ N , but there is no mention of how fast you can solve your problem!
Consequence: Comparing different systems is much easier when usingConsequence: Comparing different systems is much easier when using scalability instead of work/time directly
904.06.2010 Fooling the masses
Stunt 1: Scalability vs. performance
And… instant success!
160
180
40
45
120
140
30
35
ork/
time)
up
80
100
20
25
man
ce (w
oSp
eedu
40
60
10
15
Perf
orm
0
20
0 10 20 30 40 50 60 700
5
0 10 20 30 40 50 60 70
NEC ClusterNEC Cluster# CPUs or nodes
1004.06.2010 Fooling the masses
Stunt 2
Slow down code execution.
This is useful whenever there is some noticeable “non-execution” overhead
Parallel speedup with work ~ Nα:(α=0: strong α=1: weak scaling) )()1(
)1()( 1 NcNssNssNS α
α
+−+−+
= −(α=0: strong, α=1: weak scaling)
Now let’s slow down execution by a factor of μ>1 (for strong scaling):
)()1( NcNss α++
Now let s slow down execution by a factor of μ>1 (for strong scaling):
( )μ
σ /)(/)1(1
)(/)1()(
NNNNNS ==
I.e., if there is overhead, the slow code/machine scales better:
( ) μμσ /)(/)1()(/)1()(
NcNssNcNss +−++−+
I.e., if there is overhead, the slow code/machine scales better:
0)()()( 1 >> = NcNSNS if μμ
1104.06.2010 Fooling the masses
Stunt 2: Slow computing
Corollaries:
1. Do not use high compiler optimization levels or the latest compiler versions.
2. If scalability is still bad, parallelize some short loops with OpenMP. That way you can get some extra bonus for a scalable hybrid code.
If someone asks for time to solution, answer that if you had a bigger machine, you could get the solution as fast as you want. This is of course due to the superior scalability of your codecourse due to the superior scalability of your code.
1204.06.2010 Fooling the masses
Stunt 2: Slow computing
“Slowness” has some surprises in store…Let’s look at μ=2:
fast N=4 slow N=8 fast N=4 slow N=8
calc comm calc comm
Tf TfTf
Ts < Tf ?
Tf
1/Tf < 2/Ts ?
Ts
s f
This happens ifTs
1/Tf 2/Ts ?
This happens if
00)( =<′ sNc @
pp
0)()( => sµµNcNc @ What’s the
catch here?
1304.06.2010 Fooling the masses
µ
Stunt 2: Slow computing
Example for µ=4 and c(N)~N-2/3 at strong scaling:
35
40 The performance is better with µN slow CPUs than with N fast
25
30
ance
CPUs than with N fast CPUs“Slow computing” can
15
20
Perf
orm
a p geffectively lessen the impact of comm nication
5
10
communication overheadWe assume that the
0
5
0 10 20 30 40 50 60 70
We assume that the network is the same in both machines
fast slow # nodes
1404.06.2010 Fooling the masses
Stunt 3 (The power of obfuscation, part I)
If scalability doesn’t look good enough, use a logarithmic scale to drive your point home.
Everything looks OK if you plot it the right way!
60
70100
60
701. Linear plot: bad scaling, strange things at N=32 40
45
50
60
40
50
60
2. Log-log plot: better scaling, but still the
30
35
40
30
401030
40g,
N=32 problem
3 Log-linear plot: N=32 15
20
25
10
20
10
203. Log linear plot: N=32 problem gone
4 and remove the ideal 0
5
10
00 10 20 30 40 50 60 70
Speedup Ideal
11 10 100
Speedup Ideal
01 10 100
Speedup Ideal
4. … and remove the ideal scaling line to make it perfect!
01 10 100
Speedup
1504.06.2010 Fooling the masses
Stunt 3: Log scale
© Top500 `08
1604.06.2010 Fooling the masses
p
Stunt 4
If you must report actual performance, quietly employ weak scaling to show off
It’s all in that bloody denominator… )()1()1()( 1 NcNssNssNS α
α
+−+−+
= −
At α=1 the world looks so much nicer:
)()1( NcNss α++
)1()( NssNS −+=
)(1)(
NcNS
+
… but keep in mind: Do not mention the term “weak scaling” or you will be asked nasty questions about parallel efficiency.asked nasty questions about parallel efficiency.
1704.06.2010 Fooling the masses
Stunt 4: Weak scaling
But weak scaling gives us much more than just a “straight” graph. It gives us perfect scaling if we choose the right metric to look at!
Assumption: Weak scaling with parallel efficiency ε = S(N)/N << 1 and no other overheadother overhead
has a small slopeNssNS )1()( −+= p
But: If we choose a metric for work that is applicable to the parallel part alone
)()(
applicable to the parallel part alone, work/time scales linearly.
So all you need to do is plot Mflop/s, MLUP/s,or anything that doesn’t happen in the serialpart and you can even show real performance numbers! See also stunt #10
1804.06.2010 Fooling the masses
Stunt 5 (The power of obfuscation, part II)
Instead of performance, plot absolute runtime vs. CPU count
Very, very popular indeed!1,2
Nobody will be able to tellwhether your code actually
1
re
Scales
0,6
0,8
???
untim
em
e pe
r cor
C ll
0,4R
uC
PU ti
m
Corollary:
CPU time per core is even 0
0,2
CPU time per core is even better because it omitsmost overheads…
0 10 20 30 40 50 60 70
# CPUs
1904.06.2010 Fooling the masses
Stunt 6 (The power of obfuscation, part III)
Compare different systems by showing the log of parallel efficiency vs. CPU count
Unusual ways of putting1
0 10 20 30 40 50 60 70
# nodes/CPUs
data together surpriseand confuse your
di ncy
Cl taudience
Remember: Legends0,1
el e
ffici
en Cluster
NECRemember: Legendscan be any size youlike!
Para
lle NEC
like!
0,01
Cluster eff. NEC eff.
2004.06.2010 Fooling the masses
Stunt 7
Emphasize the quality of your shiny accelerator code by comparing it with scalar, unoptimized
d i l f ld t d d CPUcode on a single core of an old standard CPU. And use GCC 2.7.2.
Anything else is a waste of time.
And besides, don’t the compiler guys always say that they’re “multi-core enabled”?
Corollary:
Use single precision on the GPU but double precision on the CPU. This will cut on the effective bandwidths cache size and peak performancewill cut on the effective bandwidths, cache size, and peak performance of the latter and let the former shine.
2104.06.2010 Fooling the masses
Stunt 8
Always quote GFlops, MIps, Watts per Flop or any other irrelevant interesting metric instead of inverse time to solution.
Flops are so cool it hurts:for(i=0; i<N; ++i)
for(j=0; j<N; ++j)b[i][j] = 0.25*(a[i-1][j]+a[i+1][j]+a[i][j-1]+a[i][j+1]);
for(i=0; i<N; ++i)for(j=0; j<N; ++j)
b[i][j] = 0.25*(a[i-1][j]+a[i+1][j]+a[i][j-1]+a[i][j+1]);
for(i=0; i<N; ++i)for(i=0; i<N; ++i)for(j=0; j<N; ++j)
b[i][j] = 0.25*a[i-1][j]+0.25*a[i+1][j]+0.25*a[i][j-1]+0.25*a[i][j+1];
for(j=0; j<N; ++j)b[i][j] = 0.25*a[i-1][j]+0.25*a[i+1][j]
+0.25*a[i][j-1]+0.25*a[i][j+1];
“Floptimization”
Watts/Flop are an ingenious fallback – who would dare question a truly “green” application/system? Except maybe some investors…
22
g pp y p y
04.06.2010 Fooling the masses
Stunt 9
Ignore affinity and topology issues. Real scientists are not bothered by such details.
Multi-core, cache groups, ccNUMA, SMT, network hierarchies etc. are just parts of a vicious plot to take the fun out of computing Ignoring thoseparts of a vicious plot to take the fun out of computing. Ignoring those issues will make them go away. If people ask specific questions about it, answer that it’s the OS’s or the compiler’s job., p j
Shared cache re use
OpenMP overhead
PCC
PCC
PCC
PCC
C
PCC
PCC
PCC
PCC
C
Shared cache re-use
Bandwidth contention
OS buffer cache
MI
C
MI
C
Intra-node MPI
Memory Memory ccNUMA page placement
2304.06.2010 Fooling the masses
Stunt 9: Affinity issues
Re-using shared cache on multi-core CPUs? More cores mean more performance, do they not?
core0 core1core0 core1
tmp(:,:,3)
on
p( )
Memory
y-di
rect
io Memory
x(:,:,:)
z-direction
2404.06.2010 Fooling the masses
Stunt 9: Affinity issues
Memory bandwidth saturation? ccNUMA effects? Shouldn’t the OS put the threads and pages where they are supposed to be?
Parallel STREAM performance
2504.06.2010 Fooling the masses
Stunt 9: Affinity issues
Intra-node MPI is infinitely fast! Look at those latencies!
7,4PPPP PPPP
MPI intra-node and inter-node latencies on Cray XT5
,
7
8 PCC
PCC
PCC
MI
PCC
C
PCC
PCC
PCC
MI
PCC
C
5
6
[µs]
MI
Memory
MI
Memory
3
4
Late
ncy Memory Memory
0,63 0,491
2
0inter-node inter-socket intra-socket
26Fooling the masses04.06.2010
Stunt 9: Affinity issues
Intra-node MPI is infinitely fast! Low-level benchmarking is unreliable!
Shared cache advantage
Between two cores of one socket
Between two nodes via interconnect
fabricfabricBetween two sockets of one node (cache effects eliminated))
27Fooling the masses04.06.2010
Stunt 9: Affinity issues
Why should you reverse engineer the overcomplicated cache topology of those modern systems?
Xeon E5420 shared L2 same socket different socket2 Threadspthreads_barrier_wait 5863 27032 27647omp barrier (icc 11.0) 576 760 1269Spin loop 259 485 11602
Nehalem2 Threads
Shared SMT threads
shared L3 different socket
pthreads_barrier_wait 23352 4796 49237omp barrier (icc 11.0) 2761 479 1206Spin loop 17388 267 787
2804.06.2010 Fooling the masses
Stunt 9: Affinity – if you still insist…
Command line tools for Linux:easy to installworks with standard linux 2.6 kernelsimple and clear to usesupport Intel and AMD CPUs
Current tools:likwid-topology: Print thread and cache topologylikwid-perfCtr: Measure performance counterslikwid-features: View and enable/disable hardware prefetcherslikwid-pin: Pin threaded application without touching code
Open source project (GPL v2):http://code.google.com/p/likwid/
2904.06.2010 Fooling the masses
Stunt 10
If you really can’t reduce communication overhead, argue in favor of “reliable inefficiency.”
Even if you spend 80%of time comm nicating
1
of time communicating,that’s ok if the ratiostays constant – it
0,8
ime
stays constant itmeans you can scaleto any size!
0,4
0,6
on o
f run
ti
And fill any machine. 0,2
0,4
Frac
tio
01 10 100 1000
# nodes/CPUsCalculation Computation
Efficiency constant for large N
# nodes/CPUs
3004.06.2010 Fooling the masses
Stunt 11 (The power of obfuscation, part IV)
Performance modeling is for wimps. Show real data. Plenty. And then some.
Don’t try to make senseof o r data b fitting it
300
of your data by fitting itto a model. Instead, showat least 8 graphs per plot, 200
250
Machine 1
eat least 8 graphs per plot,all in bright pastel colors,with different symbols. 150
200Machine 2
Machine 3
Machine 4
Machine 5form
ance
If t ti
100
Machine 5
Machine 6
Machine 7
Machine 8
Perf
If nasty questions pop up,say your code is so complex that no model
0
50
0 100 200 300 400 500 600complex that no modelcan describe it.
0 100 200 300 400 500 600
# nodes/CPUs
3104.06.2010 Fooling the masses
Stunt 12
If they get you cornered, blame it all on OS jitter.
They will understand and nod knowingly.
Corollary:Corollary:Depending on the audience, TLB misses may work just as fine
32
TLB misses may work just as fine.
04.06.2010 Fooling the masses
Stunt 13
If all else fails, show pretty pictures and animated videos, and don’t talk about performance.
In four decades of supercomputing, this was always the best-selling plan, and it will stay that way foreverand it will stay that way forever.
3304.06.2010 Fooling the masses
THANK YOUTHANK YOU