Jeremy MeredithJeremy MeredithFuture Technologies GroupFuture Technologies Group
Experiences Programming the Cell Experiences Programming the Cell Across a Diverse Set of ApplicationsAcross a Diverse Set of Applications
OutlineOutline
Overview of the application kernelsOverview of the application kernels– Scientific, imaging, cognitive algorithms
Optimization strategiesOptimization strategies– “Asymmetric-Thread Runtime Model”– Parallelism, overheads, latencies, etc.
Performance resultsPerformance results– 2.4GHz Cell– 2.2GHz Opteron
Application KernelsApplication Kernels
Monte Carlo Light IntegrationMonte Carlo Light Integration
Molecular DynamicsMolecular Dynamics
Covariance Matrix CreationCovariance Matrix Creation
Boolean Satisfiability SolverBoolean Satisfiability Solver
Genetic AlgorithmsGenetic Algorithms
Monte Carlo Light PropagationMonte Carlo Light Propagation
Simulation of pointSimulation of point--source heating in an infinite source heating in an infinite isotropic scattering mediumisotropic scattering medium– (from Oregon Medical Laser Center)
Fixed number of photons (outer loop)Fixed number of photons (outer loop)
Variable number of steps per photon (inner loop)Variable number of steps per photon (inner loop)
Molecular DynamicsMolecular Dynamics
Force evaluation and integration between atomsForce evaluation and integration between atoms– Lennard-Jones potential interaction model– Velocity Verlet integration algorithm
Number of interacting atoms changes over timeNumber of interacting atoms changes over time
N^2 search over atom pairs for interacting atomsN^2 search over atom pairs for interacting atoms– Force evaluation over only those within cutoff limit– Search over atom pairs is bottleneck
Covariance Matrix CreationCovariance Matrix Creation
For each <For each <a,ba,b> entry in the L> entry in the L××L matrix L matrix CovCov,,
Applications include hyperspectral imagingApplications include hyperspectral imaging– Can build concise model of background for
subtraction from the HSI data cube
Known loop counts, heavy data streaming, Known loop counts, heavy data streaming, straightforward computationstraightforward computation
Cova,b = ∑∑= =
×N
i
M
jbjiaji inputinput
1 1,,,,
Boolean Satisfiability (SAT) SolverBoolean Satisfiability (SAT) Solver
Is there an assignment to a set of variables in a Boolean Is there an assignment to a set of variables in a Boolean expression to make the entire expression true?expression to make the entire expression true?– Many problems, like planning, can be reduced to SAT
Unit PropagationUnit Propagation– Stochastic solvers repeatedly change the value of a variable,
updating the scores of clauses which refer to that variable– Main loop in solvers like GSAT, WalkSAT, HSAT– Inefficient to check all clauses; instead, update only clauses
containing that variable
Essentially no computation involvedEssentially no computation involved– lookup, read, modify, write of random memory locations
Genetic AlgorithmGenetic Algorithm
Parallel optimization on a large populationsParallel optimization on a large populationsIndividuals selected for breeding by their fitnessIndividuals selected for breeding by their fitness– replication, mutation, combination of “chromosomes”
Fitness evaluation is typically bottleneckFitness evaluation is typically bottleneckTwo functions from Two functions from GENEsYsGENEsYs package:package:– Ackley’s function (more computation)– Traveling salesman (more logic, with sorting)
Optimization Strategies and IssuesOptimization Strategies and Issues
TaskTask--level parallelismlevel parallelismSPE thread launch overheadSPE thread launch overheadSIMD optimizationsSIMD optimizationsConcurrent DMA bandwidthConcurrent DMA bandwidthOverlapping communication and computationOverlapping communication and computationLatency hiding and loop unrollingLatency hiding and loop unrollingSDK optimized math librariesSDK optimized math librariesDouble precision penaltiesDouble precision penalties
Monte Carlo Light Propagation:Monte Carlo Light Propagation:TaskTask--level Parallelismlevel Parallelism
0
5000
10000
15000
20000
25000
30000
35000
40000
PPE 1SPE 2SPE 4SPE 8SPE Overlap
Run
time
(mse
c)
Molecular Dynamics:Molecular Dynamics:Thread Launch OverheadThread Launch Overhead
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1 SPE 8 SPEs 1 SPE 8 SPEs
Run
time
(sec
)Total Runtime
SPE Launch Overhead
Respawn every time step Launch only first time step
Molecular Dynamics:Molecular Dynamics:Thread Launch Overhead (cont.)Thread Launch Overhead (cont.)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1 SPE 8 SPEs 1 SPE 8 SPEs
Run
time
(sec
)Total Runtime
SPE Launch Overhead
Respawn every time step Launch only first time step
Molecular Dynamics:Molecular Dynamics:SIMD OptimizationSIMD Optimization
0.00
0.05
0.10
0.15
0.20
original replace "if"with
"copysign"
SIMDunit cell
reflection
SIMDdirectionvector
SIMDlength
calculation
SIMDacceleration
Run
time
(sec
)
Covariance Matrix Creation:Covariance Matrix Creation:Concurrent BandwidthConcurrent Bandwidth
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
1 2 4 8
Number of SPE Threads
Run
time
(sec
)
Full ExecutionLaunch+DMAThread Launch
Covariance Matrix Creation:Covariance Matrix Creation:Asynchronous DMA CommunicationAsynchronous DMA Communication
0.0
1.0
2.0
3.0
4.0
5.0
6.0
NoComputation
SomeComputation
AllComputation
Run
time
usin
g 1
SPE
(sec
onds
)
Synchronous DMA
Overlapping DMA
Boolean Satisfiability (SAT) Solver:Boolean Satisfiability (SAT) Solver:Latency Hiding and Loop UnrollingLatency Hiding and Loop Unrolling
0.0
0.5
1.0
1.5
2.0
2.5
3.0
original -O3 simplify array
indexing
loop unrolling
instructionreordering
Run
tim
e (s
ec)
PPESPE
Genetic Algorithm, AckleyGenetic Algorithm, Ackley’’s Function:s Function:SDK Optimized Math LibrariesSDK Optimized Math Libraries
SPE Optimizations
0.681 s
0.248 s
0.064 s0.047 s
0.01
0.1
1
Original Fast cosine Fast exp/sqrt SIMD
Run
time
(sec
) [l
og s
cale
]
0.0
0.5
1.0
1.5
2.0
2.5
1 SPE using 'if' test 1 SPE using 'copysign'
Run
time
(sec
)
Double PrecisionSingle Precision
Genetic Algorithm, Traveling Salesman:Genetic Algorithm, Traveling Salesman:Double Precision PenaltiesDouble Precision Penalties
Performance ResultsPerformance Results
Monte Carlo Light IntegrationMonte Carlo Light Integration
Molecular DynamicsMolecular Dynamics
Covariance Matrix CreationCovariance Matrix Creation
Boolean Satisfiability SolverBoolean Satisfiability Solver
Genetic AlgorithmsGenetic Algorithms
Monte Carlo Light PropagationMonte Carlo Light Propagation
# Photons 1000 10000 100000Opteron 24 ms 232 ms 2384 msCell, 8 SPEs 38 ms 357 ms 3112 msCell, PPE only 288 ms 2843 ms 28384 ms
Good scaling across all 8 SPEs canGood scaling across all 8 SPEs can’’t help the t help the variable length, short inner loop, almost no variable length, short inner loop, almost no SIMD, and heavy reliance on random number SIMD, and heavy reliance on random number generationgeneration
A fair amount of A fair amount of SIMDizableSIMDizable computation lets computation lets even a single SPE beat the Opteron.even a single SPE beat the Opteron.All 8 SPEs are about 5x faster than the Opteron.All 8 SPEs are about 5x faster than the Opteron.
Molecular DynamicsMolecular Dynamics
# Atoms 512Opteron 0.925 secCell, 1 SPE 0.816 secCell, 8 SPEs 0.181 secCell, PPE only 4.701 sec
High concurrent bandwidth and straightforward High concurrent bandwidth and straightforward computation allows efficient use of all 8 SPEs.computation allows efficient use of all 8 SPEs.8 SPEs are almost 20x faster than the Opteron.8 SPEs are almost 20x faster than the Opteron.
Covariance Matrix CreationCovariance Matrix Creation
Data Set Size 256×65kOpteron 12.308 secCell, 8 SPEs 0.662 secCell, PPE only 88.290 sec
No computation, indirect lookups let a single No computation, indirect lookups let a single SPE only barely beat even the PPESPE only barely beat even the PPEA single SPE is 3.4x slower than the OpteronA single SPE is 3.4x slower than the Opteron– However, multiple SPEs could search independent
parts of the problem space
Boolean Satisfiability (SAT) SolverBoolean Satisfiability (SAT) Solver
# Vars 800# Flips 10MOpteron 0.571 secCell, 1 SPE 1.961 secCell, PPE Only 1.998 sec
ComputeCompute--intensive Ackleyintensive Ackley’’s function:s function:one SPE is 4x faster; eight SPEs are 21x fasterone SPE is 4x faster; eight SPEs are 21x faster
LogicLogic--intensive traveling salesman:intensive traveling salesman:one SPE is 4x slower; eight SPEs are 2x fasterone SPE is 4x slower; eight SPEs are 2x faster
Genetic AlgorithmGenetic Algorithm
Population Size 262k 1.05MOpteron 0.645 sec 2.514 secCell, 1 SPE 0.165 sec 0.637 secCell, 8 SPEs 0.060 sec 0.119 secCell, PPE only 2.797 sec 11.146 sec
Population Size 131k 524kOpteron 0.466 sec 1.876 secCell, 1 SPE 1.697 sec 6.761 secCell, 8 SPEs 0.248 sec 0.884 secCell, PPE only 3.802 sec 15.209 sec
Optimization Strategy SummaryOptimization Strategy Summary
Parallelization across the SPEs is criticalParallelization across the SPEs is criticalBe aware of arithmetic costsBe aware of arithmetic costs– Use the optimized math libraries from the SDK if it helps– Double precision requires different kinds of optimizations
The EIB has a very high bandwidth to the SPEsThe EIB has a very high bandwidth to the SPEs– Use asynchronous DMA to overlap communication and
computation for apps with heavy bandwidth needs– But for many apps, it may simply waste space in the SPE LS
Amortize expensive SPE thread launch overheadsAmortize expensive SPE thread launch overheads– Launch once, and signal SPEs to start the next iteration
Use of SIMD intrinsics can result in large speedupsUse of SIMD intrinsics can result in large speedups– Manual loop unrolling and instruction reordering can help even if
no other SIMDization is possible
Acknowledgements and More InfoAcknowledgements and More Info
This research was sponsored by the Office of Mathematical, InforThis research was sponsored by the Office of Mathematical, Information, mation, and Computational Sciences, Office of Science, U.S. Department oand Computational Sciences, Office of Science, U.S. Department of Energy f Energy under Contract No. DEunder Contract No. DE--AC05AC05--00OR22725 with UT00OR22725 with UT--Battelle, LLC. Battelle, LLC. Accordingly, the U.S. Government retains a nonAccordingly, the U.S. Government retains a non--exclusive, royaltyexclusive, royalty--free free license to publish or reproduce the published form of this contrlicense to publish or reproduce the published form of this contribution, or ibution, or allow others to do so, for U.S. Government purposes.allow others to do so, for U.S. Government purposes.
http://www.csm.ornl.gov/fthttp://www.csm.ornl.gov/[email protected]@ornl.gov