Date post: | 13-Apr-2017 |
Category: |
Technology |
Upload: | david-lecomber |
View: | 753 times |
Download: | 1 times |
Optimizing Energy for High Performance Applications
Discovering when to Compute Green
What is HPC? Welcome to our world
Aerospace and Space Automotive Oil and Gas EDA Weather and
climate
Financial Defence Government Labs Life sciences Academic
Energy in HPC
The world’s top 500 supercomputers cost 400M€
annually in energy alone
If software reduces its energy footprint … payback could
be enormous
Solution
Enable developers and users to
improve application energy
consumption
Our tools
Debug TuneProfile
Develop
Two Key Questions
• Can developers optimize code for energy?• Can owners and users tune applications for
energy?
What is energy?
Approximations for Energy
•Floating point, vector operations, memory access•L1 or L2 misses vs main memory orders of magnitude in energy
Heuristics
•Real data from some processor, memory subsystems, accelerators•Available in kernel - Intel RAPL
Low level measure
ment
•PDU and server level readings•Real data – real energy
Server level
monitoring
Optimizing Time
Capture performance•Profiler creates application profile
•Allinea MAP records multiple processes
Find bottlenecks•Source code viewer pinpoints key consumers
•Timelines find unusual patterns
Optimize•Rewrite key loops•Reorganize memory access patterns
•Change algorithms
CPU Package and System Metrics
Whole System Power Usage
CPU Package Power Usage
Coprocessor Metrics
• Coprocessors and accelerators– NVIDIA CUDA GPU– INTEL XEON PHI
• Devices provide kernel access to power– HIGH POWER CONSUMPTION WHEN ACTIVE– LOW POWER CONSUMPTION WHEN IDLE– VERY EFFICIENT IN FLOPS PER WATT
• System now has variable energy usage to consider– OPTIMIZATION FOR TIME - IS THE GPU ROUTE QUICKER?– OPTIMIZATION FOR ENERGY - WHICH IS MOST EFFICIENT?
• (GPU + SERVER energy) * GPU time• Or SERVER * CPU time?
Two Key Questions
• Can developers optimize code for energy? YES• Can owners and users tune applications for
energy?
Tuning Time
No instrumentation needed
No source code needed
No recompilation needed
Less than 5% runtime overhead
Fully scalable
Explicit and usable output
Allinea Performance ReportsExample Report
Run details
Visual breakdown chart
Clear categorization
Explanation of figures and advice for follow-up
Breakdown of resource usage across CPU, MPI, I/O
Integrated Energy Information
Key Observation: In a Nutshell
• For many HPC workloads– THE FASTER AN APPLICATION COMPLETES, THE LOWER ITS
ENERGY CONSUMPTION– OR … OPTIMIZE FOR SPEED AND YOU ARE (USUALLY)
ALREADY OPTIMIZING FOR ENERGY
• But for some HPC and non-HPC cases– FREQUENCY SCALING SAVES ENERGY
Two Key Questions
• Can developers optimize code for energy? YES• Can owners and users tune applications for
energy? YES
…. But should they?
• Are we counting all energy?• Are we considering all costs?
What is energy?
Approximations for Energy
•Floating point, vector operations, memory access•L1 or L2 misses vs main memory orders of magnitude in energy
Heuristics
•Real data from some processor, memory subsystems•Available in kernel - Intel RAPL
Low level measurement
•PDU and server level readings•Real data – real energy
Server level monitoring
•Air-con•Servers, switches, storage….
Full system monitoring
Two Key Questions
• When should developers optimize code for energy?
• When should owners and users tune applications for energy?
Frequency Scaling
Some workloads have low compute requirement, but high data volume
Data crunching vs number crunching
Processor is over-powered for the speed of memory, disk or network
CPU frequency can be scaled down in software
Providing information to developer, user and system owner
Allinea MAP
Allinea Performance Reports
A lot of codes are memory-bound
Multiple cores share bandwidth
Core 1
Core 2
Core 3
Core 4
…
Lots of clever
technologyMain memory
Can we tune them for energy efficiency?
Core 1
Core 2
Core 3
Core 4
…
Lots of clever
technology
Main memory
How can we improve energy efficiency?
Buy a new cluster with ambient warm water cooling an integrated espresso machine
Reduce CPU frequency
Run on fewer cores per node
How can we improve energy efficiency?
Buy a new cluster with ambient warm water cooling an integrated espresso machine
Reduce CPU frequency?
Run on fewer cores per node?
The Experiment
One simple code
A well-understood wave equation solver
One compute node
Minimize effect of MPI communications
Change CPU
frequency and
#cores
Measure the results with Allinea Performance Reports
4 PPN @ 2.1 Ghz, 30 seconds
4 PPN @ 2.1 Ghz, 30 seconds 4 PPN @ 1.3 Ghz, 34 seconds
2
4
6
8
0%
10%
20%
30%
40%
50%
60%
70%
1.3 Ghz
1.7 Ghz
2.1 Ghz
Slowdown relative to 4 PPN @ 2.1GhzData gathered with Performance Reports’ CSV export
1.3 Ghz 1.7 Ghz 2.1 Ghz
1.7Ghz run completes as quickly as at 2.1Ghz
2
4
6
8-10%
-5%
0%
5%
10%
15%
20%
1.3 Ghz
1.7 Ghz
2.1 Ghz
Energy savings relative to 4 PPN @ 2.1GhzData gathered with Performance Reports’ CSV export
1.3 Ghz 1.7 Ghz 2.1 Ghz
5-10% energy savings with zero performance impact
2
4
6
8
0%
10%
20%
30%
40%
50%
60%
70%
1.3 Ghz
1.7 Ghz
2.1 Ghz
Slowdown relative to 4 PPN @ 2.1GhzData gathered with Performance Reports’ CSV export
1.3 Ghz 1.7 Ghz 2.1 Ghz
15% energy savings with 20% performance impact
The Results
24
68 -10%
-5%
0%
5%
10%
15%
20%
1.3 Ghz 1.7
Ghz 2.1 Ghz
2 PPN: 15% energy savings, 20% increased runtime
1.3 Ghz 1.7 Ghz 2.1 Ghz
24
68
0%
10%
20%
30%
40%
50%
60%
70%
1.3 Ghz
1.7 Ghz
2.1 Ghz
1.7Ghz: 6% Energy savings for free
1.3 Ghz 1.7 Ghz 2.1 Ghz
So… should we run every job at a reduced clock speed?Or only ever use half the cores on each node?
Improving energy efficiency
• Each application and system has different characteristics– TOOLS CAN SHOW IF THE APPLICATION WASTES POWER
UNNECESSARILY– DEVELOPERS CAN SEE WHERE TO OPTIMIZE AND CHANGE
CODE– USERS CAN IMPROVE EFFICIENCY WITHOUT CHANGING CODE
• Don’t forget the opportunity cost– IN HPC SLOWING DOWN APPLICATIONS COSTS SCIENCE– MACHINES AND PHDS HAVE FINITE LIFETIME – AND THEIR COST
DOMINATES
• Time and energy are not the same– OPTIMIZE FOR TIME BEFORE OPTIMIZING FOR ENERGY