Date post: | 05-Jan-2016 |
Category: |
Documents |
Upload: | susan-ellis |
View: | 218 times |
Download: | 3 times |
Optimal Power Allocation for Multiprogrammed Workloads on
Single-chip Heterogeneous Processors
Euijin Kwon1,2 Jae Young Jang2
Jae W. Lee2 Nam Sung Kim2,3
1 2 3
2
Single-chip heterogeneous processors
• Compared to systems based on discrete components- Lower communication overhead- Lower power consumption- Lower cost (less silicon)- Emerging application friendly (sequential + parallel processing)
Sources: AMD, Intel, and Samsung
AMD’s Llano Intel’s Sandy Bridge Samsung’s Exynos
3
Challenges
• SCHP’s performance: limited by power budget- Total chip power budget- CPU/GPU power budget
• Multiprogrammed workload- Workload-aware power allocation- Considering characteristics and metrics
How can optimize overall performance within limited power budget?
4
Outline
• Motivation• Target platform: SCHP + MW• Workload-aware power allocation
- Characteristics of programs- Evaluation Metrics
• Methodology- Power configuration- Benchmark programs
• Evaluation• Algorithm• Conclusion
5
Target platform: SCHP + MW• 4-core CPU + 16-SM GPU• Multiple V/F domains DVFS• 2 programs running• Hardware resources evenly divided
GPU0
GPU0 V/F domain
Memory Controllers
MCs V/F domain
CPUCore0
CPUCore1
CPUCore2
CPUCore3
CPU V/F domain(per-core)
GPU1
GPU1 V/F domain
Multiprogrammed Workload
Program 1
Program 2
6
Workload-aware power allocation• Characteristics of programs
- Non-uniform performance sensitivities • Evaluation metrics
- Throughput vs. Energy efficiency
Nor
mal
ized
thro
ughp
ut
Allocating more power to mri-q
28.6 34.2 39.8 48.6 59.0 0.8
1.0
1.2
1.4
1.6
1.8
2.0
compute-bound (mri-q)memory-bound (stream-copy)
Power allocation (using the same HW)
7
Outline
• Motivation• Target platform: SCHP + MW• Workload-aware power allocation
- Characteristics of programs- Evaluation Metrics
• Methodology- Power configuration- Benchmark programs
• Evaluation• Algorithm• Conclusion
8
Methodology: shared power budget
• Can change the power budget for 17.4
24.8
34.2
46.4
62.8 11.2
16.8
22.4
31.2
41.6 11.2
16.8
22.4
31.2
41.6
CPU 2 GPU 1 GPU 2
Power Configuration
Output
17.4
24.8
34.2
46.4
62.8
CPU 1
• Total chip power budget = 100 W• CPU power budget = 80 W• GPU power budget = 64 W• Baseline configuration
- Evenly divided (25 W for each CPU/GPU group)
Throughput EnergyEfficiency
9
Methodology: benchmark programs
• Used 6 benchmark programs.• Divided into 3 groups depending on characteristics
Benchmark Acronym Source Characteristics
Magnetic Resonance Imaging Q MRQ Parboil Compute-bound
Stream Cluster SCL Rodinia Compute-bound
Hotspot HOT Rodinia Neutral
Sum of Absolute Difference SAD Parboil Neutral
Stencil STN Parboil Memory-bound
Stream Copy SCP CS Virginia Memory-bound
10
Outline
• Motivation• Target platform: SCHP + MW• Workload-aware power allocation
- Characteristics of programs- Evaluation Metrics
• Methodology- Power configuration- Benchmark programs
• Evaluation• Algorithm• Conclusion
11
Evaluation: case study 1 (compute- vs. memory-bound)
19% throughput improvement 32% energy efficiency improvement
• Allocating more power to compute-bound• Optimal points vary depending on metrics.
12
Evaluation: case study 2 (memory- vs. memory-bound)
10% throughput improvement 32% energy efficiency improvement
• Equally allocated power• Again, optimal point depends on
- Evaluation metric- Workload characteristics (compute- or memory-bound)
13
Evaluation: variation of optimal configuration
• Depending on programs’ characteristics and evaluation metrics
P1 P2Metric 1: throughput Metric 2: energy efficiency
P1 (Watt) P2 (Watt) P1 (Watt) P2 (Watt)CPU GPU CPU GPU CPU GPU CPU GPU
MRQ (C) SCL(C) 17.4 31.2 17.4 31.2 17.4 16.8 17.4 16.8SCP (M) STN (M) 17.4 31.2 17.4 31.2 17.4 11.2 17.4 11.2SAD (N) HOT (N) 17.4 31.2 17.4 31.2 17.4 11.2 17.4 16.8MRQ (C) SCP (M) 17.4 41.6 17.4 22.4 17.4 22.4 17.4 16.8SCL (C) SCP (M) 17.4 41.6 17.4 22.4 17.4 11.2 17.4 11.2
HOT (N) MRQ(N) 17.4 31.2 17.4 31.2 17.4 11.2 17.4 22.4MRQ (C) SAD (N) 17.4 31.2 17.4 31.2 17.4 16.8 17.4 22.4SCL (C) SAD (N) 17.4 31.2 17.4 31.2 17.4 16.8 17.4 11.2
HOT (N) STN (M) 17.4 41.6 17.4 22.4 17.4 11.2 17.4 11.2HOT (N) SCP (M) 17.4 41.6 17.4 22.4 17.4 11.2 17.4 11.2SAD (N) SCP (M) 17.4 41.6 17.4 22.4 17.4 11.2 17.4 22.4
14
Evaluation: performance improvement from optimal power allocation
• Achieved significant improvement- 12% for throughput- 18% for energy efficiency
MRQ
vs.
SCL
(CC)
SCP
vs. S
TN (M
M)
SAD
vs.
HO
T (N
N)
MRQ
vs.
SCP
(CM
)
SCL
vs. S
CP (C
M)
HO
T vs
. MRQ
(NC)
MRQ
vs.
SAD
(CN
)
SCL
vs. S
AD (C
N)
HO
T vs
. STN
(NM
)
HO
T vs
. SCP
(NM
)
SAD
vs.
SCP
(NM
)
GEO
MEA
N
0.9
1.1
1.3
Normalized IPS Normalized IPS/W
15
Algorithm for throughput maximization
calculate (slope)
abs(sp1-sp2) < threshold
sp1 > sp2
alloc(p2_more)
alloc(p1_more)
alloc(equally)
wait(regular_time)
YES
YES
NO
NO
Nor
mal
ized
thro
ughp
ut
28.6 34.2 39.8 48.6 59.0 0.8
1.0
1.2
1.4
1.6
1.8
2.0
compute-bound (mri-q)memory-bound (stream-copy)
Power allocation
16
Algorithm for energy efficiency maximization
final = min_power
EE(final) == MAX
EE(final, p1++) > EE(final, p2++)
final = (final, p1++)
final = (final, p2++)
exit
MAX = max( EE(final), EE(final, p1++), EE(final, p2++) )
• Gradient search from the minimum power allocation
17
Conclusion
• We propose a solution for optimal power allocation - Workload-aware power allocation- By using program characteristics and evaluation metrics
• Significant performance improvement achieved- 12% for throughput- 18% for energy efficiency
• Run-time algorithms effectively find (near-)optimal power allocation
18
Backup slides
19
Simulator• Integrated CPU + GPU simulator
- H. Wang, V. Sathish, R. Singh, M. Schulte and N. Kim, "Workload and Power Budget Partitioning for Single-Chip Heterogeneous Processors," in PACT, 2012.
- http://cpu-gpu-sim.ece.wisc.edu/- gem5 + GPGPU-Sim
• Adaptive power allocation for multiprogrammed workload- Per-core V/F domains for CPU- 2 V/F domains for GPU