Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors...

Optimal Power Allocation for Multiprogrammed Workloads on

Single-chip Heterogeneous Processors

Euijin Kwon1,2 Jae Young Jang2

Jae W. Lee2 Nam Sung Kim2,3

1 2 3

2

Single-chip heterogeneous processors

• Compared to systems based on discrete components- Lower communication overhead- Lower power consumption- Lower cost (less silicon)- Emerging application friendly (sequential + parallel processing)

Sources: AMD, Intel, and Samsung

AMD’s Llano Intel’s Sandy Bridge Samsung’s Exynos

3

Challenges

• SCHP’s performance: limited by power budget- Total chip power budget- CPU/GPU power budget

• Multiprogrammed workload- Workload-aware power allocation- Considering characteristics and metrics

How can optimize overall performance within limited power budget?

4

Outline

• Motivation• Target platform: SCHP + MW• Workload-aware power allocation

- Characteristics of programs- Evaluation Metrics

• Methodology- Power configuration- Benchmark programs

• Evaluation• Algorithm• Conclusion

5

Target platform: SCHP + MW• 4-core CPU + 16-SM GPU• Multiple V/F domains DVFS• 2 programs running• Hardware resources evenly divided

GPU0

GPU0 V/F domain

Memory Controllers

MCs V/F domain

CPUCore0

CPUCore1

CPUCore2

CPUCore3

CPU V/F domain(per-core)

GPU1

GPU1 V/F domain

Multiprogrammed Workload

Program 1

Program 2

6

Workload-aware power allocation• Characteristics of programs

- Non-uniform performance sensitivities • Evaluation metrics

- Throughput vs. Energy efficiency

Nor

mal

ized

thro

ughp

ut

Allocating more power to mri-q

28.6 34.2 39.8 48.6 59.0 0.8

1.0

1.2

1.4

1.6

1.8

2.0

compute-bound (mri-q)memory-bound (stream-copy)

Power allocation (using the same HW)

7

Outline





8

Methodology: shared power budget

• Can change the power budget for 17.4

24.8

34.2

46.4

62.8 11.2

16.8

22.4

31.2

41.6 11.2

16.8

22.4

31.2

41.6

CPU 2 GPU 1 GPU 2

Power Configuration

Output

17.4

24.8

34.2

46.4

62.8

CPU 1

• Total chip power budget = 100 W• CPU power budget = 80 W• GPU power budget = 64 W• Baseline configuration

- Evenly divided (25 W for each CPU/GPU group)

Throughput EnergyEfficiency

9

Methodology: benchmark programs

• Used 6 benchmark programs.• Divided into 3 groups depending on characteristics

Benchmark Acronym Source Characteristics

Magnetic Resonance Imaging Q MRQ Parboil Compute-bound

Stream Cluster SCL Rodinia Compute-bound

Hotspot HOT Rodinia Neutral

Sum of Absolute Difference SAD Parboil Neutral

Stencil STN Parboil Memory-bound

Stream Copy SCP CS Virginia Memory-bound

10

Outline





11

Evaluation: case study 1 (compute- vs. memory-bound)

19% throughput improvement 32% energy efficiency improvement

• Allocating more power to compute-bound• Optimal points vary depending on metrics.

12

Evaluation: case study 2 (memory- vs. memory-bound)

10% throughput improvement 32% energy efficiency improvement

• Equally allocated power• Again, optimal point depends on

- Evaluation metric- Workload characteristics (compute- or memory-bound)

13

Evaluation: variation of optimal configuration

• Depending on programs’ characteristics and evaluation metrics

P1 P2Metric 1: throughput Metric 2: energy efficiency

P1 (Watt) P2 (Watt) P1 (Watt) P2 (Watt)CPU GPU CPU GPU CPU GPU CPU GPU

MRQ (C) SCL(C) 17.4 31.2 17.4 31.2 17.4 16.8 17.4 16.8SCP (M) STN (M) 17.4 31.2 17.4 31.2 17.4 11.2 17.4 11.2SAD (N) HOT (N) 17.4 31.2 17.4 31.2 17.4 11.2 17.4 16.8MRQ (C) SCP (M) 17.4 41.6 17.4 22.4 17.4 22.4 17.4 16.8SCL (C) SCP (M) 17.4 41.6 17.4 22.4 17.4 11.2 17.4 11.2

HOT (N) MRQ(N) 17.4 31.2 17.4 31.2 17.4 11.2 17.4 22.4MRQ (C) SAD (N) 17.4 31.2 17.4 31.2 17.4 16.8 17.4 22.4SCL (C) SAD (N) 17.4 31.2 17.4 31.2 17.4 16.8 17.4 11.2

HOT (N) STN (M) 17.4 41.6 17.4 22.4 17.4 11.2 17.4 11.2HOT (N) SCP (M) 17.4 41.6 17.4 22.4 17.4 11.2 17.4 11.2SAD (N) SCP (M) 17.4 41.6 17.4 22.4 17.4 11.2 17.4 22.4

14

Evaluation: performance improvement from optimal power allocation

• Achieved significant improvement- 12% for throughput- 18% for energy efficiency

MRQ

vs.

SCL

(CC)

SCP

vs. S

TN (M

M)

SAD

vs.

HO

T (N

N)

MRQ

vs.

SCP

(CM

)

SCL

vs. S

CP (C

M)

HO

T vs

. MRQ

(NC)

MRQ

vs.

SAD

(CN

)

SCL

vs. S

AD (C

N)

HO

T vs

. STN

(NM

)

HO

T vs

. SCP

(NM

)

SAD

vs.

SCP

(NM

)

GEO

MEA

N

0.9

1.1

1.3

Normalized IPS Normalized IPS/W

15

Algorithm for throughput maximization

calculate (slope)

abs(sp1-sp2) < threshold

sp1 > sp2

alloc(p2_more)

alloc(p1_more)

alloc(equally)

wait(regular_time)

YES

YES

NO

NO

Nor

mal

ized

thro

ughp

ut

28.6 34.2 39.8 48.6 59.0 0.8

1.0

1.2

1.4

1.6

1.8

2.0

compute-bound (mri-q)memory-bound (stream-copy)

Power allocation

16

Algorithm for energy efficiency maximization

final = min_power

EE(final) == MAX

EE(final, p1++) > EE(final, p2++)

final = (final, p1++)

final = (final, p2++)

exit

MAX = max( EE(final), EE(final, p1++), EE(final, p2++) )

• Gradient search from the minimum power allocation

17

Conclusion

• We propose a solution for optimal power allocation - Workload-aware power allocation- By using program characteristics and evaluation metrics

• Significant performance improvement achieved- 12% for throughput- 18% for energy efficiency

• Run-time algorithms effectively find (near-)optimal power allocation

18

Backup slides

19

Simulator• Integrated CPU + GPU simulator

- H. Wang, V. Sathish, R. Singh, M. Schulte and N. Kim, "Workload and Power Budget Partitioning for Single-Chip Heterogeneous Processors," in PACT, 2012.

- http://cpu-gpu-sim.ece.wisc.edu/- gem5 + GPGPU-Sim

• Adaptive power allocation for multiprogrammed workload- Per-core V/F domains for CPU- 2 V/F domains for GPU

Date post:	05-Jan-2016
Category:	Documents
Upload:	susan-ellis
View:	218 times
Download:	3 times

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors...

Documents