Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI...

Computer Science

Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs

Min Yeol LimComputer Science Department

Sep. 8, 2006

Computer Science 2

Growing energy demand

• Energy efficiency is a big concern– Increased power density of microprocessors

– Cooling cost for heat dissipation

– Power and performance tradeoff

• Dynamic voltage and frequency scaling (DVFS)– Supported by newer microprocessors

– Cubic drop on power consumption• Power frequency × voltage2

– CPU is the major power consumer : 35~50% of total power

Computer Science 3

Power-performance tradeoff

• Cost vs. Benefit– Power performance

– Increasing execution time vs. decreasing power usage

– CPU scaling is meaningful only if benefit > cost

E = P1 * T1 E = P2 * T2

Time

Pow er

P1

T1

Benefit

T2

P2

Cost

Computer Science 4

Power-performance tradeoff (cont’)

• Cost > Benefit– NPB EP benchmark

• CPU-bound application

• CPU is on critical path

• Benefit > Cost– NPB CG benchmark

• Memory-bound application

• CPU is NOT on critical path

2.01.8

1.6

1.4

1.21.0

0.8Ghz

Computer Science 5

Motivation 1

• Cost/Benefit is code specific– Applications have different code regions

– Most MPI communications are not critical on CPU

• P-state transition in each code region– High voltage and frequency on CPU intensive region

– Low voltage and frequency on MPI communication region

Computer Science 6

Time and energy performance of MPI calls

• MPI_Send

• MPI_Alltoall

Computer Science 7

Motivation 2

• Most MPI calls are too short– Scaling overhead by p-state change per call

– Up to 700 microseconds in p-state transition

• Make regions with adjacent calls– Small interval of inter MPI calls

– P-state transition occurs per region

Call length (ms) MPI calls interval (ms)

Frac

tion

of

call

s

Frac

tion

of

inte

rval

s

Computer Science 8

Reducible regions

time

user

MPIlibrary A B C D E F G H I J

R1 R2 R3

Computer Science 9

• Thresholds in time– close-enough (τ): time distance between adjacent calls

– long-enough (λ): region execution time

Reducible regions (cont’)

time

user

MPIlibrary A B C D E F G H I J

δ < τδ > λ

δ < τ

Computer Science 10

How to learn regions

• Region-finding algorithms– by-call

• Reduce only in MPI code: τ=0, λ=0

• Effective only if single MPI call is long enough

– simple• Adaptive 1-bit prediction by looking up its last behavior

• 2 flags : begin and end

– composite• Save patterns of MPI calls in each region

• Memorize the begin/end MPI calls and # of calls

Computer Science 11

P-state transition errors

• False-positive (FP)– P-state is changed in the region top p-state must be used

– e.g. regions terminated earlier than expected

• False-negative (FN)– Top p-state is used in the reducible region

– e.g. regions in first appearance

Computer Science 12

P-state transition errors (cont’)

users

MPI library

A A A BBB

Program

execution

top p-state

reduced p-state

Optimaltransitio

n

FNtop p-state

reduced p-state

Simple

FNtop p-state

reduced p-state

Composite

Computer Science 13

P-state transition errors (cont’)

users

MPI library

A A A AAA

Program

execution

top p-state

reduced p-state

Optimaltransitio

n

FNtop p-state

reduced p-state

Composite

FN

FP FP FPtop p-state

reduced p-state

Simple

Computer Science 14

Selecting proper p-state

• automatic algorithm– Use composite algorithm to find regions

– Use hardware performance counters • Evaluation of CPU dependency in reducible regions

• A metric of CPU load: micro-operations/microsecond (OPS)

– Specify p-state mapping tableOPS Frequency

> 2000 2000 Mhz

1000 ~ 2000 1800 Mhz

400 ~ 1000 1600 Mhz

200 ~ 400 1400 Mhz

100 ~ 200 1200 Mhz

< 100 800 Mhz

Computer Science 15

Implementation

• Use PMPI– MPI profiling interface

– Intercept pre and post hooks of any MPI call transparently

• MPI call unique identifier– Use the hash value of all program counters in call history

– Insert assembly code in C

Computer Science 16

Results

• System environment– 8 or 9 nodes with AMD Athlon-64 system

– 7 p-states are supported: 2000~800Mhz

• Benchmarks– NPB MPI benchmark suite

• C class

• 8 applications

– ASCI Purple benchmark suite• Aztec

• 10 ms in thresholds (τ, λ)

Computer Science 17

Benchmark analysis

– Used composite for region information

per MPI per regionEP 5 1 4.0 68.7 337.0FT 46 45 1.0 18400.0 18810.0IS 37 14 2.5 3100.0 8200.0

Aztec 20,767 301 68.9 2.0 143.0CG 41,953 1977 21.2 6.9 149.0MG 10,002 158 63.3 3.8 272.0SP 19,671 8424 3.2 20.6 49.4BT 108,706 797 136.7 8.4 1145.0LU 81,874 766 107.2 1.1 356.0

MPI callsReducible

regionsCalls per

regionAverage time (ms)

Computer Science 18

Taxonomy

– Profile does not have FN or FP

Reduced p-state

Single Multiple

Region findings

Naive By-call

Adaptive Simple

Adaptive Composite Automatic

Static Profile

Computer Science 19

Overall Energy Delay Product (EDP)

Computer Science 20

Comparison of p-state transition errors

• Breakdown of execution time

Simple Composite

Computer Science 21

τ evaluation

• SP benchmark

Computer Science 22

τ evaluation (cont’)

MG CG

BT LU

Computer Science 23

Conclusion

• Contributions– Design and implement an adaptive p-state transition system

in MPI communication phases• Identify reducible regions on the fly

• Determine proper p-state dynamically

– Provide transparency to users

• Future work– Evaluate the performance with other applications

– Experiments on the OPT cluster

Computer Science 24

Computer Science 25

State transition diagram

• Simple

OUT IN

not “close enough”

else

“close enough”

begin == 1else

end == 1

Computer Science 26

State transition diagram (cont’)

• Composite

OUT

IN REC

else

else “close enough”pattern mismatch

“close enough”



end of region

operation beginsreducible region

Computer Science 27

Performance

BaseTime (s) 984.85 987.24 1.002 985.80 1.001 985.36 1.001

Energy (KJ) 606.45 479.24 0.790 486.21 0.802 487.41 0.804Time (s) 910.85 944.86 1.037 938.16 1.030 938.38 1.030

Energy (KJ) 758.56 715.92 0.944 672.85 0.887 672.01 0.886Time (s) 1027.2 1295.50 1.261 1061.30 1.033 1057.10 1.029

Energy (KJ) 646.01 774.96 1.200 590.38 0.914 592.93 0.918Time (s) 378.5 414.87 1.096 402.81 1.064 396.74 1.048

Energy (KJ) 238.58 245.85 1.031 210.52 0.882 209.63 0.879Time (s) 76.84 90.35 1.176 81.90 1.066 78.87 1.027

Energy (KJ) 55.173 58.63 1.063 49.98 0.906 49.71 0.901Time (s) 628.74 841.69 1.339 662.94 1.054 654.67 1.041

Energy (KJ) 489.08 510.24 1.043 441.24 0.902 438.14 0.896

CG

FT

LU

MG

SP

BT

By-call Simple Composite

Computer Science 28

Benchmark analysis

– Region information from composite with τ = 10 ms

per MPI per region MPI regionEP 5 1 4.0 68.7 337.0 0.005 0.005FT 46 45 1.0 18400.0 18810.0 0.849 0.860IS 37 14 2.5 3100.0 8200.0 0.871 0.871

Aztec 20,767 301 68.9 2.0 143.0 0.806 0.812CG 41,953 1977 21.2 6.9 149.0 0.753 0.768MG 10,002 158 63.3 3.8 272.0 0.500 0.574SP 19,671 8424 3.2 20.6 49.4 0.441 0.453BT 108,706 797 136.7 8.4 1145.0 0.865 0.891LU 81,874 766 107.2 1.1 356.0 0.149 0.446

Time fractionMPI calls

Reducibleregions

Calls perregion

Average time (ms)

Date post:	04-Jan-2016
Category:	Documents
Upload:	audra-hutchinson
View:	213 times
Download:	0 times

Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI...

Documents