Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | audra-hutchinson |
View: | 213 times |
Download: | 0 times |
Computer Science
Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs
Min Yeol LimComputer Science Department
Sep. 8, 2006
Computer Science 2
Growing energy demand
• Energy efficiency is a big concern– Increased power density of microprocessors
– Cooling cost for heat dissipation
– Power and performance tradeoff
• Dynamic voltage and frequency scaling (DVFS)– Supported by newer microprocessors
– Cubic drop on power consumption• Power frequency × voltage2
– CPU is the major power consumer : 35~50% of total power
Computer Science 3
Power-performance tradeoff
• Cost vs. Benefit– Power performance
– Increasing execution time vs. decreasing power usage
– CPU scaling is meaningful only if benefit > cost
E = P1 * T1 E = P2 * T2
Time
Pow er
P1
T1
Benefit
T2
P2
Cost
Computer Science 4
Power-performance tradeoff (cont’)
• Cost > Benefit– NPB EP benchmark
• CPU-bound application
• CPU is on critical path
• Benefit > Cost– NPB CG benchmark
• Memory-bound application
• CPU is NOT on critical path
2.01.8
1.6
1.4
1.21.0
0.8Ghz
Computer Science 5
Motivation 1
• Cost/Benefit is code specific– Applications have different code regions
– Most MPI communications are not critical on CPU
• P-state transition in each code region– High voltage and frequency on CPU intensive region
– Low voltage and frequency on MPI communication region
Computer Science 6
Time and energy performance of MPI calls
• MPI_Send
• MPI_Alltoall
Computer Science 7
Motivation 2
• Most MPI calls are too short– Scaling overhead by p-state change per call
– Up to 700 microseconds in p-state transition
• Make regions with adjacent calls– Small interval of inter MPI calls
– P-state transition occurs per region
Call length (ms) MPI calls interval (ms)
Frac
tion
of
call
s
Frac
tion
of
inte
rval
s
Computer Science 8
Reducible regions
time
user
MPIlibrary A B C D E F G H I J
R1 R2 R3
Computer Science 9
• Thresholds in time– close-enough (τ): time distance between adjacent calls
– long-enough (λ): region execution time
Reducible regions (cont’)
time
user
MPIlibrary A B C D E F G H I J
δ < τδ > λ
δ < τ
Computer Science 10
How to learn regions
• Region-finding algorithms– by-call
• Reduce only in MPI code: τ=0, λ=0
• Effective only if single MPI call is long enough
– simple• Adaptive 1-bit prediction by looking up its last behavior
• 2 flags : begin and end
– composite• Save patterns of MPI calls in each region
• Memorize the begin/end MPI calls and # of calls
Computer Science 11
P-state transition errors
• False-positive (FP)– P-state is changed in the region top p-state must be used
– e.g. regions terminated earlier than expected
• False-negative (FN)– Top p-state is used in the reducible region
– e.g. regions in first appearance
Computer Science 12
P-state transition errors (cont’)
users
MPI library
A A A BBB
Program
execution
top p-state
reduced p-state
Optimaltransitio
n
FNtop p-state
reduced p-state
Simple
FNtop p-state
reduced p-state
Composite
Computer Science 13
P-state transition errors (cont’)
users
MPI library
A A A AAA
Program
execution
top p-state
reduced p-state
Optimaltransitio
n
FNtop p-state
reduced p-state
Composite
FN
FP FP FPtop p-state
reduced p-state
Simple
Computer Science 14
Selecting proper p-state
• automatic algorithm– Use composite algorithm to find regions
– Use hardware performance counters • Evaluation of CPU dependency in reducible regions
• A metric of CPU load: micro-operations/microsecond (OPS)
– Specify p-state mapping tableOPS Frequency
> 2000 2000 Mhz
1000 ~ 2000 1800 Mhz
400 ~ 1000 1600 Mhz
200 ~ 400 1400 Mhz
100 ~ 200 1200 Mhz
< 100 800 Mhz
Computer Science 15
Implementation
• Use PMPI– MPI profiling interface
– Intercept pre and post hooks of any MPI call transparently
• MPI call unique identifier– Use the hash value of all program counters in call history
– Insert assembly code in C
Computer Science 16
Results
• System environment– 8 or 9 nodes with AMD Athlon-64 system
– 7 p-states are supported: 2000~800Mhz
• Benchmarks– NPB MPI benchmark suite
• C class
• 8 applications
– ASCI Purple benchmark suite• Aztec
• 10 ms in thresholds (τ, λ)
Computer Science 17
Benchmark analysis
– Used composite for region information
per MPI per regionEP 5 1 4.0 68.7 337.0FT 46 45 1.0 18400.0 18810.0IS 37 14 2.5 3100.0 8200.0
Aztec 20,767 301 68.9 2.0 143.0CG 41,953 1977 21.2 6.9 149.0MG 10,002 158 63.3 3.8 272.0SP 19,671 8424 3.2 20.6 49.4BT 108,706 797 136.7 8.4 1145.0LU 81,874 766 107.2 1.1 356.0
MPI callsReducible
regionsCalls per
regionAverage time (ms)
Computer Science 18
Taxonomy
– Profile does not have FN or FP
Reduced p-state
Single Multiple
Region findings
Naive By-call
Adaptive Simple
Adaptive Composite Automatic
Static Profile
Computer Science 19
Overall Energy Delay Product (EDP)
Computer Science 20
Comparison of p-state transition errors
• Breakdown of execution time
Simple Composite
Computer Science 21
τ evaluation
• SP benchmark
Computer Science 22
τ evaluation (cont’)
MG CG
BT LU
Computer Science 23
Conclusion
• Contributions– Design and implement an adaptive p-state transition system
in MPI communication phases• Identify reducible regions on the fly
• Determine proper p-state dynamically
– Provide transparency to users
• Future work– Evaluate the performance with other applications
– Experiments on the OPT cluster
Computer Science 24
Computer Science 25
State transition diagram
• Simple
OUT IN
not “close enough”
else
“close enough”
begin == 1else
end == 1
Computer Science 26
State transition diagram (cont’)
• Composite
OUT
IN REC
else
else “close enough”pattern mismatch
“close enough”
not “close enough”
not “close enough”
end of region
operation beginsreducible region
Computer Science 27
Performance
BaseTime (s) 984.85 987.24 1.002 985.80 1.001 985.36 1.001
Energy (KJ) 606.45 479.24 0.790 486.21 0.802 487.41 0.804Time (s) 910.85 944.86 1.037 938.16 1.030 938.38 1.030
Energy (KJ) 758.56 715.92 0.944 672.85 0.887 672.01 0.886Time (s) 1027.2 1295.50 1.261 1061.30 1.033 1057.10 1.029
Energy (KJ) 646.01 774.96 1.200 590.38 0.914 592.93 0.918Time (s) 378.5 414.87 1.096 402.81 1.064 396.74 1.048
Energy (KJ) 238.58 245.85 1.031 210.52 0.882 209.63 0.879Time (s) 76.84 90.35 1.176 81.90 1.066 78.87 1.027
Energy (KJ) 55.173 58.63 1.063 49.98 0.906 49.71 0.901Time (s) 628.74 841.69 1.339 662.94 1.054 654.67 1.041
Energy (KJ) 489.08 510.24 1.043 441.24 0.902 438.14 0.896
CG
FT
LU
MG
SP
BT
By-call Simple Composite
Computer Science 28
Benchmark analysis
– Region information from composite with τ = 10 ms
per MPI per region MPI regionEP 5 1 4.0 68.7 337.0 0.005 0.005FT 46 45 1.0 18400.0 18810.0 0.849 0.860IS 37 14 2.5 3100.0 8200.0 0.871 0.871
Aztec 20,767 301 68.9 2.0 143.0 0.806 0.812CG 41,953 1977 21.2 6.9 149.0 0.753 0.768MG 10,002 158 63.3 3.8 272.0 0.500 0.574SP 19,671 8424 3.2 20.6 49.4 0.441 0.453BT 108,706 797 136.7 8.4 1145.0 0.865 0.891LU 81,874 766 107.2 1.1 356.0 0.149 0.446
Time fractionMPI calls
Reducibleregions
Calls perregion
Average time (ms)