Allocation DAG-based Pro�le-based DVFS-based
Energy- and Thermal-aware
scheduling for datacenters
Georges Da Costa
Ljubljana WG meeting, 8th July, 2016
Action IC0804www.cost804.org
[email protected] 1/34
Allocation DAG-based Pro�le-based DVFS-based
Datacenters: a major ecological impact
Recent datacenter 40000 servers, 500000 services (virtual machines). Google,Facebook > 1million serversPower consumption also is large scale
2000 : 70 TWh2007 : 330 TWh, 2% CO2 worldproduction2011 : 6eme country from a powerconsumption point of view2020 : 1000 TWh
Increasing
2014 : 90% of datacenters ownersplan update before end of 2015
[email protected] 2/34
Allocation DAG-based Pro�le-based DVFS-based
Sustainable datacenters
Multi-layer approachHardware: Change servers and cooling system
If entropy is constant, theoretical consumption is 0
Applications: rewrite application using innovative paradigm∗orimproved libraryMiddleware: manages the datacenters
Middleware: minimum cost, maximum impact
OpenStack: 30% market share in 2014OpenSource solutions: 43% (+72% in 2 years)
∗ Georges et al. Exascale machines require new programming paradigms and runtimes, SFI journal, 2015
[email protected] 3/34
Allocation DAG-based Pro�le-based DVFS-based
Power and Energy are Unique
Temporal efects
Inertia linked to temperatureSwitching on/o� servers
Under- or Over-reservation
Cycles can be relevant
Non-linear e�ects
Electrical power equations
Feedback loops
Cooling systemViolaine et al., Thermal-aware cloud middleware to reduce cooling needs,
WETICE workshop, 2014
[email protected] 4/34
Allocation DAG-based Pro�le-based DVFS-based
Simplest (?) tool: Experiments
Simple experiment: Fast Fourier Transform(NPB)
100 runs using the same hardware (Grid'5000)
Large di�erences
Time: 12s, 7% (Std. Dev. 3.2s)Energy: 9.3kJ, 5.5% (3kJ)
For the same time, 167s, a di�erence of 4kJ
Time 6= Energy
162
164
166
168
170
172
174
164 166 168 170 172 174 176
Tem
ps
(s)
Energie (kJ)
Transformée de fourrier
[email protected] 5/34
Allocation DAG-based Pro�le-based DVFS-based
Simulation
Large number of simulators: SimGrid, DCWorms, CloudSim, ...
Needed speci�cations:
Models of cloud (migration, over-allocation of resources, federation†)DVFSPower consumptionTemperature
An evolving �eld
DVFS and �ne-grained cloud simulation in CloudSimThermal models in DCWorms∗
DVFS and energy in SimGrid
∗ Wojtek et al., Energy and thermal models for simulation of workload and resource management in computing systems, SMPT
journal, 2015. †Thiam et al., Cooperative Scheduling Anti-load balancing Algorithm for Cloud, CCTS workshop, 2013
[email protected] 6/34
Allocation DAG-based Pro�le-based DVFS-based
Exemple: adding DVFS in CloudSim
Originally a Grid Simulator
Great stability over time100% resource usage
DVFS leads to move internally events
Fine grained temporal management (1/10 s)
Tom et al., Energy-aware simulation with DVFS, SMPT journal, 2013
[email protected] 7/34
Allocation DAG-based Pro�le-based DVFS-based
Plan
1 Allocation
2 Using DAG for frequency scaling
3 Pro�le-based hardware recon�guration
4 HPC-aware DVFS
[email protected] 8/34
Allocation DAG-based Pro�le-based DVFS-based
Allocation using Genetic Algorithms
Chromosome = Allocation
First random population
For each iteration:
Mutation and recombinationSort using the �tness functionKeep the best and iterate
Fitness depends on the metric functions
Performance, Energy, Resilience, Dynamism
Tom et al., Quality of Service Modeling for Green
Scheduling in Clouds, SUSCOM journal, 2014
[email protected] 9/34
Allocation DAG-based Pro�le-based DVFS-based
Result for Genetic Algorithm
Each one is better in itsdomain (Energy)
GA_All Good overall
400 services on 110servers, (40s)
Taking a metric intoaccount matters!
[email protected] 10/34
Allocation DAG-based Pro�le-based DVFS-based
Fuzzy Greedy
Advantage of G.A.: Fitness function
Similar method for greedy algorithm:
Set of greedy algorithmsKeep the bestWhat is the best?
Multi-objective : Fuzzy∗
With thermal models of datacenters(D-Matrix)†
Optimal sur E relaché
Famille de Gloutons
Nouvel optimal
∗ Hong Yang et al., Multi-Objective Scheduling for Heterogeneous Server Systems with Machine Placement, CCGRID conference, 2014
† Hong Yang et al.,Energy-e�cient and thermal-aware resource management for heterogeneous datacenters, SUSCOM journal, [email protected] 11/34
Allocation DAG-based Pro�le-based DVFS-based
Plan
1 Allocation
2 Using DAG for frequency scaling
3 Pro�le-based hardware recon�guration
4 HPC-aware DVFS
[email protected] 12/34
Allocation DAG-based Pro�le-based DVFS-based
Using DAG for frequency scaling
Use external contextual informationExample DAG of tasks
[email protected] 13/34
Allocation DAG-based Pro�le-based DVFS-based
Coordination of frequency of servers
[email protected] 14/34
Allocation DAG-based Pro�le-based DVFS-based
Coordination of frequency of servers
Generalization toward thecritical path
[email protected] 14/34
Allocation DAG-based Pro�le-based DVFS-based
Action au niveau du noeud
Next step:
Switching on/o� serversTake into accounttemperature
[email protected] 15/34
Allocation DAG-based Pro�le-based DVFS-based
Plan
1 Allocation
2 Using DAG for frequency scaling
3 Pro�le-based hardware recon�guration
4 HPC-aware DVFS
[email protected] 16/34
Allocation DAG-based Pro�le-based DVFS-based
Pro�le-based hardware recon�guration
Coarse grained reaction at the level of a node
Change processor frequencyChange the hard-drive modeRecon�gure network card
Detection of current phase∗
React in function of current phase
Low impact on the global infrastructure
∗ Landry et al. Application-Agnostic Framework for Improving the Energy E�ciency of Multiple HPC Subsystems, PDP Conference,
2015
[email protected] 17/34
Allocation DAG-based Pro�le-based DVFS-based
Resource consumption for phase detec-tion
0
0.002
0.004
0.006
0.008
0.01
0.012
0 50 100 150 200 250 300 350
Co
un
ters
ac
ce
ss
ra
te
Time (s)
Idle
MG
BT
EP
IS CG
branch missescache references
cache misses
[email protected] 18/34
Allocation DAG-based Pro�le-based DVFS-based
Phase detection
0
2
4
6
8
10
0 50 100 150 200 250 300 350
phas
e id
time (s)
Idle
MG BT
EP IS CG
[email protected] 19/34
Allocation DAG-based Pro�le-based DVFS-based
Decision rules
Phase label Available recon�guration rules
compute-intensive switch o� memory banks; put hard-drive in sleep mode;
processor at maximum frequency;
put network interface cards in sleep mode.
memory-intensive slow down processor frequency; put hard-drive in sleep mode;
or reduce its speed; switch on all memory banks.
mixed switch on all memory banks; increase processor frequency;
put hard-drive in sleep mode;
put network interface cards in sleep mode.
communication switch o� memory banks; slow down processor frequency;
intensive switch on hard-drives.
IO-intensive switch o� memory banks; slow down processor frequency;
put hard-drives in performance mode.
[email protected] 20/34
Allocation DAG-based Pro�le-based DVFS-based
Energy and performance28 servers
-20 %
-15 %
-10 %
-5 %
0 %
5 %
10 %
CG MG POP X1 GeneHunter WRF MDS
Ener
gy c
onsu
mpt
ion
/ ext
ra e
xecu
tion
time
Energy consumption Execution time
Landry et al., Exploiting performance counters to predict and improve energy performance of HPC systems, FGCS journal, [email protected] 21/34
Allocation DAG-based Pro�le-based DVFS-based
External phase detection
Obtaining system values is intrusive
Reducing number of monitored values reducesthe overhead
Monitoring external values (power, network)
Use statistical tools
Evaluate the behavior over time
Georges et al., Characterizing applications from power consumption : A case
study for HPC benchmarks, ICT-GLOW Symposium, 2011
[email protected] 22/34
Allocation DAG-based Pro�le-based DVFS-based
External phase detection
Obtaining system values is intrusive
Reducing number of monitored values reducesthe overhead
Monitoring external values (power, network)
Use statistical tools
Evaluate the behavior over time
Georges et al., Characterizing applications from power consumption : A case
study for HPC benchmarks, ICT-GLOW Symposium, 2011
0
2e+07
4e+07
6e+07
8e+07
1e+08
1.2e+08
0 10 20 30 40 50 60 70 80
Nu
mb
er
of
byte
s s
en
t p
er
se
co
nd
time (s)
benchmark CG (NPB)
[email protected] 22/34
Allocation DAG-based Pro�le-based DVFS-based
External phase detection
Obtaining system values is intrusive
Reducing number of monitored values reducesthe overhead
Monitoring external values (power, network)
Use statistical tools
Evaluate the behavior over time
Georges et al., Characterizing applications from power consumption : A case
study for HPC benchmarks, ICT-GLOW Symposium, 2011
0
2e+07
4e+07
6e+07
8e+07
1e+08
1.2e+08
0 50 100 150 200 250 300 350
Nu
mb
er
of
byte
s s
en
t p
er
se
co
nd
time (s)
benchmark SP (NPB)
[email protected] 22/34
Allocation DAG-based Pro�le-based DVFS-based
External phase detection
Obtaining system values is intrusive
Reducing number of monitored values reducesthe overhead
Monitoring external values (power, network)
Use statistical tools
Evaluate the behavior over time
Georges et al., Characterizing applications from power consumption : A case
study for HPC benchmarks, ICT-GLOW Symposium, 2011
[email protected] 22/34
Allocation DAG-based Pro�le-based DVFS-based
Plan
1 Allocation
2 Using DAG for frequency scaling
3 Pro�le-based hardware recon�guration
4 HPC-aware DVFS
[email protected] 23/34
Allocation DAG-based Pro�le-based DVFS-based
HPC-aware DVFS
Relative values between performance and ondemand DVFS
Benchmark FT SP BT EP LU IS CG
Time increase (%) 0 -3 -1 1 -2 2 0Energy increase (%) 0 -3 -1 -1 -2 -1 -1
HPC applications are never in Idle mode... Surprise !
MPI libraries are doing some pooling
Classical HPC benchmarks from NPB (Nas Parallel Benchmark)
[email protected] 24/34
Allocation DAG-based Pro�le-based DVFS-based
DVFS using only processor load
80
90
100
110
120
130
140
150
FT SP BT EP LU IS CG
meta_sched2_0.05smart3
meta_schedondemand
meta_sched2_0.01meta_sched2_1
smart2_0.5conservative
smart2_0.2smart2_0.01
smart2_0.1meta_sched3smart2_0.05performance
meta_sched2_0.2meta_sched2_0.1
smart2_1powersave
meta_sched2_0.5smart
[email protected] 25/34
Allocation DAG-based Pro�le-based DVFS-based
Yet DVFS has potential
Relative values between performance and powersave
Benchmark FT SP BT EP LU IS CG
Time increase (%) 36 69 110 159 96 35 83Energy increase (%) -18 2 21 50 16 -19 7
Time increases but up to 19% of reduction of energy consumption!
[email protected] 26/34
Allocation DAG-based Pro�le-based DVFS-based
HPC hypotheses
State of applications at any timeComputingCommunicationsDisk I/OIdle
[email protected] 27/34
Allocation DAG-based Pro�le-based DVFS-based
HPC hypotheses
State of applications at any timeComputingCommunications
Disk I/OIdle
[email protected] 27/34
Allocation DAG-based Pro�le-based DVFS-based
Decision
Energy for max frequency(α+ β)P1
Energy for min frequency(λα+ β)P2
It is interesting to stay at max frequency if we consume less energy:
(α+ β)P1 < (λα+ β)P2
[email protected] 28/34
Allocation DAG-based Pro�le-based DVFS-based
Obtaining α and β
Di�cult to measure them directly
We aim at runtime, not code instrumentation
Easy to measure bandwidth (where Bm is the maximum bandwidth)
Bw = Bmβ
α+ β
Actually α and β are not importantαβ is, i.e. ratio between time to compute and time to communicate
[email protected] 29/34
Allocation DAG-based Pro�le-based DVFS-based
The great mix
Mix and serve
Bw <Bm
λ− 1(λ− P1
P2) = B1
B1 : Bandwidth threshold at max frequency to change frequency
The other way around
B2 =Bm
λ− 1(λ
P2
P1− 1)
B2 : Bandwidth threshold at min frequency to change frequency
[email protected] 30/34
Allocation DAG-based Pro�le-based DVFS-based
With an hysteresis for inertia
Algorithm NetSched
Each .1 second, do:If Current_Frequency = Slowest frequency and IBR ≤ .9B1
Change frequency to Fastest
If Current_Frequency = Fastest frequency and IBR ≥ 1.1B2
Change frequency to Slowest
IBR : Incoming Byte Rate
[email protected] 31/34
Allocation DAG-based Pro�le-based DVFS-based
Experimental environment
Servers (thanks Grid5000)
Processors : bi Dual-Core AMD Opteron (2218)Memory : 8GBNic : Gigabyte EthernetFrequency : 2.6GHz and 1GHzElectrical power: P1 = 280W et P2 = 152W
Benchmark
7 Nas Parallel Benchmark (NPB)
Governors
Performance/Powersave/OndemandNetSched
1.1B1 ' 7.107 and 0.9B2 ' 3.107
[email protected] 32/34
Allocation DAG-based Pro�le-based DVFS-based
Makespan and Energy-to-solution
80
100
120
140
160
180
200
220
240
260
IS FT SP CG LU BT EP
Mak
espa
n (in
% o
f per
form
ance
)
performancepowersavenet_schedondemand
70
80
90
100
110
120
130
140
150
160
IS FT SP CG LU BT EP
Ener
gy (i
n %
of p
erfo
rman
ce)
performancepowersavenet_schedondemand
∗ Georges Da Costa et al., DVFS governor for HPC: Higher, Faster, Greener, Euromicro PDP conference, 2015
[email protected] 33/34
Allocation DAG-based Pro�le-based DVFS-based
Conclusion
Allocation : Genetic Algorithm∗, Vector packing or Fuzzy
Up to 30% power consumption reduction
Using DAG for frequency scaling�
Up to 13% power consumption reduction
Pro�le-based hardware recon�guration†
Up to 13% power consumption reduction, 3% of makespan increase
HPC-Aware DVFS‡
Up to 25% power consumption reduction, 1% of makespan decrease!
∗ Tom et al., Quality of Service Modeling for Green Scheduling in Clouds, SUSCOM journal, 2014 �Tom et al., Energy-aware
simulation with DVFS, SMPT journal, 2013 †Landry et al., Exploiting performance counters to predict and improve energy performance
of HPC systems, SUSCOM journal, 2014 ‡Georges et al., DVFS governor for HPC: Higher, Faster, Greener, PDP conference, 2015
[email protected] 34/34