Tuan V. Dinh, Lachlan Andrew and Yoni Nazarathy
Modelling a supercomputer with the model
Australia and New Zealand Applied Probability Workshop
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 2 Australia and New Zealand Applied Probability Workshop
Supercomputer clusters
large scale simulation: climate, genome, astronomy, etc.
foundation of cloud computing
BIG DATAEXASCALE
COMPUTINGMORE COMPUTING POWER DESIRED
Electricity bills
Heat – thermal management
Investment – cooling systems,
hardware, etc.
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 3 Australia and New Zealand Applied Probability Workshop
Power proportionality
Load
Pow
er
ideal
reality
60% peak
single server(1)
(1) Bassoro, “The case for energy proportional”, 2007.
idle server ~ 60% peak
power
turn off idle servers
challenges: switching cost
(setup, wear-and-tear), performance impacts ?
Swinburne Supercomputer
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 4 Australia and New Zealand Applied Probability Workshop
An energy saving framework
CONTROL FRAMEWORK
system congestion
model
number of active servers needed ?
historical implications
?ongoingsystem states ?
arrival characteristics ?
job elapsed times ?
min ( )energyperformance
penaltyswitching+ +
Objective:
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 5 Australia and New Zealand Applied Probability Workshop
Congestion model
CONTROL FRAMEWORK
number of active servers needed ?
historical implications
?ongoingsystem states ?
arrival characteristics ?
job elapsed times ?
min ( )energyperformance
penaltyswitching+ +
Objective:
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 6 Australia and New Zealand Applied Probability Workshop
Congestion model -
1
2
3
…
batch Poisson, rate function
batch size distributionwith c.d.f
i.i.d service time
WHY ?
jobs arrive in “batch” manner,
i.e within seconds, from same user
system mostly under-utilized,
using infinite server approximation
substantial daily variations
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 7 Australia and New Zealand Applied Probability Workshop
Discrete-time cost
timeT+tt
: current running jobs
t +k
{jobs arriving in (t,t+k],
still around at t+k} {jobs arriving before t, still around at t+k}
C(k) = n(k) + |n(k) – n(k-1)| +
C1(k):energy C3(k):performance penaltyC2(k):switching
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 8 Australia and New Zealand Applied Probability Workshop
Optimization formulationC(k) = n(k) + |n(k) – n(k-1)|+
C1(k):energy C3(k):performance penaltyC2(k):switching
(*)
solving (*): load estimation in far future.
the system can feedback the ACTUAL load U(s) for s
< k
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 9 Australia and New Zealand Applied Probability Workshop
A Model Predictive Control framework
CONTROL FRAMEWORK
number of active servers needed ?
historical implications
?
ongoingsystem states ?
arrival characteristics ?
job elapsed times ?
min ( )energyperformance
penaltyswitching+ +
Objective:
MPC
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 10 Australia and New Zealand Applied Probability Workshop
Model Predictive Control execution
timeT+tt
T
Solve (**), obtain {n*(0), n*(1),…}.ONLY “execute” n*(0).
t +1
T
T+t+1
Solve (**), obtain {n*(0), n*(1),…}.ONLY “execute” n*(0).
(**)
Limited look-ahead
1. less sensitive to load estimation accuracy2. Use “on-going” information
know how many jobs actually arrived in (t,t+1]
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 11 Australia and New Zealand Applied Probability Workshop
Solving the optimization problem
{ n(k) + |u(k)| } (***)
s.t: ,
k =0,1…,K-1
Normal approximation
C(k) = n(k) + |n(k) – n(k-1)|+
C1(k):energy C3(k):performance penaltyC2(k):switching
k =0,1…,K-1
solved numerically using LP
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 12 Australia and New Zealand Applied Probability Workshop
X(k): new arrivals
[Carrillo,89]: is a compound Poisson RV, with batch rate:
, where s = (k+1/2)Δ; Δ: slot-time.
even if the arrival process is NOT Poisson, [Whitt,99].
{jobs arriving in (t,t+k],
still around at t+k}
N ~ Poisson( )
bi: i.i.d batch size, mean and variance
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 13 Australia and New Zealand Applied Probability Workshop
U(k): existing jobs
[Carrillo,91]: is a binomial RV, with parameters:
and , where s = (k+1/2)Δ; Δ: slot-time.
Hence:
{jobs arriving before t, still around at t+k}
one can use job elapsed runtimes to calculate
[Whitt,99]
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 14 Australia and New Zealand Applied Probability Workshop
Summary of analytical framework
CONTROL FRAMEWORK
number of active servers needed ?
historical implications
?
ongoingsystem states ?
arrival characteristics ?
job elapsed times ?
Objective:
MPC
LP optimization
Normal approximation
min ( )energyperformance
penaltyswitching+ +
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 15 Australia and New Zealand Applied Probability Workshop
Numerical evaluation
supercomputer simulator CONTROLLER
system states
control decision
Swinburne supercompute
rlogs
cost performance
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 16 Australia and New Zealand Applied Probability Workshop
Scheme 1: All up (no turn off)
supercomputer simulator
system states
control decision
cost performanceNO CONTROL
Swinburne supercompute
rlogs
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 17 Australia and New Zealand Applied Probability Workshop
Scheme 2: twait heuristic
supercomputer simulator
system states
control decision
cost performancetwait heuristic
Server idle for twait
=> turn OFF
Swinburne supercompute
rlogs
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 18 Australia and New Zealand Applied Probability Workshop
Scheme 3: predictive control
supercomputer simulator
system states
control decision
cost performanceMPC
estimated from historical
data
Swinburne supercompute
rlogs
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 19 Australia and New Zealand Applied Probability Workshop
S.3: rate function
time of day
rate
arr
ivals
2010 2011
use daily periodic rates
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 20 Australia and New Zealand Applied Probability Workshop
S.3: service time & batch size
[Lublin et al.,2003]: Hyper-Gamma, Log-uniform
[Li et al.,2005]: Log Normal, Weibull
Empirical (2010)
Gamma
time(sec)
c.d
.f
size(CPU)
c.d
.fOur approximations only concern MEAN and VARIANCE of X
X: batch size
G: service time
(2010)
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 21 Australia and New Zealand Applied Probability Workshop
S.3: cost performance
ε ~ service availability
norm
alis
ed c
ost
Cost 1 = total cost when there is NO CONTROL (energy only)
Simulation period: 1 year
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 22 Australia and New Zealand Applied Probability Workshop
Cost performance: all schemes
“offline”
optimal cost[Lu et al., 12].
No perf. penalty
S.1 S.2 S.3, ε = 0.58
consider predictive settings (S.3) whose demand penalty cost is the same as twait
heuristic (S.2)
after all, model is to estimate
θ(k)s.
still > 20% to gain
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 23 Australia and New Zealand Applied Probability Workshop
Remarks and considerations
1. Room for improvement: ~20% to gain!
2.Examining our estimations ?
rate function not accurate
Use job elapsed times
Normal approximation
?
3. Fundamental bound on what to achieve given uncertainty ?
[Dinh,Andrew and Branch,CCgrid13]
http://caia.swin.edu.au/cv/tdinh 10 July 2013 Slide 24 Australia and New Zealand Applied Probability Workshop
Thank you
CONTROL FRAMEWORK
number of active servers needed ?
historical implications
?
ongoingsystem states ?
arrival characteristics ?
job elapsed times ?
Objective:
MPC
LP optimization
Normal approximation
min ( )energyperformance
penaltyswitching+ +