Dynamic Voltage Frequency Scaling for Multi-tasking Systems
Using Online Learning
Gaurav Dhiman Tajana Simunic Rosing
Department of Computer Science and EngineeringUniversity of California, San Diego
ISLPED 2007
Why Dynamic Voltage Frequency Scaling?
Power consumption is a critical issue in system design today Mobile systems face battery life issues High performance systems face heating issues
Dynamic Voltage Frequency Scaling (DVFS): Dynamically scale the supply voltage level of CPU
to provide “just enough” circuit speed to process the workload
An effective system level technique to reduce power consumption
Dynamic Power Management (DPM) is another popular system level technique. However focus of this work is on DVFS
Previous Work
Based on task level knowledge: [Yao95],[Ishihara98],[Quan02]
Based on compiler/app. support: [Azevedo02],[Hsu02],[Chung02]
Based on micro-architecture level support: [Marculescu00],[Weissel02],[Choi04],
[Choi05]
Workload Characterization and Voltage-Frequency Selection
No hard task deadlines in general purpose system.
Goal: Maximize energy savings while minimizing performance delay.
Key idea: CPU-intensive tasks don’t benefit from
scaling Memory intensive tasks energy efficient at
low v-f settings
Workload Characterization and Voltage-Frequency Selection (contd.)
0.8
1.3
1.8
200 300 400 500
Frequency (M Hz)
No
rma
lize
d E
ne
rgy
Co
ns
um
pti
on
burn_loop
m em
com bo
0.8
1.3
1.8
2.3
200 300 400 500
Frequency (MHz)
Pe
rfo
rma
nc
e I
mp
rov
em
en
t
• Three tasks burn_loop (CPU-intensive), mem (memory intensive) and combo (mix) run with static scaling.
• burn_loop energy efficient at all settings
• mem energy efficient at lowest v-f setting
Measure CPU-intensiveness (µ)
CPI StackCPIavg=CPIbase+CPIcache+CPItlb+CPIbranch+CPIstall
Use Performance Monitoring Unit (PMU) of PXA27x to estimate CPI stack components.
µ = CPIbase/CPIavg
High µ indicates high CPU-intensiveness and vice versa
Dynamic Task Characterization
Dynamically estimate µ for every scheduler quantum and feed it to the online learning algorithm.
The algorithm models the CPU-intensiveness of the task and accordingly selects the best suited v-f setting.
Theoretical guarantee on converging to the best v-f setting available.
Online Learning for Horse Racing
Experts
Selects the best performingexpert for investing his money
Expert manages money
for the race
Evaluates performance of allexperts for that race
Online Learning for DVFSDVFS Experts (Working Set)
Selects the best performingexpert
Selected expert applied to CPU for next
scheduler quantum
Evaluates performance of allexperts
…..v-f setting 1
DVFS Controller
CPU
v-f setting 2 v-f setting n
Controller AlgorithmParameters: 1,0
Initial weight vector for experts Nw 1,01
such that 11
1
N
i iw
Do for t = 1,2,3…..1. Calculate µ.2. Update weight vector of task:
wit+1 = wi
t . (1-(1-ß). lit
3. Choose expert with highest probability factor in :
tr
N
iti
t
w
t
1
wr
4. Apply the v-f setting corresponding to the selected expert to the CPU.
5. Reset and restart the PMU
Sched. tick occurs
Evaluation of experts (loss calculation)
0.1 0.3 0.5 0.7 0.9
0 0.2 0.6 0.80.4
Expert1 µmean
µ
Expert3 µmean
Expert4 µmean
Expert5 µmean
Expert2 µmean
1.0
Intuition: Best suited frequency scales linearly with µ.
Map task characteristics to the best suited frequency using µ-mapper. Eg: Expert1-5={100,200,300,400,500}MHz
Evaluate experts against the best suited frequency.
What about Multi-tasking systems?
Possible for task with differing characteristics to execute together.
Weight vector (wt) characterizes an executing task.
Need to personalize this information at task level for accurate characterization.
Solution: store weight vector as a task level structure
Performance bound on Controller
If lti is the loss incurred by expert i for the scheduler quantum t:
= rt.lt
Goal to minimize net loss: LG–mini Li
where, rt.lt and
Net loss bounded by Average net loss per period decreases at the
rate of
N
i
ti
ti lr
1
T
tGL
1
T
t
tii lL
1
NTO ln
TNO /)(ln
Performance of the scheme converges to that of best performing expert with successive sched ticks
Let N: experts in working set, T: total number of sched ticks
Implementation
Testbed Intel PXA27x Development Platform Linux 2.6.9 Implemented as Loadable Kernel Module
DVFS LKMTask Creation
Scheduler Tick
Linux Process Manager
Intel PXA27x
/proc file systemLinux Kernel
User
PMU vf setting
Experiments
Setup 1.25 samples/sec DAQ Energy savings calculated using actual current
measurements Working set: 4 v-f setting experts Workloads:
qsort djpeg blowfish dgzip
Freq (MHz)
Voltage (V)
208 1.2
312 1.3
416 1.4
520 1.5
Results: Single Task Environment
Bench. Low perf delay -------> Higher energy savings
%delay %energy %delay %energy %delay %energy
qsort 6 17 16 32 25 41
djpeg 7 21 15 37 26 45
dgzip 15 30 21 42 27 49
bf 6 11 16 27 25 40
Bench. 208MHz/1.2V
%delay %energy
qsort 56 48
djpeg 34 54
dgzip 33 54
bf 40 51
Result: Frequency of Selection
For qsort
0
10
20
30
40
50
60
70
80
208MHz 312MHz 416MHz 520MHz
Fre
qu
en
cy
of
Se
lec
tio
n
low α
medium α
high α
Higher energy savings
Lower Perf Delay
Results: Multi Task Environment
Bench. Low perf delay -------> Higher energy savings
%delay %energy %delay %energy %delay %energy
qsort+djpeg 6 17 15 33 25 41
djpeg+dgzip 13 24 19 39 27 48
qsort+djpeg 7 20 18 35 26 42
dgzip+bf 13 18 22 32 27 44
Advantages of the scheme Online learning algorithm:
Provides theoretical guarantee on performance converging to that of the best performing expert.
Multi-Tasking systems: Works seamlessly across context switches.
User preference: Adapts energy savings/performance delay tradeoff with changes in user preference.
Overhead
Process Creation: used lat_proc from lmbench. 0% overhead
Context Switch: used lat_ctx from lmbench 3% overhead with 20 processes (max
supported by lat_ctx) [choi05] cause 100% overhead in context
switch times Extremely lightweight
implementation.
Conclusion Designed and implemented a DVFS
technique for general purpose multi-tasking systems.
Based on online learning that provides theoretical guarantee on the convergence of overall performance to that of the best performing expert.
Provides user control over desired energy/performance tradeoff and is extremely lightweight.