Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning

Dynamic Voltage Frequency Scaling for Multi-tasking Systems

Using Online Learning

Gaurav Dhiman Tajana Simunic Rosing

Department of Computer Science and EngineeringUniversity of California, San Diego

ISLPED 2007

Why Dynamic Voltage Frequency Scaling?

Power consumption is a critical issue in system design today Mobile systems face battery life issues High performance systems face heating issues

Dynamic Voltage Frequency Scaling (DVFS): Dynamically scale the supply voltage level of CPU

to provide “just enough” circuit speed to process the workload

An effective system level technique to reduce power consumption

Dynamic Power Management (DPM) is another popular system level technique. However focus of this work is on DVFS

Previous Work

Based on task level knowledge: [Yao95],[Ishihara98],[Quan02]

Based on compiler/app. support: [Azevedo02],[Hsu02],[Chung02]

Based on micro-architecture level support: [Marculescu00],[Weissel02],[Choi04],

[Choi05]

Workload Characterization and Voltage-Frequency Selection

No hard task deadlines in general purpose system.

Goal: Maximize energy savings while minimizing performance delay.

Key idea: CPU-intensive tasks don’t benefit from

scaling Memory intensive tasks energy efficient at

low v-f settings

Workload Characterization and Voltage-Frequency Selection (contd.)

0.8

1.3

1.8

200 300 400 500

Frequency (M Hz)

No

rma

lize

d E

ne

rgy

Co

ns

um

pti

on

burn_loop

m em

com bo

0.8

1.3

1.8

2.3

200 300 400 500

Frequency (MHz)

Pe

rfo

rma

nc

e I

mp

rov

em

en

t

• Three tasks burn_loop (CPU-intensive), mem (memory intensive) and combo (mix) run with static scaling.

• burn_loop energy efficient at all settings

• mem energy efficient at lowest v-f setting

Measure CPU-intensiveness (µ)

CPI StackCPIavg=CPIbase+CPIcache+CPItlb+CPIbranch+CPIstall

Use Performance Monitoring Unit (PMU) of PXA27x to estimate CPI stack components.

µ = CPIbase/CPIavg

High µ indicates high CPU-intensiveness and vice versa

Dynamic Task Characterization

Dynamically estimate µ for every scheduler quantum and feed it to the online learning algorithm.

The algorithm models the CPU-intensiveness of the task and accordingly selects the best suited v-f setting.

Theoretical guarantee on converging to the best v-f setting available.

Online Learning for Horse Racing

Experts

Selects the best performingexpert for investing his money

Expert manages money

for the race

Evaluates performance of allexperts for that race

Online Learning for DVFSDVFS Experts (Working Set)

Selects the best performingexpert

Selected expert applied to CPU for next

scheduler quantum

Evaluates performance of allexperts

…..v-f setting 1

DVFS Controller

CPU

v-f setting 2 v-f setting n

Controller AlgorithmParameters: 1,0

Initial weight vector for experts Nw 1,01

such that 11

1

N

i iw

Do for t = 1,2,3…..1. Calculate µ.2. Update weight vector of task:

wit+1 = wi

t . (1-(1-ß). lit

3. Choose expert with highest probability factor in :

tr

N

iti

t

w

t

1

wr

4. Apply the v-f setting corresponding to the selected expert to the CPU.

5. Reset and restart the PMU

Sched. tick occurs

Evaluation of experts (loss calculation)

0.1 0.3 0.5 0.7 0.9

0 0.2 0.6 0.80.4

Expert1 µmean

µ

Expert3 µmean

Expert4 µmean

Expert5 µmean

Expert2 µmean

1.0

Intuition: Best suited frequency scales linearly with µ.

Map task characteristics to the best suited frequency using µ-mapper. Eg: Expert1-5={100,200,300,400,500}MHz

Evaluate experts against the best suited frequency.

What about Multi-tasking systems?

Possible for task with differing characteristics to execute together.

Weight vector (wt) characterizes an executing task.

Need to personalize this information at task level for accurate characterization.

Solution: store weight vector as a task level structure

Performance bound on Controller

If lti is the loss incurred by expert i for the scheduler quantum t:

= rt.lt

Goal to minimize net loss: LG–mini Li

where, rt.lt and

Net loss bounded by Average net loss per period decreases at the

rate of

N

i

ti

ti lr

1

T

tGL

1

T

t

tii lL

1

NTO ln

TNO /)(ln

Performance of the scheme converges to that of best performing expert with successive sched ticks

Let N: experts in working set, T: total number of sched ticks

Implementation

Testbed Intel PXA27x Development Platform Linux 2.6.9 Implemented as Loadable Kernel Module

DVFS LKMTask Creation

Scheduler Tick

Linux Process Manager

Intel PXA27x

/proc file systemLinux Kernel

User

PMU vf setting

Experiments

Setup 1.25 samples/sec DAQ Energy savings calculated using actual current

measurements Working set: 4 v-f setting experts Workloads:

qsort djpeg blowfish dgzip

Freq (MHz)

Voltage (V)

208 1.2

312 1.3

416 1.4

520 1.5

Results: Single Task Environment

Bench. Low perf delay -------> Higher energy savings

%delay %energy %delay %energy %delay %energy

qsort 6 17 16 32 25 41

djpeg 7 21 15 37 26 45

dgzip 15 30 21 42 27 49

bf 6 11 16 27 25 40

Bench. 208MHz/1.2V

%delay %energy

qsort 56 48

djpeg 34 54

dgzip 33 54

bf 40 51

Result: Frequency of Selection

For qsort

0

10

20

30

40

50

60

70

80

208MHz 312MHz 416MHz 520MHz

Fre

qu

en

cy

of

Se

lec

tio

n

low α

medium α

high α

Higher energy savings

Lower Perf Delay

Results: Multi Task Environment

Bench. Low perf delay -------> Higher energy savings

%delay %energy %delay %energy %delay %energy

qsort+djpeg 6 17 15 33 25 41

djpeg+dgzip 13 24 19 39 27 48

qsort+djpeg 7 20 18 35 26 42

dgzip+bf 13 18 22 32 27 44

Advantages of the scheme Online learning algorithm:

Provides theoretical guarantee on performance converging to that of the best performing expert.

Multi-Tasking systems: Works seamlessly across context switches.

User preference: Adapts energy savings/performance delay tradeoff with changes in user preference.

Overhead

Process Creation: used lat_proc from lmbench. 0% overhead

Context Switch: used lat_ctx from lmbench 3% overhead with 20 processes (max

supported by lat_ctx) [choi05] cause 100% overhead in context

switch times Extremely lightweight

implementation.

Conclusion Designed and implemented a DVFS

technique for general purpose multi-tasking systems.

Based on online learning that provides theoretical guarantee on the convergence of overall performance to that of the best performing expert.

Provides user control over desired energy/performance tradeoff and is extremely lightweight.

Date post:	06-Jan-2016
Category:	Documents
Upload:	neil
View:	26 times
Download:	1 times

Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning

Documents