LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

A QuIC Take onEnergy-Aware Scheduling

Steve Muckle

Staff Engineer, Qualcomm Innovation Center, Inc. (QuIC)

Thursday, September 18th 2014

2

QuIC has developed an energy-aware scheduler

Can our work be used?

Describe

− our design

− problems we faced

− next steps

Caveats

− none of this code is upstreamable as-is

− design and implementation are a work in progress

What’s the point of this presentation?

3

intro to Qualcomm Technologies big.Little platforms, scene setting

load tracking

power model

hmp scheduling

scheduler-guided frequency

next steps

overview

4

2 internally synchronous CPU clusters

independent CPU, cluster power gating

MSM8939 is a product of Qualcomm Technologies, Inc.

Qualcomm Technologies big.Little Platforms: MSM8939

CPU0: A53

200mhz -

1536mhz

CPU1: A53

200mhz -

1536mhz

CPU2: A53

200mhz -

1536mhz

CPU3: A53

200mhz -

1536mhz

CPU4: A53

200mhz -

998mhz

CPU5: A53

200mhz -

998mhz

CPU6: A53

200mhz -

998mhz

CPU7: A53

200mhz -

998mhz

5

2 internally synchronous CPU clusters

independent CPU, cluster power gating

MSM8994 is a product of Qualcomm Technologies, Inc.

Qualcomm Technologies big.Little Platforms: MSM8994

CPU0: A53

199mhz -

940mhz

CPU1: A53

199mhz -

940mhz

CPU2: A53

199mhz -

940mhz

CPU3: A53

199mhz -

940mhz

CPU4: A57

200mhz -

921mhz

CPU5: A57

200mhz -

921mhz

CPU6: A57

200mhz -

921mhz

CPU7: A57

200mhz -

921mhz

6

energy-aware scheduling

concern with clusters reversing roles

− overlap in mW/MIPS curves

− thermal limiting big cluster fmax at runtime

upstreaming

Why not GTS?

7

load tracking

8

per-task CPU utilization tracking critical for EA scheduling

− cannot place a task intelligently without knowing its CPU demand accurately

− what does “accurately” mean

most big.Little and EA scheduling work today uses PELT

− per-entity load tracking

− added by Paul Turner @ Google

load tracking

9

tracks per-task load via a geometrically-weighted series

was not designed specifically for

− energy-aware task placement

− mobile workloads

with default mainline tuning

− cpu-bound task takes 75ms to ramp from 0% to 80%

− idle task decays from 100% to 10% in 100ms

speeding up increase will speed up decrease

heavy task

− goes to sleep waiting for user input

− wakes up decayed, treated as low-demand task

load tracking - PELT

10

20ms busy, 20ms idle (default tuning)

after 2 20ms bursts of execution,

seen as less than 50% demand

Is this what we want?

load tracking - PELT

11

track task’s N most recent non-empty windows

− N configurable (assume 5)

− window size configurable (assume 20ms)

calculate task demand based on these samples

− different policy options such as avg, max, max(avg, recent)

load tracking – window based

policy

N previous non-empty windows task demand

12


8 13 3 5 10

max

policy

N previous non-empty windows

13

task demand

8 13 3 5 10

avg

policy


7.8

task demand

8 13 3 5 10

max(recent, avg)

policy


10

task demand

13

windowing is aligned across all tasks and CPUs

− helps with scheduler-guided frequency, discussed later

− assumes synchronized sched_clock()

load must be normalized

− normalize to both max freq and max IPC across whole topology

− (assume no thermal throttling)

normalized = exec_time * (f_cur / system fmax) *

(cpu_ipc / system max IPC)


14

Example Topology:

− A57s w/2Ghz fmax, A53s w/1Ghz fmax

− A57 IPC is 2x that of A53

Task runs for 10ms on an A53 at 1Ghz

This would be recorded as

10ms * (1Ghz / 2Ghz system fmax) * (1 A53 IPC / 2 system max IPC) = 2.5ms


15

demand values normalized to

− max freq in system

− max IPC in system

load_scale_factor is used to scale demand back to a CPU

lsf = 1024 * (system fmax / CPU's current fmax) *

(system max IPC / CPU IPC)

− CPU fmax may be reduced by thermal throttling


16

Example with previous topology:

− A57s w/2Ghz fmax, A53s w/1Ghz fmax

− A57 IPC is 2x that of A53

Translate 5ms scaled demand to A53, where the A53 is thermally throttled to 800mhz

− A53's lsf = 1024 * (2ghz / 800mhz) * (2 / 1)

= 1024 * 2.5 * 2

= 4608

− A53 demand = 5ms * 4608 / 1024 = 22.5ms


17

power model

18

no wakeup rate measurement, wakeup energy cost

no tracking additional cost of other CPUs speeding up when placing a task

support for per-CPU power numbers changing at runtime

power model

high level comparison with ARM’s EA

19

struct cpu_pstate_pwr {

unsigned int freq;

uint32_t power;

};

struct cpu_pwr_stats {

int cpu;

struct cpu_pstate_pwr *table;

int len;

};

struct cpu_pwr_stats *get_cpu_pwr_stats(void);

(code and license available at https://www.codeaurora.org/cgit/quic/la/kernel/msm-3.10/log/?h=msm-3.10)

power model

interface

20

power model

CPU 3

CPU frequency mW/MIPS

600mhz 10

800mhz 14

1.0ghz 19

1.2ghz 26

1.4ghz 36

1.6ghz 51

CPU 7


1.3ghz 35

1.5ghz 42

1.7ghz 52

1.9ghz 68

2.0ghz 85

2.1ghz 106

CPU 2


600mhz 10

800mhz 14

1.0ghz 19

1.2ghz 26

1.4ghz 36

1.6ghz 51

CPU 1


600mhz 10

800mhz 14

1.0ghz 19

1.2ghz 26

1.4ghz 36

1.6ghz 51

CPU 0


600mhz 10

800mhz 14

1.0ghz 19

1.2ghz 26

1.4ghz 36

1.6ghz 51

CPU 6


1.3ghz 35

1.5ghz 42

1.7ghz 52

1.9ghz 68

2.0ghz 85

2.1ghz 106

CPU 5


1.3ghz 35

1.5ghz 42

1.7ghz 52

1.9ghz 68

2.0ghz 85

2.1ghz 106

CPU 4


1.3ghz 35

1.5ghz 42

1.7ghz 52

1.9ghz 68

2.0ghz 85

2.1ghz 106

21

hmp scheduling

22

Information available:

− per-task CPU demand (PELT or window-based)

− mw/MIPS for freqs supported by each CPU

− f_cur, f_max, f_max_possible for each CPU

− other sched info such as nr_running

hmp scheduling

23

small task: task consumes < sched_small_task % of lowest capacity CPU

big task: task consumes > sched_upmigrate % of a CPU

mostly_idle: CPU is mostly idle if it

− does not have more than mostly_idle_nr_run tasks

− is not more than mostly_idle_load % busy

spill threshold: a CPU has crossed its spill threshold if it

− has more than spill_nr_run runnable tasks

− is more than spill_load % busy

hmp scheduling

definitions

24

1. the least-loaded CPU

− in the smallest cluster where task will fit

− where placement will not cross spill level

− power cost breaks ties in load

2. the least-loaded mostly idle CPU

− where task will not fit

3. the CPU the task last ran on

hmp scheduling

wakeup of non-small task

25

1. the lowest-power CPU, if it is mostly idle but not in a low-power state

2. the first mostly idle CPU in the smallest cluster found that is not in a low-power state

3. the idle CPU in the smallest cluster in the least shallow C-state

4. least busy CPU in the smallest cluster where adding the task won't cross spill threshold

5. most power-efficient CPU outside smallest cluster (likely to be changed)

hmp scheduling

wakeup of small task

26

concern over cpu-bound task placement

active migrate running task if it should be upmigrated

active migrate running non-small task if a lower power idle CPU is available

hmp scheduling

scheduler tick

27

in general, preserve policy from wakeups

− allow little->big cluster flow of tasks if little is beyond spill or tasks are big

− allow big->little cluster flow of tasks if big has more tasks than CPUs

− pull tasks when balancing CPU is more power efficient on intra-cluster balance

changes in most lb functions

very different policy than stock load balancer

try to avoid pulling small tasks

move tasks from CPUs w/1 task

hmp scheduling

load balancer

28

scheduler-guidedCPU frequency

29

task migrates in mid-window

each governor sees 50%

neither CPU likely responds correctly

scheduler-guided CPU frequency

10mscpu0

10ms

t = 0ms

cpu1

t = 20ms

30


10mscpu0

10ms

t = 0ms

cpu1

t = 20ms

10mscpu0

10ms

t = 0ms

cpu1

t = 20ms

10ms

31


22mscpu0

16ms

t = 0ms

cpu1

t = 20ms t = 40ms

cpu0 speeds up

cpu1 slows down

task migrates

to cpu1

cpu1 doesn't speed up until window end

cpu0 needlessly running fast for entire window

32


22mscpu0

16ms

t = 0ms

cpu1

t = 20ms t = 40ms

cpu0 speeds up

cpu1 slows down

task migrates

to cpu1

set_task_cpu() check – CPU is now very

overprovisioned, notify governor

set_task_cpu() check –

CPU is now very

underprovisioned, notify

governor

33

retain cpufreq governor policy

minimize changes to governor

sched_set_window() API allows governor to set window alignment, size

sched_get_busy() API replaces interactive governor query of cpu idle time

− returns fixed-up CPU demand from last complete window

governor gets notified when scheduler sees big demand increase/decrease

− knowing when to not notify is not easy


implementation

34

next steps

35

1. load tracking

− need full power+perf PELT vs. window-based analysis

2. power model

− can we combine best of ARM and QuIC solutions?

3. scheduler, load balancer changes

− hard to get right

− likely to be extensive and controversial

− continue development

4. scheduler-guided frequency

− continue design discussions w/Linaro

next steps

36

©2013-2014 Qualcomm Incorporated and/or its subsidiaries.

Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and bra nd names may be

trademarks or registered trademarks of their respective owners.

References to “Qualcomm” may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate stru cture, as applicable.

Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Te chnologies, Inc., a

wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s enginee ring, research and

development functions, and substantially all of its product and services businesses, including its semiconductor business, QC T.

For more information on Qualcomm, visit us at:

www.qualcomm.com & www.qualcomm.com/blog

Thank youFollow us on:

Date post:	18-Nov-2014
Category:	Software
Upload:	linaro
View:	243 times
Download:	3 times

LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

Software