+ All Categories
Home > Software > LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

Date post: 18-Nov-2014
Category:
Upload: linaro
View: 243 times
Download: 3 times
Share this document with a friend
Description:
LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING --------------------------------------------------- Speaker: Stephen Muckle Date: September 17, 2014 --------------------------------------------------- ★ Session Summary ★ Task scheduling on big.Little targets is a known challenge in the community. A commercial grade solution exists with ARM's Global Task Scheduler and there is a second solution being developed by ARM to solve this problem in a more generic, upstream-friendly way. The HMP scheduler extensions developed at Qualcomm Innovation Center (QuIC) were created to achieve many of the same benefits being sought in the power-aware scheduler in current development upstream, along with perhaps some additional ones. This presentation will cover the features, design and status of QuIC's HMP scheduler extensions. Some areas of interest from ARM-Linaro perspective: - Some intro to their target arch/platform. - Architecture of their software solution (sched, load tracking algorithm, power- management from the scheduler, energy model description and use, DT etc). - Pain points from an upstream integration PoV. - Results, if any. --------------------------------------------------- ★ Resources ★ Zerista: http://lcu14.zerista.com/event/member/137774 Google Event: https://plus.google.com/u/0/events/c4b5hqb4jau4b3r79nca8hdlmds Video: https://www.youtube.com/watch?v=2xb0vOV-E6E&list=UUIVqQKxCyQLJS6xvSmfndLA Etherpad: http://pad.linaro.org/p/lcu14-406 --------------------------------------------------- ★ Event Details ★ Linaro Connect USA - #LCU14 September 15-19th, 2014 Hyatt Regency San Francisco Airport --------------------------------------------------- http://www.linaro.org http://connect.linaro.org
36
A QuIC Take on Energy-Aware Scheduling Steve Muckle Staff Engineer, Qualcomm Innovation Center, Inc. (QuIC) Thursday, September 18 th 2014
Transcript
Page 1: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

A QuIC Take onEnergy-Aware Scheduling

Steve Muckle

Staff Engineer, Qualcomm Innovation Center, Inc. (QuIC)

Thursday, September 18th 2014

Page 2: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

2

QuIC has developed an energy-aware scheduler

Can our work be used?

Describe

− our design

− problems we faced

− next steps

Caveats

− none of this code is upstreamable as-is

− design and implementation are a work in progress

What’s the point of this presentation?

Page 3: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

3

intro to Qualcomm Technologies big.Little platforms, scene setting

load tracking

power model

hmp scheduling

scheduler-guided frequency

next steps

overview

Page 4: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

4

2 internally synchronous CPU clusters

independent CPU, cluster power gating

MSM8939 is a product of Qualcomm Technologies, Inc.

Qualcomm Technologies big.Little Platforms: MSM8939

CPU0: A53

200mhz -

1536mhz

CPU1: A53

200mhz -

1536mhz

CPU2: A53

200mhz -

1536mhz

CPU3: A53

200mhz -

1536mhz

CPU4: A53

200mhz -

998mhz

CPU5: A53

200mhz -

998mhz

CPU6: A53

200mhz -

998mhz

CPU7: A53

200mhz -

998mhz

Page 5: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

5

2 internally synchronous CPU clusters

independent CPU, cluster power gating

MSM8994 is a product of Qualcomm Technologies, Inc.

Qualcomm Technologies big.Little Platforms: MSM8994

CPU0: A53

199mhz -

940mhz

CPU1: A53

199mhz -

940mhz

CPU2: A53

199mhz -

940mhz

CPU3: A53

199mhz -

940mhz

CPU4: A57

200mhz -

921mhz

CPU5: A57

200mhz -

921mhz

CPU6: A57

200mhz -

921mhz

CPU7: A57

200mhz -

921mhz

Page 6: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

6

energy-aware scheduling

concern with clusters reversing roles

− overlap in mW/MIPS curves

− thermal limiting big cluster fmax at runtime

upstreaming

Why not GTS?

Page 7: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

7

load tracking

Page 8: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

8

per-task CPU utilization tracking critical for EA scheduling

− cannot place a task intelligently without knowing its CPU demand accurately

− what does “accurately” mean

most big.Little and EA scheduling work today uses PELT

− per-entity load tracking

− added by Paul Turner @ Google

load tracking

Page 9: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

9

tracks per-task load via a geometrically-weighted series

was not designed specifically for

− energy-aware task placement

− mobile workloads

with default mainline tuning

− cpu-bound task takes 75ms to ramp from 0% to 80%

− idle task decays from 100% to 10% in 100ms

speeding up increase will speed up decrease

heavy task

− goes to sleep waiting for user input

− wakes up decayed, treated as low-demand task

load tracking - PELT

Page 10: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

10

20ms busy, 20ms idle (default tuning)

after 2 20ms bursts of execution,

seen as less than 50% demand

Is this what we want?

load tracking - PELT

Page 11: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

11

track task’s N most recent non-empty windows

− N configurable (assume 5)

− window size configurable (assume 20ms)

calculate task demand based on these samples

− different policy options such as avg, max, max(avg, recent)

load tracking – window based

policy

N previous non-empty windows task demand

Page 12: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

12

load tracking – window based

8 13 3 5 10

max

policy

N previous non-empty windows

13

task demand

8 13 3 5 10

avg

policy

N previous non-empty windows

7.8

task demand

8 13 3 5 10

max(recent, avg)

policy

N previous non-empty windows

10

task demand

Page 13: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

13

windowing is aligned across all tasks and CPUs

− helps with scheduler-guided frequency, discussed later

− assumes synchronized sched_clock()

load must be normalized

− normalize to both max freq and max IPC across whole topology

− (assume no thermal throttling)

normalized = exec_time * (f_cur / system fmax) *

(cpu_ipc / system max IPC)

load tracking – window based

Page 14: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

14

Example Topology:

− A57s w/2Ghz fmax, A53s w/1Ghz fmax

− A57 IPC is 2x that of A53

Task runs for 10ms on an A53 at 1Ghz

This would be recorded as

10ms * (1Ghz / 2Ghz system fmax) * (1 A53 IPC / 2 system max IPC) = 2.5ms

load tracking – window based

Page 15: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

15

demand values normalized to

− max freq in system

− max IPC in system

load_scale_factor is used to scale demand back to a CPU

lsf = 1024 * (system fmax / CPU's current fmax) *

(system max IPC / CPU IPC)

− CPU fmax may be reduced by thermal throttling

load tracking – window based

Page 16: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

16

Example with previous topology:

− A57s w/2Ghz fmax, A53s w/1Ghz fmax

− A57 IPC is 2x that of A53

Translate 5ms scaled demand to A53, where the A53 is thermally throttled to 800mhz

− A53's lsf = 1024 * (2ghz / 800mhz) * (2 / 1)

= 1024 * 2.5 * 2

= 4608

− A53 demand = 5ms * 4608 / 1024 = 22.5ms

load tracking – window based

Page 17: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

17

power model

Page 18: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

18

no wakeup rate measurement, wakeup energy cost

no tracking additional cost of other CPUs speeding up when placing a task

support for per-CPU power numbers changing at runtime

power model

high level comparison with ARM’s EA

Page 19: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

19

struct cpu_pstate_pwr {

unsigned int freq;

uint32_t power;

};

struct cpu_pwr_stats {

int cpu;

struct cpu_pstate_pwr *table;

int len;

};

struct cpu_pwr_stats *get_cpu_pwr_stats(void);

(code and license available at https://www.codeaurora.org/cgit/quic/la/kernel/msm-3.10/log/?h=msm-3.10)

power model

interface

Page 20: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

20

power model

CPU 3

CPU frequency mW/MIPS

600mhz 10

800mhz 14

1.0ghz 19

1.2ghz 26

1.4ghz 36

1.6ghz 51

CPU 7

CPU frequency mW/MIPS

1.3ghz 35

1.5ghz 42

1.7ghz 52

1.9ghz 68

2.0ghz 85

2.1ghz 106

CPU 2

CPU frequency mW/MIPS

600mhz 10

800mhz 14

1.0ghz 19

1.2ghz 26

1.4ghz 36

1.6ghz 51

CPU 1

CPU frequency mW/MIPS

600mhz 10

800mhz 14

1.0ghz 19

1.2ghz 26

1.4ghz 36

1.6ghz 51

CPU 0

CPU frequency mW/MIPS

600mhz 10

800mhz 14

1.0ghz 19

1.2ghz 26

1.4ghz 36

1.6ghz 51

CPU 6

CPU frequency mW/MIPS

1.3ghz 35

1.5ghz 42

1.7ghz 52

1.9ghz 68

2.0ghz 85

2.1ghz 106

CPU 5

CPU frequency mW/MIPS

1.3ghz 35

1.5ghz 42

1.7ghz 52

1.9ghz 68

2.0ghz 85

2.1ghz 106

CPU 4

CPU frequency mW/MIPS

1.3ghz 35

1.5ghz 42

1.7ghz 52

1.9ghz 68

2.0ghz 85

2.1ghz 106

Page 21: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

21

hmp scheduling

Page 22: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

22

Information available:

− per-task CPU demand (PELT or window-based)

− mw/MIPS for freqs supported by each CPU

− f_cur, f_max, f_max_possible for each CPU

− other sched info such as nr_running

hmp scheduling

Page 23: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

23

small task: task consumes < sched_small_task % of lowest capacity CPU

big task: task consumes > sched_upmigrate % of a CPU

mostly_idle: CPU is mostly idle if it

− does not have more than mostly_idle_nr_run tasks

− is not more than mostly_idle_load % busy

spill threshold: a CPU has crossed its spill threshold if it

− has more than spill_nr_run runnable tasks

− is more than spill_load % busy

hmp scheduling

definitions

Page 24: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

24

1. the least-loaded CPU

− in the smallest cluster where task will fit

− where placement will not cross spill level

− power cost breaks ties in load

2. the least-loaded mostly idle CPU

− where task will not fit

3. the CPU the task last ran on

hmp scheduling

wakeup of non-small task

Page 25: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

25

1. the lowest-power CPU, if it is mostly idle but not in a low-power state

2. the first mostly idle CPU in the smallest cluster found that is not in a low-power state

3. the idle CPU in the smallest cluster in the least shallow C-state

4. least busy CPU in the smallest cluster where adding the task won't cross spill threshold

5. most power-efficient CPU outside smallest cluster (likely to be changed)

hmp scheduling

wakeup of small task

Page 26: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

26

concern over cpu-bound task placement

active migrate running task if it should be upmigrated

active migrate running non-small task if a lower power idle CPU is available

hmp scheduling

scheduler tick

Page 27: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

27

in general, preserve policy from wakeups

− allow little->big cluster flow of tasks if little is beyond spill or tasks are big

− allow big->little cluster flow of tasks if big has more tasks than CPUs

− pull tasks when balancing CPU is more power efficient on intra-cluster balance

changes in most lb functions

very different policy than stock load balancer

try to avoid pulling small tasks

move tasks from CPUs w/1 task

hmp scheduling

load balancer

Page 28: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

28

scheduler-guidedCPU frequency

Page 29: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

29

task migrates in mid-window

each governor sees 50%

neither CPU likely responds correctly

scheduler-guided CPU frequency

10mscpu0

10ms

t = 0ms

cpu1

t = 20ms

Page 30: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

30

scheduler-guided CPU frequency

10mscpu0

10ms

t = 0ms

cpu1

t = 20ms

10mscpu0

10ms

t = 0ms

cpu1

t = 20ms

10ms

Page 31: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

31

scheduler-guided CPU frequency

22mscpu0

16ms

t = 0ms

cpu1

t = 20ms t = 40ms

cpu0 speeds up

cpu1 slows down

task migrates

to cpu1

cpu1 doesn't speed up until window end

cpu0 needlessly running fast for entire window

Page 32: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

32

scheduler-guided CPU frequency

22mscpu0

16ms

t = 0ms

cpu1

t = 20ms t = 40ms

cpu0 speeds up

cpu1 slows down

task migrates

to cpu1

set_task_cpu() check – CPU is now very

overprovisioned, notify governor

set_task_cpu() check –

CPU is now very

underprovisioned, notify

governor

Page 33: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

33

retain cpufreq governor policy

minimize changes to governor

sched_set_window() API allows governor to set window alignment, size

sched_get_busy() API replaces interactive governor query of cpu idle time

− returns fixed-up CPU demand from last complete window

governor gets notified when scheduler sees big demand increase/decrease

− knowing when to not notify is not easy

scheduler-guided CPU frequency

implementation

Page 34: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

34

next steps

Page 35: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

35

1. load tracking

− need full power+perf PELT vs. window-based analysis

2. power model

− can we combine best of ARM and QuIC solutions?

3. scheduler, load balancer changes

− hard to get right

− likely to be extensive and controversial

− continue development

4. scheduler-guided frequency

− continue design discussions w/Linaro

next steps

Page 36: LCU14 406 A QUICK TAKE ON ENERGY-AWARE SCHEDULING

36

©2013-2014 Qualcomm Incorporated and/or its subsidiaries.

Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and bra nd names may be

trademarks or registered trademarks of their respective owners.

References to “Qualcomm” may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate stru cture, as applicable.

Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Te chnologies, Inc., a

wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s enginee ring, research and

development functions, and substantially all of its product and services businesses, including its semiconductor business, QC T.

For more information on Qualcomm, visit us at:

www.qualcomm.com & www.qualcomm.com/blog

Thank youFollow us on:


Recommended