A QuIC Take onEnergy-Aware Scheduling
Steve Muckle
Staff Engineer, Qualcomm Innovation Center, Inc. (QuIC)
Thursday, September 18th 2014
2
QuIC has developed an energy-aware scheduler
Can our work be used?
Describe
− our design
− problems we faced
− next steps
Caveats
− none of this code is upstreamable as-is
− design and implementation are a work in progress
What’s the point of this presentation?
3
intro to Qualcomm Technologies big.Little platforms, scene setting
load tracking
power model
hmp scheduling
scheduler-guided frequency
next steps
overview
4
2 internally synchronous CPU clusters
independent CPU, cluster power gating
MSM8939 is a product of Qualcomm Technologies, Inc.
Qualcomm Technologies big.Little Platforms: MSM8939
CPU0: A53
200mhz -
1536mhz
CPU1: A53
200mhz -
1536mhz
CPU2: A53
200mhz -
1536mhz
CPU3: A53
200mhz -
1536mhz
CPU4: A53
200mhz -
998mhz
CPU5: A53
200mhz -
998mhz
CPU6: A53
200mhz -
998mhz
CPU7: A53
200mhz -
998mhz
5
2 internally synchronous CPU clusters
independent CPU, cluster power gating
MSM8994 is a product of Qualcomm Technologies, Inc.
Qualcomm Technologies big.Little Platforms: MSM8994
CPU0: A53
199mhz -
940mhz
CPU1: A53
199mhz -
940mhz
CPU2: A53
199mhz -
940mhz
CPU3: A53
199mhz -
940mhz
CPU4: A57
200mhz -
921mhz
CPU5: A57
200mhz -
921mhz
CPU6: A57
200mhz -
921mhz
CPU7: A57
200mhz -
921mhz
6
energy-aware scheduling
concern with clusters reversing roles
− overlap in mW/MIPS curves
− thermal limiting big cluster fmax at runtime
upstreaming
Why not GTS?
7
load tracking
8
per-task CPU utilization tracking critical for EA scheduling
− cannot place a task intelligently without knowing its CPU demand accurately
− what does “accurately” mean
most big.Little and EA scheduling work today uses PELT
− per-entity load tracking
− added by Paul Turner @ Google
load tracking
9
tracks per-task load via a geometrically-weighted series
was not designed specifically for
− energy-aware task placement
− mobile workloads
with default mainline tuning
− cpu-bound task takes 75ms to ramp from 0% to 80%
− idle task decays from 100% to 10% in 100ms
speeding up increase will speed up decrease
heavy task
− goes to sleep waiting for user input
− wakes up decayed, treated as low-demand task
load tracking - PELT
10
20ms busy, 20ms idle (default tuning)
after 2 20ms bursts of execution,
seen as less than 50% demand
Is this what we want?
load tracking - PELT
11
track task’s N most recent non-empty windows
− N configurable (assume 5)
− window size configurable (assume 20ms)
calculate task demand based on these samples
− different policy options such as avg, max, max(avg, recent)
load tracking – window based
policy
N previous non-empty windows task demand
12
load tracking – window based
8 13 3 5 10
max
policy
N previous non-empty windows
13
task demand
8 13 3 5 10
avg
policy
N previous non-empty windows
7.8
task demand
8 13 3 5 10
max(recent, avg)
policy
N previous non-empty windows
10
task demand
13
windowing is aligned across all tasks and CPUs
− helps with scheduler-guided frequency, discussed later
− assumes synchronized sched_clock()
load must be normalized
− normalize to both max freq and max IPC across whole topology
− (assume no thermal throttling)
normalized = exec_time * (f_cur / system fmax) *
(cpu_ipc / system max IPC)
load tracking – window based
14
Example Topology:
− A57s w/2Ghz fmax, A53s w/1Ghz fmax
− A57 IPC is 2x that of A53
Task runs for 10ms on an A53 at 1Ghz
This would be recorded as
10ms * (1Ghz / 2Ghz system fmax) * (1 A53 IPC / 2 system max IPC) = 2.5ms
load tracking – window based
15
demand values normalized to
− max freq in system
− max IPC in system
load_scale_factor is used to scale demand back to a CPU
lsf = 1024 * (system fmax / CPU's current fmax) *
(system max IPC / CPU IPC)
− CPU fmax may be reduced by thermal throttling
load tracking – window based
16
Example with previous topology:
− A57s w/2Ghz fmax, A53s w/1Ghz fmax
− A57 IPC is 2x that of A53
Translate 5ms scaled demand to A53, where the A53 is thermally throttled to 800mhz
− A53's lsf = 1024 * (2ghz / 800mhz) * (2 / 1)
= 1024 * 2.5 * 2
= 4608
− A53 demand = 5ms * 4608 / 1024 = 22.5ms
load tracking – window based
17
power model
18
no wakeup rate measurement, wakeup energy cost
no tracking additional cost of other CPUs speeding up when placing a task
support for per-CPU power numbers changing at runtime
power model
high level comparison with ARM’s EA
19
struct cpu_pstate_pwr {
unsigned int freq;
uint32_t power;
};
struct cpu_pwr_stats {
int cpu;
struct cpu_pstate_pwr *table;
int len;
};
struct cpu_pwr_stats *get_cpu_pwr_stats(void);
(code and license available at https://www.codeaurora.org/cgit/quic/la/kernel/msm-3.10/log/?h=msm-3.10)
power model
interface
20
power model
CPU 3
CPU frequency mW/MIPS
600mhz 10
800mhz 14
1.0ghz 19
1.2ghz 26
1.4ghz 36
1.6ghz 51
CPU 7
CPU frequency mW/MIPS
1.3ghz 35
1.5ghz 42
1.7ghz 52
1.9ghz 68
2.0ghz 85
2.1ghz 106
CPU 2
CPU frequency mW/MIPS
600mhz 10
800mhz 14
1.0ghz 19
1.2ghz 26
1.4ghz 36
1.6ghz 51
CPU 1
CPU frequency mW/MIPS
600mhz 10
800mhz 14
1.0ghz 19
1.2ghz 26
1.4ghz 36
1.6ghz 51
CPU 0
CPU frequency mW/MIPS
600mhz 10
800mhz 14
1.0ghz 19
1.2ghz 26
1.4ghz 36
1.6ghz 51
CPU 6
CPU frequency mW/MIPS
1.3ghz 35
1.5ghz 42
1.7ghz 52
1.9ghz 68
2.0ghz 85
2.1ghz 106
CPU 5
CPU frequency mW/MIPS
1.3ghz 35
1.5ghz 42
1.7ghz 52
1.9ghz 68
2.0ghz 85
2.1ghz 106
CPU 4
CPU frequency mW/MIPS
1.3ghz 35
1.5ghz 42
1.7ghz 52
1.9ghz 68
2.0ghz 85
2.1ghz 106
21
hmp scheduling
22
Information available:
− per-task CPU demand (PELT or window-based)
− mw/MIPS for freqs supported by each CPU
− f_cur, f_max, f_max_possible for each CPU
− other sched info such as nr_running
hmp scheduling
23
small task: task consumes < sched_small_task % of lowest capacity CPU
big task: task consumes > sched_upmigrate % of a CPU
mostly_idle: CPU is mostly idle if it
− does not have more than mostly_idle_nr_run tasks
− is not more than mostly_idle_load % busy
spill threshold: a CPU has crossed its spill threshold if it
− has more than spill_nr_run runnable tasks
− is more than spill_load % busy
hmp scheduling
definitions
24
1. the least-loaded CPU
− in the smallest cluster where task will fit
− where placement will not cross spill level
− power cost breaks ties in load
2. the least-loaded mostly idle CPU
− where task will not fit
3. the CPU the task last ran on
hmp scheduling
wakeup of non-small task
25
1. the lowest-power CPU, if it is mostly idle but not in a low-power state
2. the first mostly idle CPU in the smallest cluster found that is not in a low-power state
3. the idle CPU in the smallest cluster in the least shallow C-state
4. least busy CPU in the smallest cluster where adding the task won't cross spill threshold
5. most power-efficient CPU outside smallest cluster (likely to be changed)
hmp scheduling
wakeup of small task
26
concern over cpu-bound task placement
active migrate running task if it should be upmigrated
active migrate running non-small task if a lower power idle CPU is available
hmp scheduling
scheduler tick
27
in general, preserve policy from wakeups
− allow little->big cluster flow of tasks if little is beyond spill or tasks are big
− allow big->little cluster flow of tasks if big has more tasks than CPUs
− pull tasks when balancing CPU is more power efficient on intra-cluster balance
changes in most lb functions
very different policy than stock load balancer
try to avoid pulling small tasks
move tasks from CPUs w/1 task
hmp scheduling
load balancer
28
scheduler-guidedCPU frequency
29
task migrates in mid-window
each governor sees 50%
neither CPU likely responds correctly
scheduler-guided CPU frequency
10mscpu0
10ms
t = 0ms
cpu1
t = 20ms
30
scheduler-guided CPU frequency
10mscpu0
10ms
t = 0ms
cpu1
t = 20ms
10mscpu0
10ms
t = 0ms
cpu1
t = 20ms
10ms
31
scheduler-guided CPU frequency
22mscpu0
16ms
t = 0ms
cpu1
t = 20ms t = 40ms
cpu0 speeds up
cpu1 slows down
task migrates
to cpu1
cpu1 doesn't speed up until window end
cpu0 needlessly running fast for entire window
32
scheduler-guided CPU frequency
22mscpu0
16ms
t = 0ms
cpu1
t = 20ms t = 40ms
cpu0 speeds up
cpu1 slows down
task migrates
to cpu1
set_task_cpu() check – CPU is now very
overprovisioned, notify governor
set_task_cpu() check –
CPU is now very
underprovisioned, notify
governor
33
retain cpufreq governor policy
minimize changes to governor
sched_set_window() API allows governor to set window alignment, size
sched_get_busy() API replaces interactive governor query of cpu idle time
− returns fixed-up CPU demand from last complete window
governor gets notified when scheduler sees big demand increase/decrease
− knowing when to not notify is not easy
scheduler-guided CPU frequency
implementation
34
next steps
35
1. load tracking
− need full power+perf PELT vs. window-based analysis
2. power model
− can we combine best of ARM and QuIC solutions?
3. scheduler, load balancer changes
− hard to get right
− likely to be extensive and controversial
− continue development
4. scheduler-guided frequency
− continue design discussions w/Linaro
next steps
36
©2013-2014 Qualcomm Incorporated and/or its subsidiaries.
Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and bra nd names may be
trademarks or registered trademarks of their respective owners.
References to “Qualcomm” may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate stru cture, as applicable.
Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Te chnologies, Inc., a
wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s enginee ring, research and
development functions, and substantially all of its product and services businesses, including its semiconductor business, QC T.
For more information on Qualcomm, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Thank youFollow us on: