Presented by
Date
HKG15-107 ACPI Power Management on ARM64 Servers
Linaro Enterprise GroupAshwin Chaugule
2/9/2015
Overview● CPU Performance management
○ CPPC (Collaborative Processor Performance Control)
○ PCC (Platform Communication Channel)○ State of patchwork○ Next steps
● CPU idle management overview● Device power management overview
um. Hello there!
Power Management overview● Overall goal is to run the system as efficiently as possible considering power and performance● Active power management
● Minimize power when the system is active and running● Idle power management
● Go to deepest possible idle state with most power savings while considering workloads desired response time
● Limits management● Deliver max possible performance within the system constraints
● Servers are plugged in and not backed by batteries○ Cost of power is significant in TCO
● Server workloads typically have a high dynamic range of CPU utilization● Burst of activity depending on time zones, holiday sales etc.● Not always running at peak CPU utilization
● Need to be very efficient across the whole range
CPU Performance Management● CPPC = Collaborative Processor Performance Control● New method to manage CPU performance● Defined since ACPI v5.0+● Preferred method for ARM64 servers vs PSS● Richer interface supersedes ~12 ACPI objects and notifications● Performance requests are made on an abstract unit less and continuous scale● Firmware on the remote processor is free to interpret values however it wants
○ Can choose to map unit as CPU freq. similar to “p-states”○ Could be a combination of freq + other architecture specific performance knobs
● Handling in firmware prevents risk of preempting freq transitions in the kernel● Also allows for much wider portability● OS should not assume any specific meaning to the performance scale● Per CPU table (CPC) describes each CPUs performance capabilities and controls● Contents of table can be registers (h/w, memory mapped or PCC) or static integers
Alternate method● PSS = Performance Supported States
○ Discretized table of CPU frequencies○ Assumes all CPUs have identical P states
● Requires X86 like mechanisms to write to a register to change CPU frequency● Processor Throttling Controls
○ PTC, TSS, TPC○ Throttling states available to the CPU as a percentage of max
● Needs ARM specific spec updates
CPPC high level flow● Platform enumerates CPU performance
range to the OS● Highest Performance:
○ Highest performance capability of a CPU
● Nominal Performance:○ Max sustained perf level
● Lowest Nonlinear performance:○ Lowest perf level at which non-linear
power savings achievable. Lower than this level could be suboptimal
● Lowest Performance:○ Lowest perf capability
CPPC high level flow● OS requests desired performance● Maximum Performance:
○ Upper bound on desired performance● Desired Performance:
○ Ideal desired perf level● Performance Reduction Tolerance:
○ Deviation below Desired Performance that the platform is allowed to run. If OS requests Desired perf over a specific Time Window, then this is the average performance to be delivered over the Time Window. Time Window is specific in another register.
● Minimum Performance:○ Lower bound on desired performance
Other CPPC feedback regs● Platform may be aware of power budgets and thermal constraints● It can limit delivered performance by reading instantaneous values of specific sensors or
counters● Provides notification back to OS when limits change● Reference Performance Counter:
● Counts at fixed rate when processor is active● Delivered Performance Counter:
● Counts at rate of current performance level taking Desired into account
● Guaranteed Performance:● Sustained Performance level deliverable by Platform given current constraints● Raises a notification when this level changes
● Performance Limited Register:○ In the event of some constraint (e.g. thermal excursion), this reg has 2 bits defined.
indicates platform unexpectedly delivers less than Desired or less than min.
Per CPU CPPC descriptor● Each entry of descriptor is either an integer
or a register● Register could be described as a hardware
register, System I/O or PCC register● PCC registers have following format:
PCC: Platform Communication Channel● ACPI v5.0+ defines a mailbox-like mechanism for OS to communicate with a
remote processor and back. e.g. BMC● ACPI table for PCC (PCCT) defines a list of PCC subspaces/channels● Each subspace entry defines:
○ Shared communication region address○ Command and status fields for this region○ Doorbell semantics for channel
● PCC commands are client specific○ Clients defined in the current ACPI v5.1 spec include
■ CPPC■ MPST (Memory node power state table)■ RAS
● Doorbell protocol defines exclusivity of access to PCC channel between OS and remote processor
● Supports async mode of notification from remote via IRQ
PCC: High level flow● PCC Reads:
○ Client acquires a PCC channel lock (client specific)○ Rings doorbell with READ cmd
■ Client waits for command completion○ Client reads data updated by remote processor in comm space○ Client releases PCC channel lock
● PCC Writes:○ Client acquires a PCC channel lock (client specific)○ Client writes data to comm space○ Rings doorbell with WRITE cmd
■ Client waits for command completion○ Client releases PCC channel lock
● If command completion fails, Client must retry or assume failure
Linux support for CPPC + PCC● PCC
○ Integrated as mailbox controller○ Initial patchwork in upstream kernels today (3.19-rcX)
● CPPC○ CPPC parsing methods abstracted into separate files○ CPUFreq driver that plugs into existing governors (e.g. ondemand)
■ ondemand ignores CPU freq. which could lead to suboptimal choice of next freq
■ Patchwork (v4) with CPUfreq integration under review○ Investigating PID style governor
■ Early patchwork adapted governor from intel_pstate■ Experiments on ARM64 led to extensive modifications in the way CPU
busy is calculated● Frequency weighted CPU busyness including idle time● Move busyness math to workqueue
■ Intel pstate PID suboptimally raises next freq request if workload doesn’t cause timer to defer > 30ms
■ Need more experimentation on silicon
CPPC + PCC
PCC driver
CPPC lib
CPPC CPUFreq driverCPPC driver with inbuilt governor
CPUFreq governors
Hardware registers, System I/O
CPPC tables
PCCT table
PCC firmware interface
CPU Performance handlers
LINUX
Remote Processor
CPU idle management overview● As of current spec (v5.1)● C states defined for each
processor○ C0 - On○ C1 - Cn -> ascending
order or idleness● C state object for each
processor● Each object defines
attributes for that idle state● _CSD object for each
processor defines C state cross dependency
CPU idle management overview● _CST and _CSD don’t scale well
to heterogenous architectures● Assume same number of power
states at each processor● Cant express Device power state
dependencies● Cant express power resource
dependencies● No notion of effect on caches at
each level of hierarchy● WIP to address shortcomings in
the spec● Plan to use existing governors +
PSCI methods
Device PM overview● Devices may define Dx states
○ D0 - ON○ D3 - OFF○ D1/D2 - possible intermediate states○ D3hot - Off (like D3) but may remain enumerable and context preserved.
● Platform specific details handled inside PSx control methods○ Called as needed by OSPM as the device transitions through Dx states
● Power Resources handled in PR objects○ Each PR supports: ON, OFF and STA (status) methods○ Devices have PRx lists which reference power resources as needed in Dx
states● 2 options to do device pm:
○ Manage power resources inside PSx. Called on entry to Dx state○ Declare PR separately with its own ON, OFF
■ Define device dependencies and let OSPM manage ON/OFF● Should not have to rely on clk/reg framework in Linux
Device PM state transitions● Device state transitions
1.Device wakeup (due to user request or interrupt)
a)If device depends on a power resource, must turn on all required power resources prior to enabling the device.
2.Keep alive if there are ongoing requests
3.Device inactive (no device requests for some time)
● Power Resources track all dependent devices (multiple devices may share the same power resource)
● Power Resource state transitions
A.All dependent devices are inactive (D3)
B.A dependent device is attempting wakeup
Device PM example