Measuring and Managing Power Consumption
Todd Rosedahl
IBM/POWER Firmware Development
• US datacenter energy consumption – 91Billion KWH • 34 500MW power plants • State of Minnesota – 68Billion KWH • Power Usage Effectiveness • Energy Star, SERT, Regulations
Motivation
2 11/3/2016 © 2015 OpenPOWER Foundation
• System Stack/Ecosystem • On Chip Controller (OCC) Overview • Hardware Block Diagram • Functional Details • Measurement • Future • References
Agenda
11/3/2016 3 © 2015 OpenPOWER Foundation
Cloud Software
Operating System / KVM
Standard Operating Environment
(System Mgmt)
Power Open Source Software Stack Components
Existing Open Source
Software Communities
Firmware
Hardware
New OSS Community
OpenPOWER Technology
OpenPOWER Firmware
July 2014 Power8 open source firmware stack contributed thru GitHub
Toolkits and resources for porting and optimizing, growing repository on website
AMESTER now Open Source
OpenBMC now Open Source
P8 HW
KVM/Linux
Host Boot SBE OCC
Op
en
BM
C
System
AMESTER ETH
Ecosystem Enablement
11/3/2016 4
OPAL
© 2015 OpenPOWER Foundation
http://openpowerfoundation.org/blogs/openpower-open-compute-rackspace-barreleye/
OpenBMC has been released Rackspace -- Barreleye
• 2 Power CPU Sockets
• 192 HW Threads
• 32 DIMM Slots
• 32 TB DDR3
• 15 Hot swap drives, 3 PCIe slots
• Open Compute Compliant server
Next generation OpenBMC stack
• P9 POWER ready
• Google/Rackspace (Zais)
• IBM (Witherspoon)
• Modern code stack: D-bus, journaling
• Error logs with part numbers, location codes
• Interface traces and simulation
• OpenPOWER toolkit enabled
• Extendable – add your own features/functions
5 11/3/2016 © 2015 OpenPOWER Foundation
What is OCC? • Hardware/Firmware that controls system power, performance & thermals • 405 processor with 512k dedicated RAM • General Purpose Engines (GPE) to offload the 405
What does OCC do? • Reads/controls system power • Reads/controls chip temperatures • Enables efficient fan control • Provides OT protection • Power Capping • Fault Tolerance • Energy saving • Performance boost • Enables sophisticated • measurement and profiling
On Chip Controller (OCC) Overview
11/3/2016 6 Physical paths not shown
OCC Runs on
GPE
Processor
405
Register Reads/Writes
Memory
Power
Measurement
VRMs
Measures/Actuates
Fan Control Actuation
Measures Actuates
Opal Sys Mem
Communicates
BMC
Loads
C C
Communicates
HostBoot
Communicates to other OCC via
Loaded and initialized by
Processor
405 C C
Reads
Writes
© 2015 OpenPOWER Foundation
Main Memory • Initial Load • Temperature Sensing • Utilization Measurement • Bandwidth control • OPAL communication • OCC communication
Processor • Temperature Sensing • Core Frequency Control - PSTATES • Chip Voltage Control – PSTATES • Utilization Measurement
BMC • Report power/temp • Provide Power Cap • DCMI compliance • Fan Control
Hardware Block Diagram
11/3/2016 7 © 2015 OpenPOWER Foundation
Power/Thermal Management
11/3/2016 8
Linux Governors – OCC measures/clips for capping/OT Balanced Performance Power save
OCC Controls Maximum Frequency Power Capping Over Temperature Protection (Processors/Memory) Fault Tolerance (AC feed or power supply loss)
Processor Idle States Nap Sleep Winkle
Power Capping DCMI commands to set power limit via DCMI BMC can set N and N+1 system power caps.
© 2015 OpenPOWER Foundation
BMC
OCC Voting Box
11/3/2016 9
Frequency
Voting Box
(every 250us)
Desired
PSTATE
from o/s
Maximum
PSTATE
OCC Complex
Actual
PSTATE
PMC
(Hardware that
actuates V/F slew)
VRM
Freq Set
Lowest Pstate (freq) wins
Compare temp
to limit and
set Thermal
Control Vote
(every 16ms) Core
DTS
Read every
2ms, average
all DTS to
get core
temp and
take hottest
core temp
Read current 2ms hottest Core temp
Read power
every 250us
Power
measurement
hardware
Compare
Power to limits
and set Control
Vote
(every 250us)
Read current 250us power reading
Get Temps from OCC in Poll Response for export and for fan control
Get power from OCC in poll response for export
linux
© 2015 OpenPOWER Foundation
Measurement – Out of Band
11/3/2016 10
Power/thermal
measurement
Hardware OCC
CPU
Host OS
Node
IB pathway
Tighter workload coupling
Performance concerns
Jitter concerns
OOB pathway
Measure without affecting system
Difficult to correlate to system jobs/events
BMC
Main
Memory
IPMI Sensors
Inlet Temp
Proc Core Temps
Mem Temps
All Power Rails
DCMI Commands
Node Power
AMESTER -- IPMI OCC
Pass through
© 2015 OpenPOWER Foundation
11/3/2016 11
OCC
Service
Processor
POWER8
system AMESTER Internet
Measurement
Hardware
P8 chip
AMESTER Overview
AMESTER – Automated MEasurement for System Temperature and Energy Reporting
Uses management network and service processors • Does not interfere with operating system or workloads
• Runs without requiring any additional installation on the server
Sensor data collection - Power, Temperature, Performance measurement • All Power Rails, Core temps, Utilization, IPS, Bandwidth, Frequency, etc
Continuous Graphing Mode
Trace Buffer mode – output to file for import to Excel/Matlab • 250 µs tracing into 8KB buffer -- 1 sensor logged every 250usec for 1 second
• 2 ms tracing into 8KB buffer – 1 sensor logged every 2ms for 8 seconds
Parameter setting: Power/Thermal Trip points, Pin Frequency, etc
11/3/2016 11 © 2015 OpenPOWER Foundation
Simple Example of Sensor Data
11/3/2016 12 © 2015 OpenPOWER Foundation
Sensor Description
11/3/2016 13 © 2015 OpenPOWER Foundation
Sample Sensor list
11/3/2016 14
440 Total Sensors – Power, Thermal, Performance, Utilization
© 2015 OpenPOWER Foundation
Insights
11/3/2016 15
• Visualization is key to rapid prototyping and problem solving • Correlation of power consumption with other metrics • Time-alignment of sensor data is crucial for debugging firmware algorithms • Examples
• Measuring settling time of power capping controller after workload changes • Developing dynamic voltage/frequency scaling algorithms • Discovering small non-steady behavior when steady-state behavior was expected
© 2015 OpenPOWER Foundation
Future – P9
Hardware Block Diagram – P9
17
OCC (On Chip Controller) WOF(Workload Optimized frequency)
Open BMC FSI to SBE, not I2C Move to RESTful interfaces
Hardware Block Diagram – P9 Main Memory Initial Load Temperature Sensing Utilization Measurement OPAL communication
New Sensors In-band AMESTER
Processor Temperature Sensing Core Frequency Control Chip Voltage Control Utilization Measurement 24 cores (Quads) Instant on
11/3/2016 © 2015 OpenPOWER Foundation 17
P9 OCC Complex with Quads/PPEs
11/3/2016 18 © 2015 OpenPOWER Foundation
• Hypervisor makes V/F requests per core
• Frequency is per Quad (highest Core request)
• IVRMs allow per Quad Voltage
• External Voltage is per Chip
• Temp sensors
• 2 per core
• 1 per cache
• 3 per nest
OCC Complex
OCC (405)
GPE GPE
PGPE SGPE
768KB SRAM
P9 Chip
AVSbus Interfaces
DPLL Quad
SMT4 Core Chiplet CME
NCU
10MB L3 SMT4 Core
Chiplet
L2
SMT4 Core Chiplet
CME NCU
10MB L3
SMT4 Core Chiplet
L2
DPLL Quad
SMT4 Core Chiplet CME
NCU
10MB L3 SMT4 Core
Chiplet
L2
SMT4 Core Chiplet
CME NCU
10MB L3
SMT4 Core Chiplet
L2
SBE
(Self-Boot Engine)
96KB PIB
MEM
FastI2C
32KB SRAM
32KB SRAM
32KB SRAM
32KB SRAM
IO PPE
64KB SRAM
DPLL Quad
SMT4 Core Chiplet CME
NCU
10MB L3 SMT4 Core
Chiplet
L2
SMT4 Core Chiplet
CME NCU
10MB L3
SMT4 Core Chiplet
L2
DPLL Quad
SMT4 Core Chiplet CME
NCU
10MB L3 SMT4 Core
Chiplet
L2
SMT4 Core Chiplet
CME NCU
10MB L3
SMT4 Core Chiplet
L2
32KB SRAM
32KB SRAM
32KB SRAM
32KB SRAM
DPLL Quad
SMT4 Core Chiplet CME
NCU
10MB L3 SMT4 Core
Chiplet
L2
SMT4 Core Chiplet
CME NCU
10MB L3
SMT4 Core Chiplet
L2
DPLL Quad
SMT4 Core Chiplet CME
NCU
10MB L3 SMT4 Core
Chiplet
L2
SMT4 Core Chiplet
CME NCU
10MB L3
SMT4 Core Chiplet
L2
32KB SRAM
32KB SRAM
32KB SRAM
32KB SRAM
IO PPE
64KB SRAM
IO PPE
64KB SRAM
PB PPE
16KB SRAM
Powerbus PPE
Note: also used for NV-link management
Workload Optimized Frequency (WOF)
11/3/2016 19 © 2015 OpenPOWER Foundation
• Less active workloads can go to higher frequencies
• Lower active core counts also increase top
frequency potential
• Actual PSTATE selected depends on linux Governor
• As with Turbo, higher ambients will affect
part to part determinism
• Guaranteed no worse than existing turbo frequency
• New OCC/OPAL interface
• New Ultra Turbo Frequency point
• Part to part determinism achieved by
factoring out leakage current. Frequency
Pow
er
Nominal Spec
Turbo Spec
Turb
o F
req
Ult
ra T
urb
o F
req
Graphical Processing Units (GPUs)
11/3/2016 20 © 2015 OpenPOWER Foundation
What are these things (and why are they such a big deal)?
• Roots in image rendering (what color should that pixel be?)
• One operation on lots of data at the same time
• Think weather simulation
• Now used for AI/Machine learning
• Key for speech/image recognition
Challenges: • High Power
• How to Maximize CPU/GPU performance
• Data collection
Your logo here CPU/GPU Interface
© 2015 OpenPOWER Foundation 11/3/2016 21
GPUs OCC
CPU
Host OS
Node
IB pathway
Tighter workload coupling
Performance concerns
Jitter concerns
New for P9 • Detailed power and
performance data
• In-Band AMESTER
• GPU sensing and
correlation
Measurement – In Band vs Out of Band
OOB pathway
Measure without affecting system performance
Difficult to correlate to system jobs/events
BMC/FSP
Main
Memory
Move to REST (Redfish)
• IPMI/DCMI Inlet Temp
Proc Core Temps
Mem Temps
All Power Rails
Node Power
• CIM (POWERVM) Buffers of
power/thermal
readings
• Pass through Get anything
(AMESTER)
© 2015 OpenPOWER Foundation 22
Power/thermal
measurement
Hardware OCC
CPU
Host OS
Node
Measurement – In Band Sensors
BMC/FSP
Main
Memory
• 40+ different types of sensor readings
• Power, performance, utilization
• 400+ total sensors (24 cores scales this up)
• Timestamped with the system timestamp
• Accumulator and update tag support for energy calculations
• Min/max support – and clearing of min/max
• 4KB pushed up every 10ms
• All sensors updated every 100ms
• Read by the o/s, job profilers
23
Your logo here Summary
Server Power Consumption is a large part of TCO • Data Center Power consumption is going up
• Power/Performance are linked
• GPUs provide amazing performance benefits, but also power challenges
OpenPOWER enables innovation and custom solutions • Full Firmware stack and tools are open
• Host Boot, OPAL, OCC, BMC , AMESTER
• Detailed power/performance data available
24 11/3/2016 © 2015 OpenPOWER Foundation
Open Power Blog link:
http://openpowerfoundation.org/press-releases/occ-firmware-code-is-now-open-source/
GitHub pages:
OpenPOWER project: https://github.com/open-power
OCC: https://github.com/open-power/occ
Building OpenPOWER: https://github.com/open-power/op-build
Open BMC: https://github.com/openbmc/openbmc
AMESTER: https://github.com/open-power/amester
OCC Sensor collection document: https://github.com/open-power/docs/blob/master/occ/OCC_ipmitool_sensors.pdf
References
11/3/2016 25 © 2015 OpenPOWER Foundation