Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class Processor Chip

International Symposium on Low Power Electronics and Design 1

Empirically Derived Abstractions in Uncore Power Modeling for a

Server-Class Processor Chip

Hans Jacobson, Arun Joseph*, Dharmesh Parikh*, Pradip Bose, Alper Buyuktosunoglu

IBM Systems & Technology Group*IBM T. J. Watson Research


Uncore Power: Overview• Pre-silicon power modeling has primarily focused on the processor

cores. As designs evolved, attention has shifted to “uncore”.

• Abstractions needed to characterize power-performance trade-offs.

• We examine the challenge of developing practical abstractions in uncore power modeling in an industrial setting.

• We report a systematic methodology of abstractions in modeling with focus on key uncore elements of IBM POWER8 processor.

• We show that uncore elements can be modeled using a few activity markers and a small set of microbenchmark stress test cases.


Use-case: Digital Power Proxies • Without an uncore proxy, dynamic power management policy

conservatively estimates a constant, high power for the uncore.– Reduces opportunity for maximizing performance/watt at chip level.

• Uncore proxy for a POWER8 improves accuracy of chip power by 15%. – Compared to core, L2 and L3 level proxies with uncore worst-case power.

• For scenarios where the uncore is largely idle this could translate to an opportunity to boost the frequency by at least 5%.

– For a given chip power cap assuming chip-wide DVFS. – With per-core DVFS control and the ability to shift power across core domains,

the boost in performance could be much higher.

• More use-cases: Inductive noise trend analysis, Correct decisions in the choice of early-stage micro-architectural parameters


Reference Power Modeling• The detailed reference chip

power analysis tool chain used at IBM [1].

• Accuracy validated against POWER7+ hardware power.

• The workload-specific power and event counts, form the data points we use in generating an uncore abstract power model.

IP Blocks

IP BlockPower Abstract

Generation

IP Power Abstracts

ContributorBased Cell Power

Model Generation

StandardCell Library

Chip LevelPower Analysis

Chip Netlist

Core Sim

UncoreSim

RTL Simulator

Workloads

Clock and DataSwitching

Power

Chip RTL

IP Blocks

IP BlockPower Abstract

Generation

IP Power Abstracts

ContributorBased Cell Power

Model Generation

StandardCell Library

Chip LevelPower Analysis

Chip Netlist

Core Sim

UncoreSim

RTL Simulator

Workloads

Clock and DataSwitching

Power

Chip RTL

Figure: Reference power modeling methodology[1] Dhanwada, N., et al. 2013. Efficient PVT independent abstraction of large IP blocks for hierarchical power analysis, ICCAD, Nov. 2013.


Reference Power Abstraction• Can be abstracted along several dimensions:

– The RTL simulator could be an early-stage microarchitecture-level pipeline timing model;

– The workload could be a suite of representative loop kernels; – The switching statistics could be reduced to a smaller subset; – The circuit-level detailed analysis could be approximated by area or

gate count based analytical equations.

• Abstracted power models: – RTL simulations provide data: power & high level event counts.– Linear regression techniques to data from RTL simulations.


Uncore Power Modeling• IBM POWER7 L2, L3 cache uncore elements constitutes

20% of power for a TDP workload.– Further include large macros that are shared by all chiplets.

• We focus on the path taken by memory requests.– Starting at L3 to chip memory I/O links.

• We propose a seemingly drastic abstraction.– 4 activity markers: reads, writes, retry and snoop events. – Small set of carefully crafted micro-benchmarks.– Average error: 1.4-2.4%.


IBM POWER8• 12 cores with 8-way

simultaneous multi-threading (SMT) per core. Total on-chip L2+L3 cache capacity is 102 MB.

• Fabricated using a 22nm CMOS SOI. Die size of 649 mm2. 4.2 billion transistors.

• Each core is an aggressive, wide-issue super scalar design, with 16 execution pipelines for massive data crunching.

• Uncore supports a massive 7.6 Tb/s off-chip bandwidth including memory and SMP links, PCIe links, an off-chip coherent accelerator interface, as well as on-chip bus-attached data accelerators. Figure: POWER8TM chip photomicrograph with

superimposed demarcations to indicate regions occupied by cores, L2/L3 caches, chip interconnect, memory controllers, etc.


IBM POWER8 Uncore• 512KB private L2 per core, an

8 MB L3 instance per core.

• Memory stack separated into four on-chip MCU each partitioned into two sub-controllers for a total of 8 memory I/O links and an off-chip Centaur L4 buffer chip per link.

• PowerBus (PB) provides coherent communication support across the cache-memory subsystem.

MCU0

PBEN

A0 A1 A2F1F0

X3 X2X1X0

PBI PBI PBI PBI PBI PBI


PBI PBIPBEHMCU1MCU2MCU3

L2L3

PBEX

EX

CORE

PBESCORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX EX

CORE CORECORE

EX

CORECORE CORECORECORE CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX EX

CORE CORECORE

EX

CORECORE CORECORECORECORECORE CORECORECORE CORECORECORECORE CORECORECORECORECORE CORECORECORECORECORECORE

PBI

PBI

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

MCU0

PBEN

A0 A1 A2F1F0

X3 X2X1X0



PBI PBIPBEHMCU1MCU2MCU3

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

PBESCORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX EX

CORE CORECORE

EX

CORECORE CORECORECORE CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX EX

CORE CORECORE

EX

CORECORE CORECORECORECORECORE CORECORECORE CORECORECORECORE CORECORECORECORECORE CORECORECORECORECORECORE

PBI

PBI

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

L2L3

PBEX

EX

CORE

Figure: Block-diagram view of the chip, with clearly defined “uncore” elements (highlighted in blue).


Simulation for Uncore

Table: Characteristics of workloads used for power abstraction

Figure: Uncore simulation environment


Power Bus Ramp• The Power Bus Ramp (PBIEX)

– Interface between a chiplet and the Power Bus Unit.

– Buffers memory requests initiated by L3 (cache miss/flush) until Power Bus is available.

• Sends/receives memory requests to the power bus unit PBEH.

– Activity is foremost determined by the read/writes bandwidth to and from its associated chiplet.

• Handle coherence requests on the Power Bus.

– Activity is thus further determined by snooping activity resulting from reads/writes originating from other chiplets.

Figure: PBIEX regression statistics and bar graph showing error sources for predicted power for each workload.


Power Bus Unit• Routing fabric between each chiplet

and the memory/network controllers. Performs coherency checks on each memory request received from the chiplets.

• Each L3 cache miss results in the PBEH broadcasting a request to each chiplet to check whether some other L3 contains the requested cache line.

– If not, the request is forwarded to the correct memory controller.

• Communicate coherence requests and responses between chiplets as well as memory requests to and from the MCU.

PBEH regression statistics and bar graph showing error sources for predicted power for each workload


Memory Controller Unit• Interface between the PBEH and the

high speed serial I/O links going to and from memory.

• Each request is assembled into a transmission frame and sent over the link.

• Activity of the MCU is foremost determined by the read and write bandwidth of the PBEH.

• MCU must also reject a request if it cannot handle more requests due to full buffers.

– Activity therefore also dependent on the number of such retries.

MCU regression statistics and bar graph showing error sources for predicted power for each workload.


Abstract Model Conclusions • Can be modeled accurately even with seemingly drastic

abstractions in modern day processors.

• Accurate for abstract level they are intended to be used. – < 6% maximum errors across the abstract uncore models. – < 9% power difference observed for the minimum vs. maximum

address and data switching workloads.

• Future work: Model refinements to focus on DS events.– Capture the degree of bit switching on addresses and data that

move through the uncore units.


Summary & Conclusions• Uncore power and identification of power reduction opportunities is

a critical aspect of future power-efficient micro-processor design.

• We present a practical methodology for use in an industrial setting for deriving abstract analytical power models for selected key uncore elements.

• We show that even with very few power event markers and a small set of stress marks, it is possible to develop accurate power models for uncore elements of a modern day chip.

• We quantify the accuracy such models have in providing improved power proxies and predicting worst-case bounds on chip level inductive noise in future technologies.

Date post:	14-Jan-2017
Category:	Technology
Upload:	arun-joseph
View:	47 times
Download:	0 times