directCell - Cell/B.E. tightly coupled via PCI Express

© 2009 IBM Corporation

directCellCell/B.E. tightly coupled via PCI Express

Heiko J Schick – IBM Deutschland R&D GmbH

November 2010

© 2009 IBM Corporation

Agenda

Section 1: directCell

Section 2: Building Blocks

Section 3: Summary

Section 4: PCI Express Gen 3

2

© 2009 IBM Corporation3

Terminology

An inline accelerator is an accelerator that runs sequentially with the main compute engine.

A core accelerator is a mechanism that accelerates the performance of a single core. A core may run multiple hardware threads as in an SMT implementation.

A chip accelerator is an off-chip mechanism that boosts the performance of the primary compute chip. Graphics accelerators are typically of this type.

A system accelerator is a network-attached appliance that boosts the performance of a primary multinode system. Azul is an example of a system accelerator



Remote Control

Our goal is to remotely control a chip accelerator via a device driver based on the primary compute chip. The chip accelerator does not run an operating system, but merely a firmware-based bare metal support library to facilitate the host based device driver.

Requirements– Operation (e.g. start and stop acceleration)

• Memory Mapped I/O (e.g. Cell Broadband Architecture)• Special Instruction

– Interrupts– Memory– Compatibility– Bus / Interconnect (e.g. PCI Express, PCI Express Endpoint)



What is tightly coupled?

Distributed systems are state of the art

Tightly Coupled: Usage as a device rather than a system– Completely integrated into the host's global address space– I/O attached– Commonly referred to as a “hybrid”– OS-less, Controlled by host

Driven by interactive workloads– Example: A button is pressed, etc

Pluggable into existing form factors



Why tightly coupled?

Customers want to purchase applied acceleration

Classic appliance box will be deprecated by modular and hybrid approaches

Deployment and serviceability– A system needs to be installed and administered

Nobody is happy with accelerators that has to be program– Ship working appliance kernels– Software involvement and required



PCI Express Features

Computer expansion card interface format

Replacement for PCI, PCI-X and AGP as industry standard for PCs (Workstation and Server).

Serial Interconnect– Based on differential signals with 4 wires per lane– Each lane transmits 250 MB/s per direction – Up to 32 lane per link provides 4 GB/s per direction– Low Latency– Memory-mapped IO (MMIO) and direct memory access (DMA) are key concepts



Cell/B.E. Accelerator via PCI Express

Connect Cell/B.E. System as PCI Express device to a host system

Operating Systems runs only on host system (e.g. Linux, Windows)

Main application runs on host system

Compute intensive tasks will run as threads on SPEs– Using the same Cell/B.E. programming models as for non-hybrid systems.– Three level memory hierarchy instead of two level.

Cell/B.E. processor does not run any operation systems

MMIO and DMA used as access methods in both directions



PCI Express Cabling Products



Cell/B.E. Accelerator System

SPU Tasks

SPESPU

MFC

Local StoreLS

DMA

ExecutionUnits

MMIORegisters

CELL/B.E.Memory

EIB

PPE

Southbridge

DMAEngine

Core

L2

SPE

SPU

MFC

Local StoreLS

DMA

ExecutionUnits

MMIORegisters

Operating system

ApplicationMain Thread

SPU Threads



Cell/B.E. Accelerator System

SPU Tasks

SPESPU

MFC

Local StoreLS

DMA

ExecutionUnits

MMIORegisters

CELL/B.E.Memory

EIB

PPE

Southbridge

DMAEngine

Core

L2

SPE

SPU

MFC

Local StoreLS

DMA

ExecutionUnits

MMIORegisters

Host Processor

L2

HostMemory

PCI Express Link

Core

Southbridge

Operating system


SPU Threads


Operating System



Building Block #1: Interconnect

Currently PCI Express support is included in many front office systems, hence most accelerator innovation will take place via PCI Express.

Intel's QPI & PCI Express convergence (core i5/i7) drives a strong movement to make I/O a native subset of the front-side bus.

PCI Express EP support for modern processors is the only real option for tightly coupled interconnects.

PCI Express has bifurcation support and hot plug support.

Current ECNs (ATS, TLP Hints, Atomic Ops) must be included in those designs!



Building Block #2: Addressing (1)

Integration on the Bus Level

– Host BIOS or firmware maps accelerators via PCI Express BARs:

• Increase BAR size in EP designs• Resizable BAR ECN

– Bus level integration scales well:• 264 = 16 Exabyte = 16 K Petabyte• Entire SOCs clusters can be mapped

into host



Building Block #2: Addressing (2)

Inbound Address Translation– PIM / POM, IOMMUs, etc.– Switch-based– PCIe ATS Specification

PCIe Address Translation Services– Allow EP virtual to real address

translation for DMA:• Application provides VA pointer to EP. • Host uses EP VA pointer to program it.

Userspace DMA Problem– Buffers on accelerator and host need

to be pinned for async DMA transfers.– Kernel involvement should be minimal.

Linux UIO framework– HugeTLBfs is needed.

Windows UMDF– Large Pages is needed.



Building Block #3: Run-time Control

Minimal software on accelerator

Device driver is running on host system

Include DMA engine(s) on accelerator

Control Mechanisms

– MMIO• Can easily be mapped as VFS -> UIO.• PCIe core of acc should be able to map entire MMIO range.

– Special instructions• Clumsy to map as virtual file system.• Expose to userspace as system call or IOCTL.• Fixed length of parameter area must be made user accessible.• PCI Express core of accelerator should be able to dispatch special instruction to every unit in the accelerator.

Include helper registers, scratchpads, doorbells and ring buffers



directCell Operation

SPU Tasks

SPESPU

MFC

Local StoreLS

DMA

ExecutionUnits

MMIORegisters

CELL/B.E.Memory

EIB

PPE

Southbridge

DMAEngine

Core

L2

SPE

SPU

MFC

Local StoreLS

DMA

ExecutionUnits

MMIORegisters

Host Processor

L2

HostMemory

PCI Express Link

Core

Southbridge

SPU Threads


Operating System


1

2 52 5

3 6

4 4


Prototype

Concept validation– HS21 Intel Xeon Blade connected to QS2x Cell/B.E. Blade via PCI Express 4x .– Special firmware on QS2x Cell/B.E. Blade to set PCI connector as endpoint.– Microsoft Windows as OS on HS21 blade.– Windows device driver, enabling user space access to QS2x.

Working and verified– DMA transfer from and to Cell/B.E. Memory from Windows application.– DMA transfer from and to Local Store from Windows application.– Access to Cell/B.E. MMIO registers.– Start of SPE thread from Windows (thread context is not preserved).– SPE DMA to host memory via PCI Express.– Memory management code .– User libs on Windows to abstract Cell/B.E. usage (compatible to libspe ).– SPE Context save and restore (needed for proper multi thread execution).

Section 3: Summary


Project Review

Technology study proposed to target new application domains & markets– Use Cell as an acceleration device.– All system management done from host system (GPGPU-like accelerator).

Enables Cell on Wintel platforms – Cell/B.E. Systems has no dependency on OS.– Compute intensive tasks will run as threads on SPEs.– Use MMIO and DMA operations via PCI Express to reach any memory-mapped resources of the

Cell/B.E. System from the host, and vice versa.

Exhibits a new Runtime model for Processors– Show that a processor designed for standalone operation can be fully integrated into another host

system.

Section 3: Summary


New Features

Atomic Operations

TLP Processing Hints

TLP Prefix

Resizable BAR

Dynamic Power Allocation

Latency Tolerance Reporting

Multicast

Internal Error Reporting

Alternative Routing-ID Interpretation

Extended Tag Enable Default

Single Root I/O Virtualization

Multi Root I/O Virtualization

Address Translation Services



Thank you very much for your attention.


Atomic Operations

This optional normative ECN defines 3 new PCIe transactions, each of which carries out a specific Atomic Operation (“AtomicOp”) on a target location in Memory Space.

The 3 AtomicOps are – FetchAdd (Fetch and Add)– Swap (Unconditional Swap)– CAS (Compare and Swap).

Direct support for the 3 chosen AtomicOps over PCIe enables easier migration of existing highperformance SMP applications to systems that use PCIe as the interconnect to tightly-coupled accelerators, co-processors, or GP-GPUs.

Source: PCI-SIG, Atomic Operations ECN



TLP Processing Hints

This optional normative ECR defines a mechanism by which a Requester can provide hints on a per transaction basis to facilitate optimized processing of transactions that target Memory Space.

The architected mechanisms may be used to enable association of system processing resources (e.g. caches) with the processing of Requests from specific Functions or enable optimized system specific (e.g. system interconnect and Memory) processing of Requests.

Providing such information enables the Root Complex and Endpoint to optimize handling of Requests by differentiating data likely to be reused soon from bulk flows that could monopolize system resources.

Source: PCI-SIG, Processing Hints ECN



TLP Prefix

Emerging usage model trends indicate a requirement for increase in header size fields to provide additional information than what can be accommodated in currently defined TLP header sizes. The TLP Prefix mechanism extends the header size by adding DWORDS to the front of headers that carry additional information.

The TLP Prefix mechanism provides architectural headroom for PCIe headers to grow in the future. Switches and Switch related software can be built that are transparent to the encoding of future End-End TLPs.

The End-End TLP Prefix mechanism defines rules for routing elements to route TLPs containing End-End TLP Prefixes without requiring the routing element logic to explicitly support any specific End-End TLP Prefix encoding(s).

Source: PCI-SIG, TLP Prefix ECN



Resizable BAR

This optional ECN adds a capability for Functions with BARs to report various options for sizes of their memory mapped resources that will operate properly. Also added is an ability for software to program the size to configure the BAR to.

The Resizable BAR Capability allows system software to allocate all resources in systems where the total amount of resources requesting allocation plus the amount of installed system memory is larger than the supported address space.

Source: PCI-SIG, Resizable BAR ECN



Dynamic Power Allocation

DPA (Dynamic Power Allocation) extends existing PCIe device power management to provide active (D0) device power management substates for appropriate devices, while comprehending existing PCIe PM Capabilities including PCI-PM and Power Budgeting.

Source: PCI-SIG, Dynamic Power Allocation ECN



Latency Tolerance Reporting

This ECR proposes to add a new mechanism for Endpoints to report their service latency requirements for Memory Reads and Writes to the Root Complex such that central platform resources (such as main memory, RC internal interconnects, snoop resources, and other resources associated with the RC) can be power managed without impacting Endpoint functionality and performance.

Current platform Power Management (PM) policies guesstimate when devices are idle (e.g. using inactivity timers). Guessing wrong can cause performance issues, or even hardware failures. In the worst case, users/admins will disable PM to allow functionality at the cost of increased platform power consumption.

This ECR impacts Endpoint devices, RCs and Switches that choose to implement the new optional feature.

Source: PCI-SIG, Latency Tolerance Reporting ECN



Multicast

This optional normative ECN adds Multicast functionality to PCI Express by means of an Extended Capability structure for applicable Functions in Root Complexes, Switches, and components with Endpoints.

The Capability structure defines how Multicast TLPs are identified and routed. It also provides means for checking and enforcing send permission with Function-level granularity. The ECN identifies Multicast errors and adds an MC Blocked TLP error to AER for reporting those errors.

Multicast allows a single Posted Request TLP sent from a source to be distributed to multiple recipients, resulting in a very high performance gain when applicable.

Source: PCI-SIG, Multicast ECN



Internal Error Reporting

PCI Express (PCIe) defines error signaling and logging mechanisms for errors that occur on a PCIe interface and for errors that occur on behalf of transactions initiated on PCIe. It does not define error signaling and logging mechanisms for errors that occur within a component or are unrelated to a particular PCIe transaction.

This ECN defines optional error signaling and logging mechanisms for all components except PCIe to PCI/PCI-X Bridges (i.e., Switches, Root Complexes, and Endpoints) to report internal errors that are associated with a PCI Express interface. Errors that occur within components but are not associated with PCI Express remain outside the scope of the specification.

Source: PCI-SIG, Internal Error Reporting ECN



Alternative Routing-ID Interpretation

For virtualized and non-virtualized environments, a number of PCI-SIG member companies have requested that the current constraints on number of Functions allowed per multi-Function Device be increased to accommodate the needs of next generation I/O implementations.

This ECR specifies a new method to interpret the Device Number and Function Number fields within Routing IDs, Requester IDs, and Completer IDs, thereby increasing the number of Functions that can be supported by a single Device.

Alternative Routing-ID Interpretation (ARI) enables next generation I/O implementations to support an increased number of concurrent users of a multi-Function device while providing the same level of isolation and controls found in existing implementations.

Source: PCI-SIG, Alternative Routing-ID Interpretation ECN



Extended Tag Enable Default

The change allows a Function to use Extended Tag fields (256 unique tag values) by default; this is done by allowing the Extended Tag Enable control field to be set by default.

The obligatory 32 tags provided by PCIe per Function are not sufficient to meet the throughput and requirements of emerging applications. Extended tags allow up to 256 concurrent requests, but such capability is not enabled by default in PCIe.

Source: PCI-SIG, Extended Tag Enable Default ECN



Single Root I/O Virtualization

The specification is focused on single root topologies; e.g., a single computer that supports virtualization technology.

Within the industry, significant effort has been expended to increase the effective hardware resource utilization (i.e., application execution) through the use of virtualization technology.

The Single Root I/O Virtualization and Sharing Specification (SR-IOV) defines extensions to the PCI Express (PCIe) specification suite to enable multiple System Images (SI) to share PCI hardware resources.

Source: PCI-SIG, Single Root I/O Virtualization Specification



Multi Root I/O Virtualization

The specification is focused on multi-root topologies; e.g., a server blade enclosure that uses a PCI Express® Switch-based topology to connect server blades to PCI Express Devices or PCI Express to-PCI Bridges and enable the leaf Devices to be serially or simultaneously shared by one or more System Images (SI).

Unlike the Single Root IOV environment, independent SI may execute on disparate processing components such as independent server blades.

The Multi-Root I/O Virtualization (MR-IOV) specification defines extensions to the PCI Express (PCIe) specification suite to enable multiple non-coherent Root Complexes (RCs) to share PCI hardware resources.

Source: PCI-SIG, Multi Root I/O Virtualization Specification



Address Translation Services

This specification describes the extensions required to allow PCI Express Devices to interact with an address translation agent (TA) in or above a Root Complex (RC) to enable translations of DMA addresses to be cached in the Device.

The purpose of having an Address Translation Cache (ATC) in a Device is to minimize latency and to provide a scalable distributed caching solution that will improve I/O performance while alleviating TA resource pressure.

Source: PCI-SIG, Address Translation Services Specification



Disclaimer

IBM®, DB2®, MVS/ESA, AIX®, S/390®, AS/400®, OS/390®, OS/400®, iSeries, pSeries, xSeries, zSeries, z/OS, AFP, Intelligent Miner, WebSphere®, Netfinity®, Tivoli®, Informix und Informix® Dynamic ServerTM, IBM, BladeCenter and POWER and others are trademarks of the IBM Corporation in US and/or other countries.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license there from. Linux is a trademark of Linus Torvalds in the United States, other countries or both.

Other company, product, or service names may be trademarks or service marks of others. The information and materials are provided on an "as is" basis and are subject to change.

Date post:	24-May-2015
Category:	Technology
Upload:	heiko-joerg-schick
View:	1,293 times
Download:	0 times