Speaker name, TitleCompany/Organization Name
Join the Conversation #OpenPOWERSummit
Introduction to the OpenCAPI InterfaceBrian Allison, STSM
OpenCAPI Technology and Enablement
Industry Collaboration and Innovation
Introduction to the OpenCAPI Interface
Topics
OpenCAPI Protocol Stack
OpenCAPI Reference Design Overview OpenCAPI TLx Reference Code
OpenCAPI TLx-AFU Interface and Snippets
OpenCAPI Reference Design Cards
OpenCAPI Reference AFUs
OpenCAPI Performance
OpenCAPI Roadmap
Transaction Layer (TL) specifies the control and response packets between a host and an endpoint OpenCAPI device
TL On the host side converts:
Host specific protocol requests into transaction layer defined commands (downbound)
TLx commands into host specific protocol requests. (upbound)
Responses to Endpoint initiated commands
Data Link layer supports a 25.78125 Gbps serial data rate per lane connecting a processor to an accelerator dvice: DL and DLX
TLx: On the endpoint OpenCAPI device, the transaction layer converts:
AFU protocol requests into transaction layer commands
TL commands into AFU protocol requests.
Responses to host initiated commands
Host bus protocol layer
TLTL Frame/Parser
DL
PHY
PHYX
DLX
TLX Frame/ParserTLX
AFU protocol layer
AFU
Host p
roce
sso
rO
pe
nC
AP
I d
evic
e
OpenCAPI Protocol StackHost bus interface
OpenCAPI packets
DL packet (format)
DL packet
Serial link
DLX packet
DLX packet (format)
AFU packets
AFU protocol stack
interface
Host fabric bus
4
HostCPU
TL DL DLX TLXPHY PHYX
AFU
CFGconfig_writeconfig_read
All otherTL commands
OpenCAPI Protocol Stack
6
OpenCAPI FPGA Reference Design OverviewXilinx based Verilog design for Ultrascale+ FPGA’s
Contains Phy configuration (PHYx), Transaction Layer (TLx), Data Link Layer (DLx) and Config Core (CFG)
25.78125 GT/s x8 link per PHYx/TLx/DLx/CFG using Xilinx GTY PHY
Tightly integrated with the DLx logic
400MHz @ 64B dataflow (P9 Nest runs 1600MHz @ 16B)
Vivado toolchain flow with tcl script project creation
Currently using internal github for project repository
Made available to NDA customers
Will be available to OpenCAPI consortium members via consortium github
7
OpenCAPI TLx Reference Code
Config accesses hidden from AFU and sent directly to CFG Core by the TLx
Don’t want the AFU to have the complexity nor the ability to “brick” a card
TLx interfaces to AFU via low level Transaction Layer Protocol (think parallel interface(s))
Interface specification defined in the TLX 3.0 Reference Design Specification
TLx Parser receives OpenCAPI Host initiated TL Architected packets and decodes
Separate Command & Response Interface for the separate Virtual Channels
• Can send 1 command per cycle on each interface to the AFU
Separate Data interface – Commands not presented to AFU until data on the link is received for that command
TLx Framer receives AFU commands and responses and packetizes them into efficient OpenCAPI TLxArchitected packets to send to the host
Separate Command & Response Interface for separate Virtual Channels
Can receive 1 command per cycle on each interface
8
OpenCAPI TLx – AFU Interface
Individual TL packet contents are driven or received by the AFU
TLx only parses and packs the contents from/to the link packets into interface fields
No knowledge of location of fields within packets is necessary by the AFU
No knowledge of template usage is necessary by the AFU
TLx has no intelligent logic for architected sequences of flows –
• AFU must perform the proper sequences and follow the architecture
Credit based interface to the AFU
Host to Accelerator Command Snippet
9
Signal Name Bits Source Description
tlx_afu_cmd_valid 1 TLX Command Valid. The remaining signals in this table are valid coincident with the
assertion of tlx_afu_cmd_valid.
tlx_afu_cmd_opcode 8 TLX Command Opcode. Note: Please see OpenCAPI 3.0 TL Specification for valid opcodes
tlx_afu_cmd_capptag 16 TLXUnique handle specifying the host CAPP and command instance. Provided by the CAPP
requesting command services of the TL.
tlx_afu_cmd_pa 64 TLX Physical Address
tlx_afu_cmd_dl 2 TLX
Command Data Length
Encodings Size
2b’00 Reserved2b’01 64 Bytes2b’10 128 Bytes2b’11 256 Bytes
tlx_afu_cmd_pl 3 TLX
Partial Length
Encodings Size
3b’000 1 Byte3b’001 2 Bytes3b’010 4 Bytes3b’011 8 Bytes3b’100 16 Bytes3b’101 32 Bytes3b’110-111 Reserved
Host to AFU Data Snippet
10
Signal Name Bits Source Description
tlx_afu_cmd_data_valid 1 TLX Command Data Valid. Valid data is present.
tlx_afu_cmd_data_bus 512 TLX Command Data Bus.
tlx_afu_cmd_data_bdi 1 TLXBad Data Indicator. If asserted indicates the data received during the same cycle has
an error and cannot be trusted.
afu_tlx_cmd_rd_req 1 AFU AFU requests host command data known to be available
afu_tlx_cmd_rd_cnt 3 AFU
AFU specifies the number of data packets it will accept.
Encodings Size3b’000 512 Bytes3b’001 64 Bytes
3b’010 128 Bytes
3b’011 256 Bytes
3b’100 192 Bytes
3b’101 320 Bytes
3b’110 384 Bytes
3b’111 448 Bytes
Note: ‘001’, ‘010’, and ‘011’ were set to match the data length encoding.
AFU Initiated Command and Data Snippet
11
Signal Name Bits Source Description
afu_tlx_cmd_valid 1 AFUIndicates that a valid AP command has arrived from the AFU to the TLX. Any command field that pertains to the arriving opcode should contain valid information at this time. Other command fields are undefined and may contain garbage.
afu_tlx_cmd_opcode 8 AFU AP Command Opcode. (see TL Specification)
afu_tlx_cmd_actag 12 AFU Address Context tag (see TL Specification)
afu_tlx_cmd_ea_or_obj 68 AFU Effective Address/Object Handle. (see TL Specification)
afu_tlx_cmd_afutag 16 AFU AFU Tag. (see TL Specification)
afu_tlx_cmd_be 64 AFU Byte enable. (see TL Specification)
afu_tlx_cmd_bdf 16 AFU Bus Device Function (see TL Specification)
afu_tlx_cmd_pasid 20 AFU User process ID (see TL Specification)
afu_tlx_cdata_valid 1 AFUAP Command Data Valid. Indicates that a valid packet of command immediate data has
arrived from the TLX. The data bus and the bdi bit contain valid information.
afu_tlx_cdata_bus 512 AFU AP Command Data Bus.
afu_tlx_cdata_bdi 1 AFU Bad Data Indicator. Indicates that the AP command data packet is bad.
12
OpenCAPI Reference Design CardsInitial work done on Xilinx VU3P FPGA with Alpha Data 9V3 card
Currently using Vivado 2018.2, but floorplan snapshot below is from 2017.1
Images also created and tested on KU15P FPGA (Mellanox Innova-2)
Work is ongoing with Xilinx ZU19P FPGA
Next generation images to be created on
Nallatech 250SOC
Alpha Data 9H7 (VU37P) and 9H3 (VU33P)
VU3P Resources CLB FlipFlops LUT as Logic LUT Memory Block Ram Tile
DLx 9392/788160 (1.19%)
19026/394080(4.82%)
0/197280(0%)
7.5/720(1.0%)
TLx 13806/788160(1.75%)
8463/394080(2.14%)
2156/197280(1.09%)
0/720(0%)
13
OpenCAPI 3.0 Reference AFU’s MemCopy
The MemCopy example is a data mover from source address -> destination address using Virtual Addressing and includes these features
• Work queue for each context which can be configured to do copy commands, interrupts, translation touch, wake host thread (all command types for host validation)
• Configuration and MMIO Register Space
• acTag Table used for Bus/Device/Function and Process ID identification
• 512 processes/contexts and configurable up to 32 engines supporting up to 2K transfers using 64B, 128B, or 256B operations
Memory Home Agent (LPC)
The Memory Home Agent example implements memory off the endpoint OpenCAPI accelerator to act as a coherent extension to the host processor memory
The Memory Home Agent example includes these features
• Configuration and MMIO Register Space
• Individual and pipelined operation for memory loads and stores
• Interrupts, with error details reported to software through MMIO registers
• Sparse Address Mapping feature to extend 1 MB of real space to 4 TB of address
14
OpenCAPI 3.0 Reference AFU’sAFP
Main performance AFU
Single process programmed to do streaming reads, streaming writes or a mix
Data is not checked – purely for bandwidth and latency testing
Interrupt and Wake Host Thread latency counters
Ping-Pong latency test added (MMIO to AFP->DMA store to memory)
CAPI and OpenCAPI Performance
15
CAPI 1.0
PCIE Gen3 x8
Measured Bandwidth @8Gb/s
CAPI 2.0
PCIE Gen4 x8
Measured Bandwidth @16Gb/s
OpenCAPI 3.0
25 Gb/s x8
Measured Bandwidth @25Gb/s
128B DMA Read
3.81 GB/s 12.57 GB/s 22.1 GB/s
128B DMA Write
4.16 GB/s 11.85 GB/s 21.6 GB/s
256B DMA Read
N/A 13.94 GB/s 22.1 GB/s
256B DMA Write
N/A 14.04 GB/s 22.0 GB/s
Power 8/9 CPU
Xilinx
KU60/VU3P FPGA
First
Introduction
in 2013
2nd
Generation
Open Architecture with a
Clean Slate Focused on
Bandwidth and Latency
Latency Ping-Pong Test
Simple workload created to
simulate communication
between system and
attached FPGA
Bus traffic recorded with
protocol analyzer and
PowerBus traces
Response times and
statistics calculated
TL, DL, PHY
Host Code
1. Copy 512B from cache to FPGA
2. Poll on incoming 128B cache injection
3. Reset poll location
4. Repeat
TLx, DLx, PHYx
FPGA Code
1. Poll on 512B received from host
2. Reset poll location
3. DMA write 128B for cache injection
4. Repeat
OpenCAPI Link
PCIe Stack
Host Code
1. Copy 512B from cache to FPGA
2. Poll on incoming 128B cache injection
3. Reset poll location
4. Repeat
FPGA PCIe HIP*
FPGA Code
1. Poll on 512B received from host
2. Reset poll location
3. DMA write 128B for cache injection
4. Repeat
PCIe Link
* HIP refers to hardened IP
Latency Test Results
OpenCAPILink
P9 OpenCAPI3.9GHz Core, 2.4GHz Nest
Xilinx FPGA VU3P
298ns‡
2ns Jitter
TL, DL, PHY
TLx, DLx, PHYx (80ns‖)
378ns† Total Latency
PCIe G4Link
P9 PCIe Gen4
Xilinx FPGA VU3P
est. <337ns
PCIe Stack
Xilinx PCIe HIP (218ns¶)
est. <555ns§ Total Latency
PCIe G3Link
P9 PCIe Gen33.9GHz Core, 2.4GHz Nest
Altera FPGA Stratix V
337ns
7ns Jitter
PCIe Stack
Altera PCIe HIP (400ns¶)
737ns§ Total Latency
PCIe G3Link
Kaby Lake PCIe Gen3* 3.9GHz Core, 2.4GHz Nest
Altera FPGA Stratix V
376ns
31ns Jitter
PCIe Stack
Altera PCIe HIP (400ns¶)
776ns§ Total Latency
* Intel Core i7 7700 Quad-Core 3.6GHz (4.2GHz TurboBoost)
† Derived from round-trip time minus simulated FPGA app time‡ Derived from round-trip time minus simulated FPGA app time and simulated FPGA TLx/DLx/PHYx time
§ Derived from measured CPU turnaround time plus vendor provided HIP latency‖ Derived from simulation¶ Vendor provided latency statistic
18
Roadmap - OpenCAPI 4.0 (P9’/Axone)Adds posted DMA Store operations with Address Translation Cache
New AFU validation/reference design needs
Address Translation Cache in MemCopy (in development)
Storage Class Memory Development
Technology previews being developed on OpenCAPI 3.0
Table of Enablement Deliveries
19
Item Delivery Name Where to Obtain Available When
OpenCAPI 3.0 TLx and DLx Reference Xilinx FPGA Designs (RTL and Specifications)
<snapshot>.tar.gz Enablement WG Today
Xilinx Vivado Project Build with Memcopy Exerciser
Vivado Project Flow Enablement WG Today
Device Discovery and Configuration Specification and RTL
OpenCAPI 3.0 Configuration Sub-System Reference Design Specification
Enablement WG Causeway Today
AFU Interface Specification TLX 3.0 Reference Design.pdf Enablement WG Causeway Today
25Gbps PHY Signal Specification OC PHY 25G Specification PHY Signalling WG Causeway Today
25Gbps PHY Mechanical Specification
25Gbps Interface Mechanical Spec PHY Mechanical WG Causeway Today
OpenCAPI Simulation Environment (OCSE)
ocse-<version>.tar.gzOpenCAPIDemokit.pdf
Enablement WG TodayToday
Memcopy and Memory Home Agent Exercisers
MCP3 and LPC<snapshot>.tar.gz
Enablement WG Today
Reference Driver Available LIBOCXL Ubuntu 18.04GitHub
Today