© 2009 IBM Corporation
IEEE Hot Interconnects 20, Santa Clara, CA, Aug. 22-23, 2012
Rx Stack Accelerator for 10 GbE Integrated NIC
F. Abel, C. HagleitnerIBM Research – Zurich
Switzerland
F. VerplankenIBM Systems & Technology Group
La Gaude – France
2 © 2012 IBM Corporation
Discrete vs Integrated Network Interface Controller
■ Discrete NIC (dNIC)– Peripheral ASIC device– Marketed as:
● Ethernet Controller● Converged Controller
■ Integrated NIC (iNIC)– Sun Niagara 2– Freescale → QorIQTM family– IBM → PowerENTM
Converged NetworkAdapter (CNA)
Network InterfaceCard (NIC)
LAN On Motherboard (LOM)
DDR3 DDR3DDR3DDR3
EI3
RX
EI3
TX
A2
A2
A2
A2
A2
A2
A2
A2
AT
0A
T1
AT
3
AT
2
A2
A2
A2
A2
A2
A2
A2
A2
L2
L2L
2L
2
PBECPBUS X
ML
MC
MC
Cryp
to
Co
mp
PCI/Ethernet
Re
gX
PC
I
PB
IC
PB
ICPBIC PBIC
HE
A/
PP
DDR3 DDR3DDR3DDR3
EI3
RX
EI3
TX
A2
A2
A2
A2
A2
A2
A2
A2
AT
0A
T1
AT
3
AT
2
A2
A2
A2
A2
A2
A2
A2
A2
L2
L2L
2L
2
PBECPBUS X
ML
MC
MC
Cryp
to
Co
mp
PCI/Ethernet
Re
gX
PC
I
PB
IC
PB
ICPBIC PBIC
HE
A/
PP
410 mm2
(1.43 billion transistors)
iNIC~3%
Higher performance, lower latency
Lower power consumption
Significant cost reduction Alter the general-purpose nature of
the computer complex
3 © 2012 IBM Corporation
Outline
■ Hardware context of this work– PowerENTM / Host Ethernet Adapter
■ Functional requirements
■ Architecture of the Rx Stack Accelerator– Data path– Packet parser– Packet handler
■ Results – Implementation– Performance
■ Summary and conclusions
5 © 2012 IBM Corporation
Context: PowerENTM / Host Ethernet Adapter (HEA)
PCI Exp
Gen 2
Root/EP Engine
Root Engine
AT0 AT1 Mem PHY
Crypto
PBus
AT2 AT3 Mem PHY
XML PatternEngine
PBIC
2MB L2
A2
A2
A2
A2
2MB L2
A2
A2
2MB L2A
2A
2A
2A
2
2MB L2
A2
A2
A2
A2
Misc I/O
Comp / Decomp
PBus External Controller
EI3 EI3EI3
PBIC
PBIC Pbus Access Macro
Pervasive
PLL
PLL P
LL
PLLP
LL
PLL PL
L
PLL
x8 PHY x4 PHY
Quad-port 10 GbE NIC a.k.a
Host Ethernet Adapter (HEA)
4x 10GE MAC or 4x 1GE MAC
PBIC
MC MC
PCI Exp
Gen 2
x8 PHYx8 PHY
PLL
PLL
A2
A2
6 © 2012 IBM Corporation
Overview of the Host Ethernet Adapter (HEA)
■ 410 GbE state-of-art Ethernet controller featuring:– I/O virtualization support
● Through 128 queue pairs● Internal layer-2 switch for partition-to-partition
data traffic– Flexible queue selection and scheduling assist– Rx and Tx protocol acceleration– Low-latency through direct processor bus
attachment and cache injection– Multi-core scaling,– Interrupt and receive coalescing assist,– 9 KB Jumbo frame support,– Memory address protection– . . .
■ Equivalent functionality and performance levels as modern discrete NICs
■ Small footprint (13 mm2) and low power (2.6 W)XGMACXGPCS
SoC Bus
TxAcc.
XGMACXGPCS
TxAcc.
XGMACXGPCS
TxAcc.
Host Interface
RxAcc
XGMACXGPCS
RxAcc
XGMACXGPCS
RxAcc
XGMAC
XGXSPCSXGMACXGPCS
XGMAC
XGXSPCS
TxAcc.
RxAcc.
128 128
8 © 2012 IBM Corporation
Target Domain Requirements
■ PowerENTM targets network-facing applications – Must cope with a large spectrum of protocols (13+), traffic characteristics and
policies● E.g. routers, firewalls, intrusion-prevention systems and network analytics
– Must be able to adapt to changes● Handle other existing or emerging protocols
– Virtualized I/Os → Must sustain 16 Gb/s (internal layer-2 switch)
■ Ethernet: A trucking service for application data– Protocol layered architecture (i.e., encapsulation) → 2000+ permutations
L1.5 L2 L2 L2 L2.5 L3 L4 L2Payload
ISL DIX VLAN QinQ SAP SNAP PPPoE
MPLSMPLSMPLSMPLSMPLSMPLS
IPv4t
IPv6t
IPv4
IPv6
Ipv6 Ext. Hdr.Ipv6 Ext. Hdr.Ipv6 Ext. Hdr.Ipv6 Ext. Hdr.
TCP
UDP
ICMPv6
9 © 2012 IBM Corporation
Distribution of the Protocol Stacks
DIX, 1153
SAP, 1079
SNAP, 1078
Q, 1183
QQ, 362
PPP, 972
M1, 1788
M2, 1510
M3, 1231M4, 941
M5, 665
M6, 394
IP4, 940
IP4t, 197
IP6, 1263
IP6t, 186
FF, 651
MF, 331
LF, 329
RH, 186
L4, 2118
M, 85
L5, 424
U, 558A, 25
DIX, 1153
VLAN, 1183
MPLS2, 1510
IPv6, 1263
L4, 2118
10 © 2012 IBM Corporation
Offloaded Service Requirements■ WHY
– TCP Rx+Tx processing → ~3000 instructions (assuming zero-copy and checksum offload)– 10 GbE → Occurrence of 64 B packets → 67.2 ns (14.8 Mfps)– A generally accepted rule of thumb → 1 GHz / 1 Gb/s
■ WHAT (business as usual)– 'per-byte' operations
● check-summing– 'per-packet' operations
● header processing: VLAN, MAC, flow identification, QoS determination, discard, errors, traffic steering
■ HOW (different)– Flexible way → Programmable parsing and processing (rule-based)– Meta-data descriptors → Save 300-400 instructions per frame
● E.g., Protocol stack signature (31b), Protocol statck offsets (64b)
DIX
SAP
SNAP
VLAN
QinQ
PPPoE
MPLS1
MPLS2
MPLS3
MPLS4
MPLS5
MPLS6
IPv4
IPv4t
IPv6
IPv6t
FFrg
MFrg
LFrg
RHdr
L4
L5
Nu
Nu
Nu
Nu
Nu
Incom
Malfd
Abort
Nu
L4Prot.
L2.5Pos.
L3Position
L4Position
PayloadPosition
12 © 2012 IBM Corporation
Rx Stack Processing
■ Can be decomposed in three major tasks:– (t1) Parsing
● Identify protocol stack + position of protocol fields– (t2) Data extraction
● Locate and retrieve data to be processed– (t3) Processing
● Execute instructions based on identified rules– Filtering (MAC, VLAN), VLAN extraction, – Rx queue assignment, checksum verification, – flow determination, discard, counters increments, …
■ Main characteristics exhibited by protocol processing applications [Jantsch1998]– (c1) Intensive use of pattern matching
● especially on headers– (c2) Complex and control dominated flow
● many nested if-then-else and case structures– (c3) Intensive use of irregular memory accesses
● various sizes and patterns
13 © 2012 IBM Corporation
Data Path
Packet Parser(programmable)
Packet Handler
octaword-in
FU1
octaword-out meta-data
Sockets
FUN
SrcDst
Transport
Ctrl
(16 B)
(180 bits)
RXACC Architecture A Transport Triggered Architecture (TTA) w/ 5 transport buses (1 byte/bus)
Prog. FSM + Balanced Routing Table search algorithm (B-FSM) → efficient pattern-matching (c1) → efficient one-cycle multiway branching (c2)
ILP architecture → instruction-level parallelism → parallel, independent, and pipelined coprocessors (FUs) → performance scalability → hardware efficiency → hardware modularity → “ultimate” RISC architecture
- only 1 instruction
“move data”
Cut-through architecture → low latency → efficient irregular mem. access (c3)
Frame “signature” → saves 300-400 instructions (avg.)
clk = 625 MHz
15 © 2012 IBM Corporation
Multi-field Packet Inspection & Data Extraction
Off.5
Off.4Off.3
Off.2
Off.1
FP FP
Off.1
Off.5
Off.4Off.3
Off.2
Current fields of interest
Next fields of interest
QQ/VLAN/SAP/SNAP/IPv6/UDP
48 48 16 4 | 12
16 4 | 12 16 8 8 8 24 8 4
20 128
128
16 16 16 16 32
16
8 8 16
16 © 2012 IBM Corporation
Data Path Architecture
R L
R H
M0
M0
BCNT
FrmPtr
Off(1-5)
MAC DA MAC SAEtherType
HOST I/FD00 D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 D14 D15 Status
XGMAC I/FD00 D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 D14 D15 Status
Status
Tot.Len.
Hd.CsumId.
Prot. SA DA Status
M0
B1-B5
Src
Op
I/F
Tra
np
ort
I/F
Legend:L2 (DIX)
L3 (IPv4)
Extracted fields: FrmPtr = 16 Offset1 = 2 Offset2 = 3 Offset3 = 7 Offset4 = 8 Offset5 = 9
■ Cut-through architecture– Relaxed buffering (2 16B) + Low latency– Flexible data extraction (any 5 bytes within the 32B window)– Entire packet is exposed to the parser → Can inspect and match any data of the frame
18 © 2012 IBM Corporation
Packet Parser
Design space:
micro-coded → limited branch capabilities → multi-GHz operation → power
finite state machine (FSM) → efficient but inflexible programmable finite state machine (pFSM) → high performance and energy efficient
IPv4_SAS22
IPv4_DA_ProtS24
IPv4_UDPS65
IPv4_TCPS44
IPv4_UL4S36
RuleCurrent
State
Input symbolsNextState
PriorityInput 1 Input 2 Input 3
... …. … ... ... ... ...
R44 S22 XXXX_XXXXb XXXX_XXXXb XXXX_XXXXb S24 0
R52 S24 XXXX_0101b 0001_0001b X00X_XXXXb S65 0
R53 S24 XXXX_XXXXb 0001_0001b X00X_XXXXb S82 1
R61 S24 XXXX_XXXXb 0010_1001b X00X_XXXXb S36 2
R112 S24 XXXX_0101b 0000_0110b X00X_XXXXb S44 3
State transition diagram
State transition rules('X' symbol represents a “don't care” bit)
HW EngineWe use the B-FSM architecture [van Lunteren2006] 'B' stands for Balanced Routing Table search algorithm (BaRT)
Originally designed for longest matching prefix searches (i.e. routing table lookups)
19 © 2012 IBM Corporation
Packet Parser Architecture
SC Input symbols S
N Output symbols H sym.
Test part Result part H part
Transition Rule Selector
Input
SN
SC
Transition Rule Memory (F, F', G, G', H')
Match1
Match2
Match3
Match4
B1 B2 B3
Address Generator
Data PathController
LUTGF H'
R1 R2 R3 R4 FWR
HDP
I/F
PH I/F
Input
RangeComparator
RCRSTATE
Reg.FileA,B
StatReg.
FPInc, Src1-Src5
RCC
B1N-B3
N
B1C-B3
C
Output
RuleStructure
64 640 bits256 rules
(1 rule = 160 bits)
input
output
TransitionRule
Memory
StateRegister
Rule Selector
next-state
B-FSM engine
21 © 2012 IBM Corporation
Packet Handler Architecture
Packet Handler
FU1
meta-data
Sockets
FUN
TransportCtrlDst
■ Functional Units (FU) – Independent coprocessors (IP-blocks) → Concurrent operation → ILP → Performance– Configurable through MMIO– Implement TTA socket registers:
● O = operand register(s) ● T = trigger register
(note: result registers are not implemented)
O1 O
2 O
3 O
N T
Bit Mask
Symmetry
Hasher(XORs)
32
MMIO config.
FU example:the HASHER
Decode
Dst
Transport
22 © 2012 IBM Corporation
Rule Format → 160 bits
Result Part (variable-length instruction)
(87 bits)H Part(9 bits)
Src2NS
Src1
Src3
Src4
Src5
Dst4
Dst5
CompareMask
CompareValue
CS
Test Part(62 bits)
Dst1
Dst2
Dst3
Frame pointer increment (Long/Short – Fixed/Variable)
Source of the operands (i.e. offsets w.r.t the frame pointer)
Destination of the operands (i.e. functional unit registers)
■ Simplified Instruction Set– Frame pointer instructions
● e.g. ffpinc(4); // fixed increment● e.g. vfpinc(1); // variable increment
– Move instruction● e.g.: move(DP(4), CSI.O1); // unicast destination● e.g.: move(DP(12), HSH.O1 & CSI.O1 & CST.O9); // multicast destination
(socket write sharing)
24 © 2012 IBM Corporation
Implementation Results
■ Area (in 45 nm SOI technology)– 1 RXACC = 0.7 mm2 – 4 RXACC = 21.5 % of HEA (entire HEA = 13 mm2)
■ Clock Frequency– 625 MHz (27% of core frequency)
■ Power (estimate)– RXACC 0.15 W (entire HEA = 2.6 W)
■ Transition Rule Memory Utilization– 256 rules → 64 x (4 x 160) bits = 40 kbits
● 68 states out of 128 (53 %)● 221 rules out of 256 (86 %)
– Transition rule memory + ECC logic = 15 % of RXACC
25 © 2012 IBM Corporation
RXACC Bandwidth
64 80 128 256 512 1024 1518 4096 81920
2
4
6
8
10
12
14
16
18
20
DIX/IPv4/TCPDIX/IPv6/UDPDIX/QQ/VLAN/PPPoE/MPLS/MPLS/MPLS/MPLS/IPv4/TCPDIX/QQ/VLAN/IPv6-HBH-ROUT-FRAG/TCPTheoretical limit of 10 GbE media (Inter-frame gap=12B - Preample=8B)
Frame length (bytes)
Ba
nd
wid
th (
Gb
/s)
Protocol stacks do not fit(missing bars)
26 © 2012 IBM Corporation
Summary & Conclusions
■ Processor compute complex + integrated network I/O complex – IFF high-computation performance (1) + high-chip density (2) + low-power (3)
+ flexibility (4) RXACC delivers (1) + (2) + (3) + (4)
■ (1) Performance– 15 Mfps, 20 Gb/s (at “relaxed” clock frequency → 625 MHz)– Saves hundreds of CPU cycles per frame
■ (2-3) Area and power efficiency– 0.7 mm2 (45 nm SOI), 0.15 W
■ (4) Flexibility– 2000+ protocol permutations– Programmable parsing and processing → Rule-based
● Can parse new emerging standards (e.g. SDN)● Can inspect header and payload● Pragmatic approach w/ one code-set per application
RXACC = TTA + pFSM = novel Application Specific Processor Key enabler for integrated NICs The architecture has headroom to scale towards 40-100 GbE