External Use
TM
QorIQ T4240 Communications
Processor Deep Dive
FTF-NET-F0031
A P R . 2 0 1 4
Sam Siu & Feras Hamdan
TM
External Use 1
Agenda
• QorIQ T4240 Communications Processor Overview
• e6500 Core Enhancement
• Memory Subsystem and MMU Enhancement
• QorIQ Power Management features
• HiGig Interface
• Interlaken Interface
• PCI Express® Gen 3 Interfaces (SR-IOV)
• Serial RapidIO® Manager (RMAN)
• Data Path Acceleration Architecture Enhancements
− mEMAC
− Offline Ports and Use Case
− Storage Profiling
− Data Center Bridging (FMAN and QMAN)
− Accelerators: SEC, DCE, PME
• Debug
TM
External Use 2
QorIQ T4240 Communications Processor
16-Lane 10GHz SERDES
64-bit
DDR2/3
Memory
Controller
CoreNet™ Coherency Fabric PAMU PAMU PAMU
Peripheral Access
Mgmt Unit
Security Fuse Processor
Security Monitor
2x USB 2.0 w/PHY
IFC
Power Management
SD/MMC
2x DUART
2x I2C
SPI, GPIO
64-bit
DDR2/3
Memory
Controller
64-bit
DDR3/3L
Memory
Controller
64-bit
DDR3/3L
Memory
Controller
512KB
Corenet
Platform Cache
512KB
Corenet
Platform Cache
PAMU
Queue
Mgr.
Buffer
Mgr.
Pattern
Match
Engine
2.0
Security 5.0
64-bit
DDR2/3
Memory
Controller
64-bit
DDR3/3L
Memory
Controller
512KB
Corenet
Platform Cache
RMAN
DCE
1.0
Parse, Classify,
Distribute
1/ 10G
1/ 10G
1G
1G
1G
1G
FMan
1G
1G
Parse, Classify,
Distribute
1/ 10G
1/ 10G
1G
1G
1G
1G
FMan
1G
1G
Inte
rla
ke
n L
A
16-Lane 10GHz SERDES
Processor
• 12x e6500, 64-bit, up to 1.8 GHz
• Dual threaded, with128-bit AltiVec engine
• Arranged as 3 clusters of 4 CPUs, with 2
MB L2 per cluster; 256 KB per thread
Memory SubSystem
• 1.5 MB CoreNet platform cache w/ECC
• 3x DDR3 controllers up to 1.87 GHz
• Each with up to 1 TB addressability (40
bit physical addressing)
CoreNet Switch Fabric
High-speed Serial IO
• 4 PCIe controllers, with Gen3
• SR-IOV support
• 2 sRIO controllers
• Type 9 and 11 messaging
• Interworking to DPAA via Rman
• 1 Interlaken Look-Aside at up to10 GHz
• 2 SATA 2.0 3Gb/s
• 2 USB 2.0 with PHY
Network IO
• 2 Frame Managers, each with:
• Up to 25Gbps parse/classify/distribute
• 2x10GE, 6x1GE
• HiGig, Data Center Bridging Support
• SGMII, QSGMII, XAUI, XFI
HiGig DCB HiGig DCB
2MB Banked L2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
2MB Banked L2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
2MB Banked L2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Watchpoint Cross Trigger
Perf Monitor
CoreNet Trace
Aurora
Real Time Debug
SA
TA
2.0
SA
TA
2.0
PC
Ie
PC
Ie
3xDMA
sR
IO
sR
IO
PC
Ie
PC
Ie
• Device
− TSMC 28 HPM process
− 1932-pin BGA package
− 42.5x42.5 mm, 1.0 mm pitch
• Power targets
− ~54W thermal max at 1.8 GHz
− ~42W thermal max at 1.5 GHz
• Data Path Acceleration
− SEC- crypto acceleration 40 Gbps
− PME- Reg-ex Pattern Matcher 10Gbps
− DCE- Data Compression Engine 20Gbps
TM
External Use 3
e6500 Core Enhancement
TM
External Use 4
e6500 Core Complex
High Performance • 64-bit Power Architecture® technology • Up to 1.8 GHz operation • Two threads per core • Dual load/store units, one per thread • 40-bit Real Address
− 1 Terabyte physical addr. space
• Hardware Table Walk • L2 in cluster of 4 cores
− Supports Share across cluster − Supports L2 memory allocation to core or thread
Energy Efficient Power Management
− Drowsy : Core, Cluster, AltiVec engine − Wait-on-reservation instruction − Traditional modes
• AltiVec SIMD Unit (128b)
− 8,16,32-bit signed/unsigned integer − 32-bit floating-point 173 GFLOP (1.8GHz)
− 8,16,32-bit Boolean
• Improve Productivity with Core Virtualization − Hypervisor − Logical to Real Addr (LRAT). translation
mechanism for improved hypervisor performance
CoreNet Interface 40-bit Address Bus 256-bit Rd & Wr Data Busses
CoreNet Double Data Processor Port
2MB 16-way Shared L2 Cache, 4 Banks
T T
32K
Altivec
e6500
32K
PM
C T T
32K
Altivec
e6500
32K
PM
C T T
32K
Altivec
e6500
32K
PM
C T T
32K
Altivec
e6500
32K
PM
C
CoreMark P4080
(1.5 GHz)
T4240
(1.8 GHz)
Improvement
from P4080
Single Thread 4708 7828 1.7x
Core (dual T) 4708 15,656 3.3x
SoC 37,654 187,873 5.0x
DMIPS/Watt
(typ)
2.4 5.1 2.1x
TM
External Use 5
General Core Enhancements
• Improved branch prediction and additional link stack entries
• Pipeline improvements: − LR, CTR, mfocrf optimization (LR and CTR are renamed)
− 16 entry rename/completion buffer
• New debug features: − Ability to allocate individual debug events between the internal and external
debuggers
− More IAC events
• Performance monitor − Many more events, six counters per thread
− Guest performance monitor interrupt
• Private vs. Shared State Registers and other architected state − Shared between threads: There is only one copy of the register or architected state
A change in one thread affects the other thread if the other thread reads it
− Private to the thread and are replicated per thread : There is one copy per thread of the register or architected state
A change in one thread does not affect the other thread if the thread reads its private copy
TM
External Use 6
Corenet Enhancements in QorIQ T 4240
• CoreNet Coherency Fabric − 40-bit Real Address
− Higher address bandwidth and active transactions 1.2 Tbps Read, .6Tbps Write
− 2X BW increase for core, MMU, and peripheral
− Improved configuration architecture
• Platform Cache − Increased write bandwidth (>600Gbps)
− Increased buffering for improving throughput
− Improved data ownership tracking for performance enhancement
• Data PreFetch − Tracks CPC misses
− Prefetches from multiple memory regions with configurable sizes
− Selective tracking based on requesting device, transaction type, data/instruction access
− Conservative prefetch requests to avoid system overloading with prefetches
− “Confidence” based algorithm with feedback mechanism
− Performance monitor events to evaluate the performance of Prefetch in the system
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 2 4 6 8 10 12 14 16 18 20 22 24
IP Mark
TCP Mark
TM
External Use 7
Cache and Memory Subsystem
Enhancements
TM
External Use 8
Shared L2 Cache
• Clusters of cores share a 2M byte, 4-bank, 16-way set associative shared L2 cache.
• In addition, there is also support for a 1.5M byte corenet platform cache.
• Advantages
− L2 cache is shared among 4 cores allowing lines to be allocated among the 4 cores as required
Some cores will need more lines and some will need less depending on workloads
− Faster sharing among cores in the cluster (sharing a line between cores in the cluster does not require the data to travel on CoreNet)
− Flexible partition of L2 cache base on application cluster group.
• Trade Offs
− Longer latency to DRAM and other parts of the system outside the cluster
− Longer latency to L2 cache due to increased cache size and eLink overhead
64-bit
DDR2/3
Memory
Controller
CoreNet™ Coherency Fabric PAMU PAMU PAMU
Peripheral Access
Mgmt Unit
Security Fuse Processor
Security Monitor
2x USB 2.0 w/PHY
64-bit
DDR2/3
Memory
Controller
64-bit
DDR3/3L
Memory
Controller
64-bit
DDR3/3L
Memory
Controller
512KB
Corenet
Platform Cache
512KB
Corenet
Platform Cache
PAMU
64-bit
DDR2/3
Memory
Controller
64-bit
DDR3/3L
Memory
Controller
512KB
Corenet
Platform Cache
2MB Banked L2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
2MB Banked L2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
2MB Banked L2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
Power ™
e6500
D-Cache I-Cache
32 KB 32 KB
T1 T2
TM
External Use 9
Memory Subsystem Enhancements
• The e6500 core has a larger store queue than the e5500 core
• Additional registers are provided for L2 cache partitioning controls similar to how partitioning is done in the CPC
• Cache locking is supported, however, if a line is unable to be locked, that status is not posted. Cache lock query instructions are provided for determining whether a line is locked
• The load store unit contains store gather buffers to collect stores to cache lines before sending them on eLink to the L2 cache
• There are no more Line Fill Buffers (LFB) associated with the L1 data cache
− These are replaced with Load Miss Queue (LMQ) entries for each thread
− They function in a manner very similar to LFBs
• Note there are still LFBs for L1 instruction cache
TM
External Use 10
MMU Enhancements
TM
External Use 11
MMU – TLB Enhancements
• e6500 core implements MMU architecture version 2 (V2)
− MMU architecture V2 is denoted by bits in the MMUCFG register
• Translation Look-aside Buffers (TLB1),
− Variable size pages, supports power of two page sizes (previous cores used power of 4 page sizes)
− 4 KB to 1 TB page sizes
• Translation Look-aside Buffers (TLB0) increased to 1024 entries
− 8 way associativity (from 512, 4 way)
− Supports HES (hardware entry select) when written to with tlbwe
• PID register is increased to 14 bits (from 8 bits)
− Now the operating system can have 16K simultaneous contexts
• Real address increased to 40 bits (from 36 bits)
• In general, it is backward compatible with MMU operations from e5500 core, except:
− some of the configuration registers have different organization (TLBnCFG for example)
− There are new config registers for TLB page size (TLBnPS) and LRAT page size (LRATPS)
− tlbwe can be executed by guest supervisor (but can be turned off with an EPCR bit)
Effective Address (EA) (64bit )
Effective Page #(0-52 bits) Byte Addr (12-32bits )
LPID
(14bit) GS AS PID(14bits)
0=Hypervisor
1=guest Access MSR
Byte Address (12-40bits) Real Page Number
(0-28bits)
Real Address (40bits)
TM
External Use 12
MMU – Virtualization Enhancements (LRAT)
• e6500 core contains an LRAT (logical to real address translation)
− The LRAT converts logical addresses (an address the guest operating system thinks are real) and converts them to true real addresses
− Translation occurs when the guest executes tlbwe and tries to write TLB0 or during hardware tablewalk for a guest translation
− Does not require hypervisor to intervene unless the LRAT incurs a miss (the hypervisor writes entries into the LRAT)
− 8 entry fully associative supporting variable size pages from 4 KB to 1 TB (in powers of two)
• Prior to the LRAT, the hypervisor had to intervene each time the guest tried to write a TLB entry
Application
Instr1
Instr2
Instr3
---
MMU
Page
Fault Guest OS
VA -> Guest RA
Writes TLB Trap
Hypervisor
Guest RA -> RA
Writes TLB Implemented
in HW with LRAT
TM
External Use 13
QorIQ Power Management
Features
TM
External Use 14
Dynamic T4 Family Energy/Power Total Cost of Ownership T
4
Advanced
Pow
er
Mgt
Cyclic
al
Valu
ed
Wo
rklo
ad
Today’s
Energ
y
Str
ate
gy
Always on
Energy Savings
Core Cascaded Cluster
Drowsy Dual Cluster
Drowsy
+ Tj
Dynamic
Clk Gating
SoC
Sleep
Full Mid Light Standby Light to Mid Full
TM
External Use 15
Cascaded Power Management Today: All CPUs in pool channel dequeue
until all FQs empty
Broadcast notification when work arrives
Task Queue
T1 T2 T3 T4 T5
Shared L2
Dro
wsy
Dro
wsy
C0 C1 C2 C3
Shared L2
C0 C1 C2 C3
Threshold 1 Threshold 2
DPAA uses task queue thresholds to
inform CPUs they are not needed.
CPUs selectively awakened as needed.
QMan
12
11
10
9
8
7
6
5
4
3
2
1
Active CPUs
Day Night
Burst
Pow
er/
Perf
orm
ance
• CPU’s run software that drops into polling loop when DPAA is not sending it work.
• Polling loop should include a wait w/ drowsy instruction that puts the core into drowsy
Core:
TM
External Use 16
e6500 Core Intelligent Power Management
Cluster State PCL00 PCL00 PCL00 PCL00 PCL00 PCL10
Core State PH00 PH10/PW10 PH15 PW20 PH20 PH20
Cluster Voltage
Core Voltage
Cluster Clock On On On On On Off
Core Clock On On Off Off Off Off
L2 Cache SW Flushed
L1 Cache SW Invalidated HW Invalidated SW Invalidated SW Invalidated
Wakeup Time Active Immediate < 30 ns < 200 ns < 600 ns < 1us
Power
NEW NEW NEW
PM
C T T
L1
Altivec
e6500
L1
2048KB Banked L2
PM
C
Full On Full On Full On Full On Full On Nap
Run Doze Nap Global Clk stop Nap (Pwr Gated) Core glb clk stop
Run, Doze, Nap Wait Altivec Drowsy • Auto and SW controlled – maintain state Core Drowsy • Auto and SW controlled – maintain state Dynamic Clock gating
Run, Nap • Cores and L2 Dynamic Frequency Scaling
(DFS)of the Cluster Drowsy Cluster (cores) Dynamic clock gating
• SoC Sleep with state retention
• SoC Sleep with RST
• Cascade Power Management
• Energy Efficient Ethernet (EEE)
TM
External Use 17
HiGig Interface Support
TM
External Use 18
HiGigTM/HiGig+/HiGig2 Interface Support
• The 10 Gigabit HiGigTM / HiGig+TM / HiGig2TM MAC interface interconnect
standard Ethernet devices to Switch HiGig Ports.
• Networking customers can add features like quality of service (QoS), port
trunking, mirroring across devices, and link aggregation at the MAC layer.
• The physical signaling across the interface is XAUI, four differential pairs
for receive and transmit (SerDes), each operating at 3.125 Gbit/s. HiGig+
is a higher rate version of HiGig
1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
3
3
3
4
Preamble HiGig+ Module Hdr MAC_DA MAC_SA Typ
e Packet Data FCS*
1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
Preamble MAC_DA MAC_SA Typ
e Packet Data FCS
1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
3
3
3
4
3
5
3
6
3
7
3
8
Preamble HiGig2 Module Hdr MAC_DA MAC_SA Typ
e Packet Data FCS*
Regular Ethernet Frames
Ethernet Frames with HiGig+ Header
Ethernet Frames with HiGig2 Header
TM
External Use 19
QorIQ T4240 Processor HiGig Interface
• T4240 FMan Supports HiGig/HiGig+/HiGig2 protocols
• In the T4240 processor, the 10G mEMACs can be configured as HiGig interface. In this configuration two of the 1G mEMACs are used as the HiGig message interface
TM
External Use 20
SERDES Configuration for HiGig Interface
• Networking protocols (SerDes 1 and SerDes 2)
• HiGig notation: HiGig[2]m.n means HiGig[2] (4 lanes @ 3.125 or 3.75 Gbps) − “m” indicates which Frame Manager (FM1 or FM2)
− “n” indicates which MAC on the Frame Manager
− E.g. “HiGig[2]1.10,” indicates HiGig[2] using FM1’s MAC 10
• When a SerDes protocol is selected with dual HiGigs in one SerDes, both HiGigs must be configured with the same protocol (for example, both with 12 byte headers or both with 16 byte headers)
TM
External Use 21
HiGig/HiGig2 Control and Configuration
Name Description
LLM_MODE Toggle between HiGig2 Link Level Messages physical link, OR HiGig2 link level
messages logical link (SAFC)
LLM_IGNORE Ignore HiGig2 link level message quanta
LLM_FWD Terminate/forward received HiGig2 link level message
IMG[0:7] Inter Message Gap - spacing between HiGig2 messages
NOPRMP 0 Toggle preemptive transmission of HiGig2 messages
MCRC_FWD Strip/forward HiGig2 message CRC of received messages
FER Discard/forward HiGig2 receive message with CRC error
FIMT Forward OR Discard message with illegal MSG_TYP
IGNIMG Ignore IMG on receive path
TCM TC (traffic classes) mapping
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
LL
M
LL
I
LL
F
IMG
NP
PR
MC
RC
FE
R
FIM
T
IGN
IM
TC
M
HiGig/HiGig2 control and Configuration Register (HG_CONFIG)
TM
External Use 22
Interlaken Interface
TM
External Use 23
Interlaken Look-Aside Interface
• Use Case: T4240 processor as a data path processor, requiring millions of look-ups per second. Expected requirement in edge routers.
• Interlaken Look-Aside is a new high-speed serial standard for connecting TCAMs “network search engines”, “Knowledge Based Processors” to host CPUs and NPUs. Replaces Quad Data Rate (QDR) SRAM interface.
• Like Interlaken streaming interfaces (channelized SERDES link, replacing SPI 4.2), Interlaken look-aside supports configurable number of SERDES lanes (1-32, granularity of single lane) with linearly increasing bandwidth. Freescale supporst x4 and x8, up to 10 GHz.
• For lowest latency, each vCPU (thread) in T4240 processor will have a portal into the Interlaken Controller, allowing multiple search requests and results to be returned concurrently.
• Interlaken Look Aside expected to gain traction as interface to other low latency/minimal data exchange co-processors, such as Traffic Managers. PCIe and sRIO better for higher latency/high bandwidth applications.
• Lane Striping
T4240
TCAM
Interlaken
10
G
10
G
10
G
10
G
TM
External Use 24
T4240 (LAC) Features:
• Supports Interlaken Look-Aside Protocol definition, rev. 1.1
• Supports 24 partitioned software portals
• Supports in-band per-channel flow control options, with simple xon/xoff semantics
• Supports wide range of SerDes speeds (6.25 and 10.3125 Gbps))
• Ability to disable the connection to individual SerDes lanes
• A continuous Meta Frame of programmable frequency to guarantee lane alignment, synchronize the scrambler, perform clock compensation, and indicate lane health
• 64B/67B data encoding and scrambling
• Programmable BURSTSHORT parameter of 8 or 16 bytes
• Error detection illegal burst sizes, bad 64/67 word type and CRC-24 error
• Error detection on Transmit command programming error
• Built-in statistics counters and error counters
• Dynamic power down of each software portal
TM
External Use 25
Look-Aside Controller Block Diagram
TM
External Use 26
Modes of Operation
• T4240 LA controller can be either in Stashing mode or non stashing.
• The LAC programming model is based on big Endinan mode, meaning byte 0 on the most
significant byte.
• In non Stashing mode software has to issue dcbf each time it reads SWPnRSR and RDY bit
is not set.
TM
External Use 27
Interlaken LA Controller Configuration Registers
• 4KBytes hypervisor space 0x0000-0x0FFF
• 4KBytes managing core space 0x1000-0x1FFF
• in compliant with trusted architecture ,LSRER, LBARE, LBAR, LLIODNRn, accessed exclusively in hypervisor mode, reserved in managing core mode.
• Statistics, Lane mapping, Interrupt , rate, metaframe, burst, FIFO, calendar, debug, pattern, Error, Capture Registers
• LAC software portal memory, n= 0,1,2,3,….,23 .
• SWPnTCR/ SWPnRCR—software portal 0 transmit/Receive command register
• SWPnTER/SWPnRER—software portal 0 transmit/Receive error register
• SWPnTDR/SWPnRDR0,1,2,3 —software portal 0,1,2,3 transmit/Receive data register 0,1,2,3
• SWPnRSR—software portal receive status register
TM
External Use 28
TCAM Usage in Routing Example
TM
External Use 29
Interlaken Look-Aside TCAM Board
Renesas
Interlaken LA
5Mb TCAM I2C
EEPROM
IL-LA
4x
REFCLK
156.25 MHz
VDDC
0.85V @6A
SMBus Misc:
Reset,
JTAG
3.3V/12V
Config
125 MHz
SYSCLK
VDDA
0.85V @ 2A
VCC_1.8V
1.8V @ 2A Filters
VDDHA 1.80V 0.5A
VDDO 1.80V 1.0A
VPLL 1.80V 0.25A
0-ohm
TM
External Use 30
PCI Express® Gen 3 Interfaces
TM
External Use 31
PCI Express® Gen 3 Interfaces
• Two PCIe Gen 3 controllers can be run at the same time with same SerDes reference clock source
• PCIe Gen 3 bit rates are supported − When running more than one PCIe controller at Gen3 rates, the associated
SerDes reference clocks must be driven by the same source on the board
16 SERES PCIe Configuration
PCIe1 PCIe2 PCIe3 PCIe4
x4gen3 x4gen2 x8gen2
X8gen2 x8gen2
x4gen2 x4gen2 x4gen3 x4gen2
PCIe2
OCN
PCIe3
51G 51G
51G
PCIe1
SR-IOV
EP
PCIe4
51G 51G
X4 Gen2/3 RC/EP
X4 Gen2/3 RC/EP X8 Gen2 or x4 Gen3
X8 Gen2 or
x4 Gen3 RC/EP
EP SRIOV
2 PF/64VF
8xMSI-X per VF/PF
Total of 16 lanes
TM
External Use 32
Single Root I/O Virtualization (SR-IOV) End Point
• With SR-IOV supported in EP, different devices or different software tasks can share IO resources, such as Gigabit Ethernet controllers. − T4240 Supports SR-IOV 1.1 spec version with 2 PFs and 64 VFs per PF
− SR-IOV supports native IOV in existing single-root complex PCI Express topologies
− Address translation services (ATS) supports native IOV across PCI Express via address translation
− Single Management physical or virtual machine on host handles end-point configuration
• E.g. T4240 processor as a Converged Network Adapter. Each Virtual Machine running on Host thinks it has a private version of the services card
Host
VM
1
VM
2
VM
N
…
T4240 features single controller (up to x4 Gen 3), 1 PF, 64 VFs
Translation
Agent
TM
External Use 33
PCI Express Configuration Address Register
• The PCI Express configuration address register contains address
information for accesses to PCI Express internal and external
configuration registers for End Point (EP) with SR-IOV
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
EN
Type EXTREGN VFN PFN REGN
PCI Express Address Offset Register
Name Description
Enable allows a PCI Express configuration access when PEX_CONFIG_DATA is accessed
TYPE 01, Configuration Register Accesses to PF registers for EP with SR-IOV
11, Configuration Register Accesses to VF registers for EP with SR-IOV
EXTREGN Extended register number. This field allows access to extended PCI Express configuration
space
VFN Virtual Function number minus 1. 64-255 is reserved.
PFN Physical Function number minus 1. 2-15 is reserved.
REGN Register number. 32-bit register to access within specified device
TM
External Use 34
Message Signaled Interrupts (MSI-X) Support
• MSI-X allows for EP device to send message interrupts to RC device independently for different Physical or Virtual functions as supported by EP SR-IOV.
• Each PF or VF will have eight MSI-X vectors allocated with a total of 256 total MSI-X vectors supported
− Supports MSI-X for PF/VF with 8 MSI-X vector per PF or VF
− Supports MSI-X trap operation
− To access a MSI-X PBA structure, the PF, VF, IDX, EIDXare concatenated to form the 4-byte aligned address of the register within the MSI-X PBA structure. That is, the register address is:
PF || VF || IDX || EIDX || 0b00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Typ
e
PF VF IDX EDIX M
PCI Express Address Offset Register
Name Description
TYPE Access to PF or VF MSI-X vector table for EP with SR-IOV.
PF Physical Function
VF Virtual Function
IDX MSI-X Entry Index in each VF.
EIDX Extended index This field provides which 4-Byte entity within the MSI-X PBA structure to access.
M Mode=11
TM
External Use 35
Serial RapidIO® Manager (RMAN)
TM
External Use 37
RapidIO Message Manager (RMan)
• RMAN supports both inline switching, as well as look aside
forwarding operation.
QMan
RMan
Inbound Rule
Matching
Classification
Unit
Reassembly
Contexts
Reassembly
Unit
Segmentation
Unit
Rapid
IO I
nbound T
raffic
Rapid
IO O
utb
ound T
raffic
Classification
Unit
Classification
Unit
Reassembly
Unit
Reassembly
Unit
Segmentation
Unit
Segmentation
Unit
AR
B
WQ
0
WQ
1
WQ
2
WQ
3
WQ
4
WQ
5
WQ
6
WQ
7
HW Channel
Frame Manager
1GE 1GE
1GE 1GE 10GE
D$ I$
D$ I$ L2$
e6500
Core
D$ I$
SE
C
PM
E
Disassembly
Contexts
WQ
0
WQ
1
WQ
2
WQ
3
WQ
4
WQ
5
WQ
6
WQ
7
HW Channel
WQ
0
WQ
1
WQ
2
WQ
3
WQ
4
WQ
5
WQ
6
WQ
7
Pool Channel
DCP
SW Portal
DCP
… Ftype Target ID Src ID Address Packet Data Unit CRC RapidIO PDU
TM
External Use 38
RMan: Greater Performance and Functionality
• Many queues allow multiple inbound/outbound queues per core
− Hardware queue management via QorIQ Data Path Architecture (DPAA)
• Supports all messaging-style transaction types
− Type 11 Messaging
− Type 10 Doorbells
− Type 9 Data Streaming
• Enables low overhead direct core-to-core communication
Core Core Core Core
10G SRIO
QorIQ or DSP
Core Core Core Core
10G SRIO
QorIQ or DSP
Type9 User PDU
Channelized CPU-
to-CPU transport Device-to-Device
Transport
MSG User PDU
TM
External Use 39
Data Path Acceleration
Architecture (DPAA)
TM
External Use 40
Data Path Acceleration Architecture (DPAA) Philosophy
• DPAA is design to balance the performance of multiple CPUs and Accelerators with seamless integrations
− ANY packet to ANY core to ANY accelerator or network interface efficiently WITHOUT locks or semaphores
• “Infrastructure” components
− Queue Manager (QMan)
− Buffer Manager (BMan)
• “Accelerator” Components
− Cores
− Frame Manager (FMan)
− RapidIO Message Manager (RMan)
− Cryptographic accelerator (SEC)
− Pattern matching engine (PME)
− Decompression/Compression Engine (DCE)
− DCB (Data Center Bridging)
− RAID Engine (RE)
• CoreNet
− Provides the interconnect between the cores and the DPAA infrastructure as well as access to memory
D$ I$
D$ I$ L2$
e500mc Core
D$ I$
CoreNet™ Coherency Fabric
Buffer
Mgr
D$ I$
D$ I$ L2$
e500mc Core
D$ I$
Queue
Manager
Sec 4.x PME 2
RMan RE
Parse, Classify, Distribute
Buffer
1/10G 1/10G 1G
1G
1G
1G
Frame Manager
1G
1G
D$ I$
D$ I$ L2$
e6500 Core
D$ I$
D$ I$
D$ I$ L2$
e6500
Core
D$ I$
DCE DCB
P Series T Series
… …
… …
Frame Manager
1GE 1GE
1GE 1GE 10GE
PCD
Buffer
TM
External Use 41
Length
DPAA Building Block: Frame Descriptor (FD)
Simple Frame Multi-buffer Frame
(Scatter/gather)
D PID
BPID
Address
Offset
Status/Cmd
000
Buffer
Frame Descriptor
D PID
BPID
Address
Offset
Length
Status/Cmd
100
Frame Descriptor
Address
Length
BPID
Offset
00
Address
Length
BPID
Offset (=0)
00
Address
Length
BPID
Offset (=0)
01
…
Data
Data
Data
S/G List
Packet
0 1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
D
D
LIODN
offset
BPID ELIO
DN
offset
- - - - addr
addr (cont)
Fmt Offset Length
STATUS/CMD
TM
External Use 42
Frame Descriptor Status/Command Word (FMAN Status)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
- - -
DC
L4
C
- - -
DM
E
MS
- - -
FP
E
FS
E
DIS
-
EO
F
NS
S
KS
O
-
FC
L
IPP
FLM
PT
E
ISP
PH
E
FR
DR
BL
E
L4
CV
- -
Name Description
DCL4C L4 (IP/TCP/UDP) Checksum validation Enable/Disable
DME DMA error
MS MACSEC Frame. This bit is valid on P1023
FPE Frame Physical Error
FSE Frame Size Error
DIS Discard. This bit is set only for frames that are supposed to be discarded, but are
enqueued in an error queue for debug purposes.
EOF Extract Out of Frame Error
NSS No Scheme Selection foe KeyGen
KSO Key Size Over flow Error
FCL Frame color as determined by the Policer. 00=green, 01=yellow, 10=red, 11=no reject
IPP Illegal Policer Profile error
FLM Frame Length Mismatch
PTE Parser Time-out
ISP Invalid Soft Parser instruction Error
PHE Header Error
FRDR Frame Drop
BLE Block limit is exceeded
L4CV L4 Checksum Validation
TM
External Use 43
DPAA: mEMAC Controller
TM
External Use 44
Multirate Ethernet MAC (mEMAC) Controller
Phy Mgmt MDIO
Tx Interface
Frame Manager Interface
Rx FIFO
Reconcilication
Tx FIFO 1588 Time Stamping
Tx Control
Rx Control
Flow Control
Config Control
Stat
MDIO Master
Rx Interface
• A multirate Ethernet MAC (mEMAC) controller features 100 Mbps/1G/2.5G/10G :
− Supports HiGig/HiGig+/HiGig2 protocols
− Dynamic configuration for NIC (Network Interface Card) applications or Switching/Bridging applications to support 10Gbps or below.
− Designed to comply with IEEE Std 802.3®, IEEE 802.3u, IEEE 802.3x, IEEE 802.3z, IEEE 802.3ac, IEEE 802.3ab, IEEE-1588 v2 (clock synchronization over Ethernet), IEEE 803.3az and IEEE 802.1QBbb.
− RMON statistics
− CRC-32 generation and append on transmit or forwarding of user application provided FCS selectable on a per-frame basis.
− 8 MAC address comparison on receive and one MAC address overwrite on transmit for NIC applications.
− Selectable promiscuous frame receive mode and transparent MAC address forwarding on transmit
− Multicast address filtering with 64-bin hash code lookup table on receive reducing processing load on higher layers
− Support for VLAN tagged frames and double VLAN Tags (Stacked VLANs)
− Dynamic inter packet gap (IPG) calculation for WAN applications
10GMAC dTSEC
QorIQ P Series
QorIQ T4240 - mEMAC
TM
External Use 45
DPAA: FMAN
TM
External Use 46
FMAN
Parse, Classify, Distribute
muRAM
1/10G 1/10G
1G
1G
1G
1G
1G
1G
FMan Enhancements
• Storage Profile selection (up to 32 profiles per port) based on classification − Up to four buffer pools per Storage Profile
• Customer Edge Egress Traffic Management (Egress Shaping)
• Data Center Bridging − PFC and ETS
• IEEE802.3az (Energy Efficient Ethernet)
• IEEE802.3bf (Time sync)
• IP Frag & Re-assembly Offload
• HiGig, HiGig2
• TX confirmation/error queue enhancements − Ability to configure separate FQID for normal
confirmations vs errors
− Separate FD status for Overflow and physical error
• Option to disable S/G on ingress
TM
External Use 47
Offline Ports
TM
External Use 48
FMAN Ports Types
• Ethernet receive (Rx) and transmitter (Tx) − 1 Gbps/2. 5Gbps/10 Gbps
− FMan_v3 some ports can be configured ad HiGig
− Jumbo frames of up to 9.6 KB (add uboot bootargs "fsl_fm_max_frm=9600" )
• Offline (O/H) − FMan_v3: 3.75 Mpps (vs 1.5M pps from the P series)
− Supports Parse classify distribute (PCD) function on frames extracted frame descriptor (FD) from the Qman
− Supports frame copy or move from a storage profile to an other
− Able to dequeue and enqueue from/to a QMan queue. The FMan applies a Parse Classify Distribute (PCD) flow and (if configured to do so) enqueues the frame it back in a Qman queue. In FMan_v3 the FMan is able to copy the frame into new buffers and enqueue back to the QMan.
− Use case: IP fragmentation and reassembly
• Host command − Able to dequeue host commands from a QMan queue. The FMan executes the host
command (such as a table update) and enqueues a response to the QMan. The Host commands, require a dedicated PortID (one of the O/H ports)
− The registers for Offline and Host commands are named O/H port registers
TM
External Use 49
IP Reassembly T4240 Processor Flow
BMI:
Parser:
Parse The Frame Identify fragments
KeyGen:
Calculate Hash
Fman Controller:
Coarse Classification
Enqueue Frame
BMI:
Allocate buffer
Write frame and IC
Fman Controller:
link fragment to the right
reassembly context Completed reassembly
Non Completed
reassembly
BMI:
Terminate
KeyGen:
Calculate Hash
*Fragments
Non Fragments
Fman Controller:
Start reassembly
Enqueue Frame
BMI:
Write IC
Reassembled
Frame
Regular/Fragment
Regular frame: Storage Profile is
chosen according to frame header
classification.
Reassembled frame: Storage
Profile is chosen according to MAC
and IP header classification only.
*Buffer allocation is done
According to fragment
header only
TM
External Use 50
IP Reassembly FMAN Memory Usage
• FMAN Memory: 386 KBytes
• Assumption: MTU = 1500 Bytes
• Port FMAN Memory consumption:
− Each 10G Port = 40 Kbytes
− Each 1G Port = 25 Kbytes
− Each Offline Port = 10 Kbytes
• Coarse Classification tables memory consumption:
− 100 Kbytes for all ports
• IP Reassembly:
− IP Reassembly overhead: 8 Kbytes
− Each flow: 10 Bytes
• Example:
− Usecase with: 2x10G ports + 2x1G port + 1xOffline Ports.
− Port configuration: 2x40 + 2x25 + 10 = 140 Kbytes
− Coarse Classification : 100 Kbytes
− IP reassembly 10K flows: 10K x 10B + 8KB = 108 Kbytes
− Total = 140KB + 108KB + 100KB = 348 KBytes
TM
External Use 51
Storage Profile
TM
External Use 52
Virtual Storage Profiling For Rx and Offline Ports
• Storage profile enable each partition and virtual interface enjoy a dedicated buffer pools.
• Storage profile selection after distribution function evaluation or after custom classifier
• The same Storage Profile ID (SPID) values from the classification on different physical ports, may yield to different storage profile selection.
• Up to 64 storage profiles per port are supported. − 32 storage profiles for FMan_v3L
• Storage profile contains
− LIODN offset
− Up to four buffer pools per Storage Profile
− Buffer Start margin/End margin configuration
− S/G disable
− Flow control configuration
TM
External Use 53
Data Center Bridging
TM
External Use 54
Policing and Shaping
• Policing put a cap on the network usage and guarantee bandwidth
• Shaping smoothes out the egress traffic
− May require extra memory to store the shaped traffic.
• DCB can be used in:
− Between data center network nodes
− LAN/network traffic
− Storage Area Network (SAN)
− IPC traffic (e.g. Infiniband (low latency))
Time
Time
Time
TM
External Use 55
Support Priority-based Flow Control (802.1Qbb)
• Enables lossless behavior for each class of service
• PAUSE sent per virtual lane when buffers limit exceeded
− FQ congestion groups state (on/off) from QMan
Priority vector (8 bits) is assigned to each FQ congestion group
FQ congestion group(s) are assigned to each port
Upon receipt of a congestion group state “on” message, for each Rx port associated with this congestion group, a PFC Pause frame is transmitted with priority level(s) configured for that group
− Buffer pool depletion
Priority level configured on per port (shared by all buffer pools used on that port)
− Near FMan Rx FIFO full
There is a single Rx FIFO per port for all priorities, the PFC Pause frame is sent on all priorities
• PFC Pause frame reception
− QMan provides the ability to flow control 8 different traffic classes; in CEETM each of the 16 class queues within a class queue channel can be mapped to one of the 8 traffic classes & this mapping applies to all channels assigned to the link
Transmit Queues Ethernet Link
Receive Buffers
Zero Zero
One One
Two Two
Five Five
Four Four
Six Six
Seven Seven
Three Three STOP PAUSE Eight
Virtual
Lanes
TM
External Use 56
Support Bandwidth Management 802.1Qaz
10 GE Realized Traffic Utilization
3G/s HPC Traffic
3G/s
2G/s
3G/s Storage Traffic
3G/s
3G/s
LAN Traffic
4G/s
5G/s 3G/s
t1 t2 t3
Offered Traffic
t1 t2 t3
3G/s 3G/s
3G/s 3G/s 3G/s
2G/s
3G/s 4G/s 6G/s
• Supports 32 channels available for allocation across a single FMan
− e.g. for two10G links, could allocate 16 channels (virtual links) per link
− Supports weighted bandwidth fairness amongst channels
− Shaping is supporting on per channel basis
• Hierarchical port scheduling defines the class-of-service (CoS) properties of output queues, mapped to IEEE 802.1p priorities
• Qman CEETM enables Enhanced Tx Selection (ETS) 802.1Qaz with Intelligent sharing of bandwidth between traffic classes control of bandwidth − Strict priority scheduling of the 8
independent classes. Weighted bandwidth fairness within 8 grouped classes
− Priority of the class group can be independently configured to be immediately below any of the independent classes
• Meets performance requirement for ETS: bandwidth granularity of 1% and +/-10% accuracy
TM
External Use 57
QMAN CEETM
TM
External Use 58
Shape Aware
Fair Scheduling
StrictPriority StrictPriority
CEETM Scheduling Hierarchy (QMAN 1.2)
• Logics
− Green denotes logic units and signal paths that relate to the request and fulfillment of Committed Rate (CR) packet transmission opportunities
− Yellow denotes the same for Excess Rate (ER)
− Black denotes logic units and signal paths that are used for unshaped opportunities or that operate consistently whether used for CR or ER opportunities
• Scheduler
− Channel Scheduler: channels are selected to send frame from Class Queues
− Class scheduler: frames are selected from Class Queues . Class 0 has highest priority
• Algorithm
− Strict Priority (SP)
− Weighted Scheduling
− Shaped Aware Fair Scheduling (SAFS)
− Weighted Bandwidth Fair Scheduling (WBFS)
Strict Priority
Shape Aware
Fair Scheduling
Weighted
Scheduling
StrictPriority StrictPriority StrictPriority
WBFS WBFS WBFS WBFS WBFS
CQ
8
CQ
9
CQ
10
CQ
11
CQ
12
CQ
14
CQ
13
CQ
15
CQ
0
CQ
1
CQ
2
CQ
3
CQ
4
CQ
5
CQ
6
CQ
7
Network IF
Cha
nn
el S
ch
ed
ule
r
for
LN
I #
9
Class Scheduler Ch6
unshaped
8 Indpt, 8 grp Classes
Class Scheduler Ch7
Shaped
3 Indpt, 7 grp
Class Scheduler Ch8
Shaped
2 Indpt, 8 grp
Token Bucket Shaper for Committed Rate
Token Bucket Shaper for Excess Rate
TM
External Use 59
Weighted Bandwidth Fair Scheduling (WBFS)
• Weighted Bandwidth Fair Scheduling (WBFS) is used to schedule packets from queues within a priority group such that each gets a “fair” amount of bandwidth made available to that priority group
• The premises for fairness for algorithm is: − available bandwidth is divided and offered equally to all classes
− offered bandwidth in excess of a class’s demand is to be re-offered equally to classes with unmet demand
Initial Distribution First
ReDistribution
Second
Redistribution
Total BW
Attained
BW available 10G 1.5G .2G 0G
Number of classes
with unmet demand 5 3 2
Bandwidth to be
offer to each class 2G .5G .1G
Demand Offered &
Retained
Unmet
Demand
Offered &
Retained
Unmet
Demand
Offered &
Retained
Class 0 .5G .5G 0 .5G
Class 1 2G 2G 0 2G
Class 2 2.3G 2G .3G .3G 0 2.3G
Class 3 3G 2G 1G .5G .5G .1G 2.6G
Class 4 4G 2G 2G .5G 1.5G .1G 2.6G
Total Consumption 11.8G 8.5G 1.3G .2G 10G
TM
External Use 60
DPAA: SEC Engine
TM
External Use 61
Security Engine
• Black Keys − In addition to protecting against external bus snooping, Black Keys cryptographically
protect against key snooping between security domains
• Blobs − Blobs protect data confidentiality and integrity across power cycles, but do not protect
against unauthorized decapsulation or substitution of another user’s blobs
− In addition to protecting data confidentiality and integrity across power cycles, Blobs cryptographically protect against blob snooping/substitution between security domains
• Trusted Descriptors − Trusted Descriptors protect descriptor integrity, but do not distinguish between
Trusted Descriptors created by different users
− In addition to protecting Trusted Descriptor integrity, Trusted Descriptors now cryptographically distinguish between Trusted Descriptors created in different security domains
• DECO Request Source Register − Register added
TM
External Use 62
QorIQ T4240 Processor SEC 5.0 Features Header & Trailer off-load for the following Security Protocols:
− IPSec, SSL/TLS, 3G RLC, PDCP, SRTP, 802.11i, 802.16e, 802.1ae
(3) Public Key Hardware Accelerator (PKHA)
− RSA and Diffie-Hellman (to 4096b)
− Elliptic curve cryptography (1024b)
− Supports Run Time Equalization
(1) Random Number Generators (RNG4)
− NIST Certified
(4) Snow 3G Hardware Accelerators (STHA)
− Implements Snow 3.0
− Two for Encryption (F8), two for Integrity (F9)
(4) ZUC Hardware Accelerators (ZHA)
− Two for Encryption, two for Integrity
(2) ARC Four Hardware Accelerators (AFHA)
− Compatible with RC4 algorithm
(8) Kasumi F8/F9 Hardware Accelerators (KFHA)
− F8 , F9 as required for 3GPP
− A5/3 for GSM and EDGE
− GEA-3 for GPRS
(8) Message Digest Hardware Accelerators (MDHA)
− SHA-1, SHA-2 256,384,512-bit digests
− MD5 128-bit digest
− HMAC with all algorithms
(8) Advanced Encryption Standard Accelerators (AESA)
− Key lengths of 128-, 192-, and 256-bit
− ECB, CBC, CTR, CCM, GCM, CMAC, OFB, CFB, and XTS
(8) Data Encryption Standard Accelerators (DESA)
− DES, 3DES (2K, 3K)
− ECB, CBC, OFB modes
(8) CRC Unit
− CRC32, CRC32C, 802.16e OFDMA CRC
Job Queue
Controller
Descriptor
Controllers
DM
A
RT
IC
Queue
Interface
Job Ring I/F
DESA AESA
CHAs
MDHA
AFHA PKHA STHA
RNG4
KFHA
ZHA
TM
External Use 63
Arbiter
AFHA
Arbiter RNG4
Arbiter Arbiter Arbiter
PKHA STHA f8
STHA f9
MDHA
CRCA
AESA
KFHA DESA
MDHA
CRCA
AESA
KFHA DESA
PKHA
PKHA AFHA
STHA f8
STHA f9
ZUEA
ZUCE ZUEA
ZUCE
Life of a Job Descriptor
• QI has room for more work, issues dequeue request for 1 or 3 FDs
• Qman selects FQ and provides 1 FD along with FQ Information
• QI creates [internal] Job Descriptor and if necessary, obtains output buffers
• QI transfers completed Job Descriptor into one of the Holding Tanks
• Job Queue Controller finds an available DECO, transfers JD1 to it
• DECO initiates DMA of Shared Descriptor from system memory, places it in Descriptor Buffer with JD from Holding Tank
• DECO executes descriptor commands, loading registers and FIFOs in its CCB
• CCB obtains and controls CHA(s) to process the data per DECO commands
• DECO commands DMA to store results and any updated context to system memory
• As input buffers are being emptied, DECO tells QI, which may release them back to BMan
• Upon completion of all processing through CCB, DECO resets CCB
• DECO informs QI that JD1 has completed with status code X, data of length Y has been written to address Z
• QI creates outbound FD, enqueues to Qman using FQID from Ctx B field
Queue Interface Job Prep Logic
Job Queue Controller
DECO Pool
DECO 0
Descriptor
Buffer
DECO 7
R FDs
SP1 0 000
SP2 0 001
SP3 0 101
SP4 0 011
SP5 1 111
FQ FQ FQ FQ FQ
1 E E E D E
2 D E E D E
3 E E E E E
SP Status FQ ID List
Holding
Tank 0
Holding
Tank 7
Holding Tank Pool
Job Queues JR 0
JR 1
JR 2
JR 3
DM
A
Descriptor
Buffer
CCB 0 CCB 7
Buffer
Mgr
Queue
Manager DDR/CoreNet (Shared Desc, Frame)
FD1
JD1
. . . . . . .
. . . . . . .
TM
External Use 64
DPAA: DCE
TM
External Use 65
DPAA Interaction: Frame Descriptor Status/CMD
• The Status/Command word in the dequeued FD allows software to modify the processing of individual frames while retaining the performance advantages of enqueuing to a FQ for flow based processing
• The three most significant bits of the Command /Status field of the Frame Descriptor have the following meaning:
0 1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
DD LIODN offset BPID ELIODN
offset
- - - - addr
addr (cont)
Format Offset Length
CMD Token: Pass through data that is echoed with the returned Frame.
3 MSB Description
000 Process Command Command Encoding
001 Reserved
010 Reserved
011 Reserved
100 Context Invalidate Command Token
101 Reserved
110 Reserved
111 NOP Command Token
0 1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
CM
D
OO
Z
Flu
sh
SC
RF
R
I
RB
B
64
CF
-
CE
UH
C
US
PC
U
SD
C
SC
US
Status
(output Frame)
TM
External Use 66
DCE Inputs
• SW enqueues work to DCE via Frame Queues. FQs define the flow for stateful processing
• FQ initialization creates a location for the DCE to use when storing flow stream context
• Each work item within the flow is defined by a Frame Descriptor, which includes length, pointer, offsets, and commands
• DCE has separate channels for compress and decompress
DC
P P
ort
al
DCE
WQ6
WQ7
ch
an
ne
l
WQ0
WQ1
WQ2
WQ3
WQ4
WQ5
WQ6
WQ7
FD3
FD2
FD1
Addr
Offset Length
Status/Cmd
PID BPID Addr
Addr
Offset Length
Status/Cmd
PID BPID Addr
Addr
Offset Length
Status/Cmd
PID BPID Addr
Data
Buffer
Data
Buffer
Data
Buffer
WQ6
WQ7
ch
an
ne
l
WQ0
WQ1
WQ2
WQ3
WQ4
WQ5
WQ6
WQ7
FD3
FD2
FD1
Addr
Offset Length
Status/Cmd
PID BPID Addr
Addr
Offset Length
Status/Cmd
PID BPID Addr
Addr
Offset Length
Status/Cmd
PID BPID Addr
Data
Buffer
Data
Buffer
Data
Buffer
Decomp
Comp
FQs
Command
FQs
Flow Stream Context
Context_A
Flow Stream Context
Context_A
TM
External Use 67
DCE Outputs
• DCE enqueues results to SW via Frame Queues as defined by FQ Context_B field. When buffers obtained from Bman, buffer pool ID defined by Input FQ
• Each result is defined by a Frame Descriptor, which includes a Status field
• DCE updates flow stream context located at Context_A as needed
FD3
FD2
FD1
Addr
Offset Length
Status/Cmd
PID BPID Addr
Addr
Offset Length
Status/Cmd
PID BPID Addr
Addr
Offset Length
Status/Cmd
PID BPID Addr
Data
Buffer
Data
Buffer
Data
Buffer
DC
P P
ort
al
Decomp
Comp
DCE
Flow
Stream
Context
Context_A
Data
Buffer
Data
Buffer
Data
Buffer FD3
FD2
FD1
Addr
Offset Length
Status/Cmd
PID BPID Addr
Addr
Offset Length
Status/Cmd
PID BPID Addr
Addr
Offset Length
Status/Cmd
PID BPID Addr
Flow
Stream
Context
Context_A
Status
FQ
s
FQ
s
TM
External Use 68
PME
TM
External Use 69
Frame Descriptor: STATUS/CMD Treatment
• PME Frame Descriptor Commands
− b111 NOP NOP Command
− b101 FCR Flow Context Read Command
− b100 FCW Flow Context Write Command
− b001 PMTCC Table Configuration Command
− b000 SCAN Scan Command
0 1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
DD LIODN offset BPID ELIODN
offset
- - - - addr
addr (cont)
Format Offset Length
Status/CMD
Scan
b000
SRV
M
F S/
R
E SET Subset
TM
External Use 70
I W A N T T O S E A R C H F R E E
Life of a Packet inside Pattern Matching Engine
• Combined hash/NFA technology • 9.6 Gbps raw performance • Max 32K patterns of up to 128B length • Patterns
− Patt1 /free/ tag=0x0001
− Patt2 /freescale/ tag=0x0002
• KES − Compare hash value of incoming
data(frames) against all patterns
• DXE − Retrieve the pattern with matched hash
value for a final comparison.
• SRE − Optionally post process match result before
sending the report to the CPU
On-Chip
System
Bus
Interface
Pattern
Matcher
Frame
Agent
(PMFA)
Data
Examination
Engine
(DXE)
Stateful
Rule
Engine
(SRE)
Key
Element
Scanning
Engine
(KES)
Hash
Tables
Access to Pattern Descriptors and State
Cache Cache
User Definable Reports
Cor
eNet
B
Ma
n Q
Ma
n
192.168.1.1:80 TCP 10.10.10.100:16734
192.168.1.1:25 TCP 10.10.10.100:17784
192.168.1.1:1863 TCP 10.10.10.100:16855
DDR
Memory
flowA:FD1: 192.168.1.1:80->10.10.10.100:16734 “I want to search free “
flowA:FD2: 192.168.1.1:80->10.10.10.100:16734 “scale FTF 2014 event schedule”
Frame Queue: A
FD1
Patt1 /free/
tag=0x0001
FD2
TM
External Use 71
Debug
TM
External Use 72
Core Debug in Multi-Thread Environment
• Almost all resources are private. Internal debug works as if they are
separate cores
• External debug is private per thread. An option exists to halt both threads
when one thread halts
− While threads can be debug halted individually, it is generally not very
useful if the debug session will care about the contents of the MMU
and caches
− Halting both threads prevents the other thread from continuing to
compute and essentially clean the L1 caches and the MMU of the state
of the thread which initiated the debug halt
TM
External Use 73
DPAA Debug trace
• During packet processing, FMan can trace packet processing flow
through each of the FMan modules and trap a packet.
0 1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
D
D
LIODN
offset
BPID ELIO
DN
offset
- - - - addr
addr (cont)
Fmt Offset Length
STATUS/CMD
TM
External Use 74
Summary
TM
External Use 75
QorIQ T4 Series Advance Features Summary Feature Benefit
High perf/watt • 188k CoreMark in 55W = 3.4 CM/W
• Compare to Intel E5-2650: 146k CM in 95W = 1.5 CW/W;
• Or: Intel E5-2687W: 200k MC in 150W = 1.3 CM/W
• T4 is more than 2x better than E5
• 2x perf/watt compared to P4080, FSL’s previous flagship
Highly integrated
SOC
Integration of 4x 10GE interfaces, local bus, Interlaken, SRIO mean that few chips
(takes at least four chips with Intel) and higher performance density
Sophisticated
PCIe capability
• SR-IOV for showing VMs a virtual NIC, 128 VFs (Virtual Functions)
• Four ports with ability to be root complex or endpoint for flexible configurations
Advanced
Ethernet
• Data Center Bridging for lossless Ethernet and QoS
• 10GBase-KR for backplane connections
Secure Boot Prevents code theft, system hacking, and reverse engineering
Altivec On-board SIMD engine – sonar/radar and imaging
Power
Management
• Thread, core, and cluster deep sleep modes
• Automatic deep sleep of unused resources
Advanced
virtualization
• Hypervisor privilege level enables safe guest OS at high performance
• IOMMU ensures memory accesses are restricted to correct area
• Virtualization of I/O blocks
Hardware offload • Packet handling to 50Gb/s
• Security engine to 40Gb/s
• Data compression and decompression to 20Gb/s
• Pattern matching to 10Gb/s
3x Scalability • 1-, 2-, and 3- cluster solution is 3x performance range over T4080 – T4240
• Enables customer to develop multiple SKUs from on PCB
TM
External Use 76
Other Sessions And Useful Information
• FTF2014 Sessions for QorIQ T4 Devices
− FTF-NET-F0070_QorIQ Platforms Trust Arch Overview
− FTF-NET-F0139_AltiVec_Programming
− FTF-NET-F0146_Introduction_to_DPAA
− FTF-NET-F0147-DPAAusage
− FTF-NET-F0148_DPAA_Debug
− FTF-NET-F0157_QorIQ Platforms Trust Arch Demo & Deep Dive
• T4240 Product Website
− http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240
• Online Training
− http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240&tab=Design_Support_Tab
http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240&tab=Design_Support_Tabhttp://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240&tab=Design_Support_Tab
TM
External Use 77
Introducing The
QorIQ LS2 Family
Breakthrough,
software-defined
approach to advance
the world’s new
virtualized networks
New, high-performance architecture built with ease-of-use in mind Groundbreaking, flexible architecture that abstracts hardware complexity and
enables customers to focus their resources on innovation at the application level
Optimized for software-defined networking applications Balanced integration of CPU performance with network I/O and C-programmable
datapath acceleration that is right-sized (power/performance/cost) to deliver
advanced SoC technology for the SDN era
Extending the industry’s broadest portfolio of 64-bit multicore SoCs Built on the ARM® Cortex®-A57 architecture with integrated L2 switch enabling
interconnect and peripherals to provide a complete system-on-chip solution
TM
External Use 78
QorIQ LS2 Family Key Features
Unprecedented performance and
ease of use for smarter, more
capable networks
High performance cores with leading
interconnect and memory bandwidth
• 8x ARM Cortex-A57 cores, 2.0GHz, 4MB L2
cache, w Neon SIMD
• 1MB L3 platform cache w/ECC
• 2x 64b DDR4 up to 2.4GT/s
A high performance datapath designed
with software developers in mind
• New datapath hardware and abstracted
acceleration that is called via standard Linux
objects
• 40 Gbps Packet processing performance with
20Gbps acceleration (crypto, Pattern
Match/RegEx, Data Compression)
• Management complex provides all
init/setup/teardown tasks
Leading network I/O integration
• 8x1/10GbE + 8x1G, MACSec on up to 4x 1/10GbE
• Integrated L2 switching capability for cost savings
• 4 PCIe Gen3 controllers, 1 with SR-IOV support
• 2 x SATA 3.0, 2 x USB 3.0 with PHY
SDN/NFV
Switching
Data
Center
Wireless
Access
TM
External Use 79
See the LS2 Family First in the Tech Lab!
4 new demos built on QorIQ LS2 processors:
Performance Analysis Made Easy
Leave the Packet Processing To Us
Combining Ease of Use with Performance
Tools for Every Step of Your Design
TM
© 2014 Freescale Semiconductor, Inc. | External Use
www.Freescale.com
http://www.freescale.com/https://twitter.com/Freescalehttps://twitter.com/Freescalehttps://www.facebook.com/freescalehttps://www.facebook.com/freescale
TM
External Use 81
QorIQ T4240 SerDes Options Total of four x8 banks
High speed serial
• 2.5 , 5, 8 GHz for PCIe
• 2.5, 3.125, and 5 GHz for sRIO
• 3.125, 6.25, and 10.3125 GHz for
Interlaken
• 1.5, 3.0 GHz for SATA
• 1.25, 2.5, 3.125, and 5 GHz for
debug
Ethernet options:
• 10Gbps Ethernet MACs with XAUI
or XFI
• 1Gbps Ethernet MACs with SGMII
(1 lane at 1.25 GHz with 3.125
GHz option for 2.5Gbps Ethernet)
• 2 MACs can be used with
RGMII
• 4 x1Gbps Ethernet MACs can be
supported using a single lane at 5
GHz (QSGMII)
• HiGig is supported with 4 lines at
3.125 GHz or 3.75 GHz (HiGig+)
TM
External Use 82
Decompression Compression Engine
• Zlib: As specified in RFC1950
• Deflate: As specified as in RFC1951
• GZIP: As specified in RFC1952
• Encoding
− supports Base 64 encoding and decoding (RFC4648)
• ZLIB, GZIP and DEFLATE header insertion
• ZLIB and GZIP CRC computation and insertion
• 4 modes of compression
− No compression (just add DEFLATE header)
− Encode only using static/dynamic Huffman codes
− Compress and encode using static OR dyamic Huffman codes
− at least 2.5:1 compression ratio on the Calgary Corpus
• All standard modes of decompression
− No compression
− Static Huffman codes
− Synamic Huffman codes
• Provides option to return original compressed Frame along with the uncompressed Frame or release the buffers to BMAN
32KB
History
Frame
Agent
QMan
I/F
BMan
I/F
Bus
I/F
Decompressor
Compressor
QMan
Portal
BMan
Portal
To
Corenet
4KB History
TM
© 2014 Freescale Semiconductor, Inc. | External Use
www.Freescale.com
http://www.freescale.com/https://twitter.com/Freescalehttps://twitter.com/Freescalehttps://www.facebook.com/freescalehttps://www.facebook.com/freescale