QorIQ T4240 Communications Processor Deep Dive · 2016. 3. 12. · T5 T4 T3 T2 T1 Shared L2 y y C0...

External Use

TM

QorIQ T4240 Communications

Processor Deep Dive

FTF-NET-F0031

A P R . 2 0 1 4

Sam Siu & Feras Hamdan

TM

External Use 1

Agenda

• QorIQ T4240 Communications Processor Overview

• e6500 Core Enhancement

• Memory Subsystem and MMU Enhancement

• QorIQ Power Management features

• HiGig Interface

• Interlaken Interface

• PCI Express® Gen 3 Interfaces (SR-IOV)

• Serial RapidIO® Manager (RMAN)

• Data Path Acceleration Architecture Enhancements

− mEMAC

− Offline Ports and Use Case

− Storage Profiling

− Data Center Bridging (FMAN and QMAN)

− Accelerators: SEC, DCE, PME

• Debug

TM

External Use 2

QorIQ T4240 Communications Processor

16-Lane 10GHz SERDES

64-bit

DDR2/3

Memory

Controller

CoreNet™ Coherency Fabric PAMU PAMU PAMU

Peripheral Access

Mgmt Unit

Security Fuse Processor

Security Monitor

2x USB 2.0 w/PHY

IFC

Power Management

SD/MMC

2x DUART

2x I2C

SPI, GPIO

64-bit

DDR2/3

Memory

Controller

64-bit

DDR3/3L

Memory

Controller

64-bit

DDR3/3L

Memory

Controller

512KB

Corenet

Platform Cache

512KB

Corenet

Platform Cache

PAMU

Queue

Mgr.

Buffer

Mgr.

Pattern

Match

Engine

2.0

Security 5.0

64-bit

DDR2/3

Memory

Controller

64-bit

DDR3/3L

Memory

Controller

512KB

Corenet

Platform Cache

RMAN

DCE

1.0

Parse, Classify,

Distribute

1/ 10G

1/ 10G

1G

1G

1G

1G

FMan

1G

1G

Parse, Classify,

Distribute

1/ 10G

1/ 10G

1G

1G

1G

1G

FMan

1G

1G

Inte

rla

ke

n L

A

16-Lane 10GHz SERDES

Processor

• 12x e6500, 64-bit, up to 1.8 GHz

• Dual threaded, with128-bit AltiVec engine

• Arranged as 3 clusters of 4 CPUs, with 2

MB L2 per cluster; 256 KB per thread

Memory SubSystem

• 1.5 MB CoreNet platform cache w/ECC

• 3x DDR3 controllers up to 1.87 GHz

• Each with up to 1 TB addressability (40

bit physical addressing)

CoreNet Switch Fabric

High-speed Serial IO

• 4 PCIe controllers, with Gen3

• SR-IOV support

• 2 sRIO controllers

• Type 9 and 11 messaging

• Interworking to DPAA via Rman

• 1 Interlaken Look-Aside at up to10 GHz

• 2 SATA 2.0 3Gb/s

• 2 USB 2.0 with PHY

Network IO

• 2 Frame Managers, each with:

• Up to 25Gbps parse/classify/distribute

• 2x10GE, 6x1GE

• HiGig, Data Center Bridging Support

• SGMII, QSGMII, XAUI, XFI

HiGig DCB HiGig DCB

2MB Banked L2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

2MB Banked L2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

2MB Banked L2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Watchpoint Cross Trigger

Perf Monitor

CoreNet Trace

Aurora

Real Time Debug

SA

TA

2.0

SA

TA

2.0

PC

Ie

PC

Ie

3xDMA

sR

IO

sR

IO

PC

Ie

PC

Ie

• Device

− TSMC 28 HPM process

− 1932-pin BGA package

− 42.5x42.5 mm, 1.0 mm pitch

• Power targets

− ~54W thermal max at 1.8 GHz

− ~42W thermal max at 1.5 GHz

• Data Path Acceleration

− SEC- crypto acceleration 40 Gbps

− PME- Reg-ex Pattern Matcher 10Gbps

− DCE- Data Compression Engine 20Gbps

TM

External Use 3

e6500 Core Enhancement

TM

External Use 4

e6500 Core Complex

High Performance • 64-bit Power Architecture® technology • Up to 1.8 GHz operation • Two threads per core • Dual load/store units, one per thread • 40-bit Real Address

− 1 Terabyte physical addr. space

• Hardware Table Walk • L2 in cluster of 4 cores

− Supports Share across cluster − Supports L2 memory allocation to core or thread

Energy Efficient Power Management

− Drowsy : Core, Cluster, AltiVec engine − Wait-on-reservation instruction − Traditional modes

• AltiVec SIMD Unit (128b)

− 8,16,32-bit signed/unsigned integer − 32-bit floating-point 173 GFLOP (1.8GHz)

− 8,16,32-bit Boolean

• Improve Productivity with Core Virtualization − Hypervisor − Logical to Real Addr (LRAT). translation

mechanism for improved hypervisor performance

CoreNet Interface 40-bit Address Bus 256-bit Rd & Wr Data Busses

CoreNet Double Data Processor Port

2MB 16-way Shared L2 Cache, 4 Banks

T T

32K

Altivec

e6500

32K

PM

C T T

32K

Altivec

e6500

32K

PM

C T T

32K

Altivec

e6500

32K

PM

C T T

32K

Altivec

e6500

32K

PM

C

CoreMark P4080

(1.5 GHz)

T4240

(1.8 GHz)

Improvement

from P4080

Single Thread 4708 7828 1.7x

Core (dual T) 4708 15,656 3.3x

SoC 37,654 187,873 5.0x

DMIPS/Watt

(typ)

2.4 5.1 2.1x

TM

External Use 5

General Core Enhancements

• Improved branch prediction and additional link stack entries

• Pipeline improvements: − LR, CTR, mfocrf optimization (LR and CTR are renamed)

− 16 entry rename/completion buffer

• New debug features: − Ability to allocate individual debug events between the internal and external

debuggers

− More IAC events

• Performance monitor − Many more events, six counters per thread

− Guest performance monitor interrupt

• Private vs. Shared State Registers and other architected state − Shared between threads: There is only one copy of the register or architected state

A change in one thread affects the other thread if the other thread reads it

− Private to the thread and are replicated per thread : There is one copy per thread of the register or architected state

A change in one thread does not affect the other thread if the thread reads its private copy

TM

External Use 6

Corenet Enhancements in QorIQ T 4240

• CoreNet Coherency Fabric − 40-bit Real Address

− Higher address bandwidth and active transactions 1.2 Tbps Read, .6Tbps Write

− 2X BW increase for core, MMU, and peripheral

− Improved configuration architecture

• Platform Cache − Increased write bandwidth (>600Gbps)

− Increased buffering for improving throughput

− Improved data ownership tracking for performance enhancement

• Data PreFetch − Tracks CPC misses

− Prefetches from multiple memory regions with configurable sizes

− Selective tracking based on requesting device, transaction type, data/instruction access

− Conservative prefetch requests to avoid system overloading with prefetches

− “Confidence” based algorithm with feedback mechanism

− Performance monitor events to evaluate the performance of Prefetch in the system

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 2 4 6 8 10 12 14 16 18 20 22 24

IP Mark

TCP Mark

TM

External Use 7

Cache and Memory Subsystem

Enhancements

TM

External Use 8

Shared L2 Cache

• Clusters of cores share a 2M byte, 4-bank, 16-way set associative shared L2 cache.

• In addition, there is also support for a 1.5M byte corenet platform cache.

• Advantages

− L2 cache is shared among 4 cores allowing lines to be allocated among the 4 cores as required

Some cores will need more lines and some will need less depending on workloads

− Faster sharing among cores in the cluster (sharing a line between cores in the cluster does not require the data to travel on CoreNet)

− Flexible partition of L2 cache base on application cluster group.

• Trade Offs

− Longer latency to DRAM and other parts of the system outside the cluster

− Longer latency to L2 cache due to increased cache size and eLink overhead

64-bit

DDR2/3

Memory

Controller

CoreNet™ Coherency Fabric PAMU PAMU PAMU

Peripheral Access

Mgmt Unit

Security Fuse Processor

Security Monitor

2x USB 2.0 w/PHY

64-bit

DDR2/3

Memory

Controller

64-bit

DDR3/3L

Memory

Controller

64-bit

DDR3/3L

Memory

Controller

512KB

Corenet

Platform Cache

512KB

Corenet

Platform Cache

PAMU

64-bit

DDR2/3

Memory

Controller

64-bit

DDR3/3L

Memory

Controller

512KB

Corenet

Platform Cache

2MB Banked L2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

2MB Banked L2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

2MB Banked L2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

Power ™

e6500

D-Cache I-Cache

32 KB 32 KB

T1 T2

TM

External Use 9

Memory Subsystem Enhancements

• The e6500 core has a larger store queue than the e5500 core

• Additional registers are provided for L2 cache partitioning controls similar to how partitioning is done in the CPC

• Cache locking is supported, however, if a line is unable to be locked, that status is not posted. Cache lock query instructions are provided for determining whether a line is locked

• The load store unit contains store gather buffers to collect stores to cache lines before sending them on eLink to the L2 cache

• There are no more Line Fill Buffers (LFB) associated with the L1 data cache

− These are replaced with Load Miss Queue (LMQ) entries for each thread

− They function in a manner very similar to LFBs

• Note there are still LFBs for L1 instruction cache

TM

External Use 10

MMU Enhancements

TM

External Use 11

MMU – TLB Enhancements

• e6500 core implements MMU architecture version 2 (V2)

− MMU architecture V2 is denoted by bits in the MMUCFG register

• Translation Look-aside Buffers (TLB1),

− Variable size pages, supports power of two page sizes (previous cores used power of 4 page sizes)

− 4 KB to 1 TB page sizes

• Translation Look-aside Buffers (TLB0) increased to 1024 entries

− 8 way associativity (from 512, 4 way)

− Supports HES (hardware entry select) when written to with tlbwe

• PID register is increased to 14 bits (from 8 bits)

− Now the operating system can have 16K simultaneous contexts

• Real address increased to 40 bits (from 36 bits)

• In general, it is backward compatible with MMU operations from e5500 core, except:

− some of the configuration registers have different organization (TLBnCFG for example)

− There are new config registers for TLB page size (TLBnPS) and LRAT page size (LRATPS)

− tlbwe can be executed by guest supervisor (but can be turned off with an EPCR bit)

Effective Address (EA) (64bit )

Effective Page #(0-52 bits) Byte Addr (12-32bits )

LPID

(14bit) GS AS PID(14bits)

0=Hypervisor

1=guest Access MSR

Byte Address (12-40bits) Real Page Number

(0-28bits)

Real Address (40bits)

TM

External Use 12

MMU – Virtualization Enhancements (LRAT)

• e6500 core contains an LRAT (logical to real address translation)

− The LRAT converts logical addresses (an address the guest operating system thinks are real) and converts them to true real addresses

− Translation occurs when the guest executes tlbwe and tries to write TLB0 or during hardware tablewalk for a guest translation

− Does not require hypervisor to intervene unless the LRAT incurs a miss (the hypervisor writes entries into the LRAT)

− 8 entry fully associative supporting variable size pages from 4 KB to 1 TB (in powers of two)

• Prior to the LRAT, the hypervisor had to intervene each time the guest tried to write a TLB entry

Application

Instr1

Instr2

Instr3

---

MMU

Page

Fault Guest OS

VA -> Guest RA

Writes TLB Trap

Hypervisor

Guest RA -> RA

Writes TLB Implemented

in HW with LRAT

TM

External Use 13

QorIQ Power Management

Features

TM

External Use 14

Dynamic T4 Family Energy/Power Total Cost of Ownership T

4

Advanced

Pow

er

Mgt

Cyclic

al

Valu

ed

Wo

rklo

ad

Today’s

Energ

y

Str

ate

gy

Always on

Energy Savings

Core Cascaded Cluster

Drowsy Dual Cluster

Drowsy

+ Tj

Dynamic

Clk Gating

SoC

Sleep

Full Mid Light Standby Light to Mid Full

TM

External Use 15

Cascaded Power Management Today: All CPUs in pool channel dequeue

until all FQs empty

Broadcast notification when work arrives

Task Queue

T1 T2 T3 T4 T5

Shared L2

Dro

wsy

Dro

wsy

C0 C1 C2 C3

Shared L2

C0 C1 C2 C3

Threshold 1 Threshold 2

DPAA uses task queue thresholds to

inform CPUs they are not needed.

CPUs selectively awakened as needed.

QMan

12

11

10

9

8

7

6

5

4

3

2

1

Active CPUs

Day Night

Burst

Pow

er/

Perf

orm

ance

• CPU’s run software that drops into polling loop when DPAA is not sending it work.

• Polling loop should include a wait w/ drowsy instruction that puts the core into drowsy

Core:

TM

External Use 16

e6500 Core Intelligent Power Management

Cluster State PCL00 PCL00 PCL00 PCL00 PCL00 PCL10

Core State PH00 PH10/PW10 PH15 PW20 PH20 PH20

Cluster Voltage

Core Voltage

Cluster Clock On On On On On Off

Core Clock On On Off Off Off Off

L2 Cache SW Flushed

L1 Cache SW Invalidated HW Invalidated SW Invalidated SW Invalidated

Wakeup Time Active Immediate < 30 ns < 200 ns < 600 ns < 1us

Power

NEW NEW NEW

PM

C T T

L1

Altivec

e6500

L1

2048KB Banked L2

PM

C

Full On Full On Full On Full On Full On Nap

Run Doze Nap Global Clk stop Nap (Pwr Gated) Core glb clk stop

Run, Doze, Nap Wait Altivec Drowsy • Auto and SW controlled – maintain state Core Drowsy • Auto and SW controlled – maintain state Dynamic Clock gating

Run, Nap • Cores and L2 Dynamic Frequency Scaling

(DFS)of the Cluster Drowsy Cluster (cores) Dynamic clock gating

• SoC Sleep with state retention

• SoC Sleep with RST

• Cascade Power Management

• Energy Efficient Ethernet (EEE)

TM

External Use 17

HiGig Interface Support

TM

External Use 18

HiGigTM/HiGig+/HiGig2 Interface Support

• The 10 Gigabit HiGigTM / HiGig+TM / HiGig2TM MAC interface interconnect

standard Ethernet devices to Switch HiGig Ports.

• Networking customers can add features like quality of service (QoS), port

trunking, mirroring across devices, and link aggregation at the MAC layer.

• The physical signaling across the interface is XAUI, four differential pairs

for receive and transmit (SerDes), each operating at 3.125 Gbit/s. HiGig+

is a higher rate version of HiGig

1 2 3 4 5 6 7 8 9 1

0

1

1

1

2

1

3

1

4

1

5

1

6

1

7

1

8

1

9

2

0

2

1

2

2

2

3

2

4

2

5

2

6

2

7

2

8

2

9

3

0

3

1

3

2

3

3

3

4

Preamble HiGig+ Module Hdr MAC_DA MAC_SA Typ

e Packet Data FCS*

1 2 3 4 5 6 7 8 9 1

0

1

1

1

2

1

3

1

4

1

5

1

6

1

7

1

8

1

9

2

0

2

1

2

2

2

3

2

4

2

5

2

6

2

7

2

8

2

9

3

0

3

1

3

2

Preamble MAC_DA MAC_SA Typ

e Packet Data FCS

1 2 3 4 5 6 7 8 9 1

0

1

1

1

2

1

3

1

4

1

5

1

6

1

7

1

8

1

9

2

0

2

1

2

2

2

3

2

4

2

5

2

6

2

7

2

8

2

9

3

0

3

1

3

2

3

3

3

4

3

5

3

6

3

7

3

8

Preamble HiGig2 Module Hdr MAC_DA MAC_SA Typ

e Packet Data FCS*

Regular Ethernet Frames

Ethernet Frames with HiGig+ Header

Ethernet Frames with HiGig2 Header

TM

External Use 19

QorIQ T4240 Processor HiGig Interface

• T4240 FMan Supports HiGig/HiGig+/HiGig2 protocols

• In the T4240 processor, the 10G mEMACs can be configured as HiGig interface. In this configuration two of the 1G mEMACs are used as the HiGig message interface

TM

External Use 20

SERDES Configuration for HiGig Interface

• Networking protocols (SerDes 1 and SerDes 2)

• HiGig notation: HiGig[2]m.n means HiGig[2] (4 lanes @ 3.125 or 3.75 Gbps) − “m” indicates which Frame Manager (FM1 or FM2)

− “n” indicates which MAC on the Frame Manager

− E.g. “HiGig[2]1.10,” indicates HiGig[2] using FM1’s MAC 10

• When a SerDes protocol is selected with dual HiGigs in one SerDes, both HiGigs must be configured with the same protocol (for example, both with 12 byte headers or both with 16 byte headers)

TM

External Use 21

HiGig/HiGig2 Control and Configuration

Name Description

LLM_MODE Toggle between HiGig2 Link Level Messages physical link, OR HiGig2 link level

messages logical link (SAFC)

LLM_IGNORE Ignore HiGig2 link level message quanta

LLM_FWD Terminate/forward received HiGig2 link level message

IMG[0:7] Inter Message Gap - spacing between HiGig2 messages

NOPRMP 0 Toggle preemptive transmission of HiGig2 messages

MCRC_FWD Strip/forward HiGig2 message CRC of received messages

FER Discard/forward HiGig2 receive message with CRC error

FIMT Forward OR Discard message with illegal MSG_TYP

IGNIMG Ignore IMG on receive path

TCM TC (traffic classes) mapping

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

LL

M

LL

I

LL

F

IMG

NP

PR

MC

RC

FE

R

FIM

T

IGN

IM

TC

M

HiGig/HiGig2 control and Configuration Register (HG_CONFIG)

TM

External Use 22

Interlaken Interface

TM

External Use 23

Interlaken Look-Aside Interface

• Use Case: T4240 processor as a data path processor, requiring millions of look-ups per second. Expected requirement in edge routers.

• Interlaken Look-Aside is a new high-speed serial standard for connecting TCAMs “network search engines”, “Knowledge Based Processors” to host CPUs and NPUs. Replaces Quad Data Rate (QDR) SRAM interface.

• Like Interlaken streaming interfaces (channelized SERDES link, replacing SPI 4.2), Interlaken look-aside supports configurable number of SERDES lanes (1-32, granularity of single lane) with linearly increasing bandwidth. Freescale supporst x4 and x8, up to 10 GHz.

• For lowest latency, each vCPU (thread) in T4240 processor will have a portal into the Interlaken Controller, allowing multiple search requests and results to be returned concurrently.

• Interlaken Look Aside expected to gain traction as interface to other low latency/minimal data exchange co-processors, such as Traffic Managers. PCIe and sRIO better for higher latency/high bandwidth applications.

• Lane Striping

T4240

TCAM

Interlaken

10

G

10

G

10

G

10

G

TM

External Use 24

T4240 (LAC) Features:

• Supports Interlaken Look-Aside Protocol definition, rev. 1.1

• Supports 24 partitioned software portals

• Supports in-band per-channel flow control options, with simple xon/xoff semantics

• Supports wide range of SerDes speeds (6.25 and 10.3125 Gbps))

• Ability to disable the connection to individual SerDes lanes

• A continuous Meta Frame of programmable frequency to guarantee lane alignment, synchronize the scrambler, perform clock compensation, and indicate lane health

• 64B/67B data encoding and scrambling

• Programmable BURSTSHORT parameter of 8 or 16 bytes

• Error detection illegal burst sizes, bad 64/67 word type and CRC-24 error

• Error detection on Transmit command programming error

• Built-in statistics counters and error counters

• Dynamic power down of each software portal

TM

External Use 25

Look-Aside Controller Block Diagram

TM

External Use 26

Modes of Operation

• T4240 LA controller can be either in Stashing mode or non stashing.

• The LAC programming model is based on big Endinan mode, meaning byte 0 on the most

significant byte.

• In non Stashing mode software has to issue dcbf each time it reads SWPnRSR and RDY bit

is not set.

TM

External Use 27

Interlaken LA Controller Configuration Registers

• 4KBytes hypervisor space 0x0000-0x0FFF

• 4KBytes managing core space 0x1000-0x1FFF

• in compliant with trusted architecture ,LSRER, LBARE, LBAR, LLIODNRn, accessed exclusively in hypervisor mode, reserved in managing core mode.

• Statistics, Lane mapping, Interrupt , rate, metaframe, burst, FIFO, calendar, debug, pattern, Error, Capture Registers

• LAC software portal memory, n= 0,1,2,3,….,23 .

• SWPnTCR/ SWPnRCR—software portal 0 transmit/Receive command register

• SWPnTER/SWPnRER—software portal 0 transmit/Receive error register

• SWPnTDR/SWPnRDR0,1,2,3 —software portal 0,1,2,3 transmit/Receive data register 0,1,2,3

• SWPnRSR—software portal receive status register

TM

External Use 28

TCAM Usage in Routing Example

TM

External Use 29

Interlaken Look-Aside TCAM Board

Renesas

Interlaken LA

5Mb TCAM I2C

EEPROM

IL-LA

4x

REFCLK

156.25 MHz

VDDC

0.85V @6A

SMBus Misc:

Reset,

JTAG

3.3V/12V

Config

125 MHz

SYSCLK

VDDA

0.85V @ 2A

VCC_1.8V

1.8V @ 2A Filters

VDDHA 1.80V 0.5A

VDDO 1.80V 1.0A

VPLL 1.80V 0.25A

0-ohm

TM

External Use 30

PCI Express® Gen 3 Interfaces

TM

External Use 31

PCI Express® Gen 3 Interfaces

• Two PCIe Gen 3 controllers can be run at the same time with same SerDes reference clock source

• PCIe Gen 3 bit rates are supported − When running more than one PCIe controller at Gen3 rates, the associated

SerDes reference clocks must be driven by the same source on the board

16 SERES PCIe Configuration

PCIe1 PCIe2 PCIe3 PCIe4

x4gen3 x4gen2 x8gen2

X8gen2 x8gen2

x4gen2 x4gen2 x4gen3 x4gen2

PCIe2

OCN

PCIe3

51G 51G

51G

PCIe1

SR-IOV

EP

PCIe4

51G 51G

X4 Gen2/3 RC/EP

X4 Gen2/3 RC/EP X8 Gen2 or x4 Gen3

X8 Gen2 or

x4 Gen3 RC/EP

EP SRIOV

2 PF/64VF

8xMSI-X per VF/PF

Total of 16 lanes

TM

External Use 32

Single Root I/O Virtualization (SR-IOV) End Point

• With SR-IOV supported in EP, different devices or different software tasks can share IO resources, such as Gigabit Ethernet controllers. − T4240 Supports SR-IOV 1.1 spec version with 2 PFs and 64 VFs per PF

− SR-IOV supports native IOV in existing single-root complex PCI Express topologies

− Address translation services (ATS) supports native IOV across PCI Express via address translation

− Single Management physical or virtual machine on host handles end-point configuration

• E.g. T4240 processor as a Converged Network Adapter. Each Virtual Machine running on Host thinks it has a private version of the services card

Host

VM

1

VM

2

VM

N

…

T4240 features single controller (up to x4 Gen 3), 1 PF, 64 VFs

Translation

Agent

TM

External Use 33

PCI Express Configuration Address Register

• The PCI Express configuration address register contains address

information for accesses to PCI Express internal and external

configuration registers for End Point (EP) with SR-IOV

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

EN

Type EXTREGN VFN PFN REGN

PCI Express Address Offset Register

Name Description

Enable allows a PCI Express configuration access when PEX_CONFIG_DATA is accessed

TYPE 01, Configuration Register Accesses to PF registers for EP with SR-IOV

11, Configuration Register Accesses to VF registers for EP with SR-IOV

EXTREGN Extended register number. This field allows access to extended PCI Express configuration

space

VFN Virtual Function number minus 1. 64-255 is reserved.

PFN Physical Function number minus 1. 2-15 is reserved.

REGN Register number. 32-bit register to access within specified device

TM

External Use 34

Message Signaled Interrupts (MSI-X) Support

• MSI-X allows for EP device to send message interrupts to RC device independently for different Physical or Virtual functions as supported by EP SR-IOV.

• Each PF or VF will have eight MSI-X vectors allocated with a total of 256 total MSI-X vectors supported

− Supports MSI-X for PF/VF with 8 MSI-X vector per PF or VF

− Supports MSI-X trap operation

− To access a MSI-X PBA structure, the PF, VF, IDX, EIDXare concatenated to form the 4-byte aligned address of the register within the MSI-X PBA structure. That is, the register address is:

PF || VF || IDX || EIDX || 0b00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Typ

e

PF VF IDX EDIX M

PCI Express Address Offset Register

Name Description

TYPE Access to PF or VF MSI-X vector table for EP with SR-IOV.

PF Physical Function

VF Virtual Function

IDX MSI-X Entry Index in each VF.

EIDX Extended index This field provides which 4-Byte entity within the MSI-X PBA structure to access.

M Mode=11

TM

External Use 35

Serial RapidIO® Manager (RMAN)

TM

External Use 37

RapidIO Message Manager (RMan)

• RMAN supports both inline switching, as well as look aside

forwarding operation.

QMan

RMan

Inbound Rule

Matching

Classification

Unit

Reassembly

Contexts

Reassembly

Unit

Segmentation

Unit

Rapid

IO I

nbound T

raffic

Rapid

IO O

utb

ound T

raffic

Classification

Unit

Classification

Unit

Reassembly

Unit

Reassembly

Unit

Segmentation

Unit

Segmentation

Unit

AR

B

WQ

0

WQ

1

WQ

2

WQ

3

WQ

4

WQ

5

WQ

6

WQ

7

HW Channel

Frame Manager

1GE 1GE

1GE 1GE 10GE

D$ I$

D$ I$ L2$

e6500

Core

D$ I$

SE

C

PM

E

Disassembly

Contexts

WQ

0

WQ

1

WQ

2

WQ

3

WQ

4

WQ

5

WQ

6

WQ

7

HW Channel

WQ

0

WQ

1

WQ

2

WQ

3

WQ

4

WQ

5

WQ

6

WQ

7

Pool Channel

DCP

SW Portal

DCP

… Ftype Target ID Src ID Address Packet Data Unit CRC RapidIO PDU

TM

External Use 38

RMan: Greater Performance and Functionality

• Many queues allow multiple inbound/outbound queues per core

− Hardware queue management via QorIQ Data Path Architecture (DPAA)

• Supports all messaging-style transaction types

− Type 11 Messaging

− Type 10 Doorbells

− Type 9 Data Streaming

• Enables low overhead direct core-to-core communication

Core Core Core Core

10G SRIO

QorIQ or DSP

Core Core Core Core

10G SRIO

QorIQ or DSP

Type9 User PDU

Channelized CPU-

to-CPU transport Device-to-Device

Transport

MSG User PDU

TM

External Use 39

Data Path Acceleration

Architecture (DPAA)

TM

External Use 40

Data Path Acceleration Architecture (DPAA) Philosophy

• DPAA is design to balance the performance of multiple CPUs and Accelerators with seamless integrations

− ANY packet to ANY core to ANY accelerator or network interface efficiently WITHOUT locks or semaphores

• “Infrastructure” components

− Queue Manager (QMan)

− Buffer Manager (BMan)

• “Accelerator” Components

− Cores

− Frame Manager (FMan)

− RapidIO Message Manager (RMan)

− Cryptographic accelerator (SEC)

− Pattern matching engine (PME)

− Decompression/Compression Engine (DCE)

− DCB (Data Center Bridging)

− RAID Engine (RE)

• CoreNet

− Provides the interconnect between the cores and the DPAA infrastructure as well as access to memory

D$ I$

D$ I$ L2$

e500mc Core

D$ I$

CoreNet™ Coherency Fabric

Buffer

Mgr

D$ I$

D$ I$ L2$

e500mc Core

D$ I$

Queue

Manager

Sec 4.x PME 2

RMan RE

Parse, Classify, Distribute

Buffer

1/10G 1/10G 1G

1G

1G

1G

Frame Manager

1G

1G

D$ I$

D$ I$ L2$

e6500 Core

D$ I$

D$ I$

D$ I$ L2$

e6500

Core

D$ I$

DCE DCB

P Series T Series

… …

… …

Frame Manager

1GE 1GE

1GE 1GE 10GE

PCD

Buffer

TM

External Use 41

Length

DPAA Building Block: Frame Descriptor (FD)

Simple Frame Multi-buffer Frame

(Scatter/gather)

D PID

BPID

Address

Offset

Status/Cmd

000

Buffer

Frame Descriptor

D PID

BPID

Address

Offset

Length

Status/Cmd

100

Frame Descriptor

Address

Length

BPID

Offset

00

Address

Length

BPID

Offset (=0)

00

Address

Length

BPID

Offset (=0)

01

…

Data

Data

Data

S/G List

Packet

0 1 2 3 4 5 6 7 8 9 1

0

1

1

1

2

1

3

1

4

1

5

1

6

1

7

1

8

1

9

2

0

2

1

2

2

2

3

2

4

2

5

2

6

2

7

2

8

2

9

3

0

3

1

D

D

LIODN

offset

BPID ELIO

DN

offset

- - - - addr

addr (cont)

Fmt Offset Length

STATUS/CMD

TM

External Use 42

Frame Descriptor Status/Command Word (FMAN Status)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

- - -

DC

L4

C

- - -

DM

E

MS

- - -

FP

E

FS

E

DIS

-

EO

F

NS

S

KS

O

-

FC

L

IPP

FLM

PT

E

ISP

PH

E

FR

DR

BL

E

L4

CV

- -

Name Description

DCL4C L4 (IP/TCP/UDP) Checksum validation Enable/Disable

DME DMA error

MS MACSEC Frame. This bit is valid on P1023

FPE Frame Physical Error

FSE Frame Size Error

DIS Discard. This bit is set only for frames that are supposed to be discarded, but are

enqueued in an error queue for debug purposes.

EOF Extract Out of Frame Error

NSS No Scheme Selection foe KeyGen

KSO Key Size Over flow Error

FCL Frame color as determined by the Policer. 00=green, 01=yellow, 10=red, 11=no reject

IPP Illegal Policer Profile error

FLM Frame Length Mismatch

PTE Parser Time-out

ISP Invalid Soft Parser instruction Error

PHE Header Error

FRDR Frame Drop

BLE Block limit is exceeded

L4CV L4 Checksum Validation

TM

External Use 43

DPAA: mEMAC Controller

TM

External Use 44

Multirate Ethernet MAC (mEMAC) Controller

Phy Mgmt MDIO

Tx Interface

Frame Manager Interface

Rx FIFO

Reconcilication

Tx FIFO 1588 Time Stamping

Tx Control

Rx Control

Flow Control

Config Control

Stat

MDIO Master

Rx Interface

• A multirate Ethernet MAC (mEMAC) controller features 100 Mbps/1G/2.5G/10G :

− Supports HiGig/HiGig+/HiGig2 protocols

− Dynamic configuration for NIC (Network Interface Card) applications or Switching/Bridging applications to support 10Gbps or below.

− Designed to comply with IEEE Std 802.3®, IEEE 802.3u, IEEE 802.3x, IEEE 802.3z, IEEE 802.3ac, IEEE 802.3ab, IEEE-1588 v2 (clock synchronization over Ethernet), IEEE 803.3az and IEEE 802.1QBbb.

− RMON statistics

− CRC-32 generation and append on transmit or forwarding of user application provided FCS selectable on a per-frame basis.

− 8 MAC address comparison on receive and one MAC address overwrite on transmit for NIC applications.

− Selectable promiscuous frame receive mode and transparent MAC address forwarding on transmit

− Multicast address filtering with 64-bin hash code lookup table on receive reducing processing load on higher layers

− Support for VLAN tagged frames and double VLAN Tags (Stacked VLANs)

− Dynamic inter packet gap (IPG) calculation for WAN applications

10GMAC dTSEC

QorIQ P Series

QorIQ T4240 - mEMAC

TM

External Use 45

DPAA: FMAN

TM

External Use 46

FMAN

Parse, Classify, Distribute

muRAM

1/10G 1/10G

1G

1G

1G

1G

1G

1G

FMan Enhancements

• Storage Profile selection (up to 32 profiles per port) based on classification − Up to four buffer pools per Storage Profile

• Customer Edge Egress Traffic Management (Egress Shaping)

• Data Center Bridging − PFC and ETS

• IEEE802.3az (Energy Efficient Ethernet)

• IEEE802.3bf (Time sync)

• IP Frag & Re-assembly Offload

• HiGig, HiGig2

• TX confirmation/error queue enhancements − Ability to configure separate FQID for normal

confirmations vs errors

− Separate FD status for Overflow and physical error

• Option to disable S/G on ingress

TM

External Use 47

Offline Ports

TM

External Use 48

FMAN Ports Types

• Ethernet receive (Rx) and transmitter (Tx) − 1 Gbps/2. 5Gbps/10 Gbps

− FMan_v3 some ports can be configured ad HiGig

− Jumbo frames of up to 9.6 KB (add uboot bootargs "fsl_fm_max_frm=9600" )

• Offline (O/H) − FMan_v3: 3.75 Mpps (vs 1.5M pps from the P series)

− Supports Parse classify distribute (PCD) function on frames extracted frame descriptor (FD) from the Qman

− Supports frame copy or move from a storage profile to an other

− Able to dequeue and enqueue from/to a QMan queue. The FMan applies a Parse Classify Distribute (PCD) flow and (if configured to do so) enqueues the frame it back in a Qman queue. In FMan_v3 the FMan is able to copy the frame into new buffers and enqueue back to the QMan.

− Use case: IP fragmentation and reassembly

• Host command − Able to dequeue host commands from a QMan queue. The FMan executes the host

command (such as a table update) and enqueues a response to the QMan. The Host commands, require a dedicated PortID (one of the O/H ports)

− The registers for Offline and Host commands are named O/H port registers

TM

External Use 49

IP Reassembly T4240 Processor Flow

BMI:

Parser:

Parse The Frame Identify fragments

KeyGen:

Calculate Hash

Fman Controller:

Coarse Classification

Enqueue Frame

BMI:

Allocate buffer

Write frame and IC

Fman Controller:

link fragment to the right

reassembly context Completed reassembly

Non Completed

reassembly

BMI:

Terminate

KeyGen:

Calculate Hash

*Fragments

Non Fragments

Fman Controller:

Start reassembly

Enqueue Frame

BMI:

Write IC

Reassembled

Frame

Regular/Fragment

Regular frame: Storage Profile is

chosen according to frame header

classification.

Reassembled frame: Storage

Profile is chosen according to MAC

and IP header classification only.

*Buffer allocation is done

According to fragment

header only

TM

External Use 50

IP Reassembly FMAN Memory Usage

• FMAN Memory: 386 KBytes

• Assumption: MTU = 1500 Bytes

• Port FMAN Memory consumption:

− Each 10G Port = 40 Kbytes

− Each 1G Port = 25 Kbytes

− Each Offline Port = 10 Kbytes

• Coarse Classification tables memory consumption:

− 100 Kbytes for all ports

• IP Reassembly:

− IP Reassembly overhead: 8 Kbytes

− Each flow: 10 Bytes

• Example:

− Usecase with: 2x10G ports + 2x1G port + 1xOffline Ports.

− Port configuration: 2x40 + 2x25 + 10 = 140 Kbytes

− Coarse Classification : 100 Kbytes

− IP reassembly 10K flows: 10K x 10B + 8KB = 108 Kbytes

− Total = 140KB + 108KB + 100KB = 348 KBytes

TM

External Use 51

Storage Profile

TM

External Use 52

Virtual Storage Profiling For Rx and Offline Ports

• Storage profile enable each partition and virtual interface enjoy a dedicated buffer pools.

• Storage profile selection after distribution function evaluation or after custom classifier

• The same Storage Profile ID (SPID) values from the classification on different physical ports, may yield to different storage profile selection.

• Up to 64 storage profiles per port are supported. − 32 storage profiles for FMan_v3L

• Storage profile contains

− LIODN offset

− Up to four buffer pools per Storage Profile

− Buffer Start margin/End margin configuration

− S/G disable

− Flow control configuration

TM

External Use 53

Data Center Bridging

TM

External Use 54

Policing and Shaping

• Policing put a cap on the network usage and guarantee bandwidth

• Shaping smoothes out the egress traffic

− May require extra memory to store the shaped traffic.

• DCB can be used in:

− Between data center network nodes

− LAN/network traffic

− Storage Area Network (SAN)

− IPC traffic (e.g. Infiniband (low latency))

Time

Time

Time

TM

External Use 55

Support Priority-based Flow Control (802.1Qbb)

• Enables lossless behavior for each class of service

• PAUSE sent per virtual lane when buffers limit exceeded

− FQ congestion groups state (on/off) from QMan

Priority vector (8 bits) is assigned to each FQ congestion group

FQ congestion group(s) are assigned to each port

Upon receipt of a congestion group state “on” message, for each Rx port associated with this congestion group, a PFC Pause frame is transmitted with priority level(s) configured for that group

− Buffer pool depletion

Priority level configured on per port (shared by all buffer pools used on that port)

− Near FMan Rx FIFO full

There is a single Rx FIFO per port for all priorities, the PFC Pause frame is sent on all priorities

• PFC Pause frame reception

− QMan provides the ability to flow control 8 different traffic classes; in CEETM each of the 16 class queues within a class queue channel can be mapped to one of the 8 traffic classes & this mapping applies to all channels assigned to the link

Transmit Queues Ethernet Link

Receive Buffers

Zero Zero

One One

Two Two

Five Five

Four Four

Six Six

Seven Seven

Three Three STOP PAUSE Eight

Virtual

Lanes

TM

External Use 56

Support Bandwidth Management 802.1Qaz

10 GE Realized Traffic Utilization

3G/s HPC Traffic

3G/s

2G/s

3G/s Storage Traffic

3G/s

3G/s

LAN Traffic

4G/s

5G/s 3G/s

t1 t2 t3

Offered Traffic

t1 t2 t3

3G/s 3G/s

3G/s 3G/s 3G/s

2G/s

3G/s 4G/s 6G/s

• Supports 32 channels available for allocation across a single FMan

− e.g. for two10G links, could allocate 16 channels (virtual links) per link

− Supports weighted bandwidth fairness amongst channels

− Shaping is supporting on per channel basis

• Hierarchical port scheduling defines the class-of-service (CoS) properties of output queues, mapped to IEEE 802.1p priorities

• Qman CEETM enables Enhanced Tx Selection (ETS) 802.1Qaz with Intelligent sharing of bandwidth between traffic classes control of bandwidth − Strict priority scheduling of the 8

independent classes. Weighted bandwidth fairness within 8 grouped classes

− Priority of the class group can be independently configured to be immediately below any of the independent classes

• Meets performance requirement for ETS: bandwidth granularity of 1% and +/-10% accuracy

TM

External Use 57

QMAN CEETM

TM

External Use 58

Shape Aware

Fair Scheduling

StrictPriority StrictPriority

CEETM Scheduling Hierarchy (QMAN 1.2)

• Logics

− Green denotes logic units and signal paths that relate to the request and fulfillment of Committed Rate (CR) packet transmission opportunities

− Yellow denotes the same for Excess Rate (ER)

− Black denotes logic units and signal paths that are used for unshaped opportunities or that operate consistently whether used for CR or ER opportunities

• Scheduler

− Channel Scheduler: channels are selected to send frame from Class Queues

− Class scheduler: frames are selected from Class Queues . Class 0 has highest priority

• Algorithm

− Strict Priority (SP)

− Weighted Scheduling

− Shaped Aware Fair Scheduling (SAFS)

− Weighted Bandwidth Fair Scheduling (WBFS)

Strict Priority

Shape Aware

Fair Scheduling

Weighted

Scheduling

StrictPriority StrictPriority StrictPriority

WBFS WBFS WBFS WBFS WBFS

CQ

8

CQ

9

CQ

10

CQ

11

CQ

12

CQ

14

CQ

13

CQ

15

CQ

0

CQ

1

CQ

2

CQ

3

CQ

4

CQ

5

CQ

6

CQ

7

Network IF

Cha

nn

el S

ch

ed

ule

r

for

LN

I #

9

Class Scheduler Ch6

unshaped

8 Indpt, 8 grp Classes

Class Scheduler Ch7

Shaped

3 Indpt, 7 grp

Class Scheduler Ch8

Shaped

2 Indpt, 8 grp

Token Bucket Shaper for Committed Rate

Token Bucket Shaper for Excess Rate

TM

External Use 59

Weighted Bandwidth Fair Scheduling (WBFS)

• Weighted Bandwidth Fair Scheduling (WBFS) is used to schedule packets from queues within a priority group such that each gets a “fair” amount of bandwidth made available to that priority group

• The premises for fairness for algorithm is: − available bandwidth is divided and offered equally to all classes

− offered bandwidth in excess of a class’s demand is to be re-offered equally to classes with unmet demand

Initial Distribution First

ReDistribution

Second

Redistribution

Total BW

Attained

BW available 10G 1.5G .2G 0G

Number of classes

with unmet demand 5 3 2

Bandwidth to be

offer to each class 2G .5G .1G

Demand Offered &

Retained

Unmet

Demand

Offered &

Retained

Unmet

Demand

Offered &

Retained

Class 0 .5G .5G 0 .5G

Class 1 2G 2G 0 2G

Class 2 2.3G 2G .3G .3G 0 2.3G

Class 3 3G 2G 1G .5G .5G .1G 2.6G

Class 4 4G 2G 2G .5G 1.5G .1G 2.6G

Total Consumption 11.8G 8.5G 1.3G .2G 10G

TM

External Use 60

DPAA: SEC Engine

TM

External Use 61

Security Engine

• Black Keys − In addition to protecting against external bus snooping, Black Keys cryptographically

protect against key snooping between security domains

• Blobs − Blobs protect data confidentiality and integrity across power cycles, but do not protect

against unauthorized decapsulation or substitution of another user’s blobs

− In addition to protecting data confidentiality and integrity across power cycles, Blobs cryptographically protect against blob snooping/substitution between security domains

• Trusted Descriptors − Trusted Descriptors protect descriptor integrity, but do not distinguish between

Trusted Descriptors created by different users

− In addition to protecting Trusted Descriptor integrity, Trusted Descriptors now cryptographically distinguish between Trusted Descriptors created in different security domains

• DECO Request Source Register − Register added

TM

External Use 62

QorIQ T4240 Processor SEC 5.0 Features Header & Trailer off-load for the following Security Protocols:

− IPSec, SSL/TLS, 3G RLC, PDCP, SRTP, 802.11i, 802.16e, 802.1ae

(3) Public Key Hardware Accelerator (PKHA)

− RSA and Diffie-Hellman (to 4096b)

− Elliptic curve cryptography (1024b)

− Supports Run Time Equalization

(1) Random Number Generators (RNG4)

− NIST Certified

(4) Snow 3G Hardware Accelerators (STHA)

− Implements Snow 3.0

− Two for Encryption (F8), two for Integrity (F9)

(4) ZUC Hardware Accelerators (ZHA)

− Two for Encryption, two for Integrity

(2) ARC Four Hardware Accelerators (AFHA)

− Compatible with RC4 algorithm

(8) Kasumi F8/F9 Hardware Accelerators (KFHA)

− F8 , F9 as required for 3GPP

− A5/3 for GSM and EDGE

− GEA-3 for GPRS

(8) Message Digest Hardware Accelerators (MDHA)

− SHA-1, SHA-2 256,384,512-bit digests

− MD5 128-bit digest

− HMAC with all algorithms

(8) Advanced Encryption Standard Accelerators (AESA)

− Key lengths of 128-, 192-, and 256-bit

− ECB, CBC, CTR, CCM, GCM, CMAC, OFB, CFB, and XTS

(8) Data Encryption Standard Accelerators (DESA)

− DES, 3DES (2K, 3K)

− ECB, CBC, OFB modes

(8) CRC Unit

− CRC32, CRC32C, 802.16e OFDMA CRC

Job Queue

Controller

Descriptor

Controllers

DM

A

RT

IC

Queue

Interface

Job Ring I/F

DESA AESA

CHAs

MDHA

AFHA PKHA STHA

RNG4

KFHA

ZHA

TM

External Use 63

Arbiter

AFHA

Arbiter RNG4

Arbiter Arbiter Arbiter

PKHA STHA f8

STHA f9

MDHA

CRCA

AESA

KFHA DESA

MDHA

CRCA

AESA

KFHA DESA

PKHA

PKHA AFHA

STHA f8

STHA f9

ZUEA

ZUCE ZUEA

ZUCE

Life of a Job Descriptor

• QI has room for more work, issues dequeue request for 1 or 3 FDs

• Qman selects FQ and provides 1 FD along with FQ Information

• QI creates [internal] Job Descriptor and if necessary, obtains output buffers

• QI transfers completed Job Descriptor into one of the Holding Tanks

• Job Queue Controller finds an available DECO, transfers JD1 to it

• DECO initiates DMA of Shared Descriptor from system memory, places it in Descriptor Buffer with JD from Holding Tank

• DECO executes descriptor commands, loading registers and FIFOs in its CCB

• CCB obtains and controls CHA(s) to process the data per DECO commands

• DECO commands DMA to store results and any updated context to system memory

• As input buffers are being emptied, DECO tells QI, which may release them back to BMan

• Upon completion of all processing through CCB, DECO resets CCB

• DECO informs QI that JD1 has completed with status code X, data of length Y has been written to address Z

• QI creates outbound FD, enqueues to Qman using FQID from Ctx B field

Queue Interface Job Prep Logic

Job Queue Controller

DECO Pool

DECO 0

Descriptor

Buffer

DECO 7

R FDs

SP1 0 000

SP2 0 001

SP3 0 101

SP4 0 011

SP5 1 111

FQ FQ FQ FQ FQ

1 E E E D E

2 D E E D E

3 E E E E E

SP Status FQ ID List

Holding

Tank 0

Holding

Tank 7

Holding Tank Pool

Job Queues JR 0

JR 1

JR 2

JR 3

DM

A

Descriptor

Buffer

CCB 0 CCB 7

Buffer

Mgr

Queue

Manager DDR/CoreNet (Shared Desc, Frame)

FD1

JD1

. . . . . . .

. . . . . . .

TM

External Use 64

DPAA: DCE

TM

External Use 65

DPAA Interaction: Frame Descriptor Status/CMD

• The Status/Command word in the dequeued FD allows software to modify the processing of individual frames while retaining the performance advantages of enqueuing to a FQ for flow based processing

• The three most significant bits of the Command /Status field of the Frame Descriptor have the following meaning:

0 1 2 3 4 5 6 7 8 9 1

0

1

1

1

2

1

3

1

4

1

5

1

6

1

7

1

8

1

9

2

0

2

1

2

2

2

3

2

4

2

5

2

6

2

7

2

8

2

9

3

0

3

1

DD LIODN offset BPID ELIODN

offset

- - - - addr

addr (cont)

Format Offset Length

CMD Token: Pass through data that is echoed with the returned Frame.

3 MSB Description

000 Process Command Command Encoding

001 Reserved

010 Reserved

011 Reserved

100 Context Invalidate Command Token

101 Reserved

110 Reserved

111 NOP Command Token

0 1 2 3 4 5 6 7 8 9 1

0

1

1

1

2

1

3

1

4

1

5

1

6

1

7

1

8

1

9

2

0

2

1

2

2

2

3

2

4

2

5

2

6

2

7

2

8

2

9

3

0

3

1

CM

D

OO

Z

Flu

sh

SC

RF

R

I

RB

B

64

CF

-

CE

UH

C

US

PC

U

SD

C

SC

US

Status

(output Frame)

TM

External Use 66

DCE Inputs

• SW enqueues work to DCE via Frame Queues. FQs define the flow for stateful processing

• FQ initialization creates a location for the DCE to use when storing flow stream context

• Each work item within the flow is defined by a Frame Descriptor, which includes length, pointer, offsets, and commands

• DCE has separate channels for compress and decompress

DC

P P

ort

al

DCE

WQ6

WQ7

ch

an

ne

l

WQ0

WQ1

WQ2

WQ3

WQ4

WQ5

WQ6

WQ7

FD3

FD2

FD1

Addr

Offset Length

Status/Cmd

PID BPID Addr

Addr

Offset Length

Status/Cmd

PID BPID Addr

Addr

Offset Length

Status/Cmd

PID BPID Addr

Data

Buffer

Data

Buffer

Data

Buffer

WQ6

WQ7

ch

an

ne

l

WQ0

WQ1

WQ2

WQ3

WQ4

WQ5

WQ6

WQ7

FD3

FD2

FD1

Addr

Offset Length

Status/Cmd

PID BPID Addr

Addr

Offset Length

Status/Cmd

PID BPID Addr

Addr

Offset Length

Status/Cmd

PID BPID Addr

Data

Buffer

Data

Buffer

Data

Buffer

Decomp

Comp

FQs

Command

FQs

Flow Stream Context

Context_A

Flow Stream Context

Context_A

TM

External Use 67

DCE Outputs

• DCE enqueues results to SW via Frame Queues as defined by FQ Context_B field. When buffers obtained from Bman, buffer pool ID defined by Input FQ

• Each result is defined by a Frame Descriptor, which includes a Status field

• DCE updates flow stream context located at Context_A as needed

FD3

FD2

FD1

Addr

Offset Length

Status/Cmd

PID BPID Addr

Addr

Offset Length

Status/Cmd

PID BPID Addr

Addr

Offset Length

Status/Cmd

PID BPID Addr

Data

Buffer

Data

Buffer

Data

Buffer

DC

P P

ort

al

Decomp

Comp

DCE

Flow

Stream

Context

Context_A

Data

Buffer

Data

Buffer

Data

Buffer FD3

FD2

FD1

Addr

Offset Length

Status/Cmd

PID BPID Addr

Addr

Offset Length

Status/Cmd

PID BPID Addr

Addr

Offset Length

Status/Cmd

PID BPID Addr

Flow

Stream

Context

Context_A

Status

FQ

s

FQ

s

TM

External Use 68

PME

TM

External Use 69

Frame Descriptor: STATUS/CMD Treatment

• PME Frame Descriptor Commands

− b111 NOP NOP Command

− b101 FCR Flow Context Read Command

− b100 FCW Flow Context Write Command

− b001 PMTCC Table Configuration Command

− b000 SCAN Scan Command

0 1 2 3 4 5 6 7 8 9 1

0

1

1

1

2

1

3

1

4

1

5

1

6

1

7

1

8

1

9

2

0

2

1

2

2

2

3

2

4

2

5

2

6

2

7

2

8

2

9

3

0

3

1

DD LIODN offset BPID ELIODN

offset

- - - - addr

addr (cont)

Format Offset Length

Status/CMD

Scan

b000

SRV

M

F S/

R

E SET Subset

TM

External Use 70

I W A N T T O S E A R C H F R E E

Life of a Packet inside Pattern Matching Engine

• Combined hash/NFA technology • 9.6 Gbps raw performance • Max 32K patterns of up to 128B length • Patterns

− Patt1 /free/ tag=0x0001

− Patt2 /freescale/ tag=0x0002

• KES − Compare hash value of incoming

data(frames) against all patterns

• DXE − Retrieve the pattern with matched hash

value for a final comparison.

• SRE − Optionally post process match result before

sending the report to the CPU

On-Chip

System

Bus

Interface

Pattern

Matcher

Frame

Agent

(PMFA)

Data

Examination

Engine

(DXE)

Stateful

Rule

Engine

(SRE)

Key

Element

Scanning

Engine

(KES)

Hash

Tables

Access to Pattern Descriptors and State

Cache Cache

User Definable Reports

Cor

eNet

B

Ma

n Q

Ma

n

192.168.1.1:80 TCP 10.10.10.100:16734

192.168.1.1:25 TCP 10.10.10.100:17784

192.168.1.1:1863 TCP 10.10.10.100:16855

DDR

Memory

flowA:FD1: 192.168.1.1:80->10.10.10.100:16734 “I want to search free “

flowA:FD2: 192.168.1.1:80->10.10.10.100:16734 “scale FTF 2014 event schedule”

Frame Queue: A

FD1

Patt1 /free/

tag=0x0001

FD2

TM

External Use 71

Debug

TM

External Use 72

Core Debug in Multi-Thread Environment

• Almost all resources are private. Internal debug works as if they are

separate cores

• External debug is private per thread. An option exists to halt both threads

when one thread halts

− While threads can be debug halted individually, it is generally not very

useful if the debug session will care about the contents of the MMU

and caches

− Halting both threads prevents the other thread from continuing to

compute and essentially clean the L1 caches and the MMU of the state

of the thread which initiated the debug halt

TM

External Use 73

DPAA Debug trace

• During packet processing, FMan can trace packet processing flow

through each of the FMan modules and trap a packet.

0 1 2 3 4 5 6 7 8 9 1

0

1

1

1

2

1

3

1

4

1

5

1

6

1

7

1

8

1

9

2

0

2

1

2

2

2

3

2

4

2

5

2

6

2

7

2

8

2

9

3

0

3

1

D

D

LIODN

offset

BPID ELIO

DN

offset

- - - - addr

addr (cont)

Fmt Offset Length

STATUS/CMD

TM

External Use 74

Summary

TM

External Use 75

QorIQ T4 Series Advance Features Summary Feature Benefit

High perf/watt • 188k CoreMark in 55W = 3.4 CM/W

• Compare to Intel E5-2650: 146k CM in 95W = 1.5 CW/W;

• Or: Intel E5-2687W: 200k MC in 150W = 1.3 CM/W

• T4 is more than 2x better than E5

• 2x perf/watt compared to P4080, FSL’s previous flagship

Highly integrated

SOC

Integration of 4x 10GE interfaces, local bus, Interlaken, SRIO mean that few chips

(takes at least four chips with Intel) and higher performance density

Sophisticated

PCIe capability

• SR-IOV for showing VMs a virtual NIC, 128 VFs (Virtual Functions)

• Four ports with ability to be root complex or endpoint for flexible configurations

Advanced

Ethernet

• Data Center Bridging for lossless Ethernet and QoS

• 10GBase-KR for backplane connections

Secure Boot Prevents code theft, system hacking, and reverse engineering

Altivec On-board SIMD engine – sonar/radar and imaging

Power

Management

• Thread, core, and cluster deep sleep modes

• Automatic deep sleep of unused resources

Advanced

virtualization

• Hypervisor privilege level enables safe guest OS at high performance

• IOMMU ensures memory accesses are restricted to correct area

• Virtualization of I/O blocks

Hardware offload • Packet handling to 50Gb/s

• Security engine to 40Gb/s

• Data compression and decompression to 20Gb/s

• Pattern matching to 10Gb/s

3x Scalability • 1-, 2-, and 3- cluster solution is 3x performance range over T4080 – T4240

• Enables customer to develop multiple SKUs from on PCB

TM

External Use 76

Other Sessions And Useful Information

• FTF2014 Sessions for QorIQ T4 Devices

− FTF-NET-F0070_QorIQ Platforms Trust Arch Overview

− FTF-NET-F0139_AltiVec_Programming

− FTF-NET-F0146_Introduction_to_DPAA

− FTF-NET-F0147-DPAAusage

− FTF-NET-F0148_DPAA_Debug

− FTF-NET-F0157_QorIQ Platforms Trust Arch Demo & Deep Dive

• T4240 Product Website

− http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240

• Online Training

− http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240&tab=Design_Support_Tab

http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240&tab=Design_Support_Tabhttp://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240&tab=Design_Support_Tab

TM

External Use 77

Introducing The

QorIQ LS2 Family

Breakthrough,

software-defined

approach to advance

the world’s new

virtualized networks

New, high-performance architecture built with ease-of-use in mind Groundbreaking, flexible architecture that abstracts hardware complexity and

enables customers to focus their resources on innovation at the application level

Optimized for software-defined networking applications Balanced integration of CPU performance with network I/O and C-programmable

datapath acceleration that is right-sized (power/performance/cost) to deliver

advanced SoC technology for the SDN era

Extending the industry’s broadest portfolio of 64-bit multicore SoCs Built on the ARM® Cortex®-A57 architecture with integrated L2 switch enabling

interconnect and peripherals to provide a complete system-on-chip solution

TM

External Use 78

QorIQ LS2 Family Key Features

Unprecedented performance and

ease of use for smarter, more

capable networks

High performance cores with leading

interconnect and memory bandwidth

• 8x ARM Cortex-A57 cores, 2.0GHz, 4MB L2

cache, w Neon SIMD

• 1MB L3 platform cache w/ECC

• 2x 64b DDR4 up to 2.4GT/s

A high performance datapath designed

with software developers in mind

• New datapath hardware and abstracted

acceleration that is called via standard Linux

objects

• 40 Gbps Packet processing performance with

20Gbps acceleration (crypto, Pattern

Match/RegEx, Data Compression)

• Management complex provides all

init/setup/teardown tasks

Leading network I/O integration

• 8x1/10GbE + 8x1G, MACSec on up to 4x 1/10GbE

• Integrated L2 switching capability for cost savings

• 4 PCIe Gen3 controllers, 1 with SR-IOV support

• 2 x SATA 3.0, 2 x USB 3.0 with PHY

SDN/NFV

Switching

Data

Center

Wireless

Access

TM

External Use 79

See the LS2 Family First in the Tech Lab!

4 new demos built on QorIQ LS2 processors:

Performance Analysis Made Easy

Leave the Packet Processing To Us

Combining Ease of Use with Performance

Tools for Every Step of Your Design

TM

© 2014 Freescale Semiconductor, Inc. | External Use

www.Freescale.com

http://www.freescale.com/https://twitter.com/Freescalehttps://twitter.com/Freescalehttps://www.facebook.com/freescalehttps://www.facebook.com/freescale

TM

External Use 81

QorIQ T4240 SerDes Options Total of four x8 banks

High speed serial

• 2.5 , 5, 8 GHz for PCIe

• 2.5, 3.125, and 5 GHz for sRIO

• 3.125, 6.25, and 10.3125 GHz for

Interlaken

• 1.5, 3.0 GHz for SATA

• 1.25, 2.5, 3.125, and 5 GHz for

debug

Ethernet options:

• 10Gbps Ethernet MACs with XAUI

or XFI

• 1Gbps Ethernet MACs with SGMII

(1 lane at 1.25 GHz with 3.125

GHz option for 2.5Gbps Ethernet)

• 2 MACs can be used with

RGMII

• 4 x1Gbps Ethernet MACs can be

supported using a single lane at 5

GHz (QSGMII)

• HiGig is supported with 4 lines at

3.125 GHz or 3.75 GHz (HiGig+)

TM

External Use 82

Decompression Compression Engine

• Zlib: As specified in RFC1950

• Deflate: As specified as in RFC1951

• GZIP: As specified in RFC1952

• Encoding

− supports Base 64 encoding and decoding (RFC4648)

• ZLIB, GZIP and DEFLATE header insertion

• ZLIB and GZIP CRC computation and insertion

• 4 modes of compression

− No compression (just add DEFLATE header)

− Encode only using static/dynamic Huffman codes

− Compress and encode using static OR dyamic Huffman codes

− at least 2.5:1 compression ratio on the Calgary Corpus

• All standard modes of decompression

− No compression

− Static Huffman codes

− Synamic Huffman codes

• Provides option to return original compressed Frame along with the uncompressed Frame or release the buffers to BMAN

32KB

History

Frame

Agent

QMan

I/F

BMan

I/F

Bus

I/F

Decompressor

Compressor

QMan

Portal

BMan

Portal

To

Corenet

4KB History

TM

© 2014 Freescale Semiconductor, Inc. | External Use

www.Freescale.com
http://www.freescale.com/https://twitter.com/Freescalehttps://twitter.com/Freescalehttps://www.facebook.com/freescalehttps://www.facebook.com/freescale

Date post:	01-Feb-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

QorIQ T4240 Communications Processor Deep Dive · 2016. 3. 12. · T5 T4 T3 T2 T1 Shared L2 y y C0...

Documents