SoC Solutions Enabling Server-Based Netoworking

© 2016 NETRONOME SYSTEMS, INC.

Ron SwartzentruberSenior Principal Engineer, Silicon Development

9/8/2016

SoC Solutions Enabling Server-Based Networking

© 2016 NETRONOME SYSTEMS, INC. 2

The Challenge

Demands on silicon have dramatically increased due to the rapid pace of innovation in the fields of software-defined networks and network functions virtualization

Current server-based solutions do not efficiently handle the applications they need to run▶ Low throughput of server-based networking datapath limits application performance▶ High CPU load of server-based networking limits compute available to applications

Economics of scale require that applications run on commercial-off-the-shelf hardware; instead of traditional and expensive datacenter networking equipment

The continued need for higher overall network bandwidth challenges the pace of Moore’s law

Higher packet processing performance is now required to meet the classification, filtering and forwarding demands of the latest technologies

▶ Brought on by Open vSwitch, Contrail vRouter, OpenStack and P4 applications


The Solution

1. Develop the silicon and software together to form an efficient and cohesive solution

2. Lower cost of ownership by off-loading datapath processing to efficient network flow processors connected to standard server platforms▶ Improve efficiency of server-based networking

3. Design a modular, chip multithreaded, 200Gb/s Network Flow Processor▶ Distribute datapath packet processing to large pools of processor engines ▶ Meet bandwidth needs with multiple high speed I/O and large internal memories▶ Programmable to allow new features to be deployed rapidly

4. Develop software that transparently offloads and accelerates networking data plane functions

5. Enable the open source community to easily and rapidly test and deploy next generation network technologies


Background: About Netronome

Inventor of the Network Flow Processor and pioneer of hardware-accelerated server-based networking

Provider of commercial-off-the-shelf intelligent server adapters for the data center▶ Delivering significantly higher performance for x86 environments▶ Production-ready software▶ Programmable silicon

Solutions for software-defined networks that optimize security, load balancing and virtualization

Supporter of the academic and research community towards open source projects using Open-NFP


What is Server-Based Networking?

Leverages open source networking software used in servers

Transparently offloads and accelerates networking data plane functions such as virtual switching, virtual routing, connection tracking and virtual network functions


Open Virtual Switch Example

Compute Node

. . .Linux Kernel

Agilio CXOVS Datapath

TunnelsDeliver to Host

Update Statistics

TransparentOffload

SR-IOVConnectivity

to VMs

OVS Datapath

ActionsMatch Tables Tunnels

ActionsMatch Tables

VM

VM

VM

VM


Per Server CPU Core Efficiency

Throughput with single server CPU core

Mill

ion

Pac

ket P

er S

econ

d

• 50X Efficiency Gain vs. Kernel OVS

• 20X Efficiency Gain vs. User OVS

https://www.netronome.com/media/redactor_files/WP_OVS_Benchmarking.pdf

https://www.netronome.com/media/redactor_files/WP_OVS_Benchmarking.pdf


NFV Use Case: 2,000Kpps per VNF or Application

Rack Throughput: 168MppsVNFs Per Rack: 80

ServerTOR

ServerServerServerServerServer

ServerServerServerServerServerServerServerServerServer

20 S

erve

rs w

ith 2

x40G

bE

ServerTOR

ServerServerServerServerServer

ServerServerServerServerServerServerServerServerServer

20 S

erve

rs w

ith 2

x40G

bE

Rack Throughput: 440MppsVNFs Per Rack: 220

Racks Needed to Support 220 VNFs Racks Needed to Support 220 VNFs

2.8

C C C

C C C

C C C

C C C

C C C

C C C

C C C

C C C

OVS16 Cores

9.6 Mpps of VXLAN

Processing

4 Apps or VNFs at

2,000Kpps

C C C

C C C

C C C

C C C

C C C

C C C

C C C

C C COVS

VMs23 Cores

22 Mpps of VXLAN

Processing

11 Apps or VNFs at

2,000Kpps

VMs 8 Cores

Server Core AllocationServer Core Allocation

3XLower TCO

OVS on Server with Traditional NIC OVS on Server with Netronome Agilio Platform


The SoC Solution


The Network Flow Processor Architecture

• Hardware accelerators perform compute intensive functions such as hashing, crypto, CAM, atomic and other functions

• Delivers multi-terabit bidirectional bandwidth between processing elements

• Avoids bus contention and saturation issues

• Packets autonomously pushed to processing cores

• Pool of highly multi-threaded parallel processing cores

• Production-ready OVS and vRouter datapath code

• Datapath extensibility using P4 and C programming tools

• Multi-threaded memory engines and banks of SRAM tightly coupled with atomic and other hardware accelerator functions

Latency TolerantMulti-threading between Processing Cores,

H/W Accelerators and Memory Banks

Delivers Highest Scale & Best Price-Performance


The Flow Processing Core

Flow Processing Core▶ The principal data processing element inside the NFP▶ 8K Instruction control store, with capability to share▶ 40-bit address space▶ Eight processing threads with unique wake up control, state and PC▶ Two-cycle switch between contexts▶ 6 stage main pipeline▶ 32-bit ALU with shift, multiply, CAM▶ Easily Programmable using Assembly, C or P4


Latency Tolerant Processing

Multiple Parallel Processing Threads

Delays incurred to/from Hardware Accelerators and Memory

Threads can be de-scheduled or yielded

Result: Allows Latency to be hidden from the Software Application

Flow Processing Core with 8 Threads

CRC

Ext Mem

Int Mem

Hash

LUT

XOR

Prefix Match

FPC Thread

FPC Thread

Latency to Accelerators and Memory

FPC


Memory-Centric Processing

Switch fabric interface▶ 2 billion commands

per second▶ 500Gb/s data bandwidth

Multi-bank SRAM▶ Eight crossbar inputs▶ Eight transactions per cycle▶ 1 Tb/s bandwidth

Multiple processing engines▶ No locking between engines▶ Different engines in different processing memories in the device▶ Different engines support different processing operations▶ Highly threaded to maintain 100% throughput when required


Memory Hierarchy

Philosophy: Processing in the optimal locationProcess data where the data resides

• External DDR memory units (EMU)• Locks, hash tables,

microqueues• Linked lists, rings• Recursive lookups• >300 different processing

operations• >200 threads per unit

• Cluster Target Memory (CTM M)• Locks, hash tables, microqueues• Packet buffering, delivery, transmit

offload• Rings• >250 different processing operations• >100 threads per unit

• Cluster Local Scratch • Locks, hash tables,

microqueues• Rings, stacks• Regular expression NFA• >100 different processing

operations

• Internal memory units (IMU)• Locks, hash tables,

microqueues• Recursive lookups• Statistics, load balancing• >300 different processing

operations• >200 threads per unit


Fabric Interconnect

Distributed switch fabric▶ 6-way crossbar routing ▶ 768Gb/s bandwidth across each island

Island based design methodology

Island interconnect at fixed pin locations connected by abutment

▶ Fabric ports▶ Register interface▶ Interrupts and events▶ Test logic


Island APR Block Topology

Modular▶ Allows software to scale as

processing requirements increase

Re-usable▶ Blocks can be replaced and

interchanged across the floorplan

M C C M

C C C

P C B A

M

A

E C C C

B C P AA

F

M


Technology

Intel 22nm ▶ Intel 3D Tri-Gate transistor manufactured at 22nm process▶ 37% performance increase at low voltage (0.7V)▶ 50% power reduction at typical performance v.s. 32nm

Specifics▶ Low leakage SoC process ▶ Foundry support for industry-standard SoC development tools


SoC Verification

Today’s SoC requires co-verification of silicon and software

Simulation and Emulation required to fully verify design

Enable server-based networking software applications to run pre-silicon in order prove out the design

Scalable test environment ▶ Python used to create Verilog module and test bench ▶ Instantiated UVCs based on the I/Os of interest

M C

PA

I/O

E C C C

B C P AA I/O


Software Emulation

Run Tests 500 to 2,000X the speed using emulation as compared to simulation

Run real world software applications to validate performance and find potential bottlenecks

Test many thousands of packets in a fraction of the time

Make/run environment that allowing any SW engineer to test NFP application code pre-silicon

Treat the DUT as a ”smartNIC” connected to a VM via PCIe Speedbridge Interface and loaded via external PCIe interface

Host PCIe to Network with External Memory

EthernetNetwork

M C C

C C

P CA

M

DDR

ExternalMemoy

BFM PCIe I/O

©2016 Open-NFP 20

Open-NFP www.open-nfp.org

Support and grow reusable research in accelerating dataplane network functions processing

Reduce/eliminate the cost and technology barriers to research in this space

• Technologies: P4, SDN, OpenFlow, Open vSwitch (OVS) offload

• Tools: Discounted hardware, development tools, software, cloud access

• Community: Website (www.open-nfp.org): learning & training materials, active Google group https://groups.google.com/d/forum/open-nfp, open project descriptions, code repository

• Learning/Education/Research support: Summer seminar series, Developer conferences, Tutorials, research proposal support for proposals to the NSF, state agencies

https://groups.google.com/d/forum/open-nfp

©2016 Open-NFP 21

Advanced Networking Seminars

©2016 Open-NFP 22

Development Platforms Available

©2016 Open-NFP 23

Universities Companies

Conference Attendees/Open-NFP Projects*

*This does not imply that these organizations endorse Open-NFP or Netronome


Thank You

Date post:	16-Apr-2017
Category:	Technology
Upload:	netronome
View:	94 times
Download:	3 times