Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | netronome |
View: | 94 times |
Download: | 3 times |
© 2016 NETRONOME SYSTEMS, INC.
Ron SwartzentruberSenior Principal Engineer, Silicon Development
9/8/2016
SoC Solutions Enabling Server-Based Networking
© 2016 NETRONOME SYSTEMS, INC. 2
The Challenge
Demands on silicon have dramatically increased due to the rapid pace of innovation in the fields of software-defined networks and network functions virtualization
Current server-based solutions do not efficiently handle the applications they need to run▶ Low throughput of server-based networking datapath limits application performance▶ High CPU load of server-based networking limits compute available to applications
Economics of scale require that applications run on commercial-off-the-shelf hardware; instead of traditional and expensive datacenter networking equipment
The continued need for higher overall network bandwidth challenges the pace of Moore’s law
Higher packet processing performance is now required to meet the classification, filtering and forwarding demands of the latest technologies
▶ Brought on by Open vSwitch, Contrail vRouter, OpenStack and P4 applications
© 2016 NETRONOME SYSTEMS, INC. 3
The Solution
1. Develop the silicon and software together to form an efficient and cohesive solution
2. Lower cost of ownership by off-loading datapath processing to efficient network flow processors connected to standard server platforms▶ Improve efficiency of server-based networking
3. Design a modular, chip multithreaded, 200Gb/s Network Flow Processor▶ Distribute datapath packet processing to large pools of processor engines ▶ Meet bandwidth needs with multiple high speed I/O and large internal memories▶ Programmable to allow new features to be deployed rapidly
4. Develop software that transparently offloads and accelerates networking data plane functions
5. Enable the open source community to easily and rapidly test and deploy next generation network technologies
© 2016 NETRONOME SYSTEMS, INC. 4
Background: About Netronome
Inventor of the Network Flow Processor and pioneer of hardware-accelerated server-based networking
Provider of commercial-off-the-shelf intelligent server adapters for the data center▶ Delivering significantly higher performance for x86 environments▶ Production-ready software▶ Programmable silicon
Solutions for software-defined networks that optimize security, load balancing and virtualization
Supporter of the academic and research community towards open source projects using Open-NFP
© 2016 NETRONOME SYSTEMS, INC. 5
What is Server-Based Networking?
Leverages open source networking software used in servers
Transparently offloads and accelerates networking data plane functions such as virtual switching, virtual routing, connection tracking and virtual network functions
© 2016 NETRONOME SYSTEMS, INC. 6
Open Virtual Switch Example
Compute Node
. . .Linux Kernel
Agilio CXOVS Datapath
TunnelsDeliver to Host
Update Statistics
TransparentOffload
SR-IOVConnectivity
to VMs
OVS Datapath
ActionsMatch Tables Tunnels
ActionsMatch Tables
VM
VM
VM
VM
© 2016 NETRONOME SYSTEMS, INC. 7
Per Server CPU Core Efficiency
Throughput with single server CPU core
Mill
ion
Pac
ket P
er S
econ
d
• 50X Efficiency Gain vs. Kernel OVS
• 20X Efficiency Gain vs. User OVS
https://www.netronome.com/media/redactor_files/WP_OVS_Benchmarking.pdf
© 2016 NETRONOME SYSTEMS, INC. 8
NFV Use Case: 2,000Kpps per VNF or Application
Rack Throughput: 168MppsVNFs Per Rack: 80
ServerTOR
ServerServerServerServerServer
ServerServerServerServerServerServerServerServerServer
20 S
erve
rs w
ith 2
x40G
bE
ServerTOR
ServerServerServerServerServer
ServerServerServerServerServerServerServerServerServer
20 S
erve
rs w
ith 2
x40G
bE
Rack Throughput: 440MppsVNFs Per Rack: 220
Racks Needed to Support 220 VNFs Racks Needed to Support 220 VNFs
2.8
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C C
OVS16 Cores
9.6 Mpps of VXLAN
Processing
4 Apps or VNFs at
2,000Kpps
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C COVS
VMs23 Cores
22 Mpps of VXLAN
Processing
11 Apps or VNFs at
2,000Kpps
VMs 8 Cores
Server Core AllocationServer Core Allocation
3XLower TCO
OVS on Server with Traditional NIC OVS on Server with Netronome Agilio Platform
© 2016 NETRONOME SYSTEMS, INC.
The SoC Solution
© 2016 NETRONOME SYSTEMS, INC. 10
The Network Flow Processor Architecture
• Hardware accelerators perform compute intensive functions such as hashing, crypto, CAM, atomic and other functions
• Delivers multi-terabit bidirectional bandwidth between processing elements
• Avoids bus contention and saturation issues
• Packets autonomously pushed to processing cores
• Pool of highly multi-threaded parallel processing cores
• Production-ready OVS and vRouter datapath code
• Datapath extensibility using P4 and C programming tools
• Multi-threaded memory engines and banks of SRAM tightly coupled with atomic and other hardware accelerator functions
Latency TolerantMulti-threading between Processing Cores,
H/W Accelerators and Memory Banks
Delivers Highest Scale & Best Price-Performance
© 2016 NETRONOME SYSTEMS, INC. 11
The Flow Processing Core
Flow Processing Core▶ The principal data processing element inside the NFP▶ 8K Instruction control store, with capability to share▶ 40-bit address space▶ Eight processing threads with unique wake up control, state and PC▶ Two-cycle switch between contexts▶ 6 stage main pipeline▶ 32-bit ALU with shift, multiply, CAM▶ Easily Programmable using Assembly, C or P4
© 2016 NETRONOME SYSTEMS, INC. 12
Latency Tolerant Processing
Multiple Parallel Processing Threads
Delays incurred to/from Hardware Accelerators and Memory
Threads can be de-scheduled or yielded
Result: Allows Latency to be hidden from the Software Application
Flow Processing Core with 8 Threads
CRC
Ext Mem
Int Mem
Hash
LUT
XOR
Prefix Match
FPC Thread
FPC Thread
Latency to Accelerators and Memory
FPC
© 2016 NETRONOME SYSTEMS, INC. 13
Memory-Centric Processing
Switch fabric interface▶ 2 billion commands
per second▶ 500Gb/s data bandwidth
Multi-bank SRAM▶ Eight crossbar inputs▶ Eight transactions per cycle▶ 1 Tb/s bandwidth
Multiple processing engines▶ No locking between engines▶ Different engines in different processing memories in the device▶ Different engines support different processing operations▶ Highly threaded to maintain 100% throughput when required
© 2016 NETRONOME SYSTEMS, INC. 14
Memory Hierarchy
Philosophy: Processing in the optimal locationProcess data where the data resides
• External DDR memory units (EMU)• Locks, hash tables,
microqueues• Linked lists, rings• Recursive lookups• >300 different processing
operations• >200 threads per unit
• Cluster Target Memory (CTM M)• Locks, hash tables, microqueues• Packet buffering, delivery, transmit
offload• Rings• >250 different processing operations• >100 threads per unit
• Cluster Local Scratch • Locks, hash tables,
microqueues• Rings, stacks• Regular expression NFA• >100 different processing
operations
• Internal memory units (IMU)• Locks, hash tables,
microqueues• Recursive lookups• Statistics, load balancing• >300 different processing
operations• >200 threads per unit
© 2016 NETRONOME SYSTEMS, INC. 15
Fabric Interconnect
Distributed switch fabric▶ 6-way crossbar routing ▶ 768Gb/s bandwidth across each island
Island based design methodology
Island interconnect at fixed pin locations connected by abutment
▶ Fabric ports▶ Register interface▶ Interrupts and events▶ Test logic
© 2016 NETRONOME SYSTEMS, INC. 16
Island APR Block Topology
Modular▶ Allows software to scale as
processing requirements increase
Re-usable▶ Blocks can be replaced and
interchanged across the floorplan
M C C M
C C C
P C B A
M
A
E C C C
B C P AA
F
M
© 2016 NETRONOME SYSTEMS, INC. 17
Technology
Intel 22nm ▶ Intel 3D Tri-Gate transistor manufactured at 22nm process▶ 37% performance increase at low voltage (0.7V)▶ 50% power reduction at typical performance v.s. 32nm
Specifics▶ Low leakage SoC process ▶ Foundry support for industry-standard SoC development tools
© 2016 NETRONOME SYSTEMS, INC. 18
SoC Verification
Today’s SoC requires co-verification of silicon and software
Simulation and Emulation required to fully verify design
Enable server-based networking software applications to run pre-silicon in order prove out the design
Scalable test environment ▶ Python used to create Verilog module and test bench ▶ Instantiated UVCs based on the I/Os of interest
M C
PA
I/O
E C C C
B C P AA I/O
© 2016 NETRONOME SYSTEMS, INC. 19
Software Emulation
Run Tests 500 to 2,000X the speed using emulation as compared to simulation
Run real world software applications to validate performance and find potential bottlenecks
Test many thousands of packets in a fraction of the time
Make/run environment that allowing any SW engineer to test NFP application code pre-silicon
Treat the DUT as a ”smartNIC” connected to a VM via PCIe Speedbridge Interface and loaded via external PCIe interface
Host PCIe to Network with External Memory
EthernetNetwork
M C C
C C
P CA
M
DDR
ExternalMemoy
BFM PCIe I/O
©2016 Open-NFP 20
Open-NFP www.open-nfp.org
Support and grow reusable research in accelerating dataplane network functions processing
Reduce/eliminate the cost and technology barriers to research in this space
• Technologies: P4, SDN, OpenFlow, Open vSwitch (OVS) offload
• Tools: Discounted hardware, development tools, software, cloud access
• Community: Website (www.open-nfp.org): learning & training materials, active Google group https://groups.google.com/d/forum/open-nfp, open project descriptions, code repository
• Learning/Education/Research support: Summer seminar series, Developer conferences, Tutorials, research proposal support for proposals to the NSF, state agencies
©2016 Open-NFP 21
Advanced Networking Seminars
©2016 Open-NFP 22
Development Platforms Available
©2016 Open-NFP 23
Universities Companies
Conference Attendees/Open-NFP Projects*
*This does not imply that these organizations endorse Open-NFP or Netronome
© 2016 NETRONOME SYSTEMS, INC.
Thank You