+ All Categories
Home > Documents > The Fungible DPU A New Category of Microprocessor · it is instructive to examine why these...

The Fungible DPU A New Category of Microprocessor · it is instructive to examine why these...

Date post: 25-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
8
Copyright © 2020. All Rights Reserved. | www.fungible.com 1 There are few experiences as inspiring and fulfilling as creating a new market category, especially when it involves an era-defining technology. The idea of a programmable Data Processing Unit (DPU) is gaining rapid recognition as the critical missing element in data centers designed for the data-centric era—it has been termed the “third socket” inside data centers after CPUs and GPUs. Thus, it is timely to lay out the factors driving the need for this new category of microprocessor, to define its key attributes, and to describe the role it will play in data centers. We will do this from the perspective of the Fungible DPU TM which we believe is the first clean sheet, fundamentals-based design to address the problems facing data centers. The Problem Developing technologies that advance the state of the art must begin with the “why”, not the “what”. Why are existing solutions inadequate? Why is a fundamentally new approach required? We begin by answering these basic questions by highlighting sweeping industry trends that have been gaining momentum for some time now, and today present a deepening crisis for scale-out data center infrastructure: 1. The pervasive use of cloud-native scale-out applications driving an exponential increase in network traffic within data centers, between data centers, and between data centers and end users. 2. The exponential growth in the amount of data that needs to be stored, processed, and moved inside data centers. 3. The increase in the frequency and sophistication of cyber threats resulting in an ever-growing cost of managing the security of data, both in motion and at rest. 4. The diversity of modern workloads bringing about an unsustainable increase in the number of server variants, threatening the simplicity and agility that were the very basis of scale-out architectures. FPGAs, PCIe switches, and smartNICs. FPGAs were used to accelerate specific functions, PCIe switches for interconnection, and smartNICs to offload computations from general-purpose x86 cores to cheaper general-purpose ARM cores. We will examine why these early solutions failed to address the problems comprehensively. The Status Quo Modern cloud-native applications are written as microservices distributed across network connected servers. As such, these applications place heavy demands on network resources, particularly the portions implemented in software on general-purpose processors. Many modern applications also need to process large amounts of data—data that cannot fit in a single server and therefore needs to be “sharded” or spread across many servers. These application trends drive up the intensity of an increasingly important class of computations we call data-centric. These computations are characterized by the following attributes: About the Author Wael Noureddine Chief Architect Wael Noureddine is one of the founders of Fungible, and serves as Chief Architect responsible for the overall architecture of the Fungible DPU, encompassing silicon and software. Wael has nearly two decades of experience in the design and implementation of high performance protocols and data processors, with emphasis on hardware-software co- design and application level quality of service. Wael specializes in data center networking, data security, computer architecture and high speed protocol processing. Before Fungible, Wael was Chief Architect and Vice President of Technology at Chelsio, Inc., where he led the architecture and implementation of advanced protocol processing engines, and drove the development of the world’s fastest TCP/IP stack. Wael received a Ph.D. and an M.S. degrees in Electrical Engineering from Stanford University, and a Bachelor of Computer and Communications Engineering degree from the American University of Beirut. He holds more than 40 patents and patent applications. Whitepaper These trends have been visible for some time, but the problems stemming from them showed up first in hyperscale data centers. Early attempts to solve these problems were based on incremental, point solutions using 1. A computation is divided into time-disjoint “tasklets” that modify state 1 . 2. Tasklets can be as short as 100 “standard” microprocessor instructions. 3. Tasklets belonging to many computational contexts need to be processed concurrently. 4. The ratio of data bandwidth to arithmetic bandwidth is medium to high. 1 Examples include termination of network transport, stateful firewalls and storage processing . The Fungible DPU TM : A New Category of Microprocessor
Transcript
Page 1: The Fungible DPU A New Category of Microprocessor · it is instructive to examine why these piecemeal approaches failed. To begin with, we note that modern CPU architectures (x86,

Copyright © 2020. All Rights Reserved. | www.fungible.com 1

There are few experiences as inspiring and fulfilling as creating a new market category, especially when it involves an era-defining technology. The idea of a programmable Data Processing Unit (DPU) is gaining rapid recognition as the critical missing element in data centers designed for the data-centric era—it has been termed the “third socket” inside data centers after CPUs and GPUs. Thus, it is timely to lay out the factors driving the need for this new category of microprocessor, to define its key attributes, and to describe the role it will play in data centers. We will do this from the perspective of the Fungible DPUTM which we believe is the first clean sheet, fundamentals-based design to address the problems facing data centers.

The ProblemDeveloping technologies that advance the state of the art must begin with the “why”, not the “what”. Why are existing solutions inadequate? Why is a fundamentally new approach required? We begin by answering these basic questions by highlighting sweeping industry trends that have been gaining momentum for some time now, and today present a deepening crisis for scale-out data center infrastructure:

1. The pervasive use of cloud-native scale-outapplications driving an exponential increasein network traffic within data centers,between data centers, and between datacenters and end users.

2. The exponential growth in the amount ofdata that needs to be stored, processed,and moved inside data centers.

3. The increase in the frequency andsophistication of cyber threats resulting inan ever-growing cost of managing thesecurity of data, both in motion and at rest.

4. The diversity of modern workloads bringingabout an unsustainable increase in thenumber of server variants, threatening thesimplicity and agility that were the verybasis of scale-out architectures.

FPGAs, PCIe switches, and smartNICs. FPGAs were used to accelerate specific functions, PCIe switches for interconnection, and smartNICs to offload computations from general-purpose x86 cores to cheaper general-purpose ARM cores. We will examine why these early solutions failed to address the problems comprehensively.

The Status QuoModern cloud-native applications are written as microservices distributed across network connected servers. As such, these applications place heavy demands on network resources, particularly the portions implemented in software on general-purpose processors. Many modern applications also need to process large amounts of data—data that cannot fit in a single server and therefore needs to be “sharded” or spread across many servers. These application trends drive up the intensity of an increasingly important class of computations we call data-centric. These computations are characterized by the following attributes:

About the Author

Wael Noureddine Chief Architect

Wael Noureddine is one of the founders of Fungible, and serves as Chief Architect responsible for the overall architecture of the Fungible DPU, encompassing silicon and software.

Wael has nearly two decades of experience in the design and implementation of high performance protocols and data processors, with emphasis on hardware-software co-design and application level quality of service. Wael specializes in data center networking, data security, computer architecture and high speed protocol processing. Before Fungible, Wael was Chief Architect and Vice President of Technology at Chelsio, Inc., where he led the architecture and implementation of advanced protocol processing engines, and drove the development of the world’s fastest TCP/IP stack.

Wael received a Ph.D. and an M.S. degrees in Electrical Engineering from Stanford University, and a Bachelor of Computer and Communications Engineering degree from the American University of Beirut. He holds more than 40 patents and patent applications.

Whitepaper

These trends have been visible for some time, but the problems stemming from them showed up first in hyperscale data centers. Early attempts to solve these problems were based on incremental, point solutions using

1. A computation is divided into time-disjoint“tasklets” that modify state1.

2. Tasklets can be as short as 100 “standard”microprocessor instructions.

3. Tasklets belonging to many computationalcontexts need to be processed concurrently.

4. The ratio of data bandwidth to arithmeticbandwidth is medium to high.

1 Examples include termination of network transport, statefulfirewalls and storage processing.

The Fungible DPUTM:A New Category of Microprocessor

Page 2: The Fungible DPU A New Category of Microprocessor · it is instructive to examine why these piecemeal approaches failed. To begin with, we note that modern CPU architectures (x86,

Copyright © 2020. All Rights Reserved. | www.fungible.com 2

Whitepaper

The proportion of data-centric computations compared to all other computations in a given data center has been growing steadily. An educated guess is that today well over 25% of the energy in scale-out data centers is spent on such computations, and this proportion is only expected to grow as scale-out architectures are adopted universally.

For the first decade and a half of scale-out architectures (2000-2015), data-centric computations were performed almost entirely by general-purpose CPUs. As inefficiencies were discovered in this approach, hyperscale companies adopted a patchwork of point solutions using smartNICs, PCIe switches and FPGAs. This patchwork approach did not address the growing problems comprehensively, but they did contribute to the increasing complexity of hyperscale data centers. To lay the groundwork for why a new category of microprocessor is needed, it is instructive to examine why these piecemeal approaches failed.

To begin with, we note that modern CPU architectures (x86, ARM) have been optimized relentlessly over a period of four decades to do one thing well: execute end user applications as fast as possible. This required them to focus on providing the fastest single-threaded performance possible with available technology. Unfortunately, this focus also makes them inefficient for data-centric computations.

For many years, raw performance improvements in CMOS technology, combined with the adoption of high-performance features invented in CPUs during the main-frame era masked the problem of inefficient execution. As these “Moore’s Law” improvements have slowed, it has become increasingly and painfully clear that the solution to executing data-centric computations lies elsewhere. What delayed this realization is that general-purpose programmability is such an extraordinarily flexible and powerful tool that there is every incentive to solve all problems with this one solution.

One approach to solve the problem was to evolve NICs to become “Smart” by including general purpose CPU cores and a few hardwired accelerators in addition to the existing network datapath. Unfortunately, these smartNICs did little more than integrate new accelerators and CPU cores into an existing hardwired datapath in a loosely coupled manner. While smartNICs do offload the hardwired portions of computations efficiently, the loose coupling ensures a brittle design: as long as the computation to be performed can be handled by the hardwired datapath, performance is adequate; the moment flexibility is needed and CPU cores are involved, performance literally falls off a cliff. smartNICs are little more than an exercise in integrating off the shelf computational units onto a single piece of silicon. They fail to recognize that data-centric computations are a legitimate and important class that demands architectural innovation in silicon and software to address the problem properly. Software provides flexibility and agility while silicon provides the speed. Regrettably,

smartNICs offer one or the other, not both. It is worth noting that smartNICs initially did allow hyperscalers to arbitrage the price differential between x86 cores and ARM cores, but this opportunity has diminished rapidly as competitive dynamics have taken hold.

Another approach that has been used is to leverage FPGAs to offload specific functions from general-purpose CPUs and even to “soften” the hardwired portions of smartNICs. While FPGAs represent a legitimate way to provide a combination of flexibility and speed for prototyping and validating hardware designs, they have severe drawbacks when it comes to widespread deployment2. The fact is that FPGA technology provides neither the flexibility of a fully programmable solution nor the performance of a hardwired solution, which is why it has typically been used either for prototyping or in areas where compromise is acceptable. FPGAs require hardware design expertise to express computational intent (think writing in Verilog and detailed knowledge of critical timing paths to meet performance goals. This makes development time significantly longer than writing software. On the performance/$ front, FPGAs start out at least a decimal order of magnitude3 behind optimized hardware designs because the primary technique they use for flexibility is a programmable interconnect to connect low-level hardware building blocks.

A third approach was to connect storage and computational resources either directly to the PCIe pins of x86 CPUs or via PCIe switches. The storage resources are typically SSDs and the computational engines are GPUs or TPUs needed to enhance the vector floating point capabilities used in graphics, analytics, and machine learning applications. This hyperconverged approach has multiple drawbacks. The first is that data movement between the compute and storage elements needs to go through the x86 CPUs, which is inefficient. The second is that the computational and storage resources are trapped behind the x86 CPUs and not readily usable by other hyperconverged CPUs. The third is that the storage required by many applications does not fit inside a single server. As a result, this hyperconverged approach is falling out of favor and being replaced by a disaggregated approach in which the computational and storage resources are placed in different types of servers and made available as services over the network.

While the disaggregated approach promises efficiencies through pooling of resources across nodes in a data center, it runs headlong into the problem of inefficient disaggregation over the data center network. Despite many years of effort to tune the TCP/IP stack and alternative efforts to provide patchwork solutions to what are at heart fundamental network problems, efficient disaggregation is still

2

3

The fact that they have been deployed in spite of problems is a testament to the importance of data-centric computations. This number varies by application, of course, but no one who has worked with both FPGAs and ASICs will dispute this claim.

Page 3: The Fungible DPU A New Category of Microprocessor · it is instructive to examine why these piecemeal approaches failed. To begin with, we note that modern CPU architectures (x86,

Whitepaper

Copyright © 2020. All Rights Reserved. | www.fungible.com 3

not possible: it has been hindered by the lack of availability of a true data center fabric. The Fungible DPU addresses this problem once and for all by embedding inside it a scalable standards-based technology called TrueFabricTM4.

The Fungible DPU aims to provide a comprehensive solution to these problems by using a fundamentals-based, clean sheet design uncluttered by legacy considerations. It is critically important to point out that the two capabilities of the Fungible DPU: efficient execution of data-centric computations, and efficient disaggregation of resources via TrueFabric are highly synergistic and need to be implemented in a single piece of silicon for maximum benefit.

The SolutionFungible was founded in 2015 with the mission to revolutionize the performance, economics, reliability and security of all scale-out data centers. To do this, we needed to invent a fundamental new infrastructure building block - the Fungible Data Processing Unit and its associated software. We chose the term “Data Processing Unit” deliberately to underscore the fact that this new element processes data in a programmable way, rather than just passing it along.

It is important for us to point out that the Fungible DPU is intended to complement rather than replace the CPU, which remains the primary engine for general-purpose application processing. Nor does it replace other application specific processors, such as GPUs and TPUs.

Inside an application server, the Fungible DPU sits between the application processors and network, playing the dual role of offloading data-centric computations from the application processors and implementing the end-point of TrueFabric. In a storage server, the Fungible DPU sits between storage devices and the network, implementing the target side functionality of a storage system and implementing the endpoint of TrueFabric. In these roles, the Fungible DPU enables efficient disaggregation of both computational and storage resources across an entire data center, a concept we call hyperdisaggregation. The figure below illustrates this:

The Fungible DPU therefore allows operators to build general and powerful data centers using only a small number of server types: a few application servers and a few storage servers. Thanks to the disaggregation and composability enabled by the Fungible DPU, an operator will be able to construct a “bare metal data center” tailored to a particular workload on demand within in a few minutes by assembling available compute and storage resources, independently of where they are physically located within the data center. This powerful paradigm is the holy grail of data center infrastructure and the key to achieving significantly better economics. The Fungible DPU provides the means to attain not only this important goal but also high performance, improved reliability and strong security.

The Third SocketGPUs faced drawn out resistance for over a decade before becoming an accepted second socket in servers. This time around, the industry was quicker to recognize the DPU as a required third socket inside servers. However, to earn its place alongside the CPU and GPU, the DPU must likewise be programmable and flexible as well as highly efficient at executing its targeted set of computations.

The x86 CPU has maintained its central position thanks to its flexibility, its rich development ecosystem and until recently, predictable performance improvements that allowed it to stretch its reach beyond application processing.

On the other hand, the GPU did not transcend its roots as a high-performance graphics processing engine until it incorporated programmability to take full advantage of its SIMD architecture. Since then, the GPU has cemented its position as the choice for computations such as machine learning that make heavy use of vector floating point operations.

The Fungible DPU provides large gains compared to the CPU when processing data-centric computations, with comparable programmability and flexibility. While the DPU’s development ecosystem is nascent, the Fungible DPU presents an intuitive programming model that is naturally suited for data-centric processing, greatly easing its adoption. In addition, the Fungible DPU embeds TrueFabric which will facilitate highly efficient node-to-node interactions—the foundation for scale-out data centers.

The figure in the next page summarizes the characteristics of the three types of sockets:

4 Read companion paper - TrueFabricTM: A Fundamental Advance to the State of the Art in Data Center Networks

Page 4: The Fungible DPU A New Category of Microprocessor · it is instructive to examine why these piecemeal approaches failed. To begin with, we note that modern CPU architectures (x86,

Copyright © 2020. All Rights Reserved. | www.fungible.com 4

Whitepaper

A programmable data plane that covers the full spectrum of data-centric computations

virtualization as well as primitives for applying a range of transformations to data in motion. The Fungible DPU is capable of performing the full spectrum of data-centric computations at high performance.

• The DPU must excel at both stateful and stateless computations.Specifically, stateful processing of highly multiplexed packetstreams is an essential prerequisite for taking on the entirety ofdata-centric processing. This capability is difficult to shoehorninto an approach that uses a pipeline of hardware blocksconfigurable via P4 or similar language, and literally impossible todo in a hardwired approach. The ability to implement arbitrarynew stateful computations where many different computationsneed to be executed concurrently is where most architecturesfalter. The Fungible DPU excels at these types of computationsbecause it was designed from inception with this goal in mind.

High performance data interchange between nodes at scales covering over three orders of magnitude

• A fundamental requirement for the network in a scale-out datacenter is the ability to support scalability across many orders ofmagnitude, full any-to-any cross-sectional bandwidth, lowpredictable latency, fairness, congestion avoidance, faulttolerance, end-to-end software defined security, all the whilesupporting industry standards. Surprisingly, the industry has nosolution for this problem despite many decades of effort.

• The Fungible DPU implements TrueFabric a new networktechnology that fully supports all of the attributes above. TheDPU’s role as the access point at the edge of the data centernetwork puts it in a unique position to implement the functionalityneeded. Indeed, the Fungible DPU transforms a standard IP overEthernet network into a true Fabric that acts like the backplane ofa large extended computer.

Key Attributes of the Fungible DPUAt the highest level, the Fungible DPU is a system-on-a-chip that consists of a fully programmable control plane running a standard Linux operating system, a fully programmable datapath running on a bare metal run-to-completion operating system called FunOSTM and a full implementation of TrueFabric. These components were designed to provide the following key attributes:

A programmable control plane that can run standard Linux applications

• The Fungible DPU’s control plane must support standard Linuxapplications with minimal effort. This control plane must alsointeract efficiently with the embedded data plane and flexiblywith orchestration systems.

• In order to take on the responsibility for performing all data-centric computations, the data plane of the DPU must beprogrammable using a high-level language. Its programmingmodel must allow data-centric computations to be expressednaturally and intuitively. And it must be easy to design, develop,and maintain programs. The Fungible DPU fully satisfies theserequirements. In contrast, architectures that restrict datapathflexibility to low-level hardware configuration languages willnever be able to cover the full spectrum of data-centriccomputations.

No compromise datapath performance to support modern cloud-native workloads

• Modern data center workloads are extremely diverse. As a result,the DPU must support a variety of data-centric computations withno compromise to performance. These computations includeinfrastructure services such as network, storage, security and

5 See Google's Profiling a Warehouse Scale Computer for background.

Page 5: The Fungible DPU A New Category of Microprocessor · it is instructive to examine why these piecemeal approaches failed. To begin with, we note that modern CPU architectures (x86,

Whitepaper

Copyright © 2020. All Rights Reserved. | www.fungible.com 5

What the figure also shows is that it is possible in a given technology to move the trade-off curve up and to the right, but only if one is willing to restrict in some manner the set of computations to be performed. General-purpose CPUs have had a massive effort applied over such a long period that it is difficult to imagine making large improvements without such restriction. One notable example of moving the curve up and to the right by restricting the set of computations is GPUs. Although GPUs started out as hardwired graphics pipelines, they gradually morphed into programmable engines without giving up significant performance in exchange.

The Fungible DPU is another example of moving the tradeoff curve up and to the right, but here the restriction in the computations to be performed is not to vector floating point operations as was the case with GPUs, but to data-centric computations.

We assert further that given the flattening of the Moore’s Law curve; we expect to see a small number of new engines being developed using ideas along these lines. These engines will be successful commercially only if the computations for which they are specialized are important enough and pervasive enough.

The Fungible DPU ArchitectureFor most information technologies there is an unavoidable tradeoff between performance and flexibility. Increasing flexibility requires compromises in performance and vice-versa. The figure below shows this tradeoff in the context of a single computing device, with the technology in which the device is implemented held constant. At one end of the spectrum is a high performance, but relatively inflexible ASIC implementation. At the other end is a very flexible general-purpose CPU which can perform any computation but not at the same performance as the ASIC implementation. In between these two extremes lies an FPGA implementation.

GoalsA fundamental design requirement of the Fungible DPU was to support industry standard interfaces, protocols, operating system drivers, and system APIs in order to make it easy to insert into existing data center infrastructure.

The following table summarizes the main goals of the architecture:

HIGH INTEGRATION HIGH PERFORMANCE HIGH FLEXIBILITY HIGH SCALABILITY

Plug into standard software and system APIs, including support for standard kernel drivers.

Excel at performing stateful processing of highly multiplexed packetized streams.

Support standard Ethernet and PCIe interfaces with state-of-the-art hardware flexibility.

Modular architecture with inter-locking but independent building blocks, connected with scalable on-chip fabrics.

Deliver a comprehensive solution for storage, networking and security workloads, which fully offloads the related stacks from the CPU.

Easily construct a wide range of device sizes, with capacity from 50G to 800G.

Support in-line data services such as data security, data durability and data reduction.

Excel at sub-microsecond granularity tasks6, in addition to bandwidth intensive data processing.

Process data-centric computations an order of magnitude faster than general purpose CPUs.

Take on multiple different roles and personalities, in the interfaces it offers and the functions it executes simultaneously.

Enable standalone as well as attached use cases.

Provide a reliable, error-and-con-gestion-free, end-to-end secured data center fabric.

Natively support network virtu-alization services, and NVMe and NVMe-oF storage protocols.

Deliver high performance even when all data services are enabled.

Support secure boot and hardware rooted chain of trust to authenticate all Fungible DPU software.

Deliver sustained performance in scale-out deployments with tens of thousands of nodes, under real incast conditions.

Provide a fabric that is able to scale to thousands of racks, while delivering low tail latency at high throughput.

Enable disaggregation of attached devices and resources across the data center creating a truly on-demand composable infrastrure.

Secure data in motion and at rest, deep packet inspection, dynamic fire-walls and application level filtering.

Deliver near-local latencies for distributed NVMe storage.

Easily programmable, using high level languages (ANSI-C).

Built upon an intuitive programming model for data-centric computations, enabling quick development and increased software robustness and reliability.

Flexibility transcends layers, down to accelerators and other hardware blocks.

6 See Google’s Attack of the Killer Microseconds for background on data-centric processing demands.

Page 6: The Fungible DPU A New Category of Microprocessor · it is instructive to examine why these piecemeal approaches failed. To begin with, we note that modern CPU architectures (x86,

Copyright © 2020. All Rights Reserved. | www.fungible.com 6

Whitepaper

PrinciplesThe Fungible DPU architecture starts with a blank slate and builds a solution based on first principles. While the architecture has the ostensibly divergent goals of allowing full software programmability and delivering high performance, Fungible’s approach has the advantage of being unencumbered with legacy hardware or software biases, or attachment to past investments. This fresh perspective was critical in meeting these goals which were considered unachievable by legacy vendors.

The Fungible DPU has a vertically integrated architecture that naturally demands tight hardware and software co-design, along with an in-depth understanding of key use cases, such as networking, security and storage datapath processing. Indeed, Fungible’s solution benefits from numerous groundbreaking innovations in these areas, combining novel and unique architectural features, with hundreds of distinct hardware and software inventions. As discussed below, this was not an easy path to take but it was necessary to solve the overall problem. While the list of innovations in the Fungible DPU cannot be covered in detail in this paper, we list a few highlights of our approach.

HighlightsThe Fungible DPU core is built around an industry standard general purpose multi-threaded processor, combined with fully standard Ethernet and PCIe interfaces. The balance of the hardware consists of custom-designed hardware components each of which provides the right combination of flexibility and performance. These include:• Specialized, high performance on-chip fabrics

• Specialized memory systems

• A complete set of flexible data accelerators

• Programmable networking hardware pipelines

• Programmable PCIe hardware pipelines

software plays a critically important role with innovations as unique and powerful as those in the hardware. The hardware and software were co-designed from the start.

The Fungible DPU’s data plane runs on top of a custom designed data-centric operating system called FunOS that natively supports intuitive and efficient run-to-completion programming. Like the hardware on which it runs, FunOS is a clean sheet design tailored to run data-centric computations, modeled on closures and continuations, where all processing is fully asynchronous and lock-less. The model elegantly abstracts hardware and software steps as interchangeable call-continuations that can be seamlessly woven into malleable flows. The hardware-software combination is able to naturally intermix dozens of different computations running hundreds of different instances concurrently. Fungible has developed highly-efficient networking, security and storage stacks that fully leverage this architecture.

Fungible’s decision to develop in-house IP came from the clear and up-front realization that simply integrating an assemblage of off-the-shelf components into an SoC would fail to achieve our desired goals. SmartNICs are the most visible examples of this approach and none of them can compare with the capabilities of the Fungible DPU either in flexibility or in performance. It was only by applying the same fundamental principles to the design of every component, both hardware and software, that we could build a uniform, cohesive solution to hit our performance and flexibility targets. Such a grounds up approach, which takes time and effort, is not for the faint of heart - but it has been fundamental to the differentiation provided by the Fungible DPU.

DifferentiationThe figure below compares the Fungible DPU and smartNIC approaches in relation to the CPU. The figure shows how flexible data-centric accelerators are intimately embedded and tightly coupled in the Fungible DPU structure while they are loosely attached in the smartNIC. It also shows the separation between fast fixed ASIC path and weak exception path in the smartNIC. This stark contrast is one of the reasons why the Fungible DPU can maintain high performance for all data-centric computations.

The data processing machinery translates into thousands of individual processor and accelerator threads—with the collection of threads being orchestrated by a global on-chip work scheduler. To fully harness these hardware capabilities, the Fungible DPU’s

Page 7: The Fungible DPU A New Category of Microprocessor · it is instructive to examine why these piecemeal approaches failed. To begin with, we note that modern CPU architectures (x86,

Whitepaper

Copyright © 2020. All Rights Reserved. | www.fungible.com 7

CPU SMARTNIC FUNGIBLE DPU

FUNCTION

Acts as a PCIe root complex device only, cannot be endpoint device.

Of�floads networking functions from application processors to an embedded hardware NIC.

Does not displace CPUs in a storage server.

Of�floads full network, storage, and security stacks from CPUs and other application processors.

Can act as a root complex or an endpoint device.

Can replace CPUs in a storage or server.

FLEXIBILITY (see CALLOUT 1)

Highly flexible, fully programmable using high-level languages.

Essentially in�flexible.

Programmability is limited to the control plane. Data plane is NOT software programmable.

Exceptions or unsupported features are handled by embedded cores at a significant impact to performance. This includes support for new protocols or stateful services.

Acceleration engines are not properly integrated, and cannot be used �flexibly within the datapath.

Highly �flexible, fully programmable using high-level languages.

Both datapath and control plane are programmable. Changes to protocols can be quickly implemented without performance impact. This includes examples such as adding security checks, encryption/decryption, congestion or flow control etc.

Flexible threading of hardware accelerators and software execution.

PERFORMANCE (see CALLOUT 2)

Optimized for application processing, with emphasis on single threaded performance.

Lower IPC for data-centric computations.

High context switching costs limit performance for high event rate stream processing.

Accelerators are accessed across the PCIe bus, resulting in inefficiencies, high latency and high costs in memory bandwidth and processing cycles.

Performance drops multiplicatively as data processing services get enabled.

Performance bene�fits limited to network functions implemented in the hardcoded NIC portions of the SOC.

Low IPC for data-centric computations in programmable cores.

Performance claims are only valid if the data set fits into on-chip caches. Sharp performance cliffs when workload exceeds the limited resources of the hardcoded NIC, when new functions need to be implemented, or when operated in real network conditions.

Embedded CPU cores are only adequate for control path handling and exception processing.

Single threaded cores with high context switch latency results in lower performance than CPU.

Performance degrades when multiple data services are enabled simultaneously, such as encryption and compression.

Run to completion OS co-designed with the Fungible DPU hardware to deliver near ASIC performance with software flexibility.

High IPC for data-centric computations.

Multi-threaded cores achieve maximum IPC even with context switching at 100 instruction tasklet granularity.

Hardware accelerators are flexibly and seamlessly integrated into the software programming paradigm, resulting in high performance independently of the number of data services enabled.

A more flexible architecture than can meet and sustain high packet rate stateful data plane processing.

NETWORK SCALABILITY (see CALLOUT 3)

N/A Data center fabric technologies such as RoCE have well known scalability limitations.

Repeated failures to implement a working fabric protocol despite a decade of attempts.

Constrained by the hardcoded implementation, the protocol patching rate is capped by the multi-year hardware cycle.

Fungible’s TrueFabric technology delivers a full cross-sectional bandwidth, low latency and jitter, reliable and secure data center fabric that scales to hundreds of thousands of nodes.

Fast iteration with flexible software protocol implementation.

Page 8: The Fungible DPU A New Category of Microprocessor · it is instructive to examine why these piecemeal approaches failed. To begin with, we note that modern CPU architectures (x86,

Copyright © 2020. All Rights Reserved. | www.fungible.com 8

Whitepaper

CALLOUT 1: CONFIGURABILITY VS. PROGRAMMABILITY

The need for providing true programmability in the Fungible DPU is well understood. As noted in the text, some solutions have gone to great lengths and incurred a high cost to provide a modicum of programmability at speed. However, most solutions merely pay lip service to programmability—either by including a number of cores loosely attached to a hardware interface, or by claiming that hardware configuration languages can be used to implement complex processing pipelines of the type needed by data-centric computations. These approaches are exemplified by the “RDMA pipeline plus embedded cores” and the “P4 pipeline plus embedded cores” products. The P4-based approach may be more flexible than hardwired RDMA implementation, but both have separate data paths that are essentially decoupled from the programmable cores. Both approaches are therefore unable to provide true programmability and cannot support stateful processing of data-centric flows with acceptable performance.

CALLOUT 2: MICROBENCHMARKS VS. REAL PERFORMANCE

Server networking performance has been reduced regrettably to a narrow focus on synthetic micro-benchmarks that focus on hero cases like performance for 64-Byte packets and claims of low latency when two devices are attached by a direct wire. While this trend has recently started to wane, these artificial metrics are still publicized by vendors. In practice, real data center application performance is largely uncorrelated with such metrics. This is because real problems show up only at medium to large scale where there is no effective solution today. Fungible DPU’s innovative architecture combined with TrueFabric provides the first high-performance, open standards-based solution that promises to scale to the largest hyperscale deployments.

CALLOUT 3: SCALABILITY WOES

A full decade has elapsed since RoCE was standardized. Despite repeated attempts at addressing network congestion, and numerous research publications, the problem remains effectively unsolved at even modest scales. Today, the lack of a proper fabric solution still limits the scalability of RDMA installations. There are two reasons for this failure, the first having to do with implementation choices and the second fundamental. RoCE implementations are hard-coded and have been driven primarily by the urge to demonstrate low latency numbers in artificial back-to-back settings. As such, these implementations do not scale well with the number of sessions and in fact exhibit dramatically increased latency and jitter beyond a modest number of sessions. The second reason is that RoCE, like other technologies7 that use hop-by-hop congestion control, is prone to head-of-line blocking and deadlock in large scale networks. In contrast, Fungible’s solution layers RDMA on top of TrueFabric within a programmable DPU - delivering high scale in both the number of sessions and the number of nodes.

Fungible, Inc.3201 Scott Blvd.Santa Clara, CA 95054669-292-5522

WP0027.00.02020818

ConclusionThe trends that drove the need for us to create the Fungible DPU show no sign of abating. On the contrary, they are accelerating. The Fungible DPU is well positioned to address the needs of the data-centric era because it solves the two fundamental problems that arise from these trends: inefficient execution of data-centric computations, and inefficient exchange of interactive information between nodes. This promises to dramatically improve both the performance and the economics of scale-out data centers. In addition, it also makes fundamental improvements to their reliability, scalability, and security.

The Fungible DPU is here today, positioned to occupy the third socket alongside the CPUs and GPUs, and ready to redefine the future of data center infrastructure.

7 This includes InfiniBand, Fiber Channel and FCoE.


Recommended