Overcoming Network Challanges - Fungible

Copyright © 2019. All Rights Reserved. | www.fungible.com 1

Overcoming Network Challenges when Pooling Disaggregated Resources at Scale

In recent years, HCI has been supplanted in many ways by composable disaggregated infrastructure (CDI), which disaggregates compute, storage, and networking components into separate servers that can be pooled on demand. This can allow for greater efficiency and higher utilization, avoid vendor lock in, and enable faster and more agile deployment.

CDI can bring scale-out infrastructure to the next level by taking advantage of unified APIs and automation, enabling separate components to be scaled independently across the entire data center. However, by disaggregating these components, data that used to travel on the internal PCIe bus now flows across the network, significantly increasing east-west traffic in data centers.

Thus, the network is the key enabler for next-generation data centers. To allow the pooling of resources, data center components must be connected over a spine-leaf fabric that is non-blocking, high-bandwidth, and low-latency, and the network must meet stringent service level agreements.

About the Author

Bojian Wang Principal Solution Architect

Bojian Wang is the Principal Solution Architect at Fungible, responsible for the go-to-market of technical solutions covering use cases in compute and virtualization, networking, disaggregated storage and GPU pooling.

Prior to joining Fungible, Bojian was in the top 10 percent of Cisco’s technical talent, where he served as the leading solution architect for large cloud service providers and web segments in the U.S. and China.

Bojian specializes in cloud technology, networking disaggregation and virtualization. His industry expertise is often applied to optimize business strategy and sales lifecycle management.

Bojian received a Bachelors in Computer Science and a Masters in Computer Science and Robotics Automation from Nan Kai University.

Whitepaper

We often hear the term “web-scale” used in the context of IT transformation and cloud adoption. When IT professionals employ that term, it usually means they hope to attain the same benefits of scalability, agility and cost savings achieved by hyperscale cloud companies like Amazon, Google, Microsoft, Alibaba, Facebook etc. All of these organizations use "scale-out" architectures in their cloud environments, allowing them to increase data center performance and capacity by adding resources horizontally, in identical units of infrastructure, rather than vertically by increasing the capability of individual nodes. Scale-out is now the model many smaller organizations hope to deploy on their journey for better performance and increased efficiencies.

The scale-out approach offers a lot of advantages; it started with the deployment of Hyper Converged Infrastructure (HCI). By combining compute, storage, and networking components into a single server and using software to provision it, HCI allows data center operators to increase capacity based on the needs of each application.

This white paper has three main goals: to discuss common challenges in the network that need to be addressed to enable large-scale pooling of disaggregated resources; to examine challenges specific to storage, GPU and memory pooling; and to explain what data-centric engines such as the Data Processing Unit (DPU) must offer to address the challenges described.


Whitepaper

But to take advantage of this new infrastructure model, modern networks must overcome several significant challenges outlined below.

Rethinking the challenges that networking needs to resolve Thanks to the growing scale of data centers and diverse traffic patterns driven by cloud applications, long-standing network design principles have been disrupted. The new challenges posed by scale-out computing mean that data center operators often face trade-offs between reliability, agility, cost and performance.

A prime example is bandwidth utilization. To sustain high throughput and low latency while meeting horizontal scale-out requirements, networks are typically composed of 2- or 3-tier non-blocking bisection fabric. While this design facilitates both north-south and east-west traffic, it can also lead to increased congestion as the network scales out, due to hot spots created by incast traffic.

To accommodate unexpected bursty flows and absorb congestion, most large-scale data centers try to keep bandwidth utilization to a low level. This practice may allow data center operators to maintain performance, but it comes at a cost – over-provisioning and paying for bandwidth that is under-utilized. And yet it still does not guarantee that congestion can be kept under control. Data centers need to architect and invest in better solutions to address this limitation.

Let's examine some of the common challenges in more detail.

Network congestion and long tail latency. For most data centers, the truest measure of performance is ‘tail latency’ – the level of delay experienced by the slowest percentile of traffic. One typical reason for congestion is incast. The more congested the network, the harder it is to predict tail latency. And while there are remedies to congestion – such as Data Center TCP, Data Center Quantized Congestion Notification (DCQCN), and RDMA over Converged Ethernet (RoCEv2) – these solutions present new sets of challenges such as bursty traffic, which inevitably also lead to congestion, complex provisioning, PFC pause frame storm etc.

Large failure domain. Hardware failures are inevitable, which is why network operators always try to minimize the impact (or blast radius) a single point of failure can have on the network. But traditional high-availability designs are too costly and complex to deploy at scale. Instead, the industry is looking at different ways to resolve this, one of which is a fabric made up of modular switches using single chipset packet processors, an approach known as “F16” implemented by Facebook. However, a large blast radius is still not entirely avoidable in certain layers, such as Top-of-Rack (ToR) switches, where massive redundancy is still necessary.

Hash collisions caused by ECMP. Equal cost multi-path (ECMP) is a routing algorithm that allows flows to follow the same path across discrete layers of the network fabric, enabling better load balancing and higher bandwidth utilization. The behaviors of ECMP become problematic when the dominant traffic patterns are "elephant" and "mice" flows. This is not uncommon in data centers where workloads such as VM migration or MapReduce for Hadoop create large continuous flows of data, which co-exist with short-lived flows with tighter latency constraints. Hash collisions between multiple "elephant" flows can create traffic imbalance and congestion on certain paths in the network leading to poor overall network utilization. Moreover, "mice" flows that use the congested paths suffer from longer latencies and packet drops. Adding entropy fields to packets can deprioritize "elephant" flows, but even enhanced ECMP cannot guarantee consistent jitter across multiple paths. That will cause more buffering on the receiver host; buffers will be depleted, packets will be dropped and be retransmitted, and network performance will suffer.

Slow failure detection and recovery. Failure recovery in large scale data center is complicated and time consuming. To minimize the impact on applications, most network operators choose to re-route traffic around the failure point as soon as it is detected, then work on analyzing the root cause. The time taken to detect and recover is typically in the range of 10s of seconds.

Unreliable retransmission. Networking solutions like RoCEv2 are designed on the assumption that networks are lossless. However, in many cases, packet drop is inevitable. That means networking teams must invest a

https://code.fb.com/data-center-engineering/f16-minipack/


Whitepaper

proprietary technology like NVLink for internal pooling. Both PCIe and NVLink only work at short range and this limits the pooling of GPUs to within a server or at a rack level. For large-scale parallel computations where GPUs need to be pooled across racks, operators can use GPUDirect RDMA to connect GPUs over the network fabric using RDMA. However, the use model for GPUDirect RDMA is limited to data transfer from GPU to GPU, and does not truly support full disaggregation and composability of GPU resources. Thus, the benefit of improved resource efficiency from GPU pooling still cannot be realized. Other approaches like remote CUDA can virtualize GPU resources at the application level, but response times and software dependency create issues and impose limitations.

Memory pooling. A relatively new use case is memory pooling, particularly when Storage Class Memory (SCM) is deployed. Pooling memory makes a lot of sense from an economic perspective, though the latency requirements (1us or less) provide an enormous challenge to network design. More research is currently being done on memory pooling and will likely uncover additional challenges as well.

From compute-centric to data-centric

Today, networks suffer from several architectural limitations that inevitably lead to latency and lag when attempting to move data at scale. But perhaps the biggest bottleneck is that nearly every transaction must pass through general-purpose CPUs that are not optimized for data transfer.

Fungible believes that over the coming decade the paradigm for data centers will evolve from compute-centric to data-centric. That shift will require a new type of programmable processor specifically designed for handling data-centric workloads, including data transfer, reduction, security, durability, filtering and analytics. Deploying data-centric engines such as the DPU (Data Processing Unit) can address the limitations discussed here and also provide several broad advantages:

More predictable network performance. A programmable chip that lets data center operators control end-to-end congestion can significantly reduce tail latency. This technology should be capable of being applied in an efficient and scalable way, from a few racks up to thousands of racks.

lot of effort into developing tools to handle packet drops, particularly silent packet drops, to avoid slowdowns in application throughput and performance. If packet retransmission occurs at the transport layers, this would lead to significantly improved performance compared to retransmission at higher layers. Unfortunately, UDP lacks a retransmission mechanism and while TCP has one in place, but it is complex and tail latency is hard to predict.

Enabling Effective Resource Pooling at Scale

Most large network operators are familiar with the advantages of scale out infrastructure whether it is HCI or CDI. Using software-defined technology to pool available physical resources allows administrators to provision resources on demand as workload needs change. This usually leads to greater efficiency and higher utilization levels for CPU, GPU, memory, storage, and other dispersed resources.

However, pooling these resources at scale comes with its own network challenges:

Storage pooling. The concept of pooling storage is not new. With Storage Area Networks (SAN), operators pooled hard disk drives, allowing apps to consume storage on demand as their needs dictated. But scale-out workloads and cloud-native applications require much faster solid-state drive (SSD) interconnect technology, such as non-volatile memory express (NVMe). Furthermore, emerging technologies such as NVMe-over-Fabric (NVMeoF) enable NVMe SSDs to be disaggregated from the compute nodes as pooled storage. The adoption of NVMe technology eliminates performance bottlenecks caused by HDD controllers and provides much better throughput, but it also introduces new requirements on the network, as users always expect to access networked storage similar to direct attached storage. End-to-end latency, bandwidth utilization and congestion control become even more important.

GPU pooling. The explosion of AI applications in video surveillance, facial recognition, and machine learning has led to an increase in the usage of GPUs. In current deployments, the GPU relies on the PCIe switch to connect to the CPU (where the CUDA application runs), and are connected to other GPUs in the server using


Whitepaper

Enhanced reliability. Current software-based approaches to failure can take seconds for the network to recover. Modern data centers require ultra-fast failure recovery mechanisms that bring recovery time down to the micro-second level. Thus, a hardware and software co-optimized solution will be needed to balance the need of ultra-fast convergence and flexibility of provisioning and management.

For cloud companies, large enterprises and communications service providers, the speed at which they move data through the network will make the difference between having a reliable, flexible, and scalable infrastructure or a bloated, costly and inefficient one.

The flexibility and economic benefits of resource pooling are simply too obvious to ignore, but challenges remain. We believe Fungible's data-centric approach in addressing these challenges will enable resource pooling at a scale far beyond what is possible today.

Fungible, Inc.3201 Scott Blvd.Santa Clara, CA 95054669-292-5522

Lower total cost of ownership. By employing a more granular approach that avoids “elephant" and “mice" flow problems, data center operators could significantly improve network bandwidth utilization without the need for a forklift upgrade. Flattening the network fabric architecture could further cut the cost of infrastructure by removing the ToR layer.

Greater scalability. While technologies like RoCEv2 can mitigate the impact of congestion on servers, it is a very coarse mechanism based on traffic priority so RoCEv2 only works well within relatively small networks. Improvements have been proposed in the past such as IRN (Improved RoCE NIC), but these solutions come at a cost. Data center operators need a solution that is highly granular and can scale to networks containing tens of thousands of servers, guaranteeing ultra-low latency and reliable transmissions.

WP0008.00.91020701

https://people.eecs.berkeley.edu/~radhika/irn.pdf

Date post:	16-Oct-2021
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Overcoming Network Challanges - Fungible

Documents