+ All Categories
Home > Documents > High Performance Modular Packet Processing with Click and GPU KGPU

High Performance Modular Packet Processing with Click and GPU KGPU

Date post: 03-Feb-2022
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
2
High Performance Modular Packet Processing with Click and GPU Weibin Sun Robert Ricci {wbsun, ricci}@cs.utah.edu University of Utah, School of Computing Software packet processing, like Click modular router, provides much more flexibility than the hardware-based one. Recent advances(Netmap, psio, PF RING) in soft- ware packet I/O have enabled 10Gb/s throughput on 10Gb NICswith the help of Receive Side Scaling(RSS) on multi-queue NICs and multi-core CPUs, and also zero-copy buffer sharing between kernel and user modes. With such advance, it is possible to build software packet processing systems running at line rate. Those systems include intrusion detection, firewall, VPN, Openflow, IP forwarding, deep packet inspection for ISPs to detect pi- rate content downloading, and many others. However, the bottleneck that prevents those packet processing systems listed above from reaching line rate throughput is becoming the computation now. For in- stance, the pattern matching algorithm for intrusion de- tection can only run at 2.15Gb/s on an Intel Xeon X5680 core according to the Kargus research . It needs five CPU cores to saturate only one 10Gb port. This would require significant number of dedicated CPU cores for packet processing, which may contend with packet I/O cores and hence lower the entire system performance. Besides that, the hardware cost of those multiple CPUs is also way more than another feasible commodity hard- ware choice, which is the parallel GPU computing de- vice. Many research projects have shown huge perfor- mance improvements of various computations on paral- lel GPUs compared with CPUs. As of the pattern match- ing algorithm, recent work on Kargus shows a 39.1Gb/s throughput on a $400 GTX580 GPU, which is 3x faster than the $1639 Xeon X5680 six-core CPU above. As a result, to build a flexible high performance packet pro- cessing system that could run at 10Gb/s line rate, we take the modularized Click router and integrate GPU computing into it for computation intensive processing. To use GPU computing library and other userspace li- braries, we are using usermode Click. Obviously, Click was not originally designed for usermode zero-copy packet I/O and parallel GPU computing. We adopt the following technologies to deal with obstacles in Click, multi-queue NICs and multi-core CPUs: a) GPU Man- agement: A new GPURuntime(specifically, CUDA run- time) Click information element to manage GPU com- puting memory, kernel and states. There are CUDA mapped, page-locked memory and non-mapped, page- locked memory used for GPU computing. They have different characteristics that worth investigating during our evaluation. b) Batching: We use batching to get rid of single packet processing style in Click. A Batcher element batches packets and also prepares GPU mem- ory. It also support sliced copy to only copy specified range of packet data. c) Wider Interface: To perform push and pull with packet batch, we defined BElement and BPort to provide packet batch support. d) Hybrid CPU and GPU Processing: A CGBalancer element that load-balances packets between GPU and CPU ac- cording to specified policy and system states. This pro- vide flexible GPU offloading mechanism. e) Zero-copy: We use Netmap for zero-copy packet capture and trans- mission, the GPU-enabled Click has been modified to use CUDA memory that GPU driver used for DMA in Netmap. Hence zero-copy buffer sharing among GPU driver, NIC driver and Click. f) Multi-queue Support: Netmap provide multi-queue packet capture and trans- mission. To be NUMA-aware for buffers used on dif- ferent CPU cores, we use CUDA’s cudaHostRegister() to pin NUMA-aware allocated memory, so that it can be used for GPU DMA and as zero-copy packet buffers. We have implemented the Click level improve- ment including GPU-related elements above, and the Netmap/CUDA/Click integration for zero-copy memory sharing. We are developing several computational packet processing elements for evaluation, such as simple fire- wall, IPv6 routing, IPSec, VPN and Openflow. Besides these functional applications, we also want to investigate the following problems to fully study and analyze our system: a) comparing the costs of CPU-only system and GPU-enabled system, build their theoretical cost model when scaling up. b) comparing different GPU runtime memory types under different workloads and other con- ditions. c) study the effect of load balancer policy when running different workloads. d) try to answer this ques- tion: Is multi-queue really needed? It is derived from these two facts: Netmap research found a single CPU core can handle line rate forwarding, and now we have GPUs dedicated for computing. e) also study and mea- sure the scalability of our GPU-enabled Click with re- gard to number of NICs.
Transcript

High Performance Modular Packet Processing with Click and GPU

Weibin Sun Robert Ricci

{wbsun, ricci}@cs.utah.edu University of Utah, School of Computing

Software packet processing, like Click modular router,provides much more flexibility than the hardware-basedone. Recent advances(Netmap, psio, PF RING) in soft-ware packet I/O have enabled 10Gb/s throughput on10Gb NICswith the help of Receive Side Scaling(RSS)on multi-queue NICs and multi-core CPUs, and alsozero-copy buffer sharing between kernel and user modes.With such advance, it is possible to build software packetprocessing systems running at line rate. Those systemsinclude intrusion detection, firewall, VPN, Openflow, IPforwarding, deep packet inspection for ISPs to detect pi-rate content downloading, and many others.

However, the bottleneck that prevents those packetprocessing systems listed above from reaching line ratethroughput is becoming the computation now. For in-stance, the pattern matching algorithm for intrusion de-tection can only run at 2.15Gb/s on an Intel Xeon X5680core according to the Kargus research . It needs fiveCPU cores to saturate only one 10Gb port. This wouldrequire significant number of dedicated CPU cores forpacket processing, which may contend with packet I/Ocores and hence lower the entire system performance.Besides that, the hardware cost of those multiple CPUsis also way more than another feasible commodity hard-ware choice, which is the parallel GPU computing de-vice. Many research projects have shown huge perfor-mance improvements of various computations on paral-lel GPUs compared with CPUs. As of the pattern match-ing algorithm, recent work on Kargus shows a 39.1Gb/sthroughput on a $400 GTX580 GPU, which is 3x fasterthan the $1639 Xeon X5680 six-core CPU above. As aresult, to build a flexible high performance packet pro-cessing system that could run at 10Gb/s line rate, wetake the modularized Click router and integrate GPUcomputing into it for computation intensive processing.

To use GPU computing library and other userspace li-braries, we are using usermode Click. Obviously, Clickwas not originally designed for usermode zero-copypacket I/O and parallel GPU computing. We adopt thefollowing technologies to deal with obstacles in Click,multi-queue NICs and multi-core CPUs: a) GPU Man-agement: A new GPURuntime(specifically, CUDA run-time) Click information element to manage GPU com-puting memory, kernel and states. There are CUDA

mapped, page-locked memory and non-mapped, page-locked memory used for GPU computing. They havedifferent characteristics that worth investigating duringour evaluation. b) Batching: We use batching to get ridof single packet processing style in Click. A Batcherelement batches packets and also prepares GPU mem-ory. It also support sliced copy to only copy specifiedrange of packet data. c) Wider Interface: To performpush and pull with packet batch, we defined BElementand BPort to provide packet batch support. d) HybridCPU and GPU Processing: A CGBalancer elementthat load-balances packets between GPU and CPU ac-cording to specified policy and system states. This pro-vide flexible GPU offloading mechanism. e) Zero-copy:We use Netmap for zero-copy packet capture and trans-mission, the GPU-enabled Click has been modified touse CUDA memory that GPU driver used for DMA inNetmap. Hence zero-copy buffer sharing among GPUdriver, NIC driver and Click. f) Multi-queue Support:Netmap provide multi-queue packet capture and trans-mission. To be NUMA-aware for buffers used on dif-ferent CPU cores, we use CUDA’s cudaHostRegister()to pin NUMA-aware allocated memory, so that it can beused for GPU DMA and as zero-copy packet buffers.

We have implemented the Click level improve-ment including GPU-related elements above, and theNetmap/CUDA/Click integration for zero-copy memorysharing. We are developing several computational packetprocessing elements for evaluation, such as simple fire-wall, IPv6 routing, IPSec, VPN and Openflow. Besidesthese functional applications, we also want to investigatethe following problems to fully study and analyze oursystem: a) comparing the costs of CPU-only system andGPU-enabled system, build their theoretical cost modelwhen scaling up. b) comparing different GPU runtimememory types under different workloads and other con-ditions. c) study the effect of load balancer policy whenrunning different workloads. d) try to answer this ques-tion: Is multi-queue really needed? It is derived fromthese two facts: Netmap research found a single CPUcore can handle line rate forwarding, and now we haveGPUs dedicated for computing. e) also study and mea-sure the scalability of our GPU-enabled Click with re-gard to number of NICs.

High Performance Modular Packet Processing with Click and GPUWeibin Sun Robert Ricci

{wbsun, ricci}@cs.utah.edu School of Computing, University of Utah

Computation GPU CPU SpeedupString Matching 39.1 Gbps[1] 2.15 Gbps[1] 18.2

RSA-1024 74,732 ops/s[2] 3,301 ops/s[2] 22.6AES-CBC Dec ~32 Gbps[3] 15 Gbps[2] 2.13HMAC-SHA1 31 Gbps[2] 3.343 Gbps[2] 9.3IPv6 Lookup 62.4x106 ops/s[4] 1.66x106 ops/s[4] 37.6

[1]: Kargus[CCS’12], [2]: SSLShader[NSDI’11][3]: GPUstore[SYSTOR’12], [4]: PacketShader[SIGCOMM’10]

The ProblemCPU limits functionality in line rate packet processing:

10Gb/s line rate packet I/O available[Netmap, psio]. Compute intensive packet processing:

Intrusion detectionFirewallDeep packet inspectionVPN, Openflow, ..., etc.

Flexible platform needed.

Our SolutionFlexibility: Click modular router!Compute intensive processing: Faster/More CPU cores?

NO! We have faster and cheaper GPUs.

How faster can we run with GPU?

Cheaper? Take ‘String Matching’ for example:In [1], Six-core Xeon X5680 CPU costs $1639, GTX 580 GPU costs $400. One GPU performance equals about 18 CPU cores, hence three X5680, which cost $4917. About 12 times more

What existing technologies do we have now?MultiQueue(RSS) NICsMultiCore CPUs: 1 thread/queuePre-allocated Zero-copy NUMA-aware packet buffer

GPU and Click, How?Manage GPU resource in Click, allow GPU click Elements.Batching needed for GPUFlexible CPU/GPU load balancingEfficient MM for GPU, NIC and Click.Utilize what we have now for CPU-based packet processing.

eth1(rx)

P1/T1

Core1

eth2(rx)

ethX(rx)

... ...eth1(tx)

P2/T2

Core2... ...

Py/Ty

CoreY... ...

... ... ... ... ... ...... ...

eth2(tx)

ethX(tx)

... ...

Multiple Queues Zero-copyPacket Buffer

MultiCore

Heterogenous GPU Computing In ClickGPURuntime (CUDA) element for Click to manage GPU resources.Batcher element for batch, slice, copy, and naturally no REORDER problem.BElement and BPort, with wider bpush/bpull for batched packets.CGBalancer element, to load-balancing between GPU and CPU.GPUDirect, to zero-copy among GPU driver, NIC driver, and Click.✓ CUDA memory for Netmap, NIC

driver and Click.MultiQueue support:✓ NUMA-aware CUDA memory allocation.

Batcher

N

GPUXXX

N

CGBalancer

CPUXXX

Problems To Investigate (and To Disucss)How many 10Gb NICs a single CPU core can handle?

To know the ratio of #CPU core / #GPU, and cost saving, and hence the overall system cost comparison.Study mapped GPU memory and non-mapped one, under different workloads, batches, slicing, scattered packet buffers.Effects of workload-specific balancer policy on hybrid CPU+GPU packet processing.Is multi-queue really needed?

According to Netmap work, a single CPU core can handle both RX and TX at line rate for forwarding.Using GPU as main computing resource, CPU can just do I/O, interrupt handling.

To what extent(#NICs) can this GPU-enabled Click scale up?

Current Progress and TodoDone infrastructure level, including Click GPU-related elements, Netmap/CUDA integration.Todo: Computational Click packet processing elements on GPU for evaluation:

Simple firewall: online packet inspection.IP routing, IPSec, VPNOpenflow, ...

KGPU

GPU

Batcher

N

GPUXXX

N

CGBalancer

CPUXXX

GPU

Runtim

e(inform

ation element)

MemcpyHtoD

LaunchGPUKe

rnel

MemcpyDtoH

Click

CUDA Runtime

GPU Driver


Recommended