High Performance Modular Packet Processing with Click and GPU KGPU

High Performance Modular Packet Processing with Click and GPU

Weibin Sun Robert Ricci

{wbsun, ricci}@cs.utah.edu University of Utah, School of Computing

Software packet processing, like Click modular router,provides much more flexibility than the hardware-basedone. Recent advances(Netmap, psio, PF RING) in soft-ware packet I/O have enabled 10Gb/s throughput on10Gb NICswith the help of Receive Side Scaling(RSS)on multi-queue NICs and multi-core CPUs, and alsozero-copy buffer sharing between kernel and user modes.With such advance, it is possible to build software packetprocessing systems running at line rate. Those systemsinclude intrusion detection, firewall, VPN, Openflow, IPforwarding, deep packet inspection for ISPs to detect pi-rate content downloading, and many others.

However, the bottleneck that prevents those packetprocessing systems listed above from reaching line ratethroughput is becoming the computation now. For in-stance, the pattern matching algorithm for intrusion de-tection can only run at 2.15Gb/s on an Intel Xeon X5680core according to the Kargus research . It needs fiveCPU cores to saturate only one 10Gb port. This wouldrequire significant number of dedicated CPU cores forpacket processing, which may contend with packet I/Ocores and hence lower the entire system performance.Besides that, the hardware cost of those multiple CPUsis also way more than another feasible commodity hard-ware choice, which is the parallel GPU computing de-vice. Many research projects have shown huge perfor-mance improvements of various computations on paral-lel GPUs compared with CPUs. As of the pattern match-ing algorithm, recent work on Kargus shows a 39.1Gb/sthroughput on a $400 GTX580 GPU, which is 3x fasterthan the $1639 Xeon X5680 six-core CPU above. As aresult, to build a flexible high performance packet pro-cessing system that could run at 10Gb/s line rate, wetake the modularized Click router and integrate GPUcomputing into it for computation intensive processing.

To use GPU computing library and other userspace li-braries, we are using usermode Click. Obviously, Clickwas not originally designed for usermode zero-copypacket I/O and parallel GPU computing. We adopt thefollowing technologies to deal with obstacles in Click,multi-queue NICs and multi-core CPUs: a) GPU Man-agement: A new GPURuntime(specifically, CUDA run-time) Click information element to manage GPU com-puting memory, kernel and states. There are CUDA

mapped, page-locked memory and non-mapped, page-locked memory used for GPU computing. They havedifferent characteristics that worth investigating duringour evaluation. b) Batching: We use batching to get ridof single packet processing style in Click. A Batcherelement batches packets and also prepares GPU mem-ory. It also support sliced copy to only copy specifiedrange of packet data. c) Wider Interface: To performpush and pull with packet batch, we defined BElementand BPort to provide packet batch support. d) HybridCPU and GPU Processing: A CGBalancer elementthat load-balances packets between GPU and CPU ac-cording to specified policy and system states. This pro-vide flexible GPU offloading mechanism. e) Zero-copy:We use Netmap for zero-copy packet capture and trans-mission, the GPU-enabled Click has been modified touse CUDA memory that GPU driver used for DMA inNetmap. Hence zero-copy buffer sharing among GPUdriver, NIC driver and Click. f) Multi-queue Support:Netmap provide multi-queue packet capture and trans-mission. To be NUMA-aware for buffers used on dif-ferent CPU cores, we use CUDA’s cudaHostRegister()to pin NUMA-aware allocated memory, so that it can beused for GPU DMA and as zero-copy packet buffers.

We have implemented the Click level improve-ment including GPU-related elements above, and theNetmap/CUDA/Click integration for zero-copy memorysharing. We are developing several computational packetprocessing elements for evaluation, such as simple fire-wall, IPv6 routing, IPSec, VPN and Openflow. Besidesthese functional applications, we also want to investigatethe following problems to fully study and analyze oursystem: a) comparing the costs of CPU-only system andGPU-enabled system, build their theoretical cost modelwhen scaling up. b) comparing different GPU runtimememory types under different workloads and other con-ditions. c) study the effect of load balancer policy whenrunning different workloads. d) try to answer this ques-tion: Is multi-queue really needed? It is derived fromthese two facts: Netmap research found a single CPUcore can handle line rate forwarding, and now we haveGPUs dedicated for computing. e) also study and mea-sure the scalability of our GPU-enabled Click with re-gard to number of NICs.

mailto:[email protected]

mailto:[email protected]

High Performance Modular Packet Processing with Click and GPUWeibin Sun Robert Ricci

{wbsun, ricci}@cs.utah.edu School of Computing, University of Utah

Computation GPU CPU SpeedupString Matching 39.1 Gbps[1] 2.15 Gbps[1] 18.2

RSA-1024 74,732 ops/s[2] 3,301 ops/s[2] 22.6AES-CBC Dec ~32 Gbps[3] 15 Gbps[2] 2.13HMAC-SHA1 31 Gbps[2] 3.343 Gbps[2] 9.3IPv6 Lookup 62.4x106 ops/s[4] 1.66x106 ops/s[4] 37.6

[1]: Kargus[CCS’12], [2]: SSLShader[NSDI’11][3]: GPUstore[SYSTOR’12], [4]: PacketShader[SIGCOMM’10]

The ProblemCPU limits functionality in line rate packet processing:

10Gb/s line rate packet I/O available[Netmap, psio]. Compute intensive packet processing:

Intrusion detectionFirewallDeep packet inspectionVPN, Openflow, ..., etc.

Flexible platform needed.

Our SolutionFlexibility: Click modular router!Compute intensive processing: Faster/More CPU cores?

NO! We have faster and cheaper GPUs.

How faster can we run with GPU?

Cheaper? Take ‘String Matching’ for example:In [1], Six-core Xeon X5680 CPU costs $1639, GTX 580 GPU costs $400. One GPU performance equals about 18 CPU cores, hence three X5680, which cost $4917. About 12 times more

What existing technologies do we have now?MultiQueue(RSS) NICsMultiCore CPUs: 1 thread/queuePre-allocated Zero-copy NUMA-aware packet buffer

GPU and Click, How?Manage GPU resource in Click, allow GPU click Elements.Batching needed for GPUFlexible CPU/GPU load balancingEfficient MM for GPU, NIC and Click.Utilize what we have now for CPU-based packet processing.

eth1(rx)

P1/T1

Core1

eth2(rx)

ethX(rx)

... ...eth1(tx)

P2/T2

Core2... ...

Py/Ty

CoreY... ...

... ... ... ... ... ...... ...

eth2(tx)

ethX(tx)

... ...

Multiple Queues Zero-copyPacket Buffer

MultiCore

Heterogenous GPU Computing In ClickGPURuntime (CUDA) element for Click to manage GPU resources.Batcher element for batch, slice, copy, and naturally no REORDER problem.BElement and BPort, with wider bpush/bpull for batched packets.CGBalancer element, to load-balancing between GPU and CPU.GPUDirect, to zero-copy among GPU driver, NIC driver, and Click.✓ CUDA memory for Netmap, NIC

driver and Click.MultiQueue support:✓ NUMA-aware CUDA memory allocation.

Batcher

N

GPUXXX

N

CGBalancer

CPUXXX

Problems To Investigate (and To Disucss)How many 10Gb NICs a single CPU core can handle?

To know the ratio of #CPU core / #GPU, and cost saving, and hence the overall system cost comparison.Study mapped GPU memory and non-mapped one, under different workloads, batches, slicing, scattered packet buffers.Effects of workload-specific balancer policy on hybrid CPU+GPU packet processing.Is multi-queue really needed?

According to Netmap work, a single CPU core can handle both RX and TX at line rate for forwarding.Using GPU as main computing resource, CPU can just do I/O, interrupt handling.

To what extent(#NICs) can this GPU-enabled Click scale up?

Current Progress and TodoDone infrastructure level, including Click GPU-related elements, Netmap/CUDA integration.Todo: Computational Click packet processing elements on GPU for evaluation:

Simple firewall: online packet inspection.IP routing, IPSec, VPNOpenflow, ...

KGPU

GPU

Batcher

N

GPUXXX

N

CGBalancer

CPUXXX

GPU

Runtim

e(inform

ation element)

MemcpyHtoD

LaunchGPUKe

rnel

MemcpyDtoH

Click

CUDA Runtime

GPU Driver

Date post:	03-Feb-2022
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

High Performance Modular Packet Processing with Click and GPU KGPU

Documents