+ All Categories
Home > Documents > GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code...

GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code...

Date post: 03-May-2018
Category:
Upload: dokhuong
View: 216 times
Download: 2 times
Share this document with a friend
65
PacketShaders, SSLShader GPGPU introduction and network applications
Transcript
Page 1: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

PacketShaders, SSLShader

GPGPU introduction and network applications

Page 2: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 2

Agenda

GPGPU Introduction

• Computer graphics background

• GPGPUs – past, present and future

PacketShader – A GPU-Accelerated Software Router

SSLShader – A GPU-Accelerated SSL encryption/decryption proxy

Page 3: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 3

GPGPU Intro

Page 4: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 4

GPU = Graphics Processing Unit

The heart of graphics cards

Mainly used for real-time 3D game rendering

• Massively-parallel processing capacity

(Ubisoft’s AVARTAR, from http://ubi.com)

Page 5: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 5

GPU Fundamentals: The Graphics Pipeline

A simplified graphics pipeline

• Note that pipe widths vary

• Many caches, FIFOs, and so on not shown

GPU

CPU

Application Transform Rasterizer Shade Video

Memory

(Textures) Vertices

(3D)

Xformed,

Lit

Vertices

(2D)

Fragments

(pre-pixels)

Final

pixels

(Color, Depth)

Graphics State

Render-to-texture

Page 6: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 6

GPU Pipeline: Transform

Vertex Processor (multiple operate in parallel)

• Transform from “world space” to “image space”

• Compute per-vertex lighting

Page 7: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 7

GPU Pipeline: Rasterizer

Rasterizer

• Convert geometric rep. (vertex) to image rep. (fragment)

- Fragment = image fragment

Pixel + associated data: color, depth, stencil, etc.

• Interpolate per-vertex quantities across pixels

Page 8: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 8

GPU Pipeline: Shade

Fragment Processors (multiple in parallel)

• Compute a color for each pixel

• Optionally read colors from textures (images)

Page 9: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 9

GPU Fundamentals: The Modern Graphics Pipeline

Programmable vertex processor! Programmable pixel processor!

GPU

CPU

Application Vertex

Processor Rasterizer

Pixel

Processor Video

Memory

(Textures) Vertices

(3D) Xformed,

Lit

Vertices

(2D)

Fragments

(pre-pixels)

Final

pixels

(Color, Depth)

Graphics State

Render-to-texture

Vertex

Processor Fragment

Processor

Page 10: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 10

nVidia G80 GPU Architecture Overview

•16 Multiprocessors Blocks

•Each MP Block Has:

•8 Streaming Processors (IEEE 754 spfp compliant)

•16K Shared Memory

•64K Constant Cache

•8K Texture Cache

•Each processor can access all of the memory at 86Gb/s, but with different latencies:

•Shared – 2 cycle latency

•Device – 300 cycle latency

Page 11: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 11

Queueing

FIFO buffering (first-in, first-out) is provided between task stages

• Accommodates variation in execution time

• Provides elasticity to allow unified load balancing to work

FIFOs can also be unified

• Share a single large memory with multiple head-tail pairs

• Allocate as required

Vertex assembly

Primitive assembly

Vertex operations

Application

FIFO

FIFO

FIFO

Page 12: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 12

SIMT - Memory Access Latency Hiding

GPU can effectively hide memory latency

GPU core

Cache

miss

Cache

miss

Switch to Thread 2

Switch to Thread 3

Page 13: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 13

Implementation vs. architecture model

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Prim Thread Issue Frag Thread Issue

Data Assembler

Application

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Vertex assembly

Primitive assembly

Rasterization

Fragment operations

Vertex operations

Application

Primitive operations

NVIDIA GeForce 8800 OpenGL Pipeline

Framebuffer

Source : NVIDIA

Page 14: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 14

Correspondence (by color)

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Prim Thread Issue Frag Thread Issue

Data Assembler

Application

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Vertex assembly

Primitive assembly

Rasterization (fragment assembly)

Fragment operations

Vertex operations

Application

Primitive operations

NVIDIA GeForce 8800 OpenGL Pipeline

Framebuffer

this was missing

Application-

programmable

parallel processor

Fixed-function assembly

processors

Fixed-function

framebuffer operations

Page 15: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 17

The nVidia G80 GPU

128 streaming floating point processors @1.5Ghz

1.5 Gb Shared RAM with 86Gb/s bandwidth

500 Gflop on one chip (single precision)

Page 16: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 18

Entertainment Industry has driven the economy of these chips?

• Males age 15-35 buy $10B in video games / year

Moore’s Law ++

Simplified design (stream processing)

• Huge parallelism – maps well to hardware

• Latency hiding using the parallelism

Single-chip designs.

Why are GPU’s so fast?

Page 17: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 19

“Silicon Budget” in CPU and GPU

19

Xeon X5550:

4 cores

731M transistors

GTX480:

480 cores

3,200M transistors

ALU

Page 18: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 21

Floorplans comparison

CPU - Core i7 GPU – nVidia Kepler

Page 19: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 22

GPU - A Specialized Processor

Very Efficient For • Fast Parallel Floating Point Processing

• Single Instruction Multiple Data Operations

• High Computation per Memory Access

Not As Efficient For • Double Precision – situation is improving

• Logical Operations on Integer Data

• Branching-Intensive Operations

• Random Access, Memory-Intensive Operations

Page 20: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 23

GPGPU!

Programable stream processor

• Huge number of ALUs

• Huge memory bandwidth

Programming was painful

• OpenGL-SL – Shader Language

• Requires deep understanding of computers graphics

• Huge applications speedup when done correctly

CUDA/OpenCL

• C-like code

• Massively multi-threaded

• Simple to port existing code (but not to get good performance)

Page 21: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 24

CUDA – Single Instruction Multiple Threads

Page 22: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 25

Achieving Performance in CUDA

Almost all C code will compile to be CUDA code

• But will run slower

• Single threaded operation - ~50x slower than CPU code

Must expose parallelism

Careful with memory accesses

• Thread scheduling helps hide memory access latency

• But even this runs out

Moving target

• Performance optimizations are strongly HW and SW platform dependent

Can make huge difference

• 100x and even more

Page 23: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 26

A GPU-Accelerated Software Router

PacketShader

Page 24: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 27

High Performance Software Router

Work by Sangjin Han, Keon Jang, KyoungSoo Park and Sue Moon

• Advanced Networking Lab, CS, KAIST

• Networked and Distributed Computing Systems Lab, EE, KAIST

40 Gbps packet forwarding in a single box

• IPv4, 64B packets

• Bigger packet sizes – bounded by PCI-e bandwidth

20 Gbps IPsec tunneling

• For 1024B packets

• 10 Gbps for 64B packets

Page 25: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 28

Software Router

Despite its name, not limited to IP routing • You can implement whatever you want on it.

Driven by software • Flexible

• Friendly development environments

Based on commodity hardware • Cheap

• Fast evolution

Page 26: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 29

Now 10 Gigabit NIC is a commodity

From $200 – $300 per port • Great opportunity for software routers

Page 27: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 30

Achilles’ Heel of Software Routers

Low performance • Due to CPU bottleneck

Year Ref. H/W IPv4 Throughput

2008 Egi et al. Two quad-core CPUs 3.5 Gbps

2008 “Enhanced SR”

Bolla et al. Two quad-core CPUs 4.2 Gbps

2009 “RouteBricks”

Dobrescu et al.

Two quad-core CPUs

(2.8 GHz) 8.7 Gbps

Not capable of supporting even a single 10G port

Page 28: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 31

Per-Packet CPU Cycles for 10G

1,200 600

1,200 1,600 Cycles needed

Packet I/O IPv4 lookup

= 1,800 cycles

= 2,800

Your budget

1,400 cycles

10G, min-sized packets, dual quad-core 2.66GHz CPUs

5,400 1,200 … = 6,600

Packet I/O IPv6 lookup

Packet I/O Encryption and hashing

IPv4

IPv6

IPsec

+

+

+

(in x86, cycle numbers are from RouteBricks [Dobrescu09] and PacketShader)

Page 29: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 32

PacketShader Approach 1: I/O Optimization

Packet I/O

Packet I/O

Packet I/O

Packet I/O

1,200 reduced to 200 cycles per packet

Main ideas

• Huge packet buffer

• Batch processing

Allocating SKBs – 50% of CPU time

600

1,600

IPv4 lookup

= 1,800 cycles

= 2,800

5,400 … = 6,600

IPv6 lookup

Encryption and hashing

+

+

+

1,200

1,200

1,200

Page 30: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 33

PacketShader Approach 2: GPU Offloading

Packet I/O

Packet I/O

Packet I/O

GPU Offloading for

• Memory-intensive or

• Compute-intensive operations

Main topic of this talk

600

1,600

IPv4 lookup

5,400 …

IPv6 lookup

Encryption and hashing

+

+

+

Page 31: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 34

GPU FOR PACKET PROCESSING

Page 32: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 35

Advantages of GPU for Packet Processing

1. Raw computation power

2. Memory access latency

3. Memory bandwidth

Comparison between • Intel X5550 CPU

• NVIDIA GTX480 GPU

Page 33: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 36

(1/3) Raw Computation Power

Compute-intensive operations in software routers • Hashing, encryption, pattern matching, network coding, compression, etc.

• GPU can help!

CPU: 43×109 = 2.66 (GHz) ×

4 (# of cores) ×

4 (4-way superscalar)

GPU: 672×109 = 1.4 (GHz) ×

480 (# of cores)

Instructions/sec

<

Page 34: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 37

(2/3) Memory Access Latency

Software router lots of cache misses • GPU can effectively hide memory latency

GPU core

Cache

miss

Cache

miss

Switch to Thread 2

Switch to Thread 3

Page 35: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 38

(3/3) Memory Bandwidth

CPU’s memory bandwidth (theoretical): 32 GB/s

Page 36: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 39

(3/3) Memory Bandwidth

CPU’s memory bandwidth (empirical) < 25 GB/s

4. TX: RAM NIC

3. TX: CPU RAM 2. RX:

RAM CPU

1. RX: NIC RAM

Page 37: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 40

(3/3) Memory Bandwidth

Your budget for packet processing can be less 10 GB/s

Page 38: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 41

(3/3) Memory Bandwidth

Your budget for packet processing can be less 10 GB/s

GPU’s memory bandwidth: 174GB/s

Page 39: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 42

HOW TO USE GPU

Page 40: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 43

Basic Idea

Offload core operations to GPU (e.g., forwarding table lookup)

Page 41: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 44

Recap

GTX480: 480 cores

For GPU, more parallelism, more throughput

Page 42: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 45

Parallelism in Packet Processing

The key insight • Stateless packet processing = parallelizable

RX queue

1. Batching

2. Parallel Processing in GPU

Page 43: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 46

Batching Long Latency?

Fast link = enough # of packets in a small time window

10 GbE link • up to 1,000 packets only in 67μs

Much less time with 40 or 100 GbE

Page 44: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 47

PACKETSHADER DESIGN

Page 45: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 48

Basic Design

Three stages in a streamline

Pre-shader

Shader Post-shader

Page 46: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 49

Packet’s Journey (1/3)

IPv4 forwarding example

Pre-shader

Shader Post-shader

• Checksum, TTL

• Format check

• … Collected dst. IP addrs

Some packets go to slow-path

Page 47: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 50

Packet’s Journey (2/3)

IPv4 forwarding example

Pre-shader

Shader Post-shader

1. IP addresses

2. Forwarding table lookup

3. Next hops

Page 48: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 51

Packet’s Journey (3/3)

IPv4 forwarding example

Pre-shader

Shader Post-shader

Update packets

and transmit

Page 49: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 52

Interfacing with NICs

Pre-shader

Shader Post-shader

Device driver

Packet RX

Device driver

Packet TX

Page 50: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 53

Device driver

Pre-shader

Shader

Post-shader

Device driver

Scaling with a Multi-Core CPU

Master core

Worker cores

Page 51: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 54

Device driver

Pre-shader

Shader

Post-shader

Device driver

Shader

Scaling with Multiple Multi-Core CPUs

Page 52: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 55

EVALUATION

Page 53: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 56

Hardware Setup

CPU:

Quad-core, 2.66 GHz

GPU:

NIC: Total 80 Gbps

Dual-port 10 GbE

Total 8 CPU cores

480 cores, 1.4 GHz

Total 960 cores

Page 54: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 57

Experimental Setup

8 × 10 GbE links Packet generator PacketShader

(Up to 80 Gbps)

Input traffic

Processed packets

Page 55: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 58

Results (w/ 64B packets)

28.2

8

15.6

3

39.2 38.2

32

10.2

0

5

10

15

20

25

30

35

40

IPv4 IPv6 OpenFlow IPsec

Th

rou

gh

pu

t (G

bp

s)

CPU-only CPU+GPU

1.4x 4.8x 2.1x 3.5x GPU speedup

Page 56: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 59

Example 1: IPv6 forwarding

Longest prefix matching on 128-bit IPv6 addresses

Algorithm: binary search on hash tables [Waldvogel97] • 7 hashings + 7 memory accesses

… … … …

Prefix length 1 64 128 96 80

Page 57: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 60

Example 1: IPv6 forwarding

(Routing table was randomly generated with 200K entries)

0

5

10

15

20

25

30

35

40

45

64 128 256 512 1024 1514

Th

rou

gh

pu

t (G

bp

s)

Packet size (bytes)

CPU-only CPU+GPU

Bounded by motherboard IO capacity

Page 58: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 61

Example 2: IPsec tunneling

IP header IP payload

IP header IP payload ESP

trailer

ESP (Encapsulating Security Payload) Tunnel mode • with AES-CTR (encryption) and SHA1 (authentication)

+

Original IP packet

IP header IP payload ESP

trailer ESP

header +

IP header IP payload ESP

trailer ESP

header ESP

Auth. New IP header

+ IPsec Packet

1. AES

2. SHA1

Page 59: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 62

Example 2: IPsec tunneling

3.5x speedup

1

1.5

2

2.5

3

3.5

4

0

4

8

12

16

20

24

64 128 256 512 1024 1514

Sp

eed

up

Th

rou

gh

pu

t (G

bp

s)

Packet size (bytes)

CPU-only CPU+GPU Speedup

Page 60: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 63

Year Ref. H/W IPv4

Throughput

2008 Egi et al. Two quad-core CPUs 3.5 Gbps

2008 “Enhanced SR”

Bolla et al.

Two quad-core CPUs 4.2 Gbps

2009 “RouteBricks”

Dobrescu et al.

Two quad-core CPUs

(2.8 GHz)

8.7 Gbps

2010 PacketShader

(CPU-only)

Two quad-core CPUs

(2.66 GHz)

28.2 Gbps

2010 PacketShader

(CPU+GPU)

Two quad-core CPUs

+ two GPUs

39.2 Gbps

Kernel

User

Page 61: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 64

Conclusions

GPU • a great opportunity for fast packet processing

PacketShader • Optimized packet I/O + GPU acceleration

• scalable with

- # of multi-core CPUs, GPUs, and high-speed NICs

Current Prototype • Supports IPv4, IPv6, OpenFlow, and IPsec

• 40 Gbps performance on a single PC

Page 62: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 65

Future Work

Control plane integration • Dynamic routing protocols with Quagga or Xorp

Multi-functional, modular programming environment • Integration with Click? [Kohler99]

Opportunistic offloading • CPU at low load

• GPU at high load

Stateful packet processing

Page 63: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 66

A GPU-Accelerated Software Router

SSLShader

Page 64: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

© 2014 Mellanox Technologies 67

Page 65: GPGPU introduction and network applications - Haifux · GPGPU Introduction ... Almost all C code will compile to be CUDA code ... Main topic of this talk 600 1,600+ IPv4 lookup

Thank You


Recommended