+ All Categories
Home > Documents > Implementing Ultra Low Latency Data Center Services with ...

Implementing Ultra Low Latency Data Center Services with ...

Date post: 08-Dec-2016
Category:
Upload: dinhmien
View: 219 times
Download: 2 times
Share this document with a friend
23
Implementing Ultra Low Latency Data Center Services with Programmable Logic John W. Lockwood, CEO: Algo-Logic Systems, Inc. http://Algo-Logic.com [email protected] (408) 707-3740 2255-D Martin Ave., Santa Clara, CA 95050
Transcript
Page 1: Implementing Ultra Low Latency Data Center Services with ...

Implementing Ultra Low Latency Data Center

Services with Programmable Logic

John W. Lockwood, CEO: Algo-Logic Systems, Inc.

http://Algo-Logic.com • [email protected] • (408) 707-3740 • 2255-D Martin Ave., Santa Clara, CA 95050

Page 2: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 2

Why the Move to Programmable Logic?

• Driving Metrics in the Data Center

– Latency:

• Reduce delay

• Avoid jitter

– Throughput

• Processing packets at line rate

• Handle 10G, 25G, 40G, and 100G

– Power:

• Driving cost of OpEx

• Field Programmable Gate Array (FPGA) logic moves into the CPU

• Microsoft accelerates BING search with FPGA

• Intel acquires Altera

INC

RE

AS

ING

SIL

ICO

N D

IE A

RE

A

CPU

CPUFPGA

GPU

Add Logic

Processing

TIME

2016-2019

Sequential

Processing

1971

Micro

Optimized

Sequential

1985

CPU CPU

CPU

Multi-

core

2006

CPU

CPU

GPU

Add VectorProcessing

2013

= SILICONE DIE AREA“There are large challenges in scaling the

performance of software now. The question

is: ‘What’s next?’ We took a bet on

programmable hardware.”

- Doug Burger, Microsoft

Page 3: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 3

Servers Accelerated with FPGA Gateware

• FPGA Augments Existing Servers

– Can run on an expansion card (same size as a GPU)

– Or may be integrated into the CPU socket

• GDN Applications run on FPGA

– Implements low-latency, low-power, high-throughput data processing

Accelerated Server

CPU

FPGA

RAM

40G

100

GE

10G

CPU

Cores

RAM

RAM

RAM

RAM

RAM

HDD

SSD

Rack Server

Page 4: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 4

Example of Low Latency Service: Key/Value Store

• Key/Value Store (KVS)

– Simplifies implementation of large-scale

distributed computation algorithms

– Data Center Servers exchanges data

over standard Ethernet

• Challenges

– Operating System

delays packets and

limits throughput

– Per-core processing

inefficient at high-speed

packet processing

• Solutions

– Bypass kernel bypass with DPDK

– Offload of packet processing with FPGA

Company Phone #

Interface : MAC AddressIP Address

Examples:

Directory

Forwarding

Tables

Storage Block IDContent Hash

Data De-

duplication

Key Value

Algo-Logic (408) 707-3740

204.2.34.5 Eth6 : 02:33:29:F2:AB:CC

XYZ 948830038411

Symbol, Side, PriceOrder ID

Stock Trading ATY11217911101 AAPL, B, 126.75

Edge ListVirtex

Graph Search v140 v201, v206, v225

Page 5: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 5

Mobile Application Servers Need Fast Key/Value Stores

• Scalable backend services to share data

– Sensors { location, bio, movement, .. }

– Social { status, dating, updates, multi-player games .. }

– Media { video/security, audio/music, .. }

– Communication { network status, handoff, short messages .. }

– Database { users, providers, payments, travel, authentication, .. }

• Must be able to scale

– as the number of users grows

– Scale up to provide the best latency, throughput, and power

– Scale out to increase storage capacity, throughput, and redundancy

• Example:

– Mobile location sharing

。。。

Page 6: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 6

Case Study: Implementing Uber with KVS

• Uber in 2014

– 162,037 drivers in the US completed 4 or more trips

– New drivers doubled every 6 months for past 2 years

– Number of Uber Users = 8M

– Number of Cities = 290

– Total trips = 140M

– Daily Trips = 1M

• Analysis and Assumptions

– Assuming 25% of the drivers are active

– 25% of 160k drivers = 40k active cars (<48k)

– Drivers update position once per second = 40k IOW

• Implementation

– Uber with on an Algo-Logic KVS card

Washington Post, Dec. 2014

Page 7: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 7

Algo-Logic’s KVS solution for Mobile Applications

Scale UP for FASTER access to shared data

Application Server

KVS on FPGA

。。。

Page 8: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 8

Algo-Logic’s KVS solution for Mobile Applications

。。。

。。。

2U System

KVS on FPGA

2U System

KVS on FPGA

2U System

KVS on FPGA

2U System

KVS on FPGA

2U System

KVS on FPGA

App Servers

KVS on FPGA

And SCALE-OUT quickly to increase storage capacity and throughput

Page 9: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 9

Provisioning and Measurements with GDN-Switch

Page 10: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 10

Linux Software Socket KVS

OCSM

Packet

Intel

10G NIC

Kernel

Driver

Message

Process

10g Ethernet

Data Transfer =

LEGEND

Algo-Logic software

on Intel 82598 10GE NIC

and Core i7-4770k CPU

Page 11: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 11

Algo-Logic KVS with DPDK: Bypass the Kernel

OCSM

Packet

Intel

82598

DPDK

Supported

NIC

Receive

Queue

Message

Buffer

Transmit

Queue

Message

Process

Response

Generation

Note: Message read once into CPU Cache

Store

Enqueue

Dequeue

Enqueue

Dequeue

10g Ethernet

Data Transfer =

Control Handoff =

LEGEND

Algo-Logic software

on Intel 82598 10GE NIC

and Core i7-4770k CPU

Page 12: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 12

ALGO-LOGIC’S FPGA SOLUTION FOR KEY/VALUE SEARCH

OPTIMISED FOR LARGE NUMBER OF KEY/VALUE SEARCH ENTRIESAlgo-Logic GDN-Search: KVS in FPGA

OCSM

Packet

10g Ethernet

Algo-Logic gateware

on Nallatech P385 with

Altera Stratix V A7 FPGA

EMSE2

MAC0

RX

REQUEST GENERATOR

Packet

Parser

32B OCSM

Header

Identifier

Key-Value

Extractor

MAC0

TX

RESPONSE GENERATOR

Key-Value

Response

Decoder

32B OCSM

Header

Reconstruct

Packet

Reconstruct

Packet Handler

EMSE2

MAC1

RX

REQUEST GENERATOR

Packet

Parser

64B OCSM

Header

Identifier

Key-Value

Extractor

MAC1

TX

RESPONSE GENERATOR

Key-Value

Response

Decoder

64B OCSM

Header

Reconstruct

Packet

Reconstruct

On-C

hip

Mem

ory

Packet Handler

Key-Value Search

Key-Value Search

Key: 96b, Value: 96b

Key: 96b, Value: 352b

Off-C

hip

Mem

ory

Off-C

hip

Mem

ory

Off-C

hip

Mem

ory

Off-C

hip

Mem

ory

FPGA

OCSM

Packet

10g Ethernet

Page 13: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 13

Trends for Adding Storage around FPGAs

• Six banks of memory controllers

– QDR SRAM

– RLDRAM

– DDR3, DDR4

• 64 lanes of SERDES

– SATA disk and Flash

– Serial memories

• Hybrid Memory Cube (HMC)

• Mosys Bandwidth Engine (BE2, BE3)

• New Memories

– 3D Xpoint with DDR4 Interface

– Potential for Terabytes of Memory on each card

Page 14: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 14

Implementation of KVS with Socket I/O, DPDK, and FPGA

• Benchmark same application

– Key/Value Store (KVS)

• Running on the same PC

– Intel i7-4770k CPU, 82598 NIC, and Altera Stratix V A7 FPGA

• With three different implementations

– Socket I/O, DPDK, FPGA

OCSM

Packet

Intel

82598

DPDK

Supported

NIC

Receive

Queue

Message

Buffer

Transmit

Queue

Message

Process

Response

Generation

Note: Message read once into CPU Cache

Store

Enqueue

Dequeue

Enqueue

Dequeue

10g Ethernet

Data Transfer =

Control Handoff =

LEGEND

Algo-Logic software

on Intel 82598 10GE NIC

and Core i7-4770k CPU

Exact

Match

Search

Engine

(EMSE)

REQUEST GENERATOR

Packet

Parser

OCSM

Header

Identifier

Key/Value

Extractor

RESPONSE GENERATOR

Key/Value

Search

Response

Decoder

OCSM

Header

Reconstruct

Packet

Modifier

OCSM

Packet

10g Ethernet

Algo-Logic gateware

on Nallatech P385 with

Altera Stratix V A7 FPGA

OCSM

Packet

Intel

10G NIC

Kernel

Driver

Message

Process

10g Ethernet

Data Transfer =

LEGEND

Algo-Logic software

on Intel 82598 10GE NIC

and Core i7-4770k CPU

DPDK

Socket I/O

FPGA

Page 15: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 15

KVS Hardware in Data Center Rack

© 2015 Algo-Logic Systems—Public

Provision Controller

FPGA GDN-Classify

PHY MACPacket Parser

Key Extractor

Associative Rule-Match CAM

Flow or ACL

TargetQueues

PHYsMAC

s

KVS in Software

KVS in DPDK

KVS in FPGA

UPS Power

Rack of Search

Servers

Additional KVS Servers

40G

10G

Page 16: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 16

Load Testing off the KVS Implementations

Traffic

Generator

Intel

82599

10G NIC

Mellanox

10G NIC

Mellanox

10G NIC

Intel

82598

10G NIC

Intel

82598

10G NIC

Nallatech

P385 10G

Kernel

Driver

Process

Message

Intel

DPDK

EAL

Process

Message

Process

Message

GEN: Traffic Generator KVS Implementations

Page 17: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 17

Full-Rate

Packet Processing

(100% of TX Load)

Software

Drops

Packets

Stratix V FPGA with

EMSE-2 Drops no

Packets

DPDK

Drops

Packets

Page 18: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 18

Sockets Power Consumption Profile(10M Packets with 40 CSM Messages/Packet)

Page 19: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 19

DPDK Power Consumption Profile (10M Packets with 40 CSM Messages/Packet)

Page 20: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 20

Latency Measurement with GDN-Classify

• Round trip latency of GDN Switch is deterministic (constant)

• Round trip latency of KVS Sever = Total Round trip Time – Round

trip time with 10G loopback on GDN Switch

GDN SwitchKVS Server

40G Rx

40G Tx

10G Tx

10G Rx

Loopback

Time

Stamp

T_out-

T_in

Switch

Latency(constant for

GDN-Switch)

Search

Latency(Different for Software,

DPDK, and GDN-Search)

+

Page 21: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 21

KVS Latency in FPGA, DPDK, and Sockets

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

50.00%

0 5.5 11 16.5 22 27.5 33 38.5 44

Peerc

en

tag

e o

f O

bserv

ed

Packets

Latency Distribution [µs]

Latency Comparison 100k packets, 1 OCSM per packet, 1k pps

RTL

Sockets

DPDK

Altera Stratix V RTL Average: 0.467µs

Sockets Average: 41.40µs

DPDK Average: 6.29µs

KVS in Software

Worst Latency

Worst Jitter

0.00%

0.10%

0.20%

0.30%

0.40%

0.50%

0.60%

0.70%

38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0

Perc

en

tag

e o

f P

ack

ets

Ob

serv

ed

[%

]

Latency Distribution [µs]

Socket Implementation Latency Distribution with One OCSM/Packet

Sockets

Intel i7 Average: 41.54µs

KVS in FPGA:

Best Latency,

No Jitter

KVS in DPDK:

Lowers Latency,

Some Jitter

Lower Latency = Faster ResponseLowest

Tig

hte

r S

pre

ad

= L

ess J

itte

r

Page 22: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 22

Measured Latency, Throughput, and Power Results

All Datapaths Summary

Latency (µseconds)

Tested Throughput (CSMs/sec)

Power (µJoules/CSM)

Sockets 41.54 4.0 11

DPDK 6.434 16 6.6

RTL 0.467 15 0.52

All Datapaths Summary

Latency (µseconds)

Maximum Throughput (CSMs/sec)

Power (µJoules/CSM)

GDN vs. Sockets 88x less 13x 21x less

GDN vs. DPDK 14x less 3.2x 13x less Provision Controller

FPGA GDN-Classify

PHY MACPacket Parser

Key Extractor

Associative Rule-Match CAM

Flow or ACL

TargetQueues

PHYsMACs

KVS in Software

KVS in DPDK

KVS in FPGA

UPS Power

Rack of Search

Servers

Additional KVS Servers

40G

10G

Page 23: Implementing Ultra Low Latency Data Center Services with ...

© 2015 Algo-Logic Systems Inc., All rights reserved. 23

Conclusions: Programmable Hardware in the Data Center

• Lowers Latency

―88x faster than Linux networking sockets

―14x faster than optimized DPDK (kernel bypass)

• Increases Throughput (IOPs)

―3x to 13x improvement in throughput

―Lowers Capital Expenditures (CapEx)

• Reduces Power

―13x to 21x reduction in power

―Reduces Operating Expenditures (OpEx)


Recommended