Network Requirements for Resource Disaggregation - … · Network Requirements for Resource...

Network Requirements for Resource Disaggregation

Peter Gao (Berkeley), Akshay Narayan (MIT), Sagar Karandikar (Berkeley), Joao Carreira (Berkeley), Sangjin Han (Berkeley),

Rachit Agarwal (Cornell), Sylvia Ratnasamy (Berkeley), Scott Shenker (Berkeley/ICSI)

Current Datacenter: Server-Centric Future datacenter: Disaggregated?

Disaggregated Datacenters

GPU GPUGPUGPU

Datacenter Network

Datacenter Network

HP – The Machine Intel - RSD Facebook

Huawei - NUWA SeaMicro Berkeley - FireBox2

Disaggregation Benefits (Architecture Community)

3

Overcome memory capacity wallHigher resource density

SimplifyHardware

Design

RelaxPower & Capacity

Scaling

Network is the key

Network

GPU

Network

GPUGPU

4

QPI, SMI, PCI-e

Existing prototypes use specialized hardware, such as Silicon Photonics, PCI-e

Server-Centric Disaggregated

GPU

Do we need specialized hardware?e.g.: Silicon photonics, PCI-e

• What end-to-end latency and bandwidth must the network provide for legacy apps?

• Do existing transport protocols meet these requirements?

• Do existing OS network stacks meet these requirements?

• Can commodity network hardware meet these requirements?

Application

OS

Transport

NIC

Switch

Remote Resource

OS

Transport

NIC

5

OS

Transport

NIC

Switch

OS

Transport

NICCommodity hardware solutions may be sufficient ✔

Current OS and network stack are not(Solutions are feasible) ✘

Worst case performance degradation

Assumptions

CPU

Memory

Storage

Scale

Datacenter Network

• Limited cache coherence domain• Small amount of local cache (how much?)

• Page-level remote memory access

• Block-level distributed data placement

• Rack-scale?• Datacenter-scale?

6

Cache Coherence

Methodology: Workload Driven

• 10 workloads on 8 applications

• ~ 125 GB input data

• 5 m3.2xlarge EC2 nodes

• Virtual Private Cloud enabled

Latency and Bandwidth Requirements

7

Key-value Store

SQL

Streaming

Wordcount

Sort

Pagerank

Collaborative Filter.

Spark Hadoop

Timely Dataflow

Graphlab

Memcached HERD

Spark SQL

Spark Streaming

Batch Processing

Interactive

Workloads Applications

Disaggregated Datacenter EmulatorOS

Memory

Special Swap Device (Handles Page Fault)

Local RAM Emulated Remote RAM

8

• Backed by the machine’s own memory• Partition the memory into local and remote

Free to access Via swap device

Inject latency and bandwidth constraints• Using special swap device• Delay = latency + request size / bandwidth• Akin to a dedicated link between CPU and memory

*Note: Delay = latency + request size / bandwidth

Latency and Bandwidth Requirement

5% degradation

9

5% degradation

1us 5us 10us

100 G

bps

40 G

bps

10 G

bps

100 G

bps

40 G

bps

10 G

bps

100 G

bps

40 G

bps

10 G

bps

~3us latency / 40Gbps bandwidth is enough, ignoring queueing delay

Understanding Performance Degradation

10

Spark StreamingWordcount

MemcachedYCSB

GraphlabCF

HadoopSort

HadoopWordcount

TimelyPagerank

HERDYCSB

SparkSQLBDB

SparkSort

SparkWordcount

Performance degradation is correlated with application memory bandwidth

Application

OS

Transport

NIC

Switch

Remote Resource

OS

Transport

NIC

Application Remote Resource

• 3us end-to-end latency• 40Gbps dedicated link (no queueing delay)

11

Transport Simulation Setting

Special Swap Instrumentation Network Simulator

Flow Trace

Flow completion time distribution

12

Need new transport protocols

Application Performance Degradation

~5% degradation

40Gbpsnetwork

100Gbpsnetwork

~5% degradation

13

40Gbps (no queueing delay) DC Scale (with queueing delay) Rack Scale (with queueing delay)

100Gbps (no queueing delay) DC Scale (with queueing delay) Rack Scale (with queueing delay)

100Gbps networkDC scale for some apps, rack scale for others

Application

OS

Transport

NIC

Switch

Remote Resource

OS

Transport

NIC

• 3us end-to-end latency• 40Gbps dedicated link

Transport Transport

• Efficient Transport• 100Gbps network

14

Is 100Gbps/3us achievable?

15

Feasibility of end-to-end latency within a rack


0.32us 0.8us 2us 1.9us

Propagation Transmission Switching Data Copying OS

3us Target

16*Numbers estimated optimistically based on existing hardware



0.32us 0.8us 2us 1.9us


3us Target


0.32us 0.8us

Propagation Transmission Switching

2us 1.9us

Data Copying OSCut-through Switch

0.48us




0.32us 0.8us 2us 1.9us


3us Target


0.32us 2us

Data Copying

0.48us

Cut-through Switch NIC Integration

1us 1.9us

OS




0.32us 0.8us 2us 1.9us


3us Target


0.32us 0.48us

Cut-through Switch NIC Integration

1us 1.9us

OSUse RDMA

3us Target


Feasible to meet target across the datacenter?

Application

OS

Transport

NIC

Switch

Remote Resource

OS

Transport

NIC

• 3us end-to-end latency• 40Gbps dedicated link

• Efficient Transport (pFabric,SIGCOMM’13, pHost,CoNEXT’15)

• 100Gbps network (Available)

• Kernel bypassing (RDMA common)

• CPU-NIC Integration (Coming soon)• Cut-through switch (Common?)• 100Gbps links (Available)

Application

OS

Transport

NIC

Switch

Remote Resource

OS

Transport

NIC

16

What’s next?

17

Please refer our paper for evaluations on improving application performance in disaggregated datacenters

Application Design

Rethinking OS Stack

StorageNetwork

StackFailure Models

Network Fabric Design

Thank You!

Peter X. Gao Akshay Narayan Sagar Karandikar Joao Carreira Sangjin Han

Sylvia Ratnasamy Scott ShenkerRachit Agarwal

Date post:	27-Apr-2018
Category:	Documents
Upload:	lyanh
View:	222 times
Download:	1 times

Network Requirements for Resource Disaggregation - … · Network Requirements for Resource...

Documents