+ All Categories
Home > Documents > Hardware, Software, and Network Approaches towards...

Hardware, Software, and Network Approaches towards...

Date post: 25-Mar-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
72
Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang
Transcript
Page 1: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Yiying Zhang

Page 2: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

2

Page 3: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Monolithic Computer

3

OS / Hypervisor

Page 4: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Can monolithic servers continue to meet datacenter needs?

Hardware

Application

Flexibility

Perf / $

Heterogeneity

Page 5: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

5

FPGAGPU

TPU

ASICHBM

NVM

NVMe DNA Storage

Page 6: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Making new hardware work with existing servers is like fitting puzzles

6

Page 7: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Can monolithic servers continue to meet datacenter needs?

Hardware

Application

Flexibility

Perf / $

Heterogeneity

Page 8: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Poor Hardware Elasticity

• Hard to change hardware components

- Add (hotplug), remove, reconfigure, restart

• No fine-grained failure handling

- The failure of one device can crash a whole machine

8

Page 9: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Can monolithic servers continue to meet datacenter needs?

Hardware

Application

Flexibility

Perf / $

Heterogeneity

Page 10: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Poor Resource Utilization• Whole VM/container has to run on one physical machine

- Move current applications to make room for new ones

10

Server 1 Server 2 Job 1Job 2

wasted!cpu

mem

Available Space Required Space

Page 11: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

11

Resource Utilization in Production Clusters

Unused Resource + Waiting/Killed Jobs Because of Physical-Node Constraints

* Google Production Cluster Trace Data. “https://github.com/google/cluster-data”

* Alibaba Production Cluster Trace Data. “https://github.com/alibaba/clusterdata."

Page 12: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Can monolithic servers continue to meet datacenter needs?

Hardware

Application

Flexibility

Perf / $

Heterogeneity

Page 13: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

13

How to achieve better heterogeneity, flexibility,

and perf/$?

Go beyond physical node boundary

Page 14: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Resource Disaggregation:

Breaking monolithic servers into network-attached, independent hardware components

14

Page 15: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

15

Page 16: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

16

Network

Hardware

Application Flexibility

Perf / $Heterogeneity

Page 17: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

17

Why Possible Now?• Network is faster

• InfiniBand (200Gbps, 600ns)

• Optical Fabric (400Gbps, 100ns)

• More processing power at device

• SmartNIC, SmartSSD, PIM

• Network interface closer to device

• Omni-Path, Innova-2

Intel Rack-Scale System

Berkeley Firebox

IBM Composable System

HP The Machine

Page 18: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Disaggregated Datacenter

Flexibility

$ Cost

Performance

Reliability

Heterogeneity

Hardware

Unmodified Application

Network

OS

Dist Sys

End-to-End Solution

Page 19: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Disaggregated Datacenter

Physically Disaggregated Resources

Networking for Disaggregated Resources

RDMA Network

Kernel-Level RDMA Virtualization (SOSP’17)

New Processor and Memory Architecture

End-to-End Solution

Disaggregated Operating System (OSDI’18)

Page 20: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

20

Page 21: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Can Existing Kernels Fit?

21

Core

Kern

GPU

Kern

P-NIC

Kern

Shared Main MemoryMonolithic Server

Monolithic/Micro-kernel (e.g., Linux, L4)

Multikernel (e.g., Barrelfish, Helios, fos)

mem

Disk

NIC

CPU

monolithic kernel

network across servers

Server

mem

Disk

NIC

CPU

microkernel

Server

Disk NIC

Page 22: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Access remote resources

Distributed resource mgmt

Fine-grained failure handling

Existing Kernels Don’t Fit

22

Network

Page 23: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

23

The OS should be also

When hardware isdisaggregated

Page 24: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

24

OSProcess Mgmt

Virtual Memory System

File & Storage System Network

Page 25: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

25

Process Mgmt

Virtual Memory System

File & Storage System

Network

File & Storage System

Network

Network

Network

Network

Page 26: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Processor (CPU)

Memory

The Splitkernel Architecture

26

• Split OS functions into monitors

• Run each monitor at h/w device

• Network messaging across non-coherent components

• Distributed resource mgmt and failure handling

MemoryMonitor

ProcessMonitor

network messaging across non-coherent components

GPUMinitor

Processor(GPU)

Hard Disk

NVMMonitor

NVM

SSDMonitor

SSD

HDDMonitor

XPUManagerNew h/w

(XPU)

Page 27: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

LegoOS The First Disaggregated OS

27

Processor

Storage

Memory

NVM

Page 28: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

How Should LegoOS Appear to Users?

• Our answer: as a set of virtual Nodes (vNodes)

- Similar semantics to virtual machines

- Unique vID, vIP, storage mount point

- Can run on multiple processor, memory, and storage components

28

As a giant machine?As a set of hardware devices?

Page 29: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Abstraction - vNode

29

One vNode can run multiple hardware components

One hardware component can run multiple vNodes

Processor (CPU)

GPUMinitor

Processor(GPU)

Memory Hard Disk

network messaging across non-coherent components

NVMMonitor

NVM

SSDMonitor

SSD

HDDMonitor

MemoryMonitor

ProcessMonitor

XPUManager

New h/w(XPU)

vNode2

vNode1

Page 30: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Abstraction• Appear as vNodes to users

• Linux ABI compatible

• Support unmodified Linux system call interface (common ones)

• A level of indirection to translate Linux interface to LegoOS interface

30

Page 31: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

LegoOS Design1. Clean separation of OS and hardware functionalities

2. Build monitor with hardware constraints

3. RDMA-based message passing for both kernel and applications

4. Two-level distributed resource management

5. Memory failure tolerance through replication

31

Page 32: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Separate Processor and Memory

32

ProcessorCPU CPU$ $

Last-Level

DRAM

TLB

MMU

PT

Page 33: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Separate Processor and Memory

33

ProcessorCPU CPU$ $

Last-Level

Net

wor

k

DRAM

TLB MMU

Memory

Separate and move hardware units

to memory component

MemoryPT

Page 34: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Separate Processor and Memory

34

ProcessorCPU CPU$ $

Last-Level

Net

wor

k

DRAM

TLB MMU

Memory

Separate and move hardware units

to memory component

MemoryPT

Virtual Memory

Page 35: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Separate Processor and Memory

35

ProcessorCPU CPU$ $

Last-Level

Net

wor

k

DRAM

TLB MMU

Memory

Separate and move virtual memory

system to memory component

MemoryPT

Virtual Memory

Page 36: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Separate Processor and Memory

36

ProcessorCPU CPU$ $

Last-Level

Net

wor

k

DRAM

TLB MMU

Memory

MemoryPT

Virtual Memory

Processor components only see virtual memory addresses

Memory components manage virtual and physical memory

Virtual Address

Virtual Address

Virtual Address

Virtual Address

All levels of cache are virtual cache

Page 37: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Challenge: network is 2x-4x slower

than memory bus

37

Page 38: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Add Extended Cache at Processor

38

ProcessorCPU CPU$ $

Last-Level

Net

wor

k

DRAM

TLB MMU

Memory

MemoryPT

Virtual Memory

Page 39: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Add Extended Cache at Processor

39

ProcessorCPU CPU$ $

Last-Level N

etw

ork

DRAM

TLB MMU

Memory

MemoryPT

Virtual Memory

DRAM

• Add small DRAM/HBM at processor

• Use it as Extended Cache, or ExCache

• Software and hardware co-managed

• Inclusive

• Virtual cache

Page 40: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

LegoOS Design1. Clean separation of OS and hardware functionalities

2. Build monitor with hardware constraints

3. RDMA-based message passing for both kernel and applications

4. Two-level distributed resource management

5. Memory failure tolerance through replication

40

Page 41: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Distributed Resource Management

1. Coarse-grain allocation

2. Load-balancing

3. Failure handling

41

Global Process Manager (GPM)

Global Memory Manager (GMM)

Global Storage Manager (GSM)

Processor (CPU)

GPUMinitor

Processor(GPU)

Memory Hard Disk

network messaging across non-coherent components

NVMMonitor

NVM

SSDMonitor

SSD

HDDMonitor

MemoryMonitor

ProcessMonitor Global

Resource Mgmt

Memory

MemoryMonitor

Page 42: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Implementation and Emulation

• Processor

• Reserve DRAM as ExCache (4KB page as cache line)

• h/w only on hit path, s/w managed miss path

• Indirection layer to store states for 113 Linux syscalls

• Memory

• Limit number of cores, kernel-space only

• Storage/Global Resource Monitors

• Implemented as kernel module on Linux

• Network

• RDMA RPC stack based on LITE [SOSP’17]42

CPU

LLC ExCache

CPUProcessor

Disk

Memory Storage DRAM

LLC DiskDRAM

CPU

LLC Disk

Process Monitor

Memory Monitor Linux Kernel Module

CPU CPU

CPU CPU CPU CPU

RDMA Network

Page 43: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Performance Evaluation• Unmodified TensorFlow, running CIFAR-10

• Working set: 0.9G

• 4 threads

• Systems in comparison

• Baseline: Linux with unlimited memory

• Swap to SSD, and ramdisk

• InfiniSwap [NSDI’17]

43

ExCache/Memory Size (MB)128 256 512

Slowdown

1

3

5

7Linux−swap−SSDLinux−swap−ramdiskInfiniSwapLegoOS

LegoOS Config: 1P, 1M, 1S

Only 1.3x to 1.7x slowdown when disaggregating devices with LegoOS

To gain better resource packing, elasticity, and fault tolerance!

Page 44: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

LegoOS Summary• Resource disaggregation calls for new system

• LegoOS: a new OS designed and built from scratch for datacenter resource disaggregation

• Split OS into distributed micro-OS services, running at device

• Many challenges and many potentials

44

Page 45: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Disaggregated Datacenter

Physically Disaggregated Resources

Networking for Disaggregated Resources

RDMA Network

Kernel-Level RDMA Virtualization (SOSP’17)

New Processor and Memory Architecture

flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use

Disaggregated Operating System (OSDI’18)

Networking for Disaggregated Resources

RDMA Network

Kernel-Level RDMA Virtualization (SOSP’17)

Page 46: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Network Requirements for Resource Disaggregation

• Low latency

• High bandwidth

• Scale

• Reliable

46

RDMA

Page 47: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

RDMA (Remote Direct Memory Access)

47

• Directly read/write remote memory

• Bypass kernel• Memory zero copy

Benefits:– Low latency– High throughput– Low CPU utilization

NICNIC

Memory

CPU User

Kernel

Memory

CPU User

Kernel

Socket over Ethernet

NICNIC

Memory

CPU User

Kernel

Memory

CPU User

Kernel

RDMA

Page 48: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Things have worked well in HPC •Special hardware

•Few applications

•Cheaper developer

48

Page 49: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

49

[VLDB ’16]RSI

[EuroSys ’16]DrTM+R

[NSDI ’14]FaRM

[SOSP ’15]FaRM+Xact

[SIGCOMM ’14]HERD

[ATC ’16]HERD-RPC

[OSDI ’16]FaSST

[ATC ’17]Octopus

[ATC ’13]Pilaf

[SoCC ’17]Hotpot

[OSDI ’16]Wukong

[SoCC ’17]APUS

[SOSP ’15]DrTM

[VLDB ’17]NAM-DB

[ASPLOS ’15]Mojim

[ATC ’16]Cell

RDMA-Based Datacenter Applications

Page 50: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Things have worked well in HPC• Special hardware

• Few applications

• Cheaper developer

50

• Commodity, cheaper hardware

• Many (changing) applications

• Resource sharing and isolation

What about datacenters?

Page 51: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Native RDMA

51

OS

User-Level RDMA App

RNIC

node,lkey,rkeyaddr

Permission checkAddress mapping

lkey 1

lkey n

rkey 1

rkey n… …

send recv

Library

ConnMgmt

MemMgmt

Cached PTEs

Connections Queues Keys Memory space

UserSpace

KernelSpace

Hardware

Kernel Bypassing

Page 52: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

52

Userspace

Hardware

RDMA

Socket

Developers want

Fat applications No resource sharing

Abstraction Mismatch

High-levelEasy to useResource

shareIsolation

Low-levelDifficult to use

Difficult to share

Page 53: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Things have worked well in HPC• Special hardware

• Few applications

• Cheaper developer

53

What about datacenters?• Commodity, cheaper hardware

• Many (changing) applications

• Resource sharing and isolation

Page 54: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Native RDMA

54

OS

User-Level RDMA App

RNIC

node,lkey,rkeyaddr

Permission checkAddress mapping

lkey 1

lkey n

rkey 1

rkey n… …

send recv

Library

ConnMgmt

MemMgmt

Cached PTEs

Connections Queues Keys Memory space

UserSpace

KernelSpace

Hardware

Kernel Bypassing

Userspace

Hardware

Page 55: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Requ

ests

/us

0

1.5

3

4.5

6

Total Size (MB)

1 4 16 64 256 1024

Write-64BWrite-1K

Userspace

Hardware

55

Expensive, unscalable hardware

On-NIC SRAM stores and caches metadata

Page 56: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Things have worked well in HPC• Special hardware

• Few applications

• Cheaper developer

56

What about datacenters?• Commodity, cheaper hardware

• Many (changing) applications

• Resource sharing and isolation

Page 57: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Are we removing too much from kernel?

Fat applications No resource

sharing

Expensive, unscalable hardware

57

Page 58: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

High-level abstraction

Protection

Resource sharing

Performance isolation

Without KernelLITE - Local Indirection TiEr

ProtectionPerformance

isolation

Resource sharing

High-level abstraction

58

Page 59: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

RNIC

59

Permission checkAddress mapping

Cached PTEs

lkey 1

lkey n

rkey 1

rkey n… …

Library

Connections Queues Keys Memory space

User-Level RDMA App

node,lkey,rkeyaddr

send recvConnMgmt

MemMgmt

User Space

Hardware

Page 60: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

LITE

60

Connections Queues Keys Memory space

User-Level RDMA App

node,lkey,rkeyaddr

send recvConnMgmt

MemMgmt

LITE APIs Memory RPC/Msg APIs Sync APIs

Simpler applications

User Space

Kernel Space

RNIC Permission checkAddress mapping

Cached PTEs

lkey 1

lkey n

rkey 1

rkey n… …Hardware

Page 61: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

LITE

RNIC

61

Connections Queues Keys Memory space

User-Level RDMA App

node,lkey,rkeyaddr

send recvConnMgmt

MemMgmt

LITE APIs Memory RPC/Msg APIs Sync APIs

Permission checkAddress mapping

Global rkeyGlobal lkey

Global lkey Global rkey

Simpler applications

User Space

Kernel Space

Hardware

Cheaper hardwareScalable performance

Page 62: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

62

Implementing Remote memset

Native RDMALITE

Page 63: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

63

Main Challenge: How to preserve the performance benefit

of RDMA?

Page 64: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

LITE Design Principles

2.Avoid hardware-level indirection

3.Hide kernel-space crossing cost

64

Great Performance and Scalability

1.Indirection only at local node

except for the problem of too many layers of indirection – David Wheeler

Page 65: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Requ

ests

/us

0

1.5

3

4.5

6

Total Size (MB)

1 4 16 64 256 1024

Write-64BLITE_write-64BWrite-1KLITE_write-1K

LITE RDMA:Size of MR Scalability

65

LITE scales much better than native RDMA wrt MR size and numbers

Page 66: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

LITE Application Effort

• Simple to use

• Needs no expert knowledge

• Flexible, powerful abstraction

• Easy to achieve optimized performance

Application LOC LOC using LITE Student DaysLITE-Log 330 36 1LITE-MapReduce 600* 49 4LITE-Graph 1400 20 7LITE-Kernel-DSM 3000 45 26

66* LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition

Page 67: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

MapReduce Results• LITE-MapReduce adapted from Phoenix [1]

[1]: “Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07)” 67

0

2

4

6

821

23

25HadoopPhoenixLITE

Runt

ime

(sec

)

Phoenix 2-node 4-node 8-node

LITE-MapReduce outperforms Hadoop by 4.3x to 5.3x

Page 68: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

LITE Summary

• Virtualizes RDMA into flexible, easy-to-use abstraction

• Preserves RDMA’s performance benefits

• Indirection not always degrade performance!

68

• Division across user space, kernel, and hardware

Page 69: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Disaggregated Datacenter

Physically Disaggregated Resources

Networking for Disaggregated Resources

RDMA Network

Kernel-Level RDMA Virtualization (SOSP’17)

New Processor and Memory Architecture

flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use

Disaggregated Operating System (OSDI’18)

Page 70: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Disaggregated Datacenter

Physically Disaggregated Resources

Networking for Disaggregated Resources

RDMA Network

Kernel-Level RDMA Virtualization (SOSP’17)

New Processor and Memory Architecture

flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use

Disaggregated OS (OSDI’18)

Virtually Disaggregated Resources

Network-Attached NVM

Disaggregated Persistent Memory

Distributed Non-Volatile Memory

Distributed Shared Persistent Memory (SoCC ’17)

InfiniBand

New Network Topology, Routing, Congestion-Ctrl

Page 71: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Conclusion

• New hardware and software trends point to resource disaggregation

• My research pioneers in building an end-to-end solution for disaggregated datacenter

• Opens up new research opportunities in hardware, software, networking, security, and programming language

71

Page 72: Hardware, Software, and Network Approaches towards ...alchem.usc.edu/ceng-seminar/slides/2018/USC-Yiying-10-05-18.pdf• Swap to SSD, and ramdisk • InfiniSwap [NSDI’17] 43 ExCache/Memory

Thank you Questions?

wuklab.io


Recommended