Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation
Yiying Zhang
2
Monolithic Computer
3
OS / Hypervisor
Can monolithic servers continue to meet datacenter needs?
Hardware
Application
Flexibility
Perf / $
Heterogeneity
5
FPGAGPU
TPU
ASICHBM
NVM
NVMe DNA Storage
Making new hardware work with existing servers is like fitting puzzles
6
Can monolithic servers continue to meet datacenter needs?
Hardware
Application
Flexibility
Perf / $
Heterogeneity
Poor Hardware Elasticity
• Hard to change hardware components
- Add (hotplug), remove, reconfigure, restart
• No fine-grained failure handling
- The failure of one device can crash a whole machine
8
Can monolithic servers continue to meet datacenter needs?
Hardware
Application
Flexibility
Perf / $
Heterogeneity
Poor Resource Utilization• Whole VM/container has to run on one physical machine
- Move current applications to make room for new ones
10
Server 1 Server 2 Job 1Job 2
wasted!cpu
mem
Available Space Required Space
11
Resource Utilization in Production Clusters
Unused Resource + Waiting/Killed Jobs Because of Physical-Node Constraints
* Google Production Cluster Trace Data. “https://github.com/google/cluster-data”
* Alibaba Production Cluster Trace Data. “https://github.com/alibaba/clusterdata."
Can monolithic servers continue to meet datacenter needs?
Hardware
Application
Flexibility
Perf / $
Heterogeneity
13
How to achieve better heterogeneity, flexibility,
and perf/$?
Go beyond physical node boundary
Resource Disaggregation:
Breaking monolithic servers into network-attached, independent hardware components
14
15
16
Network
Hardware
Application Flexibility
Perf / $Heterogeneity
17
Why Possible Now?• Network is faster
• InfiniBand (200Gbps, 600ns)
• Optical Fabric (400Gbps, 100ns)
• More processing power at device
• SmartNIC, SmartSSD, PIM
• Network interface closer to device
• Omni-Path, Innova-2
Intel Rack-Scale System
Berkeley Firebox
IBM Composable System
HP The Machine
Disaggregated Datacenter
Flexibility
$ Cost
Performance
Reliability
Heterogeneity
Hardware
Unmodified Application
Network
OS
Dist Sys
End-to-End Solution
Disaggregated Datacenter
Physically Disaggregated Resources
Networking for Disaggregated Resources
RDMA Network
Kernel-Level RDMA Virtualization (SOSP’17)
New Processor and Memory Architecture
End-to-End Solution
Disaggregated Operating System (OSDI’18)
20
Can Existing Kernels Fit?
21
Core
Kern
GPU
Kern
P-NIC
Kern
Shared Main MemoryMonolithic Server
Monolithic/Micro-kernel (e.g., Linux, L4)
Multikernel (e.g., Barrelfish, Helios, fos)
mem
Disk
NIC
CPU
monolithic kernel
network across servers
Server
mem
Disk
NIC
CPU
microkernel
Server
Disk NIC
Access remote resources
Distributed resource mgmt
Fine-grained failure handling
Existing Kernels Don’t Fit
22
Network
23
The OS should be also
When hardware isdisaggregated
24
OSProcess Mgmt
Virtual Memory System
File & Storage System Network
25
Process Mgmt
Virtual Memory System
File & Storage System
Network
File & Storage System
Network
Network
Network
Network
Processor (CPU)
Memory
The Splitkernel Architecture
26
• Split OS functions into monitors
• Run each monitor at h/w device
• Network messaging across non-coherent components
• Distributed resource mgmt and failure handling
MemoryMonitor
ProcessMonitor
network messaging across non-coherent components
GPUMinitor
Processor(GPU)
Hard Disk
NVMMonitor
NVM
SSDMonitor
SSD
HDDMonitor
XPUManagerNew h/w
(XPU)
LegoOS The First Disaggregated OS
27
Processor
Storage
Memory
NVM
How Should LegoOS Appear to Users?
• Our answer: as a set of virtual Nodes (vNodes)
- Similar semantics to virtual machines
- Unique vID, vIP, storage mount point
- Can run on multiple processor, memory, and storage components
28
As a giant machine?As a set of hardware devices?
Abstraction - vNode
29
One vNode can run multiple hardware components
One hardware component can run multiple vNodes
Processor (CPU)
GPUMinitor
Processor(GPU)
Memory Hard Disk
network messaging across non-coherent components
NVMMonitor
NVM
SSDMonitor
SSD
HDDMonitor
MemoryMonitor
ProcessMonitor
XPUManager
New h/w(XPU)
vNode2
vNode1
Abstraction• Appear as vNodes to users
• Linux ABI compatible
• Support unmodified Linux system call interface (common ones)
• A level of indirection to translate Linux interface to LegoOS interface
30
LegoOS Design1. Clean separation of OS and hardware functionalities
2. Build monitor with hardware constraints
3. RDMA-based message passing for both kernel and applications
4. Two-level distributed resource management
5. Memory failure tolerance through replication
31
Separate Processor and Memory
32
ProcessorCPU CPU$ $
Last-Level
DRAM
TLB
MMU
PT
Separate Processor and Memory
33
ProcessorCPU CPU$ $
Last-Level
Net
wor
k
DRAM
TLB MMU
Memory
Separate and move hardware units
to memory component
MemoryPT
Separate Processor and Memory
34
ProcessorCPU CPU$ $
Last-Level
Net
wor
k
DRAM
TLB MMU
Memory
Separate and move hardware units
to memory component
MemoryPT
Virtual Memory
Separate Processor and Memory
35
ProcessorCPU CPU$ $
Last-Level
Net
wor
k
DRAM
TLB MMU
Memory
Separate and move virtual memory
system to memory component
MemoryPT
Virtual Memory
Separate Processor and Memory
36
ProcessorCPU CPU$ $
Last-Level
Net
wor
k
DRAM
TLB MMU
Memory
MemoryPT
Virtual Memory
Processor components only see virtual memory addresses
Memory components manage virtual and physical memory
Virtual Address
Virtual Address
Virtual Address
Virtual Address
All levels of cache are virtual cache
Challenge: network is 2x-4x slower
than memory bus
37
Add Extended Cache at Processor
38
ProcessorCPU CPU$ $
Last-Level
Net
wor
k
DRAM
TLB MMU
Memory
MemoryPT
Virtual Memory
Add Extended Cache at Processor
39
ProcessorCPU CPU$ $
Last-Level N
etw
ork
DRAM
TLB MMU
Memory
MemoryPT
Virtual Memory
DRAM
• Add small DRAM/HBM at processor
• Use it as Extended Cache, or ExCache
• Software and hardware co-managed
• Inclusive
• Virtual cache
LegoOS Design1. Clean separation of OS and hardware functionalities
2. Build monitor with hardware constraints
3. RDMA-based message passing for both kernel and applications
4. Two-level distributed resource management
5. Memory failure tolerance through replication
40
Distributed Resource Management
1. Coarse-grain allocation
2. Load-balancing
3. Failure handling
41
Global Process Manager (GPM)
Global Memory Manager (GMM)
Global Storage Manager (GSM)
Processor (CPU)
GPUMinitor
Processor(GPU)
Memory Hard Disk
network messaging across non-coherent components
NVMMonitor
NVM
SSDMonitor
SSD
HDDMonitor
MemoryMonitor
ProcessMonitor Global
Resource Mgmt
Memory
MemoryMonitor
Implementation and Emulation
• Processor
• Reserve DRAM as ExCache (4KB page as cache line)
• h/w only on hit path, s/w managed miss path
• Indirection layer to store states for 113 Linux syscalls
• Memory
• Limit number of cores, kernel-space only
• Storage/Global Resource Monitors
• Implemented as kernel module on Linux
• Network
• RDMA RPC stack based on LITE [SOSP’17]42
CPU
LLC ExCache
CPUProcessor
Disk
Memory Storage DRAM
LLC DiskDRAM
CPU
LLC Disk
Process Monitor
Memory Monitor Linux Kernel Module
CPU CPU
CPU CPU CPU CPU
RDMA Network
Performance Evaluation• Unmodified TensorFlow, running CIFAR-10
• Working set: 0.9G
• 4 threads
• Systems in comparison
• Baseline: Linux with unlimited memory
• Swap to SSD, and ramdisk
• InfiniSwap [NSDI’17]
43
ExCache/Memory Size (MB)128 256 512
Slowdown
1
3
5
7Linux−swap−SSDLinux−swap−ramdiskInfiniSwapLegoOS
LegoOS Config: 1P, 1M, 1S
Only 1.3x to 1.7x slowdown when disaggregating devices with LegoOS
To gain better resource packing, elasticity, and fault tolerance!
LegoOS Summary• Resource disaggregation calls for new system
• LegoOS: a new OS designed and built from scratch for datacenter resource disaggregation
• Split OS into distributed micro-OS services, running at device
• Many challenges and many potentials
44
Disaggregated Datacenter
Physically Disaggregated Resources
Networking for Disaggregated Resources
RDMA Network
Kernel-Level RDMA Virtualization (SOSP’17)
New Processor and Memory Architecture
flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use
Disaggregated Operating System (OSDI’18)
Networking for Disaggregated Resources
RDMA Network
Kernel-Level RDMA Virtualization (SOSP’17)
Network Requirements for Resource Disaggregation
• Low latency
• High bandwidth
• Scale
• Reliable
46
RDMA
RDMA (Remote Direct Memory Access)
47
• Directly read/write remote memory
• Bypass kernel• Memory zero copy
Benefits:– Low latency– High throughput– Low CPU utilization
NICNIC
Memory
CPU User
Kernel
Memory
CPU User
Kernel
Socket over Ethernet
NICNIC
Memory
CPU User
Kernel
Memory
CPU User
Kernel
RDMA
Things have worked well in HPC •Special hardware
•Few applications
•Cheaper developer
48
49
[VLDB ’16]RSI
[EuroSys ’16]DrTM+R
[NSDI ’14]FaRM
[SOSP ’15]FaRM+Xact
[SIGCOMM ’14]HERD
[ATC ’16]HERD-RPC
[OSDI ’16]FaSST
[ATC ’17]Octopus
[ATC ’13]Pilaf
[SoCC ’17]Hotpot
[OSDI ’16]Wukong
[SoCC ’17]APUS
[SOSP ’15]DrTM
[VLDB ’17]NAM-DB
[ASPLOS ’15]Mojim
[ATC ’16]Cell
RDMA-Based Datacenter Applications
Things have worked well in HPC• Special hardware
• Few applications
• Cheaper developer
50
• Commodity, cheaper hardware
• Many (changing) applications
• Resource sharing and isolation
What about datacenters?
Native RDMA
51
OS
User-Level RDMA App
RNIC
node,lkey,rkeyaddr
Permission checkAddress mapping
lkey 1
lkey n
rkey 1
rkey n… …
send recv
Library
ConnMgmt
MemMgmt
Cached PTEs
Connections Queues Keys Memory space
UserSpace
KernelSpace
Hardware
Kernel Bypassing
52
Userspace
Hardware
RDMA
Socket
Developers want
Fat applications No resource sharing
Abstraction Mismatch
High-levelEasy to useResource
shareIsolation
Low-levelDifficult to use
Difficult to share
Things have worked well in HPC• Special hardware
• Few applications
• Cheaper developer
53
What about datacenters?• Commodity, cheaper hardware
• Many (changing) applications
• Resource sharing and isolation
Native RDMA
54
OS
User-Level RDMA App
RNIC
node,lkey,rkeyaddr
Permission checkAddress mapping
lkey 1
lkey n
rkey 1
rkey n… …
send recv
Library
ConnMgmt
MemMgmt
Cached PTEs
Connections Queues Keys Memory space
UserSpace
KernelSpace
Hardware
Kernel Bypassing
Userspace
Hardware
Requ
ests
/us
0
1.5
3
4.5
6
Total Size (MB)
1 4 16 64 256 1024
Write-64BWrite-1K
Userspace
Hardware
55
Expensive, unscalable hardware
On-NIC SRAM stores and caches metadata
Things have worked well in HPC• Special hardware
• Few applications
• Cheaper developer
56
What about datacenters?• Commodity, cheaper hardware
• Many (changing) applications
• Resource sharing and isolation
Are we removing too much from kernel?
Fat applications No resource
sharing
Expensive, unscalable hardware
57
High-level abstraction
Protection
Resource sharing
Performance isolation
Without KernelLITE - Local Indirection TiEr
ProtectionPerformance
isolation
Resource sharing
High-level abstraction
58
RNIC
59
Permission checkAddress mapping
Cached PTEs
lkey 1
lkey n
rkey 1
rkey n… …
Library
Connections Queues Keys Memory space
User-Level RDMA App
node,lkey,rkeyaddr
send recvConnMgmt
MemMgmt
User Space
Hardware
LITE
60
Connections Queues Keys Memory space
User-Level RDMA App
node,lkey,rkeyaddr
send recvConnMgmt
MemMgmt
LITE APIs Memory RPC/Msg APIs Sync APIs
Simpler applications
User Space
Kernel Space
RNIC Permission checkAddress mapping
Cached PTEs
lkey 1
lkey n
rkey 1
rkey n… …Hardware
LITE
RNIC
61
Connections Queues Keys Memory space
User-Level RDMA App
node,lkey,rkeyaddr
send recvConnMgmt
MemMgmt
LITE APIs Memory RPC/Msg APIs Sync APIs
Permission checkAddress mapping
Global rkeyGlobal lkey
Global lkey Global rkey
Simpler applications
User Space
Kernel Space
Hardware
Cheaper hardwareScalable performance
62
Implementing Remote memset
Native RDMALITE
63
Main Challenge: How to preserve the performance benefit
of RDMA?
LITE Design Principles
2.Avoid hardware-level indirection
3.Hide kernel-space crossing cost
64
Great Performance and Scalability
1.Indirection only at local node
except for the problem of too many layers of indirection – David Wheeler
Requ
ests
/us
0
1.5
3
4.5
6
Total Size (MB)
1 4 16 64 256 1024
Write-64BLITE_write-64BWrite-1KLITE_write-1K
LITE RDMA:Size of MR Scalability
65
LITE scales much better than native RDMA wrt MR size and numbers
LITE Application Effort
• Simple to use
• Needs no expert knowledge
• Flexible, powerful abstraction
• Easy to achieve optimized performance
Application LOC LOC using LITE Student DaysLITE-Log 330 36 1LITE-MapReduce 600* 49 4LITE-Graph 1400 20 7LITE-Kernel-DSM 3000 45 26
66* LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition
MapReduce Results• LITE-MapReduce adapted from Phoenix [1]
[1]: “Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07)” 67
0
2
4
6
821
23
25HadoopPhoenixLITE
Runt
ime
(sec
)
Phoenix 2-node 4-node 8-node
LITE-MapReduce outperforms Hadoop by 4.3x to 5.3x
LITE Summary
• Virtualizes RDMA into flexible, easy-to-use abstraction
• Preserves RDMA’s performance benefits
• Indirection not always degrade performance!
68
• Division across user space, kernel, and hardware
Disaggregated Datacenter
Physically Disaggregated Resources
Networking for Disaggregated Resources
RDMA Network
Kernel-Level RDMA Virtualization (SOSP’17)
New Processor and Memory Architecture
flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use
Disaggregated Operating System (OSDI’18)
Disaggregated Datacenter
Physically Disaggregated Resources
Networking for Disaggregated Resources
RDMA Network
Kernel-Level RDMA Virtualization (SOSP’17)
New Processor and Memory Architecture
flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use
Disaggregated OS (OSDI’18)
Virtually Disaggregated Resources
Network-Attached NVM
Disaggregated Persistent Memory
Distributed Non-Volatile Memory
Distributed Shared Persistent Memory (SoCC ’17)
InfiniBand
New Network Topology, Routing, Congestion-Ctrl
Conclusion
• New hardware and software trends point to resource disaggregation
• My research pioneers in building an end-to-end solution for disaggregated datacenter
• Opens up new research opportunities in hardware, software, networking, security, and programming language
71