Microkernel ConstructionCase Study: M3
Nils Asmussen
July 4th 2019
1 / 58
Heterogeneous Systems
2 / 58
Heterogeneous Systems
2 / 58
Heterogeneous Systems
2 / 58
Why?
memcached: FPGA-based implementation is 16 times be�er in performance perwa� than Atom CPU [1]
machine learning: custom accelerator is 20% faster than GPU andrequires 128 times less energy [2]
[1] Thin servers with smart pipes: Designing SoC accelerators for memcached, ISCA’13[2] PuDianNao: A polyvalent machine learning accelerator, ASPLOS’15
3 / 58
Future Platforms: Problems for Operating Systems
ARM
ARM
x86
x86
FFT
DSP
GPU
TPU
Kernel
Kernel
Kernel
Kernel
4 / 58
Future Platforms: Problems for Operating Systems
ARM
ARM
x86
x86
FFT
DSP
GPU
TPU
Kernel
Kernel
Kernel
Kernel
4 / 58
Future Platforms: Problems for Operating Systems
ARM
ARM
x86
x86
FFT
DSP
GPU
TPU
Kernel
Kernel
Kernel
Kernel
4 / 58
Future Platforms: Problems for Operating Systems
ARM
ARM
x86
x86
FFT
DSP
GPU
TPU
Kernel
Kernel
Kernel
Kernel
4 / 58
Future Platforms: Problems for Operating Systems
ARM
ARM
x86
x86
FFT
DSP
GPU
TPU
Kernel
Kernel
Kernel
Kernel
4 / 58
Future Platforms: Problems for Operating Systems
ARM
ARM
x86
x86
FFT
DSP
GPU
TPU
Kernel
Kernel
Kernel
Kernel
4 / 58
Future Platforms: Problems for Operating Systems
ARM
ARM
x86
x86
FFT
DSP
GPU
TPU
Kernel
Kernel
Kernel
Kernel
4 / 58
Future Platforms: Problems for Operating Systems
ARM
ARM
x86
x86
FFT
DSP
GPU
TPU
Kernel
Kernel
Kernel
Kernel
4 / 58
Related Work
Isolation of componentsDPU, NoC-MPU
IOMMUs
First-class handling of one specific accelerator
GPUfs, GPUnet, PTask
ReconOS, BORPH
OSes for heterogeneous systems
Barrelfish
Popcorn Linux, K2
Helios
5 / 58
Related Work
Isolation of componentsDPU, NoC-MPU
IOMMUs
First-class handling of one specific accelerator
GPUfs, GPUnet, PTask
ReconOS, BORPH
OSes for heterogeneous systems
Barrelfish
Popcorn Linux, K2
Helios
5 / 58
Related Work
Isolation of componentsDPU, NoC-MPU
IOMMUs
First-class handling of one specific accelerator
GPUfs, GPUnet, PTask
ReconOS, BORPH
OSes for heterogeneous systems
Barrelfish
Popcorn Linux, K2
Helios5 / 58
What If We Could Change Hardware?
Can we design a system that integrates all types of
untrusted compute units as first-class citizens?
6 / 58
Goals for First-class Citizens
Prevent harm by untrusted compute units (CUs)
Access operating-system services by all CUs
Direct communication between all CUs
Context switching support for all CUs
7 / 58
Outline
1 Overall System Architecture
2 Prototype Platforms
3 Isolation and Communication
4 Capabilities
5 OS Services and Accelerators
6 Virtual Memory
7 Context Switching
8 Evaluation
8 / 58
Outline
1 Overall System Architecture
2 Prototype Platforms
3 Isolation and Communication
4 Capabilities
5 OS Services and Accelerators
6 Virtual Memory
7 Context Switching
8 Evaluation
9 / 58
Hardware/Operating System Co-Design
ARM
x86
FFT
DSP
FPGA
TPU
DTU DTU DTU
DTU DTU DTU
ARM
x86
FFT
DSP
FPGA
TPU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
User PE
CU
DTU
Kernel PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
Kernel
App App
App
App
App
Key Ideas:
Minimize changes toexisting components
Add uniform interface
Kernel controls user PEsremotely
Direct communication
10 / 58
Hardware/Operating System Co-Design
ARM
x86
FFT
DSP
FPGA
TPU
DTU DTU DTU
DTU DTU DTU
ARM
x86
FFT
DSP
FPGA
TPU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
User PE
CU
DTU
Kernel PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
Kernel
App App
App
App
App
Key Ideas:
Minimize changes toexisting components
Add uniform interface
Kernel controls user PEsremotely
Direct communication
10 / 58
Hardware/Operating System Co-Design
ARM
x86
FFT
DSP
FPGA
TPU
DTU DTU DTU
DTU DTU DTU
ARM
x86
FFT
DSP
FPGA
TPU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
User PE
CU
DTU
Kernel PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
Kernel
App App
App
App
App
Key Ideas:
Minimize changes toexisting components
Add uniform interface
Kernel controls user PEsremotely
Direct communication
10 / 58
Hardware/Operating System Co-Design
ARM
x86
FFT
DSP
FPGA
TPU
DTU DTU DTU
DTU DTU DTU
ARM
x86
FFT
DSP
FPGA
TPU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
User PE
CU
DTU
Kernel PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
Kernel
App App
App
App
App Key Ideas:
Minimize changes toexisting components
Add uniform interface
Kernel controls user PEsremotely
Direct communication
10 / 58
Hardware/Operating System Co-Design
ARM
x86
FFT
DSP
FPGA
TPU
DTU DTU DTU
DTU DTU DTU
ARM
x86
FFT
DSP
FPGA
TPU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
PE
CU
DTU
User PE
CU
DTU
Kernel PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
Kernel
App App
App
App
App Key Ideas:
Minimize changes toexisting components
Add uniform interface
Kernel controls user PEsremotely
Direct communication
10 / 58
Outline
1 Overall System Architecture
2 Prototype Platforms
3 Isolation and Communication
4 Capabilities
5 OS Services and Accelerators
6 Virtual Memory
7 Context Switching
8 Evaluation
11 / 58
Tomahawk
Xtensa LX4
Instr.SPM
DataSPM
DTU
PEPEPE
PE
PE PE
PE
DRAM
RRR
R R R
RRR
PE
MemCtrl.
PEs have no OS support:
No privileged mode
No MMU, no caches, but SPM
T2: simple DTU; T4: most features12 / 58
Linux
M3 runs on Linux using it as a virtual machine
A process simulates a PE, having two threads (CPU + DTU)
DTUs communicate over UNIX domain socketsNo accuracy because
I Programs are directly executed on hostI Data transfers have huge overhead compared to HW
Very useful for debugging and early prototyping
13 / 58
gem5
Modular platform for computer architecture research
Supports various ISAs (x86, ARM, Alpha, RISC-V, . . . )
Provides detailed CPU and memory models
Cycle-accurate simulation
Added DTU model to gem5
Added hardware accelerators
14 / 58
gem5 – Example Configuration
x86 PE
L2$
DTUL1$
AccelPE
DTU
SPM
IO$
AccelPE
DTU
L1$
x86
PE
DTU
L1$ IO$
DTU
VM
ME
DRAM
x86PE
DTU
L1$SPM
15 / 58
Outline
1 Overall System Architecture
2 Prototype Platforms
3 Isolation and Communication
4 Capabilities
5 OS Services and Accelerators
6 Virtual Memory
7 Context Switching
8 Evaluation
16 / 58
Isolation
User PE
CU
DTU
Kernel PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
Kernel
App App
App
App
App DTU-based isolation:
Additional protection layer
Only kernel PE canestablish communicationchannels
User PEs can only useestablished channels
17 / 58
Isolation
User PE
CU
DTU
Kernel PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
Kernel
App App
App
App
App DTU-based isolation:
Additional protection layer
Only kernel PE canestablish communicationchannels
User PEs can only useestablished channels
17 / 58
Isolation
User PE
CU
DTU
Kernel PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
Kernel
App App
App
App
App DTU-based isolation:
Additional protection layer
Only kernel PE canestablish communicationchannels
User PEs can only useestablished channels
17 / 58
Isolation
User PE
CU
DTU
Kernel PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
Kernel
App App
App
App
App DTU-based isolation:
Additional protection layer
Only kernel PE canestablish communicationchannels
User PEs can only useestablished channels
17 / 58
Communication
User PE
CU
DTU
Kernel PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
DRAM
User PE
CU
DTU
Kernel
App App
App App
M
S
R
DTU provides endpoints to:
Access memory (contiguousrange, byte granular)
Receive messages into areceive bu�er
Send messages to areceiving endpoint
Replies for RPC
18 / 58
Communication
User PE
CU
DTU
Kernel PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
DRAM
User PE
CU
DTU
Kernel
App App
App App
M S
R
DTU provides endpoints to:
Access memory (contiguousrange, byte granular)
Receive messages into areceive bu�er
Send messages to areceiving endpoint
Replies for RPC
18 / 58
Communication
User PE
CU
DTU
Kernel PE
CU
DTU
User PE
CU
DTU
User PE
CU
DTU
DRAM
User PE
CU
DTU
Kernel
App App
App App
M S
R
DTU provides endpoints to:
Access memory (contiguousrange, byte granular)
Receive messages into areceive bu�er
Send messages to areceiving endpoint
Replies for RPC
18 / 58
OS Design
M3: Microkernel-based system for het. manycores(or L4 ± 1)
Implemented from scratch
Drivers, filesystems, etc. implemented on user PEs
Kernel manages permissions, using capabilities
DTU enforces permissions(communication, memory access)
Kernel is independent of other PEs
Kernel M3FS
pipes App
App App
19 / 58
M3 System Call
User PE
CU
DTU
Kernel PE
CU
DTU
KernelApp
R
S
20 / 58
M3 System Call
User PE
CU
DTU
Kernel PE
CU
DTU
KernelApp
RS
20 / 58
M3 System Call
User PE
CU
DTU
Kernel PE
CU
DTU
KernelApp
RS
20 / 58
M3 System Call
User PE
CU
DTU
Kernel PE
CU
DTU
KernelApp
RS
20 / 58
M3 System Call
User PE
CU
DTU
Kernel PE
CU
DTU
KernelApp
RS
20 / 58
Outline
1 Overall System Architecture
2 Prototype Platforms
3 Isolation and Communication
4 Capabilities
5 OS Services and Accelerators
6 Virtual Memory
7 Context Switching
8 Evaluation
21 / 58
Overview
0 2 0 21VPE 1 VPE 2
Kernel
VPE 2VPE 1
VPE SGate RGate VPE
22 / 58
Capabilities
M3 has the following capabilities:
Send: send messages to a receive EP
Receive: receive messages from send EPs
Memory: access remote memory via DTU
Mapping: access remote memory via load/store
Service: create sessions
Session: exchange caps with service
Endpoint: configure EPs of own or foreign DTU
VPE: use a PE
23 / 58
Capability Exchange
Kernel provides syscalls to create, exchange, and revoke capsThere are two ways to exchange caps:
1 Directly with another VPE (typically, a child VPE)2 Over a session with a service
The kernel o�ers two operations:1 Delegate: send capability to somebody else2 Obtain: receive capability from somebody else
Di�erence to L4:I Applications communicate directly, without involving the kernelI → Capability exchange cannot be done during IPCI Special communication channel between kernel and serversI Kernel uses this channel to send exchange requests to server
24 / 58
Communication
DTU
DTUDTU adds
CU
Mem
buffer
occupunread
EP credits
labeltarget
Receiver: PE1 Sender: PE2
channel
Kernel: PE0
SendGate
DTU
Mem
CUCU
EP
configuration of endpoints to establish a channel
VPE1: PE1
header data
Recv Cap RecvGate
VPE2: PE2
Send CapMem
EP
cmdregcmdreg cmdreg
25 / 58
Virtual PEs
M3 kernel manages user PEs in terms of VPEs
VPE is combination of a process and a thread
VPE creation yields a VPE capability and memory capability
Library provides primitives like fork and exec
VPEs are used for all PEs:I Accelerators are not handled di�erently by the kernelI All VPEs can perform system callsI All VPEs can have time slices and prioritiesI . . .
26 / 58
VPEs – Examples
Executing ELF-Binaries
VPE vpe("test");
char *args[] = {"/bin/hello", "foo", "bar"};
vpe.exec(3, args);
Asynchronous Lambdas
VPE vpe("test");
MemGate mem = MemGate :: create_global (0x1000 , RW);
vpe.delegate(CapRngDesc(mem.sel(), 1));
vpe.run_async ([&mem]() {
mem.read(buf , sizeof(buf));
});
27 / 58
VPEs – Examples
Executing ELF-Binaries
VPE vpe("test");
char *args[] = {"/bin/hello", "foo", "bar"};
vpe.exec(3, args);
Asynchronous Lambdas
VPE vpe("test");
MemGate mem = MemGate :: create_global (0x1000 , RW);
vpe.delegate(CapRngDesc(mem.sel(), 1));
vpe.run_async ([&mem]() {
mem.read(buf , sizeof(buf));
});
27 / 58
Outline
1 Overall System Architecture
2 Prototype Platforms
3 Isolation and Communication
4 Capabilities
5 OS Services and Accelerators
6 Virtual Memory
7 Context Switching
8 Evaluation
28 / 58
OS Service Access for all CUs
sh$ decode in.png | fft | mul | ifft > out.rawsh$ decode in.png | fft | mul | ifft > out.raw
Shell
User program Input file
Hardware accelerators forimage processing
Output file
Pipes and output redirect
Challenges:
OS must provide genericprotocols
Accelerators needsupport for protocols
29 / 58
OS Service Access for all CUs
sh$ decode in.png | fft | mul | ifft > out.rawsh$ decode in.png | fft | mul | ifft > out.raw
Shell
User program Input file
Hardware accelerators forimage processing
Output file
Pipes and output redirect
Challenges:
OS must provide genericprotocols
Accelerators needsupport for protocols
29 / 58
OS Service Access for all CUs
sh$ decode in.png | fft | mul | ifft > out.rawsh$ decode in.png | fft | mul | ifft > out.raw
Shell
User program
Input file
Hardware accelerators forimage processing
Output file
Pipes and output redirect
Challenges:
OS must provide genericprotocols
Accelerators needsupport for protocols
29 / 58
OS Service Access for all CUs
sh$ decode in.png | fft | mul | ifft > out.rawsh$ decode in.png | fft | mul | ifft > out.raw
Shell
User program Input file
Hardware accelerators forimage processing
Output file
Pipes and output redirect
Challenges:
OS must provide genericprotocols
Accelerators needsupport for protocols
29 / 58
OS Service Access for all CUs
sh$ decode in.png | fft | mul | ifft > out.rawsh$ decode in.png | fft | mul | ifft > out.raw
Shell
User program Input file
Hardware accelerators forimage processing
Output file
Pipes and output redirect
Challenges:
OS must provide genericprotocols
Accelerators needsupport for protocols
29 / 58
OS Service Access for all CUs
sh$ decode in.png | fft | mul | ifft > out.rawsh$ decode in.png | fft | mul | ifft > out.raw
Shell
User program Input file
Hardware accelerators forimage processing
Output file
Pipes and output redirect
Challenges:
OS must provide genericprotocols
Accelerators needsupport for protocols
29 / 58
OS Service Access for all CUs
sh$ decode in.png | fft | mul | ifft > out.rawsh$ decode in.png | fft | mul | ifft > out.raw
Shell
User program Input file
Hardware accelerators forimage processing
Output file
Pipes and output redirect
Challenges:
OS must provide genericprotocols
Accelerators needsupport for protocols
29 / 58
OS Service Access for all CUs
sh$ decode in.png | fft | mul | ifft > out.rawsh$ decode in.png | fft | mul | ifft > out.raw
Shell
User program Input file
Hardware accelerators forimage processing
Output file
Pipes and output redirect
Challenges:
OS must provide genericprotocols
Accelerators needsupport for protocols
29 / 58
OS Service Access for all CUs
sh$ decode in.png | fft | mul | ifft > out.rawsh$ decode in.png | fft | mul | ifft > out.raw
Shell
User program Input file
Hardware accelerators forimage processing
Output file
Pipes and output redirect
Challenges:
OS must provide genericprotocols
Accelerators needsupport for protocols
29 / 58
OS Service Access for all CUs
sh$ decode in.png | fft | mul | ifft > out.rawsh$ decode in.png | fft | mul | ifft > out.raw
Shell
User program Input file
Hardware accelerators forimage processing
Output file
Pipes and output redirect
Challenges:
OS must provide genericprotocols
Accelerators needsupport for protocols
29 / 58
Generic Protocols
Client Server
DRAM
DTU DTU
S R
req(in/out)
resp(pos,len)MM
File protocol:
Data in memoryRPC between client and server
I req(in/out) requests next piece,implicitly commits previous piece
I commit(nbytes) commits nbytes ofprevious piece
Server configures client’s memory EP
Client accesses data via DTU
30 / 58
Generic Protocols
Client Server
DRAM
DTU DTU
S R
req(in/out)
resp(pos,len)MM
File protocol:
Data in memory
RPC between client and serverI req(in/out) requests next piece,
implicitly commits previous pieceI commit(nbytes) commits nbytes of
previous piece
Server configures client’s memory EP
Client accesses data via DTU
30 / 58
Generic Protocols
Client Server
DRAM
DTU DTUS R
req(in/out)
resp(pos,len)
MM
File protocol:
Data in memoryRPC between client and server
I req(in/out) requests next piece,implicitly commits previous piece
I commit(nbytes) commits nbytes ofprevious piece
Server configures client’s memory EP
Client accesses data via DTU
30 / 58
Generic Protocols
Client Server
DRAM
DTU DTUS R
req(in/out)
resp(pos,len)MM
File protocol:
Data in memoryRPC between client and server
I req(in/out) requests next piece,implicitly commits previous piece
I commit(nbytes) commits nbytes ofprevious piece
Server configures client’s memory EP
Client accesses data via DTU
30 / 58
Generic Protocols
Client Server
DRAM
DTU DTUS R
req(in/out)
resp(pos,len)MM
File protocol:
Data in memoryRPC between client and server
I req(in/out) requests next piece,implicitly commits previous piece
I commit(nbytes) commits nbytes ofprevious piece
Server configures client’s memory EP
Client accesses data via DTU
30 / 58
Implementation: M3FS – Overview
M3FS organizes the file’s data in extentsM3FS can be used with a memory and disk backend
I With memory backend, FS image is a contiguous region in DRAMI Clients get access to parts of the imageI With disk backend, M3FS uses a bu�er cache in DRAMI Clients get access to parts of bu�er cache
Two types of sessions: metadata session, file session
Metadata session is created first, allows stat, open, . . .
open creates a new file session
Both sessions can be cloned to provide other VPEs access
31 / 58
Implementation: M3FS – File Protocol
The file session implements the file protocol (plus seeking)
File session holds file position and advances it on read/write
req(in/out) request next extent
M3FS configures client’s EP for this extent
Appending reserves new space, invisible to other clients
commit(nbytes) commits a previous append
32 / 58
Implementation: Pipe – Overview
writer reader
33 / 58
Implementation: Pipe – Overview
writer reader
Shared Memory
msg passing
pipeserv
34 / 58
Implementation: Pipe
Two types of sessions: pipe session, channel session
Pipe session represents whole pipe, allows to create channels
Channel session implements file protocol
Channel session can be cloned
Server configures client’s EP just once at the beginning
req(in/out) request access to next data
commit(nbytes) commits previous request
35 / 58
File Multiplexing
File protocol maps directly to EPs (limited resource)
Number of open files shouldn’t be limited (that much)
libm3 dedicates at most 4 EPs to files and multiplexes themMultiplexing requires:
1 commit(nbytes) to commit read/wri�en data2 revocation of EP capability (old server)3 delegation of EP capability (new server)4 next read/write will contact server again
Fortunately, file multiplexing does almost never happen
36 / 58
Additions to Accelerator
Scratchpad memory (SPM)
CU
Accelerator
DTUASM
S M S M
IN OUT
O�-the-shelf accelerators
Accelerator Support Module (ASM):
Interacts with DTU and accelerator
Implements file protocol for input andoutput channel
ASM assumes that endpoints are setupexternally by so�ware
37 / 58
Additions to Accelerator
Scratchpad memory (SPM)
CU
Accelerator
DTU
ASM
S M S M
IN OUT
O�-the-shelf accelerators
Accelerator Support Module (ASM):
Interacts with DTU and accelerator
Implements file protocol for input andoutput channel
ASM assumes that endpoints are setupexternally by so�ware
37 / 58
Additions to Accelerator
Scratchpad memory (SPM)
CU
Accelerator
DTUASM
S M S M
IN OUT
O�-the-shelf accelerators
Accelerator Support Module (ASM):
Interacts with DTU and accelerator
Implements file protocol for input andoutput channel
ASM assumes that endpoints are setupexternally by so�ware
37 / 58
Additions to Accelerator
Scratchpad memory (SPM)
CU
Accelerator
DTUASM
S M S M
IN OUT
O�-the-shelf accelerators
Accelerator Support Module (ASM):
Interacts with DTU and accelerator
Implements file protocol for input andoutput channel
ASM assumes that endpoints are setupexternally by so�ware
37 / 58
Additions to Accelerator
Scratchpad memory (SPM)
CU
Accelerator
DTUASM
S M S M
IN OUT
O�-the-shelf accelerators
Accelerator Support Module (ASM):
Interacts with DTU and accelerator
Implements file protocol for input andoutput channel
ASM assumes that endpoints are setupexternally by so�ware
37 / 58
Demo
38 / 58
Accelerator Chains: Assisted by OS
FFT
SPM
DMA
MUL
SPM
DMA
IFFT
SPM
DMA
OS
Driver
Input
Output
FFT
SPM
DMA
OS
Driver
MUL
SPM
DMAOS
Driver
IFFT
SPM
DMA
OS
Driver
OS-assisted accelerator chains:
OS drives copy-in/copy-out ofaccelerator SPMs
Only simple DMA needed
Like in traditional systems,high CPU overhead for OS
39 / 58
Accelerator Chains: Assisted by OS
FFT
SPM
DMA
MUL
SPM
DMA
IFFT
SPM
DMA
OS
Driver
Input
Output
FFT
SPM
DMA
OS
Driver
MUL
SPM
DMAOS
Driver
IFFT
SPM
DMA
OS
Driver
OS-assisted accelerator chains:
OS drives copy-in/copy-out ofaccelerator SPMs
Only simple DMA needed
Like in traditional systems,high CPU overhead for OS
39 / 58
Accelerator Chains: Assisted by OS
FFT
SPM
DMA
MUL
SPM
DMA
IFFT
SPM
DMA
OS
Driver
Input
Output
FFT
SPM
DMA
OS
Driver
MUL
SPM
DMAOS
Driver
IFFT
SPM
DMA
OS
Driver
OS-assisted accelerator chains:
OS drives copy-in/copy-out ofaccelerator SPMs
Only simple DMA needed
Like in traditional systems,high CPU overhead for OS
39 / 58
Accelerator Chains: Assisted by OS
FFT
SPM
DMA
MUL
SPM
DMA
IFFT
SPM
DMA
OS
Driver
Input
Output
FFT
SPM
DMA
OS
Driver
MUL
SPM
DMAOS
Driver
IFFT
SPM
DMA
OS
Driver
OS-assisted accelerator chains:
OS drives copy-in/copy-out ofaccelerator SPMs
Only simple DMA needed
Like in traditional systems,high CPU overhead for OS
39 / 58
Accelerator Chains: Assisted by OS
FFT
SPM
DMA
MUL
SPM
DMA
IFFT
SPM
DMA
OS
Driver
Input
Output
FFT
SPM
DMA
OS
Driver
MUL
SPM
DMAOS
Driver
IFFT
SPM
DMA
OS
Driver
OS-assisted accelerator chains:
OS drives copy-in/copy-out ofaccelerator SPMs
Only simple DMA needed
Like in traditional systems,high CPU overhead for OS
39 / 58
Accelerator Chains: Assisted by OS
FFT
SPM
DMA
MUL
SPM
DMA
IFFT
SPM
DMA
OS
Driver
Input
Output
FFT
SPM
DMA
OS
Driver
MUL
SPM
DMAOS
Driver
IFFT
SPM
DMA
OS
Driver
OS-assisted accelerator chains:
OS drives copy-in/copy-out ofaccelerator SPMs
Only simple DMA needed
Like in traditional systems,high CPU overhead for OS
39 / 58
Accelerator Chains: Assisted by OS
FFT
SPM
DMA
MUL
SPM
DMA
IFFT
SPM
DMA
OS
Driver
Input
Output
FFT
SPM
DMA
OS
Driver
MUL
SPM
DMAOS
Driver
IFFT
SPM
DMA
OS
Driver
OS-assisted accelerator chains:
OS drives copy-in/copy-out ofaccelerator SPMs
Only simple DMA needed
Like in traditional systems,high CPU overhead for OS
39 / 58
Accelerator Chains: Assisted by OS
FFT
SPM
DMA
MUL
SPM
DMA
IFFT
SPM
DMA
OS
Driver
Input
Output
FFT
SPM
DMA
OS
Driver
MUL
SPM
DMAOS
Driver
IFFT
SPM
DMA
OS
Driver
OS-assisted accelerator chains:
OS drives copy-in/copy-out ofaccelerator SPMs
Only simple DMA needed
Like in traditional systems,high CPU overhead for OS
39 / 58
Accelerator Chains: Fully Autonomous
FFT
SPM
DTU
MUL
SPM
DTU
IFFT
SPM
DTU
Shell
Input
Output
ASM
ASM
ASM
Shell
FFT
SPM
DTU ASM
MUL
SPM
DTU ASM
IFFT
SPM
DTU ASM
Autonomous accelerator chains:
Shell configures all endpoints
ASMs of accelerators drive DTUs totransfer data autonomously
Fully o�loaded,almost no CPU overhead for OS
40 / 58
Accelerator Chains: Fully Autonomous
FFT
SPM
DTU
MUL
SPM
DTU
IFFT
SPM
DTU
Shell
Input
Output
ASM
ASM
ASM
Shell
FFT
SPM
DTU ASM
MUL
SPM
DTU ASM
IFFT
SPM
DTU ASM
Autonomous accelerator chains:
Shell configures all endpoints
ASMs of accelerators drive DTUs totransfer data autonomously
Fully o�loaded,almost no CPU overhead for OS
40 / 58
Accelerator Chains: Fully Autonomous
FFT
SPM
DTU
MUL
SPM
DTU
IFFT
SPM
DTU
Shell
Input
Output
ASM
ASM
ASM
Shell
FFT
SPM
DTU ASM
MUL
SPM
DTU ASM
IFFT
SPM
DTU ASM
Autonomous accelerator chains:
Shell configures all endpoints
ASMs of accelerators drive DTUs totransfer data autonomously
Fully o�loaded,almost no CPU overhead for OS
40 / 58
Accelerator Chains: Fully Autonomous
FFT
SPM
DTU
MUL
SPM
DTU
IFFT
SPM
DTU
Shell
Input
Output
ASM
ASM
ASM
Shell
FFT
SPM
DTU ASM
MUL
SPM
DTU ASM
IFFT
SPM
DTU ASM
Autonomous accelerator chains:
Shell configures all endpoints
ASMs of accelerators drive DTUs totransfer data autonomously
Fully o�loaded,almost no CPU overhead for OS
40 / 58
Accelerator Chains: Fully Autonomous
FFT
SPM
DTU
MUL
SPM
DTU
IFFT
SPM
DTU
Shell
Input
Output
ASM
ASM
ASM
Shell
FFT
SPM
DTU ASM
MUL
SPM
DTU ASM
IFFT
SPM
DTU ASM
Autonomous accelerator chains:
Shell configures all endpoints
ASMs of accelerators drive DTUs totransfer data autonomously
Fully o�loaded,almost no CPU overhead for OS
40 / 58
Accelerator Chains: Fully Autonomous
FFT
SPM
DTU
MUL
SPM
DTU
IFFT
SPM
DTU
Shell
Input
Output
ASM
ASM
ASM
Shell
FFT
SPM
DTU ASM
MUL
SPM
DTU ASM
IFFT
SPM
DTU ASM
Autonomous accelerator chains:
Shell configures all endpoints
ASMs of accelerators drive DTUs totransfer data autonomously
Fully o�loaded,almost no CPU overhead for OS
40 / 58
Accelerator Chains: Fully Autonomous
FFT
SPM
DTU
MUL
SPM
DTU
IFFT
SPM
DTU
Shell
Input
Output
ASM
ASM
ASM
Shell
FFT
SPM
DTU ASM
MUL
SPM
DTU ASM
IFFT
SPM
DTU ASM
Autonomous accelerator chains:
Shell configures all endpoints
ASMs of accelerators drive DTUs totransfer data autonomously
Fully o�loaded,almost no CPU overhead for OS
40 / 58
Accelerator Chains: Fully Autonomous
FFT
SPM
DTU
MUL
SPM
DTU
IFFT
SPM
DTU
Shell
Input
Output
ASM
ASM
ASM
Shell
FFT
SPM
DTU ASM
MUL
SPM
DTU ASM
IFFT
SPM
DTU ASM
Autonomous accelerator chains:
Shell configures all endpoints
ASMs of accelerators drive DTUs totransfer data autonomously
Fully o�loaded,almost no CPU overhead for OS
40 / 58
Accelerator Chains: Fully Autonomous
FFT
SPM
DTU
MUL
SPM
DTU
IFFT
SPM
DTU
Shell
Input
Output
ASM
ASM
ASM
Shell
FFT
SPM
DTU ASM
MUL
SPM
DTU ASM
IFFT
SPM
DTU ASM
Autonomous accelerator chains:
Shell configures all endpoints
ASMs of accelerators drive DTUs totransfer data autonomously
Fully o�loaded,almost no CPU overhead for OS
40 / 58
Accelerator Chains: Results
Assisted Autonomous
1
Tim
e (m
s)
0
5
10
15
20
2 3 4
# of parallel chains
1
CP
U l
oad
0.0
0.2
0.4
0.6
0.8
1.0
2 3 4
# of parallel chains
41 / 58
Accelerator Chains: Results
Assisted Autonomous
1
Tim
e (m
s)
0
5
10
15
20
2 3 4
# of parallel chains
1
CP
U l
oad
0.0
0.2
0.4
0.6
0.8
1.0
2 3 4
# of parallel chains
41 / 58
Accelerator Chains: Results
Assisted Autonomous
1
Tim
e (m
s)
0
5
10
15
20
2 3 4
# of parallel chains
1
CP
U l
oad
0.0
0.2
0.4
0.6
0.8
1.0
2 3 4
# of parallel chains
41 / 58
Accelerator Chains: Results (PCIe-like Latency)
Assisted Autonomous
1
Tim
e (m
s)
0
20
40
60
80
2 3 4
# of parallel chains
1
CP
U l
oad
0.0
0.2
0.4
0.6
0.8
1.0
2 3 4
# of parallel chains
42 / 58
Outline
1 Overall System Architecture
2 Prototype Platforms
3 Isolation and Communication
4 Capabilities
5 OS Services and Accelerators
6 Virtual Memory
7 Context Switching
8 Evaluation
43 / 58
Virtual Memory – Overview
DTU
SPM
Accelerator
DTU MMU
Cache
Accelerator
DTU
x86
Cache
MMU
VM Helper
Di�erent PE types:
No MMU, SPM insteadof caches
MMU+caches providedby DTU
Reuse existingMMU+caches of CU
44 / 58
Virtual Memory – Overview
DTU
SPM
Accelerator
DTU MMU
Cache
Accelerator
DTU
x86
Cache
MMU
VM Helper
Di�erent PE types:
No MMU, SPM insteadof caches
MMU+caches providedby DTU
Reuse existingMMU+caches of CU
44 / 58
Virtual Memory – Overview
DTU
SPM
Accelerator
DTU MMU
Cache
Accelerator
DTU
x86
Cache
MMU
VM HelperDi�erent PE types:
No MMU, SPM insteadof caches
MMU+caches providedby DTU
Reuse existingMMU+caches of CU
44 / 58
Page Fault Handling
App
VMA
CU (PE-type C)
CU (PE-type B)
App
CU CU
Pager KernelPF
PF create_map
kernel requests
update PTEs
IRQ
DTU
DTU
DTU DTU
45 / 58
Outline
1 Overall System Architecture
2 Prototype Platforms
3 Isolation and Communication
4 Capabilities
5 OS Services and Accelerators
6 Virtual Memory
7 Context Switching
8 Evaluation
46 / 58
Context Switching – Overview
DTU
x86
Ctx Helper
VPE VPE
DTU
ARM
Kernel
SwitcherSwitcher
DTU
x86
DTU
ARM
DTU
Accelerator
Ctx Helper
VPE VPE
DTUDTU
ARM
Kernel handles complex partI Schedules and migrates VPEsI Initiates context switches
Helper on user PEsimplements save/restore
I General purpose PEs:So�ware helper
I Accelerator PEs:Helper implemented inhardware as part of ASM
47 / 58
Context Switching – Overview
DTU
x86
Ctx Helper
VPE VPE
DTU
ARM
Kernel
SwitcherSwitcher
DTU
x86
DTU
ARM
DTU
Accelerator
Ctx Helper
VPE VPE
DTUDTU
ARM
Kernel handles complex partI Schedules and migrates VPEsI Initiates context switches
Helper on user PEsimplements save/restore
I General purpose PEs:So�ware helper
I Accelerator PEs:Helper implemented inhardware as part of ASM
47 / 58
Context Switching – Overview
DTU
x86
Ctx Helper
VPE VPE
DTU
ARM
Kernel
SwitcherSwitcher
DTU
x86
DTU
ARM
DTU
Accelerator
Ctx Helper
VPE VPE
DTUDTU
ARM
Kernel handles complex partI Schedules and migrates VPEsI Initiates context switches
Helper on user PEsimplements save/restore
I General purpose PEs:So�ware helper
I Accelerator PEs:Helper implemented inhardware as part of ASM
47 / 58
Context Switching – Overview
DTU
x86
Ctx Helper
VPE VPE
DTU
ARM
Kernel
SwitcherSwitcher
DTU
x86
DTU
ARM
DTU
Accelerator
Ctx Helper
VPE VPE
DTUDTU
ARM
Kernel handles complex partI Schedules and migrates VPEsI Initiates context switches
Helper on user PEsimplements save/restore
I General purpose PEs:So�ware helper
I Accelerator PEs:Helper implemented inhardware as part of ASM
47 / 58
Context Switching with Direct Communication
How to determine whether recipient is running?I DTU knows running VPE and recipient of communicationI DTU reports error if recipient is not running
How to deliver the message if recipient is not running?I Message is forwarded via the kernelI Kernel schedules recipient and delivers message
How does the kernel know what VPEs are doing?I Activities send idle notificationI Only if compatible VPE is ready
48 / 58
Context Switching with Direct Communication
How to determine whether recipient is running?I DTU knows running VPE and recipient of communicationI DTU reports error if recipient is not running
How to deliver the message if recipient is not running?I Message is forwarded via the kernelI Kernel schedules recipient and delivers message
How does the kernel know what VPEs are doing?I Activities send idle notificationI Only if compatible VPE is ready
48 / 58
Context Switching with Direct Communication
How to determine whether recipient is running?I DTU knows running VPE and recipient of communicationI DTU reports error if recipient is not running
How to deliver the message if recipient is not running?I Message is forwarded via the kernelI Kernel schedules recipient and delivers message
How does the kernel know what VPEs are doing?I Activities send idle notificationI Only if compatible VPE is ready
48 / 58
Outline
1 Overall System Architecture
2 Prototype Platforms
3 Isolation and Communication
4 Capabilities
5 OS Services and Accelerators
6 Virtual Memory
7 Context Switching
8 Evaluation
49 / 58
Experimental Setup
Evaluation platform is gem5
Each general-purpose PE has out-of-order x86-64 core @ 3GHz,32+32 KiB L1 cache, 256 KiB L2 cache
Accelerator PEs are clocked with 1GHz
DRAM clocked with 1GHz
Short running, but representative benchmarks
50 / 58
Linux Application Workloads
M3
Lx
tar
02468
10
Tim
e (m
s)
M3
Lx
untar
M3
Lx
shasum
M3
Lx
sort
M3
Lx
find
M3
Lx
SQLite
M3
Lx
LvlDB
App Xfers OS
M3 vs. Linux 4.10
Traced on Linux,replayed on M3
M3FS vs. Linux tmpfs
Kernel App
Pager M3FS
M3: 1+3 cores
LinuxLinux: 1 core
51 / 58
Linux Application Workloads
M3
Lx
tar
02468
10
Tim
e (m
s)
M3
Lx
untar
M3
Lx
shasum
M3
Lx
sort
M3
Lx
find
M3
Lx
SQLite
M3
Lx
LvlDB
App Xfers OS
M3 vs. Linux 4.10
Traced on Linux,replayed on M3
M3FS vs. Linux tmpfs
Kernel App
Pager M3FS
M3: 1+3 cores
LinuxLinux: 1 core
51 / 58
Linux Application Workloads
M3
Lx
tar
02468
10
Tim
e (m
s)
M3
Lx
untar
M3
Lx
shasum
M3
Lx
sort
M3
Lx
find
M3
Lx
SQLite
M3
Lx
LvlDB
App Xfers OS
M3 vs. Linux 4.10
Traced on Linux,replayed on M3
M3FS vs. Linux tmpfs
Kernel App
Pager M3FS
M3: 1+3 cores
LinuxLinux: 1 core
51 / 58
PE Sharing
tar untar shasum sort find SQLite LvlDB0
1
2
3
4
Rel. t
ime M3 vs. Linux 4.10
M3 shares user PEs indi�erent ways
Baseline is 1+3 PEs
Kernel App
Pager M3FS
Kernel App
PG+FS
Kernel A+P+F Linux
52 / 58
PE Sharing
tar untar shasum sort find SQLite LvlDB0
1
2
3
4
Rel. t
ime M3 vs. Linux 4.10
M3 shares user PEs indi�erent ways
Baseline is 1+3 PEs
Kernel App
Pager M3FS
Kernel App
PG+FS
Kernel A+P+F Linux
52 / 58
Accelerator Sharing
FFT MUL IFFT
VPE VPE VPEInput OutputVPE VPE VPEInput Output
FFT MUL IFFT
VPE VPE VPEInput OutputVPE VPE VPEInput Output
1..4 chains
53 / 58
Accelerator Sharing
FFT MUL IFFT
VPE VPE VPEInput Output
VPE VPE VPEInput Output
FFT MUL IFFT
VPE VPE VPEInput OutputVPE VPE VPEInput Output
1..4 chains
53 / 58
Accelerator Sharing
FFT MUL IFFT
VPE VPE VPEInput OutputVPE VPE VPEInput Output
FFT MUL IFFT
VPE VPE VPEInput OutputVPE VPE VPEInput Output
1..4 chains
53 / 58
Accelerator Sharing
FFT MUL IFFT
VPE VPE VPEInput OutputVPE VPE VPEInput Output
FFT MUL IFFT
VPE VPE VPEInput OutputVPE VPE VPEInput Output
1..4 chains
53 / 58
Accelerator Sharing
1
Rel. t
ime
0.98
0.99
1.00
1.01
1.02
2 3 4
1ms 2ms 4ms
# of accelerator chains
54 / 58
Accelerator Sharing
1
Rel. t
ime
0.98
0.99
1.00
1.01
1.02
2 3 4
1ms 2ms 4ms
# of accelerator chains
54 / 58
Accelerator Sharing
1
Rel. t
ime
0.98
0.99
1.00
1.01
1.02
2 3 4
1ms 2ms 4ms
# of accelerator chains
54 / 58
Evaluation Summary
Comparable application performance
Superior performance for data-intensive applications
Accelerators can run autonomously, causing almost no CPU load
Accelerators can be shared with minimum overhead
55 / 58
Future Work
Scaling to larger systems pursued by Ma�hias Hille(runs 512 applications with a parallel e�iciency of 75%, using 11% for the OS [1])
Core-local context switching and IPC
Other accelerators: FPGAs, GPUs, . . .
[1] SemperOS: Distributed Capability System, USENIX ATC’1956 / 58
Conclusion
M3 uses a hardware/operating-system co-design
DTU introduces common interface for all CUs
Allows to integrate all (untrusted) CUs as first-class citizens
Access to OS services for all CUs
M3 uses the same concepts for all CUs
Allows simple management of complex systems
57 / 58
More Information
M3: A Hardware/Operating-System Co-Design to Tame Heterogeneous ManycoresNils Asmussen, Marcus Völp, Benedikt Nöthen, Hermann Härtig, and Gerhard Fe�weisASPLOS 2016
M3X: Autonomous Accelerators via Context-Enabled Fast-Path CommunicationNils Asmussen, Michael Roitzsch, and Hermann HärtigUSENIX ATC 2019
SemperOS: Distributed Capability SystemMa�hias Hille, Nils Asmussen, Pramod Bhatotia, and Hermann HärtigUSENIX ATC 2019
58 / 58
Backup Slides
59 / 58
Accelerator Sharing (PCIe)
1
Rel. t
ime
0.98
1.00
1.02
1.04
1.06
1.08
2 3 4
1ms 2ms 4ms
60 / 58
DTU Power Consumption
0.5 1 2 4 10
Compute time (K cycles)
Av
g P
ow
er
(mW
)
02
46
81
01
21
4
Core SPM DTU
61 / 58
DTU Size
Comparison:
Single Xtensa core has∼ 50000 gates
Single x86 core (haswell) has∼ 100 Million gates
62 / 58
So�ware Complexity
63 / 58
Context Switching Microbenchmark
M³−C (local)M³−C (rem−sh)M³−C (rem−ex)M³−B (rem−sh)M³−B (rem−ex)M³−A (rem−sh)M³−A (rem−ex)NOVA (remote)
NOVA (local)
Time (µs)
0 1 2 3 4 5 6 7 8 9 10 11
Wake CtxSw Fwd Comm
64 / 58
Scalability with Dedicated OS Service PEs
0 4 8 12 16 20 24 28 320
25
50
75
100
Pa
ral.
eff
. (%
)
# of applications ( tar )
0 4 8 12 16 20 24 28 320
25
50
75
100
# of applications ( untar )
0 4 8 12 16 20 24 28 320
25
50
75
100
Pa
ral.
eff
. (%
)
# of applications ( shasum & sort )
0 4 8 12 16 20 24 28 320
25
50
75
100
# of applications ( find )
0 4 8 12 16 20 24 28 320
25
50
75
100
Pa
ral.
eff
. (%
)
# of applications ( SQLite )
0 4 8 12 16 20 24 28 320
25
50
75
100
# of applications ( LevelDB )
1 srv 2 srv 4 srv 8 srv
65 / 58
Scalability with PE Sharing
●
● ● ●
●
1 2 4 8 16 320
25
50
75
100P
ara
l. e
ff. (
%)
# of applications
●tar untar find sqliteleveldb shasum sort
66 / 58
Stream Processing ASM
DTU
SPM
S
in out
M SM
RSASM
Acceleratorlogic
CU
C
RD OU
W
E
IN WR
input no input output
no outputin reply
out replyEOF
ctxsw
ctxsw
67 / 58