Download - Microkernel Construction - Case Study: M3os.inf.tu-dresden.de/Studium/MkK/SS2019/07_m3.pdfMicrokernel Construction Case Study: M3 Nils Asmussen July 4th 2019 1 / 58 Heterogeneous Systems

Microkernel ConstructionCase Study: M3

Nils Asmussen

July 4th 2019

1 / 58

Heterogeneous Systems

2 / 58


2 / 58


2 / 58

Why?

memcached: FPGA-based implementation is 16 times be�er in performance perwa� than Atom CPU [1]

machine learning: custom accelerator is 20% faster than GPU andrequires 128 times less energy [2]

[1] Thin servers with smart pipes: Designing SoC accelerators for memcached, ISCA’13[2] PuDianNao: A polyvalent machine learning accelerator, ASPLOS’15

3 / 58

Future Platforms: Problems for Operating Systems

ARM

ARM

x86

x86

FFT

DSP

GPU

TPU

Kernel

Kernel

Kernel

Kernel

4 / 58


ARM

ARM

x86

x86

FFT

DSP

GPU

TPU

Kernel

Kernel

Kernel

Kernel

4 / 58


ARM

ARM

x86

x86

FFT

DSP

GPU

TPU

Kernel

Kernel

Kernel

Kernel

4 / 58


ARM

ARM

x86

x86

FFT

DSP

GPU

TPU

Kernel

Kernel

Kernel

Kernel

4 / 58


ARM

ARM

x86

x86

FFT

DSP

GPU

TPU

Kernel

Kernel

Kernel

Kernel

4 / 58


ARM

ARM

x86

x86

FFT

DSP

GPU

TPU

Kernel

Kernel

Kernel

Kernel

4 / 58


ARM

ARM

x86

x86

FFT

DSP

GPU

TPU

Kernel

Kernel

Kernel

Kernel

4 / 58


ARM

ARM

x86

x86

FFT

DSP

GPU

TPU

Kernel

Kernel

Kernel

Kernel

4 / 58

Related Work

Isolation of componentsDPU, NoC-MPU

IOMMUs

First-class handling of one specific accelerator

GPUfs, GPUnet, PTask

ReconOS, BORPH

OSes for heterogeneous systems

Barrelfish

Popcorn Linux, K2

Helios

5 / 58

Related Work


IOMMUs



ReconOS, BORPH


Barrelfish

Popcorn Linux, K2

Helios

5 / 58

Related Work


IOMMUs



ReconOS, BORPH


Barrelfish

Popcorn Linux, K2

Helios5 / 58

What If We Could Change Hardware?

Can we design a system that integrates all types of

untrusted compute units as first-class citizens?

6 / 58

Goals for First-class Citizens

Prevent harm by untrusted compute units (CUs)

Access operating-system services by all CUs

Direct communication between all CUs

Context switching support for all CUs

7 / 58

Outline

1 Overall System Architecture

2 Prototype Platforms

3 Isolation and Communication

4 Capabilities

5 OS Services and Accelerators

6 Virtual Memory

7 Context Switching

8 Evaluation

8 / 58

Outline




4 Capabilities


6 Virtual Memory

7 Context Switching

8 Evaluation

9 / 58

Hardware/Operating System Co-Design

ARM

x86

FFT

DSP

FPGA

TPU

DTU DTU DTU

DTU DTU DTU

ARM

x86

FFT

DSP

FPGA

TPU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

User PE

CU

DTU

Kernel PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

Kernel

App App

App

App

App

Key Ideas:

Minimize changes toexisting components

Add uniform interface

Kernel controls user PEsremotely

Direct communication

10 / 58


ARM

x86

FFT

DSP

FPGA

TPU

DTU DTU DTU

DTU DTU DTU

ARM

x86

FFT

DSP

FPGA

TPU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

User PE

CU

DTU

Kernel PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

Kernel

App App

App

App

App

Key Ideas:





10 / 58


ARM

x86

FFT

DSP

FPGA

TPU

DTU DTU DTU

DTU DTU DTU

ARM

x86

FFT

DSP

FPGA

TPU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

User PE

CU

DTU

Kernel PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

Kernel

App App

App

App

App

Key Ideas:





10 / 58


ARM

x86

FFT

DSP

FPGA

TPU

DTU DTU DTU

DTU DTU DTU

ARM

x86

FFT

DSP

FPGA

TPU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

User PE

CU

DTU

Kernel PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

Kernel

App App

App

App

App Key Ideas:





10 / 58


ARM

x86

FFT

DSP

FPGA

TPU

DTU DTU DTU

DTU DTU DTU

ARM

x86

FFT

DSP

FPGA

TPU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

PE

CU

DTU

User PE

CU

DTU

Kernel PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

Kernel

App App

App

App

App Key Ideas:





10 / 58

Outline




4 Capabilities


6 Virtual Memory

7 Context Switching

8 Evaluation

11 / 58

Tomahawk

Xtensa LX4

Instr.SPM

DataSPM

DTU

PEPEPE

PE

PE PE

PE

DRAM

RRR

R R R

RRR

PE

MemCtrl.

PEs have no OS support:

No privileged mode

No MMU, no caches, but SPM

T2: simple DTU; T4: most features12 / 58

Linux

M3 runs on Linux using it as a virtual machine

A process simulates a PE, having two threads (CPU + DTU)

DTUs communicate over UNIX domain socketsNo accuracy because

I Programs are directly executed on hostI Data transfers have huge overhead compared to HW

Very useful for debugging and early prototyping

13 / 58

gem5

Modular platform for computer architecture research

Supports various ISAs (x86, ARM, Alpha, RISC-V, . . . )

Provides detailed CPU and memory models

Cycle-accurate simulation

Added DTU model to gem5

Added hardware accelerators

14 / 58

gem5 – Example Configuration

x86 PE

L2$

DTUL1$

AccelPE

DTU

SPM

IO$

AccelPE

DTU

L1$

x86

PE

DTU

L1$ IO$

DTU

VM

ME

DRAM

x86PE

DTU

L1$SPM

15 / 58

Outline




4 Capabilities


6 Virtual Memory

7 Context Switching

8 Evaluation

16 / 58

Isolation

User PE

CU

DTU

Kernel PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

Kernel

App App

App

App

App DTU-based isolation:

Additional protection layer

Only kernel PE canestablish communicationchannels

User PEs can only useestablished channels

17 / 58

Isolation

User PE

CU

DTU

Kernel PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

Kernel

App App

App

App





17 / 58

Isolation

User PE

CU

DTU

Kernel PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

Kernel

App App

App

App





17 / 58

Isolation

User PE

CU

DTU

Kernel PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

Kernel

App App

App

App





17 / 58

Communication

User PE

CU

DTU

Kernel PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

DRAM

User PE

CU

DTU

Kernel

App App

App App

M

S

R

DTU provides endpoints to:

Access memory (contiguousrange, byte granular)

Receive messages into areceive bu�er

Send messages to areceiving endpoint

Replies for RPC

18 / 58

Communication

User PE

CU

DTU

Kernel PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

DRAM

User PE

CU

DTU

Kernel

App App

App App

M S

R





Replies for RPC

18 / 58

Communication

User PE

CU

DTU

Kernel PE

CU

DTU

User PE

CU

DTU

User PE

CU

DTU

DRAM

User PE

CU

DTU

Kernel

App App

App App

M S

R





Replies for RPC

18 / 58

OS Design

M3: Microkernel-based system for het. manycores(or L4 ± 1)

Implemented from scratch

Drivers, filesystems, etc. implemented on user PEs

Kernel manages permissions, using capabilities

DTU enforces permissions(communication, memory access)

Kernel is independent of other PEs

Kernel M3FS

pipes App

App App

19 / 58

M3 System Call

User PE

CU

DTU

Kernel PE

CU

DTU

KernelApp

R

S

20 / 58

M3 System Call

User PE

CU

DTU

Kernel PE

CU

DTU

KernelApp

RS

20 / 58

M3 System Call

User PE

CU

DTU

Kernel PE

CU

DTU

KernelApp

RS

20 / 58

M3 System Call

User PE

CU

DTU

Kernel PE

CU

DTU

KernelApp

RS

20 / 58

M3 System Call

User PE

CU

DTU

Kernel PE

CU

DTU

KernelApp

RS

20 / 58

Outline




4 Capabilities


6 Virtual Memory

7 Context Switching

8 Evaluation

21 / 58

Overview

0 2 0 21VPE 1 VPE 2

Kernel

VPE 2VPE 1

VPE SGate RGate VPE

22 / 58

Capabilities

M3 has the following capabilities:

Send: send messages to a receive EP

Receive: receive messages from send EPs

Memory: access remote memory via DTU

Mapping: access remote memory via load/store

Service: create sessions

Session: exchange caps with service

Endpoint: configure EPs of own or foreign DTU

VPE: use a PE

23 / 58

Capability Exchange

Kernel provides syscalls to create, exchange, and revoke capsThere are two ways to exchange caps:

1 Directly with another VPE (typically, a child VPE)2 Over a session with a service

The kernel o�ers two operations:1 Delegate: send capability to somebody else2 Obtain: receive capability from somebody else

Di�erence to L4:I Applications communicate directly, without involving the kernelI → Capability exchange cannot be done during IPCI Special communication channel between kernel and serversI Kernel uses this channel to send exchange requests to server

24 / 58

Communication

DTU

DTUDTU adds

CU

Mem

buffer

occupunread

EP credits

labeltarget

Receiver: PE1 Sender: PE2

channel

Kernel: PE0

SendGate

DTU

Mem

CUCU

EP

configuration of endpoints to establish a channel

VPE1: PE1

header data

Recv Cap RecvGate

VPE2: PE2

Send CapMem

EP

cmdregcmdreg cmdreg

25 / 58

Virtual PEs

M3 kernel manages user PEs in terms of VPEs

VPE is combination of a process and a thread

VPE creation yields a VPE capability and memory capability

Library provides primitives like fork and exec

VPEs are used for all PEs:I Accelerators are not handled di�erently by the kernelI All VPEs can perform system callsI All VPEs can have time slices and prioritiesI . . .

26 / 58

VPEs – Examples

Executing ELF-Binaries

VPE vpe("test");

char *args[] = {"/bin/hello", "foo", "bar"};

vpe.exec(3, args);

Asynchronous Lambdas

VPE vpe("test");

MemGate mem = MemGate :: create_global (0x1000 , RW);

vpe.delegate(CapRngDesc(mem.sel(), 1));

vpe.run_async ([&mem]() {

mem.read(buf , sizeof(buf));

});

27 / 58

VPEs – Examples

Executing ELF-Binaries

VPE vpe("test");

char *args[] = {"/bin/hello", "foo", "bar"};

vpe.exec(3, args);

Asynchronous Lambdas

VPE vpe("test");

MemGate mem = MemGate :: create_global (0x1000 , RW);

vpe.delegate(CapRngDesc(mem.sel(), 1));

vpe.run_async ([&mem]() {

mem.read(buf , sizeof(buf));

});

27 / 58

Outline




4 Capabilities


6 Virtual Memory

7 Context Switching

8 Evaluation

28 / 58

OS Service Access for all CUs

sh$ decode in.png | fft | mul | ifft > out.rawsh$ decode in.png | fft | mul | ifft > out.raw

Shell

User program Input file

Hardware accelerators forimage processing

Output file

Pipes and output redirect

Challenges:

OS must provide genericprotocols

Accelerators needsupport for protocols

29 / 58



Shell



Output file


Challenges:



29 / 58



Shell

User program

Input file


Output file


Challenges:



29 / 58



Shell



Output file


Challenges:



29 / 58



Shell



Output file


Challenges:



29 / 58



Shell



Output file


Challenges:



29 / 58



Shell



Output file


Challenges:



29 / 58



Shell



Output file


Challenges:



29 / 58



Shell



Output file


Challenges:



29 / 58



Shell



Output file


Challenges:



29 / 58

Generic Protocols

Client Server

DRAM

DTU DTU

S R

req(in/out)

resp(pos,len)MM

File protocol:

Data in memoryRPC between client and server

I req(in/out) requests next piece,implicitly commits previous piece

I commit(nbytes) commits nbytes ofprevious piece

Server configures client’s memory EP

Client accesses data via DTU

30 / 58

Generic Protocols

Client Server

DRAM

DTU DTU

S R

req(in/out)

resp(pos,len)MM

File protocol:

Data in memory

RPC between client and serverI req(in/out) requests next piece,

implicitly commits previous pieceI commit(nbytes) commits nbytes of

previous piece



30 / 58

Generic Protocols

Client Server

DRAM

DTU DTUS R

req(in/out)

resp(pos,len)

MM

File protocol:






30 / 58

Generic Protocols

Client Server

DRAM

DTU DTUS R

req(in/out)

resp(pos,len)MM

File protocol:






30 / 58

Generic Protocols

Client Server

DRAM

DTU DTUS R

req(in/out)

resp(pos,len)MM

File protocol:






30 / 58

Implementation: M3FS – Overview

M3FS organizes the file’s data in extentsM3FS can be used with a memory and disk backend

I With memory backend, FS image is a contiguous region in DRAMI Clients get access to parts of the imageI With disk backend, M3FS uses a bu�er cache in DRAMI Clients get access to parts of bu�er cache

Two types of sessions: metadata session, file session

Metadata session is created first, allows stat, open, . . .

open creates a new file session

Both sessions can be cloned to provide other VPEs access

31 / 58

Implementation: M3FS – File Protocol

The file session implements the file protocol (plus seeking)

File session holds file position and advances it on read/write

req(in/out) request next extent

M3FS configures client’s EP for this extent

Appending reserves new space, invisible to other clients

commit(nbytes) commits a previous append

32 / 58

Implementation: Pipe – Overview

writer reader

33 / 58

Implementation: Pipe – Overview

writer reader

Shared Memory

msg passing

pipeserv

34 / 58

Implementation: Pipe

Two types of sessions: pipe session, channel session

Pipe session represents whole pipe, allows to create channels

Channel session implements file protocol

Channel session can be cloned

Server configures client’s EP just once at the beginning

req(in/out) request access to next data

commit(nbytes) commits previous request

35 / 58

File Multiplexing

File protocol maps directly to EPs (limited resource)

Number of open files shouldn’t be limited (that much)

libm3 dedicates at most 4 EPs to files and multiplexes themMultiplexing requires:

1 commit(nbytes) to commit read/wri�en data2 revocation of EP capability (old server)3 delegation of EP capability (new server)4 next read/write will contact server again

Fortunately, file multiplexing does almost never happen

36 / 58

Additions to Accelerator

Scratchpad memory (SPM)

CU

Accelerator

DTUASM

S M S M

IN OUT

O�-the-shelf accelerators

Accelerator Support Module (ASM):

Interacts with DTU and accelerator

Implements file protocol for input andoutput channel

ASM assumes that endpoints are setupexternally by so�ware

37 / 58



CU

Accelerator

DTU

ASM

S M S M

IN OUT






37 / 58



CU

Accelerator

DTUASM

S M S M

IN OUT






37 / 58



CU

Accelerator

DTUASM

S M S M

IN OUT






37 / 58



CU

Accelerator

DTUASM

S M S M

IN OUT






37 / 58

Demo

38 / 58

Accelerator Chains: Assisted by OS

FFT

SPM

DMA

MUL

SPM

DMA

IFFT

SPM

DMA

OS

Driver

Input

Output

FFT

SPM

DMA

OS

Driver

MUL

SPM

DMAOS

Driver

IFFT

SPM

DMA

OS

Driver

OS-assisted accelerator chains:

OS drives copy-in/copy-out ofaccelerator SPMs

Only simple DMA needed

Like in traditional systems,high CPU overhead for OS

39 / 58


FFT

SPM

DMA

MUL

SPM

DMA

IFFT

SPM

DMA

OS

Driver

Input

Output

FFT

SPM

DMA

OS

Driver

MUL

SPM

DMAOS

Driver

IFFT

SPM

DMA

OS

Driver





39 / 58


FFT

SPM

DMA

MUL

SPM

DMA

IFFT

SPM

DMA

OS

Driver

Input

Output

FFT

SPM

DMA

OS

Driver

MUL

SPM

DMAOS

Driver

IFFT

SPM

DMA

OS

Driver





39 / 58


FFT

SPM

DMA

MUL

SPM

DMA

IFFT

SPM

DMA

OS

Driver

Input

Output

FFT

SPM

DMA

OS

Driver

MUL

SPM

DMAOS

Driver

IFFT

SPM

DMA

OS

Driver





39 / 58


FFT

SPM

DMA

MUL

SPM

DMA

IFFT

SPM

DMA

OS

Driver

Input

Output

FFT

SPM

DMA

OS

Driver

MUL

SPM

DMAOS

Driver

IFFT

SPM

DMA

OS

Driver





39 / 58


FFT

SPM

DMA

MUL

SPM

DMA

IFFT

SPM

DMA

OS

Driver

Input

Output

FFT

SPM

DMA

OS

Driver

MUL

SPM

DMAOS

Driver

IFFT

SPM

DMA

OS

Driver





39 / 58


FFT

SPM

DMA

MUL

SPM

DMA

IFFT

SPM

DMA

OS

Driver

Input

Output

FFT

SPM

DMA

OS

Driver

MUL

SPM

DMAOS

Driver

IFFT

SPM

DMA

OS

Driver





39 / 58


FFT

SPM

DMA

MUL

SPM

DMA

IFFT

SPM

DMA

OS

Driver

Input

Output

FFT

SPM

DMA

OS

Driver

MUL

SPM

DMAOS

Driver

IFFT

SPM

DMA

OS

Driver





39 / 58

Accelerator Chains: Fully Autonomous

FFT

SPM

DTU

MUL

SPM

DTU

IFFT

SPM

DTU

Shell

Input

Output

ASM

ASM

ASM

Shell

FFT

SPM

DTU ASM

MUL

SPM

DTU ASM

IFFT

SPM

DTU ASM

Autonomous accelerator chains:

Shell configures all endpoints

ASMs of accelerators drive DTUs totransfer data autonomously

Fully o�loaded,almost no CPU overhead for OS

40 / 58


FFT

SPM

DTU

MUL

SPM

DTU

IFFT

SPM

DTU

Shell

Input

Output

ASM

ASM

ASM

Shell

FFT

SPM

DTU ASM

MUL

SPM

DTU ASM

IFFT

SPM

DTU ASM





40 / 58


FFT

SPM

DTU

MUL

SPM

DTU

IFFT

SPM

DTU

Shell

Input

Output

ASM

ASM

ASM

Shell

FFT

SPM

DTU ASM

MUL

SPM

DTU ASM

IFFT

SPM

DTU ASM





40 / 58


FFT

SPM

DTU

MUL

SPM

DTU

IFFT

SPM

DTU

Shell

Input

Output

ASM

ASM

ASM

Shell

FFT

SPM

DTU ASM

MUL

SPM

DTU ASM

IFFT

SPM

DTU ASM





40 / 58


FFT

SPM

DTU

MUL

SPM

DTU

IFFT

SPM

DTU

Shell

Input

Output

ASM

ASM

ASM

Shell

FFT

SPM

DTU ASM

MUL

SPM

DTU ASM

IFFT

SPM

DTU ASM





40 / 58


FFT

SPM

DTU

MUL

SPM

DTU

IFFT

SPM

DTU

Shell

Input

Output

ASM

ASM

ASM

Shell

FFT

SPM

DTU ASM

MUL

SPM

DTU ASM

IFFT

SPM

DTU ASM





40 / 58


FFT

SPM

DTU

MUL

SPM

DTU

IFFT

SPM

DTU

Shell

Input

Output

ASM

ASM

ASM

Shell

FFT

SPM

DTU ASM

MUL

SPM

DTU ASM

IFFT

SPM

DTU ASM





40 / 58


FFT

SPM

DTU

MUL

SPM

DTU

IFFT

SPM

DTU

Shell

Input

Output

ASM

ASM

ASM

Shell

FFT

SPM

DTU ASM

MUL

SPM

DTU ASM

IFFT

SPM

DTU ASM





40 / 58


FFT

SPM

DTU

MUL

SPM

DTU

IFFT

SPM

DTU

Shell

Input

Output

ASM

ASM

ASM

Shell

FFT

SPM

DTU ASM

MUL

SPM

DTU ASM

IFFT

SPM

DTU ASM





40 / 58

Accelerator Chains: Results

Assisted Autonomous

1

Tim

e (m

s)

0

5

10

15

20

2 3 4

# of parallel chains

1

CP

U l

oad

0.0

0.2

0.4

0.6

0.8

1.0

2 3 4


41 / 58


Assisted Autonomous

1

Tim

e (m

s)

0

5

10

15

20

2 3 4


1

CP

U l

oad

0.0

0.2

0.4

0.6

0.8

1.0

2 3 4


41 / 58


Assisted Autonomous

1

Tim

e (m

s)

0

5

10

15

20

2 3 4


1

CP

U l

oad

0.0

0.2

0.4

0.6

0.8

1.0

2 3 4


41 / 58

Accelerator Chains: Results (PCIe-like Latency)

Assisted Autonomous

1

Tim

e (m

s)

0

20

40

60

80

2 3 4


1

CP

U l

oad

0.0

0.2

0.4

0.6

0.8

1.0

2 3 4


42 / 58

Outline




4 Capabilities


6 Virtual Memory

7 Context Switching

8 Evaluation

43 / 58

Virtual Memory – Overview

DTU

SPM

Accelerator

DTU MMU

Cache

Accelerator

DTU

x86

Cache

MMU

VM Helper

Di�erent PE types:

No MMU, SPM insteadof caches

MMU+caches providedby DTU

Reuse existingMMU+caches of CU

44 / 58


DTU

SPM

Accelerator

DTU MMU

Cache

Accelerator

DTU

x86

Cache

MMU

VM Helper

Di�erent PE types:




44 / 58


DTU

SPM

Accelerator

DTU MMU

Cache

Accelerator

DTU

x86

Cache

MMU

VM HelperDi�erent PE types:




44 / 58

Page Fault Handling

App

VMA

CU (PE-type C)

CU (PE-type B)

App

CU CU

Pager KernelPF

PF create_map

kernel requests

update PTEs

IRQ

DTU

DTU

DTU DTU

45 / 58

Outline




4 Capabilities


6 Virtual Memory

7 Context Switching

8 Evaluation

46 / 58

Context Switching – Overview

DTU

x86

Ctx Helper

VPE VPE

DTU

ARM

Kernel

SwitcherSwitcher

DTU

x86

DTU

ARM

DTU

Accelerator

Ctx Helper

VPE VPE

DTUDTU

ARM

Kernel handles complex partI Schedules and migrates VPEsI Initiates context switches

Helper on user PEsimplements save/restore

I General purpose PEs:So�ware helper

I Accelerator PEs:Helper implemented inhardware as part of ASM

47 / 58


DTU

x86

Ctx Helper

VPE VPE

DTU

ARM

Kernel

SwitcherSwitcher

DTU

x86

DTU

ARM

DTU

Accelerator

Ctx Helper

VPE VPE

DTUDTU

ARM





47 / 58


DTU

x86

Ctx Helper

VPE VPE

DTU

ARM

Kernel

SwitcherSwitcher

DTU

x86

DTU

ARM

DTU

Accelerator

Ctx Helper

VPE VPE

DTUDTU

ARM





47 / 58


DTU

x86

Ctx Helper

VPE VPE

DTU

ARM

Kernel

SwitcherSwitcher

DTU

x86

DTU

ARM

DTU

Accelerator

Ctx Helper

VPE VPE

DTUDTU

ARM





47 / 58

Context Switching with Direct Communication

How to determine whether recipient is running?I DTU knows running VPE and recipient of communicationI DTU reports error if recipient is not running

How to deliver the message if recipient is not running?I Message is forwarded via the kernelI Kernel schedules recipient and delivers message

How does the kernel know what VPEs are doing?I Activities send idle notificationI Only if compatible VPE is ready

48 / 58





48 / 58





48 / 58

Outline




4 Capabilities


6 Virtual Memory

7 Context Switching

8 Evaluation

49 / 58

Experimental Setup

Evaluation platform is gem5

Each general-purpose PE has out-of-order x86-64 core @ 3GHz,32+32 KiB L1 cache, 256 KiB L2 cache

Accelerator PEs are clocked with 1GHz

DRAM clocked with 1GHz

Short running, but representative benchmarks

50 / 58

Linux Application Workloads

M3

Lx

tar

02468

10

Tim

e (m

s)

M3

Lx

untar

M3

Lx

shasum

M3

Lx

sort

M3

Lx

find

M3

Lx

SQLite

M3

Lx

LvlDB

App Xfers OS

M3 vs. Linux 4.10

Traced on Linux,replayed on M3

M3FS vs. Linux tmpfs

Kernel App

Pager M3FS

M3: 1+3 cores

LinuxLinux: 1 core

51 / 58


M3

Lx

tar

02468

10

Tim

e (m

s)

M3

Lx

untar

M3

Lx

shasum

M3

Lx

sort

M3

Lx

find

M3

Lx

SQLite

M3

Lx

LvlDB

App Xfers OS

M3 vs. Linux 4.10



Kernel App

Pager M3FS

M3: 1+3 cores

LinuxLinux: 1 core

51 / 58


M3

Lx

tar

02468

10

Tim

e (m

s)

M3

Lx

untar

M3

Lx

shasum

M3

Lx

sort

M3

Lx

find

M3

Lx

SQLite

M3

Lx

LvlDB

App Xfers OS

M3 vs. Linux 4.10



Kernel App

Pager M3FS

M3: 1+3 cores

LinuxLinux: 1 core

51 / 58

PE Sharing

tar untar shasum sort find SQLite LvlDB0

1

2

3

4

Rel. t

ime M3 vs. Linux 4.10

M3 shares user PEs indi�erent ways

Baseline is 1+3 PEs

Kernel App

Pager M3FS

Kernel App

PG+FS

Kernel A+P+F Linux

52 / 58

PE Sharing

tar untar shasum sort find SQLite LvlDB0

1

2

3

4

Rel. t

ime M3 vs. Linux 4.10

M3 shares user PEs indi�erent ways

Baseline is 1+3 PEs

Kernel App

Pager M3FS

Kernel App

PG+FS

Kernel A+P+F Linux

52 / 58

Accelerator Sharing

FFT MUL IFFT

VPE VPE VPEInput OutputVPE VPE VPEInput Output

FFT MUL IFFT


1..4 chains

53 / 58

Accelerator Sharing

FFT MUL IFFT

VPE VPE VPEInput Output

VPE VPE VPEInput Output

FFT MUL IFFT


1..4 chains

53 / 58

Accelerator Sharing

FFT MUL IFFT


FFT MUL IFFT


1..4 chains

53 / 58

Accelerator Sharing

FFT MUL IFFT


FFT MUL IFFT


1..4 chains

53 / 58

Accelerator Sharing

1

Rel. t

ime

0.98

0.99

1.00

1.01

1.02

2 3 4

1ms 2ms 4ms

# of accelerator chains

54 / 58

Accelerator Sharing

1

Rel. t

ime

0.98

0.99

1.00

1.01

1.02

2 3 4

1ms 2ms 4ms


54 / 58

Accelerator Sharing

1

Rel. t

ime

0.98

0.99

1.00

1.01

1.02

2 3 4

1ms 2ms 4ms


54 / 58

Evaluation Summary

Comparable application performance

Superior performance for data-intensive applications

Accelerators can run autonomously, causing almost no CPU load

Accelerators can be shared with minimum overhead

55 / 58

Future Work

Scaling to larger systems pursued by Ma�hias Hille(runs 512 applications with a parallel e�iciency of 75%, using 11% for the OS [1])

Core-local context switching and IPC

Other accelerators: FPGAs, GPUs, . . .

[1] SemperOS: Distributed Capability System, USENIX ATC’1956 / 58

Conclusion

M3 uses a hardware/operating-system co-design

DTU introduces common interface for all CUs

Allows to integrate all (untrusted) CUs as first-class citizens

Access to OS services for all CUs

M3 uses the same concepts for all CUs

Allows simple management of complex systems

57 / 58

More Information

M3: A Hardware/Operating-System Co-Design to Tame Heterogeneous ManycoresNils Asmussen, Marcus Völp, Benedikt Nöthen, Hermann Härtig, and Gerhard Fe�weisASPLOS 2016

M3X: Autonomous Accelerators via Context-Enabled Fast-Path CommunicationNils Asmussen, Michael Roitzsch, and Hermann HärtigUSENIX ATC 2019

SemperOS: Distributed Capability SystemMa�hias Hille, Nils Asmussen, Pramod Bhatotia, and Hermann HärtigUSENIX ATC 2019

58 / 58

Backup Slides

59 / 58

Accelerator Sharing (PCIe)

1

Rel. t

ime

0.98

1.00

1.02

1.04

1.06

1.08

2 3 4

1ms 2ms 4ms

60 / 58

DTU Power Consumption

0.5 1 2 4 10

Compute time (K cycles)

Av

g P

ow

er

(mW

)

02

46

81

01

21

4

Core SPM DTU

61 / 58

DTU Size

Comparison:

Single Xtensa core has∼ 50000 gates

Single x86 core (haswell) has∼ 100 Million gates

62 / 58

So�ware Complexity

63 / 58

Context Switching Microbenchmark

M³−C (local)M³−C (rem−sh)M³−C (rem−ex)M³−B (rem−sh)M³−B (rem−ex)M³−A (rem−sh)M³−A (rem−ex)NOVA (remote)

NOVA (local)

Time (µs)

0 1 2 3 4 5 6 7 8 9 10 11

Wake CtxSw Fwd Comm

64 / 58

Scalability with Dedicated OS Service PEs

0 4 8 12 16 20 24 28 320

25

50

75

100

Pa

ral.

eff

. (%

)

# of applications ( tar )

0 4 8 12 16 20 24 28 320

25

50

75

100

# of applications ( untar )

0 4 8 12 16 20 24 28 320

25

50

75

100

Pa

ral.

eff

. (%

)

# of applications ( shasum & sort )

0 4 8 12 16 20 24 28 320

25

50

75

100

# of applications ( find )

0 4 8 12 16 20 24 28 320

25

50

75

100

Pa

ral.

eff

. (%

)

# of applications ( SQLite )

0 4 8 12 16 20 24 28 320

25

50

75

100

# of applications ( LevelDB )

1 srv 2 srv 4 srv 8 srv

65 / 58

Scalability with PE Sharing

●

● ● ●

●

1 2 4 8 16 320

25

50

75

100P

ara

l. e

ff. (

%)

# of applications

●tar untar find sqliteleveldb shasum sort

66 / 58

Stream Processing ASM

DTU

SPM

S

in out

M SM

RSASM

Acceleratorlogic

CU

C

RD OU

W

E

IN WR

input no input output

no outputin reply

out replyEOF

ctxsw

ctxsw

67 / 58