Microkernel Design - Computer Science and Engineeringcs9242/10/lectures/10-Microkernel... ·...

Post on 09-Sep-2018

225 views 0 download

transcript

Microkernel DesignA walk through selected aspects of

kernel design and seL4kernel design and seL4

These slides are made distributed under the Creative Commons Attribution 3.0 License,

unless otherwise noted on individual slides.

You are free:

to Share — to copy, distribute and transmit the work

to Remix — to adapt the work

Under the following conditions:

Attribution — You must attribute the work (but not in any way that suggests that the author

endorses you or your use of the work) as follows:

“Courtesy of Kevin Elphinstone, UNSW”

The complete license text can be found at http://creativecommons.org/licenses/by/3.0/legalcode

2

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Formal Verification - Proof Architecture

SpecificationSpecification

3NICTA Copyright 2010 From imagination to impact

Proof

C CodeC Code

Proof Architecture

SpecificationSpecification

Access Control SpecAccess Control Spec Confinement

4NICTA Copyright 2010 From imagination to impact

C CodeC Code

DesignDesignHaskell

Prototype

Haskell

Prototype

Verification Strategy

• An OS perspective

– simple is better

– complex system-wide invariants increase – complex system-wide invariants increase

difficulty

– concurrency is very difficulty to reason

about

• must consider every possible interleaving of

execution

5

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Fundamental Kernel

Abstractions

• Execution

– support CPU running multiple activities

• Memory• Memory

– support (and protect) state associated with

an activity

6

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Execution• Two-execution environments

– kernel level (in-kernel) and user-level (application

execution)

• Covered execution models in detail earlier in

the coursethe course

– Two common approaches

• Event-based

– smaller memory footprint, limited to smaller kernels

• Process-based

– larger memory footprint, programming model scales to

larger kernels, though synchronisation adds complexity

7

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

seL4 Kernel Execution?• For verifiability

– Event-based

• sequential execution from kernel mode entry to exit

– Context switch at kernel exit

• current process/thread control block switch as late as • current process/thread control block switch as late as

possible

• kernel c-code not re-entrant

– Interrupts disabled

• delivered on return to user-level, or

• polled during long running operations

8

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Application Execution

• From kernel perspective, commonly two

models

– single-threaded

• straight forward program execution• straight forward program execution

• potentially with another execution model layered on top

(e.g. user-level threads)

– multi-threaded

• potentially with another execution model or user-level

involvement

– m-n user-level threads

– scheduler activations

9

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Virtualisation• Introduces third application (guest OS) execution

model – Virtual CPU

• Has close parallels to a thread

• We’ll distinguish them as follows

– Fixed set at “boot time”– Fixed set at “boot time”

• e.g. no create/delete CPUs by guest

– Hardware-like synchronisation

• no blocking synch primitives

– Hardware-like communication

• low-level notification (interrupts), no complex messaging

• handled via interrupt handler

10

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Application Execution• For verification

– single threaded

• execution still simplest – event-based sequential code

– multithreaded

• problematic due to concurrency• problematic due to concurrency

• good to overlap I/O (blocking) with execution, and to

utilise multiprocessors

– virtual CPU

• with interrupts disabled – event-based sequential code

• interrupts enabled, problematic due potential number of

instruction interleavings

• obviously good for replication normal CPU execution

model for guest OS.11

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

seL4 Application

Execution?

• Multithreaded

– verified applications would be limited to a

single threadsingle thread

• Alternatives

– VCPUs

• verified applications have interrupts disabled

12

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Memory Management• Page-based virtual memory ubiquitous

• Applications expect a specific memory model

– Text, data, bss, stack

– Memory mapped files

• shared libraries, shared memory• shared libraries, shared memory

– External pagers of memory objects

• Mach

– External control of mappings

• Virtualisation (hypercalls, shadow page tables)

• L4

13

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Text, data,….

• Implications for kernel

– knowledge of executable format

Text Data BSS Stack

Virtual Address Space

– knowledge of executable format

• limits alternative – e.g. guest OS, guest application

– at minimum, ability to load application and set up mappings

• also implies allocation of page tables and memory frames.– implies some model for managing memory securely between applications

• also implies book keeping for de-allocation, i.e. resource

attribution – e.g. processes.

14

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Memory Mapped Files/Objects

• Implications for kernel

– similar to text, data,G

Text Data BSS Stack

Virtual Address Space

libc File

– similar to text, data,G

– additionally

• adds file-like store to name data and retrive/store data

• adds mechanism for mapping vm region to file

15

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

External Pagers

Text Data BSS Stack

User-level server

libc File

File System

Server

• Page faults propagated to user-level servers

– they supply data for page, kernel still manages memory (frames, page tables,

etc..)

• Implications for kernel

– adds complexity of • vm-region-based fault forwarding

• data provision mechanism

– removes complexity of supplying/storing data from the kernel (not in Mach’s case)16

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Text Data BSS Stacklibc File

Historical L4 Mapping

Model

17

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Address Spaces

© 2002 Kevin Elphintone

Address Spaces

• map

• unmap

• grant

© 2002 Kevin Elphintone

Address Spaces

• map

• unmap

• grant

© 2002 Kevin Elphintone

Address Spaces

• map

• unmap

• grant

© 2002 Kevin Elphintone

Page Fault Handling

Pager

"PF" msg

Application

© 2002 Kevin Elphintone

Pager

map msg

Application

Page Fault Handling

Pager

"PF" msg

Application

PF IPC

© 2002 Kevin Elphintone

Pager

map msg

Application

res IPC

Address Spaces

© 2002 Kevin Elphintone

Physical Memory

Address Spaces

© 2002 Kevin Elphintone

Physical Memory

Initial AS

Address Spaces

© 2002 Kevin Elphintone

Physical Memory

Initial AS

Pager 1 Pager 2

Address Spaces

Pager 3

Pager 4

© 2002 Kevin Elphintone

Physical Memory

Initial AS

Pager 1 Pager 2

Address Spaces

Pager 3

Pager 4

Application

Application Application

Application

© 2002 Kevin Elphintone

Physical Memory

Initial AS

Pager 1 Pager 2

Address Spaces

Pager 3

Pager 4

Application

Application Application

Application

Driver

© 2002 Kevin Elphintone

Physical Memory

Initial AS

Pager 1 Pager 2

Driver

Historical L4 Mapping

Model• Kernel only provides

– relatively simple mechanisms

– physical memory can be directly managed at user-

levellevel

– page-tables still managed in kernel

• complexity of some memory management remains

– introduces complexity of tracking mapping

relationships

30

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Recursive mapping removed

• Single privilege syscall

for Initial AS

• Pagers requested

mapping from Initial

AS

Application

AS

• Removed need to

track mapping

relationships from

kernel

31

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Physical Memory

Initial AS

Pager 1 Pager 2

Initial task removed• Mapping operates pre-

allocate physical

memory partitions

• Removes need for

user-level to proxy

• Adds partitioning

Application

• Adds partitioning

policy in kernel, but

not significant source

of complexity

• page table

management still in

kernel

– some memory

allocation remains

32

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Physical Memory

Pager Pager

Note parallels with Hypervisors• Mapping operates on pre-

allocated physical memory

partitions

– hypercalls

• page table management

still in kernel

Application

still in kernel

– some memory allocation

remains

• page table management

becomes quite tricky when

directly virtualising page

tables without hardware

assistance

33

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Physical Memory

Guest OS Guest OS

Dhammika Elkaduwe

Philip Derrin

Kevin Elphinstone

Kernel Design for Isolation and Assurance of Physical Memory

Kevin Elphinstone

Embedded Systems

• Increasing functionality

• Increasing software complexity

– Millions of lines of code

– Mutually untrusted SW vendors

• Consolidate functionality• Consolidate functionality

Connectivity

– Attacks from outside

• No longer close systems

– Download SW

IIES08/seL4 1

Embedded Systems

• Diverse applications

– Real-time Vs. best effort

• Tight resource budgets

• Mission/life- critical applications

• Sensitive information

Reliability is paramount Reliability is paramount Reliability is paramount Reliability is paramount

IIES08/seL4 2

Small Kernel Approach

LegacyApp.Legacy

App.LegacyApp.Legacy

App.

SensitiveApp.Sensitive

App.SensitiveApp.SensitiveApp.

Untrusted Trusted

• Smaller, more trustworthy foundation

– Hypervisor, microkernelmicrokernelmicrokernelmicrokernel, isolation

kernel, …..

• Facilitate controlled integration and

isolation

– Isolate: fault isolation, diversity

– Integrate: performance

Supervisor OS

LinuxServer

DeviceDriver

TrustedService Device

Driver

TrustedServiceTrustedServiceTrustedService

DeviceDriver

Hardware

Small kernel (e.g. Microkernel)

– Integrate: performance

IIES08/seL4 3A

Small Kernel Approach

• Smaller, more trustworthy foundation

– Hypervisor, microkernelmicrokernelmicrokernelmicrokernel, isolation

kernel, …..

• Facilitate controlled integration and

isolation

– Isolate: fault isolation, diversity

– Integrate: performance

LegacyApp.Legacy

App.LegacyApp.Legacy

App.

SensitiveApp.Sensitive

App.SensitiveApp.SensitiveApp.

Untrusted Trusted

– Integrate: performance

IIES08/seL4 3B

• Microkernel should:• Provide sufficient API• Correct realisation of API• Adhere to isolation/integration requirements of the system

Supervisor OS

LinuxServer

DeviceDriver

TrustedService Device

Driver

TrustedServiceTrustedServiceTrustedService

DeviceDriver

Hardware

Small kernel (e.g. Microkernel)

Issue

• Kernel consumes resources

– Machine cycles

– Physical memory (kernel metadata)

Example:

– threads – thread control block,

– address space – page-tables

LegacyApp.Legacy

App.LegacyApp.Legacy

App.

SensitiveApp.Sensitive

App.SensitiveApp.SensitiveApp.

Untrusted Trusted

– bookkeeping to reclaim memory

Supervisor OS

LinuxServer

DeviceDriver

TrustedService Device

Driver

TrustedServiceTrustedServiceTrustedService

DeviceDriver

Microkernel

TCB TCBPT PT

IIES08/seL4 4

Possible Approaches

How do we manage kernel

metadata?

• Cache like behaviour [EROS,Cachekernel, HiStar..]

– No predictability, limited RT applicability

• Static allocations

– Works for static systems

LegacyApp.Legacy

App.LegacyApp.Legacy

App.

SensitiveApp.Sensitive

App.SensitiveApp.SensitiveApp.

Untrusted Trusted

– Works for static systems

– Dynamic systems: overcommit or fail

under heavy load

• Domain specific kernel

modifications? Supervisor OS

LinuxServer

DeviceDriver

TrustedService Device

Driver

TrustedServiceTrustedServiceTrustedService

DeviceDriver

Microkernel

TCB TCBPT PT

IIES08/seL4 5

Modified ≠ Verified

• L4.Verified project:L4.Verified project:L4.Verified project:L4.Verified project:

Formally verify the implementation correctness

of the kernel

� Properties:

–Isolation, information flow ...

• Formal refinement Abstract Model

Mathematically proven

properties

• Formal refinement

–Formally connect the properties with the

kernel implementation

C Code HW

Property preserving refinement

IIES08/seL4 6A

Modified ≠ Verified

• L4.Verified project:L4.Verified project:L4.Verified project:L4.Verified project:

Formally verify the implementation correctness

of the kernel

� Properties:

–Isolation, information flow ...

• Formal refinement Abstract Model

Mathematically proven

properties

• Formal refinement

–Formally connect the properties with the

kernel implementation

C Code HW

Property preserving refinement

IIES08/seL4 6B

Modified ≠ Verified

• L4.Verified project:L4.Verified project:L4.Verified project:L4.Verified project:

Formally verify the implementation correctness

of the kernel

� Properties:

–Isolation, information flow ...

• Formal refinement Abstract Model

Mathematically proven

properties

• Formal refinement

–Formally connect the properties with the

kernel implementation

–Modifications invalidate refinement

–Verification is labour intensive

• 10K C-lines = 200K proof lines

• Memory management is core functionality

C Code HW

Property preserving refinement

IIES08/seL4 6C

Approach in a nutshell

• No implicit allocations within

the kernel

– No heap, no slab allocation etc..

• All abstractions are provided

by first-class kernel objects

– Threads – TCB object

supervisory OS

seL4 Microkernel

Trusted

OSserver

Legacy OS

server ....

– Threads – TCB object

– Address space – Page table

objects

• All objects are created upon

explicit user request

IIES08/seL4 7

seL4 Microkernel

Kernel heap

Memory Management Model

supervisory OS

seL4 Microkernel

Trusted OS

server

Legacy OS

server....

� No implicit allocations within the kernel

� Physical memory is divided into untyped

objects

� Authority conferred via capabilities

� Untyped capability is sufficient

authority to allocate kernel objects

All abstractions are provided via first seL4 Microkernel

Kernel Code

untypedobject1

untypedobject2

untypedobject n

..

� All abstractions are provided via first

class kernel objects

� Allocate on explicit user request

� Creator gets the full authority

� Distribute capabilities to allow other

access the service

IIES08/seL4 8A

Memory Management Model

supervisory OS

seL4 Microkernel

Trusted OS

server

Legacy OS

server....

� No implicit allocations within the kernel

� Physical memory is divided into untyped

objects

� Authority conferred via capabilities

� Untyped capability is sufficient

authority to allocate kernel objects

All abstractions are provided via first seL4 Microkernel

Kernel Code

TCBuntypedobject2

untypedobject n

..TCB

� Kernel objects

� Untyped� TCB (Thread Control Blocks) � Capability tables (CT) � Comm. ports ....

IIES08/seL4 8B

� All abstractions are provided via first

class kernel objects

� Allocate on explicit user request

� Creator gets the full authority

� Distribute capabilities to allow other

access the service

Memory Management Model

supervisory OS

seL4 Microkernel

Trusted OS

server

Legacy OS

server ....

� No implicit allocations within the kernel

� Physical memory is divided into untyped

objects

� Authority conferred via capabilities

� Untyped capability is sufficient

authority to allocate kernel objects

All abstractions are provided via first seL4 Microkernel

Kernel Code

TCBuntypedobject n

..TCB

� Kernel objects

� Untyped � TCB (Thread Control Blocks) � Capability tables (CT) � Comm. ports ....

� Objects are managed by user-level

IIES08/seL4 8C

� All abstractions are provided via first

class kernel objects

� Allocate on explicit user request

� Creator gets the full authority

� Distribute capabilities to allow other

access the service

PT PT

Memory Management Model ...

� Delegate authority

� Allow others to obtain services

� Delegate resource management

� Memory management policy is completely

in user-space

supervisory OS

Microkernel

Trusted OS

server

Legacy OS

server ....

� Isolation of physical memory = Isolation of physical memory = Isolation of physical memory = Isolation of physical memory =

Isolation of authority (capabilities)Isolation of authority (capabilities)Isolation of authority (capabilities)Isolation of authority (capabilities)

� Capability dissemination is controlled by

a “Take-Grant” like protection model

Microkernel

Kernel Code

TCBuntypedobject2

untypedobject n

..TCB

IIES08/seL4 8D

Memory Management Model ...

� De-allocation upon explicit user

request

� Call revoke on the Untyped capability

� Memory can be reused

� Kernel tracks capability derivations

� Recorded in capability derivation tree

supervisory OS

Trusted OS

server

Legacy OS server

....

� Recorded in capability derivation tree

(CDT)

� Need bookkeeping

� Doubly-linked list through capabilities

� Space allocated with capability tables

seL4 Microkernel

Kernel Code

TCBuntypedobject2

untypedobject n

..TCB

untypedcap 1

TCB TCB

TCB copy

CDT

9

Capability Derivation Tree

� For allocation:For allocation:For allocation:For allocation:

� The untyped capability should not

have any CDT children

� Guarantees that there are no

previously allocated objects

� Size of the object(s) must be small

or equal to untyped object

supervisory OS

seL4 Microkernel

Trusted OS

server

Legacy OS

server ....

or equal to untyped objectseL4 Microkernel

Kernel Code

TCBuntypedobject2

untpedobject n

..TCB

untypedcap 1

TCB TCB

TCB copy

CDT

IIES08/seL4 10

Evaluation

� Formal properties:Formal properties:Formal properties:Formal properties:

� Formalised the protection model in Isabelle/HOL

� Machine checked, abstract model of the kernel

� Formal, machine checked proof that mechanisms are sufficient for

enforcing spatial partitioning

� Proof also identify the invariants the “supervisory OS” needs to enforce � Proof also identify the invariants the “supervisory OS” needs to enforce

for isolation to hold

supervisory OS

....

seL4 Microkernel

IIES08/seL4 11A

Evaluation

� Formal properties:Formal properties:Formal properties:Formal properties:

� Formalised the protection model in Isabelle/HOL

� Machine checked, abstract model of the kernel

� Formal, machine checked proof that mechanisms are sufficient for

enforcing spatial partitioning

� Proof also identify the invariants the “supervisory OS” needs to enforce � Proof also identify the invariants the “supervisory OS” needs to enforce

for isolation to hold

supervisory OS

....

seL4 Microkernel

IIES08/seL4 11B

• Can not share modifiable page/capability tables

• Can not share thread control blocks• Can not have communication

channels that allow capability propagation

Evaluation ...

� Performance Performance Performance Performance

� Used paravirtualised Linux as an

example

� Compared with L4/Wombat (Linux) for

running LMBench supervisory OS

Linux

....

seL4 Microkernel

Drivers

Iguana

Linux(Wombat) ....

L4 Microkernel

Drivers

Bench mark Gain(%)

fork 4570 3083 32.5exec 5022 3440 31.5

shell 29729 19999 32.7

page faults 34 18.7 45.4

3.4 2.9 11

10.7 9.3 7.6

L4 (s) seL4(s)

Null Syscall

ctxProxy via Iguana

IIES08/seL4 12

Conclusion

• No implicit allocations within the kernel

– Users explicitly allocate kernel objects

– No heap, slab .. (no hidden bookkeeping)

– Authority confinement guarantees control of kernel memory

• All kernel memory management policy is outside the kernel• All kernel memory management policy is outside the kernel

– Different isolation/integration configurations

– Support diverse, co-existing policies

– No modification to the kernel (remains verified)

• Hard guarantees on kernel memory consumption

– Facilitate formal reasoning of physical memory consumption

• Improve performance by controlled delegation

– Similar performance in other caseIIES08/seL4 14

Virtual Memory & seL4

• Implemented using 3 objects*

– Frames: An object corresponding to physical memory

– Page directory: An object corresponding to level 1 page

table of a two-level page table.

– Page table: An object corresponding to level 2 page table of

a two-level page table

– created from untyped memory (as directed by user-level)

55

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

* currently actually 4 – expect ASIDs will be removed

Virtual Memory & seL4

• Broadly similar model to previous L4 kernels

• VM faults are propagated as IPC

– Introduce new page fault type – missing page table

• To install a mapping, one needs:

– A cap to a page directory

– page table to be installed in page directory• install requires cap to both PD and PT

– A cap to a frame of physical memory

• Thus, model allows creation of domain specific VM model

– using only authorised memory

• Revocation handled via CDT

56

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Verification Perspective

• Complexity of memory management policy, and VM

model pushed outside the kernel

– simple VM model implemented at user-level should also be

verifiable

– unverified complex models also supported

• e.g. para-virtualised guest OS’s

• CDT an additional complexity

– needed for revocation of caps anyway

– guarantees integrity (used to determine when memory has

no references)

57

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Quick Summary

• Basic abstractions

– Execution

– Memory– Memory

• Many alternative models

– seL4 uses subset that:

• is amenable to verification in-kernel

• should be amenable to verification at user-level

58

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Inter-process

Communication• Enables system construction

– alternative is a monolithic server

• Processes cooperate to provide services

• Enables extensibility of the system

59

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

IPC Semantics

• Blocking versus Non-blocking

• Buffered versus Unbuffered

• Fixed versus Variable-size• Fixed versus Variable-size

• Direct versus Indirect

60

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Blocking versus Non-

blockingBlocking (termed synchronous)

• Send

– return control only after

message is sent

• Receive

Non-blocking (termed

asynchronous)

• Send

– message always

immediately copied or • Receive

– returns control only after

message is received

immediately copied or

queued, and send returns

• Receive

– polls for new message

Issues:

• Needs buffering

– buffering bounded

61

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Buffered versus Unbuffered

Buffered

• Requires at least extra copy

to buffer

• Send may get ahead of

receive

Unbuffered

• Rendezvous always

• Potential to copy message

directly

receive

– matches differing

processing rates

• Buffers are finite

– send eventually becomes

blocking

– synchronisation and

rendezvous occurs

– performance

62

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Fixed versus Variable Size

• Fixed size simplifies buffering and

marshalling

• Variable size needs receiver to wait on

largest size message every timelargest size message every time

– not really an issue except for large messages

63

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Direct versus Indirect

Direct

• send(dest, message)

• receive(var, message)• receive(var, message)

64

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Source Dest

Direct versus Indirect

Indirect

• send(mailbox, message)

• receive(var, message)• receive(var, message)

• Comms path first class objects

65

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Source DestMailbox

seL4 IPC model

• 6 system calls

– send, nbsend, call, wait, reply, replywait

• 2 communication objects• 2 communication objects

– EndPoint, AsyncEndPoint

66

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Kernel Calls are IPC

• IPC specifies a capability as the

destination

• ‘call’-ing a cap, invokes the kernel• ‘call’-ing a cap, invokes the kernel

– identifies the object

• TCB, PD, PT

– specifies the method and arguments of call

67

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Communications Objects

• EndPoint (EP) and AsyncEndPoint

(AEP)

– acts as a mailbox (indirect comms)– acts as a mailbox (indirect comms)

– distinguished caps to EP and AEP have

badges

• a word of bits

• used to determine authority or identity of

sender

68

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

EndPoints

• Call

– sends message via EP

– unbuffered (at the moment)

– receiver receives – receiver receives

• message

• unforgeable badge

• a reply cap to sender

– allows caps to propagate in a usable way

– “reply” responds via reply cap

69

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

Call, EP, and extensible

systems• Call and EP enable kernel extensibility via user-level

servers (Hydra)

• Calling a capability

– invokes a kernel implemented object– invokes a kernel implemented object

• TCB, PD, PT, etc.

– invokes a server implemented object

• Capability propagation is consistent for both kernel-

and user-level implemented objects

– authority confinement of kernel object applies to user-objects

as well

70

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

AEP

• Used for signalling – “nbsend”

• Badge is “or”-ed with word in AEP object

– can never block

• Receiving

– receives state of AEP word

– zeros work (atomically)

• Depending on encoding of badges,

notification of 32 source events

– used in conjunction with shared memory.

71

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

IPC Importance

General IPC Algorithm

� Validate parameters

� Locate target thread

� if unavailable, deal with it

� Transfer message

� short data only

� long – outlined or cap transfer

� Schedule target thread

� switch address space as necessary

� Wait for IPC

IPC - Implementation

Short IPC

Short IPC (uniprocessor)

� system-call preamble (disable intr)

� identify dest thread or endpoint and check

� basically cap lookup

� ready-to-receive?

analyze msg and transfer� analyze msg and transfer

� short: no action required

� switch to dest thread & address space

� system-call postamble

The critical path

Short IPC (uniprocessor) “call”“call”

� system-call pre (disable intr)

� identify dest thread or endpoint and check

� basically cap lookup

� ready-to-receive?

analyze msg and transfer wait to receiverunning � analyze msg and transfer

� short: no action required

� switch to dest thread & address space

� system-call post

wait to receiverunning

runningwait to receive

Short IPC (uniprocessor) “send” “send” (eagerly)

� system-call pre (disable intr)

� identify dest thread or endpoint and check

� basically cap lookup

� ready-to-receive?

analyze msg and transfer wait to receiverunning � analyze msg and transfer

� short: no action required

� switch to dest thread & address space

� system-call post

� Not common operation if send is ‘signal’

wait to receiverunning

runningrunning

Short IPC (uniprocessor) “send” “send” (lazily)

� system-call pre (disable intr)

� identify dest thread or endpoint and check

� basically cap lookup

� ready-to-receive?

analyze msg and transfer wait to receiverunning � analyze msg and transfer

� short: no action required

� switch to dest thread & address space

� system-call post

wait to receiverunning

runningrunning

EBX

ESI

EDI

EBP

EAX

ECX

EDX

IPC

ES

FS

GS

ESP

EFLAGS

EIP

CS

SS

DS

EBX

ESI

EDI

EBP

EAX

ECX

EDX

IPC

ES

FS

GS

ESP

EFLAGS

EIP

CS

SS

DS

EBX

ESI

EDI

EBP

EAX

ECX

EDX

IPC

ES

FS

GS

ESP

EFLAGS

EIP

CS

SS

DS

EBX

ESI

EDI

EBP

EAX

ECX

EDX

IPC

ES

FS

GS

ESP

EFLAGS

EIP

CS

SS

DS

EBX

ESI

EDI

EBP

EAX

ECX

EDX

IPC

ES

FS

GS

ESP

EFLAGS

EIP

CS

SS

DS

EBX

ESI

EDI

EBP

EAX

ECX

EDX

IPC

ES

FS

GS

ESP

EFLAGS

EIP

CS

SS

DS

EBX

ESI

EDI

EBP

EAX

ECX

EDX

IPC

Note “payload” from green thread

ES

FS

GS

ESP

EFLAGS

EIP

CS

SS

DS

Implementation Goal

� Most frequent kernel op: short IPC

� thousands of invocations per second

� Performance is critical:

� structure IPC for speed

� structure entire kernel to support fast IPC� structure entire kernel to support fast IPC

� What affects performance?

� cache line misses

� TLB misses

� memory references

� pipe stalls and flushes

� instruction scheduling

Fast Path

� Optimize for common cases

� write in assembler

� non-critical paths written in C/C++� but still fast as possible

� Avoid high-level language overhead:

� function call state preservation

� poor code “optimizations”

� We want every cycle possible!

IPC Attributes for Fast Path

� short message

� single runnable thread after IPC

� must be valid IPC call

� switch threads, originator blocks

� send phase:

� the target is waiting

� receive phase:

� the sender is not ready to couple, causing us to block

Avoid Memory References!!!

� Memory references are slow

� Microkernel should minimize indirect costs

� cache pollution

TLB pollution � TLB pollution

� memory bus

Optimized Memory

stack

Also: hard-wire TLB entries for kernel code

and data.

thread ID

cpu ID

UTCB

thread stateTCB state, grouped by cache lines.

Single TLB entry.

Branch Elimination

slow = ~receiver->thread_state +(timeouts & 0xffff) +sender->resources +receiver->resources;

Common case: -1

if( slow )enter_slow_path()

Common case: 0� Reduces branch prediction

foot print.

� Avoids mispredicts & stalls & flushes.

� Increases latency for slow path

TCB Resources

1 1

Resources bitfield

� One bit per resource

� Fast path checks entire word

� if not 0, jump to resource handlers

Debug registers

Copy area

Message Transfer

IBM PowerPC 750,500 MHz,32 registers

up to 10physicalregisters

virtual registercopy loop

Many cycles wasted on pipe flushes for privileged instructions.

Slow Path vs. Fast Path

L4Ka::Pistachio IPC performance

Pentium 3

500

600

0

100

200

300

400

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60

number message registers

cycle

s

Inter C-Path

Inter FastPath

Inter vs. Intra Address Space

L4Ka::Pistachio IPC performance

Pentium 3

500

600

0

100

200

300

400

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60

number message registers

cycle

s

Intra FastPath

Inter FastPath

IPC - Implementation

Long IPC

Long IPC (uniprocessor)

� system-call preamble (disable intr)

� identify dest and check

� ready-to-receive?

� analyze msg and transfer

� long/map:

Preemptions possible!(end of timeslice, device interrupt…)

Pagefaults possible!(in source and dest address space)

� – transfer message –

� switch to dest thread & address space

� system-call postamble

Long IPC (uniprocessor)

� system-call pre (disable intr)

� identify dest and check

� ready-to-receive?

� analyze msg and transfer

� long/map:

� lock both partners

Preemptions possible!(end of timeslice, device interrupt…)

Pagefaults possible!(in source and dest address space)

� lock both partners

� – transfer message –

� unlock both partners� switch to dest thread & address space

� system-call post

Long IPC (uniprocessor)

� system-call pre (disable intr)

� identify dest and check

� ready-to-receive?

� analyze msg and transfer

� long/map:

� lock both partners

Preemptions possible!(end of timeslice, device interrupt…)

Pagefaults possible!(in source and dest address space)

� lock both partners

� enable intr

� – transfer message –

� disable intr

� unlock both partners� switch to dest thread & address space

� system-call post

Long IPC (uniprocessor)

� system-call pre (disable intr)

� identify dest thread and check

� same chief

� ready-to-receive?

� analyze msg and transfer

� long/map:

waitrunning

lockedlocked waitlockedlocked running� lock both partners

� enable intr

� – transfer message –

� disable intr

� unlock both partners� switch to dest thread & address space

� system-call post

runningwait to receive

lockedlocked waitlockedlocked running

IPC - mem copy

� Why is it needed? Why not share?

� Security

� Need own copy

� Granularity

� Object small than a page or not aligned

copy in - copy out

� copy into kernel buffer

copy in - copy out

� copy into kernel buffer

� switch spaces

copy in - copy out

� copy into kernel buffer

� switch spaces

� copy out of kernel buffer

� costs for n words

� 2×2n r/w operations

� 3×n/8 cache lines

� 1×n/8 overhead cache misses (small n)

� 4×n/8 cache misses (large n)

temporary mapping

temporary mapping

� select dest area (4+4 M)

temporary mapping

� select dest area (4+4 M)

� map into source AS (kernel)

temporary mapping

� select dest area (4+4 M)

� map into source AS (kernel)

� copy data

temporary mapping

� select dest area (4+4 M)

� map into source AS (kernel)

� copy data

� switch to dest space� switch to dest space

temporary mapping

temporary mapping

� problems

� multiple threads per AS

� mappings might change while message is copied

� How long to keep PTE?

� What about TLB?

current AS

� What about TLB?

temporary mapping

� when leaving curr thread during ipc?

� invalidate PTE

� flush TLB

current AS

temporary mapping

� when leaving curr thread during ipc:

� invalidate PTE

� flush TLB

current AS

temporary mapping

� when returning to thread during ipc:

current AS

temporary mapping

� when returning to thread during ipc:

Reestablishing temp mappingrequires to store

partner id and dest area addressin the sender’s tcb.

Note: receiver’s page mappingsmight have changed !

current AS

Cost estimates

R/W operations

Cache lines

Small n overhead cache misses

Copy in - copy out Temporary mapping

2 × 2n 2n

3 × n/8 2 × n/8

n/8 0Small n overhead cache misses

Large n cache misses

Overhead TLB misses

Startup instructions

n/8 0

5 × n/8 3 × n/8

0 n / words per page

0 50

486 IPC costs

300

400

Mach[µs]

� Mach: copy in/out

� L4: temp mapping

0

100

200

0 2000 4000 6000

msg len

L4 + cache flush

L4

raw copy

Summary• Small messages

– buffering costs a little

– mapping more so

– ideally, direct copy between two pinned “message areas”

• needs to be synchronous

• Large messages• Large messages

– mapping is more efficient

• especially with outlined messages

• startup costs high (cost of setup amortised)

• implementation complexity high

• Shared memory and notification

– similar to buffering in terms of performance

• copy-in copy-out if mutually distrusting

– implementation complexity out of kernel118

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

seL4• EndPoint

– unbuffered, synchronous, small message to pre-

allocated pinned buffer

– used for “call”

• AsyncEndPoint• AsyncEndPoint

– “or”- ed notification

– used for notification (shared memory buffers)

• Expect long copied messages to be

– avoided if possible

– via shared memory119

© Kevin Elphinstone. Distributed under Creative Commons

Attribution License

FPU Context Switching

• Strict switchingThread switch:

Store current thread’s FPU state

Load new thread’s FPU stateLoad new thread’s FPU state

– Extremely expensive

• IA-32’s full SSE2 state is 512 Bytes

• IA-64’s floating point state is ~1.5KB

– May not even be required

• Threads do not always use FPU

Lazy FPU switching• Lock FPU on thread switch

• Unlock at first use – exception handled by kernelUnlock FPU

If fpu_owner != current

Save current state to fpu_owner

Load new state from current

FPUKernel

current fpu_owner

locked

Load new state from current

fpu_owner := current

finit

fld

fcos

fst

finit

fld

pacman()