Download - Kernel-Kernel communication in a shared-memory multiprocessor By ELISEU M. CHAVES,* JR., PRAKASH CH. DAS,* THOMAS J. LEBLANC, BRIAN D. MARSH* AND MICHAEL.

Kernel-Kernel communication in a shared-

memory multiprocessor

ByELISEU M. CHAVES,* JR., PRAKASH CH. DAS,* THOMAS J.

LEBLANC, BRIAN D. MARSH* ANDMICHAEL L. SCOTT

• Computer architecture can influence multiprocessor operating system design– How to distribute kernel functionality across processors– Inter-Kernel Communication – Data Structure and Location– How to maintain Data Integrity and provide fast access

Introduction

• Computer architecture can influence multiprocessor operating system design– How to distribute kernel functionality across processors– Inter-Kernel Communication – Data Structure and Location– How to maintain Data Integrity and provide fast access

• Some Examples...

Introduction

• Bus-based shared memory multiprocessors– Kernel code shared for all processors– Typically these have uniform memory access– Use explicit synchronization to control access to data

structures (critical sections, locks, spin-locks, semaphores)

• Advantages– Easy to port an existing uniprocessor kernel and add

explicit synchronization– Decent performance– Simplified load balancing and global resource

management

• Disadvantages– Good for smaller machines but not very scalable– Increased contention leads to performance degradation

Introduction

• Distributed systems & Distributed memory computers– Kernel data distributed among processors– Each processor executes a copy of the kernel– Local data operated on directly, but remote data is accessed

via remote invocation– Uses implicit synchronization to control access to data– Hypercubes & Mesh-connected machines

• Advantages– Works well for machines with no shared memory– Modular design gives increased fault protection– Can use non-preemption for bulk of synchronization

• Disadvantages– Reliance on a communication network to send invocations – Security of data while in the network

Introduction

• Memory can be accessed directly, but at varying cost (NUMA)

• Some with coherent cache between processors– ( ccNUMA )

• Large scale shared memory multiprocessors have properties common to both– Bus-based systems

• Which use Remote access

– Distributed memory multi-computers• Which use Remote invocation

• Kernel Designers must choose between– Remote ACCESS &– Remote INVOCATION

• Tradeoff's between Remote Access/Invocation remain largely unknown (circa 1993)

Large Scale Shared Memory Multiprocessors

• General purpose parallel operating system– Shared memory access to all processors– Full range of kernel services– Reasonable use of all processors– Arrange system using collection of nodes– Possibly with caches?– NUMA architecture– No cache coherence

Target Parameters of this Research

• Remote Access – execution on node i, reading and writing memory on

node j’s memory when necessary

• Remote Invocation– Processor at node i sends a message to node j’s

processor requesting an operation on i’s behalf

• Bulk Data Transfer– Move data required from node j to node i

• Specific request to kernel• Re-Map memory from node j to node i • Data migration

– Cache coherence blurs distinction with remote access

Kernel-Kernel Communication Options

• Interrupt Level– Request executes in the interrupt handler “bottom

half”– Doesn’t execute in the context of a ‘process’

• Can’t sleep• Can’t call schedule• Can’t transfer data to/from user space

– Interrupts enabled during execution to allow top half to continue handling other interrupts

– Mutual exclusion achieved through ‘masking interrupts’• Course granularity can unnecessarily prohibit too many

other invocations

– Prohibits outgoing invocations when incoming invocations are disabled. (to avoid deadlock)

• Limiting for accessing data on two different processors

– Yes, lots of limitations, but very fast!

Types of Remote Invocation [1/2]

• Kernel Process Level– Normal “Top Half” activity– From user space - interrupt handler performs an

asynchronous trap & remote invocation executes– From kernel space – interrupt handler queues the

invocation for execution when kernel becomes idle, or when control returns to user space

– Has process context so requested operation free to block during its execution

– Process holding semaphores or scheduler-based locks can perform process-level RI’s

– Deadlock still possible, but can be avoided• Requires requesting process to block when making RI• Busy waiting would lock out other incoming process-level

RI’s

Types of Remote Invocation [2/2]

• Remote access and interrupt-level RI on the same data structure

Coexistence is possible

• Deadlock-avoidance rule: must always be possible to perform

incoming invocations wile waiting for outgoing invocation

• Direct costs of remote operations– Latency of the operation– (R-1)*n < C

• R: remote to local access time ratio• n: number of memory accesses• C: overhead of remote invocation

– If C is greater, implement Operation using Remote Access

– Notes:• In general, C is fixed so use RI for longer operations• Parameter passing not included in calculation

Trade-Offs

• Indirect costs for local operations – RI may need context of invoker as parameter– Operations arranged to increase node locality??

• Access between invoker/target processor not interleaved??

– Process level RI for all remote accesses may allow data structure implementation without explicit synchronization

• Uses lack of pre-emption within kernel to provide implicit synchronization

– Avoiding Explicit Synchronization • Can improve speed of remote and local operations because

of the high cost of explicit synchronization

– Remote Memory Access may require large portions of kernel data space on other processors must be mapped into each instance of the kernel

• May not scale well to large machines • Mapping on demand not a good option

Trade-Offs

• Competition for processor and memory cycles– When do operations serialize

• Remote Invocation – on processor that executes them– Serialize whether data is common or not

• Remote Memory Access – at the memory– more chance of parallel computation because operations do

more than just access data– If competing for locked data operations may serialize

– Tolerance for Contention• Remote Memory Access has slightly higher throughput

because of parallel operation possibility• However, this can introduce greater variance of

completion time

Trade-Offs

Trade-Offs

• Compatibility with conceptual kernel model– Procedure based kernel

• Gives uniform model for user and kernel processes• Closely mimics hardware of UMA multiprocessor• Minimizes context switching• Fits nicely with Remote Memory Access

– Message based kernel• Comparmentalization of kernel• Synchronization is subsumed by message passing• Closely mimics hardware of distributed-memory

multicomputer• Easier to debug• Remote Invocation more appropriate

• Psyche on the BBN butterfly– 12:1 Remote-to-Local memory access time ratio– 6.88µs Read 32-bit word remote memory location– 0.518µs Read 32-bit word local memory location– 4.27µs Write 32-bit word remote memory location– 0.398µs Write 32-bit word local memory location

– 56µs Average Latency Interrupt level Remote Invocation – 421µs Average latency Process level Remote Invocation

Case Study

Operation

Local Access Remote Access Remote Invocation

locking

On Off On Off On Off

Enqueue + dequeue

(µs)

42.4 21.6 247 154 197 174

Find last in list of 5 25.0 16.1 131 87.6 115 96.7

Find last in list of 10 40.6 30.5 211 169 125 105

Create segment

(ms)

6.20 5.69 14.8 13.1 7.42 6.88

Map segment 0.96 0.86 3.05 2.62 1.94 1.77

Create process 1.43 1.35 3.30 3.04 1.89 1.75

Latency of Kernel Operations

Interrupt level RI

Process level RI

Times in columns represent cost of operations:• col 1 & 3 measurements for original Psyche kernel• col 2 synchronization without context switching• col 4 subsumed in other operation with course grain locking• col 6 cost if data accessed were always via remote

invocation – no explicit synchronization required• col 5 cost of remote invocations in a hybrid kernel

% Latency from locking

• Cost of synchronization• Measure Locking OFF vs ON

Remote Access Penalty

• NL RA(off) - local(off)/ RA(off)• L RA(off) – local(off)/RA(on)• Overhead of explicit synchronization and remote access

is a function of complexity• Example

– Interrupt level RI can be justified to avoid 11 remote refs– 6µs memory access cost & 60µs overhead of RI

interrupt level

RI time as percentage of RA time

• RA: locking in the case of remote access– column 6 as % of column 3

• B: locking in the case of both– column 5 as % of column 3– Simulates Hybrid kernel

• N: locking in the case of neither– column 6 as % of column 4– Simulates desired operation being subsumed by other

operation with coarse-grained locking

Not so fast

• Not such a clear win for Remote Invocation– Experiments represent extreme cases

• Simple enough to perform via interrupt level RI• Complex enough to absorb overhead of process level RI

– Medium size operations might not accept limitations of interrupt level RI or the overhead of process level RI

• Throughput for tiny operations– Remote memory accesses steal busy cycles from

processor where memory is being used– Remote invocations steal the entire processor for the

entire operation– Experiments show Interrupt level RI negatively affects

throughput of the remote processor & remote access does not

Conclusions

• Good Kernel design must consider– Cost of remote invocation mechanism– Cost of atomic operations and synchronization– Ratio of remote to local access time– Breakeven points depend on the number of remote

references in any given Operation

• Interrupt level RI has small number of uses– TLB shootdown– Console I/O

• Results could apply for ccNUMA machines as well

• Risks of putting data structures in interrupt level of the kernel might be a reason to use remote access instead for small operations

REFERENCES• Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, and

John S. Quarterman “The Design and Implementation of the 4.3BSD UNIX Operating System” Addison Wesley Publishing Company

• Andrew S. Tannenbaum “Modern Operating Systems 2nd Edition” Prentice-Hall of India

• Alessandro Rubini, Jonathan Corbet, “Linux Device Drivers 2nd Edition” O’Reilly