Kernel-Kernel communication in a shared-
memory multiprocessor
ByELISEU M. CHAVES,* JR., PRAKASH CH. DAS,* THOMAS J.
LEBLANC, BRIAN D. MARSH* ANDMICHAEL L. SCOTT
• Computer architecture can influence multiprocessor operating system design– How to distribute kernel functionality across processors– Inter-Kernel Communication – Data Structure and Location– How to maintain Data Integrity and provide fast access
Introduction
• Computer architecture can influence multiprocessor operating system design– How to distribute kernel functionality across processors– Inter-Kernel Communication – Data Structure and Location– How to maintain Data Integrity and provide fast access
• Some Examples...
Introduction
• Bus-based shared memory multiprocessors– Kernel code shared for all processors– Typically these have uniform memory access– Use explicit synchronization to control access to data
structures (critical sections, locks, spin-locks, semaphores)
• Advantages– Easy to port an existing uniprocessor kernel and add
explicit synchronization– Decent performance– Simplified load balancing and global resource
management
• Disadvantages– Good for smaller machines but not very scalable– Increased contention leads to performance degradation
Introduction
• Distributed systems & Distributed memory computers– Kernel data distributed among processors– Each processor executes a copy of the kernel– Local data operated on directly, but remote data is accessed
via remote invocation– Uses implicit synchronization to control access to data– Hypercubes & Mesh-connected machines
• Advantages– Works well for machines with no shared memory– Modular design gives increased fault protection– Can use non-preemption for bulk of synchronization
• Disadvantages– Reliance on a communication network to send invocations – Security of data while in the network
Introduction
• Memory can be accessed directly, but at varying cost (NUMA)
• Some with coherent cache between processors– ( ccNUMA )
• Large scale shared memory multiprocessors have properties common to both– Bus-based systems
• Which use Remote access
– Distributed memory multi-computers• Which use Remote invocation
• Kernel Designers must choose between– Remote ACCESS &– Remote INVOCATION
• Tradeoff's between Remote Access/Invocation remain largely unknown (circa 1993)
Large Scale Shared Memory Multiprocessors
• General purpose parallel operating system– Shared memory access to all processors– Full range of kernel services– Reasonable use of all processors– Arrange system using collection of nodes– Possibly with caches?– NUMA architecture– No cache coherence
Target Parameters of this Research
• Remote Access – execution on node i, reading and writing memory on
node j’s memory when necessary
• Remote Invocation– Processor at node i sends a message to node j’s
processor requesting an operation on i’s behalf
• Bulk Data Transfer– Move data required from node j to node i
• Specific request to kernel• Re-Map memory from node j to node i • Data migration
– Cache coherence blurs distinction with remote access
Kernel-Kernel Communication Options
• Interrupt Level– Request executes in the interrupt handler “bottom
half”– Doesn’t execute in the context of a ‘process’
• Can’t sleep• Can’t call schedule• Can’t transfer data to/from user space
– Interrupts enabled during execution to allow top half to continue handling other interrupts
– Mutual exclusion achieved through ‘masking interrupts’• Course granularity can unnecessarily prohibit too many
other invocations
– Prohibits outgoing invocations when incoming invocations are disabled. (to avoid deadlock)
• Limiting for accessing data on two different processors
– Yes, lots of limitations, but very fast!
Types of Remote Invocation [1/2]
• Kernel Process Level– Normal “Top Half” activity– From user space - interrupt handler performs an
asynchronous trap & remote invocation executes– From kernel space – interrupt handler queues the
invocation for execution when kernel becomes idle, or when control returns to user space
– Has process context so requested operation free to block during its execution
– Process holding semaphores or scheduler-based locks can perform process-level RI’s
– Deadlock still possible, but can be avoided• Requires requesting process to block when making RI• Busy waiting would lock out other incoming process-level
RI’s
Types of Remote Invocation [2/2]
• Remote access and interrupt-level RI on the same data structure
Coexistence is possible
• Deadlock-avoidance rule: must always be possible to perform
incoming invocations wile waiting for outgoing invocation
• Direct costs of remote operations– Latency of the operation– (R-1)*n < C
• R: remote to local access time ratio• n: number of memory accesses• C: overhead of remote invocation
– If C is greater, implement Operation using Remote Access
– Notes:• In general, C is fixed so use RI for longer operations• Parameter passing not included in calculation
Trade-Offs
• Indirect costs for local operations – RI may need context of invoker as parameter– Operations arranged to increase node locality??
• Access between invoker/target processor not interleaved??
– Process level RI for all remote accesses may allow data structure implementation without explicit synchronization
• Uses lack of pre-emption within kernel to provide implicit synchronization
– Avoiding Explicit Synchronization • Can improve speed of remote and local operations because
of the high cost of explicit synchronization
– Remote Memory Access may require large portions of kernel data space on other processors must be mapped into each instance of the kernel
• May not scale well to large machines • Mapping on demand not a good option
Trade-Offs
• Competition for processor and memory cycles– When do operations serialize
• Remote Invocation – on processor that executes them– Serialize whether data is common or not
• Remote Memory Access – at the memory– more chance of parallel computation because operations do
more than just access data– If competing for locked data operations may serialize
– Tolerance for Contention• Remote Memory Access has slightly higher throughput
because of parallel operation possibility• However, this can introduce greater variance of
completion time
Trade-Offs
Trade-Offs
• Compatibility with conceptual kernel model– Procedure based kernel
• Gives uniform model for user and kernel processes• Closely mimics hardware of UMA multiprocessor• Minimizes context switching• Fits nicely with Remote Memory Access
– Message based kernel• Comparmentalization of kernel• Synchronization is subsumed by message passing• Closely mimics hardware of distributed-memory
multicomputer• Easier to debug• Remote Invocation more appropriate
• Psyche on the BBN butterfly– 12:1 Remote-to-Local memory access time ratio– 6.88µs Read 32-bit word remote memory location– 0.518µs Read 32-bit word local memory location– 4.27µs Write 32-bit word remote memory location– 0.398µs Write 32-bit word local memory location
– 56µs Average Latency Interrupt level Remote Invocation – 421µs Average latency Process level Remote Invocation
Case Study
Operation
Local Access Remote Access Remote Invocation
locking
On Off On Off On Off
Enqueue + dequeue
(µs)
42.4 21.6 247 154 197 174
Find last in list of 5 25.0 16.1 131 87.6 115 96.7
Find last in list of 10 40.6 30.5 211 169 125 105
Create segment
(ms)
6.20 5.69 14.8 13.1 7.42 6.88
Map segment 0.96 0.86 3.05 2.62 1.94 1.77
Create process 1.43 1.35 3.30 3.04 1.89 1.75
Latency of Kernel Operations
Interrupt level RI
Process level RI
Times in columns represent cost of operations:• col 1 & 3 measurements for original Psyche kernel• col 2 synchronization without context switching• col 4 subsumed in other operation with course grain locking• col 6 cost if data accessed were always via remote
invocation – no explicit synchronization required• col 5 cost of remote invocations in a hybrid kernel
% Latency from locking
• Cost of synchronization• Measure Locking OFF vs ON
Remote Access Penalty
• NL RA(off) - local(off)/ RA(off)• L RA(off) – local(off)/RA(on)• Overhead of explicit synchronization and remote access
is a function of complexity• Example
– Interrupt level RI can be justified to avoid 11 remote refs– 6µs memory access cost & 60µs overhead of RI
interrupt level
RI time as percentage of RA time
• RA: locking in the case of remote access– column 6 as % of column 3
• B: locking in the case of both– column 5 as % of column 3– Simulates Hybrid kernel
• N: locking in the case of neither– column 6 as % of column 4– Simulates desired operation being subsumed by other
operation with coarse-grained locking
Not so fast
• Not such a clear win for Remote Invocation– Experiments represent extreme cases
• Simple enough to perform via interrupt level RI• Complex enough to absorb overhead of process level RI
– Medium size operations might not accept limitations of interrupt level RI or the overhead of process level RI
• Throughput for tiny operations– Remote memory accesses steal busy cycles from
processor where memory is being used– Remote invocations steal the entire processor for the
entire operation– Experiments show Interrupt level RI negatively affects
throughput of the remote processor & remote access does not
Conclusions
• Good Kernel design must consider– Cost of remote invocation mechanism– Cost of atomic operations and synchronization– Ratio of remote to local access time– Breakeven points depend on the number of remote
references in any given Operation
• Interrupt level RI has small number of uses– TLB shootdown– Console I/O
• Results could apply for ccNUMA machines as well
• Risks of putting data structures in interrupt level of the kernel might be a reason to use remote access instead for small operations
REFERENCES• Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, and
John S. Quarterman “The Design and Implementation of the 4.3BSD UNIX Operating System” Addison Wesley Publishing Company
• Andrew S. Tannenbaum “Modern Operating Systems 2nd Edition” Prentice-Hall of India
• Alessandro Rubini, Jonathan Corbet, “Linux Device Drivers 2nd Edition” O’Reilly