Solaris Internals(Ch17 Locking)

8/12/2019 Solaris Internals(Ch17 Locking)

1/42


2/42

17.2. Parallel Systems Architectures

)ultiprocessor ()#* systems from Sun (S#$%C&processor&ased*, as well as

se"eral x/0x/&ased )# platforms, are implemented as symmetric

multiprocessor (S)#* systems. Symmetric multiprocessor descries a system inwhich a peer&to&peer relationship exists among all the processors (C#2s* on

the system. $ master processor, defined as the only C#2 on the system that

can execute operating system code and field interrupts, does not exist. $ll

processors are e+ual. 'he S)# acronym can also e extended to mean Shared

)emory )ultiprocessor, which defines an architecture in which all the

processors in the system share a uniform "iew of the system3s physical

address space and the operating system3s "irtual address space. 'hat is, all

processors share a single image of the operating system kernel. Sun3s

multiprocessor systems meet the criteria for oth definitions.

$lternati"e )# architectures alter the kernel3s "iew of addressale memory in

different ways. )assi"ely parallel processor ()##* systems are uilt on nodes

that contain a relati"ely small numer of processors, some local memory, and

I04. 5ach node contains its own copy of the operating system6 thus, each node

addresses its own physical and "irtual address space. 'he address space of

one node is not "isile to the other nodes on the system. 'he nodes are

connected y a high&speed, low&latency interconnect, and node&to&node

communication is done through an optimized message passing interface. )##

architectures re+uire a new programming model to achie"e parallelism across

nodes.

'he shared memory model does not work since the system3s total address

space is not "isile across nodes, so memory pages cannot e shared y

threads running on different nodes. 'hus, an $#I that pro"ides an interface

into the message passing path in the kernel must e used y code that needs

to scale across the "arious nodes in the system.

4ther issues arise from the nonuniform nature of the architecture with

respect to I04 processing since the I04 controllers on each node are not easily

made "isile to all the nodes on the system. Some )## platforms attempt to

pro"ide the illusion of a uniform I04 space across all the nodes y using kernel

software, ut the nonuniformity of the access times to nonlocal I04 de"ices

still exists.

2)$ and cc2)$ (nonuniform memory access and cache coherent 2)$*architectures attempt to address the programming model issue inherent in )##


3/42

systems. -rom a hardware architecture point of "iew, 2)$ systems resemle

)##s small nodes with few processors, a node&to&node interconnect, local

memory, and I04 on each node. ote It is not re+uired that 2)$0cc2)$ or

)## systems implement small nodes (nodes with four or fewer processors*.

)any implementations are uilt that way, ut there is no architecturalrestriction on the node size.

4n 2)$0cc2)$ systems, the operating system software pro"ides a single

system image, where each node has a "iew of the entire system3s memory

address space. In this way, the shared memory model is preser"ed. 8owe"er,

the nonuniform nature of speed of memory access (latency* is a factor in the

performance and potential scalaility of the platform. 9hen a thread executing

on a processor node on a 2)$ or cc2)$ system incurs a page fault

(references an unmapped memory address*, the latency in"ol"ed in resol"ing

the page fault "aries according to whether the physical memory page is on the

same node of the executing thread or on a node somewhere across the

interconnect. 'he latency "ariance can e sustantial. $s the le"el of memory

page sharing increases across threads executing on different nodes, a

potentially higher "olume of page faults needs to e resol"ed from a non local

memory segment. 'his prolem ad"ersely affects performance and scalaility.

'he three different parallel architectures can e summarized as follows

S)#. Symmetric multiprocessor with a shared memory model6 single kernel

image

)##. )essage&ased model6 multiple kernel images

2)$0cc2)$. Shared memory model6 single kernel image

-igure 17.1illustrates the different architectures.


4/42

Figure 17.1. Parallel Systems Architectures

'he challenge in uilding an operating system that pro"ides scalale

performance when multiple processors are sharing a single image of the kernel

and when e"ery processor can run kernel code, handle interrupts, etc., is to

synchronize access to critical data and state information. Scalale

performance, or scalaility, generally refers to accomplishment of an increasing

amount of work as more hardware resources are added to the system. If

more processors are added to a multiprocessor system, an incremental

increase in work is expected, assuming sufficient resources in other areas of

the system (memory, I04, network*.

'o achie"e scalale performance, the system must e ale to concurrently

support multiple processors executing operating system code. 9hether that

execution is in de"ice dri"ers, interrupt handlers, the threads dispatcher, file


5/42

system code, "irtual memory code, etc., is, to a degree, load dependent.

Concurrency is key to scalaility.

'he preceding discussion on parallel architectures only scratched the surface

of a "ery complex topic. 5ntire texts discuss parallel architectures exclusi"ely6you should refer to them for additional information. See, for example, :1;


6/42

eing 13s (lock "alue @x--*. $ lock that is a"ailale (not eing held* is the

same yte with all @3s (lock "alue @x@@*. 'his explanation may seem +uite

rudimentary, ut is crucial to understanding the text that follows.

)ost modern processors shipping today pro"ide some form of yte&le"eltest&and&set instruction that is guaranteed to e atomic in nature. 'he

instruction se+uence is often descried as read&modify&write6 that is, the

referenced memory location (the memory address of the lock* is read,

modified, and written ack in one atomic operation. In %ISC processors (such

as the 2ltraS#$%C '1 processor*, reads are load operations and writes are

store operations. $n atomic operation is re+uired for consistency. $n

instruction that has atomic properties means that no other store operation is

allowed etween the load and store of the executing instruction. )utex and

%9 lock operations must e atomic, such that when the instruction execution

to get the lock is complete, we either ha"e the lock or ha"e the information

we need to determine that the lock is already eing held.

Consider what could happen without an instruction that has atomic properties.

$ thread executing on one processor could issue a load (read* of the lock and

while it is doing a test operation to determine if the lock is held or not,

another thread executing on another processor issues a lock call to get the

same lock at the same time. If the lock is not held, oth threads would assume

the lock is a"ailale and would issue a store to hold the lock. 4"iously, more

than one thread cannot own the same lock at the same time, ut that would

e the result of such a se+uence of e"ents. $tomic instructions pre"ent such

things from happening.

S#$%C processors implement memory access instructions that pro"ide atomic

test&and&set semantics for mutual exclusion primiti"es, as well as instructions

that can force a particular ordering of memory operations (more on the latter

feature in a moment*. 2ltraS#$%C processors (the S#$%C AB instruction set*pro"ide three memory access instructions that guarantee atomic eha"ior

ldstu (load and store unsigned yte*, cas (compare and swap*, and swap

(swap yte locations*. 'hese instructions differ slightly in their eha"ior and

the size of the datum they operate on.

-igure 17.=illustrates the ldstu and cas instructions. 'he swap instruction

(not shown* simply swaps a ;=&it "alue etween a hardware register and a

memory location, similar to what cas does if the compare phase of the

instruction se+uence is e+ual.


7/42

Figure 17.2. Atomic "nstructions for oc!s on SPA#C Systems

'he implementation of locking code with the assemly language test&and&setstyle of instructions re+uires a suse+uent test instruction on the lock "alue,

which is retrie"ed with either a cas or ldstu instruction.

-or example, the ldstu instruction retrie"es the yte "alue (the lock* from

memory and stores it in the specified hardware register. Locking code must

test the "alue of the register to determine if the lock was held or a"ailale

when the ldstu executed. If the register "alue is all 13s, the lock was held, so

the code must ranch off and deal with that condition. If the register "alue is

all @3s, the lock was not held and the code can progress as eing the currentlock holder. ote that in oth cases, the lock "alue in memory is set to all 13s,

y "irtue of the eha"ior of the ldstu instruction (store @x-- at designated

address*. If the lock was already held, the "alue simply didn3t change. If the

lock was @ (a"ailale*, it will now reflect that the lock is held (all 13s*. 'he

code that releases a lock sets the lock "alue to all @3s, indicating the lock is

no longer eing held.

'he Solaris lock code uses assemly language instructions when the lock code

is entered. 'he asic design is such that the entry point to ac+uire a lock

enters an assemly language routine, which uses either ldstu or cas to gra


8/42

the lock. 'he assemly code is designed to deal with the simple case, meaning

that the desired lock is a"ailale. If the lock is eing held, a C language code

path is entered to deal with this situation. 9e descrie what happens in detail

in the next few sections that discuss specific lock types.

'he second hardware consideration referred to earlier has to do with the

"isiility of the lock state to the running processors when the lock "alue is

changed. It is critically important on multiprocessor systems that all processors

ha"e a consistent "iew of data in memory, especially in the implementation of

synchronization primiti"es mutex locks and reader0writer (%9* locks. In other

words, if a thread ac+uires a lock, any processor that executes a load

instruction (read* of that memory location must retrie"e the data following the

last store (write* that was issued. 'he most recent state of the lock must e

gloally "isile to all processors on the system.

)odern processors implement hardware uffering to pro"ide optimal

performance. In addition to the hardware caches, processors also use load and

store uffers to hold data eing read from (load* or written to (store*

memory in order to keep the instruction pipeline running and not ha"e the

processor stall waiting for data or a data write&to&memory cycle. 'he data

hierarchy is illustrated in -igure 17.;.

Figure 17.3. Hardware $ata Hierarchy

'he illustration in -igure 17.;does not depict a specific processor6 it is a

generic representation of the "arious le"els of data flow in a typical modern

high&end microprocessor. It shows the flow of data to and from physical

memory from a processor3s main execution units (integer units, floating point

units, etc.*.


9/42

'he sizes of the load0store uffers "ary across processor implementations,

ut they are typically se"eral words in size. 'he load and store uffers on

each processor are "isile only to the processor they reside on, so a load

issued y a processor that issued the store fetches the data from the store

uffer if it is still there. 8owe"er, it is theoretically possile for otherprocessors that issue a load for that data to read their hardware cache or

main memory efore the store uffer in the store&issuing processor was

flushed. ote that the store uffer we are referring to here is not the same

thing as a le"el 1 or le"el = hardware instruction and data cache. Caches are

eyond the store uffer6 the store uffer is closer to the execution units of

the processor. #hysical memory and hardware caches are kept consistent on

S)# platforms y a hardware us protocol. $lso, many caches are

implemented as write&through caches (as is the case with the le"el 1 cache in

Sun 2ltraS#$%C*, so data written to cache causes memory to e updated.

'he implementation of a store uffer is part of the memory model implemented

y the hardware. 'he memory model defines the constraints that can e

imposed on the order of memory operations (loads and stores* y the system.

)any processors implement a se+uential consistency model, where loads and

stores to memory are executed in the same order in which they were issued

y the processor. 'his model has ad"antages in terms of memory consistency,

ut there are performance trade&offs with such a model ecause thehardware cannot optimize cache and memory operations for speed. 'he S#$%C

architecture specification :7< pro"ides for uilding S#$%C&ased processors

that support multiple memory models, the choice eing left up to the

implementors as to which memory models they wish to support. $ll current

S#$%C processors implement a 'otal Store 4rdering ('S4* model, which

re+uires compliance with the following rules for loads and stores

Loads (reads from memory* are locking and are ordered with respect

to other loads. Stores (writes to memory* are ordered with respect to other stores.

Stores cannot ypass earlier loads.

$tomic load&stores (ldstu and cas instructions* are ordered with

respect to loads.

'he 'S4 model is not +uite as strict as the se+uential consistency model ut

not as relaxed as two additional memory models defined y the S#$%C

architecture. S#$%C&ased processors also support %elaxed )emory 4rder


10/42

(%)4* and #artial Store 4rder (#S4*, ut these are not currently supported

y the kernel and not implemented y any Sun systems shipping today.

$ final consideration in data "isiility applies also to the memory model and

concerns instruction ordering. 'he execution unit in modern processors canreorder the incoming instruction stream for processing through the execution

units. 'he goals again are performance and creation of a se+uence of

instructions that will keep the processor pipeline full.

'he hardware considerations descried in this section are summarized in 'ale

17.1, along with the solution or implementation detail that applies to the

particular issue.

%a&le 17.1. Hardware Considerations and Solutions for oc!s

onsideration Solution

eed for an atomic test&and&set

instruction for locking primiti"es.

2se of nati"e machine instructions.

ldstu and cas on S#$%C, cmpxchgl

(compare0exchange long* on x/.

ata gloal "isiility issue ecause ofthe use of hardware load and store

uffers and instruction reordering, as

defined y the memory model.

2se of memory arrier instructions.

'he issues of consistent memory "iews in the face of a processor3s load and

store uffers, relaxed memory models, and atomic test&and&set capaility for

locks are addressed at the processor instruction&set le"el. 'he mutex lock

and %9 lock primiti"es implemented in the Solaris kernel use the ldstu and cas

instructions for lock testing and ac+uisition on 2ltraS#$%C&ased systems and

use the cmpxchgl (compare0exchange long* instruction on x/. 'he lock

primiti"e routines are part of the architecture&dependent segment of the

kernel code.

S#$%C processors pro"ide "arious forms of memory arrier (memar*

instructions, which, depending on options that are set in the instruction, impose

specific constraints on the ordering of memory access operations (loads andstores* relati"e to the se+uence with which they were issued. 'o ensure a

consistent memory "iew when a mutex or %9 lock operation has een issued,


11/42

the Solaris kernel issues the appropriate memar instruction after the lock its

ha"e changed.

$s we mo"e from the strongest consistency model (se+uential consistency* to

the weakest model (%)4*, we can uild a system with potentially etterperformance. 9e can optimize memory operations y playing with the ordering

of memory access instructions that enale designers to minimize access

latency and to maximize interconnect andwidth. 'he trade&off is consistency,

since the more relaxed models pro"ide fewer and fewer constraints on the

system to issue memory access operations in the same order in which the

instruction stream issued them. So, processor architectures pro"ide memory

arrier controls that kernel de"elopers can use to address the consistency

issues as necessary, with some le"el of control on which consistency le"el is

re+uired to meet the system re+uirements. 'he types of memar instructions

a"ailale, the options they support, and how they fit into the different

memory models descried would make for a highly technical and lengthy chapter

on its own. %eaders interested in this topic should read :< and :=7


12/42

are mounted, files are created and opened, network connections are made,

etc. )any of the locks are emedded in the kernel data structures that

pro"ide the astractions (processes, files* pro"ided y the kernel, and thus

the numer of kernel locks will scale up linearly as resources are created

dynamically.

'his design speaks to one of the core strengths of the Solaris kernel

scalaility and scaling synchronization primiti"es dynamically with the size of

the kernel. ynamic lock creation has se"eral ad"antages o"er static

allocations. -irst, the kernel is not wasting time and space managing a large

pool of unused locks when running on a smaller system, such as a desktop or

workgroup ser"er. 4n a large system, a sufficient numer of locks is a"ailale

to sustain concurrency for scalale performance. It is possile to ha"e literally

thousands of locks in existence on a large, usy system.

7.4. . Synchronization Process

9hen an executing kernel thread attempts to ac+uire a lock, it will encounter

one of two possile lock states free (a"ailale* or not free (owned, held*. $

re+uesting thread gets ownership of an a"ailale lock when the lock&specific

get lock function is in"oked. If the lock is not a"ailale, the thread most likely

needs to lock and wait for it to come a"ailale, although, as we will see

shortly, the code does not always lock (sleep*, waiting for a lock. -or those

situations in which a thread will sleep while waiting for a lock, the kernel

implements a sleep +ueue facility, known as turnstiles, for managing threads

locking on locks.

9hen a kernel thread has completed the operation on the shared data

protected y the lock, it must release the lock. 9hen a thread releases a

lock, the code must deal with one of two possile conditions threads are

waiting for the lock (such threads are termed waiters*, or there are nowaiters. 9ith no waiters, the lock can simply e released. 9ith waiters, the

code has se"eral options. It can release the lock and wake up the locking

threads. In that case, the first thread to execute ac+uires the lock.

$lternati"ely, the code could select a thread from the turnstile (sleep +ueue*,

ased on priority or sleep time, and wake up only that thread. -inally, the

code could select which thread should get the lock next, and the lock owner

could hand the lock off to the selected thread. $s we will see in the following

sections, no one solution is suitale for all situations, and the Solaris kernel

uses all three methods, depending on the lock type. -igure 17.pro"ides theig picture.


13/42

Figure 17.'. Solaris oc!s %he *ig Picture

-igure 17.pro"ides a generic representation of the execution flow. Later we

will see the results of a considerale amount of engineering effort that has

gone into the lock code impro"ed efficiency and speed with short code paths,

optimizations for the hot path (fre+uently hit code path* with well&tuned

assemly code, and the est algorithms for lock release as determined y

extensi"e analysis.

7.4.2. Synchronization Object Operations Vector

5ach of the synchronization o!ects discussed in this section mutex locks,

reader0writer locks, and semaphores defines an operations "ector that is

linked to kernel threads that are locking on the o!ect. Specifically, the

o!ect3s operations "ector is a data structure that exports a suset of

o!ect functions re+uired for kthreads sleeping on the lock. 'he generic

structure is defined as follows

0E E 'he following data structure is used to map

E synchronization o!ect type numers to the

E synchronization o!ect3s sleep +ueue numer

E or the synch. o!ect3s owner function.

E0

typedef struct Fso!Fops G

int so!Ftype6

kthreadFt E(Eso!Fowner*(*6

"oid (Eso!Funsleep*(kthreadFt E*6 "oid (Eso!FchangeFpri*(kthreadFt E, priFt, priFt E*6


14/42

H so!FopsFt6

See

sys0so!ect.h

'he structure shown ao"e pro"ides for the o!ect type declaration. -or each

synchronization o!ect type, a type&specific structure is defined

mutexFso!Fops for mutex locks, rwFso!Fops for reader0writer locks, and

semaFso!Fops for semaphores.

'he structure also pro"ides three functions that may e called on ehalf of a

kthread sleeping on a synchronization o!ect

$n owner function, which returns the I of the kernel thread that owns

the o!ect.

$n unsleep function, which transitions a kernel thread from a sleep state.

$ changeFpri function, which changes the priority of a kernel thread,

used for priority inheritance. (See Section 17.7.*

9e will see how references to the lock3s operations structure are implemented

as we mo"e through specifics on lock implementations in the following sections.

It is useful to note at this point that our examination of Solaris kernel locks

offers a good example of some of the design trade&offs in"ol"ed in kernel

software engineering. uilding the "arious software components that make up

the Solaris kernel is a series of design decisions, when performance needs are

measured against complexity. In areas of the kernel where optimal performance

is a top priority, simplicity might e sacrificed in fa"or of performance. 'he

locking facilities in the Solaris kernel are an area where such trade&offs are

made much of the lock code is written in assemly language, for speed, rather

than in the C language6 the latter is easier to code with and maintain ut is

potentially slower. In some cases, when the code path is not performance

critical, a simpler design will e fa"ored o"er cryptic assemly code or

complexity in the algorithms. 'he eha"ior of a particular design is examined

through exhausti"e testing, to ensure that the est possile design decisions

were made.


15/42

17.+. ,ute- oc!s

)utual exclusion, or mutex locks, are the most common type of synchronization

primiti"e used in the kernel. )utex locks serialize access to critical data, when

a kernel thread must ac+uire the mutex specific to the data region eingprotected efore it can read or write the data. 'he thread is the lock owner

while it is holding the lock, and the thread must release the lock when it has

finished working in the protected region so other threads can ac+uire the lock

for access to the protected data.

7.5. . Overview

If a thread attempts to ac+uire a mutex lock that is eing held, it can

asically do one of two things it can spin or it can lock. Spinning means thethread enters a tight loop, attempting to ac+uire the lock in each pass through

the loop. 'he term spin lock is often used to descrie this type of mutex.

locking means the thread is placed on a sleep +ueue while the lock is eing

held and the kernel sends a wakeup to the thread when the lock is released.

'here are pros and cons to oth approaches.

'he spin approach has the enefit of not incurring the o"erhead of context

switching, re+uired when a thread is put to sleep and also has the ad"antage

of a relati"ely fast ac+uisition when the lock is released, since there is nocontext&switch operation. It has the downside of consuming C#2 cycles while

the thread is in the spin loop the C#2 is executing a kernel thread (the thread

in the spin loop* ut not really doing any useful work.

'he locking approach has the ad"antage of freeing the processor to execute

other threads while the lock is eing held6 it has the disad"antage of re+uiring

context switching to get the waiting thread off the processor and a new

runnale thread onto the processor. 'here3s also a little more lock ac+uisition

latency, since a wakeup and context switch are re+uired efore the locking

thread can ecome the owner of the lock it was waiting for.

In addition to the issue of what to do if a re+uested lock is eing held, the

+uestion of lock granularity needs to e resol"ed. Let3s take a simple example.

'he kernel maintains a process tale, which is a linked list of process

structures, one for each of the processes running on the system. $ simple

tale&le"el mutex could e implemented, such that if a thread needs to

manipulate a process structure, it must first ac+uire the process tale mutex.'his le"el of locking is "ery coarse. It has the ad"antages of simplicity and

minimal lock o"erhead. It has the o"ious disad"antage of potentially poor


16/42

scalaility, since only one thread at a time can manipulate o!ects on the

process tale. Such a lock is likely to ha"e a great deal of contention (ecome

a hot lock*.

'he alternati"e is to implement a finer le"el of granularity a lock&per&process tale entry "ersus one tale&le"el lock. 9ith a lock on each process

tale entry, multiple threads can e manipulating different process structures

at the same time, pro"iding concurrency. 'he disad"antages are that such an

implementation is more complex, increases the chances of deadlock situations,

and necessitates more o"erhead ecause there are more locks to manage.

In general, the Solaris kernel implements relati"ely fine&grained locking

whene"er possile, largely due to the dynamic nature of scaling locks with

kernel structures as needed.

'he kernel implements two types of mutex locks spin locks and adapti"e locks.

Spin locks, as we discussed, spin in a tight loop if a desired lock is eing held

when a thread attempts to ac+uire the lock. $dapti"e locks are the most

common type of lock used and are designed to dynamically either spin or lock

when a lock is eing held, depending on the state of the holder. 9e already

discussed the trade&offs of spinning "ersus locking. Implementing a locking

scheme that only does one or the other can se"erely impact scalaility and

performance. It is much etter to use an adapti"e locking scheme, which is

precisely what we do.

'he mechanics of adapti"e locks are straightforward. 9hen a thread attempts

to ac+uire a lock and the lock is eing held, the kernel examines the state of

the thread that is holding the lock. If the lock holder (owner* is running on a

processor, the thread attempting to get the lock will spin. If the thread holding

the lock is not running, the thread attempting to get the lock will lock. 'his

policy works +uite well ecause the code is such that mutex hold times are"ery short (y design, the goal is to minimize the amount of code to e

executed while a lock is held*. So, if a thread is holding a lock and running, the

lock will likely e released "ery soon, proaly in less time than it takes to

context&switch off and on again, so it3s worth spinning.

4n the other hand, if a lock holder is not running, then we know that minimally

one context switch is in"ol"ed efore the holder will release the lock (getting

the holder ack on a processor to run*, and it makes sense to simply lock and

free up the processor to do something else. 'he kernel will place the lockingthread on a turnstile (sleep +ueue* designed specifically for synchronization


17/42

primiti"es and will wake the thread when the lock is released y the holder.

(See Section 17.7.*

'he other distinction etween adapti"e locks and spin locks has to do with

interrupts, the dispatcher, and context switching. 'he kernel dispatcher is thecode that selects threads for scheduling and does context switches. It runs at

an ele"ated #riority Interrupt Le"el (#IL* to lock interrupts (the dispatcher

runs at priority le"el 11 on S#$%C systems*. 8igh&le"el interrupts (interrupt

le"els 111> on S#$%C systems* can interrupt the dispatcher. 8igh&le"el

interrupt handlers are not allowed to do anything that could re+uire a context

switch or to enter the dispatcher (we discuss this further in Section ;.*.

$dapti"e locks can lock, and locking means context switching, so only spin

locks can e used in high&le"el interrupt handlers. $lso, spin locks can raise

the interrupt le"el of the processor when the lock is ac+uired.

struct kernelFdata G

kmutexFt klock6

char EforwFptr6

char EackFptr6

uint/Ft data16

uint/Ft data=6

H kdata6

"oid function(*

.

mutexFinit(Jkdata.klock*6

.

mutexFenter(Jkdata.klock*6

klock.data1 K 16

mutexFexit(Jkdata.klock*6

'he preceding lock of pseudo&code illustrates the general mechanics of

mutex locks. $ lock is declared in the code6 in this case, it is emedded in the

data structure that it is designed to protect. 4nce declared, the lock is

initialized with the kernel mutexFinit(* function. $ny suse+uent reference to

the kdata structure re+uires that the klock mutex e ac+uired with

mutexFenter(*. 4nce the work is done, the lock is released with mutexFexit(*.

'he lock type, spin or adapti"e, is determined in the mutexFinit(* code y the

kernel. $ssuming an adapti"e mutex in this example, any kernel threads that

make a mutexFenter(* call on klock will either lock or spin, depending on the

state of the kernel thread that owns klock when the mutexFenter(* is called.


18/42

7.5.2. Solaris Mute !oc" #$ple$entation

'he kernel defines different data structures for the two types of mutex

locks, adapti"e and spin, as shown elow.

0E

E #ulic interface to mutual exclusion locks. See mutex(B-* for details.

E

E 'he asic mutex type is )2'5F$$#'IA5, which is expected to e used

E in almost all of the kernel. )2'5FS#I pro"ides interrupt locking

E and must e used in interrupt handlers ao"e L4CDFL5A5L. 'he ilock

E cookie argument to mutexFinit(* encodes the interrupt le"el to lock.

E 'he ilock cookie must e 2LL for adapti"e locks.

EE )2'5F5-$2L' is the type usually specified (except in dri"ers* to

E mutexFinit(*. It is identical to )2'5F$$#'IA5.

E

E )2'5F%IA5% is always used y dri"ers. mutexFinit(* con"erts this to

E either )2'5F$$#'IA5 or )2'5FS#I depending on the ilock cookie.

E

E )utex statistics can e gathered on the fly, without reooting or

E recompiling the kernel, "ia the lockstat dri"er (lockstat(7**.

E0

typedef enum G

)2'5F$$#'IA5 K @, 0E spin if owner is running, otherwise

lock E0

)2'5FS#I K 1, 0E lock interrupts and spin E0

)2'5F%IA5% K , 0E dri"er (I* mutex E0

)2'5F5-$2L' K / 0E kernel default mutex E0

H kmutexFtypeFt6

typedef struct mutex GMifdef FL#/

"oid EFopa+ue:1


19/42

'he 1=&it mutex o!ect is used for each type of lock, as shown in -igure

17.>.

Figure 17.+. Solaris 1 Ada/ti0e and S/in ,ute-

In -igure 17.>, the mFowner field in the adapti"e lock, which holds the address

of the kernel thread that owns the lock (the kthread pointer*, plays a doule

role, in that it also ser"es as the actual lock6 successful lock ac+uisition for athread means it has its kthread pointer set in the mFowner field of the target

lock. If threads attempt to get the lock while it is held (waiters*, the low&

order it (it @* of mFowner is set to reflect that case. ecause kthread

pointers "alues are always word aligned, they do not re+uire it @, allowing

this work.

0E

E mutexFenter(* assumes that the mutex is adapti"e and tries to gra the

E lock y doing a atomic compare and exchange on the first word of the

mutex.

E If the compare and exchange fails, it means that either (1* the lock is a

E spin lock, or (=* the lock is adapti"e ut already held.

E mutexF"ectorFenter(* distinguishes these cases y looking at the mutex

E type, which is encoded in the low&order its of the owner field.

E0

typedef union mutexFimpl G

0E E $dapti"e mutex.

E0

struct adapti"eFmutex G

uintptrFt FmFowner6 0E @&;0@&7 owner and waiters it

E0

Mifndef FL#/

uintptrFt FmFfiller6 0E &7 unused E0

Mendif

H mFadapti"e6

0E


20/42

E Spin )utex.

E0

struct spinFmutex G

lockFt mFdummylock6 0E @ dummy lock (always set* E0

lockFt mFspinlock6 0E 1 real lock E0 ushortFt mFfiller6 0E =&; unused E0

ushortFt mFoldspl6 0E &> old pil "alue E0

ushortFt mFminspl6 0E /&7 min pil "al if lock held E0

H mFspin6

H mutexFimplFt6

See

sys0mutexFimpl.h

'he spin mutex, as we pointed out earlier, is used at high interrupt le"els,

where context switching is not allowed. Spin locks lock interrupts while in the

spin loop, so the kernel needs to maintain the priority le"el the processor was

running at efore entering the spin loop, which raises the processor3s priority

le"el. (5le"ating the priority le"el is how interrupts are locked.* 'he mFminspl

field stores the priority le"el of the interrupt handler when the lock is

initialized, and mFoldspl is set to the priority le"el the processor was runningat when the lock code is called. 'he mFspinlock fields are the actual mutex

lock its.

5ach kernel module and susystem implementing one or more mutex locks calls

into a common set of mutex functions. $ll locks must first e initialized y the

mutexFinit(* function, wherey the lock type is determined on the asis of an

argument passed in the mutexFinit(* call. 'he most common type passed into

mutexFinit(* is )2'5F5-$2L', which results in the init code determining

what type of lock, adapti"e or spin, should e used. It is possile for a callerof mutexFinit(* to e specific aout a lock type (for example, )2'5FS#I*.

If the init code is called from a de"ice dri"er or any kernel module that

registers and generates interrupts, then an interrupt lock cookie is added to

the argument list. $n interrupt lock cookie is an astraction used y de"ice

dri"ers when they set their interrupt "ector and parameters. 'he mutexFinit(*

code checks the argument list for an interrupt lock cookie. If mutexFinit(* is

eing called from a de"ice dri"er to initialize a mutex to e used in a high&

le"el interrupt handler, the lock type is set to spin. 4therwise, an adapti"e

lock is initialized. 'he test is the interrupt le"el in the passed interrupt lock6


21/42

le"els ao"e L4CDFL5A5L (1@ on S#$%C systems* are considered high&le"el

interrupts and thus re+uire spin locks. 'he init code clears most of the fields

in the mutex lock structure as appropriate for the lock type. 'he

mFdummylock field in spin locks is set to all 13s (@x--*. 9e3ll see why in a

minute.

'he primary mutex functions called, aside from mutexFinit(* (which is only

called once for each lock at initialization time*, are mutexFenter(* to get a

lock and mutexFexit(* to release it. mutexFenter(* assumes an a"ailale,

adapti"e lock. If the lock is held or is a spin lock, mutexF"ectorFenter(* is

entered to reconcile what should happen. 'his is a performance optimization.

mutexFenter(* is implemented in assemly code, and ecause the entry point is

designed for the simple case (adapti"e lock, not held*, the amount of code

that gets executed to ac+uire a lock when those conditions are true is minimal.

$lso, there are significantly more adapti"e mutex locks than spin locks in the

kernel, making the +uick test case effecti"e most of the time. 'he test for a

lock held or spin lock is "ery fast. 8ere is where the mFdummylock field comes

into play mutexFenter(* executes a compare&and&swap instruction on the

first yte of the mutex, testing for a zero "alue. 4n a spin lock, the

mFdummylock field is tested ecause of its positioning in the data structure

and the endianness of S#$%C processors. Since mFdummylock is always set (it

is set to all 13s in mutexFinit(**, the test will fail for spin locks. 'he test willalso fail for a held adapti"e lock since such a lock will ha"e a nonzero "alue in

the yte field eing tested. 'hat is, the mFowner field will ha"e a kthread

pointer "alue for a held, adapti"e lock.

If the lock is an adapti"e mutex and is not eing held, the caller of

mutexFenter(* gets ownership of the lock. If the two conditions are not true,

that is, either the lock is held or the lock is a spin lock, the code enters the

mutexF"ectorFenter(* function to sort things out. 'he mutexF"ectorFenter(*

code first tests the lock type. -or spin locks, the mFoldspl field is set, asedon the current #riority Interrupt Le"el (#IL* of the processor, and the lock is

tested. If it3s not eing held, the lock is set (mFspinlock* and the code returns

to the caller. $ held lock forces the caller into a spin loop, where a loop

counter is incremented (for statistical purposes6 the lockstat(1)* data*, and

the code checks whether the lock is still held in each pass through the loop.

4nce the lock is released, the code reaks out of the loop, gras the lock,

and returns to the caller.

$dapti"e locks re+uire a little more work. 9hen the code enters the adapti"e

code path (in mutexF"ectorFenter(**, it increments the


22/42

cpuFsysinfo.mutexFadenters (adapti"e lock enters* field, as is reflected in the

smtx column in mpstat(1)*. mutexF"ectorFenter(* then tests again to

determine if the lock is owned (held*, since the lock may ha"e een released in

the time inter"al etween the call to mutexFenter(* and the current point in

the mutexF"ectorFenter(* code. If the adapti"e lock is not eing held,mutexF"ectorFenter(* attempts to ac+uire the lock. If successful, the code

returns.

If the lock is held, mutexF"ectorFenter(* determines whether or not the lock

owner is running y looping through the C#2 structures and testing the lock

mFowner against the cpuFthread field of the C#2 structure. (cpuFthread

contains the kernel thread address of the thread currently executing on the

C#2.* $ match indicates the holder is running, which means the adapti"e lock

will spin. o match means the owner is not running, in which case the caller

must lock. In the locking case, the kernel turnstile code is entered to locate

or ac+uire a turnstile, in preparation for placement of the kernel thread on a

sleep +ueue associated with the turnstile.

'he turnstile placement happens in two phases. $fter mutexF"ectorFenter(*

determines that the lock holder is not running, it makes a turnstile call to look

up the turnstile, sets the waiters it in the lock, and retests to see if the

owner is running. If yes, the code releases the turnstile and enters the

adapti"e lock spin loop, which attempts to ac+uire the lock. 4therwise, the

code places the kernel thread on a turnstile (sleep +ueue* and changes the

thread3s state to sleep. 'hat effecti"ely concludes the se+uence of e"ents in

mutexF"ectorFenter(*.

ropping out of mutexF"ectorFenter(*, either the caller ended up with the

lock it was attempting to ac+uire or the calling thread is on a turnstile sleep

+ueue associated with the lock. In either case, the lockstat(1)* data is

updated, reflecting the lock type, spin time, or sleep time as the last it ofwork done in mutexF"ectorFenter(*.

lockstat(1)* is a kernel lock statistics command that was introduced in Solaris

=./. It pro"ides detailed information on kernel mutex and reader0writer locks.

'he algorithm descried in the pre"ious paragraphs is summarized in

pseudocode elow.

mutexF"ectorFenter(* if (lock is a spin lock*

lockFsetFspl(* 0E enter spin&lock specific code path E0


23/42

increment cpuFsysinfo.ademters.

spinFloop

if (lock is not owned*

mutexFtrylock(* 0E try to ac+uire the lock E0

if (lock ac+uired* goto ottom

else

continue 0E lock eing held E0

if (lock owner is running on a processor*

goto spinFloop

else

lookup turnstile for the lock

set waiters it

if (lock owner is running on a processor*

drop turnstile

goto spinFloop

else

lock 0E the sleep +ueue associated with the turnstile

E0

ottom

update lockstat statistics

9hen a thread has finished working in a lock&protected data area, it calls the

mutexFexit(* code to release the lock. 'he entry point is implemented in

assemly language and handles the simple case of freeing an adapti"e lock with

no waiters. 9ith no threads waiting for the lock, it3s a simple matter of

clearing the lock fields (mFowner* and returning. 'he C language function

mutexF"ectorFexit(* is entered from mutexFexit(* for anything ut the simple

case.

In the case of a spin lock, the lock field is cleared and the processor is

returned to the #IL le"el it was running at efore entering the lock code. -or

adapti"e locks, a waiter must e selected from the turnstile (if there is more

than one waiter*, ha"e its state changed from sleeping to runnale, and e

placed on a dispatch +ueue so it can execute and get the lock. If the thread

releasing the lock was the eneficiary of priority inheritance, meaning that it

had its priority impro"ed when a calling thread with a etter priority was not

ale to get the lock, then the thread releasing the lock will ha"e its priority


24/42

reset to what it was efore the inheritance. #riority inheritance is discussed

in Section 17.7.

9hen an adapti"e lock is released, the code clears the waiters it in mFowner

and calls the turnstile function to wake up all the waiters. %eaders familiarwith sleep0wakeup mechanisms of operating systems ha"e likely heard of a

particular eha"ior known as the Nthundering herd prolem,N a situation in

which many threads that ha"e een locking for the same resource are all

woken up at the same time and make a mad dash for the resource (a mutex in

this case*like a herd of large, four&legged easts running toward the same

o!ect. System eha"ior tends to go from a relati"ely small run +ueue to a

large run +ueue (all the threads ha"e een woken up and made runnale* and

high C#2 utilization until a thread gets the resource, at which point a unch of

threads are sleeping again, the run +ueue normalizes, and C#2 utilization

flattens out. 'his is a generic eha"ior that can occur on any operating

system.

'he wakeup mechanism used when mutexF"ectorFexit(* is called may seem like

an open in"itation to thundering herds, ut in practice it turns out not to e a

prolem. 'he main reason is that the locking case for threads waiting for a

mutex is rare6 most of the time the threads will spin. If a locking situation

does arise, it typically does not reach a point where "ery many threads are

locked on the mutexone of the characteristics of the thundering herd prolem

is resource contention resulting in a lot of sleeping threads. 'he kernel code

segments that implement mutex locks are, y design, short and fast, so locks

are not held for long. Code that re+uires longer lock&hold times uses a

reader0writer write lock, which pro"ides mutual exclusion semantics with a

selecti"e wakeup algorithm. 'here are, of course, other reasons for choosing

reader0writer locks o"er mutex locks, the most o"ious eing to allow multiple

readers to see the protected data.

17.. #eaderriter oc!s

%eader0writer (%9* locks pro"ide mutual exclusion semantics on write locks.

4nly one thread at a time is allowed to own the write lock, ut there is

concurrent access for readers. 'hese locks are designed for scenarios in

which it is acceptale to ha"e multiple threads reading the data at the same

time, ut only one writer. 9hile a writer is holding the lock, no readers are

allowed. $lso, ecause of the wakeup mechanism, a writer lock is a etter

solution for kernel code segments that re+uire relati"ely long hold times, as wewill see shortly.


25/42

'he asic mechanics of %9 locks are similar to mutexes, in that %9 locks

ha"e an initialization function (rwFinit(**, an entry function to ac+uire the lock

(rwFenter(**, and an exit function to release the lock (rwFexit(**. 'he entry

and exit points are optimized in assemly code to deal with the simple cases,

and they call into C language functions if anything eyond the simplest casemust e dealt with. $s with mutex locks, the simple case is that the re+uested

lock is a"ailale on an entry (ac+uire* call and no threads are waiting for the

lock on the exit (release* call.

7.%. . Solaris &ea'er()riter !oc"s

%eader0writer locks are implemented as a single&word data structure in the

kernel, either ;= its or / its wide, depending on the data model of the

running kernel, as depicted in -igure 17./.

Figure 17.. #eaderriter oc!

typedef struct rwlockFimpl G

uintptrFt rwFwwwh6 0E waiters, write wanted, hold count

E0

H rwlockFimplFt6

Mendif 0E F$S) E0

Mdefine %9F8$SF9$I'5%S 1

Mdefine %9F9%I'5F9$'5 =Mdefine %9F9%I'5FL4CD5

Mdefine %9F%5$FL4CD

Mdefine %9F9%I'5FL4CD(thread* ((uintptrFt*(thread* O

%9F9%I'5FL4CD5*

Mdefine %9F84LFC42' (&%9F%5$FL4CD*

Mdefine %9F84LFC42'FS8I-' ; 0E

log=(%9F%5$FL4CD* E0

Mdefine %9F%5$FC42' %9F84LFC42'

Mdefine %9F495% %9F84LFC42'

Mdefine %9FL4CD5 %9F84LFC42'


26/42

Mdefine %9F9%I'5FCL$I)5 (%9F9%I'5FL4CD5 O

%9F9%I'5F9$'5*

Mdefine %9F42L5FL4CD (%9F9%I'5FL4CD(@* O

%9F%5$FL4CD*

Seesys0rwlock.h

'here are two states for the reader writer lock, depending on whether the

lock is held y a writer, as indicated y it =, wrlock. it =, wrlock, is the

actual write lock, and it determines the meaning of the high&order its. If the

write lock is held (it = set*, then the upper its contain a pointer to the

kernel thread holding the write lock. If it = is clear, then the upper its

contain a count of the numer of threads holding the lock as a read lock.

'he Solaris 1@ %9 lock defines it @, the wait it, set to signify that threads

are waiting for the lock. 'he wrwant it (write wanted, it 1* indicates that

at least one thread is waiting for a write lock. 'he simple cases for lock

ac+uisition through rwFenter(* are the circumstances listed elow

'he write lock is wanted and is a"ailale.

'he read lock is wanted, the write lock is not held, and no threads are

waiting for the write lock (wrwant is clear*.

'he ac+uisition of the write lock results in it = getting set and the kernel

thread pointer getting loaded in the upper its. -or a reader, the hold count

(upper its* is incremented. Conditions where the write lock is eing held,

causing a lock re+uest to fail, or where a thread is waiting for a write lock,

causing a read lock re+uest to fail, result in a call to the rwFenterFsleep(*

function.

Important to note is that the rwFenter(* code sets a flag in the kernel thread

used y the dispatcher code when estalishing a kernel thread3s priority

efore preemption or changing state to sleep. 9e co"er this in more detail in

the paragraph eginning NIt is in the dispatcher +ueue insertion codeN on #age

=/=. riefly, the kernel thread structure contains a tFkpriFre+ (kernel priority

re+uest* field that is checked in the dispatcher code when a thread is aout

to e preempted (forced off the processor on which it is executing ecause a

higher&priority thread ecomes runnale* or when the thread is aout to ha"e

its state changed to sleep. If the tFkpriFre+ flag is set, the dispatcherassigns a kernel priority to the thread, such that when the thread resumes


27/42

execution, it will run efore threads in scheduling classes of lower priority

(timeshare and interacti"e class threads*. )ore succinctly, the priority of a

thread holding a write lock is set to a etter priority to minimize the hold time

of the lock.

Petting ack to the rwFenter(* flow If the code falls through the simple

case, we need to set up the kernel thread re+uesting the %9 lock to lock.

1. rwFenterFsleep(* estalishes whether the calling thread is re+uesting a

read or write lock and does another test to see if the lock is a"ailale.

If it is, the caller gets the lock, the lockstat(1)* statistics are updated,

and the code returns. If the lock is not a"ailale, then the turnstile code

is called to look up a turnstile in preparation for putting the calling

thread to sleep.=. 9ith a turnstile now a"ailale, another test is made on the lock

a"ailaility. (4n today3s fast processors, and especially multiprocessor

systems, it3s +uite possile that the thread holding the lock finished what

it was doing and the lock ecame a"ailale.* $ssuming the lock is still

held, the thread is set to a sleep state and placed on a turnstile.

*. 'he %9 lock structure will ha"e the wait it set for a reader waiting

(forced to lock ecause a writer has the lock* or the wrwant it set

to signify that a thread wanting the write lock is locking.

4. 'he cpuFsysinfo structure for the processor maintains two counters for

failures to get a read lock or write lock on the first pass rwFrdfails

and rwFwrfails. 'he appropriate counter is incremented !ust prior to the

turn&stile call6 this action places the thread on a turnstile sleep +ueue.

'he mpstat(1)* command sums the counters and displays the fails&per&

second in the srw column of its output.

'he ac+uisition of a %9 lock and suse+uent eha"ior if the lock is held are

straightforward and similar in many ways to what happens in the mutex case.'hings get interesting when a thread calls rwFexit(* to release a lock it is

hold&ingthere are se"eral potential solutions to the prolem of determining

which thread gets the lock next. $ wakeup is issued on all threads that are

sleeping, waiting for the mutex, and we know from empirical data that this

solution works well for reasons pre"iously discussed. 9ith %9 locks, we3re

dealing with potentially longer hold times, which could result in more sleepers, a

desire to gi"e writers priority o"er readers (it3s typically est to not ha"e a

reader read data that3s aout to e changed y a pending writer*, and the

potential for the priority in"ersion prolem descried in Section 17.7.


28/42

-or rwFexit(*, which is called y the lock holder when it is ready to release

the lock, the simple case is that there are no waiters. In this case, the wrlock

it is cleared if the holder was a writer, or the hold count field is

decremented to reflect one less reader. 'he more complex case of the system

ha"ing waiters when the lock is released is dealt with in the following manner

1. 'he kernel does a direct transfer of ownership of the lock to one or

more of the threads waiting for the lock when the lock is released,

either to the next writer or to a group of readers if more than one

reader is locking and no writers are locking.

'his situation is "ery different from the case of the mutex

implementation, for which the wakeup is issued and a thread must otain

lock ownership in the usual fashion. 8ere, a thread or threads wake upowning the lock they were locking on.

'he algorithm used to figure out who gets the lock next addresses

se"eral re+uirements that pro"ide for generally alanced system

performance. 'he kernel needs to minimize the possiility of star"ation (a

thread ne"er getting the resource it needs to continue executing* while

allowing writers to take precedence whene"er possile.

2. rwFexitFwakeup(* retests for the simple case and drops the lock ifthere are no waiters (clear wrlock or decrement the hold count*.

;. 9hen waiters are present, the code gras the turnstile (sleep +ueue*

associated with the lock and sa"es the pointer to the kernel thread of

the next write waiter that was on the turnstile3s sleep +ueue (if one

exists*.

'he turnstile sleep +ueues are organized as a -I-4 (first in, first out*

+ueue, so the +ueue management (turnstile code* makes sure that the

thread that was waiting the longest (the first in* is the thread that is

selected as the next writer (first out*. 'hus, part of the fairness policy

we want to enforce is co"ered.

'he remaining its of the algorithm go as follows

. If a writer is releasing the write lock and there are waiting readers and

writers, readers of the same or higher priority than the highest&priority

locked writer are granted the read lock.

5. 'he readers are handed ownership, and then woken up y the

turnstileFwakeup(* kernel function,


29/42

'hese readers also inherit the priority of the writer that released the

lock if the reader thread is of a lower priority (inheritance is done on a

per&reader thread asis when more than one thread is eing woken up*.

Lock ownership handoff is a relati"ely simple operation. -or read locks,

there is no notion of a lock owner, so it3s a matter of setting the holdcount in the lock to reflect the numer of readers coming off the

turnstile, then issuing the wakeup of each reader.

/. $n exiting reader always grants the lock to a waiting writer, e"en if

there are higher&priority readers locked.

7. It is possile for a reader freeing the lock to ha"e waiting readers,

although it may not e intuiti"e, gi"en the multiple reader design of the

lock. If a reader is holding the lock and a writer comes along, the

wrwant it is set to signify that a writer is waiting for the lock. 9ith

wrwant set, suse+uent readers cannot get the lockwe want the holding

readers to finish so the writer can get the lock. 'herefore, it is

possile for a reader to execute rwFexitFwakeup(* with waiting writers

and readers.

'he Nlet3s fa"or writers ut e fair to readersN policy descried ao"e was

first implemented in Solaris =./.

17.7. %urnstiles and Priority "nheritance

$ turnstile is a data astraction that encapsulates sleep +ueues and priority

inheritance information associated with mutex locks and reader0writer locks.

'he mutex and %9 lock code use a turnstile when a kernel thread needs to

lock on a re+uested lock. 'he sleep +ueues implemented for other resource

waits do not pro"ide an elegant method of dealing with the priority in"ersion

prolem through priority inheritance. 'urnstiles were created to address that

prolem.

#riority in"ersion descries a scenario in which a higher&priority thread is

unale to run ecause a lower&priority thread is holding a resource it needs,

such as a lock. 'he Solaris kernel addresses the priority in"ersion prolem in

its turn&stile implementation, pro"iding a priority inheritance mechanism, where

the higher&priority thread can will its priority to the lower&priority thread

holding the resource it re+uires. 'he eneficiary of the inheritance, the thread

holding the resource, will now ha"e a higher scheduling priority and thus get

scheduled to run sooner so it can finish its work and release the resource, atwhich point the original priority is returned to the thread.


30/42

In this section, we assume you ha"e some le"el of knowledge of kernel thread

priorities, which are co"ered in Section ;.7. ecause turnstiles and priority

inheritance are an integral part of the implementation of mutex and %9 locks,

we thought it est to discuss them here rather than later. -or this discussion,

it is important to e aware of these points

'he Solaris kernel assigns a gloal priority to kernel threads, ased on

the scheduling class they elong to.

Dernel threads in the timeshare and interacti"e scheduling classes will

ha"e their priorities ad!usted o"er time, ased on three things the

amount of time the threads spend running on a processor, sleep time

(locking*, and the case when they are preempted. 'hreads in the real&

time class are fixed priority6 the priorities are ne"er changed regardless

of runtime or sleep time unless explicitly changed through programming

interfaces or commands.

'he Solaris kernel implements sleep +ueues for the placement of kernel threads

locking on (waiting for* a resource or e"ent. -or most resource waits, such

as those for a disk or network I04, sleep +ueues, in con!unction with condition

"ariales, manage the systemwide +ueue of sleeping threads. 'hese sleep

+ueues are co"ered in Section ;.1@. 'his set of sleep +ueues is separate and

distinct from turn&stile sleep +ueues.

7.7. . +urnstiles #$ple$entation

-igure 17.7illustrates the Solaris 1@ turnstiles. 'urnstiles are maintained in a

systemwide hash tale, turnstileFtale:


31/42

Figure 17.7. %urnstiles

typedef struct turnstileFchain G

turnstileFt EtcFfirst6 0E first turnstile on hash chain E0

dispFlockFt tcFlock6 0E lock for this hash chain E0

H turnstileFchainFt6

turnstileFchainFt turnstileFtale:= E '2%S'IL5F8$S8FSIQ5


32/42

Mdefine 'SF2)FR = 0E numer of sleep +ueues per turnstile E0

typedef struct turnstile turnstileFt6

struct Fso!Fops6

struct turnstile G

turnstileFt EtsFnext6 0E next on hash chain E0

turnstileFt EtsFfree6 0E next on freelist E0

"oid EtsFso!6 0E s&o!ect threads are locking on

E0

int tsFwaiters6 0E numer of locked threads E0

priFt tsFepri6 0E max priority of locked threads E0

struct Fkthread EtsFinheritor6 0E thread inheriting priority E0

turnstileFt EtsFprioin"6 0E next in inheritor3s tFprioin" list E0

sleep+Ft tsFsleep+:'SF2)FR


33/42

now has a turnstile, so suse+uent threads that lock on the same lock will

donate their turnstiles to the free list on the chain (the tsFfree link off the

acti"e turnstile*.

In turnstileFlock(*, the pointers are set up as determined y the return fromturnstileFlookup(*. If the turnstile pointer is null, we link up to the turnstile

pointed to y the kernel thread3s tFts pointer. If the pointer returned from

the lookup is not null, there3s already at least one kthread waiting on the lock,

so the code sets up the pointer links appropriately and places the kthread3s

turnstile on the free list.

'he thread is then put into a sleep state through the scheduling&class&

specific sleep routine (for example, tsFsleep(**. 'he tsFwaiters field in the

turnstile is incremented, the threads tFwchan is set to the address of thelock, and tFso!Fops in the thread is set to the address of the lock3s

operations "ectors the owner, unsleep, and changeFpriority functions. 'he

kernel sleep+Finsert(* function actually places the thread on the sleep +ueue

associated with the turnstile.

'he code does the priority in"ersion check (now called out of the

turnstileFlock(* code*, uilds the priority in"ersion links and applies the

necessary priority changes. 'he priority inheritance rules apply6 that is, if the

priority of the lock holder is less (worse* than the priority of the re+uesting

thread, the re+uesting thread3s priority is NwilledN to the holder. 'he holder3s

tFepri field is set to the new priority, and the inheritor pointer in the turnstile

is linked to the kernel thread. $ll the threads on the locking chain are

potential inheritors, ased on their priority relati"e to the calling thread.

$t this point, the dispatcher is entered through a call to swtch(*, and another

kernel thread is remo"ed from a dispatch +ueue and context&switched onto a

processor.

'he wakeup mechanics are initiated as pre"iously descried, where a call to

the lock exit routine results in a turnstileFwakeup(* call if threads are locking

on the lock. turnstileFwakeup(* does essentially the re"erse of

turnstileFlock(*6 threads that inherited a etter priority ha"e that priority

wai"ed, and the thread is remo"ed from the sleep +ueue and gi"en a turnstile

from the chain3s free list. %ecall that a thread donated its turnstile to the

free list if it was not the first thread placed on the locking chain for the

lock6 coming off the turnstile, threads get a turnstile ack. 4nce the thread is


34/42

unlinked from the sleep +ueue, the scheduling class wakeup code is entered,

and the thread is put ack on a processor3s dispatch +ueue.

17.4. 5ernel Sema/hores

Semaphores pro"ide a method of synchronizing access to a sharale resource

y multiple processes or threads. $ semaphore can e used as a inary lock

for exclusi"e access or as a counter, allowing for concurrent access y

multiple threads to a finite numer of shared resources.

In the counter implementation, the semaphore "alue is initialized to the numer

of shared resources (these semaphores are sometimes referred to as counting

semaphores*. 5ach time a process needs a resource, the semaphore "alue is

decremented to indicate there is one less of the resource. 9hen the processis finished with the resource, the semaphore "alue is incremented. $ @

semaphore "alue tells the calling process that no resources are currently

a"ailale, and the calling process locks until another process finishes using the

resource and frees it. 'hese functions are historically referred to as

semaphore # and A operationsthe # operation attempts to ac+uire the

semaphore, and the A operation releases it.

'he Solaris kernel uses semaphores where appropriate, when the constraints

for atomicity on lock ac+uisition are not as stringent as they are in the areaswhere mutex and %9 locks are used. $lso, the counting functionality that

semaphores pro"ide makes them a good fit for things like the allocation and

deallocation of a fixed amount of a resource.

'he kernel semaphore structure maintains a sleep +ueue for the semaphore and

a count field that reflects the "alue of the semaphore, shown in -igure 17..

'he figure illustrates the look of a kernel semaphore for all Solaris releases

co"ered in this ook.

Figure 17.4. 5ernel Sema/hore

Dernel functions for semaphores include an initialization routine (semaFinit(**, a

destroy function (semaFdestroy(**, the traditional # and A operations

(semaFp(* and semaF"(**, and a test function (test for semaphore held,


35/42

semaFheld(**. 'here are a few other support functions, as well as some

"ariations on the semaFp(* function, which we discuss later.

'he init function simply sets the count "alue in the semaphore, ased on the

"alue passed as an argument to the semaFinit(* routine. 'he sFslp+ pointer isset to 2LL, and the semaphore is initialized. 'he semaFdestroy(* function is

used when the semaphore is an integral part of a resource that is dynamically

created and destroyed as the resource gets used and suse+uently released.

-or example, the io (lock I04* susystem in the kernel, which manages uf

structures for page I04 support through the file system, uses semaphores on a

per&uf structure asis. 5ach uffer has two semaphores, which are initialized

when a uffer is allocated y semaFinit(*. 4nce the I04 is completed and the

uffer is released, semaFdestroy(* is called as part of the uffer release

code. (semaFdestroy(* !ust nulls the sFslp+ pointer.*

Dernel threads that must access a resource controlled y a semaphore call the

semaFp(* function, which re+uires that the semaphore count "alue e greater

than @ in order to return success. If the count is @, then the semaphore is not

a"ailale and the calling thread must lock. If the count is greater than @, then

the count is decremented in the semaphore and the code returns to the caller.

4therwise, a sleep +ueue is located from the systemwide array of sleep

+ueues, the thread state is changed to sleep, and the thread is placed on the

sleep +ueue. ote that turnstiles are not used for semaphoresturnstiles are an

implementation of sleep +ueues specifically for mutex and %9 locks. Dernel

threads locked on anything other than mutexes and %9 locks are placed on

sleep +ueues.

Sleep +ueues are discussed in more detail in Section ;.1@. riefly though, sleep

+ueues are organized as a linked list of kernel threads, and each linked list is

rooted in an array referenced through a sleep+Fhead kernel pointer. -igure

17.Billustrates how sleep +ueues are organized.

Figure 17.6. Slee/ ueues


36/42

$ hashing function indexes the sleep+Fhead array, hashing on the address of

the o!ect. $ singly linked list that estalishes the eginning of the douly

linked sulists of kthreads at the same priority is in ascending order ased on

priority. 'he sulist is implemented with a tFpriforw (forward pointer* and

tFpriack (pre"ious pointer* in the kernel thread. $lso, a tFsleep+ pointerpoints ack to the array entry in sleep+Fhead, identifying which sleep +ueue

the thread is on and pro"iding a +uick method to determine if a thread is on a

sleep +ueue at all6 if the thread3s tFsleep+ pointer is 2LL, then the thread is

not on a sleep +ueue.

Inside the semaFp(* function, if we ha"e a semaphore count "alue of @, the

semaphore is not a"ailale and the calling kernel thread needs to e placed on

a sleep +ueue. $ sleep +ueue is located through a hash function into the

sleepFhead array, which hashes on the address of the o!ect the thread is

locking, in this case, the address of the semaphore. 'he code also gras the

sleep +ueue lock, s+Flock (see -igure 17.B*, to lock any further inserts or

remo"als from the sleep +ueue until the insertion of the current kernel thread

has een completed (that3s what locks are for*.

'he scheduling&class&specific sleep function is called to set the thread

wakeup priority and to change the thread state from 4#%4C (running on a

processor* to SL55#. 'he kernel thread3s tFwchan (wait channel* pointer is

set to the address of the semaphore it3s locking on, and the thread3s

tFso!Fops pointer is set to reference the semaFso!Fops structure. 'he

thread is now in a sleep state on a sleep +ueue.

$ semaphore is released y the semaF"(* function, which has the exact

opposite effect of semaFp(* and eha"es "ery much like the lock release

functions we3"e examined up to this point. 'he semaphore "alue is incremented,

and if any threads are sleeping on the semaphore, the one that has een

sitting on the sleep +ueue longest will e woken up. Semaphore wakeups alwaysin"ol"e waking one waiter at a time.

Semaphores are used in relati"ely few areas of the operating system the

uffer I04 (io* module, the dynamically loadale kernel module code, and a

couple of de"ice dri"ers.

17.6. $%race oc!stat Pro0ider

'he lockstat pro"ider makes a"ailale proes that can e used to discern lockcontention statistics or to understand "irtually any aspect of locking eha"ior.


37/42

'he lockstat(1)* command is actually a 'race consumer that uses the

lockstat pro"ider to gather its raw data.

7.,. . Overview

'he lockstat pro"ider makes a"ailale two kinds of proes content&e"ent

proes and hold&e"ent proes.

Contention&e"ent proes correspond to contention on a synchronization

primiti"e6 they fire when a thread is forced to wait for a resource to ecome

a"ailale. Solaris is generally optimized for the noncontention case, so

prolonged contention is not expected. 'hese proes should e used to

understand those cases where contention does arise. ecause contention is

relati"ely rare, enaling contention&e"ent proes generally doesn3t sustantiallyaffect performance.

8old&e"ent proes correspond to ac+uiring, releasing, or otherwise

manipulating a synchronization primiti"e. 'hese proes can e used to answer

aritrary +uestions aout the way synchronization primiti"es are manipulated.

ecause Solaris ac+uires and releases synchronization primiti"es "ery often (on

the order of millions of times per second per C#2 on a usy system*, enaling

hold&e"ent proes has a much higher proe effect than does enaling

contention&e"ent proes. 9hile the proe effect induced y enaling them cane sustantial, it is not pathological6 they may still e enaled with confidence

on production systems.

'he lockstat pro"ider makes a"ailale proes that correspond to the different

synchronization primiti"es in Solaris6 these primiti"es and the proes that

correspond to them are discussed in the remainder of this chapter.

7.,.2. -'aptive !oc" Probes

$dapti"e locks enforce mutual exclusion to a critical section and can e

ac+uired in most contexts in the kernel. ecause adapti"e locks ha"e few

context restrictions, they comprise the "ast ma!ority of synchronization

primiti"es in the Solaris kernel. 'hese locks are adapti"e in their eha"ior with

respect to contention. 9hen a thread attempts to ac+uire a held adapti"e

lock, it will determine if the owning thread is currently running on a C#2. If the

owner is running on another C#2, the ac+uiring thread will spin. If the owner is

not running, the ac+uiring thread will lock.


38/42

'he four lockstat proes pertaining to adapti"e locks are in 'ale 17.=.-or

each proe, arg@ contains a pointer to the kmutexFt structure that

represents the adapti"e lock.

%a&le 17.2. Ada/ti0e oc! Pro&es

#roeame

Cescription

adapti"e&

ac+uire

8old&e"ent proe that fires immediately after an adapti"e lock is

ac+uired.

adapti"e&lock

Contention&e"ent proe that fires after a thread that has lockedon a held adapti"e mutex has reawakened and has ac+uired the

mutex. If oth proes are enaled, adapti"e&lock fires efore

adapti"e&ac+uire. $t most one of adapti"e&lock and adapti"e&

spin fire for a single lock ac+uisition. arg1 for adapti"e&lock

contains the sleep time in nanoseconds.

adapti"e&

spin

Contention&e"ent proe that fires after a thread that has spun on

a held adapti"e mutex has successfully ac+uired the mutex. If oth

are enaled, adapti"e&spin fires efore adapti"e&ac+uire. $t most

one of adapti"e&spin and adapti"e&lock fire for a single lock

ac+uisition. arg1 for adapti"e&spin contains the spin count the

numer of iterations that were taken through the spin loop efore

the lock was ac+uired. 'he spin count has little meaning on its own

ut can e used to compare spin times.

adapti"e&

release

8old&e"ent proe that fires immediately after an adapti"e lock is

released.

7.,.*. Spin !oc" Probes

'hreads cannot lock in some contexts in the kernel, such as high&le"el

interrupt context and any context manipulating dispatcher state. In these

contexts, this restriction pre"ents the use of adapti"e locks. Spin locks are

instead used to effect mutual exclusion to critical sections in these contexts.

$s the name implies, the eha"ior of these locks in the presence of contention


39/42

is to spin until the lock is released y the owning thread. 'he three proes

pertaining to spin locks are in 'ale 17.;.

%a&le 17.3. S/in oc! Pro&es

#roeame

Cescription

spin&

ac+uire

8old&e"ent proe that fires immediately after a spin lock is ac+uired.

spin&

spin

Contention&e"ent proe that fires after a thread that has spun on a

held spin lock has successfully ac+uired the spin lock. If oth areenaled, spin&spin fires efore spin&ac+uire. arg1 for spin&spin

contains the spin count the numer of iterations that were taken

through the spin loop efore the lock was ac+uired. 'he spin count

has little meaning on its own ut can e used to compare spin times.

spin&

release

8old&e"ent proe that fires immediately after a spin lock is released.

$dapti"e locks are much more common than spin locks. 'he following script

displays totals for oth lock types to pro"ide data to support this

oser"ation.

lockstatadapti"e&ac+uire

0execname KK NdateN0

G

Tlocks:Nadapti"eN< K count(*6H

lockstatspin&ac+uire

0execname KK NdateN0

G

Tlocks:NspinN< K count(*6

H


40/42

%un this script in one window, and a date(1* command in another. 9hen you

terminate the 'race script, you will see output similar to the following

example.

lockstatadapti"e&ac+uireM dtrace &s .0whatlock.d

dtrace script 3.0whatlock.d3 matched > proes

UC

spin =/

adapti"e =B1

$s this output indicates, o"er BB percent of the locks ac+uired in running the

date command are adapti"e locks. It may e surprising that so many locks areac+uired in doing something as simple as a date. 'he large numer of locks is a

natural artifact of the fine&grained locking re+uired of an extremely scalale

system like the Solaris kernel.

7.,.4. +hrea' !oc"s

$ thread lock is a special kind of spin lock that locks a thread for purposes

of changing thread state. 'hread lock hold e"ents are a"ailale as spin lock

hold&e"ent proes (that is, spin&ac+uire and spin&release*, ut contentione"ents ha"e their own proe specific to thread locks. 'he thread lock hold&

e"ent proe is descried in 'ale 17..

%a&le 17.'. %hread oc! Pro&es

#roeame

Cescription

t8%ead

&spin

Contention&e"ent proe that fires after a thread has spun on a

thread lock. Like other contention&e"ent proes, if oth the

contention&e"ent proe and the hold&e"ent proe are enaled,

thread&spin fires efore spin&ac+uire. 2nlike other contention&e"ent

proes, howe"er, t8%ead&spin fires efore the lock is actually

ac+uired. $s a result, multiple t8%ead&spin proe firings may

correspond to a single spin&ac+uire proe firing.


41/42

7.,.5. &ea'ers()riter !oc" Probes

%eaders0writer locks enforce a policy of allowing multiple readers or a single

writerut not othto e in a critical section. 'hese locks are typically used

for structures that are searched more fre+uently than they are modified andfor which there is sustantial time in the critical section. If critical section

times are short, readers0writer locks will implicitly serialize o"er the shared

memory used to implement the lock, gi"ing them no ad"antage o"er adapti"e

locks. See rwlock(B-* for more details on readers0writer locks.

'he proes pertaining to readers0writer locks are in 'ale 17.>. -or each

proe, arg@ contains a pointer to the krwlockFt structure that represents the

adapti"e lock.

%a&le 17.+. #eadersriter oc! Pro&es

#roeame

Cescription

rw&

ac+uire

8old&e"ent proe that fires immediately after a readers0writer lock

is ac+uired. arg1 contains the constant %9F%5$5% if the lock was

ac+uired as a reader, and %9F9%I'5% if the lock was ac+uired asa writer.

rw&lock Contention&e"ent proe that fires after a thread that has locked

on a held readers0writer lock has reawakened and has ac+uired the

lock. arg1 contains the length of time (in nanoseconds* that the

current thread had to sleep to ac+uire the lock. arg= contains the

constant %9F%5$5% if the lock was ac+uired as a reader, and

%9F9%I'5% if the lock was ac+uired as a writer. arg; and arg

contain more information on the reason for locking. arg; is nonzero

if and only if the lock was held as a writer when the current thread

locked. arg contains the readers count when the current thread

locked. If oth the rw&lock and rw&ac+uire proes are enaled,

rw&lock fires efore rw&ac+uire.

rw&

upgrade

8old&e"ent proe that fires after a thread has successfully

upgraded a readers0writer lock from a reader to a writer. 2pgrades

do not ha"e an associated contention e"ent ecause they are onlypossile through a nonlocking interface,


42/42

%a&le 17.+. #eadersriter oc! Pro&es

#roeame

Cescription

rwFtryupgrade('%V2#P%$5.B-*.

rw&

downgrad

e

8old&e"ent proe that fires after a thread had downgraded its

ownership of a readers0writer lock from writer to reader.

owngrades do not ha"e an associated contention e"ent ecause

they always succeed without contention.

rw&

release

8old&e"ent proe that fires immediately after a readers0writer lock

is released. arg1 contains the constant %9F%5$5% if the released

lock was held as a reader, and %9F9%I'5% if the released lock

was held as a writer. ue to upgrades and downgrades, the lock

may not ha"e een released as it was ac+uired.

Date post:	03-Jun-2018
Category:	Documents
Upload:	youngsung-kim
View:	222 times
Download:	0 times

Solaris Internals(Ch17 Locking)

Documents