Date post: | 03-Jun-2018 |
Category: |
Documents |
Upload: | youngsung-kim |
View: | 222 times |
Download: | 0 times |
of 42
8/12/2019 Solaris Internals(Ch17 Locking)
1/42
8/12/2019 Solaris Internals(Ch17 Locking)
2/42
17.2. Parallel Systems Architectures
)ultiprocessor ()#* systems from Sun (S#$%C&processor&ased*, as well as
se"eral x/0x/&ased )# platforms, are implemented as symmetric
multiprocessor (S)#* systems. Symmetric multiprocessor descries a system inwhich a peer&to&peer relationship exists among all the processors (C#2s* on
the system. $ master processor, defined as the only C#2 on the system that
can execute operating system code and field interrupts, does not exist. $ll
processors are e+ual. 'he S)# acronym can also e extended to mean Shared
)emory )ultiprocessor, which defines an architecture in which all the
processors in the system share a uniform "iew of the system3s physical
address space and the operating system3s "irtual address space. 'hat is, all
processors share a single image of the operating system kernel. Sun3s
multiprocessor systems meet the criteria for oth definitions.
$lternati"e )# architectures alter the kernel3s "iew of addressale memory in
different ways. )assi"ely parallel processor ()##* systems are uilt on nodes
that contain a relati"ely small numer of processors, some local memory, and
I04. 5ach node contains its own copy of the operating system6 thus, each node
addresses its own physical and "irtual address space. 'he address space of
one node is not "isile to the other nodes on the system. 'he nodes are
connected y a high&speed, low&latency interconnect, and node&to&node
communication is done through an optimized message passing interface. )##
architectures re+uire a new programming model to achie"e parallelism across
nodes.
'he shared memory model does not work since the system3s total address
space is not "isile across nodes, so memory pages cannot e shared y
threads running on different nodes. 'hus, an $#I that pro"ides an interface
into the message passing path in the kernel must e used y code that needs
to scale across the "arious nodes in the system.
4ther issues arise from the nonuniform nature of the architecture with
respect to I04 processing since the I04 controllers on each node are not easily
made "isile to all the nodes on the system. Some )## platforms attempt to
pro"ide the illusion of a uniform I04 space across all the nodes y using kernel
software, ut the nonuniformity of the access times to nonlocal I04 de"ices
still exists.
2)$ and cc2)$ (nonuniform memory access and cache coherent 2)$*architectures attempt to address the programming model issue inherent in )##
8/12/2019 Solaris Internals(Ch17 Locking)
3/42
systems. -rom a hardware architecture point of "iew, 2)$ systems resemle
)##s small nodes with few processors, a node&to&node interconnect, local
memory, and I04 on each node. ote It is not re+uired that 2)$0cc2)$ or
)## systems implement small nodes (nodes with four or fewer processors*.
)any implementations are uilt that way, ut there is no architecturalrestriction on the node size.
4n 2)$0cc2)$ systems, the operating system software pro"ides a single
system image, where each node has a "iew of the entire system3s memory
address space. In this way, the shared memory model is preser"ed. 8owe"er,
the nonuniform nature of speed of memory access (latency* is a factor in the
performance and potential scalaility of the platform. 9hen a thread executing
on a processor node on a 2)$ or cc2)$ system incurs a page fault
(references an unmapped memory address*, the latency in"ol"ed in resol"ing
the page fault "aries according to whether the physical memory page is on the
same node of the executing thread or on a node somewhere across the
interconnect. 'he latency "ariance can e sustantial. $s the le"el of memory
page sharing increases across threads executing on different nodes, a
potentially higher "olume of page faults needs to e resol"ed from a non local
memory segment. 'his prolem ad"ersely affects performance and scalaility.
'he three different parallel architectures can e summarized as follows
S)#. Symmetric multiprocessor with a shared memory model6 single kernel
image
)##. )essage&ased model6 multiple kernel images
2)$0cc2)$. Shared memory model6 single kernel image
-igure 17.1illustrates the different architectures.
8/12/2019 Solaris Internals(Ch17 Locking)
4/42
Figure 17.1. Parallel Systems Architectures
'he challenge in uilding an operating system that pro"ides scalale
performance when multiple processors are sharing a single image of the kernel
and when e"ery processor can run kernel code, handle interrupts, etc., is to
synchronize access to critical data and state information. Scalale
performance, or scalaility, generally refers to accomplishment of an increasing
amount of work as more hardware resources are added to the system. If
more processors are added to a multiprocessor system, an incremental
increase in work is expected, assuming sufficient resources in other areas of
the system (memory, I04, network*.
'o achie"e scalale performance, the system must e ale to concurrently
support multiple processors executing operating system code. 9hether that
execution is in de"ice dri"ers, interrupt handlers, the threads dispatcher, file
8/12/2019 Solaris Internals(Ch17 Locking)
5/42
system code, "irtual memory code, etc., is, to a degree, load dependent.
Concurrency is key to scalaility.
'he preceding discussion on parallel architectures only scratched the surface
of a "ery complex topic. 5ntire texts discuss parallel architectures exclusi"ely6you should refer to them for additional information. See, for example, :1;
8/12/2019 Solaris Internals(Ch17 Locking)
6/42
eing 13s (lock "alue @x--*. $ lock that is a"ailale (not eing held* is the
same yte with all @3s (lock "alue @x@@*. 'his explanation may seem +uite
rudimentary, ut is crucial to understanding the text that follows.
)ost modern processors shipping today pro"ide some form of yte&le"eltest&and&set instruction that is guaranteed to e atomic in nature. 'he
instruction se+uence is often descried as read&modify&write6 that is, the
referenced memory location (the memory address of the lock* is read,
modified, and written ack in one atomic operation. In %ISC processors (such
as the 2ltraS#$%C '1 processor*, reads are load operations and writes are
store operations. $n atomic operation is re+uired for consistency. $n
instruction that has atomic properties means that no other store operation is
allowed etween the load and store of the executing instruction. )utex and
%9 lock operations must e atomic, such that when the instruction execution
to get the lock is complete, we either ha"e the lock or ha"e the information
we need to determine that the lock is already eing held.
Consider what could happen without an instruction that has atomic properties.
$ thread executing on one processor could issue a load (read* of the lock and
while it is doing a test operation to determine if the lock is held or not,
another thread executing on another processor issues a lock call to get the
same lock at the same time. If the lock is not held, oth threads would assume
the lock is a"ailale and would issue a store to hold the lock. 4"iously, more
than one thread cannot own the same lock at the same time, ut that would
e the result of such a se+uence of e"ents. $tomic instructions pre"ent such
things from happening.
S#$%C processors implement memory access instructions that pro"ide atomic
test&and&set semantics for mutual exclusion primiti"es, as well as instructions
that can force a particular ordering of memory operations (more on the latter
feature in a moment*. 2ltraS#$%C processors (the S#$%C AB instruction set*pro"ide three memory access instructions that guarantee atomic eha"ior
ldstu (load and store unsigned yte*, cas (compare and swap*, and swap
(swap yte locations*. 'hese instructions differ slightly in their eha"ior and
the size of the datum they operate on.
-igure 17.=illustrates the ldstu and cas instructions. 'he swap instruction
(not shown* simply swaps a ;=&it "alue etween a hardware register and a
memory location, similar to what cas does if the compare phase of the
instruction se+uence is e+ual.
8/12/2019 Solaris Internals(Ch17 Locking)
7/42
Figure 17.2. Atomic "nstructions for oc!s on SPA#C Systems
'he implementation of locking code with the assemly language test&and&setstyle of instructions re+uires a suse+uent test instruction on the lock "alue,
which is retrie"ed with either a cas or ldstu instruction.
-or example, the ldstu instruction retrie"es the yte "alue (the lock* from
memory and stores it in the specified hardware register. Locking code must
test the "alue of the register to determine if the lock was held or a"ailale
when the ldstu executed. If the register "alue is all 13s, the lock was held, so
the code must ranch off and deal with that condition. If the register "alue is
all @3s, the lock was not held and the code can progress as eing the currentlock holder. ote that in oth cases, the lock "alue in memory is set to all 13s,
y "irtue of the eha"ior of the ldstu instruction (store @x-- at designated
address*. If the lock was already held, the "alue simply didn3t change. If the
lock was @ (a"ailale*, it will now reflect that the lock is held (all 13s*. 'he
code that releases a lock sets the lock "alue to all @3s, indicating the lock is
no longer eing held.
'he Solaris lock code uses assemly language instructions when the lock code
is entered. 'he asic design is such that the entry point to ac+uire a lock
enters an assemly language routine, which uses either ldstu or cas to gra
8/12/2019 Solaris Internals(Ch17 Locking)
8/42
the lock. 'he assemly code is designed to deal with the simple case, meaning
that the desired lock is a"ailale. If the lock is eing held, a C language code
path is entered to deal with this situation. 9e descrie what happens in detail
in the next few sections that discuss specific lock types.
'he second hardware consideration referred to earlier has to do with the
"isiility of the lock state to the running processors when the lock "alue is
changed. It is critically important on multiprocessor systems that all processors
ha"e a consistent "iew of data in memory, especially in the implementation of
synchronization primiti"es mutex locks and reader0writer (%9* locks. In other
words, if a thread ac+uires a lock, any processor that executes a load
instruction (read* of that memory location must retrie"e the data following the
last store (write* that was issued. 'he most recent state of the lock must e
gloally "isile to all processors on the system.
)odern processors implement hardware uffering to pro"ide optimal
performance. In addition to the hardware caches, processors also use load and
store uffers to hold data eing read from (load* or written to (store*
memory in order to keep the instruction pipeline running and not ha"e the
processor stall waiting for data or a data write&to&memory cycle. 'he data
hierarchy is illustrated in -igure 17.;.
Figure 17.3. Hardware $ata Hierarchy
'he illustration in -igure 17.;does not depict a specific processor6 it is a
generic representation of the "arious le"els of data flow in a typical modern
high&end microprocessor. It shows the flow of data to and from physical
memory from a processor3s main execution units (integer units, floating point
units, etc.*.
8/12/2019 Solaris Internals(Ch17 Locking)
9/42
'he sizes of the load0store uffers "ary across processor implementations,
ut they are typically se"eral words in size. 'he load and store uffers on
each processor are "isile only to the processor they reside on, so a load
issued y a processor that issued the store fetches the data from the store
uffer if it is still there. 8owe"er, it is theoretically possile for otherprocessors that issue a load for that data to read their hardware cache or
main memory efore the store uffer in the store&issuing processor was
flushed. ote that the store uffer we are referring to here is not the same
thing as a le"el 1 or le"el = hardware instruction and data cache. Caches are
eyond the store uffer6 the store uffer is closer to the execution units of
the processor. #hysical memory and hardware caches are kept consistent on
S)# platforms y a hardware us protocol. $lso, many caches are
implemented as write&through caches (as is the case with the le"el 1 cache in
Sun 2ltraS#$%C*, so data written to cache causes memory to e updated.
'he implementation of a store uffer is part of the memory model implemented
y the hardware. 'he memory model defines the constraints that can e
imposed on the order of memory operations (loads and stores* y the system.
)any processors implement a se+uential consistency model, where loads and
stores to memory are executed in the same order in which they were issued
y the processor. 'his model has ad"antages in terms of memory consistency,
ut there are performance trade&offs with such a model ecause thehardware cannot optimize cache and memory operations for speed. 'he S#$%C
architecture specification :7< pro"ides for uilding S#$%C&ased processors
that support multiple memory models, the choice eing left up to the
implementors as to which memory models they wish to support. $ll current
S#$%C processors implement a 'otal Store 4rdering ('S4* model, which
re+uires compliance with the following rules for loads and stores
Loads (reads from memory* are locking and are ordered with respect
to other loads. Stores (writes to memory* are ordered with respect to other stores.
Stores cannot ypass earlier loads.
$tomic load&stores (ldstu and cas instructions* are ordered with
respect to loads.
'he 'S4 model is not +uite as strict as the se+uential consistency model ut
not as relaxed as two additional memory models defined y the S#$%C
architecture. S#$%C&ased processors also support %elaxed )emory 4rder
8/12/2019 Solaris Internals(Ch17 Locking)
10/42
(%)4* and #artial Store 4rder (#S4*, ut these are not currently supported
y the kernel and not implemented y any Sun systems shipping today.
$ final consideration in data "isiility applies also to the memory model and
concerns instruction ordering. 'he execution unit in modern processors canreorder the incoming instruction stream for processing through the execution
units. 'he goals again are performance and creation of a se+uence of
instructions that will keep the processor pipeline full.
'he hardware considerations descried in this section are summarized in 'ale
17.1, along with the solution or implementation detail that applies to the
particular issue.
%a&le 17.1. Hardware Considerations and Solutions for oc!s
onsideration Solution
eed for an atomic test&and&set
instruction for locking primiti"es.
2se of nati"e machine instructions.
ldstu and cas on S#$%C, cmpxchgl
(compare0exchange long* on x/.
ata gloal "isiility issue ecause ofthe use of hardware load and store
uffers and instruction reordering, as
defined y the memory model.
2se of memory arrier instructions.
'he issues of consistent memory "iews in the face of a processor3s load and
store uffers, relaxed memory models, and atomic test&and&set capaility for
locks are addressed at the processor instruction&set le"el. 'he mutex lock
and %9 lock primiti"es implemented in the Solaris kernel use the ldstu and cas
instructions for lock testing and ac+uisition on 2ltraS#$%C&ased systems and
use the cmpxchgl (compare0exchange long* instruction on x/. 'he lock
primiti"e routines are part of the architecture&dependent segment of the
kernel code.
S#$%C processors pro"ide "arious forms of memory arrier (memar*
instructions, which, depending on options that are set in the instruction, impose
specific constraints on the ordering of memory access operations (loads andstores* relati"e to the se+uence with which they were issued. 'o ensure a
consistent memory "iew when a mutex or %9 lock operation has een issued,
8/12/2019 Solaris Internals(Ch17 Locking)
11/42
the Solaris kernel issues the appropriate memar instruction after the lock its
ha"e changed.
$s we mo"e from the strongest consistency model (se+uential consistency* to
the weakest model (%)4*, we can uild a system with potentially etterperformance. 9e can optimize memory operations y playing with the ordering
of memory access instructions that enale designers to minimize access
latency and to maximize interconnect andwidth. 'he trade&off is consistency,
since the more relaxed models pro"ide fewer and fewer constraints on the
system to issue memory access operations in the same order in which the
instruction stream issued them. So, processor architectures pro"ide memory
arrier controls that kernel de"elopers can use to address the consistency
issues as necessary, with some le"el of control on which consistency le"el is
re+uired to meet the system re+uirements. 'he types of memar instructions
a"ailale, the options they support, and how they fit into the different
memory models descried would make for a highly technical and lengthy chapter
on its own. %eaders interested in this topic should read :< and :=7
8/12/2019 Solaris Internals(Ch17 Locking)
12/42
are mounted, files are created and opened, network connections are made,
etc. )any of the locks are emedded in the kernel data structures that
pro"ide the astractions (processes, files* pro"ided y the kernel, and thus
the numer of kernel locks will scale up linearly as resources are created
dynamically.
'his design speaks to one of the core strengths of the Solaris kernel
scalaility and scaling synchronization primiti"es dynamically with the size of
the kernel. ynamic lock creation has se"eral ad"antages o"er static
allocations. -irst, the kernel is not wasting time and space managing a large
pool of unused locks when running on a smaller system, such as a desktop or
workgroup ser"er. 4n a large system, a sufficient numer of locks is a"ailale
to sustain concurrency for scalale performance. It is possile to ha"e literally
thousands of locks in existence on a large, usy system.
7.4. . Synchronization Process
9hen an executing kernel thread attempts to ac+uire a lock, it will encounter
one of two possile lock states free (a"ailale* or not free (owned, held*. $
re+uesting thread gets ownership of an a"ailale lock when the lock&specific
get lock function is in"oked. If the lock is not a"ailale, the thread most likely
needs to lock and wait for it to come a"ailale, although, as we will see
shortly, the code does not always lock (sleep*, waiting for a lock. -or those
situations in which a thread will sleep while waiting for a lock, the kernel
implements a sleep +ueue facility, known as turnstiles, for managing threads
locking on locks.
9hen a kernel thread has completed the operation on the shared data
protected y the lock, it must release the lock. 9hen a thread releases a
lock, the code must deal with one of two possile conditions threads are
waiting for the lock (such threads are termed waiters*, or there are nowaiters. 9ith no waiters, the lock can simply e released. 9ith waiters, the
code has se"eral options. It can release the lock and wake up the locking
threads. In that case, the first thread to execute ac+uires the lock.
$lternati"ely, the code could select a thread from the turnstile (sleep +ueue*,
ased on priority or sleep time, and wake up only that thread. -inally, the
code could select which thread should get the lock next, and the lock owner
could hand the lock off to the selected thread. $s we will see in the following
sections, no one solution is suitale for all situations, and the Solaris kernel
uses all three methods, depending on the lock type. -igure 17.pro"ides theig picture.
8/12/2019 Solaris Internals(Ch17 Locking)
13/42
Figure 17.'. Solaris oc!s %he *ig Picture
-igure 17.pro"ides a generic representation of the execution flow. Later we
will see the results of a considerale amount of engineering effort that has
gone into the lock code impro"ed efficiency and speed with short code paths,
optimizations for the hot path (fre+uently hit code path* with well&tuned
assemly code, and the est algorithms for lock release as determined y
extensi"e analysis.
7.4.2. Synchronization Object Operations Vector
5ach of the synchronization o!ects discussed in this section mutex locks,
reader0writer locks, and semaphores defines an operations "ector that is
linked to kernel threads that are locking on the o!ect. Specifically, the
o!ect3s operations "ector is a data structure that exports a suset of
o!ect functions re+uired for kthreads sleeping on the lock. 'he generic
structure is defined as follows
0E E 'he following data structure is used to map
E synchronization o!ect type numers to the
E synchronization o!ect3s sleep +ueue numer
E or the synch. o!ect3s owner function.
E0
typedef struct Fso!Fops G
int so!Ftype6
kthreadFt E(Eso!Fowner*(*6
"oid (Eso!Funsleep*(kthreadFt E*6 "oid (Eso!FchangeFpri*(kthreadFt E, priFt, priFt E*6
8/12/2019 Solaris Internals(Ch17 Locking)
14/42
H so!FopsFt6
See
sys0so!ect.h
'he structure shown ao"e pro"ides for the o!ect type declaration. -or each
synchronization o!ect type, a type&specific structure is defined
mutexFso!Fops for mutex locks, rwFso!Fops for reader0writer locks, and
semaFso!Fops for semaphores.
'he structure also pro"ides three functions that may e called on ehalf of a
kthread sleeping on a synchronization o!ect
$n owner function, which returns the I of the kernel thread that owns
the o!ect.
$n unsleep function, which transitions a kernel thread from a sleep state.
$ changeFpri function, which changes the priority of a kernel thread,
used for priority inheritance. (See Section 17.7.*
9e will see how references to the lock3s operations structure are implemented
as we mo"e through specifics on lock implementations in the following sections.
It is useful to note at this point that our examination of Solaris kernel locks
offers a good example of some of the design trade&offs in"ol"ed in kernel
software engineering. uilding the "arious software components that make up
the Solaris kernel is a series of design decisions, when performance needs are
measured against complexity. In areas of the kernel where optimal performance
is a top priority, simplicity might e sacrificed in fa"or of performance. 'he
locking facilities in the Solaris kernel are an area where such trade&offs are
made much of the lock code is written in assemly language, for speed, rather
than in the C language6 the latter is easier to code with and maintain ut is
potentially slower. In some cases, when the code path is not performance
critical, a simpler design will e fa"ored o"er cryptic assemly code or
complexity in the algorithms. 'he eha"ior of a particular design is examined
through exhausti"e testing, to ensure that the est possile design decisions
were made.
8/12/2019 Solaris Internals(Ch17 Locking)
15/42
17.+. ,ute- oc!s
)utual exclusion, or mutex locks, are the most common type of synchronization
primiti"e used in the kernel. )utex locks serialize access to critical data, when
a kernel thread must ac+uire the mutex specific to the data region eingprotected efore it can read or write the data. 'he thread is the lock owner
while it is holding the lock, and the thread must release the lock when it has
finished working in the protected region so other threads can ac+uire the lock
for access to the protected data.
7.5. . Overview
If a thread attempts to ac+uire a mutex lock that is eing held, it can
asically do one of two things it can spin or it can lock. Spinning means thethread enters a tight loop, attempting to ac+uire the lock in each pass through
the loop. 'he term spin lock is often used to descrie this type of mutex.
locking means the thread is placed on a sleep +ueue while the lock is eing
held and the kernel sends a wakeup to the thread when the lock is released.
'here are pros and cons to oth approaches.
'he spin approach has the enefit of not incurring the o"erhead of context
switching, re+uired when a thread is put to sleep and also has the ad"antage
of a relati"ely fast ac+uisition when the lock is released, since there is nocontext&switch operation. It has the downside of consuming C#2 cycles while
the thread is in the spin loop the C#2 is executing a kernel thread (the thread
in the spin loop* ut not really doing any useful work.
'he locking approach has the ad"antage of freeing the processor to execute
other threads while the lock is eing held6 it has the disad"antage of re+uiring
context switching to get the waiting thread off the processor and a new
runnale thread onto the processor. 'here3s also a little more lock ac+uisition
latency, since a wakeup and context switch are re+uired efore the locking
thread can ecome the owner of the lock it was waiting for.
In addition to the issue of what to do if a re+uested lock is eing held, the
+uestion of lock granularity needs to e resol"ed. Let3s take a simple example.
'he kernel maintains a process tale, which is a linked list of process
structures, one for each of the processes running on the system. $ simple
tale&le"el mutex could e implemented, such that if a thread needs to
manipulate a process structure, it must first ac+uire the process tale mutex.'his le"el of locking is "ery coarse. It has the ad"antages of simplicity and
minimal lock o"erhead. It has the o"ious disad"antage of potentially poor
8/12/2019 Solaris Internals(Ch17 Locking)
16/42
scalaility, since only one thread at a time can manipulate o!ects on the
process tale. Such a lock is likely to ha"e a great deal of contention (ecome
a hot lock*.
'he alternati"e is to implement a finer le"el of granularity a lock&per&process tale entry "ersus one tale&le"el lock. 9ith a lock on each process
tale entry, multiple threads can e manipulating different process structures
at the same time, pro"iding concurrency. 'he disad"antages are that such an
implementation is more complex, increases the chances of deadlock situations,
and necessitates more o"erhead ecause there are more locks to manage.
In general, the Solaris kernel implements relati"ely fine&grained locking
whene"er possile, largely due to the dynamic nature of scaling locks with
kernel structures as needed.
'he kernel implements two types of mutex locks spin locks and adapti"e locks.
Spin locks, as we discussed, spin in a tight loop if a desired lock is eing held
when a thread attempts to ac+uire the lock. $dapti"e locks are the most
common type of lock used and are designed to dynamically either spin or lock
when a lock is eing held, depending on the state of the holder. 9e already
discussed the trade&offs of spinning "ersus locking. Implementing a locking
scheme that only does one or the other can se"erely impact scalaility and
performance. It is much etter to use an adapti"e locking scheme, which is
precisely what we do.
'he mechanics of adapti"e locks are straightforward. 9hen a thread attempts
to ac+uire a lock and the lock is eing held, the kernel examines the state of
the thread that is holding the lock. If the lock holder (owner* is running on a
processor, the thread attempting to get the lock will spin. If the thread holding
the lock is not running, the thread attempting to get the lock will lock. 'his
policy works +uite well ecause the code is such that mutex hold times are"ery short (y design, the goal is to minimize the amount of code to e
executed while a lock is held*. So, if a thread is holding a lock and running, the
lock will likely e released "ery soon, proaly in less time than it takes to
context&switch off and on again, so it3s worth spinning.
4n the other hand, if a lock holder is not running, then we know that minimally
one context switch is in"ol"ed efore the holder will release the lock (getting
the holder ack on a processor to run*, and it makes sense to simply lock and
free up the processor to do something else. 'he kernel will place the lockingthread on a turnstile (sleep +ueue* designed specifically for synchronization
8/12/2019 Solaris Internals(Ch17 Locking)
17/42
primiti"es and will wake the thread when the lock is released y the holder.
(See Section 17.7.*
'he other distinction etween adapti"e locks and spin locks has to do with
interrupts, the dispatcher, and context switching. 'he kernel dispatcher is thecode that selects threads for scheduling and does context switches. It runs at
an ele"ated #riority Interrupt Le"el (#IL* to lock interrupts (the dispatcher
runs at priority le"el 11 on S#$%C systems*. 8igh&le"el interrupts (interrupt
le"els 111> on S#$%C systems* can interrupt the dispatcher. 8igh&le"el
interrupt handlers are not allowed to do anything that could re+uire a context
switch or to enter the dispatcher (we discuss this further in Section ;.*.
$dapti"e locks can lock, and locking means context switching, so only spin
locks can e used in high&le"el interrupt handlers. $lso, spin locks can raise
the interrupt le"el of the processor when the lock is ac+uired.
struct kernelFdata G
kmutexFt klock6
char EforwFptr6
char EackFptr6
uint/Ft data16
uint/Ft data=6
H kdata6
"oid function(*
.
mutexFinit(Jkdata.klock*6
.
mutexFenter(Jkdata.klock*6
klock.data1 K 16
mutexFexit(Jkdata.klock*6
'he preceding lock of pseudo&code illustrates the general mechanics of
mutex locks. $ lock is declared in the code6 in this case, it is emedded in the
data structure that it is designed to protect. 4nce declared, the lock is
initialized with the kernel mutexFinit(* function. $ny suse+uent reference to
the kdata structure re+uires that the klock mutex e ac+uired with
mutexFenter(*. 4nce the work is done, the lock is released with mutexFexit(*.
'he lock type, spin or adapti"e, is determined in the mutexFinit(* code y the
kernel. $ssuming an adapti"e mutex in this example, any kernel threads that
make a mutexFenter(* call on klock will either lock or spin, depending on the
state of the kernel thread that owns klock when the mutexFenter(* is called.
8/12/2019 Solaris Internals(Ch17 Locking)
18/42
7.5.2. Solaris Mute !oc" #$ple$entation
'he kernel defines different data structures for the two types of mutex
locks, adapti"e and spin, as shown elow.
0E
E #ulic interface to mutual exclusion locks. See mutex(B-* for details.
E
E 'he asic mutex type is )2'5F$$#'IA5, which is expected to e used
E in almost all of the kernel. )2'5FS#I pro"ides interrupt locking
E and must e used in interrupt handlers ao"e L4CDFL5A5L. 'he ilock
E cookie argument to mutexFinit(* encodes the interrupt le"el to lock.
E 'he ilock cookie must e 2LL for adapti"e locks.
EE )2'5F5-$2L' is the type usually specified (except in dri"ers* to
E mutexFinit(*. It is identical to )2'5F$$#'IA5.
E
E )2'5F%IA5% is always used y dri"ers. mutexFinit(* con"erts this to
E either )2'5F$$#'IA5 or )2'5FS#I depending on the ilock cookie.
E
E )utex statistics can e gathered on the fly, without reooting or
E recompiling the kernel, "ia the lockstat dri"er (lockstat(7**.
E0
typedef enum G
)2'5F$$#'IA5 K @, 0E spin if owner is running, otherwise
lock E0
)2'5FS#I K 1, 0E lock interrupts and spin E0
)2'5F%IA5% K , 0E dri"er (I* mutex E0
)2'5F5-$2L' K / 0E kernel default mutex E0
H kmutexFtypeFt6
typedef struct mutex GMifdef FL#/
"oid EFopa+ue:1
8/12/2019 Solaris Internals(Ch17 Locking)
19/42
'he 1=&it mutex o!ect is used for each type of lock, as shown in -igure
17.>.
Figure 17.+. Solaris 1 Ada/ti0e and S/in ,ute-
In -igure 17.>, the mFowner field in the adapti"e lock, which holds the address
of the kernel thread that owns the lock (the kthread pointer*, plays a doule
role, in that it also ser"es as the actual lock6 successful lock ac+uisition for athread means it has its kthread pointer set in the mFowner field of the target
lock. If threads attempt to get the lock while it is held (waiters*, the low&
order it (it @* of mFowner is set to reflect that case. ecause kthread
pointers "alues are always word aligned, they do not re+uire it @, allowing
this work.
0E
E mutexFenter(* assumes that the mutex is adapti"e and tries to gra the
E lock y doing a atomic compare and exchange on the first word of the
mutex.
E If the compare and exchange fails, it means that either (1* the lock is a
E spin lock, or (=* the lock is adapti"e ut already held.
E mutexF"ectorFenter(* distinguishes these cases y looking at the mutex
E type, which is encoded in the low&order its of the owner field.
E0
typedef union mutexFimpl G
0E E $dapti"e mutex.
E0
struct adapti"eFmutex G
uintptrFt FmFowner6 0E @&;0@&7 owner and waiters it
E0
Mifndef FL#/
uintptrFt FmFfiller6 0E &7 unused E0
Mendif
H mFadapti"e6
0E
8/12/2019 Solaris Internals(Ch17 Locking)
20/42
E Spin )utex.
E0
struct spinFmutex G
lockFt mFdummylock6 0E @ dummy lock (always set* E0
lockFt mFspinlock6 0E 1 real lock E0 ushortFt mFfiller6 0E =&; unused E0
ushortFt mFoldspl6 0E &> old pil "alue E0
ushortFt mFminspl6 0E /&7 min pil "al if lock held E0
H mFspin6
H mutexFimplFt6
See
sys0mutexFimpl.h
'he spin mutex, as we pointed out earlier, is used at high interrupt le"els,
where context switching is not allowed. Spin locks lock interrupts while in the
spin loop, so the kernel needs to maintain the priority le"el the processor was
running at efore entering the spin loop, which raises the processor3s priority
le"el. (5le"ating the priority le"el is how interrupts are locked.* 'he mFminspl
field stores the priority le"el of the interrupt handler when the lock is
initialized, and mFoldspl is set to the priority le"el the processor was runningat when the lock code is called. 'he mFspinlock fields are the actual mutex
lock its.
5ach kernel module and susystem implementing one or more mutex locks calls
into a common set of mutex functions. $ll locks must first e initialized y the
mutexFinit(* function, wherey the lock type is determined on the asis of an
argument passed in the mutexFinit(* call. 'he most common type passed into
mutexFinit(* is )2'5F5-$2L', which results in the init code determining
what type of lock, adapti"e or spin, should e used. It is possile for a callerof mutexFinit(* to e specific aout a lock type (for example, )2'5FS#I*.
If the init code is called from a de"ice dri"er or any kernel module that
registers and generates interrupts, then an interrupt lock cookie is added to
the argument list. $n interrupt lock cookie is an astraction used y de"ice
dri"ers when they set their interrupt "ector and parameters. 'he mutexFinit(*
code checks the argument list for an interrupt lock cookie. If mutexFinit(* is
eing called from a de"ice dri"er to initialize a mutex to e used in a high&
le"el interrupt handler, the lock type is set to spin. 4therwise, an adapti"e
lock is initialized. 'he test is the interrupt le"el in the passed interrupt lock6
8/12/2019 Solaris Internals(Ch17 Locking)
21/42
le"els ao"e L4CDFL5A5L (1@ on S#$%C systems* are considered high&le"el
interrupts and thus re+uire spin locks. 'he init code clears most of the fields
in the mutex lock structure as appropriate for the lock type. 'he
mFdummylock field in spin locks is set to all 13s (@x--*. 9e3ll see why in a
minute.
'he primary mutex functions called, aside from mutexFinit(* (which is only
called once for each lock at initialization time*, are mutexFenter(* to get a
lock and mutexFexit(* to release it. mutexFenter(* assumes an a"ailale,
adapti"e lock. If the lock is held or is a spin lock, mutexF"ectorFenter(* is
entered to reconcile what should happen. 'his is a performance optimization.
mutexFenter(* is implemented in assemly code, and ecause the entry point is
designed for the simple case (adapti"e lock, not held*, the amount of code
that gets executed to ac+uire a lock when those conditions are true is minimal.
$lso, there are significantly more adapti"e mutex locks than spin locks in the
kernel, making the +uick test case effecti"e most of the time. 'he test for a
lock held or spin lock is "ery fast. 8ere is where the mFdummylock field comes
into play mutexFenter(* executes a compare&and&swap instruction on the
first yte of the mutex, testing for a zero "alue. 4n a spin lock, the
mFdummylock field is tested ecause of its positioning in the data structure
and the endianness of S#$%C processors. Since mFdummylock is always set (it
is set to all 13s in mutexFinit(**, the test will fail for spin locks. 'he test willalso fail for a held adapti"e lock since such a lock will ha"e a nonzero "alue in
the yte field eing tested. 'hat is, the mFowner field will ha"e a kthread
pointer "alue for a held, adapti"e lock.
If the lock is an adapti"e mutex and is not eing held, the caller of
mutexFenter(* gets ownership of the lock. If the two conditions are not true,
that is, either the lock is held or the lock is a spin lock, the code enters the
mutexF"ectorFenter(* function to sort things out. 'he mutexF"ectorFenter(*
code first tests the lock type. -or spin locks, the mFoldspl field is set, asedon the current #riority Interrupt Le"el (#IL* of the processor, and the lock is
tested. If it3s not eing held, the lock is set (mFspinlock* and the code returns
to the caller. $ held lock forces the caller into a spin loop, where a loop
counter is incremented (for statistical purposes6 the lockstat(1)* data*, and
the code checks whether the lock is still held in each pass through the loop.
4nce the lock is released, the code reaks out of the loop, gras the lock,
and returns to the caller.
$dapti"e locks re+uire a little more work. 9hen the code enters the adapti"e
code path (in mutexF"ectorFenter(**, it increments the
8/12/2019 Solaris Internals(Ch17 Locking)
22/42
cpuFsysinfo.mutexFadenters (adapti"e lock enters* field, as is reflected in the
smtx column in mpstat(1)*. mutexF"ectorFenter(* then tests again to
determine if the lock is owned (held*, since the lock may ha"e een released in
the time inter"al etween the call to mutexFenter(* and the current point in
the mutexF"ectorFenter(* code. If the adapti"e lock is not eing held,mutexF"ectorFenter(* attempts to ac+uire the lock. If successful, the code
returns.
If the lock is held, mutexF"ectorFenter(* determines whether or not the lock
owner is running y looping through the C#2 structures and testing the lock
mFowner against the cpuFthread field of the C#2 structure. (cpuFthread
contains the kernel thread address of the thread currently executing on the
C#2.* $ match indicates the holder is running, which means the adapti"e lock
will spin. o match means the owner is not running, in which case the caller
must lock. In the locking case, the kernel turnstile code is entered to locate
or ac+uire a turnstile, in preparation for placement of the kernel thread on a
sleep +ueue associated with the turnstile.
'he turnstile placement happens in two phases. $fter mutexF"ectorFenter(*
determines that the lock holder is not running, it makes a turnstile call to look
up the turnstile, sets the waiters it in the lock, and retests to see if the
owner is running. If yes, the code releases the turnstile and enters the
adapti"e lock spin loop, which attempts to ac+uire the lock. 4therwise, the
code places the kernel thread on a turnstile (sleep +ueue* and changes the
thread3s state to sleep. 'hat effecti"ely concludes the se+uence of e"ents in
mutexF"ectorFenter(*.
ropping out of mutexF"ectorFenter(*, either the caller ended up with the
lock it was attempting to ac+uire or the calling thread is on a turnstile sleep
+ueue associated with the lock. In either case, the lockstat(1)* data is
updated, reflecting the lock type, spin time, or sleep time as the last it ofwork done in mutexF"ectorFenter(*.
lockstat(1)* is a kernel lock statistics command that was introduced in Solaris
=./. It pro"ides detailed information on kernel mutex and reader0writer locks.
'he algorithm descried in the pre"ious paragraphs is summarized in
pseudocode elow.
mutexF"ectorFenter(* if (lock is a spin lock*
lockFsetFspl(* 0E enter spin&lock specific code path E0
8/12/2019 Solaris Internals(Ch17 Locking)
23/42
increment cpuFsysinfo.ademters.
spinFloop
if (lock is not owned*
mutexFtrylock(* 0E try to ac+uire the lock E0
if (lock ac+uired* goto ottom
else
continue 0E lock eing held E0
if (lock owner is running on a processor*
goto spinFloop
else
lookup turnstile for the lock
set waiters it
if (lock owner is running on a processor*
drop turnstile
goto spinFloop
else
lock 0E the sleep +ueue associated with the turnstile
E0
ottom
update lockstat statistics
9hen a thread has finished working in a lock&protected data area, it calls the
mutexFexit(* code to release the lock. 'he entry point is implemented in
assemly language and handles the simple case of freeing an adapti"e lock with
no waiters. 9ith no threads waiting for the lock, it3s a simple matter of
clearing the lock fields (mFowner* and returning. 'he C language function
mutexF"ectorFexit(* is entered from mutexFexit(* for anything ut the simple
case.
In the case of a spin lock, the lock field is cleared and the processor is
returned to the #IL le"el it was running at efore entering the lock code. -or
adapti"e locks, a waiter must e selected from the turnstile (if there is more
than one waiter*, ha"e its state changed from sleeping to runnale, and e
placed on a dispatch +ueue so it can execute and get the lock. If the thread
releasing the lock was the eneficiary of priority inheritance, meaning that it
had its priority impro"ed when a calling thread with a etter priority was not
ale to get the lock, then the thread releasing the lock will ha"e its priority
8/12/2019 Solaris Internals(Ch17 Locking)
24/42
reset to what it was efore the inheritance. #riority inheritance is discussed
in Section 17.7.
9hen an adapti"e lock is released, the code clears the waiters it in mFowner
and calls the turnstile function to wake up all the waiters. %eaders familiarwith sleep0wakeup mechanisms of operating systems ha"e likely heard of a
particular eha"ior known as the Nthundering herd prolem,N a situation in
which many threads that ha"e een locking for the same resource are all
woken up at the same time and make a mad dash for the resource (a mutex in
this case*like a herd of large, four&legged easts running toward the same
o!ect. System eha"ior tends to go from a relati"ely small run +ueue to a
large run +ueue (all the threads ha"e een woken up and made runnale* and
high C#2 utilization until a thread gets the resource, at which point a unch of
threads are sleeping again, the run +ueue normalizes, and C#2 utilization
flattens out. 'his is a generic eha"ior that can occur on any operating
system.
'he wakeup mechanism used when mutexF"ectorFexit(* is called may seem like
an open in"itation to thundering herds, ut in practice it turns out not to e a
prolem. 'he main reason is that the locking case for threads waiting for a
mutex is rare6 most of the time the threads will spin. If a locking situation
does arise, it typically does not reach a point where "ery many threads are
locked on the mutexone of the characteristics of the thundering herd prolem
is resource contention resulting in a lot of sleeping threads. 'he kernel code
segments that implement mutex locks are, y design, short and fast, so locks
are not held for long. Code that re+uires longer lock&hold times uses a
reader0writer write lock, which pro"ides mutual exclusion semantics with a
selecti"e wakeup algorithm. 'here are, of course, other reasons for choosing
reader0writer locks o"er mutex locks, the most o"ious eing to allow multiple
readers to see the protected data.
17.. #eaderriter oc!s
%eader0writer (%9* locks pro"ide mutual exclusion semantics on write locks.
4nly one thread at a time is allowed to own the write lock, ut there is
concurrent access for readers. 'hese locks are designed for scenarios in
which it is acceptale to ha"e multiple threads reading the data at the same
time, ut only one writer. 9hile a writer is holding the lock, no readers are
allowed. $lso, ecause of the wakeup mechanism, a writer lock is a etter
solution for kernel code segments that re+uire relati"ely long hold times, as wewill see shortly.
8/12/2019 Solaris Internals(Ch17 Locking)
25/42
'he asic mechanics of %9 locks are similar to mutexes, in that %9 locks
ha"e an initialization function (rwFinit(**, an entry function to ac+uire the lock
(rwFenter(**, and an exit function to release the lock (rwFexit(**. 'he entry
and exit points are optimized in assemly code to deal with the simple cases,
and they call into C language functions if anything eyond the simplest casemust e dealt with. $s with mutex locks, the simple case is that the re+uested
lock is a"ailale on an entry (ac+uire* call and no threads are waiting for the
lock on the exit (release* call.
7.%. . Solaris &ea'er()riter !oc"s
%eader0writer locks are implemented as a single&word data structure in the
kernel, either ;= its or / its wide, depending on the data model of the
running kernel, as depicted in -igure 17./.
Figure 17.. #eaderriter oc!
typedef struct rwlockFimpl G
uintptrFt rwFwwwh6 0E waiters, write wanted, hold count
E0
H rwlockFimplFt6
Mendif 0E F$S) E0
Mdefine %9F8$SF9$I'5%S 1
Mdefine %9F9%I'5F9$'5 =Mdefine %9F9%I'5FL4CD5
Mdefine %9F%5$FL4CD
Mdefine %9F9%I'5FL4CD(thread* ((uintptrFt*(thread* O
%9F9%I'5FL4CD5*
Mdefine %9F84LFC42' (&%9F%5$FL4CD*
Mdefine %9F84LFC42'FS8I-' ; 0E
log=(%9F%5$FL4CD* E0
Mdefine %9F%5$FC42' %9F84LFC42'
Mdefine %9F495% %9F84LFC42'
Mdefine %9FL4CD5 %9F84LFC42'
8/12/2019 Solaris Internals(Ch17 Locking)
26/42
Mdefine %9F9%I'5FCL$I)5 (%9F9%I'5FL4CD5 O
%9F9%I'5F9$'5*
Mdefine %9F42L5FL4CD (%9F9%I'5FL4CD(@* O
%9F%5$FL4CD*
Seesys0rwlock.h
'here are two states for the reader writer lock, depending on whether the
lock is held y a writer, as indicated y it =, wrlock. it =, wrlock, is the
actual write lock, and it determines the meaning of the high&order its. If the
write lock is held (it = set*, then the upper its contain a pointer to the
kernel thread holding the write lock. If it = is clear, then the upper its
contain a count of the numer of threads holding the lock as a read lock.
'he Solaris 1@ %9 lock defines it @, the wait it, set to signify that threads
are waiting for the lock. 'he wrwant it (write wanted, it 1* indicates that
at least one thread is waiting for a write lock. 'he simple cases for lock
ac+uisition through rwFenter(* are the circumstances listed elow
'he write lock is wanted and is a"ailale.
'he read lock is wanted, the write lock is not held, and no threads are
waiting for the write lock (wrwant is clear*.
'he ac+uisition of the write lock results in it = getting set and the kernel
thread pointer getting loaded in the upper its. -or a reader, the hold count
(upper its* is incremented. Conditions where the write lock is eing held,
causing a lock re+uest to fail, or where a thread is waiting for a write lock,
causing a read lock re+uest to fail, result in a call to the rwFenterFsleep(*
function.
Important to note is that the rwFenter(* code sets a flag in the kernel thread
used y the dispatcher code when estalishing a kernel thread3s priority
efore preemption or changing state to sleep. 9e co"er this in more detail in
the paragraph eginning NIt is in the dispatcher +ueue insertion codeN on #age
=/=. riefly, the kernel thread structure contains a tFkpriFre+ (kernel priority
re+uest* field that is checked in the dispatcher code when a thread is aout
to e preempted (forced off the processor on which it is executing ecause a
higher&priority thread ecomes runnale* or when the thread is aout to ha"e
its state changed to sleep. If the tFkpriFre+ flag is set, the dispatcherassigns a kernel priority to the thread, such that when the thread resumes
8/12/2019 Solaris Internals(Ch17 Locking)
27/42
execution, it will run efore threads in scheduling classes of lower priority
(timeshare and interacti"e class threads*. )ore succinctly, the priority of a
thread holding a write lock is set to a etter priority to minimize the hold time
of the lock.
Petting ack to the rwFenter(* flow If the code falls through the simple
case, we need to set up the kernel thread re+uesting the %9 lock to lock.
1. rwFenterFsleep(* estalishes whether the calling thread is re+uesting a
read or write lock and does another test to see if the lock is a"ailale.
If it is, the caller gets the lock, the lockstat(1)* statistics are updated,
and the code returns. If the lock is not a"ailale, then the turnstile code
is called to look up a turnstile in preparation for putting the calling
thread to sleep.=. 9ith a turnstile now a"ailale, another test is made on the lock
a"ailaility. (4n today3s fast processors, and especially multiprocessor
systems, it3s +uite possile that the thread holding the lock finished what
it was doing and the lock ecame a"ailale.* $ssuming the lock is still
held, the thread is set to a sleep state and placed on a turnstile.
*. 'he %9 lock structure will ha"e the wait it set for a reader waiting
(forced to lock ecause a writer has the lock* or the wrwant it set
to signify that a thread wanting the write lock is locking.
4. 'he cpuFsysinfo structure for the processor maintains two counters for
failures to get a read lock or write lock on the first pass rwFrdfails
and rwFwrfails. 'he appropriate counter is incremented !ust prior to the
turn&stile call6 this action places the thread on a turnstile sleep +ueue.
'he mpstat(1)* command sums the counters and displays the fails&per&
second in the srw column of its output.
'he ac+uisition of a %9 lock and suse+uent eha"ior if the lock is held are
straightforward and similar in many ways to what happens in the mutex case.'hings get interesting when a thread calls rwFexit(* to release a lock it is
hold&ingthere are se"eral potential solutions to the prolem of determining
which thread gets the lock next. $ wakeup is issued on all threads that are
sleeping, waiting for the mutex, and we know from empirical data that this
solution works well for reasons pre"iously discussed. 9ith %9 locks, we3re
dealing with potentially longer hold times, which could result in more sleepers, a
desire to gi"e writers priority o"er readers (it3s typically est to not ha"e a
reader read data that3s aout to e changed y a pending writer*, and the
potential for the priority in"ersion prolem descried in Section 17.7.
8/12/2019 Solaris Internals(Ch17 Locking)
28/42
-or rwFexit(*, which is called y the lock holder when it is ready to release
the lock, the simple case is that there are no waiters. In this case, the wrlock
it is cleared if the holder was a writer, or the hold count field is
decremented to reflect one less reader. 'he more complex case of the system
ha"ing waiters when the lock is released is dealt with in the following manner
1. 'he kernel does a direct transfer of ownership of the lock to one or
more of the threads waiting for the lock when the lock is released,
either to the next writer or to a group of readers if more than one
reader is locking and no writers are locking.
'his situation is "ery different from the case of the mutex
implementation, for which the wakeup is issued and a thread must otain
lock ownership in the usual fashion. 8ere, a thread or threads wake upowning the lock they were locking on.
'he algorithm used to figure out who gets the lock next addresses
se"eral re+uirements that pro"ide for generally alanced system
performance. 'he kernel needs to minimize the possiility of star"ation (a
thread ne"er getting the resource it needs to continue executing* while
allowing writers to take precedence whene"er possile.
2. rwFexitFwakeup(* retests for the simple case and drops the lock ifthere are no waiters (clear wrlock or decrement the hold count*.
;. 9hen waiters are present, the code gras the turnstile (sleep +ueue*
associated with the lock and sa"es the pointer to the kernel thread of
the next write waiter that was on the turnstile3s sleep +ueue (if one
exists*.
'he turnstile sleep +ueues are organized as a -I-4 (first in, first out*
+ueue, so the +ueue management (turnstile code* makes sure that the
thread that was waiting the longest (the first in* is the thread that is
selected as the next writer (first out*. 'hus, part of the fairness policy
we want to enforce is co"ered.
'he remaining its of the algorithm go as follows
. If a writer is releasing the write lock and there are waiting readers and
writers, readers of the same or higher priority than the highest&priority
locked writer are granted the read lock.
5. 'he readers are handed ownership, and then woken up y the
turnstileFwakeup(* kernel function,
8/12/2019 Solaris Internals(Ch17 Locking)
29/42
'hese readers also inherit the priority of the writer that released the
lock if the reader thread is of a lower priority (inheritance is done on a
per&reader thread asis when more than one thread is eing woken up*.
Lock ownership handoff is a relati"ely simple operation. -or read locks,
there is no notion of a lock owner, so it3s a matter of setting the holdcount in the lock to reflect the numer of readers coming off the
turnstile, then issuing the wakeup of each reader.
/. $n exiting reader always grants the lock to a waiting writer, e"en if
there are higher&priority readers locked.
7. It is possile for a reader freeing the lock to ha"e waiting readers,
although it may not e intuiti"e, gi"en the multiple reader design of the
lock. If a reader is holding the lock and a writer comes along, the
wrwant it is set to signify that a writer is waiting for the lock. 9ith
wrwant set, suse+uent readers cannot get the lockwe want the holding
readers to finish so the writer can get the lock. 'herefore, it is
possile for a reader to execute rwFexitFwakeup(* with waiting writers
and readers.
'he Nlet3s fa"or writers ut e fair to readersN policy descried ao"e was
first implemented in Solaris =./.
17.7. %urnstiles and Priority "nheritance
$ turnstile is a data astraction that encapsulates sleep +ueues and priority
inheritance information associated with mutex locks and reader0writer locks.
'he mutex and %9 lock code use a turnstile when a kernel thread needs to
lock on a re+uested lock. 'he sleep +ueues implemented for other resource
waits do not pro"ide an elegant method of dealing with the priority in"ersion
prolem through priority inheritance. 'urnstiles were created to address that
prolem.
#riority in"ersion descries a scenario in which a higher&priority thread is
unale to run ecause a lower&priority thread is holding a resource it needs,
such as a lock. 'he Solaris kernel addresses the priority in"ersion prolem in
its turn&stile implementation, pro"iding a priority inheritance mechanism, where
the higher&priority thread can will its priority to the lower&priority thread
holding the resource it re+uires. 'he eneficiary of the inheritance, the thread
holding the resource, will now ha"e a higher scheduling priority and thus get
scheduled to run sooner so it can finish its work and release the resource, atwhich point the original priority is returned to the thread.
8/12/2019 Solaris Internals(Ch17 Locking)
30/42
In this section, we assume you ha"e some le"el of knowledge of kernel thread
priorities, which are co"ered in Section ;.7. ecause turnstiles and priority
inheritance are an integral part of the implementation of mutex and %9 locks,
we thought it est to discuss them here rather than later. -or this discussion,
it is important to e aware of these points
'he Solaris kernel assigns a gloal priority to kernel threads, ased on
the scheduling class they elong to.
Dernel threads in the timeshare and interacti"e scheduling classes will
ha"e their priorities ad!usted o"er time, ased on three things the
amount of time the threads spend running on a processor, sleep time
(locking*, and the case when they are preempted. 'hreads in the real&
time class are fixed priority6 the priorities are ne"er changed regardless
of runtime or sleep time unless explicitly changed through programming
interfaces or commands.
'he Solaris kernel implements sleep +ueues for the placement of kernel threads
locking on (waiting for* a resource or e"ent. -or most resource waits, such
as those for a disk or network I04, sleep +ueues, in con!unction with condition
"ariales, manage the systemwide +ueue of sleeping threads. 'hese sleep
+ueues are co"ered in Section ;.1@. 'his set of sleep +ueues is separate and
distinct from turn&stile sleep +ueues.
7.7. . +urnstiles #$ple$entation
-igure 17.7illustrates the Solaris 1@ turnstiles. 'urnstiles are maintained in a
systemwide hash tale, turnstileFtale:
8/12/2019 Solaris Internals(Ch17 Locking)
31/42
Figure 17.7. %urnstiles
typedef struct turnstileFchain G
turnstileFt EtcFfirst6 0E first turnstile on hash chain E0
dispFlockFt tcFlock6 0E lock for this hash chain E0
H turnstileFchainFt6
turnstileFchainFt turnstileFtale:= E '2%S'IL5F8$S8FSIQ5
8/12/2019 Solaris Internals(Ch17 Locking)
32/42
Mdefine 'SF2)FR = 0E numer of sleep +ueues per turnstile E0
typedef struct turnstile turnstileFt6
struct Fso!Fops6
struct turnstile G
turnstileFt EtsFnext6 0E next on hash chain E0
turnstileFt EtsFfree6 0E next on freelist E0
"oid EtsFso!6 0E s&o!ect threads are locking on
E0
int tsFwaiters6 0E numer of locked threads E0
priFt tsFepri6 0E max priority of locked threads E0
struct Fkthread EtsFinheritor6 0E thread inheriting priority E0
turnstileFt EtsFprioin"6 0E next in inheritor3s tFprioin" list E0
sleep+Ft tsFsleep+:'SF2)FR
8/12/2019 Solaris Internals(Ch17 Locking)
33/42
now has a turnstile, so suse+uent threads that lock on the same lock will
donate their turnstiles to the free list on the chain (the tsFfree link off the
acti"e turnstile*.
In turnstileFlock(*, the pointers are set up as determined y the return fromturnstileFlookup(*. If the turnstile pointer is null, we link up to the turnstile
pointed to y the kernel thread3s tFts pointer. If the pointer returned from
the lookup is not null, there3s already at least one kthread waiting on the lock,
so the code sets up the pointer links appropriately and places the kthread3s
turnstile on the free list.
'he thread is then put into a sleep state through the scheduling&class&
specific sleep routine (for example, tsFsleep(**. 'he tsFwaiters field in the
turnstile is incremented, the threads tFwchan is set to the address of thelock, and tFso!Fops in the thread is set to the address of the lock3s
operations "ectors the owner, unsleep, and changeFpriority functions. 'he
kernel sleep+Finsert(* function actually places the thread on the sleep +ueue
associated with the turnstile.
'he code does the priority in"ersion check (now called out of the
turnstileFlock(* code*, uilds the priority in"ersion links and applies the
necessary priority changes. 'he priority inheritance rules apply6 that is, if the
priority of the lock holder is less (worse* than the priority of the re+uesting
thread, the re+uesting thread3s priority is NwilledN to the holder. 'he holder3s
tFepri field is set to the new priority, and the inheritor pointer in the turnstile
is linked to the kernel thread. $ll the threads on the locking chain are
potential inheritors, ased on their priority relati"e to the calling thread.
$t this point, the dispatcher is entered through a call to swtch(*, and another
kernel thread is remo"ed from a dispatch +ueue and context&switched onto a
processor.
'he wakeup mechanics are initiated as pre"iously descried, where a call to
the lock exit routine results in a turnstileFwakeup(* call if threads are locking
on the lock. turnstileFwakeup(* does essentially the re"erse of
turnstileFlock(*6 threads that inherited a etter priority ha"e that priority
wai"ed, and the thread is remo"ed from the sleep +ueue and gi"en a turnstile
from the chain3s free list. %ecall that a thread donated its turnstile to the
free list if it was not the first thread placed on the locking chain for the
lock6 coming off the turnstile, threads get a turnstile ack. 4nce the thread is
8/12/2019 Solaris Internals(Ch17 Locking)
34/42
unlinked from the sleep +ueue, the scheduling class wakeup code is entered,
and the thread is put ack on a processor3s dispatch +ueue.
17.4. 5ernel Sema/hores
Semaphores pro"ide a method of synchronizing access to a sharale resource
y multiple processes or threads. $ semaphore can e used as a inary lock
for exclusi"e access or as a counter, allowing for concurrent access y
multiple threads to a finite numer of shared resources.
In the counter implementation, the semaphore "alue is initialized to the numer
of shared resources (these semaphores are sometimes referred to as counting
semaphores*. 5ach time a process needs a resource, the semaphore "alue is
decremented to indicate there is one less of the resource. 9hen the processis finished with the resource, the semaphore "alue is incremented. $ @
semaphore "alue tells the calling process that no resources are currently
a"ailale, and the calling process locks until another process finishes using the
resource and frees it. 'hese functions are historically referred to as
semaphore # and A operationsthe # operation attempts to ac+uire the
semaphore, and the A operation releases it.
'he Solaris kernel uses semaphores where appropriate, when the constraints
for atomicity on lock ac+uisition are not as stringent as they are in the areaswhere mutex and %9 locks are used. $lso, the counting functionality that
semaphores pro"ide makes them a good fit for things like the allocation and
deallocation of a fixed amount of a resource.
'he kernel semaphore structure maintains a sleep +ueue for the semaphore and
a count field that reflects the "alue of the semaphore, shown in -igure 17..
'he figure illustrates the look of a kernel semaphore for all Solaris releases
co"ered in this ook.
Figure 17.4. 5ernel Sema/hore
Dernel functions for semaphores include an initialization routine (semaFinit(**, a
destroy function (semaFdestroy(**, the traditional # and A operations
(semaFp(* and semaF"(**, and a test function (test for semaphore held,
8/12/2019 Solaris Internals(Ch17 Locking)
35/42
semaFheld(**. 'here are a few other support functions, as well as some
"ariations on the semaFp(* function, which we discuss later.
'he init function simply sets the count "alue in the semaphore, ased on the
"alue passed as an argument to the semaFinit(* routine. 'he sFslp+ pointer isset to 2LL, and the semaphore is initialized. 'he semaFdestroy(* function is
used when the semaphore is an integral part of a resource that is dynamically
created and destroyed as the resource gets used and suse+uently released.
-or example, the io (lock I04* susystem in the kernel, which manages uf
structures for page I04 support through the file system, uses semaphores on a
per&uf structure asis. 5ach uffer has two semaphores, which are initialized
when a uffer is allocated y semaFinit(*. 4nce the I04 is completed and the
uffer is released, semaFdestroy(* is called as part of the uffer release
code. (semaFdestroy(* !ust nulls the sFslp+ pointer.*
Dernel threads that must access a resource controlled y a semaphore call the
semaFp(* function, which re+uires that the semaphore count "alue e greater
than @ in order to return success. If the count is @, then the semaphore is not
a"ailale and the calling thread must lock. If the count is greater than @, then
the count is decremented in the semaphore and the code returns to the caller.
4therwise, a sleep +ueue is located from the systemwide array of sleep
+ueues, the thread state is changed to sleep, and the thread is placed on the
sleep +ueue. ote that turnstiles are not used for semaphoresturnstiles are an
implementation of sleep +ueues specifically for mutex and %9 locks. Dernel
threads locked on anything other than mutexes and %9 locks are placed on
sleep +ueues.
Sleep +ueues are discussed in more detail in Section ;.1@. riefly though, sleep
+ueues are organized as a linked list of kernel threads, and each linked list is
rooted in an array referenced through a sleep+Fhead kernel pointer. -igure
17.Billustrates how sleep +ueues are organized.
Figure 17.6. Slee/ ueues
8/12/2019 Solaris Internals(Ch17 Locking)
36/42
$ hashing function indexes the sleep+Fhead array, hashing on the address of
the o!ect. $ singly linked list that estalishes the eginning of the douly
linked sulists of kthreads at the same priority is in ascending order ased on
priority. 'he sulist is implemented with a tFpriforw (forward pointer* and
tFpriack (pre"ious pointer* in the kernel thread. $lso, a tFsleep+ pointerpoints ack to the array entry in sleep+Fhead, identifying which sleep +ueue
the thread is on and pro"iding a +uick method to determine if a thread is on a
sleep +ueue at all6 if the thread3s tFsleep+ pointer is 2LL, then the thread is
not on a sleep +ueue.
Inside the semaFp(* function, if we ha"e a semaphore count "alue of @, the
semaphore is not a"ailale and the calling kernel thread needs to e placed on
a sleep +ueue. $ sleep +ueue is located through a hash function into the
sleepFhead array, which hashes on the address of the o!ect the thread is
locking, in this case, the address of the semaphore. 'he code also gras the
sleep +ueue lock, s+Flock (see -igure 17.B*, to lock any further inserts or
remo"als from the sleep +ueue until the insertion of the current kernel thread
has een completed (that3s what locks are for*.
'he scheduling&class&specific sleep function is called to set the thread
wakeup priority and to change the thread state from 4#%4C (running on a
processor* to SL55#. 'he kernel thread3s tFwchan (wait channel* pointer is
set to the address of the semaphore it3s locking on, and the thread3s
tFso!Fops pointer is set to reference the semaFso!Fops structure. 'he
thread is now in a sleep state on a sleep +ueue.
$ semaphore is released y the semaF"(* function, which has the exact
opposite effect of semaFp(* and eha"es "ery much like the lock release
functions we3"e examined up to this point. 'he semaphore "alue is incremented,
and if any threads are sleeping on the semaphore, the one that has een
sitting on the sleep +ueue longest will e woken up. Semaphore wakeups alwaysin"ol"e waking one waiter at a time.
Semaphores are used in relati"ely few areas of the operating system the
uffer I04 (io* module, the dynamically loadale kernel module code, and a
couple of de"ice dri"ers.
17.6. $%race oc!stat Pro0ider
'he lockstat pro"ider makes a"ailale proes that can e used to discern lockcontention statistics or to understand "irtually any aspect of locking eha"ior.
8/12/2019 Solaris Internals(Ch17 Locking)
37/42
'he lockstat(1)* command is actually a 'race consumer that uses the
lockstat pro"ider to gather its raw data.
7.,. . Overview
'he lockstat pro"ider makes a"ailale two kinds of proes content&e"ent
proes and hold&e"ent proes.
Contention&e"ent proes correspond to contention on a synchronization
primiti"e6 they fire when a thread is forced to wait for a resource to ecome
a"ailale. Solaris is generally optimized for the noncontention case, so
prolonged contention is not expected. 'hese proes should e used to
understand those cases where contention does arise. ecause contention is
relati"ely rare, enaling contention&e"ent proes generally doesn3t sustantiallyaffect performance.
8old&e"ent proes correspond to ac+uiring, releasing, or otherwise
manipulating a synchronization primiti"e. 'hese proes can e used to answer
aritrary +uestions aout the way synchronization primiti"es are manipulated.
ecause Solaris ac+uires and releases synchronization primiti"es "ery often (on
the order of millions of times per second per C#2 on a usy system*, enaling
hold&e"ent proes has a much higher proe effect than does enaling
contention&e"ent proes. 9hile the proe effect induced y enaling them cane sustantial, it is not pathological6 they may still e enaled with confidence
on production systems.
'he lockstat pro"ider makes a"ailale proes that correspond to the different
synchronization primiti"es in Solaris6 these primiti"es and the proes that
correspond to them are discussed in the remainder of this chapter.
7.,.2. -'aptive !oc" Probes
$dapti"e locks enforce mutual exclusion to a critical section and can e
ac+uired in most contexts in the kernel. ecause adapti"e locks ha"e few
context restrictions, they comprise the "ast ma!ority of synchronization
primiti"es in the Solaris kernel. 'hese locks are adapti"e in their eha"ior with
respect to contention. 9hen a thread attempts to ac+uire a held adapti"e
lock, it will determine if the owning thread is currently running on a C#2. If the
owner is running on another C#2, the ac+uiring thread will spin. If the owner is
not running, the ac+uiring thread will lock.
8/12/2019 Solaris Internals(Ch17 Locking)
38/42
'he four lockstat proes pertaining to adapti"e locks are in 'ale 17.=.-or
each proe, arg@ contains a pointer to the kmutexFt structure that
represents the adapti"e lock.
%a&le 17.2. Ada/ti0e oc! Pro&es
#roeame
Cescription
adapti"e&
ac+uire
8old&e"ent proe that fires immediately after an adapti"e lock is
ac+uired.
adapti"e&lock
Contention&e"ent proe that fires after a thread that has lockedon a held adapti"e mutex has reawakened and has ac+uired the
mutex. If oth proes are enaled, adapti"e&lock fires efore
adapti"e&ac+uire. $t most one of adapti"e&lock and adapti"e&
spin fire for a single lock ac+uisition. arg1 for adapti"e&lock
contains the sleep time in nanoseconds.
adapti"e&
spin
Contention&e"ent proe that fires after a thread that has spun on
a held adapti"e mutex has successfully ac+uired the mutex. If oth
are enaled, adapti"e&spin fires efore adapti"e&ac+uire. $t most
one of adapti"e&spin and adapti"e&lock fire for a single lock
ac+uisition. arg1 for adapti"e&spin contains the spin count the
numer of iterations that were taken through the spin loop efore
the lock was ac+uired. 'he spin count has little meaning on its own
ut can e used to compare spin times.
adapti"e&
release
8old&e"ent proe that fires immediately after an adapti"e lock is
released.
7.,.*. Spin !oc" Probes
'hreads cannot lock in some contexts in the kernel, such as high&le"el
interrupt context and any context manipulating dispatcher state. In these
contexts, this restriction pre"ents the use of adapti"e locks. Spin locks are
instead used to effect mutual exclusion to critical sections in these contexts.
$s the name implies, the eha"ior of these locks in the presence of contention
8/12/2019 Solaris Internals(Ch17 Locking)
39/42
is to spin until the lock is released y the owning thread. 'he three proes
pertaining to spin locks are in 'ale 17.;.
%a&le 17.3. S/in oc! Pro&es
#roeame
Cescription
spin&
ac+uire
8old&e"ent proe that fires immediately after a spin lock is ac+uired.
spin&
spin
Contention&e"ent proe that fires after a thread that has spun on a
held spin lock has successfully ac+uired the spin lock. If oth areenaled, spin&spin fires efore spin&ac+uire. arg1 for spin&spin
contains the spin count the numer of iterations that were taken
through the spin loop efore the lock was ac+uired. 'he spin count
has little meaning on its own ut can e used to compare spin times.
spin&
release
8old&e"ent proe that fires immediately after a spin lock is released.
$dapti"e locks are much more common than spin locks. 'he following script
displays totals for oth lock types to pro"ide data to support this
oser"ation.
lockstatadapti"e&ac+uire
0execname KK NdateN0
G
Tlocks:Nadapti"eN< K count(*6H
lockstatspin&ac+uire
0execname KK NdateN0
G
Tlocks:NspinN< K count(*6
H
8/12/2019 Solaris Internals(Ch17 Locking)
40/42
%un this script in one window, and a date(1* command in another. 9hen you
terminate the 'race script, you will see output similar to the following
example.
lockstatadapti"e&ac+uireM dtrace &s .0whatlock.d
dtrace script 3.0whatlock.d3 matched > proes
UC
spin =/
adapti"e =B1
$s this output indicates, o"er BB percent of the locks ac+uired in running the
date command are adapti"e locks. It may e surprising that so many locks areac+uired in doing something as simple as a date. 'he large numer of locks is a
natural artifact of the fine&grained locking re+uired of an extremely scalale
system like the Solaris kernel.
7.,.4. +hrea' !oc"s
$ thread lock is a special kind of spin lock that locks a thread for purposes
of changing thread state. 'hread lock hold e"ents are a"ailale as spin lock
hold&e"ent proes (that is, spin&ac+uire and spin&release*, ut contentione"ents ha"e their own proe specific to thread locks. 'he thread lock hold&
e"ent proe is descried in 'ale 17..
%a&le 17.'. %hread oc! Pro&es
#roeame
Cescription
t8%ead
&spin
Contention&e"ent proe that fires after a thread has spun on a
thread lock. Like other contention&e"ent proes, if oth the
contention&e"ent proe and the hold&e"ent proe are enaled,
thread&spin fires efore spin&ac+uire. 2nlike other contention&e"ent
proes, howe"er, t8%ead&spin fires efore the lock is actually
ac+uired. $s a result, multiple t8%ead&spin proe firings may
correspond to a single spin&ac+uire proe firing.
8/12/2019 Solaris Internals(Ch17 Locking)
41/42
7.,.5. &ea'ers()riter !oc" Probes
%eaders0writer locks enforce a policy of allowing multiple readers or a single
writerut not othto e in a critical section. 'hese locks are typically used
for structures that are searched more fre+uently than they are modified andfor which there is sustantial time in the critical section. If critical section
times are short, readers0writer locks will implicitly serialize o"er the shared
memory used to implement the lock, gi"ing them no ad"antage o"er adapti"e
locks. See rwlock(B-* for more details on readers0writer locks.
'he proes pertaining to readers0writer locks are in 'ale 17.>. -or each
proe, arg@ contains a pointer to the krwlockFt structure that represents the
adapti"e lock.
%a&le 17.+. #eadersriter oc! Pro&es
#roeame
Cescription
rw&
ac+uire
8old&e"ent proe that fires immediately after a readers0writer lock
is ac+uired. arg1 contains the constant %9F%5$5% if the lock was
ac+uired as a reader, and %9F9%I'5% if the lock was ac+uired asa writer.
rw&lock Contention&e"ent proe that fires after a thread that has locked
on a held readers0writer lock has reawakened and has ac+uired the
lock. arg1 contains the length of time (in nanoseconds* that the
current thread had to sleep to ac+uire the lock. arg= contains the
constant %9F%5$5% if the lock was ac+uired as a reader, and
%9F9%I'5% if the lock was ac+uired as a writer. arg; and arg
contain more information on the reason for locking. arg; is nonzero
if and only if the lock was held as a writer when the current thread
locked. arg contains the readers count when the current thread
locked. If oth the rw&lock and rw&ac+uire proes are enaled,
rw&lock fires efore rw&ac+uire.
rw&
upgrade
8old&e"ent proe that fires after a thread has successfully
upgraded a readers0writer lock from a reader to a writer. 2pgrades
do not ha"e an associated contention e"ent ecause they are onlypossile through a nonlocking interface,
8/12/2019 Solaris Internals(Ch17 Locking)
42/42
%a&le 17.+. #eadersriter oc! Pro&es
#roeame
Cescription
rwFtryupgrade('%V2#P%$5.B-*.
rw&
downgrad
e
8old&e"ent proe that fires after a thread had downgraded its
ownership of a readers0writer lock from writer to reader.
owngrades do not ha"e an associated contention e"ent ecause
they always succeed without contention.
rw&
release
8old&e"ent proe that fires immediately after a readers0writer lock
is released. arg1 contains the constant %9F%5$5% if the released
lock was held as a reader, and %9F9%I'5% if the released lock
was held as a writer. ue to upgrades and downgrades, the lock
may not ha"e een released as it was ac+uired.