+ All Categories
Home > Documents > Solaris Internals(Ch17 Locking)

Solaris Internals(Ch17 Locking)

Date post: 03-Jun-2018
Category:
Upload: youngsung-kim
View: 222 times
Download: 0 times
Share this document with a friend

of 42

Transcript
  • 8/12/2019 Solaris Internals(Ch17 Locking)

    1/42

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    2/42

    17.2. Parallel Systems Architectures

    )ultiprocessor ()#* systems from Sun (S#$%C&processor&ased*, as well as

    se"eral x/0x/&ased )# platforms, are implemented as symmetric

    multiprocessor (S)#* systems. Symmetric multiprocessor descries a system inwhich a peer&to&peer relationship exists among all the processors (C#2s* on

    the system. $ master processor, defined as the only C#2 on the system that

    can execute operating system code and field interrupts, does not exist. $ll

    processors are e+ual. 'he S)# acronym can also e extended to mean Shared

    )emory )ultiprocessor, which defines an architecture in which all the

    processors in the system share a uniform "iew of the system3s physical

    address space and the operating system3s "irtual address space. 'hat is, all

    processors share a single image of the operating system kernel. Sun3s

    multiprocessor systems meet the criteria for oth definitions.

    $lternati"e )# architectures alter the kernel3s "iew of addressale memory in

    different ways. )assi"ely parallel processor ()##* systems are uilt on nodes

    that contain a relati"ely small numer of processors, some local memory, and

    I04. 5ach node contains its own copy of the operating system6 thus, each node

    addresses its own physical and "irtual address space. 'he address space of

    one node is not "isile to the other nodes on the system. 'he nodes are

    connected y a high&speed, low&latency interconnect, and node&to&node

    communication is done through an optimized message passing interface. )##

    architectures re+uire a new programming model to achie"e parallelism across

    nodes.

    'he shared memory model does not work since the system3s total address

    space is not "isile across nodes, so memory pages cannot e shared y

    threads running on different nodes. 'hus, an $#I that pro"ides an interface

    into the message passing path in the kernel must e used y code that needs

    to scale across the "arious nodes in the system.

    4ther issues arise from the nonuniform nature of the architecture with

    respect to I04 processing since the I04 controllers on each node are not easily

    made "isile to all the nodes on the system. Some )## platforms attempt to

    pro"ide the illusion of a uniform I04 space across all the nodes y using kernel

    software, ut the nonuniformity of the access times to nonlocal I04 de"ices

    still exists.

    2)$ and cc2)$ (nonuniform memory access and cache coherent 2)$*architectures attempt to address the programming model issue inherent in )##

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    3/42

    systems. -rom a hardware architecture point of "iew, 2)$ systems resemle

    )##s small nodes with few processors, a node&to&node interconnect, local

    memory, and I04 on each node. ote It is not re+uired that 2)$0cc2)$ or

    )## systems implement small nodes (nodes with four or fewer processors*.

    )any implementations are uilt that way, ut there is no architecturalrestriction on the node size.

    4n 2)$0cc2)$ systems, the operating system software pro"ides a single

    system image, where each node has a "iew of the entire system3s memory

    address space. In this way, the shared memory model is preser"ed. 8owe"er,

    the nonuniform nature of speed of memory access (latency* is a factor in the

    performance and potential scalaility of the platform. 9hen a thread executing

    on a processor node on a 2)$ or cc2)$ system incurs a page fault

    (references an unmapped memory address*, the latency in"ol"ed in resol"ing

    the page fault "aries according to whether the physical memory page is on the

    same node of the executing thread or on a node somewhere across the

    interconnect. 'he latency "ariance can e sustantial. $s the le"el of memory

    page sharing increases across threads executing on different nodes, a

    potentially higher "olume of page faults needs to e resol"ed from a non local

    memory segment. 'his prolem ad"ersely affects performance and scalaility.

    'he three different parallel architectures can e summarized as follows

    S)#. Symmetric multiprocessor with a shared memory model6 single kernel

    image

    )##. )essage&ased model6 multiple kernel images

    2)$0cc2)$. Shared memory model6 single kernel image

    -igure 17.1illustrates the different architectures.

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    4/42

    Figure 17.1. Parallel Systems Architectures

    'he challenge in uilding an operating system that pro"ides scalale

    performance when multiple processors are sharing a single image of the kernel

    and when e"ery processor can run kernel code, handle interrupts, etc., is to

    synchronize access to critical data and state information. Scalale

    performance, or scalaility, generally refers to accomplishment of an increasing

    amount of work as more hardware resources are added to the system. If

    more processors are added to a multiprocessor system, an incremental

    increase in work is expected, assuming sufficient resources in other areas of

    the system (memory, I04, network*.

    'o achie"e scalale performance, the system must e ale to concurrently

    support multiple processors executing operating system code. 9hether that

    execution is in de"ice dri"ers, interrupt handlers, the threads dispatcher, file

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    5/42

    system code, "irtual memory code, etc., is, to a degree, load dependent.

    Concurrency is key to scalaility.

    'he preceding discussion on parallel architectures only scratched the surface

    of a "ery complex topic. 5ntire texts discuss parallel architectures exclusi"ely6you should refer to them for additional information. See, for example, :1;

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    6/42

    eing 13s (lock "alue @x--*. $ lock that is a"ailale (not eing held* is the

    same yte with all @3s (lock "alue @x@@*. 'his explanation may seem +uite

    rudimentary, ut is crucial to understanding the text that follows.

    )ost modern processors shipping today pro"ide some form of yte&le"eltest&and&set instruction that is guaranteed to e atomic in nature. 'he

    instruction se+uence is often descried as read&modify&write6 that is, the

    referenced memory location (the memory address of the lock* is read,

    modified, and written ack in one atomic operation. In %ISC processors (such

    as the 2ltraS#$%C '1 processor*, reads are load operations and writes are

    store operations. $n atomic operation is re+uired for consistency. $n

    instruction that has atomic properties means that no other store operation is

    allowed etween the load and store of the executing instruction. )utex and

    %9 lock operations must e atomic, such that when the instruction execution

    to get the lock is complete, we either ha"e the lock or ha"e the information

    we need to determine that the lock is already eing held.

    Consider what could happen without an instruction that has atomic properties.

    $ thread executing on one processor could issue a load (read* of the lock and

    while it is doing a test operation to determine if the lock is held or not,

    another thread executing on another processor issues a lock call to get the

    same lock at the same time. If the lock is not held, oth threads would assume

    the lock is a"ailale and would issue a store to hold the lock. 4"iously, more

    than one thread cannot own the same lock at the same time, ut that would

    e the result of such a se+uence of e"ents. $tomic instructions pre"ent such

    things from happening.

    S#$%C processors implement memory access instructions that pro"ide atomic

    test&and&set semantics for mutual exclusion primiti"es, as well as instructions

    that can force a particular ordering of memory operations (more on the latter

    feature in a moment*. 2ltraS#$%C processors (the S#$%C AB instruction set*pro"ide three memory access instructions that guarantee atomic eha"ior

    ldstu (load and store unsigned yte*, cas (compare and swap*, and swap

    (swap yte locations*. 'hese instructions differ slightly in their eha"ior and

    the size of the datum they operate on.

    -igure 17.=illustrates the ldstu and cas instructions. 'he swap instruction

    (not shown* simply swaps a ;=&it "alue etween a hardware register and a

    memory location, similar to what cas does if the compare phase of the

    instruction se+uence is e+ual.

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    7/42

    Figure 17.2. Atomic "nstructions for oc!s on SPA#C Systems

    'he implementation of locking code with the assemly language test&and&setstyle of instructions re+uires a suse+uent test instruction on the lock "alue,

    which is retrie"ed with either a cas or ldstu instruction.

    -or example, the ldstu instruction retrie"es the yte "alue (the lock* from

    memory and stores it in the specified hardware register. Locking code must

    test the "alue of the register to determine if the lock was held or a"ailale

    when the ldstu executed. If the register "alue is all 13s, the lock was held, so

    the code must ranch off and deal with that condition. If the register "alue is

    all @3s, the lock was not held and the code can progress as eing the currentlock holder. ote that in oth cases, the lock "alue in memory is set to all 13s,

    y "irtue of the eha"ior of the ldstu instruction (store @x-- at designated

    address*. If the lock was already held, the "alue simply didn3t change. If the

    lock was @ (a"ailale*, it will now reflect that the lock is held (all 13s*. 'he

    code that releases a lock sets the lock "alue to all @3s, indicating the lock is

    no longer eing held.

    'he Solaris lock code uses assemly language instructions when the lock code

    is entered. 'he asic design is such that the entry point to ac+uire a lock

    enters an assemly language routine, which uses either ldstu or cas to gra

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    8/42

    the lock. 'he assemly code is designed to deal with the simple case, meaning

    that the desired lock is a"ailale. If the lock is eing held, a C language code

    path is entered to deal with this situation. 9e descrie what happens in detail

    in the next few sections that discuss specific lock types.

    'he second hardware consideration referred to earlier has to do with the

    "isiility of the lock state to the running processors when the lock "alue is

    changed. It is critically important on multiprocessor systems that all processors

    ha"e a consistent "iew of data in memory, especially in the implementation of

    synchronization primiti"es mutex locks and reader0writer (%9* locks. In other

    words, if a thread ac+uires a lock, any processor that executes a load

    instruction (read* of that memory location must retrie"e the data following the

    last store (write* that was issued. 'he most recent state of the lock must e

    gloally "isile to all processors on the system.

    )odern processors implement hardware uffering to pro"ide optimal

    performance. In addition to the hardware caches, processors also use load and

    store uffers to hold data eing read from (load* or written to (store*

    memory in order to keep the instruction pipeline running and not ha"e the

    processor stall waiting for data or a data write&to&memory cycle. 'he data

    hierarchy is illustrated in -igure 17.;.

    Figure 17.3. Hardware $ata Hierarchy

    'he illustration in -igure 17.;does not depict a specific processor6 it is a

    generic representation of the "arious le"els of data flow in a typical modern

    high&end microprocessor. It shows the flow of data to and from physical

    memory from a processor3s main execution units (integer units, floating point

    units, etc.*.

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    9/42

    'he sizes of the load0store uffers "ary across processor implementations,

    ut they are typically se"eral words in size. 'he load and store uffers on

    each processor are "isile only to the processor they reside on, so a load

    issued y a processor that issued the store fetches the data from the store

    uffer if it is still there. 8owe"er, it is theoretically possile for otherprocessors that issue a load for that data to read their hardware cache or

    main memory efore the store uffer in the store&issuing processor was

    flushed. ote that the store uffer we are referring to here is not the same

    thing as a le"el 1 or le"el = hardware instruction and data cache. Caches are

    eyond the store uffer6 the store uffer is closer to the execution units of

    the processor. #hysical memory and hardware caches are kept consistent on

    S)# platforms y a hardware us protocol. $lso, many caches are

    implemented as write&through caches (as is the case with the le"el 1 cache in

    Sun 2ltraS#$%C*, so data written to cache causes memory to e updated.

    'he implementation of a store uffer is part of the memory model implemented

    y the hardware. 'he memory model defines the constraints that can e

    imposed on the order of memory operations (loads and stores* y the system.

    )any processors implement a se+uential consistency model, where loads and

    stores to memory are executed in the same order in which they were issued

    y the processor. 'his model has ad"antages in terms of memory consistency,

    ut there are performance trade&offs with such a model ecause thehardware cannot optimize cache and memory operations for speed. 'he S#$%C

    architecture specification :7< pro"ides for uilding S#$%C&ased processors

    that support multiple memory models, the choice eing left up to the

    implementors as to which memory models they wish to support. $ll current

    S#$%C processors implement a 'otal Store 4rdering ('S4* model, which

    re+uires compliance with the following rules for loads and stores

    Loads (reads from memory* are locking and are ordered with respect

    to other loads. Stores (writes to memory* are ordered with respect to other stores.

    Stores cannot ypass earlier loads.

    $tomic load&stores (ldstu and cas instructions* are ordered with

    respect to loads.

    'he 'S4 model is not +uite as strict as the se+uential consistency model ut

    not as relaxed as two additional memory models defined y the S#$%C

    architecture. S#$%C&ased processors also support %elaxed )emory 4rder

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    10/42

    (%)4* and #artial Store 4rder (#S4*, ut these are not currently supported

    y the kernel and not implemented y any Sun systems shipping today.

    $ final consideration in data "isiility applies also to the memory model and

    concerns instruction ordering. 'he execution unit in modern processors canreorder the incoming instruction stream for processing through the execution

    units. 'he goals again are performance and creation of a se+uence of

    instructions that will keep the processor pipeline full.

    'he hardware considerations descried in this section are summarized in 'ale

    17.1, along with the solution or implementation detail that applies to the

    particular issue.

    %a&le 17.1. Hardware Considerations and Solutions for oc!s

    onsideration Solution

    eed for an atomic test&and&set

    instruction for locking primiti"es.

    2se of nati"e machine instructions.

    ldstu and cas on S#$%C, cmpxchgl

    (compare0exchange long* on x/.

    ata gloal "isiility issue ecause ofthe use of hardware load and store

    uffers and instruction reordering, as

    defined y the memory model.

    2se of memory arrier instructions.

    'he issues of consistent memory "iews in the face of a processor3s load and

    store uffers, relaxed memory models, and atomic test&and&set capaility for

    locks are addressed at the processor instruction&set le"el. 'he mutex lock

    and %9 lock primiti"es implemented in the Solaris kernel use the ldstu and cas

    instructions for lock testing and ac+uisition on 2ltraS#$%C&ased systems and

    use the cmpxchgl (compare0exchange long* instruction on x/. 'he lock

    primiti"e routines are part of the architecture&dependent segment of the

    kernel code.

    S#$%C processors pro"ide "arious forms of memory arrier (memar*

    instructions, which, depending on options that are set in the instruction, impose

    specific constraints on the ordering of memory access operations (loads andstores* relati"e to the se+uence with which they were issued. 'o ensure a

    consistent memory "iew when a mutex or %9 lock operation has een issued,

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    11/42

    the Solaris kernel issues the appropriate memar instruction after the lock its

    ha"e changed.

    $s we mo"e from the strongest consistency model (se+uential consistency* to

    the weakest model (%)4*, we can uild a system with potentially etterperformance. 9e can optimize memory operations y playing with the ordering

    of memory access instructions that enale designers to minimize access

    latency and to maximize interconnect andwidth. 'he trade&off is consistency,

    since the more relaxed models pro"ide fewer and fewer constraints on the

    system to issue memory access operations in the same order in which the

    instruction stream issued them. So, processor architectures pro"ide memory

    arrier controls that kernel de"elopers can use to address the consistency

    issues as necessary, with some le"el of control on which consistency le"el is

    re+uired to meet the system re+uirements. 'he types of memar instructions

    a"ailale, the options they support, and how they fit into the different

    memory models descried would make for a highly technical and lengthy chapter

    on its own. %eaders interested in this topic should read :< and :=7

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    12/42

    are mounted, files are created and opened, network connections are made,

    etc. )any of the locks are emedded in the kernel data structures that

    pro"ide the astractions (processes, files* pro"ided y the kernel, and thus

    the numer of kernel locks will scale up linearly as resources are created

    dynamically.

    'his design speaks to one of the core strengths of the Solaris kernel

    scalaility and scaling synchronization primiti"es dynamically with the size of

    the kernel. ynamic lock creation has se"eral ad"antages o"er static

    allocations. -irst, the kernel is not wasting time and space managing a large

    pool of unused locks when running on a smaller system, such as a desktop or

    workgroup ser"er. 4n a large system, a sufficient numer of locks is a"ailale

    to sustain concurrency for scalale performance. It is possile to ha"e literally

    thousands of locks in existence on a large, usy system.

    7.4. . Synchronization Process

    9hen an executing kernel thread attempts to ac+uire a lock, it will encounter

    one of two possile lock states free (a"ailale* or not free (owned, held*. $

    re+uesting thread gets ownership of an a"ailale lock when the lock&specific

    get lock function is in"oked. If the lock is not a"ailale, the thread most likely

    needs to lock and wait for it to come a"ailale, although, as we will see

    shortly, the code does not always lock (sleep*, waiting for a lock. -or those

    situations in which a thread will sleep while waiting for a lock, the kernel

    implements a sleep +ueue facility, known as turnstiles, for managing threads

    locking on locks.

    9hen a kernel thread has completed the operation on the shared data

    protected y the lock, it must release the lock. 9hen a thread releases a

    lock, the code must deal with one of two possile conditions threads are

    waiting for the lock (such threads are termed waiters*, or there are nowaiters. 9ith no waiters, the lock can simply e released. 9ith waiters, the

    code has se"eral options. It can release the lock and wake up the locking

    threads. In that case, the first thread to execute ac+uires the lock.

    $lternati"ely, the code could select a thread from the turnstile (sleep +ueue*,

    ased on priority or sleep time, and wake up only that thread. -inally, the

    code could select which thread should get the lock next, and the lock owner

    could hand the lock off to the selected thread. $s we will see in the following

    sections, no one solution is suitale for all situations, and the Solaris kernel

    uses all three methods, depending on the lock type. -igure 17.pro"ides theig picture.

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    13/42

    Figure 17.'. Solaris oc!s %he *ig Picture

    -igure 17.pro"ides a generic representation of the execution flow. Later we

    will see the results of a considerale amount of engineering effort that has

    gone into the lock code impro"ed efficiency and speed with short code paths,

    optimizations for the hot path (fre+uently hit code path* with well&tuned

    assemly code, and the est algorithms for lock release as determined y

    extensi"e analysis.

    7.4.2. Synchronization Object Operations Vector

    5ach of the synchronization o!ects discussed in this section mutex locks,

    reader0writer locks, and semaphores defines an operations "ector that is

    linked to kernel threads that are locking on the o!ect. Specifically, the

    o!ect3s operations "ector is a data structure that exports a suset of

    o!ect functions re+uired for kthreads sleeping on the lock. 'he generic

    structure is defined as follows

    0E E 'he following data structure is used to map

    E synchronization o!ect type numers to the

    E synchronization o!ect3s sleep +ueue numer

    E or the synch. o!ect3s owner function.

    E0

    typedef struct Fso!Fops G

    int so!Ftype6

    kthreadFt E(Eso!Fowner*(*6

    "oid (Eso!Funsleep*(kthreadFt E*6 "oid (Eso!FchangeFpri*(kthreadFt E, priFt, priFt E*6

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    14/42

    H so!FopsFt6

    See

    sys0so!ect.h

    'he structure shown ao"e pro"ides for the o!ect type declaration. -or each

    synchronization o!ect type, a type&specific structure is defined

    mutexFso!Fops for mutex locks, rwFso!Fops for reader0writer locks, and

    semaFso!Fops for semaphores.

    'he structure also pro"ides three functions that may e called on ehalf of a

    kthread sleeping on a synchronization o!ect

    $n owner function, which returns the I of the kernel thread that owns

    the o!ect.

    $n unsleep function, which transitions a kernel thread from a sleep state.

    $ changeFpri function, which changes the priority of a kernel thread,

    used for priority inheritance. (See Section 17.7.*

    9e will see how references to the lock3s operations structure are implemented

    as we mo"e through specifics on lock implementations in the following sections.

    It is useful to note at this point that our examination of Solaris kernel locks

    offers a good example of some of the design trade&offs in"ol"ed in kernel

    software engineering. uilding the "arious software components that make up

    the Solaris kernel is a series of design decisions, when performance needs are

    measured against complexity. In areas of the kernel where optimal performance

    is a top priority, simplicity might e sacrificed in fa"or of performance. 'he

    locking facilities in the Solaris kernel are an area where such trade&offs are

    made much of the lock code is written in assemly language, for speed, rather

    than in the C language6 the latter is easier to code with and maintain ut is

    potentially slower. In some cases, when the code path is not performance

    critical, a simpler design will e fa"ored o"er cryptic assemly code or

    complexity in the algorithms. 'he eha"ior of a particular design is examined

    through exhausti"e testing, to ensure that the est possile design decisions

    were made.

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    15/42

    17.+. ,ute- oc!s

    )utual exclusion, or mutex locks, are the most common type of synchronization

    primiti"e used in the kernel. )utex locks serialize access to critical data, when

    a kernel thread must ac+uire the mutex specific to the data region eingprotected efore it can read or write the data. 'he thread is the lock owner

    while it is holding the lock, and the thread must release the lock when it has

    finished working in the protected region so other threads can ac+uire the lock

    for access to the protected data.

    7.5. . Overview

    If a thread attempts to ac+uire a mutex lock that is eing held, it can

    asically do one of two things it can spin or it can lock. Spinning means thethread enters a tight loop, attempting to ac+uire the lock in each pass through

    the loop. 'he term spin lock is often used to descrie this type of mutex.

    locking means the thread is placed on a sleep +ueue while the lock is eing

    held and the kernel sends a wakeup to the thread when the lock is released.

    'here are pros and cons to oth approaches.

    'he spin approach has the enefit of not incurring the o"erhead of context

    switching, re+uired when a thread is put to sleep and also has the ad"antage

    of a relati"ely fast ac+uisition when the lock is released, since there is nocontext&switch operation. It has the downside of consuming C#2 cycles while

    the thread is in the spin loop the C#2 is executing a kernel thread (the thread

    in the spin loop* ut not really doing any useful work.

    'he locking approach has the ad"antage of freeing the processor to execute

    other threads while the lock is eing held6 it has the disad"antage of re+uiring

    context switching to get the waiting thread off the processor and a new

    runnale thread onto the processor. 'here3s also a little more lock ac+uisition

    latency, since a wakeup and context switch are re+uired efore the locking

    thread can ecome the owner of the lock it was waiting for.

    In addition to the issue of what to do if a re+uested lock is eing held, the

    +uestion of lock granularity needs to e resol"ed. Let3s take a simple example.

    'he kernel maintains a process tale, which is a linked list of process

    structures, one for each of the processes running on the system. $ simple

    tale&le"el mutex could e implemented, such that if a thread needs to

    manipulate a process structure, it must first ac+uire the process tale mutex.'his le"el of locking is "ery coarse. It has the ad"antages of simplicity and

    minimal lock o"erhead. It has the o"ious disad"antage of potentially poor

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    16/42

    scalaility, since only one thread at a time can manipulate o!ects on the

    process tale. Such a lock is likely to ha"e a great deal of contention (ecome

    a hot lock*.

    'he alternati"e is to implement a finer le"el of granularity a lock&per&process tale entry "ersus one tale&le"el lock. 9ith a lock on each process

    tale entry, multiple threads can e manipulating different process structures

    at the same time, pro"iding concurrency. 'he disad"antages are that such an

    implementation is more complex, increases the chances of deadlock situations,

    and necessitates more o"erhead ecause there are more locks to manage.

    In general, the Solaris kernel implements relati"ely fine&grained locking

    whene"er possile, largely due to the dynamic nature of scaling locks with

    kernel structures as needed.

    'he kernel implements two types of mutex locks spin locks and adapti"e locks.

    Spin locks, as we discussed, spin in a tight loop if a desired lock is eing held

    when a thread attempts to ac+uire the lock. $dapti"e locks are the most

    common type of lock used and are designed to dynamically either spin or lock

    when a lock is eing held, depending on the state of the holder. 9e already

    discussed the trade&offs of spinning "ersus locking. Implementing a locking

    scheme that only does one or the other can se"erely impact scalaility and

    performance. It is much etter to use an adapti"e locking scheme, which is

    precisely what we do.

    'he mechanics of adapti"e locks are straightforward. 9hen a thread attempts

    to ac+uire a lock and the lock is eing held, the kernel examines the state of

    the thread that is holding the lock. If the lock holder (owner* is running on a

    processor, the thread attempting to get the lock will spin. If the thread holding

    the lock is not running, the thread attempting to get the lock will lock. 'his

    policy works +uite well ecause the code is such that mutex hold times are"ery short (y design, the goal is to minimize the amount of code to e

    executed while a lock is held*. So, if a thread is holding a lock and running, the

    lock will likely e released "ery soon, proaly in less time than it takes to

    context&switch off and on again, so it3s worth spinning.

    4n the other hand, if a lock holder is not running, then we know that minimally

    one context switch is in"ol"ed efore the holder will release the lock (getting

    the holder ack on a processor to run*, and it makes sense to simply lock and

    free up the processor to do something else. 'he kernel will place the lockingthread on a turnstile (sleep +ueue* designed specifically for synchronization

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    17/42

    primiti"es and will wake the thread when the lock is released y the holder.

    (See Section 17.7.*

    'he other distinction etween adapti"e locks and spin locks has to do with

    interrupts, the dispatcher, and context switching. 'he kernel dispatcher is thecode that selects threads for scheduling and does context switches. It runs at

    an ele"ated #riority Interrupt Le"el (#IL* to lock interrupts (the dispatcher

    runs at priority le"el 11 on S#$%C systems*. 8igh&le"el interrupts (interrupt

    le"els 111> on S#$%C systems* can interrupt the dispatcher. 8igh&le"el

    interrupt handlers are not allowed to do anything that could re+uire a context

    switch or to enter the dispatcher (we discuss this further in Section ;.*.

    $dapti"e locks can lock, and locking means context switching, so only spin

    locks can e used in high&le"el interrupt handlers. $lso, spin locks can raise

    the interrupt le"el of the processor when the lock is ac+uired.

    struct kernelFdata G

    kmutexFt klock6

    char EforwFptr6

    char EackFptr6

    uint/Ft data16

    uint/Ft data=6

    H kdata6

    "oid function(*

    .

    mutexFinit(Jkdata.klock*6

    .

    mutexFenter(Jkdata.klock*6

    klock.data1 K 16

    mutexFexit(Jkdata.klock*6

    'he preceding lock of pseudo&code illustrates the general mechanics of

    mutex locks. $ lock is declared in the code6 in this case, it is emedded in the

    data structure that it is designed to protect. 4nce declared, the lock is

    initialized with the kernel mutexFinit(* function. $ny suse+uent reference to

    the kdata structure re+uires that the klock mutex e ac+uired with

    mutexFenter(*. 4nce the work is done, the lock is released with mutexFexit(*.

    'he lock type, spin or adapti"e, is determined in the mutexFinit(* code y the

    kernel. $ssuming an adapti"e mutex in this example, any kernel threads that

    make a mutexFenter(* call on klock will either lock or spin, depending on the

    state of the kernel thread that owns klock when the mutexFenter(* is called.

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    18/42

    7.5.2. Solaris Mute !oc" #$ple$entation

    'he kernel defines different data structures for the two types of mutex

    locks, adapti"e and spin, as shown elow.

    0E

    E #ulic interface to mutual exclusion locks. See mutex(B-* for details.

    E

    E 'he asic mutex type is )2'5F$$#'IA5, which is expected to e used

    E in almost all of the kernel. )2'5FS#I pro"ides interrupt locking

    E and must e used in interrupt handlers ao"e L4CDFL5A5L. 'he ilock

    E cookie argument to mutexFinit(* encodes the interrupt le"el to lock.

    E 'he ilock cookie must e 2LL for adapti"e locks.

    EE )2'5F5-$2L' is the type usually specified (except in dri"ers* to

    E mutexFinit(*. It is identical to )2'5F$$#'IA5.

    E

    E )2'5F%IA5% is always used y dri"ers. mutexFinit(* con"erts this to

    E either )2'5F$$#'IA5 or )2'5FS#I depending on the ilock cookie.

    E

    E )utex statistics can e gathered on the fly, without reooting or

    E recompiling the kernel, "ia the lockstat dri"er (lockstat(7**.

    E0

    typedef enum G

    )2'5F$$#'IA5 K @, 0E spin if owner is running, otherwise

    lock E0

    )2'5FS#I K 1, 0E lock interrupts and spin E0

    )2'5F%IA5% K , 0E dri"er (I* mutex E0

    )2'5F5-$2L' K / 0E kernel default mutex E0

    H kmutexFtypeFt6

    typedef struct mutex GMifdef FL#/

    "oid EFopa+ue:1

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    19/42

    'he 1=&it mutex o!ect is used for each type of lock, as shown in -igure

    17.>.

    Figure 17.+. Solaris 1 Ada/ti0e and S/in ,ute-

    In -igure 17.>, the mFowner field in the adapti"e lock, which holds the address

    of the kernel thread that owns the lock (the kthread pointer*, plays a doule

    role, in that it also ser"es as the actual lock6 successful lock ac+uisition for athread means it has its kthread pointer set in the mFowner field of the target

    lock. If threads attempt to get the lock while it is held (waiters*, the low&

    order it (it @* of mFowner is set to reflect that case. ecause kthread

    pointers "alues are always word aligned, they do not re+uire it @, allowing

    this work.

    0E

    E mutexFenter(* assumes that the mutex is adapti"e and tries to gra the

    E lock y doing a atomic compare and exchange on the first word of the

    mutex.

    E If the compare and exchange fails, it means that either (1* the lock is a

    E spin lock, or (=* the lock is adapti"e ut already held.

    E mutexF"ectorFenter(* distinguishes these cases y looking at the mutex

    E type, which is encoded in the low&order its of the owner field.

    E0

    typedef union mutexFimpl G

    0E E $dapti"e mutex.

    E0

    struct adapti"eFmutex G

    uintptrFt FmFowner6 0E @&;0@&7 owner and waiters it

    E0

    Mifndef FL#/

    uintptrFt FmFfiller6 0E &7 unused E0

    Mendif

    H mFadapti"e6

    0E

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    20/42

    E Spin )utex.

    E0

    struct spinFmutex G

    lockFt mFdummylock6 0E @ dummy lock (always set* E0

    lockFt mFspinlock6 0E 1 real lock E0 ushortFt mFfiller6 0E =&; unused E0

    ushortFt mFoldspl6 0E &> old pil "alue E0

    ushortFt mFminspl6 0E /&7 min pil "al if lock held E0

    H mFspin6

    H mutexFimplFt6

    See

    sys0mutexFimpl.h

    'he spin mutex, as we pointed out earlier, is used at high interrupt le"els,

    where context switching is not allowed. Spin locks lock interrupts while in the

    spin loop, so the kernel needs to maintain the priority le"el the processor was

    running at efore entering the spin loop, which raises the processor3s priority

    le"el. (5le"ating the priority le"el is how interrupts are locked.* 'he mFminspl

    field stores the priority le"el of the interrupt handler when the lock is

    initialized, and mFoldspl is set to the priority le"el the processor was runningat when the lock code is called. 'he mFspinlock fields are the actual mutex

    lock its.

    5ach kernel module and susystem implementing one or more mutex locks calls

    into a common set of mutex functions. $ll locks must first e initialized y the

    mutexFinit(* function, wherey the lock type is determined on the asis of an

    argument passed in the mutexFinit(* call. 'he most common type passed into

    mutexFinit(* is )2'5F5-$2L', which results in the init code determining

    what type of lock, adapti"e or spin, should e used. It is possile for a callerof mutexFinit(* to e specific aout a lock type (for example, )2'5FS#I*.

    If the init code is called from a de"ice dri"er or any kernel module that

    registers and generates interrupts, then an interrupt lock cookie is added to

    the argument list. $n interrupt lock cookie is an astraction used y de"ice

    dri"ers when they set their interrupt "ector and parameters. 'he mutexFinit(*

    code checks the argument list for an interrupt lock cookie. If mutexFinit(* is

    eing called from a de"ice dri"er to initialize a mutex to e used in a high&

    le"el interrupt handler, the lock type is set to spin. 4therwise, an adapti"e

    lock is initialized. 'he test is the interrupt le"el in the passed interrupt lock6

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    21/42

    le"els ao"e L4CDFL5A5L (1@ on S#$%C systems* are considered high&le"el

    interrupts and thus re+uire spin locks. 'he init code clears most of the fields

    in the mutex lock structure as appropriate for the lock type. 'he

    mFdummylock field in spin locks is set to all 13s (@x--*. 9e3ll see why in a

    minute.

    'he primary mutex functions called, aside from mutexFinit(* (which is only

    called once for each lock at initialization time*, are mutexFenter(* to get a

    lock and mutexFexit(* to release it. mutexFenter(* assumes an a"ailale,

    adapti"e lock. If the lock is held or is a spin lock, mutexF"ectorFenter(* is

    entered to reconcile what should happen. 'his is a performance optimization.

    mutexFenter(* is implemented in assemly code, and ecause the entry point is

    designed for the simple case (adapti"e lock, not held*, the amount of code

    that gets executed to ac+uire a lock when those conditions are true is minimal.

    $lso, there are significantly more adapti"e mutex locks than spin locks in the

    kernel, making the +uick test case effecti"e most of the time. 'he test for a

    lock held or spin lock is "ery fast. 8ere is where the mFdummylock field comes

    into play mutexFenter(* executes a compare&and&swap instruction on the

    first yte of the mutex, testing for a zero "alue. 4n a spin lock, the

    mFdummylock field is tested ecause of its positioning in the data structure

    and the endianness of S#$%C processors. Since mFdummylock is always set (it

    is set to all 13s in mutexFinit(**, the test will fail for spin locks. 'he test willalso fail for a held adapti"e lock since such a lock will ha"e a nonzero "alue in

    the yte field eing tested. 'hat is, the mFowner field will ha"e a kthread

    pointer "alue for a held, adapti"e lock.

    If the lock is an adapti"e mutex and is not eing held, the caller of

    mutexFenter(* gets ownership of the lock. If the two conditions are not true,

    that is, either the lock is held or the lock is a spin lock, the code enters the

    mutexF"ectorFenter(* function to sort things out. 'he mutexF"ectorFenter(*

    code first tests the lock type. -or spin locks, the mFoldspl field is set, asedon the current #riority Interrupt Le"el (#IL* of the processor, and the lock is

    tested. If it3s not eing held, the lock is set (mFspinlock* and the code returns

    to the caller. $ held lock forces the caller into a spin loop, where a loop

    counter is incremented (for statistical purposes6 the lockstat(1)* data*, and

    the code checks whether the lock is still held in each pass through the loop.

    4nce the lock is released, the code reaks out of the loop, gras the lock,

    and returns to the caller.

    $dapti"e locks re+uire a little more work. 9hen the code enters the adapti"e

    code path (in mutexF"ectorFenter(**, it increments the

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    22/42

    cpuFsysinfo.mutexFadenters (adapti"e lock enters* field, as is reflected in the

    smtx column in mpstat(1)*. mutexF"ectorFenter(* then tests again to

    determine if the lock is owned (held*, since the lock may ha"e een released in

    the time inter"al etween the call to mutexFenter(* and the current point in

    the mutexF"ectorFenter(* code. If the adapti"e lock is not eing held,mutexF"ectorFenter(* attempts to ac+uire the lock. If successful, the code

    returns.

    If the lock is held, mutexF"ectorFenter(* determines whether or not the lock

    owner is running y looping through the C#2 structures and testing the lock

    mFowner against the cpuFthread field of the C#2 structure. (cpuFthread

    contains the kernel thread address of the thread currently executing on the

    C#2.* $ match indicates the holder is running, which means the adapti"e lock

    will spin. o match means the owner is not running, in which case the caller

    must lock. In the locking case, the kernel turnstile code is entered to locate

    or ac+uire a turnstile, in preparation for placement of the kernel thread on a

    sleep +ueue associated with the turnstile.

    'he turnstile placement happens in two phases. $fter mutexF"ectorFenter(*

    determines that the lock holder is not running, it makes a turnstile call to look

    up the turnstile, sets the waiters it in the lock, and retests to see if the

    owner is running. If yes, the code releases the turnstile and enters the

    adapti"e lock spin loop, which attempts to ac+uire the lock. 4therwise, the

    code places the kernel thread on a turnstile (sleep +ueue* and changes the

    thread3s state to sleep. 'hat effecti"ely concludes the se+uence of e"ents in

    mutexF"ectorFenter(*.

    ropping out of mutexF"ectorFenter(*, either the caller ended up with the

    lock it was attempting to ac+uire or the calling thread is on a turnstile sleep

    +ueue associated with the lock. In either case, the lockstat(1)* data is

    updated, reflecting the lock type, spin time, or sleep time as the last it ofwork done in mutexF"ectorFenter(*.

    lockstat(1)* is a kernel lock statistics command that was introduced in Solaris

    =./. It pro"ides detailed information on kernel mutex and reader0writer locks.

    'he algorithm descried in the pre"ious paragraphs is summarized in

    pseudocode elow.

    mutexF"ectorFenter(* if (lock is a spin lock*

    lockFsetFspl(* 0E enter spin&lock specific code path E0

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    23/42

    increment cpuFsysinfo.ademters.

    spinFloop

    if (lock is not owned*

    mutexFtrylock(* 0E try to ac+uire the lock E0

    if (lock ac+uired* goto ottom

    else

    continue 0E lock eing held E0

    if (lock owner is running on a processor*

    goto spinFloop

    else

    lookup turnstile for the lock

    set waiters it

    if (lock owner is running on a processor*

    drop turnstile

    goto spinFloop

    else

    lock 0E the sleep +ueue associated with the turnstile

    E0

    ottom

    update lockstat statistics

    9hen a thread has finished working in a lock&protected data area, it calls the

    mutexFexit(* code to release the lock. 'he entry point is implemented in

    assemly language and handles the simple case of freeing an adapti"e lock with

    no waiters. 9ith no threads waiting for the lock, it3s a simple matter of

    clearing the lock fields (mFowner* and returning. 'he C language function

    mutexF"ectorFexit(* is entered from mutexFexit(* for anything ut the simple

    case.

    In the case of a spin lock, the lock field is cleared and the processor is

    returned to the #IL le"el it was running at efore entering the lock code. -or

    adapti"e locks, a waiter must e selected from the turnstile (if there is more

    than one waiter*, ha"e its state changed from sleeping to runnale, and e

    placed on a dispatch +ueue so it can execute and get the lock. If the thread

    releasing the lock was the eneficiary of priority inheritance, meaning that it

    had its priority impro"ed when a calling thread with a etter priority was not

    ale to get the lock, then the thread releasing the lock will ha"e its priority

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    24/42

    reset to what it was efore the inheritance. #riority inheritance is discussed

    in Section 17.7.

    9hen an adapti"e lock is released, the code clears the waiters it in mFowner

    and calls the turnstile function to wake up all the waiters. %eaders familiarwith sleep0wakeup mechanisms of operating systems ha"e likely heard of a

    particular eha"ior known as the Nthundering herd prolem,N a situation in

    which many threads that ha"e een locking for the same resource are all

    woken up at the same time and make a mad dash for the resource (a mutex in

    this case*like a herd of large, four&legged easts running toward the same

    o!ect. System eha"ior tends to go from a relati"ely small run +ueue to a

    large run +ueue (all the threads ha"e een woken up and made runnale* and

    high C#2 utilization until a thread gets the resource, at which point a unch of

    threads are sleeping again, the run +ueue normalizes, and C#2 utilization

    flattens out. 'his is a generic eha"ior that can occur on any operating

    system.

    'he wakeup mechanism used when mutexF"ectorFexit(* is called may seem like

    an open in"itation to thundering herds, ut in practice it turns out not to e a

    prolem. 'he main reason is that the locking case for threads waiting for a

    mutex is rare6 most of the time the threads will spin. If a locking situation

    does arise, it typically does not reach a point where "ery many threads are

    locked on the mutexone of the characteristics of the thundering herd prolem

    is resource contention resulting in a lot of sleeping threads. 'he kernel code

    segments that implement mutex locks are, y design, short and fast, so locks

    are not held for long. Code that re+uires longer lock&hold times uses a

    reader0writer write lock, which pro"ides mutual exclusion semantics with a

    selecti"e wakeup algorithm. 'here are, of course, other reasons for choosing

    reader0writer locks o"er mutex locks, the most o"ious eing to allow multiple

    readers to see the protected data.

    17.. #eaderriter oc!s

    %eader0writer (%9* locks pro"ide mutual exclusion semantics on write locks.

    4nly one thread at a time is allowed to own the write lock, ut there is

    concurrent access for readers. 'hese locks are designed for scenarios in

    which it is acceptale to ha"e multiple threads reading the data at the same

    time, ut only one writer. 9hile a writer is holding the lock, no readers are

    allowed. $lso, ecause of the wakeup mechanism, a writer lock is a etter

    solution for kernel code segments that re+uire relati"ely long hold times, as wewill see shortly.

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    25/42

    'he asic mechanics of %9 locks are similar to mutexes, in that %9 locks

    ha"e an initialization function (rwFinit(**, an entry function to ac+uire the lock

    (rwFenter(**, and an exit function to release the lock (rwFexit(**. 'he entry

    and exit points are optimized in assemly code to deal with the simple cases,

    and they call into C language functions if anything eyond the simplest casemust e dealt with. $s with mutex locks, the simple case is that the re+uested

    lock is a"ailale on an entry (ac+uire* call and no threads are waiting for the

    lock on the exit (release* call.

    7.%. . Solaris &ea'er()riter !oc"s

    %eader0writer locks are implemented as a single&word data structure in the

    kernel, either ;= its or / its wide, depending on the data model of the

    running kernel, as depicted in -igure 17./.

    Figure 17.. #eaderriter oc!

    typedef struct rwlockFimpl G

    uintptrFt rwFwwwh6 0E waiters, write wanted, hold count

    E0

    H rwlockFimplFt6

    Mendif 0E F$S) E0

    Mdefine %9F8$SF9$I'5%S 1

    Mdefine %9F9%I'5F9$'5 =Mdefine %9F9%I'5FL4CD5

    Mdefine %9F%5$FL4CD

    Mdefine %9F9%I'5FL4CD(thread* ((uintptrFt*(thread* O

    %9F9%I'5FL4CD5*

    Mdefine %9F84LFC42' (&%9F%5$FL4CD*

    Mdefine %9F84LFC42'FS8I-' ; 0E

    log=(%9F%5$FL4CD* E0

    Mdefine %9F%5$FC42' %9F84LFC42'

    Mdefine %9F495% %9F84LFC42'

    Mdefine %9FL4CD5 %9F84LFC42'

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    26/42

    Mdefine %9F9%I'5FCL$I)5 (%9F9%I'5FL4CD5 O

    %9F9%I'5F9$'5*

    Mdefine %9F42L5FL4CD (%9F9%I'5FL4CD(@* O

    %9F%5$FL4CD*

    Seesys0rwlock.h

    'here are two states for the reader writer lock, depending on whether the

    lock is held y a writer, as indicated y it =, wrlock. it =, wrlock, is the

    actual write lock, and it determines the meaning of the high&order its. If the

    write lock is held (it = set*, then the upper its contain a pointer to the

    kernel thread holding the write lock. If it = is clear, then the upper its

    contain a count of the numer of threads holding the lock as a read lock.

    'he Solaris 1@ %9 lock defines it @, the wait it, set to signify that threads

    are waiting for the lock. 'he wrwant it (write wanted, it 1* indicates that

    at least one thread is waiting for a write lock. 'he simple cases for lock

    ac+uisition through rwFenter(* are the circumstances listed elow

    'he write lock is wanted and is a"ailale.

    'he read lock is wanted, the write lock is not held, and no threads are

    waiting for the write lock (wrwant is clear*.

    'he ac+uisition of the write lock results in it = getting set and the kernel

    thread pointer getting loaded in the upper its. -or a reader, the hold count

    (upper its* is incremented. Conditions where the write lock is eing held,

    causing a lock re+uest to fail, or where a thread is waiting for a write lock,

    causing a read lock re+uest to fail, result in a call to the rwFenterFsleep(*

    function.

    Important to note is that the rwFenter(* code sets a flag in the kernel thread

    used y the dispatcher code when estalishing a kernel thread3s priority

    efore preemption or changing state to sleep. 9e co"er this in more detail in

    the paragraph eginning NIt is in the dispatcher +ueue insertion codeN on #age

    =/=. riefly, the kernel thread structure contains a tFkpriFre+ (kernel priority

    re+uest* field that is checked in the dispatcher code when a thread is aout

    to e preempted (forced off the processor on which it is executing ecause a

    higher&priority thread ecomes runnale* or when the thread is aout to ha"e

    its state changed to sleep. If the tFkpriFre+ flag is set, the dispatcherassigns a kernel priority to the thread, such that when the thread resumes

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    27/42

    execution, it will run efore threads in scheduling classes of lower priority

    (timeshare and interacti"e class threads*. )ore succinctly, the priority of a

    thread holding a write lock is set to a etter priority to minimize the hold time

    of the lock.

    Petting ack to the rwFenter(* flow If the code falls through the simple

    case, we need to set up the kernel thread re+uesting the %9 lock to lock.

    1. rwFenterFsleep(* estalishes whether the calling thread is re+uesting a

    read or write lock and does another test to see if the lock is a"ailale.

    If it is, the caller gets the lock, the lockstat(1)* statistics are updated,

    and the code returns. If the lock is not a"ailale, then the turnstile code

    is called to look up a turnstile in preparation for putting the calling

    thread to sleep.=. 9ith a turnstile now a"ailale, another test is made on the lock

    a"ailaility. (4n today3s fast processors, and especially multiprocessor

    systems, it3s +uite possile that the thread holding the lock finished what

    it was doing and the lock ecame a"ailale.* $ssuming the lock is still

    held, the thread is set to a sleep state and placed on a turnstile.

    *. 'he %9 lock structure will ha"e the wait it set for a reader waiting

    (forced to lock ecause a writer has the lock* or the wrwant it set

    to signify that a thread wanting the write lock is locking.

    4. 'he cpuFsysinfo structure for the processor maintains two counters for

    failures to get a read lock or write lock on the first pass rwFrdfails

    and rwFwrfails. 'he appropriate counter is incremented !ust prior to the

    turn&stile call6 this action places the thread on a turnstile sleep +ueue.

    'he mpstat(1)* command sums the counters and displays the fails&per&

    second in the srw column of its output.

    'he ac+uisition of a %9 lock and suse+uent eha"ior if the lock is held are

    straightforward and similar in many ways to what happens in the mutex case.'hings get interesting when a thread calls rwFexit(* to release a lock it is

    hold&ingthere are se"eral potential solutions to the prolem of determining

    which thread gets the lock next. $ wakeup is issued on all threads that are

    sleeping, waiting for the mutex, and we know from empirical data that this

    solution works well for reasons pre"iously discussed. 9ith %9 locks, we3re

    dealing with potentially longer hold times, which could result in more sleepers, a

    desire to gi"e writers priority o"er readers (it3s typically est to not ha"e a

    reader read data that3s aout to e changed y a pending writer*, and the

    potential for the priority in"ersion prolem descried in Section 17.7.

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    28/42

    -or rwFexit(*, which is called y the lock holder when it is ready to release

    the lock, the simple case is that there are no waiters. In this case, the wrlock

    it is cleared if the holder was a writer, or the hold count field is

    decremented to reflect one less reader. 'he more complex case of the system

    ha"ing waiters when the lock is released is dealt with in the following manner

    1. 'he kernel does a direct transfer of ownership of the lock to one or

    more of the threads waiting for the lock when the lock is released,

    either to the next writer or to a group of readers if more than one

    reader is locking and no writers are locking.

    'his situation is "ery different from the case of the mutex

    implementation, for which the wakeup is issued and a thread must otain

    lock ownership in the usual fashion. 8ere, a thread or threads wake upowning the lock they were locking on.

    'he algorithm used to figure out who gets the lock next addresses

    se"eral re+uirements that pro"ide for generally alanced system

    performance. 'he kernel needs to minimize the possiility of star"ation (a

    thread ne"er getting the resource it needs to continue executing* while

    allowing writers to take precedence whene"er possile.

    2. rwFexitFwakeup(* retests for the simple case and drops the lock ifthere are no waiters (clear wrlock or decrement the hold count*.

    ;. 9hen waiters are present, the code gras the turnstile (sleep +ueue*

    associated with the lock and sa"es the pointer to the kernel thread of

    the next write waiter that was on the turnstile3s sleep +ueue (if one

    exists*.

    'he turnstile sleep +ueues are organized as a -I-4 (first in, first out*

    +ueue, so the +ueue management (turnstile code* makes sure that the

    thread that was waiting the longest (the first in* is the thread that is

    selected as the next writer (first out*. 'hus, part of the fairness policy

    we want to enforce is co"ered.

    'he remaining its of the algorithm go as follows

    . If a writer is releasing the write lock and there are waiting readers and

    writers, readers of the same or higher priority than the highest&priority

    locked writer are granted the read lock.

    5. 'he readers are handed ownership, and then woken up y the

    turnstileFwakeup(* kernel function,

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    29/42

    'hese readers also inherit the priority of the writer that released the

    lock if the reader thread is of a lower priority (inheritance is done on a

    per&reader thread asis when more than one thread is eing woken up*.

    Lock ownership handoff is a relati"ely simple operation. -or read locks,

    there is no notion of a lock owner, so it3s a matter of setting the holdcount in the lock to reflect the numer of readers coming off the

    turnstile, then issuing the wakeup of each reader.

    /. $n exiting reader always grants the lock to a waiting writer, e"en if

    there are higher&priority readers locked.

    7. It is possile for a reader freeing the lock to ha"e waiting readers,

    although it may not e intuiti"e, gi"en the multiple reader design of the

    lock. If a reader is holding the lock and a writer comes along, the

    wrwant it is set to signify that a writer is waiting for the lock. 9ith

    wrwant set, suse+uent readers cannot get the lockwe want the holding

    readers to finish so the writer can get the lock. 'herefore, it is

    possile for a reader to execute rwFexitFwakeup(* with waiting writers

    and readers.

    'he Nlet3s fa"or writers ut e fair to readersN policy descried ao"e was

    first implemented in Solaris =./.

    17.7. %urnstiles and Priority "nheritance

    $ turnstile is a data astraction that encapsulates sleep +ueues and priority

    inheritance information associated with mutex locks and reader0writer locks.

    'he mutex and %9 lock code use a turnstile when a kernel thread needs to

    lock on a re+uested lock. 'he sleep +ueues implemented for other resource

    waits do not pro"ide an elegant method of dealing with the priority in"ersion

    prolem through priority inheritance. 'urnstiles were created to address that

    prolem.

    #riority in"ersion descries a scenario in which a higher&priority thread is

    unale to run ecause a lower&priority thread is holding a resource it needs,

    such as a lock. 'he Solaris kernel addresses the priority in"ersion prolem in

    its turn&stile implementation, pro"iding a priority inheritance mechanism, where

    the higher&priority thread can will its priority to the lower&priority thread

    holding the resource it re+uires. 'he eneficiary of the inheritance, the thread

    holding the resource, will now ha"e a higher scheduling priority and thus get

    scheduled to run sooner so it can finish its work and release the resource, atwhich point the original priority is returned to the thread.

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    30/42

    In this section, we assume you ha"e some le"el of knowledge of kernel thread

    priorities, which are co"ered in Section ;.7. ecause turnstiles and priority

    inheritance are an integral part of the implementation of mutex and %9 locks,

    we thought it est to discuss them here rather than later. -or this discussion,

    it is important to e aware of these points

    'he Solaris kernel assigns a gloal priority to kernel threads, ased on

    the scheduling class they elong to.

    Dernel threads in the timeshare and interacti"e scheduling classes will

    ha"e their priorities ad!usted o"er time, ased on three things the

    amount of time the threads spend running on a processor, sleep time

    (locking*, and the case when they are preempted. 'hreads in the real&

    time class are fixed priority6 the priorities are ne"er changed regardless

    of runtime or sleep time unless explicitly changed through programming

    interfaces or commands.

    'he Solaris kernel implements sleep +ueues for the placement of kernel threads

    locking on (waiting for* a resource or e"ent. -or most resource waits, such

    as those for a disk or network I04, sleep +ueues, in con!unction with condition

    "ariales, manage the systemwide +ueue of sleeping threads. 'hese sleep

    +ueues are co"ered in Section ;.1@. 'his set of sleep +ueues is separate and

    distinct from turn&stile sleep +ueues.

    7.7. . +urnstiles #$ple$entation

    -igure 17.7illustrates the Solaris 1@ turnstiles. 'urnstiles are maintained in a

    systemwide hash tale, turnstileFtale:

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    31/42

    Figure 17.7. %urnstiles

    typedef struct turnstileFchain G

    turnstileFt EtcFfirst6 0E first turnstile on hash chain E0

    dispFlockFt tcFlock6 0E lock for this hash chain E0

    H turnstileFchainFt6

    turnstileFchainFt turnstileFtale:= E '2%S'IL5F8$S8FSIQ5

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    32/42

    Mdefine 'SF2)FR = 0E numer of sleep +ueues per turnstile E0

    typedef struct turnstile turnstileFt6

    struct Fso!Fops6

    struct turnstile G

    turnstileFt EtsFnext6 0E next on hash chain E0

    turnstileFt EtsFfree6 0E next on freelist E0

    "oid EtsFso!6 0E s&o!ect threads are locking on

    E0

    int tsFwaiters6 0E numer of locked threads E0

    priFt tsFepri6 0E max priority of locked threads E0

    struct Fkthread EtsFinheritor6 0E thread inheriting priority E0

    turnstileFt EtsFprioin"6 0E next in inheritor3s tFprioin" list E0

    sleep+Ft tsFsleep+:'SF2)FR

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    33/42

    now has a turnstile, so suse+uent threads that lock on the same lock will

    donate their turnstiles to the free list on the chain (the tsFfree link off the

    acti"e turnstile*.

    In turnstileFlock(*, the pointers are set up as determined y the return fromturnstileFlookup(*. If the turnstile pointer is null, we link up to the turnstile

    pointed to y the kernel thread3s tFts pointer. If the pointer returned from

    the lookup is not null, there3s already at least one kthread waiting on the lock,

    so the code sets up the pointer links appropriately and places the kthread3s

    turnstile on the free list.

    'he thread is then put into a sleep state through the scheduling&class&

    specific sleep routine (for example, tsFsleep(**. 'he tsFwaiters field in the

    turnstile is incremented, the threads tFwchan is set to the address of thelock, and tFso!Fops in the thread is set to the address of the lock3s

    operations "ectors the owner, unsleep, and changeFpriority functions. 'he

    kernel sleep+Finsert(* function actually places the thread on the sleep +ueue

    associated with the turnstile.

    'he code does the priority in"ersion check (now called out of the

    turnstileFlock(* code*, uilds the priority in"ersion links and applies the

    necessary priority changes. 'he priority inheritance rules apply6 that is, if the

    priority of the lock holder is less (worse* than the priority of the re+uesting

    thread, the re+uesting thread3s priority is NwilledN to the holder. 'he holder3s

    tFepri field is set to the new priority, and the inheritor pointer in the turnstile

    is linked to the kernel thread. $ll the threads on the locking chain are

    potential inheritors, ased on their priority relati"e to the calling thread.

    $t this point, the dispatcher is entered through a call to swtch(*, and another

    kernel thread is remo"ed from a dispatch +ueue and context&switched onto a

    processor.

    'he wakeup mechanics are initiated as pre"iously descried, where a call to

    the lock exit routine results in a turnstileFwakeup(* call if threads are locking

    on the lock. turnstileFwakeup(* does essentially the re"erse of

    turnstileFlock(*6 threads that inherited a etter priority ha"e that priority

    wai"ed, and the thread is remo"ed from the sleep +ueue and gi"en a turnstile

    from the chain3s free list. %ecall that a thread donated its turnstile to the

    free list if it was not the first thread placed on the locking chain for the

    lock6 coming off the turnstile, threads get a turnstile ack. 4nce the thread is

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    34/42

    unlinked from the sleep +ueue, the scheduling class wakeup code is entered,

    and the thread is put ack on a processor3s dispatch +ueue.

    17.4. 5ernel Sema/hores

    Semaphores pro"ide a method of synchronizing access to a sharale resource

    y multiple processes or threads. $ semaphore can e used as a inary lock

    for exclusi"e access or as a counter, allowing for concurrent access y

    multiple threads to a finite numer of shared resources.

    In the counter implementation, the semaphore "alue is initialized to the numer

    of shared resources (these semaphores are sometimes referred to as counting

    semaphores*. 5ach time a process needs a resource, the semaphore "alue is

    decremented to indicate there is one less of the resource. 9hen the processis finished with the resource, the semaphore "alue is incremented. $ @

    semaphore "alue tells the calling process that no resources are currently

    a"ailale, and the calling process locks until another process finishes using the

    resource and frees it. 'hese functions are historically referred to as

    semaphore # and A operationsthe # operation attempts to ac+uire the

    semaphore, and the A operation releases it.

    'he Solaris kernel uses semaphores where appropriate, when the constraints

    for atomicity on lock ac+uisition are not as stringent as they are in the areaswhere mutex and %9 locks are used. $lso, the counting functionality that

    semaphores pro"ide makes them a good fit for things like the allocation and

    deallocation of a fixed amount of a resource.

    'he kernel semaphore structure maintains a sleep +ueue for the semaphore and

    a count field that reflects the "alue of the semaphore, shown in -igure 17..

    'he figure illustrates the look of a kernel semaphore for all Solaris releases

    co"ered in this ook.

    Figure 17.4. 5ernel Sema/hore

    Dernel functions for semaphores include an initialization routine (semaFinit(**, a

    destroy function (semaFdestroy(**, the traditional # and A operations

    (semaFp(* and semaF"(**, and a test function (test for semaphore held,

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    35/42

    semaFheld(**. 'here are a few other support functions, as well as some

    "ariations on the semaFp(* function, which we discuss later.

    'he init function simply sets the count "alue in the semaphore, ased on the

    "alue passed as an argument to the semaFinit(* routine. 'he sFslp+ pointer isset to 2LL, and the semaphore is initialized. 'he semaFdestroy(* function is

    used when the semaphore is an integral part of a resource that is dynamically

    created and destroyed as the resource gets used and suse+uently released.

    -or example, the io (lock I04* susystem in the kernel, which manages uf

    structures for page I04 support through the file system, uses semaphores on a

    per&uf structure asis. 5ach uffer has two semaphores, which are initialized

    when a uffer is allocated y semaFinit(*. 4nce the I04 is completed and the

    uffer is released, semaFdestroy(* is called as part of the uffer release

    code. (semaFdestroy(* !ust nulls the sFslp+ pointer.*

    Dernel threads that must access a resource controlled y a semaphore call the

    semaFp(* function, which re+uires that the semaphore count "alue e greater

    than @ in order to return success. If the count is @, then the semaphore is not

    a"ailale and the calling thread must lock. If the count is greater than @, then

    the count is decremented in the semaphore and the code returns to the caller.

    4therwise, a sleep +ueue is located from the systemwide array of sleep

    +ueues, the thread state is changed to sleep, and the thread is placed on the

    sleep +ueue. ote that turnstiles are not used for semaphoresturnstiles are an

    implementation of sleep +ueues specifically for mutex and %9 locks. Dernel

    threads locked on anything other than mutexes and %9 locks are placed on

    sleep +ueues.

    Sleep +ueues are discussed in more detail in Section ;.1@. riefly though, sleep

    +ueues are organized as a linked list of kernel threads, and each linked list is

    rooted in an array referenced through a sleep+Fhead kernel pointer. -igure

    17.Billustrates how sleep +ueues are organized.

    Figure 17.6. Slee/ ueues

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    36/42

    $ hashing function indexes the sleep+Fhead array, hashing on the address of

    the o!ect. $ singly linked list that estalishes the eginning of the douly

    linked sulists of kthreads at the same priority is in ascending order ased on

    priority. 'he sulist is implemented with a tFpriforw (forward pointer* and

    tFpriack (pre"ious pointer* in the kernel thread. $lso, a tFsleep+ pointerpoints ack to the array entry in sleep+Fhead, identifying which sleep +ueue

    the thread is on and pro"iding a +uick method to determine if a thread is on a

    sleep +ueue at all6 if the thread3s tFsleep+ pointer is 2LL, then the thread is

    not on a sleep +ueue.

    Inside the semaFp(* function, if we ha"e a semaphore count "alue of @, the

    semaphore is not a"ailale and the calling kernel thread needs to e placed on

    a sleep +ueue. $ sleep +ueue is located through a hash function into the

    sleepFhead array, which hashes on the address of the o!ect the thread is

    locking, in this case, the address of the semaphore. 'he code also gras the

    sleep +ueue lock, s+Flock (see -igure 17.B*, to lock any further inserts or

    remo"als from the sleep +ueue until the insertion of the current kernel thread

    has een completed (that3s what locks are for*.

    'he scheduling&class&specific sleep function is called to set the thread

    wakeup priority and to change the thread state from 4#%4C (running on a

    processor* to SL55#. 'he kernel thread3s tFwchan (wait channel* pointer is

    set to the address of the semaphore it3s locking on, and the thread3s

    tFso!Fops pointer is set to reference the semaFso!Fops structure. 'he

    thread is now in a sleep state on a sleep +ueue.

    $ semaphore is released y the semaF"(* function, which has the exact

    opposite effect of semaFp(* and eha"es "ery much like the lock release

    functions we3"e examined up to this point. 'he semaphore "alue is incremented,

    and if any threads are sleeping on the semaphore, the one that has een

    sitting on the sleep +ueue longest will e woken up. Semaphore wakeups alwaysin"ol"e waking one waiter at a time.

    Semaphores are used in relati"ely few areas of the operating system the

    uffer I04 (io* module, the dynamically loadale kernel module code, and a

    couple of de"ice dri"ers.

    17.6. $%race oc!stat Pro0ider

    'he lockstat pro"ider makes a"ailale proes that can e used to discern lockcontention statistics or to understand "irtually any aspect of locking eha"ior.

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    37/42

    'he lockstat(1)* command is actually a 'race consumer that uses the

    lockstat pro"ider to gather its raw data.

    7.,. . Overview

    'he lockstat pro"ider makes a"ailale two kinds of proes content&e"ent

    proes and hold&e"ent proes.

    Contention&e"ent proes correspond to contention on a synchronization

    primiti"e6 they fire when a thread is forced to wait for a resource to ecome

    a"ailale. Solaris is generally optimized for the noncontention case, so

    prolonged contention is not expected. 'hese proes should e used to

    understand those cases where contention does arise. ecause contention is

    relati"ely rare, enaling contention&e"ent proes generally doesn3t sustantiallyaffect performance.

    8old&e"ent proes correspond to ac+uiring, releasing, or otherwise

    manipulating a synchronization primiti"e. 'hese proes can e used to answer

    aritrary +uestions aout the way synchronization primiti"es are manipulated.

    ecause Solaris ac+uires and releases synchronization primiti"es "ery often (on

    the order of millions of times per second per C#2 on a usy system*, enaling

    hold&e"ent proes has a much higher proe effect than does enaling

    contention&e"ent proes. 9hile the proe effect induced y enaling them cane sustantial, it is not pathological6 they may still e enaled with confidence

    on production systems.

    'he lockstat pro"ider makes a"ailale proes that correspond to the different

    synchronization primiti"es in Solaris6 these primiti"es and the proes that

    correspond to them are discussed in the remainder of this chapter.

    7.,.2. -'aptive !oc" Probes

    $dapti"e locks enforce mutual exclusion to a critical section and can e

    ac+uired in most contexts in the kernel. ecause adapti"e locks ha"e few

    context restrictions, they comprise the "ast ma!ority of synchronization

    primiti"es in the Solaris kernel. 'hese locks are adapti"e in their eha"ior with

    respect to contention. 9hen a thread attempts to ac+uire a held adapti"e

    lock, it will determine if the owning thread is currently running on a C#2. If the

    owner is running on another C#2, the ac+uiring thread will spin. If the owner is

    not running, the ac+uiring thread will lock.

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    38/42

    'he four lockstat proes pertaining to adapti"e locks are in 'ale 17.=.-or

    each proe, arg@ contains a pointer to the kmutexFt structure that

    represents the adapti"e lock.

    %a&le 17.2. Ada/ti0e oc! Pro&es

    #roeame

    Cescription

    adapti"e&

    ac+uire

    8old&e"ent proe that fires immediately after an adapti"e lock is

    ac+uired.

    adapti"e&lock

    Contention&e"ent proe that fires after a thread that has lockedon a held adapti"e mutex has reawakened and has ac+uired the

    mutex. If oth proes are enaled, adapti"e&lock fires efore

    adapti"e&ac+uire. $t most one of adapti"e&lock and adapti"e&

    spin fire for a single lock ac+uisition. arg1 for adapti"e&lock

    contains the sleep time in nanoseconds.

    adapti"e&

    spin

    Contention&e"ent proe that fires after a thread that has spun on

    a held adapti"e mutex has successfully ac+uired the mutex. If oth

    are enaled, adapti"e&spin fires efore adapti"e&ac+uire. $t most

    one of adapti"e&spin and adapti"e&lock fire for a single lock

    ac+uisition. arg1 for adapti"e&spin contains the spin count the

    numer of iterations that were taken through the spin loop efore

    the lock was ac+uired. 'he spin count has little meaning on its own

    ut can e used to compare spin times.

    adapti"e&

    release

    8old&e"ent proe that fires immediately after an adapti"e lock is

    released.

    7.,.*. Spin !oc" Probes

    'hreads cannot lock in some contexts in the kernel, such as high&le"el

    interrupt context and any context manipulating dispatcher state. In these

    contexts, this restriction pre"ents the use of adapti"e locks. Spin locks are

    instead used to effect mutual exclusion to critical sections in these contexts.

    $s the name implies, the eha"ior of these locks in the presence of contention

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    39/42

    is to spin until the lock is released y the owning thread. 'he three proes

    pertaining to spin locks are in 'ale 17.;.

    %a&le 17.3. S/in oc! Pro&es

    #roeame

    Cescription

    spin&

    ac+uire

    8old&e"ent proe that fires immediately after a spin lock is ac+uired.

    spin&

    spin

    Contention&e"ent proe that fires after a thread that has spun on a

    held spin lock has successfully ac+uired the spin lock. If oth areenaled, spin&spin fires efore spin&ac+uire. arg1 for spin&spin

    contains the spin count the numer of iterations that were taken

    through the spin loop efore the lock was ac+uired. 'he spin count

    has little meaning on its own ut can e used to compare spin times.

    spin&

    release

    8old&e"ent proe that fires immediately after a spin lock is released.

    $dapti"e locks are much more common than spin locks. 'he following script

    displays totals for oth lock types to pro"ide data to support this

    oser"ation.

    lockstatadapti"e&ac+uire

    0execname KK NdateN0

    G

    Tlocks:Nadapti"eN< K count(*6H

    lockstatspin&ac+uire

    0execname KK NdateN0

    G

    Tlocks:NspinN< K count(*6

    H

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    40/42

    %un this script in one window, and a date(1* command in another. 9hen you

    terminate the 'race script, you will see output similar to the following

    example.

    lockstatadapti"e&ac+uireM dtrace &s .0whatlock.d

    dtrace script 3.0whatlock.d3 matched > proes

    UC

    spin =/

    adapti"e =B1

    $s this output indicates, o"er BB percent of the locks ac+uired in running the

    date command are adapti"e locks. It may e surprising that so many locks areac+uired in doing something as simple as a date. 'he large numer of locks is a

    natural artifact of the fine&grained locking re+uired of an extremely scalale

    system like the Solaris kernel.

    7.,.4. +hrea' !oc"s

    $ thread lock is a special kind of spin lock that locks a thread for purposes

    of changing thread state. 'hread lock hold e"ents are a"ailale as spin lock

    hold&e"ent proes (that is, spin&ac+uire and spin&release*, ut contentione"ents ha"e their own proe specific to thread locks. 'he thread lock hold&

    e"ent proe is descried in 'ale 17..

    %a&le 17.'. %hread oc! Pro&es

    #roeame

    Cescription

    t8%ead

    &spin

    Contention&e"ent proe that fires after a thread has spun on a

    thread lock. Like other contention&e"ent proes, if oth the

    contention&e"ent proe and the hold&e"ent proe are enaled,

    thread&spin fires efore spin&ac+uire. 2nlike other contention&e"ent

    proes, howe"er, t8%ead&spin fires efore the lock is actually

    ac+uired. $s a result, multiple t8%ead&spin proe firings may

    correspond to a single spin&ac+uire proe firing.

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    41/42

    7.,.5. &ea'ers()riter !oc" Probes

    %eaders0writer locks enforce a policy of allowing multiple readers or a single

    writerut not othto e in a critical section. 'hese locks are typically used

    for structures that are searched more fre+uently than they are modified andfor which there is sustantial time in the critical section. If critical section

    times are short, readers0writer locks will implicitly serialize o"er the shared

    memory used to implement the lock, gi"ing them no ad"antage o"er adapti"e

    locks. See rwlock(B-* for more details on readers0writer locks.

    'he proes pertaining to readers0writer locks are in 'ale 17.>. -or each

    proe, arg@ contains a pointer to the krwlockFt structure that represents the

    adapti"e lock.

    %a&le 17.+. #eadersriter oc! Pro&es

    #roeame

    Cescription

    rw&

    ac+uire

    8old&e"ent proe that fires immediately after a readers0writer lock

    is ac+uired. arg1 contains the constant %9F%5$5% if the lock was

    ac+uired as a reader, and %9F9%I'5% if the lock was ac+uired asa writer.

    rw&lock Contention&e"ent proe that fires after a thread that has locked

    on a held readers0writer lock has reawakened and has ac+uired the

    lock. arg1 contains the length of time (in nanoseconds* that the

    current thread had to sleep to ac+uire the lock. arg= contains the

    constant %9F%5$5% if the lock was ac+uired as a reader, and

    %9F9%I'5% if the lock was ac+uired as a writer. arg; and arg

    contain more information on the reason for locking. arg; is nonzero

    if and only if the lock was held as a writer when the current thread

    locked. arg contains the readers count when the current thread

    locked. If oth the rw&lock and rw&ac+uire proes are enaled,

    rw&lock fires efore rw&ac+uire.

    rw&

    upgrade

    8old&e"ent proe that fires after a thread has successfully

    upgraded a readers0writer lock from a reader to a writer. 2pgrades

    do not ha"e an associated contention e"ent ecause they are onlypossile through a nonlocking interface,

  • 8/12/2019 Solaris Internals(Ch17 Locking)

    42/42

    %a&le 17.+. #eadersriter oc! Pro&es

    #roeame

    Cescription

    rwFtryupgrade('%V2#P%$5.B-*.

    rw&

    downgrad

    e

    8old&e"ent proe that fires after a thread had downgraded its

    ownership of a readers0writer lock from writer to reader.

    owngrades do not ha"e an associated contention e"ent ecause

    they always succeed without contention.

    rw&

    release

    8old&e"ent proe that fires immediately after a readers0writer lock

    is released. arg1 contains the constant %9F%5$5% if the released

    lock was held as a reader, and %9F9%I'5% if the released lock

    was held as a writer. ue to upgrades and downgrades, the lock

    may not ha"e een released as it was ac+uired.


Recommended