Kernel Internals

transcript

7/30/2019 Kernel Internals

1/54

Santosh Sam Koshy

santoshk@cdac.in

Centre for Development of Advanced Computing, Hyderabad

Kernel Internals
mailto:santoshk@cdac.inmailto:santoshk@cdac.inmailto:santoshk@cdac.inmailto:santoshk@cdac.inmailto:santoshk@cdac.inmailto:santoshk@cdac.inmailto:santoshk@cdac.inmailto:santoshk@cdac.inmailto:santoshk@cdac.in


2/54

Santosh Sam Koshy

santoshk@cdac.in


Agenda

IOCTLKernel Synchronization Techniques

Wait Queues

Time Delays

Deferred Executions


3/54

14/11/2012

Santosh Sam Koshy

santoshk@cdac.in

Centre for Development of Advanced Computing, Hyderabad 3

IOCTL

Most drivers need -in addition to the ability to read and writethe device -the ability to perform various types of hardware

control via the device driver. These operations are normally

supported via the ioctl method

In the user space, the ioctl command has the following formatint ioctl(int fd, unsigned long cmd, ...);

The ioctl driver method has the prototype

int (*ioctl) (struct inode *inode, struct file *filp, unsigned int

cmd, unsigned long arg);


4/54

Santosh Sam Koshy

santoshk@cdac.inCentre for Development of Advanced Computing, Hyderabad

Magic Numbers

Magic numbers are mechanisms of identifying the commandsfor a particular device. They must be unique over the system.

These are maintained by the kernel in 4 bit-fields

type: This is the magic number present in the file ioctl-

number.txt. It is 8 bits widenumber:The ordinal (sequential) number. It is also 8 bits wide

direction:The direction of data transfer. Two bits

size: The size of the user data involved. The size is architecture

dependent, and is generally limited to 13 or 14 bits


5/54

Santosh Sam Koshy


Kernel Synchronization


6/54

Santosh Sam Koshy


Agenda

Sources of Concurrency in the KernelMechanisms to manage concurrency

Semaphores

RW Semaphores

Spinlocks

RW Spinlocks

Completions

Atomic Variables

Sequential Locks

RCU


7/54

Santosh Sam Koshy


What is Synchronization

In the kernel there can be many tasks that execute pseudo

concurrently. This may lead to data inconsistencies in

accessing a common resource

A well defined coordination between tasks in accessing

shared data is a must and this coordination leads tosynchronization between tasks.


8/54

Santosh Sam Koshy


Sources of Concurrency

In a linux system, there is a possibility that numerous process areexecuting in the user space, making system calls to the kernel

SMP systems can access your code concurrently

Kernel code is preemptible

Interrupts are asynchronous events that can cause concurrent

execution

Delayed code execution mechanisms provided by the kernel

Hot pluggable devices can suddenly stop the functioning of thecode


9/54

Santosh Sam Koshy


Mechanisms to manage concurrency

Any time a hardware or software resource is shared beyond asingle thread of execution, there is a possibility that one thread

gets an inconsistent view of that resource

This calls for some resource access management and is broughtabout by mechanisms called locking or mutual exclusion -

making sure that only one resource can manipulate a shared

resource at one time.


10/54

Santosh Sam Koshy


Semaphores

At its core, a semaphore is a single integer value combined witha pair of functions that are typically called up and down

To use semaphores, the code must include asm/semaphore.h.

The semaphore implementation in the kernel is just a structure

semaphore.struct semaphore {

atomic_t count;

int sleepers;

wait_queue_head_t wait;

};


11/54

Santosh Sam Koshy


Semaphores

There are two ways of creating a semaphore. The dynamic wayuses the function

void sema_init(struct semaphore *sem, int val)

Statically, semaphores may be created by the macro

static DECLARE_SEMAPHORE_GENERIC(name,count)

The count or val in both cases specifies the initialization value of

the semaphore. Setting it to 1 created the semaphore as a binary

semaphore or a mutex (mutual exclusion semaphore)


12/54

Santosh Sam Koshy


Semaphores

Semaphores may also be created in the mutex mode by the

following functions

DECLARE_MUTEX(name);

DECLARE_MUTEX_LOCKED(name);They may be initialized at runtime by the following

init_MUTEX(struct semaphore *sem);

void init_MUTEX_LOCKED(struct semaphore *sem)


13/54

Santosh Sam Koshy


Semaphores

Semaphores may be accessed by calling one of the following

functions

void down(struct semaphore *sem);

int down_interruptible(struct semaphore *sem);

int down_trylock(struct semaphore *sem);

Once access to the critical section is completed, the

semaphore may be released by the function

void up(struct semaphore *sem)


14/54

Santosh Sam Koshy


Reader/Writer Semaphores

Code using rwsems must include linux/rwsem.h. The relevant

data type for rwsem is struct rw_semaphore. An rwsem must beexplicitly initialized at run time using

void init_rwsem(struct rw_semaphore *sem)

For read only access,

void down_read(struct rw_semaphore *sem);

int down_read_trylock(struct rw_semaphore *sem);

void up_read(struct rw_semaphore *sem)

For requirements wherein a long read is required after a quickwrite,

void downgrade_write(struct semaphore *sem)


15/54

Santosh Sam Koshy


Spinlocks

A spinlock is a mutual exclusion device that can have only twovalues locked and unlocked. It is implemented as a single bit in

an integer value. Code wishing to take out a particular lock tests

the relevant bit.

Unlike semaphores, spinlocks may be used in code that cannotsleep.

If the lock is taken by somebody else, the code goes into a tight

loop where it repeatedly checks the lock until it becomes

available


16/54

Santosh Sam Koshy


Spinlocks

Spinlocks are intended for use on multiprocessor systems

although a uniprocessor workstation running a preemptive

kernel behaves like SMP

If a non preemptive uniprocessor ever went into a spinlock, it

would spin forever; on other thread would ever be able to

obtain the CPU to release the lock

The Linux implementation nullifies the spinlock

implementation if it is tried to be used on a uniprocessorsystem


17/54

Santosh Sam Koshy


Spinlocks

The required include file for spinlock primitives is

linux/spinlock.h. A spinlock has the type spinlock_t and has to

be initialized before it is used

The static initialization for a spinlock is done by

spinlock_t my_lock = SPIN_LOCK_UNLOCKEDor at runtime as

void spin_lock_init(spinlock_t *lock);

A spinlock is obtiained and released by

void spin_lock(spinlock_t *lock);

void spin_unlock(spinlock_t *lock);


18/54

Santosh Sam Koshy


Spinlocks and Interrupts

Spinlocks can be used in interrupt handlers, whereas semaphores

may not be used since they sleep.

If a lock is shared with an interrupt handler, local interrupts must

be disabled before acquiring the lock.

The kernel provides a separate interface for this which disablesand enables local interrupts on acquiring and releasing the

spinlock respectively

void spin_lock_irqsave(spinlock_t *lock, unsigned long flags);

void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags


19/54

Santosh Sam Koshy


Reader/Writer Spinlocks

Using read/write spinlocks is similar to rwsems

It is initialized by

rwlock_t mylock =RW_LOCK_UNLOCKED //static way

rwlock_t mylock;

rw_lock_init (&my_rwlock); //Dynamic way

Reader and writer locks may be gained and released by

void read_lock(rmlock_t *lock);

void read_unlock(rwlock_t *lock);

void write_lock(rwlock_t *lock);

void write_unlock(rwlock_t *lock);


20/54

Santosh Sam Koshy


Semaphores vs Spinlocks

Requirement Recommended Lock

Low overhead locking Spinlock

Short lock hold time Spinlock

Long lock hold time Semaphore

Need to lock from interrupt context Spinlock

Need to sleep while holding lock Semaphore


21/54

Santosh Sam Koshy


Completions

A common phenomenon is kernel programming is theinitiation of some activity outside the current execution flow

and then wait for that activity to complete.

Consider the following code snippet

struct semaphore sem;init_MUTEX_LOCKED(&sem);

start_external_task(&sem);

down(&sem);


22/54

Santosh Sam Koshy


Completions

Completions are a simple light weight mechanism with one task:

allowing one thread to tell another that the job is done

A completion can be created with

DECLARE_COMPLETION(my_completion);

Waiting for the completion is by simply callingvoid wait_for_completion(struct completion *c);

The actual completion event is signaled by

void complete(struct completion *c);

void complete_all(struct completion *c);


23/54

Santosh Sam Koshy


Atomic Variables

Atomic variables are special data types that are provided by thekernel, to perform simple operations in an atomic manner.

The kernel provides an atomic integer type called atomic_t and

a set of functions that have to be used to perform operations on

the atomic variables.The operations are very fast, because they compile to a simple

machine instruction whenever possible


24/54

Santosh Sam Koshy


Atomic Integer Operations

Some important integer operations are

void atomic_set(atomic_t *v, int i);

int atomic_read(atomic_t *v);

void atomic_add(int i, atomic_t *v);void atomic_sub(int i, atomic_t *v);

void atomic_inc(atomic_t *v);

void atomic_dec(atomic_t *v);

int atomic_inc_and_test(atomic_t *v);


25/54

Santosh Sam Koshy


Atomic Bit Operations

The atomic data type is good in working around with

integers. It does not suffice bitwise operations. The kernel

provides necessary functions that act on single bits. These

are declared in asm/bitops.h

The available bit operations are:

void set_bit(nr, void *addr);

void clear_bit(nr, void *addr);

void change_bit(nr, void *addr);


26/54

Santosh Sam Koshy


seqlocks

An added feature in the 2.6 kernel that is intended to provide

fast, lockless access to a shared resource.

Seq locks work in situations where write access is rare but must

be fast

They work by allowing readers free access to the resource butrequiring those readers to check for collisions with writers and,

when collisions occur, retry their access

Cannot be used to protect data structures involving pointers

because the reader may be following a pointer that is invalidwhile the writer may be changing the data structure

seqlocks


27/54

Santosh Sam Koshy


seqlocks

Seqlocks are defined in linux/seqlock.h. It may be initialized by

seqlock_t lock1= SEQLOCK_UNLOCKED;

The write path is obtained by

void write_seqlock(seqlock_t *lock);

/*write lock is obtained....make changes*/

void write_sequnlock (seqlock_t *lock);

Readers may function in this pattern

do {

seq = read_seqbegin(&lock );read the data here

}while (read_seqretry(&lock, seq);


28/54

Santosh Sam Koshy


Read Copy Update

SharedResourc

e

Pointer

Copy ofSharedResourc

e

Pointer

reader 1

reader 2

reader 3

reader 4

reader 5

SharedResourc

e

Pointer

Copy ofSharedResourc

e

Pointer

reader 1

reader 2

reader 3

reader 4

reader 5

Pointer

writer


29/54

Santosh Sam Koshy


Wait Queues, Delays and DeferredExecution

A d


30/54

Santosh Sam Koshy


Agenda

Wait Queues

HZ

Jiffies

Long Delays

Kernel Timers

Tasklets

Work Queues

W it Q


31/54

Santosh Sam Koshy


Wait Queues

Wait Queues are mechanisms of putting a user space process

into a sleep whenever the kernel driver is not able to suffice theuser processs requirements.

When a process is put to sleep, it is marked as being in a special

state and removed from the schedulers run queue. The process

will not be scheduled unless an event causes the scheduling.

The linux scheduler maintains two special states that represent a

wait state. They are defined as TASK_INTERRUPTIBLE and

TASK_UNINTERRUPTIBLE

Declaration of Wait Queues


32/54

Santosh Sam Koshy


Declaration of Wait Queues

The wait queue may also be defined as a list of processes

waiting for a specific event.

Wait queues are managed by means of a wait queue head, a

structure of type wait_queue_head_t. It may be defined and

initialized as

DECLARE_WAIT_QUEUE(name); //static declaration

Or

wait_queue_head_t my_queue; //Dynamically

init_waitqueue_head (&my_queue);

This only creates a wait queue list for appending future tasks to

it

U i W it Q


33/54

Santosh Sam Koshy


Using Wait Queues

When a process sleeps, it is in expectation that some condition

will become true in the future. The simplest way of sleeping iscalling the macro wait_event

wait_event (queue, condition);

Wait_event_interruptible (queue, condition);

Wait_event_timeout (queue, condition, timeout);Wait_event_interruptible_timeout (queue, condition, timeout);

The waking up process is either another user space process ormay be an interrupt handler. It satisfies the condition for wake upand calls one of the appropriate functions

Void wake_up (wait_queue_head_t *queue);

Void wake_up_interruptible(wait_queue_head_t *queue);

When to use Wait Queues


34/54

Santosh Sam Koshy


When to use Wait Queues

There are two behaviors that warrant the use of wait queuesIf a process calls readbut no data is available, the process must block.The process is awakened as soon as some data arrives, and that data isreturned to the caller, even if there is less data than the amountrequested in the count argument to the method.

If a process calls write and there is no space in the buffer, the processmust block, and it must be on a different wait queue from the one usedfor reading. When some data has been written to the hardware device,and space becomes free in the output buffer, the process is awakened

and the write call succeeds, although the data may be only partiallywritten if there isnt room in the buffer for the count bytes that wererequested

Exclusive Waits


35/54

Santosh Sam Koshy


Exclusive Waits

Thundering Herd

In using wait queues, we may occur a situation wherein there are many

processes waiting for the occurrence of an event.

During the wakeup process, all processes waiting for the event are made

ready to execute. This causes a herd of processes thunder-in together to

gain exclusive access to the shared resource.Only one of these events is satisfied with the CPU and the rest have to

go back into their sleep state. This thundering of processes for CPU

access may deteriorate the overall system performance, if it is quite

frequent.

This problem is known as the Thundering Herd Problem and is sorted

out using Exclusive Wait Mechanisms


36/54

Santosh Sam Koshy


Exclusive Waits

In response to the thundering herd problems, kernel developers

have added an exclusivewait option to the kernel.There are two important differences in using exclusive waits:

When a wait queue entry has the WQ_FLAG_EXCLUSIVE flag set, itis added to the end of the wait queue. Entries without that flag areadded to the beginning

When wake_up is called on a wait queue, it stops after waking the firstprocess that has the WQ_FLAG_EXCLUSIVE flag set

Putting a process into an interruptible wait is a simple matter ofcalling

Void prepare_to_wait_exclusive(wait_queue_head_t *queue,wait_queue_t *wait, int state)


37/54

Santosh Sam Koshy


HZ

The kernel keeps track of the flow of time by means of timerinterrupts. These are generated by the systems timinghardware at regular intervals.

This interval is programmed at system boot up by the kernelaccording to the value HZ, which is an architecture dependent

variable. The default values range from 50 to 1200 and istypically set to 100 or 1000 on x86 machines

Changing the value of HZ to a new effect will take its toll onlyon recompiling the kernel with the new value

Jiffies


38/54

Santosh Sam Koshy

santoshk@cdac.in


Jiffies

Every time a timer interrupt occurs, the value of an internal

kernel counter is incremented. The counter is initialized to 0 on

system boot and therefore represents the number of timer ticks

since last boot.

The counter is a 64 bit variable and is called jiffies_64.

However, driver writers access the jiffies variable, an unsigned

long that is same as either jiffies_64 or its least significant bits.

Using the Jiffies Counter


39/54

Santosh Sam Koshy

santoshk@cdac.in


Using the Jiffies Counter

The jiffies counter can be used in reading the present time andthereby calculating the future timestamp. This may be

explained as follows

J = jiffies;

Stamp_1 = J + HZ //Stamp_1 iterates to one second

ahead

Stamp_2 = J + HZ/2 // Stamp_2 may refer to half a

second in the future

Delaying Execution


40/54

Santosh Sam Koshy

santoshk@cdac.in


Delaying Execution

Long Delays:

Occasionally a driver needs to delay execution for relatively longperiods more than one clock tick. There are few ways of

implementing the same

Busy Waiting

J = jiffies;Delay = J + 5 * HZ //A delay of 5 HZ from now

While (time_before (J, Delay)) {

/* do nothing */}

Delaying Execution


41/54

Santosh Sam Koshy

santoshk@cdac.in


y g

This method causes a busy looping in the while statement, whichhogs the CPU for no productive outcome

Yielding the Processor

While (time_before (J, Delay)) {

schedule(); // yield the CPU

}

The advantage of this method is that another process may get

access to the CPU. The delay requested guaranteed but the

process may not be scheduled exactly after the requested delay.

Delaying Execution


42/54

Santosh Sam Koshy

santoshk@cdac.in


y g

Short Delays:The kernel implements functions that provide delays that may

not otherwise be possible with the jiffies counter. These delays

are implemented as function loop, depending on the architecture.

Void ndelay(unsigned long nsecs);

Void udelay(unsigned long usecs);

Void mdelay(unsigned long msecs;

Kernel Timers


43/54

Santosh Sam Koshy

santoshk@cdac.in


Kernel Timers

Kernel timers are used to schedule execution of a function ata later instance of time, based on the clock tick.

A kernel timer is a data structure that instructs the kernel toexecute a user defined function with a user defined argumentat a user defined time.

The declaration can be found in linux/timer.h and the sourcecode may be found in kernel/timer.c

Kernel Timers


44/54

Santosh Sam Koshy

santoshk@cdac.in


The functions scheduled to run, may not run while the process

that initiated it is executing. They are run asynchronously as inan interrupt context.

Kernel timers may be considered as a software interrupthandlers and have certain constraints associated with their

implementations.Primarily, they have to be atomic and there are additionalconstraints because of execution in the interrupt context

The timers run on the same CPU that registered it.

The Timer API


45/54

Santosh Sam Koshy

santoshk@cdac.in


The Timer API

The kernel provides drivers with a number of functions todeclare, register and remove kernel timers

struct timer_list {

unsigned long expires;

void (*function)(unsigned long);

unsigned long data;

};

The member expires specifies the amount of time for delay.There is a function pointer to a user defined function and the

third parameter data takes the arguments to the function pointer.

The Timer API


46/54

Santosh Sam Koshy

santoshk@cdac.in


The Timer API

The timer may be initialized by using the function

void init_timer(struct timer_list *timer);

The public fields of the structure may be initialized after returningfrom the function

This timer may be added and deleted to and from the the kernel

using the functions

void add_timer(struct timer_list *timer);

void del_timer(struct timer_list *timer);

The timer is a one shot execution and is taken off the list before

it is run.

Other Timer APIs


47/54

Santosh Sam Koshy

santoshk@cdac.in


int mod_timer(struct timer_list *timer, unsigned long expires);

Allows the modification of the timer count

int timer_pending(const struct timer_list *timer);

Returns true or false to indicate whether the timer is currentlyscheduled to run by reading one of the opaque fields of thestructure

Applications of Kernel Timers


48/54

Santosh Sam Koshy

santoshk@cdac.in


They find various applications such as polling a device bychecking its status registers at regular intervals, when thehardware cannot fire interrupts.

Other applications may be turning off the floppy motor, shuttingdown the processor fan on system shutdown etc...

Tasklets


49/54

Santosh Sam Koshy

santoshk@cdac.in


Tasklets is another kernel facility that allows deferring theexecution of a process to a later instance.

It is similar to kernel timers in that they run at interrupt time,they always run on the same CPU that schedules them and theyreceive an unsigned long argument

They differ from kernel timers in the fact that they are notscheduled at a particular time. They are scheduled by the systemat a later instance of time.

The Tasklet Data Structure


50/54

Santosh Sam Koshy

santoshk@cdac.in


A tasklet exists as a data structure that must be initializedbefore use.

struct tasklet_struct {

void (* func)(unsigned long);

unsigned long data;

}

Initialization is done by the function

void tasklet_init(struct tasklet_struct t, void (func)(unsignedlong), unsigned long data);

Tasklets


51/54

Santosh Sam Koshy

santoshk@cdac.in


A tasklet can be disabled and re-enabled later; it wont beexecuted until it is enabled as many times as it has been disabled

A tasklet can re-register itself

A tasklet can be scheduled to execute at normal priority or high

priorityTasklets may be run immediately if the system is not underheavy load but never later than the next timer tick

Tasklet APIs


52/54

Santosh Sam Koshy

santoshk@cdac.in


void tasklet_disable(struct tasklet_struct *t);

void tasklet_enable(struct tasklet_struct *t);

void tasklet_schedule(struct tasklet_struct *t);

void tasklet_hi_schedule(struct tasklet_struct *t);

void tasklet_kill(struct tasklet_struct *t);

Work Queues


53/54

Santosh Sam Koshy

santoshk@cdac.in


Work queues allow the kernel code to request that a function be

called at some future time. They differ asWork Queue functions run in the context of a special kernelprocess

These functions can sleep

Kernel code can request that the execution of work queuefunctions be delayed for an explicit interval

The key difference between tasklets and work queues is thattasklets execute for a short period, immediately and are atomic.

The same does not hold for work queues


54/54

Santosh Sam Koshy

santoshk@cdac.in


Kernel Internals

Documents