Linux Review (2 nd Half) COMS W4118 Spring 2008. Interrupts in Linux COMS W4118 Spring 2008.

Linux Review (2nd Half)

COMS W4118

Spring 2008

Interruptsin Linux

COMS W4118

Spring 2008

Interrupts

Forcibly change normal flow of control Similar to context switch (but lighter weight)

Hardware saves some context on stack; Includes interrupted instruction if restart needed

Enters kernel at a specific point; kernel then figures out which interrupt handler should run

Execution resumes with special “iret” instruction Many different types of interrupts

Types of Interrupts

Asynchronous From external source, such as I/O device Not related to instruction being executed

Synchronous (also called exceptions) Processor-detected exceptions:

Faults — correctable; offending instruction is retried Traps — often for debugging; instruction is not retried Aborts — major error (hardware failure)

Programmed exceptions: Requests for kernel intervention (software intr/syscalls)

CPU’s ‘fetch-execute’ cycle

Fetch instruction at IP

Advance IP to next instruction

Decode the fetched instruction

Execute the decoded instruction

IRQ?

no

Save context

Get INTR ID

Lookup ISR

Execute ISR

yes IRET

User Program

IP

ld

add

st

mul

ld

sub

bne

add

jmp

…

Faults

Instruction would be illegal to execute Examples:

Writing to a memory segment marked ‘read-only’ Reading from an unavailable memory segment (on disk) Executing a ‘privileged’ instruction

Detected before incrementing the IP The causes of ‘faults’ can often be ‘fixed’ If a ‘problem’ can be remedied, then the CPU can

just resume its execution-cycle

Traps

A CPU might have been programmed to automatically switch control to a ‘debugger’ program after it has executed an instruction

That type of situation is known as a ‘trap’ It is activated after incrementing the IP

Error Exceptions Most error exceptions — divide by zero, invalid

operation, illegal memory reference, etc. — translate directly into signals

This isn’t a coincidence. . . The kernel’s job is fairly simple: send the

appropriate signal to the current process force_sig(sig_number, current);

That will probably kill the process, but that’s not the concern of the exception handler

One important exception: page fault An exception can (infrequently) happen in the kernel

die(); // kernel oops

Interrupt Hardware

x86CPU

MasterPIC

(8259)

SlavePIC

(8259)INTR

Programmable Interval-TimerKeyboard Controller

Real-Time Clock

Legacy PC Design (for single-proc

systems)

SCSI Disk

Ethernet

I/O devices have (unique or shared) Interrupt Request Lines (IRQs)

IRQs are mapped by special hardware to interrupt vectors, and passed to the CPU

This hardware is called a Programmable Interrupt Controller (PIC)

IRQs

The `Interrupt Controller’ Responsible for telling the CPU when a specific external

device wishes to ‘interrupt’ Needs to tell the CPU which one among several devices is

the one needing service PIC translates IRQ to vector

Raises interrupt to CPU Vector available in register Waits for ack from CPU

Interrupts can have varying priorities PIC also needs to prioritize multiple requests

Possible to “mask” (disable) interrupts at PIC or CPU Early systems cascaded two 8 input chips (8259A)

APIC, IO-APIC, LAPIC Advanced PIC (APIC) for SMP systems

Used in all modern systems Interrupts “routed” to CPU over system bus IPI: inter-processor interrupt

Local APIC (LAPIC) versus “frontend” IO-APIC Devices connect to front-end IO-APIC IO-APIC communicates (over bus) with Local APIC

Interrupt routing Allows broadcast or selective routing of interrupts Ability to distribute interrupt handling load Routes to lowest priority process

Special register: Task Priority Register (TPR) Arbitrates (round-robin) if equal priority

Assigning IRQs to Devices IRQ assignment is hardware-dependent Sometimes it’s hardwired, sometimes it’s set physically,

sometimes it’s programmable PCI bus usually assigns IRQs at boot Some IRQs are fixed by the architecture

IRQ0: Interval timer IRQ2: Cascade pin for 8259A

Linux device drivers request IRQs when the device is opened Note: especially useful for dynamically-loaded drivers, such as

for USB or PCMCIA devices Two devices that aren’t used at the same time can share an IRQ,

even if the hardware doesn’t support simultaneous sharing

Assigning Vectors to IRQs Vector: index (0-255) into interrupt descriptor table Vectors usually IRQ# + 32

Below 32 reserved for non-maskable intr & exceptions Maskable interrupts can be assigned as needed Vector 128 used for syscall Vectors 251-255 used for IPI

Interrupt Descriptor Table The ‘entry-point’ to the interrupt-handler is located

via the Interrupt Descriptor Table (IDT) IDT: “gate descriptors”

Segment selector + offset for handler Descriptor Privilege Level (DPL) Gates (slightly different ways of entering kernel)

Task gate: includes TSS to transfer to (not used by Linux)

Interrupt gate: disables further interrupts Trap gate: further interrupts still allowed

Interrupt Masking

Two different types: global and per-IRQ Global — delays all interrupts Selective — individual IRQs can be masked

selectively Selective masking is usually what’s needed

— interference most common from two interrupts of the same type

Putting It All Together

PIC CPU

Memory Bus

INTR

0

N

IRQs

IDT0

255

handler

idtr

Mask points

vector

Dispatching Interrupts Each interrupt has to be handled by a special

device- or trap-specific routine Interrupt Descriptor Table (IDT) has gate descriptors

for each interrupt vector Hardware locates the proper gate descriptor for this

interrupt vector, and locates the new context A new stack pointer, program counter, CPU and

memory state, etc., are loaded Global interrupt mask set The old program counter, stack pointer, CPU and

memory state, etc., are saved on the new stack The specific handler is invoked

Handling Nested Interrupts

As soon as possible, unmask the global interrupt

As soon as reasonable, re-enable interrupts from that IRQ

But that isn’t always a great idea, since it could cause re-entry to the same handler

IRQ-specific mask is not enabled during interrupt-handling

Nested Execution Interrupts can be interrupted

By different interrupts; handlers need not be reentrant No notion of priority in Linux Small portions execute with interrupts disabled Interrupts remain pending until acked by CPU

Exceptions can be interrupted By interrupts (devices needing service)

Exceptions can nest two levels deep Exceptions indicate coding error Exception code (kernel code) shouldn’t have bugs Page fault is possible (trying to touch user data)

Interrupt Handling Philosophy

Do as little as possible in the interrupt handler Defer non-critical actions till later Structure: top and bottom halves

Top-half: do minimum work and return (ISR) Bottom-half: deferred processing (softirqs,

tasklets) Again — want to do as little as possible with

IRQ interrupts masked No process context available

No Process Context

Interrupts (as opposed to exceptions) are not associated with particular instructions

They’re also not associated with a given process

The currently-running process, at the time of the interrupt, as no relationship whatsoever to that interrupt

Interrupt handlers cannot refer to current Interrupt handlers cannot sleep!

Interrupt Stack

When an interrupt occurs, what stack is used? Exceptions: The kernel stack of the current

process, whatever it is, is used (There’s always some process running — the “idle” process, if nothing else)

Interrupts: hard IRQ stack (1 per processor) SoftIRQs: soft IRQ stack (1 per processor)

These stacks are configured in the IDT and TSS at boot time by the kernel

First-Level Interrupt Handler

Often in assembler Perform minimal, common functions: save

registers, unmask other interrupts Eventually, undoes that: restores registers,

returns to previous context Most important: call proper second-level

interrupt handler (C program)

Finding the Proper Handler

On modern hardware, multiple I/O devices can share a single IRQ and hence interrupt vector

First differentiator is the interrupt vector Multiple interrupt service routines (ISR) can

be associated with a vector Each device’s ISR for that IRQ is called; the

determination of whether or not that device has interrupted is device-dependent

Deferrable Work

We don’t want to do too much in regular interrupt handlers: Interrupts are masked We don’t want the kernel stack to grow too much

Instead, interrupt handlers schedule work to be performed later

Three deferred work mechanisms: softirqs, tasklets, and work queues

Tasklets are built on top of softirqs For all of these, requests are queued

Softirqs

Statically allocated: specified at kernel compile time Limited number:

Priority Type

0 High-priority tasklets

1 Timer interrupts

2 Network transmission

3 Network reception

4 SCSI disks

5 Regular tasklets

Running Softirqs

Run at various points by the kernel System calls, exceptions

Most important: after handling IRQs and after timer interrupts

Softirq routines can be executed simultaneously on multiple CPUs: Code must be re-entrant Code must do its own locking as needed

Hardware interrupts always enabled when softirqs are running.

Rescheduling Softirqs

A softirq routine can reschedule itself This could starve user-level processes Softirq scheduler only runs a limited number

of requests at a time The rest are executed by a kernel thread, ksoftirqd, which competes with user processes for CPU time

Tasklets

Similar to softirqs Created and destroyed dynamically Individual tasklets are locked during

execution; no problem about re-entrancy, and no need for locking by the code

Only one instance of tasklet can run, even with multiple CPUs

Preferred mechanism for most deferred activity

Work Queues

Always run by kernel threads Softirqs and tasklets run in an interrupt

context; work queues have a process context Because they have a process context, they

can sleep However, they’re kernel-only; there is no user

mode associated with it

Synchronizationin Linux

COMS W4118

Spring 2008

Linux Synch Primitives

Memory barriers avoids compiler, cpu instruction re-ordering

Atomic operations memory bus lock, read-modify-write ops

Interrupt/softirq disabling/enabling Local, global

Spin locks general, read/write, big reader

Semaphores general, read/write

Choosing Synch Primitives

Avoid synch if possible! (clever instruction ordering) Example: inserting in linked list (needs barrier still)

Use atomics or rw spinlocks if possible Use semaphores if you need to sleep

Can’t sleep in interrupt context Don’t sleep holding a spinlock!

Complicated matrix of choices for protecting data structures accessed by deferred functions

The implementation of the synchronization primitives is extremely architecture dependent.

This is because only the hardware can guarantee atomicity of an operation.

Each architecture must provide a mechanism for doing an operation that can examine and modify a storage location atomically.

Some architectures do not guarantee atomicity, but inform whether the operation attempted was atomic.

Architectural Dependence

Barriers: Definition Barriers are used to prevent a processor and/or the compiler

from reordering instruction execution and memory modification. Barriers are instructions to hardware and/or compiler to complete

all pending accesses before issuing any more read memory barrier – acts on read requests write memory barrier – acts on write requests

Intel – certain instructions act as barriers: lock, iret, control regs rmb – asm volatile("lock;addl $0,0(%%esp)":::"memory")

add 0 to top of stack with lock prefix wmb – Intel never re-orders writes, just for compiler

Atomic Operations Many instructions not atomic in hardware (smp)

Read-modify-write instructions: inc, test-and-set, swap unaligned memory access

Compiler may not generate atomic code even i++ is not necessarily atomic!

If the data that must be protected is a single word, atomic operations can be used. These functions examine and modify the word atomically.

The atomic data type is atomic_t. Intel implementation

lock prefix byte 0xf0 – locks memory bus

Serializing with Interrupts

Basic primitive in original UNIX Doesn’t protect against other CPUs Intel: “interrupts enabled bit”

cli to clear (disable), sti to set (enable) Enabling is often wrong; need to restore

local_irq_save() local_irq_restore()

Services used to serialize with interrupts are:local_irq_disable - disables interrupts on the current

CPUlocal_irq_enable - enable interrupts on the current

CPUlocal_save_flags - return the interrupt state of the

processorlocal_restore_flags - restore the interrupt state of the

processor Dealing with the full interrupt state of the

system is officially discouraged. Locks should be used.

Interrupt Operations

A spin lock is a data structure (spinlock_t ) that is used to synchronize access to critical sections.

Only one thread can be holding a spin lock at any moment. All other threads trying to get the lock will “spin” (loop while checking the lock status).

Spin locks should not be held for long periods because waiting tasks on other CPUs are spinning, and thus wasting CPU execution time.

Spin Locks

__raw_spin_lock_string

1: lock; decb %0 # atomically decrement jns 3f # if clear sign bit jump forward to 32: rep; nop # wait cmpb $0, %0 # spin – compare to 0 jle 2b # go back to 2 if <= 0 (locked) jmp 1b # unlocked; go back to 1 to try again 3: # we have acquired the lock …

From linux/include/asm-i386/spinlock.h

spin_unlock merely writes 1 into the lock field.

Functions used to work with spin locks:spin_lock_init – initialize a spin lock before

using it for the first timespin_lock – acquire a spin lock, spin waiting

if it is not availablespin_unlock – release a spin lockspin_unlock_wait – spin waiting for spin

lock to become available, but don't acquire it

spin_trylock – acquire a spin lock if it is currently free, otherwise return error

spin_is_locked – return spin lock state

Spin Lock Operations

The spin lock services also provide interfaces that serialize with interrupts (on the current processor):spin_lock_irq - acquire spin lock and

disable interruptsspin_unlock_irq - release spin lock and

reenablespin_lock_irqsave - acquire spin lock, save

interrupt state, and disablespin_unlock_irqrestore - release spin lock

and restore interrupt state

Spin Locks & Interrupts

The read/write lock services also provide interfaces that serialize with interrupts (on the current processor):read_lock_irq - acquire lock for read and

disable interruptsread_unlock_irq - release read lock and

reenableread_lock_irqsave - acquire lock for read,

save interrupt state, and disableread_unlock_irqrestore - release read lock

and restore interrupt stateCorresponding functions for write exist

as well (e.g., write_lock_irqsave).

RW Spin Locks & Interrupts

The Big Reader Lock

Reader optimized RW spinlock RW spinlock suffers cache contention

On lock and unlock because of write to rwlock_t Per-CPU, cache-aligned lock arrays

One for reader portion, another for writer portion To read: set bit in reader array, spin on writer

Acquire when writer lock free; very fast! To write: set bit and scan ALL reader bits

Acquire when reader bits all free; very slow!

A semaphore is a data structure that is used to synchronize access to critical sections or other resources.

A semaphore allows a fixed number of tasks (generally one for critical sections) to "hold" the semaphore at one time. Any more tasks requesting to hold the semaphore are blocked (put to sleep).

A semaphore can be used for serialization only in code that is allowed to block.

Semaphores

Semaphore Structure Struct semaphore

count (atomic_t): > 0: free; = 0: in use, no waiters; < 0: in use, waiters

wait: wait queue sleepers:

0 (none), 1 (some), occasionally 2

wait: wait queue implementation requires lower-level

synch atomic updates, spinlock, interrupt

disabling

atomic_t count

int sleepers

wait_queue_head_t wait

lock next

struct semaphore

prev

Semaphores

optimized assembly code for normal case (down()) C code for slower “contended” case (__down())

up() is easy atomically increment; wake_up() if necessary

uncontended down() is easy atomically decrement; continue

contended down() is really complex! basically increment sleepers and sleep loop because of potentially concurrent ups/downs

still in down() path when lock is acquired

A rw_semaphore is a semaphore that allows either one writer or any number of readers (but not both at the same time) to hold it.

Any writer requesting to hold the rw_semaphore is blocked when there are readers holding it.

A rw_semaphore can be used for serialization only in code that is allowed to block. Both types of semaphores are the only synchronization objects that should be held when blocking.

Writers will not starve: once a writer arrives, readers queue behind it

Increases concurrency; introduced in 2.4

RW Semaphores

A mutex is a data structure that is also used to synchronize access to critical sections or other resources, introduced in 2.6.16.

Why? (Documentation/mutex-design.txt) simpler (lighter weight) tighter code slightly faster, better scalability no fastpath tradeoffs debug support – strict checking of

adhering to semantics

Mutexes

Linux Virtual FileSystem (VFS) and Ext2

COMS W4118

Spring 2008

Linux File System Model

basically UNIX file semantics File systems are mounted at various points Files identified by device inode numbers

VFS layer just dispatches to fs-specific functions libc read() -> sys_read()

what type of filesystem does this file belong to? call filesystem (fs) specific read function maintained in open file object (file)

example: file->f_op->read(…) similar to device abstraction model in UNIX

VFS System Calls

fundamental UNIX abstractions files (everything is a file)

ex: /dev/ttyS0 – device as a file ex: /proc/123 – process as a file

processes users

lots of syscalls related to files! (~100) most dispatch to filesystem-specific calls some require no filesystem action

example: lseek(pos) – change position in file others have default VFS implementations

VFS System Calls (cont.) filesystem ops – mounting, info, flushing, chroot, pivot_root directory ops – chdir, getcwd, link, unlink, rename, symlink file ops – open/close, (p)read(v)/(p)write(v), seek, truncate, dup

fcntl, creat, inode ops – stat, permissions, chmod, chown memory mapping files – mmap, munmap, madvise, mlock wait for input – poll, select flushing – sync, fsync, msync, fdatasync file locking – flock

Big Four Data Structures struct file

information about an open file includes current position (file pointer)

struct dentry information about a directory entry includes name + inode#

struct inode unique descriptor of a file or directory contains permissions, timestamps, block map (data) inode#: integer (unique per mounted filesystem)

struct superblock descriptor of a mounted filesystem

Two More Data Structures

struct file_system_type name of file system pointer to implementing module including how to read a superblock On module load, you call register_file_system and pass a pointer

to this structure struct vfsmount

Represents a mounted instance of a particular file system One super block can be mounted in two places, with different

covering sub mounts Thus lookup requires parent dentry and a vfsmount

task_struct

openfile

object

inode

superblock

disk

fds

openfile

object

openfile

object

inodeinode

dentrydentry

dentrydentry

dentry

inode cache

dentry cache

Data Structure Relationships

calling dup() – shares open file objects example: 2>&1

opening the same file twice – shares dentries

opening same file via different hard links – shares inodes

mounting same filesystem on different dirs – shares superblocks

Sharing Data Structures

Superblock mounted filesystem descriptor

usually first block on disk (after boot block) copied into (similar) memory structure on mount

distinction: disk superblock vs memory superblock dirty bit (s_dirt), copied to disk frequently

important fields s_dev, s_bdev – device, device-driver s_blocksize, s_maxbytes, s_type s_flags, s_magic, s_count, s_root, s_dquot s_dirty – dirty inodes for this filesystem s_op – superblock operations u – filesystem specific data

Inode

"index" node – unique file or directory descriptor meta-data: permissions, owner, timestamps, size, link

count data: pointers to disk blocks containing actual data

data pointers are "indices" into file contents (hence "inode")

inode # - unique integer (per-mounted filesystem) what about names and paths?

high-level fluff on top of a "flat-filesystem" implemented by directory files (directories) directory contents: name + inode

UNIX link semantics hard links – multiple dir entries with same inode #

equal status; first is not "real" entry file deleted when link count goes to 0 restrictions

can't hard link to directories (avoids cycles) or across filesystems

soft (symbolic) links – little files with pathnames just aliases for another pathname no restrictions, cycles possible, dangling links possible

File Links

(Open) File Object struct file (usual variable name - filp)

association between file and process no disk representation created for each open (multiple possible, even same file) most important info: file pointer

file descriptor (small ints) index into array of pointers to open file objects

file object states unused (memory cache + root reserve (10))

get_empty_filp() inuse (per-superblock lists)

system-wide max on open file objects (~8K) /proc/sys/fs/file-max

abstraction of directory entry ex: line from ls -l either files (hard links) or soft links or subdirectories every dentry has a parent dentry (except root) sibling dentries – other entries in the same directory

directory api: dentry iterators posix: opendir(), readdir(), scandir(), seekdir(), rewinddir() syscall: getdents()

why an abstraction? UNIX: directories are really files with directory "records" MSDOS, etc.: directory is just a big table on disk (FAT)

no such thing as subdirectories! just fields in table (file->parentdir), (dir->parentdir)

Dentry

Dentry Cache

very important cache for filesystem performance every file access causes multiple dentry accesses! example: /tmp/foo

dentries for "/", "/tmp", "/tmp/foo" (path components)

dentry cache "controls" inode cache inodes released only when dentry is released

dentry cache accessed via hash table hash(dir, filename) -> dentry

dentry states free (not valid; maintained by slab cache) in-use (associated with valid open inode) unused (valid but not being used; LRU list) negative (file that does not exist)

dentry ops just a few, mostly default actions ex: d_compare(dir, name1, name2)

case-insensitive for MSDOS

Dentry Cache (continued)

Process-related Files

current->fs (fs_struct) root (for chroot jails) pwd umask (default file permissions)

current->files (files_struct) fd[] (file descriptor array – pointers to file objects)

0, 1, 2 – stdin, stdout, stderr originally 32, growable to 1,024 (RLIMIT_NOFILE)

complex structure for growing … see book close_on_exec memory (bitmap)

open files normally inherited across exec

Filesystem Types

Linux must "know about" filesystem before mount multiple (mounted) instances of each type possible

special (virtual) filesystems (like /proc) structuring technique to touch kernel data examples:

/proc, /dev (devfs) sockfs, pipefs, tmpfs, rootfs, shmfs

associated with fictitious block device (major# 0) minor# distinguishes special filesystem types

Registering a Filesystem Type

must register before mount static (compile-time) or dynamic (modules)

register_filesystem() / unregister_filesystem adds file_system_type object to linked-list

file_systems (head; kernel global variable) file_systems_lock (rw spinlock to protect list)

file_system_type descriptor name, flags, pointer to implementing module list of superblocks (mounted instances) read_super() – pointer to method for reading superblock

most important thing! filesystem specific

openfile

object

superblock

openfile

object

openfile

object

inode

inode

dentry

dentry dentry dentry

inode ┴

f_dentry

d_subdirs

d_inode

d_subdirs

d_parent

i_sb

d_sb

i_dentries

Data Structure Relationships (2)

Ext2

“Standard” Linux File System Previously it was the most commonly used Serves as a basis for Ext3 which adds journaling

Uses FFS like layout Each file system is composed of identical block groups Allocation is designed to improve locality

Inodes contain pointers (32 bits) to blocks Direct, Indirect, Double Indirect, Triple Indirect Maximum file size: 4.1TB (4K Blocks) Maximum file system size: 16TB (4K Blocks)

Files in the same directory are stored in the same block group Files in different directories are spread among the block groups

Picture from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

Ext2 Disk Layout

Block Addressing in Ext2


Inode

Indirect Blocks

Indirect Blocks

Indirect Blocks

DoubleIndirect

Indirect Blocks

Indirect Blocks

DoubleIndirect

TripleIndirect

DataBlockData

BlockDataBlock

DataBlockData

BlockDataBlock

Twelve “direct” blocks

BLKSIZE/4

(BLKSIZE/4)2

(BLKSIZE/4)3

DataBlockData

BlockDataBlock

DataBlockData

BlockDataBlock

DataBlockData

BlockDataBlock

DataBlockData

BlockDataBlock

(a) A Linux directory with three files. (b) The same directory after the file voluminous has been removed.

Ext2 Directory Structure


Date post:	29-Dec-2015
Category:	Documents
Upload:	shavonne-cain
View:	224 times
Download:	3 times

Linux Review (2 nd Half) COMS W4118 Spring 2008. Interrupts in Linux COMS W4118 Spring 2008.

Documents