Date post: | 29-Dec-2015 |
Category: |
Documents |
Upload: | shavonne-cain |
View: | 224 times |
Download: | 3 times |
Linux Review (2nd Half)
COMS W4118
Spring 2008
Interruptsin Linux
COMS W4118
Spring 2008
Interrupts
Forcibly change normal flow of control Similar to context switch (but lighter weight)
Hardware saves some context on stack; Includes interrupted instruction if restart needed
Enters kernel at a specific point; kernel then figures out which interrupt handler should run
Execution resumes with special “iret” instruction Many different types of interrupts
Types of Interrupts
Asynchronous From external source, such as I/O device Not related to instruction being executed
Synchronous (also called exceptions) Processor-detected exceptions:
Faults — correctable; offending instruction is retried Traps — often for debugging; instruction is not retried Aborts — major error (hardware failure)
Programmed exceptions: Requests for kernel intervention (software intr/syscalls)
CPU’s ‘fetch-execute’ cycle
Fetch instruction at IP
Advance IP to next instruction
Decode the fetched instruction
Execute the decoded instruction
IRQ?
no
Save context
Get INTR ID
Lookup ISR
Execute ISR
yes IRET
User Program
IP
ld
add
st
mul
ld
sub
bne
add
jmp
…
Faults
Instruction would be illegal to execute Examples:
Writing to a memory segment marked ‘read-only’ Reading from an unavailable memory segment (on disk) Executing a ‘privileged’ instruction
Detected before incrementing the IP The causes of ‘faults’ can often be ‘fixed’ If a ‘problem’ can be remedied, then the CPU can
just resume its execution-cycle
Traps
A CPU might have been programmed to automatically switch control to a ‘debugger’ program after it has executed an instruction
That type of situation is known as a ‘trap’ It is activated after incrementing the IP
Error Exceptions Most error exceptions — divide by zero, invalid
operation, illegal memory reference, etc. — translate directly into signals
This isn’t a coincidence. . . The kernel’s job is fairly simple: send the
appropriate signal to the current process force_sig(sig_number, current);
That will probably kill the process, but that’s not the concern of the exception handler
One important exception: page fault An exception can (infrequently) happen in the kernel
die(); // kernel oops
Interrupt Hardware
x86CPU
MasterPIC
(8259)
SlavePIC
(8259)INTR
Programmable Interval-TimerKeyboard Controller
Real-Time Clock
Legacy PC Design (for single-proc
systems)
SCSI Disk
Ethernet
I/O devices have (unique or shared) Interrupt Request Lines (IRQs)
IRQs are mapped by special hardware to interrupt vectors, and passed to the CPU
This hardware is called a Programmable Interrupt Controller (PIC)
IRQs
The `Interrupt Controller’ Responsible for telling the CPU when a specific external
device wishes to ‘interrupt’ Needs to tell the CPU which one among several devices is
the one needing service PIC translates IRQ to vector
Raises interrupt to CPU Vector available in register Waits for ack from CPU
Interrupts can have varying priorities PIC also needs to prioritize multiple requests
Possible to “mask” (disable) interrupts at PIC or CPU Early systems cascaded two 8 input chips (8259A)
APIC, IO-APIC, LAPIC Advanced PIC (APIC) for SMP systems
Used in all modern systems Interrupts “routed” to CPU over system bus IPI: inter-processor interrupt
Local APIC (LAPIC) versus “frontend” IO-APIC Devices connect to front-end IO-APIC IO-APIC communicates (over bus) with Local APIC
Interrupt routing Allows broadcast or selective routing of interrupts Ability to distribute interrupt handling load Routes to lowest priority process
Special register: Task Priority Register (TPR) Arbitrates (round-robin) if equal priority
Assigning IRQs to Devices IRQ assignment is hardware-dependent Sometimes it’s hardwired, sometimes it’s set physically,
sometimes it’s programmable PCI bus usually assigns IRQs at boot Some IRQs are fixed by the architecture
IRQ0: Interval timer IRQ2: Cascade pin for 8259A
Linux device drivers request IRQs when the device is opened Note: especially useful for dynamically-loaded drivers, such as
for USB or PCMCIA devices Two devices that aren’t used at the same time can share an IRQ,
even if the hardware doesn’t support simultaneous sharing
Assigning Vectors to IRQs Vector: index (0-255) into interrupt descriptor table Vectors usually IRQ# + 32
Below 32 reserved for non-maskable intr & exceptions Maskable interrupts can be assigned as needed Vector 128 used for syscall Vectors 251-255 used for IPI
Interrupt Descriptor Table The ‘entry-point’ to the interrupt-handler is located
via the Interrupt Descriptor Table (IDT) IDT: “gate descriptors”
Segment selector + offset for handler Descriptor Privilege Level (DPL) Gates (slightly different ways of entering kernel)
Task gate: includes TSS to transfer to (not used by Linux)
Interrupt gate: disables further interrupts Trap gate: further interrupts still allowed
Interrupt Masking
Two different types: global and per-IRQ Global — delays all interrupts Selective — individual IRQs can be masked
selectively Selective masking is usually what’s needed
— interference most common from two interrupts of the same type
Putting It All Together
PIC CPU
Memory Bus
INTR
0
N
IRQs
IDT0
255
handler
idtr
Mask points
vector
Dispatching Interrupts Each interrupt has to be handled by a special
device- or trap-specific routine Interrupt Descriptor Table (IDT) has gate descriptors
for each interrupt vector Hardware locates the proper gate descriptor for this
interrupt vector, and locates the new context A new stack pointer, program counter, CPU and
memory state, etc., are loaded Global interrupt mask set The old program counter, stack pointer, CPU and
memory state, etc., are saved on the new stack The specific handler is invoked
Handling Nested Interrupts
As soon as possible, unmask the global interrupt
As soon as reasonable, re-enable interrupts from that IRQ
But that isn’t always a great idea, since it could cause re-entry to the same handler
IRQ-specific mask is not enabled during interrupt-handling
Nested Execution Interrupts can be interrupted
By different interrupts; handlers need not be reentrant No notion of priority in Linux Small portions execute with interrupts disabled Interrupts remain pending until acked by CPU
Exceptions can be interrupted By interrupts (devices needing service)
Exceptions can nest two levels deep Exceptions indicate coding error Exception code (kernel code) shouldn’t have bugs Page fault is possible (trying to touch user data)
Interrupt Handling Philosophy
Do as little as possible in the interrupt handler Defer non-critical actions till later Structure: top and bottom halves
Top-half: do minimum work and return (ISR) Bottom-half: deferred processing (softirqs,
tasklets) Again — want to do as little as possible with
IRQ interrupts masked No process context available
No Process Context
Interrupts (as opposed to exceptions) are not associated with particular instructions
They’re also not associated with a given process
The currently-running process, at the time of the interrupt, as no relationship whatsoever to that interrupt
Interrupt handlers cannot refer to current Interrupt handlers cannot sleep!
Interrupt Stack
When an interrupt occurs, what stack is used? Exceptions: The kernel stack of the current
process, whatever it is, is used (There’s always some process running — the “idle” process, if nothing else)
Interrupts: hard IRQ stack (1 per processor) SoftIRQs: soft IRQ stack (1 per processor)
These stacks are configured in the IDT and TSS at boot time by the kernel
First-Level Interrupt Handler
Often in assembler Perform minimal, common functions: save
registers, unmask other interrupts Eventually, undoes that: restores registers,
returns to previous context Most important: call proper second-level
interrupt handler (C program)
Finding the Proper Handler
On modern hardware, multiple I/O devices can share a single IRQ and hence interrupt vector
First differentiator is the interrupt vector Multiple interrupt service routines (ISR) can
be associated with a vector Each device’s ISR for that IRQ is called; the
determination of whether or not that device has interrupted is device-dependent
Deferrable Work
We don’t want to do too much in regular interrupt handlers: Interrupts are masked We don’t want the kernel stack to grow too much
Instead, interrupt handlers schedule work to be performed later
Three deferred work mechanisms: softirqs, tasklets, and work queues
Tasklets are built on top of softirqs For all of these, requests are queued
Softirqs
Statically allocated: specified at kernel compile time Limited number:
Priority Type
0 High-priority tasklets
1 Timer interrupts
2 Network transmission
3 Network reception
4 SCSI disks
5 Regular tasklets
Running Softirqs
Run at various points by the kernel System calls, exceptions
Most important: after handling IRQs and after timer interrupts
Softirq routines can be executed simultaneously on multiple CPUs: Code must be re-entrant Code must do its own locking as needed
Hardware interrupts always enabled when softirqs are running.
Rescheduling Softirqs
A softirq routine can reschedule itself This could starve user-level processes Softirq scheduler only runs a limited number
of requests at a time The rest are executed by a kernel thread, ksoftirqd, which competes with user processes for CPU time
Tasklets
Similar to softirqs Created and destroyed dynamically Individual tasklets are locked during
execution; no problem about re-entrancy, and no need for locking by the code
Only one instance of tasklet can run, even with multiple CPUs
Preferred mechanism for most deferred activity
Work Queues
Always run by kernel threads Softirqs and tasklets run in an interrupt
context; work queues have a process context Because they have a process context, they
can sleep However, they’re kernel-only; there is no user
mode associated with it
Synchronizationin Linux
COMS W4118
Spring 2008
Linux Synch Primitives
Memory barriers avoids compiler, cpu instruction re-ordering
Atomic operations memory bus lock, read-modify-write ops
Interrupt/softirq disabling/enabling Local, global
Spin locks general, read/write, big reader
Semaphores general, read/write
Choosing Synch Primitives
Avoid synch if possible! (clever instruction ordering) Example: inserting in linked list (needs barrier still)
Use atomics or rw spinlocks if possible Use semaphores if you need to sleep
Can’t sleep in interrupt context Don’t sleep holding a spinlock!
Complicated matrix of choices for protecting data structures accessed by deferred functions
The implementation of the synchronization primitives is extremely architecture dependent.
This is because only the hardware can guarantee atomicity of an operation.
Each architecture must provide a mechanism for doing an operation that can examine and modify a storage location atomically.
Some architectures do not guarantee atomicity, but inform whether the operation attempted was atomic.
Architectural Dependence
Barriers: Definition Barriers are used to prevent a processor and/or the compiler
from reordering instruction execution and memory modification. Barriers are instructions to hardware and/or compiler to complete
all pending accesses before issuing any more read memory barrier – acts on read requests write memory barrier – acts on write requests
Intel – certain instructions act as barriers: lock, iret, control regs rmb – asm volatile("lock;addl $0,0(%%esp)":::"memory")
add 0 to top of stack with lock prefix wmb – Intel never re-orders writes, just for compiler
Atomic Operations Many instructions not atomic in hardware (smp)
Read-modify-write instructions: inc, test-and-set, swap unaligned memory access
Compiler may not generate atomic code even i++ is not necessarily atomic!
If the data that must be protected is a single word, atomic operations can be used. These functions examine and modify the word atomically.
The atomic data type is atomic_t. Intel implementation
lock prefix byte 0xf0 – locks memory bus
Serializing with Interrupts
Basic primitive in original UNIX Doesn’t protect against other CPUs Intel: “interrupts enabled bit”
cli to clear (disable), sti to set (enable) Enabling is often wrong; need to restore
local_irq_save() local_irq_restore()
Services used to serialize with interrupts are:local_irq_disable - disables interrupts on the current
CPUlocal_irq_enable - enable interrupts on the current
CPUlocal_save_flags - return the interrupt state of the
processorlocal_restore_flags - restore the interrupt state of the
processor Dealing with the full interrupt state of the
system is officially discouraged. Locks should be used.
Interrupt Operations
A spin lock is a data structure (spinlock_t ) that is used to synchronize access to critical sections.
Only one thread can be holding a spin lock at any moment. All other threads trying to get the lock will “spin” (loop while checking the lock status).
Spin locks should not be held for long periods because waiting tasks on other CPUs are spinning, and thus wasting CPU execution time.
Spin Locks
__raw_spin_lock_string
1: lock; decb %0 # atomically decrement jns 3f # if clear sign bit jump forward to 32: rep; nop # wait cmpb $0, %0 # spin – compare to 0 jle 2b # go back to 2 if <= 0 (locked) jmp 1b # unlocked; go back to 1 to try again 3: # we have acquired the lock …
From linux/include/asm-i386/spinlock.h
spin_unlock merely writes 1 into the lock field.
Functions used to work with spin locks:spin_lock_init – initialize a spin lock before
using it for the first timespin_lock – acquire a spin lock, spin waiting
if it is not availablespin_unlock – release a spin lockspin_unlock_wait – spin waiting for spin
lock to become available, but don't acquire it
spin_trylock – acquire a spin lock if it is currently free, otherwise return error
spin_is_locked – return spin lock state
Spin Lock Operations
The spin lock services also provide interfaces that serialize with interrupts (on the current processor):spin_lock_irq - acquire spin lock and
disable interruptsspin_unlock_irq - release spin lock and
reenablespin_lock_irqsave - acquire spin lock, save
interrupt state, and disablespin_unlock_irqrestore - release spin lock
and restore interrupt state
Spin Locks & Interrupts
The read/write lock services also provide interfaces that serialize with interrupts (on the current processor):read_lock_irq - acquire lock for read and
disable interruptsread_unlock_irq - release read lock and
reenableread_lock_irqsave - acquire lock for read,
save interrupt state, and disableread_unlock_irqrestore - release read lock
and restore interrupt stateCorresponding functions for write exist
as well (e.g., write_lock_irqsave).
RW Spin Locks & Interrupts
The Big Reader Lock
Reader optimized RW spinlock RW spinlock suffers cache contention
On lock and unlock because of write to rwlock_t Per-CPU, cache-aligned lock arrays
One for reader portion, another for writer portion To read: set bit in reader array, spin on writer
Acquire when writer lock free; very fast! To write: set bit and scan ALL reader bits
Acquire when reader bits all free; very slow!
A semaphore is a data structure that is used to synchronize access to critical sections or other resources.
A semaphore allows a fixed number of tasks (generally one for critical sections) to "hold" the semaphore at one time. Any more tasks requesting to hold the semaphore are blocked (put to sleep).
A semaphore can be used for serialization only in code that is allowed to block.
Semaphores
Semaphore Structure Struct semaphore
count (atomic_t): > 0: free; = 0: in use, no waiters; < 0: in use, waiters
wait: wait queue sleepers:
0 (none), 1 (some), occasionally 2
wait: wait queue implementation requires lower-level
synch atomic updates, spinlock, interrupt
disabling
atomic_t count
int sleepers
wait_queue_head_t wait
lock next
struct semaphore
prev
Semaphores
optimized assembly code for normal case (down()) C code for slower “contended” case (__down())
up() is easy atomically increment; wake_up() if necessary
uncontended down() is easy atomically decrement; continue
contended down() is really complex! basically increment sleepers and sleep loop because of potentially concurrent ups/downs
still in down() path when lock is acquired
A rw_semaphore is a semaphore that allows either one writer or any number of readers (but not both at the same time) to hold it.
Any writer requesting to hold the rw_semaphore is blocked when there are readers holding it.
A rw_semaphore can be used for serialization only in code that is allowed to block. Both types of semaphores are the only synchronization objects that should be held when blocking.
Writers will not starve: once a writer arrives, readers queue behind it
Increases concurrency; introduced in 2.4
RW Semaphores
A mutex is a data structure that is also used to synchronize access to critical sections or other resources, introduced in 2.6.16.
Why? (Documentation/mutex-design.txt) simpler (lighter weight) tighter code slightly faster, better scalability no fastpath tradeoffs debug support – strict checking of
adhering to semantics
Mutexes
Linux Virtual FileSystem (VFS) and Ext2
COMS W4118
Spring 2008
Linux File System Model
basically UNIX file semantics File systems are mounted at various points Files identified by device inode numbers
VFS layer just dispatches to fs-specific functions libc read() -> sys_read()
what type of filesystem does this file belong to? call filesystem (fs) specific read function maintained in open file object (file)
example: file->f_op->read(…) similar to device abstraction model in UNIX
VFS System Calls
fundamental UNIX abstractions files (everything is a file)
ex: /dev/ttyS0 – device as a file ex: /proc/123 – process as a file
processes users
lots of syscalls related to files! (~100) most dispatch to filesystem-specific calls some require no filesystem action
example: lseek(pos) – change position in file others have default VFS implementations
VFS System Calls (cont.) filesystem ops – mounting, info, flushing, chroot, pivot_root directory ops – chdir, getcwd, link, unlink, rename, symlink file ops – open/close, (p)read(v)/(p)write(v), seek, truncate, dup
fcntl, creat, inode ops – stat, permissions, chmod, chown memory mapping files – mmap, munmap, madvise, mlock wait for input – poll, select flushing – sync, fsync, msync, fdatasync file locking – flock
Big Four Data Structures struct file
information about an open file includes current position (file pointer)
struct dentry information about a directory entry includes name + inode#
struct inode unique descriptor of a file or directory contains permissions, timestamps, block map (data) inode#: integer (unique per mounted filesystem)
struct superblock descriptor of a mounted filesystem
Two More Data Structures
struct file_system_type name of file system pointer to implementing module including how to read a superblock On module load, you call register_file_system and pass a pointer
to this structure struct vfsmount
Represents a mounted instance of a particular file system One super block can be mounted in two places, with different
covering sub mounts Thus lookup requires parent dentry and a vfsmount
task_struct
openfile
object
inode
superblock
disk
fds
openfile
object
openfile
object
inodeinode
dentrydentry
dentrydentry
dentry
inode cache
dentry cache
Data Structure Relationships
calling dup() – shares open file objects example: 2>&1
opening the same file twice – shares dentries
opening same file via different hard links – shares inodes
mounting same filesystem on different dirs – shares superblocks
Sharing Data Structures
Superblock mounted filesystem descriptor
usually first block on disk (after boot block) copied into (similar) memory structure on mount
distinction: disk superblock vs memory superblock dirty bit (s_dirt), copied to disk frequently
important fields s_dev, s_bdev – device, device-driver s_blocksize, s_maxbytes, s_type s_flags, s_magic, s_count, s_root, s_dquot s_dirty – dirty inodes for this filesystem s_op – superblock operations u – filesystem specific data
Inode
"index" node – unique file or directory descriptor meta-data: permissions, owner, timestamps, size, link
count data: pointers to disk blocks containing actual data
data pointers are "indices" into file contents (hence "inode")
inode # - unique integer (per-mounted filesystem) what about names and paths?
high-level fluff on top of a "flat-filesystem" implemented by directory files (directories) directory contents: name + inode
UNIX link semantics hard links – multiple dir entries with same inode #
equal status; first is not "real" entry file deleted when link count goes to 0 restrictions
can't hard link to directories (avoids cycles) or across filesystems
soft (symbolic) links – little files with pathnames just aliases for another pathname no restrictions, cycles possible, dangling links possible
File Links
(Open) File Object struct file (usual variable name - filp)
association between file and process no disk representation created for each open (multiple possible, even same file) most important info: file pointer
file descriptor (small ints) index into array of pointers to open file objects
file object states unused (memory cache + root reserve (10))
get_empty_filp() inuse (per-superblock lists)
system-wide max on open file objects (~8K) /proc/sys/fs/file-max
abstraction of directory entry ex: line from ls -l either files (hard links) or soft links or subdirectories every dentry has a parent dentry (except root) sibling dentries – other entries in the same directory
directory api: dentry iterators posix: opendir(), readdir(), scandir(), seekdir(), rewinddir() syscall: getdents()
why an abstraction? UNIX: directories are really files with directory "records" MSDOS, etc.: directory is just a big table on disk (FAT)
no such thing as subdirectories! just fields in table (file->parentdir), (dir->parentdir)
Dentry
Dentry Cache
very important cache for filesystem performance every file access causes multiple dentry accesses! example: /tmp/foo
dentries for "/", "/tmp", "/tmp/foo" (path components)
dentry cache "controls" inode cache inodes released only when dentry is released
dentry cache accessed via hash table hash(dir, filename) -> dentry
dentry states free (not valid; maintained by slab cache) in-use (associated with valid open inode) unused (valid but not being used; LRU list) negative (file that does not exist)
dentry ops just a few, mostly default actions ex: d_compare(dir, name1, name2)
case-insensitive for MSDOS
Dentry Cache (continued)
Process-related Files
current->fs (fs_struct) root (for chroot jails) pwd umask (default file permissions)
current->files (files_struct) fd[] (file descriptor array – pointers to file objects)
0, 1, 2 – stdin, stdout, stderr originally 32, growable to 1,024 (RLIMIT_NOFILE)
complex structure for growing … see book close_on_exec memory (bitmap)
open files normally inherited across exec
Filesystem Types
Linux must "know about" filesystem before mount multiple (mounted) instances of each type possible
special (virtual) filesystems (like /proc) structuring technique to touch kernel data examples:
/proc, /dev (devfs) sockfs, pipefs, tmpfs, rootfs, shmfs
associated with fictitious block device (major# 0) minor# distinguishes special filesystem types
Registering a Filesystem Type
must register before mount static (compile-time) or dynamic (modules)
register_filesystem() / unregister_filesystem adds file_system_type object to linked-list
file_systems (head; kernel global variable) file_systems_lock (rw spinlock to protect list)
file_system_type descriptor name, flags, pointer to implementing module list of superblocks (mounted instances) read_super() – pointer to method for reading superblock
most important thing! filesystem specific
openfile
object
superblock
openfile
object
openfile
object
inode
inode
dentry
dentry dentry dentry
inode ┴
f_dentry
d_subdirs
d_inode
d_subdirs
d_parent
i_sb
d_sb
i_dentries
Data Structure Relationships (2)
Ext2
“Standard” Linux File System Previously it was the most commonly used Serves as a basis for Ext3 which adds journaling
Uses FFS like layout Each file system is composed of identical block groups Allocation is designed to improve locality
Inodes contain pointers (32 bits) to blocks Direct, Indirect, Double Indirect, Triple Indirect Maximum file size: 4.1TB (4K Blocks) Maximum file system size: 16TB (4K Blocks)
Files in the same directory are stored in the same block group Files in different directories are spread among the block groups
Picture from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Ext2 Disk Layout
Block Addressing in Ext2
Picture from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Inode
Indirect Blocks
Indirect Blocks
Indirect Blocks
DoubleIndirect
Indirect Blocks
Indirect Blocks
DoubleIndirect
TripleIndirect
DataBlockData
BlockDataBlock
DataBlockData
BlockDataBlock
Twelve “direct” blocks
BLKSIZE/4
(BLKSIZE/4)2
(BLKSIZE/4)3
DataBlockData
BlockDataBlock
DataBlockData
BlockDataBlock
DataBlockData
BlockDataBlock
DataBlockData
BlockDataBlock
(a) A Linux directory with three files. (b) The same directory after the file voluminous has been removed.
Ext2 Directory Structure
Picture from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639