“Everything’s a file”
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
Everything has a filename?
Everything has a filename?
////////////////////////////////Everything has a filename?
I Pipes
I Sockets
I epoll
I memfd
I KVM virtual machines and CPUs
I . . .
Everything’s a file descriptor
I What is a file descriptor, really?
I What can you do with a file descriptor?
I What interesting file descriptors exist?
I How do you build a new type of file descriptors?
I What interesting file descriptors don’t exist?
I What is a file descriptor, really?
I What can you do with a file descriptor?
I What interesting file descriptors exist?
I How do you build a new type of file descriptors?
I What interesting file descriptors don’t exist yet?
What is a file descriptor, really?
I struct fd, struct fdtable
I struct file
struct fd versus struct file
testfile contains “0123456789”
x = open("testfile", O_RDONLY);
xdup = dup(x);
y = open("testfile", O_RDONLY);
read(x, &c, 1);
putchar(c);
read(xdup, &c, 1);
putchar(c);
read(y, &c, 1);
putchar(c);
struct fd versus struct file
testfile contains “0123456789”
x = open("testfile", O_RDONLY);
xdup = dup(x);
y = open("testfile", O_RDONLY);
read(x, &c, 1);
putchar(c);
read(xdup, &c, 1);
putchar(c);
read(y, &c, 1);
putchar(c);
struct fd versus struct file
testfile contains “0123456789”
x = open("testfile", O_RDONLY);
xdup = dup(x);
y = open("testfile", O_RDONLY);
read(x, &c, 1);
putchar(c); /* Prints ’0’ */
read(xdup, &c, 1);
putchar(c);
read(y, &c, 1);
putchar(c);
struct fd versus struct file
testfile contains “0123456789”
x = open("testfile", O_RDONLY);
xdup = dup(x);
y = open("testfile", O_RDONLY);
read(x, &c, 1);
putchar(c); /* Prints ’0’ */
read(xdup, &c, 1);
putchar(c); /* Prints ’1’ */
read(y, &c, 1);
putchar(c);
struct fd versus struct file
testfile contains “0123456789”
x = open("testfile", O_RDONLY);
xdup = dup(x);
y = open("testfile", O_RDONLY);
read(x, &c, 1);
putchar(c); /* Prints ’0’ */
read(xdup, &c, 1);
putchar(c); /* Prints ’1’ */
read(y, &c, 1);
putchar(c); /* Prints ’0’ */
struct fd
x
xdup
y
struct file
f pos:f count:
f pos: 0f count: 1
testfile
0
1
2
3
...
userspace int kernel object driver-specific
struct fd
x
xdup
y
struct file
f pos: 0f count: 1
f pos: 0f count: 1
testfile
0
1
2
3
...
userspace int kernel object driver-specific
struct fd
x
xdup
y
struct file
f pos: 0f count: 2
f pos: 0f count: 1
testfile
0
1
2
3
...
userspace int kernel object driver-specific
struct fd
x
xdup
y
struct file
f pos: 0f count: 2
f pos: 0f count: 1
testfile
0
1
2
3
...
userspace int kernel object driver-specific
struct fd
x
xdup
y
struct file
f pos: 1f count: 2
f pos: 0f count: 1
testfile
0
1
2
3
...
userspace int kernel object driver-specific
struct fd
x
xdup
y
struct file
f pos: 2f count: 2
f pos: 0f count: 1
testfile
0
1
2
3
...
userspace int kernel object driver-specific
struct fd
x
xdup
y
struct file
f pos: 2f count: 2
f pos: 1f count: 1
testfile
0
1
2
3
...
userspace int kernel object driver-specific
struct fd
x
xdup
y
struct file
f pos: 2f count: 2
f pos: 1f count: 1
testfile
0
1
2
3
...
userspace int
kernel object driver-specific
struct fd
x
xdup
y
struct file
f pos: 2f count: 2
f pos: 1f count: 1
testfile
0
1
2
3
...
userspace int kernel object
driver-specific
struct fd
x
xdup
y
struct file
f pos: 2f count: 2
f pos: 1f count: 1
testfile
0
1
2
3
...
userspace int kernel object driver-specific
File descriptor:Userspace reference to
kernel object
What can you do with afile descriptor?
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I . . .
I ioctl
Use file descriptors!
What interesting filedescriptors exist?
eventfd
I 64-bit counter used as an event queue
I write: Add value to counterI read: Block until non-zero; read value and reset to 0
I “Semaphore mode”: Read 1 and decrement by 1
I poll: Ready for reading if non-zero
I Several drivers use eventfd to signal events between kerneland userspace
eventfd
I 64-bit counter used as an event queue
I write: Add value to counter
I read: Block until non-zero; read value and reset to 0I “Semaphore mode”: Read 1 and decrement by 1
I poll: Ready for reading if non-zero
I Several drivers use eventfd to signal events between kerneland userspace
eventfd
I 64-bit counter used as an event queue
I write: Add value to counterI read: Block until non-zero; read value and reset to 0
I “Semaphore mode”: Read 1 and decrement by 1
I poll: Ready for reading if non-zero
I Several drivers use eventfd to signal events between kerneland userspace
eventfd
I 64-bit counter used as an event queue
I write: Add value to counterI read: Block until non-zero; read value and reset to 0
I “Semaphore mode”: Read 1 and decrement by 1
I poll: Ready for reading if non-zero
I Several drivers use eventfd to signal events between kerneland userspace
eventfd
I 64-bit counter used as an event queue
I write: Add value to counterI read: Block until non-zero; read value and reset to 0
I “Semaphore mode”: Read 1 and decrement by 1
I poll: Ready for reading if non-zero
I Several drivers use eventfd to signal events between kerneland userspace
timerfd
I Allows handling timers as file descriptors
I Throw them in the poll loop with everything else
I Create with specified timeout
I read: Block until timeout; return number of times expired
I poll: Reading for reading if timeout passed
Signals
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signed-off-by: <(;,;)@r’lyeh>
signalfd
I File descriptor to receive a given set of signals
I Block “normal” signal delivery; receive via signalfd instead
I read: Block until signal, return struct signalfd_siginfo
I poll: Readable when signal received
signalfd
I File descriptor to receive a given set of signals
I Block “normal” signal delivery; receive via signalfd instead
I read: Block until signal, return struct signalfd_siginfo
I poll: Readable when signal received
How do you build a new typeof file descriptor?
Semantics
I read and writeI NothingI Raw dataI Specific data structure
I poll/select/epollI Must match read/write blocking behavior if anyI Can have pollable fd even if read/write do nothing
I seek and file position
I mmap
I What happens with multiple processes, or dup?
I For everything else: ioctl
Semantics
I read and writeI NothingI Raw dataI Specific data structure
I poll/select/epollI Must match read/write blocking behavior if anyI Can have pollable fd even if read/write do nothing
I seek and file position
I mmap
I What happens with multiple processes, or dup?
I For everything else: ioctl
Semantics
I read and writeI NothingI Raw dataI Specific data structure
I poll/select/epollI Must match read/write blocking behavior if anyI Can have pollable fd even if read/write do nothing
I seek and file position
I mmap
I What happens with multiple processes, or dup?
I For everything else: ioctl
Semantics
I read and writeI NothingI Raw dataI Specific data structure
I poll/select/epollI Must match read/write blocking behavior if anyI Can have pollable fd even if read/write do nothing
I seek and file position
I mmap
I What happens with multiple processes, or dup?
I For everything else: ioctl
Semantics
I read and writeI NothingI Raw dataI Specific data structure
I poll/select/epollI Must match read/write blocking behavior if anyI Can have pollable fd even if read/write do nothing
I seek and file position
I mmap
I What happens with multiple processes, or dup?
I For everything else: ioctl
Semantics
I read and writeI NothingI Raw dataI Specific data structure
I poll/select/epollI Must match read/write blocking behavior if anyI Can have pollable fd even if read/write do nothing
I seek and file position
I mmap
I What happens with multiple processes, or dup?
I For everything else: ioctl
Implementation
I anon_inode_getfdI Doesn’t need a backing inode or filesystemI Provide an ops structure and private data pointerI Private data points to your kernel object
I simple_read_from_buffer, simple_write_to_buffer
I no_llseek, fixed_size_llseekI Check file->f_flags & O_NONBLOCK
I Blocking: wait_queue_headI Non-blocking: return -EAGAIN
Implementation
I anon_inode_getfdI Doesn’t need a backing inode or filesystemI Provide an ops structure and private data pointerI Private data points to your kernel object
I simple_read_from_buffer, simple_write_to_buffer
I no_llseek, fixed_size_llseekI Check file->f_flags & O_NONBLOCK
I Blocking: wait_queue_headI Non-blocking: return -EAGAIN
Implementation
I anon_inode_getfdI Doesn’t need a backing inode or filesystemI Provide an ops structure and private data pointerI Private data points to your kernel object
I simple_read_from_buffer, simple_write_to_buffer
I no_llseek, fixed_size_llseek
I Check file->f_flags & O_NONBLOCKI Blocking: wait_queue_headI Non-blocking: return -EAGAIN
Implementation
I anon_inode_getfdI Doesn’t need a backing inode or filesystemI Provide an ops structure and private data pointerI Private data points to your kernel object
I simple_read_from_buffer, simple_write_to_buffer
I no_llseek, fixed_size_llseekI Check file->f_flags & O_NONBLOCK
I Blocking: wait_queue_headI Non-blocking: return -EAGAIN
What interesting filedescriptors don’t exist yet?
Child processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops
Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops
Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops
Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops
Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops
Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops
Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops
Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops
Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops
Signals
I Process-global; libraries can’t manage only their own processes
Alternatives
I Set SIGCHLD handler, write to pipe or eventfdI Still process-global; gets all child exit notificationsI Requires coordinating global signal handling between libraries
Signals
I signalfd for SIGCHLDI Still process-global; gets all child exit notificationsI Requires coordinating global signal handling between librariesI Must block SIGCHLD; breaks code expecting SIGCHLD
Alternatives
I Set SIGCHLD handler, write to pipe or eventfdI Still process-global; gets all child exit notificationsI Requires coordinating global signal handling between libraries
Signals
I signalfd for SIGCHLDI Still process-global; gets all child exit notificationsI Requires coordinating global signal handling between librariesI Must block SIGCHLD; breaks code expecting SIGCHLD
clonefd
clonefd
I New flag for clone
I Return a file descriptor for the child process
I read: block until child exits, return exit information
I poll: becomes readable when child exits
I Maintains a reference to the child’s task_struct
I Relatively simple, except. . .
clonefd
I New flag for clone
I Return a file descriptor for the child process
I read: block until child exits, return exit information
I poll: becomes readable when child exits
I Maintains a reference to the child’s task_struct
I Relatively simple, except. . .
clonefd
I New flag for clone
I Return a file descriptor for the child process
I read: block until child exits, return exit information
I poll: becomes readable when child exits
I Maintains a reference to the child’s task_struct
I Relatively simple, except. . .
clonefd
I New flag for clone
I Return a file descriptor for the child process
I read: block until child exits, return exit information
I poll: becomes readable when child exits
I Maintains a reference to the child’s task_struct
I Relatively simple, except. . .
clonefd
I New flag for clone
I Return a file descriptor for the child process
I read: block until child exits, return exit information
I poll: becomes readable when child exits
I Maintains a reference to the child’s task_struct
I Relatively simple, except. . .
Complications
Need a new clone system call for the fd out parameter
clone syscall parameters vary by architectureI Avoided in the new syscall
clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size
Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)
ptrace and reparentingI Work in progress
Complications
Need a new clone system call for the fd out parameter
clone syscall parameters vary by architecture
I Avoided in the new syscall
clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size
Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)
ptrace and reparentingI Work in progress
Complications
Need a new clone system call for the fd out parameter
clone syscall parameters vary by architectureI Avoided in the new syscall
clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size
Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)
ptrace and reparentingI Work in progress
Complications
Need a new clone system call for the fd out parameter
clone syscall parameters vary by architectureI Avoided in the new syscall
clone is out of parameters (6) on some architectures
I Pass parameters via a struct and size
Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)
ptrace and reparentingI Work in progress
Complications
Need a new clone system call for the fd out parameter
clone syscall parameters vary by architectureI Avoided in the new syscall
clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size
Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)
ptrace and reparentingI Work in progress
Complications
Need a new clone system call for the fd out parameter
clone syscall parameters vary by architectureI Avoided in the new syscall
clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size
Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)
ptrace and reparentingI Work in progress
Complications
Need a new clone system call for the fd out parameter
clone syscall parameters vary by architectureI Avoided in the new syscall
clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size
Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)
ptrace and reparentingI Work in progress
Complications
Need a new clone system call for the fd out parameter
clone syscall parameters vary by architectureI Avoided in the new syscall
clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size
Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)
ptrace and reparenting
I Work in progress
Complications
Need a new clone system call for the fd out parameter
clone syscall parameters vary by architectureI Avoided in the new syscall
clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size
Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)
ptrace and reparentingI Work in progress
History and status
I Thiago Macieira originally proposed forkfd to simplify Qt
I Josh and Thiago started on clonefd earlier this year
I Some infrastructure merged into 4.2
I Syscall aimed for future kernel after resolving ptrace issues
File descriptor:Userspace reference to
kernel object
What else can we do with areference to task_struct?
Process IDs
Process IDs
I Small integers used to reference processes
I Used pervasively in process syscalls
I Enumerated as directories in /proc
I Unique within root container
I Container PID namespaces map a subset of these
I PIDs do not hold a reference; can be reused
I Race condition if used from non-parent process
Process IDs
I Small integers used to reference processes
I Used pervasively in process syscalls
I Enumerated as directories in /proc
I Unique within root container
I Container PID namespaces map a subset of these
I PIDs do not hold a reference; can be reused
I Race condition if used from non-parent process
Process IDs
I Small integers used to reference processes
I Used pervasively in process syscalls
I Enumerated as directories in /proc
I Unique within root container
I Container PID namespaces map a subset of these
I PIDs do not hold a reference; can be reused
I Race condition if used from non-parent process
clonefd as process identifier
I Unique across the entire system
I Holds a reference to the process
I Race-free
I Can pass via exec, UNIX sockets
I Allows non-parent processes to obtain exit information
clonefd as process identifier
I Unique across the entire system
I Holds a reference to the process
I Race-free
I Can pass via exec, UNIX sockets
I Allows non-parent processes to obtain exit information
clonefd as process identifier
I Unique across the entire system
I Holds a reference to the process
I Race-free
I Can pass via exec, UNIX sockets
I Allows non-parent processes to obtain exit information
clonefd as process identifier
I Unique across the entire system
I Holds a reference to the process
I Race-free
I Can pass via exec, UNIX sockets
I Allows non-parent processes to obtain exit information
clonefd as process identifier
I Unique across the entire system
I Holds a reference to the process
I Race-free
I Can pass via exec, UNIX sockets
I Allows non-parent processes to obtain exit information
Next steps
I Merge clonefd
I For each PID syscall, add an fd variant
I Add ioctls to obtain process information
I Add process enumeration (next, child, root)
Other future file descriptors
Warning: wild speculation andconjecture ahead
Other future file descriptors
Warning: wild speculation andconjecture ahead
User and group IDs
I Suppose users and groups were unique kernel objects?
I Unique across container user namespaces
I “Get unused user/group”
I Set up arbitrary mappings when mounting a filesystem
I Allow a process to hold multiple credentials (like setgroups)
User and group IDs
I Suppose users and groups were unique kernel objects?
I Unique across container user namespaces
I “Get unused user/group”
I Set up arbitrary mappings when mounting a filesystem
I Allow a process to hold multiple credentials (like setgroups)
User and group IDs
I Suppose users and groups were unique kernel objects?
I Unique across container user namespaces
I “Get unused user/group”
I Set up arbitrary mappings when mounting a filesystem
I Allow a process to hold multiple credentials (like setgroups)
User and group IDs
I Suppose users and groups were unique kernel objects?
I Unique across container user namespaces
I “Get unused user/group”
I Set up arbitrary mappings when mounting a filesystem
I Allow a process to hold multiple credentials (like setgroups)
User and group IDs
I Suppose users and groups were unique kernel objects?
I Unique across container user namespaces
I “Get unused user/group”
I Set up arbitrary mappings when mounting a filesystem
I Allow a process to hold multiple credentials (like setgroups)
Filesystem mounts
I Suppose mount returned a directory file descriptor
I openat relative to the filesystem
I Separate call to bind into the filesystem namespace
I Bind existing dirfd for bind mounts
Filesystem mounts
I Suppose mount returned a directory file descriptor
I openat relative to the filesystem
I Separate call to bind into the filesystem namespace
I Bind existing dirfd for bind mounts
Filesystem mounts
I Suppose mount returned a directory file descriptor
I openat relative to the filesystem
I Separate call to bind into the filesystem namespace
I Bind existing dirfd for bind mounts
Filesystem mounts
I Suppose mount returned a directory file descriptor
I openat relative to the filesystem
I Separate call to bind into the filesystem namespace
I Bind existing dirfd for bind mounts
Summary
I File descriptor: Userspace reference to kernel object
I Reference-counted, race-free, unambiguous ID
I Well-defined semantics
I Extensive operations
I poll and blocking
I Use file descriptors in new APIs
I Don’t invent new identifier namespaces
Summary
I File descriptor: Userspace reference to kernel object
I Reference-counted, race-free, unambiguous ID
I Well-defined semantics
I Extensive operations
I poll and blocking
I Use file descriptors in new APIs
I Don’t invent new identifier namespaces
Summary
I File descriptor: Userspace reference to kernel object
I Reference-counted, race-free, unambiguous ID
I Well-defined semantics
I Extensive operations
I poll and blocking
I Use file descriptors in new APIs
I Don’t invent new identifier namespaces
Summary
I File descriptor: Userspace reference to kernel object
I Reference-counted, race-free, unambiguous ID
I Well-defined semantics
I Extensive operations
I poll and blocking
I Use file descriptors in new APIs
I Don’t invent new identifier namespaces
Summary
I File descriptor: Userspace reference to kernel object
I Reference-counted, race-free, unambiguous ID
I Well-defined semantics
I Extensive operations
I poll and blocking
I Use file descriptors in new APIs
I Don’t invent new identifier namespaces