+ All Categories
Home > Documents > An Introduction to the Linux Kernel Block I/O Stack ...

An Introduction to the Linux Kernel Block I/O Stack ...

Date post: 01-Mar-2022
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
42
An Introduction to the Linux Kernel Block I/O Stack Based on Linux 5.11 Benjamin Block ‹[email protected]March 14th, 2021 IBM Deutschland Research & Development GmbH
Transcript
Page 1: An Introduction to the Linux Kernel Block I/O Stack ...

An Introduction to the Linux Kernel Block I/O StackBased on Linux 5.11

Benjamin Block ‹[email protected]›March 14th, 2021

IBM Deutschland Research & Development GmbH

Page 2: An Introduction to the Linux Kernel Block I/O Stack ...

Trademark Attribution Statement

The following are trademarks of the International Business Machines Corporation in the United States, other countries, or both.

Not all common lawmarks used by IBM are listed on this page. Because of the large number of products marketed by IBM, IBM’s practice is to list only the most important of its common lawmarks.Failure of a mark to appear on this page does not mean that IBM does not use the mark nor does it mean that the product is not actively marketed or is not significant within its relevant market.

A current list of IBM trademarks is available on the Web at “Copyright and trademark information”: https://www.ibm.com/legal/copytrade.

IBM®, the IBM® logo, ibm.com®, AIX®, CICS®, Db2®, DB2®, developerWorks®, DS8000®, eServer™, Fiberlink®, FICON®, FlashCopy®, GDPS®, HyperSwap®, IBM Elastic Storage®, IBM FlashCore®, IBMFlashSystem®, IBM Plex®, IBM Spectrum®, IBM Z®, IBM z Systems®, IBM z13®, IBM z13s®, IBM z14®, OS/390®, Parallel Sysplex®, Power®, POWER®, POWER8®, POWER9™, Power Architecture®,PowerVM®, RACF®, RED BOOK®, Redbooks®, S390-Tools®, S/390®, Storwize®, System z®, System z9®, System z10®, System/390®, WebSphere®, XIV®, z Systems®, z9®, z13®, z13s®, z15™,z/Architecture®, z/OS®, z/VM®, z/VSE®, and zPDT® are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product andservice names might be trademarks of IBM or other companies.

The following are trademarks or registered trademarks of other companies.

UNIX is a registered trademark of The Open Group in the United States and other countries.The registered trademark Linux® is used pursuant to a sublicense from the Linux Foundation, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis.Red Hat®, JBoss®, OpenShift®, Fedora®, Hibernate®, Ansible®, CloudForms®, RHCA®, RHCE®, RHCSA®, Ceph®, and Gluster® are trademarks or registered trademarks of Red Hat, Inc. or itssubsidiaries in the United States and other countries.

All other products may be trademarks or registered trademarks of their respective companies.

Note:

Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any userwill experience will vary depending upon considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workloadprocessed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here.IBM hardware products are manufactured Sync new parts, or new and serviceable used parts. Regardless, our warranty terms apply.All customer examples cited or described in this presentation are presented as illustrations of the manner in which some customers have used IBM products and the results they may haveachieved. Actual environmental costs and performance characteristics will vary depending on individual customer configurations and conditions.All statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.Information about non-IBM products is obtained Sync the manufacturers of those products or their published announcements. IBM has not tested those products and cannot confirm theperformance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

Page 3: An Introduction to the Linux Kernel Block I/O Stack ...

Outline

What is a Block Device?

Anatomy of a Block Device

I/O Flow in the Block Layer

Page 4: An Introduction to the Linux Kernel Block I/O Stack ...

What is a Block Device?

Page 5: An Introduction to the Linux Kernel Block I/O Stack ...

A Try at a Definition

In Linux, a Block Device is a hardware abstraction. It represents hardware whose data isstored and accessed in fixed size blocks of n bytes (e.g. 512, 2048, or 4096 bytes) [18].

In contrast to Character Devices, blocks on block devices can be accessed inrandom-access pattern, wherein the former only allows sequential access pattern [19].

Typically, and for this talk, block devices represent persistent mass storage hardware.

But not all block devices in Linux are backed by persistent storage (e.g. RAM Disks whosedata is stored in memory), nor must all of them organize their data in fixed blocks (e.g.ECKD formatted DASDs whose data is stored in variable length records). Even so, theycan be represented as such in Linux, because of the abstraction provided by the Kernel.

© Copyright IBM Corp. 2021 2

Page 6: An Introduction to the Linux Kernel Block I/O Stack ...

What is a ‘Block’ Anyway?

A Block is a fixed amount of bytes that is used in the communication with a block deviceand the associated hardware. But different layers in the software stack differ in theexact meaning and size:

• Userspace Software: application specific meaning; usually how much data is readfrom/written to files via a single system-call.

• VFS: unit of bytes in which I/O is done by file systems in Linux. Between 512, andPAGE_SIZE bytes (e.g. 4KiB for x86 and s390x, may be as big as 1MiB).

• Hardware: also referred to as Sector.• Logical: smallest unit in bytes that is addressable on the device.• Physical: smallest unit in bytes that the device can operate on without resorting toread-modify-write.

• Physical may be bigger than Logical block size

© Copyright IBM Corp. 2021 3

Page 7: An Introduction to the Linux Kernel Block I/O Stack ...

Using Block Devices in Linux (Examples)Listing available block devices:# ls / sys / c lass / blockdasda dasda1 dasda2 dm−0 dm−1 dm−2 dm−3 scma sda sdb

Reading from a block device:# dd i f =/dev / sdf of =/dev / nu l l bs=2MiB10737418240 bytes (11 GB, 10 GiB ) copied

Listing the topology of a stacked block devices:# lsb lk −s /dev /mapper/ rhel_t3545003−rootNAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTrhel_t3545003−root 253:11 0 9G 0 lvm /+−mpathf2 253:8 0 9G 0 part+−mpathf 253:5 0 10G 0 mpath+−sdf 8:80 0 10G 0 disk+−sdam 66:96 0 10G 0 disk

© Copyright IBM Corp. 2021 4

Page 8: An Introduction to the Linux Kernel Block I/O Stack ...

Some More Examples for Block Devices

Kernel

/dev/nvme0n1

Local Disk

nvme

NVMe

Kernel

/dev/sda

Remote Disk

sdiscsi_tcp

iSCSI

Guest Kernel

/dev/vda

Virtualized Disk

virtio_blk

/dev/ram0

Host Kernelbrd

RAM RAMRAID

Figure 1: Examples for block device setups with different hardware backends [14].

© Copyright IBM Corp. 2021 5

Page 9: An Introduction to the Linux Kernel Block I/O Stack ...

Anatomy of a Block Device

Page 10: An Introduction to the Linux Kernel Block I/O Stack ...

Structure of a Block Device: User Facing

block_device Userspace interface;represents special file in /dev, andlinks to other kernel objects for theblock device [4, 16]; partitions point tosame disk and queue as whole device.

inode Each block device gets a virtualinode assigned, so it can be used in theVFS.

file and address_space Userspaceprocesses open special file in /dev; inkernel represented as file withassigned address_space that pointto block device inode.

request_queue

i_rdev = devti_mode = S_IFBLK

bd_queuebd_disk

i_bdev

bd_partno = 1bd_dev = devtbd_start_sect = 32

i_size

block_device

inodeinode

i_rdev = devti_mode = S_IFBLK

block_device

bd_queuebd_disk

i_bdev

bd_partno = 0bd_dev = devtbd_start_sect = 0

i_size

KernelUser

/dev/sda /dev/sda1

gendisk

file

address_space

© Copyright IBM Corp. 2021 6

Page 11: An Introduction to the Linux Kernel Block I/O Stack ...

Structure of a Block Device: Hardware Facing

scsi_devicerequest_queue

request_queue

tag_set

limits

1

scsi_diskdisk queue_limitsgendisk

part_tblpart0queue

1

disk_part_tbllen = Npart[len]

block_deviceN

1

parent

queuedata1

1

fops

block_device_operations

*submit_bio(*bio)*open(*bdev, mode)

1

queue_ctx[C]queue_hw_ctx[N]

gendisk and request_queue Centralpart of any block device; abstracthardware details for higher layers;gendisk represents the wholeaddressable space; request_queuehow requests can be served

disk_part_tbl Points to partitions —represented as block_devices —backed by gendisk

scsi_device and scsi_disk Devicedrivers; provides common/mid layerfor all SCSI-like hardware (incl. SerialATA, SAS, iSCSI, FCP, …).

© Copyright IBM Corp. 2021 7

Page 12: An Introduction to the Linux Kernel Block I/O Stack ...

Queue Limits

• Attached to the Request Queue structure of a Block device• Abstract hardware, firmware and device driver properties that influence howrequests must be laid out

• Very important for stacked block devices• For example:

logical_block_size Smallest possible unit in bytes that is addressable in a request.physical_block_size Smallest unit in bytes handled without read-modify-write.max_hw_sectors Amount of sectors (512 bytes) that a device can handle per request.

io_opt Preferred size in bytes for requests to the device.max_sectors Softlimit used by VFS for buffered I/O (can be changed).max_segment_size Maximum size a segment in a request’s scatter/gather list can

have.max_segments Maximum amount of scatter/gather elements in a request.

© Copyright IBM Corp. 2021 8

Page 13: An Introduction to the Linux Kernel Block I/O Stack ...

Multi-Queue Request Queues

• In the past, request queues in Linux worked single threaded and withoutassociating requests with particular processors

• Couldn’t properly exploit many-core systems and new storage hardware with morethan one command queue (e.g.: NVMe); lots of cache thrashing

• Explicit Multi-Queue (MQ) support [3] was added with Linux 3.13: blk-mq• I/O requests are scheduled on a hardware queue assigned to the I/O generatingprocessor; responses are meant to be received on the same processor

• Structures necessary for I/O submission and response-handling are kept perprocessor; no shared state as much as possible

• With Linux 5.0 old single threaded queue implementation was removed

© Copyright IBM Corp. 2021 9

Page 14: An Introduction to the Linux Kernel Block I/O Stack ...

Block MQ Tag Set: Hardware Resource Allocation

• Per hardware queue resourceallocation and management

• Requests (Reqs) are pre-allocated perHW queue (blk_mq_tags)

• Tags are index into the request arrayper queue, or per tag set (new in 5.10[17])

• Allocation of tags handled via specialdata-structure: sbitmap [20]

• Tag set also provides mapping betweenCPU and hardware queue; objective iseither 1 : 1mapping, or cache proximity

Tags[M]Reqs[M]

N = 4

M = 8

HW Queues0 1 2 3

blk_mq_tags

blk_mq_tags

blk_mq_tags

blk_mq_tags

Tags[M]Reqs[M]

blk_mq_tag_set

tags[N]map

Ctrl

0 0 1 1 2 2 3 3

C = 81 2 3 4 5 6 70

CPU IDs

Tags[M]Reqs[M]

Tags[M]Reqs[M]

© Copyright IBM Corp. 2021 10

Page 15: An Introduction to the Linux Kernel Block I/O Stack ...

Block MQ Soft- and Hardware-Context

• For a request queue (see pointers onpage 7)

• Hardware context (blk_mq_hw_ctx =hctx) exists per hardware queue; hostswork item (kblockd work queue)scheduled on matching CPU; pullsrequests out of associated ctx andsubmits them to hardware

• Software context (blk_mq_ctx = ctx)exists per CPU; queues requests insimple FIFO in absence of Elevator;associated with assigned HCTX as pertag set mapping

blk_mq_hw_ctx

blk_mq_hw_ctx

blk_mq_hw_ctx

blk_mq_hw_ctx

*work()tags

*work() *work() *work()tags tags tags

blk_mq_tags

blk_mq_tags

blk_mq_tags

blk_mq_tags

hctx

ctx ctx

hctx6

rqL rqL

7hctx

ctx ctx

hctx4

rqL rqL

5hctx

ctx ctx

hctx2

rqL rqL

3hctx

ctx ctx

hctx0

rqL rqL

1

list of queued reqeusts

workitem

Core CPU NUMA Domain

0 1 2 3 4 5 6 7

scheduledhere

© Copyright IBM Corp. 2021 11

Page 16: An Introduction to the Linux Kernel Block I/O Stack ...

Block MQ Elevator / Scheduler

• Elevator = I/O Scheduler• Can be set optionally per request queue(/sys/class/block/<name>/queue/scheduler)

mq-deadline Forward port of old deadline scheduler; doesn’t handle MQ contextaffinities; default for device with 1 hardware queue; limits wait-time forrequests to prevent starvation (500ms for reads, 5 s for writes) [10, 18]

kyber Only MQ native scheduler [10]; aims to meet certain latency targets (2msfor reads, 10ms for writes) by limiting the queue-depth dynamically

bfq Only non-trivial I/O scheduler [6, 8, 9, 11] (replaces old CFQ scheduler);doesn’t handle MQ context affinities; aims at providing fairness betweenI/O issuing processes

none Default for device withmore than 1 hardware queue; simply FIFO via MQsoftware context

© Copyright IBM Corp. 2021 12

Page 17: An Introduction to the Linux Kernel Block I/O Stack ...

What About Stacked Block Devices?

• Device-Mapper (dm) and Raid (md) use virtual/stacked block device on top ofexisting hardware-backed block devices ([15, 21])

• Examples: RAID, LVM2, Multipathing

• Same structure as shown on page 6 and 7, without hardware specific structures,stacked on other block_devices

• BIO based: doesn’t have an Elevator, an own tag-set, nor any soft-, orhardware-contexts; modify I/O (BIO) after submission and immediately pass it on

• Request based: have full set of infrastructure (only dm-multipath atm.); can queuerequests; bypass lower-level queueing

• queue_limits of lower-level devices are aggregated into the “greatest commondivisor”, so that requests can be scheduled on any of them

• holders/slaves directories in sysfs show relationship

© Copyright IBM Corp. 2021 13

Page 18: An Introduction to the Linux Kernel Block I/O Stack ...

I/O Flow in the Block Layer

Page 19: An Introduction to the Linux Kernel Block I/O Stack ...

Submission of I/O Requests from Userspace (Simplified)

• I/O submission mainly categorized in 2disciplines:

Buffered I/O Requests served via PageCache; Writes cached and eventually —usually asynchronously — written todisk viaWriteback; Reads serveddirectly if fresh, otherwise read fromdisk synchronously

Direct I/O Requests served directly bybacking disk; alignment and possiblysize requirements; DMA directlyinto/from User memory possible

• For syncronous I/O system calls, taskswait in state TASK_UNINTERRUPTIBLE

KernelUser

Filesystems

block_deviceblock_device

gendiskfops

submit_bio(bio)

request_queue

queue

PageCache

Buffered I/OWrite

Direct I/ORead ReadWrite

VFS

Writeback

Read(-ahead)

R W R W

BlockLayer

© Copyright IBM Corp. 2021 14

Page 20: An Introduction to the Linux Kernel Block I/O Stack ...

A New Asynchronous I/O Interface: io_uring

• With Linux 5.1 a new I/O submission APIhas been added: io_uring [12, 2]

• New set of System Calls to create set of ringstructures: Submission Queue (SQ),Completion Queue (CQ), and SubmissionQueue Entries (SQE) array

• Structures shared between Kernel and Uservia mmap(2)

• Submission and Completion workasynchronously

• Utilizes standard syscall backends for callslike readv(2), writev(2), or fsync(2);with same categories as on Page 14

SQCQ

UserKernel

0

1

2

3

4

5

6

7

SQECQE

HeadSQESQE 1

2

Tail

Tail Head

per id in CQE

Feed into existingSyscall Paths

mmap structures into userspace1. Fill

2. Advance 3. Advance

© Copyright IBM Corp. 2021 15

Page 21: An Introduction to the Linux Kernel Block I/O Stack ...

The I/O Unit of the Block Layer: BIO

bio_vecbv_page

bv_lenbv_offset

bio_vecbv_page

bv_lenbv_offset

bio

bi_io_vec[]

bi_next

bi_disk

*bi_end_io()bi_iter bio_vec

bv_pagebv_lenbv_offset

bi_opf : req_opf

page

1 ... 256

bvec_iterbi_sectorbi_sizebi_idxbi_bvec_done

0 ... N

pagepage

PagesPhysically Contiguous

PinnedData Pages

bi_partno = N

1

Data/LBAContiguous

48656C6C6F2C

20576F726C64

210A0048656C

6C6F2C20576F

726C64210A00

000000000000

• BIOs represent in-flight I/O• Application data kept separate; array ofbio_vecs holds pointers to pages withapplication data (scatter/gather list)

• Position and progress managed inbvec_iter:bi_sector start sector

bi_size size in bytesbi_idx current bvec index

bi_bvec_done finished work in bytes• BIOs might be split when queue limits(see page 8) exceeded; or cloned whensame data goes to different places

• Data of single BIO limited to 4GiB© Copyright IBM Corp. 2021 16

Page 22: An Introduction to the Linux Kernel Block I/O Stack ...

Plugging and Merging

Plugging:

• When the VFS layer generates I/O requests and submits them for processing, itplugs the request queue of the target block device ([1])

• Requests generated while plug is active are not immediately submitted, but saveduntil unplugging

• Unplugging happens either explicitly, or during scheduled context switches

Merging:

• BIOs and requests are tried to be merged with already queued or plugged requestsBack-Merging: The new data fits to the end of an existing requestFront-Merging: The new data fits to the beginning of an existing request

• Merging is done by concatenating BIOs via bi_next• Merges must not exceed queue limits

© Copyright IBM Corp. 2021 17

Page 23: An Introduction to the Linux Kernel Block I/O Stack ...

Entry Function into the Block Layer

submit_bio(bio) {

On Stack: bios, old_bios

if (exists(current.bios)) {current.bios += bioreturn

}

do {

On Stack: same, lower

old_bios ← current.bioscurrent.bios = List()disk(bio)->submit_bio(bio)

if (q == queue(bio))same += bio

elselower += bio

current.bios += lowercurrent.bios += samecurrent.bios += old_bios

}

} while (bio = pop(current.bios))

return

lower = List()same = List()

}

current.bios ← bios

current.bios ← Null

while (bio ← pop(current.bios)) {On Stack: q = queue(bio)

thread local

• When I/O is necessary, BIOs are generated andsubmitted into the block layer via submit_bio()([4]).

• Doesn’t guarantee synchronous processing;callback via bio->bi_end_io(bio).

• One submitted BIO can turn into several more(block queue limits, stacked device, …); each isalso submitted via submit_bio().

• Especially for stacked devices this could exhaustkernel stack space → turn recursion into iteration(approx.: depth-first search with stack)

← Pseudo-code representation of functionality

© Copyright IBM Corp. 2021 18

Page 24: An Introduction to the Linux Kernel Block I/O Stack ...

Request Submission and Dispatch

• Once a BIO reaches a request queue (see Page 12and 11) the submitting task tries to get Tag fromthe associated HCTX

• Creates back pressure if not possible

• The BIO is added to the associated Request; vialinking, this could be multiple BIOs per request

• The Request is inserted into software contextFIFO queue, or Elevator (if enabled); the HCTXwork-item is queued into kblockd

• Associated CPU executes HCTX work-item• Work-item pulls queued requests out of associatedsoftware contexts or Elevators and hands them toHW device driver

CPU 0kblockd

Push usingHW Device Driver

HW Queue

HCT

X

HCT

X

HCT

X

CTX

0

0 1

PullPreemptable

Kernel T

hreads

Different

Request Queues

Ctrl

© Copyright IBM Corp. 2021 19

Page 25: An Introduction to the Linux Kernel Block I/O Stack ...

Request Completion

CPU 0

Ctrl

IRQDevice Driver

IRQ Handler

Find Request

blk_mq_complete_req(rq)

q→mq_ops→complete(rq)Queue Request

CompletionRedirect withIPI/SoftIRQ

For each associated BIO

bio→bio_end_io(bio)

NotifyWaiter • Once a request is processed, device drivers usually get

notified via an interrupt• To keep data CPU local, interrupts should be bound toCPU of associated MQ software context(platform-/controller-/driver-dependant)

• blk-mq resorts to IPI or SoftIRQ otherwise

• The device driver is responsible for determining thecorresponding block layer request for the signaledcompletion

• Progress is measured by how many bytes weresuccessfully completed; might cause re-schedule

• Process of completing a request is a bunch of callbacksto notify the waiting user-/kernel-thread

© Copyright IBM Corp. 2021 20

Page 26: An Introduction to the Linux Kernel Block I/O Stack ...

Block Layer Polling

• Similar to high speed networking, with high speed storage targets it can bebeneficial to use Polling instead of Waiting for Interrupts to handle requestcompletion

• Decreases response times and reduces overhead produced by interrupts on fast devices• Available in Linux since 4.10 [7]; only supported by NVMe at this point (support forSCSI merged for 5.13 [13], support for dm in work [26])

• Enable per request queue: echo 1 > /sys/class/block/<name>/queue/io_poll• Enable for NVMe with module parameter: nvme.poll_queues=N

• Device driver creates separate HW queues that have interrupts disabled• Whether polling is used is controlled by applications (only with Direct I/O currently):

• Pass RWF_HIPRI to readv(2)/writev(2)• Pass IORING_SETUP_IOPOLL to io_uring_setup(2) for io_uring

• When used, application threads that issued I/O, or io_uring worker threads,actively poll in HW queues whether the issued request has been completed

© Copyright IBM Corp. 2021 21

Page 27: An Introduction to the Linux Kernel Block I/O Stack ...

Closing

Page 28: An Introduction to the Linux Kernel Block I/O Stack ...

Summary

• Block devices are a hardware abstraction to allow uniform access to a range ofdiverse hardware

• The entry into the block layer is provided by block_device device-nodes, whichare backed by a gendisk — representing the block-addressable storage space — ,and a request_queue — providing a generic way to queue requests against thehardware

• Special care is taken to allow processor local processing of I/O requests andresponses

• Userspace requests mainly categorized in Buffered and Direct I/O• Central structure for transporting information about in-flight I/O is the BIO; it allowsfor cloning, splitting and merging without copying payload

• Processing of I/O is fundamentally asynchronous in the kernel, requests happen ina different context than responses, and are only synchronized via wait/notifymechanisms

© Copyright IBM Corp. 2021 22

Page 29: An Introduction to the Linux Kernel Block I/O Stack ...

Made with LATEX and Inkscaper

Questions?

Page 30: An Introduction to the Linux Kernel Block I/O Stack ...

IBM Deutschland Research & Development

Headquarters Böblingen • Big parts of the support for Linux on IBM Z — Kerneland Userland — are done at the IBM Laboratory inBöblingen

• We follow a strict upstream policy and do not — withscarce exceptions — ship code that is not acceptedin the respective upstream project

• Parts of the hard- and firmware for the IBMMainframes are also done in Böblingen

• https://www.ibm.com/de-de/marketing/entwicklung/

about.html

© Copyright IBM Corp. 2021 23

Page 31: An Introduction to the Linux Kernel Block I/O Stack ...

Backup Slides

Page 32: An Introduction to the Linux Kernel Block I/O Stack ...

But What About Zoned Storage?

• Introduced with the advent of SMR Disks, but now also in NVMe specification• Tracks on the disk overlap; overwriting a single track means overwriting a bunch ofother tracks as well [24]

→ Data is organized in “bands” of tracks: zones. Each zone onlywritten sequentially;out-of-order writes only after reset of whole zone. Random read access remainspossible.

• Supported by the Linux block layer, but breaks with previous definition• Direct use only via aid by special ioctls [23], via special device-mapper target [22],or via special file system [25]

• Gonna ignore this for the rest of the talk

© Copyright IBM Corp. 2021

Page 33: An Introduction to the Linux Kernel Block I/O Stack ...

Structure of a Block Device: Whole Picture

scsi_devicerequest_queue

request_queueelevatorbacking_dev_infolimits

1

scsi_diskdisk

elevator_queue0 ... 1

queue_limits1

gendisk

part_tblpart0

queue

1

1

bdev_inode

vfs_inodebdev

inode

i_rdev = devti_mode = S_IFBLK

block_device

bd_queuebd_disk

1

1

i_bdev

1

bd_partno = N

scsi layerblock layer(V)FS layer

bd_dev = devt

disk_part_tbllen = Opart[len]

1

1

bd_start_sect = Mblock_device_operations*submit_bio(*bio)*open(*bdev, mode)

1

fops

O

© Copyright IBM Corp. 2021

Page 34: An Introduction to the Linux Kernel Block I/O Stack ...

Structure of a MQ Request Queue: Whole Picture

request_queue

elevatorbacking_dev_infolimits

backing_dev_info1

queue_limits 1

Queue ContextRequest QueueTag Set

mq_ops

queue_ctx[]queue_hw_ctx[]

max_hw_sectorsmax_sectorsmax_segment_sizephysical_block_sizelogical_block_sizeio_optmax_segments

ra_pagesmin_ratiomax_ratiowb : bdi_writeback

blk_mq_ops*queue_rq()*complete()*timeout()

elevator_queuetypeelevator_data

0 .. 1

tag_set

mq_deadlineops

e.g.:1

nr_hw_queues

blk_mq_tag_setmap[MAX_TYPES]queue_depthtags[nr_hw_queues]

blk_mq_queue_mapmq_map[nr_cpu_ids]

1 ... 3

blk_mq_tagsnr_tags = queue_depthbitmap_tags : sbitmaprqs[nr_tags]static_rqs[nr_tags]

1 ... *

mq_map[x] = nr-hw-queue

pages : listblk_mq_ctxrq_lists[MAX_TYPES] : listcpuhctx[MAX_TYPES]queue

blk_mq_hw_ctxdispatch : listcpumasknr_ctxsctx[nr_ctx]ctx_map : sbitmaptagsqueue

1 ... *

1 1

per_cpu_ptr

1

1 ... 3

1 ... nr_hw_queues

pagerequestdata[]

1 ... nr_tags

© Copyright IBM Corp. 2021

Page 35: An Introduction to the Linux Kernel Block I/O Stack ...

Glossary i

BIO BIO: represents metadata and data for I/O in the Linux block layer; no hardware specific information.

CFQ Completely Fair Queuing: deprecated I/O scheduler for single queue block layer.

DASD Direct-Access Storage Device: disk storage type used by IBM Z via FICON.

dm Device-Mapper: low level volume manager; allows to specify mappings for ranges of logical sectors; higherlevel volume managers such as LVM2 use this driver.

DMA Direct Memory Access: hardware components can access main memory without CPU involvement.

ECKD Extended Count Key Data: a recording format of data stored on DASDs.

Elevator Synonym for “I/O Scheduler” in the Linux Kernel.

FCP Fibre Channel Protocol: transport for the SCSI command set over Fibre Channel networks.

FIFO First in, First out.

HCTX Hardware Context of a request queue.

ioctl input/output control: system call that allows to query device-specific information, or executedevice-specific operations.

© Copyright IBM Corp. 2021

Page 36: An Introduction to the Linux Kernel Block I/O Stack ...

Glossary ii

IPI Inter-Processor Interrupt: interrupt an other processor to communicate some required action.

iSCSI Internet SCSI: transport for the SCSI command set over TCP/IP.

LVM2 Logical Volume Manager: flexible methodes of allocating (non-linear) space on mass-storage devices.

md Multiple devices support: support multiple physical block devices through a single logical device; requiredfor RAID and logical volume management.

MQ Short for: Multi-Queue.

Multipathing Accessing one storage target via multiple independent paths with the purposes of redundancy andload-balancing.

NVMe Non-Volatile Memory Express: interface for accessing persistent storage device over PCI Express.

RAID Redundant Array of Inexpensive Disks: combines multiple physical disks into a logical one with thepurposes of data redundancy and load-balancing.

RAM Random-Access Memory: a form of information storage, random-accessible, and normally volatile.

SAS Serial Attached SCSI: transport for the SCSI command set over a serial point-to-point bus.

© Copyright IBM Corp. 2021

Page 37: An Introduction to the Linux Kernel Block I/O Stack ...

Glossary iii

SCSI Small Computer System Interface: set of standards for commands, protocols, and physical interfaces toconnect computers with peripheral devices.

Serial ATA Serial AT Attachment: serial bus that connects host bus adapters with mass storage devices.

SMR Shingled Magnetic Recording: a magnetic storage data recording technology used to provide increasedareal density.

VFS Virtual File System: abstraction layer in Linux that provides a common interface to file systems and devicesfor software.

© Copyright IBM Corp. 2021

Page 38: An Introduction to the Linux Kernel Block I/O Stack ...

References i

[1] J. Axboe.Explicit block device plugging, Apr. 2011.https://lwn.net/Articles/438256/.

[2] J. Axboe.Efficient io with io_uring.https://kernel.dk/io_uring.pdf, Oct. 2019.

[3] M. Bjørling, J. Axboe, D. Nellans, and P. Bonnet.Linux block io: Introducing multi-queue ssd access on multi-core systems.In Proceedings of the 6th International Systems and Storage Conference, SYSTOR ’13, pages 22:1–22:10, New York, NY, USA, 2013. ACM.

[4] N. Brown.A block layer introduction part 1: the bio layer, Oct. 2017.https://lwn.net/Articles/736534/.

[5] N. Brown.Block layer introduction part 2: the request layer, Nov. 2017.https://lwn.net/Articles/738449/.

[6] J. Corbet.The bfq i/o scheduler, June 2014.https://lwn.net/Articles/601799/.

© Copyright IBM Corp. 2021

Page 39: An Introduction to the Linux Kernel Block I/O Stack ...

References ii

[7] J. Corbet.Block-layer i/o polling, Nov. 2015.https://lwn.net/Articles/663879/.

[8] J. Corbet.The return of the bfq i/o scheduler, Feb. 2016.https://lwn.net/Articles/674308/.

[9] J. Corbet.A way forward for bfq, Dec. 2016.https://lwn.net/Articles/709202/.

[10] J. Corbet.Two new block i/o schedulers for 4.12, Apr. 2017.https://lwn.net/Articles/720675/.

[11] J. Corbet.I/o scheduling for single-queue devices.Oct. 2018.https://lwn.net/Articles/767987/.

[12] J. Corbet.Ringing in a new asynchronous i/o api, Jan. 2019.https://lwn.net/Articles/776703/.

© Copyright IBM Corp. 2021

Page 40: An Introduction to the Linux Kernel Block I/O Stack ...

References iii

[13] K. Desai.io_uring iopoll in scsi layer, Feb. 2021.https://lore.kernel.org/linux-scsi/[email protected]/T/#.

[14] W. Fischer and G. Schönberger.Linux storage stack diagramm, Mar. 2017.https://www.thomas-krenn.com/de/wiki/Linux_Storage_Stack_Diagramm.

[15] E. Goggin, A. Kergon, C. Varoqui, and D. Olien.Linux multipathing.In Proceedings of the Linux Symposium, volume 1, pages 155–176, July 2005.https://www.kernel.org/doc/ols/2005/ols2005v1-pages-155-176.pdf.

[16] kernel development community.Block documentation.https://www.kernel.org/doc/html/latest/block/index.html.

[17] M. Lei, H. Reinecke, J. Garry, and K. Desai.blk-mq/scsi: Provide hostwide shared tags for scsi hbas, Aug. 2020.https://lore.kernel.org/linux-scsi/[email protected]/T/#.

[18] R. Love.Linux Kernel Development.Addison-Wesley Professional, 3 edition, June 2010.

© Copyright IBM Corp. 2021

Page 41: An Introduction to the Linux Kernel Block I/O Stack ...

References iv

[19] O. Purdila, R. Chitu, and R. Chitu.Linux kernel labs: Block device drivers, May 2019.https://linux-kernel-labs.github.io/refs/heads/master/labs/block_device_drivers.html.

[20] O. Sandoval.blk-mq: abstract tag allocation out into sbitmap library, Sept. 2016.https://lore.kernel.org/linux-block/[email protected]/T/#.

[21] K. Ueda, J. Nomura, and M. Christie.Request-based device-mapper multipath and dynamic load balancing.In Proceedings of the Linux Symposium, volume 2, pages 235–244, June 2007.https://www.kernel.org/doc/ols/2007/ols2007v2-pages-235-244.pdf.

[22] Western Digital Corporation.dm-zoned.https://www.zonedstorage.io/linux/dm/#dm-zoned.

[23] Western Digital Corporation.Zoned block device user interface.https://www.zonedstorage.io/linux/zbd-api/.

[24] Western Digital Corporation.Zoned storage overview.https://www.zonedstorage.io/introduction/zoned-storage/.

© Copyright IBM Corp. 2021

Page 42: An Introduction to the Linux Kernel Block I/O Stack ...

References v

[25] Western Digital Corporation.zonefs.https://www.zonedstorage.io/linux/fs/#zonefs.

[26] J. Xu.dm: support polling, Mar. 2021.https://lore.kernel.org/linux-block/[email protected]/T/#.

© Copyright IBM Corp. 2021


Recommended