ORNL is managed by UT-Battelle for the US Department of Energy
Improving Block Level Efficiency with scsi-mq
Blake Caldwell NCCS/ORNL March 4th, 2015
2 Improving Block Level Efficiency
Block Layer Problems
• Global lock of request queue per block device • Cache coherency traffic
– If servicing part of a request on multiple cores, the lock must be obtained on the new core and invalidated on the old core
• Interrupt locality – Hardware interrupt may occur on wrong core, requiring
sending soft-interrupt to proper core
3 Improving Block Level Efficiency
Linux Block Layer
• Designed with rotational media in mind – Time spent in the queue allows sequential request
reordering – a very good thing – Completion latencies 10ms to 100ms
• Single request queue – Staging area for merging, reordering, scheduling
• Drivers are presented with the same interface for each block device
4 Improving Block Level Efficiency
blk-mq (multi-queue)
• Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues
1) Per-core submission queues 2) 1 more more hardware dispatch queues with affinity to
NUMA nodes/CPU’s (device-driver specific)
• IO scheduling within software queues – Inserted in FIFO order, then interleaved to hardware
queues
• Tags IOs that are reused for lookup on completion
5 Improving Block Level Efficiency
blk-mq
single queue
Disk/SSD
Block Device(SRP Target)
Page Cache
ost_iolibaio
Application(Core 0, 1, 2)
Core 0 Core 1 Core 2 Core 3
Block Device Driver(SRP Initiator)
Disk/SSD
Block Device(SRP Target)
Page Cache
ost_iolibaio
Application(Core 0, 1, 2)
Core 0 Core 1 Core 2 Core 3
Block Device Driver(SRP Initiator)
Software Submission
queues
Hardware Dispatch Queues
Hardware Parallelism
6 Improving Block Level Efficiency
IO Device Affinity
HCA 2HCA 1
NUMA node 1NUMA node 0
CPU 1CPU 0
QPI
PCIe
PCIe
To Storage Controller
LNET
OSS Node
7 Improving Block Level Efficiency
Controller Caching (direct)
ddn1
RP0 RP1
ddn2
RP0 RP1
Lun 1Lun 0 Lun 3 Lun 4
HCA
OSS
8 Improving Block Level Efficiency
Controller Caching (indirect)
ddn1
RP0 RP1
ddn2
RP0 RP1
Lun 1Lun 0 Lun 3 Lun 4
HCA
OSS
9 Improving Block Level Efficiency
Evaluation Setup
• Linux 3.18 – blk-mq (3.13) – scsi-mq (3.17) – ib-srp multichannel (3.18) – dm-multipath support (4.0)
• Lustre 2.7.0 rc1 – ldiskfs patches rebased to 3.18
• OSS – Dual Ivy Bridge E5-2650 (2
NUMA nodes) – 64GB – Dual-port QDR IB to array
• Storage Array – Dual DDN 10k controllers – 8GB write-back cache – RAID 6 (8+2) LUNs – SATA disk
10 Improving Block Level Efficiency
Evaluation Goals
• Block level testing: adapt tests done with null-blk device to a real storage device with scsi-mq – Does NUMA affinity awareness lead to efficiency
• Increased bandwidth • Decreased request latency
• Multipath performance
• Explore benefits for filesystems – Ready for use with fabric attached storage? – Are there performance benefits?
11 Improving Block Level Efficiency
Evaluation Parameters
• Affinity – IB MSI-x interrupts spread among cores on NUMA node 1 – Fio threads bound to NUMA node 1 (closest to HCA) – Block device rq_affinity=2 – completion happens on submitting core
• Block tuning – Noop scheduler – max_sectors_kb = max_hw_sectors_kb
• IB-srp module – 16 channels – Max_sect 8192 – Max_cmd_per_lun 62 – Queue size 127 (fixed by hardware)
12 Improving Block Level Efficiency
Throughput (direct path) thread/LUN
0
50000
100000
150000
200000
250000
300000
350000
1 2 4 6
Writ
e IO
PS
LUNS
2.6.32-431
3.18 mq
3.18 mq+mc
13 Improving Block Level Efficiency
Throughput (indirect path) thread/LUN
0
50000
100000
150000
200000
250000
300000
350000
1 2 4 6
Writ
e IO
PS
LUNS
2.6.32-431
3.18 mq
3.18 mq+mc
14 Improving Block Level Efficiency
CPU Usage (direct path) thread/LUN
0
10
20
30
40
50
60
70
80
90
1 2 4 6 8
CPU
%
LUNS
usr 2.6.32-431
sys 2.6.32-431
usr 3.18 mq
sys 3.18 mq
usr 3.18 mq+mc
sys 3.18 mq+mc
15 Improving Block Level Efficiency
Throughput (direct path) 1 LUN
0
100000
200000
300000
400000
500000
600000
1 2 4 8
Writ
e IO
PS
Threads
2.6.32-431
3.18 mq
3.18 mq+mc
16 Improving Block Level Efficiency
Read Throughput (direct path) 1LUN
0
50000
100000
150000
200000
250000
300000
350000
0 1 2 3 4 5 6 7 8 9
Rea
d IO
PS
Threads
2.6 sq
3.18 mq
3.18 mq+mc
17 Improving Block Level Efficiency
dm-multipath support
• Evaluated on Linux 4.0rc1 with patches for dm blk-mq support – 4k IO, libaio, iodepth=127, SRP multi-channel support enabled
Multipath Direct (no IO FWD) indirect (IO FWD) 393.2 MB/s 849.5 MB/s 854.8 MB/s 100658 IOPs 217468 IOPs 218829 IOPs
18 Improving Block Level Efficiency
Request latency
• Average for 100 4k IOs, fio with synchronous IO engine
kernel Multipath Direct (no IO FWD)
Indirect (IO FWD)
2.6.32-431.17.1 130.50 119.92 169.22
4.0rc1 (blk-mq) 130.56 103.58 142.84 Improvement -0.04% 13.6% 15.6%
19 Improving Block Level Efficiency
Lustre Profiling
• FlameGraph of kernel code using Perf, 100Hz, Linux 3.18, Lustre 2.7.0rc1, 1MB writes to a single OST
20 Improving Block Level Efficiency
Lustre Applications
• Metadata IO – Improve single request latency – Is bandwidth necessary during flushing metadata to
MDT?
• Object IO – Scheduling
• Request size • Request tagging
21 Improving Block Level Efficiency
Future Directions
• Lustre with Linux 4.0 • Testing with hardware capable of 600k+ 4k IOPs
– Random write performance for multiple thread/LUN
• Evaluate multiple threads/LUN sequential writes • Read and random tests needs further investigation
22 Improving Block Level Efficiency
Conclusion
• scsi-mq has potential to lower CPU usage even with rotational media
• scsi-mq has lower IO completion latency • Further evaluation needed of device drivers that
support multiple hardware dispatch queues