+ All Categories
Home > Documents > Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite...

Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite...

Date post: 20-Jan-2021
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
23
ORNL is managed by UT-Battelle for the US Department of Energy Improving Block Level Efficiency with scsi-mq Blake Caldwell NCCS/ORNL March 4 th , 2015
Transcript
Page 1: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

ORNL is managed by UT-Battelle for the US Department of Energy

Improving Block Level Efficiency with scsi-mq

Blake Caldwell NCCS/ORNL March 4th, 2015

Page 2: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

2 Improving Block Level Efficiency

Block Layer Problems

• Global lock of request queue per block device • Cache coherency traffic

–  If servicing part of a request on multiple cores, the lock must be obtained on the new core and invalidated on the old core

•  Interrupt locality –  Hardware interrupt may occur on wrong core, requiring

sending soft-interrupt to proper core

Page 3: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

3 Improving Block Level Efficiency

Linux Block Layer

• Designed with rotational media in mind –  Time spent in the queue allows sequential request

reordering – a very good thing –  Completion latencies 10ms to 100ms

• Single request queue –  Staging area for merging, reordering, scheduling

• Drivers are presented with the same interface for each block device

Page 4: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

4 Improving Block Level Efficiency

blk-mq (multi-queue)

• Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues

1)  Per-core submission queues 2)  1 more more hardware dispatch queues with affinity to

NUMA nodes/CPU’s (device-driver specific)

•  IO scheduling within software queues –  Inserted in FIFO order, then interleaved to hardware

queues

• Tags IOs that are reused for lookup on completion

Page 5: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

5 Improving Block Level Efficiency

blk-mq

single queue

Disk/SSD

Block Device(SRP Target)

Page Cache

ost_iolibaio

Application(Core 0, 1, 2)

Core 0 Core 1 Core 2 Core 3

Block Device Driver(SRP Initiator)

Disk/SSD

Block Device(SRP Target)

Page Cache

ost_iolibaio

Application(Core 0, 1, 2)

Core 0 Core 1 Core 2 Core 3

Block Device Driver(SRP Initiator)

Software Submission

queues

Hardware Dispatch Queues

Hardware Parallelism

Page 6: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

6 Improving Block Level Efficiency

IO Device Affinity

HCA 2HCA 1

NUMA node 1NUMA node 0

CPU 1CPU 0

QPI

PCIe

PCIe

To Storage Controller

LNET

OSS Node

Page 7: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

7 Improving Block Level Efficiency

Controller Caching (direct)

ddn1

RP0 RP1

ddn2

RP0 RP1

Lun 1Lun 0 Lun 3 Lun 4

HCA

OSS

Page 8: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

8 Improving Block Level Efficiency

Controller Caching (indirect)

ddn1

RP0 RP1

ddn2

RP0 RP1

Lun 1Lun 0 Lun 3 Lun 4

HCA

OSS

Page 9: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

9 Improving Block Level Efficiency

Evaluation Setup

• Linux 3.18 –  blk-mq (3.13) –  scsi-mq (3.17) –  ib-srp multichannel (3.18) –  dm-multipath support (4.0)

• Lustre 2.7.0 rc1 –  ldiskfs patches rebased to 3.18

• OSS –  Dual Ivy Bridge E5-2650 (2

NUMA nodes) –  64GB –  Dual-port QDR IB to array

• Storage Array –  Dual DDN 10k controllers –  8GB write-back cache –  RAID 6 (8+2) LUNs –  SATA disk

Page 10: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

10 Improving Block Level Efficiency

Evaluation Goals

•  Block level testing: adapt tests done with null-blk device to a real storage device with scsi-mq –  Does NUMA affinity awareness lead to efficiency

•  Increased bandwidth •  Decreased request latency

•  Multipath performance

•  Explore benefits for filesystems –  Ready for use with fabric attached storage? –  Are there performance benefits?

Page 11: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

11 Improving Block Level Efficiency

Evaluation Parameters

•  Affinity –  IB MSI-x interrupts spread among cores on NUMA node 1 –  Fio threads bound to NUMA node 1 (closest to HCA) –  Block device rq_affinity=2 – completion happens on submitting core

•  Block tuning –  Noop scheduler –  max_sectors_kb = max_hw_sectors_kb

•  IB-srp module –  16 channels –  Max_sect 8192 –  Max_cmd_per_lun 62 –  Queue size 127 (fixed by hardware)

Page 12: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

12 Improving Block Level Efficiency

Throughput (direct path) thread/LUN

0

50000

100000

150000

200000

250000

300000

350000

1 2 4 6

Writ

e IO

PS

LUNS

2.6.32-431

3.18 mq

3.18 mq+mc

Page 13: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

13 Improving Block Level Efficiency

Throughput (indirect path) thread/LUN

0

50000

100000

150000

200000

250000

300000

350000

1 2 4 6

Writ

e IO

PS

LUNS

2.6.32-431

3.18 mq

3.18 mq+mc

Page 14: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

14 Improving Block Level Efficiency

CPU Usage (direct path) thread/LUN

0

10

20

30

40

50

60

70

80

90

1 2 4 6 8

CPU

%

LUNS

usr 2.6.32-431

sys 2.6.32-431

usr 3.18 mq

sys 3.18 mq

usr 3.18 mq+mc

sys 3.18 mq+mc

Page 15: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

15 Improving Block Level Efficiency

Throughput (direct path) 1 LUN

0

100000

200000

300000

400000

500000

600000

1 2 4 8

Writ

e IO

PS

Threads

2.6.32-431

3.18 mq

3.18 mq+mc

Page 16: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

16 Improving Block Level Efficiency

Read Throughput (direct path) 1LUN

0

50000

100000

150000

200000

250000

300000

350000

0 1 2 3 4 5 6 7 8 9

Rea

d IO

PS

Threads

2.6 sq

3.18 mq

3.18 mq+mc

Page 17: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

17 Improving Block Level Efficiency

dm-multipath support

•  Evaluated on Linux 4.0rc1 with patches for dm blk-mq support –  4k IO, libaio, iodepth=127, SRP multi-channel support enabled

Multipath Direct (no IO FWD) indirect (IO FWD) 393.2 MB/s 849.5 MB/s 854.8 MB/s 100658 IOPs 217468 IOPs 218829 IOPs

Page 18: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

18 Improving Block Level Efficiency

Request latency

•  Average for 100 4k IOs, fio with synchronous IO engine

kernel Multipath Direct (no IO FWD)

Indirect (IO FWD)

2.6.32-431.17.1 130.50 119.92 169.22

4.0rc1 (blk-mq) 130.56 103.58 142.84 Improvement -0.04% 13.6% 15.6%

Page 19: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

19 Improving Block Level Efficiency

Lustre Profiling

•  FlameGraph of kernel code using Perf, 100Hz, Linux 3.18, Lustre 2.7.0rc1, 1MB writes to a single OST

Page 20: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

20 Improving Block Level Efficiency

Lustre Applications

• Metadata IO –  Improve single request latency –  Is bandwidth necessary during flushing metadata to

MDT?

• Object IO –  Scheduling

•  Request size •  Request tagging

Page 21: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

21 Improving Block Level Efficiency

Future Directions

• Lustre with Linux 4.0 • Testing with hardware capable of 600k+ 4k IOPs

–  Random write performance for multiple thread/LUN

• Evaluate multiple threads/LUN sequential writes • Read and random tests needs further investigation

Page 22: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

22 Improving Block Level Efficiency

Conclusion

•  scsi-mq has potential to lower CPU usage even with rotational media

•  scsi-mq has lower IO completion latency • Further evaluation needed of device drivers that

support multiple hardware dispatch queues

Page 23: Improving Block Level Efficiency4 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core

23 Improving Block Level Efficiency

Thank You

Blake Caldwell <[email protected]>


Recommended