Exploring System Challenges of Ultra-Low Latency Solid ... · Ultra-low latency (ULL) is emerging,...

Exploring System Challenges of Ultra-Low Latency Solid State Drives

Sungjoon Koh

Changrim Lee, Miryeong Kwon, and Myoungsoo Jung

Computer Architecture and Memory systems Lab

Executive Summary

Motivation. Ultra-low latency (ULL) is emerging, but not characterized by far.

Contributions.

- Characterizing the performance behaviors of ULL SSD.

- Studying several system-level challenges of the current storage stack.

Key Observations.

- ULL SSD minimizes the I/O interferences (interleaving reads and writes).

- NVMe queue mechanisms are required to be optimized for ULL SSDs.

- Polling-based I/O completion routine isn’t effective for current NVMe SSDs.

Architectural Change of SSD

MCH

(North Bridge)

PCI Express

DRAM

CPU

PCI Express

DRAM

ICH

(South Bridge)

SATA

Direct Access

High

bandwidth

SATA SSD

NVMe SSD

Evolution of SSDs

NVMe SSD

Read: 2.4GB/s

Write: 1.2 GB/s

SATA SSD

Read: 0.5 GB/s

Write: 0.5 GB/s

Changes

Bandwidth almost reaches the

maximum performance.

Still, long latency (far from DRAM)

New flash memory, called “Z-NAND”

New Flash Memory

Existing 3D NAND

Read: 45-120 𝜇s

Write: 660-5000 𝜇s

Z-NAND [1]

Read: 3𝝁s (15~20x)

Write: 100𝝁s (6~7x)

Z-NAND [1]

TechnologySLC based 3D NAND

48 stacked word-line layer

Capacity 64Gb

Page Size 2kB/Page

Z-NAND based archives “Z-SSD”

Characterization Categories

Performance Analysis.

- Average latency.

- Long-tail latency.

- Bandwidth.

- I/O interference impact.

Polling vs. Interrupt

- Overall latency comparison.

- CPU utilization analysis.

- Memory requirement.

- Five-nines latency.

Evaluation Settings

Benchmark: Flexible I/O Tester (FIO v2.99)

OS: Linux 4.14.10

CPU: Intel® Core™ i7-4790K (4-core, 4.00GHz)

Memory: DDR4 DRAM (16GB)

SSD

- ULL SSD: Z-SSD Prototype (800GB)

- NVMe SSD: Intel® SSD 750 Series (400GB) <Our testbed w/ Z-SSDs>

Z-SSD Prototype

Performance Analysis

Overview

Host

SSD

Request Queue

NVMe Controller

NVMe Driver

4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB

Increase queue depth

Rd Wr Rd Wr Rd Wr Rd Wr

① Average latency & Long-tail

latency

② Bandwidth

③ Read latency under

Read & Write intermixed workload

Wr

Average Latency of ULL SSD

5.1x 1 2 3 46

9

12

15

18

21

“Split-DMA & Super-Channel”

1.8x

4KB DMA = 8𝝁s ( =3t𝑅 𝜇s)

t𝑅 t𝐷𝑀𝐴

11 𝜇s

2 4 6 8 10 12 14 160

30

60

90

120

150

NVMe

Ave

rag

e L

ate

ncy (

μse

c)

I/O Depth

ULL

2 4 6 8 10 12 14 160

30

60

90

120

150

NVMe

Ave

rag

e L

ate

ncy (

μse

c)

I/O Depth

ULL

Sequential WriteSequential Read

2 4 6 8 10 12 14 160

5

10

15

20

25

30

35

40

SeqRd RndRd

SeqWr RndWr

Avera

ge L

ate

ncy (

μsec)

I/O Depth

Channel 1

Channel 0

Channel 1

Split-DMA & Super-Channel

4KB

Request

Z-SSD

Split DMA

Engine

2KB

2KB

Split

Channel 0

Channel 2

Channel 4

Channel 3

Channel 5

Super

Channel

𝑡𝐷𝑀𝐴 = 4𝜇𝑠

Reference: Cheong, Woosung et al., “A flash memory controller for 15μs ultra-low-

latency SSD using high-speed 3D NAND flash with 3μs read time”, ISSCC, 2018

Long-tail Latency of ULL SSD

“Split DMA” &

“Suspend/Resume”

Resource conflict

Insufficient internal buffer,

Internal tasks

2 4 6 8 10 12 14 1601234567

ULL

SeqRd RndRd

SeqWr RndWr

99

.99

9th

La

ten

cy (

mse

c)

I/O Depth

NVMe

SeqRd RndRd

SeqWr RndWr

Suspend/Resume DMA Technique

DMA (for write request)Way 1

Way 2 CMD𝑡𝑅 Data Out𝑡𝑅Reduce read latency &

Increase QoS

Way 1

Way 2 CMD𝑡𝑅 Data Out

DMA (for write request)

Suspend Resume

Wait

Suspend/Resume [1]

Read

Reference: Cheong, Woosung et al., “A flash memory controller for 15μs ultra-low-

latency SSD using high-speed 3D NAND flash with 3μs read time”, ISSCC, 2018

Flush operation / meta data writes

in file system are

intermixed with user requests

I/O Interference

0 20 40 60 800

100

200

300

400

500

600

27 32 31 34 37

Re

ad

La

ten

cy (

μse

c)

Write fraction (%)

Average

NVMe SSD

ULL SSD

0 20 40 60 800

100

200

300

400

500

600

Re

ad

La

ten

cy (

μse

c)

Write fraction (%)

Average

NVMe SSDSignificant performance

degradation in intermixed

workloads.How about ULL SSD?

Remains almost constant

“Suspend/resume”, … [1]

ULL SSD can be applied to real-life

storage stack w/o performance

degradation.

Great performance bottleneck of conventional SSDs.

Queue Analysis

50 100 150 200 2500.0

0.2

0.4

0.6

0.8

1.0

SeqRd RndRd

SeqWr RndWr

No

rma

lize

d B

an

dw

idth

I/O Depth4 8 12 16 20

0.0

0.2

0.4

0.6

0.8

1.0

No

rma

lize

d B

an

dw

idth

I/O Depth

SeqRd RndRd

SeqWr RndWr

50 100 150 200 2500.0

0.2

0.4

0.6

0.8

1.0

No

rma

lize

d B

an

dw

idth

I/O Depth

Only 6 entries required

NVMe SSD ULL SSD

Short write latency

Only 50% of Max BWAlmost Max BW

Requires more than 100 entries.

Light queue mechanisms (ex. NCQ)

are not sufficient.

Requires rich queue mechanism

Well-aligned with light queue

mechanisms (ex. NCQ).

NVMe needs to be lightened

Too long write latencyI/O request rescheduling within queue.

Polling vs. Interrupt

Two different I/O completion methods

Interrupt / Polling

Systems with short waiting time adopts polling-based

waiting strategy.(even though it incurs lots of overheads)

Does it really need for current NVMe SSDs?

For example, “spin lock”, “network message passing”

applies polling-based waiting strategy.

Polling is currently implemented to NVMe storage stack.

Interrupt / Polling

Submit request SleepCS Complete requestCS

Command Execution

ISRCS

Submit request Polling Complete request

Command Execution

Interrupt.

Polling.

CS CS

Gain

NVMe Controller② Raise IRQ

③ Wake

SSD

SSD

Done??

① Finishes

Shorter

Low latency

Larger portion

Overall Performance

4KB8KB

16KB32KB

141618202224262830

Avera

ge L

ate

ncy (

sec)

Interrupt

Polling

4KB8KB

16KB32KB

80

100

120

140

160

180

Avera

ge L

ate

ncy (

sec)

Interrupt

Polling

4KB8KB

16KB32KB

10

12

14

16

18

20

22

Polling

Avera

ge L

ate

ncy (

sec)

Interrupt

4KB8KB

16KB32KB

8

12

16

20

24

28

32

36

Avera

ge L

ate

ncy (

sec)

Interrupt

Polling

NVMe SSD ULL SSD

Decreases only

Read: 0.9% & Write: 8.2%

Decreases by

Read: 7.5% & Write: 13.2%

Read Write Read Write

Polling-based I/O

services are not

effective for current

NVMe SSDs.

Does polling-based

I/O works on ULL

SSD?

Future lower latency SSD can achieve

remarkable performance improvement with

polling-based I/O completion routine.

4KB8KB

16KB32KB

0

20

40

60

80

100

Me

mo

ry B

ou

nd

(%

)

Interrupt

System Challenges

4KB8KB

16KB32KB

0

20

40

60

80

100

Me

mo

ry B

ou

nd

(%

) Polling

Interrupt CPU

Core 1

CPU

Core n

NVMe Controller

SQ Tail Doorbell

CQ Head Doorbell

Host

Check CQ updateNVMe Controller Memory Space

Spin lock for

head/tail pointer

Synchronization

<Memory Bound>

Core 0

CQSQ

0

20

40

60

80

100

Time

CP

U U

tiliz

ation (

%)

Interrupt

0

20

40

60

80

100

Time

CP

U U

tiliz

ation (

%)

PollingCore always

Working

4KB8KB

16KB32KB

4.24.34.44.54.64.74.84.95.0

99.9

99%

Late

ncy (

msec)

ULL Write

Interrupt

Polling

<CPU Uitlization>

Polling does not

release CPU

CQ

Head

Tail SQ Head

Tail

Polling-based I/O services incur

significant system-level overheads

Needs to be addressed

High CPU utilization Frequent memory access

Memory bound

= Fraction of slots where

pipeline could be stalled

due to load/store.

High memory bound

= Frequent memory access

Conclusion

Motivation. Ultra-low latency (ULL) is emerging, but not characterized by far.

Contributions.

- Characterizing the performance behaviors of ULL SSD.

- Studying several system-level challenges of the current storage stack.

Key Insights.

- ULL SSDs can be effectively applied to real-life storage stack. (RW mixed)

- NVMe queue mechanisms are required to be optimized for ULL SSDs.

- Polling-based I/O completion routine isn’t effective for current NVMe SSDs.

Thank you

Q&A

Date post:	20-Apr-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Exploring System Challenges of Ultra-Low Latency Solid ... · Ultra-low latency (ULL) is emerging,...

Documents