Exploring System Challenges of Ultra-Low Latency Solid State Drives
Sungjoon Koh
Changrim Lee, Miryeong Kwon, and Myoungsoo Jung
Computer Architecture and Memory systems Lab
Executive Summary
Motivation. Ultra-low latency (ULL) is emerging, but not characterized by far.
Contributions.
- Characterizing the performance behaviors of ULL SSD.
- Studying several system-level challenges of the current storage stack.
Key Observations.
- ULL SSD minimizes the I/O interferences (interleaving reads and writes).
- NVMe queue mechanisms are required to be optimized for ULL SSDs.
- Polling-based I/O completion routine isn’t effective for current NVMe SSDs.
Architectural Change of SSD
MCH
(North Bridge)
PCI Express
DRAM
CPU
PCI Express
DRAM
ICH
(South Bridge)
SATA
Direct Access
High
bandwidth
SATA SSD
NVMe SSD
Evolution of SSDs
NVMe SSD
Read: 2.4GB/s
Write: 1.2 GB/s
SATA SSD
Read: 0.5 GB/s
Write: 0.5 GB/s
Changes
Bandwidth almost reaches the
maximum performance.
Still, long latency (far from DRAM)
New flash memory, called “Z-NAND”
New Flash Memory
Existing 3D NAND
Read: 45-120 𝜇s
Write: 660-5000 𝜇s
Z-NAND [1]
Read: 3𝝁s (15~20x)
Write: 100𝝁s (6~7x)
Z-NAND [1]
TechnologySLC based 3D NAND
48 stacked word-line layer
Capacity 64Gb
Page Size 2kB/Page
Z-NAND based archives “Z-SSD”
Characterization Categories
Performance Analysis.
- Average latency.
- Long-tail latency.
- Bandwidth.
- I/O interference impact.
Polling vs. Interrupt
- Overall latency comparison.
- CPU utilization analysis.
- Memory requirement.
- Five-nines latency.
Evaluation Settings
Benchmark: Flexible I/O Tester (FIO v2.99)
OS: Linux 4.14.10
CPU: Intel® Core™ i7-4790K (4-core, 4.00GHz)
Memory: DDR4 DRAM (16GB)
SSD
- ULL SSD: Z-SSD Prototype (800GB)
- NVMe SSD: Intel® SSD 750 Series (400GB) <Our testbed w/ Z-SSDs>
Z-SSD Prototype
Performance Analysis
Overview
Host
SSD
Request Queue
NVMe Controller
NVMe Driver
4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB
Increase queue depth
Rd Wr Rd Wr Rd Wr Rd Wr
① Average latency & Long-tail
latency
② Bandwidth
③ Read latency under
Read & Write intermixed workload
Wr
Average Latency of ULL SSD
5.1x 1 2 3 46
9
12
15
18
21
“Split-DMA & Super-Channel”
1.8x
4KB DMA = 8𝝁s ( =3t𝑅 𝜇s)
t𝑅 t𝐷𝑀𝐴
11 𝜇s
2 4 6 8 10 12 14 160
30
60
90
120
150
NVMe
Ave
rag
e L
ate
ncy (
μse
c)
I/O Depth
ULL
2 4 6 8 10 12 14 160
30
60
90
120
150
NVMe
Ave
rag
e L
ate
ncy (
μse
c)
I/O Depth
ULL
Sequential WriteSequential Read
2 4 6 8 10 12 14 160
5
10
15
20
25
30
35
40
SeqRd RndRd
SeqWr RndWr
Avera
ge L
ate
ncy (
μsec)
I/O Depth
Channel 1
Channel 0
Channel 1
Split-DMA & Super-Channel
4KB
Request
Z-SSD
Split DMA
Engine
2KB
2KB
Split
Channel 0
Channel 2
Channel 4
Channel 3
Channel 5
Super
Channel
𝑡𝐷𝑀𝐴 = 4𝜇𝑠
Reference: Cheong, Woosung et al., “A flash memory controller for 15μs ultra-low-
latency SSD using high-speed 3D NAND flash with 3μs read time”, ISSCC, 2018
Long-tail Latency of ULL SSD
“Split DMA” &
“Suspend/Resume”
Resource conflict
Insufficient internal buffer,
Internal tasks
2 4 6 8 10 12 14 1601234567
ULL
SeqRd RndRd
SeqWr RndWr
99
.99
9th
La
ten
cy (
mse
c)
I/O Depth
NVMe
SeqRd RndRd
SeqWr RndWr
Suspend/Resume DMA Technique
DMA (for write request)Way 1
Way 2 CMD𝑡𝑅 Data Out𝑡𝑅Reduce read latency &
Increase QoS
Way 1
Way 2 CMD𝑡𝑅 Data Out
DMA (for write request)
Suspend Resume
Wait
Suspend/Resume [1]
Read
Reference: Cheong, Woosung et al., “A flash memory controller for 15μs ultra-low-
latency SSD using high-speed 3D NAND flash with 3μs read time”, ISSCC, 2018
Flush operation / meta data writes
in file system are
intermixed with user requests
I/O Interference
0 20 40 60 800
100
200
300
400
500
600
27 32 31 34 37
Re
ad
La
ten
cy (
μse
c)
Write fraction (%)
Average
NVMe SSD
ULL SSD
0 20 40 60 800
100
200
300
400
500
600
Re
ad
La
ten
cy (
μse
c)
Write fraction (%)
Average
NVMe SSDSignificant performance
degradation in intermixed
workloads.How about ULL SSD?
Remains almost constant
“Suspend/resume”, … [1]
ULL SSD can be applied to real-life
storage stack w/o performance
degradation.
Great performance bottleneck of conventional SSDs.
Queue Analysis
50 100 150 200 2500.0
0.2
0.4
0.6
0.8
1.0
SeqRd RndRd
SeqWr RndWr
No
rma
lize
d B
an
dw
idth
I/O Depth4 8 12 16 20
0.0
0.2
0.4
0.6
0.8
1.0
No
rma
lize
d B
an
dw
idth
I/O Depth
SeqRd RndRd
SeqWr RndWr
50 100 150 200 2500.0
0.2
0.4
0.6
0.8
1.0
No
rma
lize
d B
an
dw
idth
I/O Depth
Only 6 entries required
NVMe SSD ULL SSD
Short write latency
Only 50% of Max BWAlmost Max BW
Requires more than 100 entries.
Light queue mechanisms (ex. NCQ)
are not sufficient.
Requires rich queue mechanism
Well-aligned with light queue
mechanisms (ex. NCQ).
NVMe needs to be lightened
Too long write latencyI/O request rescheduling within queue.
Polling vs. Interrupt
Two different I/O completion methods
Interrupt / Polling
Systems with short waiting time adopts polling-based
waiting strategy.(even though it incurs lots of overheads)
Does it really need for current NVMe SSDs?
For example, “spin lock”, “network message passing”
applies polling-based waiting strategy.
Polling is currently implemented to NVMe storage stack.
Interrupt / Polling
Submit request SleepCS Complete requestCS
Command Execution
ISRCS
Submit request Polling Complete request
Command Execution
Interrupt.
Polling.
CS CS
Gain
NVMe Controller② Raise IRQ
③ Wake
SSD
SSD
Done??
① Finishes
Shorter
Low latency
Larger portion
Overall Performance
4KB8KB
16KB32KB
141618202224262830
Avera
ge L
ate
ncy (
sec)
Interrupt
Polling
4KB8KB
16KB32KB
80
100
120
140
160
180
Avera
ge L
ate
ncy (
sec)
Interrupt
Polling
4KB8KB
16KB32KB
10
12
14
16
18
20
22
Polling
Avera
ge L
ate
ncy (
sec)
Interrupt
4KB8KB
16KB32KB
8
12
16
20
24
28
32
36
Avera
ge L
ate
ncy (
sec)
Interrupt
Polling
NVMe SSD ULL SSD
Decreases only
Read: 0.9% & Write: 8.2%
Decreases by
Read: 7.5% & Write: 13.2%
Read Write Read Write
Polling-based I/O
services are not
effective for current
NVMe SSDs.
Does polling-based
I/O works on ULL
SSD?
Future lower latency SSD can achieve
remarkable performance improvement with
polling-based I/O completion routine.
4KB8KB
16KB32KB
0
20
40
60
80
100
Me
mo
ry B
ou
nd
(%
)
Interrupt
System Challenges
4KB8KB
16KB32KB
0
20
40
60
80
100
Me
mo
ry B
ou
nd
(%
) Polling
Interrupt CPU
Core 1
CPU
Core n
NVMe Controller
SQ Tail Doorbell
CQ Head Doorbell
Host
Check CQ updateNVMe Controller Memory Space
Spin lock for
head/tail pointer
Synchronization
<Memory Bound>
Core 0
CQSQ
0
20
40
60
80
100
Time
CP
U U
tiliz
ation (
%)
Interrupt
0
20
40
60
80
100
Time
CP
U U
tiliz
ation (
%)
PollingCore always
Working
4KB8KB
16KB32KB
4.24.34.44.54.64.74.84.95.0
99.9
99%
Late
ncy (
msec)
ULL Write
Interrupt
Polling
<CPU Uitlization>
Polling does not
release CPU
CQ
Head
Tail SQ Head
Tail
Polling-based I/O services incur
significant system-level overheads
Needs to be addressed
High CPU utilization Frequent memory access
Memory bound
= Fraction of slots where
pipeline could be stalled
due to load/store.
High memory bound
= Frequent memory access
Conclusion
Motivation. Ultra-low latency (ULL) is emerging, but not characterized by far.
Contributions.
- Characterizing the performance behaviors of ULL SSD.
- Studying several system-level challenges of the current storage stack.
Key Insights.
- ULL SSDs can be effectively applied to real-life storage stack. (RW mixed)
- NVMe queue mechanisms are required to be optimized for ULL SSDs.
- Polling-based I/O completion routine isn’t effective for current NVMe SSDs.
Thank you
Q&A