Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019USENIX ATC 2019, RENTON, WA, USAJuly 12th, 2019
Practical Erase Suspension for Modern Low-latency SSDs
Shine Kim†§ Jonghyun Bae† Hakbeom Jang* Wenjing Jin† Jeonghun Gong†
Seoungyeon Lee§ Tae Jun Ham† Jae W. Lee†
†Seoul National University §Samsung Electronics *Sungkyunkwan University
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Today’s NAND flash-based SSDs in datacenters
2
• NAND flash-based SSDs have become a de-facto standard in datacenters− Superior throughput, low average latency, and relatively low price
PCIe Gen 3 X 8 lane NVMe SSD[1]
Seq. Read à 6300MB/sLow Latency SSD Controller with LL-NAND[2]
4KB Random Read QD1 à 15µs3D NAND & QLC-based SSD
à 0.1$/GB[3]
0 1
00 01 10 11
000 001 010 011 100 101 110 111
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
SLC
MLC
TLC
QLC
[1] https://www.samsung.com/semiconductor/ssd/enterprise-ssd/[2] IEEE ISSCC’18, W. Cheong et al., A flash memory controller for 15us ULL-SSD using high-speed 3D NAND flash with 3us read time[3] www.amazon.com: SAMSUNG 860QVO 1TB
Planar NAND
3D N
AND
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Read tail behavior of NAND flash-based SSD
3
• Challenge: Despite low average response time, read tail latency can be very long La
tenc
y (μ
s)
AS-IS 11ms
TO-BE Sub-200µs
Read latency distribution of a PCIe 3 X 4 NVMe low-latency SSD, 4KB, Queue Depth 16, 70% reads and 30% writes
433
4948 5932 10814 11600
10
100
1000
10000
99% 99.9% 99.99% 99.999% Maximum
Competitive with
emerging NVM-based SSDs
Average: 160µs
Maximum read tail latency
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Motivation: Two major sources of long read tail latency
4
[1] Wu et al, Reducing SSD Read Latency via NAND Flash Program and Erase Suspension, USENIX FAST 2012
• Garbage collection (GC) (e.g., 100ms à 10ms)
− GC-induced read tail latency has been optimized by sophisticated GC schemes
• Block erase operation (e.g., 10ms/block)
− Has become most dominant source of read tail latency
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Motivation: Two major sources of long read tail latency
5
• Garbage collection (GC) (e.g., 100ms à 10ms)
− GC-induced read tail latency has been optimized by sophisticated GC schemes
• Block erase operation (e.g., 10ms/block)
− Has become most dominant source of read tail latency
[1] Wu et al, Reducing SSD Read Latency via NAND Flash Program and Erase Suspension, USENIX FAST 2012
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Motivation: Two major sources of long read tail latency
6
[1] Wu et al, Reducing SSD Read Latency via NAND Flash Program and Erase Suspension, USENIX FAST 2012
• Garbage collection (GC) (e.g., 100ms à 10ms)
− GC-induced read tail latency has been optimized by sophisticated GC schemes
• Block erase operation (e.g., 10ms/block)
− Has become most dominant source of read tail latency
TimeRead latency (30µs)
Read
30µsRead request arrived
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Motivation: Two major sources of long read tail latency
7
[1] Wu et al, Reducing SSD Read Latency via NAND Flash Program and Erase Suspension, USENIX FAST 2012
Erase
• Garbage collection (GC) (e.g., 100ms à 10ms)
− GC-induced read tail latency has been optimized by sophisticated GC schemes
• Block erase operation (e.g., 10ms/block)
− Has become most dominant source of read tail latency
Read latency (remaining erase time + 30µs)
Time
Read request arrived
Read
30µs10,000µs
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Erase
Motivation: Two major sources of long read tail latency
8
[1] Wu et al, Reducing SSD Read Latency via NAND Flash Program and Erase Suspension, USENIX FAST 2012
• Garbage collection (GC) (e.g., 100ms à 10ms)
− GC-induced read tail latency has been optimized by sophisticated GC schemes
• Block erase operation (e.g., 10ms/block)
− Has become most dominant source of read tail latency
− Erase suspension[1] can effectively decrease block erase latency
Time
Read request arrived
Read latency(130µs)
Erase
suspendRead
100µs 30µsErase
resumeErase
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Motivation: Two major sources of long read tail latency
9
However, existing erase suspension can causewrite starvation and NAND reliability problem!
[1] Wu et al, Reducing SSD Read Latency via NAND Flash Program and Erase Suspension, USENIX FAST 2012
• Garbage collection (GC) (e.g., 100ms à 10ms)
− GC-induced read tail latency has been optimized by sophisticated GC schemes
• Block erase operation (e.g., 10ms/block)
− Has become most dominant source of read tail latency
− Erase suspension[1] can effectively decrease block erase latency
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Our contributions: Practical erase suspension
10
• Observation − Modern SSDs perform erase operation with multiple discrete pulses to provide well-aligned safe
points for suspending an ongoing erase
: Erase pulse: Verify pulse
Time (ms)
Volta
ge
1 2 3 4 5
. . .
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Our contributions: Practical erase suspension
11
• Observation − Modern SSDs perform erase operation with multiple discrete pulses to provide well-aligned safe
points for suspending an ongoing erase
• We propose three practical erase suspension schemes
Arrival of read request
: Erase pulse: Verify pulse
Time (ms)
Volta
ge
1 2 3 4 5
. . .
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Our contributions: Practical erase suspension
12
• Observation − Modern SSDs perform erase operation with multiple discrete pulses to provide well-aligned safe
points for suspending an ongoing erase
• We propose three practical erase suspension schemes− Immediate erase suspension (I-ES): Aborts erase immediately and restarts from previous safe-point
Arrival of read request
: Erase pulse: Verify pulse
Time (ms)
Vol
tage
1 2 3 4 5Read
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Our contributions: Practical erase suspension
13
• Observation − Modern SSDs perform erase operation with multiple discrete pulses to provide well-aligned safe
points for suspending an ongoing erase
• We propose three practical erase suspension schemes− Immediate erase suspension (I-ES): Aborts erase immediately and restarts from previous safe-point
− Deferred erase suspension (D-ES): Waits until the current erase pulse is finished
Arrival of read request
: Erase pulse: Verify pulse
Time (ms)
Volta
ge
1 2 3 4 5Read
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Our contributions: Practical erase suspension
14
• Observation − Modern SSDs perform erase operation with multiple discrete pulses to provide well-aligned safe
points for suspending an ongoing erase
• We propose three practical erase suspension schemes− Immediate erase suspension (I-ES): Aborts erase immediately and restarts from previous safe-point
− Deferred erase suspension (D-ES): Waits until the current erase pulse is finished
− Timeout-based erase suspension (T-ES): Adaptively switches between I-ES and D-ES
Arrival of read request
: Erase pulse: Verify pulse
Time (ms)
Volta
ge
1 2 3 4 5
. . .
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Prior work: Problems with existing erase suspension[1] (1)
15
• Problem #1: Write starvation
− With bursty reads
Read #1
Erasesuspend
[1] Wu et al, Reducing SSD Read Latency via NAND Flash Program and Erase Suspension, USENIX FAST 2012
Erase
Read #1 arrived Read #2 arrived
Eraseresume
Read #2
Erasesuspend
…∞1,000µs
1) Remaining erase pulse (9ms) may fail to make a progress by incoming reads
Erase (and Write) Starvation!
100µs 30µs 100µs 30µs
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Prior work: Problems with existing erase suspension[1] (2)
16
• Problem #2: Endurance degradation
− With bursty reads
[1] Wu et al, Reducing SSD Read Latency via NAND Flash Program and Erase Suspension, USENIX FAST 2012
2) Erase suspension/resumption causes additional stress to NAND
Over-erase NAND blocks à Increase uncorrectable bit error rate (UBER)
Endurance degradation of SSD!
Read #1
ErasesuspendErase
Read #1 arrived Read #2 arrived
Eraseresume
Read #2
Erasesuspend
…∞1,000µs 100µs 30µs 100µs 30µs
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Background
17
• NAND erase operation
− Pulls electrons out of floating gate by applying very high voltage
• Incremental Step Pulse Erasing (ISPE)
− Standard technique to minimize damages on NAND cells
− Applying several, discrete pulses (of ~1ms) with increasingly higher nominal voltages
Time (ms)
: Erase pulse- Erase cells in a NAND block
: Verify pulse - Sense which cells are erased
1 2 5 N3 4
. . .
ER
S V
olta
ge
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Immediate erase suspension (I-ES)
18
• I-ES operations
− Suspend: Immediately terminates ongoing erase step (taking ~ 100µs)
− Resume: Restarts the suspended erase pulse from the beginningER
S Vo
ltage
. . . Nth ERS Loop
: Erase pulse: Verify pulse
Time (ms)
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Immediate erase suspension (I-ES)
19
• I-ES operations
− Suspend: Immediately terminates ongoing erase step (taking ~ 100µs)
− Resume: Restarts the suspended erase pulse from the beginningER
S Vo
ltage
. . .
(1) Arrival of read request
Nth ERS Loop
: Erase pulse: Verify pulse
Time (ms)
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Immediate erase suspension (I-ES)
20
• I-ES operations
− Suspend: Immediately terminates ongoing erase step (taking ~ 100µs)
− Resume: Restarts the suspended erase pulse from the beginningER
S Vo
ltage (1) Arrival of read request
. . . Nth ERS Loop
(2) Erase suspend : Erase pulse
: Verify pulse
Time (ms)
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Immediate erase suspension (I-ES)
21
• I-ES operations
− Suspend: Immediately terminates ongoing erase step (taking ~ 100µs)
− Resume: Restarts the suspended erase pulse from the beginningER
S Vo
ltage (1) Arrival of read request
. . . Nth ERS Loop
(3) READ
(2) Erase suspend : Erase pulse
: Verify pulse
Time (ms)
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Immediate erase suspension (I-ES)
22
• I-ES operations
− Suspend: Immediately terminates ongoing erase step (taking ~ 100µs)
− Resume: Restarts the suspended erase pulse from the beginning
: Erase pulse: Verify pulse
ERS
Volta
ge (1) Arrival of read request
. . .
(2) Erase suspend
(4) Erase resume
Nth ERS Loop
(3) READ
Time (ms)
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Immediate erase suspension (I-ES)
23
• I-ES operations
− Suspend: Immediately terminates ongoing erase step (taking ~ 100µs)
− Resume: Restarts the suspended erase pulse from the beginningER
S Vo
ltage (1) Arrival of read request
. . . Nth ERS Loop
Nth ERS Loop
N + 1th ERS Loop . . .
(3) READ
(2) Erase suspend
(4) Erase resume : Erase pulse
: Verify pulse
Time (ms)
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Immediate erase suspension (I-ES)
24
• I-ES operations
− Suspend: Immediately terminates ongoing erase step (taking ~ 100µs)
− Resume: Restarts the suspended erase pulse from the beginning
− Does not guarantee forward progress of erase operation à Write starvation problem!
1ms
Baseline Original ES I-ESWrite Tail Latency
>10s >10s
FIO Thread #1: 128KB Read QD1, Thread #2: 128KB Write QD1
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Deferred erase suspension (D-ES)
25
• D-ES operations
− Suspend: Waits until current erase step is finished (erase and verify pulse)
− Resume: Start the next erase pulseER
S Vo
ltage
. . . Nth ERS Loop
(1) Arrival of read request
: Erase pulse: Verify pulse
Time (ms)
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Deferred erase suspension (D-ES)
26
• D-ES operations
− Suspend: Waits until current erase step is finished (erase and verify pulse)
− Resume: Start the next erase pulseER
S Vo
ltage (1) Arrival of read request
. . . Nth ERS Loop
(2) Erase suspend
: Erase pulse: Verify pulse
Time (ms)
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Deferred erase suspension (D-ES)
27
• D-ES operations
− Suspend: Waits until current erase step is finished (erase and verify pulse)
− Resume: Start the next erase pulseER
S Vo
ltage (1) Arrival of read request
. . .
(3) READ
Nth ERS Loop
: Erase pulse: Verify pulse
Time (ms)(2) Erase suspend
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Deferred erase suspension (D-ES)
28
• D-ES operations
− Suspend: Waits until current erase step is finished (erase and verify pulse)
− Resume: Start the next erase pulseER
S Vo
ltage (1) Arrival of read request
. . . N + 1th ERS Loop . . .
(3) READ
Nth ERS Loop
(4) Erase resume
: Erase pulse: Verify pulse
Time (ms)(2) Erase suspend
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Deferred erase suspension (D-ES)
29
• D-ES operations
− Suspend: Waits until current erase step is finished (erase and verify pulse)
− Resume: Start the next erase pulse
− No erase and write starvation problem, but longer read tail! (i.e., length of single step, ~ 1ms)
ERS
Volta
ge (1) Arrival of read request
. . . Nth ERS Loop
N + 1th ERS Loop . . .
(3) READ
: Erase pulse: Verify pulse
Time (ms)(4) Erase resume
(2) Erase suspend
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Timeout-based erase suspension (T-ES)
30
• T-ES operations
1. Performs I-ES until erase operation is suspended for a timeout period (N ms)
2. If a timeout happens, switches to D-ES to avoid erase and write starvation
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Practical erase suspension: Timeout-based erase suspension (T-ES)
31
• T-ES operations
1. Performs I-ES until erase operation is suspended for a timeout period (N ms)
2. If a timeout happens, switches to D-ES to avoid erase and write starvation
• Choice of erase timeout period (N)
− Provides an effective control knob for read/write latency
− Trades maximum write tail latency for reduced read latency
!"#$%&%'($)* +")*,-. ≤ 100%2
Ex) 3 = 64%2, ",8 9: '($)* +")*,-. = 35%2
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Evaluation: Methodology
32
• NVMe SSD simulator: MQSim[1]
• Benchmarks: Flexible I/O Tester, Aerospike Certification Tool (ACT) and TPC-C
• Comparison of six designs:− Baseline (no suspension) and Ideal-ES (erase suspension with zero penalty)
− Erase suspension (ES)[2]
− Immediate-ES (I-ES), Deferred-ES (D-ES), and, Timeout-based-ES (T-ES)
PCIe Gen 3 X 4 Lane, 240GB, NVMe SSD Device
NAND Configurations 4 channels, 4 chips/channel, 1die/chip
FTL Schemes Page Mapping, Preemptible GC
NAND Latency
Read: 3μs, Program: 100μs, Block Erase: 1ms per step (5 steps),Erase Suspension Penalty: 100μs, T-ES timeout: 64ms
[1] Tavakkol et al, MQSim: A framework for enabling realistic studies of modern multi-queue SSD devices, USENIX FAST 2018[2] Wu et al, Reducing SSD Read Latency via NAND Flash Program and Erase Suspension, USENIX FAST 2012
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Evaluation: Flexible I/O Tester (FIO)
33
• FIO random test− Read 70%, Write 30%, 4KB QD 16
100
1000
10000
99.9% 99.99% 99.999% Max.
Baseline ES I-ES D-ES T-ES Ideal-ES
1000
10000
100000
99.9% 99.99% 99.999% Max.
(a) Read tail latency (b) Write tail latency
Late
ncy
(!s)
o Baseline à ~5ms (entire erase operation)
o D-ES à ~1ms (single erase pulse)
o ES, I-ES, T-ES à ~100µs (suspension latency)
o I-ES, T-ES à Long write latency due to
repeated erase suspension
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Evaluation: Aerospike Certification Tool (ACT)
34
• ACT: Database benchmark
− Consists of three threads, and gradually increases I/O rate in integer multiples
x 4o T1: 8K small reads/so T2: 96 large reads/so T3: 96 large writes/s
o T1: 2K small (1.5KB, QD1) reads/so T2: 24 large (128KB, QD1) reads/so T3: 24 large (128KB, QD1) writes/s
ACT workload
Test Item Evaluation Criteria SSD #1 SSD #2
Performance Testi) 95% of I/O < 1msii) 99% of I/O < 8msiii) 99.9% of I/O < 64ms
10X 8X
Stress Test iv) I/O latency < request period 2X 10X
==
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Evaluation: Aerospike Certification Tool (ACT)
35
• ACT test results
− Baseline shows poor performance test result (14x) due to long-tail latency of read request
− ES and I-ES suffer write starvation problem (22x)
− D-ES and T-ES demonstrate good results (30x) for both stress and performance tests
10
100
1000
10000
95% 99% 99.9%
Baseline ES I-ES D-ES T-ES Ideal-ES
10
100
1000
10000
95% 99% 99.9%
(b) Write tail latency(a) Read tail latency
Late
ncy
(!s)
Did
n
ot
fin
ish
Did
n
ot
fin
ish
Did
n
ot
fin
ish
Did
n
ot
fin
ish
Did
n
ot
fin
ish
Did
n
ot
fin
ish
30x workload multiplier
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Evaluation: Transaction processing benchmark (TPC-C)
36
• TPC-C from SNIA
100
1000
10000
99.9% 99.99% 99.999% Max.
Baseline ES I-ES D-ES T-ES Ideal-ES
100
1000
10000
100000
99.9% 99.99% 99.999% Max.
Late
ncy
(!s)
(a) Read tail latency (b) Write tail latency
Did
no
t f
inis
h
Did
no
t f
inis
h
Did
no
t f
inis
h
Did
no
t f
inis
h
Did
no
t f
inis
h
Did
no
t f
inis
h
Did
no
t f
inis
h
Did
no
t f
inis
h
o Baseline à ~5ms (entire operation)
o D-ES, T-ES à ~1ms (single erase pulse)
o ES, I-ES à Failure by write command timeout
o T-ES à Timeout (64ms) + GC latency (24ms)
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019
Conclusion
37
• Practical erase suspension harnesses the full potential of NAND flash-based SSDs− Minimizes the impact of erase operation on read tail latency
− Achieves very low read tail latency without write starvation and endurance degradation
0
1000
2000
3000
4000
5000
6000
Baseline Practical EraseSuspension
Baseline Practical EraseSuspension
Baseline Practical EraseSuspension
FIO Random Test Aerospike Database TPC-C
Read Tail Latency
Late
ncy
(!s)
28X 10X 5X
Architecture and Code Optimization (ARC) Laboratory @ SNU USENIX ATC 2019 38
Thank You!Our simulator is available at
https://github.com/SNU-ARC/MQSim-Practical-ERS-SUS