Enabling Cost-‐Effec1ve Flash Based Caching with an Array of Commodity SSDs
Yongseok Oh* (SK telecom)Eunjae Lee, Choulseung Hyun (University of Seoul)
Jongmoo Choi (Dankook University)Donghee Lee (University of Seoul)
Sam H. Noh (UNIST) *Work done while student at University of Seoul
Unreliable SSD based Cache
Backend Storage Cache
[Datacenter Environment]
…
1
1 GB/s 8 MB/s
• Cache plays a CRITICAL ROLE in enhancing performance • However, Flash is sKll UNRELIABLE media [Liu FAST’12]
[Jeong FAST’14][Cai DSN’15] – LifeKme, read disturb, retenKon, and so on
• Write-‐back caching incurs RISKY scenario [Qin ATC’14] • Re-‐warming up takes HOURS to DAYS [Zhang FAST’13]
Our Idea: Take Advantage of RAID
Backend Storage
[Datacenter Environment]
…
– High performance – High reliability – Online replacement – Performance scaling
SSD RAID as a Cache 2
Failure Recovery via Parity
1 GB/s 8 MB/s
Our ContribuKons
• This is the first study to exploit SSD RAID as a cache
• We build two fast prototypes in Linux – Bcache and Flashcache configured with Linux RAID
• We propose a new soluKon, namely SRC (SSD RAID as a Cache) – Borrow LFS and RAID techniques – Propose opKmizaKon schemes for performance and reliability
• We evaluate SRC with other soluKons – Cost-‐effecKve analysis – SATA SSDs vs NVMe SSD
3
Outline of Contents
• IntroducKon • Exis1ng Solu1ons (Bcache and Flashcache) • SRC (SSD RAID as a Cache) • Performance EvaluaKon • Cost-‐effecKve Analysis (SATA vs NVMe SSD) • Conclusion
4
Open Source SoluKons
• Flashcache – Read opKmized layout (e.g., hash bucket)
• Bcache – Write opKmized layout (e.g., Log based B+-‐tree)
• They provide several opKons – Write-‐through and write-‐back policies – FIFO and LRU policies
• They don’t provide the RAID feature – Only single SSD can be employed (as of 2014)
5
Bcache / Flashcache
SSD Caching on RAID
6
Worker Worker Worker Worker …
8TB (8 x 7.2K 2TB Disks) MD RAID10
Backend Storage 4 x Samsung 840 Pro SSDs
RAID Volume used as Caching Space
Ubuntu 13.10 server (Kernel 3.11.7) Intel Xeon CPU (E5-‐2640) 32GB RAM
iSCSI 1Gbps
Benchmark
Linux RAID
RAID is Beneficial for SSD Caching
• FIO benchmark (4KB random) – Single SSD Caching: SATA SSD 128GB – RAID-‐4/-‐5 Caching: 4 X SATA SSD 128GB
• Single SSD caching is comparable or faster
7
0
20
40
60
80
100
120
Bcache FlashCache
Band
width (M
B/s)
Single SSD RAID-‐4 (4 SSDs) RAID-‐5 (4 SSDs)
80% 17%
NOT ? .
Analysis of I/O Caching Path
8
4KB Random Write
Cache layer
RAID-‐4/-‐5
Read Write
SSDs
FTL GC data meta
D’ P’ D P D’ P’
Data
D P P: Parity D: Data
D P D’ P’ GC D P D’ P’ GC
Analysis of I/O Caching Path
9
4KB Random Write
Cache layer
RAID-‐4/-‐5
Read Write
SSDs
FTL GC data meta
D’ P’ D P D’ P’
Data
D P P: Parity D: Data
D P D’ P’ GC D P D’ P’ GC
Single write request requires 8 I/Os plus possibly numerous FTL GCs.
Our Approach
10
4KB Random Write
Cache layer
RAID-‐4/-‐5
Read Write
SSDs
FTL GC data meta
D’ P’ D P D’ P’
Data
D P P: Parity D: Data
D P D’ P’ GC D P D’ P’ GC
1. Log-‐structured Layout > We pack caching data, metadata, and parity together into a segment like LFS
Reduce I/O amplifica1on
2. Write Larges We make large writes to SSDs (e.g., 256MB)
Remedy FTL GC cost in SSDs
• Large Write reduces internal GCs [Min FAST’12], [Li ATC’13], [Tang FAST’14] – e.g., erase-‐before-‐write property of NAND Flash
• 256MB writes achieve maximum performance for our case – Regardless of over-‐provisioned space (OPS) used by GC
• Considered as a cache replacement unit
0 100 200 300 400
1 2 4 8 16
32
64
128
256
512
1024
Band
width (M
B/s)
Write Size (MB)
Company X
Large Write Size for SSD Caching
11
OPS (40%)
OPS (20%)
Model (Year) Large Write
Company X (‘13) 256MB
Company Y (‘14) 512MB
Company Z (‘15) 1024MB
Size Con1nues to Increase
Outline of Contents
• IntroducKon • ExisKng SoluKons (Bcache and Flashcache) • SRC (SSD RAID as a Cache) • Performance EvaluaKon • Cost-‐effecKve Analysis (SATA vs NVMe SSD) • Conclusion
12
SRC Architecture
13
Cache Manager
Disk Disk Disk Disk Disk
Primary Storage
Dirty Seg Buffer Clean Seg Buffer
SSD SSD SSD SSD
SRC Layer
I/O Request
NAS/SAN
Miss
RAID Manager
Hit
• Segment Group layout• LFS layout• Selective GC• No parity for GC• Erasure coding
§ E.g., RAID-4, -5, -6§ RAID-5 (Default)
§ Replacement§ FIFO, Greedy
Key Features
In th
e pape
r
SSD
Focus
14 Disk
NAS/SAN
SG = Large Write*# of SSDs 1GB = 256MB*4
SSD0 SSD1 SSD2 SSD3
… Segment Segment Segment
SG SG SG SG
Ac1ve SG
Log
SG (Segment Group) I/O
Seg. = Max Transfer Size * # of SSDs 2MB = 512KB*4
Large Write unit for each SSD
Ø With Large Write, sustained performance is sa1sfactory
Large Write Aware Segment Group Layout
Caching Process based Log-‐structured Layout
15 Disk
NAS/SAN
SSD0 SSD1 SSD2 SSD3
SG SG SG SG
Ac1ve SG
SRC Layer
1. Collect dirty data D D D D
Dirty (Write) Caching Clean (Read) Caching
4. Submit Seg. I/O with flush
Dirty Buffer Clean Buffer
+ D D D D M2. Embed metadata
XOR 3. Calculate parity ⊕ D D D D M P
C C C C
No Parity for Clean
Log
C C C C M+ C
C C C C C M
-‐LBA -‐Gen No -‐Checksum
Ø Mul1ple random writes are aggregated
Free Space Reclama1on: Selec1ve GC
16
Disk
SG 0 SG 1 SG 2 SG 3 SG 4 Valid cached block Invalid cached block
• S2S (SSD to SSD) GC (if current uKlizaKon < uMAX) ü Re-‐insert valid cached blocks
Disk
SG 0 SG 1 SG 2 SG 3 SG 4
• S2D (SSD to Disk) GC (if current uKlizaKon >= uMAX) ü Destage valid cached blocks
Less uKlized space
More uKlized space
SSD0 SSD1 SSD2 SSD3
SSD0 SSD1 SSD2 SSD3
pre-‐defined value
HOT data
COLD data
Ø Larger SG may be under-‐u1lized, resul1ng in low hit ra1o
Outline of Contents
• IntroducKon • ExisKng SoluKons (Bcache and Flashcache) • SRC (SSD RAID as a Cache) • Performance Evalua1on • Cost-‐effecKve Analysis (SATA vs NVMe SSD) • Conclusion
17
SRC / Bcache / Flashcache
EvaluaKon Setup
18
Worker Worker Worker Worker …
8TB (8 x 7.2K 2TB Disks) MD RAID10
Backend Storage
4 x Samsung 840 Pro SSDs SSD RAID as a Cache
Ubuntu 13.10 server (Kernel 3.11.7) Intel Xeon CPU (E5-‐2640) 32GB RAM
iSCSI 1Gbps
Trace Replay
q We developed SRC based on DM-‐Writeboost q We implemented trace replay tool to mimic VM like workloads
RealisKc Workloads
• Several block-‐level traces - From SNIA and UMASS repositories
• Read group (5 traces included) – read intensive - ts0, usr0, …, msn0 traces
• Write group (10 traces included) – write intensive - prxy0, exch9, …, src22
• Mixed group (7 traces included) – read & write mixed - rsrch0, hm0, …, prn0 traces
19
0
200
400
600
800
Write Mix Read
Throughp
ut (M
B/s)
SRC5 Bcache5 FlashCache5
Comparison with Bcache and Flashcache
• SRC outperforms Bcache and Flashcache – Up to 2.7 X
• Flashcache is bexer than Bcache due to no flush – Bcache issues flush for metadata durability
20
2.5X 2.7X
2.3X
InteresKng Results in Paper
• SEL-‐GC outperforms tradiKonal destage by 60% – Hot data are re-‐inserted back to SSDs
• SEL-‐GC is opKmized at uMAX 90%
• No parity for clean brings 17% improvement
• FIFO is bexer for Write intensive workload – Greedy is bexer for Read intensive workload
• Performance of RAID-‐5 degrades by 27% – Compared to that of RAID-‐0
• Please, refer to our paper for more informaKon
21
Outline of Contents
• IntroducKon • ExisKng SoluKons (Bcache and Flashcache) • SRC (SSD RAID as a Cache) • Performance EvaluaKon • Cost-‐effec1ve Analysis (SATA vs NVMe SSD) • Conclusion
22
SATA and NVMe based SRCs
23
vs MLC based NVMe Card MLC or TLC based SATA SSDs
4 x 128GB SATA SSDs (including parity SSD)
1 x 400GB SSD (No parity SSD)
Interface SATA 3.0 NVMe 1.0
Vendor Company A Company B Company C
NAND MLC TLC MLC TLC MLC
Capacity 4 x 128GB 4 x 120GB 4 x 128GB 4 x 128GB 1 x 400GB
Cost $418 $272 $374 $222 $496
Cheaper More expensive
• Performance per Dollar ($)
• LifeKme(days) per Dollar ($)
EsKmaKon of Cost-‐effecKveness
24
MB/s Total Price ($)
Capacity x P/E cycles WAF x Wday
Total Price ($) ÷
÷
Expected days to live [Jeong FAST’14]
Analysis of Cost EffecKveness • B TLC(SATA) SRC is bexer in terms of (MB/s)/$ • B MLC(SATA) SRC is bexer in terms of LifeKme(days)/$ • SATA SSD based SRCs are sKll superior to NVMe based SRC
25
0
0.5
1
1.5
2
2.5
Write Mix Read
(MB
/s)/$
0 2 4 6 8 10 12 14 16
Write Mix Read
Life
time(
days
)/$
Conclusions
26
• Bcache and FlashCache are NOT TRULY opKmized for SSD and RAID – 80% performance reducKon
• We proposed SRC (SSD RAID as a Cache) – Segment Group (SG) to align to Large Writes – Log-‐based segment to pack data, meta, and parity – SEL-‐GC to recover less uKlized SG
• Experimental Results – 2.7X bexer than Bcache and Flashcache – SATA SRC is bexer than NVMe SRC
• Performance by 20% • LifeKme by 60%
27