Post on 07-Jun-2020
transcript
Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters
Dipti Shankar (shankar.50@osu.edu)Advisor: Dhabaleswar K. (DK) Panda (panda.2@osu.edu), Co-Advisor: Dr. Xiaoyi Lu (lu.932@osu.edu)
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu
Acknowledgements
Introduction
Research Framework
Co-Designing Key-Value Store-based Burst Buffer over PFS
Conclusion and Future Work
High-Performance Non-Blocking API Semantics
Exploring Opportunities with NVRAM and RDMA
References[1] D. Shankar , X. Lu , and D. K. Panda, “High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads”, 37th International Conference on Distributed Computing Systems (ICDCS 2017)[2] D. Shankar , X. Lu , and D. K. Panda, “Boldio: A Hybrid and Resilient Burst-Buffer Over Lustre for Accelerating Big Data I/O”, 2016 IEEE International Conference on Big Data (IEEE BigData 2016) [Short Paper][3] D. Shankar, X. Lu , N. Islam , M. W. Rahman , and D. K. Panda, “High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits”, 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2016)[4] D. Shankar, X. Lu , M. W. Rahman , N. Islam , and D. K. Panda, “Benchmarking Key-Value Stores on High-Performance Storage and Interconnects for Web-Scale Workloads”, 2015 IEEE International Conference on Big Data (IEEE BigData ’15) [Short Paper][5] D. Shankar, X. Lu , J. Jose , M. W. Rahman , N. Islam , and D. K. Panda, “Can RDMA Benefit On-Line Data Processing Workloads with Memcached and MySQL”, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015) [Poster]
v The proposed framework enables key-value storage systems to exploit the capabilities of HPC clusters for maximizing performance and scalability, while ensuring data resilience/availability.
v Provides efficient non-blocking API semantics to design efficient read/write pipelines with resilience via RDMA-aware asynchronous replication and fast online EC
v Future work for this thesis: Works-in-Progressv Explore opportunities for exploiting the SIMD compute capabilities (e.g., GPU, AVX); End-to-end SIMD-aware
key-value storage system designsv Work on co-designing memory-centric data-intensive applications over key-value stores: (1) Read-Intensive
Graph-based Workloads (e.g., LinkBench, RedisGraph) (2) Key-value store engine for Parameter Server frameworks for ML workloads
(Offline Data Analytics: Burst-Buffer and Persistent Store)
v Key-Value Stores (e.g., Memcached) serve as the heart of many production-scale distributed systems and databases
v Accelerating Online and Offline Analytics in High-Performance Computing (HPC) environments
v Our Basis: High-performance and hybrid key-value storage
v Remote Direct Memory Access (RDMA) over high-performance network interconnects interconnects (e.g., InfiniBand, RoCE)
v ‘DRAM+NVMe/NVRAM’ hybrid memory designs
v Research Focus: Designing a high-performance key-value storage system that can leverage: (1) RDMA-capable networks (2) heterogeneous I/O and (3) compute capabilities on HPC clusters
v Goals: (1) End-to-end performance (2) Scalability (3) Resilience / High Availability
Software Distributionv The RDMA-based Memcached and Non-Blocking API designs (RDMA-Memcached) proposed in this research are available for
the community as a part of the HiBD project http://hibd.cse.ohio-state.edu/#memcachedv Micro-benchmarks and YCSB plugin for RDMA-Memcached available in as a part of of the OSU HiBD Micro-benchmark Suite
(OHB) http://hibd.cse.ohio-state.edu/#microbenchmarks
This research is supported in part by National Science Foundation grants #CNS-1513120, #IIS-1636846, and #CCF-1822987
(Online Data Processing: High-Performance Cache)
v Motivation: Hybrid `DRAM+PCIe/ NVMe-SSD’ Key-Value Stores• Higher data retention; fast random
reads• Performance limited by Blocking API
semantics
v Goals: Achieve near in-memory speeds while being able to exploit hybrid memory
Fast Online Erasure Coding with RDMA
v Erasure Coding (EC): Storage-efficient alternative to Replication for Resilience
v Goal: Making Online EC viable for key-value stores
v Bottlenecks: (1) Encode/decode computation (2) Scattering/gathering the data/parity chunks
v Emerging non-volatile memory technologies (NVRAM)v Potential: Byte-addressable and persistent;
capable of RDMAv Observations: RDMA writes into NVRAM
needs to guarantee remote durabilityv Opportunities: RDMA-based Persistence
Protocols for NVRAM Systems (Architecture of NVRAM-based System)
HighPerformance, Hybrid and Resilient Key Value Storage
Application WorkloadsOnline
(E.g., SQL/NoSQL query cache/ LinkBench and TAO Graph KV workloads)
Offline (E.g., BurstBuffer over PFS for Hadoop for
MapReduce/Spark)
NonBlocking RDMAaware API Extensions
Fast Online Erasure Coding (Reliability)
NVRAMaware CommunicationProtocols (Persistence)
Accelerations on SIMDbased(e.g., GPU) Architecture
(Scalability)
HeterogeneityAware (DRAM/NVRAM/NVMe/PFS) KeyValue Storage Engine
Modern HPC System Architecture
Volatile/NonVolatileMemory Technologies (DRAM, NVRAM)
MultiCoreNodes
(w/ SIMD units)GPUs
RDMACapableNetworks (InfiniBand,
40Gbe RoCE)
Parallel FileSystem (Lustre)
Local Storage(PCIe SSD/ NVMeSSD)
High-Performance Big Data (HiBD)http://hibd.cse.ohio-state.edu
!"#$%"$#&
Web Frontend Servers (Memcached Clients)
(Memcached Servers)
(Database Servers)
High Performance
Networks
High Performance
Networks
BLOCKINGAPI
NON-BLOCKINGAPIREQ.
NON-BLOCKINGAPIREPLY
CLIEN
TSERV
ER
HYBRIDSLABMANAGER(RAM+SSD)
RDMA-ENHANCEDCOMMUNICATIONLIBRARY
RDMA-ENHANCEDCOMMUNICATIONLIBRARY
LIBMEMCACHEDLIBRARY
BlockingAPIFlow Non-BlockingAPIFlow
shared storage such as Lustre. It acts as an interface betweenthe application and the burst-buffer layer by transparentlymapping Hadoop I/O streams to key/value store semantics.(2) Boldio Burst-Buffer Server, referred to as BBS in Fig-ure 2, running on each burst-buffer node, acts as a high-performance distributed staging layer for Hadoop I/O. Itleverages ‘RAM+SSD’ available on the burst-buffer nodesin a high-performance and high-retention manner.(3) Boldio Burst-Buffer Persistence Manager, referred to asBBP in Figure 2, is located at each BBS node. It persiststhe output files buffered in the Memcached slabs to Lustreasynchronously in a resilient fashion.
Based on the overall design in Figure 2, we highlight thedetails of the key technologies involved in designing Boldioin the following sections.
III. BOLDIO CLIENT AND SERVER DESIGNS
In this section, we present the internal designs of theBoldio client and server.
A. Mapping Hadoop File Stream to Memcached Semantics
To take advantage of a key/value store based burst buffersystem that is efficient and reliable, the files accessed viaHadoop I/O streams need to be mapped onto data blocksthat are represented by key/value pairs. In addition tothis, we need to enable resilience that is critical to BigData applications. We employ a client-initiated replicationtechnique through which the Boldio clients guarantee thatredundant copies of I/O data exist on the Boldio serversbefore the application completes. To enable this mappingand to identify replicated data blocks on the Boldio servers,we consider the following two aspects:(1) Data Mapping: The file stream is divided into chunksthat can fit in Memcached’s slabs (default size <1MB). Eachdata chunk is identified by a key/value pair i.e., (File Id+ File Offset, File Chunk). A metadata key/value pair i.e.,(File Id, File Meta), is also stored in the BBS cluster forevery file output, to get access file information for futureaccess. To identify replicated key/value pairs, we assigna replica identifier i.e., Rep Id, to each of the data andmetadata chunks. This identifier ranges from 0 to N-1, fora replication factor N, with 0 as the primary copy.(2) Data Distribution Schemes: To achieve efficient dataplacement, Boldio provides two data distribution schemes:(a) 1F MS: In this scheme, data chunks belonging to a fileare scattered across all Boldio servers using default con-sistent hashing in Libmemcached. Irrespective of whetherthe key/value pairs represent data or metadata, the serveris identified as: server key ← consistent hash(key) +replica id. We refer to this scheme as 1-File Many-Serveri.e., 1F MS. Figure 3(a) illustrates how data is distributedby three tasks consistently among two Boldio servers.(b) 1F 1S: In this scheme, all data chunks are co-locatedwith their metadata key. Boldio enables this 1-File 1-Serveri.e., 1F 1S scheme, by identifying the server using the
File Id portion of the key as follows: server key ←consistent hash(prefix(key)) + replica id. Figure 3(b)illustrates how the output file from different tasks are chun-ked and distributed in the BBS cluster with localization withrespect to a given file.
(a) 1F MS (b) 1F 1S
Figure 3. Data Distribution Schemes in Boldio Client
The 1F MS scheme enables a load-balanced distributionof the data blocks; thus, exploiting the available networkbandwidth and distributed memory to maximize perfor-mance. On the other hand, though the 1F 1S scheme maynot provide a balanced distribution, it simplifies the datapersistence and recovery processes by co-locating all datablocks of a replica of each buffered file.
B. Enabling Non-blocking I/O Requests
Using default key/value store semantics, each I/O requestto the burst-buffer layer is issued in a blocking manner. Dueto the limit on the key/value pair size in most key/valuestores (i.e., max. 1 MB), a large number of requests need tobe issued to read/write a single file for Big Data workloads.We introduced novel and high-performance non-blockingkey/value store API semantics for RDMA-enhanced Mem-cached in [17]. As shown in Figure 4, these non-blockingAPI extensions allow the user to separate the request issueand completion phases to overlap concurrent data requeststo reduce total execution time.
REQUESTKV1
0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00Time(us)
Request/Response
RESPONSEKV1
RESPONSEKV1
RESPONSEKV2
RESPONSEKV2
REQUESTKV2
REQUESTKV1
REQUESTKV2
SET/GETKV1
SET/GETKV2
Non-blockingSET/GETKV1
Non-BlockingSET/GETKV2
Figure 4. Non-blocking Memcached API Semantics
In the Boldio client, we employ proposed non-blockingAPIs, i.e., memcached_iset and memcached_iget, toissue a bulk of Set/Get requests for the required data chunksand return to the Hadoop application as soon as the under-lying RDMA communication engine has communicated therequest to the Boldio servers; without blocking on the re-sponse. The memcached_wait and memcached_test
APIs are then used to check the progress of each operation ineither a blocking or non-blocking fashion, respectively. By
0
10000
20000
30000
40000
Read-Only(100%GET) Write-Heavy(50GET:50SET)
Aggr.Throu
ghpu
t(Ops/sec)
H-RDMA-Blocking
H-RDMA-NonB-iget
H-RDMA-NonB-bget
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set GetIPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-
Block H-RDMA-Opt-
NonB-i H-RDMA-Opt-
NonB-b
Aver
age
Late
ncy
(us)
Miss Penalty (Backend DB Access Overhead)Client WaitServer ResponseCache UpdateCache check+Load (Memory and/or SSD read)Slab Allocation (w/ SSD write on Out-of-Mem)
Overlap SSD I/O Access (near in-memory latency)
ECPerf
orma
nce
Ideal Scenario
MemoryEfficient
High Performance
Memory Overhead
No FT
Replication
0
500
1000
1500
2000
2500
3000
3500
512 4K 16K 64K 256K 1M
Tota
l Set
Lat
ency
(us
)
Value Size (Bytes)
Sync-Rep=3 Async-Rep=3
Era(3,2)-CE-CD Era(3,2)-SE-SD
Era(3,2)-SE-CD
~1.6x
~2.8x
Boldio: A Hybrid and Resilient Burst-Buffer Over Lustre for Accelerating
Big Data I/O
Dipti Shankar, Xiaoyi Lu, Dhabaleswar K. (DK) PandaDepartment of Computer Science and Engineering, The Ohio State University
Email: {shankard, luxi, panda}@cse.ohio-state.edu
Abstract—The limitation of local storage space in the HPCenvironments has placed an unprecedented demand on theperformance of the underlying shared parallel file systems.This has necessitated a scalable solution for running Big Datamiddleware (e.g., Hadoop) on HPC clusters. In this paper,we propose Boldio, a hybrid and resilient key-value store-based Burst-Buffer system Over Lustre for accelerating I/O-intensive Big Data workloads, that can leverage RDMA onhigh-performance interconnects and storage technologies suchas PCIe-/NVMe-SSDs, etc. We demonstrate that Boldio canimprove the performance of the I/O phase of Hadoop work-loads running on HPC clusters; serving as a light-weight, high-performance, and resilient remote I/O staging layer betweenthe application and Lustre. Performance evaluations show thatBoldio can improve the TestDFSIO write performance overLustre by up to 3x and TestDFSIO read performance by 7x,while reducing the execution time of Hadoop Sort benchmarkby up to 30%. We demonstrate that we can significantlyimprove Hadoop I/O throughput over popular in-memorydistributed storage systems such as Alluxio (formerly Tachyon),when high-speed local storage is limited.
Keywords-Burst-Buffer; Hadoop; Memcached; RDMA; Non-Blocking API;
I. INTRODUCTION
Traditionally, the Hadoop Distributed File System orHDFS [6] has been employed as the storage layer for BigData analytics on commodity clusters. Though the nodes onthe modern HPC clusters [16] are equipped with heteroge-neous storage devices (RAMDisk or SATA-/PCIe-/NVMe-SSDs), the capacities of local storage on the computenodes are essentially very limited [4], due to the Beowulfarchitectural model [18] employed. The I/O data nodesconstitute a dedicated and shared storage cluster that hostsa high-performance parallel file system (e.g., Lustre [22],GPFS [7]), and are connected to the compute nodes via high-performance interconnects such as InfiniBand [8]. Conse-quently, Big Data users on HPC clusters need to rely on par-allel storage systems such as Lustre for their I/O needs. SinceLustre is typically deployed on a separate sub-cluster, all I/Operformed are remote operations. This default approach isrepresented in Figure 1(a). With most Big Data workloadsbeing predominantly I/O-intensive, parallel storage systemstend to become a major performance bottleneck.
*This research is supported in part by National Science Foundation grants #CNS-1419123, #IIS-1447804, #CNS-1513120, #CCF-1565414 and #IIS-1636846. It usedthe Extreme Science and Engineering Discovery Environment (XSEDE), which issupported by National Science Foundation grant number OCI-1053575.
A. Motivation and Related Work
While Big Data workloads such as Hadoop MapReducehave been leveraging Lustre and Remote Direct MemoryAccess (RDMA) to run on HPC clusters [11, 15], severalrecent works have focused on accelerating Big Data I/Oin these environments. Enhanced HDFS designs were pro-posed [9, 10] to exploit the heterogeneous storage architec-ture in the HPC environment while employing advanced fea-tures such as RDMA; thus, enabling the use of Lustre withinHDFS. The recent emergence of in-memory computing hasdirected the spotlight towards memory-centric distributed filesystems such as Alluxio (formerly Tachyon) [2, 12] thatenable users to unify heterogeneous storage across nodesfor high throughput and scalability. Alluxio workers can bedeployed locally on the compute nodes (Local) or remotelyon separate sub-cluster of allocated nodes (Remote), asrepresented in Figure 1(b). While this approach is beingexplored to develop a two-level approach for Big Dataon HPC clusters [21], it was not designed specifically fortraditional HPC infrastructure and needs to incur networkaccess overhead for storing and retrieving remote I/O blocksdue to the limitations on the local storage.
(a) Direct over Lustre (b) Local/Remote In-Memory Alluxio
(c) Burst-Buffer overLustre (This paper)
Figure 1. Existing and Proposed Big Data I/O Sub-systems in HPC
On the other hand, recent research works [19, 20] havefocused on leveraging key-value store based burst-buffersystems for managing intense bursty I/O traffic generatedby checkpointing in HPC applications. This approach, rep-resented in Figure 1(c), enables data to be temporarilybuffered via high-performance and resilient key-value basedstorage layer (e.g., Memcached [1]) deployed in a separateset of compute/storage/large-memory nodes before persist-ing it to the underlying parallel file system, i.e., Lustre.Similarly, hardware-based burst-buffer approaches such asDDN IME [3] are also being actively explored. While the
• TestDFSIO on SDSC Gordon Cluster (16-core Intel Sandy Bridge + IB QDR)
• 16-node Hadoop Cluster + 4-node Boldio Cluster
• Performance gains over designs like Alluxio (Tachyon) over PFS
Dire
ctov
er L
ustr
e
Hadoop I/O Applications (MapReduce, Spark)
Hadoop File System Class Abstraction (LocalFileSystem)
Burst-Buffer Libmemcached Client
RDMA-enhanced Communication Engine
Non-Blocking API ARPE (CE/CD/Rep)
RDMA-enhanced Comm. Engine
Persistence Mgr.
…..
Co-Design BoldioFileSystem
Lustre Parallel File SystemMDS OSS OSS OSS
MDT MDT OST OST OST OST OST OST
Memcached Server Cluster
Libm
emca
ched
Cl
ient
Hyb-Mem Manager
(RAM/SSD)
ARPE (SE/SD)
RDMA-enhanced Comm. Engine
Persistence Mgr.
Hyb-Mem Manager
(RAM/SSD)
ARPE (SE/SD)
DRAM HCA
Mem. Cntrl
DRAM NVRAM
I/O Cntrllr
L3
PCIe Bus
HCA
Mem.Cntrl
NVRAM
I/O Cntrllr
L3
PCIe Bus Dura
bility
Pat
h
RDMA into NVRAM RDMA from NVRAMLocal Durability Point
Core L1/L2
Core L1/L2
Core L1/L2
Core L1/L2
Node1 (Initiator/Target)Core L1/L2
Core L1/L2
Core L1/L2
Core L1/L2
Node0 (Initiator/Target)
NonB-bTest/Wait
NonB-i
SDSC Comet Cluster
150 YCSB clients over 10 nodes
5-node Memcached server cluster
Intel Westmere + QDR-IB ClusterRep=3 vs. RS (3,2)
Replication = 3200% Storage
Overhead
Erasure Coding RS(3,2)66% Storage Overhead
2.5x
SDSC Comet (1 Server + 100 clients over 32 nodes)
(Client-Encode/Client-Decode)(Server-Encode/Client-Decode)
(Server-Encode/Server-Decode) (Client-Encode/Server-Decode)
Encode/Decode
D
D1 P1D2 D3 P2
Clie
nt
Server Server........
KV S
erve
r Cl
uste
r
Tasync_set Tasync_get
D
D
Encode/Decode
D1 D1D2 .....KV S
erve
r
KV Server Cluster
P2
Clie
nt
Tcomm
Tasync_set/get
Decode
D
D1 D3 P1
D1
KV S
erve
r
KV Server Cluster
D
Encode
D1 D2 ..... P2
D
TcommTasync_set
Tasync_get
Clie
nt
DecodeD
D1 D3 P1
D
Encode
D1
D1
D2 .....KV S
erve
r
KV Server Cluster
P2
D
Tcomm
Tasync_set
Tasync_get
Clie
nt
v Big Data I/O infrastructure (e.g., HDFS, Alluxio) vs. HPC Storage (e.g., GPFS, Lustre); limited ‘data locality’ in HPC
v Bottleneck: Heavy reliance on PFS (limited I/O bandwidth); affects data-intensive Big Data applications
v Approach: High-Performance and Hybrid and Resilient Key-Value Store-Based Burst-Buffer Over Lustre (Boldio)
• Hadoop I/O over Lustre; transparent FileSystemplugin for Hadoop MapReduce/Spark
• No dependence on local storage at compute nodes
• Resilience via Async. Replication or Online EC
• RDMA-Memcached as Burst-Buffer servers + Non-blocking client APIs for efficient I/O pipelines
0
50
100
150
200
250
4K 16K 32K 4K 16K 32K
YCSB-A YCSB-B
Agg
. Th
rou
ghp
ut
(Kop
s/se
c)
Memc-RDMA-NoRep Async-Rep=3Era(3,2)-CE-CD Era(3,2)-SE-CD
~1.5x~1.34x
0
20000
40000
60000
80000
60 GB 100 GB 60 GB 100 GB
Write Read
Agg.Th
roug
hput(M
Bps) Lustre-Direct Boldio
6.7x
3x
8-core Intel Westmere + IB QDR
8-node Hadoop Cluster0
100200300400500
WordCount InvIndx CloudBurst Spark TeraGen
Average L
ate
ncy
(us)
Lustre-Direct Alluxio-Remote Boldio
21%
4-node Boldio Cluster over Lustre
0
10000
20000
30000
40000
50000
60000
Write Read
DF
SIO
Agg
. T
hrou
ghpu
t (M
Bps
)
Lustre-DirectAlluxio-RemoteBoldio_Async-Rep=3Boldio_Online-EC
No EC overhead (Rep=3 vs. RS (3,2))
5-node Boldio Cluster over Lustre
Overlap = 80-90%
v Approach: Novel Non-blocking API Semantics
• Extensions for RDMA-based Libmemcached library
• memcached_(iset/bset/bget) for SET/GET operations
• memcached_(test/wait) for progressing communication
• Ability to overlap request and response phases; hide SSD I/O overheads
• Up to 8x gain in overall latency vs. blocking API semantics
(Architecture of Boldio)
v Approach: Non-blocking RDMA-aware semantics to enable compute/communication overlap
v Encode/Decode offloading integrated into Memcached client (CE/CD) and server (SE/SD)
v Experiments with Yahoo! Cloud Serving Benchmark (YCSB) for Online EC vs. Async. Rep: (1) Update-heavy: CE-CD outperforms; SE-CD on-par (2) Read-heavy: CE-CD/SE-CD on-par