Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing
Talk at Storage Developer Conference | SNIA 2018by
Xiaoyi LuThe Ohio State University
E-mail: [email protected]://www.cse.ohio-state.edu/~luxi
Dhabaleswar K. (DK) PandaThe Ohio State University
E-mail: [email protected]://www.cse.ohio-state.edu/~panda
SDC | SNIA ‘18 2Network Based Computing Laboratory
• Substantial impact on designing and utilizing data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online)• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)• HDFS, MapReduce, Spark
Big Data Management and Processing on Modern Clusters
Internet
Front-end Tier Back-end Tier
Web ServerWeb
ServerWeb Server
Memcached+ DB (MySQL)Memcached
+ DB (MySQL)Memcached+ DB (MySQL)
NoSQL DB (HBase)NoSQL DB
(HBase)NoSQL DB (HBase)
HDFS
MapReduce Spark
Data Analytics Apps/Jobs
Data Accessing and Serving
SDC | SNIA ‘18 3Network Based Computing Laboratory
Big Data Processing with Apache Big Data Analytics Stacks• Major components included:
– MapReduce (Batch)– Spark (Iterative and Interactive)– HBase (Query)– HDFS (Storage)– RPC (Inter-process communication)
• Underlying Hadoop Distributed File System (HDFS) used by MapReduce, Spark, HBase, and many others
• Model scales but high amount of communication and I/O can be further optimized!
HDFS
MapReduce
Apache Big Data Analytics Stacks
User Applications
HBase
Hadoop Common (RPC)
Spark
SDC | SNIA ‘18 4Network Based Computing Laboratory
Drivers of Modern HPC Cluster and Data Center Architecture
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)– Single Root I/O Virtualization (SR-IOV)
• NVM and NVMe-SSD
• Accelerators (NVIDIA GPGPUs and FPGAs)
High Performance Interconnects –InfiniBand (with SR-IOV)
<1usec latency, 200Gbps Bandwidth>Multi-/Many-core
Processors
Cloud CloudSDSC Comet TACC Stampede
Accelerators / Coprocessors high compute density, high
performance/watt>1 TFlop DP on a chip
SSD, NVMe-SSD, NVRAM
SDC | SNIA ‘18 5Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 290 organizations from 34 countries
• More than 27,800 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCEAvailable for x86 and OpenPOWER
Significant performance improvement with ‘RDMA+DRAM’ compared to
default Sockets-based designs; How about RDMA+NVRAM?
SDC | SNIA ‘18 6Network Based Computing Laboratory
Non-Volatile Memory (NVM) and NVMe-SSD
3D XPoint from Intel & Micron Samsung NVMe SSD Performance of PMC Flashtec NVRAM [*]
• Non-Volatile Memory (NVM) provides byte-addressability with persistence• The huge explosion of data in diverse fields require fast analysis and storage• NVMs provide the opportunity to build high-throughput storage systems for data-intensive
applications• Storage technology is moving rapidly towards NVM
[*] http://www.enterprisetech.com/2014/08/06/ flashtec-nvram-15-million-iops-sub-microsecond- latency/
SDC | SNIA ‘18 7Network Based Computing Laboratory
• Popular methods employed by recent works to emulate NVRAM performance model over DRAM
• Two ways: – Emulate byte-addressable NVRAM over DRAM
– Emulate block-based NVM device over DRAM
NVRAM Emulation based on DRAM
Application
Virtual File System
Block Device PCMDisk(RAM-Disk + Delay)
DRAM
mmap/memcpy/msync (DAX)
Application
Persistent Memory Library
Clflush + Delay
DRAM
pmem_memcpy_persist
Load/storeLoad/Store
open/read/write/close
SDC | SNIA ‘18 8Network Based Computing Laboratory
• NRCIO: NVM-aware RDMA-based Communication and I/O Schemes
• NRCIO for Big Data Analytics• NVMe-SSD based Big Data Analytics• Conclusion and Q&A
Presentation Outline
SDC | SNIA ‘18 9Network Based Computing Laboratory
Design Scope (NVM for RDMA)
D-to-N over RDMA N-to-D over RDMA N-to-N over RDMA
D-to-N over RDMA: Communication buffers for client are allocated in DRAM; Server uses NVMN-to-D over RDMA: Communication buffers for client are allocated in NVM; Server uses DRAMN-to-N over RDMA: Communication buffers for client and server are allocated in NVM
DRAM NVM
HDFS-RDMA (RDMADFSClient)
HDFS-RDMA (RDMADFSServer)
Client CPU
Server CPU
PCIe
NIC
PCIe
NIC
Client Server
NVM DRAM
HDFS-RDMA (RDMADFSClient)
HDFS-RDMA (RDMADFSServer)
Client CPU
Server CPU
PCIePCIe
NIC NIC
Client Server
NVM NVM
HDFS-RDMA (RDMADFSClient)
HDFS-RDMA (RDMADFSServer)
Client CPU
Server CPU
PCIePCIe
NIC NIC
Client Server
D-to-D over RDMA: Communication buffers for client and server are allocated in DRAM (Common)
SDC | SNIA ‘18 10Network Based Computing Laboratory
NVRAM-aware RDMA-based Communication in NRCIONRCIO RDMA Write over NVRAM NRCIO RDMA Read over NVRAM
SDC | SNIA ‘18 11Network Based Computing Laboratory
DRAM-TO-NVRAM RDMA-Aware Communication with NRCIO
• Comparison of communication latency using NRCIO RDMA read and write communication protocols over InfiniBand EDR HCA with DRAM as source and NVRAM as destination
• {NxDRAM} NVRAM emulation mode = Nx NVRAM write slowdown vs. DRAM with clflushopt(emulated) + sfence
• Smaller impact of time-for-persistence on the end-to-end latencies for small messages vs. large messages => larger number of cache lines to flush
0
5
10
15
20
25
256 4K 16K 256 4K 16K 256 4K 16K1xDRAM 2xDRAM 5xDRAM
Late
ncy
(us)
Data Size (Bytes)
NRCIO-RW NRCIO-RR
00.51
1.52
2.53
3.5
256K 1M 4M
256K 1M 4M
256K 1M 4M
1xDRAM 2xDRAM 5xDRAM
Laten
cy (m
s)
Data Size (Bytes)
NRCIO-RW NRCIO-RR
SDC | SNIA ‘18 12Network Based Computing Laboratory
NVRAM-TO-NVRAM RDMA-Aware Communication with NRCIO
• Comparison of communication latency using NRCIO RDMA read and write communication protocols over InfiniBand EDR HCA vs. DRAM
• {Ax, By} NVRAM emulation mode = Ax NVRAM read slowdown and Bx NVRAM write slowdown vs. NVRAM
• High end-to-end latencies due to slower writes to non-volatile persistent memory• E.g., 3.9x for {1x,2x} and 8x for {2x,5x}
0
0.5
1
1.5
2
2.5
3
3.5
256K 1M 4M 256K 1M 4M 256K 1M 4MNo Persist
(D2D)1x,2x 2x,5x
Late
ncy
(ms)
Data Size (Bytes)
NRCIO-RW NRCIO-RR
0
5
10
15
20
25
64 1K 16K 64 1K 16K 64 1K 16KNo Persist
(D2D)1x,2x 2x,5x
Late
ncy
(us)
Data Size (Bytes)
NRCIO-RW NRCIO-RR
SDC | SNIA ‘18 13Network Based Computing Laboratory
• NRCIO: NVM-aware RDMA-based Communication and I/O Schemes
• NRCIO for Big Data Analytics• NVMe-SSD based Big Data Analytics• Conclusion and Q&A
Presentation Outline
SDC | SNIA ‘18 14Network Based Computing Laboratory
• Files are divided into fixed sized blocks– Blocks divided into packets
• NameNode: stores the file system namespace• DataNode: stores data blocks in local storage
devices• Uses block replication for fault tolerance
– Replication enhances data-locality and read throughput
• Communication and I/O intensive• Java Sockets based communication• Data needs to be persistent, typically on
SSD/HDDNameNode
DataNodes
Client
Opportunities of Using NVRAM+RDMA in HDFS
SDC | SNIA ‘18 15Network Based Computing Laboratory
Design Overview of NVM and RDMA-aware HDFS (NVFS)• Design Features• RDMA over NVM• HDFS I/O with NVM• Block Access• Memory Access
• Hybrid design• NVM with SSD as a hybrid
storage for HDFS I/O• Co-Design with Spark and HBase
• Cost-effectiveness• Use-case
Applications and Benchmarks
Hadoop MapReduce Spark HBase
Co-Design(Cost-Effectiveness, Use-case)
RDMAReceiver
RDMASender
DFSClientRDMA
Replicator
RDMAReceiver
NVFS-BlkIO
Writer/Reader
NVM
NVFS-MemIO
SSD SSD SSD
NVM and RDMA-aware HDFS (NVFS)DataNode
N. S. Islam, M. W. Rahman , X. Lu, and D. K. Panda, High Performance Design for HDFS with Byte-Addressability of NVM and RDMA, 24th International Conference on Supercomputing (ICS), June 2016
SDC | SNIA ‘18 16Network Based Computing Laboratory
Evaluation with Hadoop MapReduce
0
50
100
150
200
250
300
350
Write Read
Aver
age
Thro
ughp
ut (M
Bps) HDFS (56Gbps)
NVFS-BlkIO (56Gbps)NVFS-MemIO (56Gbps)
• TestDFSIO on SDSC Comet (32 nodes)– Write: NVFS-MemIO gains by 4x over
HDFS
– Read: NVFS-MemIO gains by 1.2x over HDFS
TestDFSIO
0
200
400
600
800
1000
1200
1400
Write Read
Aver
age
Thro
ughp
ut (M
Bps) HDFS (56Gbps)
NVFS-BlkIO (56Gbps)NVFS-MemIO (56Gbps)
4x
1.2x
4x
2x
SDSC Comet (32 nodes) OSU Nowlab (4 nodes)
• TestDFSIO on OSU Nowlab (4 nodes)– Write: NVFS-MemIO gains by 4x over
HDFS
– Read: NVFS-MemIO gains by 2x over HDFS
SDC | SNIA ‘18 17Network Based Computing Laboratory
Evaluation with HBase
0100200300400500600700800
8:800K 16:1600K 32:3200K
Thro
ughp
ut (o
ps/s
)
Cluster Size : No. of Records
HDFS (56Gbps) NVFS (56Gbps)
HBase 100% insert
0
200
400
600
800
1000
1200
8:800K 16:1600K 32:3200K
Thro
ughp
ut (o
ps/s
)
Cluster Size : Number of RecordsHBase 50% read, 50% update
• YCSB 100% Insert on SDSC Comet (32 nodes)– NVFS-BlkIO gains by 21% by storing only WALs to NVM
• YCSB 50% Read, 50% Update on SDSC Comet (32 nodes)– NVFS-BlkIO gains by 20% by storing only WALs to NVM
20%21%
SDC | SNIA ‘18 18Network Based Computing Laboratory
Opportunities to Use NVRAM+RDMA in MapReduce
Disk Operations• Map and Reduce Tasks carry out the total job execution
– Map tasks read from HDFS, operate on it, and write the intermediate data to local disk (persistent)– Reduce tasks get these data by shuffle from NodeManagers, operate on it and write to HDFS (persistent)
• Communication and I/O intensive; Shuffle phase uses HTTP over Java Sockets; I/O operations take place in SSD/HDD typically
Bulk Data Transfer
SDC | SNIA ‘18 19Network Based Computing Laboratory
Opportunities to Use NVRAM in MapReduce-RDMA Design
Inpu
t File
s
Outp
ut F
iles
Inte
rmed
iate
Dat
a
Map Task
Read MapSpill
Merge
Map Task
Read MapSpill
Merge
Reduce Task
Shuffle ReduceIn-
Mem Merge
Reduce Task
Shuffle ReduceIn-
Mem Merge
RDMA
All Operations are In-Memory
Opportunities exist to improve the
performance with NVRAM
SDC | SNIA ‘18 20Network Based Computing Laboratory
NVRAM-Assisted Map Spilling in MapReduce-RDMAIn
put F
iles
Outp
ut F
iles
Inte
rmed
iate
Dat
a
Map Task
Read MapSpill
Merge
Map Task
Read MapSpill
Merge
Reduce Task
Shuffle ReduceIn-
Mem Merge
Reduce Task
Shuffle ReduceIn-
Mem Merge
RDMANVRAMq Minimizes the disk operations in Spill phase
M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, Can Non-Volatile Memory Benefit MapReduce Applications on HPC Clusters? PDSW-DISCS, with SC 2016.
M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, NVMD: Non-Volatile Memory Assisted Design for Accelerating MapReduce and DAG Execution Frameworks on HPC Systems? IEEE BigData 2017.
SDC | SNIA ‘18 21Network Based Computing Laboratory
Comparison with Sort and TeraSort
• RMR-NVM achieves 2.37x benefit for Map phase compared to RMR and MR-IPoIB; overall benefit 55% compared to MR-IPoIB, 28% compared to RMR
2.37x
55%
2.48x
51%
• RMR-NVM achieves 2.48x benefit for Map phase compared to RMR and MR-IPoIB; overall benefit 51% compared to MR-IPoIB, 31% compared to RMR
SDC | SNIA ‘18 22Network Based Computing Laboratory
Evaluation of Intel HiBench Workloads• We evaluate different HiBench
workloads with Huge data sets on 8 nodes
• Performance benefits for Shuffle-intensive workloads compared to MR-IPoIB:
– Sort: 42% (25 GB)
– TeraSort: 39% (32 GB)– PageRank: 21% (5 million pages)
• Other workloads: – WordCount: 18% (25 GB)
– KMeans: 11% (100 million samples)
SDC | SNIA ‘18 23Network Based Computing Laboratory
Evaluation of PUMA Workloads
• We evaluate different PUMA workloads on 8 nodes with 30GB data size
• Performance benefits for Shuffle-intensive workloads compared to MR-IPoIB :
– AdjList: 39%
– SelfJoin: 58%
– RankedInvIndex: 39%
• Other workloads: – SeqCount: 32%
– InvIndex: 18%
SDC | SNIA ‘18 24Network Based Computing Laboratory
• NRCIO: NVM-aware RDMA-based Communication and I/O Schemes
• NRCIO for Big Data Analytics• NVMe-SSD based Big Data Analytics• Conclusion and Q&A
Presentation Outline
SDC | SNIA ‘18 25Network Based Computing Laboratory
Overview of NVMe Standard• NVMe is the standardized interface
for PCIe SSDs
• Built on ‘RDMA’ principles
– Submission and completion I/O queues
– Similar semantics as RDMA send/recvqueues
– Asynchronous command processing
• Up to 64K I/O queues, with up to 64Kcommands per queue
• Efficient small random I/O operation
• MSI/MSI-X and interrupt aggregationNVMe Command Processing
Source: NVMExpress.org
SDC | SNIA ‘18 26Network Based Computing Laboratory
Overview of NVMe-over-Fabric• Remote access to flash with NVMe
over the network• RDMA fabric is of most importance
– Low latency makes remote access feasible
– 1 to 1 mapping of NVMe I/O queues to RDMA send/recv queues
NVMf Architecture
I/O Submission
Queue
I/O Completion
Queue
RDMA FabricSQ RQ
NVMe
Low latency overhead compared
to local I/O
SDC | SNIA ‘18 27Network Based Computing Laboratory
Design Challenges with NVMe-SSD• QoS
– Hardware-assisted QoS
• Persistence
– Flushing buffered data
• Performance
– Consider flash related design aspects
– Read/Write performance skew
– Garbage collection
• Virtualization
– SR-IOV hardware support
– Namespace isolation
• New software systems
– Disaggregated Storage with NVMf
– Persistent Caches
Co-design
SDC | SNIA ‘18 28Network Based Computing Laboratory
Evaluation with RocksDB
0
5
10
15
Insert Overwrite Random Read
Latency (us)
POSIX SPDK
0100200300400500
Write Sync Read Write
Latency (us)
POSIX SPDK
• 20%, 33%, 61% improvement for Insert, Write Sync, and Read Write• Overwrite: Compaction and flushing in background
– Low potential for improvement
• Read: Performance much worse; Additional tuning/optimization required
SDC | SNIA ‘18 29Network Based Computing Laboratory
Evaluation with RocksDB
0
5000
10000
15000
20000
Write Sync Read Write
Throughput (ops/sec)
POSIX SPDK
0
100000
200000
300000
400000
500000
600000
Insert Overwrite Random Read
Throughput (ops/sec)
POSIX SPDK
• 25%, 50%, 160% improvement for Insert, Write Sync, and Read Write
• Overwrite: Compaction and flushing in background– Low potential for improvement
• Read: Performance much worse; Additional tuning/optimization required
SDC | SNIA ‘18 30Network Based Computing Laboratory
QoS-aware SPDK Design
0
50
100
150
1 5 9 13 17 21 25 29 33 37 41 45 49
Ban
dwid
th (M
B/s
)
Time
Scenario 1
High Priority Job (WRR) Medium Priority Job (WRR)
High Priority Job (OSU-Design) Medium Priority Job (OSU-Design)
0
1
2
3
4
5
2 3 4 5
Job
Ban
dwid
th R
atio
Scenario
Synthetic Application Scenarios
SPDK-WRR OSU-Design Desired
• Synthetic application scenarios with different QoS requirements
– Comparison using SPDK with Weighted Round Robbin NVMe arbitration
• Near desired job bandwidth ratios
• Stable and consistent bandwidthS. Gugnani, X. Lu, and D. K. Panda, Analyzing, Modeling, and Provisioning QoS for NVMe SSDs, (Under Review)
SDC | SNIA ‘18 31Network Based Computing Laboratory
Conclusion and Future Work• Big Data Analytics needs high-performance NVM-aware RDMA-based
Communication and I/O Schemes• Proposed a new library, NRCIO (work-in-progress)• Re-design HDFS storage architecture with NVRAM• Re-design RDMA-MapReduce with NVRAM• Design Big Data analytics stacks with NVMe and NVMf protocols• Results are promising• Further optimizations in NRCIO• Co-design with more Big Data analytics frameworks
• TensorFlow, Object Storage, Database, etc.
SDC | SNIA ‘18 32Network Based Computing Laboratory
http://www.cse.ohio-state.edu/~luxi
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/