+ All Categories
Home > Documents > High Performance File System and I/O Middleware Design for Big Data...

High Performance File System and I/O Middleware Design for Big Data...

Date post: 12-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
1
High Performance File System and I/O Middleware Design for Big Data on HPC Clusters Nusrat Sharmin Islam Advisor: Dhabaleswar K. (DK) Panda Network-Based CompuKng Laboratory h"p://nowlab.cse.ohio-state.edu Acknowledgements High-Performance Big Data (HiBD) h"p://hibd.cse.ohio-state.edu IntroducKon Research Framework Hybrid HDFS with Heterogeneous Storage Conclusion and Future Work HDFS over RDMA NVM-based HDFS PublicaKons The Ohio State University College of Engineering HDFS Verbs RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..) ApplicaKons 1/10 GigE, IPoIB Network Java Socket Interface Java NaKve Interface (JNI) Write Others OSU Design Design Features: – RDMA-based HDFS write and replicaNon – JNI Layer bridges Java-based HDFS with naNve communicaNon library – On-demand connecNon setup – Java Direct Buffer ensures zero copy data transfer Enables high performance RDMA communicaKon, while supporKng tradiKonal socket interface [1] Reduced by 30% Overlapping in SEDA-based Approach SequenKal Data Processing in OBOT Architecture Pkt1,1 Pkt1,1 Pkt1,1 Pkt1,1 Task1, TaskK PktK,1 ... Pkt1,N PktK,1 ... Pkt1,N PktK,1 ... Pkt1,N PktK,1 ... PktK,N Read Packet Processing Replication I/O Pkt1,2 PktK,2 PktK,N Pkt1,2 PktK,2 PktK,N Pkt1,2 PktK,2 PktK,N Pkt1,2 PktK,2 Pkt1,N Default HDFS (OBOT) v SEDA-based approach for HDFS Write [2] Ø Re-designs the internal so^ware architecture from OBOT to SEDA Ø Maximizes overlapping among different stages Ø Incorporates RDMA-based communicaKon vDefault HDFS adopts One-Block-One-Thread (OBOT) architecture to process data sequenNally; good trade- off between simplicity and performance vIn OBOT architecture, incoming data-packets should wait for the I/O stage of the previous packet to be completed before they are read and processed Ø Limits HDFS from fully uNlizing the hardware capabiliNes v In RDMA-Enhanced HDFS, bo"leneck moves from data-transmission to data-persistence v Staged Event-Driven Architecture (SEDA): a high- throughput design approach for Internet services Ø Decomposes complex processing logic into a set of stages connected by queues vFour stages: (1) Read (2) Packet Processing (3) ReplicaNon (4) I/O vNo. of threads in each stage is appropriately tuned so that RDMA-enhanced HDFS can make maximum uNlizaNon of the system resources SEDA-HDFS Triple-H Heterogeneous Storage Hybrid ReplicaNon Data Placement Policies EvicNon/ PromoNon RAM Disk SSD HDD Lustre ApplicaKons [1] N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda. High-Performance RDMA-based Design of HDFS over InfiniBand. In Proceedings of SC, November 2012 [2] N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda, SOR-HDFS: A S EDA-based Approach to Maximize O verlapping in R DMA-Enhanced HDFS, HPDC '14, Short Paper, June 2014 [3] N. S. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A H ybrid Approach to Accelerate H DFS on HPC Clusters with H eterogeneous Storage Architecture, CCGrid ’15, May 2015 [4] N. S. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance CharacterizaNon and AcceleraNon of In-Memory File Systems for Hadoop and Spark ApplicaNons on HPC Clusters, IEEE BigData ’15, October 2015 [5] N. S. Islam, D. Shankar, X. Lu, M. W. Rahman, and D. K. Panda, AcceleraNng I/O Performance of Big Data AnalyNcs on HPC Clusters through RDMA-based Key-Value Store, ICPP ‘15, September 2015 [6] N. S. Islam, M. W. Rahman, X. Lu, and D. K. Panda, High Performance Design for HDFS with Byte-Addressability of NVM and RDMA, ICS ’16, June 2016 Hierarchical Flat 0 200 400 600 800 1000 1200 1GigE IPoIB(QDR) Block Write Time (ms) 448 ms 59% v The proposed framework improves communicaNon and I/O performance of HDFS and leads to maximized overlapping and storage space savings on modern clusters v In the future, we would like to propose efficient data access strategies for Hadoop and Spark applicaNons on HPC systems v We will also evaluate our proposed designs using different Big Data applicaNons Batch ApplicaKons CloudBurst HDFS 60.24s HHH 48.3s MR-MS Polygraph Lustre 939s HHH-L 196s Client (HDFS Architecture) v HDFS is the underlying storage for many Big Data processing framework like, Hadoop MapReduce, HBase, Hive and Spark v HDFS has been widely adopted by reputed organizaNons like Facebook and Yahoo to store petabytes of data v ReplicaNon is the primary fault tolerance mechanism for HDFS; replicaNon enhances data-locality and read throughput and makes recovery process faster in case of failure v HDFS uses Sockets for communicaNon and cannot fully leverage the benefits of high performance interconnects like infiniBand v I/O performance in HDFS as well as the requirement of large amount of local disk space due to the tri-replicated data blocks is of major concern for HDFS deployment v HDFS cannot take advantage of the heterogeneous storage devices available on HPC clusters 31% 24% 53% 64% TestDFSIO So^ware DistribuKon v The designs proposed in this research are available for the community in RDMA for Apache Hadoop package from the High Performance Big Data (HiBD) project (h"p://hibd.cse.ohio-state.edu ) v As of September ’16, more than 18,200 downloads (more than 190 organizaNons in 26 countries) have taken place from the project’s website Two modes: (1) Default (HHH) (2) Lustre-Integrated (HHH-L) RAM Disk and SSD –based buffer-cache Placement policies to efficiently uNlize the heterogeneous storage devices, Hybrid ReplicaNon EvicNon/PromoNon based on data usage pa"ern Lustre-Integrated mode: Lustre-based fault-tolerance Caching hierarchical, placement hybrid Key-Value Store-based burst buffer RAM Disk SSD HDD Global File System (Lustre) Data RAM Global File System (Lustre) HDD SSD RAM Disk RAM Data HDFS with Heterogeneous Storage: v HDFS cannot efficiently uNlize the heterogeneous storage devices available on HPC clusters v LimitaNon comes from the exisNng placement policies and ignorance of data usage pa"erns Triple-H proposes a hybrid approach to uKlize the heterogeneous storage devices efficiently [3, 4, 5] Storage Used (GB) HDFS 360 Lustre 120 HHH-L 240 Reduced by 54% IteraKve ApplicaKon (KMeans) HDFS 708s Tachyon 672s HHH 618s SelecNve caching: Cache output of intermediate iteraNons Sort on SDSC Gordon Spark Storage Technologies (HDD, SSD, RAM Disk, and NVM) Parallel File System (Lustre) Big Data ApplicaKons, Workloads, Benchmarks RDMA-Enhanced HDFS High Performance File System and I/O Middleware KV-Store (Memcache d)based Burst Buffer Leveraging NVM for Big Data I/O Advanced Data Placement SelecKve Caching for IteraKve ApplicaKons Hybrid HDFS with In-Memory and Heterogeneous Storage Enhanced Support for Fast AnalyKcs Hadoop MapReduce HBase Maximized Stage Overlapping Networking Technologies/ Protocols (InfiniBand, 10/40/100 GigE, RDMA) RDMA over NVM: (D (DRAM)-to-N (NVM), N-to-D, N-to-N) NVM-based I/O (Block Access, Memory Access) Hybrid Design (NVM with SSD) Co-Design (Cost-effecNveness, Use-case) Stores only Write Ahead logs (WALs) to NVM Stores only job output of Spark to NVM NVFS as a burst buffer for Spark over Lustre For in-memory storage in HDFS, persistence is challenging; compeNNon for memory between computaNon and I/O NVM (byte-addressable and non-volaNle) emerging in HPC systems; CriNcal to rethink HDFS architecture to exploit NVM along with RDMA ApplicaKons and Benchmarks MapReduce Spark HBase Co-Design RDMA Receiver RDMA Sender DFSClient RDMA Replicator RDMA Receiver NVFS- BlkIO NVFS- MemIO SSD NV M and RDMA-aware HDFS (NVFS) DataNode SSD SSD Reader/Writer NVM 0 50 100 150 200 250 300 350 Write Read Average Throughput (MBps) HDFS (56Gbps) NVFS (56Gbps) 1.2x TestDFSIO on SDSC Comet 0 10 20 30 40 50 60 70 50 5000 500000 I/O Time (s) Number of Pages Lustre (56Gbps) NVFS (56Gbps) 24% Spark PageRank on SDSC Comet 0 100 200 300 400 500 600 700 800 8:800K 16:1600K 32:3200K Throughput (ops/s) Cluster Size : No. of Records HDFS (56Gbps) NVFS (56Gbps) HBase 100% insert on SDSC Comet 21% Exploit byte-addressability of NVM for HDFS communicaKon and I/O [6] Re-design HDFS storage architecture with memory semanKcs Reduced by 2.4x Spark TeraGen on SDSC Gordon Gain for CloudBurst: 19% over HDFS Gain for MR-MSPolygraph: 79% over Lustre HHH-L offers be"er data locality over Lustre Gain over HDFS: 13% Gain over Tachyon: 9% 0 1000 2000 3000 4000 5000 6000 Write Read Total Throughput (MBps) HDFS (FDR) HHH (FDR) Increased by 7x Increased by 2x HHH-L reduces local storage requirement As I/O bo"lenecks are reduced, RDMA offers larger benefits than IPoIB and 10GigE 0 5 10 15 20 25 2 4 6 8 10 CommunicaKon Time (s) File Size (GB) 10GigE IPoIB (QDR) RDMA (QDR) 0 200 400 600 800 1000 1200 1400 16::64 32::128 64::256 Aggregated Throughput (MBps) Cluster Size :: Data Size (GB) IPoIB (FDR) RDMA (FDR) RDMA-SEDA (FDR) Enhanced DFSIO on TACC Stampede 0 2 4 6 8 10 12 8 16 32 Throughput (Kops/s) Number of Records (Million) IPoIB (FDR) RDMA (FDR) RDMA-SEDA (FDR) HBase on TACC Stampede 0 100 200 300 400 500 20 40 60 ExecuKon Time (s) Data Size (GB) HDFS Lustre HHH-L TestDFSIO on TACC Stampede 0 500 1000 1500 2000 2500 5 10 15 20 Total Throughput (MBps) Data Size (GB) 10GigE-1HDD 10GigE-2HDD IPoIB-1HDD (QDR) IPoIB-2HDD (QDR) RDMA-1HDD (QDR) RDMA-2HDD (QDR) 0 20 40 60 80 100 120 140 160 180 8:50 16:100 32:200 ExecuKon Time (s) Cluster Size : Data Size (GB) HDFS Tachyon HHH This research is supported in part by NaKonal Science FoundaKon grant #IIS-1447804.
Transcript
Page 1: High Performance File System and I/O Middleware Design for Big Data …sc16.supercomputing.org/sc-archive/doctoral_showcase/doc... · 2017-03-20 · High Performance File System and

HighPerformanceFileSystemandI/OMiddlewareDesignforBigDataonHPCClustersNusratSharminIslam

Advisor:DhabaleswarK.(DK)Panda

Network-BasedCompuKngLaboratoryh"p://nowlab.cse.ohio-state.edu

AcknowledgementsHigh-PerformanceBigData(HiBD)

h"p://hibd.cse.ohio-state.edu

IntroducKon

ResearchFramework

HybridHDFSwithHeterogeneousStorage ConclusionandFutureWork

HDFSoverRDMA

NVM-basedHDFS

PublicaKons

TheOhioStateUniversity

CollegeofEngineering

HDFS

Verbs

RDMACapableNetworks(IB,10GE/iWARP,RoCE..)

ApplicaKons

1/10GigE,IPoIBNetwork

JavaSocketInterface

JavaNaKveInterface(JNI)

WriteOthers

OSUDesign

DesignFeatures:–  RDMA-basedHDFSwriteandreplicaNon

–  JNILayerbridgesJava-basedHDFSwithnaNvecommunicaNonlibrary

–  On-demandconnecNonsetup–  JavaDirectBufferensureszerocopydatatransfer

EnableshighperformanceRDMAcommunicaKon,whilesupporKngtradiKonalsocketinterface[1]

Reducedby30%

OverlappinginSEDA-basedApproach

SequenKalDataProcessinginOBOTArchitecture

Pkt1,1

Pkt1,1

Pkt1,1

Pkt1,1

Task1,)��TaskK PktK,1 ... Pkt1,N

PktK,1 ... Pkt1,N

PktK,1 ... Pkt1,N

PktK,1 ... PktK,N

Read

%Packet%Processing

Replication

I/O

Pkt1,2 PktK,2 PktK,N

Pkt1,2 PktK,2 PktK,N

Pkt1,2 PktK,2 PktK,N

Pkt1,2 PktK,2 Pkt1,N

Default HDFS (OBOT)

v SEDA-basedapproachforHDFSWrite[2]Ø  Re-designs the internal so^ware architecture

fromOBOTtoSEDAØ  MaximizesoverlappingamongdifferentstagesØ  IncorporatesRDMA-basedcommunicaKon

v Default HDFS adopts One-Block-One-Thread (OBOT)architecture to process data sequenNally; good trade-offbetweensimplicityandperformancev In OBOT architecture, incoming data-packets shouldwait for the I/O stage of the previous packet to becompletedbeforetheyarereadandprocessedØ  Limits HDFS from fully uNlizing the hardware

capabiliNes

v InRDMA-EnhancedHDFS,bo"leneckmovesfromdata-transmissiontodata-persistencev  StagedEvent-DrivenArchitecture(SEDA):ahigh-

throughputdesignapproachforInternetservicesØ  Decomposescomplexprocessinglogicintoaset

ofstagesconnectedbyqueues

v Four stages: (1) Read (2) Packet Processing (3)ReplicaNon(4)I/Ov No.of threads ineach stage is appropriately tuned sothat RDMA-enhanced HDFS can make maximumuNlizaNonofthesystemresources

SEDA-HDFS

Triple-H

HeterogeneousStorage

HybridReplicaNon

DataPlacementPolicies

EvicNon/PromoNon

RAMDisk SSD HDD

Lustre

ApplicaKons

[1]N.S. Islam,M.W.Rahman, J. Jose,R.Rajachandrasekar,H.Wang,H.Subramoni,C.Murthy,andD.K.Panda.High-PerformanceRDMA-basedDesignofHDFSoverInfiniBand.InProceedingsofSC,November2012[2]N.S.Islam,X.Lu,M.W.Rahman,andD.K.Panda,SOR-HDFS:ASEDA-basedApproachtoMaximizeOverlappinginRDMA-EnhancedHDFS,HPDC'14,ShortPaper,June2014[3]N.S.Islam,X.Lu,M.W.Rahman,D.Shankar,andD.K.Panda,Triple-H:AHybridApproachtoAccelerateHDFSonHPCClusterswithHeterogeneousStorageArchitecture,CCGrid’15,May2015[4]N.S. Islam,M.W.Rahman,X.Lu,D.Shankar,andD.K.Panda,PerformanceCharacterizaNonandAcceleraNonofIn-MemoryFileSystemsforHadoopandSparkApplicaNonsonHPCClusters,IEEEBigData’15,October2015[5]N.S.Islam,D.Shankar,X.Lu,M.W.Rahman,andD.K.Panda,AcceleraNngI/OPerformanceofBigDataAnalyNcsonHPCClustersthroughRDMA-basedKey-ValueStore,ICPP‘15,September2015[6]N.S.Islam,M.W.Rahman,X.Lu,andD.K.Panda,HighPerformanceDesignforHDFSwithByte-AddressabilityofNVMandRDMA,ICS’16,June2016

Hierarchical Flat

0

200

400

600

800

1000

1200

1GigE IPoIB(QDR)

BlockWriteTime(m

s)

448ms

59%

v  TheproposedframeworkimprovescommunicaNonandI/OperformanceofHDFSandleadstomaximizedoverlappingandstoragespacesavingsonmodernclusters

v  Inthefuture,wewouldliketoproposeefficientdataaccessstrategiesforHadoopandSparkapplicaNonsonHPCsystems

v  WewillalsoevaluateourproposeddesignsusingdifferentBigDataapplicaNons

BatchApplicaKons

CloudBurst HDFS60.24s

HHH48.3s

MR-MSPolygraph

Lustre939s

HHH-L196s

Client

(HDFSArchitecture)

v HDFSistheunderlyingstorageformanyBigDataprocessingframeworklike,HadoopMapReduce,HBase,HiveandSpark

v HDFShasbeenwidelyadoptedbyreputedorganizaNonslikeFacebookandYahootostorepetabytesofdata

v ReplicaNonistheprimaryfaulttolerancemechanismforHDFS;replicaNonenhancesdata-localityandreadthroughputandmakesrecoveryprocessfasterincaseoffailure

v HDFSusesSocketsforcommunicaNonandcannotfullyleveragethebenefitsofhighperformanceinterconnectslikeinfiniBand

v  I/OperformanceinHDFSaswellastherequirementoflargeamountoflocaldiskspaceduetothetri-replicateddatablocksisofmajorconcernforHDFSdeployment

v HDFScannottakeadvantageoftheheterogeneousstoragedevicesavailableonHPCclusters

31%

24%

53%

64%

TestDFSIO So^wareDistribuKonv  ThedesignsproposedinthisresearchareavailableforthecommunityinRDMAforApacheHadoop

packagefromtheHighPerformanceBigData(HiBD)project(h"p://hibd.cse.ohio-state.edu)v  AsofSeptember’16,morethan18,200downloads(morethan190organizaNonsin26countries)have

takenplacefromtheproject’swebsite

•  Twomodes:(1)Default(HHH)(2)Lustre-Integrated(HHH-L)•  RAMDiskandSSD–basedbuffer-cache•  PlacementpoliciestoefficientlyuNlizetheheterogeneousstoragedevices,

HybridReplicaNon•  EvicNon/PromoNonbasedondatausagepa"ern•  Lustre-Integratedmode:Lustre-basedfault-tolerance•  Cachinghierarchical,placementhybrid•  Key-ValueStore-basedburstbuffer

RAM$Disk SSD

HDDGlobal$File$

System$(Lustre)

Data

RAM

Global&File&System&(Lustre)

HDD

SSD

RAM&Disk

RAM

Data HDFSwithHeterogeneousStorage:v  HDFScannotefficientlyuNlizetheheterogeneous

storagedevicesavailableonHPCclustersv  LimitaNoncomesfromtheexisNngplacementpolicies

andignoranceofdatausagepa"erns

Triple-HproposesahybridapproachtouKlizetheheterogeneousstoragedevicesefficiently[3,4,5]

StorageUsed(GB)HDFS 360Lustre 120HHH-L 240

Reducedby54%

IteraKveApplicaKon(KMeans)

HDFS708s

Tachyon672s

HHH618s

SelecNvecaching:CacheoutputofintermediateiteraNons

SortonSDSCGordon

Spark

StorageTechnologies(HDD,SSD,RAMDisk,

andNVM)

ParallelFileSystem(Lustre)

BigDataApplicaKons,Workloads,Benchmarks

RDMA-EnhancedHDFS

HighPerformanceFileSystemandI/OMiddleware

KV-Store(Memcached)based

BurstBuffer

LeveragingNVMforBigDataI/O

AdvancedDataPlacement

SelecKveCachingforIteraKveApplicaKons

HybridHDFSwithIn-MemoryandHeterogeneousStorage

EnhancedSupportfor

FastAnalyKcs

HadoopMapReduceHBase

MaximizedStageOverlapping

NetworkingTechnologies/

Protocols(InfiniBand,10/40/100GigE,RDMA)

•  RDMAoverNVM:(D(DRAM)-to-N(NVM),N-to-D,N-to-N)•  NVM-basedI/O(BlockAccess,MemoryAccess)•  HybridDesign(NVMwithSSD)•  Co-Design(Cost-effecNveness,Use-case)

•  StoresonlyWriteAheadlogs(WALs)toNVM•  StoresonlyjoboutputofSparktoNVM•  NVFSasaburstbufferforSparkoverLustre

•  Forin-memorystorageinHDFS,persistenceischallenging;compeNNonformemorybetweencomputaNonandI/O

•  NVM(byte-addressableandnon-volaNle)emerginginHPCsystems;CriNcaltorethinkHDFSarchitecturetoexploitNVMalongwithRDMA

1

ApplicaKonsandBenchmarks

MapReduce Spark HBase

Co-Design

RDMAReceiver

RDMASender

DFSClientRDMA

Replicator

RDMAReceiver

NVFS-BlkIO

NVFS-MemIO

SSD

NVMandRDMA-awareHDFS(NVFS)DataNode

SSDSSD

Reader/Writer

NVM

0

50

100

150

200

250

300

350

Write Read

AverageTh

roughp

ut

(MBp

s)

HDFS(56Gbps) NVFS(56Gbps)1.2xTestDFSIOon

SDSCComet

0

10

20

30

40

50

60

70

50 5000 500000

I/OTim

e(s)

NumberofPages

Lustre(56Gbps)NVFS(56Gbps)

24%

SparkPageRankonSDSCComet

0100200300400500600700800

8:800K 16:1600K 32:3200K

Throughp

ut(o

ps/s)

ClusterSize:No.ofRecords

HDFS(56Gbps) NVFS(56Gbps)

HBase100%insertonSDSCComet21%

•  Exploitbyte-addressabilityofNVMforHDFScommunicaKonandI/O[6]

•  Re-designHDFSstoragearchitecturewithmemorysemanKcs

Reducedby2.4x

SparkTeraGenonSDSCGordon

GainforCloudBurst:19%overHDFSGainforMR-MSPolygraph:79%overLustreHHH-Loffersbe"erdatalocalityoverLustre

GainoverHDFS:13%GainoverTachyon:9%

0

1000

2000

3000

4000

5000

6000

Write Read

TotalThrou

ghpu

t(MBp

s) HDFS(FDR) HHH(FDR)

Increasedby7x

Increasedby2x

HHH-Lreduceslocalstoragerequirement

AsI/Obo"lenecksarereduced,RDMAofferslargerbenefitsthan

IPoIBand10GigE

0

5

10

15

20

25

2 4 6 8 10

Commun

icaK

onTim

e(s)

FileSize(GB)

10GigE

IPoIB(QDR)

RDMA(QDR)

0

200

400

600

800

1000

1200

1400

16::64 32::128 64::256Aggregated

Throu

ghpu

t(MBp

s)

ClusterSize::DataSize(GB)

IPoIB(FDR)

RDMA(FDR)

RDMA-SEDA(FDR)

EnhancedDFSIOonTACCStampede

0

2

4

6

8

10

12

8 16 32

Throughp

ut(K

ops/s)

NumberofRecords(Million)

IPoIB(FDR) RDMA(FDR) RDMA-SEDA(FDR)

HBaseonTACCStampede

0

100

200

300

400

500

20 40 60

ExecuK

onTim

e(s)

DataSize(GB)

HDFS LustreHHH-L

TestDFSIOonTACCStampede

0

500

1000

1500

2000

2500

5 10 15 20

TotalThrou

ghpu

t(MBp

s)

DataSize(GB)

10GigE-1HDD 10GigE-2HDD IPoIB-1HDD(QDR)

IPoIB-2HDD(QDR) RDMA-1HDD(QDR) RDMA-2HDD(QDR)

0

20

40

60

80

100

120

140

160

180

8:50 16:100 32:200

ExecuK

onTim

e(s)

ClusterSize:DataSize(GB)

HDFS Tachyon HHH

ThisresearchissupportedinpartbyNaKonal

ScienceFoundaKongrant#IIS-1447804.

Recommended