HighPerformanceFileSystemandI/OMiddlewareDesignforBigDataonHPCClustersNusratSharminIslam
Advisor:DhabaleswarK.(DK)Panda
Network-BasedCompuKngLaboratoryh"p://nowlab.cse.ohio-state.edu
AcknowledgementsHigh-PerformanceBigData(HiBD)
h"p://hibd.cse.ohio-state.edu
IntroducKon
ResearchFramework
HybridHDFSwithHeterogeneousStorage ConclusionandFutureWork
HDFSoverRDMA
NVM-basedHDFS
PublicaKons
TheOhioStateUniversity
CollegeofEngineering
HDFS
Verbs
RDMACapableNetworks(IB,10GE/iWARP,RoCE..)
ApplicaKons
1/10GigE,IPoIBNetwork
JavaSocketInterface
JavaNaKveInterface(JNI)
WriteOthers
OSUDesign
DesignFeatures:– RDMA-basedHDFSwriteandreplicaNon
– JNILayerbridgesJava-basedHDFSwithnaNvecommunicaNonlibrary
– On-demandconnecNonsetup– JavaDirectBufferensureszerocopydatatransfer
EnableshighperformanceRDMAcommunicaKon,whilesupporKngtradiKonalsocketinterface[1]
Reducedby30%
OverlappinginSEDA-basedApproach
SequenKalDataProcessinginOBOTArchitecture
Pkt1,1
Pkt1,1
Pkt1,1
Pkt1,1
Task1,)��TaskK PktK,1 ... Pkt1,N
PktK,1 ... Pkt1,N
PktK,1 ... Pkt1,N
PktK,1 ... PktK,N
Read
%Packet%Processing
Replication
I/O
Pkt1,2 PktK,2 PktK,N
Pkt1,2 PktK,2 PktK,N
Pkt1,2 PktK,2 PktK,N
Pkt1,2 PktK,2 Pkt1,N
Default HDFS (OBOT)
v SEDA-basedapproachforHDFSWrite[2]Ø Re-designs the internal so^ware architecture
fromOBOTtoSEDAØ MaximizesoverlappingamongdifferentstagesØ IncorporatesRDMA-basedcommunicaKon
v Default HDFS adopts One-Block-One-Thread (OBOT)architecture to process data sequenNally; good trade-offbetweensimplicityandperformancev In OBOT architecture, incoming data-packets shouldwait for the I/O stage of the previous packet to becompletedbeforetheyarereadandprocessedØ Limits HDFS from fully uNlizing the hardware
capabiliNes
v InRDMA-EnhancedHDFS,bo"leneckmovesfromdata-transmissiontodata-persistencev StagedEvent-DrivenArchitecture(SEDA):ahigh-
throughputdesignapproachforInternetservicesØ Decomposescomplexprocessinglogicintoaset
ofstagesconnectedbyqueues
v Four stages: (1) Read (2) Packet Processing (3)ReplicaNon(4)I/Ov No.of threads ineach stage is appropriately tuned sothat RDMA-enhanced HDFS can make maximumuNlizaNonofthesystemresources
SEDA-HDFS
Triple-H
HeterogeneousStorage
HybridReplicaNon
DataPlacementPolicies
EvicNon/PromoNon
RAMDisk SSD HDD
Lustre
ApplicaKons
[1]N.S. Islam,M.W.Rahman, J. Jose,R.Rajachandrasekar,H.Wang,H.Subramoni,C.Murthy,andD.K.Panda.High-PerformanceRDMA-basedDesignofHDFSoverInfiniBand.InProceedingsofSC,November2012[2]N.S.Islam,X.Lu,M.W.Rahman,andD.K.Panda,SOR-HDFS:ASEDA-basedApproachtoMaximizeOverlappinginRDMA-EnhancedHDFS,HPDC'14,ShortPaper,June2014[3]N.S.Islam,X.Lu,M.W.Rahman,D.Shankar,andD.K.Panda,Triple-H:AHybridApproachtoAccelerateHDFSonHPCClusterswithHeterogeneousStorageArchitecture,CCGrid’15,May2015[4]N.S. Islam,M.W.Rahman,X.Lu,D.Shankar,andD.K.Panda,PerformanceCharacterizaNonandAcceleraNonofIn-MemoryFileSystemsforHadoopandSparkApplicaNonsonHPCClusters,IEEEBigData’15,October2015[5]N.S.Islam,D.Shankar,X.Lu,M.W.Rahman,andD.K.Panda,AcceleraNngI/OPerformanceofBigDataAnalyNcsonHPCClustersthroughRDMA-basedKey-ValueStore,ICPP‘15,September2015[6]N.S.Islam,M.W.Rahman,X.Lu,andD.K.Panda,HighPerformanceDesignforHDFSwithByte-AddressabilityofNVMandRDMA,ICS’16,June2016
Hierarchical Flat
0
200
400
600
800
1000
1200
1GigE IPoIB(QDR)
BlockWriteTime(m
s)
448ms
59%
v TheproposedframeworkimprovescommunicaNonandI/OperformanceofHDFSandleadstomaximizedoverlappingandstoragespacesavingsonmodernclusters
v Inthefuture,wewouldliketoproposeefficientdataaccessstrategiesforHadoopandSparkapplicaNonsonHPCsystems
v WewillalsoevaluateourproposeddesignsusingdifferentBigDataapplicaNons
BatchApplicaKons
CloudBurst HDFS60.24s
HHH48.3s
MR-MSPolygraph
Lustre939s
HHH-L196s
Client
(HDFSArchitecture)
v HDFSistheunderlyingstorageformanyBigDataprocessingframeworklike,HadoopMapReduce,HBase,HiveandSpark
v HDFShasbeenwidelyadoptedbyreputedorganizaNonslikeFacebookandYahootostorepetabytesofdata
v ReplicaNonistheprimaryfaulttolerancemechanismforHDFS;replicaNonenhancesdata-localityandreadthroughputandmakesrecoveryprocessfasterincaseoffailure
v HDFSusesSocketsforcommunicaNonandcannotfullyleveragethebenefitsofhighperformanceinterconnectslikeinfiniBand
v I/OperformanceinHDFSaswellastherequirementoflargeamountoflocaldiskspaceduetothetri-replicateddatablocksisofmajorconcernforHDFSdeployment
v HDFScannottakeadvantageoftheheterogeneousstoragedevicesavailableonHPCclusters
31%
24%
53%
64%
TestDFSIO So^wareDistribuKonv ThedesignsproposedinthisresearchareavailableforthecommunityinRDMAforApacheHadoop
packagefromtheHighPerformanceBigData(HiBD)project(h"p://hibd.cse.ohio-state.edu)v AsofSeptember’16,morethan18,200downloads(morethan190organizaNonsin26countries)have
takenplacefromtheproject’swebsite
• Twomodes:(1)Default(HHH)(2)Lustre-Integrated(HHH-L)• RAMDiskandSSD–basedbuffer-cache• PlacementpoliciestoefficientlyuNlizetheheterogeneousstoragedevices,
HybridReplicaNon• EvicNon/PromoNonbasedondatausagepa"ern• Lustre-Integratedmode:Lustre-basedfault-tolerance• Cachinghierarchical,placementhybrid• Key-ValueStore-basedburstbuffer
RAM$Disk SSD
HDDGlobal$File$
System$(Lustre)
Data
RAM
Global&File&System&(Lustre)
HDD
SSD
RAM&Disk
RAM
Data HDFSwithHeterogeneousStorage:v HDFScannotefficientlyuNlizetheheterogeneous
storagedevicesavailableonHPCclustersv LimitaNoncomesfromtheexisNngplacementpolicies
andignoranceofdatausagepa"erns
Triple-HproposesahybridapproachtouKlizetheheterogeneousstoragedevicesefficiently[3,4,5]
StorageUsed(GB)HDFS 360Lustre 120HHH-L 240
Reducedby54%
IteraKveApplicaKon(KMeans)
HDFS708s
Tachyon672s
HHH618s
SelecNvecaching:CacheoutputofintermediateiteraNons
SortonSDSCGordon
Spark
StorageTechnologies(HDD,SSD,RAMDisk,
andNVM)
ParallelFileSystem(Lustre)
BigDataApplicaKons,Workloads,Benchmarks
RDMA-EnhancedHDFS
HighPerformanceFileSystemandI/OMiddleware
KV-Store(Memcached)based
BurstBuffer
LeveragingNVMforBigDataI/O
AdvancedDataPlacement
SelecKveCachingforIteraKveApplicaKons
HybridHDFSwithIn-MemoryandHeterogeneousStorage
EnhancedSupportfor
FastAnalyKcs
HadoopMapReduceHBase
MaximizedStageOverlapping
NetworkingTechnologies/
Protocols(InfiniBand,10/40/100GigE,RDMA)
• RDMAoverNVM:(D(DRAM)-to-N(NVM),N-to-D,N-to-N)• NVM-basedI/O(BlockAccess,MemoryAccess)• HybridDesign(NVMwithSSD)• Co-Design(Cost-effecNveness,Use-case)
• StoresonlyWriteAheadlogs(WALs)toNVM• StoresonlyjoboutputofSparktoNVM• NVFSasaburstbufferforSparkoverLustre
• Forin-memorystorageinHDFS,persistenceischallenging;compeNNonformemorybetweencomputaNonandI/O
• NVM(byte-addressableandnon-volaNle)emerginginHPCsystems;CriNcaltorethinkHDFSarchitecturetoexploitNVMalongwithRDMA
1
ApplicaKonsandBenchmarks
MapReduce Spark HBase
Co-Design
RDMAReceiver
RDMASender
DFSClientRDMA
Replicator
RDMAReceiver
NVFS-BlkIO
NVFS-MemIO
SSD
NVMandRDMA-awareHDFS(NVFS)DataNode
SSDSSD
Reader/Writer
NVM
0
50
100
150
200
250
300
350
Write Read
AverageTh
roughp
ut
(MBp
s)
HDFS(56Gbps) NVFS(56Gbps)1.2xTestDFSIOon
SDSCComet
0
10
20
30
40
50
60
70
50 5000 500000
I/OTim
e(s)
NumberofPages
Lustre(56Gbps)NVFS(56Gbps)
24%
SparkPageRankonSDSCComet
0100200300400500600700800
8:800K 16:1600K 32:3200K
Throughp
ut(o
ps/s)
ClusterSize:No.ofRecords
HDFS(56Gbps) NVFS(56Gbps)
HBase100%insertonSDSCComet21%
• Exploitbyte-addressabilityofNVMforHDFScommunicaKonandI/O[6]
• Re-designHDFSstoragearchitecturewithmemorysemanKcs
Reducedby2.4x
SparkTeraGenonSDSCGordon
GainforCloudBurst:19%overHDFSGainforMR-MSPolygraph:79%overLustreHHH-Loffersbe"erdatalocalityoverLustre
GainoverHDFS:13%GainoverTachyon:9%
0
1000
2000
3000
4000
5000
6000
Write Read
TotalThrou
ghpu
t(MBp
s) HDFS(FDR) HHH(FDR)
Increasedby7x
Increasedby2x
HHH-Lreduceslocalstoragerequirement
AsI/Obo"lenecksarereduced,RDMAofferslargerbenefitsthan
IPoIBand10GigE
0
5
10
15
20
25
2 4 6 8 10
Commun
icaK
onTim
e(s)
FileSize(GB)
10GigE
IPoIB(QDR)
RDMA(QDR)
0
200
400
600
800
1000
1200
1400
16::64 32::128 64::256Aggregated
Throu
ghpu
t(MBp
s)
ClusterSize::DataSize(GB)
IPoIB(FDR)
RDMA(FDR)
RDMA-SEDA(FDR)
EnhancedDFSIOonTACCStampede
0
2
4
6
8
10
12
8 16 32
Throughp
ut(K
ops/s)
NumberofRecords(Million)
IPoIB(FDR) RDMA(FDR) RDMA-SEDA(FDR)
HBaseonTACCStampede
0
100
200
300
400
500
20 40 60
ExecuK
onTim
e(s)
DataSize(GB)
HDFS LustreHHH-L
TestDFSIOonTACCStampede
0
500
1000
1500
2000
2500
5 10 15 20
TotalThrou
ghpu
t(MBp
s)
DataSize(GB)
10GigE-1HDD 10GigE-2HDD IPoIB-1HDD(QDR)
IPoIB-2HDD(QDR) RDMA-1HDD(QDR) RDMA-2HDD(QDR)
0
20
40
60
80
100
120
140
160
180
8:50 16:100 32:200
ExecuK
onTim
e(s)
ClusterSize:DataSize(GB)
HDFS Tachyon HHH
ThisresearchissupportedinpartbyNaKonal
ScienceFoundaKongrant#IIS-1447804.