Download - High Performance File System and I/O Middleware Design for Big Data …sc16.supercomputing.org/sc-archive/doctoral_showcase/doc... · 2017-03-20 · High Performance File System and

HighPerformanceFileSystemandI/OMiddlewareDesignforBigDataonHPCClustersNusratSharminIslam

Advisor:DhabaleswarK.(DK)Panda

Network-BasedCompuKngLaboratoryh"p://nowlab.cse.ohio-state.edu

AcknowledgementsHigh-PerformanceBigData(HiBD)

h"p://hibd.cse.ohio-state.edu

IntroducKon

ResearchFramework

HybridHDFSwithHeterogeneousStorage ConclusionandFutureWork

HDFSoverRDMA

NVM-basedHDFS

PublicaKons

TheOhioStateUniversity

CollegeofEngineering

HDFS

Verbs

RDMACapableNetworks(IB,10GE/iWARP,RoCE..)

ApplicaKons

1/10GigE,IPoIBNetwork

JavaSocketInterface

JavaNaKveInterface(JNI)

WriteOthers

OSUDesign

DesignFeatures:–  RDMA-basedHDFSwriteandreplicaNon

–  JNILayerbridgesJava-basedHDFSwithnaNvecommunicaNonlibrary

–  On-demandconnecNonsetup–  JavaDirectBufferensureszerocopydatatransfer

EnableshighperformanceRDMAcommunicaKon,whilesupporKngtradiKonalsocketinterface[1]

Reducedby30%

OverlappinginSEDA-basedApproach

SequenKalDataProcessinginOBOTArchitecture

Pkt1,1

Pkt1,1

Pkt1,1

Pkt1,1

Task1,)��TaskK PktK,1 ... Pkt1,N

PktK,1 ... Pkt1,N

PktK,1 ... Pkt1,N

PktK,1 ... PktK,N

Read

%Packet%Processing

Replication

I/O

Pkt1,2 PktK,2 PktK,N



Pkt1,2 PktK,2 Pkt1,N

Default HDFS (OBOT)

v SEDA-basedapproachforHDFSWrite[2]Ø  Re-designs the internal so^ware architecture

fromOBOTtoSEDAØ  MaximizesoverlappingamongdifferentstagesØ  IncorporatesRDMA-basedcommunicaKon

v Default HDFS adopts One-Block-One-Thread (OBOT)architecture to process data sequenNally; good trade-offbetweensimplicityandperformancev In OBOT architecture, incoming data-packets shouldwait for the I/O stage of the previous packet to becompletedbeforetheyarereadandprocessedØ  Limits HDFS from fully uNlizing the hardware

capabiliNes

v InRDMA-EnhancedHDFS,bo"leneckmovesfromdata-transmissiontodata-persistencev  StagedEvent-DrivenArchitecture(SEDA):ahigh-

throughputdesignapproachforInternetservicesØ  Decomposescomplexprocessinglogicintoaset

ofstagesconnectedbyqueues

v Four stages: (1) Read (2) Packet Processing (3)ReplicaNon(4)I/Ov No.of threads ineach stage is appropriately tuned sothat RDMA-enhanced HDFS can make maximumuNlizaNonofthesystemresources

SEDA-HDFS

Triple-H

HeterogeneousStorage

HybridReplicaNon

DataPlacementPolicies

EvicNon/PromoNon

RAMDisk SSD HDD

Lustre

ApplicaKons

[1]N.S. Islam,M.W.Rahman, J. Jose,R.Rajachandrasekar,H.Wang,H.Subramoni,C.Murthy,andD.K.Panda.High-PerformanceRDMA-basedDesignofHDFSoverInfiniBand.InProceedingsofSC,November2012[2]N.S.Islam,X.Lu,M.W.Rahman,andD.K.Panda,SOR-HDFS:ASEDA-basedApproachtoMaximizeOverlappinginRDMA-EnhancedHDFS,HPDC'14,ShortPaper,June2014[3]N.S.Islam,X.Lu,M.W.Rahman,D.Shankar,andD.K.Panda,Triple-H:AHybridApproachtoAccelerateHDFSonHPCClusterswithHeterogeneousStorageArchitecture,CCGrid’15,May2015[4]N.S. Islam,M.W.Rahman,X.Lu,D.Shankar,andD.K.Panda,PerformanceCharacterizaNonandAcceleraNonofIn-MemoryFileSystemsforHadoopandSparkApplicaNonsonHPCClusters,IEEEBigData’15,October2015[5]N.S.Islam,D.Shankar,X.Lu,M.W.Rahman,andD.K.Panda,AcceleraNngI/OPerformanceofBigDataAnalyNcsonHPCClustersthroughRDMA-basedKey-ValueStore,ICPP‘15,September2015[6]N.S.Islam,M.W.Rahman,X.Lu,andD.K.Panda,HighPerformanceDesignforHDFSwithByte-AddressabilityofNVMandRDMA,ICS’16,June2016

Hierarchical Flat

0

200

400

600

800

1000

1200

1GigE IPoIB(QDR)

BlockWriteTime(m

s)

448ms

59%

v  TheproposedframeworkimprovescommunicaNonandI/OperformanceofHDFSandleadstomaximizedoverlappingandstoragespacesavingsonmodernclusters

v  Inthefuture,wewouldliketoproposeefficientdataaccessstrategiesforHadoopandSparkapplicaNonsonHPCsystems

v  WewillalsoevaluateourproposeddesignsusingdifferentBigDataapplicaNons

BatchApplicaKons

CloudBurst HDFS60.24s

HHH48.3s

MR-MSPolygraph

Lustre939s

HHH-L196s

Client

(HDFSArchitecture)

v HDFSistheunderlyingstorageformanyBigDataprocessingframeworklike,HadoopMapReduce,HBase,HiveandSpark

v HDFShasbeenwidelyadoptedbyreputedorganizaNonslikeFacebookandYahootostorepetabytesofdata

v ReplicaNonistheprimaryfaulttolerancemechanismforHDFS;replicaNonenhancesdata-localityandreadthroughputandmakesrecoveryprocessfasterincaseoffailure

v HDFSusesSocketsforcommunicaNonandcannotfullyleveragethebenefitsofhighperformanceinterconnectslikeinfiniBand

v  I/OperformanceinHDFSaswellastherequirementoflargeamountoflocaldiskspaceduetothetri-replicateddatablocksisofmajorconcernforHDFSdeployment

v HDFScannottakeadvantageoftheheterogeneousstoragedevicesavailableonHPCclusters

31%

24%

53%

64%

TestDFSIO So^wareDistribuKonv  ThedesignsproposedinthisresearchareavailableforthecommunityinRDMAforApacheHadoop

packagefromtheHighPerformanceBigData(HiBD)project(h"p://hibd.cse.ohio-state.edu)v  AsofSeptember’16,morethan18,200downloads(morethan190organizaNonsin26countries)have

takenplacefromtheproject’swebsite

•  Twomodes:(1)Default(HHH)(2)Lustre-Integrated(HHH-L)•  RAMDiskandSSD–basedbuffer-cache•  PlacementpoliciestoefficientlyuNlizetheheterogeneousstoragedevices,

HybridReplicaNon•  EvicNon/PromoNonbasedondatausagepa"ern•  Lustre-Integratedmode:Lustre-basedfault-tolerance•  Cachinghierarchical,placementhybrid•  Key-ValueStore-basedburstbuffer

RAM$Disk SSD

HDDGlobal$File$

System$(Lustre)

Data

RAM

Global&File&System&(Lustre)

HDD

SSD

RAM&Disk

RAM

Data HDFSwithHeterogeneousStorage:v  HDFScannotefficientlyuNlizetheheterogeneous

storagedevicesavailableonHPCclustersv  LimitaNoncomesfromtheexisNngplacementpolicies

andignoranceofdatausagepa"erns

Triple-HproposesahybridapproachtouKlizetheheterogeneousstoragedevicesefficiently[3,4,5]

StorageUsed(GB)HDFS 360Lustre 120HHH-L 240

Reducedby54%

IteraKveApplicaKon(KMeans)

HDFS708s

Tachyon672s

HHH618s

SelecNvecaching:CacheoutputofintermediateiteraNons

SortonSDSCGordon

Spark

StorageTechnologies(HDD,SSD,RAMDisk,

andNVM)

ParallelFileSystem(Lustre)

BigDataApplicaKons,Workloads,Benchmarks

RDMA-EnhancedHDFS

HighPerformanceFileSystemandI/OMiddleware

KV-Store(Memcached)based

BurstBuffer

LeveragingNVMforBigDataI/O

AdvancedDataPlacement

SelecKveCachingforIteraKveApplicaKons

HybridHDFSwithIn-MemoryandHeterogeneousStorage

EnhancedSupportfor

FastAnalyKcs

HadoopMapReduceHBase

MaximizedStageOverlapping

NetworkingTechnologies/

Protocols(InfiniBand,10/40/100GigE,RDMA)

•  RDMAoverNVM:(D(DRAM)-to-N(NVM),N-to-D,N-to-N)•  NVM-basedI/O(BlockAccess,MemoryAccess)•  HybridDesign(NVMwithSSD)•  Co-Design(Cost-effecNveness,Use-case)

•  StoresonlyWriteAheadlogs(WALs)toNVM•  StoresonlyjoboutputofSparktoNVM•  NVFSasaburstbufferforSparkoverLustre

•  Forin-memorystorageinHDFS,persistenceischallenging;compeNNonformemorybetweencomputaNonandI/O

•  NVM(byte-addressableandnon-volaNle)emerginginHPCsystems;CriNcaltorethinkHDFSarchitecturetoexploitNVMalongwithRDMA

1

ApplicaKonsandBenchmarks

MapReduce Spark HBase

Co-Design

RDMAReceiver

RDMASender

DFSClientRDMA

Replicator

RDMAReceiver

NVFS-BlkIO

NVFS-MemIO

SSD

NVMandRDMA-awareHDFS(NVFS)DataNode

SSDSSD

Reader/Writer

NVM

0

50

100

150

200

250

300

350

Write Read

AverageTh

roughp

ut

(MBp

s)

HDFS(56Gbps) NVFS(56Gbps)1.2xTestDFSIOon

SDSCComet

0

10

20

30

40

50

60

70

50 5000 500000

I/OTim

e(s)

NumberofPages

Lustre(56Gbps)NVFS(56Gbps)

24%

SparkPageRankonSDSCComet

0100200300400500600700800

8:800K 16:1600K 32:3200K

Throughp

ut(o

ps/s)

ClusterSize:No.ofRecords

HDFS(56Gbps) NVFS(56Gbps)

HBase100%insertonSDSCComet21%

•  Exploitbyte-addressabilityofNVMforHDFScommunicaKonandI/O[6]

•  Re-designHDFSstoragearchitecturewithmemorysemanKcs

Reducedby2.4x

SparkTeraGenonSDSCGordon

GainforCloudBurst:19%overHDFSGainforMR-MSPolygraph:79%overLustreHHH-Loffersbe"erdatalocalityoverLustre

GainoverHDFS:13%GainoverTachyon:9%

0

1000

2000

3000

4000

5000

6000

Write Read

TotalThrou

ghpu

t(MBp

s) HDFS(FDR) HHH(FDR)

Increasedby7x

Increasedby2x

HHH-Lreduceslocalstoragerequirement

AsI/Obo"lenecksarereduced,RDMAofferslargerbenefitsthan

IPoIBand10GigE

0

5

10

15

20

25

2 4 6 8 10

Commun

icaK

onTim

e(s)

FileSize(GB)

10GigE

IPoIB(QDR)

RDMA(QDR)

0

200

400

600

800

1000

1200

1400

16::64 32::128 64::256Aggregated

Throu

ghpu

t(MBp

s)

ClusterSize::DataSize(GB)

IPoIB(FDR)

RDMA(FDR)

RDMA-SEDA(FDR)

EnhancedDFSIOonTACCStampede

0

2

4

6

8

10

12

8 16 32

Throughp

ut(K

ops/s)

NumberofRecords(Million)

IPoIB(FDR) RDMA(FDR) RDMA-SEDA(FDR)

HBaseonTACCStampede

0

100

200

300

400

500

20 40 60

ExecuK

onTim

e(s)

DataSize(GB)

HDFS LustreHHH-L

TestDFSIOonTACCStampede

0

500

1000

1500

2000

2500

5 10 15 20

TotalThrou

ghpu

t(MBp

s)

DataSize(GB)

10GigE-1HDD 10GigE-2HDD IPoIB-1HDD(QDR)

IPoIB-2HDD(QDR) RDMA-1HDD(QDR) RDMA-2HDD(QDR)

0

20

40

60

80

100

120

140

160

180

8:50 16:100 32:200

ExecuK

onTim

e(s)

ClusterSize:DataSize(GB)

HDFS Tachyon HHH

ThisresearchissupportedinpartbyNaKonal

ScienceFoundaKongrant#IIS-1447804.