Download - rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory Allocator and Access Library

for One-sided Messaging

UdayangaWickramasinghe

IndianaUniversity

AndrewLumsdaine

PacificNorthwestNa<onalLaboratory

Overview

§ Mo<va<on

§ Design/SystemImplementa<on

§ Evalua<on§ FutureWork

2

RDMA Network Communication

3

NetworkOpKernel+CPUdirect

RDMAKernel+CPUbypassZeroCopy

Designedforone-sidedcommunica<on!!

One-sided Communication

4

§  Great for Random Access + Irregular Data patterns

§  Less Overhead/High Performance

Advantages Disadvantages§  Explicit Synchronization –

separate from data-path!!

RDMA Challenges – Communication

5

Recv

SendPin

PinNIC

exchange

comm NIC

register/match

register/match

§  Buffer Pin/Registration

§  Rendezvous

§  Model imposed overheads

RDMA Challenges – Synchronization

6

register/match

ExposureEpoch

comm

Barrier/Fence

Barrier/Fence

comm

comm

...AccessEpoch

Howtomakereadsandupdatesvisible?“in-use”/”re-use”

RDMA Challenges – Dynamic Memory Management

Clusterwidealloca<onsàcostlyinadynamiccontexti.e.PGAS

RDMA Challenges – Programming

register/match

exchange

RDMAPUT0x1F0000

Load0x1F0000

Inc0x1F0000,1

RDMAPUT0x1F0000

register/match

RDMAPUT0x1F0000

DataRace!!!

Deliverycomple1on

Bufferre-use

§  Enforcing“in-use”/”re-use”seman<cs–  FlowControl–Creditbased,Counterbased,polling(CQbased)

§  EnforcingComple<onseman<cs–  MPI3.0Ac<ve/Passive–barriers,fence,lock,unlock,flush

–  GAS/PGASbased(SHMEM,X10,Titanum)–futures,barriers,locks,ac<ons

–  GASNetlike(RDMA)Libraries–userhastoimplement

§  ExplicitandComplextoimplementforapplica<ons!!

9

Challenges – Programming

§  Lowoverhead,high-throughputcommunica<on?–  Eliminateunnecessaryoverheads.

§  DynamicOn-demandRDMAMemory?–  Allocate/de-Allocatewithheuris<cssupport.–  LesscoherenceTrafficandmaybebeceru<liza<on

§  ScalableSynchroniza<on?–  Comple<onandBufferin-use/re-use.

§  RDMAProgrammingabstrac<onsforapplica<ons?–  Noexplicitsynchroniza<on–Letmiddlewaretransparentlyhandleit.

–  Exposelight-weightRDMAreadymemoryandopera<ons.

10

Challenges – Summary

11

How rmalloc()/rpipe() meets these Challenges ?

Problem KeyIdea

LowCommunica<onOverhead

FastPath(MMIOvsDoorbell)NetworkOpera<on(inuGNI)withsynchronizedupdates.

DynamicRDMAMemoryMgmt

PerendpointRDMADynamicHeapàHeuris<cs+AsymmetricAlloca<on

Synchroniza<on No<fica<onFlagswithPolling(NFP)

Programmability AfamiliarTwo-levelAbstrac<onàallocator(rmalloc)+streamlikechannel(rpipe)àNoexplicitsynchroniza<on

§ Mo<va<on



12

Overview

13

System Overview

14

System Overview

High Performance RDMA Channel §  Expose Zero-copy

RDMA ops

§  Interface/s

•  rread()

•  rrwrite()

Enable Implicit Synchronization §  NFP (Notified Flags with

Polling)

15

System Overview

Allocates RDMA memory §  Returns Network

Compatible Memory

§  Dynamic Asymmetric Heap for RDMA

§  Interface/s

•  rmalloc()

Alloca1onpolicies§  Next-fit,First-fit

16

System Overview

Network Backend §  Cray specific – uGNI

§  MPI 3.0 based (portability layer)

Cray uGNI §  FMA/BTE Support

§  Memory Registration

§  CQ handling

17

“rmalloc”

Asymmetricheapsacrosscluster-0ormoreforeachendpointpair-dynamicallycreated

18

“rmalloc” Allocation

Next-fitheuris<c– returnnextavailableRDMAheapsegment

rmalloc instance

L - local heapS - shadow heapR - remote heap

- unused- used

Synchroniza<onàaspecialbootstraprpipe

19


best-fitheuris<c– findsmallestpossibleRDMAheapsegment

rmalloc instance


- unused- used

20


worst-fitheuris<c–findlargestpossibleRDMAheapsegment

rmalloc instance


- unused- used

21

“rmalloc” Implementation

rmalloc_descriptoràmanageslocalandremotevirtualmemory

22

rfree()/rmalloc() synchronization

§  Whentosynchronize?Buffer“in-use/re-use”–  Twoop<ons,usebothfordifferentalloca<onmodes

•  Atalloca<on<me–>latency(i.e.rmalloc())

•  Atde-alloca<on<me–>throughput(i.e.rfree())

§  Deferredsynchroniza<onbyrfree()ànext-fit–  Coalescetagsfromasortedfreelist

–  rmallocupdatesstatebyRDMAintocoalescedtaglistintheremote

§  Immediatesynchroniza<onbyrmalloc()àbest-fitORworst-fit–  Usingaspecialbootstraprpipetosynchronizeateachallocatedmemory

23

“rpipe”– rwrite()

LocalCQ

1

§  Completion Queue (CQ) (Light weight events by NIC/HCA)

1.Ini<ateRDMAWrite.–Sourcebufferà‘’in-use’’

24

LocalCQ

2

2.ProbeLocalCQforcomple<on.Zero-copysourcedatatotarget.

2


25

LocalCQ

3

4

3.Writetoflagjustanerdata.


26

LocalCQ

4

4.ProbeLocalCQsuccess.Sourcebufferà‘’re-use’’


27

LocalCQ

5

5.Probeflagsuccess.targetbufferisreadytoload/ops.


28

LocalCQ

6Load0x1F0000

6.remotehostconsumesdata.Sourceyettoknowbufferàrfree()


29

“rpipe”– rread()

LocalCQ1

1.Storedataintotarget.– Targetbufferà‘’in-use’’.

Store0x1F0000,val

30


LocalCQ

2.Writetosourceflag.Dataisnowreadyforrread()!!

Store0x1F0000,val2

rfree()

31


LocalCQ

3

3.RDMAZero-Copytosource.

32

LocalCQ

4.Writetoflagjustanerdata.

4


33

LocalCQ

5

5.ProbeLocalCQforcomple<on.


34

Implementing rpipe(), rwrite() and rread()

§  Arpipeiscreatedbetweentwoendpoints.–  AuGNIbasedControlMessage(FMACmsg)networktolazyini<alize

rpipei.e.GNI_CqCreate,GNI_EpCreate,GNI_EpBind

§  Implementsrwrite(),rread()inuGNI–  Small/mediummessages–FMA(FastMemoryAccess)

–  Largemessages–BTE(ByteTransferEngine)

§ MPIportabilityLayer–  rpipewithMPI-3.0windows+passiveRMA

FMA

BTE

§ Mo<va<on



35

Overview

36

rpipe programming intmain(){#definePIPE_WIDTH8rpipe_trp;rinit(&rank,NULL);//createaHalfDuplexRMApiperpipe(rp,peer,iswriter,PIPE_WIDTH,HD_PIPE);raddr_taddr;int*ptr;if(iswriter){addr=rmalloc(rp,sizeof(int));ptr=rmem(rp,addr);*ptr=SEND_VAL;rwrite(rp,addr);}else{rread(rp,addr,sizeof(int));ptr=rmem(rp,addr);rfree(addr);}}

Remoteallocate

FreeremmemoryReleaseimmediately

a5eruse!!

Rpipeops

37

Experimentation Setup CrayXC30[Aries]/

DragonFly

BigredII+550nodes/Rpeak280Tflops—10GB/sUni-direc<onal15GB/sBi-direc<onalBW

PerfbaselineàMPI/OSUBenchmark

38

Small/Medium Message Latency Comparison

1

4

16

1 4 16 64 256 1024 8192Message Size (bytes)

Late

ncy/

oper

atio

n (u

s)

MPI_RMA_FENCEMPI_RMA_PASSIVE(lock_once)MPI_RMA_PSCWMPI_SENDRMA_PIPE_WRITE(uGNI_FMA_2PUTS)RMA_PIPE_WRITE(uGNI_FMA_PUTW_SYNC)

§  Default Alloc = Next-Fit

§  FMA_PUT_W_SYNC –  Upto 6X speedup MPI

RMA

§  rpipePUT_W_sync(s)<rpipe2PUT(s)

39

Large Message Latency Comparison – rwrite()

2

16

128

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4MMessage Size (bytes)

Late

ncy/

oper

atio

n (u

s) MPI_RMA_PASSIVEMPI_RMA_PSCWRMA_PIPE_WRITE

§  rpipe uGNI(s) ≈ rpipeMPI(s)whens>4K

–  S ≥ 4K à FMA to BTE switch

small/medium 0.65us

40

Large Message Latency Comparison – rread()

8

64

512

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4MMessage Size (bytes)

Late

ncy/

oper

atio

n (u

s) MPI_RMA_PASSIVEMPI_RMA_PSCWRMA_PIPE_READ

§  rpipe uGNI(s) ≈ rpipeMPI(s)whens>1K–  S < 4b à FMA_FETCH Atomic (AMO)

–  S < 1K à FMA_FETCH + PSYNC

–  S ≥ 1K à FMA to BTE switch (BTE_FETCH + FMA_PSYNC)

small/medium 2.14us

41

Rpipe Scales ...

1

4

16

2 4 8 12 16 20 24 28 32Nodes (N)

Late

ncy/

oper

atio

n (u

s) RPIPE_WRITE(1K)(unbounded)RPIPE_WRITE(64)(unbounded)RPIPE_WRITE(8)(4K)RPIPE_WRITE(8)(64)RPIPE_WRITE(8)(unbounded)RPIPE_WRITE(8K)(unbounded)

§  “unbounded”à allocator has full rpipe available for all Zero-copy operations

§  Scaling upto 32 nodes – randomized rwrite() –  0.65 – 3.8us avg latency

42

Allocation Algorithms

1

4

1 2 4 8 16 32 64 128 256 512Message Size (bytes)

Late

ncy/

oper

atio

n (u

s)

MPI3.0_RMARPIPE_WRITE(1K)RPIPE_WRITE(256K)RPIPE_WRITE(unbounded)

§  Zero-copy write vs Heuristics–  Next-fit allocator

has better performance

–  1X – 3.5X slowdown for Best/Worst-fit

1

4

16



1

4

16



Next-fit

Best-fit

Worst-fit

L = Latency

L[Next-fit] < L[MPI] < L[Worst-fit]

§ Mo<va<on



43

Overview

§  PlavormSupport/Automatedsynchroniza<on

§  HighperformanceRMAKernels–  Ac<vemessages/Neighbor/collec<ve

communica<on

§  Aggregatedrpipes–  LeverageZerocopy/Eliminatehiddenbuffers

•  i.e.Collec<ves•  Possiblethroughput,memoryu<liza<ongains

§  IrregularRMAandmemorydisaggrega<on

Future Work

Questions?

Thank You!