+ All Categories
Home > Documents > rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto...

rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto...

Date post: 11-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
46
rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory Allocator and Access Library for One-sided Messaging Udayanga Wickramasinghe Indiana University Andrew Lumsdaine Pacific Northwest Na<onal Laboratory
Transcript
Page 1: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory Allocator and Access Library

for One-sided Messaging

UdayangaWickramasinghe

IndianaUniversity

AndrewLumsdaine

PacificNorthwestNa<onalLaboratory

Page 2: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

Overview

§ Mo<va<on

§ Design/SystemImplementa<on

§ Evalua<on§ FutureWork

2

Page 3: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

RDMA Network Communication

3

NetworkOpKernel+CPUdirect

RDMAKernel+CPUbypassZeroCopy

Designedforone-sidedcommunica<on!!

Page 4: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

One-sided Communication

4

§  Great for Random Access + Irregular Data patterns

§  Less Overhead/High Performance

Advantages Disadvantages§  Explicit Synchronization –

separate from data-path!!

Page 5: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

RDMA Challenges – Communication

5

Recv

SendPin

PinNIC

exchange

comm NIC

register/match

register/match

§  Buffer Pin/Registration

§  Rendezvous

§  Model imposed overheads

Page 6: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

RDMA Challenges – Synchronization

6

register/match

ExposureEpoch

comm

Barrier/Fence

Barrier/Fence

comm

comm

...AccessEpoch

Howtomakereadsandupdatesvisible?“in-use”/”re-use”

Page 7: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

RDMA Challenges – Dynamic Memory Management

Clusterwidealloca<onsàcostlyinadynamiccontexti.e.PGAS

Page 8: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

RDMA Challenges – Programming

register/match

exchange

RDMAPUT0x1F0000

Load0x1F0000

Inc0x1F0000,1

RDMAPUT0x1F0000

register/match

RDMAPUT0x1F0000

DataRace!!!

Deliverycomple1on

Bufferre-use

Page 9: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

§  Enforcing“in-use”/”re-use”seman<cs–  FlowControl–Creditbased,Counterbased,polling(CQbased)

§  EnforcingComple<onseman<cs–  MPI3.0Ac<ve/Passive–barriers,fence,lock,unlock,flush

–  GAS/PGASbased(SHMEM,X10,Titanum)–futures,barriers,locks,ac<ons

–  GASNetlike(RDMA)Libraries–userhastoimplement

§  ExplicitandComplextoimplementforapplica<ons!!

9

Challenges – Programming

Page 10: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

§  Lowoverhead,high-throughputcommunica<on?–  Eliminateunnecessaryoverheads.

§  DynamicOn-demandRDMAMemory?–  Allocate/de-Allocatewithheuris<cssupport.–  LesscoherenceTrafficandmaybebeceru<liza<on

§  ScalableSynchroniza<on?–  Comple<onandBufferin-use/re-use.

§  RDMAProgrammingabstrac<onsforapplica<ons?–  Noexplicitsynchroniza<on–Letmiddlewaretransparentlyhandleit.

–  Exposelight-weightRDMAreadymemoryandopera<ons.

10

Challenges – Summary

Page 11: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

11

How rmalloc()/rpipe() meets these Challenges ?

Problem KeyIdea

LowCommunica<onOverhead

FastPath(MMIOvsDoorbell)NetworkOpera<on(inuGNI)withsynchronizedupdates.

DynamicRDMAMemoryMgmt

PerendpointRDMADynamicHeapàHeuris<cs+AsymmetricAlloca<on

Synchroniza<on No<fica<onFlagswithPolling(NFP)

Programmability AfamiliarTwo-levelAbstrac<onàallocator(rmalloc)+streamlikechannel(rpipe)àNoexplicitsynchroniza<on

Page 12: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

§ Mo<va<on

§ Design/SystemImplementa<on

§ Evalua<on§ FutureWork

12

Overview

Page 13: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

13

System Overview

Page 14: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

14

System Overview

High Performance RDMA Channel §  Expose Zero-copy

RDMA ops

§  Interface/s

•  rread()

•  rrwrite()

Enable Implicit Synchronization §  NFP (Notified Flags with

Polling)

Page 15: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

15

System Overview

Allocates RDMA memory §  Returns Network

Compatible Memory

§  Dynamic Asymmetric Heap for RDMA

§  Interface/s

•  rmalloc()

Alloca1onpolicies§  Next-fit,First-fit

Page 16: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

16

System Overview

Network Backend §  Cray specific – uGNI

§  MPI 3.0 based (portability layer)

Cray uGNI §  FMA/BTE Support

§  Memory Registration

§  CQ handling

Page 17: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

17

“rmalloc”

Asymmetricheapsacrosscluster-0ormoreforeachendpointpair-dynamicallycreated

Page 18: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

18

“rmalloc” Allocation

Next-fitheuris<c– returnnextavailableRDMAheapsegment

rmalloc instance 

L - local heapS - shadow heapR - remote heap

- unused- used

Synchroniza<onàaspecialbootstraprpipe

Page 19: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

19

“rmalloc” Allocation

best-fitheuris<c– findsmallestpossibleRDMAheapsegment

rmalloc instance 

L - local heapS - shadow heapR - remote heap

- unused- used

Page 20: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

20

“rmalloc” Allocation

worst-fitheuris<c–findlargestpossibleRDMAheapsegment

rmalloc instance 

L - local heapS - shadow heapR - remote heap

- unused- used

Page 21: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

21

“rmalloc” Implementation

rmalloc_descriptoràmanageslocalandremotevirtualmemory

Page 22: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

22

rfree()/rmalloc() synchronization

§  Whentosynchronize?Buffer“in-use/re-use”–  Twoop<ons,usebothfordifferentalloca<onmodes

•  Atalloca<on<me–>latency(i.e.rmalloc())

•  Atde-alloca<on<me–>throughput(i.e.rfree())

§  Deferredsynchroniza<onbyrfree()ànext-fit–  Coalescetagsfromasortedfreelist

–  rmallocupdatesstatebyRDMAintocoalescedtaglistintheremote

§  Immediatesynchroniza<onbyrmalloc()àbest-fitORworst-fit–  Usingaspecialbootstraprpipetosynchronizeateachallocatedmemory

Page 23: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

23

“rpipe”– rwrite()

LocalCQ

1

§  Completion Queue (CQ) (Light weight events by NIC/HCA)

1.Ini<ateRDMAWrite.–Sourcebufferà‘’in-use’’

Page 24: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

24

LocalCQ

2

2.ProbeLocalCQforcomple<on.Zero-copysourcedatatotarget.

2

“rpipe”– rwrite()

Page 25: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

25

LocalCQ

3

4

3.Writetoflagjustanerdata.

“rpipe”– rwrite()

Page 26: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

26

LocalCQ

4

4.ProbeLocalCQsuccess.Sourcebufferà‘’re-use’’

“rpipe”– rwrite()

Page 27: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

27

LocalCQ

5

5.Probeflagsuccess.targetbufferisreadytoload/ops.

“rpipe”– rwrite()

Page 28: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

28

LocalCQ

6Load0x1F0000

6.remotehostconsumesdata.Sourceyettoknowbufferàrfree()

“rpipe”– rwrite()

Page 29: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

29

“rpipe”– rread()

LocalCQ1

1.Storedataintotarget.– Targetbufferà‘’in-use’’.

Store0x1F0000,val

Page 30: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

30

“rpipe”– rread()

LocalCQ

2.Writetosourceflag.Dataisnowreadyforrread()!!

Store0x1F0000,val2

rfree()

Page 31: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

31

“rpipe”– rread()

LocalCQ

3

3.RDMAZero-Copytosource.

Page 32: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

32

LocalCQ

4.Writetoflagjustanerdata.

4

“rpipe”– rread()

Page 33: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

33

LocalCQ

5

5.ProbeLocalCQforcomple<on.

“rpipe”– rread()

Page 34: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

34

Implementing rpipe(), rwrite() and rread()

§  Arpipeiscreatedbetweentwoendpoints.–  AuGNIbasedControlMessage(FMACmsg)networktolazyini<alize

rpipei.e.GNI_CqCreate,GNI_EpCreate,GNI_EpBind

§  Implementsrwrite(),rread()inuGNI–  Small/mediummessages–FMA(FastMemoryAccess)

–  Largemessages–BTE(ByteTransferEngine)

§ MPIportabilityLayer–  rpipewithMPI-3.0windows+passiveRMA

FMA

BTE

Page 35: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

§ Mo<va<on

§ Design/SystemImplementa<on

§ Evalua<on§ FutureWork

35

Overview

Page 36: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

36

rpipe programming intmain(){#definePIPE_WIDTH8rpipe_trp;rinit(&rank,NULL);//createaHalfDuplexRMApiperpipe(rp,peer,iswriter,PIPE_WIDTH,HD_PIPE);raddr_taddr;int*ptr;if(iswriter){addr=rmalloc(rp,sizeof(int));ptr=rmem(rp,addr);*ptr=SEND_VAL;rwrite(rp,addr);}else{rread(rp,addr,sizeof(int));ptr=rmem(rp,addr);rfree(addr);}}

Remoteallocate

FreeremmemoryReleaseimmediately

a5eruse!!

Rpipeops

Page 37: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

37

Experimentation Setup CrayXC30[Aries]/

DragonFly

BigredII+550nodes/Rpeak280Tflops—10GB/sUni-direc<onal15GB/sBi-direc<onalBW

PerfbaselineàMPI/OSUBenchmark

Page 38: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

38

Small/Medium Message Latency Comparison

1

4

16

1 4 16 64 256 1024 8192Message Size (bytes)

Late

ncy/

oper

atio

n (u

s)

MPI_RMA_FENCEMPI_RMA_PASSIVE(lock_once)MPI_RMA_PSCWMPI_SENDRMA_PIPE_WRITE(uGNI_FMA_2PUTS)RMA_PIPE_WRITE(uGNI_FMA_PUTW_SYNC)

§  Default Alloc = Next-Fit

§  FMA_PUT_W_SYNC –  Upto 6X speedup MPI

RMA

§  rpipePUT_W_sync(s)<rpipe2PUT(s)

Page 39: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

39

Large Message Latency Comparison – rwrite()

2

16

128

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4MMessage Size (bytes)

Late

ncy/

oper

atio

n (u

s) MPI_RMA_PASSIVEMPI_RMA_PSCWRMA_PIPE_WRITE

§  rpipe uGNI(s) ≈ rpipeMPI(s)whens>4K

–  S ≥ 4K à FMA to BTE switch

small/medium 0.65us

Page 40: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

40

Large Message Latency Comparison – rread()

8

64

512

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4MMessage Size (bytes)

Late

ncy/

oper

atio

n (u

s) MPI_RMA_PASSIVEMPI_RMA_PSCWRMA_PIPE_READ

§  rpipe uGNI(s) ≈ rpipeMPI(s)whens>1K–  S < 4b à FMA_FETCH Atomic (AMO)

–  S < 1K à FMA_FETCH + PSYNC

–  S ≥ 1K à FMA to BTE switch (BTE_FETCH + FMA_PSYNC)

small/medium 2.14us

Page 41: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

41

Rpipe Scales ...

1

4

16

2 4 8 12 16 20 24 28 32Nodes (N)

Late

ncy/

oper

atio

n (u

s) RPIPE_WRITE(1K)(unbounded)RPIPE_WRITE(64)(unbounded)RPIPE_WRITE(8)(4K)RPIPE_WRITE(8)(64)RPIPE_WRITE(8)(unbounded)RPIPE_WRITE(8K)(unbounded)

§  “unbounded”à allocator has full rpipe available for all Zero-copy operations

§  Scaling upto 32 nodes – randomized rwrite() –  0.65 – 3.8us avg latency

Page 42: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

42

Allocation Algorithms

1

4

1 2 4 8 16 32 64 128 256 512Message Size (bytes)

Late

ncy/

oper

atio

n (u

s)

MPI3.0_RMARPIPE_WRITE(1K)RPIPE_WRITE(256K)RPIPE_WRITE(unbounded)

§  Zero-copy write vs Heuristics–  Next-fit allocator

has better performance

–  1X – 3.5X slowdown for Best/Worst-fit

1

4

16

1 2 4 8 16 32 64 128 256 512Message Size (bytes)

MPI3.0_RMARPIPE_WRITE(1K)RPIPE_WRITE(256K)RPIPE_WRITE(unbounded)

1

4

16

1 2 4 8 16 32 64 128 256 512Message Size (bytes)

MPI3.0_RMARPIPE_WRITE(1K)RPIPE_WRITE(256K)RPIPE_WRITE(unbounded)

Next-fit

Best-fit

Worst-fit

L = Latency

L[Next-fit] < L[MPI] < L[Worst-fit]

Page 43: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

§ Mo<va<on

§ Design/SystemImplementa<on

§ Evalua<on§ FutureWork

43

Overview

Page 44: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

§  PlavormSupport/Automatedsynchroniza<on

§  HighperformanceRMAKernels–  Ac<vemessages/Neighbor/collec<ve

communica<on

§  Aggregatedrpipes–  LeverageZerocopy/Eliminatehiddenbuffers

•  i.e.Collec<ves•  Possiblethroughput,memoryu<liza<ongains

§  IrregularRMAandmemorydisaggrega<on

Future Work

Page 45: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

Questions?

Page 46: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency

Thank You!


Recommended