rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory Allocator and Access Library
for One-sided Messaging
UdayangaWickramasinghe
IndianaUniversity
AndrewLumsdaine
PacificNorthwestNa<onalLaboratory
Overview
§ Mo<va<on
§ Design/SystemImplementa<on
§ Evalua<on§ FutureWork
2
RDMA Network Communication
3
NetworkOpKernel+CPUdirect
RDMAKernel+CPUbypassZeroCopy
Designedforone-sidedcommunica<on!!
One-sided Communication
4
§ Great for Random Access + Irregular Data patterns
§ Less Overhead/High Performance
Advantages Disadvantages§ Explicit Synchronization –
separate from data-path!!
RDMA Challenges – Communication
5
Recv
SendPin
PinNIC
exchange
comm NIC
register/match
register/match
§ Buffer Pin/Registration
§ Rendezvous
§ Model imposed overheads
RDMA Challenges – Synchronization
6
register/match
ExposureEpoch
comm
Barrier/Fence
Barrier/Fence
comm
comm
...AccessEpoch
Howtomakereadsandupdatesvisible?“in-use”/”re-use”
RDMA Challenges – Dynamic Memory Management
Clusterwidealloca<onsàcostlyinadynamiccontexti.e.PGAS
RDMA Challenges – Programming
register/match
exchange
RDMAPUT0x1F0000
Load0x1F0000
Inc0x1F0000,1
RDMAPUT0x1F0000
register/match
RDMAPUT0x1F0000
DataRace!!!
Deliverycomple1on
Bufferre-use
§ Enforcing“in-use”/”re-use”seman<cs– FlowControl–Creditbased,Counterbased,polling(CQbased)
§ EnforcingComple<onseman<cs– MPI3.0Ac<ve/Passive–barriers,fence,lock,unlock,flush
– GAS/PGASbased(SHMEM,X10,Titanum)–futures,barriers,locks,ac<ons
– GASNetlike(RDMA)Libraries–userhastoimplement
§ ExplicitandComplextoimplementforapplica<ons!!
9
Challenges – Programming
§ Lowoverhead,high-throughputcommunica<on?– Eliminateunnecessaryoverheads.
§ DynamicOn-demandRDMAMemory?– Allocate/de-Allocatewithheuris<cssupport.– LesscoherenceTrafficandmaybebeceru<liza<on
§ ScalableSynchroniza<on?– Comple<onandBufferin-use/re-use.
§ RDMAProgrammingabstrac<onsforapplica<ons?– Noexplicitsynchroniza<on–Letmiddlewaretransparentlyhandleit.
– Exposelight-weightRDMAreadymemoryandopera<ons.
10
Challenges – Summary
11
How rmalloc()/rpipe() meets these Challenges ?
Problem KeyIdea
LowCommunica<onOverhead
FastPath(MMIOvsDoorbell)NetworkOpera<on(inuGNI)withsynchronizedupdates.
DynamicRDMAMemoryMgmt
PerendpointRDMADynamicHeapàHeuris<cs+AsymmetricAlloca<on
Synchroniza<on No<fica<onFlagswithPolling(NFP)
Programmability AfamiliarTwo-levelAbstrac<onàallocator(rmalloc)+streamlikechannel(rpipe)àNoexplicitsynchroniza<on
§ Mo<va<on
§ Design/SystemImplementa<on
§ Evalua<on§ FutureWork
12
Overview
13
System Overview
14
System Overview
High Performance RDMA Channel § Expose Zero-copy
RDMA ops
§ Interface/s
• rread()
• rrwrite()
Enable Implicit Synchronization § NFP (Notified Flags with
Polling)
15
System Overview
Allocates RDMA memory § Returns Network
Compatible Memory
§ Dynamic Asymmetric Heap for RDMA
§ Interface/s
• rmalloc()
Alloca1onpolicies§ Next-fit,First-fit
16
System Overview
Network Backend § Cray specific – uGNI
§ MPI 3.0 based (portability layer)
Cray uGNI § FMA/BTE Support
§ Memory Registration
§ CQ handling
17
“rmalloc”
Asymmetricheapsacrosscluster-0ormoreforeachendpointpair-dynamicallycreated
18
“rmalloc” Allocation
Next-fitheuris<c– returnnextavailableRDMAheapsegment
rmalloc instance
L - local heapS - shadow heapR - remote heap
- unused- used
Synchroniza<onàaspecialbootstraprpipe
19
“rmalloc” Allocation
best-fitheuris<c– findsmallestpossibleRDMAheapsegment
rmalloc instance
L - local heapS - shadow heapR - remote heap
- unused- used
20
“rmalloc” Allocation
worst-fitheuris<c–findlargestpossibleRDMAheapsegment
rmalloc instance
L - local heapS - shadow heapR - remote heap
- unused- used
21
“rmalloc” Implementation
rmalloc_descriptoràmanageslocalandremotevirtualmemory
22
rfree()/rmalloc() synchronization
§ Whentosynchronize?Buffer“in-use/re-use”– Twoop<ons,usebothfordifferentalloca<onmodes
• Atalloca<on<me–>latency(i.e.rmalloc())
• Atde-alloca<on<me–>throughput(i.e.rfree())
§ Deferredsynchroniza<onbyrfree()ànext-fit– Coalescetagsfromasortedfreelist
– rmallocupdatesstatebyRDMAintocoalescedtaglistintheremote
§ Immediatesynchroniza<onbyrmalloc()àbest-fitORworst-fit– Usingaspecialbootstraprpipetosynchronizeateachallocatedmemory
23
“rpipe”– rwrite()
LocalCQ
1
§ Completion Queue (CQ) (Light weight events by NIC/HCA)
1.Ini<ateRDMAWrite.–Sourcebufferà‘’in-use’’
24
LocalCQ
2
2.ProbeLocalCQforcomple<on.Zero-copysourcedatatotarget.
2
“rpipe”– rwrite()
25
LocalCQ
3
4
3.Writetoflagjustanerdata.
“rpipe”– rwrite()
26
LocalCQ
4
4.ProbeLocalCQsuccess.Sourcebufferà‘’re-use’’
“rpipe”– rwrite()
27
LocalCQ
5
5.Probeflagsuccess.targetbufferisreadytoload/ops.
“rpipe”– rwrite()
28
LocalCQ
6Load0x1F0000
6.remotehostconsumesdata.Sourceyettoknowbufferàrfree()
“rpipe”– rwrite()
29
“rpipe”– rread()
LocalCQ1
1.Storedataintotarget.– Targetbufferà‘’in-use’’.
Store0x1F0000,val
30
“rpipe”– rread()
LocalCQ
2.Writetosourceflag.Dataisnowreadyforrread()!!
Store0x1F0000,val2
rfree()
31
“rpipe”– rread()
LocalCQ
3
3.RDMAZero-Copytosource.
32
LocalCQ
4.Writetoflagjustanerdata.
4
“rpipe”– rread()
33
LocalCQ
5
5.ProbeLocalCQforcomple<on.
“rpipe”– rread()
34
Implementing rpipe(), rwrite() and rread()
§ Arpipeiscreatedbetweentwoendpoints.– AuGNIbasedControlMessage(FMACmsg)networktolazyini<alize
rpipei.e.GNI_CqCreate,GNI_EpCreate,GNI_EpBind
§ Implementsrwrite(),rread()inuGNI– Small/mediummessages–FMA(FastMemoryAccess)
– Largemessages–BTE(ByteTransferEngine)
§ MPIportabilityLayer– rpipewithMPI-3.0windows+passiveRMA
FMA
BTE
§ Mo<va<on
§ Design/SystemImplementa<on
§ Evalua<on§ FutureWork
35
Overview
36
rpipe programming intmain(){#definePIPE_WIDTH8rpipe_trp;rinit(&rank,NULL);//createaHalfDuplexRMApiperpipe(rp,peer,iswriter,PIPE_WIDTH,HD_PIPE);raddr_taddr;int*ptr;if(iswriter){addr=rmalloc(rp,sizeof(int));ptr=rmem(rp,addr);*ptr=SEND_VAL;rwrite(rp,addr);}else{rread(rp,addr,sizeof(int));ptr=rmem(rp,addr);rfree(addr);}}
Remoteallocate
FreeremmemoryReleaseimmediately
a5eruse!!
Rpipeops
37
Experimentation Setup CrayXC30[Aries]/
DragonFly
BigredII+550nodes/Rpeak280Tflops—10GB/sUni-direc<onal15GB/sBi-direc<onalBW
PerfbaselineàMPI/OSUBenchmark
38
Small/Medium Message Latency Comparison
1
4
16
1 4 16 64 256 1024 8192Message Size (bytes)
Late
ncy/
oper
atio
n (u
s)
MPI_RMA_FENCEMPI_RMA_PASSIVE(lock_once)MPI_RMA_PSCWMPI_SENDRMA_PIPE_WRITE(uGNI_FMA_2PUTS)RMA_PIPE_WRITE(uGNI_FMA_PUTW_SYNC)
§ Default Alloc = Next-Fit
§ FMA_PUT_W_SYNC – Upto 6X speedup MPI
RMA
§ rpipePUT_W_sync(s)<rpipe2PUT(s)
39
Large Message Latency Comparison – rwrite()
2
16
128
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4MMessage Size (bytes)
Late
ncy/
oper
atio
n (u
s) MPI_RMA_PASSIVEMPI_RMA_PSCWRMA_PIPE_WRITE
§ rpipe uGNI(s) ≈ rpipeMPI(s)whens>4K
– S ≥ 4K à FMA to BTE switch
small/medium 0.65us
40
Large Message Latency Comparison – rread()
8
64
512
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4MMessage Size (bytes)
Late
ncy/
oper
atio
n (u
s) MPI_RMA_PASSIVEMPI_RMA_PSCWRMA_PIPE_READ
§ rpipe uGNI(s) ≈ rpipeMPI(s)whens>1K– S < 4b à FMA_FETCH Atomic (AMO)
– S < 1K à FMA_FETCH + PSYNC
– S ≥ 1K à FMA to BTE switch (BTE_FETCH + FMA_PSYNC)
small/medium 2.14us
41
Rpipe Scales ...
1
4
16
2 4 8 12 16 20 24 28 32Nodes (N)
Late
ncy/
oper
atio
n (u
s) RPIPE_WRITE(1K)(unbounded)RPIPE_WRITE(64)(unbounded)RPIPE_WRITE(8)(4K)RPIPE_WRITE(8)(64)RPIPE_WRITE(8)(unbounded)RPIPE_WRITE(8K)(unbounded)
§ “unbounded”à allocator has full rpipe available for all Zero-copy operations
§ Scaling upto 32 nodes – randomized rwrite() – 0.65 – 3.8us avg latency
42
Allocation Algorithms
1
4
1 2 4 8 16 32 64 128 256 512Message Size (bytes)
Late
ncy/
oper
atio
n (u
s)
MPI3.0_RMARPIPE_WRITE(1K)RPIPE_WRITE(256K)RPIPE_WRITE(unbounded)
§ Zero-copy write vs Heuristics– Next-fit allocator
has better performance
– 1X – 3.5X slowdown for Best/Worst-fit
1
4
16
1 2 4 8 16 32 64 128 256 512Message Size (bytes)
MPI3.0_RMARPIPE_WRITE(1K)RPIPE_WRITE(256K)RPIPE_WRITE(unbounded)
1
4
16
1 2 4 8 16 32 64 128 256 512Message Size (bytes)
MPI3.0_RMARPIPE_WRITE(1K)RPIPE_WRITE(256K)RPIPE_WRITE(unbounded)
Next-fit
Best-fit
Worst-fit
L = Latency
L[Next-fit] < L[MPI] < L[Worst-fit]
§ Mo<va<on
§ Design/SystemImplementa<on
§ Evalua<on§ FutureWork
43
Overview
§ PlavormSupport/Automatedsynchroniza<on
§ HighperformanceRMAKernels– Ac<vemessages/Neighbor/collec<ve
communica<on
§ Aggregatedrpipes– LeverageZerocopy/Eliminatehiddenbuffers
• i.e.Collec<ves• Possiblethroughput,memoryu<liza<ongains
§ IrregularRMAandmemorydisaggrega<on
Future Work
Questions?
Thank You!