Pushing Bytes: Cloud Scale Big-Data Replication with RDMC · low delivery-time skew. 1 Introduction...

Pushing Bytes: Cloud Scale Big-Data Replication with RDMC

Jonathan Behrens Ken Birman Sagar Jha Edward TremelDepartment of Computer Science, Cornell University

Submission Type: Research

AbstractCloud computing frameworks replicate large objects fordiverse reasons, often under time-pressure. RDMC (Re-liable DMA Multicast) is a reliable data replication pro-tocol that runs at exceptionally high speeds and low la-tencies, implementing multicast as a pattern of RDMAunicast operations using a novel approach that maximizesconcurrency. Users with knowledge of datacenter topol-ogy can configure RDMC to use a pattern of data flowmatched to the network. In networks with full bisectionbandwidth our fastest protocol creates a collection of bi-nomial trees and embeds them on a hypercube overlay,setting speed records while also achieving exceptionallylow delivery-time skew.

1 IntroductionLarge-scale computing systems often must replicate con-tent among groups of nodes. The need is widespreadin cloud computing infrastructures, which copy hugeamounts of data when reprovisioning nodes and mustupdate VM and container images each time softwareis patched or upgraded. Replication patterns are alsoseen when a task is spread over a set of nodes in par-allel machine-learning applications, and when a file ordatabase is accessed by first-tier nodes charged withrapidly responding to client requests. With the growingembrace of IoT, immersive VR and multi-player gaming,replication must satisfy rigid time constraints.

Yet we lack effective general-purpose solutions. To-day, cloud infrastructures typically push new images us-ing specialized file copying tools. Content sharing is of-ten handled through some kind of intermediary cachingor key-value layer. Even if a scheduler like Hadoop cananticipate that a collection of tasks will read the same file,and tries to launch jobs where the data is cached, if thosefiles are not present in the cache the applications just pulldata on a one-by-one basis.

Cloud systems could derive substantial efficiencies byrecognizing such interactions as instances of a commonpattern, and our hope is that with a standard solution suchas RDMC in hand, developers will chose to do so. How-

ever, the goal of high speed replication can lead in twodistinct directions. For the present paper, our focus is onreliably pushing bytes: given an initiator with a sourceobject, (1) making large numbers of replicas as rapidlyas possible while leveraging RDMA, (2) optimizing forlarge objects and time-critical data transfer situations, and(3) ensuring that if failure occurs, the application will benotified and can retry the transfer.

The other direction includes offering stronger proper-ties, perhaps packages as a reliable multicast with stronggroup membership semantics [11] or an implementationof Paxos [21]. Such systems go beyond RDMC by impos-ing an ordering on concurrent updates that conflict. Theymay also persist data into append-only logs [6] or intro-duce data semantics, for example by focusing on supportfor key-value stores [17, 20, 23], or a transactions [7].We believe that RDMC could be useful in systems withsuch goals (and indeed are exploring that option in ourown continuing work), but stronger semantics require syn-chronization protocols, which can conflict with the goal ofachieving the highest possible data rates.

RDMC is an open-source project, written in C++ andaccessed through a simple library API. The basic func-tionality is that of a zero-copy multicast, moving datafrom a memory region in the source to memory regionsdesignated by the receivers. If an application maps filesinto memory, RDMC can transfer persistent data fromthe sender, and similarly one can persist received data bymapping a file and calling fsync. As we look to the future,RDMC should eventually be able to integrate directly withvideo capture devices, and with new kinds of persistentstorage technologies like 3D-XPoint.

We’ve tested RDMC on a variety of platforms. RDMAhardware is increasingly common in data center settings,and is widely supported by high speed switches and NICs.The feature was first offered on Infiniband networks, butis now available on fast Ethernet (RoCE), and a softwareemulation (SoftRoCE) is available for portability to set-tings where RDMA hardware is lacking. User-level ac-cess to RDMA is possible in non-virtualized Linux andWindows systems, and Microsoft’s Azure platform nowsupports RDMA in its containerized (Mesos) mode.

We start in Section 2 with a precise statement of goals.

1

In Section 3 we describe the RDMC data disseminationoptions currently supported. By supporting multiple pro-tocols, RDMC facilitates side-by-side comparisons; in ourexperimental cluster, a new protocol we call the binomialpipeline dominates.

2 Goals

RDMA (remote direct-memory access) is a zero-copycommunication standard supported on a wide range ofhardware (as well as by the SoftRoCE software). RDMAis a user-space solution, accessed by creating what arecalled queue-pairs: lock-free data structures shared be-tween user logic and the network controller (NIC) consist-ing of send queue and a receive queue. A send is issuedby posting a memory region to the send queue, and receivedestinations are indicated similarly. If the receiver hasn’tposted a receive buffer by the time a send is attemped,the sender either retries later or issues a failure. A queuepair also includes a completion queue which is used bythe NIC to report the successful completion of transfers,as well as any detected faults.

RDMA supports several modes of operation. RDMCmakes use of two-sided RDMA operations, which behavesimilarly to TCP: the sender and receiver bind their re-spective queue pairs together, creating a session fully im-plemented by the NIC endpoints. In this mode, once asend and the matching receive are posted, a zero-copytransfer occurs from the sender memory to the receiver’sdesignated location, reliably and at the full rate the hard-ware can support. This can be remarkably fast. For exam-ple, in our lab, two-side RDMA rates can approach the full100Gb/s of the optical layer, far faster than any IP proto-col can approach, and indeed more than 3x what memcpycan achieve for memory-to-memory copying internal tothe nodes of our compute cluster. Moreover, speedup to1Tb/s is widely expected within a decade.

Although not used in RDMC, RDMA also supportsone-sided RDMA reads and writes where one endpointgrants the other permission to perform one-sided reads orwrites into a preprepared memory region. The initiator ofa read or write will see a completion event, but the targetisn’t notified at all. Finally, RDMA also supports someunreliable modes of operation, including a mode that re-sembles IP multicast. However this multicast mechanismleaves it to the user to deal with message loss or reorder-ing, and would require frequent retransmissions and copy-ing.

In the two-sided mode used by RDMC, if nothing fails,data will be moved reliably from user-mode source mem-ory to user-mode receiver memory, and the send-order ispreserved. If a hardware fault or an endpoint crash oc-curs, the hardware reports the failure and breaks the two-

void create_group(group_number, root,members, notifications_callback)

void destroy_group(group_number)

void send_message(group_number,data, size)

void post_receive_buffer(group_number,data, size)

Figure 1: RDMC library interface

sided session (RDMA is not tolerant of Byzantine behav-ior). Such a situation should be extremely rare: RDMAreliability is similar to that of a memory bus.

We set out to implement an RDMA multicast that ex-tends the basic RDMA semantics to multiple receivers:

1. The sender and receivers first create a multi-waybinding: an RDMC group. This occurs out of band,using TCP as a bootstrapping protocol. To avoid in-curring delay on the critical path, applications thatwill do repeated transfers should set groups up aheadof time.

2. The sender can enqueue a sequence of asynchronoussend requests and can continue to do so for as longas the group is needed.

3. On the receive side, RDMC will notify the user ap-plication of an incoming message, at which point itmust post a buffer of the correct size to receive it into.

4. Sends complete in the order they were initiated. Anymessages that arrive will not be corrupt, out of order,or duplicated and if no failures occur, all messageswill arrive.

5. Any failures sensed by RDMA are reported to theapplication, but no automated recovery actions occur.

3 System Design

Figure 1 shows the RDMC interface, omitting configura-tion parameters like block size. To set up a group, theapplication starts by deciding group membership using anout of band mechanism. For example, a file copying pro-gram might launch on a set of nodes with the sender andreceiver IP addresses supplied as a parameter. Next, allinstances start up, pre-allocate and pin memory, and callcreate group, passing in the identical membership in-formation. Within a group, only one node (designated as

2

Application

RDMC

NIC

Application

RDMC

NIC

sen

d()

Memory segment

Application

RDMC

NIC

Up

call

register

Sender

Memory segment

Receiver

register

Memory segment

Receiver

Figure 2: Overall system design of RDMC.

the “root”) is allowed to send data. However, an applica-tion is free to create multiple groups with identical mem-bership but different senders; we’ve tested this case andfound that when groups are concurrently active with over-lapping membership, bandwidth splits between them. Wedo not provide a way to change a group’s membership orroot as this can easily be accomplished by creating a newgroup and then destroying the old one.

The create group function is inexpensive: Eachmember must simply exchange connection informationwith its neighbors and initialize a few data structures.RDMC issues callbacks to an application notification han-dler to report completion events and failures.

Sending a multicast is straightforward: a user-level pro-cess simply calls send messagewith the group numberand the memory segment it wishes to transfer, and gets anupcall when the sending action is locally complete. On thereceivers, a notification upcall requests memory for eachincoming message; a second upcall signals receipt.

Notice that neither the sender nor the receiver is noti-fied when all receivers have successfully finalized recep-tion. We recognize that a definitive outcome notificationwould often be useful, and our ongoing work includesways of implementing a notification with strong seman-tics. However, the solution transforms RDMC into a full-fledged atomic broadcast, imposing semantics on the ap-plication that go beyond the least common denominator,while bringing non-trivial overheads.

By opting for the weakest semantics, we enable suchapplications to use RDMC without incurring any unnec-essary overheads.

In summary, RDMC messages will not be corrupted orduplicated, and will arrive in order. RDMC will also notdrop messages simply because of high load or insufficientbuffer space. If no nodes fail, then all messages are guar-anteed to arrive. Any failures that do occur are reportedon a best effort basis.

The transfer of messages is achieved by a series ofRDMA reliable unicasts between the user-level mem-

ory at the sender and the recipients. As noted earlier,the RDMC library breaks each message into fixed sizedblocks and relays the data, block by block, to intermedi-ate recipients. When the multicast is locally complete, anupcall from the RDMC library on each member notifiesthem that data has arrived in their application’s memory.

Given this high-level design, the most obvious and im-portant question is what algorithm to use for constructinga multicast out of a series of point-to-point unicasts. Asnoted in the introduction, RDMC actually supports multi-ple algorithms. We’ll describe them in order of increasingeffectiveness.

The very simplest solution is the “sequential send:” itimplements the naive solution of transmitting the entiremessage from the sender one by one to each recipient inturn. This approach does not scale well because at anypoint only one node is using any of its incoming or outgo-ing bandwidth.

“Chain send” implements a bucket-brigade. Afterbreaking a message into blocks, each inner receiver in thebrigade relays blocks as it receives them. Relayers usetheir full bandwidth, but sit idle until they get their firstblock so worst-case latency is high.

A “binomial tree send” can be seen in Figure 3 (left).Here, sender 0 starts by sending the entire message to re-ceiver 1. Then in parallel, 0 sends to 2 while 1 sends to3, and then in the final step 0 sends to 4, 1 sends to 5, 2sends to 6 and 3 sends to 7. The resulting pattern of sendstraces out a binary tree, hence latency will be better thanfor the sequential send, but notice that the inner transferscan’t start until the higher level ones finish. Thus manynodes are basically idle most of the time, wasting the un-used bandwidth of their incoming and outgoing links.

Significantly lower latency is possible if we use a bi-nomial tree to transmit blocks instead of entire messages.This observation was first made by Ganesan and Seshadri[18], who proposed an algorithm that combines bucketbrigade and binomial trees to provide the benefits of each.They do so by creating a hypercube overlay of dimen-sion d, within which d blocks will be concurrently relayed(Figure 3, middle, where the blocks are represented by thecolors red, green and blue). Each node repeatedly per-forms one send operation and one receive operation until,on the last step, they all simultaneously receive their lastblock (if the number of nodes isn’t a power of 2, the fi-nal receipt spreads over two asynchronous steps). As de-tailed Appendix A, we modified Ganesan and Seshadri’ssynchronous solution into an asynchronous protocol, andincluded several other small changes to better match it toour setting.

3

Figure 3: (Left) A standard binomial tree multicast, with the entire data object sent in each transfer. (Center) Abinomial pipeline multicast, with the data object broken into three blocks, showing the first three rounds of the protocol.In this phase, the sender sends a different block in each round, and receivers forward the blocks they have to theirneighbors. (Right) The final two rounds of the binomial pipeline multicast, with the earlier sends drawn as dottedlines. In this phase, the sender keeps sending the last block, while receivers exchange their highest-numbered blockwith their neighbors.

3.1 Hybrid AlgorithmsAlthough RDMC supports multiple algorithms, the bi-nomial pipeline normally offers the best mix of latencyand performance. Nonetheless, there may be situations inwhich other options are preferable.

For example, many of today’s data centers have full bi-section bandwidth on a rack-by-rack basis, but use someform of oversubscribed top of rack (TOR) network. If wewere to use a binomial pipeline multicast without atten-tion to the topology, half the data would traverse the TORnetwork (this is because if we build the overlay using ran-dom pairs of nodes, the average link would connect nodesthat reside in different racks). The resulting data transferpattern would impose a heavy load at that level.

In contrast, suppose that we were to use chain repli-cation in the top of rack layer, designating one node perrack as the leader for its rack. This would require somecare: in our experiments chain replication was highly sen-sitive to network topology and data pattern. When trans-ferring 256 MB objects, chain replication performed well,but when the same data was transferred using 64 MB mul-ticasts, chain replication lagged in groups with as few as16 nodes and by 256 nodes, was achieving just 40% of thethroughput of the binomial pipeline. Chain replication isalso the worst case for delivery skew, and is very sensitiveto slow links. However, a properly configured TOR chainwould minimize load on the top of rack switching net-work: any given block would traverse each TOR switchexactly once. Then we could use the binomial pipelinewithin each rack.

Even more interesting would be to use two separate in-stances of the binomial pipeline, one in the TOR layer,and a second one within the rack. By doing so we couldseed each rack leader with a copy in a way that creates a

burst of higher load, but is highly efficient and achievesthe lowest possible latency and skew. Then we repeat thedissemination within the rack, and again maximize band-width while minimizing delay and skew.

3.2 Architectural DetailsWe implemented RDMC as a userspace library that runson top of the IB Verbs library. Although we have testedonly in user mode, we believe that RDMC could also runwithin the kernel or hypervisor, be dropped into the con-troller of an NVM storage unit, or embedded directly intothe network hardware, all of which would drastically re-duce scheduling latency and decrease overhead.

Initialization Before an application can participate inRDMC transfers, it must go through a setup process. Dur-ing this stage, RDMC exchanges connection informationwith all other nodes that may participate and prepares anyinternal datastructures, and also posts receive buffers forall possible control messages. Finally, we start a pollingthread that monitors RDMA completion queues for noti-fications about incoming and outgoing messages.

If several RDMC transfers are underway concurrently,each has its own block-transmission sequence to fol-low, but separate transfers can share the same completionpolling thread, which reduces overheads. Even so, an is-sue arises of CPU load: while at least one transfer is ac-tive it makes sense to poll continuously, pinning one core,but clearly when a system is idle, this overhead would beobjectionable. Accordingly, between transfers the pollinginterval is lengthened (using a thread-level sleep) to re-duce CPU load at the expense of more latency in reactingto subsequent messages. In future work we may change

4

this to switch between polling and interrupt-notifications,which would eliminate the delay that can otherwise ariseif an event suddenly occurs after a period of idleness.

Data Transfer Schedules An RDMC sending algo-rithm must be deterministic, and if a sender sends multiplemessages, must deliver them in sequential order. As sum-marized earlier, when a sender initiates a large transfer,our first step is to tell the receivers how big each incom-ing message will be, since any single RDMC group cantransport messages of various sizes. Here, we take advan-tage of an RDMA feature that allows a message to carryin integer “immediate” value. Every block in a messagewill be sent with an immediate value indicating the totalsize of the message it is part of. Accordingly, when anRDMC group is set up, the receiver posts a receive for aninitial block of known size. When this block arrives, theimmediate value allows us to determine the full transfersize and, if more blocks will be sent, the receiver can postasynchronous receives as needed.

The sender and each receiver can now treat the sched-ule as a series of asynchronous steps. In each step everyparticipant either sits idle or does some combination ofsending a block and receiving a block. (The most efficientschedules are those that make sure all the nodes spend asmuch time concurrently sending and receiving.) Given theasynchronous step number, it is possible to determine pre-cisely which blocks these will be. Accordingly, as eachreceiver posts memory for the next blocks, it can deter-mine precisely which block will be arriving and select thecorrect offset into the receive memory region. Similarly,at each step the sender knows which block to send next,and to whom.

Our design generally avoids any form of out-of-bandsignalling or other protocol messages, with one exception:to prevent blocks from being sent prematurely, each nodewill wait to receive a ready for block message fromits target so that it knows they are ready. By doing so wealso sharply reduce the amount of NIC resources used byany one multicast: today’s NICs exhibit degraded perfor-mance if the number of concurrently active receive buffersexceeds NIC caching capacity. RDMC uses just a few re-ceive queues per group, and since we do not anticipatehaving huge numbers of concurrently active groups, thisform of resource exhaustion is avoided.

4 Experiments

4.1 Experimental Setup

We conducted experiments on several clusters, beginningwith the Sierra cluster at Lawrence Livermore NationalLaboratory. The cluster consists of 1,944 nodes of which

1,856 are designated as batch compute nodes. Each isequipped with two 6-core Intel Xeon EP X5660 proces-sors and 24GB memory. The clock speed is 2.8GHz;while system memory bandwidth is 256Gb/s, memcpyachieves just 30Gb/s. They are all connected by an Infini-band fabric which is structured as a two-stage, federated,bidirectional, fat-tree. The NICs are 4x QDR QLogicadapters each operating at a 40 Gb/s line rate (20 Gb/seach direction). The Sierra cluster runs TOSS 2.2 whichis a modified version of Red Hat Linux.

The cluster employs batch scheduling for jobs, and thiscreates an issue that should be noted: nodes within thecluster but not used by our experiment will be processingreal workloads and generating unrelated network traffic.The reason this is a problem is that although we do haveexclusive access to the nodes we are assigned (any coresnot used will be idle) and the cluster uses a fat-tree net-work, observed bandwidths are far below full bisectionbandwidth. We speculate that this is caused by link con-gestion resulting from suboptimal routing. As a result,our experiments compete with other network traffic, par-ticularly at large scale. We have no control over this phe-nomenon, although we can estimate the degree to whichit is occurring. MPI, which is popular at LLNL, has anadvantage in this sense: the LLNL scheduler is optimizedfor MPI jobs and selects node layouts that work especiallywell for it, particularly at very large scale.

We also conducted tests on two other clusters. TheU. Texas Stampede cluster contains 6400 C8220 com-pute nodes each with 56 Gb/s FDR Mellanox NICs. LikeSierra, it is batch scheduled with little control over nodeplacement. We measured unicast speeds of to 40 Gb/s,about double what was observed on the other two systems.

The Fractus cluster located at Cornell University con-tains 8 RDMA enabled nodes very similar to the oneson Sierra, each equipped with a 4x QDR Mellanox NICand 94 GB of DDR3 memory, and running Ubuntu 12.04,and equipped with a 40Gbps Mellanox IB switch (20Gbpseach way). All nodes have one-hop paths to one-another,hence latency and bandwidth numbers are consistent be-tween runs.

In work currently underway at the time of submission,we upgraded Fractus with dual-capable 100Gbps Mel-lanox NICs that support both IB and RoCE, increased thenumber of RDMA-capable nodes to 19, and installed two100Gbps Mellanox switches, one for IB and the other forRoCE (here the full two-way performance limit for eachNIC seems to be around 125Gbps). Thus, we are nowin a position to explore faster hardware, to microbench-mark with larger groups, and to compare the IB and RoCEcases. Here we include preliminary results for exper-iments run on 12 of the upgraded nodes using the IBswitch. We’ve run some of the same experiments onRoCE and obtained similar findings.

5

Our experiments thus include cases that closely repli-cate the RDMA deployments seen in today’s cloud plat-forms: for example, Microsoft Azure offers both RDMAover IB and RDMA over RoCE. They also include scenar-ios seen on today’s large HPC clusters.

Not included are experiments with any form of virtu-alization. Although virtualized platforms are popular incloud settings, it is not obvious how they could exposeRDMA queue pairs to applications: multi-level page ta-bles are in potential conflict with the zero-copy RDMAmodel, and there are evident issues of security. NIC hard-ware advances could perhaps address these concerns, butuntil that happens, we doubt that RDMA can be offeredin fully virtualized clouds. Container models represent anappealing compromise, and in fact the Mesos OS, sup-porting Docker containers, underlies the Microsoft Azurecloud RDMA option. We believe that this model couldbecome a de-facto standard for applications that needRDMA support, enabling the use of RDMC by cloud in-frastructure developers, container application developers,and of course also by HPC solution developers.

With the exception of the concurrent sends experiment,we always select the lowest numbered node in our job tobe the sender. The sender generates a random block ofdata, and we measure the time from when the send is sub-mitted to the library and when all clients get an upcallindicating that the multicast has completed. Bandwidthis computed as the message size divided by the total timespent, regardless of the number of receivers. Thus, whenwe report a 6 Gb/s throughput for a group of 512 membersin Figure 7, we mean that all 511 receivers get identicalreplicas of the transmitted 256 MB object about a third ofa second after the send was initiated.

4.2 Results

In Figure 4 we break down the time for a single 256 MBtransfer with 1 MB blocks and a group size of 4 conductedon Stampede. All values are in microseconds, and mea-surements were taken on the node farthest from the root.Accordingly, the Remote Setup and Remote Block Trans-fers reflect the sum of the times taken by the root to sendand by the first receiver to relay. Roughly 99% of the to-tal time is spent in the Remote Block Transfers or BlockTransfers states (in which the network is being fully uti-lized) meaning that overheads from RDMC account foronly around 1% of the time taken by the transfer.

Figure 5 depicts the same send but shows the time usagefor each step of the transfer for both the relayer (whosetimes are reported in the table) and for the root sender.Towards the end of the message transfer we see an anoma-lously long wait time on both instrumented nodes. As itturns out, this demonstrates how RDMC can be vulnerable

to delays on individual nodes. In this instance, a roughly100 µs delay on the relayer (likely caused by the OS pick-ing an inopportune time to preempt our process) forcedthe sender to delay on the following step when it discov-ered that the target for its next block wasn’t ready yet.

In Figures 6a and 6b we examine the impact of blocksize on bandwidth for a range of message sizes. Noticethat increasing the block size initially improves perfor-mance, but then a peak is reached. This result is actuallyto be expected as there are two competing factors. Allblock transfers involve a certain amount of latency, so in-creasing the block size actually increases the rate at whichinformation moves across links (with diminishing returnsas the block size grows larger). However, the overheadassociated with the binomial pipeline algorithm is propor-tional to the amount of time spent transferring an indi-vidual block. There is also additional overhead incurredwhen there are not enough blocks in the message for allnodes to get to contribute meaningfully to the transfer.

Figure 7, Figure 8 and Figure 9 show the band-widths for various sizes of multicasts across a range ofgroup sizes running on LLNL (Sierra), Stampede, and our100Gbps Fractus IB configuration. In these experimentswe fixed the block size at 1MB.

Although Sierra was by far the largest machine avail-able to us, these experiments posed a challenge on thatplatform, where we often ended up with our experimentspread widely within the cluster. Further, the Sierra TORswitches exhibit surprising and very large load-dependentperformance variations (a problem we did not see onStampede). For example, we measured latencies as highas 20 microseconds and inconsistent bandwidths, whichin the worst case were as low as 3 Gb/s. Despite these is-sues, our asynchronous implementation of the binomialpipeline proves to be surprisingly robust to delay andscheduling phenomena. For example, with 512 nodes weobserved speeds of 6 Gb/s for 256 MB messages. It takesjust 4x as long to make 511 replicas of a large object as tomake 1.

Message size has an important impact on overall band-width. As we saw earlier, when selecting a block sizewe must balance the number of blocks with the band-

Remote Setup 11Remote Block Transfers 461Local Setup 4Block Transfers 60944Waiting 449Copy Time 215Total 62084

Figure 4: Times in microseconds for various steps of thetransfer.

6

Figure 5: Breakdown of transfer time and wait time of two nodes taking part in the 256 MB transfer from the sameexperiment as Figure 4. Notice that the relaying node spends hardly any time waiting, while the sender transmits eachblock slightly faster (since it isn’t receiving at the same time) and then must wait for the other nodes to catch up.

(a) Measurements from Fractus. (b) Measurements from Stampede.

Figure 6: Multicast bandwidth across a range of block sizes for message sizes between 4 and 256 megabytes with agroup size of 4. The ideal block size depends on both the message size and the unique characteristics of the network.

width possible when sending a block. For messages underroughly 10MB, these two factors directly conflict and weare unable to pick any size that will get extremely highoverall bandwidth.

Finally, notice that in all three of these cases, achievedbandwidth is highest in the unicast case, where the hard-ware often outperforms its specification, and remains highbut fairly stable for replication groups of up to 8 nodes (a1-level hypercube). At present we have only tested verylarge groups on Sierra, and although performance tails offat the largest sizes, RDMC even then achieves a high per-centage of the possible unicast speed. In our current ex-periments, the limiting factor is almost surely topology,and could be avoided by minimizing the load placed onshared, higher-latency links. We believe this is best donein the job scheduler that selects nodes on which the ap-plication should run, as seems to occur when the LLNLscheduler launches MPI jobs.

Figure 10 compares the performance of RDMC bino-mial pipeline multicasts to that of several others, includingMPI on Infiniband (shown as MPI Bcast), as well as sev-eral others we implemented within the RDMC framework.Sequential send is the naive algorithm introduced earlierwhere none of the receivers help with relaying, while bi-nomial tree is the slightly better one where receiviers be-gin relaying once they have the whole message. The chainreplication scheme uses the method described in [27], inwhich blocks are relayed along a chain.

We should note that of these cases, one was actuallynot run as a protocol within the RDMC framework: theMPI Bcast performance was measured using the OSUMicro Benchmarks software package using MVAPICH1.2, a version optimized for QLogic hardware.

Our algorithm outperforms the others for large transfersin small to medium sized groups, achieving higher band-width and lower latency. However, once the group size

7

Figure 7: Bandwidth of a multicast for several messagesizes on Sierra.

Figure 8: Bandwidth of a multicast for several messagesizes on Stampede.

becomes large (128 replicas or more depending on themessage size) MPI has better performance. An obviousquestion arises of why MPI Bcast is so slow for smallergroups, yet experiences almost no performance degrada-tion at larger scales. Unfortunately, we are not able toanswer this because the algorithm used for MPI Bcast isnot well documented. In particular, although Sanders de-scribes a 2-Tree approach in [25], our understanding isthat the 2-Tree algorithm would not give such flat scalingacross the full range of sizes seen here.

Next, we undertook a similar performance comparisonon Fractus at 100Gbps, but now omitting MPI. As seenin Figure 11, chain send and binomial pipeline do ex-tremely well here, while the sequential send and tree send,which transfer full objects before relaying can occur, de-grade very quickly. When examining these results it isimportant to keep in mind that while the simple chain-replication scheme can achieve high throughput, it hasterrible delivery skew and is very sensitive to slow links.With the binomial pipeline algorithm, we are able to trans-form these fast point-to-point send speeds into very effi-

Figure 9: Bandwidth of a multicast for several messagesizes on 100 Gps Fractus.

Figure 10: Bandwidth of various algorithms across arange of group sizes for 256 MB multicasts (Sierra).

cient multicasts for large objects. The algorithm is able totake advantage of both the incoming and outgoing band-width from all the nodes in the group, not just some ofthem. As a result, when the block size is small comparedto the message size, the theoretical (and observed!) timefor the transfer is only slightly more than the time it wouldtake to send the entire message between two nodes. Infact, for small groups, overheads can be so low that thetotal time taken to replicate an object can be less than itwould take to perform a unicast transfer over the slowestlink. We have seen this happen even when that link is onlyslower by a few tenths of a Gb/s.

Figure 12 compares scalability of the binomial pipelinewith that of sending sequentially, conducted on Sierra.The data shown for sequential send is an extrapolationfrom the data for 4-node groups.

Figure 13 is an experiment carried out on 40Gbps Frac-tus in which we created several groups using the sameprocesses, and then initiated concurrent multicasts fromdifferent roots. For comparison we include data for singlesender transfers as well.

8

(a) 256 MB multicasts (b) 8 MB multicasts

Figure 11: Bandwidth of various algorithms across a range of group sizes on 100 Gbps Fractus.

Figure 12: Comparison of the latency for sending mes-sages using binomial pipeline and sequential send.

Finally, Figure 14 looks at rate variability when allmembers send in groups of various sizes on 100GbpsFractus, plotting mean and standard deviation, and Figure15 measures the number of 1 byte messages per secondthat RDMC can send as a function of group size, againon 100Gbps Fractus. RDMC has not been optimized forthis use case, although we’ve observed that performancedepends heavily on the RDMA configuration parametersin use while being unchanged for all message sizes belowthe network MTU.

In the introduction, we noted that in ongoing work, weare integrating RDMC into versions of atomic broadcastand Paxos, but that concerns about the cost of synchro-nization argue for keeping RDPC itself as simple and freeof delays as possible. In support of that point, it wouldbe tempting to explore a head to head comparison withuser-space multicast libraries such as the various Paxoslibraries [1], the Isis2 (recently renamed Vsync) groupcommunication system [10], or the Orchestra Cornet li-brary [14]. However, we concluded that such comparisons

Figure 13: Average bandwidth for concurrent multicastsby distinct senders (Fractus).

would simply not be fair. All of these systems run over theIP network stack, and the Paxos libraries additionally logmessages to nonvolatile storage in support of the Paxosdurability property, hence any such experiment would beheavily biased in favor of RDMC, which offers orders ofmagnitude speedup relative to any system of this kind.

4.3 Discussion

When Ganesan and Seshadri considered tree and chaintopologies for performing multicast in [18] they thoughtthem to be unfeasibly slow over TCP/IP. This is an inter-esting question for us, because RDMA can be understoodas a hardware implementation of a TCP-like protocol, andindeed Intel’s iWarp product line promotes precisely thisanalogy. In their discussion, Ganesan and Seshadri pre-dicted suboptimal performance, attributing this to a con-cern that highly structured topologies can allow a singlelagging node to slow down the entire send for everyone.The binomial pipeline algorithm (which they recognized

9

Figure 14: Average bandwidth for concurrent multicastsby distinct senders (Fractus 100 Gbps).

as theoretically optimal) is more susceptible to this phe-nomenon because each node is responsible for the transferto all of its neighbors in the hypercube.

As we have seen, in our asynchronous implementationof their scheme, slowdown proves to be much less of an is-sue for RDMA than for TCP/IP over Ethernet.1 With truehardware-supported RDMA we are able to achieve low la-tency, zero-copy, reliable transfers directly into user-spacememory on the receiver, with no copying, which is impor-tant because memcpy peaks at 30 Gb/s and is not likely toscale up as quickly as optical network speeds will. Bycontrast, thanks to hardware support for reliable sends weare able to consistently get nearly line rates across a rangeof systems using reliable point-to-point sends, and thisshould track the evolution of optical network speeds. Thusthe opportunity for application-induced scheduling delaysis much reduced, and the size of such delays is also muchsmaller than in their analysis.

But there is a second and deeper factor at work that mayultimately dominate at very large scale. Here we point toa queuing theory analysis reported by Basin et al. in [8],where the cumulative effect of small delays for multicastoverlays constructed from TCP links was explored. Theanalysis is somewhat TCP-specific and is carried out ina simpler binary-tree topology, but it predicts that abovesome threshold size, any overlay structure at risk of link-level forwarding delays would be expected to exhibit per-formance that degrades in the group size. In effect, as thenumber of nodes below a given sender increases, the prob-ability rises that a relaying delay will occur somewhere inthe forwarding tree and back up to cause a delay at thesender. We run on a binomial tree, but an analogous resultprobably applies.

1We have not experimented with SoftRoCE, but because SoftRoCEmaps RDMA to TCP, when running on non-RDMA platforms RDMCmay have to be adjusted accordingly.

Figure 15: 1 byte messages/sec. (100Gbps, Fractus)

5 Related Work

Reliable multicast is an area rich in software librariesand systems. We’ve mentioned reliable multicast (pri-marily to emphasize that RDMC is not intended to offerthe associated strong group semantics and multicast atom-icity). Good examples of systems in this space includeIsis2/Vsync, Spread, Totem, Horus, Transis and the IsisToolkit [4, 5, 10, 12, 16, 26].

Paxos is the most famous of the persistent replica-tion solutions, and again, RDMC is not intended as acompetitor. But examples of systems in this categoryinclude Paxos, Chubby, Rambo, Zookeeper and Corfu[1, 3, 6, 13, 19, 21, 22].

We are not the first to ask how RDMA should be ex-ploited in the operating system. The early RDMA conceptitself dates to a classic paper by Von Eicken and Vogels[29], which introduced the zero-copy option and repro-grammed a network interface to demonstrate its benefits.VIA, the virtual interface architecture then emerged; its“Verbs” API extended the UNet idea to support hardwarefrom Infiniband, Myranet, QLogic and other vendors.Verbs, though, is awkward, and this has spawned a num-ber of other options: the QLogic PSM subset of RDMA,Intel’s iWarp, which offers RDMA through a TCP-like ab-straction implemented in the hardware (RDMA over con-verged Ethernet, or RoCE), socket-level offerings such asthe Chelsio WD-UDP [2] embedding, etc.

Despite the huge number of products, it seems reason-able to assert that the biggest success to date has been theMPI platform integration with Infiniband RDMA, whichhas become the mainstay of High Performance Comput-ing (HPC) communications. MPI only uses a subset ofRDMA functionality, hence a modern RDMA implemen-tation will often have a stripped-down core (PSM or asimilar library), on which Verbs and the full RDMA stackis implemented in software. UDP and UDP multicast onRDMA are also supported on such platforms.

10

Indeed, RDMC is best viewed as a bulk data copyingsolution: an OS-layer primitive capable of playing a rolethat IP multicast (IPMC) was once expected to play [15],but in which it was never successful. We are not aware ofany bulk-data replication solution that successfully usedIPMC at scale, although many research efforts attemptedto do so. Instead, researchers soon learned that IPMC wasboth hard to use because of its unreliability, and was ca-pable of destabilizing data center switches and NICs byprovoking broadcast storms [28]. Indeed, IPMC is no-torious for enthusiastically discarding data — which canoccur on the sender even immediately after a “successful”multicast Send operation — reordering data, and deliv-ering duplicates. Many of today’s cloud platforms eitherprohibit the use of IPMC or emulate it by tunneling overTCP (as in Amazon AWS). Yet it would be hard to saythat IPMC was unsuccessful. Rather, it was very mucha product of the time period within which it was offered,a period when reliability and flow-control were perceivedas a problem to be addressed purely by the endpoints, andwhen it was believed that the network would be the mainsource of any packet loss (in the cases mentioned above,IPMC itself overloads the switches and routers and causesthe loss).

Today IPMC and UDP have been mostly displaced byTCP, with built-in reliability and flow-control. RMDA,which embeds the properties of TCP into the hardware(indeed, many RDMA networks map TCP to RDMA), isa very natural fit to this new environment, and when seenin this light, we believe that RDMC is an appealing exten-sion of RDMA to the multicast case. RDMC fully lever-ages the zero-copy reliability of the hardware, supports anintuitive API, and performs very well for its intended usecases. While the RDMC failure model is weak (more orless the equivalent behavior of having N one-to-one TCPconnections side by side, which are unlikely to fail, butthat could break during a transfer if the hardware fails ora node crashes), we’ve argued that this behavior is still agood match to many intended use cases, and in our ownfuture work, have been able to extend RDMC into a full-fledged reliable multicast with strong semantics and to useit in a Paxos protocol. We feel comfortable making thecase that while RDMC’s model is weak, it is still “strongenough.”

Although our focus is on bulk data movement, the coreargument here is perhaps closest to ones made in recentoperating systems papers, such as FaRM [17], Arrakis[24] and IX [9]. In these works, the operating system isincreasingly viewed as a control plane, with the RDMAnetwork treated as an out of band technology for the dataplane that works best when minimally disrupted. RDMCis of course far less ambitious than these operating sys-tems, offering just a bare-bones reliable multicast abstrac-tion, and on achieving the absolute lowest overheads we

can. However, because RDMC is a software library andhighly portable, it could easily be used in a wide range ofsettings, and would integrate easily into the systems justlisted. Further, by leveraging RDMC in file systems andmemory sharing, its ultimate impact could be very broad.

6 Conclusion

Our paper introduces RDMC: a new reliable memory-to-memory replication tool implemented over RDMA uni-cast. Performance is very high when compared with themost widely used general-purpose options, and the proto-col scales to large numbers of replicas. At smaller scaleone can literally have 4 or 8 replicas for nearly the sameprice as for 1; with really large numbers of replicas, itcosts just a few times as long to make hundreds of repli-cas as it takes to make 1. We believe this to be a reallystriking finding, and of very broad potential applicabil-ity. Further, because RDMC delivery is nearly simulta-neous even within large groups of receivers, applicationsthat need to initiate parallel computation will experienceminimal skew in their task start times. Because our solu-tion takes the form of a library it can run in user space butcould also be dropped into the kernel. We believe that itcould dramatically accelerate and yet also simplify a widerange of important applications, and also improve utiliza-tion of datacenter computing infrastructures. We are mak-ing the RDMC code base available for free, open-sourcedownload at http://rdmc.codeplex.com.

Appendix A: Binomial Pipeline

The binomial pipeline is a scheme for distributing a col-lection of blocks originating at a single host, to some num-ber of remote hosts and was first described by [18]. Thealgorithm assigns each node to a single vertex on a hyper-cube. When the group size is a power of two, each nodeis assigned to its own vertex. Otherwise, some verticesare assigned two nodes. A vertex behaves externally asa single node: at any point it is sending and receiving atmost one block from another vertex. However, as we willdiscuss later, nodes occupying the same vertex exchangeblocks among themselves to ensure that they all receivethe full message.

The binomial pipeline proceeds in three stages, each ofwhich are further divided into steps. During every step,all vertices with at least one block have one of their mem-bers send across parallel edges of the hypercube. At thestart of the first stage the sender transfers one block of thesegment to a receiver. In the next step of the first stage,the sender transfers a different block to receiver in anothervertex, while the first receiver simultaneously sends its

11

block on to a third vertex. This pattern continues untilall vertices have a single block.

Now that all nodes have a block, the second stage canbe much more efficient. Previously we were wasting mostof the network capacity because at each step every nodewas either a sender or a receiver but not both. In this stage,the sender continues to sequentially send blocks while allother vertices trade their highest numbered blocks.

Once the sender runs out of blocks, the algorithm entersthe final stage. The sender repeatedly sends the last block,while the rest of the vertices continue to trade blocks inevery step.

The progression of the binomial pipeline for a group of8 nodes is illustrated in Figure 3, and contrasted with amore traditional binomial tree broadcast. It is worth not-ing that if the binomial pipeline is run with only a singleblock, it will produce a binomial tree.

Now all that is left is to discuss the interactions withinvertices containing two nodes. Whenever the vertex is re-sponsible for sending a block, exactly one of the nodeswithin it will have that block. During that step, the othernode will send a block that only it has to its partner andreceive the incoming block (if any) for the vertex. Andonce all vertices have all the blocks, the nodes within themtrade the final block they are missing, thereby completingthe send.

Our implementation of the binomial pipeline in RDMCis the first adaptation of this technique to an RDMA envi-ronment (the work described in [18] was evaluated purelywith a simulation). This entailed several small extensions:(1) Our implementation doesn’t need to know global stateor to compute the whole schedule. Instead it just com-putes the parts relevant to each individual node. Further,whereas the original version has a stage at which nodesgossip about which nodes have what blocks, we were ableto eliminate that step entirely. (2) RDMC adjusts the algo-rithm to allow some nodes to run slightly ahead of others.The resulting small degree of asynchronous eliminatedstalls that otherwise were seen in the originally, fully syn-chronized protocol. (3) To minimize RDMA connectionsetup overhead, we adjusted the schedule to ensure thatthe first block each node receives always comes from thesame relayer.

Acknowledgements

We are grateful to Greg Bronevetsky and Martin Schultzat LLNL for generously providing access to their largecomputer clusters, and to the U. Texas Stampede XSEDEcomputing center for providing access to that system.Support for this work was provided, in part, by DARPAunder its MRC program and by NSF under its Comput-ing in the Clouds program. An experimental grant from

Microsoft is being used to evaluate RDMC on Azure (wehope to include those results in the future). Mellanox pro-vided access to their high speed RDMA hardware, andAFOSR supports the cluster on which the majority of ourexperiments were performed.

References[1] LibPaxos: Open-source Paxos. http://

libpaxos.sourceforge.net/. Accessed: 24Mar 2015.

[2] Low latency UDP Offload solutions | Chelsio Com-munications. http://www.chelsio.com/nic/udp-offload/. Accessed: 24 Mar 2015.

[3] ABRAHAM, I., CHOCKLER, G. V., KEIDAR, I.,AND MALKHI, D. Byzantine Disk Paxos: OptimalResilience with Byzantine Shared Memory. In Pro-ceedings of the Twenty-third Annual ACM Sympo-sium on Principles of Distributed Computing (NewYork, NY, USA, 2004), PODC ’04, ACM, pp. 226–235.

[4] AGARWAL, D. A., MOSER, L. E., MELLIAR-SMITH, P. M., AND BUDHIA, R. K. The TotemMultiple-ring Ordering and Topology MaintenanceProtocol. ACM Trans. Comput. Syst. 16, 2 (May1998), 93–132.

[5] AMIR, Y., DANILOV, C., MISKIN-AMIR, M.,SCHULTZ, J., AND STANTON, J. The SpreadToolkit: Architecture and Performance. Johns Hop-kins University, Center for Networking and Distrib-uted Systems (CNDS) Technical report CNDS-2004-1 (Oct. 2004).

[6] BALAKRISHNAN, M., MALKHI, D., DAVIS, J. D.,PRABHAKARAN, V., WEI, M., AND WOBBER, T.CORFU: A Distributed Shared Log. ACM Trans.Comput. Syst. 31, 4 (Dec. 2013), 10:1–10:24.

[7] BALAKRISHNAN, M., MALKHI, D., WOBBER, T.,WU, M., PRABHAKARAN, V., WEI, M., DAVIS,J. D., RAO, S., ZOU, T., AND ZUCK, A. Tango:Distributed data structures over a shared log. In Pro-ceedings of the Twenty-Fourth ACM Symposium onOperating Systems Principles (New York, NY, USA,2013), SOSP ’13, ACM, pp. 325–340.

[8] BASIN, D., BIRMAN, K., KEIDAR, I., AND VIG-FUSSON, Y. Sources of Instability in Data CenterMulticast. In Proceedings of the 4th InternationalWorkshop on Large Scale Distributed Systems andMiddleware (New York, NY, USA, 2010), LADIS’10, ACM, pp. 32–37.

12

http://libpaxos.sourceforge.net/

http://libpaxos.sourceforge.net/

http://www.chelsio.com/nic/udp-offload/

http://www.chelsio.com/nic/udp-offload/

[9] BELAY, A., PREKAS, G., KLIMOVIC, A., GROSS-MAN, S., KOZYRAKIS, C., AND BUGNION, E. IX:A Protected Dataplane Operating System for HighThroughput and Low Latency. In Proceedings ofthe 11th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 14) (Broomfield,CO, Oct. 2014), USENIX Association, pp. 49–65.

[10] BIRMAN, K. Isis2 Cloud Computing Library.https://isis2.codeplex.com/, 2010.

[11] BIRMAN, K. Guide to Reliable Distributed Systems.No. XXII in Texts in Computer Science. Springer-Verlag, London, 2012.

[12] BIRMAN, K. P., AND JOSEPH, T. A. ExploitingVirtual Synchrony in Distributed Systems. In Pro-ceedings of the Eleventh ACM Symposium on Op-erating Systems Principles (New York, NY, USA,1987), SOSP ’87, ACM, pp. 123–138.

[13] BURROWS, M. The Chubby Lock Service forLoosely-coupled Distributed Systems. In Proceed-ings of the 7th Symposium on Operating SystemsDesign and Implementation (Berkeley, CA, USA,2006), OSDI ’06, USENIX Association, pp. 335–350.

[14] CHOWDHURY, M., ZAHARIA, M., MA, J., JOR-DAN, M. I., AND STOICA, I. Managing DataTransfers in Computer Clusters with Orchestra. InProceedings of the ACM SIGCOMM 2011 Confer-ence (New York, NY, USA, 2011), SIGCOMM ’11,ACM, pp. 98–109.

[15] DEERING, S. E., AND CHERITON, D. R. MulticastRouting in Datagram Internetworks and ExtendedLANs. ACM Trans. Comput. Syst. 8, 2 (May 1990),85–110.

[16] DOLEV, D., AND MALKI, D. The Transis Approachto High Availability Cluster Communication. Com-mun. ACM 39, 4 (Apr. 1996), 64–70.

[17] DRAGOJEVIC, A., NARAYANAN, D., CASTRO,M., AND HODSON, O. FaRM: Fast Remote Mem-ory. In Proceedings of the 11th USENIX Symposiumon Networked Systems Design and Implementation(NSDI 14) (Seattle, WA, 2014), USENIX Associa-tion, pp. 401–414.

[18] GANESAN, P., AND SESHADRI, M. On Coopera-tive Content Distribution and the Price of Barter. In25th IEEE International Conference on DistributedComputing Systems, 2005. ICDCS 2005. Proceed-ings (June 2005), pp. 81–90.

[19] JUNQUEIRA, F. P., AND REED, B. C. The Lifeand Times of a Zookeeper. In Proceedings of theTwenty-first Annual Symposium on Parallelism inAlgorithms and Architectures (New York, NY, USA,2009), SPAA ’09, ACM, pp. 46–46.

[20] KALIA, A., KAMINSKY, M., AND ANDERSEN,D. G. Using RDMA Efficiently for Key-value Ser-vices. In Proceedings of the 2014 ACM Conferenceon SIGCOMM (New York, NY, USA, 2014), SIG-COMM ’14, ACM, pp. 295–306.

[21] LAMPORT, L. The Part-time Parliament. ACMTrans. Comput. Syst. 16, 2 (May 1998), 133–169.

[22] LAMPORT, L., MALKHI, D., AND ZHOU, L. Re-configuring a State Machine. SIGACT News 41, 1(Mar. 2010), 63–73.

[23] MITCHELL, C., GENG, Y., AND LI, J. Using One-sided RDMA Reads to Build a Fast, CPU-efficientKey-value Store. In Proceedings of the 2013USENIX Conference on Annual Technical Confer-ence (Berkeley, CA, USA, 2013), USENIX ATC’13,USENIX Association, pp. 103–114.

[24] PETER, S., LI, J., ZHANG, I., PORTS, D. R. K.,WOOS, D., KRISHNAMURTHY, A., ANDERSON,T., AND ROSCOE, T. Arrakis: The Operating Sys-tem is the Control Plane. In Proceedings of the 11thUSENIX Symposium on Operating Systems Designand Implementation (OSDI 14) (Broomfield, CO,Oct. 2014), USENIX Association, pp. 1–16.

[25] SANDERS, P., SPECK, J., AND TRFF, J. L. Two-tree algorithms for full bandwidth broadcast, reduc-tion and scan. Parallel Computing 35, 12 (2009),581 – 594. Selected papers from the 14th EuropeanPVM/MPI Users Group Meeting.

[26] VAN RENESSE, R., BIRMAN, K. P., AND MAF-FEIS, S. Horus: A Flexible Group CommunicationSystem. Commun. ACM 39, 4 (Apr. 1996), 76–83.

[27] VAN RENESSE, R., AND SCHNEIDER, F. B. ChainReplication for Supporting High Throughput andAvailability. In Proceedings of the 6th Confer-ence on Symposium on Opearting Systems Design& Implementation - Volume 6 (Berkeley, CA, USA,2004), OSDI’04, USENIX Association, pp. 7–7.

[28] VIGFUSSON, Y., ABU-LIBDEH, H., BALAKRISH-NAN, M., BIRMAN, K., BURGESS, R., CHOCK-LER, G., LI, H., AND TOCK, Y. Dr. Multicast: Rxfor Data Center Communication Scalability. In Pro-ceedings of the 5th European Conference on Com-puter Systems (New York, NY, USA, 2010), EuroSys’10, ACM, pp. 349–362.

13

https://isis2.codeplex.com/

[29] VON EICKEN, T., BASU, A., BUCH, V., AND VO-GELS, W. U-Net: A User-level Network Interfacefor Parallel and Distributed Computing. In Proceed-ings of the Fifteenth ACM Symposium on Operat-ing Systems Principles (New York, NY, USA, 1995),SOSP ’95, ACM, pp. 40–53.

14

Date post:	15-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Pushing Bytes: Cloud Scale Big-Data Replication with RDMC · low delivery-time skew. 1 Introduction...

Documents