Abstract arXiv:1806.05300v1 [cs.DC] 13 Jun 2018 · Oddity is a graphical, interactive debugger for...

A Graphical Interactive Debugger for Distributed Systems

Doug Woos1 and Zachary Tatlock1 and Michael D. Ernst1 and Thomas E. Anderson1

1University of Washington

Submission Type: Research

Abstract

Designing and debugging distributed systems is notori-ously difficult. The correctness of a distributed systemis largely determined by its handling of failure scenarios.The sequence of events leading to a bug can be long andcomplex, and it is likely to include message reorderingsand failures. On single-node systems, interactive debug-gers enable stepping through an execution of the program,but they lack the ability to easily simulate failure scenariosand control the order in which messages are delivered.

Oddity is a graphical, interactive debugger for dis-tributed systems. It brings the power of traditional step-through debugging—fine-grained control and observationof a program as it executes—to distributed systems. It alsoenables exploratory testing, in which an engineer exam-ines and perturbs the behavior of a system in order to bet-ter understand it, perhaps without a specific bug in mind.A programmer can directly control message and failureinterleaving. Oddity supports time travel, allowing a de-veloper to explore multiple branching executions of a sys-tem within a single debugging session. Above all, Oddityencourages distributed systems thinking: rather than as-suming the normal case and attaching failure handling asan afterthought, distributed systems should be developedaround the certainty of message loss and node failure.

Graduate and undergraduate students used Oddity intwo distributed systems classes. Usage tracking and qual-itative surveys showed that students found Oddity usefulfor both debugging and exploratory testing.

1 Introduction

Developing correct distributed systems is difficult. Suchsystems are inherently nondeterministic. Messages canbe dropped or arbitrarily delayed, and nodes can fail andrestart. In the “normal” case where messages are deliv-ered in order and nodes remain up and responsive, un-derstanding the behavior of the code, as well as testingand debugging, are all relatively simple. However, bugsare more likely to hide in the unusual failure cases. Forexample, a version of the widely-used Raft consensus al-

gorithm [25] was discovered to have a bug in the code tohandle changes in the participants to the protocol, depend-ing on the interleaving of reconfiguration requests and aleader failover.

For single-node systems, engineers have step-throughdebuggers. A debugger helps an engineer reproduce andunderstand bugs by observing how their system’s stateevolves in both normal and buggy executions. However,traditional interactive debuggers are of limited utility indebugging distributed systems: they do not allow pro-grammers to easily control which messages will be deliv-ered and in what order. Since the behavior of a distributedsystem is determined by the order in which events happen,engineers cannot debug their systems without this control.Even the simple sanity checks that engineers can do witha traditional debugger in order to ensure that they under-stand how their system operates (e.g., for a given input,how many times is the inner loop executed?) are out ofreach in a distributed system.

To address this, we present Oddity, a graphical, inter-active debugger for distributed systems.1 It enables en-gineers to explore and control the execution of their sys-tem, including both normal operation and edge cases—message drops, node failures, and delays. Oddity displaysthe messages and timeouts that are waiting to be deliv-ered and allows the engineer to specify their order. Odd-ity supports time-travel, allowing the engineer to navigatea branching history of possible executions. This enablesusers to backtrack and make different choices about theorder in which messages and timeouts are delivered, al-lowing the exploration of many different cases—for in-stance, all of the possible orderings of a few messages—ina single debugging session. By enabling programmers toeasily explore both normal cases and edge cases—indeed,Oddity makes no distinction between these cases—Oddityencourages distributed systems thinking. Rather than as-suming the normal case and attaching failure handling asan afterthought, systems should be developed around thepossibility of failure and then optimized for performance.

Oddity supports a general execution model: event han-dlers run in response to received messages or timeouts.

1Oddity is open source and available at http://oddity.uwplse.org.

1

arX

iv:1

806.

0530

0v1

[cs

.DC

] 1

3 Ju

n 20

18

http://oddity.uwplse.org


A handler can modify local state, set timeouts, and sendmessages to other nodes. Handlers can be written in anyprogramming language, needing only to support a sim-ple shim API for interaction with the debugger (sendingand receiving messages, setting timeouts, and updatingthe node’s state).

Oddity differs from previous work in several ways. Un-like previous distributed systems visualization tools, itcan be used to visualize and control the network behav-ior of a real system, developed in any programming lan-guage. Other systems only visualize the operation of amodel [24] or logs of a particular execution [2]. Sim-ilarly, previous debugging systems for distributed sys-tems [1, 7, 12, 13, 14, 34, 19, 2] focused on ex post factodebugging and diagnosis, while Oddity is geared towardinteractive exploration of executions. Oddity is extensi-ble and supports multiple representations for viewing orinteracting with a distributed system. Oddity supportstwo visual representations of a system execution. One,shown in Section 2, emphasizes the current state of thenodes and the network, including in-flight messages andtimeouts, and also enables navigation through an execu-tion. To demonstrate Oddity’s extensibility, we added atraditional (non-interactive) space-time diagram represen-tation in under 150 lines of code.

This research makes the following high-level contribu-tions:

• Interactive debugging of distributed systems. Oddityis the first system that allows users to interactivelycontrol the order of messages and timeouts that aredelivered to each node in a distributed system, en-abling both debugging and exploratory testing. Odd-ity is designed to encourage and enable users to rea-son about the correctness of their systems by explor-ing edge cases as well as normal cases. Developerscan use Oddity to visualize execution traces minedfrom logs or obtained from a model-checker as acounterexample to a desired invariant.

• A conceptual model for distributed systems devel-opment. In Oddity, all messages and timeouts for agiven node are grouped together in “inboxes,” indi-cating that any event in any inbox can occur at itsnode at any time, and that systems cannot assume a“normal” ordering. Oddity models event history as atree of possible executions, allowing a programmerto explore the consequences of various event order-ings by navigating multiple executions of their sys-tem.

• A novel graphical interface. Oddity includes a newgraphical representation of the partial execution ofa distributed system, designed to encourage users tothink carefully about the correctness of their systems.

This interface allows users to inspect a single state ofthe system in detail, while also enabling navigationthrough an execution of the system.

• A study of student usage of Oddity. Students usedOddity in lab assignments for two distributed sys-tems classes. We studied students’ experiences withOddity with opt-in usage tracking and an optionalsurvey. In addition to providing evidence that Odd-ity is useful for distributed systems development, ourclassroom experiment provides the first insights intostudent behavior with an interactive debugger for dis-tributed systems.

The rest of the paper is organized as follows. Section 2presents Oddity from a user’s perspective via a runningexample: diagnosing a bug in the Raft consensus proto-col. Section 3 discusses Oddity’s graphical interface inmore detail. Section 4 discusses Oddity’s architecture.Section 5 discusses our prototype implementation of Odd-ity. Section 6 discusses our deployment of Oddity to twouniversity distributed systems classes; students were ableto use Oddity for both debugging and exploratory testing.Section 8 discusses related work, and Section 9 concludesand presents some potential avenues for future work.

2 Example usage

We introduce Oddity’s core ideas and interface via a run-ning example: an implementation of Raft [25]. Raft is aconsensus protocol, a key component in the constructionof strongly-consistent distributed services. A consensusprotocol enables a cluster of nodes to agree on a sequenceof values to apply to a state machine, despite node fail-ures and arbitrary message delays. To support changesin the nodes participating in the state machine consensus,Raft includes a reconfiguration protocol in which both thenew and old sets of nodes must agree on any new config-uration. The reconfiguration protocol could be triggeredmanually by a system administrator or automatically by acluster management system. In part because it was well-documented and included source code, Raft has becomewidely deployed in industry.

Ongaro’s dissertation [23] includes a simplified recon-figuration protocol designed for single node changes. Sev-eral years after publication, researchers discovered a bugin this simplified protocol: in a cluster with an even num-ber of members, if two competing reconfiguration re-quests occur with a leader election in between, the clus-ter can lose data. A simple fix, proposed when Ongaropublicly announced the bug, is to require that new lead-ers commit an entry to the log in the old configurationbefore committing a new configuration. Several months

2

Figure 1: A space-time diagram, generated with Odd-ity and lightly edited for clarity, illustrating a Raft execu-tion leading to the reconfiguration bug. Each vertical linerepresents in a node in the system, and arrows betweenthem represent messages. The circles represent messagesreceived from clients of the system (elided for presenta-tion).

passed between the bug being identified and the fix beingannounced.

For explanatory purposes, we imagine a Raft main-tainer has been informed of the existence of the buggyexecution; using Oddity, they are trying to determine whyit happens and how it can be fixed.

Figure 1 shows an execution leading to the Raft bug.First, a leader (S1) is elected in a 4-node cluster by a ma-jority including itself, S2, and S3. Then, S1 starts to repli-cate a new configuration that adds a fifth node, S5. In thesingle-server reconfiguration protocol, each server useswhichever configuration is latest in its log (regardless ofwhether it is committed). The leader sends this new con-figuration to S5 (shown on the left of Figure 1) as well as

the rest of the cluster (assumed to be delayed or droppedin Figure 1). After this configuration is replicated to S5,S2 starts an election and is elected with votes from S3 andS4. This might occur, for example, if the reconfigurationmessages from S1 are delayed to those nodes, e.g., dueto a temporary network outage. (Consensus should workeven when nodes incorrectly judge that other nodes havefailed.) Now that S2 is leader, it starts to replicate a newconfiguration that removes S1 from the cluster, leaving thethree nodes S2, S3, and S4 (since the configuration withS5 was never replicated to S2). It successfully replicatesthis configuration to S3, at which point it can commit theconfiguration since it is on a majority of nodes in the newcluster of three nodes. Now S1 starts another election, andbecomes leader with votes from S4 and S5. It can nowfinish replicating its configuration adding S5 to the wholecluster, which overwrites S2’s committed configuration.This is a violation of a crucial Raft safety property: oncean entry is committed, it should never be overwritten.

Without Oddity, the Raft engineer has several optionsto reproduce and diagnose this failure. They could exam-ine the code and try to imagine an execution that wouldtrigger the bug, but this is both time-consuming and error-prone. They could design an automated test to find theissue, but testing distributed systems is notoriously diffi-cult [20]. Since the issue depends on a failover, the testenvironment would need to simulate a temporary networkpartition. The test environment would also need to ensurethat messages are delivered in a specific order with re-spect to other messages and the network outage. The en-gineer would also need to write an oracle that determineswhether the bug has in fact been triggered (i.e., whetherdata are lost). Finally, the engineer could run their codein a traditional debugger and attempt to trigger the issue.Doing so, however, would still require simulation of fail-ures and control over the order in which messages are de-livered.

The rest of this section shows how Oddity makes theRaft engineer’s task easier.2 This illustrates Oddity’sfunctionality via one important use case: reproducing anddiagnosing a bug in a distributed system. Oddity can alsobe used for open exploration of a distributed system’s be-havior, or for visualizing a counterexample produced froma model checker.

2.1 Initialization

Oddity assumes that the system being debugged is imple-mented as a set of event handlers: deterministic functionsthat can read and write the node’s state, send messages,and set timeouts through the Oddity shim API (detailed in

2A screen-cast version can be found at http://oddity.uwplse.org.

3



Figure 2: The initial state of the Raft system in Oddity.Each node has a timeout in its inbox.

Section 4). Oddity communicates with the shim, whichcalls event handlers.

In order to use Oddity, the engineer makes a fewchanges to the system. The engineer first connects thesystem’s handlers to Oddity by routing communicationthrough the Oddity shim instead of the standard network-ing library (this can typically be achieved by a smallmacro or command line flag). Next, the engineer createsa node to represent the system’s client. This node can settimeouts that cause communication with the rest of thesystem. In response to a timeout, the Raft client shouldsend the reconfiguration commands from the counterex-ample in Figure 1. The client can be developed using anylanguage for which an implementation of the Oddity shimis available.

Having linked the system and the client with the shim,the engineer can run the system under Oddity.

2.2 Finding a buggy execution

When the engineer first starts Oddity on their system, theywill see a screen similar to Figure 2. Each node has an“inbox” next to it, which contains both messages sent byother nodes and also timeouts the node has set itself. Atthe beginning of time, no messages have been sent, soeach node’s inbox contains only the timeouts waiting atthat node (including S5, which has not yet been addedto the cluster). Timeouts are used to cause events to firewithout messages being delivered. For instance, the elec-tion timeouts in Figure 2 are fired when a node has notreceived a message from a leader for sufficient time, andcause the node to start an election.

The engineer will first need to get S1 elected leader.They can click on the E (election) timeout in S1’s inbox todeliver it, causing S1 to send RV (Request Vote) messagesto the other nodes in the initial configuration (excludingS5, which has not yet been added). The resulting state ofthe system, with a RV message in each node’s inbox, isshown in Figure 3. These messages are now waiting to bedelivered.

The engineer can then click on each RV message to de-

Figure 3: The state of the Raft system in Oddity after S1

starts an election. S2, S3, and S4 have RV messages intheir inboxes. The messages have the same color as theirsender (S1).

Figure 4: The state of the Raft system in Oddity after S2,S3, and S4 respond to S1’s vote request. The votes fromthose nodes are in S1’s inbox, with colors correspondingto the sending node. The engineer has clicked on S2 toexpand its state, showing that it has recorded a vote for S1

(the Raft protocol requires that nodes track which nodethey voted for in the current term).

liver them, causing them to respond to S1 with their V(Vote) messages. In Figure 4, these messages have beensent and S2’s state is expanded, showing that it votedfor S1. Since Raft requires a quorum to elect a leader,once two of these votes are delivered S1 considers itselfelected.

Now that S1 is the leader, the engineer can investigatethe reconfiguration bug. They can make the client (notshown) send a reconfiguration request to add S5 by deliv-ering a timeout. They can inspect the request by clickingon it, as shown in Figure 5. Once the request is deliv-ered, the leader will try to commit this new configurationto a majority of the new configuration per the single-nodereconfiguration protocol.

Once the new configuration has been replicated to S5,the engineer needs to trigger a new leader election in orderto continue following the counterexample. They can doso by delivering the E timeout to S2. This timeout is resetevery time a heartbeat comes in from the leader, and if itever arrives the follower decides the leader has failed andmakes itself a candidate.

4

Figure 5: The state of the Raft system in Oddity afterclient sends its reconfiguration request. The request isopen for inspection. As shown, the engineer can chooseto duplicate it or drop it rather than delivering it.

The rest of the leader election is elided for brevity. Theengineer now clicks on the client’s other timeout, causingit to send a second reconfiguration request dropping S1. Itis delivered at S2 and S2 replicates the new configurationto S3; this configuration is now committed, having beenreplicated to a majority of the new cluster.

The counterexample now calls for S1 to start a newelection, which it can do after receiving S2’s RV messageand then its E timeout. After getting elected, it replicatesthe configuration with S5 to the rest of the cluster. Cru-cially, S1 replicates the updated configuration to S2, over-writing a previously-committed entry and demonstratingthat the engineer’s Raft implementation is buggy.

2.3 BacktrackingInvestigating the Raft reconfiguration bug in Oddity in-volves a number of steps. The engineer might mistakenlyclick on the wrong message (for instance, delivering thenode removal request to S2 before the leader election hap-pens). The engineer might also want to explore alternativeexecutions. The reconfiguration bug can be fixed by re-quiring that a leader replicate an entry (which could bea no-op) in its term before attempting to reconfigure thesystem. In order to test this potential bug fix, the engi-neer could explore an execution in which S2 does this, at-tempting to replicate a no-op entry in its old configurationbefore it receives the request to reconfigure the system.It would be time-consuming and frustrating to start overfrom the initial state of the system in order to answer suchquestions.

Fortunately, Oddity provides an alternative: the engi-neer can click on any previous state in the history in orderto reset the system to that state. They can then exploreother executions starting from that state. Using Oddity’sexecution history view, the engineer can go back to thepoint just before S2 started to replicate the command re-moving S1 and instead deliver a heartbeat timeout to S2,causing it to attempt to replicate a no-op entry in the old

Figure 6: The debugger window. Each node (A) is dis-played, along with an inbox (B) of messages and timeoutswaiting to be delivered at that node. The user can controldelivery by clicking on timeouts and messages, and canalso inspect the contents of any message or timeout or thestate at any node. Using the branching history view (C),the user can navigate the states of the system they haveexplored. The user can reset the debugger to a previousstate by clicking on it; this resets the system to that stateso that the user can explore further from there.

configuration. In order to proceed, S2 must replicate theno-op entry to at least three nodes (e.g., S2, S3, and S4)before it can attempt to remove S1 from the replica set.At that point the pending reconfiguration with S5 will notsucceed, since S1 will not be able to be elected leader untilits log is up to date with the rest of the cluster. The engi-neer now has some evidence that the Raft reconfigurationbug can be fixed by requiring that a leader replicates anentry in its term before attempting to reconfigure the clus-ter.

3 FrontendOddity’s graphical interface, shown in Figure 6, is de-signed to enable engineers to easily explore executions ofdistributed systems, including failure cases. The graph-ical interface was designed with several requirements inmind:

1. It should be application- and implementationlanguage-agnostic. A user should be able to graph-ically debug their system without developing asystem-specific visualization.

2. It should neither depend on nor suggest to the userany notion of real time. Messages can be arbitrarily

5

delayed and reordered, and timeouts can be deliveredeven if no failures occur.

3. It should enable detailed inspection of a single globalstate of the system (including the contents of all mes-sages and the local state at every node), control overwhich event should be executed next, and navigationbetween system states for the purpose of time travel.

In this section, we discuss how Oddity’s frontend ad-dresses each of these requirements.

3.1 Application-agnostic

Distributed systems are designed to provide reliable ser-vice in diverse contexts and this is reflected in their struc-ture and operation: Chord [28] and Dynamo [5] maintaina ring structure via pointers at each node, Raft has a singleleader who communicates with a number of followers, theDNS system has a loose tree structure, etc. Rather thanforcing system developers to develop visualizations forthe structure of each system, Oddity’s interface displaysthe components all distributed systems have in common:a set of nodes communicating via a network. When theuser starts Oddity, it displays each system node in a cir-cle. The user can then reposition the nodes as they desireby clicking and dragging.

Oddity does not yet support application-specific exten-sions to the visualization (for instance, to display a starnext to Raft leaders or arrows describing Chord’s ringstructure), but we anticipate that these will be easy to add.Thus far, we have focused on making Oddity useful evenfor developers who are not willing to develop such visu-alizations.

3.2 No real time

The unreliability of physical clocks due to clock skew isa fundamental problem in distributed systems [15]. Engi-neers cannot rely on measurements of time being consis-tent across nodes except within very loose bounds. As aresult, most distributed systems are designed around thepossibility that messages can be arbitrarily delayed by thenetwork. While messages are often thought of as movingthrough the network over time to their destination, Odditydoes not represent them this way; doing so would imply asemantic meaning to real time. Instead, messages are im-mediately transferred to the receiver’s inbox and can thenbe delayed for an arbitrary amount of time (or dropped),under user control. Oddity’s display encourages users toignore wall-clock time in thinking about distributed sys-tems correctness, and instead think about correctness inthe face of all possible event orders.

Figure 7: The architecture of the debugger implemen-tation. The debugger backend communicates with in-dividual nodes (each of which has a stub implementingthe Oddity API). The debugger frontend, running in thebrowser, communicates with the backend. Most of thelogic runs in the browser, allowing the backend to serveas a thin communication layer.

Figure 8: An example of DVIZ components communicat-ing to deliver a message. When Node 1 receives a time-out, either from log replay or user action, it produces amessage for Node 2. This message is sent to the debug-ger backend, which tells the frontend to display it in Node2’s inbox. When the user clicks the message (which theycould do immediately or after delivering other messagesand timeouts) the frontend notifies the backend, whichsends the message to Node 2.

3.3 State inspection and history navigation

Oddity’s graphical interface is geared towards represent-ing a single state of the system—including in-flight mes-sages and timeouts—in detail, while also enabling users tonavigate a branching execution history. Users can click toinspect server state or the contents of messages and time-outs. Enabling detailed inspection is crucial for a debug-ging interface, since engineers use this information to de-cide which message or timeout should be delivered next.Oddity supports time travel debugging, allowing engi-neers to navigate to any previously explored state and ex-plore a branching execution history without starting fromscratch.

6

Table 1: The Oddity API. In order to use Oddity, users must implement a simple, JSON-based message API. Once asystem node registers with the server, it responds to each message (including the start message, which is sent at thebeginning of a debugging session and after a reset) with its updated state, sent messages, and set and cleared timeouts.

Message Description From/Toregister(name) Register a node Node to serverstart Start the node Server to nodetimeout(type, body) Deliver a timeout Server to nodemessage(from, type, body) Deliver a message Server to noderesponse(state, messages, timeouts, cleared) Response to any event Server to node

4 Architecture

Figure 7 shows the architecture of the Oddity debugger.As shown, Oddity consists of several cooperating compo-nents: a browser-based frontend that displays the interfacediscussed in the previous section; a backend, split betweenthe browser and the server, that tracks the system’s state,implements time travel, and communicates with nodes inthe system; and a shim that runs at each system node,communicates with the server, and calls the node’s han-dlers. Oddity’s graphical interface is independent of itsbackend; either could be replaced without changing theother.

4.1 Debugger backend

The debugger backend tracks the state of the system, in-cluding the local state at each node and in-flight messagesand timeouts. It also tracks the event history. Each eventis either a message delivery or a timeout delivery; a spe-cial “start” event represents the beginning of time. Whenthe user clicks on a message or a timeout, the backendrecords this event and then sends the message or timeoutto the Oddity shim instance running on the appropriatenode. When it receives the response, it updates the displayto reflect the new messages and timeouts and the modifiedlocal state.

When the user clicks on a state in the history display,the debugger backend resets the system to that point byreplaying all of the events that led to that state (includingthe special “start” event). The user can then explore alter-native executions starting from that state. This techniquewill not work if the system has non-deterministic handlersor accesses persistent state outside the debugger’s con-trol. Oddity could be extended to support such systemsby recording a snapshot of each node’s state after everyevent. We leave such an extension for future work.

4.2 Oddity shim

The Oddity shim is a library written in the user’s imple-mentation language that implements communication with

the Oddity server and is responsible for tracking the lo-cal state of the system. It must implement the API shownin Table 1 for communication with the Oddity server, andmust call the user’s event handlers for start, timeout,and message calls from the server. It must provide usercode with some mechanism for updating the local state ofthe node, as well as sending messages and setting time-outs.

The Oddity shim currently assumes that the user’s codeis written as a set of event handlers. Oddity cannotcurrently be used to debug systems developed againsta different programming model, such as multi-threadedservers using blocking RPC calls. This limitation is notfundamental, and Odditys architecture and visualizationscan be extended to support such systems. We discuss thispotential future extension in more detail in Section 7.

4.3 Trace displayModel checkers have been applied to distributed systemsto great effect [32, 11]. The counterexamples they pro-duce, however, can be long and complex. Rather thanstarting Oddity on a system and debugging from the be-ginning of time, a user can start Oddity with a trace ofthat system’s execution, produced by a model-checker asa counterexample to a desired property. A common usepattern is for the engineer to use time-travel to explore thesequence of events that led to the invariant violation. Theuser can also use Oddity to investigate other executionsbranching off of the trace. The same approach facility canalso be used to investigate traces mined from system logsby tools such as DEMi [27] to investigate bugs encoun-tered during testing or production use. Oddity enhancestechniques such as model-checking and log analysis byproviding the engineer with tools to explore the contextof the bug as well as alternative executions that might ormight not trigger the same problem.

5 ImplementationOur research prototype of the Oddity debugger is imple-mented in approximately 1400 lines of Clojurescript (for

7

the browser-based frontend) and 500 lines of Clojure (forthe backend). Its interface uses SVG, and we have notfound rendering performance to be an issue even withlarge systems and long execution traces. The current userinterface is more limited—it is intended to be used withroughly 10 or fewer nodes for regular debugging and ex-ploration tasks. The frontend uses a Websocket to com-municate with the backend, which communicates with theshim over TCP.

We have implemented the Oddity shim for Python (121lines of code) and Java (293 lines of code), and we havedeveloped several systems against each implementation.It is easy to develop a new shim in any language withlibraries for JSON serialization and network communi-cation. The course labs discussed in the following sec-tion use the Java implementation of the shim. Table 2shows all of the systems that have been debugged us-ing Oddity, which version of the shim they used, andapproximate lines of code for each system. The largestsystem we have debugged using Oddity is a sharded lin-earizable key-value store using Multi-Paxos replication toprovide exactly-once, highly available operations on eachkey. The shard allocations are dynamic, and the systemalso supports simple multi-key transactions.

6 Evaluation

We have deployed Oddity to two classes: a 40-studentgraduate-level distributed systems class, which served asa pilot, and a 180-student undergraduate-level distributedsystems class. In both cases, Oddity was given to stu-dents as part of the lab framework they used to do theirhomework assignments. The labs come with extensivetest suites, and include a model-checker. Students can runOddity in two modes: they can start their system and ex-plore from the beginning of time, or run Oddity on a coun-terexample trace produced by the model-checker when aninvariant is violated.

We had two goals in studying students’ experience withOddity. One was to determine whether Oddity’s featuresare useful. The other was to examine student behav-ior when given access to an interactive debugger for dis-tributed systems, in line with previous work that examinesstudent usage of traditional step-through debuggers [22]and developer usage of trace visualization tools [4]. Wehope that our experiences can inform future work in thesame area.

We studied student experiences with Oddity in severalways. We instrumented the Oddity interface in order totrack users’ clicks on various interface elements in orderto see how they interacted with the system (this featurewas only enabled if students opted into it). We sent out anoptional survey to students after they completed the first

major lab assignment, a primary-backup-based key-valuestore (as of this writing, other labs are ongoing). The sur-vey is shown in Figure 9. We also informally discussedOddity with students, recording anecdotes about their us-age of the system on the primary-backup lab as well as thenext lab, a Paxos-based key-value store [16].

We have defined several research questions, each basedon a different use-case for Oddity: exploration of a sys-tem from the beginning of time in order to understand itsoperation and to find possible bugs; diagnosing and un-derstanding a particular known bug; and replaying a tracefrom a model-checking counterexample. The researchquestions are as follows:

RQ1: Do developers explore their systems starting fromthe beginning of time? When doing so, do develop-ers use Oddity to test their systems’ edge case behav-ior?

RQ2: Do developers find the debugger useful in diag-nosing and repairing bugs? When doing so, do de-velopers explore multiple branches?

RQ3: Is the ability to explore alternative executionsstarting from a model-checking trace useful for un-derstanding why the bug occurred?

We discuss each of these questions in detail below. Thetwo modes in which a student can start the Oddity debug-ger are (1) to start it on their system and explore from thesystem’s start state and (2) to start it on a trace generatedby the model-checker when it finds a counterexample to adesired invariant. We use these as rough proxies for (A)exploratory testing, in which developers explore systemsin order to understand them and find bugs and (B) diag-nosing a specific bug, respectively; this is imperfect, sincestudents may start their system from the beginning of timebut with a specific bug in mind.

RQ1. Do developers explore their systems startingfrom the beginning of time? When doing so, do devel-opers use Oddity to test their systems’ edge case behav-ior? We found that 74.5% of Oddity runs started from thebeginning of time, as opposed to from a model-checkingtrace. In these runs, users explored an average of 37.3states per run, with a median of 23 states per run. Fromthis we can conclude that students did use the debuggerfor exploratory testing. We received some survey data tosuggest that students were able to explore edge cases us-ing this mode. One student said that “It was useful forone bug where I found out there was unexpected behaviorfrom the [view server] when both the primary and backuptimed out at the same time.” This indicates that studentsused Oddity to explore edge case behavior. We also re-ceived an anecdote from a student about a bug in whichtheir Paxos implementation sent redundant messages un-der certain conditions (specifically: when a proposer re-

8

Table 2: Systems that have been debugged using Oddity, the version of the shim they used, and lines of code for thesystem implementation.

System Shim Language SLOCLamport mutual exclusion Python 73Raft (with reconfiguration) Python 240At-most-once RPC Java 280Primary-backup replication Java 380Paxos replication Java 550Sharded transactional key-value store Java 1390

ceived more than a majority of replies to its “prepare”messages, it ended up sending extra “accept” messages).The student did not suspect the existence of this bug be-fore noticing it in Oddity, and believed they would nothave found the bug at all without Oddity (the providedtest suite did not test for the presence of these extra mes-sages). Without an interactive debugger that can controlmessage and timeout ordering, exploratory testing of dis-tributed systems is tedious, and the usage tracking andsurvey results indicate that students find this feature veryuseful.

A number of students said that they did not explore theirsystems starting from the beginning because they only de-bugged their system when a test case from the providedtest suite failed. Our results may be biased as a result ofour setting: with an extensive test suite, some studentsmay not have felt a need to understand their system be-havior independently of its behavior on the tests. It ispossible that without such an extensive test suite, studentswould have found it more useful to start their systems inthe debugger. On the other hand, our evaluation is of stu-dents and not professional developers. The students werelearning about the protocols at the same time as they wereimplementing and debugging them, so it is possible thatexploratory testing was a more compelling option for stu-dents than it would be for more experienced developers.

RQ2. Do developers find the debugger useful in diag-nosing and repairing bugs? When doing so, do develop-ers explore multiple branches? We found that 25.5% ofOddity runs started from a model-checking trace. In re-sponse to survey question 2, students reported that Oddity“helped [them] diagnose [their] handling of state transferand state transfer acknowledgements” and that they wereable to use it to diagnose a bug in which they “had somedelayed messages arriving and causing problems.” A stu-dent reported successfully using Oddity to diagnose a bugin which the system had stopped making any progress af-ter their latest change, which implemented deduplicationof redundant client requests. They stepped through a sim-ple test case and found that servers were never actuallyresponding to clients; they were then able to fix the issue.

RQ3. Is the ability to explore alternative executionsstarting from a model-checking trace useful for under-

1. Did the debugger help you to discover any bugs inyour system? Describe one.

2. Did the debugger help you to diagnose any bugs youwere already aware of? Describe one.

3. When using the debugger to view a search-test coun-terexample, did you also explore other executions?Did this help you to understand the counterexam-ples? Describe an instance of this being useful.

4. Were there any bugs you think you would have foundearlier if you had used the debugger? If not, howcould the debugger have been more useful to you?

5. Do you have any other feedback about the debugger?

Figure 9: The optional survey sent to students after com-pleting a homework assignment. We referred to tests thatcalled the model checker as “search tests.”

standing why the bug occurred? When students startedtheir systems from a model-checking trace, 23.6% ofthose executions explored multiple branches. In thosecases, those state graphs branched an average of 1.5 times.From this we can conclude that at least some studentsexplored alternative executions when viewing a model-checking counterexample. In response to survey question3, some students did report that exploring alternative exe-cutions was useful. One student said that

from the bug where our servers were advanc-ing themselves based on outdated/future viewnumbers, instead of just from the view server,it helped us see a situation where we could getstuck more frequently waiting for the server toack a state transfer.

Another reported that the ability to explore alternativeexecutions “distinctly helped understand what was hap-pening.” We can conclude that the ability to explore alter-native executions starting from a model-checking coun-terexample was useful for some students.

9

7 DiscussionOddity provides an extensible platform for investigatingmany aspects of distributed systems beyond the examplesillustrated in earlier sections. Below we describe howOddity could facilitate new tools and methodologies fordebugging, implementing, and designing distributed sys-tems. We have left these features for future work.

7.1 Interactive space-time diagramsIn addition to the primary “nodes and inboxes” visual-ization, Oddity supports visualizing a system’s executionas a space-time diagram (e.g., Figure 1). Space-time di-agrams are useful for viewing a summary of an entireexecution trace at once. Because of Oddity’s extensi-ble design, adding a traditional (non-interactive) versionof space-time diagrams required less than 150 lines ofcode: Oddity simply pipes a formatted version of the sys-tem trace through GraphViz [8], and displays the resultingSVG image in the browser.

In Oddity, space-time diagrams could be enriched withmore detailed information about the currently executingsystem, e.g., by adding JavaScript hooks so that whena user hovers over a node in the space-time SVG, thestate of that individual node at that point in history isdisplayed. Such an enriched space-time diagram wouldprovide a bridge between the “nodes and inboxes” andbranching trace history visualizations. Adding these addi-tional features introduces new design and user interactionchallenges:

• How should in-flight messages and timeouts be rep-resented?

• How should users interactively control system exe-cution or time travel from such a diagram?

• In what scenarios is one visualization simpler ormore effective than another?

Oddity is well-suited for exploring these challenges: theOddity API abstracts away many of the tedious detailsfor modeling the network, controlling implementations ofnodes, and interacting with different programming lan-guages.

7.2 System-specific interface componentsThe Oddity frontend is built around a generic SVG-based canvas which makes integrating other visualizationtools straightforward (e.g., for space-time diagrams as dis-cussed above). In particular, Oddity could easily supportsystems which control aspects of their own visual rep-resentation by providing a mechanism to add additionalshapes to the frontend SVG. This could be as simple as

nodes (optionally) providing a special field in their lo-cal state with literal SVG objects to add to the visualiza-tion relative to the node’s own position. These extensionswould enable generic system-specific visualization. Forinstance, the developer of a state-machine replication sys-tem might want to display the log of commands seen ateach node as an array of boxes colored by term, while thedeveloper of a ring maintenance system such as Chordmight want to display the successor and predecessor ofeach node as arrows to other nodes. This would involveadding an API for systems to write elements to Oddity’sSVG-based interface, and perhaps developing a library ofcommonly-useful components (such as the arrows men-tioned above). In general, Oddity’s architecture makes in-tegrating new visualizations easy. Oddity’s browser-basedfrontend also simplifies building on recent advances indata visualization libraries such as D3 [3].

7.3 System model

Initially, we have focused on using Oddity to debug andexplore distributed systems where node implementationsbehave as deterministic, single-threaded event handlingloops. This class of systems corresponds to the exampleprotocols used in lectures and exercises for the courseswhere Oddity has been used to date (Section 6). Italso corresponds to the model used by recent projectsformally verifying implementations of distributed sys-tems [31, 9, 18]. However, many distributed systems relyon some combination of nondeterministic choice, block-ing RPC-based communication, and local multi-threadingat nodes.

For basic control and visualization, Oddity alreadysupports nondeterministic distributed systems. However,Oddity’s replay-based approach to time travel will not cor-rectly restore local node states with nondeterministic han-dlers. To correctly provide time-travel in the face of non-determinism, Oddity could require that each node send asnapshot of its current state after each event, or we couldextend the Oddity shim to fork a child process after eachevent to effectively save a “paused” instance of a node’sprocess. Equipped with such an extension to the back-end (and without any changes to the frontend), Odditycould support time travel for nondeterministic systems byrestoring arbitrary previous states from snapshots.

Many implementations of distributed systems allow forthe concurrent execution of event handlers for higher per-formance; for example, updates to different keys in a key-value store can be safely handled in parallel on differentprocessors of a multi-core server, e.g., using locks to ar-bitrate access to shared data structures. Thus, the systembehavior may depend on the thread scheduling decisionsmade on the local node. If the bug being diagnosed isinvariant to the local thread schedule, it may suffice to

10

simply enforce a single canonical thread execution order,such as with deterministic multi-threading [6]. If the bugmanifests due to the interaction of the local schedule anddistributed event delivery, then we would need to extendthe visualization model to allow the programmer to con-trol both.

Supporting RPC-based systems adds another layer ofcomplexity to the interface between the debugger and thesystem. Oddity’s system model assumes that nodes atomi-cally send messages in response to receiving a message ortimeout and then immediately return to the top of the eventloop, ready to handle the next (arbitrary) input event. WithRPC systems, the event loop is inverted. Although thecode performs the same computation in the same order asin an event system, a message arrival is “handled” onlywhen a thread retrieves it, e.g., in response to a previoussend.

7.4 Deeper model-checker integration

Oddity is designed to use event traces as a common rep-resentation for communication between the frontend andbackend and between the backend and node shims. Thisdesign choice made integration with a model checkerstraightforward: the backend can simply walk a coun-terexample produced from the model checker as a traceand use each step to replay events on the nodes as in timetravel debugging. As discussed in Section 6, this straight-forward technique for integrating model checkers has al-ready proved valuable for Oddity users. To further inte-grate model-checker functionality, Oddity could highlightparticular components of a system state that violate thedesired invariant.

More ambitiously, a model-checker could also be usedto provide Oddity with “breakpoints.” Many systems dosome initial bootstrapping and setup that may be tediousto simulate manually in Oddity (for instance, electing aninitial leader). Instead, a developer could specify that theywant to debug the system starting in some state meeting aglobal property (such as after a successful election). Odd-ity could ask the model-checker to find such a state, andthen allow the user to explore the system’s execution start-ing from the state returned by the model-checker.

8 Related WorkOddity builds on previous work in several areas, includ-ing distributed systems correctness, distributed systemslog exploration, and distributed systems visualization. Wediscuss each of these areas in detail below. In addi-tion, Oddity depends on a long line of work on single-node program debugging; many aspects of Oddity’s de-sign were inspired by graphical step-through debuggers

such as DDD [33].

Distributed Systems Correctness Systems such asTLA+ [17], Alloy [10], and Ivy [26] have been used forbug-finding and verification of high-level, abstract mod-els of distributed systems (Ivy has recently been extendedto support extraction to runnable code [30]). All threeshare Oddity’s goal of enabling users to understand thebehavior of their systems and encouraging correct dis-tributed systems thinking. TLA+ and Alloy use boundedmodel-checking; Oddity could be used in conjunctionwith these systems to display model-checking counterex-amples. Ivy enables automatic verification of distributedsystems specified in a carefully-crafted subset of first-order logic; it includes a graphical representation of coun-terexamples. Oddity complements these systems by en-abling users to interactively debug both models and work-ing implementations of distributed systems during devel-opment and after deployment.

Distributed systems log analysis There has been alarge amount of work on collecting and analyzing logsof distributed systems [1, 7, 12, 13, 14, 34, 19, 2] and,relatedly, datacenter networks [21, 29]. Many of thesesystems, such as ShiViz [2], provide a graphical interface,allowing users to interactively explore the logs producedby their system. ShiViz’s visualization is based on space-time diagrams; it allows users to explore large and com-plex executions by querying the log and collapsing andexpanding events. Like Oddity, these tools are designedfor debugging and understanding distributed systems im-plementations. Unlike Oddity, these tools are for ex postfacto debugging of system logs, rather than interactivedebugging of a system’s behavior; log analysis systemsdo not enable exploratory testing. These two use casescomplement each other: having obtained and examined asystem log using ShiViz, an engineer can replay the loglocally using Oddity in order to understand it and diag-nose any problems that were encountered. Using Oddity,a user can explore alternative message orderings to deter-mine whether they also produce bugs; existing log analy-sis systems do not support this.

Distributed systems animations Runway [24] is a sys-tem for visualizing models of distributed systems. It con-sists of a domain-specific high-level modeling languagebased on TLA-like actions. An interpreter for this lan-guage is written in Javascript and an API for extractingvalues from the interpreter for visualization. Several mod-els and animations have been developed using Runway,including a visualization of the Raft consensus protocol.

Oddity’s visualization was inspired by those created forRunway. Oddity is much more general, however. Users

11

of Runway must create protocol-specific visualizations,e.g., there is a visualization for Raft that would not applyeven to related protocols such as Multi-Paxos or primary-backup. The students to whom we gave Oddity wouldhave had to write their own Runway visualizations, sincethey were instructed to implement against a specificationbut were not forced to use any particular algorithm. Ad-ditionally, Oddity can be used to debug systems written inany language, while Runway requires users to model theirsystems in its domain-specific language.

9 ConclusionOddity is the first interactive graphical debugger for dis-tributed systems. It allows users to control the deliveryof events to the distributed system and observe the result-ing execution. Oddity provides time travel to enable de-bugging multiple executions in order to explore normal-and edge-case behavior. Oddity introduces a new visual-ization and interaction mode that encourages distributedsystems thinking: rather than assuming the normal caseand attaching failure handling as an afterthought, usersare shown the vast range of possible behaviors and pro-vided with the tools needed to effectively explore. Dozensof student users learning about distributed systems acrossgraduate and undergraduate courses have reported thevalue of such exploration when debugging and learningwhat makes systems (not) work. Finally, Oddity providesan extensible platform that can support research investi-gating new questions about how best to visualize and ex-plore a greater range of systems.

References[1] P. Bates and J. C. Wileden. An approach to high-

level debugging of distributed systems. ACM SIG-SOFT/SIGPLAN Software Engineering Symposiumon High-level Debugging ’83.

[2] I. Beschastnikh, P. Wang, Y. Brun, and M. D. Ernst.Debugging distributed systems. Commun. ACM,59(8), July 2016.

[3] M. Bostock, V. Ogievetsky, and J. Heer. D3 data-driven documents. IEEE Transactions on Visualiza-tion and Computer Graphics, 17(12), Dec. 2011.

[4] B. Cornelissen, A. Zaidman, and A. van Deursen. Acontrolled experiment for program comprehensionthrough trace visualization. IEEE Transactions onSoftware Engineering, 37(3), May 2011.

[5] G. DeCandia, D. Hastorun, M. Jampani, G. Kakula-pati, A. Lakshman, A. Pilchin, S. Sivasubramanian,

P. Vosshall, and W. Vogels. Dynamo: Amazon’shighly available key-value store. SOSP ’07.

[6] J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP:Deterministic shared memory multiprocessing. AS-PLOS ’09.

[7] S. G. Eick and A. Wards. An interactive visualiza-tion for message sequence charts. WPC ’96.

[8] J. Ellson, E. Gansner, L. Koutsofios, S. North,G. Woodhull, S. Description, and L. Technologies.Graphviz open source graph drawing tools. LectureNotes in Computer Science, pages 483–484, 2001.

[9] C. Hawblitzel, J. Howell, M. Kapritsos, J. R. Lorch,B. Parno, M. L. Roberts, S. Setty, and B. Zill. Iron-fleet: Proving practical distributed systems correct.SOSP ’15.

[10] D. Jackson. Alloy: A lightweight object modellingnotation. ACM Trans. Softw. Eng. Methodol., 11(2),Apr. 2002.

[11] C. Killian, J. W. Anderson, R. Jhala, and A. Vah-dat. Life, death, and the critical transition: Findingliveness bugs in systems code. NSDI ’07.

[12] D. Kranzlmuller, S. Grabner, and J. Volkert. Eventgraph visualization for debugging large applications.SPDT ’96.

[13] J. Kundu and J. E. Cuny. A scalable, visual interfacefor debugging with event-based behavioral abstrac-tion. FRONTIERS ’99.

[14] T. Kunz, D. J. Taylor, and J. P. Black. Poet: Target-system-independent visualizations of complex dis-tributed executions. The Computer Journal, 40(8),1997.

[15] L. Lamport. Time, clocks, and the ordering of eventsin a distributed system. Commun. ACM, 21(7), July1978.

[16] L. Lamport. The part-time parliament. ACM Trans.Comput. Syst., 16(2), May 1998.

[17] L. Lamport, J. Matthews, M. Tuttle, and Y. Yu. Spec-ifying and verifying systems with TLA+. EW 10.

[18] M. Lesani, C. J. Bell, and A. Chlipala. Chapar:Certified causally consistent distributed key-valuestores. POPL ’16.

[19] X. Liu, Z. Guo, X. Wang, F. Chen, X. Lian, J. Tang,M. Wu, M. F. Kaashoek, and Z. Zhang. D3S: debug-ging deployed distributed systems. NSDI ’08.

12

[20] P. Maddox. Testing a distributed system. ACMQueue, 13(7), July 2015.

[21] H. Mai, A. Khurshid, R. Agarwal, M. Caesar, P. B.Godfrey, and S. T. King. Debugging the data planewith anteater. SIGCOMM Comput. Commun. Rev.,41(4), Aug. 2011.

[22] R. McCauley, S. Fitzgerald, G. Lewandowski,L. Murphy, B. Simon, L. Thomas, and C. Zander.Debugging: a review of the literature from an edu-cational perspective. Computer Science Education,18(2):67–92, 2008.

[23] D. Ongaro. Consensus: Bridging Theory and Prac-tice. PhD thesis, Stanford University, Aug. 2014.

[24] D. Ongaro. Runway: A new tool for distributed sys-tems design. ;login:, 41(3), 2016.

[25] D. Ongaro and J. K. Ousterhout. In search of an un-derpstandable consensus algorithm. USENIX ATC’14, 2014.

[26] O. Padon, K. L. McMillan, A. Panda, M. Sagiv, andS. Shoham. Ivy: Safety verification by interactivegeneralization. PLDI ’16.

[27] C. Scott, A. Panda, V. Brajkovic, G. Necula, A. Kr-ishnamurthy, and S. Shenker. Minimizing faulty ex-ecutions of distributed systems. NSDI ’16.

[28] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, andH. Balakrishnan. Chord: A scalable peer-to-peerlookup service for internet applications. SIGCOMM’01.

[29] P. Tammana, R. Agarwal, and M. Lee. Simplify-ing datacenter network debugging with pathdump.OSDI ’16.

[30] M. Taube, G. Losa, K. McMillan, O. Padon, M. Sa-giv, S. Shoham, J. R. Wilcox, , and D. Woos. Modu-larity for decidability of deductive verification withapplications to distributed systems. PLDI ’18.

[31] J. R. Wilcox, D. Woos, P. Panchekha, Z. Tatlock,X. Wang, M. D. Ernst, and T. Anderson. Verdi: Aframework for implementing and formally verifyingdistributed systems. PLDI ’15.

[32] J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin,M. Yang, F. Long, L. Zhang, and L. Zhou. Modist:Transparent model checking of unmodified dis-tributed systems. NSDI ’09.

[33] A. Zeller and D. Lutkehaus. DDD—a free graphicalfront-end for unix debuggers. SIGPLAN Not., 31(1),Jan. 1996.

[34] D. Zernick, M. Snir, and D. Malki. Using visualiza-tion tools to understand concurrency. IEEE Softw.,9(3), May 1992.

13

Date post:	13-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Abstract arXiv:1806.05300v1 [cs.DC] 13 Jun 2018 · Oddity is a graphical, interactive debugger for...

Documents