[ACM Press the 9th ACM SIGCOMM conference - Chicago, Illinois, USA (2009.11.04-2009.11.06)]...

The Nature of Datacenter Traffic: Measurements & Analysis

Srikanth Kandula, Sudipta Sengupta, Albert Greenberg, Parveen Patel, Ronnie ChaikenMicrosoft Research

ABSTRACTWe explore the nature of traffic in data centers, designed to sup-port the mining of massive data sets. We instrument the servers tocollect socket-level logs, with negligible performance impact. In a server operational cluster, we thus amass roughly a petabyteof measurements over two months, from which we obtain and re-port detailed views of traffic and congestion conditions andpatterns.We further consider whether trafficmatrices in the clustermight beobtained instead via tomographic inference from coarser-grainedcounter data.

Categories and Subject DescriptorsC.. [Distributed Systems] Distributed applicationsC. [Performance of systems] Performance Attributes

General TermsDesign, experimentation, measurement, performance

KeywordsData center traffic, characterization,models, tomography

1. INTRODUCTIONAnalysis of massive data sets is a major driver for today’s data

centers []. For example, web search relies on continuously col-lecting and analyzing billions of web pages to build fresh indexesand mining of click-stream data to improve search quality. As aresult, distributed infrastructures that support query processing onpeta-bytes of data using commodity servers are increasingly preva-lent (e.g., GFS, BigTable [, ], Yahoo’s Hadoop, PIG [, ] andMicrosoft’s Cosmos, Scope [, ]). Besides search providers, theeconomics and performance of these clusters appeals to commer-cial cloud computing providers who offer fee based access to suchinfrastructures [, , ].

To the best of our knowledge, this paper provides the first de-scription of the characteristics of traffic arising in an operationaldistributedqueryprocessing cluster that supports diverseworkloadscreated in the course of solving business and engineering problems.Our measurements collected network related events from each ofthe servers, which represent a logical cluster in an opera-tional data center housing tens of thousands of servers, for over twomonths. Our contributions are as follows:Measurement Instrumentation. We describe a lightweight, exten-sible instrumentation and analysis methodology thatmeasures traf-fic on data center servers, rather than switches, providing socketlevel logs. This server-centric approach, we believe, provides an ad-vantageous tradeoff for monitoring traffic in data centers. Serveroverhead (CPU, memory, storage) is relatively small, though thetraffic volumes generated in total are large – over GB per serverper day. Further, such server instrumentation enables linking up

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.IMC’09, November 4–6, 2009, Chicago, Illinois, USA.Copyright 2009 ACM 978-1-60558-770-7/09/11 ...$10.00.

… … ……

IP Router

……

Aggregation Switches

Top-of-rack Switch

Servers

VLANs

Figure : Sketch of a typical cluster. Tens of servers per rack areconnected via inexpensive top of rack switches that in turn con-nect to high degree aggregation switches. VLANs are set-up be-tween small numbers of racks to keep broadcast domains small.We collect traces from all () nodes in a production cluster.

network traffic to the applications that generate or depend on it, let-ting us understand the causes (and impact) of network incidents.Traffic Characteristics. Much of the traffic volume could be ex-plained by two clearly visible patterns which we call Work-Seeks-Bandwidth and Scatter-Gather. Using socket level logs, we investi-gate the nature of the traffic within these patterns: flow characteris-tics, congestion, and rate of change of the traffic mix.Tomography Inference Accuracy. Will the familiar infer-ence methods to obtain traffic matrices in the Internet ServiceProvider (ISP) networks extend to data centers [, , , ]? Ifthey do, the barrier to understand the traffic characteristics of dat-acenters will be lowered from the detailed instrumentation that wehave done here to analyzing the more easily available SNMP linkcounters. Our evaluation shows that tomography performs poorlyfor data center traffic and we postulate some reasons for this.

A consistent theme that runs through our investigation is that themethodology that works in the data center and the results seen inthe data center are different than their counterparts in ISP or evenenterprise networks. The opportunities and “sweet spots” for instru-mentation are different. The characteristics of the traffic are differ-ent, as are the challenges of associated inference problems. Simpleintuitive explanations arise from engineering considerations, wherethere is tighter coupling in application’s use of network, computing,and storage resources, than that is seen in other settings.

2. DATA & METHODOLOGYWe briefly present our instrumentation methodology. Measure-

ments in ISPs and enterprises concentrate on instrumenting the net-work devices with the following choices:SNMP counters, which support packet and byte counts across indi-vidual switch interfaces and related metrics, are ubiquitously avail-able on network devices. However, logistic concerns on how oftenrouters can be polled limit availability to coarse time-scales, typi-cally once every five minutes, and by itself SNMP provides little in-sight into flow-level or even host-level behavior.Sampled flow or sampled packet header level data [, , , ]can provide flow level insight at the cost of keeping a higher volumeof data for analysis and for assurance that samples are representa-tive []. While not yet ubiquitous, these capabilities are becomingmore available, especially on newer platforms [].Deep packet inspection: Much research mitigates the costs ofpacket inspection at high speed [, ] but few commercial devicessupport these across production switch and router interfaces.

202

In this context, how do we design data-center measurementsthat achieve accurate and useful data while keeping costs manage-able? What drives cost is detailed measurement at very high speed.To achieve speed, the computations have to be implemented infirmware and more importantly the high speed memory or stor-age required to keep track of details is expensive causing little of itto be available on-board the switch or router. Datacenters providea unique choice— rather than collecting data on network deviceswith limited capabilities for measurement, we could obtain mea-surements at the servers, even commodity versions of which havemultiple cores, GBs of memory, and s of GBs or more of localstorage. When divided across servers, the per-server monitoringtask is a surprisingly small fraction of what a network device mightincur. Further, modern data centers have a common managementframework spanning their entire environment–servers, storage, andnetwork, simplifying the task of managing measurements and stor-ing the produced data. Finally, instrumentation at the servers allowsus to link the network traffic with application level logs (e.g., at thelevel of individual processes) which is otherwise impossible to dowith reasonable accuracy. This lets us understand not only the ori-gins of network trafficbut also the impact of network incidents (suchas congestion, incast) on applications.

The idea of using servers to ease operations is not novel, networkexception handlers leverage endhosts to enforce access policies [],and some prior work adapts PCA-based anomaly detectors to workwell even when data is distributed on many servers []. Yet, per-forming cluster-wide instrumentation of servers to obtain detailedmeasurements is a novel aspect of this work.

We use the ETW (Event Tracing for Windows []) framework tocollect socket level events at each server and parse the informationlocally. Periodically, the measured data is stowed away using theAPIs of the underlying distributed file system which we also use foranalyzing the data.

In our cluster, the cost of turning on ETW was a median increaseof . in CPU utilization, an increase of . in disk utilization, more cpu cycles per byte of network traffic and fewer than aMbps drop in network throughput even when the server was us-ing the NIC at capacity (i.e., at Gbps). This overhead is low pri-marily due to the efficient tracing framework [] underlying ETWbut also because unlike packet capture which involves an interruptfrom the kernel’s network stack for each packet, we use ETW to ob-tain socket level events, one per application read or write, whichaggregates over several packets and skips network chatter. To keepthe cumulative data upload rate manageable, we compress the logsprior to uploading. Compression reduces the network bandwidthused by the measurement infrastructure by at least x.

In addition to network level events, we collect and use applica-tion logs (job queues, process error codes, completion times etc.) tosee which applications generate what network traffic as well as hownetwork artifacts (congestion etc.) impact applications.

Over a month, our instrumentation collected nearly a petabyteof uncompressed data. We believe that deep packet inspection isinfeasible in production clusters of this scale–it would be hard tojustify the associated cost and the spikes in CPU usage associatedwith packet capture and parsing on the server interfaces are a con-cern production cluster managers. The socket level detail we collectis both doable and useful, since as we will show next this lets us an-swer questions that SNMP feeds cannot.

3. APPLICATION WORKLOADBefore we delve into measurement results, we briefly sketch the

nature of the application that is driving traffic on the instrumentedcluster. At a high level, the cluster is a set of commodity servers that

Server From

Ser

ver

To

5

10

15

20

25

Figure : The Work-Seeks-Bandwidth and Scatter-Gather patternsin datacenter traffic as seen in a matrix of loge(Bytes) exchangedbetween server pairs in a representative s period. (See §.).

supports map reduce style jobs as well as a distributed replicatedblock store layer for persistent storage. Programmers write jobs ina high-level SQL like language called Scope []. The scope compilertransforms the job into a workflow (similar to that of Dryad [])consisting of phases of different types. Some of the common phasetypes are Extractwhich looks at the raw data and generates a streamof relevant records, Partition which divides a stream into a set num-ber of buckets, Aggregate which is the Dryad equivalent of reduceand Combine which implements joins. Each phase consists of oneor more vertices that run in parallel and perform the same compu-tation on different parts of the input stream. Input data may needto be read off the network if it is not available on the same machinebut outputs are always written to the local disk for simplicity. Somephases can function as a pipeline, for example Partitionmay start di-viding the data generated by Extract into separate hash bins as soonas an extract vertex finishes, while other phasesmay not be pipeline-able, for example, anAggregate phase that computes themedian saleprice of different textbooks would need to look at every sales recordfor a textbook before it can compute the median price. Hence, inthis case the aggregate can run only after every partition vertex thatmay output sales records for this book name completes. All the in-puts and the eventual outputs of jobs are stored in a reliable repli-cated block storage mechanism called Cosmos that is implementedon the same commodity servers that do computation. Finally, jobsrange over a broad spectrum from short interactive programs thatmay be written to quickly evaluate a new algorithm to long running,highly optimized, production jobs that build indexes.

4. TRAFFIC CHARACTERISTICSContext: Thedatacenterwe collect traffic fromhas the typical struc-ture sketched in Figure . Virtualization is not used in this cluster,hence each IP corresponds to a distinctmachinewhich we will referto as a server. A matrix representing how much traffic is exchangedfrom the server denoted by the row to the server denoted by thecolumn will be referred to as a traffic matrix (TM). We computeTMs at multiple time-scales, 1s, 10s and 100s and between bothservers and top-of-rack (ToR) switches. The latter ToR-to-ToR TMhas zero entries on the diagonal, i.e., unlike the server-to-serverTMonly traffic that flows across racks is included here. By flow, wemeanthe canonical five-tuple (source IP,port, destination IP,port and pro-tocol). When explicit begins and ends of a flow are not available,similar to much prior work [, ], we use a long inactivity time-out (default s) to determine when a flow ends (or a new one be-gins). Finally, clocks across the various servers are not synchronizedbut also not too far skewed to affect the subsequent analysis.

4.1 PatternsTwopronouncedpatterns together comprise a large chunkof traf-

fic in the data center. We call these the work-seeks-bandwidth pat-

203

loge(Bytes) within rack

Den

sity

0 5 10 15 20 25

0.00

0.10

0.20

loge(Bytes) across racks

Den

sity

0 5 10 15 20 25

0.00

0.10

0.20

Figure : How much traffic is exchanged between serverpairs (non-zero entries)?

Fraction of Correspondents within rack

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

020

040

060

0

Fraction of Correspondents across racks

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

040

080

012

00

Figure : How many other servers does a server correspondwith? (Rack = 20 servers,Cluster ∼ 1500 servers)

tern and the scatter-gather pattern due to their respective causes.Figure plots the loge(Bytes) exchanged between server pairs ina s period. We order the servers such that those within a rackare adjacent to each other on the axes. The small squares aroundthe diagonal represent a large chunk of the traffic and correspondto exchanges among servers within a rack. At first blush, this figureresembles CPU and memory layouts on ASIC chips that are com-mon in the architecture community. Indeed the resemblance ex-tends to the underlying reasons. While chip designers prefer plac-ing components that interact often (e.g., cpu-L cache,multiple cpucores) close by to get high bandwidth interconnections on the cheap,writers of data center applications prefer placing jobs that rely onheavy traffic exchanges with each other in areas where high networkbandwidth is available. In topologies such as the one in Figure this translates to the engineering decision of placing jobs within thesame server, within servers on the same rack or within servers inthe same VLAN and so on with decreasing order of preference andhence the work-seeks-bandwidth pattern. Further, the horizontaland vertical lines represent instances wherein one server pushes (orpulls) data tomany servers across the cluster. This is indicative of themap and reduce primitives underlying distributed query processinginfrastructures wherein data is partitioned into small chunks, eachof which is worked on by different servers, and the resulting answersare later aggregated. Hence, we call this the scatter-gather pattern.Finally, we note that the dense diagonal does not extend all the wayto the top right corner. This is because the area on the far right (andfar top) corresponds to servers that are external to the cluster whichupload new data into the cluster or pull out results from it.

We attempt to characterize these patterns with a bit more pre-cision. Figure plots the log-distribution of the non-zero entriesof the TM. At first both distributions appear similar, non-zero en-tries are somewhat heavy-tailed, ranging from [e4 : e20]with serverpairs that are within the same rack more likely to exchange morebytes. Yet, the true distributions are quite different due to the num-bers of zero entries– the probability of exchanging no traffic is for server pairs that belong to the same rack and . for pairs thatare in different racks. Finally, Figure shows the distributions ofhow many correspondents a server talks with. A server either talksto almost all the other servers within the rack (the bump near inFig. left) or it talks to fewer than of servers within the rack.Further, a server either doesn’t talk to servers outside its rack (thespike at zero in Fig. right) or it talks to about - of outsideservers. The median numbers of correspondents for a server aretwo (other) servers within its rack and four servers outside the rack.

0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14 16 18 20 22 24

Link

s fr

om T

oR to

Cor

e

Time (hours)

100s Cong 10s Cong

Figure : When and where does congestion happen in the data-center?

1

10

100

1000

10000

1 10 100 1000 0

0.2

0.4

0.6

0.8

1

Fre

quen

cy

Cum

ulat

ive

Congestion Duration (s)

665 Unique Episodes

ProbabilityCumulative

Figure : Length of Congestion Events

Webelieve that figs. to together form the first characterization ofdatacenter traffic at a macroscopic level and comprise a model thatcan be used in simulating such traffic.

4.2 Congestion Within the DatacenterNext, we shift focus to hot-spots in the network, i.e., links that

have average utilization above some constant C . Results in this sec-tion use a value of C = 70% but choosing a threshold of 90% or95% yields qualitatively similar results. Ideally, one would like todrive the network at as high an utilization as possible without ad-versely affecting throughput. Pronounced periods of low networkutilization likely indicate (a) that the application by nature demandsmore of other resources such as cpu and disk than the network, or(b) that the applications canbe re-written tomake better use of avail-able network bandwidth.

Figure illustrates when and where links within the monitorednetwork are highly utilized. Highly utilized links happen often!Among the inter-switch links that carry the traffic of the monitored machines, of the links observe congestion lasting atleast seconds and observe congestion lasting at least sec-onds. Short congestion periods (blue circles, s of high utilization)are highly correlated across many tens of links and are due to briefspurts of high demand from the application. Long lasting conges-tion periods tend to be more localized to a small set of links. Fig-ure shows that most periods of congestion tend to be short-lived.Off all congestion events that are more than one second long, over are no longer than seconds, but long epochs of congestionexist – in one day’s worth of data, there were unique episodesof congestion that each lasted more than s, a few epochs lastedseveral hundreds of seconds and the longest lasted for seconds.

When congestion happens, is there collateral damage to victim flowsthat happen to be using the congested links? Figure compares therates of flows that overlap high utilization periods with the rates ofall flows. From an initial inspection, it appears as if the rates do notchange appreciably (see cdf below). Errors such as flow timeouts orfailure to start may not be visible in flow rates, hence we correlatehigh utilization epochs directly with application level logs. Figure shows that jobs experience a median increase of .x in their prob-ability of failing to read input (s) if they have flows traversing highutilization links. Note that while outputs are always written to thelocal disk, the next phase of the job that uses this data may have to

204

0

0.01

0.02

0.03

0.04

0.05

0.06

0.0001 0.001 0.01 0.1 1 10 100 1000

Fra

ctio

n

Flow Rate (Mbps)

Flows that Overlap CongestionAll Flows

0 0.5

1

0.0001 0.001 0.01 0.1 1 10 100 1000

CD

F

Flow Rate (Mbps)

Figure : Comparing rates of flows that overlap congestion withrates of all flows.

2427.38%

931.63%1500.00%

2000.00%

2500.00%

3000.00%

Increase in the probability that a job cannot read input(s) when it overlaps a high utilization link

42.41%

835.75%493.84%

0.10%

-90.20%

110.63%

-500.00%

0.00%

500.00%

1000.00%

5-Jan 6-Jan 7-Jan 8-Jan 9-Jan 10-Jan 11-Jan 12-Jan

Figure : Impact of high utilization– The likelihood that a jobfails because it is unable to read requisite data over the networkincreases by .x (median) during high utilization epochs.

read it over the network if necessary. When a job is unable to findits input data, or is unable to connect to the machine that has theinput data, or is stuck, i.e., does not make steady progress in read-ing more of its input, the job is killed and logged as a read failure.We note upfront that not all read failures are due to the network;besides congestion they could be caused by an unresponsive ma-chine, bad software or bad disk sectors. However, we observe a highcorrelation between network congestion and read failures leadingus to believe that a sizable chunk of the observed read failures aredue to congestion. Over a one week period, we see that the inabil-ity to read input (s) increases when the network is highly utilized.Further, the more prevalent the congestion (on ’th, ’th Jan for ex-ample), the larger the increase and in particular the days with littleincrease (’th, ’th Jan) correspond to a lightly loaded weekend.

When high utilization epochs happen, we would like to know thecauses behind high volumes of traffic. Operatorswould like to know ifthese high volumes are normal. Developers can better engineer jobplacement if they know which applications send how much trafficand network designers can evaluate architecture choices better byknowing what drives the traffic. To attribute network traffic to theapplications that generate it, we merge the network event logs withlogs at the application-level that describe which job and phase (e.g.,map, reduce) were active at that time. Our results show that, as ex-pected, jobs in the reduce phase are responsible for a fair amount ofthe network traffic. Note that in the reduce phase of a map-reducejob, data in each partition that is present at multiple servers in thecluster (e.g., all personnel records that start with ‘A‘) has to be pulledto the server that handles the reduce for the partition (e.g., count thenumber of records that begin with ’A’) [, ].

However, unexpectedly, the extract phase also contributed a fairamount of the flows on high utilization links. In Dryad [], extractis an early phase in the workflow that parses the data blocks. Hence,it looks at by far the largest amount of data and the job manager at-tempts to keep the computation as close to data as possible. It turnsout that a small fraction of all extract instances read data off the net-work if all of the cores on the machine that has the data are busyat the time. Yet another unexpected cause for highly utilized linkswere evacuation events. When a server repeatedly experiences prob-lems, the automatedmanagement system in our cluster evacuatesall

0 0.005 0.01

0.015 0.02

0.025 0.03

0.01 0.1 1 10 100 1000 10000 10000

Pro

babi

lity

Flow Duration~(seconds)

FlowsBytes

0 0.2 0.4 0.6 0.8

1

0.01 0.1 1 10 100 1000 10000 10000

Cum

ulat

ive

Flow Duration~(seconds)

FlowsBytes

Figure : More than of the flows last less than ten seconds,fewer than . last longer than s and more than of thebytes are in flows lasting less than s.

the usable blocks on that server prior to alerting a human that theserver is ready to be re-imaged (or reclaimed). The latter two un-expected sources of congestion helped developer’s re-engineer theapplications based on these measurements.

To sum up, high utilization epochs are common, appear to becaused by application demand and have amoderate negative impacton job performance.

4.3 Flow CharacteristicsFigure shows that the traffic mix changes frequently. The figure

plots the durations of million flows (a day’sworth of flows) in thecluster. Most flows come and go ( last less than s) and thereare few long running flows (less than . last longer than s).This has interesting implications for traffic engineering. Central-ized decisionmaking, in terms of deciding which path a certain flowshould take, is quite challenging–not only would the central sched-uler have to deal with a rather high volume of scheduling decisionsbut it would also have to make the decisions very quickly to avoidvisible lag in flows. One might wonder whether most of the bytesare contained in the long running flows. If this were true, schedul-ing just the few long running flowswould be enough. Unfortunately,this does not turn out to be the case in DC traffic; more than half thebytes are in flows that last no longer than s.

Figure shows how the traffic changes over time within the datacenter. The figure on the top shows the aggregate traffic rate overall server pairs for a ten hour period. Traffic changes quite quickly,some spikes are transient but others last for awhile. Interestingly thetop of the spikes is more than half the full-duplex bisection band-width of the network. Communication patterns that are full duplexare rare, because typically at any time, the producers and consumersof data are fixed. Hence, this means that at several times during atypical day all the used network links run close to capacity.

Another dimension in traffic change is the flux in participants–even when the net traffic rate remains the same, the servers thatexchange those bytes may change. Fig (bottom) quantifies theabsolute change in traffic matrix from one instant to another nor-malized by the total traffic. More precisely if M(t) and M(t + τ )are the traffic matrices at time t and t + τ , we plot

Normalized Change =|M(t + τ ) − M(t)|

|M(t)| ,

where the numerator is the absolute sum of the entry wide differ-ences of the twomatrices and the denominator is the absolute sumofentries in M(t). We plot changes for both τ = 100s and τ = 10s.At both these time-scales, the median change in traffic is roughly and the th and th percentiles are and respec-tively. This means that even when the total traffic in the matrix re-mains the same (flat regions on the top graph), the server pairs thatare involved in these traffic exchanges change appreciably. There

205

0 2 4 6 8

10

50000 55000 60000 65000 70000 75000

TM

Mag

nitu

de (

GB

/s)

Traffic Matrix, Sum of Entries

0.1

1

10

100

50000 55000 60000 65000 70000 75000

Cha

nge

in T

M

(tim

es)

Time (seconds)

Norm1 Change Over 100sNorm1 Change Over 10s

Figure : Traffic in the data-center changes in both the magni-tude (top) and the participants (bottom).

are instances of both leading and lagging change; short bursts causespikes at the shorter time-scale (in dashed line) that smooth out atthe longer time scale (in solid line) whereas gradual changes appearconversely, smoothed out at shorter time-scales yet pronounced onthe longer time-scale. Significant variability appears to be a key as-pect of data center traffic.

Figure portrays the distribution of inter-arrival times betweenflows as seen at hosts in the datacenter. How long after a flow ar-rives would one expect another flow to arrive? If flow arrivals were aPoisson process, network designers could safely design for the aver-age case. Yet, we see evidence of periodic short-termbursts and longtails. The inter-arrivals at both servers and top-of-rack switches havepronounced periodic modes spaced apart by roughly ms. We be-lieve that this is likely due to the stop-and-go behavior of the appli-cation that rate-limits the creation of new flows. The tail for thesetwo distributions is quite long as well, servers may see flows spacedapart by up to s. Finally, the median arrival rate of all flows in thecluster is 105 flows per second, or 100 flows in every millisecond.Centralized schedulers that decide which path to pin a flow on maybe hard pressed to keep up. Scheduling application units (jobs etc.)rather than the flows caused by these units is likely to be more feasi-ble, as would distributed schedulers that engineer flows by makingsimple random choices [, ].

4.4 On IncastWe do not see direct evidence of the incast problem [, ], per-

haps because we don’t have detailed TCP level statistics for flowsin the datacenter. However, we comment on how often the sev-eral assumptions that need to happen in the examined datacenterfor incast to occur. First, due to the low round-trip times in data-centers, the bandwidth delay product is small which when dividedover the many contending flows on a link results in a small conges-tion window for each flow. Second, when the interface’s queue isfull, multiple flows should see their packets dropped. Due to theirsmall congestion windows, these flows cannot recover via TCP fastretransmit, are stuck until a TCP timeout andhave poor throughput.Third, for the throughput of the network to also go down, synchro-nization should happen such that no other flow is able to pick upthe slack when some flows are in TCP timeout. Finally, an applica-tion is impacted more if it cannot make forward progress until allits network flows finish.

MapReduce, or at least the implementation in our datacenter, ex-hibits very few scenarios wherein a job phase cannot make incre-mental progress with the data it receives from the network. Further,two engineering decisions explicitly limit the number of mutuallycontending flows–first, applications limit their simultaneously openconnections to a small number (default to ) and second, computa-tion is placed such that with a high probability network exchangesare local (i.e., within a rack, within a VLAN etc. Figure ). This lo-cal nature of flows, most are either within the same rack or VLAN,

0

0.02

0.04

0.06

0.08

0.1

0.01 0.1 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08

Pro

babi

lity

All Flows in ClusterFlows traversing a ToR switch

Flows from/to a Server

0 0.2 0.4 0.6 0.8

1

0.01 0.1 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08Cum

ulat

ive

Fra

ctio

n

Inter-arrival Time~(ms)

Figure : Distribution of inter-arrival times of the flows seenin the entire cluster, at Top-of-Rack switches (averaged) and atservers (averaged).

implicitly isolates flows from other flows elsewhere in the networkand reduces the likelihood that a bottleneck-ed switch will carry thelarge number of flows needed to trigger incast. Finally, several jobsrun on the cluster at any time. Though one or a few flowsmay suffertimeouts, this multiplexing allows other flows to use up the band-width that becomes free thereby reducing the likelihood of whole-sale throughput collapse.

We do believe that TCP’s inability to recover from even a fewpacket drops without resorting to timeouts in low bandwidth delayproduct settings is a fundamental problem that needs to be solved.However on the observed practical workloads,which is perhaps typ-ical of a wide set of datacenter workloads, we see little evidence ofthroughput collapse due to this weakness in TCP.

5. TOMOGRAPHY IN THE DATA CENTERSocket level instrumentation, which we used to drive the results

presented so far in the paper, is unavailable in most datacentersbut link counters at routers (e.g., SNMP byte counts) are widelyavailable. It is natural to ask – in the absence of more detailed in-strumentation, to what approximation can we achieve similar valuefrom link counters? In this section, we primarily focus on networktomography methods that infer traffic matrices (origin-destinationflowvolumes) from link level SNMPmeasurements [,]. If thesetechniques are as applicable in datacenters as they are in ISP net-works, theywould help us unravel the nature of traffic inmanymoredatacenters without the overhead of detailed measurement.

There are several challenges for tomographymethods to extend todata centers. Tomography inherently is an under-constrained prob-lem; while the number of origin-destination flow volumes to be es-timated is quadratic (n(n − 1)), the number of link measurementsavailable (i.e., constraints) is much fewer, often a small constanttimes the number of nodes. Further, the typical datacenter topol-ogy (Fig. ) represents a worst-case scenario for tomography. Asmany ToR switches connect to one or a few high-degree aggregationswitches, the number linkmeasurements available is small (typically2n). To combat this under-constrained nature, tomography meth-odsmodel the traffic seen in practice and use thesemodels as aprioriestimates of the traffic matrix, thereby narrowing the space of TMsthat are possible given the link data.

A second difficulty stems from the fact that many of the priorsthat are known to be effective make simplifying assumptions. Forexample, the gravity model assumes that the amount of traffic anode (origin) would send to another node (destination) is propor-tional to the traffic volume received by the destination. Though thisprior has been shown to be a good predictor in ISP networks [,], the pronounced patterns in traffic that we observe are quite farfrom the simple spread that the gravity prior would generate. A finaldifficulty is due to scale. While most existing methods can computetrafficmatrices between a fewparticipants (e.g., POPs in an ISP),even a reasonable cluster has several thousand servers.

206

0

0.2

0.4

0.6

0.8

1

0% 200% 400% 600% 800%

Pe

rce

ntil

e R

an

k

Tomogravity estimation error (for 75% volume)

Tomogravity

Tomog+job info

Max Sparse

Figure : CDF of estimation error for TMs estimated by (i)tomogravity, (ii) tomogravity augmented with job information,and (iii) sparsity maximization.

0

0.1

0.2

0.3

0.4

0.5

0% 50% 100% 150% 200%

Fra

ctio

n o

f T

M e

ntrie

s

for 7

5%

vo

lum

e

Tomogravity Estimation Error

Figure : The fraction of entries that comprise 75% of the traf-fic in the ground truth TM correlates well (negatively) with theestimation error of tomogravity.

Methodology: We compute link counts from the ground truthTM and measure how well the TM estimated by tomography fromthese link counts approximates the true TM. Our error functionavoids penalizing mis-estimates of matrix entries that have smallvalues []. Specifically, we choose a threshold T such that en-tries larger than T make up about 75% of traffic volume andthen obtain the Root Mean Square Relative Error (RMSRE) assP

xtrueij ≥T

„xest

ij −xtrueij

xtrueij

«2

, where xtrueij , xest

ij are the true and

estimated entries respectively. This evaluation sidesteps the issue ofscale by attempting to obtain traffic matrices at the ToR level. Wereport aggregate results over ToR-level TMs, which is about aday’s worth of min average TMs.

5.1 TomogravityTomogravity based tomography methods [] use the gravity

traffic model to estimate apriori the traffic between a pair of nodes.In Figure , we plot the CDF of tomogravity estimation errors of min TMs taken over an entire day. Tomogravity results in fairlyinaccurate inferences, with estimation errors ranging from 35% to184% and a median of 60%. We observed that the gravity priorused in estimation tends to spread traffic aroundwhereas the groundtruth TMs are sparse. An explanation for this is that communica-tion is more likely between nodes that are assigned to the same jobrather than all nodes, whereas gravity model, not being aware ofthese job-clusters, introduces traffic across clusters, thus resultingin many non-zero TM entries. To verify this conjecture, we show,in Figure , that the estimation error of tomogravity is correlatedwith the sparsity of the ground truth TM – the fewer the numberof entries in ground truth TM the larger the estimation error. (Alogarithmic best-fit curve is shown in black.)

5.2 Sparsity MaximizationGiven the sparse nature of datacenter TMs, we consider an esti-

mation method that favors sparser TMs among the many possible.Specifically, we formulated a mixed integer linear program (MILP)that generates the sparsest TM subject to link traffic constraints.Sparsity maximization has been used earlier to isolate anomaloustraffic []. However, we find that the sparsest TMsaremuch sparserthan ground truth TMs (see Figure ) and hence yield a worse esti-mate than tomogravity (see Figure ). The TMs estimated via spar-

0

0.2

0.4

0.6

0.8

1

0% 10% 20% 30% 40% 50% 60%

Pe

rce

ntil

e R

an

k

Fraction of TM entries for 75% volume

Tomogravity

Tomog+job info

Max Sparse

Ground Truth

Figure : Comparing the TMs estimated by various tomographymethods with the ground truth in terms of the number of TM en-tries that account for 75% of the total traffic. Ground truth TMsare sparser than tomogravity estimated TMs, and denser thansparsity maximized estimated TMs.

sity maximization contain typically 150 non-zero entries, which isabout 3% of the total TM entries. Further, these non-zero entriesdo not correspond to heavy hitters in the ground truth TMs–only ahandful (5-20) of these entries correspond to entries in ground truthTMwith value greater than the 97-th percentile. Sparsitymaximiza-tion appears overly aggressive and datacenter traffic appears to besomewhere in between the dense nature of tomogravity estimatedTMs and the sparse nature of sparsity maximized TMs.

5.3 Prior based on application metadataCan we leverage application logs to supplement the shortcom-

ings of tomogravity? Specifically, we use metadata on which jobsran when and which machines were running instances of the samejob. We extend the gravitymodel to include an additional multiplierfor traffic between two given nodes (ToRs) i and j that is larger if thenodes share more jobs and fewer otherwise, i.e., the product of thenumber of instances of a job running on servers under ToRs i and j,summed over all jobs k. In practice however, this extension seemsto not improve vanilla tomogravity by much, the estimation errorsare only marginally better (Figure ) though the TMs estimated bythis method are closer to ground truth in terms of sparsity (Figure). We believe that this is due to nodes in a job assuming differentroles over time and traffic patterns varying with respective roles. Asfuture work, we plan to incorporate further information on roles ofnodes assigned to a job.

6. RELATED WORKData center networking has recently emerged as a topic of inter-

est. There is not much work on measurement, analysis, and charac-terization of datacenter traffic. Greenberg et al. [] report datacen-ter traffic characteristics–variability at small timescales and statisticson flow sizes and concurrent flows, and use these to guide networkdesign. Benson et al. [] perform a complementary study of trafficat the edges of a datacenter by examining SNMP traces from routersand identify ON-OFF characteristics whereas this paper examinesnovel aspects of traffic within a data center in detail.

Traffic measurement in enterprises is better studied with papersthat compare enterprise traffic to wide-area traffic [], study thehealth of an enterprise network based on the fraction of successfulflows generated by end-hosts [] and use traffic measurement onend-hosts for fine-grained access control [].

7. DISCUSSIONWe believe that our results here would extend to other mining

data centers that employ some flavor of map-reduce style workflowcomputation on top of a distributed block store. For example, sev-eral companies including Yahoo! and Facebook have clusters run-ning Hadoop, an open source implementation of map-reduce andGoogle has clusters that run map reduce. In contrast, web or cloud

207

data centers that primarily deal with generating responses for webrequests (e.g., mail, messenger), are likely to have different charac-teristics. Our results are primarily dictated by how the applicationshave been engineered and are likely to hold even as the specifics suchas network topology and over-subscription ratios change. However,we note that pending future data center measurements, based per-haps on instrumentation similar to that described here, these beliefsremain conjectures at this point.

An implication of our measurements is worth calling out. By par-titioning the measurement problem which in the past was done atswitches or routers across many commodity servers we relax manyof the typical constraints (memory, cycles) for measurement. Clevercounters or data structures to perform measurement at line speedunder constrained memory are no longer as crucial but continueto be useful in keeping overheads small. Conversely, however, han-dling scenarios where multiple independent parties are each mea-suring a small piece of the puzzle gains new weight.

8. CONCLUSIONSIn spite of widespread interest in datacenter networks, little has

been published that reveals the nature of their traffic, or the prob-lems that arise in practice. This paper is a first attempt to captureboth the macroscopic patterns – which servers talk to which oth-ers, when and for what reasons – as well as the microscopic char-acteristics – flow durations, inter-arrival times, and like statistics –that should provide a useful guide for datacenter network designers.These statistics appear more regular and better behaved than coun-terparts from ISPnetworks (e.g., “elephant” flows only last about onesecond). This, we believe, is the natural outcome of the tighter cou-pling between network, computing, and storage in datacenter ap-plications. We did not see evidence of super large flows (flow sizesbeing determined largely by chunking considerations, optimizingfor storage latencies), TCP incast problems (the preconditions ap-parently not arising consistently), or sustained overloads (owing tonear ubiquitous use of TCP). However, episodes of congestion andnegative application impact do occur, highlighting the significantpromise for improvement through better understanding of trafficand mechanisms that steer demand.

Acknowledgments: We are grateful to Igor Belianski and AparnaRajaraman for invaluable help in deploying themeasurement infras-tructure and also to Bikas Saha, Jay Finger, Mosha Pasumansky andJingren Zhou for helping us interpret results.

9. REFERENCES[] Amazon Web Services. http://aws.amazon.com.[] EventTracing for Windows.

http://msdn.microsoft.com/en-us/library/ms.aspx.[] Google app engine. http://code.google.com/appengine/.[] Hadoop distributed filesystem. http://hadoop.apache.org.[] Windows Azure. http://www.microsoft.com/azure/.[] L. A. Barroso and U. Hölzle. The Datacenter as a Computer:

An Introduction to the Design of Warehouse-ScaleMachines.Synthesis Lectures on Computer Architecture, .

[] T. Benson, A. Anand, A. Akella, and M. Zhang.Understanding Datacenter Traffic Characteristics. InSIGCOMMWREN workshop, .

[] R. Chaiken, B. Jenkins, P. Åke Larson, B. Ramsey, D. Shakib,S. Weaver, and J. Zhou. SCOPE: Easy and Efficient ParallelProcessing of Massive Data Sets. In VLDB, .

[] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. A. Wallach,M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable:a distributed storage system for structured data. In OSDI,.

[] Y. Chen, R. Griffith, J. Liu, R. H. Katz, and A. D. Joseph.Understanding TCP Incast Throughput Collapse inDatacenter Networks. In SIGCOMMWRENWorkshop, .

[] Cisco Guard DDoS Mitigation Appliance.http://www.cisco.com/en/US/products/ps/.

[] Cisco Nexus Series Switches.http://www.cisco.com/en/US/products/ps/.

[] C. Cranor, T. Johnson, O. Spataschek, and V. Shkapenyuk.Gigascope: A stream database for network applications. InSIGMOD, .

[] J. Dean and S. Ghemawat. Mapreduce: Simplified dataprocessing on large clusters. In OSDI, .

[] N. Duffield, C. Lund, and M.Thorup. Estimating FlowDistributions from Sampled Flow Statistics. In SIGCOMM,.

[] C. Estan, K. Keys, D. Moore, and G. Varghese. Building aBetter NetFlow. In SIGCOMM, .

[] S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google FileSystem. In SOSP, .

[] A. Greenberg, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz,P. Patel, and S. Sengupta. VL: A Scalable and Flexible DataCenter Network. In ACM SIGCOMM, .

[] S. Guha, J. Chandrashekar, N. Taft, and K. Papagiannaki.How Healthy are Today’s Enterprise Networks? In IMC, .

[] A. Gunnar, M. Johansson, and T. Telkampi. Traffic MatrixEstimation on a Large IP Backbone - A Comparison on RealData. In IMC, .

[] L. Huang, X. Nguyen, M. Garofalakis, J. Hellerstein,M. Jordan, M. Joseph, and N. Taft. Communication-EfficientOnline Detection of Network-Wide Anomalies. InINFOCOM, .

[] IETF Working Group IP Flow Information Export (ipfix).http://www.ietf.org/html.charters/ipfix-charter.html.

[] M. Isard, M. Budiu, Y. Yu, A. Birrell, , and D. Fetterly. Dryad:Distributed Data-Parallel Programs from Sequential BuildingBlocks. In EUROSYS, .

[] T. Karagiannis, R. Mortier, and A. Rowstron. Networkexception handlers: Host-Network Control in EnterpriseNetworks. In SIGCOMM, .

[] M. Kodialam, T. V. Lakshman, and S. Sengupta. Efficient andRobust Routing of Highly Variable Traffic. In HotNets, .

[] R. Kompella and C. Estan. Power of Slicing in Internet FlowMeasurement. In IMC, .

[] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins.Pig Latin: A Not-So-Foreign Language for Data Processing.In SIGMOD, .

[] R. Pang, M. Allman, M. Bennett, J. Lee, V. Paxson, andB. Tierney. A First Look at Modern Enterprise Traffic. InIMC, .

[] IETF Packet Sampling (Active WG).http://tools.ietf.org/wg/psamp/.

[] S. Kandula and D. Katabi and S. Sinha and A. Berger.Dynamic Load Balancing Without Packet Reordering. InCCR, .

[] sFlow.org. Making the network visible. http://www.sflow.org.[] A. Soule, A. Lakhina, N. Taft, K. Papagiannaki, K. Salamatian,

A. Nucci, M. Crovella, and C. Diot. Traffic Matrices:Balancing Measurements, Inference and Modeling. In ACMSIGMETRICS, .

[] V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat,D. Andersen, G. Ganger, G. Gibson, and B. Mueller. Safe andEffective Fine-grained TCP Retransmissions for DatacenterCommunication. In SIGCOMM, .

[] Y. Zhang, Z. Ge, A. Greenberg, and M. Roughan. NetworkAnomography. In IMC, .

[] Y. Zhang, M. Roughan, N. C. Duffield, and A. Greenberg.Fast accurate computation of large-scale IP traffic matricesfrom link loads. In ACM SIGMETRICS, .

[] R. Zhang-Shen and N. McKeown. Designing a PredictableInternet Backbone Network. In HotNets, .

208

http://aws.amazon.com

http://msdn.microsoft.com/en-us/library/ms751538.aspx

http://code.google.com/appengine/

http://hadoop.apache.org

http://www.microsoft.com/azure/

http://www.cisco.com/en/US/products/ps5888/

http://www.cisco.com/en/US/products/ps9402/

http://www.ietf.org/html.charters/ipfix-charter.html

http://tools.ietf.org/wg/psamp/

http://www.sflow.org

Date post:	11-Dec-2016
Category:	Documents
Upload:	ronnie
View:	213 times
Download:	0 times

[ACM Press the 9th ACM SIGCOMM conference - Chicago, Illinois, USA (2009.11.04-2009.11.06)]...

Documents