Sketch Guided Sampling Š Using On-Line Estimates of Flow ...jx/reprints/guided.pdf · accurate...

1

Sketch Guided Sampling — Using On-LineEstimates of Flow Size for Adaptive Data Collection

Abhishek Kumar Jun (Jim) XuCollege of Computing

Georgia Institute of Technology�akumar,jx � @cc.gatech.edu

Abstract— Monitoring the traffic in high-speed networksis a data intensive problem. Uniform packet sampling is themost popular technique for reducing the amount of datathe network monitoring hardware/software has to process.However, uniform sampling captures far less informationthan can be potentially obtained with the same overallsampling rate. This is because uniform sampling (unnec-essarily) draws the vast majority of samples from largeflows, and very few from small and medium flows. Thisinformation loss on small and medium flows significantlyaffects the accuracy of the estimation of various networkstatistics.

In this work, we develop a new packet sampling method-ology called “sketch-guided sampling” (SGS), which offersbetter statistics than obtainable from uniform sampling,given the same number of raw samples gathered. Its mainidea is to make the probability with which an incomingpacket is sampled a decreasing sampling function � of thesize of the flow the packet belongs to. This way our schemeis able to significantly increase the packet sampling rateof the small and medium flows at slight expense of thelarge flows, resulting in much more accurate estimationsof various network statistics. However, the exact sizes ofall flows are available only if we keep per-flow informationfor every flow, which is prohibitively expensive for high-speed links. Our SGS scheme solves this problem by usinga small (lossy) synopsis data structure called countingsketch to encode the approximate sizes of all flows. Ourevaluation on real-world Internet traffic traces shows thatour sampling theory based the approximate flow sizeestimates from the counting sketch works almost as wellas if we know the exact sizes of the flows.

I. INTRODUCTION

Accurate measurement and monitoring of data traffictraversing network links or routers is critical for networkmanagement and operation. For example, per-flow trafficmeasurement [1] is critical for network usage account-ing, resource planning, traffic engineering, pricing andbilling, and anomaly detection; flow size distribution [2]information can help detect an event (e.g., link failure)that causes the transition of the global network dynamics

(e.g., traffic re-routing). With the rapid growth of theInternet, network link speeds have become faster everyyear to accommodate more Internet users. Accuratemonitoring of the traffic on such high speed links isa challenging problem. For example, in a 40 Gbps(OC-768) link, the line card has a total of 25 �� toforward a packet1, and per-packet monitoring functionshave to finish well within this time frame. Traditionalimplementations of such monitoring functions involve asearch and update to a per-flow state (typically organizedas a hash table), and this state has to be stored in fast(yet expensive) SRAM to keep up with the link speed.However, since this state can be very large (say hundredsof megabytes) at high link speeds, storing it in SRAMcan be prohibitively expensive.

Packet sampling is the de facto technology for re-ducing the amount of data that has to be processedwhen monitoring high speed links and routers. Since itsinitial use at the monitoring infrastructure of NSFNET’sT1 backbone in September 1991 [3], packet samplinghas found increasing acceptance in network monitoringapplications, and is currently supported by equipmentvendors (e.g., Cisco’s NetFlow [4]). By far its simplestform is uniform packet sampling, in which each packetis sampled independently with a fixed probability � ,and only sampled packets trigger an update to the flowrecords. Deployed products, such as Cisco’s NetFlowactually use a slight variant, namely periodic packetsampling (sampling every �� packet), for ease of imple-mentation. An advantage of uniform or periodic packetsampling is that it guarantees to reduce the traffic by afactor of �� in both long and short time scales, whichis very important when the router CPU that processessampled traffic (e.g., the hash table of flow records inNetFlow) operates under hard resource constraints2.

1We assume a conservative packet size of 1000 bits here.2Flow sampling, to be discussed in Sec. II, is a more attractive

alternative to packet sampling for certain applications, but it doesnot guarantee the traffic reduction ratio in the same way assured bypacket sampling.

2

While uniform (or periodic) sampling is very straight-forward to implement, we discover that it captures farless information than can be potentially obtained with thesame overall sampling rate � . This is because uniformsampling devotes resources in proportion to the size ofa flow. However, the information gained from samplecollection in this manner does not grow in proportionto the flow size. For example, sampling 1,000 packetsfrom a flow of size 10,000 clearly contains far lessinformation than sampling one packet each from 1,000flows of size 1, while consuming the same amount ofsampling resource (processing 1,000 sampled packets).This problem is exacerbated in Internet monitoring dueto the Zipfian nature of the Internet traffic – there arevery few large flows but they contain the vast majority ofthe packets and therefore consume most of the samplingresource. Since sampled packets will be processed bya flow table, which usually resides in relatively slowerDRAM, to generate flow records like in Cisco’s Netflow,only very low packet sampling rate (say 1/500) can besupported on high speed links. Such low sampling rateimplies that the sampled flow records capture most ofthe large flows, but miss a majority of small and mediumflows. This information loss on small and medium flowssignificantly affects the accuracy of the estimation ofvarious network statistics such as flow size distributionand per-flow traffic, over the sampled traffic. With uni-form packet sampling, it is hard to improve this situation,since even when we increase the sampling rate, most ofthe additional updates hit the large flows, doing little toincrease the number of small and medium-sized flows.

We realize that if we are able to sample packets insmall and medium flows at much higher rate than thosein large flows, we will be able to gather richer and moreuseful information than uniform sampling. This is nothard to do if we somehow know, when a packet ��arrives, the current size � of the flow it belongs to; Wecan simply decrease the sampling rate of its packetswhen a flow grows larger. In other words, we makethe sampling rate of a packet �� a decreasing function(denoted as � ) of its current flow size � . This way, wecan significantly increase the packet sampling rate of thesmall and medium flows at slight expense of the largeflows3, again thanks to the Zipfian nature of the Internettraffic. For example, suppose the largest 1% of the flowscontain 90% of the Internet traffic. Then reducing thepacket sampling rates of the top 1% flows by 12.5%will save us enough sampling resource to increase thesampling rate of the rest of the flows by 100%. One

3Note that the relative error of our estimates of the large flowswon’t suffer significantly since the denominator is large.

key contribution of this work is to develop a samplingtheory as to how to determine the right sampling function�� so that the resulting sample allows for accurateestimation of various network statistics under a fixedsampling resource constraint.4

Now the question is how do we know, when a packetarrives, the exact size of the flow that the packet belongsto. Such exact information is available only if we keepper-flow information for every flow and update the cor-responding flow entry for every incoming packet, but aswe mentioned before, this is not practical on high speedlinks. We develop a novel packet sampling methodologycalled “sketch-guided sampling” that circumvents thisproblem. Our main idea is to use a small (lossy) synopsisdata structure (called a counting sketch) to encode theapproximate sizes of all flows. This data structure issmall enough to fit in fast SRAM so that it is ableto process all the packets (without any sampling). Foreach incoming packet �� , we first look up the datastructure to determine the approximate size �� of the flowit belongs to, and then sample it with probability �� .The estimation error of �� may impact the precision ofestimates based on our sampling theory, but the estimatesremain unbiased. Our evaluation shows that our revisedsampling theory based on �� works almost as well asthe prefect case based on �� .

Our SGS methodology will allow us to achieve moreaccurate estimations of many different network statisticsthan uniform sampling. In this work, we focus on threeof them: (a) per flow usage accounting, (b)identificationof the set of medium to large flows and estimation oftheir size, and (c) estimation of the distribution of flowsizes in arbitrary subpopulations of the traffic [5]. Ourchoice is based on the observation that uniform samplingleads to poor estimation of these network statistics.The problem and applications of estimating each ofthese three statistics have been studied independentlyand solutions based on sampling and data streamingtechniques have been proposed. In this work, our focusis on demonstrating the utility of the SGS methodologyand evaluating its impact on the accuracy of existingtechniques.

The rest of this paper is organized as follows. Sec-tion II provides a brief survey of the work on samplingtechniques with application to network monitoring, and

4Note that we are not the first to propose the concept of size-dependent sampling. Our work differs from existing work in that itis the only work that allows for the sampling decision to be madeonline; all other sampling work makes decision off-line on the flowrecords that have already been collected, with or without sampling,which has more information available to them and therefore is ableto achieve stronger properties.

3

identifies our contributions in relation to this work.Section III gives a high-level description of the proposedmechanism and develops the mathematical machineryused in our design of various sampling functions, fol-lowed by a description of the counting sketch (stream-ing data structure) we propose to use for maintainingapproximate per-flow state. Section IV develops threespecific applications based on the proposed frameworkand demonstrates resulting improvements in estimationaccuracy. We conclude in Section V.

II. BACKGROUND AND RELATED WORK

Solutions for network monitoring using data from asample of packets typically use the following architec-ture. A monitoring point tapping into the live packetstream uses a selection process to pick a small sampleof these packets and hands them off to a reportingprocess that aggregates them, typically into flow records,and exports these records using an information exportprotocol. An independent monitoring station collects thisexported information and makes it available to variousapplications that use this sampled data to infer specificcharacteristics of the monitored traffic.

The Packet Sampling (PSAMP) working group atIETF is chartered to define a standard set of capabilitiesfor network elements to sample subsets of packets bystatistical and other methods [6], [7], [8]. Similarly, theIP Flow Information Export (IPFIX) working group hasthe specific goals of defining the notion of a ”standardIP flow”, devising data encodings that support analysisat various levels of aggregation, and considering thenotion of IP flow information export based upon packetsampling [9], [10], [11]. The selection process in thisarchitecture, as defined by the PSAMP working group,provides a detailed description of various sampling andfiltering techniques that can be used to decide whichpackets are selected for reporting. The IPFIX workinggroup specifies a detailed architecture for the reportingprocess that contains rules for encoding the observedpackets into Flow Records and packetizing the selectedflow records into IPFIX messages that are then exportedto a Collector.

In their work on Building a Better NetFlow, Estanet al. [12] identify the major shortcomings of Net-Flow, which include the large variability in the numberof flows, and the mismatch between flow terminationheuristics and the analysis. They propose the use ofepoch-based termination of flows and adaptive samplingto overcome these problems. Their work, however, doesnot fundamentally change the nature of this uniform (ornear-uniform) packet sampling in NetFlow. Hohn andVeitch [13] discuss the inaccuracy of estimating flow

distribution from sampled traffic, when the sampling isperformed at the packet level, and show that samplingat the flow level leads to more accurate estimations.

The idea of size-dependent flow sampling has beenproposed and studied from various perspectives byDuffield, Lund and Thorup [14], [15], [16]. They ingeneral solve the following problem. Suppose the moni-toring device has collected a set of NetFlow-like flowrecords, and after a certain period of time, due tothe storage constraint, only a small percentage of suchrecords can be kept. Then what records should be keptand what should be discarded? The solutions proposedin [14], [15], [16] are to sample (and keep) flows ofdifferent sizes with different probability. Clearly differentvariants of solutions are called for under differenceresource constraints and operation conditions. In [14],they study how to sample from (unsampled) NetFlowrecords so that estimations made on the sampled flowrecords are unbiased and have small MSE, under soft andhard resource constraints respectively. This is extendedin [15] to the case of sampling from NetFlow recordsconstructed from uniformly sampled packets. Our workis fundamentally different from all their approaches inthat we need to make an online decision on whether ornot to sample a packet while their approach makes anoff-line decision on whether or not to sample a flowrecord that has already been collected and stored. Inaddition, since we are performing packet sampling andtheir works are performing flow sampling, our samplingobjectives and corresponding mathematical analysis arefundamentally different from theirs.

Flow sampling is used for sampling the same set offlows (i.e., consistently) across multiple network ele-ments [13]. In flow sampling, a flow is chosen uniformlyat random with a probability � . All packets in thesampled flow will be sampled. Flow sampling can beimplemented using a hash function by sampling all flowswhose flow label (source and destination IP addressesand port numbers) is hashed to a particular set of values.Flow sampling is shown to be a better alternative topacket sampling for estimating certain network statisticssuch as flow size distribution [13]. However, samplinga flow with probability � in general cannot guarantee areduction of factor �� at any time scale, because at thelarge time scale elephant flows may be disproportionallysampled, and at the small time scale bursty flows thatconsume the majority of link rate may be disproportion-ally sampled. This makes flow sampling unsuitable forapplications that operate under a hard resource constraint(as in [16]).

Several mechanisms that use sketches to monitorspecific properties of the network traffic have been

4

UsageAccounting

ElephantDetection

SubFSDEstimation

OtherApplications

Flow Table

Packet stream

Flow−size

Per−packet operations

Sampling Process Counting SketchEstimated

SampledPackets

FlowRecords

Fig. 1. Architecture for using sketches to guide packet sampling.

proposed recently. Sketches are extremely simple data-structures that maintain approximate information and canbe updated at line speeds. Examples where sketches havebeen used to guide the sampling decision for individualpackets include a multistage filter to detect large flows,designed by Estan and Varghese [17] and an automatedworm fingerprinting mechanism that classifies contentsegments that are seen to be coming from multiplesources and going to multiple destinations as possibleworm signatures [18]. Such solutions typically use asketch as a high-speed counting device, and select pack-ets with dominant counts in such sketches for furtherprocessing. These form a starting point for the streamingguided sampling techniques that we explore in this paper.

III. DESIGN OF SKETCH-GUIDED SAMPLING

As we described before, uniform sampling is not ac-curate and efficient for the estimation of various networkstatistics because it collects the vast majority of samplesfrom the few large flows while skipping most of thesmall and medium flows. In this section, we describeour sketch-guided sampling (SGS) scheme that aims atimproving the sampling rate of small and medium flowsat slight expense of large flows. The basic idea of SGS atthe high level is quite simple and its overall architectureis shown in Figure 1. Upon the arrival of each packet,the SGS scheme tries to determine the current size ofthe flow that the packet belongs to, from the countingsketch module. The estimated flow size �� is fed to thesampling process, which will sample the packet withprobability �� . The sampled packets will be used to

update the flow table. Here � is the aforementioned(size-dependent) sampling function. The (sampled) flowrecords and sometimes the counting sketch will be usedfor estimating various network statistics such as per-flowtraffic, identification of large flows, and estimating thesubpopulation flow size distribution (SubFSD).

As we discussed earlier, intuitively, � should be adecreasing function of � so that higher sampling ratescan be applied to small and medium flows, but whatkind of function should � be exactly? In Sec. III-A, wedevelop our sampling theory that develops a family of �that can be useful for the accurate estimation of variousnetwork statistics, assuming the perfect estimation of thecurrent flow size � by the counting sketch. We also studyhow to design a robust � such that the estimation erroron the flow size � will not impact the statistics estimatedfrom the sampled packets significantly. Then in Sec. III-B, we describe our design of our counting sketch thatcan operate at full line speed and provide fairly accurateflow size estimation to the sampling module.

A. Sampling theory behind the SGS design

In this section, we derive a family of sampling func-tions, based on which the sampling module can collectpacket samples that possess certain desirable statisticalproperties. We motivate this study by first examining thestatistical properties of the packet samples collected byuniform sampling. Suppose in uniform sampling, eachpacket in a flow of size � is sampled with probability� . Then each sampled packet represents �� packets inthe original stream, on the average. Thus, if the numberof sampled packets is � , an unbiased estimator for theoverall flow size is �� . The variance of thisestimator can be shown to be � �!#"$�%�&� � , which is equalto its mean square error (MSE)5 since �� is unbiased.The root mean square error (RMSE) of this estimator is' ��!(")��&� � , which clearly grows much slower thanthe flow size � . In other words, the relative error inestimates reduces with rate *+�� ,.-0/ 1�� when the actual sizeof the flow � increases. While this might be desirable forcertain applications, one can envisage others that woulddesire different growth functions of RMSE.

1) The constant relative error case: In many appli-cations such as per-flow traffic accounting, we oftenwould like the aforementioned RMSE to grow linearlywith the actual flow size. In other words, we wouldlike the average estimation error to be 23� (or *4�� )for a flow of size � , which corresponds exactly toconstant relative error. Constant relative error is desirable

5The MSE of the estimator 56 of a parameter6

is defined as798;: 56=<>6$?�@BA.

5

in traffic accounting because the average percentage ofovercharge/undercharge remains the same irrespective ofactual usage. We show next that this can be achieved bychoosing a proper sampling function � .

We temporarily assume that our counting sketch pro-vides perfect estimation of the flow sizes. Now givena flow of size � , recall that in our SGS scheme, itsfirst, second, ..., and �� packets will be sampled withprobability ��!�� , ��CD� , ..., and �� respectively. Asdiscussed before, � is in general a decreasing function,i.e., ��!��FEG��CD�HEJIKIKI)EG�� . Suppose � packetsare sampled from a flow with probability �ML , �ON , ...,�OP respectively. Then the total number of packets inthis flow can be estimated as �� Q PRS L �� R . TheRHS of the estimator is equivalent to QUTRS L�V RXW ��ZY[� ,where V R are mutually independent indicator randomvariables taking the value 1 if packet Y is sampledand zero otherwise. The variance of the estimator ��is given by the sum of the variance of the individualterms in the summation, due to their independence. Thus,\$] ^ �_��$� Q TRS L \$] ^ � V R ��ZY[�&� . Here,

\$] ^ � V R ��ZY[�&�>��!("`��ZY[�&�&��ZY[� , which gives:

\a]b^ �� Tc RS L �!d"`��ZY[�&�&��ZY[� (1)

Now, using the property that ��feg��ZY!� , hOYaei� , wecan obtain a bound on the variance of the estimator as:\$] ^ ��9e=� �!("j��&�&�� (2)

To make sure that RMSE grows at most linearly (i.e.,relative error no more than a constant 2 ), we can simplymake the RHS of Eqn. 2 equal to 2�� , which gives ��k��l�!nmo2 N �� .

Readers may feel that the relaxation in equation 2is unnecessary. We can make the variance (MSE) of�� , which is Q TRS L �!$"=p��ZY[�&�&��p��ZY!� when using p as thesampling function, exactly equal to ��2 �� N . One can easilyverify that this can be achieved by using the followingsampling function: p��ZY[�9� �rq Rs�rq Rsut L where v��ZY!�9�w2 N ��C�Yk"�� , for Yx� �y C�y�IKIKIKy � . This optimization however isnot recommended for two reasons. First, here we areassuming the perfect estimation of flow size by thecounting sketch, which in reality will have estimationerror. We discover that its impact on the sampling oper-ation using p is much higher than that using � becausethe aforementioned relaxation in � provides robustnessagainst this estimation error. Second, suppose that theprobabilities with which the � sampled packets weresampled are ��LzE{�ONFE|IKIKI}E��P . Some applications(e.g., SubFSD, to be discussed in Sec. IV-C) can onlymake accurate inferences on a packet sample in which

each packet is sampled with equal probability, whichcan be achieved by resampling these packets. In thiscase, the maximum sampling rate we can achieve afterresampling is ��P . Using sampling function � guaranteesthat the packet sample after resampling will not haveRMSE larger than 2~� , but using p cannot guarantee that.Therefore, although for the purpose of per-flow trafficaccounting it makes sense to use p , we adopt function� because it is applicable to a much wider spectrum ofapplications.

2) Generalizing to *+�� RMSE: We have just de-veloped the sampling theory for achieving linear RMSE(i.e., *+�� RMSE). In some other applications, however,one may want to have RMSE of the form *4��X� where��C�e ��e is a tunable constant. Note that theaforementioned uniform sampling and linear relativeerror corresponds to the case �|� ��C and �� respectively. To make RMSE grow with rate *+�� , welet

' �� !d"`��&�&�� be equal to 2 �� for ��y C�y�IKIKI ,which gives ��l�!�m�2 N � q N � , L s � . For ��C , wehave �� L�� t L which is a constant as expected, sincethis is the case of uniform sampling.

The bound in equation 2 is tight for ��C (the caseof uniform sampling). For �{�� , which correspondsto the case of constant relative error, the upper boundgiven by equation 2 is not tight for all values of � .Figure 2 shows the variance and standard error of guidedsampling, with �� and 2��U�lI . The upper bound andthe exact values are computed using equations 1 and 2.Figure 2(b) shows that the upper bound on standard errorgiven by equation 2 is about 30% larger than the exactvalue of the standard error which can be calculated fromequation 1. This slack is due to the approximation usedto get rid of the summation in equation 2.

B. Implementation using Sketches

Our sampling theory establishes a family of guidedsampling algorithms where the probability with which apacket is sampled depends on the size of the correspond-ing flow at that point of time. However, maintainingexact information about every flow seen at a monitoringpoint is prohibitively expensive at high speed. We wouldlike to relax this requirement of knowing the exact flowsize, trading it for the knowledge of an approximateestimate that might be relatively easier to obtain. Wenow present such a solution that uses a synopsis datastructure from a family known as couting sketches tomaintain approximate estimates of per-flow size.

Since this counting sketch needs to process each andevery packet, for it to operate at link speeds such as OC-192 and OC-768, each operation has to be performed

6

0.01

1

100

10000

1e+06

1e+08

1e+10

1 10 100 1000 10000 100000 1e+06

Var

(n)

upper boundexact

PSfrag replacements �(a) Variance of estimates.

0.01

0.1

1

1 10 100 1000 10000 100000 1e+06

Sta

ndar

d er

ror

upper boundexact

PSfrag replacements �(b) Standard error.

Fig. 2. Variance and standard error of guided sampling, with �>�� and ��;� .within tens of nanoseconds. After evaluating a numberof possible candidates, we choose a very simple synopsisdata structure – an array of counters indexed by a hashfuction – to track per-flow sizes approximately. Theprimary reasons for picking this solution over moresophisticated alternatives such as the Space-Code BloomFilter [19] were its extreme simplicity and efficiency thatallow us to support very high links [20], [5].

Every counter in this array is initiated to 0 at thebeginning of a measurement epoch. The update and es-timation operations at each packet arrival are performedas follows. Upon arrival of a packet at the measurementpoint, its flow label6 is hashed to generate an index intothis array, and the counter at this index is incrementedby 1. The updated value of the counter is passed onto the selection process as the estimated size of thecorresponding flow. Collisions due to hashing mightcause two or more flow labels to be hashed to thesame index. The value of the counter at such indiceswould represent the total number of packets belongingto all of the flows hashing to the index. We do nothave any explicit mechanisms to handle collisions asany such mechanism would impose additional processingand storage overheads that are unsustainable at highspeeds. This makes the encoding process very simple andfast, but introduces some estimation errors as discussedin the next section. Efficient implementations of hashfunctions [21] allow the online streaming module tooperate at speeds as high as OC-768 without missingany packets.

6Our design does not place any constraints on the definition of flowlabel. It can be any combination of fields from the packet header.

This sketch data structure can be implemented usingfast memory (SRAM) to keep up with high packetarrival rates. A naive implementation would simply placean array of 32-bit counters in SRAM. This should bepractical for arrays of up to a million counters using 4MBof commodity SRAM. Larger arrays can still be imple-mented using the techniques for efficient implementationof an array of counters, proposed by Ramabhadran andVarghese [22]. The computational cost of each updateis one hash function computation followed by one readand write to SRAM. The estimate of flow size is nothingbut the updated counter value, available at no extracomputational cost. We refer the reader to [20] for amore detailed analysis of the computational and storagecomplexities of this sketch.

IV. APPLICATIONS

In this section, we apply our SGS methodology tothe estimation of three types of network traffic statis-tics. They are: per flow usage accounting, described insection IV-A; identification of the set of medium tolarge flows and estimation of their size, described insection IV-B, and estimation of the distribution of flowsizes in arbitrary subpopulations of the monitored traffic,described in IV-C.

A. Usage accounting

Accounting for bandwidth usage is perhaps the mostbasic application of sampling in network monitoring.Per-flow traffic accounting has applications in usage-based charging/pricing, network anomaly detection, se-curity, network planning, peering policy, customer acqui-sition and traffic engineering [15], [17]. Previous work

7

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000

Est

imat

ed fl

ow s

ize

(pac

kets

)

Actual flow size (packets)

y=x

(a) Uniform sampling, �d��&��&� . 1

10

100

1000

10000

100000

1 10 100 1000 10000 100000

Est

imat

ed fl

ow s

ize

(pac

kets

)


y=x

(b) Guided sampling, �`��;� , exactinformation.

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000

Est

imat

ed fl

ow s

ize

(pac

kets

)


y=x

(c) Sketch Guided sampling, �4�� ,approximate information.

Fig. 3. Actual and estimated flow sizes using different sampling techniques.

on this application has focussed on the use of samplingat the collector of flow records. Specifically, previouswork by Duffield et al. [14] assumes that the initial se-lection and reporting processes can process each packet,exporting flow records to a collector which samples andarchives only a subset of these records to save storagespace. Duffield et al. show that uniform sampling of flowrecords is not the best strategy, due to the heavy taileddistribution of flow sizes, and propose a size-dependentsampling scheme that gives priority to larger flows. Insubsequent work [15], Duffield et al. quantify the impactof uniform packet sampling on the accuracy of usageestimates in an operational measurement infrastructure.They show that if packets are sampled uniformly withprobability � , then ��o�|�� is an unbiased estimatorof the total number of packets in a flow, where � isthe number of samples from the flow. They also showthat the standard error7, or relative RMSE, defined as� \$] ^ �� , for this estimator is

' �!("��&� �� . Thus, thestandard error of estimates based on uniform samplingreduces as the flow size � increases. This is shown infigure 2(b) in the curve for ��C .

A property of uniform sampling is that most of thesamples are collected from a small number of largeflows with increasing estimation accuracy for such flows.However, because the sampling “budget” is limited dueto system resource constraints, the additional accuracyfor large flows may be a waste of this budget. In otherwords, the downward sloping shape of the curve forstandard error (figure 2(b)) is fixed in uniform sampling,while applications might prefer other shapes such asa flat curve. The use of SGS provides this flexibilityto network monitoring applications. In particular, theguided sampling with �� provides flat standard error

7We use the terms standard error and relative RMSE interchange-ably.

in estimation irrespective of the original flow size, asshown in figure 2(b).

Figure 3 shows the usage estimates generated fromsamples collected via various techniques from the sametrace. The trace used in this and all the subsequentexperiments was collected at the 1 Gbps access linkconnecting the UNC campus to the rest of the Inter-net on Thursday, April 24, 2003 at 11:00 am. Thistrace contains 27,507,496 packet headers from 1,133,150flows constituting 500 seconds of traffic on the link.We obtained similar experimental results over publiclyavailable traces from NLANR and USC that we omit inthe interest of brevity.

As is common with Internet traffic, the flow sizedistribution in the trace used in Figure 3 is heavy-tailedwith a large number of small flows and a few largeflows. Figure 3(a) shows estimates of flow-size derivedfrom uniform packet sampling with probability 1/10. Theestimation accuracy is very low for small flows, and islow for medium sized flows with 10 to 1000 packetstoo, but improves significantly for the very large flows.On the other hand, guided sampling with exact flow-size information provides a constant relative accuracy, asdemonstrated by the clustering of all points in a narrowband along the �+�¡ line in figure 3(b). Due to the useof log-scale on both axes, this narrow band parallel to the�+�¡ line represents a region linear in size with respectto � , the value of the actual flow size plotted on the axis. The slight broadening of this band near )�� is dueto the presence of an extremely large number of smallflows in the trace, which ensures that even the extremelyrare estimation-errors are represented by a point.

Replacing the assumption of knowledge of exact flow-size in guided sampling with sketch-based approximateestimates in SGS results in some loss of accuracy, asshown in figure 3(c). The additional inaccuracy is mostly

8

0.001

0.01

0.1

1

10

1 10 100 1000 10000 100000

stan

dard

err

or


sqrt((1-p)/np)

(a) Uniform sampling, �d��&��&� . 0.01

0.1

1

1 10 100 1000 10000 100000 1e+06

stan

dard

err

or


upper bound (eqn 1)exact (eqn 2)

(b) Guided sampling, �`��;� , exactinformation.

0.01

0.1

1

1 10 100 1000 10000 100000 1e+06

stan

dard

err

or



(c) Sketch Guided sampling, �4�� ,approximate information.

Fig. 4. Experimentally determined standard error and analytical values using different sampling techniques.

limited to small flows. This inaccuracy comes fromhashing collisions in the counting sketch that result inthe generation of inaccurate flow-size estimates by thesketch. Notice that collisions in hashing can only causeover-estimation of the sizes of the colliding flows. How-ever, since SGS uses this estimate only in determiningthe sampling rate, the estimation through SGS remainsunbiased. The over-estimation of flow-sizes by the sketchdoes force SGS to pick a lower sampling rate, resultingin higher standard error. This impact is muted for allflows of medium and large sizes because SGS uses lowsampling rates for such flows anyway, and the additionalreduction in this sampling rate due to over-estimation offlow-sizes by the sketch has negligible impact.

The standard error of estimation using uniform sam-pling, guided sampling with exact information andsketch guided sampling using approximate informationis shown in figure 4. To generate sufficient number ofpoints, the estimation experiments were repeated 100times with different seeds. The experimental values ofstandard error for uniform sampling and guided samplingwith exact information faithfully follow the respectiveanalytical curves. On the other hand, the experimentallydetermined points for sketch guided sampling lie abovethe analytical upper bound for guided sampling withexact flow-size information (figure 4(c)). This increasein error is due to inaccuracies in the estimates of flow-size that are provided by the counting sketch. However,this inaccuracies have an impact only on the estimationof very small flows, and even in that region, the overallaccuracy is much better than that provided by estimationbased on uniform sampling.

B. Detecting Elephants

The previous example application, per-flow usagemonitoring, is closely related to our second example

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000 10000 100000 1e+06

Sam

plin

g pr

obab

ility

PSfrag replacements �(a) Usage accounting

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000 10000 100000 1e+06

Sam

plin

g pr

obab

ility

PSfrag replacements �(b) Elephant detection

Fig. 5. Sampling probabilities as a function of flow size for twodifferent applications.

application, using SGS to detect large flows (a.k.a.heavy hitters or elephants). The ability to detect largeflows and estimate their sizes has applications similar tothose of usage monitoring, such as anomaly detection,verification of compliance with service level agreementsand traffic engineering. In monitoring very high speedlinks, it is often necessary to ignore the small flows (ormice) and focus on the medium to large size flows, sinceit significantly reduces the number of flows that need tobe tracked, freeing up resources for tracking the largerflows. Due to the Zipfian nature of the distribution offlow sizes in the Internet traffic, ignoring small flows infact will allows us to track all the medium and largeflows with much higher accuracy than attainable whiletracking all flows.

Our SGS-based solution for estimating the sizes ofmedium and large flows (called “elephant detection”) isan adaption of the solution for per-flow usage account-ing. The crucial difference is that for elephant detection,we can ignore the small flows. This is implemented bysetting a threshold ¢ and modify the sampling functionused in usage accounting £¤��l�!nmo2 N ��!¥ in sucha way that �� becomes zero for �§¦�¢ , but remains

9

the same (as in usage accounting) for �4¨¡¢ . The flowsize estimator is modified accordingly by incrementingthe non-zero estimates of flow size by the threshold ¢ .

Figure 5 shows the sampling rates as a function of theflow size for the two applications of usage accountingand elephant detection, respectively. The parameters usedin both figures are �F�{�y 2$�©�lI and the threshold ¢is set to 10 packets for the curve in figure 5(b). Thearea below the curves in figure 5 indicates the numberof samples collected for the corresponding applications.The total number of samples collected is the product ofthis value and the actual flow size distribution (histogramof the flow sizes) which is known to be heavy tailed inthe case of Internet traffic. By excluding the flows whosesizes are below the threshold ¢ , our solution collects asmaller number of samples from a much smaller numberof flows. If desired, the saved resources can be used toincrease the sampling rates for the medium and largeflows, resulting in much higher estimation accuracy.

Although the additional inaccuracy introduced by ig-noring the first few packets of the flow while its sizeis below the threshold can be substantial for flowswhose sizes are just above the threshold ¢ , this effectdiminishes rapidly and is negligible for flows larger than��¢ . This is best demonstrated through the experimentalresults shown in figure 6. The analytical curves (“upperbound” and “exact”) are from the equations 1 and 2 inthe previous section (with no cutoff before the threshold¢ ). The errors in estimating the flow sizes are capturedin the “black dust” curves. The high errors for flowssmaller than the threshold can be attributed to the factthat our elephant detection mechanism is designed toavoid collecting samples from such flows. In practicesome of these flows are still sampled due to overesti-mation of their sizes by the counting sketch (the caseof hash collisions). There is a perceptible downwardjump around the threshold ¢ as flows suddenly begin tobe sampled by the elephant detection mechanism. Theaccuracy of size estimations improves rapidly from thispoint on, until it approaches the analytical accuracy ofusage accounting using guided sampling with exact flowsize information (i.e., the ideal case).

Previous work by Estan and Varghese [17] for elephantdetection used two alternative techniques for identify-ing potential elephants. Once a flow is identified asa potential elephant, an entry is created for it in fastmemory (SRAM). Every packet that belongs to one ofthe identified potential elephants causes an update to thecorresponding entry in fast memory. The two alternativetechniques are sample and hold, where packets aresampled at a low uniform probability and an entry iscreated if one doesn’t already exist for the corresponding

flow, and multistage filter where multiple parallel countsketches with independent hash functions are used, anda flow is identified as a potential elephant if all thecorresponding counters in the parallel sketches exceedsome predetermined threshold value.

The main difference between our scheme and thatby Estan and Varghese [17] lies in our use of guidedsampling even for the flows whose sizes are above thethreshold versus the deterministic per-packet update bythe latter. Our use of guided sampling provides con-stant relative error, with the possibility of picking otherfunctions from the � -family introduced in section III.The use of per-packet updates in [17] means that theonly errors in estimation are due to the initial detectionalgorithm, resulting in slightly better accuracy than ourscheme. However, in our scheme, the number of updatesto the (elephant) flow table stored in fast memory canbe much smaller because of the sampling operation. Forexample, for the largest flow in the UNC trace, whichhas 83,097 packets, per-packet update would requireslightly less than 83,097 updates to the fast memory.Our SGS solution, on the other hand, only needs 670updates, with 2��ª�lI . This corresponds to two orders ofmagnitude fewer updates to the corresponding flow entryin the fast memory. This reduction allows our scheme toscale to much higher link speeds (or allowing the useof slower and less expensive DRAM) for the followingreason. In both schemes in [17], the flow entries haveto be organized as a hash table in the fast memory toallow for per-packet updates. The cost of storing theseflow entries and updating them is much higher than thecounting sketch used in our scheme. The flow table inour SGS scheme, on the other hand, can be stored ininexpensive DRAM because the number of updates itneeds to process is two orders of magnitude smaller.Therefore, our SGS scheme offers an attractive and lesscostly alternative to [17] for tracking medium to largesize flows, at the slight cost of estimation accuracy.

C. Estimating the flow-size distribution

The final application of our SGS scheme is moreaccurate estimation of the Flow Size Distribution (FSD)of the Internet traffic. The FSD is by far the mostfundamental statistic since its accurate estimation canenable the inference of many other statistics such asthe total number of flows and the average flow size.Furthermore, there are a number of applications where itwould be useful to measure the flow size distribution ofa particular subpopulation, i.e., a subset of the completeflow population. Such subpopulations may be definedby a traffic subtype (e.g., web or peer-to-peer traffic),

10

0.001

0.01

0.1

1

10

1 10 100 1000 10000 100000 1e+06

stan

dard

err

or



(a) �� , threshold=10.

0.001

0.01

0.1

1

10

1 10 100 1000 10000 100000 1e+06

stan

dard

err

or



(b) �O�«�� , threshold=100.

Fig. 6. Experimentally determined standard error for elephant detection using Sketch Guided sampling with various parameters.

by a source or destination IP address/prefix, by protocol(e.g., TCP or UDP), or some combination thereof. Forexample, to investigate a sudden slowdown in DNSlookups, a network operator may specify all flows withsource or destination port 53 as the subpopulation ofinterest and query the data collected in the precedinghour.

Accurate solutions for usage accounting on a per-flow basis provide a naive solution to the problem ofestimating FSD – simply sort the estimated sizes of allflows to obtain a distribution. However, there are tworeasons to not take this approach. First, one may wish tosolve the easier problem of estimating the FSD withoutbeing forced to solve the harder problem of identifyingall flows and estimating their sizes accurately. Second,by designing an independent solution for estimatingthe FSD, one might hope to achieve better estimationaccuracy than that provided by the aforementioned naivesolution. Indeed, previous work on estimating the FSD([23], [13], [20], [5]) has shown that data collectedthrough sampling or sketching can be used to estimatethe FSD much more accurately than it can be usedto estimate the size of individual flows. In particular,we have presented in [5] the design of an efficientmechanism to provide accurate estimates of the FSD forarbitrary subpopulations specified after the act of datacollection. That solution uses two sources of data, acounting sketch which is the same as in section III-Band a variant of sampled NetFlow [4] that implementsuniform packet sampling. It then uses an estimation al-gorithm to statistically correlate and decode both sourcesof data to accurately infer the flow size distribution for

any subpopulation.

In this section we replace the data gathered fromuniform sampling, as used in our previous solution [5],with data gathered by SGS, to further improve theaccuracy of estimation of that solution. The estimationalgorithm in [5] is an iterative Expectation Maximization(EM) algorithm that computes the Maximum LikelihoodEstimate (MLE) of the distribution that would cause theobservations as collected at the monitoring point. Here,the observations are the values in the count sketch (arrayof counters) and the samples collected in the samplingmodule. Without repeating the details of this algorithm,we would like to point out that it was originally designedto work with uniform sampling. Since the packets in aflow is not uniformly sampled in our SGS scheme, weneed to resample them into a locally uniform (definednext) sample to work with this algorithm. The resamplingalgorithm is quite simple: among all packets that weresampled and hashed to counter � , pick the sampling rate¬ seen by the last sampled packet as the target samplingprobability for counter � . Note that due to the monoton-ically non-decreasing nature of sampling probabilities, ¬is the smallest sampling probability seen by any of thesampled packets hashed to counter � . Suppose packets(in their arrival orders) are sampled with probabilities��L , �ON , ..., and �O®¯� ¬ . Then our resampling processis simply to sample them with probability ¬ � �ML , ¬ � �ON ,..., ¬ � �O® , L and 1 respectively. One can see that thenet effect of this is that the resulting packet sample isstatistically equivalent to uniform sampling of all packetshashed to counter � with probability ¬ . Therefore, werefer to the resulting samples as locally uniform, where

11

0.001

0.01

0.1

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000

frequ

ency

flow size

Actual distributionUniform Sampling

Sketch + Uniform SamplingSketch + SGS

(a) Complete distribution.

1000

10000

100000

1e+06

1 10

frequ

ency

flow size



(b) Zoom in to show impact on small flows.

Fig. 7. Estimates of FSD of http flows using various data sources.

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

100000

1 10 100 1000

frequ

ency

flow size



(a) Complete distribution.

100

1000

10000

100000

1 10

frequ

ency

flow size



(b) Zoom in to show impact on small flows.

Fig. 8. Estimates of FSD of https flows using various data sources.

the locality of interest is the set of flows hashed to thesame counter. With this resampling, the data collectedby SGS can be readily used by the aforementioned EMalgorithm in [5].

Our SGS scheme significantly improves the estimationaccuracy of subpopulation FSD for small flow sizeswith less packet samples collected from the traffic.Figure 7 shows the estimates of FSD of all flowsfrom or to port 80, which corresponds to the HTTPtraffic, in the trace from UNC, computed using variousmethods. The actual distribution has a peak at 7 packets,which is completely missed by estimation using onlythe data from uniform sampling using Duffield et. al.’s

method [23]. Our previously proposed algorithm [5]detects this peak using the count sketch and uniformsampling simultaneously. However, this estimate is alsosomewhat inaccurate for the small flows as can be seenfrom figure 7(b). Replacing the uniform sampling datawith data generated through SGS provides the most ac-curate estimation for small flows without compromisingthe accuracy in any other region. Figure 8 shows similarimprovements in accuracy for flows to or from port 443,which corresponds to HTTPS traffic. In both figures,the sampling rate for uniform sampling was �x�°�lI .The parameters for SGS used in this experiment were�x�±�y 2²��lI�³ . With these parameters, the number of

12

packets sampled in SGS, before the resampling describedabove, was roughly the same as those sampled usinguniform sampling. After resampling as described above,the total number of sampled packets in the SGS datasetwas actually 30% less than in the dataset for uniformsampling.

V. CONCLUSIONS

Packet sampling is a basic ingredient of the archi-tecture of network monitoring solutions. Recognizingthat uniform packet sampling might not provide thebest utilization of scarce resources available for datacollection at high speed network devices, we have devel-oped a new sampling methodology that uses approximateinformation from a counting sketch to guide the samplingdecision. This SGS methodology allows a designer totailor the sampling rate as a function of flow size, pro-viding greater flexibility in the allocation of resources.Through the development of three applications based onSGS data, we have demonstrated the advantages of usingSGS in specific scenarios. In usage accounting, SGSallows for the design of estimation mechanisms that haveconstant accuracy irrespective of flow size. For elephantdetection, it reduces the number of updates to high speedmemory by up to two orders of magnitude by allowingthe sampling rate to change with flow size, achievingthe targeted accuracy while minimizing resource usage.Similarly, for the estimation of flow size distribution ofarbitrary subpopulations of traffic, we have demonstratedthat the use of SGS improves the accuracy of ourpreviously proposed algorithm [5] while collecting thesame number of samples as uniform sampling. We havealso discussed the implication of changing the size of thearray of counters used in the count sketch and changingthe sampling function or its parameters in SGS. Webelieve, that these applications of SGS are the first stepin an exploration of the wide range of possibilities thatare offered by the ability to guide the sampling decisionon a per-packet basis.

REFERENCES

[1] A. Kumar, J. Xu, J. Wang, O. Spatschek, and L. Li, “Space-Code Bloom Filter for Efficient per-flow Traffic Measurement,”in Proc. IEEE INFOCOM, Mar. 2004, extended abstract ap-peared in Proc. ACM IMC 2003.

[2] N. Duffield, C. Lund, and M. Thorup, “Estimating flow distribu-tions from sampled flow statistics,” in Proc. ACM SIGCOMM,Aug. 2003.

[3] K. C. Claffy, G. C. Polyzos, and H.-W. Braun, “Application ofsampling methodologies to network traffic characterization,” inSIGCOMM ’93: Conference proceedings on Communicationsarchitectures, protocols and applications. ACM Press, 1993,pp. 194–203.

[4] “Cisco NetFlow,” http://www.cisco.com/warp/public/732/Tech/netflow.

[5] A. Kumar, M. Sung, J. Xu, and E. W. Zegura, “A data streamingalgorithms for estimating subpopulation flow size distribution,”in Proc. ACM SIGMETRICS, June 2005.

[6] “http://www.ietf.org/html.charters/psamp-charter.html.”[7] D. Chiou, B. Claise, N. Duffield, A. Greenberg, M. Gross-

glauser, P. Marimuthu, and J. Rexford, “A framework for packetselection and reporting,” draft-ietf-psamp-framework-10.txt.

[8] T. Zseby, M. Molina, N. Duffield, S. Niccolini, and F. Raspall,“Sampling and filtering techniques for ip packet selection,”draft-ietf-psamp-sample-tech-06.txt.

[9] “http://www.ietf.org/html.charters/ipfix-charter.html.”[10] G. Sadasivan, N. Brownlee, B. Claise, and J. Quittek, “Architec-

ture for ip flow information export,” draft-ietf-ipfix-architecture-06.

[11] J. Quittek, S. Bryant, and J. Meyer, “Information model for ipflow information export,” draft-ietf-ipfix-info-06.

[12] C. Estan, K. Keyes, D. Moore, and G. Varghese, “Building abetter netflow,” in Proc. ACM SIGCOMM, Aug. 2004.

[13] N. Hohn and D. Veitch, “Inverting sampled traffic,” inProc. ACM SIGCOMM Internet Measurement Conference, Oct.2003.

[14] N. Duffield, C. Lund, and M. Thorup, “Charging from samplednetwork usage,” in Proc. ACM SIGCOMM Internet Measure-ment Workshop, Nov. 2001.

[15] N. Duffield and C. Lund, “Predicting resource usage andestimation accuracy in an ip flow measurement collection in-frastructure,” in Proc. ACM Internet Measurement Conference,Miami Beach, FL, October 27-29 2003, pp. 179–191.

[16] N. Duffield, C. Lund, and M. Thorup, “Flow sampling underhard resource constraints,” in SIGMETRICS 2004/PERFOR-MANCE 2004: Proceedings of the joint international conferenceon Measurement and modeling of computer systems. NewYork, NY, USA: ACM Press, 2004, pp. 85–96.

[17] C. Estan and G. Varghese, “New directions in traffic measure-ment and accounting: Focusing on the elephants, ignoring themice,” ACM Trans. Comput. Syst., vol. 21, no. 3, pp. 270–313,2003.

[18] S. Singh, C. Estan, G. Varghese, , and S. Savage, “Automatedworm fingerprinting,” in Proc. of USENIX OSDI, 2004.

[19] A. Kumar, J. Xu, J. Wang, O. Spatschek, and L. Li, “Space-Code Bloom Filter for Efficient per-flow Traffic Measurement,”in Proc. IEEE Infocom, Mar. 2004.

[20] A. Kumar, M. Sung, J. Xu, and J. Wang, “Data streamingalgorithms for efficient and accurate estimation of flow sizedistribution,” in Proc. ACM SIGMETRICS, June 2004.

[21] M. Ramakrishna, E. Fu, and E. Bahcekapili, “Efficient hard-ware hashing functions for high performance computers,” IEEETrans. on Computers, vol. 46, no. 12, pp. 1378–1381, Dec.1997.

[22] S. Ramabhadran and G. Varghese, “Efficient implementation ofa statistics counter architecture,” in Proc. ACM SIGMETRICS,2003.

[23] N. Duffield, C. Lund, and M. Thorup, “Estimating flow distribu-tions from sampled flow statistics,” in Proc. ACM SIGCOMM,Aug. 2003.

Date post:	29-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Sketch Guided Sampling Š Using On-Line Estimates of Flow ...jx/reprints/guided.pdf · accurate...

Documents