+ All Categories
Home > Documents > IP Traffic Classification via Blind Source Separation Based on Jacobi Algorithm

IP Traffic Classification via Blind Source Separation Based on Jacobi Algorithm

Date post: 29-Mar-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
10
IP traffic classification via blind source separation based on Jacobi algorithm ? Walid Saddi, Nadia Ben Azzouna, and Fabrice Guillemin France Telecom, Division R&D, 2 Avenue Pierre Marzin, F-22300 Lannion, France [email protected] Abstract. By distinguishing long and short TCP flows, we address in this paper the problem of efficiently computing the characteristics of long flows. Instead of using time consuming off-line flow classification proce- dures, we investigate how flow characteristics could directly be inferred from traffic measurements by means of digital signal processing tech- niques. The proposed approach consists of classifying on the fly packets according to their size in order to construct two signals, one associated with short flows and the other with long flows. Since these two signals have intertwined spectral characteristics, we use a blind source sepa- ration technique in order to reconstruct the original spectral densities of short and long flow sources. The method is applied to a real traffic trace captured on a link of the France Telecom IP backbone network and proves efficient to recover the characteristics of long and short flows. 1 Introduction A common approach to modeling Internet traffic relies on the concept of flow (see for instance [1–3]). Short and long flows are frequently referred to in the technical literature as mice and elephants, respectively even if a clear definition does not exist (see [4] for a detailed discussion on the difficulty of defining mice and elephants). From a practical point of view, the major issue when using the concept of flow is that a complete traffic trace has to be captured and then off- line analyzed in order to properly classify the different flows. To know whether a flow is indeed a mouse or an elephant, the complete history of the flow has to be known and flows have to be sorted according to some criteria. Because of the very high transmission rate of network links, off-line analy- sis requires huge storage capacities and sorting flows is highly CPU consuming. Moreover, an exhaustive description of flows in not always needed, since only some metrics of interest have to be estimated. For instance, in the case of ele- phants, their bit rate and their duration give some insights into the way end users utilize the network. Extracting metrics directly from traffic traces can be done, in principle, by using digital signal processing techniques. The major problem, however, is that the spectral characteristics of the different types of flows may be intertwined. ? This work has been partially supported by the RNRT project Metropolis
Transcript

IP traffic classification via blind sourceseparation based on Jacobi algorithm?

Walid Saddi, Nadia Ben Azzouna, and Fabrice Guillemin

France Telecom, Division R&D, 2 Avenue Pierre Marzin, F-22300 Lannion, [email protected]

Abstract. By distinguishing long and short TCP flows, we address inthis paper the problem of efficiently computing the characteristics of longflows. Instead of using time consuming off-line flow classification proce-dures, we investigate how flow characteristics could directly be inferredfrom traffic measurements by means of digital signal processing tech-niques. The proposed approach consists of classifying on the fly packetsaccording to their size in order to construct two signals, one associatedwith short flows and the other with long flows. Since these two signalshave intertwined spectral characteristics, we use a blind source sepa-ration technique in order to reconstruct the original spectral densitiesof short and long flow sources. The method is applied to a real traffictrace captured on a link of the France Telecom IP backbone network andproves efficient to recover the characteristics of long and short flows.

1 Introduction

A common approach to modeling Internet traffic relies on the concept of flow(see for instance [1–3]). Short and long flows are frequently referred to in thetechnical literature as mice and elephants, respectively even if a clear definitiondoes not exist (see [4] for a detailed discussion on the difficulty of defining miceand elephants). From a practical point of view, the major issue when using theconcept of flow is that a complete traffic trace has to be captured and then off-line analyzed in order to properly classify the different flows. To know whethera flow is indeed a mouse or an elephant, the complete history of the flow has tobe known and flows have to be sorted according to some criteria.

Because of the very high transmission rate of network links, off-line analy-sis requires huge storage capacities and sorting flows is highly CPU consuming.Moreover, an exhaustive description of flows in not always needed, since onlysome metrics of interest have to be estimated. For instance, in the case of ele-phants, their bit rate and their duration give some insights into the way endusers utilize the network.

Extracting metrics directly from traffic traces can be done, in principle, byusing digital signal processing techniques. The major problem, however, is thatthe spectral characteristics of the different types of flows may be intertwined.

? This work has been partially supported by the RNRT project Metropolis

Hence, computing metrics for a given flow type first requires the isolation of thespectral characteristics of the component under consideration.

In this paper, we propose a method of processing packets instead of flowsin order to construct two signals, one for mice and the other for elephants byusing a very simple definition for these two types of flows. Packets are classifiedon the basis of their size, using the fact that large packets have a good chanceof belonging to elephants, while small packets are fairly distributed among miceand elephants. This classification yields two processes, which can reasonablybe assumed Gaussian. We then employ a blind source separation method forrecovering elephant and mice traffic. This method is based on minimizing themutual information between the two Gaussian processes under consideration,provided that elephant and mouse processes are to a large extent independent.This method has been designed by Pham [5, 6] and we illustrate its use on ADSLtraffic observed on an IP backbone link of the France Telecom network.

The organization of this paper in as follows: In Section 2, we present theexperimental setting and recall some modeling results established in an earlierstudy [7]. The blind source separation method is presented in Section 3 and isapplied in Section 4 to real traffic traces, where some elements for inferring trafficparameters are presented, especially for elephants. Some concluding remarks arepresented in Section 5.

2 Experimental setting and preliminary results

2.1 Experimental setting

Throughout this paper, we consider measurements from a 1 Gbit/s link of theFrance Telecom IP backbone network. We observe TCP traffic from the backbonenetwork in direction to several ADSL areas. Traffic is mainly due to ADSLcustomers and is thus quite different from LAN or Tiers One traffic usuallyanalyzed in the technical literature [1, 8–10]. The total load of the link (includingTCP and UDP traffic) is about 43.5%.

Traffic is observed by means of a measurement device, which performs a copyof the headers of both TCP segments and IP datagrams. We are thus able toidentify those packets with the same 5-tuple composed of the source IP address,the destination IP address, the source port, the destination port and the protocoltype (namely TCP). Packets with the same 5-tuple are said to belong to the sameflow (a.k.a. micro-flow in the technical literature [2]).

Measurements were performed in October 2003 during the time period be-tween 9:00 pm and 11:00 pm, which usually corresponds to the highest dailyactivity period by ADSL customers. In the following, we evaluate the “instan-taneous” bit rate by computing the number of bits arriving in time intervals oflength ∆ = 100 ms. Let Xn denote the bit rate evaluated over the nth time in-terval. In the following, we are interested in the properties of the process {Xn},which can also be seen as a time series, assumed to be stationary in the wide

sense, characterized by its mean and spectral density function ψ(x) defined by

cov(Xn, Xn+k) =

∫ π

−πeikxψ(x)dx,

where cov(Xn, Xn+k) is the covariance of the random variables Xn and Xn+k.(This covariance depends only upon k because of wide sense stationarity.)

2.2 Modeling ADSL traffic

A mathematical model for the global bit rate process has been established in[11, 7] by using a flow-based approach and in particular, the mouse/elephantdichotomy with the following definitions.

Definition 1 (Mouse). A mouse is a flow comprising less than 20 packets andis terminated when no packets of the flow has been observed for a time period of5 seconds.

The 5 second timer is introduced in order to concentrate on the transmissionphase of the mouse. This prevents mice from being unduly stretched becausesome segments (typically SYN or FIN segments) arrive too far from other seg-ments.

Definition 2 (Elephant). An elephant is a flow with more than 20 packets.

The above definitions may at first glance appear very crude, but they areactually sufficient to describe the global bit rate. This is in particular due tothe fact that the majority of traffic is due to peer-to-peer (p2p) applications,generating 80% of global traffic; elephants related to large data transfer tend tobe very long, making a clear difference between long and short flows and p2pmice due to signalling are very numerous and of small size.

Mice, which represent 95% of the flows, contribute only 6% to the global load.To describe the bit rate of mice, we are led to aggregate them according to somecriteria. In particular, mice related to p2p protocols are not aggregated in thesame way as other mice associated with classical applications (web, ftp, nntp,etc.). By aggregating mice, it is possible to introduce new entities (referred toas macro-mice), which have the key property of arriving according to a Poissonprocess. Moreover, their duration D can be well described by a two-parameterWeibullian distribution, i.e.,

P (D > x) = exp(− (x/η)

β), (1)

where η and β are the scale and skew parameters, respectively. Finally, their fluidbit rate Y , defined as the total amount of bits of the macro-mouse divided by itsduration, lightly depends on the duration, so that the conditional expectationE[Y 2 | D] can be approximated by a constant κ.

Concerning elephants, we first note that some elephants are mainly composedof ACK segments. This corresponds to the fact that the terminals of some ADSL

customers play the role of servers for p2p applications, revealing a symmetricusage of the Internet under the impetus given by the massive deployment of p2pservices. The corresponding elephants are referred to as ACK elephants and giverise to a bit rate, which can be assimilated to a white noise. Other elephants arereferred to as data elephants.

These latter elephants are not nicely transmitting but composed of burstsseparated by time periods of weak activity. To model the bit rate, we decomposeelephants into mini-elephants and elephant mice. These two new objects enjoythe same properties as macro-mice introduced above. In particular, they arriveaccording to Poisson processes, their durations can be well approximated by two-parameterWeibullian distributions and the conditional moment of the squaredfluid bit rate can be taken equal to a constant (denoted by κe for mini-elephants).

In the following, we assume that the processes describing the bit rates of thedifferent components are Gaussian (see [7] for details). Moreover, the spectraldensity of the global bit rate is dominated by that of elephants in low frequenciesand that of non p2p mice in high frequencies.

2.3 Blind source separation method

Blind source separation consists of recovering components from a set of observedmixtures. In our case, components are the sources generating mice and elephants.The goal of source separation is to obtain two distinct signals, one for mice andthe other for elephants, so that the characteristics of the two signals can beanalyzed separately.

Assuming that elephants and mice are linearly superposed, blind source sep-aration can be formalized as follows: The observed process is modeled as

X(t) = AS(t),

where X(t) is the vector of observation (X(t) = (X1(t), ..., XK(t))), S(t) is thevector of original sources (S(t) = (S1(t), ..., SK(t))), and A is a K × K nonsingular matrix. In the case considered in this paper, K = 2.

In a first step, what we have to do is to construct the two observed processesX1(t) and X2(t). For this purpose, we adopt the following strategy based onthe packet size: We fix a threshold T (expressed in bytes) and two probabilitiesp1 and p2; if a packet has a size smaller than T , then the packet belongs to anelephant with probability p1 and to a mouse with probability 1−p1; similarly, ifthe size of the packet is greater than the threshold T , the packet belongs to anelephant with probability p2 and to a mouse with probability 1− p2. With thisstochastic packet classification, packets of mice could be considered as packetsof elephants and vice versa, giving rise to a classification error.

Instead of dealing with continuous time signals, we fix in practice an integra-tion interval of length ∆ (in the following ∆ = 100 milliseconds) and the quan-tities X1(n) and X2(n) represent the total amount of bytes of packets classifiedas belonging to mice and elephants, respectively. Moreover, in the experimentsreported below the probabilities have been taken equal to p1 = 0.5 and p2 = 0.9.

The signals under consideration are altered by noise caused by different phe-nomena (e.g., discrete packet arrivals, ACK elephants, etc.) and the magnitudeof noise is very large. In the following, we shall eliminate noise by filtering theobserved signals by means of a wavelet filter (see [12] for details in the case of a1/f signal). Moreover, we shall assume that processes are centered. The two re-sulting signals are denoted by {Xe(n)} and {Xs(n)}, corresponding to elephantsand mice, respectively. With a small abuse of notation, the filtered mouse andelephant sources are still denoted by {S1(n)} and {S2(n)}.

With the above assumptions, we have the system(Xs

Xe

)=

(a11 a12

a21 a22

)(S1

S2

), (2)

where aij are unknown mixing coefficients, which quantify the classification errorand which entail that the signals {Xe(n)} and {Xs(n)} are not exactly equal tothe sources {S1(n)} and {S2(n)}.

3 Blind source separation based on Gaussian mutualinformation

In this section, we review the method proposed by Pham [5] and based on theminimization of Gaussian mutual information. This method seems to be welladapted to the problem considered in this paper since we deal with Gaussianprocesses, associated with mice and elephants.

The mutual information between Gaussian processes is defined by [5, 6, 13]

Ig[Y1, ..., YK ] =1

∫ π

−πlog det [diag[fY(λ)]]− log det[fY(λ)]dλ, (3)

where for some matrix A, diag(A) denotes the diagonal matrix with the samediagonal elements as the matrix A and fY(λ) is the spectral density matrix ofthe process (Y1, . . . , YK), the Yi being real Gaussian processes.

The mixing model is given by equation (2). The goal of the source separationmethod is actually to find a matrix B minimizing Ig[Y1, Y2], where

Y =

(Y1

Y2

)= B

(Xs

Xe

)

Y1 and Y2 represent the reconstructed sources (denoted by S1 and S2 in thefollowing). It is worth noting that the separation is achievable if and only if thesources have spectral densities, which are not proportional.

Ideally, the matrix B minimizing the mutual information, should be equal tothe right inverse of A so that BA = I , where I is the identity matrix. Denotingby Bt the transpose of the matrix B, the mutual information (3) is then equalto

Ig[Y1, ..., YK ] =1

∫ π

−πlog det

[diag[BfX(λ)Bt]

]− log det[BfX(λ)Bt]dλ.

Minimizing the mutual information can be viewed as reducing the deviationfrom diagonality of the matrices BfX(λ)Bt for λ ∈ [−π, π]. In practice, thematrices BfX(λ)Bt are evaluated at some point λl for l = 1, . . . , L, which aredistributed in the interval [0, 2π]. The mutual information then reads

1

2L

L∑

l=1

log det[diag[BfX(λl)B

t]− log det[BfX(λl)B

t] (4)

The algorithm proposed by Pham [6] consists of jointly diagonalizing thematrices BfX(λl)B

t for l = 1, . . . , L and relies on the Jacobi diagonalizationmethod: Let

B =

(b11 b12

b21 b22

)=

(B1

B2

)

and define recursively

(B1

B2

)

k+1

=

(B1

B2

)

k

− αk where

αk = 21+√

1−4h12h21

(0 h12

h21 0

)(B1

B2

)

k

,

(h12

h21

)=

(w12 11 w21

)−1(g12

g21

),

with

gij =1

L

L∑

l=1

[BkfX(λl)Btk]ij

[BkfX(λl)Btk]iiand wij =

1

L

L∑

l=1

[BkfX(λl)Btk]jj

[BkfX(λl)Btk]ii.

The above operations are iterated until the quantity c =√g2

12 + g221 is less than

a given threshold ε� 1.The algorithm described above actually achieves source separation only up

to a scaling matrix and a permutation. In other words, the output matrix B isnot exactly the right inverse of the matrix A but is such that

BA = PD, (5)

where P is a permutation matrix and D is a diagonal matrix.In addition, the matrix fX(λ) appearing in the above algorithm is the the-

oretical spectral density matrix. In reality, this matrix has to be evaluated byusing an empirical set of samples with limited size. In the following, the spectraldensity matrix is evaluated by using periodograms or the discrete Fourier trans-form of autocorrelation functions [14, 15]. Finally, to obtain a smooth versionof a spectral density function SX associated with a process X, we can take theconvolution of the function SX with a Parzen kernel KM , i.e., a 2π-periodic func-tion depending upon a parameter M and converging to the Dirac comb whenM → ∞. The Parzen kernel used in the experimental results reported below isthe kernel KM given in [16], namely

KM =

1− 6( uM )2(1− | uM |) if 0 ≤ |u| < M2 ,

2(1− | uM |)3 if M2 ≤ |u| < M ,

0 otherwise.

4 Application of the source separation algorithm andinference of traffic parameters

4.1 Blind source separation of mice and elephants

We apply the blind source separation algorithm presented in the previous sec-tion to the processes {Xe(t)} and {Xs(t)} associated with elephants and mice,respectively (see Section 2.3). In the experimental data reported below, we haveused the following parameter values: The threshold T for discriminating pack-ets has been set equal to be 200 bytes and mixing probabilities p1 = 0.5 andp2 = 0.9. The autocorrelation and intecorrelation functions are evaluated with asample size N = 45056 over 4096 points (4096 values of lag k). Since we have realprocesses, the autocorrelation and intercorrelation functions are even. Therefore,we can extend the evaluation to 8192 points. Thus, we have applied the discreteFourier transform and obtained 8192 samples for the power spectral density. Thesmoothed version of the spectral density is obtained with M = 16 for the Parzenkernel.

Starting with the identity matrix as initial value for the matrix B, the Jacobialgorithm is run in order to diagonalize jointly L = 8192 matrices. The algorithmconverges in a few iterations.

The normalized spectral densities of the observed processes are displayed inFigure 1(a). We note that they are intertwined and very close one to each other.The former property is also verified by the actual spectral densities of mice andelephants as illustrated in Figure 1(b). The spectral density of mice is of thesame order of magnitude as that of elephants for high frequencies, and the latteris dominating in low frequencies.

10−4 10−3 10−2 10−1 100 10110−5

10−4

10−3

10−2

10−1

100

101

102

103Xe and Xs (2003)

Pulsation

XeXs

(a) Observed processes

10−4 10−3 10−2 10−1 100 10110−6

10−5

10−4

10−3

10−2

10−1

100

101

102

103

Pulsation

ElephantsMice

(b) Original processes

Fig. 1. Spectral densities of observed processes and real sources.

When comparing in terms of spectral densities the observed processes againstthe real processes, we note that the observed elephant process is close to the realone. This is due to the fact the probability of classifying a packet of large size asbelonging to an elephant is quite large, equal to p2 = 0.9. Hence, the correlationstructure of elephants due to large packets arriving close one to each other ispreserved in spite of the random drawing of packets. This property is not enjoyedby mice. These flows are with small durations, and their structure is then alteredby the random packet drawing.

The spectral densities of the reconstructed signals are displayed and com-pared with those of the real sources in Figures 2(a) and 2(b). It clearly appearsthat the blind source separation algorithm is very efficient to recover from theobserved signals the normalized spectral densities of the real sources. This canbe used to infer the traffic characteristics of the different components, especiallythose of elephants, which represent almost the totality of traffic. This point isaddressed in the next section.

10−4 10−3 10−2 10−1 100 10110−6

10−5

10−4

10−3

10−2

10−1

100

101

102

103

Pulsation

MiceSource 1

(a) Mice

10−4 10−3 10−2 10−1 100 10110−5

10−4

10−3

10−2

10−1

100

101

102

103

Pulsation

Elephants and Source 2

ElephantsSource 2

(b) Elephants

Fig. 2. Comparison between the spectral densities of the reconstructed and real sources.

4.2 Inference of traffic parameters

The normalized spectral densities can be used to estimate the mean durationof mice and elephants. Indeed, for a given type of flow, the normalized spectraldensity has the property of being close to 1/(πE[S]x2) for a large range of fre-quencies x (see [7] for details). It can be checked that the spectral densities of thereconstructed sources indeed decay as c/x2 for high frequencies. By estimatingthe coefficient c for each curve, we can deduce the mean duration of mice andmini-elephants. Note that this yields a criterion for determining which output

signal of the source separation algorithm corresponds to mice or elephants, sincethe mean duration of mini-elephants is of course greater than that of mice. Thisallows us to fix the permutation matrix appearing in equation (5).

To simplify the notation, assume that S1 = Ss and S2 = Se. Moreover, letα1 and α2 denote the diagonal elements of the diagonal matrix D appearing inequation (5). Then, we have

α1S1 = b11Xs + b12Xe, (6)

α2S2 = b21Xs + b22Xe. (7)

Recall that all the random variables in the above system are centered.If we increase the integration interval ∆ up to 4 seconds, then mice appear

more or less as noise for elephants, which represent the dominant part of traffic.Using the general relations established in [7]

var[X] ≈ λeκeE[De] and E[X] ≈ λe√κeE[De],

where De is the duration of elephants, we can deduce the quantity κe (thesquared of the fluid bit rate of mini-elephants) and then the mini-elephant arrivalrate λe. By using equation (6), we can compute α2.

Concerning mice, by using the process X evaluated over time interval of 100ms, we have var[X] = λsκsE[Ds] + λeκeE[De], which allows us to deduce thequantity λsκs. Equation (7) can be used to compute the diagonal element α1.

The numerical values obtained by using the above heuristic parameter in-ference method are given in Table 1 and compared against real parameters.Estimated values for elephants are obtained by approximating the behavior ofthe spectral density of the reconstructed process in low frequencies. The meanduration of macro-mice is computed by using the reconstructed mouse source.The estimated values are of the same order of magnitude as the real ones andcan be used to qualitatively estimate the behavior of mini-elephants.

Table 1. Estimated values of the traffic parameters compared against real values.

parameter estimated value real value

λe 35.0 40.01κe 9.2e8 1e9E[De] 164.6 192.95E[Ds] 3.05 3.249

5 Conclusion

We have investigated in this paper a method for rapidly inferring the spectralcharacteristics of short and long flows giving rise to global traffic on a link

connecting several ADSL areas to an IP backbone network. Analyzed traffichas the particularity that it can be described by simple M/G/∞ models withWeibullian service times under heavy traffic conditions so that all processesappearing in the analysis can be assumed to be Gaussian. The proposed methodrelies on the classification of packets according to their size and on minimizingmutual information between Gaussian processes by using the algorithm designedby Pham [5, 6].

This method turns out to be very efficient to recover from observed data thespectral characteristics of long and short flows. The approach followed in thispaper opens the door to an on-line estimation of traffic parameters instead ofstoring huge amounts of data and performing tedious off-line analysis. Designingon-line parameter estimation methods will be addressed in further studies.

References

1. Barakat, C., Thiran, P., Iannaccone, G., Diot, C., Owezarski, P.: A flow-basedmodel for Internet backbone traffic. In: Proc. ACM SIGCOMM Internet Measure-ment Workshop, Marseille (2002)

2. Fredj, S.B., Bonald, T., Proutiere, A., Regnie, G., Roberts, J.: Statistical band-width sharing: A study of congestion at flow level. In: Proc. ACM Sigcomm. (2001)

3. Olivier, P., Benameur, N.: Flow level IP traffic characterization. In: Proc. ITC’17,Salvador de Bahia, Brasil (2001)

4. Papagiannaki, K., Taft, N., Bhattachayya, S., Thiran, P., Salamatian, K., Diot, C.:On the feasibility of identifying elephants in Internet backbone traffic. TechnicalReport TR01-ATL-110918, Sprint Labs, Sprint ATL (2001)

5. Pham, D.: Blind separation of instantaneous mixture of sources via the Gaus-sian mutual information criterion. Technical report, Laboratory of Modeling andComputation, IMAG, C.N.R.S, France (2001)

6. Pham, D.: Mutual information approach to blind separation for stationnarysources. In: Proc. ICA’99 Conference, Ausois (1999) 215–220

7. Ben-Azzouna, N., Clerot, F., Fricker, C., Guillemin, F.: A flow-based approachto modeling ADSL traffic on an IP backbone link. Annals of Telecommunications(2004)

8. Claffy, K., Miller, G., Thompson, K.: The nature of the beast: Recent trafficmeasurement from an Internet backbone. In: Proc. of Inet. (1998)

9. Leland, W., Taqqu, M., Willinger, W., Wilson, D.: On the self-similar nature ofEthernet traffic. IEEE/ACM Trans. Net. (1994) 1–15

10. Z.L.Zhang, Ribeiro, V., Moon, S., Diot, C.: Small time scaling behavior of Internetbackbone traffic: An empirical study. In: Proc. Infocom 2003. (2003)

11. Ben-Azzouna, N., Guillemin, F.: Analysis of ADSL traffic on an IP backbone link.In: Proc. Globecom 2003, San Francisco (CA) (2003)

12. Wornell, G.: Signal processing with fractals. A wavelet-based appoach. PrenticeHall Signal Processing Series (1995)

13. Cover, T., Thomas, J.: Elements of Information Theory. Wiley-Interscience Pub-lication (1991)

14. Castanie, F.F., ed.: Analyse spectrale. Hermes Science Publications (2003)15. Priestley, M.: Spectral Analysis and Time Series. Academic Press, London (1981)16. Kunt, M.: Traitement Numerique des Signaux. Presses Polytechniques Romandes,

Lausanne (1984)


Recommended