+ All Categories
Home > Documents > Ident_P2P_PCA_UBICC_PaperID315 (1)_315

Ident_P2P_PCA_UBICC_PaperID315 (1)_315

Date post: 08-Apr-2018
Category:
Upload: ubiquitous-computing-and-communication-journal
View: 217 times
Download: 0 times
Share this document with a friend

of 13

Transcript
  • 8/7/2019 Ident_P2P_PCA_UBICC_PaperID315 (1)_315

    1/13

    DIFFERENTIATING INTERNET APPLICATIONS USING PRINCIPAL

    COMPONENT ANALYSIS

    Roberto Nogueira*, Antonio Nogueira**, Paulo Salvador**, Rui Valadas**

    *Portugal Telecom Inovacao, Aveiro, Portugale-mail: [email protected]

    **University of Aveiro/Instituto de Telecomunicacoes, Aveiro, Portugale-mail: {nogueira, salvador, rv}@ua.pt

    ABSTRACT

    The number and variety of IP applications had a tremendous increase in thelast few years. Besides, Internet applications of end users are changing withthe wide spread of high performance PCs connected through broadband links.An accurate mapping of traffic to applications is important for a wide range ofnetwork management tasks, like traffic engineering, service differentiation, per-formance/failure monitoring and security. Since traditional mapping approacheshave become increasingly inaccurate, this paper presents a new approach, basedon Principal Component Analysis, that is able to identify differentiating char-acteristics of different Internet applications, including several P2P file sharingprotocols. The accuracy of the proposed approach was evaluated by performinga set of intensive tests and the results obtained show that it constitutes a valu-able tool to identify peculiar characteristics of Internet applications while being,at the same time, immune to the most important disadvantages presented byother identification methods. We believe this methodology can form the basisfor the development of an efficient application identification tool.

    Keywords: flow identification, peer-to-peer, principal component analysis.

    1 INTRODUCTION

    The introduction of Peer-to-Peer (P2P) filesharing applications triggered a paradigm shiftin Internet data exchange. Since the emergenceof Napster, the first popular P2P application, anumber of new P2P based multimedia file sharingsystems have been developed (FastTrack, eDon-key, Gnutella, Direct Connect, etc). The traf-fic generated by these applications consumes themajor portion of the bandwidth in campus net-works, largely overtaking the traffic share of theWWW [1, 2]. However, P2P applications canharm network traffic generated by businesses, gov-ernments, education and the Internet infrastruc-ture itself, preventing mission critical applicationsfrom accessing the network. These applicationscan also represent serious security vulnerabilitiesto systems and networks since hackers can exploitthem to access and attack campus networks. Fi-nally, P2P applications also pose serious legal is-sues as users can download copyrighted material,

    thus placing access providers in a difficult legalsituation. So, these applications create logistic,

    security, and legal troubles for network adminis-trators on high-speed networks.

    Having the ability to accurately identify Inter-net applications, and particularly P2P ones, canbe crucial for several network management andmeasurement tasks, including traffic engineering,service differentiation, performance/failure moni-toring and security. Once correctly identified, thenetwork manager can take the most appropriateaction regarding each application that is running

    on a particular scenario.

    The identification of IP applications has beentraditionally based on different techniques, eachone having its own advantages but also importantdrawbacks that limit or dissuade their applicationon certain identification scenarios: (i) port basedanalysis presents some obvious limitations sincemost applications allow users to change the de-fault port numbers by manually selecting what-ever port(s) they like; many newer applicationsare more inclined to use random ports, thus mak-ing ports unpredictable, and there is also a trend

    for applications to begin masquerade their func-tion ports within well-known application ports;

    Ubiquitous Computing and Communication Journal 1

  • 8/7/2019 Ident_P2P_PCA_UBICC_PaperID315 (1)_315

    2/13

    (ii) protocol analysis is ineffective since IP ap-plications are continuously evolving and thereforetheir signatures can change; application develop-ers can encrypt traffic making protocol analysismore difficult; signature-based identification canaffect network stability because it has to read and

    process all network traffic and, finally, protocolanalysis is not able to deal with confidentiality re-quirements; (iii) syntactic and semantic analysisof the data flows can be a burden to network sta-bility due to its high processing requirements andis not appropriate when dealing with confidential-ity requirements because, in these situations, itis not possible to have access to the packet con-tents. Section 2 will briefly describe the most im-portant related work on this subject, pointing outthe main advantages and disadvantages of the dif-ferent proposed methodologies.

    This paper proposes an approach, based onPrincipal Component Analysis (PCA), to identifythe most differentiating characteristics of Inter-net applications. PCA involves a mathematicalprocedure that transforms a number of (possibly)correlated variables into a (smaller) number of un-correlated variables called principal components,thus reducing the dimensionality of the data setby ignoring the dimensions that contribute less tothe data variability. Several applications will beused to test the proposed methodology: some ofthe most popular P2P applications (Gnutella, Bit-Torrent and eMule), Skype, YouTube, web-basedfile sharing and Internet browsing. The results ob-tained show that the proposed methodology canachieve very good results, has low computationalrequirements and, when used in an applicationidentification framework, allows to avoid some ofthe most important drawbacks of existing identi-fication approaches.

    This paper is an extended version of the workpublished in [3]: now, we include an extendedrelated work section, a description of the differ-ent modules of an integrated identification frame-work that will be based on the PCA identification

    methodology and a more detailed explanation ofthe different steps of the proposed PCA identifica-tion methodology, including more graphical infor-mation to help on the explanation of the proposedapproach.

    The rest of the paper is organized as follows:section 2 describes some related work on method-ologies for the identification of IP applications;section 3 presents an overview of the applicationidentification framework we are planning to pro-pose, where the PCA-based identification mod-ule is one of the main building blocks; section 4

    gives an overview of the applications that were se-lected for this study and the measurements that

    were made; section 5 gives a brief review of themost important topics on PCA; section 6 de-scribes the methodology that is proposed to iden-tify the differentiating characteristics of Internetapplications and discusses the main results ob-tained and, finally, section 7 presents the main

    conclusions and some topics for future research.

    2 RELATED WORK

    Traditionally, the identification of IP appli-cations has been based on different techniques.Port-based identification was first suggested by[4, 5] and is the most basic and straightforwardmethod to detect applications and users based onnetwork traffic. It is based on the simple con-cept that many applications have default ports onwhich they function. When these applications are

    run, they use these ports to communicate with theoutside. To perform port based analysis, admin-istrators just need to observe network traffic andcheck whether there are connection records usingthese ports. If a match is found, it may indicate aparticular application activity. Port matching isvery simple in practice, but its limitations are ob-vious. Most applications allow users to change thedefault port numbers by manually selecting what-ever port(s) they like. Additionally, many newerapplications, like WinMX [6] and Winny [7], aremore inclined to use random ports, thus making

    ports unpredictable. Besides, since the closure ofNapster more and more P2P applications beginto masquerade their function ports within well-known application ports [2, 8, 4, 9, 10]. In [11],authors have shown that this technique achievesan accuracy no better than 50 to 70% using theofficial IANA1 list.

    Another identification approach is payload orprotocol analysis: in this case, traffic is monitoredand the data payload of the packets is inspectedaccording to some previously defined applicationsignatures [12, 13, 14, 8, 11, 15, 9]. This trafficidentification method is widely applied on Intru-sion Detection Systems (IDS) to manage traffic[16, 17]. Application-layer analysis of packet con-tents is also employed by some commercial band-width management tools [18, 19]. This approachhas been shown to work very well for Internettraffic including P2P applications. However, thistechnique also has some drawbacks: first, pay-load analysis poses privacy and security concerns;second, the technique typically requires increasedprocessing and storage capacity [20, 5, 21, 22];third, it is unable to cope with encrypted trans-missions and, finally, this approach only identifies

    traffic for which signatures are available and isunable to classify previously unknown traffic.

    Ubiquitous Computing and Communication Journal 2

  • 8/7/2019 Ident_P2P_PCA_UBICC_PaperID315 (1)_315

    3/13

    Syntactic and semantic analysis of the dataflow avoids some of the disadvantages of port-based analysis and protocol analysis. This ap-proach can perform protocol recognition regard-less of any encapsulation and is able to extractdata specific to each protocol, involving stateful

    reconstruction of session and application informa-tion from the packet content [23]. This techniqueprovides very accurate and reliable applicationidentification, but imposes significant complexityand processing load on the traffic identificationdevice. It must be kept up-to-date with extensiveknowledge of application semantics and network-level syntax, and must be powerful enough toperform concurrent analysis of a potentially largenumber of flows.

    Karagiannis et all. [8] proposed a new algo-rithm which is based on the behavior character-

    istic of the transport layer: by using little infor-mation of transport layer packets, this methodcan accurately identify 99% of the P2P traffic,but the algorithm can only be used offline. In[24] a new identification method is proposed, re-lying on patterns of host behavior at the trans-port layer. Instead of studying TCP (or UDP)flows individually, this scheme pays attention toall flows generated by specific hosts and can ac-curately associate each host with the services itprovides or uses (application server, web client,etc). However, this method has to gather infor-mation from several flows of each host before itcan decide on the host role, which makes it verytime-consuming.

    The diminished effectiveness of the aforemen-tioned techniques motivated the use of flow statis-tics for classifying network traffic. There are atleast three reasons why this approach is recom-mended: first, different applications manifest dis-similar behaviors and thus exhibit different flowstatistics (for instance, a large file transfer usingFTP would have higher average packet size andsmaller mean packet interarrival time than an in-stant messaging client sending short, occasional,

    messages to other clients); second, although ob-fuscation of flow statistics is also possible, it isgenerally much harder to implement; third, clas-sification based on flow statistics can benefit fromthe large body of work on scalable flow sam-pling/estimation techniques [25, 26, 27, 28, 29].

    Several methods have been proposed to clas-sify traffic based on summarized flow informationsuch as duration, number of packets and meaninter-arrival time [30, 31], but they are all off-line algorithms and are not sufficiently mature.In [32], the authors use some fundamental char-

    acteristics of P2P protocols to identify P2P ap-plications, such as the huge network diameter

    and the presence of many hosts acting as bothservers and clients. It utilizes only the trans-port layer header of every packet, and can iden-tify unknown P2P protocols. However, it is alsotime-consuming. In [33, 34] authors proposed atechnique that relies on the observation of the

    first five packets of a TCP connection to identifythe application. In [35] an algorithm is proposedto identify P2P traffic based on machine learn-ing techniques: by investigating the ratio betweenthe upload and download traffic volume of severalP2P applications, a characteristic library is con-structed; then, the unknown network traffic canbe recognized online using this library. In refer-ences [36, 37] authors also use machine learningtechniques for traffic classification and [38] pro-poses a traffic classifier using supervised machinelearning based on a Bayesian trained neural net-

    work. On reference [39] a back propagation neuralnetwork model is used to distinguish between P2Pand non-P2P applications; finally, in [40] neuralnetworks were successfully used to identify severalInternet applications, although none of them wasof the P2P type.

    3 IDENTIFICATION FRAMEWORKBASED ON PCA

    The proposed identification methodology willconstitute the basis for an integrated identifica-

    tion framework, whose functioning principles aredepicted in Figure 1. The central element ofthe identification tool is the PCA-based identi-fication methodology: first, the predominance ar-eas corresponding to each IP application are cal-culated using a set of known traffic values as-sociated with each application; after this pre-identification phase, the PCA methodology canbe used to identify IP applications based on newtraffic values that are presented as inputs. Obvi-ously, the pre-calculation of the predominance ar-eas relies on a pre-classification of the various IPapplications that is based on offline measurements

    that were previously made and stored. The pre-classification can rely on conventional applicationmapping approaches or can derive from knowntraffic generated and measured in a controlled en-vironment. So, this kind of training phase, al-though it can be computationally demanding, isan offline phase.

    The online classification phase relies on on-line measurements that are continuously made onthe network infrastructure and are also stored tobecome essential historic data for further refine-ments on the predominance areas pre-calculation

    phase (training phase). The result of the onlineclassification phase, that is, the correct identifi-

    Ubiquitous Computing and Communication Journal 3

  • 8/7/2019 Ident_P2P_PCA_UBICC_PaperID315 (1)_315

    4/13

    PCA-based pre-calculation

    of the predominace areasfor each IP application

    Offlinemeasurements

    Onlinemeasurements

    Pre-identification PCA-basedidentification

    ApplicationIdentification

    Validation

    Figure 1: PCA-based integrated framework for identifying IP applications.

    cation of the different applications, must finallybe validated. This process involves human inter-

    vention and must also take into account the pre-classification of the different applications. Thecycle that constitutes this framework is a contin-uously evolving structure, so the different blocksof the flowchart are continuously updated in sucha way that reflects the best possible classificationof the different Internet applications.

    4 SELECTED APPLICATIONS ANDMEASURED TRAFFIC TRACES

    In order to cover the most popular P2P pro-

    tocols, the following applications were selectedfor this study: Shareaza 2.2.5, that was used toconnect to the Gnutella (version 1) network (al-though it allows connection to several P2P net-works), eMule 0.48A to connect to the eMule net-work and BitTorrent 5.0.5 to connect to the Bit-Torrent network. Besides these P2P applications,we have also included other important applica-tions in terms of their contribution to the cur-rent Internet usage and exchanged traffic amount:Skype 3.5.0, that enables connection to the Skypenetwork, YouTube as an example of a centralizedfile sharing application, web-based file sharing and

    web browsing.

    Skype is the most popular Voice over IP(VoIP) and instant messaging application. TheSkype protocol defines the direct exchange ofpackets between peers: whenever direct exchangeis not possible, Skype relies on routing mech-anisms that use other peers of the Skype net-work. The decentralized Skype infrastructuremakes it scalable without implying additionalcosts. YouTube is a centralized file sharing ser-vice that enables the upload and download ofvideo files to/from network servers. Although

    this application is not P2P, it is included in thisstudy due to its current popularity/importance.

    Web-based file sharing and web browsing are wellknown and very common Internet applications.

    Our study resorts to data traces that weremeasured from March to July, 2007, on a 4MbpsADSL access link. In order to keep the same hard-ware and software configurations and the sameapplication settings, the following parameteriza-tions were adopted: (i) maximum download rateof 96 Kbps; (ii) maximum upload rate of 10 Kbpsand (iii) maximum number of connections equalto 100. In order to compare the behavior of sim-ilar applications, the same configurations, cap-ture durations, querys and transfered files wereconsidered for all measurement sessions. Eachmeasurement session registered the packet head-

    ers of all packets flowing in both directions (up-load/download); no packet drops were reported.The traffic analyzer was a 800 MHz Intel CeleronLaptop having 256 Mbytes of SDRAM and run-ning WinDump.

    As an illustrative example, the bitTorrent ses-sion that was planned in order to capture thistype of traffic consisted on the following sequenceof actions:

    1. start of the traffic capture (windump -i 2 -C 700-w bittorrentCapture.dmp);

    2. connection to the mininova.org website:

    (a) Search for Vista Transformation Pack;

    (b) Order the transfer of the Vista Trans-formation Pack 6.0.exe.torrent (9.8KB)torrent;

    3. start the bitTorrent application;

    4. wait for approximately 2 minutes such that thehost connected to the bitTorrent network canstabilize (flow of the application initializationmessages);

    5. start the transfer of the Vista TransformationPack 6.0.exe (30.27MB) file, using the torrent;

    6. two minutes later, restore the connection to themininova.org website:

    Ubiquitous Computing and Communication Journal 4

  • 8/7/2019 Ident_P2P_PCA_UBICC_PaperID315 (1)_315

    5/13

    (a) Search for Skype;

    (b) Order the transfer of the Skype V.3.1.0.152 Final.exe.torrent (8.4KB) tor-rent;

    7. start the transfer of the Skype V. 3.1.0.152 Fi-nal.exe (19.97MB) file, using the torrent;

    8. close the application, after 10 minutes of inac-tivity;

    9. open the application; wait for approximately 2minutes such that the host connected to the bit-Torrent network can stabilize;

    10. restore the connection to the mininova.orgweb-site:

    (a) Search for AVG;

    (b) Order the transfer of the AVG Profes-sional Internet Security Suite.rar.torrent

    (10.6KB) torrent;

    11. start the transfer of the AVG Professional In-ternet Security Suite.rar (39.23MB) file, usingthe torrent;

    12. after a five minutes period, restore the connec-tion to the mininova.org website:

    (a) Search for Ubuntu;

    (b) Order the transfer of the Ubuntu-6.10-Desktop-i386.iso.torrent (27.5KB) tor-rent;

    13. start the transfer of the Ubuntu-6.10-Desktop-i386.iso (698.36MB) file, using the torrent;

    14. close the application, after one hour of activity;15. end of the capture.

    Other utilization sessions were also defined for themeasurements scenarios corresponding to the re-maining selected applications.

    5 PRINCIPAL COMPONENT ANALY-SIS

    Principal component analysis involves a math-ematical procedure that transforms a number of(possibly) correlated variables into a (smaller)

    number of uncorrelated variables called principalcomponents. The first principal component ac-counts for as much of the variability in the dataas possible, and each succeeding component ac-counts for as much of the remaining variability aspossible.

    Given the random variables X1, X2, . . . , X p,the k-th principal component (PC k) is definedas the linear combination,

    Zk = k1X1 + k2X2 + . . . + kpXp (1)

    such that the loadings of Zk, k =(k1, k2, . . . , kp)

    t, have unitary Euclidean

    norm, maximum variance and PC k, k 2, isuncorrelated with the previous PCs, which infact means that tkj = 0, j = 1, . . . , k 1 and

    tkk = 1. Thus, the first principal component is

    the linear combination of the observed variableswith maximum variance. The second principal

    component verifies a similar optimal criteria andis uncorrelated with PC 1, and so on. As a re-sult, the principal components are indexed bydecreasing variance, i.e., 1 2 . . . p,where r denotes the variance of PC r and p isthe maximum number of PCs (n > p).

    It can be proved [41] that the vector of load-ings of the k-th principal component, k, is theeigenvector associated with the k-th highest eigen-value, k, of the covariance matrix of the observedvariables. Therefore, the k-th highest eigenvalueof the covariance matrix is the variance of PC k,

    i.e. k = Var(Zk).The proportion of the total variance explainedby the first r principal components is

    1 + . . . + r1 + . . . + p

    . (2)

    If this proportion is close to one, than there is al-most as much information in the first r principalcomponents as in the original p variables. In prac-tice, the number r of considered principal compo-nents should be chosen as small as possible, takinginto account that the proportion of the explainedvariance, equation (2), should be large enough.

    Once the loadings of the principal componentsare obtained, the score of object i on PC j is givenby

    zij = j1xi1 + j2xi2 + . . . + jpxip (3)

    where xi = (xi1, . . . , xip)t is the data correspond-

    ing to object i.

    6 IDENTIFICATION METHODOLOGYAND RESULTS

    This section will describe in detail the pro-posed identification methodology and the mainresults obtained from its application on the mea-surement traces presented in section 4.

    Each capture file was processed using theTSTAT application [42], that correlates forwardand backward packet streams in order to obtaindetailed statistical information about each packetflow. This information will be used by the pro-posed identification methodology to identify thetraffic characteristic patterns associated to eachIP application. Table 1 presents the most rel-evant upload and download statistics that wereoutputted by TSTAT for each one of the mea-

    sured traces: in fact, the table only presents asmall set of all the parameters that are outputted

    Ubiquitous Computing and Communication Journal 5

  • 8/7/2019 Ident_P2P_PCA_UBICC_PaperID315 (1)_315

    6/13

    Table 1: Some of the upload and download statistics outputted by TSTAT.

    #Flows Packets RST ACK Pure ACK KBytes Data packets Kbytes (w/ retrans) rexmit packets

    Shareaza 1075 58893 65 56077 53225 145,072 2761 157,292 1847

    eMule 998 116631 16 115376 21930 107780,784 92451 108433,256 1394

    bitTorrent 1166 205082 174 203373 85392 124903,684 117242 126209,951 2675

    HTTP 697 79923 281 79192 76818 1120,075 1723 1126,331 42

    Browsing 824 11358 317 10450 7364 1225,208 2351 1296,95 143Skype 51 318 3 226 59 4,491 150 4,499 46

    youTube 278 40670 170 40383 38206 1369,82 1912 1401,508 61

    rexmit KByte S YN count FIN count SACK sent rtx RTO rtx FR reordering unknown unnece rtx RTO

    Shareaza 13,965 2815 27 5463 1824 14 0 21 0

    eMule 652,833 1251 1002 1918 1146 21 1694 1563 128

    bitTorrent 1306,845 1679 646 4925 1503 117 3735 2159 861

    HTTP 6,29 730 371 383 42 0 0 0 0

    Browsing 71,826 908 418 249 133 1 0 9 0

    Skype 0,052 92 14 7 43 0 0 3 0

    youTube 31,698 287 95 1153 33 0 0 28 0

    #Flows Packets RST ACK Pure ACK KBytes Data packets Kbytes (w/ retrans) rexmit packets

    Shareaza 1075 93273 17 93269 2068 115106,694 91029 115244,894 257

    eMule 998 91451 21 91446 55717 34366,424 33826 35017,843 954

    bitTorrent 1166 216601 270 216574 80998 129338,573 133337 131278,927 3164

    HTTP 697 151435 8 151430 1315 205121,848 149113 205184,553 65

    Browsing 824 14133 22 14126 1443 11368,405 11538 11377,214 39Skype 51 343 0 343 138 37,696 184 37,696 1

    youTube 278 72746 7 72740 445 96745,717 71934 97004,513 194

    rexmit KByte S YN count FIN count SACK sent rtx RTO rtx FR reordering unknown unnece rtx RTO

    Shareaza 138,212 115 54 4 86 4 858 440 99

    eMule 651,495 959 935 4087 660 15 187 978 224

    bitTorrent 1940,692 1138 931 6717 2416 213 1478 2616 514

    HTTP 62,717 674 573 0 55 32 176 34 6

    Browsing 8,838 787 577 1 68 32 136 44 1

    Skype 0,001 11 11 0 1 0 9 1 0

    youTube 258,8 272 163 0 30 30 776 123 4

    SERVER

    CLIENT

    Table 2: Average session durations per application.

    Shareaza eMule bitTorrent HTTP Browsing Skype youTube

    31.83 79.25 108.84 24.65 28.25 70.89 67.11Mean Duration (sec)

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    by TSTAT, the one that was chosen based on theirrelevance for the purposes of the current study;note that the meaning of each column can befound in the Appendix. Table 2 presents the av-erage session durations per application. As canbe seen, file sharing applications are predominantin terms of generated traffic. The upload trafficcorresponding to Shareaza is slightly lower thanthe upload traffic of other P2P applications (andHTTP file sharing), since Gnutella only sharescomplete files. Regarding download, eMule is

    the file sharing application with the worst perfor-mance, while Shareaza and web-based file sharingare the most efficient applications.

    Based on the collected statistics, a series ofbidimensional graphics were made representingthe relationships between the parameters out-putted by TSTAT (each graph relates to a pairof parameters). As an example, Figure 2 presentsdifferent plots, one per application, of the numberof downloaded packets as a function of the flowduration, while Figure 3 presents the plots corre-sponding to the number of uploaded bytes as a

    function of the number of valid round trip timesfor each application. This kind of graphs provide

    a qualitative overview on the relative importanceof each parameter to the traffic identification ob- jective. Looking at Figure 2, we can see that forall applications there are basically two kinds offlows: flows related to TCP connection establish-ment and termination phases and flows related todata transfer. For each one of these types there isa roughly linear growth on the number of packetsas a function of the flow duration. A quite clearlinear growth on the number of uploaded bytes asa function of the number of valid round trip time

    values can also be observed from Figure 3.The final set of parameters selected for traf-

    fic characterization are the ones that present themost significant (graphical) differences betweenprotocols. For the conducted study, the follow-ing parameters were identified as good candidatesfor this set: flow duration, total number of pack-ets, total number of ACK messages, total numberof payload bytes, total number of SYN messages,valid round trip time and average round trip time.

    Having agreed on a relevant set of parame-ters (that can be somewhat extensive) selected

    from all parameters outputted by TSTAT, thenext step is to apply PCA in order to obtain a

    Ubiquitous Computing and Communication Journal 6

  • 8/7/2019 Ident_P2P_PCA_UBICC_PaperID315 (1)_315

    7/13

    0 500 1000 1500 2000 25000

    0.5

    1

    1.5

    2x 10

    4

    Completion time [sec]

    dow

    nloadpackets[Number]

    0 50 100 150 200 2500

    500

    1000

    1500

    2000

    Completion time [sec]

    downloadpackets[Number]

    0 1000 2000 3000 40000

    2000

    4000

    6000

    8000

    10000

    Completion time [sec]

    downloadpackets[Number]

    0 500 1000 1500 20000

    0.5

    1

    1.5

    2

    2.5x 10

    4

    Completion time [sec]

    downloadpackets[Number]

    0 1000 2000 3000 40000

    1

    2

    3

    4

    5x 10

    4

    Completion time [sec]

    downloadpackets[Number]

    0 1000 2000 3000 40000

    100

    200

    300

    Completion time [sec]

    downloadpackets[Number]

    0 100 200 300 400 5000

    2000

    4000

    6000

    8000

    10000

    Completion time [sec]

    downloadpackets[Number]

    Figure 2: Number of downloaded packets versus completion time for: (from left to right and top tobottom) bitTorrent, Browsing, eMule, HTTP, Shareaza, Skype and youTube.

    quantitative feedback on the most important pa-rameters that have to be considered for identifi-cation. Principal components are calculated foreach possible combination of parameters (takenfrom the selected set), allowing for the identifica-tion of the combinations whose first two princi-pal components account for more than a certainthreshold (that was empirically taken as 75%) ofthe data variability percentage. In this way, ignor-ing the other principal components does not leadto a significant loss of relevant information. Thecombination having the largest number of param-eters that was able to fullfil the imposed require-

    ment, for both upload and download, is shown inTable 3.

    The next step of the identification methodol-ogy is the bidimensional identification of charac-teristic traffic patterns. In order to accomplishthis, the following steps are executed:

    1. calculation of the minimum and maximumvalues of the first two principal components,considering all applications (Figure 4);

    2. division of the bidimensional space on a pre-defined number of rectangular areas (this is

    an input parameter of the algorithm - in-teger values between 2 and 5 were consid-

    ered, because a number of areas higher than5 lead to very small areas obviously contain-ing too few points);

    3. for each area, verify if the number of pointscorresponding to a certain application (theone that is being analyzed) is significantlyhigher than the number of points corre-sponding to the other applications. Thisoperation is performed for each elementaryarea:

    (a) areas where the number of points cor-responding to the analyzed applica-

    tion is not significant (lower than apre-established threshold, which wastaken as 1.5%) are automatically dis-carded;

    (b) the areas where the number of pointscorresponding to the analyzed appli-cation is significant but at least oneof the other applications has a non-negligible number of points (higherthan another pre-established thresh-old, which was taken as 1%) are fur-ther subdivided in a number of areas

    that is equal to the pre-defined num-ber of areas and step 3 is repeated.

    Ubiquitous Computing and Communication Journal 7

  • 8/7/2019 Ident_P2P_PCA_UBICC_PaperID315 (1)_315

    8/13

  • 8/7/2019 Ident_P2P_PCA_UBICC_PaperID315 (1)_315

    9/13

    Table 3: Combination with the largest number of parameters that was able to fullfil the imposed re-quirement.

    Components: Completion time, rtt count,data bytes, ACK sent, packets

    Application Upload Download

    bitTorrent 93.04% 96.57%browsing 75.18% 96.94%

    eMule 99.29% 90.44%HTTP 97.07% 99.99%

    Shareaza 91.78% 99.59%Skype 99.77% 99.93%

    youTube 89.86% 99.99%

    Figure 4: Calculation of the minimum and maximum values, for all applications.

    7 CONCLUSIONS AND FURTHER RE-

    SEARCH

    As the number and diversity of IP applicationsincrease, it becomes more and more importantto accurately map Internet traffic to their cor-responding applications. Network managementand measurement tasks like traffic engineering,service differentiation, performance/failure mon-itoring, and security can greatly benefit from thismapping ability. Since traditional mapping ap-proaches have important limitations when appliedto some specific identification scenarios, this pa-

    per proposed a methodology, based on Princi-pal Component Analysis, to identify the differ-entiating characteristics of Internet applications.The results obtained by applying the proposedmethodology to several IP applications, includ-ing P2P applications, show that this method canbe efficiently used to identify characteristic trafficpatterns of IP applications and can constitute thebasis for an efficient traffic identification tool.

    APPENDIX

    The columns of Table 1 have the following

    mean: (i) #Flows - number of identified flows;(ii) packets - number of packets sent; (iii) RST

    - number of sessions where an RST message was

    sent; (iv) ACK - total number of ACK messagessent; (v) Pure ACK - total number of ACK mes-sages sent without payload; (vi) KBytes - totalnumber of Kbytes sent in message payloads; (vii)Data packets - total number of messages sent withpayload; (viii) KBytes (w/ retrans) - total numberof Kbytes sent in message payloads, including re-transmissions; (ix and x) rexmit packets/KBytes -total number of messages/KBytes retransmitted;(xi and xii) SYN/FYN count - total number ofSYN/FYN messages sent; (xiii) SACK sent - to-tal number of SACK messages sent; (xiv and xv)

    rtx RTO/FR - total number of messages retrans-mitted due to Timeout/Fast Retransmit; (xvi) re-ordering - total number of messages for sequencereordering; (xvii) unknown - total number of mes-sages out of sequence or duplicated, without clas-sification; (xviii) unnece rtx RTO - total num-ber of messages unnecessarily transmitted due totimeout.

    ACKNOWLEDGEMENTS

    This work was done under the scope of the

    Euro-FGI and Euro-NF Networks of Excellence,funded by the European Union.

    Ubiquitous Computing and Communication Journal 9

  • 8/7/2019 Ident_P2P_PCA_UBICC_PaperID315 (1)_315

    10/13

    0.6 0.5 0.4 0.3 0.2 0.1

    0.4

    0.3

    0.2

    0.1

    0

    0.1

    0.2

    0.3

    0.4

    Figure 5: Algorithm for the bidimensional identification of characteristic traffic patterns.

    Table 4: Total number of points contained in the defined areas for download traffic.

    References

    [1] K. Gummadi, R. Dunn, S. Saroiu, S. Grib-ble, H. Levy, and J. Zahorjan, Measure-ment, modeling, and analysis of a peer-to-

    peer file-sharing workload, in Proceedingsof the 19th ACM Symposium on OperatingSystems Principles 2003, 2003.

    [2] T. Karagiannis, A. Broido, N. Brownlee,K. C. Claffy, and M. Faloutsos, Is P2P dyingor just hiding?, in IEEE Global Telecommu-nications Conference 2004, 2004.

    [3] P. Salvador R. Valadas R. Nogueira,A. Nogueira, Identifying differentiatingcharacteristics of internet applications using

    principal component analysis, in Proceed-ings of the 6th Symposium on Communica-

    tion Systems, Networks and Digital SignalProcessing (CSNDSP08), 2008.

    [4] S. Sen and J. Wang, Analyzing peer-to-peertraffic across large networks, IEEE/ACM

    Transactions on Networking, vol. 12, no. 2,pp. 219232, 2004.

    [5] D. Moore, K. Keys, R. Koga, E. Lagache,and K. C. Claffy, The coralreef softwaresuite as a tool for system and network ad-ministrators, in Proceedings of the 15thUSENIX conference on Systems Administra-tion (LISA01), 2001, pp. 133144.

    [6] http://www.winmx.com, Winmx, .

    [7] http://www.nynode.info, Winny, .

    [8] T. Karagiannis, A. Broido, M. Faloutsos, andK. Claffy, Transport layer identification of

    Ubiquitous Computing and Communication Journal 10

  • 8/7/2019 Ident_P2P_PCA_UBICC_PaperID315 (1)_315

    11/13

    Table 5: Results obtained for the training and the testing sets.Components: Completion time, rtt count,

    data bytes, ACK sent, packets

    Application Upload Download

    bitTorrent 98.10% 96.30%browsing 96.90% 98.44%

    eMule 93.36% 90.27%HTTP 99.99% 99.99%

    Shareaza 95.45% 99.61%Skype 97.52% 100.00%

    youTube 99.99% 99.99%

    -5 0 5 10 15 20 25 30

    -15

    -10

    -5

    0

    5

    10

    15

    1st Principal Component

    2nd

    PrincipalComponent

    --

    -

    -

    --

    -

    -

    -

    -

    -

    -2 0 2 4 6 8 10 12 14-6

    -4

    -2

    0

    2

    4

    1st Principal Component

    2nd

    PrincipalComponent

    --

    -

    -

    Figure 6: PCA for bitTorrent traffic: (top left) upload training set; (top right) upload testing set;(bottom left) download training set; (bottom right) download testing set.

    P2P traffic, in Proceedings of the ACM SIG-COMM Internet Measurement Conference,2004, pp. 121134.

    [9] S. Sen, O. Spatscheck, and D. Wang, Ac-curate, scalable in-network identification ofp2p traffic using application signatures, inProceedings of the WWW Conference, 2004.

    [10] A. Madhukar and C. Williamson, A lon-gitudinal study of p2p traffic classification,in Proceedings of the MASCOTS Conference,2006.

    [11] A. W. Moore and D. Papagiannaki, Towardthe accurate identification of network appli-cations, in Proceedings of the 6th PassiveActive Measurements Workshop, 2005, vol.3431, p. 41.

    [12] T. Choi, C. Kim, S. Yoon, J. Park, H. Kim,H. Chung, and T. Jesong, Content-aware

    internet application traffic measurement andanalysis, in Proceedings of the IEEE/IFIPNOMS Conference, 2004.

    [13] M. Roesch, Snort: Lightweight intrusion de-tection for networks, in Proceedings of the13th USENIX Conference on Systems Ad-ministration (LISA99), 1999, pp. 229238.

    [14] V. Paxson, Bro: a system for detecting net-work intruders in real-time, Computer Net-works, , no. 31.

    [15] P. Haffner, S. Sen, O. Spatscheck, andD. Wang, Acas: Automated constructionof application signatures, in Proceedings ofthe SIGCOMM05 Workshops, 2005.

    [16] http://www.snort.org, Snort, .

    [17] P. Barford, J. Kline, D. Plonka, and A. Ron,A signal analysis of network traffic anoma-

    Ubiquitous Computing and Communication Journal 11

  • 8/7/2019 Ident_P2P_PCA_UBICC_PaperID315 (1)_315

    12/13

    -2 -1 0 1 2 3 4 5 6

    -2

    0

    2

    4

    6

    8

    10

    1st Principal Component

    2n

    dPrincipalComponent

    - -

    -

    - --

    - -

    -

    -4 -2 0 2 4 6-2

    0

    2

    4

    6

    8

    10

    1st Principal Component

    2n

    dPrincipalComponent

    - --

    -2 0 2 4 6 8 10 12 14-1

    0

    1

    2

    3

    4

    1st Principal Component

    2ndPrincipalComponent

    --

    --

    -

    -

    -2 0 2 4 6 8 10 12 14-5

    -4

    -3

    -2

    -1

    0

    1

    1st Principal Component

    2ndPrincipalComponent

    --

    Figure 7: PCA for youTube traffic: (top left) upload training set; (top right) upload testing set; (bottomleft) download training set; (bottom right) download testing set.

    lies, in In Proceedings of the ACM IMWConference, 2002.

    [18] http://www.cachelogic.com, Cache logic, .

    [19] http://www.packeteer.com, Packeteer, .

    [20] S. Dharmapurikar, P. Krishnamurthy, T.S.

    Sproull, and J.W. Lockwood, Deep packetinspection using parallel bloom filters?,IEEE/Micro, vol. 24, no. 1.

    [21] T. Kocak and I. Kaya, Low-power bloom fil-ter architecture for deep packet inspection,IEEE/Communications Letters, vol. 10, no.3.

    [22] A. Broder and M. Mitzenmacher, Networkapplications of bloom filters: a survey, In-ternet Mathematics, vol. 1, no. 4.

    [23] Cisco IOS Documentation, Network-based application recognition and dis-tributed network-based application recogni-tion, 2006.

    [24] T. Karagiannis, K. Papagiannaki, andM. Faloutsos, BLINC: multilevel trafficclassification in the dark, in Proceedingsof the Conference on Applications, Technolo-gies, Architectures, and Protocols for Com-puter Communications, 2005.

    [25] http://www.cisco.com, Cisco netflow, .

    [26] N. Duffield, C. Lund, and M. Thorup, Prop-erties and prediction of flow statistics from

    sampled packet streams, in Proceedings ofthe IMW Conference, 2002.

    [27] N. Duffield, C. Lund, and M. Thorup, Flowsampling under hard resource constraints,in Proceedings of the SIGMETRICS Confer-ence, 2004.

    [28] C. Estan, K. Keys, D. Moore, and G. Vargh-ese, Building a better netflow, in In Pro-ceedings of the SIGCOMM Conference, 2004.

    [29] R. Kompella and C. Estan, The power ofslicing in internet flow measurement, in InProceedings of the IMC Conference, 2005.

    [30] M. Roughan, S. Sen, O. Spatscheck, andN. Duffield, Class-of-service mapping forQoS: A statistical signature-based approach

    to IP traffic classification, in Proceedings ofthe ACM SIGCOMM Internet MeasurementConference, 2004, pp. 135148.

    [31] A. Moore and D. Zuev, Internet traffic clas-sification using bayesian analysis, in Pro-ceedings of International Conference on Mea-surement and Modeling of Computer Sys-tems, 2005, pp. 5060.

    [32] F. Constantinou and P. Mavrommatis,Identifying known and unknown peer-to-peer traffic,, in Proceedings of Fifth IEEE

    International Symposium on Network Com-puting and Applications, 2006, pp. 93102.

    Ubiquitous Computing and Communication Journal 12

  • 8/7/2019 Ident_P2P_PCA_UBICC_PaperID315 (1)_315

    13/13

    [33] L. Bernaille, R. Teixeira, and I. Akodkenou,Traffic classification on the fly, ComputerCommunication Review, vol. 36, no. 2, pp.23926, 2006.

    [34] L. Bernaille and R. Teixeira, Early recog-

    nition of encrypted applications, in In Pro-ceedings of the 8th Passive and Active Mea-surement Conference (PAM 2007), 2007.

    [35] H. Liu, W. Feng, Y. Huang, and X. Li, Apeer-to-peer traffic identification method us-ing machine learning, in Proceedings of theInternational Conference on Networking, Ar-chitecture and Storage, 2007.

    [36] R. Yuan Z. Li and X. Guan, Accurate clas-sification of the internet traffic based on theSVM method, in In Proceedings of the 42th

    IEEE International Conference on Commu-nications (ICC 2007), 2007.

    [37] N. Williams, S. Zander, and G. Armitage, Apreliminary performance comparison of fivemachine learning algorithms for practical IP

    traffic flow classification, ACM SIGCOMMComputer Communication Review, vol. 36,no. 5.

    [38] T. Auld, A. W. Moore, and S. F. Gull,Bayesian neural networks for internet traffic

    classification, IEEE Transactions on NeuralNetworks, vol. 18, no. 1, pp. 223239, 2007.

    [39] F. Shen, C. Pan, and X. Ren, Research ofP2P traffic identification based on BP neuralnetwork, in Proceedings of the Third inter-national Conference on international infor-mation Hiding and Multimedia Signal Pro-cessing (IIH-MSP 2007), 2007.

    [40] A. Ali and R. Tervo, Traffic identificationusing artificial neural network, in CanadianConference on Electrical and Computer En-

    gineering, vol. 1, pp. 667672.[41] I. T. Jolliffe, Principal Component Analysis,

    Springer-Verlag, 1986.

    [42] http://tstat.tlc.polito.it/index.shtml, Tstat- tcp statistic and analysis tool, .

    Ubiquitous Computing and Communication Journal 13


Recommended