This item is the archived peer-reviewed author-version of datasets Jellis Vanhoeyveld David Martens...

This item is the archived peer-reviewed author-version of

Imbalanced classification in sparse and large behaviour datasets

ReferenceVanhoeyveld Jellis Martens David- Imbalanced classif ication in sparse and large behaviour datasetsData mining and know ledge discovery - ISSN 1384-5810 - 321(2018) p 25-82 Full text (Publishers DOI) httpsdoiorg101007S10618-017-0517-Y To cite this reference httphdlhandlenet100671444660151162165141

Institutional repository IRUA

Author version manuscript No(will be inserted by the editor)

Imbalanced classification in sparse and large behaviourdatasets

Jellis Vanhoeyveld middot David Martens

Received date Accepted date

Abstract Recent years have witnessed a growing number of publications dealingwith the imbalanced learning issue While a plethora of techniques have been inves-tigated on traditional low-dimensional data little is known on the effect thereof onbehaviour data This kind of data reflects fine-grained behaviours of individuals ororganisations and is characterized by sparseness and very large dimensions In thisarticle we investigate the effects of several over-and undersampling cost-sensitivelearning and boosting techniques on the problem of learning from imbalanced be-haviour data Oversampling techniques show a good overall performance and do notseem to suffer from overfitting as traditional studies report A variety of undersam-pling approaches are investigated as well and show the performance degrading effectof instances showing odd behaviour Furthermore the boosting process indicates thatthe regularization parameter in the SVM formulation acts as a weakness indicator andthat a combination of weak learners can often achieve better generalization than a sin-gle strong learner Finally the EasyEnsemble technique is presented as the methodoutperforming all others By randomly sampling several balanced subsets feedingthem to a boosting process and subsequently combining their hypotheses a classifieris obtained that achieves noiseoutlier reduction effects and simultaneously exploresthe majority class space efficiently Furthermore the method is very fast since it isparallelizable and each subset is only twice as large as the minority class size

Keywords Imbalanced learning middot Behaviour data middot Over-and undersampling middotCost-sensitive learning middot Support Vector Machine (SVM) middot On-line repository

J VanhoeyveldDepartment of Engineering Management Prinsstraat 13 B-2000 Antwerp BelgiumTel 032654393E-mail jellisvanhoeyvelduantwerpenbe vanhoeyveldjellisgmailcom

D MartensDepartment of Engineering Management Prinsstraat 13 B-2000 Antwerp Belgium

2 Jellis Vanhoeyveld David Martens

1 Introduction

11 Literature overview

Learning from imbalanced data is an important concept that gained recent attentionin the research community showing its first publications around 1990 (He and Garcia2009 Chawla et al 2002) The fundamental issue lies in the fact that many learnersare designed with a lack of consideration for the underlying data distribution Rule-and tree-based techniques and methods aiming to minimize overall training set er-ror such as support vector machines (SVMs) and neural networks are often citedexamples of inducers that suffer from this issue (Liu et al 2010 Chawla et al 2002Mazurowski et al 2008 Akbani et al 2004 Tang et al 2009) The resulting classi-fier will emphasize the majority class instances at the expense of neglecting minorityclass examples while the latter is usually the phenomenon of interest Besides this is-sue of between-class imbalance several other sources of data complexity are knownto hinder learning within-class imbalance overlapping classes outliers and noise(He and Garcia 2009 Sobhani et al 2015) In the following paragraphs we will givean overview of the two classes of methods that are most widely used to cope withthis issue (Liu et al 2009) sampling and cost-sensitive learning A more thoroughoverview on the imbalanced learning issue including a summary on less common ac-tive learning methods and Kernel-based approaches can be found in Chawla (2005)He and Garcia (2009)

Sampling methods work at a data level and attempt to provide a more balanceddata distribution to the underlying base learner These techniques generally consistof oversampling the minority class undersampling the majority class or a combi-nation thereof A simple oversampling technique will replicate minority class in-stances More advanced techniques will introduce synthetic samples such as SMOTE(Chawla et al 2002) Borderline-SMOTE (Han et al 2005) and ADASYN (He et al2008) A detailed discussion of these methods is provided in Section 31 Clusterbased oversampling techniques have also been investigated (eg see the related dis-cussion on the CBO-algorithm upcoming in this overview) A more recent approachcalled MWMOTE (Barua et al 2014) clusters the minority class instances first afterwhich the SMOTE synthetic sampling procedure is applied to generate new syntheticminority class instances within each cluster A general downside of oversampling isthe increased training time of the subsequently employed learner

The simplest undersampling method is called random undersampling where ma-jority class instances are randomly discarded from the training data More sophisti-cated techniques operate in an informed fashion where majority class instances aredropped according to their anticipated importance One could for instance use the K-nearest neighbour (Knn) classifier to do so (Zhang and Mani 2003 Chyi 2003) Otherinformed undersampling approaches rely on the results of a preliminary clusteringprocedure (see the upcoming discussion on within-class imbalance) Yet other sam-pling techniques employ a combination of oversampling and undersampling Chawlaet al (2002) apply the SMOTE procedure in a first step and use random undersam-pling in a second step though many combination schemes are possible and can becombined with a data cleaning technique The latter type of methods aim at remov-

Imbalanced classification in sparse and large behaviour datasets 3

ing noise and the overlap that is introduced from the sampling schemes (He and Gar-cia 2009) Representative work in this area includes the one-sided selection (OSS)method (Kubat and Matwin 1997) and the techniques discussed in Batista et al (2004)(eg SMOTE with Tomek links) The literature on sampling techniques does notend there Inspired by the success of boosting algorithms (and ensemble learning ingeneral) sampling techniques have been integrated into this process Examples ofthese methods are SMOTE-BOOST (Chawla et al 2003) JOUS-Boost (Mease et al2007) DataBoost-IM (Guo and Viktor 2004) balanced cascade (Liu et al 2009) andEasyEnsemble (Liu et al 2009) EasyEnsemble is briefly described in Section 333

Cost-sensitive learning makes use of cost-matrices or misclassification costs todescribe the penalty of misclassifying a certain instance By incorporating misclassi-fication costs one can put a larger emphasis on the minority class Indeed misclassifi-cation costs of minortiy class examples are usually much larger than misclassificationcosts of majority class instances The overall goal of cost-sensitive learning is ldquoto de-velop a hypothesis that minimizes the overall cost on the training data set (He andGarcia 2009)rdquo and can be accomplished in two ways The first set of techniques usescost-sensitive boosting variants which basically incorporate misclassification costsin the weight updating scheme of the AdaBoost (Schapire 1999) algorithm Populartechniques include AdaC1 AdaC2 AdaC3 (Sun et al 2007) and AdaCost (Fan et al1999) A more detailed discussion on AdaBoost and AdaCost is presented in Sec-tions 331 and 332 respectively The second class of methods integrates misclassi-fication costs directly into the underlying classifier and are specifically tailored to thetype of inducer that is being used to construct the hypothesis This basically meansthat the learning algorithm is made cost-sensitive Barandela et al (2003) present acost-sensitive Knn-classifier by using a weighted distance function that emphasizesthe minority class A cost-sensitive support vector machine (SVM) is another ex-ample of this category of techniques see the related discussion in Section 332 formore details Also worth mentioning are methods that use a combination of cost-sensitive learning with one of the sampling techniques previously described Akbaniet al (2004) first apply SMOTE to generate synthetic minority class instances (yet thedataset is still imbalanced after applying SMOTE) after which a cost-sensitive SVMis used

The previously outlined sampling and cost-sensitive learning procedures are gearedtoward solving the between-class imbalance problem where the number of majorityclass instances dominates the number of minority class examples The vast majorityof research focuses on this issue (Ali et al 2015 Guo et al 2008) and ignores thepossible occurrence of within-class imbalance This type of imbalance occurs whendifferent subconcepts (clusters) within a single class have an entirely different amountof representatives Jo and Japkowicz (2004) were the first to note that small disjuncts- clusters containing a limited amount of instances - can be responsible for perfor-mance degradation since learning algorithms have difficulties in picking up on theseconcepts They proposed a cluster-based oversampling method (CBO) where major-ity class instances and minority class instances are clustered separately Afterwardsall clusters are oversampled in such a way that there is no more between-class andwithin-class imbalance Cluster-based undersampling approaches have also been pro-posed The undersampling based on clustering (SBC) technique (Yen and Lee 2009)

clusters all training data from both classes simultaneously and subsequently selectsmajority class instances from each cluster in accordance with the ratio of the numberof majority class samples to the number of minority class samples within the clusterunder consideration Sobhani et al (2015) solve the between-class and within-classimbalance problem by clustering the majority class instances and selecting an equalamount of majority representatives from each cluster such that the total number ofselected majority instances equal the total amount of minority class instances

The aforementioned techniques have all been investigated on traditional low-dimensional data usually on datasets from the UCI-repository However little isknown on the effect of imbalance on the performance of classification algorithmswhen dealing with imbalanced behaviour data An extensive discussion on behaviourdata and its distinction with lsquotraditionalrsquo data is provided in Section 21 The classi-fication techniques that are used for behaviour data can roughly be characterized asheuristic and regularization-based approaches as discussed in more detail in Stankovaet al (2015) Note that many classification techniques designed for traditional datacannot be directly applied to behaviour type of data This is because tailored tech-niques need to be developed that have a low computational complexity and take thesparseness structure into account Frasca et al (2013) present a cost-sensitive neuralnetwork algorithm designed to solve the imbalanced semi-supervised unigraph clas-sification problem in the area of gene function prediction (GFP) They note that manyalgorithms designed for graph learning show a decay in the quality of the solutionswhen the input data are imbalanced

12 Goals and contributions

The main goal of this article is to investigate the effect of a large variety of over-and undersampling methods boosting variants and cost-sensitive learning techniqueswhich are traditionally applied to low-dimensional dense data - on the problemof learning from imbalanced behaviour datasets This is our first contribution in thesense that we are the first to perform such an investigation The base learner we willconsider is a SVM1 for reasons indicated in Section 22 This learner will be appliedon a large diversity of behaviour data gathered from several distinct domains Aswill be discussed in Section 21 behaviour data have a different structure and showdistinct characteristics than traditional data As a direct consequence the conclusionsand results drawn from previous studies dealing with imbalanced learning cannot begeneralized to this specific type of data By conducting targeted experiments we gaininsights on the possible occurrence of overfitting during oversampling dwell uponthe presence of noiseoutliers and its influence on the performance of undersamplingmethods and boosting variants and finally we also gain knowledge on the effect ofweakstrong learners in applying boosting algorithms This investigation allows usto confirm or contradict these findings with the results obtained from studies dealingwith lsquotraditional datarsquo Throughout the text we will highlight the main differences

The subgoal of this paper is to provide a benchmark for future studies in this do-main As we conduct a comprehensive comparison of the aforementioned techniques

1 In Section 5 we will investigate the effect of several base learners

in terms of predictive performance and timings researchers in this field can easilyassess and compare their proposed techniques with the methods we have explored inthis article To enable this process we provide implementations datasets and resultsin our on-line repository httpwwwapplieddataminingcomcmsq=software

A third contribution lies in the fact that the applied sampling methods need to beadapted to cope with behaviour data In Section 31 we provide specific implemen-tations of the SMOTE and ADASYN algorithm and highlight the differences withits original formulation The informed undersampling techniques presented in Sec-tion 32 rely on nearest neighbour computations which require a different similaritymeasure as opposed to lsquotraditional datarsquo

Traditional studies integrating boosting with SVM based component classifierstypically use a RBF-Kernel (Wickramaratna et al 2001 Garcıa and Lozano 2007 Liet al 2008) The specific boosting algorithm we propose in Section 331 combinesa linear SVM with a logistic regression (LR) to form a single confidence-rated baselearner This specific kind of weak learner has to our knowledge never been pro-posed in earlier studies yet proves to be very valuable in the setting of behaviourdata

Our fifth contribution lies in the exploration of a larger part of the parameterspace Chawla et al (2002) and He et al (2008) both employ K = 5 nearest neighboursin their experiments Zhang and Mani (2003) presents among others the Near-miss1undersampling method where majority class instances are selected based on theiraverage distance to the three closest minority examples Liu et al (2009) investigatetheir proposed EasyEnsemble algorithm see Section 333 for a short description us-ing S= 4 subsets and T = 10 rounds of boosting All of the aforementioned parametersettings are chosen without a clear motivation In our experiments we will considera larger proportion of parameter space by varying for instance the number of nearestneighbours used in oversampling and undersampling methods the number of subsetsand boosting rounds in EasyEnsemble etc In doing this we can more accuratelycompare distinct methods dealing with imbalanced learning Parameter settings foreach of the methods proposed in this study are shown in Section 42 Furthermore wealso study the effect of the number of subsets used in EasyEnsemble As mentionedbefore the authors only use 4 subsets

The studies mentioned in the literature overview of Section 11 usually postulatea new method and compare it to some or more closely related variants by performingexperiments over several datasets and reporting performance outcomes without anyfurther statistical grounding or interpretation In this article we provide statistical ev-idence by conducting hypothesis tests which enable a more meaningful comparison

2 Preliminaries

21 Behaviour data

The last decades have witnessed an explosion in data collection data storage anddata processing capabilities leading to significant advances across various domains(Chen et al 2014) With the current technology it is now possible to capture specific

conducts of persons or organisations Following the definition of Shmueli (2017) be-haviour data models fine-grained actions andor interactions of entities such as per-sons or objects To distinguish lsquotraditionalrsquo data from behaviour data it is appropriateto introduce the framework proposed in Junque de Fortuny et al (2014a) traditionalpredictive analytics describes instances by using a small number of features (dozensup to hundreds) These datasets are usually dense in nature meaning that each in-stance has a non-trivial value for each of its features (or at least for most of them)Recent years have witnessed a growing number of applications making use of be-haviour data which reflect specific behaviours of individuals or organisations Thinkfor example about users liking certain pages on Facebook visiting certain websites ormaking transactions with specific merchants and organisations interacting with sup-pliersclients Such data are high-dimensional containing thousands or even millionsof features Indeed the amount of websites a user can visit the number of Facebookpages a user can like or the number of unique merchants a user can transact withare enormous ranging up to millions A key characteristic is the sparse nature ofbehaviour data resulting in a sparse matrix representation (see further in this sec-tion) The majority of attributes have a trivial value of ldquozerordquo (or not present) or asJunque de Fortuny et al (2014a) formulate it ldquopeople only have a limited amountof behavioural capitalrdquo To summarize behaviour data are very high-dimensional(104minus 108) and sparse and are mostly originating from capturing the fine-grainedbehaviours of persons or companies2

Besides the differences in structurerepresentation behaviour data also show dis-tinct properties in comparison to traditional data Junque de Fortuny et al (2014a)proved empirically that larger behaviour datasets in terms of the number of instancesor features3 result in significant performance gains This is in shear contrast to theliterature on sampling (reducing the number of instances) or feature selection that iscommonly applied to traditional type of data where usually there are a large numberof irrelevant features that increase the variance and the opportunity to overfit (Provostand Fawcett 2013) Behaviour type of data show a different ldquorelevance structurerdquo inthe sense that most of the features provide a small though relevant amount of addi-tional information about the target prediction (Junque de Fortuny et al 2014a) Fur-thermore the instances and features show a power-law distribution (Stankova et al2015) the vast amount of instances have a low number of non-zero (active) featuresand conversely the majority of features are only present for a few instances In herdoctoral dissertation Stankova (2016) showed that applying non-linear classifiers tobehaviour type of data does not improve predictive performance in comparison to theplain application of linear techniques This is definitely a major contrast to prior liter-ature dealing with dense datasets where linear methods generally have a lower pre-dictive performance compared to highly non-linear techniques (Baesens et al 2003)

Imbalanced behaviour data occur naturally across a wide range of applicationssome examples include companies employing data on the specific websites a uservisits for targeted online advertising (Provost et al 2009) where usually only a rel-

2 The last requirement is not strictly necessary when we talk about behaviour data in our study (thesparseness and high-dimensionality properties are sufficient)

3 In this sparse setting instance removal or feature selection are in a certain sense equivalent to oneanother

atively low amount of people respond positively on the advertisement Data on theindividual merchants with whom one transacts with can be used to detect credit cardfraud The study of Bhattacharyya et al (2011) which uses real-life credit card trans-actional data mentions that from the 49858600 transactions considered only 2420were fraudulent (0005) Besides these marketing and fraud domains other areascan be considered such as churn prediction (Verbeke et al 2012) default prediction(Tobback et al 2016) and predictive policing (Bachner 2013) Despite the abundantapplication domains little is known on the effect of imbalance on the performanceof classification algorithms when dealing with this kind of data Needless to say thatempirical and theoretical developments in this field can give rise to major benefits foracademia companies and governments

Behaviour data can be represented as a sparse matrix or equivalently as a bipar-tite graph In the matrix representation each instance i corresponds to a single rowand columns correspond to specific features Letrsquos take the example of users ratingfilms where we wish to predict the gender of each user In that case each user corre-sponds to a row in the matrix and each specific film corresponds to a single columnAlternatively behaviour data can also be represented as a bipartite graph (Stankovaet al 2015) Consider a set of bottom nodes NB and a set of top nodes NT A bipartitegraph is a network that contains edges only between nodes of a different type Eachbottom node nb isin NB can only be connected to nodes nt isin NT Returning to our ex-ample each user i corresponds with a bottom node nbi and each film j correspondswith a top node nt j

Imbalanced behaviour data occur when the collection of instances contains sub-stantially more examples of one class as compared to the other In our running exam-ple of the previous paragraph this means that there might be far more males ratingfilms than females In the datasets presented in Section 41 each of the instancescontains a label indicating the particular class the example belongs to In practice in-stances do not necessarily have a label associated and the goal is to infer a labelscorefor these unknown instances based on the examples with known labels Note that wefocus on the two-class classification problem where labels are limited to two typesA multi-class classifier can be obtained by solving a sequence of binary classificationproblems This can for instance be accomplished with one-versus-one or one-versus-all setups (Hsu and Lin 2002)

22 Support vector machines (SVMs)

The support vector machine (SVM) is a ldquostate-of-the-artrdquo classification techniquethat has been applied in a wide variety of domains (Suykens et al 2002) The trainingdata consist of a set of d-dimensional input vectors xi with corresponding labels yi isinminus11 i = 1 m A linear SVM constructs a linear classifier (hyperplane) in theinput space The hyperplane is constructed in such a way that as few points as possibleare wrongfully classified while simultaneously maximizing the margin between thetwo classes This trade-off between minimizing the model complexity and reducingmisclassification is governed by the regularization parameter C The linear SVM is

the solution to the following quadratic optimization problem

minwbξi

wT w2

+Cm

sumi=1

ξi

st yi(wT xi +b

)ge 1minusξi i = 1 m

ξi ge 0 i = 1 m

(1)

where b represents the bias and ξi are the slack variables measuring classificationerrors The classifier is given by y(x) = sign

(wT x+b

) where wT x+b represents the

output score It should be noted that we usually solve the dual form of equation (1)with solution y(x) = sign

[sum

mi=1(αiyixT xi

)+b]

and dual variables4 αi An overviewon the issues regarding SVMs and imbalanced classification for low-dimensional datacan be found in (Akbani et al 2004) Briefly because of the imbalance the majority ofthe slack-variables ξi represent errors with respect to the majority class This meansthat the minority class is under-represented and has a minor contribution to the goalfunction

In this study we have opted for a linear SVM (using the LIBLINEAR (Fan et al2008) package) as the base learner to classify behaviour data5 The feature vectorwill have a sparse and high-dimensional representation with d = |NT | Note thatthe SVM formulation does not change in this setting The solution vector w willbecome high-dimensional The kernel matrix contains inner products of sparse andhigh-dimensional vectors resulting in a (possibly) sparse matrix

The reason we have chosen for a SVM are twofold First of all it is a very pop-ular technique for dealing with traditional and behaviour data and has been appliedin many diverse domains (Suykens et al 2002) As we have noted in Section 11tailored techniques need to be applied to behaviour datasets and roughly fall in twocategories regularization based techniques and heuristic approaches The latter typeof techniques are not suitable for dealing with traditional data The remaining reg-ularization based techniques have formed the subject of traditional studies dealingwith imbalanced data We have opted for this type of techniques since we can theneasily compare our results to the conclusions obtained from previous studies Anotherreason we have opted for a SVM is the fact that many of our proposed techniques seeSection 3 for details rely on a boosting process Wickramaratna et al (2001) Garcıaand Lozano (2007) noted that using a strong learner can result in performance degra-dation during boosting Regularization based techniques offer an added element offlexibility in the sense that the strength of the learner can be controlled by varyingthe regularization parameter Heuristic approaches today do not offer this attractivefeature

4 Also called support values5 In Section 5 we will consider different types of base learners

23 Evaluation metrics

The performance measures that are typically used in traditional studies dealing withimbalanced data are accuracy sensitivity specificity precision F-measure and G-mean see for instance Bhattacharyya et al (2011) Han et al (2005) Gonzlez andVelsquez (2013) These measures are derived from the confusion matrix and are basedon a certain treshold applied on the output scores of the classifier where the tresh-old is usually contained within the classification algorithm There are two main issueswith this approach First of all the built-in treshold that is applied on the scores mightnot be suitable This might cause low performance values with respect to these crite-ria yet if we were to simply adapt the treshold the same performance criteria mightshow excellent results The second issue lies in the fact that the chosen treshold couldbe irrelevant with respect to the available capacity We address this issue with a sim-ple example If a targeted advertisement company would apply this classifier (withbuilt-in treshold) on new customers the classifier might choose to predict 5 ofall possible customers to target as positive Yet the company only has a marketingbudget that allows targeting 01 of all possible customers It is clear that the cho-sen treshold is inappropriate here For these reasons we have chosen to opt for areaunder ROC-curve (AUC) instead6 Chawla et al (2004) note that ROC-curves (andcost-curves) should be preferred over these traditional measures The AUC whichmeasures the area under the ROC-curve is more appropriate since it is independentof class skew and measures the ranking abilities of the classifier (Fawcett 2006) Itanswers the question if we were to rank all instances according to output scores isthe classifier able to place the positive instances near the top of the list and the nega-tive instances near the bottom Because the method scans over all possible tresholdsit is independent of a specific cut-off value Another reason we have chosen for AUCis the fact that many boosting and cost-sensitive learning techniques evaluate the per-formance using accuracy or misclassification cost which suffer from the same issuesas previously mentioned Hence we are one of the first to evaluate these techniqueswith respect to AUC Note that AUC is also the preferred metric in the assessmentof unsupervised anomaly detection techniques (Goldstein and Uchida 2016) Out-liersanomalies are strongly related to our field as they signify rare events that are ofspecific importance to the analyst

Though we focus on AUC other measures are suitable for performance assess-ment of imbalanced data Weighted area under the ROC-curve (wAUC) (Li and Fine2010) is able to emphasize certain regions of ROC-space Lift-curves (Bekkar et al2013) popular in the marketing domain are ideal to assess the prevalence of positivecases among instances ranked highly by the classifier These curves can be evaluatedaccording to the available capacity When cost information is available this shouldbe integrated in the performance assessment criterion Indeed the cost of a false neg-ative is usually much larger than the cost of a false positive We could then opt forcost-curves (Whitrow et al 2009 Bhattacharyya et al 2011) that can also be eval-uated with consideration of the available capacity Few studies make use of costs

6 We provide traditional measures (sensitivity specificity G-means F-measure) in our online reposi-tory httpwwwapplieddataminingcomcmsq=software

(Ngai et al 2011) mainly because these are difficult to determine uncertain and havea temporal characteristic To conclude this section we note that each of the perfor-mance assessment criteria mentioned in this paragraph require some additional infor-mation in the form of a weight function (wAUC) capacity requirements andor costs(liftcost-curves) These measures are application specific and this is the main reasonwe excluded them in our study

3 Methods

31 Oversampling

Over the years the data mining community investigated several techniques that bal-ance the data distribution by oversampling the minority class In this section weinvestigate the basic oversampling with replacement (OSR) approach in conjunctionwith synthetic sample generation procedures The first technique simply duplicatesminority class instances by a certain amount Several references dealing with tradi-tional low-dimensional data note that this technique may make the decision regionsof the learner smaller and too specific on the replicated instances causing the learnerto overfit (Chawla et al 2002 Han et al 2005 Liu et al 2009 He and Garcia 2009)The synthetic approaches are designed to overcome this overfitting behaviour by gen-erating new non-overlapping instances in the minority class space The techniques weinvestigate are SMOTE (Chawla et al 2002) and ADASYN (He et al 2008)

Consider a certain minority class instance xi In the traditional setting SMOTEand ADASYN will generate a new synthetic instance xsyn by choosing a randompoint on the line segment between the point xi and one of its K minority class nearestneighbours (computed according to Euclidean distance) In SMOTE each originalminority class instance xi generates the same number of synthetic instances whereasthe ADASYN algorithm generates a variable number of synthetic instances by puttinga larger weight on the harder to learn minority instances This way the learner is morefocussed toward difficult instances

These techniques need to be adapted when dealing with binary behaviour datawhere each instance is represented by a large and sparse binary vector The main dif-ferences with the original versions of SMOTE and ADASYN are indicated in Table 1and are explained in more detail in the following paragraphs The first difference liesin the generation of the new synthetic instances As before a synthetic sample is con-structed based on two original minority instances When both instances have a 0 or 1in their corresponding column the synthetic sample will also show a 0 or 1 respec-tively at the considered position When only one of the two minority instances showsa 1 the decision will be made according to a user specified parameter prioropt Thisparameter can be one of the following three options

ndash ldquoFlipCoinrdquo where there is a 50 probability that the synthetic instance will showa 1 at the considered position

ndash ldquoPriorrdquo where the value of the synthetic sample is determined by the prior withinthe minority class in the corresponding column One generates a random number

u in the interval [01] and puts a 1 in the corresponding position if u is smallerthan the prior

ndash ldquoReverse Priorrdquo where one generates a random number u in the interval [01]and puts a 1 in the corresponding position if u is larger than the prior within theminority class for this columnfeature

The second difference lies in the way the nearest neighbours are determined It wouldbe unwise to consider Euclidean distance in that respect because it treats a 0minus 0match in the same way as a 1minus 1 match Since we are working with dichotomousvariables (ie present or absent) a 1minus 1 match at a certain position is far more in-formative than a 0minus 0 match For instance two users visiting the same web pagecontains more information than two users who didnrsquot visit that specific page Thesimilarity between two instances is defined by a user-specified parameter simmeasureWe have limited ourselves to two popular choices ldquoJaccardrdquo uses the Jaccard simi-larity measure (Finch 2005) and ldquoCosinerdquo uses the cosine similarity measure (Huang2008) In principle one could apply any of the metrics summarized in Stankova et al(2015)

Table 1 Differences between the original SMOTE and ADASYN implementations and the versionsSMOTEbeh and ADASYNbeh tailored for behaviour data (parameter explanations are provided in Section31)

SMOTE SMOTEbeh ADASYN ADASYNbeh

Amount of over-sampling

N β β β

Synthetic samplegeneration

random point online segment

prioropt random point online segment

prioropt

Similarity mea-sure

Euclidian JaccardCosine Euclidian JaccardCosine

Number of near-est neighbours

K K K

K K

A detailed pseudo-code implementation of our versions of SMOTE and ADASYNcalled SMOTEbeh and ADASYNbeh is shown in Algorithm 1 Also note that we intro-duced an extra parameter K to decouple the determination of the number of syntheticinstances that need to be generated for a certain minority instance and the number ofnearest neighbours it uses to obtain the synthetic instances

The experimental set-up adopted by Chawla et al (2002) He et al (2008) consid-ers K = 5 as the number of nearest neighbours used without a detailed motivationFurthermore the latter paper compares SMOTE with an oversampling percentage Nof 200 (meaning that the size of the newly created synthetic data instances is twiceas large as the size of the original minority training data) with a completely balanceddataset (β = 1 in Algorithm 1) in ADASYN In our experiments we will considera variety of possible K-values and compare SMOTE and ADASYN with identicaloversampling rates controlled by a single parameter β

Algorithm 1 SMOTEbeh and ADSYNbeh pseudo-code implementation for binary be-haviour data

Input XminXma jβ prioropt simmeasureK Ka) determine the total amount of synthetic minority instances that need to be generated (β isin [01] isa parameter that controls the amount of oversampling β = 1 means a fully balanced dataset will becreated)

G =(∣∣Xma j

∣∣minus|Xmin|)timesβ

b) determine the number of synthetic samples gi that need to be generated for each minority classinstance xiif SMOTE then

gilarrlceil

G|Xmin|

rceilelse if ADASYN then

calculate K nearest neighbours (with simmeasure option) for instance xi from the set

Xmin xiXma j

and determine ∆i the number of majority class nearest neighboursNext calculate ri =

∆iK and normalize these values ri =

ri

sum|Xmin|j=1 r j

gilarr dritimesGeend ifc) generate gi synthetic samples for minority instance xicalculate K nearest neighbours (with simmeasure option) for instance xi from the set Xmin xi Ad-ditionally remove those nearest neighbours that have a similarity of 0 with xi The remaining nearestneighbours form the set Kused If this set turns out to be empty set Kused = xifor iter = 1rarr gi do

randomly choose 1 nearest neighbour from the set Kusedgenerate synthetic minority sample from xi and the chosen nearest neighbour (according to prioropt )

end ford) because sumgi ge G randomly remove synthetic points until the total number of synthetic samplesequals G

32 Undersampling

In this section we will compare the simple random undersampling technique (RUS)with informed undersampling approaches The first method randomly discards ma-jority class training instances While this technique can achieve fast training perfor-mance of the underlying base learner an obvious disadvantage is the fact that it mightdiscard potentially useful majority class instances The informed approaches try to in-telligently retain the most informative majority class instances in the hope to increasepredictive performance while at the same time keeping the fast training speed of theunderlying classifier

The first set of informed undersampling techniques are based on the methods pro-posed by Zhang and Mani (2003) Chyi (2003) The K-nearest neighbour rsquoclassifierrsquois used to determine the importance of each majority class training instance by cal-culating the total similarity with the K closest minority class training set examplesRegarding similarity computations for binary behaviour data we refer to the relateddiscussion in Section 31 The first technique called ldquoClosest Knnrdquo retains majorityclass examples that are closest to the minority class instances These instances arethe most difficult to classify and we would expect them to be most informative Thesecond method called ldquoClosest tot simrdquo is similar to the previously described tech-

nique The difference is that it no longer computes similarities with the K closest mi-nority neighbours instead it calculates the total similarity with all minority instancesin determining the importance The main reason we included this technique is com-putational speed7 The last techniques called ldquoFarthest Knnrdquo and ldquoFarthest tot simrdquoare included for comparison with the previously mentioned techniques Its impleme-nation is identical to the previously described techniques however it retains majorityclass examples that are farthest to the minority class instances For each of the pro-posed methods in this paragraph the amount of undersampling is controlled by auser-specified parameter βu according to the following formula

Nrrem = b(∣∣Xma j

∣∣minus|Xmin|)timesβuc (2)

where Nrrem represents the amount of majority class instances to be discarded βu = 1means a completely balanced dataset is obtained

The second set of informed undersampling techniques aim at targeting the within-class imbalance problem and are based on the approach proposed in Sobhani et al(2015) They postulate that this within-class imbalance problem is more pronouncedin the case of undersampling methods If we were to randomly select majority classinstances then the probability of drawing an instance from small disjuncts within themajority class would be very low These regions might therefore contain no represen-tatives and remain unlearned The authors chose to address this issue by clustering themajority class instances in a first step and subsequently selecting an equal number ofrepresentatives from each cluster The reported results show their approach to outper-form the CBO-algorithm (see Section 11) In the following paragraphs we will digdeeper into the subject of clustering behaviour data We already refer to Algorithm 2for an overview of our cluster-based undersampling method (CBU)

As we noted in Section 21 behaviour data can be represented as a bipartite graphThe clustering of behaviour data8 aims at finding groups of nodes (communities) thatconnect more to each other than to other nodes in the network This subject is cur-rently an active area of research with a rapid evolution of a vast number of clusterdetection techniques (Zha et al 2001 Dhillon 2001 Larremore et al 2014 Beckett2016) We refer to Porter et al (2009) Fortunato (2010) Alzahrani and Horadam(2016) for detailed surveys on the problem It should be noted that the vast major-ity of publications deal with the subject of clustering unigraphs (networks with onlyone type of nodes) It is only fairly recent that interest grew in the clustering of bi-graphs In our implementations we have chosen for the popular modularity-basedapproaches9 for clustering bigraphs and these fall into two directions the modularityfunction that is used for unigraphs is adapted to be suitable for bigraphs see for in-stance the work of Barber (2007) The other direction which we adopt in our studyprojects the bigraph to a unigraph of bottom nodes and performs community detec-tion on the projection using traditional modularity definitions Note that Guimera et al(2007) observed no difference in the obtained communities using either direction

7 For each majority class instance we no longer need to sort the similarities with all minority instancesin determining the K largest values

8 This subject is more commonly known as community detection in bipartite graphs9 Modularity-based approaches attempt to optimize a quality function known as modularity for finding

community structures in networks and rely on the use of heuristics due to the complexity of the problem

In this article we adopt the methodology proposed in Alzahrani and Horadam(2016) which consists of projecting the bigraph to a unigraph of bottom nodes andapplying10 the Louvain algorithm (Blondel et al 2008) on the projection Lanci-chinetti and Fortunato (2009) performed a comparitive study regarding the perfor-mance of 12 community detection algorithms and concluded the Louvain method tobe the best modularity based algorithm and second best among all algorithms Theheuristic is very fast with a O(m) complexity with m the number of edges in the uni-graph We have chosen for the Louvain algorithm because of its speed performanceand the availability of a toolbox (Jutla et al 2011-2016) that is directly compatiblewith our implementations The toolbox provides a generalized implementation of theLouvain algorithm in the sense that multiple definitions of modularity are possibleThe quality function we chose is the popular Newman-Girvan modularity (Newmanand Girvan 2004)

With respect to the projection Alzahrani and Horadam (2016) connect two bot-tom nodes if they have at least one top node in common The connection weight be-tween two bottom nodes in the projection is set to the number of shared top nodes Inour implementation we adapted the connection weights in accordance with Stankovaet al (2015) as follows first of all we assign weights to the top nodes correspond-ing to the hyperbolic tangent applied to the inverse degree of the top node Next theconnection weight between two bottom nodes in the projection corresponds with thetotal weight of the shared top nodes Top nodes having low degrees therefore obtaina higher contribution in the projection (eg two users making a transaction to a localbook store are assumed to be more closely connected to each other than two usersmaking a transaction to a large retail store)

In our CBU-algorithm (see Algorithm 2) after clustering the bigraph containingexclusively majority class instances we randomly select an equal amount of majorityinstances from each community to target the within-class imbalance problem In therare situation that the number of obtained clusters exceeds the required amount ofmajority instances (Nrretain) we sort the communities according to a user-specifiedparameter Clustopt and randomly select 1 instance from the first Nrretain clusters Theparameter Clustopt can take the following values

ndash C Smallest where we sort clusters in ascending order of their sizendash C Largest where we sort clusters in descending order of their size

Note that we randomly select majority class instances from each cluster Yen and Lee(2009) found a random selection strategy after clustering to be superior to informedapproaches based on distance

10 Note that they also made use of the flow-based algorithm Infomap (Rosvall and Bergstrom 2008) thatshows excellent results on the LFR-benchmark

Algorithm 2 CBU pseudo-code implementation for behaviour dataInput XminXma jβuClustopta) Cluster the majority class instances Xma j

ndash Assign weights to each top node corresponding with the hyperbolic tangent applied to the inverseof the nodersquos degree

ndash Project the bigraph Xma j to a weighted unigraph consisting of bottom node majority class instancesThe weight wi j between majority class instances i and j corresponds with the total weight of theshared top nodes

ndash Apply the Louvain algorithm (Blondel et al 2008) on the projected unigraph to partition the majorityclass instances into clusters

b) Select majority class instancesNrremlarr b

(∣∣Xma j∣∣minus|Xmin|

)timesβuc (see Equation (2))

Nrretainlarr∣∣Xma j

∣∣minusNrremif Nrretain lt |Clust| then

ndash Sort clusters according to Clustoptndash Randomly select 1 instance from the first Nrretain clusters

elsendash Randomly select

lceilNrretain|Clust|

rceilmajority class instances from each cluster

ndash Randomly discard instances from the previous step until its size corresponds with Nrretainend ifc) Return the new training set consisting of Xmin and the selected majority class instances from step b

33 Boosting cost-sensitive learning and EasyEnsemble

331 AdaBoost

The AdaBoost (Schapire and Singer 1999 Schapire 1999) algorithm has been de-signed from a perspective of improving the performance of a weak learner so that itachieves accuracies that are comparable with a strong learning algorithm Fundamen-tal to the idea of boosting is to maintain a weight distribution over the training set Ineach boosting iteration the weights of wrongly classified instances are increased sothat the underlying weak learner puts more emphasis on these hard examples In ourimplementation we will consider using a SVM as base learner SVMs are generallyregarded as strong learners The studies of Wickramaratna et al (2001) Garcıa andLozano (2007) note that using a strong learner usually results in performance degra-dation during the boosting process The RBF-kernel SVM classifier (Wickramaratnaet al 2001 Li et al 2008) is used as the underlying classifier to prove their pointIn our study we will employ a linear SVM which can be considered as a weakerversion compared to the RBF-kernel Furthermore the regularization parameter Ccan be viewed as a ldquoweaknessrdquo indicator11 Lowering the C-value results in weakerlearners as can be seen from the goal function of the SVM optimization problem (seeequation (1)) We will come back to this point in Section 45

11 The distinction between weakstrong learners is loosely rsquodefinedrsquo in Schapire (1999) A weak learnercorresponds with a hypothesis that performs just slightly better than random guessing A strong learner isable to generate a hypothesis with an arbitrary low error rate given enough data We adopt these definitionsbut consider the distinction between weakstrong based on training set error In a SVM context it is quitetypical that error levels on training data drop with increasing C-values (Suykens et al 2002) A learner thatis rsquotoo strongrsquo means that even though its performance on training data is very high it fails to generalizewell and the test set error increases due to overfitting

The boosting algorithm with underlying SVM is presented in Algorithm 3 Thisalgorithm closely follows the original boosting implementation presented in Schapireand Singer (1999) and requires each learner to output confidence rated predictions inthe interval [minus11] Since the SVM outputs real-valued scores we apply the proce-dure of Platt (1999) to transform these scores into probability estimates (which caneasily be translated to form confidence rated predictions) A logistic regression (LR)model (Ng and Jordan 2002) is trained using as input the SVM-scores and the corre-sponding labels as output Note that the same data that are used to construct the linearSVM can be used to estimate the LR-model see Platt (1999) for a motivation

The boosting algorithm requires the weak learner to be trained using a distribu-tion Dt One could sample from this distribution to generate (unweighted) examplesand train a SVM next We have chosen to include weights in the SVM goal functionformulation (this requires an extension of the LIBLINEAR package)

minwbξi

wT w2

+Cm

sumi=1

weightiξi (3)

The weights weighti are set according to the weight distribution Dt(i) in each round ofboosting Note that the C-value that is used to train the SVM model will be divided bymean(weighti) This normalization allows for a fair comparison between the weightedand unweighted SVM versions (eg in the first round of boosting weighti = 1m forall instances By multiplying the C-value with m this corresponds to solving theunweighted problem with the same C-value)

We introduce an additional parameter micro called the weight percentage with valuesin [0100] in the boosting algorithm of Algorithm 3 This parameter controls theamount of training data that is used to construct the SVM model and subsequentLR model We sort the original training data in descending order according to thedistribution Dt Next we form a new training set of minimal cardinality by includingpoints from the original sorted training data until the total weight is higher than theweight-percentage micro100 This way the newly formed training data will contain onlythe part of the original training data that has the most weight This partial set willthen be used to construct a weighted SVM-model (according to equation (3) withupdated distribution for this set) and a subsequent LR-model The idea of using apartial dataset to construct the base learner not only reduces training times but alsoweakens the learner (Garcıa and Lozano 2007)

In Algorithm 3 we have an explicit check to verify if rAB = 1 In this case theSVM model outputs scores that allow for a perfect classification on the training setThe subsequent LR model will find a treshold and output a value of minus1 if the SVMscore is lower than this treshold (a value of +1 if the score is higher than this tresh-old) In this situation the training data would be perfectly classified In our implemen-tation we attempt to avoid this kind of behaviour because it can lead to overfittingmeaning it might pinpoint on the wrong treshold and make too drastic decisions (welose the meaning of confidence) The check if rAB le 0 verifies if the currently boostedmodel is performing worse than random (this model would have a rAB-value of 0)Obviously if the model performs worse than random we quit the boosting processDuring the first round of boosting we perform similar checks that are not explicitly

indicated in Algorithm 3 In the case where rAB = 1 we output the SVM scores in-stead of the LR binary values When rAB le 0 we quit the boosting process and outputthe LR scores

Algorithm 3 AdaBoost with a SVM-LR combination as a base learnerInput (X Y ) = (x1y1) (xmym)CTmicroInitialize distribution D1(i) = 1mfor t = 1 to T do

ndash train weak learner using distribution Dt Weak learner consists of weighted linear SVM and LRmodel trained with weight-percentage micro of Dt

ht larr Train WeakLearner(X YDt Cmicro)

ndash compute the weighted confidence rAB on the training data

rABlarrm

sumi=1

Dt(i)yiht(xi)

If (rAB = 1 rAB le 0) then αt larr 0 and stop the boosting processndash choose αt isin R

αt larr12

log(

1+ rAB

1minus rAB

)ndash update distribution

Dt+1(i)larrDt(i)exp(minusαt yiht(xi))

Zt

where Zt is a normalization factor (chosen so that Dt+1 will be a distribution)end forOutput the final hypothesis (the output score is the term contained in the sign function)

H(x) = sign

(T

sumt=1

αt ht(x)

)with αt =

αt

sumTi=1 αi

332 AdaCost

The AdaCost algorithm (Fan et al 1999) is a variant of cost-sensitive learning wheremisclassification costs are introduced in the weight-update formula of AdaBoostThe cost-sensitive update rule will increase the weights of costly misclassified in-stances more aggressively and decreases weights of costly correct classificationsmore conservatively Each instance is given a misclassification cost ci where wechose to put ci = 1 for positive (minority) instances and ci = 1R for negative (ma-jority) instances R is a user defined value that allows one to put more emphasis onthe minority class The implementation of AdaCost is similar to Algorithm 3 yetthere are a few differences (Fan et al 1999) firstly the initial distribution is cho-sen as D1(i) = cisum

mj=1 c j secondly the weight update rule is given by Dt+1(i) =

(Dt(i)exp(minusαtyiht(xi)β (i))Zt where β (i) = minus05sign(yiht(xi))ci + 05 is a cost-adjustment function finally the choice of αt is given by αt =

12 log

(1+rAC1minusrAC

) where

rAC = summi=1 Dt(i)yiht(xi)β (i) Note that the checks to stop the boosting process pre-

maturely see the second bullet in Algorithm 3 are still based on the r-value obtainedfrom AdaBoost (rAB) This is because β isin [01]

In the first boosting round of AdaCost the weighted SVM formulation (3) isequivalent to solving a SVM formulation with the following goal function

minwbξi

wT w2

+C+m+

sumi|yi=1

ξi +Cminusmminus

sumi|yi=minus1

ξi (4)

where C+

Cminus = R This can be seen as a cost-sensitive version of a SVM and this ideahas initially been proposed by Veropoulos et al (1999)

333 EasyEnsemble

One of the disadvantages of the random undersampling method is the fact that weare discarding potentially valuable information EasyEnsemble (Liu et al 2009) isa method that combines several balanced subsets (S in total) containing randomlyselected majority class instances together with all minority examples Each subsetcontains the same amount of instances from both classes and is fed to the boostingalgorithm presented in Algorithm 3 Afterwards the weak learners hst of each subsets are simply combined to form the final ensemble

H(x) = sign

(S

sums=1

T

sumt=1

αsthst(x)

)with s = 1 S t = 1 T (5)

It is clear that this technique benefits from a combination of bagging and boosting(Liu et al 2009) Note that we apply the same boosting algorithm as previously de-scribed to each of the balanced subsets However when rAB = 1 in the first round ofboosting we quit the boosting process put α1 = 1 and continue to use the trainedLR-model in the final ensemble It was previously noted that this can cause overfit-ting in the sense that it can pinpoint to a wrong threshold However the LR-modelstill contains information and the combination thereof with the models obtained fromthe other subsets remains valuable (as we have noted from initial experiments bycomparing the situation where we include or reject those subsets)

There are a few subtle though important differences with respect to the experi-ments performed by Liu et al (2009) First of all the authors use a CART-algorithm(Breiman et al 1984) as base learner We employ a linear SVM with subsequent LR asa weak learner To our knowledge this combination has not been proposed elsewhereyet proves to be very efficient in this setting Secondly their methodology employsa discrete version of AdaBoost where each weak learner outputs binary values inminus11 As already stated we make use of an improved version of AdaBoost that re-lies on confidence rated predictions (Schapire and Singer 1999) Finally the authorsreported performances with fixed levels of S = 4 and T = 10 We will investigate theeffect of varying S and T levels

4 Results and discussion

41 Datasets

Stankova et al (2015) provide the first large collection of benchmark behaviour datasetsfor classification In our experiments we make use of these data sources and extendthis repository with two additional datasets Each of these datasets shows a bipar-tite structure with a clear target variable to predict We refer to this study and the nextparagraph for a short description on the available data resources In this paragraph weindicate why we have chosen to include or reject certain data sources from the afore-mentioned study The available datasets can be divided into small medium and largedatasets based on the number of instances and the number of features present TheNorwegian companies and Reality Mining dataset comprise the small datasets sincethey contain only a few hundreds of instances or features and are therefore regardedas impractical for our purposes The MovieLens Yahoo TaFeng Book-Crossing andLibimSeTi datasets belong to the medium sized datasets each containing a few thou-sands up to a few hundred of thousands of instances and features All of these datasetsare included in our study The large datasets containing hundred of thousands up tomillions of instances and features are the Flickr and Kdd databases12 Other largeproprietary data sources not included in Stankova et al (2015) are the corporateresidence fraud (CRF) and banking (Bank) datasets which arise from real-life appli-cation domains with intrinsic imbalance

To summarize we have gathered datasets containing such fine-grained behaviourdata from a wide variety of application domains The MovieLens datasets for whichwe are predicting the gender13 (Mov G) or the genre thriller 14 (Mov Th) providesdata on which films each user has rated The Yahoo movies15 dataset has a simi-lar structure where the age of each user Yahoo A (above or below average) or thegender Yahoo G is being predicted The TaFeng dataset16 contains data on shoppingbehaviour where age (below or above average) is being predicted based upon whichproducts are purchased In the book-crossing (Book) dataset (Ziegler et al 2005)users rate books and the age of the user (above or below average) is being predictedLibimSeTi (LST) contains data from a dating site (Brozovsky and Petricek 2007)where users rate each others profile and gender is being predicted In the advertise-ment (Adver) (Lichman 2013) dataset we try to predict if a url is an advertisementbased on a large variety of binary features of the url Note that this dataset does notarise from the behaviour of entities yet it still has a high dimensional and sparserepresentation The Flickr dataset (Cha et al 2009) contains pictures being markedby users as favorite and we predict the number of comments on each picture (below

12 Flickr and KDD will be excluded in the comparative study of Section 46 This is because somemethods are too computationally intensive - especially in combination with the large number of possibleparameter combinations - to be applied on these very large data sources Furthermore our statistical evi-dence is already sufficiently strong to conclude significance without these datasets Having said this thesedata sources will be included in the analysis of Section 5

13 MovieLens 1M dataset from httpgrouplensorgdatasetsmovielens14 MovieLens 10M dataset from httpgrouplensorgdatasetsmovielens15 httpswebscopesandboxyahoocom16 httpwwwbigdatalabaccnbenchmarkbmdddata=Ta-Feng

or above average) In the Kdd cup data performance of a student on a test is beingpredicted based on artificially created binary features (Yu et al 2010) The corpo-rate residence fraud dataset (CRF) (Junque de Fortuny et al 2014b) contains data onforeign companies making transactions with specific Belgian companies where wetry to predict whether the foreign company commits residence fraud (a type of fis-cal fraud) Finally the banking dataset (Bank) (Martens et al 2016) contains detailedbehaviour on consumers making transactions with merchants or other persons to pre-dict interest in a pension fund product Some characteristics on these datasets can befound in Table 2 The features column only shows the number of active features17

Table 2 Behaviour data characteristics The final column shows the imbalance ratio p defined as the ratioof the number of minority class instances to the amount of majority class instances in the training setexpressed as a percentage See Section 42 for details regarding p

Name∣∣Xma j

∣∣ |Xmin| Features p = 100times|Xmin|train ∣∣Xma j

∣∣train

Mov G 4331 1709 3706 1 amp 25Mov Th 10546 131 69878 124(p = [])Yahoo A 6030 1612 11915 1 amp 25Yahoo G 5436 2206 11915 1 amp 25TaFeng 17330 14310 23719 1 amp 25Book 42900 18858 282973 1 amp 25LST 59702 60145 166353 1

Adver 2792 457 1555 1638(p = []) amp 1CRF 869071 62 108753 00072(p = [])Bank 1193619 11107 3139570 093(p = [])Flickr 8166814 3028330 497472 01Kdd 7171885 1235867 19306083 05

42 Methodology

Regarding the experiments performed in the upcoming sections we applied a tenfoldcross-validation procedure Each of the folds contains 80 training data 10 valida-tion data and 10 test data Note that these percentages are valid for both the majorityclass and the minority class (stratified sampling) As can be seen from Table 2 somedatasets are balanced in nature We created artificial imbalance for these datasetsby removing minority class instances from the initial training set according to a userdefined parameter p We ensured that the amount of minority training instances corre-sponds to p percent of the majority class training size |Xmin|train =

p100

∣∣Xma j∣∣train As

an example say that we are using the Book dataset with p= 25 In that case we knowthat the majority class contains 34320 training instances (80 of

∣∣Xma j∣∣ = 42900)

The minority training data would contain 8580 instances (25 of 34320) When thedataset is already imbalanced we define p = [] which means that no downsampling

17 Active features represent features that are present for at least one instance in the dataset A non-activefeature corresponds with a column of zeros in the matrix representation and would not contribute to themodel

of the minority class training data is performed Note that the validation and test dataare left untouched18

The methods detailed in Section 3 are applied on the training data For both theunder-and oversampling approaches a linear SVM is trained on the newly createdbalanced training data with regularization parameter C having values

C = [10minus710minus510minus310minus1100]

The validation data are used for parameter tuning purposes The test data allow usto obtain the generalization performance The results reported show the average overten folds In the remaining paragraphs of this section we will describe the various pa-rameter settings used in our experiments and give a brief overview of the parametersoccurring in each method

Considering the oversampling techniques the parameter settings are as follows

β = [013231]prioropt = FlipCoinReverse Priorsimmeasure = Cosine JaccardK = K =

[100101102 |Xmin|train

]

We didnrsquot include the ldquoPriorrdquo option due to initial experiments showing a lower per-formance in comparison with the other options This can be explained by the lowpriors occuring in each colum resulting in synthetic samples that mainly show zerosin 0-1 match situations The oversampling with replacement method (OSR) only usesthe β parameter SMOTE uses all but the K parameter and ADASYN uses all of theparameters listed above

With respect to the undersampling techniques the following parameter settingsare used

βu = [01412341]simmeasure = Cosine JaccardK =

]Clustopt = C SmallestC Largest

The random undersampling (RUS) technique only uses the βu parameter The secondset of methods ldquoClosest tot simrdquo and ldquoFarthest tot simrdquo uses βu and simmeasure Thethird set of techniques ldquoClosest Knnrdquo and ldquoFarthest Knnrdquo make use of all the pa-rameters listed above except for Clustopt The final approach CBU employs βu andClustopt

The boosting variants (AdaBoost AdaCost and EasyEnsemble) presented in Sec-tion 33 make use of the following settings

T = 30micro = [10075]C = [10minus710minus510minus310minus1]

18 This means that if we start from a balanced set only the training data will show artificial imbalanceaccording to the imbalance ratio p The validation and test data would remain balanced Since AUC (andsome other metrics) is independent of class skew it would be unwise to make these sets imbalancedas well because that would lead to discarding minority class instances that are relevant for performanceassessment

R = [28RL] where RL =|Xma j|train|Xmin|train

S = [51015]

The AdaBoost algorithm includes the Tmicro and C parameters AdaCost additionalyuses cost-ratios R We have chosen for a range of values because misclassificationcosts are unknown for many business applications (He and Garcia 2009 Fan et al1999 Sun et al 2007) The final value RL seems to be a popular choice (Akbaniet al 2004 Luts et al 2010) because the total weight on the majority class balanceswith the total weight on the minority class The final method EasyEnsemble uses Ssubsets in addition to the parameters previously mentioned for AdaBoost Note thatwe consider the boosting iteration round t isin [1T ] as a tunable parameter19

43 Oversampling

The oversampling techniques presented in Section 31 are applied to each of thedatasets from Table 2 with varying imbalance levels (by varying the p-parameter)The experiments we conducted use the following methodology for each dataset weapply the oversampling techniques with all possible parameter combinations on thetraining data to create newly balanced datasets after which linear SVMs are trainedThe optimal parameter combination with respect to each imbalance ratio β is se-lected based on validation set AUC-performance With these parameters the AUC onthe test data is obtained Results show the average over ten folds

The results on four arbitrarily selected datasets are shown in Table 3 Full resultson each of the data sources can be found in Appendix A Table 11 From these tableswe can conclude that performance generally increases with growing β -values It ismore correct to say that performance keeps improving with growing β -levels untilan optimal point β is reached Increasing the balance level after this optimal valuewill cause only small fluctuations with respect to the optimal performance Tradi-tional studies dealing with dense low-dimensional data note that the OSR methodcan suffer from overfitting as already mentioned in Section 31 It is interesting to seethat in this sparse high-dimensional setting this effect doesnrsquot seem to occur20 Fur-thermore the computationally expensive synthetic sampling approaches do not seemto improve performance over a more simple OSR method This can be explained bythe fact that many instances seem to have a very limited amount of non-zero elements(Junque de Fortuny et al 2014a Stankova et al 2015) causing the synthetic samplingprocedures to be limited in the amount of unique new samples they can produce Notealso that synthetic sampling procedures have historically been designed to overcomethe overfitting behaviour of OSR which no longer seems to apply here

19 We sometimes make use of T as a symbol to indicate the boosting iteration number instead of t20 The SVM regularization parameter C controls for overfitting Additionally the influence of majority

class noiseoutliers (see Section 32) on the learned hyperplane decreases as we oversample the minorityclass This hyperplane is more sensitive towards minority instances and a change in its directionorientationtowards these majority class noiseoutliers will be more heavily penalized

Table 3 Oversampling experiments Results show the average tenfold AUC test set performance withrespect to increasing β -values (β = [β1β2β3β4] = [013231]) Additionally standard deviationsare shown between brackets Optimal performances for each method are highlighted in boldface SeeSection 42 for parameter settings

Mov Th(p = [])β1 β2 β3 β4

OSR 7977 (533) 853 (466) 8316 (45) 8459 (569)SMOTE 7977 (533) 8418 (651) 8558 (597) 8433 (575)

ADASYN 7977 (533) 8411 (677) 8586 (606) 8536 (513)Yahoo G(p = 25)

β1 β2 β3 β4OSR 7882 (139) 7878 (202) 7874 (186) 7835 (147)

SMOTE 7882 (139) 7923 (157) 791 (12) 7903 (189)ADASYN 7882 (139) 7912 (143) 7923 (137) 7951 (201)

TaFeng(p = 25)β1 β2 β3 β4

OSR 6694 (134) 6884 (121) 6699 (142) 677 (141)SMOTE 6694 (134) 6847 (15) 6707 (115) 6665 (081)

ADASYN 6694 (134) 6862 (138) 6785 (16) 6691 (139)Book(p = 25)

β1 β2 β3 β4OSR 6008 (071) 6105 (094) 6182 (112) 6462 (057)

SMOTE 6008 (071) 626 (073) 6095 (068) 63 (08)ADASYN 6008 (071) 6233 (073) 6077 (085) 6304 (058)

44 Undersampling

Regarding the undersampling techniques we employ a similar experimental set-up asin the previous section firstly train SVMs on the undersampled training data with allpossible parameter combinations Secondly choose a suitable parameter combinationbased on validation set AUC-performance and finally obtain the AUC-performanceon the test set The results are averaged across ten folds

Before going into detail on the experimental findings we give a short note onthe effect of noiseoutliers on SVM performance Consider the following imaginaryexample say that a majority class female is rating a lot of action films and thrillersWe can consider this as an outlier since most of the females are rating romanticor drama films This specific female has far more in common with the minority classmales who also rate action films and thrillers Outliers are therefore instances showingodd behaviour Noise are wrongfully labelled instances The effect of noiseoutlierscan be severe on SVM performance Indeed many of the instances in the dataset havea support value (dual variable αi) of 0 Instances contributing to the model (a non-zero support value) are examples close to the borderline (Suykens et al 2002) andnoiseoutliers (as can be derived from the KKT dual-complementarity conditions21)

21 if αi = 0 then yi(wT xi +b)ge 1 For noiseoutliers the term yi(wT xi +b) is negative hence αi 6= 0

With respect to noiseoutliers the support value can reach a maximal value of Cthe chosen regularization parameter in the SVM formulation It should be clear thatlowering the C-value leads to models that are less sensitive to noiseoutliers

The experimental outcomes of the various undersampling processes are shownin Table 4 for the same four previously selected datasets We refer to Appendix BTable 12 for results on the entire data repository When removing only a limitedamount of majority class instances (βu = βu2 = 14) we observed that in 12 out of16 datasets the ldquoFarthest Knnrdquo method outperforms the ldquoClosest Knnrdquo techniqueThe four remaining datasets show comparable performances22 This finding showsthat the ldquoFarthestrdquo method is very suitable in removing majority class noiseoutliersand empirically shows their performance degrading effect With higher undersam-pling rates (βu = 1) the ldquoClosest Knnrdquo method achieves higher performances thanthe ldquoFarthest Knnrdquo technique in general though the results are less clear We ob-served that only 9 out of 16 datasets indicates ldquoClosestrdquo as a clear winner with 3tie situations and 4 situations where ldquoFarthestrdquo shows the best results Intuitively wewould expect to see the ldquoClosestrdquo method perform better since we are keeping themost informative instances (examples close to the minority class) However this tech-nique will also emphasize noiseoutliers and this is the main reason why the resultsare less significant One can also see this from the fact that the RUS method indi-cates far better results in comparison to the aforementioned techniques when βu = 1an observation that is valid for all datasets The RUS method is efficient in realiz-ing noiseoutlier reduction effects since each SVM is trained with only a fraction ofthese type of instances It is remarkable to see that in 9 out of 16 datasets the RUSmethod with βu = 1 outperforms the baseline model (βu = 0) In 4 datasets we ob-served equal performance and 3 losses with respect to the baseline RUS shows twoattractive features majority class noiseoutlier removal and its ability to put more em-phasis on the minority class instances by undersampling the majority class A generaldownside of the approach is the information loss and this effect causes the 3 losses tothe baseline Usually though the attractive features of RUS dominate the latter effectand this indicates that it is definitely not necessary to include all majority class train-ing instances in constructing the predictive model Apparently there is a high levelof redundancy which can be exploited to construct efficient hypotheses

In the next paragraph we compare the cluster based undersampling (CBU) ap-proach to the Random undersampling (RUS) technique Before proceeding we shouldnote that Table 12 does not show results on the LST CRF and Bank datasets due tothe fact that the projected unigraph was too large to fit in memory (larger than 16 GB)A limited amount of features (top nodes) does seem to be active for a relatively largenumber of instances (bottom nodes) All these instances will be connected in the uni-graph projection resulting in relatively high degrees for the bottom nodes A possiblesolution to circumvent this problem might be to cut the weakest edges in the projectedunigraph

If we apply a limited amout of undersampling (βu = βu2 = 14) the CBU tech-nique outperforms RUS in 8 out of 13 datasets (4 losses and 1 tie with respect toRUS) On the highly imbalanced datasets (p = 1 or p = []) the CBU method wins

22 A tie occurs in the situation where the absolute difference in AUC is smaller or equal to 05

in 8 out of 8 cases We can therefore conclude CBU to outperform RUS with lowundersampling rates This indeed shows that handling the within-class imbalance canbe beneficial in this environment If we make a comparison at the highest undersam-pling rate βu = βu5 = 1 the results are completely opposite CBU wins in only 1case with 9 losses with respect to RUS and 3 ties On the higly imbalanced datasetsCBU outperforms RUS for 1 dataset (with 4 losses and 3 ties) It is clear that RUSis the preferred technique in case of high undersampling rates The reason is thatCBU emphasizes on the small disjuncts and basically fails to adequately pick up onthe common larger communities Furthermore these small disjuncts might actuallycorrespond with majority class noiseoutliers and CBU is focusing on these type ofinstances On the overall level where we consider all undersampling rates (exceptβu = 0) CBU wins in 5 out of 13 datasets RUS wins in 4 cases and there are 4 tiesituations (yet RUS did outperform CBU in all 4 cases) Both methods seem to becompetitive to one another across a wide range of imbalance levels If we pay partic-ular attention to the highly imbalanced sets CBU wins in 5 cases and there are 3 tiesituations on the remaining sets (yet RUS did outperform CBU on all 3 cases) CBUseems to be the method of choice (with respect to RUS) when the datasets show highimbalance levels

Table 4 Undersampling experiments Results show the average tenfold AUC test set performance withrespect to increasing βu-values

(βu = [βu1βu2βu3βu4βu5] = [01412341]

) Optimal perfor-

mances (boldface) are highlighted for each imbalance ratio Additionaly standard deviations are shownbetween brackets CL K represents the ldquoClosest Knnrdquo technique CL T represents the ldquoClosest tot simrdquotechnique (similar for Far K and Far T see Section 32) Parameter settings are given in Section 42

Mov Th(p = [])βu1 βu2 βu3 βu4 βu5

RUS 7977(53) 8032(58) 8157(55) 8186(66) 8126(62)Cl K 7977(53) 7925(45) 7807(5) 7625(65) 6246(85)CL T 7977(53) 784(44) 7241(35) 6466(45) 6037(73)Far K 7977(53) 8454(5) 8364(64) 8002(73) 5682(103)Far T 7977(53) 8503(57) 8268(68) 7561(92) 5677(109)CBU 8011(58) 8117(6) 8108(65) 8417(51) 8096(69)

Yahoo G(p = 25)βu1 βu2 βu3 βu4 βu5

RUS 7882(14) 7891(16) 7897(16) 7861(16) 7782(21)Cl K 7882(14) 7726(15) 7252(15) 6786(2) 6507(27)CL T 7882(14) 7683(1) 7199(18) 6715(23) 611(27)Far K 7882(14) 7826(22) 7469(27) 6722(21) 6072(23)Far T 7882(14) 7768(26) 7244(3) 6494(24) 596(2)CBU 7525(32) 7522(24) 7469(23) 7307(24) 7069(24)Continues on next page

Table 4 continuedTaFeng(p = 25)

βu1 βu2 βu3 βu4 βu5RUS 6694(13) 6744(13) 681(14) 6827(14) 6613(12)Cl K 6694(13) 6613(14) 6339(12) 5983(13) 5694(07)CL T 6694(13) 6638(15) 6289(16) 5746(13) 5456(13)Far K 6694(13) 6806(14) 6643(16) 6446(15) 6335(13)Far T 6694(13) 6431(11) 6269(1) 6127(11) 5903(1)CBU 6481(12) 6415(11) 6413(12) 6388(08) 6346(08)

Book(p = 25)βu1 βu2 βu3 βu4 βu5

45 Boosting variants

In this section the plain AdaBoost (AB) method is compared with the cost-sensitiveAdaCost (AC) algorithm the EasyEnsemble (EE) technique and the baseline (BL)method (where we train a single SVM on the imbalanced training data) AB ACand EE have an underlying boosting process in common The AUC-values are storedduring each boosting iteration which allows us to show the performance graphicallyas a function of the number of iterations t In EasyEnsemble (EE) we combine theweak learners of each subset by summing their individual contribution For examplewhen t = 2 boosting iterations are performed we evaluate the performance of thecombined learner sum

Ss=1 sum

2t=1 αsthst(x) Figures 1 2 3 and 4 show the average ten-

fold AUC-performance (with micro = 100) on the test data The left-side figures show theAB AC (with varying R-values) and EE (with varying S-values) performance with re-spect to the number of boosting iterations The C-value is tuned according to highestvalidation set AUC-performance (over all possible boosting iterations) The right-sidefigures investigate the influence of the C-parameter for AB and EE (with S = 15) andallow us to gain insight on the effect of weak and strong learners on the generaliza-tion performance Results on the remaining datasets are given in Appendix C Fig-ures 6-17 Note that we only indicate results with weight-percentage micro = 100 (useall instances in the training process) Previous experiments (with micro = 75) showedinferior results This suggests that it is indeed important to include all minority in-stances during the training of each weak learner an observation also made by Liuet al (2009)

When investigating the influence of the regularization parameter C in AdaBoostwe can graphically verify that it is a ldquoweaknessrdquo indicator in the sense that lowerC-values correspond to weaker learners While these low C-value learners have a lowperformance in the first rounds of boosting the possible boost in performance is usu-

ally much higher compared to higher C-value inducers Strong learners can alreadymake a suitable distinction between minority and majority class behaviours in thefirst boosting round The boosting process will emphasize the hard to learn instancesamong which are noiseoutliers Therefore too strong learners are not suitable tobe used in a boosting process as already noted by Wickramaratna et al (2001) InFigures 2(b) and 3(b) we observe that the lowest C-values (C = 10minus710minus5 canoutperform higher C-values (C = 10minus310minus1) In many cases the AB-process out-performs the baseline (BL) see also the results in Table 5 A combination of severalweak learners can often achieve better generalization than a single strong learner

Inspired by the success of the RUS method with βu = 1 (for reasons indicatedin Section 44) the EE-technique tries to overcome an obvious downside of this ap-proach we are removing a lot of majority class instances that can be very informativeHence we try to make up for this loss by combining several of these balanced train-ing sets in the hope that the majority class is more adequately explored This resultsin a very effective learner that seems to outperform the AB AC and BL methods inthe majority of cases as can be seen from these figures It is remarkable that we donrsquotneed to use that many subsets S to achieve high performances Increasing the S-valuefrom 5 to 15 doesnrsquot seem to cause major increases in AUC (see also Table 5) Theinfluence of the C-parameter is similar as described in the previous paragraph

T0 5 10 15 20 25 30

AU

Cte

st

78

79

80

81

82

83

84

85

86

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

55

60

65

70

75

80

85

90

AB C = 1e-07AB C = 1e-05AB C = 0001AB C = 01EE C = 1e-07EE C = 1e-05EE C = 0001EE C = 01BL

(b)

Fig 1 Mov G(p = 25) dataset results showing average tenfold AUC-performance on test data (withmicro = 100) for (a) AB AC and EE with C chosen according to highest validation set AUC-performance(over all possible T) (b) AB and EE(S = 15) with varying C-levels

T0 5 10 15 20 25 30

AU

Cte

st

50

52

54

56

58

60

62

64

66

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

50

52

54

56

58

60

62

64

66

(b)

Fig 2 Book(p = 25) dataset

T0 5 10 15 20 25 30

AU

Cte

st

56

58

60

62

64

66

68

70

72

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

52

54

56

58

60

62

64

66

68

70

72

(b)

Fig 3 TaFeng(p = 25) dataset

T0 5 10 15 20 25 30

AU

Cte

st

64

65

66

67

68

69

70

71

72

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

45

50

55

60

65

70

75

(b)

Fig 4 Bank(p = []) dataset

46 Final comparison

461 Performance-wise comparison

In this section we wish to determine which of the aforementioned methods showssuperior performance in terms of AUC on test data The experimental set-up is de-scribed in Section 42 Note that we exclude β = 0 and βu = 0 in determining theperformance of the oversampling respectively undersampling techniques to be ableto compare them with the baseline (BL) approach23 The results for AB AC and EEare shown for micro = 100 The number of boosting iterations t isin [0T ] is also consid-ered as a tunable parameter Table 5 shows the final comparison between all meth-ods Additionally we added an average rank column indicating the mean rank ofeach algorithm across all datasets The rank of an algorithm for a specific datasetcorresponds to its position in the sorted AUC-performance spectrum where the bestperforming algorithm gets the rank of 1 the second best rank 2 etc In case two ormore algorithms perform equally well average ranks are assigned (if two algorithmshave ranks of 3 and 4 respectively and show equal performance then both methodsget assigned a rank of 35) Note that the LST dataset is excluded in the calculationof average ranks since we consider this as a dataset not being affected by imbalanceand showing equal performances

As a side note we noticed that increasing the number of minority training set in-stances can significantly improve performances We can see this clearly by comparingthe p = 25 outcomes with the results of p = 1 on the same dataset Junque de Fortunyet al (2014a) already proved empirically that larger behaviour datasets (in terms of thenumber of instances or features) leads to higher AUC-performances This situation isconfirmed in our experiments though we want to add that this is only true when thedata size is increased in a balanced fashion

The results in Table 5 confirm our previous analysis Each of the oversamplingundersampling (except Cl Knn) and boosting variants seems to improve upon the BLperformance Sobhani et al (2015) note that the direct application of ensemble tech-niques on imbalanced data does not circumvent the imbalanced learning issue In ourexperiments on behaviour data we do find a simple boosting procedure applied tothe imbalanced data to improve upon the predictive performance The oversamplingtechniques OSR SMOTE and ADASYN show similar outcomes and are competitivemethods seeming to outperform the undersampling techniques and the plain boost-ing (AB) and cost-sensitive learning (AC) approaches When dealing with traditionaldata the performances of oversampling techniques are generally worse than that ofundersampling approaches (Drummond and Holte 2003) It is interesting to observethat for these large and sparse behaviour datasets the opposite is true This is heav-ily related to the intrinsic properties of the data under consideration Undersampling(or feature selection for that matter) is not recommended because as we have notedin Section 21 each feature provides a small additional contribution to the predic-tive performance With respect to AC the results indicate that R = RL is a betterchoice compared to more random cost ratios R = 28 The EE-technique has the

23 The BL technique trains single SVMs on the imbalanced training data

lowest average rank and is our best performing method The combination of severalbalanced subsets with filtered noiseoutlier contributions allows one to explore themajority class space efficiently To strengthen our previous statements we will in thenext paragraphs conduct statistical significance tests in accordance with the proce-dure outlined in Demsar (2006) The latter paper presents a framework to performstatistical comparisons of classifiers over multiple datasets

The first null-hypothesis we try to reject postulates that each of the algorithmsperform equally well or equivalently their average ranks R j (see Table 5) are indif-ferent Under the null-hypothesis the Friedmann statistic (Friedman 1937)

χ2F =

12Nk(k+1)

[k

sumj=1

R2j minus

k(k+1)2

4

] (6)

with N the number of datasets (N = 15) and k the number of algorithms (k = 13) canbe used to compute the statistic (Iman and Davenport 1980)

FF =(Nminus1)χ2

F

N(kminus1)minusχ2F (7)

The latter is distributed according to the F-distribution with kminus1 and (kminus1)(Nminus1)degrees of freedom Evaluating this measure with the average ranks shown in Table 5results in FF = 2298 which is a lot higher than the critical value of 181 at theα = 005 significance level We can therefore reject the null-hypothesis and proceedwith post-hoc tests

The Nemenyi test (Nemenyi 1963) allows one to compare each classifier to oneanother and adjusts the critical value to compensate for making k(kminus 1)2 compar-isons24 ldquoThe performance of two classifiers is significantly different if the corre-sponding average ranks differ by at least the critical difference (CD) (Demsar 2006)rdquoWe refer to the aforementioned paper to calculate the CD The result of these com-parisons at the α = 005 significance level is shown in Table 6 This test is alreadyquite revealing in the sense that for instance oversampling methods (OSR SMOTEand ADASYN) AC (R = RL) and EE-techniques are significantly different from theBL Yet it does not indicate that undersampling or applying plain boosting (AB) re-sults in significantly different performances compared to the BL The Nemenyi test isstill conservative and we therefore compare all classifiers to a single control classifierin the next paragraphs Note that we only make kminus 1 comparisons in that situationresulting in stronger results compared to the Nemenyi test

When comparing a single classifier i with a control classifier c the null-hypothesisstates that the two classifiers i and c perform equally well Under this hypothesis thetest statistic (based on the average ranks Ri shown in Table 5)

z =(RiminusRc)radick(k+1)6N

(8)

is distributed according to a standard normal distribution This z-value is used tofind the corresponding probability (p-value) p from the cumulative standard normal

24 The larger the amount of comparisons the higher the proportion of null-hypothesis that are wrongfullyrejected due to random chance

Table 5 Final comparison of all methods in terms of AUC test set performance see Section 42 for detailson experimental set-up β 6= 0 and βu 6= 0 for the oversampling respectively undersampling techniquesmicro = 100 for each of the boosting variants Standard deviations are given between brackets and the per-formance difference with the BL is shown in square brackets

Mov G(p = 1) Mov G(p = 25) Mov Th(p = []) Yahoo A(p = 1)

BL 716(262)[0] 8141(132)[0] 7977(533)[0] 5592(297)[0]OSR 7535(227)[38] 8376(209)[23] 8513(61)[54] 6005(271)[41]

SMOTE 7616(227)[46] 837(21)[23] 8567(498)[59] 601(3)[42]ADASYN 7607(226)[45] 8363(204)[22] 8565(56)[59] 599(299)[4]

RUS 7288(273)[13] 8152(215)[01] 8291(719)[31] 5704(177)[11]Cl Knn 7143(136)[-02] 8088(119)[-05] 7887(471)[-09] 5578(271)[-01]Far Knn 719(295)[03] 809(148)[-05] 8407(464)[43] 572(133)[13]

CBU 7417(236)[26] 8151(104)[01] 8276(722)[3] 5877(343)[28]AB 7165(173)[01] 8452(189)[31] 8243(518)[27] 5835(262)[24]

AC(R = 28) 7161(246)[0] 8346(182)[2] 8327(56)[35] 5772(247)[18]AC(R = R L) 7465(27)[31] 8335(209)[19] 8541(449)[56] 5947(233)[35]EE(S = 10) 7604(266)[44] 8505(185)[36] 861(578)[63] 5966(313)[37]EE(S = 15) 7612(288)[45] 8514(186)[37] 8642(586)[67] 5976(293)[38]

Yahoo A(p = 25) Yahoo G(p = 1) Yahoo G(p = 25) TaFeng(p = 1)

BL 6168(242)[0] 6684(366)[0] 7882(139)[0] 5575(16)[0]OSR 6459(312)[29] 7308(296)[62] 7852(201)[-03] 6121(224)[55]

SMOTE 6556(333)[39] 7311(312)[63] 7901(121)[02] 6172(181)[6]ADASYN 6513(338)[34] 7322(317)[64] 7974(168)[09] 6168(186)[59]

RUS 6411(28)[24] 7065(339)[38] 7891(155)[01] 5925(218)[35]Cl Knn 6114(213)[-05] 6634(354)[-05] 7726(146)[-16] 5577(128)[0]Far Knn 6396(303)[23] 6697(354)[01] 7826(22)[-06] 5998(126)[42]

CBU 6227(179)[06] 7127(289)[44] 7522(242)[-36] 584(157)[26]AB 6388(267)[22] 689(203)[21] 7901(166)[02] 5621(179)[05]

AC(R = 28) 6432(356)[26] 6889(311)[2] 7899(189)[02] 5633(183)[06]AC(R = R L) 6431(303)[26] 7313(28)[63] 7841(2)[-04] 616(226)[59]EE(S = 10) 6651(324)[48] 7261(315)[58] 8052(16)[17] 612(182)[54]EE(S = 15) 6636(318)[47] 7348(232)[66] 8054(156)[17] 6113(183)[54]

TaFeng(p = 25) Book(p = 1) Book(p = 25) LST(p = 1)

BL 6694(134)[0] 526(129)[0] 6008(071)[0] 9999(001)[0]OSR 6877(123)[18] 5587(142)[33] 6462(057)[45] 9999(001)[0]

SMOTE 6847(15)[15] 5507(088)[25] 6296(082)[29] 9999(001)[0]ADASYN 6848(147)[15] 5504(091)[24] 6302(057)[29] 9999(001)[0]

RUS 6828(139)[13] 5426(092)[17] 6328(08)[32] 9998(001)[0]Cl Knn 6613(143)[-08] 5269(13)[01] 6002(079)[-01] 9999(001)[0]Far Knn 6806(141)[11] 5625(152)[37] 6415(112)[41] 9998(001)[0]

CBU 6384(107)[-31] 5375(101)[12] 5468(088)[-54] []AB 6765(155)[07] 5427(195)[17] 65(067)[49] 9999(001)[0]

AC(R = 28) 6931(123)[24] 5372(1)[11] 6124(08)[12] 9998(001)[0]AC(R = R L) 6715(151)[02] 5573(122)[31] 646(064)[45] 9999(001)[0]EE(S = 10) 703(135)[34] 5509(129)[25] 6537(061)[53] 9998(001)[0]EE(S = 15) 704(13)[35] 5535(126)[28] 654(051)[53] 9998(001)[0]

Table 5 Continued Additionally an average rank column is added showing the mean rank of each algo-rithm across all datasets Note that we excluded the LST dataset for this purpose

Adver(p = []) Adver(p = 1) CRF(p = []) Bank(p = [])

BL 9661(182)[0] 9093(302)[0] 6406(1643)[0] 6682(088)[0]OSR 9693(191)[03] 933(202)[24] 8074(1293)[167] 7139(079)[46]

SMOTE 9705(166)[04] 9335(201)[24] 787(1656)[146] []ADASYN 9691(195)[03] 9346(221)[25] 7887(1671)[148] []

RUS 9681(187)[02] 9238(251)[15] 8398(599)[199] 6941(119)[26]Cl Knn 964(148)[-02] 8973(342)[-12] 7663(1619)[126] 6617(072)[-06]Far Knn 9577(181)[-08] 9388(178)[3] 8375(1311)[197] 6695(056)[01]

CBU 9715(188)[05] 9418(23)[33] [] []AB 9734(218)[07] 9139(323)[05] 7762(1515)[136] 6682(088)[0]

AC(R = 28) 9744(193)[08] 91(335)[01] 6831(1493)[42] 6767(071)[09]AC(R = R L) 9746(171)[08] 9351(217)[26] 8508(977)[21] 707(08)[39]EE(S = 10) 9764(135)[1] 9297(275)[2] 8618(1017)[221] 7146(081)[46]EE(S = 15) 9763(135)[1] 933(214)[24] 8635(999)[223] 7154(076)[47]

Average Rank

BL 11600OSR 5000

SMOTE 4533ADASYN 4800

RUS 8167Cl Knn 12467Far Knn 8133

CBU 8567AB 8267

AC(R = 28) 8467AC(R = R L) 5400EE(S = 10) 3267EE(S = 15) 2333

Table 6 Nemenyi test outcomes at the α = 005 significance level A value of 1 indicates the null-hypothesis is rejected and thus finds the two algorithms as significantly different (a value of 0 meanswe accept the null-hypothesis) Methods are listed in the same order as Table 5 and are more conciselyrepresented

BL RO SM AD RU Cl Fa CBU AB AC1 AC2 EE1 EE2

BL 0 1 1 1 0 0 0 0 0 0 1 1 1RO 1 0 0 0 0 1 0 0 0 0 0 0 0SM 1 0 0 0 0 1 0 0 0 0 0 0 0AD 1 0 0 0 0 1 0 0 0 0 0 0 0RU 0 0 0 0 0 0 0 0 0 0 0 1 1Cl 0 1 1 1 0 0 0 0 0 0 1 1 1Fa 0 0 0 0 0 0 0 0 0 0 0 1 1

CBU 0 0 0 0 0 0 0 0 0 0 0 1 1AB 0 0 0 0 0 0 0 0 0 0 0 1 1AC1 0 0 0 0 0 0 0 0 0 0 0 1 1AC2 1 0 0 0 0 1 0 0 0 0 0 0 0EE1 1 0 0 0 1 1 1 1 1 1 0 0 0EE2 1 0 0 0 1 1 1 1 1 1 0 0 0

distribution function p = 2timesmin(Φ(z)1minusΦ(z)) Holmrsquos method (Holm 1979)compares kminus 1 classifiers to a single control classifier and sorts the correspondingp-values in ascending order so that p1 le p2 le le pkminus1 Each pi is subsequentlycompared to its associated confidence level25 αcomp = α(kminus i) Holm starts withperforming the check p1 ltα(kminus1) and rejects the null-hypothesis if the test passesIt then proceeds with a similar check for i = 2 and continues until a certain null-hypothesis cannot be rejected The remaining hypothesis are retained

Table 7 shows the result of the Holm test at the α = 005 significance level withthe BL algorithm as control classifier The table lists each method according to sortedp-values and shows the z- p- and αcomp-values The significance column indicateswhether the proposed method is significantly different from the BL and coincidentlymatches with the result of the Nemenyi test The p-values of the (Far Knn RUSCBU) undersampling methods AB and AC (R= 28) are only slightly higher thanthe level required to reject the null-hypothesis We included a critical significancelevel αcrit corresponding to the lowest possible significance level upon which themethod would be considered as significantly different from the BL (if α = αcrit thenp = αcomp) In the case of AB for example if we were to choose a significance levelα = 00764 (a confidence level of 9236) the associated p-value would be smallerthan αcomp and we would proceed to conclude26 that AB would perform significantlydifferent from the BL To summarize the null-hypothesis of equal performance be-tween the BL and an alternative method proposed in Section 3 (except Cl Knn) isrejected with a confidence level of more than 91 according to Holmrsquos test Hencethe performance of the BL approach is significantly worse than that of each of theproposed over- and undersampling methods (except Cl Knn) cost-sensitive learningtechniques (AC) and boosting variants (AB EE)

Similar to the previous paragraph Table 8 presents the results of Holmrsquos test at theα = 005 confidence level with EE(S= 15) control classifier The results coincidentlycorrespond with the result of the Nemenyi test and indicate that the oversampling(OSR SMOTE and ADASYN) AC(R = RL) and EE(S = 10) are not statisticallydifferent from the EE(S = 15) technique at the 005 significance level Howeverif we were to adapt the significance level to approximately α = 025 (75 confi-dence level) the results would conclude significance (except for EE(S = 10)) Tosummarize the null-hypothesis of equal performance between the EE(S = 15) andan alternative method proposed in Section 3 (except EE(S = 10)) is rejected with aconfidence level of more than 75 according to Holmrsquos test Hence the performanceof the oversampling undersampling AB and AC approaches is significantly worsethan that of EE(S = 15) We can therefore safely conclude EE to be the preferredmethod in terms of AUC performance when dealing with imbalanced behaviour data

25 αcomp adjusts the value of α to compensate for multiple comparisons26 This is not entirely true since Holmrsquos method requires Far Knn to pass the test first before proceeding

which only occurs at the 0088662 significance level

Table 7 Holm test at the α = 005 significance level with BL reference The table shows the z test statisticwith associated p-value αcomp =α(kminus i) with k the number of algorithms (k= 13) and i corresponding tothe position of the method in the sorted p-value vector Since we already ranked the p-values in ascendingorder i takes on the row number (eg OSR has i = 5) The column significant denotes if we can reject thenull-hypothesis (significant p lt αcomp) αcrit corresponds with the smallest possible significance levelwhere we would decide to reject the null-hypothesis (αcrit = α pαcomp)

z p αcomp significant αcrit

EE(S = 15) -651642 72E-11 0004167 1 864E-10EE(S = 10) -586009 463E-09 0004545 1 509E-08

SMOTE -496936 672E-07 0005 1 672E-06ADASYN -478183 174E-06 0005556 1 156E-05

OSR -464119 346E-06 000625 1 277E-05AC(R = RL) -435991 13E-05 0007143 1 911E-05

Far Knn -24378 0014777 0008333 0 0088662RUS -241436 0015763 001 0 0078815AB -234404 0019076 00125 0 0076305

AC(R = 28) -220339 0027567 0016667 0 0082701CBU -213307 0032919 0025 0 0065837

Cl Knn 0609449 0542227 005 0 0542227

Table 8 Holm test at the α = 005 significance level with EE(S = 15) reference

Cl Knn 712587 103E-12 0004167 1 124E-11BL 6516421 72E-11 0004545 1 792E-10

CBU 4383348 117E-05 0005 1 0000117AC(R = 28) 4313027 161E-05 0005556 1 0000145

AB 4172384 301E-05 000625 1 0000241RUS 4102063 409E-05 0007143 1 0000287

Far Knn 4078623 453E-05 0008333 1 0000272AC(R = RL) 2156513 0031044 001 0 0155218

OSR 1875229 0060761 00125 0 0243045ADASYN 1734587 0082814 0016667 0 0248442SMOTE 1547064 0121848 0025 0 0243696

EE(S = 10) 065633 0511612 005 0 0511612

462 Timing-wise comparison

The performance of each of the proposed algorithms in terms of training times de-pends highly upon the chosen parameters The β and βu parameters control the num-ber of training instances and are an important contribution to the subsequent SVMlearning times for each of the over- and undersampling techniques The number ofboosting iterations t isin [0T ] is also a major factor in each of the boosting variantsproposed in Section 33 Furthermore the chosen regularization parameter C highlyinfluences the training times of the SVM inducer Higher C-values result in largertraining times (Fan et al 2008) Not only do the chosen hyperparameters affect learn-ing times the data characteristics such as the number of minority instances the im-balance level the number of features the amount of overlapping etc have a majoreffect

In comparing each of the methods outlined in Section 3 we make use of a similarmethodology as previously presented (see Section 461) For each of the ten foldsa suitable parameter combination is selected based on optimal validation set AUC-perfomance These chosen parameters are applied in determining the timings of eachmethod The final results indicate the average timings across ten folds Hence weshow the required amount of time for each method when using the parameter com-bination resulting in the best performance Note that this might for instance result inusing β = 13 in case of OSR or using only t = 5 boosting iterations in EE Compu-tational timings are shown in Table 9

As one can observe from the aforementioned table CBU is the slowest methodand relies on computationally intensive clustering procedures The synthetic SMOTEand ADASYN oversampling approaches are very time consuming methods Theyboth rely on computationally expensive nearest neighbour computations in conjunc-tion with a synthetic sample generation procedure that requires a single pass over allminority training instances

The RUS method shows as expected superior results in terms of computationaltimings Other undersampling methods perform quite well though they are slowerthan BL and OSR due to their inherent nearest neighbour computations

The boosting variants seem to perform worse than the BL OSR and each of theundersampling methods Since EE was the best method in terms of AUC-performance(see Section 461) we focus on this learner in this paragraph For the medium sizeddatasets (all but CRF and Bank) OSR outperfoms EE(S = 15) Yet for the largedatasets (CRF and Bank) the opposite is true A SVM is known to have a quadraticcomputational complexity in the number of training instances (Suykens et al 2002Garcıa and Lozano 2007) Because of this reason we consider OSR (or any otheroversampling technique) as inappropriate for large datasets Since with EE eachsubset has a size limited to twice the minority class training size we expect increas-ing benefits of the EE-technique for larger and more imbalanced behaviour data AsJunque de Fortuny et al (2014a) have observed larger behaviour data in terms of thenumber of instances or features also contributes to an increase in predictive powerHence we expect to see this type of big behaviour data more often in real-life appli-cations where computational resources can become a major constraint Because theEE-technique is easily parallelizable by simultaneously performing the boosting pro-cess for each of the independent subsets we added an extra row in Table 9 EE parcorresponds with the time needed for EE(S = 15) divided by 15 indicating the timethat would be required for EE if each subset were trained in parallel We can see thatEE par can outperform OSR in 8 out of 16 cases precisely for the datasets that showa high imbalance level (eg p = 1 or p = [])

We conclude our comparison of the proposed methods in Figure 5 which showsthe average ranks in terms of computational timings (see Table 9) and AUC per-formance for each of the available data sources Regarding the ranks with respectto the latter we added an extra row EE par having the same AUC-performance asEE(S = 15) for each dataset in Table 5 Ranks are subsequently determined accord-ing to the procedure outlined in Section 461 Additionally Figure 18 in Appendix Dshows the result for the large and highly imbalanced CRF and Bank datasets Weexpect the latter type of data to be of major interest for reasons already indicated

Table 9 Computational timings (in seconds) averaged across ten folds using the parameter combinationresulting in the highest validation set AUC-performance First and second best performances are empha-sized in boldface (third and fourth best are underlined) EE par is the time required for EE(S = 15) dividedby 15

BL 0032889 0056697 0558563 0026922OSR 0055043 0062802 099009 0044421

SMOTE 0218821 0937057 3841482 0057726ADASYN 0284688 1802399 5191265 0087694

RUS 0011431 0025383 0155224 0007991CL Knn 0046599 0599846 0989914 0037182Far Knn 0039887 080072 0683023 0027788

CBU 1034111 1060173 6822839 1692477AB 0169792 0841443 3460246 0139251

AC(R = 28) 0471994 2996585 1086907 0366555AC(R = RL) 053376 1179542 6065177 0209015EE(S = 10) 0117226 6065145 117995 0148973EE(S = 15) 020474 7173737 2119991 0180365

EE par 0013649 0478249 0141333 0012024

BL 0092954 0011915 0044164 0026728OSR 0027887 0013241 0047206 0040919

SMOTE 1062686 0056153 0883698 0219553ADASYN 2050993 0079073 1733367 0306618

RUS 0048471 0003234 0033423 0002916CL Knn 084391 0025404 0502515 0092167Far Knn 0664124 0026576 0500206 0080159

CBU 1569442 1287221 1355035 2467279AB 0445546 0078777 0169977 0114619

AC(R = 28) 1034044 0321723 0515953 0926178AC(R = RL) 0706215 0226741 0112949 0610233EE(S = 10) 1026577 0100331 1527146 0058052EE(S = 15) 1607596 0077483 2472582 010538

EE par 0107173 0005166 0164839 0007025

BL 0032033 0080035 0318093 0652045OSR 0032414 0132927 0092757 087152

SMOTE 5089283 3409418 1143444 4987705ADASYN 8148419 3689661 1225441 6840083

RUS 0020457 0022713 0031972 0432839CL Knn 1713731 0400873 3711648 2508374Far Knn 1539437 0379086 3988552 2511037

CBU 2642686 4198663 4631987 []AB 0713265 061719 1238585 2466151

AC(R = 28) 1234647 1666131 2330635 1451671AC(R = RL) 0279047 0860346 0197053 123763EE(S = 10) 2484502 2145747 7177484 0524066EE(S = 15) 3363971 2480066 1121945 0784111

EE par 0224265 0165338 0747963 0052274

Table 9 Continued Additionally an extra column indicating average ranks is included (lower ranks arepreferred)

BL 0010953 0002796 0725911 7089334OSR 0012178 0006166 3685813 1797481

SMOTE 0123112 0017764 5633862 []ADASYN 0183767 0021728 5768669 []

RUS 0012115 000204 0147392 5247441CL Knn 0061324 0005568 1106755 7373282Far Knn 0079078 0007069 1110379 9759619

CBU 3378235 3236754 [] []AB 0069199 0103518 1153196 8308618

AC(R = 28) 0193092 0068905 2047434 7170548AC(R = RL) 0107652 0037963 1387174 1063466EE(S = 10) 0138485 0085686 0198656 2495117EE(S = 15) 0185136 0139121 0285345 3640107

EE par 0012342 0009275 0019023 2426738

Average Rank [pos]

BL 294 [2]OSR 419 [4]

SMOTE 959 [11]ADASYN 1091 [13]

RUS 138 [1]CL Knn 65 [5]Far Knn 656 [6]

CBU 14 [14]AB 806 [7]

AC(R = 28) 1081 [12]AC(R = RL) 925 [9]EE(S = 10) 825 [8]EE(S = 15) 956 [10]

EE par 3 [3]

Timings are a substantial concern with respect to these large datasets As can be ob-served the EE-technique thrives for these big and highly imbalanced data even in itsnon-parallel form

Average Rank AUC02468101214

Ave

rage

Ran

k T

ime

0

2

4

6

8

10

12

14

16

18

BLOSRSMOTEADASYNRUS

Cl KnnFar KnnCBUABAC(R=28)

AC(R=RL)

EE(S=10)EE(S=15)EE

par

Fig 5 Average rank AUC versus average rank Time (see Table 9) across all datasets from Table 2 (exclud-ing Flickr and Kdd) Regarding the AUC rank an extra row EE par is added having the same AUC-value asEE(S = 15) Ranks for each dataset are subsequently obtained via the procedure outlined in Section 461Points occurring in the upper-right region are preferred

5 The effect of the chosen base learner

In this short section we investigate the influence of the chosen base learner Morespecifically we ask ourselves whether some base inducers can more easily cope withthe imbalanced learning issue and in doing so verifying the conclusion that EE is asuitable technique to elevate upon the baseline performance We rely on several pop-ular classification methods for behaviour data two regularization based techniques(SVM LR) and two heuristic type of approaches (Naive Bayes (NB) BehaviouralSimilarity (BeSim)) The next paragraphs briefly introduce these methods Note thatsource codes and dataset results are provided in our on-line repository

LR constructs a linear model that maximizes the likelihood of the observed databy imposing a logistic model for the parametric probability model P(Y |X) Ng andJordan (2002) provides a thorough introduction on the topic27 and notes that in itsplain form LR can suffer from overfitting especially when the input data is very highdimensional Therefore we resort to regularized logistic regression We opt for L2-regularized LR since L1-regularization corresponds with a natural way of imposingfeature selection (Ng 2004) The LIBLINEAR toolbox (Fan et al 2008) is used toobtain these L2-regularized LR models

27 For an accessible introduction see the chapter on ldquoGenerative and Discriminative ClassifiersNaive Bayes and Logistic Regressionrdquo provided at the following link httpwwwcscmuedusimtomNewChaptershtml

NB (Ng and Jordan 2002) relies on the use of the Bayes rule which expresses theposterior P(Y |X) as the product of the likelihood P(X |Y ) and the prior P(Y ) dividedby the evidence P(X) The naivety of the approach originates from assuming condi-tional independence of the features We impose a multivariate event model regardingthe distribution P(X |Y ) Junque de Fortuny et al (2014a) provides an efficient NBimplementation for dealing with large and sparse binary behaviour data

The BeSim approach conceptually corresponds to projecting the bigraph to a un-igraph using the procedure outlined in Section 32 Next the weighted-vote rela-tional neighbour classifier (Macskassy and Provost 2007) is used to infer unknownlabelsscores through a weighted probability estimation using the known labels ofneighbouring nodes A very fast and scalable implementation of this approach (theSW-transformation) is presented in detail in Stankova et al (2015)

In Table 10 we compare the performances of the baseline (BL) application of themethods with their EE-version (with S = 15T = 20) It is quite striking to see theEE-version dominating upon the baseline version when looking at each method indi-vidually This suggests that the imbalanced learning issue has an impact for each ofthe aforementioned techniques Focussing on the regularization based approaches weobserve the baseline LR to be less influenced by the imbalance than the SVM methodTheir EE-versions (EE SVM and EE LR) can boost the baseline performance in al-most all cases and have the highest average ranks amongst all the proposed methodsAs already emphasized the regularization based approaches offer an added elementof flexibility in the sense that the strength of the underlying base learner can be con-trolled

The heuristic approaches NB and BeSim do not offer this flexibility Furthermoreit is not directly possible to integrate instance weights during the training phase ofthe base learner During the boosting process we therefore sample from the distribu-tion Dt to generate (unweighted) examples and train a BeSimNB learner next Theresults in Table 10 indicate BL NB to be the worst performing algorithm amongst allmethods tried Due to the conditional independence assumption one could regard thisas a weak learner and is therefore suitable to be used in a boosting process Indeedthe EE NB approach is able to significantly improve upon its performance We foundthis to be the slowest EE-version28 (eg taking more than 24 hours to process a singlefold of the Bank dataset versus several minutes for the other versions) The averagerank estimate of EE NB is rather pessimistic since we assigned the lowest perfor-mance across all methods in case of missing results BeSim can be considered as astronger learner compared to NB Its possible boost in performance is typically lowerthan that of NB Having said this we found EE BeSim to improve upon BL BeSimfor all datasets (except Adver(p = 1))

28 This statement only relates to the EE-version In the baseline case the timings of the NB-version areroughly of the same order of magnitude than SVM and LR (though this heavily depends on the chosenregularization parameter C) With EE each subset has a size limited to twice the minority class trainingsize (a relatively low number compared to the size of the majority class) This works in favour of SVMLR(eg SVM has a computational complexity of O(m2) with m the number of training instances) Also notethat the NB-version is indeed optimized for sparse data Internally the NB-implementation writes a fairlylarge model file to a specified directory (sizes of 1 GB can be encountered there) This writing operationslows down the training stage (especially in case of EE where Stimes T = 15times 20 = 300 model files areconstructed)

Table 10 Average tenfold AUC test set performance of the BL compared to its EE-version (S = 15T = 20)for SVM LR BeSim and NB In case the EE NB implementation crashed we put a times in the table If it istoo slow (gt 24 hours per fold) we place a [] Additionally average ranks are assigned to each method Incase EE NB has missing results for a particular dataset we ensure it has the lowest rank across all otheralgorithms

BL SVM 716 (262) 8141 (132) 7977 (533) 5649 (337)EE SVM 7612 (288) 8513 (186) 8643 (586) 5974 (296)BL LR 7102 (209) 8439 (184) 8314 (417) 5784 (239)EE LR 7669 (292) 8503 (198) 863 (537) 5979 (262)

BL BeSim 761 (358) 813 (292) 8281 (66) 5627 (273)EE BeSim 7631 (371) 8137 (29) 8502 (628) 577 (171)

BL NB 7026 (584) 7701 (254) 7048 (1014) 5256 (209)EE NB 7593 (283) 8556 (201) 8691 (415) 5755 (273)

BL BeSim 6454 (202) 6889 (249) 7955 (196) 5789 (118)EE BeSim 6525 (223) 7118 (291) 8004 (185) 5936 (147)

BL NB 65 (165) 6333 (256) 7889 (164) 5461 (12)EE NB 666 (279) 7099 (288) 8101 (13) 5901 (184)

BL BeSim 6749 (123) 5519 (127) 637 (063) 9999 (001)EE BeSim 68 (121) 5521 (115) 6438 (042) 9999 (0)

BL NB 6521 (164) 5293 (09) 5975 (047) 9869 (03)EE NB 7072 (115) times 6346 (061) 9992 (004)

BL BeSim 9726 (112) 9538 (135) 8691 (936) 6785 (067)EE BeSim 9738 (104) 9383 (135) 8702 (1043) 7041 (055)

BL NB 9375 (19) 9337 (19) 8724 (938) 6783 (063)EE NB 9404 (175) times times []

Flickr(p = 01) Kdd(p = 05) Average Rank

BL SVM 7492 (017) 7453 (005) 644 [7]EE SVM 7986 (013) 8098 (005) 239 [1]BL LR 7903 (011) 8129 (004) 428 [4]EE LR 7985 (013) 8075 (005) 261 [2]

BL BeSim 7462 (013) 7495 (0) 511 [6]EE BeSim 764 (013) 7755 (003) 361 [3]

BL NB 8136 (01) 7429 (005) 65 [8]EE NB [] [] 506 [5]

6 Conclusions and future research directions

We investigated the effects of sampling and cost-sensitive learning techniques onthe problem of learning from imbalanced behaviour data This type of big data hasfundamentally different properties and internal structure than traditional data and ischaracterized by sparseness and very large dimensions This is a new research topicwhich enables benefits across a wide variety of application domains such as frauddetection targeted advertising predictive policing churn prediction default predic-tion etc To enable future developments in this field we provide an on-line reposi-tory containing implementations datasets and results available at the following linkhttpwwwapplieddataminingcomcmsq=software

Oversampling techniques need to be adapted in order to cope with this kind ofdata We proposed new versions of the traditional SMOTE and ADASYN techniqueswhich differ in the way synthetic samples are generated and similarity computationsare performed The plain random oversampling (OSR) and the synthetic approachesare competitive methods in terms of AUC and do not seem to suffer from overfit-ting as traditional studies report Though the synthetic approaches have comparableAUC-perfomances in comparison to OSR they are useless in practice due to theircomputationally expensive procedures The OSR method is very fast for small andmedium sized datasets yet for larger datasets becomes inappropriate because thetraining times of the underlying base learner increase drastically

With respect to undersampling methodologies we employed a random under-sampling technique (RUS) and Knn-based informed approaches The latter type se-lects majority class instances that are closest (Cl Knn) or farthest (Far Knn) fromthe minority class examples and allows one to investigate the effect of noiseoutlierson SVM performance We observed that these odd instances have a severe perfor-mance degrading effect RUS and Far Knn approaches can significantly improve per-formance with respect to the baseline thanks to their abilities to reduce noiseoutlierlevels and to put more emphasis on the minority class Having said this they usuallyperform worse than any of the oversampling approaches This is in sheer contrastto the results obtained from traditional studies and is heavily related to the intrinsicnature of behaviour data Timing-wise RUS is the fastest method The informed ap-proaches are slower than RUS due to the required nearest neighbour computationsAdditionally a cluster-based undersampling (CBU) technique was proposed to alle-viate the within-class imbalance For low undersampling rates the performance ofCBU is superior to plain RUS whereas for high undersampling rates the oppositeis true Overall no substantial gain in performance is obtained in comparison with asimple RUS approach Furthermore the method is the slowest among all techniquesconsidered and relies on computationally expensive clustering heuristics

We studied the effects of boosting where we made use of a very specific lin-ear SVM and logistic regression combination that uses confidence rated predictionsinstead of a more popular plain ldquodiscreterdquo boosting algorithm (with weak hypothe-sis having values in minus11) Our experiments clearly indicated the regularizationconstant C in the SVM formulation to act as a ldquoweaknessrdquo indicator Indeed higherC-values cause stronger learners and shouldnrsquot be used in the boosting process whichmatches with the analysis of studies dealing with traditional data Strong learners can

already quite accurately distinguish majority class from minority class behavioursThe boosting process will focus on the hard to learn instances among which arenoiseoutliers and this can result in overfitting behaviour We also observed that boost-ing can usually outperform the plain baseline indicating that a combination of weaklearners can achieve higher AUC-performance than a single strong learner

EasyEnsemble (EE) integrates random undersampling with a boosting processBy sampling several balanced subsets from the training data and combining thesewe can emphasize the minority class achieve noiseoutlier reduction effects and si-multaneously explore the majority class space efficiently Our version differs from theoriginal formulation because we employ a different type of underlying weak learner(SVM and LR combination) that is fed to a confidence rated boosting algorithm (in-stead of the discrete version) We observed superior AUC-performance with respectto all of the aforementioned techniques Furthermore the method is very fast sinceeach subset can be trained in parallel and has a size that is twice as large as the mi-nority class training set We studied the effect of the number of subsets and noted thata limited amount of these (O(10)) is already sufficient to obtain satisfactory perfor-mances and increasing this number further causes only minor benefits

Additionally statistical comparison of each of the proposed methods was con-ducted across a wide variety of datasets Holmrsquos significance test indicates that eachof the proposed techniques (oversampling undersampling except Cl Knn cost sen-sitive learning and boosting variants) can significantly outperform the plain baselineperformance (each of the null-hypotheses of equal performance gets rejected witha confidence level of at least 91) Comparison of the techniques with EE showedthat the latter can significantly outperform all of them (each of the null-hypotheses ofequal performance gets rejected with a confidence level of at least 75) A timing-wise comparison showed EE to be the preferred method (after RUS) for large andhighly imbalanced datasets A parallel implementation would make EE the fastestmethod even for medium sized datasets that show a high level of imbalance

The influence of the chosen base learner was investigated by comparing the base-line to its EE-version for the SVM LR NB and BeSim inducers Each of the EE-versions are able to improve upon the baseline performance for nearly all datasetswhere base learners of the regularization type are superior because of their abilitiesto control the strength of the learner

Though we conducted an extensive benchmark study there are still plenty of fu-ture research directions First and most important there is an urgent need to develop amore general theoretical framework to completely understand the effect of imbalanceand why certain methods outperform others Secondly other types of mechanismscould be investigated in this setting advanced sampling (Guo et al 2008) set-upsperform sampling based on preliminary classifications one-class learning is a recog-nition based methodology where a discriminative boundary is learned around thetraining examples of a single class alone Raskutti and Kowalczyk (2004) demon-strated the superiority of the recognition based one-class SVMs over the discrimina-tive two-class SVMs other types of ensemble learning strategies (eg bagging andstacking) could be investigated Batista et al (2004) note that combining over-andundersampling can be beneficial when data are highly imbalanced or show very fewminority class instances the ideas behind Extended Nearest Neighbour (Tang and

He 2015) could guide us in the sampling process As opposed to Knn ldquoit considersnot only who are the nearest neighbours of a certain sample but also who considerthe particular sample as their nearest neighbour (Tang and He 2015)rdquo The methodassigns a class membership to an unknown sample that maximizes the intra-class co-herence finally more advanced techniques do exist that can provide a K (the numberof nearest neighbours) faster or with a (slight) better performance

Another issue relates to the lack of comprehensibility of the EasyEnsemble tech-nique in its present form Because we currently integrate Logistic Regression in theboosting process the final ensemble classifier is non-linear and therefore incompre-hensible It might be interesting to investigate the numerical boosting technique pro-posed in Schapire and Singer (1999) where each weak learner is allowed to outputreal values (instead of values in [minus11]) In that case we would be able to use a plainlinear SVM as a weak learner without the Logistic Regression component The fi-nal ensemble would be a linear model which is surely easier to comprehend Alsothe instance-based approach of Martens and Provost (2014) could be used to explainindividual predictions of the non-linear EE-technique

A Oversampling experiments

MOV G(p = 1)β1 β2 β3 β4

OSR 716 (262) 7437 (204) 736 (184) 7473 (245)SMOTE 716 (262) 7508 (218) 7602 (214) 7648 (23)

ADASYN 716 (262) 7516 (192) 7593 (208) 7647 (229)MOV G(p = 25)

β1 β2 β3 β4OSR 8141 (132) 8349 (181) 8384 (196) 8391 (204)

SMOTE 8141 (132) 8332 (197) 8359 (204) 8376 (211)ADASYN 8141 (132) 8361 (182) 8402 (197) 8369 (196)

OSR 7977 (533) 853 (466) 8316 (45) 8459 (569)SMOTE 7977 (533) 8418 (651) 8558 (597) 8433 (575)

ADASYN 7977 (533) 8411 (677) 8586 (606) 8536 (513)Yahoo A(p = 1)

β1 β2 β3 β4OSR 5592 (297) 5866 (327) 5999 (228) 5974 (178)

SMOTE 5592 (297) 5976 (262) 5974 (267) 5943 (24)ADASYN 5592 (297) 5954 (253) 5955 (294) 5956 (222)

Yahoo A(p = 25)β1 β2 β3 β4

OSR 6168 (242) 6419 (317) 6508 (326) 6467 (21)SMOTE 6168 (242) 6546 (363) 6533 (323) 6452 (298)

ADASYN 6168 (242) 6504 (374) 6541 (347) 644 (221)Continues on next page

Table 11 continuedYahoo G(p = 1)

β1 β2 β3 β4OSR 6684 (366) 7218 (236) 7311 (27) 7249 (341)

SMOTE 6684 (366) 7265 (285) 7327 (336) 7337 (356)ADASYN 6684 (366) 7287 (283) 7318 (32) 7339 (359)

Yahoo G(p = 25)β1 β2 β3 β4

OSR 7882 (139) 7878 (202) 7874 (186) 7835 (147)SMOTE 7882 (139) 7923 (157) 791 (12) 7903 (189)

ADASYN 7882 (139) 7912 (143) 7923 (137) 7951 (201)TaFeng(p = 1)

β1 β2 β3 β4OSR 5575 (16) 5923 (196) 60 (168) 6104 (236)

SMOTE 5575 (16) 6026 (195) 6149 (18) 6113 (152)ADASYN 5575 (16) 6026 (19) 6144 (185) 6116 (15)

OSR 6694 (134) 6884 (121) 6699 (142) 677 (141)SMOTE 6694 (134) 6847 (15) 6707 (115) 6665 (081)

ADASYN 6694 (134) 6862 (138) 6785 (16) 6691 (139)Book(p = 1)

β1 β2 β3 β4OSR 526 (129) 5361 (094) 5541 (175) 5587 (144)

SMOTE 526 (129) 5477 (099) 5491 (08) 5436 (098)ADASYN 526 (129) 5486 (113) 5506 (073) 5454 (092)

Book(p = 25)β1 β2 β3 β4

OSR 6008 (071) 6105 (094) 6182 (112) 6462 (057)SMOTE 6008 (071) 626 (073) 6095 (068) 63 (08)

ADASYN 6008 (071) 6233 (073) 6077 (085) 6304 (058)LST(p = 1)

β1 β2 β3 β4OSR 9999 (001) 9999 (001) 9999 (001) 9999 (001)

SMOTE 9999 (001) 9999 (001) 9999 (001) 9999 (001)ADASYN 9999 (001) 9999 (001) 9999 (001) 9999 (001)

Adver(p = [])β1 β2 β3 β4

OSR 9661 (182) 9731 (165) 9707 (184) 9707 (179)SMOTE 9661 (182) 9691 (166) 9719 (165) 9707 (191)

ADASYN 9661 (182) 971 (17) 9708 (187) 9707 (188)Adver(p = 1)

β1 β2 β3 β4OSR 9093 (302) 9127 (303) 9266 (282) 9329 (197)

SMOTE 9093 (302) 9251 (203) 9296 (214) 9353 (181)ADASYN 9093 (302) 9222 (233) 927 (236) 9388 (173)

CRF(p = [])β1 β2 β3 β4

OSR 6406 (1643) 8082 (1294) 8128 (1227) 8191 (1128)SMOTE 6406 (1643) 7864 (1686) 8252 (1374) 7932 (1626)

ADASYN 6406 (1643) 7895 (1672) 8119 (1632) 7931 (1619)Bank(p = [])

β1 β2 β3 β4OSR 6682 (088) 701 (074) 7139 (08) 7147 (08)

SMOTE ADASYN

B Undersampling experiments

(βu = [βu1βu2βu3βu4βu5] = [01412341]

) Optimal perfor-

Mov G(p = 1)βu1 βu2 βu3 βu4 βu5

Yahoo A(p = 1)βu1 βu2 βu3 βu4 βu5

TaFeng(p = 1)βu1 βu2 βu3 βu4 βu5

Table 12 continuedLST(p = 1)

βu1 βu2 βu3 βu4 βu5RUS 9999(0) 9999(0) 9999(0) 9998(0) 9999(0)Cl K 9999(0) 9999(0) 9999(0) 9999(0) 9999(0)CL T 9999(0) 9999(0) 9999(0) 9999(0) 9998(0)Far K 9999(0) 9998(0) 9998(0) 9998(0) 9998(0)Far T 9999(0) 9998(0) 9998(0) 9998(0) 9998(0)CBU [] [] [] [] []

Adver(p = [])βu1 βu2 βu3 βu4 βu5

Adver(p = 1)βu1 βu2 βu3 βu4 βu5

CRF(p = [])βu1 βu2 βu3 βu4 βu5

RUS 6406(164) 6328(159) 6798(174) 6695(219) 8773(88)Cl K 6406(164) 6244(166) 6234(169) 7137(138) 7822(177)CL T 6406(164) 6244(166) 6234(169) 7137(138) 6267(229)Far K 6406(164) 838(142) 8393(148) 8449(137) 8611(97)Far T 6406(164) 838(142) 8393(148) 8449(137) 8611(97)CBU [] [] [] [] []

Bank(p = [])βu1 βu2 βu3 βu4 βu5

C Boosting experiments

This section presents the results of each of the boosting experiments from Section 45 on the remainingdata sources Each of the figures below depicts the average tenfold AUC-performance on test data (withmicro = 100) with respect to the number of boosting iterations for (left) Adaboost (AB) AdaCost (AC) andEasyEnsemble (EE) with C chosen according to highest validation set AUC-performance (over all possibleboosting rounds) and (right) AB and EE (with S = 15) with varying C-levels

T0 5 10 15 20 25 30

AU

Cte

st

64

66

68

70

72

74

76

78

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

56

58

60

62

64

66

68

70

72

74

76

(b)

Fig 6 Mov G(p = 1) dataset

T0 5 10 15 20 25 30

AU

Cte

st

65

70

75

80

85

90

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

45

50

55

60

65

70

75

80

85

90

(b)

Fig 7 Mov Th(p = []) dataset

T0 5 10 15 20 25 30

AU

Cte

st

55

555

56

565

57

575

58

585

59

595

60

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

48

50

52

54

56

58

60

(b)

Fig 8 Yahoo A(p = 1) dataset

T0 5 10 15 20 25 30

AU

Cte

st

61

62

63

64

65

66

67

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

50

52

54

56

58

60

62

64

66

68

(b)

T0 5 10 15 20 25 30

AU

Cte

st

62

64

66

68

70

72

74

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

58

60

62

64

66

68

70

72

74

(b)

Fig 10 Yahoo G(p = 1) dataset

T0 5 10 15 20 25 30

AU

Cte

st

755

76

765

77

775

78

785

79

795

80

805

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

60

65

70

75

80

85

(b)

T0 5 10 15 20 25 30

AU

Cte

st

53

54

55

56

57

58

59

60

61

62

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

51

52

53

54

55

56

57

58

59

60

61

(b)

T0 5 10 15 20 25 30

AU

Cte

st

50

51

52

53

54

55

56

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

50

51

52

53

54

55

56

(b)

T0 5 10 15 20 25 30

AU

Cte

st

9989

999

9991

9992

9993

9994

9995

9996

9997

9998

9999

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

997

9975

998

9985

999

9995

100

(b)

Fig 14 LST(p = 1) dataset

T0 5 10 15 20 25 30

AU

Cte

st

955

96

965

97

975

98

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

75

80

85

90

95

100

(b)

Fig 15 Adver(p = []) dataset

T0 5 10 15 20 25 30

AU

Cte

st

80

82

84

86

88

90

92

94

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

55

60

65

70

75

80

85

90

95

(b)

Fig 16 Adver(p = 1) dataset

T0 5 10 15 20 25 30

AU

Cte

st

45

50

55

60

65

70

75

80

85

90

ABAC

R2AC

R8

ACRD

EES5

EES10

EES15

BL

(a)

T0 5 10 15 20 25 30

AU

Cte

st

45

50

55

60

65

70

75

80

85

90

(b)

Fig 17 CRF(p = []) dataset

D Final Comparison

Ave

rage

Ran

k T

ime

0

2

4

6

8

10

12

14

16

18

BLOSRSMOTEADASYNRUS

AC(R=RL)

EE(S=10)EE(S=15)EE

par

Fig 18 Average rank AUC versus average rank Time (see Table 9) across the large datasets from Table 2(CRF and Bank) Regarding the former an extra row EE par is added having the same AUC-value asEE(S = 15) Ranks for each dataset are subsequently obtained via the procedure outlined in Section 461Points occurring in the upper-right region are preferred

References

Akbani R Kwek S Japkowicz N (2004) Applying support vector machines to imbalanced datasetsIn Machine Learning ECML 2004 15th European Conference on Machine Learning Pisa ItalySeptember 20-24 2004 Proceedings Springer Berlin Heidelberg Berlin Heidelberg pp 39ndash50 DOI101007978-3-540-30115-8 7

Ali A Shamsuddin SM Ralescu AL (2015) Classification with class imbalance problem A review Inter-national Journal of Advances in Soft Computing and its Applications 7(3)176ndash204

Alzahrani T Horadam KJ (2016) Community detection in bipartite networks Algorithms and case studiesIn Complex Systems and Networks Dynamics Controls and Applications Springer Berlin Heidel-berg Berlin Heidelberg pp 25ndash50 DOI 101007978-3-662-47824-0 2

Bachner J (2013) Predictive policing Preventing crime with data and analytics IBM Center for the Busi-ness of Government

Baesens B Van Gestel T Viaene S Stepanova M Suykens J Vanthienen J (2003) Benchmarking state-of-the-art classification algorithms for credit scoring Journal of the Operational Research Society54(6)627ndash635 DOI 101057palgravejors2601545

Barandela R Snchez J Garca V Rangel E (2003) Strategies for learning in class imbalance problemsPattern Recognition 36(3)849 ndash 851 DOI httpsdoiorg101016S0031-3203(02)00257-1

Barber MJ (2007) Modularity and community detection in bipartite networks Physical Review E76066102 DOI 101103PhysRevE76066102

Barua S Islam MM Yao X Murase K (2014) MWMOTEndashmajority weighted minority oversamplingtechnique for imbalanced data set learning IEEE Transactions on Knowledge and Data Engineer-ing 26(2)405ndash425 DOI 101109TKDE2012232

Batista GEAPA Prati RC Monard MC (2004) A study of the behavior of several methods for balancingmachine learning training data SIGKDD Explor Newsl 6(1)20ndash29 DOI 10114510077301007735

Beckett SJ (2016) Improved community detection in weighted bipartite networks Royal Society OpenScience 3(1) DOI 101098rsos140536

Bekkar M Djemaa HK Alitouche TA (2013) Evaluation measures for models assessment over imbalanceddata sets Journal of Information Engineering and Applications 3(10)27ndash38

Bhattacharyya S Jha S Tharakunnel K Westland JC (2011) Data mining for credit card fraud A compara-tive study Decision Support Systems 50(3)602 ndash 613 DOI httpsdoiorg101016jdss201008008

Blondel VD Guillaume JL Lambiotte R Lefebvre E (2008) Fast unfolding of communities in large net-works Journal of Statistical Mechanics Theory and Experiment 2008(10)P10008

Breiman L Friedman J Stone CJ Olshen RA (1984) Classification and regression trees Taylor amp FrancisBrozovsky L Petricek V (2007) Recommender system for online dating service In Proceedings of

Znalosti 2007 Conference VSB OstravaCha M Mislove A Gummadi KP (2009) A measurement-driven analysis of information propagation in

the Flickr social network In Proceedings of the 18th International Conference on World Wide WebACM New York NY USA WWW rsquo09 pp 721ndash730 DOI 10114515267091526806

Chawla NV (2005) Data mining for imbalanced datasets An overview In Data mining and knowledgediscovery handbook Springer US Boston MA pp 853ndash867

Chawla NV Bowyer KW Hall LO Kegelmeyer WP (2002) SMOTE synthetic minority over-samplingtechnique Journal of artificial intelligence research 16321ndash357

Chawla NV Lazarevic A Hall LO Bowyer KW (2003) Smoteboost Improving prediction of the minorityclass in boosting In Knowledge Discovery in Databases PKDD 2003 Springer Berlin HeidelbergBerlin Heidelberg pp 107ndash119

Chawla NV Japkowicz N Kotcz A (2004) Editorial Special issue on learning from imbalanced data setsSIGKDD Explor Newsl 6(1)1ndash6 DOI 10114510077301007733

Chen M Mao S Liu Y (2014) Big data A survey Mobile Networks and Applications 19(2)171ndash209DOI 101007s11036-013-0489-0

Chyi YM (2003) Classification analysis techniques for skewed class distribution problems Master thesisDepartment of Information management National Sun Yat-Sen University

Demsar J (2006) Statistical comparisons of classifiers over multiple data sets Journal of Machine LearningResearch 7(Jan)1ndash30

Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning In Pro-ceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and DataMining ACM New York NY USA KDD rsquo01 pp 269ndash274 DOI 101145502512502550

Drummond C Holte RC (2003) C45 class imbalance and cost sensitivity why under-sampling beatsover-sampling In Proceedings of the ICML rsquo03 Workshop on Learning from Imbalanced Datasets

Fan RE Chang KW Hsieh CJ Wang XR Lin CJ (2008) LIBLINEAR A library for large linear classifi-cation Journal of Machine Learning Research 91871ndash1874

Fan W Stolfo SJ Zhang J Chan PK (1999) AdaCost Misclassification cost-sensitive boosting In Pro-ceedings of the Sixteenth International Conference on Machine Learning Morgan Kaufmann Pub-lishers Inc San Francisco CA USA ICML rsquo99 pp 97ndash105

Fawcett T (2006) An introduction to ROC analysis Pattern Recognition Letters 27(8)861 ndash 874 DOIhttpsdoiorg101016jpatrec200510010

Finch H (2005) Comparison of distance measures in cluster analysis with dichotomous data Journal ofData Science 3(1)85ndash100

Fortunato S (2010) Community detection in graphs Physics Reports 486(3 5)75 ndash 174 DOI httpsdoiorg101016jphysrep200911002

Junque de Fortuny E Martens D Provost F (2014a) Predictive modeling with big data is bigger reallybetter Big Data 1(4)215ndash226 DOI 101089big20130037

Junque de Fortuny E Stankova M Moeyersoms J Minnaert B Provost F Martens D (2014b) Corporateresidence fraud detection In Proceedings of the 20th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining ACM New York NY USA KDD rsquo14 pp 1650ndash1659 DOI10114526233302623333

Frasca M Bertoni A Re M Valentini G (2013) A neural network algorithm for semi-supervised node labellearning from unbalanced data Neural Networks 4384 ndash 98 DOI httpsdoiorg101016jneunet201301021

Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis ofvariance Journal of the american statistical association 32(200)675ndash701

Garcıa E Lozano F (2007) Boosting support vector machines In Machine Learning and Data Mining inPattern Recognition 5th International Conference MLDM 2007 Leipzig Germany July 18-20 PostProceedings IBaI publishing pp 153ndash167

Goldstein M Uchida S (2016) A comparative evaluation of unsupervised anomaly detection algorithmsfor multivariate data PLOS ONE 11(4)1ndash31 DOI 101371journalpone0152173

Gonzlez PC Velsquez JD (2013) Characterization and detection of taxpayers with false invoices usingdata mining techniques Expert Systems with Applications 40(5)1427 ndash 1436 DOI httpsdoiorg101016jeswa201208051

Guimera R Sales-Pardo M Amaral LAN (2007) Module identification in bipartite and directed networksPhysical Review E 76036102 DOI 101103PhysRevE76036102

Guo H Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation TheDataBoost-IM approach SIGKDD Explor Newsl 6(1)30ndash39 DOI 10114510077301007736

Guo X Yin Y Dong C Yang G Zhou G (2008) On the class imbalance problem In 2008 Fourth Interna-tional Conference on Natural Computation IEEE vol 4 pp 192ndash201 DOI 101109ICNC2008871

Han H Wang WY Mao BH (2005) Borderline-SMOTE A new over-sampling method in imbalanced datasets learning In Advances in Intelligent Computing Springer Berlin Heidelberg Berlin Heidelbergpp 878ndash887

He H Garcia EA (2009) Learning from imbalanced data IEEE Transactions on Knowledge and DataEngineering 21(9)1263ndash1284 DOI 101109TKDE2008239

He H Bai Y Garcia EA Li S (2008) ADASYN Adaptive synthetic sampling approach for imbalancedlearning In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congresson Computational Intelligence) IEEE pp 1322ndash1328 DOI 101109IJCNN20084633969

Holm S (1979) A simple sequentially rejective multiple test procedure Scandinavian journal of statistics6(2)65ndash70

Hsu CW Lin CJ (2002) A comparison of methods for multiclass support vector machines IEEE Transac-tions on Neural Networks 13(2)415ndash425 DOI 10110972991427

Huang A (2008) Similarity measures for text document clustering In Proceedings of the sixth new zealandcomputer science research student conference (NZCSRSC2008) Christchurch New Zealand pp 49ndash56

Iman RL Davenport JM (1980) Approximations of the critical region of the Friedman statistic Commu-nications in Statistics-Theory and Methods 9(6)571ndash595

Jo T Japkowicz N (2004) Class imbalances versus small disjuncts ACM SIGKDD Explor Newsl 6(1)40ndash49 DOI 10114510077301007737

Jutla IS Jeub LG Mucha PJ (2011-2016) A generalized louvain method for community detection imple-mented in MATLAB URL httpnetwikiamathunceduGenLouvain

Kubat M Matwin S (1997) Addressing the curse of imbalanced training sets One-sided selection InProceedings of the Fourteenth International Conference on Machine Learning Morgan KaufmannPublishers Inc San Francisco CA USA pp 179ndash186

Lancichinetti A Fortunato S (2009) Community detection algorithms A comparative analysis PhysicalReview E 80056117 DOI 101103PhysRevE80056117

Larremore DB Clauset A Jacobs AZ (2014) Efficiently inferring community structure in bipartitenetworks Physical Review E Statistical Nonlinear and Soft Matter Physics 90012805 DOI101103PhysRevE90012805

Li J Fine JP (2010) Weighted area under the receiver operating characteristic curve and its application togene selection Journal of the Royal Statistical Society Series C (Applied Statistics) 59(4)673ndash692DOI 101111j1467-9876201000713x

Li X Wang L Sung E (2008) AdaBoost with SVM-based component classifiers Engineering Applicationsof Artificial Intelligence 21(5)785ndash795 DOI httpsdoiorg101016jengappai200707001

Lichman M (2013) UCI machine learning repository URL httparchiveicsuciedumlLiu W Chawla S Cieslak DA Chawla NV (2010) A robust decision tree algorithm for imbalanced data

sets In Proceedings of the tenth SIAM international conference on data mining SIAM Philadelphiavol 10 pp 766ndash777

Liu XY Wu J Zhou ZH (2009) Exploratory undersampling for class-imbalance learning IEEE Transac-tions on Systems Man and Cybernetics Part B (Cybernetics) 39(2)539ndash550 DOI 101109TSMCB20082007853

Luts J Ojeda F Van de Plas R De Moor B Van Huffel S Suykens JA (2010) A tutorial on supportvector machine-based methods for classification problems in chemometrics Analytica Chimica Acta665(2)129ndash145 DOI httpsdoiorg101016jaca201003030

Macskassy SA Provost F (2007) Classification in networked data A toolkit and a univariate case studyJournal of Machine Learning Research 8(May)935ndash983

Martens D Provost F (2014) Explaining data-driven document classifications MIS Quarterly 38(1)73ndash100

Martens D Provost F Clark J Junque de Fortuny E (2016) Mining massive fine-grained behavior data toimprove predictive analytics MIS Quarterly 40(4)869ndash888

Mazurowski MA Habas PA Zurada JM Lo JY Baker JA Tourassi GD (2008) Training neural networkclassifiers for medical decision making The effects of imbalanced datasets on classification perfor-mance Neural Networks 21(23)427 ndash 436 DOI httpsdoiorg101016jneunet200712031

Mease D Wyner AJ Buja A (2007) Boosted classification trees and class probabilityquantile estimationJournal of Machine Learning Research 8409ndash439

Nemenyi P (1963) Distribution-free Multiple Comparisons Dissertation Princeton UniversityNewman MEJ Girvan M (2004) Finding and evaluating community structure in networks Physical Re-

view E 69026113 DOI 101103PhysRevE69026113Ng AY (2004) Feature selection L1 vs L2 regularization and rotational invariance In Proceedings of the

Twenty-first International Conference on Machine Learning ACM New York NY USA ICML rsquo04pp 78ndash DOI 10114510153301015435

Ng AY Jordan MI (2002) On discriminative vs generative classifiers A comparison of logistic regressionand naive bayes In Advances in Neural Information Processing Systems 14 MIT Press pp 841ndash848

Ngai E Hu Y Wong Y Chen Y Sun X (2011) The application of data mining techniques in financial frauddetection A classification framework and an academic review of literature Decision Support Systems50(3)559 ndash 569 DOI httpsdoiorg101016jdss201008006

Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likeli-hood methods In Advances in Large-Margin Classifiers MIT Press pp 61ndash74

Porter MA Onnela JP Mucha PJ (2009) Communities in networks Notices of the American MathematicalSociety 56(9)1082ndash1097

Provost F Fawcett T (2013) Data Science for Business What you need to know about data mining anddata-analytic thinking OrsquoReilly Media Inc

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audience selection for on-line brandadvertising Privacy-friendly social network targeting In Proceedings of the 15th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining ACM New York NY USAKDD rsquo09 pp 707ndash716 DOI 10114515570191557098

Raskutti B Kowalczyk A (2004) Extreme re-balancing for SVMs A case study SIGKDD Explor Newsl6(1)60ndash69 DOI 10114510077301007739

Rosvall M Bergstrom CT (2008) Maps of random walks on complex networks reveal community structureProceedings of the National Academy of Sciences 105(4)1118ndash1123 DOI 101073pnas0706851105

Schapire RE (1999) A brief introduction to boosting In Proceedings of the 16th International Joint Con-ference on Artificial Intelligence - Volume 2 Morgan Kaufmann Publishers Inc San Francisco CAUSA IJCAIrsquo99 pp 1401ndash1406

Schapire RE Singer Y (1999) Improved boosting algorithms using confidence-rated predictions Machinelearning 37(3)297ndash336 DOI 101023A1007614523901

Shmueli G (2017) Analyzing behavioral big data Methodological practical ethical and moral issuesQuality Engineering 29(1)57ndash74 DOI 1010800898211220161210979

Sobhani P Viktor H Matwin S (2015) Learning from imbalanced data using ensemble methods andcluster-based undersampling In New Frontiers in Mining Complex Patterns Third InternationalWorkshop NFMCP 2014 Held in Conjunction with ECML-PKDD 2014 Nancy France Septem-ber 19 2014 Revised Selected Papers Springer International Publishing Cham pp 69ndash83 DOI101007978-3-319-17876-9 5

Stankova M (2016) Classification within network data with a bipartite structure Dissertation Universityof Antwerp

Stankova M Martens D Provost F (2015) Classification over bipartite graphs through projection WorkingPapers 2015001 University of Antwerp Faculty of Applied Economics

Sun Y Kamel MS Wong AK Wang Y (2007) Cost-sensitive boosting for classification of imbalanceddata Pattern Recognition 40(12)3358 ndash 3378 DOI httpsdoiorg101016jpatcog200704009

Suykens JA Van Gestel T De Brabanter J De Moor B Vandewalle J Suykens J Van Gestel T (2002)Least squares support vector machines World Scientific Singapore

Tang B He H (2015) Enn Extended nearest neighbor method for pattern recognition [research frontier]IEEE Computational Intelligence Magazine 10(3)52ndash60 DOI 101109MCI20152437512

Tang Y Zhang YQ Chawla NV Krasser S (2009) SVMs modeling for highly imbalanced classificationIEEE Transactions on Systems Man and Cybernetics Part B (Cybernetics) 39(1)281ndash288 DOI101109TSMCB20082002909

Tobback E Moeyersoms J Stankova M Martens D (2016) Bankruptcy prediction for SMEs using rela-tional data Working Paper 2016004 University of Antwerp Faculty of Applied Economics

Verbeke W Dejaeger K Martens D Hur J Baesens B (2012) New insights into churn prediction in thetelecommunication sector A profit driven data mining approach European Journal of OperationalResearch 218(1)211 ndash 229 DOI httpsdoiorg101016jejor201109031

Veropoulos K Campbell I Cristianini N (1999) Controlling the sensitivity of support vector machinesIn Proceedings of the International Joint Conference on Artificial Intelligence Stockholm Sweden(IJCAI99) pp 55 ndash 60

Whitrow C Hand DJ Juszczak P Weston D Adams NM (2009) Transaction aggregation as a strategyfor credit card fraud detection Data Mining and Knowledge Discovery 18(1)30ndash55 DOI 101007s10618-008-0116-z

Wickramaratna J Holden SB Buxton BF (2001) Performance degradation in boosting In Proceedings ofthe Second International Workshop on Multiple Classifier Systems Springer London UK MCS rsquo01pp 11ndash21

Yen SJ Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions ExpertSystems with Applications 36(3 Part 1)5718 ndash 5727 DOI httpsdoiorg101016jeswa200806108

Yu HF Lo HY Hsieh HP Lou JK McKenzie TG Chou JW Chung PH Ho CH Chang CF Wei YH et al(2010) Feature engineering and classifier ensemble for kdd cup 2010 In Proceedings of the KDDCup 2010 Workshop pp 1ndash16

Zha H He X Ding C Simon H Gu M (2001) Bipartite graph partitioning and data clustering In Pro-ceedings of the Tenth International Conference on Information and Knowledge Management ACMNew York NY USA CIKM rsquo01 pp 25ndash32 DOI 101145502585502591

Zhang J Mani I (2003) Knn approach to unbalanced data distributions A case study involving informationextraction In Proceedings of the ICMLrsquo2003 Workshop on Learning from Imbalanced DatasetsWashington DC

Ziegler CN McNee SM Konstan JA Lausen G (2005) Improving recommendation lists through topicdiversification In Proceedings of the 14th International Conference on World Wide Web ACM NewYork NY USA WWW rsquo05 pp 22ndash32 DOI 10114510607451060754

Introduction
Preliminaries
Methods
Results and discussion
The effect of the chosen base learner
Conclusions and future research directions
Oversampling experiments
Undersampling experiments
Boosting experiments
Final Comparison

Page 2: This item is the archived peer-reviewed author-version of datasets Jellis Vanhoeyveld David Martens Received: date / Accepted: date Abstract Recent years have witnessed a growing number