A Benchmarking Study of Classification Techniques for ... FACULTY OF APPLIED ECONOMICS DEPARTMENT OF...

DEPARTMENT OF ENGINEERING MANAGEMENT

A Benchmarking Study of Classification Techniques for

Behavioral Data

Sofie De Cnudde, David Martens, Theodoros Evgeniou & Foster Provost

UNIVERSITY OF ANTWERP Faculty of Applied Economics

City Campus

Prinsstraat 13, B.226

B-2000 Antwerp

Tel. +32 (0)3 265 40 32

Fax +32 (0)3 265 47 99

www.uantwerpen.be

http://www.uantwerpen.be/

FACULTY OF APPLIED ECONOMICS

DEPARTMENT OF ENGINEERING MANAGEMENT

A Benchmarking Study of Classification Techniques for Behavioral data

Sofie De Cnudde, David Martens, Theodoros Evgeniou & Foster Provost

RESEARCH PAPER 2017-005 APRIL 2017

University of Antwerp, City Campus, Prinsstraat 13, B-2000 Antwerp, Belgium

Research Administration – room B.226

phone: (32) 3 265 40 32

fax: (32) 3 265 47 99

e-mail: [email protected]

The research papers from the Faculty of Applied Economics

are also available at www.repec.org

(Research Papers in Economics - RePEc)

D/2017/1169/005

mailto:[email protected]

http://www.repec.org/

A Benchmarking Study of Classification Techniques forBehavioral Data

Sofie De Cnudde [email protected] Martens [email protected]

Theodoros Evgeniou [email protected] Provost [email protected]

Abstract

The predictive power in ubiquitous big, behavioral data has been emphasized by pre-vious academic research. The ultra-high dimensional and sparse characteristics, how-ever, pose significant challenges on state-of-the-art classification techniques. Moreover,no consensus exists regarding a feasible trade-off between classification performance andcomputational complexity. This work provides a contribution in this direction througha systematic benchmarking study. Forty-three fine-grained behavioral data sets are ana-lyzed with 11 classification techniques. Statistical performance comparisons enrichedwith learning curve analyses demonstrate two important findings. Firstly, an inherentAUC-time trade-off becomes clear, making the choice for an appropriate classifier de-pendent on time restrictions and data set characteristics. Logistic regression achievesthe best AUC, however in the worst amount of time. Also, L2 regularization proves bet-ter than sparse L1-regularization. An attractive trade-off is found in a similarity-basedtechnique called PSN. Secondly, the results illustrate that significant value lies in col-lecting and analyzing even more data, both in the instance and in the feature dimension,contrasting findings on traditional data. The results of this study provide guidance forresearchers and practitioners for the selection of appropriate classification techniques,sample sizes and data features, while also providing focus in scalable algorithm designin the face of large, behavioral data.

1 Introduction

The prevalence of big data has been cultivated by the expansion of data collection and stor-age possibilities (Boyd and Crawford, 2012). With more of life’s activities being recordedand quantified, and the rapid diffusion of data-generating sources such as mobile phones,the rate and volume of information flow is exploding at an ever-increasing pace (Wu et al.,2014). The proliferation of this so-called big data, generally characterized by an expansion involume, velocity and variety (Laney, 2001), has led to advances in several fields. Examplescan be found in e-commerce, e-government, politics, science and technology, e-health, secur-ity and public safety (Chen et al., 2012). The uprise of big data has, however, also broughtalong challenges with respect to storage and access of large volumes of data, semantic in-terpretation of large, high-dimensional data, privacy, and algorithm design in the face ofcomplex and dynamic data characteristics (Wu et al., 2014).

1

One particular manifestation of big data originates from behavioral data (Chen et al.,2009). Following the definition of Shmueli (2016), big behavioral data models human be-havior through actions and/or interactions of people. These form a testimony of a person’sbehavior captured through fine-grained modular features. Customer transactions with abank, web surfer’s web visiting behavior, mobile phone users’ visited locations and Face-book like data are just a few examples. This behavior can be modeled through a high-levelview of presence or absence of an individual action (binary). More detail can be added bymodeling the strength or the frequency of an action (numeric). The set of all possible behavi-oral aspects (features) an entity can take on is enormous (such as the set of all possible GPScoordinates one can visit), resulting in ultra-high dimensional data. Moreover, the limit onso-called behavioral capital (Junqué de Fortuny et al., 2013) implies that among all possibleactions, a person can take on only a limited number. This results in highly sparse data. Thehigh-dimensionality and sparsity stand in stark contrast to traditional sociodemographicfeatures or summarizing behavioral features such as RFM (recency, frequency, monetary)values.

Behavioral data has been analyzed in the academic literature and has demonstratedpromising results in predictive analysis. It can be telling of a person’s personality traits (Kos-inski et al., 2013), his interest in banking products (Martens et al., 2016), his interest in anews article (Liu et al., 2010), his interest in a (mobile) ad (Li and Du 2012; Perlich et al.2014), his tendency to churn (Verbeke et al., 2014), his credit default behavior (De Cnuddeet al., 2015) or his tendency to commit fraudulent activities (Fawcett and Provost 1997;Junqué de Fortuny et al. 2014).

In spite of the growing presence of big behavioral data (Yang and Wu, 2006) and the nu-merous studies clearly demonstrating the potential for predictive purposes, this data posessignificant challenges on traditional state-of-the-art data mining techniques (Provost andKolluri 1999; Brain and Webb 2002; Dalessandro 2013). One such challenge is capturedthrough the curse of dimensionality (Donoho, 2000) which states that a large feature dimen-sion results in highly sparse and highly scattered data, making it very difficult for classifiersto capture a general trend. Although high in complexity, these data sets grasp very detailedand rich information which prove indispensable in a predictive context (Clark and Provost,2016). Thus, the question emerges as to whether and to what extent widely-used, traditionalclassifiers can cope with this complex and rich type of data (Brain and Webb 2002; Wu et al.2014).

We investigate the existing literature performing predictive analyses on behavioral data.The goal is to obtain an understanding regarding which techniques are used and why, andwhich problems are encountered during analysis. We focus on literature specifically analyz-ing fine-grained, high-dimensional and sparse behavioral data and synthesize the employedclassification techniques in Table 1 (marked with an ‘X’ in the appropriate column). Whileconstructing the table, the following rules were applied. If no explicit mention is made ofthe number of instances n and/or the number of features m, we denote this with a questionmark. In case a paper analyzes several data sets, the largest is shown. For each data set,a bold ‘X’ represents the best-performing technique. At the bottom, the table shows thetotal number of occurrences of each technique, along with the number of times it performedbest among the used techniques (also shown in bold). Note that when only one technique isanalyzed in a paper, it is also denoted in boldface. For linear support vector machines and

1 Introduction 2

n m PSN NB RBF-SVM LIN-SVM LR-BGD LR-SGD RF Tree Boost kNN LPR

LR1 LR2 LR1 LR2 LR1 LR2Abramson and Aha (2012) 6,925 1,000 X X XDe Bock and Van den Poel (2010) 5,719 1,821 X X XDe Cnudde and Martens (2015) 177,761 2,448 X XLiu et al. (2012) 300,000 3,000 X X XLi et al. (2015) 9,489 4,368 XLee and Chung (2014) 6,040 6,040 XMa et al. (2009a) 30,000 30,597 X X X XThomas et al. (2011) 1,200,000 98,900 XGoel et al. (2012) 250,000 100,000 XJunqué de Fortuny et al. (2014) 858,703 108,753 X X XChen et al. (2009) 500,000,000 150,000 XBannur (2011) 60,000 153,918 X X XChen et al. (2011) 400,000 2,909,162 XMartens et al. (2016) 1,200,000 3,200,000 X X XAlrajeh et al. (2014) 2,156,517 3,231,961 XKazemian and Ahmed (2015) 2,156,517 3,231,961 X X X XMa et al. (2009b) 2,156,517 3,231,961 X X XShah (2014) 2,156,517 3,231,961 XDe Cnudde et al. (2015) 5,000 4,122,418 X XYu et al. (2010) 8,407,752 20,216,830 X XStankova et al. (2014) 8,407,752 20,216,830 X XAgarwal et al. (2014) 2,300,000,000 16,777,216 XPandey et al. (2011) 40,000,000 ? XPerlich et al. (2014) ? ? X XNumber of wins 4 4 3 11 7 2 1 0 0 0 1Total count 4 7 4 17 7 3 1 1 1 1 1

Table 1: Overview of behavioral predictive literature. n is the number of instances, m is the number of features. (Abbreviations of thetechniques: PSN = pseudo social network, NB = naive Bayes, RBF-SVM = support vector machine with RBF kernel, LIN-SVM= linear support vector machine, LR-BGD = logistic regression with batch gradient descent, LR-SGD = logistic regression withstochastic gradient descent, RF = random forest, Tree = decision tree, Boost = boosting method, kNN = k-nearest neighbor, LPR= logistic Poisson regression).

1Introduction

3

logistic regression, the type of regularization is indicated. When the authors did not specifywhich type was used, an ‘X’ is put in the middle. From Table 1, it is observed that someconsensus seems to exist in prior work regarding the best selection: linear SVMs are mostfrequently used, along with L2 regularization. The papers specifically mention the use oflinear SVMs as fast and adequate in very high-dimensional contexts. Next, naive Bayes andlogistic regression are most frequently used. For naive Bayes, many papers mention its speedand performance in textual data as benefits. Logistic regression, interestingly, performs atleast as well or better than LIN-SVM in all reported comparisons. Lastly, the RBF-SVM iscapable of finding non-linear patterns. Most papers, however, condemn this technique forits lack of scalability. The last five techniques in Table 1 are used too seldom and are nevercompared with the more frequently used techniques, therefore they are not included in ouranalysis (see Table 2). Trees for example are not adept at handling high-dimensionality:overfitting occurs and complex interactions between the features are ignored due to thedivision of the training space in mutually exclusive subspaces. In the search for nearestneighbors, the high dimensionality drastically magnifies the neighborhood search space forkNN, which highly impedes the quest for similar data points. Finally, roughly half the papersuse only one technique to analyze the data, while the other half employ several techniquesfollowed by a comparison of the results. Most of these studies are performed from a data-centric perspective, starting from one or two data sets and selecting appropriate classificationtechniques. This is done either to demonstrate the predictive power present in a data set,to compare existing techniques or to benchmark a self-developed technique against state-of-the-art classifiers. However, most papers do not provide clear-cut explanations as to why acertain technique is elected over others for analysis.

Existing research also demonstrates three main approaches which are taken to cope withthe high dimensionality of the data (Provost and Kolluri 1999; Brain and Webb 2002; Yangand Wu 2006): (1) scaling up the classifiers (for example, Tsang et al. 2005; Collobert et al.2006; Nie et al. 2014), (2) scaling down the data dimensionality (for example, Chang et al.2010; De Bock and Van den Poel 2010; Tan et al. 2014), and (3) a hybrid approach (Kumaret al., 2014).

Combining these two findings from our literature study, i.e. the lack of consensus regard-ing which classification technique to employ in a behavioral data context and classificationalgorithms not particularly designed and naturally fitted to handle them, leads to gapsregarding both (1) the selection of an appropriate technique and (2) its robustness for pre-dictive purposes. This is where this paper attempts to contribute. Regarding the firstgap, benchmarking studies are a useful means to compare the performance of a collectionof techniques. Comparing these techniques in a systematic manner results in statisticallysound conclusions on the one hand and on the other hand leads to practical guidelines dir-ecting data miners to an appropriate classification technique suited for their needs. Wepresent a benchmarking study in the tradition of Forman (2003) and Fernández-Delgadoet al. (2014), and follow the advice of Demšar (2006), among others. In the past, large-scalebenchmarking studies of data mining algorithms have been performed (for example, Michieet al. 2009; King et al. 1995; Lim et al. 2000; Meyer et al. 2002; Fernández-Delgado et al.2014). Also, benchmarking is often carried out between two or more techniques, investig-ating when which technique performs better (for example, Langley et al. 1992; Ralaivolaand d’Alché Buc 2001; Huang et al. 2003; Perlich et al. 2003). However, to our knowledge,

1 Introduction 4

MN-NB Multinomial naive BayesMV-NB Multivariate naive BayesLA-SVM-L2 Least absolute errors support vector machine with a linear kernel and L2 regularizationLS-SVM-L1 Least square errors support vector machine with a linear kernel and L1 regularizationLS-SVM-L2 Least square errors support vector machine with a linear kernel and L2 regularizationPSN Relational classifier with bigraphsLR-BGD-L1 Batch gradient descent logistic regression with L1 regularizationLR-BGD-L2 Batch gradient descent logistic regression with L2 regularizationLR-SGD-L1 Stochastic gradient descent logistic regression with L1 regularizationLR-SGD-L2 Stochastic gradient descent logistic regression with L2 regularizationRBF-SVM Support vector machine with Gaussian kernel

Table 2: The classification techniques studied.

no comprehensive comparative study has yet been done focusing specifically on massive,sparse behavioral data, which are becoming increasingly common in applications of machinelearning.

Regarding the second gap, we also study the resilience of the classification algorithmsunder varying training set sizes. This is done with learning curves, which investigate theimpact of data size on classification performance (Perlich et al., 2003). From this analysis,conclusions can be drawn regarding the extent to which the techniques scale up in terms ofperformance for increasing data set size, in both the instance and feature dimensions. Thereis value in understanding if and when more data leads to better predictive performance, suchas for organizations regarding their investment in collecting and storing even more data. Astarting point for learning curve analysis on behavioral data was given in Junqué de Fortunyet al. (2013), which we will expand in a systematic manner. Importantly, these curves showstrikingly different behavior from learning curve studies on more traditional data (Perlichet al., 2003).

In summary, the contributions of this paper are as follows:

I We perform a comparative analysis of state-of-the-art classification techniques on be-havioral data sets. Both predictive and computational performance of the techniquesare compared for significant differences. Subsequently, recommendations are formu-lated aimed at guiding analysts’ choice of a predictive technique when confronted withbehavioral data.

II We also assess the predictive value of behavioral data depending on two differentdata modeling schemes. An analysis is performed regarding the (un)importance ofthe strength of a behavioral action (binary vs. numeric data) for the analyzed tech-niques. Hence, guidance is offered regarding how to model behavioral data so as toreach optimal performance.

III The third contribution is a learning curve analysis of the classification techniques alongdifferent dimensions such that performance patterns become clear under changing dataset sizes. The results of this analysis lead to a clear view regarding the relevance ofmore data collection from a predictive performance viewpoint.

The remainder of this work is organized as follows. In the following section, first, an over-view is given of the behavioral data sets analyzed in this study. Secondly, the measures

1 Introduction 5

of performance for classification techniques are set out. Section 3 provides insight into theset-up of the benchmarking study, thus tackling the reproducibility aspect of this work. Theresults are presented and discussed in Section 4. Finally, we conclude with some general re-marks and further research avenues in Section 5. In Appendix A, details of the classificationtechniques are given, along with information regarding implementation and computationalcomplexity.

2 Components of the Benchmarking Study

2.1 Data

First, a notation is established which will be used throughout this work. A behavioral dataset X consists of n datapoints xi with i = (1, ..., n) and xi ∈ Rm. The high-dimensionalxi represent behavior of an instance i through fine-grained behavioral features j. Whenmodeling behavior in a binary manner, then xi,j ∈ {0, 1}. Binary behavior can also beenriched with more detailed information, in that case xi,j ∈ N. This information might referto frequency (for example in the case of visiting behavior) or preference (for example in thecase of rating data). In this classification setting, Y models the target variable that shouldbe predicted and is a vector of size n with yi ∈ {−1,+1}.

In search of a representative collection of real-life behavioral data sets, the followingonline data repositories were consulted: the UCI Machine Learning Repository1, YahooLabs2, the Stanford Large Network Dataset Collection3, KDNuggets4, Kaggle5, AmazonWeb Services data sets6, the Koblenz network collection (KONECT)7 and the Max PlankInstitute for Software Systems8. Also, proprietary data sets from prior studies are usedwhich were obtained from the online world, the public sector and the banking sector. Whenavailable, both binary and numeric behavior are used.

In total, 43 behavioral data sets are used, originating from 17 real-world problems. TheMovieLens data set9 contains movie-rating data from users. Based on these ratings, aprediction is made concerning the gender and age of a user. Two versions are available: onewith 100,000 non-zero (or active) elements and one with 1,000,000 active elements. Thelatter is also used to predict the genre of the movies based on users’ ratings. Eighteen datasets are constructed in order to translate this multi-class problem to a binary problem. TheInternet Advertising data set (Lichman, 2013) predicts whether an image is an advertisementbased on images’ features. Yahoo Labs10 makes available the YahooMovies data set whichcontains movie-rating data, analogous to the MovieLens data set. Here, also the genderand age of the users is predicted. The Ecommerce data set originates from the PAKDD2015challenge with the goal of predicting gender based on product viewing data on an e-commerce

1See http://archive.ics.uci.edu/ml.2See http://webscope.sandbox.yahoo.com.3See http://snap.stanford.edu/data.4See http://www.kdnuggets.com.5See http://www.kaggle.com.6See http://aws.amazon.com.7See http://konect.uni-koblenz.de.8See http://socialnetworks.mpi-sws.org/datasets.html.9See http://grouplens.org.

10See http://webscope.sandbox.yahoo.com.

2 Components of the Benchmarking Study 6

http://archive.ics.uci.edu/ml

http://webscope.sandbox.yahoo.com

http://snap.stanford.edu/data

http://www.kdnuggets.com

http://www.kaggle.com

http://aws.amazon.com

http://konect.uni-koblenz.de

http://socialnetworks.mpi-sws.org/datasets.html

http://grouplens.org

http://webscope.sandbox.yahoo.com

website11. Next, the TaFeng data set contains shopping transactions of users and the goalis to predict the users’ age (Huang et al., 2005). In the BookCrossing data set, books arerated by members of the BookCrossing community and based on these ratings, the age of theuser is predicted (Ziegler et al., 2005). The LibimSeTi data set contains ratings of datingprofiles by users of the dating service LibimSeti (Brozovsky and Petricek, 2007). Based onthese profile ratings, the gender of the user is inferred. The KDD cup 2015 challenge aspiresto predict the MOOC dropout rate from the online learning platform XuetangX based onprior online course behavior. The A-Card data sets consist of user-visiting behavior froma city loyalty card on which three predictions are made (De Cnudde and Martens, 2015).First, cashout prediction consists of predicting whether a user will trade collected points fora benefit. Second, an assertion is made with respect to the user becoming inactive whichis referred to as defect prediction. Third, for each user and five locations, a prediction ismade whether that location will be visited in the near future. The Fraud data set consistsof transactional information concerning payments between Belgian and foreign companiesand attempts to predict whether a company is involved in fraudulent activities (Junqué deFortuny et al., 2014). In Martens et al. (2016), the Banking data set is constructed bycollecting debit transactions from customers of a bank. With this payment data, a predictionis made concerning the possible purchase of a financial product offered by the bank. In Maet al. (2009b), the URL data set is constructed. Based on features of URLs, an assertion ismade whether a URL is malicious or not. The goal of the KDDa data set (Yu et al., 2010)from the 2010 KDD cup challenge is to predict the performance of students on an algebraictest based on their past performance. In the Flickr data set, the transactions consist ofusers tagging pictures as being their ‘favorite’ and then predicting the number of commentsa picture has (Cha et al., 2009). For the proprietary Car data set, predictions regarding theinterest in a car advertisement are made based on users’ web visiting behavior.

Table 3 summarizes some general characteristics related to the data sets. Judging fromthis summary, a great variety of data sets is present in terms of size (both in the instanceas well as in the feature dimension), the nature of the predictive variable, the n-m relation(n� m, n� m and n ≈ m) and the balance b. Since these are real-life data sets, in mostcases, the distribution of the classes is unbalanced (Yang and Wu, 2006). For the frauddata set, the highest imbalance is achieved, as the number of fraudulent organizations incomparison to the number of non-fraudulent organizations usually is very low (Liu et al.,2007).

As mentioned and as demonstrated in Table 3, the sparsity ρ of behavioral data setsis extreme due to limited behavioral capital (Junqué de Fortuny et al., 2013). Figures1-2 show the probability distributions of the number of features per instance (Figure 1)and the number of instances per feature (Figure 2) which we refer to here as the sparsitydistributions. These distributions confirm the limited behavioral capital claim. It is clearfrom the sparsity distributions for the instances (Figure 1) that most of them have a lownumber of active (non-zero) features. From the tail of the distributions, it can be observedthat instances with a large number of active features are much less frequent. Conversely,looking at the sparsity distributions of the features in Figure 2, also the probability of afeature being present in many instances’ behaviors is low. This makes sense when lookingat the Banking data set for example: the majority of users have payment transactions only

11See https://knowledgepit.fedcsis.org.


https://knowledgepit.fedcsis.org

Data set Target variable Binary Numeric n m m ρ bMovieLens100k age X X 943 1,682 100,000 93.6953% 42.31MovieLens100k gender X X 943 1,682 100,000 93.6953% 28.95MovieLens1m age X X 6,040 3,883 1,000,209 95.7353% 43.36MovieLens1m gender X X 6,040 3,883 1,000,209 95.7353% 28.29Yahoo Movies age X X 7,642 106,363 221,330 99.9727% 21.09Yahoo Movies gender X X 7,642 106,363 221,330 99.9727% 28.87MovieLens10m action X X 10,681 69,878 10,000,053 98.6602% 13.79MovieLens10m adventure X X 10,681 69,878 10,000,053 98.6602% 5.36MovieLens10m animation X X 10,681 69,878 10,000,053 98.6602% 1.51MovieLens10m children X X 10,681 69,878 10,000,053 98.6602% 1.75MovieLens10m comedy X X 10,681 69,878 10,000,053 98.6602% 28.29MovieLens10m crime X X 10,681 69,878 10,000,053 98.6602% 5.50MovieLens10m documentary X X 10,681 69,878 10,000,053 98.6602% 4.09MovieLens10m drama X X 10,681 69,878 10,000,053 98.6602% 29.57MovieLens10m fantasy X X 10,681 69,878 10,000,053 98.6602% 0.43MovieLens10m film noir X X 10,681 69,878 10,000,053 98.6602% 0.24MovieLens10m horror X X 10,681 69,878 10,000,053 98.6602% 5.10MovieLens10m musical X X 10,681 69,878 10,000,053 98.6602% 0.41MovieLens10m mystery X X 10,681 69,878 10,000,053 98.6602% 0.43MovieLens10m romance X X 10,681 69,878 10,000,053 98.6602% 0.56MovieLens10m sci-fi X X 10,681 69,878 10,000,053 98.6602% 0.66MovieLens10m thriller X X 10,681 69,878 10,000,053 98.6602% 1.23MovieLens10m war X X 10,681 69,878 10,000,053 98.6602% 0.19MovieLens10m western X X 10,681 69,878 10,000,053 98.6602% 0.86Ecommerce gender X 15,000 21,880 33,455 99.9898% 21.98TaFeng age X 31,640 23,719 723,449 99.9036% 39.67BookCrossing age X X 167,175 337,921 838,364 99.9985% 29.04LibimSeTi gender X X 137,806 220,970 15,656,500 99.9486% 44.53KDD2015 MOOC dropout X X 120,542 5,891 1,919,150 99,7300% 20.71A-Card cashout X X 177,761 2,448 435,244 99.9000% 6.71A-Card defect X X 177,761 2,448 435,244 99.9000% 13.20A-Card Permeke X X 177,761 2,448 435,244 99.9000% 7.29A-Card Wezenberg X X 177,761 2,448 435,244 99.9000% 2.18A-Card MAS X X 177,761 2,448 435,244 99.9000% 1.82A-Card Roma X X 177,761 2,448 435,244 99.9000% 0.96A-Card Zoo X X 177,761 2,448 435,244 99.9000% 0.85Fraud fraudulent X 858,131 107,345 1,955,912 99.9979% 0.0064Banking interest in product X 1,204,726 3,192,554 20,914,516 99.9995% 0.35KDDa task performance X 8,407,752 20,216,830 305,613,510 99.9998% 14.70Car interest in ad X 9,108,905 2,936,810 65,464,708 99.9998% 0.70Flickr comments X 11,195,144 497,472 34,645,469 99.9994% 27.05

Table 3: General characteristics concerning the data sets (ordered by ascending n): the targetvariable being predicted, whether binary and numeric versions are available, the numberof instances n, the number of features m, the number of active elements m, the sparsityρ defined as ρ = 1 − (m/(n ×m)) and the balance b (percentage of positive instances inthe target variable).


101

102

103

10−3

10−2

10−1

MovieLens100k

# of features

Pro

babi

lity

100

101

102

10−4

10−3

10−2

10−1

InternetAdvertising

# of features

Pro

babi

lity

101

102

103

104

10−4

10−3

10−2

10−1

MovieLens1m

# of features

Pro

babi

lity

101

102

103

104

10−4

10−3

10−2

10−1

YahooMovies

# of features

Pro

babi

lity

100

101

102

103

104

105

10−5

10−4

10−3

10−2

10−1

MovieLens10m

# of features

Pro

babi

lity

100

101

102

10−5

10−4

10−3

10−2

10−1

100

Ecommerce

# of features

Pro

babi

lity

100

101

102

103

104

10−5

10−4

10−3

10−2

10−1

TaFeng

# of features

Pro

babi

lity

100

101

102

103

104

10−6

10−5

10−4

10−3

10−2

10−1

100

BookCrossing

# of features

Pro

babi

lity

100

101

102

103

104

105

10−6

10−5

10−4

10−3

10−2

10−1

100

LibimSeTi

# of features

Pro

babi

lity

100

101

102

103

10−6

10−5

10−4

10−3

10−2

10−1

100

KDD2015

# of features

Pro

babi

lity

100

101

102

103

10−6

10−5

10−4

10−3

10−2

10−1

100

A−Card

# of features

Pro

babi

lity

100

101

102

103

10−6

10−5

10−4

10−3

10−2

10−1

100

Fraud

# of features

Pro

babi

lity

100

101

102

103

104

10−7

10−6

10−5

10−4

10−3

10−2

10−1

Banking

# of features

Pro

babi

lity

101

102

103

10−7

10−6

10−5

10−4

10−3

10−2

10−1

URL

# of features

Pro

babi

lity

101.3

101.5

101.7

101.9

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

KDDa

# of features

Pro

babi

lity

100

101

102

103

104

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Car

# of features

Pro

babi

lity

100

101

102

103

104

10−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Flickr

# of features

Pro

babi

lity

Figure 1: Sparsity distributions for the number of features per instance.


100

101

102

103

10−4

10−3

10−2

10−1

MovieLens100k

# of instances

Pro

babi

lity

100

101

102

103

104

10−4

10−3

10−2

10−1

100

InternetAdvertising

# of instances

Pro

babi

lity

100

101

102

103

104

10−4

10−3

10−2

10−1

MovieLens1m

# of instances

Pro

babi

lity

100

101

102

103

104

10−6

10−5

10−4

10−3

10−2

10−1

100

YahooMovies

# of instances

Pro

babi

lity

101

102

103

104

10−5

10−4

10−3

10−2

10−1

MovieLens10m

# of instances

Pro

babi

lity

100

101

102

10−5

10−4

10−3

10−2

10−1

100

Ecommerce

# of instances

Pro

babi

lity

100

101

102

103

104

10−5

10−4

10−3

10−2

10−1

100

TaFeng

# of instances

Pro

babi

lity

100

101

102

103

104

10−6

10−5

10−4

10−3

10−2

10−1

100

BookCrossing

# of instances

Pro

babi

lity

100

101

102

103

104

105

10−6

10−5

10−4

10−3

10−2

10−1

100

LibimSeTi

# of instances

Pro

babi

lity

100

102

104

106

10−4

10−3

10−2

10−1

100

KDD2015

# of instances

Pro

babi

lity

100

101

102

103

104

10−4

10−3

10−2

10−1

100

A−Card

# of instances

Pro

babi

lity

100

101

102

103

104

105

10−6

10−5

10−4

10−3

10−2

10−1

100

Fraud

# of instances

Pro

babi

lity

100

101

102

103

104

105

10−6

10−5

10−4

10−3

10−2

10−1

100

Flickr

# of instances

Pro

babi

lity

Figure 2: Sparsity distributions for the number of instances per feature.


with a small fraction of all possible payment receivers. Also, the majority of the paymentreceivers have payment relations with only a small fraction of all clients of the bank.

2.2 Performance Measures

2.2.1 Area under ROC-curve (AUC)

Accuracy is a fairly intuitive and often-used measure of performance in comparative stud-ies (King et al. 1995; Lim et al. 2000): it expresses the percentage of correctly predictedinstances (Fawcett, 2006). However, it is influenced by class imbalance. Since the bulk ofthe data sets in this study come from real-life classification tasks and demonstrate imbalance(Table 3) (Vanhoeyveld and Martens, 2017), accuracy is not an adequate measure (Provostet al., 1998).

Therefore, the AUC measure is used, which refers to the Area Under ROC-Curve (AUC).The ROC (Receiver Operating Curve) space is used to plot the performance of classifiersin terms of the true positive rate (TP) and the false positive rate (FP), on the Y-axis andthe X-axis respectively. This is done by ranking the classifier’s prediction scores for datapoints in the test set in a descending fashion while iteratively lowering the threshold forclassifying an instance as positive. The AUC value is a summarizing scalar representing thearea under this performance curve (Fawcett, 2006). Thus, it expresses the models’ ability torank instances in a descending fashion in terms of their prediction score or, in other words,the probability of a classifier to rank a randomly chosen positive instance higher than arandomly chosen negative instance. Normally, AUC has values in [0, 1], but here we expressit as a percentage. An AUC of 50 then corresponds to a model performing no better thanrandom. A perfect model has an AUC of 100.

2.2.2 Statistical Significance Test

Two statistical tests are used in order to elect an algorithm or a group of algorithms as betteror best performing: the Wilcoxon signed rank test and the Friedman test, both proposedby Demšar (2006). The former compares two treatments of a collection of data sets (used tocontrast binary versus numeric data); the latter is used to compare a collection of treatments(for comparing all classifiers).

The Wilcoxon signed-rank test is a non-parametric test which first computes the abso-lute differences in performance between two treatments of a collection of data sets. Thesedifferences are ranked and summarized in two variables R+ and R− representing the sumof ranks where the second treatment, respectively the first treatment, performs better. Thelowest value T = min (R+, R−) is compared to a Wilcoxon critical value. If T is equal to orlower than this critical value, the null hypothesis stating that the two treatments performequal can be rejected and a significant difference is found.

In the Friedman test for each data set separately, the performance values for each methodare ranked. The average rank ARj = 1

N

∑i rji of each algorithm is calculated, with rji the

rank of the j-th algorithm on the i-th data set. The Friedman statistic is defined as

χ2F =

12N

K(K + 1)

∑j

AR2j −

K(K + 1)2

4,


with N the number of data sets and K the number of algorithms. Iman and Davenport(1980) state that this χ2 approximation results in an overly conservative statistic with toosmall a critical region and present an updated approximation

FF =(N − 1)χ2

F

N(K − 1)− χ2F

,

distributed according to an F-distribution with (K − 1) and (K − 1)(N − 1) degrees offreedom. This value is compared to a critical value corresponding to an F-distribution anda significance level α, resulting in either accepting or rejecting the null-hypothesis that allalgorithms are equivalent. In the latter case the Nemenyi post-hoc test is performed. Thistest defines a critical difference

CD = qα

√K(K + 1)

6N,

with qα a critical value based on the Studentized range statistic divided by√

2. Two clas-sifiers demonstrate significantly different performance measures if their average ranks differmore than the critical difference value.

2.2.3 Learning Curves

Learning curves show performance variations of learning algorithms as a function of the sizeof the training set. The goal is to get insight into performance generalization of the algorithmregarding data set size (Perlich et al., 2003). The performance values in terms of AUC (whichassume independence of data size) now gain a substantial level of detail as performance iscompared over different techniques and over different data set sizes (Fernández-Delgadoet al. 2014; Macia and Bernadó-Mansilla 2014).

Concretely, a learning curve plots the performance as a function of the training set size,generally on a logarithmic scale. AUC is used as a performance measure and the training setsize is varied separately in the dimension of the instances and the features. For the instancedimension, increasing samples are drawn (uniformly at random) from the original trainingdata. For the feature dimension, learning curves are built in two ways. First, increasingsamples of the features are drawn uniformly at random. These learning curves are builtto assess performance variations over the number of features, regardless of their predictivevalue. Secondly, the information value of each feature is determined. Learning curves arethen built by taking increasing feature samples according to their descending informationvalue. This approach enables us to relate performance variations to the importance of thefeatures, and to assess the relevance of many fine-grained features in predictive performance.The information value of a feature can be assessed by a plethora of metrics (Forman 2003;Guyon and Elisseeff 2003). We employ the information gain metric here as it is a fairly quickand accurate way to determine value in separate features (Forman 2003). Information gainmodels the reduction of entropy in the predictive variable brought along by the presence ofa feature f

Information Gain(f) = H(Y )−H(Y |f),

with Y the classification values of a training set, f a feature and H(Y ) the entropy of Y .


3 Experimental Set-Up

Before training the model, a third of the training set (sampled uniformly at random) is setaside as validation set for parameter selection. Model selection with grid search is performedto find an optimal value for the regularization parameter C (for logistic regression and theSVMs) and the kernel parameter γ for the RBF-SVM (Hsu et al., 2003). An initial grid([2−5, 2−3, . . . , 215] for C and [2−15, 2−13, . . . , 23] for γ) (Hsu et al., 2003) is explored, thatis, models are constructed and tested on the validation set with the grid parameter values.Based on the best performing model, a new grid is built around the best value. These gridsare then iteratively improved (up to three times) and the best resulting value is finally usedin building the classification model on the training set.

Since multivariate naive Bayes and PSN expect binary data, the numeric informationis modeled in a binary fashion through unary encoding. This so-called thermometer codetranslates non-negative, numeric feature values xi,j with maximum range R into xi,j onesfollowed by (R− xi,j) zeros (a feature with value 2 out of 5 will be translated in 5 featuresas follows: 11000). In theory, this increases the dependency between the features andviolates the naive Bayes assumption. However, the approach has been shown to result ingood predictions even with dependent features (Hand and Yu, 2001). This unary expansionresults in higher-dimensional data sets for the numeric analysis of both PSN and MV-NB.

Dependence of the results on data sampling is a relevant issue in benchmark studydesign, impacting the reliability of the study (Fernández-Delgado et al. 2014; Macia andBernadó-Mansilla 2014). Therefore, k-fold cross validation is used which determines k dis-joint partitions through sampling uniformly at random from the entire data set, and usingeach partition as test set and the remaining k − 1 as training set. Commonly, k is set to10 which has been shown sufficient in reducing bias and variance (Kohavi, 1995). The foldsare equal for all classifier executions, such that sound comparisons across classifiers can bemade. Moreover, this fulfills the necessary conditions of stable results as mentioned in De-mšar (2006); the statistical tests demand ‘reliable estimates of the classifier’s performance’.

The learning curves are built as follows:

1. For the learning curves in the instance dimension: For each of the ten cross-validationfolds, repeatedly take random subsamples of the training set with an increasing nl-value (nl ∈ 1, ..., n).

2. For the learning curves in the feature dimension (random features): For each of theten cross-validation folds, repeatedly take random subsamples of the training set withan increasing ml-value (ml ∈ 1, ...,m). Adjust the corresponding test set according tothe selected features.

3. For the learning curves in the feature dimension (feature selection): For each of theten cross-validation folds, repeatedly take subsamples of the training set according todescending information value of the features with increasing ml-value (ml ∈ 1, ...,m).Adjust the corresponding test set according to the selected features.

The analyses were performed on an Intel i7 processor with 4 physical cores, 3.40 GHZ clockrate and 16 GB RAM.

3 Experimental Set-Up 13

Among the investigated techniques, RBF-SVM scales between O(n2) and O(n3) in timecomplexity as mentioned in Appendix A. This clearly is not scalable with respect to thesizes of many of these data sets. Therefore, for the largest dimensions (starting from Book-Crossing in Table 3), the entire data set could not be used when comparing AUC and timeperformance. A random subsample of size 215 is used as a proxy for these data sets.

4 Experimental Results

As many results arise from the performed analyses, we highlight the most important andrelevant ones and attempt to fit them in a general context while referring and comparingwith existing literature or results. We first discuss the results which are independent oftraining size variations, followed by the analysis of the learning curves.

4.1 Performance Analysis

The analysis of performances is divided into the following parts: (1) comparison of AUC onall data sets, (2) statistical comparison of the classification and time performance, and (3)comparison of the effect of binary versus numeric data on AUC and time.

4.1.1 Comparison of AUC on All Data Sets

Table 4 and Table 5 report the AUC values for all binary and numeric data sets, respectively(ordered by ascending maximum AUC value). For each data set, the best AUC is denotedin boldface. Also, the average rank per technique is shown where for each data set rank 1 isgiven to the best technique and rank 10 is given to the worst performing technique. For eachalgorithm, also the number of times it performs best is given. The results for RBF-SVM arepictured somewhat isolated because the technique is not always run on the entire data set.

For the binary data sets in Table 4, LR-BGD-L2 and PSN perform best. Moreover,they seem to be complementary to one another: if one of them performs best, the otherbelongs to the worst performing techniques. We attribute this to the level of imbalance andconfirm this in a further meta analysis. MV-NB performs better than MN-NB, which canbe explained by their event models. The multivariate model assumes each feature is gener-ated by independent boolean draws and models the presence and absence of features. Themultinomial model, however, models frequencies of features and assumes the features to bedrawn independently and with replacement from the collection of all features. The latter isless suited in this binary context (Junqué de Fortuny et al., 2013), which the results demon-strate here. Also, for binary and numeric text data, we find that the multivariate modelis mostly only used with the former and the multinomial one with the latter (McCallumand Nigam, 1998).The techniques optimized with L2 regularization (LA-SVM-L2, LS-SVM-L2, LR-BGD-L2 and LR-SGD-L2) have better performance compared to their counterpartswith L1 regularization. This confirms the findings of Zhu et al. (2003) and Bannur (2011).Moreover, Ng (2004) theoretically shows that due to the rotationally-invariant nature ofL2 regularization, it is better suited than its L1 counterpart in a context with many relev-ant features. This implies that many features in the high-dimensional behavioral contextcontribute to the prediction, which harms the performance of the feature selection-like L1regularization. This is similar to the findings of (Clark and Provost, 2016). Finally, the

4 Experimental Results 14

MN-NB MV-NB LA-SVM-L2 LS-SVM-L1 LS-SVM-L2 PSN LR-BGD-L1 LR-BGD-L2 LR-SGD-L1 LR-SGD-L2 RBF-SVMBookCrossing 56.17 56.05 56.16 54.46 56.24 53.54 54.78 56.25 55.10 55.13 53.48YahooMovies_age 59.55 65.20 63.06 63.82 64.57 65.28 63.54 64.84 61.52 61.55 71.94banking 54.79 68.17 54.25 53.40 54.62 67.08 53.73 54.95 66.28 67.24 53.45TaFeng 71.92 70.58 70.57 69.52 71.19 69.86 69.80 71.36 65.77 66.05 63.04MovieLens_crime 74.59 72.30 75.09 75.07 75.83 76.33 75.16 76.13 72.52 73.07 68.34A-Card_Goto_MAS 63.73 76.50 68.58 63.36 63.27 75.51 64.93 64.97 75.55 76.50 64.78MovieLens_adventure 73.6 59.72 73.52 73.79 73.94 76.57 73.79 73.98 68.86 67.64 70.07fraud 76.27 77.16 57.13 50.41 55.68 76.87 52.66 55.48 72.75 77.11 69.09MovieLens_100k_gender 74.48 69.63 74.73 73.02 77.09 74.71 72.82 77.21 73.78 74.49 75.75Car 57.71 77.52 69.98 77.14 74.21 71.06 75.53 77.70 72.31 72.09 55.83MovieLens_fantasy 61.99 56.13 61.35 60.43 62.27 78.3 61.23 62.22 61.00 61.80 69.85MovieLens_romance 65.43 73.06 70.16 70.37 69.44 79.56 71.25 68.20 69.31 69.28 61.81Ecommerce 77.24 55.85 79.54 75.31 79.60 68.30 76.02 79.62 79.26 79.26 71.95YahooMovies_gender 73.81 79.30 79.92 79.08 80.30 80.18 79.14 80.43 75.36 75.44 78.38MovieLens_mystery 59.86 57.32 69.61 69.85 69.70 81.07 70.32 68.03 61.95 62.80 69.26A-Card_Goto_Permeke 77.47 80.84 61.13 64.5 64.34 78.27 63.14 63.14 82.35 82.35 81.86MovieLens_children 80.97 73.02 79.10 79.78 79.54 83.7 79.83 79.7 76.07 75.05 80.75MovieLens_drama 73.65 65.60 82.18 82.76 83.27 76.52 82.84 83.74 79.35 79.17 70.07MovieLens_thriller 70.00 73.27 74.53 75.16 75.03 84.04 76.75 75.08 72.97 73.45 70.31kdd2015 70.79 81.14 77.95 79.19 79.25 83.73 79.84 79.89 84.43 83.71 81.57kdda 78.91 78.32 81.17 84.44 83.92 79.82 84.33 85.50 82.28 78.88 70.15MovieLens_1m_gender 80.54 76.84 84.05 83.83 84.83 81.33 84.17 85.20 80.55 80.56 82.57A-Card_Goto_Wezenberg 79.74 84.57 53.25 56.88 57.00 84.92 55.19 55.21 85.36 85.40 79.37A-Card_defect 51.41 85.50 76.53 75.49 75.14 78.68 70.93 75.39 81.68 81.72 65.65flickr 77.77 85.99 73.48 76.77 76.22 76.63 77.08 76.97 84.38 84.25 80.15MovieLens_comedy 77.02 77.82 84.45 85.03 85.84 78.21 85.29 86.13 82.19 82.09 74.82A-Card_Goto_Roma 70.78 86.60 60.12 56.52 56.59 86.32 58.03 57.96 85.37 85.38 67.79MovieLens_action 82.07 66.30 85.90 86.48 86.53 82.52 86.76 86.89 83.53 83.37 84.33A-Card_Goto_Zoo 71.71 86.58 60.34 55.73 55.70 87.05 57.71 57.64 85.30 85.74 72.56MovieLens_100k_age 79.43 77.20 87.09 84.17 87.71 80.21 85.08 87.95 84.10 83.33 81.61MovieLens_animation 84.83 70.00 87.29 87.53 87.16 85.91 88.04 87.10 80.75 80.55 84.86MovieLens_scifi 76.73 68.08 79.08 78.75 78.85 88.50 80.42 79.66 73.54 73.94 69.64MovieLens_documentary 83.55 71.40 87.94 87.45 88.44 87.82 87.98 88.57 85.76 85.90 79.95MovieLens_musical 80.50 72.05 81.25 79.61 80.85 90.34 79.43 80.53 75.84 76.08 73.31MovieLens_1m_age 81.96 78.79 90.34 89.80 90.43 83.16 89.92 90.81 87.30 87.13 83.25MovieLens_western 84.22 89.67 84.58 86.00 85.24 91.37 85.72 85.82 84.94 84.84 69.12MovieLens_horror 90.88 88.80 91.14 91.01 91.08 91.08 91.46 91.27 90.01 89.81 88.62A-Card_cashout 55.90 91.54 74.01 70.87 70.35 83.34 71.12 70.96 90.86 90.88 90.60MovieLens_filmnoir 79.50 71.82 76.40 78.94 76.07 92.90 78.57 78.60 68.33 69.10 80.17MovieLens_war 72.86 70.61 81.18 78.82 81.71 95.23 79.92 80.74 79.80 79.83 77.53InternetAdvertising 95.57 93.75 96.08 96.42 97.20 97.12 96.76 97.57 96.30 96.31 94.46LibimSeTi 99.64 99.65 99.68 99.69 99.68 78.97 99.69 99.69 99.65 99.65 99.66url 97.78 96.98 99.81 99.93 99.81 97.47 99.77 99.63 99.78 99.63 97.17

Average Ranking 7.76 7.18 5.77 6.15 4.90 4.43 5.16 4.09 6.58 6.26 7.62Number of wins 1 7 0 2 0 13 3 15 2 2 1

Table 4: Predictive performance of the models in terms of AUC for the binary data sets (highest-achieved performance for a data setindicated in boldface).

4Experim

entalResults

15

MN-NB MV-NB LA-SVM-L2 LS-SVM-L1 LS-SVM-L2 PSN LR-BGD-L1 LR-BGD-L2 LR-SGD-L1 LR-SGD-L2 RBF-SVMBookCrossing 57.24 53.19 55.28 54.22 55.13 52.57 54.24 55.78 54.36 54.46 52.16YahooMovies_age 64.45 65.39 64.10 64.40 64.72 65.20 64.44 65.45 60.94 61.19 58.83MovieLens_crime 74.62 66.09 71.79 73.09 72.51 72.18 73.44 73.21 72.88 73.98 72.93MovieLens_fantasy 71.55 51.13 63.90 62.32 62.72 75.61 60.09 60.35 56.68 59.53 71.81A-Card_Goto_MAS 63.60 76.51 66.27 61.09 60.97 76.23 62.60 62.44 70.53 62.68 65.48MovieLens_adventure 74.20 59.19 72.24 72.32 72.78 76.88 72.11 72.68 62.36 66.56 74.31MovieLens_100k_gender 75.57 74.56 75.87 73.37 76.16 75.98 75.07 78.72 76.24 76.64 77.03MovieLens_mystery 75.28 54.44 64.32 67.26 63.97 79.02 69.94 64.86 56.48 59.21 69.44MovieLens_romance 71.68 55.80 67.62 65.41 67.85 79.19 62.69 66.23 57.86 59.28 66.14A-Card_Goto_Permeke 77.67 80.71 65.18 68.96 67.70 79.13 69.62 67.38 81.55 79.63 81.76MovieLens_drama 72.72 69.15 79.82 80.85 79.94 75.91 81.07 81.75 75.41 75.57 71.33YahooMovies_gender 78.09 80.69 80.64 79.34 79.71 82.00 79.62 80.97 74.65 74.74 80.31MovieLens_1m_gender 79.99 78.79 81.25 82.65 80.35 80.59 82.48 83.23 70.23 79.44 79.42MovieLens_thriller 78.51 57.15 74.23 72.5 73.74 83.97 74.26 73.98 64.18 64.17 79.95MovieLens_children 81.58 69.25 78.70 77.89 78.51 83.97 76.88 78.83 73.87 78.14 80.80MovieLens_comedy 76.51 77.30 83.03 83.44 83.59 78.77 83.57 84.36 79.50 79.81 73.28kdd2015 62.46 83.96 67.88 67.65 64.95 84.59 66.66 69.06 73.06 80.55 80.87A-Card_Goto_Wezenberg 76.67 84.08 56.55 57.64 57.39 85.07 57.79 57.81 79.20 72.28 80.16A-Card_defect 52.67 85.64 72.06 74.09 71.61 81.35 74.04 73.85 76.08 61.26 67.20MovieLens_action 81.88 67.20 83.25 85.01 83.24 82.14 84.75 86.12 81.27 81.36 84.38MovieLens_animation 84.64 67.53 86.06 85.86 85.62 84.17 86.13 85.45 75.16 77.10 82.97A-Card_Goto_Roma 69.85 86.64 60.05 57.72 58.35 86.59 58.25 58.46 76.17 69.47 70.47MovieLens_scifi 84.72 56.43 78.16 76.39 77.66 87.02 77.27 77.23 72.51 72.50 76.97MovieLens_documentary 79.39 81.20 86.35 86.45 86.78 87.41 86.91 87.10 78.65 82.07 81.05A-Card_Goto_Zoo 67.62 86.49 56.81 55.73 57.30 87.26 55.77 55.82 72.78 76.07 75.34MovieLens_1m_age 80.80 78.86 85.86 87.80 85.18 82.95 87.79 87.71 84.53 84.84 77.57MovieLens_100k_age 79.53 80.33 85.58 82.27 85.10 81.19 83.49 88.52 83.55 83.45 83.84MovieLens_musical 88.45 57.83 80.13 79.24 80.28 89.57 76.83 80.22 61.82 72.64 79.84MovieLens_horror 89.31 88.30 88.14 88.40 88.42 90.72 89.56 89.3 87.74 87.76 90.03MovieLens_filmnoir 89.54 71.08 72.07 74.73 71.93 90.78 71.78 71.73 64.06 68.11 81.21A-Card_cashout 53.49 91.48 71.30 66.40 66.68 86.91 66.37 66.23 86.96 80.32 91.57MovieLens_western 89.04 66.81 82.92 83.61 83.71 91.68 83.64 84.00 74.31 81.05 87.46MovieLens_war 83.20 65.55 75.83 75.82 78.06 94.06 82.07 73.68 65.28 67.73 73.97LibMiSeTi 99.64 99.65 99.68 99.69 99.68 78.97 99.69 99.68 99.65 99.65 98.78

Average Ranking 5.73 7.70 5.97 6.39 5.79 3.58 5.66 4.76 7.72 7.36 5.55Number of wins 2 3 1 2 0 16 1 8 0 0 2

Table 5: Predictive performance of the models in terms of AUC for the numeric data sets (highest-achieved performance for a data setindicated in boldface).

4Experim

entalResults

16

SGD variants of logistic regression demonstrate worse performance than BGD. RegardingRBF-SVM, we observe that even when using a sample for the larger data sets, RBF onlyperforms worst in a minority of the cases.

In order to gain more insight into Table 4, Figure 3 shows a decision tree denotingwhich classifier performs best dependent on extrinsic data characteristics. Linking classifierperformance and data characteristics reveals their specific domain of competence (Maciàet al., 2013). These characteristics consist of instance dimension n, feature dimension m,number of active elements m, sparsity ρ, balance b and nature of behavior (rating, location,transactional, interest). Note that for this tree, only one of the eighteen MovieLens_genredata sets (and only one of theMovieLens100k, MovieLens1m, A-Card and YahooMovies datasets) is used to train the tree to reduce overfitting. The tree conceptualizes the followingfindings:

• For small, imbalanced data sets, the PSN approach leads to higher classification per-formance (for example MovieLens_scifi, YahooMovies_age).

• For large, imbalanced data sets, MV-NB leads to higher classification performance (forexample A-Card_defect, Flickr, Fraud, Banking).

• For balanced data sets, LR-BGD-L2 leads to higher classification performance(for exampleMovieLens100k_age,MovieLens1m_gender, Internet Advertisement, Book-Crossing).

• For very large balanced data sets, LS-SVM-L1 leads to higher classification perform-ance (LibimSeTi, URL).

The data sets for which the fourth condition holds are data sets with high signal-to-noiseratio (Perlich et al., 2003). For LibimSeTi and URL the use of a small portion of thefeatures (less than 1% and 32% respectively) results in an AUC not significantly different

Balance

Number of instances (n) Number of elements (m)

PSN MV-NB LR-BGD-L2 LS-SVM-L1

< 28.05 ≥ 28.05

< 96,381 ≥ 96,381< 8,328,350 ≥ 8,328,350

Figure 3: Decision tree visualizing the most discriminative data set characteristics for all classifiersfor binary data sets.

from the AUC obtained with all features. Thus, in this feature-redundant case, sparse L1regularization is beneficial.

The findings from the decision tree are emphasized by an additional logistic regressionperformed for each technique on the data set characteristics and whether that technique


performs best (1) or not (0). Significant regression coefficients were found for MV-NB(instance dimension n, balance b), LS-SVM-L1 (number of active elements m, balance b) andfor LR-BGD-L2 (balance b). Note that both these meta-analyses are purely to understandwhere the different techniques are performing better or worse in this study—not to presentgeneralizable results.

Turning to the numeric data sets, Table 5 shows that PSN performs best, followed byLR-BGD-L2. Multivariate naive Bayes, and LR-SGD perform worst. In contrast to binarydata sets, MN-NB overall has lower rank than MV-NB, which follows the intuition behindtheir underlying event models as stated above: MN-NB models frequency as opposed topresence/absence of features. Also, by running MV-NB on an expanded unary-encodeddata set, dependence between features is created, violating the independence assumption.Analogous to the binary data sets, L2 regularization and BGD perform better than L1regularization and SGD respectively. RBF-SVM has a better score compared to the onefor binary data sets; however, it does not perform among the best techniques despite itscapability of capturing complex relations (Chang and Lin, 2011). No decision tree was builthere due to only having a sample of 8 representative numeric data sets.

4.1.2 Statistical Comparison of Classification and Time Performance

In order to extrapolate these findings to a larger population of behavioral data sets, ideallythe data sets should form a random and representative sample of the population of beha-vioral data sets. However, the data collection consists of multiple multi-target problems(MovieLens, YahooMovies and A-Card). These are subsets of prediction problems with thesame features but different targets (notably resulting in a consistency regarding their best-performing techniques). Two approaches were taken to address this. Firstly, we randomlyselected one data set in each multi-target problem to represent the others. Secondly, weweighted the ranks of these multi-target data sets in order for all information to be presentin the analysis. Both approaches lead to the same statistical conclusions, and we present theformer. Concretely, the statistical test is performed with a sample size of 17 for the binarydata and 8 for the numeric data.

The non-parametric Friedman test is performed at a α = 0.05 significance level. Follow-ing the graphical representation proposed by Demšar (2006), Figure 4 sets out the averageranks of the classification techniques both for AUC (dashed lines) and execution time (solidlines), for the binary data sets (top) and for the numeric ones (bottom). Horizontal connec-tions between techniques denote groups of algorithms which show no significant performancedifferences. Note that the ranks in Figure 4 differ from the ranks in Table 4 and Table 5since a different sample is used.

From Figure 4 (binary, top), we observe that LR-BGD-L2 performs better than LS-SVM-L1, MV-NB, MN-NB and RBF-SVM. Also, LS-SVM-L2 performs significantly betterthan RBF-SVM.

Although LR-BGD-L2 has the best classification performance, it is very slow in termsof computational efficiency. In contrast, MN-NB is quite fast, but unfortunately it performsquite poorly in terms of AUC. This is striking, as MN-NB is the second most used technique(with LR-BGD) in the relevant literature, as illustrated in Table 1. The likely reason forthis is that NB has been shown to perform well in settings such as text mining even if theunderlying independence assumptions do not hold (Hand and Yu, 2001). Since these


11 10 9 8 7 6 5 4 3 2 1

1.00 - PSN2.00 - MN-NB4.70 - LR-SGD-L2

5.00 - LR-SGD-L1

6.05 - LS-SVM-L1

6.29 - LA-SVM-L2

7.00 - LR-BGD-L1

7.70 - LS-SVM-L27.17 - MV-NB

8.05 - LR-BGD-L2

11.00 - RBF-SVM

3.32 - LR-BGD-L24.50 - LS-SVM-L2

5.76 - LR-SGD-L2

5.11 - LR-BGD-L1

5.67 - LA-SVM-L2

6.05 - PSN

6.23 - LS-SVM-L1

6.02 - LR-SGD-L1

6.47 - MV-NB

7.64 - MN-NB8.64 - RBF-SVM

11 10 9 8 7 6 5 4 3 2 1

1.00 - PSN2.00 - MN-NB

3.37 - LR-SGD-L1

4.12 - LR-SGD-L2

5.37 - LS-SVM-L17.00 - LA-SVM-L2

7.37 - LS-SVM-L2

7.62 - MV-NB8.25 - LR-BGD-L1

8.87 - LR-BGD-L2

11.00 - RBF-SVM

3.50 - LR-BGD-L2

4.62 - LA-SVM-L2

5.43 - LR-BGD-L1

5.62 - LS-SVM-L2

5.93 - LS-SVM-L1

5.00 - PSN

6.50 - MV-NB

7.62 - MN-NB

7.18 - LR-SGD-L1

7.68 - LR-SGD-L2

7.25 - RBF-SVM

Figure 4: Statistical significant differences between the methods in terms of AUC (dashed lines) and time (solid lines) for a random sampleof binary (top) and numeric (bottom) data sets. The horizontal lines depict a group of methods for which no significant differencewas found. The caption by the arrows represent the rank and the name of the technique.

4Experim

entalResults

19

behavioral data resemble (perhaps superficially) text data sets, one might conclude thatnaive Bayes would also work well. However, when MN-NB is used with textual data, thefeature values seldom model absolute frequencies of the words (as they do here). Typically,an inverse document frequency (IDF) measure is employed, favoring less frequently occurringwords. This IDF philosophy is incorporated in the PSN method. Weighting features in thismanner thus seems beneficial in a behavioral context. RBF-SVM achieves the worst AUCand is the slowest. The best performing method with respect to time is PSN. Overall, PSNachieves a very respectable AUC-time trade-off.

From Figure 4 (numeric, bottom), we cannot distinguish the techniques in terms of AUC.The small sample size of 8 is the main reason for this. Here, also, PSN and MN-NB arethe fastest and the non-linear RBF-SVM the slowest. Also, L2 regularization and logisticregression are very time-consuming techniques.

Figure 5 presents the Pareto front for both types of behavioral data, clearly demonstrat-ing the multi-objective trade-off between AUC and time: if more computational resourcesare available (further right), better classification predictions are reached. Note that all tech-niques on the Pareto front for binary classification use L2 regularization. Further, the Paretofronts confirm our previous findings. Clearly, classification techniques such as NB, PSN andSGD, which all make use of heuristic assumptions to simplify the classification process arelocated most left in Figure 5. Making these assumptions (for example that the features areconditionally independent in the case of NB) leads to lower runtime complexity, while at thesame time giving up AUC. Only MV-NB forms an exception to this: its runtime of O(m ·n)approaches that of the SVM techniques.

4.1.3 Comparison of Binary and Numeric Behavioral Data

Binary and numeric data are contrasted with the Wilcoxon signed-rank test to determ-ine if numeric behavioral information, which models strength of behavior, leads to betterpredictions. This leads to the finding that LS-SVM-L2, LA-SVM-L2, LS-SVM-L1,LR-SGD-L1 and LR-SGD-L2 in fact perform better for binary data sets. A tendency to-wards a similar result was found for LR-BGD, but without statistical support. The sparsitydistributions in Figure 1 and Figure 2, demonstrate that instances engage in few behaviorson the one hand and on the other hand that a given behavior is not likely to be presentin a large number of instances. The mere presence or absence then seems to be sufficientevidence of instances’ class membership for these techniques.

Additional significant differences contrasting binary and numeric data on computationalperformance are the following. All support vector machines (LA-SVM-L2, LS-SVM-L1 andLS-SVM-L2) and both batch gradient descent logistic regression methods (LR-BGD-L1 andLR-BGD-L2) run faster when faced with binary behavioral data. In contrast, MN-NB runsfaster when faced with numeric behavioral data.

4.1.4 Summary of Performance Analysis

We summarize the conclusions of the performance analysis as follows:

• Overall, LR-BGD-L2 yields the best generalization performance (AUC) for both binaryand numeric data. However, in terms of computational efficiency, it performs worst.


• In general, L2 regularization performs better than L1 regularization except in the rarecases when feature redundancy is present. Table 1 demonstrates that a presumptiontowards this empirical finding exists in literature as L2 regularization is used morefrequently. As a drawback, we find that L2 regularization takes more time.

• BGD optimization is slower than its SGD variant, while resulting in better generaliz-ation performance.

• RBF-SVM is the slowest method for both binary and numeric data. This is alsostated in many papers as the reason why RBF-SVM is not considered an option withsuch high-dimensionality. PSN and MN-NB, on the other hand, build their models inthe shortest amount of time with the heuristic assumptions of PSN leading to betterresults in the smallest amount of time.

• MV-NB performs better than MN-NB for binary data; the opposite holds for numericdata.

• Contrasting binary and numeric behavioral data, LIN-SVM and LR result in betterpredictions in a lower amount of time for binary data. On the other hand, MN-NBachieves better AUC and run time for numeric data.

Linking these results to Contribution I of this work, we can thus conclude the following.LR-BGD-L2 performs best in terms of AUC on binary behavioral data sets. However, ina practical setting, its time complexity might render it impracticable. An attractive trade-off between performance and time is given by the PSN technique. Regarding ContributionII, for the discriminative linear SVM and logistic regression classifiers, the mere modelingof presence and absence of features (binary features) is superior to frequency attributes interms of classification and computational performance.

4.2 Learning Curve Analysis

Figure 6, Figure 7 and Figure 8 show the learning curves in the instance dimension, in thefeature dimension for random feature selection, and in the feature dimension with featureselection for the largest behavioral data sets. In order to provide clarity to the many learningcurves, we will structure the analysis to find general patterns and attempt to identify groupsof similar behavior. Note that some learning curves demonstrate deviant behavior from thegeneral trend, which is mostly due to the random sampling procedure leading to varyingimbalance and sparsity levels.

In Appendix B, more learning curves for these dimensions can be found. Not all learningcurves are shown; however, we have selected representative learning curves showing the mainbehaviors in the data sets.

4.2.1 Instance Dimension Learning Curves

For the bulk of the learning curves, two groups of classifiers can be identified. The firstgroup consists of the classifiers employing heuristics in classifying patterns: MV-NB, MN-NB, PSN, LR-SGD-L1 and LR-SGD-L2 (Group 1) and the second group consists of the exacttechniques LA-SVM-L2, LS-SVM-L2, LS-SVM-L1, LR-BGD-L1 and LR-BGD-L2 (Group 2).


0 1 2 3 4 5 6 7 8 9 10 11 120

1

2

3

4

5

6

7

8

9

10

11

PSN

MV−NB

MN−NB

LR−SGD−L1

LR−SGD−L2

LR−BGD−L2

LR−BGD−L1LA−SVM−L2

LS−SVM−L1

LS−SVM−L2

Pareto front for binary datasets

Time rank

AU

C r

ank

RBF−SVM

0 1 2 3 4 5 6 7 8 9 10 11 120

1

2

3

4

5

6

7

8

9

10

11

PSN

LS−SVM−L1LS−SVM−L2

LA−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1LR−SGD−L2

MV−NB

MN−NBRBF−SVM

Pareto front for numeric datasets

Time rank

AU

C r

ank

Figure 5: Pareto front for binary (top) and numeric (bottom) behavioral data.

4Experim

entalResults

22

The latter techniques overall show similar behavior, with L2 regularization often dominatingL1 regularization. On a more detailed level, four cases can be identified dependent on twodimensions: signal-from-noise separability SNS and imbalance. The exact signal-from-noiseseparability of a data set cannot be determined, so a proxy SNS is used in the form of themaximum AUC reached by the classification techniques analyzed here, analogous to Perlichet al. (2003). Two cases are distinguished: SNS ≤ 83% refers to lower signal separabilitywhile SNS > 83% refers to higher signal separability, which is essentially the same splitreached by Perlich et al. (2003). The second dimension denotes the imbalance of the targetvariable. A high imbalance is recorded if less than 5% of the labels are positive, otherwiseimbalance is considered low.

Along these two dimensions, four cases can now be discussed. The first case is charac-terized by low imbalance and high separability. The most obvious illustration can be seenin URL and LibimSeTi. An instance sample of less than 1% is sufficient to reach an AUCnot significantly different from the final performance. The techniques learn fast and theresult is a concave-down learning curve. MV-NB is the only technique which requires con-siderably more instances to learn from this data as demonstrated by the later occurrence ofthis shape. As the signal-from-noise separability decreases, but stays above 83%, the curvesstay concave down. The second case is illustrated by MovieLens_scifi, MovieLens_thrillerand MovieLens_western. Here, the separability is still high but the imbalance also is high.A concave-up learning curve can be observed, while the performance differences betweenGroup 1 and Group 2 become smaller. Thirdly, when the separability is low and imbalanceis high, a concave-up curve is observed and the Group 1 techniques demonstrate more ro-bustness towards that imbalance (Fraud, Car and Banking). Theoretically indeed, SVMsare not able to generalize well with high imbalances as a separator is learned which is biasedtowards the minority class (Liu et al. 2007; Wallace et al. 2011; Vanhoeyveld and Martens2017). Lastly, when both separability and imbalance are low (TaFeng, BookCrossing andYahooMovies), again a concave-up/linear curve demonstrates a slow start-up in learning.Also, the performance of Group 1 and Group 2 algorithms is comparable.

For all learning curves, the findings are the following. Group 1 techniques (or a subset ofthem) show the best learning ability when faced with little data. With increasing sample size,their learning capacity cannot grasp the increasing complexity, leading to higher bias (Handand Yu 2001; Brain and Webb 2002). Moreover, the LR-SGD curves often are similar toone another.

In summary, in all cases except the first (low imbalance, high separability), the learningcurves show concave-up/linear behavior, which implies that for these behavioral data setsmore training instances keep on increasing the classification performance. Although thereis both an inherent limit on predictive performance and the fact that general trends aredistinguished first (Junqué de Fortuny et al., 2013), the nuances learned in the end, however,still contribute. This is in contrast to learning curves for traditional, non-behavioral data(such as Shavlik et al. 1991; Perlich et al. 2003; Martens et al. 2016) which mostly haveconcave-down shapes. In that case generally, the benefit of adding training set samplesleads to diminishing returns in AUC (Provost and Kolluri, 1999). The commonality ofconcave-up learning curves should lead practitioners to exercise caution when performingpilot studies on smaller data samples, as the observed performance may well not representwhat is possible with large data sets.


Linking this to Contribution III of this work, it is apparent from these learning curvesthat overall adding more training instances leads to a better performing model. This rein-forces what has been found in Junqué de Fortuny et al. (2013): more data indeed leads tobetter predictions.

4.2.2 Feature Dimension Learning Curves with Random Feature Selection

Since we have previously shown that behavioral data sets contain many relevant features,these learning curves should confirm these findings. Thus, including features at randomshould result in increasing predictive performance.

Analogous to the results of the instance learning curves, two groups can generally beidentified: Group 1 consisting of MV-NB, MN-NB, LR-SGD-L1 and LR-SGD-L2 and Group2 consisting of LA-SVM-L2, LS-SVM-L2, LS-SVM-L1, LR-BGD-L1 and LR-BGD-L2. It canbe seen that all algorithms start roughly equally when faced with few features and overallno classification technique is significantly better at handling fewer features.

When the data has low separability and no extreme imbalance (YahooMovies, TaFengand BookCrossing), the learning curves are concave up and no dominance of either groups canbe distinguished. As imbalance increases for low SNS datasets (Banking, Car and Fraud),Group 1 dominates Group 2 (analogous to the instance dimension). The latter also holdsfor highly separable data sets (MovieLens_scifi,MovieLens_thriller and Acard_Wezenberg),although the high SNS reshapes the end of the learning curves towards being concave down.In the case of a high signal from noise separability with no extreme imbalance on largedatasets (Flickr and KDDa), the techniques learn slowly resulting in concave up curves. Incontrast, for smaller data sets (MovieLens1m, KDD2015 and MovieLens_horror) learninggoes faster, and the resulting curves are linear or concave down.

A comparison of the learning curves in the instance and feature dimensions leads to thefinding that performance convergence is more sensitive to the features than to the instances:the feature learning curves overall demonstrate concave up behavior. This is strongly con-firmed in the high SNS data sets URL and LibimSeTi : the performance converges faster inthe instance dimension than in the feature dimension. Following the findings in the previ-ous section, the Group 2 algorithms and the SGD variants each demonstrate similar beha-vior. Regarding individual classifiers’ robustness, it is hinted at by the learning curves forMovieLens (comedy), MovieLens1m (age) and LibimSeTi that MV-NB learns more quicklyin the feature than in the instance dimension: the bias component of its error is larger in thelatter due to smaller sampling size combined with having many features (Friedman, 1997).

The foremost conclusion consists of the fact that adding more features leads to higherpredictive performance (Contribution III). Moreover, due to the shapes being concave up(and some linear), it is clear that many features provide significant, independent predictiveevidence.

4.2.3 Feature Dimension Learning Curves with Intelligent Feature Selection

In analysing learning curves arising from intelligent feature selection and comparing themwith those from random selection, we should be able to demonstrate that generally bothcurves are similar. This will strengthen the assumption that fine-grained behavioral featureshave low redundancy and that many, if not all, are informative.


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1350

55

60

65

70

75

80

85

90

Sample size

AU

CMovieLens (sci−fi, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1350

55

60

65

70

75

80

85

90

Sample size

AU

C

MovieLens (sci−fi, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^21 2^2250

55

60

65

70

75

80

85

Sample size

AU

C

KDDa

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^2050

52

54

56

58

60

62

64

66

68

70

Sample sizeA

UC

Banking

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^1945

50

55

60

65

70

75

80

Sample size

AU

C

Fraud

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^21 2^22 2^2350

55

60

65

70

75

80

85

90

Sample size

AU

C

Flickr

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^2150

55

60

65

70

75

80

85

90

95

100

Sample size

AU

C

URL

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^21 2^2250

55

60

65

70

75

80

Sample size

AU

C

Car

2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C

Internet Advertisement

PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2

Figure 6: Learning curves in the instance dimension.


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

Sample size

AU

CA−Card (Wezenberg, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

Sample size

AU

C

A−Card (Wezenberg, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^21 2^22 2^23 2^2450

55

60

65

70

75

80

85

Sample size

AU

C

KDDa

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^2150

52

54

56

58

60

62

64

66

68

Sample sizeA

UC

Banking

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

Sample size

AU

C

Fraud

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^1850

55

60

65

70

75

80

Sample size

AU

C

Flickr

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^2150

55

60

65

70

75

80

85

90

95

100

Sample size

AU

C

URL

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^2150

55

60

65

70

75

Sample size

AU

C

Car

2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2

Figure 7: Learning curves in the feature dimension (random feature selection).


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

90

95

Sample size

AU

CA−Card (cashout, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

90

95

Sample size

AU

C

A−Card (cashout, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^21 2^22 2^23 2^2450

55

60

65

70

75

80

85

Sample size

AU

C

KDDa

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^2150

52

54

56

58

60

62

64

66

68

Sample sizeA

UC

Banking

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1655

60

65

70

75

80

Sample size

AU

C

Fraud

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^1850

55

60

65

70

75

80

Sample size

AU

C

Flickr

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^2165

70

75

80

85

90

95

100

Sample size

AU

C

URL

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^2150

55

60

65

70

75

Sample size

AU

C

Car

2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2

Figure 8: Learning curves in the feature dimension (intelligent feature selection).


Unsurprisingly, it can be observed that the starting point with intelligent feature se-lection is higher in comparison to the starting point when adding random features. Forthe large behavioral data sets (very fine-grained with more than 1 million features such asCar, KDDa and Banking), adding more features, the curves exhibit a similar concave-upshape as when no feature selection is used. This prompts us to conclude that for these verylarge behavioral data sets, the features show low redundancy and very many are essential inpredicting the target variable. For smaller behavioral data sets (such as Flickr, BookCross-ing, TaFeng and Ecommerce), the curves change from concave up to linear. Hence, thereis discriminative informative value present in the features, although each still contributesto better predictions. These findings are also in line with results from text classification,which is also characterised by many relevant features (Joachims, 1998). For the other datasets, the curves demonstrate concave-down learning behavior. Adding the most informativefeatures first leads to a significant performance increase. The remaining less-discriminativefeatures result in diminished increases and in some cases even decrease the AUC. This is thecase for the linear support vector machines in a high-imbalanced setting (Acard_Permeke,Acard_Wezenberg) for which these techniques are highly sensitive (Forman, 2003).

Referring back to Contribution III, using more, although less informative features stillleads to a higher predictive value, especially in the case of very fine-grained features. Thisimplies that care should be taken with preprocessing techniques such as feature selection inthe context of very fine-grained behavioral data.

5 Conclusion

The academic literature regarding big behavioral data provides substantial evidence of itspredictive power in a wide variety of fields. However, not all state-of-the-art classificationtechniques are suitable for the high-dimensional and sparse characteristics of these datasets. Through a systematic comparative benchmarking study, this paper investigated theperformance of these classifiers on large, sparse behavioral data.

The first contribution consists of finding a well-performing method both in terms of AUCand computational complexity. The results, however, indicate that an AUC-time trade-offis inherent to the problem as the Pareto front clearly illustrates: given more time, one canchoose to achieve higher classification performance. In terms of AUC, logistic regression withL2 regularization leads to significantly better results. Unfortunately, it attains this result at ahigh computational cost. Relating these results to the used techniques in academic literature(Table 1), linear support vector machines are most frequently used, while our results findthat logistic regression would perform better. The propensity in literature towards the useof L2 regularization is supported by our findings. This suggests that each behavioral featurecaptures a different fine-grained aspect of an instances’ behavior. This low redundancy isalso demonstrated by the learning curves in the feature dimension when adding featuresdependent on their information value. In terms of computational complexity, PSN and MN-NB stand out with their significantly low run time. MN-NB is commonly used (Table 1)due to its successful application in text analysis. Despite its speed, however, its underlyingassumptions do not lead to high-quality predictions in this behavioral setting. PSN appearsto have a much better AUC-time trade-off.

On a more fine-grained level, a tree classifier and a logistic regression are learned on the

5 Conclusion 28

results to explore the competence domain of the best-performing classifiers. These meta-analyses are to be interpreted with caution due to the restricted sample size. It is found thatas the number of instances in an imbalanced sample increases, MV-NB performs best. If thesample is heavily unbalanced for small data sets, PSN becomes the method of choice. Asconfirmed in the learning curves, the Group 1 techniques (MV-NB, MN-NB, PSN, LR-SGD-L1, LR-SGD-L2) indeed perform better in a highly imbalanced setting. In low-imbalanceddata sets, the Group 2 techniques (LA-SVM-L2, LS-SVM-L1, LS-SVM-L2, LR-BGD-L1 andLR-BGD-L2) have higher AUC.

The second contribution is to determine the predictive power that lies in the behavioralmodeling scheme (binary or numeric). The discriminative Group 2 techniques perform betterwhen the data merely models presence/absence of features in contrast to data enriched withbehavioral strength. Thus, the mere presence of behavior apparently highlights that part ofan instances’ behavior sufficiently. This obviously leads to decreased data collection effort.

By systematically comparing these classification techniques in a benchmarking study, wehave formally investigated what is correctly or incorrectly presumed by previous behavioraldata analysis studies. The conclusions can point researchers and practitioners towards aunifying direction for both future behavioral data research and future technique optimizationresearch.

Limitations related to this analysis originate from limited public availability of behavioraldata sets. This results in a small sample for significance testing. However, we worked tomake it as broad a sample as possible. One avenue of future research therefore consists ofupdating this benchmark with even more behavioral data sets as these become available.Especially for numeric data sets, this could lead to stronger conclusions. Moreover, soundmeta-analyses could then strengthen the relations between data set characteristics and choiceof classifier. A second possibility for further research constitutes a focus towards scaling upthe well-performing L2-regularized techniques in terms of computational complexity on verysparse data (Dalessandro et al., 2014), most importantly for LR-BGD-L2. It would also beinteresting to explore whether fast, heuristic predictions by e.g. a PSN technique could beused to speed up the training phase of more complex, slower classifiers such as LR-BGD-L2or RBF-SVM in order to combine the best of both worlds (Dalessandro et al. 2014; Junqué deFortuny et al. 2015).

With respect to the third contribution, a learning curve analysis is performed whichshows that better performance continues to be observed when more data (both in instanceand feature dimension) is used. In contrast to non-behavioral instance learning curves, thecurves are generally linear/concave-up. This implies that performance still increases whenadding more training data, even to very large data set sizes, which is only marginally the casefor traditional non-behavioral features. Very fine-grained, large data sets which demonstratevery low redundancy in the features show no dependence on the informative value of featuresas demonstrated by the concave-up curves when adding informative features in a descendingfashion. For smaller, less fine-grained data sets, a higher redundancy between the featuresis present with a higher sensitivity to more behavioral features. Hence, it is very valuableto collect as much data as possible in this behavioral setting both regarding more instancesas well as regarding more modular behavioral aspects.

Moreover, this shows that traditional learning curve analysis might be misleading. Forexample, Provost and Fawcett (2013) suggest that investing in more training data probably

5 Conclusion 29

is not worthwhile as learning curves show that generalization performance has leveled off.This advice was indeed supported by traditional learning curve analyses, where one seldomwitnesses learning curves that look poor for significant stretches and then suddenly turnsteeply up. However, here we see this pattern repeatedly and thus researchers and practi-tioners should be given different advice for data such as these, as incorrect conclusions willbe drawn if they do not sample enough. For future research, defining and measuring thebehavioral characteristics of data sets resulting in a framework also could prove helpful indetermining causes for different generalization patterns of classification techniques.

As a final conclusion, it is apparent that the predictive analysis of big behavioral datasignificantly differs from the analysis of traditional (even big) data. The results of this studyshould be taken into account in the general predictive analysis of this kind of data.

A Classification Techniques

A classification technique takes a data set X along with values Y for the target variablefor each of the instances xi in X and attempts to learn a function h(x) = y as an approx-imation of the true value y. The classifier builds a predictive model based on a trainingset (Xtrain, Ytrain). The trained model is then used to predict y values of new, unseen datapoints belonging to a test set.

A.1 Naive Bayes

The naive Bayes classifier is a generative classifier using Bayes’ rule to build a predictivemodel

p(y|x) =p(y)p(x|y)

p(x).

Since the denominator is not dependent on the class variable y, it is not taken into account.Then, making use of the naive assumption that features are mutually conditionally inde-pendent, the above equation can be rewritten as follows and forms the probability modelused by the naive Bayes classifier

p(y|x) ∝ p(y)

m∏j=1

p(xj |y). (1)

In order to determine p(x|y), an underlying event model is assumed for the generation ofthe features. Considering the binomial and multinomial character of the distributions ofthe behavioral features, the multivariate and the multinomial event model are consideredsuitable.

A multinomial event model has proven successful in text classification, an area alsocharacterized by high dimensionality (McCallum and Nigam, 1998), and this model definesthe conditional probability as

p(x|y) =

(m∑j=1

xj)!

m∏j=1

xj !

m∏j=1

p(xj |y)xj .

A Classification Techniques 30

A multinomial distribution implies that the features result from independent draws from thecollection of all features. It does not take into account absent features, which is computa-tionally beneficial in a sparse context. The training time complexity of its implementationconsists of calculating a vector of feature weights for each class and results in O(m) time.

The multivariate event model defines the conditional probability as

p(x|y) =m∏j=1

p(xj |y)xj (1− p(xj |y))(1−xj).

Theoretically, this event model excellently lends itself to binary data: a feature is presentwith probability p(xj |y) and absent with probability 1−p(xj |y). However, since the absenceof features is explicitly modeled, its implementation is not naturally tailored to sparse data.Therefore, an efficient sparse implementation presented by Junqué de Fortuny et al. (2013)is used. This implementation takes advantage of the assumption that the features are binaryand transforms Equation 1 into

log p(y|x) ∝ log p(y) +∑j|xj=1

log p(xj = 1|y) +∑j|xj=0

log(p(y)− p(xj = 1|y)).

This transformation results in a O(m ·n) time complexity in contrast to O(m ·n) with mthe number of active elements.

A.2 Logistic Regression

In logistic regression, the target function, h(x) = (wTx), is transformed with the use of thelogistic function with w a vector of weights corresponding to the dimensions of X. Thistransformation models a probabilistic estimate as to whether a test instance belongs to thepositive class. The logistic regression model is thus defined as

p(y|x) =1

1 + e−ywT x.

When training the logistic regression model, the function

minwR+ C

n∑i=1

log(1 + e−yiwT xi), (2)

is optimized, where R is the regularization term to prevent overfitting. With L1 regu-larization, the value of R equals ||w||1, with L2 regularization, R is 1

2 ||w||22. The former

regularization parameter zeroes out low-valued coefficients which results in natural featureselection (Ng, 2004). The latter, in contrast, favors very small, non-zero weight values. Thisregularization is controlled by a parameter C which models a trade-off between the complex-ity of the model (first term) and minimization of the training error (second term). Extremelyminimizing the training error might result in a complex model with lower generalizabilitywhich the regularization parameter C attempts to correct.

In the search for an optimal w, Equation 2 can be solved with Newton’s methods batchgradient descent (LR-BGD variants) or with stochastic gradient descent (LR-SGD vari-ants) (Bottou, 2010). The Liblinear package implements logistic regression with a trusted


region Newton method (Fan et al. 2008; Lin et al. 2008). Iteratively, a subset of the regionof the objective function is approximated and subsequently expanded or shrinked dependingon the quality of the approximation. This is done in O(m · c) time where c is the number ofiterations needed until convergence. Stochastic gradient descent is scalable towards largerdimensions since it approximates the true gradient of w by calculating the gradient over onerandom training instance. This approximation is seen as a proxy for the real gradient andis used in subsequent steps of the algorithm. While the execution time decreases, clearly,convergence towards an optimum value will be slower. Vowpal Wabbit, a widely usedanalysis tool for big data, solves the LR-SGD variants with stochastic gradient descent inO(n) time (Langford et al., 2007).

A.3 Support Vector Machine

The support vector machine (SVM) (Cortes and Vapnik, 1995) is a discriminative binaryclassifier that is very suitable for high-dimensional data. An SVM finds a hyperplane thatmaximally separates the closest points of each of two classes, called support vectors (SV).In maximally separating the SVs, the SVM aims for high generalizability and low variance.If the data is separable, the hard margin SVM seeks a hyperplane of the form

wTx + b = 0,

with w the weight vector normal to the hyperplane and b a bias. New test points areclassified on one of the sides of this hyperplane, i.e.{

wTxi + b ≥ +1 if yi = +1,

wTxi + b ≤ −1 if yi = −1.

When faced with non-linearly separable data, a non-linear function θ(x), called a kernel, isused to project the data points to a high-dimensional feature space where the points arelinearly separable.

In general, the goal of the support vector machine is to solve the objective function

minwb, ξ R+ C

n∑i=1

ξi,

s.t. yi(wT θ(xi + b)) ≥ 1− ξi,

ξi ≥ 0,

with C once again a trade-off parameter between complexity (first term) and error rate(second term) and ξi (i = 1, ..., n) slack variables representing the loss function. In words,the goal is to minimize the training error (second term), while allowing for misclassifications(first term), regulated by the trade-off parameter C. Three parameters are to be defined inthe above equation, i.e. the regularization parameter R, the loss function ξi and the kernelfunction θ.

The first parameter can be defined following L1 or L2 regularization. In the first case,R = ||w||1 and in the second case, R is equal to 1

2 ||w||22. As mentioned before, L1 regular-

ization results in sparse outputs. For the loss function ξ, the L1-norm and the L2-norm are


considered here. Selecting the L1-norm as loss function ξ, the sum of the absolute differencesis minimized: ξi = max (0, 1− yiwTxi). When using the L2-norm as loss function ξ, thesquare of the errors is minimized and can be defined as follows: ξi = max (0, 1− yiwTxi)

2.Since the second loss function attempts to minimize the squared errors, it is more sensitiveto outliers. Regarding the kernel, two options are explored: a linear kernel and an RBF ker-nel. A linear SVM uses a linear function as the kernel θ(x). It is often stated in literaturethat with high-dimensional data a projection to a higher-dimensional feature space to finda hyperplane will come at too high a computational cost and will not improve classificationperformance (Hsu et al. 2003; Yu et al. 2010). RBF-SVM, on the other hand, uses a non-linear kernel and is capable of capturing complex interactions in the data (Chang and Lin,2011). An RBF-SVM operates with a Gaussian kernel and takes the form

K(xi,x′j) = e−γ||xi−x′

j ||2 ,

with xi and x′j two samples of which the Gaussian kernel determines the similarity in thenew high-dimensional space guided by parameter γ. The parameter γ controls the standarddeviation of the Gaussian at each point: the higher a value for γ, the lower the influence ofthe SVs which decreases bias, but increases variance. The Liblinear package (Fan et al.,2008) is used for the implementations of the different variants of linear SVM (i.e. LS-SVM-L2, LS-SVM-L1, LA-SVM-L2). It uses a coordinate gradient descent method solving theoptimization problem in O(n) time. For the RBF-SVM, the Libsvm package (Chang andLin, 2011) is used which leads to a training time complexity of O(n2)-O(n3).

A.4 Relational Classification with Pseudo Social Networks

In this approach, the data is transformed to a similarity network (pseudo social network,PSN) between the instances (Stankova et al. 2014; Martens et al. 2016). The network isdenoted ‘pseudo’ as no true social network is implied: two instances are connected if theyare similar regarding behaviors they have engaged in. Based on this similarity, predictionsare made using traditional relational classifiers.

Concretely, first, weights are calculated with a top-node function for each feature basedon its degree (Stankova et al., 2014). We employ the tangens hyperbolicum which definesthe weight sm for a feature m as

sm = tanh(1

dm),

with dm the degree of node m such that features with a low degree receive a higher weight.Then, the pseudo social network is built by connecting instances, weighing their edges

based on their shared features. The feature weights are aggregated into edge weights wijbetween nodes i and j through an instance node function. The sum of shared nodes functionsimply sums the feature weights sm of the shared features of instances i and j as

wij =∑

m∈N(i)∩N(j)

sm,

with N(i) the features demonstrated by instance i.Now, relational classifiers are used. These classifiers infer unknown labels through net-

work structure and labels of connected nodes.We use the weighted-vote relational neighbour


classifier (Macskassy and Provost, 2007), which labels a node through a weighted probabilityestimation using the known labels of connected nodes. Formally, the classifier calculates

P (li = c|N(i)) =1

Z

∑j∈N(i)

wijP (lj = c|N(j)), (3)

with li the label of node i, N(i) the instance nodes connected to node i and Z the numberof connected nodes.

In Stankova et al. (2014), a highly-scalable version of the combination of the sum ofshared nodes instance node function with the weighted-vote relational classifier is deducedresulting in a fast linear model over the feature nodes, referred to as SW-transformation.This fast, scalable variant with O(m) runtime complexity lends itself excellently in thecontext of sparse, high-dimensional data and translates Equation 3 into

P (li = c|N(i)) =1

Z

∑m|xim 6=0

nsm × sm,

where nsm = |xjm = 1 and yj = 1| and sm is the weight of top node m and N(i) theinstance nodes connected to node i. For a full account of this method, we refer to Stankovaet al. (2014).

B Learning Curves

B Learning Curves 34

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^1255

60

65

70

75

80

85

90

95

Sample size

AU

C

MovieLens1m (age, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^1245

50

55

60

65

70

75

80

85

90

Sample size

AU

C

MovieLens1m (age, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^1250

55

60

65

70

75

80

85

Sample size

AU

C

MovieLens1m (gender, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^1240

45

50

55

60

65

70

75

80

85

Sample size

AU

C

MovieLens1m (gender, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^1250

55

60

65

Sample size

AU

C

YahooMovies (age, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^1250

55

60

65

Sample size

AU

C

YahooMovies (age, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^1250

55

60

65

70

75

80

Sample size

AU

C

YahooMovies (gender, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^1245

50

55

60

65

70

75

80

85

Sample size

AU

C

YahooMovies (gender, numeric)

2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2

Figure 9: Learning curves in the instances dimension.


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

90

95

100

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1350

55

60

65

70

75

80

Sample size

AU

C

Ecommerce

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^1450

55

60

65

70

75

Sample size

AU

C

TaFeng

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1640

45

50

55

60

65

70

75

80

85

Sample size

AU

C

KDD2015 (binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

90

95

100

Sample size

AU

C

LibimSeTi (binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

90

95

100

Sample size

AU

C

LibimSeTi (numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^1750

51

52

53

54

55

56

57

Sample size

AU

C

BookCrossing (binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^1750

51

52

53

54

55

56

57

Sample size

AU

C

BookCrossing (numeric)

2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2



2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1350

55

60

65

70

75

80

85

90

Sample size

AU

C

MovieLens (action, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1350

55

60

65

70

75

80

85

90

Sample size

AU

C

MovieLens (action, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1350

55

60

65

70

75

80

85

90

Sample size

AU

C

MovieLens (comedy, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1350

55

60

65

70

75

80

85

90

Sample size

AU

C

MovieLens (comedy, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1350

55

60

65

70

75

80

Sample size

AU

C

MovieLens (crime, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1350

55

60

65

70

75

Sample size

AU

C

MovieLens (crime, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1350

55

60

65

70

75

80

85

90

95

Sample size

AU

C

MovieLens (horror, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1350

55

60

65

70

75

80

85

90

95

Sample size

AU

C

MovieLens (horror, numeric)

2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2



2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1345

50

55

60

65

70

75

80

85

Sample size

AU

C

MovieLens (thriller, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1350

55

60

65

70

75

80

85

Sample size

AU

C

MovieLens (thriller, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1345

50

55

60

65

70

75

80

85

90

95

Sample size

AU

C

MovieLens (western, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^1345

50

55

60

65

70

75

80

85

90

95

Sample size

AU

C

MovieLens (western, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^1750

55

60

65

70

75

80

85

90

95

Sample size

AU

C

A−Card (cashout, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^1750

55

60

65

70

75

80

85

90

95

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^1750

55

60

65

70

75

80

85

Sample size

AU

C

A−Card (Permeke, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^1750

55

60

65

70

75

80

85

Sample size

AU

C

A−Card (Permeke, numeric)

2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2



2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

90

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

90

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

Sample size

AU

C

YahooMovies (gender, numeric)

2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2

Figure 13: Learning curves in the features dimension (random feature selection).


2^4 2^5 2^6 2^7 2^8 2^9 2^1050

55

60

65

70

75

80

85

90

95

100

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^1450

55

60

65

70

75

80

Sample size

AU

C

Ecommerce

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^1450

55

60

65

70

75

Sample size

AU

C

TaFeng

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^1250

55

60

65

70

75

80

85

Sample size

AU

C

KDD2015 (binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^1750

55

60

65

70

75

80

85

90

95

100

Sample size

AU

C

LibimSeTi (binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^1750

55

60

65

70

75

80

85

90

95

100

Sample size

AU

C

LibimSeTi (numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^1850

51

52

53

54

55

56

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^1850

51

52

53

54

55

56

57

Sample size

AU

C


2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2



2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

90

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

90

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

90

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

Sample size

AU

C

MovieLens (crime, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

Sample size

AU

C

MovieLens (crime, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

90

95

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

90

95

Sample size

AU

C


2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2



2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1645

50

55

60

65

70

75

80

85

Sample size

AU

C

MovieLens (thriller, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

Sample size

AU

C

MovieLens (thriller, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

90

Sample size

AU

C

MovieLens (sci−fi, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

90

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

90

95

Sample size

AU

C

A−Card (cashout, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

90

95

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

Sample size

AU

C


2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2



2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1155

60

65

70

75

80

85

90

95

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1155

60

65

70

75

80

85

90

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1166

68

70

72

74

76

78

80

82

84

86

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1155

60

65

70

75

80

85

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

52

54

56

58

60

62

64

66

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1654

56

58

60

62

64

66

68

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1666

68

70

72

74

76

78

80

82

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1662

64

66

68

70

72

74

76

78

80

82

Sample size

AU

C

YahooMovies (gender)

2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2

Figure 17: Learning curves in the features dimension (intelligent feature selection).


2^4 2^5 2^6 2^7 2^8 2^9 2^1060

65

70

75

80

85

90

95

100

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^1450

55

60

65

70

75

80

Sample size

AU

C

Ecommerce

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^1454

56

58

60

62

64

66

68

70

72

Sample size

AU

C

TaFeng

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^1250

55

60

65

70

75

80

85

Sample size

AU

C

KDD2015

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^1760

65

70

75

80

85

90

95

100

Sample size

AU

C

LibimSeTi (binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^1760

65

70

75

80

85

90

95

100

Sample size

AU

C

LibimSeTi (numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^1850

51

52

53

54

55

56

57

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^1850

51

52

53

54

55

56

57

58

Sample size

AU

C


2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2



2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1660

65

70

75

80

85

90

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1655

60

65

70

75

80

85

90

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1665

70

75

80

85

90

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1660

65

70

75

80

85

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1655

60

65

70

75

80

85

90

Sample size

AU

C

MovieLens (documentary, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1655

60

65

70

75

80

85

90

Sample size

AU

C

MovieLens (documentary, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1655

60

65

70

75

80

85

90

95

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1655

60

65

70

75

80

85

90

95

Sample size

AU

C


2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2



2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1655

60

65

70

75

80

85

Sample size

AU

C

MovieLens (drama, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

Sample size

AU

C

MovieLens (drama, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1655

60

65

70

75

80

85

90

Sample size

AU

C

MovieLens (sci−fi, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^1650

55

60

65

70

75

80

85

90

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

90

Sample size

AU

C

A−Card (Wezenberg, binary)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1150

55

60

65

70

75

80

85

Sample size

AU

C

A−Card (Wezenberg, numeric)

2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1155

60

65

70

75

80

85

Sample size

AU

C


2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^1162

64

66

68

70

72

74

76

78

80

82

Sample size

AU

C


2^40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size

AU

C


PSN

NB−MV

NB−MN

LS−SVM−L1

LA−SVM−L2

LS−SVM−L2

LR−BGD−L1

LR−BGD−L2

LR−SGD−L1

LR−SGD−L2



Bibliography

Myriam Abramson and David W. Aha. What’s in a URL? Genre classification from URLs.AAAI Workshop on Intelligent Techniques for Web Personalization and RecommenderSystems, 2012.

Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. A reliable effectiveterascale linear learning system. Journal of Machine Learning Research, 15:1111–1133,2014.

Abdullah Alrajeh, Akiko Takeda, and Mahesan Niranjan. Memory-efficient large-scale lin-ear support vector machine. In International Conference on Machine Vision (ICMV).International Society for Optics and Photonics, 2014.

Sushma Nagesh Bannur. Detecting Malicious Webpages using Content Based Classication.Master’s thesis, University of California, San Diego, 2011.

Léon Bottou. Large-scale machine learning with stochastic gradient descent. In InternationalConference on Computational Statistics (COMPSTAT), pages 177–186. Springer, 2010.

Danah Boyd and Kate Crawford. Critical questions for big data: provocations for a cultural,technological, and scholarly phenomenon. Information, Communication & Society, 15(5):662–679, 2012.

Damien Brain and Geoffrey I. Webb. The need for low bias algorithms in classificationlearning from large data sets. In Principles of Data Mining and Knowledge Discovery,pages 62–73. Springer, 2002.

Lukas Brozovsky and Vaclav Petricek. Recommender system for online dating service. Zn-alosti Conference, pages 1–12, 2007.

Meeyoung Cha, Alan Mislove, and Krishna P. Gummadi. A measurement-driven analysisof information propagation in the Flickr social network. In International Conference onWorld Wide Web (WWW), pages 721–730. ACM, 2009.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 2011.

Fu Chang, Chien-Yang Guo, Xiao-Rong Lin, and Chi-Jen Lu. Tree decomposition for large-scale SVM problems. Journal of Machine Learning Research, 11:2935–2972, 2010.

Feilong Chen, Supranamaya Ranjan, and Pang-Ning Tan. Detecting bots via incrementalLS-SVM learning with dynamic feature adaptation. In International Conference on Know-ledge Discovery and Data Mining (SIGKDD), pages 386–394. ACM, 2011.

Hsinchun Chen, Roger H.L. Chiang, and Veda C. Storey. Business intelligence and analytics:from big data to big impact. Management Information Systems Quarterly (MISQ), 36(4):1165–1188, 2012.

Ye Chen, Dmitry Pavlov, and John F. Canny. Large-scale behavioral targeting. In Interna-tional Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 209–218.ACM, 2009.

Bibliography 47

Jessica Clark and Foster Provost. Matrix-factorization-based dimensionality reduction in thepredictive modeling process: a design science perspective. Technical report, Departmentof Information, Operations, and Management Sciences, New York University, USA, 2016.

Ronan Collobert, Fabian Sinz, Jason Weston, and Léon Bottou. Large scale transductiveSVMs. Journal of Machine Learning Research, 7:1687–1712, 2006.

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.

Brian Dalessandro. Bring the noise: embracing randomness is the key to scaling up machinelearning algorithms. Big Data, 1(2):110–112, 2013.

Brian Dalessandro, Daizhuo Chen, Troy Raeder, Claudia Perlich, Melinda Han Williams,and Foster Provost. Scalable hands-free transfer learning for online advertising. In Interna-tional Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 1573–1582.ACM, 2014.

Koen W. De Bock and Dirk Van den Poel. Predicting website audience demographics for webadvertising targeting using multi-website clickstream data. Fundamenta Informaticae, 98(1):49–70, 2010.

Sofie De Cnudde and David Martens. Loyal to your city? A data mining analysis of a publicservice loyalty program. Decision Support Systems, 73:74–84, 2015.

Sofie De Cnudde, Julie Moeyersoms, Marija Stankova, Ellen Tobback, Vinayak Javaly, andDavid Martens. Who cares about your Facebook friends? Credit scoring for microfinance.Technical report, Department of Applied Economics, Antwerp University, Belgium, 2015.

Janez Demšar. Statistical comparisons of classifiers over multiple data sets. Journal ofMachine Learning Research, 7:1–30, 2006.

David L. Donoho. High-dimensional data analysis: the curses and blessings of dimensionality.AMS Conference on Math Challenges of the 21st Century, pages 1–32, 2000.

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIB-LINEAR: a library for large linear classification. Journal of Machine Learning Research,9:1871–1874, 2008.

Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874,2006.

Tom Fawcett and Foster Provost. Adaptive fraud detection. Data Mining and KnowledgeDiscovery, 1(3):291–316, 1997.

Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do we needhundreds of classifiers to solve real world classification problems? Journal of MachineLearning Research, 15:3133–3181, 2014.

George Forman. An extensive empirical study of feature selection metrics for text classific-ation. Journal of Machine Learning Research, 3:1289–1305, 2003.

Jerome H. Friedman. On bias, variance, 0/1-loss, and the curse of dimensionality. DataMining and Knowledge Discovery, 1(1):55–77, 1997.

Bibliography 48

Sharad Goel, Jake M. Hofman, and Mehmet Irmak Sirer. Who does what on the web: alarge-scale study of browsing behavior. In International Conference on Web and SocialMedia (ICWSM). AAAI, 2012.

Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection.Journal of Machine Learning Research, 3:1157–1182, 2003.

David J. Hand and Keming Yu. Idiot’s Bayes—not so stupid after all? InternationalStatistical Review, 69(3):385–398, 2001.

Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to support vectorclassification. Technical report, National Taiwan University, Taipei, Taiwan, 2003.

Han-Shen Huang, Koung-Lung Lin, Jane Yung-jen Hsu, and Chun-Nan Hsu. Item-triggeredrecommendation for identifying potential customers of cold sellers in supermarkets. InBeyond Personalization Workshop on the Next Stage of Recommender Systems Research,pages 37–42, 2005.

Jin Huang, Jingjing Lu, and Charles X. Ling. Comparing naive Bayes, decision trees, andSVM with AUC and accuracy. In International Conference on Data Mining (ICDM),pages 553–556. IEEE, 2003.

Ronald L. Iman and James M. Davenport. Approximations of the critical region of theFriedman statistic. Communications in Statistics—Theory and Methods, 9(6):571–595,1980.

Thorsten Joachims. Text categorization with support vector machines: Learning with manyrelevant features. Springer, 1998.

Enric Junqué de Fortuny, David Martens, and Foster Provost. Predictive modeling with bigdata: is bigger really better? Big Data, 1(4):215–226, 2013.

Enric Junqué de Fortuny, Marija Stankova, Julie Moeyersoms, Bart Minnaert, Foster Prov-ost, and David Martens. Corporate residence fraud detection. In International Conferenceon Knowledge Discovery and Data Mining (SIGKDD), pages 1650–1659. ACM, 2014.

Enric Junqué de Fortuny, Theodoros Evgeniou, David Martens, and Foster Provost. Iter-atively refining SVMs using priors. In International Conference on Big Data (Big Data),pages 46–52. IEEE, 2015.

Hassan B. Kazemian and Shafi Ahmed. Comparisons of machine learning techniques fordetecting malicious webpages. Expert Systems with Applications, 42(3):1166–1177, 2015.

Ross D. King, Cao Feng, and Alistair Sutherland. Statlog: comparison of classificationalgorithms on large real-world problems. Applied Artificial Intelligence, 9(3):289–333,1995.

Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and modelselection. In International Joint Conference on Artificial Intelligence (IJCAI), volume 14,pages 1137–1145, 1995.

Michal Kosinski, David Stillwell, and Thore Graepel. Private traits and attributes arepredictable from digital records of human behavior. National Academy of Sciences, 110(15):5802–5805, 2013.

Bibliography 49

Abhimanu Kumar, Alex Beutel, Qirong Ho, and Eric P. Xing. Fugue: slow-worker-agnosticdistributed learning for big models on big data. In International Conference on ArtificialIntelligence and Statistics (AISTATS), pages 531–539, 2014.

Doug Laney. 3D data management: controlling data volume, velocity and variety. METAGroup Research Note, 6:70, 2001.

J Langford, Lihong Li, and A Strehl. Vowpal Wabbit online learning project. Technicalreport, http://hunch.net, 2007.

Pat Langley, Wayne Iba, and Kevin Thompson. An analysis of Bayesian classifiers. InNational Conference on Artificial Intelligence, volume 90, pages 223–228. AAAI, 1992.

Jong-Ryul Lee and Chin-Wan Chung. A new correlation-based information diffusion pre-diction. In International Conference on World Wide Web Companion, pages 793–798.International World Wide Web Conferences Steering Committee, 2014.

Kai Li and Timon C. Du. Building a targeted mobile advertising system for location-basedservices. Decision Support Systems, 54(1):1–8, 2012.

Xiang Li, Huaimin Wang, Bin Gu, and Charles X. Ling. Data sparseness in linear SVM. InInternational Conference on Artificial Intelligence, pages 3628–3634. AAAI, 2015.

Moshe Lichman. UCI Machine Learning Repository, 2013. URL http://archive.ics.uci.edu/ml.

Tjen-Sien Lim, Wei-Yin Loh, and Yu-Shan Shih. A comparison of prediction accuracy, com-plexity, and training time of thirty-three old and new classification algorithms. MachineLearning, 40(3):203–228, 2000.

Chih-Jen Lin, Ruby C. Weng, and Sathiya Keerthi. Trust region Newton method for logisticregression. Journal of Machine Learning Research, 9:627–650, 2008.

Alexander Liu, Joydeep Ghosh, and Cheryl Martin. Generative oversampling for miningimbalanced datasets. In International Conference on Data Mining (ICDM), pages 66–72.IEEE, 2007.

Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. Personalized news recommendationbased on click behavior. In International Conference on Intelligent User Interfaces (IUI),pages 31–40. ACM, 2010.

Yandong Liu, Sandeep Pandey, Deepak Agarwal, and Vanja Josifovski. Finding the rightconsumer: optimizing for conversion in display advertising campaigns. In InternationalConference on Web Search and Data Mining (WSDM), pages 473–482. ACM, 2012.

Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. Beyond blacklists:learning to detect malicious web sites from suspicious URLs. In International Conferenceon Knowledge Discovery and Data Mining (SIGKDD), pages 1245–1254. ACM, 2009a.

Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. Identifying suspi-cious URLs: an application of large-scale online learning. In International Conference onMachine Learning (ICML), pages 681–688. ACM, 2009b.

Bibliography 50

http://hunch.net



Nuria Macia and Ester Bernadó-Mansilla. Towards UCI+: a mindful repository design.Information Sciences, 261(1):237–262, 2014.

Núria Macià, Ester Bernadó-Mansilla, Albert Orriols-Puig, and Tin Kam Ho. Learnerexcellence biased by data set selection: A case for data characterisation and artificial datasets. Pattern Recognition, 46(3):1054–1066, 2013.

Sofus A. Macskassy and Foster Provost. Classification in networked data: a toolkit and aunivariate case study. Journal of Machine Learning Research, 8:935–983, 2007.

David Martens, Foster Provost, Jessica Clark, and Enric Junqué de Fortuny. Mining massivefine-grained behavior data to improve predictive analytics. MIS quarterly, 40(4):869–888,2016.

Andrew McCallum and Kamal Nigam. A comparison of event models for naive Bayes textclassification. InWorkshop on Learning for Text Categorization, pages 41–48. AAAI, 1998.

David Meyer, Friedrich Leisch, and Kurt Hornik. Benchmarking support vector machines.Technical report, Adaptive Information Systems and Modelling in Economics and Manage-ment Science, WU Vienna University of Economics and Business Administration, Austria,2002.

Donald Michie, David J. Spiegelhalter, and Charles C. Taylor. Machine Learning, Neuraland Statistical Classification. Overseas Press, 2009.

Andrew Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. InInternational Conference on Machine Learning (ICML). ACM, 2004.

Feiping Nie, Yizhen Huang, Xiaoqian Wang, and Heng Huang. New primal SVM solverwith linear computational cost for big data classifications. In International Conference onMachine Learning (ICML). ACM, 2014.

Sandeep Pandey, Mohamed Aly, Abraham Bagherjeiran, Andrew Hatch, Peter Ciccolo, Ad-wait Ratnaparkhi, and Martin Zinkevich. Learning to target: what works for behavi-oral targeting. In International Conference on Information and Knowledge Management(CIKM), pages 1805–1814. ACM, 2011.

Claudia Perlich, Foster Provost, and Jeffrey S. Simonoff. Tree induction vs. logistic re-gression: a learning-curve analysis. Journal of Machine Learning Research, 4:211–255,2003.

Claudia Perlich, Brian Dalessandro, Troy Raeder, Ori Stitelman, and Foster Provost. Ma-chine learning for targeted display advertising: transfer learning in action. Machine Learn-ing, 95(1):103–127, 2014.

Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know aboutData Mining and Data-analytic Thinking. O’Reilly Media, Inc., 2013.

Foster Provost and Venkateswarlu Kolluri. A survey of methods for scaling up inductivealgorithms. Data Mining and Knowledge Discovery, 3(2):131–169, 1999.

Foster Provost, Tom Fawcett, and Ron Kohavi. The case against accuracy estimationfor comparing induction algorithms. In International Conference on Machine Learning(ICML), pages 445–453. ACM, 1998.

Bibliography 51

Liva Ralaivola and Florence d’Alché Buc. Incremental support vector machine learning:a local approach. In International Conference on Artificial Neural Networks (ICANN),pages 322–330. Springer, 2001.

Rajen Dinesh Shah. Topics in High-Dimensional and Large-Scale Data Analysis. PhD thesis,University of Cambridge, United Kingdom, 2014.

Jude W. Shavlik, Raymond J. Mooney, and Geoffrey G. Towell. Symbolic and neural learningalgorithms: an experimental comparison. Machine Learning, 6(2):111–143, 1991.

Galit Shmueli. Analyzing behavioral big data: methodological, practical, ethical, and moralissues. In Stu Hunter Research Conference, 2016.

Marija Stankova, David Martens, and Foster Provost. Classification over bipartite graphsthrough projection. Technical report, Department of Applied Economics, Antwerp Uni-versity, Belgium, 2014.

Mingkui Tan, Ivor W. Tsang, and Li Wang. Towards ultrahigh dimensional feature selectionfor big data. Journal of Machine Learning Research, 15:1371–1429, 2014.

Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song. Design and evaluationof a real-time URL spam filtering service. In Symposium on Security and Privacy (SP),pages 447–462. IEEE, 2011.

Ivor W. Tsang, James T. Kwok, and Pak-Ming Cheung. Core vector machines: fast SVMtraining on very large data sets. Journal of Machine Learning Research, 6:363–392, 2005.

Jellis Vanhoeyveld and David Martens. Imbalanced classification in sparse and large be-haviour datasets. Technical report, Faculty of Applied Economics, Antwerp University,Belgium, 2017.

Wouter Verbeke, David Martens, and Bart Baesens. Social network analysis for customerchurn prediction. Applied Soft Computing, 14(3):431–446, 2014.

Byron C. Wallace, Kevin Small, Carla E. Brodley, and Thomas A. Trikalinos. Class imbal-ance, redux. In International Conference on Data Mining (ICDM), pages 754–763. IEEE,2011.

Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding. Data mining with big data.Transactions on Knowledge and Data Engineering, 26(1):97–107, 2014.

Qiang Yang and XindongWu. 10 challenging problems in data mining research. InternationalJournal of Information Technology & Decision Making, 5(4):597–604, 2006.

Hsiang-Fu Yu, Hung-Yi Lo, Hsun-Ping Hsieh, Jing-Kai Lou, Todd G. McKenzie, Jung-WeiChou, Po-Han Chung, Chia-Hua Ho, Chun-Fu Chang, Yin-Hsuan Wei, et al. Featureengineering and classifier ensemble for KDD Cup 2010. In International Conference onKnowledge Discovery and Data Mining KDD Cup 2010 Workshop (SIGKDD). ACM, 2010.

Ji Zhu, Saharon Rosset, Trevor Hastie, and Rob Tibshirani. 1-norm support vector machines.Advances in Neural Information Processing Systems (NIPS), 16(1):49–56, 2003.

Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. Improvingrecommendation lists through topic diversification. In International Conference on WorldWide Web (WWW), pages 22–32. ACM, 2005.

Bibliography 52

Date post:	06-Aug-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

A Benchmarking Study of Classification Techniques for ... FACULTY OF APPLIED ECONOMICS DEPARTMENT OF...

Documents