Benchmarking cross-project defect prediction approaches with … · Benchmarking cross-project...

Benchmarking cross-project defect prediction approaches withcosts metrics

Steffen HerboldUniversity of Goettingen, Insititute of Computer Science

Göttingen, [email protected]

ABSTRACTDefect prediction can be a powerful tool to guide the use of qualityassurance resources. In recent years, many researchers focused onthe problem of Cross-Project Defect Prediction (CPDP), i.e., thecreation of prediction models based on training data from otherprojects. However, only few of the published papers evaluate thecost efficiency of predictions, i.e., if they save costs if they are usedto guide quality assurance efforts. Within this paper, we providea benchmark of 26 CPDP approaches based on cost metrics. Ourbenchmark shows that trivially assuming everything as defectiveis on average better than CPDP under cost considerations. More-over, we show that our ranking of approaches using cost metricsis uncorrelated to a ranking based on metrics that do not directlyconsider costs. These findings show that we must put more effortinto evaluating the actual benefits of CPDP, as the current stateof the art of CPDP can actually be beaten by a trivial approach incost-oriented evaluations.

CCS CONCEPTS• Software and its engineering→ Software defect analysis; •General and reference → Experimentation;

KEYWORDSdefect prediction, cross-project, cost metrics

ACM Reference Format:Steffen Herbold. 2018. Benchmarking cross-project defect prediction ap-proaches with costs metrics. In Proceedings of , , , 12 pages.https://doi.org/

1 INTRODUCTIONSoftware defect prediction has been under investigation in ourcommunity for many years. The reason for this is the huge costsaving potential of accurately predicting in which parts of soft-ware defects are located. This information can be used to guidequality assurance efforts and ideally have fewer post release de-fects with less effort. Since one major problem of defect predictionapproaches is the availability of training data, researchers turnedtowards Cross-Project Defect Prediction (CPDP), i.e., the predic-tion of defects based on training data from other projects. A recentbenchmark by Herbold et al. [24] compared the performance of 24suggested approaches by researchers between 2008 and 2015. How-ever, the benchmark has one major limitation: it does not consider

, ,2018. ACM ISBN .https://doi.org/

the impact on costs directly. Instead, Herbold et al. used the ma-chine learning metrics AUC, F-measure, G-measure, andMCC. Froma cost perspective, these metrics only make sense, if all software en-tities that are predicted as defective get additional quality assuranceattention. Even then, the cost of missing defects would be assumedas equal to the cost of additional review effort. However, accordingto the literature, post-release defects cost 15-50 times more thanadditional quality assurance effort [36]. Rahman et al. [55] alreadyfound, that the results regarding CPDP may be very different, ifconsidered from a cost-oriented perspective. Whether this findingby Rahman et al. translates to the majority of the CPDP literatureis unclear, because a cost-sensitive evaluation is missing.

We close this gap with this paper. Our contribution is an adop-tion of the benchmark from Herbold et al. [24] for cost metrics todetermine a cost-sensitive ranking of CPDP approaches. We com-pare 26 CPDP approaches published between 2008 and 2016, as wellas three baselines. Our results show that a trivial baseline approachthat considers all code as defective is almost never significantlyoutperformed by CPDP. Only one CPDP proposed by Liu et al. [40]performs better for one out of three cost metrics, but even this isonly the case on one of the two data sets we use. For the other dataset, the performance of not statistically significantly better. For theother two cost metrics, no CPDP approach performs statisticallysignificantly better than the trivial baseline. Moreover, there is nocorrelation between ranking CPDP approaches based on cost met-rics, and the ranking produced by Herbold et al. [24]. Hence, costmetrics should always be considered in addition to other metrics,because otherwise we do not evaluate if proposed models to whatthey should: reduce costs.

The remainder of this paper is structured as follows. We dis-cuss related work in Section 2. Then, we introduce our benchmarkmethodology including the research questions, data used, perfor-mance metrics, and statistical evaluation in Section 3. Afterwards,we present our results in Section 4, followed the discussion in Sec-tion 5 and the threats to validity in Section 6. Finally, we concludethe paper in Section 7.

2 RELATEDWORKWe split our discussion of the related work into two parts. First, wediscuss the related work on defect prediction benchmarks. Second,we discuss the related work on CPDP.

2.1 Defect prediction benchmarksOur benchmark on cost aspects is influenced by four defect pre-diction benchmarks from the literature. Lessmann et al. [39] andGhotra et al. [18] evaluated the impact of different classifiers onWithin-Project Defect Prediction (WPDP). D’Ambros et al. [14]

arX

iv:1

801.

0410

7v1

[cs

.SE

] 1

2 Ja

n 20

18

https://doi.org/

https://doi.org/

, , S. Herbold

Table 1: Related work on CPDP included in the benchmark. Acronyms are defined following the authors and reference in thefirst column.

Acronym Short DescriptionKhoshgoftaar08 Khoshgoftaar et al. [35] proposed majority voting of multiple classifiers trained for each product in the training

data.Watanabe08 Watanabe et al. [72] proposed standardization based on mean values of the target product.Turhan09 Turhan et al. [67] proposed a log-transformation and nearest neighbor relevancy filtering.Zimmermann09 Zimmermann et al. [77] proposed a decision tree to select suitable training data.CamargoCruz09 Camargo Cruz and Ochimizu [6] proposed a log-transformation and standardization based on the median of the

target product.Liu10 Liu et al. [40] proposed an S-expression tree created by genetic program.Menzies11 Menzies et al. [43, 44] proposed local models for different regions of the training data through clustering.Ma12 Ma et al. [41] proposed data weighting using the concept of gravitation.Peters12 Peters and Menzies [51] proposed an approach for data privacy using a randomized transformation called

MORPH.Uchigaki12 Uchigaki et al. [70] proposed an ensemble of univariate logistic regression models build for each attribute

separately.Canfora13 Canfora et al. [7, 8] proposed a multi-objective genetic program to build a logistic regression model that optimizes

costs and the number of defects detected.Peters13 Peters et al. [52] proposed relevancy filtering using conditional probabilities combined with MORPH data

privatization.Herbold13 Herbold [21] proposed relevancy filtering using distributional characteristics of products.ZHe13 Z. He et al. [20] proposed attribute selection and relevancy filtering using separability between training and

target products.Panichella14 Panichella et al. [50] proposed the CODEP meta classifier over the results of multiple classification models.Ryu14 Ryu et al. [57] proposed similarity based resampling and boosting.PHe15 P. He et al. [19] proposed feature selection based on how often the metrics are used for classification models

build using the training data.Peters15 Peters et al. [53] proposed LACE2 as an extension with further privacy of CLIFF and MORPH.Kawata15 Kawata et al. [33] proposed relevancy filtering using DBSCAN clustering.YZhang15 Y. Zhang et al. [76] proposed the ensemble classifiers average voting, maximum voting, boosting and bagging.Amasaki15 Amasaki et al. [1] proposed feature selection and relevancy filtering based on minimal metric distances between

training and target data.Ryu15 Ryu et al. [58] proposed relevancy filtering based on string distances and LASER classification.Nam15 Nam and Kim [46] proposed unsupervised defect prediction based on the median of attributes.Tantithamthavorn16 Tantithamthavorn et al., 2016 [63] proposed to use hyper parameter optimization for classifiers.FZhang16 F. Zhang et al. [75] proposed unsupervised defect prediction based on spectral clustering.Hosseini16 Hosseini et al. [27, 28] proposed a genetic program and nearest neighbor relevancy filtering to select training

data.

compared different kinds of metrics and classification models forWPDP. Herbold et al. [24] compared CPDP approaches on multipledata sets using multiple performance metrics.

The benchmarks by Lessmann et al., D’Ambros et al., and Her-bold et al. followed Demšar’s guidelines [15] and make use of theFriedman test [16] with the post-hoc Nemenyi test [48]. Herbold etal. extended this concept using rankscores, that allow the combina-tion of results from multiple data sets and performance metrics intoa single ranking. Ghotra et al. use a different statistical procedurebased on ANOVA [16] and the Scott-Knott test [60].

Our benchmark design is similar to the design used by Herbold etal. [24]. However, there are twomajor differences: 1) we focus on dif-ferent research questions, i.e., the performance using cost-metrics,whereas Herbold et al. focused on machine learning metrics that do

not take costs into account; and 2) we extended the statistical evalu-ation with a recently proposed effect size correction taking patternfrom ScottKnottESD proposed by Tantithamthavorn et al. [64].

2.2 Cross-project defect predictionThe scope of our benchmark are approaches that predict softwaredefects in a target product using metric data collected from otherprojects. However, our benchmark does not cover the completebody of CPDP research. Specifically, we do not address the follow-ing.

• Mixed-project defect prediction, i.e., approaches that requirelabelled training data from the target product, e.g., [10, 59,68, 69, 73].

• Heterogenuous defect prediction, i.e., approaches that workwith different metric sets for the training and target products,e.g., [29, 45].

Benchmarking cross-project defect prediction approaches with costs metrics , ,

• Just-in-time defect prediction, i.e., defect prediction for spe-cific commits, e.g., [17, 32].

• Approaches that require project context factors, e.g., [74].Additionally, we excluded two works that use transfer compo-

nent analysis [49] by Nam et al. [47] and by Jing et al. [30]. Transfercomponent analysis has major scalability issues due to a very largeeigenvalue problem that needs to be solved. Herbold et al. [22]already determined that they could only compute results in lessthan one day for the smaller data sets for the approach by Nam etal. [47]. We tried to resolve this problem by using a scientific com-pute cluster, where we had access to nodes with up to 64 cores and256 GB of memory. We were still not able to compute results for alarge data set with 17681 instances1 before hitting the executiontime limit for computational jobs of 48 hours.

This leaves us with 26 approaches that were published through29 publications listed in Table 1. For each of these approaches,we define an acronym, which we will use hereafter to refer to theapproach and a short description of the approach. The list in Table 1is mostly consistent with the benchmark from Herbold et al. [24] onCPDP. Additionally, we extended the work by Herbold et al. withthree further replications of approaches that were published in 2016,i.e., Tantithamthavorn16, FZhang16, and Hosseini16.

3 BENCHMARK METHODOLOGYWe now describe the methodology of our benchmark, including ourresearch questions, the data we used, the baselines and classifiers,the performance metrics, and the statistical analysis.

3.1 Research QuestionsWe address the following six research questions within this study.RQ1: Does it matter if we use defect counts or binary labels for

cost metrics?RQ2: Which approach performs best if quality assurance is applied

according to the prediction?RQ3: Which approach performs best if additional quality assur-

ance can only be applied to a small portion of the code?RQ4: Which approach performs best independent of the prediction

threshold?RQ5: Which approach performs best overall in terms of costs?RQ6: Is the overall ranking based on cost metrics different from

the overall ranking based on the AUC, F-measure, G-measure,and MCC?

With RQ1, we address the question if binary labels as defective/non-defective are sufficient, or if defect counts are required to comparecosts. Binary labels carry less information and should lead to lessaccurate results. For example, you save more costs if you preventtwo defects, instead of one. The question is, does this really matter,i.e., do the values of performance metrics change significantly? Thisquestion is especially interesting, as not all publicly available defectprediction data sets provide defect counts. In case the impact ofdefect counts is large, we may only use data sets that provide thisinformation for our benchmark.

With research questions RQ2-RQ4 we consider different qualityassurance scenarios. RQ2 explores the costs savings, if someone

1JURECZKO as defined in Section 3.2

Table 2: Summary of used data sets.

Name #Products #Instances #DefectiveJURECZKO 62 17681 6062AEEEM 5 5371 893

trusts the defect prediction model completely, i.e., applies quality as-surance measures exactly according to the prediction of the model.RQ3 considers the case where the defect prediction model is usedto identify a small portion of the code for additional quality assur-ance, i.e., a setting with a limited quality assurance budget. RQ2and RQ3 have specific prediction thresholds, i.e., a certain amountof the predicted defects are considered. With RQ4, we provide athreshold independent view on costs, which is valuable for select-ing an approach if the amount of effort to be invested is unclearbeforehand. To provide a general purpose ranking, we throw all theconsiderations from RQ2-RQ4 together for RQ5 and evaluate whichapproach performs best if all criteria are considered. Thus, RQ5 pro-vides an evaluation that accounts for different application and costscenarios. With RQ6 we address the question if the usually usedmachine learning metrics are sufficient to estimate a cost-sensitiveranking, i.e., if they produce a similar or a different ranking fromdirectly using cost metrics.

3.2 DataWe use data from two different defect prediction data sets fromthe literature for our benchmark listed in Table 2. These data setsare a subset of the data sets that Herbold et al. [24] used in theirbenchmark. The other three data sets that Herbold et al. used couldnot be used for different reasons. The MDP and RELINK data wereinfeasible due to our results regarding RQ1 (see Section 4). TheNETGENE data does not contain the size of artifacts, which isrequired for cost-sensitive evaluations.2

We can only give a brief summary of both data sets due to spacerestrictions. Full lists of the software metrics, products contained,etc. can be found in the literature at the cited references for eachdata set or summarized in the benchmark by Herbold et al. [24].

The first data set was donated by Jureczko and Madeyski [31]3and consists of 48 product releases of 15 open source projects, 27product releases of six proprietary projects and 17 academic prod-ucts that were implemented by students. For each of these releases,20 static product metrics for Java classes, as well as the number ofdefects are part of the data. Taking pattern from Herbold et al. [22],we use 62 of the products. We do not use the 27 proprietary prod-ucts to avoid threats to the validity of our results due to mixingproprietary and open source software. Moreover, three of the aca-demic products contain less than five defective instances, whichis too few for reasonable analysis with machine learning. In thefollowing, we will refer to this data set as JUREZCKO.

The second data set was published by D’Ambros et al. [13]4 andconsists of five software releases from different projects. For each of

2Accoding to the paper by Herzig et al. [26], complexity and size should be includedin the data. There are archives called complexity_diff.tar.gz available for each productin the data. However, the archives seem to contain the differences for complexity andsize metrics for each transaction identified by a revision hash. We did not find absolutevalues for the size, which would be required for cost metrics.3The data is publicly available online: http://openscience.us/repo/defect/ck/ (lastchecked: 2017-08-25).4The data is publicly available online: http://bug.inf.usi.ch/ (last checked: 2017-08-25)

http://openscience.us/repo/defect/ck/

http://bug.inf.usi.ch/

, , S. Herbold

these releases, 71 software metrics for Java classes are available, thatinclude static product metrics and process metrics, like weightedchurn and linearly decayed entropy, as well as the number of defects.In the following, we refer to this data set as AEEEM, taking patternfrom Nam et al. [47].

3.3 Baselines and ClassifiersBecause we adopt Herbold et al.’s [24] benchmark methodologyfor using cost metrics, our choices for baselines and classifiersare nearly identical. We adopt three performance baselines fromHerbold et al. [24] that define naïve approaches for classificationmodels: ALL that takes all available training data as is, RANDOMwhich randomly classifies instances as defective with a probabil-ity of 0.5, and FIX which classifies all instances as defective. Wedo not adopt the baseline CV for 10x10 cross-validation, becausecross-validation is not implementable in practice and is knownto overestimate the performance of WPDP [62]. This may skewrankings, as approaches may be outperformed by something thatoverestimates performance. Moreover, cross-validation is an esti-mator for WPDP performance and, therefore, out of scope of ourbenchmark for CPDP models.

We use a C4.5 decision tree (DT) [54], logistic regression (LR) [12],naïve bayes (NB) [56], random forest (RF) [3], RBF network (NET) [4,9], and a support vector machine with radial basis function kernel(SVM) [71] for all approaches that did not propose a classifier, butrather a treatment of the training data or something similar. Theseare the sixteen approaches Koshgoftaar08, Watanabe08, Turhan09,Zimmermann09, CamargoCruz09, Ma12, Peters12, Peters13, Her-bold13, ZHe13, PHe15, Peters15, Kawata15, Amasaki15, Ryu15, andNam15, as well as the baseline ALL.

For Menzies11, we use the WHICH classifier, which was used inthe original publication together with the six classifiers from above(DR, LR, NB, RF, NET, and SVM), i.e., a total of seven classifiers.For Tantithamthavorn16 we apply the proposed hyper parameteroptimization to NB, RF, and SVM with the same parameters assuggested by Tantithamthavorn et al. [63]. For DT, NET, and LRno hyper parameters to optimize are contained in the caret R pack-age [38] suggested by Tantithamthavorn et al.Additionally, we traina C5.0 decision tree [37] with hyper parameter optimization, be-cause it performs best in the evaluation by Tantithamthavorn et al..We refer to these as optimized classifiers as NBCARET, RFCARET,SVMCARET and C50CARET.

The remaining eight of the approaches directly propose a classi-fication scheme, which we use:

• genetic program (GP) for Liu10;• logistic ensemble (LE) for Uchigaki12;• MODEP for Canfora13;• CODEP with Logistic Regression (CODEP-LR) and CODEPwith a Bayesian Network (CODEP-BN) for Panichella14;

• the value-cognitive boosted SVM (VCBSVM) for Ryu14;• average voting (AVGVOTE), maximum voting (MAXVOTE),bagging with a C4.5 Decision Tree (BAG-DT), bagging withNaïve Bayes (BAG-NB), boosting with a C4.5 Decision Tree(BOOST-DT), boosting with Naïve Bayes (BOOST-NB) forYZhang15;

• spectral clustering (SC) for FZhang16; and

• search-based selection (SBS) for Hosseini16.

The MODEP classifier by Canfora13 requires either a constraintwith a desired recall, or a desired cost objective. Herbold et al. [24]decided to use a recall of 0.7 as the constraint. In our benchmark,we sample different values for recall and use the values 0.1 to 1.0 insteps of 0.1. We denote the different recall constraints after the clas-sifier name using the percentage, e.g., MODEP10 for the constraintrecall=0.1.

To deal with randomization, we repeat all approaches that con-tain random components 10 times and then use the mean value ofthese repetitions for comparison with the other approaches. Fol-lowing Herbold et al. [24], these are Liu10, Canfora10, Menzies11,Peters12, Peters13, ZHe13, Peters15, as well as the baseline RAN-DOM. Moreover, two of the three approaches that we added tothe benchmark contain random components, i.e., Hosseini16 be-cause of the random test data splits and the genetic program andTantithamthavorn16 because of the cross-validation for the hyperparameter optimization.

3.4 Performance MetricsTo properly evaluate our research questions, we require metricsthat measure the cost for different settings. In order to not re-inventthe wheel, we scanned the literature and found fitting metrics forall our research questions.

For RQ2, we need a metric that can be used to evaluate the costsif one follows the classification achieved with a defect predictionmodel, i.e., to apply quality assurance ot everything that is pre-dicted as defective and nothing else.We use themetricNECMCrat io ,which is defined as

NECMCrat io =f p +Cratio · f ntp + f p + tn + f n

(1)

This metric was, e.g., used by Liu et al. [40] and Khoshgoftaar etal. [35] and measures the costs resulting from overhead in qualityassurance effort through false positive predictions versus the costsdue to missed defects through false negative predictions. Cratiois used to define the difference in cost for false positive and falsenegative predictions. Khoshgoftaar et al. and Liu et al. both use 15,20, and 25 as values for Cratio . The cost of missing defects maybe 15-50 times higher than that of additional quality assurancemeasures according to the literature [36]. For our benchmark, weuse Cratio = 15, i.e., the most conservative cost scenario, wherereviews are relatively expensive in comparison to the saved costsof finding a defect through the prediction model.

To evaluate RQ3, i.e, the costs when only a small part of thecode shall be reviewed, we use the metric RelB20%, defined as thepercentage of defects found when inspecting 20% of the code. Thus,the defect prediction model is used to rank all code entities. Then,the entities are considered starting with the highest ranked entityuntil 20% of the code are covered. This metric is an adoption of themetric NofB20% used by Y. Zhang et al. [76]. The only difference isthat Y. Zhang et al. consider absolute numbers, whereas we considerthe percentage. This difference is required due to the diversityin the size of software products. If we do not remove the strongimpact of the project size from the values for this metric and useabsolutes instead of ratios, our statistical analysis would be strongly


influenced by the size of the products and not measure the actualdefect detection capability.

The metrics NECM15 and RelB20% evaluate the CPDP modelsin such a way that a fixed amount of code is considered for addi-tional quality assurance. For NECM15, we follow the classification,i.e., we consider the scenario that is most likely according to thepredictions of the CPDP model. For RelB20%, we stop at 20% of thecode, regardless of the scores of the classification model. This isrelated to the notion of thresholds for classification in the machinelearning world: a learned prediction model has a scoring functionas output and everything above a certain threshold is then classifiedas defective. While there are good reasons to use the stratgies thethresholds are picked by NECM15 and RelB20% , these thresholdsare still magic numbers. To evaluate RQ4, we use the thresholdindependent metric AUCEC, which is defined as the area under thecurve of review effort versus number of defects found [55]. This isa threshold independent metric that analyzes the performance of adefect likelihood ranking produced by a classifier for all possiblethresholds. Thus, the value of AUCEC is not a measure for a singleprediction model with a fixed threshold like NECM15 and RelB20% ,but instead for a family of predictionmodels with all possible thresh-old values. Regarding costs, this means that AUCEC evaluates thecosts for all possible amounts of code that to which additional qual-ity assurance is applied, starting from applying no quality assuranceat all and stopping with applying quality assurance to the completeproduct.

To evaluate RQ1 and RQ5, we use the metrics NECM15, RelB20%,and AUCEC together. For RQ6, we compare the findings of RQ5 tothe results if we use the benchmark criteria from Herbold et al. [22].Thus we determine how different the ranking from RQ5 is from aranking using the metrics AUC, F-measure, G-measure, and MCC.The definition and reasons for selecting these metrics can be foundin the benchmark by Herbold et al. [22].

We use actual defect counts for the number of true positives andfalse negatives to compute the above metrics. Thus, if we have aninstance with two defects, it carries twice the weight for the aboveperformance metrics. The only exception to this is RQ1, where wecompare using binary labels to using defect counts. With binarylabels, we do not care about the number of defects in a class andjust label it as defective or non-defective, meaning that classes withfive defects have the same weight as classes with a one defect.

In case of ties, i.e., two instances with the same score accordingto a prediction model, we use the size of the instance as tie breakerand say that the smaller instances get additional quality assurancefirst. This tie-breaking strategy was proposed by Rahman et al. [55].

3.5 Statistical AnalysisWe use a threshold of α = 0.995 for all statistical tests instead ofthe usually used α = 0.95. Hence, we require that p-value < 0.005instead of the usual p-value < 0.05 to reject a null hypothesis. Wedecided to follow the recent suggestion made by a broad array ofresearchers from different domains to raise the threshold for sig-nificance [2]. The reason for this higher threshold is to reduce thelikelihood that approaches are false positively detected as signifi-cantly different, even though they are not.

For RQ1, we compare the mean performance of each approachachieved using binary labels for calculating metric values withnumerical values for calculating the metrics. We say that there is adifference if the mean value is statistically significantly differentand the effect size is non-negligible. For the testing of statisticalsignificance, we use the non-parametric Mann-Whitney-U test [42].In case the difference is statistically significant, we measure theeffect size using Cohen’s d [11]. According to Cohen, the effectsize is negligible for d < 0.2, small for 0.2 ≤ d < 0.5, medium for0.5 ≤ d < 0.8, and large for d ≥ 0.8. We used Levene’s test [5] totest if the homoscedasticity assumption of Cohen’s d is fulfilled.In case we find that binary labels lead to statistically significantlydifferent results with a non-neglible effect size, data sets with binarylabels instead of defect counts should not be used for cost-sensitiveevaluations.

For the research questions RQ2-RQ6, we require a ranking of mul-tiple approaches and, thus, a more complex statistical testing pro-cedure. Our first step was to determine if we can use ANOVA [16]in combination with a Scott-Knott test [60], which is a popularchoice in recent defect prediction literature that compares multipleapproaches to each other, e.g., [18, 23, 63, 64]. The advantage ofANOVA and Scott-Knott is a clear and non-overlapping rankingof results. However, ANOVA has the heavy assumptions that allpopulations follow a normal distribution and are homoscedastic.We used the Shapiro-Wilk test [61] to test if the performance val-ues are normally distributed and used Levene’s test [5] to test forhomoscedasticity. Unfortunately, both conditions are frequentlybroken by the data.

Therefore, we decided to use the less powerful but non-parametricFriedmann test [16] with the post-hoc Nemenyi test [48] instead.The Nemenyi test compares the distances between the mean ranksof multiple pair-wise comparisons between all approaches on allproducts of a data set. The main drawback of this test is that theranks between approaches may be overlapping. To deal with thisissue, we follow the strategy suggested by Herbold et al. [25] tocreate non-overlapping groups of statistically significantly differentresults. Herbold et al. suggest to start with the best ranked approachand always create a new group, if the difference in ranking betweentwo subsequently ranked approaches is greater then the criticaldistance. At the cost of discriminatory power of the test, this en-sures that the resulting groups are non-overlapping and statisticallysignificantly different.

Moreover, we took pattern from Tantithamthavorn et al.’s modi-fication of the Scott-Knott test [64] and adopted the proposed effectsize correction. This means that we use Cohen’s d to measure theeffect size between two subsequently ranked groups and mergethem if the effect size is negligible, i.e., d < 0.2.

The final step of the statistical analysis is the generation ofthe rankscore from the ranking. The rankscore was introduced byHerbold et al. [24] to deal with the problem of different group sizesthat occur when ranks of groups of approaches are created. Notevery group will have the same number of approaches. This meansthat the number of the group-ranking becomes a bad estimator forthe performance of the group. Basically, it is a difference if youare in the second-ranked group and there is one approach in thefirst-ranked group or there are ten approaches in the first-rankedgroup. The rankscore takes care of this problem by transforming

, , S. Herbold

Table 3: Mean performance with binary labels and defectcounts, as well as the p-value of the Mann-Whitney-U test.

Dataset Metric BinaryLabel

DefectCounts

p-value(d)

AEEEMNECM15 1.92 2.25 0.012RelB20% 0.28 0.29 0.288AUCEC 0.56 0.57 0.236

JURECZKONECM15 3.44 4.10 <0.001

(0.52)RelB20% 0.23 0.21 0.631AUCEC 0.53 0.54 0.439

the ranks into a normalized representation based on the percentageof approaches that are on higher ranks, i.e.,

rankscore = 1 − #{approaches ranked higher}#{approaches} − 1

.

For example, a rankscore = 1 is perfect meaning that no approachis ranked higher, a rankscore = 0.7 would mean that 30% of ap-proaches are ranked higher.

We apply this statistical evaluation procedure to each combina-tion of data set and performance measure. For research questionsRQ2-RQ4, we then evaluate the mean rankscore on both data sets forthe metrics NECM15, RelB20%, and AUCEC, respectively. For RQ5,we evaluate the mean rankscore on both data sets and for all threecost metrics. For RQ6, we evaluate the mean rankscore on both datasets for the machine learning metrics AUC, F-measure, G-measure,and MCC. To evaluate the relationship between the rankings pro-duced with cost metrics and machine learning metrics for RQ6,we evaluate the correlation between both using Kendall’s τ [34].Kendall’s τ is a non-parametric correlation measure between ranksof results, i.e., the ordering of results produced, which is exactlywhat we are interested in.

4 RESULTSWe now present the results of our benchmark. A replication kitthat provides the complete source code and data required for thereplication of our results, a tutorial on how to use the replicationkit, as well as additional visualizations including plots for that listall classifiers and not only the best classifiers, and critical distancediagrams for the Nemenyi tests are available online.5

RQ1: Does it matter if we use defect counts orbinary labels for cost metrics?Table 3 shows the mean values for all cost metrics on the two datasets where defect counts are available. We also report the p-valuesdetermined by theMann-Whitney-U test, and in case of significance,i.e., if p-value < 0.005, we also report the value for Cohen’s din parenthesis. We observe statistically significant differences forNECM15 on the JURECZKO data. Levene’s test showed that theresults are homoscedastic. The effect size is d = 0.52, i.e., medium.

5Reference to the replication kit removed due to double blind review. Will be uploadedto Zenodo in case of acceptance and evaluated through the artifact evaluation. Weprovide the visualiations and the statistical analysis code as supplemental material tothis submission.

Table 4: Mean results for NECM15 with rankscore in paren-thesis.

Approach JURECZKO AEEEMALL-LR 4.7 (0.29) 1.31 (0.73)

Amasaki15-NB 2.9 (0.57) 1.56 (0.54)CamargoCruz09-NB 2.92 (0.57) 1.49 (0.54)

Canfora13-MODEP100 0.55 (1) 0.76 (0.92)FZhang16-SC 3.65 (0.43) 2.35 (0.08)

Herbold13-NET 1.44 (0.86) 0.91 (0.77)Hosseini16-SBS 2.72 (0.57) 1.63 (0.27)Kawata15-NET 4.18 (0.43) 1.61 (0.5)

Koshgoftaar08-LR 3.99 (0.43) 1.13 (0.88)Liu10-GP 1.3 (1) 0.82 (1)

Ma12-NET 3.16 (0.57) 1.27 (0.73)Menzies11-NB 4.48 (0.43) 2.07 (0.12)

Nam15-DT 1.94 (0.86) 1.11 (0.73)Nam15-RF 1.94 (0.86) 1.11 (0.73)

Panichella14-CODEP-BN 3.66 (0.57) 1.45 (0.54)Peters12-LR 4.04 (0.43) 1.41 (0.54)Peters13-LR 4.04 (0.43) 1.41 (0.54)Peters15-NB 2.88 (0.57) 1.73 (0.42)PHe15-LR 5.36 (0.29) 1.21 (0.62)RANDOM 3.84 (0.43) 2.48 (0.08)

Ryu14-VCBSVM 2.34 (0.57) 1.71 (0.5)Ryu15-NB 3.11 (0.57) 3.27 (0.04)

Tantitham.16-NBCARET 3.82 (0.43) 1.97 (0.12)FIX 0.53 (1) 0.72 (0.96)

Turhan09-LR 3.88 (0.43) 1.19 (0.73)Uchigaki12-LE 1.56 (0.86) 3 (0.04)

Watanabe08-NET 2.94 (0.57) 1.57 (0.42)YZhang15-MAXVOTE 3.49 (0.43) 1.34 (0.65)

ZHe13-NET 2.71 (0.57) 0.87 (1)Zimmermann09-LR 3.14 (0.57) 1.57 (0.54)

Answer RQ1: For the metric NECM15, defect counts yieldsignificantly different results with a medium effect size incomparison to using binary labels. Therefore, only data withdefect counts should be used for evaluations using NECM15.Consequently, we conclude that only data with defect countsshould be used for benchmarking with cost metrics and maynot use the MDP data and RELINK data used by Herbold etal. [24] for our benchmark.

RQ2: Which approach performs best if qualityassurance is applied according to the prediction?Figure 1(a) shows the best approaches ranked using NECM15 bytheirmean rankscore over both data sets. Themean value ofNECM15and the rankscore for each of these best approaches are listed inTable 4. The best ranking approach is Liu10-GP with a perfect meanrankscore of 1, i.e, for both data sets no approach is significantlybetter. The trivial baseline FIX, i.e., predicting all code as defective,is a close second with a mean rankscore of 0.981. Thus, only oneapproach beats the trivial baseline FIX for this performance metric.Another approach, Canfora-MODEP100 is very close to FIX with arankscore of 0.962. If we look at the actual values of NECM15 and


Figure 1: Mean rankscore over all data sets. The black diamonds depict the mean rankscore, the gray points in the backgroundare the rankscores over which the mean is taken. We list only the result achieved with the best classifier for each approach.

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

●●

0.508

0.5550.555

0.962

0.253

0.813

0.42

0.464

0.657

1

0.651

0.272

0.7940.794

0.555

0.4840.4840.497

0.451

0.253

0.536

0.3050.272

0.981

0.58

0.448

0.497

0.541

0.786

0.555

FZhang16−SCRANDOM

Menzies11−NBTantithamthavorn16−NBCARET

Ryu15−NBHosseini16−SBS

Uchigaki12−LEPHe15−LR

Kawata15−NETPeters12−LRPeters13−LRPeters15−NB

Watanabe08−NETALL−LR

Ryu14−VCBSVMYZhang15−MAXVOTE

Amasaki15−NBCamargoCruz09−NB

Panichella14−CODEP−BNZimmermann09−LR

Turhan09−LRMa12−NET

Koshgoftaar08−LRZHe13−NETNam15−DTNam15−RF

Herbold13−NETCanfora13−MODEP100

FIXLiu10−GP

0.0

0.2

0.4

0.6

0.8

1.0

Mean rankscore

App

roac

h

Ranking of approaches using NECM15

(a) Results with NECM15

● ●

● ●

● ●

● ●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0.912

11

0.971

0.753

0.90.90.9

0.453

0.912

0.971

0.871

1

0.8120.8120.8120.812

0.9710.971

11

0.576

0.871

1

0.612

1

1

0.129

1

0.512

0.9

1

Uchigaki12−LEHosseini16−SBS

YZhang15−BAG−NBRANDOM

Tantithamthavorn16−SVMCARETFZhang16−SC

Menzies11−SVMNam15−DTNam15−RF

Panichella14−CODEP−LRLiu10−GP

Ryu14−VCBSVMHerbold13−LR

Herbold13−NETHerbold13−SVM

ZHe13−LRALL−SVM

Kawata15−SVMCanfora13−MODEP100

Koshgoftaar08−SVMPeters12−SVMPeters13−SVM

Amasaki15−SVMCamargoCruz09−SVM

FIXMa12−SVM

Peters15−SVMPHe15−SVMRyu15−SVM

Turhan09−SVMWatanabe08−SVM

Zimmermann09−SVM

0.0

0.2

0.4

0.6

0.8

1.0

Mean rankscore

App

roac

h

Ranking of approaches using RelB20

(b) Results with RelB20%

● ●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0.974

11

0.974

0.668

0.80.8

0.416

0.974

1

0.795

11

0.8

0.9740.9740.974

11

0.6680.747

1

0.8

1

0.974

0.153

1

0.511

0.80.80.8

1

Uchigaki12−LEHosseini16−SBS

YZhang15−MAXVOTEFZhang16−SC

RANDOMRyu14−VCBSVM

Liu10−GPHerbold13−LR

Herbold13−SVMNam15−SVM

Tantithamthavorn16−SVMCARETZHe13−LRZHe13−NB

ZHe13−SVMALL−SVM

Canfora13−MODEP100Kawata15−SVM

Panichella14−CODEP−LRPeters12−SVMPeters13−SVMTurhan09−SVM

Amasaki15−SVMCamargoCruz09−SVM

FIXKoshgoftaar08−SVM

Ma12−SVMMenzies11−SVM

Peters15−SVMPHe15−SVMRyu15−SVM

Watanabe08−SVMZimmermann09−SVM

0.0

0.2

0.4

0.6

0.8

1.0

Mean rankscore

App

roac

h

Ranking of approaches using AUCEC

(c) Results with AUCEC

●● ● ●

● ●

●●

●● ● ●

●● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ●● ●

●●

●●

●●

●● ● ● ●

●

●● ●

●●●

●●

●●●●

●●

●●●●

●●

●●●●

●● ● ● ●

●

● ●● ● ● ●

● ● ●● ● ●

● ●●●●●

●●●●●

●

● ● ●● ● ●

● ● ●●

●●

● ● ●●

●●

● ● ● ● ●●

● ●●●●

●

● ●●●●

●

●●●

● ●●

● ●●

●● ●

● ● ●● ● ●

● ●●

●● ●

●●●

● ●●

●●●

●●●

●

●●●●●

0.7580.751

0.797

0.969

0.558

0.811

0.43

0.628

0.7

0.888

0.808

0.604

0.7670.767

0.731

0.7610.761

0.6670.667

0.499

0.718

0.667

0.501

0.994

0.658

0.243

0.762

0.413

0.77

0.745

Uchigaki12−LEYZhang15−MAXVOTE

Hosseini16−SBSRANDOM


Menzies11−SVMKawata15−SVMTurhan09−SVMPeters15−SVM

PHe15−SVMRyu15−SVM

Koshgoftaar08−LRRyu14−VCBSVM

Panichella14−CODEP−LRZimmermann09−SVM

Amasaki15−LRALL−LR

Peters12−LRPeters13−LR

Watanabe08−LRNam15−DTNam15−RFZHe13−NB

CamargoCruz09−LRMa12−NET

Herbold13−NETLiu10−GP

Canfora13−MODEP100FIX

0.0

0.2

0.4

0.6

0.8

1.0

Mean rankscore

App

roac

hRanking of approaches using NECM15, RelB20, and AUCEC

(d) Results with NECM15 , RelB20% , and AUCEC

not the rankscore, FIX actually has better mean values than Liu10-GP for all three data sets, even though the ranking is worse on theAEEEM data.6 This anomaly is possible due to the nature of theNemenyi test. The Nemenyi test is based on the ranking in pair-wise comparisons of all approaches, not on the mean value. Table 5shows NECM15 value and ranking of both Liu10-GP and FIX oneach product in the AEEEM data. The equinox and the lucene prod-ucts are most interesting. On equinox, FIX has an advantage of 0.4over Liu10-GP with respect to the metric NECM15. However, thedifference in ranks to Liu10-GP is only three. On lucene, Liu10-GPhas a only an advantage of 0.12 over FIX with respect to the metricNECM15, but the difference in ranks to FIX is 27. Thus, while thisanomaly seems counter intuitive, from a ranking perspective andalso for the statistical test, it is correct. Such effects are the reasonwhy checking for assumptions and choosing appropriate statisticaltests is important, as these effects are due to the heteroscedacity ofthe data. This also shows that pure comparisons of characteristics

6There are multiple such anomylies in our results. We checked them and they areall due to the effect we describe below and not due to problems with the statisticalanalyis.

Table 5: Detailed results for NECM15on the AEEEM data forLiu10-GP and FIX.

Liu10-GP FIXProduct NECM15 Rank NECM15 Rankeclipse 0.70 11 0.68 9

equinox 0.84 5 0.44 2lucene 0.75 21 0.87 38mylyn 1.03 6 0.83 2

pde 0.78 1 0.79 5mean 0.82 8.8 0.72 11.2

like mean or median values are not sufficient for the ranking ofmultiple approaches.

Answer RQ2: Liu10-GP yields the best cost performance as-suming missed defects are 15 times more expensive than ad-ditional quality assurance costs through false positive predic-tions. The other 25 approaches perform worse in terms of

, , S. Herbold

Figure 2: Mean rankscore over all data sets for the met-rics AUC, F-measure, G-measure, and MCC. The black dia-monds depict the mean rankscore, the gray points in thebackground the rankscores over which the mean is taken.We list only the results for the classifiers that are listed inFigure 1(d), i.e., those performing best according to the met-rics NECM15, RelB20%, and AUCEC.

● ● ● ● ● ●● ●

● ● ● ● ● ● ●●

●●

●●

●● ● ●

● ● ● ●● ● ● ●

● ● ● ● ●● ● ●

● ● ● ●● ● ● ●

● ● ● ●●

●● ●

●● ● ●

● ● ● ●

●●●●

●● ● ●

●●●●

● ● ● ●

●●●●●●

● ●

●●●●●

●● ●

● ●●●●

●● ●

● ● ● ● ● ●● ●

● ● ● ● ● ●● ●

● ● ●● ● ● ●

●

● ● ● ●●●●

●

● ● ● ● ●●

●●

●● ● ● ● ● ●

●

●● ● ● ● ● ●

●

● ● ● ● ●●

●●

●●●

●●

●●●

●●●

●●

●●●

● ●● ● ●

●●●

● ● ● ● ● ● ●●

● ● ● ●● ● ●

●

● ●●

●●●

●●

● ● ● ● ● ●● ●

●●●●

● ● ● ●

●●●●

● ● ● ●

0.6290.5670.7680.4890.4110.5190.2670.1960.1430.07

0.0330.08

0.8440.5320.6250.3470.5840.5840.5670.5670.61

0.9150.9150.7630.64

0.7280.7390.6620.2760.215

Uchigaki12−LEYZhang15−MAXVOTE

Hosseini16−SBSRANDOM


Menzies11−SVMKawata15−SVMTurhan09−SVMPeters15−SVM

PHe15−SVMRyu15−SVM

Koshgoftaar08−LRRyu14−VCBSVM

Panichella14−CODEP−LRZimmermann09−SVM

Amasaki15−LRALL−LR

Peters12−LRPeters13−LR

Watanabe08−LRNam15−DTNam15−RFZHe13−NB

CamargoCruz09−LRMa12−NET

Herbold13−NETLiu10−GP

Canfora13−MODEP100FIX

0.0

0.2

0.4

0.6

0.8

1.0

Mean rankscore

App

roac

h

Ranking of approaches using AUC, F−Measure, G−Measure, and MCC

expected costs than trivially assuming that all code is defec-tive, although at least Canfora13-MODEP is very close to thistrivial baseline.

RQ3: Which approach performs best ifadditional quality assurance can only be appliedto a small portion of the code?Figure 1(b) shows the best approaches ranked usingRelB20% by theirmean rankscore over both data sets. The mean value of RelB20% andthe rankscore for each of these best approaches are listed in Table 6.There are actually ten approaches with a perfect rankscore of 1, i.e.,Zimmermann09-SVM, Watanabe09-SVM, Turhan09-SVM, Ryu15-SVM, PHe15-SVM, Peters15-SVM, Ma12-SVM, CamargoCruz09-SVM, Amasaki15-SVM, and the trivial baseline FIX. Notably, allof the CPDP approaches use the SVM as classifier. Herbold etal. [24] found in their benchmark, that unless the bias towardsnon-defective instances is treatet, SVMs often yield trivial or nearlytrivial classifiers. We checked the raw results and found that thisapplies here, too. All of these SVMs are either trivial classifierspredicting only one class, or nearly trivial. Thus, they are nearlythe same as the trivial baseline FIX. Since we use the size of entitiesas tie-breaker, this means that ranking code by size starting withthe smallest instances until 20% of code is covered, is more effecientthan actual CPDP. When we investigate at the actual values ofRelB20%, we can conclude that we find on average 32%–35% percentof the defects that way, depending on the data set and which of thetop ranked approaches is used.

Table 6: Mean results for RelB20% with rankscore in parenthe-sis.

Approach JURECZKO AEEEMALL-SVM 0.33 (1) 0.3 (0.82)

Amasaki15-SVM 0.34 (1) 0.35 (1)CamargoCruz09-SVM 0.34 (1) 0.35 (1)Canfora13-MODEP100 0.33 (1) 0.32 (0.94)

FZhang16-SC 0.26 (0.8) 0.28 (0.71)Herbold13-LR 0.25 (0.8) 0.36 (1)

Herbold13-NET 0.26 (0.8) 0.32 (1)Herbold13-SVM 0.24 (0.8) 0.33 (1)Hosseini16-SBS 0.12 (0.2) 0.29 (0.71)Kawata15-SVM 0.34 (1) 0.3 (0.82)

Koshgoftaar08-SVM 0.34 (1) 0.32 (0.94)Liu10-GP 0.27 (0.8) 0.31 (0.94)

Ma12-SVM 0.33 (1) 0.32 (1)Menzies11-SVM 0.3 (0.8) 0.31 (0.82)

Nam15-DT 0.24 (0.8) 0.29 (0.82)Nam15-RF 0.24 (0.8) 0.29 (0.82)

Panichella14-CODEP-LR 0.3 (0.8) 0.28 (0.82)Peters12-SVM 0.33 (1) 0.32 (0.94)Peters13-SVM 0.33 (1) 0.32 (0.94)Peters15-SVM 0.34 (1) 0.33 (1)PHe15-SVM 0.34 (1) 0.32 (1)RANDOM 0.26 (0.8) 0.25 (0.35)

Ryu14-VCBSVM 0.24 (0.8) 0.31 (0.94)Ryu15-SVM 0.33 (1) 0.32 (1)

Tantitham.16-SVMCARET 0.2 (0.4) 0.29 (0.82)FIX 0.34 (1) 0.32 (1)

Turhan09-SVM 0.32 (1) 0.33 (1)Uchigaki12-LE 0.11 (0.2) 0.2 (0.06)

Watanabe08-SVM 0.32 (1) 0.33 (1)YZhang15-BAG-NB 0.11 (0.2) 0.28 (0.82)

ZHe13-LR 0.23 (0.8) 0.33 (1)Zimmermann09-SVM 0.34 (1) 0.33 (1)

Answer RQ3: Trivial or nearly trivial predictions performbest if only a small portion of the code undergoes additionalquality assurance.

RQ4: Which approach performs bestindependent of the prediction threshold?Figure 1(c) shows the best approaches ranked usingAUCEC by theirmean rankscore over both data sets. The mean value of AUCEC andthe rankscore for each of these best approaches are listed in Table 7.The results are very similar to the results from RQ3, i.e., we have alarge group of approaches with a perfect rankscore of 1, all of whichuse SVM as classifiers including the baseline FIX. Thus, startingwith small code entities is again the best strategy.

Answer RQ4: Trivial or nearly trivial predictions performbest without a fix prediction threshold.


Table 7: Mean results for AUCEC with rankscore in paren-thesis.

Approach JURECZKO AEEEMALL-SVM 0.63 (1) 0.61 (0.95)

Amasaki15-SVM 0.63 (1) 0.63 (1)CamargoCruz09-SVM 0.63 (1) 0.63 (1)Canfora13-MODEP100 0.62 (1) 0.61 (0.95)

FZhang16-SC 0.55 (0.6) 0.57 (0.74)Herbold13-LR 0.54 (0.6) 0.63 (1)

Herbold13-SVM 0.54 (0.6) 0.62 (1)Hosseini16-SBS 0.48 (0.2) 0.54 (0.63)Kawata15-SVM 0.63 (1) 0.61 (0.95)

Koshgoftaar08-SVM 0.63 (1) 0.62 (1)Liu10-GP 0.57 (0.8) 0.6 (0.79)

Ma12-SVM 0.63 (1) 0.61 (1)Menzies11-SVM 0.62 (1) 0.62 (1)

Nam15-SVM 0.55 (0.6) 0.62 (1)Panichella14-CODEP-LR 0.6 (1) 0.6 (0.95)

Peters12-SVM 0.63 (1) 0.6 (0.95)Peters13-SVM 0.63 (1) 0.6 (0.95)Peters15-SVM 0.62 (1) 0.62 (1)PHe15-SVM 0.63 (1) 0.61 (1)RANDOM 0.57 (0.6) 0.55 (0.74)

Ryu14-VCBSVM 0.54 (0.6) 0.6 (0.89)Ryu15-SVM 0.62 (1) 0.61 (1)

Tantitham.16-SVMCARET 0.55 (0.6) 0.62 (1)FIX 0.63 (1) 0.61 (1)

Turhan09-SVM 0.63 (1) 0.61 (0.95)Uchigaki12-LE 0.46 (0.2) 0.49 (0.11)

Watanabe08-SVM 0.62 (1) 0.62 (1)YZhang15-MAXVOTE 0.53 (0.6) 0.54 (0.42)

ZHe13-LR 0.55 (0.6) 0.61 (1)ZHe13-NB 0.52 (0.6) 0.6 (1)

ZHe13-SVM 0.54 (0.6) 0.62 (1)Zimmermann09-SVM 0.62 (1) 0.62 (1)

RQ5: Which approach performs best overall?Figure 1(d) shows the best approches ranked using the three metricsNECM15, RelB20%, and AUCEC over both data sets. Given the re-sults for RQ2-RQ4, it is not suprising that the best ranking approachis the trivial baseline FIX with a nearly perfect mean rankscore of0.994. The best CPDP approaches are Canfora13-MODEP100 witha mean rankscore of 0.969 and Liu10-GP with a mean rankscore of0.888.

Answer RQ5: No CPDP approach outperforms our trivialbaseline on average over three performance metrics. The bestperforming CPDP approaches are Canfora13-MODEP100 andLiu10-GP.

RQ6: Is the overall ranking based on cost metricsdifferent from the overall ranking based on theAUC, F-measure, G-Measure, and MCC?Figure 2 shows the rankscores of the best ranked approaches fromRQ5, but ranked withAUC, F-measure,G-measure, andMCC instead,

i.e., the metrics used by Herbold et al. [24, 25] for ranking withmachine learning metrics that do not consider costs. In case therankings are correlated, we would expect that the rankscores areroughly sorted in descending order from top to bottom. However,this is clearly not the case. The two top ranking approaches withcost metrics, i.e., FIX and Canfora13-MODEP have both very lowrankscores with the metrics used by Herbold et al., the next rankingapproaches are much better with at least mediocore rankscores.Please note that the rankscore values here are not the same asdetermined by Herbold et al. [24, 25], because we only use two ofthe five data sets for this comparison. We confirmed this visualobservation using Kendall’s τ as correlation measure We observealmost no correlation of τ = −0.047 between both rankings.

Answer RQ6: The cost-sensitive ranking is completely differ-ent from the ranking base on AUC, F-measure, G-measure, andMCC. Thus, machine learning metrics are unsuited to predictthe cost efficiency of approaches, and cost metrics are likewiseunsuited to predict the performance measured with machinelearning metrics.

5 DISCUSSIONOur results cast a relatively devastating light on the cost efficiencyof CPDP. Based on our results, it just seems better to follow a trivialapproach and assume everything as equally likely to contain defects.What we find notable is that most of the state of the art of CPDP hasignored cost metrics. Only Khoshgoftaar08, Liu10, Uchigaki12, Can-fora13, Panichella14, and YZhang15, i.e, six out of 26 approachesused any effort or cost related metrics when evaluating their work.Only two of these approaches, i.e., Liu10 and Canfora13 optimizefor costs. These are also the two best ranked approaches after thetrivial baseline if we use all three cost metrics. Liu10 weights falsenegatives fifteen times stronger than false positives, i.e., puts astrong incentive on identifying defective instances in comparisonto misidentifying instances as defective. This incentive means thatthey optimize the metric NECM15. As a results, Liu10-GP is betterranked than the trivial baseline, which is the only time this happendfor all approaches and metrics in our benchmark. Canfora13 usetwo objectives for optimization: the recall7 and the effort in lines ofcode considered. This is similar to AUCEC, but not sufficient to out-perform our trivial baseline even for that metric. Thus, optimizingfor cost directly seems to be vital for CPDP if the created modelsshould be cost efficient.

Another interesting aspect of our findings is the lack of correla-tion between the ranking using cost metrics and machine learningmetrics. One would expect that good classification models in termsof machine learning metrics, also perform well under cost consid-erations. We believe that this correlation is missing because theperformance of the CPDP models is too bad, especially the preci-sion,8 but also in terms of recall. According to the benchmark fromHerbold et al. [25], only very few predictions achieve a recall ≥ 0.7and precision ≥ 0.5 at the same time. In other words, finding 70%of the defects with at least 50% of the predicted instances being

7Percentage of predicted defective instances.8Percentage of defective predictions that are actually not defective.

, , S. Herbold

actually defective is almost impossible with the state of the art ofCPDP. Herbold et al. actually found that the baseline FIX is oneof the best approaches when it comes to these criteria. Thus, thistrivial baseline, which ranks badly in the benchmark by Herbold etal., performs well in our cost setting, because it has a very highrecall. We think that this explains the lack of correlation. If thegeneral performance of CPDP models would be higher, we believethat it is likely that the results would be correlated to costs.

Both the machine learning metric benchmark by Herbold etal. [24] and our cost-sensitive benchmark share the trait that rela-tively naïve baselines outperform many approaches from the stateof the art and that the best ranking CPDP approach is relatively old.In Herbold et al.’s[24] benchmark, the baseline is ALL, i.e., usingall data without any treatment for classification. In our case it isthe trivial baseline FIX. Moreover, in both benchmarks approachesthat are over 5 years old often perform better than most newerapproaches. While most researchers implement comparisons tobaselines, other approaches from the state of the art are only sel-domly replicated. Furthermore, the amount of data is importantfor the generalization of results. For example, Canfora et al. [8]actually compare their results with the same trivial baseline we do.The difference between their work and ours is that we use moredata for our analysis. They use a subset of 10 products from theJURECZKO data, whereas we use 62 products. Together, this showsthat diligent comparisons to the state of the art and baselines arevital, but also that using large data sets is mandatory to reach firmconclusions. Otherwise, many approaches will be published, whichonly marginally advance the state of the art, or only under veryspecific conditions, but not in a broader context. Please note thatwhile we use 67 products in total, this is still only a small amountof data, in comparison to the thousands of Java projects on GitHubalone that provide sufficient data in issue tracking system and com-mit comments to be used for defect prediction studies. Therefore,we believe that this sample size is major threat to the validity ofthis benchmark and defect prediction research in general. The prob-lem of small sample sizes was already identified as a threat to thevalidity of repository mining studies in general [65, 66]. Withoutscaling up our sample sizes, we may run into a serious replicationcrisis.

6 THREATS TO VALIDITYOur benchmark has several threats to its validity. We distinguishbetween internal validity, construct validity, and external validity.

6.1 Internal ValidityBecause we replicated only existing work, we do not see any inter-nal threats to the validity of our results.

6.2 Construct ValidityThe benchmark’s construction influences the results. Threats tothe validity of the construction include unsuitable choice of per-formance metrics for the research questions, unsuitable statisticaltests, noisy or mislabeled data [31], as well as defects in our im-plementations. To address these threats, we based our metrics andstatistical tests on the literature, used multiple data sets to mitigatethe impact of noise and mislabeling, and tested all implementations.

Moreover, we performed a sanity check against the results fromHerbold et al. [22], assuming that the construct of that benchmarkis valid.

6.3 External ValidityThe biggest threat to the external validity is the sample size ofsoftware products. Within this benchmark, we consider 67 products,which is large for defect prediction studies, but small in relationto the overall number of software projects. Therefore, we cannotfirmly conclude that we found general effects and not randomeffects that only hold on our used data.

7 CONCLUSIONWithin this paper, we present a benchmark on CPDP using effortand cost metrics. We replicated 26 approaches from the state ofthe art published between 2008 and 2016. Our results show thata trivial approach that predicts everything as defective performsbetter than the state of the art for CPDP under cost considerations.The two best CPDP were proposed by Liu et al. [40] and Canfora etal. [7, 8] and are close to the trivial predictions in performance.These are also the only two approaches, that directly optimize costs.We suspect that the generally insufficient performance of CPDPmodels, that was already determined in another benchmark byHerbold et al. [24], is the reason for the bad performance of CPDPin a cost-sensitive setting. It seems that optimizing directly for costand not performance of the prediction model is currently the onlyway to produce relatively cost-efficient CPDP models.

In our future work, we will build on the findings of this bench-mark and use the gained insights to advance the state of the art.We plan to define a general CPDP framework, that will allow abetter selection of optimization criteria for approaches. We want tosee if it is possible to build a wrapper around the defect predictionmodels that can make them optimize for costs, e.g., by injectingother performance metrics into the machine learning algorithmsas optimization criteria, or manipulate the training data such thatperformance estimations, e.g., based on the error are more similarto actual costs. We hope to advance the state of the art this wayto be more cost-efficient, such that CPDP becomes a significantimprovement in comparison to assuming everything is defective. Inparallel to this, we will collect more defect prediction data in orderto scale up the sample size and allow better conclusions about thegeneralizability of defect prediction results and reduce this majorthreat to the validity of defect prediction research.

ACKNOWLEDGMENTSThe authors would like to thank GWDG for the access to the sci-entific compute cluster used for the training and evaluation ofthousands of defect prediction models.

REFERENCES[1] S. Amasaki, K. Kawata, and T. Yokogawa. 2015. Improving Cross-Project

Defect Prediction Methods with Data Simplification. In Software Engineeringand Advanced Applications (SEAA), 2015 41st Euromicro Conference on. 96–103.https://doi.org/10.1109/SEAA.2015.25

[2] D. J. Benjamin, J. O. Berger, M. Johannesson, B. A. Nosek, E. Wagenmakers,LotsofOtherPeople, and Dustin Tingley. 2017. Redefine Statistical Significance.Human Nature Behavior (2017).

https://doi.org/10.1109/SEAA.2015.25


[3] Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (Oct. 2001), 5–32. https://doi.org/10.1023/A:1010933404324

[4] David S Broomhead and David Lowe. 1988. Radial basis functions, multi-variablefunctional interpolation and adaptive networks. Technical Report. DTIC Docu-ment.

[5] Morton B. Brown and Alan B. Forsythe. 1974. Robust Tests for the Equality ofVariances. J. Amer. Statist. Assoc. 69, 346 (1974), 364–367. http://www.jstor.org/stable/2285659

[6] Ana Erika Camargo Cruz and Koichiro Ochimizu. 2009. Towards logistic regres-sion models for predicting fault-prone code across software projects. In Proc. 3rdInt. Symp. on Empirical Softw. Eng. and Measurement (ESEM). IEEE ComputerSociety, 4. https://doi.org/10.1109/ESEM.2009.5316002

[7] Gerardo Canfora, Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, Anni-bale Panichella, and Sebastiano Panichella. 2013. Multi-Objective Cross-ProjectDefect Prediction. In Proc. 6th IEEE Int. Conf. Softw. Testing, Verification andValidation (ICST).

[8] Gerardo Canfora, Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, Anni-bale Panichella, and Sebastiano Panichella. 2015. Defect prediction as a multiob-jective optimization problem. Software Testing, Verification and Reliability 25, 4(2015), 426–459. https://doi.org/10.1002/stvr.1570

[9] Rich Caruana and Alexandru Niculescu-Mizil. 2006. An Empirical Comparisonof Supervised Learning Algorithms. In Proceedings of the 23rd International Con-ference on Machine Learning (ICML ’06). ACM, New York, NY, USA, 161–168.https://doi.org/10.1145/1143844.1143865

[10] Lin Chen, Bin Fang, Zhaowei Shang, and Yuanyan Tang. 2015. Negative samplesreduction in cross-company software defects prediction. Information and SoftwareTechnology 62 (2015), 67 – 77. https://doi.org/10.1016/j.infsof.2015.01.014

[11] J. Cohen. 1988. Statistical Power Analysis for the Behavioral Sciences. LawrenceErlbaum Associates.

[12] David R. Cox. 1958. The regression analysis of binary sequences (with discussion).J Roy Stat Soc B 20 (1958), 215–242.

[13] Marco D’Ambros, Michele Lanza, and Romain Robbes. 2010. An ExtensiveComparison of Bug Prediction Approaches. In Proceedings of the 7th IEEEWorkingConference on Mining Software Repositories (MSR). IEEE Computer Society.

[14] Marco D’Ambros, Michele Lanza, and Romain Robbes. 2012. EvaluatingDefect Prediction Approaches: A Benchmark and an Extensive Comparison.Empirical Softw. Engg. 17, 4-5 (Aug. 2012), 531–577. https://doi.org/10.1007/s10664-011-9173-9

[15] Janez Demšar. 2006. Statistical Comparisons of Classifiers over Multiple DataSets. J. Mach. Learn. Res. 7 (Dec. 2006), 1–30. http://dl.acm.org/citation.cfm?id=1248547.1248548

[16] Milton Friedman. 1940. A Comparison of Alternative Tests of Significance forthe Problem of m Rankings. The Annals of Mathematical Statistics 11, 1 (1940),86–92. http://www.jstor.org/stable/2235971

[17] Takafumi Fukushima, Yasutaka Kamei, Shane McIntosh, Kazuhiro Yamashita, andNaoyasu Ubayashi. 2014. An Empirical Study of Just-in-time Defect PredictionUsing Cross-project Models. In Proceedings of the 11th Working Conference onMining Software Repositories (MSR 2014). ACM, New York, NY, USA, 172–181.https://doi.org/10.1145/2597073.2597075

[18] B. Ghotra, S. McIntosh, and A. E. Hassan. 2015. Revisiting the Impact of Classifi-cation Techniques on the Performance of Defect Prediction Models. In SoftwareEngineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference on, Vol. 1.789–800. https://doi.org/10.1109/ICSE.2015.91

[19] Peng He, Bing Li, Xiao Liu, Jun Chen, and Yutao Ma. 2015. An empirical study onsoftware defect prediction with a simplified metric set. Information and SoftwareTechnology 59 (2015), 170 – 190. https://doi.org/10.1016/j.infsof.2014.11.006

[20] Zhimin He, Fayola Peters, Tim Menzies, and Yang Yang. 2013. Learning fromOpen-Source Projects: An Empirical Study on Defect Prediction. In Proc. 7th Int.Symp. on Empirical Softw. Eng. and Measurement (ESEM).

[21] Steffen Herbold. 2013. Training data selection for cross-project defect prediction.In Proc. 9th Int. Conf. on Predictive Models in Softw. Eng. (PROMISE). ACM, Article6, 10 pages. https://doi.org/10.1145/2499393.2499395

[22] Steffen Herbold. 2017. A systematic mapping study on cross-project defectprediction. CoRR abs/1705.06429 (2017). arXiv:arXiv:1705.06429 https://arxiv.org/abs/1705.06429

[23] Steffen Herbold, Alexander Trautsch, and Jens Grabowski. 2016. Global vs. localmodels for cross-project defect prediction. Empirical Software Engineering (2016),1–37. https://doi.org/10.1007/s10664-016-9468-y

[24] S. Herbold, A. Trautsch, and J. Grabowski. 2017. A Comparative Study to Bench-mark Cross-project Defect Prediction Approaches. IEEE Transactions on SoftwareEngineering PP, 99 (2017), 1–1. https://doi.org/10.1109/TSE.2017.2724538

[25] S. Herbold, A. Trautsch, and J. Grabowski. 2017. Correction of "A ComparativeStudy to Benchmark Cross-project Defect Prediction". CoRR abs/1707.09281(2017). arXiv:arXiv:1705.06429 https://arxiv.org/abs/1707.09281

[26] K. Herzig, S. Just, A. Rau, and A. Zeller. 2013. Predicting defects using change ge-nealogies. In Software Reliability Engineering (ISSRE), 2013 IEEE 24th InternationalSymposium on. 118–127. https://doi.org/10.1109/ISSRE.2013.6698911

[27] Seyedrebvar Hosseini, Burak Turhan, and Mika Mäntylä. 2016. Search BasedTraining Data Selection For Cross Project Defect Prediction. In Proceedings ofthe The 12th International Conference on Predictive Models and Data Analyticsin Software Engineering (PROMISE 2016). ACM, New York, NY, USA, Article 3,10 pages. https://doi.org/10.1145/2972958.2972964

[28] Seyedrebvar Hosseini, Burak Turhan, and Mika MÃďntylÃď. 2017. A benchmarkstudy on the effectiveness of search-based data selection and feature selectionfor cross project defect prediction. Information and Software Technology (2017).https://doi.org/10.1016/j.infsof.2017.06.004

[29] Xiaoyuan Jing, Fei Wu, Xiwei Dong, Fumin Qi, and Baowen Xu. 2015. Heteroge-neous Cross-company Defect Prediction by Unified Metric Representation andCCA-based Transfer Learning. In Proceedings of the 2015 10th Joint Meeting onFoundations of Software Engineering (ESEC/FSE 2015). ACM, New York, NY, USA,496–507. https://doi.org/10.1145/2786805.2786813

[30] X. Y. Jing, F. Wu, X. Dong, and B. Xu. 2017. An Improved SDA Based Defect Pre-diction Framework for Both Within-Project and Cross-Project Class-ImbalanceProblems. IEEE Transactions on Software Engineering 43, 4 (April 2017), 321–339.https://doi.org/10.1109/TSE.2016.2597849

[31] Marian Jureczko and Lech Madeyski. 2010. Towards identifying software projectclusters with regard to defect prediction. In Proc. 6th Int. Conf. on PredictiveModels in Softw. Eng. (PROMISE). ACM, Article 9, 10 pages. https://doi.org/10.1145/1868328.1868342

[32] Yasutaka Kamei, Takafumi Fukushima, Shane McIntosh, Kazuhiro Yamashita,Naoyasu Ubayashi, and Ahmed E. Hassan. 2015. Studying just-in-time defectprediction using cross-project models. Empirical Software Engineering (2015),1–35. https://doi.org/10.1007/s10664-015-9400-x

[33] K. Kawata, S. Amasaki, and T. Yokogawa. 2015. Improving Relevancy FilterMethods for Cross-Project Defect Prediction. In Applied Computing and In-formation Technology/2nd International Conference on Computational Scienceand Intelligence (ACIT-CSI), 2015 3rd International Conference on. 2–7. https://doi.org/10.1109/ACIT-CSI.2015.104

[34] Maurice G. Kendall. 1948. Rank correlation methods. Griffin, Lon-don. http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM=ppn+18489199X&sourceid=fbw_bibsonomy

[35] Taghi M. Khoshgoftaar, Pierre Rebours, and Naeem Seliya. 2008. Software qualityanalysis by combining multiple projects and learners. Software Quality Journal17, 1 (2008), 25–49. https://doi.org/10.1007/s11219-008-9058-3

[36] Taghi M. Khoshgoftaar and Naeem Seliya. 2004. Comparative Assessment ofSoftware Quality Classification Techniques: An Empirical Case Study. EmpiricalSoftware Engineering 9, 3 (01 Sep 2004), 229–257. https://doi.org/10.1023/B:EMSE.0000027781.18360.9b

[37] Max Kuhn, Steve Weston, Nathan Coulter, and Mark Culp. C code for C5.0 byR. Quinlan. 2015. C50: C5.0 Decision Trees and Rule-Based Models. https://CRAN.R-project.org/package=C50 R package version 0.1.0-24.

[38] Max Kuhn, Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engel-hardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, MichaelBenesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, CanCandan, and Tyler Hunt. 2016. caret: Classification and Regression Training.https://CRAN.R-project.org/package=caret R package version 6.0-73.

[39] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. 2008. Benchmarking Classifica-tion Models for Software Defect Prediction: A Proposed Framework and NovelFindings. IEEE Transactions on Software Engineering 34, 4 (July 2008), 485–496.https://doi.org/10.1109/TSE.2008.35

[40] Yi Liu, T.M. Khoshgoftaar, and N. Seliya. 2010. Evolutionary Optimization ofSoftware QualityModelingwithMultiple Repositories. Software Engineering, IEEETransactions on 36, 6 (Nov 2010), 852–864. https://doi.org/10.1109/TSE.2010.51

[41] Ying Ma, Guangchun Luo, Xue Zeng, and Aiguo Chen. 2012. Transfer learningfor cross-company software defect prediction. Inf. Softw. Technology 54, 3 (2012),248 – 256. https://doi.org/10.1016/j.infsof.2011.09.007

[42] H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two RandomVariables is Stochastically Larger than the Other. The Ann. of Math. Stat. 18, 1(1947), pp. 50–60.

[43] T. Menzies, A Butcher, D. Cok, A Marcus, L. Layman, F. Shull, B. Turhan, andT. Zimmermann. 2013. Local versus Global Lessons for Defect Prediction andEffort Estimation. IEEE Trans. Softw. Eng. 39, 6 (June 2013), 822–834. https://doi.org/10.1109/TSE.2012.83

[44] T. Menzies, A. Butcher, A. Marcus, T. Zimmermann, and D. Cok. 2011. Local vs.global models for effort estimation and defect prediction. In Proc. 26th IEEE/ACMInt. Conf. on Automated Softw. Eng. (ASE). IEEE Computer Society. https://doi.org/10.1109/ASE.2011.6100072

[45] J. Nam, W. Fu, S. Kim, T. Menzies, and L. Tan. 2017. Heterogeneous DefectPrediction. IEEE Transactions on Software Engineering PP, 99 (2017), 1–1. https://doi.org/10.1109/TSE.2017.2720603

[46] Jaechang Nam and Sunghun Kim. 2015. CLAMI: Defect Prediction on Unla-beled Datasets. In Automated Software Engineering (ASE), 2015 30th IEEE/ACMInternational Conference on. 452–463. https://doi.org/10.1109/ASE.2015.56

https://doi.org/10.1023/A:1010933404324

https://doi.org/10.1023/A:1010933404324

http://www.jstor.org/stable/2285659


https://doi.org/10.1109/ESEM.2009.5316002

https://doi.org/10.1002/stvr.1570

https://doi.org/10.1145/1143844.1143865

https://doi.org/10.1016/j.infsof.2015.01.014

https://doi.org/10.1007/s10664-011-9173-9

https://doi.org/10.1007/s10664-011-9173-9

http://dl.acm.org/citation.cfm?id=1248547.1248548



https://doi.org/10.1145/2597073.2597075

https://doi.org/10.1109/ICSE.2015.91


https://doi.org/10.1145/2499393.2499395

http://arxiv.org/abs/arXiv:1705.06429

https://arxiv.org/abs/1705.06429


https://doi.org/10.1007/s10664-016-9468-y

https://doi.org/10.1109/TSE.2017.2724538

http://arxiv.org/abs/arXiv:1705.06429


https://doi.org/10.1109/ISSRE.2013.6698911

https://doi.org/10.1145/2972958.2972964


https://doi.org/10.1145/2786805.2786813

https://doi.org/10.1109/TSE.2016.2597849

https://doi.org/10.1145/1868328.1868342

https://doi.org/10.1145/1868328.1868342

https://doi.org/10.1007/s10664-015-9400-x

https://doi.org/10.1109/ACIT-CSI.2015.104

https://doi.org/10.1109/ACIT-CSI.2015.104

http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM=ppn+18489199X&sourceid=fbw_bibsonomy

http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM=ppn+18489199X&sourceid=fbw_bibsonomy

https://doi.org/10.1007/s11219-008-9058-3

https://doi.org/10.1023/B:EMSE.0000027781.18360.9b

https://doi.org/10.1023/B:EMSE.0000027781.18360.9b

https://CRAN.R-project.org/package=C50

https://CRAN.R-project.org/package=C50

https://CRAN.R-project.org/package=caret

https://doi.org/10.1109/TSE.2008.35





https://doi.org/10.1109/ASE.2011.6100072

https://doi.org/10.1109/ASE.2011.6100072

https://doi.org/10.1109/TSE.2017.2720603

https://doi.org/10.1109/TSE.2017.2720603

https://doi.org/10.1109/ASE.2015.56

, , S. Herbold

[47] Jaechang Nam, S.J. Pan, and Sunghun Kim. 2013. Transfer defect learning. InSoftware Engineering (ICSE), 2013 35th International Conference on. 382–391. https://doi.org/10.1109/ICSE.2013.6606584

[48] P.B. Nemenyi. 1963. Distribution-free Multiple Comparison. Ph.D. Dissertation.Princeton University.

[49] Sinno Jialin Pan, I.W. Tsang, J.T. Kwok, and Qiang Yang. 2011. Domain Adaptationvia Transfer Component Analysis. Neural Networks, IEEE Transactions on 22, 2(Feb 2011), 199–210. https://doi.org/10.1109/TNN.2010.2091281

[50] A. Panichella, R. Oliveto, and A. De Lucia. 2014. Cross-project defect predictionmodels: L’Union fait la force. In Software Maintenance, Reengineering and ReverseEngineering (CSMR-WCRE), 2014 Software Evolution Week - IEEE Conference on.164–173. https://doi.org/10.1109/CSMR-WCRE.2014.6747166

[51] F. Peters and T. Menzies. 2012. Privacy and utility for defect prediction: Ex-periments with MORPH. In Software Engineering (ICSE), 2012 34th InternationalConference on. 189–199. https://doi.org/10.1109/ICSE.2012.6227194

[52] F. Peters, T. Menzies, L. Gong, and H. Zhang. 2013. Balancing Privacy and Utilityin Cross-Company Defect Prediction. Software Engineering, IEEE Transactions on39, 8 (Aug 2013), 1054–1068. https://doi.org/10.1109/TSE.2013.6

[53] F. Peters, T. Menzies, and L. Layman. 2015. LACE2: Better Privacy-PreservingData Sharing for Cross Project Defect Prediction. In Software Engineering (ICSE),2015 IEEE/ACM 37th IEEE International Conference on, Vol. 1. 801–811. https://doi.org/10.1109/ICSE.2015.92

[54] J. Ross Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan KaufmannPublishers Inc., San Francisco, CA, USA.

[55] Foyzur Rahman, Daryl Posnett, and Premkumar Devanbu. 2012. Recalling the“imprecision” of cross-project defect prediction. In Proc. ACM SIGSOFT 20th Int.Symp. Found. Softw. Eng. (FSE). ACM, Article 61, 11 pages. https://doi.org/10.1145/2393596.2393669

[56] Stuart Jonathan Russell, Peter Norvig, John F Canny, Jitendra M Malik, andDouglas D Edwards. 2003. Artificial intelligence: a modern approach. Vol. 2.Prentice hall Upper Saddle River.

[57] Duksan Ryu, Okjoo Choi, and Jongmoon Baik. 2014. Value-cognitive boostingwith a support vector machine for cross-project defect prediction. Empirical Soft-ware Engineering 21, 1 (2014), 43–71. https://doi.org/10.1007/s10664-014-9346-4

[58] Duksan Ryu, Jong-In Jang, and Jongmoon Baik. 2015. A Hybrid Instance Se-lection Using Nearest-Neighbor for Cross-Project Defect Prediction. Journal ofComputer Science and Technology 30, 5 (2015), 969–980. https://doi.org/10.1007/s11390-015-1575-5

[59] Duksan Ryu, Jong-In Jang, and Jongmoon Baik. 2015a. A transfer cost-sensitiveboosting approach for cross-project defect prediction. Software Quality Journal(2015a), 1–38. https://doi.org/10.1007/s11219-015-9287-1

[60] A. J. Scott and M. Knott. 1974. A Cluster Analysis Method for Grouping Meansin the Analysis of Variance. Biometrics 30, 3 (1974), pp. 507–512.

[61] Sam S. Shapiro and Martin B. Wilk. 1965. An Analysis of Variance Test forNormality (Complete Samples). Biometrika 52, 3/4 (1965), 591–611.

[62] Ming Tan, Lin Tan, Sashank Dara, and Caleb Mayeux. 2015. Online DefectPrediction for Imbalanced Data. In Proceedings of the 37th International Conferenceon Software Engineering - Volume 2 (ICSE ’15). IEEE Press, Piscataway, NJ, USA,99–108. http://dl.acm.org/citation.cfm?id=2819009.2819026

[63] Chakkrit Tantithamthavorn, ShaneMcIntosh, Ahmed E. Hassan, and Kenichi Mat-sumoto. 2016. Automated Parameter Optimization of Classification Techniquesfor Defect Prediction Models. In Proceedings of the 38th International Conferenceon Software Engineering. ACM. https://doi.org/10.1145/2884781.2884857

[64] Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, and Kenichi andMatsumoto. 2017. An Empirical Comparison of Model Validation Techniques forDefect Prediction Models. IEEE Transactions on Software Engineering 43, 1 (2017),1–18. https://doi.org/doi.ieeecomputersociety.org/10.1109/TSE.2016.2584050

[65] Fabian Trautsch, Steffen Herbold, Philip Makedonski, and Jens Grabowski. 2016.Adressing Problems with External Validity of Repository Mining Studies Througha Smart Data Platform. In Proceedings of the 13th International Conference onMining Software Repositories (MSR ’16). ACM, New York, NY, USA, 97–108. https://doi.org/10.1145/2901739.2901753

[66] Fabian Trautsch, Steffen Herbold, Philip Makedonski, and Jens Grabowski. 2017.Addressing problems with replicability and validity of repository mining studiesthrough a smart data platform. Empirical Software Engineering (08 Aug 2017).https://doi.org/10.1007/s10664-017-9537-x

[67] Burak Turhan, Tim Menzies, AyÅ§eB. Bener, and Justin Di Stefano. 2009. Onthe relative value of cross-company and within-company data for defect predic-tion. Empirical Softw. Eng. 14 (2009), 540–578. Issue 5. https://doi.org/10.1007/s10664-008-9103-7

[68] Burak Turhan, Ayse Tosun Misirli, and Ayse Bener. 2013. Empirical evaluationof the effects of mixed project data on learning defect predictors. Information &Software Technology 55, 6 (2013), 1101–1118. https://doi.org/10.1016/j.infsof.2012.10.003

[69] B. Turhan, A. Tosun, and A. Bener. 2011. Empirical Evaluation of Mixed-ProjectDefect Prediction Models. In Software Engineering and Advanced Applications(SEAA), 2011 37th EUROMICRO Conference on. 396–403. https://doi.org/10.1109/

SEAA.2011.59[70] S. Uchigaki, S. Uchida, K. Toda, and A. Monden. 2012. An Ensemble Approach

of Simple Regression Models to Cross-Project Fault Prediction. In SoftwareEngineering, Artificial Intelligence, Networking and Parallel Distributed Com-puting (SNPD), 2012 13th ACIS International Conference on. 476–481. https://doi.org/10.1109/SNPD.2012.34

[71] Tony van Gestel, JohanA.K. Suykens, Bart Baesens, Stijn Viaene, Jan Vanthienen,Guido Dedene, Bart de Moor, and Joos Vandewalle. 2004. Benchmarking LeastSquares Support Vector Machine Classifiers. Machine Learning 54, 1 (2004), 5–32.

[72] Shinya Watanabe, Haruhiko Kaiya, and Kenji Kaijiri. 2008. Adapting a faultprediction model to allow inter language reuse. In Proc. 4th Int. Workshop onPredictor Models in Softw. Eng. (PROMISE). ACM, 6. https://doi.org/10.1145/1370788.1370794

[73] X. Xia, D. Lo, S. J. Pan, N. Nagappan, and X. Wang. 2016. HYDRA: MassivelyCompositional Model for Cross-Project Defect Prediction. IEEE Transactions onSoftware Engineering 42, 10 (Oct 2016), 977–998. https://doi.org/10.1109/TSE.2016.2543218

[74] Feng Zhang, Audris Mockus, Iman Keivanloo, and Ying Zou. 2015. Towards build-ing a universal defect prediction model with rank transformed predictors. Empiri-cal Software Engineering (2015), 1–39. https://doi.org/10.1007/s10664-015-9396-2

[75] Feng Zhang, Quan Zheng, Ying Zou, and Ahmed E. Hassan. 2016. Cross-projectDefect Prediction Using a Connectivity-based Unsupervised Classifier. In Pro-ceedings of the 38th International Conference on Software Engineering (ICSE ’16).ACM, New York, NY, USA, 309–320. https://doi.org/10.1145/2884781.2884839

[76] Yun Zhang, D. Lo, Xin Xia, and Jianling Sun. 2015. An Empirical Study ofClassifier Combination for Cross-Project Defect Prediction. In Computer Softwareand Applications Conference (COMPSAC), 2015 IEEE 39th Annual, Vol. 2. 264–269.https://doi.org/10.1109/COMPSAC.2015.58

[77] Thomas Zimmermann, Nachiappan Nagappan, Harald Gall, Emanuel Giger, andBrendan Murphy. 2009. Cross-project defect prediction: a large scale experimenton data vs. domain vs. process. In Proc. the 7th Joint Meet. Eur. Softw. Eng. Conf.(ESEC) and the ACM SIGSOFT Symp. Found. Softw. Eng. (FSE). ACM, 91–100.https://doi.org/10.1145/1595696.1595713



https://doi.org/10.1109/TNN.2010.2091281

https://doi.org/10.1109/CSMR-WCRE.2014.6747166





https://doi.org/10.1145/2393596.2393669

https://doi.org/10.1145/2393596.2393669

https://doi.org/10.1007/s10664-014-9346-4

https://doi.org/10.1007/s11390-015-1575-5

https://doi.org/10.1007/s11390-015-1575-5

https://doi.org/10.1007/s11219-015-9287-1


https://doi.org/10.1145/2884781.2884857

https://doi.org/doi.ieeecomputersociety.org/10.1109/TSE.2016.2584050

https://doi.org/10.1145/2901739.2901753

https://doi.org/10.1145/2901739.2901753

https://doi.org/10.1007/s10664-017-9537-x

https://doi.org/10.1007/s10664-008-9103-7

https://doi.org/10.1007/s10664-008-9103-7





https://doi.org/10.1109/SNPD.2012.34

https://doi.org/10.1109/SNPD.2012.34

https://doi.org/10.1145/1370788.1370794

https://doi.org/10.1145/1370788.1370794

https://doi.org/10.1109/TSE.2016.2543218

https://doi.org/10.1109/TSE.2016.2543218

https://doi.org/10.1007/s10664-015-9396-2

https://doi.org/10.1145/2884781.2884839

https://doi.org/10.1109/COMPSAC.2015.58

https://doi.org/10.1145/1595696.1595713

Date post:	23-Aug-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Benchmarking cross-project defect prediction approaches with … · Benchmarking cross-project...

Documents