Heterogeneous Defect Prediction (ESEC/FSE 2015)

transcript

Heterogeneous Defect PredictionESEC/FSE 2015

September 3, 2015Jaechang Nam and Sunghun Kim

Department of Computer Science and Engineering

Predict

Training

Project A

: Metric value: Buggy-labeled instance: Clean-labeled instance

?: Unlabeled instance

Software Defect Prediction

Related WorkMunson@TSE`92, Basili@TSE`95, Menzies@TSE`07,Hassan@ICSE`09, Bird@FSE`11,D’ambros@EMSE112Lee@FSE`11,...

Within-Project Defect Prediction (WPDP)

What if labeled instances do not exist?

Project X

Unlabeled Dataset

?: Unlabeled instance: Metric value

New projectsProjects lacking inhistorical data

Existing Solutions?

(New) Project X

Unlabeled Dataset

?: Unlabeled instance: Metric value

Cross-Project Defect Prediction (CPDP)

Training

Predict

Project A(source)

Project X(target)

Unlabeled Dataset

: Metric value: Buggy-labeled instance: Clean-labeled instance

?: Unlabeled instance

Related WorkWatanabe@PROMISE`08, Turhan@EMSE`09Zimmermann@FSE`09, Ma@IST`12, Zhang@MSR`14Zhang@MSR`14, Panichella@WCRE`14, Canfora@STVR15

Challenge

Same metric set(same feature space)

• Heterogeneousmetrics betweensource and target

Motivation

Training

Project A(source)

Project C(target)

?Heterogeneous metric sets

(different feature spacesor different domains)

Possible to Reuse all the existing defect datasets for CPDP!

Heterogeneous Defect Prediction (HDP)

Key Idea• Consistent defect-proneness tendency of

metrics– Defect prediction metrics measure complexity of

software and its development process.• e.g.

– The number of developers touching a source code file (Bird@FSE`11)

– The number of methods in a class (D’Ambroas@ESEJ`12)– The number of operands (Menzies@TSE`08)

More complexity implies more defect-proneness (Rahman@ICSE`13)

• Distributions between source and target should be the same to build a strong prediction model.

Match source and target metrics that have similar distribution

Heterogeneous Defect Prediction (HDP)

- Overview -X1 X2 X3 X4 Label

1 1 3 10 Buggy8 0 1 0 Clean⋮ ⋮ ⋮ ⋮ ⋮9 0 1 1 Clean

MetricMatching

Source: Project A Target: Project B

Cross-prediction Mode

l Build

(training)Predict(test)

MetricSelection

Y1 Y2 Y3 Y4 Y5 Y6 Y7 Label

3 1 1 0 2 1 9 ?1 1 9 0 2 3 8 ?⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮0 1 1 1 2 1 1 ?

1 3 10 Buggy8 1 0 Clean⋮ ⋮ ⋮ ⋮9 1 1 Clean

9 1 1 ?8 3 9 ?⋮ ⋮ ⋮ ⋮1 1 1 ?

Metric Selection• Why? (Guyon@JMLR`03)

– Select informative metrics• Remove redundant and irrelevant metrics

– Decrease complexity of metric matching combinations

• Feature Selection Approaches (Gao@SPE`11,Shivaji@TSE`13)

– Gain Ratio– Chi-square– Relief-F– Significance attribute evaluation

Metric Matching

Source Metrics Target MetricsX1

* We can apply different cutoff values of matching scores.* It can be possible that there is no matching at all.

Compute Matching Score KSAnalyzer• Use p-value of Kolmogorov-Smirnov Test

(Massey@JASA`51) Matching Score M of i-th source and j-th target metrics:

Mij = pij

Heterogeneous Defect Prediction- Overview -

X1 X2 X3 X4 Label

1 1 3 10 Buggy8 0 1 0 Clean⋮ ⋮ ⋮ ⋮ ⋮9 0 1 1 Clean

MetricMatching

Source: Project A Target: Project B

Cross-prediction Mode

l Build

(training)Predict(test)

MetricSelection

Y1 Y2 Y3 Y4 Y5 Y6 Y7 Label

3 1 1 0 2 1 9 ?1 1 9 0 2 3 8 ?⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮0 1 1 1 2 1 1 ?

1 3 10 Buggy8 1 0 Clean⋮ ⋮ ⋮ ⋮9 1 1 Clean

9 1 1 ?8 3 9 ?⋮ ⋮ ⋮ ⋮1 1 1 ?

EVALUATION

Baselines• WPDP

• CPDP-CM (Turhan@EMSE`09,Ma@IST`12,He@IST`14)

– Cross-project defect prediction using only common metrics between source and target datasets

• CPDP-IFS (He@CoRR`14)

– Cross-project defect prediction on Imbalanced Feature Set (i.e. heterogeneous metric set)

– 16 distributional characteristics of values of an instance as features (e.g., mean, std, maximum,...)

Research Questions (RQs)• RQ1– Is heterogeneous defect prediction

comparable to WPDP?• RQ2– Is heterogeneous defect prediction

comparable to CPDP-CM? • RQ3– Is Heterogeneous defect prediction

comparable to CPDP-IFS?

Benchmark DatasetsGroup Dataset

# of instances # of

metrics

Granularity

All Buggy (%)

EQ 325 129 (39.7%)

61 Class

JDT 997 206 (20.7%)

LC 399 64 (9.36%)

ML 1862

245 (13.2%)

PDE 1492

209 (14.0%)

ant-1.3 125 20 (16.0%)

20 Class

arc 234 27 (11.5%)

camel-1.0 339 13 (3.8%)

poi-1.5 237 141 (75.0%)

redaktor 176 27 (15.3%)

skarbonka 45 9 (20.0%)

tomcat 858 77 (9.0%)velocity-

1.4 196 147 (75.0%)

xalan-2.4 723 110 (15.2%)

xerces-1.2 440 71

(16.1%)

Group Dataset

# of instances # of

metrics

Granularity

All Buggy (%)

ReLink

Apache 194 98 (50.5%)

26 FileSafe 56 22 (39.3%)

ZXing 399 118 (29.6%)

cm1 327 42 (12.8%)

37 Function

mw1 253 27 (10.7%)

pc1 705 61 (8.7%)

pc3 1077 134 (12.4%)

pc4 1458 178 (12.2%)

SOFTLAB

ar1 121 9 (7.4%)

29 Function

ar3 63 8 (12.7%)

ar4 107 20 (18.7%)

ar5 36 8 (22.2%)

ar6 101 15 (14.9%)

600 prediction combinations in total!

Experimental Settings• Logistic Regression• HDP vs. WPDP, CPDP-CM, and CPDP-IFS

Test set(50%)

Training set(50%)

Project 1

Project 2

Project n

......

X 1000

Project 1

Project 2

Project n

......

CPDP-CMCPDP-IFS

HDPWPDP

Project A

Evaluation Measures• False Positive Rate = FP/(TN+FP)• True Positive Rate = Recall• AUC (Area Under receiver operating characteristic Curve)

False Positive rate

Evaluation Measures• Win/Tie/Loss (Valentini@ICML`03, Li@JASE`12,

Kocaguneli@TSE`13)

– Wilcoxon signed-rank test (p<0.05) for 1000 prediction results

– Win• # of outperforming HDP prediction combinations

with statistical significance. (p<0.05)– Tie

• # of HDP prediction combinations with no statistical significance. (p≥0.05)

– Loss• # of outperforming baseline prediction results

with statistical significance. (p>0.05)

RESULT

Prediction Results in median AUC

Target WPDP CPDP-CM

CPDP-IFS

HDPKS(cutoff=0.05)

EQ 0.583 0.776 0.461 0.783JDT 0.795 0.781 0.543 0.767MC 0.575 0.636 0.584 0.655ML 0.734 0.651 0.557 0.692*PDE 0.684 0.682 0.566 0.717

ant-1.3 0.670 0.611 0.500 0.701arc 0.670 0.611 0.523 0.701

camel-1.0 0.550 0.590 0.500 0.639

poi-1.5 0.707 0.676 0.606 0.537redaktor 0.744 0.500 0.500 0.537skarbonk

a 0.569 0.736 0.528 0.694*

tomcat 0.778 0.746 0.640 0.818velocity-

1.4 0.725 0.609 0.500 0.391

xalan-2.4 0.755 0.658 0.499 0.751xerces-

1.2 0.624 0.453 0.500 0.48921

Target WPDP CPDP-CM

CPDP-IFS

HDPKS(cutoff=0.05)

Apache 0.714 0.689 0.635 0.717*Safe 0.706 0.749 0.616 0.818*

ZXing 0.605 0.619 0.530 0.650*cm1 0.653 0.622 0.551 0.717*mw1 0.612 0.584 0.614 0.727pc1 0.787 0.675 0.564 0.752*pc3 0.794 0.665 0.500 0.738*pc4 0.900 0.773 0.589 0.682*ar1 0.582 0.464 0.500 0.734*ar3 0.574 0.862 0.682 0.823*ar4 0.657 0.588 0.575 0.816*ar5 0.804 0.875 0.585 0.911*ar6 0.654 0.611 0.527 0.640

All 0.657 0.636 0.555 0.724*

HDPKS: Heterogeneous defect prediction using KSAnalyzer

Win/Tie/Loss ResultsTarget

Against WPDP

Against CPDP-

AgainstCPDP-

IFSW T L W T L W T L

EQ 4 0 0 2 2 0 4 0 0JDT 0 0 5 3 0 2 5 0 0LC 6 0 1 3 3 1 3 1 3ML 0 0 6 4 2 0 6 0 0PDE 3 0 2 2 0 3 5 0 0

ant-1.3 6 0 1 6 0 1 5 0 2arc 3 1 0 3 0 1 4 0 0

camel-1.0 3 0 2 3 0 2 4 0 1

poi-1.5 2 0 2 3 0 1 2 0 2redaktor 0 0 4 2 0 2 3 0 1skarbonk

a 11 0 0 4 0 7 9 0 2

tomcat 2 0 0 1 1 0 2 0 0velocity-

1.4 0 0 3 0 0 3 0 0 3

xalan-2.4 0 0 1 1 0 0 1 0 0xerces-

1.2 0 0 3 3 0 0 1 0 2

Target

AgainstWPDP

AgainstCPDP-CM

AgainstCPDP-IFS

W T L W T L W T L

Apache 6 0 5 8 1 2 9 0 2

Safe 14 0 3 12 0 5 15 0 2ZXing 8 0 0 6 0 2 7 0 1cm1 7 1 2 8 0 2 9 0 1mw1 5 0 1 4 0 2 4 0 2pc1 1 0 5 5 0 1 6 0 0pc3 0 0 7 7 0 0 7 0 0pc4 0 0 7 2 0 5 7 0 0ar1 14 0 1 14 0 1 11 0 4ar3 15 0 0 5 0 10 10 2 3ar4 16 0 0 14 1 1 15 0 1ar5 14 0 4 14 0 4 16 0 2ar6 7 1 7 8 4 3 12 0 3

Total 147 3 72 147 14 61 182 3 35% 66.2

Matched Metrics (Win)M

Distribution(Source metric: RFC-the number of method invoked by a class, Target metric: the number of operands)

Matching Score = 0.91

AUC = 0.946 (ant1.3 ar5)

Matched Metrics (Loss)M

Distribution(Source metric: LOC, Target metric: average number of LOC in a method)

Matching Score = 0.13

AUC = 0.391 (Safe velocity-1.4)

Different Feature Selections(median AUCs, Win/Tie/Loss)

Approach

AgainstWPDP

AgainstCPDP-CM

AgainstCPDP-IFS HDP

AUC Win% AUC Win% AUC Win% AUC

Gain Ratio 0.657 63.7% 0.645 63.2% 0.536 80.2% 0.720Chi-

Square 0.657 64.7% 0.651 66.4% 0.556 82.3% 0.727

Significance 0.657 66.2% 0.636 66.2% 0.553 82.0% 0.724

Relief-F 0.670 57.0% 0.657 63.1% 0.543 80.5% 0.709None 0.657 47.3% 0.624 50.3% 0.536 66.3% 0.663

Results in Different Cutoffs

CutoffAgainstWPDP

AgainstCPDP-CM

AgainstCPDP-IFS HDP

TargetCoverageAUC Win% AUC Win% AUC Win% AUC

0.05 0.657 66.2% 0.636 66.2% 0.553 82.4% 0.724* 100%0.90 0.657 100% 0.761 71.4% 0.624 100% 0.852* 21%

Conclusion• HDP– Potential for CPDP across datasets

with different metric sets.

• Future work– Filtering out noisy metric matching– Determine the best probability

threshold

Q&ATHANK YOU!