An Industrial Case Study of Automatically Identifying
Performance Regression-Causes
Thanh H. D. Nguyen, Meiyappan Nagappan, Ahmed E. Hassan
Mohamed Nasser, Parminder Flora
1
2
Performance is key in day to day cloud based software
3
Performance regressions are caused by changes to the software
Version 1 Version 2
4
Performance is measured using resource counters
Time
Counter Value
CPU % Memory Disk IO Network IO
5
Performance regressions are found during load testing
Apply the same load
Version 2
Version 1
Probable causes
Compare the counters
Performance Engineer
CPU %Memory usage
Disk IONetwork IO
CPU %Memory usage
Disk IONetwork IO
6
Probable Causes derived from Industrial Case Study
Probable Causes %Added frequently executed DB query or miss
matched DB indices30.54
Added frequently accessed fields and objects 30.18
Added frequently executed logic 16.67
Symptom of regression is detected (e.g., response time increased) but no regression-
cause can be determined
16.67
Added blocking I/O access 5.55
7
Leveraging a repository for Regression-cause analysis
Frequently executed logic
Counters from Version 1
Counters from Version 2
Mismatched DB indices
Counters from Version 2
Counters from Version 3
Frequently executed logic
Counters from Version n-1
Counters from Version n
Baseline Counters
Target Counters
Probable Cause
Version 1 Version 2 Frequently executed logic
Version 2 Version 3 Mismatched DB indices
… … …
Version n-1 Version n Frequently executed logic
Performance Data for many different Probable Causes!
8
Mining performance regression repositories
Performance Regression Repository
Train Model
Model
Counters from Version n
Counters from Version n+1
Predicted Probable
Cause
Evaluate Prediction
9
But Input to the model cannot be raw counter data
Time
Counter Value
Load test on Machine 1
Load test on Machine 2
Same Pattern, But Different ValuesUse Control Charts
Violations = 3
Violation Ratio = 3/7
Total = 7
Same Violation Ratio = 3/7
Upper Control Line
Lower Control Line
Control Line
Upper Control Line
Lower Control Line
Control Line
10
Case Study Subjects
Open-source
Small set of usersWeb app
Not open-source
Large set of usersCommunication
11
Case Study Methodology
Apply the same load
Version 2
Version 1
Probable causes
CPU %Memory usage
Disk IONetwork IO
CPU %Memory usage
Disk IONetwork IO
Inject Fault
Version 1 + Injected Fault
Can we find the probable cause of injected fault?
12
All machine learner performs 3-7 times better than random predictor
Random J48
RandomTreeLM
T
RandomForest
BayesN
et
N.Bayes
N.Bayes Multinomial
DecisionTa
blePART
JRipLW
L IBkKSta
r
SimpleLo
gistic
Logisti
cSM
O
MultilayerPerce
ptron
0%10%20%30%40%50%60%70%80%
Accuracy
Decision tree Bayes
Rule LazyLogistic
Neural net
Results hold for both case studies,But different ML is better in each
13