Date post: | 17-Jan-2018 |
Category: |
Documents |
Upload: | penelope-richardson |
View: | 221 times |
Download: | 0 times |
A HUMAN STUDY OF FAULT LOCALIZATION ACCURACY
Zachary P. FryWestley WeimerUniversity of VirginiaSeptember 16, 2010
SOFTWARE MAINTENANCE Maintenance can account for the majority
of the software lifecycleLocating defects in code is a considerable
challenge What if we knew how easy it was to locate
faults in a code base beforehand? Engineer systems to make bug finding easierConcentrate on problem areas
Could we develop a model that measures this?How would we gather a data set?
2
PROBLEM – FAULT LOCALIZATION We treat fault localization as the task of determining if a program or code fragment contains a defect and, if so, locating the line where that defect resides
Research question: Which factors contribute to a human’s ability to detect and locate defects?
3
PROBLEM – FAULT LOCALIZATION We examine four categories of defect and code characteristicsError typeSurface and syntactical featuresControl flow and contextual
featuresAbstraction
Which of these affect humans’ abilities to locate defects in code?
4
OUTLINE Motivation Structure of Model Human Study Evaluation of Model Conclusions
5
MOTIVATION: AN EXAMPLE
6
/** Move a single disk from src to dest. */public static void hanoi1(int src, int dest){ System.out.println(src + " => " + dest);}/** Move two disks from src to dest, making use of a spare peg. */public static void hanoi2(int src, int dest, int spare) { hanoi1(src, dest); System.out.println(src + " => " + dest); hanoi1(spare, dest);}/** Move three disks from src to dest, making use of a spare peg. */public static void hanoi3(int src, int dest, int spare) { hanoi2(src, spare, dest); System.out.println(src + " => " + dest); hanoi2(spare, dest, src);}
hanoi1(src, spare); 33% of participants correctly located the defect
TOWERS OF HANOI – VERSION 2 More complex control
flow if/else statement recursion
Rich commenting Descriptive identifiers
53% of participant correctly located the fault
7
/******************************************* Performs the initial call to moveTower to solve the puzzle. Moves the disks from tower 1 to tower 3 using tower 2.********************************************/public void solve () { moveTower (totalDisks, 1, 3, 2); }/******************************************* Moves the specified number of disks from one tower to another by moving a subtower of n-1 disks out of the way, moving one disk, then moving the subtower back. Base case of 1 disk.********************************************/private void moveTower (int numDisks, int start, int end, int temp) { if (numDisks == 1) moveTower(numDisks-1, temp, end, start);else { moveTower (numDisks-1, start, temp, end); moveOneDisk (start, end); moveTower (numDisks-1, temp, end, start); }}/******************************************* Prints instructions to move one disk from the specified start tower to the specified end tower.*******************************************/private void moveOneDisk (int start, int end) { System.out.println ("Move one disk from " + start + " to " + end);}
moveOneDisk (start, end);
MODEL – OVERVIEW We desire a model of human fault
localization accuracy that, given source code as input, can predict the likelihood that a human will be able to accurately locate faults within it
We hypothesize that features relevant to such a model will fall into four categories: fault type, syntax, context, and abstractionExisting work tends to focus on only one of
these areas at a time Linear regression – trained on human
study dataEase of analysis
8
DEFECT FEATURES Error type
Adapted and expanded existing Knight taxonomy
Sampled from consecutive Mozilla bugs to obtain types and distribution
We consider 17 total types of single-line defects
9
Missing statementUninitialized variableExtra assignmentIncorrect typeIncorrect constantIncorrect parameterNegated conditionalIncorrect method callIncorrect variable…
MODEL – CODE FEATURES Code based features
Most measured automatically, some manually 92 total
10
SyntaxBlock nesting levelNumber of method callsNum of local varsNum of var declarationsNum of var usesAvg line length…
ContextAvg/Max CFG in-edgesAvg/Max CFG out-edgesAvg CFG path lengthNum of CFG edgesNum of CFG leavesRatio of “ifs” to “elses”…
AbstractionNum of array-based structuresUses underlying data structureImplements a heap Implements a treeImplements reheap…
HUMAN STUDY – PARTICIPANT SELECTION 215 fourth year students and volunteers
from the internet (crowdsourcing) Monetary reward given for completion to
encourage best effort
11
Subset Average Accuracy
Number of Participants
All 46.3% 65Accuracy > 40% 55.2% 46Experience >4 years 51.5% 34Experience = 4 years 46.7% 17Experience < 4 years 33.4% 14
HUMAN STUDY – CODE SELECTION
Five textbooksThree sets of code features to vary or
control: Syntax and Surface Control flow and Contextual Abstraction
Provides similar concepts but differing presentations and/or implementations
45 Java files total12
HUMAN STUDY – FAULT SEEDING
Types and distribution based on Mozilla All faults selected are limited to one line for simplicity
Random seeding Zero or one bugs per file Type chosen based on distribution All possible sites enumerated and one is randomly
chosen Fault seeded manually, based on actual bugs if possible
20 line search-space windows To further control for code length and facilitate quick
and accurate response Randomly chosen around the seeded fault location
13
HUMAN STUDY - PROTOCOL Each participant sees 30 consecutive
files and is asked:Is there a bug in this code?If so, on what line does the bug occur?How difficult do you feel this code is to
understand (1-5)? Participants cannot execute or
automatically search the code – only manual inspection is permitted
14
EVALUATION Three separate experiments
1. Examines defect type as related to fault localization accuracy
Are certain bugs harder to find?2. Examines Syntactical, Contextual, and
Abstraction features as related to fault localization accuracy
Does our model correlate with actual human ability to locate faults better than existing baselines?
3. Analysis of individual features What features contribute the most towards
humans’ ability to locate defects in source code?15
EVALUATION – EXPERIMENT 1 Goal: relate fault type to fault localization accuracy
16
EVALUATION – EXPERIMENT 2 Goal: measure accuracy of our model’s ability to
predict ease of human fault localization Two version of our model
All features vs. only those that are measured automatically
Baselines Code readability (syntactic and surface features) Cyclomatic complexity (contextual features) “Textbook difficulty” (chapter number in the textbook)
10-fold cross validation to mitigate over-fitting
17
EVALUATION – EXPERIMENT 2
Our model greatly outperforms the baselines
Automatic-only model does only slightly worse than the full model
18
EVALUATION – EXPERIMENT 2 Perceived difficulty
is a concrete measure of understandability Fault localization
accuracy is correlated with understandability
While baselines do comparably better, our model correlates in a similar fashion
19
EVALUATION – FEATURE ANALYSIS ANOVA of features with respect to human
accuracy
20
(type) - Feature F Pr(F) Dirabs – uses abstraction: array 130.9 < 0.001 -abs – provides abstraction: queue 54.1 < 0.001 +syn – ratio of constant to variable assignments
40.4 < 0.001 +
syn – avg block nesting level 38.9 < 0.001 -abs – provides abstraction: heap 28.3 < 0.001 +syn – max global variables 25.6 < 0.001 +abs – uses abstraction: linked list 25.6 < 0.001 -syn – ratio simple to constant conditional 20.6 < 0.001 -cfg – max CFG out-edges per node 10.0 0.002 -cfg – avg CFG in-edges per node 5.8 0.016 +…
CONCLUSION We present a human study of 65 participants
based on concrete fault localization tasks We analyze the effect that the type of defects
has on humans’ ability to locate faults Based on the source code, we analyze the
correlation of surface, control flow, and abstract features on humans’ ability to locate faults
We present a model of human fault localization accuracy based on these features that correlates with human accuracy at least four times more than corresponding baselines
21
Questions?
22
23
24