State coverage: an empirical analysis based on a user study
Dries Vanoverberghe, Emma Eyckmans, and Frank Piessens
Software Validation Metrics
• Software defects after product release are expensive– NIST2002: $60 billion annually– MS Security bulletins: around 40/year at 100k to 1M $ each
• Validating software (Testing)– Reduce # defects before release– But not without a cost
• Make tradeoff:– Estimate remaining # defects
=> Software validation metrics
Example: Code coverage
• Fraction of statements/basic blocks that are executed by the test suite
• Principle: – not executed => no defects discovered
• Hypothesis:– not executed => more likely contains defect
Example: Code coverage
• High statement coverage– No defects?– Different paths
• Structural coverage metrics:– e.g. Path coverage, data flow coverage, …– Measure degree of exploration
• Automatic tool assistance – Metrics evaluate tools rather than human effort
Problem statement
• Exploration is not sufficient– Tests need to check requirements– Evaluate completeness of test oracle
• Impossible to automate:– Guess requirements– Evaluation is critical!
• No good metrics available
State coverage
• Evaluate strength of assertions• Idea:
– State updates must be checked by assertions• Hypothesis:
– Unchecked state update => more likely defect
State coverage
• Complements code coverage– No replacement
• Metrics also assist developers– Code coverage => reachability of statements?– State coverage => invariant established by
reachable statements?
8
State coverage
• Metric:
• State update– Assignment to fields of objects– Return values, local variables, … also possible
• Computation:– Runtime monitor
number of state updates read in assertionstotal number of state updates
Design of experiment
• Existing evaluation:– Correlation with mutation adequacy (Koster et al.)– Case study by expert user
• Goal:– Directly analyze correlation with ‘real’ defects– Average users
Hypotheses
• Hypothesis 1:– When increasing state coverage (without
increasing exploration), the number of discovered defects increases
– Similar to existing case study• Hypothesis 2:
– State coverage and the number of discovered defects are correlated
– Much stronger
Structure of experiment
• Base program:– Small calendar management system– Result of software design course– Existing test suite– Presence of software defects unknown
Structure of experiment
• Phase 1: case study– Extend test suite to find defects
• First increase code coverage• Then increase state coverage
– Dry run of experiment• Simplified application• Injected additional defects
Structure of experiment
• Phase 2: Controlled user study– Create new test suite
• First increase code coverage• Then increase state coverage
– Commit after each detected defect
Threats to validity
• Internal validity– Two sessions: no differences observed– Learning effect: subjects were familiar with
environment before experiment• External validity
– Choice of application– Choice of faults– Subjects are students
Results
• Phase 1: case study– No additional defects discovered– No confirmation for hypothesis 1– Potential reasons
• Mostly structural faults• Non-structural faults were obvious
• Phase 2: Controlled user study– No confirmation for hypothesis 1
0 1 2 3 4 5 6 70
10
20
30
40
50
60
Code Coverage
user 1
user 2
user 3
user 4
user 5
user 6
user 7
user 8
user 9
user 10
user 11
user 12
user 13
# detected faults
Co
de
cove
rag
e %
0 1 2 3 4 5 6 70.00
0.20
0.40
0.60
0.80
1.00
1.20
State coverage
user 1
user 2
user 3
user 4
user 5
user 6
user 7
user 8
user 9
user 10
user 11
user 12
user 13
# detected faults
Sta
te c
ove
rag
e %
Potential causes
• Frequency of logical faults– 3/20 incorrect state updates – only 1/14 discovered!– 5/14 are detected by assertions– Focusing on these 5 faults
• Higher state coverage (42% wrt 34%) for classes that detect at least one of these 5
– How common are logical faults?
Potential causes
• Logical faults too obvious– Subjects discovered them with code coverage
• State coverage is not monotonic– Adding new tests may decrease state coverage– Always relative to exploration
0 1 2 3 4 5 6 70.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
State coverage - absolute
user 1
user 2
user 3
user 4
user 5
user 6
user 7
user 8
user 9
user 10
user 11
user 12
user 13
# detected faults
# co
vere
d s
tate
up
dat
es
Conclusions
• Experiment fails to confirm hypothesis– How frequent are logical faults?– Combine state coverage with code coverage?
• Or compare test suites with similar code coverage
• But also:– Simple – Efficient
Questions?