Software Verification:
Testing vs. Model Checking
A Comparative Evaluation of the State of the Art
Thomas Lemberger
Joint work with Dirk BeyerLMU Munich, Germany
Null Hypothesis:I Testing is better at finding bugs than model checking.I Testing is faster than model checking.I Testing is more precise than model checking.I Testing is easier to use than model checking.
Thomas Lemberger LMU Munich, Germany 2 / 23
Terminology
I Testing:I Execute finite set of test cases on programI Observe compliance/violation of specificationI Focus: Test-case generation
I Model checking:I Formally describe possible program statesI Prove compliance/violation of specificationI Abstraction important
I Automated!
Thomas Lemberger LMU Munich, Germany 5 / 23
Terminology
I Testing:I Execute finite set of test cases on programI Observe compliance/violation of specificationI Focus: Test-case generation
I Model checking:I Formally describe possible program statesI Prove compliance/violation of specificationI Abstraction important
I Automated!
Thomas Lemberger LMU Munich, Germany 5 / 23
Terminology
I Testing:I Execute finite set of test cases on programI Observe compliance/violation of specificationI Focus: Test-case generation
I Model checking:I Formally describe possible program statesI Prove compliance/violation of specificationI Abstraction important
I Automated!
Thomas Lemberger LMU Munich, Germany 5 / 23
Scope
I Single, sequential programsI Whitebox programsI Task: bug finding
Thomas Lemberger LMU Munich, Germany 6 / 23
Comparability
Test-case generators
I Di�erent conventions forprogram input
I Di�erent output formatsfor test cases
I Di�erent/no testexecutors
Thomas Lemberger LMU Munich, Germany 7 / 23
Comparability
Test-case generatorsI Di�erent conventions for
program input
I Di�erent output formatsfor test cases
I Di�erent/no testexecutors
klee_make_symbolic(&x, sizeof(x), "x");
CREST_int(x);x = input();
x = parse( fgets (...));
input(&x, sizeof (x), "x")
Thomas Lemberger LMU Munich, Germany 7 / 23
Comparability
Test-case generatorsI Di�erent conventions for
program inputI Di�erent output formats
for test cases
I Di�erent/no testexecutors
KTESTsimple.bc_sym___VERIFIER_nondet_int????...
1, ≠5, 31, ≠5, 0
Test inputs : [42, 107]
zsd;as@d
0xF2030x00030xF2030x0003
Thomas Lemberger LMU Munich, Germany 7 / 23
Comparability
Test-case generatorsI Di�erent conventions for
program inputI Di�erent output formats
for test casesI Di�erent/no test
executors
None
klee-replay
None
Thomas Lemberger LMU Munich, Germany 7 / 23
Comparability
Model checkersI Established standard for
input programs
I Established standard foroutput format of result
x = __VERIFIER_nondet_int();
Thomas Lemberger LMU Munich, Germany 8 / 23
Comparability
Model checkersI Established standard for
input programsI Established standard for
output format of result • FALSE
• UNKNOWN
• TRUE
Thomas Lemberger LMU Munich, Germany 8 / 23
Comparability
Model checkersI Established standard for
input programsI Established standard for
output format of result
∆ Adjust test-case generators to standards of model checkers
Thomas Lemberger LMU Munich, Germany 8 / 23
Framework: TBF
TBF: Test-based falsifierI Apply test-case generators to model checker standards
I Create, execute + observe testsI Only variable: Test-case generation toolI Specification: Never call __VERIFIER_error
I Disclaimer: Comparison of tools, not techniques
Thomas Lemberger LMU Munich, Germany 10 / 23
Framework: TBF
TBF: Test-based falsifierI Apply test-case generators to model checker standardsI Create, execute + observe tests
I Only variable: Test-case generation toolI Specification: Never call __VERIFIER_error
I Disclaimer: Comparison of tools, not techniques
Thomas Lemberger LMU Munich, Germany 10 / 23
Framework: TBF
TBF: Test-based falsifierI Apply test-case generators to model checker standardsI Create, execute + observe testsI Only variable: Test-case generation tool
I Specification: Never call __VERIFIER_error
I Disclaimer: Comparison of tools, not techniques
Thomas Lemberger LMU Munich, Germany 10 / 23
Framework: TBF
TBF: Test-based falsifierI Apply test-case generators to model checker standardsI Create, execute + observe testsI Only variable: Test-case generation toolI Specification: Never call __VERIFIER_error
I Disclaimer: Comparison of tools, not techniques
Thomas Lemberger LMU Munich, Germany 10 / 23
Framework: TBF
TBF: Test-based falsifierI Apply test-case generators to model checker standardsI Create, execute + observe testsI Only variable: Test-case generation toolI Specification: Never call __VERIFIER_error
I Disclaimer: Comparison of tools, not techniques
Thomas Lemberger LMU Munich, Germany 10 / 23
TBF Architecture
InputProgram
Preprocessor PreparedProgram
Test-CaseGenerator
TestCases
Test-VectorExtractor
TestVectors
HarnessGenerator Harness
TestExecutor
Verdict
Thomas Lemberger LMU Munich, Germany 11 / 23
TBF Architecture
InputProgram
Preprocessor PreparedProgram
Test-CaseGenerator
TestCases
Test-VectorExtractor
TestVectors
HarnessGenerator Harness
TestExecutor
Verdict
int x = __VERIFIER_nondet_int();
Preprocessor
int x; klee_make_symbolic(&x, sizeof(x), "x");
Thomas Lemberger LMU Munich, Germany 11 / 23
TBF Architecture
InputProgram
Preprocessor PreparedProgram
Test-CaseGenerator
TestCases
Test-VectorExtractor
TestVectors
HarnessGenerator Harness
TestExecutor
Verdict
int x; klee_make_symbolic(&x, sizeof(x), "x");
Test-CaseGenerator
KTESTsimple.bc_sym___VERIFIER_nondet_int????...
Thomas Lemberger LMU Munich, Germany 11 / 23
TBF Architecture
InputProgram
Preprocessor PreparedProgram
Test-CaseGenerator
TestCases
Test-VectorExtractor
TestVectors
HarnessGenerator Harness
TestExecutor
Verdict
KTESTsimple.bc_sym___VERIFIER_nondet_int????...
Test-VectorExtractor
< 0, 3, 5 >
Thomas Lemberger LMU Munich, Germany 11 / 23
TBF Architecture
InputProgram
Preprocessor PreparedProgram
Test-CaseGenerator
TestCases
Test-VectorExtractor
TestVectors
HarnessGenerator Harness
TestExecutor
Verdict
...int x = __VERIFIER_nondet_int();...
HarnessGenerator
...int __VERIFIER_nondet_int() {
return ( int ) parse(input ());}void __VERIFIER_error() {
fprintf ( stderr , "Err\n");exit (1);
}
Thomas Lemberger LMU Munich, Germany 11 / 23
TBF Architecture
InputProgram
Preprocessor PreparedProgram
Test-CaseGenerator
TestCases
Test-VectorExtractor
TestVectors
HarnessGenerator Harness
TestExecutor
Verdict
for vec in test_vectors:stderr = run(prog, harness , vec)if "Err" in stderr :
return FALSEreturn UNKNOWN
Thomas Lemberger LMU Munich, Germany 11 / 23
Considered Tools
Tool Technique
AFL-fuzz Greybox fuzzingCrest-ppc Concolic execution, search-basedCPATiger Model checking-based testing, based on CPAcheckerFShell Model checking-based testing, based on CbmcKlee Symbolic execution, search-basedPRtest Random testing
Cbmc Bounded model checkingCPA-Seq Explicit-state, predicate abstraction, k-inductionEsbmc-incr Bounded model checking, incremental loop boundEsbmc-kInd Bounded model checking, k-induction
Thomas Lemberger LMU Munich, Germany 13 / 23
Experiment Setup
I Benchmark tool: BenchExecI Limits:
I 2 CPUsI 15 GB of memoryI 15 min CPU time
I Benchmark setI Openly available:
https://github.com/sosy-lab/sv-benchmarksI Largest available benchmark setI C programsI 1490 tasks with known bugI 4203 tasks without bug
Thomas Lemberger LMU Munich, Germany 14 / 23
Experiments
1. Bug-finding capabilities: Consider 1490 tasks with bug2. Precision: Consider 4203 tasks without bug3. Validity: Comparison with existing Klee-replay
Thomas Lemberger LMU Munich, Germany 15 / 23
1. Bug-Finding Capabilities I
No.
Prog
ram
s
AFL
-fuz
zt
CPA
Tig
ert
Cre
st-p
pct
FShe
llt
Kle
et
PRt
estt
CB
MC
m
CPA
-seq
m
ESB
MC
-incr
m
ESB
MC
-kIn
dm
Unio
nTe
ster
s
Unio
nM
C
Unio
nAl
l
Total Found 1490 605 57 376 236 826 292 830 889 949 844 887 1092 1176Compilable 1115 605 57 376 236 826 292 779 819 830 761 887 930 1014
Median CPU Time (s) 11 4.5 3.4 6.2 3.6 3.6 1.4 15 1.9 2.3
I Model checkers find more bugsI Model checkers don’t need stubsI Model checkers are comparable in speed
Thomas Lemberger LMU Munich, Germany 16 / 23
1. Bug-Finding Capabilities I
No.
Prog
ram
s
AFL
-fuz
zt
CPA
Tig
ert
Cre
st-p
pct
FShe
llt
Kle
et
PRt
estt
CB
MC
m
CPA
-seq
m
ESB
MC
-incr
m
ESB
MC
-kIn
dm
Unio
nTe
ster
s
Unio
nM
C
Unio
nAl
l
Total Found 1490 605 57 376 236 826 292 830 889 949 844 887 1092 1176Compilable 1115 605 57 376 236 826 292 779 819 830 761 887 930 1014
Median CPU Time (s) 11 4.5 3.4 6.2 3.6 3.6 1.4 15 1.9 2.3
I Model checkers find more bugs
I Model checkers don’t need stubsI Model checkers are comparable in speed
Thomas Lemberger LMU Munich, Germany 16 / 23
1. Bug-Finding Capabilities I
No.
Prog
ram
s
AFL
-fuz
zt
CPA
Tig
ert
Cre
st-p
pct
FShe
llt
Kle
et
PRt
estt
CB
MC
m
CPA
-seq
m
ESB
MC
-incr
m
ESB
MC
-kIn
dm
Unio
nTe
ster
s
Unio
nM
C
Unio
nAl
l
Total Found 1490 605 57 376 236 826 292 830 889 949 844 887 1092 1176Compilable 1115 605 57 376 236 826 292 779 819 830 761 887 930 1014
Median CPU Time (s) 11 4.5 3.4 6.2 3.6 3.6 1.4 15 1.9 2.3
I Model checkers find more bugsI Model checkers don’t need stubs
I Model checkers are comparable in speed
Thomas Lemberger LMU Munich, Germany 16 / 23
1. Bug-Finding Capabilities I
No.
Prog
ram
s
AFL
-fuz
zt
CPA
Tig
ert
Cre
st-p
pct
FShe
llt
Kle
et
PRt
estt
CB
MC
m
CPA
-seq
m
ESB
MC
-incr
m
ESB
MC
-kIn
dm
Unio
nTe
ster
s
Unio
nM
C
Unio
nAl
l
Total Found 1490 605 57 376 236 826 292 830 889 949 844 887 1092 1176Compilable 1115 605 57 376 236 826 292 779 819 830 761 887 930 1014
Median CPU Time (s) 11 4.5 3.4 6.2 3.6 3.6 1.4 15 1.9 2.3
I Model checkers find more bugsI Model checkers don’t need stubsI Model checkers are comparable in speed
Thomas Lemberger LMU Munich, Germany 16 / 23
1. Bug-Finding Capabilities II
0.1
1
10
100
1000
0 200 400 600 800 1000 1200 1400
CPU
time(s)
n-th fastest correct result
AFL-fuzzT
CPATigerT
CrestT
FShellT
KLEET
PRTestT
CBMCM
CPA-seqM
ESBMC-incrM
ESBMC-kIndM
Thomas Lemberger LMU Munich, Germany 17 / 23
Time Performance
I CPU time of Kleet/AFL-fuzzt vs. ESBMC-incrm
on solvable tasks
0.1
1
10
100
1000
0.1 1 10 100 1000
CPU
TimeforE
SBMC-incrM(s)
CPU Time for KLEET (s)
0.1
1
10
100
1000
0.1 1 10 100 1000
CPU
TimeforE
SBMC-incrM(s)
CPU Time for AFL-fuzzT (s)
∆ Time performance is task-specific
Thomas Lemberger LMU Munich, Germany 18 / 23
Time Performance
I CPU time of Kleet/AFL-fuzzt vs. ESBMC-incrm
on solvable tasks
0.1
1
10
100
1000
0.1 1 10 100 1000
CPU
TimeforE
SBMC-incrM(s)
CPU Time for KLEET (s)
0.1
1
10
100
1000
0.1 1 10 100 1000
CPU
TimeforE
SBMC-incrM(s)
CPU Time for AFL-fuzzT (s)
∆ Time performance is task-specific
Thomas Lemberger LMU Munich, Germany 18 / 23
2. Precision
I 4203 tasks without bugI Testers: No false alarmsI Model Checkers: Negligible
Worst: Esbmc-incr, 6 false alarms
Thomas Lemberger LMU Munich, Germany 19 / 23
3. Validity
Comparison of TBF with Klee-replay
I Specific to Kleetest case format
I Same concept asTBF
I Comparableperformance
0.1
1
10
100
1000
0.1 1 10 100 1000
CPU
TimeforK
LEE+KLEE
-replay
(s)
CPU Time for TBF with KLEE (s)
Thomas Lemberger LMU Munich, Germany 20 / 23
Conclusion I
I TBF:I makes 5 existing test-case generators comparable
I allows easy integration of new generatorsI automatically transforms generated test cases to
executable tests
Thomas Lemberger LMU Munich, Germany 21 / 23
Conclusion II
Can we confirm our null hypothesis?I Testing is better at finding bugs than model checking.
7
I Testing is faster than model checking.
7
I Testing is more precise than model checking.
3
I Testing is easier to use than model checking.
7
Thomas Lemberger LMU Munich, Germany 22 / 23
Conclusion II
Can we confirm our null hypothesis?I Testing is better at finding bugs than model checking. 7
I Testing is faster than model checking.
7
I Testing is more precise than model checking.
3
I Testing is easier to use than model checking.
7
Thomas Lemberger LMU Munich, Germany 22 / 23
Conclusion II
Can we confirm our null hypothesis?I Testing is better at finding bugs than model checking. 7
I Testing is faster than model checking. 7
I Testing is more precise than model checking.
3
I Testing is easier to use than model checking.
7
Thomas Lemberger LMU Munich, Germany 22 / 23
Conclusion II
Can we confirm our null hypothesis?I Testing is better at finding bugs than model checking. 7
I Testing is faster than model checking. 7
I Testing is more precise than model checking. 3
I Testing is easier to use than model checking.
7
Thomas Lemberger LMU Munich, Germany 22 / 23
Conclusion II
Can we confirm our null hypothesis?I Testing is better at finding bugs than model checking. 7
I Testing is faster than model checking. 7
I Testing is more precise than model checking. 3
I Testing is easier to use than model checking. 7
Thomas Lemberger LMU Munich, Germany 22 / 23