Software Verification: Testing vs. Model Checking - IBM … · 2017-11-15 · I Create, execute +...

Software Verification:

Testing vs. Model Checking

A Comparative Evaluation of the State of the Art

Thomas Lemberger

Joint work with Dirk BeyerLMU Munich, Germany

Null Hypothesis:I Testing is better at finding bugs than model checking.I Testing is faster than model checking.I Testing is more precise than model checking.I Testing is easier to use than model checking.

Thomas Lemberger LMU Munich, Germany 2 / 23

Where’s the numbers?


Overview


Terminology

I Testing:I Execute finite set of test cases on programI Observe compliance/violation of specificationI Focus: Test-case generation

I Model checking:I Formally describe possible program statesI Prove compliance/violation of specificationI Abstraction important

I Automated!


Terminology



I Automated!


Terminology



I Automated!


Scope

I Single, sequential programsI Whitebox programsI Task: bug finding


Comparability

Test-case generators

I Di�erent conventions forprogram input

I Di�erent output formatsfor test cases

I Di�erent/no testexecutors


Comparability

Test-case generatorsI Di�erent conventions for

program input

I Di�erent output formatsfor test cases


klee_make_symbolic(&x, sizeof(x), "x");

CREST_int(x);x = input();

x = parse( fgets (...));

input(&x, sizeof (x), "x")


Comparability


program inputI Di�erent output formats

for test cases


KTESTsimple.bc_sym___VERIFIER_nondet_int????...

1, ≠5, 31, ≠5, 0

Test inputs : [42, 107]

zsd;as@d

0xF2030x00030xF2030x0003


Comparability


program inputI Di�erent output formats

for test casesI Di�erent/no test

executors

None

klee-replay

None


Comparability

Model checkersI Established standard for

input programs

I Established standard foroutput format of result

x = __VERIFIER_nondet_int();


Comparability


input programsI Established standard for

output format of result • FALSE

• UNKNOWN

• TRUE


Comparability


input programsI Established standard for

output format of result

∆ Adjust test-case generators to standards of model checkers


Framework


Framework: TBF

TBF: Test-based falsifierI Apply test-case generators to model checker standards

I Create, execute + observe testsI Only variable: Test-case generation toolI Specification: Never call __VERIFIER_error

I Disclaimer: Comparison of tools, not techniques


Framework: TBF

TBF: Test-based falsifierI Apply test-case generators to model checker standardsI Create, execute + observe tests

I Only variable: Test-case generation toolI Specification: Never call __VERIFIER_error



Framework: TBF

TBF: Test-based falsifierI Apply test-case generators to model checker standardsI Create, execute + observe testsI Only variable: Test-case generation tool

I Specification: Never call __VERIFIER_error



Framework: TBF

TBF: Test-based falsifierI Apply test-case generators to model checker standardsI Create, execute + observe testsI Only variable: Test-case generation toolI Specification: Never call __VERIFIER_error



Framework: TBF

TBF: Test-based falsifierI Apply test-case generators to model checker standardsI Create, execute + observe testsI Only variable: Test-case generation toolI Specification: Never call __VERIFIER_error



TBF Architecture

InputProgram

Preprocessor PreparedProgram

Test-CaseGenerator

TestCases

Test-VectorExtractor

TestVectors

HarnessGenerator Harness

TestExecutor

Verdict


TBF Architecture

InputProgram


Test-CaseGenerator

TestCases


TestVectors


TestExecutor

Verdict

int x = __VERIFIER_nondet_int();

Preprocessor

int x; klee_make_symbolic(&x, sizeof(x), "x");


TBF Architecture

InputProgram


Test-CaseGenerator

TestCases


TestVectors


TestExecutor

Verdict

int x; klee_make_symbolic(&x, sizeof(x), "x");

Test-CaseGenerator



TBF Architecture

InputProgram


Test-CaseGenerator

TestCases


TestVectors


TestExecutor

Verdict



< 0, 3, 5 >


TBF Architecture

InputProgram


Test-CaseGenerator

TestCases


TestVectors


TestExecutor

Verdict

...int x = __VERIFIER_nondet_int();...

HarnessGenerator

...int __VERIFIER_nondet_int() {

return ( int ) parse(input ());}void __VERIFIER_error() {

fprintf ( stderr , "Err\n");exit (1);

}


TBF Architecture

InputProgram


Test-CaseGenerator

TestCases


TestVectors


TestExecutor

Verdict

for vec in test_vectors:stderr = run(prog, harness , vec)if "Err" in stderr :

return FALSEreturn UNKNOWN


Evaluation


Considered Tools

Tool Technique

AFL-fuzz Greybox fuzzingCrest-ppc Concolic execution, search-basedCPATiger Model checking-based testing, based on CPAcheckerFShell Model checking-based testing, based on CbmcKlee Symbolic execution, search-basedPRtest Random testing

Cbmc Bounded model checkingCPA-Seq Explicit-state, predicate abstraction, k-inductionEsbmc-incr Bounded model checking, incremental loop boundEsbmc-kInd Bounded model checking, k-induction


Experiment Setup

I Benchmark tool: BenchExecI Limits:

I 2 CPUsI 15 GB of memoryI 15 min CPU time

I Benchmark setI Openly available:

https://github.com/sosy-lab/sv-benchmarksI Largest available benchmark setI C programsI 1490 tasks with known bugI 4203 tasks without bug


Experiments

1. Bug-finding capabilities: Consider 1490 tasks with bug2. Precision: Consider 4203 tasks without bug3. Validity: Comparison with existing Klee-replay


1. Bug-Finding Capabilities I

No.

Prog

ram

s

AFL

-fuz

zt

CPA

Tig

ert

Cre

st-p

pct

FShe

llt

Kle

et

PRt

estt

CB

MC

m

CPA

-seq

m

ESB

MC

-incr

m

ESB

MC

-kIn

dm

Unio

nTe

ster

s

Unio

nM

C

Unio

nAl

l

Total Found 1490 605 57 376 236 826 292 830 889 949 844 887 1092 1176Compilable 1115 605 57 376 236 826 292 779 819 830 761 887 930 1014

Median CPU Time (s) 11 4.5 3.4 6.2 3.6 3.6 1.4 15 1.9 2.3

I Model checkers find more bugsI Model checkers don’t need stubsI Model checkers are comparable in speed



No.

Prog

ram

s

AFL

-fuz

zt

CPA

Tig

ert

Cre

st-p

pct

FShe

llt

Kle

et

PRt

estt

CB

MC

m

CPA

-seq

m

ESB

MC

-incr

m

ESB

MC

-kIn

dm

Unio

nTe

ster

s

Unio

nM

C

Unio

nAl

l


Median CPU Time (s) 11 4.5 3.4 6.2 3.6 3.6 1.4 15 1.9 2.3

I Model checkers find more bugs

I Model checkers don’t need stubsI Model checkers are comparable in speed



No.

Prog

ram

s

AFL

-fuz

zt

CPA

Tig

ert

Cre

st-p

pct

FShe

llt

Kle

et

PRt

estt

CB

MC

m

CPA

-seq

m

ESB

MC

-incr

m

ESB

MC

-kIn

dm

Unio

nTe

ster

s

Unio

nM

C

Unio

nAl

l


Median CPU Time (s) 11 4.5 3.4 6.2 3.6 3.6 1.4 15 1.9 2.3

I Model checkers find more bugsI Model checkers don’t need stubs

I Model checkers are comparable in speed



No.

Prog

ram

s

AFL

-fuz

zt

CPA

Tig

ert

Cre

st-p

pct

FShe

llt

Kle

et

PRt

estt

CB

MC

m

CPA

-seq

m

ESB

MC

-incr

m

ESB

MC

-kIn

dm

Unio

nTe

ster

s

Unio

nM

C

Unio

nAl

l


Median CPU Time (s) 11 4.5 3.4 6.2 3.6 3.6 1.4 15 1.9 2.3

I Model checkers find more bugsI Model checkers don’t need stubsI Model checkers are comparable in speed


1. Bug-Finding Capabilities II

0.1

1

10

100

1000

0 200 400 600 800 1000 1200 1400

CPU

time(s)

n-th fastest correct result

AFL-fuzzT

CPATigerT

CrestT

FShellT

KLEET

PRTestT

CBMCM

CPA-seqM

ESBMC-incrM

ESBMC-kIndM


Time Performance

I CPU time of Kleet/AFL-fuzzt vs. ESBMC-incrm

on solvable tasks

0.1

1

10

100

1000

0.1 1 10 100 1000

CPU

TimeforE

SBMC-incrM(s)

CPU Time for KLEET (s)

0.1

1

10

100

1000

0.1 1 10 100 1000

CPU

TimeforE

SBMC-incrM(s)

CPU Time for AFL-fuzzT (s)

∆ Time performance is task-specific


Time Performance

I CPU time of Kleet/AFL-fuzzt vs. ESBMC-incrm

on solvable tasks

0.1

1

10

100

1000

0.1 1 10 100 1000

CPU

TimeforE

SBMC-incrM(s)

CPU Time for KLEET (s)

0.1

1

10

100

1000

0.1 1 10 100 1000

CPU

TimeforE

SBMC-incrM(s)

CPU Time for AFL-fuzzT (s)

∆ Time performance is task-specific


2. Precision

I 4203 tasks without bugI Testers: No false alarmsI Model Checkers: Negligible

Worst: Esbmc-incr, 6 false alarms


3. Validity

Comparison of TBF with Klee-replay

I Specific to Kleetest case format

I Same concept asTBF

I Comparableperformance

0.1

1

10

100

1000

0.1 1 10 100 1000

CPU

TimeforK

LEE+KLEE

-replay

(s)

CPU Time for TBF with KLEE (s)


Conclusion I

I TBF:I makes 5 existing test-case generators comparable

I allows easy integration of new generatorsI automatically transforms generated test cases to

executable tests


Conclusion II

Can we confirm our null hypothesis?I Testing is better at finding bugs than model checking.

7

I Testing is faster than model checking.

7

I Testing is more precise than model checking.

3

I Testing is easier to use than model checking.

7


Conclusion II

Can we confirm our null hypothesis?I Testing is better at finding bugs than model checking. 7

I Testing is faster than model checking.

7


3


7


Conclusion II


I Testing is faster than model checking. 7


3


7


Conclusion II



I Testing is more precise than model checking. 3


7


Conclusion II



I Testing is more precise than model checking. 3

I Testing is easier to use than model checking. 7


Conclusion III

New null hypothesis:

I Model CheckingI can find more bugsI in less time

I requires less adjustments to input program


Date post:	27-Jun-2018
Category:	Documents
Upload:	lethien
View:	214 times
Download:	0 times

Software Verification: Testing vs. Model Checking - IBM … · 2017-11-15 · I Create, execute +...

Documents