Segmented Symbolic Analysis

transcript

Wei LeRochester Institute of Technology

Motivation

• Symbolic analysis has many important applications in software tools [Sen, Marinov, Agha ‘05] [Godefroid, Klarlund, Sen ‘05] [Le, Soffa ’08]

[Chipounov, Kuznetsov, Candea ‘12]

• Compared to testing with concrete input: better coverage

• Compared to other static techniques: more precise

• Will continue being a powerful tool due to improved scalability [Chipounov, Kuznetsov, Candea ‘12]

Challenges of Symbolic Analysis

• Loops: Can have an statically unknown bound

• Library calls: the source code of a library is typically not available at compile time

Previous Solutions

• Loops - very small state space is covered – Iterate once [Cadar, Dunbar, Engler ‘ 08] [Chipounov, Kuznetsov, Candea ‘12]

– Report unknown [Xie, Chou, Engler ‘03]

– Pattern matching [Saxena, Poosankam, McCamant, Song ‘00]

• Library calls – imprecise and manual effort – A concrete value [Sen, Marinov, Agha ‘05] [Godefroid, Klarlund, Sen ‘05]

– Manually constructed models (e.g., simplified C implementation) [Bush, Pincus, Sielaff ‘00] [Chipounov, Kuznetsov, Candea ‘12]

Segmented Symbolic Analysis - Insights

• Code is not uniformly easy to analyze

• We should leverage the structural and semantic relations between statements to partition a program and apply different analyses accordingly

• The capabilities of static analysis are limited; we should introduce dynamic analysis to supply information that a pure static symbolic analyzer is slow or unable to produce

Overall Approach

• Perform symbolic analysis

• When an unknown occurs, identify code segments that cause unknown

• Construct unit tests and automatically generate inputs

• Run tests, perform dynamic inference to generate symbolic rules and symbolic values (transfer functions)

• Resume symbolic analysis using inferred rules

Novelty of the Work • Weave static and dynamic analyses on demand on a concurrent framework

• Dynamic analysis is fully automatic (not running the entire program but on code segments)

• Aggregated information from multiple runs: regression analysis1. Programs mostly consist of linear operations [Knuth’71] [Halbwachs, Proy, P.

Roumanoff ‘97]

2. Determining program properties often only requires linear constraints [Halbwachs, Y.-E. Proy, and P. Roumanoff ’97] [[Xie, Chou, Engler ‘03]

3. We assume that linear relations can characterize relevant behavior of small code segments

Overview using an Example

struct stat s

char filename[32]

char* temp = argv[1]

int i = 0

*temp != ‘\0’

filename[i] = *temp++

strcat(filename, “, ”)

t == 0

t =_stat64i32(filename,&s)

Library

32 > Len(filename)+1

Segmented SATraditional SA

Library Unknown

Traditional SA with Library Models

Loop Unknown

Len(filename’) = Len (temp)

32 > Len(temp)+1

32 > Len(argv[1])+1

Buffer Overflow

Len (filename’) = Len(filename)

//initialize with test inputs char* temp = _GenChars(test_buf);char* filename = _GenChars(test_buf);

//code segment for the loopint i = 0;while(*temp != '\0'){ filename[i] = *temp++; i++; }

//output Len(filename) char* _result = _GenChars(g_buf);int _rint = strlen(filename);itoa(_rint, _result, 10);fputs(_result, fp);

// cleanup…

Unit Test to Infer the Loop

Reduce to Regression Analysis

Test Test Input Test Input Transformed for RA Output

temp filename Len(temp) Len(filenname) Len(filename’)

1 acde piidaf 4 6 4

2 tazipad qdd 7 3 7

3 ad dafdalfll 2 9 2

Internal Design and Components

q q q……

Solved

Solving

Unknown

Solved

Not Found New Rules

Symbolic Analysis & Partition Program for Unknown

Test Synthesizer Inference Engine

Inference Repository

Request

Respond

Dynamic Inference On Demand

The Helium framework

Components on the Helium Framework

• Static component:

- Perform demand-driven, path-sensitive symbolic analysis

- Isolate the code segment that causes unknown- Determine the environment for the code segment

V: Inquiry

Transfer Func

Test Input

Code Unit Test Output

Request

Respond Inference

Dynamic Inference

E: Env

C: CodeSymbolic Analysis

Interaction Protocol

Test Synthesizer

Construct a Unit Test from Program Segment

Code Segment

Determine Test Input Variables

Determine Test Output Variables

Construct Runnable Code Select Code Segment

Inference via Regression

Input TransformationModel Selection

Simple, Multiple, Polynomial LinearPiecewise Linear

Data for Explanatory Variables

Data for Response Variables

Linear Symbolic Rules

Dynamic Inference as Regression Analysis

Y = X0 + a1 X1 + a2 X2 … + an Xn

Explanatory Models for Representing Code Semantics

(SUPPOSE a: OUTPUT VAR, b, c, d: INPUT VARS)

Models Examples

Constant a = 0

Simple Linear a = b

Multiple Linear a = 2*b + c

Polynomial Linear a = b^2 + c*d

Piece-wise Linear if b > 0 a = b, else a = 3

Experimental Setup

1. Implementation - Phoenix and Disolver, analyzing C/C++/C#

• A traditional symbolic analysis that gives up in loops and library calls

• Segmented symbolic analysis• Applications of both symbolic analyses to detect infeasible

paths and buffer overflows

2. Research Questions:

• Can we find useful symbolic rules and values?• Are we improving the detection capabilities for infeasible

paths and buffer overflows?• What are the capabilities of segmented symbolic analysis? • Is the technique still scalability and practical?

Experimental Results: Compare the two

Program Overflow Unknown Infeasible Unknown

SA S-SA SA S-SA SA S-SA SA S-SA

wu-ftpd 0 3 7 5 1 2 4 3

sendmail 0 3 18 16 1 1 6 6

polymorph 2 7 6 2 3 4 5 4

gzip 1 5 25 21 9 11 24 22

grep 1 1 6 6 14 15 19 17

tightvnc 0 0 12 11 5 5 34 32

putty 0 1 60 54 30 31 72 70

snort 0 13 53 45 59 67 147 124

Dynamic Inference for Buffer OverflowProgram Segments Runnable Analyzable Inferred Rules

Loop Lib Loop Lib Loop Lib

wu-ftpd 6 35 0 35 0 33 112

sendmail 7 26 7 25 7 16 79

polymorph 0 19 0 18 0 17 62

gzip 5 63 3 61 3 57 197

grep 2 7 0 5 0 5 11

tightvnc 8 11 0 5 0 2 4

putty 18 42 10 29 10 22 82

snort 37 47 11 40 9 25 148

PerformanceProgram size Symbolic Analysis Segmented Symbolic Analysis

kloc T-inf T-buf T-inf Thread-inf T-buf Thread-buf

wu-ftpd 0.4 0.7 s 1.5 s 2.7 s 6 523.8 s 206

sendmail 0.9 1.0 s 1.8 s 11.2 s 26 228.9 s 166

polymorph 0.9 2.2 s 1.0 s 3.9 s 6 143.3 s 96

gzip 5.1 358.7 s 3. 0 s 1679.3 s 271 508.6 s 341

grep 16.9 21.1 s 3.4 s 470.1 s 71 79.9 s 46

tightvnc 45.4 490.5 s 18.4 s 1149.9 s 126 596.6 s 96

putty 60.1 331.4 s 81.4 s 508.4 s 101 1213.6 s 281

snort 98.8 124.6 s 465.6 s 2009.4 s 651 1472.3 s 421

Experimental Summary • Improved the detection capabilities: 5 times more

buffer overflows

• Inferred 1135 models

• 2/3 of the loops are eligible for size, 29.3% yields runnable unit tests, inferred models from 23.8% loops

• Unit tests for 81.4% library calls are runnable and models are inferred for 70.4% library calls

• Scalability is still practical

• We can handle loops that traditional symbolic analysis cannot

Capabilities of Segmented Symbolic Analysis

Lib Yes Example

String strcpy, strcat, strlen, strncpy, strdup

File Systems chdir, getcwd, rename, unlink, stat

I/O printf, fgets, fgetc, read

Misc perror, utime, inet addr,atoi

Lib No Example

String Content strrchr, getenv

Compiler Unknown malloc

Network recv, gethostbyname

Interactive Input getchar

Loop No Example

Complex loop nested loop

Network recv

Interactive input getchar

Invalid context Invalid loop index

Loops We can Handle

//loop handled by segment symbolic analysisfor (p = name; *p != '\0'; p++){

if (isascii((int)*p) && isupper((int)*p)){*p = tolower(*p);tryagain = TRUE;

Loops We cannot Handle Yet

for (n = 7; n >= 8 - pfburh->r.w % 8; n--) {rcSource[i++] = rcolors [m_netbuf[y * bytesPerRow + x] >> n & 1] ;

Related Work

• Various symbolic analyses for bug finding, debugging [Sen, Marinov, Agha ‘05] [Godefroid, Klarlund, Sen ‘05] [Le, Soffa ’08]

[Chipounov, Kuznetsov, Candea ‘12]

• Hybrid symbolic analysis [Sen, Marinov, Agha ‘05] [Godefroid,

Klarlund, Sen ‘05] [Chipounov, Kuznetsov, Candea ‘12]

• Dynamic invariants discovery [Ernst, Czeisler,

Griswold, Notkin ‘ 00]

Conclusions

A novel hybrid technique that flexibly weaves static and dynamic analyses on demand for their maximum capabilities of discoveringprogram semantic information

Addressed the two key challenges : 1) partitioning a program toconstruct valid unit tests, and 2) mapping the problems of discovering symbolic relations between program variables to regression analysis.

Fully automatic and can be generally applied for determining different program properties and for different programs.

Thank you and Questions?

Segmented Symbolic Analysis

Documents