Post on 05-Jan-2016
description
transcript
Segmented Symbolic Analysis
Wei LeRochester Institute of Technology
Motivation
• Symbolic analysis has many important applications in software tools [Sen, Marinov, Agha ‘05] [Godefroid, Klarlund, Sen ‘05] [Le, Soffa ’08]
[Chipounov, Kuznetsov, Candea ‘12]
• Compared to testing with concrete input: better coverage
• Compared to other static techniques: more precise
• Will continue being a powerful tool due to improved scalability [Chipounov, Kuznetsov, Candea ‘12]
Challenges of Symbolic Analysis
• Loops: Can have an statically unknown bound
• Library calls: the source code of a library is typically not available at compile time
Previous Solutions
• Loops - very small state space is covered – Iterate once [Cadar, Dunbar, Engler ‘ 08] [Chipounov, Kuznetsov, Candea ‘12]
– Report unknown [Xie, Chou, Engler ‘03]
– Pattern matching [Saxena, Poosankam, McCamant, Song ‘00]
• Library calls – imprecise and manual effort – A concrete value [Sen, Marinov, Agha ‘05] [Godefroid, Klarlund, Sen ‘05]
– Manually constructed models (e.g., simplified C implementation) [Bush, Pincus, Sielaff ‘00] [Chipounov, Kuznetsov, Candea ‘12]
Segmented Symbolic Analysis - Insights
• Code is not uniformly easy to analyze
• We should leverage the structural and semantic relations between statements to partition a program and apply different analyses accordingly
• The capabilities of static analysis are limited; we should introduce dynamic analysis to supply information that a pure static symbolic analyzer is slow or unable to produce
Overall Approach
• Perform symbolic analysis
• When an unknown occurs, identify code segments that cause unknown
• Construct unit tests and automatically generate inputs
• Run tests, perform dynamic inference to generate symbolic rules and symbolic values (transfer functions)
• Resume symbolic analysis using inferred rules
Novelty of the Work • Weave static and dynamic analyses on demand on a concurrent framework
• Dynamic analysis is fully automatic (not running the entire program but on code segments)
• Aggregated information from multiple runs: regression analysis1. Programs mostly consist of linear operations [Knuth’71] [Halbwachs, Proy, P.
Roumanoff ‘97]
2. Determining program properties often only requires linear constraints [Halbwachs, Y.-E. Proy, and P. Roumanoff ’97] [[Xie, Chou, Engler ‘03]
3. We assume that linear relations can characterize relevant behavior of small code segments
Overview using an Example
struct stat s
char filename[32]
char* temp = argv[1]
int i = 0
*temp != ‘\0’
filename[i] = *temp++
i++
strcat(filename, “, ”)
t == 0
t =_stat64i32(filename,&s)
yes
no
no
yes
Library
Loop
32 > Len(filename)+1
1
2
3
4
5
6
7
8
9
10
Segmented SATraditional SA
Library Unknown
Traditional SA with Library Models
32 > Len(filename)+1
32 > Len(filename)+1
Loop Unknown
32 > Len(filename)+1
32 > Len(filename)+1
Len(filename’) = Len (temp)
32 > Len(temp)+1
32 > Len(argv[1])+1
Buffer Overflow
Len (filename’) = Len(filename)
//initialize with test inputs char* temp = _GenChars(test_buf);char* filename = _GenChars(test_buf);
//code segment for the loopint i = 0;while(*temp != '\0'){ filename[i] = *temp++; i++; }
//output Len(filename) char* _result = _GenChars(g_buf);int _rint = strlen(filename);itoa(_rint, _result, 10);fputs(_result, fp);
// cleanup…
Unit Test to Infer the Loop
Reduce to Regression Analysis
Test Test Input Test Input Transformed for RA Output
temp filename Len(temp) Len(filenname) Len(filename’)
1 acde piidaf 4 6 4
2 tazipad qdd 7 3 7
3 ad dafdalfll 2 9 2
Internal Design and Components
q q q……
Solved
Solving
q
Solving
Unknown
Solved
Not Found New Rules
Symbolic Analysis & Partition Program for Unknown
Test Synthesizer Inference Engine
Inference Repository
Request
Respond
Dynamic Inference On Demand
The Helium framework
Components on the Helium Framework
• Static component:
- Perform demand-driven, path-sensitive symbolic analysis
- Isolate the code segment that causes unknown- Determine the environment for the code segment
V: Inquiry
Transfer Func
Test Input
Code Unit Test Output
Request
Respond Inference
Dynamic Inference
E: Env
C: CodeSymbolic Analysis
Interaction Protocol
Test Synthesizer
Construct a Unit Test from Program Segment
Code Segment
Determine Test Input Variables
Determine Test Output Variables
Construct Runnable Code Select Code Segment
Inference via Regression
Input TransformationModel Selection
Simple, Multiple, Polynomial LinearPiecewise Linear
Data for Explanatory Variables
Data for Response Variables
Linear Symbolic Rules
Dynamic Inference as Regression Analysis
Y = X0 + a1 X1 + a2 X2 … + an Xn
Explanatory Models for Representing Code Semantics
(SUPPOSE a: OUTPUT VAR, b, c, d: INPUT VARS)
Models Examples
Constant a = 0
Simple Linear a = b
Multiple Linear a = 2*b + c
Polynomial Linear a = b^2 + c*d
Piece-wise Linear if b > 0 a = b, else a = 3
Experimental Setup
1. Implementation - Phoenix and Disolver, analyzing C/C++/C#
• A traditional symbolic analysis that gives up in loops and library calls
• Segmented symbolic analysis• Applications of both symbolic analyses to detect infeasible
paths and buffer overflows
2. Research Questions:
• Can we find useful symbolic rules and values?• Are we improving the detection capabilities for infeasible
paths and buffer overflows?• What are the capabilities of segmented symbolic analysis? • Is the technique still scalability and practical?
Experimental Results: Compare the two
Program Overflow Unknown Infeasible Unknown
SA S-SA SA S-SA SA S-SA SA S-SA
wu-ftpd 0 3 7 5 1 2 4 3
sendmail 0 3 18 16 1 1 6 6
polymorph 2 7 6 2 3 4 5 4
gzip 1 5 25 21 9 11 24 22
grep 1 1 6 6 14 15 19 17
tightvnc 0 0 12 11 5 5 34 32
putty 0 1 60 54 30 31 72 70
snort 0 13 53 45 59 67 147 124
Dynamic Inference for Buffer OverflowProgram Segments Runnable Analyzable Inferred Rules
Loop Lib Loop Lib Loop Lib
wu-ftpd 6 35 0 35 0 33 112
sendmail 7 26 7 25 7 16 79
polymorph 0 19 0 18 0 17 62
gzip 5 63 3 61 3 57 197
grep 2 7 0 5 0 5 11
tightvnc 8 11 0 5 0 2 4
putty 18 42 10 29 10 22 82
snort 37 47 11 40 9 25 148
PerformanceProgram size Symbolic Analysis Segmented Symbolic Analysis
kloc T-inf T-buf T-inf Thread-inf T-buf Thread-buf
wu-ftpd 0.4 0.7 s 1.5 s 2.7 s 6 523.8 s 206
sendmail 0.9 1.0 s 1.8 s 11.2 s 26 228.9 s 166
polymorph 0.9 2.2 s 1.0 s 3.9 s 6 143.3 s 96
gzip 5.1 358.7 s 3. 0 s 1679.3 s 271 508.6 s 341
grep 16.9 21.1 s 3.4 s 470.1 s 71 79.9 s 46
tightvnc 45.4 490.5 s 18.4 s 1149.9 s 126 596.6 s 96
putty 60.1 331.4 s 81.4 s 508.4 s 101 1213.6 s 281
snort 98.8 124.6 s 465.6 s 2009.4 s 651 1472.3 s 421
Experimental Summary • Improved the detection capabilities: 5 times more
buffer overflows
• Inferred 1135 models
• 2/3 of the loops are eligible for size, 29.3% yields runnable unit tests, inferred models from 23.8% loops
• Unit tests for 81.4% library calls are runnable and models are inferred for 70.4% library calls
• Scalability is still practical
• We can handle loops that traditional symbolic analysis cannot
Capabilities of Segmented Symbolic Analysis
Lib Yes Example
String strcpy, strcat, strlen, strncpy, strdup
File Systems chdir, getcwd, rename, unlink, stat
I/O printf, fgets, fgetc, read
Misc perror, utime, inet addr,atoi
Lib No Example
String Content strrchr, getenv
Compiler Unknown malloc
Network recv, gethostbyname
Interactive Input getchar
Loop No Example
Complex loop nested loop
Network recv
Interactive input getchar
Invalid context Invalid loop index
Loops We can Handle
//loop handled by segment symbolic analysisfor (p = name; *p != '\0'; p++){
if (isascii((int)*p) && isupper((int)*p)){*p = tolower(*p);tryagain = TRUE;
}}
Loops We cannot Handle Yet
for (n = 7; n >= 8 - pfburh->r.w % 8; n--) {rcSource[i++] = rcolors [m_netbuf[y * bytesPerRow + x] >> n & 1] ;
}
Related Work
• Various symbolic analyses for bug finding, debugging [Sen, Marinov, Agha ‘05] [Godefroid, Klarlund, Sen ‘05] [Le, Soffa ’08]
[Chipounov, Kuznetsov, Candea ‘12]
• Hybrid symbolic analysis [Sen, Marinov, Agha ‘05] [Godefroid,
Klarlund, Sen ‘05] [Chipounov, Kuznetsov, Candea ‘12]
• Dynamic invariants discovery [Ernst, Czeisler,
Griswold, Notkin ‘ 00]
Conclusions
A novel hybrid technique that flexibly weaves static and dynamic analyses on demand for their maximum capabilities of discoveringprogram semantic information
Addressed the two key challenges : 1) partitioning a program toconstruct valid unit tests, and 2) mapping the problems of discovering symbolic relations between program variables to regression analysis.
Fully automatic and can be generally applied for determining different program properties and for different programs.
Thank you and Questions?