Code Analysis
Zhengong (才振功 )2011-07-27
Agenda
• Software Analysis OverviewSoftware Analysis Overview
• Methodology & MethodsMethodology & Methods
• Tools DemoTools Demo
• Project Candidates & DiscussionProject Candidates & Discussion
Software Analysis for What?
• Scenario 1Scenario 1Company Company BB buysbuys another one another one CC for market needs. for market needs. CC has a running system has a running system
with a small maintenance group. with a small maintenance group. BB wants to expand the business of the wants to expand the business of the
related system. However, the system documents are related system. However, the system documents are outdatedoutdated, and the , and the
original developers leave. Moreover, the platform of the system cannot get original developers leave. Moreover, the platform of the system cannot get
enough supports from the providers. On the other hands, developing a enough supports from the providers. On the other hands, developing a
replacement system will bring great cost and risks, for the replacement system will bring great cost and risks, for the lack of lack of
comprehensioncomprehension of the system and its business logics…… of the system and its business logics……
Software Analysis for What?
• Scenario 2Scenario 2Project A has been in progress for several months, a Project A has been in progress for several months, a newnew developer D developer D
involvesinvolves as a replacement since a original member leaves. as a replacement since a original member leaves. DD needs to needs to
knows of the source code, documents and other materials before starting knows of the source code, documents and other materials before starting
the development….the development….
• Scenario 3Scenario 3A A bugbug reported by QA or customer. As a developer, how to locate the bug reported by QA or customer. As a developer, how to locate the bug
and and fixfix it in a given time? On the other hand, the customer hope to it in a given time? On the other hand, the customer hope to addadd a a
newnew functionfunction, where to add and which source code to be modified…., where to add and which source code to be modified….
What’s Software Analysis
• DefinitionDefinition
Software analysis is a process or action to validate, verify or locate Software analysis is a process or action to validate, verify or locate
software features (or constraints) manually or automaticallysoftware features (or constraints) manually or automatically
• Similar termsSimilar terms
▫Program comprehension / reverse engineeringProgram comprehension / reverse engineering
• ScopeScope▫DevelopmentDevelopment phase phase :: know of the progress, predict the developing know of the progress, predict the developing actions, eliminate the defeats and program changes, etcactions, eliminate the defeats and program changes, etc▫MaintenanceMaintenance phase phase :: program comprehension and software program comprehension and software maintenancemaintenance▫ReuseReuse phase phase :: analysis and reuse the available partsanalysis and reuse the available parts
The Goals
• Program comprehensionProgram comprehension▫ FunctionalityFunctionality
▫ architecturearchitecture
• Feature LocationFeature Location▫ locate the buglocate the bug
▫ locate where to add new functionslocate where to add new functions
• Code ReviewCode Review▫ Coding stylesCoding styles
▫ Program optimizationProgram optimization
In a word, identify the code architecture and map source code to abstract In a word, identify the code architecture and map source code to abstract
modelsmodels
Objects for Analysis
• Source Code – code analysisSource Code – code analysis
• ModelsModels▫ RequirementsRequirements
▫ Design ModelsDesign Models
▫ Software ArchitectureSoftware Architecture
• Documents, including requirements, design, test, etc.Documents, including requirements, design, test, etc.
• CommentsComments
Problems With Code Analysis
Compile
Source Code
Compilation Environment
Code Analysis Tool
1. Business domain vs application domain
2. Source code vs abstract business
3. Tools are costly
Link
APPLICAT ION
SYSTEM
Syntactic Data for Code Analysis Syntactic & Semantic
Data for Code Analysis
1. Physical actions vs logics
2. Structural program vs unstructured semantic data
Methodology and Methods
• Static analysisStatic analysis
• Dynamic analysisDynamic analysis
• Hybrid approachesHybrid approaches
Static Analysis• Basic approachesBasic approaches
▫ Control flow analysisControl flow analysis
▫ Data flow analysisData flow analysis
▫ Information flow analysisInformation flow analysis
▫ Symbolic executionSymbolic execution
▫ Slice analysisSlice analysis
▫ Clone analysisClone analysis
▫ Syntax analysisSyntax analysis
▫ Type analysisType analysis
Static Analysis▫ Range checkingRange checking
▫ Structure analysisStructure analysis
▫ Alias analysisAlias analysis
▫ Pointer analysisPointer analysis
• Formal approachesFormal approaches
▫ Model checkingModel checking
▫ Theorem provingTheorem proving
Not limited to these. More methods….Not limited to these. More methods….
Control Flow Analysis
• GoalsGoals :: to construct CFGto construct CFG
▫ Analysis the execution pathAnalysis the execution path
▫ Abstract the code structureAbstract the code structure
▫ Locate dead codeLocate dead code
▫ Evaluate the loops and recursionEvaluate the loops and recursion
• MethodsMethods
▫ Sequence diagramSequence diagram
▫ Call graphCall graph
▫ Structure analysisStructure analysis
▫ Program sliceProgram slice
Control Flow Analysis
• ExampleExample
Data Flow Analysis
• GoalsGoals :: evaluate the definition and use of variable in each evaluate the definition and use of variable in each
statementstatement
▫ Variable definitionVariable definition
▫ Input should not be re-assignedInput should not be re-assigned
▫ Output should be assignedOutput should be assigned
▫ Proper global variableProper global variable
• DFADFA :: usually starts with CFA usually starts with CFA ▫ forward analysis——reaching definitionforward analysis——reaching definition
▫ backward analysis——live variablesbackward analysis——live variables ,, eliminating dead codeeliminating dead code
Classical Data-flow Problems• Reaching definitions (Reach)
• Live uses of variables (Live)
• Def-use chains built from Reach, and the dual Use-def chains, built from Live, play role in many optimizations
• Set of variables▫ Gen(N) = set of variables defined by Node N. ▫ Kill(N) = set of variables killed by Node N . ▫ IN(N)=set of variables from the previous nodes▫ Forward order: Out(N) = Gen(N) +(In(N) - Kill(N));
Reaching Definitions
•Definition A statement that may change the value of a variable (e.g., x = i+5)
•A definition of a variable x at node k reaches node n if there is a path clear of a definition of x from k to n.
k
n
x = …
… = x
x = …
Live Uses of Variables
•Use Appearance of a variable as an operand of a 3-address statement (e.g., y=x+4)
•A use of a variable x at node n is live on exit from k if there is a path from k to n clear of definition of x.
k
n
x = …
… = x
x = …
Def-use Relations
•Use-def chain links an use to a definition that reaches that use
•Def-use chain links a definition to an use that it reaches
k
n
x = …
… = x
x = …
Optimizations Enabled
•Dead code elimination (Def-use)
•Loop invariant code motion (Use-def)
•Constant propagation (Use-def)
•Strength reduction (Use-def)
•Copy propagation (Def-use)
Information Flow Analysis
• GoalsGoals::▫ The dependency tracing from output to inputThe dependency tracing from output to input
▫ Validate the dependency according to initial constraintsValidate the dependency according to initial constraints
• IFA methodsIFA methods ::▫ Intra-procedural analysisIntra-procedural analysis
▫ Inter-procedural analysisInter-procedural analysis
Example:X := A + B;Y := D – C;if X>0 then Z := Y + 1;end if;
Here:X depends on A & BY depends on C & DZ depends on A, B, C, & Dand implicitly on Z’s initial value
21
Symbolic Execution •Goals
▫ Verify properties of a program by algebraic manipulation of the source text without requiring a formal specification
•Methods:▫ Typically performed where the program is “executed”
statically by performing back-substitution▫ Converts sequential logic into a set of parallel
assignments in which output values are expressed in terms of input values A + B <= 0:
X = A + BY = D – CZ = not defined
A + B > 0:X = A + BY = D – CZ = D – C + 1
Previous Example:X := A + B;Y := D – C;if X>0 then Z := Y + 1;end if;
Slicing Analysis
• GoalsGoals
▫ Extract the source code related to the concernExtract the source code related to the concern ,, i.e. slicei.e. slice
• MethodMethod::▫ Obtain the concern-related variableObtain the concern-related variable
▫ Analyze the related statements and predicateAnalyze the related statements and predicate ,, to form a sliceto form a slice
▫ Analyze the slice to comprehend the programAnalyze the slice to comprehend the program
• Analysis approachAnalysis approach
▫ Data flow analysisData flow analysis
▫ Dependency analysisDependency analysis
Slicing Analysis
• ExampleExample
int main() {int sum = 0;int i = 1;while (i < 11) {
sum = sum + i;i = i + 1;
}printf(“%d\n”,sum);printf(“%d\n”,i);
}
Backward Slice
Backward slice with respect to “printf(“%d\n”,i)”
int main() {int sum = 0;int i = 1;while (i < 11) {
sum = sum + i;i = i + 1;
}printf(“%d\n”,sum);printf(“%d\n”,i);
}
Backward Slice
Backward slice with respect to “printf(“%d\n”,i)”
int main() {
int i = 1;while (i < 11) {
i = i + 1;}
printf(“%d\n”,i);}
Slice Extraction
Backward slice with respect to “printf(“%d\n”,i)”
Forward Slice
int main() {int sum = 0;int i = 1;while (i < 11) {
sum = sum + i;i = i + 1;
}printf(“%d\n”,sum);printf(“%d\n”,i);
}
Forward slice with respect to “sum = 0”
Forward slice with respect to “sum = 0”
Forward Slice
int main() {int sum = 0;int i = 1;while (i < 11) {
sum = sum + i;i = i + 1;
}printf(“%d\n”,sum);printf(“%d\n”,i);
}
Control Flow Graph
Enter
sum = 0 i = 1 while(i < 11) printf(sum) printf(i)
sum = sum + i i = i + i
T
F
int main() {int sum = 0;int i = 1;while (i < 11) {
sum = sum + i;i = i + 1;
}printf(“%d\n”,sum);printf(“%d\n”,i);
}
Flow Dependence Graphint main() {
int sum = 0;int i = 1;while (i < 11) {
sum = sum + i;i = i + 1;
}printf(“%d\n”,sum);printf(“%d\n”,i);
} Enter
sum = 0 printf(sum) printf(i)
sum = sum + i i = i + i
Flow dependence
p q Value of variableassigned at p may beused at q.
i = 1 while(i < 11)
q is reached from pif condition p istrue (T), not otherwise.
Control Dependence Graph
Control dependence
p qT
p qF
Similar for false (F).
Enter
sum = 0 i = 1 while(i < 11) printf(sum) printf(i)
sum = sum + i i = i + 1
T T
TT T
TTT
int main() {int sum = 0;int i = 1;while (i < 11) {
sum = sum + i;i = i + 1;
}printf(“%d\n”,sum);printf(“%d\n”,i);
}
Program Dependence Graph (PDG)int main() {
int sum = 0;int i = 1;while (i < 11) {
sum = sum + i;i = i + 1;
}printf(“%d\n”,sum);printf(“%d\n”,i);
} Enter
sum = 0 i = 1 while(i < 11) printf(sum) printf(i)
sum = sum + i i = i + 1
T
TT T
T
Control dependence
Flow dependence
TT
T
Program Dependence Graph (PDG)int main() {
int i = 1;int sum = 0;while (i < 11) {
sum = sum + i;i = i + 1;
}printf(“%d\n”,sum);printf(“%d\n”,i);
} Enter
sum = 0 i = 1 while(i < 11) printf(sum) printf(i)
sum = sum + i i = i + 1
T
TT T
TTT
T
Opposite Order
Same PDG
Backward Sliceint main() {
int sum = 0;int i = 1;while (i < 11) {
sum = sum + i;i = i + 1;
}printf(“%d\n”,sum);printf(“%d\n”,i);
} Enter
sum = 0 i = 1 while(i < 11) printf(sum) printf(i)
sum = sum + i i = i + 1
T
TT T
TTT
T
Backward Slice (2)int main() {
int sum = 0;int i = 1;while (i < 11) {
sum = sum + i;i = i + 1;
}printf(“%d\n”,sum);printf(“%d\n”,i);
} Enter
sum = 0 i = 1 while(i < 11) printf(sum) printf(i)
sum = sum + i i = i + 1
T
TT T
TTT
T
Backward Slice (3)int main() {
int sum = 0;int i = 1;while (i < 11) {
sum = sum + i;i = i + 1;
}printf(“%d\n”,sum);printf(“%d\n”,i);
} Enter
sum = 0 i = 1 while(i < 11) printf(sum) printf(i)
sum = sum + i i = i + 1
T
TT T
TTT
T
Backward Slice (4)int main() {
int sum = 0;int i = 1;while (i < 11) {
sum = sum + i;i = i + 1;
}printf(“%d\n”,sum);printf(“%d\n”,i);
} Enter
sum = 0 i = 1 while(i < 11) printf(sum) printf(i)
sum = sum + i i = i + 1
TT
TT T
TTT
Slice Extractionint main() {
int i = 1;while (i < 11) {
i = i + 1;}
printf(“%d\n”,i);} Enter
i = 1 while(i < 11) printf(i)
i = i + 1T
TT
TT
Clone AnalysisCode clone is a code fragment in source
files that is identical or similar to another
Clone Pair
Clone ClassCode clone is one of factors that make
software maintenance more difficult.▫ If some faults are found in a code clone, it is
necessary to consider pros and cons of modification in its all code clones.
Clone Analysis
• Improvements for clone codeImprovements for clone code▫ Extract methodExtract method
▫ Pull up methodPull up method
• ToolsTools▫ CCFinder CCFinder
▫ GeminiGemini
Extract Methodvoid methodA(int i){
methodZ();
System.out.println(“name:” + name);System.out.println(“amount:” + i);
}
void methodB(int i){methodY();
System.out.println(“name:” + name);System.out.println(“amount:” + i);
}
void methodA(int i){methodZ();methodC(i);
}void methodB(int i){
methodY();methodC(i);
}
Void methodC(int i){System.out.println(“name:” + name);System.out.println(“amount:” + i);
}
methodC(i);
methodC(i);
Pull Up Method
method A
class A
class B class C
class A
class B class C
method A
method A
Syntax Analysis
• GoalsGoals
▫ Construct AST (Abstract-Syntax Tree)Construct AST (Abstract-Syntax Tree)
▫ Validate the AST according to BNFValidate the AST according to BNF
• MethodsMethods
▫ Bottom-upBottom-up :: operator first methodsoperator first methods
▫ Top-downTop-down :: recursive approachrecursive approach
▫ Context-free, like Left-Right, etcContext-free, like Left-Right, etc
Syntax analysis is also the fundamental of compiling and other Syntax analysis is also the fundamental of compiling and other
analysis approaches.analysis approaches.
Type Analysis
• GoalsGoals
▫ Locating the type errorsLocating the type errors
• MethodsMethods
▫ Most are based on static analysis, for eliminating the type errors Most are based on static analysis, for eliminating the type errors
and verifying the software qualityand verifying the software quality
▫ Some based on dynamic analysisSome based on dynamic analysis
Pointer Analysis• GoalsGoals
▫ Find locations to which a pointer may point toFind locations to which a pointer may point to
▫ Lies at the heart of many program optimization and verification Lies at the heart of many program optimization and verification
problemsproblems
• Pointer analysis is un-decidable in static analysisPointer analysis is un-decidable in static analysis
▫ There exist many conservative approximationsThere exist many conservative approximations
▫ Small points-to set Small points-to set more precision more precision
• FactorsFactors
▫ Flow sensitivityFlow sensitivity
▫ Context sensitivityContext sensitivity
▫ Etc.Etc.
Alias Analysis• Why?
▫ More accurate memory dependence analysis and data flow analysis.
▫ More aggressive optimization and scheduling. Without alias analysis, data flow analysis, optimization and scheduling have to be conservative.
• Exampler 1= arr[1];arr[2]=r2;r3=arr[1];val=r3+arr[3];
Alias Analysis• Challenges
▫ Formal parameters▫ Function pointers▫ Struct & union▫ Type-casted
• Alias Analysis: Computes pairs of pointers that may point to the same memory location
▫Used primarily by older pointer analysis for C
▫Can be computed using a points-to analysis may-alias(v1,v2) if points-to(v1) ∩ points-to(v2) ≠ Ø
Alias Analysis
•ExampleClass Quad{uint32 ulow;uint32 uhigh;};
Class qpart {ushort c, d, a, b;}Quad quad;qpart s = & quad;
ulow
uhigh
cdab
49
•Goals: ▫ Ensure data values lie within the specified ranges▫ Ensure data maintains specified accuracy
•Methods:▫ Overflow and Underflow Analysis▫ Range Checking Analysis▫ Array Bounds Checking▫ Rounding Errors Analysis
Discrete static bounds can often be checked automatically
Checking is straight forward for Enumeration TypesAbsence of overflow for Real Types can be demanding
Range Checking
Structure Analysis
• GoalsGoals
▫ How artifacts build into higher level artifactsHow artifacts build into higher level artifacts
▫ How artifacts depend on each otherHow artifacts depend on each other
▫ visualizationvisualization
• methodsmethods ::▫ Dependency analysisDependency analysis
▫ Impact analysisImpact analysis
• ToolsTools
▫ STAN – a structure analysis tool for JavaSTAN – a structure analysis tool for Java
▫ IBM Rational RoseIBM Rational Rose
▫ MS VisioMS Visio
Structure Analysis• Directed dependency graphDirected dependency graph
Model Checking
• GoalsGoals
▫ Verifying the system models according to requirementsVerifying the system models according to requirements
• MethodsMethods
▫ State transitionState transition
▫ Modal / temporal logicsModal / temporal logics
▫ Define and validate the mathematical problem “can state transition Define and validate the mathematical problem “can state transition
satisfy the logics ”satisfy the logics ”
• Potential problems——models abstraction from code may lose Potential problems——models abstraction from code may lose
some informationsome information
• Tools——SLAMTools——SLAM ,, Java PathFinder2, etc.Java PathFinder2, etc.
Theorem Proving
• ObjectsObjects
▫ Theoretical proof of the system logicsTheoretical proof of the system logics
• FeaturesFeatures
▫ Based on formal or mathematical approachBased on formal or mathematical approach
▫ Most complex and preciseMost complex and precise
▫ Relies on the manual transition and configurationRelies on the manual transition and configuration
• Tools——ESCTools——ESC ,, ESC/JavaESC/Java
Pros and Cons of Static Analysis• ProsPros
▫ Can cover all the code and pathsCan cover all the code and paths
▫ Prior knowledge is not mandatory, applicable for unfamiliar codePrior knowledge is not mandatory, applicable for unfamiliar code
▫ For complete comprehension and partial comprehensionFor complete comprehension and partial comprehension
• ConsCons▫ The precision is affected by dynamic featuresThe precision is affected by dynamic features
▫ Relies on programming languages and coding stylesRelies on programming languages and coding styles
▫ Some data dependencies are too complicatedSome data dependencies are too complicated
Dynamic Analysis
• Dynamic tracingDynamic tracing
• Off-line validationOff-line validation
• Online detectingOnline detecting
Dynamic Tracing
• Output-basedOutput-based▫ Use the system output, log, etc.Use the system output, log, etc.
• Code instrumentationCode instrumentation▫ Source code instrumentationSource code instrumentation
▫ Binary code instrumentationBinary code instrumentation
▫ InterceptorInterceptor for the communication between caller and calleefor the communication between caller and callee
• By interfaces of platformBy interfaces of platform▫ Use the development platform, like OS, JVM, etc.Use the development platform, like OS, JVM, etc.
Off-line validation
• Input generationInput generation▫ First static analysisFirst static analysis
▫ Goal-driven input generationGoal-driven input generation
• Constraint descriptionConstraint description▫ Describe the constraints using the linear sequence diagramDescribe the constraints using the linear sequence diagram
• Execution trace analysisExecution trace analysis▫ Plenty of internal and output data at runtimePlenty of internal and output data at runtime
▫ Trace generation from these dataTrace generation from these data
▫ Verify the trace against the constraintsVerify the trace against the constraints
Online Detecting
• In-lineIn-line▫ The monitor runs in the space with the systemThe monitor runs in the space with the system
▫ High efficiency, quick responseHigh efficiency, quick response
▫ May affect the system itselfMay affect the system itself
• Out lineOut line▫ Monitor runs in independent spaceMonitor runs in independent space
▫ Can deal with multiple outputsCan deal with multiple outputs
▫ Low efficiencyLow efficiency
Online Detecting
• Features Features ▫ The input should not affect the normal execution of the systemThe input should not affect the normal execution of the system
▫ The monitor is always part of the systemThe monitor is always part of the system
Pros and cons
• ProsPros▫ The running results are more trustableThe running results are more trustable
▫ Incremental comprehension by adding test casesIncremental comprehension by adding test cases
▫ High precisionHigh precision
• ConsCons▫ Prior knowledge is inevitable to design test casesPrior knowledge is inevitable to design test cases
▫ Test cases are often not thoroughTest cases are often not thorough
▫ Can not filter dead codeCan not filter dead code
▫ Require ability to run programRequire ability to run program
Hybrid Analysis
• Dynamic Analysis (DA) + Information Retrieval (IR)Dynamic Analysis (DA) + Information Retrieval (IR)
• DA + Impact AnalysisDA + Impact Analysis
• DA + Web MiningDA + Web Mining
• DA + IR + dependency analysisDA + IR + dependency analysis
• IR + BRCGIR + BRCG
• Analysis + TestAnalysis + Test
etc.etc.
IR+BRCG• Approach overviewApproach overview
IR+BRCG
• A BRCG exampleA BRCG example
Tool Demo• InstrumentationInstrumentation
▫ Bytecode instrumentation – IBM BIPTKBytecode instrumentation – IBM BIPTK
• Information RetrievalInformation Retrieval
▫ Lucene 3.0Lucene 3.0
• Static analysisStatic analysis
▫ PMDPMD
▫ Architexa @ Architexa @ http://www.architexa.com/
▫ Code Analysis Plugin @ Code Analysis Plugin @ http://sourceforge.net/projects/cap4e/
▫ AppPerfect @ AppPerfect @ http://www.appperfect.com/download/files/index.html
Projects
• Projects candidatesProjects candidates▫ JDKJDK
▫ JunitJunit
▫ Log4jLog4j
▫ Struts 2.0Struts 2.0
▫ SpringSpring
▫ HibernateHibernate
• DeliverablesDeliverables▫ Analysis report, including at least architecture, flow Analysis report, including at least architecture, flow
chart or sequence diagram and function descriptionchart or sequence diagram and function description
Thank You !