Our Goal
ProgramAnalyzer
SourcecodeSecuritybugs
Programanalyzermustbeabletounderstandprogramproper?es(e.g.,canavariablebeNULLata
par?cularprogrampoint?)
Mustperformcontrolanddataflowanalysis
Do we need to implement control and data flow analysis from scratch?
• Mostmoderncompilersalreadyperformseveraltypesofsuchanalysisforcodeop?miza?on• Wecanhookintodifferentlayersofanalysisandcustomizethem• Wes?llneedtounderstandthedetails
• LLVM(hNp://llvm.org/)isahighlycustomizableandmodularcompilerframework• UserscanwriteLLVMpassestoperformdifferenttypesofanalysis• Clangsta?canalyzercanfindseveraltypesofbugs• Caninstrumentcodefordynamicanalysis
Compiler Overview
• AbstractSyntaxTree:SourcecodeparsedtoproduceAST• ControlFlowGraph:ASTistransformedtoCFG• DataFlowAnalysis:operatesonCFG
The Structure of a Compiler
5
scanner
parser
checker
codegen
Sourcecode(streamofcharacters)
streamoftokens
AbstractSyntaxTree(AST)
ASTwithannota?ons(types,declara?ons)
Machine/bytecode
SyntacBc Analysis
• Input:sequenceoftokensfromscanner• Output:abstractsyntaxtree• Actually,
• parserfirstbuildsaparsetree• ASTisthenbuiltbytransla?ngtheparsetree• parsetreerarelybuiltexplicitly;onlydeterminedby,say,howparserpushesstufftostack
6AdoptedFromUCBerkeley:Prof.BodikCS164Lecture5
Example
• SourceCode 4*(2+3)
• ParserinputNUM(4) TIMES LPAR NUM(2) PLUS NUM(3) RPAR
• Parseroutput(AST):
AdoptedFromUCBerkeley:Prof.BodikCS164Lecture5 7
*
NUM(4) +
NUM(2) NUM(3)
Parse tree for the example: 4*(2+3)
AdoptedFromUCBerkeley:Prof.BodikCS164Lecture5 8
leavesaretokens
NUM(4)TIMESLPARNUM(2)PLUSNUM(3)RPAR
EXPR
EXPR
EXPR
Another example • SourceCode
if(x==y){a=1;}• Parserinput
IF LPAR ID EQ ID RPAR LBR ID AS INT SEMI RBR
• Parseroutput(AST):
AdoptedFromUCBerkeley:Prof.BodikCS164Lecture5 9
IF-THEN
==
ID ID
=
ID INT
Parse tree for example: if (x==y) {a=1;}
AdoptedFromUCBerkeley:Prof.BodikCS164Lecture5 10
IFLPARID==IDRPARLBRID=INTSEMIRBR
EXPREXPR
STMT
BLOCK
STMT
leavesaretokens
Parse Tree
• Representa?onofgrammarsinatree-likeform.
• Isaone-to-onemappingfromthegrammartoatree-form.
Aparsetreepictoriallyshowshowthestartsymbolofagrammarderivesastringinthe
language.…DragonBook
CStatement:returna+2
averyformalrepresenta?onthatstrictlyshowshowtheparser
understandsthestatementreturna+2;
Abstract Syntax Tree (AST)
• Simplifiedsyntac?crepresenta?onsofthesourcecode,andthey'remostopenexpressedbythedatastructuresofthelanguageusedforimplementa?on
• Withoutshowingthewholesyntac?ccluNer,representstheparsedstringinastructuredway,discardingallinforma?onthatmaybeimportantforparsingthestring,butisn'tneededforanalyzingit.
ASTsdifferfromparsetreesbecausesuperficialdis?nc?onsofform,unimportantfortransla?on,donotappearinsyntaxtrees..…DragonBook
Disadvantages of ASTs
• ASThasmanysimilarforms• E.g.,for,while,repeat...un?l• E.g.,if,?:,switch
• ExpressionsinASTmaybecomplex,nested• (x*y)+(z>5?12*z:z+20)
• Wantsimplerrepresenta?onforanalysis• ...atleast,fordataflowanalysis
15
intx=1//what’sthevalueofx?//ASTtraversalcangivetheanswer,right?Whataboutintx;x=1;orintx=0;x+=1;?
RepresenBng Control Flow
AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)
High-level representation – Control flow is implicit in an AST
Low-level representation: – Use a Control-flow graph (CFG)
– Nodes represent statements (low-level linear IR) – Edges represent explicit flow of control
What Is Control-Flow Analysis?
AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)
1 2
a := 0 b := a * b
3 L1: c := b/d
4 5 6
if c < x goto L2 e := b / c f := e + 1
7 L2: g := f
8 9
h := t - g if e > 0 goto L3
10 goto L1 11 L3: return
a := 0 b := a * b
e := b / c f : e + 1
g:=fh:=t–gIfe>0?
goto return
c := b / d c < x?
1
3
5
7
1110
Yes No
Basic Blocks
• A basic block is a sequence of straight line code that can be entered only at the beginning and exited only at the end
AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)
g:=fh:=t–gIfe>0?
Building basic blocks – Identify leaders
– The first instruction in a procedure, or – The target of any branch, or – An instruction immediately following a branch (implicit target)
– Gobble all subsequent instructions until the next leader
Basic Block Example
1 2
a := 0 b := a * b
3 L1: c := b/d
4 5 6
if c < x goto L2 e := b / c f := e + 1
7 L2: g := f
8 9
h := t - g if e > 0 goto L3
10 goto L1 11 L3: return
Leaders? – {1, 3, 5, 7, 10, 11}
Blocks? – {1, 2} – {3, 4} – {5, 6} – {7, 8, 9} – {10} – {11}
AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)
Building a CFG From Basic Block
a := 0 b := a * b
e := b / c f : e + 1
g:=fh:=t–gIfe>0?
goto return
c := b / d c < x?
1
3
5
7
1110
Yes No
Construc=on– EachCFGnoderepresentsabasicblock– Thereisanedgefromnodeitojif– Laststatementofblockibranchestothefirststatementofj,or– Blockidoesnotendwithanuncondi?onalbranchandisimmediatelyfollowedinprogramorderbyblockj(fallthrough)
AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)
Looping
preheader
head
tail exitedge
Exitedge
backedge
entryedge
Loop
AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)
Why?backedgesindicatethatwemightneedtotraversetheCFGmorethanoncefordataflowanalysis
Looping
preheader
head
tail exitedge
Exitedge
backedge
entryedge
Loop
Notallloopshavepreheaders–Some?mesitisusefultocreatethem
Withoutpreheadernode–Therecanbemul?pleentryedges
Withsinglepreheadernode–Thereisonlyoneentryedge
AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)
Dominators
• d dom i if all paths from entry to node i include d
• Strict Dominator (d sdom i) • If d dom i, but d != i
• Immediate dominator (a idom b) • a sdom b and there does not exist any node c such that a != c, c != b, a dom c,
c dom b
• Post dominator (p pdom i) • If every possible path from i to exit includes p
AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)
IdenBfying Natural Loops and Dominators
• BackEdge• A back edge of a natural loop is one whose target dominates its source
• NaturalLoop• The natural loop of a back edge (m→n), where n dominates m, is the set of
nodes x such that n dominates x and there is a path from x to m not containing n
AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)
Reducibility • ACFGisreducible(well-structured)ifwecanpar??onitsedgesintotwodisjointsets,theforwardedgesandthebackedges,suchthat – Theforwardedgesformanacyclicgraphinwhicheverynodecanbereachedfromtheentrynode– Thebackedgesconsistonlyofedgeswhosetargetsdominatetheirsources
• Structuredcontrol-flowconstructsgiverisetoreducibleCFGsValueofreducibility:– Dominanceusefuliniden?fyingloops– Simplifiescodetransforma?ons(everyloophasasingleheader)– Permits interval analysis
AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)
Handling Irreducible CFG’s
• Nodesplivng• CanturnirreducibleCFGsintoreducibleCFGs
a
b
c d
e
b
c
a
d
dʹ e
Generalidea– Reducegraph(itera?velyremoveselfedges,mergenodeswithsinglepred)– Morethanonenode=>irreducible–Splitanymul?-parentnodeandstartover
AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)
Why go through all this trouble? • Modernlanguagesprovidestructuredcontrolflow– Shouldn’tthecompilerrememberthisinforma?onratherthanthrowitawayandthenre-computeit?
• Answers?– Wemaywanttoworkonthebinarycode– Mostmodernlanguagess?llprovideagotostatement– Languages typically providemul?ple types of loops. This analysis letsus treatthemalluniformly– Wemaywantacompilerwithmul?plefrontendsformul?plelanguages;ratherthantransla?ngeachlanguagetoaCFG,translateeachlanguagetoacanonicalIRandthentoaCFG
AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)