+ All Categories
Home > Technology > Bilar (2007) callgraph properties of executables

Bilar (2007) callgraph properties of executables

Date post: 17-Dec-2014
Category:
Upload: danielbilar
View: 430 times
Download: 5 times
Share this document with a friend
Description:
 
13
1 Callgraph properties of executables and generative mechanisms Daniel Bilar * Wellesley College Department of Computer Science Wellesley, MA 02481, USA E-mail: [email protected] This paper examines the callgraphs of 120 malicious and 280 non-malicious executables. Pareto models were fitted to in-degree, out-degree and basic block count distribution, and a statistically significant differ- ence shown for the derived power law exponent. Gen- erative mechanism are discussed and a two-step op- timization process, based on resource constraints and robustness tradeoff (PLR HOT mechanism) sketched to account for the structure of the executable. Keywords: Executables, Callgraph, HOT process, Graph-structural fingerprint, Malware, PLR model 1. Motivation All commercial antivirus (AV) products rely on signature matching; the bulk of which constitutes strict byte sequence pattern matching. For mod- ern, evolving polymorphic and metamorphic mal- ware, this approach is unsatifactory. Clementi re- cently checked fifteen state-of-the-art, updated AV scanner against ten highly polymorphic malware samples and found false negative rates from 0- 90%, with an average of 48% [12]. This develop- ment was already predicted in 2001 [54]. Polymor- phic malware contain decryption routines which decrypt encrypted constant parts of the mal- ware body. The malware can mutate its decryp- tors in subsequent generations, thereby complicat- ing signature-based detection approaches. The de- crypted body, however, remains constant. Meta- morphic malware generally do not use encryp- tion, but are able to mutate their body in sub- * Corresponding author: Daniel Bilar, Department of Computer Science, Wellesley College sequent generation using various techniques, such as junk insertion, semantic NOPs, code transposi- tion, equivalent instruction substitution and regis- ter reassignments [11][52]. The net result of these techniques is a shrinking usable “constant base” for strict signature-based detection approaches. Since signature-based approaches are quite fast (but show little tolerance for metamorphic and polymorphic code) and heuristics such as emula- tion are more resilient (but quite slow and may hinge on environmental triggers), a detection ap- proach that combines the best of both worlds would be desirable. This is the philosophy behind a structural fingerprint. Structural fingerprints are statistical in nature, and as such are positioned as ‘fuzzier’ metrics between static signatures and dy- namic heuristics. The structural fingerprint inves- tigated in this paper for differentiation purposes is based on some properties of the executable’s call- graph. I also propose a generative mechanisms for the callgraph topology. 2. Generating the callgraph Primary tools used are described in more details in the appendix. 2.1. Samples For non-malicious software, henceforth called ‘goodware’, sampling followed a two-step process: I inventoried all PEs (the primary 32-bit Windows file format) on a Microsoft XP Home SP2 lap- top, extracted uniform randomly 300 samples, dis- carded overly large and small files, yielding 280 samples. For malicious software (malware), seven classes of interest were fixed: backdoor, hacking tools, DoS, trojans, exploits, virus, and worms. The worm class was further divided into Peer- to-Peer (P2P), Internet Relay Chat/Instant Mes- senger (IRC/IM), Email and Network worm sub- AI Communications ISSN 0921-7126, IOS Press. All rights reserved
Transcript
Page 1: Bilar (2007) callgraph properties of executables

1

Callgraph properties of executables andgenerative mechanisms

Daniel Bilar ∗

Wellesley CollegeDepartment of Computer ScienceWellesley, MA 02481, USAE-mail: [email protected]

This paper examines the callgraphs of 120 maliciousand 280 non-malicious executables. Pareto modelswere fitted to in-degree, out-degree and basic blockcount distribution, and a statistically significant differ-ence shown for the derived power law exponent. Gen-erative mechanism are discussed and a two-step op-timization process, based on resource constraints androbustness tradeoff (PLR HOT mechanism) sketchedto account for the structure of the executable.

Keywords: Executables, Callgraph, HOT process,Graph-structural fingerprint, Malware, PLR model

1. Motivation

All commercial antivirus (AV) products rely onsignature matching; the bulk of which constitutesstrict byte sequence pattern matching. For mod-ern, evolving polymorphic and metamorphic mal-ware, this approach is unsatifactory. Clementi re-cently checked fifteen state-of-the-art, updated AVscanner against ten highly polymorphic malwaresamples and found false negative rates from 0-90%, with an average of 48% [12]. This develop-ment was already predicted in 2001 [54]. Polymor-phic malware contain decryption routines whichdecrypt encrypted constant parts of the mal-ware body. The malware can mutate its decryp-tors in subsequent generations, thereby complicat-ing signature-based detection approaches. The de-crypted body, however, remains constant. Meta-morphic malware generally do not use encryp-tion, but are able to mutate their body in sub-

*Corresponding author: Daniel Bilar, Department ofComputer Science, Wellesley College

sequent generation using various techniques, suchas junk insertion, semantic NOPs, code transposi-tion, equivalent instruction substitution and regis-ter reassignments [11][52]. The net result of thesetechniques is a shrinking usable “constant base”for strict signature-based detection approaches.

Since signature-based approaches are quite fast(but show little tolerance for metamorphic andpolymorphic code) and heuristics such as emula-tion are more resilient (but quite slow and mayhinge on environmental triggers), a detection ap-proach that combines the best of both worldswould be desirable. This is the philosophy behinda structural fingerprint. Structural fingerprints arestatistical in nature, and as such are positioned as‘fuzzier’ metrics between static signatures and dy-namic heuristics. The structural fingerprint inves-tigated in this paper for differentiation purposes isbased on some properties of the executable’s call-graph. I also propose a generative mechanisms forthe callgraph topology.

2. Generating the callgraph

Primary tools used are described in more detailsin the appendix.

2.1. Samples

For non-malicious software, henceforth called‘goodware’, sampling followed a two-step process:I inventoried all PEs (the primary 32-bit Windowsfile format) on a Microsoft XP Home SP2 lap-top, extracted uniform randomly 300 samples, dis-carded overly large and small files, yielding 280samples. For malicious software (malware), sevenclasses of interest were fixed: backdoor, hackingtools, DoS, trojans, exploits, virus, and worms.The worm class was further divided into Peer-to-Peer (P2P), Internet Relay Chat/Instant Mes-senger (IRC/IM), Email and Network worm sub-

AI CommunicationsISSN 0921-7126, IOS Press. All rights reserved

Page 2: Bilar (2007) callgraph properties of executables

2 D. Bilar / AICom LATEX2ε Style sample

(a) Example: Callgraph (b) Example: Control Flow Graph(CFG)

(c) Example: Basic Block

Fig. 1. Graph structures of an executable

classes. For an non-specialist introduction to ma-licious software, see [51]; for a canonical reference,see [53]. Each class (subclass) contained at least15 samples. Since AV vendors were hesitant forliability reasons to provide samples, I gatheredthem from herm1t’s (underground) collection andidentified compiler and (potential) packer meta-data using PEiD. Practically all malware sampleswere identified as having been compiled by MSC++ 5.0/6.0, MS Visual Basic 5.0/6.0 or LCC,and about a dozen samples were packed with var-ious versions of UPX (an executable compres-sion program). Malware was run through best-of-breed, updated open- and closed-source AV prod-ucts yielding a false negative rate of 32% (open-source) and 2% (closed-source), respectively. Over-all file sizes for both mal- and goodware rangedfrom Θ(10kb) to Θ(1MB). A preliminary file sizedistribution investigation yielded a log-normal dis-tribution; for a putative explanation of the un-derlying generative process, see [38] and [31]. All400 samples were loaded into the de-facto indus-try standard disassembler (IDA Pro [24]), inter-and intra-procedurally parsed and augmented withsymbolic meta-information gleaned programmati-cally from the binary via FLIRT signatures (whenapplicable). I exported the identified structures ex-ported via IDAPython into a MySQL database.These structures were subsequently parsed by adisassembly visualization tool (BinNavi [16]) togenerate and investigate the callgraph.

2.2. Callgraph

Following [17], we treat an executable as a graphof graphs. This follows the intuition that in anyprocedural language, the source code is structuredinto functions (which can be viewed as a flowchart,

e.g. a directed graph which we call flowgraph).These functions call each other, thus creating alarger graph where each node is a function and theedges are calls-to relations between the functions.We call this larger graph the callgraph. We re-cover this structure by diassembling the executableinto individual instructions. We distinguish be-tween short and far branch instructions: Shortbranches do not save a return address while farbranches do. Intuitively, short branches are nor-mally used to pass control around within one func-tion of the program, while far branches are usedto call other functions. A sequence of instructionsthat is continuous (e.g. has no branches jumpinginto its middle and ends at a branch instruction) iscalled a basic block. We consider the graph formedby having each basic block as a node, and eachshort branch an edge. The connected componentsin this directed graph correspond to the flowgraphsof the functions in the source code. For each con-nected component in the previous graph, we cre-ate a node in the callgraph. For each far branchin the connected component, we add an edge tothe node corresponding to the connected compo-nent this branch is targeting. Fig. 1 illustrate theseconcepts.

Formally, denote a callgraph CG as CG =G(V, E), where G(·) stands for ‘Graph’. Let V =⋃

F , where F ∈ normal, import, library, thunk.This just says that each function in CG is either a‘library’ function (from an external libraries stati-cally linked in), an ‘import’ function (dynamicallyimported from a dynamic library), a ‘thunk’ func-tion (mostly one-line wrapper functions used forcalling convention or type conversion) or a ‘nor-mal’ function (can be viewed as the executablesown function). Following metrics were program-matically collected from CG

Page 3: Bilar (2007) callgraph properties of executables

D. Bilar / AICom LATEX2ε Style sample 3

– |V | is number of nodes in CG, i.e the functioncount of the callgraph

– For any f ∈ V , let f = G(Vf , Ef ) where b ∈Vf is a block of code, i.e each node in thecallgraph is itself a graph, a flowgraph, andeach node on the flowgraph is a basic block

– Define IC : B → N where B is defined to beset of blocks of code, and IC(b) is the numberof instructions in b. We denote this functionshorthand as |b|IC , the number of instructionsin basic block b.

– We extend this notation | · |IC to elements ofV be defining |f |IC =

∑b∈Vf

|b|IC . This givesus the total number of instructions in a nodeof the callgraph, i.e in a function.

– Let d+G(f), d−G(f) and dbb

G (f) denote the in-degree, outdegree and basic block count of afunction, respectively.

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4Malware: Scatterplot p vs r

p

r

No statistically significant correlation

between indegree and outdegree of callgraph

nodes for ≈ 85% of executables

15% of executables with p-vals ≤ 0.05

and 8% with p-vals ≤ 0.01

show weak to very weak correlation

(a) Malware: p vs rin,out

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

p

r

Goodware: Scatterplot p vs r

6% of executables with p-vals ≤ 0.05

and 3% with p-vals ≤ 0.01

show weak to very weak correlation

No statistically significant correlation

between in and outdegree of callgraph nodes

for ≈ 94% of executables

(b) Goodware: p vs rin,out

Fig. 2. Correlation Coefficient rin,out

class metric Θ(10) Θ(100) Θ(1000)

Goodware r 0.05 -0.017 -0.0366

IQR 12 44 36

Malware r 0.08 0.0025 0.0317

IQR 8 45 28

Table 1

Correlation, IQR for instruction count

2.3. Correlations

I calculated the correlation between in andoutdegree of functions. Prior analysis of staticclass collaboration networks [45][41] suggest ananti-correlation, characterizing some functions assource or sinks. I found no significant correlationbetween in and outdegree of functions in the dis-assembled executables (Fig. 2). Correlation intu-itively is unlikely to occur except in the ‘0 out-degree’ case (the BinNAvi toolset does not gen-erate the flowgraph for imported functions, i.e.an imported function automatically has outde-gree 0, and but will be called from many otherfunctions). Additional, I size-blocked both samplegroups into three function count blocks, with blockcriteria chosen as Θ(10), Θ(100) and Θ(1000) func-tion counts to investigate a correlation betweeninstruction count in functions and complexity ofthe executable (with function count as a proxy).Again, I found no correlation at significance level≤ 0.001. Coefficient values and the IQR for in-struction counts (a spread measure, the differencebetween the 75th and the 25th percentiles of thesample) are given in Table 1. The first result cor-roborate previous findings; the second result at thephenomenological level agrees with the ‘refactor-ing’ model in [41], which posits that excessivelylong functions that tend to be decomposed intosmaller functions. Remarkably, the spread is quitelow, on the order of a few dozen instructions. I willdiscuss models more in section 4.

2.4. Function types

Each point in the scatterplots in Fig. 3 repre-sents three metrics for one individual executable:Function count, and the proportions of normalfunction, static library + dynamic import func-tions, and thunks. Proportions for an individualexecutable add up to 1. The four subgraphs areparsed thusly, using Fig. 3(b) as an example. Thex-axis denotes the proportion of ‘normal’ function,

Page 4: Bilar (2007) callgraph properties of executables

4 D. Bilar / AICom LATEX2ε Style sample

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Normal

Thunk

Goodware: Scatterplot of function type proportions

102

103

104

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Normal

Lib

rari

esand

Import

s

Goodware: Scatterplot of function type proportions

102

103

104

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Thunk

Lib

rari

esand

Import

s

Goodware: Scatterplot of function type proportions

102

103

104

00.5

1

00.5

1

0

0.2

0.4

0.6

0.8

1

Normal

Goodware: Scatterplot of function type proportions

Thunk

Lib

rari

esand

Import

s

102

103

104

(a) GW:Norm vs Lib+Imp

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Normal

Thunk

Goodware: Scatterplot of function type proportions

102

103

104

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Normal

Lib

rari

esand

Import

s

Goodware: Scatterplot of function type proportions

102

103

104

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Thunk

Lib

rari

esand

Import

s

Goodware: Scatterplot of function type proportions

102

103

104

00.5

1

00.5

1

0

0.2

0.4

0.6

0.8

1

Normal

Goodware: Scatterplot of function type proportions

Thunk

Lib

rari

esand

Import

s

102

103

104

(b) GW:Norm vs Thunk

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Normal

Thunk

Goodware: Scatterplot of function type proportions

102

103

104

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Normal

Lib

rari

esand

Import

s

Goodware: Scatterplot of function type proportions

102

103

104

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Thunk

Lib

rari

esand

Import

s

Goodware: Scatterplot of function type proportions

102

103

104

00.5

1

00.5

1

0

0.2

0.4

0.6

0.8

1

Normal

Goodware: Scatterplot of function type proportions

Thunk

Lib

rari

esand

Import

s

102

103

104

(c) GW:Thunk vs Lib+Imp

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Normal

Thunk

Malware: Scatterplot of function type proportions

102

103

104

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Normal

Lib

rari

esand

Import

s

Malware: Scatterplot of function type proportions

102

103

104

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Thunk

Lib

rari

esand

Import

s

Malware: Scatterplot of function type proportions

102

103

104

00.20.40.60.81

0

0.5

10

0.2

0.4

0.6

0.8

1

Thunk

Normal

Malware: Scatterplot of function type proportions

Lib

rari

esand

Import

s

102

103

104

(d) MW:Norm vs Lib+Imp

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Normal

Thunk

Malware: Scatterplot of function type proportions

102

103

104

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

NormalLib

rari

esand

Import

s

Malware: Scatterplot of function type proportions

102

103

104

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Thunk

Lib

rari

esand

Import

s

Malware: Scatterplot of function type proportions

102

103

104

00.20.40.60.81

0

0.5

10

0.2

0.4

0.6

0.8

1

Thunk

Normal

Malware: Scatterplot of function type proportions

Lib

rari

esand

Import

s

102

103

104

(e) MW:Norm vs Thunk

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Normal

Thunk

Malware: Scatterplot of function type proportions

102

103

104

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Normal

Lib

rari

esand

Import

s

Malware: Scatterplot of function type proportions

102

103

104

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Thunk

Lib

rari

esand

Import

s

Malware: Scatterplot of function type proportions

102

103

104

00.20.40.60.81

0

0.5

10

0.2

0.4

0.6

0.8

1

Thunk

Normal

Malware: Scatterplot of function type proportions

Lib

rari

esand

Import

s

102

103

104

(f) MW:Thunk vs Lib+Imp

Fig. 3. Scatterplot of function type proportions

and the y-axis the proportion of “thunk” func-tions in the binaries. The color of each point in-dicates |V |, which may serve as a rough proxyfor the executable’s size. The dark red point at(X,Y )= (0.87, 0.007) is endnote.exe, since itis the only goodware binary with functions countof Θ(104).

Most thunks are wrappers around imports,hence in small executables, a larger proportion ofthe functions will be thunks. The same holds for li-braries: The larger the executable, the smaller thepercentage of libraries. This is heavily influencedby the choice of dynamic vs. static linking. Thethunk/library plot, listed for completeness reasons,does not give much information, confirming the in-tuition that they are independent of each other,mostly due to compiler behavior.

2.5. α fitting with Hill estimator

Taking our cue from [44] who surveyed empiricalstudies of technological, social, and biological net-works, I hypothesize that the discrete distributionsof d+(f), d−(f) and dbb(f) follows a truncated

powerlaw of the form Pd?(f)(m) ∼ mαd?(f)e−mkc ,

where kc indicates the end of the power law regime.Shorthand, I call αd?(f) for the respective metricsαindeg, αoutdeg and αbb.

Figs. 4(a) and 4(b) show pars pro toto the fit-ting procedures for our 400 samples. The plot isan Empirical Complimentary Cumulative DensityPlot (ECCDF). The x-axis show indegree, the y-axis show the CDF P[X>x] that a function inendote.exe has indegree of x. If P[X>x] can beshown to fit a Pareto distribution, we can ex-tract the power law exponent for PMF Pd?(f)(m)from the CDF fit (see [1] and more extensively[42] for the relationship between Pareto, powerlaws and Zipf distributions). Parsing Fig. 4(a)):Blue points denotes the data points (functions)and two descriptive statistics (median and themaximum value) for the indegree distribution forendote.exe. We see that for endnote.exe, 80%of functions have a indegree=1, 2% indegree >10.and roughly 1% indegree > 20. The fitted dis-tribution is shown in magenta, together with theparameters α = 1.97 and kc = 1415.83. Al-though tempting, simply ‘eyeballing’ Pareto CDFsfor the requisite linearity on a log-log scale [23] [6]

Page 5: Bilar (2007) callgraph properties of executables

D. Bilar / AICom LATEX2ε Style sample 5

100

101

102

103

10−4

10−3

10−2

10−1

100

Indegree

P[X

>x]

endnote.exe(G81),numfunc=10339

Cx−k

e−x

kc

α=1.9716

kc=1415.8378

median=0

datamax=1497

100

102

104

0

1

2

3

Observationsα(n

)

Hill estimator

(a) GW sample: Fitting for alphaindeg and kc

100

101

102

103

10−4

10−3

10−2

10−1

100

Number of Basic Blocks

P[X

>x]

Exploit.Win32.MsSqlHack(G48),numfunc=170

Cx−k

e−x

kc

α=1.6073

kc=50.9956

median=3.5

datamax=184

100

102

104

0

1

2

3

Observations

α(n

)

Hill estimator

(b) MW sample: Fitting for alphabb and kc

Fig. 4. Pareto fitting ECCDFs, shown with Hill estimatorinset

is not enough: Following [38] on philosophy and[47] on methodology, I calculate the Hill estima-tor α whose asymptotical normality is then usedto compute a 95% CI. This is shown in the insetand serves as a Pareto model self-consistency checkthat estimates the parameter α as a function of thenumber of observations. As the number of obser-vations i increase, a model that is consistent alongthe data should show roughly CIi ⊇ CIi+1. Foran insightful expose and more recent proceduresto estimate Pareto tails, see [60][19].

To tentatively corroborate the consistency ofour posited Pareto model, 30 (goodware) and 21(malware) indegree, outdegree and basic block EC-CDF plots were uniformly sampled into three func-tion count blocks, with block criteria chosen asΘ(10), Θ(100) and Θ(1000) function counts, yield-ing a sampling coverage of 10 %(goodware) and

class Basic Block Indegree Outdegree

GW N(1.634,0.3) N(2.02, 0.3) N(1.69,0.307)

MW N(1.7,0.3) N(2.08,0.45) N (1.68,0.35)

t 2.57 1.04 -0.47

Table 2

α distribution fitting and testing

17%(malware). Visual inspection indicates that formalware, the model seemed more consistent foroutdegree than indegree at all function sizes. Forbasic block count, the consistency tends to be bet-ter for smaller executables. I see these tendency forgoodware, as well, with the observation that out-degree was most consistent in size block Θ(100);for Θ(10) and Θ(1000). For both malware andgoodware, indegree seemed the least consistent,quite a few samples did exhibit a so-called ‘HillHorror Plot’ [47], where αs and the correspondingCIs were very jittery.

The fitted power-law exponents αindeg, αoutdeg,αbb, together with individual functions’ callgraphsize are shown Fig. 5. For both classes, the rangeextends for αindeg ≈ [1.5-3], αoutdeg ≈ [1.1-2.5] andαbb ≈ [1.1-2.1], with a slightly greater spread formalware.

2.6. Testing for difference

I now check whether there are any statisticallysignificant differences between (α, kc) fit for good-ware and malware, respectively. Following proce-dures in [61], I find αindeg, αoutdeg and αbb dis-tributed approximately normal. The exponentialcutoff parameters kc are lognormally distributed.Applying a standard two-tailed t-test (Table 2), Ifind at significance level 0.05 (tcritical=1.97) onlyµ(αbb,malware) ≥ µ(αbb,goodware).

For the basic blocks, kc ≈ LogN(59.1, 52)(goodware) and ≈ LogN(54.2, 44) (malware) andµ(kc(bb,malware)) = µ(kc(bb, goodware)) was re-jected via Wilcoxon Rank Sum with z = 13.4. Thesteeper slope of malware’s αbb imply that functionsin malware tend to have a lower basic block count.This can be accounted by the fact that malwaretend to be simpler than most applications andoperates without much interaction, hence fewerbranches, hence fewer basic blocks. Malware tendto have limited functionality, and operate indepen-dently of input from user and the operating en-vironment. Also, malware is usually not compiledwith aggressive compiler optimization settings en-

Page 6: Bilar (2007) callgraph properties of executables

6 D. Bilar / AICom LATEX2ε Style sample

0 1 2 30

0.5

1

1.5

2

2.5

3

indegree

outd

egre

e

Goodware: Scatterplot of powerlaw exponent α

102

103

104

0 1 2 30

0.5

1

1.5

2

2.5

3

indegree

basi

cblo

cks

Goodware: Scatterplot of powerlaw exponent α

102

103

104

0 1 2 30

0.5

1

1.5

2

2.5

3

outdegree

basi

cblo

cks

Goodware: Scatterplot of powerlaw exponent α

102

103

104

01

23

0

1

2

30

1

2

3

indegree

Goodware: Scatterplot of powerlaw exponent α

outdegree

basi

cblo

cks

102

103

104

(a) GW:αindeg vs αoutdeg

0 1 2 30

0.5

1

1.5

2

2.5

3

indegree

outd

egre

e

Goodware: Scatterplot of powerlaw exponent α

102

103

104

0 1 2 30

0.5

1

1.5

2

2.5

3

indegree

basi

cblo

cks

Goodware: Scatterplot of powerlaw exponent α

102

103

104

0 1 2 30

0.5

1

1.5

2

2.5

3

outdegree

basi

cblo

cks

Goodware: Scatterplot of powerlaw exponent α

102

103

104

01

23

0

1

2

30

1

2

3

indegree

Goodware: Scatterplot of powerlaw exponent α

outdegree

basi

cblo

cks

102

103

104

(b) GW:αindeg vs αbb

0 1 2 30

0.5

1

1.5

2

2.5

3

indegree

outd

egre

e

Goodware: Scatterplot of powerlaw exponent α

102

103

104

0 1 2 30

0.5

1

1.5

2

2.5

3

indegree

basi

cblo

cks

Goodware: Scatterplot of powerlaw exponent α

102

103

104

0 1 2 30

0.5

1

1.5

2

2.5

3

outdegree

basi

cblo

cks

Goodware: Scatterplot of powerlaw exponent α

102

103

104

01

23

0

1

2

30

1

2

3

indegree

Goodware: Scatterplot of powerlaw exponent α

outdegree

basi

cblo

cks

102

103

104

(c) αoutdeg vs αbbdeg

0 1 2 30

0.5

1

1.5

2

2.5

3

indegree

outd

egre

e

Malware: Scatterplot of powerlaw exponent α

102

103

104

0 1 2 30

0.5

1

1.5

2

2.5

3

indegree

basi

cblo

cks

Malware: Scatterplot of powerlaw exponent α

102

103

104

0 1 2 30

0.5

1

1.5

2

2.5

3

outdegree

basi

cblo

cks

Malware: Scatterplot of powerlaw exponent α

102

103

104

01

23

0

1

2

30

1

2

3

indegree

Malware: Scatterplot of powerlaw exponent α

outdegree

basi

cblo

cks

102

103

104

(d) MW:αindeg vs αoutdeg

0 1 2 30

0.5

1

1.5

2

2.5

3

indegree

outd

egre

e

Malware: Scatterplot of powerlaw exponent α

102

103

104

0 1 2 30

0.5

1

1.5

2

2.5

3

indegreebasi

cblo

cks

Malware: Scatterplot of powerlaw exponent α

102

103

104

0 1 2 30

0.5

1

1.5

2

2.5

3

outdegree

basi

cblo

cks

Malware: Scatterplot of powerlaw exponent α

102

103

104

01

23

0

1

2

30

1

2

3

indegree

Malware: Scatterplot of powerlaw exponent α

outdegree

basi

cblo

cks

102

103

104

(e) MW:αindeg vs αbb

0 1 2 30

0.5

1

1.5

2

2.5

3

indegree

outd

egre

e

Malware: Scatterplot of powerlaw exponent α

102

103

104

0 1 2 30

0.5

1

1.5

2

2.5

3

indegree

basi

cblo

cks

Malware: Scatterplot of powerlaw exponent α

102

103

104

0 1 2 30

0.5

1

1.5

2

2.5

3

outdegree

basi

cblo

cks

Malware: Scatterplot of powerlaw exponent α

102

103

104

01

23

0

1

2

30

1

2

3

indegree

Malware: Scatterplot of powerlaw exponent α

outdegree

basi

cblo

cks

102

103

104

(f) MW:αoutdeg vs αbbdeg

Fig. 5. Scatterplots of α’s

abled (which would lead to more inlining and thusincrease the basic block count of the individualfunctions. It may be possible, too, that malwareauthors tend to break functions into simpler com-ponents than ‘regular’ programmers. The smallercutoff point for malware seems to corroborate this,as well, in that the power law relationship holdsover a shorter range. However, this explanationshould be regarded as speculative pending furtherinvestigation.

3. Related work

Analysis of non-graph-based structural featuresof executables were undertaken by [30] [5] [58] [48][4] [49]. Li et al [30] used statistical 1-gram analy-sis of binary byte values to generate a fingerprint(a ‘fileprint’) for file type identification and clas-sification purposes. At the semantically richer op-code level, Bilar [5] investigated and statisticallycompared opcode frequency distributions of mali-cious and non-malicious executables. Weber et al[58] start from the assumption that compiled bina-

ries exhibit homogeneities with respect to severalstructural features such as instruction frequencies,instruction patterns, memory access, jumpcall dis-tances, entropy metrics and byte-type probabili-ties and that tampering by malware would disturbthese statistical homogeneities.

Bayer [4] and Ries [48] run malware dynami-cally in a sandbox, record security-relevant Win32API calls, and construct a syscall-based behavior-ial fingerprint for malware identification and clas-sification purposes. Rozinov locates calls to theWin32 API in the binary itself: While Ries andBayer record the malware’s system calls during ex-ecution, Rozinov [49] statically disassembles andsimplifies the malware binary via slicing, scans forWin32 API calls and constructs a FSA signaturefor later detection purposes.

A simple but effective graph-based signature setto characterize statically disassembled binaries wasproposed by Flake [20]. For the purposes of sim-ilarity analysis, he assigned to each function a 3-tuple consisting of basic blocks count, count ofbranches, and count of calls. These sets were usedto compare malware variants and localize changes;

Page 7: Bilar (2007) callgraph properties of executables

D. Bilar / AICom LATEX2ε Style sample 7

(a) Compiler: CFG without loop unrolling (b) Compiler: CFG with loop unrolling

Fig. 6. Basic Block differences in CFG under compiler optimization regimes

an in-depth discussion of the involved procedurescan be found in [17]. For the purposes of worm de-tection, Kruegel [27] extracts control flow graphsfrom executable code in network streams, aug-ments them with a colouring scheme, identifies k-connected subgraphs that are subsequently usedas structural fingerprints.

Power-law relationships were reported in [56][41] [57] [10]. Valverde et al [56][57] measured undi-rected graph properties of static class relation-ships for Java Development Framework 1.2 anda racing computer game, ProRally 2002. Theyfound the αJDK ≈ 2.5 − 2.65 for the two largest

(N1=1376, N2=1364) connected components andαgame ≈ 2.85 ± 1.1 for the game (N=1989).In the context of studying time series evolutionof C/C++ compile-time “#include” dependencygraphs, αin ≈ 0.97− 1.22 and an exponential out-degree distribution are reported. This asymmetryis not explained. Focusing on the properties of di-rected graphs, Potanin et al [45] examined the bi-nary heap during execution and took a snapshotof 60 graphs from 35 programs written in Java,Self, C++ and Lisp. They concluded that the dis-tributions of incoming and outgoing object refer-ences followed a power law with αin ≈ 2.5 and

Page 8: Bilar (2007) callgraph properties of executables

8 D. Bilar / AICom LATEX2ε Style sample

αout ≈ 3. Myers [41] embarked on an extensive andcareful analysis of six large collaboration networks(three C++ static class diagrams and three C call-graphs) and collected data on in/outdegree distri-bution, degree correlation, clustering and complex-ity evolution of individual classes through time.He found roughly similar results for the callgraphs,αin ≈ αin ≈ 2.5, and noted that it was more likelyto find a function with many incoming links thanoutgoing ones. More recently, Chatzigeorgiou et al[10] applied algebraic methods to identify, amongother structures, heavily loaded ‘manager’ classeswith high in- and outdegree in three static OOclass graphs. Along the lines of network motif [35]or graphlet[46] detection in biological networks,they propose a similarity-measure algorithm to de-tect Design Patterns [22], best-practices high leveldesign structure whose presence manifest them-selves in the form of tell-tale subgraphs.

4. Emergent Complexity vs Engineered/EvolvedSystems

There is no dearth of graph-inducing models andprocesses. For a historical sketch together with theseminal papers, the reader is referred to [43]; fora shorter, more up-to-date synopsis on power lawsand distinctive generative mechanisms, includingso-called HOT (Highly Optimized Tolerance [8]),see [42]. Variations of the Yule process , pithilysummarized as a ‘rich-get-richer’ scheme, are pop-ular. Physicist Barabasi rediscovered and recoinedthe process as ‘preferential attachment’ [3], al-though the process discovery antedates him by atleast forty years (its origins lay in explaining bio-logical taxa). In some quarters of the physics com-munity, power laws have also been taken as a sig-nature of emergent complexity posited by criticalphenomena such as phase transitions and chaoticbifurcation points [9]. The models derived fromsuch a framework are mathematically compellingand very elegant in their generality; with littlemore than a couple of parameter adjustments,they are able at some phenomenological level togenerate graphs whose aggregate statistics (some-times provably) exhibit power-law distributions.Although these models offer a relatively simple,mathematically tractable approximation of somefeatures of the system under study, I think thatHOT models with their emphasis on evolved and

engineered complexity through feedback, trade-offs between objective functions and resource con-straints situated in a probabilistic environment isa more natural and appropriate framework to rep-resent the majority of real-life systems. I illustratethe pitfalls of a narrow focus on power law met-rics without proper consideration of real-life do-main specification, demands and constraints withFig. 7 from [15]: Note that the degree sequence insubfig e) is the same for all subfig a)-d), yet thetopological structure for subfigs. a)-d) is vastly dif-ferent. Along these lines, domain experts have ar-gued against ‘emergent’ complexity models in thecases of Internet router [29] and river stream [26]structures.

Within the framework of a Monte Carlo sim-ilation, Myers’ [41] ’refactoring model’ processphenomenologically was able to reproduced keyfeatures of the in and outdegree distribution ofthe callgraphs he investigated; similarly Valverde[57] with his GNC model. Interestingly, he takesat first a distinctively HOT-esque tack in askingwhether “large-scale patterns are due to externalconstraints, path-dependent processes and specificfunctionalities”, then nevertheless proceeds to pro-pose a stochastic attachment model premised oncode evolution through code duplication. While itis possible to reproduce the features of callgraphsusing stochastic models, I would argue that thecall-graph features I see here are the natural sig-nature of two distinct, domain-specific HOT op-timization processes; one involving human design-ers and the other, code compilers. HOT mecha-nisms are processes that induce highly structured,complex systems (like an executable) through pro-cesses that try to optimally allocate resources tolimit event losses in an probabilistic environment.

4.1. Human design and coding as HOTmechanism

The first domain-specific mechanism that in-duces a cost-optimized, resource-constrained struc-ture on the executable is the human element. Hu-mans using various best-practice software devel-opment techniques [28][22] have to juggle at vari-ous stage of the design and coding stages: Evolv-ability vs specificity of the system , functional-ity vs code size, source readability vs developmenttime, debugging time vs time-to-market, just toname a few conflicting objective function and re-

Page 9: Bilar (2007) callgraph properties of executables

D. Bilar / AICom LATEX2ε Style sample 9

source constraints. Humans design and implementprograms against a set of constraints, taking im-plicitly (rarely explicitly) the probability of theevent space into consideration, indirectly throughthe choice of programming language (typed, OO,procedural, functional etc) and directly throughthe design choice of data structures and controlflow. The majority of programs are designed for av-eraged (or even optimal) operating environmentsand deal very badly with exceptional condition likerandom inputs, low resources, which are the corre-sponding analogue of PLR system perturbation byrare events. The most common attack techniquefor years has been exploiting input validation vul-nerabilities, accounting for over half of the pub-lished software vulnerabilities and over eighty per-cent of faults leading to successful penetration at-tacks. Miller et al testing Unix, Windows and OSX utilities [34][33] by subjecting them in the sim-plest case to random keyboard input, and reportscrash failure rates of 25%-40%, 24%, and 7%, re-spectively. More recently, Whittaker et al [59] de-scribe a dozen practical attack techniques target-ing resources against which the executable wereconstrained (primarily by the human designer);among them memory, memory, disk space and net-work availability conditions. Effects of so-called“Reduction of Quality” attacks against optimizingcontrol systems have also been studied by [37][36]

4.2. Compiler at HOT mechanism

The second domain-specific mechanism that in-duces a cost-optimized, resource-constrained struc-ture on the executable is the compiler. The com-piler functions as a HOT process. Cost functionhere include memory footprint, execution cycles,and power consumption minimization, whereas theconstraints typically involves register and cacheline allocation, opcode sequence selection, num-ber/stages of pipelines, ALU and FPU utilization.The interactions between at least 40+ optimiza-tion mechanisms (in itself a network graph [39])are so complex that meta-optimization [25] havebeen developed to heuristically choose a subsetfrom the bewildering possibilities. Although thecallgraph is largely invariant under most optimiza-tion regimes, the more aggressive mechanisms canhave a marked effect on graph structure. Fig. 6(a)shows the CFG of some source code induced by theIntel C++ Compiler 9.1 under standard optimiza-

tion regimes. The yellow section are loop struc-tures. Fig. 6(b) shows the same source code com-piled under a more aggressive inlining regime, inwhich the compiler unrolled the loops into an as-sortment of switch statements, vastly increasingthe number of basic blocks.

4.3. Example of PLR mechanism and effect

The HOT mechanism inducing the structureof the callgraph executable can be formulated asa Probability, Loss, Resource (PLR) problems,which in its simplest form can be viewed as a gen-eralization of Shannon source coding for data com-pression [32]. The reader is referred to [14] for de-tails, I just give a sketch of the general formulationand a motivating example:

min J (1)

subject to∑ri ≤ R (2)

where

J =∑

pili (3)

li = f(ri) (4)

1 ≤ i ≤ N (5)

We have a set of N events (Eq. 5) with occur-ring iid with probability pi incurring loss li (Eq. 3),the sum-product of which is our objective functionto be minimized (Eq. 1).Resources ri are hedgedagainst losses li (Eq. 4), subjected to resourcebounds R (Eq. 2). I will demonstrate the applica-bility of this PLR model with the following shortC program, adapted from [21]:

# i n c l u d e <s t d l i b . h># i n c l u d e <s t d i o . h># i n c l u d e <s t r i n g . h>

i n t provePequalsNP ( ){/∗ Next paper . . ∗/}i n t bof ( ){c h a r bu f f e r [ 8 ] ; /∗ an 8 b y t e c h a r a c t e r b u f f e r ∗/s t r cpy ( bu f f e r , g e t s ( ) ) ; /∗ g e t i npu t from the user ∗//∗ may not r e t u rn i f b u f f e r o v e r f l owedr e t u rn 42 ;}

i n t main ( i n t argc , char ∗∗ argv ){bo f ( ) ; /∗ c a l l b o f ( ) f u n c t i o n ∗//∗ e x e cu t i on may never reachnex t f u n c t i o n because o f o v e r f l ow ∗/provePequalsNP ( ) ;r e t u r n 1000000; /∗ e x i t w i t h Clay p r i z e ∗/}}

Page 10: Bilar (2007) callgraph properties of executables

10 D. Bilar / AICom LATEX2ε Style sample

I assume here that the uncertain, probabilisticenvironment is just the user. She is asked for in-put in gets(), this represents the event. In theC code, the human designer specified an 8 bytebuffer (char buffer[8]) and the compiler woulddutifully allocate the minimum buffer needed for8 bytes (this is the resource r). The constrainedresources r is the variable buffer. The loss associ-ated with the user input event is really a step func-tion; as long as the user satisfies the assumption ofthe designer, the ‘loss’ is constant, and can be seen(simplified) as just the ‘normal’ loss incurred inproper continuation of control flow. Put differently,as long as user input is ≤ 8 bytes, the resource r isminimally sufficient to ensure normal control flowcontinuation. If, however, the user decides to input‘Superfragilisticexpialidocious’ (which was implic-itly assumed to be an unlikely/impossible event bythe human designer in the code declaration), theloss l takes a huge jump: a catastrophic loss ensuessince strcpy(buffer,gets()) overflows buffer.The improbable event breaches the resource andnow, control flow may be rerouted, the processcrashed, shellcode executed via a stack overflow(or in our example, fame remains elusive). Thisis a classic buffer overflow attack and the essenceof hacking in general - violating assumptions by’breaking through’ the associated resource allo-cated explicitly (input validation) and implicitly(race condition attacks, for instance) by the pro-grammer and the compiler.

What could have prevented this catastrophicloss? A type-safe language such as Java and C#rather than C, more resources in terms of bufferspace and more code in terms of bounds checkingfrom the human designer’s side theoretically wouldhave worked. In practice, for a variety of reasons,programmers write unsafe, buggy code. Recently,compiler guard techniques [13] have been devel-oped to make these types of system perturbationattacks against allocated resources harder to exe-cute or more easily to detect; again attacks againstthese compiler guard techniques have been devel-oped [2].

5. Conclusion

I started by analyzing the callgraph structureof 120 malicious and 280 non-malicious executa-bles, extracting descriptive graph metrics to assess

whether statistically relevant differences could befound. Malware tends to have a lower basic blockcount, implying a simpler structure (less interac-tion, fewer branches, limited functionality). Themetrics under investigation were fitted relativelysuccessfully to a Pareto model. I claim that theprocess that induces such graph structure fits themechanisms of a Probability, Loss, Resource opti-mization process, a mechanism based on objectivefunction tradeoffs and resource constraints, shownin other venues to produces power laws, as well.In the case of the callgraph, the primary optimizeris the human designer, although under aggressiveoptimization regimes, the compiler will alter thecallgraph, as well. It has been suggested that theseoptimization processes are the norm, even the driv-ing force for various physical, biological, ecologicaland engineered systems [18][50]. I share this par-ticular outlook. An executable may yet be anotherexample of engineered complexity that is robusttowards known uncertainties, yet fragile towardsperturbations that are not anticipated.

Appendix

The goodware samples were indexed, collectedand meta-data identified using Index Your Files- Revolution! 3.1, Advanced Data Catalog 1.51and PEiD 0.94, all freely available from www.softpedia.com. The executable’s callgraph gener-ation and structural identification was done withIDA Pro 5 and a pre-release of BinNavi 1.2, bothcommercially available at www.datarescue.comand www.sabre-security.com. Programming wasdone with Python 2.4, freely available at www.python.org. Graphs generated with and some an-alytical tools provided by Matlab 7.3, commer-cially available at www.matlab.com.

I would like to thank Thomas Dullien (SABREGmbH) without whose superb tools and contri-butions this paper could not have been written.He serves as the inspiration to me for this lineof research. Furthermore, I would like to thankWalter Willinger (AT&T Research), Ero Carrera(SABRE), Jason Geffner (Microsoft Research),Scott Anderson, Frankly Turbak, Mark Sheldon,Randy Shull, Frederic Moisy (Universite Paris-Sud11), Mukhtar Ullah (Universitat Rostock), DavidAlderson (Naval Post Graduate School) for theirhelpful comments, suggestions, and explanations.

Page 11: Bilar (2007) callgraph properties of executables

D. Bilar / AICom LATEX2ε Style sample 11

5.0 – 10.0

1.0 – 5.0

0.5 – 1.0

0.1 – 0.5

0.05 – 0.1

0.01 – 0.05

0.005 – 0.01

0.001 – 0.005

(d)(b)

100 101 102 103100

101

102

Node Degree

Nod

e R

ank

Link Speed (Gbps)

(e)

(c)(a)

50 – 100

10 – 50

5 – 10

1 – 5

0.5 – 1.0

0.1 – 0.5

0.05 – 0.1

0.01 – 0.05

Router Speed (Gbps)

Fig. 7. Degree sequence e) following power law is identitical for all graphs a)-d)

References

[1] L. Adamic and B. Huberman. Zipf’s law and the in-ternet. Glottometrics, 3:143–150, 2002.

[2] S. Alexander. Defeating compiler-level buffer overflowprotection. j-LOGIN, 30(3):59–71, June 2005.

[3] A. L. Barabasi. Mean field theory for scale-free randomnetworks. Physica A Statistical Mechanics and its Ap-plications, 272:173–187, Oct 1999, cond-mat/9907068.

[4] U. Bayer, C. Kruegel, and E. Kirda. TTanalyze: Atool for analyzing malware. In EICAR ’06: Proceed-ings of the 15th Annual Conference of the EuropeanInstitute for Computer Antivirus Research, Hamburg(Germany), April 2006.

[5] D. Bilar. Fingerprinting malicious code through sta-tistical opcode analysis. In ICGeS ’07: Third Interna-tional Conference on Global E-Security, London (UK),April 2007. (Submitted).

[6] N. Boccara. Modeling Complex Systems (GraduateTexts in Contemporary Physics), page 325. In [7], 1stedition, November 2004.

[7] N. Boccara. Modeling Complex Systems (GraduateTexts in Contemporary Physics). Springer, 1st edition,November 2004.

[8] J. M. Carlson and J. Doyle. Highly optimized toler-ance: A mechanism for power laws in designed systems.Physical Review E, 60(2):1412+, 1999.

[9] J. M. Carlson and J. Doyle. Complexity and robust-ness. Proceedings of the National Academy of Sciences,99 Suppl 1:2538–2545, February 2002.

[10] A. Chatzigeorgiou, N. Tsantalis, and G. Stephanides.Application of graph theory to OO software engineer-ing. In WISER ’06: Proceedings of the 2006 inter-national workshop on Workshop on interdisciplinarysoftware engineering research, pages 29–36, New York,NY, USA, 2006. ACM Press.

[11] M. Christodorescu and S. Jha. Static analysis of ex-ecutables to detect malicious patterns. In Security’03:Proceedings of the 12th USENIX Security Sympo-sium, pages 169–186. USENIX Association, USENIXAssociation, August 2003.

[12] A. Clementi. Anti-virus comparative no. 11. Tech-nical report, Kompetenzzentrum IT, Insbruck (Aus-tria), August 2006. http://www.av-comparatives.

org/seiten/ergebnisse/report11.pdf.

[13] C. Cowan, C. Pu, D. Maier, J. Walpole, P. Bakke,S. Beattie, A. Grier, P. Wagle, Q. Zhang, and H. Hin-ton. StackGuard: Automatic adaptive detection andprevention of buffer-overflow attacks. In Proc. 7thUSENIX Security Conference, pages 63–78, San Anto-nio, Texas, January 1998.

[14] J. Doyle and J. M. Carlson. Power laws, highly opti-mized tolerance, and generalized source coding. Phys-ical Review Letters, 84(24):5656–5659, June 2000.

[15] J. C. Doyle, D. L. Alderson, L. Li, S. Low, M. Roughan,S. Shalunov, R. Tanaka, and W. Willinger. The “ro-bust yet fragile” nature of the Internet. Proceedingsof the National Academy of Sciences, 102(41):14497–14502, 2005.

[16] T. Dullien. Binnavi v1.2. http://www.

sabre-security.com/products/binnavi.html, 2006.

[17] T. Dullien and R. Rolles. Graph-based comparisonof executable objects. In SSTIC ’05: Symposium surla Securite des Technologies de l’Information et desCommunications, Rennes, France, June 2005.

[18] I. Ekeland. The Best of All Possible Worlds: Mathe-matics and Destiny. U Chicago Press, October 2006.

[19] Z. Fan. Estimation problems for distributions withheavy tails. PhD thesis, Georg-August-Universitat zuGottingen, 2001.

[20] H. Flake. Compare, Port, Navigate. Black Hat Europe2005 Briefings and Training, March 2005.

[21] J. C. Foster, V. Osipov, N. Bhalla, and N. Heinen.Buffer Overflow Attacks. Syngress, 2005.

Page 12: Bilar (2007) callgraph properties of executables

12 D. Bilar / AICom LATEX2ε Style sample

[22] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. De-sign patterns: Abstraction and reuse of object-orienteddesign. Lecture Notes in Computer Science, 707:406–431, 1993.

[23] M. L. Goldstein, S. A. Morris, and G. G. Yen. Problemswith fitting to the power-law distribution. EuropeanJournal of Physics B, 41(2):255–258, September 2004,cond-mat/0402322.

[24] I. Guilfanov. Ida pro v5.0.0.879. http://www.

datarescue.com/idabase/, 2006.

[25] M. Haneda, P. M. W. Knijnenburg, and H. A. G. Wi-jshoff. Optimizing general purpose compiler optimiza-tion. In CF ’05: Proceedings of the 2nd conference onComputing Frontiers, pages 180–188, New York, NY,USA, 2005. ACM Press.

[26] J. W. Kirchner. Statistical inevitability of horton’slaws and the apparent randomness of stream channelnetworks. Geology, 21:591–594, 1993.

[27] C. Krugel, E. Kirda, D. Mutz, W. Robertson, andG. Vigna. Polymorphic worm detection using struc-tural information of executables. In Valdes and Zam-boni [55], pages 207–226.

[28] J. Lakos. Large-scale C++ software design. AddisonWesley Longman Publishing Co., Inc., Redwood City,CA, USA, 1996.

[29] L. Li, D. Alderson, W. Willinger, and J. Doyle. A first-principles approach to understanding the internet’srouter-level topology. In SIGCOMM ’04: Proceedingsof the 2004 conference on Applications, technologies,architectures, and protocols for computer communica-tions, pages 3–14, New York, NY, USA, 2004. ACMPress.

[30] W.-J. Li, K. Wang, S. Stolfo, and B. Herzog. Fileprints:Identifying file types by n-gram analysis. In SMC’05:Proceedings from the Sixth Annual IEEE Informa-tion Assurance Workshop on Systems, Man and Cy-bernetics, pages 64– 71, West Point (NY), June 2005.

[31] E. Limpert, W. A. Stahel, and M. Abbt. Log-normaldistributions across the sciences: Keys and clues. Bio-Science, 51(5):341–352, May 2001.

[32] M. Manning, J. M. Carlson, and J. Doyle. Highly opti-mized tolerance and power laws in dense and sparse re-source regimes. Physical Review E (Statistical, Nonlin-ear, and Soft Matter Physics), 72(1):016108–016125,July 2005, physics/0504136.

[33] B. P. Miller, G. Cooksey, and F. Moore. An empiri-cal study of the robustness of macos applications usingrandom testing. In RT ’06: Proceedings of the 1st In-ternational workshop on Random testing, pages 46–54,New York, NY, USA, 2006. ACM Press.

[34] B. P. Miller, L. Fredriksen, and B. So. An empiricalstudy of the reliability of unix utilities. Communica-tion of the ACM, 33(12):32–44, 1990.

[35] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan,D. Chklovskii, and U. Alon. Network Motifs: Sim-ple Building Blocks of Complex Networks. Science,298(5594):824–827, 2002.

[36] A. B. Mina Guirguis and I. Matta. Reduction of qual-ity (roq) attacks on dynamic load balancers: Vulnera-bility assessment and design tradeoffs. In Infocom ’07:Proceedings of the 26th IEEE International Confer-ence on Computer Communication, Anchorage (AK),May 2007 (to appear).

[37] I. M. Mina Guirguis, Azer Bestavros and Y. Zhang.Adversarial exploits of end-systems adaptation dynam-ics. Journal of Parallel and Distributed Computing,2007 (to appear).

[38] M. Mitzenmacher. Dynamic models for file sizes anddouble pareto distributions. Internet Mathematics,1(3):305–334, 2004.

[39] S. S. Muchnick. Advanced compiler design and imple-mentation, pages 326–327. In [40], 1998.

[40] S. S. Muchnick. Advanced compiler design and im-plementation. Morgan Kaufmann Publishers Inc., SanFrancisco (CA), 1998.

[41] C. Myers. Software systems as complex networks:Structure, function, and evolvability of software collab-oration graphs. Physical Review E (Statistical, Non-linear, and Soft Matter Physics), 68(4):046116, 2003.

[42] M. Newman. Power laws, Pareto distributions andZipf’s law. Contemporary Physics, 46(5):323–351,September 2005.

[43] M. Newman, A.-L. Barabasi, and D. J. Watts. TheStructure and Dynamics of Networks: (PrincetonStudies in Complexity). Princeton University Press,April 2006.

[44] M. E. J. Newman. The structure and function of com-plex networks. SIAM Review, 45:167, 2003.

[45] A. Potanin, J. Noble, M. Frean, and R. Biddle. Scale-free geometry in oo programs. Commununication ofthe ACM, 48(5):99–103, 2005.

[46] N. Przulj. Biological network comparison usinggraphlet degree distribution. In Proceedings of the2006 European Conference on Computational Biol-ogy, ECCB ’06, Oxford, UK, 2006. Oxford UniversityPress.

[47] S. Resnick. Heavy tail modeling and teletraffic data.Annals of Statistics, 25(5):1805–1869, 1997.

[48] C. Ries. Automated identification of malicious codevariants. BA CS Honors Thesis, May 2005.

[49] K. Rozinov. Efficient static analysis of executables fordetecting malicious behaviour. MSc Honors Thesis,May 2005.

[50] E. D. Schneider and D. Sagan. Into the Cool : En-ergy Flow, Thermodynamics, and Life. University OfChicago Press, June 2005.

[51] E. Skoudis and L. Zeltser. Malware: Fighting Mali-cious Code. Prentice Hall PTR, Upper Saddle River(NJ), 2003.

[52] P. Szor. The Art of Computer Virus Research andDefense, pages 252–293. In [53], February 2005.

[53] P. Szor. The Art of Computer Virus Research andDefense. Addison-Wesley Professional, Upper SaddleRiver (NJ), February 2005.

Page 13: Bilar (2007) callgraph properties of executables

D. Bilar / AICom LATEX2ε Style sample 13

[54] P. Szor and P. Ferrie. Hunting for metamorphic. InVB ’01: Proceedings of the 11th Virus Bulletin Con-ference, September 2001.

[55] A. Valdes and D. Zamboni, editors. Recent Advancesin Intrusion Detection, 8th International Symposium,RAID 2005, Seattle, WA, USA, September 7-9, 2005,Revised Papers, volume 3858 of Lecture Notes in Com-puter Science. Springer, 2006.

[56] S. Valverde, R. Ferrer Cancho, and R. V. Sole. Scale-free networks from optimal design. Europhysics Let-ters, 60:512–517, November 2002, cond-mat/0204344.

[57] S. Valverde and R. V. Sole. Logarithmic growth dy-namics in software networks. Europhysics Letters,72:5–12, November 2005, physics/0511064.

[58] M. Weber, M. Schmid, M. Schatz, and D. Geyer. Atoolkit for detecting and analyzing malicious software.In ACSAC ’02: Proceedings of the 18th Annual Com-puter Security Applications Conference, Washington(DC), 2002.

[59] J. Whittaker and H. Thompson. How to break Softwaresecurity. Addison Wesley (Pearson Education), June2003.

[60] W. Willinger, D. Alderson, J. C. Doyle, and L. Li. Morenormal than normal: scaling distributions and complexsystems. In WSC ’04: Proceedings of the 36th con-ference on Winter simulation, pages 130–141. WinterSimulation Conference, 2004.

[61] G. T. Wu, S. L. Twomey, , and R. E. Thiers. Statis-tical evaluation of method-comparison data. ClinicalChemistry, 21(3):315–320, March 1975.


Recommended