Paradyn Project
Paradyn / Dyninst WeekMadison, Wisconsin
May 2-4, 2011
Where Did This Code Come From?
Nathan Rosenblum
Recovering the Provenance of Program Binaries
2Who Wrote This Code?
01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010
1010101
GCC 4.2.xunoptimizedC++
Toolchain Provenanc
e
3Who Wrote This Code?
01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010
1010101
ICC 10.xoptimizedC
Toolchain Provenanc
e
[mixed]
I C++
4
linux-vdso.so.1libpthread.so.0libasound.so.2libdl.so.2libstdc++.so.6libm.so.6libgcc_s.so.1libc.so.6 /lib64/ld-linux-x86-64.so.2librt.so.1
Debugging remote
deploymentscompiler bug?subtle incompatibility?
Forensicsreverse engineering
tools, obfuscations?decompiling
Why provenance?
Who Wrote This Code?
5Who Wrote This Code?
OutlinePROVENANCE STUDIESSYSTEM DESIGN
MODELING PROGRAM
PROVENANCE
BINARY CODE ABSTRACTIONS
System overview
6Who Wrote This Code?
01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010
1010101program
TARGET BINARY BINARY ANALYSIS
TOOL0111010110101010101011101010010110101110101101010101010111010100101101
0111010110101010101011101010010110101110101101010101010111010100101101
TRAINING DATA LEARNING FRAMEWORK
ICC
MSVS
representation +
feature selection
discovering evidence of provenance
7
Binary code model
program binary… 55 89 e5 83 ec 2c 57 56 53 8b 45 0c 8b 00 a3 90 a3 05 08 85 c0 74 2b 83 c4 … ⟨ mov [imm], rax ; sub [imm], rax ⟩⟨ push ebp ; * ; mov esp, ebp ⟩
Call Graphfprintf
External Libraries
code
Who Wrote This Code?
program
01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010
1010101
Control Flow Graphlayout, block
contents
8
Graphlets
code element nodes(e.g. basic
blocks)
typed edges(branch, call, etc.)
node colors
arithmeticprivileged instruction
Ex: instruction summary graphlets
Color bit field 214 possible colors
14 instruction categories
sparse in practice
Who Wrote This Code?
9
Modeling approach some amount of
code
feature vector
“decompiles to <push ebp,...”
contains 27 occurrences of
“”
Who Wrote This Code?
basic blockfunctionwhole program
Compiler toolchainC++C F77
optimized not optimized
Who Wrote This Code?
3.4 4.2 4.4 2003 2005 2008 10 11
11
Toolchain details [ISSTA 2011]
compiler familyGNU, Intel, Microsoft
source language version optimization levelC, C++, Fortran [several] low, high
functions
Who Wrote This Code?
language
family
optimization
version
Model as Conditional Random Field
Instruction sequence featuresSummary graphlet features
12
Evaluation
LanguageCompiler
OptimizationVersion
Functions Individually
(SVM) Linear CRF
.987
.971
.616
.910.998.993.910
.999
Who Wrote This Code?
same label likely
statistical dependenci
es
MSVC code generation changes little between
versions
13
Program authorshipfor(int i=0; i<sz;++i){// etc
std::vector<int>::iterator it = foo.begin();
while(it != foo.end()) {// etc
Who Wrote This Code?
I C++
14
Long-range control flow
Summary graphlets
basic blocks
⇒
supergraphlets
merged instruction summaries
Who Wrote This Code?
15
Interprocedural graphlets
FPRINTF
FOPEN
[local]
Who Wrote This Code?
Unique “color” for external functions
Anonymous internal functions
16
Program-author dataset
1. Author labels2. Parallel corpus 3. Linguistic homogeneity
(CS 537)several contest years
8-16 programs per contestant
C and C++ programs C programssome provided/template
code
Ideal:
Who Wrote This Code?
17
Author attribution391,056 N-grams54,705 idioms
37,358 graphlets117,997 supergraphlets8,062 call graphlets152 library calls
1,900 features
94.7% 93.7% 84.3%Top-5
CJ 2009 CJ 2010 CS 537
77.8% 76.8% 38.4%Exact
1. CS537 has much less data2. Template code + instructor
guidance confound results
Students have less distinctive styles?
Who Wrote This Code?
20 programmers
Summary
18Who Wrote This Code?
01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010
1010101
questions
20
Backup slides follow
21
Program provenance
Systemglibc static codelibrary imports
Link & post-linkwhole-program optimization
rewriting toolsobfuscation tools
Compilerfamily
versionoptimization level
source language
Authorship
Who Wrote This Code?
22
Instruction-level features
‹mov [imm], rax ; sub [imm], rax›
‹push ebp ; * ; mov esp, ebp›
single-instruction wildcard
opcode class abstraction hidden immediates
IDIOMS
N-GRAMS <4889c2be> <8d45f8><018b45f8>
4-grams 3-grams
23
Digression: finding code
program binarycode… 55 89 e5 83 ec 2c 57 56 53 8b 45 0c 8b 00 a3 90 a3 05 08 85 c0 74 2b 83 c4 …
push ebp mov esp, ebp sub 0x2c, esp ...in 0x83,eax in [dx],al sub 0x57,al ...
[AAAI 2008]Model compiler-specific “function entry points”Compute max-likelihood labels F1 from .86 - .99 depending on compiler
Byte-sequence model [PASTE 2010]
program binary
GCC GCCICC ICC
Compiler labels modeled as CRF...
Who Wrote This Code?
yi yi-1 yj yj+1sequence labels ∈ {icc,gcc,...,data}…
… 55 89 e5 83 ec 2c 57 56 53 8b 45 0c 8b 00 a3 90 a3 05 08 85 c0 74 2b 83 c4 …
25
Digression: Conditional Random Fields
weights (learned)labels
evidence feature functions
Linear chain CRF
exact inference tractable
if xi decompiles to idiom u
otherwise=fu
Idiom feature function
0
1
Who Wrote This Code?
⎧
⎩⎨
Byte-sequence model [PASTE 2010]
program binary
Who Wrote This Code?
yi yi-1 yj yj+1sequence labels ∈ {icc,gcc,...,data}…
… 55 89 e5 83 ec 2c 57 56 53 8b 45 0c 8b 00 a3 90 a3 05 08 85 c0 74 2b 83 c4 …
94% accuracy labeling mixed-compiler sequences+18% accuracy increase in
function finding
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
27
Feature selection
(condor)
training programs
28
Distance metrics
distance metric
Mahalanobis distance
Equivalently:
How do we get A?
29
Style clustering
Programs, no training data
01110101101010101010111010100101010111000100100101101011001101010101010101001
011101011010101010101110101001010101110001001001011010110011010101010101010010111010110
1010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
Conclude: XWho Wrote This Code?
30
Transfer learning
01110101101010101010111010100101010111000100100101101011001101010101010101001
0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001
Alice Bob ? ?
01110101101010101010111010100101010111000100100101101011001101010101010101001
0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
Large-margin Nearest Neighbors
(LMNN)Weinberger, Saul 2009
semi-definite program☹one-time cost☺
Who Wrote This Code?
31
Component modelssemi-open world provenance
component sharing (e.g. command and control)programmer movement between groups
mixture of styles
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
style vs. functionality?
infinite mixture models
interpreting style clusters
Who Wrote This Code?
32
Social code networks
program binaries
Who Wrote This Code?