+ All Categories
Home > Documents > Where Did This Code Come From?

Where Did This Code Come From?

Date post: 24-Feb-2016
Category:
Upload: sulwyn
View: 26 times
Download: 0 times
Share this document with a friend
Description:
Where Did This Code Come From?. Recovering the Provenance of Program Binaries. Nathan Rosenblum. GCC 4.2.x. 011101011010101010101110101001010101110001001001011010110011010101010101010010100101001001001101101101010110010010101101010010100101110101010101010101. unoptimized. C++. - PowerPoint PPT Presentation
Popular Tags:
32
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin May 2-4, 2011 Where Did This Code Come From? Nathan Rosenblum Recovering the Provenance of Program Binaries
Transcript
Page 1: Where Did This Code Come From?

Paradyn Project

Paradyn / Dyninst WeekMadison, Wisconsin

May 2-4, 2011

Where Did This Code Come From?

Nathan Rosenblum

Recovering the Provenance of Program Binaries

Page 2: Where Did This Code Come From?

2Who Wrote This Code?

01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010

1010101

GCC 4.2.xunoptimizedC++

Toolchain Provenanc

e

Page 3: Where Did This Code Come From?

3Who Wrote This Code?

01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010

1010101

ICC 10.xoptimizedC

Toolchain Provenanc

e

[mixed]

I C++

Page 4: Where Did This Code Come From?

4

linux-vdso.so.1libpthread.so.0libasound.so.2libdl.so.2libstdc++.so.6libm.so.6libgcc_s.so.1libc.so.6 /lib64/ld-linux-x86-64.so.2librt.so.1

Debugging remote

deploymentscompiler bug?subtle incompatibility?

Forensicsreverse engineering

tools, obfuscations?decompiling

Why provenance?

Who Wrote This Code?

Page 5: Where Did This Code Come From?

5Who Wrote This Code?

OutlinePROVENANCE STUDIESSYSTEM DESIGN

MODELING PROGRAM

PROVENANCE

BINARY CODE ABSTRACTIONS

Page 6: Where Did This Code Come From?

System overview

6Who Wrote This Code?

01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010

1010101program

TARGET BINARY BINARY ANALYSIS

TOOL0111010110101010101011101010010110101110101101010101010111010100101101

0111010110101010101011101010010110101110101101010101010111010100101101

TRAINING DATA LEARNING FRAMEWORK

ICC

MSVS

representation +

feature selection

discovering evidence of provenance

Page 7: Where Did This Code Come From?

7

Binary code model

program binary… 55 89 e5 83 ec 2c 57 56 53 8b 45 0c 8b 00 a3 90 a3 05 08 85 c0 74 2b 83 c4 … ⟨ mov [imm], rax ; sub [imm], rax ⟩⟨ push ebp ; * ; mov esp, ebp ⟩

Call Graphfprintf

External Libraries

code

Who Wrote This Code?

program

01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010

1010101

Control Flow Graphlayout, block

contents

Page 8: Where Did This Code Come From?

8

Graphlets

code element nodes(e.g. basic

blocks)

typed edges(branch, call, etc.)

node colors

arithmeticprivileged instruction

Ex: instruction summary graphlets

Color bit field 214 possible colors

14 instruction categories

sparse in practice

Who Wrote This Code?

Page 9: Where Did This Code Come From?

9

Modeling approach some amount of

code

feature vector

“decompiles to <push ebp,...”

contains 27 occurrences of

“”

Who Wrote This Code?

basic blockfunctionwhole program

Page 10: Where Did This Code Come From?

Compiler toolchainC++C F77

optimized not optimized

Who Wrote This Code?

3.4 4.2 4.4 2003 2005 2008 10 11

Page 11: Where Did This Code Come From?

11

Toolchain details [ISSTA 2011]

compiler familyGNU, Intel, Microsoft

source language version optimization levelC, C++, Fortran [several] low, high

functions

Who Wrote This Code?

language

family

optimization

version

Model as Conditional Random Field

Instruction sequence featuresSummary graphlet features

Page 12: Where Did This Code Come From?

12

Evaluation

LanguageCompiler

OptimizationVersion

Functions Individually

(SVM) Linear CRF

.987

.971

.616

.910.998.993.910

.999

Who Wrote This Code?

same label likely

statistical dependenci

es

MSVC code generation changes little between

versions

Page 13: Where Did This Code Come From?

13

Program authorshipfor(int i=0; i<sz;++i){// etc

std::vector<int>::iterator it = foo.begin();

while(it != foo.end()) {// etc

Who Wrote This Code?

I C++

Page 14: Where Did This Code Come From?

14

Long-range control flow

Summary graphlets

basic blocks

supergraphlets

merged instruction summaries

Who Wrote This Code?

Page 15: Where Did This Code Come From?

15

Interprocedural graphlets

FPRINTF

FOPEN

[local]

Who Wrote This Code?

Unique “color” for external functions

Anonymous internal functions

Page 16: Where Did This Code Come From?

16

Program-author dataset

1. Author labels2. Parallel corpus 3. Linguistic homogeneity

(CS 537)several contest years

8-16 programs per contestant

C and C++ programs C programssome provided/template

code

Ideal:

Who Wrote This Code?

Page 17: Where Did This Code Come From?

17

Author attribution391,056 N-grams54,705 idioms

37,358 graphlets117,997 supergraphlets8,062 call graphlets152 library calls

1,900 features

94.7% 93.7% 84.3%Top-5

CJ 2009 CJ 2010 CS 537

77.8% 76.8% 38.4%Exact

1. CS537 has much less data2. Template code + instructor

guidance confound results

Students have less distinctive styles?

Who Wrote This Code?

20 programmers

Page 18: Where Did This Code Come From?

Summary

18Who Wrote This Code?

01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010

1010101

Page 19: Where Did This Code Come From?

questions

Page 20: Where Did This Code Come From?

20

Backup slides follow

Page 21: Where Did This Code Come From?

21

Program provenance

Systemglibc static codelibrary imports

Link & post-linkwhole-program optimization

rewriting toolsobfuscation tools

Compilerfamily

versionoptimization level

source language

Authorship

Who Wrote This Code?

Page 22: Where Did This Code Come From?

22

Instruction-level features

‹mov [imm], rax ; sub [imm], rax›

‹push ebp ; * ; mov esp, ebp›

single-instruction wildcard

opcode class abstraction hidden immediates

IDIOMS

N-GRAMS <4889c2be> <8d45f8><018b45f8>

4-grams 3-grams

Page 23: Where Did This Code Come From?

23

Digression: finding code

program binarycode… 55 89 e5 83 ec 2c 57 56 53 8b 45 0c 8b 00 a3 90 a3 05 08 85 c0 74 2b 83 c4 …

push ebp mov esp, ebp sub 0x2c, esp ...in 0x83,eax in [dx],al sub 0x57,al ...

[AAAI 2008]Model compiler-specific “function entry points”Compute max-likelihood labels F1 from .86 - .99 depending on compiler

Page 24: Where Did This Code Come From?

Byte-sequence model [PASTE 2010]

program binary

GCC GCCICC ICC

Compiler labels modeled as CRF...

Who Wrote This Code?

yi yi-1 yj yj+1sequence labels ∈ {icc,gcc,...,data}…

… 55 89 e5 83 ec 2c 57 56 53 8b 45 0c 8b 00 a3 90 a3 05 08 85 c0 74 2b 83 c4 …

Page 25: Where Did This Code Come From?

25

Digression: Conditional Random Fields

weights (learned)labels

evidence feature functions

Linear chain CRF

exact inference tractable

if xi decompiles to idiom u

otherwise=fu

Idiom feature function

0

1

Who Wrote This Code?

⎩⎨

Page 26: Where Did This Code Come From?

Byte-sequence model [PASTE 2010]

program binary

Who Wrote This Code?

yi yi-1 yj yj+1sequence labels ∈ {icc,gcc,...,data}…

… 55 89 e5 83 ec 2c 57 56 53 8b 45 0c 8b 00 a3 90 a3 05 08 85 c0 74 2b 83 c4 …

94% accuracy labeling mixed-compiler sequences+18% accuracy increase in

function finding

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

Page 27: Where Did This Code Come From?

27

Feature selection

(condor)

training programs

Page 28: Where Did This Code Come From?

28

Distance metrics

distance metric

Mahalanobis distance

Equivalently:

How do we get A?

Page 29: Where Did This Code Come From?

29

Style clustering

Programs, no training data

01110101101010101010111010100101010111000100100101101011001101010101010101001

011101011010101010101110101001010101110001001001011010110011010101010101010010111010110

1010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

Conclude: XWho Wrote This Code?

Page 30: Where Did This Code Come From?

30

Transfer learning

01110101101010101010111010100101010111000100100101101011001101010101010101001

0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001

Alice Bob ? ?

01110101101010101010111010100101010111000100100101101011001101010101010101001

0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

Large-margin Nearest Neighbors

(LMNN)Weinberger, Saul 2009

semi-definite program☹one-time cost☺

Who Wrote This Code?

Page 31: Where Did This Code Come From?

31

Component modelssemi-open world provenance

component sharing (e.g. command and control)programmer movement between groups

mixture of styles

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

style vs. functionality?

infinite mixture models

interpreting style clusters

Who Wrote This Code?

Page 32: Where Did This Code Come From?

32

Social code networks

program binaries

Who Wrote This Code?


Recommended