Neural Reverse Engineering of Stripped Binaries · 2020. 6. 14. · Neural Reverse Engineering of...

Post on 19-Apr-2021

3 views 0 download

transcript

Neural Reverse Engineering of Stripped Binaries

Yaniv David, Uri Alon, Eran YahavTechnion, Israel

Reverse Engineering (RE) BinariesWhat, Why & How?

2

RE – What & Why?

3

Malware?

Bug? find & fix it

RE – How? Disassemblers

4

call getaddrinfomov rax, [rbp-30h]mov rdx, [rbp-50h]mov rdx, cs:688588dmov [rax], rdxmov rax, [rbp-30h]mov rdx, [rbp-580h]mov [rax+8], rdxmov rax, [rbp-30h]call strerrorsub rdx, raxidiv [rbp-28h]call setsockoptmov rdx, [rax]mov eax, [rbx+40h]cdqe

RE – How? Disassemblers

5

call getaddrinfomov rax, [rbp-30h]mov rdx, [rbp-50h]mov rdx, cs:688588dmov [rax], rdxmov rax, [rbp-30h]mov rdx, [rbp-580h]mov [rax+8], rdxmov rax, [rbp-30h]call strerrorsub rdx, raxidiv [rbp-28h]call setsockoptmov rdx, [rax]mov eax, [rbx+40h]cdqe

No Names

No Types

RE – How? Modern Disassemblers

6

RE – How? Modern Disassemblers

7

Where to start?

Progress in Other Domains

8

Progress in the Source Code Domain

9http://jsnice.org

Progress in the Source Code Domain

10https://code2vec.org - code2vec: Learning Distributed Representations of Code

Un-Stripping Procedure Names

11

Un-Stripping Procedure Names

12

Start at the right place

Translate: Assembly Procedure → English

13

Sequence-To-Sequence (seq2seq) Models

• A basic approach:• LSTM encoder• LSTM decoder

14

estás

how are you

cómo

• LSTM with attention & Transformers are state of the art for seq2seq tasks (machine translation, speech recognition, etc.)

Binary Syntax Is Very Local

15

call getaddrinfomov rax, [rbp-30h]mov rdx, [rbp-50h]mov rdx, cs:688588dmov [rax], rdxmov rax, [rbp-30h]mov rdx, [rbp-580h]mov [rax+8], rdxmov rax, [rbp-30h]call strerrorsub rdx, raxidiv [rbp-28h]call setsockoptmov rdx, [rax]mov eax, [rbx+40h]cdqe

Binary Syntax Is Very Local

16

call getaddrinfomov rax, [rbp-30h]mov rdx, [rbp-50h]mov rdx, cs:688588dmov [rax], rdxmov rax, [rbp-30h]mov rdx, [rbp-580h]mov [rax+8], rdxmov rax, [rbp-30h]call strerrorsub rdx, raxidiv [rbp-28h]call setsockoptmov rdx, [rax]mov eax, [rbx+40h]cdqe

Global offsets local to

executable

Register allocation is local to instruction/BB

Stack offsets local to procedure

Finding Prediction Anchors

17

call getaddrinfomov rdx, cs:qword_68858mov rax, [rbp-30h]mov rdx, [rbp-50h]mov [rax], rdxmov rax, [rbp-30h]mov rdx, [rbp-580h]mov [rax+8], rdxmov rax, [rbp-30h]call strerrorsub rdx, raxidiv [rbp-28h]call setsockoptmov rdx, [rax]Mov eax, [rbp+var3C]cdqe

call getaddrinfo…

call strerror…

call setsockopt…

Not enough data and context

Focus On Calls

Finding Prediction Anchors

18

call getaddrinfomov rdx, cs:qword_68858mov rax, [rbp-30h]mov rdx, [rbp-50h]mov [rax], rdxmov rax, [rbp-30h]mov rdx, [rbp-580h]mov [rax+8], rdxmov rax, [rbp-30h]call strerrorsub rdx, raxidiv [rbp-28h]call setsockoptmov rdx, [rax]Mov eax, [rbp+var3C]cdqe

call getaddrinfo…

call strerror…

call setsockopt…

Not enough data and context

Focus On Calls

Combine binary program analysis with machine learning to find a sweet-spot

Augmented Call Sites as Learning Features

19

Using API Calls

20

…call getaddrinfo

…call strerror

…call setsockopt

…setsockopt(rdi,rsi,rdx,rcx,r8)

API calls Reconstructed API Call Sites

Calling Conventions + Library information

Augmenting Call Sites

21

setsockopt(rdi,rsi,rdx,rcx,r8)

call socket(...)mov [rbp-58h], raxmov rax, [rbp-58h]mov rdi, rax

mov rsi, 1

mov r8, 4

In C: setsocketopt(sock_var,…,1,4)

Augmenting Call Sites

Using concrete or abstracted values:

1. Concrete value (Integer, Enum, String)

2. ARG – procedure argument

3. GLOBAL - pointer to a global variable

4. RET – a return value from a call

5. STACK – pointer to stack memory

22

Less Informative

Pointer-Aware Slicing of Call Site Args

23getaddrinfo(rdi,rsi,rdx,rcx)

mov rdi, rax

mov rax, [rbp-68h] ∅

V(rax) P([rax])

P([rbp-68h])

mov [rbp-68h], rdi

V(rbp)

V

V(rdi)

∅ ∅

P([rdi])

Augmenting Call Site Arguments

24getaddrinfo(rdi,rsi,rdx,rcx)

mov rdi, rax

mov rax, [rbp-68h] ∅

mov [rbp-68h], rdi∅

∅ ∅

STACK

ARG

ARG | ∅

STACK | ARG

ARG | ∅

Augmenting Call Site Arguments

25getaddrinfo(rdi,rsi,rdx,rcx)

mov rdi, rax

mov rax, [rbp-68h] ∅

mov [rbp-68h], rdi∅

∅ ∅

STACK

ARG

ARG | ∅

STACK | ARG

ARG | ∅

Using concrete or abstracted values:

1. Concrete value (Integer, Enum, String)

2. ARG – procedure argument

3. GLOBAL - pointer to a global variable

4. RET – a return value from a call

5. STACK – pointer to stack memory

Less Informative

Augmenting Call Site Arguments

26getaddrinfo(ARG,rsi,rdx,rcx)

mov rdi, rax

∅ ∅ARG

ARG | ∅

STACK | ARG

ARG | ∅

STACK

27

Augmented Control Flow Graph

…call …

…call socket

…call printf

…call setsockopt

…call close

…call printf

28

Augmented Control Flow Graph

setsockopt(RET,0,10,STK,4)

socket(2,1,0)

printf(GLOBAL,…)

close(…)

...

printf(GLOBAL,…)

Usefull for training seq2seq or GNN models

...

Extracting Paths From the ACFG

29

Extract simple paths(no loops)

setsockopt(RET,0,10,STK,4)

socket(2,1,0)

printf(GLOBAL,…)

close(…) ...

printf(GLOBAL,…)

setsockopt(RET,1,2,STK,4)

getaddrinfo(ARG,ARG,STK,STK)

socket(…)

bind(…)

listen(…)

memset(STK,0,48)

Our Approach: [Set-Of-Seq]-To-Seq

30

setsockopt(RET,1,2,STK,4)

getaddrinfo(ARG,ARG,STK,STK)

socket(…)

bind(…)

listen(…)

memset(STK,0,48)

servercreate socket

EvaluationImplementation: Nero

31

Evaluation Corpus

32

GNU software repository

Remove Duplications

67,246 Labeled

Procedures

Strip

Strip &

Obfuscate APIs

8:1:1 Package-Based Split

Executable Obfuscation Types

• String encoding/encryption

• Code obfuscations (opaque predictions, etc.)

• Commercial (known) / Home-made packers • Header manipulation => API calls not visable

33

Simulating Header Manipulation

• Zeroing ’.dynstr’ removes imported libraries & procedure names

34

Stripped & Obfuscated API Calls

Prec Rec F1

15.46 14.00 14.70

18.41 12.24 14.70

32.10 28.76 30.09

39.12 31.40 34.83

36.50 32.25 34.24

Stripped

Prec Rec F1

22.32 21.16 21.72

25.45 15.97 19.64

34.86 32.54 33.66

39.94 38.89 39.40

41.54 38.64 40.04

Evaluation Results

StatsModel

LSTM-text

Transformer-text

Debin [He et al. 2018]

Nero-LSTM

Nero-Transformer

35”Debin: Predicting Debug Information in Stripped Binaries”, CCS’18

Ablation Study

Components Prec Rec F1

Only Callsà LSTM 23.45 24.56 24.04

Augmented Call Sites à LSTM 36.05 31.77 33.77

Paths à Only Calls à LSTM 29.84 24.08 26.65

Paths à Augmented Call Sites à LSTM 39.94 38.89 39.40

36

Prediction Examples

Model Prediction

Ground Truth read file check new watcher

get user groups

install signal handlers

Debin [He et al. 2018] bt open read index display signal setup

LSTM-text <unk> check opt close stdin <unk>

Transformer-text Ipmi disable coredump <unk> config file

ipmi regfree

Nero-LSTM vfs read file check file get ip groups install handlers

Nero-Transformer read file system list check state get user

groups install signal

37

Qualitive EvaluationError Type Package Ground Truth Predicted Name

Programmers Vs

English Language

wget i18n_initialize i18n_initdirevent split_cfg_path split_config_path

gzip add_env_opt add_option

Date StructureName Missing

gtypist get_best_speed get_list_itemwget ftp_parse_winnt_ls parse_treegzip abort_gzip_signal fatal_signal_handler

Verb Replaced

units read_units parsefindutils share_file_fopen add_filemcsim display_help show_help

38

Qualitive EvaluationError Type Package Ground Truth Predicted Name

Programmers Vs

English Language

wget i18n_initialize i18n_initdirevent split_cfg_path split_config_path

gzip add_env_opt add_option

Date StructureName Missing

gtypist get_best_speed get_list_itemwget ftp_parse_winnt_ls parse_treegzip abort_gzip_signal fatal_signal_handler

Verb Replaced

units read_units parsefindutils share_file_fopen add_filemcsim display_help show_help

39

Measured F1 is actually a lower-

bound

Takeaway Messages

40

Use Augmented Call Sites as Learning Features

setsockopt(rdi,rsi,rdx,rcx,r8)

call socket(...)mov [rbp-58h], raxmov rax, [rbp-58h]mov rdi, rax

mov rsi, 1

mov r8, 4

In C: setsocketopt(sock_var,…,1,4)

Translate: Assembly Procedure → English