Neural Reverse Engineering of Stripped Binaries
Yaniv David, Uri Alon, Eran YahavTechnion, Israel
Reverse Engineering (RE) BinariesWhat, Why & How?
2
RE – What & Why?
3
Malware?
Bug? find & fix it
RE – How? Disassemblers
4
call getaddrinfomov rax, [rbp-30h]mov rdx, [rbp-50h]mov rdx, cs:688588dmov [rax], rdxmov rax, [rbp-30h]mov rdx, [rbp-580h]mov [rax+8], rdxmov rax, [rbp-30h]call strerrorsub rdx, raxidiv [rbp-28h]call setsockoptmov rdx, [rax]mov eax, [rbx+40h]cdqe
RE – How? Disassemblers
5
call getaddrinfomov rax, [rbp-30h]mov rdx, [rbp-50h]mov rdx, cs:688588dmov [rax], rdxmov rax, [rbp-30h]mov rdx, [rbp-580h]mov [rax+8], rdxmov rax, [rbp-30h]call strerrorsub rdx, raxidiv [rbp-28h]call setsockoptmov rdx, [rax]mov eax, [rbx+40h]cdqe
No Names
No Types
RE – How? Modern Disassemblers
6
RE – How? Modern Disassemblers
7
Where to start?
Progress in Other Domains
8
Progress in the Source Code Domain
10https://code2vec.org - code2vec: Learning Distributed Representations of Code
Un-Stripping Procedure Names
11
Un-Stripping Procedure Names
12
Start at the right place
Translate: Assembly Procedure → English
13
Sequence-To-Sequence (seq2seq) Models
• A basic approach:• LSTM encoder• LSTM decoder
14
estás
how are you
cómo
• LSTM with attention & Transformers are state of the art for seq2seq tasks (machine translation, speech recognition, etc.)
Binary Syntax Is Very Local
15
call getaddrinfomov rax, [rbp-30h]mov rdx, [rbp-50h]mov rdx, cs:688588dmov [rax], rdxmov rax, [rbp-30h]mov rdx, [rbp-580h]mov [rax+8], rdxmov rax, [rbp-30h]call strerrorsub rdx, raxidiv [rbp-28h]call setsockoptmov rdx, [rax]mov eax, [rbx+40h]cdqe
Binary Syntax Is Very Local
16
call getaddrinfomov rax, [rbp-30h]mov rdx, [rbp-50h]mov rdx, cs:688588dmov [rax], rdxmov rax, [rbp-30h]mov rdx, [rbp-580h]mov [rax+8], rdxmov rax, [rbp-30h]call strerrorsub rdx, raxidiv [rbp-28h]call setsockoptmov rdx, [rax]mov eax, [rbx+40h]cdqe
Global offsets local to
executable
Register allocation is local to instruction/BB
Stack offsets local to procedure
Finding Prediction Anchors
17
call getaddrinfomov rdx, cs:qword_68858mov rax, [rbp-30h]mov rdx, [rbp-50h]mov [rax], rdxmov rax, [rbp-30h]mov rdx, [rbp-580h]mov [rax+8], rdxmov rax, [rbp-30h]call strerrorsub rdx, raxidiv [rbp-28h]call setsockoptmov rdx, [rax]Mov eax, [rbp+var3C]cdqe
call getaddrinfo…
call strerror…
call setsockopt…
Not enough data and context
Focus On Calls
Finding Prediction Anchors
18
call getaddrinfomov rdx, cs:qword_68858mov rax, [rbp-30h]mov rdx, [rbp-50h]mov [rax], rdxmov rax, [rbp-30h]mov rdx, [rbp-580h]mov [rax+8], rdxmov rax, [rbp-30h]call strerrorsub rdx, raxidiv [rbp-28h]call setsockoptmov rdx, [rax]Mov eax, [rbp+var3C]cdqe
call getaddrinfo…
call strerror…
call setsockopt…
Not enough data and context
Focus On Calls
Combine binary program analysis with machine learning to find a sweet-spot
Augmented Call Sites as Learning Features
19
Using API Calls
20
…call getaddrinfo
…call strerror
…call setsockopt
…setsockopt(rdi,rsi,rdx,rcx,r8)
API calls Reconstructed API Call Sites
Calling Conventions + Library information
Augmenting Call Sites
21
setsockopt(rdi,rsi,rdx,rcx,r8)
call socket(...)mov [rbp-58h], raxmov rax, [rbp-58h]mov rdi, rax
mov rsi, 1
mov r8, 4
In C: setsocketopt(sock_var,…,1,4)
Augmenting Call Sites
Using concrete or abstracted values:
1. Concrete value (Integer, Enum, String)
2. ARG – procedure argument
3. GLOBAL - pointer to a global variable
4. RET – a return value from a call
5. STACK – pointer to stack memory
22
Less Informative
Pointer-Aware Slicing of Call Site Args
23getaddrinfo(rdi,rsi,rdx,rcx)
mov rdi, rax
mov rax, [rbp-68h] ∅
V(rax) P([rax])
P([rbp-68h])
mov [rbp-68h], rdi
V(rbp)
∅
V
V(rdi)
∅ ∅
P([rdi])
Augmenting Call Site Arguments
24getaddrinfo(rdi,rsi,rdx,rcx)
mov rdi, rax
mov rax, [rbp-68h] ∅
mov [rbp-68h], rdi∅
∅ ∅
STACK
ARG
ARG | ∅
STACK | ARG
ARG | ∅
Augmenting Call Site Arguments
25getaddrinfo(rdi,rsi,rdx,rcx)
mov rdi, rax
mov rax, [rbp-68h] ∅
mov [rbp-68h], rdi∅
∅ ∅
STACK
ARG
ARG | ∅
STACK | ARG
ARG | ∅
Using concrete or abstracted values:
1. Concrete value (Integer, Enum, String)
2. ARG – procedure argument
3. GLOBAL - pointer to a global variable
4. RET – a return value from a call
5. STACK – pointer to stack memory
Less Informative
Augmenting Call Site Arguments
26getaddrinfo(ARG,rsi,rdx,rcx)
mov rdi, rax
∅
∅
∅ ∅ARG
ARG | ∅
STACK | ARG
ARG | ∅
STACK
27
Augmented Control Flow Graph
…call …
…
…call socket
…
…call printf
…
…call setsockopt
…
…call close
…call printf
…
28
Augmented Control Flow Graph
setsockopt(RET,0,10,STK,4)
socket(2,1,0)
printf(GLOBAL,…)
close(…)
...
printf(GLOBAL,…)
Usefull for training seq2seq or GNN models
...
Extracting Paths From the ACFG
29
Extract simple paths(no loops)
setsockopt(RET,0,10,STK,4)
socket(2,1,0)
printf(GLOBAL,…)
close(…) ...
printf(GLOBAL,…)
setsockopt(RET,1,2,STK,4)
getaddrinfo(ARG,ARG,STK,STK)
socket(…)
bind(…)
listen(…)
memset(STK,0,48)
Our Approach: [Set-Of-Seq]-To-Seq
30
setsockopt(RET,1,2,STK,4)
getaddrinfo(ARG,ARG,STK,STK)
socket(…)
bind(…)
listen(…)
memset(STK,0,48)
servercreate socket
EvaluationImplementation: Nero
31
Evaluation Corpus
32
GNU software repository
Remove Duplications
67,246 Labeled
Procedures
Strip
Strip &
Obfuscate APIs
8:1:1 Package-Based Split
Executable Obfuscation Types
• String encoding/encryption
• Code obfuscations (opaque predictions, etc.)
• Commercial (known) / Home-made packers • Header manipulation => API calls not visable
33
Simulating Header Manipulation
• Zeroing ’.dynstr’ removes imported libraries & procedure names
34
Stripped & Obfuscated API Calls
Prec Rec F1
15.46 14.00 14.70
18.41 12.24 14.70
32.10 28.76 30.09
39.12 31.40 34.83
36.50 32.25 34.24
Stripped
Prec Rec F1
22.32 21.16 21.72
25.45 15.97 19.64
34.86 32.54 33.66
39.94 38.89 39.40
41.54 38.64 40.04
Evaluation Results
StatsModel
LSTM-text
Transformer-text
Debin [He et al. 2018]
Nero-LSTM
Nero-Transformer
35”Debin: Predicting Debug Information in Stripped Binaries”, CCS’18
Ablation Study
Components Prec Rec F1
Only Callsà LSTM 23.45 24.56 24.04
Augmented Call Sites à LSTM 36.05 31.77 33.77
Paths à Only Calls à LSTM 29.84 24.08 26.65
Paths à Augmented Call Sites à LSTM 39.94 38.89 39.40
36
Prediction Examples
Model Prediction
Ground Truth read file check new watcher
get user groups
install signal handlers
Debin [He et al. 2018] bt open read index display signal setup
LSTM-text <unk> check opt close stdin <unk>
Transformer-text Ipmi disable coredump <unk> config file
ipmi regfree
Nero-LSTM vfs read file check file get ip groups install handlers
Nero-Transformer read file system list check state get user
groups install signal
37
Qualitive EvaluationError Type Package Ground Truth Predicted Name
Programmers Vs
English Language
wget i18n_initialize i18n_initdirevent split_cfg_path split_config_path
gzip add_env_opt add_option
Date StructureName Missing
gtypist get_best_speed get_list_itemwget ftp_parse_winnt_ls parse_treegzip abort_gzip_signal fatal_signal_handler
Verb Replaced
units read_units parsefindutils share_file_fopen add_filemcsim display_help show_help
38
Qualitive EvaluationError Type Package Ground Truth Predicted Name
Programmers Vs
English Language
wget i18n_initialize i18n_initdirevent split_cfg_path split_config_path
gzip add_env_opt add_option
Date StructureName Missing
gtypist get_best_speed get_list_itemwget ftp_parse_winnt_ls parse_treegzip abort_gzip_signal fatal_signal_handler
Verb Replaced
units read_units parsefindutils share_file_fopen add_filemcsim display_help show_help
39
Measured F1 is actually a lower-
bound
Takeaway Messages
40
Use Augmented Call Sites as Learning Features
setsockopt(rdi,rsi,rdx,rcx,r8)
call socket(...)mov [rbp-58h], raxmov rax, [rbp-58h]mov rdi, rax
mov rsi, 1
mov r8, 4
In C: setsocketopt(sock_var,…,1,4)
Translate: Assembly Procedure → English