1
Simple Substitution Distance and Metamorphic
Detection
Simple Substitution Distance
Gayathri ShanmugamRichard M. Low
Mark Stamp
2
The Idea
Metamorphic malware “mutates” with each infection
Measuring software similarity is a possible means of detection
But, how to measure similarity?o Much relevant previous work
Here, a novel distance measure is considered
Simple Substitution Distance
3
Simple Substitution Distance
We treat each metamorphic copy as if it is an “encrypted” version of “base” viruso Where the “cipher” is a simple substitution
Why simple substitution?o Easy to work with, fast algorithm to solve
Why might this work?o Simple substitution “cryptanalysis” tends to
yield results that match family statisticso Accounts for modifications to files similar to
some common metamorphic techniques
Simple Substitution Distance
4
Motivation Given a simple substitution ciphertext where
plaintext is English…o If we cryptanalyze using English language
statistics, we expect a good scoreo If we cryptanalyze using, say, French language
statistics, we expect a not-so-good score We can obtain opcode statistics for a
metamorphic familyo Using simple substitution cryptanalysis, a virus of
same family should score well… o …but, a benign exe should not score as wello Assuming statistics of these families differ
Simple Substitution Distance
5
Metamorphic Techniques
Many possible morphing strategies Here, briefly consider
o Register swappingo Garbage code insertiono Equivalent substitutiono Transpositiono Formal grammar mutation
At a high level --- substitution, transposition, insertion, and deletion
Simple Substitution Distance
6
Register Swap
Register swappingo E.g., replace EBX register with EAX,
provided EAX not in use Very simple and used in some of
first metamorphic malware Not very effective
o Why not?
Simple Substitution Distance
7
Garbage Insertion
Garbage code insertion Two cases:
o Dead code --- inserted, but not executed We can simply JMP over dead code
o Do-nothing instructions --- executed, but has no effect on program Like NOP or ADD EAX,0
Relatively easy to implement Effective at breaking signature detection
Simple Substitution Distance
8
Code Substitution
Equivalent instruction substitutiono For example, can replace SUB EAX,EAX
with XOR EAX,EAX Does not need to be 1 for 1
substitutiono That is, can include insertion/deletion
Unlimited number of substitutions Very effective Somewhat difficult to implementSimple Substitution Distance
9
Transposition
Transpositiono Reorder instructions that have no
dependency For example,
MOV R1,R2 ADD R3,R4
ADD R3,R4 MOV R1,R2 Can be highly effective But, can be difficult to implement
o Sometimes applied only to subroutines
Simple Substitution Distance
10
Formal Grammar Mutation
Formal grammar mutation View morphing engine as non-
deterministic automatao Allow transitions between any symbolso Apply formal grammar rules
Obtain many variants, high variation Really just a formalization of others
approaches, not a separate technique
Simple Substitution Distance
11
Previous Work
Easy to prove that “good” metamorphic code is immune to signature detectiono Why?
But, many successes detecting hacker-produced metamorphic malware…o HMM/PHMM/machine learningo Graph-based techniqueso Statistics (chi-squared, naïve Bayes)o Structural entropyo Linear algebraic techniques
Simple Substitution Distance
12
This Research
Measure similarity using “simple substitution distance”
We “decrypt” suspect file using statistics from a metamorphic familyo If decryption is good, we classify it as
a member of the same metamorphic family
o If decryption is poor, we classify it as NOT a member of the given metamorphic family
Simple Substitution Distance
13
Simple Substitution Cipher Simple substitution is one of the
oldest and simplest means of encryption
A fixed key used to substitute letterso For example, Caesar’s cipher, substitute
letter 3 positions ahead in alphabeto In general, any permutation can be key
Simple substitution cryptanalysis?o Statistical analysis of ciphertext
Simple Substitution Distance
14
Simple Substitution Cryptanalysis Suppose you observe the ciphertext
PBFPVYFBQXZTYFPBFEQJHDXXQVAPTPQJKTOYQWIPBVWLXTOXBTFXQWAXBVCXQWAXFQJVWLEQNTOZQGGQLFXQWAKVWLXQWAEBIPBFXFQVXGTVJVWLBTPQWAEBFPBFHCVLXBQUFEVWLXGDPEQVPQGVPPBFTIXPFHXZHVFAGFOTHFEFBQUFTDHZBQPOTHXTYFTODXQHFTDPTOGHFQPBQWAQJJTODXQHFOQPWTBDHHIXQVAPBFZQHCFWPFHPBFIPBQWKFABVYYDZBOTHPBQPQJTQOTOGHFQAPBFEQJHDXXQVAVXEBQPEFZBVFOJIWFFACFCCFHQWAUVWFLQHGFXVAFXQHFUFHILTTAVWAFFAWTEVOITDHFHFQAITIXPFHXAFQHEFZQWGFLVWPTOFFA
Analyze frequency counts…
Likely that ciphertext “F” represents “E”o And so on, at least for common letters
Simple Substitution Distance
15
Simple Substitution Cryptanalysis
Can even automate attack1. Make initial guess for key using frequency counts2. Compute oldScore3. Modify key by swapping adjacent elements4. Compute newScore5. If newScore > oldScore then oldScore = newScore6. Else unswap elements7. Goto 3
How to compute score?o Number of dictionary words in putative plaintext?o Much better to use English digraph statistics
Simple Substitution Distance
16
Jackobsen’s Algorithm
Method on previous slide can be slowo Why?
Jackobsen’s algorithm uses similar idea, but fast and efficiento Ciphertext is only decrypted onceo So algorithm is (essentially)
independent of length of messageo Then, only matrix manipulations
requiredSimple Substitution Distance
17
Jackobsen’s Algorithm: Swapping Assume plaintext is English, 26 letters Let K = k1,k2,k3,…,k26 be putative key
o And let “|” represent “swap” Then we swap elements as follows
Also, we restart this swapping schedule from the beginning whenever score improves
Simple Substitution Distance
18
Jackobsen’s Algorithm: Swapping
Minimum swaps is 26 choose 2, or 325 Maximum is unbounded Each swap requires a score computation Average number of swaps? Experimentally
o Ciphertext of length 500, average 1050 swapso Ciphertext of length 8000, avg just 630 swaps
So, work depends on length of ciphertexto More ciphertext, better scores, fewer swaps
Simple Substitution Distance
19
Jackobsen’s Algorithm: Scoring
Let D = {dij} be digraph distribution corresponding to putative key K
Let E = {eij} be digraph distribution of English language
These matrices are 26 x 26 Compute score as
Simple Substitution Distance
20
Jackobsen’s Algorithm
So far, nothing fancy hereo Could see all of this in a CS 265 assignment
Jackobsen’s trick: Determine new D matrix from old D without decrypting
How to do so?o It turns out that swapping elements of K
swaps corresponding rows and columns of D
See example on next slides…
Simple Substitution Distance
21
Swapping Example
To simplify, suppose 10 letter alphabetE, T, A, O, I, N, S, R, H, D
Suppose you are given the ciphertextTNDEODRHISOADDRTEDOAHENSINEOAR
DTTDTINDDRNEDNTTTDDISRETEEEEEAA Frequency counts given by
Simple Substitution Distance
22
Swapping Example We choose the putative
key K given here The corresponding
putative plaintext isAOETRENDSHRIEENATE
RIDTOHSOTRINEAAEAS
OEENOTEOAAAEESHNA
TTTTTII Corresponding digraph
distribution D is
Simple Substitution Distance
23
Swapping Example Suppose we
swap first 2 elements of K
Then decrypt using new K
And compute digraph matrix for new K
Previous key K
New key K
Simple Substitution Distance
24
Swapping Example
Old D matrix vs new D matrix
What do you notice?
So what’s the point here?
This is good!
Simple Substitution Distance
25
Jackobsen’s Algorithm
Simple Substitution Distance
26
Proposed Similarity Score
Extract opcodes sequences from collection of viruseso All viruses from same metamorphic
family Determine n most common opcodes
o Symbol n+1 used for all “other” opcodes Use resulting digraph statistics form
matrix E = {eij} o Note that matrix is (n+1) x (n+1)
Simple Substitution Distance
27
Scoring a File Given an executable we want to score Extract it’s opcode sequence Use opcode digraph stats to get D = {dij}
o This matrix also (n+1) x (n+1) Initial “key” K chosen to match monograph
stats of virus familyo Most frequent opcode in exe maps to most frequent
opcode in virus family, etc. Score based on distance between D and E
o “Decrypt” D and score how closely it matches Eo Jackobsen’s algorithm used for “decryption”
Simple Substitution Distance
28
Example Suppose only 5 common opcodes in family
viruses (in descending frequency)
Extract following sequence from an exe
Initial “key” is
And “decrypt is
Simple Substitution Distance
29
Example
Given “decrypt”
Form D matrix
After swap…o And so on…
Simple Substitution Distance
30
Scoring Algorithm
Simple Substitution Distance
31
Quantifying Success Consider these 2 scatterplots of
scores
Which is better (and why)?Simple Substitution Distance
32
ROC Curves Plot true-positive vs
false positiveo As “threshold” varies
Curve nearer 45-degree line is bad
Curve nearer upper-left is good
Simple Substitution Distance
33
ROC Curves
Use ROC curves to quantify success
Area under the ROC curve (AUC)o Probability that randomly chosen
positive instance scores higher than a randomly chosen negative instance
AUC of 1.0 implies ideal detection AUC of 0.5 means classification is
no better than flipping a coinSimple Substitution Distance
34
Parameter Selection
Tested the following parameterso Opcode matrix sizeo Scoring functiono Normalizationo Swapping strategy
None significant, except matrix sizeo So we only give results for matrix size
hereSimple Substitution Distance
35
Opcode Matrix Size
Obtained following results
So, ironically, we use 26 x 26 matrix
Simple Substitution Distance
36
Test Data
Tested the following metamorphic familieso G2 --- known to be weako NGVCK --- highly metamorphico MWOR --- highly metamorphic and stealthy
MWOR “padding ratios” of 0.5 to 4.0 For G2 and NGVCK
o 50 files tested, cygwin utilities for benign files For each MWOR padding ratio
o 100 files tested, Linux utilities for benign files 5-fold cross validation in each experiment
Simple Substitution Distance
37
NGVCK and G2 Graphs
Simple Substitution Distance
38
MWOR Score Graphs
Simple Substitution Distance
39
MWOR ROC Curves
Simple Substitution Distance
40
MWOR AUC Statistics
Simple Substitution Distance
41
Efficiency
Simple Substitution Distance
42
Conclusions
+ Simple substitution score, good results for challenging metamorphic viruses
+ Scoring is fast and efficient+ Applicable to other types of
malware- Requires opcodes
Simple Substitution Distance
43
References
G. Shanmugam, R.M. Low, and M. Stamp, Simple substitution distance and metamorphic detection, Journal of Computer Virology and Hacking Techniques, 9(3):159-170, 2013
Simple Substitution Distance