Chi-Squared Distance and Metamorphic Virus Detection · A virus is often de ned as malware that...

Chi-Squared Distance and Metamorphic VirusDetection

Annie H. Toderici∗ and Mark Stamp†

Abstract

Metamorphic malware changes its internal structure with each generation, whilemaintaining its original behavior. Current commercial antivirus software generallyscan for known malware signatures; therefore, they are not able to detect metamor-phic malware that sufficiently morphs its internal structure.

Machine learning methods such as hidden Markov models (HMM) have shownpromise for detecting hacker-produced metamorphic malware. However, previousresearch has shown that it is possible to evade HMM-based detection by carefullymorphing with content from benign files. In this paper, we combine HMM detectionwith a statistical technique based on the chi-squared test to build an improved detec-tion method. We discuss our technique in detail and provide experimental evidenceto support our claim of improved detection.

Keywords: metamorphic malware, chi-squared statistics, hidden Markov models,malware detection

1 Introduction

Malicious software attacks can cause extensive financial damage. For example, My-Doom, a spam-mailing malware, caused an estimated $38 billion in damage whilethe damage due to Conficker, a password-stealing botnet, has been estimated at $9.1billion [7].

Antivirus software aim to detect malware. Signature detection is the most com-monly used approach used by commercial antivirus [4]. This technique relies on adatabase of known signatures for viruses and other malware, and attempts to matchthese signatures against files on a user’s computer. If a match is found, the file islikely infected by the corresponding malware. Traditionally, this method has beeneffective for detecting most malware. The major weakness of signature detection isthat it cannot detect previously unknown viruses.

∗Department of Computer Science, San Jose State University†Department of Computer Science, San Jose State University: [email protected]

1

As a way to evade signature detection, malware writers employ code obfuscationmethods [23]. Metamorphic viruses apply code obfuscation techniques at each gener-ation and, consequently, a well designed metamorphic virus cannot be detected usingstandard signatures [6].

Previous research has shown that machine learning methods such as hidden Markovmodels (HMMs) are effective at detecting hacker-produced metamorphic viruses [29].However, such a detection strategy can be defeated by inserting a sufficient amountof code from benign files into each virus—at some point, the HMM classifier cannotreliably distinguish such a virus from a benign file. An experimental metamorphicvirus generator was previously developed to exploit this strategy [16]. It was foundthat the HMM-based approach is very robust when random changes are made to theviral code, or when small blocks of code are copied from benign files into the viruses.However, the HMM technique is relatively fragile when code is copied from benignfiles in the form of contiguous blocks (e.g., entire subroutines). The motivation forthe research presented here is to improve on this weakness in the HMM detector.

In this paper, we analyze the utility of a statistical chi-squared test (or χ2 test)for malware detection, as suggested by the theoretical framework in [11]. Our resultsshow that the chi-squared statistic, computed on instruction opcode frequencies, cansignificantly improve virus detection—whether code is copied from benign files in theform of many small segments or a contiguous block has little effect on the chi-squaredstatistic. We show that by combining a chi-squared test with an HMM detector, wecan improve on the results obtained by either when used individually.

This paper is organized as follows. Section 2 gives relevant background infor-mation on malware, malware detection, code obfuscation, hidden Markov models,and the chi-squared statistical test. In Section 3, we discuss our proposed detectiontechniques, while Section 4 covers the performance criteria we use to quantify ourresults. Then, in Section 5 we present our experimental results. Finally, conclusionsand suggestions for future work are given in Section 6.

2 Background

In this section, we briefly discuss background material that is relevant to the researchin this paper. First, we cover malware, with the focus on metamorphic malware andcode obfuscation techniques. Then we consider malware detection, with the emphasison a machine learning technique based on hidden Markov models. Finally, we discussa chi-squared statistical test and its potential application to malware detection.

2.1 Malware

As the name suggests, malware is software designed specifically for malicious pur-poses. Malware is often classified into various categories, including virus, worm,trojan horse, spyware, adware, and botnet.

2

A virus is often defined as malware that relies on passive propagation, whereas aworm uses active means [21]. That is, a worm actively propagates itself (typically,via a network), while a virus requires outside assistance (e.g., an infected USB keyinserted into a computer). However, others define a virus as parasitic malware, incontrast to a worm that is stand-alone code [4]. Here, we use the term “virus”generically to refer to malware.

Next, we consider encrypted, polymorphic, and metamorphic viruses. These cat-egories can be viewed as a hierarchy, employing increasingly sophisticated strategiesdesigned to evade signature detection.

2.2 Encrypted Viruses

Encrypted viruses encrypt their body using a different key at every infection. Whileencryption is an effective means of evading signature detection, an encrypted virusmust include a plaintext decryptor routine. Therefore, it is possible to detect thisclass of viruses by analyzing the decryptor to obtain a signature.

2.3 Polymorphic Viruses

Polymorphic viruses are encrypted viruses that obfuscate their decryption code bymutating it. In practice, there are usually a relatively small number of decryptors,making signature detection a viable option. In addition, if a part of the code looks“suspicious,” we can execute it in a virtual machine. If the suspicious code is apolymorphic virus, it will decrypt itself, at which point standard signature detectioncan succeed.

2.4 Metamorphic Viruses

Metamorphic viruses change their internal structure at each generation. Unlike poly-morphic viruses, the entire virus is morphed while still maintaining its original be-havior. If the morphing is sufficient, no common signature is available and hencesignature-based antivirus software cannot reliably detect well-designed metamorphicviruses. Note that no encryption is necessary for a metamorphic virus.

Although it is difficult to detect metamorphic viruses, fortunately, it has provenequally difficult for malware writers to implement. One difficulty is that the malwaremust mutate sufficiently, but yet the size of the code must not increase uncontrol-lably. Another concern is that the viruses must be sufficiently similar to benign codeto avoid detection by similarity-based or heuristic methods. Malware writers havenot yet successfully overcome these obstacles [29, 30]. In fact, most metamorphic gen-erators introduce very little metamorphism, and those that do, produce variants thatare easily distinguished from benign code since. Nevertheless, it is possible to pro-duce relatively strong, practical metamorphic generators [16, 19, 20], so metamorphicdetection is a worthy research problem.

3

2.4.1 Virus Obfuscation Techniques

Code obfuscation techniques are applied when creating metamorphic viruses. Thesetechniques can be used to create a vast number of distinct copies that have the samebehavior but different internal structure. In this section, we briefly discuss someelementary code morphing techniques.

Register Renaming: Register renaming is one of the oldest and simplest techniquesused in metamorphic generators. For example, Figure 1 provides a code snippet wherethe following substitutions have been made:

eax −→ ebx

ebx −→ ecx

ecx −→ eax

In spite of its simplicity, this technique does change the binary pattern in morphedexecutable files. However, register renaming does not effect the opcode sequence.Furthermore, it is relatively easy to detect register renaming by using signatureswith wildcard strings [24].

MOV eax, 4

ADD ebx, eax

SUB ecx, 1

−→MOV ebx, 4

ADD ecx, ebx

SUB eax, 1

Figure 1: Register renaming example.

Equivalent Instruction Replacement: The instruction set for modern processorshave numerous equivalent instructions (or groups of instructions). For example, MOVeax, 0, is equivalent to SUB eax, eax, and XOR eax, eax. Figure 2 illustrates anexample where a single instruction is equivalent to a sequence of instructions.

MOV eax, 1 −→ SUB eax, eax

ADD eax, 1

Figure 2: Equivalent instructions substitution.

Instruction Reordering: Instruction reordering consists of transposing instructionsthat do not depend on the output of previous instructions. When instructions arereordered, signatures involving the instructions can be broken, but code execution isunaffected. Figure 3 shows an example of instruction reordering.

4

MOV eax, 4

ADD ecx, 1

MOV ebx, 0

−→ADD ecx, 1

MOV ebx, 0

MOV eax, 4

Figure 3: Instruction reordering.

Junk Code: Junk code is any code that has no effect on program execution. Junkcode might be executed, with no effect on the program, or it might be “dead code”that is never executed. Junk code is often inserted randomly throughout the body ofa metamorphic virus during the morphing process. The intention is that such junkcode will break up signatures.

A trivial example of junk code is the NOP instruction, which does nothing to affectthe CPU state. Other examples of junk code include MOV eax, eax and ADD eax,

0 and SUB eax, 0.

2.5 Malware Detection

In this section, we briefly discuss signature detection and heuristic detection. Otherdetection strategies are used, but these are the most common today.

2.5.1 Signature Detection

Commercial antivirus software typically uses signature detection to identify maliciousfiles. A signature is created by analyzing the binary code of a virus, and selectinga sequence of bits that is, ideally, unique to that virus [21]. The signature must belong enough so that it is unlikely to appear in uninfected files. For example, theChernobyl/CIH virus can be detected using the signature [28]

E800 0000 005B 8D4B 4251 5050 0F01 4C24 FE5B 83C3 1CFA 8B2B .

String matching algorithms are applied when scanning for virus signatures. Ex-amples of such algorithms include Aho-Corasick, Veldman, and Wu-Manber [4]. TheAho-Corasick algorithm scans for exact matches, so a slight variation will escape de-tection [1]. On the other hand, Veldman and Wu-Manber allow for the use of wildcardsearch strings [23].

Signature scanning is relatively easy and efficient, but the virus database needsto be kept up to date. The crucial weakness of signature detection is that signaturesare unlikely to detect previously unseen malware, including metamorphic variants.

2.5.2 Heuristic Detection

Heuristic analysis can be used to detect unknown viruses and variants of knownviruses. Heuristic analysis can be based on a static or dynamic approach, or a combi-

5

nation of the two. An example of static heuristic analysis would be to look for opcodesequences that match a general pattern found in viral code. For dynamic analysis,the code might be executed in a virtual machine to watch for suspicious behavior.An example of such suspicious behavior is opening an executable file for writing [4].

2.6 Hidden Markov Models

Machine learning can be defined as computer algorithms that improve through exper-iments [17]. Examples of such techniques include Naıve Bayes [18], decision trees [15],hidden Markov models [22], and many other statistical learning methods [11].

A hidden Markov model (HMM) is a statistical modeling method that has beenused in speech recognition, bioinformatics, mouse gesture recognition, credit cardfraud detection, and computer virus detection research. It is widely used because itis simple and computationally efficient [22].

The popularity of HMMs for virus detection stems from the fact that a programcan be represented as a sequences of instructions. The CPU executes the instructionsone at a time, which implies that programs can be treated as time series, which is anideal situation for an HMM.

For a Markov model of order one, we assume that the sequential data can bemodeled based solely on the current state, with no memory—what happens next inthe sequence depends only on the current state. In a typical Markov chain, the statesare fully observed. In the case of hidden Markov model, as its name implies, thestates are not directly observed, as they are “hidden.” We can only estimate thesestates while observing sequences of data [22].

2.6.1 HMM Notation

To describe an HMM, we will use the notation summarized in Table 1. The maincomponents of the HMM are the state transition probability matrix A, the observationprobability matrix B (which gives the likelihood of an observation given the state),the initial state distribution π, the observation sequence O, and the hidden states X.We denote the HMM model as λ = (A,B, π). The matrices A, B, and π are rowstochastic, that is, each row is a probability distribution. The probability πxi isthe initial probability of starting in state Xi, while axi,xi+1 is the probability oftransitioning from state Xi to state Xi+1. Finally, bxi (Oi) is the probability ofobserving Oi when the underlying Markov process is in state Xi.

Figure 4 illustrates a generic HMM. The state transitions in Figure 4 are placedabove the dashed line to indicate that they are “hidden.” Some information aboutthe transitions can be deduced by analyzing the observations, since the observationsequence is related to the states of the Markov process by the matrix B.

The utility of HMMs derives largely from the fact that there are efficient algo-rithms to solve the following three problems.

6

Table 1: HMM symbols and their meanings.

Symbol Description

T The length of the observation sequenceN The number of states in the modelM The number of distinct observation symbolsO The observation sequence, O = {O0,O1, . . . ,OT−1}X The hidden states, X = {X0, X1, . . . , XT−1}A The state transition probability matrixB The observation probability matrixπ The initial state distributionλ The hidden Markov model: λ = (A,B, π)

Figure 4: Hidden Markov Model [22].

1. Given a model λ and an observed sequence O determine P (O|λ). That is, wecan score a sequence of observations against a given model.

2. Given a model λ and an observed sequence O, we can find the most likely hiddenstate sequence {X0, X1, . . . , XT−1}.

3. Given an observation sequence O, we can train a model λ to maximizes theprobability of O. That is, we can train a model to fit the data.

In this research, we apply the solution to the third problem to train an HMM tofit opcode sequences extracted from a metamorphic family. Then we use the solutionto problem one to score files (based on extracted opcode sequences) and classify eachas either belonging to the metamorphic family or not.

For more information on HMMs in general, including detailed examples andpseudo-code, see [22]. For additional information on the application of HMMs tothe malware detection problem, see, for example [16, 30]

7

2.7 Chi-Squared Distance

In this section, we first discuss chi-squared tests in general terms. Then we considerthe use of such a test for malware detection.

2.7.1 CSD Notation

Suppose that Y is a statistical variable from a distribution under observation. Ourgoal is to estimate the main characteristics of the probability distribution P of Y . LetY1, Y2, . . . , Yn be a random sample of elements from this distribution. These samplesreveal some information about the unknown parameter, say, θ of the probabilitydistribution P .

Let f(Y1, Y2, . . . , Yn) denote a function that is used to estimate θ; we refer to f asan estimator function. An estimator function can be used to compute the probabilitydistribution of a given sample.

The parameter space from which θ is drawn can be arbitrary, but will typicallydepend on the problem and the choice of f . For example, we might have θ ∈ N,where N is the set of natural numbers, or θ ∈ Rk, for some k, where R is the set of realnumbers. For example, if θ represents the parameters of a normal distribution, then θwill have two real-valued dimensions, corresponding to the mean µ and variance σ.For the purpose of malware analysis, we restrict the parameter space to k-dimensionalvectors with elements from the set of natural numbers, that is, θ ∈ Nk. The generalform of P is assumed to be known.

Statistical testing is used to decide which hypothesis best fits an observed se-quence of samples (Y1, Y2, . . . , Yn). In statistical testing, initial hypotheses are pro-posed which are either accepted or rejected after measuring the likelihood of thesehypotheses with respect to the probabilistic law P of Y .

The initial hypothesis is denoted H0 and is referred to as the null hypothesis. Thealternative hypothesis is denoted as H1. The tests that we use will either accept orreject the null hypothesis. To construct such a test, we determine an estimator thatseparates possible test values into disjoint sets consisting of an acceptance region andrejection region. Then given a sample, its estimator value is computed and comparedwith the threshold to determine whether we accept or reject the null hypothesis.

There are two types of errors associated with this detection problem.

• A type I error occurs when we reject the null hypothesis, although the nullhypothesis is correct. The probability of such an error is denoted as α and itcorresponds to the false positive rate.

• A type II error occurs when we accept the null hypothesis, although the nullhypothesis is false. This probability corresponds to the false negative rate.

8

2.7.2 Statistical Test for Malware Detection

To formalize the virus detection framework, we adopt notation from Chess andWhite [5]. A given antivirus system uses a detection algorithm D, which it ap-plies to a program p. The goal of the algorithm is to determine whether p is infectedby a particular virus V . Then D(p) should return true if and only if the program pis infected by the virus V . Filiol and Josse [11] further expand on this approach byproviding a statistical framework to describe the detection process.

For malware analysis, we consider opcode instruction frequencies. We modelthe statistics of viruses from a given metamorphic family. Ideally, the spectrum ofinstructions from these family viruses would be obtained by analyzing instructionfrequencies of all possible family viruses. However, for any realistic metamorphicgenerator, an exact spectrum is impossible to compute. As an approximation, werely on instruction frequencies from a representative set of family viruses.

The related problem of modeling compilers is considered in [3]. Most compilers usea relatively small subset of all possible instructions, different compilers use differentsubsets, and the instructions that are common between the two will appear withdifferent frequencies. We can make use of such observations to develop an estimatorfunction which can be used to classify whether a given executable was generatedwith a particular compiler. This is analogous to the metamorphic detection problem,where a metamorphic generator can be considered a type of “compiler.”

The formal definition of the spectrum of a program is

spectrum(C) = (Ii, ni)1≤i≤c (1)

where, in our case, C represents a particular metamorphic generator, Ii representsthe ith instruction, and ni is the frequency of instruction i. Here, c denotes thetotal number of unique instructions that may be output by generator. For the 80x86architecture, the total number of possible instructions is 501 [14], but, for a givenmetamorphic generator, only a relatively small subset of these instructions are likelyto occur.

In our experiments, the spectrum in equation (1) represents the expected valueof opcodes from a typical family virus. Consequently, for any family virus, we expectto observe instruction i at approximately frequency ni.

Given an executable file that we want to classify, we first compute its spectrum.That is, we compute the instruction frequencies observed in this file—we denote theobserved frequency for instruction i as ni. Then we rely on a statistical test todetermine whether we should accept or reject the null hypothesis.

To make this process concrete, we need to specify the null hypothesis, an estimatorfunction, and the decision threshold. The null hypothesis and alternative hypothesisare specified as

H0 : ni = ni, 1 ≤ i ≤ cH1 : ni 6= ni, 1 ≤ i ≤ c

9

respectively. Note that the null hypothesis states that the observed frequencies matchthe expected frequencies. If these frequencies are the same, then the suspected fileis likely a family virus, and the null hypothesis will be accepted. However, if thefrequencies are significantly different, then the alternative hypothesis will be accepted,which means that the suspect file is classified as benign.

However, we cannot expect an exact match of any given spectrum to the expectedspectral values. Filiol and Josse [11] suggest using Pearson’s χ2 statistical test. Thistest, which we denote as D2, is given by

D2 =c∑i=1

(ni − ni)2

ni.

Pearson’s χ2 statistic is commonly used to determine whether the difference be-tween the expected and observed data is significant. The decision threshold is ob-tained by comparing the estimator value given by D2 to the χ2(α, c−1) distribution,that is, a chi-squared distribution with c − 1 degrees of freedom and a type I errorrate of α. Typically, the type I error rate is set to 0.05, which means that the testtolerates no more than 5% of the viruses files being misclassified as benign.

Using Pearson’s χ2 statistic, the null hypothesis and the alternative hypothesisare

H0 : D2 ≤ χ2 (α, c− 1)

H1 : D2 > χ2 (α, c− 1)(2)

The bottom line here is that the χ2 statistical test gives us a practical means todetermine whether an observed spectrum matches a given distribution.

To summarize, the following steps need to be performed to implement Pearson χ2

test for malware detection.

1. Specify the null hypothesis H0 and the alternative hypothesis H1.

2. Choose a significance level (we use α = 0.05).

3. Compute the estimator value D2 given the opcode frequencies in the file underconsideration (that is, ni for i = 1, 2, . . . , c) and the spectrum that correspondsto the null hypothesis (that is, ni for i = 1, 2, . . . , c).

4. Determine whether to accept or reject the null hypothesis H0 based on equa-tion (2).

2.7.3 Example

To illustrate the statistical test discussed above, we consider a simplified probleminvolving only three instructions Ii1 , Ii2 , Ii3 . Suppose that for the family viruses underconsideration, instruction Ii1 has frequency ni1 , with ni2 and ni3 . The spectrum forthis example appears in Table 2.

10

Table 2: Example family virus spectrum.

Instruction Opcode Frequency nii1 MOV 7i2 PUSH 10i3 POP 3

Table 3: Example frequencies of suspect program.

Instruction Opcode Frequency nii1 MOV 6i3 POP 11

The possible observations for this compiler are MOV, PUSH, and POP, and the corre-sponding frequencies are 7, 10, and 3. Since we can view the distribution of interestas a histogram over these three instructions, the parameter space of θ is N3.

Suppose that we have a suspect file that may or may not be a family virus. Giventhe instructions and the frequencies in Table 3, we would like to perform the χ2 testso as to classify the file as a family virus or benign.

The null hypothesis H0 is that the file is benign, provided that the estimatorfunction D2 yields a score less than or equal to the χ2 value. That is,

D2 =c∑i=1

(ni − ni)2

ni≤ χ2 (α, c− 1) .

When computing the estimator D2, we use the opcode frequency counts for eachinstruction. However, since χ2 is a probability distribution, the frequencies are nor-malized before performing this test—normalization is done by simply dividing thecount of an instruction by the total number of instructions. The normalized valuesfor the spectrum in Table 2 are (MOV, 0.35), (PUSH, 0.5), and (POP, 0.15). The normal-ized values for the file under consideration (see Table 3) are (MOV, 0.353) and (POP,0.647). Therefore,

D2 =(0.353− 0.35)2

0.35+

(0.0− 0.5)2

0.5+

(0.647− 0.15)2

0.15∼= 2.1467.

We compare D2 = 2.1467 to χ2(0.05, 2) = 5.991. Since D2 ≤ χ2(0.05, 2), we acceptthe null hypothesis, that is, we classify the file as a family virus.

3 Proposed Virus Detector

We propose a metamorphic detection method that combines an HMM detector (as de-veloped in [30] and further extended and analyzed in [16]) with a chi-squared distance

11

(CSD) estimator, as discussed above in Section 2.7. We refer to this combination ofHMM and CSD as a hybrid method.

Experiments show that the HMM detector performs extremely well in detectingviruses morphed with short sequences of code, whether the code is randomly selectedor carefully chosen to be statistically similar to benign code [9]. The same high levelof detection is achieved even when the short sequences of code are taken directlyfrom benign files [16]. However, when long contiguous blocks (e.g., subroutines) frombenign files are inserted into morphed viruses, the HMM detector fails at a relativelylow percentage of such code [16].

Intuitively, the performance of our CSD estimator should not be affected by thelength of the code blocks copied from benign files—only the total amount of suchcode should matter. Our goal is to experimentally verify this intuition and developa hybrid model that will outperform both the HMM and CSD detectors.

Below, we consider ways to combine the HMM and CSD scores to obtain a scorefor the hybrid model. But first, we convert each of these scores into probabilities.

Let PHMM(X) be the probability that corresponds to an HMM score for X andlet Pχ2(X) be the probability that corresponds to a CSD score for X. We can directlycompute PHMM(X) from the score; see [22] for the details. To compute a probabilityfor our CSD score, we use the fact that it is related to the χ2 distribution and makeuse of the χ2 cumulative distribution function (CDF). We can write

Pχ2(X) = P (Y < D2)

= CDFχ2(D2)

=1

Γ(k/2)γ(k/2, D2/2)

where Γ is the well-known gamma function1

Γ(z) =

∫ ∞0

tz−1 · e−tdt

and k represents the degrees of freedom, which in our case is one less than the numberof unique instructions encountered in the training phase. The function γ(k, z) is thelower incomplete gamma function, which is given by

γ(s, x) =

∫ x

0ts−1 · e−tdt.

With these probabilities in hand, the probability for a hybrid model could becomputed as

P (X) = PHMM(X) · Pχ2(X).

1The Γ function can be viewed as a generalization of the factorial function; for positive integers n, wehave Γ(n) = (n− 1)!.

12

To allow for different weightings of the component probabilities, we define

Phybrid(X) = Pw1

HMM(X) · Pw2

χ2 (X)

where w1 and w2 are to be determined. Finally, for numerical stability, we trans-form Phybrid(X) to its corresponding log-likelihood form

logPhybrid(X) = w1 · logPHMM(X) + w2 · logPχ2(X). (3)

The values we used for w1 and w2 in equation (3) were determined by a gridsearch over the range 0 to 1. For this search, we conducted 100 experiments with w1

and w2 varying independently, on a logarithmic scale. For each case, we computedresults analogous to those described in Section 4. From this process, we found thatour best results were obtained using the values

w1 = 10−8 and w2 = 10−9

and, consequently, these weights are used for all experiments discussed in the nextsection.

4 Validation and Datasets

In this section, we discuss the validation method used and the metrics that we employto compare our experimental results. Then we briefly cover the datasets used in ourexperiments.

4.1 Cross-Validation

Cross-validation, or rotational estimation is used to enhance the statistical validityof a limited data set [12]. This approach enables us to have sufficient training data,while also obtaining meaningful results in the testing phase.

Since testing data must be disjoint from training data, a limited data set impliesthat only a relatively small amount of data may be available for testing. Hence, wemay have insufficient data to obtain meaningful test results. With cross-validation,the training data and the testing data are selected so that they do not intersect, andthe experiment is repeated multiple times on different subsets of data. This results inmany more test cases, which will tend to improve the overall reilability of the results.

The approach that we employ here is five-fold cross-validation, where the datais divided into five equal subsets. Then four subsets are used for training with theremain subset reserved for testing. This process is repeated five times, with a differentsubset reserved for testing each time—each such selection is referred to as a “fold.”For each fold, the evaluation performance is recorded and all folds are used to estimatethe performance of a particular experiment.

Next, we discuss the metrics we use to measure the accuracy of our classifiers.

13

4.2 Evaluation Metrics

Here, we briefly discuss false positive and false negative rates, and explain how wecombine these into a measure of accuracy. We also consider ROC curves and theirrole in evaluating our metamorphic detection technique.

4.2.1 Accuracy Measure

There are four possible outcomes for detection, namely, true positive (TP), falsepositive (FP), true negative (TN), and false negative (FN). A detection is considereda true positive when a virus is correctly classified as a virus, whereas it is a truenegative when a benign file is correctly marked as benign. Of course, TP and TN aredesirable outcomes. A false positive occurs when a benign file is mistakenly classifiedas a virus. Similarly, a false negative occurs when a virus file is not detected as avirus, but instead is classified as benign. Table 4 shows these four possible outcomes.

Table 4: Possible outcomes for detection.

Predicted ClassVirus Benign

Actual ClassVirus True Positive False NegativeBenign False Positive True Negative

Short of perfect detection, there is an inherent tradeoff between the FP and FNrates. As an aside, we note that commercial antivirus makers generally try to avoidfalse positives, although this leads to significantly higher false negative rates. Userstend to notice (and be very unhappy) whenever an uninfected file is identified asmalicious, but they tend to be more forgiving of (or may not even notice) an occasionalfalse negative [4]. In any case, we would like to avoid both false positives and falsenegatives.

The overall success rate, or accuracy rate, is the fraction of correct classificationsobtained from the total number of files tested. The accuracy rate is given by

Accuracy Rate =TP + TN

TP + TN + FP + FN. (4)

The error rate is one minus this accuracy rate, that is,

Error Rate = 1−Accuracy Rate = 1− TP + TN

TP + TN + FP + FN(5)

Since we performed five-fold cross-validation, we have a total of five models foreach experiment. Each of these folds will yield an accuracy rate. Therefore, we usethe mean of the accuracy values obtained from the five folds, which we refer to as the

14

mean maximum accuracy (MMA) rate. This rate is computed as

MMA =1

5

5∑i=1

Accuracy Ratei (6)

where Accuracy Ratei denotes the accuracy rate for the ith fold in the five-fold cross-validation.

4.2.2 Receiver Operating Characteristic

The receiver operating characteristic (ROC) was developed for applications relatedto signal detection [10]. But, ROC curves have gained popularity in the machinelearning community as a tool for evaluating the performance of algorithms [2, 25].

ROC curves are typically represented as two-dimensional plots. For the virusdetection problem under consideration here, we let the x-axis represent the falsepositive rate and the y-axis represent the true positive rate. Figure 5 illustrates anexample with several ROC curves plotted on the same axis.

Figure 5: Example ROC curves.

An algorithm that performs random classification with 50% accuracy will gener-ate a diagonal line in the ROC space. Results closer to the top left area of the graphrepresent improved classification with a higher true positive rate. For example, the

15

Figure 6: Process for obtaining opcode sequences.

Figure 7: Process for generating metamorphic viruses.

blue diamond line in Figure 5 represents perfect classification, i.e., a 100% true posi-tive rate with a 0% false positive rate. The line indicated by red squares correspondto a classification algorithm that achieves about a 78% true positive rate with a 10%false positive rate. The green triangle line indicates essentially random classification.In Section 5, we give ROC curves for an HMM-based metamorphic detector and ourproposed CSD detector.

4.3 Datasets

The basic malware dataset used in this research consists of 200 Next GenerationVirus Construction Kit, or NGVCK, metamorphic family viruses [27]. For our rep-resentative benign files, we selected 40 Cygwin utility files [8]. The NGVCK fileswere selected since previous research has shown that these viruses are highly meta-morphic, and they have served as the basis for previous HMM-based metamorphicdetection [16, 29].

Each executable file was disassembled using IDA Pro [13] and opcode sequenceswere extracted for use in our experiments. Figure 6 illustrates the process.

For many of our experiments, the opcode sequences from the NGVCK virus fileswere further morphed, using a metamorphic generator similar to that in [16]. Figure 7illustrates the basic steps taken to generate the various metamorphic viruses.

16

Table 5: Experiments.

Training Dataset

Original NGVCK virusesNGVCK morphed with 10% dead codeNGVCK morphed with 10% subroutine code

While the NGVCK files can be detected using an HMM-based approach [29], thegenerator in [16] is able to defeat the HMM detector. Since our goal is to improveon the HMM detection results, these tools provide us with the material needed tocompare our CSD detector and hybrid approach with the HMM detector, as well asto validate our HMM detector based on previous work.

5 Experimental Results

In this section, we present the results of our experiments using an HMM-based detec-tor, a chi-squared distance (CSD) estimator, and our proposed hybrid virus detector.As discussed in the previous section, we use ROC curves and the mean maximumaccuracy (MMA) rates to evaluate the performance of the each of the three detectors.

Here, we refer to short segments of benign code that are interspersed throughoutthe file as “dead code.” For the case where a long contiguous block is used, werefer to the inserted code as “subroutine code.” For each of the three experimentslisted in Table 5, the parameters for our metamorphic generator were set to generateincreased dead code and increased subroutine code, both in increments of 10%, up toa maximum of 40%. This yields 25 combinations of parameter values per experiment,with each combination employing 200 distinct metamorphic virus files (and 40 benignfiles), and each of these using five-fold cross validation.

All experiments presented here rely on the datasets and procedures discussed inSection 4. Additional experimental results can be found in [26].

5.1 Training on NGVCK Viruses

In this experiment, we used NGVCK metamorphic virus files as our base viruses.That is, for each training set, we used the NGVCK viruses without further morphing.For viruses in the test sets, various levels of morphing (dead code insertion and/orsubroutine insertion) were applied.

Table 6 contains MMA results for the HMM detector, the CSD estimator, and ourhybrid model. For each row in the table, the score in boldface is the best detectionrate for the specified level of morphing. For this case, relevant ROC curves appearin Figure 8.

17

For this test case, we observe that the HMM results are extremely good, withthe hybrid results being only marginally better. The CSD results are comparable tothe HMM at lower levels of morphing, but at high levels or morphing, the HMM issuperior. We also note that CSD tends to perform somewhat better, relative to theHMM, at higher levels of subroutine insertion rates, although it is still below the levelof the HMM.

5.2 Training Set Morphed with 10% Dead Code

For this experiment, the training dataset consists of NGVCK virus files that were fur-ther morphed by inserting 10% dead code. This experiment simulates the case wherethe base viruses are more highly morphed, with the additional morphing consists ofdead code selected from normal files. The purpose of inserting such code would beto evade statistical-based detection strategies. Table 7 summarizes the MMA scoresobtained by the HMM detector, the CSD estimator, and the hybrid model. For thiscase, relevant ROC curves appear in Figure 9.

In this case, the HMM results are somewhat stronger, relative to the CSD, thanin the previous experiment. As in the previous experiment, we see that the hybridmodel offers some improvement over the HMM classifier.

5.3 Training Set Morphed with 10% Subroutine Code

As in the previous experiment, the base metamorphic viruses were morphed by anadditional 10% of code, with the morphing code taken from benign files. However,in this case, entire subroutines were extracted from benign files. This represents thesituation where the metamorphic viruses are more highly morphed than NGVCK,with the additional morphing consisting of contiguous blocks of code from benignfiles. The results for this experiment are given in Table 8. For this case, relevantROC curves appear in Figure 10.

For this experiment, the HMM detection rates are poor, which is consistent withprevious research [16]. But, the CSD performs well and the hybrid results improve(slightly) on the CSD.

5.4 Discussion

As expected, for the CSD detector, it makes little difference whether the benign codeis dispersed throughout the file (dead code), or inserted as long contiguous blocks(subroutine code). But, this is not the case for the HMM, which fails at relativelylow percentages when the inserted benign code is in the form of contiguous blocks.

These results show that the HMM and CSD detectors provide significantly dif-ferent statistical measures. Most importantly from the perspective of metamorphicdetection, we have shown that it is possible to combine these two detectors to obtaina hybrid detector that is stronger than either individual detector.

18

6 Conclusion

In this paper, we considered a hybrid metamorphic detection strategy that employsboth a machine learning (HMM) component and a statistical analysis (CSD) com-ponent. We showed that our hybrid detector generally outperforms either individualtechnique. Our hybrid approach overcomes a significant weakness in HMM-basedmetamorphic detection that was identified in previous research [16, 29].

Consistent with previous research, we found that the HMM detector performswell when benign code is inserted in small blocks, but does much worse when themorphing consists of contiguous blocks. As expected, the CSD estimator had similarperformance in both cases. The overall high level performance of the CSD detectorwas somewhat surprising, although the hybrid approach is superior in most cases.But, given the simplicity of the CSD detector and the fact that its detection rates aregenerally close to those of the hybrid model, the CSD detector might be preferablein practice.

Future work could include the investigation of more statistical models and evalua-tions of their performance. It might also be worthwhile to investigate other methodsof combining two (or more) scores into a hybrid model. Also, additional tests withother metamorphic generators and morphing strategies could prove interesting.

References

[1] A. V. Aho and M. J. Corasick, Efficient string matching: An aid to bibliographicsearch, Communications of the ACM, Vol. 18, pp. 333–340, 1975.

[2] K. Ataman, W. N. Street, and Y. Zhang, Learning to rank by maximizing aucwith linear programming, IEEE Technical Report, 2006.

[3] T. H. Austin, E. Filiol, S. Josse, and M. Stamp, Exploring hidden Markov modelsfor virus analysis: A semantic approach, submitted for publication, 2012.

[4] J. Aycock, Computer Viruses and Malware, Springer, 2006.

[5] D. Chess and S. White, An undetectable computer virus, Virus Bulletin Con-ference, 2000.

[6] J. Borello and L. Me. Code obfuscation techniques for metamorphic viruses,Journal in Computer Virology, Vol. 4, No. 3, pp. 211–220, August 2008.

[7] F. Coulter and K. Eichorn, A good decade for cybercrime, McAfee, Inc, TechnicalReport, 2011.

[8] Cygwin, September 2011, [online], at http://www.cygwin.com

[9] P. Desai and M. Stamp, A highly metamorphic virus generator, InternationalJournal of Multimedia Intelligence and Security, Vol. 1, No. 4, pp. 402–427, 2010.

[10] J. Egan, Signal Detection Theory and ROC Analysis, Academic Press, Inc., 1975.

19

[11] E. Filiol and S. Josse, A statistical model for undecidable viral detection, Journalin Computer Virology, Vol. 3, No. 1, pp. 65–74, April 2007.

[12] S. Geisser, Predictive Inference: An Introduction, Chapman and Hall, 1993.

[13] IDAPro, Interactive dissassembler, 2011, [online], athttp://www.hex-rays.com/products/ida/index.shtml

[14] Intel, Intel R© Architecture Software Developer’s Manual, Volume 2, InstructionSet Reference Manual, October 2011.

[15] J. Kolter and M. Maloof, Learning to detect malicious executables in the wild,Proceedings of KDD ’04, 2004.

[16] D. Lin and M. Stamp, Hunting for undetectable metamorphic viruses, Journalin Computer Virology, Vol. 7, No. 3, pp. 201–214, August 2011.

[17] T. Mitchell, Machine Learning, McGraw Hill, 1997.

[18] M. Schultz, E. Eskin, and E. Zadok, Data mining methods for data miningmethods for detection of new malicious executables, Proceedings of IEEE Inter-national Conference on Data Mining, 2001.

[19] S. Madenur Sridhara. Metamorphic worm that carries its own morphing engine,Master’s Projects, Paper 240, 2012,http://scholarworks.sjsu.edu/etd_projects/240.

[20] S. Madenur Sridhara and M. Stamp. Metamorphic worm that carries its ownmorphing engine, submitted for publication.

[21] M. Stamp, Information Security: Principles and Practice, 2nd edition, Wiley,2011.

[22] M. Stamp, A revealing introduction to hidden Markov models, [online], atwww.cs.sjsu.edu/faculty/RUA/HMM.pdf

[23] P. Szor, The Art of Computer Virus Research and Defense. Addition WesleyProfessional, 2005.

[24] P. Szor and P. Ferrie, Hunting for metamorphic, Virus Bulletin, pp. 123–144,2001.

[25] S. Thrun, L. K. Saul, and B. Scholkopf, Eds., AUC Optimization vs. Error RateMinimization, MIT Press, 2004.

[26] A. H. Toderici, Chi-squared distance and metamorphic virus detection, Master’sThesis, Department of Computer Science, San Jose State University, May 2012

[27] Vx heavens, [online], at http://www.vx.netlux.org/

[28] R. Wang, Flash in the pan? Virus Bulletin, July 1998.

[29] W. Wong, Analysis and detection of metamorphic computer viruses, Master’sThesis, Department of Computer Science, San Jose State University, May 2006.

[30] W. Wong and M. Stamp, Hunting for metamorphic engines, Journal in ComputerVirology, Vol. 2, No. 3, pp. 211–229, December 2006.

20

Table 6: MMA results training on NGVCK viruses.

Dead Code Subroutine HMM CSD Hybrid0% 0% 99.75% 99.50% 99.75%

10% 87.50% 85.81% 88.96%20% 82.00% 81.58% 84.20%30% 79.75% 80.35% 81.69%40% 77.25% 76.84% 78.94%

10% 0% 92.00% 90.81% 93.48%10% 86.00% 83.87% 88.47%20% 83.00% 79.60% 84.96%30% 78.50% 76.90% 83.46%40% 77.50% 74.89% 80.19%

20% 0% 91.50% 86.39% 92.98%10% 85.50% 82.66% 88.47%20% 82.75% 78.91% 85.71%30% 80.25% 75.91% 82.20%40% 77.75% 74.41% 79.95%

30% 0% 90.75% 81.91% 92.22%10% 86.50% 79.16% 88.47%20% 82.25% 76.47% 85.71%30% 81.50% 74.16% 82.20%40% 79.50% 73.21% 82.20%

40% 0% 90.25% 77.82% 91.47%10% 85.75% 76.00% 89.22%20% 83.00% 73.97% 84.96%30% 81.25% 73.23% 84.21%40% 80.50% 71.46% 82.20%

Average 84.09% 79.43% 86.25%

21

(a) HMM: dead code (b) HMM: subroutine code

(c) CSD: dead code (d) CSD: subroutine code

Figure 8: ROC curves for the HMM and CSD detectors, trained on NGVCK viruses, testedon viruses morphed with dead code or subroutine code.

22

Table 7: MMA results training on NGVCK morphed 10% dead code.


10% 87.75% 80.75% 88.46%20% 81.75% 76.00% 83.95%30% 80.00% 73.50% 81.18%40% 76.75% 71.50% 77.18%

10% 0% 99.75% 95.75% 99.24%10% 90.75% 86.75% 90.47%20% 85.75% 80.00% 86.46%30% 81.75% 77.25% 83.46%40% 80.00% 75.25% 81.20%

20% 0% 99.75% 94.00% 98.49%10% 90.25% 87.25% 91.73%20% 86.25% 81.75% 88.21%30% 82.75% 78.25% 86.71%40% 81.00% 76.00% 84.20%

30% 0% 100.00% 93.00% 98.24%10% 93.50% 87.25% 93.47%20% 88.75% 82.75% 91.22%30% 84.00% 78.00% 87.46%40% 83.00% 77.75% 86.96%

40% 0% 100.00% 92.50% 97.99%10% 93.25% 88.00% 93.47%20% 87.75% 82.75% 90.97%30% 85.25% 81.50% 89.22%40% 83.75% 78.50% 88.46%

Average 88.12% 82.72% 89.40%

23



Figure 9: ROC curves for HMM and CSD detectors, trained on viruses morphed with 10%dead code, tested on viruses morphed with dead code or subroutine code.

24

Table 8: MMA results training on NGVCK morphed 10% subroutine code.


10% 60.50% 92.25% 94.97%20% 59.25% 90.00% 92.47%30% 58.25% 88.50% 90.72%40% 57.00% 86.75% 88.71%

10% 0% 61.75% 95.00% 97.23%10% 60.50% 92.50% 96.22%20% 59.50% 90.00% 93.72%30% 58.25% 88.75% 92.22%40% 59.00% 88.50% 91.22%

20% 0% 65.75% 92.00% 95.23%10% 64.25% 90.25% 94.23%20% 64.00% 89.50% 92.72%30% 62.25% 88.75% 91.96%40% 59.75% 88.25% 90.97%

30% 0% 68.75% 89.50% 91.96%10% 68.00% 89.75% 92.21%20% 67.00% 90.25% 92.46%30% 64.75% 88.50% 90.71%40% 63.75% 88.25% 90.96%

40% 0% 71.25% 87.50% 90.45%10% 70.25% 88.75% 91.46%20% 69.25% 88.75% 91.46%30% 66.00% 87.75% 90.96%40% 65.25% 87.25% 89.71%

Average 63.46% 89.74% 92.48%

25



Figure 10: ROC curves for HMM and CSD detectors, trained on viruses morphed with10% subroutine code, and tested on viruses morphed with dead code or subroutine code.

26

Date post:	08-Jun-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Chi-Squared Distance and Metamorphic Virus Detection · A virus is often de ned as malware that...

Documents