+ All Categories
Home > Documents > Survey in Static Detection of Malware

Survey in Static Detection of Malware

Date post: 02-Dec-2014
Category:
Upload: silvio-cesare
View: 3,073 times
Download: 1 times
Share this document with a friend
7
Survey in Static Detection of Malware Silvio Cesare School of Information Technology Deakin University Burwood, Victoria 3125, Australia <[email protected]> ABSTRACT Malware continues to be a significant problem facing computer use in today’s world. Historically Antivirus software has employed the use of static signatures to detect instances of known malware. Signature based detection has fallen out of favour to many, and detection techniques based on identifying malicious program behavior are now part of the Antivirus toolkit. However, static approaches to malware detection have been heavily researched and can employ modern fingerprints that significantly improve on the simple string signatures used in the past. Instance- based learning can allow the detection of an entire family of malware variants based on a single signature of static features. Statistical machine learning can turn the features extracted into a predictive Antivirus system able to detect novel and previously unseen malware samples. This paper surveys the approaches and techniques used in static malware detection. Keywords Malware classification. 1. INTRODUCTION Malicious software is a significant problem that threatens the security of users on the internet. Today, malware is created by criminal gangs for the purposes of financial gain. These criminals employ malware for the purposes of stealing of credit card information to commit fraud or to obtain illegal use of a computer to launch spam campaigns. A simple approach often used by criminals on victims to is by having innocent users open an E- Mail attachment that is malicious. To protect users from malware, detection of the threat before it is allowed to execute its malicious intent is a necessity. Behaviour blocking is a useful approach, but relying solely on the dynamic behaviour of a program may allow unwanted actions to be performed before the malware is detected. Running a program in a virtual machine or isolated sandbox to detect its intent is not always effective. Dynamic analysis can never reason about all potential behaviours. If the malware performs differently while being analysed, or can detect the analysis itself, then the malware has a high probability to escape detection. Static analysis and detection provides a possible solution in the arsenal of defences. Static signature based detection has been a dominant feature in Antivirus. Because of performance constraints, the most widely used signature is a string containing patterns of the raw file content [1, 2]. This allows for a string search [3] to quickly identify patterns associated with known malware. However, these patterns can easily be invalidated because minor changes to the malware source code have significant effects on the malware’s raw content. Thus, traditional signatures can prove ineffective when dealing with unknown variants. Modern approaches to signature generation involve less fragile and more versatile fingerprints. Program features are extracted that enable a more robust representation to detect an entire family of malware variants. Machine learning and statistical classification using those same program features can allow the detection of novel and unknown malware not belonging to previously identified families. Static program analysis is undecidable for many problems concerning binaries, and a transformation of a compiled program known as code packing is often used by malware authors to hide the intent of the malware and make analysis more difficult. The packing process encrypts, compresses, or obfuscates the malware. The original unobfuscated code is restored at run time, or in the case of instruction virtualization, a byte code representing the original code is executed. In most cases, unpacking is a requirement for effective static malware classification and use of signatures. Automated unpacking has been partly successful but for those cases where it cannot be achieved, it is sometimes better to mark those programs as likely to be malicious. Thus, even with packed samples, static detection of malware can still be an effective tool. 1.1 Structure of the Paper The format of this paper is as follows: Section 2 describes the taxonomy of program features useful to malware classification. Section 3 compares those features. Section 4 describes the approaches that can be employed in static malware classification and Section 5 describes the specific techniques based on feature. Section 6 identifies future trends. Finally, Section 7 concludes the survey. 2. TAXONOMY OF STATIC PROGRAM FEATURES Malware classification and detection involves the extraction of features which are subsequently used to characterize the malware. Features may be extracted dynamically or statically. Dynamic approaches to malware classification involve monitoring execution of the programs and extracting features based on their behaviour. Static approaches extract features without program execution. 2.1 Object File Header Attributes The object file header contains attributes which are often custom written during link editing and binary rewriting.
Transcript
Page 1: Survey in Static Detection of Malware

Survey in Static Detection of Malware Silvio Cesare

School of Information Technology Deakin University

Burwood, Victoria 3125, Australia

<[email protected]>

ABSTRACT

Malware continues to be a significant problem facing computer

use in today’s world. Historically Antivirus software has

employed the use of static signatures to detect instances of known

malware. Signature based detection has fallen out of favour to

many, and detection techniques based on identifying malicious

program behavior are now part of the Antivirus toolkit. However,

static approaches to malware detection have been heavily

researched and can employ modern fingerprints that significantly

improve on the simple string signatures used in the past. Instance-

based learning can allow the detection of an entire family of

malware variants based on a single signature of static features.

Statistical machine learning can turn the features extracted into a

predictive Antivirus system able to detect novel and previously

unseen malware samples. This paper surveys the approaches and

techniques used in static malware detection.

Keywords

Malware classification.

1. INTRODUCTION Malicious software is a significant problem that threatens the

security of users on the internet. Today, malware is created by

criminal gangs for the purposes of financial gain. These criminals

employ malware for the purposes of stealing of credit card

information to commit fraud or to obtain illegal use of a computer

to launch spam campaigns. A simple approach often used by

criminals on victims to is by having innocent users open an E-

Mail attachment that is malicious.

To protect users from malware, detection of the threat before it is

allowed to execute its malicious intent is a necessity. Behaviour

blocking is a useful approach, but relying solely on the dynamic

behaviour of a program may allow unwanted actions to be

performed before the malware is detected. Running a program in a

virtual machine or isolated sandbox to detect its intent is not

always effective. Dynamic analysis can never reason about all

potential behaviours. If the malware performs differently while

being analysed, or can detect the analysis itself, then the malware

has a high probability to escape detection. Static analysis and

detection provides a possible solution in the arsenal of defences.

Static signature based detection has been a dominant feature in

Antivirus. Because of performance constraints, the most widely

used signature is a string containing patterns of the raw file

content [1, 2]. This allows for a string search [3] to quickly

identify patterns associated with known malware. However, these

patterns can easily be invalidated because minor changes to the

malware source code have significant effects on the malware’s

raw content. Thus, traditional signatures can prove ineffective

when dealing with unknown variants.

Modern approaches to signature generation involve less fragile

and more versatile fingerprints. Program features are extracted

that enable a more robust representation to detect an entire family

of malware variants. Machine learning and statistical

classification using those same program features can allow the

detection of novel and unknown malware not belonging to

previously identified families.

Static program analysis is undecidable for many problems

concerning binaries, and a transformation of a compiled program

known as code packing is often used by malware authors to hide

the intent of the malware and make analysis more difficult. The

packing process encrypts, compresses, or obfuscates the malware.

The original unobfuscated code is restored at run time, or in the

case of instruction virtualization, a byte code representing the

original code is executed. In most cases, unpacking is a

requirement for effective static malware classification and use of

signatures. Automated unpacking has been partly successful but

for those cases where it cannot be achieved, it is sometimes better

to mark those programs as likely to be malicious. Thus, even with

packed samples, static detection of malware can still be an

effective tool.

1.1 Structure of the Paper The format of this paper is as follows: Section 2 describes the

taxonomy of program features useful to malware classification.

Section 3 compares those features. Section 4 describes the

approaches that can be employed in static malware classification

and Section 5 describes the specific techniques based on feature.

Section 6 identifies future trends. Finally, Section 7 concludes the

survey.

2. TAXONOMY OF STATIC PROGRAM

FEATURES Malware classification and detection involves the extraction of

features which are subsequently used to characterize the malware.

Features may be extracted dynamically or statically. Dynamic

approaches to malware classification involve monitoring

execution of the programs and extracting features based on their

behaviour. Static approaches extract features without program

execution.

2.1 Object File Header Attributes The object file header contains attributes which are often custom

written during link editing and binary rewriting.

Page 2: Survey in Static Detection of Malware

2.2 Bytes One of simplest features that can be extracted from a program is

the raw byte level content of the malware executable file [4]. An

alternative source of content comes from the individual program

sections in the binary, including the code and data segments.

2.3 Instructions An executable program is constructed of code and data. The code

is represented as assembly language. Extracting the assembly is

the process of disassembling. The instruction level content of a

program can represent a more resilient form than the byte level

content if the instructions are considered by their type or

mnemonic representation [5].

2.4 Basic Blocks A basic block is a straight line sequence of code without an

intervening control transfer instruction [6]. The basic block may

be treated at the byte level, or at the instruction level.

Additionally, data dependency within the basic block may be

examined to construct a directed acyclic graph [7]. The basic

blocks may also be grouped to form a set, or they may have

additional structure imposed by the control flow graph.

2.5 Control Flow Graphs The control flow graph is a directed graph, where the nodes are

basic blocks [8]. The edges in the graph represent the possible

control flow of the associated procedure. The control flow graph

represents the intra-procedural control flow. A program may be

considered a set of control flow graphs, or the control flow graphs

may have additional structure as dictated by the call graph.

Alternatively, control flow graphs may represent inter-procedural

and intra-procedural control flow in a single graph. In this case,

the graph represents the inter-procedural control flow graph.

It is possible to construct alternative or abstracted representations

of the control flow graph. Loop nest trees, dominator trees, and

control dependency graphs can also be constructed [7] which are

different ways of representing control flow.

2.6 Call Graph Call graphs, like control flow graphs, model the possible

execution paths and control flow in a program [9]. The call graph

is a directed graph representing the inter-procedural control flow.

Like the control flow graph, alternative or abstracted

representations are possible such as dominator trees.

2.7 API Calls Programs interface with the underlying operating system and

libraries. The invocation of an API function from a known a

library can often be identified statically [10]. The API call

sequence gives insight to the behaviour of the program.

2.8 Data Flow The data flow of a program represents the set of possible values

data may hold during program execution [11]. Many types of data

flow analyses exist, including live variable analysis, reaching

definitions, and value-set analysis. Each analysis looks at a

particular property of the data at specific program points.

Modelling the data flow requires that the control flow be

successfully identified. A simpler model of data dependencies can

be modelled as described in the basic block feature section.

2.9 Procedure Dependence Graph A procedure dependency graph combines the control

dependencies and data dependencies of a procedure into a single

graph.

2.10 System Dependence Graph The system dependence graph is a collection of procedure

dependence graphs, one for each procedure in the program.

3. COMPARISON OF STATIC PROGRAM

FEATURES Malware may be polymorphic, but static program features are

known to be invariant under different polymorphic techniques.

Byte and instruction level program features perform poorly when

faced with the polymorphic variations and mutations.

Recompiling source code using different compile time options

may result in syntactic changes including variable renaming, and

instruction substitution. Code normalization can sometimes

reverse the effects of syntactic polymorphism and can work in

practice but is not based on a sound technique. Additionally, the

8d 4c 24 04

83 e4 f0

ff 71 fc

55

89 e5

51

83 ec 24

e8 6a 00 00 00

c7 45 f8 00 00 00 00

eb 10

c7 04 24 a0 20 40 00

e8 5d 00 00 00

83 45 f8 01

83 7d f8 09

7e ea

83 c4 24

59

5d

8d 61 fc

c3

lea 0x4(%esp),%ecx

and $0xfffffff0,%esp

pushl -0x4(%ecx)

push %ebp

mov %esp,%ebp

push %ecx

sub $0x24,%esp

call 4011b0 <___main>

movl $0x0,-0x8(%ebp)

jmp 40115f <_main+0x2f>

movl $0x4020a0,(%esp)

call 4011b8 <_puts>

addl $0x1,-0x8(%ebp)

cmpl $0x9,-0x8(%ebp)

jle 40114f <_main+0x1f>

add $0x24,%esp

pop %ecx

pop %ebp

lea -0x4(%ecx),%esp

ret

movl $0x4020a0,(%esp)

call 4011b8 <_puts>

addl $0x1,-0x8(%ebp)

lea 0x4(%esp),%ecx

and $0xfffffff0,%esp

pushl -0x4(%ecx)

push %ebp

mov %esp,%ebp

push %ecx

sub $0x24,%esp

call 4011b0 <___main>

movl $0x0,-0x8(%ebp)

jmp 40115f <_main+0x2f>

add $0x24,%esp

pop %ecx

pop %ebp

lea -0x4(%ecx),%esp

ret

cmpl $0x9,-0x8(%ebp)

jle 40114f <_main+0x1f>

Figure 1. Instructions and basic blocks.

Page 3: Survey in Static Detection of Malware

byte and instruction stream may change when minor semantic

alterations are made to the malware source code.

The advantage of byte level content as a program feature is that

the dependence on accurate static analysis of the programs

semantics or structure is not required. If the instruction stream is

used, additional challenges are presented because it is known that

perfect disassembly of an unknown image is undecidable on the

x86 platform [12].

To avoid the problems of syntactic polymorphism, higher level

abstractions of the program can be used. The control flow features

including control flow graphs and call graphs are considered more

invariant in polymorphic malware than byte and instruction level

content [8]. However, opaque predicates - conditions that always

evaluate to the same result but are hard to determine statically -

may result in these features being altered. The detection of opaque

predicates has been investigated, but it is not evident that this is

entirely satisfactory, and a sound method of detection against all

unknown predicates is not possible. For example, it is known that

some algorithms which are used to construct predicates are

actually only strong conjectures in evaluating to the same result.

This implies an automated approach to prove that it constant is

hard.

The presence of pointers and indirection in assembly language

also present problems to static analyses which may not have the

precision required to construct a control flow graph or call graph

with the degree of accuracy required for malware classification.

For all its disadvantages, control flow has shown to be an

effective feature that is invariant in most current malware.

The use of API calls is another approach to solve the syntactic

polymorphism problem. This approach has problems with

malware that obscures the use of those calls, as is the case of the

stolen bytes technique [13] introduced by code packing tools.

Data flow analysis is another high level abstraction but when used

in the presence of pointers is compounded by the problems that

static analyses must face.

The procedure and system dependence graphs have similar

problems with pointers and indirection even when data

dependencies of pointers are ignored. The dependence graphs are

also dependent on accurate modelling of the instruction sequence.

This avoids problems such as register reassignment because the

data dependency is represented as a graph. The problem occurs

with the modelled instructions used in the data dependencies

which may be polymorphic and variant. Polymorphism is not

handled effectively in this situation although code normalization

may help.

4. STATIC APPROACHES TO MALWARE

CLASSIFICATION Malware classification is the process of determining if an

unknown binary belongs to the class of malicious programs or the

class of benign programs.

4.1 Statistical Classification A data mining approach to malware detection is to employ

statistical classification. Each classification algorithm constructs a

model, using machine learning, to represent the benign and

malicious classes. In this approach, a labeled training set is

required to build the class models during a process of supervised

learning. Many statistical classification algorithms exist including

Naive Bayes, Neural Networks, and Support Vector Machines.

The key to statistical classification is to represent the malicious

and benign samples in an appropriate manner to enable the

classification algorithms to work effectively. Feature extraction is

an important component of effective classification, and an

associated feature vector that can accurately represent the

invariant characteristics in the training sets and query samples is

highly desirable.

4.2 Instance-Based Learning Instance-based learning is a related and traditionally popular

approach that can be employed wherein the query program is

movl $0x4020a0,(%esp)

call 4011b8 <_puts>

addl $0x1,-0x8(%ebp)

lea 0x4(%esp),%ecx

and $0xfffffff0,%esp

pushl -0x4(%ecx)

push %ebp

mov %esp,%ebp

push %ecx

sub $0x24,%esp

call 4011b0 <___main>

movl $0x0,-0x8(%ebp)

jmp 40115f <_main+0x2f>

add $0x24,%esp

pop %ecx

pop %ebp

lea -0x4(%ecx),%esp

ret

cmpl $0x9,-0x8(%ebp)

jle 40114f <_main+0x1f>

Proc_0

Proc_2

Proc_1

Proc_4

Proc_3

Figure 2. Control flow graph (left) and call graph (right).

Page 4: Survey in Static Detection of Malware

classified by identifying a high similarity to a known instance of

malware in the training set. Traditional Antivirus utilises this

approach when it performs signature based detection. The key

component to perform classification using instance-based learning

is a distance or similarity function between the objects

representing samples and queries. For a distance function to be

effective between objects, the objects must be modeled by a

limited set of features that capture the invariant characteristics of

the malicious and benign programs. In some cases, the distance

function is replaced with a test for equality. However, testing only

for equality reduces the effectiveness of the classification process

when dealing with malware variants. Instance-based learning can

additionally identify high similarity to benign or white-listed

samples, depending on the aims of the classification.

4.3 The Similarity Search Used in Instance-

Based Learning A search of a database to find similar, but not necessarily identical

objects to a query is known as a similarity search. The similarity

search is a central aspect of instance-based learning when applied

to malware detection and classification using a large number of

malware signatures and training instances.

Distance functions between objects that have the properties of a

metric can employ the use of Metric Access Methods. A similarity

search using metric access methods performs faster than

exhaustive linear search and enables significantly larger databases

without being restricted by an equivalent increase in running time.

Metric access methods may use either static or dynamic databases.

In dynamic Metric Access Methods, dynamic database operations,

such as object insertion, can be effectively performed with

reasonable performance expectations.

5. CLASSIFICATION APPROACHES BY

FEATURE

5.1 Object File Header Attributes Identifying object file discrepancies to detect malware has

sometimes been used by commercial Antivirus to detect

suspicious binaries. peHash [14] proposed a similar technique by

hashing object file features to cluster malware. The advantage of

this approach is that it is highly efficient. The disadvantage is that

using object file attributes predominantly identifies those

attributes of the packing tool used when packing the malware. The

concept is very similar to the techniques used when identifying

code packing using object file features. Classifying a sample as

malicious, based on the packer, is not necessarily accurate. It is

not necessarily true that the presence of code packing indicates

the malicious intent of a program.

5.2 Byte Level Approaches

5.2.1 File Hashing The simplest approach to malware detection is hashing the

contents of the file and comparing that hash against a blacklist.

This approach is widely used in commercial Antivirus. The

disadvantage of using this approach on its own is that it

ineffectively detects malware that has incurred any byte level

alterations. However, the blacklisting of specific and unaltered

malware instances is a useful technique that is easily and

efficiently implemented.

5.2.2 Kolmogorov Complexity Kolmogorov complexity is a theoretical measure of the

computational complexity, or minimum string length in a

universal description language, required to represent an object or

set of data. It is a theoretical measure that is not computable. To

estimate the Kolmogorov complexity, an object may be

compressed and concatenated with the associated decompression

routine, to give the approximate minimum string length to

describe the object. The observation, when this theory is related to

malware, is that similar malware have similar measures of

Kolmogorov complexity. This form of analysis occurs on the

malwares raw file or section content.

Estimating Kolmogorov complexity was proposed in peHash [14]

by identifying the compression ratio of a malicious sample that

was subsequently used for clustering malware families. Another

measure of similarity related to Kolmogorov complexity is the

Normalized Compression Distance (NCD). The NCD was used in

[15] to cluster worms into families. This approach, like peHash

[14], was not used to classify samples as being benign or

malicious, but to cluster malicious samples only.

It was the observation in [16] that malware and benign programs

can be classified according to a likeness to a compression model

for each of the malicious and benign classes. In this research, it

was proposed that two compression models be constructed from a

two training sets, one of malicious samples, and one of benign

samples. To classify a query sample as being malicious or benign,

the number of bits required to encode the query was calculated for

each compression model. The query was classified by identifying

the class that requires the least data to encode the query.

5.2.3 String Signatures Static string signatures have been the dominant technique used in

traditional Antivirus. String signatures represent patterns in a

malware’s raw content used to uniquely identify it. In [1, 2] it was

proposed to automatically extract the string signatures used by the

detection system. The set of all possible likely string signatures of

fixed size are extracted from the malware, and those that result in

high similarity to a corpus of benign programs are removed from

the signature candidate list.

String signatures may use fast string matching algorithms to detect

malicious instances [3]. Wildcards and regular expressions

present extensions of string matching that can be used in the

detection of malware variants. String signatures are efficient, and

more effective than file hashing, but can be ineffective when faced

with polymorphic malware variants.

5.2.4 Malware Normalization To improve the effectiveness of string based signatures, malware

normalization has been proposed. Normalizing malware before

passing it to Antivirus software was investigated by

Christodorescu et al in [17]. Static analysis was carried to

eliminate unnecessary control flow as indicated by superfluous

unconditional branches. Semantic nops were also removed from

the malware by using decision procedures. At this point the

malware was passed, now in a more canonical form, to Antivirus

software.

Another approach to the code normalization problem was to

rewrite sequences of code using compiler optimisation techniques

[18]. Expression propagation, dead code elimination, and

expression simplification using algebraic identities was used. The

intuition is that that the process of an optimising compiler

Page 5: Survey in Static Detection of Malware

removes the redundancy of the original code and improves the

terseness, resulting in a normalized representation.

An approach using term rewriting was proposed in [19] where

rewrite rules were constructed to model the malware

transformations that occur during polymorphic and metamorphic

mutation. From these, a normalizing rule-set was constructed that

could rewrite the malware to a canonical or near canonical

representation.

5.3 Instruction Distributions

5.3.1 Opcode Distributions Instruction level content of malware can provide a more resilient

representation than the byte level data. This is especially true if

the instruction arguments or operands are ignored leaving only the

opcodes to be examined. To determine the instructions and

opcodes, a disassembly of the malware is required. A

classification technique using the statistical distribution of

opcodes as a predictor of malware was proposed in [5]. The

investigation found that rarely occurring opcodes were a strong

predictor of malware than compared to frequently occurring

opcodes. The disadvantage of opcode distributions is that

polymorphism can change the distributions.

5.3.2 Byte and Instruction Level N-grams An approach to classify malware using evolutionary trees and

phylogeny based on n-grams and n-perms was proposed by Karim

et al in [20]. Their approach was to show a similarity between

malware. At this point phylogeny which generates evolutionary

trees could be used for taxonomy. N-grams of byte and

instruction level content were extracted as features from each

binary. These vectors were compared to establish a similarity

using a variety of metrics. One such metric was cosine similarity.

N-perms extended the concept of n-grams to group permutations

of each n-gram as a single feature. N-grams are more resistant to

polymorphic changes than string based signatures, but are not

highly effective when faced with techniques such as instruction

substitution or register reassignment.

5.3.3 N-gram Analysis and Machine Learning N-gram analysis using machine learning was proposed by Perdisci

et al to classify malware in McBoost [21]. A similar method also

involving machine learning and classification was proposed in

[4]. The classification was performed using a similar algorithm to

how McBoost used n-gram analysis to detect packed binaries.

The most informative n-grams of a training set were used in

representing the occurrence of those n-grams in each binary as a

vector. Those vectors were used to train a statistical classifier.

Automated unpacking was performed on the malware, if

necessary, using a method similar to the Renovo unpacking

system.

5.4 Basic Blocks

5.4.1 Edit Distances The edit distance between basic blocks was used to classify

malware by Gheorghescu in [6]. The edit distance describes the

number of insertions, deletions and substitutions to convert one

string to another. To classify malware using edit distance, each

malware was statically disassembled and the basic blocks

extracted. These basic blocks were then considered as strings.

Classification then proceeded to build a similarity between the

malware's basic blocks and the binary being examined using a

similarity ratio based on the edit distance.

The main disadvantage with this approach is that minor changes

to the malware source code can result in significant changes to

almost all basic blocks. Changes in compiler configuration and

optimisations can equally result in large changes.

5.4.2 Membership Testing - Inverted Index and

Bloom Filters Gheorghescu also proposed the use of finding identical features in

basic blocks shared between malware [6]. The inverted index and

bloom filters provided for faster searching of exact matches in a

database. An inverted index is an associative mapping between

the content and the source of that content. A bloom filter allows

for fast set membership queries with allowable false positives and

guaranteed no false negatives. To make this an approximate

search, features extraction occurred on the basic block. Bloom

filters were shown to perform fast enough to be performed on a

desktop system, but not fast enough for desktop Antivirus. The

disadvantage of this approach is even more pronounced than

using the edit distance between basic blocks. Minor changes to

the malware source or compiler configuration can change almost

all basic blocks.

5.5 Control Flow Graphs

5.5.1 Whole Program Control Flow Graph

Isomorphism Recognition Using Tree Automata A fast approach to detecting whole program control flow graph

isomorphism and subgraph isomorphism was proposed in [22].

This approach constructed a spanning tree based structure from

the control flow graph, and then built a tree automaton for graph

recognition. This approach appears to have reasonable

performance. However, this technique is not effective at detecting

malware variants that alter the control flow or have semantic

changes.

5.5.2 Common k-subgraphs Decomposing control flow graphs into subgraphs was proposed

by Kruegel et al in [8] to classify polymorphic worms. The

control flow graphs were decomposed into the set of all subgraphs

of fixed size k, where k is the number of nodes in the subpgrah.

The k-subgraphs were subsequently transformed into their

canonical labeled form. The adjacency matrix of the canonically

label graph was transformed into a string. This string represented

the k-subgraph feature of the control flow being analysed. Worm

detection and classification occurs through identifying the

prevalence of k-subgraph features between worm like executable

content and unclassified executable programs. The performance

of this system was reasonable. Because the classification only

occurs on network streams identified as potential worms, it is hard

to determine the accuracy of the classification when applied to a

larger set of malware.

5.5.3 Structured Control Flow Graphs Using

Decompilation An approach that decompiles the control flow into a high level

source code like representation was proposed in [23]. Comparing

two control flow graphs is performed by using the string edit

distance on their decompiled sources. The similarities of each

control flow graph are accumulated to give a similarity taking into

account the entire program. A related approach is to decompose

the decompiled strings into q-grams [24].

Page 6: Survey in Static Detection of Malware

5.6 Call Graphs

5.6.1 Whole Program Context-Free Control Flow It was proposed in [25], that the inter-procedural control flow

information could be represented as a context free grammar with a

limited loss of information. A string could represent the grammar,

and string equality used to show equivalence between the

grammar, and inter-procedural control flow they represented. The

advantage of this approach, is that string based representations

allow for fast searches in a malware database using a dictionary

search. The disadvantage of the approach investigated in this

research is that it did not employ approximate matching of the

inter-procedural control flow. For polymorphic malware variants

that alter the control flow through source code modification, an

approximate match is necessary for detection of the malware.

5.6.2 Flowgraph Based Classification using Fixed

Points Carrera proposed an approximate flowgraph matching algorithm

in [9] by identifying fixed points in the flowgraphs and

successively matching surrounding nodes in the graph. Carrera

built a similarity index between malware and used this to build

phylogeny trees for taxonomy. Dullien and Rolles expanded the

approximate graph comparison algorithm in [26] to identify

identical nodes between call graphs and control flow graphs.

Their algorithm worked by identifying nodes, or fixed points,

between binaries that have uniquely identifiable features.

Features for a node in the call graph include the number of basic

blocks, control flow edges, and number of subfunction calls.

Carrera also proposed an estimation of a control flow graph

isomorphism based on string equality and a string signature of the

graph representing a graph traversal. Once a set of fixed points

were known, their neighbouring nodes could be examined.

Identifying neighbours sharing common and unique features

iteratively allowed greater parts of the flowgraph to be identified.

The advantage of this approach is that it allows for moderately

fast pair-wise comparison between graphs. However, the approach

does not perform efficiently for a database of graphs and is not

fast enough for desktop Antivirus use.

5.6.3 Approximating the Graph Edit Distance An alternative approach to approximate graph matching was

proposed in the SMIT system [27]. SMIT employed the use of

bipartite graphs and the Hungarian nor Munkres algorithm to find

matching nodes between two call graphs being compared in O(N3)

running time. The strength of their matching algorithm was that

they allowed for it be used as an approximation to the graph edit

distance. The graph edit distance between two graphs, is the

number of edit operations to convert one graph to the other. The

graph edit distance gives a sound basis for similarity and

dissimilarity between graphs. The graph edit distance is known to

have the properties of a metric which allows the use of metric

access methods to search a database of objects.

5.7 API Calls API calls are a feature used in several malware classification and

detection systems. In the SAVE system [28], it was proposed to

disassemble the malware image and then to extract the API call

sequence as they appeared in the disassembled output. The API

call sequence was used to construct a vector. Similarity between

vectors employed the use of the cosine similarity measure, the

Jaccard index [29], and the Pearson’s correlation measure.

An alternative approach to using vectors to represent API call

sequences was proposed in the IMDS malware detection system

[10]. IMDS’s approach employed the use of the data mining

technique known as association mining. Association mining was

able to associate sequences of API calls to classify query samples

as benign or malicious.

5.8 Data Flow Combining data flow analysis and control flow analysis was

proposed in [11, 30]. Annotations were made to the control flow

graphs to incorporate abstractions of the instructions and data

flow. These annotated flowgraphs were compared to signatures, or

automata, that described the malware. If the malware signature

was present in the query program, a malware instance had been

detected. In [31], value set analysis was used as a specific data

flow analysis to identify fixed points that was subsequently used

to construct signatures.

6. TRENDS Malware obfuscation has been increasingly addressed by

researchers, and deobfuscation will continue to be developed and

incorporated into malware detection systems. These deobfuscation

techniques have increasingly borrowed from formal program

analyses in an attempt to make sound analyses possible in regards

to their given constraints.

Malware classification has employed statistical techniques to

detect unknown malware. We believe research will continue using

this approach and new features will be developed that can more

accurately characterize malware. Instance-based learning will also

be developed with particular research opportunity in working with

large scale datasets.

Static program features have been extracted at increasing levels of

abstraction, and we expect this to continue in future research.

Abstraction has the benefit of being resistant to lower level

polymorphic changes. The performance of these research systems

has not been fully investigated, and we expect that future research

opportunity lies in making classification systems practical for

industrial and widespread use.

7. CONCLUSION Detecting malware before it is allowed to execute is an important

feature of Antivirus and system security. Static analysis

techniques allow feature extraction of programs which allows

machine learning to identify variants of malware and novel

samples. Malware packing which hides the code from analysis

remains the main sticking point for static detection and it can be

hard to reverse all packers automatically. If unpacking is

achievable, the problem of malware detection using static analysis

is quite feasible and we expect the accuracy and efficiency of such

systems will continue to improve as research continues.

REFERENCES

[1] K. Griffin, S. Schneider, X. Hu, and T. Chiueh,

"Automatic Generation of String Signatures for

Malware Detection," in Recent Advances in Intrusion

Detection: 12th International Symposium, RAID 2009,

Saint-Malo, France, 2009.

[2] J. O. Kephart and W. C. Arnold, "Automatic extraction

of computer virus signatures," in 4th Virus Bulletin

International Conference, 1994, pp. 178-184.

Page 7: Survey in Static Detection of Malware

[3] A. V. Aho and M. J. Corasick, "Efficient string

matching: an aid to bibliographic search,"

Communications of the ACM, vol. 18, p. 340, 1975.

[4] J. Z. Kolter and M. A. Maloof, "Learning to detect

malicious executables in the wild," in International

Conference on Knowledge Discovery and Data Mining,

2004, pp. 470-478.

[5] D. Bilar, "Opcodes as predictor for malware,"

International Journal of Electronic Security and Digital

Forensics, vol. 1, pp. 156-168, 2007.

[6] M. Gheorghescu, "An automated virus classification

system," in Virus Bulletin Conference, 2005, pp. 294-

300.

[7] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers:

principles, techniques, and tools. Reading, MA:

Addison-Wesley, 1986.

[8] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G.

Vigna, "Polymorphic worm detection using structural

information of executables," Lecture notes in computer

science, vol. 3858, p. 207, 2006.

[9] E. Carrera and G. Erdélyi, "Digital genome mapping–

advanced binary malware analysis," in Virus Bulletin

Conference, 2004, pp. 187-197.

[10] Y. Ye, D. Wang, T. Li, and D. Ye, "IMDS: intelligent

malware detection system," in Proceedings of the 13th

ACM SIGKDD international conference on Knowledge

discovery and data mining, 2007.

[11] M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and

R. E. Bryant, "Semantics-aware malware detection," in

Proceedings of the 2005 IEEE Symposium on Security

and Privacy (S&P 2005), Oakland, California, USA,

2005.

[12] R. N. Horspool and N. Marovac, "An approach to the

problem of detranslation of computer programs," The

Computer Journal, vol. 23, pp. 223-229, 1979.

[13] L. Boehne, "Pandora’s Bochs: Automatic Unpacking of

Malware," University of Mannheim, 2008.

[14] G. Wicherski, "peHash: A Novel Approach to Fast

Malware Clustering," in Usenix Workshop on Large-

Scale Exploits and Emergent Threats (LEET'09),

Boston, MA, USA, 2009.

[15] S. Wehner, "Analyzing worms and network traffic using

compression," Journal of Computer Security, vol. 15,

pp. 303-320, 2007.

[16] Y. Zhou and W. M. Inge, "Malware detection using

adaptive data compression," in Proceedings of the 1st

ACM workshop on Workshop on AISec (AISec '08),

2008, pp. 53-60.

[17] M. Christodorescu, J. Kinder, S. Jha, S. Katzenbeisser,

and H. Veith, "Malware normalization," University of

Wisconsin, Madison, Wisconsin, USA Technical Report

#1539, 2005.

[18] D. Bruschi, L. Martignoni, and M. Monga, "Using code

normalization for fighting self-mutating malware,"

presented at the Proceedings of International

Symposium on Secure Software Engineering, 2006.

[19] W. Andrew, M. Rachit, R. C. Mohamed, and L. Arun,

"Normalizing Metamorphic Malware Using Term

Rewriting," presented at the Proceedings of the Sixth

IEEE International Workshop on Source Code Analysis

and Manipulation, 2006.

[20] M. E. Karim, A. Walenstein, A. Lakhotia, and L. Parida,

"Malware phylogeny generation using permutations of

code," Journal in Computer Virology, vol. 1, pp. 13-23,

2005.

[21] R. Perdisci, A. Lanzi, and W. Lee, "McBoost: Boosting

Scalability in Malware Collection and Analysis Using

Statistical Classification of Executables," in

Proceedings of the 2008 Annual Computer Security

Applications Conference, 2008, pp. 301-310.

[22] G. Bonfante, M. Kaczmarek, and J. Y. Marion,

"Morphological Detection of Malware," in International

Conference on Malicious and Unwanted Software,

IEEE, Alexendria VA, USA, 2008, pp. 1-8.

[23] S. Cesare and Y. Xiang, "Classification of Malware

Using Structured Control Flow," in 8th Australasian

Symposium on Parallel and Distributed Computing

(AusPDC 2010), 2010.

[24] S. Cesare and Y. Xiang, "Malware Variant Detection

Using Similarity Search over Sets of Control Flow

Graphs," in IEEE Trustcom, 2011.

[25] R. T. Gerald and A. F. Lori, "Polymorphic malware

detection and identification via context-free grammar

homomorphism," Bell Labs Technical Journal, vol. 12,

pp. 139-147, 2007.

[26] T. Dullien and R. Rolles, "Graph-based comparison of

Executable Objects (English Version)," in SSTIC, 2005.

[27] X. Hu, T. Chiueh, and K. G. Shin, "Large-Scale

Malware Indexing Using Function-Call Graphs," in

Computer and Communications Security, Chicago,

Illinois, USA, pp. 611-620.

[28] A. H. Sung, J. Xu, P. Chavez, and S. Mukkamala,

"Static analyzer of vicious executables (save)," 2004,

pp. 326-334.

[29] G. Salton and M. J. McGill, Introduction to modern

information retrieval: McGraw-Hill New York, 1983.

[30] M. Christodorescu and S. Jha, "Static analysis of

executables to detect malicious patterns," presented at

the Proceedings of the 12th USENIX Security

Symposium, 2003.

[31] F. Leder, B. Steinbock, and P. Martini, "Classification

and Detection of Metamorphic Malware using Value

Set Analysis," in Proc. of 4th International Conference

on Malicious and Unwanted Software (Malware 2009),

Montreal, Canada, 2009.


Recommended