Date post: | 02-Dec-2014 |
Category: |
Documents |
Upload: | silvio-cesare |
View: | 3,073 times |
Download: | 1 times |
Survey in Static Detection of Malware Silvio Cesare
School of Information Technology Deakin University
Burwood, Victoria 3125, Australia
ABSTRACT
Malware continues to be a significant problem facing computer
use in today’s world. Historically Antivirus software has
employed the use of static signatures to detect instances of known
malware. Signature based detection has fallen out of favour to
many, and detection techniques based on identifying malicious
program behavior are now part of the Antivirus toolkit. However,
static approaches to malware detection have been heavily
researched and can employ modern fingerprints that significantly
improve on the simple string signatures used in the past. Instance-
based learning can allow the detection of an entire family of
malware variants based on a single signature of static features.
Statistical machine learning can turn the features extracted into a
predictive Antivirus system able to detect novel and previously
unseen malware samples. This paper surveys the approaches and
techniques used in static malware detection.
Keywords
Malware classification.
1. INTRODUCTION Malicious software is a significant problem that threatens the
security of users on the internet. Today, malware is created by
criminal gangs for the purposes of financial gain. These criminals
employ malware for the purposes of stealing of credit card
information to commit fraud or to obtain illegal use of a computer
to launch spam campaigns. A simple approach often used by
criminals on victims to is by having innocent users open an E-
Mail attachment that is malicious.
To protect users from malware, detection of the threat before it is
allowed to execute its malicious intent is a necessity. Behaviour
blocking is a useful approach, but relying solely on the dynamic
behaviour of a program may allow unwanted actions to be
performed before the malware is detected. Running a program in a
virtual machine or isolated sandbox to detect its intent is not
always effective. Dynamic analysis can never reason about all
potential behaviours. If the malware performs differently while
being analysed, or can detect the analysis itself, then the malware
has a high probability to escape detection. Static analysis and
detection provides a possible solution in the arsenal of defences.
Static signature based detection has been a dominant feature in
Antivirus. Because of performance constraints, the most widely
used signature is a string containing patterns of the raw file
content [1, 2]. This allows for a string search [3] to quickly
identify patterns associated with known malware. However, these
patterns can easily be invalidated because minor changes to the
malware source code have significant effects on the malware’s
raw content. Thus, traditional signatures can prove ineffective
when dealing with unknown variants.
Modern approaches to signature generation involve less fragile
and more versatile fingerprints. Program features are extracted
that enable a more robust representation to detect an entire family
of malware variants. Machine learning and statistical
classification using those same program features can allow the
detection of novel and unknown malware not belonging to
previously identified families.
Static program analysis is undecidable for many problems
concerning binaries, and a transformation of a compiled program
known as code packing is often used by malware authors to hide
the intent of the malware and make analysis more difficult. The
packing process encrypts, compresses, or obfuscates the malware.
The original unobfuscated code is restored at run time, or in the
case of instruction virtualization, a byte code representing the
original code is executed. In most cases, unpacking is a
requirement for effective static malware classification and use of
signatures. Automated unpacking has been partly successful but
for those cases where it cannot be achieved, it is sometimes better
to mark those programs as likely to be malicious. Thus, even with
packed samples, static detection of malware can still be an
effective tool.
1.1 Structure of the Paper The format of this paper is as follows: Section 2 describes the
taxonomy of program features useful to malware classification.
Section 3 compares those features. Section 4 describes the
approaches that can be employed in static malware classification
and Section 5 describes the specific techniques based on feature.
Section 6 identifies future trends. Finally, Section 7 concludes the
survey.
2. TAXONOMY OF STATIC PROGRAM
FEATURES Malware classification and detection involves the extraction of
features which are subsequently used to characterize the malware.
Features may be extracted dynamically or statically. Dynamic
approaches to malware classification involve monitoring
execution of the programs and extracting features based on their
behaviour. Static approaches extract features without program
execution.
2.1 Object File Header Attributes The object file header contains attributes which are often custom
written during link editing and binary rewriting.
2.2 Bytes One of simplest features that can be extracted from a program is
the raw byte level content of the malware executable file [4]. An
alternative source of content comes from the individual program
sections in the binary, including the code and data segments.
2.3 Instructions An executable program is constructed of code and data. The code
is represented as assembly language. Extracting the assembly is
the process of disassembling. The instruction level content of a
program can represent a more resilient form than the byte level
content if the instructions are considered by their type or
mnemonic representation [5].
2.4 Basic Blocks A basic block is a straight line sequence of code without an
intervening control transfer instruction [6]. The basic block may
be treated at the byte level, or at the instruction level.
Additionally, data dependency within the basic block may be
examined to construct a directed acyclic graph [7]. The basic
blocks may also be grouped to form a set, or they may have
additional structure imposed by the control flow graph.
2.5 Control Flow Graphs The control flow graph is a directed graph, where the nodes are
basic blocks [8]. The edges in the graph represent the possible
control flow of the associated procedure. The control flow graph
represents the intra-procedural control flow. A program may be
considered a set of control flow graphs, or the control flow graphs
may have additional structure as dictated by the call graph.
Alternatively, control flow graphs may represent inter-procedural
and intra-procedural control flow in a single graph. In this case,
the graph represents the inter-procedural control flow graph.
It is possible to construct alternative or abstracted representations
of the control flow graph. Loop nest trees, dominator trees, and
control dependency graphs can also be constructed [7] which are
different ways of representing control flow.
2.6 Call Graph Call graphs, like control flow graphs, model the possible
execution paths and control flow in a program [9]. The call graph
is a directed graph representing the inter-procedural control flow.
Like the control flow graph, alternative or abstracted
representations are possible such as dominator trees.
2.7 API Calls Programs interface with the underlying operating system and
libraries. The invocation of an API function from a known a
library can often be identified statically [10]. The API call
sequence gives insight to the behaviour of the program.
2.8 Data Flow The data flow of a program represents the set of possible values
data may hold during program execution [11]. Many types of data
flow analyses exist, including live variable analysis, reaching
definitions, and value-set analysis. Each analysis looks at a
particular property of the data at specific program points.
Modelling the data flow requires that the control flow be
successfully identified. A simpler model of data dependencies can
be modelled as described in the basic block feature section.
2.9 Procedure Dependence Graph A procedure dependency graph combines the control
dependencies and data dependencies of a procedure into a single
graph.
2.10 System Dependence Graph The system dependence graph is a collection of procedure
dependence graphs, one for each procedure in the program.
3. COMPARISON OF STATIC PROGRAM
FEATURES Malware may be polymorphic, but static program features are
known to be invariant under different polymorphic techniques.
Byte and instruction level program features perform poorly when
faced with the polymorphic variations and mutations.
Recompiling source code using different compile time options
may result in syntactic changes including variable renaming, and
instruction substitution. Code normalization can sometimes
reverse the effects of syntactic polymorphism and can work in
practice but is not based on a sound technique. Additionally, the
8d 4c 24 04
83 e4 f0
ff 71 fc
55
89 e5
51
83 ec 24
e8 6a 00 00 00
c7 45 f8 00 00 00 00
eb 10
c7 04 24 a0 20 40 00
e8 5d 00 00 00
83 45 f8 01
83 7d f8 09
7e ea
83 c4 24
59
5d
8d 61 fc
c3
lea 0x4(%esp),%ecx
and $0xfffffff0,%esp
pushl -0x4(%ecx)
push %ebp
mov %esp,%ebp
push %ecx
sub $0x24,%esp
call 4011b0 <___main>
movl $0x0,-0x8(%ebp)
jmp 40115f <_main+0x2f>
movl $0x4020a0,(%esp)
call 4011b8 <_puts>
addl $0x1,-0x8(%ebp)
cmpl $0x9,-0x8(%ebp)
jle 40114f <_main+0x1f>
add $0x24,%esp
pop %ecx
pop %ebp
lea -0x4(%ecx),%esp
ret
movl $0x4020a0,(%esp)
call 4011b8 <_puts>
addl $0x1,-0x8(%ebp)
lea 0x4(%esp),%ecx
and $0xfffffff0,%esp
pushl -0x4(%ecx)
push %ebp
mov %esp,%ebp
push %ecx
sub $0x24,%esp
call 4011b0 <___main>
movl $0x0,-0x8(%ebp)
jmp 40115f <_main+0x2f>
add $0x24,%esp
pop %ecx
pop %ebp
lea -0x4(%ecx),%esp
ret
cmpl $0x9,-0x8(%ebp)
jle 40114f <_main+0x1f>
Figure 1. Instructions and basic blocks.
byte and instruction stream may change when minor semantic
alterations are made to the malware source code.
The advantage of byte level content as a program feature is that
the dependence on accurate static analysis of the programs
semantics or structure is not required. If the instruction stream is
used, additional challenges are presented because it is known that
perfect disassembly of an unknown image is undecidable on the
x86 platform [12].
To avoid the problems of syntactic polymorphism, higher level
abstractions of the program can be used. The control flow features
including control flow graphs and call graphs are considered more
invariant in polymorphic malware than byte and instruction level
content [8]. However, opaque predicates - conditions that always
evaluate to the same result but are hard to determine statically -
may result in these features being altered. The detection of opaque
predicates has been investigated, but it is not evident that this is
entirely satisfactory, and a sound method of detection against all
unknown predicates is not possible. For example, it is known that
some algorithms which are used to construct predicates are
actually only strong conjectures in evaluating to the same result.
This implies an automated approach to prove that it constant is
hard.
The presence of pointers and indirection in assembly language
also present problems to static analyses which may not have the
precision required to construct a control flow graph or call graph
with the degree of accuracy required for malware classification.
For all its disadvantages, control flow has shown to be an
effective feature that is invariant in most current malware.
The use of API calls is another approach to solve the syntactic
polymorphism problem. This approach has problems with
malware that obscures the use of those calls, as is the case of the
stolen bytes technique [13] introduced by code packing tools.
Data flow analysis is another high level abstraction but when used
in the presence of pointers is compounded by the problems that
static analyses must face.
The procedure and system dependence graphs have similar
problems with pointers and indirection even when data
dependencies of pointers are ignored. The dependence graphs are
also dependent on accurate modelling of the instruction sequence.
This avoids problems such as register reassignment because the
data dependency is represented as a graph. The problem occurs
with the modelled instructions used in the data dependencies
which may be polymorphic and variant. Polymorphism is not
handled effectively in this situation although code normalization
may help.
4. STATIC APPROACHES TO MALWARE
CLASSIFICATION Malware classification is the process of determining if an
unknown binary belongs to the class of malicious programs or the
class of benign programs.
4.1 Statistical Classification A data mining approach to malware detection is to employ
statistical classification. Each classification algorithm constructs a
model, using machine learning, to represent the benign and
malicious classes. In this approach, a labeled training set is
required to build the class models during a process of supervised
learning. Many statistical classification algorithms exist including
Naive Bayes, Neural Networks, and Support Vector Machines.
The key to statistical classification is to represent the malicious
and benign samples in an appropriate manner to enable the
classification algorithms to work effectively. Feature extraction is
an important component of effective classification, and an
associated feature vector that can accurately represent the
invariant characteristics in the training sets and query samples is
highly desirable.
4.2 Instance-Based Learning Instance-based learning is a related and traditionally popular
approach that can be employed wherein the query program is
movl $0x4020a0,(%esp)
call 4011b8 <_puts>
addl $0x1,-0x8(%ebp)
lea 0x4(%esp),%ecx
and $0xfffffff0,%esp
pushl -0x4(%ecx)
push %ebp
mov %esp,%ebp
push %ecx
sub $0x24,%esp
call 4011b0 <___main>
movl $0x0,-0x8(%ebp)
jmp 40115f <_main+0x2f>
add $0x24,%esp
pop %ecx
pop %ebp
lea -0x4(%ecx),%esp
ret
cmpl $0x9,-0x8(%ebp)
jle 40114f <_main+0x1f>
Proc_0
Proc_2
Proc_1
Proc_4
Proc_3
Figure 2. Control flow graph (left) and call graph (right).
classified by identifying a high similarity to a known instance of
malware in the training set. Traditional Antivirus utilises this
approach when it performs signature based detection. The key
component to perform classification using instance-based learning
is a distance or similarity function between the objects
representing samples and queries. For a distance function to be
effective between objects, the objects must be modeled by a
limited set of features that capture the invariant characteristics of
the malicious and benign programs. In some cases, the distance
function is replaced with a test for equality. However, testing only
for equality reduces the effectiveness of the classification process
when dealing with malware variants. Instance-based learning can
additionally identify high similarity to benign or white-listed
samples, depending on the aims of the classification.
4.3 The Similarity Search Used in Instance-
Based Learning A search of a database to find similar, but not necessarily identical
objects to a query is known as a similarity search. The similarity
search is a central aspect of instance-based learning when applied
to malware detection and classification using a large number of
malware signatures and training instances.
Distance functions between objects that have the properties of a
metric can employ the use of Metric Access Methods. A similarity
search using metric access methods performs faster than
exhaustive linear search and enables significantly larger databases
without being restricted by an equivalent increase in running time.
Metric access methods may use either static or dynamic databases.
In dynamic Metric Access Methods, dynamic database operations,
such as object insertion, can be effectively performed with
reasonable performance expectations.
5. CLASSIFICATION APPROACHES BY
FEATURE
5.1 Object File Header Attributes Identifying object file discrepancies to detect malware has
sometimes been used by commercial Antivirus to detect
suspicious binaries. peHash [14] proposed a similar technique by
hashing object file features to cluster malware. The advantage of
this approach is that it is highly efficient. The disadvantage is that
using object file attributes predominantly identifies those
attributes of the packing tool used when packing the malware. The
concept is very similar to the techniques used when identifying
code packing using object file features. Classifying a sample as
malicious, based on the packer, is not necessarily accurate. It is
not necessarily true that the presence of code packing indicates
the malicious intent of a program.
5.2 Byte Level Approaches
5.2.1 File Hashing The simplest approach to malware detection is hashing the
contents of the file and comparing that hash against a blacklist.
This approach is widely used in commercial Antivirus. The
disadvantage of using this approach on its own is that it
ineffectively detects malware that has incurred any byte level
alterations. However, the blacklisting of specific and unaltered
malware instances is a useful technique that is easily and
efficiently implemented.
5.2.2 Kolmogorov Complexity Kolmogorov complexity is a theoretical measure of the
computational complexity, or minimum string length in a
universal description language, required to represent an object or
set of data. It is a theoretical measure that is not computable. To
estimate the Kolmogorov complexity, an object may be
compressed and concatenated with the associated decompression
routine, to give the approximate minimum string length to
describe the object. The observation, when this theory is related to
malware, is that similar malware have similar measures of
Kolmogorov complexity. This form of analysis occurs on the
malwares raw file or section content.
Estimating Kolmogorov complexity was proposed in peHash [14]
by identifying the compression ratio of a malicious sample that
was subsequently used for clustering malware families. Another
measure of similarity related to Kolmogorov complexity is the
Normalized Compression Distance (NCD). The NCD was used in
[15] to cluster worms into families. This approach, like peHash
[14], was not used to classify samples as being benign or
malicious, but to cluster malicious samples only.
It was the observation in [16] that malware and benign programs
can be classified according to a likeness to a compression model
for each of the malicious and benign classes. In this research, it
was proposed that two compression models be constructed from a
two training sets, one of malicious samples, and one of benign
samples. To classify a query sample as being malicious or benign,
the number of bits required to encode the query was calculated for
each compression model. The query was classified by identifying
the class that requires the least data to encode the query.
5.2.3 String Signatures Static string signatures have been the dominant technique used in
traditional Antivirus. String signatures represent patterns in a
malware’s raw content used to uniquely identify it. In [1, 2] it was
proposed to automatically extract the string signatures used by the
detection system. The set of all possible likely string signatures of
fixed size are extracted from the malware, and those that result in
high similarity to a corpus of benign programs are removed from
the signature candidate list.
String signatures may use fast string matching algorithms to detect
malicious instances [3]. Wildcards and regular expressions
present extensions of string matching that can be used in the
detection of malware variants. String signatures are efficient, and
more effective than file hashing, but can be ineffective when faced
with polymorphic malware variants.
5.2.4 Malware Normalization To improve the effectiveness of string based signatures, malware
normalization has been proposed. Normalizing malware before
passing it to Antivirus software was investigated by
Christodorescu et al in [17]. Static analysis was carried to
eliminate unnecessary control flow as indicated by superfluous
unconditional branches. Semantic nops were also removed from
the malware by using decision procedures. At this point the
malware was passed, now in a more canonical form, to Antivirus
software.
Another approach to the code normalization problem was to
rewrite sequences of code using compiler optimisation techniques
[18]. Expression propagation, dead code elimination, and
expression simplification using algebraic identities was used. The
intuition is that that the process of an optimising compiler
removes the redundancy of the original code and improves the
terseness, resulting in a normalized representation.
An approach using term rewriting was proposed in [19] where
rewrite rules were constructed to model the malware
transformations that occur during polymorphic and metamorphic
mutation. From these, a normalizing rule-set was constructed that
could rewrite the malware to a canonical or near canonical
representation.
5.3 Instruction Distributions
5.3.1 Opcode Distributions Instruction level content of malware can provide a more resilient
representation than the byte level data. This is especially true if
the instruction arguments or operands are ignored leaving only the
opcodes to be examined. To determine the instructions and
opcodes, a disassembly of the malware is required. A
classification technique using the statistical distribution of
opcodes as a predictor of malware was proposed in [5]. The
investigation found that rarely occurring opcodes were a strong
predictor of malware than compared to frequently occurring
opcodes. The disadvantage of opcode distributions is that
polymorphism can change the distributions.
5.3.2 Byte and Instruction Level N-grams An approach to classify malware using evolutionary trees and
phylogeny based on n-grams and n-perms was proposed by Karim
et al in [20]. Their approach was to show a similarity between
malware. At this point phylogeny which generates evolutionary
trees could be used for taxonomy. N-grams of byte and
instruction level content were extracted as features from each
binary. These vectors were compared to establish a similarity
using a variety of metrics. One such metric was cosine similarity.
N-perms extended the concept of n-grams to group permutations
of each n-gram as a single feature. N-grams are more resistant to
polymorphic changes than string based signatures, but are not
highly effective when faced with techniques such as instruction
substitution or register reassignment.
5.3.3 N-gram Analysis and Machine Learning N-gram analysis using machine learning was proposed by Perdisci
et al to classify malware in McBoost [21]. A similar method also
involving machine learning and classification was proposed in
[4]. The classification was performed using a similar algorithm to
how McBoost used n-gram analysis to detect packed binaries.
The most informative n-grams of a training set were used in
representing the occurrence of those n-grams in each binary as a
vector. Those vectors were used to train a statistical classifier.
Automated unpacking was performed on the malware, if
necessary, using a method similar to the Renovo unpacking
system.
5.4 Basic Blocks
5.4.1 Edit Distances The edit distance between basic blocks was used to classify
malware by Gheorghescu in [6]. The edit distance describes the
number of insertions, deletions and substitutions to convert one
string to another. To classify malware using edit distance, each
malware was statically disassembled and the basic blocks
extracted. These basic blocks were then considered as strings.
Classification then proceeded to build a similarity between the
malware's basic blocks and the binary being examined using a
similarity ratio based on the edit distance.
The main disadvantage with this approach is that minor changes
to the malware source code can result in significant changes to
almost all basic blocks. Changes in compiler configuration and
optimisations can equally result in large changes.
5.4.2 Membership Testing - Inverted Index and
Bloom Filters Gheorghescu also proposed the use of finding identical features in
basic blocks shared between malware [6]. The inverted index and
bloom filters provided for faster searching of exact matches in a
database. An inverted index is an associative mapping between
the content and the source of that content. A bloom filter allows
for fast set membership queries with allowable false positives and
guaranteed no false negatives. To make this an approximate
search, features extraction occurred on the basic block. Bloom
filters were shown to perform fast enough to be performed on a
desktop system, but not fast enough for desktop Antivirus. The
disadvantage of this approach is even more pronounced than
using the edit distance between basic blocks. Minor changes to
the malware source or compiler configuration can change almost
all basic blocks.
5.5 Control Flow Graphs
5.5.1 Whole Program Control Flow Graph
Isomorphism Recognition Using Tree Automata A fast approach to detecting whole program control flow graph
isomorphism and subgraph isomorphism was proposed in [22].
This approach constructed a spanning tree based structure from
the control flow graph, and then built a tree automaton for graph
recognition. This approach appears to have reasonable
performance. However, this technique is not effective at detecting
malware variants that alter the control flow or have semantic
changes.
5.5.2 Common k-subgraphs Decomposing control flow graphs into subgraphs was proposed
by Kruegel et al in [8] to classify polymorphic worms. The
control flow graphs were decomposed into the set of all subgraphs
of fixed size k, where k is the number of nodes in the subpgrah.
The k-subgraphs were subsequently transformed into their
canonical labeled form. The adjacency matrix of the canonically
label graph was transformed into a string. This string represented
the k-subgraph feature of the control flow being analysed. Worm
detection and classification occurs through identifying the
prevalence of k-subgraph features between worm like executable
content and unclassified executable programs. The performance
of this system was reasonable. Because the classification only
occurs on network streams identified as potential worms, it is hard
to determine the accuracy of the classification when applied to a
larger set of malware.
5.5.3 Structured Control Flow Graphs Using
Decompilation An approach that decompiles the control flow into a high level
source code like representation was proposed in [23]. Comparing
two control flow graphs is performed by using the string edit
distance on their decompiled sources. The similarities of each
control flow graph are accumulated to give a similarity taking into
account the entire program. A related approach is to decompose
the decompiled strings into q-grams [24].
5.6 Call Graphs
5.6.1 Whole Program Context-Free Control Flow It was proposed in [25], that the inter-procedural control flow
information could be represented as a context free grammar with a
limited loss of information. A string could represent the grammar,
and string equality used to show equivalence between the
grammar, and inter-procedural control flow they represented. The
advantage of this approach, is that string based representations
allow for fast searches in a malware database using a dictionary
search. The disadvantage of the approach investigated in this
research is that it did not employ approximate matching of the
inter-procedural control flow. For polymorphic malware variants
that alter the control flow through source code modification, an
approximate match is necessary for detection of the malware.
5.6.2 Flowgraph Based Classification using Fixed
Points Carrera proposed an approximate flowgraph matching algorithm
in [9] by identifying fixed points in the flowgraphs and
successively matching surrounding nodes in the graph. Carrera
built a similarity index between malware and used this to build
phylogeny trees for taxonomy. Dullien and Rolles expanded the
approximate graph comparison algorithm in [26] to identify
identical nodes between call graphs and control flow graphs.
Their algorithm worked by identifying nodes, or fixed points,
between binaries that have uniquely identifiable features.
Features for a node in the call graph include the number of basic
blocks, control flow edges, and number of subfunction calls.
Carrera also proposed an estimation of a control flow graph
isomorphism based on string equality and a string signature of the
graph representing a graph traversal. Once a set of fixed points
were known, their neighbouring nodes could be examined.
Identifying neighbours sharing common and unique features
iteratively allowed greater parts of the flowgraph to be identified.
The advantage of this approach is that it allows for moderately
fast pair-wise comparison between graphs. However, the approach
does not perform efficiently for a database of graphs and is not
fast enough for desktop Antivirus use.
5.6.3 Approximating the Graph Edit Distance An alternative approach to approximate graph matching was
proposed in the SMIT system [27]. SMIT employed the use of
bipartite graphs and the Hungarian nor Munkres algorithm to find
matching nodes between two call graphs being compared in O(N3)
running time. The strength of their matching algorithm was that
they allowed for it be used as an approximation to the graph edit
distance. The graph edit distance between two graphs, is the
number of edit operations to convert one graph to the other. The
graph edit distance gives a sound basis for similarity and
dissimilarity between graphs. The graph edit distance is known to
have the properties of a metric which allows the use of metric
access methods to search a database of objects.
5.7 API Calls API calls are a feature used in several malware classification and
detection systems. In the SAVE system [28], it was proposed to
disassemble the malware image and then to extract the API call
sequence as they appeared in the disassembled output. The API
call sequence was used to construct a vector. Similarity between
vectors employed the use of the cosine similarity measure, the
Jaccard index [29], and the Pearson’s correlation measure.
An alternative approach to using vectors to represent API call
sequences was proposed in the IMDS malware detection system
[10]. IMDS’s approach employed the use of the data mining
technique known as association mining. Association mining was
able to associate sequences of API calls to classify query samples
as benign or malicious.
5.8 Data Flow Combining data flow analysis and control flow analysis was
proposed in [11, 30]. Annotations were made to the control flow
graphs to incorporate abstractions of the instructions and data
flow. These annotated flowgraphs were compared to signatures, or
automata, that described the malware. If the malware signature
was present in the query program, a malware instance had been
detected. In [31], value set analysis was used as a specific data
flow analysis to identify fixed points that was subsequently used
to construct signatures.
6. TRENDS Malware obfuscation has been increasingly addressed by
researchers, and deobfuscation will continue to be developed and
incorporated into malware detection systems. These deobfuscation
techniques have increasingly borrowed from formal program
analyses in an attempt to make sound analyses possible in regards
to their given constraints.
Malware classification has employed statistical techniques to
detect unknown malware. We believe research will continue using
this approach and new features will be developed that can more
accurately characterize malware. Instance-based learning will also
be developed with particular research opportunity in working with
large scale datasets.
Static program features have been extracted at increasing levels of
abstraction, and we expect this to continue in future research.
Abstraction has the benefit of being resistant to lower level
polymorphic changes. The performance of these research systems
has not been fully investigated, and we expect that future research
opportunity lies in making classification systems practical for
industrial and widespread use.
7. CONCLUSION Detecting malware before it is allowed to execute is an important
feature of Antivirus and system security. Static analysis
techniques allow feature extraction of programs which allows
machine learning to identify variants of malware and novel
samples. Malware packing which hides the code from analysis
remains the main sticking point for static detection and it can be
hard to reverse all packers automatically. If unpacking is
achievable, the problem of malware detection using static analysis
is quite feasible and we expect the accuracy and efficiency of such
systems will continue to improve as research continues.
REFERENCES
[1] K. Griffin, S. Schneider, X. Hu, and T. Chiueh,
"Automatic Generation of String Signatures for
Malware Detection," in Recent Advances in Intrusion
Detection: 12th International Symposium, RAID 2009,
Saint-Malo, France, 2009.
[2] J. O. Kephart and W. C. Arnold, "Automatic extraction
of computer virus signatures," in 4th Virus Bulletin
International Conference, 1994, pp. 178-184.
[3] A. V. Aho and M. J. Corasick, "Efficient string
matching: an aid to bibliographic search,"
Communications of the ACM, vol. 18, p. 340, 1975.
[4] J. Z. Kolter and M. A. Maloof, "Learning to detect
malicious executables in the wild," in International
Conference on Knowledge Discovery and Data Mining,
2004, pp. 470-478.
[5] D. Bilar, "Opcodes as predictor for malware,"
International Journal of Electronic Security and Digital
Forensics, vol. 1, pp. 156-168, 2007.
[6] M. Gheorghescu, "An automated virus classification
system," in Virus Bulletin Conference, 2005, pp. 294-
300.
[7] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers:
principles, techniques, and tools. Reading, MA:
Addison-Wesley, 1986.
[8] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G.
Vigna, "Polymorphic worm detection using structural
information of executables," Lecture notes in computer
science, vol. 3858, p. 207, 2006.
[9] E. Carrera and G. Erdélyi, "Digital genome mapping–
advanced binary malware analysis," in Virus Bulletin
Conference, 2004, pp. 187-197.
[10] Y. Ye, D. Wang, T. Li, and D. Ye, "IMDS: intelligent
malware detection system," in Proceedings of the 13th
ACM SIGKDD international conference on Knowledge
discovery and data mining, 2007.
[11] M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and
R. E. Bryant, "Semantics-aware malware detection," in
Proceedings of the 2005 IEEE Symposium on Security
and Privacy (S&P 2005), Oakland, California, USA,
2005.
[12] R. N. Horspool and N. Marovac, "An approach to the
problem of detranslation of computer programs," The
Computer Journal, vol. 23, pp. 223-229, 1979.
[13] L. Boehne, "Pandora’s Bochs: Automatic Unpacking of
Malware," University of Mannheim, 2008.
[14] G. Wicherski, "peHash: A Novel Approach to Fast
Malware Clustering," in Usenix Workshop on Large-
Scale Exploits and Emergent Threats (LEET'09),
Boston, MA, USA, 2009.
[15] S. Wehner, "Analyzing worms and network traffic using
compression," Journal of Computer Security, vol. 15,
pp. 303-320, 2007.
[16] Y. Zhou and W. M. Inge, "Malware detection using
adaptive data compression," in Proceedings of the 1st
ACM workshop on Workshop on AISec (AISec '08),
2008, pp. 53-60.
[17] M. Christodorescu, J. Kinder, S. Jha, S. Katzenbeisser,
and H. Veith, "Malware normalization," University of
Wisconsin, Madison, Wisconsin, USA Technical Report
#1539, 2005.
[18] D. Bruschi, L. Martignoni, and M. Monga, "Using code
normalization for fighting self-mutating malware,"
presented at the Proceedings of International
Symposium on Secure Software Engineering, 2006.
[19] W. Andrew, M. Rachit, R. C. Mohamed, and L. Arun,
"Normalizing Metamorphic Malware Using Term
Rewriting," presented at the Proceedings of the Sixth
IEEE International Workshop on Source Code Analysis
and Manipulation, 2006.
[20] M. E. Karim, A. Walenstein, A. Lakhotia, and L. Parida,
"Malware phylogeny generation using permutations of
code," Journal in Computer Virology, vol. 1, pp. 13-23,
2005.
[21] R. Perdisci, A. Lanzi, and W. Lee, "McBoost: Boosting
Scalability in Malware Collection and Analysis Using
Statistical Classification of Executables," in
Proceedings of the 2008 Annual Computer Security
Applications Conference, 2008, pp. 301-310.
[22] G. Bonfante, M. Kaczmarek, and J. Y. Marion,
"Morphological Detection of Malware," in International
Conference on Malicious and Unwanted Software,
IEEE, Alexendria VA, USA, 2008, pp. 1-8.
[23] S. Cesare and Y. Xiang, "Classification of Malware
Using Structured Control Flow," in 8th Australasian
Symposium on Parallel and Distributed Computing
(AusPDC 2010), 2010.
[24] S. Cesare and Y. Xiang, "Malware Variant Detection
Using Similarity Search over Sets of Control Flow
Graphs," in IEEE Trustcom, 2011.
[25] R. T. Gerald and A. F. Lori, "Polymorphic malware
detection and identification via context-free grammar
homomorphism," Bell Labs Technical Journal, vol. 12,
pp. 139-147, 2007.
[26] T. Dullien and R. Rolles, "Graph-based comparison of
Executable Objects (English Version)," in SSTIC, 2005.
[27] X. Hu, T. Chiueh, and K. G. Shin, "Large-Scale
Malware Indexing Using Function-Call Graphs," in
Computer and Communications Security, Chicago,
Illinois, USA, pp. 611-620.
[28] A. H. Sung, J. Xu, P. Chavez, and S. Mukkamala,
"Static analyzer of vicious executables (save)," 2004,
pp. 326-334.
[29] G. Salton and M. J. McGill, Introduction to modern
information retrieval: McGraw-Hill New York, 1983.
[30] M. Christodorescu and S. Jha, "Static analysis of
executables to detect malicious patterns," presented at
the Proceedings of the 12th USENIX Security
Symposium, 2003.
[31] F. Leder, B. Steinbock, and P. Martini, "Classification
and Detection of Metamorphic Malware using Value
Set Analysis," in Proc. of 4th International Conference
on Malicious and Unwanted Software (Malware 2009),
Montreal, Canada, 2009.