+ All Categories
Home > Documents > Malware detection by behavioural sequential patterns

Malware detection by behavioural sequential patterns

Date post: 21-Apr-2023
Category:
Upload: unica
View: 0 times
Download: 0 times
Share this document with a friend
9
FEATURE August 2013 Computer Fraud & Security 11 Malware detection by behavioural sequential patterns Traditional signature-based malware detection techniques are the basis of many industrial anti-virus products. These techniques are preferred because of their high detection rate versus low false alarms. Because false alarms can cause loss of useful and harmless information and executables, products with high false alarm rates have never been acceptable as standalone security products. Even though signature-based techniques do not suffer from these pitfalls, these approaches do have a Mansour Ahmadi, Young Researchers and Elite Club, Shiraz Branch, Iran; Ashkan Sami, Hossein Rahimi, Babak Yadegari, Shiraz University, Iran For many years, malware has been the subject of intensive study by researchers in industry and academia. Malware production, while not being an organised business, has reached a level where automatic malicious code generators/ engines are easily found. These tools are able to exploit multiple techniques for countering anti-virus (AV) protections, from aggressive AV killing to passive evasive behaviours in any arbitrary malicious code or executable. Development of such techniques has lead to easier creation of malicious executables. Consequently, an unprecedented prevalence of new and unseen malware is being observed. Reports suggested a global, annual economic loss due to malware exceeding $13bn in 2007. 1
Transcript

FEATURE

August 2013 Computer Fraud & Security11

Malware detection by behavioural sequential patterns

Traditional signature-based malware detection techniques are the basis of many industrial anti-virus products. These techniques are preferred because of their high detection rate versus low false alarms. Because false alarms can cause loss of useful and harmless information and executables, products with high false alarm rates have never been acceptable as standalone security products. Even though signature-based techniques do not suffer from these pitfalls, these approaches do have a

Mansour Ahmadi, Young Researchers and Elite Club, Shiraz Branch, Iran; Ashkan Sami, Hossein Rahimi, Babak Yadegari, Shiraz University, Iran

For many years, malware has been the subject of intensive study by researchers in industry and academia. Malware production, while not being an organised business, has reached a level where automatic malicious code generators/engines are easily found. These tools are able to exploit multiple techniques for countering anti-virus (AV) protections, from aggressive AV killing to passive evasive behaviours in any arbitrary malicious code or executable. Development of such techniques has lead to easier creation of malicious executables. Consequently, an unprecedented prevalence of new and unseen malware is being observed. Reports suggested a global, annual economic loss due to malware exceeding $13bn in 2007.1

FEATURE

Computer Fraud & Security August 201312

significant drawback – their weakness against new, unseen and polymorphic/metamorphic malware.

“Even if users regularly update the signature database for their installed security software, most probably there will be an attack while there is still no signature in the threats database”

So, using signature-based techniques to immunise our computing resources is not sufficient to fight newly created and/or polymorphic/metamorphic malware. Even if users regularly update the signature database for their installed security software, most probably there will be an attack while there is still no signature in the threats database to detect it.

Heuristic efforts

To address the weaknesses of the signature-based method, there have been many heuristic-based efforts in both the static and dynamic analysis sectors. These works have used static analysis of unchangeable executable characteristics or distance-based signature matching.2,3,4 Alternatively, there is dynamic analysis using techniques such as graph-based signatures and instruction sequence mining.5,6 Because of the uncertainty in these approaches, the challenge is to perfect the models by extracting more semantics or gaining a better detection rate associated with a low false alarm production.

Although API calls are commonly analysed in existing anti-virus systems and sandboxes, our work is the first to use iterative pattern mining to detect malware. Iterative patterns, in contrast to simple sequential patterns, hold information on repetitive occurrences of items. We used iterative patterns because ‘iteration’ is an inevitable characteristic of computer programs,

including malicious ones. The iterative patterns of API calls can be the result of both conditional iteration and recursive subroutines used by a programmer. Additionally, repetitive actions on data sequences are used by malware writers, especially well-known loops performing decryption/encryption and infection. Knowing that the iterative patterns have a greater potential for having useful information and semantics about computer programs, in comparison to other sequence-based substructures, our work has included experiments on how they can help to detect malware.

Logging calls

Another feature of our work is the way we log the API calls of running programs. To extract API call traces, we used hooking techniques with a tool named WinAPIOverride32.7 In contrast to many debugging tools, this method does not insert breakpoints into the program’s code and avoids running the program in single stepping mode. Such actions by a debugger are signs of a debugger’s presence and are easily detected by many malicious programs in runtime.8 This detection results in termination of malicious behaviour as soon as the program under investigation finds itself being debugged. But this program uses API call hooking, which is a technique similar to sniffing the calls a program makes to the Windows API without explicitly altering the way the program runs.

To give an overall view of the approach, our work can be divided into five major steps. First, we gathered malware samples. Then we ran each PE in a controlled emulation environment like VMware or Qemu and captured its interactions with the operating system API.9,10 After that, a data set was built using API call logs gathered by the previous step. In the next phase, we looked for iterative API patterns that are seen more times than a minimum support threshold in the data set. Finally,

the patterns were used as features in an initial dataset. Considering the huge number of the initial features, we also introduced some feature selection algorithms to achieve better performance. After building the datasets with the selected features and appending a class label to each instance, we ran multiple classifiers such as Random Forest and SVM on them. These runs have lead to a best accuracy of 95% while keeping the detection rate as high as 98.4%.

Background and related worksThere are two major approaches, static and dynamic analysis, in the field of malware detection.

Static analysis is a process of analysing software without actually running it. The great benefit of static analysis approaches is their efficiency compared to dynamic methods. But there are various anti-analysis techniques such as code packing, anti-debugging, control-flow and entry point obfuscation that make static analysis less than ideal.11,12 Malware can easily use these anti-AV methods to mislead disassembler and static analyser tools.13 In addition, analysers may analyse some part of the code that is not actually run. Some static binary analysis techniques (Christodorescu et al) have been introduced to detect malware or identify the types of malware based on signature distance.14 However, polymorphic and metamorphic malware can change their structures using self-modifications at runtime. Some methods (Sami et al and Ye et al) have improved the static approaches and have specifically addressed the polymorphism and metamorphism problem. These techniques use the association rule mining and classification algorithms like decision trees and SVM to detect malware based on API calls. The dataset that they have used is a combination of malware and benign executables. Each data set is a set of API calls extracted

FEATURE

August 2013 Computer Fraud & Security13

from malicious or benign PE with a class label indicating what type it is. Although these methods improve the efficiency of malware detection, adding fake APIs and using explicit APIs which are called during runtime reduce the effectiveness of their detection systems. The inherent limitations of static analysis persuaded us to introduce a dynamic detection system that covers these problems.

Compared to static techniques, dynamic techniques analyse the code during runtime and this can immunise the analysis process from many obfuscation techniques and even self-modifying programs. There are two closely related types of work to our approach – Clemens Kolbitsch et al and Jianyong Dai et al. In Jianyong Dai et al, the authors extract a sequence of instructions from both malicious and benign programs and use frequent blocks of assembly instructions that feature in the dataset. Finally, they use a classification algorithm with an accuracy of 91.9%. The dataset contains 635 software samples where 267 are malicious and the others are benign samples. The failure point of this method is when malware is metamorphic – it can replace an assembly instruction and cause the classifier to be misled into extracting legitimate information. Other approaches presented in Clemens Kolbitsch et al use dependency graphs.15 In this approach, a dependency graph is built based on malware behaviour observation. The graph nodes represent API calls and the axes show the dependency between API parameters. It means that input and output parameters of subsequent API calls establish the dependency. The behavioural graph is used as the signature of each PE. Despite their claim, this method is not even a working signature-based method, because it fails against detecting samples it has already observed when constructing its signature database. It is also proven that graph isomorphism is an NP-hard problem. Moreover, it is not an effective method compared to ours

because the average of detection rate on 263 malware samples is 63%, which is very much lower.

Polymorphic malware

Polymorphism is the most common method used by many types of malware to make detection complicated. The functionality of polymorphic code is the same as the original code, but the appearance is varied for each mutation.

We describe here some polymorphism methods (Table 1) such as renaming the variables, replacing and reordering the statements, replacing the controls, inserting junk code, subroutine outlining and spaghetti code. In the variable renaming method, a variable name was changed throughout the code – a simple and effective method. In statement replacing, each line can be replaced by an equivalent equation, and in statement reordering malware can change the order

Figure 1: System overview.

Original code: Variable renaming: Statement reordering:

int x, y, c;c = 1;x = 0;y = 6;while (c < y){ x = x + y;}

int m, n, o;o = 1;m = 0;n = 6;while (o < n){ m = m + n;}

int x, y, c;y = 6;x = 0;c = 1;while (c < n){ x = x + y;}

Statement replacing: Junk code insertion: Spaghetti code:

int x, y, c;c = 1 * 1;x = o * c;y = 6 / c;while (c < y){ x = x + y;}

int x, y, c;c = 1;x = 0;y = 6;bool c = true;if (v) { while (c < y) { x = x + y; }}

int x, y, c;goto K:J:while (c < y){ x = x + y;}goto L:K:C = 1; x = 0; y = 6;goto J:L:

Table 1: Polymorphism techniques.

FEATURE

Computer Fraud & Security August 201314

of the code lines. Inserting irrelevant code is another useful approach. It this method, junk code is added to the main code and does not affect the logic of the executable. In the spaghetti code method, goto jump commands are used to confuse the virus scanner, although execution order is kept the same as the original code. Metamorphism is another obfuscation method like polymorphism. Although polymorphic malware make a unique decrypter for obfuscating the original code, metamorphic malware morphs the original code so that manual tracing is impossible. As you can see, detecting malicious code by matching or mining assembly code is not an effective method, so we use the executables’ API call patterns to make malware incapable of evading our proposed method.

Our system

Our proposed method uses the following five steps: • Capturemalwarebehaviour.• ExtracttheAPIsequencefromlogs

for each PE.

• Featureextractionbyiterativepatterns.

• FeatureselectionbyFisherscore.• Useaclassifierforprediction.

In order to monitor the behaviour of executables, a controlled environment such as a virtual machine is required, because executing malware in an ordinary system is dangerous. Virtual machines such as VMware or Virtual Box can create snapshots to maintain the initial state of the installed operating system. We run each malware and, after capturing its interactions with the operating system, we restore the virtual machine state to the previous safe snapshot. Restoring the virtual machine is vital to our analysis since we want to make sure that the malware has not caused damage that has a disruptive effect on the following PE execution. In other words, executing malware may affect the system storage and other system parameters that may consequently impact the result of later malware execution. So it was important to return back to the operating system’s initial state.

When a binary file runs, it invokes different functions and this can be logged and evaluated to determine the executable behaviour. The trace log can be pre-processed to a desired input for further analysis. We considered a sequence of API calls as executable behaviour. So, at first we log malware API calls. For this purpose we did not use debugging tools. There are many methods of detecting debuggers and so evade the debugging process.16

“The hook is an area in the message-handling mechanism of a computer system in which a program can install a function and can monitor the messages transmitted among processes”

We used API monitoring software named WINAPIOverride32 which monitors both calls to Microsoft Windows APIs and internal functions. First, it injects a DLL in the target process, then the DLL initialises interprocess communication with WINAPIOverride32 (see Figure 2). The role of the injected DLL is to set up hooks, load DLLs and monitoring files, and make function calls in the target process. The hook is an area in the message-handling mechanism of a computer system in which a program can install a function and can monitor the messages transmitted among processes. Usually a hook system is composed of at least two main parts, a hook server and a driver. By means of the hook server, the hook driver is injected into the targeted processes. WinAPIOverride32 can also process certain kinds of messages before they can reach the targeted Windows procedure. To perform hooking on a function call, we need the function’s address. This can be done by finding the address directly by performing a LoadLibrary API call with the DLL name or by reading the address table

Figure 2: Using API monitoring software WINAPIOverride32.

FEATURE

15August 2013 Computer Fraud & Security

of the executable located in its header. To create the hook, the program has to modify the initial assembly instructions at the specified function address. For example, GetProcAddress on the MessageBoxA function in User32.dll returns 0x77d3add7 on some versions of Windows XP. Therefore, to set up the hook, WinAPIOverrideE32 just has to replace the first bytes at address 0x77d3add7 by ‘call OurHandlerAddr’ using the WriteProcessMemory API function.

The benefit of WinAPIOverride32 is that it can break the targeted application before or after a function call which is invoked (Figure 3), allowing memory or registers to be change. In addition, it is able to directly make calls to functions inside the targeted application. Another advantage of this software is that it can hide itself under the name of another system’s process, such as Explorer, so malware cannot recognise its presence as a monitoring tool.

Preprocessing and feature extractionBefore applying classification algorithms, we must first extract relevant data from the executables’ logs and make it appropriate for the mining algorithms. We replace each API name with a unique number and build an elementary dataset. After gathering the sequences, we remove the executables that have fewer than 10 calls to the Windows API. This is due to the unsuccessful execution of some executables because of unrecognised file format, or a missing library such as DLL or OCX. Each sequence in our sequence database is a series of API calls associated with a label indicating that the executable is either benign or malicious. Unlike relational data, a malware behaviour, which is a sequence of single APIs, has no predefined feature vector. So in the sequence database, feature extraction is an important part of the data mining process. Selecting each API as a feature is a good idea, but considering a set of APIs

is more descriptive and helpful. Existing studies have shown the effectiveness of frequent pattern-based classification and the relationship between discriminative metrics and frequency.17 So after observing repetitive API patterns for each malware, we decided to use the frequent iterative patterns algorithm for extracting features.18,19 In contrast to frequent itemset and traditional sequence mining, iterative pattern contains temporal orders in a trace and repetitive behaviours in a sequence. A pattern instance is defined as follows.

Definition 1 (Pattern Instance): Given a pattern P<e1, e2, …, en>, a consecutive series of events SB (sb1, sb2,

…, sbn) in an API call sequence S in API call database (APIDB) is an instance of P if it is of the following Quantified regular expression (QRE):

e1; [-e1, …, en]*; …; [-e1, …, en]*; enQRE is very similar to the standard

regular expression with ‘;’ as the concatenation operator, ‘[-]’ as the exclusion operator (eg, [-P, S] means any event except P and S), and ‘*’ as the standard Kleene star. For example, consider the following database:

Figure 3: The API hooking process (after Jacquelin Potier, with permission).

Id Sequence

S1 4 2 1 6 2 1 6 2 3 5

S2 4 2 1 4 2 2 2 1 2

FEATURE

Computer Fraud & Security August 201316

Assume that we want to find P= {<1, 2>} instances, the set of instances of P denoted as Inst (P) is the set of triples {(1, 3, 5), (1, 6, 8), (2, 3, 5), (2, 8, 9)}. Each element in Inst (P) is denoted by a triple. The first one is the sequence ID, the second is the start index and the third is the end index. We are now going to define frequent iterative patterns.

Definition 2 (Frequent Iterative Pattern). For an API call database (APIDB), an iterative pattern P is frequent if its instances occur above a certain thresh-old of minimum support in APIDB.

Furthermore, an iterative pattern is closed if the following conditions are held.

Definition 3 (Closed Iterative Pattern). A frequent iterative pattern P is closed if there exists no super sequence Q such that:1. P and Q have the same support.2. Every instance of P corresponds to

a unique instance of Q, denoted as Inst(P) ≈ Inst(Q).

An instance of P (seqP ; startP ; endP) corresponds to an instance of Q (seqQ; startQ; endQ) if seqP = seqQ and startP

≥ startQ and endP ≤ endQ. Where seq means each record in the APIDB.

Although mining closed frequent iterative patterns helps us to deduce the number of produced patterns, still there are a huge number of closed patterns that make it inefficient to build and run a classifier. Additionally, there are API calls that are generated in many languages by compilers and interpreters. Such API calls do not really reflect a programmer’s intent and could be considered as noise. To achieve our goal of having a robust classification model, we need to minimise the noise included in the dataset. In order to remove such noisy patterns from the APIDB, we use the unique iterative patterns that are defined in Algorithm 1 (see box).

Definition 4 (Closed Unique Pattern). A frequent pattern P is a closed unique pattern if P contains no repeated constituent events, and there exists no super-sequence Q such that:1. P and Q have the same support.2. Every instance of P corresponds to a

unique instance of Q.

3. Q contains no constituent events that repeat.

Consider the following example APIDB with 3 sequences:

After setting minimum support to 66%, <3 5 4 1 2> is a closed unique pattern that is achieved that contains no repeated elements. In order to improve our detection engine, both single APIs and iterative API patterns are considered.

As the introduced system is an intelligent system, it doesn’t need to update in the same way as anti-virus products. Nevertheless, updating the model can improve the power of detection. The iterative pattern algorithm is not updatable and if we must update the model, it must be run on the overall database for extracting updated frequent patterns. So, in order to adapt the algorithm to achieve our purpose as an updatable anti-virus engine, we change it as an updatable iterative pattern algorithm (Algorithm 1) to help us reduce the time for extracting new patterns. The algorithm has two steps:1. For each previous found pattern, its

minimum support is updated and if its minimum support is greater than updated minimum support, we select it.

2. For each new PE sequence, patterns are extracted from the sequence. If the new patterns exist in previous patterns it is ignored, otherwise it is added to the patterns. Finally, we must check closed and uniqueness characteristics for new achieved patterns.

Feature selection and classificationFor building robust learning models, selecting a subset of relevant features is required. It means that we select features with more distinguishing characteristics. For each malware, we extract a feature

Algorithm 1: Update mined closed unique iterative patternsProcedure: Update Iterative PatternsInputs: APIDB: PEs API sequence databasePrevPat: Set of Mined Closed Unique Pattern minsup: Minimum support threshold

//STEP 1) Update previous patterns1: for every pat in PrevPat2: for every Instance in {Inst (pat) in APIDB}3: sup (pat) ++;4: Let UpdatedPat= UpdatedPat U {pat| sup (pat) ≥ minsup}

// STEP 2) Mine new closed unique iterative patterns and filter them5: Let FqEv = {p| (|p| = 1) < (sup (p) ≥ min_sup)}6: for every e in FqEv7: NewPat= Call GrowRec (e, APIDB, minsup, FqEv)[19]8: Let UpdatedPat= UpdatedPat U {NewPat| NewPat∉ UpdatedPat}9: for every pat in UpdatedPat10: if (pat is closed unique)11: Output Pat

Id Sequence

S1 1 2 2 2 2 3 5 4 1 2 2

S2 3 5 4 1 2 2 2 2 3

S3 3 1 1 2 3 4 2 2 1

FEATURE

August 2013 Computer Fraud & Security17

vector from its behaviour. Assume MB and Fv are consecutive representatives of a malware behaviour and feature vector for each malware. Fs is a feature set that is extracted from the previous part. Then:

For example, Table 2 describes the API sequence, representative number and feature vector for Trojan.Win32.Tunneler. The first row shows the API sequence that we gathered from monitoring. As we said in the preprocessing step, we replaced each API with a number for the feature extraction step, which was shown in the second row of the table. Because this sequence is generated from a malware run, we add a zero label at the end of the sequence indicating its category. In the feature vector, we consider both single APIs and iterative patterns with its class label at the beginning. The value of a single API is its frequency in a sequence and for iterative patterns it is their support in the sequence. In the feature vector row after the label, we separate each feature and its value with ‘:’. This is the format of LIBSVM.20 For example the GetModuleFileNameA API occurs twice in the sequence. So we show it with ‘5:2’. After single APIs, we add frequent closed unique patterns. Then we select

the features that are more discriminative between malware and benign. There is a popular statistical measure – the Fisher Score (David Lo et al) – that indicates the discriminative power of the features. This score is:

Some closed frequent API patterns among 615 patterns, which were extracted for minimum support of 0.4 and have the highest Fisher score, are as follows:1. LoadLibraryA & RegOpenKeyExA

& LocalFree & RegCreateKeyExA2. RegCreateKeyExA &

CreateWindowExA & GetWindowThreadProcessId

3. GetModuleFileNameA & CreateWindowExA & GetWindowThreadProcessId

4. GetVersion & CreateWindowExA & GetWindowThreadProcessId

5. GetModuleFileNameA & GetVersion & GetSystemMetrics

“Previous malware detection researches based on static analysis and API call catego-risation determined the most discriminative categories of API functions”

These are sorted with their Fisher scores descending. These patterns can show some patterns that are discriminative between malware and benign. Previous malware detection researches based on static analysis and API call categorisation (Ashkan Sami et al) have determined the most discriminative categories of API functions. To make the discussion more comprehensive, here we restate the top four discriminative API categories:1. File management.2. Process and thread.3. Console.4. Registry.

The relation between the discriminative API call categories and discriminative iterative unique patterns extracted by this work is somehow intuitive. After applying Fisher Score feature selection, we use second feature selection with CFSSUBSETEVAL evaluator on all features. This algorithm was implemented in WEKA.21 Our detection engine uses a classifier for

API

Se

quen

ce

RegCreateKeyExA, LoadLibraryA, LoadLibraryA, LoadLibraryA, LoadLibraryA, LoadLibraryA, LoadLibraryA, LoadLibraryA, LoadLibraryA, LoadLibraryA, LoadLibraryA, GetModuleFileNameA, GetModuleFileNameA, RegOpenKeyExA, RegOpenKeyExA, RegOpenKeyExA, CloseHandle, CloseHandle, CloseHandle, CloseHandle, LocalFree, LocalFree, ExitProcess, LocalFree

Repr

esen

tati

ve N

umbe

r

4 6 6 6 6 6 6 6 6 6 6 5 5 2 2 2 16 16 16 16 3 3 21 3 Label: 0(Malware)

Feat

ure

Vect

or

0(Malware) 2:3 3:3 4:1 5:2 6:10 16:4 21:1 pat(2,3):1 pat(2,16):1 pat(16,3):1 pat(6,5):1 pat(6,2):1 pat(6,3):1 pat(6,16):1 pat(5,16):1 pat(5,2):1 pat(5,3):1 pat(4,5):1 pat(4,6):1 pat(4,2):1 pat(4,16):1 pat(4,3):1

Table 2: Trojan.Win32.Tunneler behaviour.

1) S

2) #F

3) #SF

4) TP

5) FP

6) MR

7) AvR

8) AUC

9) ACC

10) RMSE

0.1 4380 27 0.950 0.100 98.0 95.0 0.980 94.96 0.19020.2 1865 27 0.948 0.097 97.5 94.8 0.977 94.78 0.19920.3 996 23 0.946 0.104 97.6 94.6 0.983 94.60 0.19230.4 703 22 0.951 0.104 98.4 95.1 0.983 95.05 0.18660.5 532 19 0.943 0.119 98.1 94.3 0.984 94.33 0.20100.6 442 22 0.939 0.110 96.9 93.9 0.980 93.88 0.20550.7 394 22 0.939 0.114 97.1 93.9 0.971 93.88 0.21560.8 349 20 0.938 0.129 97.9 93.8 0.979 93.79 0.21160.9 312 22 0.945 0.106 97.6 94.5 0.980 94.51 0.2040

Table 3: The experimental results of our system based on different minimum support: 1) Refer to minimum support for iterative pattern; 2) Number of patterns + single APIs; 3) Number of selected features; 4) True positive rate; 5) False positive rate; 6) Malware recall (detection rate percentage); 7) Average recall percentage; 8) Area under curve; 9) Correctly classified instances (accuracy per-centage); 10) Root mean squared error.

FEATURE

Computer Fraud & Security August 201318

building a model. We use RandomForest to train and test the classifier.22

“Although anti-virus prod-ucts that detect malicious code based on their sig-natures are claimed to be effective, they have been shown to be ineffective against unknown and newly established attacks”

Experiments

We monitored five important Windows libraries – kernel32.dll, advapi32.dll, user32.dll, ws_s32.dll, wininet.dll. Kernel32.dll provides access to the fundamental resources available to a Windows system, such as file systems, devices, processes and threads.23 This module can help to detect malware that replicate themselves or create multiple processes. Advapi32.dll provides access to functionality additional to the kernel such as Windows registry, shutdown/restart the system, start/stop/create a Windows service and manage user accounts. This module is helpful for those malware that create customised services, use the registry for read-critical data, and reboot the system without asking the user. User32.dll provides the functionality to create and manage screen windows and most basic controls, such as buttons and scrollbars,

receive mouse and keyboard input, and other functionality associated with the GUI part of Windows.

This library is helpful for detecting those malware that don’t use the user interface (most malware don’t because they want to be hidden). Ws2_32.DLL and wininet.dll respectively contain the windows socket APIs and Internet-related functions used by network and Internet applications to manipulate their connections. Most trojans use network connections to send their information to their servers.

“We have demonstrated a hooking technique that can overcome the shortcomings of debugging techniques in controlled environments for malware detection”

Table 3 shows the results obtained by Random Forest. To evaluate the performance of our system, we use 10-fold cross validation. Our train dataset contains 806 malware and 306 benign. It is divided into 10 sub-samples with the same total instances. Each time, we used nine of them as train data and the remainder is used for testing. Table 3 presents the average of all 10 times, which was the algorithm run or the 10-fold cross validation. In the feature extraction step, we use nine different minimum support thresholds for extracting frequent iterative API patterns.

After comparing the results, it shows that we achieved very good detection rates with low false alarm for all of the minimum support (Figure 4).

Conclusion and future workDue to the rapid growth of malware production and the potential harm malware causes, demand for an automated, intelligent system for malware detection is growing. Although anti-virus products that detect malicious code based on their signatures are claimed to be effective, they have been shown to be ineffective against unknown and newly established attacks. In this paper, we have proposed a novel dynamic malware detection system based on mining the API sequences and iterative patterns extracted from an executable’s trace of API calls. We have presented a framework to analyse and detect malicious behaviour and introduced the concept of iterative pattern mining in this field. In addition we have demonstrated a hooking technique that can overcome the shortcomings of debugging techniques in controlled environments for malware detection.

To continue our research, we will extract more complicated and informative behavioural patterns from malware in both static and dynamic settings. We believe this combination offers much more.

AcknowledgmentWe would like to thank Jacquelin Potier for his useful comments and guidance. The preliminary version of this paper won the first prize at the Kaspersky Lab Asia Pacific & MEA Cup 2011 (http://www.kasperskyasia.com/news_20110317_01).

About the authorsMansour Ahmadi obtained his BS in Applied Mathematics from Sistan Baloochestan University, Iran and his MS

Figure 4: Correlation of FA rate and minimum support.

FEATURE

August 2013 Computer Fraud & Security19

in software engineering from Islamic Azad University, Arak, Iran. He worked on malware detection as his MS thesis under supervision of Dr Ashkan Sami and is currently a security researcher.

Dr Ashkan Sami obtained his BS from Virginia Tech, Blacksburg, VA, US, his MS from Shiraz University, Iran and his PhD from Tohoku University, Japan. He is interested in data mining, software quality and security. He has been a member of technical committees for several international conferences such as PAKDD, ADMA, HumanCon, and Future Tech and has published more than 40 conference papers and nearly 10 journal papers. He is an associate member of IEEE and was among the founding members of Shiraz University CERT.

Hossein Rahimi obtained his BS in Computer Science from Shiraz University, Iran, and he is an MS student at Dalhousie University, Canada.

Babak Yadegari obtained his BS in Computer Science from Shiraz University, Iran and he is a PhD student at the University of Arizona, US.

References1. ‘Malware Report: The Economic

Impact of Viruses, Spyware, Adware, Botnets, and Other Malicious Code’. Computer economics, (2007).

2. Ashkan Sami, Babak Yadegar, Hossein Rahimi, Naser Peiravian. ‘Malware Detection Based On Mining API Calls’. ACM Symposium on Applied Computing, Switzerland, (2010).

3. Yanfang Ye, Dingding Wang, Tao Li, Dongyi Ye, Qingshan Jiang. ‘An intelligent pe-malware detection

system based on association min-ing’. Journal in Computer Virology, (2008).

4. Mihai Christodorescu, Somesh Jha. ‘Static analysis of executables to detect malicious patterns’. USENIX Security Symposium, (2003).

5. Clemens Kolbitsch, Paolo Milani Comparetti, Christopher Kruegel, Engin Kirda, Xiaoyong Zhou, and XiaoFeng Wang ‘Effective and Efficient Malware Detection at the End Host’. 18th Usenix Security Symposium, (2009).

6. Jianyong Dai, Ratan Guha, Joohan Lee. ‘Efficient Virus Detection Using Dynamic Instruction Sequences’. Journal of Computers, (2009).

7. Jacquelin Potier. ‘WinAPIOverride32’. Accessed 2010. http://jacquelin.potier.free.fr/winapi-override32/.

8. John Aycock. ‘Computer Viruses and Malware’. Springer, (2006).

9. Fabrice Bellard. ‘Qemu, a fast and portable dynamic translator’. Usenix Annual Technical Conference, (2005).

10. Vmware. Accessed 2010. www.vmware.com.

11. The Unpacker Archive. Accessed 2010. www.woodmann.com/crackz/Tools/Unpckarc.zip

12. Cullen Linn, Saumya Debray. ‘Obfuscation of executable code to improve resistance to static disassem-bly’. ACM Conference on Computer and Communications Security, (2003).

13. IDA Pro Disassembler and Debugger. Accessed 2011. www.datarescue.com/idabase/.

14. Mihai Christodorescu, Somesh Jha, Sanjit A. Seshia, Dawn Song, Randal

E. Bryant. ‘Semantics-aware mal-ware detection’. IEEE Symposiumon Security and Privacy, (2005).

15. Keehyung Kim, Byung-Ro Moon. ‘Malware Detection based on Dependency Graph using Hybrid Genetic Algorithm’. GECCO, USA, (2010).

16. Nicolas Falliere. ‘Windows Anti-Debug Reference’. Accessed 2007. www.symantec.com/connect/articles/windows-anti-debug-reference.

17. Hong Cheng , Xifeng Yan , Jiawei Han , Chih-wei Hsu. ‘Discriminative frequent pattern analysis for effective classification’. ICDE, (2007).

18. David Lo, Siau-Cheng Khoo, Chao Liu. ‘Efficient mining of iterative patterns for software specification discovery’ KDD, (2007).

19. David Lo, Hong Cheng, Jiawei Han, Siau-Cheng Khoo, Chengnian Sun. ‘Classification of Software Behaviors for Failure Detection: A Discriminative Pattern Mining Approach’. KDD, Paris, France, (2009).

20. Chih-Chung Chang, Chih-Jen Lin. ‘LIBSVM: A library for support vec-tor machines’. ACM Transactions on Intelligent Systems and Technology, (2011).

21. Weka 3: Data Mining open source Software. Accessed 2008. www.cs.waikato.ac.nz/ml/weka/.

22. Leo Breiman, Adele Cutler. ‘Random Forests’. Machine Learning, volum 45(2001): 5-32.

23. Microsoft. ‘Overview of the Windows API’. Accessed 2010. http://msdn.microsoft.com/en-us/library/Aa383723.


Recommended