First International Congress on Technology, Communication ... · Spinellis[4]analyzed many computer...

First International Congress on Technology, Communication and Knowledge (ICTCK 2014)

November, 26-27, 2014 - Mashhad Branch, Islamic Azad University, Mashhad, Iran

978-1-4799-8021-5/14/$31.00 ©2014 IEEE

Malware Detection Using Hidden Markov Model based

on Markov Blanket Feature Selection MethodBassir Pechaz

Imam Reza University

Faculty of computer engineering

Mashhad, Iran

[email protected]

Majid Vafaie Jahan

Islamic Azad University


Mashhad, Iran

[email protected]

Mehrdad Jalali

Islamic Azad University


Mashhad, Iran

jalali@ mshdiau.ac.ir

Abstract—In general we categorize all malicious codes that

potentially can harm a single or network of computers into

malware groups. With great progress in enhancing virus

development kit and various kind of malware appeared today,

and increasing in number of web networks users, malwares

spreading out rapidly in all aspect of computers systems. The

main approach for finding and detecting malware today, is

signature base methods. But with progress in developing

metamorphic malware today, these technique lost their

performance to detecting malwares. In this research by using

machine learning methods and combining them with n-gram

model and use statistical analysis, a new approach introduced

for detection malwares. Using markov blanket method as

feature selection technique, reduced size of features

approximately 86% in average. Then numbers of sequences

produced to training hidden markov model. Trained HMM

showed great accuracy Keywords: malware detection, hidden

markov model, n-gram, markov blanket, machine learning

about 90% to detecting and classifying malware and benign

files.

I. INTRODUCTION

Malware or malicious software is software developed for harming a single or network of computers[1]. Hence, malwares contain wide spread range of malicious software. “Computer Viruses” initially introduced by Dr. Frederick Cohen in 1984[2]. In attention to malware operation and their variety, 5 class of them selected for this research. 5 selected malware class are: Backdoors, Rootkits, Trojan Horses, Viruses and Worms.

II. RELATED WORKS

In theory, malware detection categorized into hard problem [3]. Spinellis[4]analyzed many computer malwares and finally proved that detecting malwares is NP-Complete problem.

All approach that already presented for detecting malwares can be divided into two general groups: signature based approach and non-signature based approach. Non-signature based approaches are usually based on data mining technique and machine learning methods[5].

Bergeron [6] introduced several methods based on disassembling files into their bit sequence for finding behavioral pattern. Christodorescu[7] recommended a method based on Control Flow Graph for finding and detecting malwares and enhanced his recommended method by combining it with semantic models[8]. Kephart[9] examined ANN classifying algorithm on byte sequences that was generated from disassembling boot sector malwares class.

Bilar[10] proved that operation codes (opCodes) could be appropriate feature for classifying files into benign or malware groups. Based on Bilar research, Santos [11]used opCodes as feature and find opCodes frequency for each malware and benign files and improved his work by adding n-gram models to his works[12].

In recent research had been done by Park[13], new method introduced based on malware behavior graph. In their method malware system calls graph mapped and then it compared to standard malware system calls graph.

Wang[14] considered metamorphic malwares and disassembled them into their opCodes and obtained 416 unique opCodes. He generate training sequence based on unique opCodes with length between 66000 to 67000 and used them to training HMM model.

In this paper 5 class of malware has collected and Markov Blanket method used as feature selection method and by reducing amount of unique opCodes, efficient method introduced for classifying malware and benign files.

III. CONCEPTS

A. Entropy

Entropy is a quantity that we can define it for any probability

distribution function. We can also easily extend it as a

measure for computing mutual information if random

variables[15]. Entropy of random variable is a criterion of

random variable’s uncertainty. In other words, entropy is

criterion for determining the amount of average information

to describe random variables.



Definition 1 – entropy of discrete random variable X is

defined as (1):

(1)𝐻(𝑋) = − ∑ 𝑝(𝑥) log2 𝑝(𝑥)

𝑥𝜖𝑋

In (1) its obvious entropy of X is function of probability

distribution function X and it is not depended of actual values

that taken by X.

Definition 2 – joint entropy of pair of discrete random

variables (X,Y) with a joint distribution p(x,y) is defined as

(2).

(2)𝐻(𝑋, 𝑌) = − ∑ ∑ 𝑝(𝑥, 𝑦) log2 𝑝(𝑥, 𝑦)

𝑦𝜖𝑌𝑥𝜖𝑋

Definition 3 – if(𝑋, 𝑌)~𝑝(𝑥, 𝑦), then conditional entropy

H(Y|X) is defined as (3).

(3)

(𝑌|𝑋) = − ∑ 𝑝(𝑥)𝐻(𝑌|𝑋 = 𝑥)

𝑥𝜖𝑋

=

− ∑ 𝑝(𝑥)

𝑥𝜖𝑋

∑ 𝑝(𝑦|𝑥) log2 𝑝(𝑦|𝑥)

𝑦𝜖𝑌

Definition 4 – mutual information (MI) is a quantity for

measuring the information that one random variable could

contain about another random variable. Mutual information

defined as (4).

(4)𝐼(𝑋, 𝑌) = 𝐻(𝑋) − 𝐻(𝑋|𝑌)

Mutual information is a symmetric quantity, so:

(5)𝐼(𝑋, 𝑌) = 𝐻(𝑌) − 𝐻(𝑌|𝑋)

Therefore, the amount of information that X could contain

about Y, is same as the amount of information that Y contains

about X. Figure 1 illustrate relation between entropy and

MI.

Figure 1. Relation between entroph and MI

B. Markov blanket

Markov blanket is feature selection method that based on

entropy. There are several algorithm for finding markov

blanket subset that we can mention HITON, MMMB,

MMHC and IAMB[16]. There is also another algorithm that

introduced by Yu[17]. In this research, we use the method

that introduced by Yu.

Assume F is set of features and G is a subset of F and 𝑓𝐺 is

elements of G, goal of all feature selection method could be

depicted by (6).

(6)𝑃(𝐶|𝐺 = 𝑓𝐺) ≅ 𝑃(𝐶|𝐹 = 𝑓)

There are two main approach for selecting features:

individual considering of each feature and considering subset

of total features.

In individual considering of each feature, main goal is

choosing values as weight for each feature and finally

selecting features with more weight. But it is possible that

irrelevant feature get same weight as relevance ones. Thus,

selecting best feature is very challenging.

In considering subset of total feature approach, features

selected on their relation’s and correlation’s with each other.

Figure 2 illustrate considering subset of total features[17].

Figure 2. considering subset of total features algotihm flow chart

In subset generation step, according to search strategy best

subset of features candidate and in evaluation step it

compared with older one. If new subset satisfied certain

criterion, then it replaced with previous one. These two steps

repeated until there is no better subset of features.

There are two approach for finding correlation between data:

linear method and nonlinear method. It is possible in several

cases that there is no linear correlation between data;

however, there is still nonlinear correlation between them.

Most of nonlinear correlation method based on entropy.

According to MI, variable Y is more depended on X than

variable Z if an only if 𝐼(𝑋, 𝑌) > 𝐼(Z, Y). Value of MI always

is in [0,1]. If it has value 1, it means that by having each

feature, other features are predictable. But if value of MI is 0,

it means that there is no relevance between X and Y.

Symmetric Uncertainty (SU) is quantity for normalizing MI.

SU is defined as (7).

(7)𝑆𝑈(𝑋, 𝑌) = 2 [𝐼(𝑋, 𝑌)

𝐻(𝑋) + 𝐻(𝑌)]

Subset Generation

Candidat

e

Subset

Sunset Evaluation

Current Best Subset

Stopping Criterion

N

o

Original

Set

I(X,Y) H(X|Y) H(Y|X)

H(X,Y)

H(X) H(Y)



C. Markov Blanket set

Markov blanket set introduced by Koller in1996[18].

Assume 𝐹𝑖 had given and also assume𝑚𝑖 ⊂ 𝐹, (𝐹𝑖 ∉ 𝑚𝑖).

𝑚𝑖Could be markov blanket set for 𝐹𝑖 if and only if

𝑃(𝐹\{𝐹𝑖, 𝑚𝑖}, 𝐶|𝐹𝑖 , 𝑚𝑖) = 𝑃(𝐹\{𝐹𝑖 , 𝑚𝑖}, 𝐶|𝑚𝑖).

In other words, markov blanket set for special feature

𝐹𝑖indicate that 𝐹𝑖 is redundant and could be diminished. In

simple word, there is markov blanket only for features that

don’t have any relation with class C and couldn’t determine

the class and finally it can be removed.

D. Introducing Yu method

Definition 5 – probability distribution of class C with set of

feature F: probability distribution of class C with set of

features F is showed by 𝑃(𝐶|𝐹).𝑆𝑖 = 𝐹/𝐹𝑖Means feature 𝐹𝑖

removed from set F.

Definition 6 – strong relevance: feature 𝐹𝑖 has strong

relevance with set F if and only if

𝑃(𝐶|𝑆𝑖 , 𝐹𝑖) ≠ 𝑃(𝐶|𝑆𝑖)

Definition 7 – weak relevance: feature 𝐹𝑖 has weak relevance

with set F if and only if

𝑃(𝐶|𝑆𝑖 , 𝐹𝑖) = 𝑃(𝐶|𝑆𝑖)𝑎𝑛𝑑 ∃𝑆𝑖,

⊂ 𝑆𝑖𝑠𝑢𝑐ℎ𝑡ℎ𝑎𝑡𝑃(𝐶|𝑆𝑖,, 𝐹𝑖) ≠ 𝑃

Definition 8 – feature 𝐹𝑖 is completely irrelevance with set F

if and only if

∀𝑆𝑖, ⊂ 𝑆𝑖 , 𝑃(𝐶|𝑆𝑖

,, 𝐹𝑖) = 𝑃(𝐶|𝑆𝑖,)

If feature with strong relevance diminished from selected set,

power of detection related class C reduced. In other words,

existence of feature with strong relevance is necessary for

detecting class C. selecting or deselecting of feature with

weak relevance depended on selection or deselection of other

features, could be effected on detection power. Feature with

no relevance could be diminished with no effect on detection

power. Hence, optimal set contains all features with strong

relevance and some feature with weak relevance.

Yu introduced two definition for correlation: individual

correlation and combination correlation.

Definition 9 – correlation between 𝐹𝑖 and class C is called

individual correlation.

Definition 10 – correlation between two features 𝐹𝑖 and

𝐹𝑗(𝑖 ≠ 𝑗) is called combination correlation.

Features with high individual correlation give more

information about class C. For two feature 𝐹𝑖and 𝐹𝑗, if they

have equal individual correlation, next step is calculating

combination correlation of 𝐹𝑖 and 𝐹𝑗 and finding out if 𝐹𝑗 is

redundant based on 𝐹𝑖.

For two feature 𝐹𝑖and 𝐹𝑗(𝑖 ≠ 𝑗), 𝐹𝑖 formed approximate

markov blanket for 𝐹𝑗 if and only if 𝑆𝑈𝑖,𝑐 ≥ 𝑆𝑈𝑗,𝑐and

𝑆𝑈𝑖,𝑗 ≥ 𝑆𝑈𝑗,𝑐.

Assume 𝐹𝑗 formed approximate markov blanket for 𝐹𝑘. So 𝐹𝑘

could be removed based on 𝐹𝑗. On the other hand, 𝐹𝑖 formed

approximate markov blanket for 𝐹𝑗. Hence, 𝐹𝑗 could be

removed based on 𝐹𝑖. Finally, it seems 𝐹𝑘 removed based on

no feature. For avoiding some kind of these phenomenon, Yu

introduced predominant feature concept.

Definition 12 – predominant feature: a feature is predominant

feature if and only if there is no approximate markov blanket

set for it. For finding predominant feature its only need to sort

features in ascending order based on their individual

correlation value. Feature with greatest individual correlation

value are predominant feature.

E. RBF algorithm

RBF algorithm finds predominant features and remove

redundant features based on predominant features. RBF stops

when there is no predominant feature. Remained features

formed markov blanket set. Main property of RBF is that

there is no need for declaring special threshold.

Finding and selecting relevance features done in O(N) that N

is number of features in original set. Finding predominant

features also done in O(N) and finally RBF use time

complexity𝑂(𝑁2)for finding markov blanket set.

F. Validation methods

In this research 3 method has chosen for validating results:

Precision, sensitivity and accuracy.

TP1 are vectors that they are in positive class and correctly

determined as positive. FP2 are negative vectors that wrongly

determined as positive. TN3 are vectors that they are in

negative class and correctly determined as negative. Finally,

FN4 are negative vectors that wrongly determined as positive.

Precision measuring preference of used method for

determining positive class (8)[19].

(8) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇𝑃

𝑇𝑃 + 𝐹𝑃

Sensitivity measuring the amount correct positive vectors and

determine how much of them predicted correctly (9)[19].

(9) 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁

Accuracy measuring total amounts of correct prediction n

total amounts of vectors (10)[19].

(10) 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑇𝑃 + 𝑇𝑁

𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁

1 True positive

2 False positive

3 True negative

4 False negative



IV. EMPIRICAL RESULTS

Current research contains several steps. Collecting benign

and malware files, categorizing into two groups: malware and

benign, check files and ensure that files are accorded on their

labels, disassembling files and extract their OpCodes,

purifying extracted opCodes, cross validating files and

choose one chunk of them for test and remained ones for

training, selecting best opCodes as features, forming n-gram

sequence of opCodes and finally training Hidden Markov

Model and classifying test file and validating developed

model.

At first step, 5 group of malware that are main challenge for

today’s computer networks has selected. Backdoors, Rootkits,

Trojan horses, Viruses and Worms has selected malware

groups. All of malware files has collected from

vxheavens5website. Benign files collected from two popular

operating systems: Windows 7 and 8 and Linux Ubuntu.

At next step and after ensuring that files are correctly labeled,

all file has disassembled. For selecting relevance opCodes,

there was need a list of opCodes that has collected from

Intel6website.

For ensuring that both class of files have same volume, equal

volume of file for each class has selected. Table 1 shows

number, minimum, maximum, average and total volume of

selected files.

TABLE I. NUMBER, MINIMUM, MAXIMUM, AVERAGE AND TOTAL

VOLUME OF SELECTED FILES

Total size of selected files

Average size

Max size

Min sizeNumber of files

class

26 MB280.25 KB757 KB92.9 KB95backdoors

7 MB15.89 KB185 KB6.85 KB451rootkits

26 MB113.29 KB797 KB18.9 KB235Trojan horses

26 MB161.35 KB1.01 MB29.8 KB165viruses

26 MB102 KB4.57 MB19.3 KB261worms

26 MB208 KB705 KB62.5 KB128Benign file (appropriate for non-rootkit class)

26 MB108.6 KB199 KB62.5 KB66Benign file (appropriate for rootkit class)

Kohavi[20] indicated that 10-fold cross validation has better

result. According to Kohavi research, all files in all groups

distributed in 10 chunk.

After applying RBF algorithm and finding relevance

opCodes, n-gram sequences has formed based on selected

opCodes. In this research we used 2, 3, 4 and 5-gram. Figure3

compare amount of selected opCodes and all uniqueopCodes

in malware groups.

5 www.vxheavens.com

6http://ref.x86asm.net/coder32.html

Figure 3. Comparing amount of selected features and total features

Figure 4 illustrate steps for applying RBF algorithm on

extracted opCodes.

Finally, generated sequences of opOcodes has used for

training Hidden Markov Model with 15 states for each group

of malwares. Then, test chunk of files has used for validating

trained model.

V. RESULTS

In this research Hidden Markov Model has used as classifier

model. After applying RBF algorithm and selecting relevance

opCodes and reduced size of total amount of opCodes and

decreasing length of n-gram sequences, HMM model could

be trained. Training sequence generated for 2 to 5-grams.

Figures5-9 illustrate precision, accuracy and sensitivity of

trained HMM model for each group of malwares. In this

research, malwares considered as positive class.

0

100

200

300

400

500

Backdoor Rootkit Trojanhorse

Virus Worm

379 347

384 415 394

22

105 32 60 43

Total Selected



Figure 4. Steps for applying RBF algorithm on extracted opCodes

Figure 5. Precision, accuracy and sensitivity in 2-5 gram for Backdoor

Figure 6. Precision, accuracy and sensitivity in 2-5 gram for Rootkit

Figure 7. Precision, accuracy and sensitivity in 2-5 gram for Trojan Horse

Figure 8. Precision, accuracy and sensitivity in 2-5 gram for Virus

Figure 9. Precision, accuracy and sensitivity in 2-5 gram for Worm

VI. CONCLUSION

In current paper by using markov blanket as feature selecting

method and reducing total amount of opCodes and decreasing

length of training sequences and using HMM as classifier

method, We could classify malware and benign. Using RBF

algorithm for finding relevance opCodes reduced original

amount of opCodes to 86% in average for 5 group of

malwares.

It obvious from diagram 2-6 that with increasing is number of

gram, amount of unique features increased and accuracy of

classifying increased. However, increasing in amount of

0

0.5

1

2 3 4 5

gram

precision accuracy sensitivity

0

0.5

1

2 3 4 5

gram


0

0.5

1

2 3 4 5

gram


0

0.5

1

2 3 4 5

gram


0

0.5

1

2 3 4 5

gram


Input: benign and malware files in separate set with

label

Numbering each opCodes for each class of files and

calculate sum of each opCode in two class of file

Calculate entropy for each opCode

Calculate conditional entropy for each opCode related

to two class of files

Calculate MI

Applying RBF algorithm and finding markov blanket

set

Output: optimal set of opCodes



unique features effected on both precision and sensitivity of

trained HMM and reduced them.

An important point that should be appointed is that with using

presented method there is no need for examine files inn real

computer system for detecting their behavior and it helps us

avoiding malicious files before they affect our systems.

Using HMM as classifier model and combing it by n-gram

models, has great impact on results and improve them up to

90% in average.

REFERENCES

[1] B. Potter and G. Day, "The effectiveness of antimalware

tools," Computer fraud and security, pp. 12-13., 2009.

[2] P. Szor, The art of computer virus research and defense,

Addison Wesley Professional, 2005.

[3] F. Cohen, "Computer viruses: Theory and expeiments,"

Computer and security, pp. 22-35, 1987.

[4] D. Spinellis, "Reliable Identification of Bounded-Length

Viruses Is NP-Complete," IEEE Transactions on

information theoty, VOL. 49, 2003.

[5] I. Yoo and V. Nitshe, "Non signature based virus

detection: Toward establishing unknown virus detection

technique using som," J. Comput. Virol 2(3), 2006.

[6] J. Bergeron and M. Debibabi, "Static Analysis of binary

code to isolate malicious behaviors," Proceedings of

1999 workshop on enabaling technologies on

infrastructure for collaborative Enterprises, 1999.

[7] M. Christodorescu and S. Jha, "Static analysis of

excecutables to detect malicious pattern," Proceedings of

the 12th USENIX security symposium, 2003.

[8] M. Christodorescu, S. Jha, S. Seshia, D. Sony and R.

Bryant, "Semantics-Aawre malware detection,"

Proceedings of the 2005 IEEE symposium on security

and privacy, 2005.

[9] J. Kephart, G. Sorkin, W. Arnold, D. Chess, G. Tesauro

and S. White, "Biologically inspired defenses against

computer viruses," Proceedings of the 14th IJCAI,

Montreal, 1995.

[10] D. Bilar, "OpCodes as predicator of malware,"

International journal of electronic security and digital

forensics, 2007.

[11] I. Santos, F. Brezo, J. Nieves, Y. Penya and B. Sanz,

"OpeCode-Sequence-Based malware detection,"

Engineering secure software and systems, INCS 5965,

2010.

[12] I. Santos, F. Brezo, X. Ugarte and P. Bringas, "OpCode

sequences as representation of executables for data-

mining-based unknown malware detection," Information

Sciences 231, pp. 64-82, 2013.

[13] Y.Park, D.S.Reeves and M.Stamp, "Deriving common

malware behavior through graph clustering," Computer

and Security, pp. 419-430, 2013.

[14] W. Wong, "Analysis and detection of metamorphic

computer viruses," San Jose State University, 2006.

[15] T. Cover and J. Thomas, Elements of information

theory, Wiley and Sons Inc, 1991.

[16] H. Borchani, C. Bielza, P. Martin and P. Larranga,

"Markov blanket based approach for learning multi-

dimensional Bayesian network classifier: An application

to predict the European quality of life-5 dimension (EQ-

5D) from the 39-item Parkinson’s disease questionnaire

(PDQ-39)," Journal of Biomedical Informatics, p. 1175–

1184, 2012.

[17] L. Yu and H. Liu, "Efficient feature selection via

analysis of relevance and redundancy," Journal machine

learning research, pp. 1205-1224, 2004.

[18] K. Sharma, Bioinformatics, sequence alignment and

markov models, McGraw-Hill Publication, 2009.

[19] D. Powers, "Evaluation: From precision, recall and f-

measure to ROC, Informedness, Markedness &

Correlation," Journal of machine learning technologies,

pp. 27-83, 2011.

[20] R. Kohavi, "A study of cross-validation and bootstrap

for accuracy estimation and model selection,"

International joint conference on artificial intelligence

(IJCAI), 1995.

Date post:	11-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

First International Congress on Technology, Communication ... · Spinellis[4]analyzed many computer...

Documents