First International Congress on Technology, Communication and Knowledge (ICTCK 2014)
November, 26-27, 2014 - Mashhad Branch, Islamic Azad University, Mashhad, Iran
978-1-4799-8021-5/14/$31.00 ©2014 IEEE
Malware Detection Using Hidden Markov Model based
on Markov Blanket Feature Selection MethodBassir Pechaz
Imam Reza University
Faculty of computer engineering
Mashhad, Iran
Majid Vafaie Jahan
Islamic Azad University
Faculty of computer engineering
Mashhad, Iran
Mehrdad Jalali
Islamic Azad University
Faculty of computer engineering
Mashhad, Iran
jalali@ mshdiau.ac.ir
Abstract—In general we categorize all malicious codes that
potentially can harm a single or network of computers into
malware groups. With great progress in enhancing virus
development kit and various kind of malware appeared today,
and increasing in number of web networks users, malwares
spreading out rapidly in all aspect of computers systems. The
main approach for finding and detecting malware today, is
signature base methods. But with progress in developing
metamorphic malware today, these technique lost their
performance to detecting malwares. In this research by using
machine learning methods and combining them with n-gram
model and use statistical analysis, a new approach introduced
for detection malwares. Using markov blanket method as
feature selection technique, reduced size of features
approximately 86% in average. Then numbers of sequences
produced to training hidden markov model. Trained HMM
showed great accuracy Keywords: malware detection, hidden
markov model, n-gram, markov blanket, machine learning
about 90% to detecting and classifying malware and benign
files.
I. INTRODUCTION
Malware or malicious software is software developed for harming a single or network of computers[1]. Hence, malwares contain wide spread range of malicious software. “Computer Viruses” initially introduced by Dr. Frederick Cohen in 1984[2]. In attention to malware operation and their variety, 5 class of them selected for this research. 5 selected malware class are: Backdoors, Rootkits, Trojan Horses, Viruses and Worms.
II. RELATED WORKS
In theory, malware detection categorized into hard problem [3]. Spinellis[4]analyzed many computer malwares and finally proved that detecting malwares is NP-Complete problem.
All approach that already presented for detecting malwares can be divided into two general groups: signature based approach and non-signature based approach. Non-signature based approaches are usually based on data mining technique and machine learning methods[5].
Bergeron [6] introduced several methods based on disassembling files into their bit sequence for finding behavioral pattern. Christodorescu[7] recommended a method based on Control Flow Graph for finding and detecting malwares and enhanced his recommended method by combining it with semantic models[8]. Kephart[9] examined ANN classifying algorithm on byte sequences that was generated from disassembling boot sector malwares class.
Bilar[10] proved that operation codes (opCodes) could be appropriate feature for classifying files into benign or malware groups. Based on Bilar research, Santos [11]used opCodes as feature and find opCodes frequency for each malware and benign files and improved his work by adding n-gram models to his works[12].
In recent research had been done by Park[13], new method introduced based on malware behavior graph. In their method malware system calls graph mapped and then it compared to standard malware system calls graph.
Wang[14] considered metamorphic malwares and disassembled them into their opCodes and obtained 416 unique opCodes. He generate training sequence based on unique opCodes with length between 66000 to 67000 and used them to training HMM model.
In this paper 5 class of malware has collected and Markov Blanket method used as feature selection method and by reducing amount of unique opCodes, efficient method introduced for classifying malware and benign files.
III. CONCEPTS
A. Entropy
Entropy is a quantity that we can define it for any probability
distribution function. We can also easily extend it as a
measure for computing mutual information if random
variables[15]. Entropy of random variable is a criterion of
random variable’s uncertainty. In other words, entropy is
criterion for determining the amount of average information
to describe random variables.
First International Congress on Technology, Communication and Knowledge (ICTCK 2014)
November, 26-27, 2014 - Mashhad Branch, Islamic Azad University, Mashhad, Iran
Definition 1 – entropy of discrete random variable X is
defined as (1):
(1)𝐻(𝑋) = − ∑ 𝑝(𝑥) log2 𝑝(𝑥)
𝑥𝜖𝑋
In (1) its obvious entropy of X is function of probability
distribution function X and it is not depended of actual values
that taken by X.
Definition 2 – joint entropy of pair of discrete random
variables (X,Y) with a joint distribution p(x,y) is defined as
(2).
(2)𝐻(𝑋, 𝑌) = − ∑ ∑ 𝑝(𝑥, 𝑦) log2 𝑝(𝑥, 𝑦)
𝑦𝜖𝑌𝑥𝜖𝑋
Definition 3 – if(𝑋, 𝑌)~𝑝(𝑥, 𝑦), then conditional entropy
H(Y|X) is defined as (3).
(3)
(𝑌|𝑋) = − ∑ 𝑝(𝑥)𝐻(𝑌|𝑋 = 𝑥)
𝑥𝜖𝑋
=
− ∑ 𝑝(𝑥)
𝑥𝜖𝑋
∑ 𝑝(𝑦|𝑥) log2 𝑝(𝑦|𝑥)
𝑦𝜖𝑌
Definition 4 – mutual information (MI) is a quantity for
measuring the information that one random variable could
contain about another random variable. Mutual information
defined as (4).
(4)𝐼(𝑋, 𝑌) = 𝐻(𝑋) − 𝐻(𝑋|𝑌)
Mutual information is a symmetric quantity, so:
(5)𝐼(𝑋, 𝑌) = 𝐻(𝑌) − 𝐻(𝑌|𝑋)
Therefore, the amount of information that X could contain
about Y, is same as the amount of information that Y contains
about X. Figure 1 illustrate relation between entropy and
MI.
Figure 1. Relation between entroph and MI
B. Markov blanket
Markov blanket is feature selection method that based on
entropy. There are several algorithm for finding markov
blanket subset that we can mention HITON, MMMB,
MMHC and IAMB[16]. There is also another algorithm that
introduced by Yu[17]. In this research, we use the method
that introduced by Yu.
Assume F is set of features and G is a subset of F and 𝑓𝐺 is
elements of G, goal of all feature selection method could be
depicted by (6).
(6)𝑃(𝐶|𝐺 = 𝑓𝐺) ≅ 𝑃(𝐶|𝐹 = 𝑓)
There are two main approach for selecting features:
individual considering of each feature and considering subset
of total features.
In individual considering of each feature, main goal is
choosing values as weight for each feature and finally
selecting features with more weight. But it is possible that
irrelevant feature get same weight as relevance ones. Thus,
selecting best feature is very challenging.
In considering subset of total feature approach, features
selected on their relation’s and correlation’s with each other.
Figure 2 illustrate considering subset of total features[17].
Figure 2. considering subset of total features algotihm flow chart
In subset generation step, according to search strategy best
subset of features candidate and in evaluation step it
compared with older one. If new subset satisfied certain
criterion, then it replaced with previous one. These two steps
repeated until there is no better subset of features.
There are two approach for finding correlation between data:
linear method and nonlinear method. It is possible in several
cases that there is no linear correlation between data;
however, there is still nonlinear correlation between them.
Most of nonlinear correlation method based on entropy.
According to MI, variable Y is more depended on X than
variable Z if an only if 𝐼(𝑋, 𝑌) > 𝐼(Z, Y). Value of MI always
is in [0,1]. If it has value 1, it means that by having each
feature, other features are predictable. But if value of MI is 0,
it means that there is no relevance between X and Y.
Symmetric Uncertainty (SU) is quantity for normalizing MI.
SU is defined as (7).
(7)𝑆𝑈(𝑋, 𝑌) = 2 [𝐼(𝑋, 𝑌)
𝐻(𝑋) + 𝐻(𝑌)]
Subset Generation
Candidat
e
Subset
Sunset Evaluation
Current Best Subset
Stopping Criterion
N
o
Original
Set
I(X,Y) H(X|Y) H(Y|X)
H(X,Y)
H(X) H(Y)
First International Congress on Technology, Communication and Knowledge (ICTCK 2014)
November, 26-27, 2014 - Mashhad Branch, Islamic Azad University, Mashhad, Iran
C. Markov Blanket set
Markov blanket set introduced by Koller in1996[18].
Assume 𝐹𝑖 had given and also assume𝑚𝑖 ⊂ 𝐹, (𝐹𝑖 ∉ 𝑚𝑖).
𝑚𝑖Could be markov blanket set for 𝐹𝑖 if and only if
𝑃(𝐹\{𝐹𝑖, 𝑚𝑖}, 𝐶|𝐹𝑖 , 𝑚𝑖) = 𝑃(𝐹\{𝐹𝑖 , 𝑚𝑖}, 𝐶|𝑚𝑖).
In other words, markov blanket set for special feature
𝐹𝑖indicate that 𝐹𝑖 is redundant and could be diminished. In
simple word, there is markov blanket only for features that
don’t have any relation with class C and couldn’t determine
the class and finally it can be removed.
D. Introducing Yu method
Definition 5 – probability distribution of class C with set of
feature F: probability distribution of class C with set of
features F is showed by 𝑃(𝐶|𝐹).𝑆𝑖 = 𝐹/𝐹𝑖Means feature 𝐹𝑖
removed from set F.
Definition 6 – strong relevance: feature 𝐹𝑖 has strong
relevance with set F if and only if
𝑃(𝐶|𝑆𝑖 , 𝐹𝑖) ≠ 𝑃(𝐶|𝑆𝑖)
Definition 7 – weak relevance: feature 𝐹𝑖 has weak relevance
with set F if and only if
𝑃(𝐶|𝑆𝑖 , 𝐹𝑖) = 𝑃(𝐶|𝑆𝑖)𝑎𝑛𝑑 ∃𝑆𝑖,
⊂ 𝑆𝑖𝑠𝑢𝑐ℎ𝑡ℎ𝑎𝑡𝑃(𝐶|𝑆𝑖,, 𝐹𝑖) ≠ 𝑃
Definition 8 – feature 𝐹𝑖 is completely irrelevance with set F
if and only if
∀𝑆𝑖, ⊂ 𝑆𝑖 , 𝑃(𝐶|𝑆𝑖
,, 𝐹𝑖) = 𝑃(𝐶|𝑆𝑖,)
If feature with strong relevance diminished from selected set,
power of detection related class C reduced. In other words,
existence of feature with strong relevance is necessary for
detecting class C. selecting or deselecting of feature with
weak relevance depended on selection or deselection of other
features, could be effected on detection power. Feature with
no relevance could be diminished with no effect on detection
power. Hence, optimal set contains all features with strong
relevance and some feature with weak relevance.
Yu introduced two definition for correlation: individual
correlation and combination correlation.
Definition 9 – correlation between 𝐹𝑖 and class C is called
individual correlation.
Definition 10 – correlation between two features 𝐹𝑖 and
𝐹𝑗(𝑖 ≠ 𝑗) is called combination correlation.
Features with high individual correlation give more
information about class C. For two feature 𝐹𝑖and 𝐹𝑗, if they
have equal individual correlation, next step is calculating
combination correlation of 𝐹𝑖 and 𝐹𝑗 and finding out if 𝐹𝑗 is
redundant based on 𝐹𝑖.
For two feature 𝐹𝑖and 𝐹𝑗(𝑖 ≠ 𝑗), 𝐹𝑖 formed approximate
markov blanket for 𝐹𝑗 if and only if 𝑆𝑈𝑖,𝑐 ≥ 𝑆𝑈𝑗,𝑐and
𝑆𝑈𝑖,𝑗 ≥ 𝑆𝑈𝑗,𝑐.
Assume 𝐹𝑗 formed approximate markov blanket for 𝐹𝑘. So 𝐹𝑘
could be removed based on 𝐹𝑗. On the other hand, 𝐹𝑖 formed
approximate markov blanket for 𝐹𝑗. Hence, 𝐹𝑗 could be
removed based on 𝐹𝑖. Finally, it seems 𝐹𝑘 removed based on
no feature. For avoiding some kind of these phenomenon, Yu
introduced predominant feature concept.
Definition 12 – predominant feature: a feature is predominant
feature if and only if there is no approximate markov blanket
set for it. For finding predominant feature its only need to sort
features in ascending order based on their individual
correlation value. Feature with greatest individual correlation
value are predominant feature.
E. RBF algorithm
RBF algorithm finds predominant features and remove
redundant features based on predominant features. RBF stops
when there is no predominant feature. Remained features
formed markov blanket set. Main property of RBF is that
there is no need for declaring special threshold.
Finding and selecting relevance features done in O(N) that N
is number of features in original set. Finding predominant
features also done in O(N) and finally RBF use time
complexity𝑂(𝑁2)for finding markov blanket set.
F. Validation methods
In this research 3 method has chosen for validating results:
Precision, sensitivity and accuracy.
TP1 are vectors that they are in positive class and correctly
determined as positive. FP2 are negative vectors that wrongly
determined as positive. TN3 are vectors that they are in
negative class and correctly determined as negative. Finally,
FN4 are negative vectors that wrongly determined as positive.
Precision measuring preference of used method for
determining positive class (8)[19].
(8) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇𝑃
𝑇𝑃 + 𝐹𝑃
Sensitivity measuring the amount correct positive vectors and
determine how much of them predicted correctly (9)[19].
(9) 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇𝑃
𝑇𝑃 + 𝐹𝑁
Accuracy measuring total amounts of correct prediction n
total amounts of vectors (10)[19].
(10) 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁
1 True positive
2 False positive
3 True negative
4 False negative
First International Congress on Technology, Communication and Knowledge (ICTCK 2014)
November, 26-27, 2014 - Mashhad Branch, Islamic Azad University, Mashhad, Iran
IV. EMPIRICAL RESULTS
Current research contains several steps. Collecting benign
and malware files, categorizing into two groups: malware and
benign, check files and ensure that files are accorded on their
labels, disassembling files and extract their OpCodes,
purifying extracted opCodes, cross validating files and
choose one chunk of them for test and remained ones for
training, selecting best opCodes as features, forming n-gram
sequence of opCodes and finally training Hidden Markov
Model and classifying test file and validating developed
model.
At first step, 5 group of malware that are main challenge for
today’s computer networks has selected. Backdoors, Rootkits,
Trojan horses, Viruses and Worms has selected malware
groups. All of malware files has collected from
vxheavens5website. Benign files collected from two popular
operating systems: Windows 7 and 8 and Linux Ubuntu.
At next step and after ensuring that files are correctly labeled,
all file has disassembled. For selecting relevance opCodes,
there was need a list of opCodes that has collected from
Intel6website.
For ensuring that both class of files have same volume, equal
volume of file for each class has selected. Table 1 shows
number, minimum, maximum, average and total volume of
selected files.
TABLE I. NUMBER, MINIMUM, MAXIMUM, AVERAGE AND TOTAL
VOLUME OF SELECTED FILES
Total size of selected files
Average size
Max size
Min sizeNumber of files
class
26 MB280.25 KB757 KB92.9 KB95backdoors
7 MB15.89 KB185 KB6.85 KB451rootkits
26 MB113.29 KB797 KB18.9 KB235Trojan horses
26 MB161.35 KB1.01 MB29.8 KB165viruses
26 MB102 KB4.57 MB19.3 KB261worms
26 MB208 KB705 KB62.5 KB128Benign file (appropriate for non-rootkit class)
26 MB108.6 KB199 KB62.5 KB66Benign file (appropriate for rootkit class)
Kohavi[20] indicated that 10-fold cross validation has better
result. According to Kohavi research, all files in all groups
distributed in 10 chunk.
After applying RBF algorithm and finding relevance
opCodes, n-gram sequences has formed based on selected
opCodes. In this research we used 2, 3, 4 and 5-gram. Figure3
compare amount of selected opCodes and all uniqueopCodes
in malware groups.
5 www.vxheavens.com
6http://ref.x86asm.net/coder32.html
Figure 3. Comparing amount of selected features and total features
Figure 4 illustrate steps for applying RBF algorithm on
extracted opCodes.
Finally, generated sequences of opOcodes has used for
training Hidden Markov Model with 15 states for each group
of malwares. Then, test chunk of files has used for validating
trained model.
V. RESULTS
In this research Hidden Markov Model has used as classifier
model. After applying RBF algorithm and selecting relevance
opCodes and reduced size of total amount of opCodes and
decreasing length of n-gram sequences, HMM model could
be trained. Training sequence generated for 2 to 5-grams.
Figures5-9 illustrate precision, accuracy and sensitivity of
trained HMM model for each group of malwares. In this
research, malwares considered as positive class.
0
100
200
300
400
500
Backdoor Rootkit Trojanhorse
Virus Worm
379 347
384 415 394
22
105 32 60 43
Total Selected
First International Congress on Technology, Communication and Knowledge (ICTCK 2014)
November, 26-27, 2014 - Mashhad Branch, Islamic Azad University, Mashhad, Iran
Figure 4. Steps for applying RBF algorithm on extracted opCodes
Figure 5. Precision, accuracy and sensitivity in 2-5 gram for Backdoor
Figure 6. Precision, accuracy and sensitivity in 2-5 gram for Rootkit
Figure 7. Precision, accuracy and sensitivity in 2-5 gram for Trojan Horse
Figure 8. Precision, accuracy and sensitivity in 2-5 gram for Virus
Figure 9. Precision, accuracy and sensitivity in 2-5 gram for Worm
VI. CONCLUSION
In current paper by using markov blanket as feature selecting
method and reducing total amount of opCodes and decreasing
length of training sequences and using HMM as classifier
method, We could classify malware and benign. Using RBF
algorithm for finding relevance opCodes reduced original
amount of opCodes to 86% in average for 5 group of
malwares.
It obvious from diagram 2-6 that with increasing is number of
gram, amount of unique features increased and accuracy of
classifying increased. However, increasing in amount of
0
0.5
1
2 3 4 5
gram
precision accuracy sensitivity
0
0.5
1
2 3 4 5
gram
precision accuracy sensitivity
0
0.5
1
2 3 4 5
gram
precision accuracy sensitivity
0
0.5
1
2 3 4 5
gram
precision accuracy sensitivity
0
0.5
1
2 3 4 5
gram
precision accuracy sensitivity
Input: benign and malware files in separate set with
label
Numbering each opCodes for each class of files and
calculate sum of each opCode in two class of file
Calculate entropy for each opCode
Calculate conditional entropy for each opCode related
to two class of files
Calculate MI
Applying RBF algorithm and finding markov blanket
set
Output: optimal set of opCodes
First International Congress on Technology, Communication and Knowledge (ICTCK 2014)
November, 26-27, 2014 - Mashhad Branch, Islamic Azad University, Mashhad, Iran
unique features effected on both precision and sensitivity of
trained HMM and reduced them.
An important point that should be appointed is that with using
presented method there is no need for examine files inn real
computer system for detecting their behavior and it helps us
avoiding malicious files before they affect our systems.
Using HMM as classifier model and combing it by n-gram
models, has great impact on results and improve them up to
90% in average.
REFERENCES
[1] B. Potter and G. Day, "The effectiveness of antimalware
tools," Computer fraud and security, pp. 12-13., 2009.
[2] P. Szor, The art of computer virus research and defense,
Addison Wesley Professional, 2005.
[3] F. Cohen, "Computer viruses: Theory and expeiments,"
Computer and security, pp. 22-35, 1987.
[4] D. Spinellis, "Reliable Identification of Bounded-Length
Viruses Is NP-Complete," IEEE Transactions on
information theoty, VOL. 49, 2003.
[5] I. Yoo and V. Nitshe, "Non signature based virus
detection: Toward establishing unknown virus detection
technique using som," J. Comput. Virol 2(3), 2006.
[6] J. Bergeron and M. Debibabi, "Static Analysis of binary
code to isolate malicious behaviors," Proceedings of
1999 workshop on enabaling technologies on
infrastructure for collaborative Enterprises, 1999.
[7] M. Christodorescu and S. Jha, "Static analysis of
excecutables to detect malicious pattern," Proceedings of
the 12th USENIX security symposium, 2003.
[8] M. Christodorescu, S. Jha, S. Seshia, D. Sony and R.
Bryant, "Semantics-Aawre malware detection,"
Proceedings of the 2005 IEEE symposium on security
and privacy, 2005.
[9] J. Kephart, G. Sorkin, W. Arnold, D. Chess, G. Tesauro
and S. White, "Biologically inspired defenses against
computer viruses," Proceedings of the 14th IJCAI,
Montreal, 1995.
[10] D. Bilar, "OpCodes as predicator of malware,"
International journal of electronic security and digital
forensics, 2007.
[11] I. Santos, F. Brezo, J. Nieves, Y. Penya and B. Sanz,
"OpeCode-Sequence-Based malware detection,"
Engineering secure software and systems, INCS 5965,
2010.
[12] I. Santos, F. Brezo, X. Ugarte and P. Bringas, "OpCode
sequences as representation of executables for data-
mining-based unknown malware detection," Information
Sciences 231, pp. 64-82, 2013.
[13] Y.Park, D.S.Reeves and M.Stamp, "Deriving common
malware behavior through graph clustering," Computer
and Security, pp. 419-430, 2013.
[14] W. Wong, "Analysis and detection of metamorphic
computer viruses," San Jose State University, 2006.
[15] T. Cover and J. Thomas, Elements of information
theory, Wiley and Sons Inc, 1991.
[16] H. Borchani, C. Bielza, P. Martin and P. Larranga,
"Markov blanket based approach for learning multi-
dimensional Bayesian network classifier: An application
to predict the European quality of life-5 dimension (EQ-
5D) from the 39-item Parkinson’s disease questionnaire
(PDQ-39)," Journal of Biomedical Informatics, p. 1175–
1184, 2012.
[17] L. Yu and H. Liu, "Efficient feature selection via
analysis of relevance and redundancy," Journal machine
learning research, pp. 1205-1224, 2004.
[18] K. Sharma, Bioinformatics, sequence alignment and
markov models, McGraw-Hill Publication, 2009.
[19] D. Powers, "Evaluation: From precision, recall and f-
measure to ROC, Informedness, Markedness &
Correlation," Journal of machine learning technologies,
pp. 27-83, 2011.
[20] R. Kohavi, "A study of cross-validation and bootstrap
for accuracy estimation and model selection,"
International joint conference on artificial intelligence
(IJCAI), 1995.