+ All Categories

Today

Date post: 13-Aug-2015
Category:
Upload: sherin-bennet
View: 21 times
Download: 1 times
Share this document with a friend
Popular Tags:
30
Malware Detection using n-grams and Evaluation using Machine Learning Algorithms 11MSE0195-SHERIN JOSEPHIN B
Transcript

Malware Detection using n-grams and Evaluation using Machine Learning Algorithms

11MSE0195-SHERIN JOSEPHIN B

Abstract

• Computer security has been a major concern in today's scenario. The term Malware is used to denote bad software which hacks the computer security in the present world.

• While most of the anti-virus software fails to detects new virus. Thus n-grams as file signature can help us to detect own malware and reduce false positive ratio.

• Further the dataset is optimized by using feature selection algorithm. The final Featured Vector Table obtained from feature selection and dimension reduction will be compared and evaluated using various machine learning algorithms.

Aim and Scope

• The aim of this project is to detect malware files using n-gram analysis and evaluate it using machine learning algorithm.

• As many antivirus software fails to detect new virus, using n-gram as a model, will detect malware files efficiently.

• This project will focus on developing a better tool to detect the malware files taking into consideration space complexity.

• It is currently used in industries. Every industry mainly focuses on securing the data. Anti-virus software like Kaspersky, K7 uses this technique to detect malware files.

LITERATURE SURVEY

LITERATURE SURVEY...

LITERATURE SURVEY...S.N TITLE ABSTRACT TECHNIQUES ADVANTAGES

8. “Static Malware Detection with Segmented Sandboxing”

This is study is about Taking the best part from both static and dynamic detection approach, which is called “Segmented Sandboxing” is applied to detect malware files.

1. segmented sandboxing

Higher detection rate (compare previous data)

9. .,“N grams based file signature for malware detection”.

This study proposes the use of n-grams as file signatures in order to detect unknownmalware

1.n-grams low false positive ratio.

10. “A Hybrid Model to Detect Malicious Executables”.

This paper proposes featuthe re set is called hybrid feature set which is given to support vector machine which classify malware and benign files.

1.n-grams2.SVM

1.high accuracy2. low false positive rate

11. Detection of New Malicious Code Using N-grams Signatures”.

This paper says about the n-gram analysis that classify the malware and benign .

1.n-grams 1. efficient2. Scalable3. practical solutions

Architecture

Detailed Design

Module DescriptionMODULE 1: Dataset preparation -executable files (benign or malware file) are disassembled using a disassembler. -assembly code is parsed. The opcode sequence is collected in Dataset.

MODULE 2 : Create Feature Vector Table( FVT )by n-grams extraction - Dataset is classified as Training data and Testing data. - The training data is used for n-gram extraction. - These extracted n-grams are stored in a table called Feature Vector Table (FVT). - Feature Vector Table consists opcode, its frequency count and respective class MODULE 3 : Employing Feature Reduction Algorithm - PCA MODULE 4: Classification using Machine Learning Algorithm - J48,Support Vector Machine(SVM) and Random Forest

UML Design

•USE CASE DIAGRAM • CLASS DIAGRAM • SEQUENCE DIAGRAM • ACTIVITY DIAGRAM• STATE CHART DIAGRAM

USE CASE DIAGRAM

CLASS DIAGRAM

Sequence Diagram

Activity Diagram

State Traction

Results and Discussion

With PCA Without PCA

2 grams 8 216

3 grams 9 256

4 grams 8 256

With Feature Selection Algorithm

2-grams Random Forest SVM J48

Classified 95% 82.50% 88%

Misclassified 12.30% 82.50% 36.40%

Precision 95.00% 68.10% 86.90%

Performance Table for 2grams

Graphic view for 2grams

Random Forest SVM J480%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

TPRFPRPrecision

Performance Table for 3grams

3-grams Random Forest SVM J48

Classified 92% 94.70% 84%

Misclassified 52.10% 34.70% 53.20%

Precision 92.80% 95.00% 84.20%

Graphic view for 3grams

Random Forest SVM J480%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

TPRFPRPrecision

Normal Code and Obfuscated Code

Disassembling the executables

Parser

N-grams extraction

Opcode and its frequency and class

Data set

Before PCA

After Feature Selection- PCA

Classification

Thank You..


Recommended