NEURLUX: Dynamic Malware Analysis using Neural Networks

NEURLUX: Dynamic Malware Analysis using Neural Networks

CHANI JINDAL

STATIC vs DYNAMIC ANALYSIS ● Static Analysis● Dynamic Analysis● Why Sandbox? “malware cannot avoid leaving a behavioural footprint” (2018)

Static Analysis Dynamic Analysis

Pros● Comprehensive signatures can be created● Small error rate with byte-sequence matches

● All behavior● thorough understanding

Cons

● Difficult to thoroughly understand behavior especially if obfuscated or packed

● net requests?● Time to Prevention is ultimately long (lifecycle)● Cost (Human / Machine)

● Need environment/ resources● Different execution environment

HISTORY and RELATED WORK● Machine Learning (ML)

○ raw bytecode greyscale image of executable -> 2D array (2011)

○ necessary feature extraction and feature engineering○ shallow learning techniques, not scalable

● Neural Networks and Image Classification (2018)

○ Convolutional Network ○ improvement on basic ML models○ adversarial model

● A multi-level Deep Learning system for malware detection (2019)

○ deep learning architecture still focussed on static PEs

● Dynamic based NN ○ focus on machine activity read/write file counts (2018)

LIMITATIONS OF PREVIOUS WORK● Image classification scrutinized

○ adversarial attacks

● Focus only on a single dynamic feature, such as API sequence○ lack of feature coverage

● Feature extraction and feature engineering necessary?● Dynamic Deep Learning precedent?

DATASETS● We have 2 datasets :

● Generation of Behavioral Reports from these datasets

MALICIOUS BENIGN

PRIVATE DATASET - (real world)

13760 13760

EMBER DATASET 21000 21000

DATA COLLECTION - SANDBOXES● Orchestrate Cuckoo -- The Open Source Sandbox

○ 20+ headless vms all running windows with full net access

● potential answer to why this is an unexplored space

REPORT FORMAT

● Term Frequency - Inverse Document Frequency (TF-IDF)

● Feature Hashing “the hashing trick”

NATURAL LANGUAGE PROCESSING

● Bag of Words (BOW) ○ loss of spatial locality

● N-grams

BASELINE = STATE OF THE ART

Word EmbeddingsWords that have the same meaning should have similar have a similar representation.

● A way to learn to map a set of words or phrases in a vocabulary to vectors of numerical values.

● Computations with one-hot encoded vectors inefficient because most values in your one-hot vector will be 0 sparse

● Dense representation of words and their relative meanings

Word Embeddings

0.7 0.4 0.5

0.2 -0.1 0.1

PeerDist

CacheMgr

Words with similar context tend to have collinear vectors

vocabulary_size * embedding dimension

Convolution Neural Networks

S = length of InputD = Dimension of word vectorSentence Matrix = S * D

HOW DO THEY FIND SIMILARITIES?

0.7 0.4 0.5

0.2 -0.1 0.1

-0.5 0.4 0.1

0.6 0.3 0.5

0.3 -0.1 0.1

-0.5 0.4 0.1

CurrVersion

PeeDist

CacheMgr

Defender

spynet

windows

0.6 0.4 0.5

0.2 -0.1 0.2

0.9

0.84

Word Embeddings

Convolutional filter

LSTM● Networks with loops in them, which allow information to persist.● Cell states are like conveyor belts.

LSTMThe LSTM cell has the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through

Input from BiLSTM Meaning behind sequences, vector corresponding to each word

Word Vector

Sentence Vector

PROPOSED METHODS

FEATURE COUNTS

Common in both reports -> Format in terms of Parent-Child processes

FEATURE BASED TEXT CLASSIFICATION

RAW - CNN

WHAT DOES THIS MEAN?

● We don’t encode any information about file format to the model● This can be adapted to new file formats● Time-to-solution reduced. No feature engineering!

PROBLEMS?

● Reports are large● Variable length● Lots of information

Evaluation

DATASET

BEST:INDIVIDUAL FEATURES + RAW DATA

MALWARE FAMILY

BEST:RAW_DATA

DATASET + REPORT

BEST:STILL RUNNING??

REPORT FORMAT

BEST:INDIVIDUAL FEATURES

UNKNOWN

Embedding Visualization

Results

CONCLUSION

1. RAW IS THE GAWD2. FEATURES CAN BE GIVEN A SHOT3. COMBINATION OF CNN + LSTM +ATTENTION DOES BETTER THAN

JUST CNN4. THIS IS A NOVEL APPROACH!!!!

Discussion● Detect a previously unseen family

○ Virus Total experiment○ Correlations between malware families

● Adversarial learning● Can you obfuscate runtime behavior ?

Future Work ➔ Try Adversarial attacks on our model➔ Current training relies on accurate and broad data

◆ resiliency to data➔ Compare with Image Classification ➔ More models to try:

◆ Cleaning of reports, document classification on the entire report.

ENSEMBLE MODEL

➔ Integrated Stacking Ensemble Model on all features.

Refs● L. Nataraj, S. Karthikeyan, G. Jacob, B. Manjunath, "Malware images: visualization and automatic classification", Proceedings of the 8th international

symposium on visualization for cyber security., pp. 4, 2011. (image ml)● M. Kalash, M. Rochan, N. Mohammed, N. D. B. Bruce, Y. Wang and F. Iqbal, "Malware Classification with Deep Convolutional Neural Networks," 2018 9th

IFIP International Conference on New Technologies, Mobility and Security (NTMS), Paris, 2018, pp. 1-5. (image nn)

● P. Burnap, R. French, F. Turner, K. Jones, Malware classification using self organising feature maps and machine activity data, Computers & Security 73 (2018) 399–410. quote

Date post:	15-Oct-2021
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

NEURLUX: Dynamic Malware Analysis using Neural Networks

Documents