NEURLUX: Dynamic Malware Analysis using Neural Networks
CHANI JINDAL
STATIC vs DYNAMIC ANALYSIS ● Static Analysis● Dynamic Analysis● Why Sandbox? “malware cannot avoid leaving a behavioural footprint” (2018)
Static Analysis Dynamic Analysis
Pros● Comprehensive signatures can be created● Small error rate with byte-sequence matches
● All behavior● thorough understanding
Cons
● Difficult to thoroughly understand behavior especially if obfuscated or packed
● net requests?● Time to Prevention is ultimately long (lifecycle)● Cost (Human / Machine)
● Need environment/ resources● Different execution environment
HISTORY and RELATED WORK● Machine Learning (ML)
○ raw bytecode greyscale image of executable -> 2D array (2011)
○ necessary feature extraction and feature engineering○ shallow learning techniques, not scalable
● Neural Networks and Image Classification (2018)
○ Convolutional Network ○ improvement on basic ML models○ adversarial model
● A multi-level Deep Learning system for malware detection (2019)
○ deep learning architecture still focussed on static PEs
● Dynamic based NN ○ focus on machine activity read/write file counts (2018)
LIMITATIONS OF PREVIOUS WORK● Image classification scrutinized
○ adversarial attacks
● Focus only on a single dynamic feature, such as API sequence○ lack of feature coverage
● Feature extraction and feature engineering necessary?● Dynamic Deep Learning precedent?
DATASETS● We have 2 datasets :
● Generation of Behavioral Reports from these datasets
MALICIOUS BENIGN
PRIVATE DATASET - (real world)
13760 13760
EMBER DATASET 21000 21000
DATA COLLECTION - SANDBOXES● Orchestrate Cuckoo -- The Open Source Sandbox
○ 20+ headless vms all running windows with full net access
● potential answer to why this is an unexplored space
REPORT FORMAT
● Term Frequency - Inverse Document Frequency (TF-IDF)
● Feature Hashing “the hashing trick”
NATURAL LANGUAGE PROCESSING
● Bag of Words (BOW) ○ loss of spatial locality
● N-grams
BASELINE = STATE OF THE ART
Word EmbeddingsWords that have the same meaning should have similar have a similar representation.
● A way to learn to map a set of words or phrases in a vocabulary to vectors of numerical values.
● Computations with one-hot encoded vectors inefficient because most values in your one-hot vector will be 0 sparse
● Dense representation of words and their relative meanings
Word Embeddings
0.7 0.4 0.5
0.2 -0.1 0.1
PeerDist
CacheMgr
Words with similar context tend to have collinear vectors
vocabulary_size * embedding dimension
Convolution Neural Networks
S = length of InputD = Dimension of word vectorSentence Matrix = S * D
HOW DO THEY FIND SIMILARITIES?
0.7 0.4 0.5
0.2 -0.1 0.1
-0.5 0.4 0.1
0.6 0.3 0.5
0.3 -0.1 0.1
-0.5 0.4 0.1
CurrVersion
PeeDist
CacheMgr
Defender
spynet
windows
0.6 0.4 0.5
0.2 -0.1 0.2
0.9
0.84
Word Embeddings
Convolutional filter
LSTM● Networks with loops in them, which allow information to persist.● Cell states are like conveyor belts.
LSTMThe LSTM cell has the ability to remove or add information to the cell state, carefully regulated by structures called gates.
Gates are a way to optionally let information through
Input from BiLSTM Meaning behind sequences, vector corresponding to each word
Word Vector
Sentence Vector
PROPOSED METHODS
FEATURE COUNTS
Common in both reports -> Format in terms of Parent-Child processes
FEATURE BASED TEXT CLASSIFICATION
RAW - CNN
WHAT DOES THIS MEAN?
● We don’t encode any information about file format to the model● This can be adapted to new file formats● Time-to-solution reduced. No feature engineering!
PROBLEMS?
● Reports are large● Variable length● Lots of information
Evaluation
DATASET
BEST:INDIVIDUAL FEATURES + RAW DATA
MALWARE FAMILY
BEST:RAW_DATA
DATASET + REPORT
BEST:STILL RUNNING??
REPORT FORMAT
BEST:INDIVIDUAL FEATURES
UNKNOWN
Embedding Visualization
Results
CONCLUSION
1. RAW IS THE GAWD2. FEATURES CAN BE GIVEN A SHOT3. COMBINATION OF CNN + LSTM +ATTENTION DOES BETTER THAN
JUST CNN4. THIS IS A NOVEL APPROACH!!!!
Discussion● Detect a previously unseen family
○ Virus Total experiment○ Correlations between malware families
● Adversarial learning● Can you obfuscate runtime behavior ?
Future Work ➔ Try Adversarial attacks on our model➔ Current training relies on accurate and broad data
◆ resiliency to data➔ Compare with Image Classification ➔ More models to try:
◆ Cleaning of reports, document classification on the entire report.
ENSEMBLE MODEL
➔ Integrated Stacking Ensemble Model on all features.
Refs● L. Nataraj, S. Karthikeyan, G. Jacob, B. Manjunath, "Malware images: visualization and automatic classification", Proceedings of the 8th international
symposium on visualization for cyber security., pp. 4, 2011. (image ml)● M. Kalash, M. Rochan, N. Mohammed, N. D. B. Bruce, Y. Wang and F. Iqbal, "Malware Classification with Deep Convolutional Neural Networks," 2018 9th
IFIP International Conference on New Technologies, Mobility and Security (NTMS), Paris, 2018, pp. 1-5. (image nn)
● P. Burnap, R. French, F. Turner, K. Jones, Malware classification using self organising feature maps and machine activity data, Computers & Security 73 (2018) 399–410. quote