Date post: | 13-Apr-2017 |
Category: |
Documents |
Upload: | yuyan-amber-zhang |
View: | 63 times |
Download: | 1 times |
1 Copyright 2011 Trend Micro Inc.
MALWARE CLASSIFICATION
Amber Zhang, Bharath Chandrasekhar
July, 2016
2 Copyright 2011 Trend Micro Inc.
Solution summary
• Model selection: Xgboost
– Performance evaluation:• ROC score (Trend testing set): 0.9975
• Accuracy (5 fold cross validation): 0.987
• Attribute selection: 504 attributes
• Training time and resource usage– 2 hours, 4G memory to train the model
– Predicting 30,000 testing files in ~10 minutes
– 727G Raw data and generated features on disk
3 Copyright 2011 Trend Micro Inc.
Problem statement
• Binary classification– Training data comes in [data, label] pair– A data [label] is either “malicious” or “not malicious”
• 50,000 Training samples were giving• 30,000 Testing samples were giving• Goal: develop a robust and re-trainable model to
predict if a PE program is malicious or or malicious– Continue training the existing model to further boost model
performance– Time cost for predicting: light-weighted model is preferred
under same model performance– More features does not mean better model
4 Copyright 2011 Trend Micro Inc.
• Portable executable files broken into:• assembly source code (.asm file)• Imports files: a list of all functions names imported by the program• Sections file: assembly code sections information of each program• Info file: general information about the program• Strings file: a list of all literal strings in the program
Data Format
5 Copyright 2011 Trend Micro Inc.
Project Flow and Resource Usage
Extract and add new features
Re-train Model
Evaluate Model (cross validation)
• Dimension reduction: select top features
Feed nGram opcode
features to construct
initial model
Randomly sampled 10,000 data Full datasetFull dataset
Compare, train and tune the model on
full dataset
~12 hours to scan through all training data
~2 hours to run random forest~7 G of RAM memory
~4 hours to run random forest~4 G of RAM memory
~40G+ extracted files on the disk~1h to train the model
6 Copyright 2011 Trend Micro Inc.
Initial Model
Extract and add new features
Re-train Model
Evaluate Model (cross validation)
• Dimension reduction: select top features
Feed nGram opcode
features to construct
initial model
Randomly sampled 10,000 data Full datasetFull dataset
Compare, train and tune the model on
full dataset
~12 hours to scan through all training data
~2 hours to run random forest~7 G of RAM memory
~4 hours to run random forest~4 G of RAM memory
~40G+ extracted files on the disk~1h to train the model
7 Copyright 2011 Trend Micro Inc.
290 Features: nGram opcode counts
1. Scan through .asm assembly source code file and extract all 1-4 gram opcode
2. Select frequent opcode pattern– For one gram opcode: if in a loop then count 10 times more– Frequent if opcode patterns appears in the file more than 100 times
3. Run random forest to select opcode features with top importance (based on reduced info gain):– Top 200 one gram, top 30 two gram, top 30 three gram, top 30 four gram– In total, 290 nGram opcode counts features were selected
Loop pattern in assembly source code
One gram opcodemov, mul, shr, lea…Two gram opcodemov_mul, mul_shr, shr_lea…Three, four, five gram
Anti Debugmov eax, fs:[30h]mov eax, byte [eax+2]test eax,eaxjne @DebuggerDetected
8 Copyright 2011 Trend Micro Inc.
Feature importance: loss in info gain
9 Copyright 2011 Trend Micro Inc.
Adding more features
Extract and add new features
Re-train Model
Evaluate Model (cross validation)
• Dimension reduction: select top features
Feed nGram opcode
features to construct
initial model
Randomly sampled 10,000 data Full datasetFull dataset
Compare, train and tune the model on
full dataset
~12 hours to scan through all training data
~2 hours to run random forest~7 G of RAM memory
~4 hours to run random forest~4 G of RAM memory
~40G+ extracted files on the disk~1h to train the model
10 Copyright 2011 Trend Micro Inc.
• DD, DW, DB…
8 Features: memory declaration count
7 DD7*4 bytes
11 Copyright 2011 Trend Micro Inc.
127 Features: section hex dump single byte count
• Single byte counts• Count number of single bytes in the section hex dump files:
– 128 cc, 200 ff, 568 6a etc.
12 Copyright 2011 Trend Micro Inc.
58 other features: info, DDL imports, sections
10 Info fields
• File Entropy• Size of Stack
Reserve• File Size• Size of Image• Loop Count• Size of Code• Size Of
Initialized Data• Number of
sections• Size of Header• String Count
5 Derived fields (newly created)
• Loop Count to File Size
• Code Size to File Size
• String Count to File Size
• Loop Count to Code Size
• String Count to Code Size
26 DDL import Count
•KERNEL32.dll•USER32.dll•ADVAPI32.dll•GDI32.dll•MSVBVM60.DLL•SHELL32.dll•ntdll.dll•ole32.dll•COMCTL32.dll•OLEAUT32.dll•…
17 Section fields
• .text_entropy• .rsrc_entropy• .rsrc_vSize• .data_entropy• .text_rSize• .text_vSize• .rsrc_rSize• .data_vSize• .data_rSize• .rdata_entropy• …
13 Copyright 2011 Trend Micro Inc.
21 Features: .asm imageNormal Malicious Malicious
Pixel density changes across the file
10 % of the file Chunk each fileinto 10 partsto abstract densitychanges
14 Copyright 2011 Trend Micro Inc.
21 Features: .asm image
• Read .asm assembly source code in binary mode
• Extract– First 800 bytes– Average pixel density in each region– Standard deviation of pixel density in each region
15 Copyright 2011 Trend Micro Inc.
Accuracy improved as extracting new features
Randomly sampled 10,000
data
Extract new features Select attributes 10-fold cross
validationKeep predictive
attributes
16 Copyright 2011 Trend Micro Inc.
Intuition: the more the better
• High training error: underfit– Training error reaches 0.0000 only using 1000 training samples!
• High testing error: overfit– Avoid overfitting became the main challenge
=> Reduce the dimension of feature set as long as the model does not underfit the data
17 Copyright 2011 Trend Micro Inc.
Selecting and tuning the model
Extract and add new features
Re-train Model
Evaluate Model (cross validation)
• Dimension reduction: select top features
Feed nGram opcode
features to construct
initial model
Randomly sampled 10,000 data Full datasetFull dataset
Compare, train and tune the model on
full dataset
~12 hours to scan through all training data
~2 hours to run random forest~7 G of RAM memory
~4 hours to run random forest~4 G of RAM memory
~40G+ extracted files on the disk~1h to train the model
18 Copyright 2011 Trend Micro Inc.
Some models have inherent biases
19 Copyright 2011 Trend Micro Inc.
Xgboost: a gradient boosting algorithm
• Objective Function: minimize L+Ω
• Use gradient Descent to find the optima• Parameters (30+ parameters)
– Max_depth = 3,– min_child_width = 4,– n_estimator = 3000, – learning_rate = 0.1, – subsample = 0.8, – gamma = 0, – colsample_bytree = 0.8
• Resampled subset of data and attribute columns each time building a new tree• Reasons: parallel processing, regularization avoids overfitting, tree pruning,
handle missing values
20 Copyright 2011 Trend Micro Inc.
Xgboost: tuning intuition vs computation complexity
Objective Function: minimize L+Ω
Number of tree leaves
Score on the j-th leaf
L: control predictive power
Ω : control simplicity of model
Each tree
21 Copyright 2011 Trend Micro Inc.
Xgboost: tuning intuition vs computation complexity
Objective Function: minimize L+Ω
GammaLambda
L: control predictive power
Ω : control simplicity of model
N_estimator: number of trees
• Max_depth: maximun depth of tree• min_child_width: minimum score of
child in order to split the node• Subsample: Randomly sample subset
of data• colsample_bytree: Randomly sample
subset of features
22 Copyright 2011 Trend Micro Inc.
ROC score: trend 30,000 testing set
23 Copyright 2011 Trend Micro Inc.
A note on evaluation metric: ROC
• What if training data contains 98% non-malicious files?• ROC: you can specify threshold to catch almost all TP
– [0,1]– 0.5 is random guess– Evaluate model at all threshold (accuracy metric is a specific probabilistic cut)
24 Copyright 2011 Trend Micro Inc.
Potential further analysis: meta learner
Meta learner:Voting, geometric mean, weighted
average etc.
Sub-learner: Logistic Regression
Sub-learner: Neural Nets
Sub-learner:SVM
• Each sub-learner output probabilities for binary classification– Different model– Use different subset of features
• Meta Learner aggregate results– Voting– Weighted average– Geometric mean etcx
• Reduce Bias!– Strong feature– Model bias
25 Copyright 2011 Trend Micro Inc.
Solution summary
• Model selection: Xgboost – Performance evaluation:
• ROC score (Trend testing set): 0.9975• Accuracy (5 fold cross validation): 0.987
– Parameters: • Max_depth = 3, min_child_width = 4, n_estimator = 3000, learning_rate = 0.1,
subsample = 0.8, gamma = 0, colsample_bytree = 0.8
• Attribute selection: 504 attributes – Top 290 nGram opcodes counts, top 127 single byte count, top 8 memory
declaration keyword count, top 15 info fields, top 26 DDL imports function counts, top 17 section info, top 11 asm image density, top 10 asm image statistics
• Training time and resource usage– 2 hours, 4G memory to train the model – Predicting 30,000 testing files in ~10 minutes– 727G Raw data and generated features on disk