Justin Sahs and Prof. Latifur Khan
1
A MACHINE LEARNING APPROACH TO ANDROID MALWARE DETECTION
The Problem2
Smartphones represent a significant and growing proportion of computing devices
Android in particular is the fastest growing smartphone platform, and has 52.5% market share*
The power of the Android platform allows for applications providing a variety of services, including sensitive services like banking
This power can also be leveraged by malware
*http://www.gartner.com/it/page.jsp?id=1848514
The Problem (cont.)3
Tremendous growth in Android market from 2,300 applications in March 2009 to 400,000 applications by January 2012 has also attracted a significant growth in malware for Android.
TrendMicro, a global leader in antivirus, has predicted growth of Android Malware by December 2012 to be 129,000 malware.
Anyone and everyone can develop Android applications and host it on the Android market. Online markets do not have a process to check android applications for malware.
Google added a new security feature on Feb 2 this year to its android market to fight malware which will scan every new submission and current apps for anomalous behavior.* This new system does not apply to
alternative markets
…
Various Android Markets*Google Android Bouncer
The Problem (cont.)4
Smartphones are becoming increasingly ubiquitous. A report from Gartner shows that there were over 100
million smart phones sold in the first quarter of 2011, an increase of 85% over the first quarter of 2010*.
Malware often disguise themselves as normal applications
Malware can cause financial loss, theft of private information
Users need robust malware detection software
*http://www.gartner.com/it/page.jsp?id=1689814
One Class SVM
Training
Model
Testing
PredictionTesting
Apps
Training Apps
Static Analysis: Data Mining Approach
5
Malware Application Detection
Feature Extraction6
We use an open source library called Androguard to extract features from applications: Permissions Control Flow Graphs
One for every method in the application
We use these extracted features to train a machine learning-based classifier
Feature set is not homogeneous (bit-vector, string and graph representations)
Feature Extraction: Acquiring Applications
7
APK files are Android package files. Applications that are packaged as APKs can be installed in any compatible android device.
Benign APK files were harvested from the official Android Market using the android-market-api,* in addition to a collection of known malware
We used 2,172 APK files in our analysis.
http://code.google.com/p/android-market-api/
Background: Structure of an android application
8
Contains permissions and other metadata
Contains Application signing information
Contains any auxiliary files. The Android framework does not generate IDs for assets. Accessed through AssetManager api.The compiled program code
Contains auxiliary files (resources) with IDs generated by the Android framework.
Contains compiled xml files and resources.
Classification9
Classification: Permissions10
Built-in permissions Access to hardware and certain parts of the Android API Based on a list of 121 standard built-in permissions, we
construct a 121-bit vector, with a 1 for each requested permission, and a 0 otherwise
Non-standard permissions Mainly access to other applications’ APIs We split the strings into three sections: a prefix (usually
“com” or “org”), a section of organization and product identifiers, and the permission name, ignoring instances of the strings “android” and “permission,” which are ubiquitous
Classification: Permissions (example)11
Represented as a bit vector:
00000100 00000000 00000000 00100000 00000000 00010000 01000000 10000000 00000100 00101000 01110001 00000000 00011000 00000101 00100001 1
And three sets of strings:
[“com”],[“launcher”],[“CONTROL”, “GLOBAL”, “INSTALL”, “READ”, “SEARCH”, “SETTINGS”, “SHORTCUT”]
Built-in:android.permission.WRITE_EXTERNAL_STORAGEandroid.permission.CALL_PHONEandroid.permission.EXPAND_STATUS_BARandroid.permission.GET_TASKSandroid.permission.READ_CONTACTSandroid.permission.SET_WALLPAPERandroid.permission.SET_WALLPAPER_HINTSandroid.permission.VIBRATEandroid.permission.WRITE_SETTINGSandroid.permission.READ_PHONE_STATEandroid.permission.ACCESS_NETWORK_STATEandroid.permission.WRITE_APN_SETTINGSandroid.permission.RECEIVE_SMSandroid.permission.RECEIVE_MMSandroid.permission.RECEIVE_WAP_PUSHandroid.permission.INTERNETandroid.permission.SEND_SMSandroid.permission.READ_SMSandroid.permission.WRITE_SMS
Requested Permissions:
Non-standard:com.android.launcher.permission.INSTALL_SHORTCUTcom.android.launcher.permission.UNINSTALL_SHORTCUTcom.android.launcher.permission.READ_SETTINGScom.android.launcher.permission.WRITE_SETTINGSandroid.permission.GLOBAL_SEARCH_CONTROL
Classification: Control Flow Graphs (CFGs)
12
Constructed from the compiled bytecode of the application
Each method can be represented as a graph Nodes represent contiguous sequences of non-jump
instructions Edges represent jumps (goto, if, loops, etc.)
CFGs encode the behavior of the methods they represent, and are therefore a potential source of discriminating information
The actual bytecode is often obfuscated, either by the compiler for optimization or deliberately to prevent reverse engineering or detection
We perform reduction on the extracted CFGs to counteract obfuscation
Classification: CFG Reduction13
We reduce graphs according to three rules:
1) Contiguous instruction blocks are merged
2) Unconditional jumps are merged with their target
3) Contiguous conditional jumps that share a destination are merged.
(2) (3)
(1)
Classification: Training and Testing14
Once we have extracted our four feature representations, we use them to train a One-Class Support Vector Machine (1C-SVM)
The 1C-SVM is designed to detect test examples that are significantly different than the training data We have far more examples of benign applications
than malware to train on
One Class SVM (1C-SVM)15
A Support Vector Machine (SVM) finds the maximum-margin separating hyperplane between the positive and negative training examples in some feature space i.e. it maximizes the distance between the hyperplane and the
closest examples from each class
The SVM uses comparison functions called kernels to map each extracted feature into a high-dimensional feature space The linear separation of the data in the feature space may
correspond to a very non-linear separation of the original data Each kernel takes two feature representations as input and
outputs a number that measures similarity
Features and Kernels16
String Kernel
Graph Kernel
• The Set Kernel applies some other kernel to each pair of elements from the two input sets, e.g. the String Kernel if the elements are strings
Classification: Training and Testing (cont.)
17
We use a data mining library, scikit-learn (http://scikit-learn.org/), which implements a convenient wrapper around the popular LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
The use of a SVM requires specialized functions called kernels that are used to compare features between applications We implement these kernels ourselves
Kernels18
We have three feature representations (bit-vectors, strings, and graphs)
We have three kernels, each of which takes two feature representations as input, and outputs a measure of similarity
1)A bit-vector kernel that counts the number of equivalent bits Example: let our two bit-vectors be
<0 0 1 1 1 0 1><1 0 1 0 1 0 1>
Then, we have 5 matching bits, so a kernel value of 5
Kernels (cont.)19
2) A kernel over strings that counts the number of common subsequences between two strings, weighted by length
Length is measured by the distance between the first and last elements in both strings
For example, the strings “abc” and “bxc” have as common subsequences “b”, “c”, and “bc”, which have lengths 1, 1 and 2+3=5, respectively.
Kernels (cont.)20
3) A graph kernel that iteratively relabels the nodes of each graph based on that node’s children’s labels
Original labels are based on the instructions present in each node
The generated labels are counted to generate a vector
The kernel returns the dot product of these two vectors
Example:
22
11 44
22
11 44
44Graph A Graph B
Original Graphs
Kernels (cont.)21
22
11 44
22
11 44
44Graph A Graph B
Original Graphs
214214
1414 44
214214
1414 44
44Graph A Graph B
Iteration 1
214144214144
144144 44
214144214144
144144 44
44Graph A Graph B
Iteration 2
21414414442141441444
14441444 44
21414414442141441444
14441444 44
44Graph A Graph B
Iteration 3
Kernels (cont.)22
Labels and count vectors:
The dot product of these two vectors is 1*1 + 1*1 + 4*8 + 1*1 + 1*1 + 1*1 + 1*1 + 1*1 + 1*1= 40
1 2 4 14 144
214
1444
214144
2141441444
Graph A 1 1 4 1 1 1 1 1 1
Graph B 1 1 8 1 1 1 1 1 1
Kernels (cont.)23
Additionally, we have a kernel over sets, which applies some other kernel, k0, over the elements of each set. It applies the element kernel to every pair of elements in the two sets, and exponentiates these values, so that the better matches (higher values) are emphasized:
Then we feed the sets of strings from the non-standard permissions feature and the sets of graphs from the CFG feature into this with the string kernel and graph kernel, respectively
Kernels (cont.)24
Each of these kernel values are normalized, then summed to form the final kernel value
One such value is calculated for every pair of training examples, generating a kernel matrix
The kernel matrix is used to train the 1C-SVM
During testing, one value is calculated for each pair of training and testing examples.
Experimental Results25
We tested our system with 2081 benign applications and 91 malicious applications
The system correctly classifies approximately 90% of malware, but only correctly classifies approximately 50% of benign applications
We also tested against each of the individual features alone
Background: Measures of Quality26
We examine several measures of quality: True Positive Rate (aka Recall): the proportion of actual malware that
our model classifies as malware False Negative Rate: the proportion of actual malware that our model
classifies as benign; “miss” rate True Negative Rate: the proportion of actual benign applications that
our model classifies as benign False Positive Rate: the proportion of actual benign applications that our
model classifies as malware; “false alarm” rate Precision: The proportion of malware-classified applications that are
actually malware F1: The harmonic mean of precision and recall; this gives a measure of
quality between precision and recall, closer to the worse of the two F2: Like F1, but with recall weighted twice as much as precision
F½: Like F1, but with precision weighted twice as much as recall
Experimental Results (cont.)27
Experimental Results (cont.)28
Note: The downward trend in precision and F-measures is due to the increasing benign sample size and fixed malware sample size
Conclusions and Future Work29
The high true positive is promising, but the low true negative shows much room for improvement
There are a number of areas ripe for future investigation: Additional features from static analysis or even
dynamic analysis New and better kernels and feature representations Alternative models such as the Semi-Supervised SVM,
Kernel PCA or probabilistic models