Date post: | 01-Apr-2015 |
Category: |
Documents |
Upload: | sincere-crees |
View: | 215 times |
Download: | 1 times |
A Comprehensive Approach for Malicious Javascript Detection
EJ Jung
12/18/09
with Peter Likarish, Insoon Jo
in The 4th International Malicious and Unwanted Software (Malware 2009)
Why javascript?
>60% of Internet attacks are on web app’s [sans09]
• SQL injection, cross-site scripting(xss)
XSS is the most prevalent bug on the web• drive-by download, malicious advertisements,
… • take over the user’s browser using JavaScript• Cross-site request forgeries (CSRF)
– forces to execute commands without users’ consent
What has been done before?
Blacklist-based approaches• profiles from known malicious javascripts• domain names and URLs of known bad
websites• most scanners adopt this
Sandbox-based approaches• run in a virtual machine and check the state
change• honey* approaches to find new malware
Limited-capability approaches• run with limited function calls• only use in a subset of javascript
Limitations
Blacklist-based approaches• zero-day vulnerability• cannot respond to new ones spontaneously
Sandbox-based approaches• delay before execution• imperfect sandbox might leak
Limited-capability approaches• compatibility issues
Good and bad javascripts
Clue: Obfuscation!>90% in our dataset
De-obfuscation?
Why not de-obfuscation then blacklist check?• complete de-obfuscation is extremely difficult• we do use partial de-obfuscation for URL extraction• still vulnerable to 0-day attacks
Only need to know the existence for detection
Good and obfuscated codes?• copyright, tamper-proof, protection against reverse-
engineering• other features to reduce false positives
Our approach
Comprehensive framework consists of• a targeted web crawler• url extractions&feedback• javascript classifiers
Classifier benefits• mitigate 0-day
vulnerability• smaller delay • compatibility with legacy
codes
Preliminaries on classifiers
Classifiers “learn” from training set how to classify • is this script benign or malicious? • probabilistic analysis, decision tree, rule
induction, hyperplane, ...
Example classifier: Naive Bayes• highly used in spam filtering
Classifier evaluation
Confusion matrix[thanks to Prof. Press]
Precision/NPP
Precision• if the classifier says malicious, how much can
we trust this decision? • precision = tp/(tp+fp)• the higher the precision is, the tougher we can
be on the positives
Negative Predictive Power(NPP)• if the classifier says benign, how much can we
trust this decision?• NPP = (tn/tn+fn)• the higher the NPP is, less risk we have letting
this script run
How to get good classifiers?
Given a word “stock” in an email, what is the probability of this email being spam?
we can compute these from the sample set of
emails
the closer the sample set is to the real Internet
the better this classifier gets. -> importance of crawler
Targeted Crawls
Based on Heritrix, open-source crawler Initial seeds from popular and blacklisted
domains Alexa top 500
• top 500 websites with the most traffic • may include some malicious scripts but mostly
benign
Blacklisted domains• malekal.com, malwareurl.com
Feedback from newly found malicious scripts• extract URLs from redirections and downloads
Crawled scripts
Dates Initial seeds # pages downloaded
# unique scripts
Jan. 26 ~ Feb. 3
Alexa 500 9, 028, 469 ~63million
Jun. 2~16 827 blacklisted domains
163, 938 24,269
Jul. 16 ~ Aug. 1
559 blacklisted domains
79,696 7,602
Training set: 50,000 benign + 66 malicious scripts from Feb~Mar 2009
65 out of 66 obfuscated
Is this training set good?
10-fold cross validation by 5,000 incrementsClassifier Precision
(stdev)Recall(stdev)
NPP(stdev)
NaiveBayes
0.808(0.11)
0.659(0.18)
0.996(0.0023)
REPTree 0.884(0.12)
0.769(0.17)
0.997(0.0022)
SVM 0.920(0.14)
0.742(0.17)
0.997(0.0021)
RIPPER 0.882(0.17)
0.787(0.21)
0.997(0.0027)
Feature extraction
Identify commonly observed features of malicious javascript• manually added features (obfuscation)• 50 reserved javascript keywords
Important features• human readability (obfuscation)
– >70% alphabetical, 60%>vowels>20%, <15 characters long, <=2 repetitions
• eval– obfuscation and hiding malicious code
Feature evaluation
Scatterplots: good vs. bad = red vs. blue
Helpful features
Detection in the real world
Test data• 2 weeks’ data from malwaredomains.com• 24,269 unique scripts by MD5• 22 malicious scripts found by classifiers
– all obfuscated
• 2 found by the latest virus scanner
Classifier #found #mal
precision
NaiveBayes 19 17 89.5%
REPTree(decision tree learner)
21 19 90.4%
SVM 22 19 86.3%
Ripper(inductive rule learner)
28 19 67.9%
Future work
Correlation among malicious domains• more effective domain-based blacklist
Language-model classifiers Resilience testing
• feedback from newly found malicious scripts• sustain the classifiers’ accuracy
Combine with other features• HTTP and connection information [Seifert08]
Recall testing with blacklists