Date post: | 23-Dec-2015 |
Category: |
Documents |
Upload: | francis-cunningham |
View: | 215 times |
Download: | 0 times |
Verma - ICISS 2014
ReasoningMiningNLPDefense
Rakesh M. Verma ([email protected])
ReMiND Laboratory
Catching Classical and Hijack-based Phishing Attacks
12/20/2014
1
2Digital Identity and Phishing
Classical Phishing AttacksSend emailcontaining•Bad link, and•Loss, urgency, or
incentive
Plant a link•Internet forums•Social networks •Chat or bulletin
boards
12/20/2014Verma - ICISS 2014
3
Hijack-based Attacks
Hijack a legal server and plant a phishing page
Install malware and when user types a legal target URL interpose a phishing page
Note: The URL in the address bar is legal
12/20/2014Verma - ICISS 2014
4
5Motivation for Phishing
Phishing causes loss of time, productivity and monetary loss which run to billions of dollars.
Despite advances and research in phishing protection, number of victims of phishing is increasing every year.
Source: Gartner, Anti-Phishing Working Group, 2014.
Phishing Detection Dimensions
Web site and address (URL)
Web site only
(e.g. “Account quota exceeded”)
12/20/2014Verma - ICISS 2014
6
This Paper
7Evolving Phishing Trends
Phishing patterns are constantly evolving.
So we want to detect phishing patterns based on the fundamental characteristics of a phishing website.
8Characteristics of Phishing Website
URL
Content
Behavior
9URL Characteristics
Disguise URL with Targets (APWG: 45 - 50%) Top Level Domain (TLD) gets misplaced
10Content Characteristics
External sources of images, styles from target site, to mimic the appearance.
Page Contents (Text) resemble target site Unencrypted sessions
11Behavior Characteristics
12Behavior Characteristics
13Behavior Characteristics
14Behavior Characteristics
15Objective
Distinguish characteristics of classical and hijack based phishing sites
Develop an algorithm for detection
16Approach
1
•Develop Algorithm
•To detect characteristics
2
•Test Algorithm
•Dataset from PhishTank, Alexa and DMOZ
3
•Evaluate Algorithm
•Against Google Safe Browsing (GSB) Phishing detection
17DEVELOPING THE ALGORITHM
18Algorithm
Decision Model URL Classifiers Content Classifiers Behavior Classifier
The Classification Classes: Phishing , Legitimate
19URL Classifiers
URL is checked for presence of target domain and extra top level domain (TLD) at non-TLD place
U1 - Targets in URL Targets are whitelisted domains (n=5000) and
some popular targets identified in security blogs (n=50). (total of 5050)
Applies regular expression on URL to detect targets
20URL Classifiers
U2 - Misplaced Top-Level Domains (TLDs) 7 most targeted TLDs
(.com, .net, .org, .gov, .edu, .info, .biz) Applies regular expression to detect additional
TLDs.
21URL Classifiers
U3 - General Characteristics of URL Detects Phishing URL based on the following features:
Length of the domain
Number of @ symbols, hyphens, punctuation symbols, top-level domains, target words, suspicious words
Whether or not the URL is an IP address, and
Euclidean and Kolmogorov-Smirnov (KS) distances between the distribution of characters in the URL and the distribution of characters in standard English text
Development:
Used PART algorithm to set optimal thresholds for the features.
Dataset: 10600 Random Alexa URLs with 9640 PhishTank URLs
10 fold cross validation: TPR= 94.66, FPR= 2.04
22Content Classifier
C1 – More Redirection: Ratio of Internal Links to Total Links in source Anchor tags, Link tags
Script tags, Images tags
C2 – Copy Detection : Compare given page with targets Terms (words)
Top Terms (~21)
Random Top Terms (~11)
IDs used for tags
C3 - Unsecure Password Handling Checks SSL on submit page and result page
23Behavior Classifier
B1 – Real-time Form analysis Extracts action URL from forms with password
fields Analyzes contents of action URL page
24TESTING OF ALGORITHM
25Testing of Algorithm
Algorithm applied on dataset from PhishTank, Alexa and DMOZ
Preprocessing of data was done before algorithm was applied.
26Dataset
Phishing Set 17200 PhishTank URLs
Legitimate Set 17200 DMOZ
Whitelist Top-5000 domains from Alexa
27Preprocessing
Remove URLs redirecting to any URL already in the dataset
Remove offline (including 404 response), and other inaccessible URLs (timeout > 10 second)
If response is 200, read final landing page URL and HTML contents. Check landing URL against whitelist Check for password input field in body
28Metrics
Classified as Phishing
Classified as Legitimate
Phishing pages TP FN
Legitimate pages FP TN
True Positive Rate (TPR) =
False Positive Rate (FPR) =
Precision (PR) =
F1 Score =
29Algorithm
URL Classifier U1 - Targets in URL U2 - Misplaced TLD U3 - General Characteristics of URL
Content Classifier C1- More Redirection C2 - Copy Detection C3 - Unsecure Password Handling
Behavioral Classifier B1 - Real-time Form Analysis
Models
URLYes
NoU1 – Target in URL
Yes
NoU2 – Misplaced TLD
Yes
NoU3 – Gen.
Characteristics of URL
Yes
NoC1 – More Redirection
Yes
NoC2 – Copy Detection
Yes
NoC3 – Unsecure Pwd.
Handling
Yes
NoB1 – Realtime Form
Analysis
Combination Phishing URL Condition
OR(U1 OR U2 OR U3)
OR
(C1 OR C2 OR C3 OR B1)
AND(U1 OR U2 OR U3)
AND
(C1 OR C2 OR C3 OR B1)Potential
Site only(C1 OR C2 OR C3 OR
B1)
Yes>= 2
31Performance of Classifiers on the dataset
32Results
Combinations
Search Based Filtering = OFF Search Based Filtering = ON
TPR FPR PRF-
scoreTPR FPR PR
F-score
Or 99.97 3.50 88.25 93.75 93.37 0.54 97.84 95.55
And 87.64 1.80 92.76 90.13 82.30 0.22 98.98 89.88
Pot. 97.94 2.48 91.24 94.47 91.55 0.36 98.52 94.91
Site only
99.31 3.44 88.37 93.52 92.84 0.53 97.88 95.30
33Discussion
The Or combination effectively combines URL and content based classifiers and achieves the highest detection rate of 99.97% with FPR of 3.5%.
The FPR can be dropped to 0.36% with TPR of 91.55% with the potential scheme
And has the lowest FPR with detection rate of 82.30%
Site only method has second lowest FPR of 0.53% with second highest detection rate using search-based filtering
34Advantages of the Approach
Can be used effectively in zero hour environment
Can handle hijack based attacks, as they have behavioral analysis
Content language independent.
35EVALUATION OF ALGORITHM
36Existing Methods
Related phishing algorithms Blacklisting Xiang et al - hierarchical adaptive probabilistic approach CANTINA CANTINA+ Google Safe Browsing
Good performance, but could not compare with my algorithm Closed source No API
So used publically available Google Safe Browsing for evaluation.
37Google Safe Browsing
Large-scale automatic phishing website detection
Analyzes both URL and content
Claims accuracy of 90% and FPR of 0.1%
38Direct Comparison
Model
Combination
s
Search Based Filtering = OFF Search Based Filtering = ON
TPR FPR PR F-score TPR FPR PR F-
score
Ours
1 99.97 3.50 88.25 93.75 93.37 0.54 97.84 95.55
2 87.64 1.80 92.76 90.13 82.30 0.22 98.98 89.88
3 97.94 2.48 91.24 94.47 91.55 0.36 98.52 94.91
GSB 51.46 0.03 99.80 67.91
39Security Analysis
If phishers get hold of this work, then they might adapt to hide from the detection techniques.
Buying genuine domain, SSL, using self signed or open-SSL can hamper some of the classifiers, but it will add to phishers’ efforts and it will reduce their profit.
If phishers, somehow, manage to get good page rank, and higher position in search results, then they can escape from being detected.
They can change the behavior of the page for hiding purposes, but this could alarm the users, and responsible users will report the URL
40Conclusion
Efficient algorithms based on the fundamental characteristics of phishing websites were developed.
Algorithms have comparable or better efficacy with other established phishing detection algorithms.
A novel approach to handle hijack based attacks.
41Future Work
Improve the Behavior classifier to include other phishing website behaviors.
Deploy as a browser extension to test in-field performance.
Thank You
Questions?
43Hijack Based Phishing Attacks
Agency for the Safety of Aerial Navigation in Africa and Madagascar (ASECNA)
April 2014
Redirected to PayPal