Verma - ICISS 2014 R easoning M ining NLP Defense Rakesh M. Verma (rverma@uh.edu) ReMiND Laboratory...

transcript

Verma - ICISS 2014

ReasoningMiningNLPDefense

Rakesh M. Verma (rverma@uh.edu)

ReMiND Laboratory

Catching Classical and Hijack-based Phishing Attacks

12/20/2014

2Digital Identity and Phishing

Classical Phishing AttacksSend emailcontaining•Bad link, and•Loss, urgency, or

incentive

Plant a link•Internet forums•Social networks •Chat or bulletin

boards

12/20/2014Verma - ICISS 2014

Hijack-based Attacks

Hijack a legal server and plant a phishing page

Install malware and when user types a legal target URL interpose a phishing page

Note: The URL in the address bar is legal

12/20/2014Verma - ICISS 2014

5Motivation for Phishing

Phishing causes loss of time, productivity and monetary loss which run to billions of dollars.

Despite advances and research in phishing protection, number of victims of phishing is increasing every year.

Source: Gartner, Anti-Phishing Working Group, 2014.

Phishing Detection Dimensions

Web site and address (URL)

Web site only

(e.g. “Account quota exceeded”)

12/20/2014Verma - ICISS 2014

This Paper

7Evolving Phishing Trends

Phishing patterns are constantly evolving.

So we want to detect phishing patterns based on the fundamental characteristics of a phishing website.

8Characteristics of Phishing Website

Content

Behavior

9URL Characteristics

Disguise URL with Targets (APWG: 45 - 50%) Top Level Domain (TLD) gets misplaced

10Content Characteristics

External sources of images, styles from target site, to mimic the appearance.

Page Contents (Text) resemble target site Unencrypted sessions

11Behavior Characteristics

15Objective

Distinguish characteristics of classical and hijack based phishing sites

Develop an algorithm for detection

16Approach

•Develop Algorithm

•To detect characteristics

•Test Algorithm

•Dataset from PhishTank, Alexa and DMOZ

•Evaluate Algorithm

•Against Google Safe Browsing (GSB) Phishing detection

17DEVELOPING THE ALGORITHM

18Algorithm

Decision Model URL Classifiers Content Classifiers Behavior Classifier

The Classification Classes: Phishing , Legitimate

19URL Classifiers

URL is checked for presence of target domain and extra top level domain (TLD) at non-TLD place

U1 - Targets in URL Targets are whitelisted domains (n=5000) and

some popular targets identified in security blogs (n=50). (total of 5050)

Applies regular expression on URL to detect targets

20URL Classifiers

U2 - Misplaced Top-Level Domains (TLDs) 7 most targeted TLDs

(.com, .net, .org, .gov, .edu, .info, .biz) Applies regular expression to detect additional

21URL Classifiers

U3 - General Characteristics of URL Detects Phishing URL based on the following features:

Length of the domain

Number of @ symbols, hyphens, punctuation symbols, top-level domains, target words, suspicious words

Whether or not the URL is an IP address, and

Euclidean and Kolmogorov-Smirnov (KS) distances between the distribution of characters in the URL and the distribution of characters in standard English text

Development:

Used PART algorithm to set optimal thresholds for the features.

Dataset: 10600 Random Alexa URLs with 9640 PhishTank URLs

10 fold cross validation: TPR= 94.66, FPR= 2.04

22Content Classifier

C1 – More Redirection: Ratio of Internal Links to Total Links in source Anchor tags, Link tags

Script tags, Images tags

C2 – Copy Detection : Compare given page with targets Terms (words)

Top Terms (~21)

Random Top Terms (~11)

IDs used for tags

C3 - Unsecure Password Handling Checks SSL on submit page and result page

23Behavior Classifier

B1 – Real-time Form analysis Extracts action URL from forms with password

fields Analyzes contents of action URL page

24TESTING OF ALGORITHM

25Testing of Algorithm

Algorithm applied on dataset from PhishTank, Alexa and DMOZ

Preprocessing of data was done before algorithm was applied.

26Dataset

Phishing Set 17200 PhishTank URLs

Legitimate Set 17200 DMOZ

Whitelist Top-5000 domains from Alexa

27Preprocessing

Remove URLs redirecting to any URL already in the dataset

Remove offline (including 404 response), and other inaccessible URLs (timeout > 10 second)

If response is 200, read final landing page URL and HTML contents. Check landing URL against whitelist Check for password input field in body

28Metrics

Classified as Phishing

Classified as Legitimate

Phishing pages TP FN

Legitimate pages FP TN

True Positive Rate (TPR) =

False Positive Rate (FPR) =

Precision (PR) =

F1 Score =

29Algorithm

URL Classifier U1 - Targets in URL U2 - Misplaced TLD U3 - General Characteristics of URL

Content Classifier C1- More Redirection C2 - Copy Detection C3 - Unsecure Password Handling

Behavioral Classifier B1 - Real-time Form Analysis

Models

URLYes

NoU1 – Target in URL

NoU2 – Misplaced TLD

NoU3 – Gen.

Characteristics of URL

NoC1 – More Redirection

NoC2 – Copy Detection

NoC3 – Unsecure Pwd.

Handling

NoB1 – Realtime Form

Analysis

Combination Phishing URL Condition

OR(U1 OR U2 OR U3)

(C1 OR C2 OR C3 OR B1)

AND(U1 OR U2 OR U3)

(C1 OR C2 OR C3 OR B1)Potential

Site only(C1 OR C2 OR C3 OR

Yes>= 2

31Performance of Classifiers on the dataset

32Results

Combinations

Search Based Filtering = OFF Search Based Filtering = ON

TPR FPR PRF-

scoreTPR FPR PR

F-score

Or 99.97 3.50 88.25 93.75 93.37 0.54 97.84 95.55

And 87.64 1.80 92.76 90.13 82.30 0.22 98.98 89.88

Pot. 97.94 2.48 91.24 94.47 91.55 0.36 98.52 94.91

Site only

99.31 3.44 88.37 93.52 92.84 0.53 97.88 95.30

33Discussion

The Or combination effectively combines URL and content based classifiers and achieves the highest detection rate of 99.97% with FPR of 3.5%.

The FPR can be dropped to 0.36% with TPR of 91.55% with the potential scheme

And has the lowest FPR with detection rate of 82.30%

Site only method has second lowest FPR of 0.53% with second highest detection rate using search-based filtering

34Advantages of the Approach

Can be used effectively in zero hour environment

Can handle hijack based attacks, as they have behavioral analysis

Content language independent.

35EVALUATION OF ALGORITHM

36Existing Methods

Related phishing algorithms Blacklisting Xiang et al - hierarchical adaptive probabilistic approach CANTINA CANTINA+ Google Safe Browsing

Good performance, but could not compare with my algorithm Closed source No API

So used publically available Google Safe Browsing for evaluation.

37Google Safe Browsing

Large-scale automatic phishing website detection

Analyzes both URL and content

Claims accuracy of 90% and FPR of 0.1%

38Direct Comparison

Combination

Search Based Filtering = OFF Search Based Filtering = ON

TPR FPR PR F-score TPR FPR PR F-

1 99.97 3.50 88.25 93.75 93.37 0.54 97.84 95.55

2 87.64 1.80 92.76 90.13 82.30 0.22 98.98 89.88

3 97.94 2.48 91.24 94.47 91.55 0.36 98.52 94.91

GSB 51.46 0.03 99.80 67.91

39Security Analysis

If phishers get hold of this work, then they might adapt to hide from the detection techniques.

Buying genuine domain, SSL, using self signed or open-SSL can hamper some of the classifiers, but it will add to phishers’ efforts and it will reduce their profit.

If phishers, somehow, manage to get good page rank, and higher position in search results, then they can escape from being detected.

They can change the behavior of the page for hiding purposes, but this could alarm the users, and responsible users will report the URL

40Conclusion

Efficient algorithms based on the fundamental characteristics of phishing websites were developed.

Algorithms have comparable or better efficacy with other established phishing detection algorithms.

A novel approach to handle hijack based attacks.

41Future Work

Improve the Behavior classifier to include other phishing website behaviors.

Deploy as a browser extension to test in-field performance.

Thank You

Questions?

43Hijack Based Phishing Attacks

Agency for the Safety of Aerial Navigation in Africa and Madagascar (ASECNA)

April 2014

Redirected to PayPal

Verma - ICISS 2014 R easoning M ining NLP Defense Rakesh M. Verma (rverma@uh.edu) ReMiND Laboratory...

Documents