CLOAKOF VISIBILITY ETECTING WHEN MACHINES ROWSE...

CLOAK OF VISIBILITY : DETECTING

WHENMACHINES BROWSE A

DIFFERENTWEB

CIS 601: Graduate Seminar

Prof. S. S. Chung

Presented By:-

Amol Chaudhari

CSU ID 2682329

AGENDA

� About

� Introduction

� Contributions

� Background

� How Common is Cloaking

� Perspective of Cloaking

� Cloaking Techniques

� Detecting Cloaking

� Performance and Evaluation

ABOUT

� Paper Title

Cloak of Visibility -Detecting When Machines Browse A

Different Web

� Publication

2016 IEEE Symposium on Security and Privacy

� Author

Luca Invernizzi , Kurt Thomas and other Google

researchers

WHAT IS CLOAKING?

Hypothetically Technically

WHAT IS CLOAKING?

� A search engine optimization(SEO) technique.

� The content presented to the search engine spider different from

that presented to the user's browser.

� Done by delivering content based on the IP addresses or the User-

Agent HTTP header of the user requesting the page

CONTRIBUTIONS

� Provides the first broad study of blackhat cloaking

techniques and the companies affected

� Builds a distributed crawler and classifier detecting

and bypassing mobile, search, and ads cloaking, with

95% accuracy and a false positive rate of 0.9%

� Measures the most prominent search, and ads

cloaking techniques in the wild; leads to 4.9% of ads

and 11.7% of search results cloak against Google’s

generic crawler

� Determine the minimum set of capabilities required

of security crawlers to contend with cloaking today

Cloaking site

BACKGROUND

� Web Cloaking Incentives (with bad actors)

I) Search results

Servers will manipulate fake or compromised pages to appear

attractive to crawlers (bots) while organic visitors are guided to

(illegal) profit-generating content pages

II) Advertisements

Miscreants will pay advertising networks to display their URLs.

They rely on cloaking to avoid ad policies that strictly prohibit

dietary scams, trademark infringing goods, or any form of deceptive

advertisements–including malware

III)Drive-by Download

Miscreants compromise popular websites and heavy loaded pages

with drive-by exploits.

HOW COMMON IS CLOAKING

� Wang et al. estimated that 2.2% of Google

Searches for trending keywords contained

cloaked result.

� 61% related to certainmerchandise

� 31% results are related to pharmaceutical

keywords advertised in spam emails led to

cloaked content.

PERSPECTIVE OF CLOAKING

� The software package for cloaking ranges from $167 to $

13188.

� Authors have analyzed the cloaking packages in order to

understand

1. Fingerprinting capabilities

2. Switch logic for displaying targeted content

3. Other built-in SEO capabilities

i. Content spinning

ii. Keyword stuffing

� Software analysis

Use of various languages like C++, Perl, JavaScript ,PHP

CLOAKING TECHNIQUES

� Network Fingerprinting

� Browser Fingerprinting

� Contextual Fingerprinting

NETWORK FINGERPRINTING

� IP Address

The list of IP address contained 54,166 unique IP’s tied to popular

search engines and crawlers at the same time of analysis.

� Reverse DNS

• When bot appears from non blacklisted IP , 4/10 cloaking services

perform rDNS lookup of visitor’s IP.

• The software compares the rDNS record against list of domain

strings belongs to (Google , Yahoo, etc )

BROWSER FINGERPRINTING

� Well behaving search and advertisement crawlers announce

their presence with special string on operator’s website.

� Ex. Google’s googlebot , Microsoft’s bingbot .

These cloaking services everywhere rely on user-agent

comparison and blocks Google, Bing, Yahoo etc.

CONTEXTUAL FINGERPRINTING

� This technique prevents crawlers from harvesting

URLs and visiting them outside the context they first

appeared.

REDIRECTION TECHNIQUES

� Stopped at a doorway

� Redirected to an entirely differentdomain

DETECTING CLOAKING

DETECTING CLOAKING

� crawl URLs from the Alexa Top Million, Google

Search, and Google Ads to scan for cloaking

� Fetch each of those URLs via multiple browser ,

networks to trigger any cloaking logic .

� compare the similarity of content returned by

each crawl

� feeding the resulting metrics into a classifier that

detects divergent content indicative of cloaking

URL SELECTION

� Collect the URLs.

� Split the dataset into 2 parts, one for training a

classifier based on labeled data (table 1) and second

to feed into our classifier to analyze cloaking (table 2)

TABLE 1 TABLE 2

BROWSER CONFIGURATION

� Crawl each URL with 11 different browser and network

configurations in attempt to trigger any cloaking logic.

� Repeat each crawl 3 times to remove noise added by network

error.

� 3 platforms used : Chrome on Desktop, Chrome on Android ,

basic HTTP

CONTEXT &NETWORK SELECTION

� Key words

Non-ads based URL

� Filter page’s HTML to include visible part and select

top 3 frequent words

Ads based URL

� Rely on the keywords the advertiser bids on for

targeting ( gather from Google AdWords)

� Network

Google’s network, AT&T or Verizon, Google Cloud

datacenter, residential IP

SIMILARITY FEATURES

� Content Similarity

• detect entirely distinct content by estimating similarity

of documents data.

• clean the data , tokenize the content using sliding window ,

calculate 64- bit simhash of all token and hamming distance in

two simhash.

• High score of hamming distance in 2 simhash indicate two

documents are unique.

� Screenshot Similarity

• Visual differences in layout and media presented to browsing

profile of the same window dimension

� Element Similarity

• Extract the set of embedded images per document, calculate the

difference in media content(Jaccard Similarity Coefficient)

CLASSIFICATION

� Used Extremely Randomized Trees

- Ensemble, non-linear, supervised learning

model

- Candidate features and thresholds are

selected entirely at random

- Normalized all features into a range (0,1)

- Relying on ten-fold cross validation

SYSTEM SPECIFICATION

� Google Compute Engine with crawling and featurization

distributed among 20 Ubuntu machines

� scheduler is built on top of Celery backed by Redis

� featurization and classification we rely on scikit-learn

and Pandas

� capture and log all network requests via mitmproxy

� Network used is AT&T , Verizon in prepaid plans

PERFORMANCE AND EVALUATION

� The overall accuracy of this system is 95.5%

� Correct detection rate is 99.1% for Alexa URLs as

non cloaked with false positive rate of 0.9%

TRAINING SINGLE FEATURE CLASS TRAINING ALL BUT ONE CLASS

CONCLUSION

� In this work, we explored the cloaking arms race

playing between security crawlers and miscreants

� Compared and classified the content returned for

94,946 labeled URLs, arriving at a system that

accurately detected cloaking 95.5% of the time with a

false positive rate of 0.9%.

� In the process, we exposed a gap between current

blackhat practices and the broader set of

fingerprinting techniques known within the research

community which may yet be deployed.

THANK YOU

Date post:	03-Jun-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

CLOAKOF VISIBILITY ETECTING WHEN MACHINES ROWSE...

Documents