CLOAK OF VISIBILITY : DETECTING
WHENMACHINES BROWSE A
DIFFERENTWEB
CIS 601: Graduate Seminar
Prof. S. S. Chung
Presented By:-
Amol Chaudhari
CSU ID 2682329
AGENDA
� About
� Introduction
� Contributions
� Background
� How Common is Cloaking
� Perspective of Cloaking
� Cloaking Techniques
� Detecting Cloaking
� Performance and Evaluation
ABOUT
� Paper Title
Cloak of Visibility -Detecting When Machines Browse A
Different Web
� Publication
2016 IEEE Symposium on Security and Privacy
� Author
Luca Invernizzi , Kurt Thomas and other Google
researchers
WHAT IS CLOAKING?
Hypothetically Technically
WHAT IS CLOAKING?
� A search engine optimization(SEO) technique.
� The content presented to the search engine spider different from
that presented to the user's browser.
� Done by delivering content based on the IP addresses or the User-
Agent HTTP header of the user requesting the page
CONTRIBUTIONS
� Provides the first broad study of blackhat cloaking
techniques and the companies affected
� Builds a distributed crawler and classifier detecting
and bypassing mobile, search, and ads cloaking, with
95% accuracy and a false positive rate of 0.9%
� Measures the most prominent search, and ads
cloaking techniques in the wild; leads to 4.9% of ads
and 11.7% of search results cloak against Google’s
generic crawler
� Determine the minimum set of capabilities required
of security crawlers to contend with cloaking today
Cloaking site
BACKGROUND
� Web Cloaking Incentives (with bad actors)
I) Search results
Servers will manipulate fake or compromised pages to appear
attractive to crawlers (bots) while organic visitors are guided to
(illegal) profit-generating content pages
II) Advertisements
Miscreants will pay advertising networks to display their URLs.
They rely on cloaking to avoid ad policies that strictly prohibit
dietary scams, trademark infringing goods, or any form of deceptive
advertisements–including malware
III)Drive-by Download
Miscreants compromise popular websites and heavy loaded pages
with drive-by exploits.
HOW COMMON IS CLOAKING
� Wang et al. estimated that 2.2% of Google
Searches for trending keywords contained
cloaked result.
� 61% related to certainmerchandise
� 31% results are related to pharmaceutical
keywords advertised in spam emails led to
cloaked content.
PERSPECTIVE OF CLOAKING
� The software package for cloaking ranges from $167 to $
13188.
� Authors have analyzed the cloaking packages in order to
understand
1. Fingerprinting capabilities
2. Switch logic for displaying targeted content
3. Other built-in SEO capabilities
i. Content spinning
ii. Keyword stuffing
� Software analysis
Use of various languages like C++, Perl, JavaScript ,PHP
CLOAKING TECHNIQUES
� Network Fingerprinting
� Browser Fingerprinting
� Contextual Fingerprinting
NETWORK FINGERPRINTING
� IP Address
The list of IP address contained 54,166 unique IP’s tied to popular
search engines and crawlers at the same time of analysis.
� Reverse DNS
• When bot appears from non blacklisted IP , 4/10 cloaking services
perform rDNS lookup of visitor’s IP.
• The software compares the rDNS record against list of domain
strings belongs to (Google , Yahoo, etc )
BROWSER FINGERPRINTING
� Well behaving search and advertisement crawlers announce
their presence with special string on operator’s website.
� Ex. Google’s googlebot , Microsoft’s bingbot .
These cloaking services everywhere rely on user-agent
comparison and blocks Google, Bing, Yahoo etc.
CONTEXTUAL FINGERPRINTING
� This technique prevents crawlers from harvesting
URLs and visiting them outside the context they first
appeared.
REDIRECTION TECHNIQUES
� Stopped at a doorway
� Redirected to an entirely differentdomain
DETECTING CLOAKING
DETECTING CLOAKING
� crawl URLs from the Alexa Top Million, Google
Search, and Google Ads to scan for cloaking
� Fetch each of those URLs via multiple browser ,
networks to trigger any cloaking logic .
� compare the similarity of content returned by
each crawl
� feeding the resulting metrics into a classifier that
detects divergent content indicative of cloaking
URL SELECTION
� Collect the URLs.
� Split the dataset into 2 parts, one for training a
classifier based on labeled data (table 1) and second
to feed into our classifier to analyze cloaking (table 2)
TABLE 1 TABLE 2
BROWSER CONFIGURATION
� Crawl each URL with 11 different browser and network
configurations in attempt to trigger any cloaking logic.
� Repeat each crawl 3 times to remove noise added by network
error.
� 3 platforms used : Chrome on Desktop, Chrome on Android ,
basic HTTP
CONTEXT &NETWORK SELECTION
� Key words
Non-ads based URL
� Filter page’s HTML to include visible part and select
top 3 frequent words
Ads based URL
� Rely on the keywords the advertiser bids on for
targeting ( gather from Google AdWords)
� Network
Google’s network, AT&T or Verizon, Google Cloud
datacenter, residential IP
SIMILARITY FEATURES
� Content Similarity
• detect entirely distinct content by estimating similarity
of documents data.
• clean the data , tokenize the content using sliding window ,
calculate 64- bit simhash of all token and hamming distance in
two simhash.
• High score of hamming distance in 2 simhash indicate two
documents are unique.
� Screenshot Similarity
• Visual differences in layout and media presented to browsing
profile of the same window dimension
� Element Similarity
• Extract the set of embedded images per document, calculate the
difference in media content(Jaccard Similarity Coefficient)
CLASSIFICATION
� Used Extremely Randomized Trees
- Ensemble, non-linear, supervised learning
model
- Candidate features and thresholds are
selected entirely at random
- Normalized all features into a range (0,1)
- Relying on ten-fold cross validation
SYSTEM SPECIFICATION
� Google Compute Engine with crawling and featurization
distributed among 20 Ubuntu machines
� scheduler is built on top of Celery backed by Redis
� featurization and classification we rely on scikit-learn
and Pandas
� capture and log all network requests via mitmproxy
� Network used is AT&T , Verizon in prepaid plans
PERFORMANCE AND EVALUATION
� The overall accuracy of this system is 95.5%
� Correct detection rate is 99.1% for Alexa URLs as
non cloaked with false positive rate of 0.9%
TRAINING SINGLE FEATURE CLASS TRAINING ALL BUT ONE CLASS
CONCLUSION
� In this work, we explored the cloaking arms race
playing between security crawlers and miscreants
� Compared and classified the content returned for
94,946 labeled URLs, arriving at a system that
accurately detected cloaking 95.5% of the time with a
false positive rate of 0.9%.
� In the process, we exposed a gap between current
blackhat practices and the broader set of
fingerprinting techniques known within the research
community which may yet be deployed.
THANK YOU