+ All Categories
Home > Education > Tyler McConville: Mind Controlling Crawl Bots for Success

Tyler McConville: Mind Controlling Crawl Bots for Success

Date post: 21-Apr-2017
Category:
Upload: nav43
View: 213 times
Download: 7 times
Share this document with a friend
28
Mind Control MANIPULATING BOTS FOR SUCCESS Advanced SEO SUMMIT 1492450200
Transcript
Page 1: Tyler McConville: Mind Controlling Crawl Bots for Success

Mind ControlMANIPULATING BOTS FOR SUCCESS

Advanced SEO SUMMIT1492450200

Page 2: Tyler McConville: Mind Controlling Crawl Bots for Success

Why Should you care?What you get out of this talk

D Ability to understand how Crawl bots function and View your Website

D Ability decrypt and understand Bot behaviour and Triggers

D Prioritize Pages and content as well as identify roadblocks

Page 3: Tyler McConville: Mind Controlling Crawl Bots for Success

[~]$ whoamiNAV43\Tyler McConville

D CEO & CO-Founder of NAV43

D Technical Search Engine Optimizer

D 7 YEAR SERP JOURNEY

D Survivor of 2013

Those Who Have Fallen… In the SERPs

Page 4: Tyler McConville: Mind Controlling Crawl Bots for Success

AgendaD A very short understanding of how Google works. D A background on Crawl Bot dutiesD Tracking those f#@kers!D The Secret Data D Key Take-AwaysD Questions anyone?

Page 5: Tyler McConville: Mind Controlling Crawl Bots for Success

DisclaimerThis is NOT an exploit resource

D It’s just an understanding from tests ヽ ༼ຈ͜ل ຈ ノ༽D …and some implementation specific oddities

Google has done nothing [especially] wrongD To the contrary, their bots are quite organized

Modifying your server and misdirecting Google bots can be DamagingD If not done right. You have been forewarned..

Duplicating this is NOT guaranteed to Rank youD I’m looking at you.. Understand the concepts before implementing them.

Page 6: Tyler McConville: Mind Controlling Crawl Bots for Success

BackgroundGoogle, Crawl Bots, Bot Behaviour.

Page 7: Tyler McConville: Mind Controlling Crawl Bots for Success

The now infamous Google

Crawl[ Initial Connection ]

E.G.: BASIC WEB CRAWL SIMPLIFIED

TL

DR;

Page 8: Tyler McConville: Mind Controlling Crawl Bots for Success

Google Bots and DutiesGoogle Bots a brief explanation

D C r a w l e r -> A discovery program for Google!

D A bot that mines “meta-data” and organizes “relationship” mapping

D Technically a robust scraper running within a cluster

D Spoiler: Google Crawl bots aren’t intelligent!

List of Google Crawl Bots and functions

D DesktopD Standard Website Scraper

D Smartphone

D Standard Scraper with mobile rules

D Image

D Image check & meta data gatherer

D Video

D Video Render & Processing

D News

D App

Page 9: Tyler McConville: Mind Controlling Crawl Bots for Success

Removing the Distortion

What we want to doD Isolate Google crawlers from Users!

What to actually doD Mine Server Logs and Compile Repository

D recommended `An upper limit of 2 months is suggested for crawl logs`

D When shipping logs, send over encrypted state to ELK stack on the network

D basically to keeping your info.. yours, a logical first step...

What implementation will also doD Store symmetric crawl schedules (so you can find the otherwise random actions...)

D Give real time feedback on crawl errors and application issues...

“ Find bots, you must” 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)"

123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/5star2000.gif HTTP/1.0" 200 4005 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

123.123.123.123 - - [26/Apr/2000:00:23:50 -0400] "GET /pics

Page 10: Tyler McConville: Mind Controlling Crawl Bots for Success

Shipping Logs.. WOOHOO

MUST FIND: GBOT SERVER: Here’s Everything, BABY.

*Command to Slice the Server Log file for shipping: split -l 200000 logfile.log

MAX 10MB!!

WOOT!

Page 11: Tyler McConville: Mind Controlling Crawl Bots for Success

E L K S TAC Kby the docs

Diagram based on:https://logz.io/learn/complete-guide-elk-stack/

Google Bot

Webserver

Visualization Dashboard

Page 12: Tyler McConville: Mind Controlling Crawl Bots for Success

So what just happened?D ClientConnects: “Everytime a client (or bot) connects to the server through apache, a server entry is

formed in the apache-access log.”

D “ W h e n t h i s i s c o m p l e t e , the log shipper will send the sliced log files into logstash which will be processed and then placed into Elasticsearch.”*

D E l a s ti c s e a r c h : “…provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents…”

D Kibana: “This is the final step to visualizing the data within Elasticsearch so that we can sort and compile the data we need. […] This is where it get’s interesting, we will live here.” *

*Elasticsearch quoted from here: https://qbox.io/blog/what-is-elasticsearch

Page 13: Tyler McConville: Mind Controlling Crawl Bots for Success

Background Summary

We’re Looking Here

For These

Because of That

Apache.log

Page 14: Tyler McConville: Mind Controlling Crawl Bots for Success

MissionWe want to be able to see data trends and crawl patterns from Google bots navigating the webserver.

We want to gather any contextual information that we can use for forensic purposes,regardless of whether or not we can accomplish the above

We (as an adversary) want to be able to map Google bots direct actions and compile data trends to be able to predict and guide the data a Google bot digests in a crawl.

D We want to do this without manipulating Google bots.. Too much. ;)

Page 15: Tyler McConville: Mind Controlling Crawl Bots for Success

S e c r 3 t s

Page 16: Tyler McConville: Mind Controlling Crawl Bots for Success

The Observed Data

RequestsUser AgentServer Response IP Address

Source ID*

Exits

+

Page 17: Tyler McConville: Mind Controlling Crawl Bots for Success

Organization! We got ’em…all.

W IC K E D !

Page 18: Tyler McConville: Mind Controlling Crawl Bots for Success

The Data itself.D JSON Format / Organized and savedD R e q u i red for future processingD Oh.. And it can be real time ;)

Approach Premise:

D Organize Server Logs into Line-itemsD Requests are relatively smallD S e r v e r logs are larger and cluttered

D … they are a “log” after all.

D Create relationships based off JSON requests

D

FindAES from: http://jessekornblum.com/tools/

Page 19: Tyler McConville: Mind Controlling Crawl Bots for Success

Grouping Data_New_Search

4 Agent

4 Time

4/8 Response [“code”]

4/8 Request [“URL”]

CSslUserContext

Look Complicated? Let’s go more into this on a later date.

_LOG_ITEM

4 IP Address

4 Request

4 Server Response

4/8 Referral

4 Timestamp

4/8 Agent

_Relationship_MAP

4 Request URL

4 SERVER Response

4 Bot Type

... ...

4 Bytes downloaded

? Referral link

? Exit

Page 20: Tyler McConville: Mind Controlling Crawl Bots for Success

The Results

This functions do three things:

D Isolate drop-off or dead spots D Return deep error reporting D Check natural flow of Crawl bot types

A Complete Website Google Crawl Map

*There are many ways to visualize this.

Page 21: Tyler McConville: Mind Controlling Crawl Bots for Success

So how do we manipu la te?

GO M Y

M IN I ON S !

Scrape.. scrape

Page 22: Tyler McConville: Mind Controlling Crawl Bots for Success

Pre-Planned crawl routes D B o t s a r e n o t S m a r t

D As Bots will always follow a given path that’s laid out to them, we are always in control.

D Silo’s are not dead. Instead they are topically focused.

D Provide a crawl path that makes sense for a user navigating your funnel. Outline relationships and target dead areas.

Bottom line:

Create absolute relationship links between

pages for bots.

Page 23: Tyler McConville: Mind Controlling Crawl Bots for Success

Reactive Server ActionsD P r e p r o g r a m m e d r e s p o n s e a c ti o n s

D You can program your server to act reactive to requests and craft responses accordingly. Ie: Using 304 response codes.

D Addressing incorrect internal referral sources.

D Provide crawl bot with unique meta-data within headers and avoid silly server errors. Bottom line:

You control your server and how it behaves.

Google bots are just here for the ride.

Page 24: Tyler McConville: Mind Controlling Crawl Bots for Success

ThE Ta K e a w a y s

Page 25: Tyler McConville: Mind Controlling Crawl Bots for Success

You CAN track crawl bots and form trends!

Page 26: Tyler McConville: Mind Controlling Crawl Bots for Success

Crawlbots are just scrapers!

OOOOO!!

Page 27: Tyler McConville: Mind Controlling Crawl Bots for Success

Fin.

Page 28: Tyler McConville: Mind Controlling Crawl Bots for Success

QUest ions?

[email protected]


Recommended