+ All Categories
Home > Documents > Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple...

Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple...

Date post: 24-Aug-2018
Category:
Upload: vankiet
View: 213 times
Download: 0 times
Share this document with a friend
34
Web Crawling Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org
Transcript
Page 1: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Web CrawlingIntroduction to Information RetrievalINF 141Donald J. Patterson

Content adapted from Hinrich Schützehttp://www.informationretrieval.org

Page 2: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

• Introduction

• URL Frontier

• Robust Crawling

• DNS

Overview

Web Crawling Outline

Page 3: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

The Web

Web Spider

Indices Ad Indices

flickr:crankyT

Indexer

The User

Search Results

Introduction

Page 4: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

The basic crawl algorithm

• Initialize a queue of URLs (“seed” URLs)

• Repeat

• Remove a URL from the queue

• Fetch associated page

• Parse and analyze page

• Store representation of page

• Extract URLs from page and add to queue

Introduction

Page 5: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Crawling the web

Introduction

Seed Pages

Web Spider

Crawled Pages

URL Frontier

The Rest of the Web

Page 6: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Basic Algorithm is not reality...

• Real web crawling requires multiple machines

• All steps distributed on different computers

• Even Non-Adversarial pages pose problems

• Latency and bandwidth to remote servers vary

• Webmasters have opinions about crawling their turf

• How “deep” in a URL should you go?

• Site mirrors and duplicate pages

• Politeness

• Don’t hit a server too often

Introduction

Page 7: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Basic Algorithm is not reality...

• Adversarial Web Pages

• Spam Pages

• Spider Traps

Introduction

Page 8: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Minimum Characteristics for a Web Crawler

• Be Polite:

• Respect implicit and explicit terms on website

• Crawl pages you’re allowed to

• Respect “robots.txt” (more on this coming up)

• Be Robust

• Handle traps and spam gracefully

Introduction

Page 9: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Desired Characteristics for a Web Crawler

• Be a distributed systems

• Run on multiple machines

• Be scalable

• Adding more machines allows you to crawl faster

• Be Efficient

• Fully utilize available processing and bandwidth

• Focus on “Quality” Pages

• Crawl good information first

Introduction

Page 10: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Desired Characteristics for a Web Crawler

• Support Continuous Operation

• Fetch fresh copies of previously crawled pages

• Be Extensible

• Be able to adapt to new data formats, protocols, etc.

• Today it’s AJAX, tomorrow it’s SilverLight, then....

Introduction

Page 11: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Updated Crawling picture

URL Frontier

Seed Pages

Spider Thread

Crawled Pages

URL Frontier"Priority Queue" The Rest of the Web

Page 12: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

• Frontier Queue might have multiple pages from the same host

• These need to be load balanced (“politeness”)

• All crawl threads should be kept busy

URL Frontier

Page 13: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Politeness?

• It is easy enough for a website to block a crawler

• Explicit Politeness

• “Robots Exclusion Standard”

• Defined by a “robots.txt” file maintained by a webmaster

• What portions of the site can be crawled.

• Irrelevant, private or other data excluded.

• Voluntary compliance by crawlers.

• Based on regular expression matching

URL Frontier

Page 14: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Politeness?

• Explicit Politeness

• “Sitemaps”

• Introduced by Google, but open standard

• XML based

• Allows webmasters to give hints to web crawlers:

• Location of pages (URL islands)

• Relative importance of pages

• Update frequency of pages

• Sitemap location listed in robots.txt

URL Frontier

Page 15: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Politeness?

• Implicit Politeness

• Even without specification avoid hitting any site too often

• It costs bandwidth and computing resources for host.

URL Frontier

Page 16: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Politeness?

URL Frontier

Page 17: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Politeness?

URL Frontier

Page 18: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Politeness?

URL Frontier

Page 19: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Robots.txt - Exclusion

URL Frontier

• Protocol for giving spiders (“robots”) limited access to a

website

• Source: http://www.robotstxt.org/wc/norobots.html

• Website announces what is okay and not okay to crawl:

• Located at http://www.myurl.com/robots.txt

• This file holds the restrictions

Page 20: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Robots.txt Example

URL Frontier

• http://www.ics.uci.edu/robots.txt

Page 21: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Sitemaps - Inclusion

URL Frontier

• https://www.google.com/webmasters/tools/docs/en/protocol.html#sitemapXMLExample

Page 22: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

• Introduction

• URL Frontier

• Robust Crawling

• DNS

Overview

Web Crawling Outline

Page 23: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

A Robust Crawl Architecture

Robust Crawling

WWW

DNS

Fetch

Parse

Seen?

Doc.Finger-prints

URLFilter

Robots.txt

Duplicate Elimination

URLIndex

URL Frontier Queue

Page 24: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Processing Steps in Crawling

Robust Crawling

• Pick a URL from the frontier (how to prioritize?)

• Fetch the document (DNS lookup)

• Parse the URL

• Extract Links

• Check for duplicate content

• If not add to index

• For each extracted link

• Make sure it passes filter (robots.txt)

• Make sure it isn’t in the URL frontier

Page 25: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Domain Name ServerDNS

• A lookup service on the internet

• Given a URL, retrieve its IP address

• www.djp3.net -> 69.17.116.124

• This service is provided by a distributed set of servers

• Latency can be high

• Even seconds

• Common OS implementations of DNS lookup are blocking

• One request at a time

• Solution:

• Caching

• Batch requests

Page 26: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

DNS dig +trace www.djp3.net

Root Name Server

.netName Server

djp3.netName Server

Where is www.djp3.net?

Ask 192.5.6.30

{A}.ROOT-SERVERS.NET = 198.41.0.4

{A}.GTLD-SERVERS.net = 192.5.6.30

Ask 72.1.140.145

{ns1}.speakeasy.net =72.1.140.145

Use 69.17.116.124

Give me a web page

www.djp3.net = 69.17.116.124

1

2

3

4

Page 27: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

DNS What really happens

Give me a www.djp3.net

flickr:crankyT

The User

Firefox DNS cache

OS DNS Resolver

OS DNS Cache

OS specified DNS Serverns1.ics.uci.edu

DNS Cache

Host tableClient

Name Server

Page 28: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Class Exercise

DNS

http://www.flickr.com/photos/lurie/298967218/

Page 29: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Class Exercise

DNS

http://www.flickr.com/photos/lurie/298967218/

Page 30: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Class Exercise

DNS

• Calculate how long it would take to completely fill a DNS

cache.

http://www.flickr.com/photos/lurie/298967218/

Page 31: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Class Exercise

DNS

• Calculate how long it would take to completely fill a DNS

cache.

• How many active hosts are there?

http://www.flickr.com/photos/lurie/298967218/

Page 32: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Class Exercise

DNS

• Calculate how long it would take to completely fill a DNS

cache.

• How many active hosts are there?

• What is an average lookup time?

http://www.flickr.com/photos/lurie/298967218/

Page 33: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Class Exercise

DNS

• Calculate how long it would take to completely fill a DNS

cache.

• How many active hosts are there?

• What is an average lookup time?

• Do the math.

http://www.flickr.com/photos/lurie/298967218/

Page 34: Web Crawlinglopes/teaching/cs221W12/slides/Lecture05.pdf · • Real web crawling requires multiple machines ... • Adversarial Web Pages • Spam Pages • Spider Traps Introduction.

Recommended