+ All Categories
Home > Documents > Mining the Web Crawling the Web. Mining the Web2 Schedule Search engine requirements Components...

Mining the Web Crawling the Web. Mining the Web2 Schedule Search engine requirements Components...

Date post: 12-Jan-2016
Category:
Upload: imogene-reed
View: 228 times
Download: 3 times
Share this document with a friend
112
Mining the Web Crawling the Web
Transcript
Page 1: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web

Crawling the Web

Page 2: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 2

ScheduleSearch engine requirementsComponents overviewSpecific modules: the crawler

PurposeImplementation Performance metrics

Page 3: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 3

What does it do?Processes users queriesFind pages with related information

Return a list of resources

Is it really that simple?

Page 4: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 4

What does it do?Processes users queries

How is a query represented?

Find pages with related information

Return a resources list

Is it really that simple?

Page 5: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 5

What does it do?Processes users queriesFind pages with related informationHow do we find pages?Where in the web do we look?How do we match query and documents?

Return a resources list

Is it really that simple?

Page 6: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 6

What does it do?Processes users queriesFind pages with related information

Return a resources listIs what order?How are the pages ranked?

Is it really that simple?

Page 7: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 7

What does it do?Processes users queriesFind pages with related information

Return a resources list

Is it really that simple?Limited resourcesTime quality tradeoff

Page 8: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 8

Search Engine Structure

General DesignCrawlingStorageIndexingRanking

Page 9: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 9

Crawl Control

Search Engine Structure

Crawlers

Ranking

Indexer

Page Repository

Query Engine

Collection Analysis

Text Structure Utility

Queries Results

Indexes

Page 10: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 10

Is it an IR system?

The web isUsed by millions Contains lots of informationLink basedIncoherentChanges rapidlyDistributed

Traditional information retrieval was built with the exact opposite in mind

Page 11: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 11

Web DynamicsSize

~10 billion Public Indexable pages10kB / page 100 TBDoubles every 18 months

Dynamics33% change weekly8% new pages every week25% new links every week

Page 12: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 12

Weekly change

Fetterly, Manasse, Najork, Wiener 2003

Page 13: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 13

Collecting “all” Web pages

For searching, for classifying, for mining, etc

Problems: No catalog of all accessible URLs on the Web

Volume, latency, duplications, dinamicity, etc.

Page 14: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 14

The Crawler A program that downloads and stores web pages: Starts off by placing an initial set of URLs, S0, in a queue, where all URLs to be retrieved are kept and prioritized.

From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue.

This process is repeated until the crawler decides to stop.

Page 15: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 15

Crawling Issues How to crawl?

Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns

How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have?

How often to crawl? Freshness: How much has changed? How much has really changed? (why is this a different question?)

Page 16: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 16

Before discussing crawling policies…

Some implementation issue

Page 17: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 17

HTML HyperText Markup Language Lets the author

specify layout and typeface embed diagrams create hyperlinks.

expressed as an anchor tag with a HREF attributeHREF names another page using a Uniform Resource Locator (URL),

URL = protocol field (“HTTP”) +a server hostname (“www.cse.iitb.ac.in”) +file path (/, the `root' of the published file system).

Page 18: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 18

HTTP(hypertext transport protocol)

Built on top of the Transport Control Protocol (TCP)

Steps(from client end) resolve the server host name to an Internet address (IP)

Use Domain Name Server (DNS)DNS is a distributed database of name-to-IP mappings maintained at a set of known servers

contact the server using TCPconnect to default HTTP port (80) on the server.Enter the HTTP requests header (E.g.: GET)Fetch the response header

MIME (Multipurpose Internet Mail Extensions) A meta-data standard for email and Web content transfer

Fetch the HTML page

Page 19: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 19

Crawling procedure Simple

Great deal of engineering goes into industry-strength crawlers

Industry crawlers crawl a substantial fraction of the Web

E.g.: Google, Yahoo

No guarantee that all accessible Web pages will be located

Crawler may never halt ……. pages will be added continually even as it is running.

Page 20: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 20

Crawling overheadsDelays involved in

Resolving the host name in the URL to an IP address using DNS

Connecting a socket to the server and sending the request

Receiving the requested page in response

Solution: Overlap the above delays byfetching many pages at the same time

Page 21: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 21

Anatomy of a crawler

Page fetching by (logical) threads Starts with DNS resolution Finishes when the entire page has been fetched

Each page stored in compressed form to disk/tape scanned for outlinks

Work pool of outlinks maintain network utilization without overloading it

Dealt with by load manager

Continue till the crawler has collected a sufficient number of pages.

Page 22: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 22

Typical anatomy of a large-scale crawler.

Page 23: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 23

Large-scale crawlers: performance and

reliability considerations

Need to fetch many pages at same time utilize the network bandwidth single page fetch may involve several seconds of network latency

Highly concurrent and parallelized DNS lookups Multi-processing or multi-threading: impractical at low level Use of asynchronous sockets

Explicit encoding of the state of a fetch context in a data structure

Polling socket to check for completion of network transfers

Care in URL extraction Eliminating duplicates to reduce redundant fetches Avoiding “spider traps”

Page 24: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 24

DNS caching, pre-fetching and resolution

A customized DNS component with…..Custom client for address resolutionCaching serverPrefetching client

Page 25: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 25

Custom client for address resolution

Tailored for concurrent handling of multiple outstanding requests

Allows issuing of many resolution requests together polling at a later time for completion of individual requests

Facilitates load distribution among many DNS servers.

Page 26: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 26

Caching server With a large cache, persistent across DNS restarts

Residing largely in memory if possible.

Page 27: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 27

Prefetching client Steps

Parse a page that has just been fetched extract host names from HREF targets Make DNS resolution requests to the caching server

Usually implemented using UDP User Datagram Protocol connectionless, packet-based communication protocol

does not guarantee packet delivery

Does not wait for resolution to be completed.

Page 28: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 28

Multiple concurrent fetches

Managing multiple concurrent connections A single download may take several seconds Open many socket connections to different HTTP servers simultaneously

Multi-CPU machines not useful crawling performance limited by network and disk

Two approaches using multi-threading using non-blocking sockets with event handlers

Page 29: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 29

Multi-threading threads

physical thread of control provided by the operating system (E.g.: pthreads) OR

concurrent processes fixed number of threads allocated in advance programming paradigm

create a client socket connect the socket to the HTTP service on a server Send the HTTP request header read the socket (recv) until

no more characters are available close the socket.

use blocking system calls

Page 30: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 30

Multi-threading: Problems

performance penaltymutual exclusionconcurrent access to data structures

slow disk seeks.great deal of interleaved, random input-output on disk

Due to concurrent modification of document repository by multiple threads

Page 31: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 31

Non-blocking sockets and event

handlers non-blocking sockets

connect, send or recv call returns immediately without waiting for the network operation to complete.

poll the status of the network operation separately “select” system call

lets application suspend until more data can be read from or written to the socket

timing out after a pre-specified deadline Monitor polls several sockets at the same time

More efficient memory management code that completes processing not interrupted by other completions

No need for locks and semaphores on the pool only append complete pages to the log

Page 32: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 32

Link extraction and normalization

Goal: Obtaining a canonical form of URL URL processing and filtering

Avoid multiple fetches of pages known by different URLs

many IP addresses For load balancing on large sites

Mirrored contents/contents on same file system“Proxy pass“

Mapping of different host names to a single IP address

need to publish many logical sites

Relative URLsneed to be interpreted w.r.t to a base URL.

Page 33: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 33

Canonical URLFormed by

Using a standard string for the protocol

Canonicalizing the host nameAdding an explicit port numberNormalizing and cleaning up the path

Page 34: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 34

Robot exclusionCheck

whether the server prohibits crawling a normalized URL

In robots.txt file in the HTTP root directory of the serverspecifies a list of path prefixes which crawlers should not attempt to fetch.

Meant for crawlers only

Page 35: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 35

Eliminating already-visited

URLs Checking if a URL has already been fetched

Before adding a new URL to the work pool Needs to be very quick. Achieved by computing MD5 hash function on the URL

Exploiting spatio-temporal locality of access

Two-level hash function. most significant bits (say, 24) derived by hashing the host name plus port

lower order bits (say, 40) derived by hashing the pathconcatenated bits used as a key in a B-tree

qualifying URLs added to frontier of the crawl.

hash values added to B-tree.

Page 36: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 36

Spider trapsProtecting from crashing on

Ill-formed HTMLE.g.: page with 68 kB of null characters

Misleading sitesindefinite number of pages dynamically generated by CGI scripts

paths of arbitrary depth created using soft directory links and path remapping features in HTTP server

Page 37: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 37

Spider Traps: Solutions

No automatic technique can be foolproof

Check for URL lengthGuards

Preparing regular crawl statisticsAdding dominating sites to guard moduleDisable crawling active content such as CGI form queries

Eliminate URLs with non-textual data types

Page 38: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 38

Avoiding repeated expansion of links on duplicate pages

Reduce redundancy in crawls Duplicate detection

Mirrored Web pages and sites Detecting exact duplicates

Checking against MD5 digests of stored URLs Representing a relative link v (relative to aliases u1 and u2) as tuples (h(u1); v) and (h(u2); v)

Detecting near-duplicates Even a single altered character will completely change the digest !

E.g.: date of update/ name and email of the site administrator

Page 39: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 39

Load monitorKeeps track of various system statisticsRecent performance of the wide area network (WAN) connectionE.g.: latency and bandwidth estimates.

Operator-provided/estimated upper bound on open sockets for a crawler

Current number of active sockets.

Page 40: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 40

Thread managerResponsible for

Choosing units of work from frontierScheduling issue of network resources

Distribution of these requests over multiple ISPs if appropriate.

Uses statistics from load monitor

Page 41: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 41

Per-server work queues

Denial of service (DoS) attacks limit the speed or frequency of responses to any fixed client IP address

Avoiding DOS limit the number of active requests to a given server IP address at any time

maintain a queue of requests for each serverUse the HTTP/1.1 persistent socket capability.

Distribute attention relatively evenly between a large number of sites

Access locality vs. politeness dilemma

Page 42: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 42

Crawling Issues How to crawl?

Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication)

Etiquette: Robots.txt, Server load concerns How much to crawl? How much to index?

Coverage: How big is the Web? How much do we cover?

Relative Coverage: How much do competitors have? How often to crawl?

Freshness: How much has changed? How much has really changed? (why is this a different question?)

Page 43: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 43

Crawl OrderWant best pages firstPotential quality measures:

Final In-degree Final PageRank

Crawl heuristics:Breadth First Search (BFS)Partial IndegreePartial PageRank Random walk

Page 44: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web

Breadth-First CrawlBreadth-First Crawl

Basic idea:Basic idea: start at a set of known URLsstart at a set of known URLs explore in “concentric circles” around these URLsexplore in “concentric circles” around these URLs

start pages

distance-one pages

distance-two pages

used by broad web search engines balances load between servers

Page 45: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 45

Web Wide Crawl (328M pages) [Najo01]

BFS crawling brings in high qualitypages early in the crawl

Page 46: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 46

Overlap with best x% byindegree

x% crawled by O(u)

Stanford Web Base (179K) [Cho98]

Page 47: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 47

Queue of URLs to be fetched

What constraints dictate which queued URL is fetched next?

Politeness – don’t hit a server too often, even from different threads of your spider

How far into a site you’ve crawled already Most sites, stay at ≤ 5 levels of URL hierarchy

Which URLs are most promising for building a high-quality corpus This is a graph traversal problem: Given a directed graph you’ve partially visited, where do you visit next?

Page 48: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 48

Where do we crawl next?

Complex scheduling optimization problem, subject to constraintsPlus operational constraints (e.g., keeping all machines load-balanced)

Scientific study – limited to specific aspectsWhich ones?What do we measure?

What are the compromises in distributed crawling?

Page 49: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 49

Page selectionImportance metricWeb crawler modelCrawler method for choosing page to download

Page 50: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 50

Importance MetricsGiven a page P, define how “good” that page is

Several metric types:Interest drivenPopularity drivenLocation drivenCombined

Page 51: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 51

Interest Driven Define a driving query Q Find textual similarity between P and Q

Define a word vocabulary t1…tn Define a vector for P and Q:

Vp, Vq = <w1,…,wn>wi = 0 if ti does not appear in the documentwi = IDF(ti) = 1 / number of pages containing ti

Importance: IS(P) = Vp * Vq (cosine product) Finding IDF requires going over the entire web Estimate IDF by pages already visited, to calculate IS’

Page 52: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 52

Popularity DrivenHow popular a page is:

Backlink countIB(P) = the number of pages containing a link to P

Estimate by pervious crawls: IB’(P)

More sophisticated metric, e.g. PageRank: IR(P)

Page 53: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 53

Location DrivenIL(P): A function of the URL to PWords appearing on URLNumber of “/” on the URL

Easily evaluated, requires no data from pervious crawls

Page 54: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 54

Combined MetricsIC(P): a function of several other metrics

Allows using local metrics for first stage and estimated metrics for second stage

IC(P) = a*IS(P) + b*IB(P) + c*IL(P)

Page 55: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 55

Crawling Issues How to crawl?

Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication)

Etiquette: Robots.txt, Server load concerns How much to crawl? How much to index?

Coverage: How big is the Web? How much do we cover?

Relative Coverage: How much do competitors have? How often to crawl?

Freshness: How much has changed? How much has really changed? (why is this a different question?)

Page 56: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 56

Crawler ModelsA crawler

Tries to visit more important pages first

Only has estimates of importance metrics

Can only download a limited amountHow well does a crawler perform?

Crawl and StopCrawl and Stop with Threshold

Page 57: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 57

Crawl and StopA crawler stops after visiting K pages

A perfect crawler Visits pages with ranks R1,…,RkThese are called Top Pages

A real crawlerVisits only M < K top pages

Page 58: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 58

Crawl and Stop with Threshold

A crawler stops after visiting T top pages

Top pages are pages with a metric higher than G

A crawler continues until T threshold is reached

Page 59: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 59

Ordering Metrics The crawlers queue is prioritized according to an ordering metric

The ordering metric is based on an importance metric Location metrics - directly Popularity metrics - via estimates according to pervious crawls

Similarity metrics – via estimates according to anchor

Page 60: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 60

Focused Crawling (Chakrabarti)

Distributed federation of focused crawlers

Supervised topic classifierControls priority of unvisited frontier

Trained on document samples from Web directory (Dmoz)

Page 61: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 61

Motivation Let’s relax the problem space: “Focus” on a restricted target space of Web pages that may be of some “type” (e.g., homepages) that may be of some “topic” (CS, quantum physics)

The “focused” crawling effort would use much less resources, be more timely, be more qualified for indexing & searching purposes

Page 62: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 62

MotivationGoal: Design and implement a focused Web crawler that wouldgather only pages on a particular “topic” (or class)

use effective heuristics while choosing the next page to download

Page 63: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 63

Focused crawling“A focused crawler seeks and acquires [...] pages on a specific set of topics representing a relatively narrow segment of the Web.” (Soumen Chakrabarti)

The underlying paradigm is Best-First Search instead of the Breadth-First Search

Page 64: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 64

Breadth vs. Best First Search

Page 65: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 65

Two fundamental questions

Q1: How to decide whether a downloaded page is on-topic, or not?

Q2: How to choose the next page to visit?

Page 66: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 66

Chakrabarti’s crawler

Chakrabarti’s focused crawlerA1: Determines the page relevance using a text classifier

A2: Adds URLs to a max-priority queue with their parent page’s score and visits them in descending order!

What is original is using a text classifier!

Page 67: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 67

Page relevance Testing the classifier

User determines focus topics Crawler calls the classifier and obtains a score for each downloaded page

Classifier returns a sorted list of classes and scores

(A 80%, B 10%, C 7%, D 1%,...)

The classifier determines the page relevance!

Page 68: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 68

Visit orderThe radius-1 hypothesis: If page u is an on-topic example and u links to v, then the probability that v is on-topic is higher than the probability that a random chosen Web page is on-topic.

Page 69: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 69

Visit order: case 1

Hard-focus crawling:If a downloaded page is off-topic, stops following hyperlinks from this page.

Assume target is class B And for page P, classifier gives: A 80%, B 10%, C 7%, D 1%,...

Do not follow P’s links at all!

Page 70: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 70

Visit order: case 2

Soft-focus crawling: obtains a page’s relevance score (a score on the page’s relevance to the target topic)

assigns this score to every URL extracted from this particular page, and adds to the priority queue

Example: A 80%, B 10%, C 7%, D 1%,...Insert P’s links with score 0.10 into PQ

Page 71: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 71

Basic Focused Crawler

Page 72: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 72

Comparisons Start the baseline crawler from the URLs in one topic

Fetch up to 20000-25000 pages For each pair of fetched pages (u,v), add item to the training set of the apprentice

Train the apprentice Start the enhanced crawler from the same set of pages

Fetch about the same number of pages

Page 73: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 73

Results

Page 74: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 74

ControversyChakrabarty claims focused crawler superior to breadth-first

Suel claims the contrary and that argument was based on experiments with poor performance crawlers

Page 75: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 75

Crawling Issues How to crawl?

Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication)

Etiquette: Robots.txt, Server load concerns How much to crawl? How much to index?

Coverage: How big is the Web? How much do we cover?

Relative Coverage: How much do competitors have? How often to crawl?

Freshness: How much has changed? How much has really changed? (why is this a different question?)

Page 76: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 76

Determining page changes

“Expires” HTTP response headerFor page that come with an expiry date

Otherwise need to guess if revisiting that page will yield a modified version.Score reflecting probability of page being modified

Crawler fetches URLs in decreasing order of score.

Assumption : recent past predicts the future

Page 77: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 77

Estimating page change rates

Brewington and Cybenko & Cho Algorithms for maintaining a crawl in which most pages are fresher than a specified epoch.

Prerequisite average interval at which crawler checks for changes is smaller than the inter-modification times of a page

Small scale intermediate crawler runs to monitor fast changing sites

E.g.: current news, weather, etc. Patched intermediate indices into master index

Page 78: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 78

Refresh StrategyCrawlers can refresh only a certain amount of pages in a period of time.

The page download resource can be allocated in many ways

The proportional refresh policy allocated the resource proportionally to the pages’ change rate.

Page 79: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 79

Average Change Interval

frac

tion

of p

ages

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

1day 1day- 1week

1week-1month

1month-4months

4months

average change interval

Page 80: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 80

Change Interval – By Domain

frac

tion

of p

ages

0

0,1

0,2

0,3

0,4

0,5

0,6

1day 1day- 1week

1week-1month

1month-4months

4months

comnetorgedugov

average change interval

Page 81: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 81

Modeling Web Evolution

Poisson process with rate T is time to next eventfT (t) = e- t (t > 0)

Page 82: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 82

Change Intervalfor pages thatchange every

10 days on average

interval in days

frac

tion

of c

hang

esw

ith g

iven

inte

rval

Poisson model

Page 83: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 83

Change Metrics Freshness

Freshness of element ei at time t is

F ( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise

ei ei

......

web databaseFreshness of the database S at time t is

F( S ; t ) = F( ei ; t )

(Assume “equal importance” of pages)

N

1 N

i=1

Page 84: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 84

Change Metrics Age

Age of element ei at time t is

A( ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise

ei ei

......

web databaseAge of the database S at time t is

A( S ; t ) = A( ei ; t )

(Assume “equal importance” of pages)

N

1 N

i=1

Page 85: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 85

Example The collection contains 2 pages

E1 changes 9 times a day E2 changes once a day Simplified change model

The day is split into 9 equal intervals, and E1 changes once on each interval

E2 changes once during the entire dayThe only unknown is when the pages change within the intervals

The crawler can download only a page a day.

Our goal is to maximize the freshness

Page 86: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 86

Example (2)

Page 87: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 87

Example (3) Which page do we refresh?

If we refresh E2 in middayIf E2 changes in first half of the day, and we refresh in midday, it remains fresh for the rest half of the day.

50% for 0.5 day freshness increase 50% for no increase Expectancy of 0.25 day freshness increase

If we refresh E1 in middayIf E1 changes in first half of the interval, and we refresh in midday (which is the middle of the interval), it remains fresh for the rest half of the interval = 1/18 of a day.

50% for 1/18 day freshness increase 50% for no increase Expectancy of 1/36 day freshness increase

Page 88: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 88

Example (4)This gives a nice estimationBut things are more complex in real lifeNot sure that a page will change within an interval

Have to worry about age

Using a Poisson model shows a uniform policy always performs better than a proportional one.

Page 89: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 89

Example (5) Studies have found the best policy for

similar example Assume page changes follow a Poisson process. Assume 5 pages, which change 1,2,3,4,5 times

a day

Page 90: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Distributed Crawling

Page 91: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 91

ApproachesCentralized Parallel CrawlerDistributedP2P

Page 92: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 92

Distributed Crawlers

A distributed crawler consists of multiple crawling processes communicating via local network (intra-site distributed crawler) or Internet (distributed crawler) http://www2002.org/CDROM/refereed/108/index.html

Setting: we have a number of c-proc’s c-proc = crawling process

Goal: we wish to crawl the best pages with minimum overhead

Page 93: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 93

Crawler-process distribution

on the same local network

at geographically distant locations.

Central Parallel Crawler

Distributed Crawler

Page 94: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 94

Distributed modelCrawlers may be running in diverse geographic locationsPeriodically update a master indexIncremental update so this is “cheap”Compression, differential update etc.

Focus on communication overhead during the crawl

Page 95: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 95

Issues and benefits

Issues: overlap: minimization of multiple downloaded pages

quality: depends on the crawl strategy communication bandwidth: minimization

Benefits: scalability: for large-scale web-crawls costs: use of cheaper machines network-load dispersion and reduction: by dividing the web into regions and crawling only the nearest pages

Page 96: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 96

Coordination A parallel crawler consists of

multiple crawling processes communicating via local network (intra-site parallel crawler) or Internet (distributed crawler)

Page 97: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 97

Coordination1. Independent:

no coordination, every process follows its extracted links

2. Dynamic assignment: a central coordinator dynamically

divides the web into small partitions and assigns each partition to a process

3. Static assignment: Web is partitioned and assigned without

central coordinator before the crawl starts

Page 98: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 98

c-proc’s crawling the web

URLs crawledURLs inqueues

Which c-procgets this URL?

Communication: by URLspassed between c-procs.

Page 99: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 99

Static assignmentLinks from one partition to another (inter-

partition links) can be handled either in:

1. Firewall mode: a process does not follow any inter-partition link

2. Cross-over mode: a process follows also inter-partition links and discovers also more pages in its partition

3. Exchange mode: processes exchange inter-partition URLs; mode needs communication

a f

g

ih

e

d

cb

Partition 1 Partition 2

Page 100: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 100

Classification of parallel crawlers

If exchange mode is used, communication can be limited by: Batch communication: every process collects some URLs and send them in a batch

Replication: the k most popular URLs are replicated at each process and are not exchanged (previous crawl or on the fly)

Some ways to partition the Web: URL-hash based: many inter-partition links Site-hash based: reduces the inter partition links

Hierarchical: .com domain, .net domain …

Page 101: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 101

Static assignement: comparison

Coverage Overlap Quality Communication

Firewall Bad Good Bad Good

Cross-over

Good Bad Bad Good

Exchange Good Good Good Bad

Page 102: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 102

UBI Crawler [2002, Boldi, Codenotti, Santini, Vigna] Features:

Full distribution: identical agents / no central coordinator

Balanced locally computable assignment:each URL is assigned to one agenteach agent can compute the URL assignement locallydistribution of URLs is balanced

Scalability:number of crawled pages per second and per agent are independent of the number of agents

Fault tolerance:URLs are not statically distributeddistributed reassignment protocol not reasonableè

Page 103: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 103

UBI Crawler: Assignment

FunctionA: set of agent identifiersL: set of alive agents m: total number of hosts : assigns host h to an alive agent in L:

Requirements: Balance: each agent should be responsible for approximatly

the same number of hosts:

Contravariance: if the number of agents grows, the portion of the web crawled by each agent must shrink:

AL ⊆

LhL ∈)(

L

ma

L≈− )(1

)()(1

'

1' aaLLLL ⊇ −−⇒⊆

)()()(''

' hhLhLLLLL =⇒∈∧⊆

Page 104: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 104

Consistent Hashing Each bucket is replicated k times and each replica is mapped

randomly on the unit circle Hashing a key: compute a point on the unit circle and find the

nearest replica

L = {a,b}, L‘ = {a,b,c}, k = 3, hosts = {0,1,..,9}

a

a

a

b

b

b

c

cc

1

2

3

4

5

6

7

8

9

0

L‘-1(a) = {4,5,6,8}L‘-1(b) = {0,2,7}L‘-1(c) = {1,3,9}

a

a

a

b

b

b

1

2

3

4

5

6

7

8

9

0

L-1(a) = {1,4,5,6,8,9}L-1(b) = {0,2,3,7}

)()(1

'

1' aaLLLL ⊇ −−⇒⊆

)()(''

' hhLLLLLL =⇒∈∧⊆

Contravariance:

Balancing:

Hash function and random number generator

Page 105: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 105

UBI Crawler: fault tolerance

Up to now: no metrics for estimating the fault tolerance of distributed crawlers

Each agent has its own view of the set of alive agents (views can be different) but two agents will never dispatch hosts to two different agents.

Agents can be added dynamically in a self-stabilizing way

c

a

b

d

1

2

3

died

Page 106: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 106

Evaluation metrics (1)

1. Overlap:

N: total number of fetched pagesI: number of distinct fetched pages

minimize the overlap

1. Coverage:

U: total number of Web pages maximize the coverage

N

INOverlap

−=

I

UCoverage=

Page 107: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 107

Evaluation metrics (2)

3. Communication Overhead:

M: number of exchanged messages (URLs)P: number of downloaded pages

minimize the overhead

4. Quality:

maximize the quality backlink count / oracle crawler

P

MOverhead =

∑=i

ipPageRankN

Quality )(1

x

Page 108: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 108

Experiments40M URL graph – Stanford Webbase

Open Directory (dmoz.org) URLs as seeds

Should be considered a small Web

Page 109: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 109

Firewall mode coverage

The price of crawling in firewall mode

Page 110: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 110

Crossover mode overlap

Demanding coverage drives up overlap

Page 111: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 111

Exchange mode communication

Communication overhead sublinear

PerdownloadedURL

Page 112: Mining the Web Crawling the Web. Mining the Web2 Schedule  Search engine requirements  Components overview  Specific modules: the crawler  Purpose.

Mining the Web 112

Cho’s conclusion <4 crawling processes run in parallel firewall mode provide good coverage

firewall mode not appropriate when: > 4 crawling processes download only a small subset of the Web and quality of the downloaded pages is important

exchange mode consumes < 1% network bandwidth for URL exchanges

maximizes the quality of the downloaded pages By replicating 10,000 - 100,000 popular URLs, communication overhead reduced by 40%


Recommended