Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | imogene-reed |
View: | 228 times |
Download: | 3 times |
Mining the Web
Crawling the Web
Mining the Web 2
ScheduleSearch engine requirementsComponents overviewSpecific modules: the crawler
PurposeImplementation Performance metrics
Mining the Web 3
What does it do?Processes users queriesFind pages with related information
Return a list of resources
Is it really that simple?
Mining the Web 4
What does it do?Processes users queries
How is a query represented?
Find pages with related information
Return a resources list
Is it really that simple?
Mining the Web 5
What does it do?Processes users queriesFind pages with related informationHow do we find pages?Where in the web do we look?How do we match query and documents?
Return a resources list
Is it really that simple?
Mining the Web 6
What does it do?Processes users queriesFind pages with related information
Return a resources listIs what order?How are the pages ranked?
Is it really that simple?
Mining the Web 7
What does it do?Processes users queriesFind pages with related information
Return a resources list
Is it really that simple?Limited resourcesTime quality tradeoff
Mining the Web 8
Search Engine Structure
General DesignCrawlingStorageIndexingRanking
Mining the Web 9
Crawl Control
Search Engine Structure
Crawlers
Ranking
Indexer
Page Repository
Query Engine
Collection Analysis
Text Structure Utility
Queries Results
Indexes
Mining the Web 10
Is it an IR system?
The web isUsed by millions Contains lots of informationLink basedIncoherentChanges rapidlyDistributed
Traditional information retrieval was built with the exact opposite in mind
Mining the Web 11
Web DynamicsSize
~10 billion Public Indexable pages10kB / page 100 TBDoubles every 18 months
Dynamics33% change weekly8% new pages every week25% new links every week
Mining the Web 12
Weekly change
Fetterly, Manasse, Najork, Wiener 2003
Mining the Web 13
Collecting “all” Web pages
For searching, for classifying, for mining, etc
Problems: No catalog of all accessible URLs on the Web
Volume, latency, duplications, dinamicity, etc.
Mining the Web 14
The Crawler A program that downloads and stores web pages: Starts off by placing an initial set of URLs, S0, in a queue, where all URLs to be retrieved are kept and prioritized.
From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue.
This process is repeated until the crawler decides to stop.
Mining the Web 15
Crawling Issues How to crawl?
Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns
How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have?
How often to crawl? Freshness: How much has changed? How much has really changed? (why is this a different question?)
Mining the Web 16
Before discussing crawling policies…
Some implementation issue
Mining the Web 17
HTML HyperText Markup Language Lets the author
specify layout and typeface embed diagrams create hyperlinks.
expressed as an anchor tag with a HREF attributeHREF names another page using a Uniform Resource Locator (URL),
URL = protocol field (“HTTP”) +a server hostname (“www.cse.iitb.ac.in”) +file path (/, the `root' of the published file system).
Mining the Web 18
HTTP(hypertext transport protocol)
Built on top of the Transport Control Protocol (TCP)
Steps(from client end) resolve the server host name to an Internet address (IP)
Use Domain Name Server (DNS)DNS is a distributed database of name-to-IP mappings maintained at a set of known servers
contact the server using TCPconnect to default HTTP port (80) on the server.Enter the HTTP requests header (E.g.: GET)Fetch the response header
MIME (Multipurpose Internet Mail Extensions) A meta-data standard for email and Web content transfer
Fetch the HTML page
Mining the Web 19
Crawling procedure Simple
Great deal of engineering goes into industry-strength crawlers
Industry crawlers crawl a substantial fraction of the Web
E.g.: Google, Yahoo
No guarantee that all accessible Web pages will be located
Crawler may never halt ……. pages will be added continually even as it is running.
Mining the Web 20
Crawling overheadsDelays involved in
Resolving the host name in the URL to an IP address using DNS
Connecting a socket to the server and sending the request
Receiving the requested page in response
Solution: Overlap the above delays byfetching many pages at the same time
Mining the Web 21
Anatomy of a crawler
Page fetching by (logical) threads Starts with DNS resolution Finishes when the entire page has been fetched
Each page stored in compressed form to disk/tape scanned for outlinks
Work pool of outlinks maintain network utilization without overloading it
Dealt with by load manager
Continue till the crawler has collected a sufficient number of pages.
Mining the Web 22
Typical anatomy of a large-scale crawler.
Mining the Web 23
Large-scale crawlers: performance and
reliability considerations
Need to fetch many pages at same time utilize the network bandwidth single page fetch may involve several seconds of network latency
Highly concurrent and parallelized DNS lookups Multi-processing or multi-threading: impractical at low level Use of asynchronous sockets
Explicit encoding of the state of a fetch context in a data structure
Polling socket to check for completion of network transfers
Care in URL extraction Eliminating duplicates to reduce redundant fetches Avoiding “spider traps”
Mining the Web 24
DNS caching, pre-fetching and resolution
A customized DNS component with…..Custom client for address resolutionCaching serverPrefetching client
Mining the Web 25
Custom client for address resolution
Tailored for concurrent handling of multiple outstanding requests
Allows issuing of many resolution requests together polling at a later time for completion of individual requests
Facilitates load distribution among many DNS servers.
Mining the Web 26
Caching server With a large cache, persistent across DNS restarts
Residing largely in memory if possible.
Mining the Web 27
Prefetching client Steps
Parse a page that has just been fetched extract host names from HREF targets Make DNS resolution requests to the caching server
Usually implemented using UDP User Datagram Protocol connectionless, packet-based communication protocol
does not guarantee packet delivery
Does not wait for resolution to be completed.
Mining the Web 28
Multiple concurrent fetches
Managing multiple concurrent connections A single download may take several seconds Open many socket connections to different HTTP servers simultaneously
Multi-CPU machines not useful crawling performance limited by network and disk
Two approaches using multi-threading using non-blocking sockets with event handlers
Mining the Web 29
Multi-threading threads
physical thread of control provided by the operating system (E.g.: pthreads) OR
concurrent processes fixed number of threads allocated in advance programming paradigm
create a client socket connect the socket to the HTTP service on a server Send the HTTP request header read the socket (recv) until
no more characters are available close the socket.
use blocking system calls
Mining the Web 30
Multi-threading: Problems
performance penaltymutual exclusionconcurrent access to data structures
slow disk seeks.great deal of interleaved, random input-output on disk
Due to concurrent modification of document repository by multiple threads
Mining the Web 31
Non-blocking sockets and event
handlers non-blocking sockets
connect, send or recv call returns immediately without waiting for the network operation to complete.
poll the status of the network operation separately “select” system call
lets application suspend until more data can be read from or written to the socket
timing out after a pre-specified deadline Monitor polls several sockets at the same time
More efficient memory management code that completes processing not interrupted by other completions
No need for locks and semaphores on the pool only append complete pages to the log
Mining the Web 32
Link extraction and normalization
Goal: Obtaining a canonical form of URL URL processing and filtering
Avoid multiple fetches of pages known by different URLs
many IP addresses For load balancing on large sites
Mirrored contents/contents on same file system“Proxy pass“
Mapping of different host names to a single IP address
need to publish many logical sites
Relative URLsneed to be interpreted w.r.t to a base URL.
Mining the Web 33
Canonical URLFormed by
Using a standard string for the protocol
Canonicalizing the host nameAdding an explicit port numberNormalizing and cleaning up the path
Mining the Web 34
Robot exclusionCheck
whether the server prohibits crawling a normalized URL
In robots.txt file in the HTTP root directory of the serverspecifies a list of path prefixes which crawlers should not attempt to fetch.
Meant for crawlers only
Mining the Web 35
Eliminating already-visited
URLs Checking if a URL has already been fetched
Before adding a new URL to the work pool Needs to be very quick. Achieved by computing MD5 hash function on the URL
Exploiting spatio-temporal locality of access
Two-level hash function. most significant bits (say, 24) derived by hashing the host name plus port
lower order bits (say, 40) derived by hashing the pathconcatenated bits used as a key in a B-tree
qualifying URLs added to frontier of the crawl.
hash values added to B-tree.
Mining the Web 36
Spider trapsProtecting from crashing on
Ill-formed HTMLE.g.: page with 68 kB of null characters
Misleading sitesindefinite number of pages dynamically generated by CGI scripts
paths of arbitrary depth created using soft directory links and path remapping features in HTTP server
Mining the Web 37
Spider Traps: Solutions
No automatic technique can be foolproof
Check for URL lengthGuards
Preparing regular crawl statisticsAdding dominating sites to guard moduleDisable crawling active content such as CGI form queries
Eliminate URLs with non-textual data types
Mining the Web 38
Avoiding repeated expansion of links on duplicate pages
Reduce redundancy in crawls Duplicate detection
Mirrored Web pages and sites Detecting exact duplicates
Checking against MD5 digests of stored URLs Representing a relative link v (relative to aliases u1 and u2) as tuples (h(u1); v) and (h(u2); v)
Detecting near-duplicates Even a single altered character will completely change the digest !
E.g.: date of update/ name and email of the site administrator
Mining the Web 39
Load monitorKeeps track of various system statisticsRecent performance of the wide area network (WAN) connectionE.g.: latency and bandwidth estimates.
Operator-provided/estimated upper bound on open sockets for a crawler
Current number of active sockets.
Mining the Web 40
Thread managerResponsible for
Choosing units of work from frontierScheduling issue of network resources
Distribution of these requests over multiple ISPs if appropriate.
Uses statistics from load monitor
Mining the Web 41
Per-server work queues
Denial of service (DoS) attacks limit the speed or frequency of responses to any fixed client IP address
Avoiding DOS limit the number of active requests to a given server IP address at any time
maintain a queue of requests for each serverUse the HTTP/1.1 persistent socket capability.
Distribute attention relatively evenly between a large number of sites
Access locality vs. politeness dilemma
Mining the Web 42
Crawling Issues How to crawl?
Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication)
Etiquette: Robots.txt, Server load concerns How much to crawl? How much to index?
Coverage: How big is the Web? How much do we cover?
Relative Coverage: How much do competitors have? How often to crawl?
Freshness: How much has changed? How much has really changed? (why is this a different question?)
Mining the Web 43
Crawl OrderWant best pages firstPotential quality measures:
Final In-degree Final PageRank
Crawl heuristics:Breadth First Search (BFS)Partial IndegreePartial PageRank Random walk
Mining the Web
Breadth-First CrawlBreadth-First Crawl
Basic idea:Basic idea: start at a set of known URLsstart at a set of known URLs explore in “concentric circles” around these URLsexplore in “concentric circles” around these URLs
start pages
distance-one pages
distance-two pages
used by broad web search engines balances load between servers
Mining the Web 45
Web Wide Crawl (328M pages) [Najo01]
BFS crawling brings in high qualitypages early in the crawl
Mining the Web 46
Overlap with best x% byindegree
x% crawled by O(u)
Stanford Web Base (179K) [Cho98]
Mining the Web 47
Queue of URLs to be fetched
What constraints dictate which queued URL is fetched next?
Politeness – don’t hit a server too often, even from different threads of your spider
How far into a site you’ve crawled already Most sites, stay at ≤ 5 levels of URL hierarchy
Which URLs are most promising for building a high-quality corpus This is a graph traversal problem: Given a directed graph you’ve partially visited, where do you visit next?
Mining the Web 48
Where do we crawl next?
Complex scheduling optimization problem, subject to constraintsPlus operational constraints (e.g., keeping all machines load-balanced)
Scientific study – limited to specific aspectsWhich ones?What do we measure?
What are the compromises in distributed crawling?
Mining the Web 49
Page selectionImportance metricWeb crawler modelCrawler method for choosing page to download
Mining the Web 50
Importance MetricsGiven a page P, define how “good” that page is
Several metric types:Interest drivenPopularity drivenLocation drivenCombined
Mining the Web 51
Interest Driven Define a driving query Q Find textual similarity between P and Q
Define a word vocabulary t1…tn Define a vector for P and Q:
Vp, Vq = <w1,…,wn>wi = 0 if ti does not appear in the documentwi = IDF(ti) = 1 / number of pages containing ti
Importance: IS(P) = Vp * Vq (cosine product) Finding IDF requires going over the entire web Estimate IDF by pages already visited, to calculate IS’
Mining the Web 52
Popularity DrivenHow popular a page is:
Backlink countIB(P) = the number of pages containing a link to P
Estimate by pervious crawls: IB’(P)
More sophisticated metric, e.g. PageRank: IR(P)
Mining the Web 53
Location DrivenIL(P): A function of the URL to PWords appearing on URLNumber of “/” on the URL
Easily evaluated, requires no data from pervious crawls
Mining the Web 54
Combined MetricsIC(P): a function of several other metrics
Allows using local metrics for first stage and estimated metrics for second stage
IC(P) = a*IS(P) + b*IB(P) + c*IL(P)
Mining the Web 55
Crawling Issues How to crawl?
Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication)
Etiquette: Robots.txt, Server load concerns How much to crawl? How much to index?
Coverage: How big is the Web? How much do we cover?
Relative Coverage: How much do competitors have? How often to crawl?
Freshness: How much has changed? How much has really changed? (why is this a different question?)
Mining the Web 56
Crawler ModelsA crawler
Tries to visit more important pages first
Only has estimates of importance metrics
Can only download a limited amountHow well does a crawler perform?
Crawl and StopCrawl and Stop with Threshold
Mining the Web 57
Crawl and StopA crawler stops after visiting K pages
A perfect crawler Visits pages with ranks R1,…,RkThese are called Top Pages
A real crawlerVisits only M < K top pages
Mining the Web 58
Crawl and Stop with Threshold
A crawler stops after visiting T top pages
Top pages are pages with a metric higher than G
A crawler continues until T threshold is reached
Mining the Web 59
Ordering Metrics The crawlers queue is prioritized according to an ordering metric
The ordering metric is based on an importance metric Location metrics - directly Popularity metrics - via estimates according to pervious crawls
Similarity metrics – via estimates according to anchor
Mining the Web 60
Focused Crawling (Chakrabarti)
Distributed federation of focused crawlers
Supervised topic classifierControls priority of unvisited frontier
Trained on document samples from Web directory (Dmoz)
Mining the Web 61
Motivation Let’s relax the problem space: “Focus” on a restricted target space of Web pages that may be of some “type” (e.g., homepages) that may be of some “topic” (CS, quantum physics)
The “focused” crawling effort would use much less resources, be more timely, be more qualified for indexing & searching purposes
Mining the Web 62
MotivationGoal: Design and implement a focused Web crawler that wouldgather only pages on a particular “topic” (or class)
use effective heuristics while choosing the next page to download
Mining the Web 63
Focused crawling“A focused crawler seeks and acquires [...] pages on a specific set of topics representing a relatively narrow segment of the Web.” (Soumen Chakrabarti)
The underlying paradigm is Best-First Search instead of the Breadth-First Search
Mining the Web 64
Breadth vs. Best First Search
Mining the Web 65
Two fundamental questions
Q1: How to decide whether a downloaded page is on-topic, or not?
Q2: How to choose the next page to visit?
Mining the Web 66
Chakrabarti’s crawler
Chakrabarti’s focused crawlerA1: Determines the page relevance using a text classifier
A2: Adds URLs to a max-priority queue with their parent page’s score and visits them in descending order!
What is original is using a text classifier!
Mining the Web 67
Page relevance Testing the classifier
User determines focus topics Crawler calls the classifier and obtains a score for each downloaded page
Classifier returns a sorted list of classes and scores
(A 80%, B 10%, C 7%, D 1%,...)
The classifier determines the page relevance!
Mining the Web 68
Visit orderThe radius-1 hypothesis: If page u is an on-topic example and u links to v, then the probability that v is on-topic is higher than the probability that a random chosen Web page is on-topic.
Mining the Web 69
Visit order: case 1
Hard-focus crawling:If a downloaded page is off-topic, stops following hyperlinks from this page.
Assume target is class B And for page P, classifier gives: A 80%, B 10%, C 7%, D 1%,...
Do not follow P’s links at all!
Mining the Web 70
Visit order: case 2
Soft-focus crawling: obtains a page’s relevance score (a score on the page’s relevance to the target topic)
assigns this score to every URL extracted from this particular page, and adds to the priority queue
Example: A 80%, B 10%, C 7%, D 1%,...Insert P’s links with score 0.10 into PQ
Mining the Web 71
Basic Focused Crawler
Mining the Web 72
Comparisons Start the baseline crawler from the URLs in one topic
Fetch up to 20000-25000 pages For each pair of fetched pages (u,v), add item to the training set of the apprentice
Train the apprentice Start the enhanced crawler from the same set of pages
Fetch about the same number of pages
Mining the Web 73
Results
Mining the Web 74
ControversyChakrabarty claims focused crawler superior to breadth-first
Suel claims the contrary and that argument was based on experiments with poor performance crawlers
Mining the Web 75
Crawling Issues How to crawl?
Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication)
Etiquette: Robots.txt, Server load concerns How much to crawl? How much to index?
Coverage: How big is the Web? How much do we cover?
Relative Coverage: How much do competitors have? How often to crawl?
Freshness: How much has changed? How much has really changed? (why is this a different question?)
Mining the Web 76
Determining page changes
“Expires” HTTP response headerFor page that come with an expiry date
Otherwise need to guess if revisiting that page will yield a modified version.Score reflecting probability of page being modified
Crawler fetches URLs in decreasing order of score.
Assumption : recent past predicts the future
Mining the Web 77
Estimating page change rates
Brewington and Cybenko & Cho Algorithms for maintaining a crawl in which most pages are fresher than a specified epoch.
Prerequisite average interval at which crawler checks for changes is smaller than the inter-modification times of a page
Small scale intermediate crawler runs to monitor fast changing sites
E.g.: current news, weather, etc. Patched intermediate indices into master index
Mining the Web 78
Refresh StrategyCrawlers can refresh only a certain amount of pages in a period of time.
The page download resource can be allocated in many ways
The proportional refresh policy allocated the resource proportionally to the pages’ change rate.
Mining the Web 79
Average Change Interval
frac
tion
of p
ages
0,00
0,05
0,10
0,15
0,20
0,25
0,30
0,35
1day 1day- 1week
1week-1month
1month-4months
4months
average change interval
Mining the Web 80
Change Interval – By Domain
frac
tion
of p
ages
0
0,1
0,2
0,3
0,4
0,5
0,6
1day 1day- 1week
1week-1month
1month-4months
4months
comnetorgedugov
average change interval
Mining the Web 81
Modeling Web Evolution
Poisson process with rate T is time to next eventfT (t) = e- t (t > 0)
Mining the Web 82
Change Intervalfor pages thatchange every
10 days on average
interval in days
frac
tion
of c
hang
esw
ith g
iven
inte
rval
Poisson model
Mining the Web 83
Change Metrics Freshness
Freshness of element ei at time t is
F ( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise
ei ei
......
web databaseFreshness of the database S at time t is
F( S ; t ) = F( ei ; t )
(Assume “equal importance” of pages)
N
1 N
i=1
Mining the Web 84
Change Metrics Age
Age of element ei at time t is
A( ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise
ei ei
......
web databaseAge of the database S at time t is
A( S ; t ) = A( ei ; t )
(Assume “equal importance” of pages)
N
1 N
i=1
Mining the Web 85
Example The collection contains 2 pages
E1 changes 9 times a day E2 changes once a day Simplified change model
The day is split into 9 equal intervals, and E1 changes once on each interval
E2 changes once during the entire dayThe only unknown is when the pages change within the intervals
The crawler can download only a page a day.
Our goal is to maximize the freshness
Mining the Web 86
Example (2)
Mining the Web 87
Example (3) Which page do we refresh?
If we refresh E2 in middayIf E2 changes in first half of the day, and we refresh in midday, it remains fresh for the rest half of the day.
50% for 0.5 day freshness increase 50% for no increase Expectancy of 0.25 day freshness increase
If we refresh E1 in middayIf E1 changes in first half of the interval, and we refresh in midday (which is the middle of the interval), it remains fresh for the rest half of the interval = 1/18 of a day.
50% for 1/18 day freshness increase 50% for no increase Expectancy of 1/36 day freshness increase
Mining the Web 88
Example (4)This gives a nice estimationBut things are more complex in real lifeNot sure that a page will change within an interval
Have to worry about age
Using a Poisson model shows a uniform policy always performs better than a proportional one.
Mining the Web 89
Example (5) Studies have found the best policy for
similar example Assume page changes follow a Poisson process. Assume 5 pages, which change 1,2,3,4,5 times
a day
Distributed Crawling
Mining the Web 91
ApproachesCentralized Parallel CrawlerDistributedP2P
Mining the Web 92
Distributed Crawlers
A distributed crawler consists of multiple crawling processes communicating via local network (intra-site distributed crawler) or Internet (distributed crawler) http://www2002.org/CDROM/refereed/108/index.html
Setting: we have a number of c-proc’s c-proc = crawling process
Goal: we wish to crawl the best pages with minimum overhead
Mining the Web 93
Crawler-process distribution
on the same local network
at geographically distant locations.
Central Parallel Crawler
Distributed Crawler
Mining the Web 94
Distributed modelCrawlers may be running in diverse geographic locationsPeriodically update a master indexIncremental update so this is “cheap”Compression, differential update etc.
Focus on communication overhead during the crawl
Mining the Web 95
Issues and benefits
Issues: overlap: minimization of multiple downloaded pages
quality: depends on the crawl strategy communication bandwidth: minimization
Benefits: scalability: for large-scale web-crawls costs: use of cheaper machines network-load dispersion and reduction: by dividing the web into regions and crawling only the nearest pages
Mining the Web 96
Coordination A parallel crawler consists of
multiple crawling processes communicating via local network (intra-site parallel crawler) or Internet (distributed crawler)
Mining the Web 97
Coordination1. Independent:
no coordination, every process follows its extracted links
2. Dynamic assignment: a central coordinator dynamically
divides the web into small partitions and assigns each partition to a process
3. Static assignment: Web is partitioned and assigned without
central coordinator before the crawl starts
Mining the Web 98
c-proc’s crawling the web
URLs crawledURLs inqueues
Which c-procgets this URL?
Communication: by URLspassed between c-procs.
Mining the Web 99
Static assignmentLinks from one partition to another (inter-
partition links) can be handled either in:
1. Firewall mode: a process does not follow any inter-partition link
2. Cross-over mode: a process follows also inter-partition links and discovers also more pages in its partition
3. Exchange mode: processes exchange inter-partition URLs; mode needs communication
a f
g
ih
e
d
cb
Partition 1 Partition 2
Mining the Web 100
Classification of parallel crawlers
If exchange mode is used, communication can be limited by: Batch communication: every process collects some URLs and send them in a batch
Replication: the k most popular URLs are replicated at each process and are not exchanged (previous crawl or on the fly)
Some ways to partition the Web: URL-hash based: many inter-partition links Site-hash based: reduces the inter partition links
Hierarchical: .com domain, .net domain …
Mining the Web 101
Static assignement: comparison
Coverage Overlap Quality Communication
Firewall Bad Good Bad Good
Cross-over
Good Bad Bad Good
Exchange Good Good Good Bad
Mining the Web 102
UBI Crawler [2002, Boldi, Codenotti, Santini, Vigna] Features:
Full distribution: identical agents / no central coordinator
Balanced locally computable assignment:each URL is assigned to one agenteach agent can compute the URL assignement locallydistribution of URLs is balanced
Scalability:number of crawled pages per second and per agent are independent of the number of agents
Fault tolerance:URLs are not statically distributeddistributed reassignment protocol not reasonableè
Mining the Web 103
UBI Crawler: Assignment
FunctionA: set of agent identifiersL: set of alive agents m: total number of hosts : assigns host h to an alive agent in L:
Requirements: Balance: each agent should be responsible for approximatly
the same number of hosts:
Contravariance: if the number of agents grows, the portion of the web crawled by each agent must shrink:
AL ⊆
LhL ∈)(
L
ma
L≈− )(1
)()(1
'
1' aaLLLL ⊇ −−⇒⊆
)()()(''
' hhLhLLLLL =⇒∈∧⊆
Mining the Web 104
Consistent Hashing Each bucket is replicated k times and each replica is mapped
randomly on the unit circle Hashing a key: compute a point on the unit circle and find the
nearest replica
L = {a,b}, L‘ = {a,b,c}, k = 3, hosts = {0,1,..,9}
a
a
a
b
b
b
c
cc
1
2
3
4
5
6
7
8
9
0
L‘-1(a) = {4,5,6,8}L‘-1(b) = {0,2,7}L‘-1(c) = {1,3,9}
a
a
a
b
b
b
1
2
3
4
5
6
7
8
9
0
L-1(a) = {1,4,5,6,8,9}L-1(b) = {0,2,3,7}
)()(1
'
1' aaLLLL ⊇ −−⇒⊆
)()(''
' hhLLLLLL =⇒∈∧⊆
Contravariance:
Balancing:
Hash function and random number generator
Mining the Web 105
UBI Crawler: fault tolerance
Up to now: no metrics for estimating the fault tolerance of distributed crawlers
Each agent has its own view of the set of alive agents (views can be different) but two agents will never dispatch hosts to two different agents.
Agents can be added dynamically in a self-stabilizing way
c
a
b
d
1
2
3
died
Mining the Web 106
Evaluation metrics (1)
1. Overlap:
N: total number of fetched pagesI: number of distinct fetched pages
minimize the overlap
1. Coverage:
U: total number of Web pages maximize the coverage
N
INOverlap
−=
I
UCoverage=
Mining the Web 107
Evaluation metrics (2)
3. Communication Overhead:
M: number of exchanged messages (URLs)P: number of downloaded pages
minimize the overhead
4. Quality:
maximize the quality backlink count / oracle crawler
P
MOverhead =
∑=i
ipPageRankN
Quality )(1
x
Mining the Web 108
Experiments40M URL graph – Stanford Webbase
Open Directory (dmoz.org) URLs as seeds
Should be considered a small Web
Mining the Web 109
Firewall mode coverage
The price of crawling in firewall mode
Mining the Web 110
Crossover mode overlap
Demanding coverage drives up overlap
Mining the Web 111
Exchange mode communication
Communication overhead sublinear
PerdownloadedURL
Mining the Web 112
Cho’s conclusion <4 crawling processes run in parallel firewall mode provide good coverage
firewall mode not appropriate when: > 4 crawling processes download only a small subset of the Web and quality of the downloaded pages is important
exchange mode consumes < 1% network bandwidth for URL exchanges
maximizes the quality of the downloaded pages By replicating 10,000 - 100,000 popular URLs, communication overhead reduced by 40%