+ All Categories
Home > Documents > distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages...

distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages...

Date post: 22-Dec-2015
Category:
View: 212 times
Download: 0 times
Share this document with a friend
Popular Tags:
29
distributed web crawlers 1 Implementation • All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec. 99, during a period of two weeks. The web image projected from this crawl might be biased but it represent the pages a parallel crawler would fetch.
Transcript
Page 1: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 1

Implementation

• All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec. 99, during a period of two weeks.

• The web image projected from this crawl might be biased but it represent the pages a parallel crawler would fetch.

Page 2: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 2

Firewall Mode & Coverage

• Firewall: – every crawl collects pages only from its

predetermined partition, and follows only intra-partition links.

– Has a minimal communication overhead, but may have quality and coverage problems.

Page 3: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 3

Firewall Mode & Coverage

• Considering the 40m pages as the entire web.

• Using site-hash based partitioning.

• Each c-proc was given five random sites from its own partitioning (5n for the overall crawler).

Page 4: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 4

Results

Page 5: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 5

Results (2)

Page 6: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 6

Conclusions

• When a small number of c-proc’s run in parallel this mode provides good coverage, and the crawler may start with relatively small number of seed URLs.

• This mode is not a good choice when coverage is important, especially when it runs many c-proc’s in parallel.

Page 7: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 7

Example

• Assuming we want to download 1B pages over one month, with 10 Mbps link to the internet per each c-proc’s machine:– we need to download 109X104 bytes.– The download rate is 34 Mbps therefore we need 4

c-proc’s. from fig 4 we conclude that coverage will be about 80%.

– having a week, we need a download rate of 140 Mbps = 14 c-proc’s, which will cover only 50%.

Page 8: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 8

Cross-over & Overlap

• This mode may yield improved coverage, since it follows inter-partition links, when a c-proc runs out of links in its own partition.

• This mode also incurs overlap, because a page can be downloaded by several c-procs.

• => the crawler increases coverage at the expense of overlap.

Page 9: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 9

Cross-over & Overlap

• Considering the 40M pages as the entire web.

• Using site-hash based partitioning.

• Each c-proc was given five random sites from its own partitioning (5n for the overall crawler).

• Measuring overlap in various coverage points

Page 10: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 10

Results

Page 11: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 11

Conclusions

• While this mode is much better than independent crawl, it still incurs quite significant overlap. For example: 4 c-proc’s running will overlap almost 2.5 in order to obtain coverage close to 1. For this reason it is not recommended to use this mode unless coverage is important and no communication between c-proc’s is available.

Page 12: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 12

Exchange Mode & Communication

• In this section we learn the communication overhead of an exchange mode crawler and how to reduce it by replication.

• We split the 40M pages into n partitions based on site-hash value, and run n c-proc’s in exchange mode.

Page 13: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 13

Results

Page 14: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 14

Conclusions

• The site-hash based partitioning scheme significantly reduces communication overhead, compared with the URL-hash based scheme. In average we need to transfer less than 10% of the discovered links (or up to 1 per page).

• .

Page 15: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 15

Conclusions (2)

• Network bandwidth used for URL exchange is relatively small. URL’s average length is about 40 bytes, while an average page is about 10kb, so this transfer consumes about 0.4% of total network bandwidth.

Page 16: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 16

Conclusions (3)

• The overhead of this exchange is quite significant because transmission goes through TCP/IP network stack at both sides, and incurs 2 switches between kernel and user mode.

Page 17: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 17

Reducing Overhead by Replication

Page 18: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 18

Conclusions

• Based on this result replicating 10-100 thousands in each c-proc will give best results (minimizes communication overhead while maintaining low replication overhead).

Page 19: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 19

Quality & Batch Communication

• In this section we study the quality issue – as mentioned parallel crawler can be worse than

single process crawler if every c-proc decides solely based on personal information.

Page 20: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 20

Quality & Batch Communication(2)

• throughout this section we’ll regard a page’s importance I(p) as the number of backlinks it has.– The most common metric. – Obviously depends on how often c-proc’s are

exchanging information.

Page 21: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 21

Quality at Different Exchange Rates

Page 22: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 22

Conclusions

• As number of c-proc’s increases, the quality becomes worse, unless they exchange backlink messages often.

• The quality of a firewall mode is worse than a single process crawler when downloading small fraction of pages. However there is no difference when downloading bigger fractions.

Page 23: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 23

Quality and Communication Overhead

Page 24: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 24

Conclusions

• Communication overhead doesn’t increase linearly.

• Large number of URL exchanges is not necessary for achieving high quality, especially when downloading large portion of the web. fig 9.

Page 25: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 25

Final Example

• Say we plan to operate a medium-scale search engine, to obtain 20% of the web (240M pages). We plan to refresh the index once a month, and our machines have 1 Mbps connection to the Internet.– We need about 7.44 Mbps download

bandwidth, so we have to run at least 8 c-procs run in parallel.

Page 26: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 26

Related charts

Page 27: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 27

Final Conclusions

• When a small number of c-procs run in parallel, firewall mode provides good coverage. Given the simplicity of this mode, it is a good option to consider unless:– More than 4 c-procs are required. fig 4.– Small subset of the web is required and quality

is important. Fig 9.

Page 28: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 28

Final Conclusions (2)

• Exchange mode based crawler consumes small network bandwidth, and minimize overhead if batch communication is operated. Quality is maximized even if less than 100 URL exchanges occurs.

• Replication of 10,000-100,000 most popular URLs reduces communication overhead by roughly 40%. Further replication contributes little. Fig 8.

Page 29: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

distributed web crawlers 29

References

• Junghoo Cho, Hector Garcia-Molina. Parallel crawlers, October 2001.

• Mike Burner, Crawling Towards Eternity, web techniques magazine, May 1998.


Recommended