Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum...

AuthorsMuhammad Atif Qureshi

Arjumand YounusFrancisco Rojas

1International Conference on Information Science and Applications 2010

Introduction Implementation Alternatives Crawler Architecture Implications Conclusion


Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework


Background Motivation Problem Statement Contributions


Web crawler Description

Program that downloads web pages recursively by fetching links from a seed of web pages

Backbone of search engine’s data repository

Competing factors among search engines Coverage of internet Throughput of complete download


[ Introduction ]

Web crawler needs to have highly optimized system architecture with ability to Download large number of web pages per second Be robust against crashes Be manageable and considerate of resources and

web servers Most of the works focus on “improving

strategy for web crawlers” [LLWL08] [SS02]

Our focus is to provide a convincing analysis of web crawler from system's viewpoint


[ Introduction ]

Description: analysis of web crawling from a systems’ perspective

Issues Threads vs. events Distributed implementation Prevention from DDoS attack Web crawler as feed forward engine for next

phases of search engine 7

International Conference on Information Science and Applications 2010

[ Introduction ]

First ever threads vs. events debate from web crawlers perspective

MapReduce architecture for distributed web crawler implementation

Implications towards birth of operating system for Internet based applications e.g. web crawlers


[ Introduction ]

Threads vs. Events Performance Evaluation for Threads vs.

Events


Problems in Threads Large memory footprint Context switch overhead Cache and TLB misses Expensive synchronization mechanisms

Problems in Events Add to programmers’ difficulty Debugging is troublesome


[ Implementation Alternatives ]

Environment CPU: Intel Pentium 4 Core 2 Duo 3GHz RAM: 3.2 GB OS: Linux 2.6.28-11-generic

Experiments 1st experiment:

Comparison of crawler throughput with varying pool size

2nd experiment: Comparison of crawler throughput with varying seed

URL size11





No. of Seed URLs were kept constant at 1000



Pool size was kept constant at 200

High Level View of MapReduce Usage High Level Distributed Design with

MapReduce Prevention of DDoS Attack



[ Crawler Architecture]

International Conference on Information Science and Applications 2010 16

The distributed implementation was done with our own version of MapReduce[DG04] library.


Target server: yahoo.com

Same crawling machines

Simultaneous and continuing connections



Push Right-side Order

URL Pop Left-side Priority

1 a.com 1

2 a.com/a 7

3 1.a.com 5

4 b.com 2

5 c.net 3

6 1.b.com 6

7 c.com 4



19

IMPLICATIONS


Observations during implementation of feed forward mechanisms in web crawler Exokernel based approach favorable for web

crawler Priority queue control Filesystem should not provide consistency

guarantees Indexing and dictionary concept should be

supported by file system

20

SEARCH ENGINE OPERATING SYSTEM


[DG04] Dean, J., and Ghemawat, S., “ MapReduce: simplified data processing on large clusters,” In Proc. 6th Int’l Symposium on Operating Systems Design and Implementation, San Francisco, CA, 2004: 137-150.

[LLWL08] Lee, H.T., Leonard, D., Wang, X., and Loguinov, D., “IRLbot: scaling to 6 billion pages and beyond,” In Proc. 17th Int’l Conf. on World Wide Web, April 21-25, 2008, Beijing, China.

[SS02] Shkapenyuk, V. and Suel, T., “Design and Implementation of a High-Performance Distributed Web Crawler,” In Proc. 18th Int’l Conf. on Data Engineering, pp. 3-57, San Jose, California, USA, Feb. 2002.


Date post:	13-May-2015
Category:	Technology
Upload:	m-atif-qureshi
View:	1,676 times
Download:	0 times

Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum...

Technology