Date post: | 13-May-2015 |
Category: |
Technology |
Upload: | m-atif-qureshi |
View: | 1,676 times |
Download: | 0 times |
AuthorsMuhammad Atif Qureshi
Arjumand YounusFrancisco Rojas
1International Conference on Information Science and Applications 2010
Introduction Implementation Alternatives Crawler Architecture Implications Conclusion
2International Conference on Information Science and Applications 2010
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework
3International Conference on Information Science and Applications 2010
Background Motivation Problem Statement Contributions
4International Conference on Information Science and Applications 2010
Web crawler Description
Program that downloads web pages recursively by fetching links from a seed of web pages
Backbone of search engine’s data repository
Competing factors among search engines Coverage of internet Throughput of complete download
5International Conference on Information Science and Applications 2010
[ Introduction ]
Web crawler needs to have highly optimized system architecture with ability to Download large number of web pages per second Be robust against crashes Be manageable and considerate of resources and
web servers Most of the works focus on “improving
strategy for web crawlers” [LLWL08] [SS02]
Our focus is to provide a convincing analysis of web crawler from system's viewpoint
6International Conference on Information Science and Applications 2010
[ Introduction ]
Description: analysis of web crawling from a systems’ perspective
Issues Threads vs. events Distributed implementation Prevention from DDoS attack Web crawler as feed forward engine for next
phases of search engine 7
International Conference on Information Science and Applications 2010
[ Introduction ]
First ever threads vs. events debate from web crawlers perspective
MapReduce architecture for distributed web crawler implementation
Implications towards birth of operating system for Internet based applications e.g. web crawlers
8International Conference on Information Science and Applications 2010
[ Introduction ]
Threads vs. Events Performance Evaluation for Threads vs.
Events
9International Conference on Information Science and Applications 2010
Problems in Threads Large memory footprint Context switch overhead Cache and TLB misses Expensive synchronization mechanisms
Problems in Events Add to programmers’ difficulty Debugging is troublesome
10International Conference on Information Science and Applications 2010
[ Implementation Alternatives ]
Environment CPU: Intel Pentium 4 Core 2 Duo 3GHz RAM: 3.2 GB OS: Linux 2.6.28-11-generic
Experiments 1st experiment:
Comparison of crawler throughput with varying pool size
2nd experiment: Comparison of crawler throughput with varying seed
URL size11
International Conference on Information Science and Applications 2010
[ Implementation Alternatives ]
12International Conference on Information Science and Applications 2010
[ Implementation Alternatives ]
No. of Seed URLs were kept constant at 1000
13International Conference on Information Science and Applications 2010
[ Implementation Alternatives ]
Pool size was kept constant at 200
High Level View of MapReduce Usage High Level Distributed Design with
MapReduce Prevention of DDoS Attack
14International Conference on Information Science and Applications 2010
15International Conference on Information Science and Applications 2010
[ Crawler Architecture]
International Conference on Information Science and Applications 2010 16
The distributed implementation was done with our own version of MapReduce[DG04] library.
[ Crawler Architecture]
Target server: yahoo.com
Same crawling machines
Simultaneous and continuing connections
[ Crawler Architecture]
International Conference on Information Science and Applications 2010
Push Right-side Order
URL Pop Left-side Priority
1 a.com 1
2 a.com/a 7
3 1.a.com 5
4 b.com 2
5 c.net 3
6 1.b.com 6
7 c.com 4
[ Crawler Architecture]
International Conference on Information Science and Applications 2010
19
IMPLICATIONS
International Conference on Information Science and Applications 2010
Observations during implementation of feed forward mechanisms in web crawler Exokernel based approach favorable for web
crawler Priority queue control Filesystem should not provide consistency
guarantees Indexing and dictionary concept should be
supported by file system
20
SEARCH ENGINE OPERATING SYSTEM
International Conference on Information Science and Applications 2010
[DG04] Dean, J., and Ghemawat, S., “ MapReduce: simplified data processing on large clusters,” In Proc. 6th Int’l Symposium on Operating Systems Design and Implementation, San Francisco, CA, 2004: 137-150.
[LLWL08] Lee, H.T., Leonard, D., Wang, X., and Loguinov, D., “IRLbot: scaling to 6 billion pages and beyond,” In Proc. 17th Int’l Conf. on World Wide Web, April 21-25, 2008, Beijing, China.
[SS02] Shkapenyuk, V. and Suel, T., “Design and Implementation of a High-Performance Distributed Web Crawler,” In Proc. 18th Int’l Conf. on Data Engineering, pp. 3-57, San Jose, California, USA, Feb. 2002.
21International Conference on Information Science and Applications 2010