+ All Categories
Home > Documents > B Level Project Combined Index

B Level Project Combined Index

Date post: 07-Apr-2018
Category:
Upload: srivasab
View: 220 times
Download: 0 times
Share this document with a friend

of 59

Transcript
  • 8/3/2019 B Level Project Combined Index

    1/59

    WEB SPIDER

    A Focused Crawler

    Acknowledgements

    It shall truly be unfair not to show our gratefulness to all those who helped

    us complete this project. We would like to show our deepest gratitude to our

    project guide Dr. Anupam Agarwal, without whom this project would not

    have been possible. It was he who motivated us for this cause and was

    always present with his precious guidance and ideas besides being extremely

    supportive and understanding all the times.This fueled our enthusiasm evenfurther and encouraged us to boldly step into what was a totally dark and

    unexplored expanse before us.

    We would also like to thank our batch mates and seniors who were ready

    with a positive comment all the time, whether it was an off-hand comment to

  • 8/3/2019 B Level Project Combined Index

    2/59

    Web Spider: A Focused Crawler

    encourage us or a constructive piece of criticism. Their positive as well as

    critic comments were of great help in giving the project its present form.

    Abstract

    The world-wide web, having over 350 million pages, continues to grow

    rapidly at a million pages per day. About 600 GB of text changes every

    month. Such growth and flux poses basic limits of scale for today's generic

    crawlers and search engines. In spite of using high-end multiprocessors and

    exquisitely crafted crawling software, the largest crawls cover only 30-40%

    of the web, and refreshes take weeks to a month. With such unprecedented

    scaling challenges for general-purpose crawlers and search engines, we

    propose a hypertext resource discovery system called a Focused Crawler.The goal of a focused crawler is to selectively seek out pages that are

    relevant to a pre-defined set of topics. The topics are specified not using

    keywords, but using exemplary documents

    To achieve such goal-directed crawling, we

    evaluate the relevance of a hypertext document with respect to the focus

    topics thereby discarding the irrelevant pages and focusing on the hyperlinks

    of relevant pages only. Focused crawling, thus steadily acquire relevant

    pages only while standard crawling quickly loses its way. Therefore it is

    very effective for building high-quality collections of Web documents on

    specific topics, using modest desktop hardware.

    Indian Institute of Information Technology, Allahabad2

  • 8/3/2019 B Level Project Combined Index

    3/59

    Web Spider: A Focused Crawler

    Contents

    Student Declaration 2Supervisor Recommendation 3

    Acknowledgement 4

    Abstract 5

    List of figures used 7

    Chapter 1: Introduction 8

    1.1 Objective 10

    1.2 Motivation 10

    1.3 Problem Definition 12

    Chapter 2: Literature Survey 14

    2.1 Literature survey 15

    2.2 Previous Work 20

    Chapter 3: Project Model 23

    3.1 Basic Architecture 24

    3.2 Crawler Policies 28

    3.3 Issues 33

    Indian Institute of Information Technology, Allahabad3

  • 8/3/2019 B Level Project Combined Index

    4/59

    Web Spider: A Focused Crawler

    Chapter 4: Algorithm Implementation 35

    4.1 Outline 36

    4.2 Parsing and Stemming 38

    4.3 Threshold calculation 40

    4.4 Document Frequency 41

    4.5 Robots.txt 42

    Chapter 5: Discussion and Results 43

    5.1 Retrieval of relevant pages only 44

    5.2 Multithreading 45

    5.3 Crawl space reduction 45

    5.4 Reduction of server overload 46

    5.5 Robustness of Acquisition 46

    5.6 Snapshots 46

    Chapter 6: Conclusion 486.1 Conclusion 49

    6.2 Challenges and Future work 49

    Appendices 51

    Appendix A: Term Vector Model 52Appendix B: Basic Authentication Scheme 54

    Appendix C: Term Frequency-Inverse Document Frequency 56

    References 58

    Technical references 59

    Indian Institute of Information Technology, Allahabad4

  • 8/3/2019 B Level Project Combined Index

    5/59

    Web Spider: A Focused Crawler

    Other references 60

    List of figures:

    Fig 1.1: Performance of an unfocused crawler 11

    Fig 1.2: Performance of focused crawler 12

    Fig 2.1: Basic Components of the crawler 16

    Fig 2.2: Integration of crawler, classifier and distiller 17

    Fig 2.3: Domain of focused web crawler. 19

    Fig 3.1: Simple Crawler Configuration 25

    Fig 3.2: Control Flow of a Crawler Frontier 27

    Fig 4.1: Basic functioning of crawl frontier 35Fig 5.1: Comparison Analysis 44

    Fig 5.2: Crawl Space reduction 45

    Fig 5.3: Snapshot 1 46

    Fig 5.4: Snapshot 2 47

    Indian Institute of Information Technology, Allahabad5

  • 8/3/2019 B Level Project Combined Index

    6/59

    Web Spider: A Focused Crawler

    Chapter I

    Introduction

    This section covers:

    Objective

    Motivation

    Problem definition

    Indian Institute of Information Technology, Allahabad6

  • 8/3/2019 B Level Project Combined Index

    7/59

    Web Spider: A Focused Crawler

    1.1: Objective

    To build a customized multithreaded, focused crawler, which will crawl the web, based

    on the relevance of the web page, thus reducing the crawl space.

    1.2: Motivation

    The World Wide Web has grown from a few thousand pages in 1993 to more than two

    billion pages at present. It continues to grow rapidly at a million pages per day.

    About 600 GB of text changes every month. Due to this explosion in size, web search

    engines are becoming increasingly important as the primary means of locating relevant

    information [2]. Such search engines rely on massive collections of web pages that are

    acquired with the help of web crawlers, which traverse the web by following hyperlinks

    and storing downloaded pages in a large database that is later indexed for efficient

    execution of user queries. Many researchers have looked at web search technology over

    the last few years, including crawling strategies, storage, indexing, ranking techniques,

    and a significant amount of work on the structural analysis of the web and web graph.

    .

    Indian Institute of Information Technology, Allahabad7

  • 8/3/2019 B Level Project Combined Index

    8/59

    Web Spider: A Focused Crawler

    In spite of using high-end multiprocessors and exquisitely crafted crawling software, the

    largest crawls cover only 30-40% of the web, and refreshes take weeks to a month. The

    overwhelming engineering challenges are in part due to the one-size-fits-all philosophy:

    the crawler trying to cater to every possible query.

    Serious web users adopt the strategy of filtering by relevance and quality. The growth of

    the web matters little to a physicist if at most a few dozen pages dealing with quantum

    electrodynamics are added or updated per week. Seasoned users also rarely roam

    aimlessly; they have bookmarked sites important to them, and their primary need is to

    expand and maintain a community around these examples while preserving the quality. A

    focused crawler selectively seeks out pages that are relevant to a pre-defined set of topics.

    It is crucial that the harvest rate: the fraction of page fetches which are relevant to the

    user's interest of the focused crawler be high, otherwise it would be easier to crawl the

    whole web and bucket the results into topics as a post-processing step.

    Indian Institute of Information Technology, Allahabad8

  • 8/3/2019 B Level Project Combined Index

    9/59

    Web Spider: A Focused Crawler

    Fig 1.1: Performance of unfocused crawler[10]

    Fig 1.2: Performance of focused crawler[10]

    As we see in case of focused crawler (Fig 1.2) the fraction of page fetches which are

    relevant to the user's interest in case of the focused crawler is very high when compared

    to that of unfocused crawler (Fig 1.1). Crawl Space in case of focused crawler can be

    reduced to a large extent as compared to a normal crawler.

    1.3: Problem Definition

    Indian Institute of Information Technology, Allahabad9

  • 8/3/2019 B Level Project Combined Index

    10/59

    Web Spider: A Focused Crawler

    Our project is ambitious to build customized multithreaded, focused crawler, which

    crawls the web, based on the relevance of the web page. The approach should concern

    specifically in particular domain.

    In order to achieve the objectives it should be able to perform the following:

    Efficient Preprocessing: This involves the preprocessing of the input

    documents. We aim to provide efficient parsing and stemming of pages. . Initially,

    the user will be required to provide a set of example pages along with his search

    query. These example pages will be parsed, removing all the stop words and

    finally the text will be stemmed

    Knowledge Retrieval: To provide efficient retrieval of information containing

    words .Once the text will be stemmed, information containing words will be

    picked and this will form the information to the crawler which it carries with it.

    Crawling: To build a crawler that starts from the root node or a URL, called the

    seed. As the crawler visits these URLs, it will identify all the hyperlinks in the

    page and adds them to the list of URLs to visit, called the crawl frontier. URLs

    from the frontier will now be recursively visited.

    Retrieving relevant pages: We aim to retrieve only those pages which are

    closely related to the corresponding query. In our case we will deal with the most

    relevant pages. It will reduce the burden on the user to scan through all the

    retrieved pages to find the pages of his interest

    Indian Institute of Information Technology, Allahabad10

  • 8/3/2019 B Level Project Combined Index

    11/59

    Web Spider: A Focused Crawler

    Chapter II

    Literature Survey

    This section covers:

    Literature survey

    Background and Previous work

    Indian Institute of Information Technology, Allahabad11

  • 8/3/2019 B Level Project Combined Index

    12/59

    Web Spider: A Focused Crawler

    2.1: Literature survey

    2.1.1: Basic Crawler

    The rapid growth of the World-Wide Web poses unprecedented scaling challenges for

    general-purpose crawlers and search engines. We want to implement a new hypertext

    resource discovery system called a Focused Crawler. The goal of a focused crawler is

    to selectively seek out pages that are relevant to a pre-defined set oftopics. The topics are

    specified not using keywords, but using exemplary documents. Rather than collecting and

    indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a

    focused crawler analyzes its crawl boundary to find the links that are likely to be most

    relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant

    savings in hardware and network resources, and helps keep the crawl more up-to-date.

    To achieve such goal-directed crawling, we will design two hypertext

    mining programs that guide our crawler: a classifier that evaluates the relevance of a

    hypertext document with respect to the focus topics, and a distiller that identifies

    hypertext nodes that are great access points to many relevant pages within a few links. [7]

    Now report on extensive focused-crawling experiments using several topics at different

    levels of specificity.

    Focused crawling acquires relevant pages steadily while standard

    crawling quickly loses its way, even though they are started from the same root set.

    Focused crawling is robust against large perturbations in the starting set of URLs. It

    discovers largely overlapping sets of resources in spite of these perturbations. It is also

    capable of exploring out and discovering valuable resources that are dozens of links away

    Indian Institute of Information Technology, Allahabad12

  • 8/3/2019 B Level Project Combined Index

    13/59

    Web Spider: A Focused Crawler

    from the start set, while carefully pruning the millions of pages that may lie within this

    same radius.[5]

    As a result it is highly efficient as compared to normal crawlers. Normal crawlers

    when start crawling works good for some time but loose their path making it their biggest

    disadvantage over focused crawlers. Our anecdotes suggest that focused crawling is very

    Effective for building high-quality collections of Web documents on specific topics,

    using modest desktop hardware.[3]

    Fig 2.1: Basic Components of the crawler[2]

    The focused crawler has three main components: a classifier which makes relevance

    judgments on pages crawled to decide on link expansion, a distiller which determines a

    measure of centrality of crawled pages to determine visit priorities, and a crawler with

    dynamically reconfigurable priority controls which is governed by the classifier and

    distiller.[2]

    Indian Institute of Information Technology, Allahabad13

  • 8/3/2019 B Level Project Combined Index

    14/59

    Web Spider: A Focused Crawler

    Its block diagram can be shown as

    Fig 2.2:Focused crawler showing how crawler, classifier and distiller are integrated.[1]

    2.1.2: Classification

    Relevance is enforced on the focused crawler using a hypertext classifier. We assume that

    the category taxonomy induces a hierarchical partition on Web documents. (In real life,

    documents are often judged to belong to multiple categories) useful pages, not

    eliminating irrelevant pages. Human judgment, although subjective and even erroneous,

    would be best for measuring relevance. Clearly, even for an experimental crawler that

    acquires only ten thousand pages per hour, this is impossible. Therefore we use our

    classifier to estimate the relevance of the crawl graph. It is to be noted carefully that we

    Indian Institute of Information Technology, Allahabad14

  • 8/3/2019 B Level Project Combined Index

    15/59

    Web Spider: A Focused Crawler

    are not, for instance, training and testing the classifier on the same set of documents, or

    checking the classifiers earlier evaluation of a document using the classifier itself. Just as

    human judgment is prone to variation and error, a statistical program can make mistakes.

    Based on such imperfect recommendation, we choose to or not to expand pages. Later,

    when a page that was chosen is visited, we evaluate its relevance, and thus the value of

    that decision.[8]

    2.1.3: Distillation

    Relevance is not the only attribute used to evaluate a page while crawling. A long essay

    very relevant to the topic but without links is only a finishing point in the crawl. A good

    strategy for the crawler is to identify hubs: pages that are almost exclusively a collection

    of links to authoritative resources that are relevant to the topic.* Social network analysis

    is concerned with the properties of graphs formed between entities such as people,

    organizations, papers, etc., through coauthoring, citations, mentoring, paying,

    telephoning, infecting, etc. Prestige is an important attribute of nodes in a social network,

    especially in the context of academic papers and Web documents. The number of

    citations to paper is a reasonable but crude measure of the prestige. Also many hubs are

    multi-topic in nature, e.g., a published bookmark file pointing to sports car sites and

    photography sites.[4]

    2.1.4: Integration with the crawler

    The crawler has one watchdog thread and many worker threads. The watchdog is in

    charge of checking out new work from the crawl frontier, which is stored on disk. New

    work is passed to workers using shared memory buffers. Workers save details of newly

    explored pages in private per-worker disk structures. In bulk-synchronous fashion,

    workers are stopped, and their results are collected and integrated into the central pool of

    work.[4]

    Indian Institute of Information Technology, Allahabad15

  • 8/3/2019 B Level Project Combined Index

    16/59

    Web Spider: A Focused Crawler

    While it is fairly easy to build a slow crawler that downloads a few pages

    per second for a short period of time, building a high-performance system that can

    download hundreds of millions of pages over several weeks presents a number of

    challenges in system designed, I/O and network efficiency, and robustness and

    manageability.[2]

    * Refer Appendix B

    Perhaps the most crucial evaluation of focused crawling is to measure the rate at which

    relevant pages are acquired, and how effectively irrelevant pages are filtered off from the

    crawl. This harvest ratio must be high, otherwise the focused crawler would spend a lot

    of time merely eliminating irrelevant pages, and it may be better to use an ordinary

    crawler instead! It would be good to judge the relevance of the crawl by human

    inspection, even though it is subjective and inconsistent. But this is not possible for the

    hundreds of thousands of pages our system crawled. Therefore we have to take recourse

    to running an automatic classifier over the collected pages. Specifically, we can use our

    classifier. It may appear that using the same classifier to guide the crawler and judge the

    relevance of crawled pages is flawed methodology, but it is not so. We are evaluating not

    the classifier but the basic crawling heuristic that neighbors of highly relevant pages tend

    to be relevant.

    Indian Institute of Information Technology, Allahabad16

  • 8/3/2019 B Level Project Combined Index

    17/59

    Web Spider: A Focused Crawler

    Fig 2.3: Domain of focused web crawler[11]

    The unfocused crawler starts out from the same set of dozens of highly relevant links as

    the focused crawler, but is completely lost within the next hundred page fetches: the

    relevance goes quickly to zero. In contrast the focused one crawl keeps up a healthy pace

    of acquiring relevant pages over thousands of pages, in spite of some short-range rate

    fluctuations, which is expected. On an average, between a third and half of all page

    fetches result in success over the first several thousand fetches, and there seems to be no

    sign of stagnation.

    Crawling the Web, in a certain way, resembles watching the sky in a clear night: what we

    see reflect the state of the stars at different times, as the light travels different distances.

    What a Web crawler gets is not a snapshot of the Web, because it does not represents

    the Web at any given instant of time. The last pages being crawled are probably very

    accurately represented, but the first pages that were downloaded have a high probability

    of have been changed [6]

    Indian Institute of Information Technology, Allahabad17

  • 8/3/2019 B Level Project Combined Index

    18/59

    Web Spider: A Focused Crawler

    2.2: Previous work

    The following is a list of published crawler architectures for general-purpose crawlers

    (excluding focused Web crawlers), with a brief description that includes the names given

    to the different components and outstanding features:

    2.21: RBSE was the first published web crawler. It was based on two programs: the first

    program, "spider" maintains a queue in a relational database, and the second program

    mite, is a modified www ASCII browser that downloads the pages from the Web. It was

    presented at the First International Conference on the World Wide Web, Geneva,

    Switzerland.[12]

    2.2.2: Google Crawler is described in some detail, but the reference is only about an

    early version of its architecture, which was based in C++ and Python. The crawler was

    integrated with the indexing process, because text parsing was done for full-text indexing

    and also for URL extraction. There is a URL server that sends lists of URLs to be fetchedby several crawling processes. During parsing, the URLs found were passed to a URL

    server that checked if the URL has been previously seen. If not, the URL was added to

    the queue of the URL server.[16]

    2.2.3: Mercator is a distributed, modular web crawler written in Java. Its modularity

    arises from the usage of interchangeable "protocol modules" and "processing modules".

    Protocols modules are related to how to acquire the Web pages (e.g.: by HTTP), and

    processing modules are related to how to process Web pages. The standard processing

    module just parses the pages and extracts new URLs, but other processing modules can

    be used to index the text of the pages, or to gather statistics from the Web.[15]

    Indian Institute of Information Technology, Allahabad18

  • 8/3/2019 B Level Project Combined Index

    19/59

    Web Spider: A Focused Crawler

    2.2.4: WebRACE is a crawling and caching module implemented in Java, and used as a

    part of a more generic system called eRACE. The system receives requests from users for

    downloading Web pages, so the crawler acts in part as a smart proxy server. The system

    also handles requests for "subscriptions" to Web pages that must be monitored: when the

    pages change, they must be downloaded by the crawler and the subscriber must be

    notified. The most outstanding feature of WebRACE is that, while most crawlers start

    with a set of "seed" URLs, WebRACE is continuously receiving new starting URLs to

    crawl from.[18]

    2.2.5: Ubicrawler is a distributed crawler written in Java, and it has no central process. It

    is composed of a number of identical "agents"; and the assignment function is calculated

    using consistent hashing of the host names. There is zero overlap, meaning that no page

    is crawled twice, unless a crawling agent crashes (then, another agent must re-crawl the

    pages from the failing agent). The crawler is designed to achieve high scalability and to

    be tolerant to failures.[13]

    2.2.6: Some Open-source crawlers [11]

    DatePrk Search

    GNU Widget

    Heritrix

    HTTrack

    Metaboth

    NUTCH

    WebSPHINX

    Sherlock Holmes

    YaCy

    Indian Institute of Information Technology, Allahabad19

  • 8/3/2019 B Level Project Combined Index

    20/59

    Web Spider: A Focused Crawler

    Chapter III

    Project Model

    Indian Institute of Information Technology, Allahabad20

  • 8/3/2019 B Level Project Combined Index

    21/59

    Web Spider: A Focused Crawler

    This section covers:

    Basic Architecture

    Crawler Policies

    Issues

    3.1: Basic Architecture

    In this project we will develop a web crawler that will start with a list of URLs to visit,

    called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the

    page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the

    frontier are recursively visited according to a set of policies.

    While it is fairly easy to build a slow crawler that downloads a few pages per second for a

    short period of time, building a high-performance system that can download hundreds of

    millions of pages over several weeks presents a number of challenges in system design,

    I/O and network efficiency, and robustness and manageability.

    Indian Institute of Information Technology, Allahabad21

  • 8/3/2019 B Level Project Combined Index

    22/59

    Web Spider: A Focused Crawler

    3.1.1: Basic Concept -

    A web crawler also known as a Web spider or Web robot is a program or automated

    script which browses the World Wide Web in a methodical, automated manner. Web

    crawlers are mainly used to create a copy of all the visited pages for later processing by a

    search engine, that will index the downloaded pages to provide fast searches. Crawlers

    can also be used for automating maintenance tasks on a Web site, such as checking links

    or validating HTML code.

    Web crawlers start by parsing a specified web page, noting any hypertext links on that

    page that point to other web pages. They then parse those pages for new links, and so on,

    recursively. Web-crawler software doesn't actually move around to different computers

    on the Internet, as viruses or intelligent agents do. A crawler resides on a single machine.

    The crawler simply sends HTTP requests for documents to other machines on the Internet

    just as a web browser does when the user clicks on links. All the crawler really does is to

    automate the process of following links. [2]

    Indian Institute of Information Technology, Allahabad22

  • 8/3/2019 B Level Project Combined Index

    23/59

    Web Spider: A Focused Crawler

    3.1.2: Architecture -

    The input to the focused crawler is the search query of the user. Also, a set of examplepages relating to the query have to be given to the crawler. A series of parses are done on

    these example pages to finally extract the information containing words. These

    information containing words are given as input to the crawler which carries them along

    with it. Based on this information, the crawler calculates the relevance of an encountered

    page, and only if the relevance is satisfying, will it be stored for further crawling. [4]

    Fig 3.1: Simple Crawler Configuration [4]

    Indian Institute of Information Technology, Allahabad23

  • 8/3/2019 B Level Project Combined Index

    24/59

    Web Spider: A Focused Crawler

    The architecture can be classified into two major components - crawling system and

    crawling application. The crawling system itself consists of several specializedcomponent s, in particular a crawl manager, downloader and DNS resolver.

    The crawl manager is responsible for receiving the URL input stream from

    the applications. After loading the URLs of a request files, the manager queries the DNS

    resolvers for the IP addresses of the servers, unless a recent address is already cached.

    The manager then requests the file robots.txt in the web servers root directory, unless it

    already has a recent copy of the file. A downloader is a high-performance asynchronous

    HTTP client capable of downloading hundreds of web pages in parallel, while a DNS

    resolver is an optimized stub DNS resolver that forwards queries to local DNS servers.[6]

    Finally, after parsing the robots files and removing excluded URLs, the requested URLs

    are sent in batches to the downloader. The manager later notifies the application of the

    pages that have been downloaded and are available for processing.

    The crawling application starts out with a URL giving it to the crawl

    manager. The application then parses each downloaded page for hyperlinks, checks

    whether these URLs have already been encountered before, and if not, sends them to the

    manager in batches of a few hundred or thousand.[9] The downloaded files are then

    forwarded to a storage manager for compression and storage in a repository.

    3.1.3: Control flow

    As the crawler gets the relevant pages, it retrieves their URLs and makes their list out of

    which it takes the URLs one by one and the web page is downloaded. Now the

    downloaded page is changed to a text file for simplicity. This text file is parsed removing

    all the stop words from it and stemming the remaining words using Porter Stemmer.

    Then its relevance is tested. If relevant, the URLs present on the page are extracted and

    added to the list of URLs for further crawling.

    Indian Institute of Information Technology, Allahabad24

  • 8/3/2019 B Level Project Combined Index

    25/59

    Web Spider: A Focused Crawler

    Fig 3.2: Control Flow of a Crawler Frontier

    Indian Institute of Information Technology, Allahabad25

  • 8/3/2019 B Level Project Combined Index

    26/59

    Web Spider: A Focused Crawler

    3.2: Crawler policies

    There are three important characteristics of the Web that generate a scenario in which

    Web crawling is very difficult: its large volume, its fast rate of change, dynamic page

    generation, containing a wide variety of possible crawlable URLs.

    The large volume implies that the crawler can only download a fraction of the Web

    pages within a given time, so it needs to prioritize all of its downloads. The high rate of

    change implies that by the time the crawler is downloading the last pages from a site, it is

    very likely that new pages have been added to the site, or that pages have already been

    updated or even deleted.

    The recent increase in the number of pages being generated by server-side scripting

    languages has also created difficulty in those endless combinations of HTTP GET

    parameters exist, only a small selection of which will actually return unique content. For

    example, a simple online photo gallery may offer three options to users, as specified

    through HTTP GET parameters. If there exist four ways to sort images, three choices of

    thumbnail size, two file formats, and an option to disable user-provided contents, thenthat same set of content can be accessed with forty-eight different URLs, all of which will

    be present on the site. This mathematical combination creates a problem for crawlers, as

    they must sort through endless combinations of relatively minor scripted changes in order

    to retrieve unique content.

    The behavior of a Web crawler is the outcome of a combination of policies:

    A selection policy that states which pages to download.

    A re-visit policy that states when to check for changes to the pages.

    A politeness policy that states how to avoid overloading websites.

    A parallelization policy that states how to coordinate distributed web crawlers.

    Indian Institute of Information Technology, Allahabad26

  • 8/3/2019 B Level Project Combined Index

    27/59

    Web Spider: A Focused Crawler

    3.2.1: Selection policy -

    Given the current size of the Web, even large search engines cover only a portion of the

    publicly available internet; A recent study showed that no search engine indexes more

    than 16% of the Web. As a crawler always downloads just a fraction of the Web pages, it

    is highly desirable that the downloaded fraction contains the most relevant pages, and not

    just a random sample of the Web. [2]

    This requires a metric of importance for prioritizing Web pages. The importance of a

    page is a function of its intrinsic quality, its popularity in terms of links or visits, and

    even of its URL (the latter is the case of vertical search engines restricted to a single top-

    level domain, or search engines restricted to a fixed Web site). Designing a good

    selection policy has an added difficulty: it must work with partial information, as the

    complete set of Web pages is not known during crawling.

    Crawling can be combined with different strategies. The ordering metrics can be breadth-

    first, backlink-count and partial Pagerank calculations. One of the conclusions was that if

    the crawler wants to download pages with high Pagerank early during the crawling

    process, then the partial Pagerank strategy is the better, followed by breadth-first andbacklink-count. However, these results are for just a single domain.

    Though it is basically considered that the breadth first strategy is a better strategy than

    page rank . The explanation is simple for this it has been proved that the most important

    pages have many links to them from numerous hosts, and those links will be found early,

    regardless of on which host or page the crawl originates.

    3.2.2: Re-visit policy -

    The Web has a very dynamic nature, and crawling a fraction of the Web can take a really

    long time, usually measured in weeks or months. By the time a Web crawler has finished

    its crawl, many events could have happened. These events can include creations, updates

    and deletions. [2]

    Indian Institute of Information Technology, Allahabad27

  • 8/3/2019 B Level Project Combined Index

    28/59

    Web Spider: A Focused Crawler

    From the search engine's point of view, there is a cost associated with not detecting an

    event, and thus having an outdated copy of a resource. The most used cost functions are

    freshness and age.

    Freshness: This is a binary measure that indicates whether the local copy is

    accurate or not.

    Age: This is a measure that indicates how outdated the local copy is.

    The objective of the crawler is to keep the average freshness of pages in its collection as

    high as possible, or to keep the average age of pages as low as possible. These objectives

    are not equivalent: in the first case, the crawler is just concerned with how many pages

    are out-dated, while in the second case, the crawler is concerned with how old the local

    copies of pages are.

    Two simple re-visiting policies are:

    Uniform policy: This involves re-visiting all pages in the collection with the same

    frequency, regardless of their rates of change.

    Proportional policy: This involves re-visiting more often the pages that change

    more frequently. The visiting frequency is directly proportional to the (estimated)

    change frequency.

    In terms of average freshness, the uniform policy outperforms the proportional policy in

    both a simulated Web and a real Web crawl. The explanation for this result comes from

    the fact that, when a page changes too often, the crawler will waste time by trying to re-

    crawl it too fast and still will not be able to keep its copy of the page fresh.

    3.2.3: Politeness policy -

    Crawlers can retrieve data much quicker and in greater depth than human searchers, so

    they can have a crippling impact on the performance of a site. Needless to say if a single

    crawler is performing multiple requests per second and downloading large files, a server

    would have a hard time keeping up with requests from multiple crawlers. [2]

    The use of Web crawlers is useful for a number of tasks, but comes with a price for the

    general community. The costs of using Web crawlers include:

    Indian Institute of Information Technology, Allahabad28

  • 8/3/2019 B Level Project Combined Index

    29/59

    Web Spider: A Focused Crawler

    Network resources, as crawlers require considerable bandwidth and operate with a high

    degree of parallelism during a long period of time.

    Server overload, especially if the frequency of accesses to a given server is too high.

    Poorly written crawlers, which can crash servers or routers, or which download pages

    they cannot handle.

    Personal crawlers that, if deployed by too many users, can disrupt networks and Web

    servers.

    A partial solution to these problems is the robots exclusion protocol, also known as the

    robots.txt protocol that is a standard for administrators to indicate which parts of their

    Web servers should not be accessed by crawlers. This standard does not include a

    suggestion for the interval of visits to the same server, even though this interval is the

    most effective way of avoiding server overload. Recently commercial search engines like

    Ask Jeeves, MSN and Yahoo are able to use an extra "Crawl-delay:" parameter in the

    robots.txt file to indicate the number of seconds to delay between requests.However, if

    pages were downloaded at this rate from a website with more than 100,000 pages over a

    perfect connection with zero latency and infinite bandwidth, it would take more than 2

    months to download only that entire website; also, only a fraction of the resources from

    that Web server would be used. This does not seem acceptable.

    Normally one uses 10 seconds as an interval for accesses and some crawlers uses 15

    seconds as the default. Some even follows an adaptive politeness policy: if it took t

    seconds to download a document from a given server, the crawler waits for 10t seconds

    before downloading the next page.

    3.2.4: Parallelization policy -

    A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to

    maximize the download rate while minimizing the overhead from parallelization and to

    avoid repeated downloads of the same page. To avoid downloading the same page more

    than once, the crawling system requires a policy for assigning the new URLs discovered

    Indian Institute of Information Technology, Allahabad29

  • 8/3/2019 B Level Project Combined Index

    30/59

    Web Spider: A Focused Crawler

    during the crawling process, as the same URL can be found by two different crawling

    processes. There are basic following policies:-

    Dynamic Assignment: With this type of policy, a central server assigns new URLs to

    different crawlers dynamically. This allows the central server to; for instance,

    dynamically balance the load of each crawler.

    With dynamic assignment, typically the systems can also add or remove downloader

    processes. The central server may become the bottleneck, so most of the workload must

    be transferred to the distributed crawling processes for large crawls.

    There are two configurations of crawling architectures with dynamic assignments such

    that [2]

    A small crawler configuration, in which there is a central DNS resolver and

    central queues per Web site, and distributed downloaders.

    A large crawler configuration, in which the DNS resolver and the queues are also

    distributed.

    Static Assignment: With this type of policy, there is a fixed rule stated from the

    beginning of the crawl that defines how to assign new URLs to the crawlers.

    For static assignment, a hashing function can be used to transform URLs (or, even better,

    complete website names) into a number that corresponds to the index of the

    corresponding crawling process. As there are external links that will go from a Web site

    assigned to one crawling process to a website assigned to a different crawling process,

    some exchange of URLs must occur.[14]

    To reduce the overhead due to the exchange of URLs between crawling processes, theexchange should be done in batch, several URLs at a time, and the most cited URLs in

    the collection should be known by all crawling processes before the crawl (e.g.: using

    data from a previous crawl) .

    An effective assignment function must have three main properties: each crawling process

    should get approximately the same number of hosts (balancing property), if the number

    Indian Institute of Information Technology, Allahabad30

  • 8/3/2019 B Level Project Combined Index

    31/59

    Web Spider: A Focused Crawler

    of crawling processes grows, the number of hosts assigned to each process must shrink

    (contra-variance property), and the assignment must be able to add and remove crawling

    processes dynamically. Consistent hashing, which replicates the buckets, so adding or

    removing a bucket does not require re-hashing of the whole table to achieve all of the

    desired properties. Crawling is an effective process synchronization tool between the

    users and the search engine.

    3.3: Issues

    3.3.1: How to Re-visit web pages

    The optimum method to re-visit the web and maintain average freshness high of web

    page is to ignore the pages that change too often.

    The approaches could be: [2]

    Re-visiting all pages in the collection with the same frequency, regardless of their

    rates of change.

    Re-visiting more often the pages that change more frequently.

    In both cases, the repeated crawling order of pages can be done either at random or with a

    fixed order.

    The re-visiting methods considered here regard all pages as homogeneous in terms of

    quality ("all pages on the Web are worth the same"), something that is not a realisticscenario

    3.3.2: How to avoid overloading websites

    Crawlers can retrieve data much quicker and in greater depth than human searchers, so

    they can have a crippling impact on the performance of a site. Needless to say if a single

    Indian Institute of Information Technology, Allahabad31

  • 8/3/2019 B Level Project Combined Index

    32/59

    Web Spider: A Focused Crawler

    crawler is performing multiple requests per second and/or downloading large files, a

    server would have a hard time keeping up with requests from multiple crawlers.

    The use of Web crawler is useful for a number of tasks, but comes with a price for the

    general community.

    The costs of using Web crawlers include: [1]

    Network resources , as crawlers require considerable bandwidth and operate with a

    high degree of parallelism during a long period of time.

    Server overload , especially if the frequency of accesses to a given server is too

    high.

    Poorly written crawlers , which can crash servers or routers, or which download

    pages they cannot handle.

    Personal crawlers that, if deployed by too many users, can disrupt networks and

    Web servers.

    To resolve this problem we can use robots exclusion protocol, also known as the

    robots.txt protocol.

    The robots exclusion standard or robots.txt protocol is a convention to prevent

    cooperating web spiders and other web robots from accessing all or part of a website. We

    can specify the top level directory of web site in a file called robots.txt and this will

    prevent the access of that directory to crawler. This protocol uses simple substring

    comparisons to match the patterns defined in robots.txt file. So, while using this

    robots.txt file we need to make sure that we use final [./.]. [17] Character appended to

    directory path. Else, files with names starting with that substring will be matched rather

    than directory.

    Indian Institute of Information Technology, Allahabad32

  • 8/3/2019 B Level Project Combined Index

    33/59

    Web Spider: A Focused Crawler

    Chapter IV

    Algorithm Implementation

    This section covers:

    Outline

    Parsing and Stemming

    Threshold calculation

    Document Frequency

    Robots.txt

    Indian Institute of Information Technology, Allahabad33

  • 8/3/2019 B Level Project Combined Index

    34/59

    Web Spider: A Focused Crawler

    4.1: Outline

    The input to the focused crawler is the search query of the user in form of pages. Parsing

    is done and words are retrieved. These information containing words are given as input to

    the crawler which carries them along with it. Based on this information, the crawler

    calculates the relevance of an encountered page, and only if the relevance is satisfying,

    will it be stored for further crawling

    Indian Institute of Information Technology, Allahabad34

  • 8/3/2019 B Level Project Combined Index

    35/59

    Web Spider: A Focused Crawler

    Fig 4.1: Basic functioning of crawl frontier

    The Pseudo-code summary of implementing the crawler:Ask user to specify the starting URL on web and file type that crawler

    should crawl.

    Add the URL to the empty list of URLs to search.

    While not empty ( the list of URLs to search )

    {

    Take the first URL in from the list of URLs

    Mark this URL as already searched URL.

    If the URL protocol is not HTTP then

    break;

    go back to while

    Indian Institute of Information Technology, Allahabad35

  • 8/3/2019 B Level Project Combined Index

    36/59

    Web Spider: A Focused Crawler

    If robots.txt file exist on site then

    If file includes .Disallow. statement then

    break;

    go back to while

    Open the URL

    If the opened URL is not HTML file then

    Break;

    Go back to while

    Iterate the HTML file

    While the html text contains another link {

    If robots.txt file exist on URL/site then

    If file includes .Disallow. statement then

    break;

    go back to while

    If the opened URL is HTML file then

    If the URL isn't marked as searched then

    Mark this URL as already searched URL.

    Else if type of file is user requested

    Add to list of files found.

    }

    }

    4.2: Parsing and Stemming

    4.2.1: Parsing (more formally syntactical analysis) is the process of analyzing a sequence

    of tokens to determine its grammatical structure with respect to a given formal grammar.

    A parser is the component of a compiler that carries out this task.

    Parsing transforms input text into a data structure, usually a tree, which is suitable for

    later processing and which captures the implied hierarchy of the input. Lexical analysis

    Indian Institute of Information Technology, Allahabad36

  • 8/3/2019 B Level Project Combined Index

    37/59

    Web Spider: A Focused Crawler

    creates tokens from a sequence of input characters and it is these tokens that are

    processed by a parser to build a data structure such as parse tree or abstract syntax trees.

    Parsing is also an earlier term for the diagramming of sentences in grammar of natural

    language, and is still used to diagram the grammar of inflected languages, such as the

    Romance languages or Latin.

    4.2.2: Removal of Stop Words

    Firstly, the web page is converted into a text file for convenience. An initial parse

    removes all the stop words from the file. Stop words are words which are filtered out

    prior to, or after, processing of natural language data (text). Some of the most frequently

    used stop words include "a", "of", "the", "I", "it", "you", and "and. These are generally

    regarded as 'functional words' which do not carry meaning (are not as important for

    communication).[16] The assumption is that, when assessing the contents of the web

    page, the meaning can be conveyed more clearly, or interpreted more easily, by ignoring

    the functional words. A Stop List is maintained in a separate text file and all the words of

    that file are removed from the file being parsed.

    4.2.3: Stemming

    Next, a Porter Stemmer is run on this file. Stemming is the process for reducing inflected

    (or sometimes derived) words to their stem, base or root form generally a written word

    form. The stem need not be identical to the morphological root of the word; it is usually

    sufficient that related words map to the same stem, even if this stem is not in itself a valid

    root. A stemmer for English, for example, should identify the string "cats" (and possibly

    "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed"

    as based on "stem". Porter Stemmer is one algorithm for doing this process effectively.

    Some examples of the rules include:

    Indian Institute of Information Technology, Allahabad37

  • 8/3/2019 B Level Project Combined Index

    38/59

    Web Spider: A Focused Crawler

    if the word ends in 'ed', remove the 'ed'

    if the word ends in 'ing', remove the 'ing'

    if the word ends in 'ly', remove the 'ly'

    Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the

    challenges of linguistics and morphology and encoding suffix stripping rules. Suffix

    stripping algorithms are sometimes regarded as crude given the poor performance when

    dealing with exceptional relations (like 'ran' and 'run'). The solutions produced by suffix

    stripping algorithms are limited to those lexical categories which have well known

    suffices with few exceptions. This, however, is a problem, as not all parts of speech have

    such a well formulated set of rules. Lemmatization attempts to improve upon this

    challenge [16]

    When this is done for all the example pages, we perform the frequency analysis of each

    word and the number of pages in which it has appeared and then select the words that are

    most likely to carry the information content of the page. These words are given to the

    crawler.

    4.3: Threshold Calculation

    4.3.1: Identification of Information carrying words -

    Firstly, the frequency of each word in each of the example pages is found out. Also, the

    number of pages in which it is appearing is also kept track of. Based on these two criteria,

    we select our information containing words. *

    Indian Institute of Information Technology, Allahabad38

  • 8/3/2019 B Level Project Combined Index

    39/59

    Web Spider: A Focused Crawler

    Fixing Threshold:-

    We have used the Vector Space Method to fix the threshold from the initial set of pages.

    Then we used the following formula to find out the relevance of a particular page.

    Relevance = No. of info words with mean freq

    --------------------------------------------------------- = 3

    Total no. of information words

    4.3.2: Vector space model -

    It is an algebraic model used for information filtering, information retrieval, indexing and

    relevancy rankings. It represents natural language documents (or any objects, in general)

    in a formal manner through the use of vectors (of identifiers, such as, for example, index

    terms) in a multi-dimensional linear space. Its first use was in the SMART Information

    Retrieval System. Documents are represented as vectors of index terms (keywords). The

    set of terms is a predefined collection of terms, for example the set of all unique words

    occurring in the document corpus.*

    *Refer Appendix A

    4.4: Document frequency

    4.4.1: Term frequency in the given document is simply the number of times a given

    term appears in that document. This count is usually normalized to prevent a bias towards

    longer documents (which may have a higher term frequency regardless of the actual

    importance of that term in the document) to give a measure of the importance of the term

    ti within the particular document. [10]

    Indian Institute of Information Technology, Allahabad39

    http://en.wikipedia.org/wiki/Indexinghttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/SMART_Information_Retrieval_Systemhttp://en.wikipedia.org/wiki/SMART_Information_Retrieval_Systemhttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/Keyword_(linguistics)http://en.wikipedia.org/wiki/Text_corpushttp://en.wikipedia.org/wiki/Text_corpushttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/SMART_Information_Retrieval_Systemhttp://en.wikipedia.org/wiki/SMART_Information_Retrieval_Systemhttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/Keyword_(linguistics)http://en.wikipedia.org/wiki/Text_corpushttp://en.wikipedia.org/wiki/Indexing
  • 8/3/2019 B Level Project Combined Index

    40/59

    Web Spider: A Focused Crawler

    Where ni is the number of occurrences of the considered term, and the denominator is the

    number of occurrences of all terms.

    4.4.2: The inverse document frequency is a measure of the general importance of the

    term (obtained by dividing the number of all documents by the number of documents

    containing the term, and then taking the logarithm of that quotient) .*

    A high weight in tfidf is reached by a high term frequency (in the given document) and

    a low document frequency of the term in the whole collection of documents; the weight

    hence tends to filter out common terms.

    * Refer Appendix C

    4.5: Robots.txt :-

    The robots exclusion standard, also known as the Robots Exclusion Protocol or robots.txt

    protocol is a convention to prevent cooperating web spiders and other web robots from

    accessing all or part of a website which is, otherwise, publicly viewable. Robots are often

    used by search engines to categorize and archive web sites, or by webmasters to

    Indian Institute of Information Technology, Allahabad40

  • 8/3/2019 B Level Project Combined Index

    41/59

    Web Spider: A Focused Crawler

    proofread source code. A robots.txt file on a website will function as a request that

    specified robots ignore specified files or directories in their search. This might be, for

    example, out of a preference for privacy from search engine results, or the belief that the

    content of the selected directories might be misleading or irrelevant to the categorization

    of the site as a whole, or out of a desire that an application only operate on certain data.

    The protocol, however, is purely advisory. It relies on the cooperation of the web robot,

    so that marking an area of a site out of bounds with robots.txt does not guarantee privacy.

    Some web site administrators have tried to use the robots file to make private

    parts of a website invisible to the rest of the world, but the file is necessarily publicly

    available and its content is easily checked by anyone with a web browser.

    An example robots.txt file

    # robots.txt for http://somehost.com/

    User-agent: *

    Disallow: /cgi-bin/

    Disallow: /registration

    Disallow: /login

    Chapter V

    Indian Institute of Information Technology, Allahabad41

  • 8/3/2019 B Level Project Combined Index

    42/59

    Web Spider: A Focused Crawler

    Discussion and Results

    This section covers:

    Retrieval of relevant pages

    Multithreading

    Crawl Space reduction

    Server overload

    Robustness of Acquisition

    Snapshots

    5.1: Retrieval of the relevant pages only

    Relevant pages are those which are closely related to the input document to the crawler.

    Our focused crawler achieves the relevance of downloaded web-pages up to 80-85%

    while for a normal crawler it is only up to 20-25 %.

    Indian Institute of Information Technology, Allahabad42

  • 8/3/2019 B Level Project Combined Index

    43/59

    Web Spider: A Focused Crawler

    Moreover the numbers of relevant pages downloaded are 50-100 per hour as compared to

    a normal crawler which goes on downloading pages most of which are irrelevant.

    Fig 5.1: Comparison Analysis

    5.2: Multithreading

    This issue has been dealt successfully as on receiving more than one hyperlink from a file

    then, a number of parallel threads are generated and work together for downloading the

    pages and parsing them.

    Indian Institute of Information Technology, Allahabad43

  • 8/3/2019 B Level Project Combined Index

    44/59

    Web Spider: A Focused Crawler

    5.3: Crawl space reduction

    Crawl space is the number of pages visited by the crawler on the web. Our focused

    crawler reduces the crawl space to a great extent as it visits the hyperlinks of only the

    relevant pages thus go on pruning most part of the web tree.

    Fig 5.2: Crawl Space reduction

    5.4: Reduction of server overload

    We have used robots exclusion protocol, also known as the robots.txt protocol to prevent

    web spider from accessing all or part of a website.

    Indian Institute of Information Technology, Allahabad44

  • 8/3/2019 B Level Project Combined Index

    45/59

    Web Spider: A Focused Crawler

    5.5: Robustness of acquisition

    Web Spider has the ability to ramp up to and maintain a healthy acquisition rate without

    being too sensitive on the start set.

    5.6: Snapshots

    Fig5.3: When the page is downloaded, it is parsed and stemmed. The frequency of the

    words in the page is calculated and its relevance is checked using the cosine similarity

    method.

    Indian Institute of Information Technology, Allahabad45

  • 8/3/2019 B Level Project Combined Index

    46/59

    Web Spider: A Focused Crawler

    Fig 5.4: Initially a URL http://www.cert.org/research/papers.html is given as input to the

    Web Spider and the links it visits are shown above out of which only the links of the

    relevant page are downloaded.

    Indian Institute of Information Technology, Allahabad46

  • 8/3/2019 B Level Project Combined Index

    47/59

    Web Spider: A Focused Crawler

    Chapter VI

    Conclusion

    This section covers:

    Conclusion

    Challenges and Future Work

    Indian Institute of Information Technology, Allahabad47

  • 8/3/2019 B Level Project Combined Index

    48/59

    Web Spider: A Focused Crawler

    6.1: Conclusion

    The motives of our project as per the problem definition have been achieved to

    completion. Web Spider, the customized multithreaded focused crawler with its all

    functionalities properly running is ready to be used.

    We have achieved a relatively high reduction in crawl space. The rate of pages being

    downloaded varies from 50 to 100 pages an hour and they are the ones most relevant to

    the user input document. We have taken care of the Robots Exclusion protocol and have

    achieved a healthy acquisition rate without being too sensitive to the start document. Our

    project can successfully perform using modest desktop hardware.

    No doubt, the process of developing Web Spider was extremely knowledgeful and

    enjoying. We got to learn the very deep concepts of information retrieval and their

    practical implementation and are proud to complete the project up to a satisfactory level.

    6.2: Challenges and future work

    6.2.1: Challenges

    Server side checking: Web Spider in its present form downloads all the URLs

    present in a relevant pages and discard the irrelevant URLs post-downloading. Its

    Indian Institute of Information Technology, Allahabad48

  • 8/3/2019 B Level Project Combined Index

    49/59

    Web Spider: A Focused Crawler

    a challenge for us to implement a server side check, i.e. checking the URLs on the

    server side and thus downloading only the relevant ones.

    Distributed web crawler: Our project presently works on a single system, to make

    it scalable its a challenge to make it a distributed system with many parallelcrawlers running.

    6.2.2: Future work

    Extending the project for file format son the web other than HTML and text.

    Ranking the downloaded pages with respect to their priority. The page having

    high cosine similarity with the example pages carry high priority.

    Increasing the harvest rate. Presently the relevant pages are downloaded at the

    rate of 50-100 pages per hour. Implementing a better focused crawler can increase

    the rate.

    Implementing better preprocessing algorithms. We have presently employed upto

    650 stop words and implemented the Porter Stemmer algorithm. Using more stop

    words and implementing a better stemming algorithm may further enhance the

    crawler performance.

    Indian Institute of Information Technology, Allahabad49

  • 8/3/2019 B Level Project Combined Index

    50/59

    Web Spider: A Focused Crawler

    Appendices

    This section covers:

    Term Vector Model

    Basic Authentication Scheme

    Term Frequency-Inverse Document Frequency

    Indian Institute of Information Technology, Allahabad50

  • 8/3/2019 B Level Project Combined Index

    51/59

    Web Spider: A Focused Crawler

    Appendix A: Term vector model

    Term vector model is an algebraic model used for information filtering, information

    retrieval, indexing and relevancy rankings. It represents natural language documents (or

    any objects, in general) in a formal manner through the use of vectors (of identifiers, such

    as, for example, index terms) in a multi-dimensional linear space. Its first use was in the

    SMART Information Retrieval System.

    Documents are represented as vectors of index terms (keywords). The set of terms is a

    predefined collection of terms, for example the set of all unique words occurring in the

    document corpus.

    Relevancy rankings of documents in a keyword search can be calculated, using the

    assumptions of document similarities theory, by comparing the deviation of angles

    between each document vector and the original query vector where the query is

    represented as same kind of vector as the documents.

    In practice, it is easier to calculate the cosine of the angle between the vectors instead of

    the angle:

    A cosine value of zero means that the query and document vector were orthogonal and

    had no match (i.e. the query term did not exist in the document being considered).

    Assumptions and Limitations of the Vector Space Model

    The Vector Space Model has the following limitations:

    Long documents are poorly represented because they have poor similarity values(a small scalar product and a large dimensionality)

    Search keywords must precisely match document terms; word substrings might

    result in a "false positive match"

    Indian Institute of Information Technology, Allahabad51

    http://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/Vector_space
  • 8/3/2019 B Level Project Combined Index

    52/59

    Web Spider: A Focused Crawler

    Semantic sensitivity; documents with similar context but different term

    vocabulary won't be associated, resulting in a "false negative match".

    Indian Institute of Information Technology, Allahabad52

  • 8/3/2019 B Level Project Combined Index

    53/59

    Web Spider: A Focused Crawler

    Appendix B: Basic authentication scheme

    In the context of an HTTP transaction, the basic authentication scheme is a method

    designed to allow a web browser, or other client program, to provide credentials in the

    form of a user name and password when making a request. Although the scheme is

    easily implemented, it relies on the assumption that the connection between the client and

    server computers is secure and can be trusted. Specifically, the credentials are passed as

    plaintext and could be intercepted easily. The scheme also provides no protection for the

    information passed back from the server.

    To prevent the user name and password being read directly by a person, they are encoded

    as a sequence of base-64 characters before transmission. For example, the user name

    "Aladdin" and password "open sesame" would be combined as "Aladdin:open sesame"

    which is equivalent to QWxhZGRpbjpvcGVuIHNlc2FtZQ== when encoded in base-64.

    Little effort is required to translate the encoded string back into the user name and

    password, and many popular security tools will decode the strings "on the fly", so an

    encrypted connection should always be used to prevent interception.

    One advantage of the basic authentication scheme is that it is supported by almost all

    popular web browsers. It is rarely used on normal Internet web sites but may sometimes

    be used by small, private systems. A later mechanism, digest access authentication, was

    developed in order to replace the basic authentication scheme and enable credentials to be

    passed in a relatively secure manner over an otherwise insecure channel.

    Example

    Here is a typical transaction between an HTTP client and an HTTP server running on the

    local machine (localhost). It comprises the following steps.

    The client asks for a page that requires authentication but does not provide a user name

    and password. Typically this is because the user simply entered the address or followed a

    link to the page.

    Indian Institute of Information Technology, Allahabad53

  • 8/3/2019 B Level Project Combined Index

    54/59

    Web Spider: A Focused Crawler

    The server responds with the 401 response code and provides the authentication realm.

    At this point, the client will present the authentication realm (typically a description of

    the computer or system being accessed) to the user and prompt for a user name and

    password. The user may decide to cancel at this point.

    Once a user name and password have been supplied, the client re-sends the same request

    but includes the authentication header.

    In this example, the server accepts the authentication and the page is returned. If the user

    name is invalid or the password incorrect, the server might return the 401 response code

    and the client would prompt the user again.

    Indian Institute of Information Technology, Allahabad54

  • 8/3/2019 B Level Project Combined Index

    55/59

    Web Spider: A Focused Crawler

    Appendix C: Term frequencyinverse

    document frequency

    The tfidf weight (term frequencyinverse document frequency) is a weight often used

    in information retrieval and text mining. This weight is a statistical measure used to

    evaluate how important a word is to a document in a collection or corpus. The importance

    increases proportionally to the number of times a word appears in the document but is

    offset by the frequency of the word in the corpus. Variations of the tfidf weighting

    scheme are often used by search engines to score and rank a document's relevance given

    a user query. In addition to tf-idf weighting, Internet search engines use link analysis

    based ranking to determine the order in which the scored documents are presented to the

    user.

    The term frequency in the given document is simply the number of times a given term

    appears in that document. This count is usually normalized to prevent a bias towards

    longer documents (which may have a higher term frequency regardless of the actualimportance of that term in the document) to give a measure of the importance of the term

    ti within the particular document.

    where ni is the number of occurrences of the considered term, and the denominator is the

    number of occurrences of all terms.

    The inverse document frequency is a measure of the general importance of the term

    (obtained by dividing the number of all documents by the number of documents

    containing the term, and then taking the logarithm of thatquotient).

    Indian Institute of Information Technology, Allahabad55

    http://en.wikipedia.org/wiki/Documentshttp://en.wikipedia.org/wiki/Logarithmhttp://en.wikipedia.org/wiki/Quotienthttp://en.wikipedia.org/wiki/Quotienthttp://en.wikipedia.org/wiki/Documentshttp://en.wikipedia.org/wiki/Logarithmhttp://en.wikipedia.org/wiki/Quotient
  • 8/3/2019 B Level Project Combined Index

    56/59

    Web Spider: A Focused Crawler

    with

    |D| : total number of documents in the corpus

    : number of documents where the term ti appears (that is ).

    Numeric application of the document frequency.

    There are many different formulas used to calculate tfidf. The term frequency (TF) is

    the number of times the word appears in a document divided by the number of total

    words in the document. If a document contains 100 total words and the word cow appears

    3 times, then the term frequency of the word cow in the document is 0.03 (3/100). One

    way ofcalculating document frequency (DF) is to determine how many documents

    contain the word cow divided by the total number of documents in the collection. So if

    cow appears in 1,000 documents out of a total of 10,000,000 then the document

    frequency is 0.0001 (1000/10,000,000). The final tf-idf score is then calculated by

    dividing the term frequency by the document frequency. For our example, the tf-idf score

    for cow in the collection would be 300 (0.03/0.0001). Alternatives to this formula are to

    take the log of the document frequency.

    Applications in Vector Space Model

    The tf-idf weighting scheme is often used in thevector space model together with cosine

    similarity to determine the similarity between two documents.

    Indian Institute of Information Technology, Allahabad56

    http://en.wikipedia.org/wiki/Formulahttp://en.wikipedia.org/wiki/Calculatinghttp://en.wikipedia.org/wiki/Calculatinghttp://en.wikipedia.org/wiki/Formulahttp://en.wikipedia.org/wiki/Logarithmhttp://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Cosine_similarityhttp://en.wikipedia.org/wiki/Cosine_similarityhttp://en.wikipedia.org/wiki/Similarityhttp://en.wikipedia.org/wiki/Formulahttp://en.wikipedia.org/wiki/Calculatinghttp://en.wikipedia.org/wiki/Formulahttp://en.wikipedia.org/wiki/Logarithmhttp://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Cosine_similarityhttp://en.wikipedia.org/wiki/Cosine_similarityhttp://en.wikipedia.org/wiki/Similarity
  • 8/3/2019 B Level Project Combined Index

    57/59

    Web Spider: A Focused Crawler

    References

    This section covers:

    Technical references

    Other references

    Indian Institute of Information Technology, Allahabad57

  • 8/3/2019 B Level Project Combined Index

    58/59

    Web Spider: A Focused Crawler

    Technical references

    [1]Soumen Chakrabarti, Martin van den Berg, Byron Dom, Focused crawling: a new

    approach to topic-specific Web resource discovery The Eighth InternationalWorldWide Web Conference,Toronto 1999 Published by Elsevier Science B.V.,1999

    [2]Vladislav Shkapenyuk, Torsten Suel, Design and Implementation of a HighPerformance Distributed Web Crawler Proceedings of the 18th International Conference

    on Data Engineering ICDE02, 1063-6382/02 2002 IEEE

    [3]Ke Hu, Wing Shing Wong, A probabilistic model for intelligent Web crawlers

    Computer Software and Applications Conference, 2003. Proceedings .27th national

    Conference, On page(s): 278- 282, ISSN: 0730-3157 2003 IEEE

    [4]Castillo, C. Effective Web Crawling PhD thesis, University of Chile.Year of

    Publication 2004.

    [5]Padmini Sriniwasan,Gautam Pant, Learning to Crawl :Comparing Classification

    Schemes, ACM Transactions on Information Systems(TOIS), Volume 23, Issue 4,

    Pages: 430 462, ISSN:1046-8188, ACM Press 2005

    [6]Ipeirotis, P., Ntoulas, A., Cho, J., Gravano, L., Modeling and managing content in

    text databases, In Proceedings of the 21st IEEE International Conference, Pages: 606 617, ISBN ~ ISSN:1084-4627 , 0-7695-2285-8, 2005 IEEE

    [7]Baeza-Yates, R., Castillo, C., Marin, M. and Rodriguez, A., Crawling a Country:

    Better Strategies than Breadth-First for Web Page Ordering. In Proceedings of theIndustrial and Practical Experience track of the 14th conference on World Wide Web,

    pages 864872, Chiba, Japan. ACM Press 2005.

    [8]Gautam Pant, Padmini Srinivasan, Link Contexts in Classifier-guided topical

    crawler, IEEE Transactions on knowledge and data engineering , Vol. 18, No 1,January

    2006 2006 IEEE

    [9]Jamali, M., Sayyadi, H., Hariri, B.B., Abolhassani, H., A Method for Focused

    Crawling Using Combination of Link Structure and Content, Web Intelligence, 2006.WI 2006. IEEE/WIC/ACM International Conference, Publication Date: Dec. 2006 Onpage(s): 753-756 ISBN: 0-7695-2747-7 2006 IEEE

    Indian Institute of Information Technology, Allahabad58

    http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=8813http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=8813http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=8813http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=8813
  • 8/3/2019 B Level Project Combined Index

    59/59

    Web Spider: A Focused Crawler

    Other References

    [10]. http://www.devbistro.com/articles/Misc/Effective-Web-Crawler

    [11]. http://en.wikipedia.org/wiki/Web_crawler

    [12]. http://www.depspid.net/

    [13]. http://www-db.stanford.edu/~backrub/google.html

    [14]. http://www.webtechniques.com/archives/1997/05/burner/

    [15]. http://www.ils.unc.edu/keyes/java/porter/

    [16]. http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

    [17]. http://combine.it.lth.se/

    [18]. http://www.cse.iitb.ac.in/~soumen/focus/

    http://www.devbistro.com/articles/Misc/Implementing-Effective-Web-Crawlerhttp://en.wikipedia.org/wiki/Web_crawlerhttp://www.depspid.net/http://www-db.stanford.edu/~backrub/google.htmlhttp://www.webtechniques.com/archives/1997/05/burner/http://www.cse.iitb.ac.in/~soumen/focus/http://www.devbistro.com/articles/Misc/Implementing-Effective-Web-Crawlerhttp://en.wikipedia.org/wiki/Web_crawlerhttp://www.depspid.net/http://www-db.stanford.edu/~backrub/google.htmlhttp://www.webtechniques.com/archives/1997/05/burner/http://www.cse.iitb.ac.in/~soumen/focus/

Recommended