Crawling
Ida Mele
Nutch
• Apache Nutch is an open source Java implementation of a search engine.
• We can use Nutch for crawling a portion of the Web.
• Useful links:– http://nutch.apache.org/– http://wiki.apache.org/nutch/NutchTutorial– http://nutch.apache.org/apidocs-2.1/index.html
2Ida Mele Crawling
Nutch
• Advantages of Nutch:– Understanding. • We have the source code and we can use it to
see how a large search engine works. • Nutch has been built using ideas from
academia and industry, and it is very useful for researchers who want to try out new search algorithms.
3Ida Mele Crawling
– Transparency. • The details of the ranking algorithms used by
commercial search engines are secret, and usually there are economical reason behind the ranked list of results. • Nutch implementation is transparent. We know
how the ranking algorithms work, and we can trust on the fairness of the final rankings.
Nutch
4Ida Mele Crawling
– Extensibility. • Nutch is a platform for adding search to
heterogeneous collections of information. • It allows to customize the search interface.• We can use extend the out-of-the-box
functionality through the plugin mechanism.
Nutch
5Ida Mele Crawling
Nutch vs. Lucene
• Nutch is built on top of Lucene.• Apache Lucene is a Java library for text
indexing and searching. • Lucene ensures high-performance, full-
featured text search, it provides support for any application that requires full-text search, but it is not a crawler.
6Ida Mele Crawling
Architecture
• Nutch can be divided into two pieces:– crawler, which fetches pages and turns them into
an inverted index.– searcher, which answers users' search queries.
• The index is the interface between the crawler and the searcher.
• The crawler and searcher systems can be on separate hardware platforms.
7Ida Mele Crawling
Architecture
• Crawler and searcher systems can be scaled independently.Example: if we have a highly trafficked search page that provides searching for a relatively modest set of sites. We may use a modest crawler infrastructure, and invest more substantial resources for supporting the searcher.
8Ida Mele Crawling
Crawler System
• The crawler system is driven by the Nutch tool called crawl, and by other related tools to build and maintain the data structures.
• Data structures are:– the web database, – a set of segments, – the index.
9Ida Mele Crawling
WebDB
• The web database (WebDB) is a data structure for mirroring the structure and properties of the web graph being crawled.
• It stores two types of entities:– Page. It is indexed by its URL and the MD5 hash of
its contents. Other information: the # of outlinks, fetch information, the score of the page.
– Link. It represents the connection between the source page and the target page.
10Ida Mele Crawling
Segment
• The segment is a collection of pages that are fetched and indexed by the crawler in a run.
• The fetchlist is a list of URLs to fetch, and it is generated from the WebDB.
• The fetcher output is the data retrieved from the pages in the fetchlist.
• Any segment has a lifespan (30 days is the default re-fetch interval).
11Ida Mele Crawling
Index
• Inverted index of all of the pages the system has retrieved.
• The index is created by merging all of the individual segment indexes.
• Nutch uses Lucene to build the index. Note that in Lucene there is the concept of segment, but it is different.– In Lucene, the index segment is a portion of the index.– In Nutch, the segment is a fetched and indexed
portion of the WebDB.
12Ida Mele Crawling
Crawling
• Nutch can operate at one of these three different scales:– Local filesystem;– Intranet;– Web.
• All three scales have different characteristics. Example: crawling the file system is reliable compared to the other two scales.
13Ida Mele Crawling
Crawling
• For crawling billions of pages from the web, we have to:– define the seed set: the set of pages we start with; – decide how many crawlers we use and how
partition the work among them; – decide how often we want to do the re-crawling; – cope with broken links, unresponsive sites, and
unintelligible or duplicate content.
14Ida Mele Crawling
Crawling
• The crawling process is basically a cycle made of three steps:1. the crawler generates a set of fetchlists from the
WebDB (generate),2. a set of fetchers downloads the content from
the Web (fetch),3. the crawler updates the WebDB with new links
that were found (update).
15Ida Mele Crawling
Crawling
Nutch observes:• Politeness, URLs with the same host are
always assigned to the same fetchlist, so that a web site is not overloaded with requests from multiple fetchers in rapid succession.
• Robots Exclusion Protocol, which allows site owners to control which parts of their site may be crawled.
16Ida Mele Crawling
Crawling: low-level tools
• Crawling is done by the crawl tool of Nutch, that is a front-end to lower-level tools.
• The crawl tool can be used to get started with crawling websites, but then we need to use the lower-level tools to perform re-crawls and other maintenance on the data structures built during the initial crawl.
17Ida Mele Crawling
Crawling: low-level tools
• We can use the lower-level tools in sequence:1. Create a new WebDB (admin db -create).2. Inject root URLs into the WebDB (inject).3. Generate a fetchlist from the WebDB in a new
segment (generate).4. Fetch content from URLs in the fetchlist (fetch).5. Update the WebDB with links from fetched pages
(updatedb).6. Repeat steps 3-5 until the required depth is
reached.
18Ida Mele Crawling
Crawling: low-level tools
7. Update segments with scores and links from the WebDB (updatesegs).
8. Index the fetched pages (index).9. Eliminate duplicate content, and duplicate URLs,
from the indexes (dedup).10. Merge the indexes into a single index for
searching (merge).
19Ida Mele Crawling
• We create a new WebDB (step 1), and we populate it with some seed URLs (step 2).
• Then we use the generate/fetch/update cycle (steps 3-6).
• When this cycle has finished, the crawler goes on to create an index (steps 7-10): – each segment is indexed independently (step 8), – the duplicate pages are removed (step 9), – the individual indexes are combined into a single index
(step 10).
Crawling: low-level tools
20Ida Mele Crawling
Running a crawl with Nutch
• Download and unpack a Nutch distribution (for example apache-nutch-1.1-bin.zip).
• Make sure that the environment variable NUTCH_JAVA_HOME or JAVA_HOME is set properly, so as it tells Nutch where Java is:– Run the following command or add it to
the .bashrc file:export NUTCH_JAVA_HOME= %pathJava
21Ida Mele Crawling
Nutch configuration
• All of Nutch's configuration files are in the conf subdirectory of the Nutch distribution.
• The main configuration file is conf/nutch-default.xml. It contains the default settings, and should not be modified.
• To change a setting we can create or update the conf/nutch-site.xml file.
22Ida Mele Crawling
• Add your agent name in the value field of the http.agent.name property of the file conf/nutch-site.xml, for example we can use the name: Sapienza University.
Nutch configuration
<property> <name>http.agent.name</name>
<value>Sapienza University</value> <description>
HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization.
</description></property>
23Ida Mele Crawling
Url filter
• The crawl tool uses a filter to decide which URLs can go into the WebDB (steps 2 and 5).
• This can be used to restrict the crawlto the URLs that match any given pattern, specified by regular expressions.
• For example, if we want to restrict the domain to the DIS domain, we have to update the configuration file conf/crawl-urlfilter.txt.
24Ida Mele Crawling
Url filter• Open the file conf/crawl-urlfilter.txt and replace the line:
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/with:+^http://([a-z0-9]*\.)*dis.uniroma1.it/
• The file conf/crawl-urlfilter.txt will contain:
# accept hosts in MY.DOMAIN.NAME#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/+^http://([a-z0-9]*\.)*dis.uniroma1.it/
25Ida Mele Crawling
Example
• Create a file called urls, that contains the root URLs.
• These URLs will be used to populate the initial fetchlist.
• For example, if we want to start from the home page of the department, we will use:echo ‘http://www.dis.uniroma1.it’ > urls
26Ida Mele Crawling
Example
• We run the crawler with:bin/nutch crawl urls -dir mycrawl -depth 5 > mycrawl.logwhere:
• urls is the name of the file with the seed URLs.• mycrawl is the name of the directory. • 5 is the depth of the crawling.• mycrawl.log is the name of the log file.
27Ida Mele Crawling
Results of the Crawl
• The directory mycrawl contains the following subdirectories:
• crawldb • linkdb• segments
• index• indexes
28Ida Mele Crawling
Results of the Crawl: readdb
• The readdb tool parses the WebDB and displays portions of it in human-readable form. – The stats option displays the number of pages and
links: bin/nutch readdb mycrawl/crawldb -stats >stats.txt Use:more stats.txt
29Ida Mele Crawling
Results of the Crawl: readdb
– The dump option gives the dump of the pages. Each page appears in a separate block, with one field per line. The ID field is the MD5 hash of the page contents. There is also information about when the pages should be next fetched (which defaults to 30 days), and the page scores. We issue the command:bin/nutch readdb mycrawl/crawldb -dump mydumpand then use: more mydump/part-00000
30Ida Mele Crawling
Results of the Crawl: readdb
– The readdb tool also supports extraction of an individual page or link by URL or MD5 hash. For example, to examine the info of the page http://cclii.dis.uniroma1.it/airo/index.phpwe issue the command:bin/nutch readdb mycrawl/crawldb -url http://www.dis.uniroma1.it/airo/index.php
31Ida Mele Crawling
Results of the Crawl: readlinkdb
– The readlinkdb tool can be used to create the dump of the link structure (the graph): bin/nutch readlinkdb mycrawl/linkdb/ -dump mylinksWe can read the in-links by using:more mylinks/part-00000
Note that it gives us just the list of the in-links. For the out-links we have to merge the segments and read the result.
32Ida Mele Crawling
Results of the Crawl: readseg
• The crawl creates a few segments in timestamped subdirectories, one for each generate/fetch/update cycle.
• The readseg tool is the segment reader. – The option list gives a summary of all of the
generated segments:bin/nutch readseg -list -dir mycrawl/segments/
33Ida Mele Crawling
Results of the Crawl: readseg
– The option dump gives a dump of a given segment:bin/nutch readseg -dump mycrawl/segments/20111204150100/ dump_seg1Use:more dump_seg1/dump
*Note that the name of the segment is given by the date and time we created the segment. For instance, 20111204150100 is the name of the segment created on 2011-12-04 at 15:01:00.
34Ida Mele Crawling
• We have seen that the readlinkdb tool can be used to have the list of in-links.
• To have the out-links we need to merge the segments and read the result.
• We use the mergesegs tool: bin/nutch mergesegs whole-segments -dir mycrawl/segments/*
• Then we can use the dump option of the readseg tool on the result of the merge: bin/nutch readseg -dump whole-segments/20111204174133/ dump-outlinks
Results of the Crawl: mergeseg
35Ida Mele Crawling
Exercise
• Install and configure Nutch.• Create the file with the seed set (example urls).• Update the conf/url-filter.txt file.• Decide the depth of the crawling and crawl a portion of the
web using the crawl tool.• Download NutchGraph.jar and add it to the directory
containing all the libraries.• Update the set-classpath.sh file.• Set the classpath.
36Ida Mele Crawling
• Create the file with in-links using the following commands:– bin/nutch readlinkdb mycrawl/linkdb/ -dump mylinks– egrep -v $'^$' mylinks/part-00000 >inlinks.txt
• Merge the segments:bin/nutch mergesegs whole-segments -dir mycrawl/segments/*
• Use readseg to read the segments and create the file with out-links:– bin/nutch readseg -dump whole-segments/20111204174133/
dump-outlinks– cat dump-outlinks/dump | egrep 'URL|toUrl' >outlinks.txt
Exercise
37Ida Mele Crawling
Exercise
• Print the in-links and out-links in the links.txt file by issuing the following commands:– java nutchGraph.PrintInlinks inlinks.txt >links.txt– java nutchGraph.PrintOutlinks outlinks.txt >>links.txt
• Remove the duplicates:LANG=C sort links.txt | uniq > cleaned-links.txt
38Ida Mele Crawling
Exercise
• Create the map of urls with the following commands:– cut -f1 links.txt >url-list.txt– cut -f2 links.txt >>url-list.txt– LANG=C sort url-list.txt | uniq > sorted-url-list.txt– java -Xmx2G it.unimi.dsi.util.FrontCodedStringList -u -r 32
umap.fcl < sorted-url-list.txt– java -Xmx2G it.unimi.dsi.sux4j.mph.MWHCFunction
umap.mph sorted-url-list.txt
39Ida Mele Crawling
Exercise
• Create the graph:– java -Xmx2G nutchGraph.PrintEdges cleaned-links.txt
umap.mph > webgraph.dat– numNodes=$(wc -l < sorted-url-list.txt)– java -Xmx2G nutchGraph.IncidenceList2Webgraph
$numNodes webgraph– java -Xmx2G it.unimi.dsi.webgraph.BVGraph -g
ASCIIGraph webgraph webgraph
40Ida Mele Crawling
Indexing
• Once the crawling operation is completed, we have the graph and the indexed pages.
• Remember that Nutch uses Lucene for the indexing phase.
• If we want to use MG4J we can collect with wget the pages fetched during the crawling:wget –i sorted-url-list.txt
• Then we can use MG4J for indexing and querying the resulting collection of web documents.
41Ida Mele Crawling
WEBWEB
db
Linkstructure
RankPR
Nutch ParserDBreaddb
graph.txt
PageRank
getfiles
files
MG4J QueryMG4JQuery
42Ida Mele Crawling
ASCIIGraph
BVGraph
Homework
• Repeat the exercise using a different seed set.• Create the corresponding webgraph.• Compute the Pagerank for the nodes of the
webgraph.• Plot the distribution of the Pagerank values.
43Ida Mele Crawling