Large Scale Crawling with Apache Nutch and Friends

transcript

Large Scale Crawling with

Julien Niochejulien@digitalpebble.com

LUCENE/SOLR REVOLUTION EU 2013

Apache

and friends...

2 / 43

About myself

DigitalPebble Ltd, Bristol (UK) Specialised in Text Engineering

– Web Crawling– Natural Language Processing– Information Retrieval– Machine Learning

Strong focus on Open Source & Apache ecosystem VP Apache Nutch User | Contributor | Committer

– Tika– SOLR, Lucene – GATE, UIMA– Mahout– Behemoth

3 / 43

Overview

Installation and setup

Main steps

Nutch 2.x

Future developments

Outline

4 / 43

Nutch?

“Distributed framework for large scale web crawling”(but does not have to be large scale at all)

Based on Apache Hadoop

Apache TLP since May 2010

Indexing and Search by

5 / 43

A bit of history

2002/2003 : Started By Doug Cutting & Mike Caffarella

2005 : MapReduce implementation in Nutch

– 2006 : Hadoop sub-project of Lucene @Apache

2006/7 : Parser and MimeType in Tika

– 2008 : Tika sub-project of Lucene @Apache

May 2010 : TLP project at Apache

Sept 2010 : Storage abstraction in Nutch 2.x– 2012 : Gora TLP @Apache

6 / 43

Recent Releases

2.2.12.0

1.5.11.3 1.41.1 1.21.0

06/1206/1106/1006/09

7 / 43

Why use Nutch?

Features– Index with SOLR / ES / CloudSearch– PageRank implementation– Loads of existing plugins– Can easily be extended / customised

Usual reasons– Open source with a business-friendly license, mature, community, ...

Scalability– Tried and tested on very large scale– Standard Hadoop

8 / 43

Use cases

Crawl for search– Generic or vertical– Index and Search with SOLR and al.– Single node to large clusters on Cloud

… but also– Data Mining– NLP (e.g.Sentiment Analysis)– ML

– MAHOUT / UIMA / GATE – Use Behemoth as glueware

(https://github.com/DigitalPebble/behemoth)

9 / 43

Customer casesSpecificity (Verticality)

BetterJobs.com (CareerBuilder)– Single server

– Aggregates content from job portals

– Extracts and normalizes structure (description, requirements, locations)

– ~2M pages total

– Feeds SOLR index

SimilarPages.com– Large cluster on Amazon EC2 (up to 400

nodes)

– Fetched & parsed 3 billion pages

– 10+ billion pages in crawlDB (~100TB data)

– 200+ million lists of similarities

– No indexing / search involved

10 / 43

http://commoncrawl.org/

Using Nutch 1.7 A few modifications to Nutch code

– https://github.com/Aloisius/nutch

Next release imminent

Open repository of web crawl data 2012 dataset : 3.83 billion docs ARC files on Amazon S3

CommonCrawl

11 / 43

Overview

Main steps

Nutch 2.x

Future developments

Outline

12 / 43

Installation

http://nutch.apache.org/downloads.html

1.7 => src and bin distributions 2.2.1 => src only

'ant clean runtime'– runtime/local => local mode (test and debug)– runtime/deploy => job jar for Hadoop + scripts

Binary distribution for 1.x == runtime/local

13 / 43

Configuration and resources

Changes in $NUTCH_HOME/conf– Need recompiling with 'ant runtime'– Local mode => can be made directly in runtime/local/conf

Specify configuration in nutch-site.xml– Leave nutch-default alone!

At least :

<property> <name>http.agent.name</name> <value>WhateverNameDescribesMyMightyCrawler</value></property>

14 / 43

Running it!

bin/crawl script : typical sequence of steps

bin/nutch : individual Nutch commands– Inject / generate / fetch / parse / update ….

Local mode : great for testing and debugging

Recommended : deploy + Hadoop (pseudo) distrib mode – Parallelism– MapReduce UI to monitor crawl, check logs, counters

15 / 43

Monitor Crawl with MapReduce UI

16 / 43

Counters and logs

17 / 43

Overview

Main steps

Nutch 2.x

Future developments

Outline

18 / 43

Typical Nutch Steps

1) Inject → populates CrawlDB from seed list

2) Generate → Selects URLS to fetch in segment

3) Fetch → Fetches URLs from segment

4) Parse → Parses content (text + metadata)

5) UpdateDB → Updates CrawlDB (new URLs, new status...)

6) InvertLinks → Build Webgraph

7) Index → Send docs to [SOLR | ES | CloudSearch | … ]

Sequence of batch operations

Or use the all-in-one crawl script

Repeat steps 2 to 7

Same in 1.x and 2.x

19 / 43

Main steps from a data perspective

CrawlDBSeed List Segment

/crawl_generate//crawl_fetch//content//crawl_parse//parse_data//parse_text/

LinkDB

20 / 43

Frontier expansion

Manual “discovery”– Adding new URLs by

hand, “seeding”

Automatic discovery of new resources (frontier expansion)– Not all outlinks are

equally useful - control– Requires content

parsing and link extraction

[Slide courtesy of A. Bialecki]

21 / 43

An extensible framework

Endpoints– Protocol– Parser– HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x)– ScoringFilter (used in various places)– URLFilter (ditto)– URLNormalizer (ditto)– IndexingFilter– IndexWriter (NEW IN 1.7!)

Plugins– Activated with parameter 'plugin.includes'– Implement one or more endpoints

22 / 43

Features

Fetcher– Multi-threaded fetcher– Queues URLs per hostname / domain / IP– Limit the number of URLs for round of fetching– Default values are polite but can be made more aggressive

Crawl Strategy – Breadth-first but can be depth-first– Configurable via custom ScoringFilters

Scoring– OPIC (On-line Page Importance Calculation) by default– LinkRank

23 / 43

Features (cont.)

Protocols– Http, file, ftp, https– Respects robots.txt directives

Scheduling– Fixed or adaptive

URL filters– Regex, FSA, TLD, prefix, suffix

URL normalisers– Default, regex

24 / 43

Features (cont.)

Other plugins– CreativeCommons– Feeds– Language Identification– Rel tags– Arbitrary Metadata

Pluggable indexing– SOLR | ES etc...

Parsing with Apache Tika– Hundreds of formats supported– But some legacy parsers as well

25 / 43

Indexing

Apache SOLR– schema.xml in conf/– SOLR 3.4 – JIRA issue for SOLRCloud

• https://issues.apache.org/jira/browse/NUTCH-1377

ElasticSearch– Version 0.90.1

AWS CloudSearch– WIP : https://issues.apache.org/jira/browse/NUTCH-1517

Easy to build your own– Text, DB, etc...

26 / 43

Typical Nutch document

Some of the fields (IndexingFilters in plugins or core code)– url– content– title– anchor– site– boost– digest– segment– host– type

Configurable ones– meta tags (keywords, description etc...)– arbitrary metadata

27 / 43

Overview

Main steps

Nutch 2.x

Future developments

Outline

28 / 43

NUTCH 2.x

2.0 released in July 2012

2.2.1 in July 2013

Common features as 1.x– MapReduce, Tika, delegation to SOLR, etc...

Moved to 'big table'-like architecture– Wealth of NoSQL projects in last few years

Abstraction over storage layer → Apache GORA

29 / 43

Apache GORA

http://gora.apache.org/

ORM for NoSQL databases– and limited SQL support + file based storage

Serialization with Apache AVRO

Object-to-datastore mappings (backend-specific)

DataStore implementations

Current version 0.3

● Accumulo● Cassandra● HBase

● Avro● DynamoDB● SQL (broken)

30 / 43

AVRO Schema => Java code

{"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }},[…]

31 / 43

Mapping file (backend specific – Hbase)

<gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/>  <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">  <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/>

32 / 43

DataStore operations

Basic operations– get(K key) – put(K key, T obj)– delete(K key)

Querying– execute(Query<K, T> query) → Result<K,T>– deleteByQuery(Query<K, T> query)

Wrappers for Apache Hadoop– GORAInput|OutputFormat– GoraRecordReader|Writer– GORAMapper|Reducer

33 / 43

GORA in Nutch

AVRO schema provided and java code pre-generated

Mapping files provided for backends

– can be modified if necessary

Need to rebuild to get dependencies for backend– hence source only distribution of Nutch 2.x

http://wiki.apache.org/nutch/Nutch2Tutorial

34 / 43

Benefits

Storage still distributed and replicated

… but one big table

– status, metadata, content, text → one place

– no more segments

Resume-able fetch and parse steps

Easier interaction with other resources

– Third-party code just need to use GORA and schema

Simplify the Nutch code

Potentially faster (e.g. update step)

35 / 43

Drawbacks

More stuff to install and configure– Higher hardware requirements

Current performance :-(– http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html– N2+HBase : 2.7x slower than 1.x– N2+Cassandra : 4.4x slower than 1.x– due mostly to GORA layer : not inherent to Hbase or Cassandra– https://issues.apache.org/jira/browse/GORA-119 → filtered scans– Not all backends provide data locality!

Not as stable as Nutch 1.x

36 / 43

2.x Work in progress

Stabilise backend implementations– GORA-Hbase most reliable

Synchronize features with 1.x– e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph)– No pluggable indexers yet (NUTCH-1568)

Filter enabled scans– GORA-119

• => don't need to de-serialize the whole dataset

37 / 43

Overview

Main steps

Nutch 2.x

Future developments

Outline

38 / 43

Future

New functionalities – Support for SOLRCloud– Sitemap (from CrawlerCommons library)– Canonical tag– Generic deduplication (NUTCH-656)

1.x and 2.x to coexist in parallel– 2.x not yet a replacement of 1.x

Move to new MapReduce API– Use Nutch on Hadoop 2.x

39 / 43

More delegation

Great deal done in recent years (SOLR, Tika)

Share code with crawler-commons(http://code.google.com/p/crawler-commons/)– Fetcher / protocol handling– URL normalisation / filtering

PageRank-like computations to graph library– Apache Giraph– Should be more efficient + less code to maintain

40 / 43

Longer term

Hadoop 2.x & YARN

Convergence of batch and streaming– Storm / Samza / Storm-YARN / …

End of 100% batch operations ?– Fetch and parse as streaming ?– Always be fetching– Generate / update / pagerank remain batch

See https://github.com/DigitalPebble/storm-crawler

41 / 43

Where to find out more?

Project page : http://nutch.apache.org/ Wiki : http://wiki.apache.org/nutch/ Mailing lists :

– user@nutch.apache.org– dev@nutch.apache.org

Chapter in 'Hadoop the Definitive Guide' (T. White)– Understanding Hadoop is essential anyway...

Support / consulting : – http://wiki.apache.org/nutch/Support

42 / 43

Questions

43 / 43

Large Scale Crawling with Apache Nutch and Friends

Technology