+ All Categories
Home > Documents > bixo-intro

bixo-intro

Date post: 09-Apr-2018
Category:
Upload: email112302
View: 218 times
Download: 0 times
Share this document with a friend
23
Bixo - a webcrawler toolkit Ken Kr ugler, Stefan Groschupf Tambako the [email protected] Friday, May 22, 2009
Transcript

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 1/23

Bixo - a webcrawler toolkitKen Krugler, Stefan Groschupf 

Tambako the [email protected]

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 2/23

Agenda

Overview

Background

Motivation

GoalsStatus

Differences

Architecture

Data life cycleRobust Testing

Resources

[email protected]

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 3/23

Primary users will be companies extracting datafrom the web (not search)

Interested in subset of the web

Typically part of larger data processing system

Overview

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 4/23

No good solution available

We need a toolkit

Missing from Nutch et al.

Easy to integrate

Easy to extend

Easy to understandAPI vs CLI

Pluggable I/O

Avoid common problems

Spider traps & link farms

Slow servers

Hanging crawls

Motivation - tech

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 5/23

Screen scrape, data extraction

Artist websites, e.g. concert dates

Many pages from large sites

 Just crawl, no index

One of many inputs into Business Intelligence

Integration in larger BI system (Cascading-based)

Motivation - EMI

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 6/23

Focused index for key partners

Data analysis and mining of 100m pages

Integration into existing log analysis and datamining systems (Cascading-based)

Low IT/Ops support requirements

Motivation - Share This

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 7/23

GoalsFulfill key motivating requirements

OSS project with business-friendly license

Focus on vertical crawling, leverage other projects

Efficient execution in EC2/cloud environment

Grow OSS community

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 8/23

Current StatusWe already do crawls in EC2

2 sponsored developers, since March 2009

MIT license

Todo:

Improve robots.txt handling

Bugfixes and many improvementsWebsite & documentation

A CLI for easy testing.

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 9/23

Differences (from Nutch)Toolkit versus system - building blocks, notplugins

Workflow focus, versus system where you setconf and run a command

More emphasis on instrumentation - monitoring,error handling,

No search serving

Vertical crawl, not intranet or whole web

HTTP(S) only, not ftp, etc.

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 10/23

Differences (from Hadoop)

Not much, which is a good thing

Generates lots of data - want to store in S3,want to minimize writes

Heavy user of DNS server - extra set up forcaching server

Fetch phase is unusual Cascading topology

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 11/23

Hadoop IntroOpen Source map reduce system

Execution layer - map reduce

Mapper, Reducer Tasks

Storage layer - (distributed) file system

Local FS, HDFS, S3, etc

Scales from single node to thousands

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 12/23

Cascading IntroData processing can be hard with Hadoop

Cascading extends Hadoop

Provides simple data processing API

Reusable (unix) pipe based concept

Sources and Sinks separated

HDFS, Hbase, JDBC, Aster etc.Assemble Pipes, Source and Sink in a Flow

GPL or OEM, though might change

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 13/23

Architecture

Hadoop

Cascading

Bixo pipes

your java your groovy your jython

input output

single jvm server cluster

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 14/23

Data life cycleInject URLs in URL DB

Select URLs from URL DB -

based on recrawl policy, orpartner/domain, or type, etc

Normalize URLs

Score URLs

Group URLs

Fetch

Save content

and/or update URL DB

and/or analyze/parse content

Notice nothing aboutindexing, pushing out index,serving up index.

Meta data fully supported

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 15/23

Architecture - Pipes

fetch pipe parse pipe update url db pipeurl pipe

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 16/23

Import Url Pipe

Import SubAssembly

Each

URL Normalizing

IUrlFilter

Source

URL DB

Sink

URLs

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 17/23

Fetch Pipe

Fetch SubAssembly

Each

URL Domain Map

Each

URL Scoring

GroupBy

URL Grouping

Every

Fetching

GroupingKeyGenerator IHttpFetcherScoreGenerator

URLs

Source

Pages & Status

Sink

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 18/23

Parse Pipe

Parse SubAssembly

Each

URL Domain Map

IParser

Pages

Source

ParsedText & OutLinks

Sink

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 19/23

Update Pipe

Update DB SubAssembly

Each

URL Normalizing

GroupBy

URL Grouping

Every

URL Selection

IUrlFilter

URLs

Source

URL DB

Sink

LastUpdated

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 20/23

Output

MultiSinkTap

Sink

Each

URL Status

Each

URL Content IndexScheme

Sink Each

Lucene Index

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 21/23

Robust testingUnit tests

 Jetty with special request

handlers

wrong content type

slow responses

wronger header

WebGraph test platform

test/simulate URL discovery

Looping/URL DB updates

page rank calcs, etc.

Wikipedia

large amount of data that canbe "crawled" via local setup

http://webgraph.dsi.unimi.it/

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 22/23

Resources

Web: http://bixo.101tec.com/

List: http://groups.yahoo.com/group/bixo-dev

Sources: https://github.com/emi/bixo/tree

Bugtracking:

http://oss.101tec.com/jira/browse/bixo

Friday, May 22, 2009

8/7/2019 bixo-intro

http://slidepdf.com/reader/full/bixo-intro 23/23

Scale Unlimited, Inc.

Ken Krugler, Stefan Groschupf 

[email protected]

hans [email protected]


Recommended