Mining a Large Web Corpus

Slide 1

International Internet Preservation Consortium

General Assembly 2014, Paris

Mining a Large Web Corpus

Robert Meusel

Christian Bizer

Slide 2

The Common Crawl

Slide 3

Hyperlink Graphs

Knowledge about the structure of the Web can be used to improve crawling strategies, to help SEO experts or to understand social phenomena.

Slide 4

HTML-embedded Data on the Web

Several million websites semantically markup the content of their HTML pages.

Markup Syntaxes

Microformats

RDFa

Microdata

Data snippets

within info boxes

Slide 5

Relational HTML Tables

HTML Tables over semi-structured data which can be used to build up or extend knowledge bases as DBPedia.

• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.

In a corpus of 14B raw

tables, 154M are „good“

relations (1.1%)

Slide 6

The Web Data Commons Project

Has developed an Amazon-based framework for extracting data

from large web crawls

Capable to run on any cloud infrastructure

Has applied this framework to the Common Crawl data

Adaptable to other crawls

Results and framework are publicly available

http://webdatacommons.org

Goal: Offer an easy-to-use, cost efficient, distributed extraction framework for large web crawls, as well as datasets extracted out of the crawls.

Slide 7

Extraction Framework

AWS EC2

Instance

AWS EC2

Instance

Master

AWS SQS

AWS EC2

Instance

AWS S3

1: Fill queue

2: Launch instances

3: Request

file-reference

4: Download file

5: Extract &

Upload

automated

manual

6: Collect results

Slide 8

Extraction Worker

AWS S3

AWS S3

WDC Extractor

.(w)arc

Worker

Filter

output

Worker:

• Written in Java

• Process one page at

once

• Independent from

other files and

workers

Download file

Upload output file

Filter:

• Reduce Runtime

• Mime-Type filter

• Regex detection of

content or meta-

information

Worker

Slide 9

Web Data Commons – Extraction Framework

Written in Java

Mainly tailored for Amazon Web Services

Fault tolerant and cheap

300 USD to extract 17 billion RDF statements from 44 TB

Easy customizable

Only worker has to be adapted

Worker is a single process method processing one file each time

Scaling is automated by the framework

Access Open Source Code:

https://www.assembla.com/code/commondata/

Alternative: Hadoop Version, which can run on any Hadoop cluster without Amazon Web Services.

Slide 10

Extracted Datasets

Hyperlink Graph

HTML-embedded Data


Hyperlink Graph

HTML-embedded Data


Slide 11

Hyperlink Graph

Extracted from the Common Crawl 2012 Dataset

Over 3.5 billion pages connected by over 128 billion links

Graph files: 386 GB

http://webdatacommons.org/hyperlinkgraph/

http://wwwranking.webdatacommons.org/

Slide 12

Hyperlink Graph

Degrees do not follow a power-law

Detection of Spam pages

Further insights: WWW‘14: Graph Structure in the Web – Revisited (Meusel et al.)

WebSci‘14: The Graph Structure of the Web aggregated by Pay-Level Domain (Lehmberg et al.)

Discovery of evolutions in the global structure of the World Wide Web.

Slide 13

Hyperlink Graph

Discovery of important and interesting sites using different popularity rankings or website categorization libraries

Websites connected by at least ½ Million Links

Slide 14

HTML-embedded Data

More and more Websites semantically markup the content of their HTML pages.

Markup SyntaxesRDFa

Microformats

Microdata

Slide 15

Websites containing Structured Data (2013)

1.8 million websites (PLDs) out of 12.8 million provide Microformat, Microdata or RDFa data (13.9%)

585 million of the 2.2 billion pages contain Microformat, Microdata or RDFa data (26.3%).

Web Data Commons - Microformat, Microdata, RDFa Corpus

17 billion RDF triples from Common Crawl 2013

Next release will be in winter 2014

http://webdatacommons.org/structureddata/

Slide 16

Top Classes Microdata (2013)

• schema = Schema.org

• dv = Google‘s

Rich Snippet Vocabulary

Slide 17

HTML Tables

• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.

• Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011.

In corpus of 14B raw tables, 154M are “good” relations (1.1%).

Cafarella (2008)

Classification Precision: 70-80%

Slide 18

WDC - Web Tables Corpus

Large corpus of relational Web tables for public download

Extracted from Common Crawl 2012 (3.3 billion pages)

147 million relational tables

selected out of 11.2 B raw tables (1.3%)

download includes the HTML pages of the tables (1TB zipped)

Table Statistics

Heterogeneity: Very high.

http://webdatacommons.org/webtables/

Min Max Average Median

Attributes 2 2,368 3.49 3

Data Rows 1 70,068 12.41 6

Slide 19

Attribute Statistics

28,000,000 different attribute labels

WDC - Web Tables Corpus

Attribute #Tables

name 4,600,000

price 3,700,000

date 2,700,000

artist 2,100,000

location 1,200,000

year 1,000,000

manufacturer 375,000

counrty 340,000

isbn 99,000

area 95,000

population 86,000

Subject Attribute Values

1.74 billion rows

253,000,000 different subject labels

Value #Rows

usa 135,000

germany 91,000

greece 42,000

new york 59,000

london 37,000

athens 11,000

david beckham 3,000

ronaldinho 1,200

oliver kahn 710

twist shout 2,000

yellow submarine 1,400

Slide 20

Conclusion

Three factors are necessary to work with web-scale data:

Thanks to Common Crawl, this data is available

Like Amazon or other on-demand cloud-services

The Web Data Commons Framework, or standard tools like Pig

Cost evaluation on task-base, but the WDC framework has turned

out to be cheaper

Availability of Crawls

Availability of cheap, easy-to-use infrastructures

Easy to adopt scalable extraction frameworks

Slide 21

Questions

Please visit our website: www.webdatacommons.org

Data and Framework are available as free download

Web Data Commons is supported by:

http://www.webdatacommons.org/

Date post:	19-Aug-2014
Category:	Engineering
Upload:	robert-meusel
View:	127 times
Download:	3 times

Mining a Large Web Corpus

Engineering