Large-Scale Analysis of Web Pages - on a Startup Budget?
Hannes Mühleisen, Web-Based Systems Group
AWS Summit 2012 | Berlin
Our Starting Point
2
Our Starting Point
• Websites now embed structured data in HTML
2
Our Starting Point
• Websites now embed structured data in HTML
• Various Vocabularies possible
• schema.org, Open Graph protocol, ...
2
Our Starting Point
• Websites now embed structured data in HTML
• Various Vocabularies possible
• schema.org, Open Graph protocol, ...
• Various Encoding Formats possible
• μFormats, RDFa, Microdata
2
Our Starting Point
• Websites now embed structured data in HTML
• Various Vocabularies possible
• schema.org, Open Graph protocol, ...
• Various Encoding Formats possible
• μFormats, RDFa, Microdata
2
Question: How are Vocabularies and Formats used?
Web Indices
• To answer our question, we need to access to raw Web data.
3
Web Indices
• To answer our question, we need to access to raw Web data.
• However, maintaining Web indices is insanely expensive
• Re-Crawling, Storage, currently ~50 B pages (Google)
3
Web Indices
• To answer our question, we need to access to raw Web data.
• However, maintaining Web indices is insanely expensive
• Re-Crawling, Storage, currently ~50 B pages (Google)
• Google and Bing have indices, but do not let outsiders in
3
• Non-Profit Organization
4
• Non-Profit Organization
• Runs crawler and provides HTML dumps
4
• Non-Profit Organization
• Runs crawler and provides HTML dumps
• Available data:
• Index 02-12: 1.7 B URLs (21 TB)
• Index 09/12: 2.8 B URLs (29 TB)
4
• Non-Profit Organization
• Runs crawler and provides HTML dumps
• Available data:
• Index 02-12: 1.7 B URLs (21 TB)
• Index 09/12: 2.8 B URLs (29 TB)
• Available on AWS Public Data Sets
4
Why AWS?
• Now that we have a web crawl, how do we run our analysis?
• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)
5
Why AWS?
• Now that we have a web crawl, how do we run our analysis?
• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)
• Preliminary analysis: 1 GB / hour / CPU possible
• 8-CPU Desktop: 8 months
• 64-CPU Server: 1 month
• 100 8-CPU EC2-Instances: ~ 3 days
5
Common Crawl Dataset Size
1 CPU, 1 h
Common Crawl Dataset Size
1000 € PC, 1 h
1 CPU, 1 h
Common Crawl Dataset Size
1000 € PC, 1 h
1 CPU, 1 h
5000 € Server, 1 h
Common Crawl Dataset Size
1000 € PC, 1 h
1 CPU, 1 h
5000 € Server, 1 h
Common Crawl Dataset Size
17 € EC2 Instances, 1 h
AWS Setup
• Data Input: Read Index Splits from S3
7
AWS Setup
• Data Input: Read Index Splits from S3
• Job Coordination: SQS Message Queue
7
AWS Setup
• Data Input: Read Index Splits from S3
• Job Coordination: SQS Message Queue
• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)
7
AWS Setup
• Data Input: Read Index Splits from S3
• Job Coordination: SQS Message Queue
• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)
• Result Output: Write to S3
7
AWS Setup
• Data Input: Read Index Splits from S3
• Job Coordination: SQS Message Queue
• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)
• Result Output: Write to S3
• Logging: SDB
7
S3
SQS
42
EC2
...
42 43 ... CC R42 R43 ...WDC
• Each input file queued in SQS
• EC2 Workers take tasks from SQS
• Workers read and write S3 buckets
S3
SQS
42
EC2
...
42 43 ... CC R42 R43 ...WDC
• Each input file queued in SQS
• EC2 Workers take tasks from SQS
• Workers read and write S3 buckets
S3
SQS
42
EC2
...
42 43 ... CC R42 R43 ...WDC
• Each input file queued in SQS
• EC2 Workers take tasks from SQS
• Workers read and write S3 buckets
Results - Types of Data
9
0 50 100 150 200
5e+0
35e
+04
5e+0
55e
+06
Type
Entit
y C
ount
(log
)
Microdata 02/2012RDFa 02/2012RDFa 2009/2010Microdata 2009/2010
Website Structure 23 %
Products, Reviews 19 %
Movies, Music, ... 15 %
Geodata 8 %
People, Organizations 7 %
2012 Microdata Breakdown
Results - Types of Data
9
0 50 100 150 200
5e+0
35e
+04
5e+0
55e
+06
Type
Entit
y C
ount
(log
)
Microdata 02/2012RDFa 02/2012RDFa 2009/2010Microdata 2009/2010
Website Structure 23 %
Products, Reviews 19 %
Movies, Music, ... 15 %
Geodata 8 %
People, Organizations 7 %
2012 Microdata Breakdown
• Available data largely determined by major player support
Results - Types of Data
9
0 50 100 150 200
5e+0
35e
+04
5e+0
55e
+06
Type
Entit
y C
ount
(log
)
Microdata 02/2012RDFa 02/2012RDFa 2009/2010Microdata 2009/2010
Website Structure 23 %
Products, Reviews 19 %
Movies, Music, ... 15 %
Geodata 8 %
People, Organizations 7 %
2012 Microdata Breakdown
• Available data largely determined by major player support
• “If Google consumes it, we will publish it”
Results - Formats
10
• URLs with embedded Data: +6%
RDFa Microdata geo hcalendar hcard hreview XFN
Format
Perc
enta
ge o
f UR
Ls
01
23
4 2009/201002−2012
Results - Formats
10
• URLs with embedded Data: +6%
• Microdata +14% (schema.org?)
RDFa Microdata geo hcalendar hcard hreview XFN
Format
Perc
enta
ge o
f UR
Ls
01
23
4 2009/201002−2012
Results - Formats
10
• URLs with embedded Data: +6%
• Microdata +14% (schema.org?)
• RDFa +26% (Facebook?)
RDFa Microdata geo hcalendar hcard hreview XFN
Format
Perc
enta
ge o
f UR
Ls
01
23
4 2009/201002−2012
Results - Extracted Data
• Extracted data available for download at
• www.webdatacommons.org
11
Results - Extracted Data
• Extracted data available for download at
• www.webdatacommons.org
• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)
11
Results - Extracted Data
• Extracted data available for download at
• www.webdatacommons.org
• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)
• Have a look!
11
AWS Costs
• Ca. 5500 Machine-Hours were required
• 1100 € billed by AWS for that
12
AWS Costs
• Ca. 5500 Machine-Hours were required
• 1100 € billed by AWS for that
• Cost for other services negligible *
12
AWS Costs
• Ca. 5500 Machine-Hours were required
• 1100 € billed by AWS for that
• Cost for other services negligible *
• * At first, we underestimated SDB cost
12
Takeaways• Web Data Commons now publishes the largest set of
structured data from Web pages available
13
Takeaways• Web Data Commons now publishes the largest set of
structured data from Web pages available
• Large-Scale Web Analysis now possible with Common Crawl datasets
13
Takeaways• Web Data Commons now publishes the largest set of
structured data from Web pages available
• Large-Scale Web Analysis now possible with Common Crawl datasets
• AWS great for massive ad-hoc computing power and complexity reduction
13
Takeaways• Web Data Commons now publishes the largest set of
structured data from Web pages available
• Large-Scale Web Analysis now possible with Common Crawl datasets
• AWS great for massive ad-hoc computing power and complexity reduction
• Choose your architecture wisely, test by experiment, for us EMR was too expensive.
13
Thank You!
Web Resources: http://webdatacommons.orghttp://hannes.muehleisen.org
Questions?Want to hire me?