Focused Crawling for Structured Data
Robert Meusel, Peter Mika, and Roi Blanco
2
Markup Languages in HTML Pages
<html>
…
<body>
…
<div id="main-section" class="performance left" data-
sku="M17242_580“>
<h1> Predator Instinct FG Fußballschuh
</h1>
<div>
<meta content="EUR">
<span
data-sale-price="219.95">219,95</span>
…
</body>
</html>
HTML pages embed directly markup languages to annotate items using different vocabularies
<html>
…
<body>
…
<div id="main-section" class="performance left" data-
sku="M17242_580" itemscope
itemtype="http://schema.org/Product">
<h1 itemprop="name"> Predator Instinct FG Fußballschuh
</h1>
<div itemscope itemtype="http://schema.org/Offer"
itemprop="offers">
<meta itemprop="priceCurrency" content="EUR">
<span itemprop="price" data-sale-
price="219.95">219,95</span>
…
</body>
</html>
1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-
ns#type> <http://schema.org/Product> .
2._:node1 <http://schema.org/Product/name> "Predator
Instinct FG Fußballschuh"@de .
3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-
ns#type> <http://schema.org/Offer> .
4._:node1 <http://schema.org/Offer/price>
"219,95"@de .
5._:node1 <http://schema.org/Offer/priceCurrency>
"EUR" .
6.…
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
3
Deployment of Markup Languages
14% of all sites use markup languages to annotate their data (status 2013) [Meusel2014]
• Broad topical variations from Articles over Products to Recipe [Bizer2013]
• Multiple strong drivers pushing the deployment
• Search engine companies initiative on Schema.org
• Open Graph Protocol used by Facebook
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
4
Motivation
• Existing datasets/crawls do not focus on structured data
• Common Crawl Foundation uses PageRank and Breadth-First Search
• Datasets, as the WebDataCommons corpus extracted from these corpora, are likely to miss large amounts of data [Meusel2014]
• Structured information
• Hundreds of million pages
• Up-to-date information
• Publicly available
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
5
Main Idea
• Adapting the idea of focused crawling
• Similarities:
• Evaluation of content based on a objective function
• Differences:
• Typically focused by topic, not quality/amount of data collected
• Because of that, typically no direct feedback about crawled pages available
Possibility to incorporate the feedback directly into our system to improve classification of newly
discovered URLs.
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
6
Online Learning for Focused Crawling
• Capability to incorporates real-time feedback
• Improves performance
• Adapts to concept drifts
• Possible features
• URL-based features; mainly tokens from the URL-String itself
• Features describing information from the parent(s) of the URL
• Features describing information from the siblings of the URL
• Free open-source software available (e.g. Massive Online Analysis Library by Bifet et al.)
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
7
Exploration vs. Exploitation
• Decision/Classification is based on gathered knowledge
• Knowledge can be incomplete• Crawled too few pages
• Knowledge can get invalid• Reaching part of the Web with
different behavior
Selecting the page with the highest confidence for supporting our objective, might not always be the best
choice
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
8
Bandit-Based Selection
• Bin each URL to the host it belongs to
• Each host represents one bandit
• Calculate the expected score for each bandit based on a scoring function
• Select the degree of randomness λ
• λ between 0 and 1
• For each turn draw a random number z
• z > λ: select the bandit with highest score
• else: select a random bandit
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
9
Scoring Functions
Incorporate knowledge in score calculation for bandit/host:
• Best Score (Pure classification-based selection)
• Negative Absolute Bad
• Success Rate
• Absolute Good · Best Score
• Success Rate · Best Score
• Thompson Sampling
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
10
System Workflow
Online
Classifier
Bandits
Crawler
URLParser
SemanticParser
Classified URL
URLHTMLPage
URLs
Feedback
Seeds
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
11
Setup for Experiments
• Data originates from the Common Crawl Corpus 2012
• including over 3.5 billion HTML pages
• Extracted a subset of 5.5 million linked pages
• Including 450k different hosts
• Identified all pages within the subset containing at least one markup language (using the WebDataCommons corpus)
• 27.5% of all pages
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
12
Experiment Description
Measure: Number of relevant pages retrieved within the first 1 million pages crawled.
1. Online vs. batch-based classification with 100K, 250K, and 1M pages
2. Pure online classification vs. enhanced with bandit-based selection (λ=0)
3. Improvements with different λ
4. Improvements with decaying λ
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
13
Results: Online vs. Offline
• Both methods outperform Breadth-First Search (BFS)
• Static approach: 340K
• Adaptive approach: 539K
Perc
enta
ge o
f re
leva
nt
pag
es
Fetched web pages
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
14
Results: Pure Online Classification vs. +Bandit-based
• Success rate based scoring functions show most promising results
• Negative absolute bad scoring performs like BFS
• Success ratefunction: 628K
• Pure online-classification: 539K
Perc
enta
ge o
f re
leva
nt
pag
es
Fetched web pages
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
15
Results: λ > 0
• Including randomness seems not to have an effect
• Beneficial effect of λ > 0 is shown e.g. for the success ratefunction within the first 400K crawled pages
Perc
enta
ge o
f re
leva
nt
pag
es
Fetched web pages
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
16
Results: Decaying λ
Decaying λ over time, means the reduction of randomness while crawling more pages.
• Success rate function with decaying λ = 0.5: 673K
• Static λ: 628K
Perc
enta
ge o
f re
leva
nt
pag
es
Fetched web pages
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
17
Adaptation to more specific Objective
• General objective is narrowed down to:
• Pages making use of the markup language Microdata and
• Include at least five marked up statements
• Example:
1. A page including information about a movie
2. The movie has the name Se7en
3. with a rating of 8.7 out of 10
4. and it was released in 1995
5. This information is maintained by imdb.com
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
18
Results: Adaptation to more specific Objective
• 3.5% of pages include such information
• In general: Observation of beneficial effects using our approach
• Static λ = 0.2: 120K
• Decaying λ = 0.5: 108K
Perc
enta
ge o
f re
leva
nt
pag
es
Fetched web pages
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
19
Conclusion
• Improvement by 26% in comparison to pure online classification-based selection strategy for general objective
• Improvement by 66% for the more specific objective
• Success rate based scoring functions shows most promising results for objectives
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
20
Open Challenges
• Expand the approach to exploit results from one bandit to the other bandits (contextual bandits)
• Introduce a more fine grained grading of the crawled pages (multi-class problem)
• Take into account the quality of gathered information (beside richness)
• Adapt the process to traditional topical focused crawling
• Publishing of code and data to the community
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
21
More Information
• Paper accepted at ACM International Conference on Information and Knowledge Management in Shanghai, China
• ACM Digital Library: Focused Crawling for Structured Data
• Detailed Descriptions and Source Code:
• Anthelion Webpage
• Datasets:
• Common Crawl Foundation Corpora
• WebDataCommons Corpora
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai