+ All Categories
Home > Documents > Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa,...

Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa,...

Date post: 25-May-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
57
University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 1 Web Data Integration Types of Structured Data on the Web
Transcript
Page 1: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 1

Web Data Integration

Types of Structured Data on the Web

Page 2: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 2

Topology of the Web Today

The Web of Data

The Classic

DocumentWeb

Deep Web(via APIs

and forms)

Page 3: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 3

Outline

1. Data Catalogs and Marketplaces

2. Web APIs

3. Linked Data

4. HTML-embedded Data1. RDFa, Microdata, Microformats2. HTML Tables and Templates3. Wikipedia as Data Source

5. References

Page 4: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 4

1. Data Catalogs and Marketplaces

The Web traditionally containsstructured data in various formats:• CSV files, Excel worksheets• XML documents, SQL dumps

Data Catalogs and Data Marketplaces • collect and host data sets plus metadata• provide free or payment-based access to the data sets

Examples• data.gov.uk, data.gov.us, publicdata.eu: Thousands of public sector data sets• Factual, Dun & Bradstreet, DataStreamX: Commercial data market places

List of Data Catalogs• http://data.wu.ac.at/portalwatch/portalslist

Page 5: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 5

Page 6: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 6

Page 7: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 7

2. Web 2.0 Applications and Web APIs

A multitude of Web-basedapplications enable usersto share information.

These applications form seperate data spaces thatare only partly accessiblevia the Web.• HTML interfaces• Web APIs

Page 8: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 8

Example: Facebook

Users (September 2012)• 1 billion monthly active users • including 600 million mobile users• 140.3 billion friend connections • 1.13 trillion likes since launch in February 2009 • 219 billion photos uploaded• 17 billion location-tagged posts, including check-ins

Data Volume• over 100 Petabyte• inluding profile data, communication, usage logs, ...

Sources• https://s3.amazonaws.com/OneBillionFB/Facebook+1+Billion+Stats.docx• http://www.technologyreview.com/featuredstory/428150/what-facebook-knows/

Page 9: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 9

Web APIs

Provide limited access to the collected data• restricted to specific queries (canned queries)• restrictred number of queries / number of results

ProgrammableWeb API Catalog• lists over 17,000 Web APIs (2017)• list over 6,800 mashups

Page 10: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 10

Most Popular Web API

Popularity = Number of Mashups using an API

Page 11: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 11

Mashups are based on a fixed set of data sources

WebAPI

A

MashupUp

Web APIs expose proprietary interfaces.

Not index-able by generic Web crawlers

No automatic discovery of additional data sources

No single global data spaceWebAPI

B

WebAPI

C

WebAPI

D

Page 12: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 12

Web APIs slice the Web into Data Silos

Image: Bob Jagensdorf, http://flickr.com/photos/darwinbell/, CC-BY

Page 13: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 13

3. Alternative Approach: Linked Data

B C

RDF

RDFlink

A D E

RDFlinks

RDFlinks

RDFlinks

RDF

RDF

RDF

RDF

RDF RDF

RDF

RDF

RDF

Extend the Web with a single global data graph• by using RDF to publish structured data on the Web• by setting links between data items within different data sources.

Page 14: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 14

Entities are identified with HTTP URIs

pd:cygri

Richard Cyganiak

dbpedia:Berlin

foaf:name

foaf:based_near

foaf:Personrdf:type

HTTP URIs take the role of global primary keys.

pd:cygri = http://richard.cyganiak.de/foaf.rdf#cygridbpedia:Berlin = http://dbpedia.org/resource/Berlin

Page 15: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 15

URIs can be looked up on the Web

dp:Cities_in_Germany

3.405.259dp:population

skos:subject

Richard Cyganiak

dbpedia:Berlin

foaf:name

foaf:based_near

foaf:Personrdf:type

pd:cygri

By following RDF links applications can• navigate the global data graph• discover new data sources

Page 16: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 16

The Marbles Hyperdata Browser

Page 17: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 17

The SigMa Linked Data Search Engine

Page 18: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 18

LOD Datasets on the Web: April 2014

Source: Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. In: 13th International Semantic Web Conference (ISWC2014).

More statistics: http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/

Dataset Catalogs:https://lod-cloud.net/datasets (as of 2018)http://linkeddatacatalog.dws.informatik.uni-mannheim.de/ (as of 2014)

Page 19: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 19

Uptake in the Life Science Domain

Goals: 1. Connect life science datasets

in order to support• biological knowledge discovery• drug discovery

2. Reuse results of previous integration efforts

Page 20: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 20

Uptake in the Libraries Community

Institutions publishing Linked Data• Library of Congress (subject headings)• German National Library (PND dataset and subject headings)• Swedish National Library (Libris - catalog)• Hungarian National Library (OPAC and digital library)• Europeana Digital Library (4 million artifacts)• Springer (metadata about conference proceedings)

Goals: 1. Interconnect resources between repositories

(by topic, by location, by historical period, by ...)2. Integrate library catalogs on global scale

Page 21: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 21

RDF Links between LOD Datasets

https://lod-cloud.net/

Page 22: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 22

Hands-on: How to get the Data?

Download the Billion Triples Challenge Dataset• 4 billion RDF triples (52 GB gzipped, 1.1 TB uncompressed)• crawled from the public Web of Linked Data in Februar/June 2014• http://km.aifb.kit.edu/projects/btc-2014/

Use SPARQL endpoints of individual data sets• Endpoint list: http://sparqles.ai.wu.ac.at/availability

Page 23: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 23

4. HTML-embedded Data

Microformats

Microdata

RDFa

1. Webpages traditionally contain structured data in the form of HTML tables as well as template data.

2. More and more Websites semantically markup the content of their HTML pages using standardized markup formats.

Page 24: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 24

4.1 Microformats

Microformat effort dates back to 2003

Small set of fixed formats• hcard : people, companies, organizations, and places• XFN : relationships between people• hCalendar : calendaring and events• hListing : small-ads; classifieds• hReview : reviews of products, businesses, events

Shortcoming of Microformats• can not represent any kind of data.

indexed by Google and Yahoo since 2009

Page 25: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 25

RDFa

serialization format for embedding RDF data into HTML pages

proposed in 2004, W3C Recommendation in 2008

can be used together with any vocabulary

can assign URIs as global primary keys to entities

Page 26: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 26

Open Graph Protocol

allows site owners to determine how entities are described in Facebook

relies on RDFa for embedding data into HTML pages

available since April 2010

Page 27: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 27

Microdata

alternative technique for embedding structured data

proposed in 2009 by WHATWG as part of HTML5 work

tries to be simpler than RDFa (5 new attributes instead of 8)

<div itemtype="http://schema.org/Hotel">

<span itemprop="name">Vienna Marriott Hotel</span>

<span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">

<span itemprop="streetAddress">Parkring 12a</span>

<span itemprop="addressLocality">Vienna</span>

</span>

<div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">

<span itemprop="ratingValue"> 4 </span> stars-based on

<span itemprop="reviewCount"> 250 </span> reviews.

</div>

Page 28: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 28

Schema.org

ask site owners since 2011 to annotate data for enriching search results

675 Types: Event, Place, Local Business, Product, Review, Person Encoding: Microdata, RDFa, JSON-LD

Page 29: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 29

Usage of Schema.org Data @ Google

Data snippetswithin

search results

Data snippetswithin

info boxes

Page 30: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 30

Rich-Snippets Get More User Attention

Suchen

Source: www.looktracker.com

Potential business incentive.

Page 31: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 31

The Web Data Commons Project

extracts all Microformat, Microdata, RDFa, JSON-LDdata from the Common Crawl

analyzes and provides the extracted data for download

statistics about some extraction runs• 2017 CC Corpus: 3.1 billion HTML pages 38.2 billion RDF triples• 2016 CC Corpus: 3.1 billion HTML pages 44.2 billion RDF triples• 2014 CC Corpus: 2.0 billion HTML pages 20.4 billion RDF triples• 2012 CC Corpus: 3.0 billion HTML pages 7.3 billion RDF triples

uses 100 machines on Amazon EC2 • approx. 2000 machine/hours

(100 spot instances of type c3.xlarge) 350 Euro

http://www.webdatacommons.org/structureddata/

Page 32: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 32

Overall Adoption 2017

1.2 billion HTML pages out of the 3.2 billion pages provide semantic annotations (38%).

7.4 million pay-level-domains (PLDs) out of the 26 million pay-level-domains covered by the crawl provide annotations (28.4%).

Google, 2014*:5 million websites provide Schema.org data.

* Guha in LDOW2014 Keynote

Page 33: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 33

Development of the Adoption over TimeN

umbe

rofP

LDs

usin

ga

spec

ific

form

at

Page 34: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 34

Most Popular Classes

RDFa

Microdata

Strong focus onSchema.org andFacebook (og:) vocabularies.

# PL

Ds

# PL

Ds

Page 35: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 35

Topical Focus – Microdata 2014

2014 2013Class Instances # PLDs PLDs

# % # %1 schema:WebPage 51.757.000 148,893 18,16% 69.712 15,042 schema:Article 54.972.000 88,7 10,82% 65.930 14,223 schema:Blog 3.787.000 110,663 13,50% 64.709 13,964 schema:Product 288.083.000 89,608 10,93% 56.388 12,165 schema:PostalAddress 48.804.000 101,086 12,33% 52.446 11,316 dv:Breadcrumb 269.088.000 76,894 9,38% 44.187 9,537 schema:AggregateRating 59.070.000 50,510 6,16% 36.823 7,948 schema:Offer 236.953.000 62,849 7,66% 35.635 7,699 schema:LocalBusiness 20.194.000 62,191 7,58% 35.264 7,6110 schema:BlogPosting 11.458.000 65,397 7,98% 32.056 6,9211 schema:Organization 101.769.000 52,733 6,43% 24.255 5,2312 schema:Person 115.376.000 47,936 5,85% 21.107 4,5513 schema:ImageObject 35.356.000 25,573 3,12% 16.084 3,4714 dv:Product 12.411.000 16,003 1,95% 13.844 2,9915 schema:Review 42.561.000 20,124 2,45% 13.137 2,8316 dv:Review‐aggregate 3.964.000 14,094 1,72% 13.075 2,8217 dv:Organization 3.155.000 10,649 1,30% 9.582 2,0718 dv:Offer 7.170.000 11,64 1,42% 9.298 2,0119 dv:Address 2.138.000 9,674 1,18% 8.866 1,9120 dv:Rating 1.732.000 9,367 1,14% 8.360 1,8

Top Classes

Topics:• CMS and blog

metadata• products and

offers• ratings and

reviews• business listings• address data• ...and a massive

long tail

schema: = Schema.orgdv: = Google Rich Snippet Vocabulary (deprecated)

Page 36: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 36

Adoption by E-Commerce Websites

Distribution by Alexa Top-15 Shopping Sites Top-Level Domain

TLD #PLDs com 38344 co.uk 3605net 1813de 1333pl 1273com.br 1194ru 1165com.au 1062nl 1002

Website schema:ProductAmazon.com

Ebay.com

NetFlix.com

Amazon.co.uk

Walmart.com

etsy.com

Ikea.com

Bestbuy.com

Homedepot.com

Target.com

Groupon.com

Newegg.com

Lowes.com

Macys.com

Nordstrom.com

Adoption by Top-15: 60 %

Page 37: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 37

Properties used to Describe Products

Top15Properties PLDs# %

schema:Product/name 78,292 87%schema:Product/image 59,445 66%schema:Product/description 58,228 65%schema:Product/offers 57,633 64%schema:Offer/price 54,290 61%schema:Offer/availability 36,789 41%schema:Offer/priceCurrency 30,610 34%schema:Product/url 23,723 26%schema:Product/aggregateRating 21,166 24%schema:AggregateRating/ratingValue 20,513 23%schema:AggregateRating/reviewCount 14,930 17%schema:Product/manufacturer 10,150 11%schema:Product/brand 9,739 11%schema:Product/productID 9,221 10%schema:Product/sku 7955 9%

Page 38: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 38

Adoption by Travel Websites

Top 15 Travel Websites schema:Hotel Any ClassBooking.com (uses DataVoc)

TripAdvisor

Expedia

Agoda

Hotels.com

Kayak

Priceline

Travelocity

Orbitz

ChoiceHotels

HolidayCheck

ChoiceHotels

InterContinental Hotels Group

Marriott International

Global Hyatt Corp.

Adoption: 73 %

Page 39: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 39

Hands-on: How to get the Data?

http://www.webdatacommons.org/structureddata/

Only tip of the iceberg, as each website is only partly crawled.

Page 40: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 40

There are hundreds of millions of high-quality tables on the Web and in Wikipedia.

Page 41: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 41

4.2 HTML Tables

• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.• Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011.

In corpus of 14B raw tables, 154M are “good” relations (1.1%).Cafarella (2008)

Page 42: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 42

Hands-on: Web Data Commons – Web Tables Corpus

Large public corpus of relational Web tables

extracted from Common Crawl 2015 (1.78 billion pages)

90 million relational tables selected out of 10.2 B raw tables (0.9%)

download includes the HTML pages of the tables (1TB zipped)

http://webdatacommons.org/webtables/

Alternative: Google Table Search https://research.google.com/tables

Page 43: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 43

Attribute Statistics

28,000,000 different attribute labels

Web Data Commons – Web Tables Corpus

Attribute #Tablesname 4,600,000price 3,700,000date 2,700,000artist 2,100,000location 1,200,000year 1,000,000manufacturer 375,000counrty 340,000isbn 99,000area 95,000population 86,000

Subject Attribute Values

1.74 billion rows253,000,000 different subject labels

Value #Rowsusa 135,000germany 91,000greece 42,000new york 59,000london 37,000athens 11,000david beckham 3,000ronaldinho 1,200oliver kahn 710twist shout 2,000yellow submarine 1,400

Page 44: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 44

Exploiting the Template-Structure of HTML Pages

Many websites are generated from databases using HTML-templates.

Approaches to extract the data:• Hand-written wrappers using

Xpath or regexes • Wrapper induction using

machine learning techniques (see Bing Liu: Web Data Mining book)

Problem:• Wrappers are site-specific• Thus the approach does not scale

to large numbers of websites

Page 45: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 45

Title

Description

CrossLanguageLinks

Geo‐Coordinates

Images

Infoboxes

4.3 Wikipedia as Data Source

Page 46: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 46

Extracting Knowledge from Wikipedia

Page 47: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 47

The DBpedia Knowledge Base - Release 2016

Describes 6.6 million things, out of which 5.5 million are classified in a consistent ontology using 760 classes and 2729 different properties• 1,500,000 persons• 840,000 places• 260,000 organizations • 139,000 music albums

Altogether 13 billion pieces of information (RDF triples)• 1.7 million were extracted from the English edition of Wikipedia• 29,000,000 links to external web pages• 50,000,000 external links into other RDF datasets

DBpedia Internationalization• provides data from 125 Wikipedia language editions for download• For 28 popular languages DBpedia provides cleaned infobox data

Page 48: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 48

Page 49: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 49

Download Data Dumps

Use SPARQL endpoint

Hands-on: How to get DBpedia Data?

Page 50: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 50

Knowledge Graphs

Google Knowledge Graph• development started 2012, builds on Freebase • 570 million objects described by over 18 billion facts (2012)• 1500 classes, 35,000 properties

Microsoft Satori Knowledge Base• revealed to the public in mid-2013

Yahoo Knowledge Graph• revealed to the public early-2014

Knowledge Graphs employ RDF-style graph data models

Large cross-domain knowledge bases which aim to cover all “relevant” entities in the world.

Page 51: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 51

Data Sources used to Build Knowledge Graphs

1. Wikipedia• infoboxes, category system, information extraction from text

2. Open license sources • e.g. CIA World Factbook, MusicBrainz, …

3. Commercial third-party data• e.g. IMDB, company listings, …

4. schema.org annotations in web pages• e.g. contact information for companies• e.g. logos of companies

Lots of effort is spend on data integration and manual data curation

Page 52: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 52

Application of the Google Knowledge Graph

Enrich search results with knowledge cards and lists

Goal: Fulfil information need without having users navigate to other websites

Page 53: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 53

1. Answer fact queries: “birthdate michael douglas”

2. Compare things: ”compare eiffel tower vs empire state building”

Applications of the Google Knowledge Graph

Page 54: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 54

Behind-the-Scenes Applications of KGs

Google• uses its knowledge graph to identity entities in web pages (Entity Linking)• Hummingbird ranking algorithm (deployed in 2013) uses

knowledge graph as background knowledge for ranking search results.

Yahoo• uses its knowledge graph to “support applications across the company:

• Web Search, Content Understanding• Recommendation, Personalization, Advertisement”*

Data Integration• becomes matching data sources against knowledge graphs

as intermediate schemata.

Various tasks become easier, if you know all entities in the world.

*Source: Nicolas Torzec, Yahoo 2014

Page 55: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 55

SEO Topic: How to influence Knowledge Graphs?

Source: http://searchengineland.com/leveraging-wikidata-gain-google-knowledge-graph-result-219706

Page 56: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 56

Summary

In addition to text, the Web contains a vast amount of structured data.

The topics of the published data partly correlate with the publication methods used:• Web APIs: User generated content, geographic data• Schema.org data: E-commerce, local business, event, job data• Linked Data: E-government data, library data, research data• Wikipedia, HTML tabels: General knowledge

The Web is the perfect playground for researching and applying Big Data integration techniques• tough challenges concerning heterogeneity, volume, and data quality• rewarding if challenges can be handled, e.g. web-scale queries and mining

Page 57: Types of Structured Data on the Web - uni-mannheim.de · extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl analyzes and provides the extracted data for

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 57

5. References

Linked Data• Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space.

Synthesis Lectures on the Semantic Web. Morgan & Claypool Publishers (also: free online version), 2011.

RDFa, Microdata and Microformats• Christian Bizer, et al.: Deployment of RDFa, Microdata, and Microformats on the Web

– A Quantitative Analysis. 12th International Semantic Web Conference, 2013. Extracting HTML Table Data

• Michael Cafarella, Alon Halevy, et al.: WebTables: Exploring the power of Tables on the Web. Proceedings of the VLDB Endowment, 2008.

• Petros Venetis, Alon Halevy, et al.: Recovering Semantics of Tables on the Web. Proceedings of the VLDB Endowment, 2011.

Wrapper Induction• Bing Liu: Web Data Mining. Chapter 9. Springer, 2011.

DBpedia• Jens Lehmann, et al: DBpedia – A Large-scale, Multilingual Knowledge Base

Extracted from Wikipedia. Semantic Web Journal, 2014.


Recommended