+ All Categories
Home > Technology > STI Summit 2011 - Global data integration and global data mining

STI Summit 2011 - Global data integration and global data mining

Date post: 08-May-2015
Category:
Upload: semantic-technology-institute-international
View: 267 times
Download: 6 times
Share this document with a friend
33
STI Summit July 6 th 2011 Riga Latvia July 6 th , 2011, Riga, Latvia Global Data Integration Global Data Integration and Global Data Mining and Global Data Mining Prof. Dr. Christian Bizer F i Ui ität B li Freie Universität Berlin Germany
Transcript
Page 1: STI Summit 2011 - Global data integration and global data mining

STI SummitJuly 6th 2011 Riga LatviaJuly 6th, 2011, Riga, Latvia

Global Data IntegrationGlobal Data Integrationand Global Data Miningand Global Data Mining

Prof. Dr. Christian Bizer F i U i ität B liFreie Universität Berlin

Germany

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 2: STI Summit 2011 - Global data integration and global data mining

Outline

1. Topology of the Web of Data What data is out there?

2. Global Data Integration How to split the integration effort

3. Global Data Mining The logical next step

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 3: STI Summit 2011 - Global data integration and global data mining

Linked Data Deployment on the Web

Year Datasets Triples Growth

2007 12 500 000 0002007 12 500.000.000

2008 45 2.000.000.000 300%

2009 95 6.726.000.000 236%

2010 203 26.930.509.703 300%

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 4: STI Summit 2011 - Global data integration and global data mining

Uptake in the Government Domain

The EU is starting to publish Linked Data (LOD2, LATC)

Various other national efforts

W3C eGovernment Interest Group

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 5: STI Summit 2011 - Global data integration and global data mining

Uptake in the Libraries Community

Institutions publishing Linked Data Library of Congress (subject headings) Library of Congress (subject headings)

German National Library (PND dataset and subject headings)

S edish National Librar (Libris catalog) Swedish National Library (Libris - catalog)

Hungarian National Library (OPAC and Digital Library)

E j t j t l d d t b t 4 illi tif t Europeana project just released data about 4 million artifacts

Growth of Library Linked Data (2009-2010): 1000%Growth of Library Linked Data (2009-2010): 1000%

W3C Library Linked Data Incubator Group

Goals: 1. Integrate Library Catalogs on global scale.

2. Interconnect resources between repositories (by topic, by location, by historical period, by ...).

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 6: STI Summit 2011 - Global data integration and global data mining

LOD data set statistics as of November 2010

Domain Data Sets Triples Percent RDF Links Percent

Cross‐domain 20 1,999,085,950 7.42 29,105,638 7.36

Geographic 16 5,904,980,833 21.93 16,589,086 4.19

Government 25 11,613,525,437 43.12 17,658,869 4.46

Media 26 2,453,898,811 9.11 50,374,304 12.74

Lib i 67 2 237 435 732 8 31 77 951 898 19 71Libraries 67 2,237,435,732 8.31 77,951,898 19.71

Life sciences 42 2,664,119,184 9.89 200,417,873 50.67

User Content 7 57 463 756 0 21 3 402 228 0 86User Content 7 57,463,756 0.21 3,402,228 0.86

203 26,930,509,703 395,499,896

LOD Cloud Data Catalog on CKANhttp://www ckan net/group/lodcloudhttp://www.ckan.net/group/lodcloud

More statistics

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

http://www4.wiwiss.fu-berlin.de/lodcloud/state/

Page 7: STI Summit 2011 - Global data integration and global data mining

What are the big players doing?

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 8: STI Summit 2011 - Global data integration and global data mining

Structured Data becomes a SEO Topic

Data Snippetspp

Query AnswerQuery Answer

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 9: STI Summit 2011 - Global data integration and global data mining

Result: Further growth …

usage of RDFa has increased 510% gbetween March, 2009 and October, 2010

430 million webpages contain RDFa

Source: Yahoo

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

http://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/

Page 10: STI Summit 2011 - Global data integration and global data mining

The Structural Continuum

The Web of Data is interwoven with the classic Web.The Web of Data is interwoven with the classic Web.

Unstructured text: HTML

Structured data: RDFa embed into HTML (Open Graph)

Microdata embed into HTML (Schema.org)

Microformats embed into HTML

Linked data: RDF/XML Linked data: RDF/XML

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 11: STI Summit 2011 - Global data integration and global data mining

Topology of the Web of Data

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 12: STI Summit 2011 - Global data integration and global data mining

How to get the data?

Download the Billion Triples Challenge Dataset 2 billion triples (20GB gzipped)

crawled from the public Web of Linked Data in May/June 2011

http://challenge.semanticweb.org/

Download the Sindice Dump 12 billion triples (164GB gzipped ~1 16TB uncompressed) 12 billion triples (164GB gzipped, 1,16TB uncompressed)

crawled from the public Web of Linked Data and

includes RDFa Microformat and wrapped API data includes RDFa, Microformat, and wrapped API data

http://data.sindice.com/trec2011/download.html

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 13: STI Summit 2011 - Global data integration and global data mining

2. Global Data Integration

Applications hate heterogeneity!pp g y

The wild wild west My little world

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

The wild wild west My little world

Page 14: STI Summit 2011 - Global data integration and global data mining

The Dataspace Vision

Alternative to classic data integration systems in

P ti f d t

order to cope with growing number of data sources.

Properties of dataspaces no upfront investment into a global schema

l d t i t ti rely on pay-as-you-go data integration

give best effort answers to queries

Franklin, M., Halevy, A., and Maier, D.: From Databases to Dataspaces A new Abstraction for Information Management SIGMOD Rec 2005A new Abstraction for Information Management, SIGMOD Rec. 2005.

Madhavan, J., et al.: Web-scale Data Integration: You Can Only Afford to Pay As You Go, CIDR 2007

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 15: STI Summit 2011 - Global data integration and global data mining

Linked Data relies on Pay-as-You-Go Idea

for Identity Management

for Schema/Vocabulary Management

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 16: STI Summit 2011 - Global data integration and global data mining

Publish Identity Links on the Web

Identity Link<http://www4.wiwiss.fu-berlin.de/is-group/resource/persons/Person4>

owl:sameAs

Identity Link

<http://dblp.l3s.de/d2r/resource/authors/Christian_Bizer> .

You publish links pointing at other data sources.

S b d l bli h li k i ti t Somebody else publishes links pointing at yourdata source.

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 17: STI Summit 2011 - Global data integration and global data mining

Effort Distribution between Publisher and Consumer

Consumer data mines identit linksidentity links

Effort Distribution

Publishers or third parties provides

identity links

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

y

Page 18: STI Summit 2011 - Global data integration and global data mining

Vocabularies on the Web of Data

Everyone can use whatever vocabularies she likes to publish Data on the Webto publish Data on the Web.

Or invest effort and reuse Common Vocabularies Friend-of-a-Friend for describing people and their social network

SIOC for describing forums and blogs

SKOS for representing topic taxonomies

Organization Ontology for describing the structure of organizations

GoodRelations provides terms for describing products and business entities

Music Ontology for describing artists, albums, and performances

Review Vocabulary provides terms for representing reviews

Many Linked Data Source use mixture of common andMany Linked Data Source use mixture of common andproprietary vocabulary terms.

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 19: STI Summit 2011 - Global data integration and global data mining

Publish Vocabulary Links on the Web

Vocabulary Link<http://xmlns.com/foaf/0.1/Person>

owl:equivalentClass

Vocabulary Link

<http://dbpedia.org/ontology/Person> .

Simple Mappings: RDFS, OWL rdfs:subClassOf, rdfs:subPropertyOf

owl:equivalentClass, owl:equivalentProperty

Complex Mappings: R2Rp pp g provides value transformation functions

structural transformations

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 20: STI Summit 2011 - Global data integration and global data mining

Deployment of Vocabulary Links

S Li k d O V b l i

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Source: Linked Open Vocabularies, http://labs.mondeca.com/dataset/lov

Page 21: STI Summit 2011 - Global data integration and global data mining

Effort Distribution between Publisher and Consumer

Consumer defines or data mines mappings

EffortEffort Distribution

Publisher reuses vocabulariesvocabularies

Publisher or third party publishes mappings

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

publishes mappings

Page 22: STI Summit 2011 - Global data integration and global data mining

Somebody-Pays-As-You-Go

The overall data integration effort is split between the data publisher the

Fix Overall Data  Integration

split between the data publisher, the data consumer and third parties.

Data Publisher publishes data as RDF

IntegrationEffort

sets identity links

reuses terms or publishes mappings

Third Parties set identity links pointing at your data Third 

Publisher‘sy p g y

publish mappings to the Web

Data Consumer

Party Effort

Publisher‘sEffort

Data Consumer has to do the rest

using record linkage and schema matching

Consumer‘sEffort

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

using record linkage and schema matching techniques

Page 23: STI Summit 2011 - Global data integration and global data mining

Research Directions

1. More research on pay-as-you-go data integration is needed.

2. More research on data mining mappings andidentity resolution heuristics is needed. Identity links make it easier to mine vocabulary links.

Vocabulary links make it easier to mine identity links.

3 More research on SPAM detection and data quality3. More research on SPAM detection and data qualityassessment is needed.

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 24: STI Summit 2011 - Global data integration and global data mining

LDIF – Linked Data Integration Framework

Combines vocabulary normalization and identity resolution C tl l i i l t ti Currently only in-memory implementation

Next release: Hadoop-based implementation

htt // 4 i i f b li d /bi /ldif/ http://www4.wiwiss.fu-berlin.de/bizer/ldif/ IdentityResolution

Normalizevocabularies

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 25: STI Summit 2011 - Global data integration and global data mining

What can we do afterwards …

… build better entity search engines

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 26: STI Summit 2011 - Global data integration and global data mining

3. Global Data Mining

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 27: STI Summit 2011 - Global data integration and global data mining

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 28: STI Summit 2011 - Global data integration and global data mining

Think about interesting questions …

… that you can answer based on the Web of Data

… that require aggregation

summarization

classification

association rule mining

… combined with… combined with text mining

sediment analysisy

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 29: STI Summit 2011 - Global data integration and global data mining

Everybody has the tools to find the answers

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 30: STI Summit 2011 - Global data integration and global data mining

Research Directions

1. More research on data space profiling is needed.

2 M h l b l d t i i i d d2. More research on global data mining is needed.

Google, Yahoo, Microsoft, Facebook will get there soon.g , , , g

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 31: STI Summit 2011 - Global data integration and global data mining

Semantic Web Challenge

Submission Statistics

Year Open Track Billion Triple Track2008 13 92009 16 32010 14 42010 14 4

Do something interesting with the Billion Triple Data and submit your results to the challenge until October 1st

present your results at the 10th International Semantic Web Conference (ISWC2011), October 2011, Koblenz, Germany

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 32: STI Summit 2011 - Global data integration and global data mining

Conclusions

The Web of Data is there Linked Data, Microdata, RDFa, Microformats

Upcoming research topics pay-as-you-go data integration

mapping discovery, schema clustering

identity resolution heuristics discovery

probabilistic data integration

data quality assessment

data space profiling

global data mining

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Page 33: STI Summit 2011 - Global data integration and global data mining

Thanks!

References Textbook: Tom Heath Christian Bizer: Linked Data: Evolving the Web into a Global Textbook: Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global

Data Space. http://linkeddatabook.com/

Christian Bizer, Tom Heath, Tim Berners-Lee: Linked Data – The Story So Far

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf


Recommended