Date post: | 09-May-2015 |
Category: |
Career |
Upload: | shenghui-wang |
View: | 526 times |
Download: | 3 times |
The Web of Data as a Complex System
- First insight into its multi-scale network properties
Christophe Guéret, Shenghui Wang, and Stefan Schlobach
Department of Computer Science, Network InstituteVrije Universiteit Amsterdam
Outline
• What is the Web of Data? • How complex is the Web of Data?
• A new way of seeing the Web of Data
• What have we found?
• What are the challenges?
What is the Web of Data?
The Semantic Web is a web of data -- http://www.w3.org/2001/sw/
Linked Data is a sub-topic of the Semantic Web. The term Linked Data is used to describe a method of exposing, sharing, and connecting data via dereferenceable URIs on the Web.
-- http://en.wikipedia.org/wiki/Linked_Data Linked Data is about using the Web to connect related data that wasn't previously linked, or using the Web to lower the barriers to linking data currently linked using other methods.
-- http://linkeddata.org/
Four principles of Linked Data
1.Use URIs to identify things.2.Use HTTP URIs so that these things can be referred to
and looked up ("dereferenced") by people and user agents.
3.Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
4.Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
-- Tim Berners-Lee
http://dbpedia.org/resource/Amsterdamhttp://dbpedia.org/resource/Amsterdam
http://dbpedia.org/resource/Cityhttp://dbpedia.org/resource/City
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://umbel.org/umbel/ne/wikipedia/Amsterdamhttp://umbel.org/umbel/ne/wikipedia/Amsterdam
http://www.w3.org/2002/07/owl#sameAs
http://www.freebase.com/view/en/abraham_pais
http://www.freebase.com/view/en/abraham_pais
http://dbpedia.org/ontology/birthPlace
An example of linked data
• Nodes are shared across statements• The links have some meaning
Since 2006, people are creating linked data
October 2007
July 2009
Evolution of the Web of Data
The WoD is a complex system!
• More than 260 extremely heterogeneous datasetso general-purposed datasets, such as DBpediao domain-oriented datasets, such as Bio2RDFo government data, music data, geological data, social
network data, etc. • Nearly 50 billion RDF triples
o Nearly 50 billion links within the datasetso More than 800 million links between the datasets
• Embedded rich semantics in the data
o data points are typedo links are typedo links is what makes the statements useful
AmsterdamAmsterdam
The NetherlandsThe Netherlands
isLocatedIn
ChristopheChristophe VU AmsterdamVU AmsterdamworkIn
isLocatedIn
workIn
workIn
The links have explicit semantics, which brings implicit links deduced after the reasoning process
People are trying to use the WoD
Billion triple challenges since 2008 "The specific goal of the Billion Triples Track is to demonstrate the scalability of applications as well as to encourage the development of applications that can deal with Web data. We stress that the goal of this is not to be a benchmarking effort between triple stores, but rather to demonstrate applications that can scale to a Web scale using realistic Web-quality data. "
http://challenge.semanticweb.org/
The WoD itself should be robust
• Is there central hubs whose failure would lead to lack of connectivity?
• The WoD is designed for automated agents that
have less capability to recover from the failure of the connectivity.
• The robustness of the WoD should be ensured
• Up till now, the WoD could be studied, searched
and maintained like a classical database
Network analysis
A new way of seeing the WoD
What network analysis tells us
A new way of seeing the WoD
Consider the WoD as network
Applying network analysis over the WoD
• Average path length
• Degree distribution
• Strongly connected components
• Degree centrality
• Between centrality
• Closeness centrality
Scales of observation of the WoD 1. Graphs scale
Graph-scale WoD network
• Each dataset is a node • Edges are weighted, directed connections
between the datasetso if there is at least one triple having a subject
within dataset 1 and an object within dataset 2, then there is an edge between these two datasets.
o the number of such triples is the weight of the edge.
• 110 nodes with 350 edges• Average path length is 2.16• 50 components
The degree of 7 is critical point after which the network is not scale-free any more.
Top central nodes
Node Value
DBpedia 0.332
DBLP Berlin 0.108
DBLP (RKB) 0.100
DBLP Hannover 0.097
FOAF profiles 0.075
Betweenness centrality
Node Value
DBpedia 0.762
Geonames 0.614
Drug Bank 0.576
Linked MDB 0.544
Flickr wrappr 0.526
Closeness centrality
Node Value
DBpedia 0.505
UniProt 0.266
DBLP (RKB) 0.266
ACM (RKB) 0.229
GeneID 0.211
Degree centrality
Every centrality has a specific meaning...
Scales of observation of the WoD2. Triple scale
Triple-scale WoD network
• We took the 10 million triples from the dataset crawled from the WoD, provided by the billion triple challenge 2009
• This "BTC" network is defined as G=(V, (E, L)), where
o V is a set of nodes, and each node is a URI or a literal
o E is a set of edgeso L is a set of labels, each label characterising a
relation between nodes • We applied a few strategies to aggregate data for
comparison.
Network Nodes EgesAverage path
lengthComponents
BTC 605K 860K 2.15 602K
BTC aggregated 14K 31K 2.80 7K
BTC aggregated + filter
37 91 1.88 17
Triple-scale network and its aggregations• BTC aggregated: triples are aggregated by the domain names• BTC aggregated + filter: only domain names shared with the graph-scale network
Degree distribution
BTC BTC aggregated
Power-law distribution
Top central nodes:
The next steps
Open challenges
Ongoing research activities at VUA
Challenges:
• Existence of implicit links
“Semantic virus”
AmsterdamAmsterdam
The NetherlandsThe Netherlands
isLocatedIn
ChristopheChristophe VU AmsterdamVU AmsterdamworkIn
isLocatedIn
workIn
workIn
AsiaAsia
isLocatedIn
Challenges:
• Multi-relations links
• FOAF (social networks + personal information)• SIOC (relations characterising blogs)• SWRC (describing research work)• …
Different filtering produce different networksCentrality status of nodes changes w.r.t the networks
• Dynamics
• Data will be continuously added and linked.
“sameAs” networks
Monitoring and Improving the WoD
• Linked data is meant to be browsed, jumping from one ressource to another
• The presence of Hubs is critical for the paths• Create alternate paths to be used in case of failure
Guéret, Groth, van Harmelen, Schlobach, "Finding the Achilles Heel of the Web of Data: using network analysis for link-recommendation", ISWC2010 - To appear
We need to study more!
{cgueret, swang, schlobac}@few.vu.nl