Similarity on DBpediaUIMR
PhD student: Samantha LamSupervisor: Conor Hayes
Similarity
How similar are the following films:
2
Similarity
How similar are the following films: (Unsatisfactory)Answer: it depends!
3
DBpedia Graph
Films - nodes - on DBpedia.
Some things about DBpedia:
Big, rich, dense Knowledge Base
→ 3.77m nodes, 400m edges (EN)
Lots of prior work (as we shall see...)
But very heterogeneous - vocabularies, categories
It is a graph
4
DBpedia Graph
Films - nodes - on DBpedia.
Some things about DBpedia:
Big, rich, dense Knowledge Base
→ 3.77m nodes, 400m edges (EN)
Lots of prior work (as we shall see...)
But very heterogeneous - vocabularies, categories
It is a graph
4
DBpedia Graph
Films - nodes - on DBpedia.
Some things about DBpedia:
Big, rich, dense Knowledge Base
→ 3.77m nodes, 400m edges (EN)
Lots of prior work (as we shall see...)
But very heterogeneous - vocabularies, categories
It is a graph
4
Similarity in general
Cognitive Science - Tversky (1977) - psychology - featural.
E.g. film: genre, language, director
Modelling of human thought, semantic relations, how do werelate things to each other? (Quillian & Collins 1969)
5
Semantic
The notion of semantic networks is derived from the hierarchicalsemantic memory model [Collins & Quillian, 1969]
6
Semantic Similarity
Different techniques:
Word frequency: Latent semantic analysis (doesn’t actuallyuse semantic net structure)
Rada (1989) - average shortest path length
Resnik (1999) - information content of lcs
Unfortunately...
Word frequency N/A
Often assumes hierarchical/tree structure oftaxonomy/ontology. (Both Rada and Resnik assumetaxonomy is an is-A hierarchy)
7
Semantic Similarity
Different techniques:
Word frequency: Latent semantic analysis (doesn’t actuallyuse semantic net structure)
Rada (1989) - average shortest path length
Resnik (1999) - information content of lcs
Unfortunately...
Word frequency N/A
Often assumes hierarchical/tree structure oftaxonomy/ontology. (Both Rada and Resnik assumetaxonomy is an is-A hierarchy)
7
Semantic Similarity
Remember, DBpedia not as ‘neat’:
(Image source: http://www.visualdataweb.org/relfinder/)
8
On DBpedia/Wikipedia
Recent applications:
Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)
Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours
Passant (2010) - Linked Data Semantic Distance
Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model
9
On DBpedia/Wikipedia
Recent applications:
Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)
Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours
Passant (2010) - Linked Data Semantic Distance
Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model
9
On DBpedia/Wikipedia
Recent applications:
Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)
Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours
Passant (2010) - Linked Data Semantic Distance
Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model
9
On DBpedia/Wikipedia
Recent applications:
Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)
Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours
Passant (2010) - Linked Data Semantic Distance
Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model
9
On DBpedia/Wikipedia
Recent applications:
Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)
Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours
Passant (2010) - Linked Data Semantic Distance ← uses paths!
Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model
10
Similarity
Important:
Properties can be related to each other
type 1, e.g. influenced
node, e.g. director
type 2, e.g. collaborated with
node type 2, e.g. film
11
Network Similarity
Social Network Analysis
Established field - notions of influence, centrality, rank etc.
Often applied to small networks
Note: Ranking is often based on similarity
12
Network Similarity
Homogeneous network measures:
PageRank - Sergey & Brin (1998) - random-surfer withteleportation
SimRank - Jeh & Widom (2002) - iteratively ‘inherits’ rankof neighbours
σact - Thiel & Berthold (2010) - node similarities fromspreading activation with a decay factor
13
Network Similarity
Homogeneous network measures:
PageRank - Sergey & Brin (1998) - random-surfer withteleportation
SimRank - Jeh & Widom (2002) - iteratively ‘inherits’ rankof neighbours
σact - Thiel & Berthold (2010) - node similarities fromspreading activation with a decay factor
13
Network Similarity
Homogeneous network measures:
PageRank - Sergey & Brin (1998) - random-surfer withteleportation
SimRank - Jeh & Widom (2002) - iteratively ‘inherits’ rankof neighbours
σact - Thiel & Berthold (2010) - node similarities fromspreading activation with a decay factor
13
Network Similarity
Heterogeneous network measures:
PathSim - Sun & Han (2009) - count instances of‘meta-path’ (specific link pattern)
14
Network Similarity
Applicability to DBpedia:
PageRank, SimRank - N/A - assumes homogeneous links!
Spreading Activation - possible with constraints
Apply PathSim - but how to learn such meta-paths?
Another idea:
Count node-disjoint paths.
Why? View each path as one distinct ‘reason’.
15
Network Similarity
Applicability to DBpedia:
PageRank, SimRank - N/A - assumes homogeneous links!
Spreading Activation - possible with constraints
Apply PathSim - but how to learn such meta-paths?
Another idea:
Count node-disjoint paths.
Why? View each path as one distinct ‘reason’.
15
Similarity
Totoro GITS Matrix
Totoro 44 1 0GITS 1 35 2
Matrix 0 2 58
Totoro – GITS
Category:Anime films
GITS – Matrix
Category:Brain-computer interfacing in fictionMatrix → Category:The Matrix (franchise) →Category:Media franchises ← GITS
16
Similarity
How similar are the following films: Answer: it still depends
17
Similarity
How similar are the following films: Answer: it still depends- on the path you take
18
Summary
Similarity, useful concept in many areas, hard to define
how are films similar?
DBpedia, richly linked KB
film information available here
→ Problem: How to define similarity on DBpedia?
Past methods - don’t exploit linkedness
Network analysis methods can aid this
test trial with node-disjoint paths, GITS more similar to Matrixthan Totoro
19
Summary
Similarity, useful concept in many areas, hard to define
how are films similar?
DBpedia, richly linked KB
film information available here
→ Problem: How to define similarity on DBpedia?
Past methods - don’t exploit linkedness
Network analysis methods can aid this
test trial with node-disjoint paths, GITS more similar to Matrixthan Totoro
19
Summary
Similarity, useful concept in many areas, hard to define
how are films similar?
DBpedia, richly linked KB
film information available here
→ Problem: How to define similarity on DBpedia?
Past methods - don’t exploit linkedness
Network analysis methods can aid this
test trial with node-disjoint paths, GITS more similar to Matrixthan Totoro
20
Ongoing/Future Work
Mining DBpedia as Network
Analyse structured and related data
Similarity as complement to – reasoning, retrieval, querying
Also useful in NLP, recommender systems, knowledgediscovery
→ Examples: work we do in UIMR
21
Ongoing/Future Work
Mining DBpedia as Network
Analyse structured and related data
Similarity as complement to – reasoning, retrieval, querying
Also useful in NLP, recommender systems, knowledgediscovery
→ Examples: work we do in UIMR
21
Ioana Hulpus (2011/2012)
Graph-based topic analysis with the support of Linked Data
22
Ioana Hulpus (2011/2012)
Graph-based topic analysis with the support of Linked Data
23
Benjamin Heitmann (2011/2012)
Spreading activation for cross-domain recommendation
24
Challenges/Discussion
Challenges:
Topology of DBpedia graph
Standard SNA measures for homogeneous networks, e.g.density, degree distribution - how to apply to DBpedia?
What does a path actually mean?
Which subgraphs to use?
How do metrics vary with different subgraphs, e.g. diffontologies/categories?
Scalability (not problem, but challenge)
Evaluation - how do we confirm something is similar?
Thanks for listening! Questions/Suggestions?
25
Challenges/Discussion
Challenges:
Topology of DBpedia graph
Standard SNA measures for homogeneous networks, e.g.density, degree distribution - how to apply to DBpedia?
What does a path actually mean?
Which subgraphs to use?
How do metrics vary with different subgraphs, e.g. diffontologies/categories?
Scalability (not problem, but challenge)
Evaluation - how do we confirm something is similar?
Thanks for listening! Questions/Suggestions?
25
Challenges/Discussion
Challenges:
Topology of DBpedia graph
Standard SNA measures for homogeneous networks, e.g.density, degree distribution - how to apply to DBpedia?
What does a path actually mean?
Which subgraphs to use?
How do metrics vary with different subgraphs, e.g. diffontologies/categories?
Scalability (not problem, but challenge)
Evaluation - how do we confirm something is similar?
Thanks for listening! Questions/Suggestions?
25