Date post: | 06-Jul-2015 |
Category: |
Technology |
Upload: | heiko-paulheim |
View: | 145 times |
Download: | 0 times |
10/22/13 Heiko Paulheim 1
DBpediaNYD – A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia
Heiko Paulheim
10/22/13 Heiko Paulheim 2
Motivation
• There are quite a few approaches to entity ranking/statement weighting on Linked Data
– and DBpedia in particular
• Examples:
– Franz et al. (2009) – Tensor Decomposition
– Meij et al. (2009) – Machine Learning
– Mirizzi et al. (2010) – Web Search Engines
– Mulay and Kumar (2011) – Machine Learning
– Hees et al. (2012) – Crowd Sourcing
– Nunes et al. (2012) – Social Network Analysis
10/22/13 Heiko Paulheim 3
Motivation
• However,
– none of those have been competitively evaluated
– none of those have been evaluated at large scale
• Evaluation with
– small private data sets
– user studies
• Approaches using Machine Learning
– requires training data
– expensive to obtain
10/22/13 Heiko Paulheim 4
The Dataset
• Large-scale dataset (several thousand instances)
– statements with strengths
• Strength value: Normalized Google Distance
• f(x): number of search results containing x
• f(x,y): number of search results containing both x and y
• M: number of pages in search engine index
• NGD has been shown to correlate with human strength associations
10/22/13 Heiko Paulheim 5
The Dataset
• NGD is a symmetric value
– NYD dataset also contains asymmetric values
• Asymmetric Normalized Google Distance
• f(x): number of search results containing x
• f(x,y): number of search results containing both x and y
• M: number of pages in search engine index
10/22/13 Heiko Paulheim 6
Constructing the Dataset
• We sampled 10,000 statements
– with DBpedia resources as subject and object(e.g., no type statements, no literals)
– with dbpedia or dbpprop predicate
• ...and computed symmetric/asymmetric NGD
– using the labels as search strings
– using Yahoo BOSS
10/22/13 Heiko Paulheim 7
The Dataset
• Random sample of 10,000 statements
– i.e., 30,000 search engine calls (80c/1,000 → 24 USD)
• 3,058 pairs of resources had to be discarded
– f(x)<f(x,y) or f(y)<f(x,y)
– search engines sometimes don't count properly :-(
• Result:
– 6,942 weighted statements (symmetric)
– 13,884 weighted statements (asymmetric)
10/22/13 Heiko Paulheim 8
The Dataset
• Example:
– dbpedia:John_Lennon and dbpedia:Yoko_Ono
• Distances:
– symmetric: 0.18
– John Lennon → Yoko Ono 0.18
– Yoko Ono → John Lennon 0.03
• Explanation:
– Yoko Ono is famous for being John Lennon's wife
• and most often mentioned in that context
– John Lennon is more famous for being a member of the Beatles
10/22/13 Heiko Paulheim 9
Example: the DBpedia FindRelated Service
• We trained two regression SVMs (LibSVM) based on DBpediaNYD
– one for symmetric, one for asymmetric
– service allows for finding the most related among the linked resources
• Example results:
• http://wiki.dbpedia.org/FindRelated
10/22/13 Heiko Paulheim 10
Conclusion and Outlook
• DBpediaNYD allows for large scale evaluation
– rather a silver standard
– does not replace manually created gold standards
• Future work
– validate DBpediaNYD with users
– compare search engines
10/22/13 Heiko Paulheim 11
Something Completely Different
• Challenges enumerated in the workshop intro this morning
– “Logical inference on noisy data”
• Talk on “Type Inference on Noisy RDF Data”
– Was actually applied for DBpedia 3.9
– Friday, 3:15, Bayside 204A
10/22/13 Heiko Paulheim 12
DBpediaNYD – A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia
Heiko Paulheim