DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 1

DBpediaNYD – A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

Heiko Paulheim


Motivation

• There are quite a few approaches to entity ranking/statement weighting on Linked Data

– and DBpedia in particular

• Examples:

– Franz et al. (2009) – Tensor Decomposition

– Meij et al. (2009) – Machine Learning

– Mirizzi et al. (2010) – Web Search Engines

– Mulay and Kumar (2011) – Machine Learning

– Hees et al. (2012) – Crowd Sourcing

– Nunes et al. (2012) – Social Network Analysis


Motivation

• However,

– none of those have been competitively evaluated

– none of those have been evaluated at large scale

• Evaluation with

– small private data sets

– user studies

• Approaches using Machine Learning

– requires training data

– expensive to obtain


The Dataset

• Large-scale dataset (several thousand instances)

– statements with strengths

• Strength value: Normalized Google Distance

• f(x): number of search results containing x

• f(x,y): number of search results containing both x and y

• M: number of pages in search engine index

• NGD has been shown to correlate with human strength associations


The Dataset

• NGD is a symmetric value

– NYD dataset also contains asymmetric values

• Asymmetric Normalized Google Distance

• f(x): number of search results containing x

• f(x,y): number of search results containing both x and y

• M: number of pages in search engine index


Constructing the Dataset

• We sampled 10,000 statements

– with DBpedia resources as subject and object(e.g., no type statements, no literals)

– with dbpedia or dbpprop predicate

• ...and computed symmetric/asymmetric NGD

– using the labels as search strings

– using Yahoo BOSS


The Dataset

• Random sample of 10,000 statements

– i.e., 30,000 search engine calls (80c/1,000 → 24 USD)

• 3,058 pairs of resources had to be discarded

– f(x)<f(x,y) or f(y)<f(x,y)

– search engines sometimes don't count properly :-(

• Result:

– 6,942 weighted statements (symmetric)

– 13,884 weighted statements (asymmetric)


The Dataset

• Example:

– dbpedia:John_Lennon and dbpedia:Yoko_Ono

• Distances:

– symmetric: 0.18

– John Lennon → Yoko Ono 0.18

– Yoko Ono → John Lennon 0.03

• Explanation:

– Yoko Ono is famous for being John Lennon's wife

• and most often mentioned in that context

– John Lennon is more famous for being a member of the Beatles


Example: the DBpedia FindRelated Service

• We trained two regression SVMs (LibSVM) based on DBpediaNYD

– one for symmetric, one for asymmetric

– service allows for finding the most related among the linked resources

• Example results:

• http://wiki.dbpedia.org/FindRelated


Conclusion and Outlook

• DBpediaNYD allows for large scale evaluation

– rather a silver standard

– does not replace manually created gold standards

• Future work

– validate DBpediaNYD with users

– compare search engines


Something Completely Different

• Challenges enumerated in the workshop intro this morning

– “Logical inference on noisy data”

• Talk on “Type Inference on Noisy RDF Data”

– Was actually applied for DBpedia 3.9

– Friday, 3:15, Bayside 204A


DBpediaNYD – A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

Heiko Paulheim

Date post:	06-Jul-2015
Category:	Technology
Upload:	heiko-paulheim
View:	145 times
Download:	0 times

DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

Technology