Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | constance-rose-carter |
View: | 221 times |
Download: | 0 times |
Linked Data Profiling
Andrejs Abele
National University of Ireland, Galway
Supervisor: Paul Buitelaar
Overview
Terminology Motivation My approach Evaluation Conclusion Future work
Linked Data is about using the Web to connect related data that was not previously linked.
Resource Description Framework is represented by sets of subject-predicate-object triples, where the elements may be URIs, literals
https://www.insight-centre.org/users/andrejs-ābele foaf:name “Andrejs Ābele”
Linked Open Data Cloud is a collection of Linked Data resources that are open and freely available
Terminology
Linked Open Data Cloud Diagram
Publications
Life Sciences
Cross-Domain
Social Networking
Geographic
Government
Media
User-Generated Content
Linguistics
Motivation
Linked Data is hard to understand for humans Only a small number of datasets provide a
human readable overview or comprehensive metadata
When adding a new dataset to the LOD cloud, connections have to be identified to as many other relevant LOD datasets as possible
LOD Cloud Diagram relays on human classification
Existing solutions for LD profiling
[1] http://demo.seco.tkk.fi/aether/#/ [2] https://www.hpi.uni-potsdam.de/naumann/sites/prolod++/#[3] http://lodlaundromat.org/
[4] http://stats.lod2.eu/ [5] http://demo.seco.tkk.fi/aether/#/[6] http://rdfstats.sourceforge.net/
Loupe1
ProLOD++2
LOD Laundromat3
LODStat4
Aether5
RDF-stats6
Domain identification method using DBpedia
Topic Extraction
Domain Identification
Domain
• Input : Bio2RDF-sgd
• Description: The Saccharomyces Genome Database (SGD) collects and organizes information about the molecular biology and genetics of the yeast Saccharomyces cerevisiae
1. Most frequent terms (sgd_vocabulary, query, proper, phenotype, experiment)
2. Literal containing one of the terms ("protein [sgd_vocabulary:protein]@en")
3. Identify DBpedia concept (http://dbpedia.org/resource/Protein)
4. Identify Category (http://dbpedia.org/resource/Category:Molecular_biology)
5. Identify domain under which category fits best (Biology =>Life Sciences)
Example
DatasetsLOD cloud datasets (annotated in LOD Cloud Diagram)405 datasets, 9 domains • Media (13)• Linguistics(34)• Publications (111)• Social Networking (41)• Geography (29)• Government (65)• Cross Domain (25)• User Generated (52)• Life Sciences (35)
1. Extract URIs of properties and classes from datasets2. Use classes and properties as features3. Classify using Support Vector Machine classifier4. Use Precision and Recall as metrics
Extended baselineEnrich the data with human annotated tags from Linked Open Vocabularies1
1. http://lov.okfn.org/dataset/lov/
Baseline approach
Precision and Recall for different domains using SVM
Media
Linguist
ics
Publicatio
ns
Social n
etwork
ing
Geogra
phy
Gove
rnm
ent
Cross
dom
ain
User g
enerate
d
Life s
cience
s0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
PrecisionRecall
Correctly Classified Instances
Classes Properties Classes + Properties
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
From DatasetDataset+LOVLOV
Conclusion
• Does not require training
• Works with new and customized vocabularies
• Works only if datasets contain literals
• Can not identify User-Generated Content and Cross-Domain
• Using just classes and properties is hard to improve results above 75%
Future Work
• Evaluate alternative classification algorithms
• Use Literals and URIs for classification
• Classify datasets in more specific subdomains
Thank you!