Date post: | 07-Jan-2017 |
Category: |
Technology |
Upload: | semanticsconference |
View: | 55 times |
Download: | 3 times |
WWW.LEDS-PROJEKT.DE
LEDS
KNOWLEDGE EXTRACTIONFROM HETEROGENEOUS
SEMI-STRUCTURED DATA SOURCES
MARTIN SEIDEL, MICHAEL KRUG, FRANK BURIAN, MARTIN GAEDKE
12. September 2016
LEDSCURRENT SITUATION
• knowledge in the Web often only available as weakly interlinked, heterogeneous, semi-structured data
à no semantic classification• how to link or merge data?• how to do semantic queries?
à not usable in a meaningful way
2 12. September 2016
LEDSGOAL
Extraction of knowledge from semi-structured data• knowledge in terms of semantic metadata• semantically enriched data then can utilize
the potential of Linked Data
à provide an automatic process
3 13. September 2016
LEDS
THE KESEDAAPPROACH
LEDSTHE KESEDA APPROACH
• Especially designed to work on JSON data
• Challenges when working with JSON data
à no schema, only name-value pairs
à any structure and depth possible
12. September 20165
LEDSTHE KESEDA APPROACH
{"id": "krug”,"firstName": "Michael","lastName": "Krug","title": "Dipl.-Inf.","phone": "+49 371 531 39929","email": "[email protected]",[...]
}
12. September 20166
LEDSTHE KESEDA APPROACH
{"id": "2015-007","title": "SmartComposition: ...","author": [ "Michael Krug", "Martin Gaedke"],"year": "2015","type": "Conference Paper","event": {
"name": "24th International World Wide Web Conference","url": http://www.www2015.it/
},[...]
}
12. September 20167
Arrays
Objects
LEDSTHE KESEDA APPROACH
• multi-step algorithm• work in existing JSON structure
• find and store various matches with different weights• use additional information sources like API descriptions
• assign classes to objects with multiple properties
• link detected entities
12. September 20168
LEDSTHE KESEDA APPROACH
1. Differentiation of input sources / formats2. Preparation of data structure
3. Analysis of property labels4. Analysis of property values
5. Mapping of classes
6. Generate JSON-LD document7. Evaluation of results
13. September 20169
LEDS
PROTOTYPE
LEDSPROTOTYPE
• prototype implemented in Node.js• working with properties and classes from:• schema.org• foaf• dublincore• goodrelations• music ontology
• dictionaries for: first & last names, cities, streets, languages• list of manually curated synonyms• option to provide pre-defined mappings
12. September 201611
LEDSPROTOTYPE
• Web interface for• pre-configuration• mappings, synonyms, dictionaries
• data upload• result analysis• statistics and browsing
12. September 201612
LEDSPROTOTYPE
12. September 201613
CONFIGURATION
LEDSPROTOTYPE
12. September 201614
RESULTS
LEDS
EVALUATION
LEDSEVALUATION
Algorithm applied to datasets of
1) JSON array of people
2) JSON array of publications
a) Without custom pre-configuration
b) With custom pre-configuration
12. September 201616
LEDSEVALUATION
Initial Setup• dictionary and structure pattern matching• label à predicate string matching• classes and properties: schema.org, foaf, dublincore, goodrelations
Custom Pre-Configuration• set of label à predicate mappings (hand-picked for data context)• list of known synonyms• more structure patterns
12. September 201617
LEDS1A) PEOPLE W/O CONFIG
12. September 201618
LEDS1A) PEOPLE W/O CONFIG
12. September 201619
LEDS2A) PEOPLE W/ CONFIG
12. September 201620
LEDS2A) PEOPLE W/ CONFIG
12. September 201621
LEDS1B) PUBLICATIONS W/O CONFIG
12. September 201622
LEDS1B) PUBLICATIONS W/O CONFIG
12. September 201623
LEDS2B) PUBLICATIONS W/ CONFIG
12. September 201624
LEDS2B) PUBLICATIONS W/ CONFIG
12. September 201625
LEDS
SUMMARY
LEDSSUMMARY
➙ Approach for extracting knowledge from semi-structured data
➙ by applying a multi-step algorithm➙ to convert JSON data to RDF➙ that assigns known classes to objects and maps
their properties to S-P-O triples
12. September 201627
LEDSOPEN CHALLENGES
• detect and reuse JSON structure pattern• disambiguate values• apply quality control to results• improve scalability for large datasets• research application of machine learning
12. September 201628
WWW.LEDS-PROJEKT.DE
LEDS
THANK [email protected]
VSR.INFORMATIK.TU-CHEMNITZ.DE
WWW.LEDS-PROJEKT.DE
12. September 201629
LEDS
LEDSTHE KESEDA APPROACH
1. Differentiation of input sources / formats
• text, file, URL, API• check for format
• optional conversion of XML to JSON
13. September 201631
LEDSTHE KESEDA APPROACH
2. Preparation of data structure
• pre-process JSON tree to store matches and mappings• keep original structure to preserve hierachie for later
relations• detect arrays and objects for seperate processing
• clean up: remove empty entries
12. September 201632
LEDSTHE KESEDA APPROACH
3. Analysis of property labels
• string matching (substrings, prefixes, …)
• synonyms
• pre-defined mappings• use metadata from API description, if available
12. September 201633
LEDSTHE KESEDA APPROACH
4. Analysis of property values
• dictionaries
• structure patterns (uri, date, address, color…)
• data types (date, time, number, boolean…)• (lower weighted)
12. September 201634
LEDSTHE KESEDA APPROACH
5. Mapping of classes
• find class by number of matched properties
• select match that is most appropriate for chosen class
• take different weights into account
12. September 201635
LEDSTHE KESEDA APPROACH
6. Generate JSON-LD document
• use matches and mappings
• link entities depending on JSON tree structure
• validation of output• optional conversion to various RDF formats
12. September 201636
LEDSTHE KESEDA APPROACH
7. Evaluation of results
• manual or automatic comparision of actual vs. desired result to reweight matching components
• store correctly applied mappings for later reuse
12. September 201637