KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

WWW.LEDS-PROJEKT.DE

LEDS

KNOWLEDGE EXTRACTIONFROM HETEROGENEOUS

SEMI-STRUCTURED DATA SOURCES

MARTIN SEIDEL, MICHAEL KRUG, FRANK BURIAN, MARTIN GAEDKE

16. September 2016

LEDSCURRENT SITUATION

• knowledge in the Web often only available as weakly interlinked, heterogeneous, semi-structured data

à no semantic classification• how to link or merge data?• how to do semantic queries?

à not usable in a meaningful way

2 16. September 2016

LEDSGOAL

Extraction of knowledge from semi-structured data• knowledge in terms of semantic metadata• semantically enriched data then can utilize

the potential of Linked Data

à provide an automatic process

3 16. September 2016

LEDS

THE KESEDAAPPROACH

LEDSTHE KESEDA APPROACH

• Especially designed to work on JSON data

• Challenges when working with JSON data

à no schema, only name-value pairs

à any structure and depth possible

16. September 20165


{"id": "krug”,"firstName": "Michael","lastName": "Krug","title": "Dipl.-Inf.","phone": "+49 371 531 39929","email": "[email protected]",[...]

}

16. September 20166


{"id": "2015-007","title": "SmartComposition: ...","author": [ "Michael Krug", "Martin Gaedke"],"year": "2015","type": "Conference Paper","event": {

"name": "24th International World Wide Web Conference","url": http://www.www2015.it/

},[...]

}

16. September 20167

Arrays

Objects


• multi-step algorithm• work in existing JSON structure

• find and store various matches with different weights• use additional information sources like API descriptions

• assign classes to objects with multiple properties

• link detected entities

16. September 20168


1. Differentiation of input sources / formats2. Preparation of data structure

3. Analysis of property labels4. Analysis of property values

5. Mapping of classes

6. Generate JSON-LD document7. Evaluation of results

16. September 20169

LEDS

PROTOTYPE

LEDSPROTOTYPE

• prototype implemented in Node.js• working with properties and classes from:• schema.org• foaf• dublincore• goodrelations• music ontology

• dictionaries for: first & last names, cities, streets, languages• list of manually curated synonyms• option to provide pre-defined mappings

16. September 201611

LEDSPROTOTYPE

• Web interface for• pre-configuration• mappings, synonyms, dictionaries

• data upload• result analysis• statistics and browsing


LEDSPROTOTYPE


CONFIGURATION

LEDSPROTOTYPE


RESULTS

LEDS

EVALUATION

LEDSEVALUATION

Algorithm applied to datasets of

1) JSON array of people

2) JSON array of publications

a) Without custom pre-configuration

b) With custom pre-configuration


LEDSEVALUATION

Initial Setup• dictionary and structure pattern matching• label à predicate string matching• classes and properties: schema.org, foaf, dublincore, goodrelations

Custom Pre-Configuration• set of label à predicate mappings (hand-picked for data context)• list of known synonyms• more structure patterns


LEDS1A) PEOPLE W/O CONFIG


LEDS1A) PEOPLE W/O CONFIG


LEDS2A) PEOPLE W/ CONFIG


LEDS2A) PEOPLE W/ CONFIG


LEDS1B) PUBLICATIONS W/O CONFIG


LEDS1B) PUBLICATIONS W/O CONFIG


LEDS2B) PUBLICATIONS W/ CONFIG


LEDS2B) PUBLICATIONS W/ CONFIG


LEDS

SUMMARY

LEDSSUMMARY

➙ Approach for extracting knowledge from semi-structured data

➙ by applying a multi-step algorithm➙ to convert JSON data to RDF➙ that assigns known classes to objects and maps

their properties to S-P-O triples


LEDSOPEN CHALLENGES

• detect and reuse JSON structure pattern• disambiguate values• apply quality control to results• improve scalability for large datasets• research application of machine learning


WWW.LEDS-PROJEKT.DE

LEDS

THANK [email protected]

VSR.INFORMATIK.TU-CHEMNITZ.DE

WWW.LEDS-PROJEKT.DE


LEDS


1. Differentiation of input sources / formats

• text, file, URL, API• check for format

• optional conversion of XML to JSON



2. Preparation of data structure

• pre-process JSON tree to store matches and mappings• keep original structure to preserve hierachie for later

relations• detect arrays and objects for seperate processing

• clean up: remove empty entries



3. Analysis of property labels

• string matching (substrings, prefixes, …)

• synonyms

• pre-defined mappings• use metadata from API description, if available



4. Analysis of property values

• dictionaries

• structure patterns (uri, date, address, color…)

• data types (date, time, number, boolean…)• (lower weighted)



5. Mapping of classes

• find class by number of matched properties

• select match that is most appropriate for chosen class

• take different weights into account



6. Generate JSON-LD document

• use matches and mappings

• link entities depending on JSON tree structure

• validation of output• optional conversion to various RDF formats



7. Evaluation of results

• manual or automatic comparision of actual vs. desired result to reweight matching components

• store correctly applied mappings for later reuse


Date post:	14-Jan-2017
Category:	Internet
Upload:	linked-enterprise-date-services
View:	133 times
Download:	1 times

KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

Internet