+ All Categories
Home > Internet > KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

Date post: 14-Jan-2017
Category:
Upload: linked-enterprise-date-services
View: 133 times
Download: 1 times
Share this document with a friend
37
WWW.LEDS-PROJEKT.DE LEDS KNOWLEDGE EXTRACTION FROM HETEROGENEOUS SEMI-STRUCTURED DATA SOURCES MARTIN SEIDEL, MICHAEL KRUG, FRANK BURIAN, MARTIN GAEDKE 16. September 2016
Transcript

WWW.LEDS-PROJEKT.DE

LEDS

KNOWLEDGE EXTRACTIONFROM HETEROGENEOUS

SEMI-STRUCTURED DATA SOURCES

MARTIN SEIDEL, MICHAEL KRUG, FRANK BURIAN, MARTIN GAEDKE

16. September 2016

LEDSCURRENT SITUATION

• knowledge in the Web often only available as weakly interlinked, heterogeneous, semi-structured data

à no semantic classification• how to link or merge data?• how to do semantic queries?

à not usable in a meaningful way

2 16. September 2016

LEDSGOAL

Extraction of knowledge from semi-structured data• knowledge in terms of semantic metadata• semantically enriched data then can utilize

the potential of Linked Data

à provide an automatic process

3 16. September 2016

LEDS

THE KESEDAAPPROACH

LEDSTHE KESEDA APPROACH

• Especially designed to work on JSON data

• Challenges when working with JSON data

à no schema, only name-value pairs

à any structure and depth possible

16. September 20165

LEDSTHE KESEDA APPROACH

{"id": "krug”,"firstName": "Michael","lastName": "Krug","title": "Dipl.-Inf.","phone": "+49 371 531 39929","email": "[email protected]",[...]

}

16. September 20166

LEDSTHE KESEDA APPROACH

{"id": "2015-007","title": "SmartComposition: ...","author": [ "Michael Krug", "Martin Gaedke"],"year": "2015","type": "Conference Paper","event": {

"name": "24th International World Wide Web Conference","url": http://www.www2015.it/

},[...]

}

16. September 20167

Arrays

Objects

LEDSTHE KESEDA APPROACH

• multi-step algorithm• work in existing JSON structure

• find and store various matches with different weights• use additional information sources like API descriptions

• assign classes to objects with multiple properties

• link detected entities

16. September 20168

LEDSTHE KESEDA APPROACH

1. Differentiation of input sources / formats2. Preparation of data structure

3. Analysis of property labels4. Analysis of property values

5. Mapping of classes

6. Generate JSON-LD document7. Evaluation of results

16. September 20169

LEDS

PROTOTYPE

LEDSPROTOTYPE

• prototype implemented in Node.js• working with properties and classes from:• schema.org• foaf• dublincore• goodrelations• music ontology

• dictionaries for: first & last names, cities, streets, languages• list of manually curated synonyms• option to provide pre-defined mappings

16. September 201611

LEDSPROTOTYPE

• Web interface for• pre-configuration• mappings, synonyms, dictionaries

• data upload• result analysis• statistics and browsing

16. September 201612

LEDSPROTOTYPE

16. September 201613

CONFIGURATION

LEDSPROTOTYPE

16. September 201614

RESULTS

LEDS

EVALUATION

LEDSEVALUATION

Algorithm applied to datasets of

1) JSON array of people

2) JSON array of publications

a) Without custom pre-configuration

b) With custom pre-configuration

16. September 201616

LEDSEVALUATION

Initial Setup• dictionary and structure pattern matching• label à predicate string matching• classes and properties: schema.org, foaf, dublincore, goodrelations

Custom Pre-Configuration• set of label à predicate mappings (hand-picked for data context)• list of known synonyms• more structure patterns

16. September 201617

LEDS1A) PEOPLE W/O CONFIG

16. September 201618

LEDS1A) PEOPLE W/O CONFIG

16. September 201619

LEDS2A) PEOPLE W/ CONFIG

16. September 201620

LEDS2A) PEOPLE W/ CONFIG

16. September 201621

LEDS1B) PUBLICATIONS W/O CONFIG

16. September 201622

LEDS1B) PUBLICATIONS W/O CONFIG

16. September 201623

LEDS2B) PUBLICATIONS W/ CONFIG

16. September 201624

LEDS2B) PUBLICATIONS W/ CONFIG

16. September 201625

LEDS

SUMMARY

LEDSSUMMARY

➙ Approach for extracting knowledge from semi-structured data

➙ by applying a multi-step algorithm➙ to convert JSON data to RDF➙ that assigns known classes to objects and maps

their properties to S-P-O triples

16. September 201627

LEDSOPEN CHALLENGES

• detect and reuse JSON structure pattern• disambiguate values• apply quality control to results• improve scalability for large datasets• research application of machine learning

16. September 201628

WWW.LEDS-PROJEKT.DE

LEDS

THANK [email protected]

VSR.INFORMATIK.TU-CHEMNITZ.DE

WWW.LEDS-PROJEKT.DE

16. September 201629

LEDS

LEDSTHE KESEDA APPROACH

1. Differentiation of input sources / formats

• text, file, URL, API• check for format

• optional conversion of XML to JSON

16. September 201631

LEDSTHE KESEDA APPROACH

2. Preparation of data structure

• pre-process JSON tree to store matches and mappings• keep original structure to preserve hierachie for later

relations• detect arrays and objects for seperate processing

• clean up: remove empty entries

16. September 201632

LEDSTHE KESEDA APPROACH

3. Analysis of property labels

• string matching (substrings, prefixes, …)

• synonyms

• pre-defined mappings• use metadata from API description, if available

16. September 201633

LEDSTHE KESEDA APPROACH

4. Analysis of property values

• dictionaries

• structure patterns (uri, date, address, color…)

• data types (date, time, number, boolean…)• (lower weighted)

16. September 201634

LEDSTHE KESEDA APPROACH

5. Mapping of classes

• find class by number of matched properties

• select match that is most appropriate for chosen class

• take different weights into account

16. September 201635

LEDSTHE KESEDA APPROACH

6. Generate JSON-LD document

• use matches and mappings

• link entities depending on JSON tree structure

• validation of output• optional conversion to various RDF formats

16. September 201636

LEDSTHE KESEDA APPROACH

7. Evaluation of results

• manual or automatic comparision of actual vs. desired result to reweight matching components

• store correctly applied mappings for later reuse

16. September 201637


Recommended