+ All Categories
Home > Technology > Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from...

Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from...

Date post: 07-Jan-2017
Category:
Upload: semanticsconference
View: 55 times
Download: 3 times
Share this document with a friend
37
WWW.LEDS-PROJEKT.DE LEDS KNOWLEDGE EXTRACTION FROM HETEROGENEOUS SEMI-STRUCTURED DATA SOURCES MARTIN SEIDEL, MICHAEL KRUG, FRANK BURIAN, MARTIN GAEDKE 12. September 2016
Transcript
Page 1: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

WWW.LEDS-PROJEKT.DE

LEDS

KNOWLEDGE EXTRACTIONFROM HETEROGENEOUS

SEMI-STRUCTURED DATA SOURCES

MARTIN SEIDEL, MICHAEL KRUG, FRANK BURIAN, MARTIN GAEDKE

12. September 2016

Page 2: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSCURRENT SITUATION

• knowledge in the Web often only available as weakly interlinked, heterogeneous, semi-structured data

à no semantic classification• how to link or merge data?• how to do semantic queries?

à not usable in a meaningful way

2 12. September 2016

Page 3: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSGOAL

Extraction of knowledge from semi-structured data• knowledge in terms of semantic metadata• semantically enriched data then can utilize

the potential of Linked Data

à provide an automatic process

3 13. September 2016

Page 4: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDS

THE KESEDAAPPROACH

Page 5: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSTHE KESEDA APPROACH

• Especially designed to work on JSON data

• Challenges when working with JSON data

à no schema, only name-value pairs

à any structure and depth possible

12. September 20165

Page 6: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSTHE KESEDA APPROACH

{"id": "krug”,"firstName": "Michael","lastName": "Krug","title": "Dipl.-Inf.","phone": "+49 371 531 39929","email": "[email protected]",[...]

}

12. September 20166

Page 7: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSTHE KESEDA APPROACH

{"id": "2015-007","title": "SmartComposition: ...","author": [ "Michael Krug", "Martin Gaedke"],"year": "2015","type": "Conference Paper","event": {

"name": "24th International World Wide Web Conference","url": http://www.www2015.it/

},[...]

}

12. September 20167

Arrays

Objects

Page 8: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSTHE KESEDA APPROACH

• multi-step algorithm• work in existing JSON structure

• find and store various matches with different weights• use additional information sources like API descriptions

• assign classes to objects with multiple properties

• link detected entities

12. September 20168

Page 9: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSTHE KESEDA APPROACH

1. Differentiation of input sources / formats2. Preparation of data structure

3. Analysis of property labels4. Analysis of property values

5. Mapping of classes

6. Generate JSON-LD document7. Evaluation of results

13. September 20169

Page 10: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDS

PROTOTYPE

Page 11: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSPROTOTYPE

• prototype implemented in Node.js• working with properties and classes from:• schema.org• foaf• dublincore• goodrelations• music ontology

• dictionaries for: first & last names, cities, streets, languages• list of manually curated synonyms• option to provide pre-defined mappings

12. September 201611

Page 12: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSPROTOTYPE

• Web interface for• pre-configuration• mappings, synonyms, dictionaries

• data upload• result analysis• statistics and browsing

12. September 201612

Page 13: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSPROTOTYPE

12. September 201613

CONFIGURATION

Page 14: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSPROTOTYPE

12. September 201614

RESULTS

Page 15: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDS

EVALUATION

Page 16: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSEVALUATION

Algorithm applied to datasets of

1) JSON array of people

2) JSON array of publications

a) Without custom pre-configuration

b) With custom pre-configuration

12. September 201616

Page 17: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSEVALUATION

Initial Setup• dictionary and structure pattern matching• label à predicate string matching• classes and properties: schema.org, foaf, dublincore, goodrelations

Custom Pre-Configuration• set of label à predicate mappings (hand-picked for data context)• list of known synonyms• more structure patterns

12. September 201617

Page 18: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDS1A) PEOPLE W/O CONFIG

12. September 201618

Page 19: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDS1A) PEOPLE W/O CONFIG

12. September 201619

Page 20: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDS2A) PEOPLE W/ CONFIG

12. September 201620

Page 21: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDS2A) PEOPLE W/ CONFIG

12. September 201621

Page 22: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDS1B) PUBLICATIONS W/O CONFIG

12. September 201622

Page 23: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDS1B) PUBLICATIONS W/O CONFIG

12. September 201623

Page 24: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDS2B) PUBLICATIONS W/ CONFIG

12. September 201624

Page 25: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDS2B) PUBLICATIONS W/ CONFIG

12. September 201625

Page 26: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDS

SUMMARY

Page 27: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSSUMMARY

➙ Approach for extracting knowledge from semi-structured data

➙ by applying a multi-step algorithm➙ to convert JSON data to RDF➙ that assigns known classes to objects and maps

their properties to S-P-O triples

12. September 201627

Page 28: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSOPEN CHALLENGES

• detect and reuse JSON structure pattern• disambiguate values• apply quality control to results• improve scalability for large datasets• research application of machine learning

12. September 201628

Page 29: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

WWW.LEDS-PROJEKT.DE

LEDS

THANK [email protected]

VSR.INFORMATIK.TU-CHEMNITZ.DE

WWW.LEDS-PROJEKT.DE

12. September 201629

Page 30: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDS

Page 31: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSTHE KESEDA APPROACH

1. Differentiation of input sources / formats

• text, file, URL, API• check for format

• optional conversion of XML to JSON

13. September 201631

Page 32: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSTHE KESEDA APPROACH

2. Preparation of data structure

• pre-process JSON tree to store matches and mappings• keep original structure to preserve hierachie for later

relations• detect arrays and objects for seperate processing

• clean up: remove empty entries

12. September 201632

Page 33: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSTHE KESEDA APPROACH

3. Analysis of property labels

• string matching (substrings, prefixes, …)

• synonyms

• pre-defined mappings• use metadata from API description, if available

12. September 201633

Page 34: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSTHE KESEDA APPROACH

4. Analysis of property values

• dictionaries

• structure patterns (uri, date, address, color…)

• data types (date, time, number, boolean…)• (lower weighted)

12. September 201634

Page 35: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSTHE KESEDA APPROACH

5. Mapping of classes

• find class by number of matched properties

• select match that is most appropriate for chosen class

• take different weights into account

12. September 201635

Page 36: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSTHE KESEDA APPROACH

6. Generate JSON-LD document

• use matches and mappings

• link entities depending on JSON tree structure

• validation of output• optional conversion to various RDF formats

12. September 201636

Page 37: Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

LEDSTHE KESEDA APPROACH

7. Evaluation of results

• manual or automatic comparision of actual vs. desired result to reweight matching components

• store correctly applied mappings for later reuse

12. September 201637


Recommended