Date post: | 15-Feb-2017 |
Category: |
Technology |
Upload: | periclesfp7 |
View: | 38 times |
Download: | 0 times |
GRANT AGREEMENT: 601138 | SCHEME FP7 ICT 2011.4.3 Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics [Digital Preservation]
“This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no601138”.
DETECTING SEMANTIC DRIFT FOR ONTOLOGY MAINTENANCESándor Darányi (University of Borås, Sweden)Panos Mitzias (CERTH/ITI, Greece)
▶Evolving Semantics & Digital Preservation▶The PERICLES Approach▶PERICLES Tools
◦ Somoclu◦ SemaDrift
▶Putting it All Together◦ Data◦ Workflow◦ Sample Results
▶Conclusions
Outline
▶Schlieder (2010) brings three examples why LTDP is paramount:◦ Because hardware and software evolves (technology
drift)◦ Because language changes (semantic drift)◦ Because value systems underlying societies change
(social value shifts)▶Apart from DP, formalizing change scenarios so that
they become manageable by computers is a hot research topic also in:◦ Semantic Web, Knowledge Engineering & Management,
Natural Language Processing, Document Engineering, Digital Humanities, Data Science
Evolving Semantics & Digital Preservation
▶With DP acting roughly in the 5 to 50 years interval, recent advances in LTDP look at longer ranges◦ 2000 years: the use of DNA for very long term DP [Grass et
al., 2015]◦ 13.8 billion years: DNA combined with nanostructured
glass storage [Kazansky et al., 2016]▶The ultimate question is the returns of investment
into DP and LTDP, should one lose access to already preserved content◦ Currently proposed preventive measure: Develop scalable
methodologies of context-aware content interpretation by monitoring semantic vs. conceptual drifts
Evolving Semantics & LTDP Horizons
▶We address evolving semantics from two perspectives: ◦ Change-sensitive ontologies necessitate logic◦ Scalability and the distributed nature of content asks
for statistical processing▶These two major components complement and
inform each other and become tools of the model-driven DP paradigm◦ E.g. collection-specific domain ontologies and change
monitoring options help appraisal
The PERICLES Approach
▶Time-dependent content displacement in vector space affects categorization & retrieval
▶Model such content dynamics on a vector field, by metaphorical use of physical concepts
▶Somoclu (Self-Organizing Map Over a CLUster), using the ESOM (Emerging Self-Organizing Maps) algorithm
◦ Fastest massively parallel open-source SOM algorithm available, developed in PERICLES
PERICLES Tools - Somoclu
https://github.com/peterwittek/somoclu
▶Reduces high-dimensional space to low-dimensional one (2-d)
▶Preserves local topology▶Suitable for drift detection of
feature/ object locations▶After training the algorithm, each
data instance has a node (Best Matching Unit, BMU) on the map
▶Intense colours on the map indicate high distances between the original data points
Somoclu and Self-Organizing Maps
Drift Detection Workflow
Problem▶To measure semantic
drift in ontologies across time & versions▶Related to ontology
evolution, versioning, drift/shift/decay ->
PERICLES Tools - SemaDrift
▶A suite of tools for measuring drift in ontologies across time/versions◦ SemaDrift Library (API)◦ SemaDrift Protege Plugin (GUI desktop application)◦ SemaDrift FX (GUI desktop application)
▶Cross-domain, no prior programming knowledge▶Apache V2 License▶Two proof-of-concept use cases: Tate and OWL-S
PERICLES Tools - SemaDrift
SemaDrift Workflow
Collection of rdfs:labels
Concept to concept Label Drift
Series of ontologies
Collection of property triples
Concept to concept Intensional Drift
Collection of instance URIs
Concept to concept Extensional Drift
Concept to concept Whole Drift
Average Label, Intension, Extension Drift for the series
compare each concept to all concepts of next ontologyfor all concepts
average of label, int, ext
average of all concepts
output
SemaDrift GUI Desktop Applications
▶Extracted offline▶In this scenario, extensional drift shows
clearly
Morphing Chains
▶Monitor feature (index term) drifts over time▶ Apply threshold to at-risk (splitting/merging) index terms▶ Extract index terms above threshold▶Ontology Creation (creation of a Digital Ecosystem Model,
DEM)SOMOCLU > Propose least volatile terms to be included in the model
▶Ontology MaintenanceSOMOCLU+SemaDrift > Assess at-risk terminology, update model, alert user
▶Appraisal SOMOCLU > Extract period-specific objects
Putting it All Together
Tate Collection Statistics for Drift Analysis▶ Catalog as open data for
69.202 artworks in JSON format (53.698 time-stamped)
▶ Indexed by Tate’s own hierarchical subject index (three levels, from general to specific index terms)
▶ Two acquisition peaks: 1796-1844 (33.625 artworks) and 1960-2009 (12.756 artworks), broken down into 10 five-years epochs each
▶ 46.381 artworks in the experiment
A Typical Tate Artefact: J.E.Millais’
Ophelia...
...and its subject index metadata (excerpt)
J.E.Millais’ Ophelia in the Domain Ontology
Semantic Drift Example: Concept Splitting in the Tate Collection
With semantic relations impacted...
Drift Detection at Work
...ontology relations also change
Another Type of Change in Conceptual Coherence: natural and inland merge
1796-1800 1801-1805
▶Statistical analysis of scalable collections reveals content dislocations over time
▶Such drifts are the norm, not an exception▶They influence future access to content by their
impact on the efficiency of information retrieval and classification
▶For a remedy, drift detection can alert ontology maintenance and artefact (object) appraisal by designated workflows
Conclusions