Date post: | 11-Jun-2015 |
Category: |
Technology |
Upload: | antoine-isaac |
View: | 345 times |
Download: | 2 times |
Europeana and Enrichment
Antoine Isaac
Europeana PGA meeting
Sept 25th, 2014
Semantic extraction?
Recognizing and extracting named entities and keywords, analyzing the sentiment of a document, extracting facts and relation between those facts and named entities, categorizing documents, recognizing and extracting concepts and finally adding them as metadata or annotations.
Market study on technical options for semantic feature extraction http://pro.europeana.eu/web/network/europeana-tech/-/wiki/Main/
Market+study+on+technical+options+for+semantic+feature+extraction
Semantic enrichment?
In a linked data environment, enrichment refers to the creation of new links between the enriched resources and another data resource. […] link to controlled vocabularies or authority files (contextualization)
Automatic Enrichments with Controlled Vocabularies in Europeana: Challenges and Consequences
Stiller, Petras, Gäde, Isaac. Euromed 2014
Semantic enrichment?• Analysis: the pre-enrichment phase focuses on the analysis of the
metadata fields in the original resource descriptions, the selection of potential resources to be linked to and derives rules to match and link the original fields to the contextual resource.
• Linking: the process of automatically matching the values of the metadata fields to values of the contextual resources and adding contextual links (whose values are most often based on equivalent relationships) to the dataset.
• Augmentation: the process of selecting the values from the contextual resource to be added to the original object description. This might not only include (multilingual) synonyms of terms to be enriched but also further information, for example broader or narrower concepts.
Automatic Enrichments with Controlled Vocabularies in Europeana: Challenges and ConsequencesEuromed 2014
Characteristics of enrichment
• Adding new data on top of existingnormalization focus on syntactic aspects, no addition of new
semantics
• (Semi-)automatic– For manual enrichment, see discussion on Annotations
• Connecting to internal or external datasets
Where does it happen?
• Ingestion from providers– Harvesting metadata and content
• Consolidating Europeana’s "master" database– De-referencing– Enrichment
• Leveraging data for search– Augmenting Solr index – Query enrichment and translation
Not de-referencing?
• In provider data, it is semantically equivalent to have a CHO with link or a CHO with link and contextual entity materialized next to it
• Just called « richer » (more structured, « semantic ») metadata given by providers
Not index augmentation?
• One semantic link can lead to different indexes• Enrichment shouldn’t be considered to feed directly
in application/tool-specific databasesNB: it should be exchangeable
• Yet enrichment should be designed in coordination with what will happen laterAugmentation is the post-prod of linking
Not query enrichment/translation?
• Tools used may be the same (NLP)• But the evaluation criteria change• These enrichments are ‘lost’, not exchangeable
Ground material for enrichment
Metadata is the primary focus of most effortsContent can also be used• Extraction of visual features
– Text transcription– Map alignment– Image-based similarity (Ecreative)
• Extraction of audio features (ESounds)
Linking is king• Object/object
• Cross-dataset de-duplication – equivalence/similarity links• Other relations – derivation, part-of, FRBRization• Clustering into hierarchical objects or collections• NB: neglected, though Europeana can contribute something
• Object/Context• Agents• Concepts• Places• Periods and Events• Documentation, e.g., Wikipedia articles
• Context/Context (vocabulary alignment)• Matching concepts
Europeana enrichment
• Bringing multilingual, structured data• Collaborative/strategy aspect• Likely to interest providers (Einside)
Should we be interested in other kinds of enrichment?
• Non-semantic tagging with simple words• Translation• Named entity recognition• Language detection for metadata fields• Group editing, when not actioned by providers
Europeana-related projects in the picture• Object/object
• De-duplication – equivalence/similarity links• Other relations – derivation (ESounds), part-of, FRBRization (TEL)• Clustering (EF-OCLC)
• Object/Context• Agents• Concepts (PATHS, EConnect, LOCloud, MIMO)• Places (EConnect, LOCloud)• Periods and Events (PATHS, ECloud)• Documentation, e.g., Wikipedia articles (PATHS, LOCloud)
• Vocabulary alignmentEConnect (Amalgame), EFG, EUScreen, ATHENAplus?, PartagePlus
• Non-semantic tagging with simple words• Translation• Named entity recognition• Language detection for metadata fields• Group editing, when not actioned by provider (Esounds)
Other categories?
Next steps?
• Agree on categories• Agree on APIs for enrichment services• Addressing post-processes for applications (solr indexing)• Evaluation• Informativeness measure, completness• Showing it?
APIs for enrichment services
• Input: record, field, collection?– Meta-enrichers
• Problem: API result often assume application needs and data elements that are useful, beyond the URI of the entity: They are APIs for enrichment+de-referencing.
• Keeping track of provenance (data field, version of enrichment tool…)
• Example of Sounds music information retrieval• Exchanging enrichment data. Cf EDMpaths
Example: Europeana enrichment console prototype