Enrichment and Europeana

Europeana and Enrichment

Antoine Isaac

Europeana PGA meeting

Sept 25th, 2014

Semantic extraction?

Recognizing and extracting named entities and keywords, analyzing the sentiment of a document, extracting facts and relation between those facts and named entities, categorizing documents, recognizing and extracting concepts and finally adding them as metadata or annotations.

Market study on technical options for semantic feature extraction http://pro.europeana.eu/web/network/europeana-tech/-/wiki/Main/

Market+study+on+technical+options+for+semantic+feature+extraction

Semantic enrichment?

In a linked data environment, enrichment refers to the creation of new links between the enriched resources and another data resource. […] link to controlled vocabularies or authority files (contextualization)

Automatic Enrichments with Controlled Vocabularies in Europeana: Challenges and Consequences

Stiller, Petras, Gäde, Isaac. Euromed 2014

Semantic enrichment?• Analysis: the pre-enrichment phase focuses on the analysis of the

metadata fields in the original resource descriptions, the selection of potential resources to be linked to and derives rules to match and link the original fields to the contextual resource.

• Linking: the process of automatically matching the values of the metadata fields to values of the contextual resources and adding contextual links (whose values are most often based on equivalent relationships) to the dataset.

• Augmentation: the process of selecting the values from the contextual resource to be added to the original object description. This might not only include (multilingual) synonyms of terms to be enriched but also further information, for example broader or narrower concepts.

Automatic Enrichments with Controlled Vocabularies in Europeana: Challenges and ConsequencesEuromed 2014

Characteristics of enrichment

• Adding new data on top of existingnormalization focus on syntactic aspects, no addition of new

semantics

• (Semi-)automatic– For manual enrichment, see discussion on Annotations

• Connecting to internal or external datasets

Where does it happen?

• Ingestion from providers– Harvesting metadata and content

• Consolidating Europeana’s "master" database– De-referencing– Enrichment

• Leveraging data for search– Augmenting Solr index – Query enrichment and translation

Not de-referencing?

• In provider data, it is semantically equivalent to have a CHO with link or a CHO with link and contextual entity materialized next to it

• Just called « richer » (more structured, « semantic ») metadata given by providers

Not index augmentation?

• One semantic link can lead to different indexes• Enrichment shouldn’t be considered to feed directly

in application/tool-specific databasesNB: it should be exchangeable

• Yet enrichment should be designed in coordination with what will happen laterAugmentation is the post-prod of linking

Not query enrichment/translation?

• Tools used may be the same (NLP)• But the evaluation criteria change• These enrichments are ‘lost’, not exchangeable

Ground material for enrichment

Metadata is the primary focus of most effortsContent can also be used• Extraction of visual features

– Text transcription– Map alignment– Image-based similarity (Ecreative)

• Extraction of audio features (ESounds)

Linking is king• Object/object

• Cross-dataset de-duplication – equivalence/similarity links• Other relations – derivation, part-of, FRBRization• Clustering into hierarchical objects or collections• NB: neglected, though Europeana can contribute something

• Object/Context• Agents• Concepts• Places• Periods and Events• Documentation, e.g., Wikipedia articles

• Context/Context (vocabulary alignment)• Matching concepts

Europeana enrichment

• Bringing multilingual, structured data• Collaborative/strategy aspect• Likely to interest providers (Einside)

Should we be interested in other kinds of enrichment?

• Non-semantic tagging with simple words• Translation• Named entity recognition• Language detection for metadata fields• Group editing, when not actioned by providers

Europeana-related projects in the picture• Object/object

• De-duplication – equivalence/similarity links• Other relations – derivation (ESounds), part-of, FRBRization (TEL)• Clustering (EF-OCLC)

• Object/Context• Agents• Concepts (PATHS, EConnect, LOCloud, MIMO)• Places (EConnect, LOCloud)• Periods and Events (PATHS, ECloud)• Documentation, e.g., Wikipedia articles (PATHS, LOCloud)

• Vocabulary alignmentEConnect (Amalgame), EFG, EUScreen, ATHENAplus?, PartagePlus

• Non-semantic tagging with simple words• Translation• Named entity recognition• Language detection for metadata fields• Group editing, when not actioned by provider (Esounds)

Other categories?

Next steps?

• Agree on categories• Agree on APIs for enrichment services• Addressing post-processes for applications (solr indexing)• Evaluation• Informativeness measure, completness• Showing it?

APIs for enrichment services

• Input: record, field, collection?– Meta-enrichers

• Problem: API result often assume application needs and data elements that are useful, beyond the URI of the entity: They are APIs for enrichment+de-referencing.

• Keeping track of provenance (data field, version of enrichment tool…)

• Example of Sounds music information retrieval• Exchanging enrichment data. Cf EDMpaths

Example: Europeana enrichment console prototype

Antoine Isaac

[email protected]

@EuropeanaTech

Date post:	11-Jun-2015
Category:	Technology
Upload:	antoine-isaac
View:	345 times
Download:	2 times

Enrichment and Europeana

Technology