Metadata Provenance

Post on 30-Jun-2015

1,882 views 0 download

description

Two motivating scenarios for metametadata

transcript

1DCMI Metadata Provenance

Metadata ProvenanceTwo motivating scenarios for metametadata

Kai EckertMannheim University Library

Michael PanzerOCLC

DCMI Metadata ProvenanceF2F Meeting and Workshop

October 20th, 2010Pittsburgh, PA, USA

2DCMI Metadata Provenance

Metametadata

Provenance information outside of existing data models „Transparent“  Potential use­cases:

Whenever you have lots of legacy data in a model that does not support provenance.

Whenever new applications require information that can not be expressed in the existing data model.

3DCMI Metadata Provenance

Need for Metametadata Metadata are also data, so we need additional data 

about them.                  Metametadata Metadata about a whole metadata record, not for single 

statements: Who created this metadata record? When was this record created? …

 Metadata Provenance

4DCMI Metadata Provenance

Statements about (single) statements

Often proposed, but only vague instructions how to implement it.

Needed, if metadata records are created by the combination of single statements from different sources.

Needed for the storage of arbitrary additional information for single statements, that can not be represented in the metadata format easily.

5DCMI Metadata Provenance

Metametadata vs. Model based provenance

Simple statement: Peter knows Paul.

Provenance information: This statement is made by Mary.

Peter Paul

Mary

Knows

says

Metalevel

6DCMI Metadata Provenance

Data model extension

Peter

Paul

Mary

Has RelationRelation

Has Object

Has Creator

Knows Relation

Has Type

Simple statement: Peter knows Paul.

Provenance information: This statement is made by Mary.

7DCMI Metadata Provenance

Peter

Paul

Mary

hasRelationRelation

Has Object

Has Creator

Knows Relation

Has Type

Peter Paul

Mary

Knows

says

Metalevel

8DCMI Metadata Provenance

Implementation in RDF

This should not be limited to RDF! But it is a good example and RDF has a currently a 

high impact. RDF provides no satisfying answer how to express 

provenance information. Different possible implementation, e.g.:

Reification Named Graphs Extended data models ...

9DCMI Metadata Provenance

RDF Reification

RDF supports statements about statements by means of Reification, literally „objectification“ (actually a “subjectification”...).

“The book is written by Goethe“ is said by Kai.

How is it done in RDF:

ex:someID rdf:type rdf:Statement .ex:someID rdf:subject “The book”.ex:someID rdf:predicate ex:isWrittenBy . ex:someID rdf:object "Goethe" .ex:someID ex:isSaidBy “Kai” .

Subject Predicate Object

10DCMI Metadata Provenance

S u b j e c t P r e d i c a t e O b j e c t

1 e x : p 1 2 3 r d f : t y p e e x : p e r s o n

2 e x : p 1 2 3 e x : h a s N a m e “ K a i E c k e r t ”

3 e x : p 1 2 3 e x : w o r k s F o r e x : u n i m a

E x a m p l e 1 : A s i m p l e R D F e x a m p l e

Simplified Presentation

Based on Notation 3 (RDF/N3)

Identification of statements by the line number:

4 #1 dc:creator ''Kai Eckert''

The subject of a statement is a reference to another statement. With this notation, we imply a reification.

11DCMI Metadata Provenance

Scenario 1: Crosswalks

Crosswalks define rules, how metadata from one schema are represented in a different schema.

Problems:  Loss of information Erroneous Crosswalks

MARC field Dublin Core element

260$c (Date of publication, distribution, etc.) → Date.Created

522 (Geographic Coverage Note) → Coverage.Spatial

300$a (Physical Description) → Format.Extent

12DCMI Metadata Provenance

Possibilities for Metametadata

Storage of additional information, which would be lost in the target format.

Identification of Crosswalks with version and the specific rule for every generated statement.

Which statements are generated by a specific rule?

Which rule is responsible for a specific (erroneous) statement?

Which data in the originating format was used to generate a specific statement?

13DCMI Metadata Provenance

Example 1: Crosswalk Data

S u b j e c t P r e d i c a t e O b j e c t

1 e x : d o c b a s e / d o c 1 d c : t i t l e “ E x a m p l e t i t l e ”

2 # 1 e x : r u l e 1 6

3 # 1 e x : c r o s s w a l k 3

4 # 1 e x : o r i g i n M A R C : 2 4 5

5 e x : d o c b a s e / d o c 2 d c : t i t l e “ A b o u t f i n d i n g a t i t l e ”

6 # 5 e x : r u l e 1 6

7 # 5 e x : c r o s s w a l k 3

8 # 5 e x : o r i g i n M A R C : 2 4 5

9 e x : d o c b a s e / d o c 3 d c : t i t l e “ L o r e m i p s u m d o l o r ”

1 0 # 9 e x : r u l e 1 8

1 1 # 9 e x : c r o s s w a l k 3

1 2 # 9 e x : o r i g i n M A R C : 2 4 5

1 3 # 9 e x : o r i g i n M A R C : 2 4 6

1 4 e x : d o c b a s e / d o c 4 d c : t i t l e “ C o n s e t e t u r S a d i p s c i n g ”

1 5 # 1 4 e x : r u l e 1 9

1 6 # 1 4 e x : c r o s s w a l k 6

1 7 # 1 4 e x : o r i g i n x m l : / r e c o r d / d e s c r i p t i o n

E x a m p l e 4 : R e s u l t i n g R D F s t a t e m e n t s w i t h a d d i t i o n a l M e t a m e t a d a t a

14DCMI Metadata Provenance

Crosswalk Updates

Which statements are generated by a given rule and need to be regenerated after an update?

SELECT ?document ?field ?value WHERE { ?t rdf:subject ?document . ?t rdf:predicate ?field . ?t rdf:object ?value . ?t ex:rule 16 . ?t ex:crosswalk 3 .}

document field valueex:docbase/doc1 http://www.example.org/dc#title "Example title"ex:docbase/doc2 http://www.example.org/dc#title "About ding a title"

document field valueex:docbase/doc1 http://www.example.org/dc#title "Example title"ex:docbase/doc2 http://www.example.org/dc#title "About ding a title"

15DCMI Metadata Provenance

Crosswalk Debugging

Which rule is responsible for a given statement and what was the original data?

SELECT ?crosswalk ?rule ?origin WHERE { ?t rdf:subject <ex:docbase/doc1> . ?t rdf:predicate dc:title . ?t rdf:object "Example title" . ?t ex:rule ?rule . ?t ex:crosswalk ?crosswalk . ?t ex:origin ?origin .}

crosswalk rule origin3 16 "MARC:245"

crosswalk rule origin3 16 "MARC:245"

16DCMI Metadata Provenance

Scenario 2: Different Sources for Metadata

Manual indexing is costly. Many documents are not indexed at all or not 

searchable: Journal Articles Externally owned documents  Working papers Webpages

New sources for metadata?

17DCMI Metadata Provenance

New ways for document indexing

Automatic processes Tagging (Automatic) mapping of metadata from external 

sources Problem: Lack of quality How do you integrate these data from different sources without 

compromising the retrieval quality?

18DCMI Metadata Provenance

Possibilities for Metametadata

Storage of the source of single statements. Storage of further source­specific information:

Weighting for automatically generated subject headings. Number of users who tagged a document with a given tag. The original subject heading in case of an automatic 

translation or mapping.

Can we use these additional information to improve document retrieval?

19DCMI Metadata Provenance

Example 2: Subject indexing

S u b j e c t P r e d i c a t e O b j e c t

1 e x : d o c b a s e / d o c 1 d c : s u b j e c t e x : t h e s / s u b 2 0

2 # 1 e x : s o u r c e e x : s o u r c e s / a u t o i n d e x 1

3 # 1 e x : r a n k 0 . 5 5

4 e x : d o c b a s e / d o c 1 d c : s u b j e c t e x : t h e s / s u b 3 0

5 # 4 e x : s o u r c e e x : s o u r c e s / a u t o i n d e x 1

6 # 4 e x : r a n k 0 . 8

7 e x : d o c b a s e / d o c 1 d c : s u b j e c t e x : t h e s / s u b 3 0

8 # 7 e x : s o u r c e e x : s o u r c e s / p f e f f e r

9 # 7 e x : r a n k 1 . 0

1 0 e x : d o c b a s e / d o c 1 d c : s u b j e c t e x : t h e s / s u b 4 0

1 1 # 1 0 e x : s o u r c e e x : s o u r c e s / p f e f f e r

1 2 # 1 0 e x : r a n k 1 . 0

1 3 e x : s o u r c e s / a u t o i n d e x 1 e x : t y p e e x : t y p e s / a u t o

1 4 e x : s o u r c e s / p f e f f e r e x : t y p e e x : t y p e s / m a n u a l

E x a m p l e 7 : S u b j e c t a s s i g n m e n t s b y d i f f e r e n t s o u r c e s

20DCMI Metadata Provenance

Backward compatibility

While there are four assignments for subject headings, the statement “ex:docbase/doc1 dc:subject ex:thes/sub30”is still one statement, regardless of the number of times you put it into your RDF store.

Important for applications, that access the RDF Data, but do not handle the RDF reification.

Your metadata remains valid, in particular there are no doublets.

21DCMI Metadata Provenance

Separating the sources

Which statements are made by a specific source (here: Pfeffer)?

SELECT ?document ?value WHERE { ?t rdf:subject ?document . ?t rdf:predicate dc:subject . ?t rdf:object ?value . ?t ex:source <ex:sources/pfeffer> .}

document subjectex:docbase/doc1 ex:thes/sub30ex:docbase/doc1 ex:thes/sub40

document subjectex:docbase/doc1 ex:thes/sub30ex:docbase/doc1 ex:thes/sub40

22DCMI Metadata Provenance

Extended queries

Use all manually created subject headings. Use all subject headings with a rank > 0.7.

SELECT DISTINCT ?document ?subject WHERE { ?t rdf:subject ?document . ?t rdf:predicate dc:subject . ?t rdf:object ?subject . ?t ex:source ?source . ?source ex:type ?type . ?t ex:rank ?rank . FILTER ( ?type = <ex:types/manual> || ?rank > 0.7 )}

document subjectex:docbase/doc1 ex:thes/sub30ex:docbase/doc1 ex:thes/sub40

document subjectex:docbase/doc1 ex:thes/sub30ex:docbase/doc1 ex:thes/sub40

23DCMI Metadata Provenance

Conclusion Many applications of metametadata in the library fields 

can be realized with Metametadata. No change on the underlying data models needed. But:

Reification is not well accepted in the community. Named graphs are not (yet) part of RDF standard. ...

Existing approaches are usable, but users need more guidance how to implement them.

Metametadata is not always the appropriate solution (meta­level complexity vs. data model complexity)