Date post: | 29-Nov-2014 |
Category: |
Technology |
Upload: | silvio-peroni |
View: | 367 times |
Download: | 0 times |
http://creativecommons.org/licenses/by-sa/3.0
Tracking changes through EARMARK:a theoretical perspective and an implementation
Silvio Peroni – [email protected] Poggi – [email protected]
Fabio Vitali – [email protected]
1st International Workshop on Document Changes: Modeling, Detection, Storage and Visualizationhttp://diff.cs.unibo.it/dchanges2013/@ DocEng 2013, Florence, Italy - September 10, 2013
Outline
• Documents and their changes in time through FRBR
• Change tracking and provenance data of XML documents
• Defining multi-hierarchy documents through EARMARK
• EARMARK Changes Ontology and its application
• Querying and byte-counting
• Conclusions
Documents do change in time
• Any creative act of a text ✦ starts from a particular draft made by someone at a certain time✦ is then modified through consecutive revisions✦ may end up being forked into different variants✦ may be modified by additional editorial activities such as typo-fixing,
shortening, restructuring, etc.
• Importance of keeping tracks of changes✦ Computer Science: to show how programming code or computational
models evolve throughout the natural lifecycle of software development✦ Philology: to tell the way in which variant copies of a same book overlap in
time and content✦ Scientific Publishing: to understand the entity and quality of the
modifications and driving the final acceptance or rejection of a paper✦ etc.
About (textual) documents and their changes
A document is more than the string it is
composed of
Alice
produces an hand-writtendocument on a piece of paper
Bob
produces a digital documentthrough a word-processor
“Hello world.”
composed by the string
“Hello world.”
composed by the string
differentdocuments
samestring
modifies Bob’s documentproducing a new version
“Hey, hello world!”
composed by the string
Charles
used by
to certain extent
samedocument
differentstrings
A document refers to different strings
in time
Introducing FRBR for change tracking
• “Functional Requirements for Bibliographic Records (FRBR) is a conceptual entity-relationship model developed by the International Federation of Library Associations and Institutions (IFLA) that relates user tasks of retrieval and access in online library catalogues and bibliographic databases from a user’s perspective.”
• According to FRBR, document is an overloaded word, and is better substituted by four different concepts called respectively
✦ Work, coupled with the concept of identity✦ Expression, to record the evolution in time✦ Manifestation, which specifies the form and format✦ Item, identifying the concrete object
from http://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records
string of characters constituting the content of a document
FRBR Expression layerExpressions never change in time
: a distinct intellectual or artistic creation
: the intellectual or artistic realisation of a work in the formof alpha-numeric, musical, or choreographic notation, sound, image, object, movement, etc., or any combination of such forms
FRBR Work layer
Bob“Hello world.”
Charles“Hey, hello world!”
the abstract conceptualisation of a document
realisation of realisation of
revision of
Time
What happens toXML-based markup documents?
• What changes we have to keep track of among those that occur in XML documents, and what are the markup elements that, after edits and document changes, are
✦ directly affected✦ hierarchically affected✦ completely unaffected
• Research questions:✦ When a markup element E1 within a document version V1 changes in some
way, e.g. by adding something to the text it contains, thereby generating document version V2, are the two instances of E1 in V1 and E2 in V2 to be considered actually the same element?
✦ In the case the aforementioned instances of E are to be considered different, is the difference meant to be propagated also to their ancestor elements?
DEL
<section>NEW
Applying FRBR to XML markupElements and text nodes as FRBR Expressions
NEW<section>
<p> <p>
<em>
Some interesting content. It was written by me.
INS
<em>
very
NEW
<section>
<p>
<section> <p>Some <em>interesting</em> content.</p> <p>It was written by me.</p> </section>
Alice
<section> <p>Some <em>very interesting</em> content.</p> <p>It was written by me.</p> </section>
Bob
revised by
<section> <p>Some <em>very interesting</em> content.</p> <p>It was written by me.</p> </section>
Charles
revised by
has partrevision of
Who, What, How and When
• An important part of change tracking operations involves keeping track of provenance information
✦ who made the modification✦ what was modified✦ how it was modified✦ when it was modified
• How do we keep track of all these data in practice?
✦ XML-based languages use workarounds to implement overlapping markup
✦ Other solutions?
<section>
<section>
<p> <p>
<em>
very
<section>Version made by BobJuly 3, 2013, at 04:15
Text inserted by BobJuly 3, 2013, at 04:15
Version made by AliceJune 19, 2013, at 13:45
Markup deleted by CharlesJuly 5, 2013, at 03:33
Version made by CharlesJuly 5, 2013, at 03:33
From theory to practice through
The Extremely Annotational RDF Markup (EARMARK) is at the same time a markup meta-language and an ontology of (document) markup
✦ More expressive than XML – it allows to organise markup structures as graphs
✦ It makes easy to associate annotations to document items such as change tracking information – since an EARMARK document is a set of OWL assertions, all the markup items and text nodes are individuals of particular classes identified by an IRI
✦ Lot of tools available: - a Java API- frameworks to convert XML documents into EARMARK ones and vice versa
more information at http://palindrom.es/phd/research/earmark
Example of EARMARK documentLinearised using Turtle
Some interesting content. It was written by me.
# Textual content of the document:content a earmark:StringDocuverse ; earmark:hasContent "Some interesting content. It was written by me."^^xsd:string .
full Turtle source of the document available at http://www.essepuntato.it/2013/dchanges
# String ’Some ’:r1 a earmark:PointerRange ; earmark:refersTo :content ; earmark:begins "0"^^xsd:nonNegativeInteger ; earmark:ends "5"^^xsd:nonNegativeInteger .
# String ‘interesting’:r2 a earmark:PointerRange ...# String ’ content.’:r3 a earmark:PointerRange ...# String ’It was written by me.’ :r4 a earmark:PointerRange ...
<section>
<p>
# Element ‘section’:section a earmark:Element ; earmark:hasGeneralIdentifier "section"^^xsd:string ; co:firstItem [ a co:ListItem ; co:itemContent :p1 ; co:nextItem [ a co:ListItem ; co:itemContent :p2 ] ] .
<p>
<em>
# First element ’p’:p1 a earmark:Element ; earmark:hasGeneralIdentifier "p"^^xsd:string ; co:firstItem [ a co:ListItem ; co:itemContent :r1 ; co:nextItem [ a co:ListItem ; co:itemContent :em ; co:nextItem [ a co:ListItem ; co:itemContent :r3 ] ] ] .
... and similarly for the other markup elements
EARMARK Changes OntologyExtending EARMARK to manage change tracking information
• The EARMARK Changes Ontology (EChO) extends the EARMARK ontology and includes the OWL 2 DL implementation of FRBR (http://purl.org/spar/frbr) and the Provenance Ontology (http://www.w3.org/ns/prov#), so as to keep track of all the changes and provenance data related to different versions of the same document
✦ the EARMARK items (docuverses, ranges and markup items) to model the structure of the different document versions and to store them all within a single EARMARK document
✦ frbr:revisionOf to indicate that a markup item is a revision of another ✦ prov:wasDerivedFrom to indicate that a range is actually derived from another one defined in a
previous version of the document ✦ prov:wasGeneratedBy (coupled with instances of echo:VersionCreation and echo:ItemInsertion) and
prov:generatedAtTime to indicate that a particular markup item, a range or a whole document version has been created at a certain time
✦ prov:wasInvalidatedBy (coupled with instances of echo:VersionRemoval and echo:ItemDeletion) and prov:invalidatedAtTime to indicate that a particular markup item, a range or a whole document version has been deleted at a certain time
✦ prov:wasAssociatedWith to indicate the agent involved in the activity of generation/invalidation of a certain item
Version creation
• Who: Alice
• What: document version (implicitly identified by the document element of the markup document :section)
• How: creation
• When: June 19, 2013 at 13:45
:section # Provenance information prov:wasGeneratedBy :creation-by-alice ; prov:generatedAtTime "2013-06-19T13:45:00Z"^^xsd:dateTime .
# Activity of creation of a new version :creation-by-alice a echo:VersionCreation ; prov:wasAssociatedWith :alice .
Revision, insertion and deletion
• Bob’s revision of Alice’s version was made on July 3, 2013, at 04:15, and concerns only the insertion of the string “very ” as first textual node of the element em
• Charles’ revision of Bob’s version was made on July 5, 2013, at 03:33, and deletes the Bob’s second p of section
# Element ’section ’ by Bob:section-by-bob a earmark:Element ; earmark:hasGeneralIdentifier "section"^^xsd:string ; co:firstItem [ a co:ListItem ; co:itemContent :p1-by-bob ; co:nextItem [ a co:ListItem ; co:itemContent :p2 ] ] ; # relation with previous version frbr:revisionOf :section ; # provenance information prov:wasGeneratedBy :creation-by-bob ; prov:generatedAtTime "2013-07-03T04:15:00Z"^^xsd:dateTime .
:creation-by-bob a echo:VersionCreation ; prov:wasAssociatedWith :bob .
revision
# New content of the document:content-by-bob a earmark:StringDocuverse ; earmark:hasContent "very "^^xsd:string .
# New string ’very ’:r5 a earmark:PointerRange ; earmark:refersTo :content-by-bob ; earmark:begins "0"^^xsd:nonNegativeInteger ; earmark:ends "5"^^xsd:nonNegativeInteger ; # provenance information prov:wasGeneratedBy :insertion-by-bob ; prov:generatedAtTime "2013-07-03T04:15:00Z"^^xsd:dateTime .
:insertion-by-bob a echo:ItemInsertion ; prov:wasAssociatedWith :bob .
insertion
:p2 # Second element ’p’ of Bob’s ‘section‘ prov:wasInvalidatedBy :deletion-by-charles prov:invalidatedAtTime "2013-07-05T03:33:00Z"^^xsd:dateTime .
:deletion-by-charles a echo:ItemDeletion ; prov:wasAssociatedWith :charles .
deletion
Splitting ranges up
• Daniel’s revision of the Alice’s version, where Daniel decided to substitute the string “me” in the second p with its name (i.e. the string “Daniel”)
• In EARMARK, this string substitution (i.e. a deletion plus an insertion) is possible by defining four new ranges
• We use prov:wasDerivedFrom statements between ranges to describe (at an abstract level) a more complex scenario of overlapping markup between the two versions
Querying the history of changes
• Since EARMARK is defined by means of Semantic Web technologies, we can use already implemented standards such as SPARQL 1.1 to query over the change tracking history of a certain EARMARK document
✦ Return a new EARMARK document that contains only Bob’s versionCONSTRUCT { ?other ?p ?o . ?version ?pv ?ov . ?docuverse ?pd ?od } WHERE { { SELECT DISTINCT ?other ?version WHERE { { SELECT DISTINCT ?version WHERE { ?version a earmark:Element ; prov:wasGeneratedBy ?activity . ?activity a echo:VersionCreation ; prov:wasAssociatedWith :bob } } ?other (^co:itemContent?/^co:item)+ ?version } } ?version ?pv ?ov . ?other ?p ?o . OPTIONAL { ?other a earmark:PointerRange ; earmark:refersTo ?docuverse . ?docuverse ?pd ?od } }
✦ Select the textual content of all paragraphs removed by CharlesSELECT DISTINCT ?range WHERE { ?p a earmark:Element ; earmark:hasGeneralIdentifier "p"^^xsd:string ; co:item/co:itemContent ?range ; prov:wasInvalidatedBy ?activity . ?range a earmark:PointerRange . ?activity a echo:ItemDeletion ; prov:wasAssociatedWith :charles }
Byte-counting EARMARK documents
• We used two documents ✦ The first document composed of seven different versions, named after the “Seven Dwarfs” for
recognizability and obtained by applying very common edits according to three authors✦ The second document composed of seven different versions, named after the weekdays and
created by seven different authors when editing a very simple document
• We compared the size in bytes of consecutive versions of such documents according to OpenDocument and OpenXML formats, and to EARMARK linearised in six different formats: Turtle, RDF/XML, OWL/XML, N-Triples, HDT and Manchester Syntax
Turtle seems the best
linearisation format
for EARMARKHDT (compressed)
performances similar t
o
ODT and OOXML
Conclusions
• In this paper we presented a theoretical approach to track document changes based on FRBR and provenance data
• We proposed one possible implementation of it through EARMARK, a Semantic Web-aware meta-markup language that enables the definition of multiple overlapping markup hierarchies representing different versions of the same document
• We highlighted the main advantages and drawbacks in terms of querying and storing such EARMARK documents
• In the future we plan to extend EChO so as to enable the description of additional change tracking operations (e.g. swap, update)
• We also plan to experiment the effective use of translation mechanisms to convert EARMARK documents with change tracking information into XML formats, e.g. ODT and OOXML
Thanks for your attention
<end>
<end>revision of