Semantic Web 0 (2012) 1 1IOS Press

Lifecycle models of data-centric systems anddomainsThe Abstract Data Lifecycle Model

Editor(s): Werner Kuhn, University of Münster, GermanySolicited review(s): Tomi Kauppinen, University of Münster, Germany; Todd Pehle, Orbis Technologies, USA

Knud Möller ∗

DERI, National University of Ireland, GalwayE-mail:

Abstract. The Semantic Web, especially in the light of the current focus on its nature as a Web of Data, is a data-centric system,and arguably the largest such system in existence. Data is being created, published, exported, imported, used, transformed andre-used, by different parties and for different purposes. Together, these actions form a lifecycle of data on the Semantic Web.Understanding this lifecycle will help to better understand the nature of data on the SW, to explain paradigm shifts, to comparethe functionality of different platforms, to aid the integration of previously disparate implementation efforts or to position variousactors on the SW and relate them to each other. However, while conceptualisations of many aspects of the SW exist, no exhaustivedata lifecycle has been proposed.

This paper proposes a data lifecycle model for the Semantic Web by first looking outward, and performing an extensivesurvey of lifecycle models in other data-centric domains, such as digital libraries, multimedia, eLearning, knowledge and Webcontent management or ontology development. For each domain, an extensive list of models is taken from the literature, andthen described and analysed in terms of its different phases, actor roles and other characteristics. By contrasting and comparingthe existing models, a meta vocabulary of lifecycle models for data-centric systems — the Abstract Data Lifecycle Model, orADLM — is developed. In particular, a common set of lifecycle phases, lifecycle features and lifecycle roles is established, aswell as additional actor features and generic features of data and metadata. This vocabulary now provides a tool to describe eachindividual model, relate them to each other, determine similarities and overlaps and eventually establish a new such model forthe Semantic Web.

Keywords: data, data-centric, lifecycle, Semantic Web

1. Introduction

The Semantic Web — a web of data rather than doc-uments, where automatic agents can make sense of in-formation, aiding human users and preventing infor-mation overload — is still a young phenomenon, andat this point in time we cannot yet say what shape andform exactly it will take in the years to come. A crucial

*current affiliation:Kasabi, UK,E-mail:

aspect during this development process is a commonunderstanding of the anchor points of what this webof data sets out to be, regardless of specific technolo-gies, languages or systems. Without this understand-ing, there is a danger that individual efforts will be in-compatible, that there is a duplication of efforts or thatthe effort as a whole will derail. It is therefore nec-essary to establish a comprehensive conceptual modeland architecture of the Semantic Web: “the architec-ture of any system is one of the primary aspects to con-sider during design and implementation thereof, andthe [. . . ] architecture of the Semantic Web is thus cru-

cial to its eventual realisation” [12]. This architectureand model will then provide the required anchor pointsand allow for a common understanding.

So far, a number of basic foundations have beenlaid and are quite well understood: (i) A layer cake oflanguages and technologies has been proposed (e.g.,[3]) — and since been changed and extended manytimes — to implement the vision of the Semantic Web.(ii) The Semantic Web is rooted in the World WideWeb, and so the technical infrastructure [21] of howcommunication takes place on the latter also applies tothe former. (iii) Suggestions have been made to extendthe architecture specification for the WWW, in orderto incorporate ideas such as self-describing data [30],that underlie the Semantic Web.

While all these components are part of what we caneventually consider a complete conceptual model forthe Semantic Web, other aspects have not yet beentaken into account. One of the core assumptions aboutthe Semantic Web is that it is a web of data — there-fore, one of the missing pieces must be to establishwhat constitutes data on the Semantic Web, what therole is that it plays, and what its features are. Data is aliving thing that moves through various stages, such ascreation, publishing, use or termination; data is at thecentre of the Semantic Web. Therefore, this paper con-tributes to the overall task of establishing a conceptualmodel of the Semantic Web by proposing an exhaus-tive lifecycle model of data on the Semantic Web. Weare not aware of any definition of lifecycles specificto data — however, a generic definition of the term is“life cycle, n. [. . . ] In extended use: a course or evolu-tion from a beginning, through development and pro-ductivity, to decay or ending.” [1].

The lifecycle model is established in three mainsteps: (i) In light of the fact that the Semantic Web is,at its heart, a data-centric system, we begin in Sect. 2with an extensive survey of lifecycle models and sim-ilar relevant literature from other data-centric domainsand systems. (ii) As a second step, in Sect. 3, we thenmove to a higher level of abstraction by distilling ameta-model and vocabulary of terms (phases, features,roles, etc.) from the survey. We call this model the Ab-stract Data Lifecycle Model (ADLM). In defining theADLM, we also revisit the surveyed lifecycle mod-els and classify them according to the abstract model.(iii) Finally, in Sect. 4, we turn our attention to the Se-mantic Web as a concrete example of a data-centricsystem and apply the ADLM to it, in effect proposinga descriptive lifecycle model of data for the SemanticWeb.

2. Data lifecycles in data-centric domains

The idea of a lifecycle for data has been discussedfor various different domains in which the genera-tion and use of data and metadata is of central impor-tance. Examples of such data-centric domains are dig-ital libraries, multimedia, eLearning, knowledge andWeb content management systems or ontology devel-opment. In this section we perform a survey of exam-ples from each domain.

2.1. Lifecycles for multimedia

[15] sets out to find a canonical description of theprocesses involved in (any) media production. For eachof these processes, different tools are available to theuser. However, since a unifying model of media pro-duction as a whole is not available, the inputs and out-puts of the different processes are often not compat-ible, in particular regarding metadata, thus making itimpossible to devise an integrated tool chain. The in-centive for proposing a canonical model then is to fa-cilitate the integration of processes and tools. To illus-trate the kind of integration they have in mind, the au-thors make a comparison with UNIX pipes.

The authors define nine different processes, togetherwith the inputs and outputs of each — in fact, the for-mal definition of inputs and outputs is what clearly setsHardman’s model apart from all other models featuredin this paper. Inputs and outputs can be complex ob-jects, and the model is formalised by typing each com-ponent and identifying it with an id. The processes are(i) premeditate — decisions, inspirations and thoughtprocesses that take place prior to the creation of ac-tual media content (e.g., who, where, why or what),(ii) capture — the process of physically capturing apiece of media content, (iii) archive — storing and in-dexing a media asset, alongside its annotations, (iv) an-notate — adding arbitrary information to a particu-lar media asset, (v) query — retrieving a media assetfrom an archive, (vi) message construction — definingthe intended meaning of the media object currently inproduction, (vii) organise — composing a set of me-dia assets to a larger document, (viii) publish — con-verting a document into a format ready for externaluse, and (ix) distribute — making a media documentavailable to external users. An interpretation of the ar-rangement and interaction of these processes is shownin Fig. 1. The various kinds of xxxID in the figuredenote named data objects which are part of the in-put or output of a process, while arrows indicate how

medID, {capID}, {annID}



compID, archID

medID, {capID}, {annID}


compID, {annID}




archID, query



Message Construction


messID, {compID}








Author Author

Fig. 1. Processes in media production, from Hardman

the different processes can be linked through their in-puts and outputs. Even though this is not conveyed inthe figure, some of those data objects can themselvesbe complex objects. E.g., an annotation referenced byan annID will contain a reference to an ontology orvocabulary (ontID), as well as a term from that vo-cabulary (attID). Similarly, an archived media asset(compID) will contain a reference to the physical me-dia asset (medID).

The production model described by Hardman is notnecessarily a repetitive life-cycle, but rather a non-circular workflow. The publish and distribute pro-cesses are considered to be the end of the workflow:“Once a document structure is published it is no longerpart of the process set.” However, the authors also ac-knowledge that there can be circular sequences: “In anumber of cases, the results of a process can feedbackinto a different process in a different role.” An exam-ple of this is the idea that an annotation can itself beconsidered a media asset, which could be archived, an-notated, etc. Similarly, a document as the output of theorganise process can be considered a media asset andas such be archived into the system again. These kindsof transformations are indicated by the dotted arrowsin the figure.

[25] discusses the lifecycle of metadata in the con-text of multimedia systems (here focussing on themultimedia database management system CODAC1).Metadata in this context relates to both structure andformat of media resources (e.g., resolution, sample

1, retrieved 31/10/2010

rate, color scheme, etc.), as well as their content (whatis shown in a video, what can be heard in an au-dio recording). Kosch characterises those two kindsof metadata as “metadata for content adaption” and“metadata for search and retrieval”, respectively (seeSect. 3.5).

The lifecycle model presented by the authors issomewhat unusual, in that it introduces so-called life-cycle spaces as one of the central concepts, whichis used as a term to denote the different divisionsof a multimedia system in which metadata has rel-evance. Three such spaces are identified: the con-tent space, metadata space and user space. The con-tent space is the main anchor point and is concernedwith the multimedia data itself. It comprises a produc-tion/creation, postproduction, delivery (e.g., throughstreaming) and consumption stage. The metadata spaceon the other hand is divided into metadata produc-tion and metadata consumption. Metadata productiontakes place during content production/creation andpost-production, and both phases can produce content-related as well as structural metadata. Metadata con-sumption takes place during content delivery and con-tent consumption. Finally, within the user space the au-thors identify content providers/producers, processingusers and end users. Both content providers and pro-cessing users are involved in metadata production, andboth can generate content-related as well as structuralmetadata. On the other hand, end users are only in-volved in metadata consumption. This of course alsomeans that the authors do not assume that end userscontribute to the media’s metadata through means suchas tagging. Figure 2 illustrates our own interpretationof how the different spaces are related.

Even though the authors use the term “metadata life-cycle”, their model does not contain an immediatelyapparent circular flow. However, the CODAC systemcontains router and proxy elements which potentiallyreproduce metadata used to adapt multimedia data tothe preferences or context of a particular user. Accord-ing to the authors, this in effect closes the metadatalifecycle.

MPEG-7 is adopted as the sole metadata schema forCODAC, apparently with the implicit assumption thatit suffices for the purpose of multimedia data. For thisreason, the question of ontology creation or validationof a chosen terminology is not considered.

ta S












Content Space

production/ creation




User Spacecontent providers/



end users

Fig. 2. Metadata lifecycle spaces in multimedia, from Kosch et al.

2.2. Lifecycles in eLearning

[5] investigates the lifecycle of data in the context ofeLearning. Their approach is to first give an overviewof a number of other lifecycle models for learning ob-jects (LO), from which they then distil their own com-mon terminology and model. The end-result is an elab-orate model describing a combined lifecycle of the ac-tual LOs and their associated metadata.

The model consists of nine different phases, whichare related to the lifecycle stages of the reviewedmodels. Those terms are: initiation, conception, re-alisation, classification, validation, diffusion, usage,feedback and termination. Strictly speaking, initiationtakes place before any actual work on the LO begins,and is therefore outside of the cycle. Only in concep-tion does the work begin in earnest. Various data el-ements are defined, as well as their relations to otherLOs. Realisation marks the point when the actual con-tent of the LO is produced, while classification posi-tions it in the context of a classification system such asthe Dewey Decimal system. Validation is defined as ameans of quality control by experts, which may lead tothe LO being pushed back to the classification, reali-sation or even conception step. In diffusion, the LO isbeing introduced into a concrete learning managementsystem, where it becomes available to learners in theusage step. User reactions and comments are analysedduring feedback, which may again lead to changes ofthe LO in terms of conception or realisation, or even tothe complete termination, in case it is deemed unsuit-able or unsuccessful.










Fig. 3. Metadata lifecycles for learning objects, from Catteau et al.







Fig. 4. Metadata lifecycles for learning objects, from Millard et al.

In Fig. 3, the thick arrows indicate the main flowof the proposed lifecycle model, while thin lines in-dicate changes after “unfavourable evaluation” in thevalidation or feedback stage. Furthermore, the authorsidentify three stages which involve changes to the LOsthemselves (indicated by a book icon), while there areseven different stages which involve changes in the LOmetadata (indicated by an RDF icon).

Another, much simpler model is introduced in [32].The goal of their “Knowledge Life Cycle” is to facil-itate the lifting of learning objects and their metadatato a level of formal semantics. They propose four ba-sic stages: Knowledge acquisition, which is the con-ceptualisation of a knowledge domain with the helpof domain experts, knowledge modelling, which is theformalisation of the conceptualisation into an ontol-ogy language, knowledge annotation, which is the an-

notation of learning objects with the ontology and fi-nally knowledge reuse, which is usage of the annotatedLOs in applications such as search, automatic compo-sition of courses, personalisation, etc. Millard’s life-cycle model is circular in the sense that one can haveseveral iterations, with an additional knowledge main-tenance step in between reuse and acquisition (seeFig. 4). Apart from its simplicity, a crucial differencebetween Catteau’s model and Millard’s is that the for-mer completely omits ontology creation, whereas thelatter gives it a very prominent position (by taking uphalf of the cycle).

2.3. Lifecycles in digital libraries

Another domain for which data lifecycles are rel-evant are digital libraries. [6] describes a very de-tailed and elaborate 10-step methodology for settingup metadata systems for digital libraries (or moregenerically: collections of digital artefacts). The life-cycle model is divided into four general groups: Re-quirement Assessment and Content Analysis, SystemRequirement Specification, Metadata System and Ser-vice and Evaluation. Each of those groups is again di-vided into a number of different steps, as shown inFig. 5.

The Requirement Assessment and Content Analysisphase covers aspects such as defining the basic meta-data needs of the task at hand (e.g., schedule, scopeor function), as well as deep needs (ontologising thedomain, defining search strategies, etc.) and a reviewof candidate metadata standards. Since the authors as-sume a situation where data from legacy systems willbe migrated into the new system, another aspect of thisphase is the analysis of the legacy system in terms ofaccess options, suitability for transformation, etc. Af-ter this, the System Requirement Specification phasewill produce a detailed specification document (theMetadata Requirement Specification, or MRS), cover-ing all aspects which were dealt with in the assessmentphase. Also covered by this phase is an evaluation ofboth existing metadata platforms (software) for use inthe desired target system and the possibility to developsuch a system from scratch. In the Metadata Systemphase, two things take place: a best practice guidelinefor applying the MRS to specific metadata elements isprepared, and the development or setting up of the sys-tem which was decided upon in the previous step is be-ing done. Finally, in the Service and Evaluation phase,a metadata service model (including a service mecha-nism, different user roles and their relationships) is put

Acquisition ofMetadata Base


Review ofRelevant Metadata


Investigation ofDeep Metadata


Identification ofStrategies for the


Preparation ofthe MetadataRequirementSpecification

Evaluation ofMetadataSystems

Preparation ofGuidance/Best


Development ofthe Metadata


Maintenance ofMetadataService

Evaluation ofMetadata


Requirement Assessment & C

ontent Analysis System Requirement Specification



a Sy



ice & Evaluation

Metdata Lifecycle


Fig. 5. Metadata lifecycle for digital libraries, from Chen et al.

into place to ensure the smooth running of the system.Also, an evaluation of the system takes place, whichcan feed back into a new round of the cycle.

If we compare Chen’s model to the other modelsdiscussed in the paper, it becomes obvious that it isn’treally a lifecycle model for data, but rather a lifecy-cle model for ontology (see Sect. 2.5) and systems de-velopment. All data and metadata in the system comesfrom translating resources from legacy systems into aformat defined by the new ontology. The creation ofnew instance data does not feature anywhere in themodel.

In [39], a lifecycle model for ontology developmentwhich was previously defined in [38] is applied tothe domain of digital libraries. Also this model fo-cusses on the ontology development aspect of digitallibraries. While much simpler than [6] (including cre-ation, evaluation, negotiation and versioning phases),it also specifically distinguishes between automaticcreation of ontologies (ontology learning) through ma-chines and manual, collaborative ontology creation ofontologies through humans (see Sect. 3.4).

2.4. Lifecycles for knowledge and contentmanagement

Knowledge and content management represents thelargest subset of data lifecycles discussed in this paper.The domain is defined very loosely, comprising both

Use Create



Fig. 6. Knowledge process, from Staab et al.


Ontologykickoff Refinement Maintenance



Fig. 7. Knowdlege metaprocess, from Staab et al.

traditional knowledge management and lifecycles witha specific Web or even Semantic Web focus.

[42] proposes a metadata lifecycle for knowledgemanagement (KM) in organisations. In fact, the au-thors define two such cycles, which are intertwined:the knowledge process and the orthogonal knowledgemetaprocess. The former denotes creation, retrievaland use of actual knowledge (not simply in the formof documents, but as knowledge items). On the otherhand, the knowledge metaprocess denotes the processof planning and implementing an infrastructure andplatform for knowledge management, including ontol-ogy creation and management. For both cycles, the au-thors assume that a soft- and hardware infrastructure isalready in place.

Within the knowledge process (i.e., the lifecycle ofdata), the authors identify four different steps: Cre-ate, Capture, Retrieve/Access and Use, as depictedin Fig. 6. In addition, Import is identified as a fifthstep, which allows to introduce new, external data (e.g.

from documents). Both Creation and Import are char-acterised by the fact that new knowledge items comeinto the system. Knowledge items have varying de-grees of formality (from free text documents over tem-plated documents to formal knowledge structures), andcan thus be said to cover both data and metadata in theclassical sense. In the Capture phase, the new knowl-edge items are then integrated into the system: thiscan mean indexing, additional annotation, establish-ing relations to other knowledge items, etc. Retrieveand Access usually involves querying and browsingthe organisation’s knowledge base. In other words, thisphase defines a typical search task of a knowledgeworker, at the end of which they have retrieved a num-ber of knowledge items. The authors stress that, onceknowledge items are found, the job is not done: in-stead, another important phase is Use, which can spanaspects such as personalisation, pro-active access, in-tegration with other applications or aggregation of sep-arate knowledge items into a new whole. Aggregationand inference can lead to new knowledge and thusclose the cycle.

The knowledge metaprocess is mostly concernedwith ontology development for a particular organisa-tion or use case. It consists of a Feasibility Study,Ontology kickoff, Refinement, Evaluation and Main-tenance. The Feasibility Study takes place before theactual work on the ontology begins. Its purpose is toproperly specify the use case at hand, define the or-ganisational context and identify any problems whichmight appear along the way. The purpose of the kickoffphase is to produce a requirements document, whichwill serve as a guide throughout the project, elabo-rating on aspects such as goal, domain and scope,supported applications, available knowledge sources,users and external ontologies which could potentiallybe reused. In the Refinement phase, the ontology en-gineers will then start to implement the specification.The suggested approach is to start with a simple base-line taxonomy, over a so-called “seed ontology” un-til one arrives at the final target ontology, which isexpressed in a formal representation language. Forthe Evaluation phase, the designers refer back to thespecification document to check their ontology, test itwithin the targeted applications and gather feedbackfrom beta users. Finally, the Maintenance phase en-sures that the ontology always stays up-to-date and re-flects changes in the real world. Both in the Evaluationand Maintenance phases, user input and feedback maylead to further refinement, as indicated by the loops inFig. 7.

[14] introduces a framework for the description ofgeneric metadata generation. She focusses on the dif-ferent processes, people (roles) and tools involved.While not strictly speaking a complete lifecycle model,her framework is still providing important input forat least part of such a model. In terms of processes,Greenberg establishes the very broad dichotomy ofhuman metadata generation vs. automatic metadatageneration. While the first was historically the onlyform of metadata generation, the latter is becoming in-creasingly important, especially considering the hugeamounts of data on the Web and elsewhere today.In human metadata generation, the different classesof persons which are proposed in the framework areprofessional metadata creators (e.g. professional li-brarians), technical metadata creators (e.g. library as-sistants), content creators (e.g. authors) and commu-nity or subject enthusiasts2 (e.g. readers). While thoseclasses can be arranged on a vector of decreasing pro-fessionality, they are not necessarily disjunct. In termsof tools, the framework distinguishes between threedifferent kinds: human beings, standards & documen-tation and devices. The latter are the technical means toactually capture metadata, and are further divided intotemplates, editors and generators. Templates are de-fined as simple forms to input metadata, while editorswill aid the user with links to documentation, syntac-tical help, etc. Generators, finally, will produce meta-data automatically and can often be found as compo-nents in editors.

[27] proposes a simple lifecycle model for knowl-edge management in organisations. The model com-prises four rather high-level steps, which are discussedby suggesting the different information technologies(IT) which facilitate them: (i) Knowledge capture isdefined as “the process by which knowledge is ob-tained and stored”. Facilitating ITs are database sys-tems, data warehouses and document managementsystems. (ii) Knowledge development is the organisa-tion and analysis of data “for strategic or tactical de-cision making”. Examples of ITs used in this step aredata mining tools, OLAP (online analytical process)and competitive intelligence systems. (iii) Knowledgesharing means distribution of knowledge and is en-abled by group support systems and communicationstechnology such as EDI (electronic data interchange),

2Greenberg relates an interesting account of a project at the FineArts Museum of San Francisco, which involved voluntary commu-nity enthusiasts to help assigning keywords to a corpus of 20.000images — an early, pre-Web2.0 example of collaborative tagging.

e-mail, voice mail, video conferencing and electronicbulletin boards. (iv) Knowledge utilisation finally is theuse “without computer knowledge” of knowledge byend users in the organisation. ITs used in this step arevaguely defined as GUI-driven end user applications.Multimedia is also mentioned as an enabler technol-ogy.

In a general paper on Web content management(WCM) systems in organisations, [29] suggests a life-cycle model for WCM as a special case of KM. In hermodel, collection and delivery are two processes whichare repeated iteratively for as long as the organisationis involved in WCM. Collection is the authoring or cre-ation of content, while delivery means the publishingor deployment of data. In addition to those two pro-cesses, the workflow and control and administrationprocesses have a ongoing supportive function. The for-mer enables collaboration and manages steps such asapproval, development, etc., while the latter includesthe identification of user roles, groups, management ofthe system, security and similar aspects.

On the fringes of knowledge management, [19] pro-poses a lifecycle model for digital curation, to be pro-moted by the UK Digital Curation Centre (DCC)3. Itis one of the most complex models in this survey, dis-tinguishing between three different types of lifecycleactions, from full lifecycle actions, which take placecontinually, over sequential actions, which conformmore to the traditional lifecycle concept and form thebulk of the model, to occasional actions, which onlytake place potentially at certain moments in the life-cycle. Full lifecycle actions are mainly administrativeand preparatory, while the latter two kinds of actionsdeal with the data directly. As full lifecycle actions, theDCC defines description and representation informa-tion, similar to ontology development phases of othermodels, preservation planning, which provides plan-ning for the management and administration of the reg-ular lifecycle actions, community watch and partici-pation, which ensures involvement in related activi-ties (e.g., standards development), and finally curateand preserve, which drives the actual management andadministration of the lifecycle. Sequential actions areconceptualise, preceding the creation of data, createand receive, either creating new data or importing ex-isting data, appraise and select, evaluating and select-ing data for long-term curation, ingest, archiving datain a repository, preservation action, which includes

3, retrieved 31/10/2010

data cleaning and validation or assigning preservationmetadata, store, another form of archiving, access, useand reuse, making data accessible to potential users,and finally transform, meaning to create new versionsof data already present in the lifecycle. The three oc-casional actions are dispose, which can take place af-ter appraise and select in order to remove data fromthe lifecycle, reappraise, which is a loop back frompreservation action to appraise and select, and finallymigrate, which moves directly from preservation ac-tion to transform, skipping the intermediate actions.

In [33], a very simple lifecycle called MAAME isproposed for semantic applications. The name is de-rived from the five phases which constitute the life-cycle: modelling, application, authoring, mining andevaluation. The modelling phase covers all elementsof ontology modelling, whereas application denotesthe process of defining how semantic information willbe used in a concrete application, i.e., what is therole of semantics in that application. In authoring, in-stance data is created manually, while mining createsdata automatically from other sources, through tech-niques such as natural language processing or machinelearning. Finally, evaluation encompasses all aspectsof feedback about the lifecycle, including the model,the application itself and its instance data. Mödritscherdiscusses the lifecycle itself only at a very high and ab-stract level, but then grounds it by mapping four differ-ent concrete applications to it.

[17] loosely defines a lifecycle for linked data [2],with the express purpose of categorising differentR&D projects in a research group. In this sense, thesix phases of data awareness, modelling, publishing,discovery, integration and use cases are used. The firstof these phases should probably be seen separately,in that it does not concern a particular unit of data ora particular dataset, but rather the awareness towardslinked and open data in general.

2.5. Ontology lifecycles

A special case of data lifecycles are ontology lifecy-cles, i.e., lifecycle models that describe aspects such ascreation, integration or reuse of ontologies (as opposedto instance data). In fact, some of the metadata lifecy-cles presented in the previous sections, e.g. Chen et al.,turn out to be ontology lifecycles, while others at leastinclude the concept of ontology creation prominently(Staab et al., Millard et al., Mödritscher, Hausenblas).A significant number of what can be considered life-cycle models for ontologies have been published under

the label of methodologies for ontology developmentand often do not use the term “lifecycle” at all.

[10] gives a comprehensive overview of differentontology development methodologies based on theircompliance with the IEEE Standard for DevelopingSoftware Life Cycle Processes (1074-1995) [41]. Themethodologies are categorised as (i) building ontolo-gies from scratch, (ii) re-engineering ontologies or(iii) collaborative methodologies. For (i), the authorsinclude the Cyc methodology, Uschold and King,Grüninger and Fox, KACTUS, their own METHON-TOLOGY [8,13] and the SENSUS-based methodol-ogy. For (ii), they outline an approach of their own,while for (iii), the CO4 and KA2 methodologies areincluded. Each methodology is evaluated in terms ofwhether or not it implements the different processesdefined by the IEEE standard. Depending on howmany processes are considered by a given methodol-ogy, it is given a maturity rating (the more processes,the more mature). The authors’ own METHONTOL-OGY comes out as the “most mature approach”. Ad-ditionally, each approach is characterised according tothe kind of lifecycle model it proposes (sequential, in-cremental or evolving), as presented in [8] (see alsoSect. 3.1). [9] discusses how the (development) lifecy-cles of different ontologies can touch due to integrationand reuse. Considering the “confluences and forkingof life cycles”, the authors present a method to makethe relations between ontologies explicit and define thesteps necessary to integrate them. Other, more recentexamples of ontology development lifecycles includework done in the Knowledge Web project4 (e.g., [38])and the NeOn project5.

2.6. Lifecycles in databases

Databases are not an application domain such asthe ones presented in the previous sections, but ratheran enabling technology. For this reason, they are onlybriefly mentioned here. Nevertheless, databases are anarea where data lifecycles are of central importance,and the CRUD operations [23] in particular are rele-vant. The CRUD acronym denotes the four fundamen-tal atomic operations common to persistent databasesystems, i.e., (i) create, (ii) read, (iii) update and(iv) delete. The point of view of CRUD is that these areall individual, low-level operations, rather than phases

4, re-trieved 31/10/2010

5, retrieved 31/10/2010

in a sequential lifecycle. However, there is still an im-plicit temporal element to them, in that each unit ofdata first needs to be created before it can be read, up-dated or deleted. Also, all four operations can be easilymapped to the lifecycle model proposed in the follow-ing Sect. 3. Indeed, it is easy to argue that CRUD infact does represent a simple lifecycle for databases.

While CRUD originates from databases, where it isoften used at the basic record level, the term is nowalso widely used to describe the functionality of ap-plications at a higher, even user interface level. In thisinterpretation, an application dealing with data entitiesof a particular type is only considered complete if itsUI at least supports the four CRUD operations.

3. The Abstract Data Lifecycle Model

Looking at the survey of lifecycle models presentedin Sect. 2, we can see that a significant number ofsuch models have been proposed for different domains.Some models have been designed from scratch, whileothers have been based on surveys of previous mod-els (in particular McKeever and Catteau et al.). How-ever, none of the literature presented has looked be-yond its respective domain and proposed a genericlifecycle model for (meta)data. On the one hand, thisis understandable, considering that the individual re-search communities are often disjoint. On the otherhand, even though it is not always possible to per-form a direct one-to-one mapping between the differ-ent models, there are striking similarities among allof them. Each model has its own specific focus, andwhile each domain is concerned with a different kindof data, they all deal with the same thing: data. Forthis reason, it is feasible to devise a domain-agnostic(meta)data lifecycle model based on the previous sec-tion, which will help to compare and point out differ-ences and similarities between different systems withrespect to their data lifecycles. In this approach, theabstraction is derived from specific instances. The al-ternative approach, as e.g. adopted in [26] for the do-main of case base maintenance, would be to begin withthe abstraction and then use it to classify a selection ofinstances in the pertinent domain6.

6The bottom-up approach chosen in this paper is not in generalbetter or worse than the top-down approach. However, it ensures abroader coverage and wider applicability, while the top-down ap-proach may be more suitable for a restricted, specific use case.

The meta-model, which we call the Abstract DataLifecycle Model (ADLM), consists of five parts: (i) aset of phases which will generalise the steps and pro-cesses defined in the individual models in the survey(Fig. 8), (ii) a set of features which can be used todefine lifecycle models further (overview in Tab. 2),(iii) a set of roles describing the different actors in themodel (Fig. 8), (iv) a set of features describing the ac-tors in the model and (v) a set of features describing thedata and metadata found in a lifecycle. Finally, each ofthe lifecycle models in the survey will be categorisedin terms of the meta-model.

3.1. Lifecycle phases

The choice of phases for the meta-model is based onthe survey in Sect. 2. While some of the reviewed mod-els go into a lot of detail and include domain-specificaspects, or aspects which pertain to specific implemen-tation matters, other models are very high-level andgeneric. Both extremes are undesirable for the purposeof building a comprehensive meta-model, in the sensethat they are either over- or under-specific. However, itis still possible to map all models covered in the surveyto the meta-model.

Like the models covered in the survey, the ADLMhas transitions between the different phases. In manycases, these transitions are possible in both directions,allowing to pass the phases in many different orders,including recursively. While there is no explicit ver-sioning phase or aspect to the ADLM, one implicationof the recursiveness is that versioning of data is possi-ble within the ADLM.

Ontology development In this phase the formal do-main model for (meta)data is defined. As the previ-ous sections show, the literature contains several life-cycle models which are dedicated solely to the do-main of ontology development (notably Chen et al. andthe methodologies in Sect. 2.5), whereas others do notconsider ontology development at all. There are a fewexamples in which ontology development is seen as acomponent in a larger structure, such as Millard et al.,in which knowledge acquisition and knowledge mod-elling are the two initial steps in the five-step cycle,or Staab et al., where the knowledge meta-process isone of the two cycles which make up the model as awhole. Nevertheless, the reason why ontology devel-opment is not featured at all in some of the literatureis not that it is considered unimportant, but simply thatit is considered to be something that has taken place

Ontology Development




External Use TerminationFeedback

End Users

Data Creators

Metadata Creators

Ontology Designers



Fig. 8. Phases and roles of the Abstract Data Lifecycle Model (ADLM)

a priori to the lifecycle of instance data. E.g., Koschet al. simply assume that MPEG-7 will be used, whileCatteau et al. suggest to use the Dewey Decimal (DD)or a similar system. However, both MPEG-7 and theDD have obviously been designed at some stage andare in some form of lifecycle themselves. While it maybe less important in some domains, ontology develop-ment is conceptually still a pre-requisite to any datalifecycle in a data-centric system, which is why it is in-cluded in the ADLM. It is, however, placed outside thecore part of the lifecycle. It can also be considered alifecycle of its own, but one which must precede everyother (meta)data lifecycle at some point.

If, in a particular lifecycle model, ontology devel-opment is included as a phase, this inevitably meansthat the lifecycle is not that of a particular piece ofdata or metadata, but instead of a complete system asa whole, which includes both ontology and instancedata. This characteristic sets the ontology developmentphase apart from all other phases, which is reflected inFig. 8 by using a dotted line for its outline.

Planning Planning is a phase which precedes the ac-tual lifecycle of data and describes the moment whenthe intent to create data takes a concrete form, butbefore the data is created as part of the system. As

Hardman puts it, planning is “outwith the system”, butit can already result in a number of facts which latermay become part of the data, e.g. ownership, intent,etc. It is obvious that some form of planning must al-ways take place, but only a few models in the litera-ture name it explicitly. In Catteau et al. this phase iscalled initiation, while Hardman has two different pro-cesses which fall under planning: one is called premed-itate, and denotes the planning of creating a single unitof data, while message construction explicitly meansthe planning and intent of combining several pieces ofdata to a larger whole. From the point of view of theADLM, both processes shall be considered the same(but possibly during different iterations).

Creation The Creation phase defines the momentwhen new data or metadata is created in terms of thesystem at hand. This data is either genuinely new andhas not existed in any other (formal) format before,or it has been imported from another system — themain point is that it did not exist within the lifecy-cle’s system before the creation phase. All models dis-cussed above contain a creation step: the multime-dia model in Kosch et al. calls it Production/Creation,while Hardman names it capture (here in the sense of“capturing a piece of media with a recording device”).

Within eLearning, Millard et al. include a knowledgeannotation step (which in fact spans the phases of datacreation, archiving and data refinement, see below),whereas Catteau et al. split creation into the two sep-arate conception and realisation steps. In knowledgemanagement, Lee and Hong call this phase knowledgecapture, McKeever collection and Staab et al. have cre-ate. The latter also distinguish explicitly between thecreation of new data and the import of existing, butexternal data, i.e., the transformation of existing datafrom one format into the format and infrastructure ofthe lifecycle’s system. Finally, this phase in combina-tion with the following archiving phase is representedby the create operation in CRUD set of operations,since it also entails making the newly created data per-sistent in the system.

Archiving Archiving denotes the process of anchor-ing a piece of data within the system by the meansof indexing, cataloguing or a similar activity. Not allmodels covered in the survey feature anything that canbe mapped to this phase. However, Hardman intro-duces an archive step, whereas Staab et al. suggestsa capture step, which combines aspects of the archiv-ing, refinement and publication phases (see below). InMcKeever, archiving is covered by the supporting pro-cess control and administration.

Refinement The refinement phase covers all kinds ofactivities which make additions or changes to data thatalready exists within the system. In a very generalsense, this can mean to annotate data, which inciden-tally is the name of one of the steps within the refine-ment phase in Hardman, the other one being organise,which means the combination of several data items toa larger whole. Of course, depending on whether dataand metadata are distinguished, annotation as refine-ment could also be interpreted as another iteration ofthe creation phase: if all data in the lifecycle is con-sidered equal, then adding new annotations to a pieceof data is nothing else than creating new data. Thisis reflected in Hardman by the fact that the output ofthe annotate process can be new data that can enterthe system. In Kosch et al., refinement is named post-production, a term typically used in the media domain,for which this model was made. Catteau et al. empha-sise the act of classification as a specific kind of refine-ment. Both post-production and classification can beconsidered specialised, restricted kinds of annotation,i.e., refinement. On the other hand, Lee and Hong in-troduce the very generic knowledge development pro-cess, which denotes various kinds of data analysis ac-

tivities. The knowledge annotation and capture stepsin Millard et al. and Staab et al., respectively, bothcontain aspects of refinement, but are not limited tothis phase. Like archiving, McKeever covers refine-ment as part of the control and administration pro-cess. In Greenberg, refinement is reflected as the gen-eral process of metadata generation and further dis-tinguished into human metadata generation and auto-matic metadata generation7. In CRUD, the update op-eration entails a combination of refinement and archiv-ing, since any changes made to data are immediatelymade persistent in the system.

Publication The publication phase is the momentwhen data is made accessible to the users either withinor outside of the system, or both. Most of the lifecyclemodels in the survey have a dedicated slot for publica-tion: For multimedia data in Kosch et al. this is calleddelivery, while Hardman splits this phase up into apublish (which really is just preparing a piece of datafor publication) and a distribute (the actual act of pub-lication) step. In eLearning, LOs are being made avail-able to the system in the diffusion step. In Staab et al.,publication is part of the very broad capture step, whileLee and Hong reserves a dedicated slot with knowl-edge sharing and McKeever with the delivery step.

Access Access denotes the moment in the lifecyclewhen parties from either within or outside of the sys-tem gain access to the data in the system, e.g., bymeans of a query or through browsing. This phase iscovered explicitly in Hardman by the query step, andin Staab et al. by the retrieve/access step. Kosch et al.and Catteau et al. also recognise access as a necessarystep in the lifecycle, as consumption and usage, respec-tively, but do not distinguish it from actual use (see be-low). The read operation of CRUD is also an instanceof access in terms of the ADLM.

External Use Whereas access merely means to re-trieve data, external use implies that the user then per-forms some further actions with it, such as export intoother systems or software or aggregation. It should bestressed that this phase explicitly means usage of thesystem’s data outside of it. If data is used and changedwithin the system, this is a case of refinement. As men-tioned above, use falls together with access in bothKosch et al. and Catteau et al., whereas Staab et al. in-clude a dedicated use step. Also, the model in Millard

7This distinction is extended to a general feature of lifecycle ac-tors in the model proposed here; see Sect. 3.4.

et al. includes a dedicated knowledge reuse step, andLee and Hong suggest the term knowledge utilisation.

Feedback The feedback phase allows users of a sys-tem to comment on the data or metadata they have pre-viously accessed and used. In a strict view, this onlymakes sense in centralised lifecycles, where there issome sort of authority the users can give feedback to.Nevertheless, the multimedia model in Kosch et al.does not provide for any kind of explicit feedback.In Catteau et al., different kinds of users (experts andend users) provide feedback at different levels in thefeedback and validation steps. Staab et al. also pro-vide feedback as evaluation, but only for the ontologywhich was devised in the knowledge metaprocess. Inthe same sense, Chen et al. envisages a Service & Eval-uation phase, which provides feedback on the ontol-ogy and overall performance of the system. Similarly,almost all dedicated ontology development method-ologies provide the opportunity for feedback, usuallyin order to gain consensus within a community. Theknowledge maintenance phase in Millard et al. has asimilar purpose.

Termination Finally, the termination phase presentsthe moment when data is removed from the system. Inother words, it is the “end of life” phase (a term takenfrom product lifecycles). Surprisingly, of all the mod-els in the survey, only Catteau et al. provide a slot forthe termination of data. Termination is, however, oneof the four CRUD operations, i.e., the delete operation.

3.2. Lifecycle features

There are a number of dichotomies and other char-acteristics that can aid in classifying different lifecyclemodels and which are useful in pointing out the dif-ferences and similarities between the various lifecyclemodels presented in Sect. 2.

Distinction data vs. metadata An important point inlooking at data lifecycles is the question whether ornot a distinction between data and metadata is made.Are both kinds of data first-class citizens? A numberof models in the survey clearly make this difference: inmultimedia, Kosch et al. distinguishes between multi-media resources and metadata about those. In eLearn-ing, both Catteau et al. and Millard et al. considerlearning objects and annotations on them as differentelements in their lifecycle models. While McKeeverdoesn’t discuss metadata in Web content managementin detail, she does seem to distinguish the two when

she states that part of the control and administrationprocess is the “ability to specify metadata”. On theother hand, Hardman initially distinguishes betweendata (media objects) and metadata (annotations on me-dia objects), but leaves the possibility open that anno-tations themselves can become first-class citizens (seeSect. 2.1). Also in the model in Staab et al., the dis-tinction is not so clear. Instead, all data in this modelare knowledge items with a varying degree of formal-ity, ranging from unstructured text over templated datato structured data objects. Finally, Lee and Hong onlyuse the generic term “knowledge” when describing thedata in their lifecycle. Since activities such as data min-ing can lead to new knowledge, we will assume thata clear distinction between data and metadata is notmade in their model.

Prescriptive vs. descriptive The dichotomy of pre-scriptive vs. descriptive is typically used in linguistics,where it denotes the two different approaches to lan-guage in general, and grammar in particular. Prescrip-tive linguistics attempt to define a set of rules for the“correct” use of language, whereas descriptive linguis-tics try to analyse a language in its actual use, in orderto find rules and regularities within it (see e.g., [7]).Applied to the domain at hand, we use the term pre-scriptive for lifecycle models which try to establish aset of steps which are then suggested for use by oth-ers8, while a descriptive model will look at a givensystem and find a lifecycle in it. Using this terminol-ogy, many of the models presented in Sect. 2 can becalled prescriptive, because they suggest to the readera methodology of how the lifecycle of data or metadatashould be handled. Hardman presents an interestingexception to this trend: “The processes should not beviewed as prepackaged, ready to be implemented by aprogrammer. Our goal is rather to analyse existing sys-tems to identify functionality they provide”. Lee andHong and McKeever both address an audience of de-cision makers or experts in a business position, ratherthan one of implementers and developers. For this rea-son, they simply report what they conceive to be life-cycle processes in knowledge and Web content man-agement, rather than suggest a particular model to use,and should therefore be considered descriptive.

8In a very combative position paper, [28] make the point that,applied to the area of software development, prescriptive lifecyclemodels may in fact be considered a bad idea in many circumstances.

Table 1Comparison of Metadata Lifecycles — Phases


























































































































































































































































Homogeneous vs. heterogeneous We will say that alifecycle is homogeneous when the data in the sys-tem it describes is homogeneous, i.e., when its schemaor ontology is known beforehand, and no data of un-known form or using unknown vocabulary terms willtypically enter the cycle. In contrast, a lifecycle is het-erogeneous when the data in the system it describes isheterogeneous, i.e., when its form or ontology is notknown beforehand. Most lifecycle models presented inthe survey are examples of homogeneous lifecycles,since they only deal with a specific kind of data ormetadata. Kosch et al. is restricted to MPEG-7, whilein Catteau et al. only a specific vocabulary for learn-ing objects is used. Both Staab et al. and Chen et not prescribe the vocabulary or ontology to be usedfor data and/or metadata in the system, but assumethat a previous ontology creation phase clearly defineswhich ontology will be used for any metadata in thesystem. The same is true for Millard et al., who as-sume that a domain ontology will be modelled in thefirst phases of the lifecycle, which will then be used toannotate the data in the system (in this case learningobjects). Neither Lee and Hong nor McKeever explic-itly define what the form of either data or metadata inthe systems they describe should be. This means that aclear classification in terms of homogeneous and het-erogeneous cannot be made. An exception is Hardman:while it is true that the data in this lifecycle is restrictedto some form of media asset, no restrictions are givenon the form or terminology used for annotations overthis data. Also, even the restriction to media as datais softened somewhat in the sense that annotations areallowed to become media assets in their own right.

Open vs. closed This feature is related to the homo-/heterogeneous dichotomy. Open and Closed describewhether a system allows arbitrary data from the out-side to enter its lifecycle, which wasn’t initially meantfor this particular system. In this sense, the model de-scribed in Staab et al. is an example for an open life-cycle, since it includes an explicit import stage for theinclusion of external data. All other models in the sur-vey should be considered closed lifecycles, becausethey do not provide any import stage. At first sight,this statement does not seem to hold for Hardman, inthe sense that the two planning steps premeditate andmessage construction take their inputs from outside thesystem. However, this kind of input actually only rep-resents the intent or context of the author for becomingactive in the system. To consider this as an example foran open lifecycle would render the feature meaning-

less, since all lifecycles would then have to be charac-terised as open.

Centralised vs. distributed This last dichotomy de-scribes the physical nature of a system and the lifecy-cle of its data and metadata: if it resides in a single,centrally controlled infrastructure, we will call it cen-tralised. If it is spread over a a network with no singlepoint of control, we will call it distributed. Most sys-tems described in Sect. 2 are examples for centralisedlifecycles. One might argue that, if users can access thesystem online, that those systems are still distributed.However, since the circular flow of data resides in acentralised system, we will still consider them to becentralised. Again, Hardman is an exception: her life-cycle does not describe a concrete system at any oneplace, but rather the lifecycle of media data as it isprocessed by various tools. Where those tools reside,how they are connected and how the data flows be-tween them is left open, therefore making the lifecy-cle a distributed one. Due to their vague nature, bothLee and Hong and McKeever could describe either acentralised or a distributed systems.

Lifecycle type [8] and [10] suggest three generaltypes of lifecycle models, based on how and underwhat circumstances a new iteration of the lifecycle isreached and how data in the system is changed: (i) se-quential, (ii) incremental and (iii) evolving. A sequen-tial model [40] (also referred to as waterfall model),denotes a kind of lifecycle in which each phase or stepcan only be reached if the preceding one is completelyfinished. This also means that a new iteration of the cy-cle can only be started when all steps have been gonethrough. In contrast, the incremental model allows thestart of a new iteration whenever the totality of data inthe system is lifted to a new version. This may happeneven before the cycle has been finished completely. Fi-nally, the evolving model implies that data in the sys-tem can change at any time, meaning that new iter-ations can be started at any time. Consequentially, itwould also mean that several iterations of the lifecyclecan be ongoing at the same time.

Granularity Since every lifecycle operates on somekind of system, we can also consider how much of thatsystem is affected by a single iteration of the cycle.By going through the individual steps of the cycle, arewe manipulating all data in the system or only parts ofit? The answer to this question defines the granular-ity of the lifecycle: a model where all data is affectedin each iteration has a coarse granularity, whereas a

model where only portions of the data are affected hasa fine granularity. Most lifecycles in the survey shouldbe considered fine-grained models, in the sense thatthey allow a single iteration for each data object in thesystem, be it a media asset, a learning object, a docu-ment, etc. However, those models which contain a ded-icated ontology development phase in the beginningappear to be rather coarse-grained: we first define thedomain ontology, then apply it to a corpus of data, thenget feedback on the ontology and its application, thenrevise the ontology based on the feedback, etc. In thesurvey, examples of this kind of coarse-grained lifecy-cle are Millard et al. and Chen et al.

There also seems to be a connection between thegranularity of a lifecycle and its type. A lifecycle ofthe evolving type would also tend to be fine-grained(lest we would have the whole system in a constantlychanging state, with each state existing alongside theothers), whereas a sequential lifecycle would tend tobe more coarse-grained.

3.3. Lifecycle roles

Based on the survey in Sect. 2 we propose five dif-ferent roles (see [37]) in this section, which can beplayed by actors in the ADLM. Together with the life-cycle phases and features, these roles will aid in de-scribing and classifying various lifecycle models, andultimately outline the lifecycle model for the Seman-tic Web. With respect to the relationship between actorand role, it should be noted that the same actor in anyparticular system for which the lifecycle is applied canplay several roles at once. E.g., the same actor could,with their data creator hat on, first plan and create adata object, and then, with their administrator hat on,archive and publish the data.

In addition to the models in the survey, we also con-sider the roles established in a prescriptive proposal forthe lifecycle of URIs on the Semantic Web [4], whichare URI owner (authority to mint URIs), statement au-thor (using URIs in statements) and consumer (readingand interpreting an RDF statement).

Each role will typically be involved in specificphases from the lifecycle meta-model, as shown in theoverview in Fig. 8.

Ontology designers An actor who is involved in theontology development phase of the lifecycle plays therole of an ontology designer. As argued previously, theontology development phase can often be seen as apreceding lifecycle of its own. For the same reasons,

when the ontology is considered data, the ontology de-signer role could be argued to comprise all other roleswithin it.

Data creators The role of data creator implies thatthe actor who performs it is creating the primarykind of data in the system at hand. Depending onthe system, this could mean the creation of mediaassets, learning objects, documents, etc. In the lit-erature, this kind of role has been given differentnames: Greenberg, in her discussion of processes, peo-ple and tools, calls this role content creator, whereasin Kosch et al., data creators are identified as con-tent providers/producers in the tripartite user space. InBooth’s proposal, the most suitable mappings for thedata creator role are URI owner and statement author.Within the lifecycle meta-model, data creators partakein the planning and creation phases.

Metadata creators Metadata creators are those ac-tors which annotate the primary data in a system withmetadata. This role is called processing user in the userspace in Kosch et al. and statement creator in the pro-posal by Booth. Greenberg uses the same term as it waschosen for the ADLM, but further distinguishes meta-data creators according to their level of professionality.In the lifecycle model, we have chosen to express pro-fessionality as a role feature instead. Metadata creatorsalso perform planning, after which they either createnew metadata, or refine existing data or metadata.

Obviously, the distinction between data creators andmetadata creators will become meaningless when aparticular lifecycle doesn’t distinguish between dataand metadata. In that case, the two roles should be con-sidered identical.

Administrators Administrators, in contrast to cre-ators, handle data and metadata in the system withoutactually changing its shape or explicit meaning. In thatsense, administrators are responsible for the archiving,publication and termination of data. Other activities,which are not covered in the meta-model, are thoseoutlined in the control and administration support pro-cess as outlined in McKeever (security, monitoring,etc.).

End users As the last role, end users are those ac-tors who do not actively handle the data in the system,but instead passively receive and consume it. Accord-ing to Kosch et al., end users are involved in “brows-ing, searching, and consuming” metadata and content.The equivalent in Booth is consumer. Within the meta-model, this means that this role is played in the ac-

Table 2Comparison of Metadata Lifecycles — Features























































































































































































cess and external use phases, but also in the more ac-tive feedback phase, which allows the closing of thecycle. This gives the end user actor a pivotal role inthe lifecycle, which is also played out in the fact thatthey can change their role and become a data or meta-data creator (by re-using the data). By doing this, theywill effectively move the data they are consuming backinto the refinement or archiving phase, thus leading toa new iteration of the lifecycle.

3.4. Actor features

Actor professionality Apart from the roles definedin the paragraphs above, each actor playing a rolecan also be categorised according to their level ofprofessionality. Borrowing the suggestion made inGreenberg, we propose two levels of professionality,which can be seen as the extreme points on a vector ofprofessionality: professional and community or subjectenthusiasts. Greenberg only applies these attributes tometadata creators, whereas in this model they will beapplied to actors performing any of the five roles. Pro-fessionality is a function of individual actors pairedwith a role. I.e., an actor might be a community or sub-ject enthusiast as a Data Creator, but professional asan Administrator.

Actor humanness Any of the roles defined for datalifecycles can be performed by actors which are ei-ther human or machine agents. This categorisation isalso highlighted by Greenberg, but restricted to meta-data generation. In the model proposed here, human-ness can be applied to actors performing any role of thelifecycle. Situations which involve any kind of semi-automatic operation should be modelled with two ac-tors — one human and machine9.

3.5. Metadata features

As a final area of classification, we will presentsome features that have been suggested for the descrip-tion of metadata in the literature.

Authoritative vs. non-authoritative Where data andmetadata are distinguished, metadata can be eitherauthoritative or non-authoritative. Authoritative meta-data will be used in preference over all other, non-authoritative metadata. The exact definition of bothterms depends on the usage context; however, gener-

9Unless, of course, the actor is a cyborg. In this case, a human-machine category should be added to the model.

ally speaking metadata is considered authoritative if itcomes directly from the author or source of a piece ofdata. E.g., authoritative metadata about a digital im-age would be the metadata that comes directly fromthe creator of the image. On the other hand, metadatais considered non-authoritative if it originates fromany source other than the original author of the data.E.g., tags created through the means of social tagging(i.e., by the community) would be considered non-authoritative metadata. For a discussion of authorita-tive metadata as part of Web architecture see [11].As an example for the relevance of this feature, [20]propose the use of T-Box reasoning over authoritativestatements only to tackle scalability issue in Web-scalereasoning.

Content-independent vs. content-dependent This di-chotomy is introduced as part of a classificationscheme for metadata in [22] and classifies metadataaccording to its content dependency. Examples forcontent-independent metadata are modification date orauthor of a document, a logo, or sensor information foran image. Content-dependent metadata on the otherhand is directly related or even derived from the con-tent of the data it is related to. Examples for this kindof metadata are document size or image resolution.

Direct Content-based vs. content-descriptive A fur-ther classification of content-dependent metadata fromKashyap and Sheth describes how closely metadata isbased on its data. Direct content-based metadata is di-rectly derived from its data, such as a document full-text index or document vectors. Content-descriptivemetadata on the other hand is an indirect description ofdata, such as textual descriptions or markup.

Domain-independent vs. domain-specific Drillingdown further in their classification of metadata, Kashyapand Sheth suggest this dichotomy to describe thedomain dependency of content-descriptive metadata.Metadata which can be used to describe other data re-gardless of its domain is called domain-independent. Atypical example would be format-related markup suchas HTML. In contrast, domain-specific metadata is ap-plicable only depending on the subject domain of itsdata. A domain ontology or domain-specific markupare examples.

A similar distinction is discussed in [36]. The au-thors there divide the domain of metadata into struc-tural metadata and content metadata, where the for-mer describes the internal and external structure of itsdata (and is therefore content-independent), whereas

the latter describes the content of its data (and is there-fore content-specific). Also Kosch et al. make this dis-tinction by using the two opposing terms “metadatafor content adaption” (structural) and “metadata forsearch and retrieval” (content). Occasionally, domain-independent metadata is further differentiated intostructural and administrative metadata (e.g., [31]).

4. Applying the ADLM to the Semantic Web

In the following, we will revisit the different phasesand features proposed in previous sections and discusswhether and how they are realised on the SemanticWeb.

4.1. Semantic Web lifecycle phases

Ontology Development Ontologies are a central com-ponent of a self-describing, semantic Web. Based onthe evidence found in the survey, the definition of theontology development phase in the previous sectionstated that this phase should either be considered to beoutside the lifecycle model (since it concerns a metalayer and therefore doesn’t directly consider the flowof data itself), or to be its own lifecycle altogether.While this is still true for the Semantic Web, there is,however, a slight modification: Vocabularies and on-tologies on the SW on the one hand and primary data(RDF statements using the terms defined in the ontolo-gies) on the other — or TBox and ABox in descriptionlogic terms — cannot be distinguished with respectto their form. Any document or set of statements cancontain any mixture of definitions of ontological termsand instances of them. Therefore, data in the ontologydevelopment phase has to be interpreted in two ways:(i) Ontological data in the sense of a conceptualisationof a domain. In this sense, ontology development is aseparate lifecycle, with its own set of phases and fea-tures, e.g., those defined in [8]. Indeed, we shall con-sider it to be a separate instance of the ADLM, pro-ceeding that of the primary (or instance) data. (ii) On-tological data in the sense of their nature as RDF state-ments. In this sense, ontological data is indistinguish-able from other data on the Semantic Web and part ofthe same lifecycle. Which of the two interpretationsapplies, depends on the specific requirements of thescenario for which the ADLM is used.

Planning With respect to planning, the lifecycle ofdata on the Semantic Web is no different from thatof any other system. Planning must precede any cre-ation or refinement of data. It does not touch the dataitself, and is therefore outside of the system. Exam-ples of planning include (i) the decision process thatleads to the actual creation of data, (ii) defining a URIpattern for any new resources that shall be created, or(iii) setting up the necessary technical infrastructurethat might be needed for it, such as servers or tools totransform source data from other formats into RDF.

Creation In terms of the ADLM, data creation meansintroducing data into the lifecycle’s system which didnot exist within the system before. For the purpose ofthis paper, we will say that any data which is not a setof RDF statements, does not qualify as data within theSemantic Web. In a looser interpretation of “SemanticWeb”, one could of course argue that data in variousother formats can be integrated into the Semantic Web,be it by extraction, transformation or reference. How-ever, for the purpose of this discussion, such data willonly be considered once it has been expressed in termsof the system, i.e., as RDF statements. Creation of dataon the SW in our model therefore means the creationof new RDF statements.

Refinement Refinement means to change or makeadditions to existing data in some way. An examplewould be updating an RDF description of someone’sbibliography with references to new papers or articles.Also the process of ontology mapping should be clas-sified as refinement, in the sense that one makes addi-tional statements about ontology terms. Since, withinthe model discussed in this chapter, all data is RDF,and since no difference is made between data andmetadata (see Sect. 4.2), refinement and creation onthe Semantic Web are technically the same. Makingadditions to existing data could tentatively mean mak-ing additions to an RDF statement. This is not possi-ble, however, since RDF statements are immutable10.Making additions therefore has to mean creating newstatements about a resource that is part of a previouslyexisting statement. If the focus is on an RDF graph (aset of statements), rather than an individual statement,making additions then means creating a new statementto add to the graph. This point of view also capturesthe notion of database updates, applied to RDF. In

10Changing an RDF statement would create a new one, whichcould not be explicitly related to the original one. There is no conceptof versioning in RDF.

other words, updates on the SW are the addition ofstatements to existing RDF graphs. In both cases —statement-level and graph-level —, refinement is there-fore a special case of creation.

However, for some contexts it is still useful to keepa distinction between the two phases. When a user cre-ates a description of a real world entity in RDF, this de-scription will in most cases be a set of statements. Con-ceptually, the user might consider creating this set ofstatements as one act of creation, saying e.g. “I createdan RDF description of Galway”. When, at a later stage,this RDF graph is extended by adding more statementsabout Galway, the user might consider this to be an actof refinement, clearly distinct from the initial creation,saying “I have refined the description of Galway”. Inother words: at statement level, there is only creation,while at graph level, there are creation and refinement.

Inferencing and reasoning, as central concepts of theSemantic Web, would be considered examples of cre-ation and/or refinement: the application of inferencerules implies creating new statements based on other,pre-existing statements. Since the new statements willmost likely be related to the resources involved in theexisting ones, inference will more often than not be anexample of refinement.

Lifecycle aspects such as provenance and trust areessentially the addition of metadata to the lifecycle’sprimary data, and should as such also be interpretedas a type of refinement. Alternatively, such metadatacould have its own lifecycle, which is linked to the pri-mary data lifecycle. This is similar to the way ontologydevelopment can be its own lifecycle.

Archiving The archiving phase in the SW data life-cycle constitutes making data persistent. In concreteterms, this can e.g. mean to serialise data in a particu-lar RDF serialisation format, or storing and indexing itin one of the many RDF databases available. In moregeneral terms, archiving means to physically preparethe data in a way that allows it to be published (seenext phase).

Technically, creating or refining data and archivingare closely related and hard to distinguish in the model.On the one hand, it could be argued that data, i.e., RDFstatements, are created in a transient, non-persistentway, e.g., by simply thinking them or as an in-memoryrepresentation. However, there will hardly be any ap-plication that will facilitate creation but not archivingof some form, or an actor on the lifecycle that is re-sponsible for the former, but not the latter. It is per-fectly feasible to say that a user creates RDF data in

a file, which is then loaded into a database by anotheruser. However, this only means that data has been cre-ated and archived, and then archived a second time.The conclusion is that creation/refinement cannot oc-cur on its own in any realistic scenario, whereas archiv-ing can. It is still beneficial to distinguish between thetwo stages, e.g., when the model is applied for the anal-ysis of a single application, in order to define a sepa-ration of concerns within it (which part of the code orUI is concerned with creation, and which is concernedwith archiving).

Furthermore, there are cases where data is cre-ated and published immediately, without ever beingarchived. Examples for this are data wrappers whichgenerate RDF on the fly from a source format (e.g., aGRDDL transformation), on inference engines whichcompute additional statements of an RDF graph basedon RDFS or OWL entailment rules on the fly, i.e., with-out ever materialising those statements physically.

Publication Publication of data on the Semantic Webmeans to make it available on the World-wide Web. Inother words, it means making it available at an HTTPURI on a Web server. In the case of data which haspreviously been archived, this can either happen in theform of a file that is served, or an interface to an RDFdatabase (such as a SPARQL endpoint). Alternatively,as described in the previous section, RDF data can becreated and published in one seamless step.

Access From a technical perspective, the accessphase in our model on the Semantic Web can be de-fined as the dereferencing of an HTTP URI by anagent, given that SW data has been published to and isserved at that URI. Dereferencing here means to sendan HTTP request to a server and receive a HTTP re-sponse in return. Access is the flip side of publication.

Access can occur in two different ways, which de-termine how the lifecycle is continued. If an agent ac-cesses a URI on the Semantic Web and retrieves RDFdata, this means it stays within the system. In this case,the data can then be refined (changed), archived andpublished again, possibly at a different location thanbefore. Alternatively, an agent could also access a SWURI and receive an answer in a different format, suchas an HTML page. In this case, the data would be ex-posed outside the system and enable external use.

External Use External use means the utilisation ofdata outside the system of the lifecycle. For the Seman-tic Web, this means to use RDF outside it (use withinthe Semantic Web means the flow of data through

many of the other phases of the lifecycle: data is usedon the SW when it is refined, when it is archived, pub-lished, or accessed). Examples of external use on theSemantic Web would be the aggregation of data fromvarious sites in order to answer a particular query foran external agent, or the conversion and import of datainto an external application, such as a desktop addressbook application. The latter scenario is described in[35].

Strictly speaking, even the rendering of a Web pagein a browser derived from RDF data already consti-tutes a case of external use. However, in the case of anHTML page with embedded RDFa, an interesting sit-uation arises, in which the same document is a hybridof published Semantic Web data and external use asnon-Semantic Web data. Also, it should be noted thatexternal use can take place without the user actuallyrealising it. E.g., a user could direct their Web browserat the URI Through content-nego-tiation, this particular server would determine that thebrowser requires HTML and redirect to, which is an HTML document. For theuser, it would not be obvious that they have just ac-cessed and used SW data.

Feedback In the feedback phase end users give com-ments on the data they have accessed and used. In theADLM, it is suggested that feedback implies a systemswhich has a central authority to which the feedbackcan be addressed. Following this notion, there wouldbe no feedback on the Semantic Web as such, sincethere is no central authority. However, it is obvious thatthe Semantic Web as a system at large has many differ-ent sub-systems which are the providers of individualservices or datasets. Therefore, while users of the Se-mantic Web could not give feedback to the overall sys-tem (just as Web users cannot give feedback to the Webas such), they could certainly give feedback to the cen-tral authority of each sub-system. This feedback canthen trigger actions on the side of the service provider,such as the termination or refinement of data.

Termination Termination, finally, means to removedata from the system. In a distributed system likethe Semantic Web (just like on the World-wide Web),where data is constantly crawled, indexed and dupli-cated by various players, it is not possible (or at leastvery difficult) to completely terminate any piece ofdata, i.e., remove it from the system without any trace.E.g., while the publishers of a particular dataset on the

Table 3Features of the Semantic Web data lifecycle

Distinction Data vs. Metadata no

Prescriptive vs. Descriptive descriptive

Homogeneous vs. Heterogeneous heterogeneous

Closed vs. Open open

Centralised vs. Distributed distributed

Lifecycle Type evolving

Granularity fine

SW might decide to remove a number of RDF state-ments from their site, it is very likely that this data haspreviously been crawled by a different service, and willtherefore still exist in some form after the termination.

4.2. Semantic Web data lifecycle features

Wrapping up the discussion on applying the ADLMto the SW, this section will characterise the lifecycleof data on the Semantic Web in terms of the featuresdefined in Sect. 3.2. As an overview, all features aresummarised in Tab. 3.

Distinction data vs. metadata We have stated that,for the purpose of this paper, we only consider RDFstatements as data. Furthermore, in RDF seman-tics [18] there is no conceptual difference betweendata and metadata. Consequently, the lifecycle modelis characterised as having no distinction between dataand metadata.

Prescriptive vs. descriptive The lifecycle model isdescriptive, simply observing the properties of the Se-mantic Web as they are perceived by the authors of thispaper, and classifying them in the terms of the ADLM.It is not prescriptive because it does not propose anyrules of how the lifecycle of data on the Semantic Webshould look like.

Homogeneous vs. heterogeneous The data in the life-cycle’s system is heterogeneous, because no assump-tion is made about the schema or ontology which de-fines the various resource types, such as classes, in-stances or properties. In fact, agents do not have toknow any data’s schema. Data can even be completelyschema-less and still be useful. One could assumethe opposite standpoint and argue that the data is infact homogeneous, because only RDF is considered.However, this only means that the data is homoge-neous syntactically. It is still heterogeneous semanti-cally, which is what is considered by this feature.

Closed vs. open With the restriction that any data en-tering the system first has to be expressed as RDF, theSemantic Web is an open system. Any data can be in-tegrated, as anything can be said about anything [24],which means that there are no restrictions on the scopeof domains or topics which can be discussed.

Centralised vs. distributed The underlying architec-ture of the the Semantic Web is that of the World-wideWeb, which has no place for any one central author-ity. Consequently, and since it does not add any suchauthority itself, the Semantic Web is a distributed sys-tem.

Lifecycle type The lifecycle type of the model pro-posed here is evolving, since there is no requirementfor data to pass through the lifecycle completely beforea new iteration can be started. In fact, there is no suchthing as “complete” iteration in the model. RDF datacan be created, published, accessed, refined, archived,published, etc. These phases can be passed in manydifferent orders and, in principle, for an infinite num-ber of times.

Granularity Finally, the Semantic Web data lifecy-cle is fine-grained, meaning that the cardinality of theset of statements considered by any particular instanceof the lifecycle can be as small as 1. The lifecycle isequally applicable to individual statements and largegraphs, which can both pass through any of the phasesproposed in the ADLM.

4.3. Using the ADLM

Concluding the application of the ADLM to the Se-mantic Web, two examples shall illustrate how themodel can be used in practice, by grounding discussionin a precise terminology.

Comparing applications A brief look at two cases ofSW-enabled websites will illustrate how the ADLMcan be applied to analyse differences in the approachto the publication of data on the Semantic Web. Ver-sion 7 of the Drupal content management system fea-tures built-in RDFa functionality, and so a means topublish data on the Semantic Web. The “Semantic WebDog Food” (SWDF) site for conference data and meta-data (details in [34]) also publishes RDF to the Se-mantic Web, and is based on Drupal. At first sight, thetwo implementations seem to be doing more or less thesame thing. However, the SWDF approach to publica-tion and the Drupal 7 RDFa functionality are funda-mentally different: where SWDF uses Drupal to pub-

Fig. 9. Different approaches to publication on the SW

lish pre-existing RDF to both the Semantic Web (in itsoriginal RDF form) and the traditional eye-ball Web(as HTML), Drupal 7 transforms and publishes thenon-RDF contents of its own database to the Seman-tic Web as RDFa. Using the ADLM, we can more pre-cisely describe SWDF as covering the creation phase(conference data is created in file format), the archiv-ing phase (data is archived in an RDF store) and thepublishing phase (data is published through Drupal).These phases are visited in sequence, as shown inFig. 9. In Drupal 7, on the other hand, data is cre-ated on the fly from the Drupal database, and then in-jected as RDFa in a Drupal page. Again employing theADLM’s terminology, we can explain that on the flymore precisely means that the publication phase fol-lows the creation phase directly, with no intermediatearchiving in play, as also shown in Fig. 9.

In a less detailed way, also [17] (see Sect. 2.4) pro-vides an example of how a lifecycle model (here alinked data lifecycle) can be used to classify and groupapplications, with the purpose of giving an overviewof a broader field or the work done within a researchgroup.

Explaining paradigm shifts [16] observes a (possi-ble) paradigm shift from the (i) traditional publish-ing and usage of data to (ii) publishing and usage ofdata according to the principles of linked data. In bothcases, data publishers use transformation templates topublish data from their internal databases to a web-page. However, in (i) the transformation is to tradi-tional HTML (meaning that software agents have toemploy screen scraping to re-discover structured dataon the webpage), while in (ii) the transformation is toHTML+RDFa (meaning that software agents can takestructured data directly from the page).

This paradigm-shift can be nicely described usingthe ADLM: In the traditional scenario (i) the SW’s datalifecycle is only entered through the software agent’s

screen scraper, which performs the creation phase ofthe lifecycle. On the other hand, in the linked dataapproach (ii) the lifecycle is already entered on thedata publisher’s side, where the transformation tem-plate creates SW data, which is then published on thefly (as in the previous example). From there, a softwareagent is ready to simply access the data. In both cases,the software agent can then proceed to the archiving(e.g., a crawler), refinement (e.g., an aggregated query)or external use (e.g., visualisation in an infographic)phase.

5. Conclusion

The aim of this paper was two-fold: (i) to providea wide overview of different approaches to lifecyclemodelling of data-centric domains, and (ii) to establisha terminological framework and conceptual model fordata and metadata lifecycles in data-centric domains.The overview was presented in the form of a surveyof a significant number of different lifecycle modelsfrom the literature, ranging across different domains.The survey itself functions as a first port of call forresearchers and developers aiming to design a lifecy-cle for a particular domain, by providing an overviewand inspiration over typical modelling approaches andpractices.

Based on the survey, as well as additional literature,five areas of classification for data lifecycles have beenidentified and discussed, which together form the ter-minological framework and model, called the AbstractData Lifecycle Model (ADLM): lifecycle phases, life-cycle features, lifecycle roles, actor features and meta-data features. None of the individual models coveredin the survey suffices as a generic model over any data-centric domain; the ADLM can fulfil this task. In addi-tion to its primary aspects (phases, features and roles),the ADLM also enables to model secondary aspects ofdata lifecycles, such as versioning (through repetitionof the cycle) or provenance and trust (as metadata cre-ation through the refinement phase).

The function of the ADLM as a meta model is toprovide a means to classify, compare and relate otherlifecycle models, as well as to provide the basis to con-struct new lifecycle models for other data-centric do-mains, applications or use cases. In the third part ofthe paper, this kind of use is illustrated by applying themodel to the Semantic Web. This SW instance of theADLM serves both to support the general purpose ofdefining a conceptual framework for the SW, as well as

to show how application developers or researchers canuse the framework to illustrate design issues or grounddiscussion in a common terminology.

With respect to other proposed lifecycles for the Se-mantic Web or Web of Data, no comprehensive pro-posal is known to the authors. However, a number ofSW-related lifecycles have been discussed as part ofSect. 2.4 on lifecycles for knowledge and content man-agement, as well as in Sect. 4.3 on using the ADLM.


