Data & Knowledge Engineeringshiyong/papers/dke10.pdf · metadata management. Finally, our case...

Data & Knowledge Engineering 69 (2010) 836–865

Contents lists available at ScienceDirect

Data & Knowledge Engineering

j ourna l homepage: www.e lsev ie r.com/ locate /datak

RDFPROV: A relational RDF store for querying and managing scientificworkflow provenance

Artem Chebotko a,⁎, Shiyong Lu b, Xubo Fei b, Farshad Fotouhi b

a Department of Computer Science, University of Texas-Pan American, 1201 West University Drive, Edinburg, TX 78539, USAb Department of Computer Science, Wayne State University, 431 State Hall, 5143 Cass Avenue, Detroit, MI 48202, USA

a r t i c l e i n f o

⁎ Corresponding author. Tel.: +1 956 381 2577; faxE-mail addresses: [email protected] (A. Chebot

0169-023X/$ – see front matter © 2010 Elsevier B.V.doi:10.1016/j.datak.2010.03.005

a b s t r a c t

Article history:Received 12 October 2008Received in revised form 8 March 2010Accepted 11 March 2010Available online 23 March 2010

Provenance metadata has become increasingly important to support scientific discoveryreproducibility, result interpretation, and problem diagnosis in scientific workflowenvironments. The provenance management problem concerns the efficiency andeffectiveness of the modeling, recording, representation, integration, storage, and queryingof provenance metadata. Our approach to provenance management seamlessly integrates theinteroperability, extensibility, and inference advantages of Semantic Web technologies withthe storage and querying power of an RDBMS to meet the emerging requirements of scientificworkflow provenance management. In this paper, we elaborate on the design of a relationalRDF store, called RDFPROV, which is optimized for scientific workflow provenance querying andmanagement. Specifically, we propose: i) two schema mapping algorithms to map an OWLprovenance ontology to a relational database schema that is optimized for commonprovenance queries; ii) three efficient data mapping algorithms to map provenance RDFmetadata to relational data according to the generated relational database schema, and iii) aschema-independent SPARQL-to-SQL translation algorithm that is optimized on-the-fly byusing the type information of an instance available from the input provenance ontology and thestatistics of the sizes of the tables in the database. Experimental results are presented to showthat our algorithms are efficient and scalable. The comparison with two popular relational RDFstores, Jena and Sesame, and two commercial native RDF stores, AllegroGraph and BigOWLIM,showed that our optimizations result in improved performance and scalability for provenancemetadata management. Finally, our case study for provenance management in a real-lifebiological simulation workflow showed the production quality and capability of the RDFPROVsystem. Although presented in the context of scientific workflow provenance management,many of our proposed techniques apply to general RDF data management as well.

© 2010 Elsevier B.V. All rights reserved.

Keywords:ProvenanceScientific workflowMetadata managementOntologyRDFOWLSPARQL-to-SQL translationQuery optimizationRDF storeRDBMS

1. Introduction

With recent advances in the development of ScientificWorkflowManagement Systems [34,39,46,69,70,80,123], scientists fromvarious domains are able to automate their experiments using scientific workflows to achieve significant scientific discoveries viacomplex and distributed scientific computations. As a result, scientific workflow has emerged as a new field to address the newrequirements from scientists [70,75]. One such important requirement is provenance management which is essential for scientificworkflows to support scientific discovery reproducibility, result interpretation, and problem diagnosis [3,20,96]. This support isenabled via provenance metadata that captures the origin and derivation history of a data product, including the original datasources, intermediate data products, and the steps that were applied to produce the data product. The provenance management

: +1 956 384 5099.ko), [email protected] (S. Lu), [email protected] (X. Fei), [email protected] (F. Fotouhi).

All rights reserved.

mailto:[email protected]




http://dx.doi.org/10.1016/j.datak.2010.03.005

http://www.sciencedirect.com/science/journal/0169023X

837A. Chebotko et al. / Data & Knowledge Engineering 69 (2010) 836–865

problem concerns the efficiency and effectiveness of the modeling, recording, representation, integration, storage, and querying ofprovenance metadata.

While there is an ongoing community effort on standardizing provenancemodeling via the Open ProvenanceModel (OPM) [3],it is still not clear which storage and query model is most suitable for provenance management. Recently, the Semantic Web[16,94] technologies have been increasingly used for provenance management due to their flexibility and semantics support[48,50,65,88,121], such that provenance metadata is represented and captured via Resource Description Framework (RDF)[111,114], RDF Schema (RDFS) [113], and Web Ontology Language (OWL) [110], and queried using the SPARQL [115] querylanguage. This technological suite, enhanced with the Semantic Web inference support, was shown to address [88] the fourfunctional requirements for provenance identified by the Open Provenance Model: (1) provenance information interoperability,(2) ease of application development, (3) precise description of provenance information, and (4) inference capability and digitalrepresentation of provenance. In addition, in our work, we choose a Semantic Web approach for provenance management due toits several advantages. First, a flexible and extensible data model is needed for provenance representation as what provenanceinformation should be recorded can differ from one system to another and from one domain to another domain and can evolveover time; the RDF data model satisfies such a requirement. Second, it is important to interpret and reason about provenance usingdomain knowledge via domain-specific provenance ontologies; therefore, an inference engine with support of user-definedinference rules is needed as domain-specific provenance ontologies can contain various inference rules (such as “a peptide isderived from a protein”) that cannot be known in advance, and domain-specific provenance ontologies can evolve rapidly overtime. Third, provenance interoperability becomes more and more important due to the need of integrating provenance acrossdifferent provenance models, domains, and organizations in collaborative scientific projects. The RDF model facilitates suchintegration and interoperability. Finally, as RDF serializes graphs, it is naturally suitable for representation of provenance graphswithno further adaptation, even though the mapping does not have to be one-to-one (e.g., the OPM implementation as RDF/OWL bythe Tupelo project [6]).

In this paper, we propose an approach to provenance management that seamlessly integrates the interoperability, ex-tensibility, and inference advantages of SemanticWeb technologies with the storage and querying power of an RDBMS tomeet theemerging requirements of scientific workflow provenance management. Our motivation of using the mature relational databasetechnology is provided by the fact that provenance metadata growth rate is potentially very high since provenance is generatedautomatically for every scientific experiment. On the SemanticWeb, large volumes of RDF data aremanagedwith the so called RDFstores, andmajority of them, including Jena [118,119], Sesame [23], 3store [56,57], KAON [107], RStar [71], OpenLink Virtuoso [42],DLDB [81], RDFSuite [9,105], DBOWL [77], PARKA [101], and RDFBroker [100], use an RDBMS as a backend to manage RDF data.Although a general-purpose relational RDF store (see [15] for a survey) can be used for provenance metadata management, thefollowing provenance-specific requirements bring about several optimization strategies for schema design, data mapping, andquery mapping, enabling us to develop a provenance metadata management system that is more efficient and flexible than onethat is simply based on an existing RDF store.

• As provenance metadata is generated incrementally, each time a scientific workflow executes, provenance systems shouldemphasize optimizations for efficient incremental data mapping. As we show in this work, one of such optimizations, a join-elimination optimization strategy, can be developed for provenance based on the property that workflow definition metadata isgenerated before workflow execution metadata.

• As the performance for provenance storage and that for provenance querying are often conflicting, it may be preferable for aprovenance management system to trade data ingest performance for query performance. For example, for long-runningscientific workflows, trading data ingest performance for query performance might be a good strategy.

• The identification of common provenance queries has the potential to lead to an optimized database schema design to supportefficient provenance browsing, visualization, and analysis.

• Update and delete are not the concern of provenance management since it works in an append fashion, similarly to logmanagement. Therefore, we can apply some denormalization and redundancy strategies for database schema design, leading toimproved query performance.

These provenance-specific metadata properties cannot be assumed by a general-purpose RDF store, hampering severalinteresting data management optimizations to gain better performance for data ingest and querying. While conducting a casestudy for a real-life scientific workflow in the biological simulation field (see Section 7 for detailed information) to illustrate andverify the validity of our research, we observed that two popular general-purpose RDF stores, Jena and Sesame, could notcompletely satisfy the provenance management requirements of the workflow. While Sesame could not keep up with the dataingest rate, Jena could not do as good as Sesame on query performance. Both systems lacked support for some provenance queries.

Therefore, by exploiting the above provenance characteristics, we design a relational RDF store, called RDFPROV, which isoptimized for scientific workflow provenance querying and management. RDFPROV has a three-layer architecture (see Fig. 1) thatcomplies with the architectural requirements defined for the reference architecture for scientific workflow management systems[68]. The provenance model layer is responsible for managing provenance ontologies and rule-based inference to augment to-be-stored RDF datasets with new triples. The model mapping layer employs three mappings: (1) schema mapping to generate arelational database schema based on a provenance ontology, (2) data mapping to map RDF triples to relational tuples, and(3) query mapping to translate RDF queries expressed in the SPARQL language into relational queries expressed in the SQLlanguage. These mappings bridge the provenance model layer and the relational model layer, where the latter is represented by a

Fig. 1. An architecture of RDFPROV.

838 A. Chebotko et al. / Data & Knowledge Engineering 69 (2010) 836–865

relational database management system that serves as an efficient relational provenance storage backend. This paper elaborateson the design of RDFPROV and has the following main contributions: i) we propose two schema mapping algorithms to map aprovenance ontology encodedwith OWL to a relational database schema that is optimized for common provenance queries; ii) wepropose three efficient data mapping algorithms to map provenance RDF metadata to relational data according to the generatedrelational database schema, and iii) we propose a schema-independent SPARQL-to-SQL translation algorithm that is optimized on-the-fly by using the type information of an instance available from the input provenance ontology and the statistics of the sizes ofthe tables in the database. At each design step, we contribute novel ideas which are not available in existing RDF stores, such asnew kinds of relations for schema mapping, optimized incremental strategies for data mapping, and two query optimizationtechniques for query translation. When combined together, our algorithms provide a competitive solution to the provenancemanagement problem. We compare our techniques with open-source relational RDF stores, Jena [118,119] and Sesame [23], andcommercial native RDF stores, AllegroGraph [1] and BigOWLIM [2], to show that our optimizations result in improvedperformance and scalability for Semantic Web enabled provenance metadata management. We also show how SPARQL can beextended with negation, aggregation, and set operations (e.g., division) to support additional important provenance queries. Last,but not least, we provide a case study for provenance management in the TangoInSilico [43] scientific workflow, exploring theproduction quality and capability of RDFPROV for this real-life provenance application.

1.1. Organization

The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 discusses the provenancemodel layer ofRDFPROV and introduces a sample provenance ontology. Sections 4, 5, and 6 present the model mapping layer of RDFPROV,elaborating on provenance ontology to database schemamapping, provenance metadata to relational data mapping, and SPARQL-to-SQL query translation, respectively. Section 7 provides our case study for provenance management in a real-life scientificworkflow from the biological simulations field. Section 8 empirically compares RDFPROV with two commercial relational RDFstores. Finally, Section 9 concludes the paper and discusses possible future work directions.

2. Related work

In this section, we discuss related work on scientific workflow provenance management, as well as RDBMS-based RDF storageand querying systems. At the end of the section, we discuss our research in the context of related work.

2.1. Storing and querying scientific workflow provenance

Provenance management has become an important functionality for most scientific workflow management systems; see[20,38,96] for surveys. The Kepler system [10,70] implements a provenance recorder to record information about a workflow run,including the context, data derivation history, workflow definition, and workflow evolution. The provenance recorder isparametric and customizable, allowing the user to choose different levels of granularity of provenance metadata for recording.Based on the provenance information, Kepler supports efficient workflow rerun for a slightly modified workflow. In addition, theprovenance recorder implements the Read-Write-State-Reset (RWS) provenance model proposed by Bowers et al. [21] forpipelined scientific workflows. The RWS model records read, write, and state-reset events for each actor in a workflow run and


stores them in a relational event log. An approach is proposed to reconstruct various dependency graphs from the event log for aworkflow run to support a wide range of scientific provenance queries.

The myGrid/Taverna system [121,122] uses Semantic Web technologies for representing provenance metadata at four levels:process, data, organization, and knowledge. Two levels of ontologies are used. A domain-independent schema ontology is used todescribe the classes of resources and the properties between them that are needed to represent the four levels of provenance. Adomain ontology is used to classify various types of resources such as data types, service types, and topic of interest for a particulardomain. Taverna uses general-purpose RDF stores, such as Jena [118] and Sesame [23], to manage and query provenance.

The CombeChem [48,104], Mindswap [49,50], and VIEW [29,69] systems also use a Semantic Web approach for provenancecollection and representation. While CombeChem and VIEW use relational RDF stores to manage provenance, Mindswap publishesworkflow provenance on the Semantic Web.

The Swift [123] and Chimera [45] systems introduce a Virtual Data System (VDS) consisting of a set of relations to store thedescription of executable programs as transformations, their actual invocations as derivations, and inputs/outputs as data objects.These systems use provenance for tracking the data derivation history, on-demand data generation and re-generation, and dataproduct validation.

The Wings-Pegasus system [64] uses an OWL ontology for semantic representation [65] of provenance generated duringworkflow instantiation and the Virtual Data System (VDS) provenance tracking catalog for provenance generated duringworkflowexecution. As a result, workflow instantiation provenance can be queried using SPARQL and workflow execution provenance canbe queried using SQL.

The VisTrails system [27,46] is the first one to support provenance tracking of workflow evolution. In VisTrails, workflowevolution provenance is represented as a rooted tree, in which each node corresponds to a version of a workflow, and each edgecorresponds to an update action that was applied to the parent workflow to create the child workflow. Therefore, a workflowevolution tree concisely represents all the workflow versions that a scientist has explored to produce the visualization products. Inthis way, VisTrails can support scientists to navigate through the space of workflows and parameter settings for an explorationprocess. VisTrails uses XML and relational database technologies for provenance management.

The REDUX system [14] uses the Windows Workflow Foundation (WinWF) as a workflow engine and introduces a layeredprovenance model. The REDUX provenance layers include abstract service descriptions, service instantiation descriptions,workflow data instantiation descriptions, and workflow execution descriptions. The system uses a relational database forprovenance management.

The PReServ/PASOA system [51,52] supports the recording of interaction provenance, actor provenance, and input provenancewith the provenance recording protocol, which specifies themessages that actors can asynchronously exchangewith a provenancestore to support provenance submission. PReServ uses a provenance management service which provides a common interface toenable different storage systems, such as file system, relational databases, XML databases, and RDF stores, as a provenance store.

The Karma system [97,98] records provenance at four dimensions: execution, location, time, and dataflow, and uses a publish–subscribe notification protocol for provenance collection. Karma uses XML and relational database technologies to store and queryprovenance.

The PASS [60] and ES3 [47] systems focus on provenance capturing via operating system mechanisms. They use Berkeley DBand XML database, respectively, for provenance management.

The summary of storage and query capabilities for all the described systems is shown in Table 1. Many of these systemsparticipated in the first provenance challenge [76], showcasing their storage and querying capabilities on a sample scientificworkflow. Currently, provenance metadata integration from different systems is very complicated due to diverse provenancemodels and formats. The Open Provenance Model (OPM) community initiative [3] develops a standard vocabulary for scientificworkflow provenance annotation that will enable interoperability among the systems.

Finally, data provenance is closely related to the data lineage problem [24,25,35,36] studied in the database community, whichdetermines the source data that is used to produce a data item. However, in scientific workflows, datasets are not necessarily

Table 1Storage and query capabilities of provenance management systems.

Workflow/provenance system Provenance storage Provenance query support

Kepler File system APITaverna RDF store SPARQLCombeChem RDF store SPARQLMindswap RDF files SPARQLView RDF store SPARQLSwift/Chimera RDBMS SQLWings-Pegasus RDF files+RDBMS SPARQL+SQLVisTrails RDBMS+files Visual QBEREDUX RDBMS SQLPReServ Provenance store Provenance management serviceKarma XML database XQuery, XPathPASS Berkeley DB Query toolES3 XML database XQuery, XPath


contained in a relational or XML database and data processing cannot be necessarily accomplished by a database query. Therefore,existing approaches to the data lineage problem are not sufficient for solving the data provenance problem in scientific workflows.

2.2. Storing and querying RDF data using relational RDF stores

In recent years, a number of relational RDF stores (see [15] for a survey) have been developed to support large-scale SemanticWeb applications. To resolve the conflict between the graph RDF data model and the target relational data model, such systemsrequire to deal with various mappings between the two data models, such as schemamapping, data mapping, and query mapping(aka query translation). First, the schema mapping is used to generate a relational database schema that can store RDF data.Second, the data mapping is used to transform RDF triples into relational tuples and insert them into the database. Finally, thequery mapping is used to translate a SPARQL query into an equivalent SQL query, which is evaluated by the relational engine andits result is returned as a SPARQL query solution.

Based on database schemas employed by existing relational RDF stores, we can classify them into four categories:

Schema-oblivious (also called generic or vertical): A single relation, e.g., Triple(s,p,o), is used to store RDF triples, such thatattribute s stores the subject of a triple, p stores its predicate, and o stores its object. Schema-oblivious RDF stores include Jena[118,119], Sesame [23], 3store [56,57], KAON [107], RStar [71], and OpenLink Virtuoso [42]. This approach has no concerns ofRDF schema or ontology evolution, since it employs a generic database representation.Schema-aware (also called specific or binary): This approach usually employs an RDF schema or ontology to generate so calledproperty relations and class relations. A property relation, e.g., Property(s,o), is created for each property in an ontology andstores subjects s and objects o related by this property. A class relation, e.g., Class(i), is created for each class in an ontology andstores instances i of this class. An extension to the idea of property relations is a clustered property relation [117], e.g., Clustered(s,o1,o2,…,on), which stores subjects s and objects o1, o2,…, on related by n distinct properties (e.g., bsp1o1N, bsp2o2N, etc.).Representatives of schema-aware RDF stores are Jena [117–119], DLDB [81], RDFSuite [9,105], DBOWL [77], and PARKA [101].Schema evolution for this approach should be carefully considered: the addition or deletion of a class/property in an ontologyrequires the addition or deletion of a relation/attribute/tuple in the database. The schema-aware approach in general yieldsbetter query performance than the schema-oblivious approach as has been shown in several experimental studies [8,9,105]. Inaddition, the use of a column-oriented database to manage property relations (vertically partitioned approach) has shownfurther improvements in query performance [7].Data-driven: This approach uses an RDF data, as opposed to an RDF schema or ontology, to generate database schema. Forexample, in [40], a database schema is generated based on patterns found in RDF data using data mining techniques. Thevertically partitioned approach can also be data-driven (e.g., implemented in Sesame [23]): property relations are createdwhen their instances are first seen in an RDF document during data mapping. RDF store RDFBroker [100] implements signaturerelations, which are conceptually similar to clustered property relations, but are generated based on RDF data rather than RDFSchema information. There is a lack of experiments that compare the data-driven approach with other approaches (except forthe vertically partitioned approach that is available in both schema-aware and data-driven flavors). RDFBroker [100] reportsimproved in-memory query performance over Sesame and Jena for some test queries. Schema evolution for the data-drivenapproach, if supported, might be expensive.Hybrid: This approach uses themix of features of the previous approaches. An example of the hybrid database schema (resultedfrom schema-oblivious and schema-aware approaches) is presented in [105], where a schema-oblivious database rep-resentation, e.g., Triple(s,p,o), is partitioned into multiple relations based on the data type of object o, and a binary relation, e.g.,Class(i, c), is introduced to store instances i of classes c. Theoharis et al. [105] reports comparable query performance of thehybrid and schema-aware approaches.

Data mapping algorithms are hardly covered in the literature. As we show in this work, the problem of data mapping isimportant and has many previously unexplored optimization opportunities.

Inference support techniques employed by RDF stores can be classified as forward-chaining or backward-chaining. In forward-chaining, all inferences are precomputed and stored along with explicit triples of an RDF graph. This enables fast query responseand increased result completeness [53]; however, it complicates RDF data updates and consumes more storage space. In addition,when inference rules change, precomputed triples should also be rematerialized [109]. The forward-chaining inference can besupported on the data mapping stage. In backward-chaining, inferences are computed dynamically for each query, whichsimplifies updates and omits a storage overhead, but results in worse query performance and scalability. This technique is boundby the main memory space required to compute inferences. The backward-chaining inference can be supported on the querymapping stage. Additional readings on inference for the Semantic Web include [4,11,66,74].

One of the most difficult mappings in relational RDF stores is the query mapping. Related literature on the SPARQL-to-SQLquery translation, SPARQL query processing and optimization includes the following research works. Harris and Shadbolt [57]show how basic graph pattern expressions, as well as simple optional graph patterns, can be translated into relational algebra


expressions. Cyganiak [37] presents a relational algebra for SPARQL and outlines rules establishing equivalence between thisalgebra and SQL. Zemke [120] and Chebotko et al. [31] outline SPARQL-to-SQL translations that incorporate recent changes in theW3C SPARQL semantics [115] triggered by the compositional semantics proposed by Perez et al. [82,83]. In addition, to improvethe evaluation performance of the SPARQL optional graph patterns in a relational database, Chebotko et al. [30] propose a novelrelational operator, called nested optional join, which shows better performance than conventional left outer join implementations.Based on the RDF-to-relational mappings and SPARQL-to-SQL translation presented in [31], Elliott et al. [41] translate additionalSPARQL features and implement several query translation simplifications. Polleres [84] and Schenk [89] contribute with thetranslation of SPARQL queries into Datalog. Anyanwu et al. [12] propose an extended SPARQL query language called SPARQ2L,which supports subgraph extraction queries. Serfiotis et al. [93] study the containment and minimization problems of RDF queryfragments using a logic framework that allows to reduce these problems into their relational equivalents. Hartig and Heese [59]propose a SPARQL query graph model and pursue query rewriting based on this model. Harth and Decker [58] propose optimizedindex structures for RDF that can support efficient evaluation of select-project-join queries and can be implemented in a relationaldatabase. Udrea et al. [106] propose an in-memory index structure to store RDF graph regions defined by center nodes and theirassociated radii; the index helps to reduce the number of joins during SPARQL query evaluation. Weiss et al. [116] introduce asextuple-indexing scheme that can support efficient querying of RDF data based on six types of indexes, one for each possibleordering of a subject, predicate, and object. Neumann and Weikum [78,79] propose a RISC-style engine for RDF, called RDF-3X,which features a single table for storing triples, comprehensive indexing on all possible permutations of the table columns, andadvanced techniques for join order optimization via selectivity estimates and sideways information passing across different joinsand index scans. Bernstein et al. [17] propose SPARQL query optimization techniques based on triple pattern selectivity estimationand evaluate them using an in-memory SPARQL query engine. Chong et al. [32] introduce an SQL table function into the Oracledatabase to query RDF data, such that the function can be combined with SQL statements for further processing. Ma et al. [72]elaborate on OWL/RDF data management in IBM DB2, presenting scalable storage schema, reasoning, indexing and queryprocessing techniques. Hung et al. [61] study the problem of RDF aggregate queries by extending an RDF query language with theGROUP BY clause and several aggregate functions. Schenk and Staab [90], Volz et al. [108], and Magkanaraki et al. [73] define RDFand SPARQL views for RDF data personalization and integration. Several research works [19,67,85,86] focus on accessingconventional relational databases using SPARQL, which requires the SPARQL-to-SQL query translation. Distributed techniques toRDF data management and querying are discussed in [26,62,87,103]. Finally, [18,54,55,92] propose benchmarks for Semantic Webdata management systems.

2.3. Our research in the context of related work

Among available scientific workflow management systems, several use general-purpose relational RDF stores to manage andquery provenance. For example, Taverna [121,122] is known to use Sesame [23] and Jena [118,119] as a provenance storagebackend. In this work, we present relational RDF store RDFPROV that is specifically optimized for provenance querying andmanagement based on known provenancemetadata characteristics (provenance immutability, incremental data loading, commonquery patterns, etc.). RDFPROV is used with the VIEW [68,69] system and shown to have better production characteristics thanSesame and Jena for provenancemetadata (see our case study in Section 7). RDFPROV comeswith a provenance ontology that bearsmuch similarity to OPM [3], although they are developed in parallel and independently.

Although RDFPROV is intended for scientific workflow provenance, our work is also of interest for the Database and SemanticWeb communities as we cover many design aspects of a relational RDF store, including system architecture, database schemageneration, schema evolution, rule-based inference, data indexing, data loading strategies, and query translation and optimization.Some of our novel findings with respect to existing RDF data management work are described below.

At the schemamapping stage, to support common patterns in provenance querying and browsing, we design a hybrid databaseschema that features a single relation (Triple(s,p,o)) from the schema-oblivious approach and class, property, class-subject, andclass-object relations from the schema-aware approach. Relation ClassSubject(i,p,o) stores triples whose subjects are instances of aparticular class in an ontology, and relation ClassObject(s,p,i) stores triples whose objects are instances of a particular class. Thesetwo relations are useful for queries that need to efficiently retrieve all information about an instance (object or subject) and are firstintroduced in RDFPROV. Moreover, we promote the idea of supporting multiple switchable schema mappings with differentcharacteristics to satisfy varying data management needs. In particular, RDFPROV supports very fast data mapping for one schemaand very efficient query processing for another, giving a choice of trading data ingest performance for query performance whendesired.

At the data mapping stage, which is insufficiently explored in the literature, we design several data ingest strategies with theemphasis on incremental data loading. Two of our RDF-to-Relational data mappings are optimized for provenance metadataspecifically, employing the fact that workflow definitions are always inserted before workflow execution provenance.

At the query mapping stage, we propose a new schema-independent SPARQL-to-SQL query translation algorithm that isessential to support multiple schema mappings in the system. We propose two optimizations for the translation:

(1) selection of the smallest table to query based on the type information of an instance and the statistics of the size of the tablesin the database, and

(2) elimination of redundancies in basic graph patterns based on the semantics of the patterns and database schema. Thesequery optimization techniques are novel and can be applied in general-purpose RDF stores.


Finally, we explore several performance characteristics of RDFPROV, such as schema mapping, data mapping, storage spaceconsumption, and query response time, and compare them with the performance of Jena, Sesame, AllegroGraph, and BigOWLIM.Unlike in other experimental studies that benchmark specific querying capabilities of RDF stores, such as extensional taxonomicqueries [53], intensional taxonomic queries [33], intensional and extensional taxonomic queries [105], our study targets real-lifeprovenance queries and does not limit the query complexity. As a result, we conduct the benchmark of SPARQL expressivity andusability for scientific workflow provenance querying, extending SPARQL with negation, aggregation, and set operations tosupport additional important provenance queries.

3. Provenance model management

The provenance model layer (see Fig. 1) of the RDFPROV system is responsible for managing provenance ontologies and rule-based inference to augment to-be-stored RDF datasets with new triples. RDFPROV is not limited to any particular provenancemodel, such as OPM [3] or the model used in this paper, as long as it can be captured via an ontology expressed in OWL DL. Thislayer features a provenance ontology repository which stores ontologies, associated inference rules, and ontology-to-ontologymappings that are available for provenance metadata acquisition. The repository is currently represented by a file system.

While OPM [3] may emerge as a provenance model standard in the future, in this paper, we use a simplified sketch of ourprovenance model to provide an example. Note that the development of a full-fledged provenance model is not our focus in thispaper; this task rather requires collaborative community effort. In our sample model, a workflow consists of a set of workflowtasks, workflow inputs, workflow input parameters, workflow outputs, and data channels that connect them. Each task representsa computational or analytical step of a scientific workflow. A task has input ports and output ports that provide the communicationinterface to other tasks. Tasks are linked together into theworkflow as a graph via data channels. Duringworkflow execution, taskscommunicate with each other by passing data via their ports through data channels. Finally, a task can have arbitrary number ofinput parameters, which are used by an e-scientist to configure its dynamic execution behavior.

The provenance ontology, called PO, which can capture the semantic and structural description of workflows, tasks, and dataobjects, as well as their execution instances, is shown in Fig. 2. PO can support queries across workflow definitions, workflow runs,and data objects. Fig. 2 illustrates only an excerpt of PO which sketches concepts for workflow definition, workflow execution,workflow evolution, and definition–execution relationships:

• Workflow definition (see Fig. 2(a)). It includes classesWorkflow, Task, and DataObject. ADataObjectmodels a data object passed toa particular data channel, it can be either a workflow input (relates toWorkflow via property input), workflow output (relates toWorkflow via property output), workflow input parameter (relates to Workflow via inputParameter), or an intermediate dataproduct (relates toWorkflow via partOf). Meanwhile, a data object can either be a task input (relates to Task via property input),task output (relates to Task via output), or task input parameter (relates to Task via inputParameter).

Fig. 2. An excerpt of a sample provenance ontology.


• Workflow execution (see Fig. 2(b)). It includes classes WorkflowRun, TaskRun, and DataObjectRun, which can be related to eachother similarly via properties input, output, inputParameter, and partOf. Additionally, task runs can be organized into a taskinvocation dependency graph (directTaskDependency and transitiveTaskDependency), and data object runs — into a datadependency graph (directDataDependency and transitiveDataDependency). Task invocation and data dependencies capture howtasks are triggered by each other via producing/consuming the same data object and how data objects are derived from eachother via computational tasks.

• Workflow evolution (see Fig. 2(c)). A workflow that evolved from another workflow can be related to its parent via propertiesdirectWorkflowEvolution and transitiveWorkflowEvolution to model a workflow evolution graph.

• Definition–execution relationships. The instances of execution classes are related to the instances of corresponding definitionclasses via property instanceOf to model that each workflow run executes a particular workflow, each task run corresponds to anexecution of a particular workflow task, and each data object run corresponds to a particular data object produced by a particulartask run.

Although the definition and execution components look similar, their purposes are totally different. A workflow definition isprospective, representing the plan for processing; a workflow run is retrospective, representing an actual execution of theworkflow, which can be conditional or iterative. In addition, a workflow can be executed multiple times, each of whichcorresponds to a different workflow run, while the workflow definition stays the same (workflow evolution is considered as adifferent workflow definition).

Consider a sampleworkflow in Fig. 3(a). It consists of three tasks, twoworkflow inputs, oneworkflow input parameter, and twoworkflow outputs. An RDF graph that describes this workflow is drawn in Fig. 3(b), and an execution of this workflow produces theRDF graph in Fig. 3(c). Both RDF graphs are to-be stored into our provenance database via the model provenance layer.

Informally, it is possible to align our sample provenancemodelwith theOpen ProvenanceModel, aswell as provide correspondingontological mappings. For example, PO's Task(Run) corresponds to OPM's Process, DataObject(Run) corresponds to Artifact, input andinputParameter correspond to used, output corresponds to wasGeneratedBy, directTaskDependency and transitiveTaskDependencycorrespond to wasTriggeredBy and wasTriggeredBy*, directDataDependency and transitiveDataDependency correspond to wasDerived-From (or mayHaveBeenDerivedFrom) and wasDerivedFrom* (ormayHaveBeenDerivedFrom*), and so forth. There are also notions thatmay be present in only one model, such as workflow evolution in PO and agents (a catalyst of a process) in OPM.

An important advantage of using a formal ontology is the ability to derive new RDF descriptions using semantic inference. Outof two inference evaluation techniques, forward-chaining, in which all inferences are precomputed and stored, and backward-chaining, in which inferences are computed dynamically for each query, we adopt the first one. Forward-chaining previouslyshowed to provide faster query response and increased result completeness [53]; among its disadvantages are complicated dataupdates and larger space consumption, however the former is not an issue for immutable provenance metadata.

While OPM identifies inference as a key functionality for provenance management, very often only a few inference rules areneeded for provenance queries (see the provenance challenge queries [76] for example). Therefore, in practice, instead of using afull-fledged DL A-Box reasoner, a provenance inference engine that supports a selective set of inference rules is usually sufficientand more efficient. This is the approach that is taken in our work. We implement our reasoner outside of an RDBMS in theprovenancemodel layer to efficiently support rather selective set of inference rules: (1) OWL semantics inference, which uses someA-Box inference rules for OWL constructs, such as rdfs:subClassOf, owl:TransitiveProperty, and owl:SymmetricProperty, and(2) Provenance dependency graph inference, which uses our rules to derive various provenance graphs. However, we reserve theopportunity for a user to define additional inference rules for a specific application. We use a simple language to define inferencerules, such that an antecedent and a consequent of a rule are specified as SPARQL basic graph patterns. If the antecedent matchestriples in an RDF graph, then bound variables are used in the consequent to infer new RDF triples that are appended to the RDFgraph. For example, our rule for deriving a data dependency graph is as follows.

?tr rdf :type TaskRun : ?tr input ?d1 : ?tr output ?d2 :

?d1 rdf :type DataObjectRun : ?d2 rdf :type DataObjectRun :

?d2 directDataDependency ?d1 : ?d2 transitiveDataDependency ?d1 :

In this rule, if task run ?tr has input data object ?d1 and output data object ?d2, then two triples ?d2 directDataDependency ?d1and ?d2 transitiveDataDependency ?d1 are inferred, where ?d1 and ?d2 are substituted with their corresponding bindings, and thetransitiveDataDependency property is defined as owl:TransitiveProperty, such that

?d3 transitiveDataDependency ?d2 : ?d2 transitiveDataDependency ?d1 :

?d3 transitiveDataDependency ?d1 :

Task invocation dependency graph can be inferred based on a similar rule:

?tr1 rdf :type TaskRun : ?tr2 rdf :type TaskRun :

?tr1 output ?d : ?tr2 input ?d :

?tr2 directTaskDependency ?tr1 : ?tr2 transitiveTaskDependency ?tr1 :

Fig. 3. A sample workflow and its RDF graphs.



Finally, workflow evolution graph can be inferred using even a simpler rule, because the directWorkflowEvolution relationshipbetween two different workflows is given in a provenance dataset. Then,

?w2 directWorkflowEvolution ?w1 :

?w2 transitiveWorkflowEvolution ?w1 :

For example, given the workflow run RDF graph in Fig. 3(c), data dependency and task invocation dependency graphs areshown in Fig. 3(d). These RDF graphs, as well as theworkflow evolution graph, are precomputed by an inference engine and are to-be-stored into our provenance database via the model provenance layer.

In addition to a general provenance ontology like PO or OPM, experimental data can be annotated using a scientific domainvocabulary like Gene Ontology [5] to provide scientific workflow domain-specific metadata. One approach to achieve this isdescribed in [88], where authors propose to use specialized provenance services that can be integrated into a scientific workflowon demand. For example, while the core functionality of a scientific workflow management system records a data object instanceconsumed by some task, a specialized provenance service records the same information, but using domain terms, such as “proteincomplex” for the data object and “alignment” for the task. Furthermore, domain-specific inference rules can be defined for thiskind of metadata. Both general and domain-specific metadata constitute scientific workflow provenance, and both are handledequivalently in this work.

4. Provenance ontology to database schema mapping

In this section, we elaborate on the first mapping of the model mapping layer, and propose two provenance ontology todatabase schema mapping algorithms using our provenance ontology as a running example.

4.1. Schema mapping algorithms

Database schema design is of decisive importance to support efficient processing of provenance queries. In general, generatingan optimal schema for a given set of queries under certain time and space efficiency constraints is a hard problem. In addition, adatabase schema should be simple and flexible enough to support evolving provenance ontologies. Our schema design aims atsupporting efficient processing of the following queries that are common in provenance retrieval and browsing: (1) retrieve RDFgraph nodes of a given type, such asWorkflow,WorkflowRun, and Task, (2) retrieve all the immediate neighboring nodes of a givennode reachable through incoming or outgoing edges, (3) retrieve nodes that are directly related to a given node by someproperties, such as input, output, and partOf, and (4) a combination of the above queries. While these queries are quite commonand can be supported by a general-purpose relational RDF store, RDFPROV exploits the immutable property of provenance to enablemore efficient query evaluation. In particular, since efficient updates are not on the agenda, RDFPROV stores RDF triples in multiplerelations redundantly, such that query types (1), (2), and (3) can be supported via an efficient scan of a single relation rather thanjoins of multiple relations.

Our two alternative schemamapping algorithms are presented in Fig. 4. SchemaMapping-V generates one table and four kindsof views from a given ontology with class-set C and property-set P as described in the following. First, a single table Triple(s,p,o)that stores all RDF triples in the database is created. This table can be used to efficiently match triple patterns with all variables atthe subject, predicate, and object positions. Second, for each $caC, a view $c(i) is specified to capture all instances of class $c. Theseviews are intended for query type (1), e.g., a user can retrieve all workflows from view Workflow. Third, for each $caC, a view$cSubject(i,p,o) is created to capture all triples whose subjects are instances of class $c. Fourth, for each $caC, a view $cObject(s,p,i)is used to capture all triples whose objects are instances of class $c. The $cSubject(i,p,o) and $cObject(s,p,i) views support querytype (2), e.g., all information associated with a particular task can be retrieved from TaskSubject(i,p,o) and TaskObject(s,p,i). Finally,for each $paP, a view $p(s,o) is created to capture all instances of property $p. These views are introduced to support query type(3), e.g., all inputs used by a workflow can be retrieved from view input.

SchemaMapping-T generates similar database schema, but materializes all the views as tables. While table Triple(s,p,o)contains a complete set of RDF triples, it is the largest table and should be accessed when there is no sufficient information tochoose a smaller table. All the other tables redundantly store RDF triples as a result of described partitioning of Triple(s,p,o). Thisredundant design enables having several different access paths to answer a query and the fastest access path can be selected forthe query processing. For both algorithms SchemaMapping-V and SchemaMapping-T, the total number of generated relations(tables and views) is computed as 1+3×|C|+|P|. Some of the relations generated by our algorithms for the provenance ontology(see Fig. 2) are shown in Fig. 5.

To support efficient query processing over the proposed schemas, we create indexes on the database tables. We consider threealternative indexing strategies with B+-tree indexes, hash indexes, and combination of both B+-tree and hash indexes. For the firststrategy, B+-tree indexes are created on columns (s,p,o), (s,o), (o,p), and (p) of table Triple(s,p,o). Similar indexes are created oncolumns of tables $cSubject(i,p,o) and $cObject(s,p,i). Single-column index and indexes on columns (s,o) and (o) are created fortables $c(i) and $p(s,o), respectively. The set of B+-tree indexes for each table exhaustively covers all possible finer granularityequality queries, range queries, andpartial key queries (e.g., a search for a value in column s of table Triple(s,p,o) can be supported bythe index on (s,p,o)). This strategy uses four indexes for SchemaMapping-V and 4+9×|C|+2×|P| indexes for SchemaMapping-T.

Fig. 4. Algorithms SchemaMapping-V and SchemaMapping-T.


The second strategy only uses hash indexes, which may support even faster equality queries, but cannot support range andpartial key queries. To cover all possible equality searches, hash indexes on (s), (s,p), (s,p,o), (p), (p,o), (o), and (o,s) should be createdfor table Triple(s,p,o); and similarly for $cSubject(i,p,o) and $cObject(s,p,i). Single-column index and indexes on columns (s), (s,o) and(o) are created for tables $c(i) and $p(s,o), respectively. This strategy uses seven indexes for SchemaMapping-V and 7+15×|C|+3×|P| indexes for SchemaMapping-T.

Finally, the hybrid strategy can provide benefits of performing hash-enables equality searches and B+-tree-enabled rangesearches and is inspired by the fact that range searches are commonly executed on objects that can be numeric literals, rather thanon subject and predicates that are represented by URIs. For example, we can use B+-tree indexes on tables Triple(s,p,o), $cSubject(i,p,o), and $p(s,o), but hash indexes on tables $cObject(s,p,i) and $c(i). It is also possible to use hash indexes for some $p(s,o) tablesthat are known to relate instance of ontological classes (aka object properties) and B+-tree indexes for those that relate a classinstance with a literal (aka datatype properties).

Out of these three strategies, we choose the first one to be used by RDFPROV as B+-tree indexes are sufficient for the purpose ofthis work; we leave the exploration of other strategies as our future work.

To complete the picture, we consider the problem of database schema and instance data change in the context of ontologyevolution. In Table 2, we outline the effects on database schemas SchemaMapping-V and SchemaMapping-T and correspondinginstance data for three basic ontology-change operations, namely add, delete, and rename, which can be applied to classes andproperties in the ontology. Consider a class/property hierarchy represented by a tree, whose nodes are classes/properties andedges are subclass/subproperty relationships. A new node can be added into such a tree as (1) a leaf, (2) a root, or (3) an innernode, such that it becomes a child of a node n1 and a parent of another node n2, where n1 is originally a parent of n2. Similarly, anexisting node can be deleted from the tree at the three positions; its children become children of its parent; if the root is deleted, itseach child becomes the root of an independent hierarchy. The renaming operation does not affect the tree structure. As shown inTable 2, these operations can be easily supported on the database schema level. The effect on instance data can also be

Fig. 5. Relations generated by SchemaMapping-V/SchemaMapping-T for PO.

Table 2Ontology-change operations and their effects on database schemas SchemaMapping-V and SchemaMapping-T and instance data.

Operation Effect on schemaSchemaMapping-V

Effect on instance data (after forward-chaining inference)

Add class $c Create views $c, $cSubject,and $cObject

If $c′ is a subclass of $c, then instances of $c′ must be inferred as instances of $c in table Triple

Delete class $c Drop views $c, $cSubject, and$cObject

Delete tuples in table Triple that define instances of $c. Instances of $c remain instances ofthe superclass of $c

Rename class $cinto $c′

Drop corresponding views for$c and create views for $c′

Update tuples in table Triple that define instances of $c to become instances of $c′

Add property $p Create view $p If $p′ is a subproperty of $p, then instances of $p′ must be inferred as instances of $p in table TripleDelete property $p Drop view $p Delete tuples in table Triple that define instances of $p. Instances of $p remain instances of the

superproperty of $pRename property$p into $p′

Drop view $p and createview $p′

Update tuples in table Triple that define instances of $p to become instances of $p′

Operation Effect on schemaSchemaMapping-T

Effect on instance data (after forward-chaining inference)

Add class $c Create tables $c, $cSubject,and $cObject

If $c′ is a subclass of $c, then instances of $c′ must be inferred as instances of $c in tables Tripleand type. Compute tuples for the new tables (see Section 5)

Delete class $c Drop tables $c, $cSubject,and $cObject

Delete tuples in tables Triple and type that define instances of $c. Instances of $c remain instancesof the superclass of $c

Rename class $cinto $c′

Rename tables for $c into$c′, $c′Subject, and $c′Object

Update tuples in tables Triple and type that define instances of $c to become instances of $c′

Add property $p Create table $p If $p′ is a subproperty of $p, then instances of $p′ must be inferred as instances of $p in tablesTriple and $p

Delete property $p Drop table $p Delete tuples in tables Triple, $cSubject, and $cObject (for each class $c) that define instances of $p.Instances of $p remain instances

Rename property $pinto $p′

Rename table $p into $p′ Update tuples in tables Triple, $cSubject, and $cObject (for each class $c) that define instances of $pto become instances of $p′


implemented quite straightforwardly, but is computationally more expensive, especially for the SchemaMapping-T schema,whose materialized views must be maintained to be consistent. Some other operations, such as merge classes and split a class inseveral classes, can also be supported, but they are out the scope of this work. More information on ontology evolution andreusability can be found in [44,99,102].

4.2. Schema mapping experiments

All the experiments reported in this paper were conducted on a PCwith 2.4 GHz Pentium IV CPU and 1024 MB ofmainmemoryoperated by MS Windows XP Professional. All algorithms were implemented in C/C++ and MySQL 5.0 Community Edition wasemployed as the RDBMS. Our developed provenance server communicated with MySQL using the MySQL C API.

The datasets for schema mapping, data mapping, and query performance experiments included our provenance ontology, fiveworkflow definition documents, 2000 workflow run provenance documents. All the workflows are different versions of thebiology simulation workflow described in our case study (see Section 7) and are linked with each other via workflow evolutionrelationships. Workflow execution provenance documents were obtained by executing the workflows in the VIEW system [29,69].VIEW collected provenance for each workflow run in a persistent log file and stored the RDF file into RDFPROV once the workflowexecution was completed. The summary of the dataset characteristics is shown in Table 3. The usage of each individual dataset isdiscussed in corresponding sections.

The performance of our schema mapping algorithms on PO is presented in Table 4. The reported times include the timerequired to process the ontology and the time to create the corresponding database schema in the RDMBS. In our approach,

Table 3Characteristics of provenance ontology, workflow definition, and workflow execution documents.

Provenance ontology documentRepresentation language OWLNumber of classes 31Number of properties 41

Workflow definition document (1 out of 5)Representation language RDF/N-TriplesNumber of triple before inference 137Number of triples after inference 166

Workflow execution document (1 out of 2000)Representation language RDF/N-TriplesNumber of triple before inference 387Number of triples after inference 500

Table 4Performance of SchemaMapping-V and SchemaMapping-T on PO.

Algorithm # of tables created # of views created # of indexes created Time (s)

SchemaMapping-V 1 134 4 4.437SchemaMapping-T 135 0 365 53.641


schemamapping is only required to be performed once to store multiple provenance datasets. It can also be precomputed before aworkflow is even defined or executed and therefore, its performance is much less important than that of data ingest and queryprocessing. The reported times give an idea that the schemamapping can take seconds for a relatively small ontology like PO witharound a hundred classes and properties, and minutes for complex ontologies with hundreds and thousands of entries.

5. Provenance metadata to relational data mapping

In this section, we explore data ingest optimization strategies for the RDFPROV system and perform several experiments thatcompare our strategies with two existing general-purpose RDF stores.

5.1. Data mapping algorithms

Availability of the two alternative database schemas in RDFPROV requires separate data mapping algorithms to ingest data intothe system. While the SchemaMapping-V schema requires to deal with only one table, SchemaMapping-T is much moredemanding and computationally expensive. On the other hand,materialized views of SchemaMapping-T allow faster querying andtherefore, trading data ingest performance for query performance can be a useful strategy for many long-running scientificworkflows.

In the following, we present three data mapping algorithms that insert a new provenance dataset D, either of a workflowdefinition, a workflow run, or a workflow provenance dependency graph, into the database. The DataMapping-V algorithm thatcorresponds to SchemaMapping-V is trivial as all we need to do is to insert D into table Triple. For the database schema created bySchemaMapping-T, table Triple can be populated similarly, i.e., by simply inserting D into Triple. Let Triple′ be a temporary table forstoring the triples of D. New tuples for $c can be calculated by $c′(i) ← Select s From Triple′ Where p= ‘rdf:type’ And o= ‘$c’, andnew tuples for $p can be calculated by $p′(s,o)← Select s,o From Triple′Where p= ‘$p’. The question is how we can calculate newtuples for tables $cSubject and $cObject for each class $caC.

One strategy, called brute-force, is to calculate Triple′, $c′, and $p′ and insert them into tables Triple, $c, and $p, respectively.Then, delete contents of $cSubject and $cObject and rematerialize these two tables as follows: $cSubject(i,p,o) ← Select s,p,o FromTriple, $c Where s=i and $cObject(s,p,i) ← Select s,p,o From Triple, $c Where o=i. However, this strategy is expensive since wehave to recompute joins of Triple and $c, whose sizes are growing over time.

A better strategy, called incremental, is based upon the semi-naïve evaluation [13,91]. It calculates new tuples in $cSubject and$cObject and then inserts them into $cSubject and $cObject, respectively. $cSubject and $cObject are calculated as follows: $cSubject(i,p,o)← (Select s,p,o From Triple′, $c′Where s=i) Union (Select s,p,o From Triple, $c′Where s=i) Union (Select s,p,o From Triple′,$c Where s=i) and $cObject(s,p,i) ← (Select s,p,o From Triple′, $c′ Where o=i) Union (Select s,p,o From Triple, $c′ Where o=i)Union (Select s,p,o From Triple′, $c Where o=i). In other words, we need to compute unions of three joins: (1) Triple′⋈$c′, (2)Triple⋈$c′, and (3) Triple′⋈$c. In contrast to the brute-force strategy, the incremental strategy requires to compute joins ofsmaller tables.

Next, our optimized incremental strategy is similar to the incremental one, except that we do not need to compute the joinTriple⋈$c′when populating $cSubject and $cObject. This simplification is possible because provenance datasets are stored in order,such that aworkflow definition is stored at first, its workflow run is stored at second, and its provenance graphs are stored last. As aresult, a to-be-stored dataset Dmay have an instance Xwhose type (X rdf:type class) is not defined in D, but is defined in the triple-set stored in the database; the other way around can never be true. Therefore, the join of Triple and $c′ can return only tuples thatare already in the database in $cSubject or in $cObject. Furthermore, since we left with only two required joins Triple′⋈$c′ andTriple′⋈$c, we can replace them by Triple′⋈($c′∪$c) or by inserting $c′ into $c and computing Triple′⋈$c.

Finally, our optimized incremental in-memory strategy completely eliminates relational joins by calculating new tuples forTriple, $c, $cSubject, $cObject, and $p in main memory outside of the database engine. To achieve this, we introduce the notion oftype dictionary— an in-memory data structure that, given an instance URI, returns a set of types of this instance.Wemodel the typedictionary T as a hash table whose keys are strings that represent URIs and values are sets of strings that represent classes in theontology. If a key k is not found in T, then T½k� = ϕ. Initially, T is retrieved from table type(s,o) which stores instances in column sand their types in column o and, once in memory, T is maintained synchronously with table type. Given T and a new provenancedatasetD, this strategy scansD twice. First, to update Twith new instances and types found inD. Second, to calculate new tuples forthe database tables as follows. Let Triple′, $c′, $cSubject, $cObject, and $p′ be empty tuple-sets, for each class $caC and property$paP in the ontology. For each triple t (t.s, t.p, t.o) in D, (1) t is added to Triple′, (2) t.s is added to $c′ if t.p=rdf:type and t.o=$c,(3) t is added to $cSubject for each $ca T[t.s], (4) t is added to $cObject for $ca T[t.o], and (5) (t.s, t.o) is added to $p’ if t.p=$p. After

Fig. 7. Algorithm DataMapping-TM.

Fig. 6. Algorithm DataMapping-T.


the tuple-sets are computed, they are inserted into the corresponding tables in the database. This strategy has a time complexity ofO(D).

Figs. 6 and 7 define algorithms DataMapping-T and DataMapping-TM that implement the optimized incremental andoptimized incremental in-memory strategies, respectively. Fig. 8 shows the result of calling these algorithms on a sample RDFdataset from Fig. 3(b) and gives a good example of a database instance that we need to query.

Note that the following three properties regarding the cardinalities of relations always hold: (1) |Triple|≥ |$cSubject|≥ |$c|, (2)|Triple|≥ |$cObject|≥ |$c|, and (3) |Triple|≥ |$p|. To obtain other relationships between different table sizes, during the datamapping stage we also cache cardinality statistics for relations $p, $cSubject, and $cObject. This information can be used to selectthe smallest relation for query optimization which we discuss later in this work.

5.2. Data mapping experiments

Before data mapping is performed, each RDF document that corresponds to a workflow definition or workflow run provenanceis preprocessed by our inference engine implemented outside of the relational database. The inference engine showed theperformance of less than 0.1 s when evaluated on sample datasets of workflow definitions and workflow runs with less than 400triples (see Table 3 for details). The number of inferred triples constituted about 30% of original datasets. In the rest of our

Fig. 8. A workflow RDF graph stored into a relational database.


experiments with RDFPROV and other systems, we use the same RDF datasets that have already been appended with inferredtriples. Since inference is performed beforehand and is the same for all systems, our reported times do not include inference time.In this way, we ensure a fair comparison of datamapping and querying performances and avoid the problem that different systemsproduce different entailments.

First, we experimentally checked that the optimized incremental and optimized incremental in-memory strategies wereindeed faster alternatives to the brute-force and incremental strategies for the database schema generated by SchemaMapping-T.Since these strategies required to compute relational joins, whose performance depended on the database size, it was sufficient tostore five workflow definitions and 20,000 workflow runs into the database using one of the strategies and measure the time toingest the 20,001st run. While the optimized incremental and optimized incremental in-memory strategies were approximately20–30% faster than the incremental strategy to load 500 triples, the brute-force strategy showed to be the slowest and impracticalsolution (45 min to load the same 20,001st run). Therefore, in the rest of our experiments, we focused on the detailed explorationof the optimized incremental (DataMapping-T) and optimized incremental in-memory (DataMapping-TM) strategies for theSchemaMapping-T schema.

Fig. 9. Performance of DataMapping-V, DataMapping-T, DataMapping-TM, Jena 2.5.2, and Sesame 1.2.6 on sequences of workflow runs.


Second, algorithms DataMapping-V, DataMapping-T, and DataMapping-TM were evaluated to store sequences of workflowruns into the databases with the corresponding schemas. In particular, we stored five workflow definitions into the database andmeasured the times to store sequences of 1, 20, 200, 2000, and 20,000 workflow runs. Each workflow run provenance contained500 triples (after inferencing) in the N-Triples [105] format. The results are shown in Fig. 9, which presents two different views ofthe same experimental results. The algorithms showed good performance and scalability, in particular, DataMapping-V showed tobe much faster than its two peers since it only populated one table. On the other hand, DataMapping-TM benefited from our in-memory data mapping strategy and showed to be significantly faster than DataMapping-T. In addition, we compared ouralgorithms to data mapping in general-purpose RDF stores Jena 2.5.2 and Sesame 1.2.6 with the MySQL backend and inferenceturned off; Sesame 2 did not support a database backend at the time of comparison. Jena [112] resulted in worse performance thanDataMapping-V and better performance than all the other systems. Sesame [23] showed theworst performance in this experimentand revealed worse scalability in contrast to other approaches. In particular, Sesame took about one hour to load 2000 workflowruns and over 26 h to load 20,000workflow runs; all the other systems showed linear scalability requiring approximately 10 timeslarger response time for the 10 times larger dataset.

Third, the algorithms were evaluated to store a single workflow run into the database for varying number of workflow runsalready stored in the database. In particular, we stored fiveworkflow definitions into the database andmeasured the times to store1st, 3rd, 21st, 201st, 2,001st, and 20,001st workflow runs. The results are shown in Fig. 10, wherewe report performance for “cold”runs with the MySQL server restarted before a trial and for “warm” runs over MySQL with “warm” cache. In both cases,DataMapping-V was the fastest and Jena showed the second best performance. DataMapping-TM was the slowest for the “cold”trials, except for 20,001st workflow run when Sesame was slower, and Sesame was the slowest for the “warm” trials. Theperformance of all the approaches, except maybe for Sesame, showed to be efficient and scalable when we ran them over MySQLwith “warm” cache, which is the most probable situation for real-life settings. DataMapping-V showed stable performance of

Fig. 10. Performance of DataMapping-V, DataMapping-T, DataMapping-TM, Jena 2.5.2, and Sesame 1.2.6 on a single workflow run.


about 0.1 s per workflow run, Jena — about 0.2–0.3 s per workflow run, DataMapping-TM — about 1.0–1.2 s per workflow run,DataMapping-T — about 1.3–1.7 s per workflow run, and Sesame performed in the range of 1.7–5.4 s per workflow run.

Fourth, in Fig. 11, we report disk space required to store 1, 20, 200, 2000, and 20,000 workflow runs for the databaseschemas generated by SchemaMapping-V, SchemaMapping-T, Jena, and Sesame. In the figure, the data length corresponds tothe length of all tables in the database and the data and index length corresponds to the length of all tables and indexes in thedatabase. Since SchemaMapping-T generated a number of materialized views, it required approximately three times larger diskspace than SchemaMapping-V to store the same data. Jena and Sesame required less space than SchemaMapping-T and morespace than SchemaMapping-V for our datasets. The indexes consumed much larger space than data. Overall, the data and indexlength for SchemaMapping-T was approximately two times larger than the data and index length for SchemaMapping-V for thesame datasets. Due to the extensive use of indexes, both SchemaMapping-V and SchemaMapping-T consumed more disk spacethan Jena and Sesame.

Finally, we explored the performance of our algorithms on the database with no indexes created. Such settings can be used toachieve even faster data mapping of large provenance datasets, while the indexes can be created later for efficient queryevaluation. SchemaMapping-V showed constant performance of 0.022 s and 0.16 s per workflow run for “warm” and “cold” trials,respectively. SchemaMapping-TM showed nearly constant performance of about 1.0 s and 2.5 s per workflow run for “warm” and“cold” trials, respectively. SchemaMapping-T (and similarly the brute-force and incremental strategies) appeared to be unsuitablefor this purpose. Its performance degraded quickly with the database growth, since it required to compute relational joins, whoseperformance greatly depended on the availability of indexes.

6. Translation of SPARQL provenance queries into equivalent SQL queries

In this section, we present the last and most complex mapping from the model mapping layer of RDFPROV — query mapping.While we present a SPARQL-to-SQL translation algorithm for queries with basic graph patterns, our main focus is on the two novelquery optimization techniques, which (in part) became possible due to the redundant storage of provenance metadata motivatedby the common provenance query patterns and provenance immutability.

Fig. 11. Disk space required to store workflow runs in the MySQL database over schemas SchemaMapping-V, SchemaMapping-T, Jena 2.5.2, and Sesame 1.2.6.


6.1. Query translation algorithm

In our approach, SPARQL is the primary language for provenance querying. SPARQL is used to specify queries with basic, group,optional, and alternative graph patterns that are matched over RDF provenance graphs. For example, the following SPARQL queryreturns information that describes workflows that require user input (have input parameters):

Select ?w ?p ?oWhere f?w rdf :type :Workflow :?w ?p ?o : ?w :inputParameter ?xg:

In the Select clause, it specifies three variables ?w for the workflow, ?p for the predicate, and ?o for the object, whoseinstantiations must be returned. In the Where clause, the query has the basic graph pattern consisting of three triple patterns: ?w rdf:type :Workflow tomatch instances of theWorkflow type, ?w ?p ?o tomatch related instances, and ?w :inputParameter ?x toensure that a workflow has at least one input parameter. To evaluate such a query over our relational provenance database, wedesign algorithms to translate a SPARQL query into an equivalent SQL query.

We first consider the problem of matching a triple pattern tp (tp.s, tp.p, tp.o) against the database. Two questions need to beanswered: (1) Which relation should be used for tp? (2) Which relational attributes should be used for tp.s, tp.p, and tp.o? Thesequestions are formulated in terms of twomapping functions, ρ and α, such that ρ(tp) returns the relation that stores all the triplesthat may match tp, while α(tp, s), α(tp, p), and α(tp, o) return the corresponding relational attributes for subject, predicate, andobject. The two mapping functions provide a foundation for schema-independent SPARQL-to-SQL translation, such that therelational schema design, which concerns ρ and α, is fully separated from the translation algorithm that is parameterized by ρ andα. In addition, such an abstraction enables schema design optimization and query optimization. We will not pursue the schemadesign optimization issue in this work, but illustrate two query optimization techniques below by using the type information of aninstance.


The first optimization is based on the following two questions: (1) Given an instance or variable X in a triple pattern tp from aSPARQL query q, can we determine the type (class in the ontology) of instances that X may match? (2) If the answer to the firstquestion is yes, let τ(X) be the type of X, then which relation should be chosen for ρ(tp) among τ(X), τ(X)Subject, τ(X)Object, orTriple? Intuitively, we like to choose the relation with the smallest number of tuples.

For the first question, given an instance or variable X in a triple pattern tp from a SPARQL query q and our provenance ontologyPO, τ(X) can be decided as follows:

τðXÞ =

c If tp is of the form X rdf :type :c;wherec∈Cis a classin PO;p If tp:p = :p and X = : p;where p∈P is a property in PO;c If ∩∀pd∈qdomainðpdÞ

� �∩ ∩∀pr∈qrangeðprÞ� �

= f: cg;where pd and pr are property instances in some triple patternstpd and tpr from q; such that tpd:s = X; tpd:p = pd;tpr:o = X; tpr :p = pr; and pr is of type owl:ObjectProperty;

undef otherwise:

8>>>>>>>>><>>>>>>>>>:

In other words, τ(X) is defined if the type of X is explicitly stated in q via the rdf:type property, or X is a property instance, or it iscomputed as the intersection of the domain and range sets of all the properties that are predicates in q's triple patterns and X is asubject and object, respectively, in these triple patterns. If the result of the intersection is a set with one ontology class, then τ(X) isdefined; otherwise, τ(X) is undefined.

For our sample query, τ(?w)=Workflow as stated in the first triple pattern, τ(rdf:type)= type, and τ(:inputParameter)=inputParameter. The other instances and variables have undefined value of τ; in particular, τ(?x) cannot be computed because therange of inputParameter contains several classes.

The answer to the second question of choosing the best relation for ρ(tp) is described in Fig. 12. Algorithm Calculate-ρ-α isused to compute ρ and α for each tp in a SPARQL query, such that ρ(tp) is assigned the smallest relation and α values are directlydecided by the relation schema of ρ(tp). The smallest relation is identified using themin function (line 08).When τ(tp.s), τ(tp.o), orτ(tp.p) is undefined, we assign |τ(tp.s)Subject|=+∞, |τ(tp.o)Object|=+∞, or |τ(tp.p)|=+∞, respectively.

For our sample query, we have ρ(?w rdf:type :Workflow)=Workflow (Case 1), ρ(?w ?p ?o)=WorkflowSubject (Case 2)because τ(p) and τ(?o) are undefined and |WorkflowSubject|≤ |Triple|, and ρ(?w :inputParameter ?x)= inputParameter (Case 4),assuming that |inputParameter|≤ |WorkflowSubject|. α is assigned accordingly.

The second optimization is the elimination of some redundant triple patterns from a basic graph pattern bgp in a SPARQL query.In particular, we eliminate tp of the form X rdf:type :c, if X also appears in another triple pattern tp′, and ρ(tp′)=cSubject or ρ(tp′)=cObject. This is based on the fact that X rdf:type :c restricts X to match only instances of type :c, however the same restriction isalready in place when we match X over relational attribute α(tp′.s) of cSubject or α(tp′.o) of cObject.

In our sample query, ?w rdf:type :Workflow should be eliminated, because ρ(?w rdf:type :Workflow)=Workflow andρ(?w ?p ?o)=WorkflowSubject.

Finally, we are ready to present our basic graph pattern translation algorithm BGPtoSQL in Fig. 13. BGPtoSQL takes bgp, ρ, and αand outputs an equivalent SQL query to evaluate bgp against the relational RDF database. Note that while the BGPtoSQL algorithmis schema-independent, it is parameterized with ρ and α, and the Calculate-ρ-α algorithm is schema-dependent and prerequisitefor the translation algorithm. The benefit of this design is that the SPARQL-to-SQL translation can be reused for a different databaseschema as long as Calculate-ρ-α is specified for that schema, which is in general a simpler task than writing a new translationalgorithm. The main idea of BGPtoSQL is to retrieve all possible variable instantiations from relations that correspond (based on ρ)to triple patterns in bgp, restricting (1) relational attributes that correspond (based on α) to the same variables in different triplepatterns to match the same values and (2) relational attributes that correspond (based on α) to instances or literals to match thevalues of those instances or literals, respectively.

Fig. 12. Algorithm Calculate-ρ-α.

Fig. 13. Algorithm BGPtoSQL.


For our optimized query with the basic graph pattern consisting of ?w ?p ?o and ?w :inputParameter ?x, the SQL From clausecontains two relations (lines 05–06 in the algorithm)WorkflowSubject and inputParameterwith aliases t1 and t2, respectively. Thehash h (lines 07–09) contains only one relational attribute for every variable, except for ?w, h(?w)={t,1i,t2.s}, and therefore, theSQLWhere clause should ensure their equality t1.i= t2.s (lines 10–12). The bgp contains one instance (:inputParameter), howeverits α value is undefined (lines 13–14). The SQL Select clause projects each distinct variable in bgp (lines 15–16), resulting in thetranslated query (line 17):

Select t1:i As w; t1:p As p; t1:o As o; t2:o As xFrom WorkflowSubject t1; inputParameter t2 Where t1:i ¼ t2:s

Finally, the SQL query equivalent of our SPARQL query is

Select w; p; o From ðSelect t1:i As w; t1:p Asp; t1:o As o; t2:o As xFrom WorkflowSubject t1; inputParameter t2 Where t1:i ¼ t2:sÞ t3:

The translation of some other SPARQL constructs (optional graph patterns, value constraints, etc.) into SQL is illustrated byexample in the following section. The complete set of SPARQL-to-SQL translation rules and algorithms, as well as the discussion oftheir applicability (via RDF-to-Relational mappings ρ and α) to various database schemas employed by existing relational RDFstores, are available in [31].

6.2. Provenance queries

The first provenance challenge [76] defined several provenance queries for a sample scientific workflow from the functionalmagnetic resonance imaging field. SPARQL has shown to be quite successful in expressing all the queries using only basic graphpatterns and filtering (see http://twiki.ipaw.info/bin/view/Challenge/MINDSWAP). In this work, we aim to explore a wider rangeof provenance queries that, besides basic graph patterns and filtering, require the use of optional and alternative graph patterns, aswell as some features which are not easily or directly supported by the SPARQL language. In Table 5, we present several suchqueries that our e-scientist collaborators are interested in. For each query expressed in English, we construct a SPARQL query andits SQL counterpart using our translation algorithms. While queries Q1–Q9 use standard SPARQL features, in queries Q10–Q12, weuse our extensions to SPARQL, denoted as SPARQL+, which are summarized in the following:

• Negation. The NOT clause in Q10 requires its graph pattern to fail (for the whole query to succeed) under the same variableinstantiations as in the containing pattern. Although, negation can be expressed in pure SPARQL as shown in Q9, the explicit NOTgreatly improves the readability of a query.

• Aggregation. SQL′s COUNT, MAX, MIN, AVG, and GROUP BY are useful constructs that SPARQL lacks. Fortunately, our approachcan use the power of a relational database to implement them (see Q11).

• Set operations. Similarly to the UNION operation, a query may require other set operations like difference, intersection,containment, etc. In particular, Q12 implements the division operator.

http://twiki.ipaw.info/bin/view/Challenge/MINDSWAP

Table 5Provenance queries.

Q1: Return all the tasks of a workflow :w1 that require user input (have input parameters).SPARQL: Select Distinct ?t Where {?t :partOf :w1. ?t :inputParameter ?p.}SQL: Select t From (Select t1.s As t, t2.o As p From partOf t1,inputParameter t2 Where t1.s=t2.s And t1.o=‘w1’) t3Q2: Return the data dependency graph for a workflow run :wr1.SPARQL: Select ?d1 ?d2 Where {?d1 :partOf :wr1. ?d1 :directDataDependency ?d2.}SQL: Select d1, d2 From (Select t1.s As d1, t2.o As d2 From partOf t1, directDataDependency t2 Where t1.s=t2.s And t1.o=‘wr1’) t3Q3: Return all information about a workflow run :wr1.SPARQL: Select ?p ?o Where {:wr1 rdf:type :WorkflowRun. :wr1 ?p ?o.}SQL: Select p, o From (Select t1.p As p, t1.o As o From WorkflowRunSubject t1 Where t1.i=‘wr1’) t2Q4: Return all the data objects that have been used to produce a data object :wr1_d5. In addition, return task runs (if any) that have produced

each such data object.SPARQL: Select ?d ?tr Where {:wr1_d5 :transitiveDataDependency ?d. Optional {?tr :output ?d}.}SQL: Select d, tr From (Select t1.o As d From transitiveDataDependency t1 Where t1.s=‘wr1_d5’) t3 Natural Left Outer Join (Select t2.s As tr,

t2.o As d From output t2) t4Q5: Return all the workflow runs of a workflow :w1 whose input parameter "p" has a value in the range between 5 and 10.SPARQL: Select ?wr Where {?wr :instanceOf :w1. ?wr :inputParameter ?p. ?p :title "p". ?p :dataValue ?v. Filter (?v N= 5 && ?v b= 10).}SQL: Select wr From (Select t1.s As wr, t2.o As p, t4.o As v From instanceOf t1, inputParameter t2, title t3, dataValue t4 Where t1.s=t2.s And

t2.o=t3.s And t1.o=‘w1’ And t3.o=‘p’ And (t4.o N= 5 And t4.o b=10)) t5Q6: Return all the workflow runs of workflows that directly evolved from a workflow :w1 and these runs have used an input dataset located

at a URL url.SPARQL: Select ?wr Where {?wr :instanceOf ?w. ?w :directWorkflowEvolution :w1. ?wr :input ?d. ?d :locationURI “url”}SQL: Select wr From (Select t1.s As wr, t1.o As w, t3.o As d From instanceOf t1, directWorkflowEvolution t2, input t3, locationURI t4 Where

t1.s=t3.s And t1.o=t2.s And t3.o=t4.s And t2.o=‘w1’ And t4.o=‘url’) t5Q7: Return all the workflow runs of a workflow :w1 that have used the results (final or intermediate) that have been generated by a run of a

workflow :w2.SPARQL: Select Distinct ?wr1 Where {?wr1 :instanceOf :w1. :w1 rdf:type :Workflow. ?wr1 :input ?d. ?wr2 :instanceOf :w2. :w2 rdf:type

:Workflow. ?tr :partOf ?wr2. ?tr :output ?d.}SQL: Select Distinct wr1 From (Select t1.s As wr1, t2.o As d, t3.s As wr2, t4.s As tr From WorkflowObject t1, input t2, WorkflowObject t3,

partOf t4, output t5 Where t1.i=t2.s And t2.o=t5.o And t3.s=t4.o And t4.s=t5.s And t1.p=‘instanceOf’ And t1.i=‘w1’ And t3.p=‘instanceOf’ And t3.i=‘w2’) t6

Q8: Return all the datasets that have been used or produced by a task run :wr1_t1.SPARQL: Select ?d Where {{:wr1_t1 :input ?d.} Union {:wr1_t1 :output ?d.}}SQL: Select d From (Select t1.o As d From input t1 Where t1.s=‘wr1_t1’) t3 Union Select d From (Select t2.o As d From output t2 Where

t2.s=‘wr1_t1’) t4Q9: Return all the tasks of a workflow :w1 that require no user interaction.SPARQL: Select ?t Where {?t rdf:type :Task. ?t :partOf :w1. Optional {?t :inputParameter ?p}. Filter (!bound(?p)).}SQL: Select t From (Select t1.i As t From TaskSubject t1 Where t1.p=‘partOf’ And t1.o=‘w1’) t3 Natural Left Outer Join (Select t2.s As t, t2.o

As p From inputParameter t2) t4 Where p Is NullQ10: Return all the tasks of a workflow :w1 that require no user interaction.SPARQL+: Select ?t Where {?t rdf:type :Task. ?t :partOf :w1. NOT {?t :inputParameter ?p.}.}SQL: Select t From (Select t1.i As t From TaskSubject t1 Where t1.p=‘partOf’ And t1.o=‘w1’ And Not Exists (Select t2.s As t, t2.o As p From

inputParameter t2 Where t2.s = t1.i)) t3Q11: Return the number of the workflow runs in the system.SPARQL+: Select COUNT(?wr) Where {?wr rdf:type :WorkflowRun.}SQL: Select COUNT(wr) From (Select t1.i As wr From WorkflowRun t1) t2Q12: Return all the datasets that have been used by all the workflow runs of a workflow :w1.SPARQL+: Select ?d Where {{?wr1 :instanceOf :w1. ?wr1 :input ?d.} DIVIDE {?wr1 :instanceOf :w1.}}SQL: Select Distinct d From (Select t1.s As wr, t2.o As d From instanceOf t1, input t2 Where t1.s=t2.s And t1.o=‘w1’) r1 Where Not Exists

(Select * From (Select t3.s As wr From instanceOf t3 Where t3.o=‘w1’) r2 Where Not Exists (Select * From (Select t1.s As wr, t2.o As dFrom instanceOf t1, input t2 Where t1.s=t2.s And t1.o=‘w1’) r3 Where r1.d=r3.d And r3.wr=r2.wr))


6.3. Query performance experiments

Our SPARQL-to-SQL translation showed to be very efficient, returning SQL equivalents of the sample provenance queries in lessthan 0.01 s. The query response times for our two database schemas and two database size settings are reported in Fig. 14. Inaddition, we compared the performance of our approach to the performance of Jena 2.5.2 and Sesame 1.2.6 on queries Q1–Q9;since Sesame did not support SPARQL, we had to represent the queries in the SeRQL [22] query language.

For 2000 workflow runs or one million triples, Sesame was the fastest for Q1, Q2, Q4, Q5, Q6, and Q8. SchemaMapping-T wasthe fastest for Q3, Q7, and Q9; these three queries were translated into SQL using our optimizations that reduced the number ofrelations and relational joins in the queries. Both Sesame and SchemaMapping-T were approximately 2–3 times faster than Jena.SchemaMapping-V was on-average faster than Jena and showed the third best performance. For the SPARQL+ queries Q10, Q11,and Q12, SchemaMapping-T was significantly faster than SchemaMapping-V.

For 20,000 workflow runs or 10 million triples, SchemaMapping-T was the fastest for Q2, Q3, Q6, Q7, and Q9, while Sesamewas the fastest for Q1, Q4, Q5, and Q8. Similar to the previous results, SchemaMapping-V and Jena showed slower performance

Fig. 14. Performance of provenance queries with database indexes.


than SchemaMapping-T and Sesame; SchemaMapping-V was on-average faster than Jena. For all the SPARQL+ queries,SchemaMapping-T was significantly faster than SchemaMapping-V.

Queries Q10, Q11, and Q12 were only evaluated by RDFPROV as they used the negation, aggregation, and division extensions ofSPARQL implemented only in our system. Q12 showed to be the slowest (especially with schema SchemaMapping-V) one out of all12 queries evaluated by RDFPROV, because its equivalent SQL query contains a correlated nested query; correlated nested queriesare known to be expensive [63].

From these two experiments, we observed that SchemaMapping-V, Jena, and Sesame scaled worse than SchemaMapping-T,i.e., the evaluation of Q7 suffered with the growth of the database size. The good Sesame performance is explained by the factthat RDF URIs and literals were substituted with integer IDs facilitating faster indexes on numeric values. This design requiredto map URIs/literals in a SPARQL query to integer IDs and vice-versa, i.e., integer IDs returned in the query result to URIs/literals. If such mappings were encoded in an SQL query via relational joins of Sesame's tables Triples, Resources, and Literals, thequery response would have been unacceptably slow (e.g., Q1, when translated into SQL according to Sesame's schema, tookseveral minutes for 10 million triples). To avoid expensive joins, Sesame had to cache the mappings in main memory, resultingin a memory bound approach. When there is a huge number of URIs/literals in a dataset, ID-to-URI/literal mapping maybecome problematic [77]. In contrast, SchemaMapping-T only relied on the database engine for query processing rather thanon in-memory data structures. In conclusion, while SchemaMapping-T provides comparable query performance with Sesame,SchemaMapping-T enables faster data mapping and better scalability for both data mapping and query processing.Unfortunately, further experimentation with Sesame using a larger dataset is rather prohibitive due to slow data mapping (seeFigs. 9 and 10).


Finally, we dropped all the database indexes and redid the above experiments to get an idea of how the indexes andmaterialized tables affect the query performance. The query response times for our two database schemas with indexes droppedand two database size settings are reported in Fig. 15. We observed that:

• All the queries showed slower performance without indexes (Fig. 15) when compared to the corresponding queries withindexes (Fig. 14). For many queries (e.g., for most queries over SchemaMapping-T) the performance difference was relativelysmall for the smaller database size; however, the difference became substantial for the larger dataset (e.g., Q5, Q6, and Q12 forboth schemas and Q7 for SchemaMapping-V took over three minutes to execute). Thus, indexes showed to be very important tosupport scalable query processing in RDFPROV.

• All the queries over SchemaMapping-T were significantly faster than the corresponding queries over SchemaMapping-V.Therefore, partitioning and materialization showed to be beneficial to achieve improved query performance in RDFPROV.

7. Case study for the TangoInSilico workflow

In this section, we present a case study for a real-life scientific workflow in the biological simulation field, describing relevantbackground, workflow definition, workflow execution plan, provenance requirements and their implications for RDFPROV, as wellas for Jena and Sesame.

Biology knows pheromones as attractants for the opposite sex in many environments; however, little is known about thesearch strategies employed in responding to pheromone in the marine environment. The spawning behavior of males of thepolychaeteNereis succinea is known to be triggered at close range by a high concentration of pheromone released by females. Sincepheromone also causes acceleration of swimming and increased turning, in addition to eliciting ejaculation, scientists propose thehypothesis that these behaviors elicited by low concentrations of pheromone can be used by males to find females. To test thishypothesis, a simulation scientific workflow, called TangoInSilico [43], was developed using the VIEW system [29,69]. Fig. 16 showsthis workflowwith seven tasks described in the following. TasksMale Factory and Female Factory are used to generate any numberof different kinds of male and female worms based on the parameters specified by a user. Model Factory and Environment Factory

Fig. 15. Performance of provenance queries without database indexes.

Fig. 16. Biology simulation workflow TangoInSilico in VIEW.

Table 6TangoInSilico provenance requirements and their fulfillment by RDFProv, Jena, and Sesame.

TangoInSilico provenance requirement Jena support Sesame support RDFPROV withSchemaMapping-Vsupport

RDFPROV withSchemaMapping-Tsupport

Data ingest of 500 triples every 3 s Yes ≈0.3 s First few weeks only. Then, N3 s Yes ≈0.06 s Yes ≈1.3 sProvenance query support Partial (no division and aggreg.) Partial (no division and aggreg.) Full FullQuery performance Good Excellent Good Excellent


are used to generate the experiment and environmentmodel, respectively. Task Simulation calculates the movement of the wormsstep by step with all intermediate data recorded for each step so that the output of the model can be used for future analysis.Finally, task Visualization is used to display themovement trajectories and task Statistical Analysis is used to analyze the simulationresults.

A scientist performs on average three sets of experiments with TangoInSilico every week during a couple of months. Then, theworkflow evolves. Each set of experiments tests five pheromone concentrations, 36 movement starting angels, and 20 otherconditions, resulting in 5×36×20=3, 600 workflow runs. Each workflow run takes on average three seconds and generatesprovenance of around 500 RDF triples (after inference). After each set of experiments (3600workflow runs), the scientist does 20–50 replays of simulation visualization and asks a number of questions (provenance queries):

• What were the pheromone concentration, male and female starting angles, male and female positions, etc. (there are more than20 different parameters) for a particular workflow run?

• Did a male catch a female in a particular workflow run?• In how many runs (aggregation) did a male catch a female for a particular environment setting?• In how many runs (aggregation) did a male fail (negation) to catch a female for a particular environment setting?• Find all starting angles, for which all (division) the runs were successful (a male caught a female) under a given concentration?

Based on the above information and our experiments with RDFPROV, Jena, and Sesame, we provide a summary (see Table 6) ofhow these systems can fulfill the requirements of provenance management for the TangoInSilico workflow. The three majorrequirements of data ingest, provenance query support, and query performance are fully supported only by RDFPROV. Sesamecannot keep up with the data ingest after provenance of a few million triples for several sets of experiments is accumulated. It ispossible to organize a data queue for Sesame, however it will also imply that the scientist will have to wait until all data from thequeue is processed before provenance can be queried. Both Jena and Sesame cannot support provenance queries that requireCOUNT and DIVIDE operations. Out of two database schemas supported by our system, RDFPROVwith SchemaMapping-T is the bestcandidate to manage provenance of TangoInSilico. An important advantage of RDFPROV is the ability to switch between databaseschemas SchemaMapping-T and SchemaMapping-V. If later, for another scientific workflow, we need to support much higher rateof data digest, SchemaMapping-V can be used as a solution.Moreover, if data digest for SchemaMapping-T becomes infeasible overtime due to various reasons (database size, workflow evolution), switching to SchemaMapping-V will only require to create viewsover one table of SchemaMapping-T, since the latter basically subsumes the former.

8. Comparison of RDFPROV with commercial RDF stores

The scope of our performance comparison and case study in the previous sections was limited to open-source RDF stores, suchas Jena and Sesame, since they are free to use and distribute with non-commercial scientific workflowmanagement systems, suchas Taverna [121,122] and VIEW [68,69]. However, we are also interested in exploring if RDFPROV is up-to-date with state of the art


commercial systems. In this section, we empirically compare RDFPROV with AllegroGraph [1] and BigOWLIM [2] in the context ofscientific workflow provenance management. Both AllegroGraph and BigOWLIM are native RDF stores and therefore useproprietary data structures and algorithms to persist and query RDF datasets. Thus, in this experimental study, we consider themas black boxes with defined APIs.

We conducted two experiments for data mapping and query evaluation on a PC with 2.4 GHz Pentium IV CPU and 1024 MB ofmain memory operated by MS Windows XP Professional. Since AllegroGraph and BigOWLIM are commercial RDF stores, we alsoincluded an instance of RDFPROV with a commercial database backend in the comparison, resulting in the following four instances:

(1) RDFPROV+MySQL 5.0 Community Edition (via MySQL C API),(2) RDFPROV+Oracle Enterprise Edition 9.0.1 (via ODBC),(3) AllegroGraph 3.2,(4) BigOWLIM 3.1.0.

The inference mechanisms were turned off in the systems and the same dataset (see Table 3) and test queries (see Table 5)were used in the experiments. All the stores, except AllegroGraph, used default indexing schemes. For AllegroGraph, incrementalindexing procedure (the indexNewTriples method) was called after data mapping of each workflow definition or workflowexecution document. Since this strategy resulted in many small index files and there was a limit on howmany such index chunkscould be opened at the same time for query evaluation (exceeding the limit gave an exception), the threshold for automatic re-indexing of the triples was set (the setUnmergedThresholdmethod) to 200. This way, when the number of index chunks exceeded200, the new triples were automatically re-indexed into a new single unified index.

In our first experiment (see Fig. 17), RDFPROV, AllegroGraph 3.2, and BigOWLIM 3.1.0 were evaluated to store sequences ofworkflow runs. We stored five workflow definitions into the database and measured the times to store sequences of 1, 20, 200,2000, and 20,000 workflow runs. Each workflow run provenance contained 500 triples (after inferencing) in the N-Triples [112]format. For RDFPROV, the MySQL and Oracle backends, as well as algorithms DataMapping-V, DataMapping-T, and DataMapping-TM, were used, resulting in six different combinations. RDFPROV with the Oracle backend showed faster data mapping thanRDFPROV with the MySQL backend (except for the single workflow run case). The same performance order of DataMapping-V,DataMapping-TM, and DataMapping-T was observed for the two backends. DataMapping-V showed to bemuch faster than its twopeers since it only populated one table. DataMapping-TM benefited from our in-memory data mapping strategy and showed to besignificantly faster than DataMapping-T. Overall, RDFPROV/Oracle/DataMapping-V delivered the fastest data mapping out-performing other approaches, as well as AllegroGraph and BigOWLIM. While AllegroGraph performed well on the sequences of 1and 20, it was the slowest to load 200 and 2000 workflow runs. AllegroGraph could not load 20,000 workflow runs as this store islimited to handle only a few million triples on a 32-bit operating system. BigOWLIM outperformed the DataMapping-T andDataMapping-TM approaches, but was slower than DataMapping-V. For example, BigOWLIM loaded 20,000 workflow runs in6318 s, which is approximately 3.3 times faster than RDFPROV/Oracle/DataMapping-TM and approximately 2.7 times slower than

Fig. 17. Data mapping performance of RDFProv, AllegroGraph 3.2, and BigOWLIM 3.1.0 on sequences of workflow runs.


RDFPROV/Oracle/DataMapping-V. We do not report the performance of the systems to store a single workflow run for the casewhen they already store a varying number of workflow runs, because neither the native RDF stores showed stable performance forthis type of evaluation, nor it is clear how “cold” and “warm” runs can be measured in the native systems.

In our last experiment (see Fig. 18), RDFPROV (with MySQL/Oracle and SchemaMapping-V/SchemaMapping-T), AllegroGraph3.2, and BigOWLIM 3.1.0 were evaluated to execute the test provenance queries (see Table 5) on the two stored datasets. Overall,RDFPROV showed better performance when coupled with Oracle rather than with MySQL, which could be due to moresophisticated query optimization techniques and join algorithms supported by Oracle. RDFPROV/Oracle/SchemaMapping-T andBigOWLIM were frequently very close and outperformed other competitors for most queries.

For 2000 workflow runs or one million triples, RDFPROV/Oracle/SchemaMapping-T was the fastest for Q1, Q4, Q8, and Q9, andBigOWLIMwas the fastest for Q5, Q6, and Q7. Both systems evaluated queries Q2 and Q3 in 15 ms. While AlegroGraph performedbetter than RDFPROV/MySQL for Q1, Q2, Q3, Q4, and Q6, this system performance was significantly worse than all other approachesfor Q5, Q7, and Q9.

For 20,000 workflow runs or 10 million triples, RDFPROV/Oracle/SchemaMapping-T was the fastest for Q1, Q3, Q4, Q6, Q7, andQ9, and BigOWLIM was the fastest for Q2, Q5, and Q8. However, on many queries the difference between these two stores wasinsignificant, except perhaps for query Q6, where BigOWLIM was outperformed by every other competitor. While, RDFPROV/Oracle/SchemaMapping-V and RDFPROV/MySQL/SchemaMapping-T were not very far from the two leaders, RDFPROV/MySQL/SchemaMapping-V was significantly slower for Q4, Q7, and Q9.

Queries Q10, Q11, and Q12 were only evaluated by RDFPROV as they used the negation, aggregation, and division extensions ofSPARQL implemented in our system. RDFPROV with the Oracle backend performed faster on these queries. For query Q11 and thetwo database sizes, both RDFPROV/Oracle/SchemaMapping-T and RDFPROV/Oracle/SchemaMapping-V showed equivalent times of16 ms.

Fig. 18. Performance of RDFProv, AllegroGraph 3.2, and BigOWLIM 3.1.0 for test provenance queries.


In summary, we observed that RDFPROV with the non-commercial version of MySQL provided good performance and scalabilitywith respect to the commercial RDF stores AllegroGraph and BigOWLIM. In particular, its data mapping outperformedAllegroGraph and its DataMapping-V algorithm provided faster data loading than both AllegroGraph and BigOWLIM. This instanceof RDFPROV also showed comparable query performance, outperforming AllegroGraph on the smaller dataset for queries Q5, Q6,and Q7, and outperforming BigOWLIM on the larger dataset for queries Q6, Q7, and Q9 when SchemaMapping-T was used.Furthermore, RDFPROV with the Oracle backend showed significant improvements in performance, giving the best response timesfor the majority of the test queries.

9. Conclusions and future work

In this work, we designed relational RDF store RDFPROV that is a Semantic Web driven system optimized for querying andmanaging scientific workflow provenance metadata. The architecture of RDFPROV seamlessly integrates the interoperability,extensibility, and reasoning advantages of Semantic Web technologies with the storage and querying power of an RDBMS. Tosupport this integration, three model mappings are described in detail. Our schema mapping, data mapping, and SPARQL-to-SQLquery translation algorithms are optimized to efficiently support (1) common provenance queries, (2) incremental data loadingthat employs the ordering of inserting various provenance metadata, and (3) schema-independent query translation that isoptimized on-the-fly by using the type information of an instance and the statistics of the sizes of the tables in the database. TheRDFPROV system design provides the two alternative database representations, SchemaMapping-V and SchemaMapping-T,enabling the flexibility to setup a provenance repository based on specific scientific workflow needs. In particular,SchemaMapping-V supports very fast schema and data mappings, while SchemaMapping-T supports very efficient queryprocessing. Our query translation allows transparent switching between these two representations. The experimental studyshowed that our algorithms are efficient and scalable. The comparison with existing general-purpose RDF stores Jena, Sesame,AllegroGraph, and BigOWLIM showed that our optimizations provide improved efficiency and scalability to provenance metadatamanagement. Finally, our case study for provenance management in the TangoInSilico scientific workflow showed the productionquality and capability of the RDFPROV system.

Our provenance storage and querying techniques are orthogonal to the scientific workflow model. Therefore, we support thestorage and querying of provenance generated from both long-duration and short-duration activities. However, one can takeadvantage of the characteristics of scientific workflows (long-duration or short-duration) to trade between data ingest per-formance and querying performance. For long-duration activities, in which the provenance digest performance is less important,we can choose a database schema with a slower data mapping strategy but with a faster query response time. On the other hand,for short-duration activities, in which the provenance digest rate becomes critical, a faster data mapping strategy can be chosen tospeed up data ingest. As shown in our case study, such tradeoff between datamapping performance and query performancemightbe a desirable feature for some scientific workflow applications.

In the future, we would like to continue to explore further optimizations for database schema design, data ingest, and queryingwith the main focus on semantic query optimization. Our attention also catch column-oriented databases [7,95], which can becustomized for provenance management, and provenance reduction techniques [28], which can be used to decrease storagerequirements via duplicate elimination and provenance inheritance. Finally, we would like to consider querying and managingscientific workflow provenance in distributed environments withmultiple computing nodes to enable processing of huge datasetswith billions of triples.

References

[1] AllegroGraph RDFStore, 2010, Available from http://agraph.franz.com/allegrograph/.[2] BigOWLIM - OWL Semantic Repository, 2010, Available from http://www.ontotext.com/owlim/big/index.html.[3] Open Provenance Model, 2010, Available from http://openprovenance.org.[4] RuleML: The RuleML Initiative website, 2010, Available from http://www.ruleml.org/.[5] The Gene Ontology, 2010, Available from http://www.geneontology.org.[6] Tupelo Semantic Content Repository, 2010, Available from http://tupeloproject.ncsa.uiuc.edu.[7] D.J. Abadi, A. Marcus, S. Madden, K.J. Hollenbach, Scalable SemanticWeb DataManagement Using Vertical Partitioning, Proc. of the International Conference

on Very Large Data Bases (VLDB), 2007, pp. 411–422.[8] R. Agrawal, A. Somani, Y. Xu, Storage and Querying of E-commerce Data, Proc. of the International Conference on Very Large Data Bases (VLDB), 2001,

pp. 149–158.[9] S. Alexaki, V. Christophides, G. Karvounarakis, D. Plexousakis, On Storing Voluminous RDF Descriptions: The Case of Web Portal Catalogs, Proc. of the

International Workshop on the Web and Databases (WebDB), 2001, pp. 43–48.[10] I. Altintas, O. Barney, E. Jaeger-Frank, Provenance Collection Support in the Kepler Scientific Workflow System, Proc. of the International Provenance and

Annotation Workshop (IPAW), 2006, pp. 118–132.[11] G. Antoniou, A. Bikakis, N. Dimaresis, M. Genetzakis, G. Georgalis, G. Governatori, E. Karouzaki, N. Kazepis, D. Kosmadakis, M. Kritsotakis, G. Lilis, A.

Papadogiannakis, P. Pediaditis, C. Terzakis, R. Theodosaki, D. Zeginis, Proof explanation for a nonmonotonic Semantic Web rules language, Data &Knowledge Engineering (DKE) 64 (3) (2008) 662–687.

[12] K. Anyanwu, A. Maduko, A. Sheth, SPARQ2L: Towards Support for Subgraph Extraction Queries in RDF Databases, Proc. of the International WorldWideWebConference (WWW), 2007, pp. 797–806.

[13] F. Bancilhon, R. Ramakrishnan, An Amateur's Introduction to Recursive Query Processing Strategies, Proc. of the SIGMOD International Conference onManagement of Data, 1986, pp. 16–52.

[14] R.S. Barga, L.A. Digiampietri, Automatic capture and efficient storage of e-Science experiment provenance, Concurr. Comput. : Pract. Exper. 20 (5) (2008)419–429.

http://agraph.franz.com/allegrograph/

http://www.ontotext.com/owlim/big/index.html

http://openprovenance.org

http://www.ruleml.org/

http://www.geneontology.org

http://tupeloproject.ncsa.uiuc.edu


[15] D. Beckett, J. Grant, SWAD-Europe Deliverable 10.2: Mapping Semantic Web data with RDBMSes, Technical report, 2003, Available from http://www.w3.org/2001/sw/Europe/reports/scalable_rdbms_mapping_report/.

[16] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific American, May 2001.[17] A. Bernstein, C. Kiefer, M. Stocker, OptARQ: a SPARQL optimization approach based on triple pattern selectivity estimation, Technical Report ifi-2007.03,

March 2007, 2007, Available from http://www.ifi.uzh.ch/ddis/staff/goehring/btw/files/ifi-2007.03.pdf.[18] C. Bizer, A. Schultz, The Berlin SPARQL benchmark, International Journal on Semantic Web and Information Systems 5 (2) (2009) 1–24.[19] C. Bizer, A. Seaborne, D2RQ— Treating Non-RDF Databases as Virtual RDF Graphs, Proc. of the International Semantic Web Conference (ISWC), 2004, Poster

presentation.[20] R. Bose, J. Frew, Lineage retrieval for scientific data processing: a survey, ACM Computing Surveys 37 (1) (2005) 1–28.[21] S. Bowers, T.M. McPhillips, B. Ludäscher, S. Cohen, S.B. Davidson, A Model for User-oriented Data Provenance in Pipelined Scientific Workflows, Proc. of the

International Provenance and Annotation Workshop (IPAW), 2006.[22] J. Broekstra and A. Kampman. SeRQL: a second generation RDF query language. Technical report, 2003, Available from http://www.w3.org/2001/sw/Europe/

events/20031113-storage/positions/aduna.pdf.[23] J. Broekstra, A. Kampman, F. van Harmelen, Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema, Proc. of the International

Semantic Web Conference (ISWC), 2002, pp. 54–68.[24] P. Buneman, A. Chapman, J. Cheney, Provenance Management in Curated Databases, Proc. of the SIGMOD International Conference onManagement of Data,

2006, pp. 539–550.[25] P. Buneman, S. Khanna, W.C. Tan, Why andWhere: A Characterization of Data Provenance, Proc. of the International Conference on Database Theory (ICDT),

2001, pp. 316–330.[26] M. Cai, M.R. Frank, B. Yan, R.M. MacGregor, A subscribable peer-to-peer RDF repository for distributed metadata management, Journal of Web Semantics 2

(2) (2004) 109–130.[27] S.P. Callahan, J. Freire, E. Santos, C.E. Scheidegger, C.T. Silva, H.T. Vo, Vistrails: Visualization Meets Data Management, Proc. of the SIGMOD International

Conference on Management of Data, 2006, pp. 745–747.[28] A. Chapman, H.V. Jagadish, P. Ramanan, Efficient Provenance Storage, Proc. of the SIGMOD International Conference onManagement of Data, 2008, pp. 993–1006.[29] A. Chebotko, C. Lin, X. Fei, Z. Lai, S. Lu, J. Hua, F. Fotouhi, VIEW: A VIsual sciEntific Workflow management system, Proc. of the International Workshop on

Scientific Workflows (SWF), 2007.[30] A. Chebotko, S. Lu, M. Atay, F. Fotouhi, Efficient processing of RDF queries with nested optional graph patterns in an RDBMS, International Journal on

Semantic Web and Information Systems (IJSWIS) 4 (4) (2008) 1–30.[31] A. Chebotko, S. Lu, F. Fotouhi, Semantics preserving SPARQL-to-SQL translation, Data & Knowledge Engineering (DKE) 68 (10) (2009) 973–1000.[32] E.I. Chong, S. Das, G. Eadon, J. Srinivasan, An Efficient SQL-based RDF Querying Scheme, Proc. of the International Conference on Very Large Data Bases

(VLDB), 2005, pp. 1216–1227.[33] V. Christophides, D. Plexousakis, M. Scholl, S. Tourtounis, On Labeling Schemes for the SemanticWeb, Proc. of the InternationalWorldWideWeb Conference

(WWW), 2003, pp. 544–555.[34] D. Churches, G. Gombas, A. Harrison, J. Maassen, C. Robinson, M. Shields, I. Taylor, I. Wang, Programming scientific and distributed workflow with Triana

services, Concurrency and Computation: Practice and Experience 18 (10) (2006) 1021–1037.[35] Y. Cui, J. Widom, Lineage tracing for general data warehouse transformations, VLDB Journal 12 (1) (2003) 41–58.[36] Y. Cui, J. Widom, J. Wiener, Tracing the lineage of view data in a warehousing environment, ACM Transactions on Database Systems (TODS) 25 (2) (2000)

179–227.[37] R. Cyganiak, A relational algebra for SPARQL, Technical Report HPL-2005-170, Hewlett-Packard Laboratories, 2005, Available from http://www.hpl.hp.com/

techreports/2005/HPL-2005-170.html.[38] S.B. Davidson, J. Freire, Provenance and ScientificWorkflows: Challenges and Opportunities, Proc. of the SIGMOD International Conference on Management

of Data, 2008, pp. 1345–1350.[39] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G.B. Berriman, J. Good, A. Laity, J.C. Jacob, D.S. Katz, Pegasus: a framework for

mapping complex scientific workflows onto distributed systems, Scientific Programming Journal 13 (3) (2005) 219–237.[40] L. Ding, K.Wilkinson, C. Sayers, H. Kuno, Application Specific Schema Design for Storing Large RDF Datasets, Proc. of the International Workshop on Practical

and Scalable Semantic Systems (PSSS), 2003.[41] B. Elliott, E. Cheng, C. Thomas-Ogbuji, Z. Özsoyoglu, A Complete Translation from SPARQL into Efficient SQL, Proc. of the International Database Engineering

and Applications Symposium (IDEAS), 2009, pp. 31–42.[42] O. Erling, Implementing a SPARQL compliant RDF triple store using a SQL-ORDBMS, Technical report, OpenLink Software Virtuoso, 2001, Available from

http://virtuoso.openlinksw.com/wiki/main/Main/VOSRDFWP.[43] X. Fei, S. Lu, T. Breithaupt, J.D. Hardege, J.L. Ram,Modeling mate-finding behavior of the swarming polychaete, Nereis succinea, with TangoInSilico, a scientific

workflow based simulation system for sexual searching, Invertebrate Reproduction and Development 52 (1–2) (2008) 69–80.[44] G. Flouris, D. Manakanatas, H. Kondylakis, D. Plexousakis, G. Antoniou, Ontology change: classification and survey, Knowledge Engineering Review 23 (2)

(2008).[45] I. Foster, J. Vöckler, M.Wilde, Y. Zhao, Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation, Proc. of the International

Conference on Scientific and Statistical Database Management (SSDBM), 2002, pp. 37–46.[46] J. Freire, C.T. Silva, S.P. Callahan, E. Santos, C.E. Scheidegger, H.T. Vo, Managing Rapidly-evolving Scientific Workflows, Proc. of the International Provenance

and Annotation Workshop (IPAW), 2006.[47] J. Frew, D. Metzger, P. Slaughter, Automatic capture and reconstruction of computational provenance, Concurrency and Computation: Practice and

Experience 20 (5) (2008) 485–496.[48] J.G. Frey, D. De Roure, K.R. Taylor, J.W. Essex, H.R. Mills, E. Zaluska, CombeChem: A Case Study in Provenance and Annotation Using the Semantic Web, Proc.

of the International Provenance and Annotation Workshop (IPAW), 2006, pp. 270–277.[49] J. Golbeck, Combining Provenance with Trust in Social Networks for Semantic Web Content Filtering, Proc. of the International Provenance and Annotation

Workshop (IPAW), 2006, pp. 101–108.[50] J. Golbeck, J. Hendler, A Semantic Web approach to the provenance challenge, Concurrency and Computation: Practice and Experience 20 (5) (2008) 431–439.[51] P. Groth, S. Jiang, S. Miles, S. Munroe, V. Tan, S. Tsasakou, and L. Moreau. An architecture for provenance systems—executive summary. Technical report,

University of Southampton, February 2006.[52] P. Groth, S. Miles, W. Fang, S.C. Wong, K.-P. Zauner, L. Moreau, Recording and Using Provenance in a Protein Compressibility Experiment, Proc. of the

International Symposium on High Performance Distributed Computing (HPDC), 2005.[53] Y. Guo, J. Heflin, Z. Pan, Benchmarking DAML+OIL Repositories, Proc. of the International Semantic Web Conference (ISWC), 2003, pp. 613–627.[54] Y. Guo, Z. Pan, J. Heflin, LUBM: a benchmark for OWL knowledge base systems, Journal of Web Semantics 3 (2–3) (2005) 158–182.[55] Y. Guo, A. Qasem, Z. Pan, J. Heflin, A requirements driven framework for benchmarking Semantic Web knowledge base systems, IEEE Transactions on

Knowledge and Data Engineering 19 (2) (2007) 297–309.[56] S. Harris, N. Gibbins, 3store: Efficient Bulk RDF Storage, Proc. of the International Workshop on Practical and Scalable Semantic Systems (PSSS), 2003,

pp. 1–15.[57] S. Harris, N. Shadbolt, SPARQL Query Processing with Conventional Relational Database Systems, Proc. of the International Workshop on Scalable Semantic

Web Knowledge Base Systems (SSWS), 2005, pp. 235–244.[58] A. Harth, S. Decker, Optimized Index Structures for Querying RDF from the Web, Proc. of the Latin American Web Congress (LA-WEB), 2005, pp. 71–80.[59] O. Hartig, R. Heese, The SPARQL Query Graph Model for Query Optimization, Proc. of the European Semantic Web Conference (ESWC), 2007,

pp. 564–578.

http://www.w3.org/2001/sw/Europe/reports/scalable_rdbms_mapping_report/

http://www.w3.org/2001/sw/Europe/reports/scalable_rdbms_mapping_report/

http://www.ifi.uzh.ch/ddis/staff/goehring/btw/files/ifi-2007.03.pdf

http://www.w3.org/2001/sw/Europe/events/20031113-storage/positions/aduna.pdf

http://www.w3.org/2001/sw/Europe/events/20031113-storage/positions/aduna.pdf

http://www.hpl.hp.com/techreports/2005/HPL-2005-170.html

http://www.hpl.hp.com/techreports/2005/HPL-2005-170.html

http://virtuoso.openlinksw.com/wiki/main/Main/VOSRDFWP


[60] D.A. Holland, M.I. Seltzer, U. Braun, K.-K. Muniswamy-Reddy, PASSing the provenance challenge, Concurrency and Computation: Practice and Experience 20(5) (2008) 531–540.

[61] E. Hung, Y. Deng, V.S. Subrahmanian, RDF Aggregate Queries and Views, Proc. of the International Conference on Data Engineering (ICDE), 2005, pp. 717–728.[62] M.F. Husain, P. Doshi, L. Khan, B.M. Thuraisingham, Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce, Proc. of the International

Conference on Cloud Computing (CloudCom), 2009, pp. 680–686.[63] M. Kifer, A. Bernstein, P.M. Lewis, Database Systems: An Application Oriented Approach, Addison-Wesley, 2006.[64] J. Kim, E. Deelman, Y. Gil, G. Mehta, V. Ratnakar, Provenance trails in the Wings-Pegasus system, Concurrency and Computation: Practice and Experience 20

(5) (2008) 587–597.[65] J. Kim, Y. Gil, V. Ratnakar, Semantic Metadata Generation for Large Scientific Workflows, Proc. of the International Semantic Web Conference (ISWC), 2006,

pp. 357–370.[66] E. Kontopoulos, N. Bassiliades, G. Antoniou, Deploying defeasible logic rule bases for the SemanticWeb, Data&Knowledge Engineering (DKE) 66 (1) (2008) 116–146.[67] C.P. de Laborda, S. Conrad, Bringing Relational Data into the Semantic Web Using SPARQL and Relational OWL, Proc. of the ICDE Workshops, 2006, p. 55.[68] C. Lin, S. Lu, X. Fei, A. Chebotko, D. Pai, Z. Lai, F. Fotouhi, J. Hua, A reference architecture for scientific workflow management systems and the VIEW SOA

solution, IEEE Transactions on Services Computing 2 (1) (2009) 79–92.[69] C. Lin, S. Lu, Z. Lai, A. Chebotko, X. Fei, J. Hua, F. Fotouhi, Service-oriented Architecture for VIEW: A Visual Scientific WorkflowManagement System, Proc. of

the International Conference on Services Computing (SCC), 2008, pp. 335–342.[70] B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E.A. Lee, J. Tao, Y. Zhao, Scientific workflow management and the Kepler system,

Concurrency and Computation: Practice and Experience 18 (10) (2006) 1039–1065.[71] L. Ma, Z. Su, Y. Pan, L. Zhang, T. Liu, RStar: An RDF Storage and Query System for Enterprise Resource Management, Proc. of the International Conference on

Information and Knowledge Management (CIKM), 2004, pp. 484–491.[72] L. Ma, C.Wang, J. Lu, F. Cao, Y. Pan, Y. Yu, Effective and Efficient SemanticWeb Data Management over DB2, Proc. of the SIGMOD International Conference on

Management of Data, 2008, pp. 1183–1194.[73] A. Magkanaraki, V. Tannen, V. Christophides, D. Plexousakis, Viewing the Semantic Web through RVL lenses, Journal of Web Semantics 1 (4) (2004) 359–375.[74] D.L. McGuinness, P.P. da Silva, Explaining answers from the Semantic Web: the Inference Web approach, Journal of Web Semantics 1 (4) (2004) 397–413.[75] S.Miles, P.Groth,M.Branco, L.Moreau, The requirements of recordingandusingprovenance ine-Scienceexperiments, Journal ofGridComputing5 (1) (2007)1–25.[76] L. Moreau, et al., Special issue: the first provenance challenge, Concurrency and Computation: Practice and Experience 20 (5) (2008) 409–418.[77] S. Narayanan, T. M. Kurc, and J. H. Saltz. DBOWL: towards extensional queries on a billion statements using relational databases. Technical Report

OSUBMI_TR_2006_n03, Ohio State University, 2006. Available from http://bmi.osu.edu/resources/techreports/osubmi.tr.2006.n3.pdf.[78] T. Neumann, G. Weikum, RDF-3X: A RISC-style Engine for RDF, Proceedings of the VLDB Endowment (PVLDB) 1 (1) (2008) 647–659.[79] T. Neumann, G. Weikum, Scalable Join Processing on Very Large RDF Graphs, Proc. of the SIGMOD International Conference on Management of Data, 2009,

pp. 627–640.[80] T.M. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, R.M. Greenwood, T. Carver, K. Glover, M.R. Pocock, A. Wipat, P. Li, Taverna: a tool for the composition and

enactment of bioinformatics workflows, Bioinformatics 20 (17) (2004) 3045–3054.[81] Z. Pan, J. Heflin, DLDB: Extending Relational Databases to Support Semantic Web Queries, Proc. of the International Workshop on Practical and Scalable

Semantic Web Systems (PSSS), 2003, pp. 109–113.[82] J. Perez, M. Arenas, C. Gutierrez, Semantics and Complexity of SPARQL, Proc. of the International Semantic Web Conference (ISWC), 2006, pp. 30–43.[83] J. Perez, M. Arenas, C. Gutierrez, Semantics and complexity of SPARQL, ACM Transactions on Database Systems (TODS) 34 (3) (2009).[84] A. Polleres, From SPARQL to Rules (and Back), Proc. of the International World Wide Web Conference (WWW), 2007, pp. 787–796.[85] E. Prud'hommeaux. Optimal RDF access to relational databases. Technical report, 2004. Available from http://www.w3.org/2004/04/30-RDF-RDB-access/.[86] E. Prud'hommeaux. Notes on adding SPARQL to MySQL. Technical report, 2005. Available from http://www.w3.org/2005/05/22-SPARQL-MySQL/.[87] B. Quilitz, U. Leser, Querying Distributed RDF Data Sources with SPARQL, Proc. of the European Semantic Web Conference (ESWC), 2008, pp. 524–538.[88] S.S. Sahoo, A. Sheth, C. Henson, Semantic provenance for eScience: managing the deluge of scientific data, IEEE Internet Computing 12 (4) (2008) 46–54.[89] S. Schenk, A SPARQL Semantics Based on Datalog, Proc. of the KI 2007, Annual German Conference on AI, 2007, pp. 160–174.[90] S. Schenk, S. Staab, Networked Graphs: A Declarative Mechanism for SPARQL Rules, SPARQL views and RDF Data Integration on the Web, Proc. of the

International World Wide Web Conference (WWW), 2008, pp. 585–594.[91] H. Schmidt, W. Kießling, U. Güntzer, R. Bayer, Compiling Exploratory and Goal-directed Deduction into Sloppy Delta-iteration, Proc. of the Symposium on

Logic Programming (SLP), 1987, pp. 234–243.[92] M. Schmidt, T. Hornung, G. Lausen, C. Pinkel, SP2Bench: A SPARQL Performance Benchmark, Proc. of the International Conference on Data Engineering

(ICDE), 2009, pp. 222–233.[93] G. Serfiotis, I. Koffina, V. Christophides, V. Tannen, Containment and Minimization of RDF/S Query Patterns, Proc. of the International Semantic Web

Conference (ISWC), 2005, pp. 607–623.[94] N. Shadbolt, T. Berners-Lee, W. Hall, The Semantic Web revisited, IEEE Intelligent Systems 21 (3) (2006) 96–101.[95] L. Sidirourgos, R. Goncalves, M. Kersten, N. Nes, S. Manegold, Column-store Support for RDF Data Management: Not All Swans are White, Proc. of the

International Conference on Very Large Data Bases (VLDB), 2008.[96] Y. Simmhan, B. Plale, D. Gannon, A survey of data provenance in e-Science, SIGMOD Rec. 34 (3) (2005) 31–36.[97] Y. Simmhan, B. Plale, D. Gannon, A Framework for Collecting Provenance in Data-centric ScientificWorkflows, Proc. of the International Conference onWeb

Services (ICWS), 2006, pp. 427–436.[98] Y. Simmhan, B. Plale, D. Gannon, Query capabilities of the Karma provenance framework, Concurrency and Computation: Practice and Experience 20 (5)

(2008) 441–451.[99] E. Simperl, Reusing ontologies on the Semantic Web: a feasibility study, Data & Knowledge Engineering (DKE) 68 (10) (2009) 905–925.

[100] M. Sintek, M. Kiesel, RDFBroker: A Signature-based High-performance RDF Store, Proc. of the European Semantic Web Conference (ESWC), 2006, pp. 363–377.[101] K. Stoffel, M.G. Taylor, J.A. Hendler, Efficient Management of Very Large Ontologies, Proc. of the American Association for Artificial Intelligence Conference

(AAAI), 1997, pp. 442–447.[102] L. Stojanovic.Methods and Tools for Ontology Evolution. Ph.D. Dissertation, University of Karlsruhe, Germany, 2004. Available from digbib.ubka.uni-karlsruhe.

de/volltexte/documents/1241.[103] H. Stuckenschmidt, R. Vdovjak, J. Broekstra, G.-J. Houben, Towards distributed processing of RDF path queries, International Journal of Web Engineering and

Technology 2 (2/3) (2005) 207–230.[104] K.R. Taylor, R.J. Gledhill, J.W. Essex, J.G. Frey, S.W. Harris, D. De Roure, Bringing chemical data onto the Semantic Web, Journal of Chemical Information and

Modeling 46 (3) (2006) 939–952.[105] Y. Theoharis, V. Christophides, G. Karvounarakis, Benchmarking Database Representations of RDF/S Stores, Proc. of the International Semantic Web

Conference (ISWC), 2005, pp. 685–701.[106] O. Udrea, A. Pugliese, V.S. Subrahmanian, GRIN: A Graph Based RDF Index, Proc. of the American Association for Artificial Intelligence Conference (AAAI),

2007, pp. 1465–1470.[107] R. Volz, D. Oberle, B. Motik, S. Staab, KAON SERVER — A Semantic Web Management System, Proc. of the International World Wide Web Conference

(WWW), Alternate Tracks — Practice and Experience, 2003.[108] R. Volz, D. Oberle, R. Studer, Implementing Views for Light-weight Web Ontologies, Proc. of the International Database Engineering and Applications

Symposium (IDEAS), 2003, pp. 160–169.[109] R. Volz, S. Staab, B. Motik, Incrementally maintaining materializations of ontologies stored in logic databases, Journal Data Semantics 2 (2005) 1–34.[110] W3C, OWLWeb Ontology Language Reference. W3C Recommendation, in: M. Dean, G. Schreiber (Eds.), , February 10 2004, Available from http://www.w3.

org/TR/2004/REC-owl-ref-20040210/.

http://bmi.osu.edu/resources/techreports/osubmi.tr.2006.n3.pdf

http://www.w3.org/2004/04/30-RDF-RDB-access/

http://www.w3.org/2005/05/22-SPARQL-MySQL/

http://www.w3.org/TR/2004/REC-owl-ref-20040210/

http://www.w3.org/TR/2004/REC-owl-ref-20040210/


[111] W3C, RDF Primer. W3C Recommendation, in: F. Manola, E. Miller (Eds.), February 10 2004, Available from http://www.w3.org/TR/rdf-primer/.[112] W3C, RDF Test Cases. W3C Recommendation, in: J. Grant, D. Beckett (Eds.), February 10 2004, Available from http://www.w3.org/TR/rdf-testcases/.[113] W3C, RDF Vocabulary Description Language 1.0: RDF Schema. W3C Recommendation, in: D. Brickley, R.V. Guha (Eds.), February 10 2004, Available from

http://www.w3.org/TR/2004/REC-rdf-schema-20040210/.[114] W3C, Resource Description Framework (RDF): Concepts and Abstract Syntax. W3C Recommendation, in: G. Klyne, J.J. Carroll, B. McBride (Eds.), February 10

2004, Available from http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/.[115] W3C, SPARQL Query Language for RDF. W3C Candidate Recommendation, in: E. Prud'hommeaux, A. Seaborne (Eds.), January 15 2008, Available from http://

www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/.[116] C.Weiss, P. Karras, A. Bernstein, Hexastore: Sextuple Indexing for SemanticWeb DataManagement, Proc. of the International Conference on Very Large Data

Bases (VLDB), 2008.[117] K. Wilkinson, Jena Property Table Implementation, Proc. of the International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS), 2006.[118] K. Wilkinson, C. Sayers, H. Kuno, D. Reynolds, Efficient RDF Storage and Retrieval in Jena2, Proc. of the International Workshop on Semantic Web and

Databases (SWDB), 2003, pp. 131–150.[119] K. Wilkinson, C. Sayers, H.A. Kuno, D. Reynolds, L. Ding, Supporting scalable, persistent SemanticWeb applications, IEEE Data Eng. Bull. 26 (4) (2003) 33–39.[120] F. Zemke. Converting SPARQL to SQL. Technical report, October 2006. Available from http://lists.w3.org/Archives/Public/public-rdf-dawg/2006OctDec/att-

0058/sparql-to-sql.pdf.[121] J. Zhao, C. Goble, R. Stevens, D. Turi, Mining Taverna's semantic web of provenance, Concurrency and Computation: Practice and Experience 20 (5) (2008)

463–472.[122] J. Zhao, C. Wroe, C.A. Goble, R. Stevens, D. Quan, R.M. Greenwood, Using Semantic Web Technologies for Representing e-Science Provenance, Proc. of the

International Semantic Web Conference (ISWC), 2004.[123] Y. Zhao, M. Hategan, B. Clifford, I.T. Foster, G. von Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, M. Wilde, Swift: Fast, Reliable, Loosely Coupled Parallel

Computation, Proc. of the International Workshop on Scientific Workflows (SWF), 2007, pp. 199–206.

Artem Chebotko received the PhD degree in computer science from Wayne State University in 2008, MS and BS degrees inmanagement information systems and computer science from Ukraine State Maritime Technical University in 2003 and 2001. He iscurrently an assistant professor in the Department of Computer Science, University of Texas-Pan American. His research interestsinclude Semantic Web data management, scientific workflow provenance, XML databases, and relational databases. He has publishedaround 30 papers in refereed international journals and conference proceedings. He currently serves as a program committee memberof several international conferences and workshops on Semantic Web and scientific workflows. He is a member of the IEEE.

Shiyong Lu received the PhD degree in Computer Science from the State University of New York at Stony Brook in 2002, ME from theInstitute of Computing Technology of Chinese Academy of Sciences at Beijing in 1996, and BE from the University of Science andTechnology of China at Hefei in 1993. He is currently an Associate Professor in the Department of Computer Science, Wayne StateUniversity, and the Director of the Scientific Workflow Research Laboratory (SWR Lab). His research interests include scientificworkflows and databases. He has published more than 80 papers in refereed international journals and conference proceedings. He isthe founder and currently a program co-chair of the IEEE International Workshop on Scientific Workflows (2007–2010), an editorial
board member for International Journal of Semantic Web and Information Systems and International Journal of Healthcare InformationSystems and Informatics. He is a Senior Member of the IEEE.
Xubo Fei is a PhD student in the Department of Computer Science, Wayne State University. He is a member of the ScientificWorkflowResearch Laboratory (SWR Lab). His research interests include scientific workflows, cloud computing and their applications inbioinformatics and biology simulation. He is a student member of the IEEE.

Farshad Fotouhi received the PhD degree in computer science from Michigan State University in 1988. In August 1988, he joined thefaculty of Computer Science at Wayne State University, where he is currently a professor and the chair of the department. His majorareas of research include XML databases, semantic Web, multimedia systems, and query optimization. He has published more than100 papers in refereed journals and conference proceedings, served as a program committee member of various database-relatedconferences. He is on the editorial boards of the IEEE Multimedia Magazine and International Journal on Semantic Web and InformationSystems. He is a member of the IEEE.

http://www.w3.org/TR/rdf-primer/

http://www.w3.org/TR/rdf-testcases/

http://www.w3.org/TR/2004/REC-rdf-schema-20040210/

http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/

http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/

http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006OctDec/att-0058/sparql-to-sql.pdf

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006OctDec/att-0058/sparql-to-sql.pdf

Date post:	24-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Data & Knowledge Engineeringshiyong/papers/dke10.pdf · metadata management. Finally, our case...

Documents