+ All Categories
Home > Documents > RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain...

RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain...

Date post: 19-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
15
RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F Deus 1,2* , Miriã C Correa 3 , Romesh Stanislaus 4 , Maria Miragaia 5 , Wolfgang Maass 6 , Hermínia de Lencastre 5,7 , Ronan Fox 1 and Jonas S Almeida 8 Abstract Background: The value and usefulness of data increases when it is explicitly interlinked with related data. This is the core principle of Linked Data. For life sciences researchers, harnessing the power of Linked Data to improve biological discovery is still challenged by a need to keep pace with rapidly evolving domains and requirements for collaboration and control as well as with the reference semantic web ontologies and standards. Knowledge organization systems (KOSs) can provide an abstraction for publishing biological discoveries as Linked Data without complicating transactions with contextual minutia such as provenance and access control. We have previously described the Simple Sloppy Semantic Database (S3DB) as an efficient model for creating knowledge organization systems using Linked Data best practices with explicit distinction between domain and instantiation and support for a permission control mechanism that automatically migrates between the two. In this report we present a domain specific language, the S3DB query language (S3QL), to operate on its underlying core model and facilitate management of Linked Data. Results: Reflecting the data driven nature of our approach, S3QL has been implemented as an application programming interface for S3DB systems hosting biomedical data, and its syntax was subsequently generalized beyond the S3DB core model. This achievement is illustrated with the assembly of an S3QL query to manage entities from the Simple Knowledge Organization System. The illustrative use cases include gastrointestinal clinical trials, genomic characterization of cancer by The Cancer Genome Atlas (TCGA) and molecular epidemiology of infectious diseases. Conclusions: S3QL was found to provide a convenient mechanism to represent context for interoperation between public and private datasets hosted at biomedical research institutions and linked data formalisms. Keywords: S3DB, Linked Data, KOS, RDF, SPARQL, knowledge organization system, policy Background Knowledge engineering in the Life Sciences is challenged by the combination of high specificity and high heteroge- neity of the data needed to represent and understand Biologys systemic puzzles. Despite the deluge of data that has invaded life sciences in the past decade [1], data- driven discovery in Biology is hindered by a lack of enough interlinked information to allow statistical algo- rithms to find the patterns that inform hypothesis-driven research [2,3]. Life Sciences research relies heavily on bioinformatics integration tools like Ensembl [4], the UCSC genome browser [5], Entrez Gene [6] or the gene ontology [7] because these offer researchers portals to a wealth of interlinked biological annotations within the context of their experimentally derived results, thus play- ing a lead role in advancing scientific discovery. The amount of time and effort required to develop and main- tain such tools has prompted Linked Data approaches for data integration to become increasingly relevant in Health Care and Life Sciences (HCLS) domains [8-10]. Briefly stated, Linked Data can be described as a bottom- up solution for data integration: its focus is on creating a global Web of Data where typed links between data sources provide rich context and expressive reusable queries over aggregated and distributed heterogeneous datasets [8,11-13]. The architecture of that Web is * Correspondence: [email protected] 1 Digital Enterprise Research Institute, National University of Ireland at Galway, IDA Business Park, Lower Dangan, Galway, Ireland Full list of author information is available at the end of the article Deus et al. BMC Bioinformatics 2011, 12:285 http://www.biomedcentral.com/1471-2105/12/285 © 2011 Deus et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Transcript
Page 1: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

RESEARCH ARTICLE Open Access

S3QL A distributed domain specific language forcontrolled semantic integration of life sciences dataHelena F Deus 12 Miriatilde C Correa3 Romesh Stanislaus4 Maria Miragaia5 Wolfgang Maass6Hermiacutenia de Lencastre57 Ronan Fox1 and Jonas S Almeida8

Abstract

Background The value and usefulness of data increases when it is explicitly interlinked with related data This isthe core principle of Linked Data For life sciences researchers harnessing the power of Linked Data to improvebiological discovery is still challenged by a need to keep pace with rapidly evolving domains and requirements forcollaboration and control as well as with the reference semantic web ontologies and standards Knowledgeorganization systems (KOSs) can provide an abstraction for publishing biological discoveries as Linked Data withoutcomplicating transactions with contextual minutia such as provenance and access controlWe have previously described the Simple Sloppy Semantic Database (S3DB) as an efficient model for creatingknowledge organization systems using Linked Data best practices with explicit distinction between domain andinstantiation and support for a permission control mechanism that automatically migrates between the two In thisreport we present a domain specific language the S3DB query language (S3QL) to operate on its underlying coremodel and facilitate management of Linked Data

Results Reflecting the data driven nature of our approach S3QL has been implemented as an applicationprogramming interface for S3DB systems hosting biomedical data and its syntax was subsequently generalized beyondthe S3DB core model This achievement is illustrated with the assembly of an S3QL query to manage entities from theSimple Knowledge Organization System The illustrative use cases include gastrointestinal clinical trials genomiccharacterization of cancer by The Cancer Genome Atlas (TCGA) and molecular epidemiology of infectious diseases

Conclusions S3QL was found to provide a convenient mechanism to represent context for interoperationbetween public and private datasets hosted at biomedical research institutions and linked data formalisms

Keywords S3DB Linked Data KOS RDF SPARQL knowledge organization system policy

BackgroundKnowledge engineering in the Life Sciences is challengedby the combination of high specificity and high heteroge-neity of the data needed to represent and understandBiologyrsquos systemic puzzles Despite the deluge of datathat has invaded life sciences in the past decade [1] data-driven discovery in Biology is hindered by a lack ofenough interlinked information to allow statistical algo-rithms to find the patterns that inform hypothesis-drivenresearch [23] Life Sciences research relies heavily onbioinformatics integration tools like Ensembl [4] the

UCSC genome browser [5] Entrez Gene [6] or the geneontology [7] because these offer researchers portals to awealth of interlinked biological annotations within thecontext of their experimentally derived results thus play-ing a lead role in advancing scientific discovery Theamount of time and effort required to develop and main-tain such tools has prompted Linked Data approaches fordata integration to become increasingly relevant inHealth Care and Life Sciences (HCLS) domains [8-10]Briefly stated Linked Data can be described as a bottom-up solution for data integration its focus is on creating aglobal Web of Data where typed links between datasources provide rich context and expressive reusablequeries over aggregated and distributed heterogeneousdatasets [811-13] The architecture of that Web is

Correspondence helenadeusderiorg1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway IrelandFull list of author information is available at the end of the article

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

copy 2011 Deus et al licensee BioMed Central Ltd This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (httpcreativecommonsorglicensesby20) which permits unrestricted use distribution and reproduction inany medium provided the original work is properly cited

expected by its original architects [14] to require a repre-sentation of usage contexts that can be applied in the col-laboration and controlled sharing of data When thisfunctionality is supported as that report anticipatesldquosocial machinesrdquo will be able to manage the simulta-neous and conflicting views of data that fuel scientificdebate The S3DB knowledge organization system wasdesigned to provide baseline support for that bottom-upprocess by addressing a recurrent need for controlledsharing of HCLS datasets [1516] This report describes aconvention the S3QL language to query and manipulateit It will also be demonstrated that S3QL provides a con-venient mechanism to engage Linked Data in general

11 Linked Data Best PracticesLinked data best practices set the stage for an interlin-gua of relational data and logic in the web [17] by thedefinition of core principles that can be summarized as1) information resources should be identified withHTTP universal resource identifiers (URIs) 2) informa-tion should be served against a URI in a standardsemantic web format such as the Resource DescriptionFramework (RDF) and 3) links should be established toinformation resources elsewhere [10] For large datasetsit is also convenient that a web service supportingSPARQL the protocol and RDF query language is alsodeployed [18] Aggregation of data sources is availableeither by accessing metadata about the datasets as RDF[1920] or through direct aggregation of RDF assertionsin a single knowledgebase [2122] To ensure contextualconsistency and reusability across datasets data ele-ments and descriptors are mapped using standard voca-bularies namespaces and ontologies [23-25]

12 Challenges involved in Publishing PrimaryExperimental Life Sciences Datasets as Linked DataThe value of linked data for life scientists lies primarily inthe possibility to quickly discover information about pro-teins or genes of interest derived for example from amicroarray or protein array experiment [26] Life scientistsinvolved in primary research still face significant chal-lenges in harnessing the power of Linked Data to improvebiological discovery Part of the difficulty lies in the lack ofadequate and user-friendly mechanisms to publish biologi-cal results as Linked Data prior to publication in scientificarticles Efforts in linking life sciences data typically focuson datasets which are already available in structured andannotated formats ie after the researchers have analysedcorrelated and manually annotated their results by brows-ing the literature or submitting their data to multiple web-based interfaces [2728] Current research [2930] and ourown experience in developing content management sys-tems for health care and life sciences [263132] has identi-fied the need to go beyond those data sets by creating

mechanisms for contextualizing linked life sciences datawith attribution and version before it can be shared with astable annotation Advances have been made in that direc-tion by other research efforts such as the recent publica-tion of VoiD as a W3C note [33]The technological advancements that will make primary

Life Sciences experimental results an integral part of theWeb of Data are also thwarted by challenges which gobeyond infrastructure and standards [30] In particularHCLS datasets often include data elements such as thosethat could be used to identify individual patients withstringent requirements for privacy and protection [34]The typical approach to privatizing data has been to makeit the responsibility of the data providers Although thismay provide a temporary solution for a small number ofself-contained datasets it quickly becomes unmanageablewhen datasets aggregate both public and sensitive datafrom multiple sources each with its own requirements forprivacy and access control [35]One final common concern in Life Sciences is the need

to enable data experts to edit and augment the datarepresentation models failure to support this flexibilityhas lead in the past to misinterpretation of primaryexperimental data due to absence of critical contextualinformation [3637]

13 Knowledge organization systems for Linked DataIn order to address the information management needs ofLife Scientists the practice of Linked Data standards mustbe coupled with the implementation of Knowledge Orga-nization Systems (KOSs) a view also espoused by theW3C where the Simple Knowledge Organization System(SKOS) has been recently proposed as a standard [3839]In previous work we proposed the design principles of aKOS the Simple Sloppy Semantic Database (S3DB)[151640] The S3DB core model is much like SKOStask-independent and light-weight Implementation of theS3DB Core Model and operators resulted in a prototypethat has been validated and tested by Life Scientists toaddress pressing data management needs or in particularas a controlled Read-Write Linked Data system [3141-43]S3DB was shown to include the minimum set of featuresrequired to support the management of experimental andanalytic results by Life Sciences experts while making useof Linked Data best practices such as HTTP URI subject-predicate-object triples represented using RDF Schemalinks to widely used ontologies suggested by NCBO ontol-ogy widgets [44] or new OWL classes created by the usersand a SPARQL endpoint [41] Although complying withthese practices is enough to cover the immediate query orldquoreadrdquo requirements of a Linked Data KOS we found thatefficient data management or ldquowriterdquo operations such asinserting updating and deprecating data instances withina KOS could be more efficiently addressed with the

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 2 of 15

identification of the S3DB Query Language (S3QL) aDomain Specific Language (DSL) devised to abstract mostof the details involved in managing interlinked contextua-lized RDF statementsS3QL is not meant as an alternative to SPARQL but

rather as a complement data management operationsenabled by S3QL can also be formalized in SPARQLHowever the availability of a data management DSL thatcan be serialized to SPARQL provides an abstraction layerthat can be intuitively used by domain experts As suchDSLs can provide a solution for bridging the gap betweenthe formalisms required by Linked Data best practicessuch as SPARQL and RDFS and the basic controlledreadwrite management requirements of HCLS experts[124546] DSLs optimize beyond general purpose lan-guages in the identification of the domain in which a taskbelongs drastically reducing the development time [47]The task of adding a graph to a triple store is supportedby most graph stores by means of the SPARQL 11Update language [48] To enable controlled ldquowriterdquo opera-tions targeting the dataset it would be useful to annotatefor example the creator of a named graph under whichcircumstances it was created and who has permission tomodify it Similarly upon changes to the dataset annota-tion of the modifier and a comment describing the changewould be in the interest of the communities using thedata Many triple stores are in fact quad stores to enablepartial support of that requirement for contextual repre-sentation The most common approach is to use a namedgraph a set of triples identified by a URI [49] that indi-cates the source of a graph The S3DB Query Language(S3QL) presented in this report was devised with theintent of automating Linked Data management by creatingthose contextual descriptors in a single S3QL transactionincluding author creation date and description of the dataBy making use of those contextual descriptors we pro-

pose a method for fine grained permission control inS3QL that relies on s3dboperators [50] a class of func-tions with states that may be used as the predicate of anRDF triple between a user and a dataset with privacyrequirements These operators described in [40] andmade available for experimentation at [51] operate onthe adjacency matrix defined by the nodes and edges ofan RDF graph They can be applied in a variety of scenar-ios such as optimizing queries or as is the case withS3QL to propagate permission assignments In the lattercase an adjacency matrix includes both the edgesbetween instances of S3DB entities and the transitions ofpermission on S3DB entities such as eg the assertionthat a Userrsquos permission on a Project propagates to itsentities Accordingly by defining user permissions asstates of s3dboperators the core modelrsquos adjacencymatrix is used to propagate the ability to control viewand modify S3DB entities

We have found the target audience for S3QL to be bothlife sciences application developers who use it through aRESTful application programming interface (API) and lifesciences researchers who use it through user interfaces forweaving the ontologies that best represent the critical con-textual information in their experimental results Theapplicability of S3QL to other linked data KOSs such asthe Simple Knowledge Organization System (SKOS) [39]is explored with an example and the advantages of thesolution proposed are discussed in three biomedical data-sets with very different requirements for controlled opera-tions gastrointestinal clinical trials [42] cancer genomiccharacterization [41] and molecular epidemiology [52]

MethodsThis section overviews the core model for S3DB includ-ing the set of operators that enable fine grained permis-sion control and the distributed infrastructure supportingS3QL The principles defined here are implemented as aprototypical application available at https3dborg

The S3DB Knowledge Organization ModelS3QL is a DSL to programmatically manipulate data asinstances of entities defined in a KOS One of the key fea-tures of the KOS defined using the S3DB core model [16]is the use of typed named graphs to separate the identifica-tion of the domain the metadata describing the data fromits observational instantiation - the data itself We havepreviously shown that this approach to representing RDFgreatly facilitates the assembly of SPARQL and lowers theentry barrier for biomedical researchers interested in usingSemantic Web Technologies to address their data man-agement needs [41] That separation is achieved by usingthe representation of domain as triples that are themselvesthe predicates of the statements that instantiate thatdomain (as detailed by Fig two in [40]) For example thetriple [Person hasAge Age] identified as R12 through anamed graph of type s3dbrule describes the domain whilethe triple [John R12 26 ] identified by a named graph oftype s3dbstatement instantiates that domain Throughthe logic encoded in the RDF Schema definition of domain(rdfsdomain) and range (rdfsrange) the assertion thatldquoJohnrdquo is of type ldquoPersonrdquo and that ldquo26rdquo is an ldquoAgerdquo isenabled in the S3DB KOS S3DBrsquos use of named graphs todescribe the domain enables updates to the domain with-out affecting the consistency of its instantiation - in theexample above modifying ldquohasAgerdquo with ldquohasAgeInYearsrdquowill not affect queries that have already been assembledusing that propertyIn the S3DB core [40] a meta-model for this data is

also created with the specific objective of enabling propa-gation of operations such as permission assignmentsbetween the domain description and the data itselfdescribed in the following section (see Figures 1 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 3 of 15

and 3) In the example above the two triples are respec-tively assigned to entities of type s3dbrule and s3dbstate-ment where indexes ldquoPersonrdquo is identified by a namedgraph of type s3dbcollection and ldquoJohnrdquo is identified by anamed graph of type s3dbitem The S3DB Core specifiesthree other entities which are specifically devised toenable knowledge organization and operator propagations3dbproject entails a list of s3dbrule and s3dbcollectionand are typically applied in domain contextualizations3dbdeployment corresponds to the physical location ofan S3DB system (its URL) and s3dbuser is the subject ofpermission assignment operations It is worth notingthat by making use of S3DB entities blank nodes areavoided by assigning a unique alphanumeric identifier toevery instance of an S3DB entity The S3DB entities canalso be identified using the first letter of their names DP R C I S or U which will be used in subsequent exam-ples to indicate respectively s3dbdeployment s3db

project s3dbrule s3dbcollection s3dbitem s3dbstate-ment or s3dbuser

Operators for Permission ControlThe second key feature that makes S3DB appropriatefor controlled management operations is support forpermission control embedded in its core model Asdescribed above and in [40] the hierarchy of permis-sions to viewedit entities in S3DB is modelled by anadjacency matrix which is used as a transition matrix inthe propagation of permission states For a walkthroughof the propagation mechanism see additional file 1 Thes3dboperator states applied to the S3DB transitionmatrix modulate propagation by three core functions -merge migrate and percolate This behaviour for propa-gation of permission is described in detail in equation 5of [40] and is reproduced here in Equation 1 The S3DBtransition matrix (T) is defined by 12 s3dbrelationships

ltdeletegt E _id value ltdeletegt ltwheregt ltwheregt E lt gt _id E lt gt

EntityID

ltinsertgt E _id value ltinsertgt ltwheregt ltwheregt E lt gt _id E lt gt

Ea lt gt value Ea lt gt

EntityAttributeValue

EntityID

ltupdategt E _id value ltupdategt ltwheregt

ltwheregt

E lt gt _id E lt gt

Ea lt gt value Ea lt gt

EntityID

ltselectgt

Ea

ltselectgt

ltwheregt ltwheregt Ea lt gt value Ea lt gt

ltfromgt ltfromgt E

ltupdategt E _id value ltupdategt ltwheregt

ltwheregt

E lt gt _id E lt gt

Ea lt gt value Ea lt gt

EntityID

View

Ch

ange

Us

e

EntityAttributeValue

EntityAttributeValue

Figure 1 S3QL language specification using rail diagrams Rail diagrams are read from the left to the right - any string that can becomposed following these diagrams is a valid S3QL query Valid forms of E and Ea will vary according to the Core Model used in the KOS Forexample if the S3DB Core model is used any entity in figure 2 can be used in place of E upon choice of E Ea is any attribute that can beattained by following a line from E

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 4 of 15

describing dependencies and inference rules betweenentities of the S3DB core model The operator state vec-tor (f) is used as the predicate of a triple establishedbetween an s3dbuser and an entity of the S3DB coremodel The JavaScript application at [51] can also beused to attempt this set of propagation behaviours fors3dboperators both on the S3DB transition matrix orwith alternative adjacency matrixes

fobjectk+1 = merge([fobjectkmigrate(T times fsubjectk)])

l = length(f )

l = 1 rarr migrate(f ) = f = f [1]

l gt 1 rarr migrate(f ) = f [2 l]

(1)

The s3dboperators [40] have a scope and applicability inlinked data beyond permission management In S3QL wedefine three operator types for controlled management

operations for each of the rights to view changeedit oruse instances of S3DB entities The format used to assignpermission was defined as a three character string whereeach operator occupies respectively the first second orthird positions and may assume value N S or Y accordingto the level of permission intended no permission (N)permission limited to the creator of the resource (S) or fullpermission (Y) For example the permission assignmentldquoYSNrdquo specifies complete permission to view (Y) the sub-ject entity partial permission to change it (S) and no per-mission to use it (N) States may be defined as dominantby use of uppercase (Y N or S) or recessive by use of low-ercase characters (y n or s) Dominant and recessive per-missions are used to decide on the outcome of multiplepermissions converging on the same entity (as detailed in[40]) Missing permission states indicated by the dashcharacter lsquo-rsquo (which has no lower or upper case) are also

rdfid

rdfslabel

dcdescription

dccreated

dccreator

deployment

user

project

collection

rule

item

statement

nn

foafmbox

s3dbobject_id

s3dbobject

foaffo mboxs3dbsubject_id

bs3dbs3d

s3dbproject_id

s3dbpredicate_id

s3dbcollection_id

s3dbitem_id

s3dbrule_id rdfvalue

E Ea Ea a

Figure 2 Entities in the S3DB Core Model and its attributes A minimal set of common attributes was defined (left) for each of the S3DBentities using RDF Schema (rdfs) and Dublin core (dc) terminology - these are rdfid rdfslabel dcdescription dccreated and dccreator Otherattributes which are specific to each of the S3DB entities (right) such as foafmbox for the entity User or s3dbproject_id for the entity Collectionreflect the s3dbrelationships described above and formalized in the S3DB conceptual model (figure 3)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 5 of 15

allowed as well as a mechanism to succinctly specify tran-sitions with variable memory length (l in equation 1) Thepropagation of permissions in the S3DB Core Modelensures that for every entity and every user two types of

permission are defined the assigned permission or thepermission state assigned directly to a user in an entityand the effective permission which is the result of the pro-pagation of s3dboperators

Figure 3 The S3DB conceptual model Five attributes (id label description creator and created) and four methods (select update insert anddelete) are common to all S3DB entities In the current S3QL implementation the label and description attributes are defined by the submitterof the data whereas the id created and creator attributes are automatically assigned by the system Other dependencies were devised tocomply with the definition of s3dbrelationships

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 6 of 15

Components of a Distributed SystemOne of the requirements for RDF-based knowledgemanagement ecosystems is the availability of queriesspanning across multiple SPARQL endpoints Automa-tion of distributed queries in systems supporting permis-sion control such as S3DB is challenged by userauthentication In S3QL we propose addressing thisthrough delegation to authentication authorities As aresult a user (or usage) can be identified by a URI thatis independent of the authorities that validate it When-ever possible it is recommended that authenticationcredentials be protected by use of OAuth [53]Use of URIs and Internationalized Resource Identifiers

(IRIs) to identify data elements is one of the core princi-ples of Linked Data However many programmingenvironments cannot easily handle URIs as elementidentifiers Problems range from decreased processingspeed to a need for encoding the URIs in web serviceexchanges As an anticipation for that class of problemsthe URIs for entities in S3DB are interchangeable withalphanumeric identifiers formulated as the concatena-tion of one of D U P C R I or S (referring to S3DBentities described in The S3DB Knowledge Organiza-tional Model) identifying the entity and a unique num-ber As an example for a deployment located at URLhttpqs3dborgdemo the alphanumeric P126 is resol-vable to an entity of type Project with URI httpqs3dborgs3dbdemoP126 To facilitate exchange of URI indistinct deployments the URI above could also be speci-fied as D282P126 where ldquoD282rdquo is the alphanumericidentifier of the S3DB deployment located at URLhttpqs3dborgdemo Every s3dbdeployment is identi-fied by a named graph in the form D[number] for com-pleteness metadata pertaining to each s3dbdeploymentsuch as the corresponding URL is described using thevocabulary of interlinked datasets (VoiD) [19] andshared through a root location

Availability and DocumentationThe specification of the S3QL language has been madeavailable at httplinks3dborgspecs and one exampleof the output RDF is available at httplinks3dborgexample S3QL has been implemented through a RESTapplication programming interface (API) for the S3DBprototype which is publicly available at https3dborgBoth the prototype and its API were developed in PHPwith MySQL or PostgreSQL for data storage Documen-tation about the S3DB implementation of S3QL as anAPI can be found at httplinks3dborgdocs S3QLqueries may be tested at the demo implementation athttplinks3dborgs3qldemo and a translator for thecompact notation is available at httplinks3dborgtranslate

ResultsS3QL SyntaxS3QL is a domain specific language devised for facilitat-ing management operations such as ldquoinsertrdquo ldquoupdaterdquo orldquodeleterdquo using entities of a Linked Data KOS such as theS3DB core model described above Its syntax howeveris loosely tied to the S3DB Core Model and can easilybe applied to a set of KOSrsquo core models in which S3DBis included The complete syntax of S3QL in its XML(eXtended Markup Language) flavour is represented inthe railroad diagram of Figure 1 The S3QL syntaxincludes three elements the description of the opera-tion the target entity and the input parameters Fourbasic operation descriptions were deemed necessary tofully support readwrite operations select insert updateand delete The action of these operations mimic thoseof the structured query language (SQL) and targetinstances of entities (E) defined in the core model Inputparameters include the set of attributes defined for eachof the entities either in the alphanumeric form asso-ciated with entity instances (EntityId) or in the form ofEntityAttributeValue (Ea) The values for Ea are deter-mined upon choice of E - for an example using S3DBentities and attributes see Figure 2 E may be replacedwith any of the entities defined in the S3DB CoreModel (Deployment User Project Collection Rule Itemor Statement) upon choice of E valid forms of Eainclude any of the attributes defined for E (eg rdfidrdfslabel rdfscomment) A table summarizing all avail-able operations targets and input parameters is madeavailable at httplinks3dborgspecs The formal S3QLsyntax is completed by enclosing the outcome of one ofthe diagrams on Figure 1 with the ltS3QLgt tag Forexample the following XML structure is a valid S3QLquery for an operation of type insert where the target isthe S3DB entity Project and the input parameter formu-lated as EntityAtributeValue is ldquolabel = TestltS3QLgtltinsertgtprojectltinsertgtltwheregtltlabelgtTestltlabelgtltwheregt

ltS3QLgtThe set of 12 s3dbrelationships (see Table one of

[40]) in the S3DB Model determine the organizationaldependencies of S3DB entities For example s3dbPC isthe s3dbrelationship that specifies a dependencybetween an instance of a Collection (C) and an instanceof a Project (P) (Figure 3) The S3QL syntax fulfils thisconstraint by assigning project_id (the identifier of anS3DB Project) as an attribute of a Collection In thisdescription of attributes associated with the S3DB coremodel we make use of the assumption as in other

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 7 of 15

KOSs and in the Linked Data in general that there is norestriction to adding relationships beyond thosedescribed here S3QL was identified as the minimalrepresentation to interoperate with the S3DB coremodel and therefore only those relationships areexplored in this reportThe syntax diagram in Figure 1 generates XML a

standard widely used in web service implementationsThat alternative often results in verbose queries thatcould easily be assembled from more compact notationsOne example to consider is the form action (E | Ea =value) Here the symbol lsquo|rsquo should be interpreted as inBayesian inference as a condition and be read ldquogiventhatrdquo The letter ldquoEldquo corresponds to the first letter ofany S3DB entity (D P R C I S or U) and Ea is any ofits attributes as described in Figure 2 In this examplethe query insert(P | label = test) is equivalent to theexample query above That particular variant is alsoaccepted by the S3DB prototype and a converter for thissyntax into complete S3QLXML syntax was made avail-able at httplinks3dborgtranslate For further com-pactness of this alternative formulation entity identifiersused as parameters may be replaced with its corre-sponding alphanumeric identifiers - for example projec-t_id = 156 may be replaced with P156 This alternativenotation will be used in the subsequent examples

S3QL Permission ControlPermission states are assigned using an S3QL query suchas insert(U| U1P157permission_level = ysn) whichincludes the action insert the target entity User andthree input parameters identifier of the User (U1) iden-tifier of the entity (P157) and permission assignment(ysn) Effectively this will result in the creation of the tri-ple [U1 ysn P157] where the subject is of type s3dbuser the predicate is of type s3dboperator and the objectis of type s3dbproject The inclusion of this triple in adataset will modulate the type of management operationthat a user may perform As described in the Methodssection each position in the permission assignmentoperator (ysn) encodes respectively for permission toldquoviewrdquo ldquochangerdquo or ldquouserdquo the object entity Values y sand n indicate respectively that the user has full permis-sion to view it (y) permission to change its metadataonly if he was the creator of the entity (s) and no permis-sion to insert (n) child entities Each S3QL operation istherefore tightly woven to each of the three operatorsselect is controlled by ldquoview update and delete are con-trolled by ldquochangerdquo and insert is controlled by ldquouserdquo(shaded areas in Figure 1) The lsquousersquo operator encodes forthe ability of a user to create new relationships with thetarget entity which is defined separately from the rightto ldquochangerdquo it For example in the case of a user (U1)being granted ldquoyrdquo as the effective permission to ldquochangerdquo

an s3dbrule then the metadata describing it may bealtered If however that same user is granted permissionldquonrdquo to lsquousersquo that same Rule she is prevented from creatingStatements using that Rule Although ldquouserdquo may be inter-preted as being equivalent to ldquoinsertrdquo or ldquoappendrdquo inother systems we have chosen to separate the termsdescribing the operator ldquouserdquo from the S3QL actionldquoinsertrdquo The permission assigned at the dataset level willthen propagate in the S3DB transition matrix followingthe behaviour formalized in equation 1 therefore avoid-ing the need to assign permission to every user on everyentity It is worth noting that the DSL presented here isextensible beyond the 4 management actions (selectinsert update delete) described The s3dboperators thatcontrol permission on these actions are also extensiblebeyond ldquoviewrdquo ldquochangerdquo and ldquouserdquo and different imple-mentations may support alternative statesThe permission control behaviour for S3QL operations

can be illustrated through the use of the Quadratus anapplication available at httpqs3dborgquadratus thatcan be pointed at any S3DB deployment to assign permis-sion states on S3DB entities to different users (Figure 4)Other use case scenarios are also explored in the S3QLspecification at httplinks3dborgspecs

Global Dereferencing for Distributed QueriesA simple dereferencing system was devised for S3DB iden-tifiers that relies on the identification of root deploymentsie S3DB systems where alphanumeric identifiers forS3DB deployments can be dereferenced to URL This sim-ple mechanism enables complex transactions of controlleddata For an example of this behaviour see Figure 5 wherethe S3DB UID D327R172930 identifying an entity of typeRule (R172930) available in deployment D327 is beingrequest by a user registered in deployment D309 In orderto retrieve the requested data the URL of deploymentD327 must first be resolved at a root deployment such asfor example httproots3dborgThe dereferencing mechanism is also applicable in more

complex cases where the root of two deployments sharingdata is not the same Prepending the deployment identifierof the root to the UIDs such as for example D1016666D327R172930 where D1016666 identifies the rootdeployment would result in recursive URL resolutionsteps such as select(Durl| D1016666) prior to step 4 inFigure 5 This mechanism avoids broken links when S3DBdeployments are moved to different URLs by enablingdeployment metadata to be updated securely at the rootusing a publicprivate key encryption system

Implementation and benchmarkingIn the current prototype implementation S3QL is sub-mitted to S3DB deployments using either a GET or POSTrequest and may include an optional authentication token

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 8 of 15

(key) The REST specification [54] suggest separate HTTPmethods according to the intention of the operationoften ldquoGETrdquo is used to retrieve data ldquoPUTrdquo is used to sub-mit data ldquoPOSTrdquo is used to update data and ldquoDELETErdquo isused to remove data There are however many program-ming environments that implement only the REST ldquoGETrdquomethod including many popular computational statisticsprogramming frameworks such as R and Matlab There-fore in order to fully explore the integrative potential ofthis readwrite semantic web service and to supportoperations beyond the 4 implemented the S3DB proto-type implementation of S3QL supports the ldquoGETrdquo methodfor all S3QL operations with the parameters of the S3QLcall appended to the URL One drawback of relying onGET is the limits imposed by the browser on URL lengthTo address this potential problem the S3DB prototypealso supports the use of ldquoPOSTrdquo for S3QL calls

Two further challenges needed to be addressed in theprototypical implementation of S3QL 1) the need for acentralized root location to support dereferencing ofdeployment URI when the condensed version is used (eg D282) and 2) distributed queries on REST systemsrequired users to authenticate in multiple KOSs Thefirst challenge was addressed by configuring an S3DBdeployment as the root location available at httproots3dborg Deployment metadata is submitted to this rootdeployment at configuration time using S3QL data per-taining to each deployment can therefore be derefer-enced to a URL using httproots3dborgD[numeric]The option to refer to another root deployment thanthe default is possible during installation To avoid over-loading these root deployments with too many requestsa local 24 hour cache of all accessed deployments iskept in each S3DB deployment using the same strategy

Figure 4 Quadratus an interface to illustrate S3QLrsquos permission control mechanism and its effect on S3QL queries Projects CollectionsRules Items and Statements associated with GI Clinical Trials Project are displayed with effective and assigned permissions Collections and Rulesretrieved using S3QL queries select(C|P1196457) and select(R| P1196457) inherit the permission assignment in the Project ldquoysn assigned permissionldquoNndashrdquo in Collection ldquoDemographicsrdquo results in the effective permission of ldquoNsnrdquo inherited by all Rules and Items that have a relationship with thatCollection effectively preventing gi_user1 from accessing its data The directed labelled graph of the propagation resolution is displayed on theright side of the application illustrating the propagation mechanism

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 9 of 15

if the URL is cached it will not be requested from theroot deployment To address the second challenge eachs3dbdeployment can store any number of authenticationservices supporting HTTP FTP or LDAP protocolsOnce the user is authenticated temporary surrogatetokens are issued with each query When coupled withthe user identifier in the format of a URI these tokenseffectively identify the user performing the query regard-less of the S3DB deployment where the query isrequestedScreencasts illustrating processing time of data manip-

ulation using S3QL are available at httpwwwyoutubecomwatchv=2KZC6kI609s and httpwwwyoutubecomwatchv=FJSYLCwBaPI

DiscussionOne of the major concerns in making use of LinkedData to improve health care and life sciences research isthe need to ensure both the availability of contextualinformation about experimental datasets and the abilityto protect the privacy of certain data elements whichmay identify an individual patient Domain specific lan-guages (DSL) can ease the task of managing the contex-tual descriptors that would be necessary to implement

permission control in RDF and by doing so couldgreatly accelerate the rate of adoption of Linked Dataformalisms in the life sciences communities to improvescientific discovery We have described S3QL a DSL toperform readwrite operations on entities of the S3DBCore Model S3QL attempts to address the requirementsin linking Life Sciences datasets including both publish-able and un-publishable data elements by 1) includingcontextual descriptors for every submitted data elementand 2) making use of those descriptors to ensure per-mission control managed by the data experts them-selves This avoids the need to break a consolidateddataset into its public and private parts when the resultsare acceptable for publicationApplying S3QL to the S3DB Core Model in a prototy-

pical application benefited from the definition of looselydefined boundaries for RDF data that enabled propaga-tion of permission while avoiding the need to documenta relationship for each data instance individually andfor each user The assembly of SPARQL queries is alsofacilitated by the identification of domain triples usingnamed graphs from the data itself [41] and can be illu-strated in the application at [55] where a subset ofS3QL can be readily converted into the W3C standard

harrharr

Figure 5 The global S3QL dereferencing system User U78 of deployment D309 issues a command to request all entities of type Statementwhere the attribute Rule_id corresponds to the value D327R172930 through S3QL operation select(Svalue|D327R172930) (1) If the URLcorresponding to deployment D327 is not cached locally or has not been validated in the past 24 h a query is issued and executed at the rootdeployment to retrieve the corresponding URL select(Durl|D327) (23) Once the URL is returned query (1) is re-issued as select(Svalue| R172930)and executed at the URL for D327 (4) To validate the user deployment D327 issues the command select(Uid|U76) at D309 using the keyprovided (56) and returns the data only if Uid matches the value for user_id (7)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 10 of 15

SPARQL Although the prototypical implementation ofS3QL fits the definition of an API for S3DB systems itis immediately apparent that the same notation could beeasily and intuitively extended to other KOSrsquo core mod-els For example pointing the tool at httpqs3dborgtranslate to a JavaScript Object Notation (JSON) repre-sentation of the SKOS core model (skosjs) instead ofthe default S3DB core model (s3dbjs) results in a validS3QL syntax that could easily be applied as an API forSKOS based systems as illustrated by applying theexample query ldquoselect(C|prefLabel = animal)ldquo usinghttplinks3dborgtranslatecore=skosjsampquery=select(C|prefLabel=animal) to retrieve SKOS concepts labelledldquoanimalrdquo The progress of adoption for life sciencesapplication developers can be further smoothed by com-plete reliance on the REST protocol for data exchangeand the availability of widely used formats such asJSON XML or RDFturtleThe applicability of S3QL to life sciences domains is

illustrated here with three case studies 1) in the domainof clinical trials a project [42] that requires collabora-tion between departments with different interests 2) inthe domain of the cancer genome atlas (TCGA) projecta multi-institutional effort that requires multiple authen-tication mechanism and sharing of data among multipleinstitutions [41] and 3) in the domain of molecular epi-demiology a project where non-public data stored in anS3DB deployment needs to be statistically integratedwith data from a public repository All described usecases shared one considerable requirement - the abilityto include in the same dataset both published andunpublished results As such they required both theannotation of contextual descriptors of the data enabledby S3QL and the availability of controlled permissionpropagation enabled by the S3DB model transitionmatrix Future work in this effort may include the appli-cation of S3QL in a Knowledge Organization Systembased on SKOS terminology and the definition of atransition matrix for SKOS to enable controlled permis-sion propagation

Gastro-intestinal Clinical Trials Use CaseAs part of collaboration with the Department of Gastroin-testinal Medical Oncology at The University of Texas MD Anderson Cancer Center an S3DB deployment wasconfigured to host data from gastro-intestinal (GI) clinicaltrials A schema was developed using S3DB Collectionsand Rules and S3QL insert queries were used to submitdata elements as Items and Statements (see Figure 6) Twopermission propagation examples are illustrated one of arestrictive nature and the other permissive which can alsobe explored at httpqs3dborgquadratus (Figure 4) Inthis example the simple mechanisms of propagationdefined for S3QL support the complex social interaction

that requires a fraction of the dataset to be shared withcertain users but not with others Contextual usage istherefore a function of the attributes of the data itself (egits creator) and the user identification token that is sub-mitted with every readwrite operation

The Cancer Genome Atlas Use CaseThe cancer genome atlas (TCGA) is a pilot project tocharacterize several types of cancer by sequencing andgenetically characterizing tumours for over 500 patientsthroughout multiple institutions S3QL was used in thiscase study to produce an infrastructure that exposes thepublic portion of the TCGA datasets as a SPARQL end-point [41] This was possible because SPARQL is entailedby S3QL but not the opposite Specifically SPARQLqueries can be serialized to S3QL but the opposite is notalways possible particularly as regards write and accesscontrol operations The structure of the S3DB Core Modelwhich explicitly distinguishes domain from instantiationenables SPARQL query patterns such as Patient R390 cancerType to be readily serialized into its S3QL equiva-lent select(S|R390) Although this will not be furtherexplored in this discussion it is worth noting that theavailability of this serialization allows for an intuitive syn-tax of SPARQL queries by patterning them on the descrip-tion of the user-defined domain Rule such as ldquoPatienthasCancerType cancerTyperdquo

A Molecular Epidemiology Use CaseIn this example SPARQL was serialized to S3QL to sup-port a computational statistics application As a first stepan S3DB data store was deployed using S3QL to managemolecular epidemiology data related to strains of Staphylo-coccus aureus bacteria collected at the Instituto de Tecno-logia Quiacutemica e Bioloacutegica (ITQB) in Portugal Specificallythe ITQB Staphylococcus reference database was devisedwith a purpose of managing Multilocus Sequence Typing(MLST) data a typing method used to track the molecularepidemiology of infectious bacteria [54-57] As a secondstep we downloaded the public Staphylococcus aureusMLST profiles database at httpwwwmlstnet and madeit available through a SPARQL endpoint at httpqs3dborgmlst_sparqlendpointphp The process of integrationof MLST profiles from the ITQB Staphylococcus databasewith the publicly available MLST profiles is illustrated inFigure 7 In this example a federated SPARQL query isassembled to access both MLST sources data stored inthe S3DB deployment is retrieved by serializing theSPARQL query into S3QL and providing an authentica-tion token to identify the user as described in Figure 3The assembled graph resulting from the federatedSPARQL query (see additional file 2) can be importedinto a statistical computing environment such as Matlab(Mathworks Inc) Using this methodology it was possible

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 11 of 15

to cluster strains from two different data sources with verydifferent authentication requirements The observationthat some Portuguese strains (PT1 PT2 PT15 and PT21)that are not publicly shared cluster together with a groupof public UK strains (UK17 UK16 UK11 UK14 UK15UK13 UK12) and therefore may share a common ancestoris an observation enabled by the data integrated throughS3QL

ConclusionsLife sciences applications are set to greatly benefit fromcoupling Semantic Web Linked Data standards and KOSsIn the current report we illustrate data models from lifesciences domains weaved using the S3DB knowledge orga-nization system In line with the requirements for theemergence of evolvable ldquosocial machinesrdquo different per-spectives on the data are made possible by a permissionpropagation mechanism controlled by contextual attri-butes of data elements such as its creator The operationof the S3DB KOS is mediated by the S3QL protocoldescribed in this report which exposes its ApplicationProgramming Interface for viewing inserting updating

and removing data elements Because S3QL is implemen-ted with a distributed architecture where URIs can bedereferenced into multiple S3DB deployments domainexperts can share data on their own deployments withusers of other systems without the need for localaccounts Therefore S3QLrsquos fine grained permission con-trol defined as instances of s3dboperators enables domainexperts to clearly specify the degree of permission that auser should have on a resource and how that permissionshould propagate in a distributed infrastructure This is incontrast to the conventional approach of delegating per-mission management to the point of access In the currentSPARQL specification extension various data sources canbe queried simultaneously or sequentially There is still noaccepted convention for tying a query pattern to anauthenticated user probably because SPARQL engineswould have no use for that information as most have beencreated in a context of Linked Open Data efforts The cri-tical limitation in applying this solution for Health Careand Life Sciences is the ability to make use of contextualinformation to determine both the level of trust on thedata and to enable controlled access to elements in a

giuser1r1r

giuser2r22

gimanagerere

GI Clinical Trials project

Patient collection

Demographic data collection

Tissue collection

gimanagererer

merge 1 (restrictive)

merge 2 (permissive)

Assign permission

Assign permission

Patient Name statements

Tissue analysis statements

merge ( ) =

) = merge (

merge 1

merge 2

Figure 6 Two use cases of permission propagation Two users are granted full permission to view GI Clinical Trials Project (yrdquo) however noneof them can add new data (nrdquo) nor edit existing data unless they were its original creator (srdquo) One of the users (giuser1) is prevented fromaccessing any data with demographic elements such as for example the names of the Patients In this case an uppercase ldquoNrdquo is assigned in theright to view the Collection DemographicData which will be merged with the inherited ldquoyrdquo to produce an effective permission level of ldquoNrdquo forthe right to view (merge 1) For the second user (giuser2) permission is granted to use the Collection TissueData indicating that the user cancreate new instances in that Collection (merge 2)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 12 of 15

dataset without breaking it and storing it in multiple sys-tems To address this requirement S3QL was fitted withdistributed control operational features that follow designcriteria found desirable for biomedical applications S3QLis not unique in its class for example the linked data APIwhich is being used by datagovuk is an alternative DSL tomanage linked data [58] However we believe that S3QLis closer to the technologies currently used by applicationdevelopers and therefore may provide a more suitablemiddle layer between linked data formalisms and applica-tion development It is argued that these features mayassist and anticipate future extensions of semantic webprovenance control formalisms

Additional material

Additional file 1 Walkthrough of S3DBrsquos permission propagationmerge migrate and percolate

Additional file 2 Matrix of MLST profiles in Portugal and the UnitedKingdom

Acknowledgements and fundingThis work was supported in part by the Center for Clinical and TranslationalSciences of the University of Alabama at Birmingham under contract no

5UL1 RR025777-03 from NIH National Center for Research Resources by theNational Cancer Institute grant 1U24CA143883-01 by the European UnionFP7 PNEUMOPATH project award and by the Portuguese Science andTechnology foundation under contracts PTDCEIA-EIA1052452008 andPTDCEEA-ACR695302006 HFD also thankfully acknowledges PhDfellowship from the same foundation award SFRHBD459632008 and theScience Foundation Ireland project Lion under Grant No SFI02CE1I131The authors would also like to thank three anonymous reviewers whosecomments and advice were extremely valuable for improving themanuscript

Author details1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway Ireland 2Biomathematics Institutode Tecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal3Laboratoacuterio Nacional de Computaccedilatildeo Ciecircntifica Av Getuacutelio Vargas 333Quitandinha 25651-075 Petroacutepolis Brasil 4Sanofi Pasteur 38 Sidney StreetCambridge MA 02139 USA 5Laboratory of Molecular Genetics Instituto deTecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal6Research Center for Intelligent Media Furtwangen University FurtwangenGermany 7Laboratory of Microbiology The Rockefeller University 10021 NewYork USA 8Division of Informatics Department of Pathology University ofAlabama at Birmingham 619 South 19th Street Birmingham Alamaba USA

Authorsrsquo contributionsHFD developed the S3QL domain specific language implemented it in theS3DB prototype validated it with examples and wrote the manuscript MCCRS MM HL RF WM and JSA tested and validated the language withexamples and made suggestions which lead to its improvement All authorsread and approved the final manuscript

Strains from Portugal

Strains from UK

SPARQL

S3DB system Triple stores

S3QL + authentication

Statistical computing environment (eg Matlab R)

S3QL +Lauthent

Figure 7 Workflow for the creation of a hierarchical cluster of MLST profiles from strains collected in Portugal and in the UK Twosources are chosen to perform the MLST profile assembly - an S3DB deployment which holds the ITQB Staphylococcus database and theSPARQL endpoint configured to host the Staphylococcus aureus MLST database Although data in the MLST SPARQL endpoint is publiclyavailable to access data in S3DB the user needs to provide an authentication token and a user_id as well as assemble the S3QL queries toretrieve the data [select(S|R172930) and select(S|R167271)] which may also be formulated as SPARQL A data structure is assembled that can finallybe analysed using hierarchical clustering methods such as the ones made available by Matlab The complete MLST dataset used to obtain thegraph is provided in additional file 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 13 of 15

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 2: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

expected by its original architects [14] to require a repre-sentation of usage contexts that can be applied in the col-laboration and controlled sharing of data When thisfunctionality is supported as that report anticipatesldquosocial machinesrdquo will be able to manage the simulta-neous and conflicting views of data that fuel scientificdebate The S3DB knowledge organization system wasdesigned to provide baseline support for that bottom-upprocess by addressing a recurrent need for controlledsharing of HCLS datasets [1516] This report describes aconvention the S3QL language to query and manipulateit It will also be demonstrated that S3QL provides a con-venient mechanism to engage Linked Data in general

11 Linked Data Best PracticesLinked data best practices set the stage for an interlin-gua of relational data and logic in the web [17] by thedefinition of core principles that can be summarized as1) information resources should be identified withHTTP universal resource identifiers (URIs) 2) informa-tion should be served against a URI in a standardsemantic web format such as the Resource DescriptionFramework (RDF) and 3) links should be established toinformation resources elsewhere [10] For large datasetsit is also convenient that a web service supportingSPARQL the protocol and RDF query language is alsodeployed [18] Aggregation of data sources is availableeither by accessing metadata about the datasets as RDF[1920] or through direct aggregation of RDF assertionsin a single knowledgebase [2122] To ensure contextualconsistency and reusability across datasets data ele-ments and descriptors are mapped using standard voca-bularies namespaces and ontologies [23-25]

12 Challenges involved in Publishing PrimaryExperimental Life Sciences Datasets as Linked DataThe value of linked data for life scientists lies primarily inthe possibility to quickly discover information about pro-teins or genes of interest derived for example from amicroarray or protein array experiment [26] Life scientistsinvolved in primary research still face significant chal-lenges in harnessing the power of Linked Data to improvebiological discovery Part of the difficulty lies in the lack ofadequate and user-friendly mechanisms to publish biologi-cal results as Linked Data prior to publication in scientificarticles Efforts in linking life sciences data typically focuson datasets which are already available in structured andannotated formats ie after the researchers have analysedcorrelated and manually annotated their results by brows-ing the literature or submitting their data to multiple web-based interfaces [2728] Current research [2930] and ourown experience in developing content management sys-tems for health care and life sciences [263132] has identi-fied the need to go beyond those data sets by creating

mechanisms for contextualizing linked life sciences datawith attribution and version before it can be shared with astable annotation Advances have been made in that direc-tion by other research efforts such as the recent publica-tion of VoiD as a W3C note [33]The technological advancements that will make primary

Life Sciences experimental results an integral part of theWeb of Data are also thwarted by challenges which gobeyond infrastructure and standards [30] In particularHCLS datasets often include data elements such as thosethat could be used to identify individual patients withstringent requirements for privacy and protection [34]The typical approach to privatizing data has been to makeit the responsibility of the data providers Although thismay provide a temporary solution for a small number ofself-contained datasets it quickly becomes unmanageablewhen datasets aggregate both public and sensitive datafrom multiple sources each with its own requirements forprivacy and access control [35]One final common concern in Life Sciences is the need

to enable data experts to edit and augment the datarepresentation models failure to support this flexibilityhas lead in the past to misinterpretation of primaryexperimental data due to absence of critical contextualinformation [3637]

13 Knowledge organization systems for Linked DataIn order to address the information management needs ofLife Scientists the practice of Linked Data standards mustbe coupled with the implementation of Knowledge Orga-nization Systems (KOSs) a view also espoused by theW3C where the Simple Knowledge Organization System(SKOS) has been recently proposed as a standard [3839]In previous work we proposed the design principles of aKOS the Simple Sloppy Semantic Database (S3DB)[151640] The S3DB core model is much like SKOStask-independent and light-weight Implementation of theS3DB Core Model and operators resulted in a prototypethat has been validated and tested by Life Scientists toaddress pressing data management needs or in particularas a controlled Read-Write Linked Data system [3141-43]S3DB was shown to include the minimum set of featuresrequired to support the management of experimental andanalytic results by Life Sciences experts while making useof Linked Data best practices such as HTTP URI subject-predicate-object triples represented using RDF Schemalinks to widely used ontologies suggested by NCBO ontol-ogy widgets [44] or new OWL classes created by the usersand a SPARQL endpoint [41] Although complying withthese practices is enough to cover the immediate query orldquoreadrdquo requirements of a Linked Data KOS we found thatefficient data management or ldquowriterdquo operations such asinserting updating and deprecating data instances withina KOS could be more efficiently addressed with the

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 2 of 15

identification of the S3DB Query Language (S3QL) aDomain Specific Language (DSL) devised to abstract mostof the details involved in managing interlinked contextua-lized RDF statementsS3QL is not meant as an alternative to SPARQL but

rather as a complement data management operationsenabled by S3QL can also be formalized in SPARQLHowever the availability of a data management DSL thatcan be serialized to SPARQL provides an abstraction layerthat can be intuitively used by domain experts As suchDSLs can provide a solution for bridging the gap betweenthe formalisms required by Linked Data best practicessuch as SPARQL and RDFS and the basic controlledreadwrite management requirements of HCLS experts[124546] DSLs optimize beyond general purpose lan-guages in the identification of the domain in which a taskbelongs drastically reducing the development time [47]The task of adding a graph to a triple store is supportedby most graph stores by means of the SPARQL 11Update language [48] To enable controlled ldquowriterdquo opera-tions targeting the dataset it would be useful to annotatefor example the creator of a named graph under whichcircumstances it was created and who has permission tomodify it Similarly upon changes to the dataset annota-tion of the modifier and a comment describing the changewould be in the interest of the communities using thedata Many triple stores are in fact quad stores to enablepartial support of that requirement for contextual repre-sentation The most common approach is to use a namedgraph a set of triples identified by a URI [49] that indi-cates the source of a graph The S3DB Query Language(S3QL) presented in this report was devised with theintent of automating Linked Data management by creatingthose contextual descriptors in a single S3QL transactionincluding author creation date and description of the dataBy making use of those contextual descriptors we pro-

pose a method for fine grained permission control inS3QL that relies on s3dboperators [50] a class of func-tions with states that may be used as the predicate of anRDF triple between a user and a dataset with privacyrequirements These operators described in [40] andmade available for experimentation at [51] operate onthe adjacency matrix defined by the nodes and edges ofan RDF graph They can be applied in a variety of scenar-ios such as optimizing queries or as is the case withS3QL to propagate permission assignments In the lattercase an adjacency matrix includes both the edgesbetween instances of S3DB entities and the transitions ofpermission on S3DB entities such as eg the assertionthat a Userrsquos permission on a Project propagates to itsentities Accordingly by defining user permissions asstates of s3dboperators the core modelrsquos adjacencymatrix is used to propagate the ability to control viewand modify S3DB entities

We have found the target audience for S3QL to be bothlife sciences application developers who use it through aRESTful application programming interface (API) and lifesciences researchers who use it through user interfaces forweaving the ontologies that best represent the critical con-textual information in their experimental results Theapplicability of S3QL to other linked data KOSs such asthe Simple Knowledge Organization System (SKOS) [39]is explored with an example and the advantages of thesolution proposed are discussed in three biomedical data-sets with very different requirements for controlled opera-tions gastrointestinal clinical trials [42] cancer genomiccharacterization [41] and molecular epidemiology [52]

MethodsThis section overviews the core model for S3DB includ-ing the set of operators that enable fine grained permis-sion control and the distributed infrastructure supportingS3QL The principles defined here are implemented as aprototypical application available at https3dborg

The S3DB Knowledge Organization ModelS3QL is a DSL to programmatically manipulate data asinstances of entities defined in a KOS One of the key fea-tures of the KOS defined using the S3DB core model [16]is the use of typed named graphs to separate the identifica-tion of the domain the metadata describing the data fromits observational instantiation - the data itself We havepreviously shown that this approach to representing RDFgreatly facilitates the assembly of SPARQL and lowers theentry barrier for biomedical researchers interested in usingSemantic Web Technologies to address their data man-agement needs [41] That separation is achieved by usingthe representation of domain as triples that are themselvesthe predicates of the statements that instantiate thatdomain (as detailed by Fig two in [40]) For example thetriple [Person hasAge Age] identified as R12 through anamed graph of type s3dbrule describes the domain whilethe triple [John R12 26 ] identified by a named graph oftype s3dbstatement instantiates that domain Throughthe logic encoded in the RDF Schema definition of domain(rdfsdomain) and range (rdfsrange) the assertion thatldquoJohnrdquo is of type ldquoPersonrdquo and that ldquo26rdquo is an ldquoAgerdquo isenabled in the S3DB KOS S3DBrsquos use of named graphs todescribe the domain enables updates to the domain with-out affecting the consistency of its instantiation - in theexample above modifying ldquohasAgerdquo with ldquohasAgeInYearsrdquowill not affect queries that have already been assembledusing that propertyIn the S3DB core [40] a meta-model for this data is

also created with the specific objective of enabling propa-gation of operations such as permission assignmentsbetween the domain description and the data itselfdescribed in the following section (see Figures 1 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 3 of 15

and 3) In the example above the two triples are respec-tively assigned to entities of type s3dbrule and s3dbstate-ment where indexes ldquoPersonrdquo is identified by a namedgraph of type s3dbcollection and ldquoJohnrdquo is identified by anamed graph of type s3dbitem The S3DB Core specifiesthree other entities which are specifically devised toenable knowledge organization and operator propagations3dbproject entails a list of s3dbrule and s3dbcollectionand are typically applied in domain contextualizations3dbdeployment corresponds to the physical location ofan S3DB system (its URL) and s3dbuser is the subject ofpermission assignment operations It is worth notingthat by making use of S3DB entities blank nodes areavoided by assigning a unique alphanumeric identifier toevery instance of an S3DB entity The S3DB entities canalso be identified using the first letter of their names DP R C I S or U which will be used in subsequent exam-ples to indicate respectively s3dbdeployment s3db

project s3dbrule s3dbcollection s3dbitem s3dbstate-ment or s3dbuser

Operators for Permission ControlThe second key feature that makes S3DB appropriatefor controlled management operations is support forpermission control embedded in its core model Asdescribed above and in [40] the hierarchy of permis-sions to viewedit entities in S3DB is modelled by anadjacency matrix which is used as a transition matrix inthe propagation of permission states For a walkthroughof the propagation mechanism see additional file 1 Thes3dboperator states applied to the S3DB transitionmatrix modulate propagation by three core functions -merge migrate and percolate This behaviour for propa-gation of permission is described in detail in equation 5of [40] and is reproduced here in Equation 1 The S3DBtransition matrix (T) is defined by 12 s3dbrelationships

ltdeletegt E _id value ltdeletegt ltwheregt ltwheregt E lt gt _id E lt gt

EntityID

ltinsertgt E _id value ltinsertgt ltwheregt ltwheregt E lt gt _id E lt gt

Ea lt gt value Ea lt gt

EntityAttributeValue

EntityID

ltupdategt E _id value ltupdategt ltwheregt

ltwheregt

E lt gt _id E lt gt

Ea lt gt value Ea lt gt

EntityID

ltselectgt

Ea

ltselectgt

ltwheregt ltwheregt Ea lt gt value Ea lt gt

ltfromgt ltfromgt E

ltupdategt E _id value ltupdategt ltwheregt

ltwheregt

E lt gt _id E lt gt

Ea lt gt value Ea lt gt

EntityID

View

Ch

ange

Us

e

EntityAttributeValue

EntityAttributeValue

Figure 1 S3QL language specification using rail diagrams Rail diagrams are read from the left to the right - any string that can becomposed following these diagrams is a valid S3QL query Valid forms of E and Ea will vary according to the Core Model used in the KOS Forexample if the S3DB Core model is used any entity in figure 2 can be used in place of E upon choice of E Ea is any attribute that can beattained by following a line from E

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 4 of 15

describing dependencies and inference rules betweenentities of the S3DB core model The operator state vec-tor (f) is used as the predicate of a triple establishedbetween an s3dbuser and an entity of the S3DB coremodel The JavaScript application at [51] can also beused to attempt this set of propagation behaviours fors3dboperators both on the S3DB transition matrix orwith alternative adjacency matrixes

fobjectk+1 = merge([fobjectkmigrate(T times fsubjectk)])

l = length(f )

l = 1 rarr migrate(f ) = f = f [1]

l gt 1 rarr migrate(f ) = f [2 l]

(1)

The s3dboperators [40] have a scope and applicability inlinked data beyond permission management In S3QL wedefine three operator types for controlled management

operations for each of the rights to view changeedit oruse instances of S3DB entities The format used to assignpermission was defined as a three character string whereeach operator occupies respectively the first second orthird positions and may assume value N S or Y accordingto the level of permission intended no permission (N)permission limited to the creator of the resource (S) or fullpermission (Y) For example the permission assignmentldquoYSNrdquo specifies complete permission to view (Y) the sub-ject entity partial permission to change it (S) and no per-mission to use it (N) States may be defined as dominantby use of uppercase (Y N or S) or recessive by use of low-ercase characters (y n or s) Dominant and recessive per-missions are used to decide on the outcome of multiplepermissions converging on the same entity (as detailed in[40]) Missing permission states indicated by the dashcharacter lsquo-rsquo (which has no lower or upper case) are also

rdfid

rdfslabel

dcdescription

dccreated

dccreator

deployment

user

project

collection

rule

item

statement

nn

foafmbox

s3dbobject_id

s3dbobject

foaffo mboxs3dbsubject_id

bs3dbs3d

s3dbproject_id

s3dbpredicate_id

s3dbcollection_id

s3dbitem_id

s3dbrule_id rdfvalue

E Ea Ea a

Figure 2 Entities in the S3DB Core Model and its attributes A minimal set of common attributes was defined (left) for each of the S3DBentities using RDF Schema (rdfs) and Dublin core (dc) terminology - these are rdfid rdfslabel dcdescription dccreated and dccreator Otherattributes which are specific to each of the S3DB entities (right) such as foafmbox for the entity User or s3dbproject_id for the entity Collectionreflect the s3dbrelationships described above and formalized in the S3DB conceptual model (figure 3)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 5 of 15

allowed as well as a mechanism to succinctly specify tran-sitions with variable memory length (l in equation 1) Thepropagation of permissions in the S3DB Core Modelensures that for every entity and every user two types of

permission are defined the assigned permission or thepermission state assigned directly to a user in an entityand the effective permission which is the result of the pro-pagation of s3dboperators

Figure 3 The S3DB conceptual model Five attributes (id label description creator and created) and four methods (select update insert anddelete) are common to all S3DB entities In the current S3QL implementation the label and description attributes are defined by the submitterof the data whereas the id created and creator attributes are automatically assigned by the system Other dependencies were devised tocomply with the definition of s3dbrelationships

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 6 of 15

Components of a Distributed SystemOne of the requirements for RDF-based knowledgemanagement ecosystems is the availability of queriesspanning across multiple SPARQL endpoints Automa-tion of distributed queries in systems supporting permis-sion control such as S3DB is challenged by userauthentication In S3QL we propose addressing thisthrough delegation to authentication authorities As aresult a user (or usage) can be identified by a URI thatis independent of the authorities that validate it When-ever possible it is recommended that authenticationcredentials be protected by use of OAuth [53]Use of URIs and Internationalized Resource Identifiers

(IRIs) to identify data elements is one of the core princi-ples of Linked Data However many programmingenvironments cannot easily handle URIs as elementidentifiers Problems range from decreased processingspeed to a need for encoding the URIs in web serviceexchanges As an anticipation for that class of problemsthe URIs for entities in S3DB are interchangeable withalphanumeric identifiers formulated as the concatena-tion of one of D U P C R I or S (referring to S3DBentities described in The S3DB Knowledge Organiza-tional Model) identifying the entity and a unique num-ber As an example for a deployment located at URLhttpqs3dborgdemo the alphanumeric P126 is resol-vable to an entity of type Project with URI httpqs3dborgs3dbdemoP126 To facilitate exchange of URI indistinct deployments the URI above could also be speci-fied as D282P126 where ldquoD282rdquo is the alphanumericidentifier of the S3DB deployment located at URLhttpqs3dborgdemo Every s3dbdeployment is identi-fied by a named graph in the form D[number] for com-pleteness metadata pertaining to each s3dbdeploymentsuch as the corresponding URL is described using thevocabulary of interlinked datasets (VoiD) [19] andshared through a root location

Availability and DocumentationThe specification of the S3QL language has been madeavailable at httplinks3dborgspecs and one exampleof the output RDF is available at httplinks3dborgexample S3QL has been implemented through a RESTapplication programming interface (API) for the S3DBprototype which is publicly available at https3dborgBoth the prototype and its API were developed in PHPwith MySQL or PostgreSQL for data storage Documen-tation about the S3DB implementation of S3QL as anAPI can be found at httplinks3dborgdocs S3QLqueries may be tested at the demo implementation athttplinks3dborgs3qldemo and a translator for thecompact notation is available at httplinks3dborgtranslate

ResultsS3QL SyntaxS3QL is a domain specific language devised for facilitat-ing management operations such as ldquoinsertrdquo ldquoupdaterdquo orldquodeleterdquo using entities of a Linked Data KOS such as theS3DB core model described above Its syntax howeveris loosely tied to the S3DB Core Model and can easilybe applied to a set of KOSrsquo core models in which S3DBis included The complete syntax of S3QL in its XML(eXtended Markup Language) flavour is represented inthe railroad diagram of Figure 1 The S3QL syntaxincludes three elements the description of the opera-tion the target entity and the input parameters Fourbasic operation descriptions were deemed necessary tofully support readwrite operations select insert updateand delete The action of these operations mimic thoseof the structured query language (SQL) and targetinstances of entities (E) defined in the core model Inputparameters include the set of attributes defined for eachof the entities either in the alphanumeric form asso-ciated with entity instances (EntityId) or in the form ofEntityAttributeValue (Ea) The values for Ea are deter-mined upon choice of E - for an example using S3DBentities and attributes see Figure 2 E may be replacedwith any of the entities defined in the S3DB CoreModel (Deployment User Project Collection Rule Itemor Statement) upon choice of E valid forms of Eainclude any of the attributes defined for E (eg rdfidrdfslabel rdfscomment) A table summarizing all avail-able operations targets and input parameters is madeavailable at httplinks3dborgspecs The formal S3QLsyntax is completed by enclosing the outcome of one ofthe diagrams on Figure 1 with the ltS3QLgt tag Forexample the following XML structure is a valid S3QLquery for an operation of type insert where the target isthe S3DB entity Project and the input parameter formu-lated as EntityAtributeValue is ldquolabel = TestltS3QLgtltinsertgtprojectltinsertgtltwheregtltlabelgtTestltlabelgtltwheregt

ltS3QLgtThe set of 12 s3dbrelationships (see Table one of

[40]) in the S3DB Model determine the organizationaldependencies of S3DB entities For example s3dbPC isthe s3dbrelationship that specifies a dependencybetween an instance of a Collection (C) and an instanceof a Project (P) (Figure 3) The S3QL syntax fulfils thisconstraint by assigning project_id (the identifier of anS3DB Project) as an attribute of a Collection In thisdescription of attributes associated with the S3DB coremodel we make use of the assumption as in other

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 7 of 15

KOSs and in the Linked Data in general that there is norestriction to adding relationships beyond thosedescribed here S3QL was identified as the minimalrepresentation to interoperate with the S3DB coremodel and therefore only those relationships areexplored in this reportThe syntax diagram in Figure 1 generates XML a

standard widely used in web service implementationsThat alternative often results in verbose queries thatcould easily be assembled from more compact notationsOne example to consider is the form action (E | Ea =value) Here the symbol lsquo|rsquo should be interpreted as inBayesian inference as a condition and be read ldquogiventhatrdquo The letter ldquoEldquo corresponds to the first letter ofany S3DB entity (D P R C I S or U) and Ea is any ofits attributes as described in Figure 2 In this examplethe query insert(P | label = test) is equivalent to theexample query above That particular variant is alsoaccepted by the S3DB prototype and a converter for thissyntax into complete S3QLXML syntax was made avail-able at httplinks3dborgtranslate For further com-pactness of this alternative formulation entity identifiersused as parameters may be replaced with its corre-sponding alphanumeric identifiers - for example projec-t_id = 156 may be replaced with P156 This alternativenotation will be used in the subsequent examples

S3QL Permission ControlPermission states are assigned using an S3QL query suchas insert(U| U1P157permission_level = ysn) whichincludes the action insert the target entity User andthree input parameters identifier of the User (U1) iden-tifier of the entity (P157) and permission assignment(ysn) Effectively this will result in the creation of the tri-ple [U1 ysn P157] where the subject is of type s3dbuser the predicate is of type s3dboperator and the objectis of type s3dbproject The inclusion of this triple in adataset will modulate the type of management operationthat a user may perform As described in the Methodssection each position in the permission assignmentoperator (ysn) encodes respectively for permission toldquoviewrdquo ldquochangerdquo or ldquouserdquo the object entity Values y sand n indicate respectively that the user has full permis-sion to view it (y) permission to change its metadataonly if he was the creator of the entity (s) and no permis-sion to insert (n) child entities Each S3QL operation istherefore tightly woven to each of the three operatorsselect is controlled by ldquoview update and delete are con-trolled by ldquochangerdquo and insert is controlled by ldquouserdquo(shaded areas in Figure 1) The lsquousersquo operator encodes forthe ability of a user to create new relationships with thetarget entity which is defined separately from the rightto ldquochangerdquo it For example in the case of a user (U1)being granted ldquoyrdquo as the effective permission to ldquochangerdquo

an s3dbrule then the metadata describing it may bealtered If however that same user is granted permissionldquonrdquo to lsquousersquo that same Rule she is prevented from creatingStatements using that Rule Although ldquouserdquo may be inter-preted as being equivalent to ldquoinsertrdquo or ldquoappendrdquo inother systems we have chosen to separate the termsdescribing the operator ldquouserdquo from the S3QL actionldquoinsertrdquo The permission assigned at the dataset level willthen propagate in the S3DB transition matrix followingthe behaviour formalized in equation 1 therefore avoid-ing the need to assign permission to every user on everyentity It is worth noting that the DSL presented here isextensible beyond the 4 management actions (selectinsert update delete) described The s3dboperators thatcontrol permission on these actions are also extensiblebeyond ldquoviewrdquo ldquochangerdquo and ldquouserdquo and different imple-mentations may support alternative statesThe permission control behaviour for S3QL operations

can be illustrated through the use of the Quadratus anapplication available at httpqs3dborgquadratus thatcan be pointed at any S3DB deployment to assign permis-sion states on S3DB entities to different users (Figure 4)Other use case scenarios are also explored in the S3QLspecification at httplinks3dborgspecs

Global Dereferencing for Distributed QueriesA simple dereferencing system was devised for S3DB iden-tifiers that relies on the identification of root deploymentsie S3DB systems where alphanumeric identifiers forS3DB deployments can be dereferenced to URL This sim-ple mechanism enables complex transactions of controlleddata For an example of this behaviour see Figure 5 wherethe S3DB UID D327R172930 identifying an entity of typeRule (R172930) available in deployment D327 is beingrequest by a user registered in deployment D309 In orderto retrieve the requested data the URL of deploymentD327 must first be resolved at a root deployment such asfor example httproots3dborgThe dereferencing mechanism is also applicable in more

complex cases where the root of two deployments sharingdata is not the same Prepending the deployment identifierof the root to the UIDs such as for example D1016666D327R172930 where D1016666 identifies the rootdeployment would result in recursive URL resolutionsteps such as select(Durl| D1016666) prior to step 4 inFigure 5 This mechanism avoids broken links when S3DBdeployments are moved to different URLs by enablingdeployment metadata to be updated securely at the rootusing a publicprivate key encryption system

Implementation and benchmarkingIn the current prototype implementation S3QL is sub-mitted to S3DB deployments using either a GET or POSTrequest and may include an optional authentication token

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 8 of 15

(key) The REST specification [54] suggest separate HTTPmethods according to the intention of the operationoften ldquoGETrdquo is used to retrieve data ldquoPUTrdquo is used to sub-mit data ldquoPOSTrdquo is used to update data and ldquoDELETErdquo isused to remove data There are however many program-ming environments that implement only the REST ldquoGETrdquomethod including many popular computational statisticsprogramming frameworks such as R and Matlab There-fore in order to fully explore the integrative potential ofthis readwrite semantic web service and to supportoperations beyond the 4 implemented the S3DB proto-type implementation of S3QL supports the ldquoGETrdquo methodfor all S3QL operations with the parameters of the S3QLcall appended to the URL One drawback of relying onGET is the limits imposed by the browser on URL lengthTo address this potential problem the S3DB prototypealso supports the use of ldquoPOSTrdquo for S3QL calls

Two further challenges needed to be addressed in theprototypical implementation of S3QL 1) the need for acentralized root location to support dereferencing ofdeployment URI when the condensed version is used (eg D282) and 2) distributed queries on REST systemsrequired users to authenticate in multiple KOSs Thefirst challenge was addressed by configuring an S3DBdeployment as the root location available at httproots3dborg Deployment metadata is submitted to this rootdeployment at configuration time using S3QL data per-taining to each deployment can therefore be derefer-enced to a URL using httproots3dborgD[numeric]The option to refer to another root deployment thanthe default is possible during installation To avoid over-loading these root deployments with too many requestsa local 24 hour cache of all accessed deployments iskept in each S3DB deployment using the same strategy

Figure 4 Quadratus an interface to illustrate S3QLrsquos permission control mechanism and its effect on S3QL queries Projects CollectionsRules Items and Statements associated with GI Clinical Trials Project are displayed with effective and assigned permissions Collections and Rulesretrieved using S3QL queries select(C|P1196457) and select(R| P1196457) inherit the permission assignment in the Project ldquoysn assigned permissionldquoNndashrdquo in Collection ldquoDemographicsrdquo results in the effective permission of ldquoNsnrdquo inherited by all Rules and Items that have a relationship with thatCollection effectively preventing gi_user1 from accessing its data The directed labelled graph of the propagation resolution is displayed on theright side of the application illustrating the propagation mechanism

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 9 of 15

if the URL is cached it will not be requested from theroot deployment To address the second challenge eachs3dbdeployment can store any number of authenticationservices supporting HTTP FTP or LDAP protocolsOnce the user is authenticated temporary surrogatetokens are issued with each query When coupled withthe user identifier in the format of a URI these tokenseffectively identify the user performing the query regard-less of the S3DB deployment where the query isrequestedScreencasts illustrating processing time of data manip-

ulation using S3QL are available at httpwwwyoutubecomwatchv=2KZC6kI609s and httpwwwyoutubecomwatchv=FJSYLCwBaPI

DiscussionOne of the major concerns in making use of LinkedData to improve health care and life sciences research isthe need to ensure both the availability of contextualinformation about experimental datasets and the abilityto protect the privacy of certain data elements whichmay identify an individual patient Domain specific lan-guages (DSL) can ease the task of managing the contex-tual descriptors that would be necessary to implement

permission control in RDF and by doing so couldgreatly accelerate the rate of adoption of Linked Dataformalisms in the life sciences communities to improvescientific discovery We have described S3QL a DSL toperform readwrite operations on entities of the S3DBCore Model S3QL attempts to address the requirementsin linking Life Sciences datasets including both publish-able and un-publishable data elements by 1) includingcontextual descriptors for every submitted data elementand 2) making use of those descriptors to ensure per-mission control managed by the data experts them-selves This avoids the need to break a consolidateddataset into its public and private parts when the resultsare acceptable for publicationApplying S3QL to the S3DB Core Model in a prototy-

pical application benefited from the definition of looselydefined boundaries for RDF data that enabled propaga-tion of permission while avoiding the need to documenta relationship for each data instance individually andfor each user The assembly of SPARQL queries is alsofacilitated by the identification of domain triples usingnamed graphs from the data itself [41] and can be illu-strated in the application at [55] where a subset ofS3QL can be readily converted into the W3C standard

harrharr

Figure 5 The global S3QL dereferencing system User U78 of deployment D309 issues a command to request all entities of type Statementwhere the attribute Rule_id corresponds to the value D327R172930 through S3QL operation select(Svalue|D327R172930) (1) If the URLcorresponding to deployment D327 is not cached locally or has not been validated in the past 24 h a query is issued and executed at the rootdeployment to retrieve the corresponding URL select(Durl|D327) (23) Once the URL is returned query (1) is re-issued as select(Svalue| R172930)and executed at the URL for D327 (4) To validate the user deployment D327 issues the command select(Uid|U76) at D309 using the keyprovided (56) and returns the data only if Uid matches the value for user_id (7)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 10 of 15

SPARQL Although the prototypical implementation ofS3QL fits the definition of an API for S3DB systems itis immediately apparent that the same notation could beeasily and intuitively extended to other KOSrsquo core mod-els For example pointing the tool at httpqs3dborgtranslate to a JavaScript Object Notation (JSON) repre-sentation of the SKOS core model (skosjs) instead ofthe default S3DB core model (s3dbjs) results in a validS3QL syntax that could easily be applied as an API forSKOS based systems as illustrated by applying theexample query ldquoselect(C|prefLabel = animal)ldquo usinghttplinks3dborgtranslatecore=skosjsampquery=select(C|prefLabel=animal) to retrieve SKOS concepts labelledldquoanimalrdquo The progress of adoption for life sciencesapplication developers can be further smoothed by com-plete reliance on the REST protocol for data exchangeand the availability of widely used formats such asJSON XML or RDFturtleThe applicability of S3QL to life sciences domains is

illustrated here with three case studies 1) in the domainof clinical trials a project [42] that requires collabora-tion between departments with different interests 2) inthe domain of the cancer genome atlas (TCGA) projecta multi-institutional effort that requires multiple authen-tication mechanism and sharing of data among multipleinstitutions [41] and 3) in the domain of molecular epi-demiology a project where non-public data stored in anS3DB deployment needs to be statistically integratedwith data from a public repository All described usecases shared one considerable requirement - the abilityto include in the same dataset both published andunpublished results As such they required both theannotation of contextual descriptors of the data enabledby S3QL and the availability of controlled permissionpropagation enabled by the S3DB model transitionmatrix Future work in this effort may include the appli-cation of S3QL in a Knowledge Organization Systembased on SKOS terminology and the definition of atransition matrix for SKOS to enable controlled permis-sion propagation

Gastro-intestinal Clinical Trials Use CaseAs part of collaboration with the Department of Gastroin-testinal Medical Oncology at The University of Texas MD Anderson Cancer Center an S3DB deployment wasconfigured to host data from gastro-intestinal (GI) clinicaltrials A schema was developed using S3DB Collectionsand Rules and S3QL insert queries were used to submitdata elements as Items and Statements (see Figure 6) Twopermission propagation examples are illustrated one of arestrictive nature and the other permissive which can alsobe explored at httpqs3dborgquadratus (Figure 4) Inthis example the simple mechanisms of propagationdefined for S3QL support the complex social interaction

that requires a fraction of the dataset to be shared withcertain users but not with others Contextual usage istherefore a function of the attributes of the data itself (egits creator) and the user identification token that is sub-mitted with every readwrite operation

The Cancer Genome Atlas Use CaseThe cancer genome atlas (TCGA) is a pilot project tocharacterize several types of cancer by sequencing andgenetically characterizing tumours for over 500 patientsthroughout multiple institutions S3QL was used in thiscase study to produce an infrastructure that exposes thepublic portion of the TCGA datasets as a SPARQL end-point [41] This was possible because SPARQL is entailedby S3QL but not the opposite Specifically SPARQLqueries can be serialized to S3QL but the opposite is notalways possible particularly as regards write and accesscontrol operations The structure of the S3DB Core Modelwhich explicitly distinguishes domain from instantiationenables SPARQL query patterns such as Patient R390 cancerType to be readily serialized into its S3QL equiva-lent select(S|R390) Although this will not be furtherexplored in this discussion it is worth noting that theavailability of this serialization allows for an intuitive syn-tax of SPARQL queries by patterning them on the descrip-tion of the user-defined domain Rule such as ldquoPatienthasCancerType cancerTyperdquo

A Molecular Epidemiology Use CaseIn this example SPARQL was serialized to S3QL to sup-port a computational statistics application As a first stepan S3DB data store was deployed using S3QL to managemolecular epidemiology data related to strains of Staphylo-coccus aureus bacteria collected at the Instituto de Tecno-logia Quiacutemica e Bioloacutegica (ITQB) in Portugal Specificallythe ITQB Staphylococcus reference database was devisedwith a purpose of managing Multilocus Sequence Typing(MLST) data a typing method used to track the molecularepidemiology of infectious bacteria [54-57] As a secondstep we downloaded the public Staphylococcus aureusMLST profiles database at httpwwwmlstnet and madeit available through a SPARQL endpoint at httpqs3dborgmlst_sparqlendpointphp The process of integrationof MLST profiles from the ITQB Staphylococcus databasewith the publicly available MLST profiles is illustrated inFigure 7 In this example a federated SPARQL query isassembled to access both MLST sources data stored inthe S3DB deployment is retrieved by serializing theSPARQL query into S3QL and providing an authentica-tion token to identify the user as described in Figure 3The assembled graph resulting from the federatedSPARQL query (see additional file 2) can be importedinto a statistical computing environment such as Matlab(Mathworks Inc) Using this methodology it was possible

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 11 of 15

to cluster strains from two different data sources with verydifferent authentication requirements The observationthat some Portuguese strains (PT1 PT2 PT15 and PT21)that are not publicly shared cluster together with a groupof public UK strains (UK17 UK16 UK11 UK14 UK15UK13 UK12) and therefore may share a common ancestoris an observation enabled by the data integrated throughS3QL

ConclusionsLife sciences applications are set to greatly benefit fromcoupling Semantic Web Linked Data standards and KOSsIn the current report we illustrate data models from lifesciences domains weaved using the S3DB knowledge orga-nization system In line with the requirements for theemergence of evolvable ldquosocial machinesrdquo different per-spectives on the data are made possible by a permissionpropagation mechanism controlled by contextual attri-butes of data elements such as its creator The operationof the S3DB KOS is mediated by the S3QL protocoldescribed in this report which exposes its ApplicationProgramming Interface for viewing inserting updating

and removing data elements Because S3QL is implemen-ted with a distributed architecture where URIs can bedereferenced into multiple S3DB deployments domainexperts can share data on their own deployments withusers of other systems without the need for localaccounts Therefore S3QLrsquos fine grained permission con-trol defined as instances of s3dboperators enables domainexperts to clearly specify the degree of permission that auser should have on a resource and how that permissionshould propagate in a distributed infrastructure This is incontrast to the conventional approach of delegating per-mission management to the point of access In the currentSPARQL specification extension various data sources canbe queried simultaneously or sequentially There is still noaccepted convention for tying a query pattern to anauthenticated user probably because SPARQL engineswould have no use for that information as most have beencreated in a context of Linked Open Data efforts The cri-tical limitation in applying this solution for Health Careand Life Sciences is the ability to make use of contextualinformation to determine both the level of trust on thedata and to enable controlled access to elements in a

giuser1r1r

giuser2r22

gimanagerere

GI Clinical Trials project

Patient collection

Demographic data collection

Tissue collection

gimanagererer

merge 1 (restrictive)

merge 2 (permissive)

Assign permission

Assign permission

Patient Name statements

Tissue analysis statements

merge ( ) =

) = merge (

merge 1

merge 2

Figure 6 Two use cases of permission propagation Two users are granted full permission to view GI Clinical Trials Project (yrdquo) however noneof them can add new data (nrdquo) nor edit existing data unless they were its original creator (srdquo) One of the users (giuser1) is prevented fromaccessing any data with demographic elements such as for example the names of the Patients In this case an uppercase ldquoNrdquo is assigned in theright to view the Collection DemographicData which will be merged with the inherited ldquoyrdquo to produce an effective permission level of ldquoNrdquo forthe right to view (merge 1) For the second user (giuser2) permission is granted to use the Collection TissueData indicating that the user cancreate new instances in that Collection (merge 2)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 12 of 15

dataset without breaking it and storing it in multiple sys-tems To address this requirement S3QL was fitted withdistributed control operational features that follow designcriteria found desirable for biomedical applications S3QLis not unique in its class for example the linked data APIwhich is being used by datagovuk is an alternative DSL tomanage linked data [58] However we believe that S3QLis closer to the technologies currently used by applicationdevelopers and therefore may provide a more suitablemiddle layer between linked data formalisms and applica-tion development It is argued that these features mayassist and anticipate future extensions of semantic webprovenance control formalisms

Additional material

Additional file 1 Walkthrough of S3DBrsquos permission propagationmerge migrate and percolate

Additional file 2 Matrix of MLST profiles in Portugal and the UnitedKingdom

Acknowledgements and fundingThis work was supported in part by the Center for Clinical and TranslationalSciences of the University of Alabama at Birmingham under contract no

5UL1 RR025777-03 from NIH National Center for Research Resources by theNational Cancer Institute grant 1U24CA143883-01 by the European UnionFP7 PNEUMOPATH project award and by the Portuguese Science andTechnology foundation under contracts PTDCEIA-EIA1052452008 andPTDCEEA-ACR695302006 HFD also thankfully acknowledges PhDfellowship from the same foundation award SFRHBD459632008 and theScience Foundation Ireland project Lion under Grant No SFI02CE1I131The authors would also like to thank three anonymous reviewers whosecomments and advice were extremely valuable for improving themanuscript

Author details1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway Ireland 2Biomathematics Institutode Tecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal3Laboratoacuterio Nacional de Computaccedilatildeo Ciecircntifica Av Getuacutelio Vargas 333Quitandinha 25651-075 Petroacutepolis Brasil 4Sanofi Pasteur 38 Sidney StreetCambridge MA 02139 USA 5Laboratory of Molecular Genetics Instituto deTecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal6Research Center for Intelligent Media Furtwangen University FurtwangenGermany 7Laboratory of Microbiology The Rockefeller University 10021 NewYork USA 8Division of Informatics Department of Pathology University ofAlabama at Birmingham 619 South 19th Street Birmingham Alamaba USA

Authorsrsquo contributionsHFD developed the S3QL domain specific language implemented it in theS3DB prototype validated it with examples and wrote the manuscript MCCRS MM HL RF WM and JSA tested and validated the language withexamples and made suggestions which lead to its improvement All authorsread and approved the final manuscript

Strains from Portugal

Strains from UK

SPARQL

S3DB system Triple stores

S3QL + authentication

Statistical computing environment (eg Matlab R)

S3QL +Lauthent

Figure 7 Workflow for the creation of a hierarchical cluster of MLST profiles from strains collected in Portugal and in the UK Twosources are chosen to perform the MLST profile assembly - an S3DB deployment which holds the ITQB Staphylococcus database and theSPARQL endpoint configured to host the Staphylococcus aureus MLST database Although data in the MLST SPARQL endpoint is publiclyavailable to access data in S3DB the user needs to provide an authentication token and a user_id as well as assemble the S3QL queries toretrieve the data [select(S|R172930) and select(S|R167271)] which may also be formulated as SPARQL A data structure is assembled that can finallybe analysed using hierarchical clustering methods such as the ones made available by Matlab The complete MLST dataset used to obtain thegraph is provided in additional file 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 13 of 15

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 3: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

identification of the S3DB Query Language (S3QL) aDomain Specific Language (DSL) devised to abstract mostof the details involved in managing interlinked contextua-lized RDF statementsS3QL is not meant as an alternative to SPARQL but

rather as a complement data management operationsenabled by S3QL can also be formalized in SPARQLHowever the availability of a data management DSL thatcan be serialized to SPARQL provides an abstraction layerthat can be intuitively used by domain experts As suchDSLs can provide a solution for bridging the gap betweenthe formalisms required by Linked Data best practicessuch as SPARQL and RDFS and the basic controlledreadwrite management requirements of HCLS experts[124546] DSLs optimize beyond general purpose lan-guages in the identification of the domain in which a taskbelongs drastically reducing the development time [47]The task of adding a graph to a triple store is supportedby most graph stores by means of the SPARQL 11Update language [48] To enable controlled ldquowriterdquo opera-tions targeting the dataset it would be useful to annotatefor example the creator of a named graph under whichcircumstances it was created and who has permission tomodify it Similarly upon changes to the dataset annota-tion of the modifier and a comment describing the changewould be in the interest of the communities using thedata Many triple stores are in fact quad stores to enablepartial support of that requirement for contextual repre-sentation The most common approach is to use a namedgraph a set of triples identified by a URI [49] that indi-cates the source of a graph The S3DB Query Language(S3QL) presented in this report was devised with theintent of automating Linked Data management by creatingthose contextual descriptors in a single S3QL transactionincluding author creation date and description of the dataBy making use of those contextual descriptors we pro-

pose a method for fine grained permission control inS3QL that relies on s3dboperators [50] a class of func-tions with states that may be used as the predicate of anRDF triple between a user and a dataset with privacyrequirements These operators described in [40] andmade available for experimentation at [51] operate onthe adjacency matrix defined by the nodes and edges ofan RDF graph They can be applied in a variety of scenar-ios such as optimizing queries or as is the case withS3QL to propagate permission assignments In the lattercase an adjacency matrix includes both the edgesbetween instances of S3DB entities and the transitions ofpermission on S3DB entities such as eg the assertionthat a Userrsquos permission on a Project propagates to itsentities Accordingly by defining user permissions asstates of s3dboperators the core modelrsquos adjacencymatrix is used to propagate the ability to control viewand modify S3DB entities

We have found the target audience for S3QL to be bothlife sciences application developers who use it through aRESTful application programming interface (API) and lifesciences researchers who use it through user interfaces forweaving the ontologies that best represent the critical con-textual information in their experimental results Theapplicability of S3QL to other linked data KOSs such asthe Simple Knowledge Organization System (SKOS) [39]is explored with an example and the advantages of thesolution proposed are discussed in three biomedical data-sets with very different requirements for controlled opera-tions gastrointestinal clinical trials [42] cancer genomiccharacterization [41] and molecular epidemiology [52]

MethodsThis section overviews the core model for S3DB includ-ing the set of operators that enable fine grained permis-sion control and the distributed infrastructure supportingS3QL The principles defined here are implemented as aprototypical application available at https3dborg

The S3DB Knowledge Organization ModelS3QL is a DSL to programmatically manipulate data asinstances of entities defined in a KOS One of the key fea-tures of the KOS defined using the S3DB core model [16]is the use of typed named graphs to separate the identifica-tion of the domain the metadata describing the data fromits observational instantiation - the data itself We havepreviously shown that this approach to representing RDFgreatly facilitates the assembly of SPARQL and lowers theentry barrier for biomedical researchers interested in usingSemantic Web Technologies to address their data man-agement needs [41] That separation is achieved by usingthe representation of domain as triples that are themselvesthe predicates of the statements that instantiate thatdomain (as detailed by Fig two in [40]) For example thetriple [Person hasAge Age] identified as R12 through anamed graph of type s3dbrule describes the domain whilethe triple [John R12 26 ] identified by a named graph oftype s3dbstatement instantiates that domain Throughthe logic encoded in the RDF Schema definition of domain(rdfsdomain) and range (rdfsrange) the assertion thatldquoJohnrdquo is of type ldquoPersonrdquo and that ldquo26rdquo is an ldquoAgerdquo isenabled in the S3DB KOS S3DBrsquos use of named graphs todescribe the domain enables updates to the domain with-out affecting the consistency of its instantiation - in theexample above modifying ldquohasAgerdquo with ldquohasAgeInYearsrdquowill not affect queries that have already been assembledusing that propertyIn the S3DB core [40] a meta-model for this data is

also created with the specific objective of enabling propa-gation of operations such as permission assignmentsbetween the domain description and the data itselfdescribed in the following section (see Figures 1 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 3 of 15

and 3) In the example above the two triples are respec-tively assigned to entities of type s3dbrule and s3dbstate-ment where indexes ldquoPersonrdquo is identified by a namedgraph of type s3dbcollection and ldquoJohnrdquo is identified by anamed graph of type s3dbitem The S3DB Core specifiesthree other entities which are specifically devised toenable knowledge organization and operator propagations3dbproject entails a list of s3dbrule and s3dbcollectionand are typically applied in domain contextualizations3dbdeployment corresponds to the physical location ofan S3DB system (its URL) and s3dbuser is the subject ofpermission assignment operations It is worth notingthat by making use of S3DB entities blank nodes areavoided by assigning a unique alphanumeric identifier toevery instance of an S3DB entity The S3DB entities canalso be identified using the first letter of their names DP R C I S or U which will be used in subsequent exam-ples to indicate respectively s3dbdeployment s3db

project s3dbrule s3dbcollection s3dbitem s3dbstate-ment or s3dbuser

Operators for Permission ControlThe second key feature that makes S3DB appropriatefor controlled management operations is support forpermission control embedded in its core model Asdescribed above and in [40] the hierarchy of permis-sions to viewedit entities in S3DB is modelled by anadjacency matrix which is used as a transition matrix inthe propagation of permission states For a walkthroughof the propagation mechanism see additional file 1 Thes3dboperator states applied to the S3DB transitionmatrix modulate propagation by three core functions -merge migrate and percolate This behaviour for propa-gation of permission is described in detail in equation 5of [40] and is reproduced here in Equation 1 The S3DBtransition matrix (T) is defined by 12 s3dbrelationships

ltdeletegt E _id value ltdeletegt ltwheregt ltwheregt E lt gt _id E lt gt

EntityID

ltinsertgt E _id value ltinsertgt ltwheregt ltwheregt E lt gt _id E lt gt

Ea lt gt value Ea lt gt

EntityAttributeValue

EntityID

ltupdategt E _id value ltupdategt ltwheregt

ltwheregt

E lt gt _id E lt gt

Ea lt gt value Ea lt gt

EntityID

ltselectgt

Ea

ltselectgt

ltwheregt ltwheregt Ea lt gt value Ea lt gt

ltfromgt ltfromgt E

ltupdategt E _id value ltupdategt ltwheregt

ltwheregt

E lt gt _id E lt gt

Ea lt gt value Ea lt gt

EntityID

View

Ch

ange

Us

e

EntityAttributeValue

EntityAttributeValue

Figure 1 S3QL language specification using rail diagrams Rail diagrams are read from the left to the right - any string that can becomposed following these diagrams is a valid S3QL query Valid forms of E and Ea will vary according to the Core Model used in the KOS Forexample if the S3DB Core model is used any entity in figure 2 can be used in place of E upon choice of E Ea is any attribute that can beattained by following a line from E

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 4 of 15

describing dependencies and inference rules betweenentities of the S3DB core model The operator state vec-tor (f) is used as the predicate of a triple establishedbetween an s3dbuser and an entity of the S3DB coremodel The JavaScript application at [51] can also beused to attempt this set of propagation behaviours fors3dboperators both on the S3DB transition matrix orwith alternative adjacency matrixes

fobjectk+1 = merge([fobjectkmigrate(T times fsubjectk)])

l = length(f )

l = 1 rarr migrate(f ) = f = f [1]

l gt 1 rarr migrate(f ) = f [2 l]

(1)

The s3dboperators [40] have a scope and applicability inlinked data beyond permission management In S3QL wedefine three operator types for controlled management

operations for each of the rights to view changeedit oruse instances of S3DB entities The format used to assignpermission was defined as a three character string whereeach operator occupies respectively the first second orthird positions and may assume value N S or Y accordingto the level of permission intended no permission (N)permission limited to the creator of the resource (S) or fullpermission (Y) For example the permission assignmentldquoYSNrdquo specifies complete permission to view (Y) the sub-ject entity partial permission to change it (S) and no per-mission to use it (N) States may be defined as dominantby use of uppercase (Y N or S) or recessive by use of low-ercase characters (y n or s) Dominant and recessive per-missions are used to decide on the outcome of multiplepermissions converging on the same entity (as detailed in[40]) Missing permission states indicated by the dashcharacter lsquo-rsquo (which has no lower or upper case) are also

rdfid

rdfslabel

dcdescription

dccreated

dccreator

deployment

user

project

collection

rule

item

statement

nn

foafmbox

s3dbobject_id

s3dbobject

foaffo mboxs3dbsubject_id

bs3dbs3d

s3dbproject_id

s3dbpredicate_id

s3dbcollection_id

s3dbitem_id

s3dbrule_id rdfvalue

E Ea Ea a

Figure 2 Entities in the S3DB Core Model and its attributes A minimal set of common attributes was defined (left) for each of the S3DBentities using RDF Schema (rdfs) and Dublin core (dc) terminology - these are rdfid rdfslabel dcdescription dccreated and dccreator Otherattributes which are specific to each of the S3DB entities (right) such as foafmbox for the entity User or s3dbproject_id for the entity Collectionreflect the s3dbrelationships described above and formalized in the S3DB conceptual model (figure 3)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 5 of 15

allowed as well as a mechanism to succinctly specify tran-sitions with variable memory length (l in equation 1) Thepropagation of permissions in the S3DB Core Modelensures that for every entity and every user two types of

permission are defined the assigned permission or thepermission state assigned directly to a user in an entityand the effective permission which is the result of the pro-pagation of s3dboperators

Figure 3 The S3DB conceptual model Five attributes (id label description creator and created) and four methods (select update insert anddelete) are common to all S3DB entities In the current S3QL implementation the label and description attributes are defined by the submitterof the data whereas the id created and creator attributes are automatically assigned by the system Other dependencies were devised tocomply with the definition of s3dbrelationships

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 6 of 15

Components of a Distributed SystemOne of the requirements for RDF-based knowledgemanagement ecosystems is the availability of queriesspanning across multiple SPARQL endpoints Automa-tion of distributed queries in systems supporting permis-sion control such as S3DB is challenged by userauthentication In S3QL we propose addressing thisthrough delegation to authentication authorities As aresult a user (or usage) can be identified by a URI thatis independent of the authorities that validate it When-ever possible it is recommended that authenticationcredentials be protected by use of OAuth [53]Use of URIs and Internationalized Resource Identifiers

(IRIs) to identify data elements is one of the core princi-ples of Linked Data However many programmingenvironments cannot easily handle URIs as elementidentifiers Problems range from decreased processingspeed to a need for encoding the URIs in web serviceexchanges As an anticipation for that class of problemsthe URIs for entities in S3DB are interchangeable withalphanumeric identifiers formulated as the concatena-tion of one of D U P C R I or S (referring to S3DBentities described in The S3DB Knowledge Organiza-tional Model) identifying the entity and a unique num-ber As an example for a deployment located at URLhttpqs3dborgdemo the alphanumeric P126 is resol-vable to an entity of type Project with URI httpqs3dborgs3dbdemoP126 To facilitate exchange of URI indistinct deployments the URI above could also be speci-fied as D282P126 where ldquoD282rdquo is the alphanumericidentifier of the S3DB deployment located at URLhttpqs3dborgdemo Every s3dbdeployment is identi-fied by a named graph in the form D[number] for com-pleteness metadata pertaining to each s3dbdeploymentsuch as the corresponding URL is described using thevocabulary of interlinked datasets (VoiD) [19] andshared through a root location

Availability and DocumentationThe specification of the S3QL language has been madeavailable at httplinks3dborgspecs and one exampleof the output RDF is available at httplinks3dborgexample S3QL has been implemented through a RESTapplication programming interface (API) for the S3DBprototype which is publicly available at https3dborgBoth the prototype and its API were developed in PHPwith MySQL or PostgreSQL for data storage Documen-tation about the S3DB implementation of S3QL as anAPI can be found at httplinks3dborgdocs S3QLqueries may be tested at the demo implementation athttplinks3dborgs3qldemo and a translator for thecompact notation is available at httplinks3dborgtranslate

ResultsS3QL SyntaxS3QL is a domain specific language devised for facilitat-ing management operations such as ldquoinsertrdquo ldquoupdaterdquo orldquodeleterdquo using entities of a Linked Data KOS such as theS3DB core model described above Its syntax howeveris loosely tied to the S3DB Core Model and can easilybe applied to a set of KOSrsquo core models in which S3DBis included The complete syntax of S3QL in its XML(eXtended Markup Language) flavour is represented inthe railroad diagram of Figure 1 The S3QL syntaxincludes three elements the description of the opera-tion the target entity and the input parameters Fourbasic operation descriptions were deemed necessary tofully support readwrite operations select insert updateand delete The action of these operations mimic thoseof the structured query language (SQL) and targetinstances of entities (E) defined in the core model Inputparameters include the set of attributes defined for eachof the entities either in the alphanumeric form asso-ciated with entity instances (EntityId) or in the form ofEntityAttributeValue (Ea) The values for Ea are deter-mined upon choice of E - for an example using S3DBentities and attributes see Figure 2 E may be replacedwith any of the entities defined in the S3DB CoreModel (Deployment User Project Collection Rule Itemor Statement) upon choice of E valid forms of Eainclude any of the attributes defined for E (eg rdfidrdfslabel rdfscomment) A table summarizing all avail-able operations targets and input parameters is madeavailable at httplinks3dborgspecs The formal S3QLsyntax is completed by enclosing the outcome of one ofthe diagrams on Figure 1 with the ltS3QLgt tag Forexample the following XML structure is a valid S3QLquery for an operation of type insert where the target isthe S3DB entity Project and the input parameter formu-lated as EntityAtributeValue is ldquolabel = TestltS3QLgtltinsertgtprojectltinsertgtltwheregtltlabelgtTestltlabelgtltwheregt

ltS3QLgtThe set of 12 s3dbrelationships (see Table one of

[40]) in the S3DB Model determine the organizationaldependencies of S3DB entities For example s3dbPC isthe s3dbrelationship that specifies a dependencybetween an instance of a Collection (C) and an instanceof a Project (P) (Figure 3) The S3QL syntax fulfils thisconstraint by assigning project_id (the identifier of anS3DB Project) as an attribute of a Collection In thisdescription of attributes associated with the S3DB coremodel we make use of the assumption as in other

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 7 of 15

KOSs and in the Linked Data in general that there is norestriction to adding relationships beyond thosedescribed here S3QL was identified as the minimalrepresentation to interoperate with the S3DB coremodel and therefore only those relationships areexplored in this reportThe syntax diagram in Figure 1 generates XML a

standard widely used in web service implementationsThat alternative often results in verbose queries thatcould easily be assembled from more compact notationsOne example to consider is the form action (E | Ea =value) Here the symbol lsquo|rsquo should be interpreted as inBayesian inference as a condition and be read ldquogiventhatrdquo The letter ldquoEldquo corresponds to the first letter ofany S3DB entity (D P R C I S or U) and Ea is any ofits attributes as described in Figure 2 In this examplethe query insert(P | label = test) is equivalent to theexample query above That particular variant is alsoaccepted by the S3DB prototype and a converter for thissyntax into complete S3QLXML syntax was made avail-able at httplinks3dborgtranslate For further com-pactness of this alternative formulation entity identifiersused as parameters may be replaced with its corre-sponding alphanumeric identifiers - for example projec-t_id = 156 may be replaced with P156 This alternativenotation will be used in the subsequent examples

S3QL Permission ControlPermission states are assigned using an S3QL query suchas insert(U| U1P157permission_level = ysn) whichincludes the action insert the target entity User andthree input parameters identifier of the User (U1) iden-tifier of the entity (P157) and permission assignment(ysn) Effectively this will result in the creation of the tri-ple [U1 ysn P157] where the subject is of type s3dbuser the predicate is of type s3dboperator and the objectis of type s3dbproject The inclusion of this triple in adataset will modulate the type of management operationthat a user may perform As described in the Methodssection each position in the permission assignmentoperator (ysn) encodes respectively for permission toldquoviewrdquo ldquochangerdquo or ldquouserdquo the object entity Values y sand n indicate respectively that the user has full permis-sion to view it (y) permission to change its metadataonly if he was the creator of the entity (s) and no permis-sion to insert (n) child entities Each S3QL operation istherefore tightly woven to each of the three operatorsselect is controlled by ldquoview update and delete are con-trolled by ldquochangerdquo and insert is controlled by ldquouserdquo(shaded areas in Figure 1) The lsquousersquo operator encodes forthe ability of a user to create new relationships with thetarget entity which is defined separately from the rightto ldquochangerdquo it For example in the case of a user (U1)being granted ldquoyrdquo as the effective permission to ldquochangerdquo

an s3dbrule then the metadata describing it may bealtered If however that same user is granted permissionldquonrdquo to lsquousersquo that same Rule she is prevented from creatingStatements using that Rule Although ldquouserdquo may be inter-preted as being equivalent to ldquoinsertrdquo or ldquoappendrdquo inother systems we have chosen to separate the termsdescribing the operator ldquouserdquo from the S3QL actionldquoinsertrdquo The permission assigned at the dataset level willthen propagate in the S3DB transition matrix followingthe behaviour formalized in equation 1 therefore avoid-ing the need to assign permission to every user on everyentity It is worth noting that the DSL presented here isextensible beyond the 4 management actions (selectinsert update delete) described The s3dboperators thatcontrol permission on these actions are also extensiblebeyond ldquoviewrdquo ldquochangerdquo and ldquouserdquo and different imple-mentations may support alternative statesThe permission control behaviour for S3QL operations

can be illustrated through the use of the Quadratus anapplication available at httpqs3dborgquadratus thatcan be pointed at any S3DB deployment to assign permis-sion states on S3DB entities to different users (Figure 4)Other use case scenarios are also explored in the S3QLspecification at httplinks3dborgspecs

Global Dereferencing for Distributed QueriesA simple dereferencing system was devised for S3DB iden-tifiers that relies on the identification of root deploymentsie S3DB systems where alphanumeric identifiers forS3DB deployments can be dereferenced to URL This sim-ple mechanism enables complex transactions of controlleddata For an example of this behaviour see Figure 5 wherethe S3DB UID D327R172930 identifying an entity of typeRule (R172930) available in deployment D327 is beingrequest by a user registered in deployment D309 In orderto retrieve the requested data the URL of deploymentD327 must first be resolved at a root deployment such asfor example httproots3dborgThe dereferencing mechanism is also applicable in more

complex cases where the root of two deployments sharingdata is not the same Prepending the deployment identifierof the root to the UIDs such as for example D1016666D327R172930 where D1016666 identifies the rootdeployment would result in recursive URL resolutionsteps such as select(Durl| D1016666) prior to step 4 inFigure 5 This mechanism avoids broken links when S3DBdeployments are moved to different URLs by enablingdeployment metadata to be updated securely at the rootusing a publicprivate key encryption system

Implementation and benchmarkingIn the current prototype implementation S3QL is sub-mitted to S3DB deployments using either a GET or POSTrequest and may include an optional authentication token

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 8 of 15

(key) The REST specification [54] suggest separate HTTPmethods according to the intention of the operationoften ldquoGETrdquo is used to retrieve data ldquoPUTrdquo is used to sub-mit data ldquoPOSTrdquo is used to update data and ldquoDELETErdquo isused to remove data There are however many program-ming environments that implement only the REST ldquoGETrdquomethod including many popular computational statisticsprogramming frameworks such as R and Matlab There-fore in order to fully explore the integrative potential ofthis readwrite semantic web service and to supportoperations beyond the 4 implemented the S3DB proto-type implementation of S3QL supports the ldquoGETrdquo methodfor all S3QL operations with the parameters of the S3QLcall appended to the URL One drawback of relying onGET is the limits imposed by the browser on URL lengthTo address this potential problem the S3DB prototypealso supports the use of ldquoPOSTrdquo for S3QL calls

Two further challenges needed to be addressed in theprototypical implementation of S3QL 1) the need for acentralized root location to support dereferencing ofdeployment URI when the condensed version is used (eg D282) and 2) distributed queries on REST systemsrequired users to authenticate in multiple KOSs Thefirst challenge was addressed by configuring an S3DBdeployment as the root location available at httproots3dborg Deployment metadata is submitted to this rootdeployment at configuration time using S3QL data per-taining to each deployment can therefore be derefer-enced to a URL using httproots3dborgD[numeric]The option to refer to another root deployment thanthe default is possible during installation To avoid over-loading these root deployments with too many requestsa local 24 hour cache of all accessed deployments iskept in each S3DB deployment using the same strategy

Figure 4 Quadratus an interface to illustrate S3QLrsquos permission control mechanism and its effect on S3QL queries Projects CollectionsRules Items and Statements associated with GI Clinical Trials Project are displayed with effective and assigned permissions Collections and Rulesretrieved using S3QL queries select(C|P1196457) and select(R| P1196457) inherit the permission assignment in the Project ldquoysn assigned permissionldquoNndashrdquo in Collection ldquoDemographicsrdquo results in the effective permission of ldquoNsnrdquo inherited by all Rules and Items that have a relationship with thatCollection effectively preventing gi_user1 from accessing its data The directed labelled graph of the propagation resolution is displayed on theright side of the application illustrating the propagation mechanism

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 9 of 15

if the URL is cached it will not be requested from theroot deployment To address the second challenge eachs3dbdeployment can store any number of authenticationservices supporting HTTP FTP or LDAP protocolsOnce the user is authenticated temporary surrogatetokens are issued with each query When coupled withthe user identifier in the format of a URI these tokenseffectively identify the user performing the query regard-less of the S3DB deployment where the query isrequestedScreencasts illustrating processing time of data manip-

ulation using S3QL are available at httpwwwyoutubecomwatchv=2KZC6kI609s and httpwwwyoutubecomwatchv=FJSYLCwBaPI

DiscussionOne of the major concerns in making use of LinkedData to improve health care and life sciences research isthe need to ensure both the availability of contextualinformation about experimental datasets and the abilityto protect the privacy of certain data elements whichmay identify an individual patient Domain specific lan-guages (DSL) can ease the task of managing the contex-tual descriptors that would be necessary to implement

permission control in RDF and by doing so couldgreatly accelerate the rate of adoption of Linked Dataformalisms in the life sciences communities to improvescientific discovery We have described S3QL a DSL toperform readwrite operations on entities of the S3DBCore Model S3QL attempts to address the requirementsin linking Life Sciences datasets including both publish-able and un-publishable data elements by 1) includingcontextual descriptors for every submitted data elementand 2) making use of those descriptors to ensure per-mission control managed by the data experts them-selves This avoids the need to break a consolidateddataset into its public and private parts when the resultsare acceptable for publicationApplying S3QL to the S3DB Core Model in a prototy-

pical application benefited from the definition of looselydefined boundaries for RDF data that enabled propaga-tion of permission while avoiding the need to documenta relationship for each data instance individually andfor each user The assembly of SPARQL queries is alsofacilitated by the identification of domain triples usingnamed graphs from the data itself [41] and can be illu-strated in the application at [55] where a subset ofS3QL can be readily converted into the W3C standard

harrharr

Figure 5 The global S3QL dereferencing system User U78 of deployment D309 issues a command to request all entities of type Statementwhere the attribute Rule_id corresponds to the value D327R172930 through S3QL operation select(Svalue|D327R172930) (1) If the URLcorresponding to deployment D327 is not cached locally or has not been validated in the past 24 h a query is issued and executed at the rootdeployment to retrieve the corresponding URL select(Durl|D327) (23) Once the URL is returned query (1) is re-issued as select(Svalue| R172930)and executed at the URL for D327 (4) To validate the user deployment D327 issues the command select(Uid|U76) at D309 using the keyprovided (56) and returns the data only if Uid matches the value for user_id (7)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 10 of 15

SPARQL Although the prototypical implementation ofS3QL fits the definition of an API for S3DB systems itis immediately apparent that the same notation could beeasily and intuitively extended to other KOSrsquo core mod-els For example pointing the tool at httpqs3dborgtranslate to a JavaScript Object Notation (JSON) repre-sentation of the SKOS core model (skosjs) instead ofthe default S3DB core model (s3dbjs) results in a validS3QL syntax that could easily be applied as an API forSKOS based systems as illustrated by applying theexample query ldquoselect(C|prefLabel = animal)ldquo usinghttplinks3dborgtranslatecore=skosjsampquery=select(C|prefLabel=animal) to retrieve SKOS concepts labelledldquoanimalrdquo The progress of adoption for life sciencesapplication developers can be further smoothed by com-plete reliance on the REST protocol for data exchangeand the availability of widely used formats such asJSON XML or RDFturtleThe applicability of S3QL to life sciences domains is

illustrated here with three case studies 1) in the domainof clinical trials a project [42] that requires collabora-tion between departments with different interests 2) inthe domain of the cancer genome atlas (TCGA) projecta multi-institutional effort that requires multiple authen-tication mechanism and sharing of data among multipleinstitutions [41] and 3) in the domain of molecular epi-demiology a project where non-public data stored in anS3DB deployment needs to be statistically integratedwith data from a public repository All described usecases shared one considerable requirement - the abilityto include in the same dataset both published andunpublished results As such they required both theannotation of contextual descriptors of the data enabledby S3QL and the availability of controlled permissionpropagation enabled by the S3DB model transitionmatrix Future work in this effort may include the appli-cation of S3QL in a Knowledge Organization Systembased on SKOS terminology and the definition of atransition matrix for SKOS to enable controlled permis-sion propagation

Gastro-intestinal Clinical Trials Use CaseAs part of collaboration with the Department of Gastroin-testinal Medical Oncology at The University of Texas MD Anderson Cancer Center an S3DB deployment wasconfigured to host data from gastro-intestinal (GI) clinicaltrials A schema was developed using S3DB Collectionsand Rules and S3QL insert queries were used to submitdata elements as Items and Statements (see Figure 6) Twopermission propagation examples are illustrated one of arestrictive nature and the other permissive which can alsobe explored at httpqs3dborgquadratus (Figure 4) Inthis example the simple mechanisms of propagationdefined for S3QL support the complex social interaction

that requires a fraction of the dataset to be shared withcertain users but not with others Contextual usage istherefore a function of the attributes of the data itself (egits creator) and the user identification token that is sub-mitted with every readwrite operation

The Cancer Genome Atlas Use CaseThe cancer genome atlas (TCGA) is a pilot project tocharacterize several types of cancer by sequencing andgenetically characterizing tumours for over 500 patientsthroughout multiple institutions S3QL was used in thiscase study to produce an infrastructure that exposes thepublic portion of the TCGA datasets as a SPARQL end-point [41] This was possible because SPARQL is entailedby S3QL but not the opposite Specifically SPARQLqueries can be serialized to S3QL but the opposite is notalways possible particularly as regards write and accesscontrol operations The structure of the S3DB Core Modelwhich explicitly distinguishes domain from instantiationenables SPARQL query patterns such as Patient R390 cancerType to be readily serialized into its S3QL equiva-lent select(S|R390) Although this will not be furtherexplored in this discussion it is worth noting that theavailability of this serialization allows for an intuitive syn-tax of SPARQL queries by patterning them on the descrip-tion of the user-defined domain Rule such as ldquoPatienthasCancerType cancerTyperdquo

A Molecular Epidemiology Use CaseIn this example SPARQL was serialized to S3QL to sup-port a computational statistics application As a first stepan S3DB data store was deployed using S3QL to managemolecular epidemiology data related to strains of Staphylo-coccus aureus bacteria collected at the Instituto de Tecno-logia Quiacutemica e Bioloacutegica (ITQB) in Portugal Specificallythe ITQB Staphylococcus reference database was devisedwith a purpose of managing Multilocus Sequence Typing(MLST) data a typing method used to track the molecularepidemiology of infectious bacteria [54-57] As a secondstep we downloaded the public Staphylococcus aureusMLST profiles database at httpwwwmlstnet and madeit available through a SPARQL endpoint at httpqs3dborgmlst_sparqlendpointphp The process of integrationof MLST profiles from the ITQB Staphylococcus databasewith the publicly available MLST profiles is illustrated inFigure 7 In this example a federated SPARQL query isassembled to access both MLST sources data stored inthe S3DB deployment is retrieved by serializing theSPARQL query into S3QL and providing an authentica-tion token to identify the user as described in Figure 3The assembled graph resulting from the federatedSPARQL query (see additional file 2) can be importedinto a statistical computing environment such as Matlab(Mathworks Inc) Using this methodology it was possible

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 11 of 15

to cluster strains from two different data sources with verydifferent authentication requirements The observationthat some Portuguese strains (PT1 PT2 PT15 and PT21)that are not publicly shared cluster together with a groupof public UK strains (UK17 UK16 UK11 UK14 UK15UK13 UK12) and therefore may share a common ancestoris an observation enabled by the data integrated throughS3QL

ConclusionsLife sciences applications are set to greatly benefit fromcoupling Semantic Web Linked Data standards and KOSsIn the current report we illustrate data models from lifesciences domains weaved using the S3DB knowledge orga-nization system In line with the requirements for theemergence of evolvable ldquosocial machinesrdquo different per-spectives on the data are made possible by a permissionpropagation mechanism controlled by contextual attri-butes of data elements such as its creator The operationof the S3DB KOS is mediated by the S3QL protocoldescribed in this report which exposes its ApplicationProgramming Interface for viewing inserting updating

and removing data elements Because S3QL is implemen-ted with a distributed architecture where URIs can bedereferenced into multiple S3DB deployments domainexperts can share data on their own deployments withusers of other systems without the need for localaccounts Therefore S3QLrsquos fine grained permission con-trol defined as instances of s3dboperators enables domainexperts to clearly specify the degree of permission that auser should have on a resource and how that permissionshould propagate in a distributed infrastructure This is incontrast to the conventional approach of delegating per-mission management to the point of access In the currentSPARQL specification extension various data sources canbe queried simultaneously or sequentially There is still noaccepted convention for tying a query pattern to anauthenticated user probably because SPARQL engineswould have no use for that information as most have beencreated in a context of Linked Open Data efforts The cri-tical limitation in applying this solution for Health Careand Life Sciences is the ability to make use of contextualinformation to determine both the level of trust on thedata and to enable controlled access to elements in a

giuser1r1r

giuser2r22

gimanagerere

GI Clinical Trials project

Patient collection

Demographic data collection

Tissue collection

gimanagererer

merge 1 (restrictive)

merge 2 (permissive)

Assign permission

Assign permission

Patient Name statements

Tissue analysis statements

merge ( ) =

) = merge (

merge 1

merge 2

Figure 6 Two use cases of permission propagation Two users are granted full permission to view GI Clinical Trials Project (yrdquo) however noneof them can add new data (nrdquo) nor edit existing data unless they were its original creator (srdquo) One of the users (giuser1) is prevented fromaccessing any data with demographic elements such as for example the names of the Patients In this case an uppercase ldquoNrdquo is assigned in theright to view the Collection DemographicData which will be merged with the inherited ldquoyrdquo to produce an effective permission level of ldquoNrdquo forthe right to view (merge 1) For the second user (giuser2) permission is granted to use the Collection TissueData indicating that the user cancreate new instances in that Collection (merge 2)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 12 of 15

dataset without breaking it and storing it in multiple sys-tems To address this requirement S3QL was fitted withdistributed control operational features that follow designcriteria found desirable for biomedical applications S3QLis not unique in its class for example the linked data APIwhich is being used by datagovuk is an alternative DSL tomanage linked data [58] However we believe that S3QLis closer to the technologies currently used by applicationdevelopers and therefore may provide a more suitablemiddle layer between linked data formalisms and applica-tion development It is argued that these features mayassist and anticipate future extensions of semantic webprovenance control formalisms

Additional material

Additional file 1 Walkthrough of S3DBrsquos permission propagationmerge migrate and percolate

Additional file 2 Matrix of MLST profiles in Portugal and the UnitedKingdom

Acknowledgements and fundingThis work was supported in part by the Center for Clinical and TranslationalSciences of the University of Alabama at Birmingham under contract no

5UL1 RR025777-03 from NIH National Center for Research Resources by theNational Cancer Institute grant 1U24CA143883-01 by the European UnionFP7 PNEUMOPATH project award and by the Portuguese Science andTechnology foundation under contracts PTDCEIA-EIA1052452008 andPTDCEEA-ACR695302006 HFD also thankfully acknowledges PhDfellowship from the same foundation award SFRHBD459632008 and theScience Foundation Ireland project Lion under Grant No SFI02CE1I131The authors would also like to thank three anonymous reviewers whosecomments and advice were extremely valuable for improving themanuscript

Author details1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway Ireland 2Biomathematics Institutode Tecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal3Laboratoacuterio Nacional de Computaccedilatildeo Ciecircntifica Av Getuacutelio Vargas 333Quitandinha 25651-075 Petroacutepolis Brasil 4Sanofi Pasteur 38 Sidney StreetCambridge MA 02139 USA 5Laboratory of Molecular Genetics Instituto deTecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal6Research Center for Intelligent Media Furtwangen University FurtwangenGermany 7Laboratory of Microbiology The Rockefeller University 10021 NewYork USA 8Division of Informatics Department of Pathology University ofAlabama at Birmingham 619 South 19th Street Birmingham Alamaba USA

Authorsrsquo contributionsHFD developed the S3QL domain specific language implemented it in theS3DB prototype validated it with examples and wrote the manuscript MCCRS MM HL RF WM and JSA tested and validated the language withexamples and made suggestions which lead to its improvement All authorsread and approved the final manuscript

Strains from Portugal

Strains from UK

SPARQL

S3DB system Triple stores

S3QL + authentication

Statistical computing environment (eg Matlab R)

S3QL +Lauthent

Figure 7 Workflow for the creation of a hierarchical cluster of MLST profiles from strains collected in Portugal and in the UK Twosources are chosen to perform the MLST profile assembly - an S3DB deployment which holds the ITQB Staphylococcus database and theSPARQL endpoint configured to host the Staphylococcus aureus MLST database Although data in the MLST SPARQL endpoint is publiclyavailable to access data in S3DB the user needs to provide an authentication token and a user_id as well as assemble the S3QL queries toretrieve the data [select(S|R172930) and select(S|R167271)] which may also be formulated as SPARQL A data structure is assembled that can finallybe analysed using hierarchical clustering methods such as the ones made available by Matlab The complete MLST dataset used to obtain thegraph is provided in additional file 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 13 of 15

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 4: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

and 3) In the example above the two triples are respec-tively assigned to entities of type s3dbrule and s3dbstate-ment where indexes ldquoPersonrdquo is identified by a namedgraph of type s3dbcollection and ldquoJohnrdquo is identified by anamed graph of type s3dbitem The S3DB Core specifiesthree other entities which are specifically devised toenable knowledge organization and operator propagations3dbproject entails a list of s3dbrule and s3dbcollectionand are typically applied in domain contextualizations3dbdeployment corresponds to the physical location ofan S3DB system (its URL) and s3dbuser is the subject ofpermission assignment operations It is worth notingthat by making use of S3DB entities blank nodes areavoided by assigning a unique alphanumeric identifier toevery instance of an S3DB entity The S3DB entities canalso be identified using the first letter of their names DP R C I S or U which will be used in subsequent exam-ples to indicate respectively s3dbdeployment s3db

project s3dbrule s3dbcollection s3dbitem s3dbstate-ment or s3dbuser

Operators for Permission ControlThe second key feature that makes S3DB appropriatefor controlled management operations is support forpermission control embedded in its core model Asdescribed above and in [40] the hierarchy of permis-sions to viewedit entities in S3DB is modelled by anadjacency matrix which is used as a transition matrix inthe propagation of permission states For a walkthroughof the propagation mechanism see additional file 1 Thes3dboperator states applied to the S3DB transitionmatrix modulate propagation by three core functions -merge migrate and percolate This behaviour for propa-gation of permission is described in detail in equation 5of [40] and is reproduced here in Equation 1 The S3DBtransition matrix (T) is defined by 12 s3dbrelationships

ltdeletegt E _id value ltdeletegt ltwheregt ltwheregt E lt gt _id E lt gt

EntityID

ltinsertgt E _id value ltinsertgt ltwheregt ltwheregt E lt gt _id E lt gt

Ea lt gt value Ea lt gt

EntityAttributeValue

EntityID

ltupdategt E _id value ltupdategt ltwheregt

ltwheregt

E lt gt _id E lt gt

Ea lt gt value Ea lt gt

EntityID

ltselectgt

Ea

ltselectgt

ltwheregt ltwheregt Ea lt gt value Ea lt gt

ltfromgt ltfromgt E

ltupdategt E _id value ltupdategt ltwheregt

ltwheregt

E lt gt _id E lt gt

Ea lt gt value Ea lt gt

EntityID

View

Ch

ange

Us

e

EntityAttributeValue

EntityAttributeValue

Figure 1 S3QL language specification using rail diagrams Rail diagrams are read from the left to the right - any string that can becomposed following these diagrams is a valid S3QL query Valid forms of E and Ea will vary according to the Core Model used in the KOS Forexample if the S3DB Core model is used any entity in figure 2 can be used in place of E upon choice of E Ea is any attribute that can beattained by following a line from E

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 4 of 15

describing dependencies and inference rules betweenentities of the S3DB core model The operator state vec-tor (f) is used as the predicate of a triple establishedbetween an s3dbuser and an entity of the S3DB coremodel The JavaScript application at [51] can also beused to attempt this set of propagation behaviours fors3dboperators both on the S3DB transition matrix orwith alternative adjacency matrixes

fobjectk+1 = merge([fobjectkmigrate(T times fsubjectk)])

l = length(f )

l = 1 rarr migrate(f ) = f = f [1]

l gt 1 rarr migrate(f ) = f [2 l]

(1)

The s3dboperators [40] have a scope and applicability inlinked data beyond permission management In S3QL wedefine three operator types for controlled management

operations for each of the rights to view changeedit oruse instances of S3DB entities The format used to assignpermission was defined as a three character string whereeach operator occupies respectively the first second orthird positions and may assume value N S or Y accordingto the level of permission intended no permission (N)permission limited to the creator of the resource (S) or fullpermission (Y) For example the permission assignmentldquoYSNrdquo specifies complete permission to view (Y) the sub-ject entity partial permission to change it (S) and no per-mission to use it (N) States may be defined as dominantby use of uppercase (Y N or S) or recessive by use of low-ercase characters (y n or s) Dominant and recessive per-missions are used to decide on the outcome of multiplepermissions converging on the same entity (as detailed in[40]) Missing permission states indicated by the dashcharacter lsquo-rsquo (which has no lower or upper case) are also

rdfid

rdfslabel

dcdescription

dccreated

dccreator

deployment

user

project

collection

rule

item

statement

nn

foafmbox

s3dbobject_id

s3dbobject

foaffo mboxs3dbsubject_id

bs3dbs3d

s3dbproject_id

s3dbpredicate_id

s3dbcollection_id

s3dbitem_id

s3dbrule_id rdfvalue

E Ea Ea a

Figure 2 Entities in the S3DB Core Model and its attributes A minimal set of common attributes was defined (left) for each of the S3DBentities using RDF Schema (rdfs) and Dublin core (dc) terminology - these are rdfid rdfslabel dcdescription dccreated and dccreator Otherattributes which are specific to each of the S3DB entities (right) such as foafmbox for the entity User or s3dbproject_id for the entity Collectionreflect the s3dbrelationships described above and formalized in the S3DB conceptual model (figure 3)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 5 of 15

allowed as well as a mechanism to succinctly specify tran-sitions with variable memory length (l in equation 1) Thepropagation of permissions in the S3DB Core Modelensures that for every entity and every user two types of

permission are defined the assigned permission or thepermission state assigned directly to a user in an entityand the effective permission which is the result of the pro-pagation of s3dboperators

Figure 3 The S3DB conceptual model Five attributes (id label description creator and created) and four methods (select update insert anddelete) are common to all S3DB entities In the current S3QL implementation the label and description attributes are defined by the submitterof the data whereas the id created and creator attributes are automatically assigned by the system Other dependencies were devised tocomply with the definition of s3dbrelationships

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 6 of 15

Components of a Distributed SystemOne of the requirements for RDF-based knowledgemanagement ecosystems is the availability of queriesspanning across multiple SPARQL endpoints Automa-tion of distributed queries in systems supporting permis-sion control such as S3DB is challenged by userauthentication In S3QL we propose addressing thisthrough delegation to authentication authorities As aresult a user (or usage) can be identified by a URI thatis independent of the authorities that validate it When-ever possible it is recommended that authenticationcredentials be protected by use of OAuth [53]Use of URIs and Internationalized Resource Identifiers

(IRIs) to identify data elements is one of the core princi-ples of Linked Data However many programmingenvironments cannot easily handle URIs as elementidentifiers Problems range from decreased processingspeed to a need for encoding the URIs in web serviceexchanges As an anticipation for that class of problemsthe URIs for entities in S3DB are interchangeable withalphanumeric identifiers formulated as the concatena-tion of one of D U P C R I or S (referring to S3DBentities described in The S3DB Knowledge Organiza-tional Model) identifying the entity and a unique num-ber As an example for a deployment located at URLhttpqs3dborgdemo the alphanumeric P126 is resol-vable to an entity of type Project with URI httpqs3dborgs3dbdemoP126 To facilitate exchange of URI indistinct deployments the URI above could also be speci-fied as D282P126 where ldquoD282rdquo is the alphanumericidentifier of the S3DB deployment located at URLhttpqs3dborgdemo Every s3dbdeployment is identi-fied by a named graph in the form D[number] for com-pleteness metadata pertaining to each s3dbdeploymentsuch as the corresponding URL is described using thevocabulary of interlinked datasets (VoiD) [19] andshared through a root location

Availability and DocumentationThe specification of the S3QL language has been madeavailable at httplinks3dborgspecs and one exampleof the output RDF is available at httplinks3dborgexample S3QL has been implemented through a RESTapplication programming interface (API) for the S3DBprototype which is publicly available at https3dborgBoth the prototype and its API were developed in PHPwith MySQL or PostgreSQL for data storage Documen-tation about the S3DB implementation of S3QL as anAPI can be found at httplinks3dborgdocs S3QLqueries may be tested at the demo implementation athttplinks3dborgs3qldemo and a translator for thecompact notation is available at httplinks3dborgtranslate

ResultsS3QL SyntaxS3QL is a domain specific language devised for facilitat-ing management operations such as ldquoinsertrdquo ldquoupdaterdquo orldquodeleterdquo using entities of a Linked Data KOS such as theS3DB core model described above Its syntax howeveris loosely tied to the S3DB Core Model and can easilybe applied to a set of KOSrsquo core models in which S3DBis included The complete syntax of S3QL in its XML(eXtended Markup Language) flavour is represented inthe railroad diagram of Figure 1 The S3QL syntaxincludes three elements the description of the opera-tion the target entity and the input parameters Fourbasic operation descriptions were deemed necessary tofully support readwrite operations select insert updateand delete The action of these operations mimic thoseof the structured query language (SQL) and targetinstances of entities (E) defined in the core model Inputparameters include the set of attributes defined for eachof the entities either in the alphanumeric form asso-ciated with entity instances (EntityId) or in the form ofEntityAttributeValue (Ea) The values for Ea are deter-mined upon choice of E - for an example using S3DBentities and attributes see Figure 2 E may be replacedwith any of the entities defined in the S3DB CoreModel (Deployment User Project Collection Rule Itemor Statement) upon choice of E valid forms of Eainclude any of the attributes defined for E (eg rdfidrdfslabel rdfscomment) A table summarizing all avail-able operations targets and input parameters is madeavailable at httplinks3dborgspecs The formal S3QLsyntax is completed by enclosing the outcome of one ofthe diagrams on Figure 1 with the ltS3QLgt tag Forexample the following XML structure is a valid S3QLquery for an operation of type insert where the target isthe S3DB entity Project and the input parameter formu-lated as EntityAtributeValue is ldquolabel = TestltS3QLgtltinsertgtprojectltinsertgtltwheregtltlabelgtTestltlabelgtltwheregt

ltS3QLgtThe set of 12 s3dbrelationships (see Table one of

[40]) in the S3DB Model determine the organizationaldependencies of S3DB entities For example s3dbPC isthe s3dbrelationship that specifies a dependencybetween an instance of a Collection (C) and an instanceof a Project (P) (Figure 3) The S3QL syntax fulfils thisconstraint by assigning project_id (the identifier of anS3DB Project) as an attribute of a Collection In thisdescription of attributes associated with the S3DB coremodel we make use of the assumption as in other

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 7 of 15

KOSs and in the Linked Data in general that there is norestriction to adding relationships beyond thosedescribed here S3QL was identified as the minimalrepresentation to interoperate with the S3DB coremodel and therefore only those relationships areexplored in this reportThe syntax diagram in Figure 1 generates XML a

standard widely used in web service implementationsThat alternative often results in verbose queries thatcould easily be assembled from more compact notationsOne example to consider is the form action (E | Ea =value) Here the symbol lsquo|rsquo should be interpreted as inBayesian inference as a condition and be read ldquogiventhatrdquo The letter ldquoEldquo corresponds to the first letter ofany S3DB entity (D P R C I S or U) and Ea is any ofits attributes as described in Figure 2 In this examplethe query insert(P | label = test) is equivalent to theexample query above That particular variant is alsoaccepted by the S3DB prototype and a converter for thissyntax into complete S3QLXML syntax was made avail-able at httplinks3dborgtranslate For further com-pactness of this alternative formulation entity identifiersused as parameters may be replaced with its corre-sponding alphanumeric identifiers - for example projec-t_id = 156 may be replaced with P156 This alternativenotation will be used in the subsequent examples

S3QL Permission ControlPermission states are assigned using an S3QL query suchas insert(U| U1P157permission_level = ysn) whichincludes the action insert the target entity User andthree input parameters identifier of the User (U1) iden-tifier of the entity (P157) and permission assignment(ysn) Effectively this will result in the creation of the tri-ple [U1 ysn P157] where the subject is of type s3dbuser the predicate is of type s3dboperator and the objectis of type s3dbproject The inclusion of this triple in adataset will modulate the type of management operationthat a user may perform As described in the Methodssection each position in the permission assignmentoperator (ysn) encodes respectively for permission toldquoviewrdquo ldquochangerdquo or ldquouserdquo the object entity Values y sand n indicate respectively that the user has full permis-sion to view it (y) permission to change its metadataonly if he was the creator of the entity (s) and no permis-sion to insert (n) child entities Each S3QL operation istherefore tightly woven to each of the three operatorsselect is controlled by ldquoview update and delete are con-trolled by ldquochangerdquo and insert is controlled by ldquouserdquo(shaded areas in Figure 1) The lsquousersquo operator encodes forthe ability of a user to create new relationships with thetarget entity which is defined separately from the rightto ldquochangerdquo it For example in the case of a user (U1)being granted ldquoyrdquo as the effective permission to ldquochangerdquo

an s3dbrule then the metadata describing it may bealtered If however that same user is granted permissionldquonrdquo to lsquousersquo that same Rule she is prevented from creatingStatements using that Rule Although ldquouserdquo may be inter-preted as being equivalent to ldquoinsertrdquo or ldquoappendrdquo inother systems we have chosen to separate the termsdescribing the operator ldquouserdquo from the S3QL actionldquoinsertrdquo The permission assigned at the dataset level willthen propagate in the S3DB transition matrix followingthe behaviour formalized in equation 1 therefore avoid-ing the need to assign permission to every user on everyentity It is worth noting that the DSL presented here isextensible beyond the 4 management actions (selectinsert update delete) described The s3dboperators thatcontrol permission on these actions are also extensiblebeyond ldquoviewrdquo ldquochangerdquo and ldquouserdquo and different imple-mentations may support alternative statesThe permission control behaviour for S3QL operations

can be illustrated through the use of the Quadratus anapplication available at httpqs3dborgquadratus thatcan be pointed at any S3DB deployment to assign permis-sion states on S3DB entities to different users (Figure 4)Other use case scenarios are also explored in the S3QLspecification at httplinks3dborgspecs

Global Dereferencing for Distributed QueriesA simple dereferencing system was devised for S3DB iden-tifiers that relies on the identification of root deploymentsie S3DB systems where alphanumeric identifiers forS3DB deployments can be dereferenced to URL This sim-ple mechanism enables complex transactions of controlleddata For an example of this behaviour see Figure 5 wherethe S3DB UID D327R172930 identifying an entity of typeRule (R172930) available in deployment D327 is beingrequest by a user registered in deployment D309 In orderto retrieve the requested data the URL of deploymentD327 must first be resolved at a root deployment such asfor example httproots3dborgThe dereferencing mechanism is also applicable in more

complex cases where the root of two deployments sharingdata is not the same Prepending the deployment identifierof the root to the UIDs such as for example D1016666D327R172930 where D1016666 identifies the rootdeployment would result in recursive URL resolutionsteps such as select(Durl| D1016666) prior to step 4 inFigure 5 This mechanism avoids broken links when S3DBdeployments are moved to different URLs by enablingdeployment metadata to be updated securely at the rootusing a publicprivate key encryption system

Implementation and benchmarkingIn the current prototype implementation S3QL is sub-mitted to S3DB deployments using either a GET or POSTrequest and may include an optional authentication token

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 8 of 15

(key) The REST specification [54] suggest separate HTTPmethods according to the intention of the operationoften ldquoGETrdquo is used to retrieve data ldquoPUTrdquo is used to sub-mit data ldquoPOSTrdquo is used to update data and ldquoDELETErdquo isused to remove data There are however many program-ming environments that implement only the REST ldquoGETrdquomethod including many popular computational statisticsprogramming frameworks such as R and Matlab There-fore in order to fully explore the integrative potential ofthis readwrite semantic web service and to supportoperations beyond the 4 implemented the S3DB proto-type implementation of S3QL supports the ldquoGETrdquo methodfor all S3QL operations with the parameters of the S3QLcall appended to the URL One drawback of relying onGET is the limits imposed by the browser on URL lengthTo address this potential problem the S3DB prototypealso supports the use of ldquoPOSTrdquo for S3QL calls

Two further challenges needed to be addressed in theprototypical implementation of S3QL 1) the need for acentralized root location to support dereferencing ofdeployment URI when the condensed version is used (eg D282) and 2) distributed queries on REST systemsrequired users to authenticate in multiple KOSs Thefirst challenge was addressed by configuring an S3DBdeployment as the root location available at httproots3dborg Deployment metadata is submitted to this rootdeployment at configuration time using S3QL data per-taining to each deployment can therefore be derefer-enced to a URL using httproots3dborgD[numeric]The option to refer to another root deployment thanthe default is possible during installation To avoid over-loading these root deployments with too many requestsa local 24 hour cache of all accessed deployments iskept in each S3DB deployment using the same strategy

Figure 4 Quadratus an interface to illustrate S3QLrsquos permission control mechanism and its effect on S3QL queries Projects CollectionsRules Items and Statements associated with GI Clinical Trials Project are displayed with effective and assigned permissions Collections and Rulesretrieved using S3QL queries select(C|P1196457) and select(R| P1196457) inherit the permission assignment in the Project ldquoysn assigned permissionldquoNndashrdquo in Collection ldquoDemographicsrdquo results in the effective permission of ldquoNsnrdquo inherited by all Rules and Items that have a relationship with thatCollection effectively preventing gi_user1 from accessing its data The directed labelled graph of the propagation resolution is displayed on theright side of the application illustrating the propagation mechanism

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 9 of 15

if the URL is cached it will not be requested from theroot deployment To address the second challenge eachs3dbdeployment can store any number of authenticationservices supporting HTTP FTP or LDAP protocolsOnce the user is authenticated temporary surrogatetokens are issued with each query When coupled withthe user identifier in the format of a URI these tokenseffectively identify the user performing the query regard-less of the S3DB deployment where the query isrequestedScreencasts illustrating processing time of data manip-

ulation using S3QL are available at httpwwwyoutubecomwatchv=2KZC6kI609s and httpwwwyoutubecomwatchv=FJSYLCwBaPI

DiscussionOne of the major concerns in making use of LinkedData to improve health care and life sciences research isthe need to ensure both the availability of contextualinformation about experimental datasets and the abilityto protect the privacy of certain data elements whichmay identify an individual patient Domain specific lan-guages (DSL) can ease the task of managing the contex-tual descriptors that would be necessary to implement

permission control in RDF and by doing so couldgreatly accelerate the rate of adoption of Linked Dataformalisms in the life sciences communities to improvescientific discovery We have described S3QL a DSL toperform readwrite operations on entities of the S3DBCore Model S3QL attempts to address the requirementsin linking Life Sciences datasets including both publish-able and un-publishable data elements by 1) includingcontextual descriptors for every submitted data elementand 2) making use of those descriptors to ensure per-mission control managed by the data experts them-selves This avoids the need to break a consolidateddataset into its public and private parts when the resultsare acceptable for publicationApplying S3QL to the S3DB Core Model in a prototy-

pical application benefited from the definition of looselydefined boundaries for RDF data that enabled propaga-tion of permission while avoiding the need to documenta relationship for each data instance individually andfor each user The assembly of SPARQL queries is alsofacilitated by the identification of domain triples usingnamed graphs from the data itself [41] and can be illu-strated in the application at [55] where a subset ofS3QL can be readily converted into the W3C standard

harrharr

Figure 5 The global S3QL dereferencing system User U78 of deployment D309 issues a command to request all entities of type Statementwhere the attribute Rule_id corresponds to the value D327R172930 through S3QL operation select(Svalue|D327R172930) (1) If the URLcorresponding to deployment D327 is not cached locally or has not been validated in the past 24 h a query is issued and executed at the rootdeployment to retrieve the corresponding URL select(Durl|D327) (23) Once the URL is returned query (1) is re-issued as select(Svalue| R172930)and executed at the URL for D327 (4) To validate the user deployment D327 issues the command select(Uid|U76) at D309 using the keyprovided (56) and returns the data only if Uid matches the value for user_id (7)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 10 of 15

SPARQL Although the prototypical implementation ofS3QL fits the definition of an API for S3DB systems itis immediately apparent that the same notation could beeasily and intuitively extended to other KOSrsquo core mod-els For example pointing the tool at httpqs3dborgtranslate to a JavaScript Object Notation (JSON) repre-sentation of the SKOS core model (skosjs) instead ofthe default S3DB core model (s3dbjs) results in a validS3QL syntax that could easily be applied as an API forSKOS based systems as illustrated by applying theexample query ldquoselect(C|prefLabel = animal)ldquo usinghttplinks3dborgtranslatecore=skosjsampquery=select(C|prefLabel=animal) to retrieve SKOS concepts labelledldquoanimalrdquo The progress of adoption for life sciencesapplication developers can be further smoothed by com-plete reliance on the REST protocol for data exchangeand the availability of widely used formats such asJSON XML or RDFturtleThe applicability of S3QL to life sciences domains is

illustrated here with three case studies 1) in the domainof clinical trials a project [42] that requires collabora-tion between departments with different interests 2) inthe domain of the cancer genome atlas (TCGA) projecta multi-institutional effort that requires multiple authen-tication mechanism and sharing of data among multipleinstitutions [41] and 3) in the domain of molecular epi-demiology a project where non-public data stored in anS3DB deployment needs to be statistically integratedwith data from a public repository All described usecases shared one considerable requirement - the abilityto include in the same dataset both published andunpublished results As such they required both theannotation of contextual descriptors of the data enabledby S3QL and the availability of controlled permissionpropagation enabled by the S3DB model transitionmatrix Future work in this effort may include the appli-cation of S3QL in a Knowledge Organization Systembased on SKOS terminology and the definition of atransition matrix for SKOS to enable controlled permis-sion propagation

Gastro-intestinal Clinical Trials Use CaseAs part of collaboration with the Department of Gastroin-testinal Medical Oncology at The University of Texas MD Anderson Cancer Center an S3DB deployment wasconfigured to host data from gastro-intestinal (GI) clinicaltrials A schema was developed using S3DB Collectionsand Rules and S3QL insert queries were used to submitdata elements as Items and Statements (see Figure 6) Twopermission propagation examples are illustrated one of arestrictive nature and the other permissive which can alsobe explored at httpqs3dborgquadratus (Figure 4) Inthis example the simple mechanisms of propagationdefined for S3QL support the complex social interaction

that requires a fraction of the dataset to be shared withcertain users but not with others Contextual usage istherefore a function of the attributes of the data itself (egits creator) and the user identification token that is sub-mitted with every readwrite operation

The Cancer Genome Atlas Use CaseThe cancer genome atlas (TCGA) is a pilot project tocharacterize several types of cancer by sequencing andgenetically characterizing tumours for over 500 patientsthroughout multiple institutions S3QL was used in thiscase study to produce an infrastructure that exposes thepublic portion of the TCGA datasets as a SPARQL end-point [41] This was possible because SPARQL is entailedby S3QL but not the opposite Specifically SPARQLqueries can be serialized to S3QL but the opposite is notalways possible particularly as regards write and accesscontrol operations The structure of the S3DB Core Modelwhich explicitly distinguishes domain from instantiationenables SPARQL query patterns such as Patient R390 cancerType to be readily serialized into its S3QL equiva-lent select(S|R390) Although this will not be furtherexplored in this discussion it is worth noting that theavailability of this serialization allows for an intuitive syn-tax of SPARQL queries by patterning them on the descrip-tion of the user-defined domain Rule such as ldquoPatienthasCancerType cancerTyperdquo

A Molecular Epidemiology Use CaseIn this example SPARQL was serialized to S3QL to sup-port a computational statistics application As a first stepan S3DB data store was deployed using S3QL to managemolecular epidemiology data related to strains of Staphylo-coccus aureus bacteria collected at the Instituto de Tecno-logia Quiacutemica e Bioloacutegica (ITQB) in Portugal Specificallythe ITQB Staphylococcus reference database was devisedwith a purpose of managing Multilocus Sequence Typing(MLST) data a typing method used to track the molecularepidemiology of infectious bacteria [54-57] As a secondstep we downloaded the public Staphylococcus aureusMLST profiles database at httpwwwmlstnet and madeit available through a SPARQL endpoint at httpqs3dborgmlst_sparqlendpointphp The process of integrationof MLST profiles from the ITQB Staphylococcus databasewith the publicly available MLST profiles is illustrated inFigure 7 In this example a federated SPARQL query isassembled to access both MLST sources data stored inthe S3DB deployment is retrieved by serializing theSPARQL query into S3QL and providing an authentica-tion token to identify the user as described in Figure 3The assembled graph resulting from the federatedSPARQL query (see additional file 2) can be importedinto a statistical computing environment such as Matlab(Mathworks Inc) Using this methodology it was possible

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 11 of 15

to cluster strains from two different data sources with verydifferent authentication requirements The observationthat some Portuguese strains (PT1 PT2 PT15 and PT21)that are not publicly shared cluster together with a groupof public UK strains (UK17 UK16 UK11 UK14 UK15UK13 UK12) and therefore may share a common ancestoris an observation enabled by the data integrated throughS3QL

ConclusionsLife sciences applications are set to greatly benefit fromcoupling Semantic Web Linked Data standards and KOSsIn the current report we illustrate data models from lifesciences domains weaved using the S3DB knowledge orga-nization system In line with the requirements for theemergence of evolvable ldquosocial machinesrdquo different per-spectives on the data are made possible by a permissionpropagation mechanism controlled by contextual attri-butes of data elements such as its creator The operationof the S3DB KOS is mediated by the S3QL protocoldescribed in this report which exposes its ApplicationProgramming Interface for viewing inserting updating

and removing data elements Because S3QL is implemen-ted with a distributed architecture where URIs can bedereferenced into multiple S3DB deployments domainexperts can share data on their own deployments withusers of other systems without the need for localaccounts Therefore S3QLrsquos fine grained permission con-trol defined as instances of s3dboperators enables domainexperts to clearly specify the degree of permission that auser should have on a resource and how that permissionshould propagate in a distributed infrastructure This is incontrast to the conventional approach of delegating per-mission management to the point of access In the currentSPARQL specification extension various data sources canbe queried simultaneously or sequentially There is still noaccepted convention for tying a query pattern to anauthenticated user probably because SPARQL engineswould have no use for that information as most have beencreated in a context of Linked Open Data efforts The cri-tical limitation in applying this solution for Health Careand Life Sciences is the ability to make use of contextualinformation to determine both the level of trust on thedata and to enable controlled access to elements in a

giuser1r1r

giuser2r22

gimanagerere

GI Clinical Trials project

Patient collection

Demographic data collection

Tissue collection

gimanagererer

merge 1 (restrictive)

merge 2 (permissive)

Assign permission

Assign permission

Patient Name statements

Tissue analysis statements

merge ( ) =

) = merge (

merge 1

merge 2

Figure 6 Two use cases of permission propagation Two users are granted full permission to view GI Clinical Trials Project (yrdquo) however noneof them can add new data (nrdquo) nor edit existing data unless they were its original creator (srdquo) One of the users (giuser1) is prevented fromaccessing any data with demographic elements such as for example the names of the Patients In this case an uppercase ldquoNrdquo is assigned in theright to view the Collection DemographicData which will be merged with the inherited ldquoyrdquo to produce an effective permission level of ldquoNrdquo forthe right to view (merge 1) For the second user (giuser2) permission is granted to use the Collection TissueData indicating that the user cancreate new instances in that Collection (merge 2)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 12 of 15

dataset without breaking it and storing it in multiple sys-tems To address this requirement S3QL was fitted withdistributed control operational features that follow designcriteria found desirable for biomedical applications S3QLis not unique in its class for example the linked data APIwhich is being used by datagovuk is an alternative DSL tomanage linked data [58] However we believe that S3QLis closer to the technologies currently used by applicationdevelopers and therefore may provide a more suitablemiddle layer between linked data formalisms and applica-tion development It is argued that these features mayassist and anticipate future extensions of semantic webprovenance control formalisms

Additional material

Additional file 1 Walkthrough of S3DBrsquos permission propagationmerge migrate and percolate

Additional file 2 Matrix of MLST profiles in Portugal and the UnitedKingdom

Acknowledgements and fundingThis work was supported in part by the Center for Clinical and TranslationalSciences of the University of Alabama at Birmingham under contract no

5UL1 RR025777-03 from NIH National Center for Research Resources by theNational Cancer Institute grant 1U24CA143883-01 by the European UnionFP7 PNEUMOPATH project award and by the Portuguese Science andTechnology foundation under contracts PTDCEIA-EIA1052452008 andPTDCEEA-ACR695302006 HFD also thankfully acknowledges PhDfellowship from the same foundation award SFRHBD459632008 and theScience Foundation Ireland project Lion under Grant No SFI02CE1I131The authors would also like to thank three anonymous reviewers whosecomments and advice were extremely valuable for improving themanuscript

Author details1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway Ireland 2Biomathematics Institutode Tecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal3Laboratoacuterio Nacional de Computaccedilatildeo Ciecircntifica Av Getuacutelio Vargas 333Quitandinha 25651-075 Petroacutepolis Brasil 4Sanofi Pasteur 38 Sidney StreetCambridge MA 02139 USA 5Laboratory of Molecular Genetics Instituto deTecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal6Research Center for Intelligent Media Furtwangen University FurtwangenGermany 7Laboratory of Microbiology The Rockefeller University 10021 NewYork USA 8Division of Informatics Department of Pathology University ofAlabama at Birmingham 619 South 19th Street Birmingham Alamaba USA

Authorsrsquo contributionsHFD developed the S3QL domain specific language implemented it in theS3DB prototype validated it with examples and wrote the manuscript MCCRS MM HL RF WM and JSA tested and validated the language withexamples and made suggestions which lead to its improvement All authorsread and approved the final manuscript

Strains from Portugal

Strains from UK

SPARQL

S3DB system Triple stores

S3QL + authentication

Statistical computing environment (eg Matlab R)

S3QL +Lauthent

Figure 7 Workflow for the creation of a hierarchical cluster of MLST profiles from strains collected in Portugal and in the UK Twosources are chosen to perform the MLST profile assembly - an S3DB deployment which holds the ITQB Staphylococcus database and theSPARQL endpoint configured to host the Staphylococcus aureus MLST database Although data in the MLST SPARQL endpoint is publiclyavailable to access data in S3DB the user needs to provide an authentication token and a user_id as well as assemble the S3QL queries toretrieve the data [select(S|R172930) and select(S|R167271)] which may also be formulated as SPARQL A data structure is assembled that can finallybe analysed using hierarchical clustering methods such as the ones made available by Matlab The complete MLST dataset used to obtain thegraph is provided in additional file 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 13 of 15

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 5: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

describing dependencies and inference rules betweenentities of the S3DB core model The operator state vec-tor (f) is used as the predicate of a triple establishedbetween an s3dbuser and an entity of the S3DB coremodel The JavaScript application at [51] can also beused to attempt this set of propagation behaviours fors3dboperators both on the S3DB transition matrix orwith alternative adjacency matrixes

fobjectk+1 = merge([fobjectkmigrate(T times fsubjectk)])

l = length(f )

l = 1 rarr migrate(f ) = f = f [1]

l gt 1 rarr migrate(f ) = f [2 l]

(1)

The s3dboperators [40] have a scope and applicability inlinked data beyond permission management In S3QL wedefine three operator types for controlled management

operations for each of the rights to view changeedit oruse instances of S3DB entities The format used to assignpermission was defined as a three character string whereeach operator occupies respectively the first second orthird positions and may assume value N S or Y accordingto the level of permission intended no permission (N)permission limited to the creator of the resource (S) or fullpermission (Y) For example the permission assignmentldquoYSNrdquo specifies complete permission to view (Y) the sub-ject entity partial permission to change it (S) and no per-mission to use it (N) States may be defined as dominantby use of uppercase (Y N or S) or recessive by use of low-ercase characters (y n or s) Dominant and recessive per-missions are used to decide on the outcome of multiplepermissions converging on the same entity (as detailed in[40]) Missing permission states indicated by the dashcharacter lsquo-rsquo (which has no lower or upper case) are also

rdfid

rdfslabel

dcdescription

dccreated

dccreator

deployment

user

project

collection

rule

item

statement

nn

foafmbox

s3dbobject_id

s3dbobject

foaffo mboxs3dbsubject_id

bs3dbs3d

s3dbproject_id

s3dbpredicate_id

s3dbcollection_id

s3dbitem_id

s3dbrule_id rdfvalue

E Ea Ea a

Figure 2 Entities in the S3DB Core Model and its attributes A minimal set of common attributes was defined (left) for each of the S3DBentities using RDF Schema (rdfs) and Dublin core (dc) terminology - these are rdfid rdfslabel dcdescription dccreated and dccreator Otherattributes which are specific to each of the S3DB entities (right) such as foafmbox for the entity User or s3dbproject_id for the entity Collectionreflect the s3dbrelationships described above and formalized in the S3DB conceptual model (figure 3)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 5 of 15

allowed as well as a mechanism to succinctly specify tran-sitions with variable memory length (l in equation 1) Thepropagation of permissions in the S3DB Core Modelensures that for every entity and every user two types of

permission are defined the assigned permission or thepermission state assigned directly to a user in an entityand the effective permission which is the result of the pro-pagation of s3dboperators

Figure 3 The S3DB conceptual model Five attributes (id label description creator and created) and four methods (select update insert anddelete) are common to all S3DB entities In the current S3QL implementation the label and description attributes are defined by the submitterof the data whereas the id created and creator attributes are automatically assigned by the system Other dependencies were devised tocomply with the definition of s3dbrelationships

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 6 of 15

Components of a Distributed SystemOne of the requirements for RDF-based knowledgemanagement ecosystems is the availability of queriesspanning across multiple SPARQL endpoints Automa-tion of distributed queries in systems supporting permis-sion control such as S3DB is challenged by userauthentication In S3QL we propose addressing thisthrough delegation to authentication authorities As aresult a user (or usage) can be identified by a URI thatis independent of the authorities that validate it When-ever possible it is recommended that authenticationcredentials be protected by use of OAuth [53]Use of URIs and Internationalized Resource Identifiers

(IRIs) to identify data elements is one of the core princi-ples of Linked Data However many programmingenvironments cannot easily handle URIs as elementidentifiers Problems range from decreased processingspeed to a need for encoding the URIs in web serviceexchanges As an anticipation for that class of problemsthe URIs for entities in S3DB are interchangeable withalphanumeric identifiers formulated as the concatena-tion of one of D U P C R I or S (referring to S3DBentities described in The S3DB Knowledge Organiza-tional Model) identifying the entity and a unique num-ber As an example for a deployment located at URLhttpqs3dborgdemo the alphanumeric P126 is resol-vable to an entity of type Project with URI httpqs3dborgs3dbdemoP126 To facilitate exchange of URI indistinct deployments the URI above could also be speci-fied as D282P126 where ldquoD282rdquo is the alphanumericidentifier of the S3DB deployment located at URLhttpqs3dborgdemo Every s3dbdeployment is identi-fied by a named graph in the form D[number] for com-pleteness metadata pertaining to each s3dbdeploymentsuch as the corresponding URL is described using thevocabulary of interlinked datasets (VoiD) [19] andshared through a root location

Availability and DocumentationThe specification of the S3QL language has been madeavailable at httplinks3dborgspecs and one exampleof the output RDF is available at httplinks3dborgexample S3QL has been implemented through a RESTapplication programming interface (API) for the S3DBprototype which is publicly available at https3dborgBoth the prototype and its API were developed in PHPwith MySQL or PostgreSQL for data storage Documen-tation about the S3DB implementation of S3QL as anAPI can be found at httplinks3dborgdocs S3QLqueries may be tested at the demo implementation athttplinks3dborgs3qldemo and a translator for thecompact notation is available at httplinks3dborgtranslate

ResultsS3QL SyntaxS3QL is a domain specific language devised for facilitat-ing management operations such as ldquoinsertrdquo ldquoupdaterdquo orldquodeleterdquo using entities of a Linked Data KOS such as theS3DB core model described above Its syntax howeveris loosely tied to the S3DB Core Model and can easilybe applied to a set of KOSrsquo core models in which S3DBis included The complete syntax of S3QL in its XML(eXtended Markup Language) flavour is represented inthe railroad diagram of Figure 1 The S3QL syntaxincludes three elements the description of the opera-tion the target entity and the input parameters Fourbasic operation descriptions were deemed necessary tofully support readwrite operations select insert updateand delete The action of these operations mimic thoseof the structured query language (SQL) and targetinstances of entities (E) defined in the core model Inputparameters include the set of attributes defined for eachof the entities either in the alphanumeric form asso-ciated with entity instances (EntityId) or in the form ofEntityAttributeValue (Ea) The values for Ea are deter-mined upon choice of E - for an example using S3DBentities and attributes see Figure 2 E may be replacedwith any of the entities defined in the S3DB CoreModel (Deployment User Project Collection Rule Itemor Statement) upon choice of E valid forms of Eainclude any of the attributes defined for E (eg rdfidrdfslabel rdfscomment) A table summarizing all avail-able operations targets and input parameters is madeavailable at httplinks3dborgspecs The formal S3QLsyntax is completed by enclosing the outcome of one ofthe diagrams on Figure 1 with the ltS3QLgt tag Forexample the following XML structure is a valid S3QLquery for an operation of type insert where the target isthe S3DB entity Project and the input parameter formu-lated as EntityAtributeValue is ldquolabel = TestltS3QLgtltinsertgtprojectltinsertgtltwheregtltlabelgtTestltlabelgtltwheregt

ltS3QLgtThe set of 12 s3dbrelationships (see Table one of

[40]) in the S3DB Model determine the organizationaldependencies of S3DB entities For example s3dbPC isthe s3dbrelationship that specifies a dependencybetween an instance of a Collection (C) and an instanceof a Project (P) (Figure 3) The S3QL syntax fulfils thisconstraint by assigning project_id (the identifier of anS3DB Project) as an attribute of a Collection In thisdescription of attributes associated with the S3DB coremodel we make use of the assumption as in other

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 7 of 15

KOSs and in the Linked Data in general that there is norestriction to adding relationships beyond thosedescribed here S3QL was identified as the minimalrepresentation to interoperate with the S3DB coremodel and therefore only those relationships areexplored in this reportThe syntax diagram in Figure 1 generates XML a

standard widely used in web service implementationsThat alternative often results in verbose queries thatcould easily be assembled from more compact notationsOne example to consider is the form action (E | Ea =value) Here the symbol lsquo|rsquo should be interpreted as inBayesian inference as a condition and be read ldquogiventhatrdquo The letter ldquoEldquo corresponds to the first letter ofany S3DB entity (D P R C I S or U) and Ea is any ofits attributes as described in Figure 2 In this examplethe query insert(P | label = test) is equivalent to theexample query above That particular variant is alsoaccepted by the S3DB prototype and a converter for thissyntax into complete S3QLXML syntax was made avail-able at httplinks3dborgtranslate For further com-pactness of this alternative formulation entity identifiersused as parameters may be replaced with its corre-sponding alphanumeric identifiers - for example projec-t_id = 156 may be replaced with P156 This alternativenotation will be used in the subsequent examples

S3QL Permission ControlPermission states are assigned using an S3QL query suchas insert(U| U1P157permission_level = ysn) whichincludes the action insert the target entity User andthree input parameters identifier of the User (U1) iden-tifier of the entity (P157) and permission assignment(ysn) Effectively this will result in the creation of the tri-ple [U1 ysn P157] where the subject is of type s3dbuser the predicate is of type s3dboperator and the objectis of type s3dbproject The inclusion of this triple in adataset will modulate the type of management operationthat a user may perform As described in the Methodssection each position in the permission assignmentoperator (ysn) encodes respectively for permission toldquoviewrdquo ldquochangerdquo or ldquouserdquo the object entity Values y sand n indicate respectively that the user has full permis-sion to view it (y) permission to change its metadataonly if he was the creator of the entity (s) and no permis-sion to insert (n) child entities Each S3QL operation istherefore tightly woven to each of the three operatorsselect is controlled by ldquoview update and delete are con-trolled by ldquochangerdquo and insert is controlled by ldquouserdquo(shaded areas in Figure 1) The lsquousersquo operator encodes forthe ability of a user to create new relationships with thetarget entity which is defined separately from the rightto ldquochangerdquo it For example in the case of a user (U1)being granted ldquoyrdquo as the effective permission to ldquochangerdquo

an s3dbrule then the metadata describing it may bealtered If however that same user is granted permissionldquonrdquo to lsquousersquo that same Rule she is prevented from creatingStatements using that Rule Although ldquouserdquo may be inter-preted as being equivalent to ldquoinsertrdquo or ldquoappendrdquo inother systems we have chosen to separate the termsdescribing the operator ldquouserdquo from the S3QL actionldquoinsertrdquo The permission assigned at the dataset level willthen propagate in the S3DB transition matrix followingthe behaviour formalized in equation 1 therefore avoid-ing the need to assign permission to every user on everyentity It is worth noting that the DSL presented here isextensible beyond the 4 management actions (selectinsert update delete) described The s3dboperators thatcontrol permission on these actions are also extensiblebeyond ldquoviewrdquo ldquochangerdquo and ldquouserdquo and different imple-mentations may support alternative statesThe permission control behaviour for S3QL operations

can be illustrated through the use of the Quadratus anapplication available at httpqs3dborgquadratus thatcan be pointed at any S3DB deployment to assign permis-sion states on S3DB entities to different users (Figure 4)Other use case scenarios are also explored in the S3QLspecification at httplinks3dborgspecs

Global Dereferencing for Distributed QueriesA simple dereferencing system was devised for S3DB iden-tifiers that relies on the identification of root deploymentsie S3DB systems where alphanumeric identifiers forS3DB deployments can be dereferenced to URL This sim-ple mechanism enables complex transactions of controlleddata For an example of this behaviour see Figure 5 wherethe S3DB UID D327R172930 identifying an entity of typeRule (R172930) available in deployment D327 is beingrequest by a user registered in deployment D309 In orderto retrieve the requested data the URL of deploymentD327 must first be resolved at a root deployment such asfor example httproots3dborgThe dereferencing mechanism is also applicable in more

complex cases where the root of two deployments sharingdata is not the same Prepending the deployment identifierof the root to the UIDs such as for example D1016666D327R172930 where D1016666 identifies the rootdeployment would result in recursive URL resolutionsteps such as select(Durl| D1016666) prior to step 4 inFigure 5 This mechanism avoids broken links when S3DBdeployments are moved to different URLs by enablingdeployment metadata to be updated securely at the rootusing a publicprivate key encryption system

Implementation and benchmarkingIn the current prototype implementation S3QL is sub-mitted to S3DB deployments using either a GET or POSTrequest and may include an optional authentication token

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 8 of 15

(key) The REST specification [54] suggest separate HTTPmethods according to the intention of the operationoften ldquoGETrdquo is used to retrieve data ldquoPUTrdquo is used to sub-mit data ldquoPOSTrdquo is used to update data and ldquoDELETErdquo isused to remove data There are however many program-ming environments that implement only the REST ldquoGETrdquomethod including many popular computational statisticsprogramming frameworks such as R and Matlab There-fore in order to fully explore the integrative potential ofthis readwrite semantic web service and to supportoperations beyond the 4 implemented the S3DB proto-type implementation of S3QL supports the ldquoGETrdquo methodfor all S3QL operations with the parameters of the S3QLcall appended to the URL One drawback of relying onGET is the limits imposed by the browser on URL lengthTo address this potential problem the S3DB prototypealso supports the use of ldquoPOSTrdquo for S3QL calls

Two further challenges needed to be addressed in theprototypical implementation of S3QL 1) the need for acentralized root location to support dereferencing ofdeployment URI when the condensed version is used (eg D282) and 2) distributed queries on REST systemsrequired users to authenticate in multiple KOSs Thefirst challenge was addressed by configuring an S3DBdeployment as the root location available at httproots3dborg Deployment metadata is submitted to this rootdeployment at configuration time using S3QL data per-taining to each deployment can therefore be derefer-enced to a URL using httproots3dborgD[numeric]The option to refer to another root deployment thanthe default is possible during installation To avoid over-loading these root deployments with too many requestsa local 24 hour cache of all accessed deployments iskept in each S3DB deployment using the same strategy

Figure 4 Quadratus an interface to illustrate S3QLrsquos permission control mechanism and its effect on S3QL queries Projects CollectionsRules Items and Statements associated with GI Clinical Trials Project are displayed with effective and assigned permissions Collections and Rulesretrieved using S3QL queries select(C|P1196457) and select(R| P1196457) inherit the permission assignment in the Project ldquoysn assigned permissionldquoNndashrdquo in Collection ldquoDemographicsrdquo results in the effective permission of ldquoNsnrdquo inherited by all Rules and Items that have a relationship with thatCollection effectively preventing gi_user1 from accessing its data The directed labelled graph of the propagation resolution is displayed on theright side of the application illustrating the propagation mechanism

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 9 of 15

if the URL is cached it will not be requested from theroot deployment To address the second challenge eachs3dbdeployment can store any number of authenticationservices supporting HTTP FTP or LDAP protocolsOnce the user is authenticated temporary surrogatetokens are issued with each query When coupled withthe user identifier in the format of a URI these tokenseffectively identify the user performing the query regard-less of the S3DB deployment where the query isrequestedScreencasts illustrating processing time of data manip-

ulation using S3QL are available at httpwwwyoutubecomwatchv=2KZC6kI609s and httpwwwyoutubecomwatchv=FJSYLCwBaPI

DiscussionOne of the major concerns in making use of LinkedData to improve health care and life sciences research isthe need to ensure both the availability of contextualinformation about experimental datasets and the abilityto protect the privacy of certain data elements whichmay identify an individual patient Domain specific lan-guages (DSL) can ease the task of managing the contex-tual descriptors that would be necessary to implement

permission control in RDF and by doing so couldgreatly accelerate the rate of adoption of Linked Dataformalisms in the life sciences communities to improvescientific discovery We have described S3QL a DSL toperform readwrite operations on entities of the S3DBCore Model S3QL attempts to address the requirementsin linking Life Sciences datasets including both publish-able and un-publishable data elements by 1) includingcontextual descriptors for every submitted data elementand 2) making use of those descriptors to ensure per-mission control managed by the data experts them-selves This avoids the need to break a consolidateddataset into its public and private parts when the resultsare acceptable for publicationApplying S3QL to the S3DB Core Model in a prototy-

pical application benefited from the definition of looselydefined boundaries for RDF data that enabled propaga-tion of permission while avoiding the need to documenta relationship for each data instance individually andfor each user The assembly of SPARQL queries is alsofacilitated by the identification of domain triples usingnamed graphs from the data itself [41] and can be illu-strated in the application at [55] where a subset ofS3QL can be readily converted into the W3C standard

harrharr

Figure 5 The global S3QL dereferencing system User U78 of deployment D309 issues a command to request all entities of type Statementwhere the attribute Rule_id corresponds to the value D327R172930 through S3QL operation select(Svalue|D327R172930) (1) If the URLcorresponding to deployment D327 is not cached locally or has not been validated in the past 24 h a query is issued and executed at the rootdeployment to retrieve the corresponding URL select(Durl|D327) (23) Once the URL is returned query (1) is re-issued as select(Svalue| R172930)and executed at the URL for D327 (4) To validate the user deployment D327 issues the command select(Uid|U76) at D309 using the keyprovided (56) and returns the data only if Uid matches the value for user_id (7)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 10 of 15

SPARQL Although the prototypical implementation ofS3QL fits the definition of an API for S3DB systems itis immediately apparent that the same notation could beeasily and intuitively extended to other KOSrsquo core mod-els For example pointing the tool at httpqs3dborgtranslate to a JavaScript Object Notation (JSON) repre-sentation of the SKOS core model (skosjs) instead ofthe default S3DB core model (s3dbjs) results in a validS3QL syntax that could easily be applied as an API forSKOS based systems as illustrated by applying theexample query ldquoselect(C|prefLabel = animal)ldquo usinghttplinks3dborgtranslatecore=skosjsampquery=select(C|prefLabel=animal) to retrieve SKOS concepts labelledldquoanimalrdquo The progress of adoption for life sciencesapplication developers can be further smoothed by com-plete reliance on the REST protocol for data exchangeand the availability of widely used formats such asJSON XML or RDFturtleThe applicability of S3QL to life sciences domains is

illustrated here with three case studies 1) in the domainof clinical trials a project [42] that requires collabora-tion between departments with different interests 2) inthe domain of the cancer genome atlas (TCGA) projecta multi-institutional effort that requires multiple authen-tication mechanism and sharing of data among multipleinstitutions [41] and 3) in the domain of molecular epi-demiology a project where non-public data stored in anS3DB deployment needs to be statistically integratedwith data from a public repository All described usecases shared one considerable requirement - the abilityto include in the same dataset both published andunpublished results As such they required both theannotation of contextual descriptors of the data enabledby S3QL and the availability of controlled permissionpropagation enabled by the S3DB model transitionmatrix Future work in this effort may include the appli-cation of S3QL in a Knowledge Organization Systembased on SKOS terminology and the definition of atransition matrix for SKOS to enable controlled permis-sion propagation

Gastro-intestinal Clinical Trials Use CaseAs part of collaboration with the Department of Gastroin-testinal Medical Oncology at The University of Texas MD Anderson Cancer Center an S3DB deployment wasconfigured to host data from gastro-intestinal (GI) clinicaltrials A schema was developed using S3DB Collectionsand Rules and S3QL insert queries were used to submitdata elements as Items and Statements (see Figure 6) Twopermission propagation examples are illustrated one of arestrictive nature and the other permissive which can alsobe explored at httpqs3dborgquadratus (Figure 4) Inthis example the simple mechanisms of propagationdefined for S3QL support the complex social interaction

that requires a fraction of the dataset to be shared withcertain users but not with others Contextual usage istherefore a function of the attributes of the data itself (egits creator) and the user identification token that is sub-mitted with every readwrite operation

The Cancer Genome Atlas Use CaseThe cancer genome atlas (TCGA) is a pilot project tocharacterize several types of cancer by sequencing andgenetically characterizing tumours for over 500 patientsthroughout multiple institutions S3QL was used in thiscase study to produce an infrastructure that exposes thepublic portion of the TCGA datasets as a SPARQL end-point [41] This was possible because SPARQL is entailedby S3QL but not the opposite Specifically SPARQLqueries can be serialized to S3QL but the opposite is notalways possible particularly as regards write and accesscontrol operations The structure of the S3DB Core Modelwhich explicitly distinguishes domain from instantiationenables SPARQL query patterns such as Patient R390 cancerType to be readily serialized into its S3QL equiva-lent select(S|R390) Although this will not be furtherexplored in this discussion it is worth noting that theavailability of this serialization allows for an intuitive syn-tax of SPARQL queries by patterning them on the descrip-tion of the user-defined domain Rule such as ldquoPatienthasCancerType cancerTyperdquo

A Molecular Epidemiology Use CaseIn this example SPARQL was serialized to S3QL to sup-port a computational statistics application As a first stepan S3DB data store was deployed using S3QL to managemolecular epidemiology data related to strains of Staphylo-coccus aureus bacteria collected at the Instituto de Tecno-logia Quiacutemica e Bioloacutegica (ITQB) in Portugal Specificallythe ITQB Staphylococcus reference database was devisedwith a purpose of managing Multilocus Sequence Typing(MLST) data a typing method used to track the molecularepidemiology of infectious bacteria [54-57] As a secondstep we downloaded the public Staphylococcus aureusMLST profiles database at httpwwwmlstnet and madeit available through a SPARQL endpoint at httpqs3dborgmlst_sparqlendpointphp The process of integrationof MLST profiles from the ITQB Staphylococcus databasewith the publicly available MLST profiles is illustrated inFigure 7 In this example a federated SPARQL query isassembled to access both MLST sources data stored inthe S3DB deployment is retrieved by serializing theSPARQL query into S3QL and providing an authentica-tion token to identify the user as described in Figure 3The assembled graph resulting from the federatedSPARQL query (see additional file 2) can be importedinto a statistical computing environment such as Matlab(Mathworks Inc) Using this methodology it was possible

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 11 of 15

to cluster strains from two different data sources with verydifferent authentication requirements The observationthat some Portuguese strains (PT1 PT2 PT15 and PT21)that are not publicly shared cluster together with a groupof public UK strains (UK17 UK16 UK11 UK14 UK15UK13 UK12) and therefore may share a common ancestoris an observation enabled by the data integrated throughS3QL

ConclusionsLife sciences applications are set to greatly benefit fromcoupling Semantic Web Linked Data standards and KOSsIn the current report we illustrate data models from lifesciences domains weaved using the S3DB knowledge orga-nization system In line with the requirements for theemergence of evolvable ldquosocial machinesrdquo different per-spectives on the data are made possible by a permissionpropagation mechanism controlled by contextual attri-butes of data elements such as its creator The operationof the S3DB KOS is mediated by the S3QL protocoldescribed in this report which exposes its ApplicationProgramming Interface for viewing inserting updating

and removing data elements Because S3QL is implemen-ted with a distributed architecture where URIs can bedereferenced into multiple S3DB deployments domainexperts can share data on their own deployments withusers of other systems without the need for localaccounts Therefore S3QLrsquos fine grained permission con-trol defined as instances of s3dboperators enables domainexperts to clearly specify the degree of permission that auser should have on a resource and how that permissionshould propagate in a distributed infrastructure This is incontrast to the conventional approach of delegating per-mission management to the point of access In the currentSPARQL specification extension various data sources canbe queried simultaneously or sequentially There is still noaccepted convention for tying a query pattern to anauthenticated user probably because SPARQL engineswould have no use for that information as most have beencreated in a context of Linked Open Data efforts The cri-tical limitation in applying this solution for Health Careand Life Sciences is the ability to make use of contextualinformation to determine both the level of trust on thedata and to enable controlled access to elements in a

giuser1r1r

giuser2r22

gimanagerere

GI Clinical Trials project

Patient collection

Demographic data collection

Tissue collection

gimanagererer

merge 1 (restrictive)

merge 2 (permissive)

Assign permission

Assign permission

Patient Name statements

Tissue analysis statements

merge ( ) =

) = merge (

merge 1

merge 2

Figure 6 Two use cases of permission propagation Two users are granted full permission to view GI Clinical Trials Project (yrdquo) however noneof them can add new data (nrdquo) nor edit existing data unless they were its original creator (srdquo) One of the users (giuser1) is prevented fromaccessing any data with demographic elements such as for example the names of the Patients In this case an uppercase ldquoNrdquo is assigned in theright to view the Collection DemographicData which will be merged with the inherited ldquoyrdquo to produce an effective permission level of ldquoNrdquo forthe right to view (merge 1) For the second user (giuser2) permission is granted to use the Collection TissueData indicating that the user cancreate new instances in that Collection (merge 2)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 12 of 15

dataset without breaking it and storing it in multiple sys-tems To address this requirement S3QL was fitted withdistributed control operational features that follow designcriteria found desirable for biomedical applications S3QLis not unique in its class for example the linked data APIwhich is being used by datagovuk is an alternative DSL tomanage linked data [58] However we believe that S3QLis closer to the technologies currently used by applicationdevelopers and therefore may provide a more suitablemiddle layer between linked data formalisms and applica-tion development It is argued that these features mayassist and anticipate future extensions of semantic webprovenance control formalisms

Additional material

Additional file 1 Walkthrough of S3DBrsquos permission propagationmerge migrate and percolate

Additional file 2 Matrix of MLST profiles in Portugal and the UnitedKingdom

Acknowledgements and fundingThis work was supported in part by the Center for Clinical and TranslationalSciences of the University of Alabama at Birmingham under contract no

5UL1 RR025777-03 from NIH National Center for Research Resources by theNational Cancer Institute grant 1U24CA143883-01 by the European UnionFP7 PNEUMOPATH project award and by the Portuguese Science andTechnology foundation under contracts PTDCEIA-EIA1052452008 andPTDCEEA-ACR695302006 HFD also thankfully acknowledges PhDfellowship from the same foundation award SFRHBD459632008 and theScience Foundation Ireland project Lion under Grant No SFI02CE1I131The authors would also like to thank three anonymous reviewers whosecomments and advice were extremely valuable for improving themanuscript

Author details1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway Ireland 2Biomathematics Institutode Tecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal3Laboratoacuterio Nacional de Computaccedilatildeo Ciecircntifica Av Getuacutelio Vargas 333Quitandinha 25651-075 Petroacutepolis Brasil 4Sanofi Pasteur 38 Sidney StreetCambridge MA 02139 USA 5Laboratory of Molecular Genetics Instituto deTecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal6Research Center for Intelligent Media Furtwangen University FurtwangenGermany 7Laboratory of Microbiology The Rockefeller University 10021 NewYork USA 8Division of Informatics Department of Pathology University ofAlabama at Birmingham 619 South 19th Street Birmingham Alamaba USA

Authorsrsquo contributionsHFD developed the S3QL domain specific language implemented it in theS3DB prototype validated it with examples and wrote the manuscript MCCRS MM HL RF WM and JSA tested and validated the language withexamples and made suggestions which lead to its improvement All authorsread and approved the final manuscript

Strains from Portugal

Strains from UK

SPARQL

S3DB system Triple stores

S3QL + authentication

Statistical computing environment (eg Matlab R)

S3QL +Lauthent

Figure 7 Workflow for the creation of a hierarchical cluster of MLST profiles from strains collected in Portugal and in the UK Twosources are chosen to perform the MLST profile assembly - an S3DB deployment which holds the ITQB Staphylococcus database and theSPARQL endpoint configured to host the Staphylococcus aureus MLST database Although data in the MLST SPARQL endpoint is publiclyavailable to access data in S3DB the user needs to provide an authentication token and a user_id as well as assemble the S3QL queries toretrieve the data [select(S|R172930) and select(S|R167271)] which may also be formulated as SPARQL A data structure is assembled that can finallybe analysed using hierarchical clustering methods such as the ones made available by Matlab The complete MLST dataset used to obtain thegraph is provided in additional file 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 13 of 15

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 6: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

allowed as well as a mechanism to succinctly specify tran-sitions with variable memory length (l in equation 1) Thepropagation of permissions in the S3DB Core Modelensures that for every entity and every user two types of

permission are defined the assigned permission or thepermission state assigned directly to a user in an entityand the effective permission which is the result of the pro-pagation of s3dboperators

Figure 3 The S3DB conceptual model Five attributes (id label description creator and created) and four methods (select update insert anddelete) are common to all S3DB entities In the current S3QL implementation the label and description attributes are defined by the submitterof the data whereas the id created and creator attributes are automatically assigned by the system Other dependencies were devised tocomply with the definition of s3dbrelationships

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 6 of 15

Components of a Distributed SystemOne of the requirements for RDF-based knowledgemanagement ecosystems is the availability of queriesspanning across multiple SPARQL endpoints Automa-tion of distributed queries in systems supporting permis-sion control such as S3DB is challenged by userauthentication In S3QL we propose addressing thisthrough delegation to authentication authorities As aresult a user (or usage) can be identified by a URI thatis independent of the authorities that validate it When-ever possible it is recommended that authenticationcredentials be protected by use of OAuth [53]Use of URIs and Internationalized Resource Identifiers

(IRIs) to identify data elements is one of the core princi-ples of Linked Data However many programmingenvironments cannot easily handle URIs as elementidentifiers Problems range from decreased processingspeed to a need for encoding the URIs in web serviceexchanges As an anticipation for that class of problemsthe URIs for entities in S3DB are interchangeable withalphanumeric identifiers formulated as the concatena-tion of one of D U P C R I or S (referring to S3DBentities described in The S3DB Knowledge Organiza-tional Model) identifying the entity and a unique num-ber As an example for a deployment located at URLhttpqs3dborgdemo the alphanumeric P126 is resol-vable to an entity of type Project with URI httpqs3dborgs3dbdemoP126 To facilitate exchange of URI indistinct deployments the URI above could also be speci-fied as D282P126 where ldquoD282rdquo is the alphanumericidentifier of the S3DB deployment located at URLhttpqs3dborgdemo Every s3dbdeployment is identi-fied by a named graph in the form D[number] for com-pleteness metadata pertaining to each s3dbdeploymentsuch as the corresponding URL is described using thevocabulary of interlinked datasets (VoiD) [19] andshared through a root location

Availability and DocumentationThe specification of the S3QL language has been madeavailable at httplinks3dborgspecs and one exampleof the output RDF is available at httplinks3dborgexample S3QL has been implemented through a RESTapplication programming interface (API) for the S3DBprototype which is publicly available at https3dborgBoth the prototype and its API were developed in PHPwith MySQL or PostgreSQL for data storage Documen-tation about the S3DB implementation of S3QL as anAPI can be found at httplinks3dborgdocs S3QLqueries may be tested at the demo implementation athttplinks3dborgs3qldemo and a translator for thecompact notation is available at httplinks3dborgtranslate

ResultsS3QL SyntaxS3QL is a domain specific language devised for facilitat-ing management operations such as ldquoinsertrdquo ldquoupdaterdquo orldquodeleterdquo using entities of a Linked Data KOS such as theS3DB core model described above Its syntax howeveris loosely tied to the S3DB Core Model and can easilybe applied to a set of KOSrsquo core models in which S3DBis included The complete syntax of S3QL in its XML(eXtended Markup Language) flavour is represented inthe railroad diagram of Figure 1 The S3QL syntaxincludes three elements the description of the opera-tion the target entity and the input parameters Fourbasic operation descriptions were deemed necessary tofully support readwrite operations select insert updateand delete The action of these operations mimic thoseof the structured query language (SQL) and targetinstances of entities (E) defined in the core model Inputparameters include the set of attributes defined for eachof the entities either in the alphanumeric form asso-ciated with entity instances (EntityId) or in the form ofEntityAttributeValue (Ea) The values for Ea are deter-mined upon choice of E - for an example using S3DBentities and attributes see Figure 2 E may be replacedwith any of the entities defined in the S3DB CoreModel (Deployment User Project Collection Rule Itemor Statement) upon choice of E valid forms of Eainclude any of the attributes defined for E (eg rdfidrdfslabel rdfscomment) A table summarizing all avail-able operations targets and input parameters is madeavailable at httplinks3dborgspecs The formal S3QLsyntax is completed by enclosing the outcome of one ofthe diagrams on Figure 1 with the ltS3QLgt tag Forexample the following XML structure is a valid S3QLquery for an operation of type insert where the target isthe S3DB entity Project and the input parameter formu-lated as EntityAtributeValue is ldquolabel = TestltS3QLgtltinsertgtprojectltinsertgtltwheregtltlabelgtTestltlabelgtltwheregt

ltS3QLgtThe set of 12 s3dbrelationships (see Table one of

[40]) in the S3DB Model determine the organizationaldependencies of S3DB entities For example s3dbPC isthe s3dbrelationship that specifies a dependencybetween an instance of a Collection (C) and an instanceof a Project (P) (Figure 3) The S3QL syntax fulfils thisconstraint by assigning project_id (the identifier of anS3DB Project) as an attribute of a Collection In thisdescription of attributes associated with the S3DB coremodel we make use of the assumption as in other

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 7 of 15

KOSs and in the Linked Data in general that there is norestriction to adding relationships beyond thosedescribed here S3QL was identified as the minimalrepresentation to interoperate with the S3DB coremodel and therefore only those relationships areexplored in this reportThe syntax diagram in Figure 1 generates XML a

standard widely used in web service implementationsThat alternative often results in verbose queries thatcould easily be assembled from more compact notationsOne example to consider is the form action (E | Ea =value) Here the symbol lsquo|rsquo should be interpreted as inBayesian inference as a condition and be read ldquogiventhatrdquo The letter ldquoEldquo corresponds to the first letter ofany S3DB entity (D P R C I S or U) and Ea is any ofits attributes as described in Figure 2 In this examplethe query insert(P | label = test) is equivalent to theexample query above That particular variant is alsoaccepted by the S3DB prototype and a converter for thissyntax into complete S3QLXML syntax was made avail-able at httplinks3dborgtranslate For further com-pactness of this alternative formulation entity identifiersused as parameters may be replaced with its corre-sponding alphanumeric identifiers - for example projec-t_id = 156 may be replaced with P156 This alternativenotation will be used in the subsequent examples

S3QL Permission ControlPermission states are assigned using an S3QL query suchas insert(U| U1P157permission_level = ysn) whichincludes the action insert the target entity User andthree input parameters identifier of the User (U1) iden-tifier of the entity (P157) and permission assignment(ysn) Effectively this will result in the creation of the tri-ple [U1 ysn P157] where the subject is of type s3dbuser the predicate is of type s3dboperator and the objectis of type s3dbproject The inclusion of this triple in adataset will modulate the type of management operationthat a user may perform As described in the Methodssection each position in the permission assignmentoperator (ysn) encodes respectively for permission toldquoviewrdquo ldquochangerdquo or ldquouserdquo the object entity Values y sand n indicate respectively that the user has full permis-sion to view it (y) permission to change its metadataonly if he was the creator of the entity (s) and no permis-sion to insert (n) child entities Each S3QL operation istherefore tightly woven to each of the three operatorsselect is controlled by ldquoview update and delete are con-trolled by ldquochangerdquo and insert is controlled by ldquouserdquo(shaded areas in Figure 1) The lsquousersquo operator encodes forthe ability of a user to create new relationships with thetarget entity which is defined separately from the rightto ldquochangerdquo it For example in the case of a user (U1)being granted ldquoyrdquo as the effective permission to ldquochangerdquo

an s3dbrule then the metadata describing it may bealtered If however that same user is granted permissionldquonrdquo to lsquousersquo that same Rule she is prevented from creatingStatements using that Rule Although ldquouserdquo may be inter-preted as being equivalent to ldquoinsertrdquo or ldquoappendrdquo inother systems we have chosen to separate the termsdescribing the operator ldquouserdquo from the S3QL actionldquoinsertrdquo The permission assigned at the dataset level willthen propagate in the S3DB transition matrix followingthe behaviour formalized in equation 1 therefore avoid-ing the need to assign permission to every user on everyentity It is worth noting that the DSL presented here isextensible beyond the 4 management actions (selectinsert update delete) described The s3dboperators thatcontrol permission on these actions are also extensiblebeyond ldquoviewrdquo ldquochangerdquo and ldquouserdquo and different imple-mentations may support alternative statesThe permission control behaviour for S3QL operations

can be illustrated through the use of the Quadratus anapplication available at httpqs3dborgquadratus thatcan be pointed at any S3DB deployment to assign permis-sion states on S3DB entities to different users (Figure 4)Other use case scenarios are also explored in the S3QLspecification at httplinks3dborgspecs

Global Dereferencing for Distributed QueriesA simple dereferencing system was devised for S3DB iden-tifiers that relies on the identification of root deploymentsie S3DB systems where alphanumeric identifiers forS3DB deployments can be dereferenced to URL This sim-ple mechanism enables complex transactions of controlleddata For an example of this behaviour see Figure 5 wherethe S3DB UID D327R172930 identifying an entity of typeRule (R172930) available in deployment D327 is beingrequest by a user registered in deployment D309 In orderto retrieve the requested data the URL of deploymentD327 must first be resolved at a root deployment such asfor example httproots3dborgThe dereferencing mechanism is also applicable in more

complex cases where the root of two deployments sharingdata is not the same Prepending the deployment identifierof the root to the UIDs such as for example D1016666D327R172930 where D1016666 identifies the rootdeployment would result in recursive URL resolutionsteps such as select(Durl| D1016666) prior to step 4 inFigure 5 This mechanism avoids broken links when S3DBdeployments are moved to different URLs by enablingdeployment metadata to be updated securely at the rootusing a publicprivate key encryption system

Implementation and benchmarkingIn the current prototype implementation S3QL is sub-mitted to S3DB deployments using either a GET or POSTrequest and may include an optional authentication token

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 8 of 15

(key) The REST specification [54] suggest separate HTTPmethods according to the intention of the operationoften ldquoGETrdquo is used to retrieve data ldquoPUTrdquo is used to sub-mit data ldquoPOSTrdquo is used to update data and ldquoDELETErdquo isused to remove data There are however many program-ming environments that implement only the REST ldquoGETrdquomethod including many popular computational statisticsprogramming frameworks such as R and Matlab There-fore in order to fully explore the integrative potential ofthis readwrite semantic web service and to supportoperations beyond the 4 implemented the S3DB proto-type implementation of S3QL supports the ldquoGETrdquo methodfor all S3QL operations with the parameters of the S3QLcall appended to the URL One drawback of relying onGET is the limits imposed by the browser on URL lengthTo address this potential problem the S3DB prototypealso supports the use of ldquoPOSTrdquo for S3QL calls

Two further challenges needed to be addressed in theprototypical implementation of S3QL 1) the need for acentralized root location to support dereferencing ofdeployment URI when the condensed version is used (eg D282) and 2) distributed queries on REST systemsrequired users to authenticate in multiple KOSs Thefirst challenge was addressed by configuring an S3DBdeployment as the root location available at httproots3dborg Deployment metadata is submitted to this rootdeployment at configuration time using S3QL data per-taining to each deployment can therefore be derefer-enced to a URL using httproots3dborgD[numeric]The option to refer to another root deployment thanthe default is possible during installation To avoid over-loading these root deployments with too many requestsa local 24 hour cache of all accessed deployments iskept in each S3DB deployment using the same strategy

Figure 4 Quadratus an interface to illustrate S3QLrsquos permission control mechanism and its effect on S3QL queries Projects CollectionsRules Items and Statements associated with GI Clinical Trials Project are displayed with effective and assigned permissions Collections and Rulesretrieved using S3QL queries select(C|P1196457) and select(R| P1196457) inherit the permission assignment in the Project ldquoysn assigned permissionldquoNndashrdquo in Collection ldquoDemographicsrdquo results in the effective permission of ldquoNsnrdquo inherited by all Rules and Items that have a relationship with thatCollection effectively preventing gi_user1 from accessing its data The directed labelled graph of the propagation resolution is displayed on theright side of the application illustrating the propagation mechanism

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 9 of 15

if the URL is cached it will not be requested from theroot deployment To address the second challenge eachs3dbdeployment can store any number of authenticationservices supporting HTTP FTP or LDAP protocolsOnce the user is authenticated temporary surrogatetokens are issued with each query When coupled withthe user identifier in the format of a URI these tokenseffectively identify the user performing the query regard-less of the S3DB deployment where the query isrequestedScreencasts illustrating processing time of data manip-

ulation using S3QL are available at httpwwwyoutubecomwatchv=2KZC6kI609s and httpwwwyoutubecomwatchv=FJSYLCwBaPI

DiscussionOne of the major concerns in making use of LinkedData to improve health care and life sciences research isthe need to ensure both the availability of contextualinformation about experimental datasets and the abilityto protect the privacy of certain data elements whichmay identify an individual patient Domain specific lan-guages (DSL) can ease the task of managing the contex-tual descriptors that would be necessary to implement

permission control in RDF and by doing so couldgreatly accelerate the rate of adoption of Linked Dataformalisms in the life sciences communities to improvescientific discovery We have described S3QL a DSL toperform readwrite operations on entities of the S3DBCore Model S3QL attempts to address the requirementsin linking Life Sciences datasets including both publish-able and un-publishable data elements by 1) includingcontextual descriptors for every submitted data elementand 2) making use of those descriptors to ensure per-mission control managed by the data experts them-selves This avoids the need to break a consolidateddataset into its public and private parts when the resultsare acceptable for publicationApplying S3QL to the S3DB Core Model in a prototy-

pical application benefited from the definition of looselydefined boundaries for RDF data that enabled propaga-tion of permission while avoiding the need to documenta relationship for each data instance individually andfor each user The assembly of SPARQL queries is alsofacilitated by the identification of domain triples usingnamed graphs from the data itself [41] and can be illu-strated in the application at [55] where a subset ofS3QL can be readily converted into the W3C standard

harrharr

Figure 5 The global S3QL dereferencing system User U78 of deployment D309 issues a command to request all entities of type Statementwhere the attribute Rule_id corresponds to the value D327R172930 through S3QL operation select(Svalue|D327R172930) (1) If the URLcorresponding to deployment D327 is not cached locally or has not been validated in the past 24 h a query is issued and executed at the rootdeployment to retrieve the corresponding URL select(Durl|D327) (23) Once the URL is returned query (1) is re-issued as select(Svalue| R172930)and executed at the URL for D327 (4) To validate the user deployment D327 issues the command select(Uid|U76) at D309 using the keyprovided (56) and returns the data only if Uid matches the value for user_id (7)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 10 of 15

SPARQL Although the prototypical implementation ofS3QL fits the definition of an API for S3DB systems itis immediately apparent that the same notation could beeasily and intuitively extended to other KOSrsquo core mod-els For example pointing the tool at httpqs3dborgtranslate to a JavaScript Object Notation (JSON) repre-sentation of the SKOS core model (skosjs) instead ofthe default S3DB core model (s3dbjs) results in a validS3QL syntax that could easily be applied as an API forSKOS based systems as illustrated by applying theexample query ldquoselect(C|prefLabel = animal)ldquo usinghttplinks3dborgtranslatecore=skosjsampquery=select(C|prefLabel=animal) to retrieve SKOS concepts labelledldquoanimalrdquo The progress of adoption for life sciencesapplication developers can be further smoothed by com-plete reliance on the REST protocol for data exchangeand the availability of widely used formats such asJSON XML or RDFturtleThe applicability of S3QL to life sciences domains is

illustrated here with three case studies 1) in the domainof clinical trials a project [42] that requires collabora-tion between departments with different interests 2) inthe domain of the cancer genome atlas (TCGA) projecta multi-institutional effort that requires multiple authen-tication mechanism and sharing of data among multipleinstitutions [41] and 3) in the domain of molecular epi-demiology a project where non-public data stored in anS3DB deployment needs to be statistically integratedwith data from a public repository All described usecases shared one considerable requirement - the abilityto include in the same dataset both published andunpublished results As such they required both theannotation of contextual descriptors of the data enabledby S3QL and the availability of controlled permissionpropagation enabled by the S3DB model transitionmatrix Future work in this effort may include the appli-cation of S3QL in a Knowledge Organization Systembased on SKOS terminology and the definition of atransition matrix for SKOS to enable controlled permis-sion propagation

Gastro-intestinal Clinical Trials Use CaseAs part of collaboration with the Department of Gastroin-testinal Medical Oncology at The University of Texas MD Anderson Cancer Center an S3DB deployment wasconfigured to host data from gastro-intestinal (GI) clinicaltrials A schema was developed using S3DB Collectionsand Rules and S3QL insert queries were used to submitdata elements as Items and Statements (see Figure 6) Twopermission propagation examples are illustrated one of arestrictive nature and the other permissive which can alsobe explored at httpqs3dborgquadratus (Figure 4) Inthis example the simple mechanisms of propagationdefined for S3QL support the complex social interaction

that requires a fraction of the dataset to be shared withcertain users but not with others Contextual usage istherefore a function of the attributes of the data itself (egits creator) and the user identification token that is sub-mitted with every readwrite operation

The Cancer Genome Atlas Use CaseThe cancer genome atlas (TCGA) is a pilot project tocharacterize several types of cancer by sequencing andgenetically characterizing tumours for over 500 patientsthroughout multiple institutions S3QL was used in thiscase study to produce an infrastructure that exposes thepublic portion of the TCGA datasets as a SPARQL end-point [41] This was possible because SPARQL is entailedby S3QL but not the opposite Specifically SPARQLqueries can be serialized to S3QL but the opposite is notalways possible particularly as regards write and accesscontrol operations The structure of the S3DB Core Modelwhich explicitly distinguishes domain from instantiationenables SPARQL query patterns such as Patient R390 cancerType to be readily serialized into its S3QL equiva-lent select(S|R390) Although this will not be furtherexplored in this discussion it is worth noting that theavailability of this serialization allows for an intuitive syn-tax of SPARQL queries by patterning them on the descrip-tion of the user-defined domain Rule such as ldquoPatienthasCancerType cancerTyperdquo

A Molecular Epidemiology Use CaseIn this example SPARQL was serialized to S3QL to sup-port a computational statistics application As a first stepan S3DB data store was deployed using S3QL to managemolecular epidemiology data related to strains of Staphylo-coccus aureus bacteria collected at the Instituto de Tecno-logia Quiacutemica e Bioloacutegica (ITQB) in Portugal Specificallythe ITQB Staphylococcus reference database was devisedwith a purpose of managing Multilocus Sequence Typing(MLST) data a typing method used to track the molecularepidemiology of infectious bacteria [54-57] As a secondstep we downloaded the public Staphylococcus aureusMLST profiles database at httpwwwmlstnet and madeit available through a SPARQL endpoint at httpqs3dborgmlst_sparqlendpointphp The process of integrationof MLST profiles from the ITQB Staphylococcus databasewith the publicly available MLST profiles is illustrated inFigure 7 In this example a federated SPARQL query isassembled to access both MLST sources data stored inthe S3DB deployment is retrieved by serializing theSPARQL query into S3QL and providing an authentica-tion token to identify the user as described in Figure 3The assembled graph resulting from the federatedSPARQL query (see additional file 2) can be importedinto a statistical computing environment such as Matlab(Mathworks Inc) Using this methodology it was possible

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 11 of 15

to cluster strains from two different data sources with verydifferent authentication requirements The observationthat some Portuguese strains (PT1 PT2 PT15 and PT21)that are not publicly shared cluster together with a groupof public UK strains (UK17 UK16 UK11 UK14 UK15UK13 UK12) and therefore may share a common ancestoris an observation enabled by the data integrated throughS3QL

ConclusionsLife sciences applications are set to greatly benefit fromcoupling Semantic Web Linked Data standards and KOSsIn the current report we illustrate data models from lifesciences domains weaved using the S3DB knowledge orga-nization system In line with the requirements for theemergence of evolvable ldquosocial machinesrdquo different per-spectives on the data are made possible by a permissionpropagation mechanism controlled by contextual attri-butes of data elements such as its creator The operationof the S3DB KOS is mediated by the S3QL protocoldescribed in this report which exposes its ApplicationProgramming Interface for viewing inserting updating

and removing data elements Because S3QL is implemen-ted with a distributed architecture where URIs can bedereferenced into multiple S3DB deployments domainexperts can share data on their own deployments withusers of other systems without the need for localaccounts Therefore S3QLrsquos fine grained permission con-trol defined as instances of s3dboperators enables domainexperts to clearly specify the degree of permission that auser should have on a resource and how that permissionshould propagate in a distributed infrastructure This is incontrast to the conventional approach of delegating per-mission management to the point of access In the currentSPARQL specification extension various data sources canbe queried simultaneously or sequentially There is still noaccepted convention for tying a query pattern to anauthenticated user probably because SPARQL engineswould have no use for that information as most have beencreated in a context of Linked Open Data efforts The cri-tical limitation in applying this solution for Health Careand Life Sciences is the ability to make use of contextualinformation to determine both the level of trust on thedata and to enable controlled access to elements in a

giuser1r1r

giuser2r22

gimanagerere

GI Clinical Trials project

Patient collection

Demographic data collection

Tissue collection

gimanagererer

merge 1 (restrictive)

merge 2 (permissive)

Assign permission

Assign permission

Patient Name statements

Tissue analysis statements

merge ( ) =

) = merge (

merge 1

merge 2

Figure 6 Two use cases of permission propagation Two users are granted full permission to view GI Clinical Trials Project (yrdquo) however noneof them can add new data (nrdquo) nor edit existing data unless they were its original creator (srdquo) One of the users (giuser1) is prevented fromaccessing any data with demographic elements such as for example the names of the Patients In this case an uppercase ldquoNrdquo is assigned in theright to view the Collection DemographicData which will be merged with the inherited ldquoyrdquo to produce an effective permission level of ldquoNrdquo forthe right to view (merge 1) For the second user (giuser2) permission is granted to use the Collection TissueData indicating that the user cancreate new instances in that Collection (merge 2)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 12 of 15

dataset without breaking it and storing it in multiple sys-tems To address this requirement S3QL was fitted withdistributed control operational features that follow designcriteria found desirable for biomedical applications S3QLis not unique in its class for example the linked data APIwhich is being used by datagovuk is an alternative DSL tomanage linked data [58] However we believe that S3QLis closer to the technologies currently used by applicationdevelopers and therefore may provide a more suitablemiddle layer between linked data formalisms and applica-tion development It is argued that these features mayassist and anticipate future extensions of semantic webprovenance control formalisms

Additional material

Additional file 1 Walkthrough of S3DBrsquos permission propagationmerge migrate and percolate

Additional file 2 Matrix of MLST profiles in Portugal and the UnitedKingdom

Acknowledgements and fundingThis work was supported in part by the Center for Clinical and TranslationalSciences of the University of Alabama at Birmingham under contract no

5UL1 RR025777-03 from NIH National Center for Research Resources by theNational Cancer Institute grant 1U24CA143883-01 by the European UnionFP7 PNEUMOPATH project award and by the Portuguese Science andTechnology foundation under contracts PTDCEIA-EIA1052452008 andPTDCEEA-ACR695302006 HFD also thankfully acknowledges PhDfellowship from the same foundation award SFRHBD459632008 and theScience Foundation Ireland project Lion under Grant No SFI02CE1I131The authors would also like to thank three anonymous reviewers whosecomments and advice were extremely valuable for improving themanuscript

Author details1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway Ireland 2Biomathematics Institutode Tecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal3Laboratoacuterio Nacional de Computaccedilatildeo Ciecircntifica Av Getuacutelio Vargas 333Quitandinha 25651-075 Petroacutepolis Brasil 4Sanofi Pasteur 38 Sidney StreetCambridge MA 02139 USA 5Laboratory of Molecular Genetics Instituto deTecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal6Research Center for Intelligent Media Furtwangen University FurtwangenGermany 7Laboratory of Microbiology The Rockefeller University 10021 NewYork USA 8Division of Informatics Department of Pathology University ofAlabama at Birmingham 619 South 19th Street Birmingham Alamaba USA

Authorsrsquo contributionsHFD developed the S3QL domain specific language implemented it in theS3DB prototype validated it with examples and wrote the manuscript MCCRS MM HL RF WM and JSA tested and validated the language withexamples and made suggestions which lead to its improvement All authorsread and approved the final manuscript

Strains from Portugal

Strains from UK

SPARQL

S3DB system Triple stores

S3QL + authentication

Statistical computing environment (eg Matlab R)

S3QL +Lauthent

Figure 7 Workflow for the creation of a hierarchical cluster of MLST profiles from strains collected in Portugal and in the UK Twosources are chosen to perform the MLST profile assembly - an S3DB deployment which holds the ITQB Staphylococcus database and theSPARQL endpoint configured to host the Staphylococcus aureus MLST database Although data in the MLST SPARQL endpoint is publiclyavailable to access data in S3DB the user needs to provide an authentication token and a user_id as well as assemble the S3QL queries toretrieve the data [select(S|R172930) and select(S|R167271)] which may also be formulated as SPARQL A data structure is assembled that can finallybe analysed using hierarchical clustering methods such as the ones made available by Matlab The complete MLST dataset used to obtain thegraph is provided in additional file 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 13 of 15

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 7: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

Components of a Distributed SystemOne of the requirements for RDF-based knowledgemanagement ecosystems is the availability of queriesspanning across multiple SPARQL endpoints Automa-tion of distributed queries in systems supporting permis-sion control such as S3DB is challenged by userauthentication In S3QL we propose addressing thisthrough delegation to authentication authorities As aresult a user (or usage) can be identified by a URI thatis independent of the authorities that validate it When-ever possible it is recommended that authenticationcredentials be protected by use of OAuth [53]Use of URIs and Internationalized Resource Identifiers

(IRIs) to identify data elements is one of the core princi-ples of Linked Data However many programmingenvironments cannot easily handle URIs as elementidentifiers Problems range from decreased processingspeed to a need for encoding the URIs in web serviceexchanges As an anticipation for that class of problemsthe URIs for entities in S3DB are interchangeable withalphanumeric identifiers formulated as the concatena-tion of one of D U P C R I or S (referring to S3DBentities described in The S3DB Knowledge Organiza-tional Model) identifying the entity and a unique num-ber As an example for a deployment located at URLhttpqs3dborgdemo the alphanumeric P126 is resol-vable to an entity of type Project with URI httpqs3dborgs3dbdemoP126 To facilitate exchange of URI indistinct deployments the URI above could also be speci-fied as D282P126 where ldquoD282rdquo is the alphanumericidentifier of the S3DB deployment located at URLhttpqs3dborgdemo Every s3dbdeployment is identi-fied by a named graph in the form D[number] for com-pleteness metadata pertaining to each s3dbdeploymentsuch as the corresponding URL is described using thevocabulary of interlinked datasets (VoiD) [19] andshared through a root location

Availability and DocumentationThe specification of the S3QL language has been madeavailable at httplinks3dborgspecs and one exampleof the output RDF is available at httplinks3dborgexample S3QL has been implemented through a RESTapplication programming interface (API) for the S3DBprototype which is publicly available at https3dborgBoth the prototype and its API were developed in PHPwith MySQL or PostgreSQL for data storage Documen-tation about the S3DB implementation of S3QL as anAPI can be found at httplinks3dborgdocs S3QLqueries may be tested at the demo implementation athttplinks3dborgs3qldemo and a translator for thecompact notation is available at httplinks3dborgtranslate

ResultsS3QL SyntaxS3QL is a domain specific language devised for facilitat-ing management operations such as ldquoinsertrdquo ldquoupdaterdquo orldquodeleterdquo using entities of a Linked Data KOS such as theS3DB core model described above Its syntax howeveris loosely tied to the S3DB Core Model and can easilybe applied to a set of KOSrsquo core models in which S3DBis included The complete syntax of S3QL in its XML(eXtended Markup Language) flavour is represented inthe railroad diagram of Figure 1 The S3QL syntaxincludes three elements the description of the opera-tion the target entity and the input parameters Fourbasic operation descriptions were deemed necessary tofully support readwrite operations select insert updateand delete The action of these operations mimic thoseof the structured query language (SQL) and targetinstances of entities (E) defined in the core model Inputparameters include the set of attributes defined for eachof the entities either in the alphanumeric form asso-ciated with entity instances (EntityId) or in the form ofEntityAttributeValue (Ea) The values for Ea are deter-mined upon choice of E - for an example using S3DBentities and attributes see Figure 2 E may be replacedwith any of the entities defined in the S3DB CoreModel (Deployment User Project Collection Rule Itemor Statement) upon choice of E valid forms of Eainclude any of the attributes defined for E (eg rdfidrdfslabel rdfscomment) A table summarizing all avail-able operations targets and input parameters is madeavailable at httplinks3dborgspecs The formal S3QLsyntax is completed by enclosing the outcome of one ofthe diagrams on Figure 1 with the ltS3QLgt tag Forexample the following XML structure is a valid S3QLquery for an operation of type insert where the target isthe S3DB entity Project and the input parameter formu-lated as EntityAtributeValue is ldquolabel = TestltS3QLgtltinsertgtprojectltinsertgtltwheregtltlabelgtTestltlabelgtltwheregt

ltS3QLgtThe set of 12 s3dbrelationships (see Table one of

[40]) in the S3DB Model determine the organizationaldependencies of S3DB entities For example s3dbPC isthe s3dbrelationship that specifies a dependencybetween an instance of a Collection (C) and an instanceof a Project (P) (Figure 3) The S3QL syntax fulfils thisconstraint by assigning project_id (the identifier of anS3DB Project) as an attribute of a Collection In thisdescription of attributes associated with the S3DB coremodel we make use of the assumption as in other

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 7 of 15

KOSs and in the Linked Data in general that there is norestriction to adding relationships beyond thosedescribed here S3QL was identified as the minimalrepresentation to interoperate with the S3DB coremodel and therefore only those relationships areexplored in this reportThe syntax diagram in Figure 1 generates XML a

standard widely used in web service implementationsThat alternative often results in verbose queries thatcould easily be assembled from more compact notationsOne example to consider is the form action (E | Ea =value) Here the symbol lsquo|rsquo should be interpreted as inBayesian inference as a condition and be read ldquogiventhatrdquo The letter ldquoEldquo corresponds to the first letter ofany S3DB entity (D P R C I S or U) and Ea is any ofits attributes as described in Figure 2 In this examplethe query insert(P | label = test) is equivalent to theexample query above That particular variant is alsoaccepted by the S3DB prototype and a converter for thissyntax into complete S3QLXML syntax was made avail-able at httplinks3dborgtranslate For further com-pactness of this alternative formulation entity identifiersused as parameters may be replaced with its corre-sponding alphanumeric identifiers - for example projec-t_id = 156 may be replaced with P156 This alternativenotation will be used in the subsequent examples

S3QL Permission ControlPermission states are assigned using an S3QL query suchas insert(U| U1P157permission_level = ysn) whichincludes the action insert the target entity User andthree input parameters identifier of the User (U1) iden-tifier of the entity (P157) and permission assignment(ysn) Effectively this will result in the creation of the tri-ple [U1 ysn P157] where the subject is of type s3dbuser the predicate is of type s3dboperator and the objectis of type s3dbproject The inclusion of this triple in adataset will modulate the type of management operationthat a user may perform As described in the Methodssection each position in the permission assignmentoperator (ysn) encodes respectively for permission toldquoviewrdquo ldquochangerdquo or ldquouserdquo the object entity Values y sand n indicate respectively that the user has full permis-sion to view it (y) permission to change its metadataonly if he was the creator of the entity (s) and no permis-sion to insert (n) child entities Each S3QL operation istherefore tightly woven to each of the three operatorsselect is controlled by ldquoview update and delete are con-trolled by ldquochangerdquo and insert is controlled by ldquouserdquo(shaded areas in Figure 1) The lsquousersquo operator encodes forthe ability of a user to create new relationships with thetarget entity which is defined separately from the rightto ldquochangerdquo it For example in the case of a user (U1)being granted ldquoyrdquo as the effective permission to ldquochangerdquo

an s3dbrule then the metadata describing it may bealtered If however that same user is granted permissionldquonrdquo to lsquousersquo that same Rule she is prevented from creatingStatements using that Rule Although ldquouserdquo may be inter-preted as being equivalent to ldquoinsertrdquo or ldquoappendrdquo inother systems we have chosen to separate the termsdescribing the operator ldquouserdquo from the S3QL actionldquoinsertrdquo The permission assigned at the dataset level willthen propagate in the S3DB transition matrix followingthe behaviour formalized in equation 1 therefore avoid-ing the need to assign permission to every user on everyentity It is worth noting that the DSL presented here isextensible beyond the 4 management actions (selectinsert update delete) described The s3dboperators thatcontrol permission on these actions are also extensiblebeyond ldquoviewrdquo ldquochangerdquo and ldquouserdquo and different imple-mentations may support alternative statesThe permission control behaviour for S3QL operations

can be illustrated through the use of the Quadratus anapplication available at httpqs3dborgquadratus thatcan be pointed at any S3DB deployment to assign permis-sion states on S3DB entities to different users (Figure 4)Other use case scenarios are also explored in the S3QLspecification at httplinks3dborgspecs

Global Dereferencing for Distributed QueriesA simple dereferencing system was devised for S3DB iden-tifiers that relies on the identification of root deploymentsie S3DB systems where alphanumeric identifiers forS3DB deployments can be dereferenced to URL This sim-ple mechanism enables complex transactions of controlleddata For an example of this behaviour see Figure 5 wherethe S3DB UID D327R172930 identifying an entity of typeRule (R172930) available in deployment D327 is beingrequest by a user registered in deployment D309 In orderto retrieve the requested data the URL of deploymentD327 must first be resolved at a root deployment such asfor example httproots3dborgThe dereferencing mechanism is also applicable in more

complex cases where the root of two deployments sharingdata is not the same Prepending the deployment identifierof the root to the UIDs such as for example D1016666D327R172930 where D1016666 identifies the rootdeployment would result in recursive URL resolutionsteps such as select(Durl| D1016666) prior to step 4 inFigure 5 This mechanism avoids broken links when S3DBdeployments are moved to different URLs by enablingdeployment metadata to be updated securely at the rootusing a publicprivate key encryption system

Implementation and benchmarkingIn the current prototype implementation S3QL is sub-mitted to S3DB deployments using either a GET or POSTrequest and may include an optional authentication token

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 8 of 15

(key) The REST specification [54] suggest separate HTTPmethods according to the intention of the operationoften ldquoGETrdquo is used to retrieve data ldquoPUTrdquo is used to sub-mit data ldquoPOSTrdquo is used to update data and ldquoDELETErdquo isused to remove data There are however many program-ming environments that implement only the REST ldquoGETrdquomethod including many popular computational statisticsprogramming frameworks such as R and Matlab There-fore in order to fully explore the integrative potential ofthis readwrite semantic web service and to supportoperations beyond the 4 implemented the S3DB proto-type implementation of S3QL supports the ldquoGETrdquo methodfor all S3QL operations with the parameters of the S3QLcall appended to the URL One drawback of relying onGET is the limits imposed by the browser on URL lengthTo address this potential problem the S3DB prototypealso supports the use of ldquoPOSTrdquo for S3QL calls

Two further challenges needed to be addressed in theprototypical implementation of S3QL 1) the need for acentralized root location to support dereferencing ofdeployment URI when the condensed version is used (eg D282) and 2) distributed queries on REST systemsrequired users to authenticate in multiple KOSs Thefirst challenge was addressed by configuring an S3DBdeployment as the root location available at httproots3dborg Deployment metadata is submitted to this rootdeployment at configuration time using S3QL data per-taining to each deployment can therefore be derefer-enced to a URL using httproots3dborgD[numeric]The option to refer to another root deployment thanthe default is possible during installation To avoid over-loading these root deployments with too many requestsa local 24 hour cache of all accessed deployments iskept in each S3DB deployment using the same strategy

Figure 4 Quadratus an interface to illustrate S3QLrsquos permission control mechanism and its effect on S3QL queries Projects CollectionsRules Items and Statements associated with GI Clinical Trials Project are displayed with effective and assigned permissions Collections and Rulesretrieved using S3QL queries select(C|P1196457) and select(R| P1196457) inherit the permission assignment in the Project ldquoysn assigned permissionldquoNndashrdquo in Collection ldquoDemographicsrdquo results in the effective permission of ldquoNsnrdquo inherited by all Rules and Items that have a relationship with thatCollection effectively preventing gi_user1 from accessing its data The directed labelled graph of the propagation resolution is displayed on theright side of the application illustrating the propagation mechanism

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 9 of 15

if the URL is cached it will not be requested from theroot deployment To address the second challenge eachs3dbdeployment can store any number of authenticationservices supporting HTTP FTP or LDAP protocolsOnce the user is authenticated temporary surrogatetokens are issued with each query When coupled withthe user identifier in the format of a URI these tokenseffectively identify the user performing the query regard-less of the S3DB deployment where the query isrequestedScreencasts illustrating processing time of data manip-

ulation using S3QL are available at httpwwwyoutubecomwatchv=2KZC6kI609s and httpwwwyoutubecomwatchv=FJSYLCwBaPI

DiscussionOne of the major concerns in making use of LinkedData to improve health care and life sciences research isthe need to ensure both the availability of contextualinformation about experimental datasets and the abilityto protect the privacy of certain data elements whichmay identify an individual patient Domain specific lan-guages (DSL) can ease the task of managing the contex-tual descriptors that would be necessary to implement

permission control in RDF and by doing so couldgreatly accelerate the rate of adoption of Linked Dataformalisms in the life sciences communities to improvescientific discovery We have described S3QL a DSL toperform readwrite operations on entities of the S3DBCore Model S3QL attempts to address the requirementsin linking Life Sciences datasets including both publish-able and un-publishable data elements by 1) includingcontextual descriptors for every submitted data elementand 2) making use of those descriptors to ensure per-mission control managed by the data experts them-selves This avoids the need to break a consolidateddataset into its public and private parts when the resultsare acceptable for publicationApplying S3QL to the S3DB Core Model in a prototy-

pical application benefited from the definition of looselydefined boundaries for RDF data that enabled propaga-tion of permission while avoiding the need to documenta relationship for each data instance individually andfor each user The assembly of SPARQL queries is alsofacilitated by the identification of domain triples usingnamed graphs from the data itself [41] and can be illu-strated in the application at [55] where a subset ofS3QL can be readily converted into the W3C standard

harrharr

Figure 5 The global S3QL dereferencing system User U78 of deployment D309 issues a command to request all entities of type Statementwhere the attribute Rule_id corresponds to the value D327R172930 through S3QL operation select(Svalue|D327R172930) (1) If the URLcorresponding to deployment D327 is not cached locally or has not been validated in the past 24 h a query is issued and executed at the rootdeployment to retrieve the corresponding URL select(Durl|D327) (23) Once the URL is returned query (1) is re-issued as select(Svalue| R172930)and executed at the URL for D327 (4) To validate the user deployment D327 issues the command select(Uid|U76) at D309 using the keyprovided (56) and returns the data only if Uid matches the value for user_id (7)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 10 of 15

SPARQL Although the prototypical implementation ofS3QL fits the definition of an API for S3DB systems itis immediately apparent that the same notation could beeasily and intuitively extended to other KOSrsquo core mod-els For example pointing the tool at httpqs3dborgtranslate to a JavaScript Object Notation (JSON) repre-sentation of the SKOS core model (skosjs) instead ofthe default S3DB core model (s3dbjs) results in a validS3QL syntax that could easily be applied as an API forSKOS based systems as illustrated by applying theexample query ldquoselect(C|prefLabel = animal)ldquo usinghttplinks3dborgtranslatecore=skosjsampquery=select(C|prefLabel=animal) to retrieve SKOS concepts labelledldquoanimalrdquo The progress of adoption for life sciencesapplication developers can be further smoothed by com-plete reliance on the REST protocol for data exchangeand the availability of widely used formats such asJSON XML or RDFturtleThe applicability of S3QL to life sciences domains is

illustrated here with three case studies 1) in the domainof clinical trials a project [42] that requires collabora-tion between departments with different interests 2) inthe domain of the cancer genome atlas (TCGA) projecta multi-institutional effort that requires multiple authen-tication mechanism and sharing of data among multipleinstitutions [41] and 3) in the domain of molecular epi-demiology a project where non-public data stored in anS3DB deployment needs to be statistically integratedwith data from a public repository All described usecases shared one considerable requirement - the abilityto include in the same dataset both published andunpublished results As such they required both theannotation of contextual descriptors of the data enabledby S3QL and the availability of controlled permissionpropagation enabled by the S3DB model transitionmatrix Future work in this effort may include the appli-cation of S3QL in a Knowledge Organization Systembased on SKOS terminology and the definition of atransition matrix for SKOS to enable controlled permis-sion propagation

Gastro-intestinal Clinical Trials Use CaseAs part of collaboration with the Department of Gastroin-testinal Medical Oncology at The University of Texas MD Anderson Cancer Center an S3DB deployment wasconfigured to host data from gastro-intestinal (GI) clinicaltrials A schema was developed using S3DB Collectionsand Rules and S3QL insert queries were used to submitdata elements as Items and Statements (see Figure 6) Twopermission propagation examples are illustrated one of arestrictive nature and the other permissive which can alsobe explored at httpqs3dborgquadratus (Figure 4) Inthis example the simple mechanisms of propagationdefined for S3QL support the complex social interaction

that requires a fraction of the dataset to be shared withcertain users but not with others Contextual usage istherefore a function of the attributes of the data itself (egits creator) and the user identification token that is sub-mitted with every readwrite operation

The Cancer Genome Atlas Use CaseThe cancer genome atlas (TCGA) is a pilot project tocharacterize several types of cancer by sequencing andgenetically characterizing tumours for over 500 patientsthroughout multiple institutions S3QL was used in thiscase study to produce an infrastructure that exposes thepublic portion of the TCGA datasets as a SPARQL end-point [41] This was possible because SPARQL is entailedby S3QL but not the opposite Specifically SPARQLqueries can be serialized to S3QL but the opposite is notalways possible particularly as regards write and accesscontrol operations The structure of the S3DB Core Modelwhich explicitly distinguishes domain from instantiationenables SPARQL query patterns such as Patient R390 cancerType to be readily serialized into its S3QL equiva-lent select(S|R390) Although this will not be furtherexplored in this discussion it is worth noting that theavailability of this serialization allows for an intuitive syn-tax of SPARQL queries by patterning them on the descrip-tion of the user-defined domain Rule such as ldquoPatienthasCancerType cancerTyperdquo

A Molecular Epidemiology Use CaseIn this example SPARQL was serialized to S3QL to sup-port a computational statistics application As a first stepan S3DB data store was deployed using S3QL to managemolecular epidemiology data related to strains of Staphylo-coccus aureus bacteria collected at the Instituto de Tecno-logia Quiacutemica e Bioloacutegica (ITQB) in Portugal Specificallythe ITQB Staphylococcus reference database was devisedwith a purpose of managing Multilocus Sequence Typing(MLST) data a typing method used to track the molecularepidemiology of infectious bacteria [54-57] As a secondstep we downloaded the public Staphylococcus aureusMLST profiles database at httpwwwmlstnet and madeit available through a SPARQL endpoint at httpqs3dborgmlst_sparqlendpointphp The process of integrationof MLST profiles from the ITQB Staphylococcus databasewith the publicly available MLST profiles is illustrated inFigure 7 In this example a federated SPARQL query isassembled to access both MLST sources data stored inthe S3DB deployment is retrieved by serializing theSPARQL query into S3QL and providing an authentica-tion token to identify the user as described in Figure 3The assembled graph resulting from the federatedSPARQL query (see additional file 2) can be importedinto a statistical computing environment such as Matlab(Mathworks Inc) Using this methodology it was possible

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 11 of 15

to cluster strains from two different data sources with verydifferent authentication requirements The observationthat some Portuguese strains (PT1 PT2 PT15 and PT21)that are not publicly shared cluster together with a groupof public UK strains (UK17 UK16 UK11 UK14 UK15UK13 UK12) and therefore may share a common ancestoris an observation enabled by the data integrated throughS3QL

ConclusionsLife sciences applications are set to greatly benefit fromcoupling Semantic Web Linked Data standards and KOSsIn the current report we illustrate data models from lifesciences domains weaved using the S3DB knowledge orga-nization system In line with the requirements for theemergence of evolvable ldquosocial machinesrdquo different per-spectives on the data are made possible by a permissionpropagation mechanism controlled by contextual attri-butes of data elements such as its creator The operationof the S3DB KOS is mediated by the S3QL protocoldescribed in this report which exposes its ApplicationProgramming Interface for viewing inserting updating

and removing data elements Because S3QL is implemen-ted with a distributed architecture where URIs can bedereferenced into multiple S3DB deployments domainexperts can share data on their own deployments withusers of other systems without the need for localaccounts Therefore S3QLrsquos fine grained permission con-trol defined as instances of s3dboperators enables domainexperts to clearly specify the degree of permission that auser should have on a resource and how that permissionshould propagate in a distributed infrastructure This is incontrast to the conventional approach of delegating per-mission management to the point of access In the currentSPARQL specification extension various data sources canbe queried simultaneously or sequentially There is still noaccepted convention for tying a query pattern to anauthenticated user probably because SPARQL engineswould have no use for that information as most have beencreated in a context of Linked Open Data efforts The cri-tical limitation in applying this solution for Health Careand Life Sciences is the ability to make use of contextualinformation to determine both the level of trust on thedata and to enable controlled access to elements in a

giuser1r1r

giuser2r22

gimanagerere

GI Clinical Trials project

Patient collection

Demographic data collection

Tissue collection

gimanagererer

merge 1 (restrictive)

merge 2 (permissive)

Assign permission

Assign permission

Patient Name statements

Tissue analysis statements

merge ( ) =

) = merge (

merge 1

merge 2

Figure 6 Two use cases of permission propagation Two users are granted full permission to view GI Clinical Trials Project (yrdquo) however noneof them can add new data (nrdquo) nor edit existing data unless they were its original creator (srdquo) One of the users (giuser1) is prevented fromaccessing any data with demographic elements such as for example the names of the Patients In this case an uppercase ldquoNrdquo is assigned in theright to view the Collection DemographicData which will be merged with the inherited ldquoyrdquo to produce an effective permission level of ldquoNrdquo forthe right to view (merge 1) For the second user (giuser2) permission is granted to use the Collection TissueData indicating that the user cancreate new instances in that Collection (merge 2)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 12 of 15

dataset without breaking it and storing it in multiple sys-tems To address this requirement S3QL was fitted withdistributed control operational features that follow designcriteria found desirable for biomedical applications S3QLis not unique in its class for example the linked data APIwhich is being used by datagovuk is an alternative DSL tomanage linked data [58] However we believe that S3QLis closer to the technologies currently used by applicationdevelopers and therefore may provide a more suitablemiddle layer between linked data formalisms and applica-tion development It is argued that these features mayassist and anticipate future extensions of semantic webprovenance control formalisms

Additional material

Additional file 1 Walkthrough of S3DBrsquos permission propagationmerge migrate and percolate

Additional file 2 Matrix of MLST profiles in Portugal and the UnitedKingdom

Acknowledgements and fundingThis work was supported in part by the Center for Clinical and TranslationalSciences of the University of Alabama at Birmingham under contract no

5UL1 RR025777-03 from NIH National Center for Research Resources by theNational Cancer Institute grant 1U24CA143883-01 by the European UnionFP7 PNEUMOPATH project award and by the Portuguese Science andTechnology foundation under contracts PTDCEIA-EIA1052452008 andPTDCEEA-ACR695302006 HFD also thankfully acknowledges PhDfellowship from the same foundation award SFRHBD459632008 and theScience Foundation Ireland project Lion under Grant No SFI02CE1I131The authors would also like to thank three anonymous reviewers whosecomments and advice were extremely valuable for improving themanuscript

Author details1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway Ireland 2Biomathematics Institutode Tecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal3Laboratoacuterio Nacional de Computaccedilatildeo Ciecircntifica Av Getuacutelio Vargas 333Quitandinha 25651-075 Petroacutepolis Brasil 4Sanofi Pasteur 38 Sidney StreetCambridge MA 02139 USA 5Laboratory of Molecular Genetics Instituto deTecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal6Research Center for Intelligent Media Furtwangen University FurtwangenGermany 7Laboratory of Microbiology The Rockefeller University 10021 NewYork USA 8Division of Informatics Department of Pathology University ofAlabama at Birmingham 619 South 19th Street Birmingham Alamaba USA

Authorsrsquo contributionsHFD developed the S3QL domain specific language implemented it in theS3DB prototype validated it with examples and wrote the manuscript MCCRS MM HL RF WM and JSA tested and validated the language withexamples and made suggestions which lead to its improvement All authorsread and approved the final manuscript

Strains from Portugal

Strains from UK

SPARQL

S3DB system Triple stores

S3QL + authentication

Statistical computing environment (eg Matlab R)

S3QL +Lauthent

Figure 7 Workflow for the creation of a hierarchical cluster of MLST profiles from strains collected in Portugal and in the UK Twosources are chosen to perform the MLST profile assembly - an S3DB deployment which holds the ITQB Staphylococcus database and theSPARQL endpoint configured to host the Staphylococcus aureus MLST database Although data in the MLST SPARQL endpoint is publiclyavailable to access data in S3DB the user needs to provide an authentication token and a user_id as well as assemble the S3QL queries toretrieve the data [select(S|R172930) and select(S|R167271)] which may also be formulated as SPARQL A data structure is assembled that can finallybe analysed using hierarchical clustering methods such as the ones made available by Matlab The complete MLST dataset used to obtain thegraph is provided in additional file 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 13 of 15

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 8: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

KOSs and in the Linked Data in general that there is norestriction to adding relationships beyond thosedescribed here S3QL was identified as the minimalrepresentation to interoperate with the S3DB coremodel and therefore only those relationships areexplored in this reportThe syntax diagram in Figure 1 generates XML a

standard widely used in web service implementationsThat alternative often results in verbose queries thatcould easily be assembled from more compact notationsOne example to consider is the form action (E | Ea =value) Here the symbol lsquo|rsquo should be interpreted as inBayesian inference as a condition and be read ldquogiventhatrdquo The letter ldquoEldquo corresponds to the first letter ofany S3DB entity (D P R C I S or U) and Ea is any ofits attributes as described in Figure 2 In this examplethe query insert(P | label = test) is equivalent to theexample query above That particular variant is alsoaccepted by the S3DB prototype and a converter for thissyntax into complete S3QLXML syntax was made avail-able at httplinks3dborgtranslate For further com-pactness of this alternative formulation entity identifiersused as parameters may be replaced with its corre-sponding alphanumeric identifiers - for example projec-t_id = 156 may be replaced with P156 This alternativenotation will be used in the subsequent examples

S3QL Permission ControlPermission states are assigned using an S3QL query suchas insert(U| U1P157permission_level = ysn) whichincludes the action insert the target entity User andthree input parameters identifier of the User (U1) iden-tifier of the entity (P157) and permission assignment(ysn) Effectively this will result in the creation of the tri-ple [U1 ysn P157] where the subject is of type s3dbuser the predicate is of type s3dboperator and the objectis of type s3dbproject The inclusion of this triple in adataset will modulate the type of management operationthat a user may perform As described in the Methodssection each position in the permission assignmentoperator (ysn) encodes respectively for permission toldquoviewrdquo ldquochangerdquo or ldquouserdquo the object entity Values y sand n indicate respectively that the user has full permis-sion to view it (y) permission to change its metadataonly if he was the creator of the entity (s) and no permis-sion to insert (n) child entities Each S3QL operation istherefore tightly woven to each of the three operatorsselect is controlled by ldquoview update and delete are con-trolled by ldquochangerdquo and insert is controlled by ldquouserdquo(shaded areas in Figure 1) The lsquousersquo operator encodes forthe ability of a user to create new relationships with thetarget entity which is defined separately from the rightto ldquochangerdquo it For example in the case of a user (U1)being granted ldquoyrdquo as the effective permission to ldquochangerdquo

an s3dbrule then the metadata describing it may bealtered If however that same user is granted permissionldquonrdquo to lsquousersquo that same Rule she is prevented from creatingStatements using that Rule Although ldquouserdquo may be inter-preted as being equivalent to ldquoinsertrdquo or ldquoappendrdquo inother systems we have chosen to separate the termsdescribing the operator ldquouserdquo from the S3QL actionldquoinsertrdquo The permission assigned at the dataset level willthen propagate in the S3DB transition matrix followingthe behaviour formalized in equation 1 therefore avoid-ing the need to assign permission to every user on everyentity It is worth noting that the DSL presented here isextensible beyond the 4 management actions (selectinsert update delete) described The s3dboperators thatcontrol permission on these actions are also extensiblebeyond ldquoviewrdquo ldquochangerdquo and ldquouserdquo and different imple-mentations may support alternative statesThe permission control behaviour for S3QL operations

can be illustrated through the use of the Quadratus anapplication available at httpqs3dborgquadratus thatcan be pointed at any S3DB deployment to assign permis-sion states on S3DB entities to different users (Figure 4)Other use case scenarios are also explored in the S3QLspecification at httplinks3dborgspecs

Global Dereferencing for Distributed QueriesA simple dereferencing system was devised for S3DB iden-tifiers that relies on the identification of root deploymentsie S3DB systems where alphanumeric identifiers forS3DB deployments can be dereferenced to URL This sim-ple mechanism enables complex transactions of controlleddata For an example of this behaviour see Figure 5 wherethe S3DB UID D327R172930 identifying an entity of typeRule (R172930) available in deployment D327 is beingrequest by a user registered in deployment D309 In orderto retrieve the requested data the URL of deploymentD327 must first be resolved at a root deployment such asfor example httproots3dborgThe dereferencing mechanism is also applicable in more

complex cases where the root of two deployments sharingdata is not the same Prepending the deployment identifierof the root to the UIDs such as for example D1016666D327R172930 where D1016666 identifies the rootdeployment would result in recursive URL resolutionsteps such as select(Durl| D1016666) prior to step 4 inFigure 5 This mechanism avoids broken links when S3DBdeployments are moved to different URLs by enablingdeployment metadata to be updated securely at the rootusing a publicprivate key encryption system

Implementation and benchmarkingIn the current prototype implementation S3QL is sub-mitted to S3DB deployments using either a GET or POSTrequest and may include an optional authentication token

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 8 of 15

(key) The REST specification [54] suggest separate HTTPmethods according to the intention of the operationoften ldquoGETrdquo is used to retrieve data ldquoPUTrdquo is used to sub-mit data ldquoPOSTrdquo is used to update data and ldquoDELETErdquo isused to remove data There are however many program-ming environments that implement only the REST ldquoGETrdquomethod including many popular computational statisticsprogramming frameworks such as R and Matlab There-fore in order to fully explore the integrative potential ofthis readwrite semantic web service and to supportoperations beyond the 4 implemented the S3DB proto-type implementation of S3QL supports the ldquoGETrdquo methodfor all S3QL operations with the parameters of the S3QLcall appended to the URL One drawback of relying onGET is the limits imposed by the browser on URL lengthTo address this potential problem the S3DB prototypealso supports the use of ldquoPOSTrdquo for S3QL calls

Two further challenges needed to be addressed in theprototypical implementation of S3QL 1) the need for acentralized root location to support dereferencing ofdeployment URI when the condensed version is used (eg D282) and 2) distributed queries on REST systemsrequired users to authenticate in multiple KOSs Thefirst challenge was addressed by configuring an S3DBdeployment as the root location available at httproots3dborg Deployment metadata is submitted to this rootdeployment at configuration time using S3QL data per-taining to each deployment can therefore be derefer-enced to a URL using httproots3dborgD[numeric]The option to refer to another root deployment thanthe default is possible during installation To avoid over-loading these root deployments with too many requestsa local 24 hour cache of all accessed deployments iskept in each S3DB deployment using the same strategy

Figure 4 Quadratus an interface to illustrate S3QLrsquos permission control mechanism and its effect on S3QL queries Projects CollectionsRules Items and Statements associated with GI Clinical Trials Project are displayed with effective and assigned permissions Collections and Rulesretrieved using S3QL queries select(C|P1196457) and select(R| P1196457) inherit the permission assignment in the Project ldquoysn assigned permissionldquoNndashrdquo in Collection ldquoDemographicsrdquo results in the effective permission of ldquoNsnrdquo inherited by all Rules and Items that have a relationship with thatCollection effectively preventing gi_user1 from accessing its data The directed labelled graph of the propagation resolution is displayed on theright side of the application illustrating the propagation mechanism

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 9 of 15

if the URL is cached it will not be requested from theroot deployment To address the second challenge eachs3dbdeployment can store any number of authenticationservices supporting HTTP FTP or LDAP protocolsOnce the user is authenticated temporary surrogatetokens are issued with each query When coupled withthe user identifier in the format of a URI these tokenseffectively identify the user performing the query regard-less of the S3DB deployment where the query isrequestedScreencasts illustrating processing time of data manip-

ulation using S3QL are available at httpwwwyoutubecomwatchv=2KZC6kI609s and httpwwwyoutubecomwatchv=FJSYLCwBaPI

DiscussionOne of the major concerns in making use of LinkedData to improve health care and life sciences research isthe need to ensure both the availability of contextualinformation about experimental datasets and the abilityto protect the privacy of certain data elements whichmay identify an individual patient Domain specific lan-guages (DSL) can ease the task of managing the contex-tual descriptors that would be necessary to implement

permission control in RDF and by doing so couldgreatly accelerate the rate of adoption of Linked Dataformalisms in the life sciences communities to improvescientific discovery We have described S3QL a DSL toperform readwrite operations on entities of the S3DBCore Model S3QL attempts to address the requirementsin linking Life Sciences datasets including both publish-able and un-publishable data elements by 1) includingcontextual descriptors for every submitted data elementand 2) making use of those descriptors to ensure per-mission control managed by the data experts them-selves This avoids the need to break a consolidateddataset into its public and private parts when the resultsare acceptable for publicationApplying S3QL to the S3DB Core Model in a prototy-

pical application benefited from the definition of looselydefined boundaries for RDF data that enabled propaga-tion of permission while avoiding the need to documenta relationship for each data instance individually andfor each user The assembly of SPARQL queries is alsofacilitated by the identification of domain triples usingnamed graphs from the data itself [41] and can be illu-strated in the application at [55] where a subset ofS3QL can be readily converted into the W3C standard

harrharr

Figure 5 The global S3QL dereferencing system User U78 of deployment D309 issues a command to request all entities of type Statementwhere the attribute Rule_id corresponds to the value D327R172930 through S3QL operation select(Svalue|D327R172930) (1) If the URLcorresponding to deployment D327 is not cached locally or has not been validated in the past 24 h a query is issued and executed at the rootdeployment to retrieve the corresponding URL select(Durl|D327) (23) Once the URL is returned query (1) is re-issued as select(Svalue| R172930)and executed at the URL for D327 (4) To validate the user deployment D327 issues the command select(Uid|U76) at D309 using the keyprovided (56) and returns the data only if Uid matches the value for user_id (7)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 10 of 15

SPARQL Although the prototypical implementation ofS3QL fits the definition of an API for S3DB systems itis immediately apparent that the same notation could beeasily and intuitively extended to other KOSrsquo core mod-els For example pointing the tool at httpqs3dborgtranslate to a JavaScript Object Notation (JSON) repre-sentation of the SKOS core model (skosjs) instead ofthe default S3DB core model (s3dbjs) results in a validS3QL syntax that could easily be applied as an API forSKOS based systems as illustrated by applying theexample query ldquoselect(C|prefLabel = animal)ldquo usinghttplinks3dborgtranslatecore=skosjsampquery=select(C|prefLabel=animal) to retrieve SKOS concepts labelledldquoanimalrdquo The progress of adoption for life sciencesapplication developers can be further smoothed by com-plete reliance on the REST protocol for data exchangeand the availability of widely used formats such asJSON XML or RDFturtleThe applicability of S3QL to life sciences domains is

illustrated here with three case studies 1) in the domainof clinical trials a project [42] that requires collabora-tion between departments with different interests 2) inthe domain of the cancer genome atlas (TCGA) projecta multi-institutional effort that requires multiple authen-tication mechanism and sharing of data among multipleinstitutions [41] and 3) in the domain of molecular epi-demiology a project where non-public data stored in anS3DB deployment needs to be statistically integratedwith data from a public repository All described usecases shared one considerable requirement - the abilityto include in the same dataset both published andunpublished results As such they required both theannotation of contextual descriptors of the data enabledby S3QL and the availability of controlled permissionpropagation enabled by the S3DB model transitionmatrix Future work in this effort may include the appli-cation of S3QL in a Knowledge Organization Systembased on SKOS terminology and the definition of atransition matrix for SKOS to enable controlled permis-sion propagation

Gastro-intestinal Clinical Trials Use CaseAs part of collaboration with the Department of Gastroin-testinal Medical Oncology at The University of Texas MD Anderson Cancer Center an S3DB deployment wasconfigured to host data from gastro-intestinal (GI) clinicaltrials A schema was developed using S3DB Collectionsand Rules and S3QL insert queries were used to submitdata elements as Items and Statements (see Figure 6) Twopermission propagation examples are illustrated one of arestrictive nature and the other permissive which can alsobe explored at httpqs3dborgquadratus (Figure 4) Inthis example the simple mechanisms of propagationdefined for S3QL support the complex social interaction

that requires a fraction of the dataset to be shared withcertain users but not with others Contextual usage istherefore a function of the attributes of the data itself (egits creator) and the user identification token that is sub-mitted with every readwrite operation

The Cancer Genome Atlas Use CaseThe cancer genome atlas (TCGA) is a pilot project tocharacterize several types of cancer by sequencing andgenetically characterizing tumours for over 500 patientsthroughout multiple institutions S3QL was used in thiscase study to produce an infrastructure that exposes thepublic portion of the TCGA datasets as a SPARQL end-point [41] This was possible because SPARQL is entailedby S3QL but not the opposite Specifically SPARQLqueries can be serialized to S3QL but the opposite is notalways possible particularly as regards write and accesscontrol operations The structure of the S3DB Core Modelwhich explicitly distinguishes domain from instantiationenables SPARQL query patterns such as Patient R390 cancerType to be readily serialized into its S3QL equiva-lent select(S|R390) Although this will not be furtherexplored in this discussion it is worth noting that theavailability of this serialization allows for an intuitive syn-tax of SPARQL queries by patterning them on the descrip-tion of the user-defined domain Rule such as ldquoPatienthasCancerType cancerTyperdquo

A Molecular Epidemiology Use CaseIn this example SPARQL was serialized to S3QL to sup-port a computational statistics application As a first stepan S3DB data store was deployed using S3QL to managemolecular epidemiology data related to strains of Staphylo-coccus aureus bacteria collected at the Instituto de Tecno-logia Quiacutemica e Bioloacutegica (ITQB) in Portugal Specificallythe ITQB Staphylococcus reference database was devisedwith a purpose of managing Multilocus Sequence Typing(MLST) data a typing method used to track the molecularepidemiology of infectious bacteria [54-57] As a secondstep we downloaded the public Staphylococcus aureusMLST profiles database at httpwwwmlstnet and madeit available through a SPARQL endpoint at httpqs3dborgmlst_sparqlendpointphp The process of integrationof MLST profiles from the ITQB Staphylococcus databasewith the publicly available MLST profiles is illustrated inFigure 7 In this example a federated SPARQL query isassembled to access both MLST sources data stored inthe S3DB deployment is retrieved by serializing theSPARQL query into S3QL and providing an authentica-tion token to identify the user as described in Figure 3The assembled graph resulting from the federatedSPARQL query (see additional file 2) can be importedinto a statistical computing environment such as Matlab(Mathworks Inc) Using this methodology it was possible

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 11 of 15

to cluster strains from two different data sources with verydifferent authentication requirements The observationthat some Portuguese strains (PT1 PT2 PT15 and PT21)that are not publicly shared cluster together with a groupof public UK strains (UK17 UK16 UK11 UK14 UK15UK13 UK12) and therefore may share a common ancestoris an observation enabled by the data integrated throughS3QL

ConclusionsLife sciences applications are set to greatly benefit fromcoupling Semantic Web Linked Data standards and KOSsIn the current report we illustrate data models from lifesciences domains weaved using the S3DB knowledge orga-nization system In line with the requirements for theemergence of evolvable ldquosocial machinesrdquo different per-spectives on the data are made possible by a permissionpropagation mechanism controlled by contextual attri-butes of data elements such as its creator The operationof the S3DB KOS is mediated by the S3QL protocoldescribed in this report which exposes its ApplicationProgramming Interface for viewing inserting updating

and removing data elements Because S3QL is implemen-ted with a distributed architecture where URIs can bedereferenced into multiple S3DB deployments domainexperts can share data on their own deployments withusers of other systems without the need for localaccounts Therefore S3QLrsquos fine grained permission con-trol defined as instances of s3dboperators enables domainexperts to clearly specify the degree of permission that auser should have on a resource and how that permissionshould propagate in a distributed infrastructure This is incontrast to the conventional approach of delegating per-mission management to the point of access In the currentSPARQL specification extension various data sources canbe queried simultaneously or sequentially There is still noaccepted convention for tying a query pattern to anauthenticated user probably because SPARQL engineswould have no use for that information as most have beencreated in a context of Linked Open Data efforts The cri-tical limitation in applying this solution for Health Careand Life Sciences is the ability to make use of contextualinformation to determine both the level of trust on thedata and to enable controlled access to elements in a

giuser1r1r

giuser2r22

gimanagerere

GI Clinical Trials project

Patient collection

Demographic data collection

Tissue collection

gimanagererer

merge 1 (restrictive)

merge 2 (permissive)

Assign permission

Assign permission

Patient Name statements

Tissue analysis statements

merge ( ) =

) = merge (

merge 1

merge 2

Figure 6 Two use cases of permission propagation Two users are granted full permission to view GI Clinical Trials Project (yrdquo) however noneof them can add new data (nrdquo) nor edit existing data unless they were its original creator (srdquo) One of the users (giuser1) is prevented fromaccessing any data with demographic elements such as for example the names of the Patients In this case an uppercase ldquoNrdquo is assigned in theright to view the Collection DemographicData which will be merged with the inherited ldquoyrdquo to produce an effective permission level of ldquoNrdquo forthe right to view (merge 1) For the second user (giuser2) permission is granted to use the Collection TissueData indicating that the user cancreate new instances in that Collection (merge 2)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 12 of 15

dataset without breaking it and storing it in multiple sys-tems To address this requirement S3QL was fitted withdistributed control operational features that follow designcriteria found desirable for biomedical applications S3QLis not unique in its class for example the linked data APIwhich is being used by datagovuk is an alternative DSL tomanage linked data [58] However we believe that S3QLis closer to the technologies currently used by applicationdevelopers and therefore may provide a more suitablemiddle layer between linked data formalisms and applica-tion development It is argued that these features mayassist and anticipate future extensions of semantic webprovenance control formalisms

Additional material

Additional file 1 Walkthrough of S3DBrsquos permission propagationmerge migrate and percolate

Additional file 2 Matrix of MLST profiles in Portugal and the UnitedKingdom

Acknowledgements and fundingThis work was supported in part by the Center for Clinical and TranslationalSciences of the University of Alabama at Birmingham under contract no

5UL1 RR025777-03 from NIH National Center for Research Resources by theNational Cancer Institute grant 1U24CA143883-01 by the European UnionFP7 PNEUMOPATH project award and by the Portuguese Science andTechnology foundation under contracts PTDCEIA-EIA1052452008 andPTDCEEA-ACR695302006 HFD also thankfully acknowledges PhDfellowship from the same foundation award SFRHBD459632008 and theScience Foundation Ireland project Lion under Grant No SFI02CE1I131The authors would also like to thank three anonymous reviewers whosecomments and advice were extremely valuable for improving themanuscript

Author details1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway Ireland 2Biomathematics Institutode Tecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal3Laboratoacuterio Nacional de Computaccedilatildeo Ciecircntifica Av Getuacutelio Vargas 333Quitandinha 25651-075 Petroacutepolis Brasil 4Sanofi Pasteur 38 Sidney StreetCambridge MA 02139 USA 5Laboratory of Molecular Genetics Instituto deTecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal6Research Center for Intelligent Media Furtwangen University FurtwangenGermany 7Laboratory of Microbiology The Rockefeller University 10021 NewYork USA 8Division of Informatics Department of Pathology University ofAlabama at Birmingham 619 South 19th Street Birmingham Alamaba USA

Authorsrsquo contributionsHFD developed the S3QL domain specific language implemented it in theS3DB prototype validated it with examples and wrote the manuscript MCCRS MM HL RF WM and JSA tested and validated the language withexamples and made suggestions which lead to its improvement All authorsread and approved the final manuscript

Strains from Portugal

Strains from UK

SPARQL

S3DB system Triple stores

S3QL + authentication

Statistical computing environment (eg Matlab R)

S3QL +Lauthent

Figure 7 Workflow for the creation of a hierarchical cluster of MLST profiles from strains collected in Portugal and in the UK Twosources are chosen to perform the MLST profile assembly - an S3DB deployment which holds the ITQB Staphylococcus database and theSPARQL endpoint configured to host the Staphylococcus aureus MLST database Although data in the MLST SPARQL endpoint is publiclyavailable to access data in S3DB the user needs to provide an authentication token and a user_id as well as assemble the S3QL queries toretrieve the data [select(S|R172930) and select(S|R167271)] which may also be formulated as SPARQL A data structure is assembled that can finallybe analysed using hierarchical clustering methods such as the ones made available by Matlab The complete MLST dataset used to obtain thegraph is provided in additional file 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 13 of 15

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 9: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

(key) The REST specification [54] suggest separate HTTPmethods according to the intention of the operationoften ldquoGETrdquo is used to retrieve data ldquoPUTrdquo is used to sub-mit data ldquoPOSTrdquo is used to update data and ldquoDELETErdquo isused to remove data There are however many program-ming environments that implement only the REST ldquoGETrdquomethod including many popular computational statisticsprogramming frameworks such as R and Matlab There-fore in order to fully explore the integrative potential ofthis readwrite semantic web service and to supportoperations beyond the 4 implemented the S3DB proto-type implementation of S3QL supports the ldquoGETrdquo methodfor all S3QL operations with the parameters of the S3QLcall appended to the URL One drawback of relying onGET is the limits imposed by the browser on URL lengthTo address this potential problem the S3DB prototypealso supports the use of ldquoPOSTrdquo for S3QL calls

Two further challenges needed to be addressed in theprototypical implementation of S3QL 1) the need for acentralized root location to support dereferencing ofdeployment URI when the condensed version is used (eg D282) and 2) distributed queries on REST systemsrequired users to authenticate in multiple KOSs Thefirst challenge was addressed by configuring an S3DBdeployment as the root location available at httproots3dborg Deployment metadata is submitted to this rootdeployment at configuration time using S3QL data per-taining to each deployment can therefore be derefer-enced to a URL using httproots3dborgD[numeric]The option to refer to another root deployment thanthe default is possible during installation To avoid over-loading these root deployments with too many requestsa local 24 hour cache of all accessed deployments iskept in each S3DB deployment using the same strategy

Figure 4 Quadratus an interface to illustrate S3QLrsquos permission control mechanism and its effect on S3QL queries Projects CollectionsRules Items and Statements associated with GI Clinical Trials Project are displayed with effective and assigned permissions Collections and Rulesretrieved using S3QL queries select(C|P1196457) and select(R| P1196457) inherit the permission assignment in the Project ldquoysn assigned permissionldquoNndashrdquo in Collection ldquoDemographicsrdquo results in the effective permission of ldquoNsnrdquo inherited by all Rules and Items that have a relationship with thatCollection effectively preventing gi_user1 from accessing its data The directed labelled graph of the propagation resolution is displayed on theright side of the application illustrating the propagation mechanism

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 9 of 15

if the URL is cached it will not be requested from theroot deployment To address the second challenge eachs3dbdeployment can store any number of authenticationservices supporting HTTP FTP or LDAP protocolsOnce the user is authenticated temporary surrogatetokens are issued with each query When coupled withthe user identifier in the format of a URI these tokenseffectively identify the user performing the query regard-less of the S3DB deployment where the query isrequestedScreencasts illustrating processing time of data manip-

ulation using S3QL are available at httpwwwyoutubecomwatchv=2KZC6kI609s and httpwwwyoutubecomwatchv=FJSYLCwBaPI

DiscussionOne of the major concerns in making use of LinkedData to improve health care and life sciences research isthe need to ensure both the availability of contextualinformation about experimental datasets and the abilityto protect the privacy of certain data elements whichmay identify an individual patient Domain specific lan-guages (DSL) can ease the task of managing the contex-tual descriptors that would be necessary to implement

permission control in RDF and by doing so couldgreatly accelerate the rate of adoption of Linked Dataformalisms in the life sciences communities to improvescientific discovery We have described S3QL a DSL toperform readwrite operations on entities of the S3DBCore Model S3QL attempts to address the requirementsin linking Life Sciences datasets including both publish-able and un-publishable data elements by 1) includingcontextual descriptors for every submitted data elementand 2) making use of those descriptors to ensure per-mission control managed by the data experts them-selves This avoids the need to break a consolidateddataset into its public and private parts when the resultsare acceptable for publicationApplying S3QL to the S3DB Core Model in a prototy-

pical application benefited from the definition of looselydefined boundaries for RDF data that enabled propaga-tion of permission while avoiding the need to documenta relationship for each data instance individually andfor each user The assembly of SPARQL queries is alsofacilitated by the identification of domain triples usingnamed graphs from the data itself [41] and can be illu-strated in the application at [55] where a subset ofS3QL can be readily converted into the W3C standard

harrharr

Figure 5 The global S3QL dereferencing system User U78 of deployment D309 issues a command to request all entities of type Statementwhere the attribute Rule_id corresponds to the value D327R172930 through S3QL operation select(Svalue|D327R172930) (1) If the URLcorresponding to deployment D327 is not cached locally or has not been validated in the past 24 h a query is issued and executed at the rootdeployment to retrieve the corresponding URL select(Durl|D327) (23) Once the URL is returned query (1) is re-issued as select(Svalue| R172930)and executed at the URL for D327 (4) To validate the user deployment D327 issues the command select(Uid|U76) at D309 using the keyprovided (56) and returns the data only if Uid matches the value for user_id (7)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 10 of 15

SPARQL Although the prototypical implementation ofS3QL fits the definition of an API for S3DB systems itis immediately apparent that the same notation could beeasily and intuitively extended to other KOSrsquo core mod-els For example pointing the tool at httpqs3dborgtranslate to a JavaScript Object Notation (JSON) repre-sentation of the SKOS core model (skosjs) instead ofthe default S3DB core model (s3dbjs) results in a validS3QL syntax that could easily be applied as an API forSKOS based systems as illustrated by applying theexample query ldquoselect(C|prefLabel = animal)ldquo usinghttplinks3dborgtranslatecore=skosjsampquery=select(C|prefLabel=animal) to retrieve SKOS concepts labelledldquoanimalrdquo The progress of adoption for life sciencesapplication developers can be further smoothed by com-plete reliance on the REST protocol for data exchangeand the availability of widely used formats such asJSON XML or RDFturtleThe applicability of S3QL to life sciences domains is

illustrated here with three case studies 1) in the domainof clinical trials a project [42] that requires collabora-tion between departments with different interests 2) inthe domain of the cancer genome atlas (TCGA) projecta multi-institutional effort that requires multiple authen-tication mechanism and sharing of data among multipleinstitutions [41] and 3) in the domain of molecular epi-demiology a project where non-public data stored in anS3DB deployment needs to be statistically integratedwith data from a public repository All described usecases shared one considerable requirement - the abilityto include in the same dataset both published andunpublished results As such they required both theannotation of contextual descriptors of the data enabledby S3QL and the availability of controlled permissionpropagation enabled by the S3DB model transitionmatrix Future work in this effort may include the appli-cation of S3QL in a Knowledge Organization Systembased on SKOS terminology and the definition of atransition matrix for SKOS to enable controlled permis-sion propagation

Gastro-intestinal Clinical Trials Use CaseAs part of collaboration with the Department of Gastroin-testinal Medical Oncology at The University of Texas MD Anderson Cancer Center an S3DB deployment wasconfigured to host data from gastro-intestinal (GI) clinicaltrials A schema was developed using S3DB Collectionsand Rules and S3QL insert queries were used to submitdata elements as Items and Statements (see Figure 6) Twopermission propagation examples are illustrated one of arestrictive nature and the other permissive which can alsobe explored at httpqs3dborgquadratus (Figure 4) Inthis example the simple mechanisms of propagationdefined for S3QL support the complex social interaction

that requires a fraction of the dataset to be shared withcertain users but not with others Contextual usage istherefore a function of the attributes of the data itself (egits creator) and the user identification token that is sub-mitted with every readwrite operation

The Cancer Genome Atlas Use CaseThe cancer genome atlas (TCGA) is a pilot project tocharacterize several types of cancer by sequencing andgenetically characterizing tumours for over 500 patientsthroughout multiple institutions S3QL was used in thiscase study to produce an infrastructure that exposes thepublic portion of the TCGA datasets as a SPARQL end-point [41] This was possible because SPARQL is entailedby S3QL but not the opposite Specifically SPARQLqueries can be serialized to S3QL but the opposite is notalways possible particularly as regards write and accesscontrol operations The structure of the S3DB Core Modelwhich explicitly distinguishes domain from instantiationenables SPARQL query patterns such as Patient R390 cancerType to be readily serialized into its S3QL equiva-lent select(S|R390) Although this will not be furtherexplored in this discussion it is worth noting that theavailability of this serialization allows for an intuitive syn-tax of SPARQL queries by patterning them on the descrip-tion of the user-defined domain Rule such as ldquoPatienthasCancerType cancerTyperdquo

A Molecular Epidemiology Use CaseIn this example SPARQL was serialized to S3QL to sup-port a computational statistics application As a first stepan S3DB data store was deployed using S3QL to managemolecular epidemiology data related to strains of Staphylo-coccus aureus bacteria collected at the Instituto de Tecno-logia Quiacutemica e Bioloacutegica (ITQB) in Portugal Specificallythe ITQB Staphylococcus reference database was devisedwith a purpose of managing Multilocus Sequence Typing(MLST) data a typing method used to track the molecularepidemiology of infectious bacteria [54-57] As a secondstep we downloaded the public Staphylococcus aureusMLST profiles database at httpwwwmlstnet and madeit available through a SPARQL endpoint at httpqs3dborgmlst_sparqlendpointphp The process of integrationof MLST profiles from the ITQB Staphylococcus databasewith the publicly available MLST profiles is illustrated inFigure 7 In this example a federated SPARQL query isassembled to access both MLST sources data stored inthe S3DB deployment is retrieved by serializing theSPARQL query into S3QL and providing an authentica-tion token to identify the user as described in Figure 3The assembled graph resulting from the federatedSPARQL query (see additional file 2) can be importedinto a statistical computing environment such as Matlab(Mathworks Inc) Using this methodology it was possible

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 11 of 15

to cluster strains from two different data sources with verydifferent authentication requirements The observationthat some Portuguese strains (PT1 PT2 PT15 and PT21)that are not publicly shared cluster together with a groupof public UK strains (UK17 UK16 UK11 UK14 UK15UK13 UK12) and therefore may share a common ancestoris an observation enabled by the data integrated throughS3QL

ConclusionsLife sciences applications are set to greatly benefit fromcoupling Semantic Web Linked Data standards and KOSsIn the current report we illustrate data models from lifesciences domains weaved using the S3DB knowledge orga-nization system In line with the requirements for theemergence of evolvable ldquosocial machinesrdquo different per-spectives on the data are made possible by a permissionpropagation mechanism controlled by contextual attri-butes of data elements such as its creator The operationof the S3DB KOS is mediated by the S3QL protocoldescribed in this report which exposes its ApplicationProgramming Interface for viewing inserting updating

and removing data elements Because S3QL is implemen-ted with a distributed architecture where URIs can bedereferenced into multiple S3DB deployments domainexperts can share data on their own deployments withusers of other systems without the need for localaccounts Therefore S3QLrsquos fine grained permission con-trol defined as instances of s3dboperators enables domainexperts to clearly specify the degree of permission that auser should have on a resource and how that permissionshould propagate in a distributed infrastructure This is incontrast to the conventional approach of delegating per-mission management to the point of access In the currentSPARQL specification extension various data sources canbe queried simultaneously or sequentially There is still noaccepted convention for tying a query pattern to anauthenticated user probably because SPARQL engineswould have no use for that information as most have beencreated in a context of Linked Open Data efforts The cri-tical limitation in applying this solution for Health Careand Life Sciences is the ability to make use of contextualinformation to determine both the level of trust on thedata and to enable controlled access to elements in a

giuser1r1r

giuser2r22

gimanagerere

GI Clinical Trials project

Patient collection

Demographic data collection

Tissue collection

gimanagererer

merge 1 (restrictive)

merge 2 (permissive)

Assign permission

Assign permission

Patient Name statements

Tissue analysis statements

merge ( ) =

) = merge (

merge 1

merge 2

Figure 6 Two use cases of permission propagation Two users are granted full permission to view GI Clinical Trials Project (yrdquo) however noneof them can add new data (nrdquo) nor edit existing data unless they were its original creator (srdquo) One of the users (giuser1) is prevented fromaccessing any data with demographic elements such as for example the names of the Patients In this case an uppercase ldquoNrdquo is assigned in theright to view the Collection DemographicData which will be merged with the inherited ldquoyrdquo to produce an effective permission level of ldquoNrdquo forthe right to view (merge 1) For the second user (giuser2) permission is granted to use the Collection TissueData indicating that the user cancreate new instances in that Collection (merge 2)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 12 of 15

dataset without breaking it and storing it in multiple sys-tems To address this requirement S3QL was fitted withdistributed control operational features that follow designcriteria found desirable for biomedical applications S3QLis not unique in its class for example the linked data APIwhich is being used by datagovuk is an alternative DSL tomanage linked data [58] However we believe that S3QLis closer to the technologies currently used by applicationdevelopers and therefore may provide a more suitablemiddle layer between linked data formalisms and applica-tion development It is argued that these features mayassist and anticipate future extensions of semantic webprovenance control formalisms

Additional material

Additional file 1 Walkthrough of S3DBrsquos permission propagationmerge migrate and percolate

Additional file 2 Matrix of MLST profiles in Portugal and the UnitedKingdom

Acknowledgements and fundingThis work was supported in part by the Center for Clinical and TranslationalSciences of the University of Alabama at Birmingham under contract no

5UL1 RR025777-03 from NIH National Center for Research Resources by theNational Cancer Institute grant 1U24CA143883-01 by the European UnionFP7 PNEUMOPATH project award and by the Portuguese Science andTechnology foundation under contracts PTDCEIA-EIA1052452008 andPTDCEEA-ACR695302006 HFD also thankfully acknowledges PhDfellowship from the same foundation award SFRHBD459632008 and theScience Foundation Ireland project Lion under Grant No SFI02CE1I131The authors would also like to thank three anonymous reviewers whosecomments and advice were extremely valuable for improving themanuscript

Author details1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway Ireland 2Biomathematics Institutode Tecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal3Laboratoacuterio Nacional de Computaccedilatildeo Ciecircntifica Av Getuacutelio Vargas 333Quitandinha 25651-075 Petroacutepolis Brasil 4Sanofi Pasteur 38 Sidney StreetCambridge MA 02139 USA 5Laboratory of Molecular Genetics Instituto deTecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal6Research Center for Intelligent Media Furtwangen University FurtwangenGermany 7Laboratory of Microbiology The Rockefeller University 10021 NewYork USA 8Division of Informatics Department of Pathology University ofAlabama at Birmingham 619 South 19th Street Birmingham Alamaba USA

Authorsrsquo contributionsHFD developed the S3QL domain specific language implemented it in theS3DB prototype validated it with examples and wrote the manuscript MCCRS MM HL RF WM and JSA tested and validated the language withexamples and made suggestions which lead to its improvement All authorsread and approved the final manuscript

Strains from Portugal

Strains from UK

SPARQL

S3DB system Triple stores

S3QL + authentication

Statistical computing environment (eg Matlab R)

S3QL +Lauthent

Figure 7 Workflow for the creation of a hierarchical cluster of MLST profiles from strains collected in Portugal and in the UK Twosources are chosen to perform the MLST profile assembly - an S3DB deployment which holds the ITQB Staphylococcus database and theSPARQL endpoint configured to host the Staphylococcus aureus MLST database Although data in the MLST SPARQL endpoint is publiclyavailable to access data in S3DB the user needs to provide an authentication token and a user_id as well as assemble the S3QL queries toretrieve the data [select(S|R172930) and select(S|R167271)] which may also be formulated as SPARQL A data structure is assembled that can finallybe analysed using hierarchical clustering methods such as the ones made available by Matlab The complete MLST dataset used to obtain thegraph is provided in additional file 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 13 of 15

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 10: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

if the URL is cached it will not be requested from theroot deployment To address the second challenge eachs3dbdeployment can store any number of authenticationservices supporting HTTP FTP or LDAP protocolsOnce the user is authenticated temporary surrogatetokens are issued with each query When coupled withthe user identifier in the format of a URI these tokenseffectively identify the user performing the query regard-less of the S3DB deployment where the query isrequestedScreencasts illustrating processing time of data manip-

ulation using S3QL are available at httpwwwyoutubecomwatchv=2KZC6kI609s and httpwwwyoutubecomwatchv=FJSYLCwBaPI

DiscussionOne of the major concerns in making use of LinkedData to improve health care and life sciences research isthe need to ensure both the availability of contextualinformation about experimental datasets and the abilityto protect the privacy of certain data elements whichmay identify an individual patient Domain specific lan-guages (DSL) can ease the task of managing the contex-tual descriptors that would be necessary to implement

permission control in RDF and by doing so couldgreatly accelerate the rate of adoption of Linked Dataformalisms in the life sciences communities to improvescientific discovery We have described S3QL a DSL toperform readwrite operations on entities of the S3DBCore Model S3QL attempts to address the requirementsin linking Life Sciences datasets including both publish-able and un-publishable data elements by 1) includingcontextual descriptors for every submitted data elementand 2) making use of those descriptors to ensure per-mission control managed by the data experts them-selves This avoids the need to break a consolidateddataset into its public and private parts when the resultsare acceptable for publicationApplying S3QL to the S3DB Core Model in a prototy-

pical application benefited from the definition of looselydefined boundaries for RDF data that enabled propaga-tion of permission while avoiding the need to documenta relationship for each data instance individually andfor each user The assembly of SPARQL queries is alsofacilitated by the identification of domain triples usingnamed graphs from the data itself [41] and can be illu-strated in the application at [55] where a subset ofS3QL can be readily converted into the W3C standard

harrharr

Figure 5 The global S3QL dereferencing system User U78 of deployment D309 issues a command to request all entities of type Statementwhere the attribute Rule_id corresponds to the value D327R172930 through S3QL operation select(Svalue|D327R172930) (1) If the URLcorresponding to deployment D327 is not cached locally or has not been validated in the past 24 h a query is issued and executed at the rootdeployment to retrieve the corresponding URL select(Durl|D327) (23) Once the URL is returned query (1) is re-issued as select(Svalue| R172930)and executed at the URL for D327 (4) To validate the user deployment D327 issues the command select(Uid|U76) at D309 using the keyprovided (56) and returns the data only if Uid matches the value for user_id (7)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 10 of 15

SPARQL Although the prototypical implementation ofS3QL fits the definition of an API for S3DB systems itis immediately apparent that the same notation could beeasily and intuitively extended to other KOSrsquo core mod-els For example pointing the tool at httpqs3dborgtranslate to a JavaScript Object Notation (JSON) repre-sentation of the SKOS core model (skosjs) instead ofthe default S3DB core model (s3dbjs) results in a validS3QL syntax that could easily be applied as an API forSKOS based systems as illustrated by applying theexample query ldquoselect(C|prefLabel = animal)ldquo usinghttplinks3dborgtranslatecore=skosjsampquery=select(C|prefLabel=animal) to retrieve SKOS concepts labelledldquoanimalrdquo The progress of adoption for life sciencesapplication developers can be further smoothed by com-plete reliance on the REST protocol for data exchangeand the availability of widely used formats such asJSON XML or RDFturtleThe applicability of S3QL to life sciences domains is

illustrated here with three case studies 1) in the domainof clinical trials a project [42] that requires collabora-tion between departments with different interests 2) inthe domain of the cancer genome atlas (TCGA) projecta multi-institutional effort that requires multiple authen-tication mechanism and sharing of data among multipleinstitutions [41] and 3) in the domain of molecular epi-demiology a project where non-public data stored in anS3DB deployment needs to be statistically integratedwith data from a public repository All described usecases shared one considerable requirement - the abilityto include in the same dataset both published andunpublished results As such they required both theannotation of contextual descriptors of the data enabledby S3QL and the availability of controlled permissionpropagation enabled by the S3DB model transitionmatrix Future work in this effort may include the appli-cation of S3QL in a Knowledge Organization Systembased on SKOS terminology and the definition of atransition matrix for SKOS to enable controlled permis-sion propagation

Gastro-intestinal Clinical Trials Use CaseAs part of collaboration with the Department of Gastroin-testinal Medical Oncology at The University of Texas MD Anderson Cancer Center an S3DB deployment wasconfigured to host data from gastro-intestinal (GI) clinicaltrials A schema was developed using S3DB Collectionsand Rules and S3QL insert queries were used to submitdata elements as Items and Statements (see Figure 6) Twopermission propagation examples are illustrated one of arestrictive nature and the other permissive which can alsobe explored at httpqs3dborgquadratus (Figure 4) Inthis example the simple mechanisms of propagationdefined for S3QL support the complex social interaction

that requires a fraction of the dataset to be shared withcertain users but not with others Contextual usage istherefore a function of the attributes of the data itself (egits creator) and the user identification token that is sub-mitted with every readwrite operation

The Cancer Genome Atlas Use CaseThe cancer genome atlas (TCGA) is a pilot project tocharacterize several types of cancer by sequencing andgenetically characterizing tumours for over 500 patientsthroughout multiple institutions S3QL was used in thiscase study to produce an infrastructure that exposes thepublic portion of the TCGA datasets as a SPARQL end-point [41] This was possible because SPARQL is entailedby S3QL but not the opposite Specifically SPARQLqueries can be serialized to S3QL but the opposite is notalways possible particularly as regards write and accesscontrol operations The structure of the S3DB Core Modelwhich explicitly distinguishes domain from instantiationenables SPARQL query patterns such as Patient R390 cancerType to be readily serialized into its S3QL equiva-lent select(S|R390) Although this will not be furtherexplored in this discussion it is worth noting that theavailability of this serialization allows for an intuitive syn-tax of SPARQL queries by patterning them on the descrip-tion of the user-defined domain Rule such as ldquoPatienthasCancerType cancerTyperdquo

A Molecular Epidemiology Use CaseIn this example SPARQL was serialized to S3QL to sup-port a computational statistics application As a first stepan S3DB data store was deployed using S3QL to managemolecular epidemiology data related to strains of Staphylo-coccus aureus bacteria collected at the Instituto de Tecno-logia Quiacutemica e Bioloacutegica (ITQB) in Portugal Specificallythe ITQB Staphylococcus reference database was devisedwith a purpose of managing Multilocus Sequence Typing(MLST) data a typing method used to track the molecularepidemiology of infectious bacteria [54-57] As a secondstep we downloaded the public Staphylococcus aureusMLST profiles database at httpwwwmlstnet and madeit available through a SPARQL endpoint at httpqs3dborgmlst_sparqlendpointphp The process of integrationof MLST profiles from the ITQB Staphylococcus databasewith the publicly available MLST profiles is illustrated inFigure 7 In this example a federated SPARQL query isassembled to access both MLST sources data stored inthe S3DB deployment is retrieved by serializing theSPARQL query into S3QL and providing an authentica-tion token to identify the user as described in Figure 3The assembled graph resulting from the federatedSPARQL query (see additional file 2) can be importedinto a statistical computing environment such as Matlab(Mathworks Inc) Using this methodology it was possible

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 11 of 15

to cluster strains from two different data sources with verydifferent authentication requirements The observationthat some Portuguese strains (PT1 PT2 PT15 and PT21)that are not publicly shared cluster together with a groupof public UK strains (UK17 UK16 UK11 UK14 UK15UK13 UK12) and therefore may share a common ancestoris an observation enabled by the data integrated throughS3QL

ConclusionsLife sciences applications are set to greatly benefit fromcoupling Semantic Web Linked Data standards and KOSsIn the current report we illustrate data models from lifesciences domains weaved using the S3DB knowledge orga-nization system In line with the requirements for theemergence of evolvable ldquosocial machinesrdquo different per-spectives on the data are made possible by a permissionpropagation mechanism controlled by contextual attri-butes of data elements such as its creator The operationof the S3DB KOS is mediated by the S3QL protocoldescribed in this report which exposes its ApplicationProgramming Interface for viewing inserting updating

and removing data elements Because S3QL is implemen-ted with a distributed architecture where URIs can bedereferenced into multiple S3DB deployments domainexperts can share data on their own deployments withusers of other systems without the need for localaccounts Therefore S3QLrsquos fine grained permission con-trol defined as instances of s3dboperators enables domainexperts to clearly specify the degree of permission that auser should have on a resource and how that permissionshould propagate in a distributed infrastructure This is incontrast to the conventional approach of delegating per-mission management to the point of access In the currentSPARQL specification extension various data sources canbe queried simultaneously or sequentially There is still noaccepted convention for tying a query pattern to anauthenticated user probably because SPARQL engineswould have no use for that information as most have beencreated in a context of Linked Open Data efforts The cri-tical limitation in applying this solution for Health Careand Life Sciences is the ability to make use of contextualinformation to determine both the level of trust on thedata and to enable controlled access to elements in a

giuser1r1r

giuser2r22

gimanagerere

GI Clinical Trials project

Patient collection

Demographic data collection

Tissue collection

gimanagererer

merge 1 (restrictive)

merge 2 (permissive)

Assign permission

Assign permission

Patient Name statements

Tissue analysis statements

merge ( ) =

) = merge (

merge 1

merge 2

Figure 6 Two use cases of permission propagation Two users are granted full permission to view GI Clinical Trials Project (yrdquo) however noneof them can add new data (nrdquo) nor edit existing data unless they were its original creator (srdquo) One of the users (giuser1) is prevented fromaccessing any data with demographic elements such as for example the names of the Patients In this case an uppercase ldquoNrdquo is assigned in theright to view the Collection DemographicData which will be merged with the inherited ldquoyrdquo to produce an effective permission level of ldquoNrdquo forthe right to view (merge 1) For the second user (giuser2) permission is granted to use the Collection TissueData indicating that the user cancreate new instances in that Collection (merge 2)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 12 of 15

dataset without breaking it and storing it in multiple sys-tems To address this requirement S3QL was fitted withdistributed control operational features that follow designcriteria found desirable for biomedical applications S3QLis not unique in its class for example the linked data APIwhich is being used by datagovuk is an alternative DSL tomanage linked data [58] However we believe that S3QLis closer to the technologies currently used by applicationdevelopers and therefore may provide a more suitablemiddle layer between linked data formalisms and applica-tion development It is argued that these features mayassist and anticipate future extensions of semantic webprovenance control formalisms

Additional material

Additional file 1 Walkthrough of S3DBrsquos permission propagationmerge migrate and percolate

Additional file 2 Matrix of MLST profiles in Portugal and the UnitedKingdom

Acknowledgements and fundingThis work was supported in part by the Center for Clinical and TranslationalSciences of the University of Alabama at Birmingham under contract no

5UL1 RR025777-03 from NIH National Center for Research Resources by theNational Cancer Institute grant 1U24CA143883-01 by the European UnionFP7 PNEUMOPATH project award and by the Portuguese Science andTechnology foundation under contracts PTDCEIA-EIA1052452008 andPTDCEEA-ACR695302006 HFD also thankfully acknowledges PhDfellowship from the same foundation award SFRHBD459632008 and theScience Foundation Ireland project Lion under Grant No SFI02CE1I131The authors would also like to thank three anonymous reviewers whosecomments and advice were extremely valuable for improving themanuscript

Author details1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway Ireland 2Biomathematics Institutode Tecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal3Laboratoacuterio Nacional de Computaccedilatildeo Ciecircntifica Av Getuacutelio Vargas 333Quitandinha 25651-075 Petroacutepolis Brasil 4Sanofi Pasteur 38 Sidney StreetCambridge MA 02139 USA 5Laboratory of Molecular Genetics Instituto deTecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal6Research Center for Intelligent Media Furtwangen University FurtwangenGermany 7Laboratory of Microbiology The Rockefeller University 10021 NewYork USA 8Division of Informatics Department of Pathology University ofAlabama at Birmingham 619 South 19th Street Birmingham Alamaba USA

Authorsrsquo contributionsHFD developed the S3QL domain specific language implemented it in theS3DB prototype validated it with examples and wrote the manuscript MCCRS MM HL RF WM and JSA tested and validated the language withexamples and made suggestions which lead to its improvement All authorsread and approved the final manuscript

Strains from Portugal

Strains from UK

SPARQL

S3DB system Triple stores

S3QL + authentication

Statistical computing environment (eg Matlab R)

S3QL +Lauthent

Figure 7 Workflow for the creation of a hierarchical cluster of MLST profiles from strains collected in Portugal and in the UK Twosources are chosen to perform the MLST profile assembly - an S3DB deployment which holds the ITQB Staphylococcus database and theSPARQL endpoint configured to host the Staphylococcus aureus MLST database Although data in the MLST SPARQL endpoint is publiclyavailable to access data in S3DB the user needs to provide an authentication token and a user_id as well as assemble the S3QL queries toretrieve the data [select(S|R172930) and select(S|R167271)] which may also be formulated as SPARQL A data structure is assembled that can finallybe analysed using hierarchical clustering methods such as the ones made available by Matlab The complete MLST dataset used to obtain thegraph is provided in additional file 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 13 of 15

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 11: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

SPARQL Although the prototypical implementation ofS3QL fits the definition of an API for S3DB systems itis immediately apparent that the same notation could beeasily and intuitively extended to other KOSrsquo core mod-els For example pointing the tool at httpqs3dborgtranslate to a JavaScript Object Notation (JSON) repre-sentation of the SKOS core model (skosjs) instead ofthe default S3DB core model (s3dbjs) results in a validS3QL syntax that could easily be applied as an API forSKOS based systems as illustrated by applying theexample query ldquoselect(C|prefLabel = animal)ldquo usinghttplinks3dborgtranslatecore=skosjsampquery=select(C|prefLabel=animal) to retrieve SKOS concepts labelledldquoanimalrdquo The progress of adoption for life sciencesapplication developers can be further smoothed by com-plete reliance on the REST protocol for data exchangeand the availability of widely used formats such asJSON XML or RDFturtleThe applicability of S3QL to life sciences domains is

illustrated here with three case studies 1) in the domainof clinical trials a project [42] that requires collabora-tion between departments with different interests 2) inthe domain of the cancer genome atlas (TCGA) projecta multi-institutional effort that requires multiple authen-tication mechanism and sharing of data among multipleinstitutions [41] and 3) in the domain of molecular epi-demiology a project where non-public data stored in anS3DB deployment needs to be statistically integratedwith data from a public repository All described usecases shared one considerable requirement - the abilityto include in the same dataset both published andunpublished results As such they required both theannotation of contextual descriptors of the data enabledby S3QL and the availability of controlled permissionpropagation enabled by the S3DB model transitionmatrix Future work in this effort may include the appli-cation of S3QL in a Knowledge Organization Systembased on SKOS terminology and the definition of atransition matrix for SKOS to enable controlled permis-sion propagation

Gastro-intestinal Clinical Trials Use CaseAs part of collaboration with the Department of Gastroin-testinal Medical Oncology at The University of Texas MD Anderson Cancer Center an S3DB deployment wasconfigured to host data from gastro-intestinal (GI) clinicaltrials A schema was developed using S3DB Collectionsand Rules and S3QL insert queries were used to submitdata elements as Items and Statements (see Figure 6) Twopermission propagation examples are illustrated one of arestrictive nature and the other permissive which can alsobe explored at httpqs3dborgquadratus (Figure 4) Inthis example the simple mechanisms of propagationdefined for S3QL support the complex social interaction

that requires a fraction of the dataset to be shared withcertain users but not with others Contextual usage istherefore a function of the attributes of the data itself (egits creator) and the user identification token that is sub-mitted with every readwrite operation

The Cancer Genome Atlas Use CaseThe cancer genome atlas (TCGA) is a pilot project tocharacterize several types of cancer by sequencing andgenetically characterizing tumours for over 500 patientsthroughout multiple institutions S3QL was used in thiscase study to produce an infrastructure that exposes thepublic portion of the TCGA datasets as a SPARQL end-point [41] This was possible because SPARQL is entailedby S3QL but not the opposite Specifically SPARQLqueries can be serialized to S3QL but the opposite is notalways possible particularly as regards write and accesscontrol operations The structure of the S3DB Core Modelwhich explicitly distinguishes domain from instantiationenables SPARQL query patterns such as Patient R390 cancerType to be readily serialized into its S3QL equiva-lent select(S|R390) Although this will not be furtherexplored in this discussion it is worth noting that theavailability of this serialization allows for an intuitive syn-tax of SPARQL queries by patterning them on the descrip-tion of the user-defined domain Rule such as ldquoPatienthasCancerType cancerTyperdquo

A Molecular Epidemiology Use CaseIn this example SPARQL was serialized to S3QL to sup-port a computational statistics application As a first stepan S3DB data store was deployed using S3QL to managemolecular epidemiology data related to strains of Staphylo-coccus aureus bacteria collected at the Instituto de Tecno-logia Quiacutemica e Bioloacutegica (ITQB) in Portugal Specificallythe ITQB Staphylococcus reference database was devisedwith a purpose of managing Multilocus Sequence Typing(MLST) data a typing method used to track the molecularepidemiology of infectious bacteria [54-57] As a secondstep we downloaded the public Staphylococcus aureusMLST profiles database at httpwwwmlstnet and madeit available through a SPARQL endpoint at httpqs3dborgmlst_sparqlendpointphp The process of integrationof MLST profiles from the ITQB Staphylococcus databasewith the publicly available MLST profiles is illustrated inFigure 7 In this example a federated SPARQL query isassembled to access both MLST sources data stored inthe S3DB deployment is retrieved by serializing theSPARQL query into S3QL and providing an authentica-tion token to identify the user as described in Figure 3The assembled graph resulting from the federatedSPARQL query (see additional file 2) can be importedinto a statistical computing environment such as Matlab(Mathworks Inc) Using this methodology it was possible

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 11 of 15

to cluster strains from two different data sources with verydifferent authentication requirements The observationthat some Portuguese strains (PT1 PT2 PT15 and PT21)that are not publicly shared cluster together with a groupof public UK strains (UK17 UK16 UK11 UK14 UK15UK13 UK12) and therefore may share a common ancestoris an observation enabled by the data integrated throughS3QL

ConclusionsLife sciences applications are set to greatly benefit fromcoupling Semantic Web Linked Data standards and KOSsIn the current report we illustrate data models from lifesciences domains weaved using the S3DB knowledge orga-nization system In line with the requirements for theemergence of evolvable ldquosocial machinesrdquo different per-spectives on the data are made possible by a permissionpropagation mechanism controlled by contextual attri-butes of data elements such as its creator The operationof the S3DB KOS is mediated by the S3QL protocoldescribed in this report which exposes its ApplicationProgramming Interface for viewing inserting updating

and removing data elements Because S3QL is implemen-ted with a distributed architecture where URIs can bedereferenced into multiple S3DB deployments domainexperts can share data on their own deployments withusers of other systems without the need for localaccounts Therefore S3QLrsquos fine grained permission con-trol defined as instances of s3dboperators enables domainexperts to clearly specify the degree of permission that auser should have on a resource and how that permissionshould propagate in a distributed infrastructure This is incontrast to the conventional approach of delegating per-mission management to the point of access In the currentSPARQL specification extension various data sources canbe queried simultaneously or sequentially There is still noaccepted convention for tying a query pattern to anauthenticated user probably because SPARQL engineswould have no use for that information as most have beencreated in a context of Linked Open Data efforts The cri-tical limitation in applying this solution for Health Careand Life Sciences is the ability to make use of contextualinformation to determine both the level of trust on thedata and to enable controlled access to elements in a

giuser1r1r

giuser2r22

gimanagerere

GI Clinical Trials project

Patient collection

Demographic data collection

Tissue collection

gimanagererer

merge 1 (restrictive)

merge 2 (permissive)

Assign permission

Assign permission

Patient Name statements

Tissue analysis statements

merge ( ) =

) = merge (

merge 1

merge 2

Figure 6 Two use cases of permission propagation Two users are granted full permission to view GI Clinical Trials Project (yrdquo) however noneof them can add new data (nrdquo) nor edit existing data unless they were its original creator (srdquo) One of the users (giuser1) is prevented fromaccessing any data with demographic elements such as for example the names of the Patients In this case an uppercase ldquoNrdquo is assigned in theright to view the Collection DemographicData which will be merged with the inherited ldquoyrdquo to produce an effective permission level of ldquoNrdquo forthe right to view (merge 1) For the second user (giuser2) permission is granted to use the Collection TissueData indicating that the user cancreate new instances in that Collection (merge 2)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 12 of 15

dataset without breaking it and storing it in multiple sys-tems To address this requirement S3QL was fitted withdistributed control operational features that follow designcriteria found desirable for biomedical applications S3QLis not unique in its class for example the linked data APIwhich is being used by datagovuk is an alternative DSL tomanage linked data [58] However we believe that S3QLis closer to the technologies currently used by applicationdevelopers and therefore may provide a more suitablemiddle layer between linked data formalisms and applica-tion development It is argued that these features mayassist and anticipate future extensions of semantic webprovenance control formalisms

Additional material

Additional file 1 Walkthrough of S3DBrsquos permission propagationmerge migrate and percolate

Additional file 2 Matrix of MLST profiles in Portugal and the UnitedKingdom

Acknowledgements and fundingThis work was supported in part by the Center for Clinical and TranslationalSciences of the University of Alabama at Birmingham under contract no

5UL1 RR025777-03 from NIH National Center for Research Resources by theNational Cancer Institute grant 1U24CA143883-01 by the European UnionFP7 PNEUMOPATH project award and by the Portuguese Science andTechnology foundation under contracts PTDCEIA-EIA1052452008 andPTDCEEA-ACR695302006 HFD also thankfully acknowledges PhDfellowship from the same foundation award SFRHBD459632008 and theScience Foundation Ireland project Lion under Grant No SFI02CE1I131The authors would also like to thank three anonymous reviewers whosecomments and advice were extremely valuable for improving themanuscript

Author details1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway Ireland 2Biomathematics Institutode Tecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal3Laboratoacuterio Nacional de Computaccedilatildeo Ciecircntifica Av Getuacutelio Vargas 333Quitandinha 25651-075 Petroacutepolis Brasil 4Sanofi Pasteur 38 Sidney StreetCambridge MA 02139 USA 5Laboratory of Molecular Genetics Instituto deTecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal6Research Center for Intelligent Media Furtwangen University FurtwangenGermany 7Laboratory of Microbiology The Rockefeller University 10021 NewYork USA 8Division of Informatics Department of Pathology University ofAlabama at Birmingham 619 South 19th Street Birmingham Alamaba USA

Authorsrsquo contributionsHFD developed the S3QL domain specific language implemented it in theS3DB prototype validated it with examples and wrote the manuscript MCCRS MM HL RF WM and JSA tested and validated the language withexamples and made suggestions which lead to its improvement All authorsread and approved the final manuscript

Strains from Portugal

Strains from UK

SPARQL

S3DB system Triple stores

S3QL + authentication

Statistical computing environment (eg Matlab R)

S3QL +Lauthent

Figure 7 Workflow for the creation of a hierarchical cluster of MLST profiles from strains collected in Portugal and in the UK Twosources are chosen to perform the MLST profile assembly - an S3DB deployment which holds the ITQB Staphylococcus database and theSPARQL endpoint configured to host the Staphylococcus aureus MLST database Although data in the MLST SPARQL endpoint is publiclyavailable to access data in S3DB the user needs to provide an authentication token and a user_id as well as assemble the S3QL queries toretrieve the data [select(S|R172930) and select(S|R167271)] which may also be formulated as SPARQL A data structure is assembled that can finallybe analysed using hierarchical clustering methods such as the ones made available by Matlab The complete MLST dataset used to obtain thegraph is provided in additional file 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 13 of 15

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 12: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

to cluster strains from two different data sources with verydifferent authentication requirements The observationthat some Portuguese strains (PT1 PT2 PT15 and PT21)that are not publicly shared cluster together with a groupof public UK strains (UK17 UK16 UK11 UK14 UK15UK13 UK12) and therefore may share a common ancestoris an observation enabled by the data integrated throughS3QL

ConclusionsLife sciences applications are set to greatly benefit fromcoupling Semantic Web Linked Data standards and KOSsIn the current report we illustrate data models from lifesciences domains weaved using the S3DB knowledge orga-nization system In line with the requirements for theemergence of evolvable ldquosocial machinesrdquo different per-spectives on the data are made possible by a permissionpropagation mechanism controlled by contextual attri-butes of data elements such as its creator The operationof the S3DB KOS is mediated by the S3QL protocoldescribed in this report which exposes its ApplicationProgramming Interface for viewing inserting updating

and removing data elements Because S3QL is implemen-ted with a distributed architecture where URIs can bedereferenced into multiple S3DB deployments domainexperts can share data on their own deployments withusers of other systems without the need for localaccounts Therefore S3QLrsquos fine grained permission con-trol defined as instances of s3dboperators enables domainexperts to clearly specify the degree of permission that auser should have on a resource and how that permissionshould propagate in a distributed infrastructure This is incontrast to the conventional approach of delegating per-mission management to the point of access In the currentSPARQL specification extension various data sources canbe queried simultaneously or sequentially There is still noaccepted convention for tying a query pattern to anauthenticated user probably because SPARQL engineswould have no use for that information as most have beencreated in a context of Linked Open Data efforts The cri-tical limitation in applying this solution for Health Careand Life Sciences is the ability to make use of contextualinformation to determine both the level of trust on thedata and to enable controlled access to elements in a

giuser1r1r

giuser2r22

gimanagerere

GI Clinical Trials project

Patient collection

Demographic data collection

Tissue collection

gimanagererer

merge 1 (restrictive)

merge 2 (permissive)

Assign permission

Assign permission

Patient Name statements

Tissue analysis statements

merge ( ) =

) = merge (

merge 1

merge 2

Figure 6 Two use cases of permission propagation Two users are granted full permission to view GI Clinical Trials Project (yrdquo) however noneof them can add new data (nrdquo) nor edit existing data unless they were its original creator (srdquo) One of the users (giuser1) is prevented fromaccessing any data with demographic elements such as for example the names of the Patients In this case an uppercase ldquoNrdquo is assigned in theright to view the Collection DemographicData which will be merged with the inherited ldquoyrdquo to produce an effective permission level of ldquoNrdquo forthe right to view (merge 1) For the second user (giuser2) permission is granted to use the Collection TissueData indicating that the user cancreate new instances in that Collection (merge 2)

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 12 of 15

dataset without breaking it and storing it in multiple sys-tems To address this requirement S3QL was fitted withdistributed control operational features that follow designcriteria found desirable for biomedical applications S3QLis not unique in its class for example the linked data APIwhich is being used by datagovuk is an alternative DSL tomanage linked data [58] However we believe that S3QLis closer to the technologies currently used by applicationdevelopers and therefore may provide a more suitablemiddle layer between linked data formalisms and applica-tion development It is argued that these features mayassist and anticipate future extensions of semantic webprovenance control formalisms

Additional material

Additional file 1 Walkthrough of S3DBrsquos permission propagationmerge migrate and percolate

Additional file 2 Matrix of MLST profiles in Portugal and the UnitedKingdom

Acknowledgements and fundingThis work was supported in part by the Center for Clinical and TranslationalSciences of the University of Alabama at Birmingham under contract no

5UL1 RR025777-03 from NIH National Center for Research Resources by theNational Cancer Institute grant 1U24CA143883-01 by the European UnionFP7 PNEUMOPATH project award and by the Portuguese Science andTechnology foundation under contracts PTDCEIA-EIA1052452008 andPTDCEEA-ACR695302006 HFD also thankfully acknowledges PhDfellowship from the same foundation award SFRHBD459632008 and theScience Foundation Ireland project Lion under Grant No SFI02CE1I131The authors would also like to thank three anonymous reviewers whosecomments and advice were extremely valuable for improving themanuscript

Author details1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway Ireland 2Biomathematics Institutode Tecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal3Laboratoacuterio Nacional de Computaccedilatildeo Ciecircntifica Av Getuacutelio Vargas 333Quitandinha 25651-075 Petroacutepolis Brasil 4Sanofi Pasteur 38 Sidney StreetCambridge MA 02139 USA 5Laboratory of Molecular Genetics Instituto deTecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal6Research Center for Intelligent Media Furtwangen University FurtwangenGermany 7Laboratory of Microbiology The Rockefeller University 10021 NewYork USA 8Division of Informatics Department of Pathology University ofAlabama at Birmingham 619 South 19th Street Birmingham Alamaba USA

Authorsrsquo contributionsHFD developed the S3QL domain specific language implemented it in theS3DB prototype validated it with examples and wrote the manuscript MCCRS MM HL RF WM and JSA tested and validated the language withexamples and made suggestions which lead to its improvement All authorsread and approved the final manuscript

Strains from Portugal

Strains from UK

SPARQL

S3DB system Triple stores

S3QL + authentication

Statistical computing environment (eg Matlab R)

S3QL +Lauthent

Figure 7 Workflow for the creation of a hierarchical cluster of MLST profiles from strains collected in Portugal and in the UK Twosources are chosen to perform the MLST profile assembly - an S3DB deployment which holds the ITQB Staphylococcus database and theSPARQL endpoint configured to host the Staphylococcus aureus MLST database Although data in the MLST SPARQL endpoint is publiclyavailable to access data in S3DB the user needs to provide an authentication token and a user_id as well as assemble the S3QL queries toretrieve the data [select(S|R172930) and select(S|R167271)] which may also be formulated as SPARQL A data structure is assembled that can finallybe analysed using hierarchical clustering methods such as the ones made available by Matlab The complete MLST dataset used to obtain thegraph is provided in additional file 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 13 of 15

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 13: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

dataset without breaking it and storing it in multiple sys-tems To address this requirement S3QL was fitted withdistributed control operational features that follow designcriteria found desirable for biomedical applications S3QLis not unique in its class for example the linked data APIwhich is being used by datagovuk is an alternative DSL tomanage linked data [58] However we believe that S3QLis closer to the technologies currently used by applicationdevelopers and therefore may provide a more suitablemiddle layer between linked data formalisms and applica-tion development It is argued that these features mayassist and anticipate future extensions of semantic webprovenance control formalisms

Additional material

Additional file 1 Walkthrough of S3DBrsquos permission propagationmerge migrate and percolate

Additional file 2 Matrix of MLST profiles in Portugal and the UnitedKingdom

Acknowledgements and fundingThis work was supported in part by the Center for Clinical and TranslationalSciences of the University of Alabama at Birmingham under contract no

5UL1 RR025777-03 from NIH National Center for Research Resources by theNational Cancer Institute grant 1U24CA143883-01 by the European UnionFP7 PNEUMOPATH project award and by the Portuguese Science andTechnology foundation under contracts PTDCEIA-EIA1052452008 andPTDCEEA-ACR695302006 HFD also thankfully acknowledges PhDfellowship from the same foundation award SFRHBD459632008 and theScience Foundation Ireland project Lion under Grant No SFI02CE1I131The authors would also like to thank three anonymous reviewers whosecomments and advice were extremely valuable for improving themanuscript

Author details1Digital Enterprise Research Institute National University of Ireland at GalwayIDA Business Park Lower Dangan Galway Ireland 2Biomathematics Institutode Tecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal3Laboratoacuterio Nacional de Computaccedilatildeo Ciecircntifica Av Getuacutelio Vargas 333Quitandinha 25651-075 Petroacutepolis Brasil 4Sanofi Pasteur 38 Sidney StreetCambridge MA 02139 USA 5Laboratory of Molecular Genetics Instituto deTecnologia Quiacutemica e Bioloacutegica Universidade Nova de Lisboa Av daRepuacuteblica Estaccedilatildeo Agronoacutemica Nacional 2780-157 Oeiras Portugal6Research Center for Intelligent Media Furtwangen University FurtwangenGermany 7Laboratory of Microbiology The Rockefeller University 10021 NewYork USA 8Division of Informatics Department of Pathology University ofAlabama at Birmingham 619 South 19th Street Birmingham Alamaba USA

Authorsrsquo contributionsHFD developed the S3QL domain specific language implemented it in theS3DB prototype validated it with examples and wrote the manuscript MCCRS MM HL RF WM and JSA tested and validated the language withexamples and made suggestions which lead to its improvement All authorsread and approved the final manuscript

Strains from Portugal

Strains from UK

SPARQL

S3DB system Triple stores

S3QL + authentication

Statistical computing environment (eg Matlab R)

S3QL +Lauthent

Figure 7 Workflow for the creation of a hierarchical cluster of MLST profiles from strains collected in Portugal and in the UK Twosources are chosen to perform the MLST profile assembly - an S3DB deployment which holds the ITQB Staphylococcus database and theSPARQL endpoint configured to host the Staphylococcus aureus MLST database Although data in the MLST SPARQL endpoint is publiclyavailable to access data in S3DB the user needs to provide an authentication token and a user_id as well as assemble the S3QL queries toretrieve the data [select(S|R172930) and select(S|R167271)] which may also be formulated as SPARQL A data structure is assembled that can finallybe analysed using hierarchical clustering methods such as the ones made available by Matlab The complete MLST dataset used to obtain thegraph is provided in additional file 2

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 13 of 15

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 14: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

Competing interestsThe authors declare that they have no competing interests

Received 14 April 2011 Accepted 14 July 2011 Published 14 July 2011

References1 Bell G Hey T Szalay A Computer science Beyond the data deluge

Science (New York NY) 2009 3231297-82 Chiang AP Butte AJ Data-driven methods to discover molecular

determinants of serious adverse drug events Clinical pharmacology andtherapeutics 2009 85259-68

3 The end of theory the data deluge makes the scientific methodobsolete [httpwwwwiredcomsciencediscoveriesmagazine16-07pb_theory]

4 Hubbard T The Ensembl genome database project Nucleic Acids Research2002 3038-41

5 Karolchik D The UCSC Genome Browser Database Nucleic Acids Research2003 3151-54

6 Maglott D Ostell J Pruitt KD Tatusova T Entrez Gene gene-centeredinformation at NCBI Nucleic acids research 2005 33D54-8

7 Ashburner M Ball CA Blake JA et al Gene ontology tool for theunification of biology The Gene Ontology Consortium Nature genetics2000 2525-9

8 Bizer C Heath T Berners-Lee T Linked Data - The Story So FarInternational Journal on Semantic Web and Information Systems (IJSWIS)2009

9 Linked Data | Linked Data - Connect Distributed Data across the Web[httplinkeddataorg]

10 Linked data - Design issues [httpwwww3orgDesignIssuesLinkedDatahtml]

11 Vandervalk BP McCarthy EL Wilkinson MD Moby and Moby 2 creaturesof the deep (web) Briefings in bioinformatics 2009 10114-28

12 Where the semantic web stumbled linked data will succeed - OrsquoReillyRadar [httpradaroreillycom201011semantic-web-linked-datahtml]

13 Berners-Lee T Weitzner DJ Hall W et al A Framework for Web ScienceFoundations and Trendsreg in Web Science 2006 11-130

14 Hendler J Berners-Lee T From the Semantic Web to social machines Aresearch challenge for AI on the World Wide Web Artificial Intelligence2010 174156-161

15 Almeida JS Chen C Gorlitsky R et al Data integration gets ldquoSloppyrdquoNature biotechnology 2006 241070-1

16 Deus HF Stanislaus R Veiga DF et al A Semantic Web managementmodel for integrative biomedical informatics PloS one 2008 3e2946

17 Putting the Web back in Semantic Web [httpwwww3org2005Talks1110-iswc-tbl(1)]

18 SPARQL Query Language for RDF [httpwwww3orgTRrdf-sparql-query]

19 Alexander K Cyganiak R Hausenblas M Zhao J Describing LinkedDatasets On the Design and Usage of voiD the ldquo Vocabulary OfInterlinked Datasetsrdquo Linked Data on the Web Workshop (LDOW 09) inconjunction with 18th International World Wide Web Conference (WWW 09)2009

20 Cheung KH Frost HR Marshall MS et al A journey to Semantic Webquery federation in the life sciences BMC bioinformatics 2009 10(Suppl1)S10

21 A Prototype Knowledge Base for the Life Sciences [httpwwww3orgTRhcls-kb]

22 Belleau F Nolin M-A Tourigny N Rigault P Morissette J Bio2RDF towardsa mashup to build bioinformatics knowledge systems Journal ofbiomedical informatics 2008 41706-16

23 Smith B Ashburner M Rosse C et al The OBO Foundry coordinatedevolution of ontologies to support biomedical data integration Naturebiotechnology 2007 251251-5

24 Taylor CF Field D Sansone SA et al Promoting coherent minimumreporting guidelines for biological and biomedical investigations theMIBBI project Nature biotechnology 2008 26889-96

25 Noy NF Shah NH Whetzel PL et al BioPortal ontologies and integrateddata resources at the click of a mouse Nucleic acids research 2009 37W170-3

26 Deus HF Prud E Zhao J Marshall MS Samwald M Provenance ofMicroarray Experiments for a Better Understanding of ExperimentResults ISWC 2010 SWPM 2010

27 Stein LD Integrating biological databases Nature reviews Genetics 20034337-45

28 Goble C Stevens R State of the nation in data integration forbioinformatics Journal of Biomedical Informatics 2008 41687-693

29 Ludaumlscher B Altintas I Bowers S et al Scientific Process Automation andWorkflow Management In Scientific Data Management Edited byShoshani A Rotem D Chapman 2009

30 Nelson B Data sharing Empty archives Nature 2009 461160-331 Stanislaus R Chen C Franklin J Arthur J Almeida JS AGML Central web

based gel proteomic infrastructure Bioinformatics (Oxford England) 2005211754-7

32 Silva S Gouveia-Oliveira R Maretzek A et al EURISWEBndashWeb-basedepidemiological surveillance of antibiotic-resistant pneumococci in daycare centers BMC medical informatics and decision making 2003 39

33 Describing Linked Datasets with the VoiD Vocabulary [httpwwww3orgTR2011NOTE-void-20110303]

34 HIPAA Administrative Simplification Statute and Rules [httpwwwhhsgovocrprivacyhipaaadministrativeindexhtml]

35 Socially Aware Cloud Storage [httpwwww3orgDesignIssuesCloudStoragehtml]

36 Koslow SH Opinion Sharing primary data a threat or asset to discoveryNature reviews Neuroscience 2002 3311-3

37 Baggerly KA Coombes KR Deriving chemosensitivity from cell linesForensic bioinformatics and reproducible research in high-throughputbiology The Annals of Applied Statistics 2009 31309-1334

38 Hodge G Systems of Knowledge Organization for Digital Libraries BeyondTraditional Authority Files 2000

39 SKOS Simple Knowledge Organization System Reference [httpwwww3orgTRskos-reference]

40 Almeida JS Deus HF Maass W S3DB core a framework for RDFgeneration and management in bioinformatics infrastructures BMCbioinformatics 2010 11387

41 Deus HF Veiga DF Freire PR et al Exposing The Cancer Genome Atlas asa SPARQL endpoint Journal of Biomedical Informatics 2010 43998-1008

42 Correa MC Deus HF Vasconcelos AT et al AGUIA autonomous graphicaluser interface assembly for clinical trials semantic data services BMCmedical informatics and decision making 2010 10

43 Freire P Vilela M Deus H et al Exploratory analysis of the copy numberalterations in glioblastoma multiforme PloS one 2008 3e4076

44 NCBO Ontology Widgets [httpwwwbioontologyorgwikiindexphpNCBO_Widgets]

45 Bussler C Is Semantic Web Technology Taking the Wrong Turn IeeeInternet Computing 2008 1275-79

46 What people find hard about linked data [httpdynamicorangecom20101115what-people-find-hard-about-linked-data]

47 Raja A Lakshmanan D Domain Specific Languages International Journal ofComputer Applications 2010 199-105

48 SPARQL Update [httpwwww3orgTRsparql11-update]49 Carroll JJ Bizer C Hayes P Stickler P Named graphs provenance and

trust Proceedings of the 14th international conference on World Wide WebWWW 05 2005 14613

50 S3DB operator function states [httpcodegooglecomps3db-operator]51 S3DB Operators [https3db-operatorgooglecodecomhgpropagation

html]52 Deus HF Sousa MA de Carrico JA Lencastre H de Almeida JS Adapting

experimental ontologies for molecular epidemiology AMIA AnnualSymposium proceedings 2007 935

53 The OAuth 10 Protocol [httptoolsietforghtmlrfc5849]54 Francisco AP Bugalho M Ramirez M Carriccedilo JA Global optimal eBURST

analysis of multilocus typing data using a graphic matroid approachBMC bioinformatics 2009 10152

55 S3QL serialization engine [httpjss3dbgooglecodecomhgtranslatequickTranslatehtml]

56 Ippolito G Leone S Lauria FN Nicastri E Wenzel RP Methicillin-resistantStaphylococcus aureus the superbug International journal of infectiousdiseases 2010 14S7-S11

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 14 of 15

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References
Page 15: RESEARCH ARTICLE Open Access S3QL: A …RESEARCH ARTICLE Open Access S3QL: A distributed domain specific language for controlled semantic integration of life sciences data Helena F

57 Harris SR Feil EJ Holden MTG et al Evolution of MRSA during hospitaltransmission and intercontinental spread Science (New York NY) 2010327469-74

58 Linked Data API [httpcodegooglecomplinked-data-api]

doi1011861471-2105-12-285Cite this article as Deus et al S3QL A distributed domain specificlanguage for controlled semantic integration of life sciences data BMCBioinformatics 2011 12285

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit

Deus et al BMC Bioinformatics 2011 12285httpwwwbiomedcentralcom1471-210512285

Page 15 of 15

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
        • 11 Linked Data Best Practices
        • 12 Challenges involved in Publishing Primary Experimental Life Sciences Datasets as Linked Data
        • 13 Knowledge organization systems for Linked Data
          • Methods
            • The S3DB Knowledge Organization Model
            • Operators for Permission Control
            • Components of a Distributed System
            • Availability and Documentation
              • Results
                • S3QL Syntax
                • S3QL Permission Control
                • Global Dereferencing for Distributed Queries
                • Implementation and benchmarking
                  • Discussion
                    • Gastro-intestinal Clinical Trials Use Case
                    • The Cancer Genome Atlas Use Case
                    • A Molecular Epidemiology Use Case
                      • Conclusions
                      • Acknowledgements and funding
                      • Author details
                      • Authors contributions
                      • Competing interests
                      • References

Recommended