1Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Requirements for caBIG Infrastructure
to Support Semantic Workflows
Yolanda Gil, PhDInformation Sciences Institute and
Department of Computer ScienceUniversity of Southern California
http://www.isi.edu/~gil
QuickTime™ and a decompressor
are needed to see this picture.
2Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Outline
Brief background on semantic workflows• Semantic workflow representations in Wings
Five uses of semantic workflows to assist users and their resulting requirements• Reproducibility• Validation• Metadata generation• Data discovery• Workflow discovery
Requirements for architecture components• Ontology repositories and services• Data/metadata catalogs and services• Component/service catalogs and services• Workflow catalogs and services
3Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Benefits of Semantic Workflows [Gil JSP-09]
Execution management: Automation of workflow execution
Managing distributed computation
Managing large data sets
Security and access control
Provenance recording Low-cost high fidelity reproducibility
Semantics and reasoning: Workflow retrieval and discovery
Automation of workflow generation
Systematic exploration of design space
Validation of workflows Automated generation of metadata
Guarantees of data pedigree
“Conceptual” reproducibility
4Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Semantic Workflows in Wings [Kim et al CCPEJ 08; Gil et al IEEE eScience 09; Gil et al K-CAP 09; Kim et al IUI 06; Gil et al IEEE IS 2010]
Workflows augmented with semantic constraints • Each workflow constituent has a variable associated with it
– Nodes, links, workflow components, datasets– Workflow variables can represent collections of data as well as classes of software components
• Constraints are used to restrict variables, and include: – Metadata properties of datasets– Constraints across workflow variables
• Incorporate function of workflow components: how data is transformed
Reasoning about semantic constraints in a workflow• Algorithms for semantic enrichment of workflow templates• Algorithms for matching queries against workflow catalogs• Algorithms for generating workflows from high-level user requests
• Algorithms for generating metadata of new data products• Algorithms for assisting users w/creation of valid workflow templates
5Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Semantic Workflows
in WINGS Workflow templates Dataflow diagram
• Each constituent (node, link, component, dataset) has a corresponding variable
Semantic properties Constraint
s on workflow variables
(TestData dcdom:isDiscrete false)(TrainingData dcdom:isDiscrete false)
6Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Semantic Constraints as Metadata Properties
Constraints on reusable template (shown below)
Constraints on current user request (shown above)
[modelerInput_not_equal_to_classifierInput: (:modelerInput wflow:hasDataBinding ?ds1) (:classifierInput wflow:hasDataBinding ?ds2) equal(?ds1, ?ds2) (?t rdf:type wflow:WorkflowTemplate) > (?t wflow:isInvalid "true"^^xsd:boolean)]
7Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Outline
Brief background on semantic workflows• Semantic workflow representations in Wings
Five uses of semantic workflows to assist users and their resulting requirements• Reproducibility• Validation• Metadata generation• Data discovery• Workflow discovery
Requirements for architecture components• Ontology repositories and services• Data/metadata catalogs and services• Component/service catalogs and services• Workflow catalogs and services
8Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Uses of Semantic Workflows:1) Easily Replicate Previously Published Results
A catalog of carefully crafted workflows of select state-of-the-art methods to cover a wide range of common analyses• Many implementations of same algorithm, some proprietary
• Same implementation but new versions and bug fixes
With such catalog, the effort involved in reproducing results is greatly reduced
Semantics needed to assist users to use workflows correctly
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
9Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Resulting Requirements (1)
Semantic representations of workflows need to abstract from software implementation• Representing abstract classes of software components – Instances are the implemented codes– Workflow steps refer to component classes
• Representing abstract kinds of data (eg exclude format)
Semantic reasoning needed to specialize workflow• To map the abstract workflow into an execution-ready workflow
• To insert lower level steps (eg data transformations)
10Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Uses of Semantic Workflows:2) Ensure Correct Use of State-of-the-Art Methods
Analytic software and methods are well documented but all is text (papers, manuals, etc)• Time consuming, hard to spot interdependencies, no validation
Semantics needed to guide users to set up workflows correctly and customize them to their datasets and goals
QuickTime™ and a decompressor
are needed to see this picture.
11Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Requirements (2)
Semantic workflows can check constraints and guide users• Representing requirements of software components
– Constraints on input data– Constraints on parameter settings given properties of input data
• Representing metadata properties of datasets Semantic reasoning needed:
• To check constraints of each workflow step• To propagate constraints across the workflow
12Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Uses of Semantic Workflows:3) Automatic Generation of Metadata
Metadata annotations are tedious and involved• Often not done, an obstacle to sharing and to reuse
Semantic workflows can automate the generation of metadata for analysis data products
QuickTime™ and a decompressor
are needed to see this picture.
13Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Requirements (3)
Semantic representations needed:• Representing expected characteristics of output dataset for each software component given the input metadata
• Representing metadata properties of input datasets
Semantic reasoning needed:• To propagate metadata for each workflow step • To propagate metadata across the workflow
14Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Uses of Semantic Workflows:4) Discovery of Relevant Data
Need a dataset of updated
common (known) locito annotate findings, where can I find one?
Workflows reused from a catalog may require additional data besides what is provided by the user
Semantic workflows can help identify characteristics of required datasets and query data catalogs to find them for the user
15Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Requirements (4)
Semantic representations needed:• Metadata properties of any additional input datasets in the workflow, including:– Default properties for the given workflow– Augmented properties that result from the specific input data provided by the user
Semantic reasoning needed:• Propagation of semantic constraints through the workflow
• Formulation of queries to data catalogs based on semantic properties required of datasets in the workflow
16Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Uses of Semantic Workflows:5) Retrieval of Workflows
Hard to find workflows for the type of analysis a user wants• Semantic information is not provided when creating the
workflow• However, retrieval queries are often based on metadata
properties of data– e.g., “Find workflows that can normalize data which is continuous and
has missing values [<- constraints on inputs] to create a decision tree model [constraint on intermediate data products]”
Semantic workflows needed to augment user-provided workflows with semantic constraints from metadata catalogs and component catalogs
QuickTime™ and a decompressor
are needed to see this picture.
17Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Requirements (5)
Semantic representations are needed:• For workflow constituents
– Metadata properties of input, intermediate and final data products
– Metadata properties of workflow and component function• For user queries
– Express workflow sketches containing partial data descriptions (constraints)
Reasoning capabilities• Automatic creation of metadata for expected workflow data
products• Workflow matching to queries (exact and partial)
18Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Outline
Brief background on semantic workflows• Semantic workflow representations in Wings
Five uses of semantic workflows to assist users and their resulting requirements• Reproducibility• Validation• Metadata generation• Data discovery• Workflow discovery
Requirements for architecture components• Ontology repositories and services• Data/metadata catalogs and services• Component/service catalogs and services• Workflow catalogs and services
19Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Requirements on Core Ontology Repositories and Services
Component/service ontologies• Extend with semantic representations that support reasoning, not just their execution
Workflow ontologies• Develop workflow ontologies that enable shared workflow repositories
• Develop semantic layer for the workflow ontologies– Workflow steps must be able to represent component classes
– Support reasoning about workflows in all architecture components
20Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Requirements on Data/Metadata Catalogs and Services
Representing abstracts kinds of data (eg exclude format)
Representing metadata properties that are relevant to data analysis• Eg: the organization that contributed the data may be less relevant than the instrument used to collect it, its calibration, its quality and accuracy, etc.
21Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Requirements on Component/Service Catalogs and Services Represent abstract classes of software components
• Instances correspond to implemented codes/services Represent constraints on input data
• Metadata properties that make the component appropriate for a given input dataset
Represent constraints on output data• Metadata properties of expected input datasets given
the required outcome of the execution of the component
Represent constraints on parameter values• Constraints on parameter settings given properties of
input or output data Represent how metadata properties of inputs is
related to metadata of outputs• Metadata properties of output datasets given the
properties of the input datasets
22Yolanda Gil ([email protected])
USC Information Sciences Institute
January 10, 2010
Requirements on Workflow Catalogs and Services Semantic reasoning to specialize workflows
• Given user requirements and a high-level workflow, automatically generate valid execution-ready workflows
• Automatically insert lower level steps when needed (eg data format conversions)
Semantic reasoning to propagate constraints of each workflow step• Check constraints of each workflow step and propagate them
throughout the workflow• Incorporate constraints coming from the user’s
requirements with constraints from the individual steps of the workflow
Formulation of data catalog queries based on the metadata properties of a given dataset in the workflow
Workflow discovery and matching for a given user query• Need a language to express user queries as workflow
sketches containing partial data descriptions (constraints) and partial dataflow patterns
• Need semantic reasoning for matching such queries, both exact and partial matching