Date post: | 28-Nov-2014 |
Category: |
Technology |
Upload: | tope-omitola |
View: | 490 times |
Download: | 0 times |
Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
Capturing interactive data transformation
operations using provenance workflows
Tope Omitola, Andre Freitas, Edward Curry, Sean
O'Riain, Nicholas Gibbins and Nigel Shadbolt
SWPM Workshop 28.05.2012, Herakleion, Crete
Digital Enterprise Research Institute www.deri.ie
Outline
Motivation
Interactive data transformations (IDTs)
IDT & Provenance
Modelling IDTs
Provenance Representation
Provenance Capture
Case Study
Conclusion
Digital Enterprise Research Institute www.deri.ie
Motivation
Dataspaces:
High number of heterogeneous data sources
Complex data transformation environment
Need for both repeatable data transformations and once-
off transformations
Traditional ETL approaches for data
transformation/integration:
Based on scripting/programming
Focus on repeatable data transformation processes
Digital Enterprise Research Institute www.deri.ie
Interactive Data Transformation (IDTs)
Based on user interaction paradigms for user
creation of data transformations
Explores GUI elements mapping to data
transformation operations
Instant feedback of each iteration
Complementary to existing ETL tools
Lower the barriers for non-programmers (reduces
programming effort) of doing data transformations
Example platforms: Google Refine, Potters Wheel,
Wrangler
Digital Enterprise Research Institute www.deri.ie
Interactive Data Transformation (IDTs)
Digital Enterprise Research Institute www.deri.ie
Challenges
How to model IDTs?
Facilitating the reuse of previous IDTs
Representing IDTs
Making IDT platforms provenance-aware
Enabling transportability across IDT and ETL
platforms
Provenance
Digital Enterprise Research Institute www.deri.ie
IDT & Provenance
Provenance supports representation of interactive
data transformations
Output: a provenance descriptor which shows the
relationship between the inputs, the outputs, and
the applied transformation operations
Both retrospective and prospective provenance
Digital Enterprise Research Institute www.deri.ie
IDT
IDT model
Formal model (Algebra for IDT)
Provenance representation
Provenance capture of IDTs
Digital Enterprise Research Institute www.deri.ie
IDT Model: Core Elements
Schema and instance data
Set of predefined operations
GUI elements mapping to predefined operations
User actions
Operation selection
Parameter selection
Operation composition (workflow)
Digital Enterprise Research Institute www.deri.ie
IDT Model
Digital Enterprise Research Institute www.deri.ie
Formalizing the mapping from IDT to
Provenance
Definition 1: A provenance-based interactive data
transformation engine, consists of a set of
transformations (or activities) on a set of datasets
generating outputs in the form of other datasets or
events which may trigger further transformations
Definition 2: An interactive data transformation
event, consists of the input dataset, the output
dataset(s), the applied transformation function,
and the time the transformation took place
Digital Enterprise Research Institute www.deri.ie
Definition 3: A run is a function from time to
dataset(s) and the transformation applied to those
dataset(s)
Definition 4: A trace is the sequence of pairs of a
run and the time the run was made
Formalizing the mapping from IDT to
Provenance
Digital Enterprise Research Institute www.deri.ie
Provenance Representation
Proposed in Representing Interoperable Provenance
Descriptions for ETL Workflows
Three-layered provenance model:
Open Provenance Model Vocabulary Layer
Cogs ETL Provenance Vocabulary
Domain-Specific Model Layer
Linked Data standards
Digital Enterprise Research Institute www.deri.ie
Provenance Capture Layers
Digital Enterprise Research Institute www.deri.ie
Provenance Event-Capture Sequence Flow
Digital Enterprise Research Institute www.deri.ie
Case study
@prefix grf: <http://127.0.0.1:3333/project/1402144365904/> .
grf :MassCellChange-1092380975 rdf:type opmv:Process,
cogs:ColumnOperation, cogs:Transformation;
cogs:operationName "MassCellChange"^^xsd:string;
cogs:programUsed "com.google.refine.operations.cell.MassEditOperation"^^xsd:string;
rdfs:label "Mass edit 1 cells in column ==List of winners=="^^xsd:string.
grf:MassCellChange-1092380975/1_0 rdf:type opmv:Artifact ;
rdfs:label "* '''1955 [[Meena Kumari]]'[[Parineeta (1953 film)|Parineeta]]''''' as '''Lolita'''"^^xsd:string.
grf:MassCellChange-1092380975/1_1 rdf:type opmv:Artifact;
rdfs:label "* '''John Wayne'''"^^xsd:string.
grf:MassCellChange-1092380975/1_1 opmv:wasDerivedFrom grf:MassCellChange-1092380975/1_0.
grf:MassCellChange-1092380975 opmv:used grf:MassCellChange-1092380975/1_0.
grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedBy grf:MassCellChange-1092380975.
grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedAt "2011-11-16T11:2:14"^xsd: dateTime.
Mapping to the actual program
Process
Input Artifact
Output Artifact
Workflow structure
Implementation over the GR Platform
Example descriptor
Digital Enterprise Research Institute www.deri.ie
Conclusion
The proposed approach provides low impact on the
existing IDT process
Provenance representation supports different data
models
Preliminary implementation of a Google Refine
provenance extension