Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | moris-hutchinson |
View: | 220 times |
Download: | 2 times |
Quality views: capturing and exploiting the user perspective on data quality
Paolo Missier, Suzanne Embury, Mark GreenwoodSchool of Computer ScienceUniversity of Manchester, UK
Alun Preece, Binling JinDepartment of Computing Science
University of Aberdeen, UK
http://www.qurator.org
Combining the strengths of UMIST andThe Victoria University of Manchester
Integration of public data (in biology)
GenBankUniProt
EnsEMBL
Entrez
dbSNP
• Large volumes of data in many public repositories• Increasingly creative uses for this data• Their quality is largely unknown
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality of e-science data
Defining quality can be challenging:
• In-silico experiments express cutting-edge research
– Experimental data liable to change rapidly
– Definitions of quality are themselves experimental
• Scientists’ quality requirements often just a hunch
– Quality tests missing or based on experimental heuristics
– Often implicit and embedded in the experiment not reusable
Criteria for data acceptability within a specific data processing context
Criteria for data acceptability within a specific data processing context
A data consumer’s view on quality:
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: protein identification
Data output
Protein identification algorithm
“Wet lab” experiment
Referencedatabases
Protein Hitlist
Protein function prediction
Remove likely false positives Improve prediction accuracyQuality filtering
Goal:to explicitly define and automatically add the additional filtering step in a principled way
Goal:to explicitly define and automatically add the additional filtering step in a principled way
Support evidence:provenance metadata
Combining the strengths of UMIST andThe Victoria University of Manchester
Our goals
Offer e-scientists a principled way to:
• Discover quality definitions for specific data domains
• Make them explicit using a formal model
• Implement them in their data processing environment
• Test them on their data
… in an incremental refinement cycle
Benefits:
• Automated processing
• Reusability
• “plug-in” quality components
Combining the strengths of UMIST andThe Victoria University of Manchester
ApproachResearch hypothesis:
adding quality to data can be made cost-effective– By separating out generic quality processing from domain-
specific definitions
Defineabstract quality views
on the data
Map quality view to an
executable process
Execute quality views
- runtime environment- data-specific quality services
Quratorarchitectural framework:
Combining the strengths of UMIST andThe Victoria University of Manchester
Abstract quality view model
Data
Assertions
Classspace 1C11 C12 …
C21 C22… Class
space 2
Classification1
Classification2
Actions on regions
Conditions:regions specification
Quality Metadata
Evidence
e1
e2
e3
Data annotation
Coverage
PeptidesCount
Combining the strengths of UMIST andThe Victoria University of Manchester
Semantic model for quality concepts
Quality “upper ontology”(OWL)
Quality “upper ontology”(OWL)
Evidence annotations are class instances
Evidence annotations are class instances
Quality evidence typesQuality evidence types
EvidenceMeta-data model
(RDF)
EvidenceMeta-data model
(RDF)
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality hypotheses discovery and testing
Performance assessment
Executionon test data
abstractquality view
CompilationCompilationTargeted
Compilation
Quality-enhancedUser environmentQuality-enhanced
User environmentQuality-enhancedUser environment
Target-specificQuality componentTarget-specific
Quality componentTarget-specificQuality component
DeploymentDeployment
Deployment
Multiple target environments:• Workflow• query processor
Combining the strengths of UMIST andThe Victoria University of Manchester
Generic quality process pattern
Collect evidence - Fetch persistent annotations- Compute on-the-fly annotations
<variables <var variableName="Coverage“ evidence="q:Coverage"/> <var variableName="PeptidesCount“ evidence="q:PeptidesCount"/> </variables>
Evaluate conditionsExecute actions
<action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12</condition> </filter> </action>
Compute assertions
ClassifierClassifier
Classifier
<QualityAssertion
serviceName="PIScoreClassifier" serviceType="q:PIScoreClassifier" tagSemType="q:PIScoreClassification" tagName="ScoreClass"
Persistentevidence
Combining the strengths of UMIST andThe Victoria University of Manchester
Bindings: assertion service
service class Web service endpoint
PIScoreClassifier http://localhost/axis/services/PIScoreClassifierSvc
All services implement the same WSDL interface
• Makes concrete assertion functions homogeneous
• Facilitates compilation
• Uniform input / output messages
PIScoreClassifierSvc
Common WSDLinterface
PI_Top_k_svc
D = {(di, evidence(di))}
{class(di)}{score(di)}
(service registry)
Combining the strengths of UMIST andThe Victoria University of Manchester
Execution model for Quality views
Binding compilation executable component
– Sub-flow of an existing workflow
– Query processing interceptor
Host workflow
AbstractQuality view
Embeddedquality
workflow
QV compiler
D
D’ Quality view on D’
Qurator quality frameworkServices registry
Servicesimplementation
Host workflow: D D’
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: original proteomics workflow
Taverna (*): workflow language and enactment engine for e-science applications
(*) part of the myGrid project, University of Manchester - taverna.sourceforge.net
Quality flow embedding point
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: embedded quality workflow
Combining the strengths of UMIST andThe Victoria University of Manchester
Interactive conditions / actions
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality views for queries
Data
Queryprocessor
Q
Rannotate
R1
Queryclient
QualityView
manager
R
assert
act
evidence
dump
dump
Actions: filtering, dump to DB / file
Combining the strengths of UMIST andThe Victoria University of Manchester
Qurator architecture
Combining the strengths of UMIST andThe Victoria University of Manchester
Summary
For complex data types, often no single “correct” and agreed-upon definition of quality of data
• Qurator provides an environment for fast prototyping of quality hypotheses
– Based on the notion of “evidence” supporting a quality hypothesis
– With support for an incremental learning cycle
• Quality views offer an abstract model for making data processing environments quality-aware
– To be compiled into executable components and embedded
– Qurator provides an invocation framework for Quality Views
More info and papers: http://www.qurator.orgLive demos (informal) available