Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne...

Quality views: capturing and exploiting the user perspective on data quality

Paolo Missier, Suzanne Embury, Mark GreenwoodSchool of Computer ScienceUniversity of Manchester, UK

Alun Preece, Binling JinDepartment of Computing Science

University of Aberdeen, UK

http://www.qurator.org

Combining the strengths of UMIST andThe Victoria University of Manchester

Integration of public data (in biology)

GenBankUniProt

EnsEMBL

Entrez

dbSNP

• Large volumes of data in many public repositories• Increasingly creative uses for this data• Their quality is largely unknown


Quality of e-science data

Defining quality can be challenging:

• In-silico experiments express cutting-edge research

– Experimental data liable to change rapidly

– Definitions of quality are themselves experimental

• Scientists’ quality requirements often just a hunch

– Quality tests missing or based on experimental heuristics

– Often implicit and embedded in the experiment not reusable

Criteria for data acceptability within a specific data processing context

Criteria for data acceptability within a specific data processing context

A data consumer’s view on quality:


Example: protein identification

Data output

Protein identification algorithm

“Wet lab” experiment

Referencedatabases

Protein Hitlist

Protein function prediction

Remove likely false positives Improve prediction accuracyQuality filtering

Goal:to explicitly define and automatically add the additional filtering step in a principled way

Goal:to explicitly define and automatically add the additional filtering step in a principled way

Support evidence:provenance metadata


Our goals

Offer e-scientists a principled way to:

• Discover quality definitions for specific data domains

• Make them explicit using a formal model

• Implement them in their data processing environment

• Test them on their data

… in an incremental refinement cycle

Benefits:

• Automated processing

• Reusability

• “plug-in” quality components


ApproachResearch hypothesis:

adding quality to data can be made cost-effective– By separating out generic quality processing from domain-

specific definitions

Defineabstract quality views

on the data

Map quality view to an

executable process

Execute quality views

- runtime environment- data-specific quality services

Quratorarchitectural framework:


Abstract quality view model

Data

Assertions

Classspace 1C11 C12 …

C21 C22… Class

space 2

Classification1

Classification2

Actions on regions

Conditions:regions specification

Quality Metadata

Evidence

e1

e2

e3

Data annotation

Coverage

PeptidesCount


Semantic model for quality concepts

Quality “upper ontology”(OWL)

Quality “upper ontology”(OWL)

Evidence annotations are class instances

Evidence annotations are class instances

Quality evidence typesQuality evidence types

EvidenceMeta-data model

(RDF)

EvidenceMeta-data model

(RDF)


Quality hypotheses discovery and testing

Performance assessment

Executionon test data

abstractquality view

CompilationCompilationTargeted

Compilation

Quality-enhancedUser environmentQuality-enhanced

User environmentQuality-enhancedUser environment

Target-specificQuality componentTarget-specific

Quality componentTarget-specificQuality component

DeploymentDeployment

Deployment

Multiple target environments:• Workflow• query processor


Generic quality process pattern

Collect evidence - Fetch persistent annotations- Compute on-the-fly annotations

<variables <var variableName="Coverage“ evidence="q:Coverage"/> <var variableName="PeptidesCount“ evidence="q:PeptidesCount"/> </variables>

Evaluate conditionsExecute actions

<action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12</condition> </filter> </action>

Compute assertions

ClassifierClassifier

Classifier

<QualityAssertion

serviceName="PIScoreClassifier" serviceType="q:PIScoreClassifier" tagSemType="q:PIScoreClassification" tagName="ScoreClass"

Persistentevidence


Bindings: assertion service

service class Web service endpoint

PIScoreClassifier http://localhost/axis/services/PIScoreClassifierSvc

All services implement the same WSDL interface

• Makes concrete assertion functions homogeneous

• Facilitates compilation

• Uniform input / output messages

PIScoreClassifierSvc

Common WSDLinterface

PI_Top_k_svc

D = {(di, evidence(di))}

{class(di)}{score(di)}

(service registry)


Execution model for Quality views

Binding compilation executable component

– Sub-flow of an existing workflow

– Query processing interceptor

Host workflow

AbstractQuality view

Embeddedquality

workflow

QV compiler

D

D’ Quality view on D’

Qurator quality frameworkServices registry

Servicesimplementation

Host workflow: D D’


Example: original proteomics workflow

Taverna (*): workflow language and enactment engine for e-science applications

(*) part of the myGrid project, University of Manchester - taverna.sourceforge.net

Quality flow embedding point


Example: embedded quality workflow


Interactive conditions / actions


Quality views for queries

Data

Queryprocessor

Q

Rannotate

R1

Queryclient

QualityView

manager

R

assert

act

evidence

dump

dump

Actions: filtering, dump to DB / file


Qurator architecture


Summary

For complex data types, often no single “correct” and agreed-upon definition of quality of data

• Qurator provides an environment for fast prototyping of quality hypotheses

– Based on the notion of “evidence” supporting a quality hypothesis

– With support for an incremental learning cycle

• Quality views offer an abstract model for making data processing environments quality-aware

– To be compiled into executable components and embedded

– Qurator provides an invocation framework for Quality Views

More info and papers: http://www.qurator.orgLive demos (informal) available

Date post:	04-Jan-2016
Category:	Documents
Upload:	moris-hutchinson
View:	220 times
Download:	2 times

Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne...

Documents