+ All Categories
Home > Documents > Data Quality Meeting Alex Poulovassilis

Data Quality Meeting Alex Poulovassilis

Date post: 21-Jan-2016
Category:
Upload: tale
View: 23 times
Download: 0 times
Share this document with a friend
Description:
Data Quality Meeting Alex Poulovassilis. Some current & recent research projects. AutoMed (EPSRC, BBSRC, MoD) – has developed tools for semi-automatic transformation and integration of heterogeneous information sources – provides a single framework for data cleansing/transformation/integration - PowerPoint PPT Presentation
Popular Tags:
35
19 January 2007 Data Quality Meeting Alex Poulovassilis
Transcript
Page 1: Data Quality Meeting Alex Poulovassilis

19 January 2007

Data Quality MeetingAlex Poulovassilis

Page 2: Data Quality Meeting Alex Poulovassilis

19 January 2007

Some current & recent research projects

AutoMed (EPSRC, BBSRC, MoD)– has developed tools for semi-automatic transformation and integration of

heterogeneous information sources– provides a single framework for data cleansing/transformation/integration– can handle both structured and semi-structured (RDF/S, XML) data; virtual, materialised and hybrid integration scenarios; bottom-up, top-down and P2P data integration

ISPIDER (BBSRC)– is developing an integrated platform of proteomic data sources– in collaboration with groups at EBI, Manchester, UCL– is using AutoMed, in conjunction with OGSA-DAI, DQP, Taverna – to support biological data integration and web service interoperability

Page 3: Data Quality Meeting Alex Poulovassilis

19 January 2007

Some current & recent research projects

SeLeNe (EU)– technologies for syndication and personalisation of learning resources:

• semantic reconciliation and integration of heterogeneous educational metadata,

• structured and unstructured querying of learning object descriptions, including through virtual views (RQL/RVL)

• automatic propagation notification of changes in the descriptions of learning objects – our XML and RDF ECA rule processing languages and systems were developed in this context

L4All and MyPlan (JISC)–new techniques to support personalised planning of lifelong learning–developing a system that allows users to record and share learning

pathways through courses and modules in the London area– in collaboration with IoE, Community College Hackney, UCAS,

LearnDirect, Linking London Lifelong Learning Network

Page 4: Data Quality Meeting Alex Poulovassilis

19 January 2007

The AutoMed Project

Partners: Birkbeck and Imperial Colleges Data integration based on schema equivalence Low-level metamodel, the Hypergraph Data Model (HDM),

in terms of which higher-level modelling languages are defined – extensible therefore with new modelling languages

Automatically provides a set of primitive equivalence-preserving schema transformations for higher-level modelling languages: • addT(c,q,C) deleteT(c,q,C) renameT(c,n,n’,C)

There are also two more primitive transformations for capturing imprecise knowledge:• extendT(c,Range q q’,C) contractT(c,Range q q’,C)

Page 5: Data Quality Meeting Alex Poulovassilis

19 January 2007

AutoMed Features

Schema transformations are automatically reversible:• addT/deleteT(c,q,C) by deleteT/addT(c,q,C)• extendT(c,Range q1 q2,C) by contractT(c,Range q1

q2,C)• renameT(c,n,n’,C) by renameT(c,n’,n,C)

Hence bi-directional transformation pathways (more generally networks) are defined between schemas i.e. both-as-view (BAV) transformation/integration

The queries within transformations allow automatic data translation, query translation and data lineage tracing

Schemas may or may not have a data source associated with them; thus, virtual, materialised or hybrid integration can be supported

Page 6: Data Quality Meeting Alex Poulovassilis

19 January 2007

Schema Transformation/Integration Networks

US1 US2 USi USn

LS1 LS2 LSi LSn

GS

id id id id id

… …

… …

Page 7: Data Quality Meeting Alex Poulovassilis

19 January 2007

AutoMed Architecture

Global Query Processor

Global Query Optimiser

Schema Evolution Tool

Schema Transformationand Integration Tools

Model Definition Tool

Schema and Transformation

Repository

Model Definitions Repository

Wrapper

Page 8: Data Quality Meeting Alex Poulovassilis

19 January 2007

DWSDSS1

DSS2

DSS3

DSS4

TS1

TS2

TS3

TS4

SS1

SS4

SS3

SS2

MS1

MS2

DS

SV1

SV2

SV3

SV4

SV5

DMS1

TransformingSingle-Source

CleaningMulti-Source

CleaningIntegrating Summarizing

DMS1

DMS1

Creating DataMarts

DWS: Data Warehouse SchemaDS: Detailed SchemaSV: Summary ViewDMS: Data Mart Schema

DSS: Data Source SchemaTS: Transformed SchemaSS: Single-Cleaned SchemaMS: Multi-Cleaned Schema

Data warehousing scenario

Page 9: Data Quality Meeting Alex Poulovassilis

19 January 2007

ISPIDER Project

Partners: Birkbeck, EBI, Manchester, UCL Aims:

• Vast, heterogeneous biological data• Need for interoperability• Need for efficient processing • Development of Proteomics Grid Infrastructure, use

existing proteomics resources and develop new ones, develop new proteomics clients for querying, visualisation, workflow etc.

Page 10: Data Quality Meeting Alex Poulovassilis

19 January 2007

Project Aims

Page 11: Data Quality Meeting Alex Poulovassilis

19 January 2007

myGrid / DQP / AutoMed

myGrid: collection of services/components allowing high-level integration of data/applications for in-silico biology experiments

DQP: • OGSA-DAI (Open Grid Services Architecture Data

Access and Integration)• Distributed query processing over OGSA-DAI enabled

resources AutoMed + DQP: interoperation for integration and query

processing over heterogeneous data resources AutoMed + myGrid : interoperation for processing

workflows incorporating heterogeneous services and resources

Page 12: Data Quality Meeting Alex Poulovassilis

19 January 2007

Recent/current AutoMed research

Using AutoMed for virtual data integration: • BAV query processing: integrates GAV and LAV

techniques• supporting source or target schema evolution

Using AutoMed for materialised data integration: • incremental view maintenance• data lineage tracing

Lucas Zamboulis has been working on techniques for automatically transforming and integrating XML data

Has also investigated using correspondences to ontologies – RDFS schemas – to enhance these techniques

Page 13: Data Quality Meeting Alex Poulovassilis

19 January 2007

Other recent/ongoing AutoMed research

Dean Williams has been working on extracting structure from unstructured text sources

The aim here is to integrate information extracted from unstructured text with structured information available from other sources, using IE techniques in conjunction with AutoMed

Dean has used existing IE technology (the GATE tool from Sheffield) for the text annotation and IE part of this work

P2P query and update processing over AutoMed pathways Extension with ECA rules and a P2P ECA rule execution

engine – Sandeep Mittal – will allow automatic propagation of updates e.g. for view and constraint maintenance

Planning to undertake further investigation of constraints and conditional data transformation/integration

Page 14: Data Quality Meeting Alex Poulovassilis

19 January 2007

Some possible synergies with the proposed data quality project

AutoMed & BAV provide a single framework to support data cleansing, transformation and integration

Applicable in a broad range of integration scenarios (top-down, bottom-up, P2P; virtual, materialised, hybrid)

Schema transformations can, optionally, be accompanied by a constraint, giving the possibility of investigating conditional data transformation and integration

Schema transformations can be used to propagate data forwards (view maintenance) and backwards (lineage tracing) – it would be interesting to see what other information could be propagated e.g. accuracy and timeliness of data

Flexible global query processing could be used to support imprecise/incomplete data integration

Page 15: Data Quality Meeting Alex Poulovassilis

19 January 2007

Extra slides

Page 16: Data Quality Meeting Alex Poulovassilis

19 January 2007

Schema Transformation/Integration Networks (cont’d)

On the previous slide:• GS is a global schema• LS1, …, LSn are local schemas• US1, …, USn are union-compatible schemas• the transformation pathways between each pair LSi and

USi may consist of add, delete, rename, expand and contract primitive transformation, operating on any modelling construct defined in the AutoMed Model Definitions Repository

• the transformation pathway between USi and GS is similar

• the transformation pathway between each pair of union-compatible schemas consists of id transformation steps

Page 17: Data Quality Meeting Alex Poulovassilis

19 January 2007

Comparison with GAV & LAV Data Integration

Global-As-View (GAV) approach: specify GS constructs by view definitions over LSi constructs

Local-As-View (LAV) approach: specify LS constructs by view definitions over GS constructs

RDF

XMLFileRDB

Local Schema

GlobalSchema

Local SchemaLocal Schema

Vie

wD

efin

itio

n

View

Def

initi

on

View

Definition

Page 18: Data Quality Meeting Alex Poulovassilis

19 January 2007

GAV Example

student(id,name,left,degree) = [ x,y,z,w |x,y,z,w,_ug x,_,_,_,_phd

x,y,z,w,_phd w = ‘phd’]

monitors(sno,id) = [ x,y |x,_,_,_,yug

x,_,_,_,_phd x,ysupervises]

staff(sno,sname,dept) = [ x,y,z |x,y,z,w,_tutor

x,_,_supervisor

x,y,zsupervisor]

Page 19: Data Quality Meeting Alex Poulovassilis

19 January 2007

LAV Example

tutor(sno,sname) = [ x,y | x,y,_staff

x,zmonitors z,_,_,wstudent

w ‘phd’] ug(id,name,left,degree,sno)

= [ x,y,z,w,v | x,y,z,wstudent

v,xmonitors

w ‘phd’] phd, supervises, supervisor

are defined similarly

Page 20: Data Quality Meeting Alex Poulovassilis

19 January 2007

Evolution problems of GAV and LAV

GAV does not readily support evolution of local schemas e.g. adding an ‘age’ attribute to ‘phd’ invalidates some of the global view definitions

In LAV, changes to a local schema impact only the derivation rules defined for that schema e.g. adding an ‘age’ attribute to ‘phd’ affects only the rule defining ‘phd’

But LAV has problems if one wants to evolve the global schema since all the rules defining local schema constructs in terms of the global schema would need to be reviewed

These problems are exacerbated in P2P data integration scenarios where there is no distinction between local and global schemas

Page 21: Data Quality Meeting Alex Poulovassilis

19 January 2007

AutoMed approach, ‘Growing’ Phaseassuming initially a schema U = S1 + S2

addRel(<<student,id>>, [x | x <<ug,id>>

x <<phd,id>>]) addAtt(<<student,name>>,

[<x,y> | (<x,y><<ug,name>>

x <<phd,id>>) <x,y>

<<phd,name>>]) addAtt(<<student,left>>,

[<x,y> | (<x,y> <<ug,left>> x <<phd,id>>) <x,y> <<phd,left>>]) …

Page 22: Data Quality Meeting Alex Poulovassilis

19 January 2007

AutoMed approach, Shrinking Phase (cont’d)

contrAtt(<<phd,title>>, Range Void Any)

delAtt(<<phd,left>>, [<x,y> | <x,y><<student,left>> x <<phd,id>>])

delAtt(<<phd,name>>, [<x,y> | <x,y> <<student,name>> x <<phd,id>>]) delRel(<<phd,id>>, [x |

x <<student,id>> <x,’phd’> <<student,degree>>])

Similarly deletions for supervises and supervisor

Page 23: Data Quality Meeting Alex Poulovassilis

19 January 2007

AutoMed approach, `Shrinking’ Phase

contrAtt(<<tutor,sname>>, Range [<x,y> | <x,y> <<staff,sname>> <z,x> <<ug,sno>>] Any)

contrRel(<<tutor,sno>>, Range [x | x<<staff,sno>> <z,x> <<ug,sno>>] Any)

Similarly contractions for the ug attributes and relation

Page 24: Data Quality Meeting Alex Poulovassilis

19 January 2007

Schema Evolution in BAV

Unlike GAV/LAV/GLAV, BAV framework readily supports the evolution of both local and global schemas

The evolution of the global or local schema is specified by a schema transformation pathway from the old to the new schema

For example, the figure on the right shows transformation pathways T from an old to a new global or local schema

Global SchemaS

New GlobalSchema S’

T

New LocalSchema S’

Local SchemaS

T

Page 25: Data Quality Meeting Alex Poulovassilis

19 January 2007

Global Schema Evolution

Each transformation step t in T:SS’ is considered in turn• if t is an add, delete or rename then schema

equivalence is preserved and there is nothing further to do (except perhaps optimise the extended transformation pathway); the extended pathway can be used to regenerate the necessary GAV or LAV views

• if t is a contract then there will be information present in S that is no longer available in S’; again there is nothing further to do

• if t is an extend then domain knowledge is required to determine if the new construct in S’ can in fact be derived from existing constructs; if not, there is nothing further to do; if yes, the extend step is replaced by an add step

Page 26: Data Quality Meeting Alex Poulovassilis

19 January 2007

Local Schema Evolution

This is a bit more complicated as it may require changes to be propagated also to the global schema(s)

Again each transformation step t in T:SS’ is considered in turn

In the case that t is an add, delete, rename or contract step, the evolution can be carried out automatically

If it is an extend, then domain knowledge is required See our CAiSE’02, ICDE’03 and ER’04 papers for more

details The last of these discusses a materialised data

integration scenario where the old/new global/local schemas have an extent

Page 27: Data Quality Meeting Alex Poulovassilis

19 January 2007

Global Query Processing

We handle query language heterogeneity by translation into/from a functional intermediate query language – IQL

A query Q expressed in a high-level query language on a schema S is first translated into IQL (this functionality is not yet supported in the AutoMed toolkit)

View definitions are derived from the transformation pathways between S and the requested data source schemas

These view definitions are substituted into Q, reformulating it into an IQL query over source schema constructs

Page 28: Data Quality Meeting Alex Poulovassilis

19 January 2007

Global Query Processing (cont’d)

Query optimisation (currently algebraic) and query evaluation then occur

During query evaluation, the evaluator submits to wrappers sub-queries that they are able to translate into the local query language. Currently, AutoMed supports wrappers for SQL, OQL, XPath, XQuery and flat-file data sources

The wrappers translate sub-query results back into the IQL type system

Further query post-processing then occurs in the IQL evaluator

Page 29: Data Quality Meeting Alex Poulovassilis

19 January 2007

Other AutoMed research at Imperial

Automatic generation of equivalences between different data models

A graphical schema & transformations editor Data mining techniques for extracting schema

equivalences Optimising schema transformation pathways

Page 30: Data Quality Meeting Alex Poulovassilis

19 January 2007

DQP – AutoMed Interoperability

Data sources wrapped with OGSA-DAI

AutoMed OGSA-DAI wrappers extract data sources’ metadata

Semantic integration of data sources using AutoMed transformation pathways into an integrated AutoMed schema

IQL queries submitted to this integrated schema are:• Reformulated to IQL

queries on the data sources, using the AutoMed transformation pathways

• Submitted to DQP for evaluation

AutoMed Wrappers

AutoMedRepository

OGSA-DAIActivity

OGSA-DAIActivity

OGSA-DAIActivity

DB

AutoMedwrapper

AutoMedwrapper

AutoMedwrapper

DistributedQuery Processor

IntegratedAutoMed Schema

AutoMedSchema

AutoMedSchema

AutoMedSchema

AutoMedQuery Processor

IQL query

OQL query

OGSA-DAIService

OGSA-DAIService

OGSA-DAIService

DBDB

AutoMed DQPwrapper

OQL result

IQL result

IQL query

IQL result

Page 31: Data Quality Meeting Alex Poulovassilis

19 January 2007

Data source schema extraction

AutoMed wrapper requests the schema of the data source using an OGSA-DAI service

The service replies with the source schema encoded in XML

The AutoMed wrapper creates the corresponding schema in the AutoMed repository

AutoMedwrapper

AutoMedSchema

OGSA-DAIService

schema request

DB

XMLresponse

Page 32: Data Quality Meeting Alex Poulovassilis

19 January 2007

Using AutoMed for in the BioMap Project

Relational/XML data sources containing protein sequence, structure, function and pathway data; gene expression data; other experimental data

Wrapping of data sources Translation of source and global

schemas into AutoMed’s XML schema

Domain expert provides matchings between constructs in source and global schemas

Automatic schema restructuring, with automatic generation of schema transformation pathways

See DILS’05 paper for more details RDB

XMLFileRDB

AutoMedRelationalSchema

AutoMedIntegratedSchema

AutoMedXMLDSSSchema

AutoMedRelationalSchema

XMLWrapper

RDBWrapper

RDBWrapper

Tra

nsf

orm

atio

np

athw

ay

Tran

sfor

mat

ion

path

way

Transformation

pathway

IntegratedDatabaseWrapper

IntegratedDatabase

…..

…..

…..

Page 33: Data Quality Meeting Alex Poulovassilis

19 January 2007

purpose designed buildingScience Research Infrastructure Fund: £ 6m

Research staff and students: 50Location: Bloomsbury

Open: June 2004

Institute of EducationUniversity of London

Birkbeck College University of London

Social scientistsExperts in education, sociology, culture and media, semiotics, philosophy, knowledge management ...

Computer scientistsExperts in information systems,

information management, web technologies, personalisation,

ubiquitous technologies …

The London Knowledge Lab

Page 34: Data Quality Meeting Alex Poulovassilis

19 January 2007

LKL Research Themes

Research at the London Knowledge Lab consists mainly of externallyfunded projects by EU, EPSRC, ESRC, AHRB, BBSRC, JISC, Wellcome Trust – currently about 25 projects.

Four broad themes guide our work and inform our research strategy:

• new forms of knowledge

• turning information into knowledge

• the changing cultures of new media

• creating empowering technologies for formal and informal learning

Page 35: Data Quality Meeting Alex Poulovassilis

19 January 2007

Turning Information Into Knowledge

• The need to cope with ubiquitous, complex, incomplete and inconsistent information is pervasive in our societies

• How can people benefit from this information in their learning, working and social lives ?

• What new techniques are necessary for managing, accessing, integrating and personalising such information ?

• How to design and build tools that help people to understand such information and generate new knowledge from it ?


Recommended