19 January 2007
Data Quality MeetingAlex Poulovassilis
19 January 2007
Some current & recent research projects
AutoMed (EPSRC, BBSRC, MoD)– has developed tools for semi-automatic transformation and integration of
heterogeneous information sources– provides a single framework for data cleansing/transformation/integration– can handle both structured and semi-structured (RDF/S, XML) data; virtual, materialised and hybrid integration scenarios; bottom-up, top-down and P2P data integration
ISPIDER (BBSRC)– is developing an integrated platform of proteomic data sources– in collaboration with groups at EBI, Manchester, UCL– is using AutoMed, in conjunction with OGSA-DAI, DQP, Taverna – to support biological data integration and web service interoperability
19 January 2007
Some current & recent research projects
SeLeNe (EU)– technologies for syndication and personalisation of learning resources:
• semantic reconciliation and integration of heterogeneous educational metadata,
• structured and unstructured querying of learning object descriptions, including through virtual views (RQL/RVL)
• automatic propagation notification of changes in the descriptions of learning objects – our XML and RDF ECA rule processing languages and systems were developed in this context
L4All and MyPlan (JISC)–new techniques to support personalised planning of lifelong learning–developing a system that allows users to record and share learning
pathways through courses and modules in the London area– in collaboration with IoE, Community College Hackney, UCAS,
LearnDirect, Linking London Lifelong Learning Network
19 January 2007
The AutoMed Project
Partners: Birkbeck and Imperial Colleges Data integration based on schema equivalence Low-level metamodel, the Hypergraph Data Model (HDM),
in terms of which higher-level modelling languages are defined – extensible therefore with new modelling languages
Automatically provides a set of primitive equivalence-preserving schema transformations for higher-level modelling languages: • addT(c,q,C) deleteT(c,q,C) renameT(c,n,n’,C)
There are also two more primitive transformations for capturing imprecise knowledge:• extendT(c,Range q q’,C) contractT(c,Range q q’,C)
19 January 2007
AutoMed Features
Schema transformations are automatically reversible:• addT/deleteT(c,q,C) by deleteT/addT(c,q,C)• extendT(c,Range q1 q2,C) by contractT(c,Range q1
q2,C)• renameT(c,n,n’,C) by renameT(c,n’,n,C)
Hence bi-directional transformation pathways (more generally networks) are defined between schemas i.e. both-as-view (BAV) transformation/integration
The queries within transformations allow automatic data translation, query translation and data lineage tracing
Schemas may or may not have a data source associated with them; thus, virtual, materialised or hybrid integration can be supported
19 January 2007
Schema Transformation/Integration Networks
US1 US2 USi USn
LS1 LS2 LSi LSn
GS
id id id id id
… …
… …
19 January 2007
AutoMed Architecture
Global Query Processor
Global Query Optimiser
Schema Evolution Tool
Schema Transformationand Integration Tools
Model Definition Tool
Schema and Transformation
Repository
Model Definitions Repository
Wrapper
19 January 2007
DWSDSS1
DSS2
DSS3
DSS4
TS1
TS2
TS3
TS4
SS1
SS4
SS3
SS2
MS1
MS2
DS
SV1
SV2
SV3
SV4
SV5
DMS1
TransformingSingle-Source
CleaningMulti-Source
CleaningIntegrating Summarizing
DMS1
DMS1
Creating DataMarts
DWS: Data Warehouse SchemaDS: Detailed SchemaSV: Summary ViewDMS: Data Mart Schema
DSS: Data Source SchemaTS: Transformed SchemaSS: Single-Cleaned SchemaMS: Multi-Cleaned Schema
Data warehousing scenario
19 January 2007
ISPIDER Project
Partners: Birkbeck, EBI, Manchester, UCL Aims:
• Vast, heterogeneous biological data• Need for interoperability• Need for efficient processing • Development of Proteomics Grid Infrastructure, use
existing proteomics resources and develop new ones, develop new proteomics clients for querying, visualisation, workflow etc.
19 January 2007
Project Aims
19 January 2007
myGrid / DQP / AutoMed
myGrid: collection of services/components allowing high-level integration of data/applications for in-silico biology experiments
DQP: • OGSA-DAI (Open Grid Services Architecture Data
Access and Integration)• Distributed query processing over OGSA-DAI enabled
resources AutoMed + DQP: interoperation for integration and query
processing over heterogeneous data resources AutoMed + myGrid : interoperation for processing
workflows incorporating heterogeneous services and resources
19 January 2007
Recent/current AutoMed research
Using AutoMed for virtual data integration: • BAV query processing: integrates GAV and LAV
techniques• supporting source or target schema evolution
Using AutoMed for materialised data integration: • incremental view maintenance• data lineage tracing
Lucas Zamboulis has been working on techniques for automatically transforming and integrating XML data
Has also investigated using correspondences to ontologies – RDFS schemas – to enhance these techniques
19 January 2007
Other recent/ongoing AutoMed research
Dean Williams has been working on extracting structure from unstructured text sources
The aim here is to integrate information extracted from unstructured text with structured information available from other sources, using IE techniques in conjunction with AutoMed
Dean has used existing IE technology (the GATE tool from Sheffield) for the text annotation and IE part of this work
P2P query and update processing over AutoMed pathways Extension with ECA rules and a P2P ECA rule execution
engine – Sandeep Mittal – will allow automatic propagation of updates e.g. for view and constraint maintenance
Planning to undertake further investigation of constraints and conditional data transformation/integration
19 January 2007
Some possible synergies with the proposed data quality project
AutoMed & BAV provide a single framework to support data cleansing, transformation and integration
Applicable in a broad range of integration scenarios (top-down, bottom-up, P2P; virtual, materialised, hybrid)
Schema transformations can, optionally, be accompanied by a constraint, giving the possibility of investigating conditional data transformation and integration
Schema transformations can be used to propagate data forwards (view maintenance) and backwards (lineage tracing) – it would be interesting to see what other information could be propagated e.g. accuracy and timeliness of data
Flexible global query processing could be used to support imprecise/incomplete data integration
19 January 2007
Extra slides
19 January 2007
Schema Transformation/Integration Networks (cont’d)
On the previous slide:• GS is a global schema• LS1, …, LSn are local schemas• US1, …, USn are union-compatible schemas• the transformation pathways between each pair LSi and
USi may consist of add, delete, rename, expand and contract primitive transformation, operating on any modelling construct defined in the AutoMed Model Definitions Repository
• the transformation pathway between USi and GS is similar
• the transformation pathway between each pair of union-compatible schemas consists of id transformation steps
19 January 2007
Comparison with GAV & LAV Data Integration
Global-As-View (GAV) approach: specify GS constructs by view definitions over LSi constructs
Local-As-View (LAV) approach: specify LS constructs by view definitions over GS constructs
RDF
XMLFileRDB
Local Schema
GlobalSchema
Local SchemaLocal Schema
Vie
wD
efin
itio
n
View
Def
initi
on
View
Definition
19 January 2007
GAV Example
student(id,name,left,degree) = [ x,y,z,w |x,y,z,w,_ug x,_,_,_,_phd
x,y,z,w,_phd w = ‘phd’]
monitors(sno,id) = [ x,y |x,_,_,_,yug
x,_,_,_,_phd x,ysupervises]
staff(sno,sname,dept) = [ x,y,z |x,y,z,w,_tutor
x,_,_supervisor
x,y,zsupervisor]
19 January 2007
LAV Example
tutor(sno,sname) = [ x,y | x,y,_staff
x,zmonitors z,_,_,wstudent
w ‘phd’] ug(id,name,left,degree,sno)
= [ x,y,z,w,v | x,y,z,wstudent
v,xmonitors
w ‘phd’] phd, supervises, supervisor
are defined similarly
19 January 2007
Evolution problems of GAV and LAV
GAV does not readily support evolution of local schemas e.g. adding an ‘age’ attribute to ‘phd’ invalidates some of the global view definitions
In LAV, changes to a local schema impact only the derivation rules defined for that schema e.g. adding an ‘age’ attribute to ‘phd’ affects only the rule defining ‘phd’
But LAV has problems if one wants to evolve the global schema since all the rules defining local schema constructs in terms of the global schema would need to be reviewed
These problems are exacerbated in P2P data integration scenarios where there is no distinction between local and global schemas
19 January 2007
AutoMed approach, ‘Growing’ Phaseassuming initially a schema U = S1 + S2
addRel(<<student,id>>, [x | x <<ug,id>>
x <<phd,id>>]) addAtt(<<student,name>>,
[<x,y> | (<x,y><<ug,name>>
x <<phd,id>>) <x,y>
<<phd,name>>]) addAtt(<<student,left>>,
[<x,y> | (<x,y> <<ug,left>> x <<phd,id>>) <x,y> <<phd,left>>]) …
19 January 2007
AutoMed approach, Shrinking Phase (cont’d)
contrAtt(<<phd,title>>, Range Void Any)
delAtt(<<phd,left>>, [<x,y> | <x,y><<student,left>> x <<phd,id>>])
delAtt(<<phd,name>>, [<x,y> | <x,y> <<student,name>> x <<phd,id>>]) delRel(<<phd,id>>, [x |
x <<student,id>> <x,’phd’> <<student,degree>>])
Similarly deletions for supervises and supervisor
19 January 2007
AutoMed approach, `Shrinking’ Phase
contrAtt(<<tutor,sname>>, Range [<x,y> | <x,y> <<staff,sname>> <z,x> <<ug,sno>>] Any)
contrRel(<<tutor,sno>>, Range [x | x<<staff,sno>> <z,x> <<ug,sno>>] Any)
Similarly contractions for the ug attributes and relation
19 January 2007
Schema Evolution in BAV
Unlike GAV/LAV/GLAV, BAV framework readily supports the evolution of both local and global schemas
The evolution of the global or local schema is specified by a schema transformation pathway from the old to the new schema
For example, the figure on the right shows transformation pathways T from an old to a new global or local schema
Global SchemaS
New GlobalSchema S’
T
New LocalSchema S’
Local SchemaS
T
19 January 2007
Global Schema Evolution
Each transformation step t in T:SS’ is considered in turn• if t is an add, delete or rename then schema
equivalence is preserved and there is nothing further to do (except perhaps optimise the extended transformation pathway); the extended pathway can be used to regenerate the necessary GAV or LAV views
• if t is a contract then there will be information present in S that is no longer available in S’; again there is nothing further to do
• if t is an extend then domain knowledge is required to determine if the new construct in S’ can in fact be derived from existing constructs; if not, there is nothing further to do; if yes, the extend step is replaced by an add step
19 January 2007
Local Schema Evolution
This is a bit more complicated as it may require changes to be propagated also to the global schema(s)
Again each transformation step t in T:SS’ is considered in turn
In the case that t is an add, delete, rename or contract step, the evolution can be carried out automatically
If it is an extend, then domain knowledge is required See our CAiSE’02, ICDE’03 and ER’04 papers for more
details The last of these discusses a materialised data
integration scenario where the old/new global/local schemas have an extent
19 January 2007
Global Query Processing
We handle query language heterogeneity by translation into/from a functional intermediate query language – IQL
A query Q expressed in a high-level query language on a schema S is first translated into IQL (this functionality is not yet supported in the AutoMed toolkit)
View definitions are derived from the transformation pathways between S and the requested data source schemas
These view definitions are substituted into Q, reformulating it into an IQL query over source schema constructs
19 January 2007
Global Query Processing (cont’d)
Query optimisation (currently algebraic) and query evaluation then occur
During query evaluation, the evaluator submits to wrappers sub-queries that they are able to translate into the local query language. Currently, AutoMed supports wrappers for SQL, OQL, XPath, XQuery and flat-file data sources
The wrappers translate sub-query results back into the IQL type system
Further query post-processing then occurs in the IQL evaluator
19 January 2007
Other AutoMed research at Imperial
Automatic generation of equivalences between different data models
A graphical schema & transformations editor Data mining techniques for extracting schema
equivalences Optimising schema transformation pathways
19 January 2007
DQP – AutoMed Interoperability
Data sources wrapped with OGSA-DAI
AutoMed OGSA-DAI wrappers extract data sources’ metadata
Semantic integration of data sources using AutoMed transformation pathways into an integrated AutoMed schema
IQL queries submitted to this integrated schema are:• Reformulated to IQL
queries on the data sources, using the AutoMed transformation pathways
• Submitted to DQP for evaluation
AutoMed Wrappers
AutoMedRepository
OGSA-DAIActivity
OGSA-DAIActivity
OGSA-DAIActivity
DB
AutoMedwrapper
AutoMedwrapper
AutoMedwrapper
DistributedQuery Processor
IntegratedAutoMed Schema
AutoMedSchema
AutoMedSchema
AutoMedSchema
AutoMedQuery Processor
IQL query
OQL query
OGSA-DAIService
OGSA-DAIService
OGSA-DAIService
DBDB
AutoMed DQPwrapper
OQL result
IQL result
IQL query
IQL result
19 January 2007
Data source schema extraction
AutoMed wrapper requests the schema of the data source using an OGSA-DAI service
The service replies with the source schema encoded in XML
The AutoMed wrapper creates the corresponding schema in the AutoMed repository
AutoMedwrapper
AutoMedSchema
OGSA-DAIService
schema request
DB
XMLresponse
19 January 2007
Using AutoMed for in the BioMap Project
Relational/XML data sources containing protein sequence, structure, function and pathway data; gene expression data; other experimental data
Wrapping of data sources Translation of source and global
schemas into AutoMed’s XML schema
Domain expert provides matchings between constructs in source and global schemas
Automatic schema restructuring, with automatic generation of schema transformation pathways
See DILS’05 paper for more details RDB
XMLFileRDB
AutoMedRelationalSchema
AutoMedIntegratedSchema
AutoMedXMLDSSSchema
AutoMedRelationalSchema
XMLWrapper
RDBWrapper
RDBWrapper
Tra
nsf
orm
atio
np
athw
ay
Tran
sfor
mat
ion
path
way
Transformation
pathway
IntegratedDatabaseWrapper
IntegratedDatabase
…..
…..
…..
19 January 2007
purpose designed buildingScience Research Infrastructure Fund: £ 6m
Research staff and students: 50Location: Bloomsbury
Open: June 2004
Institute of EducationUniversity of London
Birkbeck College University of London
Social scientistsExperts in education, sociology, culture and media, semiotics, philosophy, knowledge management ...
Computer scientistsExperts in information systems,
information management, web technologies, personalisation,
ubiquitous technologies …
The London Knowledge Lab
19 January 2007
LKL Research Themes
Research at the London Knowledge Lab consists mainly of externallyfunded projects by EU, EPSRC, ESRC, AHRB, BBSRC, JISC, Wellcome Trust – currently about 25 projects.
Four broad themes guide our work and inform our research strategy:
• new forms of knowledge
• turning information into knowledge
• the changing cultures of new media
• creating empowering technologies for formal and informal learning
19 January 2007
Turning Information Into Knowledge
• The need to cope with ubiquitous, complex, incomplete and inconsistent information is pervasive in our societies
• How can people benefit from this information in their learning, working and social lives ?
• What new techniques are necessary for managing, accessing, integrating and personalising such information ?
• How to design and build tools that help people to understand such information and generate new knowledge from it ?