Data Integration in Bioinformatics Using OGSA-DAI
The BioDA ProjectShirley Crompton, Brian Matthews (CCLRC)Alex Gray, Andrew Jones, Richard White (Cardiff University)
Overview
• Bioinformatics Data Access and Integration Requirements– Generic
• BioDA Workshop and Questionnaire– BDWorld-specific
• OGSA-DAI exemplar
The BioDA Project
• Independent Evaluation of OGSA-DAI– the suitability of that software in its present form – how to leverage OGSA-DAI in bioinformatics GRID
• OGSA-DAI Product Improvement– Feedbacks to the DAIT Team
• Knowledge Dissemination– Evaluation Report– Publications/Presentations– Workshop on OGSA-DAI for the bioinformatics
eResearch community
Bioinformatics
The Application and development of computing of mathematics to the
management, analysis an understanding of data to solve biological question.
Attwood, TK and Parry-Smith, DJ 1999
Data Management
Data Analysis
Grid Computing
... “... flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources…”
Foster, Kesselman and Tuecke, 2001
1st BioDA Workshop
• Objectives– examine bioinformatics community’s needs for
data access and integration (DAI) on the grid, and
– to explore the application of OGSA-DAI, a middleware developed expressly to address DAI requirements of eScience projects
The BioDA Survey
Mean Scores by Requirement Categories(adjusted by the no. of questions within each category)
0
1
2
3
4
5
Requirement Category
Mea
n Sc
ore
The Results17 key requirements, top of the list include: • schema integration• schema mapping• mixed language query• complex join across databases• provenance data• flexible resource discovery• RDF database access
The BioDA Exemplar
The BioDiversity World
• To create a GRID-based problem solving environment.
• Enable collaborative exploration and analysis of global biodiversity patterns using workflow and rich data sources from around the world
• Example applications would be modeling species distributions against climate change, conservation prioritization and linking evolutionary changes to past climates.
BDWorld(Source: BDWolrd)
Taxonomic index (Species 2000
& ITIS Catalogue of Life)Analyti
c tool
Thematic data
source
BDGrid
Ontology: Metadata
Intelligent links Resource & analytic
tool descriptions Maintenance tools
Proxy
Abiotic data
source
User
Local tools
Problem Solving
Environment user interface
Problem Solving Environment: Broker agents
Facilitator agents Presentation agents
Proxy
Proxy
ProxyProxy
Proxy
Analytic tool
GSDGSDGSDGSD
BDWorld Data Resources :Key Issues
• geographically distributed and autonomous– heterogeneous in structure and data standards – mainly read via HTTP/XML protocols using custom
wrappers • SQL queries are limited to the EBI EMBL store and
BDWorld cache databases
• potentially resource-intensive to harvest – a single taxa name may resolve into a large number
of ‘accepted’ taxon names – same query repeated on different data collections
Resource Wrapping(Source:BDWorld)
Remote Resource
The GRID
Workflow enactment engine
User
BDWorld-GRID Interface (BGI)
BGI API
BDWorld-GRID Interface (BGI)
BGI API
Wrapper
Implications for BioDA
• abstraction layer (BGI) Proprietary invocation mechanism – InvokeOperation
(ResourceHandler, Operation, XmlDataCollection)
• prepared search statements defined in individual data resource wrapper
• BGI protocols BDW communication objects. Search parameters and results passed as XmlDataCollecton
BioDA Exemplar
• Two main possibilities within BDW:1.Augment BGI to support inclusion of queries in
workflows and to be sent directly to OGSA-DAI enabled databases.
• Distributed query processing facilities could assist in planning execution & distribution of data-orientated parts of a workflow. (For the current status of OGSA-DQP see Section 4.)
– Very major revision to BDW protocols; also,– many resources of interest are simply not exposed as
databases.
2.Provide facilities within individual wrappers that benefit from OGSA-DAI.
OGSA-DAI Prototype(What we’d have liked)
OGSA-DAI R5 GDS
deliverFromURL(xsl)OGSA-DAIClient
BDWQueryActivityWrapper Module
WrapperWrapperWrapper2. Create GDS and query
3. Invoke wrapper
Web DBs
4. QuerydeliverFromURL(url)
5. Download URL
XSLTransform
deliverToURL/GFTP
6. Download url7. url
8. XSL transform to BDWformat
9. To WF unit
1. BGIInvokeOperation()
Key Issues encountered• Complex client-side coding to orchestrate the application
flow– require several GDS perform requests…
• Difficult to synchronise– Remote web databases have different response time (or not
response at all!)• Different data transformation series applicable to
different data resources• BDW Protocols specify data returned as a BDW
XmlDataCollection object
OGSA-DAI Prototype(What we ended up doing)
OGSA-DAI R5 GDS
OGSA-DAIClient
BDWQueryActivity
Wrapper Module
WrapperWrapperWrapper
2. Create GDS and query
3. Invoke wrapper/s
Web DBs
4. Query, transform
1. BGIInvokeOperation()
Cache File
5. Write cache file
6. return XmlRemoteData7. return XmlDataCollection
Conclusion• Highlighted key bioinformatics eScience
project requirements for OGSA-DAI – support for a metadata-driven two-step access
to data and data integration…• Reviewed BDWorld DAI requirements
– uniform access to disparate, heterogeneous data resources
• including anonymous access to web information system
• Reviewed the BDWorld OGSA-DAI exemplar and issues encountered