Neil Chue HongProject Manager, EPCC
[email protected]+44 131 650 5957
Malcolm AtkinsonDirector, National e-Science Centre
[email protected]+44 131 651 4040
OGSA-DAIIntroduction and
OverviewOGSA-DAI Tutorial
GGF15, Boston, USA6 October 2005
GGF15 Tutorial 6th October 2005 2
OGSA-DAI
Data, Data Everywhere & …
• Growing volumes
• Growing diversity
• Growing complexity
• How do we mine its riches for nuggets of information?
• Find & Access
• Understand
• Extract, Combine & Digest
• Test hypotheses
• Bingo!
⇒ Rich resource
⇒ Use OGSA-DAI
GGF15 Tutorial 6th October 2005 3
OGSA-DAI
Composing Observations in Astronomy
Data and images courtesy Alex Szalay, John Hopkins
No. & sizes of data sets as of mid-2002,grouped by wavelength
• 12 waveband coverage of large areas of the sky
• Total about 200 TB data• Doubling every 12 months• Largest catalogues near 1B objects
GGF15
Tutorial 6th
4
Biomedical data – making connections
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttgttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgacttgcctgttttt ttttaattgg
Slide provided by Carole Goble: University of Manchester
Database GrowthPDB Content Growth
Slide provided by Richard Baldock: MRC HGU Edinburgh
BRIDGES
G lasgowEd inburgh
Le icester Oxford
London
Netherlands
Pub lica lly Curated D ata
P rivate data
Private data
Private data
Private data
Private data
Private data
CFG V irtual O rgan isation Ensem bl
M G I
H UG O
OM IM
SW ISS-PROT
… D A TA H U B
RG D
SyntenyGrid
Service
blast
+
VO Authorisation
Information Integrator
OGSA-DAI
Slide provided by Richard Sinnott: University of Glasgow
eDiaMoND: Screening for Breast Cancer
1 Trust Many TrustsCollaborative WorkingAudit capabilityEpidemiology
Other Modalities-MRI-PET-Ultrasound
Better access toCase informationAnd digital tools
Supplement MentoringWith access to digitalTraining cases and sharingOf information acrossclinics
LettersRadiology reportingsystems
eDiaMoNDGrid
2ndary CaptureOr FFD
Case Information
X-Rays andCase Information
DigitalReading
SMF
Case andReading Information
CAD Temporal Comparison
Screening
ElectronicPatient Records
Assessment/ SymptomaticBiopsy
Case andReading Information
Symptomatic/AssessmentInformation
Training
Manage Training Cases
Perform Training
SMF CAD 3D Images
Patients
Provided by eDiamond project: Prof. Sir Mike Brady et al.
GGF15 Tutorial 6th October 2005 8
OGSA-DAI
Data Grid
CustomerSupport
Web-basedDashboard
Customer Data Customer Order Information
Marketing department identifies likely buyers of new product
• Company wants real-time integrated view of customer buying behavior
• Data resides in various distributed CRM & ERP systems
• Grid allows developers and apps to access and integrate customer data sources together in real time--across many distributed databases
SAPOracle SiebelDB2
Business Intelligence and Customer Data
Slide from: Dave Berry, Andrew Grimshaw – OGSA-WG
GGF15 Tutorial 6th October 2005 9
OGSA-DAI
Providing Data to Cluster-Based Analytical Application
Forward Proxy Data Caches of
Remote Data
CentralizedCompute Cluster
HeadquartersIllinois
• Company has centralized HPC cluster running compute-intensive applications
• Source data for analysis distributed among 3 global sites, one of them an external partner
• Manual data-sharing processes increase costs/errors, and hinder time-to-results
• Grid enables secure, automatic provisioning of remote data to HPC cluster—feeding CPUs more data faster
Data Grid
DataGrid
DataGrid
DataGrid
R&DWest Coast
TestingIndia
Engineering East Coast
AnalyticalApplications
Slide from: Dave Berry, Andrew Grimshaw – OGSA-WG
GGF15 Tutorial 6th October 2005 10
OGSA-DAI
Story so far
• There is a lot of data– growing in every dimension– Distributed– Many different producers & owners– Heterogeneous– High value resource – Combined it is more valuable
• There are many requirements for data integration– Takes many forms– Driven by insights– Enable conversion of insight into tested hypothesis
• There are many data owners– Their autonomy and policies must be respectedGeneri
c Repeatable S
olutions R
equired
Neil Chue HongProject Manager, EPCC
[email protected]+44 131 650 5957
Malcolm AtkinsonDirector, National e-Science Centre
[email protected]+44 131 651 4040
Changing theway we manage
Data
GGF15 Tutorial 6th October 2005 12
OGSA-DAI
Terabyte → Petabyte
60 m2Inside machineDisk footprint
33 Tonnes5.6 KgDisk weight
100 Kilowatts100 WattsDisk power
6800 Disks + 490 units + 32 racks = $7 million
7 disks =
$5000 (SCSI)
Disk cost
14 months ($1 million)10 hours ($1000)1GB WAN move time
2 months15 minutesRAM time to move
PetabyteTerabyte
Approximately Correct in May 2003 Distributed Computing EconomicsJim Gray, Microsoft Research, MSR-TR-2003-24
GGF15 Tutorial 6th October 2005 13
OGSA-DAI
Mohammed & Mountains
• Petabytes of Data cannot be moved– It stays where it is produced or curated
– Hospitals, observatories, European Bioinformatics Institute
– A few caches and a small proportion cached
• Distributed collaborating communities– Expertise in curation, simulation & analysis
• Diverse data collections– Discovery depends on insights– Unpredictable or unexpected use of data
GGF15 Tutorial 6th October 2005 14
OGSA-DAI
Move computation to the data
• Assumption: code size << data size– Minimise data transport
• Provision combined storage & compute resources • Develop the database philosophy for this?• Develop the storage architecture for this?• Develop experiment, sensor & simulation architectures
– That take code to select and digest data as an output control– That attach the provenance & metadata
• Data Cutter a step in this direction– Sub-setting and aggregation of datasets using filters executed
close to data– http://www.cs.umd.edu/projects/hpsl/ResearchAreas/DataCutter.htm
GGF15 Tutorial 6th October 2005 15
OGSA-DAI
Meta-data: describing data
• Choosing data sources– How do you find them?– How are they described and advertised?– Is the equivalent of Google possible?
• Meta-data is required describing– Content– Provenance– Structure– Types, Formats & Ontologies– Operations available– Access requirements– Quality of service
• No established standards for heterogeneous data sources
GGF15 Tutorial 6th October 2005 16
OGSA-DAI
Cultural Challenges
• Changing the way we work?• Publication and sharing of result data
– Increased volume and diversity = increased opportunity?– Allows independent validation of methods and derivatives– Responsibility, ownership, credit, citation
• Many distributed data resources– Data collected from observation, simulation & experiment– Independently owned & managed
– No common goals or design– Work hard for agreements on foundation types and ontologies– Autonomous decisions change data, structure, policy, etc
– Requires negotiations and patience!• Diversity
– No “one size fits all” solutions will work
GGF15 Tutorial 6th October 2005 17
OGSA-DAI
Economic Challenges
• Data production, publication & management– Many researchers contributing increments of data – Who pays for storage, transport, management and curation?
• Data longevity– Research requirements may outlive technical decisions– Data does not preserve itself indefinitely!
• When a community is dependent on a data resource– Who pays or decides to switch it off?
GGF15 Tutorial 6th October 2005 18
OGSA-DAI
Summary: Scientific Discovery
• Obtaining access to that data– Overcoming administrative barriers– Overcoming technical barriers
• Understanding that data– The parts you care about for your research
• Combing them using sophisticated models– The picture of reality in your head
• Analysis on scales required by statistics– Coupling data access with computation
• Repeated Processes– Examining variations, covering a set of candidates– Monitoring the emerging details– Coupling with scientific workflows
GGF15 Tutorial 6th October 2005 19
OGSA-DAI
Three communities
Users: Individual & Organisations
Data,Informa-tion &KnowledgeProviders
Comp-ute
Storage &Communicat
ionsProviders
GGF15 Tutorial 6th October 2005 20
OGSA-DAI
Three communities
Users: Individual & Organisations
Data,Informa-tion &KnowledgeProviders
Comp-ute
Storage &Communicat
ionsProviders
Initiate & Steer workProvide Requirements
Use knowledge & insightWant easy & reliable facility
Expect agility, reliability, stability&
latest techniquesPay the bills
GGF15 Tutorial 6th October 2005 21
OGSA-DAI
Three communities
Users: Individual & Organisations
Data,Informa-tion &KnowledgeProviders
Comp-ute
Storage &Communicat
ionsProviders
Provide & operate resourcesStorage centres, Data centres,
DBMS & File SystemsComputation environment,
Processing & CommunicationsNeed to change facilities & policiesPrefer consolidated requirements
Must be paid
GGF15 Tutorial 6th October 2005 22
OGSA-DAI
Three communities
Users: Individual & Organisations
Data,Informa-tion &KnowledgeProviders
Comp-ute
Storage &Communicat
ionsProviders
Create & Collect dataOrganise & Structure data
Provide, organise & maintain metadataOffer access and domain specific services
Establish use policiesWill change structure, services & policies
May pay or be paid
GGF15 Tutorial 6th October 2005 23
OGSA-DAI
Three communities
Users: Individual & Organisations
Data,Informa-tion &KnowledgeProviders
Comp-ute
Storage &Communicat
ionsProviders
OGSA-DAI a
framework for a lasting& productive relationship
Neil Chue HongProject Manager, EPCC
[email protected]+44 131 650 5957
Malcolm AtkinsonDirector, National e-Science Centre
[email protected]+44 131 651 4040
An Architecture
GGF15 Tutorial 6th October 2005 25
OGSA-DAI
Three communities
Users: Individual & Organisations
Data,Informa-tion &KnowledgeProviders
Comp-ute
Storage &Communicat
ionsProviders
OGSA-DAI
Client Library
Portals & Applications
Adapt
ive In
terfac
esAdaptive &
Extensible Interfaces
Provided Interfaces & Services
Neil Chue HongProject Manager, EPCC
[email protected]+44 131 650 5957
Malcolm AtkinsonDirector, National e-Science Centre
[email protected]+44 131 651 4040
DAIS Working Group
GGF15 Tutorial 6th October 2005 27
OGSA-DAI
DAIS WG Goals
• Provide service-based access to structured data resources as part of OGSA architecture
• Specify a selection of interfaces tailored to various styles of data access starting with relational and XML
• Interact well with other GGF OGSA specs
GGF15 Tutorial 6th October 2005 28
OGSA-DAI
DAIS WG Non-Goals
• No new common query language
• No schema integration or common data model
• No common namespace or naming scheme
• No data resource management– e.g. starting/stopping database managers
• No push based delivery – Information Dissemination WG?
GGF15 Tutorial 6th October 2005 29
OGSA-DAI
DAIS View Of Data Services Model
0-* 0-*Consumer Data Service Data Resource
0- * 0-* 0-*
0-*
This structure is not exposed through the Data Service interface to the Consumer.
• A Data Service presents a Consumer with an interface to a Data Resource.
• A Data Resource can have arbitrary complexity, for example, a file on an NFS mounted file system or a federation of relational databases.
• A Consumer is not typically exposed to this complexity and operates within the bounds and semantics of the interface provided by the Data Service
GGF15 Tutorial 6th October 2005 30
OGSA-DAI
DAIS Specification Landscape
OGSA Data Services
WS-DAI
WS-DAIXWS-DAIR
Scenarios for MappingDAIS Concepts
Is Informed By
Extend
GWD-I
GWD-R
GGF15 Tutorial 6th October 2005 31
OGSA-DAI
INFOD
GridFTP
DT
TMBoFADFBoF
GGF
Arch Data ISP SRM
OGSA
CMM GIR
GSM
GFS DAIS
CGS GRAAP
Policy
IETF
SNMP
Other Standards Bodies
????W3CXQuery
ANSI
SQL
DMTF
CIM
OASIS
WS-DM
WS-RF
WS-N
WSPolicy
WSAddress
OREP
JDBC
JCP
DFDL
DAIS and Other Standards/Specifications
GGF15 Tutorial 6th October 2005 32
OGSA-DAI
DAIS Data Access
DatabaseData Service
Relational Database
SQLResponse
SQLDescription: Readable Writeable ConcurrentAccess TransactionInitiation TransactionIsolation Etc.
SQLAccess
Consumer
SQLExecute ( SQLExpression )
GGF15 Tutorial 6th October 2005 33
OGSA-DAI
DAIS Derived Data Access Database
Data Service
SQ L ResponseData Service
Relational Database
Row Set
SQLExecuteFactory ( SQLExpression BehaviouralProperties )
RDBM S specific m echanism for generating result set
SQ LFactory
SQ LResponseD escription
SQ LResponseA ccess
Consum er
GetRowset ( rowsetnumber )
Reference to SQLResponse Data Service
Rowset
SQ LD escription: Readable W riteable ConcurrentA ccess TransactionInitiation TransactionIsolation Etc.
GGF15 Tutorial 6th October 2005 34
OGSA-DAI
Terminology - Data
• Data Resource– Any object that can source/sink data– Currently databases in scope
• Data Service– Common interface to a data resource– Exposes capabilities of data resource
– SQL Queries, X-Path Queries– May provide additional capabilities
– Data transformations, 3rd party data delivery
• OGSA-DAI– Open Grid Services Architecture Data Access and Integration
Neil Chue HongProject Manager, EPCC
[email protected]+44 131 650 5957
Malcolm AtkinsonDirector, National e-Science Centre
[email protected]+44 131 651 4040
And nowOGSA-DAI
GGF15 Tutorial 6th October 2005 36
OGSA-DAI
OGSA-DAI Projects• Develop a component library
– Access and manipulate data in a grid– Serve UK and International e-Science communities
• Provide an extensible framework
• Provide– Common interface to data resources– Simple integration of distributed queries to multiple data resources
• Contribute to standardisation efforts– Input into GGF DAIS WG and other groups– Provide a reference implementation of DAIS spec
• Based on Open Grid Services Architecture (OGSA)– WSRF (GT4) & WS-I (OMII_2) versions
• Support Application Developers & Contributors
GGF15 Tutorial 6th October 2005 37
Data Services: challenges
• Scale– Many sites, large collections, many uses
• Longevity– Research requirements outlive technical decisions
• Diversity– No “one size fits all” solutions will work
– Primary Data, Data Products, Meta Data, Administrative data, …• Many Data Resources
– Independently owned & managed– No common goals– No common design– Work hard for agreements on foundation types and ontologies– Autonomous decisions change data, structure, policy, …
– Geographically distributed
• and I haven’t even mentioned security yet!Slide from Neil Chue Hong
GGF15 Tutorial 6th October 2005 38
OGSA-DAI In One Slide
• An extensible framework for data access and integration.
• Expose heterogeneous data resources to a grid through web services.
• Interact with data resources:– Queries and updates.– Data transformation / compression– Data delivery.
• Customise for you project using– Additional Activities– Client Toolkit APIs– Data Resource handlers
• A base for higher-level services– federation, mining, visualisation,…Slide from Neil Chue Hong
GGF15 Tutorial 6th October 2005 39
OGSA-DAI
OGSA-DAI team
IBM Development Team, Hursley
NEReSC, Newcastle
NeSC, EdinburghEPCC Team, Edinburgh
ESNW, Manchester
IBM Dissemination Team
International Collaboration & Use
USA:o Globus Allianceo IBM Corporationo caBIGo BIRNo Indiana University o GridSphereo GEONo LEADo MCSo NCSAo Secure Data Grido UNC
Japan:o AISTo BioGrido NAREGI
Europe:o CERNo DataMiningGrido GridMinero GridSphereo inteligrido N2Grido OntoGrido Provenanceo SIMDATo OMII-EU
UK:o OMIIo OMII-UKo NGSo NCeSSo NIeeSo AstroGrido BioSimGrido BRIDGESo CancerGrido ConvertGrido eDiaMonDo EDINAo First Group plco Fujitsu Labs Europeo GEDDMo GeneGrido Genomic Technology and Informaticso GOLDo Human Genetics Unito IBM UKo myGrido Oracle UK
China:o CASo ChinaGrido cnGrido INWAo OMII-China
Australia:o Curtin Business Schoolo INWA
TutorialsBoston CambridgeCERN ChicagoEdinburgh LondonSan Francisco SeattleSeoul SingaporeTokyo ISSGC 03 to 05
DIALOGUE workshopsColumbus, Edinburgh, Indiana, Vienna
Chicago, Manchester, San Diego
South Korea:o KISTI
China40%
United Kingdom15%
United States11%
Germany3%
Japan5%
Italy2%
France3%
Austria1%
Others20%
1485 registered users5250+ downloads
LEAD
GeneGrid
caBIG
BRIDGES
OGSA WebDB
FirstDIG ConvertGrid eDiaMoND
OGSA-DQP
Grid Miner
Meeting User Requirements
GGF15 Tutorial 6th October 2005 42
OGSA-DAI
Project Partners
Powered by ….
Funded by the Grid Core ProgrammeOGSA-DAI£3 million, 18 months, from Feb 2002
Three major releases, three interim releases
DAIT (DAI-Two)Keep the OGSA-DAI brand name£1.5 million, 24 months, from Oct 2003Four major releases
OMII-UKTo October 2008
Neil Chue HongProject Manager, EPCC
[email protected]+44 131 650 5957
Malcolm AtkinsonDirector, National e-Science Centre
[email protected]+44 131 651 4040
Thanks for attending!