+ All Categories
Home > Documents > UK e-Science Future Infrastructure for Scientific Data Mining, Integration and Visualisation

UK e-Science Future Infrastructure for Scientific Data Mining, Integration and Visualisation

Date post: 31-Dec-2015
Category:
Upload: claire-lamb
View: 31 times
Download: 0 times
Share this document with a friend
Description:
UK e-Science Future Infrastructure for Scientific Data Mining, Integration and Visualisation Malcolm Atkinson Director of National e-Science Centre www.nesc.ac.uk 25 th October 2002 SDMIV workshop, e-Science Institute Edinburgh. Overview. UK e-Science - PowerPoint PPT Presentation
Popular Tags:
51
UK e-Science Future Infrastructure for Scientific Data Mining, Integration and Visualisation Malcolm Atkinson Director of National e-Science Centre www.nesc.ac.uk 25 th October 2002 SDMIV workshop, e-Science Institute Edinburgh
Transcript

UK e-Science

Future Infrastructure for Scientific DataMining, Integration and Visualisation

Malcolm Atkinson

Director of National e-Science Centrewww.nesc.ac.uk

25th October 2002

SDMIV workshop, e-Science InstituteEdinburgh

Overview

UK e-ScienceReminder of Investment and Infrastructure

International e-ScienceExamples and Collaboration

Data Access and IntegrationLego Bricks for Scientific Application DevelopersTailored: Application and Computing Scientists

A Computer Scientist’s Christmas ListDiversity and Opportunity

The Way Ahead

e-Science

Fundamentally about CollaborationSharing

Ideas Thought processes and Stimuli Effort Resources

Requires Communication Common understanding & Framework Mechanisms for sharing fairly Organisation and Infrastructure

Scientists (Biologists) have done this for Centuries

e-Science (take 2)

Fundamentally about CollaborationSharing

Ideas Thought processes and Stimuli Effort Resources

Requires Communication Common understanding & Framework Mechanisms for sharing fairly Organisation and Infrastructure

Text, digital media, structured, organised & curated data, computable

models, visualisation, shared instruments, shared systems,

shared administration, …

Nationally & Internationally Distributed, …

Routine, Daily, Automated, …

That Requires very Significant Investment in DigitalSystems and their Support

e-Science (take 3)

Fundamentally about CollaborationSharing

Ideas Thought processes and Stimuli Effort Resources

Requires Communication Common understanding & Framework Mechanisms for sharing fairly Organisation and Infrastructure

Digital networks, digital work-places, digital

instruments, …

Metadata, ontologies, standards, shared curated

data, shared codes, …

Common platforms, shared software, shared training, …

The Grid SHOULD make this much easier byproviding a common, supported high-level of Software and Organisational infrastructure

Citation, Authentication, Authorisation, Accounting,

Provenance, Policies, …

Shared Provision of Platform,

Grid ExpectationsPersistence

Always there, Always Working, Always Supported

StabilityYou can build on foundations that don’t move

Trustworthy & PredictableHonours commitments

Digital policies, digital contracts, security, … Data integrity, longevity and accessibility Performance

High-level & ExtensibleThe capabilities you need are already there

UbiquitousYour collaborators use it

Grid RealityPersistence

Always there, Always Working, Always Supported

StabilityYou can build on foundations that don’t move

Trustworthy & PredictableHonours commitments

Digital policies, digital contracts, security, … Data integrity, longevity and accessibility Performance

High-level & ExtensibleThe capabilities you need are already there

UbiquitousYour collaborators use it

Political, Economic & Technical issues to Solve

Early days but Open Grid Services link with Web

Services + GGF standardisation

Not yet but very substantialglobal effort to achieve this

Good basis for extensionCommitment to basic functionality

WS + Community effort

Global & Industrial Rallying CryMust work with Web Services

Cambridge

Newcastle

Edinburgh

Oxford

Glasgow

Manchester

Cardiff

Southampton

London

Belfast

Daresbury Lab

RALHinxton

UK Grid Network

Nationale-

ScienceCentre

Access Grid always-on video always-on video wallswalls

HPC(x)

Scotland via Glasgow

NNW

Northern Ireland

MidMAN

TVN

South WalesMAN

SWAN&BWEMAN

WorldComGlasgow

WorldComEdinburgh

WorldComManchester

WorldComReading

WorldComLeeds

WorldComBristol

WorldComLondon

WorldComPortsmouth

Scotland via Edinburgh

YHMAN

NorMAN

EMMAN

EastNet

External Links

LMN

KentishMANLeNSE

10Gbps

622Mbps155Mbps

SuperJanet4, June 2002 20Gbps

2.5Gbps

Tony Hey July 2001

National e-Science Centre

EventsWorkshopsResearch MeetingsInternational Meetings

History of EventsGGF5HPDC11Summer school> 50 workshops held> 1000 people in totalMany return often

Planned Events25 workshops Conferences to 2005

Visitors3 arrived4 arranged

International collaboration, visits & visitors

ChinaArgonne National LabSDSCNCSA…

Centre ProjectsPilot ProjectsRegional SupportResearch Projects

EPSRC, MRC, WT, SHEFC

UCSF

UIUC

From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign

DataGrid Testbed

Dubna

Moscow

RAL

Lund

Lisboa

Santander

Madrid

Valencia

Barcelona

Paris

Berlin

LyonGrenoble

Marseille

BrnoPrague

Torino

Milano

BO-CNAFPD-LNL

Pisa

Roma

Catania

ESRIN

CERN

HEP sites

ESA sites

IPSL

Estec KNMI

(>40)

[email protected] - [email protected]

Testbed Sites

A Simplified Grid Anatomy

Grid Plumbing & Security Infrastructure

Scheduling Accounting Authorisation

Monitoring Diagnosis Logging

Scientific Application

Data & Compute ResourcesOperationsTeam

ApplicationDevelopers

Distributed

Owners

Scientific Users

The Crux

Grid Plumbing & Security Infrastructure

Scheduling Accounting Authorisation

Monitoring Diagnosis Logging

Scientific Application

Data & Compute ResourcesOperationsTeam

ApplicationDevelopers

Distributed

Owners

Scientific Users

Keep all the (pink)groupsHAPPY

A SDMIV Grid Anatomy

Grid Plumbing & Security Infrastructure

Scheduling Accounting Authorisation

Monitoring Diagnosis Logging

Scientific Application

Data & Compute Resources

Distributed

SDMIV Users

Data Access

Data Integration

Structured DataData ProvidersData Curators

Database Growth

PDB protein structures

Data Mining:Science vs Commerce

Data in files FTP a local copy /subset.ASCII or Binary.Each scientist builds own analysis toolkit Analysis is tcl script of toolkit on local data.Some simple visualization tools: x vs y

Data in a database

Standard reports for standard things.Report writers for non-standard thingsGUI tools to explore data.

Decision treesClusteringAnomaly finders

Jim Gray UCSC April 2002

But…some science is hitting a wallFTP and GREP are not adequate

You can GREP 1 MB in a secondYou can GREP 1 GB in a minute You can GREP 1 TB in 2 daysYou can GREP 1 PB in 3 years.

Oh!, and 1PB ~10,000 disks

At some point you need indices to limit searchparallel data search and

analysisThis is where databases can help

You can FTP 1 MB in 1 secYou can FTP 1 GB / min (= 1 $/GB)

… 2 days and 1K$… 3 years and 1M$

Jim Gray UCSC April 2002

50,000 Kg250 KW60 Racks = 120m2

Web Services Grid Technology

Grid Services

OGSA & OGSI

www.gridforum.org/ogsi-wgwww.gridforum.org/ogsa-wgwww.gridforum.org/

Web ServicesRapid Integration

Dynamic binding

Commercial PowerFinancial & Political

IndependenceClient from ServiceService from Client

SeparationFunction from Delivery

DescriptionWSDL, WSC, WSEF, …

Tools & PlatformsJava ONE, Visual .NETWebSphere, Oracle, …

www. w3c. org / TR / SOAP or TR/wsdl

Grid TechnologyVirtual Organisations

Sharing & CollaborationSecurity

Single Sign in, delegationDistribution & fast FTP

But Various ProtocolsResource Mangement

DiscoveryProcess CreationSchedulingMonitoring

PortabilityUbiquitous APIs & Modules

Gov’nm’t Agency Buy inIndustrial Buy in

Foster, I., Kesselman, C. and Tuecke, S., The Anatomy of the Grid: Enabling Virtual Organisations, Intl. J. Supercomputer Applications, 15(3), 2001 http://www.gridforum.org/ogsi-wg

Open Grid Services Architecture

Virtual Grid Services

Applications

Multiple implementations of Grid Services

Using operations

Implemented by

OGS infrastructureFoster, I., Kesselman, C., Nick, J. and Tuecke, S., The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration

Scientific Data

Deluge of DataExponential growth

Doubling timesAstronomy 12 monthsBio-Sequences 9 monthsFunctional Genomics 6 monthsBytes/dollar 12 to 18 months

Not How big it is but

Scientific Data

Deluge of DataExponential growth

Doubling timesAstronomy 12 monthsBio-Sequences 9 monthsFunctional Genomics 6 monthsBytes/dollar 12 to 18 months

Not How big it is butWhat you do with it

SharingCurationMetadataAutomated movement, access & integrationComputational Access

Scientific Data

Deluge of DataExponential growth

Doubling timesAstronomy 12 monthsBio-Sequences 9 monthsFunctional Genomics 6 monthsBytes/dollar 12 to 18 months

Not How big it is butHow you Embrace & Manage Change

The Database is a Knowledge chestThe Database is a Communication HubAutonomously Managed (Curated) changeAn Essential part of e-BioMedical, Astronomical, …, Science & Engineering

Wellcome Trust: Cardiovascular Functional Genomics

Glasgow Edinburgh

Leicester

Oxford

LondonNetherlands

Shared dataPublic curated

data

Data Access & Integration

Central to e-ScienceAstronomy, Earth Sciences, Ecology, Biology, Medicine, …

Collaboration Shared Databases Curated Knowledge Accumulated Observations Accumulated Simulations

Computation Data mining Input to models Calibration of models

Presentation Publication of results Visualisation

GGF DAIS WGChairs

Norman Paton (Manchester Uni.)Leanne Guy (CERN)Dave Pearson (Oracle UK)

ActivityBoF GGF4 TorontoWG Meeting GGF5 EdinburghPapers for GGF6Workshops & Mail lists

GoalsAgree Standards for Database Access & IntegrationFreely available reference implementations

OGSA-DAI one source & focus for discussions

Norman Paton,Inderpal Narang,

Leanne Guy, Susan Maliaka, Greg Ricardi, …

http://www.cs.man.ac.uk/grid-db/

OGSA-DAI project

Lego kit for Data Access & IntegrationComponents for e-Science Applications

Accelerated Application DevelopmentMultiple Data Models

Distributed DataAccess via Grid & Proxies

Integration, Translation & Transformation

Open Source Reference Implementation

For DAIS-WG standard

Trigger for Component ConstructionStart a community

Oxford

Glasgow

Cardiff

Southampton

London

Belfast

Daresbury Lab

RAL

OGSA-DAI Partners

EPCC & NeSC

Newcastle

IBMUSA

IBM Hursley

Oracle

Manchester

EPCC & NeSCIBM UKIBM USAManchester e-SCNewcastle e-SCOracle £3 million, 18 months, started February 2002

Cambridge

Hinxton

Primary Components

Client

Consumer

GDS

GDSF

GDSR

DB

Advanced Components

Consumer

GDS Client

GDT

Translation

Translation

DB

GDS:PerformScript

Composed Components

Translation

Consumer

GDS

Translation

GDT

GDS:performScript

GDT

GDT

Client

GDS:performScript

GDS:performScript

GDS:performScript

Composing Components

OGSA-DAIComponent

OGSA-DAIComponent

OGSA-DAIComponent

Data Transport

Data Transport

Data Transport

Data Transport

DAI Key Components

GridDataService GDS Access to data & DB operations

GridDataServiceFactory GDSF Makes GDS & GDSF

GridDataServiceRegistry GDSR Discovery of GDS(F) & Data

GridDataTranslationService Translates or Transforms Data

GridDataTransportDepot GDTD Data transport with persistence

Relational & XML models supportedRole-based AuthorisationBinary structured files

OGSA Relationship

Class GridService Registry NotificationConsumer NotificationProducer

GDS Mandatory   Optional Normal

GDSF Mandatory   Optional Normal

GDSR Mandatory Mandatory   Normal

GDTS Mandatory      

       

GDTD Mandatory   Optional Normal

DAI portType Usage

Class GridDataService DataTransport Factory

GDS Mandatory Normal  

GDSF Optional Normal Mandatory

GDSR Optional    

GDTS Optional Mandatory  

     

GDTD Optional Mandatory  

Distributed Query

Registry R

Client

Consumer GDT

GDS

GDTV

DQP

GDT

GDTV

GDS

QPM

NS

F Factory

Evaluator

GDTV GDT

Evaluator

GDTV GDT

Evaluator

GDTV GDT

GDS

GDS

GDTV DB

T

Q

T

PNM

T

PNM

GDS

T

GDTV

D Q P : D is t ribu te d Q u e ry Pro ce s s o rG D T : G rid D a ta Tra n s po rtT : Tra n s la t io nQ : Q u e ryG D TV : G rid D a ta Tra n s po rt V e h icleF : Fa cto ryQ PM : Q u e ry Pro g re s M o n ito rPNM : Pro g re s s No t if ica t io n M e s s a g eA M : A pplica t io n M e ta da taC R M : C o m pu ta t io n a l R e s o u rce M e ta da taNS : No t if ica t io n S in k

1

2

5

3

4

5

5

7

7

6

6

7

7

7

7

(7) 8

6

OGSA-DAI Time Line

Feb ’02 May ’02 Jul ’02 Sep ’02 Dec ’02 Feb ’03 May ’03 Sep ’03

Ship Alpha Release for GT3 Integration

RDB + GT2 / OGSA Prototypes Available

XML + OGSA Prototype Available

Design Documents & Demos for DAIS WG @ GGF5

XML + OGSA Prototypes for Early Adopters

WS + GSI UK support ( > 100 downloads)

Phase 2 StartsPhase 1 Starts

Presentation & Beta @ GGF7

GGF6 WG Papers & Prototypes

Productisation, RAMPS &Extension

OGSA-DAI Summary

On Schedule & Going WellContributions via DAIS-WG @ GGF5 & 6Releases with GT3 Releases scheduledStatus: Early Days

Released prototypesTested Architectural DesignUsing OGSAWorking with Early Adopter Pilot Projects

AstroGrid & MyGrid

First PRODUCT release Dec ‘02

Influence OGSA-DAI directionVia DAIS-WG & Direct messages to us

Data Processing

Processing Characteristics-Well defined work flow-Correction, calibration, transformation,filtering, merging-Relatively static reference data-Stable processing functions (audited changes)-Periodic reprocessing from archive

Instrument

Raw Data

Reference Data

Multi-stageProcessing

ProcessedData

Archive Archive

In Silico

Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

Analysis and Interpretation

Summarisation

ProcessedData

Archive

SummarisedData

Analysis Characteristics

- Variable workflow

- Standard functions

- Standard and personal

filtering and summarisation

- Retain drill down capability

Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

Analysis and Interpretation

Analysis and Interpretation Characteristics- Highly dynamic work flow- Multiple data types- Volatile data- Annotations, inferences, conclusions- Evidential reasoning- Shared multiple versions of truth- Periodic version consolidation

ProcessedData

Result dataRetrieval &Update

SummarisedData

PersonalisedDatabase Conclusions/InferencesConclusions/Inferences

- DescriptionsDescriptions- TrendsTrends- CorrelationsCorrelations- RelationshipsRelationships

Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

Metadata Requirements

Technical Metadata Direct referencing - Physical location and data schema/structureData currency/status – version, time stampingAccreditation/Access permissions - Ownership (Dublin Core)Query time/Governance - data volume, no. of records, access paths

Contextual MetadataLogical referencing physical data – semantic/syntactic ontologiesLexical translation – Thesaurus, ontological mappingNamed derivations (summarisations)

Scope of RequirementsAll science communitiesRelated to provenance

Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

Metadata Requirements

Data VersioningDistinguish latest/agreed version of dataMaintain history record of changeSynchronise and mirror replicated dataDistinguish shared personal interpretations and/or annotations

Provenance Record of data processing – calibration, filtering, transformationRecord of workflow – methods, standards and protocolsReasoning – evidential justification for inferences & conclusions

Scope of RequirementsAll science communitiesIncludes Technical and Contextual Metadata

Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

Provenance Issues

Schema evolutionGranularity of record

Processed v Derived

InheritanceLack of structured annotations, ontologiesInteractive analysis = dynamic workflowMultiple derived data sourcesContext of usageBest practice can changeMultiple versions of the truthEvidential reasoningExisting data & applicationsWhere is the provenance record stored

Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

Collaborative Annotation

See DASDistributed Annotation ServiceChallenges

Autonomy Selective viewing Identification Provenance Derivation

Biomedical e-Scientists

Is this one species?Understanding bird energyUnderstanding a river / ocean interactionUnderstanding a biochemical pathwayUnderstanding a cellUnderstanding a Heart or BrainUnderstanding RhododendraUnderstanding Evolution…

No One-Size fits all solutionsBut sharable re-usable components

Opportunities

Many, many …More than we can addressCompute needsData management needsData integration needs…

Must choose some pioneersTo meet a range of common requirementsTo provoke rich & high-level platformTo generate re-usable components

A Long-Term Commitment Needed

Advancing SDMIV Grid

Grid Plumbing & Security Infrastructure

Scheduling Accounting Authorisation

Monitoring Diagnosis Logging

Scientific Application

Data & Compute Resources

Distributed

SDMIV Users

Data Access

Data Integration

Structured Data

SDMIV (Grid) Application Component Library

Summary

e-ScienceData as well as Compute Challenges

Needed to be put together

Need ubiquitous supported consistent platforms

GridA (potentially) invaluable platformOnly show in town

Data IntegrationHard Develop & Use Standard kit of partsStarted to build the kitNo ready made general integrationCombines application & computing science

OpportunitiesNo one-size fits all, but re-usable subsystemsInvest in wider range of Problem driven pioneeringStrategic choices needed


Recommended