www.inf.ed.ac.uk
Scrying the next generation of data-intensive research infrastructure���
���Research at the Data-Intensive Research Group of
the University of Edinburgh
Paul Martin OSDC-PIRE 2014, University of Amsterdam
www.inf.ed.ac.uk
Edinburgh
www.inf.ed.ac.uk
School of Informatics
www.inf.ed.ac.uk
Future research infrastructures… • …must support a large range of different research interactions.
– Data collection, curation, processing and publication.
– Curation of models and methods. – Community networks and cross-infrastructure interactions.
• …must support a diverse cast of research actors.
– Investigators, empiricists, theorists, librarians, engineers, etc.
• …must balance conflicting issues:
– Openness and accountability.
– Preservation and accessibility.
– Interoperability and efficacy.
– Oversight and autonomy.
www.inf.ed.ac.uk
The Data-Intensive Research Group • Part of the Centre for Intelligent Systems and their Applications in the
School of Informatics at the University of Edinburgh.
• Research agenda focuses on of how best to address current and future data-intensive research problems:
– How to manage large volumes of data;
– How to process distributed data in different environments;
– How to manage the code and tools used to handle data.
• Recent emphasis has been on workflow-based systems: languages and tools for workflow composition, services for deploying workflows, workflow optimisation and provenance gathering, etc…
• …but also, infrastructure modelling, scientific gateways, commodity supercomputing and anything else that catches our interest.
www.inf.ed.ac.uk
Supporting Research Interactions
• Support a diverse range of interactions by domain experts at the high level…
• …by providing standard interchange formats…
• …that sit atop a heterogeneous array of execution platforms.
Domain Experts
Data-Analysis Experts
Data-Intensive Engineers
user tools
execution platforms
User and application diversity
System complexity
Broker level
Tool level
Enactment level
registries
repositories
optimisation
logical schemata
gateways
component mappings
observations
virtualisation
Data Curators
www.inf.ed.ac.uk
VERCE • Virtual Earthquake and Seismology Research Community e-Science
Environment in Europe.
• Design, build and integrate components for data processing in the seismology domain.
– Streamline the process of configuring and conducting several standard types of computational task.
– Open facilities for the broader community.
– Focus on particular ‘data-intensive’ and ‘HPC’ use-cases.
• ‘Satellite’ project of EPOS (European Plate Observing System).
– Contribute to EPOS Core Services.
www.inf.ed.ac.uk
VERCE Overview
www.inf.ed.ac.uk
VERCE Principles
www.inf.ed.ac.uk
VERCE Technology Stack (c.2014)
VERCE platform of data-intensive services and applications
VERCE scientific gateway
Dissemination and training
Catalogues & registries
Integrated tools Portals Community &
user support
......
Component repositories
Grid infrastructure HPC infrastructure
Network infrastructure
Data infrastructure
Data archives
Enactment layer of services and processing elements
Technology stack
Web PortalLiferay, gUse
WorkflowspecificationWS-PGRADE,
Dispel4Py
DeploymentMPI, Storm
Datainfrastructure
ArcLink, GridFTP,iRODS, HDFS
Grid/HPCInfrastructure
Globus, UNICORE
www.inf.ed.ac.uk
Dispel4Py • Python-based implementation of DISPEL (Data-Intensive Systems Process
Engineering Language).
– Used to describe distributed data-streaming workflows at a logical level.
– Wraps Python code into Processing Elements (PEs; initial focus on seismology applications).
– Workflow graph can be deployed on various platforms (currently Storm and MPI).
• Principles of Dispel:
– Inline specification of new PEs as compositions of existing PEs.
– Strong typing for both language and dataflow with additional semantic (domain) annotation.
– Work in progress…
www.inf.ed.ac.uk
Dispel4Py workflow illustration
TaskScheduler
DataGenerator
TupleBuild
CorroboratedQuery
"uk.org.UoE.dbA"
"uk.org.UoE.dbB"
"uk.org.UoE.dbC"
TupleBurst
TupleSchema
TypeConverter
ForecastModeller
TupleSchema
Warning
Results
"Forecast Results"
www.inf.ed.ac.uk
Dispel4Py lifecycle
www.inf.ed.ac.uk
ENVRI • Common Operations of Environmental Research Infrastructures. • Initiative to promote interoperability between ESFRI projects in the
Environmental Cluster.
– Model characteristics of environmental research infrastructures to identify commonalities and gaps.
– Provide tools and services for data discovery and integration.
– Improve social links between ESFRI and affiliated projects.
• Part of a general strategic effort to simplify the construction of bespoke infrastructure by pooling expertise and resources.
www.inf.ed.ac.uk
ENVRI Requirements
ENVRI Reference Model
Data Acquisition
Data collection
Instrument accessProcess control
Instrument monitoring
Instrument configuration
Instrument integration
Instrument calibration
Configuration logging
Instrument monitoring
Parameter visualisation
Realtime parameter visualisation
Realtime data collection
Data sampling Noise reduction
Data transmissionRealtime data transmission
Data transmission monitoring
Data CurationData quality checkingData quality verification
Data identification
Data cataloguing
Data product generation
Data versioning
Workflow enactment
Data storage & preservation
Data replication
Replica synchronisation
Access control
Resource annotation
Data annotation
Metadata harvesting
Resource registration
Metadata registration
Identifier registration
Sensor registration
Data conversion
Data compression
Data publication
Data citation
Semantic harmonisation
Data discovery and access
Data visualisation
Data Access
Data Processing Data assimilation
Data analysis
Data mining
Data extraction
Scientific modelling & simulation
Scientific workflow enactment
Scientific visualisation
Service namingData processing control
Data process monitoring
Community Support
Authentication
Authorisation
Accounting
User registration
Instant messaging
Interactive visualisation
Event notification
www.inf.ed.ac.uk
ENVRI Reference Model • A standard abstract model for environmental research infrastructures. • Founded on RM-ODP (Reference Model for Open Distributed
Processing).
– Standard for modelling distributed systems.
– Viewpoint based: Enterprise, Information, Computation, Engineering and Technology.
– Support for UML-style design.
• Current model iteration based on core ‘data pipeline’ (acquisition, curation, access).
– Lightweight modelling of Enterprise (Science), Information and Computational Viewpoints.
– Main study cases: EISCAT_3D, EPOS and ICOS.
www.inf.ed.ac.uk
ENVRI Reference Model Example • Example of raw data collection from the computational viewpoint:
data acquisition data curation
acquisitionservice
data transfer service
instrumentcontroller
raw data collector
data storecontroller
PID service
prepare data transfer
configure instrument
deliver raw data import data for curation
update records
new transporter
acquire identifier
retrieve data
community support
catalogue serviceupdate catalogues
field laboratory
update registry
security service
authorise action
www.inf.ed.ac.uk
EFFORT • Earthquake and Failure Forecasting in Real Time. • Project to monitor rock failure experiments in real time.
– Rock samples are subjected to continued pressure in laboratory conditions.
– Stress leads to deformation, leading to sudden failure.
– Models of rock failure may apply to plate deformation and volcanic events.
• Need ability to continuously and reliably collect data from remote experiments, relate to proposed models and provide visualisations on demand.
• Project expanded to build a standard library for volcanology and rock physics analyses (VarPy).
www.inf.ed.ac.uk
EFFORT system overview
www.inf.ed.ac.uk
GeWWE • Generic Web-based Workflow Editor • Project to build a multi-target workflow editing tool.
– Thesis is that workflows are always built from the same fundamental components (standard schema).
– Average user is not keen to learn any specific workflow programming language (like Dispel…).
– Can map workflows to a number of target languages / platforms.
www.inf.ed.ac.uk
GeWWE Schema
www.inf.ed.ac.uk
GeWWE screenshot
www.inf.ed.ac.uk
Other Projects • EDIM1 – commodity data-brick computing. • TerraCorrelator – doing data-intensive geoscience.
• DECIPHER – quasi-anonymous analysis of medical data.