Date post: | 27-Dec-2015 |
Category: |
Documents |
Upload: | hilary-oconnor |
View: | 214 times |
Download: | 1 times |
Common solution for the (very-)large data challenge.
VLDATA
Call: EINFRA-1 (Focus on Topics 4-5)Deadline: Sep. 2nd 2014
2
1.1 ObjectivesThe mission/vision/endgoal of VLDATA is to provide common solutions for handling large and extremely large common scientific data in a cost-effective way. This solution builds on existing pan-European e-Infrastructure and tools to provide an interoperable, efficient and sustainable platform for scientific user communities, in particular, to support a new generation of data scientists. The success of this project will secure European leadership on the development and support of big data and global data science and, therefore, will contribute to the leadership of European scientists and enterprises in many research and innovation fields
3
Objectives (I)• O1: a "flexible and extendable" platform
supporting common solutions for large scale distributed data processing and analysis, ensuring interoperability among existing e-Infrastructure providers.– O1.1: (WP2,3,4,5) provide a common solution
using generic e-Infrastructure for processing large scale or extremely large scale of scientific data in a robust, efficient and cost-effective way.
– O1.2: (WP6) provide a flexible and customizable platform that can be extended to cover the specific requirements of each community.
4
Objectives (II)• O2: standardized solutions aiming to a global
interoperability of open access for large-scale data processing, minimizing unnecessary large transfers– O2.1: (WP2) provide common language and standard
for handling big volume of data– O2.2: (WP2,3,4,5) improve the efficiency of
distributed data processing by providing smart data and computing management platform
– O2.3: (WP2,3,4,5) enable effective handling of big data samples by integrating new technologies
– O2.4: (WP8) assess the value of this generic solution towards relevant stakeholders: end scientists, their management, funding agencies, policy makers, companies and the society at large
5
Objectives (III)• O3: Increase the number of users and
Research Infrastructure projects making efficient use of existing e-Infrastructure resources, designing appropriate exploitation strategies and a long-term sustainability plan.– O3.1: (WP5,7) deliver ready-to-use high-quality
standard products for internal and external usage, enhancing interdisciplinary data sciences at a global scale
– O3.2: (WP6,9) increase the degree of the open access of large scale distributed data
– O3.3: (WP9) educate new generation of data scientists and the society in general
6
1.2 Relation to the work programme
7
1.3 Concept and approach (ideas)
Make IT simple• Simplicity: VLDATA provides an abstraction of the different Resources that are
all made accessible the end user via the same interfaces.• Transparency: Users are allowed to specify their Workflows/Pipelines with
different levels of abstractions. The platform takes care of the necessary Resource Allocation to fulfill the required specifications.
• Extendibility and flexibility: VLDATA provides an API that allows users to extend the provided functionality by developing new or customized components
• Reliability: Quality standards and extensive validation in several scientific domains to ensure the readiness-to-use and robustness of VLDATA based solutions
• Scalability: Modular implementation allowing horizontal (amount of connected Resources or Users) and vertical (amount of processed Units) scaling to adapt VLDATA to the needs of each particular community or Research Infrastructure project.
• Smart and intelligent: building on collected experience and monitoring data, algorithm can look for optimized scheduling/searching strategies, including automated decision making based on usage traces and expectations.
• Cost-effective: Building up on existing well-established solutions and incrementally extending and developing to address new challenges with an evolving validated common solution, avoiding unnecessary duplicated efforts.
8
1.3 Concept and approach (model)• Model (building blocks):
– Collaborative modular architecture, with multiple layers sharing the same Framework and Basic modules, allowing horizontal & vertical scaling to ensure scalability.
– Open, iterative, incremental and parallel, requirement-driven development process. Agile(?) methodology.
– Standard procedures for quality assurance, including security, platform integration and validation, including reference benchmarks, and release procedures in accordance with requirements for production level services.
• Layers: (result of 10 evolution of DIRAC development effort)– Framework: (communication, security, access control, user/group
management, DBs)– Basic modules: SystemLogging, Configuration, Accounting,
Monitoring– Low level modules: File Catalog, Resource Status, Request
Management, Workload Management– High level modules: Data Management, Workflow Management– Interfaces: User - Resource
9
1.3 Concept and approach (assumption)• Current solution can be evolved into the new general platform to be
widely applied.• Evolution from grids to clouds, but heterogeneity will increased• Large degree of commonality on low-level requirements and tools
between different scientific domains• Fast grow of data and computing requirements almost doubling
every year. Aggregated estimation close exabyte level in 5 years from now (EGI expects 10.000.000 Cores and 1?? exabyte of scientific data by 2020). (Ref: http://delaat.net/talks/cdl-2014-05-13.pdf)
• Similar grow in number of data objects, computing units and end users (60 % of ESFRI projects completed or launched by 2015).
• New scientific domains are entering the digital era 4th paradigm of science, new data science is emerging (http://research.microsoft.com/en-us/collaboration/fourthparadigm/)
• Data is to be made openly available beyond the community that produced them, down to the citizens that might also contribute to its further processing
• Common development and validation provides robustness as well as cost-saving and, thus, enables sustainability
10
1.4 Ambition
11
2.1 Expected impacts• DIRECT impact.
– scalability, robustness (for the Research Infrastructure)– The expected impact is that participating RI projects will be able to
operate their Distributed Computing Systems efficiently processing their large volume research data, making it available to their end users in reliable and cost-effective way, which couldn't be achieved before, which may lead to new way of organizing science activities, leading to significant scientific break throughs. By providing important functional components (e.g., ) which was missing from existing practices, VLDATA platform will make possible the transparent integration of resources, hiding the complexity from use, resulting in the extension of the scale of the resources Resource Infrastructure projects can utilize. This will increase the number of RI using the project tools and the number of different types of resources reachable through the tools.
– simplicity (for the user: scientist/operator)– cost-efficiency (for funding agencies)– reduce the duplication efforts, maximizing the use of EU-invest e-
Infrastructures, enlarging the user communities, providing efficient data processing services, providing advanced technology by integrating the state-of-the-art which reduces development cost significantly. (also the processing algorithm )
12
2.1 Expected impacts• Indirect impact: large user community
– Science – innovation
– Society– industry
– Citizens– policy maker– new generation data scientists
• On the other hand, the scale of the data challenge requires simple but intelligent solutions to integrate resources from different e-Infrastructure providers.
13
2.2 Measures to maximize impact
14
Research Infrastructures (I)• Belle II: – Usage of DIRAC for the Experiment, use
case presented:• Common access to various platforms: Grid +
cloud + cluster + HPC• Support for Monitoring for Workflow
management tools• Integrate for the needs of other participants• User interface
– EU-T0: Virtual data centers / New Virtualization techniques?
15
Research Infrastructures (II)• PAO:– Usage of DIRAC for the Experiment, data
taking -> 2022• using a standard solution will help the
sustainability.• Extend functionality for their use case.• Common access to various platforms: Grid +
cloud + cluster + HPC (follow evolution of providers) in particular OSG• Open Access to data
– EU-T0: Data locality
16
Research Infrastructures (III)• LHCb:– should cover Run 2 needs and target to the
needs of Run 3 (DAQ Upgrade)• Data rate will be increase by a factor ~5, 10 PB/year.• Integration of Cloud resources.• Massive data-driven Workflows for users.• Data preservation (?)• Resource (cpu/storage/network/...)
description/monitoring/availability/management, smart allocation
• Smart/Intelligent/dynamic data placement strategies (network)
– EU-T0: New Virtualization techniques, Resource description/monitoring/availability, Virtual data centers, Data locality
17
Research Infrastructures (IV)• EISCAT_3D:– searching data (metadata catalog), intelligent
searching (patterns recognition)– visualization, – Workflow to go from one data level to another
with appropriated access rights– Training – flexible interconnection of different resources,
central (HPC) + distributed (Grid/Cloud)– time constrained massive data reduction (10 PB
-> 1 PB / month ??), including the possibility for users defined algorithm.
• EU-T0:
18
Research Infrastructures (V)• BES III:
19
3.1 Work Plan (To be confirmed)• WP1 Coordination (UB, Spain)
– External Advisory board (EUDAT, OGF, RDA, OSG, PRACE, XSEDE, CERN/HelixNebula, )
• WP2 Requirement analysis & Design (CU, UK)
• WP3 Data-driven development( UB, Spain)
• WP4 User-driven development( CYFRONET, Poland)
• WP5 Quality ( UAB, Spain)• WP6 Validation (????)
– LHCb (????)– Belle II (Institut Jozef Stefan,
UniMB, Mariborand UniLJ,Slovenia)– EISCAT_3D (SNIC, Sweden/EISCAT
Science Associate)– PAO (CESNET, Czech Republic)– BES III (IHEP, China)
– Proteomincs (ETH Zurich, Switzerland)
– Mosgrid (U Tubingen, Germany)/ CMMST (U. Peruggia, Italy)
– Seismology (MTU, Turkey)– AMC (Netherlands)– Astrophysics (INAF, Italy)– Heliophysics (Trinity College
Dublin, Ireland)– Verce (???)– DRIHM (CIMA foundation, Italy)– SMEs (U. Zaragoza, Spain)
– DIRAC 4 EGI, multi-community solution EGI ( EGI.eu, the Netherlands)
• WP7 Dissemination: outreach + Training (CNRS, France)
• WP8 Exploitation (ASCAMM, Spain)• WP9 Communication,
Internationalization (UvA, the Netherlands)
20
3.2 Management structure and procedures
Coordinator
Tech. Coordinator
Consortium Board (all partners)
Executive Board (1 Representative from each Area)
External Advisory Board
Integration/Operations
WPs (6)
Design/Develop WPs
(2,3,4,5) Communication/
Sustainability WPs (7,8,9)
Internal Communities’Coordinators
External Communities’Coordinators
Project Manager
Comm./Exploit.
Coordinator
21
3.3 Consortium as a whole
22
Private Companies• Bull/Dell (??)• ETL (UK)• AlpesLaser (CH)
23
3.4 Resources to be committed
24
Calendar (milestones)• May 23: Close the Contractors• June 11-13: all WP ready, F2F
meeting to close the Work plan. Deadline for RIs and third Parties
• July 9-11: Close proposal (I)• July 25: Proof read -> External review• Aug 18 -> Sep 2: final upates
25
Writing Calendar (I)Milestone Date
Close Technology Contractors May 23rd
Close Consortium & Complete Work Programme Description June 13th
Complete Proposal July 11th
English Proofread August 1st
External Review August 15th
Final Version August 29th
Deadline for submission September 2nd
26
Writing Calendar (II)Meetings/Dates Type Date
Decide on Communities proposed by P. Kacsuk Mail all May 30th
Reorganization of Development Area Virtual WPL(*) June 2 - 6
First WP review: put in common contributions, all WP must provide a first draft, list of activities
Virtual WPLs Jun 6th 14:00
WP integration: ensure consistency, third parties, budget F2F WPLs June 11 - 12First draft of Sections: (1) Excellence, (2) Impact, (3) Implementation
Virtual Editors June 27th 14:00
Complete proposal: full review F2F Editors July 9 - 10
Proofread: submit Mail July 11th
Proofread: received, full review, merging Virtual Editors July 25 - 30
External: submit Mail July 30th
External: received, full review, merging Virtual Editors August 18 - 29