Common solution for the (very-)large data challenge. VLDATA Call: EINFRA-1 (Focus on Topics 4-5)...

Common solution for the (very-)large data challenge.

VLDATA

Call: EINFRA-1 (Focus on Topics 4-5)Deadline: Sep. 2nd 2014

2

1.1 ObjectivesThe mission/vision/endgoal of VLDATA is to provide common solutions for handling large and extremely large common scientific data in a cost-effective way. This solution builds on existing pan-European e-Infrastructure and tools to provide an interoperable, efficient and sustainable platform for scientific user communities, in particular, to support a new generation of data scientists. The success of this project will secure European leadership on the development and support of big data and global data science and, therefore, will contribute to the leadership of European scientists and enterprises in many research and innovation fields

3

Objectives (I)• O1: a "flexible and extendable" platform

supporting common solutions for large scale distributed data processing and analysis, ensuring interoperability among existing e-Infrastructure providers.– O1.1: (WP2,3,4,5) provide a common solution

using generic e-Infrastructure for processing large scale or extremely large scale of scientific data in a robust, efficient and cost-effective way.

– O1.2: (WP6) provide a flexible and customizable platform that can be extended to cover the specific requirements of each community.

4

Objectives (II)• O2: standardized solutions aiming to a global

interoperability of open access for large-scale data processing, minimizing unnecessary large transfers– O2.1: (WP2) provide common language and standard

for handling big volume of data– O2.2: (WP2,3,4,5) improve the efficiency of

distributed data processing by providing smart data and computing management platform

– O2.3: (WP2,3,4,5) enable effective handling of big data samples by integrating new technologies

– O2.4: (WP8) assess the value of this generic solution towards relevant stakeholders: end scientists, their management, funding agencies, policy makers, companies and the society at large

5

Objectives (III)• O3: Increase the number of users and

Research Infrastructure projects making efficient use of existing e-Infrastructure resources, designing appropriate exploitation strategies and a long-term sustainability plan.– O3.1: (WP5,7) deliver ready-to-use high-quality

standard products for internal and external usage, enhancing interdisciplinary data sciences at a global scale

– O3.2: (WP6,9) increase the degree of the open access of large scale distributed data

– O3.3: (WP9) educate new generation of data scientists and the society in general

6

1.2 Relation to the work programme

7

1.3 Concept and approach (ideas)

Make IT simple• Simplicity: VLDATA provides an abstraction of the different Resources that are

all made accessible the end user via the same interfaces.• Transparency: Users are allowed to specify their Workflows/Pipelines with

different levels of abstractions. The platform takes care of the necessary Resource Allocation to fulfill the required specifications.

• Extendibility and flexibility: VLDATA provides an API that allows users to extend the provided functionality by developing new or customized components

• Reliability: Quality standards and extensive validation in several scientific domains to ensure the readiness-to-use and robustness of VLDATA based solutions

• Scalability: Modular implementation allowing horizontal (amount of connected Resources or Users) and vertical (amount of processed Units) scaling to adapt VLDATA to the needs of each particular community or Research Infrastructure project.

• Smart and intelligent: building on collected experience and monitoring data, algorithm can look for optimized scheduling/searching strategies, including automated decision making based on usage traces and expectations.

• Cost-effective: Building up on existing well-established solutions and incrementally extending and developing to address new challenges with an evolving validated common solution, avoiding unnecessary duplicated efforts.

8

1.3 Concept and approach (model)• Model (building blocks):

– Collaborative modular architecture, with multiple layers sharing the same Framework and Basic modules, allowing horizontal & vertical scaling to ensure scalability.

– Open, iterative, incremental and parallel, requirement-driven development process. Agile(?) methodology.

– Standard procedures for quality assurance, including security, platform integration and validation, including reference benchmarks, and release procedures in accordance with requirements for production level services.

• Layers: (result of 10 evolution of DIRAC development effort)– Framework: (communication, security, access control, user/group

management, DBs)– Basic modules: SystemLogging, Configuration, Accounting,

Monitoring– Low level modules: File Catalog, Resource Status, Request

Management, Workload Management– High level modules: Data Management, Workflow Management– Interfaces: User - Resource

9

1.3 Concept and approach (assumption)• Current solution can be evolved into the new general platform to be

widely applied.• Evolution from grids to clouds, but heterogeneity will increased• Large degree of commonality on low-level requirements and tools

between different scientific domains• Fast grow of data and computing requirements almost doubling

every year. Aggregated estimation close exabyte level in 5 years from now (EGI expects 10.000.000 Cores and 1?? exabyte of scientific data by 2020). (Ref: http://delaat.net/talks/cdl-2014-05-13.pdf)

• Similar grow in number of data objects, computing units and end users (60 % of ESFRI projects completed or launched by 2015).

• New scientific domains are entering the digital era 4th paradigm of science, new data science is emerging (http://research.microsoft.com/en-us/collaboration/fourthparadigm/)

• Data is to be made openly available beyond the community that produced them, down to the citizens that might also contribute to its further processing

• Common development and validation provides robustness as well as cost-saving and, thus, enables sustainability

http://delaat.net/talks/cdl-2014-05-13.pdf

http://research.microsoft.com/en-us/collaboration/fourthparadigm/

10

1.4 Ambition

11

2.1 Expected impacts• DIRECT impact.

– scalability, robustness (for the Research Infrastructure)– The expected impact is that participating RI projects will be able to

operate their Distributed Computing Systems efficiently processing their large volume research data, making it available to their end users in reliable and cost-effective way, which couldn't be achieved before, which may lead to new way of organizing science activities, leading to significant scientific break throughs. By providing important functional components (e.g., ) which was missing from existing practices, VLDATA platform will make possible the transparent integration of resources, hiding the complexity from use, resulting in the extension of the scale of the resources Resource Infrastructure projects can utilize. This will increase the number of RI using the project tools and the number of different types of resources reachable through the tools.

– simplicity (for the user: scientist/operator)– cost-efficiency (for funding agencies)– reduce the duplication efforts, maximizing the use of EU-invest e-

Infrastructures, enlarging the user communities, providing efficient data processing services, providing advanced technology by integrating the state-of-the-art which reduces development cost significantly. (also the processing algorithm )

12

2.1 Expected impacts• Indirect impact: large user community

– Science – innovation

– Society– industry

– Citizens– policy maker– new generation data scientists

• On the other hand, the scale of the data challenge requires simple but intelligent solutions to integrate resources from different e-Infrastructure providers.

13

2.2 Measures to maximize impact

14

Research Infrastructures (I)• Belle II: – Usage of DIRAC for the Experiment, use

case presented:• Common access to various platforms: Grid +

cloud + cluster + HPC• Support for Monitoring for Workflow

management tools• Integrate for the needs of other participants• User interface

– EU-T0: Virtual data centers / New Virtualization techniques?

15

Research Infrastructures (II)• PAO:– Usage of DIRAC for the Experiment, data

taking -> 2022• using a standard solution will help the

sustainability.• Extend functionality for their use case.• Common access to various platforms: Grid +

cloud + cluster + HPC (follow evolution of providers) in particular OSG• Open Access to data

– EU-T0: Data locality

16

Research Infrastructures (III)• LHCb:– should cover Run 2 needs and target to the

needs of Run 3 (DAQ Upgrade)• Data rate will be increase by a factor ~5, 10 PB/year.• Integration of Cloud resources.• Massive data-driven Workflows for users.• Data preservation (?)• Resource (cpu/storage/network/...)

description/monitoring/availability/management, smart allocation

• Smart/Intelligent/dynamic data placement strategies (network)

– EU-T0: New Virtualization techniques, Resource description/monitoring/availability, Virtual data centers, Data locality

17

Research Infrastructures (IV)• EISCAT_3D:– searching data (metadata catalog), intelligent

searching (patterns recognition)– visualization, – Workflow to go from one data level to another

with appropriated access rights– Training – flexible interconnection of different resources,

central (HPC) + distributed (Grid/Cloud)– time constrained massive data reduction (10 PB

-> 1 PB / month ??), including the possibility for users defined algorithm.

• EU-T0:

18

Research Infrastructures (V)• BES III:

19

3.1 Work Plan (To be confirmed)• WP1 Coordination (UB, Spain)

– External Advisory board (EUDAT, OGF, RDA, OSG, PRACE, XSEDE, CERN/HelixNebula, )

• WP2 Requirement analysis & Design (CU, UK)

• WP3 Data-driven development( UB, Spain)

• WP4 User-driven development( CYFRONET, Poland)

• WP5 Quality ( UAB, Spain)• WP6 Validation (????)

– LHCb (????)– Belle II (Institut Jozef Stefan,

UniMB, Mariborand UniLJ,Slovenia)– EISCAT_3D (SNIC, Sweden/EISCAT

Science Associate)– PAO (CESNET, Czech Republic)– BES III (IHEP, China)

– Proteomincs (ETH Zurich, Switzerland)

– Mosgrid (U Tubingen, Germany)/ CMMST (U. Peruggia, Italy)

– Seismology (MTU, Turkey)– AMC (Netherlands)– Astrophysics (INAF, Italy)– Heliophysics (Trinity College

Dublin, Ireland)– Verce (???)– DRIHM (CIMA foundation, Italy)– SMEs (U. Zaragoza, Spain)

– DIRAC 4 EGI, multi-community solution EGI ( EGI.eu, the Netherlands)

• WP7 Dissemination: outreach + Training (CNRS, France)

• WP8 Exploitation (ASCAMM, Spain)• WP9 Communication,

Internationalization (UvA, the Netherlands)

20

3.2 Management structure and procedures

Coordinator

Tech. Coordinator

Consortium Board (all partners)

Executive Board (1 Representative from each Area)

External Advisory Board

Integration/Operations

WPs (6)

Design/Develop WPs

(2,3,4,5) Communication/

Sustainability WPs (7,8,9)

Internal Communities’Coordinators

External Communities’Coordinators

Project Manager

Comm./Exploit.

Coordinator

21

3.3 Consortium as a whole

22

Private Companies• Bull/Dell (??)• ETL (UK)• AlpesLaser (CH)

23

3.4 Resources to be committed

24

Calendar (milestones)• May 23: Close the Contractors• June 11-13: all WP ready, F2F

meeting to close the Work plan. Deadline for RIs and third Parties

• July 9-11: Close proposal (I)• July 25: Proof read -> External review• Aug 18 -> Sep 2: final upates

25

Writing Calendar (I)Milestone Date

Close Technology Contractors May 23rd

Close Consortium & Complete Work Programme Description June 13th

Complete Proposal July 11th

English Proofread August 1st

External Review August 15th

Final Version August 29th

Deadline for submission September 2nd

26

Writing Calendar (II)Meetings/Dates Type Date

Decide on Communities proposed by P. Kacsuk Mail all May 30th

Reorganization of Development Area Virtual WPL(*) June 2 - 6

First WP review: put in common contributions, all WP must provide a first draft, list of activities

Virtual WPLs Jun 6th 14:00

WP integration: ensure consistency, third parties, budget F2F WPLs June 11 - 12First draft of Sections: (1) Excellence, (2) Impact, (3) Implementation

Virtual Editors June 27th 14:00

Complete proposal: full review F2F Editors July 9 - 10

Proofread: submit Mail July 11th

Proofread: received, full review, merging Virtual Editors July 25 - 30

External: submit Mail July 30th

External: received, full review, merging Virtual Editors August 18 - 29

Date post:	27-Dec-2015
Category:	Documents
Upload:	hilary-oconnor
View:	214 times
Download:	1 times

Common solution for the (very-)large data challenge. VLDATA Call: EINFRA-1 (Focus on Topics 4-5)...

Documents