CERN openlab V preparation, Data Analytics (for research) Many contributors, especially EN-ICE and...

CERN openlab V preparation,Data Analytics (for research)

Many contributors, especially EN-ICE and IT-DB

Challenges

2

Online triggers and DAQ

Offline simulation and processing

Data storage architectures

Resource management and provisioning

Data analytics

Networks and connectivity

3

Outline

Use cases and challenges Technology Analytics as a Service (AaaS) Education

Use case: Quench Protection System

Critical system for LHC operation• Major upgrade for LHC Run 2 (2015-2018)

High throughput for data storage requirement• Constant load of 150k changes/s from 100k signals

Whole data set is transfered to long-term storage DB• Query + Filter + Insertion

Analysis performed on both DBs

Backup

LHC Logging(long-termstorage)

RDB Archive16 ProjectsAround LHC

4

Credit: Kacper Szkudlarek EN-ICE

Use case: Quench Protection System

Nominal conditions• Stable constant load of 150k changes/s

• 100 MB/s of I/O operations

• 500 GB of data stored each day

Peak performance• Exceeded 1 million value changes per second

• 500-600 MB/s of I/O operations

All CERN production WinCC OA systems (accelerators, detectors and technical infrastructure, 600 servers) will benefit from these optimizations

Next challenge: ~10x increase • Required for next major upgrade (2019-2020)

6

Credit: Kacper Szkudlarek EN-ICE

7

Anomaly detection

>. SVM - Support Vector Machines

Credit: Massimo Lamanna, Sebastien Ponce (IT-DSS), Stefano Alberto Russo (ex IT-DB)

8

Data Placement / ATLAS

>Use cases: Trace Mining (user interactions with Distributed Data Management) Popularity (used for deciding which data to delete) Accounting and popularity (reports on data contents/popularity)

Log file aggregation

>ATLAS Distributed Data Management uses both SQL and NoSQL

9

Data Placement / CMS

>Intelligent data placement models for the CMS experiment

>Need to extract further knowledge from the monitoring data in order to implement an effective data placement Correlate file-access monitoring with site status Readiness, queue length, storage and CPU available Classify analysis activities and needed resources Making recommendations Learn from the past trends and patterns

10

Data Placement / EMBL-EBI

>To support the diverse data analysis that will take place within ELIXIR, the ability to ‘push’ data from a provider to a major analysis centres, or for the major analysis centre to ‘pull’ the required data set from a nearby source, becomes a critical capability

11

Logging service (1/2)Credit: Chris Roderick

12

Logging service (2/2)Credit: Chris Roderick

13

Domain specific language

>LHC Logging (50+ TB/year)>Perform analysis as close to data as

possible, in database analysis: built-in + ORE?

>Multi source extraction API >Domain specific

language

Credit: Chris Roderick BE-CO

14

Network monitoring

>Time correlation During a PS throughput test, was there any known

activity in the same link? There is packet loss, does this appears as degraded

performance somewhere at the same time

>We observe loss of performance in some network link Is it a network problem and where? Is it a storage problem?

Credit: Simone Campana

15

ESA

>Envisage “intelligent” bots doing much of the researcher's work in scanning the archives to collect relevant information in a particular field.

>Such “automated bots” would present their results only when called upon and only focused on a problem at hand (e.g. give me serendipitous objects in the X-Ray range lying around the Crab Nebula, since an unexplained region of hot gas may have an effect on the infra-red region I am studying…).

>The bot may be further refined to extract only very good quality data from all X-Ray missions or for a given time

Credit: Salim Ansari

16

FCCCredit: Johannes Gutleber

17

Analytics and Modelling for Availability Improvement in the FCC

>Near real-time modelling of the accelerator complex and its infrastructure services would further improve early warning capabilities, permit preventive maintenance and leverage co-scheduling of fault-prevention interventions

>Real-world use-cases taken from LHC accelerator operation shall serve as the basis to develop formal data analytics scenarios

Credit: Johannes Gutleber

18

Data analytics on scientific articles

>INSPIRE, ZENODO, ORCID>Automated extraction of information about

authors, references, key words, etc.) >Semantic analysis of text allowing identification of

the main field, key words (not appearing in the text), sentiment of references; validation based on their importance within the context of the publication and the ability to join and correlate concepts from different domains and publications.

Credit: Tim Smith

19

Administrative Information System

>(among others)>Make the data available using a bi-

temporal model, one time dimension comes from the business – e.g. contractual dates; and the other one is purely technical and indicates when which data was effectively part of the DWH and allows writing queries using a “show data as of” date

Credit: Derek Mathieson

Technology

Near real time processing• processing large amounts of data (Gigabytes per second)

with low latency (in the order of seconds) coming from different sources and domains

Batch processing (including predictive analytics)• Linear and nonlinear modelling, classical statistical tests,

complex time-series analysis and forecasting, classification, clustering

Data repositories, RDBMS and NoSQL Integration Challenges (Data Analytics as a Service)

20

21

Analytics as a service

> “Analytics platform” or (Big data) “Analytics-as-a-service” (A3S ?):

> Data fed from multiple sources (live)

> Stored reliably> Data processing with multiple

systems> Easy access, domain expert

natural language (DSL)> Visualisation> Special interest from Human Brain

Project

Credit: CERN EN-ICE

Education

“data scientist” role type Variety of tools and ideas, important

theoretical/academic background Implement a workshop/training along the

line of the one on multi-threading and parallelism

Clear need and interest about data analytics education and information sharing

22

Conclusion

Interest from many parts of CERN, experiments, engineering, administrative, IT

Leverages the work done in openlab IV Combined from the beginning with a multi

department AaaS service Education and outreach Interest from other research laboratories and

openlab partners• Challenges

• Interest in shared research / investigation / deployment

23

Date post:	25-Dec-2015
Category:	Documents
Upload:	calvin-payne
View:	217 times
Download:	2 times

CERN openlab V preparation, Data Analytics (for research) Many contributors, especially EN-ICE and...

Documents