+ All Categories
Home > Documents > Big Data solution for CTBT monitoring:CEA-IDC joint global cross correlation project

Big Data solution for CTBT monitoring:CEA-IDC joint global cross correlation project

Date post: 17-Jul-2015
Category:
Upload: ivan-kitov
View: 151 times
Download: 0 times
Share this document with a friend
Popular Tags:
20
BIG DATA SOLUTION FOR CTBT MONITORING: CEA-IDC JOINT GLOBAL CROSS CORRELATION PROJECT 15 mai 2014 CEA | 21 JUIN 2012
Transcript

BIG DATA SOLUTION FOR

CTBT MONITORING:

CEA-IDC JOINT GLOBAL

CROSS CORRELATION

PROJECT

15 mai 2014 CEA | 21 JUIN 2012

International Data Centre 25 October 2010 Page 2

Presenters

Dmitry Bobrov1), Randy Bell1), Nicolas Brachet2), Pierre

Gaillard2),, Jocelyn Guilbert2),, Ivan Kitov3), Mikhail

Rozhkov1)

1)International Data Centre, CTBTO, 2) Commissariat a l ’Energie Atomique, 3) Institute for Dynamics of Geospheres

Scientiapotestas

est

Scientiapotestas

estInformation potestas

est

Cross-correlation potestas est

Scientiapotestas

estInformation potestas

est

Tremendous seismic data growth dictates:

Repeating seismicity: the IDC view

Dozens to hundreds of events from the same Earth cell.

But how can we populate the aseismic area with quality master event?

IMS seismic network

Blue circles – primary arrays, blue triangles – primary 3-C stations.

Yellow circles – auxiliary arrays, yellow triangles – auxiliary 3-C stations. Red

stars – underground nuclear explosions.

Primary network includes 25 arrays

Global Cross Correlation Grid +

Aftershock Sequence Processing

What is Grid?

• Grid is a set of loci of

hypothetic master events.

• Master is a set of waveform

templates linking array

station and the locus.

• Spacing between masters

~140 km.

• P-wave templates from three

to ten IMS primary arrays

per master.

• At least three IMS stations to

create an REB event.

Templates needed:

Real waveforms – for seismic areas

Grand masters – for adjacent territories

Synthetic waveforms – for aseismic areas

What is Grid?

• Grid is a set of loci of

hypothetic master events.

• Master is a set of waveform

templates linking array

station and the locus.

• P-wave templates from three

to ten IMS primary arrays

per master.

• At least three IMS stations to

create an REB event.

Global Cross Correlation Grid +

Aftershock Sequence Processing

Building Masters:

IDC database comprises

hundreds of thousands seismic

events. Building comprehensive

master event database would

require:

1. To cross-correlate each by

each event (low cost effort).

2. To cross correlate each event

with 10-year time interval

event history of IDC

database - extremely high

cost effort.

Global Cross Correlation Grid +

Aftershock Sequence Processing

Template dimensionality

reduction is crucial

• A repeating seismicity map

showed that one point on a

grid may correspond to

dozens or even hundreds of

templates. Effective

dimensionality reduction

technique to be applied to the

clusters of such events to pick

up a limited number of

master events for each

cluster.

• These techniques must be

applied as well to the sets of

synthetic events generated for

the aseismic areas

Global Cross Correlation Grid +

Aftershock Sequence Processing

BIG DATA

Solution

needed

Global Cross Correlation Grid +

Aftershock Sequence Processing

Data is everythingData centers (IDC, NDCs) collect, process, analyze, produce data 24 hours a day, 7

days a week

Data is the cornerstone : full of information and source of knowledge

Data sets are :

+ Large and growing Volume

+ Complex and heterogeneous Variety

+ Continuous stream and real time Velocity

+ Sometimes imprecise Veracity

= Big Data 4V

A (big) technological problem

Intrinsic mismatch between Data and IT (Information Technology) :

Data volume increases 100x in 10 years

I/O bandwidth improves ~3x in 10 years

Difficult to process all the data with traditional applications within tolerable elapsed

time

What is Big Data

DataScale

Question is

How to bring a very practical solution to the challenge raised by the

exponential growth of the volume of data to be processed ?

DataScale project

Consortium of 9 partners, from large research

laboratories (CEA/DAM, IPGP) to SMEs,

including also big companies (BULL)

A two-year project, started in September 2013

Supported by the French government

Selected and funded by

the « Investments for the Future » program

DataScale objective

Design efficient Big Data solutions, suited to real use cases

Technological Solutions

High-Performance ComputingHPC already deals with data sets

from large-scale simulation of physical

phenomena

Enrich / Extend HPC solutions with

specific Big Data technological building

blocks

Building blocksEfficient data processing (Distributed Mining of Data)

Distribute, parallelize and deploy the application on HPC platform

Efficient data management (Mining of Distributed Data)

Define hierarchy of data storage (data life cycle, reuse process)

NoSQL DataBase Management System (DBMS) with data mining technologies

Handle very large data volumes and different types of data

TGCCMka3D

CEA Use Cases

A data-driven project

Evaluation of the relevance of the technological solutions by implementing

demonstrators.

3 areas, 4 real world applications at real scale :

Area Application

Cluster management

(CEA/DSSI)

Monitoring and enhancement of

HPC platform

Analysis of HPC log journals with data mining

techniques (detection and correlation of failure

patterns)

Social Media

Monitoring

(Linkfluence)

Measuring and reporting daily

web activities (companies, user,

topic,…)

Analysis of millions of conversations and

images (100 countries and 50 languages)

through social accounts (eg. Twitter, Facebook,

Google+)

Seismology

(IPGP)

Tomography of Europe Seismic noise correlation of 200 European

stations (5 years of records)

Seismology

(CEA/DASE)

Event detection Massive correlation between continuous data

stream and event template (Master Event

algorithm)

CEA-PTS Collaboration

Unique data analysis to revise the seismicity :

- of the last 10 years

- at global scale with a network of seismic stations distributed globally

The IDC high-quality dataset is a natural candidate for an extensive cross

correlation study :

- continuous seismic data from the primary IMS stations since 2000.

- 450,000 seismic events in the REB,

- tens of millions of raw detections.

Collaboration with IDC teams to:

- enhance the Master Event algorithm (use of station 3CP, association,

synthetic master event, subspacing)

- test and deploy the application on the secure and powerful HPC

infrastructure of the CEA.

Roadmap

15 mai 2014 | PAGE

18

Date Phase

Sep. 2013 Kick-Off

Oct. 2014 Design Specification : workflow and NoSQL database

Mar. 2014 Development NoSQL DBMS (Armadillo)

Algorithm enhancement

Workflow integration

Sep. 2014 Test Deployment

Run at reduced scale (3 years, regional network)

Result analysis

Apr. 2015 Demonstration Run at full scale (10 years, global network)

Result analysis

Aug. 2015 Assessment Reflection on the new components integration in the

operational chain

DATASCALE Partners

The DataScale project partners are :

ActiveEon

Armadillo

Bull

CEA (DASE)

CEA (LIST)

CEA (DSSI)

INRIA

IPGP

Linkfluence

CONCLUSION

We are:

Facing a BIG challenge.

Preparing a decisive turn toward a new data management

infrastructure.

Not alone, surrounded with extremely valuable partners.

New approach to nuclear monitoring

Thank you for your attention!


Recommended