CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data...

Post on 18-Mar-2021

1 views 0 download

transcript

CERN Open Data and Data AnalysisKnowledge Preservation

Tibor Šimko

Digital Library 2015 · 21–23 April 2015 · Jasná, Slovakia

@tiborsimko · @inveniosoftware 1 / 26

1

Invenio

@tiborsimko · @inveniosoftware 2 / 26

What is Invenio?digital library and document repository software

– mature platform: first public release in 2002– rich data: articles, books, notes, photos, videos, software, data

some Invenio-based services at CERN:

co-developed by an international collaboration

participating in EU FP7/H2020 projects

@tiborsimko · @inveniosoftware 3 / 26

“Data behind plots”

@tiborsimko · @inveniosoftware 4 / 26

“Code you can cite”automated GitHub↔ Zenodo bridgepush new release to GitHub→ automatic archival on Zenodosoftware preserved, minted with a DOI, made citable

https://guides.github.com/activities/citable-code

@tiborsimko · @inveniosoftware 5 / 26

Code↔ Data↔ Paper

link data (DATAVERSE) to code (ZENODO) to papers (INSPIRE)example: hep-ex/0011057, arXiv:1401.0080

@tiborsimko · @inveniosoftware 6 / 26

2

Data Analysis

@tiborsimko · @inveniosoftware 7 / 26

CERN LHC Experiments

@tiborsimko · @inveniosoftware 8 / 26

Large Scale Solutions

Primary site: 100k cores (10k nodes), 100k disks (50 PB), 21k NICGrid: 13 Tier-1 sites, 155 Tier-2 sites, 10 Gbps optical fibre

@tiborsimko · @inveniosoftware 9 / 26

Preserve an Analysis?

@tiborsimko · @inveniosoftware 10 / 26

Big Data?

data scale knowledgeraw ∼GB / sec calibration, conditioningreconstructed ∼PB / year filtering, selectionreduced ∼TB / analysis user code, physics objectspublication ∼GB / analysis correlation, data behind plots

filteringinput...

code

output ...

Analysis Train

@tiborsimko · @inveniosoftware 11 / 26

Knowledge Capture

@tiborsimko · @inveniosoftware 12 / 26

System Architecture

Analysis

analysis-preservation.cern.ch

file storage abstraction layer

CASTORCephBoxAFS Drive EOS S3

GitHub

SVN

TWiki

SharePoint

INSPIRE

CDS

...

CADI

@tiborsimko · @inveniosoftware 13 / 26

Knowledge Representationrecord format: extended MARC21

– “technical” metadata: beyond bytes

e.g. 256 �computer file characteristics�

$a characteristics $e events $t text

$b bytes $f files ...

– “knowledge” metadata: semantics

e.g. 505 �formatted contents note� CSV column information

$t title $g miscellaneous

internal format: JSON

JSON

MARC21

EAD

schema model

@tiborsimko · @inveniosoftware 14 / 26

3

Open Data

@tiborsimko · @inveniosoftware 15 / 26

Opening Up

Data policies:restricted→ embargo period→ open“[...] Data with high abstraction, such as AOD, will be conditionally made publiclyavailable after an embargo period of 5 years after publication for 10% of the dataand 10 years for 100% of the data [...]” —ALICE Data Policy

Challenges:audience:

– data miners– citizen scientists– high-school students– general public

computing:– exploring in the browser– specialised VMs

@tiborsimko · @inveniosoftware 16 / 26

CERN Open Data Portal

@tiborsimko · @inveniosoftware 17 / 26

Education

@tiborsimko · @inveniosoftware 18 / 26

Visualise detector events

@tiborsimko · @inveniosoftware 19 / 26

Basic histogramming

@tiborsimko · @inveniosoftware 20 / 26

Research

@tiborsimko · @inveniosoftware 21 / 26

CMS Primary Datasets 2010

@tiborsimko · @inveniosoftware 22 / 26

CernVM Virtual Machine

@tiborsimko · @inveniosoftware 23 / 26

Open Data? Who cares?

82,000 distinct users visited the site21,000 distinct users viewed data records16,000 distinct users used event display

3,000 distinct users used histogramming@tiborsimko · @inveniosoftware 24 / 26

7

Conclusions

@tiborsimko · @inveniosoftware 25 / 26

CERN (Open) Data

Capturing and disseminating knowledgeof data, code, platform, processes

to enable future data reuse

(Open) Data Analysis Preservation Frameworkhttp://opendata.cern.ch/

CERN IT J. Cowton, P. Fokianos, J. Kuncar, T. Smith, T. Šimko

CERN Library S. Dallmeier-Tiessen, P. Herterich, L. Rueda

ALICE M. Gheata, C. Grigoras

ATLAS K. Cranmer, L. Heinrich, D. Rousseau, F. Socher

CMS A. Calderon, A. Huffman, K. Lassila-Perini, T. McCauley, A. Rao, A. Rodriguez Marrero

LHCb S. Amerio, B. Couturier, A. Trisovic

CERN CernVM J. Blomer CERN EOS L. Mascetti DASPOS M. Hildreth DPHEP F. Berghaus

@tiborsimko · @inveniosoftware 26 / 26