+ All Categories
Home > Documents > Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP...

Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP...

Date post: 06-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
37
Invenio and its Potential Role in DPHEP Tibor Šimko Department of Information Technology CERN DASPOS/DPHEP7 Workshop CERN, March 21–22, 2013 Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 1 / 37
Transcript
Page 1: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Invenio and its Potential Role in DPHEP

Tibor Šimko<[email protected]>

Department of Information TechnologyCERN

DASPOS/DPHEP7 WorkshopCERN, March 21–22, 2013

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 1 / 37

Page 2: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Outline

1 Introduction

2 Technology

3 Community

4 Projects

5 Conclusions

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 2 / 37

Page 3: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Outline

1 Introduction

2 Technology

3 Community

4 Projects

5 Conclusions

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 3 / 37

Page 4: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

From Jamie’s DASPOS/DPHEP7 introductory talk

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 4 / 37

Page 5: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

What is Invenio?

modular digital library system– “library in which collections are stored in digital formats (as opposed

to print, microform, or other media) and accessible by computers”

free software, GNU General Public Licensemature, first release in 2002originated at CERN, sister project Indiconowadays co-developed by CERN, CfA, DESY, EPFL, SLAC. . .handles articles, books, notes, photos, videos. . .used by:(a) institutional document repositories, e.g. CDS

– pre-publication workflows e.g. approval(b) subject-based information systems, e.g. INSPIRE

– worldwide data e.g. citation analysis

(c) libraries and library networks, e.g. RERO

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 5 / 37

Page 6: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Example: Invenio Demo Site

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 6 / 37

Page 7: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Example: CDS 1/2

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 7 / 37

Page 8: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Example: CDS 2/2

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 8 / 37

Page 9: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Example: INSPIRE 1/2

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 9 / 37

Page 10: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Example: INSPIRE 2/2

P.Minkowski, Phys.Lett. B67 (1977) 421

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 10 / 37

Page 11: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Example: ILO

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 11 / 37

Page 12: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Example: EPFL

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 12 / 37

Page 13: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Example: ILCDOC

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 13 / 37

Page 14: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Outline

1 Introduction

2 Technology

3 Community

4 Projects

5 Conclusions

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 14 / 37

Page 15: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Architecture

key features:– navigable collection tree (regular, virtual)– powerful search engine

· Google-like speed for up to 5M records· combined metadata, reference and fulltext search

– flexible metadata model (MARC)· handling any kind of document (multimedia)· customizable input, formatting and linking

personalization and collaborative features:· alerts, baskets, groups, reviews, comments· internationalisation (28 languages)

modular architecture: four module families(1) ingestion modules, e.g. periodical harvesting(2) processing modules, e.g. keyword extraction and indexing(3) dissemination modules, e.g. personalised RSS alerts(4) curation modules, e.g. batch corrections

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 15 / 37

Page 16: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Invenio Modules: Overview

Author

Sources Librarian

User

Database

Ingestion Modules

Processing Modules

Dissemination Modules

Curation Modules

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 16 / 37

Page 17: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Invenio Modules: Ingestion

Author

WebSubmit

WebSession, WebAccess

Metadata Full-text

full-text documentmetadata

BibUpload

BibSched

BibConvert

metadata

MARCXML

BibHarvest

OAI Data Source

ElmSubmit

Non-OAI Data Source

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 17 / 37

Page 18: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Invenio Modules: Processing

Metadata Full-textRefExtract

BibClassify

BibDocFile

BibEncode

Clusters BibIndex

WebColl

BibRank

BibFormat

BibSort

BibAuthorID

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 18 / 37

Page 19: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Invenio Modules: Dissemination

Metadata Full-textClusters

WebSearch

User

WebBasketBibAuthorID WebAlert BibHarvest

OAI HarvesterWebComment

WebMessageWebJournal BibCirculation

WebStat WebHelp

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 19 / 37

Page 20: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Invenio Modules: Curation

Metadata Librarian Full-textBibEdit

MultiEdit

BatchUploader

BibCheck

BibCirculation

BibDocFile

BibClassify

RefExtract

Tasks

BibCatalog

Knowledge Bases

BibKnowledge

BibExportBibMatch

BibMerge

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 20 / 37

Page 21: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Scalable Architecture

800 hits per second on CDS during Higgs seminar July 4th

HAProxyBalancer

Apache 1

Apache 2

Worker 1/2

Worker 1/1

Worker 1/3

Worker 2/2

Worker 2/1

Worker 2/3NODE 2

DB Master

DB Slave

Solr

shared fs

Redis, MC

HTTP

HTTP

WSGI

WSGI

SQL R/W

SQL R/O

query

key,val

files

replication

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 21 / 37

Page 22: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Technology

technology overview:– load balancing: HAProxy– web application: Apache, WSGI, Python, Flask, Jinja– database: SQLAlchemy, MySQL/PostgreSQL/SQLite, MongoDB– caching: Memcached, Redis– UI: Twitter Bootstrap, jQuery– project tools: Git, Trac, Jenkins, Selenium

development and collaboration model:– organic-growth software development model– pull-on-demand collaboration model– rapid prototyping– unit tests, regression tests, web tests

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 22 / 37

Page 23: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Outline

1 Introduction

2 Technology

3 Community

4 Projects

5 Conclusions

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 23 / 37

Page 24: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Community

I n v e n i oEPFLCDS AUTH UAB ...

2002– 2004–2006+ 2002–

I n v e n i oADSINSPIRE arXiv

2008– 2009– 2011–

I n v e n i oBlogForever CRISP OpenAIRE M9

2011– 2012– 2009– 2012–

350k LOC - Invenio core sources

11k LOC - INSPIRE overlay sources

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 24 / 37

Page 25: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Community

Invenio User Group Workshop 2012.

50 developers and contributors in 2012 (22 new):· 20 CERN-IT (7 new), 14 CERN-GS (5 new)· 4 SLAC, 2 ADS· 2 JUELICH (2 new), 1 GSI (1 new)· 1 UAB, 1 AUTH, 1 UNIZAR (1 new), 1 RERO (1 new)· 1 RNRT (1 new)· 2 upstream others (1 new)

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 25 / 37

Page 26: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Outline

1 Introduction

2 Technology

3 Community

4 Projects

5 Conclusions

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 26 / 37

Page 27: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Related Projects

BlogForever– OAIS, ingestion store– preservation, METS, PREMIS

CRISP– DOI minting via DataCite– persistent identifier store– “data continuum”

M9– non-MARC master formats (EAD, UNIMARC)– virtual fields, calculated or derived– data model definition and checks

OpenAIRE, OpenAIREplus– going beyond papers: multidisciplinary data

archive– community curated collections

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 27 / 37

Page 28: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

BlogForever

“Think about how great it would have been [...] to see

the last minute "blogs" from dissapearing Pompei...”

–BlogForever video at CeBIT 2013<http://youtu.be/Gzp5ug4i7r4>

preserving blogs: taking care of today’s stories for tomorrow’shistoryEC funded, 2011–2013using Invenio for repository

– blog posts, digital objects (images, videos), user comments– harvest, preserve, manage, disseminate blog content– ensure authenticity, integrity, completeness, usability

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 28 / 37

Page 29: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

BlogForever: The Big Picture

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 29 / 37

Page 30: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

OAIS Functional Entities

SIP = Submission Information Package · AIP = Archival Information Package · DIP = Dissemination Information Package

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 30 / 37

Page 31: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

CRISP

WP17 following “data continuum” from ingestion through analysisto disseminationEC funded, 2012–2014ESRF and ILL, neutron physics use caseInvenio use case

– DOI minting of objects via DataCite 10.1000/182

– Persistent Identifier store– citability and tracability throughout lifecycle

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 31 / 37

Page 32: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

M9

M9: innovative cultural pole with a museum, exhibition spaces,médiathèque archives, educational activities, public services“Museum of the 20th Century” in Mestre, Italyprivately funded, Fondazione di Venezia, 2011–2014Cini Foundation, CILEAusing Invenio for repository

– native support for formats ICCD, MAG, UNIMARC– native support of archival formats EAD, ISAD(G)– data model abstraction and logical fields– rich API to expose objects to apps

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 32 / 37

Page 33: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

OpenAIRE/OpenAIREplus

EC funded, 2009–2014linking peer-reviewed literature to associated data and fundingschemesusing Invenio for “orphan” repository

– launching Zenodo– multidisciplinary publication and data store– attractive submission and visualisation platform– collaborative user-defined user-curated collections– open access, embargoed access

collaborating with EUDAT– EUDAT selected Invenio for SimpleStore component– iRODS storage backend

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 33 / 37

Page 34: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Example: Zenodo 1/2

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 34 / 37

Page 35: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Example: Zenodo 2/2

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 35 / 37

Page 36: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Outline

1 Introduction

2 Technology

3 Community

4 Projects

5 Conclusions

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 36 / 37

Page 37: Invenio and its Potential Role in DPHEP...2008– 2009– 2011– I n v e n i o BlogForever CRISP OpenAIRE M9 2011– 2012– 2009– 2012– 350k LOC - Invenio core sources 11k LOC

Conclusions

Invenio– mature, robust, configurable digital library platform– co-developed by CERN, CfA, DESY, EPFL, SLAC. . .– supporting diverse objects beyond papers or multimedia– supporting OAIS-inspired preservation features– workflows from data ingestion to processing and visualisation– primary use cases in HEP: CDS, INSPIRE– participating in (EC funded) projects BlogForever, CRISP,

OpenAIREplus, M9; collaboration with EUDATpotential role in DPHEP

– already plays a role, via INSPIRE– possible front-end for larger-scale data backends (IRODS, CASTOR)– technology reuse: DOI minting, persistent identifier store, ingestion

store, audit store, format migrations. . .

Tibor Šimko (CERN) Invenio at DASPOS/DPHEP7 March 21–22, 2013 37 / 37


Recommended