+ All Categories
Home > Documents > Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories –...

Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories –...

Date post: 27-Mar-2015
Category:
Upload: sarah-powers
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
33
Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec 2010
Transcript
Page 1: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Global Digital Library in High-Energy Physics

Jan Iwaszkiewicz(CERN)

Digital Repositories – Linked Open Data – the possible Role of D4Science

16th Dec 2010

Page 2: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Contents INSPIRE Invenio software Demo INSPIRE use cases in D4Science II

– OCR– Full-text indexing

Current research:– Semantic Search– Author Disambiguation

Page 3: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

3

run by

(2007-)

Page 4: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

4

HEP community community

20-30k active researchers publishing 10k articles/year large collaborations (up to 5000 members) very international (even small author groups) authors = readers

rapid information exchange essential mailing of preprints since the 60’s long OA tradition >90% of HEP journal articles on arXiv!

dominance of community based information systems arXiv SPIRES

Page 5: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Dominance of community services

5

From 2007 survey of 2,000 physicists. Gentil-Beccot et al, Information Resources in High-Energy Physics: Surveying the Present Landscape and Charting the Future Course. J.Am.Soc.Inf.Sci.60:150-160,2009 arXiv:0804.2701

Page 6: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

SPIRES (1974-)

6

network of databases HEP literature, conferences, institutions, experiments, hepnames, jobs

SLAC – DESY – Fermilab Collaboration SPIRES-HEP

Metadata for 850k objects, ~800 new records per week Preprints, journal articles, conference contributions, books, grey literature since 1974, web server since 1991 100k searches/day

high data quality, manually curated, comprehensive coverage high acceptance, user involvement

But: outdated technology from the 70‘s

Page 7: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

7

Invenio (2002-)

digital multimedia library system platform for CERN Document Server (CDS) powerful search engine

Google-like speed for up to 5M records combined metadata, reference and fulltext search

flexible metadata (MARCXML) personalization and collaborative features modular architecture Apache/Python/MySQL GNU General Public Licence - invenio-software.org

~30 instances worldwide

Page 8: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

8

ingestion

Page 9: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

9

dissemination

Page 10: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

10

INSPIRE development

2007: feasibility study 2008: user-level functionalities

data conversion citation analysis, search syntax, output formats…

2009: cataloguing functionalities metadata maintenance and enrichment tools

2010: workflow harvesting, cataloguing…

April 2010: public beta versionhttp://inspirebeta.net

Page 11: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

11

Bibliographic Content

SPIRES content (plus part of CDS):journal articles, conference proceedings, preprints, experimental notes, theses

going beyond SPIRES:conference slides, multimedia, software, high-level research data…

going back before 1974 more material from neighboring disciplines astrophysics, nuclear physics, mathematics…

cited by core HEP articles

Page 12: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

12

Rich Content Repository

all freely accessible articles esp. “endangered” material

access restricted articles “hidden archive” first agreements with Springer and APS

historical material scanning of old preprint series

beyond articles slides, multimedia, software, wikis… independent citable objects

Page 13: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

13

INSPIRE features I

Advanced search functionality Google-like freetext search Complex second-order searches

Example:

Find the most influential HEP core papers that cite the Hitchin article „Generalized Calabi-Yau manifolds“ but don‘t cite any papers by Polchinski

refersto:reportnumber:math/0209099 collection:core cited:100->9999 NOT refersto:author:Polchinski

Page 14: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

14

INSPIRE features II

detailed record pages abstract, keywords, references, citations, fulltext, figures various export formats

comprehensive author pages affiliation history, coauthors, frequent keywords, article classification,

citation summary

citation analysis cited by, co-cited with, self-citations, citation history

taxonomy based classification

Page 15: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Newest Features

Plot extraction Plot caption indexing Full-text snippets

15

Page 16: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

16

HEP taxonomy

Hierarchical structure of all important HEP concepts

Providing (e.g. dynamical symmetry breaking): synonyms (dynamically broken) related terms (spontaneous symmetry breaking) broader/narrower (symmetry breaking) definitions subject areas (high-energy physics – theory)

Page 17: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

17

Taxonomy applications fast automatic generation of keywords

enabling e.g. prompt alerts manually curated afterwards

automatic selection of HEP relevant articles helps minimizing manual selection

improved search algorithm (planned) e.g. search for „SUSY“ will also find „supersymmetry“ narrow/broaden search

user tagging (planned) improve Inspire generated classification improve taxonomy

Page 18: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

18

Author identification INSPIRE author id

Single Sign-on for INSPIRE and arXiv.org active participation in ORCID

author disambiguation using e.g. lab id’s, affiliation history, coauthors and more 22.000 INSPIRE-id’s already assigned

automatic association of papers with authors using info on affiliations, coauthors, research topics, from publishers

G. Chen: 963 docs, 21 real authors, only 22 docs not assigned, 97.2% success rate INSPIRE-id part of author lists of large collaborations

Page 19: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

19

Coming soon… personalization

personal accounts, "bookshelves”, display formats, e-mail alerts, RSS feeds collaborative tools, user groups

claim my papers user tagging user submission

paper centric (articles, supplementary material) and beyond

Page 20: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

20

coming later

innovative metrics semantic analysis recommender systems

– combining citations, keywords, fulltext, usage pattern data...

open API for 3rd party tools and searching object aggregation (OAI-ORE) OAIS standards for long-term document preservation

Page 21: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

21

Partnerships Researchers - community

user tagging, user submission improved correction interfaces feedback driving future developments

information providers close alliance with arXiv data exchange with publishers/databases standardized author identities

neighboring fields OAI – PMH (harvesting) and OpenSearch ADS (SAO/NASA Astrophysics Data System)

Page 22: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Demo

22

Page 23: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

INSPIRE use cases in D4Science II OCR Fulltext indexing Bibliometric analysis Data mirroring

23

Page 24: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Process Execution Engine• Process Execution Engine (PE2ng) is a system to manage the

execution of programs in a distributed infrastructure

• PE2ng provides adaptors for gCube and gLite based GRID, as well as Condor and Hadoop

24

Page 25: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

MapReduce in PE2ng

D4Science introduced a support for MapReduce (Hadoop) jobs Advantages Fault tolerance - mechanisms of fault detection and automatic job resubmission in case

of software or hardware errors. Load balancing - optimizing the completion time of processing the entire data set

(finishing time of the last subjob) Combined Storage & CPU resources “On demand” execution mode (vs. batch queues)

It perfectly matches the needs of I/O intensive INSPIRE jobs

25

Page 26: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Optical Character Recognition (OCR)• Using open source Ocropus software, well suited for large scale batch

processing• Ocropus has three steps: document layout analysis, line recognition,

character identification (different models)• Input typically PDF files, OCR output in hOCR format (HTML), can

be

converted to PDF• Several languages: EN, FR, DE...• Automatic procedure developed (in python) for format conversions,

text recognition, hOCR->pdf conversion, grid job submission• The whole OCR process can be done as a grid job• Scanning and OCRing typically done using commercial services and

tools (e.g. in India), OCRing part more expensive=>can save real money.

• Interest from other institutions 26

Page 27: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Full-text Indexing• About 1 million INSPIRE documents need full-text search

combined with the Invenio meta data search

• Technologies used: Lucene, Hadoop, BibIndex (Invenio internal engine)

• D4Science services (Process Execution Engine, Hadoop)

• Integration with the powerful Invenio meta data search

• Future extension: semantic indexing using High Energy Physics taxonomy

27

Page 28: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Semantic Search (Roman Chyla)• Lucene + Seman• Seman is used for enriching the textual content with

added, semantic features. • There are 3 crucial components involved:

– 1) NLP engine (GATE)– 2) workflow engine – 3) translation engine

• https://svnweb.cern.ch/trac/rcarepo/wiki/InspireSemanticSearch#Fulltextsearchwithsemanticfeatures

28

Page 29: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Semantic search/indexing

Page 30: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Basic ideas• NLP processing helps disambiguation

• HEP taxonomy bridges gaps between terms

• Translation into limited sets of codes facilitates operations over sets

– For example, we can compute similarity of concepts X-Y using overlap of codes

• Text is indexing together with semantic tags

• Special processing applied both at indexing and at the query time

30

Page 31: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Author Disambiguation (Henning Weiler)• BibAuthor is a framework around an algorithm

meant to solve author ambiguity based solely on the meta data available for documents.

• It relies on the database of INSPIRE which is kept in the MARC21 format.

31

Page 32: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Algorithm• Before any algorithm can run, one step of preparation

is to be done:– Read all names from the Inspire database and store them

in a new table

• The algorithm itself is divided into several steps:– 1. Clustering--finding potentially related authors– 2. Matching--create RA entities by pairwise comparison

of VAs within a cluster– 3. Post-matching comparison--identifying identical RA

entities through cross-cluster comparison

32

Page 33: Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Thank you!

Questions?

33


Recommended