Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories –...

Global Digital Library in High-Energy Physics

Jan Iwaszkiewicz(CERN)

Digital Repositories – Linked Open Data – the possible Role of D4Science

16th Dec 2010

Contents INSPIRE Invenio software Demo INSPIRE use cases in D4Science II

– OCR– Full-text indexing

Current research:– Semantic Search– Author Disambiguation

3

run by

(2007-)

4

HEP community community

20-30k active researchers publishing 10k articles/year large collaborations (up to 5000 members) very international (even small author groups) authors = readers

rapid information exchange essential mailing of preprints since the 60’s long OA tradition >90% of HEP journal articles on arXiv!

dominance of community based information systems arXiv SPIRES

Dominance of community services

5

From 2007 survey of 2,000 physicists. Gentil-Beccot et al, Information Resources in High-Energy Physics: Surveying the Present Landscape and Charting the Future Course. J.Am.Soc.Inf.Sci.60:150-160,2009 arXiv:0804.2701

SPIRES (1974-)

6

network of databases HEP literature, conferences, institutions, experiments, hepnames, jobs

SLAC – DESY – Fermilab Collaboration SPIRES-HEP

Metadata for 850k objects, ~800 new records per week Preprints, journal articles, conference contributions, books, grey literature since 1974, web server since 1991 100k searches/day

high data quality, manually curated, comprehensive coverage high acceptance, user involvement

But: outdated technology from the 70‘s

7

Invenio (2002-)

digital multimedia library system platform for CERN Document Server (CDS) powerful search engine

Google-like speed for up to 5M records combined metadata, reference and fulltext search

flexible metadata (MARCXML) personalization and collaborative features modular architecture Apache/Python/MySQL GNU General Public Licence - invenio-software.org

~30 instances worldwide

8

ingestion

9

dissemination

10

INSPIRE development

2007: feasibility study 2008: user-level functionalities

data conversion citation analysis, search syntax, output formats…

2009: cataloguing functionalities metadata maintenance and enrichment tools

2010: workflow harvesting, cataloguing…

April 2010: public beta versionhttp://inspirebeta.net

http://inspirebeta.net/

11

Bibliographic Content

SPIRES content (plus part of CDS):journal articles, conference proceedings, preprints, experimental notes, theses

going beyond SPIRES:conference slides, multimedia, software, high-level research data…

going back before 1974 more material from neighboring disciplines astrophysics, nuclear physics, mathematics…

cited by core HEP articles

12

Rich Content Repository

all freely accessible articles esp. “endangered” material

access restricted articles “hidden archive” first agreements with Springer and APS

historical material scanning of old preprint series

beyond articles slides, multimedia, software, wikis… independent citable objects

13

INSPIRE features I

Advanced search functionality Google-like freetext search Complex second-order searches

Example:

Find the most influential HEP core papers that cite the Hitchin article „Generalized Calabi-Yau manifolds“ but don‘t cite any papers by Polchinski

refersto:reportnumber:math/0209099 collection:core cited:100->9999 NOT refersto:author:Polchinski

14

INSPIRE features II

detailed record pages abstract, keywords, references, citations, fulltext, figures various export formats

comprehensive author pages affiliation history, coauthors, frequent keywords, article classification,

citation summary

citation analysis cited by, co-cited with, self-citations, citation history

taxonomy based classification

Newest Features

Plot extraction Plot caption indexing Full-text snippets

15

16

HEP taxonomy

Hierarchical structure of all important HEP concepts

Providing (e.g. dynamical symmetry breaking): synonyms (dynamically broken) related terms (spontaneous symmetry breaking) broader/narrower (symmetry breaking) definitions subject areas (high-energy physics – theory)

17

Taxonomy applications fast automatic generation of keywords

enabling e.g. prompt alerts manually curated afterwards

automatic selection of HEP relevant articles helps minimizing manual selection

improved search algorithm (planned) e.g. search for „SUSY“ will also find „supersymmetry“ narrow/broaden search

user tagging (planned) improve Inspire generated classification improve taxonomy

18

Author identification INSPIRE author id

Single Sign-on for INSPIRE and arXiv.org active participation in ORCID

author disambiguation using e.g. lab id’s, affiliation history, coauthors and more 22.000 INSPIRE-id’s already assigned

automatic association of papers with authors using info on affiliations, coauthors, research topics, from publishers

G. Chen: 963 docs, 21 real authors, only 22 docs not assigned, 97.2% success rate INSPIRE-id part of author lists of large collaborations

19

Coming soon… personalization

personal accounts, "bookshelves”, display formats, e-mail alerts, RSS feeds collaborative tools, user groups

claim my papers user tagging user submission

paper centric (articles, supplementary material) and beyond

20

coming later

innovative metrics semantic analysis recommender systems

– combining citations, keywords, fulltext, usage pattern data...

open API for 3rd party tools and searching object aggregation (OAI-ORE) OAIS standards for long-term document preservation

21

Partnerships Researchers - community

user tagging, user submission improved correction interfaces feedback driving future developments

information providers close alliance with arXiv data exchange with publishers/databases standardized author identities

neighboring fields OAI – PMH (harvesting) and OpenSearch ADS (SAO/NASA Astrophysics Data System)

Demo

22

INSPIRE use cases in D4Science II OCR Fulltext indexing Bibliometric analysis Data mirroring

23

Process Execution Engine• Process Execution Engine (PE2ng) is a system to manage the

execution of programs in a distributed infrastructure

• PE2ng provides adaptors for gCube and gLite based GRID, as well as Condor and Hadoop

24

MapReduce in PE2ng

D4Science introduced a support for MapReduce (Hadoop) jobs Advantages Fault tolerance - mechanisms of fault detection and automatic job resubmission in case

of software or hardware errors. Load balancing - optimizing the completion time of processing the entire data set

(finishing time of the last subjob) Combined Storage & CPU resources “On demand” execution mode (vs. batch queues)

It perfectly matches the needs of I/O intensive INSPIRE jobs

25

Optical Character Recognition (OCR)• Using open source Ocropus software, well suited for large scale batch

processing• Ocropus has three steps: document layout analysis, line recognition,

character identification (different models)• Input typically PDF files, OCR output in hOCR format (HTML), can

be

converted to PDF• Several languages: EN, FR, DE...• Automatic procedure developed (in python) for format conversions,

text recognition, hOCR->pdf conversion, grid job submission• The whole OCR process can be done as a grid job• Scanning and OCRing typically done using commercial services and

tools (e.g. in India), OCRing part more expensive=>can save real money.

• Interest from other institutions 26

Full-text Indexing• About 1 million INSPIRE documents need full-text search

combined with the Invenio meta data search

• Technologies used: Lucene, Hadoop, BibIndex (Invenio internal engine)

• D4Science services (Process Execution Engine, Hadoop)

• Integration with the powerful Invenio meta data search

• Future extension: semantic indexing using High Energy Physics taxonomy

27

Semantic Search (Roman Chyla)• Lucene + Seman• Seman is used for enriching the textual content with

added, semantic features. • There are 3 crucial components involved:

– 1) NLP engine (GATE)– 2) workflow engine – 3) translation engine

• https://svnweb.cern.ch/trac/rcarepo/wiki/InspireSemanticSearch#Fulltextsearchwithsemanticfeatures

28

Semantic search/indexing

Basic ideas• NLP processing helps disambiguation

• HEP taxonomy bridges gaps between terms

• Translation into limited sets of codes facilitates operations over sets

– For example, we can compute similarity of concepts X-Y using overlap of codes

• Text is indexing together with semantic tags

• Special processing applied both at indexing and at the query time

30

Author Disambiguation (Henning Weiler)• BibAuthor is a framework around an algorithm

meant to solve author ambiguity based solely on the meta data available for documents.

• It relies on the database of INSPIRE which is kept in the MARC21 format.

31

Algorithm• Before any algorithm can run, one step of preparation

is to be done:– Read all names from the Inspire database and store them

in a new table

• The algorithm itself is divided into several steps:– 1. Clustering--finding potentially related authors– 2. Matching--create RA entities by pairwise comparison

of VAs within a cluster– 3. Post-matching comparison--identifying identical RA

entities through cross-cluster comparison

32

Thank you!

Questions?

33

Date post:	27-Mar-2015
Category:	Documents
Upload:	sarah-powers
View:	214 times
Download:	0 times

Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories –...

Documents