Date post: | 27-Mar-2015 |
Category: |
Documents |
Upload: | sarah-powers |
View: | 214 times |
Download: | 0 times |
Global Digital Library in High-Energy Physics
Jan Iwaszkiewicz(CERN)
Digital Repositories – Linked Open Data – the possible Role of D4Science
16th Dec 2010
Contents INSPIRE Invenio software Demo INSPIRE use cases in D4Science II
– OCR– Full-text indexing
Current research:– Semantic Search– Author Disambiguation
3
run by
(2007-)
4
HEP community community
20-30k active researchers publishing 10k articles/year large collaborations (up to 5000 members) very international (even small author groups) authors = readers
rapid information exchange essential mailing of preprints since the 60’s long OA tradition >90% of HEP journal articles on arXiv!
dominance of community based information systems arXiv SPIRES
Dominance of community services
5
From 2007 survey of 2,000 physicists. Gentil-Beccot et al, Information Resources in High-Energy Physics: Surveying the Present Landscape and Charting the Future Course. J.Am.Soc.Inf.Sci.60:150-160,2009 arXiv:0804.2701
SPIRES (1974-)
6
network of databases HEP literature, conferences, institutions, experiments, hepnames, jobs
SLAC – DESY – Fermilab Collaboration SPIRES-HEP
Metadata for 850k objects, ~800 new records per week Preprints, journal articles, conference contributions, books, grey literature since 1974, web server since 1991 100k searches/day
high data quality, manually curated, comprehensive coverage high acceptance, user involvement
But: outdated technology from the 70‘s
7
Invenio (2002-)
digital multimedia library system platform for CERN Document Server (CDS) powerful search engine
Google-like speed for up to 5M records combined metadata, reference and fulltext search
flexible metadata (MARCXML) personalization and collaborative features modular architecture Apache/Python/MySQL GNU General Public Licence - invenio-software.org
~30 instances worldwide
8
ingestion
9
dissemination
10
INSPIRE development
2007: feasibility study 2008: user-level functionalities
data conversion citation analysis, search syntax, output formats…
2009: cataloguing functionalities metadata maintenance and enrichment tools
2010: workflow harvesting, cataloguing…
April 2010: public beta versionhttp://inspirebeta.net
11
Bibliographic Content
SPIRES content (plus part of CDS):journal articles, conference proceedings, preprints, experimental notes, theses
going beyond SPIRES:conference slides, multimedia, software, high-level research data…
going back before 1974 more material from neighboring disciplines astrophysics, nuclear physics, mathematics…
cited by core HEP articles
12
Rich Content Repository
all freely accessible articles esp. “endangered” material
access restricted articles “hidden archive” first agreements with Springer and APS
historical material scanning of old preprint series
beyond articles slides, multimedia, software, wikis… independent citable objects
13
INSPIRE features I
Advanced search functionality Google-like freetext search Complex second-order searches
Example:
Find the most influential HEP core papers that cite the Hitchin article „Generalized Calabi-Yau manifolds“ but don‘t cite any papers by Polchinski
refersto:reportnumber:math/0209099 collection:core cited:100->9999 NOT refersto:author:Polchinski
14
INSPIRE features II
detailed record pages abstract, keywords, references, citations, fulltext, figures various export formats
comprehensive author pages affiliation history, coauthors, frequent keywords, article classification,
citation summary
citation analysis cited by, co-cited with, self-citations, citation history
taxonomy based classification
Newest Features
Plot extraction Plot caption indexing Full-text snippets
15
16
HEP taxonomy
Hierarchical structure of all important HEP concepts
Providing (e.g. dynamical symmetry breaking): synonyms (dynamically broken) related terms (spontaneous symmetry breaking) broader/narrower (symmetry breaking) definitions subject areas (high-energy physics – theory)
17
Taxonomy applications fast automatic generation of keywords
enabling e.g. prompt alerts manually curated afterwards
automatic selection of HEP relevant articles helps minimizing manual selection
improved search algorithm (planned) e.g. search for „SUSY“ will also find „supersymmetry“ narrow/broaden search
user tagging (planned) improve Inspire generated classification improve taxonomy
18
Author identification INSPIRE author id
Single Sign-on for INSPIRE and arXiv.org active participation in ORCID
author disambiguation using e.g. lab id’s, affiliation history, coauthors and more 22.000 INSPIRE-id’s already assigned
automatic association of papers with authors using info on affiliations, coauthors, research topics, from publishers
G. Chen: 963 docs, 21 real authors, only 22 docs not assigned, 97.2% success rate INSPIRE-id part of author lists of large collaborations
19
Coming soon… personalization
personal accounts, "bookshelves”, display formats, e-mail alerts, RSS feeds collaborative tools, user groups
claim my papers user tagging user submission
paper centric (articles, supplementary material) and beyond
20
coming later
innovative metrics semantic analysis recommender systems
– combining citations, keywords, fulltext, usage pattern data...
open API for 3rd party tools and searching object aggregation (OAI-ORE) OAIS standards for long-term document preservation
21
Partnerships Researchers - community
user tagging, user submission improved correction interfaces feedback driving future developments
information providers close alliance with arXiv data exchange with publishers/databases standardized author identities
neighboring fields OAI – PMH (harvesting) and OpenSearch ADS (SAO/NASA Astrophysics Data System)
Demo
22
INSPIRE use cases in D4Science II OCR Fulltext indexing Bibliometric analysis Data mirroring
23
Process Execution Engine• Process Execution Engine (PE2ng) is a system to manage the
execution of programs in a distributed infrastructure
• PE2ng provides adaptors for gCube and gLite based GRID, as well as Condor and Hadoop
24
MapReduce in PE2ng
D4Science introduced a support for MapReduce (Hadoop) jobs Advantages Fault tolerance - mechanisms of fault detection and automatic job resubmission in case
of software or hardware errors. Load balancing - optimizing the completion time of processing the entire data set
(finishing time of the last subjob) Combined Storage & CPU resources “On demand” execution mode (vs. batch queues)
It perfectly matches the needs of I/O intensive INSPIRE jobs
25
Optical Character Recognition (OCR)• Using open source Ocropus software, well suited for large scale batch
processing• Ocropus has three steps: document layout analysis, line recognition,
character identification (different models)• Input typically PDF files, OCR output in hOCR format (HTML), can
be
converted to PDF• Several languages: EN, FR, DE...• Automatic procedure developed (in python) for format conversions,
text recognition, hOCR->pdf conversion, grid job submission• The whole OCR process can be done as a grid job• Scanning and OCRing typically done using commercial services and
tools (e.g. in India), OCRing part more expensive=>can save real money.
• Interest from other institutions 26
Full-text Indexing• About 1 million INSPIRE documents need full-text search
combined with the Invenio meta data search
• Technologies used: Lucene, Hadoop, BibIndex (Invenio internal engine)
• D4Science services (Process Execution Engine, Hadoop)
• Integration with the powerful Invenio meta data search
• Future extension: semantic indexing using High Energy Physics taxonomy
27
Semantic Search (Roman Chyla)• Lucene + Seman• Seman is used for enriching the textual content with
added, semantic features. • There are 3 crucial components involved:
– 1) NLP engine (GATE)– 2) workflow engine – 3) translation engine
• https://svnweb.cern.ch/trac/rcarepo/wiki/InspireSemanticSearch#Fulltextsearchwithsemanticfeatures
28
Semantic search/indexing
Basic ideas• NLP processing helps disambiguation
• HEP taxonomy bridges gaps between terms
• Translation into limited sets of codes facilitates operations over sets
– For example, we can compute similarity of concepts X-Y using overlap of codes
• Text is indexing together with semantic tags
• Special processing applied both at indexing and at the query time
30
Author Disambiguation (Henning Weiler)• BibAuthor is a framework around an algorithm
meant to solve author ambiguity based solely on the meta data available for documents.
• It relies on the database of INSPIRE which is kept in the MARC21 format.
31
Algorithm• Before any algorithm can run, one step of preparation
is to be done:– Read all names from the Inspire database and store them
in a new table
• The algorithm itself is divided into several steps:– 1. Clustering--finding potentially related authors– 2. Matching--create RA entities by pairwise comparison
of VAs within a cluster– 3. Post-matching comparison--identifying identical RA
entities through cross-cluster comparison
32
Thank you!
Questions?
33