Post on 10-May-2015
description
transcript
Universitätsbibliothek
BASE – a powerful search engine for Open Access documents
AIMS@OA Week
25 Oct 2012
Friedrich SummannBielefeld University Library
Universitätsbibliothek
Overview
BASE – the OA search engine
Harvesting OAI-PMH and its challenges
Metadata Aggregation and Data Quality
Processing Subject Repositories
Universitätsbibliothek
Harvesting Background
BASE (Bielefeld Academic Search Engine)
• started in 2002, active since 2004• 2900 repositories harvested via OAI-PMH • 2337 repositories indexed • 37.4 Mill. documents included • 3.1 Mill. documents automatically classified• Lucene/Solr Index• VuFind end-user GUI
Universitätsbibliothek
Repositories: Geographical Distribution
0.45 m
15,9 m
0.45 m
0.26 m
2,5 m
2.9 m14.0 m
0.45 m
Universitätsbibliothek
BASE search features
• Truncation
• Search History
• Sorting
• Drilldown
• Linguistic Tools
(Stemming, Eurovoc Thesaurus)
Universitätsbibliothek
Repository Typology
• Institutional Repositories (35 %)
• Thesis and Dissertation Server (11 %)
• Subject Repositories (1 %)
• Electronic Journals (21 %)
• Digital Collections (6 %)
• Others (Videos, Audios, Datasets etc.) (2 %)
Universitätsbibliothek
BASE Interfaces
• Query REST interface
• Repository Metadata interface
• Data Delivery Interface (Repository based, DDC of aggregated Metadata) (under construction)
Universitätsbibliothek
Overview
BASE – the OA search engine
Harvesting OAI-PMH and its challenges
Metadata Aggregation and Data Quality
Processing Repositories
Universitätsbibliothek
My Conclusion:
OAI-PMH Harvesting is easy But:
Putting things (results) together is the real challenge
Universitätsbibliothek
Repository does not respond (temporarily, specific verbs) Results are not xml-valid Harvesting breaks (especially big reps) Incremental Harvesting does not work No deleting information, added records Variety of Field Contents Change of behavior (basicurl, contents) Metadata point to reference or citation only Link to Document is not operable Fulltext access is restricted (non OA)
Harvesting : Challenges and pitfalls
Universitätsbibliothek
Overview
BASE – the OA search engine
Harvesting OAI-PMH and its challenges
Metadata Aggregation and Data Quality
Processing Subject Repositories
Universitätsbibliothek
Top values
en – 1385175eng – 511085spa – 345658de – 319937en_GB - 178381ger – 166587eng; - 102678FR – 95798
…l
dc:language: Variety of Metadata Values
Analysis: European Repositories, Oct. 2009804 different values in 4720585 tags
; - 3? - 3at;deu - 2 enm;eng - 2 FRA – 2fr_BE - 2 Andere Sprache – 2cat, spa, fra, eng. - 2
Universitätsbibliothek
Top values
Dataset – 588525Artikel – 192306Rezension – 113924Text – 73210Text.Thesis.Doctoral – 30201Article – 29278Miszelle – 27060NonPeerReviewed – 24688ResearchPaper – 16046Dissertation - 15531
…l
dc:type: Variety of Metadata Values
Analysis: German Repositories, Sept. 20092772 different values in 1394089 tags
Software - 7Kulturkarten - 7Composition - 7Interactive Resource - 4Interview – 3Media - 1content analysis – 1Anniversary Publication – 1qualitative research -1
Universitätsbibliothek
Overview
BASE – the OA search engine
Harvesting OAI-PMH and its challenges
Metadata Aggregation and Data Quality
Processing Subject Repositories
Universitätsbibliothek
Disciplinary repositories http://oad.simmons.edu/oadwiki/Disciplinary_repositories
OpenDOAR
Subject Repositories: Registries
Universitätsbibliothek
The Big Ones:
• arXiv.org (Physics)• CERN Document Server (Physics)• PubMed Central (Life Sciences)• CiteSeer (Computer Science)• ELIS (Library Science)• REPEC (Economics)• EconStor (Economics)• SSOAR (Social Sciences). . .
Subject Repositories in BASE
Universitätsbibliothek
The BASE Approach: Automatic Classification
Universitätsbibliothek
dc:description: 30 to 40 % of metadata records have dc:description with relevant abstract information
Document fulltext (if accessible)
Setspec contains ddc and lcc codes
dc:subject contains lots of subject-orientated information
Contents for Classifier Feed
Universitätsbibliothek
Building the Knowledge Base
Universitätsbibliothek
Mapping of frequently used classifications LCCELIS classificationArXiv classification
DDC codes: ~400.000 Documents = 1,4%
Universitätsbibliothek
DDC classes distribution in Harvesting Results
Universitätsbibliothek
Subject-based Browsing
Universitätsbibliothek
The End. Thank you!
Mail: friedrich.summann@uni-bielefeld.de