BASE : a powerful search engine for Open Access documents

Post on 10-May-2015

644 views 3 download

Tags:

description

Presentation delivered by Friedrich Summann during Open Access Week @ AIMS 2012

transcript

Universitätsbibliothek

BASE – a powerful search engine for Open Access documents

AIMS@OA Week

25 Oct 2012

Friedrich SummannBielefeld University Library

Universitätsbibliothek

Overview

BASE – the OA search engine

Harvesting OAI-PMH and its challenges

Metadata Aggregation and Data Quality

Processing Subject Repositories

Universitätsbibliothek

Harvesting Background

BASE (Bielefeld Academic Search Engine)

• started in 2002, active since 2004• 2900 repositories harvested via OAI-PMH • 2337 repositories indexed • 37.4 Mill. documents included • 3.1 Mill. documents automatically classified• Lucene/Solr Index• VuFind end-user GUI

Universitätsbibliothek

Repositories: Geographical Distribution

0.45 m

15,9 m

0.45 m

0.26 m

2,5 m

2.9 m14.0 m

0.45 m

Universitätsbibliothek

BASE search features

• Truncation

• Search History

• Sorting

• Drilldown

• Linguistic Tools

(Stemming, Eurovoc Thesaurus)

Universitätsbibliothek

Repository Typology

• Institutional Repositories (35 %)

• Thesis and Dissertation Server (11 %)

• Subject Repositories (1 %)

• Electronic Journals (21 %)

• Digital Collections (6 %)

• Others (Videos, Audios, Datasets etc.) (2 %)

Universitätsbibliothek

BASE Interfaces

• Query REST interface

• Repository Metadata interface

• Data Delivery Interface (Repository based, DDC of aggregated Metadata) (under construction)

Universitätsbibliothek

Overview

BASE – the OA search engine

Harvesting OAI-PMH and its challenges

Metadata Aggregation and Data Quality

Processing Repositories

Universitätsbibliothek

My Conclusion:

OAI-PMH Harvesting is easy But:

Putting things (results) together is the real challenge

Universitätsbibliothek

Repository does not respond (temporarily, specific verbs) Results are not xml-valid Harvesting breaks (especially big reps) Incremental Harvesting does not work No deleting information, added records Variety of Field Contents Change of behavior (basicurl, contents) Metadata point to reference or citation only Link to Document is not operable Fulltext access is restricted (non OA)

Harvesting : Challenges and pitfalls

Universitätsbibliothek

Overview

BASE – the OA search engine

Harvesting OAI-PMH and its challenges

Metadata Aggregation and Data Quality

Processing Subject Repositories

Universitätsbibliothek

Top values

en – 1385175eng – 511085spa – 345658de – 319937en_GB - 178381ger – 166587eng; - 102678FR – 95798

…l

dc:language: Variety of Metadata Values

Analysis: European Repositories, Oct. 2009804 different values in 4720585 tags

; - 3? - 3at;deu - 2 enm;eng - 2 FRA – 2fr_BE - 2 Andere Sprache – 2cat, spa, fra, eng. - 2

Universitätsbibliothek

Top values

Dataset – 588525Artikel – 192306Rezension – 113924Text – 73210Text.Thesis.Doctoral – 30201Article – 29278Miszelle – 27060NonPeerReviewed – 24688ResearchPaper – 16046Dissertation - 15531

…l

dc:type: Variety of Metadata Values

Analysis: German Repositories, Sept. 20092772 different values in 1394089 tags

Software - 7Kulturkarten - 7Composition - 7Interactive Resource - 4Interview – 3Media - 1content analysis – 1Anniversary Publication – 1qualitative research -1

Universitätsbibliothek

Overview

BASE – the OA search engine

Harvesting OAI-PMH and its challenges

Metadata Aggregation and Data Quality

Processing Subject Repositories

Universitätsbibliothek

Disciplinary repositories http://oad.simmons.edu/oadwiki/Disciplinary_repositories

OpenDOAR

Subject Repositories: Registries

Universitätsbibliothek

The Big Ones:

• arXiv.org (Physics)• CERN Document Server (Physics)• PubMed Central (Life Sciences)• CiteSeer (Computer Science)• ELIS (Library Science)• REPEC (Economics)• EconStor (Economics)• SSOAR (Social Sciences). . .

Subject Repositories in BASE

Universitätsbibliothek

The BASE Approach: Automatic Classification

Universitätsbibliothek

dc:description: 30 to 40 % of metadata records have dc:description with relevant abstract information

Document fulltext (if accessible)

Setspec contains ddc and lcc codes

dc:subject contains lots of subject-orientated information

Contents for Classifier Feed

Universitätsbibliothek

Building the Knowledge Base

Universitätsbibliothek

Mapping of frequently used classifications LCCELIS classificationArXiv classification

DDC codes: ~400.000 Documents = 1,4%

Universitätsbibliothek

DDC classes distribution in Harvesting Results

Universitätsbibliothek

Subject-based Browsing

Universitätsbibliothek

The End. Thank you!

Mail: friedrich.summann@uni-bielefeld.de