Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC.

Post on 24-Dec-2015

227 views 6 download

Tags:

transcript

Empowering EPrints Search with Xapian

Sébastien François, EPrints Lead DeveloperEPrints Developer Powwow, ULCC

Review of EPrints Internal Search

Indexing

Searching

Extras

TO-DO’s

Using & contributing

Demo(s)

Summary

EPrints “Internal” Search - Overview

Search

Field

DataSet

MetaField Condition

List1

1..n

1..n 1..n

match = “EX” queries the main & auxilliary dataset tables

match = “IN” queries the __rindex dataset table

ordering is done via the __ordervalues_$langid dataset

table

EPrints “Internal” Search – Overview (2)

Simple search is not scalable

Lots of derived data in the DB (backup?)

No relevance matching -> good matches do not surface

up

No advanced features: suggestions, facets, boolean op’s

etc.

Home-brewed: hard to maintain the code, hard to extend

Difficult to debug…

EPrints “Internal” Search – Downsides

Introduced in 3.3

Only integrated with the simple search

Little flexibility in controlling what is indexed

Advanced features “not really” enabled

Searches every fields (“text_index” not respected)

But the idea is good & worth building upon

EPrints Xapian Search

Attempts to re-use EPrints’ default configuration:

◦ datasets’ field defintion (+ “text_index”)

◦ fields defined in the simple search (un-prefixed terms)

But needs its own bits to define:

◦ default indexing methods (by MetaField type)

◦ facet-able indexes

◦ order-able indexes

May be used to declare derived indexes – examples:◦ “open_access”: to filter references from open full-text documents

◦ “year”: to filter by year of publication (rather than by date)

◦ “image_orientation”: if you had an archive of images, you could extract the orientation via

EXIF

Indexing

Indexing - Classes

Xapian::Index

IndexMethod

Config

OrderMethod

XapianDB

Fulltext Name, etc. Alpha. Name, etc.

Indexes are prefixed by “_” e.g. “_title” so we can sanitise the user query

– otherwise users could do prefixed search (and search not necessarily

allowed fields)

Z notation: indicates a stemmed value or index: Z_title, Zhappi (internal

Xapian convention)

Script available to re-process the Xapian indexes (similar to “epadmin

reindex” but doesn’t re-index the EPrints’ internal)

Reserved indexes:

◦ _id: keep the internal id of the data-obj (/id/eprint/123)

◦ _dataset: to which dataset the record belongs to (‘eprint’, ‘user’…)

◦ _configuration_md5: keeps an MD5 of the conf. the item was indexed

against (useful?)

◦ - _index_timestamp: when the item was last indexed

Indexing – Extra information

Again, attempts to re-use EPrints’ configuration:

◦ simple search (mostly for ordering methods)

◦ advanced/staff search: which fields to use (prefixed terms)

Extra bits can be configured such as which facets can be

used on each search (simple, advanced, …)

Only indexed stuff can be searched

◦ you cannot use a facet which has not been generated

◦ you need to re-index your data if you change the simple search def.

◦ same if you add new order-able fields

Searching

Abstracted by Plugin::Search (original implementation)

Tricky to make it work with EPrints’ UI because it expects

an EPrints::Search object

Plugin::Search::Internal is a wrapped EPrints::Search

object (hack) so Plugin::Search::Xapian must emulate this

behaviour

Searching (2)

Searching – Classes & Op. Stack

/cgi/xapian

Search::XapianSearch

Paginate::Facets

Plugin::Search::Xapian

Xapian DB

Xapian::Facets

May be used in a script

Exports & feeds work

Can be serialised/de-serialised (including facets) so should

work for Saved Searches (to test)

Searching – Extra information

“Related Items”

Jiadi has developed a Bootstrap-based Pagination module:

◦ more sexy

◦ supports alternative “views” of the search results

Extras

Range searching: possible in Xapian but not yet

implemented (e.g. 1..10)

Some refactoring:

◦ Xapian::Index -> Xapian::Indexer

◦ Plugin::Search::Xapianv2 => Plugin::Search::Xapian (and replace the

default EPrints’ Xapian implementation)

Test with real life data (done to a certain extent...)

Load & scalability testing (+ number of slots etc.)

Multi-lang considerations (and related IndexMethod)

TO-DO’s

Page displaying how a data-obj has been indexed

◦ prefixes

◦ terms

◦ facets & order-able fields

Status page (cf. “Admin > Status”):

◦ DB size

◦ number of Documents

◦ indexed datasets (and how)

Weighting: supported (via conf.) but un-tested in real life

TO-DO’s – Would be nice

Xapian is more of a user search

The internal search is still required to:

◦ get records from the Database ($dataset->search())

◦ this affects screens such as “Manage Deposits”, the “Review” etc.

which cannot wait for items to be indexed (direct DB calls)

◦ may be needed to apply ACL’s (if some items cannot be searched):

safer to use the (MySQL) DB as authority

Internal Search vs Xapian Search

Plugin::Search::Xapian may be set to debug mode: shows

processing and query building

Xapian comes with an analysis tool, “delve” to:

◦ view the content of the Xapian DB or some selected Documents

◦ see if a term exists in the DB (and in which Documents)

◦ other info (term frequency etc.)

Knowing what Xapian is searching and how a data-obj is

indexed is key to debug most search-relating issues

Debugging Xapian

Not quite at release stage but it is –currently- isolated so

shouldn’t break your IR

All the code is on GitHub:

https://github.com/eprints/xapianv2

Using & Contributing

http://puffin.ecs.soton.ac.uk/cgi/xapian

Simple search / facets / export / order

Simple search with boolean op’s, suggestion

Advanced search / facets / export / order

Related items

http://vmdev1.eprints.org/cgi/xapian (more data + cached

citations)

http://vmdev1.eprints.org/cgi/xapian_status

Demos

Let’s have a play?

Code overview?

Doc?

Q&A & what’s next