When worlds collide Metasearching meets central indexes

Post on 16-Jan-2016

29 views 0 download

Tags:

description

When worlds collide Metasearching meets central indexes. Mike Taylor – mike@indexdata.com Index Data – http://indexdata.com/. Search. When worlds collide : metasearching and central indexes Mike Taylor – mike@indexdata.com. Search. - PowerPoint PPT Presentation

transcript

When worlds collide

Metasearching meetscentral indexes

Mike Taylor – mike@indexdata.com

Index Data – http://indexdata.com/

Search

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Search

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Search

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Data

Search

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Data

Problem solved!

Search

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

DataData Data

? ?

Metasearch

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

Searching

Metasearch

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

Searching

360 SearchEHIS (EBSCO)MetaLib

Metasearch

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

Searching

360 SearchEHIS (EBSCO)MetaLib

Pazpar2(Open source)

Metasearch

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

Searching

Metasearch

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

A.K.A. federated search

Searching

Metasearch

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

A.K.A. federated search

A.K.A. distributed search

Searching

Metasearch

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

A.K.A. federated search

A.K.A

. bro

adcast

searc

h

A.K.A. distributed search

Searching

?

Back tothe sadsearcher

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

DataData Data

? ?

Centralindex

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

DataData DataData

Fat database

Harvesting

Centralindex

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

DataData DataData

Fat database

Harvesting

SummonWorldCatPrimo Central

Centralindex

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

DataData DataData

Fat database

Harvesting

SummonWorldCatPrimo Central

MasterKey

Centralindex

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

DataData DataData

Fat database

Harvesting

A.K.A. local index

Centralindex

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

DataData DataData

Fat database

Harvesting

A.K.A. local indexA.K.A. discovery services

Centralindex

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

DataData DataData

Fat database

Harvesting

A.K.A. local index

A.K.A

. verti

cal s

earch

A.K.A. discovery services

?

We need a controlled vocabulary!

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Metasearch= Federated search= Distributed search= Broadcast search

Central index= Local index= Discovery services= Vertical search (if you ever heard anything so dumb)

Which approach is better?

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Central indexing compared with metasearching:

- requires harvesting infrastructure- requires lots of local storage- requires co-operation from services to be harvested- does not have access to all searchable data- will always be somewhat out of date- is faster at search time (or SHOULD be)- allows data to be normalised (e.g. dates extracted)- allows for better relevance ranking- can provide pre-baked facets- may have access to some data that not searchable

Which approach is better?

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Which approach is better?

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Which approach is better?

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Which approach is better?

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Let's do both!

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

Searching

DataData DataData

Fat database

Harvesting

! “Integrated Search”

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

Searching

DataData DataData

Fat database

Harvesting

! “Integrated Search”

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

Searching

DataData DataData

Fat database

Harvesting

! “Integrated Search”

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

Searching

DataData DataData

Fat database

Harvesting

! “Integrated Search”

Metasearchhides thecomplexity

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

Searching

Metasearch

Nine tenths underThe surface

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

Searching

Metasearch

What you seelooks beautiful

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

Searching

Problems that need solving

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

A. Problems with pure metasearching

B. How those problems change when you add a central index

Problems with metasearching

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Examples based on Index Data's suite:

Pazpar2 is a free metasearching engine with a stupid name

http://indexdata.com/pazpar2/

MasterKey is a non-open suite that wraps ithttp://indexdata.com/masterkey/

MasterKey is only one way to use Pazpar2

Also integrated into other vendors' UIs.

Problems with metasearching#1: No data server at all!

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Data is often only in a user-facing Web UI

Must be made available via a standard protocol

Problems with metasearching#1: No data server at all!

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Data is often only in a user-facing Web UI

Must be made available via a standard protocol

Option 1: build a gateway in Perlhttp://indexdata.com/simpleserver/

Problems with metasearching#1: No data server at all!

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Data is often only in a user-facing Web UI

Must be made available via a standard protocol

Option 1: build a gateway in Perlhttp://indexdata.com/simpleserver/

Option 2: MasterKey Connect (non-open)http://indexdata.com/connector-framework

Problems with metasearching#2: data server is crap^H^H^H^Hsuboptimal

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Catalogs searchable using ANSI/NISO Z39.50

Support is very nominal in some cases

Problems with metasearching#2: data server is crap^H^H^H^Hsuboptimal

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Catalogs searchable using ANSI/NISO Z39.50

Support is very nominal in some cases

IRSpy probes behaviourhttp://irspy.indexdata.com

MasterKey target profiles describe behaviour

Problems with metasearching#3: Data servers don't support relevance

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Problems with metasearching#3: Data servers don't support relevance

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Pazpar2 does its own relevance ranking

(Part of merging/deduplication)

Problems with metasearching#4: Data servers don't return facets

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Problems with metasearching#4: Data servers don't return facets

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Pazpar2 calculates its own facets

There isa lot ofmagic in themagic boxSearchingSortingMergingDeduplicationRelevanceFacet generationTime travel...

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

There isa lot ofmagic in themagic boxSearchingSortingMergingDeduplicationRelevanceFacet generationTime travel...

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Pazpar2

DataData DataData

Remember, ourengine is free:

http://indexdata.com/pazpar2/

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

Searching

DataData DataData

Fat database

Harvesting

! What happenswhen we adda central index?

Problems with integrated search#1: No data server at all!

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Data is often only in a user-facing Web UI

Problems with integrated search#1: No data server at all!

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Data is often only in a user-facing Web UI

Problems with integrated search#1: No data server at all!

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Data is often only in a user-facing Web UI

You can't harvest Google

Problems with integrated search#1: No data server at all!

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Data is often only in a user-facing Web UI

You can't harvest Google

You just can't

Problems with integrated search#2: data server is crap^H^H^H^Hsuboptimal

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Repositories harvestable using OAI-PMH

(an even worse name than pazpar2)

Support is very nominal in some cases

Problems with integrated search#2: data server is crap^H^H^H^Hsuboptimal

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Repositories harvestable using OAI-PMH (an even worse name than pazpar2)

Support is very nominal in some cases

OAI-PMH client must be very tolerant

Extensive data-cleaning is usually required

Problems with integrated search#3: Central index does support relevance

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Returned records carry relevance scores

Must be merged with records scored by engine

Requires score normalisation into same range

Existing ordering may be used in merge

Problems with integrated search#3: Central index does support relevance

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Unranked#1

Ranked#1

Ranked#2

Solr

Sort

MergedUnranked#2 Sort

Problems with integrated search#4: Central index does return facets

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Lists of field values with occurrence counts:

AuthorKernighan 27Pike 13Ritchie 7Thompson 4

TitleC 7Unix 35Programming 16

Date1977 51978 41979 21981 2

Problems with integrated search#4: Central index does return facets

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Lists are returned or calculated for each server:

Server 1 (central index)(all facets from 2000 hits)

Cat 68Dinosaur 162Fish 145Frog 19

Server 2 (metasearch)(1000 hits, 100 records)

Cat 7Dog 10Dinosaur 87Fish 23

Problems with integrated search#4: Central index does return facets

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Metasearched counts normalised by total hit-count

Server 1 (central index)(all facets from 2000 hits)

Cat 68Dinosaur 162Fish 145Frog 19

Server 2 (metasearch)(normalised to 1000 hits)

Cat 70Dog 100Dinosaur 870Fish 230

Problems with integrated search#4: Central index does return facets

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Facet lists are merged

Servers 1+2 (integrated)(as though for all records in result sets)

Cat 68+70 = 138Dog 0+100 = 100Dinosaur 162+870 = 1032Fish 145+230 = 375Frog 19+0 = 19

Problems with integrated search#4: Central index does return facets

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Fringe benefit: facet-count normalisation is alsouseful when doing pure metasearching.

Servers 1+2(as though for all records in result sets)

Cat 68+70 = 138Dog 0+100 = 100Dinosaur 162+870 = 1032Fish 145+230 = 375Frog 19+0 = 19

Summary of search issues

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Issue Metasearchsolution

Central indexsolution

No data serverBuild gatewaysMasterKey Connect

---

Bad data server Probe capabilitiesProfile targets

Tolerant harvesterData-cleaning

Relevance scores Magic engineNormalise scores Ingest from server

Facets Magic engineNormalise counts Ingest from server

When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com

Magic box

DataData DataData

Searching

DataData DataData

Fat database

Harvesting

When worlds collide

Metasearching meetscentral indexes

Mike Taylor – mike@indexdata.com

Index Data – http://indexdata.com/