Post on 16-Jan-2016
description
transcript
When worlds collide
Metasearching meetscentral indexes
Mike Taylor – mike@indexdata.com
Index Data – http://indexdata.com/
Search
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Search
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Search
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Data
Search
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Data
Problem solved!
Search
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
DataData Data
? ?
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
Searching
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
Searching
360 SearchEHIS (EBSCO)MetaLib
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
Searching
360 SearchEHIS (EBSCO)MetaLib
Pazpar2(Open source)
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
Searching
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
A.K.A. federated search
Searching
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
A.K.A. federated search
A.K.A. distributed search
Searching
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
A.K.A. federated search
A.K.A
. bro
adcast
searc
h
A.K.A. distributed search
Searching
?
Back tothe sadsearcher
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
DataData Data
? ?
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
DataData DataData
Fat database
Harvesting
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
DataData DataData
Fat database
Harvesting
SummonWorldCatPrimo Central
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
DataData DataData
Fat database
Harvesting
SummonWorldCatPrimo Central
MasterKey
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
DataData DataData
Fat database
Harvesting
A.K.A. local index
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
DataData DataData
Fat database
Harvesting
A.K.A. local indexA.K.A. discovery services
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
DataData DataData
Fat database
Harvesting
A.K.A. local index
A.K.A
. verti
cal s
earch
A.K.A. discovery services
?
We need a controlled vocabulary!
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Metasearch= Federated search= Distributed search= Broadcast search
Central index= Local index= Discovery services= Vertical search (if you ever heard anything so dumb)
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Central indexing compared with metasearching:
- requires harvesting infrastructure- requires lots of local storage- requires co-operation from services to be harvested- does not have access to all searchable data- will always be somewhat out of date- is faster at search time (or SHOULD be)- allows data to be normalised (e.g. dates extracted)- allows for better relevance ranking- can provide pre-baked facets- may have access to some data that not searchable
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Let's do both!
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! “Integrated Search”
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! “Integrated Search”
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! “Integrated Search”
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! “Integrated Search”
Metasearchhides thecomplexity
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
Searching
Metasearch
Nine tenths underThe surface
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
Searching
Metasearch
What you seelooks beautiful
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
Searching
Problems that need solving
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
A. Problems with pure metasearching
B. How those problems change when you add a central index
Problems with metasearching
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Examples based on Index Data's suite:
Pazpar2 is a free metasearching engine with a stupid name
http://indexdata.com/pazpar2/
MasterKey is a non-open suite that wraps ithttp://indexdata.com/masterkey/
MasterKey is only one way to use Pazpar2
Also integrated into other vendors' UIs.
Problems with metasearching#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Data is often only in a user-facing Web UI
Must be made available via a standard protocol
Problems with metasearching#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Data is often only in a user-facing Web UI
Must be made available via a standard protocol
Option 1: build a gateway in Perlhttp://indexdata.com/simpleserver/
Problems with metasearching#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Data is often only in a user-facing Web UI
Must be made available via a standard protocol
Option 1: build a gateway in Perlhttp://indexdata.com/simpleserver/
Option 2: MasterKey Connect (non-open)http://indexdata.com/connector-framework
Problems with metasearching#2: data server is crap^H^H^H^Hsuboptimal
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Catalogs searchable using ANSI/NISO Z39.50
Support is very nominal in some cases
Problems with metasearching#2: data server is crap^H^H^H^Hsuboptimal
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Catalogs searchable using ANSI/NISO Z39.50
Support is very nominal in some cases
IRSpy probes behaviourhttp://irspy.indexdata.com
MasterKey target profiles describe behaviour
Problems with metasearching#3: Data servers don't support relevance
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Problems with metasearching#3: Data servers don't support relevance
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Pazpar2 does its own relevance ranking
(Part of merging/deduplication)
Problems with metasearching#4: Data servers don't return facets
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Problems with metasearching#4: Data servers don't return facets
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Pazpar2 calculates its own facets
There isa lot ofmagic in themagic boxSearchingSortingMergingDeduplicationRelevanceFacet generationTime travel...
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
There isa lot ofmagic in themagic boxSearchingSortingMergingDeduplicationRelevanceFacet generationTime travel...
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Pazpar2
DataData DataData
Remember, ourengine is free:
http://indexdata.com/pazpar2/
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! What happenswhen we adda central index?
Problems with integrated search#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Data is often only in a user-facing Web UI
Problems with integrated search#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Data is often only in a user-facing Web UI
Problems with integrated search#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Data is often only in a user-facing Web UI
You can't harvest Google
Problems with integrated search#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Data is often only in a user-facing Web UI
You can't harvest Google
You just can't
Problems with integrated search#2: data server is crap^H^H^H^Hsuboptimal
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Repositories harvestable using OAI-PMH
(an even worse name than pazpar2)
Support is very nominal in some cases
Problems with integrated search#2: data server is crap^H^H^H^Hsuboptimal
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Repositories harvestable using OAI-PMH (an even worse name than pazpar2)
Support is very nominal in some cases
OAI-PMH client must be very tolerant
Extensive data-cleaning is usually required
Problems with integrated search#3: Central index does support relevance
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Returned records carry relevance scores
Must be merged with records scored by engine
Requires score normalisation into same range
Existing ordering may be used in merge
Problems with integrated search#3: Central index does support relevance
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Unranked#1
Ranked#1
Ranked#2
Solr
Sort
MergedUnranked#2 Sort
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Lists of field values with occurrence counts:
AuthorKernighan 27Pike 13Ritchie 7Thompson 4
TitleC 7Unix 35Programming 16
Date1977 51978 41979 21981 2
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Lists are returned or calculated for each server:
Server 1 (central index)(all facets from 2000 hits)
Cat 68Dinosaur 162Fish 145Frog 19
Server 2 (metasearch)(1000 hits, 100 records)
Cat 7Dog 10Dinosaur 87Fish 23
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Metasearched counts normalised by total hit-count
Server 1 (central index)(all facets from 2000 hits)
Cat 68Dinosaur 162Fish 145Frog 19
Server 2 (metasearch)(normalised to 1000 hits)
Cat 70Dog 100Dinosaur 870Fish 230
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Facet lists are merged
Servers 1+2 (integrated)(as though for all records in result sets)
Cat 68+70 = 138Dog 0+100 = 100Dinosaur 162+870 = 1032Fish 145+230 = 375Frog 19+0 = 19
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Fringe benefit: facet-count normalisation is alsouseful when doing pure metasearching.
Servers 1+2(as though for all records in result sets)
Cat 68+70 = 138Dog 0+100 = 100Dinosaur 162+870 = 1032Fish 145+230 = 375Frog 19+0 = 19
Summary of search issues
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Issue Metasearchsolution
Central indexsolution
No data serverBuild gatewaysMasterKey Connect
---
Bad data server Probe capabilitiesProfile targets
Tolerant harvesterData-cleaning
Relevance scores Magic engineNormalise scores Ingest from server
Facets Magic engineNormalise counts Ingest from server
When worlds collide: metasearching and central indexes Mike Taylor – mike@indexdata.com
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
When worlds collide
Metasearching meetscentral indexes
Mike Taylor – mike@indexdata.com
Index Data – http://indexdata.com/