Collection Ranking & Selection for Federated Entity SearchKrisztian Balog, Robert Neumayer, and Kjetil NørvågNorwegian University of Science and Technology
19th International Symposium on String Processing and Information Retrieval (SPIRE 2012)Cartagena de Indias, Colombia, October 2012
Motivation- Entities are ubiquitous
Many information needs revolve around entities (people, products, organisations, places, etc.)
- Growing amount of structured dataAgain, organised around entities
- Entities are often searched by their nameIf we know (heard of) it, we just ask for it by the name
Top searches of 2011* Web search1. iPhone 2. Casey Anthony 3. Kim Kardashian 4. Katy Perry 5. Jennifer Lopez 6. Lindsay Lohan 7. American Idol 8. Jennifer Aniston 9. Japan earthquake 10. Osama bin Laden
Mobile search1. iPhone 5 2. Powerball 3. MLB 4. Scrabble cheat 5. Casey Anthony 6. Hurricane Irene 2011 7. Kim Kardashian 8. Translator 9. Amy Winehouse 10. May 21, 2011 Rapture
* http://yearinreview.yahoo.com/
Top searches of 2011* Web search1. iPhone 2. Casey Anthony 3. Kim Kardashian 4. Katy Perry 5. Jennifer Lopez 6. Lindsay Lohan 7. American Idol 8. Jennifer Aniston 9. Japan earthquake 10. Osama bin Laden
Mobile search1. iPhone 5 2. Powerball 3. MLB 4. Scrabble cheat 5. Casey Anthony 6. Hurricane Irene 2011 7. Kim Kardashian 8. Translator 9. Amy Winehouse 10. May 21, 2011 Rapture
* http://yearinreview.yahoo.com/
“users have learned that search engine relevance decreases with longer queries and have grown accustomed to reducing their query (at least initially) to the name of an entity”
[Blanco et al., 2011]
The Web of Data
0
100
200
300
2007 2008 2009 2010 2011
# Linking Open Data datasets
SW
Conference
Corpus
DBpedia
RDF Book Mashup
DBLPBerlin
Revyu
Project Guten-berg
FOAF
Geo-names
Music-brainz
Magna-tune
Jamendo
World
Fact-
book
DBLPHannover
SIOC
Sem-
Web-
Central
Euro-
stat
ECS
South-
amptonBBC
Later +TOTP
Fresh-meat
Open-
Guides
Gov-Track
US Census Data
W3CWordNet
flickrwrappr
Wiki-
company
OpenCyc
NEW! lingvoj
Onto-world
NEW!
NEW!
NEW!
LOD in 2007
As of September 2011
MusicBrainz
(zitgist)
P20
Turismo de
Zaragoza
yovisto
Yahoo! Geo
Planet
YAGO
World Fact-book
El ViajeroTourism
WordNet (W3C)
WordNet (VUA)
VIVO UF
VIVO Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UniRef
UniProt
UMBEL
UK Post-codes
legislationdata.gov.uk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdata.gov.
uk
Traffic Scotland
theses.fr
Thesau-rus W
totl.net
Tele-graphis
TCMGeneDIT
TaxonConcept
Open Library (Talis)
tags2con delicious
t4gminfo
Swedish Open
Cultural Heritage
Surge Radio
Sudoc
STW
RAMEAU SH
statisticsdata.gov.
uk
St. Andrews Resource
Lists
ECS South-ampton EPrints
SSW Thesaur
us
SmartLink
Slideshare2RDF
semanticweb.org
SemanticTweet
Semantic XBRL
SWDog Food
Source Code Ecosystem Linked Data
US SEC (rdfabout)
Sears
Scotland Geo-
graphy
ScotlandPupils &Exams
Scholaro-meter
WordNet (RKB
Explorer)
Wiki
UN/LOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAASKISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP (RKB
Explorer)
Crime Reports
UK
Course-ware
CORDIS (RKB
Explorer)CiteSeer
Budapest
ACM
riese
Revyu
researchdata.gov.
ukRen. Energy Genera-
tors
referencedata.gov.
uk
Recht-spraak.
nl
RDFohloh
Last.FM (rdfize)
RDF Book
Mashup
Rådata nå!
PSH
Product Types
Ontology
ProductDB
PBAC
Poké-pédia
patentsdata.go
v.uk
OxPoints
Ord-nance Survey
Openly Local
Open Library
OpenCyc
Open Corpo-rates
OpenCalais
OpenEI
Open Election
Data Project
OpenData
Thesau-rus
Ontos News Portal
OGOLOD
JanusAMP
Ocean Drilling Codices
New York
Times
NVD
ntnusc
NTU Resource
Lists
Norwe-gian
MeSH
NDL subjects
ndlna
myExperi-ment
Italian Museums
medu-cator
MARC Codes List
Man-chester Reading
Lists
Lotico
Weather Stations
London Gazette
LOIUS
Linked Open Colors
lobidResources
lobidOrgani-sations
LEM
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
LinkedUser
FeedbackLOV
Linked Open
Numbers
LODE
Eurostat (OntologyCentral)
Linked EDGAR
(OntologyCentral)
Linked Crunch-
base
lingvoj
Lichfield Spen-ding
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Kno.e.sis)
Klapp-stuhl-club
Good-win
Family
National Radio-activity
JP
Jamendo (DBtune)
Italian public
schools
ISTAT Immi-gration
iServe
IdRef Sudoc
NSZL Catalog
Hellenic PD
Hellenic FBD
PiedmontAccomo-dations
GovTrack
GovWILD
GoogleArt
wrapper
gnoss
GESIS
GeoWordNet
GeoSpecies
GeoNames
GeoLinkedData
GEMET
GTAA
STITCH
SIDER
Project Guten-berg
MediCare
Euro-stat
(FUB)
EURES
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
CORDIS(FUB)
Freebase
flickr wrappr
Fishes of Texas
Finnish Munici-palities
ChEMBL
FanHubz
EventMedia
EUTC Produc-
tions
Eurostat
Europeana
EUNIS
EU Insti-
tutions
ESD stan-dards
EARTh
Enipedia
Popula-tion (En-AKTing)
NHS(En-
AKTing) Mortality(En-
AKTing)
Energy (En-
AKTing)
Crime(En-
AKTing)
CO2 Emission
(En-AKTing)
EEA
SISVU
education.data.g
ov.uk
ECS South-ampton
ECCO-TCP
GND
Didactalia
DDC Deutsche Bio-
graphie
datadcs
MusicBrainz
(DBTune)
Magna-tune
John Peel
(DBTune)
Classical (DB
Tune)
AudioScrobbler (DBTune)
Last.FM artists
(DBTune)
DBTropes
Portu-guese
DBpedia
dbpedia lite
Greek DBpedia
DBpedia
data-open-ac-uk
SMCJournals
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Metoffice Weather Forecasts
Discogs (Data
Incubator)
Climbing
data.gov.uk intervals
Data Gov.ie
databnf.fr
Cornetto
reegle
Chronic-ling
America
Chem2Bio2RDF
Calames
businessdata.gov.
uk
Bricklink
Brazilian Poli-
ticians
BNB
UniSTS
UniPathway
UniParc
Taxonomy
UniProt(Bio2RDF)
SGD
Reactome
PubMedPub
Chem
PRO-SITE
ProDom
Pfam
PDB
OMIMMGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Com-pound
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
Affy-metrix
bible ontology
BibBase
FTS
BBC Wildlife Finder
BBC Program
mes BBC Music
Alpine Ski
Austria
LOCAH
Amster-dam
Museum
AGROVOC
AEMET
US Census (rdfabout)
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
LOD in 2011
- Task Given a keyword query, targeting a particular entity, provide a ranked list of relevant entities (i.e., URIs)
- QueriesSampled from web search engine logs (142 in total)
- Data collectionBillion Triple Challenge 2009 (BTC) dataset
- Relevance judgmentsOn a 3-point scale, collected using crowdsourcing
Ad-hoc entity searchAt the 2010/11 Semantic Search Challenge
In this talk- Address the ad-hoc entity retrieval task in a
distributed setting- The Web of Data is inherently distributed- Some data sources may not be crawleable at all
- Specifically, our focus is on the collection ranking and collection selection steps
Federated searchA typical broker-based architecture
1 Collection ranking
Summary ASummary BSummary C
Central broker
A
C
B
Q 1
Federated searchA typical broker-based architecture
1
Collection selection2
Collection A
Collection B
Collection C
Summary A
Summary B
Summary C
Central broker
A
C
Q
B2
Q 1
Q
Collection ranking
Federated searchA typical broker-based architecture
Collection ranking1
Result merging2
3
Collection A
Collection B
Collection C
Summary A
Summary B
Summary C
Central broker
A
C
Q
B2
3
Q 1
Q
Collection selection
Next: baseline modelsfor collection ranking and selection
Result merging
1 Collection rankingCollection selection2
3
Collection A
Collection B
Collection C
Summary A
Summary B
Summary C
Central broker
A
C
Q
B2
3
Q 1
Q
- Lexicon-based methodTreat and score each collection as if it was a single, large document
Collection rankingCollection-centric (CC)
QCollection A
Collection B
Collection C
... ...
P (c|q) / P (c) ·Yt2q
P (t|✓c)
- Document-surrogate methodModel and query individual documents (entities) and aggregate their relevance scores
Collection rankingEntity-centric (EC)
Q
Collection A
Collection B
Collection C
... ...
P (c|q) /X
e2c,r(e,q)<�
P (e|q)
- Choosing a fixed cutoff (K) ahead of timeK is usually set between 5 and 20
Collection selectionTop-K selection
A
D
C
E
B
F
- Exploit that entities are searched by their name- The central broker maintains a complete dictionary
of entity names (and corresponding identifiers)- Utilise this information in the collection selection step
to dynamically adjust the #collections selected
Our method: AENN“All that an Entity Needs is a Name”
AENN collection ranking- Key observation
Different methods—collection-centric (CC) vs. entity-centric (EC)—work best for different queries
- IdeaCombination should give better results than any of the two methods alone
AENN(c, q) = (1� �) · CC(c, q) + � · EC(c, q)
AENN collection rankingExample
A
B
C
A
D
C
E
B
F
0.65
0.3
0.05
0.35
0.3
0.2
0.15
0.1
0.05
A
B
D
C
E
F
0.5
0.2
0.15
0.125
0.075
0.025
AENNCC EC
AENN collection selection- Key observation
CC has higher recall, while EC has better precision- Idea
Use the collection rankings generated by EC and/or CC to dynamically adjust the set of collections selected- Precision-oriented selection- Recall-oriented selection- Balanced selection
- Only select collections returned by EC
AENN collection selectionPrecision-oriented selection (AENN(p))
A
B
C
A
D
C
E
B
F
0.65
0.3
0.05
0.35
0.3
0.2
0.15
0.1
0.05
A
B
D
C
E
F
0.5
0.2
0.15
0.125
0.075
0.025
AENNCC EC AENN(p)
A
B
C
0.5
0.2
0.125
- Include collections from CC until all from EC are covered. This defines the cutoff point for AENN
AENN collection selectionRecall-oriented selection (AENN(r))
A
B
C
A
D
C
E
B
F
0.65
0.3
0.05
0.35
0.3
0.2
0.15
0.1
0.05
0.5
0.2
0.15
0.125
0.075
0.025
AENNCC EC AENN(r)
A
B
D
C
E
0.5
0.2
0.15
0.125
0.075
A
B
D
C
E
F
- Include collections from AENN until all from EC are covered
AENN collection selectionBalanced selection (AENN(b))
A
B
C
0.65
0.3
0.05
0.35
0.3
0.2
0.15
0.1
0.05
0.5
0.2
0.15
0.125
0.075
0.025
AENNCC EC
A
B
D
C
E
F
A
D
C
E
B
F
A
B
D
C
0.5
0.2
0.15
0.125
AENN(b)
AENN collection selectionComparison of approaches
A
B
D
C
0.5
0.2
0.15
0.125
AENN(b)AENN(r)AENN(p)
A
B
C
0.5
0.2
0.125
A
B
D
C
E
0.5
0.2
0.15
0.125
0.075
Experimental setupBased on the 2010/11 Semantic Search Challenge
- Distributed environmentTop 100 largest second-level domains from BTC- Three sets with different handling of DBpedia
- RelevanceConsidered the #relevant entities from each collection
- Metrics- Collection ranking: Standard IR metrics (MAP, MRR, nDCG)- Collection selection: Analogues of precision and recall, plus
the avg. #coll. selected
Test collections
BTC BTC\DBpedia DBpedia
#Entities 68.8M 60.5M 8.8M#Collections 100 99 100#Queries 136 116 130Avg. #rel. entities/query 14.9 4.8 10.1Avg. #rel. entities/collection 3.4 2.8 9.4
ResultsCollection ranking (BTC)
0
0.25
0.50
0.75
Name-only Full content
MAP
CC EC AENN
ResultsDifferent collection selection strategies (BTC\DBpedia)
0
0.3
0.6
0.9
1 3 10 20 50 100
Precision
K
0.6
0.7
0.9
1.0
1 3 10 20 50 100
Recall
K
0
17
33
50
1 3 10 20 50 100
Avg. #coll. selected
K
AENN(p) AENN(r) AENN(b)
ResultsCollection selection (DBpedia)
0
0.2
0.3
0.5
1 3 10 20 50 100
Precision
K
0
0.3
0.7
1.0
1 3 10 20 50 100
Recall
K
0
33
67
100
1 3 10 20 50 100
Avg. #coll. selected
K
CC-N EC-C AENN(b)
Summary- Addressed the task of ad-hoc entity retrieval in a
distributed setting- Introduced AENN, a novel collection ranking and
selection method based on a lean name-based entity representation
- Showed experimentally that AENN can outperform standard baselines that consider all entity content
- Further, AENN can be geared towards high precision, high recall, or a balanced setting
Questions?Resources are available at http://bit.ly/OzfYK2
Contact@krisztianbalog
krisztianbalog.com