Introduc)on to Informa)on Retrieval
Introduc)onto
Informa(onRetrieval
cs160
Introduction DavidKauchak
adapted from:
h6p://www.stanford.edu/class/cs276/handouts/lecture1‐intro.ppt
Introduc)on to Informa)on Retrieval
Introduc)ons
Name/nickname
Dept.,collegeandyear Oneinteres)ngthingaboutyourself Whyareyoutakingthisclass?
Whattopics/materialwouldyouliketoseecovered? PlansaLergradua)on
Introduc)on to Informa)on Retrieval
Administra)ve Webpage:www.cs.pomona.edu/classes/cs160/ Syllabus… Administra)vehandout… Classfeedback Inclasspar)cipa)on Workload
Homework1availablesoon Programmingassignment1availablesoon
Due)me?
Introduc)on to Informa)on Retrieval
Informa)onretrieval(IR) WhatcomestomindwhenIsay“informa)onretrieval”?
WherehaveyouseenIR?Whataresomereal‐worldexamples/uses? Searchengines Filesearch(e.g.OSXSpotlight,WindowsInstantSearch,Google
Desktop)
Databases? Catalogsearch(e.g.library) Intranetsearch(i.e.corporatenetworks)
Introduc)on to Informa)on Retrieval
Informa)onRetrieval
Informa)onRetrievalisfindingmaterialintextdocumentsofanunstructurednaturethatsa)sfyaninforma)onneedfromwithinlargecollec)onsofdigitallystoredcontent
5
Introduc)on to Informa)on Retrieval
Informa)onRetrieval
Informa)onRetrievalisfindingmaterialintextdocumentsofanunstructurednaturethatsa)sfyaninforma)onneedfromwithinlargecollec)onsofdigitallystoredcontent
6
?
Introduc)on to Informa)on Retrieval
Informa)onRetrieval
Informa)onRetrievalisfindingmaterialintextdocumentsofanunstructurednaturethatsa)sfyaninforma)onneedfromwithinlargecollec)onsofdigitallystoredcontent
7
• Find all documents about computer science
• Find all course web pages at Pomona
• What is the cheapest flight from LA to NY?
• Who is was the 15th president?
Introduc)on to Informa)on Retrieval
Informa)onRetrieval
Informa)onRetrievalisfindingmaterialintextdocumentsofanunstructurednaturethatsa)sfyaninforma)onneedfromwithinlargecollec)onsofdigitallystoredcontent
8
What is the difference between an information need and a query?
Introduc)on to Informa)on Retrieval
Informa)onRetrieval
Informa)onRetrievalisfindingmaterialintextdocumentsofanunstructurednaturethatsa)sfyaninforma)onneedfromwithinlargecollec)onsofdigitallystoredcontent
9
• Find all documents about computer science
• Find all course web pages at Pomona
• Who is was the 15th president?
Information need Query
“computer science”
Pomona AND college AND url-contains class
WHO=president NUMBER=15
Introduc)on to Informa)on Retrieval
Structureddatatendstorefertoinforma)onin“tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
50000 Ivy Smith
Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.
IRvs.databases
Introduc)on to Informa)on Retrieval
Unstructured(text)vs.structured(database)datain1996
11
Introduc)on to Informa)on Retrieval
Unstructured(text)vs.structured(database)datain2006
12
Introduc)on to Informa)on Retrieval
Unstructureddatain1680
WhichplaysofShakespearecontainthewordsBrutusANDCaesarbutNOTCalpurnia?
OnecouldgrepallofShakespeare’splaysforBrutusandCaesar,thenstripoutplayscontainingCalpurnia.Anyproblemswiththis? Slow(forlargecorpora) Otheropera)ons(e.g.,findthewordRomans nearcountrymen)notfeasible
Rankedretrieval(bestdocumentstoreturn) Laterlectures
13
Introduc)on to Informa)on Retrieval
Unstructureddatain1680
WhichplaysofShakespearecontainthewordsBrutusANDCaesarbutNOTCalpurnia?
Howmightwespeedupthistypeofquery?
Indexing:foreachword,keeptrackofwhichdocumentsitoccursin
Introduc)on to Informa)on Retrieval
Term‐documentincidencematrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains word, 0 otherwise
Introduc)on to Informa)on Retrieval
Incidencevectors
Foreachterm,wehavea0/1vector Caeser=110111 Brutus=110100 Calpurnia=010000
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Introduc)on to Informa)on Retrieval
Incidencevectors
Foreachterm,wehavea0/1vector Caeser=110111 Brutus=110100 Calpurnia=010000
How can we get the answer from these vectors?
Introduc)on to Informa)on Retrieval
Incidencevectors
Foreachterm,wehavea0/1vector Caeser=110111 Brutus=110100 Calpurnia=010000
BitwiseANDthevectorstogetherusingthecomplementedvectorforallNOTqueries
Caeser AND Brutus AND COMPLEMENT(Calpurnia) 110111&110100&~010000= 110111&110100&101111= 100100
Introduc)on to Informa)on Retrieval
Answerstoquery
Antony and Cleopatra,Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.
19
Introduc)on to Informa)on Retrieval
Incidencevectors Foreachterm,wehavea0/1vector
Caeser=110111 Brutus=110100 Calpurnia=010000
BitwiseANDthevectorstogetherusingthecomplementedvectorforallNOTqueries
Any problem with this approach?
Introduc)on to Informa)on Retrieval
Biggercollec)ons
ConsiderN =1milliondocuments,eachwithabout1000words
SaythereareM =500Kdis)ncttermsamongthese.Howbigistheincidencematrix?
Thematrixisa500Kby1millionmatrix=halfatrillion0’sand1’s Evenforamoderatesizeddatasetwecan’tstorethematrixinmemory
Eachvectorhas1millionentries Bitwiseopera)onsbecomemuchmoreexpensive
Introduc)on to Informa)on Retrieval
Whatdoesthematrixlooklike?
ConsiderN =1milliondocuments,eachwithabout1000words
Extremelysparse!
Howmany1’sdoesthematrixcontain? nomorethanonebillion Eachofthe1milliondocumentshasatmost10001’s
Inprac)ce,we’llseethatthenumberofuniquewordsinadocumentismuchlessthanthis
What’sabe6errepresenta)on? Onlyrecordthe1posi)ons
Introduc)on to Informa)on Retrieval
Invertedindex
Foreachterm,westorealistofalldocumentsthatcontainit
Whatdatastructuresmightweuseforthis?
23
Brutus
2 4 8 16 32 64 128 Brutus
2 4 8 16 32 64 128
Brutus 2 4 8 16 32 64 128
array
linked list
hashtable …
? docID
Introduc)on to Informa)on Retrieval
Invertedindexrepresenta)on
24
Brutus 2 4 8 16 32 64 128 array
Pros Simpletoimplement Noextrapointersrequiredfordatastructure Con)guousmemory
Cons Howdowepickthesizeofthearray? Whatifwewanttoaddaddi)onaldocuments?
Introduc)on to Informa)on Retrieval
Invertedindexrepresenta)on
25
Pros Dynamicspacealloca)on Inser)onofnewdocumentsisstraighmorward
Cons Memoryoverheadofpointers
Noncon)guousmemoryaccess
2 4 8 16 32 64 128 Brutus
linked list
Introduc)on to Informa)on Retrieval
Invertedindexrepresenta)on
26
Pros Searchinconstant)me
Con)guousmemory
Cons Howdowepickthesize? Whatifwewanttoaddaddi)onaldocuments?
Mayhavetorehashifweincreaseinsize Togetconstant)meopera)ons,lotsofunusedslots/memory
Brutus 2 4 8 16 32 64 128
hashtable
Introduc)on to Informa)on Retrieval
Invertedindex
27
Brutus
Calpurnia
Caesar
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Dictionary Postings lists
Pos)ng
Themostcommonapproachistousealinkedlistrepresenta)on
Introduc)on to Informa)on Retrieval
Invertedindexconstruc)onDocuments to be indexed
Friends, Romans, countrymen.
indexer
Inverted index
friend
roman
countryman
2 4
2
13 16
1
text preprocessing friend , roman , countrymen .
Introduc)on to Informa)on Retrieval
Booleanretrieval Inthebooleanretrievalmodelweaskaquerythatisabooleanexpression: AbooleanqueryusesAND, ORandNOTtojoinqueryterms CaesarANDBrutusAND NOTCalpurnia PomonaANDCollege (MikeORMichael)ANDJordanAND NOT(NikeORGatorade)
Givenonlytheseopera)ons,whattypesofques)onscan’tweanswer? Phrases,e.g.“PomonaCollege” Proximity,“Michael”within2wordsof“Jordan”
29
Introduc)on to Informa)on Retrieval
Booleanretrieval
Primarycommercialretrievaltoolfor3decades
Professionalsearchers(e.g.,lawyers)s)lllikebooleanqueries
Why? Youknowexactlywhatyou’regerng,aqueryeithermatchesoritdoesn’t
Throughtrialanderror,canfrequentlyfinetunethequeryappropriately
Don’thavetoworryaboutunderlyingheuris)cs(e.g.PageRank,termweigh)ngs,synonym,etc…)
30
Introduc)on to Informa)on Retrieval
Example:WestLawhttp://www.westlaw.com/
Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)
Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query:
What is the statute of limitations in cases involving the federal tort claims act?
LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
All words starting with “LIMIT”
31
Introduc)on to Informa)on Retrieval
Example:WestLawhttp://www.westlaw.com/
Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)
Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query:
What is the statute of limitations in cases involving the federal tort claims act?
LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
32
Introduc)on to Informa)on Retrieval
Example:WestLawhttp://www.westlaw.com/
Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)
Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query:
What is the statute of limitations in cases involving the federal tort claims act?
LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
/3 = within 3 words, /S = in same sentence
33
Introduc)on to Informa)on Retrieval
Example:WestLawhttp://www.westlaw.com/
Another example query: Requirements for disabled people to be able to access
a workplace disabl! /p acces\s! /s work-site work-place
(employment /3 place) Long,precisequeries;proximityoperators;incrementallydeveloped;notlikewebsearch
ProfessionalsearchersoLenlikeBooleansearch: Precision,transparencyandcontrol
Butthatdoesn’tmeantheyactuallyworkbe6er….
Introduc)on to Informa)on Retrieval
Queryprocessing:AND Whatneedstohappentoprocess:BrutusANDCaesar
LocateBrutusandCaesarintheDic)onary; Retrievepos)ngslists
“Merge”thetwopos)ngs:
35
12834
2 4 8 16 32 641 2 3 5 8 13 21
BrutusCaesar
2 8BrutusANDCaesar
Introduc)on to Informa)on Retrieval
Themerge
Walkthroughthetwopos)ngssimultaneously
36
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
BrutusANDCaesar
Introduc)on to Informa)on Retrieval
Themerge
Walkthroughthetwopos)ngssimultaneously
37
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
BrutusANDCaesar
Introduc)on to Informa)on Retrieval
Themerge
Walkthroughthetwopos)ngssimultaneously
38
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
BrutusANDCaesar 2
Introduc)on to Informa)on Retrieval
Themerge
Walkthroughthetwopos)ngssimultaneously
39
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
BrutusANDCaesar 2
Introduc)on to Informa)on Retrieval
Themerge
Walkthroughthetwopos)ngssimultaneously
40
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
BrutusANDCaesar 2
Introduc)on to Informa)on Retrieval
Themerge
Walkthroughthetwopos)ngssimultaneously
41
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
BrutusANDCaesar 2
Introduc)on to Informa)on Retrieval
Themerge
Walkthroughthetwopos)ngssimultaneously
42
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
BrutusANDCaesar
…
2 8
Introduc)on to Informa)on Retrieval
Themerge
Walkthroughthetwopos)ngssimultaneously
43
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
What assumption are we making about the postings lists?
For efficiency, when we construct the index, we ensure that the postings lists are sorted
Introduc)on to Informa)on Retrieval
Themerge
Walkthroughthetwopos)ngssimultaneously
44
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
What is the running time?
O(length1 + length2)
Introduc)on to Informa)on Retrieval
Booleanqueries:Moregeneralmerges
Whichofthefollowingqueriescanwes)lldoin)meO(length1+length2)?
BrutusAND NOTCaesar
BrutusOR NOTCaesar
45
Sec. 1.3
Introduc)on to Informa)on Retrieval
Merging
WhataboutanarbitraryBooleanformula?
(BrutusOR Caesar) AND NOT (Antony OR Cleopatra)
x=(BrutusOR Caesar) y=(Antony OR Cleopatra) x AND NOT y
Isthereanupperboundontherunning)me? O(total_terms*query_terms)
WhataboutBrutus AND Calpurnia AND Caesar?
46
Introduc)on to Informa)on Retrieval
Queryop)miza)on
ConsideraquerythatisanANDoftterms.
Foreachoftheterms,getitspos)ngs,thenANDthemtogether
Whatisthebestorderforqueryprocessing?
Query: Brutus AND Calpurnia AND Caesar
47
Brutus
Calpurnia
Caesar
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Introduc)on to Informa)on Retrieval
Queryop)miza)onexample
Heuris)c:Processinorderofincreasingfreq: merge the two terms with the shortest pos)ngs list this creates a new AND query with one less term
repeat
48
Executethequeryas(CaesarANDBrutus)ANDCalpurnia.
Brutus
Calpurnia
Caesar
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Introduc)on to Informa)on Retrieval
Queryop)miza)on
ConsideraquerythatisanORoftterms.
Whatisthebestorderforqueryprocessing? Same:s)llwanttomergetheshortestpos)ngslistsfirst
Query: Brutus OR Calpurnia OR Caesar
49
Brutus
Calpurnia
Caesar
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Introduc)on to Informa)on Retrieval
50
Queryop)miza)oningeneral (maddingORcrowd)AND(ignobleORNOTstrife)
NeedtoevaluateORstatementsfirst WhichORshouldwedofirst?
Es)matethesizeofeachORbythesumofthepos)nglistlengths
NOTisjustthenumberofdocumentsminusthelength Then,itlookslikeanANDquery:
xANDy
Introduc)on to Informa)on Retrieval
Exercise
Recommendaqueryprocessingorderfor
Term Freq eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812
51
(tangerine OR NOT trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
Introduc)on to Informa)on Retrieval
Nextsteps…
Phrases Pomona College
Proximity:FindGatesNEAR Microso;. Needindextocaptureposi)oninforma)onindocs.Morelater
Zonesindocuments:Finddocumentswith(author = Ullman) AND(textcontainsautomata)
Rankingsearchresults includeoccurrencefrequency weightdifferentzones/featuresdifferently(e.g.)tle,header,linktext,
…)
Incorporatelinkstructure
52
Introduc)on to Informa)on Retrieval
Resourcesfortoday’slecture
Introduc)ontoInforma)onRetrieval,ch.1
ManagingGigabytes,Chapter3.2 ModernInforma)onRetrieval,Chapter8.2 Shakespeare:h6p://www.rhymezone.com/shakespeare/ Trytheneatbrowsebykeywordsequencefeature!
53