1
Indexing and Searching
Modern Information RetrievalModern Information Retrieval
by by R. Baeza-Yates and B. Ribeiro-NetoR. Baeza-Yates and B. Ribeiro-Neto
Chapter 8Chapter 8
2
Outline
Inverted FilesInverted Files Other Indices for TextOther Indices for Text Sequential SearchingSequential Searching Pattern MatchingPattern Matching CompressionCompression
3
Inverted Files
And inverted file (or And inverted file (or inverted indexinverted index) is a ) is a word-word-orientedoriented mechanism for indexing a text collection mechanism for indexing a text collection in order to speed up the searching task.in order to speed up the searching task.
StructureStructure :: vocabularyvocabulary and and occurrencesoccurrences Block addressingBlock addressing
The text is divided in blocks, and the The text is divided in blocks, and the occurrences point to the blocksoccurrences point to the blocks
Full inverted indicesFull inverted indices :: exactexact occurrences occurrences
4
5
6
Inverted Files
The search algorithm on an inverted indexThe search algorithm on an inverted index Vocabulary searchVocabulary search Retrieval of occurrencesRetrieval of occurrences Manipulation of occurrencesManipulation of occurrences
Construction (split the index into two files)Construction (split the index into two files) Posting filePosting file :: the lists of occurrences are the lists of occurrences are
stored contiguouslystored contiguously The vocabulary is stored in lexicographical The vocabulary is stored in lexicographical
order and points to its list.order and points to its list.
7
8
Inverted Files
For Large textsFor Large texts Partial indexPartial index Merging two indices consists of merging Merging two indices consists of merging
the sorted the sorted vocabulariesvocabularies..
9
10
Other Indices for Text
Suffix TreesSuffix Trees Suffix ArraysSuffix Arrays Signature FilesSignature Files
11
Suffix Trees and Suffix Arrays
Each position in the text is considered as a Each position in the text is considered as a text suffixtext suffix
Index points are selected form the text, Index points are selected form the text, which point to the which point to the beginningbeginning of the text of the text positions which will be retrievablepositions which will be retrievable
12
13
Suffix arrays
The main drawbacks of Suffix Array are its The main drawbacks of Suffix Array are its costlycostly construction processconstruction process..
Allow Allow binary searchesbinary searches done by comparing done by comparing the contents of each pointer.the contents of each pointer.
Supra-indices (for large suffix array)Supra-indices (for large suffix array)
14
15
16
Construction of Suffix Arrays for Large Texts
17
Signature Files
Word-oriented index structures base on Word-oriented index structures base on hashinghashing Maps Maps wordswords to bit masks of to bit masks of BB bits bits Divides the text in Divides the text in blocksblocks of of b b words eachwords each The mask is obtained by bitwise The mask is obtained by bitwise ORingORing the signat the signat
ures of all the words in the text block.ures of all the words in the text block. Hash the Hash the query query to a bit mask Wto a bit mask W If If W & Bi = WW & Bi = W, the text block may contain the wo, the text block may contain the wo
rdrd
18
19
Sequential Searching
Brute ForceBrute Force Knuth-Morris-PrattKnuth-Morris-Pratt Boyer-Moore FamilyBoyer-Moore Family Shift-OrShift-Or Suffix AutomatonSuffix Automaton
Backward DAWG matching (BDM)Backward DAWG matching (BDM) BNDMBNDM
20
Knuth-Morris-Pratt
21
Boyer-Moore Family
22
Shift-Or
23
Suffix Automaton
24
25
Pattern Matching
Searching allowing errorsSearching allowing errors Dynamic ProgrammingDynamic Programming AutomatonAutomaton
Regular Expressions and Extended patternsRegular Expressions and Extended patterns Pattern Matching Using IndicesPattern Matching Using Indices
Inverted filesInverted files Suffix Trees and Suffix ArraysSuffix Trees and Suffix Arrays
26
Dynamic Programming
27
Automaton
28
Regular Expressions
29
Pattern Matching Using Indices
Inverted FilesInverted Files The types of queries such as suffix or subThe types of queries such as suffix or sub
string queries, searching allowing errors astring queries, searching allowing errors and regular expressions, are solved by a nd regular expressions, are solved by a sesequential searchquential search
The The restrictionrestriction is to find approximate mat is to find approximate matches or regular expressions that span manches or regular expressions that span many word.y word.
30
Pattern Matching Using Indices
Suffix TreesSuffix Trees Suffix trees are able to perform Suffix trees are able to perform complex searchescomplex searches
Word, prefix, suffix, substring, and Range queriesWord, prefix, suffix, substring, and Range queriesRegular expressionsRegular expressionsUnrestricted approximate string matchingUnrestricted approximate string matching
Useful in specific areasUseful in specific areasFind the Find the longest substringlongest substringFind the Find the most common substringmost common substring of a fixed size of a fixed size
31
Pattern Matching Using Indices
Suffix ArraysSuffix Arrays Some patterns can be searched Some patterns can be searched directly in directly in
the suffix arraythe suffix array without simulation the su without simulation the suffix treeffix tree
Word, prefix, suffix, subword search and Word, prefix, suffix, subword search and range searchrange search
32
Compression
Compressed text--Huffman codingCompressed text--Huffman coding Taking words as Taking words as symbolssymbols Use an Use an alphabetalphabet of bytes instead of bits of bytes instead of bits
Compressed indicesCompressed indices Inverted FilesInverted Files Suffix Trees and Suffix ArraysSuffix Trees and Suffix Arrays Signature FilesSignature Files