Indexing
Temporal Information Retrieval
Index Organization, Construction, Temporal Query Processing
Avishek Anand
✤ Inverted Indexing basics revisited
✤ Indexing Static Collections
✤ Dictionaries
✤ Forward Index
✤ Inverted Index Organisation
✤ Scalable Indexing
✤ Indexing Temporal Collections
Inverted Index
2
Avishek Anand
✤ Why do we index text collections ?
✤ How do we index documents ?
✤ What are the data structures ?
✤ What are the design decisions for organising the index?
✤ How do we index huge collections ?
✤ How do we index temporal collections ?
Text Collections and Indexing
3
Avishek Anand
✤ Why do we index text collections ?
✤ How do we index documents ?
✤ What are the data structures ?
✤ What are the design decisions for organising the index?
✤ How do we index huge collections ?
✤ How do we index temporal collections ?
Text Collections and Indexing
3
Efficient document retrieval
lexicon, inverted lists
document order, score order
distributed indexing, term/doc partitioning
index maintenance strategies
Avishek Anand
Terminology Recap
4
information retrieval
Lexicon
queries, results
terms, documents, collection
index, lexicon, posting, posting list
stemming, stop-word rem
(d1 , 2, <2, 15>)
(d4 , 3, <2, 15, 23, >)
(d34 , 3, <2, 15, 23>)
…….
(d1 , 2, <3, 16>)
(d4 , 2, <23, 16>)
(d31 , 4, <8, 19, 30>)
…….
Avishek Anand
✤ Maintains statistics and information about the indexed unit (word, n-gram etc)
✤ Posting list location - for posting list retrieval
✤ Term identifier - for term lookups, matching and range queries
✤ document frequency and associated statistics - for ranking
✤ Data Structures for Lexicon
✤ Hash-based Lexicon
✤ B+-Tree based Lexicon
Lexicon or Dictionary
5
< hannover ; location: 82271; tid:12 ; df:23, … >
Avishek Anand
Hash-Based Lexicon
6
Information Retrieval: Implementing and Evaluating Search Engines · c⃝MIT Press, 2010 · DRAFT108
!"#$%!&%#'%(())*+,*-
'%./%0123%4%!!(,(5*6)6,
##'73#'$5,+8*)*-
9%1":;<=%4>:20)),5,?+,+
&'%0:9=@!12:4)*-,,5+6*
/#47!0#&%!)8-5)+5?)
><A:1%??-*(5*+5
241%'0%!!:'<65-))+-66
.:0@!#=/%6)6(,,855
!"#$%&"'()*"++",-
./((0#0/1%2$"01%*(013)4%(0#&-
./((0#0/1%2$"01%*(013)4%(0#&-
./((0#0/1%2$"01%*(013)4%(0#&-
8
,
)
6
(
?
+
,866
,86)
Figure 4.2 Dictionary data structure based on a hash table with 210 = 1024 entries (data extractedfrom schema-independent index for TREC45). Terms with the same hash value are arranged in a linkedlist (chaining). Each term descriptor contains the term itself, the position of the term’s postings list,and a pointer to the next entry in the linked list.
in GOV2 is 9.2 bytes. Storing each term in a fixed-size memory region of 20 bytes wastes 10.8bytes per term on average (internal fragmentation).
One way to eliminate the internal fragmentation is to not store the index terms themselves inthe array, but only pointers to them. For example, the search engine could maintain a primarydictionary array, containing 32-bit pointers into a secondary array. The secondary array thencontains the actual dictionary entries, consisting of the terms themselves and the correspondingpointers into the postings file. This way of organizing the search engine’s dictionary data isshown in Figure 4.3. It is sometimes referred to as the dictionary-as-a-string approach, becausethere are no explicit delimiters between two consecutive dictionary entries; the secondary arraycan be thought of as a long, uninterrupted string.
For the GOV2 collection, the dictionary-as-a-string approach, compared to the dictionarylayout shown in Figure 4.1, reduces the dictionary’s storage requirements by 10.8 − 4 = 6.8bytes per entry. Here the term 4 stems from the pointer overhead in the primary array; theterm 10.8 corresponds to the complete elimination of any internal fragmentation.
It is worth pointing out that the term strings stored in the secondary array do not require anexplicit termination symbol (e.g., the “\0” character), because the length of each term in thedictionary is implicitly given by the pointers in the primary array. For example, by looking atthe pointers for “shakespeare” and “shakespearean” in Figure 4.3, we know that the dictionaryentry for “shakespeare” requires 16629970−16629951 = 19 bytes in total: 11 bytes for the termplus 8 bytes for the 64-bit file pointer into the postings file.
Avishek Anand
✤ Constant lookups based on a Hash table
✤ Entire Lexicon loaded to the memory
Hash-Based Lexicon
6
Information Retrieval: Implementing and Evaluating Search Engines · c⃝MIT Press, 2010 · DRAFT108
!"#$%!&%#'%(())*+,*-
'%./%0123%4%!!(,(5*6)6,
##'73#'$5,+8*)*-
9%1":;<=%4>:20)),5,?+,+
&'%0:9=@!12:4)*-,,5+6*
/#47!0#&%!)8-5)+5?)
><A:1%??-*(5*+5
241%'0%!!:'<65-))+-66
.:0@!#=/%6)6(,,855
!"#$%&"'()*"++",-
./((0#0/1%2$"01%*(013)4%(0#&-
./((0#0/1%2$"01%*(013)4%(0#&-
./((0#0/1%2$"01%*(013)4%(0#&-
8
,
)
6
(
?
+
,866
,86)
Figure 4.2 Dictionary data structure based on a hash table with 210 = 1024 entries (data extractedfrom schema-independent index for TREC45). Terms with the same hash value are arranged in a linkedlist (chaining). Each term descriptor contains the term itself, the position of the term’s postings list,and a pointer to the next entry in the linked list.
in GOV2 is 9.2 bytes. Storing each term in a fixed-size memory region of 20 bytes wastes 10.8bytes per term on average (internal fragmentation).
One way to eliminate the internal fragmentation is to not store the index terms themselves inthe array, but only pointers to them. For example, the search engine could maintain a primarydictionary array, containing 32-bit pointers into a secondary array. The secondary array thencontains the actual dictionary entries, consisting of the terms themselves and the correspondingpointers into the postings file. This way of organizing the search engine’s dictionary data isshown in Figure 4.3. It is sometimes referred to as the dictionary-as-a-string approach, becausethere are no explicit delimiters between two consecutive dictionary entries; the secondary arraycan be thought of as a long, uninterrupted string.
For the GOV2 collection, the dictionary-as-a-string approach, compared to the dictionarylayout shown in Figure 4.1, reduces the dictionary’s storage requirements by 10.8 − 4 = 6.8bytes per entry. Here the term 4 stems from the pointer overhead in the primary array; theterm 10.8 corresponds to the complete elimination of any internal fragmentation.
It is worth pointing out that the term strings stored in the secondary array do not require anexplicit termination symbol (e.g., the “\0” character), because the length of each term in thedictionary is implicitly given by the pointers in the primary array. For example, by looking atthe pointers for “shakespeare” and “shakespearean” in Figure 4.3, we know that the dictionaryentry for “shakespeare” requires 16629970−16629951 = 19 bytes in total: 11 bytes for the termplus 8 bytes for the 64-bit file pointer into the postings file.
Avishek Anand
✤ Constant lookups based on a Hash table
✤ Entire Lexicon loaded to the memory
Hash-Based Lexicon
6
Information Retrieval: Implementing and Evaluating Search Engines · c⃝MIT Press, 2010 · DRAFT108
!"#$%!&%#'%(())*+,*-
'%./%0123%4%!!(,(5*6)6,
##'73#'$5,+8*)*-
9%1":;<=%4>:20)),5,?+,+
&'%0:9=@!12:4)*-,,5+6*
/#47!0#&%!)8-5)+5?)
><A:1%??-*(5*+5
241%'0%!!:'<65-))+-66
.:0@!#=/%6)6(,,855
!"#$%&"'()*"++",-
./((0#0/1%2$"01%*(013)4%(0#&-
./((0#0/1%2$"01%*(013)4%(0#&-
./((0#0/1%2$"01%*(013)4%(0#&-
8
,
)
6
(
?
+
,866
,86)
Figure 4.2 Dictionary data structure based on a hash table with 210 = 1024 entries (data extractedfrom schema-independent index for TREC45). Terms with the same hash value are arranged in a linkedlist (chaining). Each term descriptor contains the term itself, the position of the term’s postings list,and a pointer to the next entry in the linked list.
in GOV2 is 9.2 bytes. Storing each term in a fixed-size memory region of 20 bytes wastes 10.8bytes per term on average (internal fragmentation).
One way to eliminate the internal fragmentation is to not store the index terms themselves inthe array, but only pointers to them. For example, the search engine could maintain a primarydictionary array, containing 32-bit pointers into a secondary array. The secondary array thencontains the actual dictionary entries, consisting of the terms themselves and the correspondingpointers into the postings file. This way of organizing the search engine’s dictionary data isshown in Figure 4.3. It is sometimes referred to as the dictionary-as-a-string approach, becausethere are no explicit delimiters between two consecutive dictionary entries; the secondary arraycan be thought of as a long, uninterrupted string.
For the GOV2 collection, the dictionary-as-a-string approach, compared to the dictionarylayout shown in Figure 4.1, reduces the dictionary’s storage requirements by 10.8 − 4 = 6.8bytes per entry. Here the term 4 stems from the pointer overhead in the primary array; theterm 10.8 corresponds to the complete elimination of any internal fragmentation.
It is worth pointing out that the term strings stored in the secondary array do not require anexplicit termination symbol (e.g., the “\0” character), because the length of each term in thedictionary is implicitly given by the pointers in the primary array. For example, by looking atthe pointers for “shakespeare” and “shakespearean” in Figure 4.3, we know that the dictionaryentry for “shakespeare” requires 16629970−16629951 = 19 bytes in total: 11 bytes for the termplus 8 bytes for the 64-bit file pointer into the postings file.
✤ Updates difficult
✤ Range Searches, Matching, Substring queries not supported
Avishek Anand
✤ B+-Tree: Leaf nodes additionally linked for efficient range search
✤ Supports lookups in O(log n) and range searches in O(log n + k)
✤ Vocabulary dynamics (i.e., new or removed terms) no problem
✤ Works on secondary storage
B+-Tree or Sort-based Lexicon
7
[aardvark, tid:3, df:3, …]
[a-i][j-z]
[j-k][l-q][r-z][a-d][e-f][g-i]
[a-b][c][d] [e][f] [g][h][i] … … …
m = 3
[aalborg, tid:7, df:2, …]
Avishek Anand
• Mapping of doc-ids to term-ids in the same order
• Efficient retrieval of terms from (already parsed) text
• snippet generation
• proximity features for proximity-aware ranking
• per-doc term distribution for query expansions
Forward Index
8
1: “what does the fox say ?”
124 53 1 49935 1001:
Avishek Anand
✤ Inverted index is a collection of posting lists
✤ Posting contains document identifiers (as integers) along with scores (integers or doubles) and possibly positions (as integers)
✤ Postings list can be organised according to
✤ document identifiers - document ordering
✤ scores - Impact ordering
✤ What are the merits of these orderings ?
Inverted Index
9
information
(d1 , 2, <2, 15>)
(d4 , 5, <2, 15, 23, >)
(d34 , 3, <2, 15, 23>)
…….
Avishek Anand
✤ Based on faster intersections
✤ High compression of index using gap encoding of dids
✤ Easily updatable
Index Organisation
10
✤ Based on processing Top-k results fast
✤ Low compression ratio
✤ Difficult to update
Document Ordering Score/Impact Ordering
Index organisation depends on query processing style.
Avishek Anand
✤ We are given a set of documents D, where each document d is considered as a bag of terms
✤ Inverted Lists are created by a process termed as Inversion
✤ Memory-based Inversion
✤ Takes place entirely in-memory
✤ For small collections, where the index + lexicon fits in memory
✤ Disk-based Inversion
✤ Sort-based inversion vs Merge-based inversion
Inverted Index Construction
11
Avishek Anand
✤ A dictionary is required that allows efficient single-term lookup and insertion operations
✤ An extensible (i.e., dynamic) list data structure is needed that is used to store the postings for each
Memory-based Inversion
12
1: “what does the fox say ?”
2: “the fox jumped over the fence”
[the, <3>] [fox, <4>]
[the, <1,5>] [fox, <2>]
dictionary1:[the, <1,5>] “the”: [1, <3>] [2, <1,5>]
[term, positions]
…. ….
…. ….
doc: [term, positions] [term, posting list]
Avishek Anand
✤ Input Collection D >> memory size M
✤ Inversion can be seen as a sort operation on the term identifiers
✤ This method is based on external sort over data which does not fit into the memory
✤ Read data of size M into memory, sort them and write back to disk
✤ Multiway merge of D/M sorted lists to create index
✤ Shortcomings
✤ Dictionary might not fit in-memory
✤ Large memory requirements due to intermediate data
Sort-based Inversion
13
Avishek Anand
✤ What is the estimated cost of sort-based Inversion in terms of N,M and c ?
✤ How does the cost compare with in-memory sort-based inversion (assuming we had enough memory or N > M) ?
Exercise 1: Analysis of Sort-based Inversion
14
Total number of postings = N
Number of postings which fit in memory = M
Cost of disk read/write of a posting = c
Simple Computational Model
Avishek Anand
✤ Generalisation of in-memory indexing
✤ Reads input collection to create an in-memory index of size M and write it to disk to create partial indexes with local lexicons
✤ Compression in posting lists in partial indexes
✤ Multiway Merge of corresponding lists from the partial indexes to create one consolidated index
Merge-based Inversion
15
partial indexes of size M
Avishek Anand
✤ Programming paradigm for distributed data processing
✤ Improves overall throughput by parallelising loading of data
✤ Data is partitioned into the nodes which process the data in the following phases
✤ Map : Generates (key, value) pairs
✤ Shuffle : Shuffles the pairs over the network to the reducers
✤ Reduce : operates on all values for the same key are
Map-Reduce crash course
16
Avishek Anand
Map-Reduce Example : Word Count
17
1: “what does the fox say ?” 2: “the fox jumped over the fence”
what : 1does : 1
the : 1
fox : 1
say : 1
Mapper - 1 jumped : 1over : 1
the : 2
fox : 1
fence : 1
Mapper - 2
Shuffle + Sort
Reducer - 1
what : 1does : 1 the : 1fox : 1 say : 1jumped : 1 over : 1
the : 2fox : 1
fence : 1
Reducer - 2
mappers emit <word, freq>
reducers aggr. freq.
+ +
Avishek Anand
✤ How would you build the inverted index using Map-reduce ?
✤ What are the key-value pairs as defined by the Mapper ?
✤ What does the reducer do with the values of the same key ?
Exercise 2: Index Construction using Map-Reduce
18
Avishek Anand
✤ Temporal collections have temporal information
✤ Publication times - News Articles
✤ Valid times - Wikipedia articles, Web archive versions
✤ Temporal references - time mentions in text
✤ Time-travel Queries : Retrieve all documents relevant to the text and time
✤ Point in time queries
✤ Time-interval queries
Temporal Collections and Queries
19
game of thrones @ 2011 - 2014
house of cards @ 02/03/2013
Avishek Anand
✤ Given a versioned collection of documents with valid time intervals
✤ and a time-travel text query:
✤ We want to retrieve documents containing terms “game” ,“thrones” and valid between 2011 - 2014
Temporal Indexing
20
How do we efficiently retrieve these documents ?
game of thrones @ 2011 - 2014
Avishek Anand
Time-Travel Index
21
intervention
Lexicon• Inverted index processes keyword
queries
• Intersection of posting lists for processing queries
• Versions have valid time intervals
• Augment postings with valid time intervals
• Post filtering after standard query processing
nato
d5
d1
d2
d3
d4
d1
d4
d8
Avishek Anand
Time-Travel Index
21
intervention
Lexicon• Inverted index processes keyword
queries
• Intersection of posting lists for processing queries
• Versions have valid time intervals
• Augment postings with valid time intervals
• Post filtering after standard query processing
nato
d5
d1
d2
d3
d4
d1
d2d3
d4d5
timet1 t91 1
Avishek Anand
Time-Travel Index
21
intervention
Lexicon
(d5, [t5, t7))
(d1, [t1, t9))
(d2, [t2, t8))
(d3, [t3, t6))
(d4, [t4, t6))
• Inverted index processes keyword queries
• Intersection of posting lists for processing queries
• Versions have valid time intervals
• Augment postings with valid time intervals
• Post filtering after standard query processing
nato
d1
d2d3
d4d5
timet1 t91 1
Avishek Anand
Time-Travel Index
21
intervention
Lexicon
(d5, [t5, t7))
(d1, [t1, t9))
(d2, [t2, t8))
(d3, [t3, t6))
(d4, [t4, t6))
• Inverted index processes keyword queries
• Intersection of posting lists for processing queries
• Versions have valid time intervals
• Augment postings with valid time intervals
• Post filtering after standard query processing
overlaps
overlaps
overlaps
no overlap
no overlap
nato
d1
d2d3
d4d5
timet1 t91 1
Avishek Anand
Challenges in Indexing
22
• If documents are points in time how would you organize the index ?
• What is the problem if the documents are associated with time intervals ?
• Query processing expensive due to wasted accesses
intervention
Lexicon
(d5, [t5, t7))
(d1, [t1, t9))
(d2, [t2, t8))
(d3, [t3, t6))
(d4, [t4, t6))
overlaps
overlaps
overlaps
no overlap
no overlap
nato
d1
d4
d8
Avishek Anand
Data and Query Model
23
temporal distribution
• Each interval represents a document in the posting list
• Data Model: Each document is associated with a time interval
• Query Model: Queries are associated with a time interval [tb, te]
• point in time queries : when begin time = end time
• time interval queries
(d5, [t5, t7))
(d1, [t1, t9))
(d2, [t2, t8))
(d3, [t3, t6))
(d4, [t4, t6))
postings list query interval
Avishek Anand
Challenges in Indexing Time
24
• We would want to avoid unwanted or wasted access to posting lists
• Typically only access those postings that are relevant or a few more (bounded loss)
• Dealing with time points easy, akin to range queries (sorting acc to begin time and range search)
(d5, [t5, t7))
(d1, [t1, t9))
(d2, [t2, t8))
(d3, [t3, t6))
(d4, [t4, t6))
..
..
..
posting lists long
Avishek Anand
Challenges in Indexing Time
24
• We would want to avoid unwanted or wasted access to posting lists
• Typically only access those postings that are relevant or a few more (bounded loss)
• Dealing with time points easy, akin to range queries (sorting acc to begin time and range search)
(d5, [t5, t7))
(d1, [t1, t9))
(d2, [t2, t8))
(d3, [t3, t6))
(d4, [t4, t6))
..
..
..
posting lists long
query interval
d1
d2d3
d4d5
Avishek Anand
Index List Partitioning
25
• Vertically Partition the temporal space and each partition
• Now multiple posting lists per term, each with a valid time interval
• Limits index access, introduces replication
Time-travel queries
t2t1 t4 t7 t9 t11 t13 t16
( d1, [t1, t2) )
( d2, [t4, t9) )
( d3, [t7, t13) )
( d4, [t9, t11) )
McCain
( d5, [t11, t16) )
( d3, [t7, t13) )
( d2, [t4, t9) )
McCain[t3,t8)
( d3, [t7, t13) )
( d2, [t4, t9) )
McCain[t8,t12)
( d5, [t11, t16) )
( d4, [t9, t11) )
Index Partitioning
d1d2
d3
d4d5
McCainMcCain
Each version is a new entry, hence
long list:need partitioning
Avishek Anand
Vertical Partitioning - Query Processing
26
• Dictionary or Lexicon should contain partitioning information
• For each temporal query, select a subset of affected partitions and only read them
• Filter postings which do not overlap with query time interval
time interval query
“hannover”
term partition offset
hannover [t1 - t5) 12646
hannover [t5 - t7) 12673
hannover [t7 - t25) 13446
hannover [t25 - t43) 15324
Avishek Anand
Optimal Approaches
27
• Performance Optimal Approach
• Keeps one posting list per every elementary time interval,
• Achieves optimal performance but large space overhead
• Space Optimal Approach
• No replication of postings, no blowup, sub-optimal performance
3.7 Partitioning Strategies
3.7.1 Performance-Optimal Approach
As discussed in Section 3.5, the performance when processing a time-point queryq t on a TTIX instance is influenced adversely by the wasted I/O due to read butfiltered-out postings. Temporal coalescing implicitly addresses this problem byreducing the number of postings and the space consumption of the posting listsscanned, but still a significant overhead remains. We now tackle this problemand describe temporal partitioning strategies geared at time-point queries thatdetermine for each term v a set of posting lists that should be kept in the index.
timet1 t2 t3 t4 t5 t6 t7 t8 t9 t10
d1
d2
d3
document
1 2 3 4
5 6 7
8 9 10
Figure 3.6: Partitioning illustrated
We illustrate the trade-offs of temporal partitioning for time-point queries us-ing the example given in Figure 3.6. The figure shows a total of ten postingsbelonging to term v and three different documents d1, d2, and d3. For ease ofdescription, we have numbered boundaries of valid-time intervals in increas-ing time-order, as t1, . . . , t10 and numbered the postings themselves as 1, . . . , 10.Now, consider that we want to process a time-point query with t 2 [t1, t2). Onlythree postings (namely, 1, 5, and 8) are valid at time t and therefore required toprocess the query. In the worst case, if only a single list Lv : [t1, t10) is con-tained in our index, we have to read all ten postings. In the best case, if the listLv : [t1, t2) is contained in the index, we achieve the optimal query-processingperformance, reading only the three postings required to answer the query.
This last observation suggests one strategy to eliminate the problem of filtered-out postings entirely. By choosing
Pv = Ev (3.31)
and thus keeping a posting list for every elementary time interval, for any querytime-point t only the postings valid at that time are read, so that the optimal
75
Avishek Anand
Partitioning Strategies
28
• Given an input sequence of intervals how do we partition them into sublists ?
• Space Bound Materialization Approach : We have a limited budget for space, need to maximize our performance
• Performance Guarantee Approach: For any query we need a guarantee on the performance loss, need to minimize blowup
Trade-off size and performance
Avishek Anand
Partitioning Strategies
28
• Given an input sequence of intervals how do we partition them into sublists ?
• Space Bound Materialization Approach : We have a limited budget for space, need to maximize our performance
• Performance Guarantee Approach: For any query we need a guarantee on the performance loss, need to minimize blowup
Trade-off size and performance
Avishek Anand
Space Bound Approach
29
• Minimize expected number of postings read for a time-point query, while ensuring that the index contains at most κ times the optimal number of postings
• Optimal solution computable in O( |S| × n2 ) time and O( |S| × n ) space using dynamic programming over prefix subproblems [ t1, tk ) and space bounds s ≤ κ· |Lv|
Space budget = 1/3 . (optimal number of postings)
Avishek Anand
Space Bound Approach
29
• Minimize expected number of postings read for a time-point query, while ensuring that the index contains at most κ times the optimal number of postings
• Optimal solution computable in O( |S| × n2 ) time and O( |S| × n ) space using dynamic programming over prefix subproblems [ t1, tk ) and space bounds s ≤ κ· |Lv|
Space budget = 1/3 . (optimal number of postings)
Avishek Anand
Performance Guarantee
30
• Minimize total number of postings kept in the index, while guaranteeing that for any time-point query the number of postings read is at most a factor γ worse than optimal
• Optimal solution computable in time O( |Lv| + n2 ) and space O( n2 ) using dynamic programming over prefix subproblems [ t1, tk )
performance guarantee = γ (times the number of optimal results at that time)
Avishek Anand
Performance Guarantee
30
• Minimize total number of postings kept in the index, while guaranteeing that for any time-point query the number of postings read is at most a factor γ worse than optimal
• Optimal solution computable in time O( |Lv| + n2 ) and space O( n2 ) using dynamic programming over prefix subproblems [ t1, tk )
performance guarantee = γ (times the number of optimal results at that time)
Avishek Anand
Exercise
31
• Performance Guarantee with γ = 2 (read at max twice the number of postings than optimal)
• Space bound approach with κ = 1.33 (1/3 more than overall space)
d1
d2d3
d4d5
d6
Avishek Anand
Horizontal Partitioning - Sharding
32
p1 p3p2 p4 p6p5
Can we partition a posting list without replicating postings ?
• Index size blowup due to replication of postings across slices
• Query processing inefficient if replicated postings are accessed multiple times
Avishek Anand
Horizontal Partitioning - Sharding
32
Relevant postings = 3 Postings read : 3
short time interval query
p1 p3p2 p4 p6p5
Can we partition a posting list without replicating postings ?
• Index size blowup due to replication of postings across slices
• Query processing inefficient if replicated postings are accessed multiple times
Avishek Anand
Horizontal Partitioning - Sharding
32
Relevant postings = 3 Postings read : 3
short time interval query
Relevant postings = 7 Postings read : 2 + 4 + 4 + 3 = 13
long time interval query
p1 p3p2 p4 p6p5
Can we partition a posting list without replicating postings ?
• Index size blowup due to replication of postings across slices
• Query processing inefficient if replicated postings are accessed multiple times
Avishek Anand
Index Sharding
33
• Partition documents in each posting list into sublists called shards
• Contents of each shard disjoint - no replication, no index blowup
• Postings stored in begin time order
• Access structure over each shard for efficient query processing
e1
e2e3
e4e5
e2
e1
e3e4
e5
Slicing
Shardingtime
time
doc.id
doc.id
Avishek Anand
Index Sharding
34
• All shards for a given query term are accessed
• Open-skip-scan on each shard assisted by impact lists
• Result list constructed by merging results from each shard
beijing olympics @ [8 Aug 2008, 24 Aug 2008]
Beijing Olympics
Avishek Anand
Index Sharding - Impact Lists
35
• Open - Each shard of a query term opened for access
• Skip - Given a query begin time seek to appropriate offset
• Scan - Read while postings still have overlap with query time interval
read until begin time < te
Impact list
1 2 4
t1 t12 t52 t100 t115 t150
1
2
3
4
5
5
Avishek Anand
Index Sharding - Impact Lists
35
• Open - Each shard of a query term opened for access
• Skip - Given a query begin time seek to appropriate offset
• Scan - Read while postings still have overlap with query time interval
seek to shard offset for tb
read until begin time < te
Impact list
1 2 4
t1 t12 t52 t100 t115 t150
1
2
3
4
5
5
Avishek Anand
Index Sharding - Impact Lists
35
• Open - Each shard of a query term opened for access
• Skip - Given a query begin time seek to appropriate offset
• Scan - Read while postings still have overlap with query time interval
seek to shard offset for tb
read until begin time < te
Impact list
1 2 4
t1 t12 t52 t100 t115 t150
1
2
3
4
5
5
Avishek Anand
Index Sharding - Impact Lists
35
• Open - Each shard of a query term opened for access
• Skip - Given a query begin time seek to appropriate offset
• Scan - Read while postings still have overlap with query time interval
seek to shard offset for tb
read until begin time < te
Impact list
1 2 4
t1 t12 t52 t100 t115 t150
1
2
3
4
5
5
Avishek Anand
Index Sharding - Impact Lists
35
• Open - Each shard of a query term opened for access
• Skip - Given a query begin time seek to appropriate offset
• Scan - Read while postings still have overlap with query time interval
seek to shard offset for tb
read until begin time < te
Impact list
1 2 4
t1 t12 t52 t100 t115 t150
1
2
3
4
5
5
Avishek Anand
Index Sharding - Impact Lists
35
• Open - Each shard of a query term opened for access
• Skip - Given a query begin time seek to appropriate offset
• Scan - Read while postings still have overlap with query time interval
seek to shard offset for tb
read until begin time < te
Impact list
1 2 4
t1 t12 t52 t100 t115 t150
1
2
3
4
5
5
Avishek Anand
Index Sharding - Staircase Property
36
• Wasted reads are processed but do not overlap with the query time interval
• Staircase property in a shard
• Intervals arranged in begin time order
• No interval completely subsumes another interval
• Eliminates wasted reads
wasted reads due to subsumption
Avishek Anand
Index Sharding - Staircase Property
36
• Wasted reads are processed but do not overlap with the query time interval
• Staircase property in a shard
• Intervals arranged in begin time order
• No interval completely subsumes another interval
• Eliminates wasted reads
query begin time
wasted reads due to subsumption
Avishek Anand
Index Sharding - Staircase Property
36
• Wasted reads are processed but do not overlap with the query time interval
• Staircase property in a shard
• Intervals arranged in begin time order
• No interval completely subsumes another interval
• Eliminates wasted reads
wasted readwasted read
query begin time
wasted reads due to subsumption
Avishek Anand
Index Sharding - Staircase Property
36
• Wasted reads are processed but do not overlap with the query time interval
• Staircase property in a shard
• Intervals arranged in begin time order
• No interval completely subsumes another interval
• Eliminates wasted reads
wasted readwasted read
query begin time
wasted reads due to subsumption
staircase property
Avishek Anand
Index Sharding - Idealized Sharding
37
• Staircase property eliminates sequential accesses of postings non-overlapping with query time interval
• Minimizing number of shards is essential in minimizing number of random accesses
• Input : Set of postings/intervals corresponding to a postings list
• Problem Statement : Minimize the number of shards where each shard exhibits the staircase property
Greedy Algorithm exists which is proven to be optimal
Avishek Anand
Index Sharding - Idealized Sharding
38
Input :
• Postings arrive in begin time order
• For each posting chose a shard which
• does not violate staircase prop.
• has min. end time difference
• Runtime complexity O(n log n)
append posting to the end of chosen shard
p1
p2
p3
p4
p5
Avishek Anand
Index Sharding - Idealized Sharding
38
Input :
Shard 1
• Postings arrive in begin time order
• For each posting chose a shard which
• does not violate staircase prop.
• has min. end time difference
• Runtime complexity O(n log n)
append posting to the end of chosen shard
p1
p2
p3
p4
p5
Avishek Anand
Index Sharding - Idealized Sharding
38
Input :
Shard 1
• Postings arrive in begin time order
• For each posting chose a shard which
• does not violate staircase prop.
• has min. end time difference
• Runtime complexity O(n log n)
append posting to the end of chosen shard
p1
p2
p3
p4
p5
p1
Avishek Anand
Index Sharding - Idealized Sharding
38
Input :
Shard 1
• Postings arrive in begin time order
• For each posting chose a shard which
• does not violate staircase prop.
• has min. end time difference
• Runtime complexity O(n log n)Shard 2
append posting to the end of chosen shard
p1
p2
p3
p4
p5
p1
p2
Avishek Anand
Index Sharding - Idealized Sharding
38
Input :
Shard 1
• Postings arrive in begin time order
• For each posting chose a shard which
• does not violate staircase prop.
• has min. end time difference
• Runtime complexity O(n log n)Shard 2
append posting to the end of chosen shard
p1
p2
p3
p4
p5
p1
p2
p3
Avishek Anand
Index Sharding - Idealized Sharding
38
Input :
Shard 1
• Postings arrive in begin time order
• For each posting chose a shard which
• does not violate staircase prop.
• has min. end time difference
• Runtime complexity O(n log n)Shard 2
append posting to the end of chosen shard
p1
p2
p3
p4
p5
p1
p2
p3
Avishek Anand
Index Sharding - Idealized Sharding
38
Input :
Shard 1
• Postings arrive in begin time order
• For each posting chose a shard which
• does not violate staircase prop.
• has min. end time difference
• Runtime complexity O(n log n)Shard 2
append posting to the end of chosen shard
p1
p2
p3
p4
p5
p1
p2
p3
p4
Avishek Anand
Index Sharding - Idealized Sharding
38
Input :
Shard 1
• Postings arrive in begin time order
• For each posting chose a shard which
• does not violate staircase prop.
• has min. end time difference
• Runtime complexity O(n log n)Shard 2
append posting to the end of chosen shard
p1
p2
p3
p4
p5
p1
p2
p3
p4
Avishek Anand
Index Sharding - Idealized Sharding
38
Input :
Shard 1
• Postings arrive in begin time order
• For each posting chose a shard which
• does not violate staircase prop.
• has min. end time difference
• Runtime complexity O(n log n)Shard 2
append posting to the end of chosen shard
p1
p2
p3
p4
p5
p1
p2
p3
p4
p5
Avishek Anand
Index Sharding - Idealized Sharding
38
Input :
Shard 1
• Postings arrive in begin time order
• For each posting chose a shard which
• does not violate staircase prop.
• has min. end time difference
• Runtime complexity O(n log n)Shard 2
append posting to the end of chosen shard
p1
p2
p3
p4
p5
p1
p2
p3
p4
p5
Avishek Anand
Exercise
39
• What is the Idealized sharding for the given input ?
• What are Impact lists for each idealized shard ?
• What is the worst case (in the number of shards) example for idealised sharding ?
d1
d2d3
d4d5
d6
Avishek Anand
Index Sharding - Challenges
40
• No wasted reads
• Many shards
• QP suffers. Why ?
Idealized Sharding
• Random accesses (RA) are typically much more expensive than sequential accesses (SA)
Avishek Anand
Index Sharding - Challenges
40
• No wasted reads
• Many shards
• QP suffers. Why ?
Idealized Sharding
• Random accesses (RA) are typically much more expensive than sequential accesses (SA)
More shards the more the random accesses to disk
• Allow wasted reads to balance SA and RA
Avishek Anand
Index Sharding - Challenges
41
Idealized Sharding
• Random accesses (RA) are typically much more expensive than sequential accesses (SA)
Relaxing the Sharding
Avishek Anand
Index Sharding - Challenges
41
Idealized Sharding
• Random accesses (RA) are typically much more expensive than sequential accesses (SA)
More shards the more the random accesses to disk
• Allow wasted reads to balance SA and RA
Relaxing the Sharding
Avishek Anand
Bounded Subsumption
42
• Balancing sequential and random accesses
• Bounded subsumption: no more than wasted reads for any query begin time
wasted readwasted read
query begin time
• Bounded Subsumption Problem: Minimize number of shards s.t. each shard has bounded subsumption
⌘
Avishek Anand
Bounded Subsumption
42
• Balancing sequential and random accesses
• Bounded subsumption: no more than wasted reads for any query begin time
wasted readwasted read
query begin time
• Bounded Subsumption Problem: Minimize number of shards s.t. each shard has bounded subsumption
wasted read
⌘
Avishek Anand
Bounded Subsumption
42
• Balancing sequential and random accesses
• Bounded subsumption: no more than wasted reads for any query begin time
wasted readwasted read
query begin time
wasted read
Can we create shards solving the bounded subsumption problem?
⌘
Avishek Anand
Incremental Sharding
43
• Algorithm assigns incoming posting to a shard
• Posting inserted into shard buffer maintaining begin time order
• Top posting popped and appended to the shard end
t1
t2
t3
t4
time now
shard 1
shard 2
shard 3
Avishek Anand
Incremental Sharding
43
• Algorithm assigns incoming posting to a shard
• Posting inserted into shard buffer maintaining begin time order
• Top posting popped and appended to the shard end
t1
t2
t3
t4
nowtime
shard 1
shard 2
shard 3
Avishek Anand
Incremental Sharding
43
• Algorithm assigns incoming posting to a shard
• Posting inserted into shard buffer maintaining begin time order
• Top posting popped and appended to the shard end
t1
t2
t3
t4
nowtime
shard 1
shard 2
shard 3
Avishek Anand
Incremental Sharding
43
• Algorithm assigns incoming posting to a shard
• Posting inserted into shard buffer maintaining begin time order
• Top posting popped and appended to the shard end
t1
t2
t3
t4
nowtime
shard 1
shard 2
shard 3
Avishek Anand
Incremental Sharding
43
• Algorithm assigns incoming posting to a shard
• Posting inserted into shard buffer maintaining begin time order
• Top posting popped and appended to the shard end
t1
t2
t3
t4
nowtime
shard 1
shard 2
shard 3
Avishek Anand
Putting Things Together
44
• Start with initial posting list and partition them vertically or horizontally
• Given a query time interval for each term
• Access either subset of entire lists or part of all lists
• Intersect or Union the results for all query terms
Vertical Partitions Horizontal PartitionsOriginal
Postings List
Avishek Anand
Open Source Full-text Indexing Software
Avishek Anand
✤ Informa(on retrieval: (h2p://www.ir.uwaterloo.ca/book/) Stefan Bü2cher, Google Inc. , Charles L. A. Clarke, Univ. of Waterloo, Gordon V. Cormack, Univ. of Waterloo
✤Managing Gigabytes: by Jus(n Zobel, Alistair Moffat, Ian Wi2en
✤ Indexing Methods for Web Archives: Avishek Anand
References
46