Special Topics in Computer ScienceSpecial Topics in Computer Science
Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval Lecture 7 Lecture 7 (book chapter 9)(book chapter 9): :
Parallel and Distributed IRParallel and Distributed IR Alexander Gelbukh
www.Gelbukh.com
2
Previous Chapter: Previous Chapter: ConclusionsConclusions
How to accelerate search? Same results as sequential Ideas:
Quick-and-dirty rejection of bad objects, 100% recall Fast data structure for search (based on clustering) Careful check of all found candidates
Solution: mapping into fewer-D feature space Condition: lower-bounding of the distance Assumption: skewed spectrum distribution
Few coefficients concentrate energy, rest are less important
3
Previous Chapter: Research topicsPrevious Chapter: Research topics
Object detection (pattern and image recognition) Automatic feature selection Spatial indexing data structures (more than 1D) New types of data.
What features to select? How to determine them? Mixed-type data (e.g., webpages, or images with
sound and description) What clustering/IR methods are better suited for
what features? (What features for what methods?) Similar methods in data mining, ...
4
The problemThe problem
Very large document collections Google: 4,000,000,000 pages Slow response?
Solution: parallel computing Google: 10,000 computers
5
Parallel architecturesParallel architectures
Data stream
Single Multiple
Instruction stream
SingleSISD
classicalSIMDsimple
MultipleMISD(rare)
MIMDmany SISD
6
MIMD architectureMIMD architecture
The most common Can be
tightly coupled loosely coupled
Distributed Many computers interacting via network PC Clusters Similar to MIMD computers, but greater cost of
communication very loosely coupled More coarse-grained programs
7
Performance improvementPerformance improvement
Time: speedup S Ideally, N times (number of processors) In practice impossible
The problem does not decompose into N equal parts Communication and control overhead < 1 / f, where f is the largest separable fraction of the
problem
Cost Per processor: S / N
8
Two approaches to parallelismTwo approaches to parallelism
Build new algorithms E.g., neural nets Naturally parallel Problem: to define the retrieval task
Adapt the existing techniques to parallelism Allows relying on well-studied approaches We will consider this option
9
Ways to use parallelismWays to use parallelism
Multitasking N search engines Good for processing many queriesProblems: A single query is not speeded up Bottleneck: disk access (index) Possible solution: replicating (part of) data. RAIDs
Parallel algorithms IR = data. Main question: how to partition the data Document / index term matrix
(terms can be LSI dimensions, signature bits, etc)
10
Possible partitioningsPossible partitionings
Horizontal: document partitioning. Union of results Vertical: term partitioning. Basically, intersect results
11
Inverted files: Logical partitioningInverted files: Logical partitioning
Logical vs. physical document partitioning Logical: for each term, use pointers into inverted file data for
each processor, to indicate its portion
12
Inverted files: Logical partitioning Inverted files: Logical partitioning Construction and updatingConstruction and updating
Also parallelConstruction Assign docs to processors Order docs such that each processor has an interval Process in parallel Merge. Each piece is ordered already
13
Inverted files:Inverted files:Physical document partitioningPhysical document partitioning
Several separate collections, one per processor Separate indices Then the lists are merged (they are already ordered) Priority queue is used
The result is not sorted; Insertion is quick The maximal element can be found quickly First k elements can be found rather quickly Details in the book
Consistent scores are needed Global statistics is needed. Can be computed at index time
14
Logical or physical partitioning?Logical or physical partitioning?
Logical requires less communication Faster
Physical is more flexible. Simpler implementation Simpler conversion of existing systems
15
Inverted files: Inverted files: Term partitioningTerm partitioning
Each processor processes a part of the inverted file The results are intersected (for AND)
(or as appropriate for Boolean operations, OR and NOT) When term distribution in user queries is skewed,
then document partitioning is better When uniform, term partitioning is better. Twice for long queries, 5 – 10 times for short (Web-like)
16
Suffix arraysSuffix arrays
Array construction can be parallelized merges are parallel
Document partitioning is applied straightforwardly Each processor maintains its own suffix array
Term partitioning can be applied Each processor owns a branch of the tree (lexicographic
interval) Bottleneck: all processors need access to the entire text
17
18
Signature filesSignature files
Document partitioning: straightforward Create query signature, distribute to each processor Merge results (using Boolean operations if needed)
Term partitioning: shorter signatures Merging and eliminating false drops is slow This method is not recommended
19
SIMD computersSIMD computers
Single Instruction, Multiple data Uncommon Good for simple operations
Bit operations in signature files Details in the book
Ranking is supported in hardware in some computers If signature file does not fit into memory, can be
processed in batches I/O overhead Use multiple queries with the same batch This improves throughput, but not response time
20
… … SIMD computersSIMD computers
Inverted files are difficult to adapt to SIMD The inverted file is restructured Details in the book
21
Distributed IRDistributed IR
MIMD with Slow communication Not all nodes are used for a given query Encryption issues
Document partitioning is usually used Term partitioning imposes greater communication
overhead Document clustering can be useful (to distribute docs
by processors) Index clusters and then search only the best ones Another approach: use training queries, then similarity of
the user query to these
22
Research topicsResearch topics
How to evaluate the speedup New algorithms Adaptation of existing algorithms Merging the results is a bottleneck
Meta search engines Creating large collections with judgements
Is recall important?
23
ConclusionsConclusions
Parallel computing can improve response time for each query and/or throughput: number of queries processed with same speed
Document partitioning is simple good for distributed computing
Term partitioning is good for some data structures Distributed computing is MIMD computing with slow
communication SIMD machines are good for Signature files
Both are out of favor now
24
Thank you!Till May 17? 18?, 6
pm