+ All Categories
Home > Documents > Alexander Gelbukh Gelbukh

Alexander Gelbukh Gelbukh

Date post: 07-Feb-2016
Category:
Upload: noam
View: 24 times
Download: 0 times
Share this document with a friend
Description:
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 7 (book chapter 9) : Parallel and Distributed IR. Alexander Gelbukh www.Gelbukh.com. Previous Chapter: Conclusions. How to accelerate search? Same results as sequential Ideas: - PowerPoint PPT Presentation
24
Special Topics in Computer Science Special Topics in Computer Science Advanced Topics in Information Advanced Topics in Information Retrieval Retrieval Lecture 7 Lecture 7 (book chapter 9) (book chapter 9) : : Parallel and Parallel and Distributed IR Distributed IR Alexander Gelbukh www.Gelbukh.com
Transcript
Page 1: Alexander Gelbukh Gelbukh

Special Topics in Computer ScienceSpecial Topics in Computer Science

Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval Lecture 7 Lecture 7 (book chapter 9)(book chapter 9): :

Parallel and Distributed IRParallel and Distributed IR Alexander Gelbukh

www.Gelbukh.com

Page 2: Alexander Gelbukh Gelbukh

2

Previous Chapter: Previous Chapter: ConclusionsConclusions

How to accelerate search? Same results as sequential Ideas:

Quick-and-dirty rejection of bad objects, 100% recall Fast data structure for search (based on clustering) Careful check of all found candidates

Solution: mapping into fewer-D feature space Condition: lower-bounding of the distance Assumption: skewed spectrum distribution

Few coefficients concentrate energy, rest are less important

Page 3: Alexander Gelbukh Gelbukh

3

Previous Chapter: Research topicsPrevious Chapter: Research topics

Object detection (pattern and image recognition) Automatic feature selection Spatial indexing data structures (more than 1D) New types of data.

What features to select? How to determine them? Mixed-type data (e.g., webpages, or images with

sound and description) What clustering/IR methods are better suited for

what features? (What features for what methods?) Similar methods in data mining, ...

Page 4: Alexander Gelbukh Gelbukh

4

The problemThe problem

Very large document collections Google: 4,000,000,000 pages Slow response?

Solution: parallel computing Google: 10,000 computers

Page 5: Alexander Gelbukh Gelbukh

5

Parallel architecturesParallel architectures

Data stream

Single Multiple

Instruction stream

SingleSISD

classicalSIMDsimple

MultipleMISD(rare)

MIMDmany SISD

Page 6: Alexander Gelbukh Gelbukh

6

MIMD architectureMIMD architecture

The most common Can be

tightly coupled loosely coupled

Distributed Many computers interacting via network PC Clusters Similar to MIMD computers, but greater cost of

communication very loosely coupled More coarse-grained programs

Page 7: Alexander Gelbukh Gelbukh

7

Performance improvementPerformance improvement

Time: speedup S Ideally, N times (number of processors) In practice impossible

The problem does not decompose into N equal parts Communication and control overhead < 1 / f, where f is the largest separable fraction of the

problem

Cost Per processor: S / N

Page 8: Alexander Gelbukh Gelbukh

8

Two approaches to parallelismTwo approaches to parallelism

Build new algorithms E.g., neural nets Naturally parallel Problem: to define the retrieval task

Adapt the existing techniques to parallelism Allows relying on well-studied approaches We will consider this option

Page 9: Alexander Gelbukh Gelbukh

9

Ways to use parallelismWays to use parallelism

Multitasking N search engines Good for processing many queriesProblems: A single query is not speeded up Bottleneck: disk access (index) Possible solution: replicating (part of) data. RAIDs

Parallel algorithms IR = data. Main question: how to partition the data Document / index term matrix

(terms can be LSI dimensions, signature bits, etc)

Page 10: Alexander Gelbukh Gelbukh

10

Possible partitioningsPossible partitionings

Horizontal: document partitioning. Union of results Vertical: term partitioning. Basically, intersect results

Page 11: Alexander Gelbukh Gelbukh

11

Inverted files: Logical partitioningInverted files: Logical partitioning

Logical vs. physical document partitioning Logical: for each term, use pointers into inverted file data for

each processor, to indicate its portion

Page 12: Alexander Gelbukh Gelbukh

12

Inverted files: Logical partitioning Inverted files: Logical partitioning Construction and updatingConstruction and updating

Also parallelConstruction Assign docs to processors Order docs such that each processor has an interval Process in parallel Merge. Each piece is ordered already

Page 13: Alexander Gelbukh Gelbukh

13

Inverted files:Inverted files:Physical document partitioningPhysical document partitioning

Several separate collections, one per processor Separate indices Then the lists are merged (they are already ordered) Priority queue is used

The result is not sorted; Insertion is quick The maximal element can be found quickly First k elements can be found rather quickly Details in the book

Consistent scores are needed Global statistics is needed. Can be computed at index time

Page 14: Alexander Gelbukh Gelbukh

14

Logical or physical partitioning?Logical or physical partitioning?

Logical requires less communication Faster

Physical is more flexible. Simpler implementation Simpler conversion of existing systems

Page 15: Alexander Gelbukh Gelbukh

15

Inverted files: Inverted files: Term partitioningTerm partitioning

Each processor processes a part of the inverted file The results are intersected (for AND)

(or as appropriate for Boolean operations, OR and NOT) When term distribution in user queries is skewed,

then document partitioning is better When uniform, term partitioning is better. Twice for long queries, 5 – 10 times for short (Web-like)

Page 16: Alexander Gelbukh Gelbukh

16

Suffix arraysSuffix arrays

Array construction can be parallelized merges are parallel

Document partitioning is applied straightforwardly Each processor maintains its own suffix array

Term partitioning can be applied Each processor owns a branch of the tree (lexicographic

interval) Bottleneck: all processors need access to the entire text

Page 17: Alexander Gelbukh Gelbukh

17

Page 18: Alexander Gelbukh Gelbukh

18

Signature filesSignature files

Document partitioning: straightforward Create query signature, distribute to each processor Merge results (using Boolean operations if needed)

Term partitioning: shorter signatures Merging and eliminating false drops is slow This method is not recommended

Page 19: Alexander Gelbukh Gelbukh

19

SIMD computersSIMD computers

Single Instruction, Multiple data Uncommon Good for simple operations

Bit operations in signature files Details in the book

Ranking is supported in hardware in some computers If signature file does not fit into memory, can be

processed in batches I/O overhead Use multiple queries with the same batch This improves throughput, but not response time

Page 20: Alexander Gelbukh Gelbukh

20

… … SIMD computersSIMD computers

Inverted files are difficult to adapt to SIMD The inverted file is restructured Details in the book

Page 21: Alexander Gelbukh Gelbukh

21

Distributed IRDistributed IR

MIMD with Slow communication Not all nodes are used for a given query Encryption issues

Document partitioning is usually used Term partitioning imposes greater communication

overhead Document clustering can be useful (to distribute docs

by processors) Index clusters and then search only the best ones Another approach: use training queries, then similarity of

the user query to these

Page 22: Alexander Gelbukh Gelbukh

22

Research topicsResearch topics

How to evaluate the speedup New algorithms Adaptation of existing algorithms Merging the results is a bottleneck

Meta search engines Creating large collections with judgements

Is recall important?

Page 23: Alexander Gelbukh Gelbukh

23

ConclusionsConclusions

Parallel computing can improve response time for each query and/or throughput: number of queries processed with same speed

Document partitioning is simple good for distributed computing

Term partitioning is good for some data structures Distributed computing is MIMD computing with slow

communication SIMD machines are good for Signature files

Both are out of favor now

Page 24: Alexander Gelbukh Gelbukh

24

Thank you!Till May 17? 18?, 6

pm


Recommended