Indexing (2) - The University of EdinburghText Technologies for Data Science INFR11145 10-Oct-2017...

Text Technologies for Data Science

INFR11145

10-Oct-2017

Indexing (2)

Instructor:

Walid Magdy

2

Walid Magdy, TTDS 2017/2018

Lecture Objectives

• Learn more about indexing:

• Structured documents

• Extent index

• Index compression

• Data structure

• Wild-char search and applications

* You are not asked to implement any of the content in this lecture, but you

might think of using some for your course project

3


Structured Documents

• Document are not always flat:

• Meta-data: title, author, time-stamp

• Structure: headline, section, body

• Tags: link, hashtag, mention

• How to deal with it?

• Neglect!

• Create separate index for each field

• Use “extent index”

4


Extent Index

• Special “term” for each element/field/tag• Index all terms in a structured document as plain text

• Terms in a given field/tag get special additional entry

• Posting: spans of window related to a given field

• Allows multiple overlapping spans of different types

D1: He likes to wink, he likes to drink

D2: He likes to drink, and drink, and drink

D3: The thing he likes to drink is ink

D4: The ink he likes to drink is pink

D5: He likes to wink, and drink pink ink

2,4 2,6 2,8

he 1,1 2,1 3,3 4,3 5,1

drink 1,8 3,6

1,5

4,5 5,6

ink 4,23,8 5,8

pink5,74,8

Link3,1:2 4,1:4 5,7:8

5


Using Extent

• Doc: 1

Headline: “Information retrieval lecture”

Text: “this is lecture 6 of the TTSD course on IR”

• Query Headline: lecture

1 2 3

4 5 6 7 8

Headline 1,1:3 2,1:5 3,1:4

lecture1,41,3 2,9 3,7 3,11

6


Index Compression

• Inverted indices are big

• Large disk space large I/O operations

• Index compression

• Reduce space less I/O

• Allow more chunks of index to be cached in memory

• Large size goes to:

• terms? document numbers?

• Ideas:

• Compress document numbers, how?

7


Delta Encoding

• Large collections large sequence of doc IDs

• Large ID number more bytes to store• 1 byte: 0255

• 2 bytes: 0 65,535

• 4 bytes: 0 4.3 B

• Idea: delta in ID instead of full ID• Very useful, especially for frequent terms

term 100007100002 100008 100011 100019

term 5? 1 3 7 321 15 2

3 bytes

1 byte 2 bytes

8


v-byte Encoding

• Have different byte storage for each delta in index• Use fewer bits to encode

• High bit in a byte 1/0 = terminate/continue

• Remaining 7 bits binary number

• Examples:• “6” 10000110

• “127” 11111111

• “128” 0000000110000000 00000010000000

• Real example sequence:

100001010000000011000001010000111

0000101 000000010000010 0000111

5 130 7

9


Index Compression

• There are more sophisticated compression

algorithms:

• Elias gamma code

• The more compression

• Less storage

• More processing

• In general

• Less I/O + more processing > more I/O + no processing

“>” = faster

• With new data structures, problem is less severe

10


Dictionary Data Structures

• The dictionary data structure stores the term

vocabulary, document frequency, pointers to each

postings list …

• For small collections, load full dictionary in memory.

In real-life, cannot load all index to memory!

• Then what to load?

• How to reach quickly?

• What data structure to use for inverted index?

11


Hashes

• Each vocabulary term is hashed to an integer

• Pros

• Lookup is faster than for a tree: O(1)

• Cons

• No easy way to find minor variants:• judgment/judgement

• No prefix search

• If vocabulary keeps growing, need to occasionally do the

expensive operation of rehashing everything

12


Trees: Binary Search Tree

Roota-m n-z

a-hu hy-m n-sh si-z

13


Trees: B-tree

a-hu

hy-m

n-z

Every internal node has a number of children in the

interval [a,b] where a, b are appropriate natural

numbers, e.g., [2,4].

14


Trees

• Pros?

• Solves the prefix problem (terms starting with “ab”)

• Cons?

• Slower: O(log M) [and this requires balanced tree]

• Rebalancing binary trees is expensive

• But B-trees mitigate the rebalancing problem

15


Wild-Card Queries: *

• mon*: find all docs containing any word beginning

“mon”.

• Easy with binary tree (or B-tree) lexicon

• *mon: find words ending in “mon”: harder• Maintain an additional B-tree for terms backwards.

• How can we enumerate all terms meeting the wild-

card query pro*cent ?

• Query processing: se*ate AND fil*er ?• Expensive

16


Permuterm Indexes

• Transform wild-card queries so that the *s occur at

the end

• For term hello, index under:• hello$, ello$h, llo$he, lo$hel, o$hell, $hello

where $ is a special symbol.

• Rotate query wild-card to the right

• Queries:• X lookup on X$

• X* lookup on $X*

• *X lookup on X$*

• X*Y lookup on Y$X*

• Index Size?

17


Character n-gram Indexes

• Enumerate all n-grams (sequence of n chars)

occurring in any term

• e.g., from text “April is the cruelest month” we get the 2-

grams (bigrams)

$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$,

$m,mo,on,nt,h$

• $ is a special word boundary symbol

• Maintain a second inverted index from bigrams to

dictionary terms that match each bigram.• Character n-grams terms

• Words documents

18


Character n-gram Indexes

• The n-gram index finds terms based on a query

consisting of n-grams (here n=2).

mo

on

among

$m mace

almond

amortize

madden

among

Index of char bigrams

Collection index of terms

Find possible terms

Filter unmatchingterms

Search collection for all terms

Wild card

query Documents

19


Character n-gram Indexes: Query time

• Step 1: Query mon* $m AND mo AND on• It would still match moon.

• Step 2: Must post-filter these terms against query.• Phrase match, or post-step1 match

• Step 3: Surviving enumerated terms are then looked

up in the term-document inverted index.

Montreal OR monster OR monkey

• Wild-cards can result in expensive query execution

(very large disjunctions…)

20


Character n-gram Indexes: Applications

• Spelling Correction

• Create n-gram representation for words

• Build index for words:• Dictionary of words documents (each word is a document)

• Character n-grams terms

• When getting a search term that is misspelled (OOV or

not frequent), find possible corrections• Possible corrections = most matching results

Query: elepgant $e el le ep pg ga an nt t$Results:

elegant $e el le eg ga an nt t$elephant $e el le ep ph ha an nt t$

21


Character n-gram Indexes: Applications

• Char n-grams can be used as direct index terms for

some applications:• Arabic IR, when no stemmer/segmenter is available

• Documents with spelling mistakes: OCR documents

• Word char representation can by with multiple n’s• “elephant” 2/3-gram

“$e el le ep ph ha an nt t$ $el $ele lep eph pha han ant nt$”

The children behaved wellHer children are cute

اجيدتصرفوااألبناء

لطافأبناءها

$ءاءنابنأبألالا$

$اهاءهاءنابنأبأ$

Document: Elepbant $e el le ep pb ba an nt t$Query: Elephant $e el le ep ph ha an nt t$

22


Summary

• Index can by multilayer

• Extent index (multi-terms in one position in document)

• Index does not have to be formed of words

• Character n-grams representation of words

• Two indexes are sometimes used

• Index of character n-grams to find matching words

• Index of terms to search for matched words

23


Resources

• Text book 1: Intro to IR, Chapter 3.1 – 3.4

• Text book 2: IR in Practice, Chapter 5

Date post:	28-May-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times