lec6

Lec.6Inverted index

Inverted IndexThe most common data structure used in both database management

and Information Retrieval Systems is the inverted file structure. Inverted file structures are composed of three basic files: the

document file, the inversion lists (sometimes called posting files) and the dictionary.

The name “inverted file” comes from its underlying methodology of storing an inversion of the documents: inversion of the document from the perspective that, for each word, a list of documents in which the word is found in is stored (the inversion list for that word).

Each document in the system is given a unique numerical identifier. It is that identifier that is stored in the inversion list.

The way to locate the inversion list for a particular word is via the Dictionary.

Inverted Index

• The Dictionary is typically a sorted list of all unique words (processing tokens) in the system and a pointer to the location of its inversion list.

• Dictionaries can also store other information used in query optimization such as the length of inversion lists.

Inverted Index Construction

• Building Steps – Collect document

– Text Preprocessing

– Construct an inverted index with dictionary and postings

•Sequence of (Modified token, Document ID) pairs.

Indexer Steps

I did enact JuliusCaesar I was killed

i' the Capitol ;Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2

caesar 2was 2ambitious 2

•Sort by terms.

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Core indexing step.

•Multiple term entries in a single document are

merged.

•Frequency information is added.

Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

•The result is split into a Dictionary file and a Postings file.

Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Inverted File: An Example

• Documents are parsed to extract tokens. • These are saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time

was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

• After all documents have been parsed the inverted file is sorted alphabetically.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

• Multiple term entries for a single document are merged.

• Within-document term frequency information is compiled.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

• Finally, the file can be split into – A Dictionary or Lexicon file and – A Postings file

Dictionary/Lexicon PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Final Results

Inverted indexes

• Permit fast search for individual terms• For each term, you get a list consisting of:

– document ID – frequency of term in doc (optional) – position of term in doc (optional)

• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2

• Also used for statistical ranking algorithms

Web search

Web Searching: ArchitectureDocuments stored on many Web servers are indexed in a single central index. (This is similar to a union catalog.)

• The central index is implemented as a single system on a very large number of computers

Examples: Google, Yahoo!

17

Web Challenges for IR• Distributed Data: Documents spread over millions of

different web servers.• Volatile Data: Many documents change or disappear

rapidly (e.g. dead links).• Large Volume: Billions of separate documents.• Unstructured and Redundant Data: No uniform

structure, HTML errors, up to 30% (near) duplicate documents.

• Quality of Data: No editorial control, false information, poor quality writing, typos, etc.

• Heterogeneous Data: Multiple media types (images, video), languages, character sets, etc.

What is a Web Crawler?

Web Crawler• A program for downloading web pages.• Given an initial set of seed URLs, it recursivelydownloads every page that is linked from pages

in the set.• A focused web crawler downloads only those

pages whose content satisfies some criterion.Also known as a web spider

19

What’s wrong with the simple crawler Scale: we need to distribute. We can’t index everything: we need to subselect. How? Duplicates: need to integrate duplicate detection Spam and spider traps: need to integrate spam detection Politeness: we need to be “nice” and space out all requests for a site over

a longer period (hours, days) Freshness: we need to recrawl periodically.

Because of the size of the web, we can do frequent recrawls only for a small subset.

Again, subselection problem or prioritization

19

20

What a crawler must do

Be robust Be immune to spider traps, duplicates, very large pages, very

large websites, dynamic pages etc

20

Be polite• Don’t hit a site too often• Only crawl pages you are allowed to crawl: robots.txt

21

Robots.txt

Protocol for giving crawlers (“robots”) limited access to a website, originally from 1994

Examples:

User-agent: * Disallow: /yoursite/temp/

User-agent: searchengine Disallow: /

Important: cache the robots.txt file of each site we are crawling

21

22

Robot Exclusion

• Web sites and pages can specify that robots should not crawl/index certain areas.

• Two components:– Robots Exclusion Protocol: Site wide specification

of excluded directories.– Robots META Tag: Individual document tag to

exclude indexing or following links.

23

Robots Exclusion Protocol

• Site administrator puts a “robots.txt” file at the root of the host’s web directory.– http://www.ebay.com/robots.txt– http://www.cnn.com/robots.txt

• File is a list of excluded directories for a given robot (user-agent).– Exclude all robots from the entire site:– User-agent: *

Disallow: /

http://www.ebay.com/robots.txt

http://www.cnn.com/robots.txt

24

Robot Exclusion Protocol ExamplesExclude specific directories:

User-agent* : Disallow: /tmp/

Disallow: /cgi-bin/ Disallow: /users/paranoid/

Exclude a specific robot: User-agent: GoogleBot

Disallow/ :

25

Spiders (Robots/Bots/Crawlers)

• Start with a comprehensive set of root URL’s from which to start the search.

• Follow all links on these pages recursively to find additional pages.

• Index all novel found pages in an inverted index as they are encountered.

• May allow users to directly submit pages to be indexed (and crawled from).

26

Search StrategiesBreadth-first Search

In graph theory, breadth-first search (BFS) is a graph search algorithm that begins at the root node and explores all the neighboring nodes. Then for each of those nearest nodes, it explores their unexplored neighbor nodes, and so on, until it finds the goal.

http://www.wikipedia.org/wiki/Graph_theory

http://www.wikipedia.org/wiki/Graph_search_algorithm

http://www.wikipedia.org/wiki/Node_(computer_science)

27

Search Strategies (cont)Depth-first Search

DFS on binary tree. Specialized case of more general graph. The order of the search is down paths and from left to right. The root is examined first; then the left child of the root; then the left child of this node, etc. until a leaf is found. At a leaf, backtrack to the lowest right child and repeat

28

Search Strategy Trade-Off’s

• Breadth-first explores uniformly outward from the root page but requires memory of all nodes on the previous level (exponential in depth). Standard spidering method.

• Depth-first requires memory of only depth times branching-factor (linear in depth) but gets “lost” pursuing a single thread.

• Both strategies implementable using a queue of links (URL’s).

Spidering AlgorithmInitialize queue (Q) with initial set of known URL’s.Until Q empty or page or time limit exhausted: Pop URL, L, from front of Q. If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…) continue loop. If already visited L, continue loop. Download page, P, for L. If cannot download P (e.g. 404 error, robot excluded) continue loop. Index P (e.g. add to inverted index or store cached copy). Parse P to obtain list of new links N.

• Append N to the end of Q.

Restricting Spidering

• Restrict spider to a particular site.– Remove links to other sites from Q.

• Restrict spider to a particular directory.– Remove links not in the specified directory.

• Obey page-owner restrictions (robot exclusion).

Date post:	08-Aug-2015
Category:	Documents
Upload:	alaa223
View:	42 times
Download:	0 times