Intelligent Information Retrieval CS 336 Xiaoyan Li Spring 2006 Modified from Lisa Ballesteros’s...

Intelligent Information RetrievalCS 336

Xiaoyan Li

Spring 2006

Modified from Lisa Ballesteros’s slides

What is Information Retrieval?

• Includes the following:– Organization

– Storage/Representation

– Manipulation/Analysis

– Search/Retrieval

• How far back in history can we find examples?

IR Through the Ages

• 3rd Century BCE– Library of Alexandria

• 500,000 volumes• catalogs and classifications

• 13th Century A.D.– First concordance of the Bible

• What is a concordance?

• 15th Century A.D.– Invention of printing

• 1600– University of Oxford Library

• All books printed in England

IR Through the Ages• 1755

– Johnson’s Dictionary• Set standard for dictionaries• Included common language• Helped standardize spelling

• 1800– Library of Congress

• 1828– Webster’s Dictionary

• Significantly larger than previous dictionaries• Standardized American spelling

• 1852– Roget’s Thesaurus

IR Through the Ages• 1876

– Dewey Decimal Classification

• 1880’s– Carnegie Public Libraries

• 1,681 built (first public library 1850)

• 1930’s– Punched card retrieval systems

• 1940’s– Bush’s Memex– Shannon’s Communication Theory– Zipf’s “Law”

Historical Summary

• 1960’s– Basic advances in retrieval and indexing

techniques

• 1970’s– Probabilistic and vector space models– Clustering, relevance feedback– Large, on-line, Boolean information services– Fast string matching

• 1980’s– Natural Language Processing and IR– Expert systems and IR– Off-the-shelf IR systems

IR Through the Ages• Late 1980’s

– First mini-computer and PC systems incorporating “relevance ranking”

• Early 1990’s – information storage revolution

• 1992– First large-scale information service

incorporating probabilistic retrieval (West’s legal retrieval system)

IR Through the Ages

• Mid 1990’s to present– Multimedia databases

• 1994 to present– The Internet and Web explosion

• e.g. Google, Yahoo, Lycos, Infoseek (now Go)

• 1995 to present– Digital Libraries– Data Mining– Agents and Filtering– Knowledge and Distributed Intelligence– Information Organization– Knowledge Management

Historical Summary• 1990’s

– Large-scale, full-text IR and filtering experiments and systems (TREC)

– Dominance of ranking– Many web-based retrieval engines– Interfaces and browsing– Multimedia and multilingual– Machine learning techniques

Time

On-lineInformation

19901970

Batch systems...Interactive systems...Database Systems…Cheap Storage...Internet…Multimedia...

Gigabytes

Terabytes

Petabytes

TechnologiesBoolean Retrieval and Filtering

Ranked Retrieval

Distributed Retrieval

Concept-Based Retrieval

Image and VideoRetrieval

Information Extraction

Visualization

Summarization

Data Mining

Ranked Filtering

Trends in IR Technology

1-page word document without any images = ~10 kilobytes (kb) of disk space. 1 terabyte = one-hundred million imageless word docs1 petabyte = one-thousand terabytes.

Historical Summary• The Future

– Logic-based IR?– NLP?– Integration with other functionality– Distributed, heterogeneous database access – IR in context– “Anytime, Anywhere”

Information Retrieval• Ad Hoc Retrieval

– Given a query and a large database of text objects, find the relevant objects

• Distributed Retrieval– Many distributed databases

• Information Filtering– Given a text object from an information stream (e.g. newswire)

and many profiles (long-term queries), decide which profiles match

• Multimedia Retrieval– Databases of other types of unstructured data, e.g. images, video,

audio

Information Retrieval

• Multilingual Retrieval– Retrieval in a language other than English

• Cross-language Retrieval– Query in one language (e.g. Spanish),

retrieve documents in other languages (e.g. Chinese, French, and Spanish)

What does an IR system do?• Generate a representation of each document

– essentially pick best words and/or phrases

• Generate query representation– if documents processed specially, queries must also be– possibly weight query words

• Match queries and documents– find relevant documents

• Perhaps, rank and sort documents


• Text Representation (Indexing)– given a text document, identify the concepts that describe the

content and how well they describe it

• what makes a “good” representation?• how is a representation generated from text?• what are retrievable objects and how are they organized?

• Representing an Information Need (Query Formulation)– describe and refine information needs as explicit queries

• what is an appropriate query language?• how can interactive query formulation and refinement be supported?


• Comparing Representations (Retrieval)– compare text and information need representations to

determine which documents are likely to be relevant

• what is a “good” model of retrieval?• how is uncertainty represented?

• Evaluating Retrieved Text (Feedback)– present documents for user evaluation and modify query

based on feedback

• what are good metrics?• what constitutes a good experimental testbed

Information Retrieval and Filtering

Information Need Text Objects

Representation

Query

Comparison

Evaluation/Feedback

Indexed Objects

Retrieved Objects

Representation

Features of a Modern IR Product

• Effective “relevance ranking”• Simple free text (“natural language”) query capability• Boolean and proximity operators• Term weighting• Query formulation assistance• Query by example• Filtering• Field-based retrieval• Distributed architecture• Index anything• Fast retrieval• Information Organization

Typical Systems

• IR systems– Verity, Fulcrum, Excalibur

• Database systems– Oracle, Informix

• Web search and In-house systems– West, LEXIS/NEXIS, Dialog– Yahoo, Google, MSN, AskJeeves

IR vs. Database Systems

• Emphasis on effective, efficient retrieval of unstructured data

• IR systems typically have very simple schemas

• Query languages emphasize free text although Boolean combinations of words is also common

IR vs. Database Systems

• Matching is more complex than with structured data (semantics less obvious)– easy to retrieve the wrong objects

– need to measure accuracy of retrieval

• Less focus on concurrency control and recovery, although update is very important

Ambiguity Complicates the Task

• Synonyms: many ways to express concept– lorry/truck, elevator/lift, pump/impeller,

hypertension/high blood pressure– failure to use specific words => failure to get doc

• Words have many meanings– How many diff meanings are there for “bank”?

Ambiguity Complicates the Task

• Difficult to Specify Important but Vague Concepts– e.g. will interest rates be raised in the next

six months

• Spelling variants/ spelling errors

Basic Automatic Indexing

• Parse documents to recognize structure– e.g. title, date, other fields

• Scan for word tokens– numbers, special characters, hyphenation, capitalization,

etc.– languages like Chinese need segmentation– record positional information for proximity operators

• Stopword removal– based on short list of common words such as “the”, “and”,

“or”– saves storage overhead of very long indexes– can be dangerous (e.g. “Mr. The”, “and-or gates”)

Basic Automatic Indexing

• Stem words– group word variants such as plurals via

morphological processing• computer, computers, computing, computed,

computation, computerized, computerize, computerizable

– can make mistakes but generally preferred

• Optional– phrase indexing– thesaurus classes

How do you rank results?

• What does it mean for a document to be important/relevant?– Even human assessors do not agree with

each other.

• Word matching is imperfect, how do we decide which documents are most important?

How do you rank results?

• How do we decide which documents are most important?

– Count words • high frequency words indicate document “aboutness”

– Weight infrequent corpus words more strongly• can be strong signifiers of meaning; easier to partition

– Determine meaning by analyzing text surrounding a word»

– Give extra weight to title words, etc.

– Make sense of references given, citations received, etc.

Free Text Search Engines

• Different engines use different ranking strategies (often a trade secret)– Word frequency– Placement in document– Popularity of document– Number of links to document– Business relationships etc….

Announcement:

• Writing assignment: due next Monday– Create 10 topics/queries and search on

three popular Web search engines: Google.com, Yahoo.com and ask.com. Write a report to compare the three search engines and discuss why IR is so hard.

• Next Lecture: Query languages. (Ch. 4)

Date post:	12-Jan-2016
Category:	Documents
Upload:	magdalene-horton
View:	215 times
Download:	1 times

Intelligent Information Retrieval CS 336 Xiaoyan Li Spring 2006 Modified from Lisa Ballesteros’s...

Documents