Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | magdalene-horton |
View: | 215 times |
Download: | 1 times |
Intelligent Information RetrievalCS 336
Xiaoyan Li
Spring 2006
Modified from Lisa Ballesteros’s slides
What is Information Retrieval?
• Includes the following:– Organization
– Storage/Representation
– Manipulation/Analysis
– Search/Retrieval
• How far back in history can we find examples?
IR Through the Ages
• 3rd Century BCE– Library of Alexandria
• 500,000 volumes• catalogs and classifications
• 13th Century A.D.– First concordance of the Bible
• What is a concordance?
• 15th Century A.D.– Invention of printing
• 1600– University of Oxford Library
• All books printed in England
IR Through the Ages• 1755
– Johnson’s Dictionary• Set standard for dictionaries• Included common language• Helped standardize spelling
• 1800– Library of Congress
• 1828– Webster’s Dictionary
• Significantly larger than previous dictionaries• Standardized American spelling
• 1852– Roget’s Thesaurus
IR Through the Ages• 1876
– Dewey Decimal Classification
• 1880’s– Carnegie Public Libraries
• 1,681 built (first public library 1850)
• 1930’s– Punched card retrieval systems
• 1940’s– Bush’s Memex– Shannon’s Communication Theory– Zipf’s “Law”
Historical Summary
• 1960’s– Basic advances in retrieval and indexing
techniques
• 1970’s– Probabilistic and vector space models– Clustering, relevance feedback– Large, on-line, Boolean information services– Fast string matching
• 1980’s– Natural Language Processing and IR– Expert systems and IR– Off-the-shelf IR systems
IR Through the Ages• Late 1980’s
– First mini-computer and PC systems incorporating “relevance ranking”
• Early 1990’s – information storage revolution
• 1992– First large-scale information service
incorporating probabilistic retrieval (West’s legal retrieval system)
IR Through the Ages
• Mid 1990’s to present– Multimedia databases
• 1994 to present– The Internet and Web explosion
• e.g. Google, Yahoo, Lycos, Infoseek (now Go)
• 1995 to present– Digital Libraries– Data Mining– Agents and Filtering– Knowledge and Distributed Intelligence– Information Organization– Knowledge Management
Historical Summary• 1990’s
– Large-scale, full-text IR and filtering experiments and systems (TREC)
– Dominance of ranking– Many web-based retrieval engines– Interfaces and browsing– Multimedia and multilingual– Machine learning techniques
Time
On-lineInformation
19901970
Batch systems...Interactive systems...Database Systems…Cheap Storage...Internet…Multimedia...
Gigabytes
Terabytes
Petabytes
TechnologiesBoolean Retrieval and Filtering
Ranked Retrieval
Distributed Retrieval
Concept-Based Retrieval
Image and VideoRetrieval
Information Extraction
Visualization
Summarization
Data Mining
Ranked Filtering
Trends in IR Technology
1-page word document without any images = ~10 kilobytes (kb) of disk space. 1 terabyte = one-hundred million imageless word docs1 petabyte = one-thousand terabytes.
Historical Summary• The Future
– Logic-based IR?– NLP?– Integration with other functionality– Distributed, heterogeneous database access – IR in context– “Anytime, Anywhere”
Information Retrieval• Ad Hoc Retrieval
– Given a query and a large database of text objects, find the relevant objects
• Distributed Retrieval– Many distributed databases
• Information Filtering– Given a text object from an information stream (e.g. newswire)
and many profiles (long-term queries), decide which profiles match
• Multimedia Retrieval– Databases of other types of unstructured data, e.g. images, video,
audio
Information Retrieval
• Multilingual Retrieval– Retrieval in a language other than English
• Cross-language Retrieval– Query in one language (e.g. Spanish),
retrieve documents in other languages (e.g. Chinese, French, and Spanish)
What does an IR system do?• Generate a representation of each document
– essentially pick best words and/or phrases
• Generate query representation– if documents processed specially, queries must also be– possibly weight query words
• Match queries and documents– find relevant documents
• Perhaps, rank and sort documents
Information Retrieval
• Text Representation (Indexing)– given a text document, identify the concepts that describe the
content and how well they describe it
• what makes a “good” representation?• how is a representation generated from text?• what are retrievable objects and how are they organized?
• Representing an Information Need (Query Formulation)– describe and refine information needs as explicit queries
• what is an appropriate query language?• how can interactive query formulation and refinement be supported?
Information Retrieval
• Comparing Representations (Retrieval)– compare text and information need representations to
determine which documents are likely to be relevant
• what is a “good” model of retrieval?• how is uncertainty represented?
• Evaluating Retrieved Text (Feedback)– present documents for user evaluation and modify query
based on feedback
• what are good metrics?• what constitutes a good experimental testbed
Information Retrieval and Filtering
Information Need Text Objects
Representation
Query
Comparison
Evaluation/Feedback
Indexed Objects
Retrieved Objects
Representation
Features of a Modern IR Product
• Effective “relevance ranking”• Simple free text (“natural language”) query capability• Boolean and proximity operators• Term weighting• Query formulation assistance• Query by example• Filtering• Field-based retrieval• Distributed architecture• Index anything• Fast retrieval• Information Organization
Typical Systems
• IR systems– Verity, Fulcrum, Excalibur
• Database systems– Oracle, Informix
• Web search and In-house systems– West, LEXIS/NEXIS, Dialog– Yahoo, Google, MSN, AskJeeves
IR vs. Database Systems
• Emphasis on effective, efficient retrieval of unstructured data
• IR systems typically have very simple schemas
• Query languages emphasize free text although Boolean combinations of words is also common
IR vs. Database Systems
• Matching is more complex than with structured data (semantics less obvious)– easy to retrieve the wrong objects
– need to measure accuracy of retrieval
• Less focus on concurrency control and recovery, although update is very important
Ambiguity Complicates the Task
• Synonyms: many ways to express concept– lorry/truck, elevator/lift, pump/impeller,
hypertension/high blood pressure– failure to use specific words => failure to get doc
• Words have many meanings– How many diff meanings are there for “bank”?
Ambiguity Complicates the Task
• Difficult to Specify Important but Vague Concepts– e.g. will interest rates be raised in the next
six months
• Spelling variants/ spelling errors
Basic Automatic Indexing
• Parse documents to recognize structure– e.g. title, date, other fields
• Scan for word tokens– numbers, special characters, hyphenation, capitalization,
etc.– languages like Chinese need segmentation– record positional information for proximity operators
• Stopword removal– based on short list of common words such as “the”, “and”,
“or”– saves storage overhead of very long indexes– can be dangerous (e.g. “Mr. The”, “and-or gates”)
Basic Automatic Indexing
• Stem words– group word variants such as plurals via
morphological processing• computer, computers, computing, computed,
computation, computerized, computerize, computerizable
– can make mistakes but generally preferred
• Optional– phrase indexing– thesaurus classes
How do you rank results?
• What does it mean for a document to be important/relevant?– Even human assessors do not agree with
each other.
• Word matching is imperfect, how do we decide which documents are most important?
How do you rank results?
• How do we decide which documents are most important?
– Count words • high frequency words indicate document “aboutness”
– Weight infrequent corpus words more strongly• can be strong signifiers of meaning; easier to partition
– Determine meaning by analyzing text surrounding a word»
– Give extra weight to title words, etc.
– Make sense of references given, citations received, etc.
Free Text Search Engines
• Different engines use different ranking strategies (often a trade secret)– Word frequency– Placement in document– Popularity of document– Number of links to document– Business relationships etc….
Announcement:
• Writing assignment: due next Monday– Create 10 topics/queries and search on
three popular Web search engines: Google.com, Yahoo.com and ask.com. Write a report to compare the three search engines and discuss why IR is so hard.
• Next Lecture: Query languages. (Ch. 4)