Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | sophie-pitts |
View: | 213 times |
Download: | 0 times |
Structure of IR Systems
LBSC 796/INFM 718R
Session 1, September 10, 2007
Doug Oard
Agenda
• Teaching theater orientation
• The structure of interactive IR systems
• Course overview
What is IR?
• Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user
Anomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-143.
Information Retrieval Systems
• Information– What is “information”?
• Retrieval– What do we mean by “retrieval”?– What are different types information needs?
• Systems– How do computer systems fit into the human
information seeking process?
Information Hierarchy
Data
Information
Knowledge
Wisdom
More refined and abstract
Information Hierarchy
• Data– The raw material of information
• Information– Data organized and presented in a particular manner
• Knowledge– “Justified true belief”– Information that can be acted upon
• Wisdom– Distilled and integrated knowledge– Demonstrative of high-level “understanding”
What do We Mean by “Information?”
• How is it different from “data”?– Information is data in context
• Databases contain data and produce information
• IR systems contain and provide information
• How is it different from “knowledge”?– Knowledge is a basis for making decisions
• Many “knowledge bases” contain decision rules
A (Facetious) Example
• Data– 98.6º F, 99.5º F, 100.3º F, 101º F, …
• Information– Hourly body temperature: 98.6º F, 99.5º F, 100.3º F, 101º F,
…
• Knowledge– If you have a temperature above 100º F, you most likely
have a fever
• Wisdom– If you don’t feel well, go see a doctor
What types of information?
• Text
• Structured documents (e.g., XML)
• Images
• Audio (sound effects, songs, etc.)
• Video
• Programs
• Services
What Do We Mean by “Retrieval?”
• Find something that you want– The information need may or may not be explicit
• Known item search– Find the class home page
• Answer seeking– Is Lexington or Louisville the capital of Kentucky?
• Directed exploration– Who makes videoconferencing systems?
Relevance
• Relevance relates a topic and a document– Duplicates are equally relevant, by definition– Constant over time and across users
• Pertinence relates a task and a document– Accounts for quality, complexity, language, …
• Utility relates a user and a document– Accounts for prior knowledge
Systems: The Memex
What Do People Search For?• Searchers often don’t clearly understand
– The problem they are trying to solve
– What information is needed to solve the problem
– How to ask for that information
• The query results from a clarification process
• Dervin’s “sense making”: Need
Gap Bridge
Taylor’s Model of Question Formation
Q1 Visceral Need
Q2 Conscious Need
Q3 Formalized Need
Q4 Compromised Need (Query)
En
d-u
ser
Sea
rch
Interm
ediated
Search
Types of Information Needs
• Retrospective (“Retrieval”)– “Searching the past”– Different queries posed against a static collection– Time invariant
• Prospective (“Filtering”)– “Searching the future”– Static query posed against a dynamic collection– Time dependent
Design Strategies
• Foster human-machine synergy– Exploit complementary strengths– Accommodate shared weaknesses
• Divide-and-conquer – Divide task into stages with well-defined interfaces– Continue dividing until problems are easily solved
• Co-design related components– Iterative process of joint optimization
Divide and Conquer
• Strategy: use encapsulation to limit complexity• Approach:
– Define interfaces (input and output) for each component– Define the functions performed by each component– Study each component in isolation– Repeat the process within components as needed– Make sure that this decomposition makes sense
• Result: a hierarchical decomposition
Process/System Co-Design
Human-Machine Synergy
• Machines are good at:– Doing simple things accurately and quickly– Scaling to larger collections in sublinear time
• People are better at:– Accurately recognizing what they are looking for– Evaluating intangibles such as “quality”
• Both are pretty bad at:– Mapping consistently between words and concepts
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Query Reformulation and
Relevance Feedback
SourceReselection
Nominate ChoosePredict
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Indexing Index
Acquisition Collection
• Study the IR black box in isolation– Simple behavior: in goes query, out comes documents– Optimize the choice of documents that come out
Where to Start?
Search
Query
Ranked List
The IR Black BoxDocumentsQuery
Hits
Inside The IR Black Box
DocumentsQuery
Hits
RepresentationFunction
RepresentationFunction
Query Representation Document Representation
ComparisonFunction Index
Search Component Model
Comparison Function
Representation Function
Query Formulation
Human Judgment
Representation Function
Retrieval Status Value
Utility
Query
Information Need Document
Query Representation Document Representation
Que
ry P
roce
ssin
g
Doc
umen
t P
roce
ssin
g
What about databases?
• What are examples of databases?– Banks storing account information– Retailers storing inventories– Universities storing student grades
• What exactly is a (relational) database?– Think of them as a collection of tables– They model some aspect of “the world”
A (Simple) Database Example
Department ID DepartmentEE Electrical EngineeringHIST HistoryCLIS Information Studies
Course ID Course Namelbsc690 Information Technologyee750 Communicationhist405 American History
Student ID Course ID Grade1 lbsc690 901 ee750 952 lbsc690 952 hist405 803 hist405 904 lbsc690 98
Student ID Last Name First Name Department ID email1 Arrows John EE jarrows@wam2 Peters Kathy HIST kpeters2@wam3 Smith Chris HIST smith2002@glue4 Smith John CLIS js03@wam
Student Table
Department Table Course Table
Enrollment Table
Database Queries
• What would you want to know from a database?– What classes is John Arrow enrolled in?– Who has the highest grade in LBSC 690?– Who’s in the history department?– Of all the non-CLIS students taking LBSC 690
with a last name shorter than six characters and were born on a Monday, who has the longest email address?
Databases vs. IR
Other issues
Interaction with system
Results we get
Queries we’re posing
What we’re retrieving
IRDatabases
Issues downplayed.Concurrency, recovery, atomicity are all critical.
Interaction is important.One-shot queries.
Sometimes relevant, often not.
Exact. Always correct in a formal sense.
Vague, imprecise information needs (often expressed in natural language).
Formally (mathematically) defined queries. Unambiguous.
Mostly unstructured. Free text with some metadata.
Structured data. Clear semantics based on a formal model.
“Bag of Terms” Representation
• Bag = a “set” that can contain duplicates “The quick brown fox jumped over the lazy dog’s back”
{back, brown, dog, fox, jump, lazy, over, quick, the, the}
• Vector = values recorded in any consistent order {back, brown, dog, fox, jump, lazy, over, quick, the, the}
[1 1 1 1 1 1 1 1 2]
Bag of Terms Example
The quick brown fox jumped over the lazy dog’s back.
Document 1
Document 2
Now is the time for all good men to come to the aid of their party.
the
quick
brown
fox
over
lazy
dog
back
now
is
time
forall
good
men
tocome
jump
aid
of
their
party
00110110110010100
11001001001101011
Term Doc
umen
t 1
Doc
umen
t 2
Stopword List
Advantages of Ranked Retrieval
• Closer to the way people think– Some documents are better than others
• Enriches browsing behavior– Decide how far down the list to go as you read it
• Allows more flexible queries– Long and short queries can produce useful results
Counting Terms
• Terms tell us about documents– If “rabbit” appears a lot, it may be about rabbits
• Documents tell us about terms– “the” is in every document -- not discriminating
• Documents are most likely described well by rare terms that occur in them frequently– Higher “term frequency” is stronger evidence– Low “document frequency” makes it stronger still
Document Length Normalization
• Long documents have an unfair advantage– They use a lot of terms
• So they get more matches than short documents
– And they use the same words repeatedly• So they have much higher term frequencies
• Normalization seeks to remove these effects
Problems with “Free Text” Search
• Homonymy– Terms may have many unrelated meanings– Polysemy (related meanings) is less of a problem
• Synonymy– Many ways of saying (nearly) the same thing
• Anaphora– Alternate ways of referring to the same thing
Two Ways of Searching
Write the documentusing terms to
convey meaning
Author
Content-BasedQuery-Document
Matching Document Terms
Query Terms
Construct query fromterms that may
appear in documents
Free-TextSearcher
Retrieval Status Value
Construct query fromavailable concept
descriptors
ControlledVocabulary
Searcher
Choose appropriate concept descriptors
Indexer
Metadata-BasedQuery-Document
Matching Query Descriptors
Document Descriptors
Problems with Controlled Vocabulary
• New concepts
• Users and indexers may think differently
• Using thesauri effectively requires training
Segment Object Class
Examine View Listen
Select
Retain Print Bookmark Save Purchase Delete
Subscribe
Reference Copy / paste Quote
Forward Reply Link Cite
Annotate Mark up Rate Publish
Organize
Beh
avio
r C
ateg
ory
Minimum Scope
Some Examples
• Read/Ignored, Saved/Deleted, Replied to
(Stevens, 1993)
• Reading time
(Morita & Shinoda, 1994; Konstan et al., 1997)
• Hypertext Link
(Brin & Page, 1998)
Estimating Authority from Links
Authority
Authority
Hub
Problems with Observed Behavior
• Protecting privacy– What absolute assurances can we provide?– How can we make remaining risks understood?
• Scalable rating servers– Is a fully distributed architecture practical?
• Non-cooperative users– How can the effect of spamming be limited?
Putting It All Together
Free Text Behavior Metadata
Topicality
Quality
Reliability
Cost
Flexibility
The Big Picture
• Four Factors, working together– User– Process– System– Collection
Course Goals
• Appreciate IR system capabilities and limitations
• Understand IR system design & implementation– For a broad range of applications and media
• Evaluate IR system performance
• Identify current IR research problems
Course Design
• Text/readings provide background and detail– At least one recommended reading is required
• Class provides organization and direction– We will not cover every important detail
• Assignments and project provide experience– The TA can help with the project
• Final exam helps focus your effort
Assumed Background
• Everyone:– LBSC 690 or INFM 603 or equivalent– Comfortable with learning about technology
• MIM Students:– Basic systems analysis, scripting languages– Some programming is helpful
• MLS students:– LBSC 650 and LBSC 670– LBSC 750 or a subject access course is helpful
Grading
• Assignments (20%)– Mastery of concepts and experience using tools
• Term project (50%)– Options are described on course Web page
• Final exam (30%)– In-class exam
Handy Things to Know
• Classes will be videotaped– Available outside my office
• Office hours: 5 PM Mondays– Or schedule by email, or ask after class
• Everything is on the Web– At http://www.glue.umd.edu/~oard
• Doug is most easily reached by email– [email protected]
Some Things to Do This Week
• Assignment 1– Due at 6 PM next Monday!!
• At least skim the readings before class– Don’t fall behind!
• Explore the Web site– Start thinking about the term project