+ All Categories
Home > Documents > Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Date post: 18-Jan-2016
Category:
Upload: sophie-pitts
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
49
Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard
Transcript
Page 1: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Structure of IR Systems

LBSC 796/INFM 718R

Session 1, September 10, 2007

Doug Oard

Page 2: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Agenda

• Teaching theater orientation

• The structure of interactive IR systems

• Course overview

Page 3: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

What is IR?

• Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user

Anomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-143.

Page 4: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Information Retrieval Systems

• Information– What is “information”?

• Retrieval– What do we mean by “retrieval”?– What are different types information needs?

• Systems– How do computer systems fit into the human

information seeking process?

Page 5: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Information Hierarchy

Data

Information

Knowledge

Wisdom

More refined and abstract

Page 6: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Information Hierarchy

• Data– The raw material of information

• Information– Data organized and presented in a particular manner

• Knowledge– “Justified true belief”– Information that can be acted upon

• Wisdom– Distilled and integrated knowledge– Demonstrative of high-level “understanding”

Page 7: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

What do We Mean by “Information?”

• How is it different from “data”?– Information is data in context

• Databases contain data and produce information

• IR systems contain and provide information

• How is it different from “knowledge”?– Knowledge is a basis for making decisions

• Many “knowledge bases” contain decision rules

Page 8: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

A (Facetious) Example

• Data– 98.6º F, 99.5º F, 100.3º F, 101º F, …

• Information– Hourly body temperature: 98.6º F, 99.5º F, 100.3º F, 101º F,

• Knowledge– If you have a temperature above 100º F, you most likely

have a fever

• Wisdom– If you don’t feel well, go see a doctor

Page 9: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

What types of information?

• Text

• Structured documents (e.g., XML)

• Images

• Audio (sound effects, songs, etc.)

• Video

• Programs

• Services

Page 10: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

What Do We Mean by “Retrieval?”

• Find something that you want– The information need may or may not be explicit

• Known item search– Find the class home page

• Answer seeking– Is Lexington or Louisville the capital of Kentucky?

• Directed exploration– Who makes videoconferencing systems?

Page 11: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Relevance

• Relevance relates a topic and a document– Duplicates are equally relevant, by definition– Constant over time and across users

• Pertinence relates a task and a document– Accounts for quality, complexity, language, …

• Utility relates a user and a document– Accounts for prior knowledge

Page 12: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Systems: The Memex

Page 13: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

What Do People Search For?• Searchers often don’t clearly understand

– The problem they are trying to solve

– What information is needed to solve the problem

– How to ask for that information

• The query results from a clarification process

• Dervin’s “sense making”: Need

Gap Bridge

Page 14: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Taylor’s Model of Question Formation

Q1 Visceral Need

Q2 Conscious Need

Q3 Formalized Need

Q4 Compromised Need (Query)

En

d-u

ser

Sea

rch

Interm

ediated

Search

Page 15: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Types of Information Needs

• Retrospective (“Retrieval”)– “Searching the past”– Different queries posed against a static collection– Time invariant

• Prospective (“Filtering”)– “Searching the future”– Static query posed against a dynamic collection– Time dependent

Page 16: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Design Strategies

• Foster human-machine synergy– Exploit complementary strengths– Accommodate shared weaknesses

• Divide-and-conquer – Divide task into stages with well-defined interfaces– Continue dividing until problems are easily solved

• Co-design related components– Iterative process of joint optimization

Page 17: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Divide and Conquer

• Strategy: use encapsulation to limit complexity• Approach:

– Define interfaces (input and output) for each component– Define the functions performed by each component– Study each component in isolation– Repeat the process within components as needed– Make sure that this decomposition makes sense

• Result: a hierarchical decomposition

Page 18: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Process/System Co-Design

Page 19: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Human-Machine Synergy

• Machines are good at:– Doing simple things accurately and quickly– Scaling to larger collections in sublinear time

• People are better at:– Accurately recognizing what they are looking for– Evaluating intangibles such as “quality”

• Both are pretty bad at:– Mapping consistently between words and concepts

Page 20: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Query Reformulation and

Relevance Feedback

SourceReselection

Nominate ChoosePredict

Page 21: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Indexing Index

Acquisition Collection

Page 22: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

• Study the IR black box in isolation– Simple behavior: in goes query, out comes documents– Optimize the choice of documents that come out

Where to Start?

Search

Query

Ranked List

Page 23: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

The IR Black BoxDocumentsQuery

Hits

Page 24: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Inside The IR Black Box

DocumentsQuery

Hits

RepresentationFunction

RepresentationFunction

Query Representation Document Representation

ComparisonFunction Index

Page 25: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Search Component Model

Comparison Function

Representation Function

Query Formulation

Human Judgment

Representation Function

Retrieval Status Value

Utility

Query

Information Need Document

Query Representation Document Representation

Que

ry P

roce

ssin

g

Doc

umen

t P

roce

ssin

g

Page 26: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

What about databases?

• What are examples of databases?– Banks storing account information– Retailers storing inventories– Universities storing student grades

• What exactly is a (relational) database?– Think of them as a collection of tables– They model some aspect of “the world”

Page 27: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

A (Simple) Database Example

Department ID DepartmentEE Electrical EngineeringHIST HistoryCLIS Information Studies

Course ID Course Namelbsc690 Information Technologyee750 Communicationhist405 American History

Student ID Course ID Grade1 lbsc690 901 ee750 952 lbsc690 952 hist405 803 hist405 904 lbsc690 98

Student ID Last Name First Name Department ID email1 Arrows John EE jarrows@wam2 Peters Kathy HIST kpeters2@wam3 Smith Chris HIST smith2002@glue4 Smith John CLIS js03@wam

Student Table

Department Table Course Table

Enrollment Table

Page 28: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Database Queries

• What would you want to know from a database?– What classes is John Arrow enrolled in?– Who has the highest grade in LBSC 690?– Who’s in the history department?– Of all the non-CLIS students taking LBSC 690

with a last name shorter than six characters and were born on a Monday, who has the longest email address?

Page 29: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Databases vs. IR

Other issues

Interaction with system

Results we get

Queries we’re posing

What we’re retrieving

IRDatabases

Issues downplayed.Concurrency, recovery, atomicity are all critical.

Interaction is important.One-shot queries.

Sometimes relevant, often not.

Exact. Always correct in a formal sense.

Vague, imprecise information needs (often expressed in natural language).

Formally (mathematically) defined queries. Unambiguous.

Mostly unstructured. Free text with some metadata.

Structured data. Clear semantics based on a formal model.

Page 30: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

“Bag of Terms” Representation

• Bag = a “set” that can contain duplicates “The quick brown fox jumped over the lazy dog’s back”

{back, brown, dog, fox, jump, lazy, over, quick, the, the}

• Vector = values recorded in any consistent order {back, brown, dog, fox, jump, lazy, over, quick, the, the}

[1 1 1 1 1 1 1 1 2]

Page 31: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Bag of Terms Example

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

the

quick

brown

fox

over

lazy

dog

back

now

is

time

forall

good

men

tocome

jump

aid

of

their

party

00110110110010100

11001001001101011

Term Doc

umen

t 1

Doc

umen

t 2

Stopword List

Page 32: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Advantages of Ranked Retrieval

• Closer to the way people think– Some documents are better than others

• Enriches browsing behavior– Decide how far down the list to go as you read it

• Allows more flexible queries– Long and short queries can produce useful results

Page 33: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Counting Terms

• Terms tell us about documents– If “rabbit” appears a lot, it may be about rabbits

• Documents tell us about terms– “the” is in every document -- not discriminating

• Documents are most likely described well by rare terms that occur in them frequently– Higher “term frequency” is stronger evidence– Low “document frequency” makes it stronger still

Page 34: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Document Length Normalization

• Long documents have an unfair advantage– They use a lot of terms

• So they get more matches than short documents

– And they use the same words repeatedly• So they have much higher term frequencies

• Normalization seeks to remove these effects

Page 35: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Problems with “Free Text” Search

• Homonymy– Terms may have many unrelated meanings– Polysemy (related meanings) is less of a problem

• Synonymy– Many ways of saying (nearly) the same thing

• Anaphora– Alternate ways of referring to the same thing

Page 36: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Two Ways of Searching

Write the documentusing terms to

convey meaning

Author

Content-BasedQuery-Document

Matching Document Terms

Query Terms

Construct query fromterms that may

appear in documents

Free-TextSearcher

Retrieval Status Value

Construct query fromavailable concept

descriptors

ControlledVocabulary

Searcher

Choose appropriate concept descriptors

Indexer

Metadata-BasedQuery-Document

Matching Query Descriptors

Document Descriptors

Page 37: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Problems with Controlled Vocabulary

• New concepts

• Users and indexers may think differently

• Using thesauri effectively requires training

Page 38: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Segment Object Class

Examine View Listen

Select

Retain Print Bookmark Save Purchase Delete

Subscribe

Reference Copy / paste Quote

Forward Reply Link Cite

Annotate Mark up Rate Publish

Organize

Beh

avio

r C

ateg

ory

Minimum Scope

Page 39: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Some Examples

• Read/Ignored, Saved/Deleted, Replied to

(Stevens, 1993)

• Reading time

(Morita & Shinoda, 1994; Konstan et al., 1997)

• Hypertext Link

(Brin & Page, 1998)

Page 40: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Estimating Authority from Links

Authority

Authority

Hub

Page 41: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Problems with Observed Behavior

• Protecting privacy– What absolute assurances can we provide?– How can we make remaining risks understood?

• Scalable rating servers– Is a fully distributed architecture practical?

• Non-cooperative users– How can the effect of spamming be limited?

Page 42: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Putting It All Together

Free Text Behavior Metadata

Topicality

Quality

Reliability

Cost

Flexibility

Page 43: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

The Big Picture

• Four Factors, working together– User– Process– System– Collection

Page 44: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Course Goals

• Appreciate IR system capabilities and limitations

• Understand IR system design & implementation– For a broad range of applications and media

• Evaluate IR system performance

• Identify current IR research problems

Page 45: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Course Design

• Text/readings provide background and detail– At least one recommended reading is required

• Class provides organization and direction– We will not cover every important detail

• Assignments and project provide experience– The TA can help with the project

• Final exam helps focus your effort

Page 46: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Assumed Background

• Everyone:– LBSC 690 or INFM 603 or equivalent– Comfortable with learning about technology

• MIM Students:– Basic systems analysis, scripting languages– Some programming is helpful

• MLS students:– LBSC 650 and LBSC 670– LBSC 750 or a subject access course is helpful

Page 47: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Grading

• Assignments (20%)– Mastery of concepts and experience using tools

• Term project (50%)– Options are described on course Web page

• Final exam (30%)– In-class exam

Page 48: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Handy Things to Know

• Classes will be videotaped– Available outside my office

• Office hours: 5 PM Mondays– Or schedule by email, or ask after class

• Everything is on the Web– At http://www.glue.umd.edu/~oard

• Doug is most easily reached by email– [email protected]

Page 49: Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Some Things to Do This Week

• Assignment 1– Due at 6 PM next Monday!!

• At least skim the readings before class– Don’t fall behind!

• Explore the Web site– Start thinking about the term project


Recommended