+ All Categories
Home > Documents > Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News...

Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News...

Date post: 18-Jan-2018
Category:
Upload: wendy-evans
View: 224 times
Download: 0 times
Share this document with a friend
Description:
Growth of the Web ? # of web sites or Volume of web traffic Mosaic Netscape Volume doubling every 6 months Exponential Growth
21
Information Retrieval (1955- 1992) Primary Users Law Clerks Reference Librarians (Some) News organizations, product research, congressional committees, medical/chemical abstract searches Primary Search Models Boolean keyword searches on Abstract, Title, keyword • Vendors Mead Data Central(Lexis – Nexis) – Dialog – Westlaw Total searchable online data : O(10 terabytes)
Transcript
Page 1: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Information Retrieval(1955-1992)• Primary Users

– Law Clerks– Reference Librarians– (Some) News organizations, product research, congressional

committees, medical/chemical abstract searches• Primary Search Models

– Boolean keyword searches on Abstract, Title, keyword• Vendors

– Mead Data Central(Lexis – Nexis)– Dialog– Westlaw– Total searchable online data : O(10 terabytes)

Page 2: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Information Retrieval(1993+)• Primary users

– 1st time computer users– novices

• Primary search modes– Still Boolean keyword searches with limited probabilistic

models– But FULL TEXT Retrieval

• Vendors– Lycos, Infoseek, Yahoo, Excite, AltaVista, Google– Total online data : ???

Page 3: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Growth of the Web

?

1992 1993 1994 1995 1996 1997 1998

# of web sitesor

Volume of webtraffic

Mosaic Netscape

Volume doubling every 6 months

ExponentialGrowth

Page 4: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Observation• Early IR system basically extended library catalog systems, allowing

– Keyword searches,– Limited abstract searches

in addition to Author/Title/Subject and including Boolean combination functionality

• IR was seen as reference retrieval (full documents still had to be ordered/delivered by hand)

Page 5: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

In ContrastToday, IR has a much wider role in the age of digital libraries

• Full document retrieval

(hypertext, postscript or optical image(TIFF)

representations)

• Question answering

Page 6: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Old ViewFuntion of IR :

Map queries to relevant documents

New View Satisfy user’s information need

Infer goals/information need from:

- query itself

- past user query history

- User profiling(aol.com vs. CS dept.)

- Collective analysis of other user feedback on similar queries

15

1 8

… AND … OR …

Page 7: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

In addition, return information in a format useful/intelligible to the user

• weighted orderings

• clusterings of documents by different attributes

• visualization tools

** Text Understanding techniques to extract answer to questions or at least subregion of text

Who is the current mayor of Columbus, Ohio?

don’t need full AP/CNN article on city scandals,

just the answer(and available source for proof)

Page 8: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Boolean SystemsFunction #1 : Provide a fast, compact index into the database (of documents or references)

Chihuahua

Nanny

(granularity)Index options- Doc number- Page number in Doc- Actual word offset

Data structure:Inverted file

Page 9: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Boolean OperationsChihuahua AND Nanny Join ( )

Chihuahua OR Nanny Union ( )

Proximity searches

Chihuhua W/3 Nanny

Page 10: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Vector IR model___________________________________________________________________________________________

d1 d2

f( ) f( )

V1 V2

Find optimal f( )

Sim (Vi , VQ) = Sim’ (Di , Q)Sim (V1, V2) Sim’ (d1 , d2)

Query

Cosine distance

___________________________________________________________________________________________

Page 11: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Vector models

D1

D2

Bit vector capturing essence/meaning of D1

Query

V1

V2

Q1

Find max Sim (Vi , Q1)

Sim (V1 , Q1)

___________________________________________________________________________________________

___________________________________________________________________________________________

Page 12: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Dimensionality Reductiond1

f( )

V1

V1^

Dimensionality Reduction(SVD/LSI)

Initial (term) vector representation

More compact/reduced dimensionality model of d1

___________________________________________________________________________________________

Page 13: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Clustering wordsOffset K - hash(w) - hash(cluster(w)) - hash(cluster(stem(w)))

Japanese

JapanNippon Japanese

NihonJapanese

Japanese ..

Raw TermVector

CondensedVector

3

1

192

V1D1

The

5 Japan *

Stem : books bookcomputer computcomputation comput

1V

Page 14: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

The soap opera

The soap residue

an opera by Verdi

001

SoapOperaSoap opera

110

SoapOperaSoap opera

d1

Collocation(PhrasalTerm)

d2

Page 15: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Vector Abstractly is a compressed document(meaning preserving)

document

m1 f(d1)

document

m2 f(d2)

Compression : m1 = m2 iff d1 = d2 f( ) must be invertible

Summarization : m1 = m2 iff d1 and d2 are about the same thing(mean the same thing)

A meaningor contextvector representation

………………

………………

Page 16: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

What is the optimal method for meaning preserving compression?

Issues

• size of representation(ideally size(Vi) << size(Di))

• cost of computation of vectors

– one time cost at model creation

• cost of similarity function

• must be computed for each query

• crucial to speed that this be minimized

Page 17: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

– header processingretain/model cross references

1. remark (most) function words

NOT or 2. downweight by frequency 3. use text analysis +

decide which function words carry meaning.

)(V ref )(V ref VV 332211i

Page 18: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Supervised Learning/Training

Project #1A

Chihuahua Breeding ClubB

PersonalC

Junk mailJ

recognizer

recognizer

recognizer

recognizer

Collective Discrimination

Inputdatastream

In Real time(ongoing)

BAC J

Trai

nin

gLabelled(routed)output

Page 19: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Other related problems: Mail/News Routing and Filtering

DataStream

Project #1 at workProject #2 at workChihuahua breedingScuba clubPersonalJunk mail

Typically model long-term information needs(People put effort into training and user feedback that they aren’t willingto invest for single query-based IR)

Inboxes(prioritize)

119

121

125

131

Page 20: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Features for classification

• Subject line• Source/Sender• X-annotations• Date/time• Length• Other recipients• Message content

(regions weighted differently)

Page 21: Information Retrieval (1955-1992) Primary Users –Law Clerks –Reference Librarians –(Some) News organizations, product research, congressional committees,

Probabilistic IR models – Intermediate Topic models/detectors

f( )

TDA

TopicDetectors

(TopicModels)

TDB TDE

TV1

S

0 1 0 1 0 0

V1

V2

Q

V1

f( )V2

d1 d2

1 0 0 0 0 0

0 0 0 1 0 0


Recommended