+ All Categories
Home > Documents > Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations...

Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations...

Date post: 27-Mar-2015
Category:
Upload: megan-cook
View: 220 times
Download: 4 times
Share this document with a friend
Popular Tags:
21
Special Topics in Computer Science Special Topics in Computer Science The Art of Information The Art of Information Retrieval Retrieval Chapter 5: Query Chapter 5: Query Operations Operations Alexander Gelbukh www.Gelbukh.com
Transcript
Page 1: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

Special Topics in Computer ScienceSpecial Topics in Computer Science

The Art of Information RetrievalThe Art of Information Retrieval

Chapter 5: Query OperationsChapter 5: Query Operations

Alexander Gelbukh

www.Gelbukh.com

Page 2: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

2

Previous chapter: ConclusionsPrevious chapter: Conclusions

Query languages (width-wide):o words, phrases, proximity, fuzzy Boolean, natural

language

Query languages (depth-wide):o Pattern matching

If return sets, can be combined using Boolean model Combining with structure

o Hierarchical structure

Standardized low level languages: protocolso Reusable

Page 3: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

3

Previous chapter: Previous chapter: Trends and research topicsTrends and research topics

Models: to better understand the user needs Query languages: flexibility, power, expressiveness,

functionality Visual languages

o Example: library shown on the screen. Act: take books, open catalogs, etc.

o Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!

Page 4: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

4

Query operationsQuery operations

Users have difficulties formulating queries Program improves the query

o Interactive mode: using the user’s feedback

o Using info from the retrieved set

o Using linguistic information or information from the collection

Query expansiono add new terms

Term rewritingo modify weights

Page 5: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

5

11stst method: User relevance feedback method: User relevance feedback

User examines to 10 (20) docs and marks relevant ones System uses this to construct new query

o Moved toward relevant docs

o Away from irrelevant

Good: simplicity

Note: In all the chapter, the correct spelling is Rocchio

Page 6: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

6

User relevance feedback:User relevance feedback:

Vector Space ModelVector Space Model

Best vector to distinguish good from bad docs: avg good minus avg bad

Page 7: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

7

User relevance feedback:User relevance feedback:

Vector Space ModelVector Space Model

Equally good results Original query gives important info: Relevant docs give more info than irrelevant ones: <

= 0: Positive feedback

Page 8: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

8

User relevance feedback:User relevance feedback:

Probabilistic ModelProbabilistic Model

User feedback:

Smoothing is usually applied Bad:

o No document weightso Previous history losto No new terms, only weights are changed

Page 9: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

9

... a variant for Probabilistic Model... a variant for Probabilistic Model

Similarity is multiplied by TF (term frequency)o Not exactly, but this is the idea

o Initially, IDF is also taken into account

o Details in the book

Still no query expansion, only re-weighting the original terms

Page 10: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

10

Evaluation of Relevance FeedbackEvaluation of Relevance Feedback

Simplistic:o Evaluate precision and recall after the feedback cycle

o Not realistic since includes the user’s own feedback

Better:o Only consider unseen data

o Use the rest of the collection

o Not as good figures

o Useful to compare different methods, not to compare precision/recall before and after feedback

Page 11: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

11

22ndnd method: Automatic local analysis method: Automatic local analysis

Idea: add to the query synonyms, stemming variations, collocations: thesaurus-like relationshipso Based on clustering technoques

Global vs Local strategy:o Global: the whole collection is used for this

o Local: the retrieved set. Similar to feedback, but automatic.

Local analysis: seems to give better results (better adaptation to the specific query) but time-consuming.o Good for local collections, not for Web

Build clusters of words; add to each keyword its neighbors

Page 12: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

12

Clustering (words)Clustering (words)

Association clusterso Terms that co-occur in the docso The clusters are the n terms that occur most frequently to

gether with the query terms (normalized vs. non-) Metric clusters (better)

o Multiplies the number of co-occurrences by the proximity in the text

o Terms that occur in the same sentence are more related Scalar clusters

o Terms co-occurring with the same other terms are relatedo Relatedness of two words = scalar product of centroids of

their association clusters

Page 13: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

13

... variant (local clustering)... variant (local clustering)

Metric-like reasoning: Break the retrieved docs into passages (say, 300 words) Use them as docs; use TF-IDF Choose words related (use TF-IDF) to the whole query Better: words occuring near each other are more related Tune for each collection

, not 5:

Page 14: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

14

33rdrd Method: Automatic Global Analysis Method: Automatic Global Analysis

Uses all docs in the collection Builds a thesaurus The terms related to the whole query are added

(query expansion)

Page 15: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

15

Similarity thesaurusSimilarity thesaurus

Relatedness = occur in the same docs. Matrix doc x term frequency Inverse term frequency: divided by the size of the doc Relatedness = correlation between rows of the matrix Query: centroid, weighted (weighted sum). Relatedness between a term and this centroid = cosine Add best terms are added to the query, with weights:

Page 16: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

16

(Global) Statistical thesaurus...(Global) Statistical thesaurus...

Terms added must be discriminative low frequency Difficult to cluster (no info) Solution: First cluster docs; the frequency increases Clustering docs, e.g.:

o Each doc is a clustero Merge two most similar clusters = their docs are similaro Repeat until <condition>

page 136:

Page 17: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

17

... statistical thesaurus... statistical thesaurus

Convert the cluster hierarchy into a set of clusterso Use a threshold similarity level to cut the hierarchy

o Don’t take too large clusters

Consider only low-frequency (in terms of ITF) terms occurring in the docs of the same classo threshold

o These give clusters of words

Calculate weight of each class of terms. Add these terms with this weight to the query terms

Page 18: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

18

Research topicsResearch topics

Interactive interfaceso Graphical, 2D or 3D

Refining global analysis techniques Application of linguistics methods. Stemming.

Ontologies Local analysis for the Web (now too expensive) Combine the tree techniques (feedback, local, global)

Page 19: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

19

ConclusionsConclusions

Relevance feedbacko Simple, understandable

o Needs user attention

o Term re-weighting

Local analysis for query expansiono Co-occurrences in the retrieved docs

o Usually gives better results than global analysis

o Computationally expensive

Global analysiso Not as good results, since what is good for the whole

collection is not good for a specific query

o Linguistic methods, dictionaries, ontologies, stemming, ...

Page 20: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

20

ExamExam

Questions and exercises You do what you consider appropriate On Oct 23 or maybe Nov 6 (??), discuss The class on Oct 30 is moved to Oct 23

Page 21: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh .

21

Thank you!Till October 23

October 23:discussion of the midterm exam,

class moved from October 30

The class of Oct The class of Oct 30 is moved to 30 is moved to

2323


Recommended