Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations...

Special Topics in Computer ScienceSpecial Topics in Computer Science

The Art of Information RetrievalThe Art of Information Retrieval

Chapter 5: Query OperationsChapter 5: Query Operations

Alexander Gelbukh

www.Gelbukh.com

2

Previous chapter: ConclusionsPrevious chapter: Conclusions

Query languages (width-wide):o words, phrases, proximity, fuzzy Boolean, natural

language

Query languages (depth-wide):o Pattern matching

If return sets, can be combined using Boolean model Combining with structure

o Hierarchical structure

Standardized low level languages: protocolso Reusable

3

Previous chapter: Previous chapter: Trends and research topicsTrends and research topics

Models: to better understand the user needs Query languages: flexibility, power, expressiveness,

functionality Visual languages

o Example: library shown on the screen. Act: take books, open catalogs, etc.

o Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!

4

Query operationsQuery operations

Users have difficulties formulating queries Program improves the query

o Interactive mode: using the user’s feedback

o Using info from the retrieved set

o Using linguistic information or information from the collection

Query expansiono add new terms

Term rewritingo modify weights

5

11stst method: User relevance feedback method: User relevance feedback

User examines to 10 (20) docs and marks relevant ones System uses this to construct new query

o Moved toward relevant docs

o Away from irrelevant

Good: simplicity

Note: In all the chapter, the correct spelling is Rocchio

6

User relevance feedback:User relevance feedback:

Vector Space ModelVector Space Model

Best vector to distinguish good from bad docs: avg good minus avg bad

7


Vector Space ModelVector Space Model

Equally good results Original query gives important info: Relevant docs give more info than irrelevant ones: <

= 0: Positive feedback

8


Probabilistic ModelProbabilistic Model

User feedback:

Smoothing is usually applied Bad:

o No document weightso Previous history losto No new terms, only weights are changed

9

... a variant for Probabilistic Model... a variant for Probabilistic Model

Similarity is multiplied by TF (term frequency)o Not exactly, but this is the idea

o Initially, IDF is also taken into account

o Details in the book

Still no query expansion, only re-weighting the original terms

10

Evaluation of Relevance FeedbackEvaluation of Relevance Feedback

Simplistic:o Evaluate precision and recall after the feedback cycle

o Not realistic since includes the user’s own feedback

Better:o Only consider unseen data

o Use the rest of the collection

o Not as good figures

o Useful to compare different methods, not to compare precision/recall before and after feedback

11

22ndnd method: Automatic local analysis method: Automatic local analysis

Idea: add to the query synonyms, stemming variations, collocations: thesaurus-like relationshipso Based on clustering technoques

Global vs Local strategy:o Global: the whole collection is used for this

o Local: the retrieved set. Similar to feedback, but automatic.

Local analysis: seems to give better results (better adaptation to the specific query) but time-consuming.o Good for local collections, not for Web

Build clusters of words; add to each keyword its neighbors

12

Clustering (words)Clustering (words)

Association clusterso Terms that co-occur in the docso The clusters are the n terms that occur most frequently to

gether with the query terms (normalized vs. non-) Metric clusters (better)

o Multiplies the number of co-occurrences by the proximity in the text

o Terms that occur in the same sentence are more related Scalar clusters

o Terms co-occurring with the same other terms are relatedo Relatedness of two words = scalar product of centroids of

their association clusters

13

... variant (local clustering)... variant (local clustering)

Metric-like reasoning: Break the retrieved docs into passages (say, 300 words) Use them as docs; use TF-IDF Choose words related (use TF-IDF) to the whole query Better: words occuring near each other are more related Tune for each collection

, not 5:

14

33rdrd Method: Automatic Global Analysis Method: Automatic Global Analysis

Uses all docs in the collection Builds a thesaurus The terms related to the whole query are added

(query expansion)

15

Similarity thesaurusSimilarity thesaurus

Relatedness = occur in the same docs. Matrix doc x term frequency Inverse term frequency: divided by the size of the doc Relatedness = correlation between rows of the matrix Query: centroid, weighted (weighted sum). Relatedness between a term and this centroid = cosine Add best terms are added to the query, with weights:

16

(Global) Statistical thesaurus...(Global) Statistical thesaurus...

Terms added must be discriminative low frequency Difficult to cluster (no info) Solution: First cluster docs; the frequency increases Clustering docs, e.g.:

o Each doc is a clustero Merge two most similar clusters = their docs are similaro Repeat until <condition>

page 136:

17

... statistical thesaurus... statistical thesaurus

Convert the cluster hierarchy into a set of clusterso Use a threshold similarity level to cut the hierarchy

o Don’t take too large clusters

Consider only low-frequency (in terms of ITF) terms occurring in the docs of the same classo threshold

o These give clusters of words

Calculate weight of each class of terms. Add these terms with this weight to the query terms

18

Research topicsResearch topics

Interactive interfaceso Graphical, 2D or 3D

Refining global analysis techniques Application of linguistics methods. Stemming.

Ontologies Local analysis for the Web (now too expensive) Combine the tree techniques (feedback, local, global)

19

ConclusionsConclusions

Relevance feedbacko Simple, understandable

o Needs user attention

o Term re-weighting

Local analysis for query expansiono Co-occurrences in the retrieved docs

o Usually gives better results than global analysis

o Computationally expensive

Global analysiso Not as good results, since what is good for the whole

collection is not good for a specific query

o Linguistic methods, dictionaries, ontologies, stemming, ...

20

ExamExam

Questions and exercises You do what you consider appropriate On Oct 23 or maybe Nov 6 (??), discuss The class on Oct 30 is moved to Oct 23

21

Thank you!Till October 23

October 23:discussion of the midterm exam,

class moved from October 30

The class of Oct The class of Oct 30 is moved to 30 is moved to

2323

Date post:	27-Mar-2015
Category:	Documents
Upload:	megan-cook
View:	220 times
Download:	4 times

Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations...

Documents