Date post: | 27-Mar-2015 |
Category: |
Documents |
Upload: | megan-cook |
View: | 220 times |
Download: | 4 times |
Special Topics in Computer ScienceSpecial Topics in Computer Science
The Art of Information RetrievalThe Art of Information Retrieval
Chapter 5: Query OperationsChapter 5: Query Operations
Alexander Gelbukh
www.Gelbukh.com
2
Previous chapter: ConclusionsPrevious chapter: Conclusions
Query languages (width-wide):o words, phrases, proximity, fuzzy Boolean, natural
language
Query languages (depth-wide):o Pattern matching
If return sets, can be combined using Boolean model Combining with structure
o Hierarchical structure
Standardized low level languages: protocolso Reusable
3
Previous chapter: Previous chapter: Trends and research topicsTrends and research topics
Models: to better understand the user needs Query languages: flexibility, power, expressiveness,
functionality Visual languages
o Example: library shown on the screen. Act: take books, open catalogs, etc.
o Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!
4
Query operationsQuery operations
Users have difficulties formulating queries Program improves the query
o Interactive mode: using the user’s feedback
o Using info from the retrieved set
o Using linguistic information or information from the collection
Query expansiono add new terms
Term rewritingo modify weights
5
11stst method: User relevance feedback method: User relevance feedback
User examines to 10 (20) docs and marks relevant ones System uses this to construct new query
o Moved toward relevant docs
o Away from irrelevant
Good: simplicity
Note: In all the chapter, the correct spelling is Rocchio
6
User relevance feedback:User relevance feedback:
Vector Space ModelVector Space Model
Best vector to distinguish good from bad docs: avg good minus avg bad
7
User relevance feedback:User relevance feedback:
Vector Space ModelVector Space Model
Equally good results Original query gives important info: Relevant docs give more info than irrelevant ones: <
= 0: Positive feedback
8
User relevance feedback:User relevance feedback:
Probabilistic ModelProbabilistic Model
User feedback:
Smoothing is usually applied Bad:
o No document weightso Previous history losto No new terms, only weights are changed
9
... a variant for Probabilistic Model... a variant for Probabilistic Model
Similarity is multiplied by TF (term frequency)o Not exactly, but this is the idea
o Initially, IDF is also taken into account
o Details in the book
Still no query expansion, only re-weighting the original terms
10
Evaluation of Relevance FeedbackEvaluation of Relevance Feedback
Simplistic:o Evaluate precision and recall after the feedback cycle
o Not realistic since includes the user’s own feedback
Better:o Only consider unseen data
o Use the rest of the collection
o Not as good figures
o Useful to compare different methods, not to compare precision/recall before and after feedback
11
22ndnd method: Automatic local analysis method: Automatic local analysis
Idea: add to the query synonyms, stemming variations, collocations: thesaurus-like relationshipso Based on clustering technoques
Global vs Local strategy:o Global: the whole collection is used for this
o Local: the retrieved set. Similar to feedback, but automatic.
Local analysis: seems to give better results (better adaptation to the specific query) but time-consuming.o Good for local collections, not for Web
Build clusters of words; add to each keyword its neighbors
12
Clustering (words)Clustering (words)
Association clusterso Terms that co-occur in the docso The clusters are the n terms that occur most frequently to
gether with the query terms (normalized vs. non-) Metric clusters (better)
o Multiplies the number of co-occurrences by the proximity in the text
o Terms that occur in the same sentence are more related Scalar clusters
o Terms co-occurring with the same other terms are relatedo Relatedness of two words = scalar product of centroids of
their association clusters
13
... variant (local clustering)... variant (local clustering)
Metric-like reasoning: Break the retrieved docs into passages (say, 300 words) Use them as docs; use TF-IDF Choose words related (use TF-IDF) to the whole query Better: words occuring near each other are more related Tune for each collection
, not 5:
14
33rdrd Method: Automatic Global Analysis Method: Automatic Global Analysis
Uses all docs in the collection Builds a thesaurus The terms related to the whole query are added
(query expansion)
15
Similarity thesaurusSimilarity thesaurus
Relatedness = occur in the same docs. Matrix doc x term frequency Inverse term frequency: divided by the size of the doc Relatedness = correlation between rows of the matrix Query: centroid, weighted (weighted sum). Relatedness between a term and this centroid = cosine Add best terms are added to the query, with weights:
16
(Global) Statistical thesaurus...(Global) Statistical thesaurus...
Terms added must be discriminative low frequency Difficult to cluster (no info) Solution: First cluster docs; the frequency increases Clustering docs, e.g.:
o Each doc is a clustero Merge two most similar clusters = their docs are similaro Repeat until <condition>
page 136:
17
... statistical thesaurus... statistical thesaurus
Convert the cluster hierarchy into a set of clusterso Use a threshold similarity level to cut the hierarchy
o Don’t take too large clusters
Consider only low-frequency (in terms of ITF) terms occurring in the docs of the same classo threshold
o These give clusters of words
Calculate weight of each class of terms. Add these terms with this weight to the query terms
18
Research topicsResearch topics
Interactive interfaceso Graphical, 2D or 3D
Refining global analysis techniques Application of linguistics methods. Stemming.
Ontologies Local analysis for the Web (now too expensive) Combine the tree techniques (feedback, local, global)
19
ConclusionsConclusions
Relevance feedbacko Simple, understandable
o Needs user attention
o Term re-weighting
Local analysis for query expansiono Co-occurrences in the retrieved docs
o Usually gives better results than global analysis
o Computationally expensive
Global analysiso Not as good results, since what is good for the whole
collection is not good for a specific query
o Linguistic methods, dictionaries, ontologies, stemming, ...
20
ExamExam
Questions and exercises You do what you consider appropriate On Oct 23 or maybe Nov 6 (??), discuss The class on Oct 30 is moved to Oct 23
21
Thank you!Till October 23
October 23:discussion of the midterm exam,
class moved from October 30
The class of Oct The class of Oct 30 is moved to 30 is moved to
2323