Compact Query Term Selection Using Topically Related Text

K. Tamsin Maxwell, W. Bruce CroftSIGIR 2013

Compact Query Term Selection

Using Topically Related Text

• Introduction

• Related Work

• Principle for Term Selection

• PhRank Algorithm

• Evaluation Framework

• Experiments

• Conlusion

Outline

• Recent query reformulation techniques usually uses pseudo relevant feecback in their approaches. But since they consider words which not in the original query, the expansion may include peripheral words and causes query drift

• PhRank also uses PRF, but uses them for in-query term selection. Each indicate term include 1-3 words, and ranked with score which from a co-occurrence graph

• Here we list advantages of PhRank1. It’s the first method to use PRF for in-query term selection2. Only small number of terms are selected, so it retaining the

flexibility for more or longer terms if required3. The affinity graph captures aspects of both syntactic and non-

syntactic word associations

Introduction

• Markov chain framework– The Markov chain framework uses the stationary

distribution of a random walk over an affinity graph G to estimate the importance of vertices in the graph

– A random walk describes a succession of random or semi-random steps between vertices and in

– If we define transition probability between and as ,and as affinity score of at time t, then is the sum of scores for each connect to

Related Work

– Sometimes step to some that may be unconnected, so we often define a minimum probability , where is the number of vertices in

then we uses a factor to control the balance between transition probability and minimum probability

Related Work

• For an informative word– Is informative relative to a query： a word should

represent the meaning of query, but query usually doesn’t have enough information. PRF is used to enhancing a query representation

– Is related to other informative words： The Association Hypothesis states that, “if one index term is good at discriminating relevant from non-relevant documents, then any closely associated index term is also likely to be good at this”. With a affinity graph, we can get the information above by estimate the number of word connects to a target word and the value

Principle for Term Selection

• For a informative term– Contains informative words：We deduce all

terms must contain informative words, so we consider individual words when ranking terms

– Is discriminative in retrieval collection： A term that occurs many times within a small number of documents gives a pronounced relevance signal. So we weights terms with a normalized tf.idf inspired weight

Principle for Term Selection

1. Graph construction– For a query, we first retrieve top documents. Then

we define set as set of query itself and its relevant documents

– Do stemming for documents in . Each unique word is now a vertex in graph

– Edges between vertices and are connected if word and is adjacent in

2. Edge weights– Transition probability is based on linear combination

of word and co-occur in window size of 2 and 10

The PhRank Algorithm

– Edge weights are defined by

is the probability of document in which word and co-occur given , and and is the count of co-occur in window 2 and 10– is the style weight confirms importance

between and in


3. Random walk– A random walk of is proceed as we represent

in related work– The edge weights are normalized to sum to one– The iteration stopped when the difference

between any vertex dies not exceed 0.0001

4. Vertex weights– The word are also weighted to exhaustiveness

represent the query. Some words like “make ” would high score in affinity graph, but it is not more informative


– We define as factor to balance exhaustively with global saliency to identify stems that are poor discriminators been relevant and non-relevant documents

– For a word ,

is the frequency of in , and is of in


5. Term ranking– For a term , Factor represents the degree to

which the term is discriminative in a collection. is defined by

is the frequency of words in co-occur in 4*number of term window in collection, defined just like , and

– Finally, the rank of a term for is defined as


• After finish the rank, we still have some terms that includes uninformative words. This is because we rank terms by the whole score, so some terms would contain the similar words and decrease the diversity

• We apply a simple filtering with top-down constraints– For term , If a higher rank term contains all words in or

contains all words in higher rank term, we discard


• Robustness– Compare with sequential dependence of Markov random field

model. This model uses linear combine for query likelihood, 2 and 8 window sized bigram

• Precision– The subset distribution model achieves high mean average

precision

• Succinctness– We use Key Concepts as the succinctness approach. This

approach linear combined bag-of-words query representation and weighted bag-of-words query representation

Evaluation Framework

• Word dependence– We refers four models of phrase belief as the

figure

Evaluation Framework

• We use Indri on Robust04, WT10G and GOV2 for evaluate

• Feature analysis– Here we list the results of using the features in

PhRank

Experiments

Experiments

• Compare with other modelExperiments

• PhRank is a novel method to select succinct term within a query which works on Markov chain frameworks

• Although the term is succinct, but its risky strategy and causes the decreasing of mAP compared with sequential dependence

Conclusion

Date post:	24-Feb-2016
Category:	Documents
Upload:	ryann
View:	41 times
Download:	0 times

Compact Query Term Selection Using Topically Related Text

Documents