Date post: | 08-Jan-2018 |
Category: |
Documents |
Upload: | megan-hilda-kennedy |
View: | 220 times |
Download: | 0 times |
Query Suggestions in the Absence of
Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra
SIGIR’11, July 24–28, 2011, Beijing, China
Introduction Query suggestions are more useful for difficult topics, for
which users have little knowledge to create meaningful queries
A meaningful query must infer the user’s query intent & information needs & must help user find the
relevant documents containing relevant information
Existing web search engines rely on query logs to make query suggestions which are not available for
desktop or enterprise search systems
Solution: a document centric probabilistic mechanism to generate query suggestions w/o using query logs which utilizes the document corpus to extract
phrases 2
Related Work Most of the previous works provide query expansion &
refinement rather than query suggestions
Comp(lete)Search Method:
Provides real time auto-completion of the last query term typed by the user
Requires user to type at least two characters of the last query term which is the most frequent term
SimSearch Method: Phrase index is searched to find phrases that contain
the user submitted partial query as a sub-phrase
Selected phrases are presented to the user in order of their occurrence frequency 3
Proposed QS Approach Based on the document centric probabilistic mechanism
Extracting phrases to create a database of phrases that can be used for completing partial user queries
from document corpus
Using N-grams of all order 1, 2, & 3, i.e., unigrams, bigrams, & trigrams from the document corpus
Use idea similar to skip-grams rather than N-grams
N-gram is the number of non stop-words
4
Query Suggestions At any given instant of time, after the user has entered k
characters, denoted Q1k , which can be
decomposed
Q1K = Qc + Qt (1)
where |Qc| 0, a set of words, & |Qt| {0, 1}, a (in)complete word
Given a partial query Q1k & a phrase pi P = {p1, p2, …, pn},
what is the probability P(pi | Q1k), i.e., the probability
that the user will type pi after typing Q1k?
5
Query Suggestions
6
Query Suggestions The proposed query suggestion is defined as
The probability of selecting a phrase given a partial word is
The importance of phrases is determined by occurrence frequencies in the document corpus
7
a vocabulary word that start with Qt a phrase that contains the word ci
Estimating Phrase-Query Correlation The contextual relationship between a phrase pi & a
user submitted query Qc using their joint occurrence
pi is the 2nd half of the complete query & Qc is the 1st half
Both P(Qc, pi) & P(pi) in the previous equation can be estimated using the corpus as follows:
where Dpi and DQc
represent the sets of documents that contain phrase pi and Qc, respectively 8
Experimental Results: Datasets Two datasets were used
TREC Consists of more than 200K news articles published in
Financial Times between years 1991–1994
Ubuntu: Consists of more than 100K discussion threads crawled from
ubuntuforums.org, 25 queries, & relevance judgments
9
Baselines Methods The proposed methods was compared with the following
two baseline methods
Similarity based phrase search (SimSearch)
• Indexed phrases which contain user queries as sub-phrasesare searched & ranked according to their occurrence frequencies
CompleteSearch (CompSearch)
• Offers real-time auto-completion of the last query term being typed by the user
• Also use frequency as the ranking criterion
10
Test Queries Generated 40 partial test queries, created from 20 non-
stop words, non-single keyword, randomly-chosen queries, for each dataset
Type-A Queries Queries were generated by retaining only the 1st keyword
from each of the 20 original queries
Type-B Queries
Queries were generated by retaining the 1st keyword of the query followed by the first randomly-chosen k
characters (2 ≤ k ≤ length of the remaining query string)
11
Test Queries(cont’d)
12
Evaluation For each test query, the top 10 suggestions generated by
SimSearch, CompSearch & the proposed Probabilistic method were collected & evaluated by 3 assessors
Evaluation was performed w/ the help from 12 volunteers who were colleagues not associated with the
project
For each query suggestion, each assessor assigned one rating among the four (given below) & major-vote is
used
13
Suggestions Created by Two Test Queries
14
Success Rate of Different Methods A query suggestion method is successful for a given
partial query if it is able to generate at least one meaningful suggestion for the partial query
15
Quality of Suggestions
16
Precision Values Achieved by Different QS
17
Effectiveness of Suggested Queries Query clarity score is used to measure the retrieval
performance of suggested queries
Clarity score of a query increases if we add terms that reduce query ambiguity & it decreases on
adding terms that make the query more ambiguous
Clarity score for a query q with respect to a collection of documents C is computed using KL-Divergence
where V is the vocabulary of the collection18
Clarity Scores Achieved by Different QS
19
Conclusions and Future Works Meaningful query suggestions can be made in the absence
of query logs with probabilistic approach using the occurrence of terms/phrases in a corpus of
documents
Future works A future goal is to ensure that the badly formed combination
of phrases are eliminated from the suggestions
Use of synonyms and synonymous phrases to enable the system to suggest alternatives also needs to be
explored
Systematic approach towards diversifying the suggested queries
Apply to a relatively larger scale
20