+ All Categories
Home > Science > Learning To Rank User Queries to Detect Search Tasks

Learning To Rank User Queries to Detect Search Tasks

Date post: 13-Feb-2017
Category:
Upload: franco-maria-nardini
View: 37 times
Download: 0 times
Share this document with a friend
24
Learning to Rank User Queries to Detect Search Tasks Learning to Rank User Queries to Detect Search Tasks Claudio Lucchese 1 , Franco Maria Nardini 1 , Salvatore Orlando 2 , Gabriele Tolomei 3 1 ISTI-CNR, Pisa, Italy 2 Universit´ a Ca’ Foscari Venezia, Italy 3 Yahoo Labs, London, UK
Transcript
Page 1: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Learning to Rank User Queriesto Detect Search Tasks

Claudio Lucchese1, Franco Maria Nardini1,Salvatore Orlando2, Gabriele Tolomei3

1 ISTI-CNR, Pisa, Italy2 Universita Ca’ Foscari Venezia, Italy

3 Yahoo Labs, London, UK

Page 2: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Introduction

The Evolution of Web Search

An increasing number of user searches are part of complexpatterns.

Complex search patterns are often composed of several,multi-term, interleaved queries spread across many sessions.

User information needs are getting harder to understand andsatisfy.

Search Task: Clusters of queries with the same latent informationneed from a real-world search engine log.

Page 3: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Introduction

Complex Search Patterns: AOL 2006

Queries withinshort-time sessionsare part of differentcomplex tasks.

Each complex taskspans across severalsessions.

Page 4: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Related Work

Related Work - I

Jones and Klinkner [3]

First high-level analysis of user search tasks.

Hierarchical search:

Flat query streams can be structured as complex searchmissions linked to each other.Each mission in turn contains simpler search goals.

Design a binary classifier which is able to predict whethertwo queries belong to the same goal or not.

Page 5: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Related Work

Related Work - II

Lucchese et al. [4, 5]

Formally introduce the search task discovery problem.

Graph-based representation of each user session:

Nodes are queries.Edges between query pairs are weighted according to a querysimilarity measure.

Search tasks are identified by connected components ofeach user session graph.

Outperforms other approaches for session boundary detectionlike the Query-Flow Graph [1].

Page 6: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Related Work

Related Work - III

Wang et al. [6]

Cross-session discovery of search tasks.

Graph-based representation of all queries.

Search tasks as connected components of the graph havingthe following characteristics:

Each query of a task can be linked only to one past query ofthe same task.Tasks are therefore modeled as trees.SVM model to identify the best tree structure hidden by thequery similarity graph.

Page 7: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Search Task Discovery (STD) Framework

Search Task Discovery Framework

Ground truthof search tasks

QSL1. Learning query similarity

2. Learning pruning threshold

3. Find connected components

GQC

Page 8: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Search Task Discovery (STD) Framework

Query Similarity Learning (QSL)

Query Similarity Learning

Query Similarity Learning (QSL): estimates a target querysimilarity function σ from a ground truth of manually-labeledsearch tasks.

Binary classes: same-task (positive) and not-same-task

(negative)

Learning to Rank: instead of predicting same-task, welearn a ranking function:

same-task queries ranked highest.

How to build the training sets?

Page 9: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Search Task Discovery (STD) Framework

Graph-based Query Clustering (GQC)

Graph-based Query Clustering

Graph-based Query Clustering (GQC): transforms a usersearch log into a weighted query graph Guσ:

Queries are nodes.Edges are labeled using σ previously learnt by QSL.

Connected components of Guσ: user search tasks.

Weak edges introduced by σ:

ε-neighborhood technique as selective pruning: removing edgesif below a given ε.optimal ε learnt from groud truth.

Page 10: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Search Task Discovery (STD) Framework

Search Task Discovery Problem

Search Task Discovery Problem

Given a user query log Qu, a clustering algorithm C that extractsthe connected components of the graph, and a quality function γmeasuring the quality of a clustering, the Search Task DiscoveryProblem requires to find the best similarity function andpruning threshold that maximize the average quality of the clustersC(Guσ,ε) for all u ∈ U , i.e.,

(σ∗, ε∗) = argmaxσ,ε1

|U |∑u∈U

γ(C(Guσ,ε)).

Page 11: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Reducing QSL to a Learning to Rank Problem

Reducing QSL to a Learning to Rank Problem

Query-centric approach: we aim at learning a querysimilarity function that scores higher those queries thatappear in the same task.

For any given qui ∈ T uk , we say that quj is relevant to qui ifquj ∈ T uk , and irrelevant otherwise.

Labels {1, 0} assigned accordingly when building the trainingset.

Number of relevant labels:∑

u |Qu|2.

Page 12: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Reducing QSL to a Learning to Rank Problem

Reducing QSL to a Learning to Rank Problem

User-centric approach: we aim at learning a query similarityfunction that scores higher every pair of objects in the sametask.

Given any pair of queries (qui , quj ) in the user search log Qu we

require their similarity to be high iff they belong to the sametask.

Here, each (qui , quj ) is a single ordered pair, where i ≤ j is

associated with the tuple for user u.

the binary relation “≤” between queries is given by the orderof their issuing times.

Number of relevance labels:∑

u

(Qu

2

).

Page 13: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Reducing QSL to a Learning to Rank Problem

Types of Features Used1

Symmetric global features based on Qu.

Examples: Session Num Queries, Session Time Span,Avg Session Query Len, etc.

Symmetric features extracted from the query pair (qui , quj ).

Examples: Levenshtein, Jaccard (3-grams), ∆ Time, ∆ Pos,Global Joint Prob (queries), Wikipedia Cosine [5].

Asymmetric features extracted from the query pair (qui , quj ).

Examples: Is Proper Subset (term-set),Global Conditional Prob (queries).

37 features in total.

1See our paper for the complete list of features used.©

Page 14: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Implementing STD

Implementing STD

Learning to RankGradient-Boosted Regression Trees (GBRT)

RMSE.

LambdaMART (λMART)

nDCG.

Binary ClassificationLogistic Regression (LogReg) [3]

Logistic loss.

Decision Trees (DT) [5]

Information gain.

k-fold cross validation (k = 5): parameters tuned onvalidation data.

Clustering measure to get ε: Jaccard.

Page 15: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Experiments

Dataset

Dataset

Proposed by Hagen et al. [2]

Three-months sample of the AOL query log.8,840 queries issued by 127 users.Labeled by two human assessors into 1,378 user search tasks(called missions in the original paper).

We remove stopwords, noisy chars and transform query stringsto lowercase.

We remove longest and shortest user sessions.

Resulting dataset: 6,381 queries from 125 users.

Page 16: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Experiments

Dataset

Dataset

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 10 15 20 25 30 35 55 85 110 166

ratio

of t

asks

number of queries

(a)

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 10 15 20 25 30 35 55 85 110 166

ratio

of u

sers

task size

(b)

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 10 15 20 25 30 35 45 75

ratio

of u

sers

number of tasks

(c)

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 19 22 35

ratio

of u

sers

number of singleton tasks

(d)

Figure 2: Properties of the dataset.

Table 5: Comparison of L2R techniques and baselines in terms of Rand, F1avg, Jaccard, and F1 averaged across 5 cross-validation folds. Best results are shown in boldface, and there is a (*) next to those which are statistically significant at↵ = .05.

(a) Query-centric dataset L0

MethodMetric

Rand F1avg Jaccard F1

Singleton 0.738 0.458 0 0DT 0.898 0.853 0.620 0.714

LogReg 0.919 0.868 0.639 0.737GBRT 0.915 0.889(*) 0.670 0.763�MART 0.919 0.879 0.687(*) 0.778(*)

(b) User-centric dataset L00

MethodMetric

Rand F1avg Jaccard F1

Singleton 0.738 0.458 0 0DT 0.880 0.843 0.604 0.706

LogReg 0.921(*) 0.868 0.639 0.738GBRT 0.913 0.875(*) 0.682 0.771�MART 0.914 0.873 0.684(*) 0.778(*)

skewness of class labels distribution in our dataset. There-fore, even though it is a widely used measure, Rand indexis not completely able to provide insights about the qualityof our clustering. We conclude that the presented frame-work, thanks to the exploitation of a L2R approach is ableto improve over the state of the art.

6.1 Feature RankingWe now discuss the importance of the various features

used by the best performing task discovery solution, i.e., theL2R-based method �MART. Several approaches have beenproposed to assess how relevant a feature is to a rankingmodel. Since �MART rankers are based on GBRT, we bor-row a method for feature evaluation from the original workby Friedman [4]. During the construction of each tree, wecompute for each feature i its associated gain gi by consid-ering the various splitting nodes in the trees and their re-lated gains. We measure for each feature a cumulative gain,which is then used to rank the features. Figure 3 reports thetop-17 features for the �MART-based model, learned on theFold-1 of L0. Specifically, the x-axis reports the logarithm ofthe feature importance. We can observe that the most dis-

criminative features are those regarding the relative issuingtimes and positions of a given pair of queries. This is rea-sonable, since the chance of two close queries to be part ofthe same task is generally higher. Besides, two queries thatare far away from each other will less likely end up in sametask. The second most important feature regards a globalstatistic about the user search log. The first lexical contentsignal occurs only at the fifth position, while the semanticfeature (i.e., Wikipedia Cosine) appears as the last one inthis ranking.

7. CONCLUSIONS AND FUTURE WORKWe discussed the problem of discovering user search tasks,

i.e., clusters of web queries with the same latent need, froma real-world search engine log. We proposed the Search TaskDiscovery (STD) framework made of two components: aQuery Similarity Learning (QSL) module and a Graph-basedQuery Clustering (GQC) module. The former is devoted tolearning a query similarity function from a ground truth ofmanually-labeled search tasks. The latter represents eachuser search log as a graph whose nodes are queries, anduses the learned similarity function to weight edges between

Singleton tasks are 41% of the dataset.

Singleton: answering always not-same-task.

Page 17: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Experiments

Results

Results

Query-centric approach

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 10 15 20 25 30 35 55 85 110 166

ratio

of t

asks

number of queries

(a)

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 10 15 20 25 30 35 55 85 110 166

ratio

of u

sers

task size

(b)

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 10 15 20 25 30 35 45 75

ratio

of u

sers

number of tasks

(c)

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 19 22 35

ratio

of u

sers

number of singleton tasks

(d)

Figure 2: Properties of the dataset.

Table 5: Comparison of L2R techniques and baselines in terms of Rand, F1avg, Jaccard, and F1 averaged across 5 cross-validation folds. Best results are shown in boldface, and there is a (*) next to those which are statistically significant at↵ = .05.

(a) Query-centric dataset L0

MethodMetric

Rand F1avg Jaccard F1

Singleton 0.738 0.458 0 0DT 0.898 0.853 0.620 0.714

LogReg 0.919 0.868 0.639 0.737GBRT 0.915 0.889(*) 0.670 0.763�MART 0.919 0.879 0.687(*) 0.778(*)

(b) User-centric dataset L00

MethodMetric

Rand F1avg Jaccard F1

Singleton 0.738 0.458 0 0DT 0.880 0.843 0.604 0.706

LogReg 0.921(*) 0.868 0.639 0.738GBRT 0.913 0.875(*) 0.682 0.771�MART 0.914 0.873 0.684(*) 0.778(*)

skewness of class labels distribution in our dataset. There-fore, even though it is a widely used measure, Rand indexis not completely able to provide insights about the qualityof our clustering. We conclude that the presented frame-work, thanks to the exploitation of a L2R approach is ableto improve over the state of the art.

6.1 Feature RankingWe now discuss the importance of the various features

used by the best performing task discovery solution, i.e., theL2R-based method �MART. Several approaches have beenproposed to assess how relevant a feature is to a rankingmodel. Since �MART rankers are based on GBRT, we bor-row a method for feature evaluation from the original workby Friedman [4]. During the construction of each tree, wecompute for each feature i its associated gain gi by consid-ering the various splitting nodes in the trees and their re-lated gains. We measure for each feature a cumulative gain,which is then used to rank the features. Figure 3 reports thetop-17 features for the �MART-based model, learned on theFold-1 of L0. Specifically, the x-axis reports the logarithm ofthe feature importance. We can observe that the most dis-

criminative features are those regarding the relative issuingtimes and positions of a given pair of queries. This is rea-sonable, since the chance of two close queries to be part ofthe same task is generally higher. Besides, two queries thatare far away from each other will less likely end up in sametask. The second most important feature regards a globalstatistic about the user search log. The first lexical contentsignal occurs only at the fifth position, while the semanticfeature (i.e., Wikipedia Cosine) appears as the last one inthis ranking.

7. CONCLUSIONS AND FUTURE WORKWe discussed the problem of discovering user search tasks,

i.e., clusters of web queries with the same latent need, froma real-world search engine log. We proposed the Search TaskDiscovery (STD) framework made of two components: aQuery Similarity Learning (QSL) module and a Graph-basedQuery Clustering (GQC) module. The former is devoted tolearning a query similarity function from a ground truth ofmanually-labeled search tasks. The latter represents eachuser search log as a graph whose nodes are queries, anduses the learned similarity function to weight edges between

Page 18: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Experiments

Results

Results

User-centric approach

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 10 15 20 25 30 35 55 85 110 166

ratio

of t

asks

number of queries

(a)

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 10 15 20 25 30 35 55 85 110 166

ratio

of u

sers

task size

(b)

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 10 15 20 25 30 35 45 75

ratio

of u

sers

number of tasks

(c)

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 19 22 35

ratio

of u

sers

number of singleton tasks

(d)

Figure 2: Properties of the dataset.

Table 5: Comparison of L2R techniques and baselines in terms of Rand, F1avg, Jaccard, and F1 averaged across 5 cross-validation folds. Best results are shown in boldface, and there is a (*) next to those which are statistically significant at↵ = .05.

(a) Query-centric dataset L0

MethodMetric

Rand F1avg Jaccard F1

Singleton 0.738 0.458 0 0DT 0.898 0.853 0.620 0.714

LogReg 0.919 0.868 0.639 0.737GBRT 0.915 0.889(*) 0.670 0.763�MART 0.919 0.879 0.687(*) 0.778(*)

(b) User-centric dataset L00

MethodMetric

Rand F1avg Jaccard F1

Singleton 0.738 0.458 0 0DT 0.880 0.843 0.604 0.706

LogReg 0.921(*) 0.868 0.639 0.738GBRT 0.913 0.875(*) 0.682 0.771�MART 0.914 0.873 0.684(*) 0.778(*)

skewness of class labels distribution in our dataset. There-fore, even though it is a widely used measure, Rand indexis not completely able to provide insights about the qualityof our clustering. We conclude that the presented frame-work, thanks to the exploitation of a L2R approach is ableto improve over the state of the art.

6.1 Feature RankingWe now discuss the importance of the various features

used by the best performing task discovery solution, i.e., theL2R-based method �MART. Several approaches have beenproposed to assess how relevant a feature is to a rankingmodel. Since �MART rankers are based on GBRT, we bor-row a method for feature evaluation from the original workby Friedman [4]. During the construction of each tree, wecompute for each feature i its associated gain gi by consid-ering the various splitting nodes in the trees and their re-lated gains. We measure for each feature a cumulative gain,which is then used to rank the features. Figure 3 reports thetop-17 features for the �MART-based model, learned on theFold-1 of L0. Specifically, the x-axis reports the logarithm ofthe feature importance. We can observe that the most dis-

criminative features are those regarding the relative issuingtimes and positions of a given pair of queries. This is rea-sonable, since the chance of two close queries to be part ofthe same task is generally higher. Besides, two queries thatare far away from each other will less likely end up in sametask. The second most important feature regards a globalstatistic about the user search log. The first lexical contentsignal occurs only at the fifth position, while the semanticfeature (i.e., Wikipedia Cosine) appears as the last one inthis ranking.

7. CONCLUSIONS AND FUTURE WORKWe discussed the problem of discovering user search tasks,

i.e., clusters of web queries with the same latent need, froma real-world search engine log. We proposed the Search TaskDiscovery (STD) framework made of two components: aQuery Similarity Learning (QSL) module and a Graph-basedQuery Clustering (GQC) module. The former is devoted tolearning a query similarity function from a ground truth ofmanually-labeled search tasks. The latter represents eachuser search log as a graph whose nodes are queries, anduses the learned similarity function to weight edges between

Page 19: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Conclusion

Conclusion

We proposed the Search Task Discovery framework made upof two modules: QSL and GQC.

QSL learns a query similarity functions from a ground truth ofmanually-labeled search tasks.GQC models the user queries as a graph.

We propose to employ Learning to Rank techniques (GBRT,λMART) in QSL.

Experiments prove the effectiveness of Learning to Ranktechniques in detecting search tasks.

Future Work: We plan to employ STD in a streaming setting todetect search tasks in (pseudo) real time.

Page 20: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Conclusion

Thank you for your attention!

Page 21: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Conclusion

References I

[1] P. Boldi, F. Bonchi, C. Castillo, D. Donato, A. Gionis, and S. Vigna.The query-flow graph: model and applications.In CIKM’08, pages 609–618. ACM, 2008.

[2] M. Hagen, J. Gomoll, A. Beyer, and B. Stein.From search session detection to search mission detection.In OAIR’13, pages 85–92, 2013.

[3] R. Jones and K. L. Klinkner.Beyond the session timeout: automatic hierarchical segmentation of search topicsin query logs.In CIKM’08. ACM, 2008.

[4] C. Lucchese, S. Orlando, R. Perego, F. Silvestri, and G. Tolomei.Identifying task-based sessions in search engine query logs.In WSDM’11, pages 277–286. ACM.

[5] C. Lucchese, S. Orlando, R. Perego, F. Silvestri, and G. Tolomei.Discovering user tasks in long-term web search engine logs.ACM TOIS, 31(3):1–43, July 2013.

Page 22: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Conclusion

References II

[6] H. Wang, Y. Song, M.-W. Chang, X. He, R. W. White, and W. Chu.Learning to extract cross-session search tasks.In WWW’13, pages 1353–1364. ACM, 2013.

Page 23: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Conclusion

Clustering Metrics

Rand: tp+tntp+tn+fp+fn

Jaccard: tptp+fp+fn

F1: 2·p·rp+r .

F1avg =∑

jmj

m F1max(j) where mj = |T uj | and m = |Qu|.

Page 24: Learning To Rank User Queries to Detect Search Tasks

Learning to Rank User Queries to Detect Search Tasks

Conclusion

Feature Importance


Recommended