A general approximation framework for directoptimization of information retrieval measures
Presenter: Shih-Hsiang Lin (林士翔 )
Tao Qin, Tie-Yan Liu, Hang LiMicrosoft Research Asia, Beijing, China
Reference:1. Joachims, T. (2002). Optimizing search engines using clickthrough data. In KDD ’022. Freund, Y., et al., (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969.3. Burges, C., et al., (2005). Learning to rank using gradient descent. In ICML ’054. Cao, Z., et al., (2007). Learning to rank: From pairwise approach to listwise approach. In ICML ’075. Xu, J., & Li, H. (2007). Adarank: A boosting algorithm for information retrieval. In SIGIR ’076. He, Y., et al., (2008). Are algorithms directly optimizing ir measures really direct? Technical Report MSR-TR-2008-154, Microsoft Corporation.7. Xia, F., et al., (2008). Listwise approach to learning to rank: Theory and algorithm. In ICML ’088. Xu, J., et al., (2008). Directly optimizing evaluation measures in learning to rank. In SIGIR ’08
Recently direct optimization of information retrieval (IR) measures has become a new trend in learning to rank◦ IR measures are explicitly considered in the direct
optimization approach◦ Generally, they can be grouped into two categories
introduce upper bounds of the IR measures approximate the IR measures using some smooth functions
Open Problem◦ The relationships between the surrogate functions and the
corresponding IR measures have not been sufficiently studies◦ Some of the proposed surrogate functions are not easy to
optimize
INTRODUCTION
2
The main contributions of this work include◦ They set up a general framework for direct optimization
it is applicable to any position based IR measure◦ They take AP and NDCG as two examples to show how to
optimize the position based IR measures as surrogate functions in the framework
◦ They provide a theoretical justification to the direct optimization approach
INTRODUCTION
3
Precision@k◦ Evaluating top k positions of a ranked list using two levels
(relevant and irrelevant) of relevance judgment
Average Precision (AP)
◦ e.g. relevant docs ranked at 1, 5, 10, precisions are 1/1, 2/5, 3/10, AP = (1/1+2/5+3/10)/3≈0.56
MAP is defined as the mean of AP over a set of queries
REVIEW ON IR MEASURES (1/3)
4
k
jjrk
k1
1@Prek denotes the truncation positionrj equals one if the doc in the jth position is relevant and zero otherwise
j
j jrD
@Pre1AP |D+ | denotes the number of relevant documents w.r.t. the query
Normalized Discounted Cumulated Gain (NDCG)◦ It is designed for multiple levels of relevance judgments◦ Uses graded relevance as a measure of the usefulness, or gain,
from examining a document◦ Discounted Cumulative Gain (DCG) is the total gain accumulated at
a particular rank k
e.g. 10 ranked documents judged on 0-3 relevance scale3, 3, 2, 2, 1, 1, 17, 7, 3, 3, 1, 1, 11, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33 7, 11.41, 12.91, 14.2, 14.59, 14.95, 15.28
REVIEW ON IR MEASURES (2/3)
5
k
j
r
jk
j
1 2 1log12@DCG
rank j : rj
gain 2rj-1
discount 1/log2(1+j)
DCG
◦NDCG is defined as
REVIEW ON IR MEASURES (3/3)
6
k
j
r
k jNk
j
1 2
1
1log12@NDCG
Nk is a constant depending on a Query to make the maximum value of NDCG@k of they query is 1
The framework consists of four steps:◦ Reformulating an IR measure from ‘indexed by positions’ to
‘indexed by documents’◦ Approximating the position function with a logistic function
of ranking scores of documents◦ Approximating the truncation with a logistic function of
positions of documents◦ Applying a global optimization technique to optimize the
approximated measure (surrogate function)
A GENERAL APPROXIMATION FRAMEWORK
7
Most of the IR measures, for example, Precision@k, AP and NDCG are position based◦ The summations in the definitions of IR measures are taken
over positions◦ The position of a document may change during the training
process, which makes the optimization of the IR measures difficult
When indexed by documents, Precision@k can be re-written as below
STEP1: Measure Reformulation (1/2)
8
X
1x
kxxrk
k 1@Pre
X is a set of documentsr(x) equals one for relevant document and zero otherwiseπ(x) denotes the position of x in the ranked list π1{} is a truncation function
9
With documents as indexes, AP can be re-written as
Combining above two equations yields
So far, this measurements are non-continuous and non-differentiable
STEP1: Measure Reformulation (2/2)
yyrD y
X
@Pre1AP
X X
X X
1
1
y yxx
y x
yyxxryr
yyr
D
yxxry
yrD
,
1
11AP
10
The position function can be represented as a function of ranking scores
Due to the indication function in it, the position function is still non-continuous and non-differentiable◦ They propose approximating the indicator function
using a logistic function
STEP 2: Position Function Approximation (1/2)
xyy
yxsx,
, 01X1 yxyx sss ,where
0, yxs1
xyy yx
yx
ss
x, ,
,
exp1exp
1ˆX
α is a scaling constant and α>0
11
Examples of position approximation
◦ The approximation is very accurate in this case
STEP 2: Position Function Approximation (2/2)
12
Some measures have truncation functions in definitions, such as Precision@k, AP, and NDCG@k. These measures need further approximations on the truncation functions
To approximate the truncation function , a simple way is to use the logistic function once again
Thus, we obtain the approximation of AP as follow
STEP3: Truncation Function Approximation
yx 1
xy
xyyx
ˆˆexp1
ˆˆexp
1 β is a scaling constant and β >0
X Xy yxx xy
xyyxryr
yyr
D , ˆˆexp1ˆˆexp
ˆˆ1AP
13
With the aforementioned approximation technique, the surrogate objective functions become continuous and differentiable with respect to the parameter in the ranking model
However, considering that the original IR measures contain a lot of local optima, the approximations of them will also contain local optima◦ One should better choose those global optimization
methods such as random restart and simulated annealing in order to avoid being trapped to local optima
STEP4: Surrogate Function Optimization (1/3)
;xf
Gradient of ApproxAP
where
by chain rule
14
STEP4: Surrogate Function Optimization (2/3)
15
STEP4: Surrogate Function Optimization (3/3)
16
In general, we would like to create a ranking model that maximize the accuracy in terms of an IR measure on training data,
or equivalently, minimizes the loss function defined as follows
Directly optimizing techniques try to minimize the above function
Comparisons with other directly optimizing techniques
m
iiiE
1,max y
m
iii
m
iiiii EEE
11
* ,1min,,min yyy
πi is the permutation selected for query qi E(πi , yi ) is evaluation of πi w.r.t. yi for qi
17
From the viewpoint of loss function optimization, these methods fall into three categories◦ One can minimize upper bounds of the basic loss function
defined on the IR measures AdaRank, SVMmap
◦ One can approximate the IR measures with functions that are easy to handle this paper, SoftRank
◦ One can use specially designed technologies for optimizing the non-smooth IR measures
Comparisons with other directly optimizing techniques (cont.)
18
Minimize upper bounds of the basic loss function◦ Type one bound
the logistic function
the exponential function
Comparisons with other directly optimizing techniques (cont.)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
11-x
exp(-x)
log(1+exp(-x))
Since e-x ≥ 1-x
19
◦ Type two bound
The loss function measures the loss when the worst prediction is made
Comparisons with other directly optimizing techniques (cont.)
[[.]] is one if the condition is satisfied, otherwise zero
20
Comparisons with other directly optimizing techniques (cont.)
21
Datasets◦ LETOR 3.0 datasets
a benchmark collection for the research on learning to rank for information retrieval
TD2003, TD2004 and OHSUMED
Retrieval method◦ Use linear ranking model for ApproxAP and ApproxNDCG in
the experiments
EXPERIMENTAL SETUP
22
On the approximation of IR measures
◦ The approximation accuracy is very high and it becomes more accurate as increasing α or β
EXPERIMENTAL RESULTS (1/3)
qqQ
APAP1Approximate error:
23
On the performance of ApproxAP◦ Five fold cross validation as suggested in LETOR for both
TD2003 and TD2004 datasets α = {50, 100, 150, 200, 250, 300}, β= {1,10, 20, 50, 100}
δ=0.001, η=0.01, K=10
◦ The result clearly shows the advantage of using the proposed method for direct optimization
EXPERIMENTAL RESULTS (2/3)
24
◦ It also can be found that AdaRank.MAP and SVMmap are not as good as Ranking SVM and ListNet AdaRank.MAP and SVMmap optimize the upper bound of AP
and it is not clear whether the bound is tight. If the bound is very loose, optimization of the bound cannot
always lead to the optimization of AP, and so they may not perform well on some datasets.
EXPERIMENTAL RESULTS (3/3)
25
In this paper, they have set up a general framework to approximate position based IR measures◦ The key part of the framework is to approximate the positions
of documents by logistic functions of their scores There are several advantages of this framework◦ The way of approximating position based measures is simple
yet general◦ Many existing techniques can be directly applied to the
optimization and the optimization process itself is measure independent
◦ It is easy to conduct analysis on the accuracy of the approach and high approximation accuracy can be achieved by setting appropriate parameters
CONCLUSIONS AND FUTURE WORK (1/2)
26
There are still some issues that need to be further studied◦ The approximated measures are not convex, and there may
be many local optima in training◦ Conduct experiments to test the algorithms with other
function classes
CONCLUSIONS AND FUTURE WORK (2/2)