The Impact of Ranker Quality on Rank Aggregation Algorithms: Information vs. Robustness
Sibel Adalı, Brandeis Hill and Malik-Magdon IsmailRensselaer Polytechnic Institute
• Given a set of ranked list of objects, what is the best way to aggregate them into a final ranked list?
• The correct answer depends on the what the objective is.
• The consensus among the input rankers
• The most correct final ordering
• In this paper:
➡ We implement existing rank aggregation methods and introduce new ones.
➡ We implement a statistical framework for evaluating the methods and report on the rank aggregation methods.
Motivation
1
2
3
4
5
Ranks
Ranker1 Ranker2 Ranker3
Related Work
• Rank aggregation methods
• Use of cheap methods such as average and median is common
• Methods based on consensus introduced first by Dwork, Kumar, Naor and Sivakumar [WWW 2001] and median rank as an approximation by Fagin, Kumar, Sivakumar [SIGMOD 2003]
• Methods that integrate rank and textual information are common in meta-searching, for example Lu, Meng, Shu, Yu, Liu [WISE 2005]
• Machine learning methods learn the best factors for a user by incorporating user feedback, for example Joachims [SIGKDD 2002]
• Evaluation of rank aggregation methods are mainly with real data using fairly small data sets, for example Renda, Straccia [SAC 2003]
• Given two rankers A and B
• Precision (p) finds the number of objects A and B in common (maximization problem)
• Kendall-tau (τ) finds the total number of pairwise disagreements between A and B (minimization problem)
Error Measures
A B C
o1 o2 o4
o2 o3 o2
o3 o1 o3
o4 o4 o1
o5 o6 o7
D
o2
o1
o3
o4
o5
Input Rankers Aggregate • Precision of D with respect to A,B, and C
p(A,D) + p(B,D) + p(C,D) = 5 + 4 + 4 = 13
• Kendall-tau of D with respect to A, B, and C
τ(A,D) + τ(B,D) + τ(C,D) = 1 + 1 + 4 = 6
• Missing values for τ are handled separately.
Aggregation Methods
• Cheap Methods:
• Average (Av)
• Median (Me)
• Precision optimal (PrOPT)
• Methods that aim to optimize the Kendall-tau error of the aggregate with respect to the input rankers
• Markov chain methods (Pagerank, Pg)
• Iterative methods that improve a given aggregate methods
• adjacent pairs (ADJ)
• iterative best flip (IBF)
• Rank objects with respect to the number of times they appear in all the lists
• Break ties with respect to their ranking in average rankers
• Break remaining ties randomly
Precision Optimal
A B C
o1 o2 o4
o2 o3 o2
o3 o1 o5
o4 o4 o1
o5 o6 o7
Input Rankers
D
o1, o2, o4
o3, o5
o6, o7
D
o2
o1
o4
o3
o5
o6
o7
Number of timeseach object appears
Break ties
D
o2
o1
o4
o3
o5
Choose top K
o1
o3
o2
1
1/3
2/3
2/3
1/3
• Construct a graph from rankings (similar to Dwork et. al. WWW2001)
• Each object returned in a ranked list is a vertex
• Insert an edge (i,j) for each ranked list where i is ranked higher than j
• Compute the pagerank [Brin & Page, WWW 1998] on this graph
• The edges are weighted (wj,i) proportional to the difference in rank it represents
• The navigation probability is proportional to the edge weights
• The random jump probability (pi) is proportional to the indegree of each node
• Alpha (α) is set to 0.85.
• The pagerank Pgi is the solution to the equations below:
Pgi = !pi + (1 ! !)!
(j,i)!E
Pgjwj,i
Pagerank
A
o1
o2
o3
Iterative Improvement Methods
• Adjacent Pairs (ADJ)• Given an aggregate ranking, flip adjacent pairs until the total error with
respect to the input rankers is reduced -> normally the Kendall-tau error metric is used [Dwork]
• Iterative Best Flip (IBF)• Given an aggregate ranking
While not doneFor each object
record the current configurationfind the best flip among all other objects and do this flip even if it increases the error temporarily and make this the current configuration
Choose the lowest error configuration from the historyIf the overall error is lower or if this is a configuration not seen before, then make this the current configurationElse break ;
Iterative Best Flip
A B C
o1 o5 o1
o2 o2 o4
o3 o3 o2
o4 o4 o3
o5 o1 o5
D
o5
o1
o2
o4
o3
Input Rankers Aggregate
Errorτ = 14
After bestflip for o5
D
o1
o5
o2
o4
o3
Errorτ = 13
After bestflip for o1
Errorτ = 14
D
o2
o5
o1
o4
o3
After bestflip for o2
Errorτ = 13
D
o5
o2
o1
o4
o3
After best flip for o4
Errorτ = 12
D
o4
o2
o1
o5
o3
After bestflip for o3
Errorτ = 11
D
o4
o2
o1
o3
o5
Choose the minimum error configuration from this run!and continue
IBF seems to outperform ADJ and do well even when we start from a random ranking.
Analysis of Aggregation Methods
• Complex aggregators incorporate little nuances about the input rankers.They use more information but are sensitive to noise.
• Simple aggregators disregard information contained in the input rankers but are less sensitive to noise.
• For example average is more complex than median and precision optimal
• How about pagerank and other Kendall-tau optimization based optimizers?
A B
o1 o3
o2 o1
o3 o2
D1
o3
o1
o2
Input Rankers Kendall-tau optimal aggregations
D2
o1
o2
o3
D3
o1
o3
o2
The question we would like to answer is which aggregator performs well under different conditions. Does reducing Kendall-tau with respect to Kendall-tau always lead to a good solution?
Statistical Model of Aggregators
• Suppose there is a correct ranked list called the ground truth that represents the correct ordering.
• The correct ordering is computed for each object using:
• A set of factors that measure the fit of object for a specific criteria (factors F =f1 ... fF where fl in [-3,3])
• Examples of factors are number of occurrences of a keyword, recency of updates to a document or pagerank
• A weight for each factor (W= w1,..., wF where w1 + ... + wF = 1)
• The final score of each object oi is computed using a linear combination function
• Objects are ranked with respect to the scores.
Vi =F!
l=1
wlfl(oi)
f1f2 f3 f4 f5
w1 w2 w3 w4 w5
GROUND TRUTH
f1j =
f1+!1
f2j =
f2+!2
f3j =
f3+!3
f4j =
f4+!4
f5j =
f5+!5
RANKERj
Objects
o1
..
.
.
.
.
.
.
on
w3j
w4j
w1j w5
jw2j
Vi =F!
l=1
wl.fl(oi) V ji =
F!
l=1
wjl .f
jl (oi)
• Each ranker produces a ranked list by using the same formula and the same factors• The ranker j tries to estimate the factors’ true values for each object,
produces Fj
• It also guesses the correct weights for the combination formula, produces Wj
Statistical Model of Aggregators
• The ranker’s estimate Fj of a factor introduces an error ε , i.e. Fj = F + εj
• The magnitude of error depends on a variance parameter σ2
• The distribution of the error can be adjusted to model different types of spam
• In our model, we can model various types of correlation between the factors and the errors, but we do not report on those.
Statistical Model of Aggregators
V ar(!jil) = "2 (# ! fl(oi))!.(# + fl(oi))"
maxf!["3,3](# ! f)!.(# + f)"
• We distribute the scores for each factor uniformly for 100 objects, use 5 factors and 5 rankers
• We set γ= 1, δ = 5, β = 0.01 which models a case where rankers make small mistakes for “good” objects and make increasingly larger mistakes for “bad” objects
• We vary σ2 from 0.1, 1, 5 to 7
• We set the ground truth weights to W
• We assign 1,2,3,4, and 5 rankers to correct weights (W) and the remaining rankers are assigned the incorrect weights (Wr) (nMI represent the number of rankers with the wrong weights)
W = 〈1
15,
2
15,
3
15,
4
15,
5
15〉
Test Setup
W r= !
5
15,
4
15,
3
15,
2
15,
1
15"
Test Setup
• For each setting, we construct 40,000 different data sets • For each dataset, we construct each aggregator for top 10 from the input
rankers and output top 10• Compare the performance of each aggregator with respect to the ground
truth using precision and Kendall-tau• For each error metric, we compute the difference between all pairs of
aggregators• For each test case and error metric, we output for all pairs of aggreagators
[A1, A2] a range [l, h] with 99.9% confidence • We assume A1 and A2 are roughly equivalent (A1 ≡ A2) if the range [l,h]
crosses zero• Otherwise, we construct an ordering A1 > A2 or A2 < A1 based on the
range and the error metric• We order the aggregators using topological sort based on this ordering
for each test and each error metric
PrOpt
RndIBF
0.0028
PgADJ0.0026
MeADJ0.0024
AvADJ0.0026
PgIBF 0.0027
MeIBF
0.0022
AvIBF
0.0027
Pg
0.0031Me
0.0016
Av
1.4E-4
3.2E-4
5.7E-4
4.0E-4
2.7E-4
7.5E-4
3.0E-4
0.0013
MeIBF
RndIBF0.0197
Me
0.0227
MeADJ
0.0214
0.0244
PrOpt
0.0183
0.0166
PgADJ0.0203
0.0186
AvADJ
0.0170
0.0154
AvIBF 0.0144
0.0128
Pg
8.7E-4
0.00210.00469
PgIBF0.0172
0.0155
Av
0.0138
0.0121
Results, precision for nMI = 0
Legend
Av Average
Me Median
Pg Pagerank
Rnd Random
PrOpt Precision Optimal
xADJ ADJ opt. after aggregator x
xIBF IBF opt. after aggregator x
σ2 = 0.1 σ2 = 1.0
Av Me0.1006MeADJ 0.0286AvADJ 0.0583AvIBF 0.0929RndIBF 0.0115PgIBF 0.0168
MeIBF0.0129
PgADJ
0.0151
0.0190
Pg 0.0157
0.0196
PrOpt
0.0125
0.0119
Me Av0.0672AvADJ 0.0799MeADJ 0.0319
AvIBF0.4138
RndIBF
0.4162
PrOpt
0.0216
0.0192
PgADJ0.0217
0.0193
PgIBF0.0181
0.0156
MeIBF
0.0165
0.0141
Pg
0.0181
0.0157
Results, precision for nMI = 0
Legend
Av Average
Me Median
Pg Pagerank
Rnd Random
PrOpt Precision Optimal
xADJ ADJ opt. after aggregator x
xIBF IBF opt. after aggregator x
σ2 = 5
σ2 = 7.5
Kendall-tau results for nMI = 2
Legend
Av Average
Me Median
Pg Pagerank
Rnd Random
PrOpt Precision Optimal
xADJ ADJ opt. after aggregator x
xIBF IBF opt. after aggregator x
Pg Av-4.348PrOpt -4.7128AvADJ -6.6752PgADJ -1.3268AvIBF -0.0285PgIBF -0.0761RndIBF -0.174MeIBF -0.2622MeADJ -0.179Me -0.0186
MeADJ Me-2.7771Av -0.3827AvADJ -1.5151PgIBF -5.384
AvIBF
-5.3778RndIBF
-0.4184
-0.4246
PgADJ -0.1668
Pg
-0.1578MeIBF
-0.0944
-0.1034PrOpt -0.0998
AvNB MeNB-1.8643
MeNBadj-0.544575
AvNBadj-1.48045
PgNB-0.541825
PrOpt-0.057450004
PgNBadj-1.1758001
AvNBIBF-0.181225
PgNBIBF -0.1032
RndIBF
-0.07852385
MeNBIBF-0.068401285
PgNB AvNB-3.46555
PrOpt-1.7299
AvNBadj-5.63925
PgNBadj-1.142
PgNBIBF-0.339975
AvNBIBF-0.062074997
RndIBF-0.67745
MeNBIBF-0.166725
MeNB-0.3103
MeNBadj-0.05835
σ2 = 5
σ2 = 7.5
σ2 = 1
σ2 = 0.1
MeADJ
PrOpt0.0030
PgADJ0.0033
MeIBF
0.0028PgIBF 0.0022
Me 0.0125
AvIBF
0.0108Pg
0.0111
0.0128RndIBF 0.0039AvADJ 0.3533Av 0.5772
MeADJ Me0.0635Av 0.1114AvADJ 0.0418AvIBF 0.1383MeIBF 0.0075
RndIBF0.1412PgIBF
0.0147
0.0194
PrOpt0.0184
PgADJ 0.0174
Pg
0.0202
Precision results for nMI = 4
Legend
Av Average
Me Median
Pg Pagerank
Rnd Random
PrOpt Precision Optimal
xADJ ADJ opt. after aggregator x
xIBF IBF opt. after aggregator x
σ2 = 7.5
σ2 = 0.1
Result Summary
• Low noise:
• Average is best when all the rankers are the same
• Median is best when there is asymmetry among the rankers
• High noise
• Robustness is needed, PrOpt, IBF and Pg are the best
• As misinformation increases, robust but more complex rankers tend to do better
PrOpt
Av*
Pg*
PrOpt
Me*
PgADJ
Me
(MeADJ)
Av
(Pg)
Av
(AvADJ)
Av
Pg*
PrOpt
MeIBF
PgADJ
Me
(MeADJ)
Av
(Pg)
Av
(AvADJ)
PrOpt
(Pg
PgADJ)
PrOpt
MeIBF
Pg*
RndIBF
MeIBF
Av
(Pg
AvADJ)
Av
(AvADJ)
PrOpt
Pg*
MeIBF
PrOpt
MeIBF
Pg*
PrOpt
Pg*
*IBF
PrOpt
Pg
PgADJ
PrOpt
Pg
PgADJ
high
noise
low
noise
less
misinformation
more
misinformation
Conclusion and Future Work
• Two new aggregation methods, PrOPT and IBF that seem to do well in many cases, IBF seems to do well even starting from a random ranking
• No single rank aggregation method is best, there is a trade-off between information and robustness
• Further evaluation of rank aggregation methods is needed
• Testing with various correlations both positive and negative between ranking factors and the errors made on these
• Testing of the model with negative weights where misinformation is more misleading