Download - The Impact of Ranker Quality on Rank Aggregation ... · Input Rankers D o1, o2, o4 o3, o5 o6, o7 D o2 o1 o4 o3 o5 o6 o7 Number of times each object appears Break ties D o2 o5 Choose

The Impact of Ranker Quality on Rank Aggregation Algorithms: Information vs. Robustness

Sibel Adalı, Brandeis Hill and Malik-Magdon IsmailRensselaer Polytechnic Institute

• Given a set of ranked list of objects, what is the best way to aggregate them into a final ranked list?

• The correct answer depends on the what the objective is.

• The consensus among the input rankers

• The most correct final ordering

• In this paper:

➡ We implement existing rank aggregation methods and introduce new ones.

➡ We implement a statistical framework for evaluating the methods and report on the rank aggregation methods.

Motivation

1

2

3

4

5

Ranks

Ranker1 Ranker2 Ranker3

Related Work

• Rank aggregation methods

• Use of cheap methods such as average and median is common

• Methods based on consensus introduced first by Dwork, Kumar, Naor and Sivakumar [WWW 2001] and median rank as an approximation by Fagin, Kumar, Sivakumar [SIGMOD 2003]

• Methods that integrate rank and textual information are common in meta-searching, for example Lu, Meng, Shu, Yu, Liu [WISE 2005]

• Machine learning methods learn the best factors for a user by incorporating user feedback, for example Joachims [SIGKDD 2002]

• Evaluation of rank aggregation methods are mainly with real data using fairly small data sets, for example Renda, Straccia [SAC 2003]

• Given two rankers A and B

• Precision (p) finds the number of objects A and B in common (maximization problem)

• Kendall-tau (τ) finds the total number of pairwise disagreements between A and B (minimization problem)

Error Measures

A B C

o1 o2 o4

o2 o3 o2

o3 o1 o3

o4 o4 o1

o5 o6 o7

D

o2

o1

o3

o4

o5

Input Rankers Aggregate • Precision of D with respect to A,B, and C

p(A,D) + p(B,D) + p(C,D) = 5 + 4 + 4 = 13

• Kendall-tau of D with respect to A, B, and C

τ(A,D) + τ(B,D) + τ(C,D) = 1 + 1 + 4 = 6

• Missing values for τ are handled separately.

Aggregation Methods

• Cheap Methods:

• Average (Av)

• Median (Me)

• Precision optimal (PrOPT)

• Methods that aim to optimize the Kendall-tau error of the aggregate with respect to the input rankers

• Markov chain methods (Pagerank, Pg)

• Iterative methods that improve a given aggregate methods

• adjacent pairs (ADJ)

• iterative best flip (IBF)

• Rank objects with respect to the number of times they appear in all the lists

• Break ties with respect to their ranking in average rankers

• Break remaining ties randomly

Precision Optimal

A B C

o1 o2 o4

o2 o3 o2

o3 o1 o5

o4 o4 o1

o5 o6 o7

Input Rankers

D

o1, o2, o4

o3, o5

o6, o7

D

o2

o1

o4

o3

o5

o6

o7

Number of timeseach object appears

Break ties

D

o2

o1

o4

o3

o5

Choose top K

o1

o3

o2

1

1/3

2/3

2/3

1/3

• Construct a graph from rankings (similar to Dwork et. al. WWW2001)

• Each object returned in a ranked list is a vertex

• Insert an edge (i,j) for each ranked list where i is ranked higher than j

• Compute the pagerank [Brin & Page, WWW 1998] on this graph

• The edges are weighted (wj,i) proportional to the difference in rank it represents

• The navigation probability is proportional to the edge weights

• The random jump probability (pi) is proportional to the indegree of each node

• Alpha (α) is set to 0.85.

• The pagerank Pgi is the solution to the equations below:

Pgi = !pi + (1 ! !)!

(j,i)!E

Pgjwj,i

Pagerank

A

o1

o2

o3

Iterative Improvement Methods

• Adjacent Pairs (ADJ)• Given an aggregate ranking, flip adjacent pairs until the total error with

respect to the input rankers is reduced -> normally the Kendall-tau error metric is used [Dwork]

• Iterative Best Flip (IBF)• Given an aggregate ranking

While not doneFor each object

record the current configurationfind the best flip among all other objects and do this flip even if it increases the error temporarily and make this the current configuration

Choose the lowest error configuration from the historyIf the overall error is lower or if this is a configuration not seen before, then make this the current configurationElse break ;

Iterative Best Flip

A B C

o1 o5 o1

o2 o2 o4

o3 o3 o2

o4 o4 o3

o5 o1 o5

D

o5

o1

o2

o4

o3

Input Rankers Aggregate

Errorτ = 14

After bestflip for o5

D

o1

o5

o2

o4

o3

Errorτ = 13


Errorτ = 14

D

o2

o5

o1

o4

o3


Errorτ = 13

D

o5

o2

o1

o4

o3

After best flip for o4

Errorτ = 12

D

o4

o2

o1

o5

o3


Errorτ = 11

D

o4

o2

o1

o3

o5

Choose the minimum error configuration from this run!and continue

IBF seems to outperform ADJ and do well even when we start from a random ranking.

Analysis of Aggregation Methods

• Complex aggregators incorporate little nuances about the input rankers.They use more information but are sensitive to noise.

• Simple aggregators disregard information contained in the input rankers but are less sensitive to noise.

• For example average is more complex than median and precision optimal

• How about pagerank and other Kendall-tau optimization based optimizers?

A B

o1 o3

o2 o1

o3 o2

D1

o3

o1

o2

Input Rankers Kendall-tau optimal aggregations

D2

o1

o2

o3

D3

o1

o3

o2

The question we would like to answer is which aggregator performs well under different conditions. Does reducing Kendall-tau with respect to Kendall-tau always lead to a good solution?

Statistical Model of Aggregators

• Suppose there is a correct ranked list called the ground truth that represents the correct ordering.

• The correct ordering is computed for each object using:

• A set of factors that measure the fit of object for a specific criteria (factors F =f1 ... fF where fl in [-3,3])

• Examples of factors are number of occurrences of a keyword, recency of updates to a document or pagerank

• A weight for each factor (W= w1,..., wF where w1 + ... + wF = 1)

• The final score of each object oi is computed using a linear combination function

• Objects are ranked with respect to the scores.

Vi =F!

l=1

wlfl(oi)

f1f2 f3 f4 f5

w1 w2 w3 w4 w5

GROUND TRUTH

f1j =

f1+!1

f2j =

f2+!2

f3j =

f3+!3

f4j =

f4+!4

f5j =

f5+!5

RANKERj

Objects

o1

..

.

.

.

.

.

.

on

w3j

w4j

w1j w5

jw2j

Vi =F!

l=1

wl.fl(oi) V ji =

F!

l=1

wjl .f

jl (oi)

• Each ranker produces a ranked list by using the same formula and the same factors• The ranker j tries to estimate the factors’ true values for each object,

produces Fj

• It also guesses the correct weights for the combination formula, produces Wj


• The ranker’s estimate Fj of a factor introduces an error ε , i.e. Fj = F + εj

• The magnitude of error depends on a variance parameter σ2

• The distribution of the error can be adjusted to model different types of spam

• In our model, we can model various types of correlation between the factors and the errors, but we do not report on those.


V ar(!jil) = "2 (# ! fl(oi))!.(# + fl(oi))"

maxf!["3,3](# ! f)!.(# + f)"

• We distribute the scores for each factor uniformly for 100 objects, use 5 factors and 5 rankers

• We set γ= 1, δ = 5, β = 0.01 which models a case where rankers make small mistakes for “good” objects and make increasingly larger mistakes for “bad” objects

• We vary σ2 from 0.1, 1, 5 to 7

• We set the ground truth weights to W

• We assign 1,2,3,4, and 5 rankers to correct weights (W) and the remaining rankers are assigned the incorrect weights (Wr) (nMI represent the number of rankers with the wrong weights)

W = 〈1

15,

2

15,

3

15,

4

15,

5

15〉

Test Setup

W r= !

5

15,

4

15,

3

15,

2

15,

1

15"

Test Setup

• For each setting, we construct 40,000 different data sets • For each dataset, we construct each aggregator for top 10 from the input

rankers and output top 10• Compare the performance of each aggregator with respect to the ground

truth using precision and Kendall-tau• For each error metric, we compute the difference between all pairs of

aggregators• For each test case and error metric, we output for all pairs of aggreagators

[A1, A2] a range [l, h] with 99.9% confidence • We assume A1 and A2 are roughly equivalent (A1 ≡ A2) if the range [l,h]

crosses zero• Otherwise, we construct an ordering A1 > A2 or A2 < A1 based on the

range and the error metric• We order the aggregators using topological sort based on this ordering

for each test and each error metric

PrOpt

RndIBF

0.0028

PgADJ0.0026

MeADJ0.0024

AvADJ0.0026

PgIBF 0.0027

MeIBF

0.0022

AvIBF

0.0027

Pg

0.0031Me

0.0016

Av

1.4E-4

3.2E-4

5.7E-4

4.0E-4

2.7E-4

7.5E-4

3.0E-4

0.0013

MeIBF

RndIBF0.0197

Me

0.0227

MeADJ

0.0214

0.0244

PrOpt

0.0183

0.0166

PgADJ0.0203

0.0186

AvADJ

0.0170

0.0154

AvIBF 0.0144

0.0128

Pg

8.7E-4

0.00210.00469

PgIBF0.0172

0.0155

Av

0.0138

0.0121

Results, precision for nMI = 0

Legend

Av Average

Me Median

Pg Pagerank

Rnd Random

PrOpt Precision Optimal

xADJ ADJ opt. after aggregator x

xIBF IBF opt. after aggregator x

σ2 = 0.1 σ2 = 1.0

Av Me0.1006MeADJ 0.0286AvADJ 0.0583AvIBF 0.0929RndIBF 0.0115PgIBF 0.0168

MeIBF0.0129

PgADJ

0.0151

0.0190

Pg 0.0157

0.0196

PrOpt

0.0125

0.0119

Me Av0.0672AvADJ 0.0799MeADJ 0.0319

AvIBF0.4138

RndIBF

0.4162

PrOpt

0.0216

0.0192

PgADJ0.0217

0.0193

PgIBF0.0181

0.0156

MeIBF

0.0165

0.0141

Pg

0.0181

0.0157

Results, precision for nMI = 0

Legend

Av Average

Me Median

Pg Pagerank

Rnd Random




σ2 = 5

σ2 = 7.5

Kendall-tau results for nMI = 2

Legend

Av Average

Me Median

Pg Pagerank

Rnd Random




Pg Av-4.348PrOpt -4.7128AvADJ -6.6752PgADJ -1.3268AvIBF -0.0285PgIBF -0.0761RndIBF -0.174MeIBF -0.2622MeADJ -0.179Me -0.0186

MeADJ Me-2.7771Av -0.3827AvADJ -1.5151PgIBF -5.384

AvIBF

-5.3778RndIBF

-0.4184

-0.4246

PgADJ -0.1668

Pg

-0.1578MeIBF

-0.0944

-0.1034PrOpt -0.0998

AvNB MeNB-1.8643

MeNBadj-0.544575

AvNBadj-1.48045

PgNB-0.541825

PrOpt-0.057450004

PgNBadj-1.1758001

AvNBIBF-0.181225

PgNBIBF -0.1032

RndIBF

-0.07852385

MeNBIBF-0.068401285

PgNB AvNB-3.46555

PrOpt-1.7299

AvNBadj-5.63925

PgNBadj-1.142

PgNBIBF-0.339975

AvNBIBF-0.062074997

RndIBF-0.67745

MeNBIBF-0.166725

MeNB-0.3103

MeNBadj-0.05835

σ2 = 5

σ2 = 7.5

σ2 = 1

σ2 = 0.1

MeADJ

PrOpt0.0030

PgADJ0.0033

MeIBF

0.0028PgIBF 0.0022

Me 0.0125

AvIBF

0.0108Pg

0.0111

0.0128RndIBF 0.0039AvADJ 0.3533Av 0.5772

MeADJ Me0.0635Av 0.1114AvADJ 0.0418AvIBF 0.1383MeIBF 0.0075

RndIBF0.1412PgIBF

0.0147

0.0194

PrOpt0.0184

PgADJ 0.0174

Pg

0.0202

Precision results for nMI = 4

Legend

Av Average

Me Median

Pg Pagerank

Rnd Random




σ2 = 7.5

σ2 = 0.1

Result Summary

• Low noise:

• Average is best when all the rankers are the same

• Median is best when there is asymmetry among the rankers

• High noise

• Robustness is needed, PrOpt, IBF and Pg are the best

• As misinformation increases, robust but more complex rankers tend to do better

PrOpt

Av*

Pg*

PrOpt

Me*

PgADJ

Me

(MeADJ)

Av

(Pg)

Av

(AvADJ)

Av

Pg*

PrOpt

MeIBF

PgADJ

Me

(MeADJ)

Av

(Pg)

Av

(AvADJ)

PrOpt

(Pg

PgADJ)

PrOpt

MeIBF

Pg*

RndIBF

MeIBF

Av

(Pg

AvADJ)

Av

(AvADJ)

PrOpt

Pg*

MeIBF

PrOpt

MeIBF

Pg*

PrOpt

Pg*

*IBF

PrOpt

Pg

PgADJ

PrOpt

Pg

PgADJ

high

noise

low

noise

less

misinformation

more

misinformation

Conclusion and Future Work

• Two new aggregation methods, PrOPT and IBF that seem to do well in many cases, IBF seems to do well even starting from a random ranking

• No single rank aggregation method is best, there is a trade-off between information and robustness

• Further evaluation of rank aggregation methods is needed

• Testing with various correlations both positive and negative between ranking factors and the errors made on these

• Testing of the model with negative weights where misinformation is more misleading