Learning to Rank with Click Models: From Online Algorithms ... · Learning to Rank with Click...

Learning to Rank with Click Models: From OnlineAlgorithms to Offline Evaluations

Shuai LI

The Chinese University of Hong Kong

Shuai LI (CUHK) Learning to Rank 1 / 53

Outline

1 Motivation

2 Background

3 Problem Definition – Online

4 Click ModelsCascade Model (CM)

ICML’2016AAAI’2018IJCAI’2019

Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019

5 Offline Evaluations – KDD’2018

6 Conclusions


Outline

1 Motivation

2 Background






6 Conclusions


Motivation – Learning to Rank

Amazon, YouTube, Facebook, Netflix, TaobaoShuai LI (CUHK) Learning to Rank 4 / 53

Outline

1 Motivation

2 Background






6 Conclusions


Background – Multi-armed Bandit Problem

A special case of reinforcement learning

There are L arms

Each arm a has an unknown reward distribution with unknown mean αa

The best arm is a∗ = argmax αa


Background – Multi-armed Bandit Setting

At each time t

The learning agent selects one arm atObserve the reward Xat ,t

The objective is to minimize the regret in T rounds

R(T ) = Tα∗ − E

[T∑t=1

αat

]

Balance the trade-off between exploitation and exploration

Exploitation: select arms that yield good results so farExploration: select arms that have not been tried much before



At each time t



R(T ) = Tα∗ − E

[T∑t=1

αat

]





At each time t



R(T ) = Tα∗ − E

[T∑t=1

αat

]




Background – Upper Confidence Bound

UCB (Upper Confidence Bound) [ACF’02]

UCB policy: select

at = argmaxa αa,t +

√3 ln(t)

2Ta(t)

whereαa,t is the empirical mean of arm a in time t — ExploitationTa(t) is the played times of arm a — Exploration

Gap-dependent bound O( L∆ log(T )) where ∆ = minαa<α∗ α

∗ − αa,match lower boundGap-free bound O(

√LT log(T )) tight up to a factor of

√log(T )


Background – Upper Confidence Bound

UCB (Upper Confidence Bound) [ACF’02]

UCB policy: select

at = argmaxa αa,t +

√3 ln(t)

2Ta(t)

whereαa,t is the empirical mean of arm a in time t — ExploitationTa(t) is the played times of arm a — Exploration

Gap-dependent bound O( L∆ log(T )) where ∆ = minαa<α∗ α

∗ − αa,match lower boundGap-free bound O(

√LT log(T )) tight up to a factor of

√log(T )


Outline

1 Motivation

2 Background






6 Conclusions


Online Learning to Rank

There are L items

Each item a with an unknown attractiveness α(a)

There are K positions

At time t

The learning agent selects a list of items At = (at1, . . . , atK )

Receive the click feedback Ct ∈ 0, 1K

The objective is to minimize the regret over T rounds

R(T ) = T r(A∗)− E

[T∑t=1

r(At)

]

where

r(A) is the reward of list AA∗ = (1, 2, . . . ,K ) by assuming arms are ordered byα(1) ≥ α(2) ≥ · · · ≥ α(L)



There are L items



At time t




R(T ) = T r(A∗)− E

[T∑t=1

r(At)

]

where




There are L items



At time t




R(T ) = T r(A∗)− E

[T∑t=1

r(At)

]

where



Outline

1 Motivation

2 Background






6 Conclusions


Contents

1 Motivation

2 Background






6 Conclusions


Click Models

Click models describe how users interact with a list ofitems

Cascade Model (CM)

Assumes the user checks the list from position 1 toposition K , clicks at the first satisfying item and stopsAt most 1 clickr(A) = 1−

∏Kk=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))

The meaning of received feedback (0, 0, 1, 0, 0)

7

7

X

?

?

Click Model Regret

[KSWA, 2015] CM O( L∆ log(T ))


Click Models


Cascade Model (CM)

Assumes the user checks the list from position 1 toposition K , clicks at the first satisfying item and stops

At most 1 clickr(A) = 1−

∏Kk=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))


7

7

X

?

?

Click Model Regret



Click Models


Cascade Model (CM)

Assumes the user checks the list from position 1 toposition K , clicks at the first satisfying item and stopsAt most 1 click

r(A) = 1−∏K

k=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))The meaning of received feedback (0, 0, 1, 0, 0)

7

7

X

?

?

Click Model Regret



Click Models


Cascade Model (CM)


∏Kk=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))


7

7

X

?

?

Click Model Regret



Click Models


Cascade Model (CM)


∏Kk=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))


7

7

X

?

?

Click Model Regret



Click Models


Cascade Model (CM)


∏Kk=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))


7

7

X

?

?

Click Model Regret



Click Models


Cascade Model (CM)


∏Kk=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))


7

7

X

?

?

Click Model Regret



Outline

1 Motivation

2 Background






6 Conclusions


Contextual Bandit Setting

Contexts

User profiles, search keywordsImportant for search and recommendations

Assume each item a is represented by xt,a ∈ Rd

Assume the attractiveness for item a

αt(a) = θ>xt,a

by a fixed but unknown weight vector θ

When xt,a’s are one-hot representations, and θ = (α(1), . . . , α(L)), itreturns to multi-armed bandit setting.



Contexts




αt(a) = θ>xt,a





Contexts




αt(a) = θ>xt,a





Contexts




αt(a) = θ>xt,a




Contextual Combinatorial Cascading Bandits[LWZC,ICML’2016] – Algorithm

C 3-UCB AlgorithmInitialization: θ = 0 ∈ Rd×1,V = λI ∈ Rd×d , b = 0 ∈ Rd×1

For time t = 1, 2, . . .Obtain items xt,aa∈E ⊂ Rd×1

With high probability ∥∥∥θ − θ∥∥∥V≤ βt

thus with high probability

αt(a) ∈ θ>xt,a ± βt ‖xt,a‖V−1

Select the list At by UCBs of arms Ut(a) = θ>xt,a + βt ‖xt,a‖V−1

Receive feedback Ct ∈ 0, 1KCompute the stopping position Kt = mink : Ct(k) = 1 ∪ K andupdate

V ← V +

Kt∑k=1

xt,atkx>t,at

k, b ← b +

Kt∑k=1

xt,atkCt(k)

θ = V−1b




For time t = 1, 2, . . .

Obtain items xt,aa∈E ⊂ Rd×1






V ← V +

Kt∑k=1

xt,atkx>t,at

k, b ← b +

Kt∑k=1

xt,atkCt(k)

θ = V−1b










V ← V +

Kt∑k=1

xt,atkx>t,at

k, b ← b +

Kt∑k=1

xt,atkCt(k)

θ = V−1b










V ← V +

Kt∑k=1

xt,atkx>t,at

k, b ← b +

Kt∑k=1

xt,atkCt(k)

θ = V−1b










V ← V +

Kt∑k=1

xt,atkx>t,at

k, b ← b +

Kt∑k=1

xt,atkCt(k)

θ = V−1b










V ← V +

Kt∑k=1

xt,atkx>t,at

k, b ← b +

Kt∑k=1

xt,atkCt(k)

θ = V−1bShuai LI (CUHK) Learning to Rank 16 / 53

Contextual Combinatorial Cascading Bandits[LWZC,ICML’2016] – Results

We prove a regret bound

R(T ) = O

(d

p∗

√TK ln(T )

)

Experimental results —Ours —CombCascade

0 500 1000 1500 2000 2500 3000

Time t

0

50

100

150

Reg

ret

Synthetic Data

C3-UCB

CombCascade

0 500 1000 1500 2000

Time t

0

200

400

600

800

1000

1200

Reg

ret

Network 1221

C3-UCB

CombCascade


Contextual Combinatorial Cascading Bandits[LWZC,ICML’2016] – Results

We prove a regret bound

R(T ) = O

(d

p∗

√TK ln(T )

)

Experimental results —Ours —CombCascade

0 500 1000 1500 2000 2500 3000

Time t

0

50

100

150

Reg

ret

Synthetic Data

C3-UCB

CombCascade

0 500 1000 1500 2000

Time t

0

200

400

600

800

1000

1200

Reg

ret

Network 1221

C3-UCB

CombCascade


Summary on Bandits with Click Models

Context Click Model Regret

[KSWA, 2015] - CM O( L∆ log(T ))

[LWZC, ICML’2016] Linear CM O( dp∗

√TK log(T ))


Outline

1 Motivation

2 Background






6 Conclusions


Online Clustering of Contextual Cascading Bandits [LZ,AAAI’2018]

Find clustering over users as well as recommending

The attractiveness function is generalized linear (GL)

Improve the regret results

Experiments —Ours · · ·C3-UCB

0M 1M 2M 3M 4M 5MTime t

0K

10K

20K

30K

40K

Cum. Regret

CLUB-cascade

C3-UCB/CascadeLinUCB

0M 1M 2M 3M 4M 5MTime t

0K

10K

20K

30K

40K

50K

Cum. Regret

CLUB-cascade

C3-UCB/CascadeLinUCB


[KSWA, 2015] - CM O( L∆ log(T ))


√TK log(T ))

[LZ, AAAI’2018] GL CM O(d√TK log(T ))


Outline

1 Motivation

2 Background






6 Conclusions


Improved Algorithm on Clustering Bandits [LCLL,IJCAI’2019]

Arbitrary frequency distribution over users (compared to uniformdistribution)

Prove a regret bound that is free of the minimal frequency over users

R(T ) = O

(d√mT ln(T ) +

(1

γ2p

+nuγ2λ3

x

)ln(T )

)(compared to R(T ) = O

(d√mT ln(T ) + 1

pminγ2λ3x

ln(T ))

)

where nu is number of users and m is number of clusters

Experiments —Ours —CLUB —LinUCB-One —LinUCB-Ind

0 200k 400k 600k 800k 1mTime t

0

20k

40k

60k

Regr

et

Synthetic

0 200k 400k 600k 800k 1mTime t

0

20k

40k

60k

Regr

et

MovieLens

0 200k 400k 600k 800k 1mTime t

0

20k

40k

60k

Regr

et

Yelp


Improved Algorithm on Clustering Bandits [LCLL,IJCAI’2019]

Arbitrary frequency distribution over users (compared to uniformdistribution)

Prove a regret bound that is free of the minimal frequency over users

R(T ) = O

(d√mT ln(T ) +

(1

γ2p

+nuγ2λ3

x

)ln(T )

)(compared to R(T ) = O

(d√mT ln(T ) + 1

pminγ2λ3x

ln(T ))

)

where nu is number of users and m is number of clusters

Experiments —Ours —CLUB —LinUCB-One —LinUCB-Ind

0 200k 400k 600k 800k 1mTime t

0

20k

40k

60k

Regr

et

Synthetic

0 200k 400k 600k 800k 1mTime t

0

20k

40k

60k

Regr

et

MovieLens

0 200k 400k 600k 800k 1mTime t

0

20k

40k

60k

Regr

et

Yelp


Contents

1 Motivation

2 Background






6 Conclusions


Dependent Click Model (DCM)

Allow multiple clicks

Assumes there is a probability ofsatisfaction after each click

r(A) = 1−∏K

k=1(1− α(ak)γk)

γk : satisfaction probability after clickon position k

The meaning of received feedback(0, 1, 0, 1, 0)

7no click

Xclick, not satisfied

7no click

Xclick, satisfied?

?


[KSWA, 2015] - CM O( L∆ log(T ))


√TK log(T ))


[KKSW, 2016] - DCM O( L∆ log(T ))

[LLZ, COCOON’2018] GL DCM O(dK√TK log(T ))





r(A) = 1−∏K

k=1(1− α(ak)γk)



7no click


7no click

Xclick, satisfied?

?


[KSWA, 2015] - CM O( L∆ log(T ))


√TK log(T ))


[KKSW, 2016] - DCM O( L∆ log(T ))






r(A) = 1−∏K

k=1(1− α(ak)γk)



7no click


7no click

Xclick, satisfied?

?


[KSWA, 2015] - CM O( L∆ log(T ))


√TK log(T ))


[KKSW, 2016] - DCM O( L∆ log(T ))






r(A) = 1−∏K

k=1(1− α(ak)γk)



7no click


7no click

Xclick, satisfied?

?


[KSWA, 2015] - CM O( L∆ log(T ))


√TK log(T ))


[KKSW, 2016] - DCM O( L∆ log(T ))



Contents

1 Motivation

2 Background






6 Conclusions


Position-Based Model (PBM)

Most popular model in industry

Assumes the user click probability on an item a of position k can befactored into βk · α(a)

βk is position bias. Usually β1 ≥ β2 ≥ · · · ≥ βK

r(A) =∑K

k=1 βkα(ak)







r(A) =∑K

k=1 βkα(ak)







r(A) =∑K

k=1 βkα(ak)







r(A) =∑K

k=1 βkα(ak)







r(A) =∑K

k=1 βkα(ak)





[KSWA, 2015] - CM O( L∆ log(T ))


√TK log(T ))


[KKSW, 2016] - DCM O( L∆ log(T ))


[LVC, 2016] - PBM with β O( L∆ log(T ))


Contents

1 Motivation

2 Background






6 Conclusions


General Click Models

Common observations for click models

The click-through-rate (CTR) of list A on position k can be factoredinto

CTR(A, k) = χ(A, k) α(ak)

χ(A, k) is the examination probability of list A on position k

E.g. χ(A, k) =∏k−1

i=1 (1− α(ai )) in Cascade Model and χ(A, k) = βkin Position Based Model

Difficulties on General Click Models

χ depends on both click models and lists







E.g. χ(A, k) =∏k−1










E.g. χ(A, k) =∏k−1







[KSWA, 2015] - CM O( L∆ log(T ))


√TK log(T ))


[KKSW, 2016] - DCM O( L∆ log(T ))



[ZTGKSW, 2017] - General O(K3L

∆ log(T ))

[LKLS, NIPS’2018] - General O(KL∆ log(T )

)O(√

K 3LT log(T ))

Ω(√

KLT)


Online Learning to Rank with Features [LLS, ICML’2019] –Preparation

Recall

Each item a is represented by a feature vector xa ∈ Rd

The attractiveness of item a is α(a) = θ>xa

We bring up an algorithm called RecurRank (Recursive Ranking)

G-optimal design

Minimize the covariance of the least-squares estimatorX = x1, . . . , xn ⊂ Rd

For any distribution π : X → [0, 1], let Q(π) =∑

x∈X π(x)xx>

By the Kiefer–Wolfowitz theorem there exists a π called the G -optimaldesign such that

max det(Q(π)) or equivalently maxx∈X‖x‖2

Q(π)† ≤ d

John’s theorem implies that π may be chosen so that|x : π(x) > 0| ≤ d(d + 3)/2



Recall




G-optimal design



x∈X π(x)xx>



Q(π)† ≤ d




Recall




G-optimal design

Minimize the covariance of the least-squares estimator

X = x1, . . . , xn ⊂ Rd


x∈X π(x)xx>



Q(π)† ≤ d




Recall




G-optimal design



x∈X π(x)xx>



Q(π)† ≤ d




Recall




G-optimal design



x∈X π(x)xx>



Q(π)† ≤ d




Recall




G-optimal design



x∈X π(x)xx>



Q(π)† ≤ d




Recall




G-optimal design



x∈X π(x)xx>



Q(π)† ≤ d



Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm

RecurRank Algorithm

Each instantiation is called with three arguments:1 A phase number ` ∈ 1, 2, . . .;2 An ordered tuple of items A = (a1, a2, . . . , an);3 A tuple of positions K = (k, . . . , k + m − 1) and m ≤ n.

The algorithm is first called with ` = 1, a random order over all items1, . . . , L, and K = (1, . . . ,K )

Find a G -optimal design π = Gopt(A). Then compute

T (a) =

⌈d π(a)

2∆2`

log

(|A|δ`

)⌉, ∆` = 2−`

Hope to satisfy |α(a)− α(a)| ≤ ∆` for any a ∈ A by the end of thisinstantiationThis instantiation runs for

∑a∈A T (a) times



RecurRank AlgorithmEach instantiation is called with three arguments:

1 A phase number ` ∈ 1, 2, . . .;2 An ordered tuple of items A = (a1, a2, . . . , an);3 A tuple of positions K = (k, . . . , k + m − 1) and m ≤ n.



T (a) =

⌈d π(a)

2∆2`

log

(|A|δ`

)⌉, ∆` = 2−`









T (a) =

⌈d π(a)

2∆2`

log

(|A|δ`

)⌉, ∆` = 2−`









T (a) =

⌈d π(a)

2∆2`

log

(|A|δ`

)⌉, ∆` = 2−`









T (a) =

⌈d π(a)

2∆2`

log

(|A|δ`

)⌉, ∆` = 2−`

Hope to satisfy |α(a)− α(a)| ≤ ∆` for any a ∈ A by the end of thisinstantiation

This instantiation runs for∑

a∈A T (a) times







T (a) =

⌈d π(a)

2∆2`

log

(|A|δ`

)⌉, ∆` = 2−`




Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm (Continued)

RecurRank Algorithm (Continued)

Select each item a ∈ A exactly T (a) times at position k and put thefirst m − 1 items in A \ a at remaining positionsk + 1, . . . , k + m − 1first position — explorationremaining positions — exploitationonly first position has the same examination probability χ for all lists

E.g. Suppose we have computed T (a3) = 100, then it puts(a3, a1, a2, a4, . . . , am) on positions (k, . . . , k + m − 1) for 100 roundsCompute θ only using the feedbacks from first position k and rankitems in decreasing order of the estimated attractiveness

α(a1) ≥ α(a2) ≥ α(a3) ≥ · · · ≥ α(an)




Select each item a ∈ A exactly T (a) times at position k and put thefirst m − 1 items in A \ a at remaining positionsk + 1, . . . , k + m − 1first position — explorationremaining positions — exploitationonly first position has the same examination probability χ for all listsE.g. Suppose we have computed T (a3) = 100, then it puts(a3, a1, a2, a4, . . . , am) on positions (k , . . . , k + m − 1) for 100 rounds

Compute θ only using the feedbacks from first position k and rankitems in decreasing order of the estimated attractiveness

α(a1) ≥ α(a2) ≥ α(a3) ≥ · · · ≥ α(an)




Select each item a ∈ A exactly T (a) times at position k and put thefirst m − 1 items in A \ a at remaining positionsk + 1, . . . , k + m − 1first position — explorationremaining positions — exploitationonly first position has the same examination probability χ for all listsE.g. Suppose we have computed T (a3) = 100, then it puts(a3, a1, a2, a4, . . . , am) on positions (k , . . . , k + m − 1) for 100 roundsCompute θ only using the feedbacks from first position k and rankitems in decreasing order of the estimated attractiveness

α(a1) ≥ α(a2) ≥ α(a3) ≥ · · · ≥ α(an)




Eliminate bad arms an′+1, . . . , an if

α(a1) ≥ · · · ≥ α(am) ≥ · · · ≥ α(an′) ≥ α(an′+1)︸︷︷︸gap ≥2∆`

≥ · · · ≥ α(an)

Split the partition for each consecutive gap larger than 2∆`

α(a1) ≥ · · · ≥ α(ak1 )

∣∣∣∣∣ α(ak1+1)︸︷︷︸gap ≥2∆`

≥ · · · ≥ α(ak2 )

∣∣∣∣∣ α(ak2+1)︸︷︷︸gap ≥2∆`

≥ · · · ≥ α(an′)

k, · · · , k + k1 − 1

∣∣∣∣∣ k + k1, · · · , k + k2 − 1

∣∣∣∣∣ k + k2, · · · , k + m − 1

Call the refined partitions with phase `+ 1





α(a1) ≥ · · · ≥ α(am) ≥ · · · ≥ α(an′) ≥ α(an′+1)︸︷︷︸gap ≥2∆`

≥ · · · ≥ α(an)


α(a1) ≥ · · · ≥ α(ak1 )

∣∣∣∣∣ α(ak1+1)︸︷︷︸gap ≥2∆`

≥ · · · ≥ α(ak2 )

∣∣∣∣∣ α(ak2+1)︸︷︷︸gap ≥2∆`

≥ · · · ≥ α(an′)

k , · · · , k + k1 − 1

∣∣∣∣∣ k + k1, · · · , k + k2 − 1

∣∣∣∣∣ k + k2, · · · , k + m − 1






α(a1) ≥ · · · ≥ α(am) ≥ · · · ≥ α(an′) ≥ α(an′+1)︸︷︷︸gap ≥2∆`

≥ · · · ≥ α(an)


α(a1) ≥ · · · ≥ α(ak1 )

∣∣∣∣∣ α(ak1+1)︸︷︷︸gap ≥2∆`

≥ · · · ≥ α(ak2 )

∣∣∣∣∣ α(ak2+1)︸︷︷︸gap ≥2∆`

≥ · · · ≥ α(an′)

k , · · · , k + k1 − 1

∣∣∣∣∣ k + k1, · · · , k + k2 − 1

∣∣∣∣∣ k + k2, · · · , k + m − 1



Online Learning to Rank with Features [LLS, ICML’2019] –Results

Regret bound

R(T ) = O(K√

dT log(LT ))

Experiments —RecurRank(Ours) —C3-UCB —TopRank

0k 50k 100k 150k 200kTime t

10 2

10 1

100

101

102

103

Regr

et

(a) CM

0k 50k 100k 150k 200kTime t

0k

50k

100k

150k

200k

250k

300k

Regr

et

(b) PBM


Online Learning to Rank with Features [LLS, ICML’2019] –Results

Regret bound

R(T ) = O(K√

dT log(LT ))

Experiments —RecurRank(Ours) —C3-UCB —TopRank

0k 50k 100k 150k 200kTime t

10 2

10 1

100

101

102

103

Regr

et

(a) CM

0k 50k 100k 150k 200kTime t

0k

50k

100k

150k

200k

250k

300k

Regr

et

(b) PBM




[KSWA, 2015] - CM O( L∆ log(T ))


√TK log(T ))


[KKSW, 2016] - DCM O( L∆ log(T ))



[ZTGKSW, 2017] - General O(K3L

∆ log(T ))

[LKLS, NIPS’2018] - General O(KL∆ log(T )

)O(√

K 3LT log(T ))

Ω(√

KLT)

[LLS, ICML’2019] Linear General O(K√dT log(LT ))


Outline

1 Motivation

2 Background






6 Conclusions


Offline Evaluations

Motivation

Can we estimate the expected number of clicks of new policies withoutdirectly employing it?

Offline Evaluation!

Objective:

To design statistically efficient estimators based on logged dataset forany ranking policy

Challenge:

The number of different lists is exponential in K


Offline Evaluations

Motivation


Offline Evaluation!

Objective:


Challenge:



Offline Evaluations

Motivation


Offline Evaluation!

Objective:


Challenge:



Offline Evaluations

Motivation


Offline Evaluation!

Objective:


Challenge:



Offline Evaluation of Ranking Policies with Click Models[LAKMVW, KDD’2018]– Results

We design estimators for different click models

Item-Position, Random, Rank-Based, Position-Based, Document-Based

We prove that our estimators

are unbiased in a larger class of policieshave lower biasthe best policy have better theoretical guarantees

than the existing unstructured estimators under the correspondingclick model assumptions


Offline Evaluation of Ranking Policies with Click Models[LAKMVW, KDD’2018]– Results

We design estimators for different click models

Item-Position, Random, Rank-Based, Position-Based, Document-Based

We prove that our estimators

are unbiased in a larger class of policieshave lower biasthe best policy have better theoretical guarantees

than the existing unstructured estimators under the correspondingclick model assumptions


Offline Evaluation of Ranking Policies with Click Models[LAKMVW, KDD’2018] – Experiments

Experiments – 100 most frequent queries in Yandex dataset

100 101 102 103 104 105

M

0.04

0.06

0.08

0.10

0.12

RM

SE

(a) 100 Queries: K = 2

RCTR

Item

IP

PBM

List

100 101 102 103 104 105

M

0.04

0.06

0.08

0.10

0.12

RM

SE

(b) 100 Queries: K = 3

RCTR

Item

IP

PBM

List

100 101 102 103 104 105

M

0.04

0.06

0.08

0.10

0.12

0.14

0.16

RM

SE

100 Queries: K = 10

RCTR

Item

IP

PBM

List


Outline

1 Motivation

2 Background






6 Conclusions


Conclusions

Context + Cascade model (CM) / Dependent click model (DCM)

Online clustering of bandits + Cascade model (CM)

Improved algorithm on clustering of bandits

Context + General click model

Offline evaluation of ranking policies with click models


Publications

First-author papers in thesis – in the order of thesis

1 Shuai Li, Baoxiang Wang, Shengyu Zhang, Wei Chen, ContextualCombinatorial Cascading Bandits, ICML, 2016

2 Shuai Li, Shengyu Zhang, Online Clustering of Contextual CascadingBandits, AAAI, 2018

3 Shuai Li, Wei Chen, S Li, Kwong-Sak Leung, Improved Algorithm onClustering of Bandits, IJCAI 2019

4 Shuai Li, Tor Lattimore, Csaba Szepesvari, Online Learning to Rankwith Features, ICML, 2019

5 Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan,Vishwa Vinay and Zheng Wen, Offline Evaluation of Ranking Policieswith Click Models, KDD, 2018


Publications

Mentioned co-authored papers

6 Weiwen Liu, Shuai Li, Shengyu Zhang, Contextual Dependent ClickBandit Algorithm for Web Recommendation, COCOON, 2018

7 Tor Lattimore, Branislav Kveton, Shuai Li, Csaba Szepesvari,TopRank: A Practical Algorithm for Online Stochastic Ranking,NeurIPS, 2018

Other co-authored papers

8 Pengfei Liu, Hongjian Li, Shuai Li, Kwong-Sak Leung, ImprovingPrediction of Phenotypic Drug Response on Cancer Cell Lines UsingDeep Convolutional Network, BMC Bioinformatics, 2019

9 Ran Wang, Shuai Li, Man-Hon Wong, and Kwong-Sak Leung,Drug-Protein-Disease Association Prediction and Drug RepositioningBased on Tensor Decomposition, BIBM, 2018

10 Pengfei Liu, Shuai Li, Weiying Yi, Kwong-Sak Leung, A HybridDistributed Framework for SNP Selections, PDPTA, 2016


Publications

In submission

11 Shuai Li, Wei Chen, Zheng Wen, Kwong-Sak Leung, StochasticOnline Learning with Probabilistic Feedback Graph

12 Shuai Li, Kwong-Sak Leung, Generalized Clustering Bandits

13 Shuai Li, Tong Yu, Ole Mengshoel, Kwong-Sak Leung, OnlineSemi-Supervised Learning with Large Margin Separation

14 Xiaojin Zhang, Shuai Li, Shengyu Zhang, Contextual CombinatorialConservative Bandits

15 Pengfei Liu, Shuai Li, Kwong-Sak Leung, The Recovery of StochasticDifferential Equations with Genetic Programming andKullback-Leibler Divergence


Thank you!

&

Questions?


References I

P. Auer, N. Cesa-Bianchi, and P. Fischer.Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2-3):235–256, 2002.

S. Katariya, B. Kveton, C. Szepesvari, and Z. Wen.Dcm bandits: Learning to rank with multiple clicks.In International Conference on Machine Learning, pages 1215–1224,2016.

B. Kveton, C. Szepesvari, Z. Wen, and A. Ashkan.Cascading bandits: Learning to rank in the cascade model.In International Conference on Machine Learning, pages 767–776,2015.


References II

P. Lagree, C. Vernade, and O. Cappe.Multiple-play bandits in the position-based model.In Advances in Neural Information Processing Systems, pages1597–1605, 2016.

T. Lattimore, B. Kveton, Li, Shuai, and C. Szepesvari.Toprank: A practical algorithm for online stochastic ranking.In The Conference on Neural Information Processing Systems, 2018.

W. Liu, Li, Shuai, and S. Zhang.Contextual dependent click bandit algorithm for web recommendation.

In International Computing and Combinatorics Conference, pages39–50. Springer, 2018.


References III

Li, Shuai, Y. Abbasi-Yadkori, B. Kveton, S. Muthukrishnan, V. Vinay,and Z. Wen.Offline evaluation of ranking policies with click models.In ACM SIGKDD Conference on Knowledge Discovery and DataMining, 2018.

Li, Shuai, W. Chen, S. Li, and K.-S. Leung.Improved algorithm on online clustering of bandits.In International Joint Conference on Artificial Intelligence (IJCAI),2019.

Li, Shuai, T. Lattimore, and C. Szepesvari.Online learning to rank with features.In International Conference on Machine Learning (ICML), 2019.


References IV

Li, Shuai, B. Wang, S. Zhang, and W. Chen.Contextual combinatorial cascading bandits.In International Conference on Machine Learning, pages 1245–1253,2016.

Li, Shuai and S. Zhang.Online clustering of contextual cascading bandits.In The AAAI Conference on Artificial Intelligence, 2018.

M. Zoghi, T. Tunys, M. Ghavamzadeh, B. Kveton, C. Szepesvari, andZ. Wen.Online learning to rank in stochastic click models.In International Conference on Machine Learning, pages 4199–4208,2017.


References V

S. Zong, H. Ni, K. Sung, N. R. Ke, Z. Wen, and B. Kveton.Cascading bandits for large-scale recommendation problems.In Proceedings of the Thirty-Second Conference on Uncertainty inArtificial Intelligence, pages 835–844. AUAI Press, 2016.


A Key Part Proof for CLUB-cascade (Improving C3-UCB)

Et [R(At , y t)]

=Et

[(1−

K∏k=1

(1− y t(x∗t,k))

)−

(1−

K∏k=1

(1− y t(x t,k))

)]

=Et

[K∏

k=1

(1− y t(x t,k))−K∏

k=1

(1− y t(x∗t,k))

]

=Et

[K∑

k=1

(k−1∏`=1

(1− y t(x t,`))

)[(1− y t(x t,k))− (1− y t(x

∗t,k))

]( K∏`=k+1

(1− y t(x∗t,`))

)]

≤Et

[K∑

k=1

(k−1∏`=1

(1− y t(x t,`))

)[y t(x

∗t,k)− y t(x t,k)]

]

=Et

[K t∑k=1

[y t(x∗t,k)− y t(x t,k)]

]


Proof Sketch for RecurRank

Use (`, i) to represent the i-th call of RecurRank with `,Aì ,Kì

Prove with high probability for any (`, i)

a∗k ∈ Aì if k ∈ Kì|θ>ì xa − χìθ>∗ xa| ≤ ∆`, where χì is the examination probability of theoptimal list on the first position in Kì

In (`, i)th call, item a is put at position k, then

χì (α(a∗k)− α(a)) ≤ 8|Kì |∆` if k is the first position in Kìχì (α(a∗k)− α(a)) ≤ 4∆` if k is the remaining positionthus O(|Kì |∆`) regret for this part



Use (`, i) to represent the i-th call of RecurRank with `,Aì ,KìProve with high probability for any (`, i)






Use (`, i) to represent the i-th call of RecurRank with `,Aì ,KìProve with high probability for any (`, i)





Date post:	21-Mar-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Learning to Rank with Click Models: From Online Algorithms ... · Learning to Rank with Click...

Documents