Learning to Rank with Click Models: From OnlineAlgorithms to Offline Evaluations
Shuai LI
The Chinese University of Hong Kong
Shuai LI (CUHK) Learning to Rank 1 / 53
Outline
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 2 / 53
Outline
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 3 / 53
Motivation – Learning to Rank
Amazon, YouTube, Facebook, Netflix, TaobaoShuai LI (CUHK) Learning to Rank 4 / 53
Outline
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 5 / 53
Background – Multi-armed Bandit Problem
A special case of reinforcement learning
There are L arms
Each arm a has an unknown reward distribution with unknown mean αa
The best arm is a∗ = argmax αa
Shuai LI (CUHK) Learning to Rank 6 / 53
Background – Multi-armed Bandit Setting
At each time t
The learning agent selects one arm atObserve the reward Xat ,t
The objective is to minimize the regret in T rounds
R(T ) = Tα∗ − E
[T∑t=1
αat
]
Balance the trade-off between exploitation and exploration
Exploitation: select arms that yield good results so farExploration: select arms that have not been tried much before
Shuai LI (CUHK) Learning to Rank 7 / 53
Background – Multi-armed Bandit Setting
At each time t
The learning agent selects one arm atObserve the reward Xat ,t
The objective is to minimize the regret in T rounds
R(T ) = Tα∗ − E
[T∑t=1
αat
]
Balance the trade-off between exploitation and exploration
Exploitation: select arms that yield good results so farExploration: select arms that have not been tried much before
Shuai LI (CUHK) Learning to Rank 7 / 53
Background – Multi-armed Bandit Setting
At each time t
The learning agent selects one arm atObserve the reward Xat ,t
The objective is to minimize the regret in T rounds
R(T ) = Tα∗ − E
[T∑t=1
αat
]
Balance the trade-off between exploitation and exploration
Exploitation: select arms that yield good results so farExploration: select arms that have not been tried much before
Shuai LI (CUHK) Learning to Rank 7 / 53
Background – Upper Confidence Bound
UCB (Upper Confidence Bound) [ACF’02]
UCB policy: select
at = argmaxa αa,t +
√3 ln(t)
2Ta(t)
whereαa,t is the empirical mean of arm a in time t — ExploitationTa(t) is the played times of arm a — Exploration
Gap-dependent bound O( L∆ log(T )) where ∆ = minαa<α∗ α
∗ − αa,match lower boundGap-free bound O(
√LT log(T )) tight up to a factor of
√log(T )
Shuai LI (CUHK) Learning to Rank 8 / 53
Background – Upper Confidence Bound
UCB (Upper Confidence Bound) [ACF’02]
UCB policy: select
at = argmaxa αa,t +
√3 ln(t)
2Ta(t)
whereαa,t is the empirical mean of arm a in time t — ExploitationTa(t) is the played times of arm a — Exploration
Gap-dependent bound O( L∆ log(T )) where ∆ = minαa<α∗ α
∗ − αa,match lower boundGap-free bound O(
√LT log(T )) tight up to a factor of
√log(T )
Shuai LI (CUHK) Learning to Rank 8 / 53
Outline
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 9 / 53
Online Learning to Rank
There are L items
Each item a with an unknown attractiveness α(a)
There are K positions
At time t
The learning agent selects a list of items At = (at1, . . . , atK )
Receive the click feedback Ct ∈ 0, 1K
The objective is to minimize the regret over T rounds
R(T ) = T r(A∗)− E
[T∑t=1
r(At)
]
where
r(A) is the reward of list AA∗ = (1, 2, . . . ,K ) by assuming arms are ordered byα(1) ≥ α(2) ≥ · · · ≥ α(L)
Shuai LI (CUHK) Learning to Rank 10 / 53
Online Learning to Rank
There are L items
Each item a with an unknown attractiveness α(a)
There are K positions
At time t
The learning agent selects a list of items At = (at1, . . . , atK )
Receive the click feedback Ct ∈ 0, 1K
The objective is to minimize the regret over T rounds
R(T ) = T r(A∗)− E
[T∑t=1
r(At)
]
where
r(A) is the reward of list AA∗ = (1, 2, . . . ,K ) by assuming arms are ordered byα(1) ≥ α(2) ≥ · · · ≥ α(L)
Shuai LI (CUHK) Learning to Rank 10 / 53
Online Learning to Rank
There are L items
Each item a with an unknown attractiveness α(a)
There are K positions
At time t
The learning agent selects a list of items At = (at1, . . . , atK )
Receive the click feedback Ct ∈ 0, 1K
The objective is to minimize the regret over T rounds
R(T ) = T r(A∗)− E
[T∑t=1
r(At)
]
where
r(A) is the reward of list AA∗ = (1, 2, . . . ,K ) by assuming arms are ordered byα(1) ≥ α(2) ≥ · · · ≥ α(L)
Shuai LI (CUHK) Learning to Rank 10 / 53
Outline
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 11 / 53
Contents
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 12 / 53
Click Models
Click models describe how users interact with a list ofitems
Cascade Model (CM)
Assumes the user checks the list from position 1 toposition K , clicks at the first satisfying item and stopsAt most 1 clickr(A) = 1−
∏Kk=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))
The meaning of received feedback (0, 0, 1, 0, 0)
7
7
X
?
?
Click Model Regret
[KSWA, 2015] CM O( L∆ log(T ))
Shuai LI (CUHK) Learning to Rank 13 / 53
Click Models
Click models describe how users interact with a list ofitems
Cascade Model (CM)
Assumes the user checks the list from position 1 toposition K , clicks at the first satisfying item and stops
At most 1 clickr(A) = 1−
∏Kk=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))
The meaning of received feedback (0, 0, 1, 0, 0)
7
7
X
?
?
Click Model Regret
[KSWA, 2015] CM O( L∆ log(T ))
Shuai LI (CUHK) Learning to Rank 13 / 53
Click Models
Click models describe how users interact with a list ofitems
Cascade Model (CM)
Assumes the user checks the list from position 1 toposition K , clicks at the first satisfying item and stopsAt most 1 click
r(A) = 1−∏K
k=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))The meaning of received feedback (0, 0, 1, 0, 0)
7
7
X
?
?
Click Model Regret
[KSWA, 2015] CM O( L∆ log(T ))
Shuai LI (CUHK) Learning to Rank 13 / 53
Click Models
Click models describe how users interact with a list ofitems
Cascade Model (CM)
Assumes the user checks the list from position 1 toposition K , clicks at the first satisfying item and stopsAt most 1 clickr(A) = 1−
∏Kk=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))
The meaning of received feedback (0, 0, 1, 0, 0)
7
7
X
?
?
Click Model Regret
[KSWA, 2015] CM O( L∆ log(T ))
Shuai LI (CUHK) Learning to Rank 13 / 53
Click Models
Click models describe how users interact with a list ofitems
Cascade Model (CM)
Assumes the user checks the list from position 1 toposition K , clicks at the first satisfying item and stopsAt most 1 clickr(A) = 1−
∏Kk=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))
The meaning of received feedback (0, 0, 1, 0, 0)
7
7
X
?
?
Click Model Regret
[KSWA, 2015] CM O( L∆ log(T ))
Shuai LI (CUHK) Learning to Rank 13 / 53
Click Models
Click models describe how users interact with a list ofitems
Cascade Model (CM)
Assumes the user checks the list from position 1 toposition K , clicks at the first satisfying item and stopsAt most 1 clickr(A) = 1−
∏Kk=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))
The meaning of received feedback (0, 0, 1, 0, 0)
7
7
X
?
?
Click Model Regret
[KSWA, 2015] CM O( L∆ log(T ))
Shuai LI (CUHK) Learning to Rank 13 / 53
Click Models
Click models describe how users interact with a list ofitems
Cascade Model (CM)
Assumes the user checks the list from position 1 toposition K , clicks at the first satisfying item and stopsAt most 1 clickr(A) = 1−
∏Kk=1(1− α(ak)) = OR(α(a1), . . . , α(aK ))
The meaning of received feedback (0, 0, 1, 0, 0)
7
7
X
?
?
Click Model Regret
[KSWA, 2015] CM O( L∆ log(T ))
Shuai LI (CUHK) Learning to Rank 13 / 53
Outline
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 14 / 53
Contextual Bandit Setting
Contexts
User profiles, search keywordsImportant for search and recommendations
Assume each item a is represented by xt,a ∈ Rd
Assume the attractiveness for item a
αt(a) = θ>xt,a
by a fixed but unknown weight vector θ
When xt,a’s are one-hot representations, and θ = (α(1), . . . , α(L)), itreturns to multi-armed bandit setting.
Shuai LI (CUHK) Learning to Rank 15 / 53
Contextual Bandit Setting
Contexts
User profiles, search keywordsImportant for search and recommendations
Assume each item a is represented by xt,a ∈ Rd
Assume the attractiveness for item a
αt(a) = θ>xt,a
by a fixed but unknown weight vector θ
When xt,a’s are one-hot representations, and θ = (α(1), . . . , α(L)), itreturns to multi-armed bandit setting.
Shuai LI (CUHK) Learning to Rank 15 / 53
Contextual Bandit Setting
Contexts
User profiles, search keywordsImportant for search and recommendations
Assume each item a is represented by xt,a ∈ Rd
Assume the attractiveness for item a
αt(a) = θ>xt,a
by a fixed but unknown weight vector θ
When xt,a’s are one-hot representations, and θ = (α(1), . . . , α(L)), itreturns to multi-armed bandit setting.
Shuai LI (CUHK) Learning to Rank 15 / 53
Contextual Bandit Setting
Contexts
User profiles, search keywordsImportant for search and recommendations
Assume each item a is represented by xt,a ∈ Rd
Assume the attractiveness for item a
αt(a) = θ>xt,a
by a fixed but unknown weight vector θ
When xt,a’s are one-hot representations, and θ = (α(1), . . . , α(L)), itreturns to multi-armed bandit setting.
Shuai LI (CUHK) Learning to Rank 15 / 53
Contextual Combinatorial Cascading Bandits[LWZC,ICML’2016] – Algorithm
C 3-UCB AlgorithmInitialization: θ = 0 ∈ Rd×1,V = λI ∈ Rd×d , b = 0 ∈ Rd×1
For time t = 1, 2, . . .Obtain items xt,aa∈E ⊂ Rd×1
With high probability ∥∥∥θ − θ∥∥∥V≤ βt
thus with high probability
αt(a) ∈ θ>xt,a ± βt ‖xt,a‖V−1
Select the list At by UCBs of arms Ut(a) = θ>xt,a + βt ‖xt,a‖V−1
Receive feedback Ct ∈ 0, 1KCompute the stopping position Kt = mink : Ct(k) = 1 ∪ K andupdate
V ← V +
Kt∑k=1
xt,atkx>t,at
k, b ← b +
Kt∑k=1
xt,atkCt(k)
θ = V−1b
Shuai LI (CUHK) Learning to Rank 16 / 53
Contextual Combinatorial Cascading Bandits[LWZC,ICML’2016] – Algorithm
C 3-UCB AlgorithmInitialization: θ = 0 ∈ Rd×1,V = λI ∈ Rd×d , b = 0 ∈ Rd×1
For time t = 1, 2, . . .
Obtain items xt,aa∈E ⊂ Rd×1
With high probability ∥∥∥θ − θ∥∥∥V≤ βt
thus with high probability
αt(a) ∈ θ>xt,a ± βt ‖xt,a‖V−1
Select the list At by UCBs of arms Ut(a) = θ>xt,a + βt ‖xt,a‖V−1
Receive feedback Ct ∈ 0, 1KCompute the stopping position Kt = mink : Ct(k) = 1 ∪ K andupdate
V ← V +
Kt∑k=1
xt,atkx>t,at
k, b ← b +
Kt∑k=1
xt,atkCt(k)
θ = V−1b
Shuai LI (CUHK) Learning to Rank 16 / 53
Contextual Combinatorial Cascading Bandits[LWZC,ICML’2016] – Algorithm
C 3-UCB AlgorithmInitialization: θ = 0 ∈ Rd×1,V = λI ∈ Rd×d , b = 0 ∈ Rd×1
For time t = 1, 2, . . .Obtain items xt,aa∈E ⊂ Rd×1
With high probability ∥∥∥θ − θ∥∥∥V≤ βt
thus with high probability
αt(a) ∈ θ>xt,a ± βt ‖xt,a‖V−1
Select the list At by UCBs of arms Ut(a) = θ>xt,a + βt ‖xt,a‖V−1
Receive feedback Ct ∈ 0, 1KCompute the stopping position Kt = mink : Ct(k) = 1 ∪ K andupdate
V ← V +
Kt∑k=1
xt,atkx>t,at
k, b ← b +
Kt∑k=1
xt,atkCt(k)
θ = V−1b
Shuai LI (CUHK) Learning to Rank 16 / 53
Contextual Combinatorial Cascading Bandits[LWZC,ICML’2016] – Algorithm
C 3-UCB AlgorithmInitialization: θ = 0 ∈ Rd×1,V = λI ∈ Rd×d , b = 0 ∈ Rd×1
For time t = 1, 2, . . .Obtain items xt,aa∈E ⊂ Rd×1
With high probability ∥∥∥θ − θ∥∥∥V≤ βt
thus with high probability
αt(a) ∈ θ>xt,a ± βt ‖xt,a‖V−1
Select the list At by UCBs of arms Ut(a) = θ>xt,a + βt ‖xt,a‖V−1
Receive feedback Ct ∈ 0, 1KCompute the stopping position Kt = mink : Ct(k) = 1 ∪ K andupdate
V ← V +
Kt∑k=1
xt,atkx>t,at
k, b ← b +
Kt∑k=1
xt,atkCt(k)
θ = V−1b
Shuai LI (CUHK) Learning to Rank 16 / 53
Contextual Combinatorial Cascading Bandits[LWZC,ICML’2016] – Algorithm
C 3-UCB AlgorithmInitialization: θ = 0 ∈ Rd×1,V = λI ∈ Rd×d , b = 0 ∈ Rd×1
For time t = 1, 2, . . .Obtain items xt,aa∈E ⊂ Rd×1
With high probability ∥∥∥θ − θ∥∥∥V≤ βt
thus with high probability
αt(a) ∈ θ>xt,a ± βt ‖xt,a‖V−1
Select the list At by UCBs of arms Ut(a) = θ>xt,a + βt ‖xt,a‖V−1
Receive feedback Ct ∈ 0, 1KCompute the stopping position Kt = mink : Ct(k) = 1 ∪ K andupdate
V ← V +
Kt∑k=1
xt,atkx>t,at
k, b ← b +
Kt∑k=1
xt,atkCt(k)
θ = V−1b
Shuai LI (CUHK) Learning to Rank 16 / 53
Contextual Combinatorial Cascading Bandits[LWZC,ICML’2016] – Algorithm
C 3-UCB AlgorithmInitialization: θ = 0 ∈ Rd×1,V = λI ∈ Rd×d , b = 0 ∈ Rd×1
For time t = 1, 2, . . .Obtain items xt,aa∈E ⊂ Rd×1
With high probability ∥∥∥θ − θ∥∥∥V≤ βt
thus with high probability
αt(a) ∈ θ>xt,a ± βt ‖xt,a‖V−1
Select the list At by UCBs of arms Ut(a) = θ>xt,a + βt ‖xt,a‖V−1
Receive feedback Ct ∈ 0, 1KCompute the stopping position Kt = mink : Ct(k) = 1 ∪ K andupdate
V ← V +
Kt∑k=1
xt,atkx>t,at
k, b ← b +
Kt∑k=1
xt,atkCt(k)
θ = V−1bShuai LI (CUHK) Learning to Rank 16 / 53
Contextual Combinatorial Cascading Bandits[LWZC,ICML’2016] – Results
We prove a regret bound
R(T ) = O
(d
p∗
√TK ln(T )
)
Experimental results —Ours —CombCascade
0 500 1000 1500 2000 2500 3000
Time t
0
50
100
150
Reg
ret
Synthetic Data
C3-UCB
CombCascade
0 500 1000 1500 2000
Time t
0
200
400
600
800
1000
1200
Reg
ret
Network 1221
C3-UCB
CombCascade
Shuai LI (CUHK) Learning to Rank 17 / 53
Contextual Combinatorial Cascading Bandits[LWZC,ICML’2016] – Results
We prove a regret bound
R(T ) = O
(d
p∗
√TK ln(T )
)
Experimental results —Ours —CombCascade
0 500 1000 1500 2000 2500 3000
Time t
0
50
100
150
Reg
ret
Synthetic Data
C3-UCB
CombCascade
0 500 1000 1500 2000
Time t
0
200
400
600
800
1000
1200
Reg
ret
Network 1221
C3-UCB
CombCascade
Shuai LI (CUHK) Learning to Rank 17 / 53
Summary on Bandits with Click Models
Context Click Model Regret
[KSWA, 2015] - CM O( L∆ log(T ))
[LWZC, ICML’2016] Linear CM O( dp∗
√TK log(T ))
Shuai LI (CUHK) Learning to Rank 18 / 53
Outline
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 19 / 53
Online Clustering of Contextual Cascading Bandits [LZ,AAAI’2018]
Find clustering over users as well as recommending
The attractiveness function is generalized linear (GL)
Improve the regret results
Experiments —Ours · · ·C3-UCB
0M 1M 2M 3M 4M 5MTime t
0K
10K
20K
30K
40K
Cum. Regret
CLUB-cascade
C3-UCB/CascadeLinUCB
0M 1M 2M 3M 4M 5MTime t
0K
10K
20K
30K
40K
50K
Cum. Regret
CLUB-cascade
C3-UCB/CascadeLinUCB
Context Click Model Regret
[KSWA, 2015] - CM O( L∆ log(T ))
[LWZC, ICML’2016] Linear CM O( dp∗
√TK log(T ))
[LZ, AAAI’2018] GL CM O(d√TK log(T ))
Shuai LI (CUHK) Learning to Rank 20 / 53
Outline
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 21 / 53
Improved Algorithm on Clustering Bandits [LCLL,IJCAI’2019]
Arbitrary frequency distribution over users (compared to uniformdistribution)
Prove a regret bound that is free of the minimal frequency over users
R(T ) = O
(d√mT ln(T ) +
(1
γ2p
+nuγ2λ3
x
)ln(T )
)(compared to R(T ) = O
(d√mT ln(T ) + 1
pminγ2λ3x
ln(T ))
)
where nu is number of users and m is number of clusters
Experiments —Ours —CLUB —LinUCB-One —LinUCB-Ind
0 200k 400k 600k 800k 1mTime t
0
20k
40k
60k
Regr
et
Synthetic
0 200k 400k 600k 800k 1mTime t
0
20k
40k
60k
Regr
et
MovieLens
0 200k 400k 600k 800k 1mTime t
0
20k
40k
60k
Regr
et
Yelp
Shuai LI (CUHK) Learning to Rank 22 / 53
Improved Algorithm on Clustering Bandits [LCLL,IJCAI’2019]
Arbitrary frequency distribution over users (compared to uniformdistribution)
Prove a regret bound that is free of the minimal frequency over users
R(T ) = O
(d√mT ln(T ) +
(1
γ2p
+nuγ2λ3
x
)ln(T )
)(compared to R(T ) = O
(d√mT ln(T ) + 1
pminγ2λ3x
ln(T ))
)
where nu is number of users and m is number of clusters
Experiments —Ours —CLUB —LinUCB-One —LinUCB-Ind
0 200k 400k 600k 800k 1mTime t
0
20k
40k
60k
Regr
et
Synthetic
0 200k 400k 600k 800k 1mTime t
0
20k
40k
60k
Regr
et
MovieLens
0 200k 400k 600k 800k 1mTime t
0
20k
40k
60k
Regr
et
Yelp
Shuai LI (CUHK) Learning to Rank 22 / 53
Contents
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 23 / 53
Dependent Click Model (DCM)
Allow multiple clicks
Assumes there is a probability ofsatisfaction after each click
r(A) = 1−∏K
k=1(1− α(ak)γk)
γk : satisfaction probability after clickon position k
The meaning of received feedback(0, 1, 0, 1, 0)
7no click
Xclick, not satisfied
7no click
Xclick, satisfied?
?
Context Click Model Regret
[KSWA, 2015] - CM O( L∆ log(T ))
[LWZC, ICML’2016] Linear CM O( dp∗
√TK log(T ))
[LZ, AAAI’2018] GL CM O(d√TK log(T ))
[KKSW, 2016] - DCM O( L∆ log(T ))
[LLZ, COCOON’2018] GL DCM O(dK√TK log(T ))
Shuai LI (CUHK) Learning to Rank 24 / 53
Dependent Click Model (DCM)
Allow multiple clicks
Assumes there is a probability ofsatisfaction after each click
r(A) = 1−∏K
k=1(1− α(ak)γk)
γk : satisfaction probability after clickon position k
The meaning of received feedback(0, 1, 0, 1, 0)
7no click
Xclick, not satisfied
7no click
Xclick, satisfied?
?
Context Click Model Regret
[KSWA, 2015] - CM O( L∆ log(T ))
[LWZC, ICML’2016] Linear CM O( dp∗
√TK log(T ))
[LZ, AAAI’2018] GL CM O(d√TK log(T ))
[KKSW, 2016] - DCM O( L∆ log(T ))
[LLZ, COCOON’2018] GL DCM O(dK√TK log(T ))
Shuai LI (CUHK) Learning to Rank 24 / 53
Dependent Click Model (DCM)
Allow multiple clicks
Assumes there is a probability ofsatisfaction after each click
r(A) = 1−∏K
k=1(1− α(ak)γk)
γk : satisfaction probability after clickon position k
The meaning of received feedback(0, 1, 0, 1, 0)
7no click
Xclick, not satisfied
7no click
Xclick, satisfied?
?
Context Click Model Regret
[KSWA, 2015] - CM O( L∆ log(T ))
[LWZC, ICML’2016] Linear CM O( dp∗
√TK log(T ))
[LZ, AAAI’2018] GL CM O(d√TK log(T ))
[KKSW, 2016] - DCM O( L∆ log(T ))
[LLZ, COCOON’2018] GL DCM O(dK√TK log(T ))
Shuai LI (CUHK) Learning to Rank 24 / 53
Dependent Click Model (DCM)
Allow multiple clicks
Assumes there is a probability ofsatisfaction after each click
r(A) = 1−∏K
k=1(1− α(ak)γk)
γk : satisfaction probability after clickon position k
The meaning of received feedback(0, 1, 0, 1, 0)
7no click
Xclick, not satisfied
7no click
Xclick, satisfied?
?
Context Click Model Regret
[KSWA, 2015] - CM O( L∆ log(T ))
[LWZC, ICML’2016] Linear CM O( dp∗
√TK log(T ))
[LZ, AAAI’2018] GL CM O(d√TK log(T ))
[KKSW, 2016] - DCM O( L∆ log(T ))
[LLZ, COCOON’2018] GL DCM O(dK√TK log(T ))
Shuai LI (CUHK) Learning to Rank 24 / 53
Contents
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 25 / 53
Position-Based Model (PBM)
Most popular model in industry
Assumes the user click probability on an item a of position k can befactored into βk · α(a)
βk is position bias. Usually β1 ≥ β2 ≥ · · · ≥ βK
r(A) =∑K
k=1 βkα(ak)
The meaning of received feedback (0, 1, 0, 1, 0)
Shuai LI (CUHK) Learning to Rank 26 / 53
Position-Based Model (PBM)
Most popular model in industry
Assumes the user click probability on an item a of position k can befactored into βk · α(a)
βk is position bias. Usually β1 ≥ β2 ≥ · · · ≥ βK
r(A) =∑K
k=1 βkα(ak)
The meaning of received feedback (0, 1, 0, 1, 0)
Shuai LI (CUHK) Learning to Rank 26 / 53
Position-Based Model (PBM)
Most popular model in industry
Assumes the user click probability on an item a of position k can befactored into βk · α(a)
βk is position bias. Usually β1 ≥ β2 ≥ · · · ≥ βK
r(A) =∑K
k=1 βkα(ak)
The meaning of received feedback (0, 1, 0, 1, 0)
Shuai LI (CUHK) Learning to Rank 26 / 53
Position-Based Model (PBM)
Most popular model in industry
Assumes the user click probability on an item a of position k can befactored into βk · α(a)
βk is position bias. Usually β1 ≥ β2 ≥ · · · ≥ βK
r(A) =∑K
k=1 βkα(ak)
The meaning of received feedback (0, 1, 0, 1, 0)
Shuai LI (CUHK) Learning to Rank 26 / 53
Position-Based Model (PBM)
Most popular model in industry
Assumes the user click probability on an item a of position k can befactored into βk · α(a)
βk is position bias. Usually β1 ≥ β2 ≥ · · · ≥ βK
r(A) =∑K
k=1 βkα(ak)
The meaning of received feedback (0, 1, 0, 1, 0)
Shuai LI (CUHK) Learning to Rank 26 / 53
Summary on Bandits with Click Models
Context Click Model Regret
[KSWA, 2015] - CM O( L∆ log(T ))
[LWZC, ICML’2016] Linear CM O( dp∗
√TK log(T ))
[LZ, AAAI’2018] GL CM O(d√TK log(T ))
[KKSW, 2016] - DCM O( L∆ log(T ))
[LLZ, COCOON’2018] GL DCM O(dK√TK log(T ))
[LVC, 2016] - PBM with β O( L∆ log(T ))
Shuai LI (CUHK) Learning to Rank 27 / 53
Contents
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 28 / 53
General Click Models
Common observations for click models
The click-through-rate (CTR) of list A on position k can be factoredinto
CTR(A, k) = χ(A, k) α(ak)
χ(A, k) is the examination probability of list A on position k
E.g. χ(A, k) =∏k−1
i=1 (1− α(ai )) in Cascade Model and χ(A, k) = βkin Position Based Model
Difficulties on General Click Models
χ depends on both click models and lists
Shuai LI (CUHK) Learning to Rank 29 / 53
General Click Models
Common observations for click models
The click-through-rate (CTR) of list A on position k can be factoredinto
CTR(A, k) = χ(A, k) α(ak)
χ(A, k) is the examination probability of list A on position k
E.g. χ(A, k) =∏k−1
i=1 (1− α(ai )) in Cascade Model and χ(A, k) = βkin Position Based Model
Difficulties on General Click Models
χ depends on both click models and lists
Shuai LI (CUHK) Learning to Rank 29 / 53
General Click Models
Common observations for click models
The click-through-rate (CTR) of list A on position k can be factoredinto
CTR(A, k) = χ(A, k) α(ak)
χ(A, k) is the examination probability of list A on position k
E.g. χ(A, k) =∏k−1
i=1 (1− α(ai )) in Cascade Model and χ(A, k) = βkin Position Based Model
Difficulties on General Click Models
χ depends on both click models and lists
Shuai LI (CUHK) Learning to Rank 29 / 53
Summary on Bandits with Click Models
Context Click Model Regret
[KSWA, 2015] - CM O( L∆ log(T ))
[LWZC, ICML’2016] Linear CM O( dp∗
√TK log(T ))
[LZ, AAAI’2018] GL CM O(d√TK log(T ))
[KKSW, 2016] - DCM O( L∆ log(T ))
[LLZ, COCOON’2018] GL DCM O(dK√TK log(T ))
[LVC, 2016] - PBM with β O( L∆ log(T ))
[ZTGKSW, 2017] - General O(K3L
∆ log(T ))
[LKLS, NIPS’2018] - General O(KL∆ log(T )
)O(√
K 3LT log(T ))
Ω(√
KLT)
Shuai LI (CUHK) Learning to Rank 30 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Preparation
Recall
Each item a is represented by a feature vector xa ∈ Rd
The attractiveness of item a is α(a) = θ>xa
We bring up an algorithm called RecurRank (Recursive Ranking)
G-optimal design
Minimize the covariance of the least-squares estimatorX = x1, . . . , xn ⊂ Rd
For any distribution π : X → [0, 1], let Q(π) =∑
x∈X π(x)xx>
By the Kiefer–Wolfowitz theorem there exists a π called the G -optimaldesign such that
max det(Q(π)) or equivalently maxx∈X‖x‖2
Q(π)† ≤ d
John’s theorem implies that π may be chosen so that|x : π(x) > 0| ≤ d(d + 3)/2
Shuai LI (CUHK) Learning to Rank 31 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Preparation
Recall
Each item a is represented by a feature vector xa ∈ Rd
The attractiveness of item a is α(a) = θ>xa
We bring up an algorithm called RecurRank (Recursive Ranking)
G-optimal design
Minimize the covariance of the least-squares estimatorX = x1, . . . , xn ⊂ Rd
For any distribution π : X → [0, 1], let Q(π) =∑
x∈X π(x)xx>
By the Kiefer–Wolfowitz theorem there exists a π called the G -optimaldesign such that
max det(Q(π)) or equivalently maxx∈X‖x‖2
Q(π)† ≤ d
John’s theorem implies that π may be chosen so that|x : π(x) > 0| ≤ d(d + 3)/2
Shuai LI (CUHK) Learning to Rank 31 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Preparation
Recall
Each item a is represented by a feature vector xa ∈ Rd
The attractiveness of item a is α(a) = θ>xa
We bring up an algorithm called RecurRank (Recursive Ranking)
G-optimal design
Minimize the covariance of the least-squares estimator
X = x1, . . . , xn ⊂ Rd
For any distribution π : X → [0, 1], let Q(π) =∑
x∈X π(x)xx>
By the Kiefer–Wolfowitz theorem there exists a π called the G -optimaldesign such that
max det(Q(π)) or equivalently maxx∈X‖x‖2
Q(π)† ≤ d
John’s theorem implies that π may be chosen so that|x : π(x) > 0| ≤ d(d + 3)/2
Shuai LI (CUHK) Learning to Rank 31 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Preparation
Recall
Each item a is represented by a feature vector xa ∈ Rd
The attractiveness of item a is α(a) = θ>xa
We bring up an algorithm called RecurRank (Recursive Ranking)
G-optimal design
Minimize the covariance of the least-squares estimatorX = x1, . . . , xn ⊂ Rd
For any distribution π : X → [0, 1], let Q(π) =∑
x∈X π(x)xx>
By the Kiefer–Wolfowitz theorem there exists a π called the G -optimaldesign such that
max det(Q(π)) or equivalently maxx∈X‖x‖2
Q(π)† ≤ d
John’s theorem implies that π may be chosen so that|x : π(x) > 0| ≤ d(d + 3)/2
Shuai LI (CUHK) Learning to Rank 31 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Preparation
Recall
Each item a is represented by a feature vector xa ∈ Rd
The attractiveness of item a is α(a) = θ>xa
We bring up an algorithm called RecurRank (Recursive Ranking)
G-optimal design
Minimize the covariance of the least-squares estimatorX = x1, . . . , xn ⊂ Rd
For any distribution π : X → [0, 1], let Q(π) =∑
x∈X π(x)xx>
By the Kiefer–Wolfowitz theorem there exists a π called the G -optimaldesign such that
max det(Q(π)) or equivalently maxx∈X‖x‖2
Q(π)† ≤ d
John’s theorem implies that π may be chosen so that|x : π(x) > 0| ≤ d(d + 3)/2
Shuai LI (CUHK) Learning to Rank 31 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Preparation
Recall
Each item a is represented by a feature vector xa ∈ Rd
The attractiveness of item a is α(a) = θ>xa
We bring up an algorithm called RecurRank (Recursive Ranking)
G-optimal design
Minimize the covariance of the least-squares estimatorX = x1, . . . , xn ⊂ Rd
For any distribution π : X → [0, 1], let Q(π) =∑
x∈X π(x)xx>
By the Kiefer–Wolfowitz theorem there exists a π called the G -optimaldesign such that
max det(Q(π)) or equivalently maxx∈X‖x‖2
Q(π)† ≤ d
John’s theorem implies that π may be chosen so that|x : π(x) > 0| ≤ d(d + 3)/2
Shuai LI (CUHK) Learning to Rank 31 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Preparation
Recall
Each item a is represented by a feature vector xa ∈ Rd
The attractiveness of item a is α(a) = θ>xa
We bring up an algorithm called RecurRank (Recursive Ranking)
G-optimal design
Minimize the covariance of the least-squares estimatorX = x1, . . . , xn ⊂ Rd
For any distribution π : X → [0, 1], let Q(π) =∑
x∈X π(x)xx>
By the Kiefer–Wolfowitz theorem there exists a π called the G -optimaldesign such that
max det(Q(π)) or equivalently maxx∈X‖x‖2
Q(π)† ≤ d
John’s theorem implies that π may be chosen so that|x : π(x) > 0| ≤ d(d + 3)/2
Shuai LI (CUHK) Learning to Rank 31 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm
RecurRank Algorithm
Each instantiation is called with three arguments:1 A phase number ` ∈ 1, 2, . . .;2 An ordered tuple of items A = (a1, a2, . . . , an);3 A tuple of positions K = (k, . . . , k + m − 1) and m ≤ n.
The algorithm is first called with ` = 1, a random order over all items1, . . . , L, and K = (1, . . . ,K )
Find a G -optimal design π = Gopt(A). Then compute
T (a) =
⌈d π(a)
2∆2`
log
(|A|δ`
)⌉, ∆` = 2−`
Hope to satisfy |α(a)− α(a)| ≤ ∆` for any a ∈ A by the end of thisinstantiationThis instantiation runs for
∑a∈A T (a) times
Shuai LI (CUHK) Learning to Rank 32 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm
RecurRank AlgorithmEach instantiation is called with three arguments:
1 A phase number ` ∈ 1, 2, . . .;2 An ordered tuple of items A = (a1, a2, . . . , an);3 A tuple of positions K = (k, . . . , k + m − 1) and m ≤ n.
The algorithm is first called with ` = 1, a random order over all items1, . . . , L, and K = (1, . . . ,K )
Find a G -optimal design π = Gopt(A). Then compute
T (a) =
⌈d π(a)
2∆2`
log
(|A|δ`
)⌉, ∆` = 2−`
Hope to satisfy |α(a)− α(a)| ≤ ∆` for any a ∈ A by the end of thisinstantiationThis instantiation runs for
∑a∈A T (a) times
Shuai LI (CUHK) Learning to Rank 32 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm
RecurRank AlgorithmEach instantiation is called with three arguments:
1 A phase number ` ∈ 1, 2, . . .;2 An ordered tuple of items A = (a1, a2, . . . , an);3 A tuple of positions K = (k, . . . , k + m − 1) and m ≤ n.
The algorithm is first called with ` = 1, a random order over all items1, . . . , L, and K = (1, . . . ,K )
Find a G -optimal design π = Gopt(A). Then compute
T (a) =
⌈d π(a)
2∆2`
log
(|A|δ`
)⌉, ∆` = 2−`
Hope to satisfy |α(a)− α(a)| ≤ ∆` for any a ∈ A by the end of thisinstantiationThis instantiation runs for
∑a∈A T (a) times
Shuai LI (CUHK) Learning to Rank 32 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm
RecurRank AlgorithmEach instantiation is called with three arguments:
1 A phase number ` ∈ 1, 2, . . .;2 An ordered tuple of items A = (a1, a2, . . . , an);3 A tuple of positions K = (k, . . . , k + m − 1) and m ≤ n.
The algorithm is first called with ` = 1, a random order over all items1, . . . , L, and K = (1, . . . ,K )
Find a G -optimal design π = Gopt(A). Then compute
T (a) =
⌈d π(a)
2∆2`
log
(|A|δ`
)⌉, ∆` = 2−`
Hope to satisfy |α(a)− α(a)| ≤ ∆` for any a ∈ A by the end of thisinstantiationThis instantiation runs for
∑a∈A T (a) times
Shuai LI (CUHK) Learning to Rank 32 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm
RecurRank AlgorithmEach instantiation is called with three arguments:
1 A phase number ` ∈ 1, 2, . . .;2 An ordered tuple of items A = (a1, a2, . . . , an);3 A tuple of positions K = (k, . . . , k + m − 1) and m ≤ n.
The algorithm is first called with ` = 1, a random order over all items1, . . . , L, and K = (1, . . . ,K )
Find a G -optimal design π = Gopt(A). Then compute
T (a) =
⌈d π(a)
2∆2`
log
(|A|δ`
)⌉, ∆` = 2−`
Hope to satisfy |α(a)− α(a)| ≤ ∆` for any a ∈ A by the end of thisinstantiation
This instantiation runs for∑
a∈A T (a) times
Shuai LI (CUHK) Learning to Rank 32 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm
RecurRank AlgorithmEach instantiation is called with three arguments:
1 A phase number ` ∈ 1, 2, . . .;2 An ordered tuple of items A = (a1, a2, . . . , an);3 A tuple of positions K = (k, . . . , k + m − 1) and m ≤ n.
The algorithm is first called with ` = 1, a random order over all items1, . . . , L, and K = (1, . . . ,K )
Find a G -optimal design π = Gopt(A). Then compute
T (a) =
⌈d π(a)
2∆2`
log
(|A|δ`
)⌉, ∆` = 2−`
Hope to satisfy |α(a)− α(a)| ≤ ∆` for any a ∈ A by the end of thisinstantiationThis instantiation runs for
∑a∈A T (a) times
Shuai LI (CUHK) Learning to Rank 32 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm (Continued)
RecurRank Algorithm (Continued)
Select each item a ∈ A exactly T (a) times at position k and put thefirst m − 1 items in A \ a at remaining positionsk + 1, . . . , k + m − 1first position — explorationremaining positions — exploitationonly first position has the same examination probability χ for all lists
E.g. Suppose we have computed T (a3) = 100, then it puts(a3, a1, a2, a4, . . . , am) on positions (k, . . . , k + m − 1) for 100 roundsCompute θ only using the feedbacks from first position k and rankitems in decreasing order of the estimated attractiveness
α(a1) ≥ α(a2) ≥ α(a3) ≥ · · · ≥ α(an)
Shuai LI (CUHK) Learning to Rank 33 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm (Continued)
RecurRank Algorithm (Continued)
Select each item a ∈ A exactly T (a) times at position k and put thefirst m − 1 items in A \ a at remaining positionsk + 1, . . . , k + m − 1first position — explorationremaining positions — exploitationonly first position has the same examination probability χ for all listsE.g. Suppose we have computed T (a3) = 100, then it puts(a3, a1, a2, a4, . . . , am) on positions (k , . . . , k + m − 1) for 100 rounds
Compute θ only using the feedbacks from first position k and rankitems in decreasing order of the estimated attractiveness
α(a1) ≥ α(a2) ≥ α(a3) ≥ · · · ≥ α(an)
Shuai LI (CUHK) Learning to Rank 33 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm (Continued)
RecurRank Algorithm (Continued)
Select each item a ∈ A exactly T (a) times at position k and put thefirst m − 1 items in A \ a at remaining positionsk + 1, . . . , k + m − 1first position — explorationremaining positions — exploitationonly first position has the same examination probability χ for all listsE.g. Suppose we have computed T (a3) = 100, then it puts(a3, a1, a2, a4, . . . , am) on positions (k , . . . , k + m − 1) for 100 roundsCompute θ only using the feedbacks from first position k and rankitems in decreasing order of the estimated attractiveness
α(a1) ≥ α(a2) ≥ α(a3) ≥ · · · ≥ α(an)
Shuai LI (CUHK) Learning to Rank 33 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm (Continued)
RecurRank Algorithm (Continued)
Eliminate bad arms an′+1, . . . , an if
α(a1) ≥ · · · ≥ α(am) ≥ · · · ≥ α(an′) ≥ α(an′+1)︸ ︷︷ ︸gap ≥2∆`
≥ · · · ≥ α(an)
Split the partition for each consecutive gap larger than 2∆`
α(a1) ≥ · · · ≥ α(ak1 )
∣∣∣∣∣ α(ak1+1)︸ ︷︷ ︸gap ≥2∆`
≥ · · · ≥ α(ak2 )
∣∣∣∣∣ α(ak2+1)︸ ︷︷ ︸gap ≥2∆`
≥ · · · ≥ α(an′)
k, · · · , k + k1 − 1
∣∣∣∣∣ k + k1, · · · , k + k2 − 1
∣∣∣∣∣ k + k2, · · · , k + m − 1
Call the refined partitions with phase `+ 1
Shuai LI (CUHK) Learning to Rank 34 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm (Continued)
RecurRank Algorithm (Continued)
Eliminate bad arms an′+1, . . . , an if
α(a1) ≥ · · · ≥ α(am) ≥ · · · ≥ α(an′) ≥ α(an′+1)︸ ︷︷ ︸gap ≥2∆`
≥ · · · ≥ α(an)
Split the partition for each consecutive gap larger than 2∆`
α(a1) ≥ · · · ≥ α(ak1 )
∣∣∣∣∣ α(ak1+1)︸ ︷︷ ︸gap ≥2∆`
≥ · · · ≥ α(ak2 )
∣∣∣∣∣ α(ak2+1)︸ ︷︷ ︸gap ≥2∆`
≥ · · · ≥ α(an′)
k , · · · , k + k1 − 1
∣∣∣∣∣ k + k1, · · · , k + k2 − 1
∣∣∣∣∣ k + k2, · · · , k + m − 1
Call the refined partitions with phase `+ 1
Shuai LI (CUHK) Learning to Rank 34 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Algorithm (Continued)
RecurRank Algorithm (Continued)
Eliminate bad arms an′+1, . . . , an if
α(a1) ≥ · · · ≥ α(am) ≥ · · · ≥ α(an′) ≥ α(an′+1)︸ ︷︷ ︸gap ≥2∆`
≥ · · · ≥ α(an)
Split the partition for each consecutive gap larger than 2∆`
α(a1) ≥ · · · ≥ α(ak1 )
∣∣∣∣∣ α(ak1+1)︸ ︷︷ ︸gap ≥2∆`
≥ · · · ≥ α(ak2 )
∣∣∣∣∣ α(ak2+1)︸ ︷︷ ︸gap ≥2∆`
≥ · · · ≥ α(an′)
k , · · · , k + k1 − 1
∣∣∣∣∣ k + k1, · · · , k + k2 − 1
∣∣∣∣∣ k + k2, · · · , k + m − 1
Call the refined partitions with phase `+ 1
Shuai LI (CUHK) Learning to Rank 34 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Results
Regret bound
R(T ) = O(K√
dT log(LT ))
Experiments —RecurRank(Ours) —C3-UCB —TopRank
0k 50k 100k 150k 200kTime t
10 2
10 1
100
101
102
103
Regr
et
(a) CM
0k 50k 100k 150k 200kTime t
0k
50k
100k
150k
200k
250k
300k
Regr
et
(b) PBM
Shuai LI (CUHK) Learning to Rank 35 / 53
Online Learning to Rank with Features [LLS, ICML’2019] –Results
Regret bound
R(T ) = O(K√
dT log(LT ))
Experiments —RecurRank(Ours) —C3-UCB —TopRank
0k 50k 100k 150k 200kTime t
10 2
10 1
100
101
102
103
Regr
et
(a) CM
0k 50k 100k 150k 200kTime t
0k
50k
100k
150k
200k
250k
300k
Regr
et
(b) PBM
Shuai LI (CUHK) Learning to Rank 35 / 53
Summary on Bandits with Click Models
Context Click Model Regret
[KSWA, 2015] - CM O( L∆ log(T ))
[LWZC, ICML’2016] Linear CM O( dp∗
√TK log(T ))
[LZ, AAAI’2018] GL CM O(d√TK log(T ))
[KKSW, 2016] - DCM O( L∆ log(T ))
[LLZ, COCOON’2018] GL DCM O(dK√TK log(T ))
[LVC, 2016] - PBM with β O( L∆ log(T ))
[ZTGKSW, 2017] - General O(K3L
∆ log(T ))
[LKLS, NIPS’2018] - General O(KL∆ log(T )
)O(√
K 3LT log(T ))
Ω(√
KLT)
[LLS, ICML’2019] Linear General O(K√dT log(LT ))
Shuai LI (CUHK) Learning to Rank 36 / 53
Outline
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 37 / 53
Offline Evaluations
Motivation
Can we estimate the expected number of clicks of new policies withoutdirectly employing it?
Offline Evaluation!
Objective:
To design statistically efficient estimators based on logged dataset forany ranking policy
Challenge:
The number of different lists is exponential in K
Shuai LI (CUHK) Learning to Rank 38 / 53
Offline Evaluations
Motivation
Can we estimate the expected number of clicks of new policies withoutdirectly employing it?
Offline Evaluation!
Objective:
To design statistically efficient estimators based on logged dataset forany ranking policy
Challenge:
The number of different lists is exponential in K
Shuai LI (CUHK) Learning to Rank 38 / 53
Offline Evaluations
Motivation
Can we estimate the expected number of clicks of new policies withoutdirectly employing it?
Offline Evaluation!
Objective:
To design statistically efficient estimators based on logged dataset forany ranking policy
Challenge:
The number of different lists is exponential in K
Shuai LI (CUHK) Learning to Rank 38 / 53
Offline Evaluations
Motivation
Can we estimate the expected number of clicks of new policies withoutdirectly employing it?
Offline Evaluation!
Objective:
To design statistically efficient estimators based on logged dataset forany ranking policy
Challenge:
The number of different lists is exponential in K
Shuai LI (CUHK) Learning to Rank 38 / 53
Offline Evaluation of Ranking Policies with Click Models[LAKMVW, KDD’2018]– Results
We design estimators for different click models
Item-Position, Random, Rank-Based, Position-Based, Document-Based
We prove that our estimators
are unbiased in a larger class of policieshave lower biasthe best policy have better theoretical guarantees
than the existing unstructured estimators under the correspondingclick model assumptions
Shuai LI (CUHK) Learning to Rank 39 / 53
Offline Evaluation of Ranking Policies with Click Models[LAKMVW, KDD’2018]– Results
We design estimators for different click models
Item-Position, Random, Rank-Based, Position-Based, Document-Based
We prove that our estimators
are unbiased in a larger class of policieshave lower biasthe best policy have better theoretical guarantees
than the existing unstructured estimators under the correspondingclick model assumptions
Shuai LI (CUHK) Learning to Rank 39 / 53
Offline Evaluation of Ranking Policies with Click Models[LAKMVW, KDD’2018] – Experiments
Experiments – 100 most frequent queries in Yandex dataset
100 101 102 103 104 105
M
0.04
0.06
0.08
0.10
0.12
RM
SE
(a) 100 Queries: K = 2
RCTR
Item
IP
PBM
List
100 101 102 103 104 105
M
0.04
0.06
0.08
0.10
0.12
RM
SE
(b) 100 Queries: K = 3
RCTR
Item
IP
PBM
List
100 101 102 103 104 105
M
0.04
0.06
0.08
0.10
0.12
0.14
0.16
RM
SE
100 Queries: K = 10
RCTR
Item
IP
PBM
List
Shuai LI (CUHK) Learning to Rank 40 / 53
Outline
1 Motivation
2 Background
3 Problem Definition – Online
4 Click ModelsCascade Model (CM)
ICML’2016AAAI’2018IJCAI’2019
Dependent Click Model – A co-authored workPosition-Based ModelGeneral Click Models – A co-authored work, ICML’2019
5 Offline Evaluations – KDD’2018
6 Conclusions
Shuai LI (CUHK) Learning to Rank 41 / 53
Conclusions
Context + Cascade model (CM) / Dependent click model (DCM)
Online clustering of bandits + Cascade model (CM)
Improved algorithm on clustering of bandits
Context + General click model
Offline evaluation of ranking policies with click models
Shuai LI (CUHK) Learning to Rank 42 / 53
Publications
First-author papers in thesis – in the order of thesis
1 Shuai Li, Baoxiang Wang, Shengyu Zhang, Wei Chen, ContextualCombinatorial Cascading Bandits, ICML, 2016
2 Shuai Li, Shengyu Zhang, Online Clustering of Contextual CascadingBandits, AAAI, 2018
3 Shuai Li, Wei Chen, S Li, Kwong-Sak Leung, Improved Algorithm onClustering of Bandits, IJCAI 2019
4 Shuai Li, Tor Lattimore, Csaba Szepesvari, Online Learning to Rankwith Features, ICML, 2019
5 Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan,Vishwa Vinay and Zheng Wen, Offline Evaluation of Ranking Policieswith Click Models, KDD, 2018
Shuai LI (CUHK) Learning to Rank 43 / 53
Publications
Mentioned co-authored papers
6 Weiwen Liu, Shuai Li, Shengyu Zhang, Contextual Dependent ClickBandit Algorithm for Web Recommendation, COCOON, 2018
7 Tor Lattimore, Branislav Kveton, Shuai Li, Csaba Szepesvari,TopRank: A Practical Algorithm for Online Stochastic Ranking,NeurIPS, 2018
Other co-authored papers
8 Pengfei Liu, Hongjian Li, Shuai Li, Kwong-Sak Leung, ImprovingPrediction of Phenotypic Drug Response on Cancer Cell Lines UsingDeep Convolutional Network, BMC Bioinformatics, 2019
9 Ran Wang, Shuai Li, Man-Hon Wong, and Kwong-Sak Leung,Drug-Protein-Disease Association Prediction and Drug RepositioningBased on Tensor Decomposition, BIBM, 2018
10 Pengfei Liu, Shuai Li, Weiying Yi, Kwong-Sak Leung, A HybridDistributed Framework for SNP Selections, PDPTA, 2016
Shuai LI (CUHK) Learning to Rank 44 / 53
Publications
In submission
11 Shuai Li, Wei Chen, Zheng Wen, Kwong-Sak Leung, StochasticOnline Learning with Probabilistic Feedback Graph
12 Shuai Li, Kwong-Sak Leung, Generalized Clustering Bandits
13 Shuai Li, Tong Yu, Ole Mengshoel, Kwong-Sak Leung, OnlineSemi-Supervised Learning with Large Margin Separation
14 Xiaojin Zhang, Shuai Li, Shengyu Zhang, Contextual CombinatorialConservative Bandits
15 Pengfei Liu, Shuai Li, Kwong-Sak Leung, The Recovery of StochasticDifferential Equations with Genetic Programming andKullback-Leibler Divergence
Shuai LI (CUHK) Learning to Rank 45 / 53
Thank you!
&
Questions?
Shuai LI (CUHK) Learning to Rank 46 / 53
References I
P. Auer, N. Cesa-Bianchi, and P. Fischer.Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2-3):235–256, 2002.
S. Katariya, B. Kveton, C. Szepesvari, and Z. Wen.Dcm bandits: Learning to rank with multiple clicks.In International Conference on Machine Learning, pages 1215–1224,2016.
B. Kveton, C. Szepesvari, Z. Wen, and A. Ashkan.Cascading bandits: Learning to rank in the cascade model.In International Conference on Machine Learning, pages 767–776,2015.
Shuai LI (CUHK) Learning to Rank 47 / 53
References II
P. Lagree, C. Vernade, and O. Cappe.Multiple-play bandits in the position-based model.In Advances in Neural Information Processing Systems, pages1597–1605, 2016.
T. Lattimore, B. Kveton, Li, Shuai, and C. Szepesvari.Toprank: A practical algorithm for online stochastic ranking.In The Conference on Neural Information Processing Systems, 2018.
W. Liu, Li, Shuai, and S. Zhang.Contextual dependent click bandit algorithm for web recommendation.
In International Computing and Combinatorics Conference, pages39–50. Springer, 2018.
Shuai LI (CUHK) Learning to Rank 48 / 53
References III
Li, Shuai, Y. Abbasi-Yadkori, B. Kveton, S. Muthukrishnan, V. Vinay,and Z. Wen.Offline evaluation of ranking policies with click models.In ACM SIGKDD Conference on Knowledge Discovery and DataMining, 2018.
Li, Shuai, W. Chen, S. Li, and K.-S. Leung.Improved algorithm on online clustering of bandits.In International Joint Conference on Artificial Intelligence (IJCAI),2019.
Li, Shuai, T. Lattimore, and C. Szepesvari.Online learning to rank with features.In International Conference on Machine Learning (ICML), 2019.
Shuai LI (CUHK) Learning to Rank 49 / 53
References IV
Li, Shuai, B. Wang, S. Zhang, and W. Chen.Contextual combinatorial cascading bandits.In International Conference on Machine Learning, pages 1245–1253,2016.
Li, Shuai and S. Zhang.Online clustering of contextual cascading bandits.In The AAAI Conference on Artificial Intelligence, 2018.
M. Zoghi, T. Tunys, M. Ghavamzadeh, B. Kveton, C. Szepesvari, andZ. Wen.Online learning to rank in stochastic click models.In International Conference on Machine Learning, pages 4199–4208,2017.
Shuai LI (CUHK) Learning to Rank 50 / 53
References V
S. Zong, H. Ni, K. Sung, N. R. Ke, Z. Wen, and B. Kveton.Cascading bandits for large-scale recommendation problems.In Proceedings of the Thirty-Second Conference on Uncertainty inArtificial Intelligence, pages 835–844. AUAI Press, 2016.
Shuai LI (CUHK) Learning to Rank 51 / 53
A Key Part Proof for CLUB-cascade (Improving C3-UCB)
Et [R(At , y t)]
=Et
[(1−
K∏k=1
(1− y t(x∗t,k))
)−
(1−
K∏k=1
(1− y t(x t,k))
)]
=Et
[K∏
k=1
(1− y t(x t,k))−K∏
k=1
(1− y t(x∗t,k))
]
=Et
[K∑
k=1
(k−1∏`=1
(1− y t(x t,`))
)[(1− y t(x t,k))− (1− y t(x
∗t,k))
]( K∏`=k+1
(1− y t(x∗t,`))
)]
≤Et
[K∑
k=1
(k−1∏`=1
(1− y t(x t,`))
)[y t(x
∗t,k)− y t(x t,k)]
]
=Et
[K t∑k=1
[y t(x∗t,k)− y t(x t,k)]
]
Shuai LI (CUHK) Learning to Rank 52 / 53
Proof Sketch for RecurRank
Use (`, i) to represent the i-th call of RecurRank with `,A`i ,K`i
Prove with high probability for any (`, i)
a∗k ∈ A`i if k ∈ K`i|θ>`i xa − χ`iθ>∗ xa| ≤ ∆`, where χ`i is the examination probability of theoptimal list on the first position in K`i
In (`, i)th call, item a is put at position k, then
χ`i (α(a∗k)− α(a)) ≤ 8|K`i |∆` if k is the first position in K`iχ`i (α(a∗k)− α(a)) ≤ 4∆` if k is the remaining positionthus O(|K`i |∆`) regret for this part
Shuai LI (CUHK) Learning to Rank 53 / 53
Proof Sketch for RecurRank
Use (`, i) to represent the i-th call of RecurRank with `,A`i ,K`iProve with high probability for any (`, i)
a∗k ∈ A`i if k ∈ K`i|θ>`i xa − χ`iθ>∗ xa| ≤ ∆`, where χ`i is the examination probability of theoptimal list on the first position in K`i
In (`, i)th call, item a is put at position k, then
χ`i (α(a∗k)− α(a)) ≤ 8|K`i |∆` if k is the first position in K`iχ`i (α(a∗k)− α(a)) ≤ 4∆` if k is the remaining positionthus O(|K`i |∆`) regret for this part
Shuai LI (CUHK) Learning to Rank 53 / 53
Proof Sketch for RecurRank
Use (`, i) to represent the i-th call of RecurRank with `,A`i ,K`iProve with high probability for any (`, i)
a∗k ∈ A`i if k ∈ K`i|θ>`i xa − χ`iθ>∗ xa| ≤ ∆`, where χ`i is the examination probability of theoptimal list on the first position in K`i
In (`, i)th call, item a is put at position k, then
χ`i (α(a∗k)− α(a)) ≤ 8|K`i |∆` if k is the first position in K`iχ`i (α(a∗k)− α(a)) ≤ 4∆` if k is the remaining positionthus O(|K`i |∆`) regret for this part
Shuai LI (CUHK) Learning to Rank 53 / 53