Date post: | 20-Jan-2016 |
Category: |
Documents |
Upload: | bertina-mccarthy |
View: | 222 times |
Download: | 0 times |
Machine Learning in the Bandit Setting
Algorithms, Evaluation, and Case Studies
Lihong Li
Machine LearningYahoo! Research
SEWM2012-05-25
2012-05-25SEWM 2
ACTION
Statistics, ML, DM, …
DATA
€
E = MC2
KNOWLEDGE
UTILITY
MOREDATA
ReinforcementLearning
Outline
Introduction
Basic Solutions
Advanced algorithms
Advanced Offline Evaluation
Conclusions
2012-05-25SEWM 3
Yahoo-User Interaction
2012-05-25SEWM 4
ads, news, ranking, …
click, conversion, revenue, …
gender,age, …
ACTION
REWARD
CONTEXT servingstrategyPOLICY
GoalMaximize total REWARD
by optimizing POLICY
Today Module @ Yahoo! Front Page
A small pool of articles chosen
by editors
“Featured Article”
2012-05-255SEWM
Objectives and Challenges
• Objectives• (informally) choose most interesting articles for
individual users• (formally) maximize click-through rate (CTR)
• Challenges• Dynamic content pool fast learning• Sparse user visits transfer interests among users• Partial user feedback efficient explore/exploit
2012-05-256SEWM
Challenge: Explore/Exploit
• Observation: only displayed articles get user click feedback
EXPLOIT(choose good articles)
Article CTR estimates
EXPLORE(choose novel articles)
How to trade off?
… with dynamic article pools… while considering user interests
2012-05-257SEWM
Insufficient Exploration Example
always pays $5/round
pays $100 a quarter of the time(so $25/round on average)
1
2
3
4
5
6
7
8
$5
$5
$0
$0
$0
$5
$5
$5
2012-05-258SEWM
It turns out…
$100
$100
$100
$0
$0
$5
$5
$5
Contextual Bandit Formulation
Multi-armed contextual bandit [LZ’08]
€
At : available articles at time t
x t : user features (age, gender, interests, ...)
at : the displayed article at time t
rt,a t: 1 for click, 0 for no - click
€
Formally, we want to maximize rt ,a t
t=1
T
∑
In Today Module:
2012-05-259SEWM
Select
€
at ∈ At
Observe K arms At and “context”
€
x t ∈Rd
Receive reward
€
rt,a t∈ [0,1]
€
t ← t +1
2012-05-25SEWM 10
Another Example – Display Ads
€
At : eligible ads in current page view
x t : page/user features
at : the displayed ad(s)
rt,a t: $ if clicked/converted, 0 otherwise
2012-05-25SEWM 11
€
At : possible document rankings for query qt
x t : query/document features
at : the displayed ranking for query qt
rt,a t: 1 if session succeeds, 0 otherwise
Yet Another Example - Ranking
Related Work• Standard information retrieval and collaborative filtering
• Also concerns with (personalized) recommendation• But with (almost) static users/items
training often done in batch/offline mode
no need for online exploration
• Full reinforcement learning• General: including bandit problems as special cases• Need to tackle “temporal credit assignment”
2012-05-2512SEWM
Outline
Introduction
Basic Solutions› Algorithms
› Evaluation
› Experiments
Advanced algorithms
Advanced Offline Evaluation
Conclusions
2012-05-25SEWM 13
2012-05-25SEWM 14
Prior Bandit Algorithms
Herbert Robbins Tze Leung Lai
Regret minimization(focus of this talk)
Bayesian optimal solution
John Gittins
Traditional K-armed Bandits
Assumption: CTR (click-through rate) not affected by user features
€
CTR1 ≈ μ1
€
CTR2 ≈ μ2
€
CTR3 ≈ μ3
€
"ε − greedy":
with prob 1- ε : choose article argmaxa μa
with prob ε : choose a random article
€
"UCB1":
choose article argmaxa μa +α
Na
⎧ ⎨ ⎩
⎫ ⎬ ⎭
The more “a” has been displayed,the less uncertainty in CTRa
2012-05-2515SEWM
CTR estimates = #clicks / #impressions
No contexts no personalization
• EXP4 [ACFS’02], EXP4.P [BLLRS’11], elimination [ADKLS’12]• Strong theoretical guarantees• But computationally expensive
• Epoch-greedy [LZ’08]
• Similar to e-greedy• Simple, general and less expensive• But not most effective
• This talk: algorithms with compact, parametric models• Both efficient and effective• Extension of UCB1 to linear models• … and to generalized linear models• Randomized algorithm with Thompson sampling
Contextual Bandit Algorithms
2012-05-2516SEWM
• Linear model assumption:• Standard least-squares ridge regression
• Reward prediction for new user:
• Whether to explore requires quantifying parameter uncertainty
LinUCB: UCB for Linear Models
€
E ra | x[ ] = xTθ a
€
ˆ θ a = (DaTDa + I)−1Da
Tca
€
A a
€
where Da =
−x1T −
−x2T −
M
⎡
⎣
⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥, ca =
r1r2
M
⎡
⎣
⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥
€
xT ˆ θ a − xTθ a ≤ α xTA a−1x (with high probability)€
xT ˆ θ a ≈ xTθ a
€
measures how "dissimilar" x is to previous users
2012-05-2517SEWM
prediction error
LinUCB: UCB for Linear Models (II)
€
With high prob : xT ˆ θ a − xTθ a ≤ α xTA a−1x
LinUCB always selects an arm with highest UCB:
€
a* = argmaxa
xT ˆ θ a + α xTA a−1x{ }
to exploit to explore
€
UCB1: a* = argmaxa
ˆ μ a +α
Na
⎧ ⎨ ⎩
⎫ ⎬ ⎭
LinRel [Auer 2002] works similarly but in a more complicated way.
2012-05-2518SEWM
Recall...
Outline
Introduction
Basic Solutions› Algorithms
› Evaluation
› Experiments
Advanced algorithms
Advanced Offline Evaluation
Conclusions
2012-05-25SEWM 19
Goal: estimate average reward of running p with iid x
• Static p
• Adaptive p
Golden standard
• Run p in real system and see how well it works
• …but expensive and risky
Evaluation of Bandit Algorithms
2012-05-2520SEWM
€
V (π,T) :=1
TE r x t ,π (x t ,ht )( )
t =1
T
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥
€
V (π ) := Ex
r x,π (x)( )[ ]
• Benefits• Cheap and risk-free!• Avoid frequent bucket tests• Replicable / fair comparisons
• Common in non-interactive learning problems (e.g., classification)• Benchmark data organized as (input, label) pairs
• … but not straightforward for interactive learning problems• Data in bandits usually consists of (context, arm, reward) triples• No reward signal for other arm’ ≠ arm
Offline Evaluation
2012-05-2521SEWM
Common/Prior Evaluation Approaches
€
data
x1,a1,r1 M
xL ,aL ,rL
⎧
⎨ ⎪
⎩ ⎪
⎫
⎬ ⎪
⎭ ⎪
€
Reward simulator :
ˆ r (x,a) ≈ E r x,a[ ]
cla
ssifi
catio
nre
gres
sio
nd
ensi
ty e
stim
atio
n
this (difficult) step is often biased
In contrast, our approach
• avoids explicit user modeling simple
• gives unbiased evaluation results reliable
unreliable evaluation
bandit algorithm p
2012-05-2522SEWM
€
reveal x i
€
choose ˆ a i = π (x i)€
For i =1,2,...,L :
€
reveal ri only if ˆ a i = ai (a "match")
Our Evaluation Method: “Replay”
€
data
x1,a1,r1 M
xL ,aL ,rL
⎧
⎨ ⎪
⎩ ⎪
⎫
⎬ ⎪
⎭ ⎪
bandit algorithm p
€
Finally, output ˆ V =K
Lri ⋅I( ˆ a i = ai)
i=1
L
∑
2012-05-2523SEWM
€
Want to estimate V (π ) := Ex
r x,π (x)( )[ ]
Key requirement for data collection:
€
ai ~ unif(A)
2012-05-25SEWM 24
Theoretical Guarantees
Thm 1: Our estimator is unbiased Mathematically,
So on average reflects real, online performance
Thm 2: Estimation error 0 with more data Mathematically,
So accuracy guaranteed with large volume of data
€
V (π ) = E ˆ V [ ]
€
ˆ V
€
V (π ) − ˆ V = O K L( )
Case Study in Today Module [LCLW’11] Data:
› Large volume of real user traffic in Today Module
Policies being evaluated:
› EMP [ACE’ 09]
› SEMP/CEMP: personalized EMP variants
› Use policies’ online bucket CTR as “truth”
Random bucket data for evaluation:
› 40M visits, K ~= 20 on average
› Use it to offline-evaluate policies’ CTR
2012-05-25SEWM 25
Are they close?
Unbiasedness (Article nCTR)
Est
ima
ted
nC
TR
Recorded Online nCTR 2012-05-2526SEWM
The offline estimate is indeed unbiased!
Unbiasedness (Daily nCTR)
Recorded Online nCTR
Estimated nCTR
Ten Days in November 2009 2012-05-2527SEWM
The offline estimate is indeed unbiased!
Estimation Error
2012-05-2528SEWM €
1
L
Number of Data (L)
nC
TR
Est
ima
tion
Err
or
Recall our theoretical error bound:
€
Thm 2 (error bound) : V (π ) − ˆ V = O K L( )
Unbiased Offline Evaluation: Recap What we have shown
› A principled method for benchmark data collection
› which allows reliable/unbiased evaluation
› of any bandit algorithms
Analogue: UCI, Caltech101 ... datasets for supervised learning
The first such benchmark was released by Yahoo!
http://webscope.sandbox.yahoo.com/catalog.php?datatype=r
2nd and 3rd versions available for PASCAL2 Challenge
› ICML 2012 workshop2012-05-25SEWM 29
Outline
Introduction
Basic Solutions› Algorithms
› Evaluation
› Experiments
Advanced algorithms
Advanced Offline Evaluation
Conclusions
2012-05-25SEWM 30
Experiment Setup: Architecture
• Model updated every 5 minutes
• Main metric: overall normalized CTR in deployment bucket• nCTR = CTR * secretNumber
(to protect sensitive business information)
2012-05-2531SEWM
where E/E happens
exploitation only
“Learning Bucket”
“Deployment Bucket”
5%
95%
Experiment Setup: Data
• May 1 2009 data for parameter tuning• May 3-9 2009 data for performance evaluation (33M visits)• Number of candidate articles per user visit is about 20• Dimension reduction on user features [CBP+’09]
• 6 features
• Data available from Yahoo! Research’s Webscope program
http://webscope.sandbox.yahoo.com/catalog.php?datatype=r
€
E rt,a | x t ,a[ ] = x t ,aT θa
2012-05-2532SEWM
“cheating”policy
(no feature)
CTR in Deployment Bucket [LCLS’10]
• UCB-type algorithms do better than e-greedy counterparts
• CTR improved significantly when features/contexts are considered
2012-05-2533SEWM
Article CTR Lift
2012-05-2534SEWM
no context linear model
+ e-greedyo UCB
Outline
Introduction
Basic Solutions
Advanced algorithms› Hybrid linear models
› Generalized linear models
› Thompson sampling
› Theory
Advanced Offline Evaluation
Conclusions2012-05-25SEWM 35
Advantagelearns faster when there are few data
Challengeseems to require unbounded computation complexity
Good news!Efficient implementation made possible by block matrix manipulations
LinUCB for Hybrid Linear Models
€
New assumption : E ra | x[ ] = xTθ a + zaT β
€
Previous assumption : E ra | x[ ] = xTθ a
information shared by all articles(eg, teens like articles about Harry Potter)
article-specific information(eg, Californian males like this article)
2012-05-2536SEWM
Overall CTR in Deployment Bucket
advantageof hybrid
model
2012-05-2537SEWM
• UCB-type algorithms do better than e-greedy counterparts
• CTR improved significantly when features/contexts are considered
• Hybrid model is better when data are scarce
Outline
Introduction
Basic Solutions
Advanced algorithms› Hybrid linear models
› Generalized linear models
› Thompson sampling
› Theory
Advanced Offline Evaluation
Conclusions2012-05-25SEWM 38
2012-05-25SEWM 39
Extensions to GLMs Linear models are unnatural for binary events
Generalized linear models (GLMs)
Logistic regression
Probit regression
€
E ra | x[ ] = xTθ a
€
E ra | x[ ] = g−1 xTθ a( )
€
E ra | x[ ] =1
1+ exp(−xTθ a )
€
E ra | x[ ] = Φ xTθ a( )
(F: CDF of standard Gaussian)
“inverse link function”
€
g−1 : R →[0,1]
logistic function
€
xTθ a
2012-05-25SEWM 40
Model Fitting in GLMs
• Maintain a Bayesian posterior of parameter qa by N(ma, Sa) Use Bayes’ formula with new data (x,r):
€
p(θ a ) ∝ N(θ a;μa ,Σa )⋅ 1+ exp −(2r −1)xTθ a( )( )−1
Current posterior Likelihood
Laplace approximation
€
N μa ',Σa '( )New posterior
2012-05-25SEWM 41
UCB Heuristics for GLMs
€
E ra | xa[ ] ≤
xaT μa + α xa
TΣaxa linear
1+ α exp xaTΣaxa −1( )
1+ exp −xaT μa( )
logistic
Φ xaT μa + α xa
TΣaxa( ) probit
⎧
⎨
⎪ ⎪ ⎪ ⎪
⎩
⎪ ⎪ ⎪ ⎪
• Use posterior N(ma, Sa) to derive (approximate) upper confidence bounds [LCLMW’12]
Experiment Setup• One week data in from June 2009 (34M user visits)• About 20 candidate articles per user visit• Features: 20 features by PCA on raw binary user features• Model updated every 5 minutes
• Main metric: overall (normalized) CTR in deployment bucket
2012-05-2542SEWM
where E/E happens
exploitation only
“Learning Bucket”
“Deployment Bucket”
5%
95%
2012-05-25SEWM 43
GLM Comparisons
Obs #1: active exploration is necessaryObs #2: Logistic/probit > linearObs #3: UCB > e-greedy
e-greedy exploration UCB exploration
linea
rlo
gisi
tcpr
obit
Outline
Introduction
Basic Solutions
Advanced algorithms› Hybrid linear models
› Generalized linear models
› Thompson sampling
› Theory
Advanced Offline Evaluation
Conclusions2012-05-25SEWM 44
2012-05-25SEWM 45
Limitations of UCB Exploration
Exploration can be too much
may explore the whole space exhaustively
difficult to use prior knowledge
Exploration is deterministic
Poor performance when rewards are delayed
Deriving an (approx.) UCB is not always easy
2012-05-25SEWM 46
Thompson Sampling (1933)
Algorithmic idea: “probability matching”
Pr(a|x) = Pr(a is optimal for x)
Randomized action selection (by definition)
More robust to reward delay
Straightforward to implement [CL’12]
Maintain parameter posterior:
Draw random models:
Act accordingly:
Easily combined with other (non-)parametric models€
˜ θ a ~ Pr θ a D( )
€
Pr θ a D( )
€
a(x) = argmaxa
f (x,a; ˜ θ a )
2012-05-25SEWM 47
Thompson Sampling
One-week data from Today Module on Yahoo!’s front pageLogistic regression with Gaussian posteriors
Obs #1: TS is competitive uniformlyObs #2: TS is more robust to reward delay
Outline
Introduction
Basic Solutions
Advanced algorithms› Hybrid linear models
› Generalized linear models
› Thompson sampling
› Theory
Advanced Offline Evaluation
Conclusions2012-05-25SEWM 48
Regret-based Competitive Analysis
€
Regret(T) = E rt,a t
*
t =1
T
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥− E rt,a t
t =1
T
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥
the best we could do if we knew all
€
θa
achieved by algorithm
2012-05-2549SEWM
An algorithm “learns” if
An algorithm “learns fast” if is small
€
Regret(T) = O(Tα ) with α < 1
€
α
Regret Bounds
• LinUCB [CLRS’11]: with matching lower bound
• Generalized LinUCB: still open• A variant [FCGSz’11]:
• Thompson sampling• A variant [L’12]:
€
Average reward converges to optimal at the rate O Kd T( ).
€
Example : K = 20, d = 50, T =10M, Kd T = 0.01
2012-05-2550SEWM
€
O KdT( )
€
O d T( )
€
O K1/ 3T 2 / 3( )
Outline
Introduction
Basic Solutions
Advanced algorithms
Advanced Offline Evaluation› Importance weighting
› Doubly robust technique
Conclusions
2012-05-25SEWM 51
Uniformly random data sometimes are a luxury…
› System/cost constraints, user experience considerations, …
Randomized log suffices (by importance weighting)
Variance reduction with the “doubly robust” technique [DLL’11]
Better bias/variance tradeoff by soft rejection sampling [DDLL’12]
Extensions
2012-05-25
SEWM 52
€
V (π ) = E(x,r )~D rπ (x )[ ] ≈1
S
ra ⋅ I(π (x) = a)
max ˆ p (a | x),τ{ }(x,a,ra )∈S
∑
t controls bias/variance trade-off [SLLK 2011]
Offline Evaluation with Non-Uniform Data
2012-05-25SEWM 53
Key idea: importance reweighting
Can use weighted empirical average with estimated p(a|x)
€
ˆ V =1
S
ra ⋅ I(π (x) = a)
max ˆ p (a | x),τ{ }(x,a,ra )∈S
∑ ≈ V (π )
t controls bias/variance trade-off [SLLK 2011]
€
V (π ) = E(x,r )~D rπ (x )[ ] = E(x,r)~D ra ⋅ I(π (x) = a)a
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥
= E(x,r)~D
ra ⋅ I(π (x) = a)
p(a | x)p(a | x)
a
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥= E(x,r )~D,a ~ p
ra ⋅ I(π (x) = a)
p(a | x)
⎡
⎣ ⎢
⎤
⎦ ⎥
Results in Today Module Data [SLLK’11]
2012-05-25SEWM 54
Outline
Introduction
Basic Solutions
Advanced algorithms
Advanced Offline Evaluation› Importance weighting
› Doubly robust technique
Conclusions
2012-05-25SEWM 55
Doubly Robust Estimation
Importance weighted formula
Doubly robust technique
Usually DR estimate decreases variance [DLL’11]
2012-05-25SEWM 56
Estimation has high variance if p(a|x) is small
€
ˆ V DR =1
S
ra − ˆ r a( )⋅ I(π (x) = a)
max ˆ p (a | x),τ{ }+ ˆ r a
⎡
⎣ ⎢ ⎢
⎤
⎦ ⎥ ⎥(x,a,ra )∈S
∑
Unbiased if ˆ r a or ˆ p is correct.
€
V (π ) = E(x,r )~D,a ~ p
ra ⋅ I(π (x) = a)
p(a | x)
⎡
⎣ ⎢
⎤
⎦ ⎥≈
1
S
ra ⋅ I(π (x) = a)
max ˆ p (a | x),τ{ }(x,a,ra )∈S
∑
2012-05-25SEWM 57
Multiclass Classification
K-class classification as a K-armed bandit
Training data› In usual (non-bandit) setting,
› In bandit setting,
€
x,c ⇒ x,r1,r2,K ,rK where ra =0 if a = c
1 otherwise
⎧ ⎨ ⎩
€
D = x i,c i{ }i=1,2,K m
€
D = x i,ai, pi,ri,a i{ }i=1,2,K m
usualsetting
banditsetting
123...
m
1 2 3 … K
observed loss
unobserved loss
Loss matrixwith rij in (i,j) entry
Experimental Results on UCI Datasets Split data 50/50 for training (fully labeled) and testing (partially
labeled)
Train p on training data, evaluate p on test data
Repeated 500 times
2012-05-25SEWM 58
Outline
Introduction
Basic Solutions
Advanced algorithms
Advanced Offline Evaluation
Conclusions
2012-05-25SEWM 59
Conclusions
• Contextual bandit as a principled formulation for• News article recommendation• Internet advertising• Web search• ...
• An offline evaluation method of bandit algorithms• unbiased• accurate compared to online bucket results
• Encouraging results in significant applications• strong performance of UCB/TS exploration
2012-05-2560SEWM
Future Work• Offline evaluation
• Better use of non-uniform data• Extension to full reinforcement learning
• Use of prior knowledge
• Variants of bandits• Bandits with budgets• Bandits with many arms• Bandits with multiple objectives• Bandits with submodular rewards• Bandits with delayed reward observations• …
2012-05-2561SEWM
2012-05-25SEWM 62
References Offline policy evaluation
[LCLW] Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. WSDM, 2011
[SLLK] Learning from logged implicit exploration data. NIPS, 2010 [DLL] Doubly robust policy evaluation and learning. ICML, 2011 [DDLL] Sample-efficient nonstationary-policy evaluation for contextual
bandits. Under review.
Bandit algorithms [LCLS] A contextual-bandit approach to personalized news article
recommendation. WWW, 2010 [CLRS] Contextual bandits with linear payoff functions. AISTATS, 2011 [BLLRS] Contextual bandit algorithms with supervised learning
guarantees. AISTATS, 2011 [CL] An empirical evaluation of Thompson sampling. NIPS, 2011 [LCLMW] Unbiased offline evaluation of contextual bandit algorithms with
generalized linear models. JMLR W&PS, 2012
2012-05-25SEWM 63
Thank You!