Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | marion-baker |
View: | 218 times |
Download: | 4 times |
Reinforcement Learning
Evaluative Feedback and Bandit Problems
Subramanian RamamoorthySchool of Informatics
20 January 2012
Recap: What is Reinforcement Learning?
• An approach to Artificial Intelligence• Learning from interaction• Goal-oriented learning• Learning about, from, and while interacting with an external
environment
• Learning what to do—how to map situations to actions—so as to maximize a numerical reward signal
• Can be thought of as a stochastic optimization over time
220/01/2012
Recap: The Setup for RL
Agent is:
• Temporally situated
• Continual learning and planning
• Objective is to affect the environment – actions and states
• Environment is uncertain, stochastic
Environment
actionstate
rewardAgent
320/01/2012
Recap: Key Features of RL
• Learner is not told which actions to take• Trial-and-Error search• Possibility of delayed reward– Sacrifice short-term gains for greater long-term gains
• The need to explore and exploit
• Consider the whole problem of a goal-directed agent interacting with an uncertain environment
420/01/2012
Multi-arm Bandits (MAB)
• N possible actions• You can play for some period
of time and you want to maximize reward (expected utility)
Which is the best arm/ machine?
DEMO
520/01/2012
Real-Life Version
• Choose the best content to display to the next visitor of your commercial website
• Content options = slot machines• Reward = user's response (e.g., click on an ad)
• Also, clinical trials: arm = treatment, reward = patient cured• Simplifying assumption: no context (no visitor proles). In
practice, we want to solve contextual bandit problems but that is for later discussion.
620/01/2012
What is the Choice?
720/01/2012
n-Armed Bandit Problem
• Choose repeatedly from one of n actions; each choice is called a play
• After each play at , you get a reward rt , where
These are unknown action valuesDistribution of depends only on rt at
Objective is to maximize the reward in the long term, e.g., over 1000 plays
To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them
820/01/2012
Exploration/Exploitation Dilemma
• Suppose you form estimates
• The greedy action at time t is at*
• You can’t exploit all the time; you can’t explore all the time• You can never stop exploring; but you could reduce
exploring.
Qt(a) Q*(a) action value estimates
at* argmax
aQt(a)
at at* exploitation
at at* exploration
920/01/2012
Why?
Action-Value Methods
• Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action a had been chosen ka times, producing rewards r1 , r2 , …, rka
, then
“sample average”
limk a
Qt(a) Q*(a)
1020/01/2012
Remark
• The simple greedy action selection strategy:
• Why might this above be insufficient?• You are estimating, online, from a few samples. How will this
behave?
DEMO
1120/01/2012
-Greedy Action Selection
• Greedy action selection:
• -Greedy:
at at* arg max
aQt(a)
at* with probability 1
random action with probability {at
. . . the simplest way to balance exploration and exploitation
1220/01/2012
Worked Example: 10-Armed Testbed
• n = 10 possible actions
• Each is chosen randomly from a normal distrib.:
• Each is also normal:
• 1000 plays, repeat the whole thing 2000 times and average the results
rt
Q*(a) )1,0(N
)1),(( *taQN
1320/01/2012
-Greedy Methods on the 10-Armed Testbed
1420/01/2012
Softmax Action Selection
• Softmax action selection methods grade action probabilities by estimated values.
• The most common softmax uses a Gibbs, or Boltzmann, distribution:
re' temperatunalcomputatio' a is where
,
yprobabilit with play on action Choose
1
)(
)(
n
b
bQ
aQ
t
t
e
e
ta
1520/01/2012
Incremental Implementation
Qk
r1 r2 rk
k
Sample average estimation method:
How to do this incrementally (without storing all the rewards)?
We could keep a running sum and count, or, equivalently:
Qk1 Qk 1
k 1rk1 Qk
The average of the first k rewards is(dropping the dependence on a ):
NewEstimate = OldEstimate + StepSize [Target – OldEstimate]
1620/01/2012
Tracking a Nonstationary Problem
Choosing to be a sample average is appropriate in astationary problem, i.e., when none of the change over time,
But not in a nonstationary problem.
Qk
Q*(a)
Better in the nonstationary case is:
Qk1 Qk rk1 Qk for constant , 0 1
(1 ) kQ0 (1 i1
k
)k iri
exponential, recency-weighted average
1720/01/2012
Optimistic Initial Values
• All methods so far depend on , i.e., they are biased• Encourage exploration: initialize the action values optimistically,
Q0 (a)
i.e., on the 10-armed testbed, use Q0 (a) 5 for all a
1820/01/2012
Beyond Counting…
20/01/2012 19
An Interpretation of MAB Type Problems
20/01/2012 20
Related to‘rewards’
MAB is a Special Case of Online Learning
20/01/2012 21
How to Evaluate Online Alg.: Regret
• After you have played for T rounds, you experience a regret:= [Reward sum of optimal strategy] – [Sum of actual collected rewards]
• If the average regret per round goes to zero with probability 1, asymptotically, we say the strategy has no-regret property
~ guaranteed to converge to an optimal strategy • -greedy is sub-optimal (so has some regret). Why?
20/01/2012 22
kk
T
ti
T
tt trETrT
t
max
)(ˆ
*
1
*
1
*
Randomness in
draw of rewards &player’s strategy
Interval Estimation
• Attribute to each arm an “optimistic initial estimate” within a certain confidence interval
• Greedily choose arm with highest optimistic mean (upper bound on confidence interval)
• Infrequently observed arm will have over-valued reward mean, leading to exploration
• Frequent usage pushes optimistic estimate to true values
20/01/2012 23
Interval Estimation Procedure
• Associate to each arm 100(1-)% reward mean upper band
• Assume, e.g., rewards are normally distributed• Arm is observed n times to yield empirical mean & std dev• -upper bound:
• If is carefully controlled, could be made zero-regret strategy– In general, we don’t know
20/01/2012 24
dxx
tc
cn
u
t
2exp
2
1)(
)1(ˆ
ˆ
2
1
Cum. Distribution Function
Variant: UCB Strategy
• Again, based on notion of an upper confidence bound but more generally applicable
• Algorithm:– Play each arm once– At time t > K, play arm it maximizing
20/01/2012 25
far so playedbeen has j timesofnumber :
ln2)(
,
,
tj
tjj
T
T
ttr
UCB Strategy
20/01/2012 26
Reminder: Chernoff-Hoeffding Bound
20/01/2012 27
UCB Strategy – Behaviour
20/01/2012 28
We will not try to prove the following result but I quote the final result to tell you why UCB may be a desirable strategy – regret is bounded.
K = number of arms
Variation on SoftMax:
• It is possible to drive regret down by annealing • Exp3 : Exponential weight alg. for exploration and exploitation• Probability of choosing arm k at time t is
20/01/2012 29
n
b
bQ
aQ
t
t
e
e
1
)(
)(
)log(Regret
at pulled is arm if
)(
)(
)(exp)(
)1(
)(
)()1()(
1
KKTO
otherwise
tj
tw
KtP
trtw
tw
Ktw
twtP
j
j
jj
j
k
jj
kk
is a user definedopen parameter
The Gittins Index
• Each arm delivers reward with a probability• This probability may change through time but only when arm
is pulled• Goal is to maximize discounted rewards – future is discounted
by an exponential discount factor • The structure of the problem is such that, all you need to do is
compute an “index” for each arm and play the one with the highest index
• Index is of the form:
20/01/2012 30
T
t
t
T
t
it
Ti
tR
0
0
0
)(
sup
Gittins Index – Intuition
• Proving optimality isn’t within our scope; but based on,• Stopping time: the point where you should ‘terminate’ bandit
• Nice Property: Gittins index for any given bandit is independent of expected outcome of all other bandits– Once you have a good arm, keep playing until there is a better one– If you add/remove machines, computation doesn’t really change
• BUT: – hard to compute, even when you know distributions– Exploration issues; arms isn’t updated unless used (restless bandits?)
20/01/2012 31
Numerous Applications!
20/01/2012 32
Extending the MAB Model
• In this lecture, we are in a single casino and the only decision is to pull from a set of n arms– except perhaps in the very last slides, exactly one state!
Next,• What if there is more than one state?• So, in this state space, what is the effect of the distribution of
payout changing based on how you pull arms? • What happens if you only obtain a net reward corresponding
to a long sequence of arm pulls (at the end)?
3320/01/2012
Acknowledgements
• Many slides are adapted from web resources associated with Sutton and Barto’s Reinforcement Learning book
3420/01/2012