CS 4700: Foundations of
Artificial Intelligence
Spring 2020 Prof. Haym Hirsh
Lecture 22
April 6, 2020
1
Welcome!
2
Backup plans:
If I leave the meeting
(check participants window)
(example: my machine crashes)
Take a five minute break
If I’m not back on by then check Piazza
3
Backup plans:
Zoombombing/trolling
I will try to fix it (kill the user)
If I end the meeting please check Piazza for URL for restart
4
Rules:
Please stay muted
Please use video!
(but be dressed thoughtfully)
5
Rules:
You are encouraged to ask and answer questions
using Zoom’s chat window
6
Rules:
You are encouraged to ask questions by audio/video
Please post QUESTION in the chat window
Please wait until I call on you
7
Other announcements:
Please fill in survey on Canvas
(Lets me plan, lets you do quiz format)
8
Other announcements:
S/U grading
9
Other announcements:
Homework 3
Due: Wed Apr 8 11:59pm
10
Other announcements:
No prelim/final
Quizzes on Tuesdays and Thursdays Tuesdays: previous week’s material
Thursdays: a topic from the first half of the semester
First quiz: Thu Apr 9 12:00pm Topic: Uninformed search
24 hour window for submission Further details forthcoming
11
Other announcements:
Office hours on website
(including 2/3am EDT most days)
My office hours:
Mondays 3-4pm EDT
and by arrangement (email [email protected])
But not this week, hours this week TBA. 12
Other announcements:
Karma Lectures:
Resume tomorrow, 11:40am
Further details later today
13
Academic integrity
14
Academic integrity
You know right from wrong
Ask me if there’s something on the boundary
15
Today: “Multi-Armed Bandits”
16
Today: “Multi-Armed Bandits” (Special type of MDP)
17
Today: “Multi-Armed Bandits” (Special type of MDP)
(Section 17.3 of textbook)
18
Multi-Armed Bandit
. . . .
19
Multi-Armed Bandit
. . . .
R1 R2 R3 R4 Rn
20
Multi-Armed Bandit
. . . .
R1 R2 R3 R4 Rn
Ri: Generated stochastically Don’t know Ri
21
Multi-Armed Bandit
. . . .
R1 R2 R3 R4 Rn
Example: Ri = 1 with probability pi, otherwise 0 Don’t know pi
22
Multi-Armed Bandit
. . . .
R1 R2 R3 R4 Rn
Example: Ri = 1 with probability pi, otherwise 0 Don’t know pi
23
Which arm do I pull?
Multi-Armed Bandit
. . . .
R1 R2 R3 R4 Rn
Example: Ri = 1 with probability pi, otherwise 0 Don’t know pi
24
Which arms do I pull to figure out which arm to pull?
Multi-Armed Bandit
. . . .
R1 R2 R3 R4 Rn
25
Multi-Armed Bandit
. . . .
R1 R2 R3 R4 Rn
M1 M2 M3 M4 Mn
26
Multi-Armed Bandit
. . . .
R1 R2 R3 R4 Rn
a1 a2 a3 a4 an
M1 M2 M3 M4 Mn
27
Multi-Armed Bandit
. . . .
R1 R2 R3 R4 Rn
a1 a2 a3 a4 an
What strategy do I use to pick a sequence of ai?
. . . . M1 M2 M3 M4 Mn
28
View as a Single-State MDP
R(s,ai,s)=R(ai) P(s|s,ai)=1.0
a1
a2
an
…
29
Multi-Armed Bandits: Upper Confidence Bound (UCB) Heuristic
30
Multi-Armed Bandits: Upper Confidence Bound (UCB) Heuristic
• If you correctly knew Ri: Pick the arm with largest Ri
31
Multi-Armed Bandits: Upper Confidence Bound (UCB) Heuristic
• If you correctly knew Ri: Pick the arm with largest Ri
• Intuition: Use the arm with the best observed Ri
Pick argmaxi 𝑅 i
32
Multi-Armed Bandits: Upper Confidence Bound (UCB) Heuristic
• If you correctly knew Ri: Pick the arm with largest Ri
• Intuition: Use the arm with the best observed Ri
Pick argmaxi 𝑅 i
Problem: May not have enough data for 𝑅 i to be accurate
33
Multi-Armed Bandits: Upper Confidence Bound (UCB) Heuristic
• Key idea: Be optimistic about the outcome of each arm
• (But not wildly optimistic like Q Learning)
34
Multi-Armed Bandits: Upper Confidence Bound (UCB) Heuristic
• Key idea: Be optimistic about the outcome of each arm
• (But not wildly optimistic like Q Learning)
Instead of using observed average value
go up one standard deviation
35
Multi-Armed Bandits: Upper Confidence Bound (UCB) Heuristic
• Pick the arm with largest UCB(Mi) instead of 𝑅 i
UCB(Mi) = 𝑅 i + 𝑔(𝑁)
𝑁i
where
R i = average reward for i so far
N = total number of pulls made so far
Ni = total number of pulls of Mi so far
36
Multi-Armed Bandits: Upper Confidence Bound (UCB) Heuristic
• Pick the arm with largest UCB(Mi) instead of 𝑅 i
UCB(Mi) = 𝑅 i + c ln N
𝑁i
where
R i = average reward for i so far
N = total number of pulls made so far
Ni = total number of pulls of Mi so far
37
Multi-Armed Bandits: Upper Confidence Bound (UCB) Heuristic
• Pick the arm with largest UCB(Mi) instead of 𝑅 i
UCB(Mi) = 𝑅 i + 2 log (1+N log2 N)
𝑁i
where
R i = average reward for i so far
N = total number of pulls made so far
Ni = total number of pulls of Mi so far
38
Multi-Armed Bandits: Upper Confidence Bound (UCB) Heuristic
• Pick the arm with largest UCB(Mi) instead of 𝑅 i
UCB(Mi) = 𝑅 i + 𝑔(𝑁)
𝑁i
where
R i = average reward for i so far
N = total number of pulls made so far
Ni = total number of pulls of Mi so far
g(N) = 2 log (1 + N log2 N) <textbook>
g(N) = c ln N <common> 39
Multi-Armed Bandits: Upper Confidence Bound (UCB) Heuristic
• Pick the arm with largest UCB(Mi) instead of 𝑅 i
UCB(Mi) = 𝑅 i + 𝑔(𝑁)
𝑁i
where
R i = average reward for i so far
N = total number of pulls made so far
Ni = total number of pulls of Mi so far
g(N) = c ln N g(N) = 2 log (1 + N log2 N)
g(N) should go up more slowly than 𝑁i 40