New CS 4700: Foundations of Artificial Intelligence · 2020. 4. 13. · My office hours: Mondays...

CS 4700: Foundations of

Artificial Intelligence

Spring 2020 Prof. Haym Hirsh

Lecture 22

April 6, 2020

1

Welcome!

2

Backup plans:

If I leave the meeting

(check participants window)

(example: my machine crashes)

Take a five minute break

If I’m not back on by then check Piazza

3

Backup plans:

Zoombombing/trolling

I will try to fix it (kill the user)

If I end the meeting please check Piazza for URL for restart

4

Rules:

Please stay muted

Please use video!

(but be dressed thoughtfully)

5

Rules:

You are encouraged to ask and answer questions

using Zoom’s chat window

6

Rules:

You are encouraged to ask questions by audio/video

Please post QUESTION in the chat window

Please wait until I call on you

7

Other announcements:

Please fill in survey on Canvas

(Lets me plan, lets you do quiz format)

8


S/U grading

9


Homework 3

Due: Wed Apr 8 11:59pm

10


No prelim/final

Quizzes on Tuesdays and Thursdays Tuesdays: previous week’s material

Thursdays: a topic from the first half of the semester

First quiz: Thu Apr 9 12:00pm Topic: Uninformed search

24 hour window for submission Further details forthcoming

11


Office hours on website

(including 2/3am EDT most days)

My office hours:

Mondays 3-4pm EDT

and by arrangement (email [email protected])

But not this week, hours this week TBA. 12

mailto:[email protected]




Karma Lectures:

Resume tomorrow, 11:40am

Further details later today

13

Academic integrity

14

Academic integrity

You know right from wrong

Ask me if there’s something on the boundary

15

Today: “Multi-Armed Bandits”

16

Today: “Multi-Armed Bandits” (Special type of MDP)

17

Today: “Multi-Armed Bandits” (Special type of MDP)

(Section 17.3 of textbook)

18

Multi-Armed Bandit

. . . .

19

Multi-Armed Bandit

. . . .

R1 R2 R3 R4 Rn

20

Multi-Armed Bandit

. . . .

R1 R2 R3 R4 Rn

Ri: Generated stochastically Don’t know Ri

21

Multi-Armed Bandit

. . . .

R1 R2 R3 R4 Rn

Example: Ri = 1 with probability pi, otherwise 0 Don’t know pi

22

Multi-Armed Bandit

. . . .

R1 R2 R3 R4 Rn


23

Which arm do I pull?

Multi-Armed Bandit

. . . .

R1 R2 R3 R4 Rn


24

Which arms do I pull to figure out which arm to pull?

Multi-Armed Bandit

. . . .

R1 R2 R3 R4 Rn

25

Multi-Armed Bandit

. . . .

R1 R2 R3 R4 Rn

M1 M2 M3 M4 Mn

26

Multi-Armed Bandit

. . . .

R1 R2 R3 R4 Rn

a1 a2 a3 a4 an

M1 M2 M3 M4 Mn

27

Multi-Armed Bandit

. . . .

R1 R2 R3 R4 Rn

a1 a2 a3 a4 an

What strategy do I use to pick a sequence of ai?

. . . . M1 M2 M3 M4 Mn

28

View as a Single-State MDP

R(s,ai,s)=R(ai) P(s|s,ai)=1.0

a1

a2

an

…

29

Multi-Armed Bandits: Upper Confidence Bound (UCB) Heuristic

30


• If you correctly knew Ri: Pick the arm with largest Ri

31



• Intuition: Use the arm with the best observed Ri

Pick argmaxi 𝑅 i

32



• Intuition: Use the arm with the best observed Ri

Pick argmaxi 𝑅 i

Problem: May not have enough data for 𝑅 i to be accurate

33


• Key idea: Be optimistic about the outcome of each arm

• (But not wildly optimistic like Q Learning)

34


• Key idea: Be optimistic about the outcome of each arm

• (But not wildly optimistic like Q Learning)

Instead of using observed average value

go up one standard deviation

35


• Pick the arm with largest UCB(Mi) instead of 𝑅 i

UCB(Mi) = 𝑅 i + 𝑔(𝑁)

𝑁i

where

R i = average reward for i so far

N = total number of pulls made so far

Ni = total number of pulls of Mi so far

36



UCB(Mi) = 𝑅 i + c ln N

𝑁i

where




37



UCB(Mi) = 𝑅 i + 2 log (1+N log2 N)

𝑁i

where




38




𝑁i

where




g(N) = 2 log (1 + N log2 N) <textbook>

g(N) = c ln N <common> 39




𝑁i

where




g(N) = c ln N g(N) = 2 log (1 + N log2 N)

g(N) should go up more slowly than 𝑁i 40

Date post:	25-Sep-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

New CS 4700: Foundations of Artificial Intelligence · 2020. 4. 13. · My office hours: Mondays...

Documents