Theory, An Algorithm, Applications

Rob Schapire

• machine learning: studies automatic algorithms for learning tobe better through observation and experience, e.g.:

• learn from training examples to detect faces in images• sequentially predict what decision a person will make

next (e.g., what link a user will click on)• learn to drive toy car by observing how human does it

• very diverse!

• game theory: studies interactions between all kinds of“players”

• natural connections:• learning often involves interaction• games often most interesting when learning is involved

• this talk:• general algorithm for learning to play repeated games• can apply to handle all of the learning problems above

• reveals underlying connections between seeminglyunrelated learning problems


• how to play repeated games: theory and an algorithm

• applications:

• boosting• on-line learning• apprenticeship learning

[with Freund]


• game defined by matrix M:

Rock Paper Scissors

Rock 1/2 1 0Paper 0 1/2 1

Scissors 1 0 1/2

• row player (“Mindy”) chooses row i

• column player (“Max”) chooses column j (simultaneously)

• Mindy’s goal: minimize her loss M(i , j)

• assume (wlog) all entries in [0, 1]

• usually allow randomized play:• Mindy chooses distribution P over rows• Max chooses distribution Q over columns


• Mindy’s (expected) loss


i ,j

P(i)M(i , j)Q(j)

= P>MQ ≡ M(P,Q)

• i , j = “pure” strategies

• P, Q = “mixed” strategies

• m = # rows of M

• also write M(i ,Q) and M(P, j) when one side plays pure andother plays mixed

• say Mindy plays before Max

• if Mindy chooses P then Max will pick Q to maximizeM(P,Q) ⇒ loss will be

L(P) ≡ maxQ

M(P,Q) = maxj

M(P, j)

[note: maximum realized at pure strategy]

• so Mindy should pick P to minimize L(P)⇒ loss will be


L(P) = minP



• similarly, if Max plays first, loss will be




• playing second (with knowledge of other player’s move)cannot be worse than playing first, so:



M(P,Q)︸ ︷︷ ︸

Mindy plays first

≥ maxQ


M(P,Q)︸ ︷︷ ︸

Mindy plays second

• von Neumann’s minmax theorem:



M(P,Q) = maxQ



• in words: no advantage to playing second

• minmax theorem:



M(P,Q) = maxQ


M(P,Q) = value v of game

• optimal strategies:• P∗ = arg minP maxQ M(P,Q) = minmax strategy• Q∗ = arg maxQ minP M(P,Q) = maxmin strategy

• in words:• Mindy’s minmax strategy P∗ guarantees loss ≤ v

(regardless of Max’s play)• optimal because Max has maxmin strategy Q∗ that can

force loss ≥ v (regardless of Mindy’s play)

• e.g.: in RPS, P∗ = Q∗ = uniform

• solving game = finding minmax/maxmin strategies

• seems to fully answer how to play games — just computeminmax strategy (e.g., using linear programming)

• weaknesses:

• game M may be unknown• game M may be extremely large• opponent may not be fully adversarial

• may be possible to do better than value v

• e.g.:Lisa (thinks):

Poor predictable Bart, always takes Rock.

Bart (thinks):Good old Rock, nothing beats that.

• if only playing once, hopeless to overcome ignorance of gameM or opponent

• but if game played repeatedly, may be possible to learn toplay well

• goal: play (almost) as well as if knew game and howopponent would play ahead of time

• M unknown

• for t = 1, . . . ,T :• Mindy chooses Pt

• Max chooses Qt (possibly depending on Pt)• Mindy’s loss = M(Pt ,Qt)• Mindy observes loss M(i ,Qt) of each pure strategy i

• want:





M(Pt ,Qt)

︸ ︷︷ ︸

actual average loss

≤ minP






︸ ︷︷ ︸

best loss (in hindsight)

+ [“small amount”]

• choose η > 0

• initialize: P1 = uniform

• on round t:

Pt+1(i) =Pt(i) exp (−η M(i ,Qt))


• idea: decrease weight of strategies suffering the most loss

• directly generalizes [Littlestone & Warmuth]

• other algorithms:• [Hannan’57]

• [Blackwell’56]

• [Foster & Vohra]

• [Fudenberg & Levine]...


• Theorem: can choose η so that, for any game M with m

rows, and any opponent,





M(Pt ,Qt)

︸ ︷︷ ︸

actual average loss

≤ minP






︸ ︷︷ ︸

best average loss (≤ v)

+ ∆T

where ∆T = O





→ 0

• regret ∆T is:• logarithmic in # rows m

• independent of # columns

• running time (of MW) is:• linear in # rows m

• independent of # columns

• therefore, can use when working with very large games

• want to prove:



M(P,Q) ≤ maxQ



(already argued “≥”)

• game M played repeatedly

• Mindy plays using MW• on round t, Max chooses best response:

Qt = arg maxQ

M(Pt ,Q)

[note: Qt can be pure]

• let

P =1




Pt , Q =1





P>MQ ≤ maxQ


= maxQ






t MQ







t MQ






t MQt

≤ minP





P>MQt + ∆T

= minP

P>MQ + ∆T

≤ maxQ


P>MQ + ∆T

• derivation shows that:


M(P,Q) ≤ v + ∆T


M(P,Q) ≥ v − ∆T

• so: P and Q are ∆T -approximate minmax and maxminstrategies

• further: can choose Qt ’s to be pure⇒ Q = 1


t Qt will be sparse (≤ T non-zero entries)

• presented MW algorithm for repeatedly playing matrix games

• plays almost as well as best fixed strategy in hindsight

• very efficient — independent of # columns

• proved minmax theorem

• MW can also be used to approximately solve a game

[with Freund]

• problem: filter out spam (junk email)

• gather large collection of examples of spam and non-spam:

From: Rob, can you review a paper... non-spamFrom: Earn money without working!!!! ... spam...


• goal: have computer learn from examples to distinguish spamfrom non-spam

• main observation:

• easy to find “rules of thumb” that are “often” correct

• If ‘viagra’ occurs in message, then predict ‘spam’

• hard to find single rule that is very highly accurate

• given:• training examples• access to weak learning algorithm for finding weak

classifiers (rules of thumb)

• repeat:• choose distribution on training examples

• intuition: put most weight on “hardest” examples

• train weak learner on distribution to get weak classifier

• combine weak classifiers into final classifier• use (weighted) majority vote

• weak learning assumption: weak classifiers slightly butconsistently better than random

• want: final classifier to be very accurate

• Mindy (row player) ↔ booster

• Max (column player) ↔ weak learner

• matrix M:

• row ↔ training example• column ↔ weak classifier• M(i , j) ={

1 if j-th weak classifier correct on i -th training example0 else

• encodes which weak classifiers correct on which examples• huge # of columns — one for every possible weak


• γ-weak learning assumption:• for every distribution on examples• can find weak classifier with weighted error ≤ 1

2− γ

• equivalent to:

(value of game M) ≥ 1

2+ γ

• by minmax theorem, implies that:• ∃ some weighted majority classifier that correctly

classifies all training examples• further, weights are given by maxmin strategy of game M

• maxmin strategy of M has perfect (training) accuracy

• find approximately using earlier algorithm for solving a game• i.e., apply MW to M

• yields (variant of) AdaBoost

• AdaBoost can be derived as special case of general gameplaying algorithm (MW)

• idea of boosting intimately tied to minmax theorem

• game-theoretic interpretation can further be used to analyze:

• AdaBoost’s convergence properties• AdaBoost as a “large-margin” classifier

(margin = measure of confidence, useful for analyzinggeneralization accuracy)

Application: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting Faces[Viola & Jones]

• problem: find faces in photograph or movie

• weak classifiers: detect light/dark rectangles in image

• many clever tricks to make extremely fast and accurate

[with Freund]

• want to predict daily if stock market will go up or down

• every morning:• listen to predictions of other forecasters

(based on market conditions)• formulate own prediction

• every evening:• find out if market actually went up or down

• want own predictions to be almost as good as best forecaster

• given access to pool of predictors• could be: people, simple fixed functions, other learning

algorithms, etc.• for our purposes, think of as fixed functions of observable


• on each round:• get predictions of all predictors, given current context• make own prediction• find out correct outcome

• want: # mistakes close to that of best predictor

• training and testing occur simultaneously

• no statistical assumptions — examples chosen by adversary

• Mindy (row player) ↔ learner

• Max (column player) ↔ adversary

• matrix M:

• row ↔ predictor• column ↔ context

• M(i , j) =

{1 if i -th predictor wrong on j-th context0 else

• actually, the same game as in boosting, but with roles ofplayers reversed

• can apply MW to game M

→ weighted majority algorithm [Littlestone & Warmuth]

• MW analysis gives:

(# mistakes of MW) ≤ (# mistakes of best predictor)

+ [“small amount”]

• regret only logarithmic in # predictors

• can use to play penny-matching:• human and computer each choose a bit• if match, computer wins• else human wins

• random play wins with 50% probability• very hard for humans to play randomly• can do better if can predict what opponent will do

• algorithm: MW with many fixed predictors• each prediction based on recent history of plays• e.g.: if human played 0 on last two rounds,

then predict next play will be 1else predict next play will be 0

• play at:∼mindreader

-100 -75 -50 -25 0 25 50 75 100




� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �

� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �

human winscomputer wins

• score = (rounds won by human) − (rounds won by computer)

• based on 11,882 games

• computer won 86.6% of games

• average score = −41.0

-100 -75 -50 -25 0 25 50









• the faster people play, the lower their score

• when play fast, probably fall into rhythmic pattern• easy for computer to detect/predict

[with Syed]

• example: learning to “drive” in toy world

• states: positions of cars• actions: steer left/right; speed up; slow down• reward: for going fast without crashing or going off road

• in general, called Markov decision process (MDP)

• goal: find policy that maximizes expected long-term reward• policy = mapping from states to actions

• when real-valued reward function known, optimal policy canbe found using well established techniques

• however, often reward function not known

• in driving example:

• may know reward “features”:

• faster is better than slower• collisions are bad• going off road is bad

• but may not know how to combine these into singlereal-valued function

• we assume reward is unknown convex combination of knownreal-valued features

• also assume get to observe behavior of “expert”

• goal:• imitate expert• improve on expert, if possible

• want to find policy such that improvement over expert is aslarge as possible, for all possible reward functions

• can formulate as game:• rows ↔ features• columns ↔ policies

(note: extremely large number of columns)

• optimal (randomized) policy, under our criterion, is exactlymaxmin of game

• can solve (approximately) using MW:• to pick “best” column on each round, use methods for

standard MDP’s with known reward


• presented MW algorithm for playing repeated games• used to prove minmax theorem• can use to approximately solve games efficiently

• many diverse applications and insights• boosting

• key concepts and algorithms are all game-theoretic

• on-line learning

• in essence, the same problem as boosting

• apprenticeship learning

• new and richer problem, but solvable as a game


