Download - Regret Minimization and Job Scheduling Yishay Mansour Tel Aviv University.

Regret Minimizationand

Job Scheduling

Yishay MansourTel Aviv University

3

Decision Making under uncertainty

• Online algorithms– Stochastic models– Competitive analysis– Absolute performance

criteria• A different approach:

– Define “reasonable“ strategies

– Compete with the best (in retrospect)

– Relative performance criteria

5

Routing

• Model: Each day 1. select a path from source to

destination 2. observe the latencies.

– Each day diff values

• Strategies: All source-dest. paths • Loss: The average latency on

the selected path• Performance Goal: match the

latency of best single path

6

Financial Markets: options

• Model: stock or cash. Each day, set portfolio

then observe outcome.• Strategies: invest either:

all in stock or, all in cash• Gain: based on daily

changes in stock• Performance Goal:

Implements an option !

CASH

STOCK

7

Machine learning – Expert Advice

• Model: each time 1. observe expert predictions 2. predict a label• Strategies: experts

(online learning algorithms)• Loss: errors• Performance Goal: match the

error rate of best expert– In retrospect

1

2

3

4

1

0

1

1

8

Parameter Tuning

• Model: Multiple parameters.

• Strategies: settings of parameters• Optimization: any• Performance Goal:

match the best setting of parameters

9

Parameter Tuning

• Development Cycle– develop product (software)– test performance– tune parameters– deliver “tuned” product

• Challenge: can we combine– testing– tuning– runtime

10

Regret Minimization: Model

• Actions A={1, … ,N}• Time steps: t ∊ { 1, … , T}• At time step t:– Agent selects a distribution pt(i) over A– Environment returns costs ct(i) ε [0,1]

• Adversarial setting– Online loss: lt(on) = Σi ct(i) pt(i)

• Cumulative loss : LT(on) = Σt lt(on)

11

External Regret

• Relative Performance measure:– compares to the best strategy in A

• The basic class of strategies

• Online cumulative loss : LT(on) = Σt lt(on)

• Action i cumulative loss : LT(i) = Σt ct(i)

• Best action: LT(best) = MINi {LT(i) }=MINi {Σt ct(i)}

• External Regret = LT(on) – LT(best)

External Regret Algorithm

• Goal: Minimize Regret• Algorithm:– Track the regrets– Weights proportional to the regret

• Formally: At time t– Compute the regret to each action• Yt(i)= Lt(on)- Lt(i), and rt(i) = MAX{ Yt(i),0}

• pt+1(i) = rt(i) / Σi rt(i) – If all rt(i) = 0 select pt+1 arbitrarily.

• Rt = < rt(1), …,rt(N)> and ΔRt= Yt - Yt-1

External Regret Algorithm: Analysis

Rt = < rt(1), …,rt(N)> and ΔRt= Yt - Yt-1

• LEMMA: ΔRt ∙ Rt-1 = 0

Σi(ct(i) – lt(on)) rt-1(i) =

Σict(i)rt-1(i)– Σilt(on)rt-1(i)

Σi lt(on) rt-1(i)

= [Σi ct(i) pt(i) ]Σi rt-1(i)

= Σi ct(i)rt-1(i)• LEMMA:

Rt-1

Rt

NTTt t

ΔRiT

Ri

12

2)(max

14

External regret: Bounds

• Average regret goes to zero– No regret– Hannan [1957]

• Explicit bounds– Littstone & Warmuth ‘94 – CFHHSW ‘97– External regret = O(log N + √Tlog N)

15

16

Dominated Actions

• Model: action y dominates x if y always better than x• Goal: Not to play

dominated actions• Goal (unknown model):

The fraction of times we play dominated actions is played is vanishing

.3

.8

.9

.3

.6

CostAction y

CostAction x

.2

.4

.7

.1

.3

Internal/Swap Regret

• Internal Regret– Regret(x,y) = ∑t: a(t)=x ct(x) - ct(y)

– Internal Regret = maxx,y Regret(x,y)

• Swap Regret– Swap Regret = ∑x maxy Regret(x,y)

• Swap regret ≥ External Regret– ∑x maxy Regret(x,y) ≥ maxy ∑x Regret(x,y)

• Mixed actions– Regret(x,y) = ∑t (ct(x) - ct(y))pt(x)

Dominated Actions and Regret

• Assume action y dominates action x– For any t: ct(x) > ct(y)+δ

• Assume we used action x for n times– Regret(x,y) > δ n

• If SwapRegret < R then– Fraction of time dominated action used – At most R/δ

19

Calibration

• Model: each step predict a probability and observe outcome

• Goal: prediction calibrated with outcome– During time steps where

the prediction is p the average outcome is (approx) p

.3

.5

.3

.3

.5

predictions outcome

Calibration:

.3

.5

1/3

1/2

Predict prob. of rain

Calibration to Regret

• Reduction to Swap/Internal regret:– Discrete Probabilities• Say: {0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0}

– Loss of action x at time t: (x – ct)2

– y*(x)= argmaxy Regret(x,y) • y*(x)=avg(ct|x)

– Consider R(x,y*(x))

21

Internal regret

• No internal regret– [Foster & Vohra] , [Hart & Mas-Colell]• Based on the approachability theorem [Blackwell ’56]

– Explicit Bounds– [Cesa-Bianchi & Lugasi ’03]• Internal regret = O(log N + √T log N)

– [Blum & Mansour]• Swap regret = O(log N + √T N)

22

Regret: External vs Internal

• External regret– You should have bought

S&P 500

– Match boyi to girli

• Internal regret – Each time you bought

IBM you should have bought SUN

– Stable matching

• Limitations:- No state- Additive over time

23

[Even-Dar, Mansour, Nadav, 2009]

Routing Gamess1

t1

t2 s2

f1, L

f1, R

f2, T

f2, B

f2

f1

• Atomic

– Finite number of players

– Player i transfer flow from si to ti

f1,L

f2,T

Latency on edge e = Le(f1,L + f2,T)

e

• Costi = pε (si, ti) Latency(p) * flowi (p)

• Splittable flows

Cournot Oligopoly [Cournot 1838]

• Best response dynamics converges for 2 players [Cournot 1838]

– Two player’s oligopoly is a super-modular game [Milgrom, Roberts 1990]

• Diverges for n 5 [Theocharis 1960]

X Y

Cost1(X) Cost2(Y)

Market price

Overall quantity

• Firms select production level• Market price depends on the TOTAL supply• Firms maximize their profit = revenue - cost

X y

P

Resource Allocation Games

• Each advertiser wins a proportional market share

$5M $10M $17M $25M

• Advertisers set budgets:

‘s allocated rate = 5+10+17+25

25

f (U = ) - $25M

• Utility:

– Concave utility from allocated rate

– Quasi-linear with money

The best response dynamics generally diverges for linear resource allocation games

Properties of Selfish Routing, Cournot Oligopoly and Resource Allocation Games

1. Closed convex strategy set

2. A (weighted) social welfare is concave

3. The utility of a player is convex in the vector of actions of other players

R

There exists 1,…,n > 0

Such that

1 u1 (x) + 2 u2(x)+…+n un(x)

Socially Concave Games

The relation between socially concave games and concave games

Zero SumGames

SociallyConcaveGames

ConcaveGames

Concave Games [ Rosen 65]• The utility of a player is strictly concave in her

own strategy

• A sufficient condition for equilibrium uniqueness

Normal Form Games

(with mixed strategies)

Unique Nash Equilibrium

Atomic, splittable

routing

Resource Allocation

Cournot

The average action and average utility converge to NE

Theorem 1: The average action profile converges to NE

Player 1:

Player 2:

Player n:

Day 1 Day 2 Day 3 Day T Average of days 1…T

( )T - Nash equilibrium

Theorem 2: The average daily payoff of each player converges to her payoff

in NE

If each player uses a procedure without regret in socially

concave games then their joint play converges to Nash

equilibrium:

Convergence of the “average action” and “average payoff” are two different things!

• Here the average action converges to (½,½) for every player

s tts

• But the average cost is 2, while the average cost in NE is 1

s t

• On Even Days • On Odd Days

The Action Profile Itself Need Not Converge

s tts

• On Even Days • On Odd Days

32

Correlated Equilibrium

• CE: A joint distribution Q• Each time t, a joint action

drawn from Q – Each player action is BR

• Theorem [HM,FV]: Multiple players playing

low internal (swap) regret converge to CE

.3

.8

.9

.3

.6

Action x Action y

.2

.4

.7

.1

.3

33[Even-Dar, Klienberg, Mannor, Mansour, 2009]

35

Job Scheduling: Motivating Example

Load Balancer

usersservers

GOAL: Minimize load on servers

36

Online Algorithms

• Job scheduling– N unrelated machines

• machine = action– each time step:

• a job arrives– has different loads on

different machines• algorithm schedules

the job on some machine– Given its loads

– Goal:• minimize the loads

– makespan or L2

• Regret minimization– N actions• machines

– each time step• First, algorithm

selects an action (machine)• Then, observes the

losses– Job loads

– Goal: • minimize the sum of

losses

37

Modeling Differences: Information

• Information model:– what does the algorithm know when it selects

action/machine• Known cost: – First observe costs then select action– job scheduling

• Unknown cost:– First select action then observe costs– Regret Minimization

38

Modeling Differences: Performance

• Theoretical Performance measure:– comparison class• job scheduling: best (offline) assignment• regret minimization: best static algorithm

– Guarantees:• job scheduling: multiplicative• regret minimization: additive and vanishing.

• Objective function:– job scheduling: global (makespan)– regret minimization: additive.

39

Formal Model

• N actions• Each time step t algorithm ON– select a (fractional) action: pt(i)

– observe losses ct(i) in [0,1]

• Average losses of ON– for action i at time T: ONT(i) = (1/T) Σt<T pt(i) ct(i)

• Global cost function:– C∞(ONT(1), … , ONT(N)) = maxi ONT(i)

– Cd(ONT(1), … , ONT(N)) = [ Σi (ONT(i))d ]1/d

40

Formal Model

• Static Optimum:– Consider any fixed distribution α • Every time play α

– Static optimum α* - minimizes cost C

• Formally:– Let α ◊ L = (α(1)L(1) , … , α(N) L(N))

• Hadamard (or Schur) product.– best fixed α*(L) = arg minα C(α ◊ L )

• where LT(i) = (1/T) Σt ct(i)– static optimality C*(L) = C(α*(L) ◊ L)

41

Example

• Two machines, makespan:

observedloads

α*(L)L1 L2

21

1

21

2 ,LL

L

LL

L finalloads

L1 L2

21

12

LL

LLmakespan

4 2 ( 1/3 , 2/3) 4/3 4/3

42

Our Results: Adversarial General

• General Feasibility Result:– Assume C convex and C* concave• includes makespan and Ld norm for d>1.

– There exists an online algorithm ON, which for any loss sequence L:

C(ON) < C*(L) + o(1)– Rate of convergence about √N/T

43

Our Results: Adversarial Makespan

• Makespan Algorithm– There exists an algorithm ON– for any loss sequence LC(ON) < C*(L) + O(log2 N / √T)

• Benefits:– very simple and intuitive– improved regret bound

Δ

Tpt

2

1)1(

Two actions

44

Our Results: Adversarial Lower Bound

• We show that for many non-convex C there is a non-vanishing regret– includes Ld norm for d<1

• Non-vanishing regret ratio >1

There is a sequence of losses L, such that,C(ON) > (1+γ) C*(L), where γ>0

45

Preliminary: Local vs. Globaltime

B1 B2 Bk

Low regret in each block

Overall low regret

46

Preliminary: Local vs. Global

• LEMMA:– Assume C convex and C* concave,– Assume a partition of time to Bi

– At each time block Bi regret at most Ri

Then:

C(ON)-C*(L) ≤ Σi Ri

47

Preliminary: Local vs. Global

Proof:C(ON) ≤ Σ C(ON(Bi)) C is convexΣ C*(L(Bi)) ≤ C*(L) C* is concaveC(ON(Bi)) – C*(L(Bi)) ≤ Ri low regret in each Bi

Σ C(ON(Bi)) – C*(L(Bi)) ≤ Σ Ri C(ON) – C*(L) ≤ Σ Ri

QED

• Enough to bound the regret on subsets.

48

Example

t=1

t=2

M1 M2

arrivallosses

static optα*=(1/2,1/2)cost = 3/2

M1 M2

local optα*:(1/3,2/3)(2/3,1/3)cost = 4/3

M1 M2M1 M2

global offline opt:(0,1)(1,0)cost = 1

Stochastic case:

• Each time t the costs are drawn from a joint distribution, – i.i.d over time steps, not between actions

INTUITION: Assume two actions (machines)• Load Distribution:– With probability ½ : (1,0)– With probability ½ : (0,1)

• Which policy minimizes makespan regret?!• Regret components:– MAX(L(1),L(2)) = sum/2 +|Δ|/2– Sum=L(1)+L(2) & Δ=L(1)-L(2)

Stochastic case: Static OPT

• Natural choice (model based)– Select always action ( ½, ½ )

• Observations:– Assume T/2+Δ times (1,0) and T/2-Δ times (0,1)– Loads (T/4+ Δ/2 , T/4-Δ/2)– Makespan = T/4+ Δ/2 > T/4– Static OPT: T/4 – Δ2/T < T/4

• W.h.p. OPT is T/4-O(1)

• sum=T/2 & E[|Δ|]= O(√T)– Regret = O(√T)

Can we do better ?!

Stochastic case: Least Loaded

• Least loaded machine:– Select the machine with the lower current load

• Observation:– Machines have same load (diff ≤ 1): |Δ| ≤ 1– Sum of loads: E[sum] = T/2 – Expected makespan = T/4

• Regret– Least Loaded Makespan LLM=T/4 ± √T– Regret =MAX{LLM-T/4,0} = O(√T)• Regret considers only the “bad” regret

Can we do better ?!

Stochastic case: Optimized Finish

• Algorithm:– Select action ( ½, ½ )

• For T-4√T steps

– Play least loaded afterwards.• Claim: Regret =O(T1/4)– Until T-4 √T steps (w.h.p) Δ < 2√T– Exists time t in [T-4 √T ,T]:

• Δ=0 & sum = T/2 + O( T1/4)• From 1 to t: regret = O(T1/4)• From t to T: Regret = O(√(T-t)) =

O(T1/4)

Can we do better ?!

Stochastic case: Any time

• An algorithm which has low regret for any t– Not plan for final horizon T

• Variant of least loaded:– Least loaded weight: ½ + T-1/4

• Claim: Regret = O(T1/4)– Idea:• Regret = max{(L1 + L2)/2 – T/4,0} + Δ• Every O(T1/2) steps Δ=0• Very near (½, ½)

Can we do better ?!

Stochastic case: Logarithmic Regret

• Algorithm:– Use phases– Length of phases shrink exponentially• Tk= Tk-1/2 and T1= T/2• Log T phases

– Every phase cancels deviations of previous phase• Deviation from the expectation

• Works for any probabilities and actions !– Assuming the probabilities are known.

Can we do better ?!

Stochastic case

• Assume that each action’s cost is drawn from a joint distribution, – i.i.d over time steps

• Theorem (makespan/Ld)– Known distribution• Regret =O(log T /T)

– Unknown distributions• Regret = O( log2 T /T)

Summary

• Regret Minimization– External – Internal – Dynamics

• Job Scheduling and Regret Minimization– Different global function– Open problems:• Exact characterization• Lower bounds

64

65

Makespan Algorithm

• Outline:– Simple algorithm for two machines• Regret O(1/√T)• simple and almost memory-less

– Recursive construction:• Given three algorithms: two for k/2 actions andone for 2 actions build an algorithm for k actions• Main issue: what kind of feedback to “propagate”.• Regret O(log2 N /√T)

– better than the general result.

66

Makespan: Two Machines

• Intuition:– Keep online’s loads balanced

• Failed attempts:– use standard regret minimization• In case of unbalanced input sequence L,• algo. will put most of the load on single machine

– use optimum to drive the probabilities• Our approach: – Use the online loads • not opt or static cumulative loads

67

Makespan Algorithm: Two actions

• At time t maintain probabilities pt,1 and pt,2 = 1-pt,1

• Initially p1,1 = p1,2 = ½• At time t:

• Remarks:– uses online loads– Almost memory-less

T

ONONT

lplppp

tt

tttttt

1,2,

1,1,2,2,1,1,1

2

1

Δ

Tpt

2

11,

68

Makespan Algorithm: Analysis

• View the change in probabilities as a walk on the line.

0½ 1

69


• Consider a small interval of length ε

• Total change in loads:– identical on both• started and ended

with same Δ

• Consider only losses in the interval– local analysis

• Local opt is also in the interval• Online used “similar” probability– loss of at most ε per step

70


• Simplifying assumptions:– The walk is “balanced” in every interval

• add “virtual” losses to return to initial state• only O(√T) additional losses

– relates the learning rate to the regret

– Losses “cross” interval’s boundary line• needs more sophisticated “bookkeeping”.

– make sure an update affects at most two adjacent intervals.

– Regret accounting• loss in an interval• additional “virtual” losses.

71

Makespan: Recursive algorithm

• Recursive algorithm:

A3

A1A2

72

Makespan: Recursive

• The algorithms:– Algorithms A1 and A2:

• Each has “half” of the actions– gets the actual losses and “balances” them

• each work in isolation.– simulating and not considering actual loads.

– Algorithm A3• gets the average load in A1 and A2

– balances the “average” loads.

A3

A1 A2

AVG(lt,i qt,i ) AVG(l t,iq’ t,i

)

lt,1 …. lt,k/2 ….

74

Makespan: Recursive

• The combined output :

A3A1A2

p2p1

x xq1, …

q1p1, … ,

qk/2, …

qk/2p2, … ,

l1, … lk/2, …

AVG(lt,1 qt,i ) AVG(l t,k/2q t,i

)

75

Makespan: Recursive

• Analysis (intuition):– Assume perfect ZERO regret • just for intuition …

– The output of A1 and A2 completely balanced• The average equals the individual loads

– maximum=average=minimum

– The output of A3 is balanced• the contribution of A1 machines equal to that of A2

• Real Analysis:– need to bound the amplification in the regret.