Download - Low-regret Online Decision-making Via Bellman Inequalities · Low-regret Online Decision-making Via Bellman Inequalities Joint work with Sid Banerjee and Itai Gurvich. 2/36 Relaxations

Low-regret Online Decision-making Via Bellman Inequalities

Joint work with Sid Banerjee and Itai Gurvich

2/36Relaxations and Regret Bounds for Online Problems

● Must make decisions upon request ● Uncertain process● Statistical information available● Goal: develop practical near optimal algorithms

3/36

Our Results

Relaxations and Regret Bounds for Online Problems

4/36

Our Results


Meta-Theorem For diferent resource allocation problems, we

give a practical policy, based on re-solving an optimization

program, with bounded .

The bound is independent of the horizon and capacities.

5/36

Our Results






● Applications: Dynamic posted pricing, Online Knapsack, Network Revenue Management (Online Packing), Online Matching, Online Probing, Contextual Bandits

6/36

Our Results






● Applications: Dynamic posted pricing, Online Knapsack, Network Revenue Management (Online Packing), Online Matching, Online Probing, Contextual Bandits

● Challenges: defne a benchmark and use it to design an algorithm

7/36

Why Constant Regret?


Case Study: edge weighted online matching

8/36




9/36



● Algorithms are diferent● Not worst case, but parametric


10/36

Problem 1: Online Knapsack


● Finite set of types:

● Known reward distribution and weight:

● Initial budget and horizon:

● Arrival process:

● Objective: collect as much reward as possible

11/36

Types of Benchmark


Number of type- arrivals

12/36

Types of Benchmark


Reward

Algorithm Optimal (DP) Prophet

Regret Number of type- arrivals

13/36

Online Packing


Theorem A natural policy with constant expected regret

for online packing problems. Regret independent of .

In particular, the regret depends only on

Generalizes to multiple resources and other arrival processes.

14/36

Online Packing






15/36

Online Packing






16/36

Online Packing






17/36

Online Packing






Similar results in a recent work for restricted cases [Bumpensanti & Wang]

18/36

Overview of the General Framework


Goal: Handle more general problems

19/36

Overview of the General Framework


Goal: Handle more general problems

20/36

Intuition


Given the additional information, Prophet wants to solve a DP

21/36

Intuition



22/36

Intuition


Bellman Loss (computational)


23/36

Intuition




24/36

Intuition



Information Loss (estimation)


25/36

Knapsack RABBI


26/36

Problem 2: Dynamic Posted Pricing


● Stream of T customers with i.i.d. rewards

● Each customer wants one of our identical items

● We can post any fare from the set

● Objective: collect as much reward as possible

Prophet solves:

?

27/36

Pricing RABBI


Fraction of customers that would buy when the fare is

28/36

Dynamic Posted Pricing



for Dynamic Posted Pricing. Regret independent of .

In particular, the regret depends only on .

Fraction that buys at

29/36







30/36







31/36

The Algorithm is Practical


32/36

The Algorithm is Practical


33/36

Bound via Bellman Inequalities


Defnition Given fltration , is a relaxed value w.r.t. if

1) Initial Ordering:

2) Monotonicity:

34/36

Bound via Bellman Inequalities


Defnition Given fltration , is a relaxed value w.r.t. if

1) Initial Ordering:

2) Monotonicity:

35/36

Conclusions and Extensions


● Framework based on constructing tractable benchmarks● Bellman Loss: computational● Information Loss: estimation● Applications: NRM, Probing, Contextual Bandits,

AdWords, Dynamic Pricing, and other Resource Allocation Problems

36/36

Related Work


● Prophet: worst case distribution (competitive ratio) for maximum of iid [Hill & Kertz], best possible [Correa et al.], matroid constraints [Kleinberg & Weinberg]

● Constant regret in NRM: [Arlotto & Gurvich]

[Talluri & Van Ryzin], [Reiman & Wang], [Jasin & Kumar], [Bumpensanti & Wang]

● Online matching, resource allocation, AdWords[Manshadi et al], [Legrain & Jaillet]

● Probing: competitive ratio (linear regret) [Gupta & Nagarajan], [Singla], [Chugg & Maehara]

● Information Relaxation [Balseiro & Brown], [Brwon, Smith, & Sun] ● Approximate Dynamic Programming [Powell]