+ All Categories
Home > Documents > A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon...

A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon...

Date post: 22-Dec-2015
Category:
View: 220 times
Download: 1 times
Share this document with a friend
Popular Tags:
21
A Reinforcement A Reinforcement Learning Approach for Learning Approach for Product Delivery by Product Delivery by Multiple Vehicles Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong Tang Rasaratnam Logendran
Transcript
Page 1: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

A Reinforcement Learning A Reinforcement Learning Approach for Product Delivery Approach for Product Delivery by Multiple Vehiclesby Multiple Vehicles

Scott Proper

Oregon State University

Prasad TadepalliHong Tang Rasaratnam Logendran

Page 2: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Vehicle Routing & Product DeliveryVehicle Routing & Product Delivery

Page 3: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Contributions of our ResearchContributions of our Research

Multiple vehicle product delivery is a well-studied problem in operations research

We have formulated this problem as an average reward reinforcement learning (RL) problem

We have combined inventory control with vehicle routing

We have scaled RL methods to work with large state spaces

Page 4: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Markov Decision ProcessesMarkov Decision Processes

Action a

Actions are stochastic: Pi,j(a)

Actions have costs or rewards: ri(a)

Move

Unload

Unload

Page 5: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Average Reward Reinforcement Average Reward Reinforcement LearningLearning

Goal: Maximize average reward/time step– Minimize stockout penalty + movement

penalty Policy: states → actions Value function: states → real values

– expected long-term reward from a state, relative to other states, when following the optimal policy

Page 6: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

H-LearningH-Learning

The value function satisfies the Bellman equation:

The optimal action a* maximizes the immediate reward + expected value of the next state

H-Learning is a real-time algorithm for solving the value function

Page 7: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

H-Learning: an example 1H-Learning: an example 1

-.1, 1/1

0, 9/9 0, 0/9

A

ED

CB

Value Table

A 0

B 0

C 0

D 0

E 0

Page 8: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

H-Learning: an example 2H-Learning: an example 2

Stockout penalty: -20

A

ED

CB-.1, 1/1

0, 9/10 -20, 1/10Value Table

A -.1

B 0

C 0

D 0

E 0

Page 9: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

H-Learning: an example 3H-Learning: an example 3

A

ED

CB-.1, 1/1

0, 9/10Value Table

A -.1

B 0

C 0

D 0

E 0

-20, 1/10

Page 10: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

H-Learning: an example 4H-Learning: an example 4

Move penalty: -.1

A

ED

CB-.1, 2/2

0, 9/10

Value Table

A -.1

B 0

C 0

D 0

E 0

-20, 1/10

Page 11: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

On-line Product DeliveryOn-line Product Delivery

Deliver 1 product 9 truck actions:

– 4 levels of unload – 4 move directions– wait

P(Inventory decrease | shop)

Stockout penalty: -20 Movement penalty: -.1

5 Shops

Depot

Page 12: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

The problem of state-space The problem of state-space explosionexplosion The loads of trucks and shop inventories

are discretized into 5 levels States grow exponentially in shops and

trucks– 10 locations, 5 shops, 2 trucks = (102)

(55)(52) = 7,812,500 states– 5 trucks = 976,592,500,000 states

Table-based methods take too much time and space

Page 13: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Piecewise Linear Function Piecewise Linear Function ApproximationApproximation

We use a different linear function for each possible 5-tuple of locations l1,…, l5 of trucks

Each function is linear in truck loads and shop inventories

Every function represents 10 million states

million-fold reduction of learnable parameters

Page 14: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Piecewise linear function Piecewise linear function approximation vs. table-basedapproximation vs. table-based

10 locations, 5 shops, 2 trucks, 106 iterations

-8

-7

-6

-5

-4

-3

-2

-1

0

10 110 210 310 410 510 610 710 810 910

1000's of Iterations

Ave

rag

e R

ewar

d

Piecewise Linear Function Approximation

Table-based

Page 15: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Storing and using the action modelsStoring and using the action models

Problem: exponential time to determine the expected value of the next state:

- Each shop’s consumption is independent

- Value function is piecewise linear

?

?

?

?

Page 16: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Ignoring Truck IdentityIgnoring Truck Identity

m = number of locations (10)k = number of trucks (2-5)

5 trucks: 105 functions Learnable parameters:

1.1 million

2002 functions Learnable parameters:

22,022

mk

Page 17: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

The problem of action-space The problem of action-space explosionexplosion

Every action a is a vector of individual “truck actions” a = (a1, a2,…,an)

Actions grow exponentially in the number of trucks– 9 “truck actions”– For 2 trucks: 92 = 81 total actions– For 5 trucks: 95 = 59,049 total actions

Page 18: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Hill Climbing SearchHill Climbing Search

We initialize the vector of truck actions a to all “wait” actions

We use hill climbing to reach a local optimum Randomly perturb a truck action, repeat

This results in an order-of-magnitude improvement in search time

Page 19: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Hill climbing vs. exhaustive search Hill climbing vs. exhaustive search for 4 and 5 trucksfor 4 and 5 trucks

10 locations, 5 shops, 5 trucks, 106 iterations

-3

-2.5

-2

-1.5

-1

-0.5

0

10 110 210 310 410 510 610 710 810 9101000's of Iterations

Av

era

ge

re

wa

rd

Hill Climbing with 5 trucks

Exhaustive search, 5 trucks

Page 20: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

ConclusionConclusion

Average-reward RL and Piecewise linear function approximation are promising approaches for real-time product delivery

Hill climbing shows great potential for speeding up search in domains with a large action space

Problems of scaling are surmountable

Page 21: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Future WorkFuture Work

Scaling! More trucks, more locations, more shops, more depots, and more items

Allowing trucks to move with non-uniform speeds (event-based model needed)

Real-valued shop inventory and truck load levels


Recommended