Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 220 times |
Download: | 1 times |
A Reinforcement Learning A Reinforcement Learning Approach for Product Delivery Approach for Product Delivery by Multiple Vehiclesby Multiple Vehicles
Scott Proper
Oregon State University
Prasad TadepalliHong Tang Rasaratnam Logendran
Vehicle Routing & Product DeliveryVehicle Routing & Product Delivery
Contributions of our ResearchContributions of our Research
Multiple vehicle product delivery is a well-studied problem in operations research
We have formulated this problem as an average reward reinforcement learning (RL) problem
We have combined inventory control with vehicle routing
We have scaled RL methods to work with large state spaces
Markov Decision ProcessesMarkov Decision Processes
Action a
Actions are stochastic: Pi,j(a)
Actions have costs or rewards: ri(a)
Move
Unload
Unload
Average Reward Reinforcement Average Reward Reinforcement LearningLearning
Goal: Maximize average reward/time step– Minimize stockout penalty + movement
penalty Policy: states → actions Value function: states → real values
– expected long-term reward from a state, relative to other states, when following the optimal policy
H-LearningH-Learning
The value function satisfies the Bellman equation:
The optimal action a* maximizes the immediate reward + expected value of the next state
H-Learning is a real-time algorithm for solving the value function
H-Learning: an example 1H-Learning: an example 1
-.1, 1/1
0, 9/9 0, 0/9
A
ED
CB
Value Table
A 0
B 0
C 0
D 0
E 0
H-Learning: an example 2H-Learning: an example 2
Stockout penalty: -20
A
ED
CB-.1, 1/1
0, 9/10 -20, 1/10Value Table
A -.1
B 0
C 0
D 0
E 0
H-Learning: an example 3H-Learning: an example 3
A
ED
CB-.1, 1/1
0, 9/10Value Table
A -.1
B 0
C 0
D 0
E 0
-20, 1/10
H-Learning: an example 4H-Learning: an example 4
Move penalty: -.1
A
ED
CB-.1, 2/2
0, 9/10
Value Table
A -.1
B 0
C 0
D 0
E 0
-20, 1/10
On-line Product DeliveryOn-line Product Delivery
Deliver 1 product 9 truck actions:
– 4 levels of unload – 4 move directions– wait
P(Inventory decrease | shop)
Stockout penalty: -20 Movement penalty: -.1
5 Shops
Depot
The problem of state-space The problem of state-space explosionexplosion The loads of trucks and shop inventories
are discretized into 5 levels States grow exponentially in shops and
trucks– 10 locations, 5 shops, 2 trucks = (102)
(55)(52) = 7,812,500 states– 5 trucks = 976,592,500,000 states
Table-based methods take too much time and space
Piecewise Linear Function Piecewise Linear Function ApproximationApproximation
We use a different linear function for each possible 5-tuple of locations l1,…, l5 of trucks
Each function is linear in truck loads and shop inventories
Every function represents 10 million states
million-fold reduction of learnable parameters
Piecewise linear function Piecewise linear function approximation vs. table-basedapproximation vs. table-based
10 locations, 5 shops, 2 trucks, 106 iterations
-8
-7
-6
-5
-4
-3
-2
-1
0
10 110 210 310 410 510 610 710 810 910
1000's of Iterations
Ave
rag
e R
ewar
d
Piecewise Linear Function Approximation
Table-based
Storing and using the action modelsStoring and using the action models
Problem: exponential time to determine the expected value of the next state:
- Each shop’s consumption is independent
- Value function is piecewise linear
?
?
?
?
Ignoring Truck IdentityIgnoring Truck Identity
m = number of locations (10)k = number of trucks (2-5)
5 trucks: 105 functions Learnable parameters:
1.1 million
2002 functions Learnable parameters:
22,022
mk
The problem of action-space The problem of action-space explosionexplosion
Every action a is a vector of individual “truck actions” a = (a1, a2,…,an)
Actions grow exponentially in the number of trucks– 9 “truck actions”– For 2 trucks: 92 = 81 total actions– For 5 trucks: 95 = 59,049 total actions
Hill Climbing SearchHill Climbing Search
We initialize the vector of truck actions a to all “wait” actions
We use hill climbing to reach a local optimum Randomly perturb a truck action, repeat
This results in an order-of-magnitude improvement in search time
Hill climbing vs. exhaustive search Hill climbing vs. exhaustive search for 4 and 5 trucksfor 4 and 5 trucks
10 locations, 5 shops, 5 trucks, 106 iterations
-3
-2.5
-2
-1.5
-1
-0.5
0
10 110 210 310 410 510 610 710 810 9101000's of Iterations
Av
era
ge
re
wa
rd
Hill Climbing with 5 trucks
Exhaustive search, 5 trucks
ConclusionConclusion
Average-reward RL and Piecewise linear function approximation are promising approaches for real-time product delivery
Hill climbing shows great potential for speeding up search in domains with a large action space
Problems of scaling are surmountable
Future WorkFuture Work
Scaling! More trucks, more locations, more shops, more depots, and more items
Allowing trucks to move with non-uniform speeds (event-based model needed)
Real-valued shop inventory and truck load levels