Post on 09-Aug-2020
transcript
Decentralized Coordinated Optimal Ramp Metering using Multi-agent Reinforcement Learning
by
Kasra Rezaee
A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy
Civil Engineering Department University of Toronto
© Copyright by Kasra Rezaee 2014
ii
Decentralized Coordinated Optimal Ramp Metering using Multi-
agent Reinforcement Learning
Kasra Rezaee
Doctor of Philosophy
Civil Engineering Department University of Toronto
2014
Abstract
Freeways are the major arteries of the transportation networks. In most major cities in North
America, including Toronto, infrastructure expansion has fallen behind transportation demand,
causing escalating congestion problems. It has been realized that infrastructure expansion cannot
provide a complete solution to congestion problems owed to economic limitations, induced
demand, and, in metropolitan areas, simply lack of space. Furthermore, the drop in freeway
throughput due to congestion exacerbates the problem even more during rush hours at the time the
capacity is needed the most. Dynamic traffic control measures provide a set of cost effective
congestion mitigation solutions, among which ramp metering (RM) is the most effective approach.
This thesis proposes a novel optimal ramp control (metering) system that coordinates the actions
of multiple on-ramps in a decentralized structure. The proposed control system is based on
reinforcement learning (RL); therefore, the control agent learn the optimal action from interaction
with the environment and without reliance on any a priori mathematical model. The agents are
designed to function optimally in both independent and coordinated modes. Therefore, the whole
system is robust to communication or individual agent’s failure. The RL agents employ function
approximation to directly represent states and action with continuous variables instead of relying
on discrete state-action tables. Use of function approximation significantly speeds up the learning
iii
and reduces the complexity of the RL agents design process. The proposed RM control system is
applied to a meticulously calibrated microsimulation model of the Gardiner Expressway
westbound in Toronto, Canada. The Gardiner expressway is the main freeway running through
Downtown Toronto and suffers from extended periods of congestion every day. It was chosen as
the testbed for highlighting the effectiveness of the coordinated RM. The proposed coordinated
RM algorithm when applied to the Gardiner model resulted in 50% reduction in total travel time
compared with the base case scenario and significantly outperformed approaches based on the
well-known ALINEA RM algorithm. This improvement was achieved while the permissible on-
ramp queue limit was satisfied.
iv
Dedication
To my loving wife
Sahar
who made it all possible.
v
Acknowledgements
First and foremost, I would like to express my sincere gratitude to my supervisor, Professor Baher
Abdulhai. I would like to thank him for his deep insight, wisdom, invaluable guidance, advice and
limitless support during the development of this thesis. His patience and understanding has been
an inspiration during my graduate studies.
I also want to express my thanks for the comments and suggestions provided by the thesis
committee members Professor Matthew Roorda, Professor Amer Shalaby, and Professor Khandker
M. Nurul Habib.
I would also acknowledge the generous financial support I received from the University of
Toronto, Professor Baher Abdulhai, Fortran Traffic Systems, and Canadian Automobile
Association.
Finally, I would also like to thank the members of the Transportation Group for valuable
advice, instructive discussions. This thesis would have not been possible without the help from Dr.
Hossam Abdelgawad, Dr. Samah El-Tantawy, and Mohamed Elshenawy.
vi
Sections from Published and In-process Papers
In this thesis, some portions have been reproduced (with modifications) from previously produced
material.
Section 4.2:
Rezaee, K., Abdulhai, B., and H. Abdelgawad, “Application of Reinforcement Learning with Continuous State Space to Ramp Metering in Real-world Conditions”, in Proceedings of the IEEE Intelligent Transportation Systems Conference, Anchorage, September 2012.
Section 5.1:
Rezaee, K., Abdulhai, B., and H. Abdelgawad, “Self-Learning Adaptive Ramp Metering: Analysis of Design Parameters on a Test Case in Toronto”, Transportation Research Record 2396, 2013.
Section 5.2:
Rezaee, K., Abdulhai, B., and H. Abdelgawad, “Closed-Loop Optimal Freeway Ramp Metering using Continuous State Space Reinforcement Learning with Function Approximation”, Transportation Research Board (TRB) 93rd Annual Meeting, Washington, D.C., January 2014.
Section 4.3 and 5.3:
Rezaee, K., Abdulhai, B., and H. Abdelgawad, “Decentralized Coordinated Optimal Ramp Metering: Application to the Gardiner Expressway in Downtown Toronto”, submitted to Transportation Research Board (TRB) 94th Annual Meeting, Washington, D.C., January 2015.
vii
Table of Contents
Acknowledgements ......................................................................................................................... v
Sections from Published and In-process Papers ............................................................................ vi
Table of Contents .......................................................................................................................... vii
List of Tables .................................................................................................................................. x
List of Figures ................................................................................................................................ xi
1 Introduction ............................................................................................................................. 1
1.1 Freeway Traffic Control Problem .................................................................................... 1
1.2 Overview of the Proposed Methodology ......................................................................... 4
1.3 Thesis Structure ................................................................................................................ 6
2 Literature Review .................................................................................................................... 8
2.1 Pre-timed Ramp Metering .............................................................................................. 11
2.2 Traffic-responsive Ramp Metering ................................................................................ 11
2.2.1 Independent Controllers .......................................................................................... 11
2.2.2 Coordinated Controllers .......................................................................................... 14
2.3 Summary of RM approaches .......................................................................................... 18
3 Methodology: Optimal Ramp Metering Using Reinforcement Learning ............................. 21
3.1 Optimal Control Problem ............................................................................................... 21
3.2 Markov Decision Processes and Value Iteration ............................................................ 23
3.3 Reinforcement Learning: Model-free Learning ............................................................. 24
3.3.1 Q-Learning .............................................................................................................. 25
3.3.2 SARSA .................................................................................................................... 27
3.3.3 R-Learning .............................................................................................................. 28
3.4 RL with Continuous State and Action Space ................................................................. 28
3.4.1 k-Nearest Neighbours Weighted Average .............................................................. 29
3.4.2 Multi-Layer Perceptron Neural Network ................................................................ 30
3.4.3 Linear Model Tree .................................................................................................. 32
3.4.4 Advantage Updating ............................................................................................... 33
3.5 Multi-Agent Reinforcement Learning ............................................................................ 35
3.5.1 Independent Learners .............................................................................................. 36
viii
3.5.2 Cooperative Reinforcement Learning ..................................................................... 36
3.6 Summary ........................................................................................................................ 40
4 Development of Microscopic Simulation Testbeds .............................................................. 41
4.1 Developing the Microsimulation Models ....................................................................... 42
4.1.1 Data Preparation for Real Measurements and Paramics ......................................... 42
4.1.2 Driver Behaviour Parameter Calibration ................................................................ 44
4.1.3 OD Estimation and Calibration ............................................................................... 49
4.2 Highway 401 Eastbound Collector and Keele Street ..................................................... 51
4.3 Gardiner Expressway Westbound .................................................................................. 54
5 Independent and Coordinated RL-based Ramp Metering Design and Experiments ............ 61
5.1 Experiment I – Single Ramp with Conventional RL ..................................................... 61
5.1.1 RL-based RM Controller Design for Single Ramp ................................................. 62
5.1.2 Effect of Design Parameters on RLRM Performance ............................................. 68
5.1.3 Comparison with ALINEA Controller .................................................................... 72
5.2 Experiment II – RL-based RM with Function Approximation ...................................... 73
5.2.1 Design of Function Approximation Approaches .................................................... 74
5.2.2 Simulation Results .................................................................................................. 74
5.3 Experiment III – Gardiner: Independent and Coordinated Ramp Metering .................. 78
5.3.1 RLRM Design for Coordinated Ramp Metering .................................................... 78
5.3.2 Simulation Results and Controller Evaluation ........................................................ 81
5.3.3 The Gardiner Test Case Summary .......................................................................... 97
6 Conclusions and Future Work ............................................................................................ 101
6.1 Major Findings ............................................................................................................. 102
6.2 Contributions ................................................................................................................ 103
6.3 Towards Field Implementation .................................................................................... 104
6.4 Assumptions and Limitations ....................................................................................... 106
6.5 Future Work ................................................................................................................. 107
References ................................................................................................................................... 109
Appendix A – Paramics Plug-in ................................................................................................. 114
ix
Appendix B – Total Least Squares ............................................................................................. 115
Appendix C – Simultaneous Perturbation Stochastic Approximation ........................................ 117
x
List of Tables
Table 2-1 Summary of the performance of the RM approaches in the literature from control perspectives. Solid circles show better performance. ............................................................................................ 19
Table 4-1 Numerically calibrated Paramics parameters for Highway 401 model ...................................... 53 Table 4-2 Numerically calibrated Paramics parameters for the Gardiner model ....................................... 56 Table 4-3 Parameters of the Van Aerde model fitted to fundamental diagram samples from Paramics and
real life. ............................................................................................................................................ 57 Table 5-1 Metering rates and associated green and red phases for one-car-per-green metering policy ..... 63 Table 5-2 Metering rates and associated green and red phases for discrete release rates metering policy 63 Table 5-3 Summary of the simulation results for the single ramp testcase with conventional RL algortihms.
.......................................................................................................................................................... 73 Table 5-4 Comparison of perfromance of different RLRM approaches .................................................... 77 Table 5-5 Demand for accessing the freeway mainline downstream of each on-ramp .............................. 82
xi
List of Figures
Figure 1-1 Schematic representing the traffic flow at the boundaries of a network. ..................................... 2 Figure 1-2 Fundamental diagram based on five-minute traffic count and densities measured on a Japanese
freeway (Sugiyamal et al., 2008). ....................................................................................................... 3 Figure 1-3 Structure of the thesis .................................................................................................................. 7 Figure 2-1 Functional structure of demand capacity and ALINEA algorithms (Papageorgiou et al., 2003).
........................................................................................................................................................... 13 Figure 2-2 Forecasting theory of SWARM global mode (Ahn et al., 2007). .............................................. 15 Figure 2-3 Schematic of model predictive control for traffic control problems (Hegyi et al., 2005). ......... 17 Figure 2-4 Hierarchical control structure with distributed controllers. ....................................................... 18 Figure 3-1 The relationship between the algorithms presented in this chapter ........................................... 22 Figure 3-2 Illustration of k-nearest neighbour algorithm for estimating the value of a new point. The four
closest neighbours to candidate point are shown. ............................................................................. 30 Figure 3-3 Multi-layer perceptron structure for function approximation applications. In this figure …
are input variables, … … … are the hidden layer weights, . is the sgmid non-linear function, … are output layer weights, and is the output of the neural network. ...... 31
Figure 3-4 Illustration of input space partitioning for linear model tree ..................................................... 32 Figure 3-5 A simple problem showing the variation of Q-values in the states and actions. a) The base
problem with 101 states, where the goal is to reach the terminal state s0 with the minimum movements. Therefore, the reward of taking each action is -1 and discount factor is 1. b) The optimal state values and Q-value of actions. .................................................................................................. 34
Figure 3-6 Different approaches for applying RL to ramp metering for a sample traffic network: a) Centralized structure with a single RL agent for whole network, b) Isolated RL-based RM agents, c) Coordinated MARL-based RM agents. ............................................................................................. 38
Figure 4-1 Relationship between occupancy and density. ........................................................................... 45 Figure 4-2 A Van Aerde model is fitted to samples from a loop detector. The left figure is the flow-density
relationship and the right figure is the speed-density relationship. ................................................... 48 Figure 4-3 Aerial map and Paramics screenshot showing part of the study area. The map shows the Highway
401 eastbound collector at the merging point of Keele St. ............................................................... 52 Figure 4-4 Fundamental diagram fitted to samples from simulation of calibrated Paramics model and real
loop detectors. ................................................................................................................................... 53 Figure 4-5 The evolution of morning traffic in the Paramics model compared with measurements from real
loop detectors. ................................................................................................................................... 54 Figure 4-6 Schematic of the study area network, showing the Gardiner Expressway westbound from Don
Valley Parkway in the east to Humber Bay in the west. ................................................................... 54 Figure 4-7 Aerial map of the Gardiner Expressway westbound and its Paramics model............................ 56 Figure 4-8 The left graph shows real loop detector and right graphs shows Paramics model. The time-space
graphs are the average speed along the Gardiner from 13:00 to 21:00. ............................................ 58 Figure 4-9 GEH value for vehicle counts averaged over one-hour intervals for select loop detector locations.
........................................................................................................................................................... 59 Figure 4-10 Traffic flow of the calibrated Paramics model compared with real loop detector data along the
Gardiner for three different time intervals. ....................................................................................... 59 Figure 4-11 Aerial view of the Spadina on-ramp with information about traffic flow and queues ............ 60 Figure 5-1 Local area on an on-ramp and the loop detectors which can represent its traffic state. ............. 64 Figure 5-2 Histogram of traffic densities in a freeway section, including an on-ramp operation in the
presence of an optimal RM controller. The dashed line represents the estimated critical density. ... 65 Figure 5-3 The actual weights of and discounted weights which the RLRM agent considered using a
discount factor of 0.94. The actual weights of were based on a control cycle of 2 min and minimization horizon of 1 hr. ............................................................................................................ 67
xii
Figure 5-4 Effect of adding a penalty term to reward function for severe congestion. (a) The total travel time for freeway mainline, (b) the total travel time for the whole network. ..................................... 69
Figure 5-5 Performance comparison of RLRM agent with direct action and RLRM agent with incremental action. (a) Total travel time for freeway mainline only, (b) total travel time for the whole network. ........................................................................................................................................................... 70
Figure 5-6 Effect of different reward choices on RLRM performance. In case 1 reward is and state variables are downstream density, upstream density, and on-ramp density. In case 2 reward is
and state variables are the same as in case 1. Case 3 is similar to case 2 with the exception that upstream density is omitted. In case 4 reward is and state variables are downstream and on-ramp densities. ............................................................................................................................................ 71
Figure 5-7 Learning speed and solution quality of the presented four RL approaches. The curves above are obtained by averaging multiple epochs through a moving average window for clarity. The actaul results have significantly more variation from one epoch to another because of the stochastic nature of microscopic simulation. ................................................................................................................ 76
Figure 5-8 The schematic of the Gardiner showing the location of entry and exit flows for each individual RM agent. .......................................................................................................................................... 79
Figure 5-9 Communication between RLRM agents of the Gardiner........................................................... 81 Figure 5-10 Colour-coded space-time diagram of base case traffic speed. ................................................. 83 Figure 5-11 Freeway throughput after the Jameson on-ramp in the base case and with ramp metering. .... 84 Figure 5-12 Comparison of the Jameson on-ramp traffic flow in the base case and with independent ramp
metering............................................................................................................................................. 84 Figure 5-13 Freeway throughput after the Spadina on-ramp in the base case and with independent ramp
metering............................................................................................................................................. 85 Figure 5-14 Freeway performance for four different scenarios. The error bars show the standard deviation
of value for different simulation runs. ............................................................................................... 86 Figure 5-15 Average experienced travel time of vehicles starting from the Jameson zone until the end of
the network in the west. .................................................................................................................... 87 Figure 5-16 Time-space diagram of traffic speed for RLRM-I (left) and Jameson2pmClose (right). ........ 87 Figure 5-17 Queues for the three on-ramps throughout the simulation period. .......................................... 88 Figure 5-18 Average travel time for trips originated during 4-5 pm from origins in the downtown to west
end of network for the four scenarios. ............................................................................................... 89 Figure 5-19 On-ramp queues for the RM algorithms which consider limited queue capacity. ................... 90 Figure 5-20 Time-space diagram of traffic speed for algorithms with limited queue space. ...................... 91 Figure 5-21 Freeway performance under ramp metering with limited queue capacity. .............................. 91 Figure 5-22 Travel times from different locations to the west end of the network. .................................... 92 Figure 5-23 Freeway performance for coordinated RM approaches. .......................................................... 93 Figure 5-24 On-ramp queues of coordinated and independent RLRM agents with limited queue space. .. 94 Figure 5-25 Travel times from different locations to the west end of the network in the RLRM-C case. .. 94 Figure 5-26 Average travel time for trips originated during 4-5 pm from origins in the downtown to west
end of network at Humber Bay for different independent and coordinated RLRM approaches. ...... 95 Figure 5-27 Downtown on-ramp queues with ALINEAwLC control algorithm. ....................................... 96 Figure 5-28 On-ramp queues (left) and travel times (right) for the RLRM-CwQE case. ........................... 97 Figure 5-29 Downtown on-ramp flows entering the freeway for the RLRM-CwQE case. ......................... 98 Figure 5-30 Summary of the performance of the nine scenarios for the Gardiner test case........................ 99 Figure 6-1 Controller interface with real (a) and virtual (b) transportation environment. ........................ 105 Figure 6-2 The CID developed for evaluation of MARLIN-ATSC. ......................................................... 106 Figure B-1 Comparison of the errors for regular and total least squares. (a) samples generated from original
fundamnetal diagram with measurment error on both variables, (b) errors minimized in the regular least squares method, (c) errors minimized in total least squares method. ..................................... 116
1
1 Introduction
Freeway traffic congestion is a common problem in metropolitan areas. Whereas drivers’
perception of congestion is increased travel time, the more important issue is the drop in freeway
capacity at critical density, which causes further congestion accumulation. Traffic congestion
appears when the number of vehicles attempting to use a transportation infrastructure exceeds its
capacity. In the best-case scenario, the excess demand leads to queuing phenomena and full use of
the infrastructure. In the more common cases, this congestion leads to traffic instability,
breakdown, loss of capacity and a degraded use of the infrastructure, thus contributing to an
accelerated congestion increase. A capacity loss as low as 5% can mean 20% higher travel time
for drivers (Papageorgiou & Kotsialos, 2002). The outlined phenomena clarify that such
congestion is not simply the result of excessive demand exceeding the network capacity. It is,
however, the capacity loss and subsequent infrastructure degradation that lead to escalating
instability if no suitable control systems are employed to prevent such loss.
In recent years, it has been realized that infrastructure expansion cannot provide a complete
solution to congestion problems owed to economic limitations, induced demand, and, in
metropolitan areas, simply lack of space. An alternative approach is to use dynamic traffic control
measures, such as ramp metering (RM), variable speed limits, and dynamic route guidance
(Papageorgiou, Diakaki, Dinopoulou, Kotsialos, & Wang, 2003). Among these measures, ramp
metering is the most effective traffic control measure and is widely used in different parts of the
world (Papageorgiou & Kotsialos, 2002). Ramp metering controls the flow of cars entering the
freeway through on-ramps and can help to prevent the breakdown of the freeway.
1.1 Freeway Traffic Control Problem
Traffic condition on a freeway network is a function of three variables as illustrated in Figure 1-1:
demand entering the freeway, freeway physical capacity, and flow of vehicles exiting the freeway.
Although demand and capacity can be modified, in this research it is assumed that they are fixed.
The focus is to maximize the exit flow and get the vehicles off the freeway as soon as possible,
thereby minimizing the time vehicles spend on the freeway network. Traffic congestion limits the
freeway exit flow in two ways: 1) the blockage of off-ramps and 2) the drop in freeway throughput
2
because of congestion, which is commonly known as capacity drop. Therefore, preventing the
freeway from breaking down can significantly increase the exit flow.
Figure 1-1 Schematic representing the traffic flow at the boundaries of a network.
Unlike off-ramp blockage, understating the nature of the capacity drop phenomenon is not
as straightforward. Researchers have studied the capacity drop in the light of real freeway
measurements. Research on the Queen Elizabeth Way (QEW) west of Toronto has shown a
capacity drop of 5% to 6% (Hall & Agyemang-Duah, 1991) downstream of a congested section.
Research on German freeways has shown a similar capacity drop of 4% to 6% (Brilon & Ponzlet,
1996). Figure 1-2 shows the five-minute traffic count and density measured by the Japan Highway
Public Corporation. On the left side of the graph, traffic flow increases steadily as density
increases. As density increases and exceeds critical density (about 25 veh/km/lane in this example),
the freeway breaks down and traffic flow drops. The capacity drop is the difference between traffic
flow immediately before and after the critical density. Effectively, the traffic flow of a congested
bottleneck is on the right side of the fundamental diagram and the uncongested traffic flow follows
the left side of the fundamental diagram.
In a freeway traffic control system, the goal is to maintain the traffic condition on the left
side of the fundamental diagram but close to critical density to maximize throughput. Regulating
on-ramp traffic flow through RM provides the measures to keep traffic on the left side of the
fundamental diagram. Albeit closing the on-ramp altogether will solve the congestion problem,
the challenge is to let the maximum number of vehicles enter the freeway without causing the
freeway to break down. Early RM algorithms were pre-timed and the metering rates calculated
based on historic traffic demand, which tended to under-utilize or overly saturate the freeway.
Traffic-responsive RM algorithms tackle this issue by calculating the metering rate from the
current traffic condition.
Traffic Network
Demands Exit flows
Bottleneck capacity
3
Figure 1-2 Fundamental diagram based on five-minute traffic count and densities
measured on a Japanese freeway (Sugiyamal et al., 2008).
There are well-established RM algorithms for controlling a single on-ramp without
limitation on the queue storage. The challenges arise when multiple closely spaced on-ramps feed
traffic into the freeway and queue storage space on each on-ramp is limited. Controlling the on-
ramps independently puts pressure on the downstream on-ramp, and it loses its effectiveness when
its queue reaches the limit. Furthermore, unbalanced queues among adjacent on-ramps will
encourage drivers to take longer routes to avoid the queue on the downstream on-ramp. Metering
closely spaced on-ramps simultaneously can resolve the issue of unbalanced queues and allow use
of the queue storage space of all ramps for management of freeway traffic. However, efficient
coordination of multiple on-ramps is not trivial because of the complexity of the freeway traffic
dynamics.
Since traffic flow is maximized at the critical density (Papageorgiou & Kotsialos, 2002), a
group of RM algorithms, e.g. ALINEA (Papageorgiou, Hadj-Salem, & Blosseville, 1991a) and its
variations (Smaragdis & Papageorgiou, 2003), focus on regulating the traffic density at its critical
value. Although these controllers are simple to design and easy to implement, they neither seek
nor guarantee optimal performance. However, they can be easily augmented through heuristic
approaches to handle coordination of multiple on-ramps (Papamichail & Papageorgiou, 2008).
Another group of RM algorithms mainly based on optimal control theory, e.g. RM based on model
4
predictive control (MPC) (Hegyi, De Schutter, & Hellendoorn, 2005), determine the metering rate
which directly maximizes the network performance. These algorithms use a mathematical model
of the freeway to estimate the outcome of different ramp metering policies and choose the one that
maximizes the system performance. Given their nature, these algorithms can directly optimize the
metering rate for multiple on-ramps while taking into consideration the constraints on the queues.
However, they require an accurate model of the network for optimal results, and any uncertainty
or mismatch in the model will result in suboptimal performance. Furthermore, their computation
demand increases exponentially with the network size. In the literature, model-based optimal RM
algorithms have either been implemented as pre-timed algorithms (Gomes & Horowitz, 2006;
Apostolos Kotsialos & Papageorgiou, 2004) or applied to small networks with a single on-ramp
(Bellemans, De Schutter, & De Moor, 2002; Ghods, Fu, & Rahimi-Kian, 2010).
1.2 Overview of the Proposed Methodology
Reinforcement learning (RL) (Sutton & Barto, 1998), which has attracted significant attention in
recent years, has the potential to alleviate some of the aforementioned limitations. RL agents
continuously learn from their interaction with the environment; therefore, they do not require an
explicit model of the controlled environment. RL provides the tools for solving optimal control
problems when developing a model of the system is difficult. RL has proven its effectiveness in
solving complex control problems in different fields (Crites & Barto, 1998; Khan, Herrmann,
Lewis, Pipe, & Melhuish, 2012). In transportation, researchers have employed RL for more than a
decade (Abdulhai & Kattan, 2003) in different problems such as traffic signal control (Arel, Liu,
Urbanik, & Kohls, 2010; S. El-Tantawy, Abdulhai, & Abdelgawad, 2013) ramp metering (RM)
(Davarynejad, Hegyi, Vrancken, & van den Berg, 2011) and dynamic route guidance (Jacob &
Abdulhai, 2010).
In control problems, the control system quantifies the surrounding environment using
sensors measuring continuous variables. Conventional RL algorithms have primarily considered
discrete states to characterize the system conditions; as a result, the system’s continuous variables
must be discretized following a certain interval. This discretization of continuous state variables is
associated with two challenges/issues: 1) determining the discretization levels, which introduce a
trade-off between the accuracy of the state representation and the number of states; 2) breaking a
continuous variable into independent discrete states neglects the correlation between nearby states.
To address these limitations, a number of studies have investigated the continuous variables in RL
5
by using general function approximators (Doya, 2000; Geist & Pietquin, 2013; Powell & Ma,
2011; Santamaria, Sutton, & Ram, 1997). Although similar approaches have been applied to
transportation-related applications (Heinen, Bazzan, Engel, & Ieee, 2011; Prashanth & Bhatnagar,
2011), there is still significant room for improvement.
Although in theory it is possible to use a single RL agent to control multiple on-ramps, in
practice it is infeasible because of the computational complexity. Fortunately, it is possible to have
multiple RL agents in a decentralized structure. Although each agent controls a single ramp, they
can coordinate their actions to maximize their collective reward rather than individual rewards.
This problem falls into the multi-agent reinforcement learning domain which is extensively
reviewed in (Busoniu, Babuska, & De Schutter, 2008). In traffic control problems, different agents
are seeking the same goal and can coordinate their action to achieve it; therefore, they are playing
a cooperative game. Panait and Luke (2005) have summarized the cooperative learning algorithms.
Among these algorithms, traffic control problems nicely fit the requirements of the coordination
graph algorithm (Guestrin, Lagoudakis, & Parr, 2002). Although coordinated RL has been
employed in numerous traffic control studies, the applications were mainly control of surface street
traffic lights (Arel et al., 2010; Bazzan, 2009; S. El-Tantawy et al., 2013; Kuyer, Whiteson, Bakker,
& Vlassis, 2008; Salkham, Cunningham, Garg, & Cahill, 2008).
In this research, a decentralized and coordinated RM control system is proposed. The
agents will minimize the total travel time (TTT) vehicles spend in the network while respecting
the queue constraints. The individual agents are designed such that their local TTT is minimized
when they are acting independently. If they coordinate their actions, the cumulative TTT is
minimized. Agents coordinate their actions to achieve the highest collective reward through direct
communication and negotiation. Since agents are still locally optimal when acting independently,
the system can function properly in the event of communication failure. The RL-based RM
(RLRM) agents employ function approximation, which significantly improves their learning
speed. The learning speed of RLRM agents with function approximation is more than 20 times the
learning speed of the conventional RLRM algorithms. Furthermore, function approximation
eliminates the curse of dimensionality, and makes more complex RL algorithms possible.
Evaluation of the proposed algorithm on a Paramics simulation model of the westbound direction
of the Gardiner Expressway in Toronto resulted in almost 50% saving in TTT compared with the
base case, which is 10% more than the savings of the well-known ALINEA algorithm.
6
1.3 Thesis Structure
The structure of the thesis is illustrated in Figure 1-3. After the introduction, a literature review of
the ramp metering algorithms is presented in Chapter 2. A summary of the algorithms is provided
to identify gaps and limitations. Chapter 3 provides the details of the proposed methodology.
Besides the conventional RL algorithms, three function approximation techniques and their
application to RL are discussed. Additionally, the coordination of RL agents based on coordination
graphs for playing a cooperative game is presented. Chapter 4 discusses development and
calibration of the microsimulation models used for training and evaluation of the proposed
algorithms. The calibration of the models involves calibrating the Paramics driver behaviour
parameters and calibrating the dynamic demand. The two models are the Highway 401 eastbound
collector at Keele Street and the westbound direction of the Gardiner Expressway, both in Toronto.
In Chapter 5, three experiments are presented. The first experiment is the design of a single agent
RLRM for the Highway 401 test case, and analysis of the effect of different design parameters on
the agent’s performance. The second experiment, also on the Highway 401 model, extends the
single agent RLRM to continuous state space using different function approximation techniques
and identifying the most suitable technique for RM application. The third experiment involves
applying the multiple independent RLRM agents with continuous state space to the Gardiner test
case as well as coordination of those agents and analysis of their performance. Chapter 6
summarizes the findings and lists possible future directions for this research.
7
IntroductionReview of ramp
metering algorithms in literature
12
Gaps and Limitations
3
Conventional RL with discrete
states
Function approximation
Coordinated RL for RM
RL with continuous states
Methodology
Cooperative games
Microsimulation models development
and calibration
4
Highway 401
Gardiner Expressway
Experiments5
Single ramp RLRM design
RLRM with continuous state space
Coordinated RLRM
Conclusions and future works6
Figure 1-3 Structure of the thesis
8
2 Literature Review
Freeways were initially built to provide an unhindered flow of traffic; therefore, freeway traffic
control measures were mainly used for safety reasons. However, recurrent and non-recurrent
congestion caused by the rapid growth in auto ownership and travel demand call for these measures
to be used as a means to maintain the efficiency of the freeways. There are various control
measures that can effectively improve freeway networks’ efficiency, including ramp metering,
variable speed limits, and dynamic route guidance. Among these measures, the most direct and
efficient way to control freeway traffic is ramp metering (Papageorgiou & Kotsialos, 2002). Ramp
metering improves freeway traffic conditions by appropriately regulating the on-ramp’s flow.
Appropriate implementation of ramp metering can improve freeway traffic in different ways.
Ramp metering can increase mainline throughput and served volume because of the avoidance of
capacity loss and blockage of off-ramps, respectively. Proper ramp metering algorithms can react
to incidents efficiently to minimize their effects. A. Kotsialos, Papageorgiou, Mangeas, and Haj-
Salem (2002) employed optimal control algorithms for the ramp metering problem and, through
macroscopic simulation, demonstrated outstanding improvements in large freeway networks.
Similar results were obtained through microscopic traffic simulation of various adaptive ramp
metering algorithms (Chu, Liu, Recker, & Zhang, 2004; Hasan, Jha, & Ben-Akiva, 2002).
Although ramp metering can be very effective, there are certain limitations associated with
it. One challenge is associated with the limited queue storage capacity of the on-ramps. If the queue
exceeds the on-ramp queue capacity, the connected arterial will be adversely affected. The simple
and widely adopted approach to prevent queues from exceeding on-ramp capacity is a queue
override algorithm employed alongside the main ramp metering algorithm. The queue override
algorithm calculates the on-ramp flow required to prevent the queue from exceeding on-ramp
capacity and has priority over ramp metering algorithms. Another challenge associated with ramp
metering arises when multiple on-ramps are present along a corridor. Usually the freeway traffic
flow increases as vehicles enter the freeway along the route. Therefore, the bottleneck is at the
downstream on-ramps. Consequently, these on-ramps experience the longest queues of cars
whereas upstream on-ramps have no queues. This phenomenon penalizes drivers entering from the
downstream ramps and could encourage drivers to change their route and take upstream ramps,
which is counterproductive. Furthermore, sacrificing users of downstream on-ramps for the benefit
of users of upstream on-ramps can be viewed as inequitable and may make the public reluctant to
9
accept ramp metering. To resolve this issue, proper coordination among adjacent on-ramps is
required to meter all on-ramps simultaneously.
Ramp metering strategies (as well as traffic control strategies in general) can be classified
along several dimensions, such as:
pre-timed vs. traffic-responsive;
independent vs. coordinated;
heuristic vs. optimal;
centralized vs. decentralized.
Pre-timed vs. traffic-responsive: pre-timed ramp metering strategies are derived off-line,
based on historical demands, for particular times of day. These control approaches act in an open-
loop manner and do not take into account variations in traffic condition, resulting in either overload
of the mainstream traffic flow (congestion) or under-utilization of the freeway capacity. Unlike
pre-timed strategies, traffic-responsive ramp metering strategies are based on real-time
measurements from sensors installed in the freeway. Traffic-responsive strategies change the
control signal in response to varying traffic conditions, thereby properly reacting to disturbances
and demand variations.
Independent vs. coordinated: independent strategies make use of measurements from the
vicinity of a single ramp, and do not consider the information from other parts of the network.
Local ramp metering applied independently to multiple ramps of a freeway is very efficient in
terms of Total Travel Time (TTT) if unlimited queue storage space is available. However, ramp
queues must be restricted to avoid interference with adjacent street traffic. Releasing the queued
cars prematurely into the freeway to avoid queue spillback to local streets results in congestion on
the freeway mainline. As a result, mainline congestion cannot always be avoided merely by
independent control and limited queue storage of a single ramp. In addition, providing equity for
users of different on-ramps, which plays an important role in the acceptance of ramp metering, is
not possible with independent controllers. Coordinated ramp metering relies on the traffic
condition and on-ramp queue information from multiple on-ramps. This allows the system to
utilize the queue space available on all on-ramps to prevent freeway breakdown. In addition,
coordination allows the system to homogenize the queues and provide the same level of service to
all users.
10
Heuristic vs. optimal: in traffic control problems, the main goal is to minimize TTT.
Optimal approaches, usually based on optimal control theory and dynamic programming, are able
to find the metering rates that directly minimize the TTT. These approaches are usually based on
a mathematical model of the freeway and look for optimal metering rates. The optimal control
approach can be employed for both independent and coordinated control systems. However,
because of the complex and non-linear nature of the flow of traffic, optimal approaches are usually
computationally intensive and require an accurate model of traffic network. On the other hand,
heuristic approaches rely on traffic flow characteristics to simplify the problem; therefore, they do
not directly minimize the TTT. As an example, since traffic flow is highest at critical density, a
heuristic approach can regulate the traffic density around critical density to maximize traffic
throughput, and as a result minimize TTT. Heuristic approaches can also be employed for
coordinating multiple on-ramps, e.g. equalizing queues of upstream on-ramps with the downstream
on-ramps.
Centralized vs. decentralized: the coordination of on-ramps can be performed in a
centralized or decentralized structure (independent controllers are decentralized by nature). In a
centralized structure, measurements from the entire network are collected, and a central controller
computes timing for all on-ramps. Under ideal conditions, centralized systems can achieve the
maximum possible performance. However, in large-scale problems the computation needs and
communications overhead limit the practicality of such systems. Furthermore, the reliability of
centralized systems is very poor as failure of a single component can paralyze the whole system.
Decentralized systems place the intelligence at the controlled location by distributing the
controllers throughout the network. In this structure, controllers act on their own based on local
measurements as well as high-level information from other controllers, which is necessary for
coordination.
The ideal RM control system is a traffic-responsive optimal controller. The control system
should calculate the metering rate based on the current state of traffic, i.e. employ a control law
for calculation of the metering rate. Due to stochastic and nonlinear nature of freeway traffic,
finding the optimal control law is not trivial. An alternative approach often used in literature is to
employ model-based optimization. In this approach, the effects of different metering rates, during
the control horizon, are evaluated using a mathematical model. As the optimization should be made
for specific scenarios, the solution would not be traffic-responsive. Although repeating the
11
optimization at every control cycle would result in a traffic-responsive control system, the
complexity of the optimization limits its applicability to larger problems.
2.1 Pre-timed Ramp Metering
In pre-timed ramp metering (also known as fixed-time or time-of-day), historical traffic flow data
are used to calculate the metering rate throughout the day. Since these algorithms are derived off-
line, they can employ complex traffic flow models to calculate the metering rates. The most
prominent fixed-time ramp metering algorithm is AMOC (A. Kotsialos et al., 2002). AMOC
employs a second-order macroscopic model of the traffic network called METANET (Messmer &
Papageorgiou, 1990) to solve a non-linear optimization problem with the objective of minimizing
total travel time. Solving the problem off-line makes it possible to solve complex problems while
incorporating queue storage space constraints. Assuming the real traffic condition matches the
historical values, the result would be an optimal and coordinated solution for ramp metering.
However, any inaccuracies in the freeway traffic model or unexpected deviations in traffic
condition from the historical values will significantly degrade the system performance.
Gomes and Horowitz (2006) employed a similar optimal pre-timed ramp metering
approach with a first-order macroscopic model named the asymmetric cell transmission model.
First-order models are much simpler than second-order models; therefore, solving a nonlinear
optimization problem with a first-order model is possible for much larger networks. Despite
behaving very well in terms of reproducing congestion in the model, asymmetric cell transmission
model fails to capture the capacity drop phenomenon, which is a significant factor in the
effectiveness of ramp metering in reality. Although the numerical results of this paper show
improvement in TTT and elimination of congestion, the significance of their solution remains
questionable and needs evaluation with a more complex simulation model that can capture the
capacity drop. Nonetheless, eliminating congestion opens blocked off-ramps and allows the
vehicles exit the freeway faster, which is the possible reason for improved TTT.
2.2 Traffic-responsive Ramp Metering
2.2.1 Independent Controllers
Independent controllers are the simplest type of traffic-responsive ramp metering controllers.
Independent controllers rely on local measurements only; therefore, they are naturally
12
decentralized. One of the earliest independent controllers is the demand-capacity algorithm
(Masher et al., 1975), where the metering rate is calculated from the difference between upstream
flow and capacity of the freeway as follows:
1max ,
(2.1)
where 1 is metered on-ramp flow for the next time step, is the freeway capacity,
is the measured upstream flow, is the density downstream of the ramp, is the critical
density of the freeway, and is the minimum permissible metered ramp flow. Demand-
capacity algorithm is considered an open loop or feed-forward control approach, because the
output of the system, downstream traffic, is not directly employed in the calculation of the control
signal. Similar to any feed-forward system, this algorithm is prone to model deficiencies and its
performance will degrade if the is not accurate.
To overcome the limitations of calculating the metering rate based on freeway capacity,
Papageorgiou et al. (1991a) proposed ALINEA, a feedback-based ramp metering algorithm. Based
on the relation of flow and density shown in Figure 1-2, traffic flow is maximized at critical
density. The ALINEA algorithm varies the metering rate to regulate the density downstream of the
ramp to a desired value close to critical density as follows:
1 (2.2)
where is the desired density, and 0 is a control parameter. This control structure is one of
the simplest linear time-invariant (LTI) controllers and is known as I-controller or integral-control.
The functional structure of ALINEA and demand capacity is shown in Figure 2-1. The simplicity
of ALINEA and its variations (Smaragdis & Papageorgiou, 2003; Smaragdis, Papageorgiou, &
Kosmatopoulos, 2004) has made it the most well-known ramp metering controller and its
performance is validated through field implementations (Papageorgiou, Hadj-Salem, &
Middelham, 1997).
Considering the performance improvements of ALINEA and its simplicity, researchers
have employed more complex control algorithms to regulate occupancy or density. H. M. Zhang
and Ritchie (1997) have proposed a non-linear controller that employs a neural network in place
of the control parameter in the ALINEA. The neural network replaces the constant parameter
with one that varies according to the density of the mainline to provide better regulation of traffic
density. Sun and Horowitz (2005) employed a linear controller based on optimal control theory to
13
regulate mainline density. The proposed approach utilizes a linear first-order model that switches
between congested and free-flow conditions.
Demand‐capacity strategy
Feedforward (open loop) ALINEA (closed loop)
Figure 2-1 Functional structure of demand capacity and ALINEA algorithms
(Papageorgiou et al., 2003).
The aforementioned approaches take a heuristic approach to freeway control, as they do
not directly minimize the TTT. Ghods, Kian, and Tabibi (2007) took a semi-optimal approach to
freeway traffic control. They presented a fuzzy controller, which calculates metering rate based on
mainline density. A fuzzy controller maps the measurement to the metering rate by using a non-
linear relation, allowing more refined control over the changes in metering rates. The parameters
of the fuzzy controller are tuned with a genetic algorithm with the goal of minimizing TTT of a
test case simulated using a second-order macroscopic model.
Notable true optimal ramp metering algorithms that are independent are the ones based on
reinforcement learning. Applying the conventional RL approaches to larger problems with
multiple ramps, however, is not practical. In (Davarynejad et al., 2011), the authors have presented
an RL-based ramp metering controller. The controller is trained and evaluated using a modified
version of the METANET macroscopic model. Another example of RL-based ramp metering
system is presented in (Jacob & Abdulhai, 2010) involving the Gardiner Expressway eastbound.
In this study, the Paramics microscopic simulation model is employed for training and evaluation
of the ramp metering controllers. To account for the states unseen by the RL agent, a CMAC neural
network is used to generalize the learning outcome of the RL agent.
1
14
2.2.2 Coordinated Controllers
Whereas independent ramp meters can be very effective and easy to implement, they cannot
provide equity among different on-ramps. Additionally, their performance is significantly
degraded when the ramp queue storage space is limited. In RM problems with limited queue
storage, a separate algorithm calculates the minimum metering rate, which keeps the queue below
a maximum admissible length. Increasing the minimum metering rate forces the ramp metering to
prematurely allow the cars into the mainline and causes the freeway to break down. Although
progression of congestion upstream will trigger ramp metering for upstream on-ramps, this natural
coordination will significantly degrade freeway performance. Coordinated ramp metering
approaches try to leverage the space available on multiple adjacent on-ramps and simultaneously
meter multiple ramps to achieve higher performance when on-ramp storage space is limited. A
common by-product of proper coordinated ramp metering is more homogenous waiting times for
different on-ramps.
2.2.2.1 Coordination Based on Heuristics
Bottleneck (Jacobsen, Henry, & Mahyar, 1989), implemented in Seattle, and Zone (Lau, 1997),
implemented in the Minneapolis/St Paul area, are early coordinated algorithms which are
extensions of the demand-capacity algorithm. Bottleneck consists of a local-level component and
a system-level component, each calculating a metering rate. The more restrictive metering rate is
then applied to the ramp. The local controller is conceptually similar to demand-capacity. To
calculate the system-level metering rate, the freeway is divided into several sections depending on
loop detector locations. For each section, the number of vehicles stored in that section during a
one-minute interval is calculated from the difference between entry and exit rates. If the difference
for any section is greater than zero, i.e. vehicles are being stored in that section, the metering rate
of the on-ramps with influence over that section is reduced. Similar to Bottleneck, the Zone
algorithm extends the demand-capacity to a region rather than a local section. The metering rates
of all ramps are calculated simultaneously by taking into account entry and exit flows, capacity of
the freeway bottleneck, and estimated number of vehicles on the freeway. Chu et al. (2004)
evaluated the performance of the two algorithms using microscopic simulation and observed that
they are inferior to ALINEA in their conventional form.
SWARM (Paesani, Kerr, Perovich, & Khosravi, 1997) is another ramp metering system
which relies on heuristics to coordinate multiple on-ramps, and has been extensively implemented
15
in Southern California. SWARM calculates two separate metering rates, a local and a global, and
applies the more restrictive one. The local mode can be any local ramp metering system. The global
mode operates based on forecast densities at the system’s bottleneck locations. The future density
of a bottleneck is estimated by linear regression of immediate past samples. The forecast density
is compared with a threshold and the difference used to modify the desired current density of the
bottleneck, as illustrated in Figure 2-2. Given the current and desired bottleneck density, the
volume reduction that should be applied to upstream on-ramps is calculated. Despite widespread
implementation in southern California, field evaluation of SWARM in Portland, Oregon has not
shown any noticeable improvement over pre-timed ramp metering (Ahn, Bertini, Auffray, Ross,
& Eshel, 2007).
Figure 2-2 Forecasting theory of SWARM global mode (Ahn et al., 2007).
The first attempt at a coordinated extension of ALINEA was the METALINE algorithm
presented by Papageorgiou, Blosseville, and Haj-Salem (1990). METALINE is the multi-input
multi-output extension of ALINEA obtained by vectorization of the ALINEA equation:
1 1 1 2 , (2.3)
where … is the vector of n controllable on-ramps, … is the vector of m
measured densities, … is the vector of p potential bottleneck densities,
… is the vector of desired bottleneck densities, and and are and
matrices of control weights, respectively. Although the concept behind METALINE is sound, its
marginal improvements over the local controller ALINEA does not justify the complex design
procedure.
16
Papamichail and Papageorgiou (2008) proposed a linked ramp metering control strategy
based on ALINEA to equalize the queue length of each on-ramp with the one downstream of its
location. For each on-ramp, three metering rates are calculated. The first one is the regulator’s
metered ramp flow, , calculated from ALINEA control law in (2.2). The second ramp flow is
queue override ramp flow, , obtained from queue control law as:
11
, (2.4)
where , is the desired maximum queue, is the current queue length, is demand entering
the on-ramp, and is the control cycle. This control law will try to maintain ramp flow to a level
that ensures the queue does not exceed the desired maximum queue. The third ramp flow is linked
control ramp flow, , which coordinates each on-ramp with the one downstream of its location,
so that they have similar queue length. The control law limits the metering rate to maintain a
desired minimum queue as:
1 , (2.5)
where is a control parameter which may be set equal to 1/ for a quick response or less for a
smoother response, and , is the desired minimum queue calculated according to the queue of
the downstream on-ramp. The , is initially zero, and will be changed to same value as the
downstream on-ramp queue when the queue of the downstream on-ramp exceeds a certain
threshold. The , will be reset back to zero when the downstream on-ramp queue falls below
the threshold. The final metering rate is calculated as:
max min , , . (2.6)
Authors have evaluated the linked ramp metering algorithm on a macroscopic model and
observed that it is comparable to ALINEA when there is no limit on the queue storage. However,
if the ramp queue space is limited, linked ramp metering has significantly better performance than
ALINEA. A more refined variation of the above linked ramp metering algorithm named HERO
(Papamichail, Papageorgiou, Vong, & Gaffney, 2010b) has been field implemented at Monash
Freeway in Australia. This algorithm, contrary to the other coordinated algorithms mentioned
above, can be implemented in a decentralized structure.
2.2.2.2 Optimal Coordination
The heuristic approaches to coordination mentioned above are simple to implement; however, they
generally have complicated tuning processes for their numerous parameters, which make their
17
performance subpar in practice. Another approach to coordination of multiple on-ramps used in
literature is to employ a macroscopic model of the traffic network and solving the non-linear
optimization problem of minimizing TTT. To provide a traffic-responsive solution, the
optimization problem should be solved repeatedly in every control cycle. Although the obtained
solution is for the whole control horizon, only the first cycle of the solution is applied to the
network. Such control systems are known as model predictive control (MPC) or receding horizon
control. The algorithm searches for the set of metering rates corresponding to every control cycle
during control horizon, NC, that minimizes the cost over the prediction horizon, NP. The schematic
of the MPC for traffic control problems is shown in Figure 2-3.
Figure 2-3 Schematic of model predictive control for traffic control problems (Hegyi et al.,
2005).
Bellemans et al. (2002) and Hegyi et al. (2005) successfully employed MPC for optimal
traffic-responsive ramp metering. The freeway traffic network is modelled by the second-order
macroscopic model METALINE. To handle the high computational demand of solving a non-
linear optimization problem in every control cycle, only a single on-ramp was considered for
control. Ghods et al. (2010) have employed the same MPC approach, but have proposed a
decentralized solution for solving the non-linear optimization problem. The decentralized solution
is based on the Game Theory concept Fictitious Play (Brown, 1951). Decentralization allows the
computation to be handled by multiple nodes, making the approach applicable to bigger problems.
18
2.2.2.3 Hierarchical Control Approach
Pre-timed RM systems based on optimal control result in optimum performance in the absence of
any disturbance. Although MPC can mitigate performance drop caused by disturbance, the
computational cost limits the size of the traffic networks. Papamichail et al. proposed a hierarchical
control approach to provide semi-optimal control for large traffic networks (Papamichail,
Kotsialos, Margonis, & Papageorgiou, 2010a). The hierarchical control approach, shown in
Figure 2-4, consists of three modules. The state estimation and prediction module constantly
monitors traffic condition to estimate the state of traffic and predict future demands. The non-
linear optimization module solves the optimization problem to find the optimal metering rates and
corresponding optimal traffic densities every 10 minutes. In the presence of disturbance, the 10-
minute control signals become sub-optimal and could result in unstable conditions. The third
module takes the output of the optimization module as the input. However, instead of directly
applying the signals to traffic lights, an ALINEA regulator is employed to regulate the system
around the set point provided by optimizer. The ALINEA improves system robustness and
provides predictable behaviour during the 10-minute optimization interval.
Local regulator Local
regulator Local regulator
State estimation and prediction
Measurements
Nonlinear Optimization(Every 10 minute)
Historical data
Estimated state and disturbances
Set points
Figure 2-4 Hierarchical control structure with distributed controllers.
2.3 Summary of RM approaches
Table 2-1 is an attempt to summarize the capabilities of notable RM approaches in the literature
from different significant control perspectives. Solid circles represent best while empty circles
represent poorest. Although this classification is highly subjective and reflects the author’s view,
19
it is meant to be a brief illustration of the capabilities of different RM approaches in the literature
in a single table.
Table 2-1 Summary of the performance of the RM approaches in the literature from control perspectives. Solid circles show better performance.
Approach
Performance Criteria
Tra
ffic
-res
pons
ive
Coo
rdin
ated
Per
form
ance
- n
o
queu
e li
mit
Per
form
ance
-
lim
ited
que
ue s
pace
Rec
all c
ompu
tati
on
sim
plic
ity
Rob
ustn
ess
to
dem
and
vari
atio
n
Rob
ustn
ess
to m
odel
impe
rfec
tion
Dec
entr
aliz
ed
impl
emen
tati
on
Demand capacity (Masher et
al., 1975)
a
Zone (Lau, 1997)
SWARM (Paesani et al.,
1997)
ALINEA (Papageorgiou et
al., 1991a)
Fuzzy control (Ghods et al.,
2007)
5
METALINE (Papageorgiou
et al., 1990)
HERO (Papamichail et al.,
2010b)
5
AMOC (A. Kotsialos et al.,
2002)
MPC (Bellemans et al.,
2002)
Decentralized MPC (Ghods
et al., 2010)
Hierarchical MPC
(Papamichail et al., 2010a)
20
Heuristic approaches, which regulate traffic state, are simple and generally have fair
performance in applications with unlimited queue space. However, unlimited queue space is rarely
practical and the performance of these isolated RM approaches quickly degrades if queue storage
space is limited. Whereas coordinated variants of these approaches lessen the negative effect of
limited queue space, the heuristic nature of the coordination is not very effective in complex
scenarios.
RM approaches which directly optimize freeway performance using the freeway model
seek optimal coordination between multiple on-ramp and result in the best solution in any
condition given perfect information is available. However, any deficiency in the quality of the
model or predicted future demands will quickly degrade their performance. Furthermore, the
computational need to solve the complex non-linear optimization problem grows exponentially
with problem size, limiting the practicality of these approaches to small problems.
The present work, therefore, seeks closed-loop optimal RM in the form of a control law,
i.e. metering rates that are calculated based on current state of traffic and will directly optimize the
network performance. Additionally, a decentralized approach to coordination is considered to
facilitate the application of the proposed control system to large problems.
21
3 Methodology: Optimal Ramp Metering Using Reinforcement
Learning
The present chapter elaborates the methodologies and algorithms used in different stages of
developing the RL-based RM algorithms. The components of the methodology are described in
the order they are used in this research. The research goal is to develop and apply an optimal
control methodology for practical sized RM applications, which is only possible through
coordination of decentralized RM controllers. In this chapter, after briefly describing the optimal
control problem and its challenges, RL is presented as the solution that has shown tremendous
promise in other applications such as adaptive signal control. However, the limitations of
conventional RL algorithms with discrete states become apparent as the problem size grows.
Therefore, function approximation approaches such as k-nearest neighbours (kNN), multi-layer
perceptron (MLP) neural network, and linear model tree (LMT) are put forward to directly
represent continuous states and actions. These algorithms are built on top of the established
concepts of discrete RL algorithms while mitigating their limitations. Finally, the coordination
graph concept from Game Theory is presented as the means for coordination of multiple RM
agents. Figure 3-1 shows the relationship between the theories and algorithms discussed in this
chapter. The implementation of the methodologies in this chapter to RM is described in detail in
Chapter 5 together with presentation and discussion of the results.
3.1 Optimal Control Problem
Optimal control problems generally involve maximizing1 a reward value. The total reward to be
maximized is in fact the accumulation of instantaneous rewards received over time. Let us denote
reward incurred at time t by , , , where is the state of the system, is the control
action taken, and is a random parameter. Therefore, the problem of finding the optimum total
reward, , for a given initial condition, , can be formulated as:
0 max
0,…,, ,
0
:
(3.1)
1 Given that any minimization problem has a dual maximization problem, analysis of the minimization problems is omitted
22
1 , ,
where , 0 1, is the discount factor which shows the significance of earlier rewards
compared with later rewards, function . defines the dynamics of the system, and is a random
parameter associated with the uncertainty of the system dynamics. Although it is clear from (3.1)
that the total reward depends on the stream of the actions, each action will influence the trajectory
of the system. Therefore, each action will affect the instantaneous reward as well as the rewards
that will be observed in the future. The optimal control problem is to find the balance between the
immediate reward and the effect of the actions on future rewards.
Stochastic Optimal Control
Markov Decision Process
Reinforcement Learning
Q‐Learning
SARSA
Eligibility Traces
Control LawSelf‐Learning
RL with Continuous States and Actions
LMT‐based RL
Advantage Updating
kNN‐TD(λ)
MLP‐based RL
Game TheoryCooperative
GamesCoordination
Graph
Locally Optimal Action Selection
Cooperative Multi‐agent RL
Function Approximation
kNN MLP LMT
Figure 3-1 The relationship between the algorithms presented in this chapter
23
The problem can be solved for a given initial condition and neglecting the random
parameters using optimization algorithms. However, the resulting stream of actions will be an open
loop solution, which will not perform well due to uncertainties. For systems with the Markov
property, the problem in (3.1) can be greatly simplified. In systems with the Markov property, the
future is independent of the past given the present. In other words, the effects of an action taken in
a state depend only on that state and not on previous history of the system. When modeling a traffic
flow system with wave equations, the underlying wave equations will provide the relation from
one state to the next; hence, the system will have the Markov property. Since the current state
captures all relevant information, the optimal action in the current state is independent of the past
states and actions. Therefore, instead of a stream of actions dependent on the initial state we can
look for a policy, a function that maps states to actions. Such a policy would be an optimal policy
if it maximizes the total reward, thereby resulting in a closed-loop optimal control system. Finding
an optimal policy directly from (3.1) is not straightforward considering the convoluted effect of the
system dynamics in the total reward. To simplify the problem, dynamic programming can be
employed to break the problem into smaller problems. The dynamic programming (DP) equivalent
for optimal control problems is based on Bellman’s principle of optimality (Bellman, 2010,
Chapter III.3) which implies that “an optimal policy has the property that whatever the initial state
and initial decision are, the remaining decisions must constitute an optimal policy with regard to
the state resulting from the first decision.” Considering this principle, the optimization problem of
(3.1) can be simplified to Bellman’s equation:
0 max0
0 0, 0, 0 1
:
1 0, 0, 0
(3.2)
In Bellman’s equation we choose , knowing that our choice will cause the next state to
be , , . That new state will then affect the decision problem from time t=1 going
forward. Bellman’s equation (3.2) is a functional equation, because it involves an unknown
function . .
3.2 Markov Decision Processes and Value Iteration
A Markov decision process (MDP) provides the framework for modelling a problem that involves
random outcomes and behaviour and is under the influence of a decision-maker. In fact, an MDP
is a discrete time stochastic control process. An MDP is defined by the tuple
24
⟨ , , , , , , , ⟩, where is the set of states, is the set of actions available from
state , , , is the probability that taking action in state will lead to state , and
, , is the expected reward received because of transition from state to state provided
that action is taken. Although MDPs are not limited to systems with finite states and actions, the
conventional algorithms for solving MDPs assume states and actions are finite. Therefore, the
functions . and . can be simplified to a matrix form. Considering an MDP with finite states,
equation (3.2) can be rewritten as:
max , , ′
′∈
, , ′ ′ . (3.3)
In general, DP algorithms for solving (3.3) are iterative, and value iteration is one of the
most notable and fundamental ones. Value iteration starts with an initial value for . , initializing
all states. Then, is updated by calculating the right-hand side of (3.3) for every state. The
updating process is repeated until . is converged, i.e. the does not change from one
iteration to the next for all states. The value iteration is effectively a repetition of the following
equation.
1 max , , ′
′∈
, , ′ ′ , ∀ ∈ (3.4)
Besides the basic assumption that states are finite, there are two challenges associated with
DP-based algorithms which limit their usage in practice (Sutton & Barto, 1998). First, DP assumes
availability of a perfect model that describes the transition probabilities and reward values.
Although assuming availability of a deterministic model of the system is not unreasonable, a
stochastic model that describes the uncertainties in a traffic network in the form of a transition
probability matrix is far from practical. Second, whereas DP works well for small synthetic
problems, in practice the number of states and actions increase exponentially with the problem
size. Therefore, the computation and storage requirements of DP limit its feasibility in practice.
3.3 Reinforcement Learning: Model-free Learning
Reinforcement learning (RL) is inspired by human’s trial-and-error learning behaviour and aims
to solve the optimal control problem without a priori knowledge of the model of the system (Sutton
& Barto, 1998). In RL, agents only perceive the state of the environment and the instantaneous
scalar reward , , as the system transitions from one state to another and hence there is no
25
need to know transition probabilities a priori. The agent learns the optimal actions through direct
interaction with the environment and by trying various actions in various states and observing their
outcomes.
3.3.1 Q-Learning
Numerous algorithms with plausible convergence speed and easily-customized parameters are
used to solve single-agent RL tasks, the most notable of which is the Q-learning approach of
Watkins (Watkins & Dayan, 1992). In Q-learning, instead of . which defines the value of states,
a function , is used which quantifies the expected value of state provided that action is
taken. Effectively, function , facilitates the comparison of the quality of different actions
within a state. The value associated with a state-action pair is also known as Q-value. Q-learning
is directly derived from Bellman’s equation and is similar in nature to value iteration. For every
new time step , the value of , is calculated according to the reward received and the value
of the future states and compared with the current estimation of the , . The value of future
states are obtained from past experience and stored in the latest Q-value estimate for . Function
. is updated with every new training sample according to:
1 , , max 1, , (3.5)
where is the reward received after performing action at state and moving to the new state
, and , 0 1, is the learning rate. If the learning rate is set to 1 the old value will be
replaced with the new estimation. However, because of the stochastic nature of MDPs, it is
necessary to calculate the average value over multiple samples. Hence, the old values will be
partially updated to provide new estimations. A more detailed description of the Q-learning
algorithm can be found in Watkins and Dayan (1992).
3.3.1.1 Learning Rate
In stochastic problems, a decreasing learning rate is usually employed. Q-learning, when employed
with a decreasing learning rate and with the following learning rate characteristics, is guaranteed
to suppress uncertainties and converge to the optimal Q-values (Watkins & Dayan, 1992):
,
∞
1∞, ,
2∞
1∞, ∀ , (3.6)
where , is defined as the index of the th time that action is tried in state . The first
function which may come to mind with the above characteristics is , 1/ . This choice for
26
learning rate will result in exact averaging of samples over time. Since the value of . for the
next state is present in the updating rule and is likely to have a better estimate later in the learning
process, considering a decaying learning rate but higher than 1/ later in learning process is
advisable. As discussed by Even-Dar and Mansour (2003) a learning rate with the equation
, 1/ results in much better convergence when 0.8 than when 1.
3.3.1.2 Action Selection Policy
RL algorithms are guaranteed to converge to optimal values after infinite samples. In practice, the
luxury of infinite samples is not available, and we need to look for quick but reliable convergence
to optimal values. Since RL is based on trial and error, finding the balance between exploration,
trying new and potentially suboptimal actions, and exploitation, taking the optimal action, is
critical to Q-learning convergence. The two common approaches to action selection are -greedy
and soft-max. In -greedy, the best action is chosen with probability of 1 , and a random action
with probability of , where 0 1 is the tuning parameter. Generally, at the beginning of the
learning process 1 for completely random action selection, and as agent learns, is decreased.
Although can be decreased all the way to zero for greedy action selection, maintaining a non-
zero , e.g. 0.1, ensures that the agent keeps exploring. In contrast to -greedy action selection
which does not differentiate between actions when choosing a random action, soft-max action
selection assigns a probability to each action according to the Q-value of that action. The
probability of choosing action in state is calculated by:
,, /
∑ , /∈
(3.7)
where , 0, is a tuning parameter. A large will result in probabilities that are more or less
uniform and are independent of the Q-values, which is desirable in the early stages of learning
process. As gets smaller the actions with higher Q-values have higher probabilities. When gets
very close to zero, the action selection becomes greedy, resulting in a probability of almost 1 for
the action with the highest Q-value.
It is usual for the tuning parameters and to be varied based on the learning time.
However, in some applications, including traffic control problems, the states visited vary according
to the agent's policy and the actions it takes. Some areas of state space are visited only after the
agent consistently chooses the optimal actions in other areas of state space. Therefore, the agent
will not have the chance to explore these states if the action selection tuning parameter only
27
depends on learning time. To overcome this limitation and ensure that the agent only exploits when
all actions in a state are explored, the tuning parameters and can be varied according to the
number of visits to that state as the maturity measure.
3.3.2 SARSA
Another notable RL algorithm directly derived from Bellman’s equation is SARSA. SARSA
stands for state-action-reward-state-action, which is the order in which information is received and
used for updating . . In SARSA, updating the , depends on the action taken in time step
1, which is used instead of the optimal action. The updating rule for SARSA can be written
as:
, 1 , 11, 1
1 , (3.8)
Since the updating is based on the actions that the agent takes, the estimated . will be
dependent on the policy of the agent in that stage. However, as the agent matures and its policy
shifts from taking random actions to choosing the optimal actions, . will converge toward its
optimal value. Considering that . depends on the agent’s policy, the learning speed of the agent
is slower compared with Q-learning.
3.3.2.1 Eligibility Traces
In problems with a discount factor close to one, i.e. problems with significantly delayed rewards,
the convergence of the Q-values can be very slow. This is because the updated value of state
will not affect previously visited state until the next visit to . One way to mitigate this
issue is the eligibility traces (Singh & Sutton, 1996) mechanism. In eligibility traces, the trails of
successive visited states are stored so that the states that contributed to the rewards received can
be traced back and updated accordingly. Typically, eligibility traces decay exponentially according
to the product of the discount factor and a decay parameter , 0 1. The trace itself can be
defined by:
11 if
if (3.9)
where represents the trace for state at time , and is the visited state at time . The
eligibility trace defined in (3.9) is a replacing eligibility trace as the trace of state is reset to 1
every time it is visited.
28
3.3.3 R-Learning
Q-learning can be applied to discounted infinite-horizon problems. It can also be applied to
problems with undiscounted reward as long as the optimal policy leads to a state with zero reward.
R-learning (Mahadevan, 1996) is an extension of Q-learning to problems where average reward is
maximized instead of the total discounted reward. In R-learning the goal is to maximize the
average expected reward, , per time step:
lim→∞
1
0
(3.10)
In this method, instead of reinforcing the instantaneous reward, , the transient difference
in the reward, , is used as the reinforcement. Therefore, the equivalent of Q-learning update
law for R-learning is:
1
, , max 1, , (3.11)
The R-learning method has an additional unknown variable, , which should be learned by the
agent. This variable is updated iteratively, only at steps that the best action is taken, i.e.
argmax , , as follows:
1 max 1, , (3.12)
where is a learning parameter balancing between past experiences and new samples for updating
the . Although for many problems the average reward criterion better represents the actual
problem than a discounted reward criterion, the convergence problems exhibited for R-learning
prevent it from being widely adopted.
3.4 RL with Continuous State and Action Space
Q-learning in its conventional form uses a table to represent the function . Using a table limits
the practicality of Q-learning in complex problems that involve a multidimensional continuous
state space. Continuous states should be discretized for application by the conventional RL
approaches developed for discrete states (such as Q-learning). In addition to the exponential
growth of the discrete state space with the increase in problem size, discretization introduces a
trade-off between learning speed and system optimality. Finer discretization is likely to result in
better overall system performance, but the increased number of state-action pairs requires more
training samples and a longer training process. To overcome this challenge, one could use fine
29
discretization in sensitive regions and coarse discretization elsewhere. Although a theoretically
feasible approach, non-uniform discretization adds complexity to the design process.
In problems with continuous state space, it is expected that small movements in the state
space will result in minimal variations in the system’s behaviour. Therefore, in the RL context,
states that are closely spaced are expected to have close Q-values. Discrete states fail to utilize this
unique feature and the problem exacerbates as the discretization becomes finer.
The limitations mentioned above can be mitigated by using a general function
approximator which replaces the table representing , . Unlike Q-tables with hard boundaries,
function approximators enable the estimation of any intermediate Q-values in continuous space.
Additionally, function approximators make better use of the learning samples as each sample
updates the whole Q-function rather than a single element in the table, thereby resulting in much
faster learning speed. Three of the most notable function approximation approaches for RL are: k-
nearest neighbour weighted average, multi-layer perceptron neural network, and linear model tree.
3.4.1 k-Nearest Neighbours Weighted Average
A class of function approximators which are effective and easy to use in RL problems are sparse
coarse-coded function approximators (Santamaria et al., 1997). One method of this class, which
has shown very promising results, is based on the k-nearest neighbours concept. In theory, the k-
nearest neighbours temporal difference (kNN-TD(λ)) method (Martin, de Lope, & Maravall,
2011), can represent continuous state space in a manner which is very similar to the table-based
Q-learning with the added support for continuous state space. Therefore, all the solid theories
behind Q-learning can be seamlessly applied to kNN-TD(λ).
In kNN-TD(λ), a set of centers , each with an explicit Q-value, is generated in the state
space. The estimation of the Q-value of a new point in the state space is shown in Figure 3-2.
The set , which contains the k-nearest neighbours of in the set , based on Euclidean
distances , is identified. A probability is then assigned to each of the centers in as:
∑ ∈,
1
1 2, ∀ ∈ (3.13)
The Q-value of a state-action pair , is then defined as the weighted average of the Q-
values of the points in set with weights :
, ,∈
(3.14)
30
The updating of Q-values of set is performed as a similar process. With every new sample
, its k-nearest neighbours can be identified and updated according to:
.max 1, , (3.15)
1 , , . . , ∀ ∈ (3.16)
The number of visits to a state-action pair can be similarly estimated as:
, ,∈
, (3.17)
where , is number of visits to center and action . A more detailed description of the
NN-TD(λ) algorithm can be found in Martin et al. (2011).
Figure 3-2 Illustration of k-nearest neighbour algorithm for estimating the value of a new point. The four closest neighbours to candidate point are shown.
3.4.2 Multi-Layer Perceptron Neural Network
Multi-layer perceptron (MLP) is a feed-forward neural network with multiple layers. In function
approximation applications, typically there is a hidden layer with H neurons and an output layer
with one neuron, as shown in Figure 3-3. The hidden layer’s neurons have non-linear activation
functions, . , of the sigmoid from, e.g. / .
Considering the MLP structure in Figure 3-3, the relationship between input and output of
the MLP would be:
11
, (3.18)
∝1
31
Figure 3-3 Multi-layer perceptron structure for function approximation applications. In
this figure … are input variables, … … … are the hidden layer weights, . is the sgmid non-linear function, … are output layer weights, and is
the output of the neural network.
Training of MLPs is usually done through iterative numerical approaches. These
approaches are based on either the gradient or the Jacobian of the error with respect to weights.
The gradient and the Jacobian can be calculated by a technique called backpropagation. A simple
approach to updating the weights can be the gradient descent learning:
1 (1)
where … … … … is a vector containing all the network parameters
and is the squared prediction error at time , which in the RL context can be defined as:
2 .max 1, ,2
. (2)
In the iterative learning of MLP, it is desirable to avoid presenting successive samples that
are from the same region of state space to avoid saturation of the weights. Additionally, throughout
the learning process, samples should cover different regions of the state space to provide good
generalization. In traffic control problems, changes in traffic state are gradual; therefore, similar
samples in successive control cycles are likely. These facts discourage the use of learning methods
where samples are shown one by one as they are visited. To overcome these issues, all the samples
visited in the same simulation run, i.e. epoch, are kept in a pool of samples and shown to MLP in
random order after each epoch. Additionally, samples from previous epochs are not discarded to
32
ensure previous trainings are not lost with batch learning. The learning is still iterative; however,
in every epoch, the errors are calculated once based on last epoch’s estimate of , and kept
fixed during the training epoch.
3.4.3 Linear Model Tree
The linear model tree (LMT) is another approach to function approximation, whereby the input
space is partitioned into a decision tree with axis-orthogonal splits at internal nodes, as illustrated
in Figure 3-4. Partitions use local linear functions of the inputs, calculated by least squares
regression. In an LMT the decision tree is not pre-specified and the splits are based on the data.
Therefore, the challenge in training LMT is finding the split points. For training of LMT the work
presented by Potts and Sammut (2005) is adopted in this study. The process of building an LMT
starts with a single partition. Along each dimension candidate splits are considered. To find the
axis and location where a split should be created a primary linear model and pairs of linear models
on either side of the candidate splits are calculated. The loss function in partition depends on
the samples in that partition and can be calculated as:
,2
1
2
1.
Figure 3-4 Illustration of input space partitioning for linear model tree
The loss functions of linear models on each side of every candidate split are also calculated
in a similar manner. Let and be the loss function on the lower side and upper side of the
split along the dimension, respectively. Assuming Gaussian noise with unknown variance,
-1 0 1
-1
0
1
1
2
4
3
0
1 0
2 0.5
4 3
yes
yes
yes
no
no
no
33
the Chow test for homogeneity amongst sub-samples (Chow, 1960) is used to test the null
hypothesis of whether the data come from a single linear model. Under this null hypothesis, :
2,
where is the dimension of input, is distributed corresponding to the Fisher's distribution with
and 2 degrees of freedom. The associated p-value (probability in the tail of the
distribution) determines the probability that the null hypothesis holds. Let us denote the smallest
probability as representing the best split over every split along each dimension . To ensure
the split is significant enough, a split is only made when . A small enough value of
is suitable for any level of noise.
As training samples increase, the LMT splits the input space into smaller regions, resulting
in a more accurate approximation. However, it is often desirable to limit the growth of the model
tree and accept certain approximation error by calculating a stopping parameter as follows:
12 2
.
where is the estimated overall variance of the output. As the model tree grows and its accuracy
increases, decreases. Splitting the input space is terminated if falls below a certain threshold
which achieves the trade-off between the overall model complexity and the acceptable
approximation error.
Although there are robust approaches which are suitable for on-line learning of the model
trees (Potts & Sammut, 2005), since the optimal Q-values are not known in advance because of
the bootstrapping in the RL, the on-line learning of LMT does not suit the RL. Therefore, batch
learning after each simulation run (epoch) is performed, i.e. all the samples gathered so far are
used to rebuild the LMT to fit the equation:
1 , .max 1, , (3.19)
where . is the LMT from epoch , i.e. the previous model fitted to the gathered samples.
3.4.4 Advantage Updating
In theory, general function approximators can take the shape of virtually any continuous function.
However, in practice function approximators do not perfectly fit the data due to the presence of
measurement noise and the complexity of function approximator parameters. In RL algorithms
34
based on Q-learning, the action decision is made by comparing the Q-values of different actions
within a state. Therefore, it is of great importance that the general function approximator fits the
Q-values properly along the action axis. Often, especially in problems with a discount factor close
to one, the Q-value variations along the states dimensions are more dominant than Q-value
variations along the actions. To illustrate this phenomena, consider the example shown in
Figure 3-5a. There are 101 states, the s0 being the terminal state. The actions are either to move
right or left and the goal is to reach the terminal state with minimum number of movements.
Therefore, the reward can be defined as -1 for each movement with the discount factor of 1. Solving
the problem and finding the optimal Q-values will result in the values shown in Figure 3-5b. The
numbers under each arrow is the Q-value of taking that action in the preceding state, and the
numbers in the states represent the value of the state, which is equal to the Q-value of optimal
action in that state. As can be seen from the figure, variation of Q-values along the states axis is
very significant whereas the difference in Q-value of the two actions within a state is only two. If
we deduct the value of the current state from the Q-values and define it as advantage value, the
resulting advantage value becomes independent of state in this example. Therefore, the optimal
action (moving right) will have an advantage value of zero and the other action will have a value
of -2.
100 99100
101
98 97 2 1 099
100
98
99
2
3
1...
s100 s99 s98 s97 s2 s1... s0a
b
Figure 3-5 A simple problem showing the variation of Q-values in the states and actions. a) The base problem with 101 states, where the goal is to reach the terminal state s0 with the
minimum movements. Therefore, the reward of taking each action is -1 and discount factor is 1. b) The optimal state values and Q-value of actions.
Given the dominant variations along the state dimension, the function approximation might
sacrifice the variations along action to fit the greater variations along the states. To mitigate this
issue, the Q-value can be separated into the state value and advantage value of each action:
, , (3.20)
35
where , is the advantage of taking action in state . Note that the optimal action will have
an advantage value of zero and other actions will have negative advantage values. The Q-function
when converged according to the Bellman equation has the condition:
, .max 1, . (3.21)
Substituting , in (3.21) by (3.20) will result in:
, . 1 . (3.22)
The two unknown functions, . and . , can be updated one by one, keeping the other one
fixed:
1 . 1 , (3.23)
1 , . 11
1 (3.24)
The above equations are equivalent of the Q-learning for Advantage updating, because the
values are updated independent of the policy. It should be noted that the approximation of the
functions is not perfect and it is likely to have errors in the approximated functions. Because of the
intertwined effect of the two functions, a positive feedback might happen and the approximation
error exacerbates with every iteration, leading to divergence. This intertwined relation can be
broken by removing the term , from (3.23), which results in:
1 . 1 (3.25)
By removing the advantage term, the new equation for . is dependent on the agent’s policy,
because the agent’s action choices affect the value of states. However, as the agent starts to exploit
its knowledge and chooses optimal actions, the function . converges toward its optimal value.
This behaviour resembles the SARSA algorithm discussed in 3.3.2.
3.5 Multi-Agent Reinforcement Learning
The algorithms discussed in the above sections are formalized for a single agent interacting with
an environment. In theory, they can be applied to a problem with multiple on-ramps if we assign
a single central agent to control all on-ramps simultaneously. In such case, the environment which
the agent deals with would be the whole traffic network and the choice of actions would be the
combination of actions of all on-ramps as illustrated in Figure 3-6a. The downside of this approach
is the “curse of dimensionality” as the number of ramps grows. The size of state-action space grows
exponentially with the number of on-ramps. In the learning process, the increased number of states
and actions requires significantly more learning time. Additionally, in terms of the optimal action,
36
the search space would be much larger and finding the optimal action based on current state might
not be possible in real time. Although sound in theory, it is not practical to solve large problems
with multiple on-ramps with a single RL control agent.
3.5.1 Independent Learners
An alternative for applying RL to RM of larger traffic networks is to employ a decentralized
structure where the network is broken into smaller sections, e.g. sections that contain only a single
on-ramp, and assign an RL agent to each section as illustrated in Figure 3-6b. Each agent will
observe the state of the traffic in its local section and optimize the action to maximize the reward.
In this configuration, agents act independently of each other and resemble a local ramp metering
structure. Although the action of each agent is optimal for traffic conditions in its section, the
collective actions of all agents are not necessarily optimal for the whole network. Additionally,
lack of coordination limits the opportunities for utilizing the storage space of adjacent on-ramps
and providing equity.
3.5.2 Cooperative Reinforcement Learning
Considering that a decentralized structure is the only practical solution for applying RL to larger
networks, decentralized agents should be coordinated to provide global optimality. An RL problem
where multiple learning agents interact with shared environments is referred to as Multi-Agent
Reinforcement Learning (MARL) (Busoniu et al., 2008). MARL algorithms are usually tailored
to specific types of problems. The nature of the problem, competitive or cooperative, and the
possibility of observing actions of other agents as well as the presence of communication among
agents can affect the MARL algorithm. Panait and Luke (2005) have summarized the algorithms
which address learning of multiple agents cooperating to maximize a single reward. In terms of
traffic control problems, there are certain characteristics that help with developing a MARL
algorithm as follows:
1. In traffic control problems it is reasonable to assume that agents can freely communicate
with each other and share their state, action, and reward. Additionally, they can use the
communication to coordinate their actions to achieve higher overall reward.
2. The agents are fixed in space, and the geometry of the network is known. Therefore, agents
can be coordinated more efficiently depending on their immediate neighbours.
37
3. The main goal is to minimize the total time spent in the network and maximize the total
travelled distance. These goals can easily be broken into time spent and travelled distance
in different sections.
An approach that effectively employs the above characteristics is coordination graphs
(Guestrin et al., 2002; Kok & Vlassis, 2006). Coordination graphs decompose the global Q-
function into additively decomposed local Q-functions that only depend on the actions of a subset
of agents. In order for agents to coordinate their actions, they need to quantify the effect of state
and action of other agents on their Q-value. Effectively, each has to consider an augmented state
and action that include its local state and action as well as states and actions of other agents that
influence its reward. The geometry of the network can be utilized to identify a set of neighbours
for each agent. To achieve global optimality, it is sufficient for each agent to consider the states of
its neighbours (Nair, Varakantham, Tambe, & Yokoo, 2005). The schematic of such configuration
is illustrated in Figure 3-6c.
A successful large-scale implementation of MARL in transportation context is S. El-
Tantawy et al. (2013) work on traffic signal control. In spite of the promising outcome, their
approach was based on the conventional RL algorithms with discrete states that imposed certain
trade-offs and limitations, such as discretization choice, curse of dimensionality with added states,
significant memory requirement, and no generalization over observed samples. In this research,
continuous states and actions are directly represented using function approximation. Direct
representation of continuous variables will simplify the design process significantly and enables
designing more complex control systems.
3.5.2.1 Learning in Cooperative Multi-agent RL
Let us denote the local state of agent i as , and its actions as . Let be the set of neighbours
for agent i, agents that affect the reward of agent i. The augmented state for agent i would be
, where is the collective states of all neighbours of agent i, ∈ . Similarly, the
augmented action would be , where is defined as the collective actions of neighbours
of agent i, ∈ . The Q-learning based updating rule would be:
38
1 , , ,
, , ,
1, 1,∗,
∗, , ,
(3.26)
where is the local reward for agent i, . is the Q-function associated with agent i, and the
pair ∗,
∗ are the optimal actions in state . Note that the optimal action of each agent is not
merely the action that maximizes , , , . The action of each agent is an element in the
optimal joint actions from all agents that maximize the sum of all local Q-function. A decentralized
way to find these optimal joint actions is presented in (Kok & Vlassis, 2006).
Spadina
Central RL‐based RM agent
Signal Timings
Traffic State
Local RL‐based RM agent
Traffic State
Spadina
Signal Timing
Local RL‐based RM agent
Traffic State
Signal Timing
Local RL‐based RM agent
Traffic State
Signal Timing
Local RL‐based RM agent
Traffic State
Signal Timing
Traffic State
Communication:Traffic State,Actions,
Action selection negotiation
Spadina
Signal Timing
Traffic State
Signal Timing
Traffic State
Signal Timing
Traffic State
Signal Timing
Communication Communication
RL‐based RM agent
RL‐based RM agent
RL‐based RM agent
RL‐based RM agent
(a)
(b)
(c)
Figure 3-6 Different approaches for applying RL to ramp metering for a sample traffic network: a) Centralized structure with a single RL agent for whole network, b) Isolated
RL-based RM agents, c) Coordinated MARL-based RM agents.
39
Each step in Q-learning involves solving an optimization problem to find the optimal joint
actions that maximize global reward, which can be quite computation-heavy. This step can be
avoided by employing the SARSA learning algorithm. The SARSA update rule for the coordinated
learning is:
1 , , ,
, , ,
1, 1, 1, 1 , , ,
(3.27)
The update rule is identical to Q-learning except that the value for the next state is based on the
actual action taken rather than the optimal joint action.
The learning steps for the advantage updating algorithms (3.24) and (3.25) can be extended
to coordinated MARL as:
1 , . 1, 1 (3.28)
1 , , , . 11, 1
1 , (3.29)
As described in section 33.4.4, the advantage value should be removed from calculation of the
value function to break the cyclic relation between two functions and prevent divergence.
Therefore, the learning step in advantage updating does not require finding the globally optimal
joint actions.
3.5.2.2 Finding Optimal Action
Unlike independent learners that choose their action merely based on their own Q-function,
cooperative agents should take into account the effect of their actions on other agents as well. The
centralized approach for finding the optimal joint actions would involve combining all local Q-
functions to form a single Q-function of the complete state of the system and joint actions of all
agents. Then an optimization over the entire joint actions is required to maximize the sum of all
local Q-functions. Although simple in theory, this approach suffers from the curse of
dimensionality and becomes very demanding as the number of agents increases.
An alternative to a centralized search for optimal action is the locally optimal policy
generation approach presented in (Nair et al., 2005), with a similar approach being employed in
(S. El-Tantawy et al., 2013). In this approach, the joint actions are changed iteratively from an
initial one. In each iteration only one agent, the one that would benefit the system the most by
changing its action gets the chance to change its action. The process is repeated until no agent can
40
benefit the system by changing its action. Although the resulting joint policy is proven only locally
optimal, in many cases it may actually result in globally optimal joint actions. The steps for finding
the locally optimal joint actions are as follows:
1. Each agent chooses an initial action and communicates it to its neighbours.
2. Each agent i, assuming the actions of its neighbours are unchanged, finds the action
which maximizes the sum of its local Q-function as well as its neighbours': ∗
argmax ∑ ∈ .
3. Each agent calculates the gain that the system will achieve if it changes its action.
4. Only the agent that has the highest gain will change its action and the rest will be
unchanged, and the process is repeated from step 2. The process stops when no
agent can benefit the network by changing its action.
3.6 Summary
In this chapter, the algorithms and methods used in this research were presented. The final outcome
of the research is an optimal control system for metering of multiple on-ramps using RL. An RL-
based control system that deals with continuous variables using function approximation and
enables scalability through coordination of distributed RM agents. While applying conventional
single-agent RL to freeway ramp metering is challenging itself, function approximation and
coordination introduce two more dimensions to the complexity of the design process. To simplify
the design process, the design is performed in three stages and the three aspects of the design were
isolated from each other to the extent possible. The three stages are performed in the same order
as the methodology is presented. The three stages involve the algorithms described in
sections 3.3, 3.4, and 3.5, respectively.
41
4 Development of Microscopic Simulation Testbeds
Controllers based on RL find optimal actions based on trial and error through direct interaction
with the actual environment. However, it is not practical for a controller to learn through trial and
error in real freeway networks. In practice, a simulation environment is employed for training the
RL agent prior to implementation in the field. The simulated environment should closely replicate
the dynamics of the real environment to provide proper feedback to the RL agent for the learning
process. The most realistic models for simulating transportation networks are microscopic
simulators, which have been established as the prime tool for assessment of congestion mitigation
alternatives and ITS measures with the recent advances in both information technology and ITS
applications. Microsimulation is the dynamic and stochastic modelling of individual vehicle
movements within a system of transportation network. Each vehicle in the simulation model is
moved through the transportation network on a split-second basis according to the physical
characteristics of the vehicle (length, maximum acceleration rate, etc.), the fundamental rules of
motion (e.g. acceleration times time equals velocity, velocity times time equals distance), and rules
of driver behaviour (car following rules, lane changing rules, route choice rules, etc.), while
abiding by traffic management rules such as traffic lights, lane usage restrictions, etc. In this
research, the models were developed using Paramics©, which is a suite of high-performance
software for the microscopic simulation of realistic traffic networks.
The two Paramics models used in this research were extracted from the Greater Toronto
Area (GTA) freeway network model (Abdelgawad et al., 2011) developed at the University of
Toronto in 2009. The first model is a section of Highway 401 eastbound collector that includes the
intersection with Keele Street. This model is used for designing and evaluating algorithms that
involve only a single agent. The second model is the Gardiner Expressway westbound direction,
which is used for evaluation different coordination approaches.
It should be noted that the use of a simulation model to train RL agents is not to be confused
with a model of the controlled environment as in dynamic programming and value iteration
methods for instance (refer to section 3.1). The latter requires complete knowledge of system
dynamics including state transition probabilities and rewards associated with actions taken in each
state, for all state-action combinations, prior to solving the control problem. RL methods learn
from direct interactions with the controlled system and sample through the state-action space
repeatedly, similar to learning to play chess for instance by playing the game repeatedly. The use
42
of a simulation model in RL training merely provides a replica of the real traffic environment for
the RL agent to interact with in a safe and controlled manner until the agent learns the optimal
control policy. After that, the RL agent can be deployed into the real traffic environment with a
mature control policy, but can also continue to refine the learnt optimal control policy perpetually.
4.1 Developing the Microsimulation Models
Paramics can accurately reproduce detailed traffic information that matches the real network, given
that the parameters and geometry of the network are properly modelled. For the development of
the GTA freeway network model, a properly scaled digital representation of the study area was
loaded as an overlay into Paramics and used as a guideline for manually coding the network in
sufficient detail. The information about the geometry was generally gleaned from digital aerial
photographs. Throughout the development of the network information about the number of lanes,
roadway geometry, speed limits, detection devices, and control measures was gathered.
Although the original GTA freeway network model was rigorously calibrated to match
observed counts and average speeds, the traffic flow dynamic was not accurate enough for the RM
application, especially in terms of vehicle merging dynamics around on ramps and related capacity
degradation reflected in fundamental flow diagrams, amongst other details as will be discussed
next. Therefore, the steps described in the following sections were carried out to achieve the
calibration quality needed for training and evaluation of RM.
4.1.1 Data Preparation for Real Measurements and Paramics
Traffic measurements, such as vehicle count, average speeds, and queues are needed to calibrate
the driver behaviour model accurately as well as the origin-destination matrices. The main source
of information about traffic patterns in this research was the loop detector data available through
the ONE-ITS servers (one-its.net) at the University of Toronto. The original data from loop
detectors is data samples for every 20-second interval. It contains, for every lane in the past 20
seconds, the number of cars passed, the average speed of cars, and the percentage of time that the
loop detector was occupied. The data from different lanes are combined into one average value.
Because fluctuations of values in 20 seconds are significantly high, a five-minute rolling average
was used instead. In addition, averaging helped with replacing the missing data points with the
average of previous time samples.
43
The provided data included many faulty samples and required careful removal of faulty
data points. Different types of faulty data points were observed, for instance:
Missing data point for the whole loop detector or a single lane,
Faulty sensor: occupancy is 100%, speed is 100 kph and count is zero,
Faulty data: high speed, low flow and high occupancy,
Outlier data: low speed, low flow and low occupancy.
After the data have been read and stored in a vector with increasing sample time, missing
data points are flagged as zero and available samples flagged as one. Then the following metric
corresponding to average length of the cars in the past 20 seconds is calculated:
100 1000
0.01360020
(4.1)
where the 0.01 value added to Count is to avoid division by zero where the sensor is faulty. If
Length is greater than 50m or less than 2m then that data point is flagged as zero and therefore
treated as a missing data point. For averaging purposes, every sample is replaced with the average
of available samples in the past five minutes. For example, if two samples in the past five minutes
are flagged as missing, the average will be computed over the 13 available samples.
State of traffic can be measured in Paramics through various tools. For the purpose of
calibration and tuning the model, only measurements from loop detectors are considered so that
they be comparable with real loop detector data. Although Paramics has built-in functionality for
loop detectors, the output of Paramics loop detectors is information about individual cars passing
them. In practice, the output of real loop detectors is reported as an average over certain intervals,
a 20-second interval in Ontario; therefore, output of Paramics loop detectors is processed to match
real-life samples. Details about the processing of the Paramics loop detectors’ output to obtain
interval averages are presented in Appendix A.
The state of traffic is usually represented by speed (km/hr), density (veh/km/lane), and flow
(veh/hr/lane). Whereas loop detectors provide information about speed and flow, traffic density
measurement is not readily available from loop detectors. Traffic density can be estimated from
speed and flow with the equation / , where is traffic density, is traffic flow, and is
average traffic speed. Another way is to utilize the occupancy reported by loop detector for
44
estimation of density. The occupancy, o, based on speed of individual cars passing the detector
can be written as:
100
1
(4.2)
where is the 20 sec interval, is the number of vehicles passed, is the length of the vehicle,
and is the speed of the vehicle. If we replace the vehicle length with average length, ,̅
equation (4.2) can be simplified to:
≅100 ̅
1
100 ̅ 1 1
1
(4.3)
Similarly, density based on speed and flow can be written as:
3600/
1/ ∑ 1
(4.4)
The term 1/ ∑ is basically average speed and can be substituted with ̅; therefore,
equation (4.4) can be rewritten to achieve a similar form to equation (4.3):
3600/
36001 1
1
(4.5)
Considering (4.5) and (4.3), in free flow condition where vehicles all pass the loop detector with
similar speeds, occupancy can be reliably used to estimate density. However, in congested
conditions with stop and go behaviour, the two equations will diverge, and using occupancy for
estimating density will result in overestimation. Figure 4-1 shows the density estimated directly
from speed and flow versus occupancy measured by loop detector.
Considering the above observations, in this research the densities were calculated by
dividing flow by speed. This approach guarantees that the fundamental relation between speed,
flow, and density is maintained.
4.1.2 Driver Behaviour Parameter Calibration
Besides inspecting and fine-tuning the physical aspect of the model, we should also tune numerous
user definable driver behaviour parameters. These parameters define how drivers react in various
traffic conditions and sections of the network. Some parameters are not significant enough to
require numerical tuning using optimization and can be intuitively chosen. The following
parameters were found to require intuitive/subjective modification from their default value based
on observing the simulation model while running.
45
Figure 4-1 Relationship between occupancy and density.
Time step – time step represents the number of discrete times per real time second that a
decision is made during simulation. A higher time step value simply allows vehicles to
make decisions based on the car following and lane change logic at a higher frequency.
This is specifically important at on-ramp merging points where lane changing happens
very often. The default time step value is five (five steps in each second), but it has been
found that achieving proper merging behaviour requires a time step value of 10.
Ramp headway factor – the target headway for all vehicles on a ramp can be modulated
with this factor. Lower than average target headway for ramp vehicles allows them to
merge with mainline traffic more aggressively. The default value is one, i.e. no change to
target headway. However, a value of 0.5 is employed so that ramp vehicles force their
way into mainline as occurs in real life on busy urban freeways.
Minimum ramp time – this parameter specifies the time, in seconds, which vehicles spend
on the ramp before considering merging with the mainline traffic. Although the default is
2 sec, it has been changed to 1 sec considering the short ramp merge areas.
Signpost – this parameter defines the distance at which vehicles are notified of hazards
(divergence, lane drop, narrowing, etc.). Hazards usually require the affected vehicles to
change lane. Short signpost distance does not give vehicles enough time to change lane
properly, and over-long signpost distance will cause early lane changes and unnecessary
0 20 40 60 800
20
40
60
80
100
Occupancy (%)
Den
sity
(ve
h/km
/lane
)
46
congestion. The default value for freeway signposts is 750 m, but it should be modified
according to the network geometry through observing traffic behaviour.
Ramp-aware distance – ramp-aware distance is defined as the distance at which vehicles
in the main line traffic become aware of vehicles on the ramp. If a vehicle is in the right-
most lane on the mainline, it will attempt to change lanes in order to create a gap for the
merging vehicle. In case of low ramp-aware distance there will not be enough space on the
right lane of the freeway mainline for ramp vehicles to merge; therefore, the on-ramp flow
will be limited. On the other hand, high ramp-aware distance can cause the mainline to
breakdown even at very low on-ramp demand, because of mainline vehicles changing lanes
when there is only a single vehicle on the on-ramp. The default value is 200m, but it was
found that suitable ramp-aware distance for GTA freeways is from 100m to 150m.
Besides the aforementioned parameters, some parameters can directly affect the core specifications
of network such as capacity and susceptibility to flow breakdown; therefore, they require careful
fine-tuning, possibly using optimization, to ensure simulated traffic flow behaviour matches the
measurements. The parameters that affect the traffic flow significantly and require fine-tuning are
summarized below.
Mean target headway – the average headway, which vehicles try to maintain. The headway
directly affects freeway capacity and lower headway results in higher capacity. The default
headway value is 1.0 sec.
Mean driver reaction time – the mean reaction time of each driver, in seconds. The value
is associated with the lag in time between a change in speed of the preceding vehicle and
the following vehicle's reaction to the change. Smaller reaction times will reduce the
probability of breakdown because of faster response from drivers. Because of lower
susceptibility to flow breakdown, highway capacity increases. The default value is 1.0 sec,
but in practice, it is found to be lower.
Aggression – aggression is the distribution of target headway of various vehicles around
the average target headway. Aggression can vary on a scale from one to nine, with a score
of four being neutral, and higher aggression value will cause a vehicle to accept a smaller
headway. The default aggression is a normal distribution and is hidden from the Paramics
user. However, it is possible to modify the distribution of aggression. Increasing the
variance of aggression will increase the number of vehicles with lower aggression and as a
47
result higher headways; therefore, chance of breakdown will increase and freeway capacity
will decrease.
Awareness – similar to aggression, awareness has a distribution and affects the target
headway of the vehicles, except it is only active near lane drops. Awareness can vary on a
scale from one to nine with a score of four being neutral; high awareness values will result
in longer headway when vehicles approach a lane drop in order to allow vehicles in other
lanes to merge more easily. Therefore, high awareness reduces the susceptibility to
breakdown because of lane change, resulting in a smooth traffic flow and increased
capacity.
Link headway factor – this parameter allows the user to modify the mean target headway
locally for a link. It can be used to modify vehicular behaviour in specific sections as the
user may find warranted, such as around weaving sections for instance. The default value
is one.
Link reaction time factor – this is similar to link headway factor except for reaction time.
Among the different approaches for calibrating microsimulation model parameters, the approach
presented in M. Zhang, Ma, and Dong (2008) is plausible for freeway network models. In this
research, the calibration process is developed according to the guidelines provided in the
aforementioned report.
The main goal of the calibration process is to ensure that the Paramics model replicates the
fundamental traffic diagram of a real freeway network. To compare the samples from Paramics
with the ones from real loop detectors, a fundamental diagram based on the Van Aerde model (Van
Aerde, 1995) is fitted to both sets of samples. The Van Aerde model is a single-regime fundamental
diagram that can properly represent both congested and uncongested sides. The speed-density
relation in the model is defined as:
1
12
3. (4.6)
where is density, v is speed, is free flow speed parameter, and , , are model parameters
which can be calculated based on (jam density), / (critical speed), (capacity flow),
and (free flow speed) from the following equations:
48
1 2 2 , 2 2
2, 3
12. (4.7)
Since measurements of both speed and density are combined with noise, regular least
squares method is not suitable for fitting a Van Aerde model. A more robust approach is total least
squares, described in Appendix B, which account for errors in both independent and dependent
measurements, and minimize the error function:
2 2 (4.8)
where , is the point on the Van Aerde curve that is closest to the measured speed-density
pair , . Figure 4-2 shows an example of the Van Aerde model fitted to samples obtained
from a loop detector on the Gardiner Expressway in Toronto.
Figure 4-2 A Van Aerde model is fitted to samples from a loop detector. The left figure is the flow-density relationship and the right figure is the speed-density relationship.
To find the best driver behaviour parameters, the simultaneous perturbation stochastic
approximation (SPSA) (J. C. Spall, 1998) is employed. SPSA, described in detail in Appendix C,
is a numerical optimization algorithm suitable for problems with numerous parameters. It can
estimate an unbiased gradient of the system with only two evaluations of objective function. Then
the parameters are moved along the gradient, in the direction that minimizes the objective value.
The objective function to be minimized is defined as:
, ,
2
, ,
2
, ,
2
selectdetectors
, ,
2
(4.9)
0 20 40 60 80 1000
500
1000
1500
2000
2500
density - (veh/km/lane)
flow
- q
(ve
h/hr
/lane
)
0 20 40 60 80 1000
20
40
60
80
100
120
density - (veh/km/lane)
spee
d -
v (k
m/h
r)
Van Aerde
Loop detector
49
where superscript represents the real-life values, superscript represents the values obtained
from Paramics, and subscript represents the loop detectors for which the fundamental diagram
is calibrated. The weight parameters , , , are used to bring different errors into the
same scale.
The Origin-Destination (OD) matrices needed for calibrating the driver behaviour do not
need to be very accurate because only the fundamental diagram from the simulation result will be
employed. The only condition necessary for an OD matrix is that the demand be large enough for
the freeway to become congested, so that there are enough samples on both the free flow and
congested sides of the fundamental diagram. The preliminary OD estimation, which is described
in the next section, was increased by 10% to generate enough demand to exceed the capacity. This
OD demand was used to calibrate the behaviour parameters.
4.1.3 OD Estimation and Calibration
In order to employ any traffic simulation model, an OD matrix that describes the trip patterns is
required. To compare different scenarios effectively, an extended microscopic simulation period,
which includes free flow conditions before and after peak periods, is essential. Therefore, the OD
matrix should vary with time to account for the changes in demand during the extended simulation
period, and typical OD matrices obtained from regional demand models (e.g. EMME model) are
not suitable. The extended simulation requires calibrating several OD matrices, each representing
an interval in the simulation period.
Since the networks used in this research consist of a single freeway without parallel
arterials, there are no alternative routes from different origins to destinations. Therefore, OD
estimation is significantly simplified. The initial stage is to calculate an OD matrix for each interval
based on the loop detector counts in that interval. Let … be the demand
vector at time interval t, where is the hourly flow of OD pair i at time interval t, and
… be the vector of loop detector counts, where is the hourly flow measured
on loop detector j at time interval t. We can define matrix A as the relationship between OD pairs
and vehicle counts from loop detectors, where , is one if the route for OD pair i passes through
loop detector j and is zero otherwise. Therefore, assuming there is no congestion, the relationship
between them can be written as:
(4.10)
50
Note that no congestion assumption is necessary to make sure that all vehicles will reach
their destination and pass through the loop detector without being trapped by congestion. Basically,
row j from matrix A defines the OD pairs which pass through detector j; therefore, the vehicle
count for detector j will be the sum of all demands which pass through it. Equation (4.10) can be
solved for , for each time interval independently ( and A are known). However, to
maintain consistency and prevent oscillation of demand from one interval to the next, another cost
term is added to link different intervals. The final cost to minimize is:
1, : ,
2
11
1, 1
2
12
(4.11)
where , : is the row j of matrix A, T is the total demand intervals, and is the weight parameter
which controls the significance of the second term. The function . is defined as
, 2 / . The first term in the cost function (4.11) guarantees
simulation counts are close to real counts, and the second term will prevent the demands to oscillate
from one time interval to the next. Using common numerical optimization algorithms, we can find
a set of OD matrices which minimizes the cost .
This initial OD estimation, although very accurate with regard to capturing counts, fails to
replicate traffic congestion. Since the counts represent the vehicles that passed the loop detector
and not the actual demand, the initial OD estimation will generate demands that match the counts,
but on the uncongested side. Degree of congestion can be captured through traffic speed at different
sections. To recreate the real congestion and better replicate observed traffic speed in the
simulation, the demands for some intervals should be increased to exceed capacity. Considering
the extent of congestion and observed density in those sections from real counts, an estimate of the
vehicles stuck on the roads can be made. The number of vehicles stuck in congestion represents
the extent to which the demand should be increased in the simulation model to reproduce the same
congestion. Given that simulation starts and finishes with no congestion, the total demand for the
whole simulation horizon should not change. The extra demand needed for producing congestion
51
should be moved from later intervals to earlier intervals. After all, some of the vehicles passed the
loop detectors at later intervals were vehicles stuck in congestion.
To calibrate the OD matrices, the demands are modified based on:
1 1 1 , 1…
1 , 1… , 2… (4.12)
where is the new demand, and is the demand for OD pair i which will be moved from
interval 1 to interval ( 0 . Note that should be added to and subtracted
from 1 . The calibration aim is to find for 1… , 1… 1 so that the cost
function is minimized:
1,
2
11
1,
2
11
1, 1
2
12
(4.13)
Given the large number of unknown variables , the SPSA algorithm discussed in 4.1.2 can be
employed to solve the optimization problem efficiently.
4.2 Highway 401 Eastbound Collector and Keele Street
The first network developed is a section of the Highway 401 eastbound collector that surrounds
Keele Street, as shown in Figure 4-3. This network is employed for experimentation with different
aspects of the single-agent RL-based RM. The selection of the study area forms an important
element in the experimental design and has been selected meticulously. The morning peak period
was chosen for modelling because of the significant demand from on-ramp #1, which causes a
bottleneck. Furthermore, there are no bottlenecks present immediately downstream of this on-
ramp, ensuring that the only source of congestion is this on-ramp.
The studied network is about 2.5km long and includes Highway 401 eastbound collector
from Highway 400 to Allen Road. The section includes an off-ramp to Keele Street and two on-
ramps from Keele Street's southbound and northbound directions. The freeway mainline is four
lanes wide upstream of the off-ramp and is three lanes wide at the on-ramps’ location. The loop
detector data from the morning peak period of July 6, 2011 were used for the calibration of the
network. The simulation was performed from 06:00 to 10:00 and demand periods were broken into
52
20-min intervals. Since the focus of this work is ramp metering, the surrounding arterials were not
modelled; however, the on-ramps are extended beyond their actual length to account for the
spillback of queued vehicles because of ramp metering into nearby arterials. There are two on-
ramps in the study area. After examining the system and the demand on the two on-ramps, it was
realized that the on-ramp #2 exhibits considerably lower demand; therefore, it does not have a
significant effect on the freeway traffic, especially if the downstream on-ramp is efficiently
metered. Therefore, the upstream on-ramp was not metered.
Figure 4-3 Aerial map and Paramics screenshot showing part of the study area. The map
shows the Highway 401 eastbound collector at the merging point of Keele St.
After extraction of the above segment from the GTA freeway network model and
modification of the physical aspects, the behavioural parameters were calibrated. The Paramics
parameters used for this model are summarized in the Table 4-1. Figure 4-4 shows the fundamental
diagram estimated from measurements obtained from the Paramics network compared with the
one from real loop detectors. The loop detector is located just after the merging area of on-ramp
#1. Except for jam densities, all parameters of the two fundamental diagrams are similar. The jam
density value obtained from calibration of a Van Aerde model is very sensitive to data samples,
especially when the samples on the congested side of the diagram does not extend to higher
densities. When the measurements are made after a bottleneck, such as an on-ramp, the congestion
is limited and traffic flow is close to capacity. These congested samples do not provide enough
information for estimation of jam density. When the measurements are made upstream of a
bottleneck, traffic flow would be much lower than capacity and density will be much higher than
critical density. Therefore, it can be expected that jam density be estimated with higher confidence.
On-ramp #1
On-ramp #2
53
Even though the jam densities do not match, it can be seen that traffic flow drops below the
capacity when freeway is congested due to on-ramp bottleneck.
Table 4-1 Numerically calibrated Paramics parameters for Highway 401 model
Parameter Value Parameter Value
Mean headway 0.95 Link headway factor for on-ramp links 1.1
Mean reaction time 0.85 Link reaction time factor for on-ramp links 1.05
Mean awareness 4 Link headway factor for off-ramp links 0.85
Awareness standard deviation 2.5 Link reaction time factor for off-ramp links 0.9
Mean aggression 6
Aggression standard deviation 2
Figure 4-4 Fundamental diagram fitted to samples from simulation of calibrated Paramics
model and real loop detectors.
In the next step, the demands were calibrated to recreate the same congestion patterns in
the model as closely as possible to field observations. Figure 4-5 shows the traffic speed upstream
of the on-ramp #1 from the Paramics model and real freeway. Even though traffic speed in
Paramics does not exactly match the real freeway speeds, it is important to note that the duration
of congestion in the two cases is very similar.
0 10 20 30 40 50 60 700
500
1000
1500
2000
2500
Density (veh/km/lane)
Flo
w (veh/h
our/la
ne)
Paramics (fitted curve)
Paramics
Real Freeway (fitted curve)
Real Freeway
54
Figure 4-5 The evolution of morning traffic in the Paramics model compared with
measurements from real loop detectors.
4.3 Gardiner Expressway Westbound
The westbound direction of the Gardiner Expressway is a very good testbed for evaluation of
coordinated RM algorithms. The Gardiner Expressway has three on-ramps in downtown Toronto,
which carry traffic out of Toronto in the evening peak period. Demand from three on-ramp peaks
at 4000 veh/hr and can easily cause flow on the freeway to breakdown. Additionally, an on-ramp
on the west end connects Lakeshore to Gardiner at Jameson Street. Figure 4-6 shows the schematic
representation of the study network.
The four on-ramps feeding traffic to the Gardiner are discussed below.
Jarvis on-ramp – the Gardiner is physically limited to two lanes upstream of Jarvis, and it
changes back to three lanes after Jarvis. Therefore, the Jarvis on-ramp has its own dedicated
lane allowing for unimpeded traffic flow.
Spadina
Lakeshore
Gardiner
Figure 4-6 Schematic of the study area network, showing the Gardiner Expressway
westbound from Don Valley Parkway in the east to Humber Bay in the west.
6 6.5 7 7.5 8 8.5 9 9.5 1020
30
40
50
60
70
80
90
100
110
120
Time of Day (Hour)
Sp
ee
d (
km
/ho
ur)
Real Freeway
Paramics
55
York on-ramp – the York on-ramp is placed around 250 m upstream of the Spadina off-
ramp. The weaving section created because of this close spacing is exacerbated because of
significant demand from the York on-ramp as well as the cars coming from the upstream
(from the Don Valley Parkway, which is not shown in the figure) that have to change at
least two lanes to get to the Spadina off-ramp.
Spadina on-ramp – the Spadina on-ramp carries the highest volume among the three on-
ramps. The acceleration lane of the Spadina on-ramp for merging of the on-ramp vehicles
with the mainline flow is significantly longer than the average on-ramp. Furthermore, even
after the acceleration lane ends, the Expressway remains wide and the right lane is much
wider than other lanes. The purpose of this is probably to ease the merging process of the
on-ramp vehicles. When modelling the Spadina on-ramp it is important to keep its peculiar
geometry in mind.
Jameson on-ramp – the Jameson on-ramp has a very short acceleration area and merging
distance and is located after a very sharp turn. This geometry causes significant traffic
instability that leads to freeway breakdown even with very low on-ramp demand. For this
reason, the City of Toronto closes the Jameson on-ramp from 15:00 to 18:00 every day.
After extraction of the above segment from the GTA freeway network model, the network's
physical geometry was carefully modified to reflect the above road conditions and road behaviour.
The movement of vehicles at the merging point of Jarvis was modified to ensure no conflict
occurred between the two traffic streams. The reaction time for the right two lanes of the weaving
section after the York on-ramp was reduced for easier lane changes. The length of the Spadina on-
ramp lane was increased to reflect the extended length in the actual freeway. The Jameson on-ramp
length was reduced to match the short length of the real on-ramp. The Lakeshore Boulevard in the
Humber Bay area is also modelled as an alternative when the Jameson on-ramp is closed. Note
that if the queue behind the Jameson on-ramp gets very long, vehicles will choose the Lakeshore
instead of the Gardiner because of lower travel time. The Paramics model of the westbound
direction of the Gardiner Expressway and its aerial map are shown in Figure 4-7.
The modelled network is about 10 km long with four off-ramps and four on-ramps. The
freeway is three lanes wide for the most part as shown in the schematic diagram in Figure 4-6.
After refinement of the model, a preliminary OD was estimated based on the loop detector counts
from April 2012. In addition to weekends, Mondays and Fridays were omitted from data collection
56
to eliminate any chance of irregular traffic flows in input data. In total nine days in April 2012
[10th, 11th, 12th, 17th, 18th, 19th, 24th, 25th, 26th] were considered, and the flows were averaged
in one-hour intervals from 13:00 to 21:00. Consequently, a dynamic OD demand matrix was fitted
to the averaged count as the preliminary OD demands. Since the resulting demands are on the
uncongested side of the flow-density diagram, the demand values were increased by 10% to create
some congestion and make the process of estimating fundamental diagrams possible. Then, the
behavioural parameters were calibrated using the approach discussed in Section 4.1.2. The
Paramics parameters used for this model are summarized in the Table 4-2.
Figure 4-7 Aerial map of the Gardiner Expressway westbound and its Paramics model.
Table 4-2 Numerically calibrated Paramics parameters for the Gardiner model
Parameter Value Parameter Value
Mean headway 0.93 Link headway factor for Spadina off-ramp
weaving section
1.15
Mean reaction time 0.8 Link reaction time factor for Spadina off-
ramp weaving section
.75
Mean awareness 3 Link headway factor for lane drop upstream
of Jarvis
1.1
Awareness standard deviation 1.6
Mean aggression 5 Link reaction time factor for lane drop
upstream of Jarvis
.9
Aggression standard deviation 1.4
Table 4-3 summarizes the fundamental diagram parameters of the real freeway as well as
the ones from the Paramics network. The loop detectors are placed just downstream of the merging
point of the respective on-ramp. Except for jam density, which is generally higher for the Paramics
model, the rest of the parameters match closely and show the good quality of the calibration. As
discussed above, the proper estimation of jam density requires data samples with high density and
57
low traffic flow. These samples are usually obtained when the loop is upstream of a bottleneck.
The loop detector related to York is placed upstream of Spadina, which is a major bottleneck;
therefore, the jam density from real data and Paramics data are consistent. There are not any
significant bottleneck downstream of Jameson and Spadina in Paramics model. Hence, estimation
of jam density is not accurate. It should be noted that the real loop detector data consists of 9 days,
and there are cases of congestion building up from downstream toward Jameson and Spadina due
to different traffic patterns in different days.
Table 4-3 Parameters of the Van Aerde model fitted to fundamental diagram samples from Paramics and real life.
York Spadina Jameson
Paramics Real Paramics Real Paramics Real
93 92 104 100 95 96
1784 1746 2074 2095 2117 2150
23 24 27 28 24 25
176 153 196 117 202 101
Following the calibration of behavioural parameters, the dynamic OD should also be
calibrated. Calibrations were performed to match the loop detector measurements of the selected
nine days of traffic. Figure 4-8 compares the time-space diagram of speeds from the real loop
detectors and speeds from the Paramics model. The difference in the bottom of graph is due to the
lack of availability of data from real loop detectors at the beginning of the freeway. Nevertheless,
the two graphs show very similar patterns of congestion.
Although the speeds are matched, we should make sure that vehicle counts are still in the
acceptable range. Figure 4-9 summarizes the GEH value for selected loop detectors (the two
detectors on the right side of the graph are on-ramp detectors) at one-hour intervals. Eighty-six
percent of the GEH values are well below five, which is considered accurate calibration, and rest
are less than 8%. It is worth noting that original loop detector data are not very accurate and some
of the errors can be attributed to the low quality of initial data. Figure 4-10 shows the actual flows
from the real loop detectors and the calibrated Paramics model for the 16:00-17:00 interval when
demand from downtown is at its peak. As can be seen from the graph, the Paramics model closely
matches the measurement taken from the real freeway.
58
Figure 4-8 The left graph shows real loop detector and right graphs shows Paramics model. The time-space graphs are the average speed along the Gardiner from 13:00 to 21:00.
Data obtained from the currently available loop detectors provide detailed information
about the state of traffic on the freeway; however, they do not give any information about the on-
ramp queues. As a result, the calibration process does not have a reference for the queues and can
produce unrealistic queues. Since we are dealing with ramp metering, it is important to make sure
the ramp queues in Paramics properly follow the ones in reality. For this purpose, the INRIX app
was employed. From the historical congestion information supplied by the INRIX app, the duration
and extent of queue for the three downtown on-ramps was estimated. Congestion on surface streets
connecting to on-ramps generally starts at around 15:00 and lasts until about 18:30, which is in
agreement with the space-time speed graph. It is estimated that the congestion on Jarvis extends to
Dundas Street and is equal to around 250 vehicles in the queue. The queue of York on-ramp
propagates on both Lakeshore (for vehicles going to the Gardiner from Yonge St.) and York
Streets, resulting in about 200 to 250 vehicles waiting in the queue. The Spadina on-ramp queues
extend beyond King Street and multiple nearby streets and comprise 200 to 250 vehicles.
When vehicles are queuing on an on-ramp, changing the ramp demand will not affect the
flow entering the freeway and will only change the queue length. Therefore, if the queues are short,
some demands from later intervals should be moved forward in time to increase queue length.
When queues are high, some demand would be shifted to later time intervals to decrease the queue.
14 15 16 17 18 19 20Time of day (hour)
Color coded traffic speed - Real
14 15 16 17 18 19 20Time of day (hour)
Spe
ed (
km/h
r)
Color coded traffic speed - Paramics
10
20
30
40
50
60
70
80
90
100
110
59
Figure 4-9 GEH value for vehicle counts averaged over one-hour intervals for select loop
detector locations.
Figure 4-10 Traffic flow of the calibrated Paramics model compared with real loop detector
data along the Gardiner for three different time intervals.
0123456789
GEH
Value
GEH Value for Vehicle Count at Loop Detectors
1‐2 pm 2‐3 pm 3‐4 pm 4‐5 pm 5‐6 pm 6‐7 pm 7‐8 pm 8‐9 pm
0
1000
2000
3000
4000
5000
6000
7000
Flow (veh/hr)
Vehicle flow for 16:00‐17:00 interval
Paramics Reality
60
After closely analyzing the calibration result, we
observed that the Spadina on-ramp carries 1600 veh/hr
on average during the peak period. This result was
evident in both the Paramics model and real loop
detector data. Considering the significant queues on
Spadina on-ramp, it was important to verify these
numbers with solid evidence. Therefore, a field survey
was conducted and vehicles on the Spadina and
surrounding streets were counted between 16:00 and
17:00 to obtain traffic flow and vehicles queuing.
Figure 4-11 shows the aerial view representing the
queues on each street and the vehicle flow for each
direction. The observed flow was 1551 veh/hr, which is
consistent with previous results. The observed queue
was about 140 vehicles, which is slightly lower than the
ones obtained from INRIX. However, it should be noted
that this numbers are for a single-day observation, which
can justify the difference in the queue.
Figure 4-11 Aerial view of the
Spadina on-ramp with information about traffic flow and queues
61
5 Independent and Coordinated RL-based Ramp Metering
Design and Experiments
In this chapter, the design and evaluation of RM controllers based on RL will be discussed. The
design process was carried out in three stages, to which the three sections of this chapter
correspond. The first stage meant analyzing different design parameters of the application of RL
to RM. The second stage focused on the design, implementation and evaluation of RL-based RM
using different function approximation approaches for dealing with continuous variables. The third
stage was design and evaluation of coordination of multiple RL-based RM controllers.
During the design process, to ensure practicality of the resulting systems, it was assumed
that measurements are only available through common loop detectors and hardware that are
currently in place. Additionally, employment of microscopic traffic simulators ensures that RLRM
design abides by real-life limitations, such as loop detector measurement noise, drivers’ random
behaviour, and traffic light effect on traffic flow. These choices facilitate the future field
implementation of the algorithms. Use of more precise and detailed information, through
technologies such as cameras and connected vehicles, is expected only to improve the performance
of the system when available in the future.
5.1 Experiment I – Single Ramp with Conventional RL
Since a comprehensive model of the traffic flow, which fully represents the state of traffic, requires
an extensive number of traffic variables, RL design is not trivial in freeway control problems.
Furthermore, because of the stochastic nature of the traffic flow in freeways, an RLRM agent will
require a significant number of training samples to suppress the measurement noises. In this
section, the various RLRM design parameters and their selection criteria to ensure fast training
and reliable performance are discussed. The microsimulation model used for this part is the
Highway 401 eastbound collector at Keele Street presented in section 4.2. In this part, the
conventional table-based RL approaches were employed. The focus of this experiment was to
minimize the total travel time without any limit on the on-ramp queue storage. Therefore, queue
management algorithms are not analyzed and are deferred to the last section of this chapter.
62
5.1.1 RL-based RM Controller Design for Single Ramp
Given that conventional RL algorithms are being used, the design problem involves deciding on
the aspects of problems discussed in the following sections.
5.1.1.1 Control Cycle
Control cycle, , is the time step which the RLRM agent will perceive the new state of the
environment and takes a new action. In freeway traffic control problems aggregated traffic
conditions are used and instantaneous measurements from sensors are averaged over . A Small
control cycle is preferred to ensure fast system response to changes in traffic condition. However,
the measurement noise and system delay, i.e. the time it takes for the system to respond to the
controller action, limit the lower bound for the choice of control cycle. Depending on the algorithm
and metering approach various control cycle values have been used, e.g. 40 sec in Papageorgiou
et al. (1997) and 60 sec in Jacob and Abdulhai (2010). After experimenting with various control
cycle times, we found that a value of 2 results in a good balance between the response to
traffic changes and measurement noise observed in real-life traffic data.
5.1.1.2 Action
Metering of on-ramps is performed by placing a traffic light on the ramp at the freeway
entrance. Changing the traffic light timing directly controls the traffic inflow to the freeway. Two
notable metering policies for on-ramp traffic light timing are one-car-per-green and discrete release
rates (Papageorgiou & Papamichail, 2008). In the one-car-per-green policy, a fixed green phase of
2 sec is used and the red phase is varied to provide different flow rates. This approach has the
benefit of breaking the platoon of cars and is easy for drivers to comprehend. However, the
maximum traffic flow that can be achieved by this approach is 900 veh/hr, given a minimum red
time of 2 sec. Therefore, one-car-per-green is suitable for on-ramps with low demand. Table 5-1
lists the one-car-per-green metering rates employed in this research and the corresponding green
and red phases. Given that on-ramp demand in the Highway 401 model is less than 1000 veh/hr,
the one-time-per-green policy is employed.
The discrete release rates policy allows more flexible metering rates, up to the capacity of
the on-ramp, e.g. 1800 vph, by allowing both green phase and red phase to be varied independently.
The goal is to achieve evenly spaced metering rates to be able to inject various levels of traffic into
the freeway. Although any flow value can be achieved by unconstrained green and red phases, it
is desirable to keep the cycle length to a minimum and inject the least number of cars possible in
63
each cycle. Considering these objectives, the discrete release rates and the associated green and
red phases employed in this research are summarized in Table 5-2. The discrete release rates policy
was used in the Gardiner model due to significantly high demand from on-ramps.
Table 5-1 Metering rates and associated green and red phases for one-car-per-green metering policy
Metering rate (veh/h) 240 300 360 450 600 720 900
Green time (sec) 2 2 2 2 2 2 2
Red time (sec) 13 10 8 6 4 3 2
Table 5-2 Metering rates and associated green and red phases for discrete release rates metering policy
Metering rate (veh/h) 240 400 600 720 900 1200 1440 1800
Green time (sec) 2 2 2 2 2 4 8 6
Red time (sec) 13 7 4 3 2 2 2 0
The RL-based RM controller can control the signal timing in two ways: by a direct action
that directly decides on the new metering rate, or an incremental action in which the metering rate
is increased or decreased compared with previous control cycles. Incremental action eliminates the
large variations, which might happen in direct action, and provides smoother changes in the
metering rate. On the other hand, small variations in incremental action result in slow reaction
from RM controller, which might limit the controller performance.
5.1.1.3 State
To represent the complete model of the freeway traffic network properly, state variables from the
entire network should be considered. In addition to the impracticality of measuring all possible
state variables, the learning time of RLRM agents increases exponentially with the number of state
variables. However, for a single on-ramp problem, the state of traffic can be properly identified
with only a few variables in the local area of the on-ramp, as shown in Figure 5-1. These variables
should represent the traffic condition upstream of on-ramp, downstream of on-ramp, and on the
on-ramp itself. The condition of mainline traffic upstream and downstream of on-ramp can be
identified by its speed and density. The variables necessary to identify the condition of on-ramp
are demand coming into on-ramp, on-ramp flow entering freeway, and on-ramp queue. Although
64
for the complete state of traffic they are all needed, some of these variables share redundant
information. Omitting the redundant variables can speed up learning without significantly affecting
the performance.
The state of traffic is measured through loop detectors. Loop detectors, when implemented
in double loop configuration, sense the presence and speed of individual vehicles. Averaging this
information over provides good estimates of speed, flow, and occupancy. Albeit not directly
available through loop detectors, density, , can be estimated from average flow, , and average
speed, , as:
//
/ (5.1)
Loop detectors are point detectors, and they should be properly located to provide accurate
representation of traffic conditions. However, RLRM learns to map detector measurements to
optimal action and is robust to imperfections because of slight detector misplacements. As a result,
the actual field locations of loop detectors were considered in this research and were not changed.
On‐ramp detector
Downstream detector
Upstream detector
Signal detector
Figure 5-1 Local area on an on-ramp and the loop detectors which can represent its traffic state.
5.1.1.3.1 DownstreamTrafficCondition
The complete state information can be described by speed and density; however, these two
variables are closely correlated and one of them can describe the traffic without significant loss of
information. Although speed changes significantly as traffic changes from free flow to congested,
density provides more even variation as traffic state changes and provide better representation of
the traffic condition. In conventional RL algorithms, states are discrete; therefore, the continuous
density variable should be discretized. Downstream density, , represents the level of congestion
and is the most important variable for proper design of RLRM agents. Since the maximum
throughput of the freeway occurs at the critical density, , downstream density is expected to be
close to critical density in optimum operation of freeway. Figure 5-2a shows the histogram of the
downstream density when an RM controller is in operation. The downstream density is discretized
65
such that samples are evenly distributed among different bins. The edges of the discretization
intervals for the were chosen as [0, 12, 16, 19, 22, 25, 28, 33, 40, 50, 60].
5.1.1.3.2 UpstreamTrafficCondition
Similar to downstream traffic measurement, the upstream density was chosen as the variable to
represent the state of traffic upstream of the on-ramp. Upstream density, , provides an estimate
of the distance which the congestion has propagated to the upstream of the ramp. For the RLRM
agent to prevent congestion effectively, should stay below . Figure 5-2b shows the
histogram of upstream density when an effective RM is in operation. As can be seen from the
figure densities hardly reach ; therefore, the discretization intervals were focused on the
subcritical densities and the interval edges for upstream density were chosen as [0, 12, 16, 20, 24,
28, 40].
5.1.1.3.3 On‐RampTrafficCondition
Since in this experiment queue management techniques are not considered, queue length and on-
ramp demand are not necessary for ramp metering. Therefore, flow entering freeway, , is the
only variable used. Since the agent’s action determines the entry flow, the discretization intervals
for on-ramp flow chosen were similar to the metering policy discrete values.
Figure 5-2 Histogram of traffic densities in a freeway section, including an on-ramp operation in the presence of an optimal RM controller. The dashed line represents the
estimated critical density.
5.1.1.4 Reward and Discount Factor
Typically, the main goal of a traffic control system is to minimize the combined travel time of all
transportation users. The total travel time, , is defined as:
a
0 20 40 60 800
200
400
600
800
1000Upstream density histogram
Density (veh/km)
Obse
rved
sam
ple
s
0 20 40 60 800
100
200
300
400
500Downstream density histogram
Density (veh/km)
Obse
rved
sam
ple
s
b
66
0
(5.2)
where is the number of vehicles at the control cycle within the area confined by up-stream,
downstream, and on-ramp detectors, and is the time horizon. To minimize , the RLRM
agent reward and discount factor can be defined as and 1, respectively. This
problem is an undiscounted infinite horizon problem and ideally, the R-learning method, presented
in section 3.3.3, should be used for learning of the agent instead of typical approaches such as Q-
learning and SARSA. Q-learning and SARSA maximize the total reward and cannot be applied to
an undiscounted infinite horizon problem, because the expected reward will not converge due to
undiscounted nature of the problem. Although use of R-learning is theoretically sound in the above
problem, improper selection of the additional learning parameter in R-learning often leads to
slow convergence or divergence. It is desirable to reformulate the problem such that Q-learning
can be employed to avoid the complexities associated with R-learning. Both R and Q learning are
used and compared as will be discussed later in the chapter. Assuming there are no vehicles
available initially in the network, can be calculated as:
TC′ ′
1
′ 0 (5.3)
where and are the entrance and exit rates of vehicles, to and from the area confined by
detectors, in ( / ) at control cycle , respectively. Substituting from (5.3) into (5.2) results
in:
TC′ ′
1
′ 00 (5.4)
Rearranging the summation operators and the constants in (5.4) results in:
21
0
21
0 (5.5)
The new formulation ∑ / is equivalent to optimal control problem
equation (3.1) where and / represent and , respectively. The
negative sign changes minimization into maximization. Since is not directly evident in (5.5), it
is necessary to find a value such that is a good approximation of / . To achieve this,
is considered to be 30 control cycles (equivalent to a one-hour horizon based on a control cycle
of 2 min) and the gamma is determined to be 0.94. Figure 5-3 shows that with 0.94 and
30, is a good approximation of / . A limitation of defining reward as
67
is that it does not capture the spillback of congestion or on-ramp queue beyond
the loop detectors.
Figure 5-3 The actual weights of and discounted weights which the RLRM agent
considered using a discount factor of 0.94. The actual weights of were based on a control cycle of 2 min and minimization horizon of 1 hr.
The reward function, , contains two terms: which is independent of
the agent’s action and depends on the network demand; and which directly depends on the
congestion level. Note that the traffic throughput depends on the traffic congestion level and
agents' action can vary the congestion level. Since is independent of the agent’s action, it can
be neglected when we minimize . If is removed from the reward, the reward function
becomes which is equivalent to throughput. Therefore, the minimization problem is
equivalent to throughput maximization problem. Note that because 1 earlier throughput values
have more significance than later values. The new reward is only based on traffic condition
downstream of the ramp, and it is likely that excluding upstream density from state variables will
not have a negative effect on RLRM performance with this reward definition.
5.1.1.5 Additional Notes on RLRM Design
One challenge associated with RLRM problems deals with traffic flow instability at . Since the
traffic flow is unstable, at the beginning of the learning process, when the RLRM agent is
immature, the freeway is largely congested. As the agent learns the dynamics of congested traffic
condition, it acquires the knowledge to shift the traffic density toward the . In fact, the RLRM
agent first needs to learn the congested region of the state space to stabilize traffic, then explore
the parts of the state space around . As a result, different states are explored at different stages
0 5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
Control cycle
Wei
gh
t
Actual weights
RL discounted weights
68
of the learning process, and typical methods for changing the learning rate and selecting actions
that are functions of time are not suitable for the RLRM problem. An alternative is to keep the
number of visits, , , to each state-action pair. Therefore, the learning rate and action selection
policy for each state-action pair can be defined as functions of the estimated number of visits to
that pair.
For the learning rate an approach similar to the one discussed in 3.3.1.1 is employed and
state-action pair dependent learning rate is defined as:
,
1
1 1
0.8
, , (5.6)
The action selection policy is based on the -greedy action selection approach discussed
in 3.3.1.2. Similar to the work by Samah El-Tantawy and Abdulhai (2010), in this study state
dependent is calculated as:
max 0.1,10
10 ∑ C s, (5.7)
where is the number of possible actions in state . Based on (5.7), is initially one, and as
the number of visits to a state increases, decreases. will decrease to a minimum of 0.1, which
corresponds to total visits to the state exceeding 90.
5.1.1.5.1 PenaltyforCongestion
Although the RLRM agent using the above parameters is guaranteed to find the optimal control
policy, measurement noise and model uncertainties increase the time needed to find the optimal
policy. Since the noise levels are the highest at congested traffic conditions, in the absence of any
guidance the agent initially gets trapped in congested conditions for a significant period of
learning. Alternatively, under the operation of a trained RLRM is expected to be close to .
Adding a penalty term to the reward function for severe traffic congestions, 35 / ,
guides the RLRM agent to choose actions which result in lower densities and to learn faster how
to avoid congestion. This penalty should be significantly smaller than the reward; as a rule of
thumb, an order of magnitude smaller. Otherwise, the RLRM agent would ignore the reward term
and focus solely on minimizing the penalty.
5.1.2 Effect of Design Parameters on RLRM Performance
The different approaches to RLRM problem design, discussed in the preceding sections, are
implemented for the study area and the performances of different approaches are compared. In this
69
section, the focus is on the design parameters that are unique to RLRM: congestion penalty, type
of action, state, and reward. Other parameters do not significantly vary from one problem to
another; therefore, they can be extrapolated from other applications. The learning process of each
agent is limited to 1000 epochs (each epoch is one Paramics simulation run of 4 traffic hours).
Every point in the proceeding graphs is a moving average of 30 epochs.
5.1.2.1 Congestion Penalty
Adding a penalty to the reward for severe congestions helps the RLRM agent to prevent mainline
congestion. Three different penalty values are implemented and compared. In all three cases the
main reward function is , the state variables are and , and the action is a direct action.
A penalty is added to the main reward for 35 / . Figure 5-4a illustrates the freeway
mainline total travel time under the three penalty values. As expected, higher penalty values will
result in faster learning in terms of avoiding congestion. In the case with no penalty, the learning
is significantly slower and the agent does not reach a congestion-free traffic flow after 1000
epochs. Figure 5-4b shows the whole network total travel time, which includes the time vehicles
spend on the mainline as well as on-ramp. The agent with no penalty learns very slowly and after
1000 epochs still requires learning. The case with a penalty of 1000 converges to its best solution
after around 600 epochs, whereas the high penalty value of 2000 causes the agent to focus heavily
on mainline traffic and negatively affects RLRM performance and learning.
Figure 5-4 Effect of adding a penalty term to reward function for severe congestion. (a) The total travel time for freeway mainline, (b) the total travel time for the whole network.
0 200 400 600 800 10001000
1200
1400
1600
1800
2000
2200
2400
epoch
Mai
nlin
e T
TT
(ve
h.h
r)
Zero penalty1000 penalty
2000 penalty
(a) (b)
0 200 400 600 800 10001600
1800
2000
2200
2400
2600
2800
epoch
TT
T (
veh
.hr)
70
5.1.2.2 Direct and Incremental Action
In direct action all the eight red timings discussed in section 5.1.1.2 are possible in every state. On
the other hand, in incremental action the agent changes the signal timing one step at each control
cycle. Since the agent needs to keep the previous action as a state variable for incremental action,
the total number of state-action pairs increases. The rest of the design parameters are identical for
both agents. Figure 5-5 shows the learning performance of the two approaches. Both agents quickly
learn to prevent mainline congestion (a penalty term is considered for mainline congestion).
However, the agent with incremental action fails in terms of network TTT. The poor performance
of incremental action can be attributed to high measurement noise. Performance gain from
incremental actions is lost in the measurement noise and the RL agent requires more learning to
suppress the noise.
Figure 5-5 Performance comparison of RLRM agent with direct action and RLRM agent with incremental action. (a) Total travel time for freeway mainline only, (b) total travel
time for the whole network.
5.1.2.3 State and Reward
Since state and reward choices are closely related to each other, they have been analyzed
simultaneously. In this section, three rewards are implemented and compared. For the reward
all three state variables ( , , ) are considered and R-learning is used to train the
agent (case 1). The reward is implemented in two different ways: with all three state
variables (case 2) and omitting upstream density, , from state variables (case 3). Finally, the
0 200 400 600 800 1000500
1000
1500
2000
2500
epoch
Mai
nlin
e T
TT
(ve
h.h
r)
Incremental Action
Direct Action
0 200 400 600 800 10001500
2000
2500
3000
3500
epoch
TT
T (
veh
.hr)
Incremental Action
Direct Action
(a) (b)
71
reward is implemented, similar to case 3, with downstream density, , and on-ramp flow,
, as state variables (case 4). For cases 2-4 Q-learning is employed, as the problem would be
discounted and Q-learning is applicable. Figure 5-6 shows the performances of the four different
cases. In case 1, unlike the other cases, R-learning is employed instead of Q-learning, and the
resulting RLRM agent performance is significantly lower. The poor performance of case 1 is
attributed to the complexities associated with the R-learning algorithm. Case 3, with confined state
space compared with case 2, stabilizes the density around the critical density with fewer learning
epochs. However, it requires more epochs to suppress the uncertainties caused by not having the
full state of the environment. Although in case 2 the agent learns more slowly than those in case 3
and 4 because of larger state space, after learning is complete the performance of case 2 and case
3 are comparable. This finding shows that, as expected, the smaller state space results in faster
learning; however, if the state variables do not completely define the reward, the agent
performance will be degraded. Quick learning and good performance in case 4 show that
considering throughput as the reward in conjunction with few traffic variables as the state space
can result in minimizing TTT.
Figure 5-6 Effect of different reward choices on RLRM performance. In case 1 reward is and state variables are downstream density, upstream density, and on-ramp density.
In case 2 reward is and state variables are the same as in case 1. Case 3 is similar to case 2 with the exception that upstream density is omitted. In case 4 reward is
and state variables are downstream and on-ramp densities.
(a) (b)
0 200 400 600 800 10001800
2000
2200
2400
2600
2800
epoch
TT
T (
veh
.hr)
0 200 400 600 800 10001000
1200
1400
1600
1800
2000
2200
epoch
Mai
nlin
e T
TT
(ve
h.h
r)
case 1
case 2case 3
case 4
72
5.1.2.4 Best Design for Single-agent RLRM
Learning of an RLRM agent can be very slow if the agent is not guided to avoid the congested
regions. Adding a penalty for any state that is severely congested significantly improves the
learning speed by guiding the RLRM agent toward maintaining the traffic close to critical density.
The magnitude of penalty is very important as small values would not be effective and large values
would degrade the performance. Experiments have shown that a penalty value equal to about 10%
of capacity provides proper guidance without negatively affecting the performance. The direct
action was found more effective than incremental action in the RLRM problem. Furthermore, for
the Highway 401 test case with about 900 veh/hr demand at its peak, one-car-per-green policy is
sufficient and very effective. For the single-agent RLRM problem, choosing the throughput as the
reward with a discount factor of 0.94 will result in optimal TTT. The simple definition of the
reward allows minimal state variables (downstream density and on-ramp flow) for its
representation, which result in fast learning.
5.1.3 Comparison with ALINEA Controller
To compare the performance of the RLRM agent with other traffic-responsive ramp metering
algorithms, ALINEA (Papageorgiou, Hadj-Salem, & Blosseville, 1991b) is considered as the
benchmark. ALINEA controller is an integral controller with robust performance. Filed
implementations of ALINEA on various European freeways have resulted in savings in Total
Travel Time from 5% to 20%(Papageorgiou et al., 1997). The desired density for ALINEA was
set to 25 veh/hr, and the gain, , was tuned through trial and error and the value which resulted
in best TTT was considered. Three different cases were then considered: the base case with no
ramp metering, the ALINEA controller, and the RLRM agent with the best design parameters
discussed above. The RLRM agent was trained for 1000 epochs and the greedy policy ( 0 for
always choosing optimal action, i.e. no exploration for learning) was evaluated. To eliminate
variations caused by the stochastic behaviour of Paramics, each case was simulated 15 times with
different seeds (initial randomization parameter). The results of the simulations were averaged and
are summarized in Table 5-3. While ALINEA improves freeway TTT by 15%, in this case study
the RLRM agent improves TTT by 25% and outperforms ALINEA significantly. The improvement
can be associated with the non-linear reaction of the RLRM control agent to changes in traffic
condition. Additionally, the RLRM controller can change the metering rate freely, resulting in
quicker response compared to ALINEA. Looking at the mainline TTT, it can be seen that the two
73
controllers result in similar travel time for vehicles traveling along the freeway mainline. However,
when the on ramp wait time is considered, the RLRM achieves the best TTT savings of 25%.
Table 5-3 Summary of the simulation results for the single ramp testcase with conventional RL algorithms.
Performance Measures Control Method
No RM ALINEA RLRM TTT (veh.hr) 2381 2028 1785
TTT savings ‐ 15% 25%
Mainline TTT (veh.hr) 2326 1180 1143
Mainline TTT savings ‐ 50% 51%
Average on-ramp waiting time (min) < 1 13 9
5.2 Experiment II – RL-based RM with Function Approximation
One drawback of RLRM based on conventional discrete state algorithms is the slow learning
speed, which is further exacerbated with the size of the state-action space. The learning process
for the simplest RLRM agent takes more than 1000 epochs. To implement more complex RM
systems with queue management and coordination, the use of function approximation to increase
the learning speed is an inevitable necessity. Three function approximation approaches were
investigated to find the one most suitable for RL-based RM problems. The first approach was the
kNN-TD(λ) algorithm which is the direct generalization of discrete state RL to continuous states.
Therefore, it shares most of the characteristics and solid foundation of the conventional RL
algorithms. The second approach was the use of the Multilayer Perceptron neural network (MLP)
for function approximation. MLP has been very popular among researchers and is widely used for
function approximation in RL (Sutton & Barto, 1998). The third approach was the Linear Model
Tree (LMT) for function approximation. The underlying models in a LMT are linear functions and
are expected to provide good generalization of noisy samples. This characteristic makes LMT a
very good alternative for function approximation in transportation problems.
The approaches were applied to the Highway 401 test case. The SARSA learning approach
was employed for training of the agents. It is very straightforward to use SARSA in conjunction
with eligibility traces to speed up the learning for table-based and kNN-TD(λ). The eligibility trace
parameter employed was 0.8 (Singh & Sutton, 1996). Although eligibility traces cannot be
employed for MLP and LMT, these algorithms can benefit from the SARSA approach to learning
as well. The update rule in Q-learning depends on the best outcome in the next state,
74
max , . In the early stages of learning, an unseen state-action pair might have the best
outcome. Unlike table-based approaches, when an MLP or LMT is updated the whole function is
updated affecting the value of the aforementioned unseen state-action pair. Given that this state-
action pair is not explored, it will have a floating value with no reference target. This condition
will form a positive feedback, resulting in divergence of the function approximator. It is possible
to slow down the learning by reducing the learning rate to avoid the divergence, but it contradicts
with the purpose of function approximation which increasing learning speed. Another solution is
employing a learning approach similar to SARSA. In SARSA, learning is based on agent’s actual
actions and experiences. Therefore, state-action pairs used for learning will always be based on
proper experience, eliminating the possibility of divergence.
5.2.1 Design of Function Approximation Approaches
The kNN-TD(λ) algorithm has a similar structure to the table-based RL with the added
generalization; therefore, similar design parameters were used. The kNN-TD(λ) centers were
placed in the middle of the discretization intervals defined for table-based RL. For the number of
neighbours for weighted averaging, k, three cases with 2, 4, 8 were considered and the case
with 4 was found to have the best learning speed.
In the MLP-based RL the added parameters are related to the MLP structure and training
of MLP in each epoch. The MLP’s hidden number of neurons, after experimentation with different
numbers of hidden neurons, was chosen to be 20. The states were normalized based on their
maximum and minimum so that inputs to the MLP remained confined to 0, 1 . For the training of
the MLP after each epoch the samples were split into 70% training data and 30% test data and the
Levenberg-Marquardt (Hagan & Menhaj, 1994) technique was employed. The training was
terminated when the test data error did not improve after six successive iterations.
The LMT-based RL does not require any structural parameters beside the boundaries of
the inputs. The three LMT training parameters, , , and , were assigned to be 5, 0.01%,
and 0.005, respectively (Potts & Sammut, 2005).
5.2.2 Simulation Results
In addition to the table-based RL, the three presented RL approaches with function approximation
were applied to the test case above and trained for 2000 epochs. The four controllers were
75
evaluated in terms of design effort, computational needs, learning speed, and impact on freeway
performance.
5.2.2.1 Design Effort
The design effort for the table-based RL and kNN-TD(λ) was found comparable and higher than
the other two approaches (MLP and LMT), particularly because of the significance of
discretization. It is worth noting that for the same reason the design effort of these two algorithms
will increase exponentially with the problem size. The parameters related to Q-learning are well
studied in the literature and do not require experimentation to achieve proper results.
For the number of hidden neurons in MLP three values were examined: 10, 20, and 40. The
increase from 10 neurons to 20 neurons resulted in better performance, but the increase to 40
neurons did not show significant difference in the overall performance. Note that number of
neurons cannot be generalized to more complex RL problems. The MLP’s training parameters
such as learning approach and learning rate had a significant effect on the convergence of the
training process. Lower learning rate can result in slow training speed and higher values can
quickly saturate the neural network weights causing subpar RM performance; therefore, careful
tuning is required.
The training of the LMT-based RLRM agent was found to be robust to the choices of LMT
parameters and significantly easier than the other approaches. Varying the parameters , ,
and around their suggested values did not have a significant effect on learning convergence
speed and RM performance. However, as expected, increasing and or decreasing would
result in larger tree size and therefore increased training computation time.
5.2.2.2 Computational Needs
The computation time of the case study in this research was dominated by the microscopic
simulation and it is safe to say that all approaches had a similar overall computation time. However,
it is helpful to discard the microsimulation processing time and analyze the pure training time of
different approaches, which is summarized in Table 5-4. The on-line training of table-based RL
and kNN-TD(λ) allows efficient training with only new samples, and batch training of LMT and
MLP after each epoch results in higher training computational effort. It should be noted that the
linear regression in LMT is much faster than the MLP training method. The computation time
related to recall of the function approximator during decision-making is important for field
implementation where the computation power is limited. The numbers in Table 5-4 are specific to
76
this test case, and with increase in network size they are expected to increase linearly except for
the kNN-TD(λ), which is expected to increase exponentially.
5.2.2.3 Learning Speed
Since the simulation time is dominant compared with function approximation training time, for
the learning speed the simulation epochs are considered rather than computation hours. Figure 5-7
shows the average travel time at every epoch as agents learn. The learning speeds of the RM agents
with function approximation are significantly faster than the table-based RL. The LMT-based and
MLP-based approaches best utilize the training samples, resulting in very quick learning; however,
the MLP-based RL fails to achieve the highest performance obtained by the LMT-based approach.
Although the learning speed of kNN-TD(λ) is not as good as that of the MLP and LMT approaches,
its robust algorithm guarantees a relatively good performance after its learning is complete, as
shown at epoch 2000 of the figure. The slow learning speed of the table-based RL is evident, and
it is not completely converged after 2000 epochs and could result in better performance if trained
for more epochs.
Figure 5-7 Learning speed and solution quality of the presented four RL approaches. The
curves above are obtained by averaging multiple epochs through a moving average window for clarity. The actaul results have significantly more variation from one epoch to another
because of the stochastic nature of microscopic simulation.
0 500 1000 1500 20003
3.5
4
4.5
5
5.5
Epoch #
Ave
rag
e T
rave
l Tim
e (m
in)
Table
k-NNMLP
LMT
77
5.2.2.4 Transportation Network Performance
To compare the performance of the freeway with different RLRM algorithms, the agents were set
to exploit their learned knowledge after being trained 2000 epochs. For each RLRM approach, the
network was simulated 15 times with different seed numbers to account for the stochastic
behaviour in the Paramics simulations. The results were averaged and are summarized in
Table 5-4. Average network travel time accounts for vehicles' travel time from origin to destination
including on-ramp travel time, if any. Average mainline travel time only includes the time vehicles
spend on the freeway mainline until reaching their destination. As expected, all the RM approaches
improved the network performance compared with the base case, with savings ranging from 23.7%
in the Table-based approach to 36.8% in the kNN-TD(λ) approach. The RLRM agents with kNN-
TD(λ) and LMT performed noticeably better than the Table and the MLP approaches. It is worth
noting that LMT-based approach achieves similar performance to the kNN-TD(λ) approach
although its learning speed is an order of magnitude faster. Furthermore, the learning time of kNN-
TD(λ)-based agents is expected to increase exponentially with problem size but it would be linear
for LMT-based agents in the worst case. The limited performance of the MLP approach can be
attributed to the difficult choice of learning rates.
Table 5-4 Comparison of perfromance of different RLRM approaches
Computation Effort Method
Table kNN-TD(λ) MLP LMT
Learning computation time per epoch (sec) 0.16 0.22 15.587 2.005
Recall computation time per control cycle (sec) 0.0001 0.00014 0.003 0.003
Performance Measures Method
No RM Table kNN-TD(λ) MLP LMT
Average network travel time (min) 4:51 3:42 3:04 3:41 3:11
Average network travel time savings - 23.7% 36.8% 24% 34.3%
Average mainline travel time (min) 4:45 2:16 2:10 2:16 2:12
Average mainline travel time savings - 52.3% 54.4% 52.3% 53.7%
Average on-ramp waiting time (min) 0:45 11:14 6:57 11:09 7:43
78
5.3 Experiment III – Gardiner: Independent and Coordinated Ramp
Metering
Considering the experience and knowledge achieved by applying the RLRM to a single ramp
problem, the best of the proposed algorithms was applied to the Gardiner model, which has the
common challenges present in an RM application. This section discusses the design approach and
simulation results of applying the RLRM to the Gardiner Expressway.
5.3.1 RLRM Design for Coordinated Ramp Metering
In previous sections, the design of RLRM was focused on a single ramp while comparing different
approaches. Since some of these approaches were not very efficient, in this section we will focus
only on the best-performing approach.
Given the quick learning of the LMT-based RL and its ideal performance it has been
considered as the best learning approach. Additionally, the advantage updating approach presented
in section 3.4.4 is employed in conjunction with LMT function approximation. The advantage
updating isolates the effect of action on the reward of the agent from the value of future states,
thereby eliminating any bias in the function approximation.
5.3.1.1 Independent Agents
Independent agents will optimize their action according to the local reward that they receive.
Considering that the same agents would be coordinated later, their design was made with their
future coordination in mind. For each on-ramp agent a local area was defined which included the
on-ramp and sections of the mainline near the on-ramp. The agents' reward was defined as the total
traffic leaving the section subtracted by the traffic entering the section, . Note that
leaving traffic includes both off-ramps and downstream traffic, and entering traffic includes both
on-ramp and upstream traffic. Figure 5-8 shows the location of entry and exit flows for each RM
agent. The green rectangles are exit flows and red ellipses are entry flows. Adding the individual
agents' rewards together gives the global reward for the whole network (total vehicles exiting
subtracted by total vehicles entering). Therefore, this reward definition, while suitable for
independent agents, also satisfies the necessary conditions for the coordinated multi-agent
algorithm proposed in Section 3.5.2, which would be sought later in this chapter. In addition to the
basic reward, a penalty term was also considered for mainline congestion to facilitate the avoidance
of congestion in the learning process.
79
As discussed in Section 5.1.2.3, this reward definition requires more detailed state
information. Given that LMT size grows as necessary to fit the output and is not based on the
number of inputs, the size of the input state does not affect the learning performance of the LMT-
based RL agents. Therefore, all the variables needed for the complete state of traffic near an agent
are included in its state space definition. These variables are downstream density, downstream
speed, upstream density, upstream speed, ramp flow entering freeway, demand entering the ramp,
and ramp queue. All variables were calculated by loop detectors. The ramp queue is estimated
based on the number of vehicles present between the on-ramp detector and the signal detector
(refer to Figure 5-1).
Spadina
Lakeshore
Figure 5-8 The schematic of the Gardiner showing the location of entry and exit flows for
each individual RM agent.
For the Gardiner on-ramps, the discrete release rates traffic signal policy was employed
because of the high demand from on-ramps, which reaches 1600 veh/hr. The LMT-based RL does
not require learning rate as the tree is rebuilt from new samples after each epoch. Unlike table-
based RL, when states are continuous variables and LMT is employed, visits to a certain state
cannot be directly defined. Therefore, the action selection policy was defined based on learning
epoch rather than state visits. The action selection was -greedy with the decreasing linearly with
every epoch to a value of 0.1 after 100 epochs.
5.3.1.2 Limited Queue Space Consideration
Limited queue space is a challenge in RM applications. RM applications that employ a
mathematical model and optimize the metering rates can implement limited queue space as a
constraint in the optimization process. Therefore, actions that will cause queues to exceed the space
will be avoided. In RL, physical constraints cannot be defined as hard constraints; however, they
can be introduced in the problem as soft constraints by using a penalty term. Over time, agents
learn to balance between the penalty of exceeding the constraint and the higher reward.
Considering the current queues and acceptable conditions for downtown Gardiner on-
ramps, the maximum queue capacity was defined as 150 vehicles. Therefore, a penalty term was
80
added to the agents’ reward when the queue exceeded 150 vehicles. Note that when the benefit
from higher throughput is more significant, agents may temporarily choose actions that result in
queues exceeding 150 vehicles. The penalty weight is the same as the penalty for mainline
congestion. The RLRM agent, therefore, had to strike a balance between avoiding mainline
congestion and queues extending above capacity.
5.3.1.3 Coordination of Multiple RLRM Agents
Coordination among the agents is achieved by sharing their states and Q-values. Each agent
augments its state space with the state variables of its neighbours. Additionally, agents consider
actions of their neighbours when building their Q-functions. The augmented states in conjunction
with joint action allow the agents to consider the effect of neighbours' action on their Q-values and
vice versa. Therefore, they can choose the action that benefits the neighbourhood instead of their
local reward. The only negative effect of coordination is the increased number of input variables
of function approximation because of augmented state space. Although the LMT can handle the
augmented state space, the increased input size requires more computation when we fit an LMT to
the samples.
In cases with unlimited queue space, the coordination of RLRM agents does not have any
effect. In theory, each agent is trying to maximize its throughput, and its optimal action will result
in an uncongested freeway. Therefore, the optimal action of each agent is also optimal for the
whole network. However, in cases with limited queue space, RLRM agents have to decide between
extra queue penalty and mainline congestion penalty. In these conditions, coordination allows the
upstream agent to observe the penalty its neighbour is experiencing. To increase the
neighbourhood reward, upstream agents can reduce their ramp flow and free some road space for
downstream on-ramp.
The coordination of RLRM agents is performed for the case with limited queue space. The
Jameson on-ramp is considered as an independent agent, and the three downtown on-ramps are
coordinated. Figure 5-9 shows the coordination and communications between RLRM agents. Note
that each agent will only coordinate with its neighbours. Coordination with on-ramps farther than
the adjacent ones is also possible; however, in the Gardiner test case, it was found that coordinating
with two on-ramps on each side does not have any significant improvements over coordinating
with only the adjacent on-ramps.
81
5.3.1.4 Coordination for Queue Balance
Coordination of RLRM agents with limited queue space can improve the queue management and
allow agents to utilize the queue space of adjacent on-ramps better. This way the downstream
queue will start filling up first and after getting close to its queue limit the upstream on-ramp will
start to limit its ramp flow. However, there is no guarantee that all on-ramps will have the same
level of service. It is desirable to have the same level of service for users of different on-ramps to
discourage them from changing route in order to bypass the queue.
RLRM agent
Traffic State
Spadina
Signal Timing
RLRM agent
Traffic State
Signal Timing
RLRM agent
Traffic State
Signal Timing
RLRM agent
Traffic State
Signal Timing
Communication:Traffic State, Actions,
Action selection negotiation
Lakeshore
Figure 5-9 Communication between RLRM agents of the Gardiner.
Assuming the freeway mainline is not congested, the factor affecting the travel time the
most is the ramp queues. To achieve the same level of service among different on-ramps the goal
can be equalizing the ramp queues. To force the RLRM agents to equalize their queues, another
penalty term is added to the agents' reward. This penalty is added when the queue of the
downstream on-ramp is greater by 50 vehicles.
5.3.2 Simulation Results and Controller Evaluation
The proposed algorithms were applied to the Gardiner test case to evaluate their performance. It is
important to understand the current condition of the Gardiner and identify its limitations and
challenges. Section 5.3.2.1 provides a description of current conditions during the evening peak
period and highlights the benefits of individual ramps. Sections 5.3.2.2 to 0 provide quantitative
performance assessment and comparison of independent vs. coordinated multi-agent RLRM,
where all ramps are metered concurrently, as well as comparison with ALINEA.
5.3.2.1 Base Case and Performance Improvement via Local Metering of Individual Ramps
The evening peak period of the westbound of the Gardiner Expressway was utilized as the test case
for evaluation of the proposed algorithms. The Gardiner is one of the main arteries out of
82
downtown Toronto in the westbound direction during the evening commute. The demand to enter
the Gardiner from the three on-ramps in the downtown area exceeds 4000 veh/hr. This demand
when added to traffic flow on the mainline from further upstream surpasses the freeway capacity
of approximately 6000 veh/hr. The demand from the Jameson on-ramp, which is also used for
transferring from Lakeshore Boulevard to the Gardiner, averages 1000 veh/hr. The ramp itself is
very short in length creating significant traffic turbulence and merging hazards. Therefore, the City
of Toronto closes the ramp entirely from 15:00 to 18:00 every day. Table 5-5 shows the demands
downstream of the four freeway on-ramps. These numbers are based on the calibrated OD
matrices. Note that the demand does not necessarily mean the amount of traffic that will pass those
locations. In fact, when demand exceeds capacity congestion occurs in the bottleneck. In the
demands shown here the closure period of the Jameson on-ramp is considered.
Table 5-5 Demand for accessing the freeway mainline downstream of each on-ramp
1‐2 pm 2‐3 pm 3‐4 pm 4‐5 pm 5‐6 pm 6‐7 pm 7‐8 pm 8‐9 pm
Jarvis 3472 4027 3899 4096 4066 3507 2906 2533
York 3883 4661 4515 4772 4614 3771 3318 3116
Spadina 5122 6088 6098 6240 6052 5134 4499 4073
Jameson 5688 6710 5694 5912 5607 5818 4985 4429
As can be seen from the table, demand significantly exceeds capacity for the Jameson on-
ramp stretch in the interval between 14:00 and 15:00. Similarly, the demand for the Spadina on-
ramp is significantly high from 14:00 to 18:00 and peaks in the interval between 16:00 and 17:00.
The space-time diagram of the speed shown in Figure 5-10 shows the formation of congestion at
the bottlenecks. Although the mainline demand at Spadina in the 14:00 to 15:00 interval is not
much higher than capacity, the congestion building up at Jameson propagates upstream and causes
accelerated congestion upstream of Spadina. The demand from Jameson zone after 18.00 when
Jameson opens is less than capacity, but the vehicles entered the freeway earlier are stuck in
congestion and trigger another bottleneck at the Jameson on-ramp after 18:00.
5.3.2.1.1 JamesonRampMetering
The Jameson on-ramp has the most significant effect on the Gardiner congestion. Although the on-
ramp is closed from 15:00 until 18:00, the congestion formed before 15:00 has a lasting effect until
the ramp reopens at 18:00. If the congestion at Jameson can be avoided, the freeway performance
will be greatly improved. Metering the Jameson on-ramp will prevent the freeway from breaking
83
down and congestion propagating upstream without the need for closing the on-ramp. It is
noteworthy that, although full ramp closure can be viewed as the most aggressive metering
allowing zero entries, the closure starts late after congestion is already triggered, and when closure
is in effect, the freeway becomes underutilized, i.e. closure is not an optimal control method.
Figure 5-11 shows the freeway throughput after the Jameson on-ramp. As can be seen, the freeway
throughput between 14:00 and 15:00 with independent ramp metering is higher than in the base
case. After closure of the on-ramp, throughput drops to about 5600 veh/hr, which results in
underutilization of the freeway. This extra space is available because of the vehicles exiting the
freeway through the Dunn off-ramp.
Figure 5-10 Colour-coded space-time diagram of base case traffic speed.
It is important to see how many vehicles have taken the on-ramp in the two cases as the
ramp metering is often criticized for sacrificing on-ramp users for the benefit of through traffic.
Figure 5-12 shows the number of vehicles that have taken the Jameson on-ramp. Although ramp
metering has caused almost half the vehicles to reroute to Lakeshore between14:00 and 15:00, the
loss is compensated by the vehicles entering the freeway between 15:00 and 18:00. In fact, 6122
14 15 16 17 18 19 20Time of day (hour)
Spe
ed (
km/h
r)
Base case color coded traffic speed
10
20
30
40
50
60
70
80
90
100
110
84
vehicles are served through Jameson on-ramp in the ramp metering case compared with 5142
vehicles in the base case.
Figure 5-11 Freeway throughput after the Jameson on-ramp in the base case and with
ramp metering.
Figure 5-12 Comparison of the Jameson on-ramp traffic flow in the base case and with
independent ramp metering.
5.3.2.1.2 SpadinaRampMetering
Spadina is another critical on-ramp and bottleneck in the evening peak period. Figure 5-13
shows freeway throughput after the Spadina on-ramp. In the base case, throughput is significantly
lower than capacity during the 14:00 to 15:00 interval. This capacity loss is because of the Jameson
4000
4500
5000
5500
6000
6500
1‐2 pm 2‐3 pm 3‐4 pm 4‐5 pm 5‐6 pm 6‐7 pm 7‐8 pm 8‐9 pm
Flow (veh/hr)
Mainline throughput after Jameson
Base Case Ramp Metering
0
200
400
600
800
1000
1200
1400
1‐2 pm 2‐3 pm 3‐4 pm 4‐5 pm 5‐6 pm 6‐7 pm 7‐8 pm 8‐9 pm
Flow (veh/hr)
Jameson on‐ramp traffic flow
Base Case Ramp Metering
85
bottleneck, which spreads to Spadina. However, even after 15:00 that Jameson is closed and the
congestion downstream of Spadina is cleared, the throughput hardly reaches 6000 veh/hr.
Employing ramp metering will increase throughput by about 5%, which is in agreement with the
capacity drop owed to congestion.
Figure 5-13 Freeway throughput after the Spadina on-ramp in the base case and with
independent ramp metering.
5.3.2.2 Concurrent Multiple Independent Agents
Independent RLRM agents (named RLRM-I) were trained and evaluated with the Gardiner model
and compared with ALINEA as well as the base case scenario. Considering that the Jameson
bottleneck in the 14:00 to 15:00 interval causes the most congestion, one might suggest extending
the Jameson closure period to include 14:00 to 15:00. This case is also evaluated and called
Jameson2pmClose in this document. The four scenarios were simulated with 15 different seed
numbers to represent traffic variation on different days. Figure 5-14 shows the total vehicle hours
traveled for the whole network (TTT) as well as the freeway mainline only (TTTml) for the four
different scenarios. As expected, eliminating the Jameson bottleneck by closing it earlier
significantly improves freeway performance. However, the bottleneck at Spadina still contributes
to congestion; hence, there is high TTTml as well as variation in TTTml throughout different
simulations. The variation in different simulation runs translates to unreliability in travel time,
which is always a concern in transportation networks. Both ALINEA and RLRM-I properly
eliminate congestion and result in TTTml which is the same across all simulations. However,
4000
4500
5000
5500
6000
6500
1‐2 pm 2‐3 pm 3‐4 pm 4‐5 pm 5‐6 pm 6‐7 pm 7‐8 pm 8‐9 pm
Flow (veh/hr)
Freeway throughput after Spadina
Base Case Ramp Metering
86
ALINEA is not as efficient as RLRM-I in utilizing the freeway capacity, and results in significantly
higher TTT. The RLRM-I controller produces a 48% reduction in TTT.
Figure 5-14 Freeway performance for four different scenarios. The error bars show the
standard deviation of value for different simulation runs.
Although RM improves the freeway performance, it is important to monitor its effect on
the on-ramp users. For the Jameson on-ramp, the excess demand is rerouted to Lakeshore
Boulevard and results in higher travel time for those vehicles. Figure 5-15 shows the average travel
time that vehicles originating from Jameson zone have experienced. As can be observed, taking
the Lakeshore instead of the Gardiner will result in a 4-min increase in travel time. Although
closing the ramp from 15:00 to 18:00 is not enough, closing it from 14:00 to 18:00 is too restrictive.
Ramp metering essentially acts as an adaptive ramp closure. Ramp metering limits the ramp access
when there is high demand and keeps the ramp open when demand is low. Inevitably, ramp
metering would result in higher travel time for Jameson on-ramp users compared with no ramp
metering. However, optimal ramp metering imposes the minimum additional travel time compared
with any pre-timed approach.
Figure 5-16 shows the time-space diagram of traffic speed for the Jameson2pmClose case
against the ramp metering scenario. Employing ramp metering with an RLRM-I controller
completely eliminates congestion from the freeway. In the no ramp metering case, as expected,
when demand exceeds capacity the freeway breaks down. Figure 5-17 shows the queue on the
three downtown on-ramps throughout the simulation period. Although ramp metering would result
10276
6533 62905360
6998
54114198 4147
0
2000
4000
6000
8000
10000
12000
Base Case Jameson 2pmClose
ALINEA RLRM‐I
Vehicle Hours Travelled
TTT TTTml
87
in longer queues for the Spadina on-ramp, eliminating congestion would allow free flow traffic
movement on the York and Jarvis on-ramps. In the Jameson2pmClose, as the congestion builds up
at Spadina, it blocks the entrance from upstream on-ramps and causes queues at the York and
Jarvis on-ramps. In this case, no limit was considered for the Spadina queue. Therefore, when ramp
metering is employed, the queues at Spadina exceed the 150 limit. In the ALINEA case the queues
comprised as many as 500 vehicles. Queue management is introduce later in the chapter.
Figure 5-15 Average experienced travel time of vehicles starting from the Jameson zone
until the end of the network in the west.
Figure 5-16 Time-space diagram of traffic speed for RLRM-I (left) and Jameson2pmClose
(right).
13 14 15 16 17 18 19 20 212
3
4
5
6
7
8
9
Time of day (hr)
Tra
vel t
ime
(min
)
BaseCase
Jameson2pmCloseALINEA
RLRM-I
14 16 18 20Time of day (hour)
RLRM-I traffic speed
14 16 18 20Time of day (hour)
Jameson2pmClosed traffic speed
20
40
60
80
100
88
Figure 5-17 Queues for the three on-ramps throughout the simulation period.
The average travel time from different origins in downtown to the west end of the network
at Humber Bay is shown in Figure 5-18. The effect of ramp metering is clear in the travel times
from origins upstream of Spadina (DVP, Jarvis on-ramp, and York on-ramp), which are all free
flow travel times with RM, whereas travel time of trips originating from the Spadina on-ramp are
significantly higher. The results are the opposite in the Jameson2pmClose case, as expected. The
travel times increase as we move upstream of Spadina. Comparing the RLRM-I with ALINEA,
travel times are the same except for trips from the Spadina on-ramp. The lower travel time for
RLRM-I case shows its better efficiency in terms of utilizing the freeway space and allowing more
traffic to enter the freeway from the Spadina on-ramp. It is important to note that the high travel
times of the base case for Spadina on-ramp trips is because of the Jameson bottleneck. Since ramp
metering eliminates the congestion caused by Jameson, the overall travel time from Spadina on-
ramp is lower in the RLRM-I case.
Finally, the above analyses answer the fundamental question whether the overall system
gain in terms of TTT improvement justifies longer wait on the on-ramps. In other words:
Are the on-ramp travellers sacrificed to improve overall TTT and flow on the main
freeway, a gain primarily experienced by upstream traffic coming through? ,
Would the time loss waiting on the on-ramps under RM is regained in faster travel time
after getting on the freeway?
Our conclusions are: Waiting at the upstream on-ramps (Jarvis and York) under RM is well worth it for those
travellers as not only the overall system benefits in terms of least TTT, but also travellers
from those on-ramps benefit in terms of faster journey times.
The above is not necessarily the case for the Spadina travellers. Waiting on the on-ramp
under RM, although benefits the overall system in terms of TTT, results in longer travel
14 16 18 200
100
200
300
400
500Spadina on-ramp queue
Time of day (hr)
Tot
al V
ehic
les
in Q
ueue
14 16 18 20
York on-ramp queue
Time of day (hr)
14 16 18 20
Jarvis on-ramp queue
Time of day (hr)
Jameson2pmClose
ALINEARLRM-I
89
times for the Spadina travellers under ALINEA, and same travel time under RLRM-I. This
indicates that the Spadina travellers inequitably bear the burden of improving the system
TTT and improving travel times for upstream travellers. This motivates the question
whether a better queue management approach in conjunction with RM would even out the
ramp wait burden across all ramps such that not only the overall system TTT improves but
also travel times for onramp travellers improve, which will be addressed using coordinated
agents later in this chapter.
Figure 5-18 Average travel time for trips originated during 4-5 pm from origins in the
downtown to west end of network for the four scenarios.
5.3.2.3 Independent Agents with Limited Queue Space
Metered on-ramps require queue management to ensure excessive queues are not going to affect
nearby arterials. The ALINEA controller can be augmented with a queue override algorithm
(named ALINEAwQO) which increases the ramp flow when the queue reaches its predefined limit.
In the RL-based algorithm, constraints on the queue are implemented through a penalty imposed
on the agent when queues exceed a certain limit (named RLRM-IwQO). Figure 5-19 shows the
queues on the Spadina and York on-ramps. In the RLRM-I case queues exceed 250 vehicles, but
in the RLRM-IwQO queues are much lower and do not exceed 100 vehicles. Although in
BaseCaseJameson2pm
CloseALINEA RLRM‐I
DVP 14.34 11.20 6.47 6.34
Jarvis on‐ramp 16.65 10.78 6.33 6.19
York on‐ramp 16.27 11.09 5.60 5.47
Spadina on‐ramp 12.13 9.02 14.14 11.79
0.002.004.006.008.00
10.0012.0014.0016.0018.00
Travel tim
e (m
in)
Travel time from origins in downtown to Humber Bay
90
ALINEAwQO ramp flows are strictly implemented so that queues do not exceed the limit, Spadina
queues have exceeded the 150 vehicles. The reason for ALINEAwQO not being able to manage
queues is the very high demand from the Spadina zone. In fact, when the freeway breaks down,
even keeping the ramp completely open will not accommodate the 1600 veh/hr peak demand. The
RL algorithm can predict this phenomenon through the penalties and keep the queues at a
manageable level. The effect of Spadina queues reaching capacity can be seen at the York on-ramp
queues due to mainline congestion reaching there.
Figure 5-19 On-ramp queues for the RM algorithms which consider limited queue capacity.
Figure 5-20 depicts the time-space diagram of the traffic speed for RLRM-IwQO and
ALINEAwQO. The figure shows that the congestion at Spadina starts sooner in the RLRM-I-wQO
than ALINEAwQO, which suggests the RLRM-IwQO algorithm acts more conservatively to make
sure the queue will not exceed the limit.
Figure 5-21 shows the performance of the Gardiner freeway with different control
algorithms. Given the limited queue capacity, RLRM-IwQO performs worse than RLRM-I in
terms of overall freeway performance. Similarly, looking at TTTml, it is clear that ramp metering
with limited queue space cannot effectively eliminate congestion as in the case of not constrained
on-ramp queues. Nevertheless, RLRM-IwQO outperforms ALINEAwQO and is significantly
better than no-control cases.
13 14 15 16 17 18 19 20 210
50
100
150
200
250
300
Time of day (hr)
Tot
al V
ehic
les
in Q
ueue
Spadina on-ramp queue
13 14 15 16 17 18 19 20 210
50
100
150
200
250
300
Time of day (hr)
Tot
al V
ehic
les
in Q
ueue
York on-ramp queue
ALINEAwQO
RLRM-IwQORLRM-I
91
Figure 5-20 Time-space diagram of traffic speed for algorithms with limited queue space.
Figure 5-21 Freeway performance under ramp metering with limited queue capacity.
Figure 5-22 shows the travel time from different origins to the west end of the network. In
the Jameson2pmClosed case the travel times for upstream origins increase as the freeway gets
more congested. In the RLRM-I case travel times from all origins are free flow travel times except
those from Spadina, which are significantly higher than for the vehicles behind the on-ramp queue.
In the ALINEAwQO case, the upstream travel times are initially free flow until the Spadina queue
14 16 18 20Time of day (hour)
RLRM-IwQO traffic speed
14 16 18 20Time of day (hour)
ALINEAwQO traffic speed
20
40
60
80
100
10276
6533 61415665 5360
6998
54114729 4768
4147
0
2000
4000
6000
8000
10000
12000
Vehicle Hours Travelled
TTT
TTTml
92
reaches its limit. As a result, the freeway mainline becomes congested and queues start to build up
on the York on-ramp. Similar conditions happen in RLRM-IwQO; however, the congestion starts
slightly sooner and the queues are shorter on the Spadina on-ramp. Although the travel times for
all origins in the RLRM-IwQO are more or less identical, it should be noted that this is
coincidental. Under different demands the travel time will not necessarily be similar as the agent
will not be directly equalizing the travel times.
Figure 5-22 Travel times from different locations to the west end of the network.
5.3.2.4 Coordinated Agents
Independent RLRM agents are very effective in maximizing freeway performance as long as the
queues are not limited. However, when the queue reaches its limit, the agent loses control over
freeway congestion and the freeway breaks down. Coordination of agents allows upstream agents
to observe the condition of downstream on-ramps and cooperate to prevent the freeway from
13 14 15 16 17 18 19 20 214
6
8
10
12
14
16
Time of day (hr)
Tra
vel t
ime
(min
)
Jameson2pmClosed
DVP
JarvisYork
Spadina
13 14 15 16 17 18 19 20 214
6
8
10
12
14
16RLRM-I
Time of day (hr)
Tra
vel t
ime
(min
)
DVP
JarvisYork
Spadina
13 14 15 16 17 18 19 20 214
6
8
10
12
14
16ALINEAwQO
Time of day (hr)
Tra
vel t
ime
(min
)
DVP
JarvisYork
Spadina
13 14 15 16 17 18 19 20 214
6
8
10
12
14
16RLRM-IwQO
Time of day (hr)
Tra
vel t
ime
(min
)
DVP
JarvisYork
Spadina
93
breaking down when downstream on-ramps are full. The coordinated RLRM agents with limited
queue (named RLRM-C) were implemented and evaluated in the Gardiner model. Additionally,
the heuristic coordination of ALINEA based on linked control (named ALINEAwLC) is also
implemented and evaluated. In ALINEAwLC each upstream on-ramp will observe the immediate
downstream on-ramp and, if its queue exceeds a certain threshold, the upstream on-ramp tries to
equalize its queues with the downstream on-ramp.
Figure 5-23 compares freeway performance for coordinated RM and other approaches. The
RLRM-C algorithm achieves similar performance to RLRM-I in minimizing TTT. Furthermore,
the TTTml value and standard deviation show that freeway congestion is kept well under control
in the RLRM-C case. The ALINEAwLC algorithm, although somewhat successful in managing
congestion, cannot improve freeway performance as it achieves similar TTT to the
Jameson2pmClose case.
Figure 5-23 Freeway performance for coordinated RM approaches.
Figure 5-24 shows the queue at the Spadina and York on-ramps for RLRM-IwQO and
RLRM-C. The queues at the Spadina on-ramp for both approaches are more or less the same.
However, in the RLRM-C case the York on-ramp queue slightly increases early in the simulation.
This is because of the observation of the York agent about the traffic condition in the downstream
and the proactive measures taken to ensure the Spadina on-ramp will not reach its queue limit. The
coordinated RLRM can maintain the queues within the limits without causing the mainline to get
congested, efficiently using the available queue storage space on all on-ramps to manage the
10276
6533 66605104 5665 5360
6998
54114601 4249 4768 4147
0
2000
4000
6000
8000
10000
12000
Vehicle Hours Travelled
TTT
TTTml
94
freeway. Figure 5-25 shows the travel times from different origins to the west end of the network
for the RLRM-C case. Given the higher queues of the Spadina on-ramp, the travel times for trips
originating from Spadina are higher than for other origins during the rush hour.
Figure 5-24 On-ramp queues of coordinated and independent RLRM agents with limited
queue space.
Figure 5-25 Travel times from different locations to the west end of the network in the
RLRM-C case.
13 14 15 16 17 18 19 20 210
50
100
150
Time of day (hr)
Tot
al V
ehic
les
in Q
ueue
Spadina on-ramp queue
RLRM-IwQO
RLRM-C
13 14 15 16 17 18 19 20 210
50
100
150York on-ramp queue
Time of day (hr)T
otal
Veh
icle
s in
Que
ue
RLRM-IwQO
RLRM-C
13 14 15 16 17 18 19 20 214
5
6
7
8
9
10
11
Time of day (hr)
Tra
vel t
ime
(min
)
Travel time to west end from different origins
DVP
JarvisYork
Spadina
95
Figure 5-26 shows the average travel time from different origins in downtown to the west
end of the network at Humber Bay for the three RL-based approaches and compares them with the
base case. While in RLRM-I with unlimited queue the travel time for trips originating from
Spadina is much higher than other origins, in RLRM-IwQO with limited queue travel times are
very close. However, the travel time savings of Spadina is much less compared to the increased
travel time for the three other origins, which shows the reduced performance by introducing the
limited queue space. As can be seen from the travel time of the RLRM-C case, coordination of
RLRM agents can prevent mainline congestion, which is evident in the travel time of upstream
origins, while maintaining queue limit, which is evident from Spadina travel time. Effectively,
RLRM-C reduces the Spadina travel time compared with RLRM-I, while not imposing much extra
travel time to upstream origins. Therefore, it results in reasonable travel time variation, while
maximizing the network performance.
Figure 5-26 Average travel time for trips originated during 4-5 pm from origins in the
downtown to west end of network at Humber Bay for different independent and coordinated RLRM approaches.
Although ALINEAwLC did not improve TTT, it is interesting to see how it performed in
terms of keeping the queues equal. Figure 5-27 shows the queues at the three downtown on-ramps.
BaseCaseJameson2pmClose
RLRM‐I RLRM‐IwQO RLRM‐C
DVP 14.34 11.20 6.34 9.32 6.70
Jarvis 16.65 10.78 6.19 8.81 6.57
York 16.27 11.09 5.47 8.75 6.75
Spadina 12.13 9.02 11.79 9.47 8.68
0.002.004.006.008.00
10.0012.0014.0016.0018.00
Travel tim
e (m
in)
Travel time from different origins to Humber Bay
96
As can be seen from the figure, queues of upstream on-ramps follow the queue of their immediate
downstream on-ramp.
Figure 5-27 Downtown on-ramp queues with ALINEAwLC control algorithm.
5.3.2.4.1 CoordinationforQueueBalance
In the RLRM-C case, it has been shown that the proposed algorithm can optimally handle limited
queue space without incurring congestion on the freeway. However, it is not seeking equity among
different users. Given that the optimal solution is to meter the downstream on-ramp intensively,
the users of that on-ramp will experience the longest travel time. Although the goal is to equalize
travel times of different on-ramps to provide the same level of service, it is not a simple task to
track the travel time of vehicles through loop detectors. Furthermore, fromulating an RL system
for directly equalizing travel times of different drivers would be very complicated. For an
approximation of the problem, we considered equalizing the queues of different on-ramps. To
force the RLRM agents to equalize their queues with their neighbours, a penalty was considered
for each agent when the queue of the downstream on-ramp was greater than its own queue by 50
vehicles. The coordinated RLRM agents in the queue equalization case are named RLRM-CwQE.
13 14 15 16 17 18 19 20 210
50
100
150
200
250
Time of day (hr)
Tot
al v
ehic
les
in t
he q
ueue
Downtown on-ramps queues with ALINEA-LC algorithm
Spadina
YorkJarvis
97
Figure 5-28 shows the on-ramp queues and travel times for the RLRM-CwQE case. Although the
queues are very similar, the travel times of trips originating from on-ramps are not the same. The
first cause of the variation is the congestion on the freeway, which can be seen in the travel time
from DVP to the west end. The presence of congestion shows that if the agent is forced to keep
similar queue levels, the freeway performance is sacrificed significantly. The second factor is the
average on-ramp entry flow to the freeway shown in Figure 5-29. The time each vehicle spends in
the queue equals the queue level when the vehicle has joined the queue divided by the average
flow entering the freeway. Even if the queues are the same, the average flow can significantly
affect the time that vehicles wait on the ramp, and hence the different travel times for different on-
ramps.
Figure 5-28 On-ramp queues (left) and travel times (right) for the RLRM-CwQE case.
5.3.3 The Gardiner Test Case Summary
For the Gardiner test case, nine scenarios were examined. The network TTT and TTTml are
summarized in Figure 5-30. The Jameson on-ramp is a very critical on-ramp and closing it one
hour earlier than the base case reduces TTT by 36%. However, this closure is specific to the
demand used in this model. In practice, for different demand scenarios a separate timing for the
Jameson on-ramp closure would be needed to minimize the negative effects of closure. Metering
the Jameson on-ramp would effectively be a more refined approach to its closure. The on-ramp
will adaptively close or open the traffic access to the freeway depending on the traffic condition,
while maximizing the freeway throughput.
13 14 15 16 17 18 19 20 210
50
100
150
200
250
Time of day (hr)
Tot
al v
ehic
les
in t
he q
ueue
Downtown on-ramps queues with RLRM-CwQE algorithm
13 14 15 16 17 18 19 20 214
6
8
10
12
14
16
Time of day (hr)
Tra
vel t
ime
(min
)
Travel time to west end from diferent origins
DVP
JarvisYork
Spadina
98
Figure 5-29 Downtown on-ramp flows entering the freeway for the RLRM-CwQE case.
The ALINEA algorithm is fairly robust and can be implemented with minimal design
effort. It can be simply augmented through heuristic algorithms to handle limited queue space. It
can even be easily coordinated with neighbouring on-ramps to utilize the available on-ramp queue
storage spaces. Nonetheless, its performance is limited in more demanding problems and the
heuristic augmentations cannot properly handle the Gardiner Expressway test case.
The RL-based RM approaches learn from direct interaction with the environment;
therefore, they are able to maximize their performance. The independent agents that do not
consider queue limits manage the freeway congestion very efficiently and reduce TTT by 48%
compared with the base case. By accepting some congestion on the freeway, independent agents
can handle problems with limited queue storage space. In the Gardiner test case, the TTT reduction
for independent agent with limited queue was 45% compared with the base case. Coordinating
adjacent RLRM agents can efficiently utilize all the queue storage space to eliminate the freeway
congestion while maintaining queues within their limit. The coordinated RLRM approach could
match the performance of independent RLRM agents and reduce TTT by 50%, i.e. attain the same
performance of independent RLRM agents despite the consideration of limited queue space.
13 14 15 16 17 18 19 20 21400
600
800
1000
1200
1400
1600
Time of day (hr)
On-
ram
p flo
w e
nter
ing
free
way
(ve
h/hr
)
Downtown on-ramps flows with RLRM-CwQE algorithm
Jarvis
YorkSpadina
99
Figure 5-30 Summary of the performance of the nine scenarios for the Gardiner test case.
Although implementing a centralized RL-based ramp metering system is impractical,
considering the outcomes of the above scenarios, it is expected that outcome of such centralized
system would not be significantly better than the proposed coordination of local agents. While the
congestion was apparent in uncoordinated on-ramps with limited queue, coordination of adjacent
on-ramps effectively eliminated the congestion and changed the TTT savings from 45% to 50%.
Although coordination of on-ramps beyond their adjacent on-ramps is expected to improve the
overall system performance, the extra saving is not expected to be significant. Tests on
coordinating each on-ramp with two upstream on-ramps have not shown any significant
improvement.
The optimal RLRM did not provide equal travel times for different on-ramps. As an
alternative agents were forced to balance their queues. Forcing the queues to be similar resulted in
significantly lower system performance. Additionally, given that ramp wait time is related to both
queue and on-ramp flow, the travel times were not homogenized even though the queues were
identical.
Revisiting the question that whether on-ramp user will be sacrificed for the benefit of
mainline users, it can be said that in order to have the optimal network performance, it is inevitable
BaseCase
Jameson2pmClos
eALINEA
ALINEAwQO
ALINEAwLC
RLRM‐IRLRM‐IwQO
RLRM‐CRLRM‐CwQE
TTT 10276 6533 6290 6141 6660 5360 5665 5104 6779
TTTml 6998 5411 4198 4729 4601 4147 4768 4249 4710
0
2000
4000
6000
8000
10000
12000TO
TAL TIME SPEN
T ON NETWORK (VEH
.HR)
Summary of network performance for different scenarios
100
that the downstream on-ramp users will experience higher travel times than users of upstream
origins. However, RLRM-C can provide reasonable travel time for all users. In fact, coordination
of on-ramps while imposing a limit on the queues, will guarantee a minimum level of service for
all users. Given that freeway mainline will not be congested and queue will not exceed a certain
limit, the minimum level of service can be quantified. The same level of service cannot be
guaranteed in the base case without ramp metering, as the congestion on the mainline will degrade
performance.
101
6 Conclusions and Future Work
Ramp metering is the most direct and effective freeway traffic control measure and is widely
employed throughout the world. Local RM algorithms applied to independent on-ramps can be
very efficient as long as there is no limit on the queue storage space. Practically, however, to
prevent the queue from exceeding the pre-specified limit, simple RM algorithms prioritize queue
management over freeway traffic management; therefore, the benefits of RM quickly diminish.
Availability of multiple closely spaced on-ramps provides the opportunity to coordinate multiple
on-ramps and utilize the queue storage space of all ramps to prevent congestion more effectively.
Heuristic approaches cannot exploit the full potential in the coordination of multiple on-ramps.
Model-based optimal control approaches can theoretically find the best metering policy. However,
their computation complexity increases exponentially with the network size and they become
impractical for even moderately sized networks comprised of a few on-ramps.
In this research, a decentralized and coordinated optimal RL-based ramp metering system
is presented. Individual RLRM agents can act on their own based on their local measurements to
maximize their reward (minimizing the local total travel time). Furthermore, agents can coordinate
their actions with their neighbours to maximize their collective reward rather than only their
individual reward. The decentralized structure allows simple scalability to any problem size.
Additionally, agents seek optimality whether they are acting independently or coordinated.
Therefore, the system would function reliably in the event of communication failure. The RLRM
agents employ function approximation to represent continuous state variables directly. The move
from discrete states to continuous states can significantly improve learning speed through
generalization of information. It also eliminates the trade-offs associated with discretizing
continuous variables. Furthermore, the learning time of the agents will not grow exponentially
with number of measurement variables.
Two microscopic simulation models were developed as test cases for the training and
evaluation of the proposed algorithms. The locations of the test cases were carefully chosen so that
they highlighted ramp metering effectiveness and challenges. The driver behaviour parameters as
well as the dynamic demand of the models were meticulously calibrated to match the traffic
dynamics and congestion patterns of the real freeways. The first model was a section of the
Highway 401 eastbound collector at the Keele Street. This model is effectively a network with a
single on-ramp and was used for extensive experiments with different aspect of the RL-based RM.
102
Additionally, RLRM algorithms with different function approximation approaches were evaluated
with the Highway 401 model to identify the most suitable approach. The second model was the
westbound direction of the Gardiner Expressway. The Gardiner model includes different types of
on-ramps and is an excellent testbed for evaluation of the RM algorithms. This model was used
for evaluation of the coordinated RLRM algorithms and comparison with independent RLRM
approaches as well as the well-known ALINEA algorithm.
6.1 Major Findings
The conventional RL approaches with discrete states can be applied to RM problems. However,
they require a significantly large number of training epochs even for the simplest RLRM design.
The simplest design with about 80 states and 7 actions needed more than 1000 epochs (simulation
runs) to converge to optimal Q-values. Therefore, these approaches are not suitable for more
sophisticated RLRM designs and larger problems. It was also found that for single-ramp problems
(independent agents) defining the agent’s reward as freeway throughput in conjunction with a
penalty for mainline congestion will result in an efficient agent that minimizes TTT.
Function approximation can significantly improve the learning speed. The easiest and most
reliable approach for dealing with continuous variables in RLRM is the generalization of the
discrete states into continuous states through averaging based on k-nearest neighbours. This
approach is directly based on the solid foundations of RL with discrete states. Despite improving
the learning speed significantly, it suffers from the same issues as the conventional RL, namely
the curse of dimensionality and discretization trade-off. MLP and LMT are far more efficient
function approximators compared to averaging based on k-nearest neighbours. Although MLP has
been extensively used in the literature for function approximation in RL, it introduces several new
design parameters, which are not trivial to define. LMT breaks the state space into several sections
and uses a linear model in each section. The linear models are approximated with the least squares
method; therefore, LMT can effectively handle the measurement noise in stochastic environment
of freeway traffic problems. Additionally, the parameters associated with LMT do not have much
effect on the learning performance and only affect the number of sections in the tree.
RLRM with LMT function approximation was applied to the Gardiner test case.
Independent RLRM agents with unlimited queue space outperformed ALINEA and reduced the
TTT by close to 50% compared with the base case with no RM. As expected, limiting the queue
103
space had a negative impact on RM performance, and resulted in some congestion on the freeway.
Nevertheless, the RLRM with limited queue still reduced the TTT by 45% compared with the base
case and outperformed the ALINEA with the queue override algorithm. Coordinating the action
of individual RLRM agents with limited queue space made it possible to utilize the queue storage
on nearby on-ramps. The coordinated RLRM agents prevented the freeway from breaking down
while maintaining the queues from exceeding the predefined limits. The coordinated RLRM could
match the performance of independent RLRM agent with unlimited queue space, and reduced the
TTT by 50% compared with the base case, while offering improved queue management and
respecting queue length constraints. The ALINEA with a linked control algorithm was able to
balance the queues of the downtown on-ramps; however, the performance of the system could not
match the original ALINEA algorithm. In fact, balancing the queues resulted in inferior
performance compared with the original ALINEA. This phenomenon was also observed in the
case of coordinated RLRM agents with queue balancing. When the coordinated RLRM agents
were forced to balance their queues through a penalty term, their performance deteriorated
significantly.
6.2 Contributions
The main contributions of this thesis can be summarized as follows:
RL with continuous representation of states and actions – a novel approach for direct
representation of the continuous states and actions in RL is proposed. The proposed approach
can properly handle the stochastic behaviour as well as the noisy measurements of the traffic
control problems. Furthermore, the proposed approach allows far more state variables to be
included in the definition of the state of the environment without the need for deciding on the
discretization intervals, which significantly simplifies the design process. Given the generality
of the proposed approach, it can be applied to virtually any RL application.
Coordination of RL-based RM agents – an algorithm is proposed for direct negotiation and
coordination of RLRM agents based on coordination graphs. Design of the agents in
conjunction with the coordination algorithm enables independent as well as coordinated
implementation of the agents in a decentralized structure, which provides robustness against
communication failure.
104
Additionally, with lower significance, the followings were achieved during the course of
this research:
The Gardiner Expressway microscopic simulation model – a thoroughly refined and
calibrated microsimulation model of the Gardiner is developed in Paramics for training and
evaluation of the proposed algorithms. Since accurate traffic dynamic is crucial for freeway
traffic control applications, special consideration is given to the development and calibration
of the Gardiner model. The developed model closely replicates the characteristics of the
Gardiner such as capacity and critical density as well as traffic volumes and congestion
patterns.
Deployment-ready design – throughout the design process of the RLRM agents, only
measurements that are readily available in the field were considered. Although use of more
complex measures could simplify the design process or enhance performance, the decision
was made to minimize the time required for deployment of the proposed algorithm in the field.
6.3 Towards Field Implementation
Performance of traffic control systems under real-life conditions has always been a concern for
practitioners as well as researchers, particularly if and how new systems are implementable in the
field on controllers with specific capabilities and limitations. This concern is heightened
considering the risk and productivity issues of a traffic controller failure in the field. Hardware in-
the-loop simulation (HILS) provides the tools for evaluation of the hardware that could be
implemented in the field without risking the consequences of its failure. HILS, in the traffic control
context, is a method used for evaluating real hardware components running the traffic control
algorithms in a simulation environment. HILS allows evaluation of the hardware operation in a
controlled simulated environment before deployment in the field. HILS replaces the emulated
traffic signal control logic in the simulation model with real traffic signal control hardware, which
interacts with the simulation model. In other words, HILS replaces the real environment of a traffic
controller with microsimulation software, as illustrated in Figure 6-1.
The critical component of a hardware-in-the-loop traffic simulation system is the controller
interface device or CID, which facilitates the communication between the physical world (traffic
controller) and the simulated world. Figure 6-1b illustrates the HILS setup, which has three
components: 1) a microscopic simulation model; 2) a traffic controller; and 3) a CID, which
105
facilitates communication between the first two components. The CID captures the traffic light
indications generated by the traffic controller and routes them to the simulation software.
Similarly, inputs from the simulator, e.g. loop detector calls, are sent back to the traffic controller
through the CID, and hence the controller functions as if it were communicating with a real signal
assembly.
Figure 6-1 Controller interface with real (a) and virtual (b) transportation environment.
As part of a project for evaluating field implementation of the MARLIN-ATSC (S. El-
Tantawy et al., 2013) algorithm, a team of researchers including the author developed the CID and
the companion programs which allow communication between a NEMA TS2 Type 1 traffic
controller and Paramics microsimulation software. The developed CID is shown in Figure 6-2. On
the one hand, the CID will respond to the controller commands as if the traffic signal controller is
communicating with devices inside a control cabinet. On the other hand, it will control the traffic
signal behaviour of Paramics to match the commands coming from the traffic signal controller.
Additionally, the CID will read loop detector calls in the Paramics network and communicate them
back to the traffic signal controller.
Environment for traffic controller
(a)
(b)
Relay
Detector
106
The same CID that was originally developed for HILS of surface traffic control algorithms
can be directly employed for HILS of RM algorithms. For this purpose, and when resources are
available in the near future, the RM control logic should be implemented in an embedded
controller, which overrides the logic of the traffic signal controller. The embedded controller reads
the loop detector calls through the traffic signal control and calculates the metering rate (the green
and red timing) according to the state of the traffic. Then it overrides the traffic signal control logic
according to the controls provided by the NTCP standard.
Figure 6-2 The CID developed for evaluation of MARLIN-ATSC.
6.4 Assumptions and Limitations
The process of design and evaluation of the proposed freeway control system, certain assumptions
were made to make the problem manageable. Given the tedious and time-consuming calibration
process, only a single set of OD matrices were calibrated. The agents were trained and evaluated
based on this single demand profile. Since the proposed system finds the best response for current
traffic condition, it will act independent of the traffic pattern. However, the systems responds
optimally for traffic conditions that have been seen previously. The randomness in Paramics
provide the necessary variation for learning of the system, but if traffic pattern changes
significantly, the system will be faced with conditions that might not be seen previously. The
generalization from LMT can overcome these conditions to some extent; however, the control
system output will not be optimal. It should be noted that the control system will learn from these
experiences and will improve as new samples are visited.
For the purpose of this research, it is assumed that demand from on-ramps origins are fixed.
The assumption was made so that comparison between different scenarios can be made. However,
two phenomena in reality contradict with this assumption: 1) traffic rerouting when queue on one
107
on-ramp is lower than adjacent on-ramps, 2) induced demand due to improved traffic flow of the
on-ramps. While rerouting negatively affects the independently controlled on-ramps, coordination
of adjacent on-ramps can address this issue to some extent and nullify its negative impacts. Induced
demand is inevitable when travel time drops; however, the extra demand will not cause congestion
because the freeway is controlled. The added demand will eventually increase travel time, possibly
to the extent close to base-case travel times. Even if the travel times after controlling on-ramps
becomes similar to base-case, it should be noted that the total vehicles being served is increased.
Therefore, the overall ramp metering system is performing better than base-case, through either
lower travel time or increased vehicles served.
The emerging Advanced Traveler Information Systems (ATIS) such as real-time traffic
information and travel times can affect the ramp metering system. In the case of independent on-
ramps, travellers of downstream on-ramps will be redirect to upstream on-ramps where there is
not queue. This rerouting will contradicts with the RM’s efforts to regulate vehicles entrance to
the freeway, and results in lost productivity. However, in the case of coordinated on-ramps, ATIS
can improve the equity of the system by spreading the vehicles to different ramps and
homogenizing the on-ramps’ waiting times. Given that the metering of on-ramps is coordinated,
the re-routing will not result is lost productivity.
6.5 Future Work
The research presented in this thesis can be further extended in several ways. The following
sections outline key steps in the future.
The proposed algorithm has been developed with scalability to larger problems in mind.
Although it is expected to work in other networks with minimal modifications, applying it to the
full 400-series freeways would be solid validation of its scalability to larger problems and
transferability to other types of traffic networks.
The trained RLRM agents in this research are specific to the on-ramps they are trained for.
In practice, training an agent for each on-ramp is not always feasible. It is desirable to develop a
generalized agent while considering different possible on-ramp geometries. The training of the
generalized agent would involve samples from different on-ramp geometries and demands.
Although in this research the proposed algorithm is specifically applied to ramp metering,
it can be modified to work with other freeway traffic measures such as variable speed limits and
108
dynamic route guidance. Furthermore, coordination of variable speed limits and ramp metering
can balance the travel time between on-ramp users and mainline users, potentially addressing the
criticism that RM sacrifices on-ramp users for the benefit of mainline vehicles. Variable speed
limits in this case would act as mainline metering.
Surface streets and freeways are ultimately part of the whole transportation network.
Congestion on surface streets will affect the freeway traffic if propagated to the freeway off-ramps.
Similarly heavy demand on freeways might create long on-ramp queues, causing congestion of
surface streets. Integration of the freeway control systems with surface street control systems could
benefit both.
109
References
Abdelgawad, H., Abdulhai, B., Amirjamshidi, G., Wahba, M., Woudsma, C., & Roorda, M. J. (2011). Simulation of Exclusive Truck Facilities on Urban Freeways. Journal of Transportation Engineering-Asce, 137(8), 547-562. doi: 10.1061/(asce)te.1943-5436.0000234
Abdulhai, B., & Kattan, L. (2003). Reinforcement learning: Introduction to theory and potential for transport applications. Canadian Journal of Civil Engineering, 30(6), 981-991. doi: 10.1139/l03-014
Ahn, S., Bertini, R. L., Auffray, B., Ross, J. H., & Eshel, O. (2007). Evaluating benefits of systemwide adaptive ramp-metering strategy in Portland, Oregon. Transportation Research Record(2012), 47-56.
Arel, I., Liu, C., Urbanik, T., & Kohls, A. G. (2010). Reinforcement learning-based multi-agent system for network traffic signal control. IET Intelligent Transport Systems, 4(2), 128-135. doi: 10.1049/iet-its.2009.0070
Bazzan, A. L. C. (2009). Opportunities for multiagent systems and multiagent reinforcement learning in traffic control. Autonomous Agents and Multi-Agent Systems, 18(3), 342-375. doi: 10.1007/s10458-008-9062-9
Bellemans, T., De Schutter, B., & De Moor, B. (2002). Model predictive control with repeated model fitting for ramp metering. Paper presented at the 5th International IEEE Conference on Intelligent Transportation Systems.
Bellman, R. (2010). Dynamic programming / by Richard Bellman; with a new introduction by Stuart Dreyfus. Princeton, N.J.: Princeton University Press.
Brilon, W., & Ponzlet, M. (1996). Variability of speed-flow relationships on German autobahns. Transportation Research Record(1555), 91-98.
Brown, G. W. (1951). Iterative solution of games by fictitious play. In T. C. Koopmans (Ed.), Activity Analysis of Production and Allocation. New York: Wiley.
Busoniu, L., Babuska, R., & De Schutter, B. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Reviews, 38(2), 156-172. doi: 10.1109/tsmcc.2007.913919
Chow, G. C. (1960). Tests of Equality Between Sets of Coefficients in 2 Linear Regressions. Econometrica, 28(3), 591-605. doi: 10.2307/1910133
Chu, L. Y., Liu, H. X., Recker, W., & Zhang, H. M. (2004). Performance evaluation of adaptive ramp-metering algorithms using microscopic traffic simulation model. Journal of Transportation Engineering-Asce, 130(3), 330-338. doi: 10.1061/(asce)0733-947x(2004)130:3(330)
Crites, R. H., & Barto, A. G. (1998). Elevator group control using multiple reinforcement learning agents. Machine Learning, 33(2-3), 235-262. doi: 10.1023/a:1007518724497
Davarynejad, M., Hegyi, A., Vrancken, J., & van den Berg, J. (2011, 5-7 Oct. 2011). Motorway ramp-metering control with queuing consideration using Q-learning. Paper presented at the 14th International IEEE Conference on Intelligent Transportation Systems (ITSC).
Doya, K. (2000). Reinforcement learning in continuous time and space. [Article]. Neural Computation, 12(1), 219-245.
El-Tantawy, S., & Abdulhai, B. (2010). TEMPORAL DIFFERENCE LEARNINGBASED ADAPTIVE TRAFFIC SIGNAL CONTROL. Paper presented at the 12th WCTR, Lisbon, Portugal.
110
El-Tantawy, S., Abdulhai, B., & Abdelgawad, H. (2013). Multiagent Reinforcement Learning for Integrated Network of Adaptive Traffic Signal Controllers (MARLIN-ATSC): Methodology and Large-Scale Application on Downtown Toronto. IEEE Transactions on Intelligent Transportation Systems, PP(99), 1-11. doi: 10.1109/tits.2013.2255286
Even-Dar, E., & Mansour, Y. (2003). Learning rates for Q-learning. Journal of Machine Learning Research, 5, 1-25.
Geist, M., & Pietquin, O. (2013). Algorithmic Survey of Parametric Value Function Approximation. IEEE Transactions on Neural Networks and Learning Systems, 24(6), 845-867. doi: 10.1109/tnnls.2013.2247418
Ghods, A. H., Fu, L. P., & Rahimi-Kian, A. (2010). An Efficient Optimization Approach to Real-Time Coordinated and Integrated Freeway Traffic Control. [Article]. IEEE Transactions on Intelligent Transportation Systems, 11(4), 873-884. doi: 10.1109/tits.2010.2055857
Ghods, A. H., Kian, A. R., & Tabibi, M. (2007). A genetic-fuzzy control application to ramp metering and variable speed limit control. Paper presented at the IEEE International Conference on Systems, Man and Cybernetics. <Go to ISI>://WOS:000255016302082
Gomes, G., & Horowitz, R. (2006). Optimal freeway ramp metering using the asymmetric cell transmission model. Transportation Research Part C-Emerging Technologies, 14(4), 244-262. doi: 10.1016/j.trc.2006.08.001
Guestrin, C., Lagoudakis, M. G., & Parr, R. (2002). Coordinated reinforcement learning. Paper presented at the 19th International Conference on Machine Learning (ICML-02), Sydney, Australia, Jul. 8–12.
Hagan, M. T., & Menhaj, M. (1994). Training feed-forward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks, Vol. 5( No. 6, 1999), 989-993.
Hall, F. L., & Agyemang-Duah, K. (1991). Freeway Capacity Drop and the Definition of Capacity. Transportation Research Record 1320, TRB, National Research Council, Washington, D.C., 91–98.
Hasan, M., Jha, M., & Ben-Akiva, M. (2002). Evaluation of ramp control algorithms using microscopic traffic simulation. Transportation Research Part C-Emerging Technologies, 10(3), 229-256.
Hegyi, A., De Schutter, B., & Hellendoorn, H. (2005). Model predictive control for optimal coordination of ramp metering and variable speed limits. Transportation Research Part C-Emerging Technologies, 13(3), 185-209. doi: 10.1016/j.trc.2004.08.001
Heinen, M. R., Bazzan, A. L. C., Engel, P. M., & Ieee. (2011). Dealing with Continuous-State Reinforcement Learning for Intelligent Control of Traffic Signals 2011 14th International IEEE Conference on Intelligent Transportation Systems (pp. 890-895).
Jacob, C., & Abdulhai, B. (2010). Machine learning for multi jurisdictional optimal traffic corridor control. Transportation Research Part A-Policy and Practice, 44(2), 53-64. doi: 10.1016/j.tra.2009.11.001
Jacobsen, L., Henry, K., & Mahyar, O. (1989). Real-time metering algorithm for centralized control. Transportation Research Record(1232), 17–26.
Khan, S. G., Herrmann, G., Lewis, F. L., Pipe, T., & Melhuish, C. (2012). Reinforcement learning and optimal adaptive control: An overview and implementation examples. Annual Reviews in Control, 36(1), 42-59. doi: 10.1016/j.arcontrol.2012.03.004
Kok, J. R., & Vlassis, N. (2006). Collaborative multiagent reinforcement learning by payoff propagation. [Article]. Journal of Machine Learning Research, 7, 1789-1828.
111
Kotsialos, A., & Papageorgiou, M. (2004). Efficiency and equity properties of freeway network-wide ramp metering with AMOC. Transportation Research Part C: Emerging Technologies, 12(6), 401-420. doi: http://dx.doi.org/10.1016/j.trc.2004.07.016
Kotsialos, A., Papageorgiou, M., Mangeas, M., & Haj-Salem, H. (2002). Coordinated and integrated control of motorway networks via non-linear optimal control. Transportation Research Part C-Emerging Technologies, 10(1), 65-84.
Kuyer, L., Whiteson, S., Bakker, B., & Vlassis, N. (2008). Multiagent Reinforcement Learning for Urban Traffic Control Using Coordination Graphs. Machine Learning and Knowledge Discovery in Databases, Part I, Proceedings, 5211, 656-671.
Lau, R. (1997). Ramp metering by zone—The Minnesota algorithm: Minnesota Department of Transportation.
Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empirical results. [Article]. Machine Learning, 22(1-3), 159-195. doi: 10.1007/bf00114727
Martin, J. A., de Lope, J., & Maravall, D. (2011). Robust high performance reinforcement learning through weighted k-nearest neighbors. [Article; Proceedings Paper]. Neurocomputing, 74(8), 1251-1259. doi: 10.1016/j.neucom.2010.07.027
Masher, D. P., Ross, D. W., Wong, P. J., Tuan, P. L., Zeidler, H. M., & Petracek, S. (1975). GUIDELINES FOR DESIGN AND OPERATION OF RAMP CONTROL SYSTEMS. Stanford Research Institute, Menlo Park, California.
Messmer, A., & Papageorgiou, M. (1990). METANET: a macroscopic simulation program for motorway networks. Traffic Engineering & Control, 31(8-9), 466-470.
Nair, R., Varakantham, P., Tambe, M., & Yokoo, M. (2005). Networked distributed POMDPs: A synthesis of Distributed Constraint Optimization and POMDPs. Paper presented at the 20th National Conference on Artificial Intelligence.
Paesani, G., Kerr, J., Perovich, P., & Khosravi, E. (1997). System wide adaptive ramp metering in Southern California. Paper presented at the 7th Annual Meeting, ITS America.
Panait, L., & Luke, S. (2005). Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems, 11(3), 387-434. doi: 10.1007/s10458-005-2631-2
Papageorgiou, M., Blosseville, J.-M., & Haj-Salem, H. (1990). Modelling and real-time control of traffic flow on the southern part of Boulevard Peripherique in Paris: Part II: Coordinated on-ramp metering. Transportation Research Part A: General, 24(5), 361-370. doi: http://dx.doi.org/10.1016/0191-2607(90)90048-B
Papageorgiou, M., Diakaki, C., Dinopoulou, V., Kotsialos, A., & Wang, Y. B. (2003). Review of road traffic control strategies. Proceedings of the IEEE, 91(12), 2043-2067. doi: 10.1109/jproc.2003.819610
Papageorgiou, M., Hadj-Salem, H., & Blosseville, J.-M. (1991a). ALINEA: A LOCAL FEEDBACK CONTROL LAW FOR ON-RAMP METERING. Transportation Research Record(1320), 58-64.
Papageorgiou, M., Hadj-Salem, H., & Blosseville, J. M. (1991b). ALINEA: A local feedback control law for on-ramp metering. Transportation Research Record, 1320, 58-64.
Papageorgiou, M., Hadj-Salem, H., & Middelham, F. (1997). ALINEA local ramp metering: Summary of field results. Transportation Research Record(1603), 90-98.
Papageorgiou, M., & Kotsialos, A. (2002). Freeway ramp metering: An overview. IEEE Transactions on Intelligent Transportation Systems, 3(4), 271-281. doi: 10.1109/tits.2002.806803
112
Papageorgiou, M., & Papamichail, I. (2008). Overview of Traffic Signal Operation Policies for Ramp Metering. Transportation Research Record(2047), 28-36. doi: 10.3141/2047-04
Papamichail, I., Kotsialos, A., Margonis, I., & Papageorgiou, M. (2010a). Coordinated ramp metering for freeway networks - A model-predictive hierarchical control approach. Transportation Research Part C-Emerging Technologies, 18(3), 311-331. doi: 10.1016/j.trc.2008.11.002
Papamichail, I., & Papageorgiou, M. (2008). Traffic-responsive linked ramp-metering control. IEEE Transactions on Intelligent Transportation Systems, 9(1), 111-121. doi: 10.1109/tits.2007.908724
Papamichail, I., Papageorgiou, M., Vong, V., & Gaffney, J. (2010b). Heuristic Ramp-Metering Coordination Strategy Implemented at Monash Freeway, Australia. Transportation Research Record(2178), 10-20. doi: 10.3141/2178-02
Potts, D., & Sammut, C. (2005). Incremental learning of linear model trees. Machine Learning, 61(1-3), 5-48. doi: 10.1007/s10994-005-1121-8
Powell, W., & Ma, J. (2011). A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. Journal of Control Theory and Applications, 9(3), 336-352. doi: 10.1007/s11768-011-0313-y
Prashanth, L. A., & Bhatnagar, S. (2011). Reinforcement Learning With Function Approximation for Traffic Signal Control. [Article]. IEEE Transactions on Intelligent Transportation Systems, 12(2), 412-421. doi: 10.1109/tits.2010.2091408
Salkham, A., Cunningham, R., Garg, A., & Cahill, V. (2008). A collaborative reinforcement learning approach to urban traffic control optimization. Paper presented at the Proceedings of the 2008 IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IAT 2008.
Santamaria, J. C., Sutton, R. S., & Ram, A. (1997). Experiments with reinforcement learning in problems with continuous state and action spaces. [Article]. Adaptive Behavior, 6(2), 163-217. doi: 10.1177/105971239700600201
Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1-3), 123-158. doi: 10.1023/a:1018012322525
Smaragdis, E., & Papageorgiou, M. (2003). Series of new local ramp metering strategies. Freeways, High-Occupancy Vehicle Systems, and Traffic Signal Systems 2003(1856), 74-86.
Smaragdis, E., Papageorgiou, M., & Kosmatopoulos, E. (2004). A flow-maximizing adaptive local ramp metering strategy. Transportation Research Part B-Methodological, 38(3), 251-270. doi: 10.1016/s0191-2615(03)00012-2
Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3), 332-341.
Spall, J. C. (1998). An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins APL Technical Digest (Applied Physics Laboratory), 19(4), 482-492.
Sugiyamal, Y., Fukui, M., Kikuchi, M., Hasebe, K., Nakayama, A., Nishinari, K., . . . Yukawa, S. (2008). Traffic jams without bottlenecks-experimental evidence for the physical mechanism of the formation of a jam. New Journal of Physics, 10.
113
Sun, X. T., & Horowitz, R. (2005). A localized switching ramp-metering controller with a queue length regulator for congested freeways. Paper presented at the Proceedings of the 2005 American Control Conference, New York.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.
Van Aerde, M. (1995). Single regime speed-flow-density relationship for congested and uncongested highways. Paper presented at the 74th TRB Annual Conference, Washington D.C.
Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292. Zhang, H. M., & Ritchie, S. G. (1997). Freeway ramp metering using artificial neural networks.
Transportation Research Part C-Emerging Technologies, 5(5), 273-286. Zhang, M., Ma, J., & Dong, H. (2008). Developing Calibration Tools for Microscopic Traffic
Simulation Final Report Part II: Calibration Framework and Calibration of Local/Global Driving Behavior and Departure/Route Choice Model Parameters: California PATH Research Report.
114
Appendix A – Paramics Plug-in
The Paramics functionality can be extended through plug-ins written in C language. The plug-in
for implementing ramp metering in Paramics can be broken into three parts: 1) measurement of
state of traffic, 2) the ramp metering algorithm, and 3) implementing the metering rate to the traffic
light.
Measuring the state of traffic is limited to loop detectors. Loop detectors in Paramics
provide three pieces of information: cumulative number of vehicles passed the loop detector, the
speed of the last vehicle past the loop detector, and the duration that the last vehicle occupied the
loop detector. This information is first aggregated into 20 sec interval averages. For the traffic
volume, the total number of vehicles is calculated from the change in the accumulated number of
vehicles in the 20 sec interval. For average speed, individual vehicles’ speeds are added together
and divided by number of vehicles. To calculate the percentage occupancy, the individual
occupancy times are added together and divided by 20 sec, to achieve the ratio that the loop
detector was occupied. These 20 sec averages are then further aggregated depending on the control
cycle of the ramp metering algorithm.
The ramp metering algorithms are coded in MATLAB to take advantage of its vast libraries
and simple and versatile programming language. The interface between Paramics plug-in and
MATLAB is made through MATLAB Engine. MATLAB Engine allows an external program to
initiate an instance of MATLAB and run control the execution of functions and scripts in that
MATLAB session. The plug-in retrieves the metering rate from the MATLAB session after
executing the ramp metering algorithms.
The metering rates that the ramp metering algorithms calculate are directly translated into
green time and red time according to the metering rate signal policy used. Through Paramics
programming APIs, the timing of the traffic light is overwritten in each control cycle to match the
output of the ramp metering algorithm.
115
Appendix B – Total Least Squares
The regular least squares finds the regression that minimizes the sum of squared errors of the
dependent variable measurement. Effectively the least squares regression minimizes the error
function below:
(B.1)
where is the vector of independent variables, is the dependent variable, and is the vector of regression
parameters. The least squares regression efficiently estimates , provided that the measurement of
independent variables do not suffer from measurement errors. The error in independent variables can
negatively affect the result of least squares regression. The effect is more significant if the relation between
inputs and outputs are non-linear.
The total least squares regression does not differentiate between dependent and
independent variable and assumes error in both variables. The total least squares regression
minimizes the objective:
subjectto:
(B.2)
where is a vector obtained by nonlinear augmentation of , and are measurements subject to error,
and are points on the regression curve that have the relation . For any given , the total
least squares objective function will be minimized when , are the closest point on the curve to
measured data , . squares method. illustrates the difference between the errors to be minimized in
regular least squares and total least squares. In squares method..a the original fundamental diagram curve
is shown as well as some sample measurements obtained by adding normal noise to both dimensions.
squares method..b shows the errors that regular least squared regression will minimize. squares method..c
shows the errors that total least squared regression will minimize. It is clear from the errors that the total
least squares regression will result in a much less biased estimation of non-linear functions when both
measurements are subject to error.
To find the best-fit Van Aerde fundamental diagram for a set of speed and density
measurements, an iterative numerical approach is employed. The process starts with an initial Van
Aerde model. Then for each measured sample, the closest point on the Van Aerde model is
calculated. The sum of squares of distances for all samples defines the error value for the Van
116
Aerde model. The optimization iteratively updates the parameters of the Van Aerde model in the
direction that reduces the error.
Figure B-1 Comparison of the errors for regular and total least squares. (a) samples
generated from original fundamnetal diagram with measurment error on both variables, (b) errors minimized in the regular least squares method, (c) errors minimized in total least
squares method.
0 20 40 60 80 1000
20
40
60
80
100
Density (veh/km/lane)
Spe
ed (
km/h
r)
Actual measurement errors
0 20 40 60 80 1000
20
40
60
80
100
Density (veh/km/lane)
Spe
ed (
km/h
r)
Regular least squares reggsion error
0 20 40 60 80 1000
20
40
60
80
100Total least squares reggsion errors
Density (veh/km/lane)
Spe
ed (
km/h
r)
(a)
(b) (c)
117
Appendix C – Simultaneous Perturbation Stochastic Approximation
The simultaneous perturbation stochastic approximation (SPSA) is a gradient-based optimization
algorithm for multivariate optimization problems where it is difficult or impossible to directly
obtain the gradient of the objective function. The basic approach for estimation of the gradient is
to evaluate the objective function on both sides of the candidate point along every dimension.
Therefore, for a problem with p variables total of 2p evaluation of objective function is needed.
The SPSA algorithm, on the other hand, simultaneously perturbs the candidate point along
different dimensions and calculates an estimate of gradient with only two objective function
evaluation. It has been shown that under reasonably general conditions, SPSA achieves similar
level of accuracy as the conventional gradient based optimization approaches given similar number
of iterations.
The SPSA algorithm is an iterative approach and the implementation steps are as follows:
1 – Initialization and Coefficient Selection. The first step is to choose a feasible initial point
as well as the parameters , , , and of the SPSA algorithm. These parameters define the gain
sequences / 1 and / 1 for the algorithm that will be used in the
following steps. Practically effective values for these parameters can be found in (J. C. Spall,
1998).
2 – Generation of Simultaneous Perturbation Vector. Generation of a random vector Δ
with p elements. This random vector should satisfy the conditions described in (James C. Spall,
1992). A simple distribution that satisfies these conditions is the Bernoulli distribution with +1 and
-1 outcomes and same probability for each outcome.
3 – Objective Function Evaluation. Calculation of the objective function around the current
point with the perturbation vector Δ . The two point for evaluating the objective function are:
Δ and Δ .
4 – Gradient Approximation. The gradient of the current point given the two objective
function evaluations can be approximated by:
Δ Δ
2Δ , Δ , … , Δ (C.1)
where Δ is the ith component of the Δ vector, and . is the objective function.
118
5 – Updating the Estimate of . The estimated can be updated based on the estimated
gradient using:
. (C.2)
6 – Iteration or Termination. If the termination condition, maximum number of iteration
or close to zero gradient, is met the process should be terminated. Otherwise, it should be repeated
from step 2.