Decentralized Coordinated Optimal Ramp Metering …...ii Decentralized Coordinated Optimal Ramp...

transcript

Decentralized Coordinated Optimal Ramp Metering using Multi-agent Reinforcement Learning

Kasra Rezaee

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy

Civil Engineering Department University of Toronto

Decentralized Coordinated Optimal Ramp Metering using Multi-

agent Reinforcement Learning

Kasra Rezaee

Doctor of Philosophy

Civil Engineering Department University of Toronto

Abstract

Freeways are the major arteries of the transportation networks. In most major cities in North

America, including Toronto, infrastructure expansion has fallen behind transportation demand,

causing escalating congestion problems. It has been realized that infrastructure expansion cannot

provide a complete solution to congestion problems owed to economic limitations, induced

demand, and, in metropolitan areas, simply lack of space. Furthermore, the drop in freeway

throughput due to congestion exacerbates the problem even more during rush hours at the time the

capacity is needed the most. Dynamic traffic control measures provide a set of cost effective

congestion mitigation solutions, among which ramp metering (RM) is the most effective approach.

This thesis proposes a novel optimal ramp control (metering) system that coordinates the actions

of multiple on-ramps in a decentralized structure. The proposed control system is based on

reinforcement learning (RL); therefore, the control agent learn the optimal action from interaction

with the environment and without reliance on any a priori mathematical model. The agents are

designed to function optimally in both independent and coordinated modes. Therefore, the whole

system is robust to communication or individual agent’s failure. The RL agents employ function

approximation to directly represent states and action with continuous variables instead of relying

on discrete state-action tables. Use of function approximation significantly speeds up the learning

and reduces the complexity of the RL agents design process. The proposed RM control system is

applied to a meticulously calibrated microsimulation model of the Gardiner Expressway

westbound in Toronto, Canada. The Gardiner expressway is the main freeway running through

Downtown Toronto and suffers from extended periods of congestion every day. It was chosen as

the testbed for highlighting the effectiveness of the coordinated RM. The proposed coordinated

RM algorithm when applied to the Gardiner model resulted in 50% reduction in total travel time

compared with the base case scenario and significantly outperformed approaches based on the

well-known ALINEA RM algorithm. This improvement was achieved while the permissible on-

ramp queue limit was satisfied.

Dedication

To my loving wife

who made it all possible.

Acknowledgements

First and foremost, I would like to express my sincere gratitude to my supervisor, Professor Baher

Abdulhai. I would like to thank him for his deep insight, wisdom, invaluable guidance, advice and

limitless support during the development of this thesis. His patience and understanding has been

an inspiration during my graduate studies.

I also want to express my thanks for the comments and suggestions provided by the thesis

committee members Professor Matthew Roorda, Professor Amer Shalaby, and Professor Khandker

M. Nurul Habib.

I would also acknowledge the generous financial support I received from the University of

Toronto, Professor Baher Abdulhai, Fortran Traffic Systems, and Canadian Automobile

Association.

Finally, I would also like to thank the members of the Transportation Group for valuable

advice, instructive discussions. This thesis would have not been possible without the help from Dr.

Hossam Abdelgawad, Dr. Samah El-Tantawy, and Mohamed Elshenawy.

Sections from Published and In-process Papers

In this thesis, some portions have been reproduced (with modifications) from previously produced

material.

Section 4.2:

Rezaee, K., Abdulhai, B., and H. Abdelgawad, “Application of Reinforcement Learning with Continuous State Space to Ramp Metering in Real-world Conditions”, in Proceedings of the IEEE Intelligent Transportation Systems Conference, Anchorage, September 2012.

Section 5.1:

Rezaee, K., Abdulhai, B., and H. Abdelgawad, “Self-Learning Adaptive Ramp Metering: Analysis of Design Parameters on a Test Case in Toronto”, Transportation Research Record 2396, 2013.

Section 5.2:

Rezaee, K., Abdulhai, B., and H. Abdelgawad, “Closed-Loop Optimal Freeway Ramp Metering using Continuous State Space Reinforcement Learning with Function Approximation”, Transportation Research Board (TRB) 93rd Annual Meeting, Washington, D.C., January 2014.

Section 4.3 and 5.3:

Rezaee, K., Abdulhai, B., and H. Abdelgawad, “Decentralized Coordinated Optimal Ramp Metering: Application to the Gardiner Expressway in Downtown Toronto”, submitted to Transportation Research Board (TRB) 94th Annual Meeting, Washington, D.C., January 2015.

Table of Contents

Acknowledgements ......................................................................................................................... v

Sections from Published and In-process Papers ............................................................................ vi

Table of Contents .......................................................................................................................... vii

List of Tables .................................................................................................................................. x

List of Figures ................................................................................................................................ xi

1 Introduction ............................................................................................................................. 1

1.1 Freeway Traffic Control Problem .................................................................................... 1

1.2 Overview of the Proposed Methodology ......................................................................... 4

1.3 Thesis Structure ................................................................................................................ 6

2 Literature Review .................................................................................................................... 8

2.1 Pre-timed Ramp Metering .............................................................................................. 11

2.2 Traffic-responsive Ramp Metering ................................................................................ 11

2.2.1 Independent Controllers .......................................................................................... 11

2.2.2 Coordinated Controllers .......................................................................................... 14

2.3 Summary of RM approaches .......................................................................................... 18

3 Methodology: Optimal Ramp Metering Using Reinforcement Learning ............................. 21

3.1 Optimal Control Problem ............................................................................................... 21

3.2 Markov Decision Processes and Value Iteration ............................................................ 23

3.3 Reinforcement Learning: Model-free Learning ............................................................. 24

3.3.1 Q-Learning .............................................................................................................. 25

3.3.2 SARSA .................................................................................................................... 27

3.3.3 R-Learning .............................................................................................................. 28

3.4 RL with Continuous State and Action Space ................................................................. 28

3.4.1 k-Nearest Neighbours Weighted Average .............................................................. 29

3.4.2 Multi-Layer Perceptron Neural Network ................................................................ 30

3.4.3 Linear Model Tree .................................................................................................. 32

3.4.4 Advantage Updating ............................................................................................... 33

3.5 Multi-Agent Reinforcement Learning ............................................................................ 35

3.5.1 Independent Learners .............................................................................................. 36

3.5.2 Cooperative Reinforcement Learning ..................................................................... 36

3.6 Summary ........................................................................................................................ 40

4 Development of Microscopic Simulation Testbeds .............................................................. 41

4.1 Developing the Microsimulation Models ....................................................................... 42

4.1.1 Data Preparation for Real Measurements and Paramics ......................................... 42

4.1.2 Driver Behaviour Parameter Calibration ................................................................ 44

4.1.3 OD Estimation and Calibration ............................................................................... 49

4.2 Highway 401 Eastbound Collector and Keele Street ..................................................... 51

4.3 Gardiner Expressway Westbound .................................................................................. 54

5 Independent and Coordinated RL-based Ramp Metering Design and Experiments ............ 61

5.1 Experiment I – Single Ramp with Conventional RL ..................................................... 61

5.1.1 RL-based RM Controller Design for Single Ramp ................................................. 62

5.1.2 Effect of Design Parameters on RLRM Performance ............................................. 68

5.1.3 Comparison with ALINEA Controller .................................................................... 72

5.2 Experiment II – RL-based RM with Function Approximation ...................................... 73

5.2.1 Design of Function Approximation Approaches .................................................... 74

5.2.2 Simulation Results .................................................................................................. 74

5.3 Experiment III – Gardiner: Independent and Coordinated Ramp Metering .................. 78

5.3.1 RLRM Design for Coordinated Ramp Metering .................................................... 78

5.3.2 Simulation Results and Controller Evaluation ........................................................ 81

5.3.3 The Gardiner Test Case Summary .......................................................................... 97

6 Conclusions and Future Work ............................................................................................ 101

6.1 Major Findings ............................................................................................................. 102

6.2 Contributions ................................................................................................................ 103

6.3 Towards Field Implementation .................................................................................... 104

6.4 Assumptions and Limitations ....................................................................................... 106

6.5 Future Work ................................................................................................................. 107

References ................................................................................................................................... 109

Appendix A – Paramics Plug-in ................................................................................................. 114

Appendix B – Total Least Squares ............................................................................................. 115

Appendix C – Simultaneous Perturbation Stochastic Approximation ........................................ 117

List of Tables

Table 2-1 Summary of the performance of the RM approaches in the literature from control perspectives. Solid circles show better performance. ............................................................................................ 19

Table 4-1 Numerically calibrated Paramics parameters for Highway 401 model ...................................... 53 Table 4-2 Numerically calibrated Paramics parameters for the Gardiner model ....................................... 56 Table 4-3 Parameters of the Van Aerde model fitted to fundamental diagram samples from Paramics and

real life. ............................................................................................................................................ 57 Table 5-1 Metering rates and associated green and red phases for one-car-per-green metering policy ..... 63 Table 5-2 Metering rates and associated green and red phases for discrete release rates metering policy 63 Table 5-3 Summary of the simulation results for the single ramp testcase with conventional RL algortihms.

.......................................................................................................................................................... 73 Table 5-4 Comparison of perfromance of different RLRM approaches .................................................... 77 Table 5-5 Demand for accessing the freeway mainline downstream of each on-ramp .............................. 82

List of Figures

Figure 1-1 Schematic representing the traffic flow at the boundaries of a network. ..................................... 2 Figure 1-2 Fundamental diagram based on five-minute traffic count and densities measured on a Japanese

freeway (Sugiyamal et al., 2008). ....................................................................................................... 3 Figure 1-3 Structure of the thesis .................................................................................................................. 7 Figure 2-1 Functional structure of demand capacity and ALINEA algorithms (Papageorgiou et al., 2003).

........................................................................................................................................................... 13 Figure 2-2 Forecasting theory of SWARM global mode (Ahn et al., 2007). .............................................. 15 Figure 2-3 Schematic of model predictive control for traffic control problems (Hegyi et al., 2005). ......... 17 Figure 2-4 Hierarchical control structure with distributed controllers. ....................................................... 18 Figure 3-1 The relationship between the algorithms presented in this chapter ........................................... 22 Figure 3-2 Illustration of k-nearest neighbour algorithm for estimating the value of a new point. The four

closest neighbours to candidate point are shown. ............................................................................. 30 Figure 3-3 Multi-layer perceptron structure for function approximation applications. In this figure …

are input variables, … … … are the hidden layer weights, . is the sgmid non-linear function, … are output layer weights, and is the output of the neural network. ...... 31

Figure 3-4 Illustration of input space partitioning for linear model tree ..................................................... 32 Figure 3-5 A simple problem showing the variation of Q-values in the states and actions. a) The base

problem with 101 states, where the goal is to reach the terminal state s0 with the minimum movements. Therefore, the reward of taking each action is -1 and discount factor is 1. b) The optimal state values and Q-value of actions. .................................................................................................. 34

Figure 3-6 Different approaches for applying RL to ramp metering for a sample traffic network: a) Centralized structure with a single RL agent for whole network, b) Isolated RL-based RM agents, c) Coordinated MARL-based RM agents. ............................................................................................. 38

Figure 4-1 Relationship between occupancy and density. ........................................................................... 45 Figure 4-2 A Van Aerde model is fitted to samples from a loop detector. The left figure is the flow-density

relationship and the right figure is the speed-density relationship. ................................................... 48 Figure 4-3 Aerial map and Paramics screenshot showing part of the study area. The map shows the Highway

401 eastbound collector at the merging point of Keele St. ............................................................... 52 Figure 4-4 Fundamental diagram fitted to samples from simulation of calibrated Paramics model and real

loop detectors. ................................................................................................................................... 53 Figure 4-5 The evolution of morning traffic in the Paramics model compared with measurements from real

loop detectors. ................................................................................................................................... 54 Figure 4-6 Schematic of the study area network, showing the Gardiner Expressway westbound from Don

Valley Parkway in the east to Humber Bay in the west. ................................................................... 54 Figure 4-7 Aerial map of the Gardiner Expressway westbound and its Paramics model............................ 56 Figure 4-8 The left graph shows real loop detector and right graphs shows Paramics model. The time-space

graphs are the average speed along the Gardiner from 13:00 to 21:00. ............................................ 58 Figure 4-9 GEH value for vehicle counts averaged over one-hour intervals for select loop detector locations.

........................................................................................................................................................... 59 Figure 4-10 Traffic flow of the calibrated Paramics model compared with real loop detector data along the

Gardiner for three different time intervals. ....................................................................................... 59 Figure 4-11 Aerial view of the Spadina on-ramp with information about traffic flow and queues ............ 60 Figure 5-1 Local area on an on-ramp and the loop detectors which can represent its traffic state. ............. 64 Figure 5-2 Histogram of traffic densities in a freeway section, including an on-ramp operation in the

presence of an optimal RM controller. The dashed line represents the estimated critical density. ... 65 Figure 5-3 The actual weights of and discounted weights which the RLRM agent considered using a

discount factor of 0.94. The actual weights of were based on a control cycle of 2 min and minimization horizon of 1 hr. ............................................................................................................ 67

Figure 5-4 Effect of adding a penalty term to reward function for severe congestion. (a) The total travel time for freeway mainline, (b) the total travel time for the whole network. ..................................... 69

Figure 5-5 Performance comparison of RLRM agent with direct action and RLRM agent with incremental action. (a) Total travel time for freeway mainline only, (b) total travel time for the whole network. ........................................................................................................................................................... 70

Figure 5-6 Effect of different reward choices on RLRM performance. In case 1 reward is and state variables are downstream density, upstream density, and on-ramp density. In case 2 reward is

and state variables are the same as in case 1. Case 3 is similar to case 2 with the exception that upstream density is omitted. In case 4 reward is and state variables are downstream and on-ramp densities. ............................................................................................................................................ 71

Figure 5-7 Learning speed and solution quality of the presented four RL approaches. The curves above are obtained by averaging multiple epochs through a moving average window for clarity. The actaul results have significantly more variation from one epoch to another because of the stochastic nature of microscopic simulation. ................................................................................................................ 76

Figure 5-8 The schematic of the Gardiner showing the location of entry and exit flows for each individual RM agent. .......................................................................................................................................... 79

Figure 5-9 Communication between RLRM agents of the Gardiner........................................................... 81 Figure 5-10 Colour-coded space-time diagram of base case traffic speed. ................................................. 83 Figure 5-11 Freeway throughput after the Jameson on-ramp in the base case and with ramp metering. .... 84 Figure 5-12 Comparison of the Jameson on-ramp traffic flow in the base case and with independent ramp

metering............................................................................................................................................. 84 Figure 5-13 Freeway throughput after the Spadina on-ramp in the base case and with independent ramp

metering............................................................................................................................................. 85 Figure 5-14 Freeway performance for four different scenarios. The error bars show the standard deviation

of value for different simulation runs. ............................................................................................... 86 Figure 5-15 Average experienced travel time of vehicles starting from the Jameson zone until the end of

the network in the west. .................................................................................................................... 87 Figure 5-16 Time-space diagram of traffic speed for RLRM-I (left) and Jameson2pmClose (right). ........ 87 Figure 5-17 Queues for the three on-ramps throughout the simulation period. .......................................... 88 Figure 5-18 Average travel time for trips originated during 4-5 pm from origins in the downtown to west

end of network for the four scenarios. ............................................................................................... 89 Figure 5-19 On-ramp queues for the RM algorithms which consider limited queue capacity. ................... 90 Figure 5-20 Time-space diagram of traffic speed for algorithms with limited queue space. ...................... 91 Figure 5-21 Freeway performance under ramp metering with limited queue capacity. .............................. 91 Figure 5-22 Travel times from different locations to the west end of the network. .................................... 92 Figure 5-23 Freeway performance for coordinated RM approaches. .......................................................... 93 Figure 5-24 On-ramp queues of coordinated and independent RLRM agents with limited queue space. .. 94 Figure 5-25 Travel times from different locations to the west end of the network in the RLRM-C case. .. 94 Figure 5-26 Average travel time for trips originated during 4-5 pm from origins in the downtown to west

end of network at Humber Bay for different independent and coordinated RLRM approaches. ...... 95 Figure 5-27 Downtown on-ramp queues with ALINEAwLC control algorithm. ....................................... 96 Figure 5-28 On-ramp queues (left) and travel times (right) for the RLRM-CwQE case. ........................... 97 Figure 5-29 Downtown on-ramp flows entering the freeway for the RLRM-CwQE case. ......................... 98 Figure 5-30 Summary of the performance of the nine scenarios for the Gardiner test case........................ 99 Figure 6-1 Controller interface with real (a) and virtual (b) transportation environment. ........................ 105 Figure 6-2 The CID developed for evaluation of MARLIN-ATSC. ......................................................... 106 Figure B-1 Comparison of the errors for regular and total least squares. (a) samples generated from original

fundamnetal diagram with measurment error on both variables, (b) errors minimized in the regular least squares method, (c) errors minimized in total least squares method. ..................................... 116

1 Introduction

Freeway traffic congestion is a common problem in metropolitan areas. Whereas drivers’

perception of congestion is increased travel time, the more important issue is the drop in freeway

capacity at critical density, which causes further congestion accumulation. Traffic congestion

appears when the number of vehicles attempting to use a transportation infrastructure exceeds its

capacity. In the best-case scenario, the excess demand leads to queuing phenomena and full use of

the infrastructure. In the more common cases, this congestion leads to traffic instability,

breakdown, loss of capacity and a degraded use of the infrastructure, thus contributing to an

accelerated congestion increase. A capacity loss as low as 5% can mean 20% higher travel time

for drivers (Papageorgiou & Kotsialos, 2002). The outlined phenomena clarify that such

congestion is not simply the result of excessive demand exceeding the network capacity. It is,

however, the capacity loss and subsequent infrastructure degradation that lead to escalating

instability if no suitable control systems are employed to prevent such loss.

In recent years, it has been realized that infrastructure expansion cannot provide a complete

solution to congestion problems owed to economic limitations, induced demand, and, in

metropolitan areas, simply lack of space. An alternative approach is to use dynamic traffic control

measures, such as ramp metering (RM), variable speed limits, and dynamic route guidance

(Papageorgiou, Diakaki, Dinopoulou, Kotsialos, & Wang, 2003). Among these measures, ramp

metering is the most effective traffic control measure and is widely used in different parts of the

world (Papageorgiou & Kotsialos, 2002). Ramp metering controls the flow of cars entering the

freeway through on-ramps and can help to prevent the breakdown of the freeway.

1.1 Freeway Traffic Control Problem

Traffic condition on a freeway network is a function of three variables as illustrated in Figure 1-1:

demand entering the freeway, freeway physical capacity, and flow of vehicles exiting the freeway.

Although demand and capacity can be modified, in this research it is assumed that they are fixed.

The focus is to maximize the exit flow and get the vehicles off the freeway as soon as possible,

thereby minimizing the time vehicles spend on the freeway network. Traffic congestion limits the

freeway exit flow in two ways: 1) the blockage of off-ramps and 2) the drop in freeway throughput

because of congestion, which is commonly known as capacity drop. Therefore, preventing the

freeway from breaking down can significantly increase the exit flow.

Figure 1-1 Schematic representing the traffic flow at the boundaries of a network.

Unlike off-ramp blockage, understating the nature of the capacity drop phenomenon is not

as straightforward. Researchers have studied the capacity drop in the light of real freeway

measurements. Research on the Queen Elizabeth Way (QEW) west of Toronto has shown a

capacity drop of 5% to 6% (Hall & Agyemang-Duah, 1991) downstream of a congested section.

Research on German freeways has shown a similar capacity drop of 4% to 6% (Brilon & Ponzlet,

1996). Figure 1-2 shows the five-minute traffic count and density measured by the Japan Highway

Public Corporation. On the left side of the graph, traffic flow increases steadily as density

increases. As density increases and exceeds critical density (about 25 veh/km/lane in this example),

the freeway breaks down and traffic flow drops. The capacity drop is the difference between traffic

flow immediately before and after the critical density. Effectively, the traffic flow of a congested

bottleneck is on the right side of the fundamental diagram and the uncongested traffic flow follows

the left side of the fundamental diagram.

In a freeway traffic control system, the goal is to maintain the traffic condition on the left

side of the fundamental diagram but close to critical density to maximize throughput. Regulating

on-ramp traffic flow through RM provides the measures to keep traffic on the left side of the

fundamental diagram. Albeit closing the on-ramp altogether will solve the congestion problem,

the challenge is to let the maximum number of vehicles enter the freeway without causing the

freeway to break down. Early RM algorithms were pre-timed and the metering rates calculated

based on historic traffic demand, which tended to under-utilize or overly saturate the freeway.

Traffic-responsive RM algorithms tackle this issue by calculating the metering rate from the

current traffic condition.

Traffic Network

Demands Exit flows

Bottleneck capacity

Figure 1-2 Fundamental diagram based on five-minute traffic count and densities

measured on a Japanese freeway (Sugiyamal et al., 2008).

There are well-established RM algorithms for controlling a single on-ramp without

limitation on the queue storage. The challenges arise when multiple closely spaced on-ramps feed

traffic into the freeway and queue storage space on each on-ramp is limited. Controlling the on-

ramps independently puts pressure on the downstream on-ramp, and it loses its effectiveness when

its queue reaches the limit. Furthermore, unbalanced queues among adjacent on-ramps will

encourage drivers to take longer routes to avoid the queue on the downstream on-ramp. Metering

closely spaced on-ramps simultaneously can resolve the issue of unbalanced queues and allow use

of the queue storage space of all ramps for management of freeway traffic. However, efficient

coordination of multiple on-ramps is not trivial because of the complexity of the freeway traffic

dynamics.

Since traffic flow is maximized at the critical density (Papageorgiou & Kotsialos, 2002), a

group of RM algorithms, e.g. ALINEA (Papageorgiou, Hadj-Salem, & Blosseville, 1991a) and its

variations (Smaragdis & Papageorgiou, 2003), focus on regulating the traffic density at its critical

value. Although these controllers are simple to design and easy to implement, they neither seek

nor guarantee optimal performance. However, they can be easily augmented through heuristic

approaches to handle coordination of multiple on-ramps (Papamichail & Papageorgiou, 2008).

Another group of RM algorithms mainly based on optimal control theory, e.g. RM based on model

predictive control (MPC) (Hegyi, De Schutter, & Hellendoorn, 2005), determine the metering rate

which directly maximizes the network performance. These algorithms use a mathematical model

of the freeway to estimate the outcome of different ramp metering policies and choose the one that

maximizes the system performance. Given their nature, these algorithms can directly optimize the

metering rate for multiple on-ramps while taking into consideration the constraints on the queues.

However, they require an accurate model of the network for optimal results, and any uncertainty

or mismatch in the model will result in suboptimal performance. Furthermore, their computation

demand increases exponentially with the network size. In the literature, model-based optimal RM

algorithms have either been implemented as pre-timed algorithms (Gomes & Horowitz, 2006;

Apostolos Kotsialos & Papageorgiou, 2004) or applied to small networks with a single on-ramp

(Bellemans, De Schutter, & De Moor, 2002; Ghods, Fu, & Rahimi-Kian, 2010).

1.2 Overview of the Proposed Methodology

Reinforcement learning (RL) (Sutton & Barto, 1998), which has attracted significant attention in

recent years, has the potential to alleviate some of the aforementioned limitations. RL agents

continuously learn from their interaction with the environment; therefore, they do not require an

explicit model of the controlled environment. RL provides the tools for solving optimal control

problems when developing a model of the system is difficult. RL has proven its effectiveness in

solving complex control problems in different fields (Crites & Barto, 1998; Khan, Herrmann,

Lewis, Pipe, & Melhuish, 2012). In transportation, researchers have employed RL for more than a

decade (Abdulhai & Kattan, 2003) in different problems such as traffic signal control (Arel, Liu,

Urbanik, & Kohls, 2010; S. El-Tantawy, Abdulhai, & Abdelgawad, 2013) ramp metering (RM)

(Davarynejad, Hegyi, Vrancken, & van den Berg, 2011) and dynamic route guidance (Jacob &

Abdulhai, 2010).

In control problems, the control system quantifies the surrounding environment using

sensors measuring continuous variables. Conventional RL algorithms have primarily considered

discrete states to characterize the system conditions; as a result, the system’s continuous variables

must be discretized following a certain interval. This discretization of continuous state variables is

associated with two challenges/issues: 1) determining the discretization levels, which introduce a

trade-off between the accuracy of the state representation and the number of states; 2) breaking a

continuous variable into independent discrete states neglects the correlation between nearby states.

To address these limitations, a number of studies have investigated the continuous variables in RL

by using general function approximators (Doya, 2000; Geist & Pietquin, 2013; Powell & Ma,

2011; Santamaria, Sutton, & Ram, 1997). Although similar approaches have been applied to

transportation-related applications (Heinen, Bazzan, Engel, & Ieee, 2011; Prashanth & Bhatnagar,

2011), there is still significant room for improvement.

Although in theory it is possible to use a single RL agent to control multiple on-ramps, in

practice it is infeasible because of the computational complexity. Fortunately, it is possible to have

multiple RL agents in a decentralized structure. Although each agent controls a single ramp, they

can coordinate their actions to maximize their collective reward rather than individual rewards.

This problem falls into the multi-agent reinforcement learning domain which is extensively

reviewed in (Busoniu, Babuska, & De Schutter, 2008). In traffic control problems, different agents

are seeking the same goal and can coordinate their action to achieve it; therefore, they are playing

a cooperative game. Panait and Luke (2005) have summarized the cooperative learning algorithms.

Among these algorithms, traffic control problems nicely fit the requirements of the coordination

graph algorithm (Guestrin, Lagoudakis, & Parr, 2002). Although coordinated RL has been

employed in numerous traffic control studies, the applications were mainly control of surface street

traffic lights (Arel et al., 2010; Bazzan, 2009; S. El-Tantawy et al., 2013; Kuyer, Whiteson, Bakker,

& Vlassis, 2008; Salkham, Cunningham, Garg, & Cahill, 2008).

In this research, a decentralized and coordinated RM control system is proposed. The

agents will minimize the total travel time (TTT) vehicles spend in the network while respecting

the queue constraints. The individual agents are designed such that their local TTT is minimized

when they are acting independently. If they coordinate their actions, the cumulative TTT is

minimized. Agents coordinate their actions to achieve the highest collective reward through direct

communication and negotiation. Since agents are still locally optimal when acting independently,

the system can function properly in the event of communication failure. The RL-based RM

(RLRM) agents employ function approximation, which significantly improves their learning

speed. The learning speed of RLRM agents with function approximation is more than 20 times the

learning speed of the conventional RLRM algorithms. Furthermore, function approximation

eliminates the curse of dimensionality, and makes more complex RL algorithms possible.

Evaluation of the proposed algorithm on a Paramics simulation model of the westbound direction

of the Gardiner Expressway in Toronto resulted in almost 50% saving in TTT compared with the

base case, which is 10% more than the savings of the well-known ALINEA algorithm.

1.3 Thesis Structure

The structure of the thesis is illustrated in Figure 1-3. After the introduction, a literature review of

the ramp metering algorithms is presented in Chapter 2. A summary of the algorithms is provided

to identify gaps and limitations. Chapter 3 provides the details of the proposed methodology.

Besides the conventional RL algorithms, three function approximation techniques and their

application to RL are discussed. Additionally, the coordination of RL agents based on coordination

graphs for playing a cooperative game is presented. Chapter 4 discusses development and

calibration of the microsimulation models used for training and evaluation of the proposed

algorithms. The calibration of the models involves calibrating the Paramics driver behaviour

parameters and calibrating the dynamic demand. The two models are the Highway 401 eastbound

collector at Keele Street and the westbound direction of the Gardiner Expressway, both in Toronto.

In Chapter 5, three experiments are presented. The first experiment is the design of a single agent

RLRM for the Highway 401 test case, and analysis of the effect of different design parameters on

the agent’s performance. The second experiment, also on the Highway 401 model, extends the

single agent RLRM to continuous state space using different function approximation techniques

and identifying the most suitable technique for RM application. The third experiment involves

applying the multiple independent RLRM agents with continuous state space to the Gardiner test

case as well as coordination of those agents and analysis of their performance. Chapter 6

summarizes the findings and lists possible future directions for this research.

IntroductionReview of ramp

metering algorithms in literature

Gaps and Limitations

Conventional RL with discrete

states

Function approximation

Coordinated RL for RM

RL with continuous states

Methodology

Cooperative games

Microsimulation models development

and calibration

Highway 401

Gardiner Expressway

Experiments5

Single ramp RLRM design

RLRM with continuous state space

Coordinated RLRM

Conclusions and future works6

Figure 1-3 Structure of the thesis

2 Literature Review

Freeways were initially built to provide an unhindered flow of traffic; therefore, freeway traffic

control measures were mainly used for safety reasons. However, recurrent and non-recurrent

congestion caused by the rapid growth in auto ownership and travel demand call for these measures

to be used as a means to maintain the efficiency of the freeways. There are various control

measures that can effectively improve freeway networks’ efficiency, including ramp metering,

variable speed limits, and dynamic route guidance. Among these measures, the most direct and

efficient way to control freeway traffic is ramp metering (Papageorgiou & Kotsialos, 2002). Ramp

metering improves freeway traffic conditions by appropriately regulating the on-ramp’s flow.

Appropriate implementation of ramp metering can improve freeway traffic in different ways.

Ramp metering can increase mainline throughput and served volume because of the avoidance of

capacity loss and blockage of off-ramps, respectively. Proper ramp metering algorithms can react

to incidents efficiently to minimize their effects. A. Kotsialos, Papageorgiou, Mangeas, and Haj-

Salem (2002) employed optimal control algorithms for the ramp metering problem and, through

macroscopic simulation, demonstrated outstanding improvements in large freeway networks.

Similar results were obtained through microscopic traffic simulation of various adaptive ramp

metering algorithms (Chu, Liu, Recker, & Zhang, 2004; Hasan, Jha, & Ben-Akiva, 2002).

Although ramp metering can be very effective, there are certain limitations associated with

it. One challenge is associated with the limited queue storage capacity of the on-ramps. If the queue

exceeds the on-ramp queue capacity, the connected arterial will be adversely affected. The simple

and widely adopted approach to prevent queues from exceeding on-ramp capacity is a queue

override algorithm employed alongside the main ramp metering algorithm. The queue override

algorithm calculates the on-ramp flow required to prevent the queue from exceeding on-ramp

capacity and has priority over ramp metering algorithms. Another challenge associated with ramp

metering arises when multiple on-ramps are present along a corridor. Usually the freeway traffic

flow increases as vehicles enter the freeway along the route. Therefore, the bottleneck is at the

downstream on-ramps. Consequently, these on-ramps experience the longest queues of cars

whereas upstream on-ramps have no queues. This phenomenon penalizes drivers entering from the

downstream ramps and could encourage drivers to change their route and take upstream ramps,

which is counterproductive. Furthermore, sacrificing users of downstream on-ramps for the benefit

of users of upstream on-ramps can be viewed as inequitable and may make the public reluctant to

accept ramp metering. To resolve this issue, proper coordination among adjacent on-ramps is

required to meter all on-ramps simultaneously.

Ramp metering strategies (as well as traffic control strategies in general) can be classified

along several dimensions, such as:

pre-timed vs. traffic-responsive;

independent vs. coordinated;

heuristic vs. optimal;

centralized vs. decentralized.

Pre-timed vs. traffic-responsive: pre-timed ramp metering strategies are derived off-line,

based on historical demands, for particular times of day. These control approaches act in an open-

loop manner and do not take into account variations in traffic condition, resulting in either overload

of the mainstream traffic flow (congestion) or under-utilization of the freeway capacity. Unlike

pre-timed strategies, traffic-responsive ramp metering strategies are based on real-time

measurements from sensors installed in the freeway. Traffic-responsive strategies change the

control signal in response to varying traffic conditions, thereby properly reacting to disturbances

and demand variations.

Independent vs. coordinated: independent strategies make use of measurements from the

vicinity of a single ramp, and do not consider the information from other parts of the network.

Local ramp metering applied independently to multiple ramps of a freeway is very efficient in

terms of Total Travel Time (TTT) if unlimited queue storage space is available. However, ramp

queues must be restricted to avoid interference with adjacent street traffic. Releasing the queued

cars prematurely into the freeway to avoid queue spillback to local streets results in congestion on

the freeway mainline. As a result, mainline congestion cannot always be avoided merely by

independent control and limited queue storage of a single ramp. In addition, providing equity for

users of different on-ramps, which plays an important role in the acceptance of ramp metering, is

not possible with independent controllers. Coordinated ramp metering relies on the traffic

condition and on-ramp queue information from multiple on-ramps. This allows the system to

utilize the queue space available on all on-ramps to prevent freeway breakdown. In addition,

coordination allows the system to homogenize the queues and provide the same level of service to

all users.

Heuristic vs. optimal: in traffic control problems, the main goal is to minimize TTT.

Optimal approaches, usually based on optimal control theory and dynamic programming, are able

to find the metering rates that directly minimize the TTT. These approaches are usually based on

a mathematical model of the freeway and look for optimal metering rates. The optimal control

approach can be employed for both independent and coordinated control systems. However,

because of the complex and non-linear nature of the flow of traffic, optimal approaches are usually

computationally intensive and require an accurate model of traffic network. On the other hand,

heuristic approaches rely on traffic flow characteristics to simplify the problem; therefore, they do

not directly minimize the TTT. As an example, since traffic flow is highest at critical density, a

heuristic approach can regulate the traffic density around critical density to maximize traffic

throughput, and as a result minimize TTT. Heuristic approaches can also be employed for

coordinating multiple on-ramps, e.g. equalizing queues of upstream on-ramps with the downstream

on-ramps.

Centralized vs. decentralized: the coordination of on-ramps can be performed in a

centralized or decentralized structure (independent controllers are decentralized by nature). In a

centralized structure, measurements from the entire network are collected, and a central controller

computes timing for all on-ramps. Under ideal conditions, centralized systems can achieve the

maximum possible performance. However, in large-scale problems the computation needs and

communications overhead limit the practicality of such systems. Furthermore, the reliability of

centralized systems is very poor as failure of a single component can paralyze the whole system.

Decentralized systems place the intelligence at the controlled location by distributing the

controllers throughout the network. In this structure, controllers act on their own based on local

measurements as well as high-level information from other controllers, which is necessary for

coordination.

The ideal RM control system is a traffic-responsive optimal controller. The control system

should calculate the metering rate based on the current state of traffic, i.e. employ a control law

for calculation of the metering rate. Due to stochastic and nonlinear nature of freeway traffic,

finding the optimal control law is not trivial. An alternative approach often used in literature is to

employ model-based optimization. In this approach, the effects of different metering rates, during

the control horizon, are evaluated using a mathematical model. As the optimization should be made

for specific scenarios, the solution would not be traffic-responsive. Although repeating the

optimization at every control cycle would result in a traffic-responsive control system, the

complexity of the optimization limits its applicability to larger problems.

2.1 Pre-timed Ramp Metering

In pre-timed ramp metering (also known as fixed-time or time-of-day), historical traffic flow data

are used to calculate the metering rate throughout the day. Since these algorithms are derived off-

line, they can employ complex traffic flow models to calculate the metering rates. The most

prominent fixed-time ramp metering algorithm is AMOC (A. Kotsialos et al., 2002). AMOC

employs a second-order macroscopic model of the traffic network called METANET (Messmer &

Papageorgiou, 1990) to solve a non-linear optimization problem with the objective of minimizing

total travel time. Solving the problem off-line makes it possible to solve complex problems while

incorporating queue storage space constraints. Assuming the real traffic condition matches the

historical values, the result would be an optimal and coordinated solution for ramp metering.

However, any inaccuracies in the freeway traffic model or unexpected deviations in traffic

condition from the historical values will significantly degrade the system performance.

Gomes and Horowitz (2006) employed a similar optimal pre-timed ramp metering

approach with a first-order macroscopic model named the asymmetric cell transmission model.

First-order models are much simpler than second-order models; therefore, solving a nonlinear

optimization problem with a first-order model is possible for much larger networks. Despite

behaving very well in terms of reproducing congestion in the model, asymmetric cell transmission

model fails to capture the capacity drop phenomenon, which is a significant factor in the

effectiveness of ramp metering in reality. Although the numerical results of this paper show

improvement in TTT and elimination of congestion, the significance of their solution remains

questionable and needs evaluation with a more complex simulation model that can capture the

capacity drop. Nonetheless, eliminating congestion opens blocked off-ramps and allows the

vehicles exit the freeway faster, which is the possible reason for improved TTT.

2.2 Traffic-responsive Ramp Metering

2.2.1 Independent Controllers

Independent controllers are the simplest type of traffic-responsive ramp metering controllers.

Independent controllers rely on local measurements only; therefore, they are naturally

decentralized. One of the earliest independent controllers is the demand-capacity algorithm

(Masher et al., 1975), where the metering rate is calculated from the difference between upstream

flow and capacity of the freeway as follows:

1max ,

where 1 is metered on-ramp flow for the next time step, is the freeway capacity,

is the measured upstream flow, is the density downstream of the ramp, is the critical

density of the freeway, and is the minimum permissible metered ramp flow. Demand-

capacity algorithm is considered an open loop or feed-forward control approach, because the

output of the system, downstream traffic, is not directly employed in the calculation of the control

signal. Similar to any feed-forward system, this algorithm is prone to model deficiencies and its

performance will degrade if the is not accurate.

To overcome the limitations of calculating the metering rate based on freeway capacity,

Papageorgiou et al. (1991a) proposed ALINEA, a feedback-based ramp metering algorithm. Based

on the relation of flow and density shown in Figure 1-2, traffic flow is maximized at critical

density. The ALINEA algorithm varies the metering rate to regulate the density downstream of the

ramp to a desired value close to critical density as follows:

1 (2.2)

where is the desired density, and 0 is a control parameter. This control structure is one of

the simplest linear time-invariant (LTI) controllers and is known as I-controller or integral-control.

The functional structure of ALINEA and demand capacity is shown in Figure 2-1. The simplicity

of ALINEA and its variations (Smaragdis & Papageorgiou, 2003; Smaragdis, Papageorgiou, &

Kosmatopoulos, 2004) has made it the most well-known ramp metering controller and its

performance is validated through field implementations (Papageorgiou, Hadj-Salem, &

Middelham, 1997).

Considering the performance improvements of ALINEA and its simplicity, researchers

have employed more complex control algorithms to regulate occupancy or density. H. M. Zhang

and Ritchie (1997) have proposed a non-linear controller that employs a neural network in place

of the control parameter in the ALINEA. The neural network replaces the constant parameter

with one that varies according to the density of the mainline to provide better regulation of traffic

density. Sun and Horowitz (2005) employed a linear controller based on optimal control theory to

regulate mainline density. The proposed approach utilizes a linear first-order model that switches

between congested and free-flow conditions.

Demand‐capacity strategy

Feedforward (open loop) ALINEA (closed loop)

Figure 2-1 Functional structure of demand capacity and ALINEA algorithms

(Papageorgiou et al., 2003).

The aforementioned approaches take a heuristic approach to freeway control, as they do

not directly minimize the TTT. Ghods, Kian, and Tabibi (2007) took a semi-optimal approach to

freeway traffic control. They presented a fuzzy controller, which calculates metering rate based on

mainline density. A fuzzy controller maps the measurement to the metering rate by using a non-

linear relation, allowing more refined control over the changes in metering rates. The parameters

of the fuzzy controller are tuned with a genetic algorithm with the goal of minimizing TTT of a

test case simulated using a second-order macroscopic model.

Notable true optimal ramp metering algorithms that are independent are the ones based on

reinforcement learning. Applying the conventional RL approaches to larger problems with

multiple ramps, however, is not practical. In (Davarynejad et al., 2011), the authors have presented

an RL-based ramp metering controller. The controller is trained and evaluated using a modified

version of the METANET macroscopic model. Another example of RL-based ramp metering

system is presented in (Jacob & Abdulhai, 2010) involving the Gardiner Expressway eastbound.

In this study, the Paramics microscopic simulation model is employed for training and evaluation

of the ramp metering controllers. To account for the states unseen by the RL agent, a CMAC neural

network is used to generalize the learning outcome of the RL agent.

2.2.2 Coordinated Controllers

Whereas independent ramp meters can be very effective and easy to implement, they cannot

provide equity among different on-ramps. Additionally, their performance is significantly

degraded when the ramp queue storage space is limited. In RM problems with limited queue

storage, a separate algorithm calculates the minimum metering rate, which keeps the queue below

a maximum admissible length. Increasing the minimum metering rate forces the ramp metering to

prematurely allow the cars into the mainline and causes the freeway to break down. Although

progression of congestion upstream will trigger ramp metering for upstream on-ramps, this natural

coordination will significantly degrade freeway performance. Coordinated ramp metering

approaches try to leverage the space available on multiple adjacent on-ramps and simultaneously

meter multiple ramps to achieve higher performance when on-ramp storage space is limited. A

common by-product of proper coordinated ramp metering is more homogenous waiting times for

different on-ramps.

2.2.2.1 Coordination Based on Heuristics

Bottleneck (Jacobsen, Henry, & Mahyar, 1989), implemented in Seattle, and Zone (Lau, 1997),

implemented in the Minneapolis/St Paul area, are early coordinated algorithms which are

extensions of the demand-capacity algorithm. Bottleneck consists of a local-level component and

a system-level component, each calculating a metering rate. The more restrictive metering rate is

then applied to the ramp. The local controller is conceptually similar to demand-capacity. To

calculate the system-level metering rate, the freeway is divided into several sections depending on

loop detector locations. For each section, the number of vehicles stored in that section during a

one-minute interval is calculated from the difference between entry and exit rates. If the difference

for any section is greater than zero, i.e. vehicles are being stored in that section, the metering rate

of the on-ramps with influence over that section is reduced. Similar to Bottleneck, the Zone

algorithm extends the demand-capacity to a region rather than a local section. The metering rates

of all ramps are calculated simultaneously by taking into account entry and exit flows, capacity of

the freeway bottleneck, and estimated number of vehicles on the freeway. Chu et al. (2004)

evaluated the performance of the two algorithms using microscopic simulation and observed that

they are inferior to ALINEA in their conventional form.

SWARM (Paesani, Kerr, Perovich, & Khosravi, 1997) is another ramp metering system

which relies on heuristics to coordinate multiple on-ramps, and has been extensively implemented

in Southern California. SWARM calculates two separate metering rates, a local and a global, and

applies the more restrictive one. The local mode can be any local ramp metering system. The global

mode operates based on forecast densities at the system’s bottleneck locations. The future density

of a bottleneck is estimated by linear regression of immediate past samples. The forecast density

is compared with a threshold and the difference used to modify the desired current density of the

bottleneck, as illustrated in Figure 2-2. Given the current and desired bottleneck density, the

volume reduction that should be applied to upstream on-ramps is calculated. Despite widespread

implementation in southern California, field evaluation of SWARM in Portland, Oregon has not

shown any noticeable improvement over pre-timed ramp metering (Ahn, Bertini, Auffray, Ross,

& Eshel, 2007).

Figure 2-2 Forecasting theory of SWARM global mode (Ahn et al., 2007).

The first attempt at a coordinated extension of ALINEA was the METALINE algorithm

presented by Papageorgiou, Blosseville, and Haj-Salem (1990). METALINE is the multi-input

multi-output extension of ALINEA obtained by vectorization of the ALINEA equation:

1 1 1 2 , (2.3)

where … is the vector of n controllable on-ramps, … is the vector of m

measured densities, … is the vector of p potential bottleneck densities,

… is the vector of desired bottleneck densities, and and are and

matrices of control weights, respectively. Although the concept behind METALINE is sound, its

marginal improvements over the local controller ALINEA does not justify the complex design

procedure.

Papamichail and Papageorgiou (2008) proposed a linked ramp metering control strategy

based on ALINEA to equalize the queue length of each on-ramp with the one downstream of its

location. For each on-ramp, three metering rates are calculated. The first one is the regulator’s

metered ramp flow, , calculated from ALINEA control law in (2.2). The second ramp flow is

queue override ramp flow, , obtained from queue control law as:

, (2.4)

where , is the desired maximum queue, is the current queue length, is demand entering

the on-ramp, and is the control cycle. This control law will try to maintain ramp flow to a level

that ensures the queue does not exceed the desired maximum queue. The third ramp flow is linked

control ramp flow, , which coordinates each on-ramp with the one downstream of its location,

so that they have similar queue length. The control law limits the metering rate to maintain a

desired minimum queue as:

1 , (2.5)

where is a control parameter which may be set equal to 1/ for a quick response or less for a

smoother response, and , is the desired minimum queue calculated according to the queue of

the downstream on-ramp. The , is initially zero, and will be changed to same value as the

downstream on-ramp queue when the queue of the downstream on-ramp exceeds a certain

threshold. The , will be reset back to zero when the downstream on-ramp queue falls below

the threshold. The final metering rate is calculated as:

max min , , . (2.6)

Authors have evaluated the linked ramp metering algorithm on a macroscopic model and

observed that it is comparable to ALINEA when there is no limit on the queue storage. However,

if the ramp queue space is limited, linked ramp metering has significantly better performance than

ALINEA. A more refined variation of the above linked ramp metering algorithm named HERO

(Papamichail, Papageorgiou, Vong, & Gaffney, 2010b) has been field implemented at Monash

Freeway in Australia. This algorithm, contrary to the other coordinated algorithms mentioned

above, can be implemented in a decentralized structure.

2.2.2.2 Optimal Coordination

The heuristic approaches to coordination mentioned above are simple to implement; however, they

generally have complicated tuning processes for their numerous parameters, which make their

performance subpar in practice. Another approach to coordination of multiple on-ramps used in

literature is to employ a macroscopic model of the traffic network and solving the non-linear

optimization problem of minimizing TTT. To provide a traffic-responsive solution, the

optimization problem should be solved repeatedly in every control cycle. Although the obtained

solution is for the whole control horizon, only the first cycle of the solution is applied to the

network. Such control systems are known as model predictive control (MPC) or receding horizon

control. The algorithm searches for the set of metering rates corresponding to every control cycle

during control horizon, NC, that minimizes the cost over the prediction horizon, NP. The schematic

of the MPC for traffic control problems is shown in Figure 2-3.

Figure 2-3 Schematic of model predictive control for traffic control problems (Hegyi et al.,

2005).

Bellemans et al. (2002) and Hegyi et al. (2005) successfully employed MPC for optimal

traffic-responsive ramp metering. The freeway traffic network is modelled by the second-order

macroscopic model METALINE. To handle the high computational demand of solving a non-

linear optimization problem in every control cycle, only a single on-ramp was considered for

control. Ghods et al. (2010) have employed the same MPC approach, but have proposed a

decentralized solution for solving the non-linear optimization problem. The decentralized solution

is based on the Game Theory concept Fictitious Play (Brown, 1951). Decentralization allows the

computation to be handled by multiple nodes, making the approach applicable to bigger problems.

2.2.2.3 Hierarchical Control Approach

Pre-timed RM systems based on optimal control result in optimum performance in the absence of

any disturbance. Although MPC can mitigate performance drop caused by disturbance, the

computational cost limits the size of the traffic networks. Papamichail et al. proposed a hierarchical

control approach to provide semi-optimal control for large traffic networks (Papamichail,

Kotsialos, Margonis, & Papageorgiou, 2010a). The hierarchical control approach, shown in

Figure 2-4, consists of three modules. The state estimation and prediction module constantly

monitors traffic condition to estimate the state of traffic and predict future demands. The non-

linear optimization module solves the optimization problem to find the optimal metering rates and

corresponding optimal traffic densities every 10 minutes. In the presence of disturbance, the 10-

minute control signals become sub-optimal and could result in unstable conditions. The third

module takes the output of the optimization module as the input. However, instead of directly

applying the signals to traffic lights, an ALINEA regulator is employed to regulate the system

around the set point provided by optimizer. The ALINEA improves system robustness and

provides predictable behaviour during the 10-minute optimization interval.

Local regulator Local

regulator Local regulator

State estimation and prediction

Measurements

Nonlinear Optimization(Every 10 minute)

Historical data

Estimated state and disturbances

Set points

Figure 2-4 Hierarchical control structure with distributed controllers.

2.3 Summary of RM approaches

Table 2-1 is an attempt to summarize the capabilities of notable RM approaches in the literature

from different significant control perspectives. Solid circles represent best while empty circles

represent poorest. Although this classification is highly subjective and reflects the author’s view,

it is meant to be a brief illustration of the capabilities of different RM approaches in the literature

in a single table.

Table 2-1 Summary of the performance of the RM approaches in the literature from control perspectives. Solid circles show better performance.

Approach

Performance Criteria

Demand capacity (Masher et

al., 1975)

Zone (Lau, 1997)

SWARM (Paesani et al.,

ALINEA (Papageorgiou et

al., 1991a)

Fuzzy control (Ghods et al.,

METALINE (Papageorgiou

et al., 1990)

HERO (Papamichail et al.,

2010b)

AMOC (A. Kotsialos et al.,

MPC (Bellemans et al.,

Decentralized MPC (Ghods

et al., 2010)

Hierarchical MPC

(Papamichail et al., 2010a)

Heuristic approaches, which regulate traffic state, are simple and generally have fair

performance in applications with unlimited queue space. However, unlimited queue space is rarely

practical and the performance of these isolated RM approaches quickly degrades if queue storage

space is limited. Whereas coordinated variants of these approaches lessen the negative effect of

limited queue space, the heuristic nature of the coordination is not very effective in complex

scenarios.

RM approaches which directly optimize freeway performance using the freeway model

seek optimal coordination between multiple on-ramp and result in the best solution in any

condition given perfect information is available. However, any deficiency in the quality of the

model or predicted future demands will quickly degrade their performance. Furthermore, the

computational need to solve the complex non-linear optimization problem grows exponentially

with problem size, limiting the practicality of these approaches to small problems.

The present work, therefore, seeks closed-loop optimal RM in the form of a control law,

i.e. metering rates that are calculated based on current state of traffic and will directly optimize the

network performance. Additionally, a decentralized approach to coordination is considered to

facilitate the application of the proposed control system to large problems.

3 Methodology: Optimal Ramp Metering Using Reinforcement

Learning

The present chapter elaborates the methodologies and algorithms used in different stages of

developing the RL-based RM algorithms. The components of the methodology are described in

the order they are used in this research. The research goal is to develop and apply an optimal

control methodology for practical sized RM applications, which is only possible through

coordination of decentralized RM controllers. In this chapter, after briefly describing the optimal

control problem and its challenges, RL is presented as the solution that has shown tremendous

promise in other applications such as adaptive signal control. However, the limitations of

conventional RL algorithms with discrete states become apparent as the problem size grows.

Therefore, function approximation approaches such as k-nearest neighbours (kNN), multi-layer

perceptron (MLP) neural network, and linear model tree (LMT) are put forward to directly

represent continuous states and actions. These algorithms are built on top of the established

concepts of discrete RL algorithms while mitigating their limitations. Finally, the coordination

graph concept from Game Theory is presented as the means for coordination of multiple RM

agents. Figure 3-1 shows the relationship between the theories and algorithms discussed in this

chapter. The implementation of the methodologies in this chapter to RM is described in detail in

Chapter 5 together with presentation and discussion of the results.

3.1 Optimal Control Problem

Optimal control problems generally involve maximizing1 a reward value. The total reward to be

maximized is in fact the accumulation of instantaneous rewards received over time. Let us denote

reward incurred at time t by , , , where is the state of the system, is the control

action taken, and is a random parameter. Therefore, the problem of finding the optimum total

reward, , for a given initial condition, , can be formulated as:

0,…,, ,

1 Given that any minimization problem has a dual maximization problem, analysis of the minimization problems is omitted

where , 0 1, is the discount factor which shows the significance of earlier rewards

compared with later rewards, function . defines the dynamics of the system, and is a random

parameter associated with the uncertainty of the system dynamics. Although it is clear from (3.1)

that the total reward depends on the stream of the actions, each action will influence the trajectory

of the system. Therefore, each action will affect the instantaneous reward as well as the rewards

that will be observed in the future. The optimal control problem is to find the balance between the

immediate reward and the effect of the actions on future rewards.

Stochastic Optimal Control

Markov Decision Process

Reinforcement Learning

Q‐Learning

Eligibility Traces

Control LawSelf‐Learning

RL with Continuous States and Actions

LMT‐based RL

Advantage Updating

kNN‐TD(λ)

MLP‐based RL

Game TheoryCooperative

GamesCoordination

Locally Optimal Action Selection

Cooperative Multi‐agent RL

Function Approximation

kNN MLP LMT

Figure 3-1 The relationship between the algorithms presented in this chapter

The problem can be solved for a given initial condition and neglecting the random

parameters using optimization algorithms. However, the resulting stream of actions will be an open

loop solution, which will not perform well due to uncertainties. For systems with the Markov

property, the problem in (3.1) can be greatly simplified. In systems with the Markov property, the

future is independent of the past given the present. In other words, the effects of an action taken in

a state depend only on that state and not on previous history of the system. When modeling a traffic

flow system with wave equations, the underlying wave equations will provide the relation from

one state to the next; hence, the system will have the Markov property. Since the current state

captures all relevant information, the optimal action in the current state is independent of the past

states and actions. Therefore, instead of a stream of actions dependent on the initial state we can

look for a policy, a function that maps states to actions. Such a policy would be an optimal policy

if it maximizes the total reward, thereby resulting in a closed-loop optimal control system. Finding

an optimal policy directly from (3.1) is not straightforward considering the convoluted effect of the

system dynamics in the total reward. To simplify the problem, dynamic programming can be

employed to break the problem into smaller problems. The dynamic programming (DP) equivalent

for optimal control problems is based on Bellman’s principle of optimality (Bellman, 2010,

Chapter III.3) which implies that “an optimal policy has the property that whatever the initial state

and initial decision are, the remaining decisions must constitute an optimal policy with regard to

the state resulting from the first decision.” Considering this principle, the optimization problem of

(3.1) can be simplified to Bellman’s equation:

0 max0

0 0, 0, 0 1

1 0, 0, 0

In Bellman’s equation we choose , knowing that our choice will cause the next state to

be , , . That new state will then affect the decision problem from time t=1 going

forward. Bellman’s equation (3.2) is a functional equation, because it involves an unknown

function . .

3.2 Markov Decision Processes and Value Iteration

A Markov decision process (MDP) provides the framework for modelling a problem that involves

random outcomes and behaviour and is under the influence of a decision-maker. In fact, an MDP

is a discrete time stochastic control process. An MDP is defined by the tuple

⟨ , , , , , , , ⟩, where is the set of states, is the set of actions available from

state , , , is the probability that taking action in state will lead to state , and

, , is the expected reward received because of transition from state to state provided

that action is taken. Although MDPs are not limited to systems with finite states and actions, the

conventional algorithms for solving MDPs assume states and actions are finite. Therefore, the

functions . and . can be simplified to a matrix form. Considering an MDP with finite states,

equation (3.2) can be rewritten as:

max , , ′

′∈

, , ′ ′ . (3.3)

In general, DP algorithms for solving (3.3) are iterative, and value iteration is one of the

most notable and fundamental ones. Value iteration starts with an initial value for . , initializing

all states. Then, is updated by calculating the right-hand side of (3.3) for every state. The

updating process is repeated until . is converged, i.e. the does not change from one

iteration to the next for all states. The value iteration is effectively a repetition of the following

equation.

1 max , , ′

′∈

, , ′ ′ , ∀ ∈ (3.4)

Besides the basic assumption that states are finite, there are two challenges associated with

DP-based algorithms which limit their usage in practice (Sutton & Barto, 1998). First, DP assumes

availability of a perfect model that describes the transition probabilities and reward values.

Although assuming availability of a deterministic model of the system is not unreasonable, a

stochastic model that describes the uncertainties in a traffic network in the form of a transition

probability matrix is far from practical. Second, whereas DP works well for small synthetic

problems, in practice the number of states and actions increase exponentially with the problem

size. Therefore, the computation and storage requirements of DP limit its feasibility in practice.

3.3 Reinforcement Learning: Model-free Learning

Reinforcement learning (RL) is inspired by human’s trial-and-error learning behaviour and aims

to solve the optimal control problem without a priori knowledge of the model of the system (Sutton

& Barto, 1998). In RL, agents only perceive the state of the environment and the instantaneous

scalar reward , , as the system transitions from one state to another and hence there is no

need to know transition probabilities a priori. The agent learns the optimal actions through direct

interaction with the environment and by trying various actions in various states and observing their

outcomes.

3.3.1 Q-Learning

Numerous algorithms with plausible convergence speed and easily-customized parameters are

used to solve single-agent RL tasks, the most notable of which is the Q-learning approach of

Watkins (Watkins & Dayan, 1992). In Q-learning, instead of . which defines the value of states,

a function , is used which quantifies the expected value of state provided that action is

taken. Effectively, function , facilitates the comparison of the quality of different actions

within a state. The value associated with a state-action pair is also known as Q-value. Q-learning

is directly derived from Bellman’s equation and is similar in nature to value iteration. For every

new time step , the value of , is calculated according to the reward received and the value

of the future states and compared with the current estimation of the , . The value of future

states are obtained from past experience and stored in the latest Q-value estimate for . Function

. is updated with every new training sample according to:

1 , , max 1, , (3.5)

where is the reward received after performing action at state and moving to the new state

, and , 0 1, is the learning rate. If the learning rate is set to 1 the old value will be

replaced with the new estimation. However, because of the stochastic nature of MDPs, it is

necessary to calculate the average value over multiple samples. Hence, the old values will be

partially updated to provide new estimations. A more detailed description of the Q-learning

algorithm can be found in Watkins and Dayan (1992).

3.3.1.1 Learning Rate

In stochastic problems, a decreasing learning rate is usually employed. Q-learning, when employed

with a decreasing learning rate and with the following learning rate characteristics, is guaranteed

to suppress uncertainties and converge to the optimal Q-values (Watkins & Dayan, 1992):

1∞, ,

1∞, ∀ , (3.6)

where , is defined as the index of the th time that action is tried in state . The first

function which may come to mind with the above characteristics is , 1/ . This choice for

learning rate will result in exact averaging of samples over time. Since the value of . for the

next state is present in the updating rule and is likely to have a better estimate later in the learning

process, considering a decaying learning rate but higher than 1/ later in learning process is

advisable. As discussed by Even-Dar and Mansour (2003) a learning rate with the equation

, 1/ results in much better convergence when 0.8 than when 1.

3.3.1.2 Action Selection Policy

RL algorithms are guaranteed to converge to optimal values after infinite samples. In practice, the

luxury of infinite samples is not available, and we need to look for quick but reliable convergence

to optimal values. Since RL is based on trial and error, finding the balance between exploration,

trying new and potentially suboptimal actions, and exploitation, taking the optimal action, is

critical to Q-learning convergence. The two common approaches to action selection are -greedy

and soft-max. In -greedy, the best action is chosen with probability of 1 , and a random action

with probability of , where 0 1 is the tuning parameter. Generally, at the beginning of the

learning process 1 for completely random action selection, and as agent learns, is decreased.

Although can be decreased all the way to zero for greedy action selection, maintaining a non-

zero , e.g. 0.1, ensures that the agent keeps exploring. In contrast to -greedy action selection

which does not differentiate between actions when choosing a random action, soft-max action

selection assigns a probability to each action according to the Q-value of that action. The

probability of choosing action in state is calculated by:

∑ , /∈

where , 0, is a tuning parameter. A large will result in probabilities that are more or less

uniform and are independent of the Q-values, which is desirable in the early stages of learning

process. As gets smaller the actions with higher Q-values have higher probabilities. When gets

very close to zero, the action selection becomes greedy, resulting in a probability of almost 1 for

the action with the highest Q-value.

It is usual for the tuning parameters and to be varied based on the learning time.

However, in some applications, including traffic control problems, the states visited vary according

to the agent's policy and the actions it takes. Some areas of state space are visited only after the

agent consistently chooses the optimal actions in other areas of state space. Therefore, the agent

will not have the chance to explore these states if the action selection tuning parameter only

depends on learning time. To overcome this limitation and ensure that the agent only exploits when

all actions in a state are explored, the tuning parameters and can be varied according to the

number of visits to that state as the maturity measure.

3.3.2 SARSA

Another notable RL algorithm directly derived from Bellman’s equation is SARSA. SARSA

stands for state-action-reward-state-action, which is the order in which information is received and

used for updating . . In SARSA, updating the , depends on the action taken in time step

1, which is used instead of the optimal action. The updating rule for SARSA can be written

, 1 , 11, 1

1 , (3.8)

Since the updating is based on the actions that the agent takes, the estimated . will be

dependent on the policy of the agent in that stage. However, as the agent matures and its policy

shifts from taking random actions to choosing the optimal actions, . will converge toward its

optimal value. Considering that . depends on the agent’s policy, the learning speed of the agent

is slower compared with Q-learning.

3.3.2.1 Eligibility Traces

In problems with a discount factor close to one, i.e. problems with significantly delayed rewards,

the convergence of the Q-values can be very slow. This is because the updated value of state

will not affect previously visited state until the next visit to . One way to mitigate this

issue is the eligibility traces (Singh & Sutton, 1996) mechanism. In eligibility traces, the trails of

successive visited states are stored so that the states that contributed to the rewards received can

be traced back and updated accordingly. Typically, eligibility traces decay exponentially according

to the product of the discount factor and a decay parameter , 0 1. The trace itself can be

defined by:

if (3.9)

where represents the trace for state at time , and is the visited state at time . The

eligibility trace defined in (3.9) is a replacing eligibility trace as the trace of state is reset to 1

every time it is visited.

3.3.3 R-Learning

Q-learning can be applied to discounted infinite-horizon problems. It can also be applied to

problems with undiscounted reward as long as the optimal policy leads to a state with zero reward.

R-learning (Mahadevan, 1996) is an extension of Q-learning to problems where average reward is

maximized instead of the total discounted reward. In R-learning the goal is to maximize the

average expected reward, , per time step:

lim→∞

(3.10)

In this method, instead of reinforcing the instantaneous reward, , the transient difference

in the reward, , is used as the reinforcement. Therefore, the equivalent of Q-learning update

law for R-learning is:

, , max 1, , (3.11)

The R-learning method has an additional unknown variable, , which should be learned by the

agent. This variable is updated iteratively, only at steps that the best action is taken, i.e.

argmax , , as follows:

1 max 1, , (3.12)

where is a learning parameter balancing between past experiences and new samples for updating

the . Although for many problems the average reward criterion better represents the actual

problem than a discounted reward criterion, the convergence problems exhibited for R-learning

prevent it from being widely adopted.

3.4 RL with Continuous State and Action Space

Q-learning in its conventional form uses a table to represent the function . Using a table limits

the practicality of Q-learning in complex problems that involve a multidimensional continuous

state space. Continuous states should be discretized for application by the conventional RL

approaches developed for discrete states (such as Q-learning). In addition to the exponential

growth of the discrete state space with the increase in problem size, discretization introduces a

trade-off between learning speed and system optimality. Finer discretization is likely to result in

better overall system performance, but the increased number of state-action pairs requires more

training samples and a longer training process. To overcome this challenge, one could use fine

discretization in sensitive regions and coarse discretization elsewhere. Although a theoretically

feasible approach, non-uniform discretization adds complexity to the design process.

In problems with continuous state space, it is expected that small movements in the state

space will result in minimal variations in the system’s behaviour. Therefore, in the RL context,

states that are closely spaced are expected to have close Q-values. Discrete states fail to utilize this

unique feature and the problem exacerbates as the discretization becomes finer.

The limitations mentioned above can be mitigated by using a general function

approximator which replaces the table representing , . Unlike Q-tables with hard boundaries,

function approximators enable the estimation of any intermediate Q-values in continuous space.

Additionally, function approximators make better use of the learning samples as each sample

updates the whole Q-function rather than a single element in the table, thereby resulting in much

faster learning speed. Three of the most notable function approximation approaches for RL are: k-

nearest neighbour weighted average, multi-layer perceptron neural network, and linear model tree.

3.4.1 k-Nearest Neighbours Weighted Average

A class of function approximators which are effective and easy to use in RL problems are sparse

coarse-coded function approximators (Santamaria et al., 1997). One method of this class, which

has shown very promising results, is based on the k-nearest neighbours concept. In theory, the k-

nearest neighbours temporal difference (kNN-TD(λ)) method (Martin, de Lope, & Maravall,

2011), can represent continuous state space in a manner which is very similar to the table-based

Q-learning with the added support for continuous state space. Therefore, all the solid theories

behind Q-learning can be seamlessly applied to kNN-TD(λ).

In kNN-TD(λ), a set of centers , each with an explicit Q-value, is generated in the state

space. The estimation of the Q-value of a new point in the state space is shown in Figure 3-2.

The set , which contains the k-nearest neighbours of in the set , based on Euclidean

distances , is identified. A probability is then assigned to each of the centers in as:

∑ ∈,

1 2, ∀ ∈ (3.13)

The Q-value of a state-action pair , is then defined as the weighted average of the Q-

values of the points in set with weights :

, ,∈

(3.14)

The updating of Q-values of set is performed as a similar process. With every new sample

, its k-nearest neighbours can be identified and updated according to:

.max 1, , (3.15)

1 , , . . , ∀ ∈ (3.16)

The number of visits to a state-action pair can be similarly estimated as:

, ,∈

, (3.17)

where , is number of visits to center and action . A more detailed description of the

NN-TD(λ) algorithm can be found in Martin et al. (2011).

Figure 3-2 Illustration of k-nearest neighbour algorithm for estimating the value of a new point. The four closest neighbours to candidate point are shown.

3.4.2 Multi-Layer Perceptron Neural Network

Multi-layer perceptron (MLP) is a feed-forward neural network with multiple layers. In function

approximation applications, typically there is a hidden layer with H neurons and an output layer

with one neuron, as shown in Figure 3-3. The hidden layer’s neurons have non-linear activation

functions, . , of the sigmoid from, e.g. / .

Considering the MLP structure in Figure 3-3, the relationship between input and output of

the MLP would be:

, (3.18)

Figure 3-3 Multi-layer perceptron structure for function approximation applications. In

this figure … are input variables, … … … are the hidden layer weights, . is the sgmid non-linear function, … are output layer weights, and is

the output of the neural network.

Training of MLPs is usually done through iterative numerical approaches. These

approaches are based on either the gradient or the Jacobian of the error with respect to weights.

The gradient and the Jacobian can be calculated by a technique called backpropagation. A simple

approach to updating the weights can be the gradient descent learning:

where … … … … is a vector containing all the network parameters

and is the squared prediction error at time , which in the RL context can be defined as:

2 .max 1, ,2

In the iterative learning of MLP, it is desirable to avoid presenting successive samples that

are from the same region of state space to avoid saturation of the weights. Additionally, throughout

the learning process, samples should cover different regions of the state space to provide good

generalization. In traffic control problems, changes in traffic state are gradual; therefore, similar

samples in successive control cycles are likely. These facts discourage the use of learning methods

where samples are shown one by one as they are visited. To overcome these issues, all the samples

visited in the same simulation run, i.e. epoch, are kept in a pool of samples and shown to MLP in

random order after each epoch. Additionally, samples from previous epochs are not discarded to

ensure previous trainings are not lost with batch learning. The learning is still iterative; however,

in every epoch, the errors are calculated once based on last epoch’s estimate of , and kept

fixed during the training epoch.

3.4.3 Linear Model Tree

The linear model tree (LMT) is another approach to function approximation, whereby the input

space is partitioned into a decision tree with axis-orthogonal splits at internal nodes, as illustrated

in Figure 3-4. Partitions use local linear functions of the inputs, calculated by least squares

regression. In an LMT the decision tree is not pre-specified and the splits are based on the data.

Therefore, the challenge in training LMT is finding the split points. For training of LMT the work

presented by Potts and Sammut (2005) is adopted in this study. The process of building an LMT

starts with a single partition. Along each dimension candidate splits are considered. To find the

axis and location where a split should be created a primary linear model and pairs of linear models

on either side of the candidate splits are calculated. The loss function in partition depends on

the samples in that partition and can be calculated as:

Figure 3-4 Illustration of input space partitioning for linear model tree

The loss functions of linear models on each side of every candidate split are also calculated

in a similar manner. Let and be the loss function on the lower side and upper side of the

split along the dimension, respectively. Assuming Gaussian noise with unknown variance,

-1 0 1

the Chow test for homogeneity amongst sub-samples (Chow, 1960) is used to test the null

hypothesis of whether the data come from a single linear model. Under this null hypothesis, :

where is the dimension of input, is distributed corresponding to the Fisher's distribution with

and 2 degrees of freedom. The associated p-value (probability in the tail of the

distribution) determines the probability that the null hypothesis holds. Let us denote the smallest

probability as representing the best split over every split along each dimension . To ensure

the split is significant enough, a split is only made when . A small enough value of

is suitable for any level of noise.

As training samples increase, the LMT splits the input space into smaller regions, resulting

in a more accurate approximation. However, it is often desirable to limit the growth of the model

tree and accept certain approximation error by calculating a stopping parameter as follows:

where is the estimated overall variance of the output. As the model tree grows and its accuracy

increases, decreases. Splitting the input space is terminated if falls below a certain threshold

which achieves the trade-off between the overall model complexity and the acceptable

approximation error.

Although there are robust approaches which are suitable for on-line learning of the model

trees (Potts & Sammut, 2005), since the optimal Q-values are not known in advance because of

the bootstrapping in the RL, the on-line learning of LMT does not suit the RL. Therefore, batch

learning after each simulation run (epoch) is performed, i.e. all the samples gathered so far are

used to rebuild the LMT to fit the equation:

1 , .max 1, , (3.19)

where . is the LMT from epoch , i.e. the previous model fitted to the gathered samples.

3.4.4 Advantage Updating

In theory, general function approximators can take the shape of virtually any continuous function.

However, in practice function approximators do not perfectly fit the data due to the presence of

measurement noise and the complexity of function approximator parameters. In RL algorithms

based on Q-learning, the action decision is made by comparing the Q-values of different actions

within a state. Therefore, it is of great importance that the general function approximator fits the

Q-values properly along the action axis. Often, especially in problems with a discount factor close

to one, the Q-value variations along the states dimensions are more dominant than Q-value

variations along the actions. To illustrate this phenomena, consider the example shown in

Figure 3-5a. There are 101 states, the s0 being the terminal state. The actions are either to move

right or left and the goal is to reach the terminal state with minimum number of movements.

Therefore, the reward can be defined as -1 for each movement with the discount factor of 1. Solving

the problem and finding the optimal Q-values will result in the values shown in Figure 3-5b. The

numbers under each arrow is the Q-value of taking that action in the preceding state, and the

numbers in the states represent the value of the state, which is equal to the Q-value of optimal

action in that state. As can be seen from the figure, variation of Q-values along the states axis is

very significant whereas the difference in Q-value of the two actions within a state is only two. If

we deduct the value of the current state from the Q-values and define it as advantage value, the

resulting advantage value becomes independent of state in this example. Therefore, the optimal

action (moving right) will have an advantage value of zero and the other action will have a value

of -2.

100 99100

98 97 2 1 099

s100 s99 s98 s97 s2 s1... s0a

Figure 3-5 A simple problem showing the variation of Q-values in the states and actions. a) The base problem with 101 states, where the goal is to reach the terminal state s0 with the

minimum movements. Therefore, the reward of taking each action is -1 and discount factor is 1. b) The optimal state values and Q-value of actions.

Given the dominant variations along the state dimension, the function approximation might

sacrifice the variations along action to fit the greater variations along the states. To mitigate this

issue, the Q-value can be separated into the state value and advantage value of each action:

, , (3.20)

where , is the advantage of taking action in state . Note that the optimal action will have

an advantage value of zero and other actions will have negative advantage values. The Q-function

when converged according to the Bellman equation has the condition:

, .max 1, . (3.21)

Substituting , in (3.21) by (3.20) will result in:

, . 1 . (3.22)

The two unknown functions, . and . , can be updated one by one, keeping the other one

fixed:

1 . 1 , (3.23)

1 , . 11

1 (3.24)

The above equations are equivalent of the Q-learning for Advantage updating, because the

values are updated independent of the policy. It should be noted that the approximation of the

functions is not perfect and it is likely to have errors in the approximated functions. Because of the

intertwined effect of the two functions, a positive feedback might happen and the approximation

error exacerbates with every iteration, leading to divergence. This intertwined relation can be

broken by removing the term , from (3.23), which results in:

1 . 1 (3.25)

By removing the advantage term, the new equation for . is dependent on the agent’s policy,

because the agent’s action choices affect the value of states. However, as the agent starts to exploit

its knowledge and chooses optimal actions, the function . converges toward its optimal value.

This behaviour resembles the SARSA algorithm discussed in 3.3.2.

3.5 Multi-Agent Reinforcement Learning

The algorithms discussed in the above sections are formalized for a single agent interacting with

an environment. In theory, they can be applied to a problem with multiple on-ramps if we assign

a single central agent to control all on-ramps simultaneously. In such case, the environment which

the agent deals with would be the whole traffic network and the choice of actions would be the

combination of actions of all on-ramps as illustrated in Figure 3-6a. The downside of this approach

is the “curse of dimensionality” as the number of ramps grows. The size of state-action space grows

exponentially with the number of on-ramps. In the learning process, the increased number of states

and actions requires significantly more learning time. Additionally, in terms of the optimal action,

the search space would be much larger and finding the optimal action based on current state might

not be possible in real time. Although sound in theory, it is not practical to solve large problems

with multiple on-ramps with a single RL control agent.

3.5.1 Independent Learners

An alternative for applying RL to RM of larger traffic networks is to employ a decentralized

structure where the network is broken into smaller sections, e.g. sections that contain only a single

on-ramp, and assign an RL agent to each section as illustrated in Figure 3-6b. Each agent will

observe the state of the traffic in its local section and optimize the action to maximize the reward.

In this configuration, agents act independently of each other and resemble a local ramp metering

structure. Although the action of each agent is optimal for traffic conditions in its section, the

collective actions of all agents are not necessarily optimal for the whole network. Additionally,

lack of coordination limits the opportunities for utilizing the storage space of adjacent on-ramps

and providing equity.

3.5.2 Cooperative Reinforcement Learning

Considering that a decentralized structure is the only practical solution for applying RL to larger

networks, decentralized agents should be coordinated to provide global optimality. An RL problem

where multiple learning agents interact with shared environments is referred to as Multi-Agent

Reinforcement Learning (MARL) (Busoniu et al., 2008). MARL algorithms are usually tailored

to specific types of problems. The nature of the problem, competitive or cooperative, and the

possibility of observing actions of other agents as well as the presence of communication among

agents can affect the MARL algorithm. Panait and Luke (2005) have summarized the algorithms

which address learning of multiple agents cooperating to maximize a single reward. In terms of

traffic control problems, there are certain characteristics that help with developing a MARL

algorithm as follows:

1. In traffic control problems it is reasonable to assume that agents can freely communicate

with each other and share their state, action, and reward. Additionally, they can use the

communication to coordinate their actions to achieve higher overall reward.

2. The agents are fixed in space, and the geometry of the network is known. Therefore, agents

can be coordinated more efficiently depending on their immediate neighbours.

3. The main goal is to minimize the total time spent in the network and maximize the total

travelled distance. These goals can easily be broken into time spent and travelled distance

in different sections.

An approach that effectively employs the above characteristics is coordination graphs

(Guestrin et al., 2002; Kok & Vlassis, 2006). Coordination graphs decompose the global Q-

function into additively decomposed local Q-functions that only depend on the actions of a subset

of agents. In order for agents to coordinate their actions, they need to quantify the effect of state

and action of other agents on their Q-value. Effectively, each has to consider an augmented state

and action that include its local state and action as well as states and actions of other agents that

influence its reward. The geometry of the network can be utilized to identify a set of neighbours

for each agent. To achieve global optimality, it is sufficient for each agent to consider the states of

its neighbours (Nair, Varakantham, Tambe, & Yokoo, 2005). The schematic of such configuration

is illustrated in Figure 3-6c.

A successful large-scale implementation of MARL in transportation context is S. El-

Tantawy et al. (2013) work on traffic signal control. In spite of the promising outcome, their

approach was based on the conventional RL algorithms with discrete states that imposed certain

trade-offs and limitations, such as discretization choice, curse of dimensionality with added states,

significant memory requirement, and no generalization over observed samples. In this research,

continuous states and actions are directly represented using function approximation. Direct

representation of continuous variables will simplify the design process significantly and enables

designing more complex control systems.

3.5.2.1 Learning in Cooperative Multi-agent RL

Let us denote the local state of agent i as , and its actions as . Let be the set of neighbours

for agent i, agents that affect the reward of agent i. The augmented state for agent i would be

, where is the collective states of all neighbours of agent i, ∈ . Similarly, the

augmented action would be , where is defined as the collective actions of neighbours

of agent i, ∈ . The Q-learning based updating rule would be:

1 , , ,

1, 1,∗,

∗, , ,

(3.26)

where is the local reward for agent i, . is the Q-function associated with agent i, and the

pair ∗,

∗ are the optimal actions in state . Note that the optimal action of each agent is not

merely the action that maximizes , , , . The action of each agent is an element in the

optimal joint actions from all agents that maximize the sum of all local Q-function. A decentralized

way to find these optimal joint actions is presented in (Kok & Vlassis, 2006).

Spadina

Central RL‐based RM agent

Signal Timings

Traffic State

Local RL‐based RM agent

Traffic State

Spadina

Signal Timing

Traffic State

Signal Timing

Traffic State

Signal Timing

Traffic State

Signal Timing

Traffic State

Communication:Traffic State,Actions,

Action selection negotiation

Spadina

Signal Timing

Traffic State

Signal Timing

Traffic State

Signal Timing

Traffic State

Signal Timing

Communication Communication

RL‐based RM agent

Figure 3-6 Different approaches for applying RL to ramp metering for a sample traffic network: a) Centralized structure with a single RL agent for whole network, b) Isolated

RL-based RM agents, c) Coordinated MARL-based RM agents.

Each step in Q-learning involves solving an optimization problem to find the optimal joint

actions that maximize global reward, which can be quite computation-heavy. This step can be

avoided by employing the SARSA learning algorithm. The SARSA update rule for the coordinated

learning is:

1 , , ,

1, 1, 1, 1 , , ,

(3.27)

The update rule is identical to Q-learning except that the value for the next state is based on the

actual action taken rather than the optimal joint action.

The learning steps for the advantage updating algorithms (3.24) and (3.25) can be extended

to coordinated MARL as:

1 , . 1, 1 (3.28)

1 , , , . 11, 1

1 , (3.29)

As described in section 33.4.4, the advantage value should be removed from calculation of the

value function to break the cyclic relation between two functions and prevent divergence.

Therefore, the learning step in advantage updating does not require finding the globally optimal

joint actions.

3.5.2.2 Finding Optimal Action

Unlike independent learners that choose their action merely based on their own Q-function,

cooperative agents should take into account the effect of their actions on other agents as well. The

centralized approach for finding the optimal joint actions would involve combining all local Q-

functions to form a single Q-function of the complete state of the system and joint actions of all

agents. Then an optimization over the entire joint actions is required to maximize the sum of all

local Q-functions. Although simple in theory, this approach suffers from the curse of

dimensionality and becomes very demanding as the number of agents increases.

An alternative to a centralized search for optimal action is the locally optimal policy

generation approach presented in (Nair et al., 2005), with a similar approach being employed in

(S. El-Tantawy et al., 2013). In this approach, the joint actions are changed iteratively from an

initial one. In each iteration only one agent, the one that would benefit the system the most by

changing its action gets the chance to change its action. The process is repeated until no agent can

benefit the system by changing its action. Although the resulting joint policy is proven only locally

optimal, in many cases it may actually result in globally optimal joint actions. The steps for finding

the locally optimal joint actions are as follows:

1. Each agent chooses an initial action and communicates it to its neighbours.

2. Each agent i, assuming the actions of its neighbours are unchanged, finds the action

which maximizes the sum of its local Q-function as well as its neighbours': ∗

argmax ∑ ∈ .

3. Each agent calculates the gain that the system will achieve if it changes its action.

4. Only the agent that has the highest gain will change its action and the rest will be

unchanged, and the process is repeated from step 2. The process stops when no

agent can benefit the network by changing its action.

3.6 Summary

In this chapter, the algorithms and methods used in this research were presented. The final outcome

of the research is an optimal control system for metering of multiple on-ramps using RL. An RL-

based control system that deals with continuous variables using function approximation and

enables scalability through coordination of distributed RM agents. While applying conventional

single-agent RL to freeway ramp metering is challenging itself, function approximation and

coordination introduce two more dimensions to the complexity of the design process. To simplify

the design process, the design is performed in three stages and the three aspects of the design were

isolated from each other to the extent possible. The three stages are performed in the same order

as the methodology is presented. The three stages involve the algorithms described in

sections 3.3, 3.4, and 3.5, respectively.

4 Development of Microscopic Simulation Testbeds

Controllers based on RL find optimal actions based on trial and error through direct interaction

with the actual environment. However, it is not practical for a controller to learn through trial and

error in real freeway networks. In practice, a simulation environment is employed for training the

RL agent prior to implementation in the field. The simulated environment should closely replicate

the dynamics of the real environment to provide proper feedback to the RL agent for the learning

process. The most realistic models for simulating transportation networks are microscopic

simulators, which have been established as the prime tool for assessment of congestion mitigation

alternatives and ITS measures with the recent advances in both information technology and ITS

applications. Microsimulation is the dynamic and stochastic modelling of individual vehicle

movements within a system of transportation network. Each vehicle in the simulation model is

moved through the transportation network on a split-second basis according to the physical

characteristics of the vehicle (length, maximum acceleration rate, etc.), the fundamental rules of

motion (e.g. acceleration times time equals velocity, velocity times time equals distance), and rules

of driver behaviour (car following rules, lane changing rules, route choice rules, etc.), while

abiding by traffic management rules such as traffic lights, lane usage restrictions, etc. In this

research, the models were developed using Paramics©, which is a suite of high-performance

software for the microscopic simulation of realistic traffic networks.

The two Paramics models used in this research were extracted from the Greater Toronto

Area (GTA) freeway network model (Abdelgawad et al., 2011) developed at the University of

Toronto in 2009. The first model is a section of Highway 401 eastbound collector that includes the

intersection with Keele Street. This model is used for designing and evaluating algorithms that

involve only a single agent. The second model is the Gardiner Expressway westbound direction,

which is used for evaluation different coordination approaches.

It should be noted that the use of a simulation model to train RL agents is not to be confused

with a model of the controlled environment as in dynamic programming and value iteration

methods for instance (refer to section 3.1). The latter requires complete knowledge of system

dynamics including state transition probabilities and rewards associated with actions taken in each

state, for all state-action combinations, prior to solving the control problem. RL methods learn

from direct interactions with the controlled system and sample through the state-action space

repeatedly, similar to learning to play chess for instance by playing the game repeatedly. The use

of a simulation model in RL training merely provides a replica of the real traffic environment for

the RL agent to interact with in a safe and controlled manner until the agent learns the optimal

control policy. After that, the RL agent can be deployed into the real traffic environment with a

mature control policy, but can also continue to refine the learnt optimal control policy perpetually.

4.1 Developing the Microsimulation Models

Paramics can accurately reproduce detailed traffic information that matches the real network, given

that the parameters and geometry of the network are properly modelled. For the development of

the GTA freeway network model, a properly scaled digital representation of the study area was

loaded as an overlay into Paramics and used as a guideline for manually coding the network in

sufficient detail. The information about the geometry was generally gleaned from digital aerial

photographs. Throughout the development of the network information about the number of lanes,

roadway geometry, speed limits, detection devices, and control measures was gathered.

Although the original GTA freeway network model was rigorously calibrated to match

observed counts and average speeds, the traffic flow dynamic was not accurate enough for the RM

application, especially in terms of vehicle merging dynamics around on ramps and related capacity

degradation reflected in fundamental flow diagrams, amongst other details as will be discussed

next. Therefore, the steps described in the following sections were carried out to achieve the

calibration quality needed for training and evaluation of RM.

4.1.1 Data Preparation for Real Measurements and Paramics

Traffic measurements, such as vehicle count, average speeds, and queues are needed to calibrate

the driver behaviour model accurately as well as the origin-destination matrices. The main source

of information about traffic patterns in this research was the loop detector data available through

the ONE-ITS servers (one-its.net) at the University of Toronto. The original data from loop

detectors is data samples for every 20-second interval. It contains, for every lane in the past 20

seconds, the number of cars passed, the average speed of cars, and the percentage of time that the

loop detector was occupied. The data from different lanes are combined into one average value.

Because fluctuations of values in 20 seconds are significantly high, a five-minute rolling average

was used instead. In addition, averaging helped with replacing the missing data points with the

average of previous time samples.

The provided data included many faulty samples and required careful removal of faulty

data points. Different types of faulty data points were observed, for instance:

Missing data point for the whole loop detector or a single lane,

Faulty sensor: occupancy is 100%, speed is 100 kph and count is zero,

Faulty data: high speed, low flow and high occupancy,

Outlier data: low speed, low flow and low occupancy.

After the data have been read and stored in a vector with increasing sample time, missing

data points are flagged as zero and available samples flagged as one. Then the following metric

corresponding to average length of the cars in the past 20 seconds is calculated:

100 1000

0.01360020

where the 0.01 value added to Count is to avoid division by zero where the sensor is faulty. If

Length is greater than 50m or less than 2m then that data point is flagged as zero and therefore

treated as a missing data point. For averaging purposes, every sample is replaced with the average

of available samples in the past five minutes. For example, if two samples in the past five minutes

are flagged as missing, the average will be computed over the 13 available samples.

State of traffic can be measured in Paramics through various tools. For the purpose of

calibration and tuning the model, only measurements from loop detectors are considered so that

they be comparable with real loop detector data. Although Paramics has built-in functionality for

loop detectors, the output of Paramics loop detectors is information about individual cars passing

them. In practice, the output of real loop detectors is reported as an average over certain intervals,

a 20-second interval in Ontario; therefore, output of Paramics loop detectors is processed to match

real-life samples. Details about the processing of the Paramics loop detectors’ output to obtain

interval averages are presented in Appendix A.

The state of traffic is usually represented by speed (km/hr), density (veh/km/lane), and flow

(veh/hr/lane). Whereas loop detectors provide information about speed and flow, traffic density

measurement is not readily available from loop detectors. Traffic density can be estimated from

speed and flow with the equation / , where is traffic density, is traffic flow, and is

average traffic speed. Another way is to utilize the occupancy reported by loop detector for

estimation of density. The occupancy, o, based on speed of individual cars passing the detector

can be written as:

where is the 20 sec interval, is the number of vehicles passed, is the length of the vehicle,

and is the speed of the vehicle. If we replace the vehicle length with average length, ,̅

equation (4.2) can be simplified to:

≅100 ̅

100 ̅ 1 1

Similarly, density based on speed and flow can be written as:

1/ ∑ 1

The term 1/ ∑ is basically average speed and can be substituted with ̅; therefore,

equation (4.4) can be rewritten to achieve a similar form to equation (4.3):

36001 1

Considering (4.5) and (4.3), in free flow condition where vehicles all pass the loop detector with

similar speeds, occupancy can be reliably used to estimate density. However, in congested

conditions with stop and go behaviour, the two equations will diverge, and using occupancy for

estimating density will result in overestimation. Figure 4-1 shows the density estimated directly

from speed and flow versus occupancy measured by loop detector.

Considering the above observations, in this research the densities were calculated by

dividing flow by speed. This approach guarantees that the fundamental relation between speed,

flow, and density is maintained.

4.1.2 Driver Behaviour Parameter Calibration

Besides inspecting and fine-tuning the physical aspect of the model, we should also tune numerous

user definable driver behaviour parameters. These parameters define how drivers react in various

traffic conditions and sections of the network. Some parameters are not significant enough to

require numerical tuning using optimization and can be intuitively chosen. The following

parameters were found to require intuitive/subjective modification from their default value based

on observing the simulation model while running.

Figure 4-1 Relationship between occupancy and density.

Time step – time step represents the number of discrete times per real time second that a

decision is made during simulation. A higher time step value simply allows vehicles to

make decisions based on the car following and lane change logic at a higher frequency.

This is specifically important at on-ramp merging points where lane changing happens

very often. The default time step value is five (five steps in each second), but it has been

found that achieving proper merging behaviour requires a time step value of 10.

Ramp headway factor – the target headway for all vehicles on a ramp can be modulated

with this factor. Lower than average target headway for ramp vehicles allows them to

merge with mainline traffic more aggressively. The default value is one, i.e. no change to

target headway. However, a value of 0.5 is employed so that ramp vehicles force their

way into mainline as occurs in real life on busy urban freeways.

Minimum ramp time – this parameter specifies the time, in seconds, which vehicles spend

on the ramp before considering merging with the mainline traffic. Although the default is

2 sec, it has been changed to 1 sec considering the short ramp merge areas.

Signpost – this parameter defines the distance at which vehicles are notified of hazards

(divergence, lane drop, narrowing, etc.). Hazards usually require the affected vehicles to

change lane. Short signpost distance does not give vehicles enough time to change lane

properly, and over-long signpost distance will cause early lane changes and unnecessary

0 20 40 60 800

Occupancy (%)

congestion. The default value for freeway signposts is 750 m, but it should be modified

according to the network geometry through observing traffic behaviour.

Ramp-aware distance – ramp-aware distance is defined as the distance at which vehicles

in the main line traffic become aware of vehicles on the ramp. If a vehicle is in the right-

most lane on the mainline, it will attempt to change lanes in order to create a gap for the

merging vehicle. In case of low ramp-aware distance there will not be enough space on the

right lane of the freeway mainline for ramp vehicles to merge; therefore, the on-ramp flow

will be limited. On the other hand, high ramp-aware distance can cause the mainline to

breakdown even at very low on-ramp demand, because of mainline vehicles changing lanes

when there is only a single vehicle on the on-ramp. The default value is 200m, but it was

found that suitable ramp-aware distance for GTA freeways is from 100m to 150m.

Besides the aforementioned parameters, some parameters can directly affect the core specifications

of network such as capacity and susceptibility to flow breakdown; therefore, they require careful

fine-tuning, possibly using optimization, to ensure simulated traffic flow behaviour matches the

measurements. The parameters that affect the traffic flow significantly and require fine-tuning are

summarized below.

Mean target headway – the average headway, which vehicles try to maintain. The headway

directly affects freeway capacity and lower headway results in higher capacity. The default

headway value is 1.0 sec.

Mean driver reaction time – the mean reaction time of each driver, in seconds. The value

is associated with the lag in time between a change in speed of the preceding vehicle and

the following vehicle's reaction to the change. Smaller reaction times will reduce the

probability of breakdown because of faster response from drivers. Because of lower

susceptibility to flow breakdown, highway capacity increases. The default value is 1.0 sec,

but in practice, it is found to be lower.

Aggression – aggression is the distribution of target headway of various vehicles around

the average target headway. Aggression can vary on a scale from one to nine, with a score

of four being neutral, and higher aggression value will cause a vehicle to accept a smaller

headway. The default aggression is a normal distribution and is hidden from the Paramics

user. However, it is possible to modify the distribution of aggression. Increasing the

variance of aggression will increase the number of vehicles with lower aggression and as a

result higher headways; therefore, chance of breakdown will increase and freeway capacity

will decrease.

Awareness – similar to aggression, awareness has a distribution and affects the target

headway of the vehicles, except it is only active near lane drops. Awareness can vary on a

scale from one to nine with a score of four being neutral; high awareness values will result

in longer headway when vehicles approach a lane drop in order to allow vehicles in other

lanes to merge more easily. Therefore, high awareness reduces the susceptibility to

breakdown because of lane change, resulting in a smooth traffic flow and increased

capacity.

Link headway factor – this parameter allows the user to modify the mean target headway

locally for a link. It can be used to modify vehicular behaviour in specific sections as the

user may find warranted, such as around weaving sections for instance. The default value

is one.

Link reaction time factor – this is similar to link headway factor except for reaction time.

Among the different approaches for calibrating microsimulation model parameters, the approach

presented in M. Zhang, Ma, and Dong (2008) is plausible for freeway network models. In this

research, the calibration process is developed according to the guidelines provided in the

aforementioned report.

The main goal of the calibration process is to ensure that the Paramics model replicates the

fundamental traffic diagram of a real freeway network. To compare the samples from Paramics

with the ones from real loop detectors, a fundamental diagram based on the Van Aerde model (Van

Aerde, 1995) is fitted to both sets of samples. The Van Aerde model is a single-regime fundamental

diagram that can properly represent both congested and uncongested sides. The speed-density

relation in the model is defined as:

3. (4.6)

where is density, v is speed, is free flow speed parameter, and , , are model parameters

which can be calculated based on (jam density), / (critical speed), (capacity flow),

and (free flow speed) from the following equations:

1 2 2 , 2 2

12. (4.7)

Since measurements of both speed and density are combined with noise, regular least

squares method is not suitable for fitting a Van Aerde model. A more robust approach is total least

squares, described in Appendix B, which account for errors in both independent and dependent

measurements, and minimize the error function:

2 2 (4.8)

where , is the point on the Van Aerde curve that is closest to the measured speed-density

pair , . Figure 4-2 shows an example of the Van Aerde model fitted to samples obtained

from a loop detector on the Gardiner Expressway in Toronto.

Figure 4-2 A Van Aerde model is fitted to samples from a loop detector. The left figure is the flow-density relationship and the right figure is the speed-density relationship.

To find the best driver behaviour parameters, the simultaneous perturbation stochastic

approximation (SPSA) (J. C. Spall, 1998) is employed. SPSA, described in detail in Appendix C,

is a numerical optimization algorithm suitable for problems with numerous parameters. It can

estimate an unbiased gradient of the system with only two evaluations of objective function. Then

the parameters are moved along the gradient, in the direction that minimizes the objective value.

The objective function to be minimized is defined as:

selectdetectors

0 20 40 60 80 1000

density - (veh/km/lane)

0 20 40 60 80 1000

density - (veh/km/lane)

Van Aerde

Loop detector

where superscript represents the real-life values, superscript represents the values obtained

from Paramics, and subscript represents the loop detectors for which the fundamental diagram

is calibrated. The weight parameters , , , are used to bring different errors into the

same scale.

The Origin-Destination (OD) matrices needed for calibrating the driver behaviour do not

need to be very accurate because only the fundamental diagram from the simulation result will be

employed. The only condition necessary for an OD matrix is that the demand be large enough for

the freeway to become congested, so that there are enough samples on both the free flow and

congested sides of the fundamental diagram. The preliminary OD estimation, which is described

in the next section, was increased by 10% to generate enough demand to exceed the capacity. This

OD demand was used to calibrate the behaviour parameters.

4.1.3 OD Estimation and Calibration

In order to employ any traffic simulation model, an OD matrix that describes the trip patterns is

required. To compare different scenarios effectively, an extended microscopic simulation period,

which includes free flow conditions before and after peak periods, is essential. Therefore, the OD

matrix should vary with time to account for the changes in demand during the extended simulation

period, and typical OD matrices obtained from regional demand models (e.g. EMME model) are

not suitable. The extended simulation requires calibrating several OD matrices, each representing

an interval in the simulation period.

Since the networks used in this research consist of a single freeway without parallel

arterials, there are no alternative routes from different origins to destinations. Therefore, OD

estimation is significantly simplified. The initial stage is to calculate an OD matrix for each interval

based on the loop detector counts in that interval. Let … be the demand

vector at time interval t, where is the hourly flow of OD pair i at time interval t, and

… be the vector of loop detector counts, where is the hourly flow measured

on loop detector j at time interval t. We can define matrix A as the relationship between OD pairs

and vehicle counts from loop detectors, where , is one if the route for OD pair i passes through

loop detector j and is zero otherwise. Therefore, assuming there is no congestion, the relationship

between them can be written as:

(4.10)

Note that no congestion assumption is necessary to make sure that all vehicles will reach

their destination and pass through the loop detector without being trapped by congestion. Basically,

row j from matrix A defines the OD pairs which pass through detector j; therefore, the vehicle

count for detector j will be the sum of all demands which pass through it. Equation (4.10) can be

solved for , for each time interval independently ( and A are known). However, to

maintain consistency and prevent oscillation of demand from one interval to the next, another cost

term is added to link different intervals. The final cost to minimize is:

1, : ,

(4.11)

where , : is the row j of matrix A, T is the total demand intervals, and is the weight parameter

which controls the significance of the second term. The function . is defined as

, 2 / . The first term in the cost function (4.11) guarantees

simulation counts are close to real counts, and the second term will prevent the demands to oscillate

from one time interval to the next. Using common numerical optimization algorithms, we can find

a set of OD matrices which minimizes the cost .

This initial OD estimation, although very accurate with regard to capturing counts, fails to

replicate traffic congestion. Since the counts represent the vehicles that passed the loop detector

and not the actual demand, the initial OD estimation will generate demands that match the counts,

but on the uncongested side. Degree of congestion can be captured through traffic speed at different

sections. To recreate the real congestion and better replicate observed traffic speed in the

simulation, the demands for some intervals should be increased to exceed capacity. Considering

the extent of congestion and observed density in those sections from real counts, an estimate of the

vehicles stuck on the roads can be made. The number of vehicles stuck in congestion represents

the extent to which the demand should be increased in the simulation model to reproduce the same

congestion. Given that simulation starts and finishes with no congestion, the total demand for the

whole simulation horizon should not change. The extra demand needed for producing congestion

should be moved from later intervals to earlier intervals. After all, some of the vehicles passed the

loop detectors at later intervals were vehicles stuck in congestion.

To calibrate the OD matrices, the demands are modified based on:

1 1 1 , 1…

1 , 1… , 2… (4.12)

where is the new demand, and is the demand for OD pair i which will be moved from

interval 1 to interval ( 0 . Note that should be added to and subtracted

from 1 . The calibration aim is to find for 1… , 1… 1 so that the cost

function is minimized:

(4.13)

Given the large number of unknown variables , the SPSA algorithm discussed in 4.1.2 can be

employed to solve the optimization problem efficiently.

4.2 Highway 401 Eastbound Collector and Keele Street

The first network developed is a section of the Highway 401 eastbound collector that surrounds

Keele Street, as shown in Figure 4-3. This network is employed for experimentation with different

aspects of the single-agent RL-based RM. The selection of the study area forms an important

element in the experimental design and has been selected meticulously. The morning peak period

was chosen for modelling because of the significant demand from on-ramp #1, which causes a

bottleneck. Furthermore, there are no bottlenecks present immediately downstream of this on-

ramp, ensuring that the only source of congestion is this on-ramp.

The studied network is about 2.5km long and includes Highway 401 eastbound collector

from Highway 400 to Allen Road. The section includes an off-ramp to Keele Street and two on-

ramps from Keele Street's southbound and northbound directions. The freeway mainline is four

lanes wide upstream of the off-ramp and is three lanes wide at the on-ramps’ location. The loop

detector data from the morning peak period of July 6, 2011 were used for the calibration of the

network. The simulation was performed from 06:00 to 10:00 and demand periods were broken into

20-min intervals. Since the focus of this work is ramp metering, the surrounding arterials were not

modelled; however, the on-ramps are extended beyond their actual length to account for the

spillback of queued vehicles because of ramp metering into nearby arterials. There are two on-

ramps in the study area. After examining the system and the demand on the two on-ramps, it was

realized that the on-ramp #2 exhibits considerably lower demand; therefore, it does not have a

significant effect on the freeway traffic, especially if the downstream on-ramp is efficiently

metered. Therefore, the upstream on-ramp was not metered.

Figure 4-3 Aerial map and Paramics screenshot showing part of the study area. The map

shows the Highway 401 eastbound collector at the merging point of Keele St.

After extraction of the above segment from the GTA freeway network model and

modification of the physical aspects, the behavioural parameters were calibrated. The Paramics

parameters used for this model are summarized in the Table 4-1. Figure 4-4 shows the fundamental

diagram estimated from measurements obtained from the Paramics network compared with the

one from real loop detectors. The loop detector is located just after the merging area of on-ramp

#1. Except for jam densities, all parameters of the two fundamental diagrams are similar. The jam

density value obtained from calibration of a Van Aerde model is very sensitive to data samples,

especially when the samples on the congested side of the diagram does not extend to higher

densities. When the measurements are made after a bottleneck, such as an on-ramp, the congestion

is limited and traffic flow is close to capacity. These congested samples do not provide enough

information for estimation of jam density. When the measurements are made upstream of a

bottleneck, traffic flow would be much lower than capacity and density will be much higher than

critical density. Therefore, it can be expected that jam density be estimated with higher confidence.

On-ramp #1

On-ramp #2

Even though the jam densities do not match, it can be seen that traffic flow drops below the

capacity when freeway is congested due to on-ramp bottleneck.

Table 4-1 Numerically calibrated Paramics parameters for Highway 401 model

Parameter Value Parameter Value

Mean headway 0.95 Link headway factor for on-ramp links 1.1

Mean reaction time 0.85 Link reaction time factor for on-ramp links 1.05

Mean awareness 4 Link headway factor for off-ramp links 0.85

Awareness standard deviation 2.5 Link reaction time factor for off-ramp links 0.9

Mean aggression 6

Aggression standard deviation 2

Figure 4-4 Fundamental diagram fitted to samples from simulation of calibrated Paramics

model and real loop detectors.

In the next step, the demands were calibrated to recreate the same congestion patterns in

the model as closely as possible to field observations. Figure 4-5 shows the traffic speed upstream

of the on-ramp #1 from the Paramics model and real freeway. Even though traffic speed in

Paramics does not exactly match the real freeway speeds, it is important to note that the duration

of congestion in the two cases is very similar.

0 10 20 30 40 50 60 700

Density (veh/km/lane)

w (veh/h

our/la

Paramics (fitted curve)

Paramics

Real Freeway (fitted curve)

Real Freeway

Figure 4-5 The evolution of morning traffic in the Paramics model compared with

measurements from real loop detectors.

4.3 Gardiner Expressway Westbound

The westbound direction of the Gardiner Expressway is a very good testbed for evaluation of

coordinated RM algorithms. The Gardiner Expressway has three on-ramps in downtown Toronto,

which carry traffic out of Toronto in the evening peak period. Demand from three on-ramp peaks

at 4000 veh/hr and can easily cause flow on the freeway to breakdown. Additionally, an on-ramp

on the west end connects Lakeshore to Gardiner at Jameson Street. Figure 4-6 shows the schematic

representation of the study network.

The four on-ramps feeding traffic to the Gardiner are discussed below.

Jarvis on-ramp – the Gardiner is physically limited to two lanes upstream of Jarvis, and it

changes back to three lanes after Jarvis. Therefore, the Jarvis on-ramp has its own dedicated

lane allowing for unimpeded traffic flow.

Spadina

Lakeshore

Gardiner

Figure 4-6 Schematic of the study area network, showing the Gardiner Expressway

westbound from Don Valley Parkway in the east to Humber Bay in the west.

6 6.5 7 7.5 8 8.5 9 9.5 1020

Time of Day (Hour)

Real Freeway

Paramics

York on-ramp – the York on-ramp is placed around 250 m upstream of the Spadina off-

ramp. The weaving section created because of this close spacing is exacerbated because of

significant demand from the York on-ramp as well as the cars coming from the upstream

(from the Don Valley Parkway, which is not shown in the figure) that have to change at

least two lanes to get to the Spadina off-ramp.

Spadina on-ramp – the Spadina on-ramp carries the highest volume among the three on-

ramps. The acceleration lane of the Spadina on-ramp for merging of the on-ramp vehicles

with the mainline flow is significantly longer than the average on-ramp. Furthermore, even

after the acceleration lane ends, the Expressway remains wide and the right lane is much

wider than other lanes. The purpose of this is probably to ease the merging process of the

on-ramp vehicles. When modelling the Spadina on-ramp it is important to keep its peculiar

geometry in mind.

Jameson on-ramp – the Jameson on-ramp has a very short acceleration area and merging

distance and is located after a very sharp turn. This geometry causes significant traffic

instability that leads to freeway breakdown even with very low on-ramp demand. For this

reason, the City of Toronto closes the Jameson on-ramp from 15:00 to 18:00 every day.

After extraction of the above segment from the GTA freeway network model, the network's

physical geometry was carefully modified to reflect the above road conditions and road behaviour.

The movement of vehicles at the merging point of Jarvis was modified to ensure no conflict

occurred between the two traffic streams. The reaction time for the right two lanes of the weaving

section after the York on-ramp was reduced for easier lane changes. The length of the Spadina on-

ramp lane was increased to reflect the extended length in the actual freeway. The Jameson on-ramp

length was reduced to match the short length of the real on-ramp. The Lakeshore Boulevard in the

Humber Bay area is also modelled as an alternative when the Jameson on-ramp is closed. Note

that if the queue behind the Jameson on-ramp gets very long, vehicles will choose the Lakeshore

instead of the Gardiner because of lower travel time. The Paramics model of the westbound

direction of the Gardiner Expressway and its aerial map are shown in Figure 4-7.

The modelled network is about 10 km long with four off-ramps and four on-ramps. The

freeway is three lanes wide for the most part as shown in the schematic diagram in Figure 4-6.

After refinement of the model, a preliminary OD was estimated based on the loop detector counts

from April 2012. In addition to weekends, Mondays and Fridays were omitted from data collection

to eliminate any chance of irregular traffic flows in input data. In total nine days in April 2012

[10th, 11th, 12th, 17th, 18th, 19th, 24th, 25th, 26th] were considered, and the flows were averaged

in one-hour intervals from 13:00 to 21:00. Consequently, a dynamic OD demand matrix was fitted

to the averaged count as the preliminary OD demands. Since the resulting demands are on the

uncongested side of the flow-density diagram, the demand values were increased by 10% to create

some congestion and make the process of estimating fundamental diagrams possible. Then, the

behavioural parameters were calibrated using the approach discussed in Section 4.1.2. The

Paramics parameters used for this model are summarized in the Table 4-2.

Figure 4-7 Aerial map of the Gardiner Expressway westbound and its Paramics model.

Table 4-2 Numerically calibrated Paramics parameters for the Gardiner model

Parameter Value Parameter Value

Mean headway 0.93 Link headway factor for Spadina off-ramp

weaving section

Mean reaction time 0.8 Link reaction time factor for Spadina off-

ramp weaving section

Mean awareness 3 Link headway factor for lane drop upstream

of Jarvis

Awareness standard deviation 1.6

Mean aggression 5 Link reaction time factor for lane drop

upstream of Jarvis

Aggression standard deviation 1.4

Table 4-3 summarizes the fundamental diagram parameters of the real freeway as well as

the ones from the Paramics network. The loop detectors are placed just downstream of the merging

point of the respective on-ramp. Except for jam density, which is generally higher for the Paramics

model, the rest of the parameters match closely and show the good quality of the calibration. As

discussed above, the proper estimation of jam density requires data samples with high density and

low traffic flow. These samples are usually obtained when the loop is upstream of a bottleneck.

The loop detector related to York is placed upstream of Spadina, which is a major bottleneck;

therefore, the jam density from real data and Paramics data are consistent. There are not any

significant bottleneck downstream of Jameson and Spadina in Paramics model. Hence, estimation

of jam density is not accurate. It should be noted that the real loop detector data consists of 9 days,

and there are cases of congestion building up from downstream toward Jameson and Spadina due

to different traffic patterns in different days.

Table 4-3 Parameters of the Van Aerde model fitted to fundamental diagram samples from Paramics and real life.

York Spadina Jameson

Paramics Real Paramics Real Paramics Real

93 92 104 100 95 96

1784 1746 2074 2095 2117 2150

23 24 27 28 24 25

176 153 196 117 202 101

Following the calibration of behavioural parameters, the dynamic OD should also be

calibrated. Calibrations were performed to match the loop detector measurements of the selected

nine days of traffic. Figure 4-8 compares the time-space diagram of speeds from the real loop

detectors and speeds from the Paramics model. The difference in the bottom of graph is due to the

lack of availability of data from real loop detectors at the beginning of the freeway. Nevertheless,

the two graphs show very similar patterns of congestion.

Although the speeds are matched, we should make sure that vehicle counts are still in the

acceptable range. Figure 4-9 summarizes the GEH value for selected loop detectors (the two

detectors on the right side of the graph are on-ramp detectors) at one-hour intervals. Eighty-six

percent of the GEH values are well below five, which is considered accurate calibration, and rest

are less than 8%. It is worth noting that original loop detector data are not very accurate and some

of the errors can be attributed to the low quality of initial data. Figure 4-10 shows the actual flows

from the real loop detectors and the calibrated Paramics model for the 16:00-17:00 interval when

demand from downtown is at its peak. As can be seen from the graph, the Paramics model closely

matches the measurement taken from the real freeway.

Figure 4-8 The left graph shows real loop detector and right graphs shows Paramics model. The time-space graphs are the average speed along the Gardiner from 13:00 to 21:00.

Data obtained from the currently available loop detectors provide detailed information

about the state of traffic on the freeway; however, they do not give any information about the on-

ramp queues. As a result, the calibration process does not have a reference for the queues and can

produce unrealistic queues. Since we are dealing with ramp metering, it is important to make sure

the ramp queues in Paramics properly follow the ones in reality. For this purpose, the INRIX app

was employed. From the historical congestion information supplied by the INRIX app, the duration

and extent of queue for the three downtown on-ramps was estimated. Congestion on surface streets

connecting to on-ramps generally starts at around 15:00 and lasts until about 18:30, which is in

agreement with the space-time speed graph. It is estimated that the congestion on Jarvis extends to

Dundas Street and is equal to around 250 vehicles in the queue. The queue of York on-ramp

propagates on both Lakeshore (for vehicles going to the Gardiner from Yonge St.) and York

Streets, resulting in about 200 to 250 vehicles waiting in the queue. The Spadina on-ramp queues

extend beyond King Street and multiple nearby streets and comprise 200 to 250 vehicles.

When vehicles are queuing on an on-ramp, changing the ramp demand will not affect the

flow entering the freeway and will only change the queue length. Therefore, if the queues are short,

some demands from later intervals should be moved forward in time to increase queue length.

When queues are high, some demand would be shifted to later time intervals to decrease the queue.

14 15 16 17 18 19 20Time of day (hour)

Color coded traffic speed - Real

14 15 16 17 18 19 20Time of day (hour)

Color coded traffic speed - Paramics

Figure 4-9 GEH value for vehicle counts averaged over one-hour intervals for select loop

detector locations.

Figure 4-10 Traffic flow of the calibrated Paramics model compared with real loop detector

data along the Gardiner for three different time intervals.

0123456789

GEH Value for Vehicle Count at Loop Detectors

1‐2 pm 2‐3 pm 3‐4 pm 4‐5 pm 5‐6 pm 6‐7 pm 7‐8 pm 8‐9 pm

Flow (veh/hr)

Vehicle flow for 16:00‐17:00 interval

Paramics Reality

After closely analyzing the calibration result, we

observed that the Spadina on-ramp carries 1600 veh/hr

on average during the peak period. This result was

evident in both the Paramics model and real loop

detector data. Considering the significant queues on

Spadina on-ramp, it was important to verify these

numbers with solid evidence. Therefore, a field survey

was conducted and vehicles on the Spadina and

surrounding streets were counted between 16:00 and

17:00 to obtain traffic flow and vehicles queuing.

Figure 4-11 shows the aerial view representing the

queues on each street and the vehicle flow for each

direction. The observed flow was 1551 veh/hr, which is

consistent with previous results. The observed queue

was about 140 vehicles, which is slightly lower than the

ones obtained from INRIX. However, it should be noted

that this numbers are for a single-day observation, which

can justify the difference in the queue.

Figure 4-11 Aerial view of the

Spadina on-ramp with information about traffic flow and queues

5 Independent and Coordinated RL-based Ramp Metering

Design and Experiments

In this chapter, the design and evaluation of RM controllers based on RL will be discussed. The

design process was carried out in three stages, to which the three sections of this chapter

correspond. The first stage meant analyzing different design parameters of the application of RL

to RM. The second stage focused on the design, implementation and evaluation of RL-based RM

using different function approximation approaches for dealing with continuous variables. The third

stage was design and evaluation of coordination of multiple RL-based RM controllers.

During the design process, to ensure practicality of the resulting systems, it was assumed

that measurements are only available through common loop detectors and hardware that are

currently in place. Additionally, employment of microscopic traffic simulators ensures that RLRM

design abides by real-life limitations, such as loop detector measurement noise, drivers’ random

behaviour, and traffic light effect on traffic flow. These choices facilitate the future field

implementation of the algorithms. Use of more precise and detailed information, through

technologies such as cameras and connected vehicles, is expected only to improve the performance

of the system when available in the future.

5.1 Experiment I – Single Ramp with Conventional RL

Since a comprehensive model of the traffic flow, which fully represents the state of traffic, requires

an extensive number of traffic variables, RL design is not trivial in freeway control problems.

Furthermore, because of the stochastic nature of the traffic flow in freeways, an RLRM agent will

require a significant number of training samples to suppress the measurement noises. In this

section, the various RLRM design parameters and their selection criteria to ensure fast training

and reliable performance are discussed. The microsimulation model used for this part is the

Highway 401 eastbound collector at Keele Street presented in section 4.2. In this part, the

conventional table-based RL approaches were employed. The focus of this experiment was to

minimize the total travel time without any limit on the on-ramp queue storage. Therefore, queue

management algorithms are not analyzed and are deferred to the last section of this chapter.

5.1.1 RL-based RM Controller Design for Single Ramp

Given that conventional RL algorithms are being used, the design problem involves deciding on

the aspects of problems discussed in the following sections.

5.1.1.1 Control Cycle

Control cycle, , is the time step which the RLRM agent will perceive the new state of the

environment and takes a new action. In freeway traffic control problems aggregated traffic

conditions are used and instantaneous measurements from sensors are averaged over . A Small

control cycle is preferred to ensure fast system response to changes in traffic condition. However,

the measurement noise and system delay, i.e. the time it takes for the system to respond to the

controller action, limit the lower bound for the choice of control cycle. Depending on the algorithm

and metering approach various control cycle values have been used, e.g. 40 sec in Papageorgiou

et al. (1997) and 60 sec in Jacob and Abdulhai (2010). After experimenting with various control

cycle times, we found that a value of 2 results in a good balance between the response to

traffic changes and measurement noise observed in real-life traffic data.

5.1.1.2 Action

Metering of on-ramps is performed by placing a traffic light on the ramp at the freeway

entrance. Changing the traffic light timing directly controls the traffic inflow to the freeway. Two

notable metering policies for on-ramp traffic light timing are one-car-per-green and discrete release

rates (Papageorgiou & Papamichail, 2008). In the one-car-per-green policy, a fixed green phase of

2 sec is used and the red phase is varied to provide different flow rates. This approach has the

benefit of breaking the platoon of cars and is easy for drivers to comprehend. However, the

maximum traffic flow that can be achieved by this approach is 900 veh/hr, given a minimum red

time of 2 sec. Therefore, one-car-per-green is suitable for on-ramps with low demand. Table 5-1

lists the one-car-per-green metering rates employed in this research and the corresponding green

and red phases. Given that on-ramp demand in the Highway 401 model is less than 1000 veh/hr,

the one-time-per-green policy is employed.

The discrete release rates policy allows more flexible metering rates, up to the capacity of

the on-ramp, e.g. 1800 vph, by allowing both green phase and red phase to be varied independently.

The goal is to achieve evenly spaced metering rates to be able to inject various levels of traffic into

the freeway. Although any flow value can be achieved by unconstrained green and red phases, it

is desirable to keep the cycle length to a minimum and inject the least number of cars possible in

each cycle. Considering these objectives, the discrete release rates and the associated green and

red phases employed in this research are summarized in Table 5-2. The discrete release rates policy

was used in the Gardiner model due to significantly high demand from on-ramps.

Table 5-1 Metering rates and associated green and red phases for one-car-per-green metering policy

Metering rate (veh/h) 240 300 360 450 600 720 900

Green time (sec) 2 2 2 2 2 2 2

Red time (sec) 13 10 8 6 4 3 2

Table 5-2 Metering rates and associated green and red phases for discrete release rates metering policy

Metering rate (veh/h) 240 400 600 720 900 1200 1440 1800

Green time (sec) 2 2 2 2 2 4 8 6

Red time (sec) 13 7 4 3 2 2 2 0

The RL-based RM controller can control the signal timing in two ways: by a direct action

that directly decides on the new metering rate, or an incremental action in which the metering rate

is increased or decreased compared with previous control cycles. Incremental action eliminates the

large variations, which might happen in direct action, and provides smoother changes in the

metering rate. On the other hand, small variations in incremental action result in slow reaction

from RM controller, which might limit the controller performance.

5.1.1.3 State

To represent the complete model of the freeway traffic network properly, state variables from the

entire network should be considered. In addition to the impracticality of measuring all possible

state variables, the learning time of RLRM agents increases exponentially with the number of state

variables. However, for a single on-ramp problem, the state of traffic can be properly identified

with only a few variables in the local area of the on-ramp, as shown in Figure 5-1. These variables

should represent the traffic condition upstream of on-ramp, downstream of on-ramp, and on the

on-ramp itself. The condition of mainline traffic upstream and downstream of on-ramp can be

identified by its speed and density. The variables necessary to identify the condition of on-ramp

are demand coming into on-ramp, on-ramp flow entering freeway, and on-ramp queue. Although

for the complete state of traffic they are all needed, some of these variables share redundant

information. Omitting the redundant variables can speed up learning without significantly affecting

the performance.

The state of traffic is measured through loop detectors. Loop detectors, when implemented

in double loop configuration, sense the presence and speed of individual vehicles. Averaging this

information over provides good estimates of speed, flow, and occupancy. Albeit not directly

available through loop detectors, density, , can be estimated from average flow, , and average

speed, , as:

/ (5.1)

Loop detectors are point detectors, and they should be properly located to provide accurate

representation of traffic conditions. However, RLRM learns to map detector measurements to

optimal action and is robust to imperfections because of slight detector misplacements. As a result,

the actual field locations of loop detectors were considered in this research and were not changed.

On‐ramp detector

Downstream detector

Upstream detector

Signal detector

Figure 5-1 Local area on an on-ramp and the loop detectors which can represent its traffic state.

5.1.1.3.1 DownstreamTrafficCondition

The complete state information can be described by speed and density; however, these two

variables are closely correlated and one of them can describe the traffic without significant loss of

information. Although speed changes significantly as traffic changes from free flow to congested,

density provides more even variation as traffic state changes and provide better representation of

the traffic condition. In conventional RL algorithms, states are discrete; therefore, the continuous

density variable should be discretized. Downstream density, , represents the level of congestion

and is the most important variable for proper design of RLRM agents. Since the maximum

throughput of the freeway occurs at the critical density, , downstream density is expected to be

close to critical density in optimum operation of freeway. Figure 5-2a shows the histogram of the

downstream density when an RM controller is in operation. The downstream density is discretized

such that samples are evenly distributed among different bins. The edges of the discretization

intervals for the were chosen as [0, 12, 16, 19, 22, 25, 28, 33, 40, 50, 60].

5.1.1.3.2 UpstreamTrafficCondition

Similar to downstream traffic measurement, the upstream density was chosen as the variable to

represent the state of traffic upstream of the on-ramp. Upstream density, , provides an estimate

of the distance which the congestion has propagated to the upstream of the ramp. For the RLRM

agent to prevent congestion effectively, should stay below . Figure 5-2b shows the

histogram of upstream density when an effective RM is in operation. As can be seen from the

figure densities hardly reach ; therefore, the discretization intervals were focused on the

subcritical densities and the interval edges for upstream density were chosen as [0, 12, 16, 20, 24,

28, 40].

5.1.1.3.3 On‐RampTrafficCondition

Since in this experiment queue management techniques are not considered, queue length and on-

ramp demand are not necessary for ramp metering. Therefore, flow entering freeway, , is the

only variable used. Since the agent’s action determines the entry flow, the discretization intervals

for on-ramp flow chosen were similar to the metering policy discrete values.

Figure 5-2 Histogram of traffic densities in a freeway section, including an on-ramp operation in the presence of an optimal RM controller. The dashed line represents the

estimated critical density.

5.1.1.4 Reward and Discount Factor

Typically, the main goal of a traffic control system is to minimize the combined travel time of all

transportation users. The total travel time, , is defined as:

0 20 40 60 800

1000Upstream density histogram

Density (veh/km)

0 20 40 60 800

500Downstream density histogram

Density (veh/km)

where is the number of vehicles at the control cycle within the area confined by up-stream,

downstream, and on-ramp detectors, and is the time horizon. To minimize , the RLRM

agent reward and discount factor can be defined as and 1, respectively. This

problem is an undiscounted infinite horizon problem and ideally, the R-learning method, presented

in section 3.3.3, should be used for learning of the agent instead of typical approaches such as Q-

learning and SARSA. Q-learning and SARSA maximize the total reward and cannot be applied to

an undiscounted infinite horizon problem, because the expected reward will not converge due to

undiscounted nature of the problem. Although use of R-learning is theoretically sound in the above

problem, improper selection of the additional learning parameter in R-learning often leads to

slow convergence or divergence. It is desirable to reformulate the problem such that Q-learning

can be employed to avoid the complexities associated with R-learning. Both R and Q learning are

used and compared as will be discussed later in the chapter. Assuming there are no vehicles

available initially in the network, can be calculated as:

TC′ ′

′ 0 (5.3)

where and are the entrance and exit rates of vehicles, to and from the area confined by

detectors, in ( / ) at control cycle , respectively. Substituting from (5.3) into (5.2) results

TC′ ′

′ 00 (5.4)

Rearranging the summation operators and the constants in (5.4) results in:

0 (5.5)

The new formulation ∑ / is equivalent to optimal control problem

equation (3.1) where and / represent and , respectively. The

negative sign changes minimization into maximization. Since is not directly evident in (5.5), it

is necessary to find a value such that is a good approximation of / . To achieve this,

is considered to be 30 control cycles (equivalent to a one-hour horizon based on a control cycle

of 2 min) and the gamma is determined to be 0.94. Figure 5-3 shows that with 0.94 and

30, is a good approximation of / . A limitation of defining reward as

is that it does not capture the spillback of congestion or on-ramp queue beyond

the loop detectors.

Figure 5-3 The actual weights of and discounted weights which the RLRM agent

considered using a discount factor of 0.94. The actual weights of were based on a control cycle of 2 min and minimization horizon of 1 hr.

The reward function, , contains two terms: which is independent of

the agent’s action and depends on the network demand; and which directly depends on the

congestion level. Note that the traffic throughput depends on the traffic congestion level and

agents' action can vary the congestion level. Since is independent of the agent’s action, it can

be neglected when we minimize . If is removed from the reward, the reward function

becomes which is equivalent to throughput. Therefore, the minimization problem is

equivalent to throughput maximization problem. Note that because 1 earlier throughput values

have more significance than later values. The new reward is only based on traffic condition

downstream of the ramp, and it is likely that excluding upstream density from state variables will

not have a negative effect on RLRM performance with this reward definition.

5.1.1.5 Additional Notes on RLRM Design

One challenge associated with RLRM problems deals with traffic flow instability at . Since the

traffic flow is unstable, at the beginning of the learning process, when the RLRM agent is

immature, the freeway is largely congested. As the agent learns the dynamics of congested traffic

condition, it acquires the knowledge to shift the traffic density toward the . In fact, the RLRM

agent first needs to learn the congested region of the state space to stabilize traffic, then explore

the parts of the state space around . As a result, different states are explored at different stages

0 5 10 15 20 25 300

Control cycle

Actual weights

RL discounted weights

of the learning process, and typical methods for changing the learning rate and selecting actions

that are functions of time are not suitable for the RLRM problem. An alternative is to keep the

number of visits, , , to each state-action pair. Therefore, the learning rate and action selection

policy for each state-action pair can be defined as functions of the estimated number of visits to

that pair.

For the learning rate an approach similar to the one discussed in 3.3.1.1 is employed and

state-action pair dependent learning rate is defined as:

, , (5.6)

The action selection policy is based on the -greedy action selection approach discussed

in 3.3.1.2. Similar to the work by Samah El-Tantawy and Abdulhai (2010), in this study state

dependent is calculated as:

max 0.1,10

10 ∑ C s, (5.7)

where is the number of possible actions in state . Based on (5.7), is initially one, and as

the number of visits to a state increases, decreases. will decrease to a minimum of 0.1, which

corresponds to total visits to the state exceeding 90.

5.1.1.5.1 PenaltyforCongestion

Although the RLRM agent using the above parameters is guaranteed to find the optimal control

policy, measurement noise and model uncertainties increase the time needed to find the optimal

policy. Since the noise levels are the highest at congested traffic conditions, in the absence of any

guidance the agent initially gets trapped in congested conditions for a significant period of

learning. Alternatively, under the operation of a trained RLRM is expected to be close to .

Adding a penalty term to the reward function for severe traffic congestions, 35 / ,

guides the RLRM agent to choose actions which result in lower densities and to learn faster how

to avoid congestion. This penalty should be significantly smaller than the reward; as a rule of

thumb, an order of magnitude smaller. Otherwise, the RLRM agent would ignore the reward term

and focus solely on minimizing the penalty.

5.1.2 Effect of Design Parameters on RLRM Performance

The different approaches to RLRM problem design, discussed in the preceding sections, are

implemented for the study area and the performances of different approaches are compared. In this

section, the focus is on the design parameters that are unique to RLRM: congestion penalty, type

of action, state, and reward. Other parameters do not significantly vary from one problem to

another; therefore, they can be extrapolated from other applications. The learning process of each

agent is limited to 1000 epochs (each epoch is one Paramics simulation run of 4 traffic hours).

Every point in the proceeding graphs is a moving average of 30 epochs.

5.1.2.1 Congestion Penalty

Adding a penalty to the reward for severe congestions helps the RLRM agent to prevent mainline

congestion. Three different penalty values are implemented and compared. In all three cases the

main reward function is , the state variables are and , and the action is a direct action.

A penalty is added to the main reward for 35 / . Figure 5-4a illustrates the freeway

mainline total travel time under the three penalty values. As expected, higher penalty values will

result in faster learning in terms of avoiding congestion. In the case with no penalty, the learning

is significantly slower and the agent does not reach a congestion-free traffic flow after 1000

epochs. Figure 5-4b shows the whole network total travel time, which includes the time vehicles

spend on the mainline as well as on-ramp. The agent with no penalty learns very slowly and after

1000 epochs still requires learning. The case with a penalty of 1000 converges to its best solution

after around 600 epochs, whereas the high penalty value of 2000 causes the agent to focus heavily

on mainline traffic and negatively affects RLRM performance and learning.

Figure 5-4 Effect of adding a penalty term to reward function for severe congestion. (a) The total travel time for freeway mainline, (b) the total travel time for the whole network.

0 200 400 600 800 10001000

Zero penalty1000 penalty

2000 penalty

(a) (b)

0 200 400 600 800 10001600

5.1.2.2 Direct and Incremental Action

In direct action all the eight red timings discussed in section 5.1.1.2 are possible in every state. On

the other hand, in incremental action the agent changes the signal timing one step at each control

cycle. Since the agent needs to keep the previous action as a state variable for incremental action,

the total number of state-action pairs increases. The rest of the design parameters are identical for

both agents. Figure 5-5 shows the learning performance of the two approaches. Both agents quickly

learn to prevent mainline congestion (a penalty term is considered for mainline congestion).

However, the agent with incremental action fails in terms of network TTT. The poor performance

of incremental action can be attributed to high measurement noise. Performance gain from

incremental actions is lost in the measurement noise and the RL agent requires more learning to

suppress the noise.

Figure 5-5 Performance comparison of RLRM agent with direct action and RLRM agent with incremental action. (a) Total travel time for freeway mainline only, (b) total travel

time for the whole network.

5.1.2.3 State and Reward

Since state and reward choices are closely related to each other, they have been analyzed

simultaneously. In this section, three rewards are implemented and compared. For the reward

all three state variables ( , , ) are considered and R-learning is used to train the

agent (case 1). The reward is implemented in two different ways: with all three state

variables (case 2) and omitting upstream density, , from state variables (case 3). Finally, the

0 200 400 600 800 1000500

Incremental Action

Direct Action

0 200 400 600 800 10001500

Incremental Action

Direct Action

(a) (b)

reward is implemented, similar to case 3, with downstream density, , and on-ramp flow,

, as state variables (case 4). For cases 2-4 Q-learning is employed, as the problem would be

discounted and Q-learning is applicable. Figure 5-6 shows the performances of the four different

cases. In case 1, unlike the other cases, R-learning is employed instead of Q-learning, and the

resulting RLRM agent performance is significantly lower. The poor performance of case 1 is

attributed to the complexities associated with the R-learning algorithm. Case 3, with confined state

space compared with case 2, stabilizes the density around the critical density with fewer learning

epochs. However, it requires more epochs to suppress the uncertainties caused by not having the

full state of the environment. Although in case 2 the agent learns more slowly than those in case 3

and 4 because of larger state space, after learning is complete the performance of case 2 and case

3 are comparable. This finding shows that, as expected, the smaller state space results in faster

learning; however, if the state variables do not completely define the reward, the agent

performance will be degraded. Quick learning and good performance in case 4 show that

considering throughput as the reward in conjunction with few traffic variables as the state space

can result in minimizing TTT.

Figure 5-6 Effect of different reward choices on RLRM performance. In case 1 reward is and state variables are downstream density, upstream density, and on-ramp density.

In case 2 reward is and state variables are the same as in case 1. Case 3 is similar to case 2 with the exception that upstream density is omitted. In case 4 reward is

and state variables are downstream and on-ramp densities.

(a) (b)

0 200 400 600 800 10001800

0 200 400 600 800 10001000

case 1

case 2case 3

case 4

5.1.2.4 Best Design for Single-agent RLRM

Learning of an RLRM agent can be very slow if the agent is not guided to avoid the congested

regions. Adding a penalty for any state that is severely congested significantly improves the

learning speed by guiding the RLRM agent toward maintaining the traffic close to critical density.

The magnitude of penalty is very important as small values would not be effective and large values

would degrade the performance. Experiments have shown that a penalty value equal to about 10%

of capacity provides proper guidance without negatively affecting the performance. The direct

action was found more effective than incremental action in the RLRM problem. Furthermore, for

the Highway 401 test case with about 900 veh/hr demand at its peak, one-car-per-green policy is

sufficient and very effective. For the single-agent RLRM problem, choosing the throughput as the

reward with a discount factor of 0.94 will result in optimal TTT. The simple definition of the

reward allows minimal state variables (downstream density and on-ramp flow) for its

representation, which result in fast learning.

5.1.3 Comparison with ALINEA Controller

To compare the performance of the RLRM agent with other traffic-responsive ramp metering

algorithms, ALINEA (Papageorgiou, Hadj-Salem, & Blosseville, 1991b) is considered as the

benchmark. ALINEA controller is an integral controller with robust performance. Filed

implementations of ALINEA on various European freeways have resulted in savings in Total

Travel Time from 5% to 20%(Papageorgiou et al., 1997). The desired density for ALINEA was

set to 25 veh/hr, and the gain, , was tuned through trial and error and the value which resulted

in best TTT was considered. Three different cases were then considered: the base case with no

ramp metering, the ALINEA controller, and the RLRM agent with the best design parameters

discussed above. The RLRM agent was trained for 1000 epochs and the greedy policy ( 0 for

always choosing optimal action, i.e. no exploration for learning) was evaluated. To eliminate

variations caused by the stochastic behaviour of Paramics, each case was simulated 15 times with

different seeds (initial randomization parameter). The results of the simulations were averaged and

are summarized in Table 5-3. While ALINEA improves freeway TTT by 15%, in this case study

the RLRM agent improves TTT by 25% and outperforms ALINEA significantly. The improvement

can be associated with the non-linear reaction of the RLRM control agent to changes in traffic

condition. Additionally, the RLRM controller can change the metering rate freely, resulting in

quicker response compared to ALINEA. Looking at the mainline TTT, it can be seen that the two

controllers result in similar travel time for vehicles traveling along the freeway mainline. However,

when the on ramp wait time is considered, the RLRM achieves the best TTT savings of 25%.

Table 5-3 Summary of the simulation results for the single ramp testcase with conventional RL algorithms.

Performance Measures Control Method

No RM ALINEA RLRM TTT (veh.hr) 2381 2028 1785

TTT savings ‐ 15% 25%

Mainline TTT (veh.hr) 2326 1180 1143

Mainline TTT savings ‐ 50% 51%

Average on-ramp waiting time (min) < 1 13 9

5.2 Experiment II – RL-based RM with Function Approximation

One drawback of RLRM based on conventional discrete state algorithms is the slow learning

speed, which is further exacerbated with the size of the state-action space. The learning process

for the simplest RLRM agent takes more than 1000 epochs. To implement more complex RM

systems with queue management and coordination, the use of function approximation to increase

the learning speed is an inevitable necessity. Three function approximation approaches were

investigated to find the one most suitable for RL-based RM problems. The first approach was the

kNN-TD(λ) algorithm which is the direct generalization of discrete state RL to continuous states.

Therefore, it shares most of the characteristics and solid foundation of the conventional RL

algorithms. The second approach was the use of the Multilayer Perceptron neural network (MLP)

for function approximation. MLP has been very popular among researchers and is widely used for

function approximation in RL (Sutton & Barto, 1998). The third approach was the Linear Model

Tree (LMT) for function approximation. The underlying models in a LMT are linear functions and

are expected to provide good generalization of noisy samples. This characteristic makes LMT a

very good alternative for function approximation in transportation problems.

The approaches were applied to the Highway 401 test case. The SARSA learning approach

was employed for training of the agents. It is very straightforward to use SARSA in conjunction

with eligibility traces to speed up the learning for table-based and kNN-TD(λ). The eligibility trace

parameter employed was 0.8 (Singh & Sutton, 1996). Although eligibility traces cannot be

employed for MLP and LMT, these algorithms can benefit from the SARSA approach to learning

as well. The update rule in Q-learning depends on the best outcome in the next state,

max , . In the early stages of learning, an unseen state-action pair might have the best

outcome. Unlike table-based approaches, when an MLP or LMT is updated the whole function is

updated affecting the value of the aforementioned unseen state-action pair. Given that this state-

action pair is not explored, it will have a floating value with no reference target. This condition

will form a positive feedback, resulting in divergence of the function approximator. It is possible

to slow down the learning by reducing the learning rate to avoid the divergence, but it contradicts

with the purpose of function approximation which increasing learning speed. Another solution is

employing a learning approach similar to SARSA. In SARSA, learning is based on agent’s actual

actions and experiences. Therefore, state-action pairs used for learning will always be based on

proper experience, eliminating the possibility of divergence.

5.2.1 Design of Function Approximation Approaches

The kNN-TD(λ) algorithm has a similar structure to the table-based RL with the added

generalization; therefore, similar design parameters were used. The kNN-TD(λ) centers were

placed in the middle of the discretization intervals defined for table-based RL. For the number of

neighbours for weighted averaging, k, three cases with 2, 4, 8 were considered and the case

with 4 was found to have the best learning speed.

In the MLP-based RL the added parameters are related to the MLP structure and training

of MLP in each epoch. The MLP’s hidden number of neurons, after experimentation with different

numbers of hidden neurons, was chosen to be 20. The states were normalized based on their

maximum and minimum so that inputs to the MLP remained confined to 0, 1 . For the training of

the MLP after each epoch the samples were split into 70% training data and 30% test data and the

Levenberg-Marquardt (Hagan & Menhaj, 1994) technique was employed. The training was

terminated when the test data error did not improve after six successive iterations.

The LMT-based RL does not require any structural parameters beside the boundaries of

the inputs. The three LMT training parameters, , , and , were assigned to be 5, 0.01%,

and 0.005, respectively (Potts & Sammut, 2005).

5.2.2 Simulation Results

In addition to the table-based RL, the three presented RL approaches with function approximation

were applied to the test case above and trained for 2000 epochs. The four controllers were

evaluated in terms of design effort, computational needs, learning speed, and impact on freeway

performance.

5.2.2.1 Design Effort

The design effort for the table-based RL and kNN-TD(λ) was found comparable and higher than

the other two approaches (MLP and LMT), particularly because of the significance of

discretization. It is worth noting that for the same reason the design effort of these two algorithms

will increase exponentially with the problem size. The parameters related to Q-learning are well

studied in the literature and do not require experimentation to achieve proper results.

For the number of hidden neurons in MLP three values were examined: 10, 20, and 40. The

increase from 10 neurons to 20 neurons resulted in better performance, but the increase to 40

neurons did not show significant difference in the overall performance. Note that number of

neurons cannot be generalized to more complex RL problems. The MLP’s training parameters

such as learning approach and learning rate had a significant effect on the convergence of the

training process. Lower learning rate can result in slow training speed and higher values can

quickly saturate the neural network weights causing subpar RM performance; therefore, careful

tuning is required.

The training of the LMT-based RLRM agent was found to be robust to the choices of LMT

parameters and significantly easier than the other approaches. Varying the parameters , ,

and around their suggested values did not have a significant effect on learning convergence

speed and RM performance. However, as expected, increasing and or decreasing would

result in larger tree size and therefore increased training computation time.

5.2.2.2 Computational Needs

The computation time of the case study in this research was dominated by the microscopic

simulation and it is safe to say that all approaches had a similar overall computation time. However,

it is helpful to discard the microsimulation processing time and analyze the pure training time of

different approaches, which is summarized in Table 5-4. The on-line training of table-based RL

and kNN-TD(λ) allows efficient training with only new samples, and batch training of LMT and

MLP after each epoch results in higher training computational effort. It should be noted that the

linear regression in LMT is much faster than the MLP training method. The computation time

related to recall of the function approximator during decision-making is important for field

implementation where the computation power is limited. The numbers in Table 5-4 are specific to

this test case, and with increase in network size they are expected to increase linearly except for

the kNN-TD(λ), which is expected to increase exponentially.

5.2.2.3 Learning Speed

Since the simulation time is dominant compared with function approximation training time, for

the learning speed the simulation epochs are considered rather than computation hours. Figure 5-7

shows the average travel time at every epoch as agents learn. The learning speeds of the RM agents

with function approximation are significantly faster than the table-based RL. The LMT-based and

MLP-based approaches best utilize the training samples, resulting in very quick learning; however,

the MLP-based RL fails to achieve the highest performance obtained by the LMT-based approach.

Although the learning speed of kNN-TD(λ) is not as good as that of the MLP and LMT approaches,

its robust algorithm guarantees a relatively good performance after its learning is complete, as

shown at epoch 2000 of the figure. The slow learning speed of the table-based RL is evident, and

it is not completely converged after 2000 epochs and could result in better performance if trained

for more epochs.

Figure 5-7 Learning speed and solution quality of the presented four RL approaches. The

curves above are obtained by averaging multiple epochs through a moving average window for clarity. The actaul results have significantly more variation from one epoch to another

because of the stochastic nature of microscopic simulation.

0 500 1000 1500 20003

Epoch #

k-NNMLP

5.2.2.4 Transportation Network Performance

To compare the performance of the freeway with different RLRM algorithms, the agents were set

to exploit their learned knowledge after being trained 2000 epochs. For each RLRM approach, the

network was simulated 15 times with different seed numbers to account for the stochastic

behaviour in the Paramics simulations. The results were averaged and are summarized in

Table 5-4. Average network travel time accounts for vehicles' travel time from origin to destination

including on-ramp travel time, if any. Average mainline travel time only includes the time vehicles

spend on the freeway mainline until reaching their destination. As expected, all the RM approaches

improved the network performance compared with the base case, with savings ranging from 23.7%

in the Table-based approach to 36.8% in the kNN-TD(λ) approach. The RLRM agents with kNN-

TD(λ) and LMT performed noticeably better than the Table and the MLP approaches. It is worth

noting that LMT-based approach achieves similar performance to the kNN-TD(λ) approach

although its learning speed is an order of magnitude faster. Furthermore, the learning time of kNN-

TD(λ)-based agents is expected to increase exponentially with problem size but it would be linear

for LMT-based agents in the worst case. The limited performance of the MLP approach can be

attributed to the difficult choice of learning rates.

Table 5-4 Comparison of perfromance of different RLRM approaches

Computation Effort Method

Table kNN-TD(λ) MLP LMT

Learning computation time per epoch (sec) 0.16 0.22 15.587 2.005

Recall computation time per control cycle (sec) 0.0001 0.00014 0.003 0.003

Performance Measures Method

No RM Table kNN-TD(λ) MLP LMT

Average network travel time (min) 4:51 3:42 3:04 3:41 3:11

Average network travel time savings - 23.7% 36.8% 24% 34.3%

Average mainline travel time (min) 4:45 2:16 2:10 2:16 2:12

Average mainline travel time savings - 52.3% 54.4% 52.3% 53.7%

Average on-ramp waiting time (min) 0:45 11:14 6:57 11:09 7:43

5.3 Experiment III – Gardiner: Independent and Coordinated Ramp

Metering

Considering the experience and knowledge achieved by applying the RLRM to a single ramp

problem, the best of the proposed algorithms was applied to the Gardiner model, which has the

common challenges present in an RM application. This section discusses the design approach and

simulation results of applying the RLRM to the Gardiner Expressway.

5.3.1 RLRM Design for Coordinated Ramp Metering

In previous sections, the design of RLRM was focused on a single ramp while comparing different

approaches. Since some of these approaches were not very efficient, in this section we will focus

only on the best-performing approach.

Given the quick learning of the LMT-based RL and its ideal performance it has been

considered as the best learning approach. Additionally, the advantage updating approach presented

in section 3.4.4 is employed in conjunction with LMT function approximation. The advantage

updating isolates the effect of action on the reward of the agent from the value of future states,

thereby eliminating any bias in the function approximation.

5.3.1.1 Independent Agents

Independent agents will optimize their action according to the local reward that they receive.

Considering that the same agents would be coordinated later, their design was made with their

future coordination in mind. For each on-ramp agent a local area was defined which included the

on-ramp and sections of the mainline near the on-ramp. The agents' reward was defined as the total

traffic leaving the section subtracted by the traffic entering the section, . Note that

leaving traffic includes both off-ramps and downstream traffic, and entering traffic includes both

on-ramp and upstream traffic. Figure 5-8 shows the location of entry and exit flows for each RM

agent. The green rectangles are exit flows and red ellipses are entry flows. Adding the individual

agents' rewards together gives the global reward for the whole network (total vehicles exiting

subtracted by total vehicles entering). Therefore, this reward definition, while suitable for

independent agents, also satisfies the necessary conditions for the coordinated multi-agent

algorithm proposed in Section 3.5.2, which would be sought later in this chapter. In addition to the

basic reward, a penalty term was also considered for mainline congestion to facilitate the avoidance

of congestion in the learning process.

As discussed in Section 5.1.2.3, this reward definition requires more detailed state

information. Given that LMT size grows as necessary to fit the output and is not based on the

number of inputs, the size of the input state does not affect the learning performance of the LMT-

based RL agents. Therefore, all the variables needed for the complete state of traffic near an agent

are included in its state space definition. These variables are downstream density, downstream

speed, upstream density, upstream speed, ramp flow entering freeway, demand entering the ramp,

and ramp queue. All variables were calculated by loop detectors. The ramp queue is estimated

based on the number of vehicles present between the on-ramp detector and the signal detector

(refer to Figure 5-1).

Spadina

Lakeshore

Figure 5-8 The schematic of the Gardiner showing the location of entry and exit flows for

each individual RM agent.

For the Gardiner on-ramps, the discrete release rates traffic signal policy was employed

because of the high demand from on-ramps, which reaches 1600 veh/hr. The LMT-based RL does

not require learning rate as the tree is rebuilt from new samples after each epoch. Unlike table-

based RL, when states are continuous variables and LMT is employed, visits to a certain state

cannot be directly defined. Therefore, the action selection policy was defined based on learning

epoch rather than state visits. The action selection was -greedy with the decreasing linearly with

every epoch to a value of 0.1 after 100 epochs.

5.3.1.2 Limited Queue Space Consideration

Limited queue space is a challenge in RM applications. RM applications that employ a

mathematical model and optimize the metering rates can implement limited queue space as a

constraint in the optimization process. Therefore, actions that will cause queues to exceed the space

will be avoided. In RL, physical constraints cannot be defined as hard constraints; however, they

can be introduced in the problem as soft constraints by using a penalty term. Over time, agents

learn to balance between the penalty of exceeding the constraint and the higher reward.

Considering the current queues and acceptable conditions for downtown Gardiner on-

ramps, the maximum queue capacity was defined as 150 vehicles. Therefore, a penalty term was

added to the agents’ reward when the queue exceeded 150 vehicles. Note that when the benefit

from higher throughput is more significant, agents may temporarily choose actions that result in

queues exceeding 150 vehicles. The penalty weight is the same as the penalty for mainline

congestion. The RLRM agent, therefore, had to strike a balance between avoiding mainline

congestion and queues extending above capacity.

5.3.1.3 Coordination of Multiple RLRM Agents

Coordination among the agents is achieved by sharing their states and Q-values. Each agent

augments its state space with the state variables of its neighbours. Additionally, agents consider

actions of their neighbours when building their Q-functions. The augmented states in conjunction

with joint action allow the agents to consider the effect of neighbours' action on their Q-values and

vice versa. Therefore, they can choose the action that benefits the neighbourhood instead of their

local reward. The only negative effect of coordination is the increased number of input variables

of function approximation because of augmented state space. Although the LMT can handle the

augmented state space, the increased input size requires more computation when we fit an LMT to

the samples.

In cases with unlimited queue space, the coordination of RLRM agents does not have any

effect. In theory, each agent is trying to maximize its throughput, and its optimal action will result

in an uncongested freeway. Therefore, the optimal action of each agent is also optimal for the

whole network. However, in cases with limited queue space, RLRM agents have to decide between

extra queue penalty and mainline congestion penalty. In these conditions, coordination allows the

upstream agent to observe the penalty its neighbour is experiencing. To increase the

neighbourhood reward, upstream agents can reduce their ramp flow and free some road space for

downstream on-ramp.

The coordination of RLRM agents is performed for the case with limited queue space. The

Jameson on-ramp is considered as an independent agent, and the three downtown on-ramps are

coordinated. Figure 5-9 shows the coordination and communications between RLRM agents. Note

that each agent will only coordinate with its neighbours. Coordination with on-ramps farther than

the adjacent ones is also possible; however, in the Gardiner test case, it was found that coordinating

with two on-ramps on each side does not have any significant improvements over coordinating

with only the adjacent on-ramps.

5.3.1.4 Coordination for Queue Balance

Coordination of RLRM agents with limited queue space can improve the queue management and

allow agents to utilize the queue space of adjacent on-ramps better. This way the downstream

queue will start filling up first and after getting close to its queue limit the upstream on-ramp will

start to limit its ramp flow. However, there is no guarantee that all on-ramps will have the same

level of service. It is desirable to have the same level of service for users of different on-ramps to

discourage them from changing route in order to bypass the queue.

RLRM agent

Traffic State

Spadina

Signal Timing

RLRM agent

Traffic State

Signal Timing

RLRM agent

Traffic State

Signal Timing

RLRM agent

Traffic State

Signal Timing

Communication:Traffic State, Actions,

Action selection negotiation

Lakeshore

Figure 5-9 Communication between RLRM agents of the Gardiner.

Assuming the freeway mainline is not congested, the factor affecting the travel time the

most is the ramp queues. To achieve the same level of service among different on-ramps the goal

can be equalizing the ramp queues. To force the RLRM agents to equalize their queues, another

penalty term is added to the agents' reward. This penalty is added when the queue of the

downstream on-ramp is greater by 50 vehicles.

5.3.2 Simulation Results and Controller Evaluation

The proposed algorithms were applied to the Gardiner test case to evaluate their performance. It is

important to understand the current condition of the Gardiner and identify its limitations and

challenges. Section 5.3.2.1 provides a description of current conditions during the evening peak

period and highlights the benefits of individual ramps. Sections 5.3.2.2 to 0 provide quantitative

performance assessment and comparison of independent vs. coordinated multi-agent RLRM,

where all ramps are metered concurrently, as well as comparison with ALINEA.

5.3.2.1 Base Case and Performance Improvement via Local Metering of Individual Ramps

The evening peak period of the westbound of the Gardiner Expressway was utilized as the test case

for evaluation of the proposed algorithms. The Gardiner is one of the main arteries out of

downtown Toronto in the westbound direction during the evening commute. The demand to enter

the Gardiner from the three on-ramps in the downtown area exceeds 4000 veh/hr. This demand

when added to traffic flow on the mainline from further upstream surpasses the freeway capacity

of approximately 6000 veh/hr. The demand from the Jameson on-ramp, which is also used for

transferring from Lakeshore Boulevard to the Gardiner, averages 1000 veh/hr. The ramp itself is

very short in length creating significant traffic turbulence and merging hazards. Therefore, the City

of Toronto closes the ramp entirely from 15:00 to 18:00 every day. Table 5-5 shows the demands

downstream of the four freeway on-ramps. These numbers are based on the calibrated OD

matrices. Note that the demand does not necessarily mean the amount of traffic that will pass those

locations. In fact, when demand exceeds capacity congestion occurs in the bottleneck. In the

demands shown here the closure period of the Jameson on-ramp is considered.

Table 5-5 Demand for accessing the freeway mainline downstream of each on-ramp

Jarvis 3472 4027 3899 4096 4066 3507 2906 2533

York 3883 4661 4515 4772 4614 3771 3318 3116

Spadina 5122 6088 6098 6240 6052 5134 4499 4073

Jameson 5688 6710 5694 5912 5607 5818 4985 4429

As can be seen from the table, demand significantly exceeds capacity for the Jameson on-

ramp stretch in the interval between 14:00 and 15:00. Similarly, the demand for the Spadina on-

ramp is significantly high from 14:00 to 18:00 and peaks in the interval between 16:00 and 17:00.

The space-time diagram of the speed shown in Figure 5-10 shows the formation of congestion at

the bottlenecks. Although the mainline demand at Spadina in the 14:00 to 15:00 interval is not

much higher than capacity, the congestion building up at Jameson propagates upstream and causes

accelerated congestion upstream of Spadina. The demand from Jameson zone after 18.00 when

Jameson opens is less than capacity, but the vehicles entered the freeway earlier are stuck in

congestion and trigger another bottleneck at the Jameson on-ramp after 18:00.

5.3.2.1.1 JamesonRampMetering

The Jameson on-ramp has the most significant effect on the Gardiner congestion. Although the on-

ramp is closed from 15:00 until 18:00, the congestion formed before 15:00 has a lasting effect until

the ramp reopens at 18:00. If the congestion at Jameson can be avoided, the freeway performance

will be greatly improved. Metering the Jameson on-ramp will prevent the freeway from breaking

down and congestion propagating upstream without the need for closing the on-ramp. It is

noteworthy that, although full ramp closure can be viewed as the most aggressive metering

allowing zero entries, the closure starts late after congestion is already triggered, and when closure

is in effect, the freeway becomes underutilized, i.e. closure is not an optimal control method.

Figure 5-11 shows the freeway throughput after the Jameson on-ramp. As can be seen, the freeway

throughput between 14:00 and 15:00 with independent ramp metering is higher than in the base

case. After closure of the on-ramp, throughput drops to about 5600 veh/hr, which results in

underutilization of the freeway. This extra space is available because of the vehicles exiting the

freeway through the Dunn off-ramp.

Figure 5-10 Colour-coded space-time diagram of base case traffic speed.

It is important to see how many vehicles have taken the on-ramp in the two cases as the

ramp metering is often criticized for sacrificing on-ramp users for the benefit of through traffic.

Figure 5-12 shows the number of vehicles that have taken the Jameson on-ramp. Although ramp

metering has caused almost half the vehicles to reroute to Lakeshore between14:00 and 15:00, the

loss is compensated by the vehicles entering the freeway between 15:00 and 18:00. In fact, 6122

14 15 16 17 18 19 20Time of day (hour)

Base case color coded traffic speed

vehicles are served through Jameson on-ramp in the ramp metering case compared with 5142

vehicles in the base case.

Figure 5-11 Freeway throughput after the Jameson on-ramp in the base case and with

ramp metering.

Figure 5-12 Comparison of the Jameson on-ramp traffic flow in the base case and with

independent ramp metering.

5.3.2.1.2 SpadinaRampMetering

Spadina is another critical on-ramp and bottleneck in the evening peak period. Figure 5-13

shows freeway throughput after the Spadina on-ramp. In the base case, throughput is significantly

lower than capacity during the 14:00 to 15:00 interval. This capacity loss is because of the Jameson

Flow (veh/hr)

Mainline throughput after Jameson

Base Case Ramp Metering

Flow (veh/hr)

Jameson on‐ramp traffic flow

bottleneck, which spreads to Spadina. However, even after 15:00 that Jameson is closed and the

congestion downstream of Spadina is cleared, the throughput hardly reaches 6000 veh/hr.

Employing ramp metering will increase throughput by about 5%, which is in agreement with the

capacity drop owed to congestion.

Figure 5-13 Freeway throughput after the Spadina on-ramp in the base case and with

independent ramp metering.

5.3.2.2 Concurrent Multiple Independent Agents

Independent RLRM agents (named RLRM-I) were trained and evaluated with the Gardiner model

and compared with ALINEA as well as the base case scenario. Considering that the Jameson

bottleneck in the 14:00 to 15:00 interval causes the most congestion, one might suggest extending

the Jameson closure period to include 14:00 to 15:00. This case is also evaluated and called

Jameson2pmClose in this document. The four scenarios were simulated with 15 different seed

numbers to represent traffic variation on different days. Figure 5-14 shows the total vehicle hours

traveled for the whole network (TTT) as well as the freeway mainline only (TTTml) for the four

different scenarios. As expected, eliminating the Jameson bottleneck by closing it earlier

significantly improves freeway performance. However, the bottleneck at Spadina still contributes

to congestion; hence, there is high TTTml as well as variation in TTTml throughout different

simulations. The variation in different simulation runs translates to unreliability in travel time,

which is always a concern in transportation networks. Both ALINEA and RLRM-I properly

eliminate congestion and result in TTTml which is the same across all simulations. However,

Flow (veh/hr)

Freeway throughput after Spadina

ALINEA is not as efficient as RLRM-I in utilizing the freeway capacity, and results in significantly

higher TTT. The RLRM-I controller produces a 48% reduction in TTT.

Figure 5-14 Freeway performance for four different scenarios. The error bars show the

standard deviation of value for different simulation runs.

Although RM improves the freeway performance, it is important to monitor its effect on

the on-ramp users. For the Jameson on-ramp, the excess demand is rerouted to Lakeshore

Boulevard and results in higher travel time for those vehicles. Figure 5-15 shows the average travel

time that vehicles originating from Jameson zone have experienced. As can be observed, taking

the Lakeshore instead of the Gardiner will result in a 4-min increase in travel time. Although

closing the ramp from 15:00 to 18:00 is not enough, closing it from 14:00 to 18:00 is too restrictive.

Ramp metering essentially acts as an adaptive ramp closure. Ramp metering limits the ramp access

when there is high demand and keeps the ramp open when demand is low. Inevitably, ramp

metering would result in higher travel time for Jameson on-ramp users compared with no ramp

metering. However, optimal ramp metering imposes the minimum additional travel time compared

with any pre-timed approach.

Figure 5-16 shows the time-space diagram of traffic speed for the Jameson2pmClose case

against the ramp metering scenario. Employing ramp metering with an RLRM-I controller

completely eliminates congestion from the freeway. In the no ramp metering case, as expected,

when demand exceeds capacity the freeway breaks down. Figure 5-17 shows the queue on the

three downtown on-ramps throughout the simulation period. Although ramp metering would result

6533 62905360

54114198 4147

Base Case Jameson 2pmClose

ALINEA RLRM‐I

Vehicle Hours Travelled

TTT TTTml

in longer queues for the Spadina on-ramp, eliminating congestion would allow free flow traffic

movement on the York and Jarvis on-ramps. In the Jameson2pmClose, as the congestion builds up

at Spadina, it blocks the entrance from upstream on-ramps and causes queues at the York and

Jarvis on-ramps. In this case, no limit was considered for the Spadina queue. Therefore, when ramp

metering is employed, the queues at Spadina exceed the 150 limit. In the ALINEA case the queues

comprised as many as 500 vehicles. Queue management is introduce later in the chapter.

Figure 5-15 Average experienced travel time of vehicles starting from the Jameson zone

until the end of the network in the west.

Figure 5-16 Time-space diagram of traffic speed for RLRM-I (left) and Jameson2pmClose

(right).

13 14 15 16 17 18 19 20 212

Time of day (hr)

BaseCase

Jameson2pmCloseALINEA

RLRM-I

14 16 18 20Time of day (hour)

RLRM-I traffic speed

Jameson2pmClosed traffic speed

Figure 5-17 Queues for the three on-ramps throughout the simulation period.

The average travel time from different origins in downtown to the west end of the network

at Humber Bay is shown in Figure 5-18. The effect of ramp metering is clear in the travel times

from origins upstream of Spadina (DVP, Jarvis on-ramp, and York on-ramp), which are all free

flow travel times with RM, whereas travel time of trips originating from the Spadina on-ramp are

significantly higher. The results are the opposite in the Jameson2pmClose case, as expected. The

travel times increase as we move upstream of Spadina. Comparing the RLRM-I with ALINEA,

travel times are the same except for trips from the Spadina on-ramp. The lower travel time for

RLRM-I case shows its better efficiency in terms of utilizing the freeway space and allowing more

traffic to enter the freeway from the Spadina on-ramp. It is important to note that the high travel

times of the base case for Spadina on-ramp trips is because of the Jameson bottleneck. Since ramp

metering eliminates the congestion caused by Jameson, the overall travel time from Spadina on-

ramp is lower in the RLRM-I case.

Finally, the above analyses answer the fundamental question whether the overall system

gain in terms of TTT improvement justifies longer wait on the on-ramps. In other words:

Are the on-ramp travellers sacrificed to improve overall TTT and flow on the main

freeway, a gain primarily experienced by upstream traffic coming through? ,

Would the time loss waiting on the on-ramps under RM is regained in faster travel time

after getting on the freeway?

Our conclusions are: Waiting at the upstream on-ramps (Jarvis and York) under RM is well worth it for those

travellers as not only the overall system benefits in terms of least TTT, but also travellers

from those on-ramps benefit in terms of faster journey times.

The above is not necessarily the case for the Spadina travellers. Waiting on the on-ramp

under RM, although benefits the overall system in terms of TTT, results in longer travel

14 16 18 200

500Spadina on-ramp queue

Time of day (hr)

14 16 18 20

York on-ramp queue

Time of day (hr)

14 16 18 20

Jarvis on-ramp queue

Time of day (hr)

Jameson2pmClose

ALINEARLRM-I

times for the Spadina travellers under ALINEA, and same travel time under RLRM-I. This

indicates that the Spadina travellers inequitably bear the burden of improving the system

TTT and improving travel times for upstream travellers. This motivates the question

whether a better queue management approach in conjunction with RM would even out the

ramp wait burden across all ramps such that not only the overall system TTT improves but

also travel times for onramp travellers improve, which will be addressed using coordinated

agents later in this chapter.

Figure 5-18 Average travel time for trips originated during 4-5 pm from origins in the

downtown to west end of network for the four scenarios.

5.3.2.3 Independent Agents with Limited Queue Space

Metered on-ramps require queue management to ensure excessive queues are not going to affect

nearby arterials. The ALINEA controller can be augmented with a queue override algorithm

(named ALINEAwQO) which increases the ramp flow when the queue reaches its predefined limit.

In the RL-based algorithm, constraints on the queue are implemented through a penalty imposed

on the agent when queues exceed a certain limit (named RLRM-IwQO). Figure 5-19 shows the

queues on the Spadina and York on-ramps. In the RLRM-I case queues exceed 250 vehicles, but

in the RLRM-IwQO queues are much lower and do not exceed 100 vehicles. Although in

BaseCaseJameson2pm

CloseALINEA RLRM‐I

DVP 14.34 11.20 6.47 6.34

Jarvis on‐ramp 16.65 10.78 6.33 6.19

York on‐ramp 16.27 11.09 5.60 5.47

Spadina on‐ramp 12.13 9.02 14.14 11.79

0.002.004.006.008.00

10.0012.0014.0016.0018.00

Travel tim

Travel time from origins in downtown to Humber Bay

ALINEAwQO ramp flows are strictly implemented so that queues do not exceed the limit, Spadina

queues have exceeded the 150 vehicles. The reason for ALINEAwQO not being able to manage

queues is the very high demand from the Spadina zone. In fact, when the freeway breaks down,

even keeping the ramp completely open will not accommodate the 1600 veh/hr peak demand. The

RL algorithm can predict this phenomenon through the penalties and keep the queues at a

manageable level. The effect of Spadina queues reaching capacity can be seen at the York on-ramp

queues due to mainline congestion reaching there.

Figure 5-19 On-ramp queues for the RM algorithms which consider limited queue capacity.

Figure 5-20 depicts the time-space diagram of the traffic speed for RLRM-IwQO and

ALINEAwQO. The figure shows that the congestion at Spadina starts sooner in the RLRM-I-wQO

than ALINEAwQO, which suggests the RLRM-IwQO algorithm acts more conservatively to make

sure the queue will not exceed the limit.

Figure 5-21 shows the performance of the Gardiner freeway with different control

algorithms. Given the limited queue capacity, RLRM-IwQO performs worse than RLRM-I in

terms of overall freeway performance. Similarly, looking at TTTml, it is clear that ramp metering

with limited queue space cannot effectively eliminate congestion as in the case of not constrained

on-ramp queues. Nevertheless, RLRM-IwQO outperforms ALINEAwQO and is significantly

better than no-control cases.

13 14 15 16 17 18 19 20 210

Time of day (hr)

Spadina on-ramp queue

13 14 15 16 17 18 19 20 210

Time of day (hr)

York on-ramp queue

ALINEAwQO

RLRM-IwQORLRM-I

Figure 5-20 Time-space diagram of traffic speed for algorithms with limited queue space.

Figure 5-21 Freeway performance under ramp metering with limited queue capacity.

Figure 5-22 shows the travel time from different origins to the west end of the network. In

the Jameson2pmClosed case the travel times for upstream origins increase as the freeway gets

more congested. In the RLRM-I case travel times from all origins are free flow travel times except

those from Spadina, which are significantly higher than for the vehicles behind the on-ramp queue.

In the ALINEAwQO case, the upstream travel times are initially free flow until the Spadina queue

RLRM-IwQO traffic speed

ALINEAwQO traffic speed

6533 61415665 5360

54114729 4768

reaches its limit. As a result, the freeway mainline becomes congested and queues start to build up

on the York on-ramp. Similar conditions happen in RLRM-IwQO; however, the congestion starts

slightly sooner and the queues are shorter on the Spadina on-ramp. Although the travel times for

all origins in the RLRM-IwQO are more or less identical, it should be noted that this is

coincidental. Under different demands the travel time will not necessarily be similar as the agent

will not be directly equalizing the travel times.

Figure 5-22 Travel times from different locations to the west end of the network.

5.3.2.4 Coordinated Agents

Independent RLRM agents are very effective in maximizing freeway performance as long as the

queues are not limited. However, when the queue reaches its limit, the agent loses control over

freeway congestion and the freeway breaks down. Coordination of agents allows upstream agents

to observe the condition of downstream on-ramps and cooperate to prevent the freeway from

13 14 15 16 17 18 19 20 214

Time of day (hr)

Jameson2pmClosed

JarvisYork

Spadina

13 14 15 16 17 18 19 20 214

16RLRM-I

Time of day (hr)

JarvisYork

Spadina

13 14 15 16 17 18 19 20 214

16ALINEAwQO

Time of day (hr)

JarvisYork

Spadina

13 14 15 16 17 18 19 20 214

16RLRM-IwQO

Time of day (hr)

JarvisYork

Spadina

breaking down when downstream on-ramps are full. The coordinated RLRM agents with limited

queue (named RLRM-C) were implemented and evaluated in the Gardiner model. Additionally,

the heuristic coordination of ALINEA based on linked control (named ALINEAwLC) is also

implemented and evaluated. In ALINEAwLC each upstream on-ramp will observe the immediate

downstream on-ramp and, if its queue exceeds a certain threshold, the upstream on-ramp tries to

equalize its queues with the downstream on-ramp.

Figure 5-23 compares freeway performance for coordinated RM and other approaches. The

RLRM-C algorithm achieves similar performance to RLRM-I in minimizing TTT. Furthermore,

the TTTml value and standard deviation show that freeway congestion is kept well under control

in the RLRM-C case. The ALINEAwLC algorithm, although somewhat successful in managing

congestion, cannot improve freeway performance as it achieves similar TTT to the

Jameson2pmClose case.

Figure 5-23 Freeway performance for coordinated RM approaches.

Figure 5-24 shows the queue at the Spadina and York on-ramps for RLRM-IwQO and

RLRM-C. The queues at the Spadina on-ramp for both approaches are more or less the same.

However, in the RLRM-C case the York on-ramp queue slightly increases early in the simulation.

This is because of the observation of the York agent about the traffic condition in the downstream

and the proactive measures taken to ensure the Spadina on-ramp will not reach its queue limit. The

coordinated RLRM can maintain the queues within the limits without causing the mainline to get

congested, efficiently using the available queue storage space on all on-ramps to manage the

6533 66605104 5665 5360

54114601 4249 4768 4147

freeway. Figure 5-25 shows the travel times from different origins to the west end of the network

for the RLRM-C case. Given the higher queues of the Spadina on-ramp, the travel times for trips

originating from Spadina are higher than for other origins during the rush hour.

Figure 5-24 On-ramp queues of coordinated and independent RLRM agents with limited

queue space.

Figure 5-25 Travel times from different locations to the west end of the network in the

RLRM-C case.

13 14 15 16 17 18 19 20 210

Time of day (hr)

Spadina on-ramp queue

RLRM-IwQO

RLRM-C

13 14 15 16 17 18 19 20 210

150York on-ramp queue

Time of day (hr)T

RLRM-IwQO

RLRM-C

13 14 15 16 17 18 19 20 214

Time of day (hr)

Travel time to west end from different origins

JarvisYork

Spadina

Figure 5-26 shows the average travel time from different origins in downtown to the west

end of the network at Humber Bay for the three RL-based approaches and compares them with the

base case. While in RLRM-I with unlimited queue the travel time for trips originating from

Spadina is much higher than other origins, in RLRM-IwQO with limited queue travel times are

very close. However, the travel time savings of Spadina is much less compared to the increased

travel time for the three other origins, which shows the reduced performance by introducing the

limited queue space. As can be seen from the travel time of the RLRM-C case, coordination of

RLRM agents can prevent mainline congestion, which is evident in the travel time of upstream

origins, while maintaining queue limit, which is evident from Spadina travel time. Effectively,

RLRM-C reduces the Spadina travel time compared with RLRM-I, while not imposing much extra

travel time to upstream origins. Therefore, it results in reasonable travel time variation, while

maximizing the network performance.

Figure 5-26 Average travel time for trips originated during 4-5 pm from origins in the

downtown to west end of network at Humber Bay for different independent and coordinated RLRM approaches.

Although ALINEAwLC did not improve TTT, it is interesting to see how it performed in

terms of keeping the queues equal. Figure 5-27 shows the queues at the three downtown on-ramps.

BaseCaseJameson2pmClose

RLRM‐I RLRM‐IwQO RLRM‐C

DVP 14.34 11.20 6.34 9.32 6.70

Jarvis 16.65 10.78 6.19 8.81 6.57

York 16.27 11.09 5.47 8.75 6.75

Spadina 12.13 9.02 11.79 9.47 8.68

0.002.004.006.008.00

10.0012.0014.0016.0018.00

Travel tim

Travel time from different origins to Humber Bay

As can be seen from the figure, queues of upstream on-ramps follow the queue of their immediate

downstream on-ramp.

Figure 5-27 Downtown on-ramp queues with ALINEAwLC control algorithm.

5.3.2.4.1 CoordinationforQueueBalance

In the RLRM-C case, it has been shown that the proposed algorithm can optimally handle limited

queue space without incurring congestion on the freeway. However, it is not seeking equity among

different users. Given that the optimal solution is to meter the downstream on-ramp intensively,

the users of that on-ramp will experience the longest travel time. Although the goal is to equalize

travel times of different on-ramps to provide the same level of service, it is not a simple task to

track the travel time of vehicles through loop detectors. Furthermore, fromulating an RL system

for directly equalizing travel times of different drivers would be very complicated. For an

approximation of the problem, we considered equalizing the queues of different on-ramps. To

force the RLRM agents to equalize their queues with their neighbours, a penalty was considered

for each agent when the queue of the downstream on-ramp was greater than its own queue by 50

vehicles. The coordinated RLRM agents in the queue equalization case are named RLRM-CwQE.

13 14 15 16 17 18 19 20 210

Time of day (hr)

Downtown on-ramps queues with ALINEA-LC algorithm

Spadina

YorkJarvis

Figure 5-28 shows the on-ramp queues and travel times for the RLRM-CwQE case. Although the

queues are very similar, the travel times of trips originating from on-ramps are not the same. The

first cause of the variation is the congestion on the freeway, which can be seen in the travel time

from DVP to the west end. The presence of congestion shows that if the agent is forced to keep

similar queue levels, the freeway performance is sacrificed significantly. The second factor is the

average on-ramp entry flow to the freeway shown in Figure 5-29. The time each vehicle spends in

the queue equals the queue level when the vehicle has joined the queue divided by the average

flow entering the freeway. Even if the queues are the same, the average flow can significantly

affect the time that vehicles wait on the ramp, and hence the different travel times for different on-

ramps.

Figure 5-28 On-ramp queues (left) and travel times (right) for the RLRM-CwQE case.

5.3.3 The Gardiner Test Case Summary

For the Gardiner test case, nine scenarios were examined. The network TTT and TTTml are

summarized in Figure 5-30. The Jameson on-ramp is a very critical on-ramp and closing it one

hour earlier than the base case reduces TTT by 36%. However, this closure is specific to the

demand used in this model. In practice, for different demand scenarios a separate timing for the

Jameson on-ramp closure would be needed to minimize the negative effects of closure. Metering

the Jameson on-ramp would effectively be a more refined approach to its closure. The on-ramp

will adaptively close or open the traffic access to the freeway depending on the traffic condition,

while maximizing the freeway throughput.

13 14 15 16 17 18 19 20 210

Time of day (hr)

Downtown on-ramps queues with RLRM-CwQE algorithm

13 14 15 16 17 18 19 20 214

Time of day (hr)

Travel time to west end from diferent origins

JarvisYork

Spadina

Figure 5-29 Downtown on-ramp flows entering the freeway for the RLRM-CwQE case.

The ALINEA algorithm is fairly robust and can be implemented with minimal design

effort. It can be simply augmented through heuristic algorithms to handle limited queue space. It

can even be easily coordinated with neighbouring on-ramps to utilize the available on-ramp queue

storage spaces. Nonetheless, its performance is limited in more demanding problems and the

heuristic augmentations cannot properly handle the Gardiner Expressway test case.

The RL-based RM approaches learn from direct interaction with the environment;

therefore, they are able to maximize their performance. The independent agents that do not

consider queue limits manage the freeway congestion very efficiently and reduce TTT by 48%

compared with the base case. By accepting some congestion on the freeway, independent agents

can handle problems with limited queue storage space. In the Gardiner test case, the TTT reduction

for independent agent with limited queue was 45% compared with the base case. Coordinating

adjacent RLRM agents can efficiently utilize all the queue storage space to eliminate the freeway

congestion while maintaining queues within their limit. The coordinated RLRM approach could

match the performance of independent RLRM agents and reduce TTT by 50%, i.e. attain the same

performance of independent RLRM agents despite the consideration of limited queue space.

13 14 15 16 17 18 19 20 21400

Time of day (hr)

Downtown on-ramps flows with RLRM-CwQE algorithm

Jarvis

YorkSpadina

Figure 5-30 Summary of the performance of the nine scenarios for the Gardiner test case.

Although implementing a centralized RL-based ramp metering system is impractical,

considering the outcomes of the above scenarios, it is expected that outcome of such centralized

system would not be significantly better than the proposed coordination of local agents. While the

congestion was apparent in uncoordinated on-ramps with limited queue, coordination of adjacent

on-ramps effectively eliminated the congestion and changed the TTT savings from 45% to 50%.

Although coordination of on-ramps beyond their adjacent on-ramps is expected to improve the

overall system performance, the extra saving is not expected to be significant. Tests on

coordinating each on-ramp with two upstream on-ramps have not shown any significant

improvement.

The optimal RLRM did not provide equal travel times for different on-ramps. As an

alternative agents were forced to balance their queues. Forcing the queues to be similar resulted in

significantly lower system performance. Additionally, given that ramp wait time is related to both

queue and on-ramp flow, the travel times were not homogenized even though the queues were

identical.

Revisiting the question that whether on-ramp user will be sacrificed for the benefit of

mainline users, it can be said that in order to have the optimal network performance, it is inevitable

BaseCase

Jameson2pmClos

eALINEA

ALINEAwQO

ALINEAwLC

RLRM‐IRLRM‐IwQO

RLRM‐CRLRM‐CwQE

TTT 10276 6533 6290 6141 6660 5360 5665 5104 6779

TTTml 6998 5411 4198 4729 4601 4147 4768 4249 4710

12000TO

TAL TIME SPEN

T ON NETWORK (VEH

Summary of network performance for different scenarios

that the downstream on-ramp users will experience higher travel times than users of upstream

origins. However, RLRM-C can provide reasonable travel time for all users. In fact, coordination

of on-ramps while imposing a limit on the queues, will guarantee a minimum level of service for

all users. Given that freeway mainline will not be congested and queue will not exceed a certain

limit, the minimum level of service can be quantified. The same level of service cannot be

guaranteed in the base case without ramp metering, as the congestion on the mainline will degrade

performance.

6 Conclusions and Future Work

Ramp metering is the most direct and effective freeway traffic control measure and is widely

employed throughout the world. Local RM algorithms applied to independent on-ramps can be

very efficient as long as there is no limit on the queue storage space. Practically, however, to

prevent the queue from exceeding the pre-specified limit, simple RM algorithms prioritize queue

management over freeway traffic management; therefore, the benefits of RM quickly diminish.

Availability of multiple closely spaced on-ramps provides the opportunity to coordinate multiple

on-ramps and utilize the queue storage space of all ramps to prevent congestion more effectively.

Heuristic approaches cannot exploit the full potential in the coordination of multiple on-ramps.

Model-based optimal control approaches can theoretically find the best metering policy. However,

their computation complexity increases exponentially with the network size and they become

impractical for even moderately sized networks comprised of a few on-ramps.

In this research, a decentralized and coordinated optimal RL-based ramp metering system

is presented. Individual RLRM agents can act on their own based on their local measurements to

maximize their reward (minimizing the local total travel time). Furthermore, agents can coordinate

their actions with their neighbours to maximize their collective reward rather than only their

individual reward. The decentralized structure allows simple scalability to any problem size.

Additionally, agents seek optimality whether they are acting independently or coordinated.

Therefore, the system would function reliably in the event of communication failure. The RLRM

agents employ function approximation to represent continuous state variables directly. The move

from discrete states to continuous states can significantly improve learning speed through

generalization of information. It also eliminates the trade-offs associated with discretizing

continuous variables. Furthermore, the learning time of the agents will not grow exponentially

with number of measurement variables.

Two microscopic simulation models were developed as test cases for the training and

evaluation of the proposed algorithms. The locations of the test cases were carefully chosen so that

they highlighted ramp metering effectiveness and challenges. The driver behaviour parameters as

well as the dynamic demand of the models were meticulously calibrated to match the traffic

dynamics and congestion patterns of the real freeways. The first model was a section of the

Highway 401 eastbound collector at the Keele Street. This model is effectively a network with a

single on-ramp and was used for extensive experiments with different aspect of the RL-based RM.

Additionally, RLRM algorithms with different function approximation approaches were evaluated

with the Highway 401 model to identify the most suitable approach. The second model was the

westbound direction of the Gardiner Expressway. The Gardiner model includes different types of

on-ramps and is an excellent testbed for evaluation of the RM algorithms. This model was used

for evaluation of the coordinated RLRM algorithms and comparison with independent RLRM

approaches as well as the well-known ALINEA algorithm.

6.1 Major Findings

The conventional RL approaches with discrete states can be applied to RM problems. However,

they require a significantly large number of training epochs even for the simplest RLRM design.

The simplest design with about 80 states and 7 actions needed more than 1000 epochs (simulation

runs) to converge to optimal Q-values. Therefore, these approaches are not suitable for more

sophisticated RLRM designs and larger problems. It was also found that for single-ramp problems

(independent agents) defining the agent’s reward as freeway throughput in conjunction with a

penalty for mainline congestion will result in an efficient agent that minimizes TTT.

Function approximation can significantly improve the learning speed. The easiest and most

reliable approach for dealing with continuous variables in RLRM is the generalization of the

discrete states into continuous states through averaging based on k-nearest neighbours. This

approach is directly based on the solid foundations of RL with discrete states. Despite improving

the learning speed significantly, it suffers from the same issues as the conventional RL, namely

the curse of dimensionality and discretization trade-off. MLP and LMT are far more efficient

function approximators compared to averaging based on k-nearest neighbours. Although MLP has

been extensively used in the literature for function approximation in RL, it introduces several new

design parameters, which are not trivial to define. LMT breaks the state space into several sections

and uses a linear model in each section. The linear models are approximated with the least squares

method; therefore, LMT can effectively handle the measurement noise in stochastic environment

of freeway traffic problems. Additionally, the parameters associated with LMT do not have much

effect on the learning performance and only affect the number of sections in the tree.

RLRM with LMT function approximation was applied to the Gardiner test case.

Independent RLRM agents with unlimited queue space outperformed ALINEA and reduced the

TTT by close to 50% compared with the base case with no RM. As expected, limiting the queue

space had a negative impact on RM performance, and resulted in some congestion on the freeway.

Nevertheless, the RLRM with limited queue still reduced the TTT by 45% compared with the base

case and outperformed the ALINEA with the queue override algorithm. Coordinating the action

of individual RLRM agents with limited queue space made it possible to utilize the queue storage

on nearby on-ramps. The coordinated RLRM agents prevented the freeway from breaking down

while maintaining the queues from exceeding the predefined limits. The coordinated RLRM could

match the performance of independent RLRM agent with unlimited queue space, and reduced the

TTT by 50% compared with the base case, while offering improved queue management and

respecting queue length constraints. The ALINEA with a linked control algorithm was able to

balance the queues of the downtown on-ramps; however, the performance of the system could not

match the original ALINEA algorithm. In fact, balancing the queues resulted in inferior

performance compared with the original ALINEA. This phenomenon was also observed in the

case of coordinated RLRM agents with queue balancing. When the coordinated RLRM agents

were forced to balance their queues through a penalty term, their performance deteriorated

significantly.

6.2 Contributions

The main contributions of this thesis can be summarized as follows:

RL with continuous representation of states and actions – a novel approach for direct

representation of the continuous states and actions in RL is proposed. The proposed approach

can properly handle the stochastic behaviour as well as the noisy measurements of the traffic

control problems. Furthermore, the proposed approach allows far more state variables to be

included in the definition of the state of the environment without the need for deciding on the

discretization intervals, which significantly simplifies the design process. Given the generality

of the proposed approach, it can be applied to virtually any RL application.

Coordination of RL-based RM agents – an algorithm is proposed for direct negotiation and

coordination of RLRM agents based on coordination graphs. Design of the agents in

conjunction with the coordination algorithm enables independent as well as coordinated

implementation of the agents in a decentralized structure, which provides robustness against

communication failure.

Additionally, with lower significance, the followings were achieved during the course of

this research:

The Gardiner Expressway microscopic simulation model – a thoroughly refined and

calibrated microsimulation model of the Gardiner is developed in Paramics for training and

evaluation of the proposed algorithms. Since accurate traffic dynamic is crucial for freeway

traffic control applications, special consideration is given to the development and calibration

of the Gardiner model. The developed model closely replicates the characteristics of the

Gardiner such as capacity and critical density as well as traffic volumes and congestion

patterns.

Deployment-ready design – throughout the design process of the RLRM agents, only

measurements that are readily available in the field were considered. Although use of more

complex measures could simplify the design process or enhance performance, the decision

was made to minimize the time required for deployment of the proposed algorithm in the field.

6.3 Towards Field Implementation

Performance of traffic control systems under real-life conditions has always been a concern for

practitioners as well as researchers, particularly if and how new systems are implementable in the

field on controllers with specific capabilities and limitations. This concern is heightened

considering the risk and productivity issues of a traffic controller failure in the field. Hardware in-

the-loop simulation (HILS) provides the tools for evaluation of the hardware that could be

implemented in the field without risking the consequences of its failure. HILS, in the traffic control

context, is a method used for evaluating real hardware components running the traffic control

algorithms in a simulation environment. HILS allows evaluation of the hardware operation in a

controlled simulated environment before deployment in the field. HILS replaces the emulated

traffic signal control logic in the simulation model with real traffic signal control hardware, which

interacts with the simulation model. In other words, HILS replaces the real environment of a traffic

controller with microsimulation software, as illustrated in Figure 6-1.

The critical component of a hardware-in-the-loop traffic simulation system is the controller

interface device or CID, which facilitates the communication between the physical world (traffic

controller) and the simulated world. Figure 6-1b illustrates the HILS setup, which has three

components: 1) a microscopic simulation model; 2) a traffic controller; and 3) a CID, which

facilitates communication between the first two components. The CID captures the traffic light

indications generated by the traffic controller and routes them to the simulation software.

Similarly, inputs from the simulator, e.g. loop detector calls, are sent back to the traffic controller

through the CID, and hence the controller functions as if it were communicating with a real signal

assembly.

Figure 6-1 Controller interface with real (a) and virtual (b) transportation environment.

As part of a project for evaluating field implementation of the MARLIN-ATSC (S. El-

Tantawy et al., 2013) algorithm, a team of researchers including the author developed the CID and

the companion programs which allow communication between a NEMA TS2 Type 1 traffic

controller and Paramics microsimulation software. The developed CID is shown in Figure 6-2. On

the one hand, the CID will respond to the controller commands as if the traffic signal controller is

communicating with devices inside a control cabinet. On the other hand, it will control the traffic

signal behaviour of Paramics to match the commands coming from the traffic signal controller.

Additionally, the CID will read loop detector calls in the Paramics network and communicate them

back to the traffic signal controller.

Environment for traffic controller

Detector

The same CID that was originally developed for HILS of surface traffic control algorithms

can be directly employed for HILS of RM algorithms. For this purpose, and when resources are

available in the near future, the RM control logic should be implemented in an embedded

controller, which overrides the logic of the traffic signal controller. The embedded controller reads

the loop detector calls through the traffic signal control and calculates the metering rate (the green

and red timing) according to the state of the traffic. Then it overrides the traffic signal control logic

according to the controls provided by the NTCP standard.

Figure 6-2 The CID developed for evaluation of MARLIN-ATSC.

6.4 Assumptions and Limitations

The process of design and evaluation of the proposed freeway control system, certain assumptions

were made to make the problem manageable. Given the tedious and time-consuming calibration

process, only a single set of OD matrices were calibrated. The agents were trained and evaluated

based on this single demand profile. Since the proposed system finds the best response for current

traffic condition, it will act independent of the traffic pattern. However, the systems responds

optimally for traffic conditions that have been seen previously. The randomness in Paramics

provide the necessary variation for learning of the system, but if traffic pattern changes

significantly, the system will be faced with conditions that might not be seen previously. The

generalization from LMT can overcome these conditions to some extent; however, the control

system output will not be optimal. It should be noted that the control system will learn from these

experiences and will improve as new samples are visited.

For the purpose of this research, it is assumed that demand from on-ramps origins are fixed.

The assumption was made so that comparison between different scenarios can be made. However,

two phenomena in reality contradict with this assumption: 1) traffic rerouting when queue on one

on-ramp is lower than adjacent on-ramps, 2) induced demand due to improved traffic flow of the

on-ramps. While rerouting negatively affects the independently controlled on-ramps, coordination

of adjacent on-ramps can address this issue to some extent and nullify its negative impacts. Induced

demand is inevitable when travel time drops; however, the extra demand will not cause congestion

because the freeway is controlled. The added demand will eventually increase travel time, possibly

to the extent close to base-case travel times. Even if the travel times after controlling on-ramps

becomes similar to base-case, it should be noted that the total vehicles being served is increased.

Therefore, the overall ramp metering system is performing better than base-case, through either

lower travel time or increased vehicles served.

The emerging Advanced Traveler Information Systems (ATIS) such as real-time traffic

information and travel times can affect the ramp metering system. In the case of independent on-

ramps, travellers of downstream on-ramps will be redirect to upstream on-ramps where there is

not queue. This rerouting will contradicts with the RM’s efforts to regulate vehicles entrance to

the freeway, and results in lost productivity. However, in the case of coordinated on-ramps, ATIS

can improve the equity of the system by spreading the vehicles to different ramps and

homogenizing the on-ramps’ waiting times. Given that the metering of on-ramps is coordinated,

the re-routing will not result is lost productivity.

6.5 Future Work

The research presented in this thesis can be further extended in several ways. The following

sections outline key steps in the future.

The proposed algorithm has been developed with scalability to larger problems in mind.

Although it is expected to work in other networks with minimal modifications, applying it to the

full 400-series freeways would be solid validation of its scalability to larger problems and

transferability to other types of traffic networks.

The trained RLRM agents in this research are specific to the on-ramps they are trained for.

In practice, training an agent for each on-ramp is not always feasible. It is desirable to develop a

generalized agent while considering different possible on-ramp geometries. The training of the

generalized agent would involve samples from different on-ramp geometries and demands.

Although in this research the proposed algorithm is specifically applied to ramp metering,

it can be modified to work with other freeway traffic measures such as variable speed limits and

dynamic route guidance. Furthermore, coordination of variable speed limits and ramp metering

can balance the travel time between on-ramp users and mainline users, potentially addressing the

criticism that RM sacrifices on-ramp users for the benefit of mainline vehicles. Variable speed

limits in this case would act as mainline metering.

Surface streets and freeways are ultimately part of the whole transportation network.

Congestion on surface streets will affect the freeway traffic if propagated to the freeway off-ramps.

Similarly heavy demand on freeways might create long on-ramp queues, causing congestion of

surface streets. Integration of the freeway control systems with surface street control systems could

benefit both.

References

Abdelgawad, H., Abdulhai, B., Amirjamshidi, G., Wahba, M., Woudsma, C., & Roorda, M. J. (2011). Simulation of Exclusive Truck Facilities on Urban Freeways. Journal of Transportation Engineering-Asce, 137(8), 547-562. doi: 10.1061/(asce)te.1943-5436.0000234

Abdulhai, B., & Kattan, L. (2003). Reinforcement learning: Introduction to theory and potential for transport applications. Canadian Journal of Civil Engineering, 30(6), 981-991. doi: 10.1139/l03-014

Ahn, S., Bertini, R. L., Auffray, B., Ross, J. H., & Eshel, O. (2007). Evaluating benefits of systemwide adaptive ramp-metering strategy in Portland, Oregon. Transportation Research Record(2012), 47-56.

Arel, I., Liu, C., Urbanik, T., & Kohls, A. G. (2010). Reinforcement learning-based multi-agent system for network traffic signal control. IET Intelligent Transport Systems, 4(2), 128-135. doi: 10.1049/iet-its.2009.0070

Bazzan, A. L. C. (2009). Opportunities for multiagent systems and multiagent reinforcement learning in traffic control. Autonomous Agents and Multi-Agent Systems, 18(3), 342-375. doi: 10.1007/s10458-008-9062-9

Bellemans, T., De Schutter, B., & De Moor, B. (2002). Model predictive control with repeated model fitting for ramp metering. Paper presented at the 5th International IEEE Conference on Intelligent Transportation Systems.

Bellman, R. (2010). Dynamic programming / by Richard Bellman; with a new introduction by Stuart Dreyfus. Princeton, N.J.: Princeton University Press.

Brilon, W., & Ponzlet, M. (1996). Variability of speed-flow relationships on German autobahns. Transportation Research Record(1555), 91-98.

Brown, G. W. (1951). Iterative solution of games by fictitious play. In T. C. Koopmans (Ed.), Activity Analysis of Production and Allocation. New York: Wiley.

Busoniu, L., Babuska, R., & De Schutter, B. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Reviews, 38(2), 156-172. doi: 10.1109/tsmcc.2007.913919

Chow, G. C. (1960). Tests of Equality Between Sets of Coefficients in 2 Linear Regressions. Econometrica, 28(3), 591-605. doi: 10.2307/1910133

Chu, L. Y., Liu, H. X., Recker, W., & Zhang, H. M. (2004). Performance evaluation of adaptive ramp-metering algorithms using microscopic traffic simulation model. Journal of Transportation Engineering-Asce, 130(3), 330-338. doi: 10.1061/(asce)0733-947x(2004)130:3(330)

Crites, R. H., & Barto, A. G. (1998). Elevator group control using multiple reinforcement learning agents. Machine Learning, 33(2-3), 235-262. doi: 10.1023/a:1007518724497

Davarynejad, M., Hegyi, A., Vrancken, J., & van den Berg, J. (2011, 5-7 Oct. 2011). Motorway ramp-metering control with queuing consideration using Q-learning. Paper presented at the 14th International IEEE Conference on Intelligent Transportation Systems (ITSC).

Doya, K. (2000). Reinforcement learning in continuous time and space. [Article]. Neural Computation, 12(1), 219-245.

El-Tantawy, S., & Abdulhai, B. (2010). TEMPORAL DIFFERENCE LEARNINGBASED ADAPTIVE TRAFFIC SIGNAL CONTROL. Paper presented at the 12th WCTR, Lisbon, Portugal.

El-Tantawy, S., Abdulhai, B., & Abdelgawad, H. (2013). Multiagent Reinforcement Learning for Integrated Network of Adaptive Traffic Signal Controllers (MARLIN-ATSC): Methodology and Large-Scale Application on Downtown Toronto. IEEE Transactions on Intelligent Transportation Systems, PP(99), 1-11. doi: 10.1109/tits.2013.2255286

Even-Dar, E., & Mansour, Y. (2003). Learning rates for Q-learning. Journal of Machine Learning Research, 5, 1-25.

Geist, M., & Pietquin, O. (2013). Algorithmic Survey of Parametric Value Function Approximation. IEEE Transactions on Neural Networks and Learning Systems, 24(6), 845-867. doi: 10.1109/tnnls.2013.2247418

Ghods, A. H., Fu, L. P., & Rahimi-Kian, A. (2010). An Efficient Optimization Approach to Real-Time Coordinated and Integrated Freeway Traffic Control. [Article]. IEEE Transactions on Intelligent Transportation Systems, 11(4), 873-884. doi: 10.1109/tits.2010.2055857

Ghods, A. H., Kian, A. R., & Tabibi, M. (2007). A genetic-fuzzy control application to ramp metering and variable speed limit control. Paper presented at the IEEE International Conference on Systems, Man and Cybernetics. <Go to ISI>://WOS:000255016302082

Gomes, G., & Horowitz, R. (2006). Optimal freeway ramp metering using the asymmetric cell transmission model. Transportation Research Part C-Emerging Technologies, 14(4), 244-262. doi: 10.1016/j.trc.2006.08.001

Guestrin, C., Lagoudakis, M. G., & Parr, R. (2002). Coordinated reinforcement learning. Paper presented at the 19th International Conference on Machine Learning (ICML-02), Sydney, Australia, Jul. 8–12.

Hagan, M. T., & Menhaj, M. (1994). Training feed-forward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks, Vol. 5( No. 6, 1999), 989-993.

Hall, F. L., & Agyemang-Duah, K. (1991). Freeway Capacity Drop and the Definition of Capacity. Transportation Research Record 1320, TRB, National Research Council, Washington, D.C., 91–98.

Hasan, M., Jha, M., & Ben-Akiva, M. (2002). Evaluation of ramp control algorithms using microscopic traffic simulation. Transportation Research Part C-Emerging Technologies, 10(3), 229-256.

Hegyi, A., De Schutter, B., & Hellendoorn, H. (2005). Model predictive control for optimal coordination of ramp metering and variable speed limits. Transportation Research Part C-Emerging Technologies, 13(3), 185-209. doi: 10.1016/j.trc.2004.08.001

Heinen, M. R., Bazzan, A. L. C., Engel, P. M., & Ieee. (2011). Dealing with Continuous-State Reinforcement Learning for Intelligent Control of Traffic Signals 2011 14th International IEEE Conference on Intelligent Transportation Systems (pp. 890-895).

Jacob, C., & Abdulhai, B. (2010). Machine learning for multi jurisdictional optimal traffic corridor control. Transportation Research Part A-Policy and Practice, 44(2), 53-64. doi: 10.1016/j.tra.2009.11.001

Jacobsen, L., Henry, K., & Mahyar, O. (1989). Real-time metering algorithm for centralized control. Transportation Research Record(1232), 17–26.

Khan, S. G., Herrmann, G., Lewis, F. L., Pipe, T., & Melhuish, C. (2012). Reinforcement learning and optimal adaptive control: An overview and implementation examples. Annual Reviews in Control, 36(1), 42-59. doi: 10.1016/j.arcontrol.2012.03.004

Kok, J. R., & Vlassis, N. (2006). Collaborative multiagent reinforcement learning by payoff propagation. [Article]. Journal of Machine Learning Research, 7, 1789-1828.

Kotsialos, A., & Papageorgiou, M. (2004). Efficiency and equity properties of freeway network-wide ramp metering with AMOC. Transportation Research Part C: Emerging Technologies, 12(6), 401-420. doi: http://dx.doi.org/10.1016/j.trc.2004.07.016

Kotsialos, A., Papageorgiou, M., Mangeas, M., & Haj-Salem, H. (2002). Coordinated and integrated control of motorway networks via non-linear optimal control. Transportation Research Part C-Emerging Technologies, 10(1), 65-84.

Kuyer, L., Whiteson, S., Bakker, B., & Vlassis, N. (2008). Multiagent Reinforcement Learning for Urban Traffic Control Using Coordination Graphs. Machine Learning and Knowledge Discovery in Databases, Part I, Proceedings, 5211, 656-671.

Lau, R. (1997). Ramp metering by zone—The Minnesota algorithm: Minnesota Department of Transportation.

Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empirical results. [Article]. Machine Learning, 22(1-3), 159-195. doi: 10.1007/bf00114727

Martin, J. A., de Lope, J., & Maravall, D. (2011). Robust high performance reinforcement learning through weighted k-nearest neighbors. [Article; Proceedings Paper]. Neurocomputing, 74(8), 1251-1259. doi: 10.1016/j.neucom.2010.07.027

Masher, D. P., Ross, D. W., Wong, P. J., Tuan, P. L., Zeidler, H. M., & Petracek, S. (1975). GUIDELINES FOR DESIGN AND OPERATION OF RAMP CONTROL SYSTEMS. Stanford Research Institute, Menlo Park, California.

Messmer, A., & Papageorgiou, M. (1990). METANET: a macroscopic simulation program for motorway networks. Traffic Engineering & Control, 31(8-9), 466-470.

Nair, R., Varakantham, P., Tambe, M., & Yokoo, M. (2005). Networked distributed POMDPs: A synthesis of Distributed Constraint Optimization and POMDPs. Paper presented at the 20th National Conference on Artificial Intelligence.

Paesani, G., Kerr, J., Perovich, P., & Khosravi, E. (1997). System wide adaptive ramp metering in Southern California. Paper presented at the 7th Annual Meeting, ITS America.

Panait, L., & Luke, S. (2005). Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems, 11(3), 387-434. doi: 10.1007/s10458-005-2631-2

Papageorgiou, M., Blosseville, J.-M., & Haj-Salem, H. (1990). Modelling and real-time control of traffic flow on the southern part of Boulevard Peripherique in Paris: Part II: Coordinated on-ramp metering. Transportation Research Part A: General, 24(5), 361-370. doi: http://dx.doi.org/10.1016/0191-2607(90)90048-B

Papageorgiou, M., Diakaki, C., Dinopoulou, V., Kotsialos, A., & Wang, Y. B. (2003). Review of road traffic control strategies. Proceedings of the IEEE, 91(12), 2043-2067. doi: 10.1109/jproc.2003.819610

Papageorgiou, M., Hadj-Salem, H., & Blosseville, J.-M. (1991a). ALINEA: A LOCAL FEEDBACK CONTROL LAW FOR ON-RAMP METERING. Transportation Research Record(1320), 58-64.

Papageorgiou, M., Hadj-Salem, H., & Blosseville, J. M. (1991b). ALINEA: A local feedback control law for on-ramp metering. Transportation Research Record, 1320, 58-64.

Papageorgiou, M., Hadj-Salem, H., & Middelham, F. (1997). ALINEA local ramp metering: Summary of field results. Transportation Research Record(1603), 90-98.

Papageorgiou, M., & Kotsialos, A. (2002). Freeway ramp metering: An overview. IEEE Transactions on Intelligent Transportation Systems, 3(4), 271-281. doi: 10.1109/tits.2002.806803

Papageorgiou, M., & Papamichail, I. (2008). Overview of Traffic Signal Operation Policies for Ramp Metering. Transportation Research Record(2047), 28-36. doi: 10.3141/2047-04

Papamichail, I., Kotsialos, A., Margonis, I., & Papageorgiou, M. (2010a). Coordinated ramp metering for freeway networks - A model-predictive hierarchical control approach. Transportation Research Part C-Emerging Technologies, 18(3), 311-331. doi: 10.1016/j.trc.2008.11.002

Papamichail, I., & Papageorgiou, M. (2008). Traffic-responsive linked ramp-metering control. IEEE Transactions on Intelligent Transportation Systems, 9(1), 111-121. doi: 10.1109/tits.2007.908724

Papamichail, I., Papageorgiou, M., Vong, V., & Gaffney, J. (2010b). Heuristic Ramp-Metering Coordination Strategy Implemented at Monash Freeway, Australia. Transportation Research Record(2178), 10-20. doi: 10.3141/2178-02

Potts, D., & Sammut, C. (2005). Incremental learning of linear model trees. Machine Learning, 61(1-3), 5-48. doi: 10.1007/s10994-005-1121-8

Powell, W., & Ma, J. (2011). A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. Journal of Control Theory and Applications, 9(3), 336-352. doi: 10.1007/s11768-011-0313-y

Prashanth, L. A., & Bhatnagar, S. (2011). Reinforcement Learning With Function Approximation for Traffic Signal Control. [Article]. IEEE Transactions on Intelligent Transportation Systems, 12(2), 412-421. doi: 10.1109/tits.2010.2091408

Salkham, A., Cunningham, R., Garg, A., & Cahill, V. (2008). A collaborative reinforcement learning approach to urban traffic control optimization. Paper presented at the Proceedings of the 2008 IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IAT 2008.

Santamaria, J. C., Sutton, R. S., & Ram, A. (1997). Experiments with reinforcement learning in problems with continuous state and action spaces. [Article]. Adaptive Behavior, 6(2), 163-217. doi: 10.1177/105971239700600201

Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1-3), 123-158. doi: 10.1023/a:1018012322525

Smaragdis, E., & Papageorgiou, M. (2003). Series of new local ramp metering strategies. Freeways, High-Occupancy Vehicle Systems, and Traffic Signal Systems 2003(1856), 74-86.

Smaragdis, E., Papageorgiou, M., & Kosmatopoulos, E. (2004). A flow-maximizing adaptive local ramp metering strategy. Transportation Research Part B-Methodological, 38(3), 251-270. doi: 10.1016/s0191-2615(03)00012-2

Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3), 332-341.

Spall, J. C. (1998). An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins APL Technical Digest (Applied Physics Laboratory), 19(4), 482-492.

Sugiyamal, Y., Fukui, M., Kikuchi, M., Hasebe, K., Nakayama, A., Nishinari, K., . . . Yukawa, S. (2008). Traffic jams without bottlenecks-experimental evidence for the physical mechanism of the formation of a jam. New Journal of Physics, 10.

Sun, X. T., & Horowitz, R. (2005). A localized switching ramp-metering controller with a queue length regulator for congested freeways. Paper presented at the Proceedings of the 2005 American Control Conference, New York.

Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.

Van Aerde, M. (1995). Single regime speed-flow-density relationship for congested and uncongested highways. Paper presented at the 74th TRB Annual Conference, Washington D.C.

Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292. Zhang, H. M., & Ritchie, S. G. (1997). Freeway ramp metering using artificial neural networks.

Transportation Research Part C-Emerging Technologies, 5(5), 273-286. Zhang, M., Ma, J., & Dong, H. (2008). Developing Calibration Tools for Microscopic Traffic

Simulation Final Report Part II: Calibration Framework and Calibration of Local/Global Driving Behavior and Departure/Route Choice Model Parameters: California PATH Research Report.

Appendix A – Paramics Plug-in

The Paramics functionality can be extended through plug-ins written in C language. The plug-in

for implementing ramp metering in Paramics can be broken into three parts: 1) measurement of

state of traffic, 2) the ramp metering algorithm, and 3) implementing the metering rate to the traffic

light.

Measuring the state of traffic is limited to loop detectors. Loop detectors in Paramics

provide three pieces of information: cumulative number of vehicles passed the loop detector, the

speed of the last vehicle past the loop detector, and the duration that the last vehicle occupied the

loop detector. This information is first aggregated into 20 sec interval averages. For the traffic

volume, the total number of vehicles is calculated from the change in the accumulated number of

vehicles in the 20 sec interval. For average speed, individual vehicles’ speeds are added together

and divided by number of vehicles. To calculate the percentage occupancy, the individual

occupancy times are added together and divided by 20 sec, to achieve the ratio that the loop

detector was occupied. These 20 sec averages are then further aggregated depending on the control

cycle of the ramp metering algorithm.

The ramp metering algorithms are coded in MATLAB to take advantage of its vast libraries

and simple and versatile programming language. The interface between Paramics plug-in and

MATLAB is made through MATLAB Engine. MATLAB Engine allows an external program to

initiate an instance of MATLAB and run control the execution of functions and scripts in that

MATLAB session. The plug-in retrieves the metering rate from the MATLAB session after

executing the ramp metering algorithms.

The metering rates that the ramp metering algorithms calculate are directly translated into

green time and red time according to the metering rate signal policy used. Through Paramics

programming APIs, the timing of the traffic light is overwritten in each control cycle to match the

output of the ramp metering algorithm.

Appendix B – Total Least Squares

The regular least squares finds the regression that minimizes the sum of squared errors of the

dependent variable measurement. Effectively the least squares regression minimizes the error

function below:

where is the vector of independent variables, is the dependent variable, and is the vector of regression

parameters. The least squares regression efficiently estimates , provided that the measurement of

independent variables do not suffer from measurement errors. The error in independent variables can

negatively affect the result of least squares regression. The effect is more significant if the relation between

inputs and outputs are non-linear.

The total least squares regression does not differentiate between dependent and

independent variable and assumes error in both variables. The total least squares regression

minimizes the objective:

subjectto:

where is a vector obtained by nonlinear augmentation of , and are measurements subject to error,

and are points on the regression curve that have the relation . For any given , the total

least squares objective function will be minimized when , are the closest point on the curve to

measured data , . squares method. illustrates the difference between the errors to be minimized in

regular least squares and total least squares. In squares method..a the original fundamental diagram curve

is shown as well as some sample measurements obtained by adding normal noise to both dimensions.

squares method..b shows the errors that regular least squared regression will minimize. squares method..c

shows the errors that total least squared regression will minimize. It is clear from the errors that the total

least squares regression will result in a much less biased estimation of non-linear functions when both

measurements are subject to error.

To find the best-fit Van Aerde fundamental diagram for a set of speed and density

measurements, an iterative numerical approach is employed. The process starts with an initial Van

Aerde model. Then for each measured sample, the closest point on the Van Aerde model is

calculated. The sum of squares of distances for all samples defines the error value for the Van

Aerde model. The optimization iteratively updates the parameters of the Van Aerde model in the

direction that reduces the error.

Figure B-1 Comparison of the errors for regular and total least squares. (a) samples

generated from original fundamnetal diagram with measurment error on both variables, (b) errors minimized in the regular least squares method, (c) errors minimized in total least

squares method.

0 20 40 60 80 1000

Actual measurement errors

0 20 40 60 80 1000

Regular least squares reggsion error

0 20 40 60 80 1000

100Total least squares reggsion errors

(b) (c)

Appendix C – Simultaneous Perturbation Stochastic Approximation

The simultaneous perturbation stochastic approximation (SPSA) is a gradient-based optimization

algorithm for multivariate optimization problems where it is difficult or impossible to directly

obtain the gradient of the objective function. The basic approach for estimation of the gradient is

to evaluate the objective function on both sides of the candidate point along every dimension.

Therefore, for a problem with p variables total of 2p evaluation of objective function is needed.

The SPSA algorithm, on the other hand, simultaneously perturbs the candidate point along

different dimensions and calculates an estimate of gradient with only two objective function

evaluation. It has been shown that under reasonably general conditions, SPSA achieves similar

level of accuracy as the conventional gradient based optimization approaches given similar number

of iterations.

The SPSA algorithm is an iterative approach and the implementation steps are as follows:

1 – Initialization and Coefficient Selection. The first step is to choose a feasible initial point

as well as the parameters , , , and of the SPSA algorithm. These parameters define the gain

sequences / 1 and / 1 for the algorithm that will be used in the

following steps. Practically effective values for these parameters can be found in (J. C. Spall,

1998).

2 – Generation of Simultaneous Perturbation Vector. Generation of a random vector Δ

with p elements. This random vector should satisfy the conditions described in (James C. Spall,

1992). A simple distribution that satisfies these conditions is the Bernoulli distribution with +1 and

-1 outcomes and same probability for each outcome.

3 – Objective Function Evaluation. Calculation of the objective function around the current

point with the perturbation vector Δ . The two point for evaluating the objective function are:

Δ and Δ .

4 – Gradient Approximation. The gradient of the current point given the two objective

function evaluations can be approximated by:

2Δ , Δ , … , Δ (C.1)

where Δ is the ith component of the Δ vector, and . is the objective function.

5 – Updating the Estimate of . The estimated can be updated based on the estimated

gradient using:

. (C.2)

6 – Iteration or Termination. If the termination condition, maximum number of iteration

or close to zero gradient, is met the process should be terminated. Otherwise, it should be repeated

from step 2.

Decentralized Coordinated Optimal Ramp Metering …...ii Decentralized Coordinated Optimal Ramp...

Documents