DEEP REINFORCEMENT LEARNING FOR TRAFFIC SIGNAL CONTROL
A Dissertation in
Information Sciences and Technology
by
Hua Wei
(c 2020 Hua Wei
Submitted in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy
December 2020The dissertation of Hua Wei was reviewed and approved by the following:
Zhenhui (Jessie) Li
Associate Professor of Information Sciences and Technology
Dissertation Adviser
Chair of Committee
C. Lee Giles
Professor of Information Sciences and Technology
Xiang Zhang
Associate Professor of Information Sciences and Technology
Vikash V. Gayah
Associate professor of Civil and Environmental Engineering
Mary Beth Rosson
Professor of Information Sciences and Technology
Graduate Program Chair
Abstract
Traffic congestion is a growing problem that continues to plague urban areas with negative out comes to both the traveling public and society as a whole. Signalized intersections are one of the most prevalent bottleneck types in urban environments and thus traffic signal control tends to play a large role in urban traffic management. Nowadays the widelyused traffic signal control systems (e.g., SCATS and SCOOT) are still based on manually designed traffic signal plans. Recently, there are emerging research studies using reinforcement learning (RL) to tackle traffic signal control problem. In this dissertation, we propose to consider reinforcement learning to intelligently optimize the signal timing plans in realtime to reduce traffic congestion.
Although some efforts using reinforcement learning (RL) techniques are proposed to adjust traffic signals dynamically, they only use adhoc designs when formulating the traffic signal control problem. They lack a principled approach to formulate the problem under the framework of RL. Secondly, since RL directly learns from the data via a trialanderror search, it requires a decent number of interactions with the environment before the algorithms converge. In realworld problems, every interaction means real cost (e.g., traffic congestion, traffic accidents). Hence, a more dataefficient method is necessary. Thirdly, discrepancies between simulation and reality confine the application of RL in the real world, despite its massive success in domains like Games. Most RL methods mainly conduct experiments in the simulator since the simulator can generate data in a cheaper and faster way than real experimentation. Hence, how to address the performance gap of RL methods between simulation and the realworld is required for applying RL in the real world.
This dissertation presents how to utilize mobility data and RLbased methods for traffic signal control. I have investigated the key challenges for RLbased traffic signal control methods, including how to formulate the objective function and improve learning efficiency for citywide traffic signal control. Besides, I have also managed to mitigate the performance gap of RL methods between simulation and the realworld. We have achieved significant improvement over the stateoftheart or currently employed methods, which will provide us with promising solutions to traffic signal control problems and implications for smart city applications.
Table of Contents
List of FiguresList of Tables
Acknowledgements
Chapter 1 Introduction
1.1 Why do we need a more intelligent traffic signal control 1.2 Why do we use reinforcement learning for traffic signal control 1.3 Why RL for traffic signal control is challenging? 1.3.1 Formulation of RL agent 1.3.2 Learning cost 1.3.3 Simulation 1.4 Previous Studies 1.5 Proposed Tasks 1.6 Overview of this Disseration
Chapter 2 Notation, Background and Literature
1 Preliminaries of Traffic Signal Control Problem 2.1.1 Term Definition 2.1.2 Objective 2.1.3 Special Considerations 2.2 Background of Reinforcement Learning 2.2.1 Single Agent Reinforcement learning 2.2.2 Problem setting 2.3 Literature 2.3.1 Agent formulation 2.3.1.1 Reward 2.3.1.2 State 2.3.1.3 Action scheme 2.3.2 Policy Learning 2.3.2.1 Valuebased methods2.3.2.2 Policybased methods
* 2.3.3 Coordination
* 2.3.3.1 Joint action learners
* 2.3.3.2 Independent learners
* 2.3.3.3 Sizes of Road network
2.4 Conclusion
Chapter 4 Formulating the Learning Objevtive
* 4.1 Introduction
* 4.2 Related Work
* 4.3 Preliminaries and Notations
* 4.4 Method
* 4.4.1 Agent Design
* 4.4.2 Learning Process
* 4.5 Justification of RL agent
* 4.5.1 Justification for State Design
* 4.5.1.1 General description of traffic movement process as a Markov chain
* 4.5.1.2 Specification with proposed state definition
* 4.5.2 Justification for Reward Design
* 4.5.2.1 Stabilization on traffic movements with proposed reward.
* 4.5.2.2 Connection to throughput maximization and travel time minimization.
* 4.6 Experiment
* 4.6.1 Dataset Description
* 4.6.2 Experimental Settings
* 4.6.2.1 Environmental settings
* 4.6.2.2 Evaluation metric
* 4.6.2.3 Compared methods
* 4.6.3 Performance Comparison
* 4.6.4 Study of PressLight
* 4.6.4.1 Effects of variants of our proposed method
* 4.6.4.2 Average travel time related to pressure.
* 4.6.5 Performance on Mixed Scenarios
* 4.6.5.1 Heterogeneous intersections
* 4.6.5.2 Arterials with a different number of intersections and network
* 4.6.6 Case Study
* 4.6.6.1 Synthetic traffic on the uniform, unidirectional flow
* 4.6.6.1.1 Performance comparison
* 4.6.6.1.2 Policy learned by RL agents
* 4.6.6.2 Realworld traffic in Jinan
* 4.7 Conclusion
Chapter 5
Improving Learning Efficiency
* 5.1 Introduction
* 5.2 Related Work
* 5.3 Problem Definition
* 5.4 Method
* 5.4.1 Observation Embedding
* 5.4.2 Graph Attention Networks for Cooperation
* 5.4.2.1 Observation Interaction
* 5.4.2.2 Attention Distribution within Neighborhood Scope
* 5.4.2.3 Indexfree Neighborhood Cooperation
* 5.4.2.4 Multihead Attention
* 5.4.3 Qvalue Prediction
* 5.4.4 Complexity Analysis
5.4.4.1 Space Complexity * 5.4.4.2 Time Complexity * 5.5 Experiments * 5.5.1 Settings * 5.5.2 Datasets * 5.5.2.1 Synthetic Data * 5.5.2.2 Realworld Data * 5.5.3 Compared Methods * 5.5.4 Evaluation Metric * 5.5.5 Performance Comparison * 5.5.5.1 Overall Analysis * 5.5.5.2 Convergence Comparison * 5.5.6 Scalability Comparison * 5.5.6.1 Effectiveness. * 5.5.6.2 Training time. * 5.5.7 Study of CoLight * 5.5.7.1 Impact of Neighborhood Definition. * 5.5.7.2 Impact of Neighbor Number. * 5.5.7.3 Impact of Attention Head Number. * 5.6 Conclusion
Chapter 6 Learning to Simulate
6.1 Introduction
6.2 Related Work
6.3 Preliminaries
6.4 Method
6.4.1 Basic GAIL Framework
6.4.2 Imitation with Interpolation
6.4.2.1 Generator in the simulator
6.4.2.2 Downsampling of generated trajectories
6.4.2.3 InterpolationDiscriminator
6.4.2.3.1 Interpolator module
6.4.2.3.2 Discriminator module
6.4.2.3.3 Loss function of InterpolationDiscriminator
2.1 Definitions of traffic movement and traffic signal phases.
2.2 RL framework for traffic signal control.
3.1 Reward is not a comprehensive measure to evaluate traffic light control performance. Both policies will lead to the same rewards. But policy #1 is more suitable than policy #2 in the real world.
3.2 Case A and case B have the same environment except the traffic light phase.
3.3 Model framework
3.4 Qnetwork
3.5 Memory palace structure
3.6 Traffic surveillance cameras in Jinan, China
3.7 Percentage of the time duration of learned policy for phase GreenWE (green light on WE and EW direction, while red light on NS and SN direction) in every 2000 seconds for different methods under configuration 4.
3.8 Average arrival rate on two directions (WE and SN) and time duration ratio of two phases (GreenWE and RedWE) from learned policy for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st and August 7th, 2016.
Detailed average arrival rate on two directions (dotted lines) and changes of two phases (dashed areas) in three periods of time for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st, 2016. Xaxis of each figure indicates the time of a day; left Yaxis of each figure indicates the number of cars approaching the intersection every second; right Yaxis for each figure indicates the phase over time.
4.1 Performance of RL approaches is sensitive to reward and state. (a) A heuristic parameter tuning of reward function could result in different performances. (b) The method with a more complicated state (LIT[1] w/ neighbor) has a longer learning time but does not necessarily converge to a better result.
4.2 Illustration of max pressure control in two cases. In Case A, green signal is set in the North$\to $$\to $rarr\rightarrow$\to $South direction; in Case B, green signal is set in the East$\to $$\to $rarr\rightarrow$\to $West direction.
4.3 The transition of traffic movements.
4.4 Realworld arterial network for the experiment.
4.5 Convergence curve of average duration and our reward design (pressure). Pressure shows the same convergence trend with travel time.
4.6 Average travel time of our method on heterogeneous intersections. (a) Different number of legs. (b) Different length of lanes. (c) Experiment results.
4.7 Performance comparison under uniform unidirectional traffic, where the optimal solution is known (GreenWave). Only PressLight can achieve the optimal.
4.8 Offsets between intersections learnt by RL agents under unidirectional uniform traffic (700 vehicles/hour/lane on arterial)
4.9 Spacetime diagram with signal timing plan to illustrate the learned coordination strategy from realworld data on the arterial of Qingdao Road in the morning (around 8:30 a.m.) on August 6th.
Illustration of indexbased concatenation. Thick yellow lines are the arterials and grey thin lines are the side streets. With indexbased concatenation, $A$$A$AA$A$ and B's observation will be aligned as model inputs with an fixed order. These two inputs will confuse the model shared by $A$$A$AA$A$ and $B$$B$BB$B$.
5.2 Left: Framework of the proposed CoLight model. Right: variation of cooperation scope (light blue shadow, from onehop to twohop) and attention distribution (colored points, the redder, the more important) of the target intersection.
5.3 Road networks for realworld datasets. Red polygons are the areas we select to model, blue dots are the traffic signals we control. Left: 196 intersections with unidirectional traffic, middle: 16 intersections with uni& bidirectional traffic, right: 12 intersections with bidirectional traffic..
5.4 Convergence speed of CoLight (red continuous curves) and other 5 RL baselines (dashed curves) during training. CoLight starts with the best performance (Jumpstart), reaches to the predefined performance the fastest (Time to Threshold), and ends with the optimal policy (Aysmptotic). Curves are smoothed with a moving average of 5 points.
5.5 The training time of different models for 100 episodes. CoLight is efficient across all the datasets. The bar for Individual RL on ${D}_{NewYork}$${D}_{NewYork}$D_(NewYork)D_{NewYork}${D}_{NewYork}$ is shadowed as its running time is far beyond the acceptable time.
5.6 Performance of CoLight with respect to different numbers of neighbors (${\mathcal{N}}_{i}$${\mathcal{N}}_{i}$N_(i)\mathcal{N}_{i}${\mathcal{N}}_{i}$) on dataset ${D}_{Hangzhou}$${D}_{Hangzhou}$D_(Hangzhou)D_{Hangzhou}${D}_{Hangzhou}$ (left) and ${D}_{Jinan}$${D}_{Jinan}$D_(Jinan)D_{Jinan}${D}_{Jinan}$ (right). More neighbors (${\mathcal{N}}_{i}\le 5$${\mathcal{N}}_{i}\le 5$N_(i) <= 5\mathcal{N}_{i}\leq 5${\mathcal{N}}_{i}\le 5$) for cooperation brings better performance, but too many neighbors (${\mathcal{N}}_{i}>5$${\mathcal{N}}_{i}>5$N_(i) > 5\mathcal{N}_{i}>5${\mathcal{N}}_{i}>5$) requires more time (200 episodes or more) to learn..
6.1 Illustration of a driving trajectory. In the realworld scenario, only part of the driving points can be observed and form a sparse driving trajectory (in red dots). Each driving point includes a driving state and an action of the vehicle at the observed time step. Best viewed in color.
6.2 Proposed ImInGAIL Approach. The overall framework of ImInGAIL includes three components: generator, downsampler, and interpolationdiscriminator. Best viewed in color.
6.3 Proposed interpolationdiscriminator network.
6.4 Illustration of road networks. (a) and (b) are synthetic road networks, while (c) and (d) are realworld road networks.
6.5RMSE on time and position of our proposed method ImInGAIL under different level of sparsity. As the expert trajectory become denser, a more similar policy to the expert policy is learned.
6.6 The generated trajectory of a vehicle in the $Ring$$Ring$RingRing$Ring$ scenario. Left: the initial position of the vehicles. Vehicles can only be observed when they pass four locations $A$$A$AA$A$, $B$$B$BB$B$, $C$$C$CC$C$ and $D$$D$DD$D$ where cameras are installed. Right: the visualization for the trajectory of $Vehicle$$Vehicle$VehicleVehicle$Vehicle$ 0. The xaxis is the timesteps in seconds. The yaxis is the relative road distance in meters. Although vehicle 0 is only observed three times (red triangles), ImInGAIL (blue points) can imitate the position of the expert trajectory (grey points) more accurately than all other baselines. Better viewed in color.
List of Tables
2.1 Representative deep RLbased traffic signal control methods. Due to page limits, we only put part of the investigated papers here.
3.1 Notations
3.2 Settings for our method
3.3 Reward coefficients
3.4 Configurations for synthetic traffic data
3.5 Details of realworld traffic dataset
3.6 Performance on configuration 1. Reward: the higher the better. Other measures: the lower the better. Same with the following tables.
3.7 Performance on configuration 2
3.8 Performance on configuration 3
3.9 Performance on configuration 4
3.10 Performances of different methods on realworld data. The number after $\pm $$\pm $+\pm$\pm $ means standard deviation. Reward: the higher the better. Other measures: the lower the better.
4.1 Summary of notation.
4.2 Configurations for synthetic traffic data
4.3 Data statistics of realworld traffic dataset
Performance comparison between all the methods in the arterial with 6 intersections w.r.t. average travel time (the lower the better). Topdown: conventional transportation methods, learning methods, and our proposed method.
4.5 Detailed comparison of our proposed state and reward design and their effects w.r.t. average travel time (lower the better) under synthetic traffic data.
4.6 Average travel time of different methods under arterials with a different number of intersections and network.
5.1 Data statistics of realworld traffic dataset
5.2 Performance on synthetic data and realworld data w.r.t average travel time. CoLight is the best.
5.3 Performance of CoLight with respect to different numbers of attention heads ($H$$H$HH$H$) on dataset $Gri{d}_{6\times 6}$$Gri{d}_{6\times 6}$Grid_(6xx6)Grid_{6\times 6}$Gri{d}_{6\times 6}$. More types of attention ($H\le 5$$H\le 5$H <= 5H\leq 5$H\le 5$) enhance model efficiency, while too many ($H>5$$H>5$H > 5H>5$H>5$) could distract the learning and deteriorate the overall performance.
6.1 Features for a driving state
6.2 Hyperparameter settings for ImInGAIL
6.3 Statistics of dense and sparse expert trajectory in different datasets
6.4 Performance w.r.t Relative Mean Squared Error (RMSE) of time (in seconds) and position (in kilometers). All the measurements are conducted on dense trajectories. Lower the better. Our proposed method ImInGAIL achieves the best performance.
6.5 RMSE on time and position of our proposed method ImInGAIL against baseline methods and their corresponding twostep variants. Baseline methods and ImInGAIL learn from sparse trajectories, while the twostep variants interpolate sparse trajectories first and trained on the interpolated data. ImInGAIL achieves the best performance in most cases.
Acknowledgements
I would like to express my sincere appreciation to my doctoral committee Dr. Zhenhui (Jessie) Li, Dr. C. Lee Giles, Dr. Xiang Zhang, Dr. Vikash V. Gayah, and my Ph. D. program chair Dr. Mary Beth Rosson.
I would like to show my special appreciation to my PhD adviser Dr. Zhenhui (Jessie) Li. Jessie patiently guided me through the research process and career development, including several openarmed enlightening discussions on how to cope with worklife balance, pressure and priority. Her supportive and critical attitude has made our fruitful collaboration possible. Her patience and kindness has help me manage a smooth career development. Without her guidance, my PhD would surely not have been as successful.
I would also extend my thanks to Dr. Vikash Gayah from the Department of Civil and Environmental Engineering, whose insightful advice forms some of the foundations in our interdisciplinary research and who have been more than happy to help me become familiar with the background material.
I am also fortunate to have collaborated with several others throughout my PhD: Guanjie Zheng, Chacha Chen, Porter Jenkins, Zhengyao Yu, Kan Wu, Huaxiu Yao, Nan Xu, Chang Liu, Yuandong Wang. Without them, this research would not be possible.
The studies in this dissertation have been supported by NSF awards #1544455, #1652525, #1618448, and #1639150. The views and conclusions contained in the studies are those of the authors and should not be interpreted as representing any funding agencies. Thanks for their generous support to make these research happen.
During my several years of PhD program, I was lucky to have been surrounded by wonderful colleagues and friends. Thank you all for sharing Friday game nights with me, for the road trips and getaways on weekends, and for the summer twilight barbecues. All of you made my time at Penn State a wonderful experience.
At last, a very special thank you to my parents Shaozhu and Yuling, to my whole family, for always supporting me and for always encouraging me to pursue my passion.
Chapter 1 Introduction
Traffic congestion is a growing problem that continues to plague urban areas with negative out comes to both the traveling public and society as a whole. And, these negative outcomes will only grow over time as more people flock to urban areas. In 2014, traffic congestion cost Americans over $160 billion in lost productivity and wasted over 3.1 billion gallons of fuel [2]. Traffic congestion was also attributed to over 56 billion pounds of harmful CO2 emissions in 2011 [3]. In the European Union, the cost of traffic congestion was equivalent to 1% of the entire GDP [4]. Mitigating congestion would have significant economic, environmental and societal benefits. Signalized intersections are one of the most prevalent bottleneck types in urban environments and thus traffic signal control tends to play a large role in urban traffic management.
1.1 Why do we need a more intelligent traffic signal control
The majority of small and big cities even in industrialized countries are still operating oldfashioned fixedtime signal control strategies, often even poorly optimized or maintained. Even when modern trafficresponsive control systems are installed (e.g., SCATS [5] and SCOOT [6, 7]), the employed control strategies are sometimes naive, mainly based on manually designed traffic signal plans.
On the other hand, nowadays various kinds of traffic data can be collected to enrich the information about traffic condition. Systems like SCATS or SCOOTS mainly rely on the loop sensor data to choose the signal plans. However, loop sensor data only count the vehicle when it passes the sensor, while nowadays increasing amount of traffic data have been collected from various sources such as GPSequipped vehicles, navigationalsystems, and traffic surveillance cameras. How to use rich traffic data to better optimize our traffic signal control system has attracted more and more attention from academia, government and industry.
1.2 Why do we use reinforcement learning for traffic signal control
In transportation field, traffic signal control is one of the most fundamental research questions [8]. The typical approach that transportation researchers take is to seek an optimization solution under certain assumptions about traffic models [8]. However, most of the works focus only on automobiles and the assumptions they make are simplified and do not necessarily hold true in the field. The real traffic behave in a complicated way, affected by many factors such as driver's preference, interactions with vulnerable road users (e.g. pedestrians, cyclists, etc.), weather and road conditions, etc. These features can hardly be modelled accurately for optimizing traffic signal control.
On the other hand, machine learning techniques learn directly from the observed data without making assumptions about the data model. However, the traffic signal control problem is not a typical machine learning problem with fixed data sets for training and testing. The realworld traffic is constantly changing, and the execution of traffic lights is also changing the traffic. Therefore, in the case of everchanging data samples, we need to learn from the feedback from the environment. This idea of trialanderror is an essential RL idea. RL attempts different traffic signal control strategies based on the current traffic environment. The model will learn and adjust strategies based on environmental feedback.
1.3 Why RL for traffic signal control is challenging?
Formulation of RL agent
A key question for RL is how to define reward and state. In existing studies [9, 10, 11], a typical reward definition for traffic signal control is a weighted linear combination of several components such as queue length, waiting time, number of switches in traffic signal, and sum of delay. The state include components such as queue length, number of cars, waiting time, and current traffic signal. In recent work [10, 11], images of vehicles' positions on the roads are also considered in the state.
However, all of the existing work take an adhoc approach to define reward and state. Such an adhoc approach will cause several problems that hinder the application of RL in the real world. First, the engineering details in formulating the reward and state could significantly affect the results. For example, if the reward is defined as a weighted linear combination of several terms, the weights on each terms are tricky to set and the minor difference in weight setting could lead to dramatically different results. Second, the state representation could be in a highdimensional space, especially when using traffic images as part of the state representation [10, 11]. Such a highdimensional state representation will need much more training data samples to learn and may not even converge. Third, there is no connection between existing RL approaches and transportation methods. Without the support of transportation theory, it is highly risky to apply these purely datadriven RLbased approaches in the real physical world.
Learning cost
While learning from trialanderror is the key idea in RL, the learning cost is RL is fatal for realworld applications. Although RL algorithms are very useful to learning good solutions when the model of environment is unknown in advance [12, 13], the solutions may only be achieved after an extensive numbers of trials and errors, which is usually very time consuming. Existing RL methods for games (e.g, Go or Atari games) yield impressive results in simlulated environments, the cost of error in traffic signal control is critical, even fatal in the real world. Therefore, how to learn efficiently (e.g., learning from limited data samples, sample training data in an adaptive way, transfer learned knowledge) is a critial question for the application of RL in traffic signal control.
Simulation
Reinforcement learning (RL) has shown great success in a series of artificial intelligence (AI) domains such as Go games [14]. Despite its huge success in AI domains, RL has not yet shown the same degree of success for realworld applications [15]. These applications could involve control such as drones [16], systems that interact with people such as traffic signal control [11]. As people analyze the challenges in all these scenarios, a recurring theme emerges: there is rarely a good simulator[15].
Previous Studies
There are already some work done in investigating reinforcement learning methods in traffic signal control. They are focusing on different scale of traffic signal control, including isolated intersection control[46], arterial control [7], and region control [810]. Most of the previous studies use different kinds of features in state and reward design, while we propose the formulation of reinforcement learning in connection with transportation theory. Furthermore, we propose to investigate on how to learn efficiently to reduce the learning cost in the realworld setting. We will discuss more about the literature in Chapter 2, Chapter 3 and Chapter 4.
1.5 Proposed Tasks
In this thesis, we propose to use RL for traffic signal control problems that can combine the transportation guided RL with efficient learning. The first part will help the formulation of RL towards a reasonable state and reward design. The connection between between RL approaches and transportation theory can help RL optimize towards a correct objective and condense the state space for convergence. The second part will help reduce the learning cost of RL and enables the application of RL in the real world. The third part will try to tackle the realworld application issue by building a realistic simulator. Specifically, we elaborate RLbased traffic signal control methods with realworld implications from the following topics:
Formulation of RL with theoretical justification. How to formulate the RL agent, i.e., the reward and state definition for traffic signal control problem? Does it always beneficial if we use complex reward function and state representations? In [7, 1, 17], I proposed to deduce the intricate design of current literature and use concise reward and state design, with theoretical proof on the concise design towards the global optimal, for both signal intersection control and multiintersection control scenario.
Optimization of the learning process in RL. Is vanilla deep neural network the solution to traffic signal control problem with RL? In [11], I tackle the sample imbalance problem with a phasegated network to emphasize certain features. In [18], we design a network to model the priority between action and learn the change of traffic flow like flip and rotation. Coordination could benefit signal control for multiintersection scenarios. In [19], I contributed a framework to enable agents to communicate between them about their observations and behave as a group. Though each agent has limited capabilities and visibility of the world, in this work, agents were able to cooperate multihop neighboring intersections and learn the dynamic interactions between the hidden state of neighboring agents. Later in [20], we investigate the possibility of using RL to control traffic signals under the scale of a city, test our RL methods in the simulator under the road network of Manhattan, New York, with 2510 traffic signals. This is the first time an RLbased method can operate on such a scale in traffic signal control.
Bridging Gap from Simulation to Reality. Despite its massive success in artificial domains, RL has not yet shown the same degree of success for realworld applications. These applications could involve control systems such as drones, software systems such as data centers, systems that interact with people such as transportation systems. Currently, most RL methods mainly conduct experiments in the simulator since the simulator can generate data in a cheaper and faster way than real experimentation. Discrepancies between simulation and reality confine the application of learned policies in the real world.
1.6 Overview of this Disseration
In this chapter, we discussed the motivation, challenges and possible tasks of using reinforcement learning for traffic signal control. We will elaborate more on the details of work that have applied reinforcement learning in several scenarios. Chapter 2 will covers the notation, background and necessary literature for traffic signal control. The basic formulation of RLbased traffic signal control and further theoryguided RL method will be discussed in detail in Chapter 3 and 4. Chapter 5 will discuss how to improve the learning efficiency with neighborhood information for coordinated intersections. Chapter 6 will tackle the challenge for realworld applications of RL by building a realistic simulator, and then we are going to talk about the potential future work briefly in Chapter 7.
Chapter 2 Notation, Background and Literature
This chapter covers notation, background information and necessary literature that will be discussed throughout the rest of this thesis. The main part of this chapter is adopted from our survey [21] and [22].
2.1 Preliminaries of Traffic Signal Control Problem
Term Definition
Terms on road structure and traffic movement:
Approach: A roadway meeting at an intersection is referred to as an approach. At any general intersection, there are two kinds of approaches: incoming approaches and outgoing approaches. An incoming approach is one on which cars can enter the intersection; an outgoing approach is one on which cars can leave the intersection. Figure 2.1(a) shows a typical intersection with four incoming approaches and outgoing approaches. The southbound incoming approach is denoted in this figure as the approach on the north side in which vehicles are traveling in the southbound direction.
Lane: An approach consists of a set of lanes. Similar to approach definition, there are two kinds of lanes: incoming lanes and outgoing lanes (also known as approaching/entering lane and receiving/exiting lane in some references [23, 24]).
Traffic movement: A traffic movement refers to vehicles moving from an incoming approach to an outgoing approach, denoted as $({r}_{i}\to {r}_{o})$$({r}_{i}\to {r}_{o})$(r_(i)rarrr_(o))(r_{i}\to r_{o})$({r}_{i}\to {r}_{o})$, where ${r}_{a}$${r}_{a}$r_(a)r_{a}${r}_{a}$ and ${r}_{r}$${r}_{r}$r_(r)r_{r}${r}_{r}$ is theincoming lane and the outgoing lane respectively. A traffic movement is generally categorized as left turn, through, and right turn.
Terms on traffic signal:
Movement signal: A movement signal is defined on the traffic movement, with green signal indicating the corresponding movement is allowed and red signal indicating the movement is prohibited. For the fourleg intersection shown in Figure 2.1(a), the rightturn traffic can pass regardless of the signal, and there are eight movement signals in use, as shown in Figure 2.1(b).
Phase: A phase is a combination of movement signals. Figure 2.1(c) shows the conflict matrix of the combination of two movement signals in the example in Figure 2.1(a) and Figure 2.1(b). The grey cell indicates the corresponding two movements conflict with each other, i.e. they cannot be set to 'green' at the same time (e.g., signals #1 and #2). The white cell indicates the nonconflicting movement signals. All the nonconflicting signals will generate eight valid pairedsignal phases (letters 'A' to 'H' in Figure 2.1(c)) and eight singlesignal phases (the diagonal cells in conflict matrix). Here we letter the pairedsignal phases only because in an isolated intersection, it is always more efficient to use pairedsignal
Figure 2.1: Definitions of traffic movement and traffic signal phases.
phases. When considering multiple intersections, singlesignal phase might be necessary because of the potential spill back.
Phase sequence: A phase sequence is a sequence of phases which defines a set of phases and their order of changes.
Signal plan: A signal plan for a single intersection is a sequence of phases and their corresponding starting time. Here we denote a signal plan as $({p}_{1},{t}_{1})({p}_{2},{t}_{2})\dots ({p}_{i},{t}_{i})\dots $$({p}_{1},{t}_{1})({p}_{2},{t}_{2})\dots ({p}_{i},{t}_{i})\dots $(p_(1),t_(1))(p_(2),t_(2))dots(p_(i),t_(i))dots(p_{1},t_{1})(p_{2},t_{2})\dots(p_{i},t_{i})\dots$({p}_{1},{t}_{1})({p}_{2},{t}_{2})\dots ({p}_{i},{t}_{i})\dots $, where ${p}_{i}$${p}_{i}$p_(i)p_{i}${p}_{i}$ and ${t}_{i}$${t}_{i}$t_(i)t_{i}${t}_{i}$ stand for a phase and its starting time.
Cyclebased signal plan: A cyclebased signal plan is a kind of signal plan where the sequence of phases operates in a cyclic order, which can be denoted as $({p}_{1},{t}_{1}^{1})({p}_{2},{t}_{2}^{1})\dots ({p}_{N},{t}_{N}^{1})({p}_{1},{t}_{1}^{2})({p}_{2},{t}_{2}^{2})\dots ({p}_{N},{t}_{N}^{2})\dots $$({p}_{1},{t}_{1}^{1})({p}_{2},{t}_{2}^{1})\dots ({p}_{N},{t}_{N}^{1})({p}_{1},{t}_{1}^{2})({p}_{2},{t}_{2}^{2})\dots ({p}_{N},{t}_{N}^{2})\dots $(p_(1),t_(1)^(1))(p_(2),t_(2)^(1))dots(p_(N),t_(N)^(1))(p_(1),t_(1)^(2))(p_(2),t_(2)^(2))dots(p_(N),t_(N)^(2))dots(p_{1},t_{1}^{1})(p_{2},t_{2}^{1})\dots(p_{N},t_{N}^{1})(p_{1},t_{1}^{2})(p_{ 2},t_{2}^{2})\dots(p_{N},t_{N}^{2})\dots$({p}_{1},{t}_{1}^{1})({p}_{2},{t}_{2}^{1})\dots ({p}_{N},{t}_{N}^{1})({p}_{1},{t}_{1}^{2})({p}_{2},{t}_{2}^{2})\dots ({p}_{N},{t}_{N}^{2})\dots $, where ${p}_{1},{p}_{2},\dots ,{p}_{N}$${p}_{1},{p}_{2},\dots ,{p}_{N}$p_(1),p_(2),dots,p_(N)p_{1},p_{2},\dots,p_{N}${p}_{1},{p}_{2},\dots ,{p}_{N}$ is the repeated phase sequence and ${t}_{i}^{j}$${t}_{i}^{j}$t_(i)^(j)t_{i}^{j}${t}_{i}^{j}$ is the starting time of phase ${p}_{i}$${p}_{i}$p_(i)p_{i}${p}_{i}$ in the $j$$j$jj$j$th cycle. Specifically, ${C}^{j}={t}_{1}^{j+1}{t}_{1}^{j}$${C}^{j}={t}_{1}^{j+1}{t}_{1}^{j}$C^(j)=t_(1)^(j+1)t_(1)^(j)C^{j}=t_{1}^{j+1}t_{1}^{j}${C}^{j}={t}_{1}^{j+1}{t}_{1}^{j}$ is the cycle length of the $j$$j$jj$j$th phase cycle, and $\{\frac{{t}_{2}^{j}{t}_{1}^{j}}{{C}^{j}},\dots ,\frac{{t}_{N}^{j}{t}_{N1}^{j}}{{C}^{j}}\}$$\{\frac{{t}_{2}^{j}{t}_{1}^{j}}{{C}^{j}},\dots ,\frac{{t}_{N}^{j}{t}_{N1}^{j}}{{C}^{j}}\}${(t_(2)^(j)t_(1)^(j))/(C^(j)),dots,(t_(N)^(j)t_(N1)^(j))/(C^(j))}\{\frac{t_{2}^{j}t_{1}^{j}}{C^{j}},\dots,\frac{t_{N}^{j}t_{N1}^{j}}{C^{j}}\}$\{\frac{{t}_{2}^{j}{t}_{1}^{j}}{{C}^{j}},\dots ,\frac{{t}_{N}^{j}{t}_{N1}^{j}}{{C}^{j}}\}$ is the phase split of the $j$$j$jj$j$th phase cycle. Existing traffic signal control methods usually repeats similar phase sequence throughout the day.
Objective
The objective of traffic signal control is to facilitate safe and efficient movement of vehicles at the intersection. Safety is achieved by separating conflicting movements in time and is not considered in more detail here. Various measures have been proposed to quantify efficiency of the intersection from different perspectives:
Travel time. In traffic signal control, travel time of a vehicle is defined as the time different between the time one car enters the system and the time it leaves the system. One of the most common goals is to minimize the average travel time of vehicles in the network.
Queue length. The queue length of the road network is the number of queuing vehicles in the road network.
Number of stops. The number of stops of a vehicle is the total times that a vehicle experienced.
Throughput. The throughput is the number of vehicles that have completed their trip in the road network during throughout a period.
Special Considerations
In practice, additional attention should be paid to the following aspects:
Yellow and allred time. A yellow signal is usually set as a transition from a green signal to a red one. Following the yellow, there is an allred period during which all the signals in an intersection are set to red. The yellow and allred time, which can last from 3 to 6 seconds, allow vehicles to stop safely or pass the intersection before vehicles in conflicting traffic movements are given a green signal.
Minimum green time. Usually, a minimum green signal time is required to ensure pedestrians moving during a particular phase can safely pass through the intersection.
Left turn phase. Usually, a leftturn phase is added when the leftturn volume is above certain threshold.
2.2 Background of Reinforcement Learning
In this section, we first describe the reinforcement learning framework which constitutes the foundation of all the methods presented in this dissertation. We then provide background on conventional RLbased traffic signal control, including the problem of controlling a single intersection and multiple intersections.
Single Agent Reinforcement learning
Usually a single agent RL problem is modeled as a Markov Decision Process represented by $\u27e8\mathcal{S},\mathcal{A},P,R,\gamma \u27e9$$\u27e8\mathcal{S},\mathcal{A},P,R,\gamma \u27e9$(:S,A,P,R,gamma:)\langle\mathcal{S},\mathcal{A},P,R,\gamma\rangle$\u27e8\mathcal{S},\mathcal{A},P,R,\gamma \u27e9$, where their definitions are given as follows:
$\bullet $$\bullet $∙\bullet$\bullet $ Set of state representations $\mathcal{S}$$\mathcal{S}$S\mathcal{S}$\mathcal{S}$: At time step $t$$t$tt$t$, the agent observes state ${\mathsf{s}}^{t}\in \mathcal{S}$${\mathsf{s}}^{t}\in \mathcal{S}$s^(t)inS\mathsf{s}^{t}\in\mathcal{S}${\mathsf{s}}^{t}\in \mathcal{S}$.
$\bullet $$\bullet $∙\bullet$\bullet $ Set of action $\mathcal{A}$$\mathcal{A}$A\mathcal{A}$\mathcal{A}$ and state transition function $P$$P$PP$P$: At time step $t$$t$tt$t$, the agent takes an action ${\mathsf{a}}^{t}\in \mathcal{A}$${\mathsf{a}}^{t}\in \mathcal{A}$a^(t)inA\mathsf{a}^{t}\in\mathcal{A}${\mathsf{a}}^{t}\in \mathcal{A}$, which induces a transition in the environment according to the state transition function $P({\mathsf{s}}^{t+1}{\mathsf{s}}^{t},{\mathsf{a}}^{t}):\mathcal{S}\times \mathcal{A}\to \mathcal{S}$$P({\mathsf{s}}^{t+1}{\mathsf{s}}^{t},{\mathsf{a}}^{t}):\mathcal{S}\times \mathcal{A}\to \mathcal{S}$P(s^(t+1)s^(t),a^(t)):SxxArarrSP(\mathsf{s}^{t+1}\mathsf{s}^{t},\mathsf{a}^{t}):\mathcal{S}\times\mathcal{ A}\rightarrow\mathcal{S}$P({\mathsf{s}}^{t+1}{\mathsf{s}}^{t},{\mathsf{a}}^{t}):\mathcal{S}\times \mathcal{A}\to \mathcal{S}$
$\bullet $$\bullet $∙\bullet$\bullet $ Reward function $R$$R$RR$R$: At time step $t$$t$tt$t$, the agent obtains a reward ${r}^{t}$${r}^{t}$r^(t)r^{t}${r}^{t}$ by a reward function: $R({\mathsf{s}}^{t},{\mathsf{a}}^{t}):\mathcal{S}\times \mathcal{A}\to \mathbb{R}$$R({\mathsf{s}}^{t},{\mathsf{a}}^{t}):\mathcal{S}\times \mathcal{A}\to \mathbb{R}$R(s^(t),a^(t)):SxxArarrRR(\mathsf{s}^{t},\mathsf{a}^{t}):\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$R({\mathsf{s}}^{t},{\mathsf{a}}^{t}):\mathcal{S}\times \mathcal{A}\to \mathbb{R}$
$\bullet $$\bullet $∙\bullet$\bullet $ Discount factor $\gamma $$\gamma $gamma\gamma$\gamma $: The goal of an agent is to find a policy that maximizes the expected return, which is the discounted sum of rewards: ${G}^{t}:=\sum _{i=0}^{\mathrm{\infty}}{\gamma}^{i}{\mathsf{r}}^{t+i}$${G}^{t}:=\sum _{i=0}^{\mathrm{\infty}}\u200a{\gamma}^{i}{\mathsf{r}}^{t+i}$G^(t):=sum_(i=0)^(oo)gamma^(i)r^(t+i)G^{t}:=\sum_{i=0}^{\infty}\gamma^{i}\mathsf{r}^{t+i}${G}^{t}:=\sum _{i=0}^{\mathrm{\infty}}{\gamma}^{i}{\mathsf{r}}^{t+i}$, where the discount factor $\gamma \in [0,1]$$\gamma \in [0,1]$gamma in[0,1]\gamma\in[0,1]$\gamma \in [0,1]$ controls the importance of immediate rewards versus future rewards.
Here, we only consider continuing agentenvironment intersections which do not end with terminal states but goes on continually without limit.
Solving a reinforcement learning task means, roughly, finding an optimal policy ${\pi}^{\ast}$${\pi}^{\ast}$pi^(**)\pi^{*}${\pi}^{\ast}$ that maximizes expected return. While the agent only receives reward about its immediate, onestep performance, one way to find the optimal policy ${\pi}^{\ast}$${\pi}^{\ast}$pi^(**)\pi^{*}${\pi}^{\ast}$ is by following an optimal actionvalue function or statevalue function. The actionvalue function (Qfunction) of a policy $\pi $$\pi $pi\pi$\pi $, ${Q}^{\pi}:\mathcal{S}\times \mathcal{A}\to \mathbb{R}$${Q}^{\pi}:\mathcal{S}\times \mathcal{A}\to \mathbb{R}$Q^(pi):SxxArarrRQ^{\pi}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}${Q}^{\pi}:\mathcal{S}\times \mathcal{A}\to \mathbb{R}$, is the expected return of a stateaction pair ${Q}^{\pi}(\mathsf{s},\mathsf{a})={\mathbb{E}}_{\pi}[{G}^{t}{\mathsf{s}}^{t}=\mathsf{s},{\mathsf{a}}^{t}=\mathsf{a}]$${Q}^{\pi}(\mathsf{s},\mathsf{a})={\mathbb{E}}_{\pi}[{G}^{t}{\mathsf{s}}^{t}=\mathsf{s},{\mathsf{a}}^{t}=\mathsf{a}]$Q^(pi)(s,a)=E_(pi)[G^(t)s^(t)=s,a^(t)=a]Q^{\pi}(\mathsf{s},\mathsf{a})=\mathbb{E}_{\pi}[G^{t}\mathsf{s}^{t}=\mathsf{s },\mathsf{a}^{t}=\mathsf{a}]${Q}^{\pi}(\mathsf{s},\mathsf{a})={\mathbb{E}}_{\pi}[{G}^{t}{\mathsf{s}}^{t}=\mathsf{s},{\mathsf{a}}^{t}=\mathsf{a}]$. The statevalue function of a policy $\pi $$\pi $pi\pi$\pi $, ${V}^{\pi}:\mathcal{S}\to \mathbb{R}$${V}^{\pi}:\mathcal{S}\to \mathbb{R}$V^(pi):SrarrRV^{\pi}:\mathcal{S}\rightarrow\mathbb{R}${V}^{\pi}:\mathcal{S}\to \mathbb{R}$, is the expected return of a state ${V}^{\pi}(\mathsf{s})={\mathbb{E}}_{\pi}[{G}^{t}{\mathsf{s}}^{t}=\mathsf{s}]$${V}^{\pi}(\mathsf{s})={\mathbb{E}}_{\pi}[{G}^{t}{\mathsf{s}}^{t}=\mathsf{s}]$V^(pi)(s)=E_(pi)[G^(t)s^(t)=s]V^{\pi}(\mathsf{s})=\mathbb{E}_{\pi}[G^{t}\mathsf{s}^{t}=\mathsf{s}]${V}^{\pi}(\mathsf{s})={\mathbb{E}}_{\pi}[{G}^{t}{\mathsf{s}}^{t}=\mathsf{s}]$.
Problem setting
We now introduce the general setting of RLbased traffic signal control problem, in which the traffic signals are controlled by an RL agent or several RL agents. Figure 2.2 illustrates the basic idea of the RL framework in single traffic signal control problem. The environment is the traffic conditions on the roads, and the agent controls the traffic signal. At each time step $t$$t$tt$t$, a description of the environment (e.g., signal phase, waiting time of cars, queue length of cars, and positions of cars) will be generated as the state ${\mathsf{s}}_{t}$${\mathsf{s}}_{t}$s_(t)\mathsf{s}_{t}${\mathsf{s}}_{t}$. The agent will predict the next action ${\mathsf{a}}^{t}$${\mathsf{a}}^{t}$a^(t)\mathsf{a}^{t}${\mathsf{a}}^{t}$ to take that maximizes the expected return, where the action could be changing to a certain phase in the single intersection scenario. The action ${\mathsf{a}}^{t}$${\mathsf{a}}^{t}$a^(t)\mathsf{a}^{t}${\mathsf{a}}^{t}$ will be executed in the environment, and a reward ${\mathsf{r}}^{t}$${\mathsf{r}}^{t}$r^(t)\mathsf{r}^{t}${\mathsf{r}}^{t}$ will be generated, where the reward could be defined on traffic conditions of the intersection. Usually, in the decision process, an agent combines the exploitation of learned policy and exploration of a new policy.
Figure 2.2: RL framework for traffic signal control.
In multiintersection traffic signal control problem, there are $N$$N$NN$N$ traffic signals in the environment, controlled by one or several agents. The goal of the agent(s) is to learn the optimal policies to optimize the traffic condition of the whole environment. At each timestep $t$$t$tt$t$, each agent $i$$i$ii$i$ observe part of the environment as the observation ${o}_{i}^{t}$${o}_{i}^{t}$o_(i)^(t)o_{i}^{t}${o}_{i}^{t}$ and make predictions on the next actions ${\mathsf{a}}^{t}=({\mathsf{a}}_{1}^{t},\dots ,{\mathsf{a}}_{N}^{t})$${\mathsf{a}}^{t}=({\mathsf{a}}_{1}^{t},\dots ,{\mathsf{a}}_{N}^{t})$a^(t)=(a_(1)^(t),dots,a_(N)^(t))\boldsymbol{\mathsf{a}}^{t}=(\boldsymbol{\mathsf{a}}_{1}^{t},\ldots,\boldsymbol {\mathsf{a}}_{N}^{t})${\mathsf{a}}^{t}=({\mathsf{a}}_{1}^{t},\dots ,{\mathsf{a}}_{N}^{t})$ to take. The actions will be executed in the environment, and the reward ${\mathsf{r}}_{i}^{t}$${\mathsf{r}}_{i}^{t}$r_(i)^(t)\boldsymbol{\mathsf{r}}_{i}^{t}${\mathsf{r}}_{i}^{t}$ will be generated, where the reward could be defined on the level of individual intersections or a group of intersections within the environment. We refer readers interested in more detailed problem settings to [25].
2.3 Literature
In this section, we introduce three major aspects investigated in recent RLbased traffic signal control literature: agent formulation, policy learning approach and coordination strategy.
Agent formulation
A key question for RL is how to formulate the RL agent, i.e., the reward, state, and action definition. In this section, we focus on the advances in the reward, state, and action design in recent deep RLbased methods, and refer readers interested in more detailed definitions to [26, 27, 28].
2.3.1.1 Reward
The choice of reward reflects the learning objective of an RL agent. In the traffic signal control problem, although the ultimate objective is to minimize the travel time of all vehicles, travel time is hard to serve as a valid reward in RL. Because the travel time of a vehicle is affected by multiple actions from traffic signals and vehicle movements, the travel time as reward would be delayed and ineffective in indicating the goodness of the signals' action. Therefore, the existing literature often uses a surrogate reward that can be effectively measured after an action, considering factors like average queue length, average waiting time, average speed or throughput [11, 29]. The authors in [10] also take the frequency of signal changing and the number of emergency stops into reward.
2.3.1.2 State
At each time step, the agent receives some quantitative descriptions of the environment as state to decide its action. Various kinds of elements have been proposed to describe the environment state, such as queue length, waiting time, speed, phase, etc. These elements can be defined on the lane level or road segment level, and then concatenated as a vector. In earlier work using RL for traffic signal control, people need to discretize the state space and use a simple tabular or linear model to approximate the state functions for efficiency [30, 31, 32]. However, the realworld state space is usually huge, which confines the traditional RL methods in terms of memory or performance. With advances in deep learning, deep RL methods are proposed to handle large state space as an effective function approximator. Recent studies propose to use images [33, 34, 35, 36, 37, 38, 39, 40, 10, 11] to represent the state, where the position of vehicles are extracted as an image representation.
2.3.1.3 Action scheme
Now there are different types of action definition for an RL agent in traffic signal control: (1) set current phase duration [42, 43], (2) set the ratio of the phase duration over predefined total cycle duration [32, 44], (3) change to the next phase in predefined cyclic phase sequence [10, 11, 27, 45], and (4) choose the phase to change to among a set of phases [34, 39, 46, 47, 48, 1, 48]. The choice of action scheme is closely related to specific settings of traffic signals. For example, if the phase sequence is required to be cyclic, then the first three action schemes should be considered, while "choosing the phase to change to among a set of phases" can generate flexible phase sequences.
Policy Learning
RL methods can be categorized in different ways. [49, 50] divide current RL methods to modelbased methods and modelfree methods. Modelbased methods try to model the transition probability among states explicitly, while modelfree methods directly estimate the reward for stateaction pairs and choose the action based on this. In the context of traffic signal control, the state transition between states is primarily influenced by people's driving behaviors, which are diverse and hard to predict. Therefore, currently, most RLbased methods for traffic signal control are modelfree methods. In this section, we take the categorization in [51]: valuebased methods and policybased methods.
2.3.2.1 Valuebased methods
Valuebased methods approximate the statevalue function or stateaction value function (i.e., how rewarding each state is or stateaction pair is), and the policy is implicitly obtained from the learned value function. Most of the RLbased traffic signal control methods use DQN [52], where the model is parameterized by neural networks and takes the state representation as input [10, 53]. In DQN, discrete actions are required as the model directly outputs the action's value given a state, which is especially suitable for action schema (3) and (4) mentioned in Section 2.3.1.3.
Policybased methods
Policybased methods directly update the policy parameters (e.g., a vector of probabilities to conduct actions under specific state) towards the direction to maximizing a predefined objective (e.g., average expected return). The advantage of policybased methods is that it does not require the action to be discrete like DQN. Also, it can learn a stochastic policy and keep exploring potentially more rewarding actions. To stabilize the training process, the actorcritic framework is widely adopted. It utilize the strengths of both valuebased and policybased methods, with an actor controls how the agent behaves (policybased), and the critic measures how good the conducted action is (valuebased). In the traffic signal control problem, [44] uses DDPG [54] to learn a deterministic policy which directly maps states to actions, while [34, 42, 55] learn a stochastic policy that maps states to action probability distribution, all of which have shown excellent performance in traffic signal control problems. To further improve convergence speed for RL agents, [56] proposed a timedependent baseline to reduce the variance of policy gradient updates to specifically avoid traffic jams.
In the abovementioned methods, including both valuebased and policybased methods, deep neural networks are used to approximate the value functions. Most of the literature use vanilla neural networks with their corresponding strengths. For example, Convolutional Neural Networks (CNN) are used since the state representation contains image representation [10, 38, 39, 40, 33, 34, 35, 36]; Recurrent Neural Networks (RNN) are used to capture the temporal dependency of historical states [57].
Coordination could benefit signal control for multiintersection scenarios. Since recent advances in RL improve the performance on isolated traffic signal control, efforts havebeen performed to design strategies that cooperate with MARL agents. Literature [60] categorizes MARL into two classes: Joint action learners and independent learners. Here we extend this categorization for the traffic signal control problem.
Joint action learners
A straightforward solution is to use a single global agent to control all the intersections [31]. It directly takes the state as input and learns to set the joint actions of all intersection at the same time. However, these methods can result in the curse of dimensionality, which encompasses the exponential growth of the stateaction space in the number of state and action dimensions. Joint action modeling methods explicitly learns to model the joint action
Citation
Method
Cooperation
Road net (# signals)
[46]
Valuebased
With communication
Synthetic (5)
Policybased
Without communication
Real (50)
Policybased
Without communication
Real (43)
Valuebased
Without communication
Real (2510)
Policybased
Joint action
Real (30)
Valuebased

Synthetic (1)
Valuebased

Synthetic (1)
Valuebased
Without communication
Synthetic (9)
Both studied

Synthetic (1)
Valuebased
With communication
Synthetic (6)
Valuebased
Without communication
Synthetic (4)
Both studied
Single global
Synthetic (5)
Policybased

Real (1)
Valuebased
Joint action
Synthetic (4)
Valuebased
With communication
Real (4)
Valuebased

Synthetic (1)
Valuebased
Without communication
Real (16)
Valuebased
With communication
Real (196)
Valuebased
Joint action
Synthetic (36)
Valuebased
Without communication
Real (16)
Valuebased
Without communication
Real (5)
Citation Method Cooperation Road net (# signals)
[46] Valuebased With communication Synthetic (5)
Policybased Without communication Real (50)
Policybased Without communication Real (43)
Valuebased Without communication Real (2510)
Policybased Joint action Real (30)
Valuebased  Synthetic (1)
Valuebased  Synthetic (1)
Valuebased Without communication Synthetic (9)
Both studied  Synthetic (1)
Valuebased With communication Synthetic (6)
Valuebased Without communication Synthetic (4)
Both studied Single global Synthetic (5)
Policybased  Real (1)
Valuebased Joint action Synthetic (4)
Valuebased With communication Real (4)
Valuebased  Synthetic (1)
Valuebased Without communication Real (16)
Valuebased With communication Real (196)
Valuebased Joint action Synthetic (36)
Valuebased Without communication Real (16)
Valuebased Without communication Real (5) Citation  Method  Cooperation  Road net (# signals) 
 ::  ::  ::  :: 
 [46]  Valuebased  With communication  Synthetic (5) 
  Policybased  Without communication  Real (50) 
  Policybased  Without communication  Real (43) 
  Valuebased  Without communication  Real (2510) 
  Policybased  Joint action  Real (30) 
  Valuebased    Synthetic (1) 
  Valuebased    Synthetic (1) 
  Valuebased  Without communication  Synthetic (9) 
  Both studied    Synthetic (1) 
  Valuebased  With communication  Synthetic (6) 
  Valuebased  Without communication  Synthetic (4) 
  Both studied  Single global  Synthetic (5) 
  Policybased    Real (1) 
  Valuebased  Joint action  Synthetic (4) 
  Valuebased  With communication  Real (4) 
  Valuebased    Synthetic (1) 
  Valuebased  Without communication  Real (16) 
  Valuebased  With communication  Real (196) 
  Valuebased  Joint action  Synthetic (36) 
  Valuebased  Without communication  Real (16) 
  Valuebased  Without communication  Real (5) 
* Traffic with arrival rate less than 500 vehicles/hour/lane is considered as light traffic in this survey, otherwise considered as heavy.
* Synthetic light uniform; 2. Synthetic light dynamic; 3. Synthetic heavy uniform; 4. Synthetic heavy dynamic; 5. Realworld data
Table 2.1: Representative deep RLbased traffic signal control methods. Due to page limits, we only put part of the investigated papers here.
action value of multiple agents $Q({o}_{1},\dots ,{o}_{N},\mathfrak{a})$$Q({o}_{1},\dots ,{o}_{N},\mathfrak{a})$Q(o_(1),dots,o_(N),a)Q(o_{1},\ldots,o_{N},\mathbf{\mathfrak{a}})$Q({o}_{1},\dots ,{o}_{N},\mathfrak{a})$. The joint action space grows with the increase in the number of agents to model. To alleviate this challenge, [10] factorize the global Qfunction as a linear combination of local subproblems, extending [9] using maxplus [61] algorithm: $\hat{Q}({o}_{1},\dots ,{o}_{N},\mathfrak{a})={\mathrm{\Sigma}}_{i,j}{Q}_{i,j}({o}_{i},{o}_{j},{\mathfrak{a}}_{i},{\mathfrak{a}}_{j})$$\widehat{Q}({o}_{1},\dots ,{o}_{N},\mathfrak{a})={\mathrm{\Sigma}}_{i,j}{Q}_{i,j}({o}_{i},{o}_{j},{\mathfrak{a}}_{i},{\mathfrak{a}}_{j})$hat(Q)(o_(1),dots,o_(N),a)=Sigma_(i,j)Q_(i,j)(o_(i),o_(j),a_(i),a_(j))\hat{Q}(o_{1},\ldots,o_{N},\mathbf{\mathfrak{a}})=\Sigma_{i,j}Q_{i,j}(o _{i},o_{j},\mathbf{\mathfrak{a}}_{i},\mathbf{\mathfrak{a}}_{ j})$\hat{Q}({o}_{1},\dots ,{o}_{N},\mathfrak{a})={\mathrm{\Sigma}}_{i,j}{Q}_{i,j}({o}_{i},{o}_{j},{\mathfrak{a}}_{i},{\mathfrak{a}}_{j})$, where $i$$i$ii$i$ and $j$$j$jj$j$ corresponds to the index of neighboring agents. In other works, [48, 59, 62] regard the joint Qvalue as a weighted sum of local Qvalues, $\hat{Q}({o}_{1},\dots ,{o}_{N},\mathfrak{a})={\mathrm{\Sigma}}_{i,j}{w}_{i,j}{Q}_{i,j}({o}_{i},{o}_{j},{\mathfrak{a}}_{i},{\mathfrak{a}}_{j})$$\widehat{Q}({o}_{1},\dots ,{o}_{N},\mathfrak{a})={\mathrm{\Sigma}}_{i,j}{w}_{i,j}{Q}_{i,j}({o}_{i},{o}_{j},{\mathfrak{a}}_{i},{\mathfrak{a}}_{j})$hat(Q)(o_(1),dots,o_(N),a)=Sigma_(i,j)w_(i,j)Q_(i,j)(o_(i),o_(j),a_(i),a_(j))\hat{Q}(o_{1},\ldots,o_{N},\mathbf{\mathfrak{a}})=\Sigma_{i,j}w_{i,j }Q_{i,j}(o_{i},o_{j},\mathbf{\mathfrak{a}}_{i},\mathbf{ \mathfrak{a}}_{j})$\hat{Q}({o}_{1},\dots ,{o}_{N},\mathfrak{a})={\mathrm{\Sigma}}_{i,j}{w}_{i,j}{Q}_{i,j}({o}_{i},{o}_{j},{\mathfrak{a}}_{i},{\mathfrak{a}}_{j})$, where ${w}_{i,j}$${w}_{i,j}$w_(i,j)w_{i,j}${w}_{i,j}$ is the predefined weights. They attempt to ensure individual agents to consider other agents' learning process by adding a shaping term in the loss function of the individual agent's learning process and minimizing the difference between the weighted sum of individual Qvalues and the global Qvalue.
2.3.3.2 Independent learners
There is also a line of studies that use individual RL (IRL) agents to control the traffic signals, where each RL agent controls an intersection. Unlike joint action learning methods, each agent learns its control policy without knowing the reward signal of other agents.
IRL without communication methods treat each intersection individually, with each agent observing its own local environment and do not use explicit communication to resolve conflicts [1, 27, 39, 40, 44, 45, 63]. In some simple scenarios like arterial networks, this approach has performed well with the formation of several mini green waves. However, when the environment becomes complicated, the nonstationary impacts from neighboring agents will be brought into the environment, and the learning process usually cannot converge to stationary policies if there are no communication or coordination mechanisms among agents [64].
IRL with communication methods enable agents to communicate between agents about their observations and behave as a group, rather than a collection of individuals in complex tasks where the environment is dynamic, and each agent has limited capabilities and visibility of the world [65]. Typical methods directly add neighbor's traffic condition [66] or past actions [67] directly into the observation of ego agent, other than just using the local traffic condition of the ego agent. In this method, all the agents for different intersection share one learning model, which requires the consistent indexing of neighboring intersections. [47] attempt to remove this requirement by utilizing the road network structure with Graph Convolutional Network [68] and cooperate multihop nearby intersections. [47] models the influence of neighboring agents by the fixed adjacency matrix defined in Graph Convolutional Network, which indicates their assumption that the influences between neighbors is static. In other work, [57] propose to use Graph Attentional Networks [69] to learn the dynamic interactions between the hidden state of neighboring agents and the ego agent. It should be pointed out that there is a strong connection between methods that employ maxplus [61] to learn joint actionlearners in Section 2.3.3.1, and methods using Graph Convolutional Network to learn the communication, as both of them can be seen to learn the message passing on the graph, where the former kind of methods passing the reward and the later passing the state obervations.
2.3.3.3 Sizes of Road network
At a coarse scale, a road network is a directed graph with nodes and edges representing intersections and roads, respectively. Specifically, a realworld road network can be more complicated than the synthetic network in the road properties (e.g., the number of lanes, speed limit of every lane), intersection structures and signal phases settings. Among all the road network properties, the number of traffic signals in the network largely influences the experiment results because the scale of explorations for RL agents to take increases with the scale of road network. Currently, most of the work still conducts experiments on relatively small road networks compared to the scale of a city, which could include thousands of traffic signals. Aslani et al.[42, 43] test their method in a realworld road network with 50 signals. In [19], a district with 196 signals is investigated. One of the most recent work [20] tests their methods on the real road network of Manhattan, New York, with 2510 traffic signals.
2.4 Conclusion
In the review of exiting work, we found out that the formulation of RL lacks theoretical justification, i.e., different literature proposes adhoc designs on the reward and state formulation. And the size of coordinated traffic signals are still small. Therefore, in the following sections, I'll provide several methods to tackle the problems: I'll first show a basic formulation of RLbased traffic signal control in Chapter 3, and discuss further theoryguided RL in Chapter 4. Chapter 5 will discuss how to improve the learning efficiency with neighborhood information for coordinated intersections. Chapter 6 will tackle the challenge for realworld applications of RL by building a realistic simulator,
Chapter 3 Basic Formulation for Traffic Signal Control
This part shows a basic formulation of RLbased traffic signal control and corresponds to our paper "IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light Control" [11].
3.1 Introduction
Traffic congestion has become increasingly costly. For example, traffic congestion costs Americans $124 billion a year, according to a report by Forbes in 2014 [70]. In European Union, the traffic congestion cost is estimated to be 1% of its GDP [2]. Improving traffic conditions could increase city efficiency, improve economy, and ease people's daily life.
One way to reduce the traffic congestion is by intelligently controlling traffic lights. Nowadays, most traffic lights are still controlled with predefined fixedtime plan [71, 72] and are not designed by observing real traffic. Recent studies propose handcrafted rules according to real traffic data [73, 74]. However, these rules are still predefined and cannot be dynamically adjusted w.r.t. realtime traffic.
To dynamically adjust traffic lights according to realtime traffic, people have been using reinforcement learning technique [9, 10, 75]. Traditional reinforcement learning is difficult to apply due to two key challenges: (1) how to represent environment; and (2) how to model the correlation between environment and decision. To address these two challenges, recent studies [10, 76] have applied deep reinforcement learning techniques, such as Deep Qlearning (DQN), for traffic light control problem. Figure 2.2 illustrates the basic idea of deep reinforcement learning framework. Environment is composed of traffic light phase and traffic condition. State is a feature representation of theenvironment. Agent takes state as input and learns a model to predict whether to "keep the current phase of traffic lights" or "change the current phase". The decision is sent to the environment and the reward (e.g., how many vehicles pass the intersection) is sent back to the agent. The agent consequently updates the model and further makes the new decision for the next timestamp based on the new state and the updated model. In such a framework, traffic condition can be described as an image and such an image is directly taken as an input for a CNNbased model to enrich the handcrafted features of the environment.
Recent deep reinforcement learning approaches made promising progress for the traffic light control problem. Our approach extends this line of work by making several important new contributions:
1. Experiments with real traffic data. Nowadays, increasing amount of traffic data is being collected from various sources. In China, many big cities have installed AIequipped traffic surveillance cameras to monitor traffic conditions in real time. Such realtime traffic data enables us to implement reinforcement learning in real world. However, to the best of our knowledge, none of existing studies have used the real traffic data to test their methods. Instead, they use traffic simulations and such simulations do not reflect the realworld traffic. For example, the simulation models in current studies often assume that vehicles arrive at a constant rate but real traffic are highly dynamic over time. In this chapter, we test the methods on a largescale real traffic data obtained from 1,704 surveillance cameras in Jinan, China for a period of 31 days (see experiment section for details). In this dataset, there are more than 405 million vehicle records and more than 11 million unique vehicle plates. We conduct comprehensive experiments on such large real dataset.
2. Interpretations of the policy. A frequently used measure to quantify the performance of traffic light control is by examining the overall reward, which can be defined by several factors such as waiting time of vehicles and number of vehicles passing the intersections. However, existing studies rarely make observations of the policy learned from the model. The reward could be misleading in some cases. There could be different policies with the same reward but one is more suitable than the other. Take Figure 3.1 as an example. Assume there is only traffic on SouthNorth direction and the traffic comes for the first 80 seconds in every 120 seconds. Policy #1 is 80 seconds for green light on SouthNorth direction and followed by red light for 40 seconds, and then repeat. And policy #2 is different from policy #1 in the way that, instead of 40second red light on SouthNorth direction, the light changes every 10 seconds. Both policies will result in the same reward because no vehicle will be waiting under either policy. However, policy #1 is preferred over policy #2 in real scenario. In this chapter, we claim that it is important to study the policies rather than simply showing the overall reward. In our experiments, we show several interesting policies learned from the real traffic under different scenarios (e.g., peak hours vs. nonpeak hours, weekday vs. weekend).
3. A phasegated model learning. As described earlier in deep reinforcement learning framework, the agent will take the state, which is the representation of environment, as model input. The environment usually includes the current traffic light phase and traffic conditions. For example, the environments of two cases in Figure 3.2 are the same, except the traffic light phases. Previous studies all take phase as one feature [10, 27], together with many other features (e.g., number of vehicles at different lanes, positions of vehicles). And it is likely that this one feature does not play a role that is significant enough to affect the model output. Therefore, the model will make
Figure 3.1: Reward is not a comprehensive measure to evaluate traffic light control performance. Both policies will lead to the same rewards. But policy #1 is more suitable than policy #2 in the real world.
the same decision (i.e., either keep or change the current phase) for these two different cases. However, such a decision, no matter which one, is not ideal for one of the cases. Because in Figure 3.2, case A hopes to keep the phase and case B hopes to change the phase. In this chapter, we propose a new phasesensitive (i.e., phase gate combined with memory palace) reinforcement learning agent, which is a critical component that leads to superior performance.
The rest of this chapter is organized as follows. Section 3.2 discusses the literature. Section 3.3 formally defines the problem. The method is shown in Section 6.4 and the experimental results are shown in Section 6.5. Finally, we conclude the chapter in Section 7.
3.2 Related work
In this section, we firstly introduce conventional methods for traffic light control, then introduce methods using reinforcement learning.
Conventional Traffic Light Control
Early traffic light control methods can be roughly classified into two groups. The first is pretimed signal control [71, 72, 77], where a fixed time is determined for all green phases
Figure 3.2: Case A and case B have the same environment except the traffic light phase.
according to historical traffic demand, without considering possible fluctuations in traffic demand. The second is vehicleactuated control methods [73, 74] where the realtime traffic information is used. Vehicleactuated methods are suitable for the situations with relatively high traffic randomness. However, this method largely depends on the handcraft rules for current traffic condition, without taking into account future situation. Therefore, it cannot reach the global optimal.
Reinforcement Learning for Traffic Light Control
Recently, due to the incapability of dealing with dynamic multidirection traffic in previous methods, more works try to use reinforcement learning algorithms to solve the traffic light control problem [75, 9, 27]. Typically, these algorithms take the traffic on the road as state, and the operation on light as action. These methods usually show better performance compared with fixedtime and trafficresponsive control methods.
Methods in [78, 79, 80, 9, 30] designed the state as discrete values like the location of vehicles or number of waited cars. However, the discrete stateaction pair value matrix requires huge storage space, which keeps these methods from being used in large state space problems.
To solve the inmanagablely large state space of previous methods, recent studies [76, 10] propose to apply Deep Qlearning methods using continuous state representations. These studies learn a Qfunction (e.g. a deep neural network) to map state and action to reward. These works vary in the state representation including hand craft features (e.g., queue length [27, 76], average delay [81, 10]) and image features [82, 10, 83]) They are also different in reward design, including average delay [10, 46],the average travel time [83, 10], and queue length [76].
However, all these methods assume relatively static traffic environments, and hence far from the real case. Further, they only focus on rewards and overlook the adaptability of the algorithms to the real traffic. Therefore, they cannot interpret why the learned light signal changes corresponding to the traffic. In this chapter, we try to test the algorithms in a more realistic traffic setting, and add more interpretation other than reward.
3.3 Problem Definition
In our problem, we have the environment $\mathsf{E}$$\mathsf{E}$E\mathsf{E}$\mathsf{E}$ as an intersection of two roads (and the traffic on this intersection). There is an intelligent traffic light agent $\mathsf{G}$$\mathsf{G}$G\mathsf{G}$\mathsf{G}$. To make the notation simpler, we use "N", "S", "W", "E" to represent north, south, west, and east respectively, and use "Red" and "Green" to represent red light and green light correspondingly. A setting of the traffic light is defined as a phase (e.g., green light on the westeast direction which can be simplified as GreenWE). When a light changes from green to red, there is a 3 second yellow light, while the other directions still keep red. So one green light and the subsequent yellow light can be represented together by "Green". To simplify the problem, we assume there are only two phases of the traffic light, i.e., 1) GreenWE, and 2) RedWE. Due to the limitation of realworld setting, the traffic light can only change in a specific order (i.e., 1 $>$$>$>>$>$ 2 $>$$>$>>$>$ 1 $>$$>$>>$>$ 2 $>$$>$>>$>$...). Given the state $\mathsf{s}$$\mathsf{s}$s\mathsf{s}$\mathsf{s}$ (describing the positions and speed of the traffic near this intersection), the goal of the agent $\mathsf{G}$$\mathsf{G}$G\mathsf{G}$\mathsf{G}$ is to give the optimal action $\mathsf{a}$$\mathsf{a}$a\mathsf{a}$\mathsf{a}$ (i.e., whether to change the light to the next phase), so that the reward $\mathsf{r}$$\mathsf{r}$r\mathsf{r}$\mathsf{r}$ (i.e., the smoothness of traffic) can be maximized.
3.4 Method
Traffic light control has attracted a lot of attention in recent years due to its essential role in adjusting traffic. Current methods generally have two categories, conventional methods, and deep reinforcement learning based methods. Conventional methods usually rely on previous knowledge to set fixed time for each light phase or set changing rules. These rules are prone to dynamically changing traffic. Reinforcement learning methods usually take the traffic condition (e.g., queue length of waiting cars and updated waiting time) as state, and try to make actions that can improve the traffic condition based on the current state.
However, the current methods do not consider the complex situations in real case, and hence may lead to stuck in one single kind of action. This will lead to inferior traffic adjusting performance under complex traffic situation.
In this section, we propose a deep reinforcement traffic light agent to solve this problem. We will first introduce the model framework in Section 3.4.1. Then, we show the design of agent in Section 3.4.2. We further describe the network structure in Section 3.4.3. In addition, we describe the memory palace in Section 3.4.4. Note that, although our model is designed for a four way intersection with two phases, it is notdifficult to extend it to other types of intersections or to multiple phases scenarios.
Framework
Our model is composed of offline part and online part (as shown in Figure 3.3). We extract five kinds of features describing the traffic conditions as state (detailed in Section 3.4.2), and use reward to describe how much the action has improved the traffic (detailed in Section 3.4.2). In offline stage, we set a fixed timetable for the lights, and let traffic go through the system to collect data samples. After training with the samples logged in this stage, the model will be put into the online part. In online stage, at every time interval $\mathrm{\Delta}t$$\mathrm{\Delta}t$Delta t\Delta t$\mathrm{\Delta}t$, the traffic light agent will observe the state $\mathsf{s}$$\mathsf{s}$s\mathsf{s}$\mathsf{s}$ from the environment and take action $\mathsf{a}$$\mathsf{a}$a\mathsf{a}$\mathsf{a}$ (i.e., whether to change light signal to the next phase) according to $\u03f5$$\u03f5$epsilon\epsilon$\u03f5$greedy strategy combining exploration (i.e., random action with probability $\u03f5$$\u03f5$epsilon\epsilon$\u03f5$) and exploitation (i.e., taking the action with maximum estimated reward). After that, the agent $\mathsf{G}$$\mathsf{G}$G\mathsf{G}$\mathsf{G}$ will observe the environment and get the reward $\mathsf{r}$$\mathsf{r}$r\mathsf{r}$\mathsf{r}$ from it. Then, the tuple ($\mathsf{s}$$\mathsf{s}$s\mathsf{s}$\mathsf{s}$, $\mathsf{a}$$\mathsf{a}$a\mathsf{a}$\mathsf{a}$, $\mathsf{r}$$\mathsf{r}$r\mathsf{r}$\mathsf{r}$) will be
the intersection during $\mathrm{\Delta}t$$\mathrm{\Delta}t$Delta t\Delta t$\mathrm{\Delta}t$.
Notation Meaning
E Environment
G Agent
a Action
s State
r Reward
Delta t Time interval between actions
q Action value function
Q Deep QNetwork
L_(i) Queue length on the lane i
V_(i) Number of vehicles on the lane i
W_(i) Updated waiting time of all vehicles on the lane i
D_(i) Delay of lane i.
MinR^(N xx N) Image representation of vehicles’ position
P_(c) Current phase
P_(n) Next phase
Cin{0,1} Light switches (1) or not (0)
N Number of vehicles passed the intersection
after the action.
T Travel time in system of all vehicles that passed
the intersection during Delta t. Notation  Meaning 
 :  : 
 $\mathsf{E}$  Environment 
 $\mathsf{G}$  Agent 
 $\mathsf{a}$  Action 
 $\mathsf{s}$  State 
 $\mathsf{r}$  Reward 
 $\Delta t$  Time interval between actions 
 $\mathsf{q}$  Action value function 
 $\mathsf{Q}$  Deep QNetwork 
 $\mathsf{L}_{\mathsf{i}}$  Queue length on the lane $\mathsf{i}$ 
 $\mathsf{V}_{\mathsf{i}}$  Number of vehicles on the lane $\mathsf{i}$ 
 $\mathsf{W}_{\mathsf{i}}$  Updated waiting time of all vehicles on the lane $\mathsf{i}$ 
 $\mathsf{D}_{\mathsf{i}}$  Delay of lane $\mathsf{i}$. 
 $\mathsf{M}\in\mathbb{R}^{N\times N}$  Image representation of vehicles’ position 
 $\mathsf{P}_{c}$  Current phase 
 $\mathsf{P}_{n}$  Next phase 
 $\mathsf{C}\in\{0,1\}$  Light switches (1) or not (0) 
 $\mathsf{N}$  Number of vehicles passed the intersection 
  after the action. 
 $\mathsf{T}$  Travel time in system of all vehicles that passed 
  the intersection during $\Delta t$. 
Table 3.1: Notationsstored into memory. After several timestamps (e.g., ${t}_{2}$${t}_{2}$t_(2)t_{2}${t}_{2}$ in Figure 3.3), agent $\mathsf{G}$$\mathsf{G}$G\mathsf{G}$\mathsf{G}$ will update the network according to the logs in the memory.
Agent Design
First, we introduce the state, action and reward representation.
State. Our state is defined for one intersection. For each lane $\mathsf{i}$$\mathsf{i}$i\mathsf{i}$\mathsf{i}$ at this intersection, the state component includes queue length ${\mathsf{L}}_{\mathsf{i}}$${\mathsf{L}}_{\mathsf{i}}$L_(i)\mathsf{L}_{\mathsf{i}}${\mathsf{L}}_{\mathsf{i}}$, number of vehicles ${\mathsf{V}}_{\mathsf{i}}$${\mathsf{V}}_{\mathsf{i}}$V_(i)\mathsf{V}_{\mathsf{i}}${\mathsf{V}}_{\mathsf{i}}$, updated waiting time of vehicles ${\mathsf{W}}_{\mathsf{i}}$${\mathsf{W}}_{\mathsf{i}}$W_(i)\mathsf{W}_{\mathsf{i}}${\mathsf{W}}_{\mathsf{i}}$. In addition, the state includes an image representation of vehicles' position $\mathsf{M}$$\mathsf{M}$M\mathsf{M}$\mathsf{M}$, current phase ${\mathsf{P}}_{c}$${\mathsf{P}}_{c}$P_(c)\mathsf{P}_{c}${\mathsf{P}}_{c}$ and next phase ${\mathsf{P}}_{n}$${\mathsf{P}}_{n}$P_(n)\mathsf{P}_{n}${\mathsf{P}}_{n}$.
Action. Action is defined as $\mathsf{a}=1$$\mathsf{a}=1$a=1\mathsf{a}=1$\mathsf{a}=1$: change the light to next phase ${\mathsf{P}}_{n}$${\mathsf{P}}_{n}$P_(n)\mathsf{P}_{n}${\mathsf{P}}_{n}$, and $\mathsf{a}=0$$\mathsf{a}=0$a=0\mathsf{a}=0$\mathsf{a}=0$: keep the current phase ${\mathsf{P}}_{c}$${\mathsf{P}}_{c}$P_(c)\mathsf{P}_{c}${\mathsf{P}}_{c}$.
Reward. As is shown in Equation 3.3, reward is defined as a weighted sum of the following factors: 1. Sum of queue length $\mathsf{L}$$\mathsf{L}$L\mathsf{L}$\mathsf{L}$ over all approaching lanes, where $\mathsf{L}$$\mathsf{L}$L\mathsf{L}$\mathsf{L}$ is calculated as the total number of waiting vehicles on the given lane. A vehicle with a speed of less than 0.1 m/s is considered as waiting. 2. Sum of delay $\mathsf{D}$$\mathsf{D}$D\mathsf{D}$\mathsf{D}$ over all approaching lanes, where the delay ${\mathsf{D}}_{\mathsf{i}}$${\mathsf{D}}_{\mathsf{i}}$D_(i)\mathsf{D}_{\mathsf{i}}${\mathsf{D}}_{\mathsf{i}}$ for lane $\mathsf{i}$$\mathsf{i}$i\mathsf{i}$\mathsf{i}$ is defined in Equation 3.1, where the lane speed is the average speed of vehicles
Figure 3.3: Model framework
on lane i, and the speed limit is the maximum speed allowed on lane i: $${\mathsf{D}}_{\text{i}}=1\frac{lane\text{}speed}{speed\text{}limit}$$$${\mathsf{D}}_{\text{i}}=1\frac{lane\text{}speed}{speed\text{}limit}$$D_("i")=1(lanespeed)/(speedlimit)\mathsf{D}_{\text{i}}=1\frac{lane\ speed}{speed\ limit}$${\mathsf{D}}_{\text{i}}=1\frac{lane\text{}speed}{speed\text{}limit}$$ (3.1)
Sum of updated waiting time $\mathsf{W}$$\mathsf{W}$W\mathsf{W}$\mathsf{W}$ over all approaching lanes. This equals to the sum of $\mathsf{W}$$\mathsf{W}$W\mathsf{W}$\mathsf{W}$ over all vehicles on approaching lanes. The updated waiting time $\mathsf{W}$$\mathsf{W}$W\mathsf{W}$\mathsf{W}$ for vehicle j at time $t$$t$tt$t$ is defined in Equation 3.2. Note that the updated waiting time of a vehicle is reset to 0 every time it moves. For example, if a vehicle's speed is 0.01m/s from 0s to 15s, 5m/s from 15s to 30s, and 0.01m/s from 30s to 60s, ${\mathsf{W}}_{\text{j}}$${\mathsf{W}}_{\text{j}}$W_("j")\mathsf{W}_{\text{j}}${\mathsf{W}}_{\text{j}}$ is 15 seconds, 0 seconds and 30 seconds when $t=$$t=$t=t=$t=$15s, 30s and 60s relatively. $${\mathsf{W}}_{\text{j}}(t)=\{\begin{array}{ll}{\mathsf{W}}_{\text{j}}(t1)+1& vehicle\text{}speed<0.1\\ 0& vehicle\text{}speed\ge 0.1\end{array}$$$${\mathsf{W}}_{\text{j}}(t)=\left\{\begin{array}{ll}{\mathsf{W}}_{\text{j}}(t1)+1& vehicle\text{}speed<0.1\\ 0& vehicle\text{}speed\ge 0.1\end{array}\right.$$W_("j")(t)={[W_("j")(t1)+1,vehiclespeed < 0.1],[0,vehiclespeed >= 0.1]:}\mathsf{W}_{\text{j}}(t)=\begin{cases}\mathsf{W}_{\text{j}}(t1)+1&vehicle\ speed<0.1\\ 0&vehicle\ speed\geq 0.1\end{cases}$${\mathsf{W}}_{\text{j}}(t)=\{\begin{array}{ll}{\mathsf{W}}_{\text{j}}(t1)+1& vehicle\text{}speed0.1\\ 0& vehicle\text{}speed\ge 0.1\end{array}$$ (3.2)
Indicator of light switches $\mathsf{C}$$\mathsf{C}$C\mathsf{C}$\mathsf{C}$, where $\mathsf{C}=0$$\mathsf{C}=0$C=0\mathsf{C}=0$\mathsf{C}=0$ for keeping the current phase, and $\mathsf{C}=1$$\mathsf{C}=1$C=1\mathsf{C}=1$\mathsf{C}=1$ for changing the current phase.
Total number of vehicles $\mathsf{N}$$\mathsf{N}$N\mathsf{N}$\mathsf{N}$ that passed the intersection during time interval $\mathrm{\Delta}t$$\mathrm{\Delta}t$Delta t\Delta t$\mathrm{\Delta}t$ after the last action a.
Total travel time of vehicles $\mathsf{T}$$\mathsf{T}$T\mathsf{T}$\mathsf{T}$ that passed the intersection during time interval $\mathrm{\Delta}t$$\mathrm{\Delta}t$Delta t\Delta t$\mathrm{\Delta}t$ after the last action a, defined as the total time (in minutes) that vehicles spent on approaching lanes. $$\begin{array}{r}Reward={w}_{1}\ast \sum _{\text{i}\in \mathsf{I}}{\mathsf{L}}_{\text{i}}+{w}_{2}\ast \sum _{\text{i}\in \mathsf{I}}{\mathsf{D}}_{\text{i}}+{w}_{3}\ast \sum _{\text{i}\in \mathsf{I}}{\mathsf{W}}_{\text{i}}+\\ {w}_{4}\ast \mathsf{C}+{w}_{5}\ast \mathsf{N}+{w}_{6}\ast \mathsf{T}.\end{array}$$$$\begin{array}{r}Reward={w}_{1}\ast \sum _{\text{i}\in \mathsf{I}}\u200a{\mathsf{L}}_{\text{i}}+{w}_{2}\ast \sum _{\text{i}\in \mathsf{I}}\u200a{\mathsf{D}}_{\text{i}}+{w}_{3}\ast \sum _{\text{i}\in \mathsf{I}}\u200a{\mathsf{W}}_{\text{i}}+\\ {w}_{4}\ast \mathsf{C}+{w}_{5}\ast \mathsf{N}+{w}_{6}\ast \mathsf{T}.\end{array}$${:[Reward=w_(1)**sum_("i"inI)L_("i")+w_(2)**sum_("i"inI)D_("i")+w_(3)**sum_("i"inI)W_("i")+],[w_(4)**C+w_(5)**N+w_(6)**T.]:}\begin{split} Reward=w_{1}*\sum_{\text{i}\in\mathsf{I}}\mathsf{ L}_{\text{i}}+w_{2}*\sum_{\text{i}\in\mathsf{I}}\mathsf{D}_{\text{i}}+w_{3}* \sum_{\text{i}\in\mathsf{I}}\mathsf{W}_{\text{i}}+\\ w_{4}*\mathsf{C}+w_{5}*\mathsf{N}+w_{6}*\mathsf{T}.\end{split}$$\begin{array}{r}Reward={w}_{1}\ast \sum _{\text{i}\in \mathsf{I}}{\mathsf{L}}_{\text{i}}+{w}_{2}\ast \sum _{\text{i}\in \mathsf{I}}{\mathsf{D}}_{\text{i}}+{w}_{3}\ast \sum _{\text{i}\in \mathsf{I}}{\mathsf{W}}_{\text{i}}+\\ {w}_{4}\ast \mathsf{C}+{w}_{5}\ast \mathsf{N}+{w}_{6}\ast \mathsf{T}.\end{array}$$ (3.3)
Hence, given the current state $\mathsf{s}$$\mathsf{s}$s\mathsf{s}$\mathsf{s}$ of the traffic condition, the mission of the agent $\mathsf{G}$$\mathsf{G}$G\mathsf{G}$\mathsf{G}$ is to find the action a (change or keep current phase) that may lead to the maximum reward $\mathsf{r}$$\mathsf{r}$r\mathsf{r}$\mathsf{r}$ in the long run, following the Bellman Equation (Equation 3.4) [12]. In this situation, the action value function $\mathsf{q}$$\mathsf{q}$q\mathsf{q}$\mathsf{q}$ for time $t$$t$tt$t$ is the summation of the reward of the next timestamp $t+1$$t+1$t+1t+1$t+1$ and the maximum potential future reward. Through this conjecture of future, the agent can select action that is more suitable for longrun reward.
In order to estimate the reward based on the state, and action, the agent needs to learn a Deep QNetwork $\mathsf{Q}(\mathsf{s},\mathsf{a})$$\mathsf{Q}(\mathsf{s},\mathsf{a})$Q(s,a)\mathsf{Q}(\mathsf{s},\mathsf{a})$\mathsf{Q}(\mathsf{s},\mathsf{a})$.
In the realworld scenario, traffic is very complex and contain many different cases need to be considered separately. We will illustrate this in Example 1.
Example 1.: We still assume a simple intersection with twophase light transition here: 1) GreenWE, and 2) RedWE. The decision process of whether to change the traffic light consists of two steps. The first step is the mapping from traffic condition (e.g., how many cars are waiting, how long has each car been waiting) to a partial reward. An example of this mapping could be $\mathsf{r}=0.5\times \mathsf{L}0.7\times \mathsf{W}$$\mathsf{r}=0.5\times \mathsf{L}0.7\times \mathsf{W}$r=0.5 xxL0.7 xxW\mathsf{r}=0.5\times\mathsf{L}0.7\times\mathsf{W}$\mathsf{r}=0.5\times \mathsf{L}0.7\times \mathsf{W}$. This is shared by different phases, no matter which lane the green light is on. Then, to determine the action, the agent should watch on the traffic in different lanes during different phases. For instance, as is shown in Figure 3.2 (a), when the red light is on the NS direction, more waiting traffic (i.e., lower reward in the first step) on the NS direction will make the light tend to
Figure 3.4: Qnetworkchange (because by changing the light on this lane from red to green, more cars on this lane can pass through this intersection), while more waiting traffic (i.e., lower reward in the first step) on the WE direction will make the light tend to keep. When the red light is on the WE direction, the case is right the opposite. Therefore, the light phase should have an explicit selection on features.
In previous studies, due to the simplified design of the model for approximating Qfunction under complex traffic condition, agents are having difficulties in distinguishing the decision process for different phases. Therefore, we hereby propose a network structure that can explicitly consider the different cases explicitly. We call this special substructure "Phase Gate".
Our whole network structure can be shown as in Figure 3.4. The image features are extracted from the observations of the traffic condition and fed into two convolutional layers. The output of these layers are concatenated with the four explicitly mined features, queue length $L$$L$LL$L$, updated waiting time $W$$W$WW$W$, phase $P$$P$PP$P$ and number of total vehicles $V$$V$VV$V$. The concatenated features are then fed into fullyconnected layers to learn the mapping from traffic conditions to potential rewards. Then, for each phase, we design a separate learning process of mapping from rewards to the value of making decisions $Q(s,a)$$Q(s,a)$Q(s,a)Q(s,a)$Q(s,a)$. These separate processes are selected through a gate controlled by the phase. As shown in Figure 3.4, when phase $P=0$$P=0$P=0P=0$P=0$, the left branch will be activated, while when phase $P=1$$P=1$P=1P=1$P=1$, the right branch will be activated. This will distinguish the decision process for different phases, prevent the decision from favoring certain action, and enhance the fitting ability of the network.
Memory Palace and Model Updating
Periodically, the agent will take samples from the memory and use them to update the network. This memory is maintained by adding the new data samples in and removing the old samples occasionally. This technique is noted as experience replay [52] and has been widely used in reinforcement learning models.
However, in the real traffic setting, traffic on different lanes can be really imbalanced. As previous methods [10, 76, 81, 82] store all the stateactionreward training samples in one memory, this memory will be dominated by the phases and actions that appear most frequently in imbalanced settings. Then, the agent will be learned to estimate the reward for these frequent phaseaction combinations well, but ignore other less frequent phaseaction combinations. This will cause the learned agent to make bad decisions on the infrequent phaseaction combinations. Therefore, when traffic on different lanes are dramatically different, these imbalanced samples will lead to inferior performance on less frequent situation.
Inspired by Memory Palace theory [84, 85] in cognitive psychology, we can solve this imbalance by using different memory palaces for different phaseaction combinations. As shown in Figure 3.5, training samples for different phaseaction combinations are stored into different memory palaces. Then same number of samples will be selected from different palaces. These balanced samples will prevent different phaseaction combinations from interfering each other's training process, and hence, improve the fitting capability of the network to predict the reward accurately.
Figure 3.5: Memory palace structure
Experiment
In this section, we conduct experiments using both synthetic and realworld traffic data. We show a comprehensive quantitative evaluation by comparing with other methods and also show some interesting case studies 1.
Footnote 1: Codes are available at the author’s website.
Experiment Setting
The experiments are conducted on a simulation platform SUMO (Simulation of Urban MObility) 2. SUMO provides flexible APIs for road network design, traffic volume simulation and traffic light control. Specifically, SUMO can control the traffic moving according to the given policy of traffic light (obtained by the traffic light agent).
The environment for the experiments on synthetic data is a fourway intersection as Figure 3.1. The intersection is connected with four road segments of 300meters long, where each road have three incoming and three outgoing lanes. The traffic light in this part of experiment contains two phases: (1) GreenWE (green light on WE with red light on SN), (2) RedWE (red light on WE with green light on SN). Note that when a green light is on one direction, there is a red light on the other direction. Also, a green light is followed by a 3second yellow light before it turns to red light. Although this is a simplification of the real world scenario, the research of more types of intersections (e.g., threeway intersection), and more complex light phasing (e.g., with leftturn phasing) can be further conducted in similar way.
Parameter Setting
The parameter setting and reward coefficients for our methods are shown in Table 3.2 and Table 3.3 respectively. We found out that the action time interval $\mathrm{\Delta}t$$\mathrm{\Delta}t$Delta t\Delta t$\mathrm{\Delta}t$ has minimal influence on performance of our model as long as $\mathrm{\Delta}t$$\mathrm{\Delta}t$Delta t\Delta t$\mathrm{\Delta}t$ is between 5 seconds and 25 seconds.
Evaluation Metric
We evaluate the performance of different methods using the following measures:
Reward: average reward over time. Defined in Equation 1, the reward is a combination of several terms (positive and negative terms), therefore, the range of reward is from $\mathrm{\infty}$$\mathrm{\infty}$oo\infty$\mathrm{\infty}$ to $\mathrm{\infty}$$\mathrm{\infty}$oo\infty$\mathrm{\infty}$. Under specific configuration, there will be an upper bound for the reward when all cars move freely without any stop or delay.
Queue length: average queue length over time, where the queue length at time $t$$t$tt$t$ is the sum of $\mathsf{L}$$\mathsf{L}$L\mathsf{L}$\mathsf{L}$ (defined in Section 3.4.2) over all approaching lanes. A smaller queue length means there are fewer waiting vehicles on all lanes.
Delay: average delay over time, where the delay at time $t$$t$tt$t$ is the sum of $\mathsf{D}$$\mathsf{D}$D\mathsf{D}$\mathsf{D}$ (defined in Equation 3.1) of all approaching lanes. A lower delay means a higher speed of all lanes.
Duration: average travel time vehicles spent on approaching lanes (in seconds). It is one of the most important measures that people care when they drive on the road. A smaller duration means vehicles spend less time passing through the intersection.
In summary, a higher reward indicates a better performance of the method, and a smaller queue length, delay and duration indicates the traffic is less jammed.
Compared Methods
To evaluate the effectiveness of our model, we compare our model with the following baseline methods, and tune the parameter for all methods. We then report their best performance.
$\gamma $$\gamma $gamma\gamma$\gamma $ for future reward
0.8
$\u03f5$$\u03f5$epsilon\epsilon$\u03f5$ for exploration
0.05
Sample size
300
Memory length
1000
Model update interval
300 seconds
Model parameter Value
Action time interval 5 seconds
Delta t
gamma for future reward 0.8
epsilon for exploration 0.05
Sample size 300
Memory length 1000
Model update interval 300 seconds  Model parameter  Value  
 ::  ::  :: 
 Action time interval  5 seconds  
 $\Delta t$   
 $\gamma$ for future reward  0.8  
 $\epsilon$ for exploration  0.05  
 Sample size  300  
 Memory length  1000  
 Model update interval  300 seconds  
Table 3.2: Settings for our method * Fixedtime Control (FT). Fixedtime control method use a predetermined cycle and phase time plan [72] and is widely used when the traffic flow is steady.
SelfOrganizing Traffic Light Control (SOTL)[74]. This method controls the traffic light according to the current traffic state, including the eclipsed time and the number of vehicles waiting at the red light. Specifically, the traffic light will change when the number of waiting cars is above a handtuned threshold.
Deep Reinforcement Learning for Traffic Light Control (DRL). Proposed in [10], this method applies DQN framework to select optimal light configurations for traffic intersections. Specifically, it solely relies on the original traffic information as an image.
In addition to the baseline methods, we also consider several variations of our model.
IntelliLight (Base). Using the same network structure and reward function defined as in Section 3.4.2 and 3.4.3. This method is without Memory Palace and Phase Gate.
IntelliLight (Base+MP). By adding Memory Palace in psychology to IntelliLightBase, we store the samples from different phase and time in separate memories.
IntelliLight (Base+MP+PG). This is the model adding two techniques (Memory Palace and Phase Gate).
Datasets
Synthetic data
In the first part of our experiment, synthetic data is used with four traffic flow settings: simple changing traffic (configuration 1), equally steady traffic (configuration 2), unequally steady traffic (configuration 3) and complex traffic (configuration 4) which is a combination of previous three configurations. As is shown in Table 3.4, the arriving of vehicles are generated by Poisson distribution with certain arrival rates.
Realworld data
The realworld dataset is collected by 1,704 surveillance cameras in Jinan, China over the time period from 08/01/2016 to 08/31/2016. The locations of these cameras are shown in Figure 3.6. Gathered every second by the cameras facing towards vehicles near intersections, each record in the dataset consists of time, camera ID and the information about vehicles. By analyzing these records with camera locations, the trajectories of vehicles are recorded when they pass through road intersections. The dataset covers 935 locations, where 43 of them are fourway intersections. We use the number of vehicles passing through 24 intersections as traffic volume for experiments since only these intersections have consecutive data. Then we feed this realworld traffic setting into SUMO as online experiments. It can be seen from Table 3.5 that traffic flow on different roads are dynamically changing in the real world.
Config
Directions
Arrival rate
Start time
End time
(cars/s)
(s)
(s)
1
WE
0.4
0
36000
SN
0.4
36001
72000
2
WE
0.033
0
72000
SN
0.033
0
72000
3
WE
0.2
0
72000
SN
0.033
0
72000
4
Configuration 1
0
72000
Configuration 2
72001
144000
Configuration 3
144001
216000
Config Directions Arrival rate Start time End time
(cars/s) (s) (s)
1 WE 0.4 0 36000
SN 0.4 36001 72000
2 WE 0.033 0 72000
SN 0.033 0 72000
3 WE 0.2 0 72000
SN 0.033 0 72000
4 Configuration 1 0 72000
Configuration 2 72001 144000
Configuration 3 144001 216000  Config  Directions  Arrival rate  Start time  End time 
 ::  ::  ::  ::  :: 
   (cars/s)  (s)  (s) 
 1  WE  0.4  0  36000 
  SN  0.4  36001  72000 
 2  WE  0.033  0  72000 
  SN  0.033  0  72000 
 3  WE  0.2  0  72000 
  SN  0.033  0  72000 
 4  Configuration 1  0  72000  
  Configuration 2  72001  144000  
  Configuration 3  144001  216000  
Table 3.4: Configurations for synthetic traffic data
Figure 3.6: Traffic surveillance cameras in Jinan, China
Performance on Synthetic Data
3.5.6.1 Comparison with stateoftheart methods
We first compare our method with three other baselines under different synthetic traffic settings. From Table 3.6, 3.7, 3.8 and 3.9 we can see that our method performs better than all other baseline methods in configurations 1, 2, 3 and 4. Although some baselines perform well on certain setting, they perform badly in other configurations (e.g., SOTL achieves good rewards under configuration 1, almost the same as our method in 3 digit floats. This is because our method has learned to keep the light until 36000 s and switch the light after that, and SOTL is also designed to behave similarly. Hence, these two methods perform very similar). On the contrary, our method IntelliLight shows better performance under different configurations.
3.5.6.2 Comparison with variants of our proposed method
Table 3.6, 3.7, 3.8 and 3.9 show the performance of variants of our proposed method. First, we can see that adding Memory Palace helps achieve higher reward under configuration 3 and 4, although it does not boost the reward under configuration 1 and 2. This is because for the simple case (configuration 1 and 2), the phase is relatively steady for a long time (because the traffic only comes from one direction or keeps not changing in a long time). Therefore, the memory palace does not help in building a better model for predicting the reward. Further adding Phase Gate also reduces the queue length in most cases and achieves highest reward, demonstrating the effectiveness of these two techniques.
3.5.6.3 Interpretation of learned signal
To understand what our method have learned w.r.t. dynamic traffic conditions, we show the percentage of duration for phase GreenWE (i.e., green light on WE direction with red light on SN direction), along with the ratio of traffic flow on WE over total traffic
Time Range
Records
Arrival Rate (cars/s)
Mean
Std
Max
Min
08/01/2016 08/31/2016
405,370,631
0.089
0.117
0.844
0.0
Time Range Records Arrival Rate (cars/s)
Mean Std Max Min
08/01/2016 08/31/2016 405,370,631 0.089 0.117 0.844 0.0 Time Range  Records  Arrival Rate (cars/s)    
 ::  ::  ::  ::  ::  :: 
   Mean  Std  Max  Min 
 08/01/2016 08/31/2016  405,370,631  0.089  0.117  0.844  0.0 
Table 3.5: Details of realworld traffic datasetflow from all directions. With the changing of traffic, an ideal traffic light control method would be able to adjust its phase duration to traffic flows and get high reward. For example, as traffic changes from direction WE to SN, the traffic light agent is expected to adjust its phase duration from giving WE green light to giving SN green light. As we can see from Figure 3.7, IntelliLight can adjust its phase duration as the traffic changes.
Performance on Realworld Data
Comparison of different methods
In this section, we compare our method with baseline methods on realworld data. The overall results are shown in Table 3.10. Our method IntelliLight achieves the best reward, queue length, delay and duration over all the compared methods, with a relative improvement of 32%, 38%, 19% and 22% correspondingly over the best baseline method. In addition, our method has a relatively steady performance over multiple intersections (small standard deviation).
Table 3.6: Performance on configuration 1. Reward: the higher the better. Other measures: the lower the better. Same with the following tables.
3.5.7.2 Observations with respect to real traffic
In this section, we make observations on the policies we learned from the real data. We analyze the learned traffic light policy for the intersection of Jingliu Road (WE direction)
Figure 3.7: Percentage of the time duration of learned policy for phase GreenWE (green light on WE and EW direction, while red light on NS and SN direction) in every 2000 seconds for different methods under configuration 4.
Table 3.8: Performance on configuration 3and Erhuanxi Auxiliary Road (SN direction) under different scenarios: peak hours vs. nonpeak hours, weekdays vs. weekends, and major arterial vs. minor arterial.
1. Peak hour vs. Nonpeak hour. Figure 3.8 (a) shows the average traffic flow from both directions (WE and SN) on a Monday. On this day, there is more traffic on WE direction than SN for most of the time, during which an ideal traffic light control method is expected to give longer time for WE direction. It can be seen from Figure 3.8 (c) that, the ratio of the time duration for phase GreenWE (i.e., green light on WE, while red light on SN) is usually larger than 0.5, which means for most of the time, our method gives longer time for WE. And during peak hours (around 7:00, 9:30 and 18:00), the policies learned from our method also give longer time for green light on WE than nonpeak hours. In early morning, the vehicle arrival rates on SN are larger than the rates on WE, and our method automatically gives longer time to SN. This shows our method can intelligently adjust to different traffic conditions.
2. Weekday vs. Weekend. Unlike weekdays, weekend shows different patterns about traffic condition and traffic light control policies. Our policy gives less green light on WE (more green light on SN) during weekend daytime than it gives on weekday. This is because there is more traffic on SN than on WE during weekend daytime in Figure 3.8 (b), while during weekday traffic on SN is less than on WE. Besides, by comparing Figure 3.8 (a) with Figure 3.8 (b), we can see that the traffic of WE and SN during late night time on Monday is similar, making the ratio of duration GreenWe close to 0.5.
3. Major arterial vs. Minor arterial. Major arterials are roads that have higher traffic volume within a period, and are expected to have a longer green light time. Without prior knowledge about major arterial, learned traffic light control policy using our method prefer giving the major arterial green light (including keeping the green light already on major arterial, and tend to switching red light to green light for major arterial). Specifically, we look into three periods of time (3:00, 12:00 and 23:30) of August 1st. From Figure 3.8 (a), we can tell that the road on WE direction is the main road, since
Table 3.10: Performances of different methods on realworld data. The number after $\pm $$\pm $+\pm$\pm $ means standard deviation. Reward: the higher the better. Other measures: the lower the better.
traffic on WE is usually heavier than traffic on SN. As is shown in Figure 3.9, the dotted lines indicates the number of arriving cars for every second on two different directions. Along with the arrival rate, we also plot the change of phases (dashed area). It can be seen from Figure 3.9 (a) that: 1) the overall time period of phase RedWE is longer than GreenWE, which is compatible with traffic volume at this time. 2) although the traffic volume of SN is larger than WE, the traffic light change from RedWE to GreenWE is usually not triggered by waiting cars on WE direction. On the contrary, in Figure 3.9 (b) and Figure 3.9 (c), the change from GreenWE to RedWE is usually triggered by waiting cars on SN direction. This is mainly because the road on WE is the main road during these time periods, and the traffic light tends to favor phase GreenWE.
3.6 Conclusion
In this paper, we address the traffic light control problem using a welldesigned reinforcement learning approach. We conduct extensive experiments using both synthetic
Figure 3.8: Average arrival rate on two directions (WE and SN) and time duration ratio of two phases (GreenWE and RedWE) from learned policy for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st and August 7th, 2016.
and real world experiments and demonstrate the superior performance of our proposed method over stateoftheart methods. In addition, we show indepth case studies and observations to understand how the agent adjust to the changing traffic, as a complement to quantitative measure on rewards. These indepth case studies can help generate traffic rules for real world application.
We also acknowledge the limitations of our current approach and would like to point out several important future directions to make the method more applicable to real world. First, we can extend the twophase traffic light to multiphase traffic light, which will involve more complicated but more realistic state transition. Second, our paper addresses a simplified one intersection case, whereas the real world road network is much more complicated than this. Although some studies have tried to solve the multiintersection problem by using multiple reinforcement learning agents, they do not explicitly consider the interactions between different intersections (i.e., how can the phase of one intersection affect the state of nearby intersections) and they are still limited to small number of intersections. Lastly, our approach is still tested on a simulation framework and thus the feedback is simulated. Ultimately, a field study should be conducted to learn the realworld feedback and to validate the proposed reinforcement learning approach.
Figure 3.9: Detailed average arrival rate on two directions (dotted lines) and changes of two phases (dashed areas) in three periods of time for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st, 2016. Xaxis of each figure indicates the time of a day; left Yaxis of each figure indicates the number of cars approaching the intersection every second; right Yaxis for each figure indicates the phase over time.
Chapter 4 Formulating the Learning Objective
This chapter formulates RLbased traffic signal control methods with theoratical proofs and corresponds to our paper "PressLight: Learning Networklevel Cooperation for Traffic Signal Control" [17].
4.1 Introduction
Traffic signals coordinate the traffic movements at the intersection and a smart traffic signal control algorithm is the key to transportation efficiency. Traffic signal control remains an active research topic because of the high complexity of the problem. The traffic situations are highly dynamic, thus require traffic signal plans to be able to adjust to different situations.
Recently, people start to investigate reinforcement learning (RL) techniques for traffic signal control. Several studies have shown the superior performance of RL techniques over traditional transportation approaches [9, 10, 11, 46, 78, 80]. The biggest advantage of RL is that it directly learns how to take the next actions by observing the feedback from the environment after previous actions.
One major issue of current RLbased traffic signal control approaches is that the setting is often heuristic and lacks proper theoretical justification from transportation literature. This often results in highly sensitive performance w.r.t. the setting and leads to a long learning process. We elaborate on this issue by examining two fundamental elements in RL setting: reward and state.
First, various reward designs have been proposed in the literature. The reason is that travel time, the ultimate objective, is hard to optimize directly. Travel time is alongterm reward depending on a sequence of actions, thus the effect of one action can hardly be reflected in terms of travel time. People thus choose shortterm rewards like queue length or delay to approximate the travel time [21]. So the reward function is often defined as a weighted sum of these terms [10, 11, 79, 86, 87]. However, as shown in Figure 6.1(a), tuning the weights on these terms could lead to largely different results in terms of travel time. Some literature [1] discusses how to define the reward by connecting with the existing transportation method, but they only focus on controlling a single intersection. In this chapter, we focus on the multiintersection control scenario.
Second, existing RL methods have a trend of using more complicated state representation. Recent studies use visual images to describe the full traffic situation at the intersection [10, 11], which results in the dimension of the state in the scale of thousands. In the single intersection scenario, [1] reveals that additional information is not always helpful. Similar conclusions can also be found in the multiintersection scenario. As shown in Figure 6.1(b), complicated state definitions increase the learning time and may not necessarily bring significant gain. Note that we are not claiming that additional information is always not helpful. The choice of the state depends on the reward setting. Based on the reward design of LIT[1], neighboring information is not necessary in the case we show in Figure 6.1(b). The question is, could we justify theoretically how much information is enough in state definition in order to optimize the reward function?
The challenges we face in RL motivate us to look for support from transportation. In transportation literature, max pressure (MP) control is one of the stateofthearts in traffic signal control [88, 89]. The key idea of MP is to minimize the "pressure" of an
Figure 4.1: Performance of RL approaches is sensitive to reward and state. (a) A heuristic parameter tuning of reward function could result in different performances. (b) The method with a more complicated state (LIT[1] w/ neighbor) has a longer learning time but does not necessarily converge to a better result.
intersection, which can be loosely defined as the number of vehicles on incoming lanes minus the number of vehicles on outgoing lanes. Figure 4.2 illustrates the concept of pressure. By setting the objective as minimizing the pressure of intersections, MP is proved to maximize the throughput of the whole road network1. However, the solution of MP is greedy, which leads to locally optimal solutions.
Footnote 1: Maximizing throughput equals to minimizing travel time under certain conditions and minimizing travel time is the final goal for most traffic signal control problems.
Our proposed solution is based on RL but theoretically grounded by MP method. The connection between RL and MP is that both approaches can essentially be framed as an optimization problem. In RL, long term reward is the objective for optimization and the solution is derived from trialanderror search. In MP, the objective is to minimize pressure and the solution is derived from a greedy algorithm. Intuitively, if we set our reward function the same as the objective of MP, we can achieve the same result as MP. We first prove that under the assumption of no physical queue expansion, both our method and MP are maximizing throughput of the network. We further show that our method can relax the assumption on queue expansion and the conclusion still holds.
To further address the challenge on state design, we describe the system dynamics using the state features based on MP. MP provides evolution equations to formulate the state transition of the traffic as a Markov chain [90]. In RL, the Markov decision process formally describes the dynamics of an environment. By including the variables from the
Figure 4.2: Illustration of max pressure control in two cases. In Case A, green signal is set in the North$\to $$\to $rarr\rightarrow$\to $South direction; in Case B, green signal is set in the East$\to $$\to $rarr\rightarrow$\to $West direction.
evolution equation into state definition in RL, the state is a sufficient statistic for the system dynamics.
We conduct comprehensive experiments using both synthetic data and real data. We test our method in different traffic flow and network structure scenarios. We demonstrate the power of RL methods over traditional transportation approaches as RL optimizes the objective through trial and error. Our method also consistently outperforms stateoftheart RL methods, which shows that theoretically supported reward design is necessary and the concise state design leads to an efficient learning process. We further discuss several interesting policies learned by our method to show that our method can achieve coordination along arterial.
4.2 Related Work
Individual Traffic Signal Control. Individual traffic signal control has been investigated extensively in the field of transportation, which tries to optimize the travel time or delay of vehicles [91, 92, 93, 94, 95], assuming that vehicles are arriving and moving in a specific pattern. Recently, reinforcement learning based methods attempt to address this problem by directly learning from the data [9, 27]. Earlier work using tabular Qlearning [87, 96] can only deal with discrete state representations. Recent work using deep RL [10, 11, 76, 97, 98] can cope with more complex continuous state representation. [1] noticed that it is not always true that the more complex the state definitions are, the better the performance will be. In [1], they also investigated the proper reward design grounded by the individual intersection control method in transportation field. In this chapter, we are focusing on the multiintersection scenario.
Conventional Multiintersection Traffic Signal Control. In conventional multiintersection control, coordination can be achieved by setting a fixed offset (i.e., the time interval between the beginnings of green lights) among all intersections along an arterial [99]. In fact, it is not an easy task, given traffic of opposite directions usually cannot be facilitated simultaneously. To solve this problem, some optimizationbased methods [100, 101] are developed to minimize vehicle travel time and/or the number of stops at multiple intersections. Instead of optimizing offsets, max pressure [89, 90] aims to maximize the throughput of the network so as to minimizing the travel time. However, these approaches still rely on assumptions to simplify the traffic condition and do not guarantee optimal results in the real world.
RLbased Multiintersection Traffic Signal Control. Since recent advances in RLimprove the performance on isolated traffic signal control [11, 1], efforts have been made to design strategies that control multiple intersections. One way is to consider jointly modeling the action between learning agents with centralized optimization [75, 10]. Since these methods [10, 75] need to negotiate between the agents in the whole network, they are computationally expensive. Another way is to use decentralized RL agents to control the traffic signals in the multiintersection system [46, 79, 102]. Since each agent makes its own decision based on the information from itself and neighboring intersections without centralized decision, decentralized methods may be more scalable and practicable. By plugging new intersection controllers into the system, the decentralized systems are easy to scale. Our proposed method also follows this direction.
We notice the recent trend to vary the definition of state and reward in RL for traffic signal control. Readers interested in the detailed comparison of the state and reward definitions can refer to [21]. We are the first RL method that is theoretically grounded by traditional transportation methods to coordinate the traffic signals along an arterial.
4.3 Preliminaries and Notations
In this section, we reiterate the preliminaries mentioned in Chapter 2 and provide necessary notations for this chapter.
Definition 1 (Incoming lane and outgoing lane of an intersection).: An incoming lane for an intersection is a lane where the traffic enters the intersection. An outgoing lane for an intersection is a lane where the traffic leaves the intersection. We denote the set of incoming lanes and outgoing lanes of an intersection as ${L}_{in}$${L}_{in}$L_(in)L_{in}${L}_{in}$ and ${L}_{out}$${L}_{out}$L_(out)L_{out}${L}_{out}$ respectively.
Definition 2 (Traffic movement).: A traffic movement is defined as the traffic traveling across an intersection from one incoming lane to an outgoing lane. We denote a traffic movement from lane $l$$l$ll$l$ to lane $m$$m$mm$m$ as $(l,m)$$(l,m)$(l,m)(l,m)$(l,m)$.
Definition 3 (Movement signal and phase).: A movement signal is defined on the traffic movement, with green signal indicating the corresponding movement is allowed and red signal indicating the movement is prohibited. We denote a movement signal as $a(l,m)$$a(l,m)$a(l,m)a(l,m)$a(l,m)$, where $a(l,m)=1$$a(l,m)=1$a(l,m)=1a(l,m)=1$a(l,m)=1$ indicates the green light is on for movement $(l,m)$$(l,m)$(l,m)(l,m)$(l,m)$, and $a(l,m)=0$$a(l,m)=0$a(l,m)=0a(l,m)=0$a(l,m)=0$ indicates the red light is on for movement $(l,m)$$(l,m)$(l,m)(l,m)$(l,m)$. A phase is a combination of movement signals. We denote a phase as $p=\{(l,m)a(l,m)=1\}$$p=\{(l,m)a(l,m)=1\}$p={(l,m)a(l,m)=1}p=\{(l,m)a(l,m)=1\}$p=\{(l,m)a(l,m)=1\}$, where $l\in {L}_{in}$$l\in {L}_{in}$l inL_(in)l\in L_{in}$l\in {L}_{in}$ and $m\in {L}_{out}$$m\in {L}_{out}$m inL_(out)m\in L_{out}