这是用户在 2024-3-26 3:41 为 file:///D:/Download/deep_reinforcement_learning_fo-183ae653-e302-46bb-993b-53ac63dce2ed%20(1).html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

deep_reinforcement_learning_fo-183ae653-e302-46bb-993b-53ac63dce2ed

The Pennsylvania State University
The Graduate School
DEEP REINFORCEMENT LEARNING FOR TRAFFIC SIGNAL CONTROL
A Dissertation in
Information Sciences and Technology
by
Hua Wei
(c 2020 Hua Wei
Submitted in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy
December 2020The dissertation of Hua Wei was reviewed and approved by the following:
Zhenhui (Jessie) Li
Associate Professor of Information Sciences and Technology
Dissertation Adviser
Chair of Committee
C. Lee Giles
Professor of Information Sciences and Technology
Xiang Zhang
Associate Professor of Information Sciences and Technology
Vikash V. Gayah
Associate professor of Civil and Environmental Engineering
Mary Beth Rosson
Professor of Information Sciences and Technology
Graduate Program Chair

Abstract

Traffic congestion is a growing problem that continues to plague urban areas with negative out comes to both the traveling public and society as a whole. Signalized intersections are one of the most prevalent bottleneck types in urban environments and thus traffic signal control tends to play a large role in urban traffic management. Nowadays the widely-used traffic signal control systems (e.g., SCATS and SCOOT) are still based on manually designed traffic signal plans. Recently, there are emerging research studies using reinforcement learning (RL) to tackle traffic signal control problem. In this dissertation, we propose to consider reinforcement learning to intelligently optimize the signal timing plans in real-time to reduce traffic congestion.
Although some efforts using reinforcement learning (RL) techniques are proposed to adjust traffic signals dynamically, they only use ad-hoc designs when formulating the traffic signal control problem. They lack a principled approach to formulate the problem under the framework of RL. Secondly, since RL directly learns from the data via a trial-and-error search, it requires a decent number of interactions with the environment before the algorithms converge. In real-world problems, every interaction means real cost (e.g., traffic congestion, traffic accidents). Hence, a more data-efficient method is necessary. Thirdly, discrepancies between simulation and reality confine the application of RL in the real world, despite its massive success in domains like Games. Most RL methods mainly conduct experiments in the simulator since the simulator can generate data in a cheaper and faster way than real experimentation. Hence, how to address the performance gap of RL methods between simulation and the real-world is required for applying RL in the real world.
This dissertation presents how to utilize mobility data and RL-based methods for traffic signal control. I have investigated the key challenges for RL-based traffic signal control methods, including how to formulate the objective function and improve learning efficiency for city-wide traffic signal control. Besides, I have also managed to mitigate the performance gap of RL methods between simulation and the real-world. We have achieved significant improvement over the state-of-the-art or currently employed methods, which will provide us with promising solutions to traffic signal control problems and implications for smart city applications.

Table of Contents

List of FiguresList of Tables

Acknowledgements

Chapter 1 Introduction
1.1 Why do we need a more intelligent traffic signal control 1.2 Why do we use reinforcement learning for traffic signal control 1.3 Why RL for traffic signal control is challenging? 1.3.1 Formulation of RL agent 1.3.2 Learning cost 1.3.3 Simulation 1.4 Previous Studies 1.5 Proposed Tasks 1.6 Overview of this Disseration

Chapter 2 Notation, Background and Literature

1 Preliminaries of Traffic Signal Control Problem 2.1.1 Term Definition 2.1.2 Objective 2.1.3 Special Considerations 2.2 Background of Reinforcement Learning 2.2.1 Single Agent Reinforcement learning 2.2.2 Problem setting 2.3 Literature 2.3.1 Agent formulation 2.3.1.1 Reward 2.3.1.2 State 2.3.1.3 Action scheme 2.3.2 Policy Learning 2.3.2.1 Value-based methods2.3.2.2 Policy-based methods
* 2.3.3 Coordination
* 2.3.3.1 Joint action learners
* 2.3.3.2 Independent learners
* 2.3.3.3 Sizes of Road network
  • 2.4 Conclusion

Chapter 4 Formulating the Learning Objevtive

* 4.1 Introduction
* 4.2 Related Work
* 4.3 Preliminaries and Notations
* 4.4 Method
	* 4.4.1 Agent Design
	* 4.4.2 Learning Process
* 4.5 Justification of RL agent
	* 4.5.1 Justification for State Design
		* 4.5.1.1 General description of traffic movement process as a Markov chain
		* 4.5.1.2 Specification with proposed state definition
	* 4.5.2 Justification for Reward Design
		* 4.5.2.1 Stabilization on traffic movements with proposed reward.
		* 4.5.2.2 Connection to throughput maximization and travel time minimization.
* 4.6 Experiment
	* 4.6.1 Dataset Description
	* 4.6.2 Experimental Settings
		* 4.6.2.1 Environmental settings
		* 4.6.2.2 Evaluation metric
		* 4.6.2.3 Compared methods
	* 4.6.3 Performance Comparison
	* 4.6.4 Study of PressLight
		* 4.6.4.1 Effects of variants of our proposed method
		* 4.6.4.2 Average travel time related to pressure.
	* 4.6.5 Performance on Mixed Scenarios
		* 4.6.5.1 Heterogeneous intersections
		* 4.6.5.2 Arterials with a different number of intersections and network
	* 4.6.6 Case Study
		* 4.6.6.1 Synthetic traffic on the uniform, uni-directional flow
			* 4.6.6.1.1 Performance comparison
			* 4.6.6.1.2 Policy learned by RL agents
		* 4.6.6.2 Real-world traffic in Jinan
* 4.7 Conclusion
Chapter 5
Improving Learning Efficiency
* 5.1 Introduction
* 5.2 Related Work
* 5.3 Problem Definition
* 5.4 Method
	* 5.4.1 Observation Embedding
	* 5.4.2 Graph Attention Networks for Cooperation
		* 5.4.2.1 Observation Interaction
		* 5.4.2.2 Attention Distribution within Neighborhood Scope
		* 5.4.2.3 Index-free Neighborhood Cooperation
		* 5.4.2.4 Multi-head Attention
	* 5.4.3 Q-value Prediction
	* 5.4.4 Complexity Analysis
5.4.4.1 Space Complexity * 5.4.4.2 Time Complexity * 5.5 Experiments * 5.5.1 Settings * 5.5.2 Datasets * 5.5.2.1 Synthetic Data * 5.5.2.2 Real-world Data * 5.5.3 Compared Methods * 5.5.4 Evaluation Metric * 5.5.5 Performance Comparison * 5.5.5.1 Overall Analysis * 5.5.5.2 Convergence Comparison * 5.5.6 Scalability Comparison * 5.5.6.1 Effectiveness. * 5.5.6.2 Training time. * 5.5.7 Study of CoLight * 5.5.7.1 Impact of Neighborhood Definition. * 5.5.7.2 Impact of Neighbor Number. * 5.5.7.3 Impact of Attention Head Number. * 5.6 Conclusion

Chapter 6 Learning to Simulate

  • 6.1 Introduction
  • 6.2 Related Work
  • 6.3 Preliminaries
    • 6.4 Method
      • 6.4.1 Basic GAIL Framework
      • 6.4.2 Imitation with Interpolation
        • 6.4.2.1 Generator in the simulator
        • 6.4.2.2 Downsampling of generated trajectories
        • 6.4.2.3 Interpolation-Discriminator
          • 6.4.2.3.1 Interpolator module
          • 6.4.2.3.2 Discriminator module
          • 6.4.2.3.3 Loss function of Interpolation-Discriminator
      • 6.4.3 Training and Implementation
    • 6.5 Experiment
      • 6.5.1 Experimental Settings
        • 6.5.1.1 Dataset
        • 6.5.1.1 Synthetic Data
          • 6.5.1.1.2 Real-world Data
        • 6.5.1.2 Data Preprocessing
      • 6.5.2 Compared Methods
        • 6.5.2.1 Calibration-based methods
          6.5.2.2 Imitation learning-based methods
6.5.3 Evaluation Metrics
6.5.4 Performance Comparison
6.5.5 Study of ImIn-GAIL
6.5.5.1 Interpolation Study
6.5.5.2 Sparsity Study
6.5.6 Case Study
6.6 Conclusion
Chapter 7:
Conclusion and Future Directions
7.1 Evolving behavior with traffic signals
7.2 Benchmarking datasets and baselines
7.3 Learning efficiency
7.4 Safety issue
7.5 Transferring from simulation to reality
BibliographyList of Figures
  • 2.1 Definitions of traffic movement and traffic signal phases.
  • 2.2 RL framework for traffic signal control.
  • 3.1 Reward is not a comprehensive measure to evaluate traffic light control performance. Both policies will lead to the same rewards. But policy #1 is more suitable than policy #2 in the real world.
  • 3.2 Case A and case B have the same environment except the traffic light phase.
  • 3.3 Model framework
  • 3.4 Q-network
  • 3.5 Memory palace structure
  • 3.6 Traffic surveillance cameras in Jinan, China
  • 3.7 Percentage of the time duration of learned policy for phase Green-WE (green light on W-E and E-W direction, while red light on N-S and S-N direction) in every 2000 seconds for different methods under configuration 4.
  • 3.8 Average arrival rate on two directions (WE and SN) and time duration ratio of two phases (Green-WE and Red-WE) from learned policy for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st and August 7th, 2016.
Detailed average arrival rate on two directions (dotted lines) and changes of two phases (dashed areas) in three periods of time for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st, 2016. X-axis of each figure indicates the time of a day; left Y-axis of each figure indicates the number of cars approaching the intersection every second; right Y-axis for each figure indicates the phase over time.
  • 4.1 Performance of RL approaches is sensitive to reward and state. (a) A heuristic parameter tuning of reward function could result in different performances. (b) The method with a more complicated state (LIT[1] w/ neighbor) has a longer learning time but does not necessarily converge to a better result.
  • 4.2 Illustration of max pressure control in two cases. In Case A, green signal is set in the North rarr\rightarrowSouth direction; in Case B, green signal is set in the East rarr\rightarrowWest direction.
  • 4.3 The transition of traffic movements.
  • 4.4 Real-world arterial network for the experiment.
  • 4.5 Convergence curve of average duration and our reward design (pressure). Pressure shows the same convergence trend with travel time.
  • 4.6 Average travel time of our method on heterogeneous intersections. (a) Different number of legs. (b) Different length of lanes. (c) Experiment results.
  • 4.7 Performance comparison under uniform unidirectional traffic, where the optimal solution is known (GreenWave). Only PressLight can achieve the optimal.
  • 4.8 Offsets between intersections learnt by RL agents under uni-directional uniform traffic (700 vehicles/hour/lane on arterial)
  • 4.9 Space-time diagram with signal timing plan to illustrate the learned coordination strategy from real-world data on the arterial of Qingdao Road in the morning (around 8:30 a.m.) on August 6th.
Illustration of index-based concatenation. Thick yellow lines are the arterials and grey thin lines are the side streets. With index-based concatenation, A A AAA and B's observation will be aligned as model inputs with an fixed order. These two inputs will confuse the model shared by A A AAA and B B BBB.
  • 5.2 Left: Framework of the proposed CoLight model. Right: variation of cooperation scope (light blue shadow, from one-hop to two-hop) and attention distribution (colored points, the redder, the more important) of the target intersection.
  • 5.3 Road networks for real-world datasets. Red polygons are the areas we select to model, blue dots are the traffic signals we control. Left: 196 intersections with uni-directional traffic, middle: 16 intersections with uni-& bi-directional traffic, right: 12 intersections with bi-directional traffic..
  • 5.4 Convergence speed of CoLight (red continuous curves) and other 5 RL baselines (dashed curves) during training. CoLight starts with the best performance (Jumpstart), reaches to the pre-defined performance the fastest (Time to Threshold), and ends with the optimal policy (Aysmptotic). Curves are smoothed with a moving average of 5 points.
  • 5.5 The training time of different models for 100 episodes. CoLight is efficient across all the datasets. The bar for Individual RL on D N e w Y o r k D N e w Y o r k D_(NewYork)D_{NewYork}DNewYork is shadowed as its running time is far beyond the acceptable time.
  • 5.6 Performance of CoLight with respect to different numbers of neighbors ( | N i | | N i | |N_(i)||\mathcal{N}_{i}||Ni|) on dataset D H a n g z h o u D H a n g z h o u D_(Hangzhou)D_{Hangzhou}DHangzhou (left) and D J i n a n D J i n a n D_(Jinan)D_{Jinan}DJinan (right). More neighbors ( | N i | 5 | N i | 5 |N_(i)| <= 5|\mathcal{N}_{i}|\leq 5|Ni|5) for cooperation brings better performance, but too many neighbors ( | N i | > 5 | N i | > 5 |N_(i)| > 5|\mathcal{N}_{i}|>5|Ni|>5) requires more time (200 episodes or more) to learn..
  • 6.1 Illustration of a driving trajectory. In the real-world scenario, only part of the driving points can be observed and form a sparse driving trajectory (in red dots). Each driving point includes a driving state and an action of the vehicle at the observed time step. Best viewed in color.
  • 6.2 Proposed ImIn-GAIL Approach. The overall framework of ImIn-GAIL includes three components: generator, downsampler, and interpolation-discriminator. Best viewed in color.
  • 6.3 Proposed interpolation-discriminator network.
  • 6.4 Illustration of road networks. (a) and (b) are synthetic road networks, while (c) and (d) are real-world road networks.
  • 6.5RMSE on time and position of our proposed method ImIn-GAIL under different level of sparsity. As the expert trajectory become denser, a more similar policy to the expert policy is learned.
  • 6.6 The generated trajectory of a vehicle in the R i n g R i n g RingRingRing scenario. Left: the initial position of the vehicles. Vehicles can only be observed when they pass four locations A A AAA, B B BBB, C C CCC and D D DDD where cameras are installed. Right: the visualization for the trajectory of V e h i c l e V e h i c l e VehicleVehicleVehicle 0. The x-axis is the timesteps in seconds. The y-axis is the relative road distance in meters. Although vehicle 0 is only observed three times (red triangles), ImIn-GAIL (blue points) can imitate the position of the expert trajectory (grey points) more accurately than all other baselines. Better viewed in color.
List of Tables
  • 2.1 Representative deep RL-based traffic signal control methods. Due to page limits, we only put part of the investigated papers here.
  • 3.1 Notations
  • 3.2 Settings for our method
  • 3.3 Reward coefficients
  • 3.4 Configurations for synthetic traffic data
  • 3.5 Details of real-world traffic dataset
  • 3.6 Performance on configuration 1. Reward: the higher the better. Other measures: the lower the better. Same with the following tables.
  • 3.7 Performance on configuration 2
  • 3.8 Performance on configuration 3
  • 3.9 Performance on configuration 4
  • 3.10 Performances of different methods on real-world data. The number after ± ± +-\pm± means standard deviation. Reward: the higher the better. Other measures: the lower the better.
  • 4.1 Summary of notation.
  • 4.2 Configurations for synthetic traffic data
  • 4.3 Data statistics of real-world traffic dataset
Performance comparison between all the methods in the arterial with 6 intersections w.r.t. average travel time (the lower the better). Top-down: conventional transportation methods, learning methods, and our proposed method.
  • 4.5 Detailed comparison of our proposed state and reward design and their effects w.r.t. average travel time (lower the better) under synthetic traffic data.
  • 4.6 Average travel time of different methods under arterials with a different number of intersections and network.
  • 5.1 Data statistics of real-world traffic dataset
  • 5.2 Performance on synthetic data and real-world data w.r.t average travel time. CoLight is the best.
  • 5.3 Performance of CoLight with respect to different numbers of attention heads ( H H HHH) on dataset G r i d 6 × 6 G r i d 6 × 6 Grid_(6xx6)Grid_{6\times 6}Grid6×6. More types of attention ( H 5 H 5 H <= 5H\leq 5H5) enhance model efficiency, while too many ( H > 5 H > 5 H > 5H>5H>5) could distract the learning and deteriorate the overall performance.
  • 6.1 Features for a driving state
  • 6.2 Hyper-parameter settings for ImIn-GAIL
  • 6.3 Statistics of dense and sparse expert trajectory in different datasets
  • 6.4 Performance w.r.t Relative Mean Squared Error (RMSE) of time (in seconds) and position (in kilometers). All the measurements are conducted on dense trajectories. Lower the better. Our proposed method ImIn-GAIL achieves the best performance.
  • 6.5 RMSE on time and position of our proposed method ImIn-GAIL against baseline methods and their corresponding two-step variants. Baseline methods and ImIn-GAIL learn from sparse trajectories, while the two-step variants interpolate sparse trajectories first and trained on the interpolated data. ImIn-GAIL achieves the best performance in most cases.

Acknowledgements

I would like to express my sincere appreciation to my doctoral committee Dr. Zhenhui (Jessie) Li, Dr. C. Lee Giles, Dr. Xiang Zhang, Dr. Vikash V. Gayah, and my Ph. D. program chair Dr. Mary Beth Rosson.
I would like to show my special appreciation to my PhD adviser Dr. Zhenhui (Jessie) Li. Jessie patiently guided me through the research process and career development, including several open-armed enlightening discussions on how to cope with work-life balance, pressure and priority. Her supportive and critical attitude has made our fruitful collaboration possible. Her patience and kindness has help me manage a smooth career development. Without her guidance, my PhD would surely not have been as successful.
I would also extend my thanks to Dr. Vikash Gayah from the Department of Civil and Environmental Engineering, whose insightful advice forms some of the foundations in our interdisciplinary research and who have been more than happy to help me become familiar with the background material.
I am also fortunate to have collaborated with several others throughout my PhD: Guanjie Zheng, Chacha Chen, Porter Jenkins, Zhengyao Yu, Kan Wu, Huaxiu Yao, Nan Xu, Chang Liu, Yuandong Wang. Without them, this research would not be possible.
The studies in this dissertation have been supported by NSF awards #1544455, #1652525, #1618448, and #1639150. The views and conclusions contained in the studies are those of the authors and should not be interpreted as representing any funding agencies. Thanks for their generous support to make these research happen.
During my several years of PhD program, I was lucky to have been surrounded by wonderful colleagues and friends. Thank you all for sharing Friday game nights with me, for the road trips and getaways on weekends, and for the summer twilight barbecues. All of you made my time at Penn State a wonderful experience.
At last, a very special thank you to my parents Shaozhu and Yuling, to my whole family, for always supporting me and for always encouraging me to pursue my passion.

Chapter 1 Introduction

Traffic congestion is a growing problem that continues to plague urban areas with negative out comes to both the traveling public and society as a whole. And, these negative outcomes will only grow over time as more people flock to urban areas. In 2014, traffic congestion cost Americans over $160 billion in lost productivity and wasted over 3.1 billion gallons of fuel [2]. Traffic congestion was also attributed to over 56 billion pounds of harmful CO2 emissions in 2011 [3]. In the European Union, the cost of traffic congestion was equivalent to 1% of the entire GDP [4]. Mitigating congestion would have significant economic, environmental and societal benefits. Signalized intersections are one of the most prevalent bottleneck types in urban environments and thus traffic signal control tends to play a large role in urban traffic management.

1.1 Why do we need a more intelligent traffic signal control

The majority of small and big cities even in industrialized countries are still operating old-fashioned fixed-time signal control strategies, often even poorly optimized or maintained. Even when modern traffic-responsive control systems are installed (e.g., SCATS [5] and SCOOT [6, 7]), the employed control strategies are sometimes naive, mainly based on manually designed traffic signal plans.
On the other hand, nowadays various kinds of traffic data can be collected to enrich the information about traffic condition. Systems like SCATS or SCOOTS mainly rely on the loop sensor data to choose the signal plans. However, loop sensor data only count the vehicle when it passes the sensor, while nowadays increasing amount of traffic data have been collected from various sources such as GPS-equipped vehicles, navigationalsystems, and traffic surveillance cameras. How to use rich traffic data to better optimize our traffic signal control system has attracted more and more attention from academia, government and industry.

1.2 Why do we use reinforcement learning for traffic signal control

In transportation field, traffic signal control is one of the most fundamental research questions [8]. The typical approach that transportation researchers take is to seek an optimization solution under certain assumptions about traffic models [8]. However, most of the works focus only on automobiles and the assumptions they make are simplified and do not necessarily hold true in the field. The real traffic behave in a complicated way, affected by many factors such as driver's preference, interactions with vulnerable road users (e.g. pedestrians, cyclists, etc.), weather and road conditions, etc. These features can hardly be modelled accurately for optimizing traffic signal control.
On the other hand, machine learning techniques learn directly from the observed data without making assumptions about the data model. However, the traffic signal control problem is not a typical machine learning problem with fixed data sets for training and testing. The real-world traffic is constantly changing, and the execution of traffic lights is also changing the traffic. Therefore, in the case of ever-changing data samples, we need to learn from the feedback from the environment. This idea of trial-and-error is an essential RL idea. RL attempts different traffic signal control strategies based on the current traffic environment. The model will learn and adjust strategies based on environmental feedback.

1.3 Why RL for traffic signal control is challenging?

Formulation of RL agent

A key question for RL is how to define reward and state. In existing studies [9, 10, 11], a typical reward definition for traffic signal control is a weighted linear combination of several components such as queue length, waiting time, number of switches in traffic signal, and sum of delay. The state include components such as queue length, number of cars, waiting time, and current traffic signal. In recent work [10, 11], images of vehicles' positions on the roads are also considered in the state.
However, all of the existing work take an ad-hoc approach to define reward and state. Such an ad-hoc approach will cause several problems that hinder the application of RL in the real world. First, the engineering details in formulating the reward and state could significantly affect the results. For example, if the reward is defined as a weighted linear combination of several terms, the weights on each terms are tricky to set and the minor difference in weight setting could lead to dramatically different results. Second, the state representation could be in a high-dimensional space, especially when using traffic images as part of the state representation [10, 11]. Such a high-dimensional state representation will need much more training data samples to learn and may not even converge. Third, there is no connection between existing RL approaches and transportation methods. Without the support of transportation theory, it is highly risky to apply these purely data-driven RL-based approaches in the real physical world.

Learning cost

While learning from trial-and-error is the key idea in RL, the learning cost is RL is fatal for real-world applications. Although RL algorithms are very useful to learning good solutions when the model of environment is unknown in advance [12, 13], the solutions may only be achieved after an extensive numbers of trials and errors, which is usually very time consuming. Existing RL methods for games (e.g, Go or Atari games) yield impressive results in simlulated environments, the cost of error in traffic signal control is critical, even fatal in the real world. Therefore, how to learn efficiently (e.g., learning from limited data samples, sample training data in an adaptive way, transfer learned knowledge) is a critial question for the application of RL in traffic signal control.

Simulation

Reinforcement learning (RL) has shown great success in a series of artificial intelligence (AI) domains such as Go games [14]. Despite its huge success in AI domains, RL has not yet shown the same degree of success for real-world applications [15]. These applications could involve control such as drones [16], systems that interact with people such as traffic signal control [11]. As people analyze the challenges in all these scenarios, a recurring theme emerges: there is rarely a good simulator[15].
Previous Studies
There are already some work done in investigating reinforcement learning methods in traffic signal control. They are focusing on different scale of traffic signal control, including isolated intersection control[4-6], arterial control [7], and region control [8-10]. Most of the previous studies use different kinds of features in state and reward design, while we propose the formulation of reinforcement learning in connection with transportation theory. Furthermore, we propose to investigate on how to learn efficiently to reduce the learning cost in the real-world setting. We will discuss more about the literature in Chapter 2, Chapter 3 and Chapter 4.

1.5 Proposed Tasks

In this thesis, we propose to use RL for traffic signal control problems that can combine the transportation guided RL with efficient learning. The first part will help the formulation of RL towards a reasonable state and reward design. The connection between between RL approaches and transportation theory can help RL optimize towards a correct objective and condense the state space for convergence. The second part will help reduce the learning cost of RL and enables the application of RL in the real world. The third part will try to tackle the real-world application issue by building a realistic simulator. Specifically, we elaborate RL-based traffic signal control methods with real-world implications from the following topics:
  • Formulation of RL with theoretical justification. How to formulate the RL agent, i.e., the reward and state definition for traffic signal control problem? Does it always beneficial if we use complex reward function and state representations? In [7, 1, 17], I proposed to deduce the intricate design of current literature and use concise reward and state design, with theoretical proof on the concise design towards the global optimal, for both signal intersection control and multi-intersection control scenario.
  • Optimization of the learning process in RL. Is vanilla deep neural network the solution to traffic signal control problem with RL? In [11], I tackle the sample imbalance problem with a phase-gated network to emphasize certain features. In [18], we design a network to model the priority between action and learn the change of traffic flow like flip and rotation. Coordination could benefit signal control for multi-intersection scenarios. In [19], I contributed a framework to enable agents to communicate between them about their observations and behave as a group. Though each agent has limited capabilities and visibility of the world, in this work, agents were able to cooperate multi-hop neighboring intersections and learn the dynamic interactions between the hidden state of neighboring agents. Later in [20], we investigate the possibility of using RL to control traffic signals under the scale of a city, test our RL methods in the simulator under the road network of Manhattan, New York, with 2510 traffic signals. This is the first time an RL-based method can operate on such a scale in traffic signal control.
  • Bridging Gap from Simulation to Reality. Despite its massive success in artificial domains, RL has not yet shown the same degree of success for real-world applications. These applications could involve control systems such as drones, software systems such as data centers, systems that interact with people such as transportation systems. Currently, most RL methods mainly conduct experiments in the simulator since the simulator can generate data in a cheaper and faster way than real experimentation. Discrepancies between simulation and reality confine the application of learned policies in the real world.

1.6 Overview of this Disseration

In this chapter, we discussed the motivation, challenges and possible tasks of using reinforcement learning for traffic signal control. We will elaborate more on the details of work that have applied reinforcement learning in several scenarios. Chapter 2 will covers the notation, background and necessary literature for traffic signal control. The basic formulation of RL-based traffic signal control and further theory-guided RL method will be discussed in detail in Chapter 3 and 4. Chapter 5 will discuss how to improve the learning efficiency with neighborhood information for coordinated intersections. Chapter 6 will tackle the challenge for real-world applications of RL by building a realistic simulator, and then we are going to talk about the potential future work briefly in Chapter 7.

Chapter 2 Notation, Background and Literature

This chapter covers notation, background information and necessary literature that will be discussed throughout the rest of this thesis. The main part of this chapter is adopted from our survey [21] and [22].

2.1 Preliminaries of Traffic Signal Control Problem

Term Definition

Terms on road structure and traffic movement:
  • Approach: A roadway meeting at an intersection is referred to as an approach. At any general intersection, there are two kinds of approaches: incoming approaches and outgoing approaches. An incoming approach is one on which cars can enter the intersection; an outgoing approach is one on which cars can leave the intersection. Figure 2.1(a) shows a typical intersection with four incoming approaches and outgoing approaches. The southbound incoming approach is denoted in this figure as the approach on the north side in which vehicles are traveling in the southbound direction.
  • Lane: An approach consists of a set of lanes. Similar to approach definition, there are two kinds of lanes: incoming lanes and outgoing lanes (also known as approaching/entering lane and receiving/exiting lane in some references [23, 24]).
  • Traffic movement: A traffic movement refers to vehicles moving from an incoming approach to an outgoing approach, denoted as ( r i r o ) ( r i r o ) (r_(i)rarrr_(o))(r_{i}\to r_{o})(riro), where r a r a r_(a)r_{a}ra and r r r r r_(r)r_{r}rr is theincoming lane and the outgoing lane respectively. A traffic movement is generally categorized as left turn, through, and right turn.
Terms on traffic signal:
  • Movement signal: A movement signal is defined on the traffic movement, with green signal indicating the corresponding movement is allowed and red signal indicating the movement is prohibited. For the four-leg intersection shown in Figure 2.1(a), the right-turn traffic can pass regardless of the signal, and there are eight movement signals in use, as shown in Figure 2.1(b).
  • Phase: A phase is a combination of movement signals. Figure 2.1(c) shows the conflict matrix of the combination of two movement signals in the example in Figure 2.1(a) and Figure 2.1(b). The grey cell indicates the corresponding two movements conflict with each other, i.e. they cannot be set to 'green' at the same time (e.g., signals #1 and #2). The white cell indicates the non-conflicting movement signals. All the non-conflicting signals will generate eight valid paired-signal phases (letters 'A' to 'H' in Figure 2.1(c)) and eight single-signal phases (the diagonal cells in conflict matrix). Here we letter the paired-signal phases only because in an isolated intersection, it is always more efficient to use paired-signal
Figure 2.1: Definitions of traffic movement and traffic signal phases.
phases. When considering multiple intersections, single-signal phase might be necessary because of the potential spill back.
  • Phase sequence: A phase sequence is a sequence of phases which defines a set of phases and their order of changes.
  • Signal plan: A signal plan for a single intersection is a sequence of phases and their corresponding starting time. Here we denote a signal plan as ( p 1 , t 1 ) ( p 2 , t 2 ) ( p i , t i ) ( p 1 , t 1 ) ( p 2 , t 2 ) ( p i , t i ) (p_(1),t_(1))(p_(2),t_(2))dots(p_(i),t_(i))dots(p_{1},t_{1})(p_{2},t_{2})\dots(p_{i},t_{i})\dots(p1,t1)(p2,t2)(pi,ti), where p i p i p_(i)p_{i}pi and t i t i t_(i)t_{i}ti stand for a phase and its starting time.
  • Cycle-based signal plan: A cycle-based signal plan is a kind of signal plan where the sequence of phases operates in a cyclic order, which can be denoted as ( p 1 , t 1 1 ) ( p 2 , t 2 1 ) ( p N , t N 1 ) ( p 1 , t 1 2 ) ( p 2 , t 2 2 ) ( p N , t N 2 ) ( p 1 , t 1 1 ) ( p 2 , t 2 1 ) ( p N , t N 1 ) ( p 1 , t 1 2 ) ( p 2 , t 2 2 ) ( p N , t N 2 ) (p_(1),t_(1)^(1))(p_(2),t_(2)^(1))dots(p_(N),t_(N)^(1))(p_(1),t_(1)^(2))(p_(2),t_(2)^(2))dots(p_(N),t_(N)^(2))dots(p_{1},t_{1}^{1})(p_{2},t_{2}^{1})\dots(p_{N},t_{N}^{1})(p_{1},t_{1}^{2})(p_{ 2},t_{2}^{2})\dots(p_{N},t_{N}^{2})\dots(p1,t11)(p2,t21)(pN,tN1)(p1,t12)(p2,t22)(pN,tN2), where p 1 , p 2 , , p N p 1 , p 2 , , p N p_(1),p_(2),dots,p_(N)p_{1},p_{2},\dots,p_{N}p1,p2,,pN is the repeated phase sequence and t i j t i j t_(i)^(j)t_{i}^{j}tij is the starting time of phase p i p i p_(i)p_{i}pi in the j j jjj-th cycle. Specifically, C j = t 1 j + 1 t 1 j C j = t 1 j + 1 t 1 j C^(j)=t_(1)^(j+1)-t_(1)^(j)C^{j}=t_{1}^{j+1}-t_{1}^{j}Cj=t1j+1t1j is the cycle length of the j j jjj-th phase cycle, and { t 2 j t 1 j C j , , t N j t N 1 j C j } { t 2 j t 1 j C j , , t N j t N 1 j C j } {(t_(2)^(j)-t_(1)^(j))/(C^(j)),dots,(t_(N)^(j)-t_(N-1)^(j))/(C^(j))}\{\frac{t_{2}^{j}-t_{1}^{j}}{C^{j}},\dots,\frac{t_{N}^{j}-t_{N-1}^{j}}{C^{j}}\}{t2jt1jCj,,tNjtN1jCj} is the phase split of the j j jjj-th phase cycle. Existing traffic signal control methods usually repeats similar phase sequence throughout the day.

Objective

The objective of traffic signal control is to facilitate safe and efficient movement of vehicles at the intersection. Safety is achieved by separating conflicting movements in time and is not considered in more detail here. Various measures have been proposed to quantify efficiency of the intersection from different perspectives:
  • Travel time. In traffic signal control, travel time of a vehicle is defined as the time different between the time one car enters the system and the time it leaves the system. One of the most common goals is to minimize the average travel time of vehicles in the network.
  • Queue length. The queue length of the road network is the number of queuing vehicles in the road network.
  • Number of stops. The number of stops of a vehicle is the total times that a vehicle experienced.
  • Throughput. The throughput is the number of vehicles that have completed their trip in the road network during throughout a period.

Special Considerations

In practice, additional attention should be paid to the following aspects:
  1. Yellow and all-red time. A yellow signal is usually set as a transition from a green signal to a red one. Following the yellow, there is an all-red period during which all the signals in an intersection are set to red. The yellow and all-red time, which can last from 3 to 6 seconds, allow vehicles to stop safely or pass the intersection before vehicles in conflicting traffic movements are given a green signal.
  2. Minimum green time. Usually, a minimum green signal time is required to ensure pedestrians moving during a particular phase can safely pass through the intersection.
  3. Left turn phase. Usually, a left-turn phase is added when the left-turn volume is above certain threshold.

2.2 Background of Reinforcement Learning

In this section, we first describe the reinforcement learning framework which constitutes the foundation of all the methods presented in this dissertation. We then provide background on conventional RL-based traffic signal control, including the problem of controlling a single intersection and multiple intersections.

Single Agent Reinforcement learning

Usually a single agent RL problem is modeled as a Markov Decision Process represented by S , A , P , R , γ S , A , P , R , γ (:S,A,P,R,gamma:)\langle\mathcal{S},\mathcal{A},P,R,\gamma\rangleS,A,P,R,γ, where their definitions are given as follows:
\bullet Set of state representations S S S\mathcal{S}S: At time step t t ttt, the agent observes state s t S s t S s^(t)inS\mathsf{s}^{t}\in\mathcal{S}stS.
\bullet Set of action A A A\mathcal{A}A and state transition function P P PPP: At time step t t ttt, the agent takes an action a t A a t A a^(t)inA\mathsf{a}^{t}\in\mathcal{A}atA, which induces a transition in the environment according to the state transition function P ( s t + 1 | s t , a t ) : S × A S P ( s t + 1 | s t , a t ) : S × A S P(s^(t+1)|s^(t),a^(t)):SxxArarrSP(\mathsf{s}^{t+1}|\mathsf{s}^{t},\mathsf{a}^{t}):\mathcal{S}\times\mathcal{ A}\rightarrow\mathcal{S}P(st+1|st,at):S×AS
\bullet Reward function R R RRR: At time step t t ttt, the agent obtains a reward r t r t r^(t)r^{t}rt by a reward function: R ( s t , a t ) : S × A R R ( s t , a t ) : S × A R R(s^(t),a^(t)):SxxArarrRR(\mathsf{s}^{t},\mathsf{a}^{t}):\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}R(st,at):S×AR
\bullet Discount factor γ γ gamma\gammaγ: The goal of an agent is to find a policy that maximizes the expected return, which is the discounted sum of rewards: G t := i = 0 γ i r t + i G t := i = 0 γ i r t + i G^(t):=sum_(i=0)^(oo)gamma^(i)r^(t+i)G^{t}:=\sum_{i=0}^{\infty}\gamma^{i}\mathsf{r}^{t+i}Gt:=i=0γirt+i, where the discount factor γ [ 0 , 1 ] γ [ 0 , 1 ] gamma in[0,1]\gamma\in[0,1]γ[0,1] controls the importance of immediate rewards versus future rewards.
Here, we only consider continuing agent-environment intersections which do not end with terminal states but goes on continually without limit.
Solving a reinforcement learning task means, roughly, finding an optimal policy π π pi^(**)\pi^{*}π that maximizes expected return. While the agent only receives reward about its immediate, one-step performance, one way to find the optimal policy π π pi^(**)\pi^{*}π is by following an optimal action-value function or state-value function. The action-value function (Q-function) of a policy π π pi\piπ, Q π : S × A R Q π : S × A R Q^(pi):SxxArarrRQ^{\pi}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}Qπ:S×AR, is the expected return of a state-action pair Q π ( s , a ) = E π [ G t | s t = s , a t = a ] Q π ( s , a ) = E π [ G t | s t = s , a t = a ] Q^(pi)(s,a)=E_(pi)[G^(t)|s^(t)=s,a^(t)=a]Q^{\pi}(\mathsf{s},\mathsf{a})=\mathbb{E}_{\pi}[G^{t}|\mathsf{s}^{t}=\mathsf{s },\mathsf{a}^{t}=\mathsf{a}]Qπ(s,a)=Eπ[Gt|st=s,at=a]. The state-value function of a policy π π pi\piπ, V π : S R V π : S R V^(pi):SrarrRV^{\pi}:\mathcal{S}\rightarrow\mathbb{R}Vπ:SR, is the expected return of a state V π ( s ) = E π [ G t | s t = s ] V π ( s ) = E π [ G t | s t = s ] V^(pi)(s)=E_(pi)[G^(t)|s^(t)=s]V^{\pi}(\mathsf{s})=\mathbb{E}_{\pi}[G^{t}|\mathsf{s}^{t}=\mathsf{s}]Vπ(s)=Eπ[Gt|st=s].

Problem setting

We now introduce the general setting of RL-based traffic signal control problem, in which the traffic signals are controlled by an RL agent or several RL agents. Figure 2.2 illustrates the basic idea of the RL framework in single traffic signal control problem. The environment is the traffic conditions on the roads, and the agent controls the traffic signal. At each time step t t ttt, a description of the environment (e.g., signal phase, waiting time of cars, queue length of cars, and positions of cars) will be generated as the state s t s t s_(t)\mathsf{s}_{t}st. The agent will predict the next action a t a t a^(t)\mathsf{a}^{t}at to take that maximizes the expected return, where the action could be changing to a certain phase in the single intersection scenario. The action a t a t a^(t)\mathsf{a}^{t}at will be executed in the environment, and a reward r t r t r^(t)\mathsf{r}^{t}rt will be generated, where the reward could be defined on traffic conditions of the intersection. Usually, in the decision process, an agent combines the exploitation of learned policy and exploration of a new policy.
Figure 2.2: RL framework for traffic signal control.
In multi-intersection traffic signal control problem, there are N N NNN traffic signals in the environment, controlled by one or several agents. The goal of the agent(s) is to learn the optimal policies to optimize the traffic condition of the whole environment. At each timestep t t ttt, each agent i i iii observe part of the environment as the observation o i t o i t o_(i)^(t)o_{i}^{t}oit and make predictions on the next actions a t = ( a 1 t , , a N t ) a t = ( a 1 t , , a N t ) a^(t)=(a_(1)^(t),dots,a_(N)^(t))\boldsymbol{\mathsf{a}}^{t}=(\boldsymbol{\mathsf{a}}_{1}^{t},\ldots,\boldsymbol {\mathsf{a}}_{N}^{t})at=(a1t,,aNt) to take. The actions will be executed in the environment, and the reward r i t r i t r_(i)^(t)\boldsymbol{\mathsf{r}}_{i}^{t}rit will be generated, where the reward could be defined on the level of individual intersections or a group of intersections within the environment. We refer readers interested in more detailed problem settings to [25].

2.3 Literature

In this section, we introduce three major aspects investigated in recent RL-based traffic signal control literature: agent formulation, policy learning approach and coordination strategy.

Agent formulation

A key question for RL is how to formulate the RL agent, i.e., the reward, state, and action definition. In this section, we focus on the advances in the reward, state, and action design in recent deep RL-based methods, and refer readers interested in more detailed definitions to [26, 27, 28].
2.3.1.1 Reward
The choice of reward reflects the learning objective of an RL agent. In the traffic signal control problem, although the ultimate objective is to minimize the travel time of all vehicles, travel time is hard to serve as a valid reward in RL. Because the travel time of a vehicle is affected by multiple actions from traffic signals and vehicle movements, the travel time as reward would be delayed and ineffective in indicating the goodness of the signals' action. Therefore, the existing literature often uses a surrogate reward that can be effectively measured after an action, considering factors like average queue length, average waiting time, average speed or throughput [11, 29]. The authors in [10] also take the frequency of signal changing and the number of emergency stops into reward.

2.3.1.2 State

At each time step, the agent receives some quantitative descriptions of the environment as state to decide its action. Various kinds of elements have been proposed to describe the environment state, such as queue length, waiting time, speed, phase, etc. These elements can be defined on the lane level or road segment level, and then concatenated as a vector. In earlier work using RL for traffic signal control, people need to discretize the state space and use a simple tabular or linear model to approximate the state functions for efficiency [30, 31, 32]. However, the real-world state space is usually huge, which confines the traditional RL methods in terms of memory or performance. With advances in deep learning, deep RL methods are proposed to handle large state space as an effective function approximator. Recent studies propose to use images [33, 34, 35, 36, 37, 38, 39, 40, 10, 11] to represent the state, where the position of vehicles are extracted as an image representation.
2.3.1.3 Action scheme
Now there are different types of action definition for an RL agent in traffic signal control: (1) set current phase duration [42, 43], (2) set the ratio of the phase duration over pre-defined total cycle duration [32, 44], (3) change to the next phase in pre-defined cyclic phase sequence [10, 11, 27, 45], and (4) choose the phase to change to among a set of phases [34, 39, 46, 47, 48, 1, 48]. The choice of action scheme is closely related to specific settings of traffic signals. For example, if the phase sequence is required to be cyclic, then the first three action schemes should be considered, while "choosing the phase to change to among a set of phases" can generate flexible phase sequences.

Policy Learning

RL methods can be categorized in different ways. [49, 50] divide current RL methods to model-based methods and model-free methods. Model-based methods try to model the transition probability among states explicitly, while model-free methods directly estimate the reward for state-action pairs and choose the action based on this. In the context of traffic signal control, the state transition between states is primarily influenced by people's driving behaviors, which are diverse and hard to predict. Therefore, currently, most RL-based methods for traffic signal control are model-free methods. In this section, we take the categorization in [51]: value-based methods and policy-based methods.

2.3.2.1 Value-based methods

Value-based methods approximate the state-value function or state-action value function (i.e., how rewarding each state is or state-action pair is), and the policy is implicitly obtained from the learned value function. Most of the RL-based traffic signal control methods use DQN [52], where the model is parameterized by neural networks and takes the state representation as input [10, 53]. In DQN, discrete actions are required as the model directly outputs the action's value given a state, which is especially suitable for action schema (3) and (4) mentioned in Section 2.3.1.3.
Policy-based methods
Policy-based methods directly update the policy parameters (e.g., a vector of probabilities to conduct actions under specific state) towards the direction to maximizing a predefined objective (e.g., average expected return). The advantage of policy-based methods is that it does not require the action to be discrete like DQN. Also, it can learn a stochastic policy and keep exploring potentially more rewarding actions. To stabilize the training process, the actor-critic framework is widely adopted. It utilize the strengths of both value-based and policy-based methods, with an actor controls how the agent behaves (policy-based), and the critic measures how good the conducted action is (value-based). In the traffic signal control problem, [44] uses DDPG [54] to learn a deterministic policy which directly maps states to actions, while [34, 42, 55] learn a stochastic policy that maps states to action probability distribution, all of which have shown excellent performance in traffic signal control problems. To further improve convergence speed for RL agents, [56] proposed a time-dependent baseline to reduce the variance of policy gradient updates to specifically avoid traffic jams.
In the above-mentioned methods, including both value-based and policy-based methods, deep neural networks are used to approximate the value functions. Most of the literature use vanilla neural networks with their corresponding strengths. For example, Convolutional Neural Networks (CNN) are used since the state representation contains image representation [10, 38, 39, 40, 33, 34, 35, 36]; Recurrent Neural Networks (RNN) are used to capture the temporal dependency of historical states [57].
Coordination could benefit signal control for multi-intersection scenarios. Since recent advances in RL improve the performance on isolated traffic signal control, efforts havebeen performed to design strategies that cooperate with MARL agents. Literature [60] categorizes MARL into two classes: Joint action learners and independent learners. Here we extend this categorization for the traffic signal control problem.
Joint action learners
A straightforward solution is to use a single global agent to control all the intersections [31]. It directly takes the state as input and learns to set the joint actions of all intersection at the same time. However, these methods can result in the curse of dimensionality, which encompasses the exponential growth of the state-action space in the number of state and action dimensions. Joint action modeling methods explicitly learns to model the joint action
Citation Method Cooperation Road net (# signals)
[46] Value-based With communication Synthetic (5)
Policy-based Without communication Real (50)
Policy-based Without communication Real (43)
Value-based Without communication Real (2510)
Policy-based Joint action Real (30)
Value-based - Synthetic (1)
Value-based - Synthetic (1)
Value-based Without communication Synthetic (9)
Both studied - Synthetic (1)
Value-based With communication Synthetic (6)
Value-based Without communication Synthetic (4)
Both studied Single global Synthetic (5)
Policy-based - Real (1)
Value-based Joint action Synthetic (4)
Value-based With communication Real (4)
Value-based - Synthetic (1)
Value-based Without communication Real (16)
Value-based With communication Real (196)
Value-based Joint action Synthetic (36)
Value-based Without communication Real (16)
Value-based Without communication Real (5)
Citation Method Cooperation Road net (# signals) [46] Value-based With communication Synthetic (5) Policy-based Without communication Real (50) Policy-based Without communication Real (43) Value-based Without communication Real (2510) Policy-based Joint action Real (30) Value-based - Synthetic (1) Value-based - Synthetic (1) Value-based Without communication Synthetic (9) Both studied - Synthetic (1) Value-based With communication Synthetic (6) Value-based Without communication Synthetic (4) Both studied Single global Synthetic (5) Policy-based - Real (1) Value-based Joint action Synthetic (4) Value-based With communication Real (4) Value-based - Synthetic (1) Value-based Without communication Real (16) Value-based With communication Real (196) Value-based Joint action Synthetic (36) Value-based Without communication Real (16) Value-based Without communication Real (5)| Citation | Method | Cooperation | Road net (# signals) | | :---: | :---: | :---: | :---: | | [46] | Value-based | With communication | Synthetic (5) | | | Policy-based | Without communication | Real (50) | | | Policy-based | Without communication | Real (43) | | | Value-based | Without communication | Real (2510) | | | Policy-based | Joint action | Real (30) | | | Value-based | - | Synthetic (1) | | | Value-based | - | Synthetic (1) | | | Value-based | Without communication | Synthetic (9) | | | Both studied | - | Synthetic (1) | | | Value-based | With communication | Synthetic (6) | | | Value-based | Without communication | Synthetic (4) | | | Both studied | Single global | Synthetic (5) | | | Policy-based | - | Real (1) | | | Value-based | Joint action | Synthetic (4) | | | Value-based | With communication | Real (4) | | | Value-based | - | Synthetic (1) | | | Value-based | Without communication | Real (16) | | | Value-based | With communication | Real (196) | | | Value-based | Joint action | Synthetic (36) | | | Value-based | Without communication | Real (16) | | | Value-based | Without communication | Real (5) |

* Traffic with arrival rate less than 500 vehicles/hour/lane is considered as light traffic in this survey, otherwise considered as heavy.
* Synthetic light uniform; 2. Synthetic light dynamic; 3. Synthetic heavy uniform; 4. Synthetic heavy dynamic; 5. Real-world data
Table 2.1: Representative deep RL-based traffic signal control methods. Due to page limits, we only put part of the investigated papers here.
action value of multiple agents Q ( o 1 , , o N , a ) Q ( o 1 , , o N , a ) Q(o_(1),dots,o_(N),a)Q(o_{1},\ldots,o_{N},\mathbf{\mathfrak{a}})Q(o1,,oN,a). The joint action space grows with the increase in the number of agents to model. To alleviate this challenge, [10] factorize the global Q-function as a linear combination of local subproblems, extending [9] using max-plus [61] algorithm: Q ^ ( o 1 , , o N , a ) = Σ i , j Q i , j ( o i , o j , a i , a j ) Q ^ ( o 1 , , o N , a ) = Σ i , j Q i , j ( o i , o j , a i , a j ) hat(Q)(o_(1),dots,o_(N),a)=Sigma_(i,j)Q_(i,j)(o_(i),o_(j),a_(i),a_(j))\hat{Q}(o_{1},\ldots,o_{N},\mathbf{\mathfrak{a}})=\Sigma_{i,j}Q_{i,j}(o _{i},o_{j},\mathbf{\mathfrak{a}}_{i},\mathbf{\mathfrak{a}}_{ j})Q^(o1,,oN,a)=Σi,jQi,j(oi,oj,ai,aj), where i i iii and j j jjj corresponds to the index of neighboring agents. In other works, [48, 59, 62] regard the joint Q-value as a weighted sum of local Q-values, Q ^ ( o 1 , , o N , a ) = Σ i , j w i , j Q i , j ( o i , o j , a i , a j ) Q ^ ( o 1 , , o N , a ) = Σ i , j w i , j Q i , j ( o i , o j , a i , a j ) hat(Q)(o_(1),dots,o_(N),a)=Sigma_(i,j)w_(i,j)Q_(i,j)(o_(i),o_(j),a_(i),a_(j))\hat{Q}(o_{1},\ldots,o_{N},\mathbf{\mathfrak{a}})=\Sigma_{i,j}w_{i,j }Q_{i,j}(o_{i},o_{j},\mathbf{\mathfrak{a}}_{i},\mathbf{ \mathfrak{a}}_{j})Q^(o1,,oN,a)=Σi,jwi,jQi,j(oi,oj,ai,aj), where w i , j w i , j w_(i,j)w_{i,j}wi,j is the pre-defined weights. They attempt to ensure individual agents to consider other agents' learning process by adding a shaping term in the loss function of the individual agent's learning process and minimizing the difference between the weighted sum of individual Q-values and the global Q-value.
2.3.3.2 Independent learners
There is also a line of studies that use individual RL (IRL) agents to control the traffic signals, where each RL agent controls an intersection. Unlike joint action learning methods, each agent learns its control policy without knowing the reward signal of other agents.
IRL without communication methods treat each intersection individually, with each agent observing its own local environment and do not use explicit communication to resolve conflicts [1, 27, 39, 40, 44, 45, 63]. In some simple scenarios like arterial networks, this approach has performed well with the formation of several mini green waves. However, when the environment becomes complicated, the non-stationary impacts from neighboring agents will be brought into the environment, and the learning process usually cannot converge to stationary policies if there are no communication or coordination mechanisms among agents [64].
IRL with communication methods enable agents to communicate between agents about their observations and behave as a group, rather than a collection of individuals in complex tasks where the environment is dynamic, and each agent has limited capabilities and visibility of the world [65]. Typical methods directly add neighbor's traffic condition [66] or past actions [67] directly into the observation of ego agent, other than just using the local traffic condition of the ego agent. In this method, all the agents for different intersection share one learning model, which requires the consistent indexing of neighboring intersections. [47] attempt to remove this requirement by utilizing the road network structure with Graph Convolutional Network [68] and cooperate multi-hop nearby intersections. [47] models the influence of neighboring agents by the fixed adjacency matrix defined in Graph Convolutional Network, which indicates their assumption that the influences between neighbors is static. In other work, [57] propose to use Graph Attentional Networks [69] to learn the dynamic interactions between the hidden state of neighboring agents and the ego agent. It should be pointed out that there is a strong connection between methods that employ max-plus [61] to learn joint action-learners in Section 2.3.3.1, and methods using Graph Convolutional Network to learn the communication, as both of them can be seen to learn the message passing on the graph, where the former kind of methods passing the reward and the later passing the state obervations.
2.3.3.3 Sizes of Road network
At a coarse scale, a road network is a directed graph with nodes and edges representing intersections and roads, respectively. Specifically, a real-world road network can be more complicated than the synthetic network in the road properties (e.g., the number of lanes, speed limit of every lane), intersection structures and signal phases settings. Among all the road network properties, the number of traffic signals in the network largely influences the experiment results because the scale of explorations for RL agents to take increases with the scale of road network. Currently, most of the work still conducts experiments on relatively small road networks compared to the scale of a city, which could include thousands of traffic signals. Aslani et al.[42, 43] test their method in a real-world road network with 50 signals. In [19], a district with 196 signals is investigated. One of the most recent work [20] tests their methods on the real road network of Manhattan, New York, with 2510 traffic signals.

2.4 Conclusion

In the review of exiting work, we found out that the formulation of RL lacks theoretical justification, i.e., different literature proposes ad-hoc designs on the reward and state formulation. And the size of coordinated traffic signals are still small. Therefore, in the following sections, I'll provide several methods to tackle the problems: I'll first show a basic formulation of RL-based traffic signal control in Chapter 3, and discuss further theory-guided RL in Chapter 4. Chapter 5 will discuss how to improve the learning efficiency with neighborhood information for coordinated intersections. Chapter 6 will tackle the challenge for real-world applications of RL by building a realistic simulator,

Chapter 3 Basic Formulation for Traffic Signal Control

This part shows a basic formulation of RL-based traffic signal control and corresponds to our paper "IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light Control" [11].

3.1 Introduction

Traffic congestion has become increasingly costly. For example, traffic congestion costs Americans $124 billion a year, according to a report by Forbes in 2014 [70]. In European Union, the traffic congestion cost is estimated to be 1% of its GDP [2]. Improving traffic conditions could increase city efficiency, improve economy, and ease people's daily life.
One way to reduce the traffic congestion is by intelligently controlling traffic lights. Nowadays, most traffic lights are still controlled with pre-defined fixed-time plan [71, 72] and are not designed by observing real traffic. Recent studies propose hand-crafted rules according to real traffic data [73, 74]. However, these rules are still pre-defined and cannot be dynamically adjusted w.r.t. real-time traffic.
To dynamically adjust traffic lights according to real-time traffic, people have been using reinforcement learning technique [9, 10, 75]. Traditional reinforcement learning is difficult to apply due to two key challenges: (1) how to represent environment; and (2) how to model the correlation between environment and decision. To address these two challenges, recent studies [10, 76] have applied deep reinforcement learning techniques, such as Deep Q-learning (DQN), for traffic light control problem. Figure 2.2 illustrates the basic idea of deep reinforcement learning framework. Environment is composed of traffic light phase and traffic condition. State is a feature representation of theenvironment. Agent takes state as input and learns a model to predict whether to "keep the current phase of traffic lights" or "change the current phase". The decision is sent to the environment and the reward (e.g., how many vehicles pass the intersection) is sent back to the agent. The agent consequently updates the model and further makes the new decision for the next timestamp based on the new state and the updated model. In such a framework, traffic condition can be described as an image and such an image is directly taken as an input for a CNN-based model to enrich the hand-crafted features of the environment.
Recent deep reinforcement learning approaches made promising progress for the traffic light control problem. Our approach extends this line of work by making several important new contributions:
1. Experiments with real traffic data. Nowadays, increasing amount of traffic data is being collected from various sources. In China, many big cities have installed AI-equipped traffic surveillance cameras to monitor traffic conditions in real time. Such real-time traffic data enables us to implement reinforcement learning in real world. However, to the best of our knowledge, none of existing studies have used the real traffic data to test their methods. Instead, they use traffic simulations and such simulations do not reflect the real-world traffic. For example, the simulation models in current studies often assume that vehicles arrive at a constant rate but real traffic are highly dynamic over time. In this chapter, we test the methods on a large-scale real traffic data obtained from 1,704 surveillance cameras in Jinan, China for a period of 31 days (see experiment section for details). In this dataset, there are more than 405 million vehicle records and more than 11 million unique vehicle plates. We conduct comprehensive experiments on such large real dataset.
2. Interpretations of the policy. A frequently used measure to quantify the performance of traffic light control is by examining the overall reward, which can be defined by several factors such as waiting time of vehicles and number of vehicles passing the intersections. However, existing studies rarely make observations of the policy learned from the model. The reward could be misleading in some cases. There could be different policies with the same reward but one is more suitable than the other. Take Figure 3.1 as an example. Assume there is only traffic on South-North direction and the traffic comes for the first 80 seconds in every 120 seconds. Policy #1 is 80 seconds for green light on South-North direction and followed by red light for 40 seconds, and then repeat. And policy #2 is different from policy #1 in the way that, instead of 40-second red light on South-North direction, the light changes every 10 seconds. Both policies will result in the same reward because no vehicle will be waiting under either policy. However, policy #1 is preferred over policy #2 in real scenario. In this chapter, we claim that it is important to study the policies rather than simply showing the overall reward. In our experiments, we show several interesting policies learned from the real traffic under different scenarios (e.g., peak hours vs. non-peak hours, weekday vs. weekend).
3. A phase-gated model learning. As described earlier in deep reinforcement learning framework, the agent will take the state, which is the representation of environment, as model input. The environment usually includes the current traffic light phase and traffic conditions. For example, the environments of two cases in Figure 3.2 are the same, except the traffic light phases. Previous studies all take phase as one feature [10, 27], together with many other features (e.g., number of vehicles at different lanes, positions of vehicles). And it is likely that this one feature does not play a role that is significant enough to affect the model output. Therefore, the model will make
Figure 3.1: Reward is not a comprehensive measure to evaluate traffic light control performance. Both policies will lead to the same rewards. But policy #1 is more suitable than policy #2 in the real world.
the same decision (i.e., either keep or change the current phase) for these two different cases. However, such a decision, no matter which one, is not ideal for one of the cases. Because in Figure 3.2, case A hopes to keep the phase and case B hopes to change the phase. In this chapter, we propose a new phase-sensitive (i.e., phase gate combined with memory palace) reinforcement learning agent, which is a critical component that leads to superior performance.
The rest of this chapter is organized as follows. Section 3.2 discusses the literature. Section 3.3 formally defines the problem. The method is shown in Section 6.4 and the experimental results are shown in Section 6.5. Finally, we conclude the chapter in Section 7.
In this section, we firstly introduce conventional methods for traffic light control, then introduce methods using reinforcement learning.

Conventional Traffic Light Control

Early traffic light control methods can be roughly classified into two groups. The first is pre-timed signal control [71, 72, 77], where a fixed time is determined for all green phases
Figure 3.2: Case A and case B have the same environment except the traffic light phase.
according to historical traffic demand, without considering possible fluctuations in traffic demand. The second is vehicle-actuated control methods [73, 74] where the real-time traffic information is used. Vehicle-actuated methods are suitable for the situations with relatively high traffic randomness. However, this method largely depends on the handcraft rules for current traffic condition, without taking into account future situation. Therefore, it cannot reach the global optimal.

Reinforcement Learning for Traffic Light Control

Recently, due to the incapability of dealing with dynamic multi-direction traffic in previous methods, more works try to use reinforcement learning algorithms to solve the traffic light control problem [75, 9, 27]. Typically, these algorithms take the traffic on the road as state, and the operation on light as action. These methods usually show better performance compared with fixed-time and traffic-responsive control methods.
Methods in [78, 79, 80, 9, 30] designed the state as discrete values like the location of vehicles or number of waited cars. However, the discrete state-action pair value matrix requires huge storage space, which keeps these methods from being used in large state space problems.
To solve the in-managablely large state space of previous methods, recent studies [76, 10] propose to apply Deep Q-learning methods using continuous state representations. These studies learn a Q-function (e.g. a deep neural network) to map state and action to reward. These works vary in the state representation including hand craft features (e.g., queue length [27, 76], average delay [81, 10]) and image features [82, 10, 83]) They are also different in reward design, including average delay [10, 46],the average travel time [83, 10], and queue length [76].
However, all these methods assume relatively static traffic environments, and hence far from the real case. Further, they only focus on rewards and overlook the adaptability of the algorithms to the real traffic. Therefore, they cannot interpret why the learned light signal changes corresponding to the traffic. In this chapter, we try to test the algorithms in a more realistic traffic setting, and add more interpretation other than reward.

3.3 Problem Definition

In our problem, we have the environment E E E\mathsf{E}E as an intersection of two roads (and the traffic on this intersection). There is an intelligent traffic light agent G G G\mathsf{G}G. To make the notation simpler, we use "N", "S", "W", "E" to represent north, south, west, and east respectively, and use "Red" and "Green" to represent red light and green light correspondingly. A setting of the traffic light is defined as a phase (e.g., green light on the west-east direction which can be simplified as Green-WE). When a light changes from green to red, there is a 3 second yellow light, while the other directions still keep red. So one green light and the subsequent yellow light can be represented together by "Green". To simplify the problem, we assume there are only two phases of the traffic light, i.e., 1) Green-WE, and 2) Red-WE. Due to the limitation of real-world setting, the traffic light can only change in a specific order (i.e., 1 - > > >>> 2 - > > >>> 1 - > > >>> 2 - > > >>>...). Given the state s s s\mathsf{s}s (describing the positions and speed of the traffic near this intersection), the goal of the agent G G G\mathsf{G}G is to give the optimal action a a a\mathsf{a}a (i.e., whether to change the light to the next phase), so that the reward r r r\mathsf{r}r (i.e., the smoothness of traffic) can be maximized.

3.4 Method

Traffic light control has attracted a lot of attention in recent years due to its essential role in adjusting traffic. Current methods generally have two categories, conventional methods, and deep reinforcement learning based methods. Conventional methods usually rely on previous knowledge to set fixed time for each light phase or set changing rules. These rules are prone to dynamically changing traffic. Reinforcement learning methods usually take the traffic condition (e.g., queue length of waiting cars and updated waiting time) as state, and try to make actions that can improve the traffic condition based on the current state.
However, the current methods do not consider the complex situations in real case, and hence may lead to stuck in one single kind of action. This will lead to inferior traffic adjusting performance under complex traffic situation.
In this section, we propose a deep reinforcement traffic light agent to solve this problem. We will first introduce the model framework in Section 3.4.1. Then, we show the design of agent in Section 3.4.2. We further describe the network structure in Section 3.4.3. In addition, we describe the memory palace in Section 3.4.4. Note that, although our model is designed for a four way intersection with two phases, it is notdifficult to extend it to other types of intersections or to multiple phases scenarios.

Framework

Our model is composed of offline part and online part (as shown in Figure 3.3). We extract five kinds of features describing the traffic conditions as state (detailed in Section 3.4.2), and use reward to describe how much the action has improved the traffic (detailed in Section 3.4.2). In offline stage, we set a fixed timetable for the lights, and let traffic go through the system to collect data samples. After training with the samples logged in this stage, the model will be put into the online part. In online stage, at every time interval Δ t Δ t Delta t\Delta tΔt, the traffic light agent will observe the state s s s\mathsf{s}s from the environment and take action a a a\mathsf{a}a (i.e., whether to change light signal to the next phase) according to ϵ ϵ epsilon\epsilonϵ-greedy strategy combining exploration (i.e., random action with probability ϵ ϵ epsilon\epsilonϵ) and exploitation (i.e., taking the action with maximum estimated reward). After that, the agent G G G\mathsf{G}G will observe the environment and get the reward r r r\mathsf{r}r from it. Then, the tuple ( s s s\mathsf{s}s, a a a\mathsf{a}a, r r r\mathsf{r}r) will be
Notation Meaning
E E E\mathsf{E}E Environment
G G G\mathsf{G}G Agent
a a a\mathsf{a}a Action
s s s\mathsf{s}s State
r r r\mathsf{r}r Reward
Δ t Δ t Delta t\Delta tΔt Time interval between actions
q q q\mathsf{q}q Action value function
Q Q Q\mathsf{Q}Q Deep Q-Network
L i L i L_(i)\mathsf{L}_{\mathsf{i}}Li Queue length on the lane i i i\mathsf{i}i
V i V i V_(i)\mathsf{V}_{\mathsf{i}}Vi Number of vehicles on the lane i i i\mathsf{i}i
W i W i W_(i)\mathsf{W}_{\mathsf{i}}Wi Updated waiting time of all vehicles on the lane i i i\mathsf{i}i
D i D i D_(i)\mathsf{D}_{\mathsf{i}}Di Delay of lane i i i\mathsf{i}i.
M R N × N M R N × N MinR^(N xx N)\mathsf{M}\in\mathbb{R}^{N\times N}MRN×N Image representation of vehicles’ position
P c P c P_(c)\mathsf{P}_{c}Pc Current phase
P n P n P_(n)\mathsf{P}_{n}Pn Next phase
C { 0 , 1 } C { 0 , 1 } Cin{0,1}\mathsf{C}\in\{0,1\}C{0,1} Light switches (1) or not (0)
N N N\mathsf{N}N Number of vehicles passed the intersection
after the action.
T T T\mathsf{T}T Travel time in system of all vehicles that passed
the intersection during Δ t Δ t Delta t\Delta tΔt.
Notation Meaning E Environment G Agent a Action s State r Reward Delta t Time interval between actions q Action value function Q Deep Q-Network L_(i) Queue length on the lane i V_(i) Number of vehicles on the lane i W_(i) Updated waiting time of all vehicles on the lane i D_(i) Delay of lane i. MinR^(N xx N) Image representation of vehicles’ position P_(c) Current phase P_(n) Next phase Cin{0,1} Light switches (1) or not (0) N Number of vehicles passed the intersection after the action. T Travel time in system of all vehicles that passed the intersection during Delta t.| Notation | Meaning | | :--- | :--- | | $\mathsf{E}$ | Environment | | $\mathsf{G}$ | Agent | | $\mathsf{a}$ | Action | | $\mathsf{s}$ | State | | $\mathsf{r}$ | Reward | | $\Delta t$ | Time interval between actions | | $\mathsf{q}$ | Action value function | | $\mathsf{Q}$ | Deep Q-Network | | $\mathsf{L}_{\mathsf{i}}$ | Queue length on the lane $\mathsf{i}$ | | $\mathsf{V}_{\mathsf{i}}$ | Number of vehicles on the lane $\mathsf{i}$ | | $\mathsf{W}_{\mathsf{i}}$ | Updated waiting time of all vehicles on the lane $\mathsf{i}$ | | $\mathsf{D}_{\mathsf{i}}$ | Delay of lane $\mathsf{i}$. | | $\mathsf{M}\in\mathbb{R}^{N\times N}$ | Image representation of vehicles’ position | | $\mathsf{P}_{c}$ | Current phase | | $\mathsf{P}_{n}$ | Next phase | | $\mathsf{C}\in\{0,1\}$ | Light switches (1) or not (0) | | $\mathsf{N}$ | Number of vehicles passed the intersection | | | after the action. | | $\mathsf{T}$ | Travel time in system of all vehicles that passed | | | the intersection during $\Delta t$. |
Table 3.1: Notationsstored into memory. After several timestamps (e.g., t 2 t 2 t_(2)t_{2}t2 in Figure 3.3), agent G G G\mathsf{G}G will update the network according to the logs in the memory.

Agent Design

First, we introduce the state, action and reward representation.
  • State. Our state is defined for one intersection. For each lane i i i\mathsf{i}i at this intersection, the state component includes queue length L i L i L_(i)\mathsf{L}_{\mathsf{i}}Li, number of vehicles V i V i V_(i)\mathsf{V}_{\mathsf{i}}Vi, updated waiting time of vehicles W i W i W_(i)\mathsf{W}_{\mathsf{i}}Wi. In addition, the state includes an image representation of vehicles' position M M M\mathsf{M}M, current phase P c P c P_(c)\mathsf{P}_{c}Pc and next phase P n P n P_(n)\mathsf{P}_{n}Pn.
  • Action. Action is defined as a = 1 a = 1 a=1\mathsf{a}=1a=1: change the light to next phase P n P n P_(n)\mathsf{P}_{n}Pn, and a = 0 a = 0 a=0\mathsf{a}=0a=0: keep the current phase P c P c P_(c)\mathsf{P}_{c}Pc.
  • Reward. As is shown in Equation 3.3, reward is defined as a weighted sum of the following factors: 1. Sum of queue length L L L\mathsf{L}L over all approaching lanes, where L L L\mathsf{L}L is calculated as the total number of waiting vehicles on the given lane. A vehicle with a speed of less than 0.1 m/s is considered as waiting. 2. Sum of delay D D D\mathsf{D}D over all approaching lanes, where the delay D i D i D_(i)\mathsf{D}_{\mathsf{i}}Di for lane i i i\mathsf{i}i is defined in Equation 3.1, where the lane speed is the average speed of vehicles
Figure 3.3: Model framework
on lane i, and the speed limit is the maximum speed allowed on lane i: D i = 1 l a n e s p e e d s p e e d l i m i t D i = 1 l a n e s p e e d s p e e d l i m i t D_("i")=1-(lanespeed)/(speedlimit)\mathsf{D}_{\text{i}}=1-\frac{lane\ speed}{speed\ limit}Di=1lane speedspeed limit (3.1)
  1. Sum of updated waiting time W W W\mathsf{W}W over all approaching lanes. This equals to the sum of W W W\mathsf{W}W over all vehicles on approaching lanes. The updated waiting time W W W\mathsf{W}W for vehicle j at time t t ttt is defined in Equation 3.2. Note that the updated waiting time of a vehicle is reset to 0 every time it moves. For example, if a vehicle's speed is 0.01m/s from 0s to 15s, 5m/s from 15s to 30s, and 0.01m/s from 30s to 60s, W j W j W_("j")\mathsf{W}_{\text{j}}Wj is 15 seconds, 0 seconds and 30 seconds when t = t = t=t=t=15s, 30s and 60s relatively. W j ( t ) = { W j ( t 1 ) + 1 v e h i c l e s p e e d < 0.1 0 v e h i c l e s p e e d 0.1 W j ( t ) = W j ( t 1 ) + 1 v e h i c l e s p e e d < 0.1 0 v e h i c l e s p e e d 0.1 W_("j")(t)={[W_("j")(t-1)+1,vehiclespeed < 0.1],[0,vehiclespeed >= 0.1]:}\mathsf{W}_{\text{j}}(t)=\begin{cases}\mathsf{W}_{\text{j}}(t-1)+1&vehicle\ speed<0.1\\ 0&vehicle\ speed\geq 0.1\end{cases}Wj(t)={Wj(t1)+1vehicle speed<0.10vehicle speed0.1 (3.2)
  2. Indicator of light switches C C C\mathsf{C}C, where C = 0 C = 0 C=0\mathsf{C}=0C=0 for keeping the current phase, and C = 1 C = 1 C=1\mathsf{C}=1C=1 for changing the current phase.
  3. Total number of vehicles N N N\mathsf{N}N that passed the intersection during time interval Δ t Δ t Delta t\Delta tΔt after the last action a.
  4. Total travel time of vehicles T T T\mathsf{T}T that passed the intersection during time interval Δ t Δ t Delta t\Delta tΔt after the last action a, defined as the total time (in minutes) that vehicles spent on approaching lanes. R e w a r d = w 1 i I L i + w 2 i I D i + w 3 i I W i + w 4 C + w 5 N + w 6 T . R e w a r d = w 1 i I L i + w 2 i I D i + w 3 i I W i + w 4 C + w 5 N + w 6 T . {:[Reward=w_(1)**sum_("i"inI)L_("i")+w_(2)**sum_("i"inI)D_("i")+w_(3)**sum_("i"inI)W_("i")+],[w_(4)**C+w_(5)**N+w_(6)**T.]:}\begin{split} Reward=w_{1}*\sum_{\text{i}\in\mathsf{I}}\mathsf{ L}_{\text{i}}+w_{2}*\sum_{\text{i}\in\mathsf{I}}\mathsf{D}_{\text{i}}+w_{3}* \sum_{\text{i}\in\mathsf{I}}\mathsf{W}_{\text{i}}+\\ w_{4}*\mathsf{C}+w_{5}*\mathsf{N}+w_{6}*\mathsf{T}.\end{split}Reward=w1iILi+w2iIDi+w3iIWi+w4C+w5N+w6T. (3.3)
Hence, given the current state s s s\mathsf{s}s of the traffic condition, the mission of the agent G G G\mathsf{G}G is to find the action a (change or keep current phase) that may lead to the maximum reward r r r\mathsf{r}r in the long run, following the Bellman Equation (Equation 3.4) [12]. In this situation, the action value function q q q\mathsf{q}q for time t t ttt is the summation of the reward of the next timestamp t + 1 t + 1 t+1t+1t+1 and the maximum potential future reward. Through this conjecture of future, the agent can select action that is more suitable for long-run reward.
(3.4) q ( s t , a , t ) = r a , t + 1 + γ max q ( s a , t + 1 , a , t + 1 ) (3.4) q ( s t , a , t ) = r a , t + 1 + γ max q ( s a , t + 1 , a , t + 1 ) {:(3.4)q(s_(t)","a","t)=r_(a,t+1)+gamma maxq(s_(a,t+1)","a^(')","t+1):}\mathsf{q}(\mathsf{s}_{t},\mathsf{a},t)=\mathsf{r}_{\mathsf{a},t+1}+\gamma \max\mathsf{q}(\mathsf{s}_{\mathsf{a},t+1},\mathsf{a}^{\prime},t+1) \tag{3.4}(3.4)q(st,a,t)=ra,t+1+γmaxq(sa,t+1,a,t+1)

Network Structure

In order to estimate the reward based on the state, and action, the agent needs to learn a Deep Q-Network Q ( s , a ) Q ( s , a ) Q(s,a)\mathsf{Q}(\mathsf{s},\mathsf{a})Q(s,a).
In the real-world scenario, traffic is very complex and contain many different cases need to be considered separately. We will illustrate this in Example 1.
Example 1.: We still assume a simple intersection with two-phase light transition here: 1) Green-WE, and 2) Red-WE. The decision process of whether to change the traffic light consists of two steps. The first step is the mapping from traffic condition (e.g., how many cars are waiting, how long has each car been waiting) to a partial reward. An example of this mapping could be r = 0.5 × L 0.7 × W r = 0.5 × L 0.7 × W r=-0.5 xxL-0.7 xxW\mathsf{r}=-0.5\times\mathsf{L}-0.7\times\mathsf{W}r=0.5×L0.7×W. This is shared by different phases, no matter which lane the green light is on. Then, to determine the action, the agent should watch on the traffic in different lanes during different phases. For instance, as is shown in Figure 3.2 (a), when the red light is on the NS direction, more waiting traffic (i.e., lower reward in the first step) on the NS direction will make the light tend to
Figure 3.4: Q-networkchange (because by changing the light on this lane from red to green, more cars on this lane can pass through this intersection), while more waiting traffic (i.e., lower reward in the first step) on the WE direction will make the light tend to keep. When the red light is on the WE direction, the case is right the opposite. Therefore, the light phase should have an explicit selection on features.
In previous studies, due to the simplified design of the model for approximating Q-function under complex traffic condition, agents are having difficulties in distinguishing the decision process for different phases. Therefore, we hereby propose a network structure that can explicitly consider the different cases explicitly. We call this special sub-structure "Phase Gate".
Our whole network structure can be shown as in Figure 3.4. The image features are extracted from the observations of the traffic condition and fed into two convolutional layers. The output of these layers are concatenated with the four explicitly mined features, queue length L L LLL, updated waiting time W W WWW, phase P P PPP and number of total vehicles V V VVV. The concatenated features are then fed into fully-connected layers to learn the mapping from traffic conditions to potential rewards. Then, for each phase, we design a separate learning process of mapping from rewards to the value of making decisions Q ( s , a ) Q ( s , a ) Q(s,a)Q(s,a)Q(s,a). These separate processes are selected through a gate controlled by the phase. As shown in Figure 3.4, when phase P = 0 P = 0 P=0P=0P=0, the left branch will be activated, while when phase P = 1 P = 1 P=1P=1P=1, the right branch will be activated. This will distinguish the decision process for different phases, prevent the decision from favoring certain action, and enhance the fitting ability of the network.

Memory Palace and Model Updating

Periodically, the agent will take samples from the memory and use them to update the network. This memory is maintained by adding the new data samples in and removing the old samples occasionally. This technique is noted as experience replay [52] and has been widely used in reinforcement learning models.
However, in the real traffic setting, traffic on different lanes can be really imbalanced. As previous methods [10, 76, 81, 82] store all the state-action-reward training samples in one memory, this memory will be dominated by the phases and actions that appear most frequently in imbalanced settings. Then, the agent will be learned to estimate the reward for these frequent phase-action combinations well, but ignore other less frequent phase-action combinations. This will cause the learned agent to make bad decisions on the infrequent phase-action combinations. Therefore, when traffic on different lanes are dramatically different, these imbalanced samples will lead to inferior performance on less frequent situation.
Inspired by Memory Palace theory [84, 85] in cognitive psychology, we can solve this imbalance by using different memory palaces for different phase-action combinations. As shown in Figure 3.5, training samples for different phase-action combinations are stored into different memory palaces. Then same number of samples will be selected from different palaces. These balanced samples will prevent different phase-action combinations from interfering each other's training process, and hence, improve the fitting capability of the network to predict the reward accurately.
Figure 3.5: Memory palace structure
Experiment
In this section, we conduct experiments using both synthetic and real-world traffic data. We show a comprehensive quantitative evaluation by comparing with other methods and also show some interesting case studies 1.
Footnote 1: Codes are available at the author’s website.

Experiment Setting

The experiments are conducted on a simulation platform SUMO (Simulation of Urban MObility) 2. SUMO provides flexible APIs for road network design, traffic volume simulation and traffic light control. Specifically, SUMO can control the traffic moving according to the given policy of traffic light (obtained by the traffic light agent).
The environment for the experiments on synthetic data is a four-way intersection as Figure 3.1. The intersection is connected with four road segments of 300-meters long, where each road have three incoming and three outgoing lanes. The traffic light in this part of experiment contains two phases: (1) Green-WE (green light on WE with red light on SN), (2) Red-WE (red light on WE with green light on SN). Note that when a green light is on one direction, there is a red light on the other direction. Also, a green light is followed by a 3-second yellow light before it turns to red light. Although this is a simplification of the real world scenario, the research of more types of intersections (e.g., three-way intersection), and more complex light phasing (e.g., with left-turn phasing) can be further conducted in similar way.

Parameter Setting

The parameter setting and reward coefficients for our methods are shown in Table 3.2 and Table 3.3 respectively. We found out that the action time interval Δ t Δ t Delta t\Delta tΔt has minimal influence on performance of our model as long as Δ t Δ t Delta t\Delta tΔt is between 5 seconds and 25 seconds.

Evaluation Metric

We evaluate the performance of different methods using the following measures:
  • Reward: average reward over time. Defined in Equation 1, the reward is a combination of several terms (positive and negative terms), therefore, the range of reward is from -oo-\infty to oo\infty. Under specific configuration, there will be an upper bound for the reward when all cars move freely without any stop or delay.
  • Queue length: average queue length over time, where the queue length at time t t ttt is the sum of L L L\mathsf{L}L (defined in Section 3.4.2) over all approaching lanes. A smaller queue length means there are fewer waiting vehicles on all lanes.
  • Delay: average delay over time, where the delay at time t t ttt is the sum of D D D\mathsf{D}D (defined in Equation 3.1) of all approaching lanes. A lower delay means a higher speed of all lanes.
  • Duration: average travel time vehicles spent on approaching lanes (in seconds). It is one of the most important measures that people care when they drive on the road. A smaller duration means vehicles spend less time passing through the intersection.
In summary, a higher reward indicates a better performance of the method, and a smaller queue length, delay and duration indicates the traffic is less jammed.

Compared Methods

To evaluate the effectiveness of our model, we compare our model with the following baseline methods, and tune the parameter for all methods. We then report their best performance.
w1 w2 w3 w4 w5 w6
-0.25 -0.25 -0.25 -5 1 1
w1 w2 w3 w4 w5 w6 -0.25 -0.25 -0.25 -5 1 1| w1 | w2 | w3 | w4 | w5 | w6 | | :---: | :---: | :---: | :---: | :---: | :---: | | -0.25 | -0.25 | -0.25 | -5 | 1 | 1 |
Table 3.3: Reward coefficients
Model parameter Value
Action time interval 5 seconds
Δ t Δ t Delta t\Delta tΔt
γ γ gamma\gammaγ for future reward 0.8
ϵ ϵ epsilon\epsilonϵ for exploration 0.05
Sample size 300
Memory length 1000
Model update interval 300 seconds
Model parameter Value Action time interval 5 seconds Delta t gamma for future reward 0.8 epsilon for exploration 0.05 Sample size 300 Memory length 1000 Model update interval 300 seconds | Model parameter | Value | | | :---: | :---: | :---: | | Action time interval | 5 seconds | | | $\Delta t$ | | | | $\gamma$ for future reward | 0.8 | | | $\epsilon$ for exploration | 0.05 | | | Sample size | 300 | | | Memory length | 1000 | | | Model update interval | 300 seconds | |
Table 3.2: Settings for our method * Fixed-time Control (FT). Fixed-time control method use a pre-determined cycle and phase time plan [72] and is widely used when the traffic flow is steady.
  • Self-Organizing Traffic Light Control (SOTL)[74]. This method controls the traffic light according to the current traffic state, including the eclipsed time and the number of vehicles waiting at the red light. Specifically, the traffic light will change when the number of waiting cars is above a hand-tuned threshold.
  • Deep Reinforcement Learning for Traffic Light Control (DRL). Proposed in [10], this method applies DQN framework to select optimal light configurations for traffic intersections. Specifically, it solely relies on the original traffic information as an image.
In addition to the baseline methods, we also consider several variations of our model.
  • IntelliLight (Base). Using the same network structure and reward function defined as in Section 3.4.2 and 3.4.3. This method is without Memory Palace and Phase Gate.
  • IntelliLight (Base+MP). By adding Memory Palace in psychology to IntelliLightBase, we store the samples from different phase and time in separate memories.
  • IntelliLight (Base+MP+PG). This is the model adding two techniques (Memory Palace and Phase Gate).

Datasets

Synthetic data
In the first part of our experiment, synthetic data is used with four traffic flow settings: simple changing traffic (configuration 1), equally steady traffic (configuration 2), unequally steady traffic (configuration 3) and complex traffic (configuration 4) which is a combination of previous three configurations. As is shown in Table 3.4, the arriving of vehicles are generated by Poisson distribution with certain arrival rates.
Real-world data
The real-world dataset is collected by 1,704 surveillance cameras in Jinan, China over the time period from 08/01/2016 to 08/31/2016. The locations of these cameras are shown in Figure 3.6. Gathered every second by the cameras facing towards vehicles near intersections, each record in the dataset consists of time, camera ID and the information about vehicles. By analyzing these records with camera locations, the trajectories of vehicles are recorded when they pass through road intersections. The dataset covers 935 locations, where 43 of them are four-way intersections. We use the number of vehicles passing through 24 intersections as traffic volume for experiments since only these intersections have consecutive data. Then we feed this real-world traffic setting into SUMO as online experiments. It can be seen from Table 3.5 that traffic flow on different roads are dynamically changing in the real world.
Config Directions Arrival rate Start time End time
(cars/s) (s) (s)
1 WE 0.4 0 36000
SN 0.4 36001 72000
2 WE 0.033 0 72000
SN 0.033 0 72000
3 WE 0.2 0 72000
SN 0.033 0 72000
4 Configuration 1 0 72000
Configuration 2 72001 144000
Configuration 3 144001 216000
Config Directions Arrival rate Start time End time (cars/s) (s) (s) 1 WE 0.4 0 36000 SN 0.4 36001 72000 2 WE 0.033 0 72000 SN 0.033 0 72000 3 WE 0.2 0 72000 SN 0.033 0 72000 4 Configuration 1 0 72000 Configuration 2 72001 144000 Configuration 3 144001 216000 | Config | Directions | Arrival rate | Start time | End time | | :---: | :---: | :---: | :---: | :---: | | | | (cars/s) | (s) | (s) | | 1 | WE | 0.4 | 0 | 36000 | | | SN | 0.4 | 36001 | 72000 | | 2 | WE | 0.033 | 0 | 72000 | | | SN | 0.033 | 0 | 72000 | | 3 | WE | 0.2 | 0 | 72000 | | | SN | 0.033 | 0 | 72000 | | 4 | Configuration 1 | 0 | 72000 | | | | Configuration 2 | 72001 | 144000 | | | | Configuration 3 | 144001 | 216000 | |
Table 3.4: Configurations for synthetic traffic data
Figure 3.6: Traffic surveillance cameras in Jinan, China

Performance on Synthetic Data

3.5.6.1 Comparison with state-of-the-art methods
We first compare our method with three other baselines under different synthetic traffic settings. From Table 3.6, 3.7, 3.8 and 3.9 we can see that our method performs better than all other baseline methods in configurations 1, 2, 3 and 4. Although some baselines perform well on certain setting, they perform badly in other configurations (e.g., SOTL achieves good rewards under configuration 1, almost the same as our method in 3 digit floats. This is because our method has learned to keep the light until 36000 s and switch the light after that, and SOTL is also designed to behave similarly. Hence, these two methods perform very similar). On the contrary, our method IntelliLight shows better performance under different configurations.
3.5.6.2 Comparison with variants of our proposed method
Table 3.6, 3.7, 3.8 and 3.9 show the performance of variants of our proposed method. First, we can see that adding Memory Palace helps achieve higher reward under configuration 3 and 4, although it does not boost the reward under configuration 1 and 2. This is because for the simple case (configuration 1 and 2), the phase is relatively steady for a long time (because the traffic only comes from one direction or keeps not changing in a long time). Therefore, the memory palace does not help in building a better model for predicting the reward. Further adding Phase Gate also reduces the queue length in most cases and achieves highest reward, demonstrating the effectiveness of these two techniques.
3.5.6.3 Interpretation of learned signal
To understand what our method have learned w.r.t. dynamic traffic conditions, we show the percentage of duration for phase Green-WE (i.e., green light on WE direction with red light on SN direction), along with the ratio of traffic flow on WE over total traffic
Time Range Records Arrival Rate (cars/s)
Mean Std Max Min
08/01/2016 -08/31/2016 405,370,631 0.089 0.117 0.844 0.0
Time Range Records Arrival Rate (cars/s) Mean Std Max Min 08/01/2016 -08/31/2016 405,370,631 0.089 0.117 0.844 0.0| Time Range | Records | Arrival Rate (cars/s) | | | | | :---: | :---: | :---: | :---: | :---: | :---: | | | | Mean | Std | Max | Min | | 08/01/2016 -08/31/2016 | 405,370,631 | 0.089 | 0.117 | 0.844 | 0.0 |
Table 3.5: Details of real-world traffic datasetflow from all directions. With the changing of traffic, an ideal traffic light control method would be able to adjust its phase duration to traffic flows and get high reward. For example, as traffic changes from direction WE to SN, the traffic light agent is expected to adjust its phase duration from giving WE green light to giving SN green light. As we can see from Figure 3.7, IntelliLight can adjust its phase duration as the traffic changes.

Performance on Real-world Data

Comparison of different methods
In this section, we compare our method with baseline methods on real-world data. The overall results are shown in Table 3.10. Our method IntelliLight achieves the best reward, queue length, delay and duration over all the compared methods, with a relative improvement of 32%, 38%, 19% and 22% correspondingly over the best baseline method. In addition, our method has a relatively steady performance over multiple intersections (small standard deviation).
Model name Reward Queue length Delay Duration
FT -0.978 1.105 2.614 34.278
SOTL -21.952 19.874 4.384 177.747
DRL -2.208 3.405 3.431 52.075
IntelliLight (ours)
Base -0.523 0.208 1.689 27.505
Base+MP -0.556 0.259 1.730 27.888
Base+MP+PG -0.514 0.201 1.697 27.451
Model name Reward Queue length Delay Duration FT -0.978 1.105 2.614 34.278 SOTL -21.952 19.874 4.384 177.747 DRL -2.208 3.405 3.431 52.075 IntelliLight (ours) Base -0.523 0.208 1.689 27.505 Base+MP -0.556 0.259 1.730 27.888 Base+MP+PG -0.514 0.201 1.697 27.451| Model name | Reward | Queue length | Delay | Duration | | :---: | :---: | :---: | :---: | :---: | | FT | -0.978 | 1.105 | 2.614 | 34.278 | | SOTL | -21.952 | 19.874 | 4.384 | 177.747 | | DRL | -2.208 | 3.405 | 3.431 | 52.075 | | **IntelliLight** (ours) | | | | | | Base | -0.523 | 0.208 | 1.689 | 27.505 | | Base+MP | -0.556 | 0.259 | 1.730 | 27.888 | | Base+MP+PG | **-0.514** | **0.201** | **1.697** | **27.451** |
Table 3.7: Performance on configuration 2
Model name Reward Queue length Delay Duration
FT -2.304 8.532 2.479 42.230
SOTL 0.398 0.006 1.598 24.129
DRL -36.247 91.412 4.483 277.430
IntelliLight (ours)
Base -3.077 10.654 2.635 92.080
Base+MP -3.267 6.087 1.865 38.230
Base+MP+PG 0.399 0.005 1.598 24.130
Model name Reward Queue length Delay Duration FT -2.304 8.532 2.479 42.230 SOTL 0.398 0.006 1.598 24.129 DRL -36.247 91.412 4.483 277.430 IntelliLight (ours) Base -3.077 10.654 2.635 92.080 Base+MP -3.267 6.087 1.865 38.230 Base+MP+PG 0.399 0.005 1.598 24.130| Model name | Reward | Queue length | Delay | Duration | | :---: | :---: | :---: | :---: | :---: | | FT | -2.304 | 8.532 | 2.479 | 42.230 | | SOTL | 0.398 | 0.006 | **1.598** | **24.129** | | DRL | -36.247 | 91.412 | 4.483 | 277.430 | | **IntelliLight** (ours) | | | | | | Base | -3.077 | 10.654 | 2.635 | 92.080 | | Base+MP | -3.267 | 6.087 | 1.865 | 38.230 | | Base+MP+PG | **0.399** | **0.005** | **1.598** | 24.130 |
Table 3.6: Performance on configuration 1. Reward: the higher the better. Other measures: the lower the better. Same with the following tables.

3.5.7.2 Observations with respect to real traffic

In this section, we make observations on the policies we learned from the real data. We analyze the learned traffic light policy for the intersection of Jingliu Road (WE direction)
Model name Reward Queue length Delay Duration
FT -1.670 4.601 2.883 39.707
SOTL -14.079 13.372 3.753 54.014
DRL -49.011 91.887 4.917 469.417
IntelliLight (ours)
Base -5.030 5.880 3.432 39.021
Base + + +++MP -3.329 5.358 2.238 44.703
Base + + +++MP + + +++PG -0.474 0.548 2.202 25.977
Model name Reward Queue length Delay Duration FT -1.670 4.601 2.883 39.707 SOTL -14.079 13.372 3.753 54.014 DRL -49.011 91.887 4.917 469.417 IntelliLight (ours) Base -5.030 5.880 3.432 39.021 Base+MP -3.329 5.358 2.238 44.703 Base+MP+PG -0.474 0.548 2.202 25.977| Model name | Reward | Queue length | Delay | Duration | | :---: | :---: | :---: | :---: | :---: | | FT | -1.670 | 4.601 | 2.883 | 39.707 | | SOTL | -14.079 | 13.372 | 3.753 | 54.014 | | DRL | -49.011 | 91.887 | 4.917 | 469.417 | | **IntelliLight** (ours) | | | | | | Base | -5.030 | 5.880 | 3.432 | 39.021 | | Base$+$MP | -3.329 | 5.358 | 2.238 | 44.703 | | Base$+$MP$+$PG | **-0.474** | **0.548** | **2.202** | **25.977** |
Table 3.9: Performance on configuration 4
Figure 3.7: Percentage of the time duration of learned policy for phase Green-WE (green light on W-E and E-W direction, while red light on N-S and S-N direction) in every 2000 seconds for different methods under configuration 4.
Model name Reward Queue length Delay Duration
FT -1.724 4.159 3.551 36.893
SOTL -20.680 20.227 5.277 69.838
DRL -8.108 16.968 4.704 66.485
IntelliLight (ours)
Base -0.836 0.905 2.699 28.197
Base + + +++MP -0.698 0.606 2.729 26.948
Base + + +++MP + + +++PG -0.648 0.524 2.584 26.647
Model name Reward Queue length Delay Duration FT -1.724 4.159 3.551 36.893 SOTL -20.680 20.227 5.277 69.838 DRL -8.108 16.968 4.704 66.485 IntelliLight (ours) Base -0.836 0.905 2.699 28.197 Base+MP -0.698 0.606 2.729 26.948 Base+MP+PG -0.648 0.524 2.584 26.647| Model name | Reward | Queue length | Delay | Duration | | :---: | :---: | :---: | :---: | :---: | | FT | -1.724 | 4.159 | 3.551 | 36.893 | | SOTL | -20.680 | 20.227 | 5.277 | 69.838 | | DRL | -8.108 | 16.968 | 4.704 | 66.485 | | **IntelliLight** (ours) | | | | | | Base | -0.836 | 0.905 | 2.699 | 28.197 | | Base$+$MP | -0.698 | 0.606 | 2.729 | 26.948 | | Base$+$MP$+$PG | **-0.648** | **0.524** | **2.584** | **26.647** |
Table 3.8: Performance on configuration 3and Erhuanxi Auxiliary Road (SN direction) under different scenarios: peak hours vs. non-peak hours, weekdays vs. weekends, and major arterial vs. minor arterial.
1. Peak hour vs. Non-peak hour. Figure 3.8 (a) shows the average traffic flow from both directions (WE and SN) on a Monday. On this day, there is more traffic on WE direction than SN for most of the time, during which an ideal traffic light control method is expected to give longer time for WE direction. It can be seen from Figure 3.8 (c) that, the ratio of the time duration for phase Green-WE (i.e., green light on WE, while red light on SN) is usually larger than 0.5, which means for most of the time, our method gives longer time for WE. And during peak hours (around 7:00, 9:30 and 18:00), the policies learned from our method also give longer time for green light on WE than non-peak hours. In early morning, the vehicle arrival rates on SN are larger than the rates on WE, and our method automatically gives longer time to SN. This shows our method can intelligently adjust to different traffic conditions.
2. Weekday vs. Weekend. Unlike weekdays, weekend shows different patterns about traffic condition and traffic light control policies. Our policy gives less green light on WE (more green light on SN) during weekend daytime than it gives on weekday. This is because there is more traffic on SN than on WE during weekend daytime in Figure 3.8 (b), while during weekday traffic on SN is less than on WE. Besides, by comparing Figure 3.8 (a) with Figure 3.8 (b), we can see that the traffic of WE and SN during late night time on Monday is similar, making the ratio of duration Green-We close to 0.5.
3. Major arterial vs. Minor arterial. Major arterials are roads that have higher traffic volume within a period, and are expected to have a longer green light time. Without prior knowledge about major arterial, learned traffic light control policy using our method prefer giving the major arterial green light (including keeping the green light already on major arterial, and tend to switching red light to green light for major arterial). Specifically, we look into three periods of time (3:00, 12:00 and 23:30) of August 1st. From Figure 3.8 (a), we can tell that the road on WE direction is the main road, since
Methods Reward Queue Length Delay Duration
FT -5.727 ± ± +-\pm± 5.977 19.542 ± ± +-\pm± 22.405 3.377 ± ± +-\pm± 1.057 84.513 ± ± +-\pm± 60.888
SOTL -35.338 ± ± +-\pm± 65.108 16.603 ± ± +-\pm± 17.718 4.070 ± ± +-\pm± 0.420 64.833 ± ± +-\pm± 23.136
DRL -30.577 ± ± +-\pm± 26.242 54.148 ± ± +-\pm± 43.420 4.209 ± ± +-\pm± 1.023 166.861 ± ± +-\pm± 93.985
IntelliLight -3.892 ± ± +-\pm± 7.609 10.238 ± ± +-\pm± 20.949 2.730 ± ± +-\pm±1.086 50.487 ± ± +-\pm± 46.439
Methods Reward Queue Length Delay Duration FT -5.727 +- 5.977 19.542 +- 22.405 3.377 +- 1.057 84.513 +- 60.888 SOTL -35.338 +- 65.108 16.603 +- 17.718 4.070 +- 0.420 64.833 +- 23.136 DRL -30.577 +- 26.242 54.148 +- 43.420 4.209 +- 1.023 166.861 +- 93.985 IntelliLight -3.892+- 7.609 10.238+- 20.949 2.730+-1.086 50.487+- 46.439| Methods | Reward | Queue Length | Delay | Duration | | :---: | :---: | :---: | :---: | :---: | | FT | -5.727 $\pm$ 5.977 | 19.542 $\pm$ 22.405 | 3.377 $\pm$ 1.057 | 84.513 $\pm$ 60.888 | | SOTL | -35.338 $\pm$ 65.108 | 16.603 $\pm$ 17.718 | 4.070 $\pm$ 0.420 | 64.833 $\pm$ 23.136 | | DRL | -30.577 $\pm$ 26.242 | 54.148 $\pm$ 43.420 | 4.209 $\pm$ 1.023 | 166.861 $\pm$ 93.985 | | IntelliLight | **-3.892**$\pm$ 7.609 | **10.238**$\pm$ 20.949 | **2.730**$\pm$1.086 | **50.487**$\pm$ 46.439 |
Table 3.10: Performances of different methods on real-world data. The number after ± ± +-\pm± means standard deviation. Reward: the higher the better. Other measures: the lower the better.
traffic on WE is usually heavier than traffic on SN. As is shown in Figure 3.9, the dotted lines indicates the number of arriving cars for every second on two different directions. Along with the arrival rate, we also plot the change of phases (dashed area). It can be seen from Figure 3.9 (a) that: 1) the overall time period of phase Red-WE is longer than Green-WE, which is compatible with traffic volume at this time. 2) although the traffic volume of SN is larger than WE, the traffic light change from Red-WE to Green-WE is usually not triggered by waiting cars on WE direction. On the contrary, in Figure 3.9 (b) and Figure 3.9 (c), the change from Green-WE to Red-WE is usually triggered by waiting cars on SN direction. This is mainly because the road on WE is the main road during these time periods, and the traffic light tends to favor phase Green-WE.

3.6 Conclusion

In this paper, we address the traffic light control problem using a well-designed reinforcement learning approach. We conduct extensive experiments using both synthetic
Figure 3.8: Average arrival rate on two directions (WE and SN) and time duration ratio of two phases (Green-WE and Red-WE) from learned policy for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st and August 7th, 2016.
and real world experiments and demonstrate the superior performance of our proposed method over state-of-the-art methods. In addition, we show in-depth case studies and observations to understand how the agent adjust to the changing traffic, as a complement to quantitative measure on rewards. These in-depth case studies can help generate traffic rules for real world application.
We also acknowledge the limitations of our current approach and would like to point out several important future directions to make the method more applicable to real world. First, we can extend the two-phase traffic light to multi-phase traffic light, which will involve more complicated but more realistic state transition. Second, our paper addresses a simplified one intersection case, whereas the real world road network is much more complicated than this. Although some studies have tried to solve the multi-intersection problem by using multiple reinforcement learning agents, they do not explicitly consider the interactions between different intersections (i.e., how can the phase of one intersection affect the state of nearby intersections) and they are still limited to small number of intersections. Lastly, our approach is still tested on a simulation framework and thus the feedback is simulated. Ultimately, a field study should be conducted to learn the real-world feedback and to validate the proposed reinforcement learning approach.
Figure 3.9: Detailed average arrival rate on two directions (dotted lines) and changes of two phases (dashed areas) in three periods of time for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st, 2016. X-axis of each figure indicates the time of a day; left Y-axis of each figure indicates the number of cars approaching the intersection every second; right Y-axis for each figure indicates the phase over time.

Chapter 4 Formulating the Learning Objective

This chapter formulates RL-based traffic signal control methods with theoratical proofs and corresponds to our paper "PressLight: Learning Network-level Cooperation for Traffic Signal Control" [17].

4.1 Introduction

Traffic signals coordinate the traffic movements at the intersection and a smart traffic signal control algorithm is the key to transportation efficiency. Traffic signal control remains an active research topic because of the high complexity of the problem. The traffic situations are highly dynamic, thus require traffic signal plans to be able to adjust to different situations.
Recently, people start to investigate reinforcement learning (RL) techniques for traffic signal control. Several studies have shown the superior performance of RL techniques over traditional transportation approaches [9, 10, 11, 46, 78, 80]. The biggest advantage of RL is that it directly learns how to take the next actions by observing the feedback from the environment after previous actions.
One major issue of current RL-based traffic signal control approaches is that the setting is often heuristic and lacks proper theoretical justification from transportation literature. This often results in highly sensitive performance w.r.t. the setting and leads to a long learning process. We elaborate on this issue by examining two fundamental elements in RL setting: reward and state.
First, various reward designs have been proposed in the literature. The reason is that travel time, the ultimate objective, is hard to optimize directly. Travel time is along-term reward depending on a sequence of actions, thus the effect of one action can hardly be reflected in terms of travel time. People thus choose short-term rewards like queue length or delay to approximate the travel time [21]. So the reward function is often defined as a weighted sum of these terms [10, 11, 79, 86, 87]. However, as shown in Figure 6.1(a), tuning the weights on these terms could lead to largely different results in terms of travel time. Some literature [1] discusses how to define the reward by connecting with the existing transportation method, but they only focus on controlling a single intersection. In this chapter, we focus on the multi-intersection control scenario.
Second, existing RL methods have a trend of using more complicated state representation. Recent studies use visual images to describe the full traffic situation at the intersection [10, 11], which results in the dimension of the state in the scale of thousands. In the single intersection scenario, [1] reveals that additional information is not always helpful. Similar conclusions can also be found in the multi-intersection scenario. As shown in Figure 6.1(b), complicated state definitions increase the learning time and may not necessarily bring significant gain. Note that we are not claiming that additional information is always not helpful. The choice of the state depends on the reward setting. Based on the reward design of LIT[1], neighboring information is not necessary in the case we show in Figure 6.1(b). The question is, could we justify theoretically how much information is enough in state definition in order to optimize the reward function?
The challenges we face in RL motivate us to look for support from transportation. In transportation literature, max pressure (MP) control is one of the state-of-the-arts in traffic signal control [88, 89]. The key idea of MP is to minimize the "pressure" of an
Figure 4.1: Performance of RL approaches is sensitive to reward and state. (a) A heuristic parameter tuning of reward function could result in different performances. (b) The method with a more complicated state (LIT[1] w/ neighbor) has a longer learning time but does not necessarily converge to a better result.
intersection, which can be loosely defined as the number of vehicles on incoming lanes minus the number of vehicles on outgoing lanes. Figure 4.2 illustrates the concept of pressure. By setting the objective as minimizing the pressure of intersections, MP is proved to maximize the throughput of the whole road network1. However, the solution of MP is greedy, which leads to locally optimal solutions.
Footnote 1: Maximizing throughput equals to minimizing travel time under certain conditions and minimizing travel time is the final goal for most traffic signal control problems.
Our proposed solution is based on RL but theoretically grounded by MP method. The connection between RL and MP is that both approaches can essentially be framed as an optimization problem. In RL, long term reward is the objective for optimization and the solution is derived from trial-and-error search. In MP, the objective is to minimize pressure and the solution is derived from a greedy algorithm. Intuitively, if we set our reward function the same as the objective of MP, we can achieve the same result as MP. We first prove that under the assumption of no physical queue expansion, both our method and MP are maximizing throughput of the network. We further show that our method can relax the assumption on queue expansion and the conclusion still holds.
To further address the challenge on state design, we describe the system dynamics using the state features based on MP. MP provides evolution equations to formulate the state transition of the traffic as a Markov chain [90]. In RL, the Markov decision process formally describes the dynamics of an environment. By including the variables from the
Figure 4.2: Illustration of max pressure control in two cases. In Case A, green signal is set in the North rarr\rightarrowSouth direction; in Case B, green signal is set in the East rarr\rightarrowWest direction.
evolution equation into state definition in RL, the state is a sufficient statistic for the system dynamics.
We conduct comprehensive experiments using both synthetic data and real data. We test our method in different traffic flow and network structure scenarios. We demonstrate the power of RL methods over traditional transportation approaches as RL optimizes the objective through trial and error. Our method also consistently outperforms state-of-the-art RL methods, which shows that theoretically supported reward design is necessary and the concise state design leads to an efficient learning process. We further discuss several interesting policies learned by our method to show that our method can achieve coordination along arterial.
Individual Traffic Signal Control. Individual traffic signal control has been investigated extensively in the field of transportation, which tries to optimize the travel time or delay of vehicles [91, 92, 93, 94, 95], assuming that vehicles are arriving and moving in a specific pattern. Recently, reinforcement learning based methods attempt to address this problem by directly learning from the data [9, 27]. Earlier work using tabular Q-learning [87, 96] can only deal with discrete state representations. Recent work using deep RL [10, 11, 76, 97, 98] can cope with more complex continuous state representation. [1] noticed that it is not always true that the more complex the state definitions are, the better the performance will be. In [1], they also investigated the proper reward design grounded by the individual intersection control method in transportation field. In this chapter, we are focusing on the multi-intersection scenario.
Conventional Multi-intersection Traffic Signal Control. In conventional multi-intersection control, coordination can be achieved by setting a fixed offset (i.e., the time interval between the beginnings of green lights) among all intersections along an arterial [99]. In fact, it is not an easy task, given traffic of opposite directions usually cannot be facilitated simultaneously. To solve this problem, some optimization-based methods [100, 101] are developed to minimize vehicle travel time and/or the number of stops at multiple intersections. Instead of optimizing offsets, max pressure [89, 90] aims to maximize the throughput of the network so as to minimizing the travel time. However, these approaches still rely on assumptions to simplify the traffic condition and do not guarantee optimal results in the real world.
RL-based Multi-intersection Traffic Signal Control. Since recent advances in RLimprove the performance on isolated traffic signal control [11, 1], efforts have been made to design strategies that control multiple intersections. One way is to consider jointly modeling the action between learning agents with centralized optimization [75, 10]. Since these methods [10, 75] need to negotiate between the agents in the whole network, they are computationally expensive. Another way is to use decentralized RL agents to control the traffic signals in the multi-intersection system [46, 79, 102]. Since each agent makes its own decision based on the information from itself and neighboring intersections without centralized decision, decentralized methods may be more scalable and practicable. By plugging new intersection controllers into the system, the decentralized systems are easy to scale. Our proposed method also follows this direction.
We notice the recent trend to vary the definition of state and reward in RL for traffic signal control. Readers interested in the detailed comparison of the state and reward definitions can refer to [21]. We are the first RL method that is theoretically grounded by traditional transportation methods to coordinate the traffic signals along an arterial.

4.3 Preliminaries and Notations

In this section, we re-iterate the preliminaries mentioned in Chapter 2 and provide necessary notations for this chapter.
Definition 1 (Incoming lane and outgoing lane of an intersection).: An incoming lane for an intersection is a lane where the traffic enters the intersection. An outgoing lane for an intersection is a lane where the traffic leaves the intersection. We denote the set of incoming lanes and outgoing lanes of an intersection as L i n L i n L_(in)L_{in}Lin and L o u t L o u t L_(out)L_{out}Lout respectively.
Definition 2 (Traffic movement).: A traffic movement is defined as the traffic traveling across an intersection from one incoming lane to an outgoing lane. We denote a traffic movement from lane l l lll to lane m m mmm as ( l , m ) ( l , m ) (l,m)(l,m)(l,m).
Definition 3 (Movement signal and phase).: A movement signal is defined on the traffic movement, with green signal indicating the corresponding movement is allowed and red signal indicating the movement is prohibited. We denote a movement signal as a ( l , m ) a ( l , m ) a(l,m)a(l,m)a(l,m), where a ( l , m ) = 1 a ( l , m ) = 1 a(l,m)=1a(l,m)=1a(l,m)=1 indicates the green light is on for movement ( l , m ) ( l , m ) (l,m)(l,m)(l,m), and a ( l , m ) = 0 a ( l , m ) = 0 a(l,m)=0a(l,m)=0a(l,m)=0 indicates the red light is on for movement ( l , m ) ( l , m ) (l,m)(l,m)(l,m). A phase is a combination of movement signals. We denote a phase as p = { ( l , m ) | a ( l , m ) = 1 } p = { ( l , m ) | a ( l , m ) = 1 } p={(l,m)|a(l,m)=1}p=\{(l,m)|a(l,m)=1\}p={(l,m)|a(l,m)=1}, where l L i n l L i n l inL_(in)l\in L_{in}lLin and m L o u t m L o u t m inL_(out)m\in L_{out}