deep_reinforcement_learning_fo-183ae653-e302-46bb-993b-53ac63dce2ed

The Pennsylvania State University

The Graduate School

DEEP REINFORCEMENT LEARNING FOR TRAFFIC SIGNAL CONTROL

A Dissertation in

Information Sciences and Technology

Hua Wei

(c 2020 Hua Wei

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

December 2020The dissertation of Hua Wei was reviewed and approved by the following:

Zhenhui (Jessie) Li

Associate Professor of Information Sciences and Technology

Dissertation Adviser

Chair of Committee

C. Lee Giles

Professor of Information Sciences and Technology

Xiang Zhang

Associate Professor of Information Sciences and Technology

Vikash V. Gayah

Associate professor of Civil and Environmental Engineering

Mary Beth Rosson

Professor of Information Sciences and Technology

Graduate Program Chair

Abstract

Traffic congestion is a growing problem that continues to plague urban areas with negative out comes to both the traveling public and society as a whole. Signalized intersections are one of the most prevalent bottleneck types in urban environments and thus traffic signal control tends to play a large role in urban traffic management. Nowadays the widely-used traffic signal control systems (e.g., SCATS and SCOOT) are still based on manually designed traffic signal plans. Recently, there are emerging research studies using reinforcement learning (RL) to tackle traffic signal control problem. In this dissertation, we propose to consider reinforcement learning to intelligently optimize the signal timing plans in real-time to reduce traffic congestion.

Although some efforts using reinforcement learning (RL) techniques are proposed to adjust traffic signals dynamically, they only use ad-hoc designs when formulating the traffic signal control problem. They lack a principled approach to formulate the problem under the framework of RL. Secondly, since RL directly learns from the data via a trial-and-error search, it requires a decent number of interactions with the environment before the algorithms converge. In real-world problems, every interaction means real cost (e.g., traffic congestion, traffic accidents). Hence, a more data-efficient method is necessary. Thirdly, discrepancies between simulation and reality confine the application of RL in the real world, despite its massive success in domains like Games. Most RL methods mainly conduct experiments in the simulator since the simulator can generate data in a cheaper and faster way than real experimentation. Hence, how to address the performance gap of RL methods between simulation and the real-world is required for applying RL in the real world.

This dissertation presents how to utilize mobility data and RL-based methods for traffic signal control. I have investigated the key challenges for RL-based traffic signal control methods, including how to formulate the objective function and improve learning efficiency for city-wide traffic signal control. Besides, I have also managed to mitigate the performance gap of RL methods between simulation and the real-world. We have achieved significant improvement over the state-of-the-art or currently employed methods, which will provide us with promising solutions to traffic signal control problems and implications for smart city applications.

List of FiguresList of Tables

Acknowledgements

Chapter 1 Introduction

1.1 Why do we need a more intelligent traffic signal control 1.2 Why do we use reinforcement learning for traffic signal control 1.3 Why RL for traffic signal control is challenging? 1.3.1 Formulation of RL agent 1.3.2 Learning cost 1.3.3 Simulation 1.4 Previous Studies 1.5 Proposed Tasks 1.6 Overview of this Disseration

Chapter 2 Notation, Background and Literature

1 Preliminaries of Traffic Signal Control Problem 2.1.1 Term Definition 2.1.2 Objective 2.1.3 Special Considerations 2.2 Background of Reinforcement Learning 2.2.1 Single Agent Reinforcement learning 2.2.2 Problem setting 2.3 Literature 2.3.1 Agent formulation 2.3.1.1 Reward 2.3.1.2 State 2.3.1.3 Action scheme 2.3.2 Policy Learning 2.3.2.1 Value-based methods2.3.2.2 Policy-based methods
* 2.3.3 Coordination
* 2.3.3.1 Joint action learners
* 2.3.3.2 Independent learners
* 2.3.3.3 Sizes of Road network

2.4 Conclusion

Chapter 4 Formulating the Learning Objevtive

* 4.1 Introduction
* 4.2 Related Work
* 4.3 Preliminaries and Notations
* 4.4 Method
	* 4.4.1 Agent Design
	* 4.4.2 Learning Process
* 4.5 Justification of RL agent
	* 4.5.1 Justification for State Design
		* 4.5.1.1 General description of traffic movement process as a Markov chain
		* 4.5.1.2 Specification with proposed state definition
	* 4.5.2 Justification for Reward Design
		* 4.5.2.1 Stabilization on traffic movements with proposed reward.
		* 4.5.2.2 Connection to throughput maximization and travel time minimization.
* 4.6 Experiment
	* 4.6.1 Dataset Description
	* 4.6.2 Experimental Settings
		* 4.6.2.1 Environmental settings
		* 4.6.2.2 Evaluation metric
		* 4.6.2.3 Compared methods
	* 4.6.3 Performance Comparison
	* 4.6.4 Study of PressLight
		* 4.6.4.1 Effects of variants of our proposed method
		* 4.6.4.2 Average travel time related to pressure.
	* 4.6.5 Performance on Mixed Scenarios
		* 4.6.5.1 Heterogeneous intersections
		* 4.6.5.2 Arterials with a different number of intersections and network
	* 4.6.6 Case Study
		* 4.6.6.1 Synthetic traffic on the uniform, uni-directional flow
			* 4.6.6.1.1 Performance comparison
			* 4.6.6.1.2 Policy learned by RL agents
		* 4.6.6.2 Real-world traffic in Jinan
* 4.7 Conclusion

Chapter 5

Improving Learning Efficiency

* 5.1 Introduction
* 5.2 Related Work
* 5.3 Problem Definition
* 5.4 Method
	* 5.4.1 Observation Embedding
	* 5.4.2 Graph Attention Networks for Cooperation
		* 5.4.2.1 Observation Interaction
		* 5.4.2.2 Attention Distribution within Neighborhood Scope
		* 5.4.2.3 Index-free Neighborhood Cooperation
		* 5.4.2.4 Multi-head Attention
	* 5.4.3 Q-value Prediction
	* 5.4.4 Complexity Analysis

5.4.4.1 Space Complexity * 5.4.4.2 Time Complexity * 5.5 Experiments * 5.5.1 Settings * 5.5.2 Datasets * 5.5.2.1 Synthetic Data * 5.5.2.2 Real-world Data * 5.5.3 Compared Methods * 5.5.4 Evaluation Metric * 5.5.5 Performance Comparison * 5.5.5.1 Overall Analysis * 5.5.5.2 Convergence Comparison * 5.5.6 Scalability Comparison * 5.5.6.1 Effectiveness. * 5.5.6.2 Training time. * 5.5.7 Study of CoLight * 5.5.7.1 Impact of Neighborhood Definition. * 5.5.7.2 Impact of Neighbor Number. * 5.5.7.3 Impact of Attention Head Number. * 5.6 Conclusion

Chapter 6 Learning to Simulate

6.1 Introduction
6.2 Related Work
6.3 Preliminaries
- 6.4 Method
  - 6.4.1 Basic GAIL Framework
  - 6.4.2 Imitation with Interpolation
    - 6.4.2.1 Generator in the simulator
    - 6.4.2.2 Downsampling of generated trajectories
    - 6.4.2.3 Interpolation-Discriminator
      - 6.4.2.3.1 Interpolator module
      - 6.4.2.3.2 Discriminator module
      - 6.4.2.3.3 Loss function of Interpolation-Discriminator
  - 6.4.3 Training and Implementation
- 6.5 Experiment
  - 6.5.1 Experimental Settings
    - 6.5.1.1 Dataset
    - 6.5.1.1 Synthetic Data
      - 6.5.1.1.2 Real-world Data
    - 6.5.1.2 Data Preprocessing
  - 6.5.2 Compared Methods
    - 6.5.2.1 Calibration-based methods
      6.5.2.2 Imitation learning-based methods

6.5.3 Evaluation Metrics

6.5.4 Performance Comparison

6.5.5 Study of ImIn-GAIL

6.5.5.1 Interpolation Study

6.5.5.2 Sparsity Study

6.5.6 Case Study

6.6 Conclusion

Chapter 7:

Conclusion and Future Directions

7.1 Evolving behavior with traffic signals

7.2 Benchmarking datasets and baselines

7.3 Learning efficiency

7.4 Safety issue

7.5 Transferring from simulation to reality

BibliographyList of Figures

2.1 Definitions of traffic movement and traffic signal phases.
2.2 RL framework for traffic signal control.
3.1 Reward is not a comprehensive measure to evaluate traffic light control performance. Both policies will lead to the same rewards. But policy #1 is more suitable than policy #2 in the real world.
3.2 Case A and case B have the same environment except the traffic light phase.
3.3 Model framework
3.4 Q-network
3.5 Memory palace structure
3.6 Traffic surveillance cameras in Jinan, China
3.7 Percentage of the time duration of learned policy for phase Green-WE (green light on W-E and E-W direction, while red light on N-S and S-N direction) in every 2000 seconds for different methods under configuration 4.
3.8 Average arrival rate on two directions (WE and SN) and time duration ratio of two phases (Green-WE and Red-WE) from learned policy for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st and August 7th, 2016.

Detailed average arrival rate on two directions (dotted lines) and changes of two phases (dashed areas) in three periods of time for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st, 2016. X-axis of each figure indicates the time of a day; left Y-axis of each figure indicates the number of cars approaching the intersection every second; right Y-axis for each figure indicates the phase over time.

4.1 Performance of RL approaches is sensitive to reward and state. (a) A heuristic parameter tuning of reward function could result in different performances. (b) The method with a more complicated state (LIT[1] w/ neighbor) has a longer learning time but does not necessarily converge to a better result.
4.2 Illustration of max pressure control in two cases. In Case A, green signal is set in the North $\to$ South direction; in Case B, green signal is set in the East $\to$ West direction.
4.3 The transition of traffic movements.
4.4 Real-world arterial network for the experiment.
4.5 Convergence curve of average duration and our reward design (pressure). Pressure shows the same convergence trend with travel time.
4.6 Average travel time of our method on heterogeneous intersections. (a) Different number of legs. (b) Different length of lanes. (c) Experiment results.
4.7 Performance comparison under uniform unidirectional traffic, where the optimal solution is known (GreenWave). Only PressLight can achieve the optimal.
4.8 Offsets between intersections learnt by RL agents under uni-directional uniform traffic (700 vehicles/hour/lane on arterial)
4.9 Space-time diagram with signal timing plan to illustrate the learned coordination strategy from real-world data on the arterial of Qingdao Road in the morning (around 8:30 a.m.) on August 6th.

Illustration of index-based concatenation. Thick yellow lines are the arterials and grey thin lines are the side streets. With index-based concatenation,

A

and B's observation will be aligned as model inputs with an fixed order. These two inputs will confuse the model shared by

A

and

B

5.2 Left: Framework of the proposed CoLight model. Right: variation of cooperation scope (light blue shadow, from one-hop to two-hop) and attention distribution (colored points, the redder, the more important) of the target intersection.
5.3 Road networks for real-world datasets. Red polygons are the areas we select to model, blue dots are the traffic signals we control. Left: 196 intersections with uni-directional traffic, middle: 16 intersections with uni-& bi-directional traffic, right: 12 intersections with bi-directional traffic..
5.4 Convergence speed of CoLight (red continuous curves) and other 5 RL baselines (dashed curves) during training. CoLight starts with the best performance (Jumpstart), reaches to the pre-defined performance the fastest (Time to Threshold), and ends with the optimal policy (Aysmptotic). Curves are smoothed with a moving average of 5 points.
5.5 The training time of different models for 100 episodes. CoLight is efficient across all the datasets. The bar for Individual RL on $D_{N e w Y o r k}$ is shadowed as its running time is far beyond the acceptable time.
5.6 Performance of CoLight with respect to different numbers of neighbors ( $| N_{i} |$ ) on dataset $D_{H a n g z h o u}$ (left) and $D_{J i n a n}$ (right). More neighbors ( $| N_{i} | \leq 5$ ) for cooperation brings better performance, but too many neighbors ( $| N_{i} | > 5$ ) requires more time (200 episodes or more) to learn..
6.1 Illustration of a driving trajectory. In the real-world scenario, only part of the driving points can be observed and form a sparse driving trajectory (in red dots). Each driving point includes a driving state and an action of the vehicle at the observed time step. Best viewed in color.
6.2 Proposed ImIn-GAIL Approach. The overall framework of ImIn-GAIL includes three components: generator, downsampler, and interpolation-discriminator. Best viewed in color.
6.3 Proposed interpolation-discriminator network.
6.4 Illustration of road networks. (a) and (b) are synthetic road networks, while (c) and (d) are real-world road networks.
6.5RMSE on time and position of our proposed method ImIn-GAIL under different level of sparsity. As the expert trajectory become denser, a more similar policy to the expert policy is learned.
6.6 The generated trajectory of a vehicle in the $R i n g$ scenario. Left: the initial position of the vehicles. Vehicles can only be observed when they pass four locations $A$ , $B$ , $C$ and $D$ where cameras are installed. Right: the visualization for the trajectory of $V e h i c l e$ 0. The x-axis is the timesteps in seconds. The y-axis is the relative road distance in meters. Although vehicle 0 is only observed three times (red triangles), ImIn-GAIL (blue points) can imitate the position of the expert trajectory (grey points) more accurately than all other baselines. Better viewed in color.

List of Tables

2.1 Representative deep RL-based traffic signal control methods. Due to page limits, we only put part of the investigated papers here.
3.1 Notations
3.2 Settings for our method
3.3 Reward coefficients
3.4 Configurations for synthetic traffic data
3.5 Details of real-world traffic dataset
3.6 Performance on configuration 1. Reward: the higher the better. Other measures: the lower the better. Same with the following tables.
3.7 Performance on configuration 2
3.8 Performance on configuration 3
3.9 Performance on configuration 4
3.10 Performances of different methods on real-world data. The number after $\pm$ means standard deviation. Reward: the higher the better. Other measures: the lower the better.
4.1 Summary of notation.
4.2 Configurations for synthetic traffic data
4.3 Data statistics of real-world traffic dataset

Performance comparison between all the methods in the arterial with 6 intersections w.r.t. average travel time (the lower the better). Top-down: conventional transportation methods, learning methods, and our proposed method.

4.5 Detailed comparison of our proposed state and reward design and their effects w.r.t. average travel time (lower the better) under synthetic traffic data.
4.6 Average travel time of different methods under arterials with a different number of intersections and network.
5.1 Data statistics of real-world traffic dataset
5.2 Performance on synthetic data and real-world data w.r.t average travel time. CoLight is the best.
5.3 Performance of CoLight with respect to different numbers of attention heads ( $H$ ) on dataset $G r i d_{6 \times 6}$ . More types of attention ( $H \leq 5$ ) enhance model efficiency, while too many ( $H > 5$ ) could distract the learning and deteriorate the overall performance.
6.1 Features for a driving state
6.2 Hyper-parameter settings for ImIn-GAIL
6.3 Statistics of dense and sparse expert trajectory in different datasets
6.4 Performance w.r.t Relative Mean Squared Error (RMSE) of time (in seconds) and position (in kilometers). All the measurements are conducted on dense trajectories. Lower the better. Our proposed method ImIn-GAIL achieves the best performance.
6.5 RMSE on time and position of our proposed method ImIn-GAIL against baseline methods and their corresponding two-step variants. Baseline methods and ImIn-GAIL learn from sparse trajectories, while the two-step variants interpolate sparse trajectories first and trained on the interpolated data. ImIn-GAIL achieves the best performance in most cases.

Acknowledgements

I would like to express my sincere appreciation to my doctoral committee Dr. Zhenhui (Jessie) Li, Dr. C. Lee Giles, Dr. Xiang Zhang, Dr. Vikash V. Gayah, and my Ph. D. program chair Dr. Mary Beth Rosson.

I would like to show my special appreciation to my PhD adviser Dr. Zhenhui (Jessie) Li. Jessie patiently guided me through the research process and career development, including several open-armed enlightening discussions on how to cope with work-life balance, pressure and priority. Her supportive and critical attitude has made our fruitful collaboration possible. Her patience and kindness has help me manage a smooth career development. Without her guidance, my PhD would surely not have been as successful.

I would also extend my thanks to Dr. Vikash Gayah from the Department of Civil and Environmental Engineering, whose insightful advice forms some of the foundations in our interdisciplinary research and who have been more than happy to help me become familiar with the background material.

I am also fortunate to have collaborated with several others throughout my PhD: Guanjie Zheng, Chacha Chen, Porter Jenkins, Zhengyao Yu, Kan Wu, Huaxiu Yao, Nan Xu, Chang Liu, Yuandong Wang. Without them, this research would not be possible.

The studies in this dissertation have been supported by NSF awards #1544455, #1652525, #1618448, and #1639150. The views and conclusions contained in the studies are those of the authors and should not be interpreted as representing any funding agencies. Thanks for their generous support to make these research happen.

During my several years of PhD program, I was lucky to have been surrounded by wonderful colleagues and friends. Thank you all for sharing Friday game nights with me, for the road trips and getaways on weekends, and for the summer twilight barbecues. All of you made my time at Penn State a wonderful experience.

At last, a very special thank you to my parents Shaozhu and Yuling, to my whole family, for always supporting me and for always encouraging me to pursue my passion.

Chapter 1 Introduction

Traffic congestion is a growing problem that continues to plague urban areas with negative out comes to both the traveling public and society as a whole. And, these negative outcomes will only grow over time as more people flock to urban areas. In 2014, traffic congestion cost Americans over $160 billion in lost productivity and wasted over 3.1 billion gallons of fuel [2]. Traffic congestion was also attributed to over 56 billion pounds of harmful CO2 emissions in 2011 [3]. In the European Union, the cost of traffic congestion was equivalent to 1% of the entire GDP [4]. Mitigating congestion would have significant economic, environmental and societal benefits. Signalized intersections are one of the most prevalent bottleneck types in urban environments and thus traffic signal control tends to play a large role in urban traffic management.

1.1 Why do we need a more intelligent traffic signal control

The majority of small and big cities even in industrialized countries are still operating old-fashioned fixed-time signal control strategies, often even poorly optimized or maintained. Even when modern traffic-responsive control systems are installed (e.g., SCATS [5] and SCOOT [6, 7]), the employed control strategies are sometimes naive, mainly based on manually designed traffic signal plans.

On the other hand, nowadays various kinds of traffic data can be collected to enrich the information about traffic condition. Systems like SCATS or SCOOTS mainly rely on the loop sensor data to choose the signal plans. However, loop sensor data only count the vehicle when it passes the sensor, while nowadays increasing amount of traffic data have been collected from various sources such as GPS-equipped vehicles, navigationalsystems, and traffic surveillance cameras. How to use rich traffic data to better optimize our traffic signal control system has attracted more and more attention from academia, government and industry.

1.2 Why do we use reinforcement learning for traffic signal control

In transportation field, traffic signal control is one of the most fundamental research questions [8]. The typical approach that transportation researchers take is to seek an optimization solution under certain assumptions about traffic models [8]. However, most of the works focus only on automobiles and the assumptions they make are simplified and do not necessarily hold true in the field. The real traffic behave in a complicated way, affected by many factors such as driver's preference, interactions with vulnerable road users (e.g. pedestrians, cyclists, etc.), weather and road conditions, etc. These features can hardly be modelled accurately for optimizing traffic signal control.

On the other hand, machine learning techniques learn directly from the observed data without making assumptions about the data model. However, the traffic signal control problem is not a typical machine learning problem with fixed data sets for training and testing. The real-world traffic is constantly changing, and the execution of traffic lights is also changing the traffic. Therefore, in the case of ever-changing data samples, we need to learn from the feedback from the environment. This idea of trial-and-error is an essential RL idea. RL attempts different traffic signal control strategies based on the current traffic environment. The model will learn and adjust strategies based on environmental feedback.

1.3 Why RL for traffic signal control is challenging?

Formulation of RL agent

A key question for RL is how to define reward and state. In existing studies [9, 10, 11], a typical reward definition for traffic signal control is a weighted linear combination of several components such as queue length, waiting time, number of switches in traffic signal, and sum of delay. The state include components such as queue length, number of cars, waiting time, and current traffic signal. In recent work [10, 11], images of vehicles' positions on the roads are also considered in the state.

However, all of the existing work take an ad-hoc approach to define reward and state. Such an ad-hoc approach will cause several problems that hinder the application of RL in the real world. First, the engineering details in formulating the reward and state could significantly affect the results. For example, if the reward is defined as a weighted linear combination of several terms, the weights on each terms are tricky to set and the minor difference in weight setting could lead to dramatically different results. Second, the state representation could be in a high-dimensional space, especially when using traffic images as part of the state representation [10, 11]. Such a high-dimensional state representation will need much more training data samples to learn and may not even converge. Third, there is no connection between existing RL approaches and transportation methods. Without the support of transportation theory, it is highly risky to apply these purely data-driven RL-based approaches in the real physical world.

Learning cost

While learning from trial-and-error is the key idea in RL, the learning cost is RL is fatal for real-world applications. Although RL algorithms are very useful to learning good solutions when the model of environment is unknown in advance [12, 13], the solutions may only be achieved after an extensive numbers of trials and errors, which is usually very time consuming. Existing RL methods for games (e.g, Go or Atari games) yield impressive results in simlulated environments, the cost of error in traffic signal control is critical, even fatal in the real world. Therefore, how to learn efficiently (e.g., learning from limited data samples, sample training data in an adaptive way, transfer learned knowledge) is a critial question for the application of RL in traffic signal control.

Simulation

Reinforcement learning (RL) has shown great success in a series of artificial intelligence (AI) domains such as Go games [14]. Despite its huge success in AI domains, RL has not yet shown the same degree of success for real-world applications [15]. These applications could involve control such as drones [16], systems that interact with people such as traffic signal control [11]. As people analyze the challenges in all these scenarios, a recurring theme emerges: there is rarely a good simulator[15].

Previous Studies

There are already some work done in investigating reinforcement learning methods in traffic signal control. They are focusing on different scale of traffic signal control, including isolated intersection control[4-6], arterial control [7], and region control [8-10]. Most of the previous studies use different kinds of features in state and reward design, while we propose the formulation of reinforcement learning in connection with transportation theory. Furthermore, we propose to investigate on how to learn efficiently to reduce the learning cost in the real-world setting. We will discuss more about the literature in Chapter 2, Chapter 3 and Chapter 4.

1.5 Proposed Tasks

In this thesis, we propose to use RL for traffic signal control problems that can combine the transportation guided RL with efficient learning. The first part will help the formulation of RL towards a reasonable state and reward design. The connection between between RL approaches and transportation theory can help RL optimize towards a correct objective and condense the state space for convergence. The second part will help reduce the learning cost of RL and enables the application of RL in the real world. The third part will try to tackle the real-world application issue by building a realistic simulator. Specifically, we elaborate RL-based traffic signal control methods with real-world implications from the following topics:

Formulation of RL with theoretical justification. How to formulate the RL agent, i.e., the reward and state definition for traffic signal control problem? Does it always beneficial if we use complex reward function and state representations? In [7, 1, 17], I proposed to deduce the intricate design of current literature and use concise reward and state design, with theoretical proof on the concise design towards the global optimal, for both signal intersection control and multi-intersection control scenario.
Optimization of the learning process in RL. Is vanilla deep neural network the solution to traffic signal control problem with RL? In [11], I tackle the sample imbalance problem with a phase-gated network to emphasize certain features. In [18], we design a network to model the priority between action and learn the change of traffic flow like flip and rotation. Coordination could benefit signal control for multi-intersection scenarios. In [19], I contributed a framework to enable agents to communicate between them about their observations and behave as a group. Though each agent has limited capabilities and visibility of the world, in this work, agents were able to cooperate multi-hop neighboring intersections and learn the dynamic interactions between the hidden state of neighboring agents. Later in [20], we investigate the possibility of using RL to control traffic signals under the scale of a city, test our RL methods in the simulator under the road network of Manhattan, New York, with 2510 traffic signals. This is the first time an RL-based method can operate on such a scale in traffic signal control.
Bridging Gap from Simulation to Reality. Despite its massive success in artificial domains, RL has not yet shown the same degree of success for real-world applications. These applications could involve control systems such as drones, software systems such as data centers, systems that interact with people such as transportation systems. Currently, most RL methods mainly conduct experiments in the simulator since the simulator can generate data in a cheaper and faster way than real experimentation. Discrepancies between simulation and reality confine the application of learned policies in the real world.

1.6 Overview of this Disseration

In this chapter, we discussed the motivation, challenges and possible tasks of using reinforcement learning for traffic signal control. We will elaborate more on the details of work that have applied reinforcement learning in several scenarios. Chapter 2 will covers the notation, background and necessary literature for traffic signal control. The basic formulation of RL-based traffic signal control and further theory-guided RL method will be discussed in detail in Chapter 3 and 4. Chapter 5 will discuss how to improve the learning efficiency with neighborhood information for coordinated intersections. Chapter 6 will tackle the challenge for real-world applications of RL by building a realistic simulator, and then we are going to talk about the potential future work briefly in Chapter 7.

Chapter 2 Notation, Background and Literature

This chapter covers notation, background information and necessary literature that will be discussed throughout the rest of this thesis. The main part of this chapter is adopted from our survey [21] and [22].

2.1 Preliminaries of Traffic Signal Control Problem

Term Definition

Terms on road structure and traffic movement:

Approach: A roadway meeting at an intersection is referred to as an approach. At any general intersection, there are two kinds of approaches: incoming approaches and outgoing approaches. An incoming approach is one on which cars can enter the intersection; an outgoing approach is one on which cars can leave the intersection. Figure 2.1(a) shows a typical intersection with four incoming approaches and outgoing approaches. The southbound incoming approach is denoted in this figure as the approach on the north side in which vehicles are traveling in the southbound direction.
Lane: An approach consists of a set of lanes. Similar to approach definition, there are two kinds of lanes: incoming lanes and outgoing lanes (also known as approaching/entering lane and receiving/exiting lane in some references [23, 24]).
Traffic movement: A traffic movement refers to vehicles moving from an incoming approach to an outgoing approach, denoted as $(r_{i} \to r_{o})$ , where $r_{a}$ and $r_{r}$ is theincoming lane and the outgoing lane respectively. A traffic movement is generally categorized as left turn, through, and right turn.

Terms on traffic signal:

Movement signal: A movement signal is defined on the traffic movement, with green signal indicating the corresponding movement is allowed and red signal indicating the movement is prohibited. For the four-leg intersection shown in Figure 2.1(a), the right-turn traffic can pass regardless of the signal, and there are eight movement signals in use, as shown in Figure 2.1(b).
Phase: A phase is a combination of movement signals. Figure 2.1(c) shows the conflict matrix of the combination of two movement signals in the example in Figure 2.1(a) and Figure 2.1(b). The grey cell indicates the corresponding two movements conflict with each other, i.e. they cannot be set to 'green' at the same time (e.g., signals #1 and #2). The white cell indicates the non-conflicting movement signals. All the non-conflicting signals will generate eight valid paired-signal phases (letters 'A' to 'H' in Figure 2.1(c)) and eight single-signal phases (the diagonal cells in conflict matrix). Here we letter the paired-signal phases only because in an isolated intersection, it is always more efficient to use paired-signal

Figure 2.1: Definitions of traffic movement and traffic signal phases.

phases. When considering multiple intersections, single-signal phase might be necessary because of the potential spill back.

Phase sequence: A phase sequence is a sequence of phases which defines a set of phases and their order of changes.
Signal plan: A signal plan for a single intersection is a sequence of phases and their corresponding starting time. Here we denote a signal plan as $(p_{1}, t_{1}) (p_{2}, t_{2}) \dots (p_{i}, t_{i}) \dots$ , where $p_{i}$ and $t_{i}$ stand for a phase and its starting time.
Cycle-based signal plan: A cycle-based signal plan is a kind of signal plan where the sequence of phases operates in a cyclic order, which can be denoted as $(p_{1}, t_{1}^{1}) (p_{2}, t_{2}^{1}) \dots (p_{N}, t_{N}^{1}) (p_{1}, t_{1}^{2}) (p_{2}, t_{2}^{2}) \dots (p_{N}, t_{N}^{2}) \dots$ , where $p_{1}, p_{2}, \dots, p_{N}$ is the repeated phase sequence and $t_{i}^{j}$ is the starting time of phase $p_{i}$ in the $j$ -th cycle. Specifically, $C^{j} = t_{1}^{j + 1} - t_{1}^{j}$ is the cycle length of the $j$ -th phase cycle, and ${\frac{t_{2}^{j} - t_{1}^{j}}{C^{j}}, \dots, \frac{t_{N}^{j} - t_{N - 1}^{j}}{C^{j}}}$ is the phase split of the $j$ -th phase cycle. Existing traffic signal control methods usually repeats similar phase sequence throughout the day.

Objective

The objective of traffic signal control is to facilitate safe and efficient movement of vehicles at the intersection. Safety is achieved by separating conflicting movements in time and is not considered in more detail here. Various measures have been proposed to quantify efficiency of the intersection from different perspectives:

Travel time. In traffic signal control, travel time of a vehicle is defined as the time different between the time one car enters the system and the time it leaves the system. One of the most common goals is to minimize the average travel time of vehicles in the network.
Queue length. The queue length of the road network is the number of queuing vehicles in the road network.
Number of stops. The number of stops of a vehicle is the total times that a vehicle experienced.
Throughput. The throughput is the number of vehicles that have completed their trip in the road network during throughout a period.

Special Considerations

In practice, additional attention should be paid to the following aspects:

Yellow and all-red time. A yellow signal is usually set as a transition from a green signal to a red one. Following the yellow, there is an all-red period during which all the signals in an intersection are set to red. The yellow and all-red time, which can last from 3 to 6 seconds, allow vehicles to stop safely or pass the intersection before vehicles in conflicting traffic movements are given a green signal.
Minimum green time. Usually, a minimum green signal time is required to ensure pedestrians moving during a particular phase can safely pass through the intersection.
Left turn phase. Usually, a left-turn phase is added when the left-turn volume is above certain threshold.

2.2 Background of Reinforcement Learning

In this section, we first describe the reinforcement learning framework which constitutes the foundation of all the methods presented in this dissertation. We then provide background on conventional RL-based traffic signal control, including the problem of controlling a single intersection and multiple intersections.

Single Agent Reinforcement learning

Usually a single agent RL problem is modeled as a Markov Decision Process represented by

⟨ S, A, P, R, γ ⟩

, where their definitions are given as follows:

∙

Set of state representations

S

: At time step

t

, the agent observes state

s^{t} \in S

∙

Set of action

A

and state transition function

P

: At time step

t

, the agent takes an action

a^{t} \in A

, which induces a transition in the environment according to the state transition function

P (s^{t + 1} | s^{t}, a^{t}) : S \times A \to S

∙

Reward function

R

: At time step

t

, the agent obtains a reward

r^{t}

by a reward function:

R (s^{t}, a^{t}) : S \times A \to R

∙

Discount factor

γ

: The goal of an agent is to find a policy that maximizes the expected return, which is the discounted sum of rewards:

G^{t} := \sum_{i = 0}^{\infty} γ^{i} r^{t + i}

, where the discount factor

γ \in [0, 1]

controls the importance of immediate rewards versus future rewards.

Here, we only consider continuing agent-environment intersections which do not end with terminal states but goes on continually without limit.

Solving a reinforcement learning task means, roughly, finding an optimal policy

π^{*}

that maximizes expected return. While the agent only receives reward about its immediate, one-step performance, one way to find the optimal policy

π^{*}

is by following an optimal action-value function or state-value function. The action-value function (Q-function) of a policy

π

Q^{π} : S \times A \to R

, is the expected return of a state-action pair

Q^{π} (s, a) = E_{π} [G^{t} | s^{t} = s, a^{t} = a]

. The state-value function of a policy

π

V^{π} : S \to R

, is the expected return of a state

V^{π} (s) = E_{π} [G^{t} | s^{t} = s]

Problem setting

We now introduce the general setting of RL-based traffic signal control problem, in which the traffic signals are controlled by an RL agent or several RL agents. Figure 2.2 illustrates the basic idea of the RL framework in single traffic signal control problem. The environment is the traffic conditions on the roads, and the agent controls the traffic signal. At each time step

t

, a description of the environment (e.g., signal phase, waiting time of cars, queue length of cars, and positions of cars) will be generated as the state

s_{t}

. The agent will predict the next action

a^{t}

to take that maximizes the expected return, where the action could be changing to a certain phase in the single intersection scenario. The action

a^{t}

will be executed in the environment, and a reward

r^{t}

will be generated, where the reward could be defined on traffic conditions of the intersection. Usually, in the decision process, an agent combines the exploitation of learned policy and exploration of a new policy.

Figure 2.2: RL framework for traffic signal control.

In multi-intersection traffic signal control problem, there are

N

traffic signals in the environment, controlled by one or several agents. The goal of the agent(s) is to learn the optimal policies to optimize the traffic condition of the whole environment. At each timestep

t

, each agent

i

observe part of the environment as the observation

o_{i}^{t}

and make predictions on the next actions

a^{t} = (a_{1}^{t}, \dots, a_{N}^{t})

to take. The actions will be executed in the environment, and the reward

r_{i}^{t}

will be generated, where the reward could be defined on the level of individual intersections or a group of intersections within the environment. We refer readers interested in more detailed problem settings to [25].

2.3 Literature

In this section, we introduce three major aspects investigated in recent RL-based traffic signal control literature: agent formulation, policy learning approach and coordination strategy.

Agent formulation

A key question for RL is how to formulate the RL agent, i.e., the reward, state, and action definition. In this section, we focus on the advances in the reward, state, and action design in recent deep RL-based methods, and refer readers interested in more detailed definitions to [26, 27, 28].

2.3.1.1 Reward

The choice of reward reflects the learning objective of an RL agent. In the traffic signal control problem, although the ultimate objective is to minimize the travel time of all vehicles, travel time is hard to serve as a valid reward in RL. Because the travel time of a vehicle is affected by multiple actions from traffic signals and vehicle movements, the travel time as reward would be delayed and ineffective in indicating the goodness of the signals' action. Therefore, the existing literature often uses a surrogate reward that can be effectively measured after an action, considering factors like average queue length, average waiting time, average speed or throughput [11, 29]. The authors in [10] also take the frequency of signal changing and the number of emergency stops into reward.

2.3.1.2 State

At each time step, the agent receives some quantitative descriptions of the environment as state to decide its action. Various kinds of elements have been proposed to describe the environment state, such as queue length, waiting time, speed, phase, etc. These elements can be defined on the lane level or road segment level, and then concatenated as a vector. In earlier work using RL for traffic signal control, people need to discretize the state space and use a simple tabular or linear model to approximate the state functions for efficiency [30, 31, 32]. However, the real-world state space is usually huge, which confines the traditional RL methods in terms of memory or performance. With advances in deep learning, deep RL methods are proposed to handle large state space as an effective function approximator. Recent studies propose to use images [33, 34, 35, 36, 37, 38, 39, 40, 10, 11] to represent the state, where the position of vehicles are extracted as an image representation.

2.3.1.3 Action scheme

Now there are different types of action definition for an RL agent in traffic signal control: (1) set current phase duration [42, 43], (2) set the ratio of the phase duration over pre-defined total cycle duration [32, 44], (3) change to the next phase in pre-defined cyclic phase sequence [10, 11, 27, 45], and (4) choose the phase to change to among a set of phases [34, 39, 46, 47, 48, 1, 48]. The choice of action scheme is closely related to specific settings of traffic signals. For example, if the phase sequence is required to be cyclic, then the first three action schemes should be considered, while "choosing the phase to change to among a set of phases" can generate flexible phase sequences.

Policy Learning

RL methods can be categorized in different ways. [49, 50] divide current RL methods to model-based methods and model-free methods. Model-based methods try to model the transition probability among states explicitly, while model-free methods directly estimate the reward for state-action pairs and choose the action based on this. In the context of traffic signal control, the state transition between states is primarily influenced by people's driving behaviors, which are diverse and hard to predict. Therefore, currently, most RL-based methods for traffic signal control are model-free methods. In this section, we take the categorization in [51]: value-based methods and policy-based methods.

2.3.2.1 Value-based methods

Value-based methods approximate the state-value function or state-action value function (i.e., how rewarding each state is or state-action pair is), and the policy is implicitly obtained from the learned value function. Most of the RL-based traffic signal control methods use DQN [52], where the model is parameterized by neural networks and takes the state representation as input [10, 53]. In DQN, discrete actions are required as the model directly outputs the action's value given a state, which is especially suitable for action schema (3) and (4) mentioned in Section 2.3.1.3.

Policy-based methods

Policy-based methods directly update the policy parameters (e.g., a vector of probabilities to conduct actions under specific state) towards the direction to maximizing a predefined objective (e.g., average expected return). The advantage of policy-based methods is that it does not require the action to be discrete like DQN. Also, it can learn a stochastic policy and keep exploring potentially more rewarding actions. To stabilize the training process, the actor-critic framework is widely adopted. It utilize the strengths of both value-based and policy-based methods, with an actor controls how the agent behaves (policy-based), and the critic measures how good the conducted action is (value-based). In the traffic signal control problem, [44] uses DDPG [54] to learn a deterministic policy which directly maps states to actions, while [34, 42, 55] learn a stochastic policy that maps states to action probability distribution, all of which have shown excellent performance in traffic signal control problems. To further improve convergence speed for RL agents, [56] proposed a time-dependent baseline to reduce the variance of policy gradient updates to specifically avoid traffic jams.

In the above-mentioned methods, including both value-based and policy-based methods, deep neural networks are used to approximate the value functions. Most of the literature use vanilla neural networks with their corresponding strengths. For example, Convolutional Neural Networks (CNN) are used since the state representation contains image representation [10, 38, 39, 40, 33, 34, 35, 36]; Recurrent Neural Networks (RNN) are used to capture the temporal dependency of historical states [57].

Coordination could benefit signal control for multi-intersection scenarios. Since recent advances in RL improve the performance on isolated traffic signal control, efforts havebeen performed to design strategies that cooperate with MARL agents. Literature [60] categorizes MARL into two classes: Joint action learners and independent learners. Here we extend this categorization for the traffic signal control problem.

Joint action learners

A straightforward solution is to use a single global agent to control all the intersections [31]. It directly takes the state as input and learns to set the joint actions of all intersection at the same time. However, these methods can result in the curse of dimensionality, which encompasses the exponential growth of the state-action space in the number of state and action dimensions. Joint action modeling methods explicitly learns to model the joint action

Citation	Method	Cooperation	Road net (# signals)
[46]	Value-based	With communication	Synthetic (5)
	Policy-based	Without communication	Real (50)
	Policy-based	Without communication	Real (43)
	Value-based	Without communication	Real (2510)
	Policy-based	Joint action	Real (30)
	Value-based	-	Synthetic (1)
	Value-based	-	Synthetic (1)
	Value-based	Without communication	Synthetic (9)
	Both studied	-	Synthetic (1)
	Value-based	With communication	Synthetic (6)
	Value-based	Without communication	Synthetic (4)
	Both studied	Single global	Synthetic (5)
	Policy-based	-	Real (1)
	Value-based	Joint action	Synthetic (4)
	Value-based	With communication	Real (4)
	Value-based	-	Synthetic (1)
	Value-based	Without communication	Real (16)
	Value-based	With communication	Real (196)
	Value-based	Joint action	Synthetic (36)
	Value-based	Without communication	Real (16)
	Value-based	Without communication	Real (5)

* Traffic with arrival rate less than 500 vehicles/hour/lane is considered as light traffic in this survey, otherwise considered as heavy.
* Synthetic light uniform; 2. Synthetic light dynamic; 3. Synthetic heavy uniform; 4. Synthetic heavy dynamic; 5. Real-world data

Table 2.1: Representative deep RL-based traffic signal control methods. Due to page limits, we only put part of the investigated papers here.

action value of multiple agents

Q (o_{1}, \dots, o_{N}, a)

. The joint action space grows with the increase in the number of agents to model. To alleviate this challenge, [10] factorize the global Q-function as a linear combination of local subproblems, extending [9] using max-plus [61] algorithm:

\hat{Q} (o_{1}, \dots, o_{N}, a) = Σ_{i, j} Q_{i, j} (o_{i}, o_{j}, a_{i}, a_{j})

, where

i

and

j

corresponds to the index of neighboring agents. In other works, [48, 59, 62] regard the joint Q-value as a weighted sum of local Q-values,

\hat{Q} (o_{1}, \dots, o_{N}, a) = Σ_{i, j} w_{i, j} Q_{i, j} (o_{i}, o_{j}, a_{i}, a_{j})

, where

w_{i, j}

is the pre-defined weights. They attempt to ensure individual agents to consider other agents' learning process by adding a shaping term in the loss function of the individual agent's learning process and minimizing the difference between the weighted sum of individual Q-values and the global Q-value.

2.3.3.2 Independent learners

There is also a line of studies that use individual RL (IRL) agents to control the traffic signals, where each RL agent controls an intersection. Unlike joint action learning methods, each agent learns its control policy without knowing the reward signal of other agents.

IRL without communication methods treat each intersection individually, with each agent observing its own local environment and do not use explicit communication to resolve conflicts [1, 27, 39, 40, 44, 45, 63]. In some simple scenarios like arterial networks, this approach has performed well with the formation of several mini green waves. However, when the environment becomes complicated, the non-stationary impacts from neighboring agents will be brought into the environment, and the learning process usually cannot converge to stationary policies if there are no communication or coordination mechanisms among agents [64].

IRL with communication methods enable agents to communicate between agents about their observations and behave as a group, rather than a collection of individuals in complex tasks where the environment is dynamic, and each agent has limited capabilities and visibility of the world [65]. Typical methods directly add neighbor's traffic condition [66] or past actions [67] directly into the observation of ego agent, other than just using the local traffic condition of the ego agent. In this method, all the agents for different intersection share one learning model, which requires the consistent indexing of neighboring intersections. [47] attempt to remove this requirement by utilizing the road network structure with Graph Convolutional Network [68] and cooperate multi-hop nearby intersections. [47] models the influence of neighboring agents by the fixed adjacency matrix defined in Graph Convolutional Network, which indicates their assumption that the influences between neighbors is static. In other work, [57] propose to use Graph Attentional Networks [69] to learn the dynamic interactions between the hidden state of neighboring agents and the ego agent. It should be pointed out that there is a strong connection between methods that employ max-plus [61] to learn joint action-learners in Section 2.3.3.1, and methods using Graph Convolutional Network to learn the communication, as both of them can be seen to learn the message passing on the graph, where the former kind of methods passing the reward and the later passing the state obervations.

2.3.3.3 Sizes of Road network

At a coarse scale, a road network is a directed graph with nodes and edges representing intersections and roads, respectively. Specifically, a real-world road network can be more complicated than the synthetic network in the road properties (e.g., the number of lanes, speed limit of every lane), intersection structures and signal phases settings. Among all the road network properties, the number of traffic signals in the network largely influences the experiment results because the scale of explorations for RL agents to take increases with the scale of road network. Currently, most of the work still conducts experiments on relatively small road networks compared to the scale of a city, which could include thousands of traffic signals. Aslani et al.[42, 43] test their method in a real-world road network with 50 signals. In [19], a district with 196 signals is investigated. One of the most recent work [20] tests their methods on the real road network of Manhattan, New York, with 2510 traffic signals.

2.4 Conclusion

In the review of exiting work, we found out that the formulation of RL lacks theoretical justification, i.e., different literature proposes ad-hoc designs on the reward and state formulation. And the size of coordinated traffic signals are still small. Therefore, in the following sections, I'll provide several methods to tackle the problems: I'll first show a basic formulation of RL-based traffic signal control in Chapter 3, and discuss further theory-guided RL in Chapter 4. Chapter 5 will discuss how to improve the learning efficiency with neighborhood information for coordinated intersections. Chapter 6 will tackle the challenge for real-world applications of RL by building a realistic simulator,

Chapter 3 Basic Formulation for Traffic Signal Control

This part shows a basic formulation of RL-based traffic signal control and corresponds to our paper "IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light Control" [11].

3.1 Introduction

Traffic congestion has become increasingly costly. For example, traffic congestion costs Americans $124 billion a year, according to a report by Forbes in 2014 [70]. In European Union, the traffic congestion cost is estimated to be 1% of its GDP [2]. Improving traffic conditions could increase city efficiency, improve economy, and ease people's daily life.

One way to reduce the traffic congestion is by intelligently controlling traffic lights. Nowadays, most traffic lights are still controlled with pre-defined fixed-time plan [71, 72] and are not designed by observing real traffic. Recent studies propose hand-crafted rules according to real traffic data [73, 74]. However, these rules are still pre-defined and cannot be dynamically adjusted w.r.t. real-time traffic.

To dynamically adjust traffic lights according to real-time traffic, people have been using reinforcement learning technique [9, 10, 75]. Traditional reinforcement learning is difficult to apply due to two key challenges: (1) how to represent environment; and (2) how to model the correlation between environment and decision. To address these two challenges, recent studies [10, 76] have applied deep reinforcement learning techniques, such as Deep Q-learning (DQN), for traffic light control problem. Figure 2.2 illustrates the basic idea of deep reinforcement learning framework. Environment is composed of traffic light phase and traffic condition. State is a feature representation of theenvironment. Agent takes state as input and learns a model to predict whether to "keep the current phase of traffic lights" or "change the current phase". The decision is sent to the environment and the reward (e.g., how many vehicles pass the intersection) is sent back to the agent. The agent consequently updates the model and further makes the new decision for the next timestamp based on the new state and the updated model. In such a framework, traffic condition can be described as an image and such an image is directly taken as an input for a CNN-based model to enrich the hand-crafted features of the environment.

Recent deep reinforcement learning approaches made promising progress for the traffic light control problem. Our approach extends this line of work by making several important new contributions:

1. Experiments with real traffic data. Nowadays, increasing amount of traffic data is being collected from various sources. In China, many big cities have installed AI-equipped traffic surveillance cameras to monitor traffic conditions in real time. Such real-time traffic data enables us to implement reinforcement learning in real world. However, to the best of our knowledge, none of existing studies have used the real traffic data to test their methods. Instead, they use traffic simulations and such simulations do not reflect the real-world traffic. For example, the simulation models in current studies often assume that vehicles arrive at a constant rate but real traffic are highly dynamic over time. In this chapter, we test the methods on a large-scale real traffic data obtained from 1,704 surveillance cameras in Jinan, China for a period of 31 days (see experiment section for details). In this dataset, there are more than 405 million vehicle records and more than 11 million unique vehicle plates. We conduct comprehensive experiments on such large real dataset.

2. Interpretations of the policy. A frequently used measure to quantify the performance of traffic light control is by examining the overall reward, which can be defined by several factors such as waiting time of vehicles and number of vehicles passing the intersections. However, existing studies rarely make observations of the policy learned from the model. The reward could be misleading in some cases. There could be different policies with the same reward but one is more suitable than the other. Take Figure 3.1 as an example. Assume there is only traffic on South-North direction and the traffic comes for the first 80 seconds in every 120 seconds. Policy #1 is 80 seconds for green light on South-North direction and followed by red light for 40 seconds, and then repeat. And policy #2 is different from policy #1 in the way that, instead of 40-second red light on South-North direction, the light changes every 10 seconds. Both policies will result in the same reward because no vehicle will be waiting under either policy. However, policy #1 is preferred over policy #2 in real scenario. In this chapter, we claim that it is important to study the policies rather than simply showing the overall reward. In our experiments, we show several interesting policies learned from the real traffic under different scenarios (e.g., peak hours vs. non-peak hours, weekday vs. weekend).

3. A phase-gated model learning. As described earlier in deep reinforcement learning framework, the agent will take the state, which is the representation of environment, as model input. The environment usually includes the current traffic light phase and traffic conditions. For example, the environments of two cases in Figure 3.2 are the same, except the traffic light phases. Previous studies all take phase as one feature [10, 27], together with many other features (e.g., number of vehicles at different lanes, positions of vehicles). And it is likely that this one feature does not play a role that is significant enough to affect the model output. Therefore, the model will make

Figure 3.1: Reward is not a comprehensive measure to evaluate traffic light control performance. Both policies will lead to the same rewards. But policy #1 is more suitable than policy #2 in the real world.

the same decision (i.e., either keep or change the current phase) for these two different cases. However, such a decision, no matter which one, is not ideal for one of the cases. Because in Figure 3.2, case A hopes to keep the phase and case B hopes to change the phase. In this chapter, we propose a new phase-sensitive (i.e., phase gate combined with memory palace) reinforcement learning agent, which is a critical component that leads to superior performance.

The rest of this chapter is organized as follows. Section 3.2 discusses the literature. Section 3.3 formally defines the problem. The method is shown in Section 6.4 and the experimental results are shown in Section 6.5. Finally, we conclude the chapter in Section 7.

In this section, we firstly introduce conventional methods for traffic light control, then introduce methods using reinforcement learning.

Conventional Traffic Light Control

Early traffic light control methods can be roughly classified into two groups. The first is pre-timed signal control [71, 72, 77], where a fixed time is determined for all green phases

Figure 3.2: Case A and case B have the same environment except the traffic light phase.

according to historical traffic demand, without considering possible fluctuations in traffic demand. The second is vehicle-actuated control methods [73, 74] where the real-time traffic information is used. Vehicle-actuated methods are suitable for the situations with relatively high traffic randomness. However, this method largely depends on the handcraft rules for current traffic condition, without taking into account future situation. Therefore, it cannot reach the global optimal.

Reinforcement Learning for Traffic Light Control

Recently, due to the incapability of dealing with dynamic multi-direction traffic in previous methods, more works try to use reinforcement learning algorithms to solve the traffic light control problem [75, 9, 27]. Typically, these algorithms take the traffic on the road as state, and the operation on light as action. These methods usually show better performance compared with fixed-time and traffic-responsive control methods.

Methods in [78, 79, 80, 9, 30] designed the state as discrete values like the location of vehicles or number of waited cars. However, the discrete state-action pair value matrix requires huge storage space, which keeps these methods from being used in large state space problems.

To solve the in-managablely large state space of previous methods, recent studies [76, 10] propose to apply Deep Q-learning methods using continuous state representations. These studies learn a Q-function (e.g. a deep neural network) to map state and action to reward. These works vary in the state representation including hand craft features (e.g., queue length [27, 76], average delay [81, 10]) and image features [82, 10, 83]) They are also different in reward design, including average delay [10, 46],the average travel time [83, 10], and queue length [76].

However, all these methods assume relatively static traffic environments, and hence far from the real case. Further, they only focus on rewards and overlook the adaptability of the algorithms to the real traffic. Therefore, they cannot interpret why the learned light signal changes corresponding to the traffic. In this chapter, we try to test the algorithms in a more realistic traffic setting, and add more interpretation other than reward.

3.3 Problem Definition

In our problem, we have the environment

E

as an intersection of two roads (and the traffic on this intersection). There is an intelligent traffic light agent

G

. To make the notation simpler, we use "N", "S", "W", "E" to represent north, south, west, and east respectively, and use "Red" and "Green" to represent red light and green light correspondingly. A setting of the traffic light is defined as a phase (e.g., green light on the west-east direction which can be simplified as Green-WE). When a light changes from green to red, there is a 3 second yellow light, while the other directions still keep red. So one green light and the subsequent yellow light can be represented together by "Green". To simplify the problem, we assume there are only two phases of the traffic light, i.e., 1) Green-WE, and 2) Red-WE. Due to the limitation of real-world setting, the traffic light can only change in a specific order (i.e., 1 -

>

2 -

>

1 -

>

2 -

>

...). Given the state

s

(describing the positions and speed of the traffic near this intersection), the goal of the agent

G

is to give the optimal action

a

(i.e., whether to change the light to the next phase), so that the reward

r

(i.e., the smoothness of traffic) can be maximized.

3.4 Method

Traffic light control has attracted a lot of attention in recent years due to its essential role in adjusting traffic. Current methods generally have two categories, conventional methods, and deep reinforcement learning based methods. Conventional methods usually rely on previous knowledge to set fixed time for each light phase or set changing rules. These rules are prone to dynamically changing traffic. Reinforcement learning methods usually take the traffic condition (e.g., queue length of waiting cars and updated waiting time) as state, and try to make actions that can improve the traffic condition based on the current state.

However, the current methods do not consider the complex situations in real case, and hence may lead to stuck in one single kind of action. This will lead to inferior traffic adjusting performance under complex traffic situation.

In this section, we propose a deep reinforcement traffic light agent to solve this problem. We will first introduce the model framework in Section 3.4.1. Then, we show the design of agent in Section 3.4.2. We further describe the network structure in Section 3.4.3. In addition, we describe the memory palace in Section 3.4.4. Note that, although our model is designed for a four way intersection with two phases, it is notdifficult to extend it to other types of intersections or to multiple phases scenarios.

Framework

Our model is composed of offline part and online part (as shown in Figure 3.3). We extract five kinds of features describing the traffic conditions as state (detailed in Section 3.4.2), and use reward to describe how much the action has improved the traffic (detailed in Section 3.4.2). In offline stage, we set a fixed timetable for the lights, and let traffic go through the system to collect data samples. After training with the samples logged in this stage, the model will be put into the online part. In online stage, at every time interval

Δ t

, the traffic light agent will observe the state

s

from the environment and take action

a

(i.e., whether to change light signal to the next phase) according to

ϵ

-greedy strategy combining exploration (i.e., random action with probability

ϵ

) and exploitation (i.e., taking the action with maximum estimated reward). After that, the agent

G

will observe the environment and get the reward

r

from it. Then, the tuple (

s

a

r

) will be

Notation	Meaning
$E$	Environment
$G$	Agent
$a$	Action
$s$	State
$r$	Reward
$Δ t$	Time interval between actions
$q$	Action value function
$Q$	Deep Q-Network
$L_{i}$	Queue length on the lane $i$
$V_{i}$	Number of vehicles on the lane $i$
$W_{i}$	Updated waiting time of all vehicles on the lane $i$
$D_{i}$	Delay of lane $i$ .
$M \in R^{N \times N}$	Image representation of vehicles’ position
$P_{c}$	Current phase
$P_{n}$	Next phase
$C \in {0, 1}$	Light switches (1) or not (0)
$N$	Number of vehicles passed the intersection
	after the action.
$T$	Travel time in system of all vehicles that passed
	the intersection during $Δ t$ .

Table 3.1: Notationsstored into memory. After several timestamps (e.g.,

t_{2}

in Figure 3.3), agent

G

will update the network according to the logs in the memory.

Agent Design

First, we introduce the state, action and reward representation.

State. Our state is defined for one intersection. For each lane $i$ at this intersection, the state component includes queue length $L_{i}$ , number of vehicles $V_{i}$ , updated waiting time of vehicles $W_{i}$ . In addition, the state includes an image representation of vehicles' position $M$ , current phase $P_{c}$ and next phase $P_{n}$ .
Action. Action is defined as $a = 1$ : change the light to next phase $P_{n}$ , and $a = 0$ : keep the current phase $P_{c}$ .
Reward. As is shown in Equation 3.3, reward is defined as a weighted sum of the following factors: 1. Sum of queue length $L$ over all approaching lanes, where $L$ is calculated as the total number of waiting vehicles on the given lane. A vehicle with a speed of less than 0.1 m/s is considered as waiting. 2. Sum of delay $D$ over all approaching lanes, where the delay $D_{i}$ for lane $i$ is defined in Equation 3.1, where the lane speed is the average speed of vehicles

Figure 3.3: Model framework

on lane i, and the speed limit is the maximum speed allowed on lane i:

D_{i} = 1 - \frac{l a n e s p e e d}{s p e e d l i m i t}

(3.1)

Sum of updated waiting time $W$ over all approaching lanes. This equals to the sum of $W$ over all vehicles on approaching lanes. The updated waiting time $W$ for vehicle j at time $t$ is defined in Equation 3.2. Note that the updated waiting time of a vehicle is reset to 0 every time it moves. For example, if a vehicle's speed is 0.01m/s from 0s to 15s, 5m/s from 15s to 30s, and 0.01m/s from 30s to 60s, $W_{j}$ is 15 seconds, 0 seconds and 30 seconds when $t =$ 15s, 30s and 60s relatively. $W_{j} (t) = {\begin{cases} W_{j} (t - 1) + 1 & v e h i c l e s p e e d < 0.1 \\ 0 & v e h i c l e s p e e d \geq 0.1 \end{cases}$ (3.2)
Indicator of light switches $C$ , where $C = 0$ for keeping the current phase, and $C = 1$ for changing the current phase.
Total number of vehicles $N$ that passed the intersection during time interval $Δ t$ after the last action a.
Total travel time of vehicles $T$ that passed the intersection during time interval $Δ t$ after the last action a, defined as the total time (in minutes) that vehicles spent on approaching lanes. $\begin{aligned} R e w a r d = w_{1} * \sum_{i \in I} L_{i} + w_{2} * \sum_{i \in I} D_{i} + w_{3} * \sum_{i \in I} W_{i} + \\ w_{4} * C + w_{5} * N + w_{6} * T . \end{aligned}$ (3.3)

Hence, given the current state

s

of the traffic condition, the mission of the agent

G

is to find the action a (change or keep current phase) that may lead to the maximum reward

r

in the long run, following the Bellman Equation (Equation 3.4) [12]. In this situation, the action value function

q

for time

t

is the summation of the reward of the next timestamp

t + 1

and the maximum potential future reward. Through this conjecture of future, the agent can select action that is more suitable for long-run reward.

\begin{matrix} (3.4) & q (s_{t}, a, t) = r_{a, t + 1} + γ max q (s_{a, t + 1}, a^{'}, t + 1) \end{matrix}

Network Structure

In order to estimate the reward based on the state, and action, the agent needs to learn a Deep Q-Network

Q (s, a)

In the real-world scenario, traffic is very complex and contain many different cases need to be considered separately. We will illustrate this in Example 1.

Example 1.: We still assume a simple intersection with two-phase light transition here: 1) Green-WE, and 2) Red-WE. The decision process of whether to change the traffic light consists of two steps. The first step is the mapping from traffic condition (e.g., how many cars are waiting, how long has each car been waiting) to a partial reward. An example of this mapping could be

r = - 0.5 \times L - 0.7 \times W

. This is shared by different phases, no matter which lane the green light is on. Then, to determine the action, the agent should watch on the traffic in different lanes during different phases. For instance, as is shown in Figure 3.2 (a), when the red light is on the NS direction, more waiting traffic (i.e., lower reward in the first step) on the NS direction will make the light tend to

Figure 3.4: Q-networkchange (because by changing the light on this lane from red to green, more cars on this lane can pass through this intersection), while more waiting traffic (i.e., lower reward in the first step) on the WE direction will make the light tend to keep. When the red light is on the WE direction, the case is right the opposite. Therefore, the light phase should have an explicit selection on features.

In previous studies, due to the simplified design of the model for approximating Q-function under complex traffic condition, agents are having difficulties in distinguishing the decision process for different phases. Therefore, we hereby propose a network structure that can explicitly consider the different cases explicitly. We call this special sub-structure "Phase Gate".

Our whole network structure can be shown as in Figure 3.4. The image features are extracted from the observations of the traffic condition and fed into two convolutional layers. The output of these layers are concatenated with the four explicitly mined features, queue length

L

, updated waiting time

W

, phase

P

and number of total vehicles

V

. The concatenated features are then fed into fully-connected layers to learn the mapping from traffic conditions to potential rewards. Then, for each phase, we design a separate learning process of mapping from rewards to the value of making decisions

Q (s, a)

. These separate processes are selected through a gate controlled by the phase. As shown in Figure 3.4, when phase

P = 0

, the left branch will be activated, while when phase

P = 1

, the right branch will be activated. This will distinguish the decision process for different phases, prevent the decision from favoring certain action, and enhance the fitting ability of the network.

Memory Palace and Model Updating

Periodically, the agent will take samples from the memory and use them to update the network. This memory is maintained by adding the new data samples in and removing the old samples occasionally. This technique is noted as experience replay [52] and has been widely used in reinforcement learning models.

However, in the real traffic setting, traffic on different lanes can be really imbalanced. As previous methods [10, 76, 81, 82] store all the state-action-reward training samples in one memory, this memory will be dominated by the phases and actions that appear most frequently in imbalanced settings. Then, the agent will be learned to estimate the reward for these frequent phase-action combinations well, but ignore other less frequent phase-action combinations. This will cause the learned agent to make bad decisions on the infrequent phase-action combinations. Therefore, when traffic on different lanes are dramatically different, these imbalanced samples will lead to inferior performance on less frequent situation.

Inspired by Memory Palace theory [84, 85] in cognitive psychology, we can solve this imbalance by using different memory palaces for different phase-action combinations. As shown in Figure 3.5, training samples for different phase-action combinations are stored into different memory palaces. Then same number of samples will be selected from different palaces. These balanced samples will prevent different phase-action combinations from interfering each other's training process, and hence, improve the fitting capability of the network to predict the reward accurately.

Figure 3.5: Memory palace structure

Experiment

In this section, we conduct experiments using both synthetic and real-world traffic data. We show a comprehensive quantitative evaluation by comparing with other methods and also show some interesting case studies 1.

Footnote 1: Codes are available at the author’s website.

Experiment Setting

The experiments are conducted on a simulation platform SUMO (Simulation of Urban MObility) 2. SUMO provides flexible APIs for road network design, traffic volume simulation and traffic light control. Specifically, SUMO can control the traffic moving according to the given policy of traffic light (obtained by the traffic light agent).

Footnote 2: http://sumo.dlr.de/index.html

The environment for the experiments on synthetic data is a four-way intersection as Figure 3.1. The intersection is connected with four road segments of 300-meters long, where each road have three incoming and three outgoing lanes. The traffic light in this part of experiment contains two phases: (1) Green-WE (green light on WE with red light on SN), (2) Red-WE (red light on WE with green light on SN). Note that when a green light is on one direction, there is a red light on the other direction. Also, a green light is followed by a 3-second yellow light before it turns to red light. Although this is a simplification of the real world scenario, the research of more types of intersections (e.g., three-way intersection), and more complex light phasing (e.g., with left-turn phasing) can be further conducted in similar way.

Parameter Setting

The parameter setting and reward coefficients for our methods are shown in Table 3.2 and Table 3.3 respectively. We found out that the action time interval

Δ t

has minimal influence on performance of our model as long as

Δ t

is between 5 seconds and 25 seconds.

Evaluation Metric

We evaluate the performance of different methods using the following measures:

Reward: average reward over time. Defined in Equation 1, the reward is a combination of several terms (positive and negative terms), therefore, the range of reward is from $- \infty$ to $\infty$ . Under specific configuration, there will be an upper bound for the reward when all cars move freely without any stop or delay.
Queue length: average queue length over time, where the queue length at time $t$ is the sum of $L$ (defined in Section 3.4.2) over all approaching lanes. A smaller queue length means there are fewer waiting vehicles on all lanes.
Delay: average delay over time, where the delay at time $t$ is the sum of $D$ (defined in Equation 3.1) of all approaching lanes. A lower delay means a higher speed of all lanes.
Duration: average travel time vehicles spent on approaching lanes (in seconds). It is one of the most important measures that people care when they drive on the road. A smaller duration means vehicles spend less time passing through the intersection.

In summary, a higher reward indicates a better performance of the method, and a smaller queue length, delay and duration indicates the traffic is less jammed.

Compared Methods

To evaluate the effectiveness of our model, we compare our model with the following baseline methods, and tune the parameter for all methods. We then report their best performance.

w1	w2	w3	w4	w5	w6
-0.25	-0.25	-0.25	-5	1	1

Table 3.3: Reward coefficients

Model parameter	Value
Action time interval	5 seconds
$Δ t$
$γ$ for future reward	0.8
$ϵ$ for exploration	0.05
Sample size	300
Memory length	1000
Model update interval	300 seconds

Table 3.2: Settings for our method * Fixed-time Control (FT). Fixed-time control method use a pre-determined cycle and phase time plan [72] and is widely used when the traffic flow is steady.

Self-Organizing Traffic Light Control (SOTL)[74]. This method controls the traffic light according to the current traffic state, including the eclipsed time and the number of vehicles waiting at the red light. Specifically, the traffic light will change when the number of waiting cars is above a hand-tuned threshold.
Deep Reinforcement Learning for Traffic Light Control (DRL). Proposed in [10], this method applies DQN framework to select optimal light configurations for traffic intersections. Specifically, it solely relies on the original traffic information as an image.

In addition to the baseline methods, we also consider several variations of our model.

IntelliLight (Base). Using the same network structure and reward function defined as in Section 3.4.2 and 3.4.3. This method is without Memory Palace and Phase Gate.
IntelliLight (Base+MP). By adding Memory Palace in psychology to IntelliLightBase, we store the samples from different phase and time in separate memories.
IntelliLight (Base+MP+PG). This is the model adding two techniques (Memory Palace and Phase Gate).

Datasets

Synthetic data

In the first part of our experiment, synthetic data is used with four traffic flow settings: simple changing traffic (configuration 1), equally steady traffic (configuration 2), unequally steady traffic (configuration 3) and complex traffic (configuration 4) which is a combination of previous three configurations. As is shown in Table 3.4, the arriving of vehicles are generated by Poisson distribution with certain arrival rates.

Real-world data

The real-world dataset is collected by 1,704 surveillance cameras in Jinan, China over the time period from 08/01/2016 to 08/31/2016. The locations of these cameras are shown in Figure 3.6. Gathered every second by the cameras facing towards vehicles near intersections, each record in the dataset consists of time, camera ID and the information about vehicles. By analyzing these records with camera locations, the trajectories of vehicles are recorded when they pass through road intersections. The dataset covers 935 locations, where 43 of them are four-way intersections. We use the number of vehicles passing through 24 intersections as traffic volume for experiments since only these intersections have consecutive data. Then we feed this real-world traffic setting into SUMO as online experiments. It can be seen from Table 3.5 that traffic flow on different roads are dynamically changing in the real world.

Config	Directions	Arrival rate	Start time	End time
Config	Directions	(cars/s)	(s)	(s)
1	WE	0.4	0	36000
1	SN	0.4	36001	72000
2	WE	0.033	0	72000
2	SN	0.033	0	72000
3	WE	0.2	0	72000
3	SN	0.033	0	72000
4	Configuration 1	0	72000
	Configuration 2	72001	144000
	Configuration 3	144001	216000

Table 3.4: Configurations for synthetic traffic data

Figure 3.6: Traffic surveillance cameras in Jinan, China

Performance on Synthetic Data

3.5.6.1 Comparison with state-of-the-art methods

We first compare our method with three other baselines under different synthetic traffic settings. From Table 3.6, 3.7, 3.8 and 3.9 we can see that our method performs better than all other baseline methods in configurations 1, 2, 3 and 4. Although some baselines perform well on certain setting, they perform badly in other configurations (e.g., SOTL achieves good rewards under configuration 1, almost the same as our method in 3 digit floats. This is because our method has learned to keep the light until 36000 s and switch the light after that, and SOTL is also designed to behave similarly. Hence, these two methods perform very similar). On the contrary, our method IntelliLight shows better performance under different configurations.

3.5.6.2 Comparison with variants of our proposed method

Table 3.6, 3.7, 3.8 and 3.9 show the performance of variants of our proposed method. First, we can see that adding Memory Palace helps achieve higher reward under configuration 3 and 4, although it does not boost the reward under configuration 1 and 2. This is because for the simple case (configuration 1 and 2), the phase is relatively steady for a long time (because the traffic only comes from one direction or keeps not changing in a long time). Therefore, the memory palace does not help in building a better model for predicting the reward. Further adding Phase Gate also reduces the queue length in most cases and achieves highest reward, demonstrating the effectiveness of these two techniques.

3.5.6.3 Interpretation of learned signal

To understand what our method have learned w.r.t. dynamic traffic conditions, we show the percentage of duration for phase Green-WE (i.e., green light on WE direction with red light on SN direction), along with the ratio of traffic flow on WE over total traffic

Time Range	Records	Arrival Rate (cars/s)
Time Range	Records	Mean	Std	Max	Min
08/01/2016 -08/31/2016	405,370,631	0.089	0.117	0.844	0.0

Table 3.5: Details of real-world traffic datasetflow from all directions. With the changing of traffic, an ideal traffic light control method would be able to adjust its phase duration to traffic flows and get high reward. For example, as traffic changes from direction WE to SN, the traffic light agent is expected to adjust its phase duration from giving WE green light to giving SN green light. As we can see from Figure 3.7, IntelliLight can adjust its phase duration as the traffic changes.

Performance on Real-world Data

Comparison of different methods

In this section, we compare our method with baseline methods on real-world data. The overall results are shown in Table 3.10. Our method IntelliLight achieves the best reward, queue length, delay and duration over all the compared methods, with a relative improvement of 32%, 38%, 19% and 22% correspondingly over the best baseline method. In addition, our method has a relatively steady performance over multiple intersections (small standard deviation).

Model name	Reward	Queue length	Delay	Duration
FT	-0.978	1.105	2.614	34.278
SOTL	-21.952	19.874	4.384	177.747
DRL	-2.208	3.405	3.431	52.075
IntelliLight (ours)
Base	-0.523	0.208	1.689	27.505
Base+MP	-0.556	0.259	1.730	27.888
Base+MP+PG	-0.514	0.201	1.697	27.451

Table 3.7: Performance on configuration 2

Model name	Reward	Queue length	Delay	Duration
FT	-2.304	8.532	2.479	42.230
SOTL	0.398	0.006	1.598	24.129
DRL	-36.247	91.412	4.483	277.430
IntelliLight (ours)
Base	-3.077	10.654	2.635	92.080
Base+MP	-3.267	6.087	1.865	38.230
Base+MP+PG	0.399	0.005	1.598	24.130

Table 3.6: Performance on configuration 1. Reward: the higher the better. Other measures: the lower the better. Same with the following tables.

3.5.7.2 Observations with respect to real traffic

In this section, we make observations on the policies we learned from the real data. We analyze the learned traffic light policy for the intersection of Jingliu Road (WE direction)

Model name	Reward	Queue length	Delay	Duration
FT	-1.670	4.601	2.883	39.707
SOTL	-14.079	13.372	3.753	54.014
DRL	-49.011	91.887	4.917	469.417
IntelliLight (ours)
Base	-5.030	5.880	3.432	39.021
Base $+$ MP	-3.329	5.358	2.238	44.703
Base $+$ MP $+$ PG	-0.474	0.548	2.202	25.977

Table 3.9: Performance on configuration 4

Figure 3.7: Percentage of the time duration of learned policy for phase Green-WE (green light on W-E and E-W direction, while red light on N-S and S-N direction) in every 2000 seconds for different methods under configuration 4.

Model name	Reward	Queue length	Delay	Duration
FT	-1.724	4.159	3.551	36.893
SOTL	-20.680	20.227	5.277	69.838
DRL	-8.108	16.968	4.704	66.485
IntelliLight (ours)
Base	-0.836	0.905	2.699	28.197
Base $+$ MP	-0.698	0.606	2.729	26.948
Base $+$ MP $+$ PG	-0.648	0.524	2.584	26.647

Table 3.8: Performance on configuration 3and Erhuanxi Auxiliary Road (SN direction) under different scenarios: peak hours vs. non-peak hours, weekdays vs. weekends, and major arterial vs. minor arterial.

1. Peak hour vs. Non-peak hour. Figure 3.8 (a) shows the average traffic flow from both directions (WE and SN) on a Monday. On this day, there is more traffic on WE direction than SN for most of the time, during which an ideal traffic light control method is expected to give longer time for WE direction. It can be seen from Figure 3.8 (c) that, the ratio of the time duration for phase Green-WE (i.e., green light on WE, while red light on SN) is usually larger than 0.5, which means for most of the time, our method gives longer time for WE. And during peak hours (around 7:00, 9:30 and 18:00), the policies learned from our method also give longer time for green light on WE than non-peak hours. In early morning, the vehicle arrival rates on SN are larger than the rates on WE, and our method automatically gives longer time to SN. This shows our method can intelligently adjust to different traffic conditions.

2. Weekday vs. Weekend. Unlike weekdays, weekend shows different patterns about traffic condition and traffic light control policies. Our policy gives less green light on WE (more green light on SN) during weekend daytime than it gives on weekday. This is because there is more traffic on SN than on WE during weekend daytime in Figure 3.8 (b), while during weekday traffic on SN is less than on WE. Besides, by comparing Figure 3.8 (a) with Figure 3.8 (b), we can see that the traffic of WE and SN during late night time on Monday is similar, making the ratio of duration Green-We close to 0.5.

3. Major arterial vs. Minor arterial. Major arterials are roads that have higher traffic volume within a period, and are expected to have a longer green light time. Without prior knowledge about major arterial, learned traffic light control policy using our method prefer giving the major arterial green light (including keeping the green light already on major arterial, and tend to switching red light to green light for major arterial). Specifically, we look into three periods of time (3:00, 12:00 and 23:30) of August 1st. From Figure 3.8 (a), we can tell that the road on WE direction is the main road, since

Methods	Reward	Queue Length	Delay	Duration
FT	-5.727 $\pm$ 5.977	19.542 $\pm$ 22.405	3.377 $\pm$ 1.057	84.513 $\pm$ 60.888
SOTL	-35.338 $\pm$ 65.108	16.603 $\pm$ 17.718	4.070 $\pm$ 0.420	64.833 $\pm$ 23.136
DRL	-30.577 $\pm$ 26.242	54.148 $\pm$ 43.420	4.209 $\pm$ 1.023	166.861 $\pm$ 93.985
IntelliLight	-3.892 $\pm$ 7.609	10.238 $\pm$ 20.949	2.730 $\pm$ 1.086	50.487 $\pm$ 46.439

Table 3.10: Performances of different methods on real-world data. The number after

\pm

means standard deviation. Reward: the higher the better. Other measures: the lower the better.

traffic on WE is usually heavier than traffic on SN. As is shown in Figure 3.9, the dotted lines indicates the number of arriving cars for every second on two different directions. Along with the arrival rate, we also plot the change of phases (dashed area). It can be seen from Figure 3.9 (a) that: 1) the overall time period of phase Red-WE is longer than Green-WE, which is compatible with traffic volume at this time. 2) although the traffic volume of SN is larger than WE, the traffic light change from Red-WE to Green-WE is usually not triggered by waiting cars on WE direction. On the contrary, in Figure 3.9 (b) and Figure 3.9 (c), the change from Green-WE to Red-WE is usually triggered by waiting cars on SN direction. This is mainly because the road on WE is the main road during these time periods, and the traffic light tends to favor phase Green-WE.

3.6 Conclusion

In this paper, we address the traffic light control problem using a well-designed reinforcement learning approach. We conduct extensive experiments using both synthetic

Figure 3.8: Average arrival rate on two directions (WE and SN) and time duration ratio of two phases (Green-WE and Red-WE) from learned policy for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st and August 7th, 2016.

and real world experiments and demonstrate the superior performance of our proposed method over state-of-the-art methods. In addition, we show in-depth case studies and observations to understand how the agent adjust to the changing traffic, as a complement to quantitative measure on rewards. These in-depth case studies can help generate traffic rules for real world application.

We also acknowledge the limitations of our current approach and would like to point out several important future directions to make the method more applicable to real world. First, we can extend the two-phase traffic light to multi-phase traffic light, which will involve more complicated but more realistic state transition. Second, our paper addresses a simplified one intersection case, whereas the real world road network is much more complicated than this. Although some studies have tried to solve the multi-intersection problem by using multiple reinforcement learning agents, they do not explicitly consider the interactions between different intersections (i.e., how can the phase of one intersection affect the state of nearby intersections) and they are still limited to small number of intersections. Lastly, our approach is still tested on a simulation framework and thus the feedback is simulated. Ultimately, a field study should be conducted to learn the real-world feedback and to validate the proposed reinforcement learning approach.

Figure 3.9: Detailed average arrival rate on two directions (dotted lines) and changes of two phases (dashed areas) in three periods of time for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st, 2016. X-axis of each figure indicates the time of a day; left Y-axis of each figure indicates the number of cars approaching the intersection every second; right Y-axis for each figure indicates the phase over time.

Chapter 4 Formulating the Learning Objective

This chapter formulates RL-based traffic signal control methods with theoratical proofs and corresponds to our paper "PressLight: Learning Network-level Cooperation for Traffic Signal Control" [17].

4.1 Introduction

Traffic signals coordinate the traffic movements at the intersection and a smart traffic signal control algorithm is the key to transportation efficiency. Traffic signal control remains an active research topic because of the high complexity of the problem. The traffic situations are highly dynamic, thus require traffic signal plans to be able to adjust to different situations.

Recently, people start to investigate reinforcement learning (RL) techniques for traffic signal control. Several studies have shown the superior performance of RL techniques over traditional transportation approaches [9, 10, 11, 46, 78, 80]. The biggest advantage of RL is that it directly learns how to take the next actions by observing the feedback from the environment after previous actions.

One major issue of current RL-based traffic signal control approaches is that the setting is often heuristic and lacks proper theoretical justification from transportation literature. This often results in highly sensitive performance w.r.t. the setting and leads to a long learning process. We elaborate on this issue by examining two fundamental elements in RL setting: reward and state.

First, various reward designs have been proposed in the literature. The reason is that travel time, the ultimate objective, is hard to optimize directly. Travel time is along-term reward depending on a sequence of actions, thus the effect of one action can hardly be reflected in terms of travel time. People thus choose short-term rewards like queue length or delay to approximate the travel time [21]. So the reward function is often defined as a weighted sum of these terms [10, 11, 79, 86, 87]. However, as shown in Figure 6.1(a), tuning the weights on these terms could lead to largely different results in terms of travel time. Some literature [1] discusses how to define the reward by connecting with the existing transportation method, but they only focus on controlling a single intersection. In this chapter, we focus on the multi-intersection control scenario.

Second, existing RL methods have a trend of using more complicated state representation. Recent studies use visual images to describe the full traffic situation at the intersection [10, 11], which results in the dimension of the state in the scale of thousands. In the single intersection scenario, [1] reveals that additional information is not always helpful. Similar conclusions can also be found in the multi-intersection scenario. As shown in Figure 6.1(b), complicated state definitions increase the learning time and may not necessarily bring significant gain. Note that we are not claiming that additional information is always not helpful. The choice of the state depends on the reward setting. Based on the reward design of LIT[1], neighboring information is not necessary in the case we show in Figure 6.1(b). The question is, could we justify theoretically how much information is enough in state definition in order to optimize the reward function?

The challenges we face in RL motivate us to look for support from transportation. In transportation literature, max pressure (MP) control is one of the state-of-the-arts in traffic signal control [88, 89]. The key idea of MP is to minimize the "pressure" of an

Figure 4.1: Performance of RL approaches is sensitive to reward and state. (a) A heuristic parameter tuning of reward function could result in different performances. (b) The method with a more complicated state (LIT[1] w/ neighbor) has a longer learning time but does not necessarily converge to a better result.

intersection, which can be loosely defined as the number of vehicles on incoming lanes minus the number of vehicles on outgoing lanes. Figure 4.2 illustrates the concept of pressure. By setting the objective as minimizing the pressure of intersections, MP is proved to maximize the throughput of the whole road network1. However, the solution of MP is greedy, which leads to locally optimal solutions.

Footnote 1: Maximizing throughput equals to minimizing travel time under certain conditions and minimizing travel time is the final goal for most traffic signal control problems.

Our proposed solution is based on RL but theoretically grounded by MP method. The connection between RL and MP is that both approaches can essentially be framed as an optimization problem. In RL, long term reward is the objective for optimization and the solution is derived from trial-and-error search. In MP, the objective is to minimize pressure and the solution is derived from a greedy algorithm. Intuitively, if we set our reward function the same as the objective of MP, we can achieve the same result as MP. We first prove that under the assumption of no physical queue expansion, both our method and MP are maximizing throughput of the network. We further show that our method can relax the assumption on queue expansion and the conclusion still holds.

To further address the challenge on state design, we describe the system dynamics using the state features based on MP. MP provides evolution equations to formulate the state transition of the traffic as a Markov chain [90]. In RL, the Markov decision process formally describes the dynamics of an environment. By including the variables from the

Figure 4.2: Illustration of max pressure control in two cases. In Case A, green signal is set in the North

\to

South direction; in Case B, green signal is set in the East

\to

West direction.

evolution equation into state definition in RL, the state is a sufficient statistic for the system dynamics.

We conduct comprehensive experiments using both synthetic data and real data. We test our method in different traffic flow and network structure scenarios. We demonstrate the power of RL methods over traditional transportation approaches as RL optimizes the objective through trial and error. Our method also consistently outperforms state-of-the-art RL methods, which shows that theoretically supported reward design is necessary and the concise state design leads to an efficient learning process. We further discuss several interesting policies learned by our method to show that our method can achieve coordination along arterial.

Individual Traffic Signal Control. Individual traffic signal control has been investigated extensively in the field of transportation, which tries to optimize the travel time or delay of vehicles [91, 92, 93, 94, 95], assuming that vehicles are arriving and moving in a specific pattern. Recently, reinforcement learning based methods attempt to address this problem by directly learning from the data [9, 27]. Earlier work using tabular Q-learning [87, 96] can only deal with discrete state representations. Recent work using deep RL [10, 11, 76, 97, 98] can cope with more complex continuous state representation. [1] noticed that it is not always true that the more complex the state definitions are, the better the performance will be. In [1], they also investigated the proper reward design grounded by the individual intersection control method in transportation field. In this chapter, we are focusing on the multi-intersection scenario.

Conventional Multi-intersection Traffic Signal Control. In conventional multi-intersection control, coordination can be achieved by setting a fixed offset (i.e., the time interval between the beginnings of green lights) among all intersections along an arterial [99]. In fact, it is not an easy task, given traffic of opposite directions usually cannot be facilitated simultaneously. To solve this problem, some optimization-based methods [100, 101] are developed to minimize vehicle travel time and/or the number of stops at multiple intersections. Instead of optimizing offsets, max pressure [89, 90] aims to maximize the throughput of the network so as to minimizing the travel time. However, these approaches still rely on assumptions to simplify the traffic condition and do not guarantee optimal results in the real world.

RL-based Multi-intersection Traffic Signal Control. Since recent advances in RLimprove the performance on isolated traffic signal control [11, 1], efforts have been made to design strategies that control multiple intersections. One way is to consider jointly modeling the action between learning agents with centralized optimization [75, 10]. Since these methods [10, 75] need to negotiate between the agents in the whole network, they are computationally expensive. Another way is to use decentralized RL agents to control the traffic signals in the multi-intersection system [46, 79, 102]. Since each agent makes its own decision based on the information from itself and neighboring intersections without centralized decision, decentralized methods may be more scalable and practicable. By plugging new intersection controllers into the system, the decentralized systems are easy to scale. Our proposed method also follows this direction.

We notice the recent trend to vary the definition of state and reward in RL for traffic signal control. Readers interested in the detailed comparison of the state and reward definitions can refer to [21]. We are the first RL method that is theoretically grounded by traditional transportation methods to coordinate the traffic signals along an arterial.

4.3 Preliminaries and Notations

In this section, we re-iterate the preliminaries mentioned in Chapter 2 and provide necessary notations for this chapter.

Definition 1 (Incoming lane and outgoing lane of an intersection).: An incoming lane for an intersection is a lane where the traffic enters the intersection. An outgoing lane for an intersection is a lane where the traffic leaves the intersection. We denote the set of incoming lanes and outgoing lanes of an intersection as

L_{i n}

and

L_{o u t}

respectively.

Definition 2 (Traffic movement).: A traffic movement is defined as the traffic traveling across an intersection from one incoming lane to an outgoing lane. We denote a traffic movement from lane

l

to lane

m

(l, m)

Definition 3 (Movement signal and phase).: A movement signal is defined on the traffic movement, with green signal indicating the corresponding movement is allowed and red signal indicating the movement is prohibited. We denote a movement signal as

a (l, m)

, where

a (l, m) = 1

indicates the green light is on for movement

(l, m)

, and

a (l, m) = 0

indicates the red light is on for movement

(l, m)

. A phase is a combination of movement signals. We denote a phase as

p = {(l, m) | a (l, m) = 1}

, where

l \in L_{i n}

and

m \in L_{o u t}

In Figure 2.1, there are twelve incoming lanes and twelve outgoing lanes in the intersection. Eight movement signals (red and green dots around the intersection)comprise four phases to control the traffic movements for the intersection: WE-Straight (Going Straight from West and East), SN-Straight (Going Straight from South and North), WE-Left (Turning Left from West and East), SN-Left (Turning Left from South and North). Specifically, WE-Left allows two traffic movements. When phase #2 is activated, the traffic from

l_{E}

and

l_{W}

is allowed to turn left to corresponding outgoing lanes.

Definition 4 (Pressure of movement, pressure of intersection).: The pressure of a movement is defined as the difference of vehicle density between the incoming lane and the outgoing lane. The vehicle density of a lane is defined as

x (l) / x_{m a x} (l)

, where

x (l)

is the number of vehicles on lane

l

x_{m a x} (l)

is the maximum permissible vehicle number on

l

. We denote the pressure of movement

(l, m)

\begin{matrix} (4.1) & w (l, m) = \frac{x (l)}{x_{m a x} (l)} - \frac{x (m)}{x_{m a x} (m)} \end{matrix}

If all the lanes have the same maximum capacity

x_{m a x}

, then

w (l, m)

is simply indicating the difference between the incoming and outgoing number of vehicles.

The pressure of an intersection

i

is defined as the sum of the absolute pressures over all traffic movements, denoted as:

\begin{matrix} (4.2) & P_{i} = | \sum_{(l, m) \in i} w (l, m) | \end{matrix}

In Figure 4.2, the pressure of the intersection in Case A is

| 3 + 1 | = 4

, whereas the pressure of intersection in Case B is

| - 2 + 1 | = 1

. In general, the pressure

P_{i}

indicates the degree of disequilibrium between the incoming and outgoing vehicle density. The larger

P_{i}

is, the more unbalanced the distribution of vehicles is.

Problem 1 (Multi-intersection traffic signal control).: In our problem, each intersection is controlled by an RL agent. At each time step

t

, agent

i

observes from the environment as its state

o_{i}^{t}

. Given the vehicle distribution and current traffic signal phase, the goal of the agent is to give the optimal action

a

(i.e., which phase to set), so that the reward

r

(i.e., the average travel time of all vehicles) can be maximized.

4.4 Method

Agent Design

First, we introduce the state, action and reward design for an agent that controls an intersection.

State (Observation). Our state is defined for one intersection, which equals to the definition of observation in multi-agent RL. It includes the current phase $p$ , the number of vehicles on each outgoing lane $x (m)$ ( $m \in L_{o u t}$ ), and the number of vehicles on each segment of every incoming lane $x (l)_{k}$ ( $l \in L_{i n}$ , $k = 1 \dots K$ ). In this paper, each lane is evenly divided into 3 segments ( $K = 3$ ), and we denote the segment on lane $l$ nearest to the intersection as the first segment $x (l)_{1}$ .
Action. At time $t$ , each agent chooses a phase $p$ as its action $a_{t}$ from action set $A$ , indicating the traffic signal should be set to phase $p$ . In this paper, each agent has four permissible actions, correspondingly four phases in Figure 11. Each action candidate $a_{i}$ is represented as a one-hot vector. Note that in the real world the signal phases may organize in a cyclic way, while our action makes the traffic signal plan more flexible. Also, there may be different number of phases in the real world and four phases is not a must.
Reward. We define the reward $r_{i}$ as $r_{i} = - P_{i},$ (4.3) where $P_{i}$ is the pressure of intersection $i$ , as defined in Equation (4.2). Intuitively, the pressure $P_{i}$ indicates the degree of disequilibrium between vehicle density on the incoming and outgoing lanes. By minimizing $P_{i}$ , the vehicles within the system can be evenly distributed. Then the green light is effectively utilized so that the throughput is optimized.

Learning Process

In this paper, we adopt Deep Q-Network (DQN) as function approximator to estimate the Q-value function. To stabilize the training process, we maintain an experience replay memory as described in [52] by adding the new data samples in and removing the oldsamples occasionally. Periodically, the agent will take samples from the memory and use them to update the network.

4.5 Justification of RL agent

To theoretically support the efficacy of our proposed method, we justify our reward and state design by showing that, in a simplified transportation system, the states we use can fully describe the system dynamics, and using Equation (4.3) as reward function in RL is equivalent to optimizing travel time as in the transportation methods. Some important notation is summarized in Table 4.1.

Justification for State Design

General description of traffic movement process as a Markov chain

Consider the arterial scenario described in Example 2.

Example 2.: Figure 4.3 associates a distinct traffic movement with each incoming lane

l \in L_{i n}

and each

m \in O u t_{l}

, where

O u t_{l}

is the set of lanes output from lane

l

. Follow the notation from [90], let

x (l, m) (t)

be the associated number of vehicles at beginning of period

t

X (t) = {x (l, m) (t)}

is the

s t a t e

of the movement network, which we regard as states

o^{t}

in accordance with Section 4.4.1. There are two variables which are considered independent of

X (t)

Notation

Meaning

L_{i n}

set of incoming lanes for an intersection

L_{o u t}

set of outgoing lanes for an intersection

(l, m)

a traffic movement from lane

l

m

x (l, m)

number of vehicles leaving

l

and entering

m

x (l)

number of vehicles on lane

l

x (l)_{k}

number of vehicles on

k

-th segment of

l

x_{m a x} (m)

maximum permissible vehicle number on lane

m

r (l, m)

turning ratio of traffic movements from

l

m

c (l, m)

discharging rate of movement

(l, m)

a (l, m)

1 if the green light is on for movement

(l, m)

0 otherwise

Table 4.1: Summary of notation.

Turning ratio $r (l, m)$ : $r (l, m)$ is an i.i.d. random variable indicating the proportion of vehicles entering $m$ from $l$ to the total vehicles on $l$ .
Discharging rate $c (l, m)$ : For each $(l, m)$ , the queue discharging rate $c (l, m)$ is a non-negative, bounded, i.i.d. random variable, i.e., $c (l, m) \leq C (l, m)$ , where $C (l, m)$ is the saturation flow rate.

At the end of each period

t

, an action

A^{t} = {(l, m) | a^{t} (l, m)}

must be selected from the action set

A^{t}

as a function of

X^{t}

for use in period

(t + 1)

, indicating the agent will give green light for movements from

l

m

, see the bottom of Figure 4.3.

The evolution equations of

X (t)

are developed in [89]. For each

(l, m)

and

t

, the evolution of

x (l, m)

consists of receiving and discharging, and is captured by the following equation:

\begin{matrix} (4.4) & \begin{array}{l} x (l, m) (t + 1) \\ = x (l, m) (t) + \underset{r e c e i v i n g v e h i c l e s}{\underset{⏟}{Σ_{k \in I n_{l}} m i n [c (k, l) \cdot a (k, l) (t), x (k, l) (t)] \cdot r (l, m)}} \\ - \underset{d i s c h a r g i n g v e h i c l e s}{\underset{⏟}{m i n {c (l, m) \cdot a (l, m) (t), x (l, m) (t)} \cdot 1 (x (m) \leq x_{m a x} (m))}}, \end{array} \end{matrix}

where

I n_{l}

represents the set of lanes input to

l

. For the second term in Equation (4.4), when

l

is the receiving lane, up to

x (k, l)

vehicles will move from

k

a (k, l) (t) = 1

and they will join

(l, m)

r (l, m) = 1

For the third term in Equation (4.4), when traffic movement

(l, m)

is actuated, i.e.,

a (l, m) (t) = 1

, up to

x (l, m)

vehicles will leave

l

and

Figure 4.3: The transition of traffic movements.

be routed to

m

if there is no blockage on lane

m

, i.e.,

x (m) \leq x_{m a x} (m)

, where

x_{m a x} (m)

is the maximum permissible vehicle number on lane

m

Suppose the initial state

X (1) = x (l, m) (1)

is a bounded random variable. Since

A (t) = a (l, m) (t)

is a function of the current state

X (t)

, and

c (l, m)

and

r (l, m)

are all independent of

X (1), . . ., X (t)

, the process

X (t)

is a Markov chain. The transition probabilities of the chain depend on the control policy.

4.5.1.2 Specification with proposed state definition

We can modify the traffic movement equation from lane-level to segment-level. We denote

x (l)_{1}

as the number of vehicles on the segment

l_{1}

closest to the intersection and

x (l)_{2}

as the number of vehicles on the second closest segment, which is connected with

l_{1}

. Assume the vehicles change lanes for routing by the time it enters the lane

l

, i.e.,

x (l, m) = x (l)

, and all vehicles on

l_{i + 1}

enter next segment

l_{i}

during time

t

, then the movement process on the segment closest to the intersection can be written as:

\begin{matrix} (4.5) & \begin{aligned} x (l)_{1} (t + 1) = x (l)_{1} (t) + x (l)_{2} (t) \\ - m i n {c (l, m) \cdot a (l, m) (t), x (l)_{1} (t)} \cdot 1 (x (m) \leq x_{m a x} (m)) . \end{aligned} \end{matrix}

Equations for other segments can be derived in a similar way.

With the lane and segment movement evolution equations described above, the evolution of an individual intersection could be obtained, which is a combination of the equations of all the lanes involved. For a single intersection

i

c (l, m)

is a constant physical feature of each movement, whereas

x (l)_{1}

x (l)_{2}

, and

x (m)

are provided to the RL agent in our state definition. Hence, our state definition can fully describe the dynamics of the system.

Justification for Reward Design

4.5.2.1 Stabilization on traffic movements with proposed reward.

Inspired by [89], we first relax its assumption on physical queue expansion in the arterial. Then the goal of our RL agents is proven to stabilize the queue length, thus maximizes the system throughput and minimizes the travel time of vehicles.

Definition 5 (Movement process stability).: The movement process

X (t) = {x (l, m) (t)}

is stable in the mean (and

u

is a stabilizing control policy) if for some

M < \infty

, the following holds:

\begin{matrix} (4.6) & \sum_{t = 1}^{T} \sum_{(l, m)} E [x (l, m) (t)] < M, \forall T \end{matrix}

where

E

denotes expectation. Movement stability in the mean implies that the chain is positive recurrent and has a unique steady-state probability distribution for all

T

Definition 6 (Max-pressure control policy [89]).: At each period

t

, the agent selects the action with maximum pressure at every state

X

{\tilde{A}}^{*} (X) = \arg max_{\tilde{A} \in A} θ (\tilde{A}, X)

, where the pressure of

\tilde{A}

is defined as

θ (\tilde{A}, X) = \sum_{(l, m) : a (l, m) = 1} \tilde{w} (l, m),

and

\tilde{w} (l, m) = x (l) - x (m)

is the pressure of each movement. In this paper, we use the tilde symbol for max-pressure policy, i.e.,

\tilde{A}

, in order to differentiate it from a RL policy.

Theorem 1.: Without considering the physical queue expansion2, action

{\tilde{A}}^{*}

selected by max-pressure control policy and action

A^{*}

selected by our RL policy are both stabilizing the system, whenever the average demand is admissible3.

Footnote 2: “Without physical queue expansion” means the vehicles will be considered to have no physical length in a queue.

Footnote 3: Intuitively, an admissible demand means the traffic demand can be accommodated by traffic signal control strategies, not including situations like long-lasting over-saturated traffic that requires perimeter control to stop traffic from getting in the system.

Proof.: For max-pressure control policy, Theorem 1 in [89] shows that given a time period

t = 1, \dots, T

there exists

m < \infty

and

ϵ > 0

such that under

{\tilde{A}}^{*}

ϵ \cdot \frac{1}{T} \sum_{t = 1}^{T} E [X (t)] \leq m + \frac{1}{T} \cdot E [X (1)]^{2}

, where

X (1)

denotes the state when

t = 1

For an optimal RL control policy, the agent selects the action

A

with optimal

Q (A, X)

at every state

X

\begin{matrix} (4.7) & A^{*} (X) = \arg max_{A \in A} Q (A, X) . \end{matrix}

where

Q_{t} (A, X) = E [r_{t + 1} + γ r_{t + 2} + \dots | A, X]

denotes the maximum total reward at state

X

by taking

A

at time

t

(in Equation (4.7), we neglect time

t

for simplicity). The difference between the pressure definition in RL reward and max-pressure is that our RL agent uses the weighted pressure considering maximum permissible vehicle number

x_{m a x}

in Equation (4.1). If we assume the lanes are in the same length

x_{m a x} (l)

, the stability result still holds for the normalized

x (l)

Theorem 2.: Considering the physical queue expansion in the arterial environment, action

A^{*}

selected by our RL policy is also stabilizing the movement.

Different from [89], we now establish the proof of Theorem 2, which removes the assumption of no physical queue expansion in the arterial environment. In the arterial environment:

The maximum permissible vehicle number $x_{m a x}$ on side street lane $m^{s i d e}$ is assumed to be infinite, hence the second term in Equation (4.1) is zero. Thus we have $w (l, m^{s i d e}) = \frac{x (l)}{x_{m a x} (l)} > 0$ .
When the outgoing lane $m^{m a i n}$ along the arterial is saturated, the second term in Equation (4.1) is approximately 1 because of the queue expansion. Thus $w (l, m^{m a i n}) \approx \frac{x (l)}{x_{m a x} (l)} - 1 < 0$ .

This means when we consider the physical queue expansion in the arterial,

w (l, m^{s i d e}) > w (l, m^{m a i n})

, the control policy will restrict the queue spillback since it prohibits more vehicles to rush into the downstream intersection and block the movements of vehicles in other phases. Accordingly,

M

in Equation (4.6) can now be set to

M \leq \sum_{t = 1}^{T} \sum_{(l, m)} x_{m a x} (m)

4.5.2 Connection to throughput maximization and travel time minimization.

Given that the traffic movement process of each intersection is stable, the system is accordingly stable. In an arterial environment without U-turn, vehicles that move from lane

m

l

would not move from

l

m

again, i.e., between

x (m, l)

and

x (l, m)

only one of them can exist under arterial network. Then the actions that RL agents take will not form gridlock or block the network, thus can efficiently utilize the green time. Within the given time period

T

, our RL agent can provide the maximum throughput, thus minimize the travel time of all vehicles within the system.

4.6 Experiment

We conduct experiments on CityFlow4, an open-source traffic simulator that supports large-scale traffic signal control [103]. After the traffic data being fed into the simulator, a vehicle moves towards its destination according to the setting of the environment. The simulator provides the state to the signal control method and executes the traffic signal actions from the control method.5

Dataset Description

Both synthetic and real-world traffic flow data are used in our experiments. In a traffic dataset, each vehicle is described as

(o, t, d)

, where

o

is origin location,

t

is time, and

d

is destination location. Locations

o

and

d

are both locations on the road network. Traffic data is taken as input for the simulator. All the data contains bi-directional and dynamic flows with turning traffic.

Synthetic data. Four different configurations are tested as detailed in Table 4.2. This data is synthesized from a statistical analysis of real-world traffic patterns in Jinan and Hangzhou.
Real-world data. We collect six representative traffic flow data from three cities to evaluate the performance of our model: Beaver Avenue in State College, USA; Qingdao Road in Jinan, China; four avenues in Manhattan, New York City, USA. Figure 4.4 shows the aerial view on these arterials. Detailed statistics of these datasets are listed in Table 4.3.

Figure 4.4: Real-world arterial network for the experiment.

Experimental Settings

Environmental settings

Different road networks are configured. Besides a six-intersection arterial on which we primarily experiment, arterials with larger scale and heterogeneous intersections (in Figure 4.6) are also tested.

The free-flow speed on the road segments is set to 40 kilometers/hour. Vehicles can always turn right when there is no conflicting traffic. Every time the phase switches, a 5-second combined yellow and all-red time are followed to clear the intersection.

4.6.2.2 Evaluation metric

Following existing studies [11], we use the average travel time in seconds to evaluate the performance. The average travel time of all vehicles is the most frequently used measure in the transportation field [104], which is calculated as the average travel time of all vehicles spent in the system.

4.6.2.3 Compared methods

We compare our model with the following two categories of methods: transportation methods and RL methods. Note that all methods are carefully tuned and their best results are reported (except the offsets of FixedTime because of its random nature).

Conventional transportation baselines:

FixedTime: Fixed-time with random offset [104]. Each phase has a fixed time of 15 seconds. For uni-directional traffic, there are only 2 phases (WE-straight, SN-straight). For traffic with turning vehicles, there are 4 phases.

Config	Demand	Arrival rate
	pattern	(vehicles/h/road)	Volume
1. Light-Flat	Flat	Arterial : 600
2. Light-Peak	Peak	Side-street: 180	(Light)
3. Heavy-Flat	Flat	Arterial: 1400
4. Heavy-Peak	Peak	Side-street : 420	(Heavy)

Table 4.2: Configurations for synthetic traffic data* GreenWave[104]: This is the most classical method in transportation field to implement coordination that gives an optimal solution for unidirectional and uniform traffic on the arterial. It requires that all intersections share the same cycle length, which is the minimum value of the cycle length for individual intersections calculated using Webster's theory [71]. The phase split percentage equals to the percentage between the demand of a designated phase and total demand. Offsets between intersections are equivalent to the free-flow travel time between two consecutive intersections.

Max-pressure: Max pressure control [105] is a state-of-the-art network-level traffic signal control method, which greedily chooses the phase with the maximum pressure, as introduced in Definition 6. RL baselines:
LIT is an individual deep reinforcement learning approach proposed in [1]. This method does not consider the traffic condition on downstream lanes in state and uses a reward with queue length.
DRL is a coordinated reinforcement learning approach for multi-intersection control [10]. Specifically, the coordination is to design a coordination graph and to learn the joint local Q-function on two adjacent intersections directly.

Performance Comparison

Table 4.4 reports our experimental results using synthetic data under six-intersection arterial and real-world data w.r.t. average travel time. We have the following findings:

Dataset	Arrival rate (vehicles/h)			# of inter-
Dataset	Mean	Std	Max	Min	sections
Qingdao Rd., Jinan	3338.83	221.58	2748	3864	3
Beaver Ave.,	2982.33	359.70	2724	3491	5
State College
8-th Ave., NYC	6790.04	32.34	4968	7536	16
9-th Ave., NYC	4513.06	25.88	4416	6708	16
10-th Ave., NYC	6083.90	25.61	2892	5016	16
11-th Ave., NYC	4030.79	24.08	2472	4536	16

Table 4.3: Data statistics of real-world traffic dataset(1) Conventional transportation methods (FixedTime, GreenWave and Max-pressure) give poor performance. This is because the traffic in these settings is dynamic. Conventional methods, which rely heavily on over-simplified assumptions or prior knowledge on the traffic, may easily fail under the dynamic traffic scenarios.

(2) Our method PressLight outperforms all other RL methods. Though all the methods aim to learn to minimize the travel time, our reward design is proven to directly optimize towards it, while DRL and LIT are using mixed reward which may distract the model from learning efficiently.

(3) When the traffic grows larger (Config 3,4 to 1,2), PressLight becomes much better than other baselines. Under heavy traffic, a poor control strategy would make downstream queue may easily spill back and the green time would be wasted. The reward design of our agents considers balancing the queues on all the intersections within the arterial, which makes the performance even superior as the traffic becomes larger.

	Synthetic traffic
	LightFlat	LightPeak	HeavyFlat	HeavyPeak
FixedTime	93.29	109.50	325.48	246.25
GreenWave	98.39	124.09	263.36	286.85
Max-pressure	74.30	82.37	262.26	225.60
DRL	123.02	115.85	525.64	757.73
LIT	65.07	66.77	233.17	258.33
PressLight	59.96	61.34	160.48	184.51
	Real-world traffic
	Qingdao	Beaver Ave.,	8th Ave.,	9th Ave.,	10th Ave.,	11th Ave.,
	Rd., Jinan	State College	NYC	NYC	NYC	NYC
FixedTime	317.40	336.29	432.60	469.54	347.05	368.84
GreenWave	370.30	332.06	451.98	502.30	317.02	314.08
Max-pressure	567.06	222.90	412.58	370.61	392.77	224.54
DRL	238.19	455.42	704.98	669.69	676.19	548.34
LIT	58.18	338.52	471.30	726.04	309.95	340.40
PressLight	54.87	92.00	223.36	149.01	161.21	140.82

Table 4.4: Performance comparison between all the methods in the arterial with 6 intersections w.r.t. average travel time (the lower the better). Top-down: conventional transportation methods, learning methods, and our proposed method.

4.6.4 Study of PressLight

4.6.4.1 Effects of variants of our proposed method

We consider several variations of our model as follows.

Base. Instead of using the distribution of the vehicles, Base simply uses phase and number of vehicles on each incoming lanes as its state (similar to LIT), and uses the reward defined same as LIT. This serves as a base model for later variants.
LIT+out. Based on Base, LIT+out adds the number of vehicles on outgoing lanes to its state, which has more information about its downstream intersections than Base agents.
LIT+out+seg. Based on LIT+out, LIT+out+seg uses the phase, the number of segments' vehicles on both incoming and outgoing lanes into its state, which is the same as our proposed state definition.
PressLight. Our proposed method which further changes LIT+out+seg's reward to pressure.

Table 4.5 shows the performance of variants of our method:

Giving the added state information (LIT+out and LIT+out+seg) boosts the performance. This makes sense since (1) LIT+out is able to observe traffic condition on outgoing lanes and helps to balance the queues for each intersection when there is congestion on outgoing lanes; (2) LIT+out+seg has the information about vehicle distributions which is the key factor for agents to learn the offsets.
PressLight further outperforms LIT+out+seg owing to its reward definition. Instead of optimizing a reward that is not directly towards the travel time under arterial network, our reward design is proved to be a surrogate of average travel time. This demonstrates the effectiveness of our proposed reward design.

	HeavyFlat	HeavyPeak
Base	233.17	258.33
LIT+out	201.56	281.21
LIT+out+seg	200.28	196.34
PressLight	160.48	184.51

Table 4.5: Detailed comparison of our proposed state and reward design and their effects w.r.t. average travel time (lower the better) under synthetic traffic data.

Figure 5.4 illustrates the convergence curve of our agents learning process w.r.t. the average reward and the average pressure of each round. We can see that the travel time is closely correlated with pressure.

Performance on Mixed Scenarios

Heterogeneous intersections

We employ our model to two heterogeneous arterials, as is shown in Figure 4.6. For intersections with 3 legs, we use zero-padding to complete the state. For intersections with different lengths of lanes, our method can handle this well since the state is independent of the lane length. Table 4.6 illustrates the performance of our model against Max-pressure.

4.6.5.2 Arterials with a different number of intersections and network

We employ our model to arterials with 6, 10 and 20 intersections under synthetic data. As is shown in Table 4.6, our model could achieve better performance over conventional transportation method Max-pressure and reinforcement learning method LIT even when the number of intersections grows.

We also test our model a network with 9 intersections (

3 \times 3

grid). Table 4.6 shows the experiment results and we can see that PressLight can outperform Max-pressure and LIT under both traffic.

Figure 4.5: Convergence curve of average duration and our reward design (pressure). Pressure shows the same convergence trend with travel time.

Case Study

Another desirable property of PressLight is its ability to automatically coordinate the offset between adjacent intersections. To demonstrate this, we show two examples. Under simplified uniform traffic, we show that our model has learned the optimal solution which could be justified by transportation theories. Under the real-world traffic, the learned offset is visualized to reveal this property.

Figure 4.6: Average travel time of our method on heterogeneous intersections. (a) Different number of legs. (b) Different length of lanes. (c) Experiment results.

Figure 4.7: Performance comparison under uniform unidirectional traffic, where the optimal solution is known (GreenWave). Only PressLight can achieve the optimal.

4.6.6.1 Synthetic traffic on the uniform, uni-directional flow

In this section, we perform experiments on the arterials with six homogeneous intersections under two traffic settings. One is for light traffic (arterial demand: 300 vehicle/hour/lane, side-street demand: 180 vehicle/hour/lane) and one is for heavy traffic (arterial demand: 700 vehicle/hour/lane, side-street demand: 420 vehicle/hour/lane). Both of them are uniform and uni-directional without turning traffic and two phases (WE for green light on arterial and SN for green light for side streets) are used for all intersections. Under these simplified scenarios, the optimal solution is known as GreenWave in transportation area as stated in [104]. As the optimal solution under these settings, GreenWave's policy includes the offsets between intersections and the phase split, which requires several prior knowledge to calculate them: The offset

Δ

equals to the block length

l

between two consecutive intersections divided by free-flow speed

v

; the optimal phase split ratio is equal to the ratio of the demand for a designated phase and total demand. In our experiments,

l \approx 300

v \approx 10

m/s, hence, the optimal offset should be

Δ \approx 30

s, and the optimal phase split should be 1:0.6 (WE: SN).

Performance comparisonWe compared PressLight with all aforementioned baselines and report their results in Figure 4.7. We can find that given GreenWave is the optimal solution, only our method PressLight achieves the same performance as GreenWave in both settings. This demonstrates that our RL agents can learn the optimal policy under these simplified scenarios.

	6-intersection arterial		10-intersection arterial
	HeavyFlat	HeavyPeak	HeavyFlat	HeavyPeak
Max-pressure	262.26	225.60	129.63	129.63
LIT	233.17	258.33	157.84	200.96
PressLight (ours)	160.48	184.51	88.88	79.61

Table 4.6: Average travel time of different methods under arterials with a different number of intersections and network.

4.6.6.1 Policy learned by RL agents

We use time-space diagrams to show the trajectories of vehicles and phase plans of traffic signal controllers. In a time-space diagram like Figure 4.8, the x-axis is the time and the y-axis is the distance (from a reference point, here we use the westernmost point as the reference point). As it is shown in Figure 4.8, there are six bands with green-yellow-red colors indicating the changing phases of six intersections. The black line with an arrow is the trajectory of a vehicle, where the x-axis tells the time and the y-axis tells the location. Vehicles that travel within the green dashed area will experience a green wave. For example, vehicle

Figure 4.8: Offsets between intersections learnt by RL agents under uni-directional uniform traffic (700 vehicles/hour/lane on arterial)

Figure 4.9: Space-time diagram with signal timing plan to illustrate the learned coordination strategy from real-world data on the arterial of Qingdao Road in the morning (around 8:30 a.m.) on August 6th.

[MISSING_PAGE_FAIL:76]

Chapter 5 Improving Learning Efficiency

This chapter presents our efforts in improving the learning efficiency for coordinating traffic signals and corresponds to our paper "CoLight: Learning Network-level Cooperation for Traffic Signal Control' [19]'.

5.1 Introduction

A key question often asked of traffic signal control is "How do traffic signals cooperate between intersections?" Cooperation among intersections is especially important for an urban road network because the actions of signals could affect each other, especially when the intersections are closely connected. Good cooperation among the traffic signals enables the vehicles to move through intersections more quickly.

In the transportation field, a typical way to solve the cooperation problem between intersections is to formulate it as an optimization problem and solve it under certain assumptions (e.g., uniform arrival rate [106, 107] and unlimited lane capacity [90]). Such methods, however, do not perform well because the assumptions do not hold in the real world.

Recently, researchers start to investigate reinforcement learning (RL) techniques for the cooperation of traffic signals. The most common way is to have an RL agent control each intersection and communication is achieved by sharing information among agents [108, 109, 110, 46]. At each time step, the agent observes the traffic condition of the target intersection and its neighboring intersections, and decides an action

a

to take. After the action is taken, a reward

r

(often defined as a measure correlated with travel time) is fed back to the agent indicating how good the action

a

is. Different from conventional approaches, such RL methods avoid making strong assumptions and directly learn good strategies from trials and errors.

Existing RL-based methods still fail to communicate with neighbors in the most efficient way. We propose CoLight that improves communication of agents and is scalable to hundreds of intersections. In particular, our work makes the following key contributions:

∙

Cooperation through dynamic communication. Most methods to achieve cooperation are through expanding the observation of the target agent for more comprehensive information. Existing studies [46, 48, 108, 109, 110] tend to select the traffic conditions from adjacent intersections and directly concatenate them with the traffic condition of the target intersection, neglecting the fact that the traffic is changing both temporally and spatially. For example, if intersections

A

and

B

are adjacent intersections on a major road, but intersection

C

is on a side road linked with

B

. The information from

A

is more useful to

B

compared with that from

C

B

. Besides, the influences from intersection

A

could change temporally. For example, during the morning rush hour, there are a large number of vehicles moving from

A

B

, and the direction of vehicles is reversed at night peak hours. Therefore, the effect of

A

B

is also changing at different times of the day. In this chapter, we propose to use the graph attentional network [111] to learn the dynamics of the influences, which shows superior performance against methods without this mechanism in Section 5.5.5. To the best of our knowledge, we are the first to use graph attentional network in the setting of traffic signal control.

∙

Index-free model learning with parameter sharing. Agents often share the same model parameters to enable efficient learning [46, 108]. However, if we simply take neighboring information as part of the state for the communication purpose, the fixed indexing of neighbors may cause conflicting learning problems. Take Figure 6.1 as an example. Figure 6.1(a) shows two target intersections

A

and

B

, where their neighbors A-W, A-E, B-N and B-S are on the major roads, while A-N, A-S, B-W and B-E are on the side streets. Intuitively, the two intersections on the major roads should relate more to the target intersection than those on the side streets, e.g., the influence from A-W to

A

should be greater than that from A-N to

A

. When neighbors are indexed in the way as [E, W, N, S], agent

A

would care more about the first and second neighbors (i.e., East and West) and agent

B

would pay more attention to the third and fourth neighbors (i.e., North and South). However, when two agents share the model parameters, this will cause conflicts to learn the influences of neighbors to the target intersection. In this chapter, similar to the "mean-field" idea [112], we tackle this problem by averaging over the influences of all neighboring intersections with the learned attention weights, instead of using a fixed indexing for neighbors. This weighted average mechanism provides index-free modeling of the influences of neighboring intersections, and along with the parameter sharing strategy, the overall parameters of the learning model can be largely reduced.

∙

Experiment on the large-scale road network. To the best of our knowledge, none of the existing studies that use RL to cooperate traffic signals have evaluated their methods in large-scale road networks with hundreds of traffic signals. Instead, most of them justify their cooperation strategies on small road networks with only fewer than 20 intersections [21]. In this chapter, we conduct comprehensive experiments on both synthetic and real-world data, including a large-scale road network with 196 intersections derived from Manhattan, New York. The experiments demonstrate that the proposed model benefits from the dynamic communication and the index-free modeling mechanism and significantly outperforms the state-of-the-art methods.

Conventional coordinated methods [113] and systems [114, 115, 116] in transportation usually coordinate traffic signals through modifying the offset (i.e., the time interval between the beginnings of green lights) between consecutive intersections and require the intersections to have the same cycle length. But this type of methods can only optimize the traffic

Figure 5.1: Illustration of index-based concatenation. Thick yellow lines are the arterials and grey thin lines are the side streets. With index-based concatenation,

A

and B’s observation will be aligned as model inputs with an fixed order. These two inputs will confuse the model shared by

A

and

B

flow for certain pre-defined directions [117]. Actually, it is not an easy task to coordinate the offsets for traffic signals in the network. For network-level control, Max-pressure [90] is a state-of-the-art signal control method which greedily takes actions that maximizes the throughput of the network, under the assumption that the downstream lanes have unlimited capacity. Other traffic control methods like TUC [118] also use optimization techniques to minimize vehicle travel time and/or the number of stops at multiple intersections under certain assumptions, such as the traffic flow is uniform in a certain time period. However, such assumptions often do not hold in the network setting and therefore prevent these methods from being widely applied.

Recently, reinforcement learning techniques have been proposed to coordinate traffic signals for their capability of online optimization without prior knowledge about the given environment. One way to achieve coordination is through centralized optimization over the joint actions of multiple intersections. [119] directly trains one central agent to decide the actions for all intersections but it cannot learn well due to the curse of dimension in joint action space. [120, 10] propose to jointly model two adjacent intersections and then using centralized coordination over the global joint actions, but they require the maximization over a combinatorially large joint action space and face scalability issues during deployment. Therefore, these centralized methods are hard to apply on the large-scale road network.

To mitigate this issue, independently modelling RL methods [121, 9, 17] are proposed in which they train a bunch of RL agents separately, one for each intersection. To avoid the non-stationary impacts from neighboring agents, communication among agents [122, 64] are proposed in order to achieve coordination using neighboring information: [108, 9, 48, 109, 110] add downstream information into states, [9, 46] add all neighboring states, and [47] adds neighbors' hidden states. However, in these methods, the information from different neighbors is concatenated all together and treated with equal importance, which leads to two major issues: 1). the impacts from neighboring intersections are changing dynamically with the traffic flow and should not be treated evenly. Even when the traffic flow is static, Kinenmatic-wave theory [123] from the transportation area shows that the upstream intersections could have larger influence than downstream intersections; 2). simple concatenation of neighboring information requires all agents to have an extra indexing mechanism, which is usually unrealistic and requires heuristic design. To address the shortcomings of previous methods, our proposed method leverages the attention mechanism to learn and specify different weights to different neighboring intersections and directly models the overall influence of neighboring intersections on the target intersection.

It is worth noting that most of joint modelling methods for the traffic signal control problem only conduct experiments on simple road networks with at most 20 intersections. To the best of our knowledge, none of the individual modelling methods conduct experiments on more than 70 signals [21]. In this chapter, we conduct experiments on the simulator under different scales, including a real-world road network with about 200 intersections.

5.3 Problem Definition

In this section, we present the problem of traffic signal control as a Markov Game. Each intersection in the system is controlled by an agent. Given each agent observes part of the total system condition, we would like to proactively decide for all the intersections in the system which phases should they change to so as to minimize the average queue length on the lanes around the intersections. Specifically, the problem is characterized by the following major components

⟨ S, O, A, P, r, π, γ ⟩

∙

System state space

S

and observation space

O

. We assume that there are

N

intersections in the system and each agent can observe part of the system state

s \in S

as its observation

o \in O

. In this work, we define

o_{i}^{t}

for agent

i

at time

t

, which consists of its current phase (which direction is in green light) represented by a one-hot vector, and the number of vehicles on each lane connected with the intersection.

∙

Set of actions

A

. In the traffic signal control problem, at time

t

, an agent

i

would choose an action

a_{i}^{t}

from its candidate action set

A_{i}

as a decision for the next

Δ t

period of time. Here, each intersection would choose a phase

p

as its action

a_{i}^{t}

from its pre-defined phase set, indicating that from time

t

t + Δ t

, this intersection would be in phase

p

∙

Transition probability

P

. Given the system state

s^{t}

and corresponding joint actions

a^{t}

of agents at time

t

, the system arrives at the next state

s^{t + 1}

according to the state transition probability

P (s^{t + 1} | s^{t}, a^{t}) : S \times A_{1} \times \dots \times A_{N} \to Ω (S)

, where

Ω (S)

denotes the space of state distributions.

∙

Reward

r

. Each agent

i

obtains an immediate reward

r_{i}^{t}

from the environment at time

t

by a reward function

S \times A_{1} \times \dots \times A_{N} \to R

. In this paper, we want to minimize the travel time for all vehicles in the system, which is hard to optimize directly. Therefore, we define the reward for intersection i as

r_{i}^{t} = - Σ_{l} u_{i, l}^{t}

where

u_{i, l}^{t}

is the queue length on the approaching lane

l

at time t.

∙

Policy

π

and discount factor

γ

. Intuitively, the joint actions have long-term effects on the system, so that we want to minimize the expected queue length of each intersection in each episode. Specifically, at time

t

, each agent chooses an action following a certain policy

O \times A \to π

, aiming to maximize its total reward

G_{i}^{t} = Σ_{t = τ}^{T} γ^{t - τ} t_{i}^{t}

, where

T

is total time steps of an episode and

γ \in [0, 1]

differentiates the rewards in terms of temporal proximity.

In this paper, we use the action-value function

Q_{i} (θ_{n})

for each agent

i

at the

n

-th iteration (parameterized by

θ

) to approximate the total reward

G_{i}^{t}

with neural networks by minimizing the loss:

\begin{matrix} (5.1) & L (θ_{n}) = E [(r_{i}^{t} + γ max_{a^{'}} Q (o_{i}^{t'}, a_{i}^{t^{'}}; θ_{n - 1}) - Q (o_{i}^{t}, a_{i}^{t}; θ_{n}))^{2}] \end{matrix}

where

o_{i}^{t^{'}}

denotes the next observation for

o_{i}^{t}

. These earlier snapshots of parameters are periodically updated with the most recent network weights and help increase the learning stability by decorrelating predicted and target q-values.

5.4 Method

In this section, we will first introduce the proposed cooperated RL network structure, as Figure 6.2 illustrates, from bottom to top layer: the first observation embedding layer, the interior neighborhood cooperation layers (shorted as GAT layers) and the final q-value prediction layer. Then we will discuss its time and space complexity compared with other methods of signal control for multiple intersections.

Figure 5.2: Left: Framework of the proposed CoLight model. Right: variation of cooperation scope (light blue shadow, from one-hop to two-hop) and attention distribution (colored points, the redder, the more important) of the target intersection.

Observation Embedding

Given the raw data of the local observation, i.e., the number of vehicles on each lane and the phase the signal currently in,

we first embed such

k

-dimensional data into an

m

-dimensional latent space via a layer of Multi-Layer Perceptron:

\begin{matrix} (5.2) & h_{i} = E m b e d (o_{i}^{t}) = σ (o_{i} W_{e} + b_{e}), \end{matrix}

where

o_{i}^{t} \in R^{k}

is intersection

i

's observation at time

t

and

k

is the feature dimension of

o_{i}^{t}

W_{e} \in R^{k \times m}

and

b_{e} \in R^{m}

are weight matrix and bias vector to learn,

σ

is ReLU function (same denotation for the following

σ

). The generated hidden state

h_{i} \in R^{m}

represents the current traffic condition of the

i

-th intersection.

Graph Attention Networks for Cooperation

Communication between agents is necessary for cooperation in multi-agent reinforcement learning (MARL) environment, since the evaluation of the conducted policy for each agent depends not only on the observable surrounding, but also on the policies of other agents [64, 124]. CoLight agent learns to communicate on the representations of neighboring intersections by leveraging the attention mechanism, a widely-used technique to boost model accuracy [125, 126, 127, 128]. Then an overall summary of the agent's neighborhood is generated, upon which the agent learns to model the influence of neighborhoods.

5.4.2.1 Observation Interaction

To learn the importance of information from intersection

j

(source intersection) in determining the policy for intersection

i

(target intersection), we first embed the representation of these two intersections from previous layer and calculate

e_{i, j}

, the importance of

j

in determining the policy for

i

, with the following operation:

\begin{matrix} (5.3) & e_{i j} = (h_{i} W_{t}) \cdot (h_{j} W_{s})^{T}, \end{matrix}

where

W_{s}, W_{t} \in R^{m \times n}

are embedding parameters for the source and target intersection respectively.

Note that

e_{i j}

is not necessarily equal to

e_{j i}

. Take the scenario in Figure 6.2(a) as an example, where the 9-th Avenue is a one-way arterial. On one hand, since the traffic flow goes from Inter 9-50 to Inter 9-49, the traffic condition from Inter 9-50 is important for_Inter 9-49_ to decide for the future actions, thus,

e_{9 - 49, 9 - 50}

should be quite large. On the other hand, as the traffic condition of the downstream Inter 9-49 is less helpful to Inter 9-50,

e_{9 - 50, 9 - 49}

should be relatively small.

5.4.2.2 Attention Distribution within Neighborhood Scope

To retrieve a general attention value between source and target intersections, we further normalize the interaction scores between the target intersection

i

and its neighborhood intersections:

\begin{matrix} (5.4) & α_{i j} = softmax (e_{i j}) = \frac{\exp (e_{i j} / τ)}{\sum_{j \in N_{i}} \exp (e_{i j} / τ)}, \end{matrix}

where

τ

is the temperature factor

and

N_{i}

is the set of intersections in the target intersection's neighborhood scope. The neighborhood of the target contains the top

| N_{i} |

closest intersections, and the distance can be defined in multiple ways. For example, we can construct the neighborhood scope for target intersection

i

through:

∙

Road distance: the geo-distance between two intersections' geo-locations.

∙

Node distance: the smallest hop count between two nodes over the network, with each node as an intersection.

Note that intersection

i

itself is also included in

N_{i}

to help the agent get aware of how much attention should be put on its own traffic condition.

The general attention score

α_{i j}

is beneficial not only for it applies to all kinds of road network structures (intersections with different numbers of arms), but also for it relaxes the concept of "neighborhood". Without losing generality, the target can even take some other intersections into

N_{i}

although they are not adjacent to them. For instance, one four-way intersection can determine its signal policy based on information from five nearby intersections, four of which are the adjacent neighbors while the other is disconnected but geographically close to the target intersection.

5.4.2.3 Index-free Neighborhood Cooperation

To model the overall influence of neighborhoods to the target intersection, the representation of several source intersections are combined with their respective importance:

\begin{matrix} (5.5) & h s_{i} = σ (W_{q} \cdot \sum_{j \in N_{i}} α_{i j} (h_{j} W_{c}) + b_{q}), \end{matrix}

where

W_{c} \in R^{m \times c}

is weight parameters for source intersection embedding,

W_{q}

and

b_{q}

are trainable variables. The weighted sum of neighborhood representation

h s_{i} \in R^{c}

accumulates the key information from the surrounding environment for performing efficient signal policy. By summing over the neighborhood representation, the model is index-free that does not require all agents to align the index of their neighboring intersections.

The graph-level attention allows the agent to adjust their focus according to the dynamic traffic and to sense the environment in a larger scale. In Figure 6.2(b) and (c), the emphasizes of Inter 9-49 on four neighbors are quite distinct due to the uni-directional traffic flow, i.e., a higher attention score for Inter 9-50 (upstream, red marked) than for Inter 9-48 (downstream, green marked). The agent for Inter 9-49 acquires the knowledge of adjacent intersections (Inter 9-48, Inter 9-50, Inter 10-49 and Inter 8-49) directly from the first layer of Graph Attention Networks (GAT). Since the hidden states of adjacent neighbors from the first GAT layer carry their respective neighborhood message, then in the second GAT layer, the cooperation scope of Inter 9-49 expands significantly (blue shadow in Figure 6.2(c)) to 8 intersections. Such additional information helps the target Inter 9-49 learn the traffic trend. As a result, Inter 9-49 relies more on the upstream intersections and less on the downstream to take actions, and the attention scores on Inter 9-50 and Inter 8-49 grow higher while those on Inter 10-49 and Inter 9-48 become lower. More GAT layers helps the agent detect environment dynamics more hops away.

5.4.2.4 Multi-head Attention

The cooperating information

h s_{i}

for the

i

-th intersection concludes one type of relationship with neighboring intersections. To jointly attend to the neighborhood from different representation subspaces at different positions, we extend the previous single-head attention in the neural network to multi-head attention as much recent work did [111, 129]. Specifically, the attention function (procedures including Observation Interaction, Attention Distribution and Neighborhood Cooperation) with different linear projections (multiple sets of trainable parameters

{W_{t}

W_{s}

W_{c}}

) is performed in parallel and the different versions of neighborhood condition summarization

h s_{i}

are averaged as

h m_{i}

\begin{matrix} (5.6) & e_{i j}^{h} = (h_{i} W_{t}^{h}) \cdot (h_{j} W_{s}^{h})^{T} \end{matrix}

\begin{matrix} (5.7) & α_{i j}^{h} = softmax (e_{i j}^{h}) = \frac{\exp (e_{i j}^{h} / τ)}{\sum_{j \in N_{i}} \exp (e_{i j}^{h} / τ)} \end{matrix}

\begin{matrix} (5.8) & h m_{i} = σ (W_{q} \cdot (\frac{1}{H} \sum_{h = 1}^{h = H} \sum_{j \in N_{i}} α_{i j}^{h} (h_{j} W_{c}^{h})) + b_{q}) \end{matrix}

where

H

is the number of attention heads. Besides averaging operation, concatenating the product of multi-head attention is another feasible way to conclude multiple types of the neighborhood cooperation.

In this work, we investigate the effects of multi-head attention on performance of the proposed model and find that 5 attention heads achieve the best performance.

Q-value Prediction

As illustrated in Figure 6.2(a), each hidden layer of model

C o L i g h t

learns the neighborhood representation through methods introduced in Section 5.4.2. We denote such layerwise cooperation procedure by GAT, then the forward propagation of input data in

C o L i g h t

can be formatted as follows:

\begin{matrix} (5.9) & h i = E m b e d (o_{i}^{t}), \end{matrix}

h m_{i}^{1} = G A T^{1} (h_{i}),

\dots,

h m_{i}^{L} = G A T^{L} (h m_{i}^{L - 1}),

\tilde{q} (o_{i}^{t}) = h m_{i}^{L} W_{p} + b_{p},

where

W_{p} \in R^{c \times p}

and

b_{p} \in R^{p}

are parameters to learn,

p

is the number of phases (action space),

L

is the number of GAT layers,

\tilde{q}

is the predicted q-value.

According to Eq. (5.1), the loss function for our

C o L i g h t

to optimize the current policy is:

\begin{matrix} (5.10) & L (θ) = \frac{1}{T} \sum_{t = 1}^{t = T} \sum_{i = 1}^{i = N} (q (o_{i}^{t}, a_{i}^{t}) - \tilde{q} (o_{i}^{t}; a_{i}^{t}, θ))^{2}, \end{matrix}

where

T

is the total number of time steps that contribute to the network update,

N

is the number of intersections in the whole road network,

θ

represents all the trainable variables in

C o L i g h t

As the importance of neighborhood to the target intersection varies spatially and temporally, the proposed attention mechanism is able to help the target agent distinguish among the complex scenarios by considering the traffic condition of any source-target intersection pair.

Complexity Analysis

Although CoLight spares additional parameters to learn the dynamic cooperation from neighborhood, owing to the index-free parameter sharing mechanism, both the time and space it demands are approximately equal to

O (m^{2} L)

, which is irrelevant to the number of intersections. Hence CoLight is scalable even if the road network contains hundreds of or even thousands of intersections.

5.4.4.1 Space Complexity

If there are

L

hidden layers and each layer has

m

neurons, then the size of the weight matrices and bias vectors in each component of CoLight is: 1) Observation Embedding layer:

k m + m

; 2) interior Graph Attentional layers:

(3 m^{2} + (m^{2} + m)) L = m (4 m + 1) L

; 3) Q-value Prediction layer:

m p + p

. Hence the total number of learnable parameters to store is

O (m (4 m L + L + k + 1 + p) + p)

. Normally, the size of the hidden layer (

m

) is far greater than the number of layers (

L

), the phase space (

p

) and comparable to the input dimension (

k

). Therefore, the space complexity of CoLight is approximately equal to

O (m^{2} L)

If we leverage

N

separate RL models (without parameter sharing) to control signals in

N

intersections, then the space complexity is

O (((k m + m) + (m^{2} + m) L + (m p + p)) \cdot N) \approx O (m^{2} L \cdot N)

, which is unfeasible when

N

is extremely large for city-level traffic signal control. To scale up, the simplest solution is to allow all the intersections to share parameters and maintain one model, in this case, the space complexity is

O (m^{2} L)

, which is identical to that of CoLight.

5.4.4.2 Time Complexity

We assume that: 1) all the agents leverage CoLight to predict q-values for the corresponding intersections concurrently; 2) the multiple heads of attention are independently computed so that they are as fast as the single-head attention; 3) the embeddings for either source or target intersection condition via

W_{s}

W_{c}

and

W_{t}

are separate processes that can also be executed at the same time, 4) for one target intersection, the interaction with all the neighbors is computed simultaneously, then the time complexity (only multiplication operations considered since the addition procedures are relatively insignificant) in each component of CoLight is: 1) Observation Embedding layer:

O (k m)

; 2) interior Graph Attentional layers:

O ((m^{2} + m^{2}) L)

; 3) Q-value Prediction layer:

O (m p)

. Hencethe time complexity is

O (m (k + 2 m L + p))

, and similarly, it is approximately equal to

O (m^{2} L)

Either the individual RL models or the shared single RL model for signal control in multiple intersections requires

O (m (k + m L + p)) \approx O (m^{2} L)

computation, approaching that of

C o L i g h t

5.5 Experiments

We perform experiments on two synthetic datasets and three real-world datasets to evaluate our proposed method, especially the graph-level attention for neighborhood cooperation.

Settings

We conduct experiments on CityFlow1, an open-source traffic simulator that supports large-scale traffic signal control [103]. After the traffic data being fed into the simulator, a vehicle moves towards its destination according to the setting of the environment. The simulator provides the state to the signal control method and executes the traffic signal actions from the control method. Following the tradition, each green signal is followed

Figure 5.3: Road networks for real-world datasets. Red polygons are the areas we select to model, blue dots are the traffic signals we control. Left: 196 intersections with uni-directional traffic, middle: 16 intersections with uni- & bi-directional traffic, right: 12 intersections with bi-directional traffic.

by a three-second yellow signal and two-second all red time.2

Footnote 2: CoLight’s codes, parameter settings, public datasets can be found at: https://github.com/wingsweihua/colight. More datasets can be found at: http://traffic-signal-control.github.io

In a traffic dataset, each vehicle is described as

(o, t, d)

, where

o

is the origin location,

t

is time, and

d

is the destination location. Locations

o

and

d

are both locations on the road network.

Datasets

5.5.2.1 Synthetic Data

In the experiment, we use two kinds of synthetic data, i.e., uni- and bi-directional traffic, on the following different road networks:

∙

A r t e r i a l_{1 \times 3}

: A

1 \times 3

arterial to show the spatial attention distribution learned by CoLight.

∙

G r i d_{3 \times 3}

: A

3 \times 3

grid network to show convergence speed of different RL methods and the temporal attention distribution.

∙

G r i d_{6 \times 6}

: A

6 \times 6

grid network to evaluate effectiveness and efficiency of different methods.

Each intersection in the synthetic road network has four directions (West

\to

East, East

\to

West, South

\to

North, North

\to

South), and 3 lanes (300 meters in length and 3 meters in width) for each direction. In bi-directional traffic, vehicles come uniformly with 300 vehicles/lane/hour in West

\leftrightarrow

East direction and 90 vehicles/lane/hour in South

\leftrightarrow

North direction. Only West

\to

East and North

\to

South directional flows travel in uni-directional traffic.

5.5.2.2 Real-world Data

We also use the real-world traffic data from three cities: New York, Hangzhou and Jinan. Their road networks are imported from OpenStreetMap3, as shown in Figure 5.3. And their traffic flows are processed from multiple sources, with data statistics listed in Table 5.1. The detailed descriptions on how we preprocess these datasets are as follows:

Footnote 3: https://www.openstreetmap.org

∙

D_{N e w Y o r k}

: There are 196 intersections in Upper East Side of Manhattan with open source taxi trip data4. Since the taxi data only contains the origin and destination geo-locations of each trip, we first map these geo-locations to the intersections and find the shortest path between them. Then we take the trips that fall within the selected areas.

∙

D_{H a n g z h o u}

: There are 16 intersections in Gudang Sub-district with traffic data generated from the roadside surveillance cameras. Each record in the dataset consists of time, camera ID and the information about vehicles. By analyzing these records with camera locations, the trajectories of vehicles are recorded when they pass through road intersections. We use the number of vehicles passing through these intersections as traffic volume for experiments.

∙

D_{J i a n}

: Similar to

D_{H a n g z h o u}

, this traffic data is collected by roadside cameras near 12 intersections in Dongfeng Sub-district, Jinan, China.

Compared Methods

We compare our model with the following two categories of methods: conventional transportation methods and RL methods. Note that all the RL models are learned

Dataset	# intersections	Arrival rate (vehicles/300s)
Dataset	# intersections	Mean	Std	Max	Min
$D_{N e w Y o r k}$	196	240.79	10.08	274	216
$D_{H a n g z h o u}$	16	526.63	86.70	676	256
$D_{J i a n}$	12	250.70	38.21	335	208

Table 5.1: Data statistics of real-world traffic dataset

Model	$G r i d_{6 \times 6}$ -Uni	$G r i d_{6 \times 6}$ -Bi	$D_{N e w Y o r k}$	$D_{H a n g z h o u}$	$D_{J i a n}$
Fixedtime [113]	209.68	209.68	1950.27	728.79	869.85
MaxPressure [90]	186.07	194.96	1633.41	422.15	361.33
CGRL [10]	1532.75	2884.23	2187.12	1582.26	1210.70
Individual RL [11]	314.82	261.60	- $^{*}$	345.00	325.56
OneModel [48]	181.81	242.63	1973.11	394.56	728.63
Neighbor RL [46]	240.68	248.11	2280.92	1053.45	1168.32
GCN [47]	205.40	272.14	1876.37	768.43	625.66
CoLight-node	178.42	176.71	1493.37	331.50	340.70
CoLight	173.79	170.11	1459.28	297.26	291.14

* No result as Individual RL can not scale up to 196 intersections in New York’s road network.

Table 5.2: Performance on synthetic data and real-world data w.r.t average travel time. CoLight is the best.

without any pre-trained parameters for fair comparison.

Transportation Methods:

∙

Fixedtime[113]: Fixed-time with random offsets. This method uses a pre-determined plan for cycle length and phase time, which is widely used when the traffic flow is steady.

∙

MaxPressure[90]: A state-of-the-art network-level traffic signal control method in the transportation field, which greedily chooses the phase that maximizes the pressure (a pre-defined metric about upstream and downstream queue length).

RL Methods:

∙

CGRL[10]: A RL-based method for multi-intersection signal control with joint-action modelling[10]. Specifically, the cooperation is achieved by designing a coordination graph and it learns to optimize the joint action between two intersections.

∙

Individual RL[11]: An individual deep RL approach which does not consider neighbor information. Each intersection is controlled by one agent, and the agents do not share parameters, but update their own networks independently.

∙

OneModel[48]: This method uses the same state and reward as Individual RL in its agent design, which only considers the traffic condition on the roads connecting with the controlled intersection. Instead of maintaining their own parameters, all the agents share the same policy network.

∙

Neighbor RL[46]: Based on OneModel, agents concatenate their neighboring intersections' traffic condition with their own and all the agents share the same parameters. Hence its feature space for observation is larger than OneModel.

∙

GCN[47]: A RL based traffic signal control method that uses a graph convolutional neural network to automatically extract the traffic features of adjacent intersections. This method treats each neighboring traffic condition without difference.

Variants of Our Proposed Method:

∙

ColLight: The neighborhood scope of an intersection is constructed through geo-distance.

∙

ColLight-node: The neighborhood scope is constructed through node distance, i.e., the smallest hop count between two intersections in the road network.

Evaluation Metric

Following existing studies[11, 17], we use the average travel time to evaluate the performance of different models for traffic signal control. It calculates the average travel time of all the vehicles spend between entering and leaving the area (in seconds), which is the most frequently used measure of performance to control traffic signal in the transportation field[21, 107].

Performance Comparison

In this section, we investigate on the performance of CoLight w.r.t. the travel time and compare it with state-of-the-art transportation and RL methods.

5.5.5.1 Overall Analysis

Table 5.2 lists the performance of two types of the proposed CoLight, classic transportation models as well as state-of-the-art learning methods in both synthetic and real-world datasets. We have the following observations:

Figure 5.4: Convergence speed of CoLight (red continuous curves) and other 5 RL baselines (dashed curves) during training. CoLight starts with the best performance (Jumpstart), reaches to the pre-defined performance the fastest (Time to Threshold), and ends with the optimal policy (Aysmptotic). Curves are smoothed with a moving average of 5 points.

CoLight achieves consistent performance improvements over state-of-the-art transportation (MaxPressure) and RL (Individual RL) methods across diverse road networks and traffic patterns: the average improvement is 6.98% for synthetic data and 11.69% for real-world data. The performance improvements are attributed to the benefits from dynamic communication along with the index-free modeling. The advantage of our model is especially evident when controlling signals in real-world cities, where road structures are more irregular and traffic flows are more dynamic. Specifically, Individual RL can hardly achieve satisfactory results because it independently optimizes the single intersection's policy; Neighbor RL and GCN do not work well for either $D_{N e w Y o r k}$ or $D_{H a n g z h o u}$ , as the agent treats the information from the upstream and downstream intersections with static importance according to the prior geographic knowledge rather than real-time traffic flows.
The performance gap between the proposed CoLight and the conventional transportation method MaxPressure becomes larger as the evaluated data change from synthetic regular traffic (average gap 8.08%) to real-world dynamic traffic (average gap 19.89%). Such growing performance divergence conforms to the deficiency inherent in MaxPressure, that it is incapable of learning from the feedback of the environment.
Our method outperforms the joint-action modelling method CGRL. In order to achieve cooperation, CGRL first builds up one model to decide the joint actions of two adjacent intersections and then uses centralized coordination over the global joint actions. It requires the centralized maximization over a combinatorially large joint action space and faces scalability issues. On the constrast, our method achieves cooperation through communication between decentralized agents, which has a smaller action space and shows superior performances.

Convergence Comparison

In Figure 5.4, we compare CoLight's performance (average travel time for vehicles evaluated at each episode) to the corresponding learning curves for the other five RL methods. Evaluated in all the listed datasets, the performance of CoLight is better than any of the baselines by a large margin, both in jumpstart performance (initial performance after the first episode), time to threshold (learning time to achieve a pre-specified performance level), as well as in asymptotic performance (final learned performance). Learning the attention on neighborhood does not slow down model convergence, but accelerates the speed of approaching the optimal policy instead.

From Figure 5.4(a), we discover that model Individual RL starts with extremely huge travel time and approaches to the optimal performance after a long training time. Such disparity of convergence speed shown in Figure 5.4 agrees with our previous space complexity analysis (in Section 5.4.4.1), that agents with shared models (CGRL, Neighbor RL, OneModel, GCN and CoLight) need to learn

O (m^{2} L)

parameters while individual agents (Individual RL) have to update

O (m^{2} L \cdot N)

parameters.

Scalability Comparison

In this section, we investigate on whether CoLight is more scalable than other RL-based methods in the following aspects:

5.5.6.1 Effectiveness.

As is shown in Table 5.2 and the convergence curve in Figure 5.4, CoLight performs consistently better than other RL methods on networks with different scales, ranging from 9-intersection grid network to 196-intersection real-world network.

5.5.6.2 Training time.

We compare CoLight's training time (total clock time for 100 episodes) to the corresponding running time for the other five RL methods on road networks with different scales. All the methods are evaluated individually on the server for fair comparison. As is shown in Figure 5.5, the training time for CoLight is comparable to that of OneModel and GCN, which is far more efficient than that of CGRL, Individual RL and Neighbor RL. This is consistent with the time complexity analysis (in Section 5.4.4.2), as most of the parallel computation assumptions are satisfied in our experiments.

Note that the average travel time for Individual RL (in Table 5.2) is missing and the bar of training time (in Figure 5.5) is estimated on

D_{N e w Y o r k}

setting. This is because Individual RL is non-scalable because all the separate 196 agents cannot be trained and updated simultaneously due to processor and memory limitation.

Study of CoLight

In this section, we investigate on how different components (i.e., neighborhood definition, number of neighbors, and number of attention heads) affect CoLight.

5.5.7.1 Impact of Neighborhood Definition.

As mentioned in Section 5.4.2.2, the neighborhood scope of an intersection can be defined in different ways. And the results in Table 5.2 show that CoLight (using geo-distance) achieves similar performance with CoLight-node under synthetic data, but largely outperforms CoLight-node under real-world traffic. The reason could be that under synthetic data, since the lane lengths of all intersections are the same, the top closest neighboring intersections set according to geo-distance is identical to that based on node distance. In the following parts of our experiments, we only compare CoLight with other methods.

5.5.7.2 Impact of Neighbor Number.

In Figure 5.6, we show how the number of neighbors

| N_{i} |

influences the performance and also shed lights on how to set it. As is shown in Figure 5.6, when the number of neighbors grows from 2 to 5, the performance of CoLight achieves the optimal. Further adding nearby intersections into the neighborhood scope

N_{i}

, however, leads to the opposite trend. This is because including more neighbors in the neighborhood results in learning more relations. To determine signal control policy for each intersection, computing the attention scores only on four nearby intersections and itself seems adequate for cooperation with both time and performance guarantee.

Figure 5.5: The training time of different models for 100 episodes. CoLight is efficient across all the datasets. The bar for Individual RL on

D_{N e w Y o r k}

is shadowed as its running time is far beyond the acceptable time.

5.5.7.3 Impact of Attention Head Number.

To evaluate the effectiveness of multi-head attention, we test different numbers of attention heads and find that moderate numbers of heads are beneficial to better control intersection signals. As shown in Table 5.3, drivers spend less time as the number of attention heads grows. However, the benefits of more types of attention disappear as

H

exceeds 5. Similar conclusions can be made on other datasets with details unshown due to space limitation.

5.6 Conclusion

In this chapter, we propose a well-designed reinforcement learning approach to solve the network-level traffic signal control problem. Specifically, our method learns the dynamic communication between agents and constructs an index-free model by leveraging the graph attention network. We conduct extensive experiments using synthetic and real-world data and demonstrate the superior performance of our proposed method over state-of-the-art methods. In addition, we show in-depth case studies and observations to understand how the attention mechanism helps cooperation.

#Heads	1	3	5	7	9
Travel Time (s)	176.32	172.47	170.11	174.54	174.51

Table 5.3: Performance of

C o L i g h t

with respect to different numbers of attention heads (

H

) on dataset

G r i d_{6 \times 6}

. More types of attention (

H \leq 5

) enhance model efficiency, while too many (

H > 5

) could distract the learning and deteriorate the overall performance.

Figure 5.6: Performance of

C o L i g h t

with respect to different numbers of neighbors (

| N_{i} |

) on dataset

D_{H a n g z h o u}

(left) and

D_{J i a n}

(right). More neighbors (

| N_{i} | \leq 5

) for cooperation brings better performance, but too many neighbors (

| N_{i} | > 5

) requires more time (200 episodes or more) to learn.

We would like to point out several important future directions to make the method more applicable to the real world. First, the neighborhood scope can be determined in a more flexible way. The traffic flow information between intersections can be utilized to determine the neighborhood. Second, the raw data for observation only include the phase and the number of vehicles on each lane. More exterior data like the road and weather condition might help to boost model performance.

Chapter 6 Learning to Simulate

This chapter presents out methodology of learning a better simulator from data and corresponds to our paper "Learning to Simulate on Sparse Trajectory Data" [130].

6.1 Introduction

Simulation of the real world is one of the feasible ways to verify driving policies on autonomous vehicles and transportation policies like traffic signal control [131, 132, 133] or speed limit setting [134] since it is costly to validate them in the real world directly [21]. The driving behavior model, i.e., how the vehicle accelerates/decelerates, is the critical component that affects the similarity of the simulated traffic to the real-world traffic [135, 136, 137]. Traditional methods to learn the driving behavior model usually first assumes that the behavior of the vehicle is only influenced by a small number of factors with predefined rule-based relations, and then calibrates the model by finding the parameters that best fit the observed data [138, 139]. The problem with such methods is that their assumptions oversimplify the driving behavior, resulting in the simulated driving behavior far from the real world.

In contrast, imitation learning (IL) does not assume the underlying form of the driving behavior model and directly learns from the observed data (also called demonstrations from expert policy in IL literature). With IL, a more sophisticated driving behavior policy can be represented by a parameterized model like neural nets and provides a promising way to learn the models that behave similarly to expert policy. Existing IL methods (e.g., behavior cloning [140, 141] and generative adversarial imitation learning [142, 143, 144, 145]) for learning driving behavior relies on a large amount of behavior trajectory data that consists of dense vehicle driving states, either from vehicles installed with sensors, or roadside cameras that capture the whole traffic situation (including every vehicle drivingbehavior at every moment) in the road network.

However, in most real-world cases, the available behavior trajectory data is sparse, i.e., the driving behavior of the vehicles at every moment is difficult to observe. It is infeasible to install sensors for every vehicle in the road network or to install cameras that cover every location in the road network to capture the whole traffic situation. Most real-world cases are that only a minimal number of cars on the road are accessible with dense trajectory, and the driving behavior of vehicles can only be captured when the vehicles drive near the locations where the cameras are installed. For example, in Figure 6.1, as the cameras are installed only around certain intersections, consecutive observed points of the same car may have a large time difference, resulting in a sparse driving trajectory. As data sparsity is considered as a critical issue for unsatisfactory accuracy in machine learning, directly using sparse trajectories to learn the driving behavior could make the model fail to learn the behavior policy at the unobserved states.

To deal with sparse trajectories, a typical approach is to interpolate the sparse trajectories first and then learn the model with the dense trajectories [146, 147, 148]. This two-step approach also has an obvious weakness, especially in the problem of learning behavior models. For example, linear interpolation is often used to interpolate the missing points between two observed trajectory points. But in real-world cases, considering the interactions between vehicles, the vehicle is unlikely to drive at a uniform speed during that unobserved time period, hence the interpolated trajectories may be different from the true trajectories. However, the true trajectories are also unknown and are exactly

Figure 6.1: Illustration of a driving trajectory. In the real-world scenario, only part of the driving points can be observed and form a sparse driving trajectory (in red dots). Each driving point includes a driving state and an action of the vehicle at the observed time step. Best viewed in color.

what we aim to imitate. A better approach is to integrate interpolation with imitation because they should inherently be the same model. To the best of our knowledge, none of the existing literature has studied the real-world problem of learning driving policies from sparse trajectory data.

In this chapter, we present ImIn-GAIL, an approach that can learn the driving behavior of vehicles from observed sparse trajectory data. ImIn-GAIL learns to mimic expert behavior under the framework of generative adversarial imitation learning (GAIL), which learns a policy that can perform expert-like behaviors through rewarding the policy for deceiving a discriminator trained to classify between policy-generated and expert trajectories. Specifically, for the data sparsity issue, we present an interpolator-discriminator network that can perform both the interpolation and discrimination tasks, and a downsampler that draws supervision on the interpolation task from the trajectories generated by the learned policy. We conduct experiments on both synthetic and real-world data, showing that our method can not only have excellent imitation performance on the sparse trajectories but also have better interpolation results compared with state-of-the-art baselines. The main contributions of this chapter are summarized as follows:

We propose a novel framework ImIn-GAIL, which can learn driving behaviors from the real-world sparse trajectory data.
We naturally integrate the interpolation with imitation learning that can interpolate the sparse driving trajectory.
We conduct experiments on both real and synthetic data, showing that our approach significantly outperforms existing methods. We also have interesting cases to illustrate the effectiveness on the imitation and interpolation of our methods.

Parameter calibration. In parameter calibration-based methods, the driving behavior model is a prerequisite, and parameters in the model are tuned to minimize a pre-defined cost function. Heuristic search algorithms such as random search, Tabu search [139], and genetic algorithm [138] can be used to search the parameters. These methods rely on the pre-defined models (mostly equations) and usually fail to match the dynamic vehicle driving pattern in the real-world.

Imitation learning. Without assuming an underlying physical model, we can solve this problem via imitation learning. There are two main lines of work: (1) behavior cloning (BC) and Inverse reinforcement learning (IRL). BC learns the mapping from demonstrated observations to actions in a supervised learning way [140, 141], but suffers from the errors which are generated from unobserved states during the simulation. On the contrast, IRL not only imitates observed states but also learns the expert's underlying reward function, which is more robust to the errors from unobserved states [149, 150, 151]. Recently, a more effective IRL approach, GAIL [142], incorporates generative adversarial networks with learning the reward function of the agent. However, all of the current work did not address the challenges of sparse trajectories, mainly because in their application contexts, e.g., game or robotic control, observations can be fully recorded every time step.

6.3 Preliminaries

Definition 7 (Driving Point).: A driving point

τ^{t} = (s^{t}, a^{t}, t)

describes the driving behavior of the vehicle at time

t

, which consists of a driving state

s^{t}

and an action

a^{t}

of the vehicle. Typically, the state

s^{t}

describes the surrounding traffic conditions of the vehicle (e.g., speed of the vehicle and distance to the preceding vehicle), and the action

a^{t} \sim π (a | s^{t})

the vehicle takes at time

t

is the magnitude of acceleration/deceleration following its driving policy

π (a | s^{t})

Definition 8 (Driving Trajectory).: A driving trajectory of a vehicle is a sequence of driving points generated by the vehicle in geographical spaces, usually represented by a series of chronologically ordered points, e.g.

τ = (τ^{t_{0}}, \dots, τ^{t_{N}})

In trajectory data mining [152, 153, 154], a dense trajectory of a vehicle is the driving trajectory with high-sampling rate (e.g., one point per second on average), and a sparse trajectory of a vehicle is the driving trajectory with low-sampling rate (e.g., one point every 2 minutes on average). In this paper, the observed driving trajectory is a sequence of driving points with large and irregular intervals between their observation times.

Problem 2.: In our problem, a vehicle observes state

s

from the environment, take action

a

following policy

π^{E}

at every time interval

Δ t

, and generate a raw driving trajectory

τ

during certain time period. While the raw driving trajectory is dense (i.e., at a high-sampling rate), in our problem we can only observe a set of sparse trajectories

T_{E}

generated by expert policy

π^{E}

as expert trajectory, where

T_{E} = {τ_{i} | τ_{i} = (τ_{i}^{t_{0}}, \dots, τ_{i}^{t_{N}})}

t_{i + 1} - t_{i} ≫ Δ t

and

t_{i + 1} - t_{i}

may be different for different observation time

i

. Our goal is to learn a parameterized policy

π_{θ}

that imitates the expert policy

π^{E}

6.4 Method

In this section, we first introduce the basic imitation framework, upon which we propose our method (lmln-GAIL) that integrates trajectory interpolation into the basic model.

Basic GAIL Framework

In this paper, we follow the framework similar to GAIL [142] due to its scalability to the multi-agent scenario and previous success in learning human driver models [155]. GAIL formulates imitation learning as the problem of learning policy to perform expert-like behavior by rewarding it for "deceiving" a classifier trained to discriminate between policy-generated and expert state-action pairs. For a neural network classifier

D_{ψ}

parameterized by

ψ

, the GAIL objective is given by

m a x_{ψ} m i n_{θ} L (ψ, θ)

where

L (ψ, θ)

is :

\begin{matrix} (6.1) & L (ψ, θ) = E_{(s, a) \sim τ \in T_{E}} \log D_{ψ} (s, a) + E_{(s, a) \sim τ \in T_{G}} \log (1 - D_{ψ} (s, a)) - β H (π_{θ}) \end{matrix}

where

T_{E}

and

T_{G}

are respectively the expert trajectories and the generated trajectories from the interactions of policy

π_{θ}

with the simulation environment,

H (π_{θ})

is an entropy regularization term.

∙

Learning $ψ$ : When training

D_{ψ}

, Equation (6.1) can simply be set as a sigmoid cross entropy where positive samples are from

T_{E}

and negative samples are from

T_{G}

. Then optimizing

ψ

can be easily done with gradient ascent.

∙

Learning $θ$ : The simulator is an integration of physical rules, control policies and randomness and thus its parameterization is assumed to be unknown. Therefore, given

T_{G}

generated by

π_{θ}

in the simulator, Equation (6.1) is non-differentiable w.r.t

θ

. In order to learn

π_{θ}

, GAIL optimizes through reinforcement learning, with a surrogate reward function formulated from Equation (6.1) as:

\begin{matrix} (6.2) & \tilde{r} (s^{t}, a^{t}; ψ) = - \log (1 - D_{ψ} (s^{t}, a^{t})) \end{matrix}

Here,

\tilde{r} (s^{t}, a^{t}; ψ)

can be perceived to be useful in driving

π_{θ}

into regions of the state-action space at time

t

similar to those explored by

π^{E}

. Intuitively, when the observedtrajectory is dense, the surrogate reward from the discriminator in Equation (6.2) is helpful to learn the state transitions about observed trajectories. However, when the observed data is sparse, the reward from discriminator will only learn to correct the observed states and fail to model the behavior policy at the unobserved states. To relieve this problem, we propose to interpolate the sparse expert trajectory within the based imitation framework.

Imitation with Interpolation

An overview of our proposed Imitation-Interpolation framework (ImIn-GAIL) is shown in Figure 6.2, which consists of the following three key components.

Generator in the simulator

Given an initialized driving policy

π_{θ}

, the dense trajectories

T_{G}^{D}

of vehicles can be generated in the simulator. In this paper, the driving policy

π_{θ}

is parameterized by a neural network which will output an action

a

based on the state

s

it observes. The simulator can generate driving behavior trajectories by rolling out

π_{θ}

for all vehicles simultaneously in the simulator. The optimization of the driving policy is optimized via TRPO [156] as in vanilla GAIL [142].

Downsampling of generated trajectories

The goal of the downsampler is to construct the training data for interpolation, i.e., learning the mapping from a sparse trajectory to a dense one. For two consecutive points

Figure 6.2: Proposed ImIn-GAIL Approach. The overall framework of ImIn-GAIL includes three components: generator, downsampler, and interpolation-discriminator. Best viewed in color.

(i.e.,

τ^{t_{s}}

and

τ^{t_{e}}

in generated sparse trajectory

T_{G}

), we can sample a point

τ^{t_{i}}

T_{G}^{D}

where

t_{s} \leq t_{i} \leq t_{e}

and construct training samples for the interpolator. The sampling strategies can be sampling at certain time intervals, sampling at specific locations or random sampling and we investigate the influence of different sampling rates in Section 6.5.5.1.

6.4.2.3 Interpolation-Discriminator

The key difference between ImIn-GAIL and vanilla GAIL is in the discriminator. While learning to differentiate the expert trajectories from generated trajectories, the discriminator in ImIn-GAIL also learns to interpolate a sparse trajectory to a dense trajectory. Specifically, as is shown in Figure 6.3, the proposed interpolation-discriminator copes with two subtasks in an end-to-end way: interpolation on sparse data and discrimination on dense data.

6.4.2.3.1 Interpolator module

The goal of the interpolator is to interpolate the sparse expert trajectories

T_{E}

to the dense trajectories

T_{E}^{D}

. We can use the generated dense trajectories

T_{G}^{D}

and sparse trajectories

T_{G}

from previous downsampling process as training data for the interpolator.

For each point

τ^{t_{i}}

to be interpolated, we first concatenate state and action and embed them into an

m

-dimensional latent space:

\begin{matrix} (6.3) & h_{s} = σ (C o n c a t (s^{t_{s}}, a^{t_{s}}) W_{s} + b_{s}), h_{e} = σ (C o n c a t (s^{t_{e}}, a^{t_{e}}) W_{e} + b_{e}) \end{matrix}

where

K

is the feature dimension after the concatenation of

s^{t_{e}}

and

a^{t_{e}}

W_{s} \in R^{K \times M}

W_{e} \in R^{K \times M}

b_{s} \in R^{M}

and

b_{e} \in R^{M}

are weight matrix to learn,

σ

is ReLU function

Figure 6.3: Proposed interpolation-discriminator network.

(same denotation for the following

σ

). Here, considering

t_{s}

and

t_{e}

may have different effects on interpolation, we use two different embedding weights for

t_{s}

and

t_{e}

After point embedding, we concatenate

h_{s}

and

h_{e}

with the time interval between

t_{s}

and

t_{i}

, and use a multi-layer perception (MLP) with

L

layers to learn the interpolation.

\begin{matrix} (6.4) & \begin{aligned} h_{i n} & = σ (C o n c a t (h_{s}, h_{e}, t_{i} - t_{s}) W_{0} + b_{0}) \\ h_{1} & = σ (h_{i n} W_{1} + b_{1}), h_{2} = σ (h_{1} W_{2} + b_{2}), \dots \\ h_{L} & = t a n h (h_{L - 1} W_{L} + b_{L}) = {\hat{τ}}^{t_{i}} \end{aligned} \end{matrix}

where

W_{0} \in R^{(2 M + 1) \times N_{0}}

b_{0} \in R^{N_{0}}

are the learnable weights;

W_{j} \in R^{N_{j} \times N_{j + 1}}

and

b_{j} \in R^{N_{j + 1}}

are the weight matrix for hidden layers (

1 \leq j \leq L - 1

) of interpolator;

W_{L} \in R^{N_{j} \times K}

and

b_{L} \in R^{K}

are the weight matrix for the last layer of interpolator, which outputs an interpolated point

{\hat{τ}}^{t_{i}}

. In the last layer of interpolator, we use

t a n h

as all the feature value of

τ^{t_{i}}

is normalized to

[- 1, 1]

2.3.2 Discriminator moduleWhen sparse expert trajectories

T_{E}

are interpolated into dense trajectories

T_{E}^{D}

by the interpolator, the discriminator module learns to differentiate between expert dense trajectories

T_{E}^{D}

and generated dense trajectories

T_{D}^{D}

. Specifically, the discriminator learns to output a high score when encountering an interpolated point

{\hat{τ}}^{t_{i}}

originated from

T_{E}^{D}

, and a low score when encountering a point from

T_{G}^{D}

generated by

π_{θ}

. The output of the discriminator

D_{ψ} (s, a)

can then be used as a surrogate reward function whose value grows larger as actions sampled from

π_{θ}

look similar to those chosen by experts.

The discriminator module is an MLP with

H

hidden layers, takes

h_{L}

as input and outputs the probability of the point belongs to

T_{E}

\begin{matrix} (6.5) & \begin{aligned} h_{1}^{D} & = σ (h_{L} W_{1}^{D} + b_{1}^{D}), h_{2}^{D} = σ (h_{1}^{D} W_{2}^{D} + b_{2}^{D}), \dots \\ p & = S i g m o i d (h_{H - 1}^{D} W_{H}^{D} + b_{H}^{D}) \end{aligned} \end{matrix}

where

W_{i}^{D} \in R^{N_{i - 1}^{D} \times N_{i}^{D}}

b_{i}^{D} \in R^{N_{i}^{D}}

are learnable weights for

i

-th layer in discriminator module. For

i = 1

, we have

W_{1}^{D} \in R^{K \times N_{1}^{D}}

b_{1}^{D} \in R^{N_{1}^{D}}

K

is the concatenated dimension of state and action; for

i = H

, we have

W_{H}^{D} \in R^{N_{H - 1}^{D} \times 1}

b_{H}^{D} \in R

2.3.3 Loss function of Interpolation-DiscriminatorThe loss function of the Interpolation-Discriminator network is a combination of interpolation loss

L_{I N T}

and discrimination loss

L_{D}

, which interpolates the unobserved points and predicts the probability of the point being generated by expert policy

π^{E}

simultaneously, :

\begin{matrix} (6.6) & \begin{aligned} L & = λ L_{I N T} + (1 - λ) L_{D} = λ E_{τ^{t} \sim τ \in T_{G}^{D}} ({\hat{τ}}^{t} - τ^{t}) + \\ (1 - λ) [E_{τ^{t} \sim τ \in T_{G}} \log p (τ^{t}) + E_{τ^{t} \sim τ \in T_{E}} \log (1 - p (τ^{t}))] \end{aligned} \end{matrix}

where

λ

is a hyper-parameter to balance the influence of interpolation and discrimination,

{\hat{τ}}^{t}

is the output of the interpolator module, and

p (τ)

is the output probability from the discriminator module.

Input: Sparse expert trajectories \(\mathcal{T}_{E}\), initial policy and interpolation-discriminator parameters \(\theta_{0}\), \(\psi_{0}\) Output: Policy \(\pi_{\theta}\), interpolation-discriminator InDNet\({}_{\psi}\)
1for\(i\longleftarrow\) 0, 1,...do
2 Rollout dense trajectories for all agents \(\mathcal{T}^{D}_{G}=\{\tau|\tau=(\tau^{t_{0}},\cdots,\tau^{t_{N}}),\ \tau^{t_{j}}=(s^{t_{j}},a^{t_{j}})\sim\pi_{\theta_{i}}\}\);
3 (Generator update step)
4\(\bullet\) Score \(\tau^{t_{j}}\) from \(\mathcal{T}^{D}_{G}\) with discriminator, generating reward using Eq. 6.2;
5\(\bullet\) Update \(\theta\) in generator given \(\mathcal{T}^{D}_{G}\) by optimizing Eq. 6.1;
6 (Interpolator-discriminator update step)
7\(\bullet\) Interpolate \(\mathcal{T}_{E}\) with the interpolation module in InDNet, generating dense expert trajectories \(\mathcal{T}^{D}_{E}\);
8\(\bullet\) Downsample generated dense trajectories \(\mathcal{T}^{D}_{G}\) to sparse trajectories \(\mathcal{T}_{G}\);
9\(\bullet\) Construct training samples for InDNet
10\(\bullet\) Update InDNet parameters \(\psi\) by optimizing Eq. 6.6
11 end for

Algorithm 1Training procedure of ImIn-GAIL

Training and Implementation

Algorithm 1 describes the ImIn-GAIL approach. In this paper, the driving policy is parameterized with a two-layer fully connected network with 32 units for all the hidden layers. The policy network takes the driving state

s

as input and outputs the distribution parameters for a Beta distribution, and the action

a

will be sampled from this distribution. The optimization of the driving policy is optimized via TRPO [156]. Following [143, 155], we use the features in Table 6.1 to represent the driving state of a vehicle, and the driving policy takes the drivings state as input and outputs an action

a

(i.e., next step speed). For the interpolation-discriminator network, each driving point is embedded to a 10-dimensional latent space, the interpolator module uses a three-layer fully connected layer to interpolate the trajectory and the discriminator module contains a two-layer fully connected layer. Some of the important hyperparameters are listed in Table 6.2.

6.5 Experiment

Experimental Settings

We conduct experiments on CityFlow [157], an open-source traffic simulator that supports large-scale vehicle movements. In a traffic dataset, each vehicle is described as

(o, t, d, r)

, where

o

is the origin location,

t

is time,

d

is the destination location and

r

is its route from

o

d

. Locations

o

and

d

are both locations on the road network, and

r

is a sequence of road ID. After the traffic data is fed into the simulator, a vehicle moves towards its destination based on its route. The simulator provides the state to the vehicle control method and executes the vehicle acceleration/deceleration actions from the control method.

Dataset

In experiment, we use both synthetic data and real-world data.

Parameter	Value	Parameter	Value
Batch size for generator	64	Batch size for InDNet	32
Update epoches for generator	5	Update epoches for InDNet	10
Learning rate for generator	0.001	Learning rate for InDNet	0.0001
Number of layers in generator	4	Balancing factor $λ$	0.5

Table 6.2: Hyper-parameter settings for ImIn-GAIL

6.5.1.1.1 Synthetic Data

In the experiment, we use two kinds of synthetic data, i.e., traffic movements under ring road network and intersection road network, as shown in Figure 6.4. Based on the traffic data, we use default simulation settings of the simulator to generate dense expert trajectories and sample sparse expert trajectories when vehicles pass through the red dots.

∙

R i n g

: The ring road network consists of a circular lane with a specified length, similar to [158, 159]. This is a very ideal and simplified scenario where the driving behavior can be measured.

∙

I n t e r s e c t i o n

: A single intersection network with bi-directional traffic. The intersection has four directions (West

\to

East, East

\to

West, South

\to

North, and North

\to

South), and 3 lanes (300 meters in length and 3 meters in width) for each direction. Vehicles come uniformly with 300 vehicles/lane/hour in West

\leftrightarrow

East direction and 90 vehicles/lane/hour in South

\leftrightarrow

North direction.

Real-world DataWe also use real-world traffic data from two cities: Hangzhou and Los Angeles. Their road networks are imported from OpenStreetMap1, as shown in Figure 6.4. The detailed descriptions of how we preprocess these datasets are as follows:

∙

L A_{1 \times 4}

. This is a public traffic dataset collected from Lankershim Boulevard, Los Angeles on June 16, 2005. It covers an 1

\times

4 arterial with four successive intersections. This dataset records the position and speed of every vehicle at every 0.1 second. We treat these records as dense expert trajectories and sample vehicles' states and actions when they pass through intersections as sparse expert trajectories.

Figure 6.4: Illustration of road networks. (a) and (b) are synthetic road networks, while (c) and (d) are real-world road networks.

∙

H Z_{4 \times 4}

. This dataset covers a 4

\times

4 network of Gudang area in Hangzhou, collected from surveillance cameras near intersections in 2016. This region has relatively dense surveillance cameras and we sampled the sparse expert trajectories in a similar way as in

L A_{1 \times 4}

Data Preprocessing

To mimic the real-world situation where the roadside surveillance cameras capture the driving behavior of vehicles at certain locations, the original dense expert trajectories are processed to sparse trajectories by sampling the driving points near several fixed locations unless specified. We use the sparse trajectories as expert demonstrations for training models. To test the imitation effectiveness, we use the same sampling method as the expert data and then compare the sparse generated data with sparse expert data. To test the interpolation effectiveness, we directly compare the dense generated data with dense expert data.

Compared Methods

We compare our model with the following two categories of methods: calibration-based methods and imitation learning-based methods.

6.5.2.1 Calibration-based methods

For calibration-based methods, we use Krauss model [135], the default car-following model (CFM) of simulator SUMO [160] and CityFlow [157]. Krauss model has the following forms:

\begin{matrix} (6.7) & v_{s a f e} (t) = v_{l} (t) + \frac{g (t) - v_{l} (t) t_{r}}{\frac{v_{l} (t) + v_{f} (t)}{2 b} + t_{r}} \end{matrix}

\begin{matrix} (6.8) & v_{d e s} (t) = min [v_{s a f e} (t), v (t) + a Δ t, v_{m a x}] \end{matrix}

Env-name	$R i n g$	$I n t e r s e c t i o n$	$L A_{1 \times 4}$	$H Z_{4 \times 4}$
Duration (seconds)	300	300	300	300
# of vehicles	22	109	321	425
# of points (dense)	1996	10960	23009	87739
# of points (sparse)	40	283	1014	1481

Table 6.3: Statistics of dense and sparse expert trajectory in different datasetswhere

v_{s a f e} (t)

the safe speed at time

t

v_{l} (t)

and

v_{f} (t)

is the speed of the leading vehicle and following vehicle respectively at time

t

g (t)

is the gap to the leading vehicle,

b

is the maximum deceleration of the vehicle and

t_{r}

is the driver's reaction time.

v_{d e s} (t)

is the desired speed, which is given by the minimum of safe speed, maximum allowed speed, and the speed after accelerating at

a

for

Δ t

. Here,

a

is the maximum acceleration and

Δ t

is the simulation time step.

We calibrate three parameters in Krauss model, which are the maximum deceleration of the vehicle, the maximum acceleration of the vehicle, and the maximum allowed speed.

∙

Random Search (CFM-RS)[161]: The parameters are chosen when they generate the most similar trajectories to expert demonstrations after a finite number of trial of random selecting parameters for Krauss model.

∙

Tabu Search (CFM-TS)[139]: Tabu search chooses the neighbors of the current set of parameters for each trial. If the new CFM generates better trajectories, this set of parameters is kept in the Tabu list.

6.5.2.2 Imitation learning-based methods

We also compare with several imitation learning-based methods, including both traditional and state-of-the-art methods.

∙

Behavioral Cloning (BC)[141] is a traditional imitation learning method. It directly learns the state-action mapping in a supervised manner.

∙

Generative Adversarial Imitation Learning (GAIL) is a GAN-like framework [142], with a generator controlling the policy of the agent, and a discriminator containing a classifier for the agent indicating how far the generated state sequences are from that of the demonstrations.

Evaluation Metrics

Following existing studies [143, 145, 155], to measure the error between learned policy against expert policy, we measure the position and the travel time of vehicles between generated dense trajectories and expert dense trajectories, which are defined as:

\begin{matrix} (6.9) & R M S E_{p o s} = \frac{1}{T} \sum_{t = 1}^{T} \sqrt{\frac{1}{M} \sum_{i = 1}^{m} (l_{i}^{t} - {\hat{l}}_{i}^{t})^{2}}, R M S E_{t i m e} = \sqrt{\frac{1}{M} \sum_{i = 1}^{m} (d_{i} - {\hat{d}}_{i})^{2}} \end{matrix}

where

T

is the total simulation time,

M

is the total number of vehicles,

l_{i}^{t}

and

{\hat{l}}_{i}^{t}

are the position of

i

-th vehicle at time

t

in the expert trajectories and in the generatedtrajectories relatively,

d_{i}

and

\hat{d_{i}}

are the travel time of vehicle

i

in expert trajectories and generated trajectories respectively.

Performance Comparison

In this section, we compare the dense trajectories generated by different methods with the expert dense trajectories, to see how similar they are to the expert policy. The closer the generated trajectories are to the expert trajectories, the more similar the learned policy is to the expert policy. From Table 6.4, we can see that ImIn-GAIL achieves consistently outperforms over all other baselines across synthetic and real-world data. CFM-RS and CFM-RS can hardly achieve satisfactory results because the model predefined by CFM could be different from the real world. Specifically, ImIn-GAIL outperforms vanilla GAIL, since ImIn-GAIL interpolates the sparse trajectories and thus has more expert trajectory data, which will help the discriminator make more precise estimations to correct the learning of policy.

Study of ImIn-GAIL

Interpolation Study

. To better understand how interpolation helps in simulation, we compare two representative baselines with their two-step variants. Firstly, we use a pre-trained non-linear interpolation model to interpolate the sparse expert trajectories following the idea of [147, 162]. Then we train the baselines on the interpolated trajectories.

Table 6.5 shows the performance of baseline methods in

R i n g

and

I n t e r s e c t i o n

. We find out that baseline methods in a two-step way show inferior performance. One possible

	$R i n g$		$I n t e r s e c t i o n$		$L A_{1 \times 4}$		$H Z_{4 \times 4}$
	time (s)	pos (km)	time (s)	pos (km)	time (s)	pos (km)	time (s)	pos (km)
CFM-RS	343.506	0.028	39.750	0.144	34.617	0.593	27.142	0.318
CFM-TS	376.593	0.025	95.330	0.184	33.298	0.510	175.326	0.359
BC	201.273	0.020	58.580	0.342	55.251	0.698	148.629	0.297
GAIL	42.061	0.023	14.405	0.032	30.475	0.445	14.973	0.196
ImIn-GAIL	16.970	0.018	4.550	0.024	19.671	0.405	5.254	0.130

Table 6.4: Performance w.r.t Relative Mean Squared Error (RMSE) of time (in seconds) and position (in kilometers). All the measurements are conducted on dense trajectories. Lower the better. Our proposed method ImIn-GAIL achieves the best performance.

reason is that the interpolated trajectories generated by the pre-trained model could be far from the real expert trajectories when interacting in the simulator. Consequently, the learned policy trained on such interpolated trajectories makes further errors.

In contrast, ImIn-GAIL learns to interpolate and imitate the sparse expert trajectories in one step, combining the interpolator loss and discriminator loss, which can propagate across the whole framework. If the trajectories generated by

π_{θ}

is far from expert observations in current iteration, both the discriminator and the interpolator will learn to correct themselves and provide more precise reward for learning

π_{θ}

in the next iteration. Similar results can also be found in

L A_{1 \times 4}

and

H Z_{4 \times 4}

, and we omit these results due to page limits.

6.5.5.2 Sparsity Study

In this section, we investigate how different sampling strategies influence ImIn-GAIL. We sample randomly from the dense expert trajectories at different time intervals to get different sampling rates: 2%, 20%, 40%, 60%, 80%, and 100%. We set the sampled data as the expert trajectories and evaluate by measuring the performance of our model in imitating the expert policy. As is shown in Figure 6.5, with denser expert trajectory, the error of ImIn-GAIL decreases, indicating a better policy imitated by our method.

Case Study

To study the capability of our proposed method in recovering the dense trajectories of vehicles, we showcase the movement of a vehicle in Ring data learned by different

	$R i n g$		$I n t e r s e c t i o n$
	time (s)	position (km)	time (s)	position (km)
CFM-RS	343.506	0.028	39.750	0.144
CFM-RS (two step)	343.523	0.074	73.791	0.223
GAIL	42.061	0.023	14.405	0.032
GAIL (two step)	98.184	0.025	173.538	0.499
ImIn-GAIL	16.970	0.018	4.550	0.024

Table 6.5: RMSE on time and position of our proposed method ImIn-GAIL against baseline methods and their corresponding two-step variants. Baseline methods and ImIn-GAIL learn from sparse trajectories, while the two-step variants interpolate sparse trajectories first and trained on the interpolated data. ImIn-GAIL achieves the best performance in most cases.

methods.

We visualize the trajectories generated by the policies learned with different methods in Figure 6.6. We find that imitation learning methods (BC, GAIL, and ImIn-GAIL) perform better than calibration-based methods (CFM-RS and CFM-TS). This is because the calibration based methods pre-assumes an existing model, which could be far from the real behavior model. On the contrast, imitation learning methods directly learn the policy without making unrealistic formulations of the CFM model. Specifically, ImIn-GAIL can imitate the position of the expert trajectory more accurately than all other baseline methods. The reason behind the improvement of ImIn-GAIL against other methods is that in ImIn-GAIL, policy learning and interpolation can enhance each other and result in significantly better results.

Figure 6.5: RMSE on time and position of our proposed method ImIn-GAIL under different level of sparsity. As the expert trajectory become denser, a more similar policy to the expert policy is learned.

6.6 Conclusion

In this chapter, we present a novel framework ImIn-GAIL to integrate interpolation with imitation learning and learn the driving behavior from sparse trajectory data. Specifically, different from existing literature which treats data interpolation as a separate and preprocessing step, our framework learns to interpolate and imitate expert policy in a fully end-to-end manner. Our experiment results show that our approach significantly outperforms state-of-the-art methods. The application of our proposed method can be used to build a more realistic traffic simulator using real-world data.

Figure 6.6: The generated trajectory of a vehicle in the

R i n g

scenario. Left: the initial position of the vehicles. Vehicles can only be observed when they pass four locations

A

B

C

and

D

where cameras are installed. Right: the visualization for the trajectory of

V e h i c l e

0. The x-axis is the timesteps in seconds. The y-axis is the relative road distance in meters. Although vehicle 0 is only observed three times (red triangles), ImIn-GAIL (blue points) can imitate the position of the expert trajectory (grey points) more accurately than all other baselines. Better viewed in color.

Chapter 7 Conclusion and Future Directions

In this thesis, we present an overview of recent advances in reinforcement learning methods for traffic signal control, and propose methods that deal with the theoretical formulation and the learning efficiency in RL-based traffic signal control methods. This thesis also propose to build a more realistic simulator to bridge the gap between simulation and the real world. Here, we briefly discuss some directions for future research.

7.1 Evolving behavior with traffic signals

Existing research usually assumes the vehicles behavior is fixed when solving traffic signal control problem, i.e., the vehicle movement model is fixed once a vehicle enters the system. However, vehicles in the real-world could also adjust to traffic signals. For example, when there are multiple routes toward the destination, a human driver see a red light on through traffic, he/she might change to another lane with green light in order to move faster. This kind of adjustments in driving behavior is also evolving with the operation of traffic signals. An promising approach to cope with this situation is to co-learn both the driving behavior and the traffic signals.

7.2 Benchmarking datasets and baselines

As discussed in Section 2.3.3, researchers use different road networks, which could introduce large variances in final performance. Therefore, evaluating control policies using a standard setting could save the effort and assure a fair comparison and reproducability of RL methods [163]. An effort that could greatly facilitate research in this field is to create publicly available benchmark. Another concern for RL-based traffic signal control is that for this interdisciplinary research problem, existing literature of RL-based methodsis often lack of comparison with typical methods from transportation area, like Webster's Formula [164] and MaxPressure [165].

7.3 Learning efficiency

Existing RL methods for games usually require a massive number of update iterations and trial-and-errors for RL models to yield impressive results in simulated environments. These trial-and-error attempts will lead to real traffic jams in the traffic signal control problem. Therefore, how to learn efficiently is a critical question for the application of RL in traffic signal control. While there is some previous work considers using Meta-Learning [166] or imitation learning [55], there is still much to investigate on learning with limited data samples and efficient exploration in traffic signal control problem.

7.4 Safety issue

While RL methods learn from trial-and-error, the learning cost of RL could be critical or even fatal in the real world as the malfunction of traffic signals might lead to accidents. An open problem for RL-based traffic signal control problem is to find ways to adapt risk management to make RL agents acceptably safe in physical environments [167]. [168] directly integrate real-world constraints into the action selection process. If pedestrians are crossing the intersection, their method will not change the control actions, which can protect crossing pedestrians. However, more safety problems like handling collisions is still to be explored.

7.5 Transferring from simulation to reality

The papers reviewed in this survey mainly conduct experiments in the simulator since the simulator can generate data in a cheaper and faster way than real experimentation. Discrepancies between simulation and reality confine the application of learned policies in the real world. While some work considers to learn an interpretable policy before applying to the real world [169] or to build a more realistic simulator [170] for direct transferring, there is still a challenge to transfer the control policies learned in simulation to reality.

Bibliography

[1]Zheng, G., X. Zang, N. Xu, H. Wei, Z. Yu, et al. (2019) "Diagnosing Reinforcement Learning for Traffic Signal Control," arXiv preprint.
[2]Economist, T. (2014), "The cost of traffic jams," https://www.economist.com/blogs/economist-explains/2014/11/economist-explains-1.
[3]Schrank, D., B. Eisele, T. Lomax, and J. Bak (2015) "2015 Urban Mobility Scorecard,".
[4]Schrank, D., B. Eisele, and T. Lomax (2012) "TTI's 2012 urban mobility report," Texas A&M Transportation Institute. The Texas A&M University System, 4.
[5]Lowrie, P. (1992) "SCATS-a traffic responsive method of controlling urban traffic. Roads and traffic authority," NSW, Australia.
[6]Hunt, P., D. Robertson, R. Bretherton, and R. Winton (1981) SCOOT-a traffic responsive method of coordinating signals, Tech. rep.
[7]Hunt, P., D. Robertson, R. Bretherton, and M. C. Royle (1982) "The SCOOT on-line traffic signal optimisation technique," Traffic Engineering & Control, 23(4).
[8]Roess, R. P., E. S. Prassas, and W. R. McShane (2004) Traffic engineering, Pearson/Prentice Hall.
[9]Wiering, M. (2000) "Multi-agent reinforcement learning for traffic light control," in Machine Learning: Proceedings of the Seventeenth International Conference (ICML'2000), pp. 1151-1158.
[10]Van der Pol, E. and F. A. Oliehoek (2016) "Coordinated deep reinforcement learners for traffic light control,".
[11]Wei, H., G. Zheng, H. Yao, and Z. Li (2018) "Intellilight: A reinforcement learning approach for intelligent traffic light control," in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, pp. 2496-2505.
[12]Sutton, R. S. and A. G. Barto (1998) Reinforcement learning: An introduction, vol. 1, MIT press Cambridge.
[13]Watkins, C. J. C. H. (1989) Learning from delayed rewards, Ph.D. thesis, King's College, Cambridge.
[14]Silver, D., A. Huang, C. J. Maddison, A. Guez, L. Sifre, et al. (2016) "Mastering the game of Go with deep neural networks and tree search," Nature.
[15]Dulac-Arnold, G., M. Daniel, and T. Hester (2019) "Challenges of Real-World Reinforcement Learning," arXiv:1904.12901.
[16]Lu, H., Y. Li, S. Mu, D. Wang, H. Kim, and S. Serikawa (2018) "Motor Anomaly Detection for Unmanned Aerial Vehicles Using Reinforcement Learning," IEEE Internet of Things Journal, 5(4), pp. 2315-2322.
[17]Wei, H., C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, and Z. Li (2019) "PressLight: Learning Max Pressure Control to Coordinate Traffic Signals in Arterial Network," in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '19, ACM, pp. 1290-1298.
[18]Zheng, G., Y. Xiong, X. Zang, J. Feng, H. Wei, et al. (2019) "Learning Phase Competition for Traffic Signal Control," in CIKM.
[19]Wei, H., N. Xu, H. Zhang, G. Zheng, et al. (2019) "CoLight: Learning Network-level Cooperation for Traffic Signal Control," in CIKM.
[20]Chen, C., H. Wei, N. Xu, et al. (2020) "Toward A Thousand Lights: Decentralized Deep Reinforcement Learning for Large-Scale Traffic Signal Control," in AAAI.
[21]Wei, H., G. Zheng, V. Gayah, and Z. Li (2019) "A Survey on Traffic Signal Control Methods," CoRR, abs/1904.08117, 1904.08117. URL http://arxiv.org/abs/1904.08117
[22] ------ (2020) "Recent Advances in Reinforcement Learning for Traffic Signal Control: A Survey of Models and Evaluation," ACM SIGKDD Explorations Newsletter.
[23]El-Tantawy, S. and B. Abdulhai (2012) "Multi-agent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC)," in ITSC, IEEE, pp. 319-326.
[24]Stevanovic, A. (2010) "Adaptive Traffic Control Systems: Domestic and Foreign State of Practice," National Cooperative Highway Research Program, (403).
[25]Bazzan, A. L. (2009) "Opportunities for multiagent systems and multiagent reinforcement learning in traffic control," AAMAS.
[26]El-Tantawy, S. and B. Abdulhai (2011) Comprehensive analysis of reinforcement learning methods and parameters for adaptive traffic signal control, Tech. rep.
[27]Mannion, P., J. Duggan, and E. Howley (2016) "An experimental review of reinforcement learning algorithms for adaptive traffic signal control," in Autonomic Road Transport Support Systems, Springer, pp. 47-66.
[28]Yau, K.-L. A., J. Qadir, H. L. Khoo, et al. (2017) "A Survey on Reinforcement Learning Models and Algorithms for Traffic Signal Control," ACM Computing Survey.
[29]Schutera, M., N. Goby, S. Smolarek, and M. Reischl (2018) "Distributed traffic light control at uncoupled intersections with real-world topology by deep reinforcement learning," arXiv preprint arXiv:1811.11233.
[30]Bakker, B., S. Whiteson, L. Kester, and F. C. Groen (2010) "Traffic light control by multiagent reinforcement learning systems," in Interactive Collaborative Information Systems, Springer, pp. 475-510.
[31]Prashanth, L. A. and S. Bhatnagar (2011) "Reinforcement learning with average cost for adaptive control of traffic lights at intersections," 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), pp. 1640-1645.
[32]Abdoos, M., N. Mozayani, and A. L. Bazzan (2014) "Hierarchical control of traffic signals using Q-learning with tile coding," Applied intelligence, 40(2), pp. 201-213.
[33]Genders, W. and S. Razavi (2016) "Using a deep reinforcement learning agent for traffic signal control," arXiv preprint arXiv:1611.01142.
[34]Mousavi, S. S. et al. (2017) "Traffic light control using deep policy-gradient and value-function-based reinforcement learning," Intelligent Transport Systems.
[35]Liang, X., X. Du, G. Wang, and Z. Han (2018) "Deep reinforcement learning for traffic light control in vehicular networks," arXiv preprint arXiv:1803.11115.
[36]Liang, X., X. Du, G. Wang, and Z. Han (2019) "A Deep Reinforcement Learning Network for Traffic Light Cycle Control," IEEE Transactions on Vehicular Technology, 68(2), pp. 1243-1253.
[37]Gao, J., Y. Shen, J. Liu, M. Ito, and N. Shiratori (2017) "Adaptive traffic signal control: Deep reinforcement learning algorithm with experience replay and target network," arXiv preprint arXiv:1705.02755.
[38]Garg, D., M. Chli, and G. Vogiatzis (2018) "Deep reinforcement learning for autonomous traffic light control," in ICITE.
[39]Calvo, J. A. and I. Dusparic (2018) "Heterogeneous Multi-Agent Deep Reinforcement Learning for Traffic Lights Control." in AICS, pp. 2-13.
[40]Gong, Y., M. Abdel-Aty, Q. Cai, and M. S. Rahman (2019) "Decentralized network level adaptive signal control by multi-agent deep reinforcement learning," Transportation Research Interdisciplinary Perspectives, 1, p. 100020.
[41]Coskun, M. et al. (2018) "Deep reinforcement learning for traffic light optimization," in ICDMW, IEEE.
[42]Aslani, M., M. S. Mesgari, and M. Wiering (2017) "Adaptive traffic signal control with actor-critic methods in a real-world traffic network with different traffic disruption events," TRB-C.
[43]Aslani, M., S. Seipel, M. S. Mesgari, and M. Wiering (2018) "Traffic signal optimization through discrete and continuous reinforcement learning with robustness analysis in downtown Tehran," Advanced Engineering Informatics.
[44]Casas, N. (2017) "Deep deterministic policy gradient for urban traffic light control," arXiv preprint.
[45]Pham, T. T., T. Brys, M. E. Taylor, T. Brys, M. M. Drugan, P. Bosman, M.-D. Cock, C. Lazar, L. Demarchi, D. Steenhoff, et al. (2013) "Learning coordinated traffic light control," in Proceedings of the Adaptive and Learning Agents workshop (at AAMAS-13), vol. 10, IEEE, pp. 1196-1201.
[46]Arel, I., C. Liu, T. Urbanik, and A. Kohls (2010) "Reinforcement learning-based multi-agent system for network traffic signal control," IET Intelligent Transport Systems, 4(2), pp. 128-135.
[47]Nishi, T., K. Otaki, K. Hayakawa, and T. Yoshimura (2018) "Traffic Signal Control Based on Reinforcement Learning with Graph Convolutional Neural Nets," in 2018 21st International Conference on Intelligent Transportation Systems (ITSC), IEEE, pp. 877-883.
[48]Chu, T., J. Wang, L. Codeca, and Z. Li (2019) "Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control," IEEE Transactions on Intelligent Transportation Systems.
[49]Arulkumaran, K., M. P. Deisenroth, et al. (2017) "A brief survey of deep reinforcement learning," arXiv preprint.
[50]Kaelbling, L. P., M. L. Littman, and A. W. Moore (1996) "Reinforcement learning: A survey," Journal of artificial intelligence research.
[51]Nachum, O., M. Norouzi, K. Xu, and D. Schuurmans (2017) "Bridging the gap between value and policy based reinforcement learning," in NeurIPS.
[52]Mnih, V., K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) "Human-level control through deep reinforcement learning," Nature, 518(7540), p. 529.
[53]Li, L., Y. Lv, and F.-Y. Wang (2016) "Traffic signal timing via deep reinforcement learning," IEEE/CAA Journal of Automatica Sinica, 3(3).
[54]Lillicrap, T. P., J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) "Continuous control with deep reinforcement learning," arXiv preprint.
[55]Xiong, Y., G. Zheng, K. Xu, and Z. Li (2019) "Learning Traffic Signal Control from Demonstrations," in CIKM.
[56]Rizzo, S. G., G. Vantini, and S. Chawla (2019) "Time Critic Policy Gradient Methods for Traffic Signal Control in Complex and Congested Scenarios," in KDD.
[57]Wang, Y., T. Xu, X. Niu, and other (2019) "STMARL: A Spatio-Temporal Multi-Agent Reinforcement Learning Approach for Traffic Light Control," arXiv preprint.
[58]Wei, H., C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, and Z. Li (2019) "PressLight: Learning Max Pressure Control to Coordinate Traffic Signals in Arterial Network," in KDD.
[59]Zhang, Z., J. Yang, and H. Zha (2019) "Integrating independent and centralized multi-agent reinforcement learning for traffic signal network optimization," arXiv preprint.
[60]Claus, C. and C. Boutilier (1998) "The dynamics of reinforcement learning in cooperative multiagent systems," AAAI/IAAI.
[61]Kok, J. R. and N. Vlassis (2005) "Using the max-plus algorithm for multiagent decision making in coordination graphs," in Robot Soccer World Cup, Springer, pp. 1-12.
[62]Tan, T., F. Bao, Y. Deng, A. Jin, Q. Dai, and J. Wang (2019) "Cooperative deep reinforcement learning for large-scale traffic grid signal control," IEEE transactions on cybernetics.
[63]Liu, X.-Y., Z. Ding, S. Borst, and A. Walid (2018) "Deep reinforcement learning for intelligent transportation systems," arXiv preprint arXiv:1812.00979.
[64]Nowe, A., P. Vrancx, and Y. M. D. Hauwere (2012) Game Theory and Multi-agent Reinforcement Learning.
[65]Sukhbaatar, S., R. Fergus, et al. (2016) "Learning multiagent communication with backpropagation," in NeurIPS.
[66]Xu, M., J. Wu, L. Huang, R. Zhou, T. Wang, and D. Hu (2020) "Network-wide traffic signal control based on the discovery of critical nodes and deep reinforcement learning," Journal of Intelligent Transportation Systems, 24(1), pp. 1-10.
[67]Ge, H., Y. Song, C. Wu, J. Ren, and G. Tan (2019) "Cooperative deep Q-learning with Q-value transfer for multi-intersection signal control," IEEE Access, 7, pp. 40797-40809.
[68]Schlichtkrull, M., T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) "Modeling relational data with graph convolutional networks," in European Semantic Web Conference, Springer.
[69]Velickovic, P., G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) "Graph attention networks," ICLR.
[70]Guerrini, F. (2014) "Traffic Congestion Costs Americans $124 Billion A Year, Report Says," Forbes, October.
[71]Webster, F. V. (1958) "Traffic signal settings," Road Research Technical Paper, 39.
[72]Miller, A. J. (1963) "Settings for fixed-cycle traffic signals," Journal of the Operational Research Society, 14(4), pp. 373-386.
[73]Porche, I. and S. Lafortune (1999) "Adaptive look-ahead optimization of traffic signals," Journal of Intelligent Transportation System, 4(3-4), pp. 209-254.
[74]Cools, S.-B., C. Gershenson, and B. D. Hooghe (2013) "Self-organizing traffic lights: A realistic simulation," in Advances in applied self-organizing systems, Springer, pp. 45-55.
[75]Kuyer, L., S. Whiteson, B. Bakker, and N. Vlassis (2008) "Multiagent reinforcement learning for urban traffic control using coordination graphs," Machine learning and knowledge discovery in databases, pp. 656-671.
[76]Li, L., Y. Lv, and F.-Y. Wang (2016) "Traffic signal timing via deep reinforcement learning," IEEE/CAA Journal of Automatica Sinica, 3(3), pp. 247-254.
[77]Dion, F., H. Rakha, and Y.-S. Kang (2004) "Comparison of delay estimates at under-saturated and over-saturated pre-timed signalized intersections," Transportation Research Part B: Methodological, 38(2), pp. 99-122.
[78]Abdulhai, B., R. Pringle, and G. J. Karakoulas (2003) "Reinforcement learning for true adaptive traffic signal control," Journal of Transportation Engineering, 129(3), pp. 278-285.
[79]El-Tantawy, S., B. Abdulhai, and H. Abdelgawad (2013) "Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): methodology and large-scale application on downtown Toronto," IEEE Transactions on Intelligent Transportation Systems, 14(3), pp. 1140-1150.
[80]Abdoos, M., N. Mozayani, and A. L. Bazzan (2013) "Holonic multi-agent system for traffic signals control," Engineering Applications of Artificial Intelligence, 26(5), pp. 1575-1587.
[81]Genders, W. and S. Razavi (2016) "Using a deep reinforcement learning agent for traffic signal control," arXiv preprint arXiv:1611.01142.
[82]Gao, J., Y. Shen, J. Liu, M. Ito, and N. Shiratori (2017) "Adaptive Traffic Signal Control: Deep Reinforcement Learning Algorithm with Experience Replay and Target Network," arXiv preprint arXiv:1705.02755.
[83]Liu, M., J. Deng, M. Xu, X. Zhang, and W. Wang (2017) "Cooperative Deep Reinforcement Learning for Tra ic Signal Control,".
[84]Legge, E. L., C. R. Madan, E. T. Ng, and J. B. Caplan (2012) "Building a memory palace in minutes: Equivalent memory performance using virtual versus conventional environments with the Method of Loci," Acta psychologica, 141(3), pp. 380-390.
[85]Godwin-Jones, R. (2010) "Emerging technologies,".
[86]Brys, T., T. T. Pham, and M. E. Taylor (2014) "Distributed learning and multi-objectivity in traffic light control," Connection Science, 26(1), pp. 65-83.
[87]El-Tantawy, S. and B. Abdulhai (2010) "An agent-based learning towards decentralized and coordinated traffic signal control," IEEE Conference on Intelligent Transportation Systems, Proceedings, ITSC, pp. 665-670.
[88]Lioris, J., A. Kurzhanskiy, and P. Varaiya (2013) "Adaptive Max Pressure Control of Network of Signalized Intersections," Transportation Research Part C, 36(22), pp. 177-195.
[89]Varaiya, P. (2013) "Max pressure control of a network of signalized intersections," Transportation Research Part C: Emerging Technologies, 36, pp. 177-195.
[90] ------ (2013) "The max-pressure controller for arbitrary networks of signalized intersections," in Advances in Dynamic Network Modeling in Complex Transportation Systems, Springer, pp. 27-66.
[91]Gartner, N. H. (1983) OPAC: A demand-responsive strategy for traffic signal control, 906.
[92]Henry, J.-J., J. L. Farges, and J. Tuffal (1984) "The PRODYN real time traffic algorithm," in Control in Transportation Systems, Elsevier, pp. 305-310.
[93]Boillot, F., S. Midenet, and J.-C. Pierrelee (2006) "The real-time urban traffic control system CRONOS: Algorithm and experiments," Transportation Research Part C: Emerging Technologies, 14(1), pp. 18-38.
[94]Sen, S. and K. L. Head (1997) "Controlled optimization of phases at an intersection," Transportation science, 31(1), pp. 5-17.
[95]Liang, X., X. Du, G. Wang, and Z. Han (2018) "Deep reinforcement learning for traffic light control in vehicular networks," arXiv preprint arXiv:1803.11115.
[96]Abdulhai, B., R. Pringle, and G. J. Karakoulas (2003) "Reinforcement learning for true adaptive traffic signal control," Journal of Transportation Engineering, 129(3), pp. 278-285.
[97]Mousavi, S. S., M. Schukat, P. Corcoran, and E. Howley (2017) "Traffic Light Control Using Deep Policy-Gradient and Value-Function Based Reinforcement Learning," arXiv preprint arXiv:1704.08883.
[98]Casas, N. (2017) "Deep Deterministic Policy Gradient for Urban Traffic Light Control," arXiv preprint arXiv:1703.09035.
[99]Urbanik, T., A. Tanaka, B. Lozner, E. Lindstrom, K. Lee, S. Quayle, S. Beaird, S. Tsoi, P. Ryus, D. Gettman, et al. (2015) Signal timing manual, Transportation Research Board.
[100]Robertson, D. I. (1969) "TRANSYT: a traffic network study tool,".
[101]Little, J. D., M. D. Kelson, and N. H. Gartner (1981) "MAXBAND: A versatile program for setting signals on arteries and triangular networks,".
[102]da Silva, A. B. C., D. de Oliveria, and E. Basso (2006) "Adaptive traffic control with reinforcement learning," in Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 80-86.
[103]Zhang, H., S. Feng, C. Liu, Y. Ding, Y. Zhu, Z. Zhou, W. Zhang, Y. Yu, H. Jin, and Z. Li (2019) "CityFlow: A Multi-Agent Reinforcement Learning Environment for Large Scale City Traffic Scenario,".
[104]Roess, R. P., E. S. Prassas, and W. R. Mcshane (2011) Traffic Engineering, Pearson/Prentice Hall.
[105]Varaiya, P. (2013) "The max-pressure controller for arbitrary networks of signalized intersections," in Advances in Dynamic Network Modeling in Complex Transportation Systems, Springer, pp. 27-66.
[106]Webster, F. V. (1958) Traffic signal settings, Tech. rep.
[107]Roess, R. P., E. S. Prassas, and W. R. McShane (2004) Traffic engineering, Pearson/Prentice Hall.
[108]El-Tantawy, S., B. Abdulhai, and H. Abdelgawad (2013) "Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): methodology and large-scale application on downtown Toronto," IEEE Transactions on Intelligent Transportation Systems, 14(3), pp. 1140-1150.
[109]Dresner, K. and P. Stone (2006) "Multiagent traffic management: Opportunities for multiagent learning," in Learning and Adaption in Multi-Agent Systems, Springer, pp. 129-138.
[110]Silva, B. C. D., E. W. Basso, F. S. Perotto, A. L. C. Bazzan, and P. M. Engel (2006) "Improving reinforcement learning with context detection," in International Joint Conference on Autonomous Agents & Multiagent Systems.
[111]Velickovic, P., G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) "Graph attention networks," arXiv preprint arXiv:1710.10903, 1(2).
[112]Yang, Y., R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang (2018) "Mean Field Multi-Agent Reinforcement Learning," arXiv preprint arXiv:1802.05438.
[113]Koonce, P. et al. (2008) Traffic signal timing manual, Tech. rep., United States. Federal Highway Administration.
[114]Lutkebohle, I. (2018), "An Introduction To The New Generation Scats 6," http://www.scats.com.au/files/an_introduction_to_scats_6.pdf, [Online; accessed 5-September-2018].
[115]Hunt, P., D. Robertson, R. Bretherton, and R. Winton (1981) SCOOT-a traffic responsive method of coordinating signals, Tech. rep.
[116]Hunt, P., D. Robertson, R. Bretherton, and M. C. Royle (1982) "The SCOOT on-line traffic signal optimisation technique," Traffic Engineering & Control, 23(4).
[117]Gartner, N. H., S. F. Assman, F. Lasaga, and D. L. Hou (1991) "A multi-band approach to arterial traffic signal optimization," Transportation Research Part B: Methodological, 25(1), pp. 55-74.
[118]Diakaki, C., M. Papageorgiou, and K. Aboudolas (2002) "A multivariable regulator approach to traffic-responsive network-wide signal control," Control Engineering Practice, 10(2), pp. 183-195.
[119]Prashanth, L. and S. Bhatnagar (2011) "Reinforcement learning with function approximation for traffic signal control," IEEE Transactions on Intelligent Transportation Systems, 12(2), pp. 412-421.
[120]Kuyer, L., S. Whiteson, B. Bakker, and N. Vlassis (2008) "Multiagent reinforcement learning for urban traffic control using coordination graphs," in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp. 656-671.
[121]Camponogara, E. and W. Kraus (2003) "Distributed learning agents in urban traffic control," in Portuguese Conference on Artificial Intelligence, Springer, pp. 324-335.
[122]Bishop, C. M. et al. (2006) "Pattern recognition and machine learning (information science and statistics),".
[123]Hayes, W. D. (1970) "Kinematic wave theory," Proc. R. Soc. Lond. A, 320(1541), pp. 209-226.
[124]de Oliveira, D. and A. L. Bazzan (2009) "Multiagent learning on traffic lights control: effects of using shared information," in Multi-agent systems for traffic and transportation engineering, IGI Global, pp. 307-321.
[125]Yao, H., X. Tang, H. Wei, G. Zheng, Y. Yu, and Z. Li (2018) "Modeling Spatial-Temporal Dynamics for Traffic Prediction," arXiv preprint arXiv:1803.01254.
[126]You, Q., H. Jin, Z. Wang, C. Fang, and J. Luo (2016) "Image captioning with semantic attention," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4651-4659.
[127]Cheng, W., Y. Shen, Y. Zhu, and L. Huang (2018) "A neural attention model for urban air quality inference: Learning the weights of monitoring stations," in Thirty-Second AAAI Conference on Artificial Intelligence.
[128]Jiang, J., C. Dun, and Z. Lu (2018) "Graph Convolutional Reinforcement Learning for Multi-Agent Cooperation," arXiv preprint arXiv:1810.09202.
[129]Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) "Attention is all you need," in Advances in Neural Information Processing Systems, pp. 5998-6008.
[130]Wei, H., C. Chen, C. Liu, G. Zheng, and Z. Li (2020) "Learning to Simulate on Sparse Data," ECML-PKDD.
[131]Wei, H., C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, and Z. Li (2019) "PressLight: Learning Max Pressure Control to Coordinate Traffic Signals in Arterial Network," in KDD.
[132]Wei, H., N. Xu, H. Zhang, G. Zheng, X. Zang, C. Chen, W. Zhang, Y. Zhu, K. Xu, and Z. Li (2019) "CoLight: Learning Network-Level Cooperation for Traffic Signal Control," in CIKM.
[133]Wei, H., G. Zheng, H. Yao, and Z. Li (2018) "IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light Control," in KDD.
[134]Wu, Y., H. Tan, and B. Ran (2018) "Differential Variable Speed Limits Control for Freeway Recurrent Bottlenecks via Deep Reinforcement learning," arXiv preprint arXiv:1810.10952.
[135]Krauss, S. (1998) Microscopic modeling of traffic flow: Investigation of collision free vehicle dynamics, Ph.D. thesis.
[136]Leutzbach, W. and R. Wiedemann (1986) "Development and applications of traffic simulation models at the Karlsruhe Institut fur Verkehruesen," Traffic engineering & control, 27(5).
[137]Nagel, K. and M. Schreckenberg (1992) "A cellular automaton model for freeway traffic," Journal de physique I, 2(12).
[138]Kesting, A. and M. Treiber (2008) "Calibrating car-following models by using trajectory data: Methodological study," Transportation Research Record, 2088(1), pp. 148-156.
[139]Osorio, C. and V. Punzo (2019) "Efficient calibration of microscopic car-following models for large-scale stochastic network simulators," Transportation Research Part B: Methodological, 119, pp. 156-173.
[140]Michie, D., M. Bain, and J. Hayes-Miches (1990) "Cognitive models from subcognitive skills," IEEE Control Engineering Series, 44.
[141]Torabi, F., G. Warnell, and P. Stone (2018) "Behavioral cloning from observation," in IJCAI.
[142]Ho, J. and S. Ermon (2016) "Generative adversarial imitation learning," in NeurIPS.
[143]Bhattacharyya, R. P., D. J. Phillips, B. Wulfe, J. Morton, A. Kuefler, and M. J. Kochenderfer (2018) "Multi-agent imitation learning for driving simulation," in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE.
[144]Song, J., H. Ren, D. Sadigh, and S. Ermon (2018) "Multi-agent generative adversarial imitation learning," in NeurIPS.
[145]Zheng, G., H. Liu, K. Xu, and Z. Li (2020) "Learning to Simulate Vehicle Trajectories from Demonstrations," ICDE.
[146]Li, S. C.-X. and B. M. Marlin (2016) "A scalable end-to-end gaussian process adapter for irregularly sampled time series classification," in NeurIPS.
[147]Yi, X., Y. Zheng, J. Zhang, and T. Li (2016) "ST-MVL: filling missing values in geo-sensory time series data," in IJCAI, AAAI Press.
[148]Zheng, K., Y. Zheng, X. Xie, and X. Zhou (2012) "Reducing uncertainty of low-sampling-rate trajectories," in ICDE.
[149]Abbeel, P. and A. Y. Ng (2004) "Apprenticeship learning via inverse reinforcement learning," in ICML.
[150]Ng, A. Y., S. J. Russell, et al. (2000) "Algorithms for inverse reinforcement learning." in ICML.
[151]Ziebart, B. D., A. L. Maas, J. A. Bagnell, and A. K. Dey (2008) "Maximum entropy inverse reinforcement learning." in AAAI, vol. 8.
[152]Liu, Y., K. Zhao, G. Cong, and Z. Bao (2020) "Online anomalous trajectory detection with deep generative sequence modeling," in ICDE.
[153]Lou, Y., C. Zhang, Y. Zheng, X. Xie, W. Wang, and Y. Huang (2009) "Map-matching for low-sampling-rate GPS trajectories," in SIGSPATIAL, ACM.
[154]Zheng, Y. (2015) "Trajectory data mining: an overview," ACM Transactions on Intelligent Systems and Technology (TIST), 6(3).
[155]Kuefler, A., J. Morton, T. Wheeler, and M. Kochenderfer (2017) "Imitating driver behavior with generative adversarial networks," in IEEE Intelligent Vehicles Symposium (IV), IEEE.
[156]Schulman, J., S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2015) "Trust Region Policy Optimization," in ICML.
[157]Zhang, H., S. Feng, C. Liu, Y. Ding, Y. Zhu, Z. Zhou, W. Zhang, Y. Yu, H. Jin, and Z. Li (2019) "CityFlow: A Multi-Agent Reinforcement Learning Environment for Large Scale City Traffic Scenario," in International World Wide Web Conference.
[158]Sugiyama, Y., M. Fukui, M. Kikuchi, K. Hasebe, A. Nakayama, K. Nishinari, S.-i. Tadaki, and S. Yukawa (2008) "Traffic jams without bottlenecks--experimental evidence for the physical mechanism of the formation of a jam," New journal of physics, 10(3).
[159]Wu, C., A. Kreidieh, E. Vinitsky, and A. M. Bayen (2017) "Emergent behaviors in mixed-autonomy traffic," in Conference on Robot Learning.

Simulation of Urban MObility," International Journal On Advances in Systems and Measurements, 5(3&4), pp. 128-138.

[161]Asamer, J., H. J. van Zuylen, and B. Heilmann (2013) "Calibrating car-following parameters for snowy road conditions in the microscopic traffic simulator VISSIM," IET Intelligent Transport Systems, 7(1).
[162]Tang, X., B. Gong, Y. Yu, H. Yao, Y. Li, H. Xie, and X. Wang (2019) "Joint Modeling of Dense and Incomplete Trajectories for Citywide Traffic Volume Inference," in The World Wide Web Conference, ACM.
[163]Henderson, P., R. Islam, P. Bachman, J. Pineau, et al. (2018) "Deep reinforcement learning that matters," in AAAI.
[164]Koonce, P. et al. (2008) Traffic signal timing manual, Tech. rep., United States. Federal Highway Administration.
[165]Kouvelas, A., J. Lioris, S. A. Fayazi, and P. Varaiya (2014) "Maximum Pressure Controller for Stabilizing Queues in Signalized Arterial Networks," TRB.
[166]Zang, X., H. Yao, G. Zheng, N. Xu, K. Xu, and Z. Li (2020) "MetaLight: Value-based Meta-reinforcement Learning for Online Universal Traffic Signal Control," in AAAI.
[167]Garcia, J. and F. Fernandez (2015) "A comprehensive survey on safe reinforcement learning," JMLR.
[168]Liu, Y., L. Liu, and W.-P. Chen (2017) "Intelligent traffic light control using distributed multi-agent Q learning," in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), IEEE, pp. 1-8.
[169]Ault, J., J. Hanna, and G. Sharon (2019) "Learning an Interpretable Traffic Signal Control Policy," arXiv preprint.
[170]Zheng, G., H. Liu, and Z. Li (2020) "Learning to Simulate Vehicle Trajectory from Demonstrations," in ICDE.

Vita

Hua Wei

I get Ph. D. from the College of Information Sciences and Technology, The Pennsylvania State University. I work with Dr. Zhenhui (Jessie) Li and my research interest lies in spatio-temporal data mining and reinforcement learning, especially utilizing the power of big data to assist intelligent decision making for human beings.

Selected Publications

*Equal contribution.

∙

Hua Wei, Guanjie Zheng, Vikash Gayah, Zhenhui Li. Recent Advances in Reinforcement Learning for Traffic Signal Control: A Survey of Models and Evaluation, in ACM SIGKDD Explorations Newsletter, Dec 2020.

∙

Hua Wei*, Xian Wu*, Wenbo Guo*, Xinyu Xing. Adversarial Policy Training against Deep Reinforcement Learning, in Proceedings of the 30th USENIX Security Symposium (USENIX Security), Vancouver, Canada, Aug 2021.

∙

Hua Wei, Chacha Chen, Chang Liu, Guanjie Zheng, Zhenhui Li. Learning to Simulate on Sparse Trajectory Data, in Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD'20), Ghent, Belgium, Sep, 2020. (Best Applied Data Science Paper Award)

∙

Hua Wei, Chacha Chen, Guanjie Zheng, Kan Wu, Vikash Gayah, Kai Xu, Zhenhui Li, PressLight: Learning Max Pressure Control for Signalized Intersections in Arterial Network, in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'19), Anchorage, AK, USA, Aug 2019.

∙

Hua Wei, Nan Xu, Huichu Zhang, Guanjie Zheng, Xinshi Zang, Chacha Chen, Weinan Zhang, Yanmin Zhu, Kai Xu, Zhenhui Li. CoLight: Learning Network-level Cooperation for Traffic Signal Control, in Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM'19), Beijing, China, Nov 2019.

∙

Hua Wei, Guanjie Zheng, Huaxiu Yao, Zhenhui Li, IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light Control, in Proceedings of the 2018 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'18), London, UK, Aug 2018.

∙

Hua Wei, Chacha Chen, Kan Wu, Guanjie Zheng, Zhengyao Yu, Vikash Gayah, Zhenhui Li, Deep Reinforcement Learning for Traffic Signal Control along Arterials, in Workshop on Deep Reinforcement Learning for Knowledge Discovery (DRL4KDD), Anchorage, AK, USA, Aug 2019.

∙

Hua Wei, Yuandong Wang, Tianyu Wo, Yaxiao Liu, Jie Xu, ZEST: A Hybrid Model on Predicting Passenger Demand for Chauffeured Car Service, in Proceedings of 25th ACM International Conference on Information and Knowledge Management (CIKM'16), Indianapolis, USA, October 2016.

∙

Chacha Chen, Hua Wei, Nan Xu, Guanjie Zheng, Ming Yang, Yuanhao Xiong, Kai Xu, Zhenhui Li. Toward A Thousand Lights: Decentralized Deep Reinforcement Learning for Large-Scale Traffic Signal Control. in Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI'20), New York, USA, Feb 2020.

ProQuest Number: 28767689

INFORMATION TO ALL USERS

The quality and completeness of this reproduction is dependent on the quality and completeness of the copy made available to ProQuest.

Distributed by ProQuest LLC (2021).

This work may be used in accordance with the terms of the Creative Commons license or other rights statement, as indicated in the copyright statement or in the metadata associated with this work. Unless otherwise specified in the copyright statement or the metadata, all rights are reserved by the copyright holder.

This work is protected against unauthorized copying under Title 17,

United States Code and other applicable copyright laws.

Microform Edition where available (c) ProQuest LLC. No reproduction or digitization of the Microform Edition is authorized without permission of ProQuest LLC.

ProQuest LLC

789 East Eisenhower Parkway

P.O. Box 1346

Ann Arbor, MI 48106 - 1346 USA

deep_reinforcement_learning_fo-183ae653-e302-46bb-993b-53ac63dce2ed

Abstract

Table of Contents

Acknowledgements

Chapter 2 Notation, Background and Literature

Chapter 4 Formulating the Learning Objevtive

Chapter 6 Learning to Simulate

Acknowledgements

Chapter 1 Introduction

1.1 Why do we need a more intelligent traffic signal control

1.2 Why do we use reinforcement learning for traffic signal control

1.3 Why RL for traffic signal control is challenging?

Formulation of RL agent

Learning cost

Simulation

1.5 Proposed Tasks

1.6 Overview of this Disseration

Chapter 2 Notation, Background and Literature

2.1 Preliminaries of Traffic Signal Control Problem

Term Definition

Objective

Special Considerations

2.2 Background of Reinforcement Learning

Single Agent Reinforcement learning

Problem setting

2.3 Literature

Agent formulation

2.3.1.1 Reward

2.3.1.2 State

2.3.1.3 Action scheme

Policy Learning

2.3.2.1 Value-based methods

Joint action learners

2.3.3.2 Independent learners

2.3.3.3 Sizes of Road network

2.4 Conclusion

Chapter 3 Basic Formulation for Traffic Signal Control

3.1 Introduction

3.2 Related work

Conventional Traffic Light Control

Reinforcement Learning for Traffic Light Control

3.3 Problem Definition

3.4 Method

Framework

Agent Design

Network Structure

Memory Palace and Model Updating

Experiment Setting

Parameter Setting

Evaluation Metric

Compared Methods

Datasets

Synthetic data

Real-world data

Performance on Synthetic Data

3.5.6.1 Comparison with state-of-the-art methods

3.5.6.2 Comparison with variants of our proposed method

3.5.6.3 Interpretation of learned signal

Performance on Real-world Data

Comparison of different methods

3.5.7.2 Observations with respect to real traffic

3.6 Conclusion

Chapter 4 Formulating the Learning Objective

4.1 Introduction

4.2 Related Work

4.3 Preliminaries and Notations

4.4 Method

Agent Design

Learning Process

4.5 Justification of RL agent

Justification for State Design

General description of traffic movement process as a Markov chain

4.5.1.2 Specification with proposed state definition

Justification for Reward Design

4.5.2.1 Stabilization on traffic movements with proposed reward.

4.5.2 Connection to throughput maximization and travel time minimization.

4.6 Experiment

Dataset Description

Experimental Settings

Environmental settings