deep_reinforcement_learning_fo-183ae653-e302-46bb-993b-53ac63dce2ed

The Pennsylvania State University
宾夕法尼亚州立大学

The Graduate School 研究生院

DEEP REINFORCEMENT LEARNING FOR TRAFFIC SIGNAL CONTROL
用于交通信号控制的深度强化学习

A Dissertation in 学位论文

Information Sciences and Technology
信息科学与技术

Hua Wei 华伟

(c 2020 Hua Wei
(c 2020 华伟

Submitted in Partial Fulfillment
部分履约

of the Requirements 的要求

for the Degree of
学位

Doctor of Philosophy 哲学博士

December 2020The dissertation of Hua Wei was reviewed and approved by the following:
2020 年 12 月华伟（Hua Wei）的学位论文经以下评审委员会评审通过：

Zhenhui (Jessie) Li 李振辉（杰西

Associate Professor of Information Sciences and Technology
信息科学与技术副教授

Dissertation Adviser 论文顾问

Chair of Committee 委员会主席

C. Lee Giles C.李-贾尔斯

Professor of Information Sciences and Technology
信息科学与技术教授

Xiang Zhang

Associate Professor of Information Sciences and Technology
信息科学与技术副教授

Vikash V. Gayah

Associate professor of Civil and Environmental Engineering
土木与环境工程学副教授

Mary Beth Rosson 玛丽-贝丝-罗森

Professor of Information Sciences and Technology
信息科学与技术教授

Graduate Program Chair 研究生项目主席

Abstract 摘要

Traffic congestion is a growing problem that continues to plague urban areas with negative out comes to both the traveling public and society as a whole. Signalized intersections are one of the most prevalent bottleneck types in urban environments and thus traffic signal control tends to play a large role in urban traffic management. Nowadays the widely-used traffic signal control systems (e.g., SCATS and SCOOT) are still based on manually designed traffic signal plans. Recently, there are emerging research studies using reinforcement learning (RL) to tackle traffic signal control problem. In this dissertation, we propose to consider reinforcement learning to intelligently optimize the signal timing plans in real-time to reduce traffic congestion.
交通拥堵是一个日益严重的问题，一直困扰着城市地区，给市民出行和整个社会带来负面影响。信号交叉口是城市环境中最常见的瓶颈类型之一，因此交通信号控制往往在城市交通管理中发挥着重要作用。目前广泛使用的交通信号控制系统（如 SCATS 和 SCOOT）仍以人工设计的交通信号计划为基础。最近，利用强化学习（RL）来解决交通信号控制问题的研究正在兴起。在本论文中，我们建议考虑使用强化学习来智能地实时优化信号配时计划，以减少交通拥堵。

Although some efforts using reinforcement learning (RL) techniques are proposed to adjust traffic signals dynamically, they only use ad-hoc designs when formulating the traffic signal control problem. They lack a principled approach to formulate the problem under the framework of RL. Secondly, since RL directly learns from the data via a trial-and-error search, it requires a decent number of interactions with the environment before the algorithms converge. In real-world problems, every interaction means real cost (e.g., traffic congestion, traffic accidents). Hence, a more data-efficient method is necessary. Thirdly, discrepancies between simulation and reality confine the application of RL in the real world, despite its massive success in domains like Games. Most RL methods mainly conduct experiments in the simulator since the simulator can generate data in a cheaper and faster way than real experimentation. Hence, how to address the performance gap of RL methods between simulation and the real-world is required for applying RL in the real world.
虽然有人提出利用强化学习（RL）技术来动态调整交通信号，但他们在提出交通信号控制问题时只是采用了临时设计。它们缺乏在 RL 框架下制定问题的原则性方法。其次，由于 RL 通过试错搜索直接从数据中学习，因此在算法收敛之前需要与环境进行相当数量的交互。在实际问题中，每次交互都意味着实际成本（如交通拥堵、交通事故）。因此，需要一种数据效率更高的方法。第三，尽管 RL 在游戏等领域取得了巨大成功，但模拟与现实之间的差异限制了 RL 在现实世界中的应用。大多数 RL 方法主要在模拟器中进行实验，因为模拟器生成数据的方式比真实实验更便宜、更快捷。因此，如何解决 RL 方法在模拟和真实世界之间的性能差距，是将 RL 应用于真实世界的必要条件。

This dissertation presents how to utilize mobility data and RL-based methods for traffic signal control. I have investigated the key challenges for RL-based traffic signal control methods, including how to formulate the objective function and improve learning efficiency for city-wide traffic signal control. Besides, I have also managed to mitigate the performance gap of RL methods between simulation and the real-world. We have achieved significant improvement over the state-of-the-art or currently employed methods, which will provide us with promising solutions to traffic signal control problems and implications for smart city applications.
本论文介绍了如何利用移动数据和基于 RL 的方法进行交通信号控制。我研究了基于 RL 的交通信号控制方法所面临的主要挑战，包括如何制定目标函数和提高城市范围内交通信号控制的学习效率。此外，我还设法缩小了 RL 方法在仿真与现实世界之间的性能差距。与最先进的方法或目前采用的方法相比，我们已经取得了重大改进，这将为我们提供有前途的交通信号控制问题解决方案，并对智慧城市应用产生影响。

Table of Contents 目录

List of FiguresList of Tables
图表一览表

Acknowledgements 致谢

Chapter 1 Introduction 第 1 章导言

1.1 Why do we need a more intelligent traffic signal control 1.2 Why do we use reinforcement learning for traffic signal control 1.3 Why RL for traffic signal control is challenging? 1.3.1 Formulation of RL agent 1.3.2 Learning cost 1.3.3 Simulation 1.4 Previous Studies 1.5 Proposed Tasks 1.6 Overview of this Disseration
1.1 为什么我们需要更智能的交通信号控制 1.2 为什么我们在交通信号控制中使用强化学习 1.3 为什么交通信号控制的强化学习具有挑战性？1.3.1 RL 代理的制定 1.3.2 学习成本 1.3.3 仿真 1.4 以往研究 1.5 拟议任务 1.6 本论文概述

Chapter 2 Notation, Background and Literature
第 2 章符号、背景和文献

1 Preliminaries of Traffic Signal Control Problem 2.1.1 Term Definition 2.1.2 Objective 2.1.3 Special Considerations 2.2 Background of Reinforcement Learning 2.2.1 Single Agent Reinforcement learning 2.2.2 Problem setting 2.3 Literature 2.3.1 Agent formulation 2.3.1.1 Reward 2.3.1.2 State 2.3.1.3 Action scheme 2.3.2 Policy Learning 2.3.2.1 Value-based methods2.3.2.2 Policy-based methods
1 交通信号控制问题前言 2.1.1 术语定义 2.1.2 目标 2.1.3 特殊考虑 2.2 强化学习背景 2.2.1 单代理强化学习 2.2.2 问题设置 2.3 文献 2.3.1 代理制定 2.3.1.1 奖励 2.3.1.2 状态 2.3.1.3 行动方案 2.3.2 策略学习 2.3.2.1 基于价值的方法 2.3.2.2 基于策略的方法
* 2.3.3 Coordination * 2.3.3 协调
* 2.3.3.1 Joint action learners
* 2.3.3.1 联合行动学习者
* 2.3.3.2 Independent learners
* 2.3.3.2 独立学习者
* 2.3.3.3 Sizes of Road network
* 2.3.3.3 公路网的规模

2.4 Conclusion 2.4 结论

Chapter 4 Formulating the Learning Objevtive
第 4 章制定学习目标

* 4.1 Introduction
* 4.2 Related Work
* 4.3 Preliminaries and Notations
* 4.4 Method
	* 4.4.1 Agent Design
	* 4.4.2 Learning Process
* 4.5 Justification of RL agent
	* 4.5.1 Justification for State Design
		* 4.5.1.1 General description of traffic movement process as a Markov chain
		* 4.5.1.2 Specification with proposed state definition
	* 4.5.2 Justification for Reward Design
		* 4.5.2.1 Stabilization on traffic movements with proposed reward.
		* 4.5.2.2 Connection to throughput maximization and travel time minimization.
* 4.6 Experiment
	* 4.6.1 Dataset Description
	* 4.6.2 Experimental Settings
		* 4.6.2.1 Environmental settings
		* 4.6.2.2 Evaluation metric
		* 4.6.2.3 Compared methods
	* 4.6.3 Performance Comparison
	* 4.6.4 Study of PressLight
		* 4.6.4.1 Effects of variants of our proposed method
		* 4.6.4.2 Average travel time related to pressure.
	* 4.6.5 Performance on Mixed Scenarios
		* 4.6.5.1 Heterogeneous intersections
		* 4.6.5.2 Arterials with a different number of intersections and network
	* 4.6.6 Case Study
		* 4.6.6.1 Synthetic traffic on the uniform, uni-directional flow
			* 4.6.6.1.1 Performance comparison
			* 4.6.6.1.2 Policy learned by RL agents
		* 4.6.6.2 Real-world traffic in Jinan
* 4.7 Conclusion

Chapter 5 第五章

Improving Learning Efficiency
提高学习效率

* 5.1 Introduction
* 5.2 Related Work
* 5.3 Problem Definition
* 5.4 Method
	* 5.4.1 Observation Embedding
	* 5.4.2 Graph Attention Networks for Cooperation
		* 5.4.2.1 Observation Interaction
		* 5.4.2.2 Attention Distribution within Neighborhood Scope
		* 5.4.2.3 Index-free Neighborhood Cooperation
		* 5.4.2.4 Multi-head Attention
	* 5.4.3 Q-value Prediction
	* 5.4.4 Complexity Analysis

5.4.4.1 Space Complexity * 5.4.4.2 Time Complexity * 5.5 Experiments * 5.5.1 Settings * 5.5.2 Datasets * 5.5.2.1 Synthetic Data * 5.5.2.2 Real-world Data * 5.5.3 Compared Methods * 5.5.4 Evaluation Metric * 5.5.5 Performance Comparison * 5.5.5.1 Overall Analysis * 5.5.5.2 Convergence Comparison * 5.5.6 Scalability Comparison * 5.5.6.1 Effectiveness. * 5.5.6.2 Training time. * 5.5.7 Study of CoLight * 5.5.7.1 Impact of Neighborhood Definition. * 5.5.7.2 Impact of Neighbor Number. * 5.5.7.3 Impact of Attention Head Number. * 5.6 Conclusion
5.4.4.1 空间复杂性 * 5.4.4.2 时间复杂性 * 5.5 实验 * 5.5.1 设置 * 5.5.2 数据集 * 5.5.2.1 合成数据 * 5.5.2.2 真实世界数据 * 5.5.3 比较方法 * 5.5.4 评估指标 * 5.5.5 性能比较 * 5.5.5.1 总体分析 * 5.5.5.2 收敛性比较 * 5.5.6 扩展性比较 * 5.5.6.1 总体分析5.5.4 评估指标 * 5.5.5 性能比较 * 5.5.5.1 总体分析 * 5.5.5.2 收敛性比较 * 5.5.6 可扩展性比较 * 5.5.6.1 有效性。* 5.5.6.2 训练时间。* 5.5.7 CoLight 研究 * 5.5.7.1 邻域定义的影响。* 5.5.7.2 邻居数量的影响。* 5.5.7.3 注意头数量的影响。* 5.6 结论

Chapter 6 Learning to Simulate
第 6 章学习模拟

6.1 Introduction 6.1 导言
6.2 Related Work 6.2 相关工作
6.3 Preliminaries 6.3 前言
- 6.4 Method 6.4 方法
  - 6.4.1 Basic GAIL Framework
    6.4.1 GAIL 基本框架
  - 6.4.2 Imitation with Interpolation
    6.4.2 利用插值法进行模仿
    - 6.4.2.1 Generator in the simulator
      6.4.2.1 模拟器中的发电机
    - 6.4.2.2 Downsampling of generated trajectories
      6.4.2.2 对生成的轨迹进行下采样
    - 6.4.2.3 Interpolation-Discriminator
      6.4.2.3 内插法-鉴别器
      - 6.4.2.3.1 Interpolator module
        6.4.2.3.1 内插模块
      - 6.4.2.3.2 Discriminator module
        6.4.2.3.2 鉴别器模块
      - 6.4.2.3.3 Loss function of Interpolation-Discriminator
        6.4.2.3.3 内插法-判别器的损失函数
  - 6.4.3 Training and Implementation
    6.4.3 培训和实施
- 6.5 Experiment 6.5 实验
  - 6.5.1 Experimental Settings
    6.5.1 实验设置
    - 6.5.1.1 Dataset 6.5.1.1 数据集
    - 6.5.1.1 Synthetic Data 6.5.1.1 合成数据
      - 6.5.1.1.2 Real-world Data
        6.5.1.1.2 真实世界数据
    - 6.5.1.2 Data Preprocessing
      6.5.1.2 数据预处理
  - 6.5.2 Compared Methods 6.5.2 比较方法
    - 6.5.2.1 Calibration-based methods
      6.5.2.1 基于校准的方法
      6.5.2.2 Imitation learning-based methods
      6.5.2.2 基于模仿学习的方法

6.5.3 Evaluation Metrics
6.5.3 评估指标

6.5.4 Performance Comparison
6.5.4 性能比较

6.5.5 Study of ImIn-GAIL
6.5.5 ImIn-GAIL 研究

6.5.5.1 Interpolation Study
6.5.5.1 内插法研究

6.5.5.2 Sparsity Study 6.5.5.2 稀疏性研究

6.5.6 Case Study 6.5.6 案例研究

6.6 Conclusion 6.6 结论

Chapter 7: 第 7 章.....：

Conclusion and Future Directions
结论和未来方向

7.1 Evolving behavior with traffic signals
7.1 不断变化的交通信号行为

7.2 Benchmarking datasets and baselines
7.2 基准数据集和基线

7.3 Learning efficiency 7.3 学习效率

7.4 Safety issue 7.4 安全问题

7.5 Transferring from simulation to reality
7.5 从模拟走向现实

BibliographyList of Figures
参考文献图表目录

2.1 Definitions of traffic movement and traffic signal phases.
2.1 交通流和交通信号阶段的定义。
2.2 RL framework for traffic signal control.
2.2 交通信号控制的 RL 框架。
3.1 Reward is not a comprehensive measure to evaluate traffic light control performance. Both policies will lead to the same rewards. But policy #1 is more suitable than policy #2 in the real world.
3.1 奖励不是评价交通灯控制性能的全面措施。两种政策会带来相同的奖励。但在现实世界中，1 号政策比 2 号政策更合适。
3.2 Case A and case B have the same environment except the traffic light phase.
3.2 案例 A 和案例 B 的环境相同，但红绿灯阶段不同。
3.3 Model framework 3.3 模型框架
3.4 Q-network 3.4 Q 网络
3.5 Memory palace structure
3.5 记忆宫殿结构
3.6 Traffic surveillance cameras in Jinan, China
3.6 中国济南的交通监控摄像头
3.7 Percentage of the time duration of learned policy for phase Green-WE (green light on W-E and E-W direction, while red light on N-S and S-N direction) in every 2000 seconds for different methods under configuration 4.
3.7 在配置 4 下，不同方法在绿-绿（W-E 和 E-W 方向亮绿灯，N-S 和 S-N 方向亮红灯）阶段每 2000 秒所学策略持续时间的百分比。
3.8 Average arrival rate on two directions (WE and SN) and time duration ratio of two phases (Green-WE and Red-WE) from learned policy for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st and August 7th, 2016.
3.8 2016 年 8 月 1 日和 8 月 7 日济南市经六路（WE）和二环西辅路（SN）的两个方向（WE 和 SN）的平均到达率和两个阶段（绿-WE 和红-WE）的时间长度比。

Detailed average arrival rate on two directions (dotted lines) and changes of two phases (dashed areas) in three periods of time for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st, 2016. X-axis of each figure indicates the time of a day; left Y-axis of each figure indicates the number of cars approaching the intersection every second; right Y-axis for each figure indicates the phase over time.
2016 年 8 月 1 日济南市经六路（WE）和二环西辅路（SN）三个时段内两个方向（虚线）的平均到达率和两个相位（虚线区域）的变化详情。各图的 X 轴表示一天中的时间；各图的左 Y 轴表示每秒驶入路口的车辆数；各图的右 Y 轴表示随时间变化的相位。

4.1 Performance of RL approaches is sensitive to reward and state. (a) A heuristic parameter tuning of reward function could result in different performances. (b) The method with a more complicated state (LIT[1] w/ neighbor) has a longer learning time but does not necessarily converge to a better result.
4.1 RL 方法的性能对奖励和状态很敏感。(a) 奖赏函数的启发式参数调整可能导致不同的性能。(b) 状态更复杂的方法（LIT[1] w/ neighbor）学习时间更长，但不一定收敛到更好的结果。
4.2 Illustration of max pressure control in two cases. In Case A, green signal is set in the North $\to$ South direction; in Case B, green signal is set in the East $\to$ West direction.
4.2 两种情况下的最大压力控制示意图。在情况 A 中，绿色信号设置在北 $\to$ 向；在情况 B 中，绿色信号设置在东 $\to$ 向。南方向；在情况 B 中，绿色信号设置在东 $\to$ 西方向。西方向。
4.3 The transition of traffic movements.
4.3 交通流的过渡。
4.4 Real-world arterial network for the experiment.
4.4 用于实验的真实世界动脉网络。
4.5 Convergence curve of average duration and our reward design (pressure). Pressure shows the same convergence trend with travel time.
4.5 平均持续时间与我们的奖励设计（压力）的收敛曲线。压力与旅行时间的收敛趋势相同。
4.6 Average travel time of our method on heterogeneous intersections. (a) Different number of legs. (b) Different length of lanes. (c) Experiment results.
4.6 我们的方法在异构交叉口的平均行驶时间。(a) 不同的支腿数。 (b) 不同的车道长度。(c) 实验结果。
4.7 Performance comparison under uniform unidirectional traffic, where the optimal solution is known (GreenWave). Only PressLight can achieve the optimal.
4.7 已知最优解（GreenWave）的均匀单向流量下的性能比较。只有 PressLight 可以达到最优。
4.8 Offsets between intersections learnt by RL agents under uni-directional uniform traffic (700 vehicles/hour/lane on arterial)
4.8 在单向均匀交通（主干道上每小时每车道 700 辆车）条件下，RL 代理了解到的交叉口之间的偏移量
4.9 Space-time diagram with signal timing plan to illustrate the learned coordination strategy from real-world data on the arterial of Qingdao Road in the morning (around 8:30 a.m.) on August 6th.
4.9 根据 8 月 6 日上午（8:30 左右）青岛路干道的实际数据绘制的时空图和信号配时方案，以说明所学的协调策略。

Illustration of index-based concatenation. Thick yellow lines are the arterials and grey thin lines are the side streets. With index-based concatenation,

A

and B's observation will be aligned as model inputs with an fixed order. These two inputs will confuse the model shared by

A

and

B

.
基于索引的连接图示。黄色粗线为主干道，灰色细线为支路。通过基于索引的连接，

A

和 B 的观测值将以固定顺序作为模型输入。这两个输入将混淆

A

和

B

共享的模型。.

5.2 Left: Framework of the proposed CoLight model. Right: variation of cooperation scope (light blue shadow, from one-hop to two-hop) and attention distribution (colored points, the redder, the more important) of the target intersection.
5.2 左图：建议的 CoLight 模型框架。右图：目标交叉点的合作范围（浅蓝色阴影，从一跳到两跳）和注意力分布（彩色点，越红越重要）的变化。
5.3 Road networks for real-world datasets. Red polygons are the areas we select to model, blue dots are the traffic signals we control. Left: 196 intersections with uni-directional traffic, middle: 16 intersections with uni-& bi-directional traffic, right: 12 intersections with bi-directional traffic..
5.3 真实世界数据集的道路网络。红色多边形为我们选择的建模区域，蓝色圆点为我们控制的交通信号灯。左图：196 个单向交通路口，中图：16 个单向和双向交通路口，右图：12 个双向交通路口。
5.4 Convergence speed of CoLight (red continuous curves) and other 5 RL baselines (dashed curves) during training. CoLight starts with the best performance (Jumpstart), reaches to the pre-defined performance the fastest (Time to Threshold), and ends with the optimal policy (Aysmptotic). Curves are smoothed with a moving average of 5 points.
5.4 在训练过程中，CoLight（红色连续曲线）和其他 5 条 RL 基线（虚线）的收敛速度。CoLight 以最佳性能（Jumpstart）开始，以最快速度达到预定性能（Time to Threshold），以最优策略（Aysmptotic）结束。曲线用 5 个点的移动平均值进行平滑处理。
5.5 The training time of different models for 100 episodes. CoLight is efficient across all the datasets. The bar for Individual RL on $D_{N e w Y o r k}$ is shadowed as its running time is far beyond the acceptable time.
5.5 不同模型对 100 个事件的训练时间。CoLight 在所有数据集上都很高效。由于 $D_{N e w Y o r k}$ 上的 Individual RL 的运行时间远远超出了可接受的时间，因此其条形图上有阴影。
5.6 Performance of CoLight with respect to different numbers of neighbors ( $| N_{i} |$ ) on dataset $D_{H a n g z h o u}$ (left) and $D_{J i n a n}$ (right). More neighbors ( $| N_{i} | \leq 5$ ) for cooperation brings better performance, but too many neighbors ( $| N_{i} | > 5$ ) requires more time (200 episodes or more) to learn..
5.6 在数据集 $D_{H a n g z h o u}$ 和 $D_{J i n a n}$ 中，CoLight 在不同邻接数（ $| N_{i} |$ ）下的性能表现(左）和 $D_{J i n a n}$ （右(右图）。更多的邻居（ $| N_{i} | \leq 5$ ）会带来更好的合作效果，但过多的邻居（ $| N_{i} | > 5$ ）则需要更多的时间（200 集或更多）来学习。
6.1 Illustration of a driving trajectory. In the real-world scenario, only part of the driving points can be observed and form a sparse driving trajectory (in red dots). Each driving point includes a driving state and an action of the vehicle at the observed time step. Best viewed in color.
6.1 行驶轨迹示意图。在实际场景中，只有部分行驶点可以被观测到，并形成稀疏的行驶轨迹（红点）。每个行车点都包括观察到的时间步长内的行车状态和车辆动作。以彩色显示效果最佳。
6.2 Proposed ImIn-GAIL Approach. The overall framework of ImIn-GAIL includes three components: generator, downsampler, and interpolation-discriminator. Best viewed in color.
6.2 拟议的 ImIn-GAIL 方法。ImIn-GAIL 的整体框架包括三个部分：生成器、下采样器和插值-鉴别器。彩色效果最佳。
6.3 Proposed interpolation-discriminator network.
6.3 拟议的插值-鉴别器网络。
6.4 Illustration of road networks. (a) and (b) are synthetic road networks, while (c) and (d) are real-world road networks.
6.4 道路网络示意图。(a) 和 (b) 是合成的道路网络，而 (c) 和 (d) 是真实世界的道路网络。
6.5RMSE on time and position of our proposed method ImIn-GAIL under different level of sparsity. As the expert trajectory become denser, a more similar policy to the expert policy is learned.
6.5 我们提出的 ImIn-GAIL 方法在不同稀疏程度下的时间和位置均方根误差。随着专家轨迹越来越密集，学习到的策略与专家策略也越来越相似。
6.6 The generated trajectory of a vehicle in the $R i n g$ scenario. Left: the initial position of the vehicles. Vehicles can only be observed when they pass four locations $A$ , $B$ , $C$ and $D$ where cameras are installed. Right: the visualization for the trajectory of $V e h i c l e$ 0. The x-axis is the timesteps in seconds. The y-axis is the relative road distance in meters. Although vehicle 0 is only observed three times (red triangles), ImIn-GAIL (blue points) can imitate the position of the expert trajectory (grey points) more accurately than all other baselines. Better viewed in color.
6.6 在 $R i n g$ 场景中生成的车辆轨迹。左图：车辆的初始位置。车辆只有在经过 $A$ 、 $B$ 和 $C$ 四个位置时才能被观测到。, $B$ 、 $C$ 和 $D$ 安装了摄像头。右图： $V e h i c l e$ 0 的可视化轨迹。x 轴为时间单位，以秒为单位。y 轴为相对道路距离，单位为米。虽然只观察到车辆 0（红色三角形）三次，但 ImIn-GAIL（蓝色点）比其他所有基线都能更准确地模仿专家轨迹（灰色点）的位置。彩色效果更佳。

List of Tables 表格清单

2.1 Representative deep RL-based traffic signal control methods. Due to page limits, we only put part of the investigated papers here.
2.1 基于深度 RL 的代表性交通信号控制方法。由于篇幅限制，我们在此仅介绍部分研究论文。
3.1 Notations 3.1 符号
3.2 Settings for our method
3.2 我们方法的设置
3.3 Reward coefficients 3.3 奖励系数
3.4 Configurations for synthetic traffic data
3.4 合成流量数据的配置
3.5 Details of real-world traffic dataset
3.5 真实世界交通数据集的细节
3.6 Performance on configuration 1. Reward: the higher the better. Other measures: the lower the better. Same with the following tables.
3.6 配置 1 的性能。奖励：越高越好。其他措施：越低越好。与下表相同。
3.7 Performance on configuration 2
3.7 配置 2 的性能
3.8 Performance on configuration 3
3.8 配置 3 的性能
3.9 Performance on configuration 4
3.9 配置 4 的性能
3.10 Performances of different methods on real-world data. The number after $\pm$ means standard deviation. Reward: the higher the better. Other measures: the lower the better.
3.10 不同方法在实际数据中的表现。 $\pm$ 后面的数字表示标准偏差。奖励：越高越好。其他指标：越低越好。
4.1 Summary of notation.
4.1 符号摘要
4.2 Configurations for synthetic traffic data
4.2 合成流量数据的配置
4.3 Data statistics of real-world traffic dataset
4.3 真实世界交通数据集的数据统计

Performance comparison between all the methods in the arterial with 6 intersections w.r.t. average travel time (the lower the better). Top-down: conventional transportation methods, learning methods, and our proposed method.
在有 6 个交叉路口的干道上，所有方法在平均旅行时间方面的性能比较（越短越好）。自上而下：传统交通方法、学习方法和我们提出的方法。

4.5 Detailed comparison of our proposed state and reward design and their effects w.r.t. average travel time (lower the better) under synthetic traffic data.
4.5 在合成交通数据下，详细比较我们建议的状态和奖励设计及其对平均旅行时间（越短越好）的影响。
4.6 Average travel time of different methods under arterials with a different number of intersections and network.
4.6 在交叉口数量和网络不同的干道上，不同方法的平均旅行时间。
5.1 Data statistics of real-world traffic dataset
5.1 真实世界交通数据集的数据统计
5.2 Performance on synthetic data and real-world data w.r.t average travel time. CoLight is the best.
5.2 平均旅行时间在合成数据和真实世界数据上的表现。CoLight 是最好的。
5.3 Performance of CoLight with respect to different numbers of attention heads ( $H$ ) on dataset $G r i d_{6 \times 6}$ . More types of attention ( $H \leq 5$ ) enhance model efficiency, while too many ( $H > 5$ ) could distract the learning and deteriorate the overall performance.
5.3 在数据集 $G r i d_{6 \times 6}$ 中，CoLight 在不同注意力头数（ $H$ ）下的性能表现.更多的注意力类型（ $H \leq 5$ ）会提高模型效率，而过多的注意力类型（ $H > 5$ ）则会分散学习注意力，降低整体性能。
6.1 Features for a driving state
6.1 驾驶状态的特征
6.2 Hyper-parameter settings for ImIn-GAIL
6.2 ImIn-GAIL 的超参数设置
6.3 Statistics of dense and sparse expert trajectory in different datasets
6.3 不同数据集中密集和稀疏专家轨迹的统计数据
6.4 Performance w.r.t Relative Mean Squared Error (RMSE) of time (in seconds) and position (in kilometers). All the measurements are conducted on dense trajectories. Lower the better. Our proposed method ImIn-GAIL achieves the best performance.
6.4 时间（秒）和位置（千米）的相对均方差（RMSE）性能。所有测量均在密集轨迹上进行。越低越好。我们提出的 ImIn-GAIL 方法性能最佳。
6.5 RMSE on time and position of our proposed method ImIn-GAIL against baseline methods and their corresponding two-step variants. Baseline methods and ImIn-GAIL learn from sparse trajectories, while the two-step variants interpolate sparse trajectories first and trained on the interpolated data. ImIn-GAIL achieves the best performance in most cases.
6.5 与基准方法及其相应的两步变体相比，我们提出的 ImIn-GAIL 方法在时间和位置上的均方根误差。基准方法和 ImIn-GAIL 从稀疏轨迹中学习，而两步变体首先对稀疏轨迹进行插值，然后在插值数据上进行训练。在大多数情况下，ImIn-GAIL 的性能最佳。

Acknowledgements 致谢

I would like to express my sincere appreciation to my doctoral committee Dr. Zhenhui (Jessie) Li, Dr. C. Lee Giles, Dr. Xiang Zhang, Dr. Vikash V. Gayah, and my Ph. D. program chair Dr. Mary Beth Rosson.
在此，我要向我的博士生导师李振辉（Jessie）博士、C. Lee Giles 博士、张翔博士、Vikash V. Gayah 博士以及我的博士生导师 Mary Beth Rosson 博士表示衷心的感谢。

I would like to show my special appreciation to my PhD adviser Dr. Zhenhui (Jessie) Li. Jessie patiently guided me through the research process and career development, including several open-armed enlightening discussions on how to cope with work-life balance, pressure and priority. Her supportive and critical attitude has made our fruitful collaboration possible. Her patience and kindness has help me manage a smooth career development. Without her guidance, my PhD would surely not have been as successful.
我要特别感谢我的博士生导师李振辉（Jessie）博士。Jessie 耐心指导了我的研究过程和职业发展，包括就如何应对工作与生活的平衡、压力和优先权等问题进行了几次开诚布公的启发式讨论。她的支持和批评态度使我们之间富有成效的合作成为可能。她的耐心和仁慈帮助我顺利完成了职业发展。没有她的指导，我的博士论文肯定不会如此成功。

I would also extend my thanks to Dr. Vikash Gayah from the Department of Civil and Environmental Engineering, whose insightful advice forms some of the foundations in our interdisciplinary research and who have been more than happy to help me become familiar with the background material.
我还要向土木与环境工程系的 Vikash Gayah 博士表示感谢，他的真知灼见为我们的跨学科研究奠定了一些基础，他非常乐意帮助我熟悉背景材料。

I am also fortunate to have collaborated with several others throughout my PhD: Guanjie Zheng, Chacha Chen, Porter Jenkins, Zhengyao Yu, Kan Wu, Huaxiu Yao, Nan Xu, Chang Liu, Yuandong Wang. Without them, this research would not be possible.
在整个博士期间，我还有幸与其他几位同事进行了合作：Guanjie Zheng、Chacha Chen、Porter Jenkins、Zhengyao Yu、Kan Wu、Huaxiu Yao、Nan Xu、Chang Liu、Yuandong Wang。没有他们，就不可能有这项研究。

The studies in this dissertation have been supported by NSF awards #1544455, #1652525, #1618448, and #1639150. The views and conclusions contained in the studies are those of the authors and should not be interpreted as representing any funding agencies. Thanks for their generous support to make these research happen.
本论文中的研究得到了国家自然科学基金 #1544455、#1652525、#1618448 和 #1639150 号的资助。研究报告中的观点和结论仅代表作者本人，不应被解释为代表任何资助机构。感谢他们的慷慨支持，使这些研究得以实现。

During my several years of PhD program, I was lucky to have been surrounded by wonderful colleagues and friends. Thank you all for sharing Friday game nights with me, for the road trips and getaways on weekends, and for the summer twilight barbecues. All of you made my time at Penn State a wonderful experience.
在攻读博士学位的几年里，我很幸运身边有许多出色的同事和朋友。感谢大家与我分享周五的游戏之夜，感谢大家周末的公路旅行和郊游，感谢大家夏日黄昏的烧烤。你们让我在宾夕法尼亚州立大学度过了一段美好的时光。

At last, a very special thank you to my parents Shaozhu and Yuling, to my whole family, for always supporting me and for always encouraging me to pursue my passion.
最后，我要特别感谢我的父母绍柱和玉玲，感谢我的全家人，感谢他们一直支持我，鼓励我去追求自己的激情。

Chapter 1 Introduction 第 1 章导言

Traffic congestion is a growing problem that continues to plague urban areas with negative out comes to both the traveling public and society as a whole. And, these negative outcomes will only grow over time as more people flock to urban areas. In 2014, traffic congestion cost Americans over $160 billion in lost productivity and wasted over 3.1 billion gallons of fuel [2]. Traffic congestion was also attributed to over 56 billion pounds of harmful CO2 emissions in 2011 [3]. In the European Union, the cost of traffic congestion was equivalent to 1% of the entire GDP [4]. Mitigating congestion would have significant economic, environmental and societal benefits. Signalized intersections are one of the most prevalent bottleneck types in urban environments and thus traffic signal control tends to play a large role in urban traffic management.
交通拥堵是一个日益严重的问题，一直困扰着城市地区，给市民出行和整个社会带来负面影响。而且，随着越来越多的人涌入城市地区，这些负面影响只会与日俱增。2014 年，交通拥堵使美国人损失了超过 1600 亿美元的生产力，浪费了超过 31 亿加仑的燃料[2]。2011 年，交通拥堵还造成了超过 560 亿磅的有害二氧化碳排放[3]。在欧盟，交通拥堵造成的损失相当于整个 GDP 的 1%[4]。缓解交通拥堵将带来巨大的经济、环境和社会效益。信号交叉口是城市环境中最常见的瓶颈类型之一，因此交通信号控制往往在城市交通管理中发挥着重要作用。

1.1 Why do we need a more intelligent traffic signal control
1.1 为什么我们需要更智能的交通信号控制？

The majority of small and big cities even in industrialized countries are still operating old-fashioned fixed-time signal control strategies, often even poorly optimized or maintained. Even when modern traffic-responsive control systems are installed (e.g., SCATS [5] and SCOOT [6, 7]), the employed control strategies are sometimes naive, mainly based on manually designed traffic signal plans.
即使在工业化国家，大多数大中小城市仍在使用老式的固定时间信号控制策略，而且往往优化或维护不善。即使安装了现代交通响应控制系统（如 SCATS [5] 和 SCOOT [6，7]），所采用的控制策略有时也很幼稚，主要是基于人工设计的交通信号计划。

On the other hand, nowadays various kinds of traffic data can be collected to enrich the information about traffic condition. Systems like SCATS or SCOOTS mainly rely on the loop sensor data to choose the signal plans. However, loop sensor data only count the vehicle when it passes the sensor, while nowadays increasing amount of traffic data have been collected from various sources such as GPS-equipped vehicles, navigationalsystems, and traffic surveillance cameras. How to use rich traffic data to better optimize our traffic signal control system has attracted more and more attention from academia, government and industry.
另一方面，如今可以通过收集各种交通数据来丰富交通状况信息。SCATS 或 SCOOTS 等系统主要依靠环形传感器数据来选择信号计划。然而，环路传感器数据只计算车辆通过传感器时的数据，而现在越来越多的交通数据来自各种来源，如配备 GPS 的车辆、导航系统和交通监控摄像头。如何利用丰富的交通数据来更好地优化我们的交通信号控制系统，引起了学术界、政府和工业界越来越多的关注。

1.2 Why do we use reinforcement learning for traffic signal control
1.2 为什么要将强化学习用于交通信号控制

In transportation field, traffic signal control is one of the most fundamental research questions [8]. The typical approach that transportation researchers take is to seek an optimization solution under certain assumptions about traffic models [8]. However, most of the works focus only on automobiles and the assumptions they make are simplified and do not necessarily hold true in the field. The real traffic behave in a complicated way, affected by many factors such as driver's preference, interactions with vulnerable road users (e.g. pedestrians, cyclists, etc.), weather and road conditions, etc. These features can hardly be modelled accurately for optimizing traffic signal control.
在交通领域，交通信号控制是最基本的研究问题之一 [8]。交通研究人员通常采用的方法是在交通模型的某些假设条件下寻求优化解决方案[8]。然而，大多数研究都只关注汽车，他们所做的假设是简化的，在实际中并不一定成立。真实交通的行为方式非常复杂，受到很多因素的影响，如驾驶员的偏好、与易受伤害的道路使用者（如行人、骑自行车者等）的互动、天气和路况等。要优化交通信号控制，很难准确模拟这些特征。

On the other hand, machine learning techniques learn directly from the observed data without making assumptions about the data model. However, the traffic signal control problem is not a typical machine learning problem with fixed data sets for training and testing. The real-world traffic is constantly changing, and the execution of traffic lights is also changing the traffic. Therefore, in the case of ever-changing data samples, we need to learn from the feedback from the environment. This idea of trial-and-error is an essential RL idea. RL attempts different traffic signal control strategies based on the current traffic environment. The model will learn and adjust strategies based on environmental feedback.
另一方面，机器学习技术直接从观测数据中学习，而不对数据模型做出假设。然而，交通信号控制问题并不是一个典型的机器学习问题，它没有固定的数据集进行训练和测试。现实世界中的交通流量在不断变化，交通信号灯的执行也在改变着交通流量。因此，在数据样本不断变化的情况下，我们需要从环境的反馈中学习。这种试错思想是 RL 的基本思想。RL 会根据当前的交通环境尝试不同的交通信号控制策略。模型将根据环境反馈来学习和调整策略。

1.3 Why RL for traffic signal control is challenging?
1.3 交通信号控制 RL 为何具有挑战性？

Formulation of RL agent
RL 剂的配方

A key question for RL is how to define reward and state. In existing studies [9, 10, 11], a typical reward definition for traffic signal control is a weighted linear combination of several components such as queue length, waiting time, number of switches in traffic signal, and sum of delay. The state include components such as queue length, number of cars, waiting time, and current traffic signal. In recent work [10, 11], images of vehicles' positions on the roads are also considered in the state.
RL 的一个关键问题是如何定义奖励和状态。在现有的研究中[9, 10, 11]，交通信号控制的典型奖励定义是队列长度、等待时间、交通信号开关数和延迟总和等几个部分的加权线性组合。状态包括队列长度、汽车数量、等待时间和当前交通信号等要素。最近的研究 [10, 11]还在状态中考虑了车辆在道路上的位置图像。

However, all of the existing work take an ad-hoc approach to define reward and state. Such an ad-hoc approach will cause several problems that hinder the application of RL in the real world. First, the engineering details in formulating the reward and state could significantly affect the results. For example, if the reward is defined as a weighted linear combination of several terms, the weights on each terms are tricky to set and the minor difference in weight setting could lead to dramatically different results. Second, the state representation could be in a high-dimensional space, especially when using traffic images as part of the state representation [10, 11]. Such a high-dimensional state representation will need much more training data samples to learn and may not even converge. Third, there is no connection between existing RL approaches and transportation methods. Without the support of transportation theory, it is highly risky to apply these purely data-driven RL-based approaches in the real physical world.
然而，所有现有工作都采用临时方法来定义奖励和状态。这种临时方法会带来一些问题，阻碍 RL 在现实世界中的应用。首先，制定奖励和状态的工程细节会严重影响结果。例如，如果奖励被定义为多个项的加权线性组合，那么每个项的权重设置都很棘手，权重设置的细微差别可能会导致截然不同的结果。其次，状态表示可能是在一个高维空间中，尤其是在使用交通图像作为状态表示的一部分时[10, 11]。这种高维状态表示需要更多的训练数据样本来学习，甚至可能无法收敛。第三，现有的 RL 方法与交通方法之间没有联系。没有运输理论的支持，在真实物理世界中应用这些纯粹基于数据驱动的 RL 方法风险很大。

Learning cost 学习成本

While learning from trial-and-error is the key idea in RL, the learning cost is RL is fatal for real-world applications. Although RL algorithms are very useful to learning good solutions when the model of environment is unknown in advance [12, 13], the solutions may only be achieved after an extensive numbers of trials and errors, which is usually very time consuming. Existing RL methods for games (e.g, Go or Atari games) yield impressive results in simlulated environments, the cost of error in traffic signal control is critical, even fatal in the real world. Therefore, how to learn efficiently (e.g., learning from limited data samples, sample training data in an adaptive way, transfer learned knowledge) is a critial question for the application of RL in traffic signal control.
虽然从试错中学习是 RL 的关键思想，但 RL 的学习成本对于实际应用来说是致命的。虽然在环境模型预先未知的情况下，RL 算法对学习好的解决方案非常有用[12, 13]，但解决方案可能要经过大量的试验和错误才能得到，这通常非常耗时。现有的游戏 RL 方法（如围棋或雅达利游戏）在模拟环境中取得了令人印象深刻的结果，而在交通信号控制中，错误的代价是至关重要的，在现实世界中甚至是致命的。因此，如何高效地学习（例如，从有限的数据样本中学习、以自适应的方式抽取训练数据、迁移所学知识）是将 RL 应用于交通信号控制的关键问题。

Simulation 模拟

Reinforcement learning (RL) has shown great success in a series of artificial intelligence (AI) domains such as Go games [14]. Despite its huge success in AI domains, RL has not yet shown the same degree of success for real-world applications [15]. These applications could involve control such as drones [16], systems that interact with people such as traffic signal control [11]. As people analyze the challenges in all these scenarios, a recurring theme emerges: there is rarely a good simulator[15].
强化学习（RL）在围棋等一系列人工智能（AI）领域取得了巨大成功[14]。尽管强化学习在人工智能领域取得了巨大成功，但在现实世界的应用中，强化学习尚未取得同样程度的成功[15]。这些应用可能涉及无人机等控制[16]、交通信号控制等与人交互的系统[11]。人们在分析所有这些应用场景所面临的挑战时，发现了一个反复出现的主题：很少有好的模拟器[15]。

Previous Studies 以往的研究

There are already some work done in investigating reinforcement learning methods in traffic signal control. They are focusing on different scale of traffic signal control, including isolated intersection control[4-6], arterial control [7], and region control [8-10]. Most of the previous studies use different kinds of features in state and reward design, while we propose the formulation of reinforcement learning in connection with transportation theory. Furthermore, we propose to investigate on how to learn efficiently to reduce the learning cost in the real-world setting. We will discuss more about the literature in Chapter 2, Chapter 3 and Chapter 4.
目前已经有一些研究强化学习方法在交通信号控制中的应用。它们主要针对不同规模的交通信号控制，包括孤立交叉口控制[4-6]、干道控制[7]和区域控制[8-10]。以往的研究大多在状态和奖励设计中使用不同的特征，而我们则结合交通理论提出了强化学习的方法。此外，我们还建议研究如何在实际环境中高效学习以降低学习成本。我们将在第 2 章、第 3 章和第 4 章讨论更多相关文献。

1.5 Proposed Tasks 1.5 拟议任务

In this thesis, we propose to use RL for traffic signal control problems that can combine the transportation guided RL with efficient learning. The first part will help the formulation of RL towards a reasonable state and reward design. The connection between between RL approaches and transportation theory can help RL optimize towards a correct objective and condense the state space for convergence. The second part will help reduce the learning cost of RL and enables the application of RL in the real world. The third part will try to tackle the real-world application issue by building a realistic simulator. Specifically, we elaborate RL-based traffic signal control methods with real-world implications from the following topics:
在本论文中，我们建议将 RL 用于交通信号控制问题，将交通引导 RL 与高效学习相结合。第一部分将帮助 RL 制定合理的状态和奖励设计。RL 方法与交通理论之间的联系可以帮助 RL 向正确的目标优化，并压缩状态空间以实现收敛。第二部分将有助于降低 RL 的学习成本，并使 RL 在现实世界中得到应用。第三部分将尝试通过建立一个逼真的模拟器来解决现实世界中的应用问题。具体来说，我们将从以下几个方面阐述基于 RL 的交通信号控制方法对现实世界的影响：

Formulation of RL with theoretical justification. How to formulate the RL agent, i.e., the reward and state definition for traffic signal control problem? Does it always beneficial if we use complex reward function and state representations? In [7, 1, 17], I proposed to deduce the intricate design of current literature and use concise reward and state design, with theoretical proof on the concise design towards the global optimal, for both signal intersection control and multi-intersection control scenario.
制定 RL 的理论依据。如何制定 RL 代理，即交通信号控制问题的奖励和状态定义？使用复杂的奖励函数和状态表示是否一定有益？在文献[7, 1, 17]中，笔者提出对现有文献的复杂设计进行演绎，使用简明的奖励和状态设计，并对信号交叉口控制和多交叉口控制场景下的简明设计进行了全局最优的理论证明。
Optimization of the learning process in RL. Is vanilla deep neural network the solution to traffic signal control problem with RL? In [11], I tackle the sample imbalance problem with a phase-gated network to emphasize certain features. In [18], we design a network to model the priority between action and learn the change of traffic flow like flip and rotation. Coordination could benefit signal control for multi-intersection scenarios. In [19], I contributed a framework to enable agents to communicate between them about their observations and behave as a group. Though each agent has limited capabilities and visibility of the world, in this work, agents were able to cooperate multi-hop neighboring intersections and learn the dynamic interactions between the hidden state of neighboring agents. Later in [20], we investigate the possibility of using RL to control traffic signals under the scale of a city, test our RL methods in the simulator under the road network of Manhattan, New York, with 2510 traffic signals. This is the first time an RL-based method can operate on such a scale in traffic signal control.
优化 RL 的学习过程。香草深度神经网络是利用 RL 解决交通信号控制问题的方法吗？在 [11] 中，我用相位门控网络来解决样本不平衡问题，以强调某些特征。在文献[18]中，我们设计了一个网络来模拟行动之间的优先级，并学习交通流的变化，如翻转和旋转。在多交叉口情况下，协调可有利于信号控制。在文献[19]中，我提出了一个框架，使代理之间能够就其观察结果进行交流，并作为一个群体行事。虽然每个代理的能力和对世界的可见度都有限，但在这项工作中，代理能够在相邻交叉口进行多跳合作，并学习相邻代理隐藏状态之间的动态交互。随后在 [20] 中，我们研究了在城市规模下使用 RL 控制交通信号的可能性，在纽约曼哈顿拥有 2510 个交通信号的道路网络下的模拟器中测试了我们的 RL 方法。这是基于 RL 的方法首次在如此大规模的交通信号控制中发挥作用。
Bridging Gap from Simulation to Reality. Despite its massive success in artificial domains, RL has not yet shown the same degree of success for real-world applications. These applications could involve control systems such as drones, software systems such as data centers, systems that interact with people such as transportation systems. Currently, most RL methods mainly conduct experiments in the simulator since the simulator can generate data in a cheaper and faster way than real experimentation. Discrepancies between simulation and reality confine the application of learned policies in the real world.
缩小模拟与现实之间的差距。尽管 RL 在人工智能领域取得了巨大成功，但在现实世界的应用中还没有取得同样的成功。这些应用可能涉及无人机等控制系统、数据中心等软件系统、交通系统等与人交互的系统。目前，大多数 RL 方法主要在模拟器中进行实验，因为模拟器生成数据的方式比真实实验更便宜、更快捷。模拟与现实之间的差异限制了所学政策在现实世界中的应用。

1.6 Overview of this Disseration
1.6 本论文概述

In this chapter, we discussed the motivation, challenges and possible tasks of using reinforcement learning for traffic signal control. We will elaborate more on the details of work that have applied reinforcement learning in several scenarios. Chapter 2 will covers the notation, background and necessary literature for traffic signal control. The basic formulation of RL-based traffic signal control and further theory-guided RL method will be discussed in detail in Chapter 3 and 4. Chapter 5 will discuss how to improve the learning efficiency with neighborhood information for coordinated intersections. Chapter 6 will tackle the challenge for real-world applications of RL by building a realistic simulator, and then we are going to talk about the potential future work briefly in Chapter 7.
在本章中，我们讨论了将强化学习用于交通信号控制的动机、挑战和可能的任务。我们将详细介绍在多个场景中应用强化学习的工作细节。第 2 章将介绍交通信号控制的术语、背景和必要文献。第 3 章和第 4 章将详细讨论基于 RL 的交通信号控制的基本公式和进一步的理论指导 RL 方法。第 5 章将讨论如何利用邻域信息提高协调交叉口的学习效率。第 6 章将通过建立一个真实的模拟器来解决 RL 在现实世界中的应用难题，然后我们将在第 7 章简要讨论未来可能开展的工作。

Chapter 2 Notation, Background and Literature
第 2 章符号、背景和文献

This chapter covers notation, background information and necessary literature that will be discussed throughout the rest of this thesis. The main part of this chapter is adopted from our survey [21] and [22].
本章包括本论文其余部分将讨论的符号、背景信息和必要文献。本章的主要部分采用了我们的研究[21]和[22]。

2.1 Preliminaries of Traffic Signal Control Problem
2.1 交通信号控制问题初探

Term Definition 术语定义

Terms on road structure and traffic movement:
道路结构和交通流术语

Approach: A roadway meeting at an intersection is referred to as an approach. At any general intersection, there are two kinds of approaches: incoming approaches and outgoing approaches. An incoming approach is one on which cars can enter the intersection; an outgoing approach is one on which cars can leave the intersection. Figure 2.1(a) shows a typical intersection with four incoming approaches and outgoing approaches. The southbound incoming approach is denoted in this figure as the approach on the north side in which vehicles are traveling in the southbound direction.
引道：在交叉路口交汇的道路称为引道。在任何一个普通交叉路口，都有两种引道：进入引道和驶出引道。进路是指车辆可以进入交叉路口的道路；出路是指车辆可以离开交叉路口的道路。图 2.1(a) 显示了一个有四条进路和出路的典型交叉路口。南行进路在本图中表示为车辆南行的北侧进路。
Lane: An approach consists of a set of lanes. Similar to approach definition, there are two kinds of lanes: incoming lanes and outgoing lanes (also known as approaching/entering lane and receiving/exiting lane in some references [23, 24]).
车道：航道由一组车道组成。与路径定义类似，车道也分为两种：进入车道和离开车道（在一些参考文献中也称为接近/进入车道和接收/离开车道[23, 24]）。
Traffic movement: A traffic movement refers to vehicles moving from an incoming approach to an outgoing approach, denoted as $(r_{i} \to r_{o})$ , where $r_{a}$ and $r_{r}$ is theincoming lane and the outgoing lane respectively. A traffic movement is generally categorized as left turn, through, and right turn.
交通流：交通移动是指车辆从进站通道驶向出站通道，表示为 $(r_{i} \to r_{o})$ ，其中 $r_{a}$ 和 $r_{r}$ 分别为进站车道和出站车道。其中， $r_{a}$ 和 $r_{r}$ 分别为进入车道和驶出车道。交通流一般分为左转、通过和右转。

Terms on traffic signal:
交通信号条款：

Movement signal: A movement signal is defined on the traffic movement, with green signal indicating the corresponding movement is allowed and red signal indicating the movement is prohibited. For the four-leg intersection shown in Figure 2.1(a), the right-turn traffic can pass regardless of the signal, and there are eight movement signals in use, as shown in Figure 2.1(b).
移动信号：移动信号：移动信号定义在交通移动上，绿灯表示允许相应移动，红灯表示禁止移动。对于图 2.1(a)所示的四脚交叉路口，无论信号灯如何，右转车辆均可通过，如图 2.1(b)所示，共有八个移动信号灯在使用。
Phase: A phase is a combination of movement signals. Figure 2.1(c) shows the conflict matrix of the combination of two movement signals in the example in Figure 2.1(a) and Figure 2.1(b). The grey cell indicates the corresponding two movements conflict with each other, i.e. they cannot be set to 'green' at the same time (e.g., signals #1 and #2). The white cell indicates the non-conflicting movement signals. All the non-conflicting signals will generate eight valid paired-signal phases (letters 'A' to 'H' in Figure 2.1(c)) and eight single-signal phases (the diagonal cells in conflict matrix). Here we letter the paired-signal phases only because in an isolated intersection, it is always more efficient to use paired-signal
相位：相位：相位是运动信号的组合。图 2.1(c)显示了图 2.1(a)和图 2.1(b)中两个运动信号组合的冲突矩阵。灰色单元格表示相应的两个运动信号相互冲突，即不能同时设置为 "绿色"（如 1 号和 2 号信号）。白色单元格表示不冲突的运动信号。所有非冲突信号将产生八个有效的成对信号相位（图 2.1(c)中的字母 "A "至 "H"）和八个单信号相位（冲突矩阵中的对角线单元格）。在这里，我们只提及成对信号相位，因为在一个孤立的交叉路口，使用成对信号相位总是更有效率。

Figure 2.1: Definitions of traffic movement and traffic signal phases.
图 2.1：交通流和交通信号阶段的定义。

phases. When considering multiple intersections, single-signal phase might be necessary because of the potential spill back.
阶段。在考虑多个交叉口时，由于可能出现回溢，可能需要单信号灯相位。

Phase sequence: A phase sequence is a sequence of phases which defines a set of phases and their order of changes.
相序：阶段序列是一个阶段序列，它定义了一组阶段及其变化顺序。
Signal plan: A signal plan for a single intersection is a sequence of phases and their corresponding starting time. Here we denote a signal plan as $(p_{1}, t_{1}) (p_{2}, t_{2}) \dots (p_{i}, t_{i}) \dots$ , where $p_{i}$ and $t_{i}$ stand for a phase and its starting time.
信号计划：单个交叉口的信号计划是一系列相位及其相应的起始时间。在此，我们用 $(p_{1}, t_{1}) (p_{2}, t_{2}) \dots (p_{i}, t_{i}) \dots$ 表示一个信号计划。其中 $p_{i}$ 和 $t_{i}$ 代表一个相位及其起始时间。
Cycle-based signal plan: A cycle-based signal plan is a kind of signal plan where the sequence of phases operates in a cyclic order, which can be denoted as $(p_{1}, t_{1}^{1}) (p_{2}, t_{2}^{1}) \dots (p_{N}, t_{N}^{1}) (p_{1}, t_{1}^{2}) (p_{2}, t_{2}^{2}) \dots (p_{N}, t_{N}^{2}) \dots$ , where $p_{1}, p_{2}, \dots, p_{N}$ is the repeated phase sequence and $t_{i}^{j}$ is the starting time of phase $p_{i}$ in the $j$ -th cycle. Specifically, $C^{j} = t_{1}^{j + 1} - t_{1}^{j}$ is the cycle length of the $j$ -th phase cycle, and ${\frac{t_{2}^{j} - t_{1}^{j}}{C^{j}}, \dots, \frac{t_{N}^{j} - t_{N - 1}^{j}}{C^{j}}}$ is the phase split of the $j$ -th phase cycle. Existing traffic signal control methods usually repeats similar phase sequence throughout the day.
循环信号计划：循环信号计划是一种相位顺序以循环顺序运行的信号计划，可表示为 $(p_{1}, t_{1}^{1}) (p_{2}, t_{2}^{1}) \dots (p_{N}, t_{N}^{1}) (p_{1}, t_{1}^{2}) (p_{2}, t_{2}^{2}) \dots (p_{N}, t_{N}^{2}) \dots$ ，其中 $p_{1}, p_{2}, \dots, p_{N}$ 为重复相位顺序， $t_{i}^{j}$ 为 $j$ 中相位 $p_{i}$ 的起始时间。其中， $p_{1}, p_{2}, \dots, p_{N}$ 为重复的相位序列， $t_{i}^{j}$ 为相位 $p_{i}$ 在 $j$ -次循环中的起始时间。-的开始时间。具体来说， $C^{j} = t_{1}^{j + 1} - t_{1}^{j}$ 是 $j$ -次相位周期的周期长度， ${\frac{t_{2}^{j} - t_{1}^{j}}{C^{j}}, \dots, \frac{t_{N}^{j} - t_{N - 1}^{j}}{C^{j}}}$ 是{{8} -次相位周期的周期长度。的周期长度， ${\frac{t_{2}^{j} - t_{1}^{j}}{C^{j}}, \dots, \frac{t_{N}^{j} - t_{N - 1}^{j}}{C^{j}}}$ 是 $j$ -个周期的相位分割。-第 4 个相位周期的相位分割。现有的交通信号控制方法通常全天重复类似的相位顺序。

Objective 目标

The objective of traffic signal control is to facilitate safe and efficient movement of vehicles at the intersection. Safety is achieved by separating conflicting movements in time and is not considered in more detail here. Various measures have been proposed to quantify efficiency of the intersection from different perspectives:
交通信号控制的目的是促进车辆在交叉路口安全高效地行驶。安全是通过及时分隔相互冲突的通行来实现的，在此不再详述。人们从不同角度提出了各种措施来量化交叉口的效率：

Travel time. In traffic signal control, travel time of a vehicle is defined as the time different between the time one car enters the system and the time it leaves the system. One of the most common goals is to minimize the average travel time of vehicles in the network.
行驶时间。在交通信号控制中，车辆的行驶时间被定义为一辆车从进入系统到离开系统之间的时间差。最常见的目标之一就是尽量缩短网络中车辆的平均行驶时间。
Queue length. The queue length of the road network is the number of queuing vehicles in the road network.
队列长度。路网的队列长度是指路网中排队车辆的数量。
Number of stops. The number of stops of a vehicle is the total times that a vehicle experienced.
停车次数。车辆的停车次数是指车辆经历的总次数。
Throughput. The throughput is the number of vehicles that have completed their trip in the road network during throughout a period.
吞吐量。吞吐量是指在整个时段内完成路网行程的车辆数量。

Special Considerations 特殊考虑因素

In practice, additional attention should be paid to the following aspects:
在实践中，应额外注意以下几个方面：

Yellow and all-red time. A yellow signal is usually set as a transition from a green signal to a red one. Following the yellow, there is an all-red period during which all the signals in an intersection are set to red. The yellow and all-red time, which can last from 3 to 6 seconds, allow vehicles to stop safely or pass the intersection before vehicles in conflicting traffic movements are given a green signal.
黄色和全红时间。黄色信号灯通常是从绿色信号灯向红色信号灯的过渡。黄灯之后是全红灯时间，在此期间，交叉路口的所有信号灯都设置为红灯。黄色信号灯和全红信号灯时间可持续 3 到 6 秒钟，这段时间允许车辆安全停靠或通过交叉路口，然后再向与之交通流冲突的车辆发出绿色信号灯。
Minimum green time. Usually, a minimum green signal time is required to ensure pedestrians moving during a particular phase can safely pass through the intersection.
最短绿灯时间。通常情况下，需要一个最短的绿灯时间，以确保在某一特定阶段通行的行人能安全通过交叉路口。
Left turn phase. Usually, a left-turn phase is added when the left-turn volume is above certain threshold.
左转阶段。通常，当左转量超过一定临界值时，就会增加一个左转阶段。

2.2 Background of Reinforcement Learning
2.2 强化学习的背景

In this section, we first describe the reinforcement learning framework which constitutes the foundation of all the methods presented in this dissertation. We then provide background on conventional RL-based traffic signal control, including the problem of controlling a single intersection and multiple intersections.
在本节中，我们首先介绍强化学习框架，它是本论文中所有方法的基础。然后，我们将介绍基于 RL 的传统交通信号控制的背景，包括单个交叉口和多个交叉口的控制问题。

Single Agent Reinforcement learning
单个代理强化学习

Usually a single agent RL problem is modeled as a Markov Decision Process represented by

⟨ S, A, P, R, γ ⟩

, where their definitions are given as follows:
通常，单个代理的 RL 问题被建模为一个马尔可夫决策过程，由

⟨ S, A, P, R, γ ⟩

表示。表示，它们的定义如下：

∙

Set of state representations

S

: At time step

t

, the agent observes state

s^{t} \in S

∙

状态表示

S

的集合：在时间步骤

t

中，代理观察到状态

s^{t} \in S

。代理观察到状态

s^{t} \in S

。.

∙

Set of action

A

and state transition function

P

: At time step

t

, the agent takes an action

a^{t} \in A

, which induces a transition in the environment according to the state transition function

P (s^{t + 1} | s^{t}, a^{t}) : S \times A \to S

∙

行动

A

和状态转换函数

P

的集合：在时间步骤

t

中，代理采取了行动

a^{t} \in A

。代理采取了一个行动

a^{t} \in A

，根据状态转换函数

P (s^{t + 1} | s^{t}, a^{t}) : S \times A \to S

引起了环境的转换。根据状态转换函数

P (s^{t + 1} | s^{t}, a^{t}) : S \times A \to S

，环境会发生转换

∙

Reward function

R

: At time step

t

, the agent obtains a reward

r^{t}

by a reward function:

R (s^{t}, a^{t}) : S \times A \to R

∙

奖励函数

R

：在时间步骤

t

中，代理通过奖励函数

r^{t}

获得奖励。代理通过奖励函数获得奖励

r^{t}

：

R (s^{t}, a^{t}) : S \times A \to R

∙

Discount factor

γ

: The goal of an agent is to find a policy that maximizes the expected return, which is the discounted sum of rewards:

G^{t} := \sum_{i = 0}^{\infty} γ^{i} r^{t + i}

, where the discount factor

γ \in [0, 1]

controls the importance of immediate rewards versus future rewards.

∙

贴现因子

γ

：代理人的目标是找到一个能使预期收益（即贴现后的收益总和）最大化的策略：

G^{t} := \sum_{i = 0}^{\infty} γ^{i} r^{t + i}

其中，贴现因子

γ \in [0, 1]

控制着当前收益与未来收益的重要性。

Here, we only consider continuing agent-environment intersections which do not end with terminal states but goes on continually without limit.
在这里，我们只考虑持续的代理-环境交集，这种交集不会以终结状态结束，而是无限制地持续下去。

Solving a reinforcement learning task means, roughly, finding an optimal policy

π^{*}

that maximizes expected return. While the agent only receives reward about its immediate, one-step performance, one way to find the optimal policy

π^{*}

is by following an optimal action-value function or state-value function. The action-value function (Q-function) of a policy

π

Q^{π} : S \times A \to R

, is the expected return of a state-action pair

Q^{π} (s, a) = E_{π} [G^{t} | s^{t} = s, a^{t} = a]

. The state-value function of a policy

π

V^{π} : S \to R

, is the expected return of a state

V^{π} (s) = E_{π} [G^{t} | s^{t} = s]

.
解决强化学习任务大致意味着找到一个能使预期收益最大化的最优策略

π^{*}

。虽然行为主体只能就其直接的、一步到位的表现获得奖励，但找到最优策略

π^{*}

的一种方法是遵循最优的行动-价值函数或状态-价值函数。政策

π

的行动价值函数（Q 函数｝,

Q^{π} : S \times A \to R

的预期收益。.政策

π

的状态值函数,

V^{π} : S \to R

是状态

V^{π} (s) = E_{π} [G^{t} | s^{t} = s]

的预期收益。.

Problem setting 问题设置

We now introduce the general setting of RL-based traffic signal control problem, in which the traffic signals are controlled by an RL agent or several RL agents. Figure 2.2 illustrates the basic idea of the RL framework in single traffic signal control problem. The environment is the traffic conditions on the roads, and the agent controls the traffic signal. At each time step

t

, a description of the environment (e.g., signal phase, waiting time of cars, queue length of cars, and positions of cars) will be generated as the state

s_{t}

. The agent will predict the next action

a^{t}

to take that maximizes the expected return, where the action could be changing to a certain phase in the single intersection scenario. The action

a^{t}

will be executed in the environment, and a reward

r^{t}

will be generated, where the reward could be defined on traffic conditions of the intersection. Usually, in the decision process, an agent combines the exploitation of learned policy and exploration of a new policy.
现在我们介绍基于 RL 的交通信号控制问题的一般设置，其中交通信号由一个或多个 RL 代理控制。图 2.2 展示了单个交通信号控制问题中 RL 框架的基本思想。环境是道路上的交通状况，代理控制交通信号。在每个时间步

t

将生成环境描述（如信号灯相位、汽车等待时间、汽车队列长度和汽车位置）作为状态

s_{t}

。.代理将预测下一步采取的行动

a^{t}

，该行动可使预期收益最大化。行动

a^{t}

将在环境中执行，并产生奖励

r^{t}

，奖励可根据交叉路口的交通状况确定。通常，在决策过程中，代理会将利用已学政策和探索新政策结合起来。

Figure 2.2: RL framework for traffic signal control.
图 2.2：交通信号控制的 RL 框架。

In multi-intersection traffic signal control problem, there are

N

traffic signals in the environment, controlled by one or several agents. The goal of the agent(s) is to learn the optimal policies to optimize the traffic condition of the whole environment. At each timestep

t

, each agent

i

observe part of the environment as the observation

o_{i}^{t}

and make predictions on the next actions

a^{t} = (a_{1}^{t}, \dots, a_{N}^{t})

to take. The actions will be executed in the environment, and the reward

r_{i}^{t}

will be generated, where the reward could be defined on the level of individual intersections or a group of intersections within the environment. We refer readers interested in more detailed problem settings to [25].
在多交叉路口交通信号控制问题中，环境中有

N

个交通信号，由一个或多个代理控制。代理的目标是学习最优策略，优化整个环境的交通状况。在每个时间步

t

中每个代理

i

都会观察环境的一部分，即观察值

o_{i}^{t}

，并预测下一步要采取的行动

a^{t} = (a_{1}^{t}, \dots, a_{N}^{t})

。这些行动将在环境中执行，并产生奖励

r_{i}^{t}

，奖励可以定义为环境中的单个交叉点或一组交叉点。如果读者对更详细的问题设置感兴趣，请参阅 [25]。

2.3 Literature 2.3 文献资料

In this section, we introduce three major aspects investigated in recent RL-based traffic signal control literature: agent formulation, policy learning approach and coordination strategy.
在本节中，我们将介绍近期基于 RL 的交通信号控制文献中研究的三个主要方面：代理制定、策略学习方法和协调策略。

Agent formulation 制剂配方

A key question for RL is how to formulate the RL agent, i.e., the reward, state, and action definition. In this section, we focus on the advances in the reward, state, and action design in recent deep RL-based methods, and refer readers interested in more detailed definitions to [26, 27, 28].
RL 的一个关键问题是如何制定 RL 代理，即奖励、状态和行动定义。在本节中，我们将重点介绍近期基于深度 RL 的方法在奖励、状态和动作设计方面的进展，对更详细的定义感兴趣的读者可参考 [26, 27, 28]。

2.3.1.1 Reward 2.3.1.1 奖励

The choice of reward reflects the learning objective of an RL agent. In the traffic signal control problem, although the ultimate objective is to minimize the travel time of all vehicles, travel time is hard to serve as a valid reward in RL. Because the travel time of a vehicle is affected by multiple actions from traffic signals and vehicle movements, the travel time as reward would be delayed and ineffective in indicating the goodness of the signals' action. Therefore, the existing literature often uses a surrogate reward that can be effectively measured after an action, considering factors like average queue length, average waiting time, average speed or throughput [11, 29]. The authors in [10] also take the frequency of signal changing and the number of emergency stops into reward.
奖励的选择反映了 RL 代理的学习目标。在交通信号控制问题中，虽然最终目标是尽量减少所有车辆的行驶时间，但在 RL 中，行驶时间很难作为有效的奖励。因为车辆的行驶时间会受到交通信号和车辆行驶等多种行为的影响，所以将行驶时间作为奖励会有延迟，也不能有效地显示信号行为的好坏。因此，现有文献通常使用一种可在行动后有效测量的替代奖励，考虑平均排队长度、平均等待时间、平均速度或吞吐量等因素 [11，29]。文献[10]中的作者还将信号变化频率和紧急停车次数作为奖励。

2.3.1.2 State 2.3.1.2 国家

At each time step, the agent receives some quantitative descriptions of the environment as state to decide its action. Various kinds of elements have been proposed to describe the environment state, such as queue length, waiting time, speed, phase, etc. These elements can be defined on the lane level or road segment level, and then concatenated as a vector. In earlier work using RL for traffic signal control, people need to discretize the state space and use a simple tabular or linear model to approximate the state functions for efficiency [30, 31, 32]. However, the real-world state space is usually huge, which confines the traditional RL methods in terms of memory or performance. With advances in deep learning, deep RL methods are proposed to handle large state space as an effective function approximator. Recent studies propose to use images [33, 34, 35, 36, 37, 38, 39, 40, 10, 11] to represent the state, where the position of vehicles are extracted as an image representation.
在每个时间步骤中，代理都会收到一些环境的定量描述，作为决定其行动的状态。人们提出了各种描述环境状态的元素，如队列长度、等待时间、速度、相位等。这些元素可以定义在车道或路段级别，然后串联成一个向量。在早期将 RL 用于交通信号控制的工作中，人们需要将状态空间离散化，并使用简单的表格或线性模型来近似状态函数以提高效率 [30、31、32]。然而，现实世界的状态空间通常非常巨大，这就限制了传统 RL 方法的内存或性能。随着深度学习的发展，人们提出了深度 RL 方法，作为一种有效的函数近似方法来处理大型状态空间。最近的研究提出使用图像 [33, 34, 35, 36, 37, 38, 39, 40, 10, 11] 来表示状态，其中车辆的位置被提取为图像表示。

2.3.1.3 Action scheme 2.3.1.3 行动方案

Now there are different types of action definition for an RL agent in traffic signal control: (1) set current phase duration [42, 43], (2) set the ratio of the phase duration over pre-defined total cycle duration [32, 44], (3) change to the next phase in pre-defined cyclic phase sequence [10, 11, 27, 45], and (4) choose the phase to change to among a set of phases [34, 39, 46, 47, 48, 1, 48]. The choice of action scheme is closely related to specific settings of traffic signals. For example, if the phase sequence is required to be cyclic, then the first three action schemes should be considered, while "choosing the phase to change to among a set of phases" can generate flexible phase sequences.
目前，交通信号控制中的 RL 代理有不同类型的动作定义：(1) 设置当前相位持续时间 [42, 43]，(2) 设置相位持续时间与预先定义的总周期持续时间之比 [32, 44]，(3) 在预先定义的周期相位序列中切换到下一个相位 [10, 11, 27, 45]，(4) 在一组相位中选择要切换到的相位 [34, 39, 46, 47, 48, 1, 48]。行动方案的选择与交通信号的具体设置密切相关。例如，如果要求相位序列是循环的，则应考虑前三种行动方案，而 "在一组相位中选择要切换到的相位 "则可以产生灵活的相位序列。

Policy Learning 政策学习

RL methods can be categorized in different ways. [49, 50] divide current RL methods to model-based methods and model-free methods. Model-based methods try to model the transition probability among states explicitly, while model-free methods directly estimate the reward for state-action pairs and choose the action based on this. In the context of traffic signal control, the state transition between states is primarily influenced by people's driving behaviors, which are diverse and hard to predict. Therefore, currently, most RL-based methods for traffic signal control are model-free methods. In this section, we take the categorization in [51]: value-based methods and policy-based methods.
RL 方法有不同的分类方法。[49，50] 将当前的 RL 方法分为基于模型的方法和无模型的方法。基于模型的方法试图对状态间的转换概率进行明确建模，而无模型方法则直接估计状态-行动对的奖励，并据此选择行动。在交通信号控制中，状态之间的转换主要受人们驾驶行为的影响，而人们的驾驶行为多种多样，难以预测。因此，目前大多数基于 RL 的交通信号控制方法都是无模型方法。在本节中，我们采用文献[51]中的分类方法：基于值的方法和基于策略的方法。

2.3.2.1 Value-based methods
2.3.2.1 基于价值的方法

Value-based methods approximate the state-value function or state-action value function (i.e., how rewarding each state is or state-action pair is), and the policy is implicitly obtained from the learned value function. Most of the RL-based traffic signal control methods use DQN [52], where the model is parameterized by neural networks and takes the state representation as input [10, 53]. In DQN, discrete actions are required as the model directly outputs the action's value given a state, which is especially suitable for action schema (3) and (4) mentioned in Section 2.3.1.3.
基于值的方法近似地计算状态-值函数或状态-动作值函数（即每个状态或状态-动作对的回报率），并从学习到的值函数中隐含地获得策略。大多数基于 RL 的交通信号控制方法都使用 DQN [52]，其中模型由神经网络参数化，并将状态表示作为输入 [10，53]。在 DQN 中，由于模型直接输出给定状态的动作值，因此需要离散动作，这尤其适合第 2.3.1.3 节中提到的动作模式 (3) 和 (4)。

Policy-based methods 基于政策的方法

Policy-based methods directly update the policy parameters (e.g., a vector of probabilities to conduct actions under specific state) towards the direction to maximizing a predefined objective (e.g., average expected return). The advantage of policy-based methods is that it does not require the action to be discrete like DQN. Also, it can learn a stochastic policy and keep exploring potentially more rewarding actions. To stabilize the training process, the actor-critic framework is widely adopted. It utilize the strengths of both value-based and policy-based methods, with an actor controls how the agent behaves (policy-based), and the critic measures how good the conducted action is (value-based). In the traffic signal control problem, [44] uses DDPG [54] to learn a deterministic policy which directly maps states to actions, while [34, 42, 55] learn a stochastic policy that maps states to action probability distribution, all of which have shown excellent performance in traffic signal control problems. To further improve convergence speed for RL agents, [56] proposed a time-dependent baseline to reduce the variance of policy gradient updates to specifically avoid traffic jams.
基于策略的方法直接更新策略参数（如在特定状态下采取行动的概率向量），使其朝着最大化预定目标（如平均预期收益）的方向发展。基于策略的方法的优势在于，它不像 DQN 那样要求行动是离散的。此外，它还能学习随机策略，并不断探索可能更有回报的行动。为了稳定训练过程，行为批判框架被广泛采用。它利用了基于价值和基于策略两种方法的优点，由行动者控制代理的行为方式（基于策略），由批评者衡量所采取的行动的好坏（基于价值）。在交通信号控制问题中，[44] 使用 DDPG [54] 学习直接将状态映射到行动的确定性策略，而 [34, 42, 55] 则学习将状态映射到行动概率分布的随机策略，这些方法在交通信号控制问题中都表现出了优异的性能。为了进一步提高 RL 代理的收敛速度，[56] 提出了一个随时间变化的基线，以减少策略梯度更新的方差，从而特别避免交通堵塞。

In the above-mentioned methods, including both value-based and policy-based methods, deep neural networks are used to approximate the value functions. Most of the literature use vanilla neural networks with their corresponding strengths. For example, Convolutional Neural Networks (CNN) are used since the state representation contains image representation [10, 38, 39, 40, 33, 34, 35, 36]; Recurrent Neural Networks (RNN) are used to capture the temporal dependency of historical states [57].
在上述方法（包括基于价值的方法和基于策略的方法）中，深度神经网络被用来近似价值函数。大多数文献都使用了具有相应优势的虚神经网络。例如，由于状态表示包含图像表示，因此使用了卷积神经网络（CNN）[10, 38, 39, 40, 33, 34, 35, 36]；为了捕捉历史状态的时间依赖性，使用了循环神经网络（RNN）[57]。

Coordination could benefit signal control for multi-intersection scenarios. Since recent advances in RL improve the performance on isolated traffic signal control, efforts havebeen performed to design strategies that cooperate with MARL agents. Literature [60] categorizes MARL into two classes: Joint action learners and independent learners. Here we extend this categorization for the traffic signal control problem.
在多交叉路口的情况下，协调可有利于信号控制。由于最近 RL 的进步提高了孤立交通信号控制的性能，因此人们开始努力设计与 MARL 代理合作的策略。文献 [60] 将 MARL 分成两类：联合行动学习者和独立学习者。在此，我们针对交通信号控制问题扩展了这一分类。

Joint action learners 联合行动学习者

A straightforward solution is to use a single global agent to control all the intersections [31]. It directly takes the state as input and learns to set the joint actions of all intersection at the same time. However, these methods can result in the curse of dimensionality, which encompasses the exponential growth of the state-action space in the number of state and action dimensions. Joint action modeling methods explicitly learns to model the joint action
一个直接的解决方案是使用一个全局代理来控制所有交叉点 [31]。它直接将状态作为输入，同时学习设置所有交叉口的联合行动。然而，这些方法可能会导致维度诅咒，即状态-行动空间在状态和行动维度数量上的指数增长。联合行动建模方法明确地学习对联合行动进行建模。

Citation 引用	Method 方法	Cooperation 合作	Road net (# signals) 路网（# 信号灯）
[46]	Value-based 以价值为基础	With communication 通过交流	Synthetic (5) 合成 (5)
	Policy-based 以政策为基础	Without communication 没有交流	Real (50) 真实 (50)
	Policy-based 以政策为基础	Without communication 没有交流	Real (43) 真实 (43)
	Value-based 以价值为基础	Without communication 没有交流	Real (2510) 真实 (2510)
	Policy-based 以政策为基础	Joint action 联合行动	Real (30) 真实 (30)
	Value-based 以价值为基础	-	Synthetic (1) 合成 (1)
	Value-based 以价值为基础	-	Synthetic (1) 合成 (1)
	Value-based 以价值为基础	Without communication 没有交流	Synthetic (9) 合成 (9)
	Both studied 两者都研究了	-	Synthetic (1) 合成 (1)
	Value-based 以价值为基础	With communication 通过交流	Synthetic (6) 合成 (6)
	Value-based 以价值为基础	Without communication 没有交流	Synthetic (4) 合成 (4)
	Both studied 两者都研究了	Single global 单一全球	Synthetic (5) 合成 (5)
	Policy-based 以政策为基础	-	Real (1) 真实 (1)
	Value-based 以价值为基础	Joint action 联合行动	Synthetic (4) 合成 (4)
	Value-based 以价值为基础	With communication 通过交流	Real (4) 真实 (4)
	Value-based 以价值为基础	-	Synthetic (1) 合成 (1)
	Value-based 以价值为基础	Without communication 没有交流	Real (16) 真实 (16)
	Value-based 以价值为基础	With communication 通过交流	Real (196) 真实 (196)
	Value-based 以价值为基础	Joint action 联合行动	Synthetic (36) 合成 (36)
	Value-based 以价值为基础	Without communication 没有交流	Real (16) 真实 (16)
	Value-based 以价值为基础	Without communication 没有交流	Real (5) 真实 (5)

* Traffic with arrival rate less than 500 vehicles/hour/lane is considered as light traffic in this survey, otherwise considered as heavy.
* 本调查将到达率低于 500 辆/小时/车道的交通视为轻型交通，否则视为重型交通。
* Synthetic light uniform; 2. Synthetic light dynamic; 3. Synthetic heavy uniform; 4. Synthetic heavy dynamic; 5. Real-world data
* 合成轻型均匀； 2. 合成轻型动态； 3. 合成重型均匀； 4. 合成重型动态； 5. 真实世界数据

Table 2.1: Representative deep RL-based traffic signal control methods. Due to page limits, we only put part of the investigated papers here.
表 2.1：基于深度 RL 的代表性交通信号控制方法。由于篇幅限制，我们仅在此列出部分研究论文。

action value of multiple agents

Q (o_{1}, \dots, o_{N}, a)

. The joint action space grows with the increase in the number of agents to model. To alleviate this challenge, [10] factorize the global Q-function as a linear combination of local subproblems, extending [9] using max-plus [61] algorithm:

\hat{Q} (o_{1}, \dots, o_{N}, a) = Σ_{i, j} Q_{i, j} (o_{i}, o_{j}, a_{i}, a_{j})

, where

i

and

j

corresponds to the index of neighboring agents. In other works, [48, 59, 62] regard the joint Q-value as a weighted sum of local Q-values,

\hat{Q} (o_{1}, \dots, o_{N}, a) = Σ_{i, j} w_{i, j} Q_{i, j} (o_{i}, o_{j}, a_{i}, a_{j})

, where

w_{i, j}

is the pre-defined weights. They attempt to ensure individual agents to consider other agents' learning process by adding a shaping term in the loss function of the individual agent's learning process and minimizing the difference between the weighted sum of individual Q-values and the global Q-value.
多个代理的行动值

Q (o_{1}, \dots, o_{N}, a)

.联合行动空间会随着建模代理数量的增加而扩大。为了缓解这一难题，[10] 使用 max-plus [61] 算法将全局 Q 函数因子化为局部子问题的线性组合，并对 [9] 进行了扩展：

\hat{Q} (o_{1}, \dots, o_{N}, a) = Σ_{i, j} Q_{i, j} (o_{i}, o_{j}, a_{i}, a_{j})

其中

i

和

j

对应于相邻代理的索引。在其他著作中，[48, 59, 62] 将联合 Q 值视为局部 Q 值的加权和，

\hat{Q} (o_{1}, \dots, o_{N}, a) = Σ_{i, j} w_{i, j} Q_{i, j} (o_{i}, o_{j}, a_{i}, a_{j})

。其中

w_{i, j}

是预先定义的权重。他们试图通过在单个代理学习过程的损失函数中添加一个整形项，并最小化单个 Q 值的加权和与全局 Q 值之间的差值，来确保单个代理考虑其他代理的学习过程。

2.3.3.2 Independent learners
2.3.3.2 独立学习者

There is also a line of studies that use individual RL (IRL) agents to control the traffic signals, where each RL agent controls an intersection. Unlike joint action learning methods, each agent learns its control policy without knowing the reward signal of other agents.
还有一系列研究使用单个 RL（IRL）代理控制交通信号，即每个 RL 代理控制一个交叉路口。与联合行动学习方法不同的是，每个代理在不知道其他代理的奖励信号的情况下学习自己的控制策略。

IRL without communication methods treat each intersection individually, with each agent observing its own local environment and do not use explicit communication to resolve conflicts [1, 27, 39, 40, 44, 45, 63]. In some simple scenarios like arterial networks, this approach has performed well with the formation of several mini green waves. However, when the environment becomes complicated, the non-stationary impacts from neighboring agents will be brought into the environment, and the learning process usually cannot converge to stationary policies if there are no communication or coordination mechanisms among agents [64].
无通信的 IRL 方法单独处理每个交叉口，每个代理观察自己的局部环境，不使用显式通信来解决冲突[1, 27, 39, 40, 44, 45, 63]。在一些简单的场景（如干道网络）中，这种方法表现出色，形成了几个小型绿波。然而，当环境变得复杂时，相邻代理的非静态影响将被带入环境，如果代理之间没有通信或协调机制，学习过程通常无法收敛到静态策略[64]。

IRL with communication methods enable agents to communicate between agents about their observations and behave as a group, rather than a collection of individuals in complex tasks where the environment is dynamic, and each agent has limited capabilities and visibility of the world [65]. Typical methods directly add neighbor's traffic condition [66] or past actions [67] directly into the observation of ego agent, other than just using the local traffic condition of the ego agent. In this method, all the agents for different intersection share one learning model, which requires the consistent indexing of neighboring intersections. [47] attempt to remove this requirement by utilizing the road network structure with Graph Convolutional Network [68] and cooperate multi-hop nearby intersections. [47] models the influence of neighboring agents by the fixed adjacency matrix defined in Graph Convolutional Network, which indicates their assumption that the influences between neighbors is static. In other work, [57] propose to use Graph Attentional Networks [69] to learn the dynamic interactions between the hidden state of neighboring agents and the ego agent. It should be pointed out that there is a strong connection between methods that employ max-plus [61] to learn joint action-learners in Section 2.3.3.1, and methods using Graph Convolutional Network to learn the communication, as both of them can be seen to learn the message passing on the graph, where the former kind of methods passing the reward and the later passing the state obervations.
在复杂的任务中，环境是动态的，每个代理的能力和对世界的可见度都是有限的[65]。典型的方法是直接将邻居的交通状况 [66] 或过去的行动 [67] 添加到自我代理的观察结果中，而不是仅仅使用自我代理的本地交通状况。在这种方法中，不同交叉路口的所有代理共享一个学习模型，这就要求对相邻交叉路口进行一致的索引。文献[47]试图利用图形卷积网络的道路网络结构[68]和附近交叉口的多跳合作来消除这一要求。[47] 通过图卷积网络中定义的固定邻接矩阵来模拟相邻代理的影响，这表明他们假设相邻代理之间的影响是静态的。在其他研究中，[57] 提出使用图注意网络 [69] 来学习相邻代理的隐藏状态与自我代理之间的动态交互。需要指出的是，第 2.3.3.1 节中使用 max-plus [61] 学习联合行动学习器的方法与使用图卷积网络学习通信的方法之间存在紧密联系，因为这两种方法都可以看作是学习图上的信息传递，前者传递奖励，后者传递状态观测。

2.3.3.3 Sizes of Road network
2.3.3.3 公路网的规模

At a coarse scale, a road network is a directed graph with nodes and edges representing intersections and roads, respectively. Specifically, a real-world road network can be more complicated than the synthetic network in the road properties (e.g., the number of lanes, speed limit of every lane), intersection structures and signal phases settings. Among all the road network properties, the number of traffic signals in the network largely influences the experiment results because the scale of explorations for RL agents to take increases with the scale of road network. Currently, most of the work still conducts experiments on relatively small road networks compared to the scale of a city, which could include thousands of traffic signals. Aslani et al.[42, 43] test their method in a real-world road network with 50 signals. In [19], a district with 196 signals is investigated. One of the most recent work [20] tests their methods on the real road network of Manhattan, New York, with 2510 traffic signals.
粗略地看，道路网络是一个有向图，节点和边分别代表交叉口和道路。具体来说，现实世界的道路网络在道路属性（如车道数、每条车道的限速）、交叉口结构和信号相位设置方面可能比合成网络更加复杂。在所有路网属性中，路网中交通信号的数量在很大程度上会影响实验结果，因为随着路网规模的扩大，RL 代理需要探索的范围也会扩大。目前，大多数研究仍在相对较小的道路网络上进行实验，而城市的规模可能包括数千个交通信号。Aslani 等人[42, 43]在现实世界中有 50 个信号灯的道路网络中测试了他们的方法。文献[19]对一个拥有 196 个信号灯的地区进行了研究。最近的一项研究[20]在纽约曼哈顿拥有 2510 个交通信号灯的真实道路网络中测试了他们的方法。

2.4 Conclusion 2.4 结论

In the review of exiting work, we found out that the formulation of RL lacks theoretical justification, i.e., different literature proposes ad-hoc designs on the reward and state formulation. And the size of coordinated traffic signals are still small. Therefore, in the following sections, I'll provide several methods to tackle the problems: I'll first show a basic formulation of RL-based traffic signal control in Chapter 3, and discuss further theory-guided RL in Chapter 4. Chapter 5 will discuss how to improve the learning efficiency with neighborhood information for coordinated intersections. Chapter 6 will tackle the challenge for real-world applications of RL by building a realistic simulator,
在对前人工作的回顾中，我们发现 RL 的表述缺乏理论依据，即不同文献对奖励和状态表述提出了临时设计。而且，协调交通信号的规模还很小。因此，在下面的章节中，我将提供几种方法来解决这些问题：在第 3 章中，我将首先展示基于 RL 的交通信号控制的基本公式，并在第 4 章中进一步讨论理论指导下的 RL。第五章将讨论如何利用邻域信息提高协调交叉口的学习效率。第 6 章将通过建立一个逼真的模拟器来解决 RL 在现实世界中的应用难题、

Chapter 3 Basic Formulation for Traffic Signal Control
第 3 章交通信号控制基本公式

This part shows a basic formulation of RL-based traffic signal control and corresponds to our paper "IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light Control" [11].
这一部分展示了基于强化学习的交通信号控制的基本模式，与我们的论文 "IntelliLight：智能交通信号灯控制的强化学习方法"[11]。

3.1 Introduction 3.1 导言

Traffic congestion has become increasingly costly. For example, traffic congestion costs Americans $124 billion a year, according to a report by Forbes in 2014 [70]. In European Union, the traffic congestion cost is estimated to be 1% of its GDP [2]. Improving traffic conditions could increase city efficiency, improve economy, and ease people's daily life.
交通拥堵的代价越来越高。例如，根据《福布斯》2014 年的一份报告[70]，交通拥堵每年给美国人造成 1 240 亿美元的损失。在欧盟，交通拥堵成本估计占其国内生产总值的 1%[2]。改善交通状况可以提高城市效率，改善经济状况，缓解人们的日常生活。

One way to reduce the traffic congestion is by intelligently controlling traffic lights. Nowadays, most traffic lights are still controlled with pre-defined fixed-time plan [71, 72] and are not designed by observing real traffic. Recent studies propose hand-crafted rules according to real traffic data [73, 74]. However, these rules are still pre-defined and cannot be dynamically adjusted w.r.t. real-time traffic.
减少交通拥堵的方法之一是智能控制交通信号灯。目前，大多数交通信号灯仍采用预定义的固定时间计划进行控制 [71, 72]，并不是通过观察实际交通情况来设计的。最近的研究提出了根据真实交通数据手工制定规则的建议 [73，74]。然而，这些规则仍然是预先确定的，无法根据实时交通情况进行动态调整。

To dynamically adjust traffic lights according to real-time traffic, people have been using reinforcement learning technique [9, 10, 75]. Traditional reinforcement learning is difficult to apply due to two key challenges: (1) how to represent environment; and (2) how to model the correlation between environment and decision. To address these two challenges, recent studies [10, 76] have applied deep reinforcement learning techniques, such as Deep Q-learning (DQN), for traffic light control problem. Figure 2.2 illustrates the basic idea of deep reinforcement learning framework. Environment is composed of traffic light phase and traffic condition. State is a feature representation of theenvironment. Agent takes state as input and learns a model to predict whether to "keep the current phase of traffic lights" or "change the current phase". The decision is sent to the environment and the reward (e.g., how many vehicles pass the intersection) is sent back to the agent. The agent consequently updates the model and further makes the new decision for the next timestamp based on the new state and the updated model. In such a framework, traffic condition can be described as an image and such an image is directly taken as an input for a CNN-based model to enrich the hand-crafted features of the environment.
为了根据实时交通情况动态调整交通灯，人们一直在使用强化学习技术 [9、10、75]。传统的强化学习很难应用，因为存在两个关键挑战：(1) 如何表示环境；(2) 如何为环境和决策之间的相关性建模。为了解决这两个难题，最近的研究 [10, 76] 将深度强化学习技术（如深度 Q-learning (DQN)）应用于交通灯控制问题。图 2.2 展示了深度强化学习框架的基本思想。环境由交通灯相位和交通状况组成。状态是环境的特征表示。代理将状态作为输入，并学习一个模型来预测是 "保持交通灯的当前相位 "还是 "改变当前相位"。该决定被发送给环境，而回报（例如，有多少车辆通过十字路口）则被发送回代理。代理随之更新模型，并根据新的状态和更新的模型为下一个时间戳做出新的决定。在这种框架中，交通状况可以用图像来描述，而这种图像可以直接作为基于 CNN 的模型的输入，以丰富人工创建的环境特征。

Recent deep reinforcement learning approaches made promising progress for the traffic light control problem. Our approach extends this line of work by making several important new contributions:
最近的深度强化学习方法在交通灯控制问题上取得了可喜的进展。我们的方法扩展了这一工作领域，做出了几项新的重要贡献：

1. Experiments with real traffic data. Nowadays, increasing amount of traffic data is being collected from various sources. In China, many big cities have installed AI-equipped traffic surveillance cameras to monitor traffic conditions in real time. Such real-time traffic data enables us to implement reinforcement learning in real world. However, to the best of our knowledge, none of existing studies have used the real traffic data to test their methods. Instead, they use traffic simulations and such simulations do not reflect the real-world traffic. For example, the simulation models in current studies often assume that vehicles arrive at a constant rate but real traffic are highly dynamic over time. In this chapter, we test the methods on a large-scale real traffic data obtained from 1,704 surveillance cameras in Jinan, China for a period of 31 days (see experiment section for details). In this dataset, there are more than 405 million vehicle records and more than 11 million unique vehicle plates. We conduct comprehensive experiments on such large real dataset.
1.真实交通数据实验。如今，从各种渠道收集到的交通数据越来越多。在中国，许多大城市都安装了配备人工智能的交通监控摄像头，以实时监控交通状况。这些实时交通数据使我们能够在现实世界中实施强化学习。然而，据我们所知，现有的研究都没有使用真实的交通数据来测试他们的方法。相反，他们使用的是交通模拟，而这种模拟并不能反映真实世界的交通情况。例如，现有研究中的模拟模型通常假定车辆以恒定的速度到达，但实际交通在一段时间内是高度动态的。在本章中，我们将对从中国济南市 1 704 个监控摄像头中获取的大规模真实交通数据进行测试（详见实验部分）。在这个数据集中，有超过 4.05 亿条车辆记录和超过 1100 万个唯一车牌。我们在如此庞大的真实数据集上进行了全面的实验。

2. Interpretations of the policy. A frequently used measure to quantify the performance of traffic light control is by examining the overall reward, which can be defined by several factors such as waiting time of vehicles and number of vehicles passing the intersections. However, existing studies rarely make observations of the policy learned from the model. The reward could be misleading in some cases. There could be different policies with the same reward but one is more suitable than the other. Take Figure 3.1 as an example. Assume there is only traffic on South-North direction and the traffic comes for the first 80 seconds in every 120 seconds. Policy #1 is 80 seconds for green light on South-North direction and followed by red light for 40 seconds, and then repeat. And policy #2 is different from policy #1 in the way that, instead of 40-second red light on South-North direction, the light changes every 10 seconds. Both policies will result in the same reward because no vehicle will be waiting under either policy. However, policy #1 is preferred over policy #2 in real scenario. In this chapter, we claim that it is important to study the policies rather than simply showing the overall reward. In our experiments, we show several interesting policies learned from the real traffic under different scenarios (e.g., peak hours vs. non-peak hours, weekday vs. weekend).
2.政策解读。量化交通灯控制性能的常用方法是考察总体回报，它可以由多个因素来定义，如车辆等待时间和通过交叉口的车辆数。然而，现有研究很少对从模型中学到的政策进行观察。在某些情况下，奖励可能会产生误导。可能存在奖励相同的不同策略，但其中一种比另一种更合适。以图 3.1 为例。假设只有南向和北向的车流，且每 120 秒中的前 80 秒都有车流。1 号策略是南北方向绿灯亮 80 秒，随后红灯亮 40 秒，然后重复。而政策 2 与政策 1 的不同之处在于，南北方向的红灯不是 40 秒，而是每 10 秒变一次。这两种政策会产生相同的奖励，因为在这两种政策下都不会有车辆等待。然而，在实际场景中，1 号策略比 2 号策略更受青睐。在本章中，我们认为重要的是研究各种策略，而不是简单地显示总奖励。在实验中，我们展示了在不同场景下（如高峰时段与非高峰时段、工作日与周末）从实际交通中学习到的几种有趣的策略。

3. A phase-gated model learning. As described earlier in deep reinforcement learning framework, the agent will take the state, which is the representation of environment, as model input. The environment usually includes the current traffic light phase and traffic conditions. For example, the environments of two cases in Figure 3.2 are the same, except the traffic light phases. Previous studies all take phase as one feature [10, 27], together with many other features (e.g., number of vehicles at different lanes, positions of vehicles). And it is likely that this one feature does not play a role that is significant enough to affect the model output. Therefore, the model will make
3.相控模型学习。如前所述，在深度强化学习框架中，代理将把作为环境表示的状态作为模型输入。环境通常包括当前红绿灯相位和交通状况。例如，图 3.2 中两个案例的环境是相同的，只是红绿灯的相位不同。以往的研究都是将相位作为一个特征[10, 27]，再加上许多其他特征（如不同车道的车辆数量、车辆位置）。而这一个特征所起的作用很可能不足以影响模型输出。因此，模型将

Figure 3.1: Reward is not a comprehensive measure to evaluate traffic light control performance. Both policies will lead to the same rewards. But policy #1 is more suitable than policy #2 in the real world.
图 3.1：奖励不是评价交通灯控制性能的全面措施。两种政策会带来相同的奖励。但在现实世界中，1 号政策比 2 号政策更合适。

the same decision (i.e., either keep or change the current phase) for these two different cases. However, such a decision, no matter which one, is not ideal for one of the cases. Because in Figure 3.2, case A hopes to keep the phase and case B hopes to change the phase. In this chapter, we propose a new phase-sensitive (i.e., phase gate combined with memory palace) reinforcement learning agent, which is a critical component that leads to superior performance.
在这两种不同的情况下，可以做出相同的决定（即保留或改变当前阶段）。然而，无论哪种决策，对其中一种情况来说都不理想。因为在图 3.2 中，情况 A 希望保留相位，而情况 B 希望改变相位。在本章中，我们提出了一种新的相位敏感（即相位门与记忆宫殿相结合）强化学习代理，它是带来卓越性能的关键组成部分。

The rest of this chapter is organized as follows. Section 3.2 discusses the literature. Section 3.3 formally defines the problem. The method is shown in Section 6.4 and the experimental results are shown in Section 6.5. Finally, we conclude the chapter in Section 7.
本章其余部分安排如下。第 3.2 节讨论文献。第 3.3 节正式定义了问题。第 6.4 节展示了方法，第 6.5 节展示了实验结果。最后，我们在第 7 节中结束本章。

In this section, we firstly introduce conventional methods for traffic light control, then introduce methods using reinforcement learning.
在本节中，我们首先介绍交通灯控制的传统方法，然后介绍使用强化学习的方法。

Conventional Traffic Light Control
传统交通灯控制

Early traffic light control methods can be roughly classified into two groups. The first is pre-timed signal control [71, 72, 77], where a fixed time is determined for all green phases
早期的交通灯控制方法大致可分为两类。第一类是预定时信号控制 [71、72、77]，即为所有绿灯相位确定一个固定的时间

Figure 3.2: Case A and case B have the same environment except the traffic light phase.
图 3.2：除了红绿灯阶段，情况 A 和情况 B 的环境相同。

according to historical traffic demand, without considering possible fluctuations in traffic demand. The second is vehicle-actuated control methods [73, 74] where the real-time traffic information is used. Vehicle-actuated methods are suitable for the situations with relatively high traffic randomness. However, this method largely depends on the handcraft rules for current traffic condition, without taking into account future situation. Therefore, it cannot reach the global optimal.
根据历史交通需求，而不考虑交通需求可能出现的波动。第二种是使用实时交通信息的车辆主动控制方法 [73, 74]。车辆主动控制方法适用于交通随机性相对较高的情况。然而，这种方法在很大程度上依赖于针对当前交通状况制定的人工规则，而不考虑未来情况。因此，它无法达到全局最优。

Reinforcement Learning for Traffic Light Control
用于交通灯控制的强化学习

Recently, due to the incapability of dealing with dynamic multi-direction traffic in previous methods, more works try to use reinforcement learning algorithms to solve the traffic light control problem [75, 9, 27]. Typically, these algorithms take the traffic on the road as state, and the operation on light as action. These methods usually show better performance compared with fixed-time and traffic-responsive control methods.
最近，由于以往的方法无法处理动态多向交通，越来越多的研究尝试使用强化学习算法来解决交通灯控制问题 [75, 9, 27]。通常情况下，这些算法将道路上的交通作为状态，将对灯的操作作为动作。与固定时间控制方法和交通响应控制方法相比，这些方法通常表现出更好的性能。

Methods in [78, 79, 80, 9, 30] designed the state as discrete values like the location of vehicles or number of waited cars. However, the discrete state-action pair value matrix requires huge storage space, which keeps these methods from being used in large state space problems.
文献[78, 79, 80, 9, 30]中的方法将状态设计为离散值，如车辆的位置或等待车辆的数量。然而，离散的状态-行动对值矩阵需要巨大的存储空间，这使得这些方法无法用于大型状态空间问题。

To solve the in-managablely large state space of previous methods, recent studies [76, 10] propose to apply Deep Q-learning methods using continuous state representations. These studies learn a Q-function (e.g. a deep neural network) to map state and action to reward. These works vary in the state representation including hand craft features (e.g., queue length [27, 76], average delay [81, 10]) and image features [82, 10, 83]) They are also different in reward design, including average delay [10, 46],the average travel time [83, 10], and queue length [76].
为了解决以往方法难以管理的巨大状态空间问题，最近的研究[76, 10]提出了使用连续状态表征的深度 Q 学习方法。这些研究通过学习 Q 函数（如深度神经网络）将状态和行动映射到奖励。这些作品在状态表示上各不相同，包括手艺特征（如队列长度 [27, 76]、平均延迟 [81, 10]）和图像特征 [82, 10, 83]），在奖励设计上也各不相同，包括平均延迟 [10, 46]、平均旅行时间 [83, 10] 和队列长度 [76]。

However, all these methods assume relatively static traffic environments, and hence far from the real case. Further, they only focus on rewards and overlook the adaptability of the algorithms to the real traffic. Therefore, they cannot interpret why the learned light signal changes corresponding to the traffic. In this chapter, we try to test the algorithms in a more realistic traffic setting, and add more interpretation other than reward.
然而，所有这些方法都假设了相对静态的交通环境，因此与实际情况相去甚远。此外，它们只关注奖励，忽视了算法对实际交通的适应性。因此，它们无法解释为什么学习到的灯光信号会随着交通流量的变化而变化。在本章中，我们将尝试在更真实的交通环境中测试算法，并增加除奖励之外的更多解释。

3.3 Problem Definition 3.3 问题定义

In our problem, we have the environment

E

as an intersection of two roads (and the traffic on this intersection). There is an intelligent traffic light agent

G

. To make the notation simpler, we use "N", "S", "W", "E" to represent north, south, west, and east respectively, and use "Red" and "Green" to represent red light and green light correspondingly. A setting of the traffic light is defined as a phase (e.g., green light on the west-east direction which can be simplified as Green-WE). When a light changes from green to red, there is a 3 second yellow light, while the other directions still keep red. So one green light and the subsequent yellow light can be represented together by "Green". To simplify the problem, we assume there are only two phases of the traffic light, i.e., 1) Green-WE, and 2) Red-WE. Due to the limitation of real-world setting, the traffic light can only change in a specific order (i.e., 1 -

>

2 -

>

1 -

>

2 -

>

...). Given the state

s

(describing the positions and speed of the traffic near this intersection), the goal of the agent

G

is to give the optimal action

a

(i.e., whether to change the light to the next phase), so that the reward

r

(i.e., the smoothness of traffic) can be maximized.
在我们的问题中，环境

E

是两条道路的交叉路口（以及交叉路口上的车流）。有一个智能交通灯代理

G

。.为了简化记法，我们用 "N"、"S"、"W"、"E "分别代表北、南、西、东，用 "Red "和 "Green "分别代表红灯和绿灯。交通灯的设置被定义为一个阶段（例如，西-东方向的绿灯可简化为 Green-WE）。当绿灯变红灯时，会有 3 秒钟的黄灯亮起，而其他方向仍保持红灯。因此，一个绿灯和随后的黄灯可以用 "绿 "来表示。为了简化问题，我们假设交通信号灯只有两个阶段，即 1）绿灯-WE；2）红灯-WE。由于现实环境的限制，红绿灯只能按照特定的顺序变化（即 1 -

>

2 -

>

1 -

>

2 -

>

......）。给定状态

s

(描述了该十字路口附近的车流位置和速度），代理

G

的目标是给出最优行动

a

（即是否在该十字路口附近停车）。(即是否将红绿灯切换到下一阶段），从而获得奖励

r

(即交通顺畅度）最大化。

3.4 Method 3.4 方法

Traffic light control has attracted a lot of attention in recent years due to its essential role in adjusting traffic. Current methods generally have two categories, conventional methods, and deep reinforcement learning based methods. Conventional methods usually rely on previous knowledge to set fixed time for each light phase or set changing rules. These rules are prone to dynamically changing traffic. Reinforcement learning methods usually take the traffic condition (e.g., queue length of waiting cars and updated waiting time) as state, and try to make actions that can improve the traffic condition based on the current state.
近年来，交通灯控制因其在交通调节中的重要作用而备受关注。目前的方法一般分为两类，一类是传统方法，另一类是基于深度强化学习的方法。传统方法通常依赖于先前的知识，为每个灯相设定固定时间或设定变化规则。这些规则很容易受到动态交通变化的影响。强化学习方法通常以交通状况（如等待车辆的队列长度和更新的等待时间）为状态，并尝试根据当前状态做出能够改善交通状况的行动。

However, the current methods do not consider the complex situations in real case, and hence may lead to stuck in one single kind of action. This will lead to inferior traffic adjusting performance under complex traffic situation.
然而，目前的方法没有考虑到实际情况中的复杂情况，因此可能会导致陷入单一行动。这将导致复杂交通情况下的交通调整性能低下。

In this section, we propose a deep reinforcement traffic light agent to solve this problem. We will first introduce the model framework in Section 3.4.1. Then, we show the design of agent in Section 3.4.2. We further describe the network structure in Section 3.4.3. In addition, we describe the memory palace in Section 3.4.4. Note that, although our model is designed for a four way intersection with two phases, it is notdifficult to extend it to other types of intersections or to multiple phases scenarios.
在本节中，我们将提出一种深度强化交通灯代理来解决这一问题。首先，我们将在第 3.4.1 节中介绍模型框架。然后，我们将在第 3.4.2 节中展示代理的设计。我们将在第 3.4.3 节中进一步描述网络结构。此外，我们还将在第 3.4.4 节中介绍记忆宫殿。需要注意的是，尽管我们的模型是针对双阶段四向交叉路口设计的，但将其扩展到其他类型的交叉路口或多阶段场景并不困难。

Framework 框架

Our model is composed of offline part and online part (as shown in Figure 3.3). We extract five kinds of features describing the traffic conditions as state (detailed in Section 3.4.2), and use reward to describe how much the action has improved the traffic (detailed in Section 3.4.2). In offline stage, we set a fixed timetable for the lights, and let traffic go through the system to collect data samples. After training with the samples logged in this stage, the model will be put into the online part. In online stage, at every time interval

Δ t

, the traffic light agent will observe the state

s

from the environment and take action

a

(i.e., whether to change light signal to the next phase) according to

ϵ

-greedy strategy combining exploration (i.e., random action with probability

ϵ

) and exploitation (i.e., taking the action with maximum estimated reward). After that, the agent

G

will observe the environment and get the reward

r

from it. Then, the tuple (

s

a

r

) will be
我们的模型由离线部分和在线部分组成（如图 3.3 所示）。我们提取五种描述交通状况的特征作为状态（详见第 3.4.2 节），并使用奖励来描述行动对交通的改善程度（详见第 3.4.2 节）。在离线阶段，我们为信号灯设定一个固定的时间表，让车辆通过系统来收集数据样本。在使用此阶段记录的样本进行训练后，模型将进入在线阶段。在联机阶段，在每个时间间隔

Δ t

，交通灯代理将观察交通灯的运行情况。交通灯代理将观察环境中的状态

s

，并采取相应的行动

a

（即是否关闭交通灯）。(即是否将信号灯切换到下一阶段）。-探索（即概率为

ϵ

的随机行动）和利用（即采取估计回报最大的行动）相结合的贪婪策略。之后，代理

G

将观察环境并从中获取奖励

r

。然后，元组（

s

a

r

）将是

Notation 符号	Meaning 意义
$E$	Environment 环境
$G$	Agent 代理
$a$	Action 行动
$s$	State 国家
$r$	Reward 奖励
$Δ t$	Time interval between actions 行动之间的时间间隔
$q$	Action value function 行动价值函数
$Q$	Deep Q-Network 深度 Q 网络
$L_{i}$	Queue length on the lane $i$ 车道上的排队长度 $i$
$V_{i}$	Number of vehicles on the lane $i$ 车道上的车辆数 $i$
$W_{i}$	Updated waiting time of all vehicles on the lane $i$ 车道上所有车辆的最新等待时间 $i$
$D_{i}$	Delay of lane $i$ . 车道延迟 $i$ ..
$M \in R^{N \times N}$	Image representation of vehicles’ position 车辆位置的图像显示
$P_{c}$	Current phase 当前阶段
$P_{n}$	Next phase 下一阶段
$C \in {0, 1}$	Light switches (1) or not (0) 电灯开关 (1) 是否 (0)
$N$	Number of vehicles passed the intersection 通过路口的车辆数
	after the action. 行动之后。
$T$	Travel time in system of all vehicles that passed 所有通过车辆在系统中的行驶时间
	the intersection during $Δ t$ . 在 $Δ t$ 期间的交叉点.

Table 3.1: Notationsstored into memory. After several timestamps (e.g.,

t_{2}

in Figure 3.3), agent

G

will update the network according to the logs in the memory.
表 3.1：内存中存储的符号。经过几个时间戳后（如图 3.3 中的

t_{2}

），代理

G

将根据内存中的日志更新网络。

Agent Design 代理设计

First, we introduce the state, action and reward representation.
首先，我们介绍状态、行动和奖励表示法。

State. Our state is defined for one intersection. For each lane $i$ at this intersection, the state component includes queue length $L_{i}$ , number of vehicles $V_{i}$ , updated waiting time of vehicles $W_{i}$ . In addition, the state includes an image representation of vehicles' position $M$ , current phase $P_{c}$ and next phase $P_{n}$ .
状态。我们的状态是针对一个交叉路口定义的。For each lane $i$ at this intersection, the state component includes queue length $L_{i}$ 车辆数 $V_{i}$ 车辆的最新等待时间 $W_{i}$ 。.此外，状态还包括车辆位置的图像表示 $M$ 、当前阶段 $P_{c}$ 和 $P_{n}$ 。当前阶段 $P_{c}$ 和下一阶段 $P_{n}$ 。.
Action. Action is defined as $a = 1$ : change the light to next phase $P_{n}$ , and $a = 0$ : keep the current phase $P_{c}$ .
动作。动作定义为 $a = 1$ ：将灯光切换到下一阶段 $P_{n}$ ，以及 $a = 0$ ：保持当前阶段 $P_{c}$ 。和 $a = 0$ ：保持当前相位 $P_{c}$ 。.
Reward. As is shown in Equation 3.3, reward is defined as a weighted sum of the following factors: 1. Sum of queue length $L$ over all approaching lanes, where $L$ is calculated as the total number of waiting vehicles on the given lane. A vehicle with a speed of less than 0.1 m/s is considered as waiting. 2. Sum of delay $D$ over all approaching lanes, where the delay $D_{i}$ for lane $i$ is defined in Equation 3.1, where the lane speed is the average speed of vehicles
奖励。如公式 3.3 所示，奖励被定义为以下因素的加权和：1.所有接近车道的队列长度总和 $L$ ，其中 $L$ 计算为给定车道上等待车辆的总数。速度小于 0.1 米/秒的车辆视为等待车辆。2.所有驶近车道的延迟时间 $D$ 之和，其中车道 $i$ 的延迟时间 $D_{i}$ 按公式 3.1 定义，车道速度为车辆的平均速度。

Figure 3.3: Model framework
图 3.3：模型框架

on lane i, and the speed limit is the maximum speed allowed on lane i:

D_{i} = 1 - \frac{l a n e s p e e d}{s p e e d l i m i t}

(3.1)
车道 i 上的速度限制，而速度限制是车道 i 上允许的最高速度：

D_{i} = 1 - \frac{l a n e s p e e d}{s p e e d l i m i t}

(3.1)

Sum of updated waiting time $W$ over all approaching lanes. This equals to the sum of $W$ over all vehicles on approaching lanes. The updated waiting time $W$ for vehicle j at time $t$ is defined in Equation 3.2. Note that the updated waiting time of a vehicle is reset to 0 every time it moves. For example, if a vehicle's speed is 0.01m/s from 0s to 15s, 5m/s from 15s to 30s, and 0.01m/s from 30s to 60s, $W_{j}$ is 15 seconds, 0 seconds and 30 seconds when $t =$ 15s, 30s and 60s relatively. $W_{j} (t) = {\begin{cases} W_{j} (t - 1) + 1 & v e h i c l e s p e e d < 0.1 \\ 0 & v e h i c l e s p e e d \geq 0.1 \end{cases}$ (3.2)
所有驶入车道的最新等待时间 $W$ 之和。这等于所有驶入车道的车辆的 $W$ 之和。车辆 j 在时间 $t$ 时的更新等待时间 $W$ 定义于公式 3.2。请注意，车辆每次移动时，其更新等待时间都会重置为 0。例如，如果车辆在 0 秒至 15 秒期间的速度为 0.01m/s，15 秒至 30 秒期间的速度为 5m/s，30 秒至 60 秒期间的速度为 0.01m/s，则 $W_{j}$ 为 15 秒， $t =$ 为 0 秒， $W_{j} (t) = {\begin{cases} W_{j} (t - 1) + 1 & v e h i c l e s p e e d < 0.1 \\ 0 & v e h i c l e s p e e d \geq 0.1 \end{cases}$ 为 30 秒。15秒、30秒和60秒时， $t =$ 相对为15秒、0秒和30秒。 $W_{j} (t) = {\begin{cases} W_{j} (t - 1) + 1 & v e h i c l e s p e e d < 0.1 \\ 0 & v e h i c l e s p e e d \geq 0.1 \end{cases}$ (3.2)
Indicator of light switches $C$ , where $C = 0$ for keeping the current phase, and $C = 1$ for changing the current phase.
指示灯开关 $C$ 的指示器其中 $C = 0$ 表示保持当前相位， $C = 1$ 表示改变当前相位。
Total number of vehicles $N$ that passed the intersection during time interval $Δ t$ after the last action a.
最后一次操作 a 之后的时间间隔 $Δ t$ 内通过路口的车辆总数 $N$ 。
Total travel time of vehicles $T$ that passed the intersection during time interval $Δ t$ after the last action a, defined as the total time (in minutes) that vehicles spent on approaching lanes. $\begin{aligned} R e w a r d = w_{1} * \sum_{i \in I} L_{i} + w_{2} * \sum_{i \in I} D_{i} + w_{3} * \sum_{i \in I} W_{i} + \\ w_{4} * C + w_{5} * N + w_{6} * T . \end{aligned}$ (3.3)
最后一次行动 a 后的时间间隔 $Δ t$ 内通过交叉路口的车辆 $T$ 的总行驶时间，定义为车辆在接近车道上花费的总时间（分钟）。 $\begin{aligned} R e w a r d = w_{1} * \sum_{i \in I} L_{i} + w_{2} * \sum_{i \in I} D_{i} + w_{3} * \sum_{i \in I} W_{i} + \\ w_{4} * C + w_{5} * N + w_{6} * T . \end{aligned}$ (3.3)

Hence, given the current state

s

of the traffic condition, the mission of the agent

G

is to find the action a (change or keep current phase) that may lead to the maximum reward

r

in the long run, following the Bellman Equation (Equation 3.4) [12]. In this situation, the action value function

q

for time

t

is the summation of the reward of the next timestamp

t + 1

and the maximum potential future reward. Through this conjecture of future, the agent can select action that is more suitable for long-run reward.
因此，给定交通状况的当前状态

s

，代理

G

的任务就是根据贝尔曼方程（方程 3.4）[12]，找到可能带来长期最大回报

r

的行动 a（改变或保持当前阶段）。在这种情况下，时间

t

的行动值函数

q

就是下一个时间戳{{5}的奖励与未来最大潜在奖励的总和。通过对未来的猜想，代理可以选择更适合长期回报的行动。

\begin{matrix} (3.4) & q (s_{t}, a, t) = r_{a, t + 1} + γ max q (s_{a, t + 1}, a^{'}, t + 1) \end{matrix}

Network Structure 网络结构

In order to estimate the reward based on the state, and action, the agent needs to learn a Deep Q-Network

Q (s, a)

.
为了根据状态和行动估算奖励，代理需要学习深度 Q 网络

Q (s, a)

。.

In the real-world scenario, traffic is very complex and contain many different cases need to be considered separately. We will illustrate this in Example 1.
在现实世界中，交通情况非常复杂，需要分别考虑许多不同的情况。我们将在例 1 中对此进行说明。

Example 1.: We still assume a simple intersection with two-phase light transition here: 1) Green-WE, and 2) Red-WE. The decision process of whether to change the traffic light consists of two steps. The first step is the mapping from traffic condition (e.g., how many cars are waiting, how long has each car been waiting) to a partial reward. An example of this mapping could be

r = - 0.5 \times L - 0.7 \times W

. This is shared by different phases, no matter which lane the green light is on. Then, to determine the action, the agent should watch on the traffic in different lanes during different phases. For instance, as is shown in Figure 3.2 (a), when the red light is on the NS direction, more waiting traffic (i.e., lower reward in the first step) on the NS direction will make the light tend to
例 1：在此，我们仍然假设一个简单的十字路口有两个阶段的灯光转换：1) 绿灯-WE，和 2) 红灯-WE。是否改变交通信号灯的决策过程包括两个步骤。第一步是交通状况（例如，有多少辆车在等待，每辆车等了多久）到部分奖励的映射。这种映射的一个例子是

r = - 0.5 \times L - 0.7 \times W

.无论绿灯亮在哪条车道上，不同阶段都会共享这个映射。然后，为了确定行动，代理应在不同阶段观察不同车道的交通情况。例如，如图 3.2 (a)所示，当 NS 方向亮起红灯时，NS 方向上更多的等待车流（即第一步的奖励较低）会使红灯趋向于

r = - 0.5 \times L - 0.7 \times W

。

Figure 3.4: Q-networkchange (because by changing the light on this lane from red to green, more cars on this lane can pass through this intersection), while more waiting traffic (i.e., lower reward in the first step) on the WE direction will make the light tend to keep. When the red light is on the WE direction, the case is right the opposite. Therefore, the light phase should have an explicit selection on features.
图 3.4：Q-网络变化（因为将该车道上的红灯改为绿灯后，该车道上更多的车辆可以通过该交叉路口），而 WE 方向上更多的等待车流（即第一步中的奖励较低）将使红灯趋于保持。当红灯在 WE 方向时，情况正好相反。因此，红绿灯阶段应该有明确的选择特征。

In previous studies, due to the simplified design of the model for approximating Q-function under complex traffic condition, agents are having difficulties in distinguishing the decision process for different phases. Therefore, we hereby propose a network structure that can explicitly consider the different cases explicitly. We call this special sub-structure "Phase Gate".
在以往的研究中，由于复杂交通状况下近似 Q 函数的模型设计过于简单，代理很难区分不同阶段的决策过程。因此，我们在此提出一种可以明确考虑不同情况的网络结构。我们将这种特殊的子结构称为 "阶段门"。

Our whole network structure can be shown as in Figure 3.4. The image features are extracted from the observations of the traffic condition and fed into two convolutional layers. The output of these layers are concatenated with the four explicitly mined features, queue length

L

, updated waiting time

W

, phase

P

and number of total vehicles

V

. The concatenated features are then fed into fully-connected layers to learn the mapping from traffic conditions to potential rewards. Then, for each phase, we design a separate learning process of mapping from rewards to the value of making decisions

Q (s, a)

. These separate processes are selected through a gate controlled by the phase. As shown in Figure 3.4, when phase

P = 0

, the left branch will be activated, while when phase

P = 1

, the right branch will be activated. This will distinguish the decision process for different phases, prevent the decision from favoring certain action, and enhance the fitting ability of the network.
我们的整个网络结构如图 3.4 所示。图像特征是从交通状况观测数据中提取的，并输入两个卷积层。这些卷积层的输出与四个明确挖掘出的特征（队列长度

L

、更新等待时间

W

和队列长度

L

）进行连接。更新等待时间

W

相位

P

和车辆总数

V

。.然后，将串联特征输入全连接层，以学习从交通状况到潜在奖励的映射。然后，针对每个阶段，我们设计了一个单独的学习过程，将奖励映射到决策值

Q (s, a)

中。.这些独立的过程通过一个由阶段控制的门进行选择。如图 3.4 所示，当阶段

P = 0

，左分支将被激活。时，左侧分支将被激活，而当阶段

P = 1

时，右侧分支将被激活。时，右分支将被激活。这样可以区分不同阶段的决策过程，防止决策偏向于某些行动，并增强网络的拟合能力。

Memory Palace and Model Updating
记忆宫殿和模型更新

Periodically, the agent will take samples from the memory and use them to update the network. This memory is maintained by adding the new data samples in and removing the old samples occasionally. This technique is noted as experience replay [52] and has been widely used in reinforcement learning models.
代理会定期从内存中提取样本，用于更新网络。内存的维护方式是偶尔添加新数据样本，偶尔删除旧样本。这种技术被称为 "经验重放"[52]，已广泛应用于强化学习模型中。

However, in the real traffic setting, traffic on different lanes can be really imbalanced. As previous methods [10, 76, 81, 82] store all the state-action-reward training samples in one memory, this memory will be dominated by the phases and actions that appear most frequently in imbalanced settings. Then, the agent will be learned to estimate the reward for these frequent phase-action combinations well, but ignore other less frequent phase-action combinations. This will cause the learned agent to make bad decisions on the infrequent phase-action combinations. Therefore, when traffic on different lanes are dramatically different, these imbalanced samples will lead to inferior performance on less frequent situation.
然而，在实际交通环境中，不同车道上的交通流量可能真的很不平衡。由于之前的方法 [10, 76, 81, 82] 将所有状态-行动-回报训练样本存储在一个存储器中，因此该存储器将被不平衡环境中出现频率最高的阶段和行动所支配。这样，代理就能很好地估计这些频繁出现的阶段-动作组合的回报，而忽略其他不太频繁出现的阶段-动作组合。这将导致学习代理在不经常出现的阶段-行动组合上做出错误决策。因此，当不同车道上的交通流量差异巨大时，这些不平衡的样本将导致在频率较低的情况下表现较差。

Inspired by Memory Palace theory [84, 85] in cognitive psychology, we can solve this imbalance by using different memory palaces for different phase-action combinations. As shown in Figure 3.5, training samples for different phase-action combinations are stored into different memory palaces. Then same number of samples will be selected from different palaces. These balanced samples will prevent different phase-action combinations from interfering each other's training process, and hence, improve the fitting capability of the network to predict the reward accurately.
受认知心理学中的记忆宫理论[84, 85]的启发，我们可以通过为不同的相位动作组合使用不同的记忆宫来解决这种不平衡问题。如图 3.5 所示，不同相位-动作组合的训练样本存储在不同的记忆宫中。然后从不同的存储空间中选择相同数量的样本。这些均衡的样本可以防止不同相位动作组合在训练过程中相互干扰，从而提高网络的拟合能力，准确预测奖励。

Figure 3.5: Memory palace structure
图 3.5：记忆宫殿结构

Experiment 实验

In this section, we conduct experiments using both synthetic and real-world traffic data. We show a comprehensive quantitative evaluation by comparing with other methods and also show some interesting case studies 1.
在本节中，我们使用合成和真实世界的流量数据进行了实验。通过与其他方法的比较，我们进行了全面的定量评估，并展示了一些有趣的案例研究 1.

Footnote 1: Codes are available at the author’s website.
脚注 1：代码见作者网站。

Experiment Setting 实验设置

The experiments are conducted on a simulation platform SUMO (Simulation of Urban MObility) 2. SUMO provides flexible APIs for road network design, traffic volume simulation and traffic light control. Specifically, SUMO can control the traffic moving according to the given policy of traffic light (obtained by the traffic light agent).
SUMO 为道路网络设计、交通流量模拟和交通灯控制提供了灵活的应用程序接口。具体来说，SUMO 可以根据给定的红绿灯策略（由红绿灯代理获得）控制交通流。

Footnote 2: http://sumo.dlr.de/index.html
脚注 2：http://sumo.dlr.de/index.html

The environment for the experiments on synthetic data is a four-way intersection as Figure 3.1. The intersection is connected with four road segments of 300-meters long, where each road have three incoming and three outgoing lanes. The traffic light in this part of experiment contains two phases: (1) Green-WE (green light on WE with red light on SN), (2) Red-WE (red light on WE with green light on SN). Note that when a green light is on one direction, there is a red light on the other direction. Also, a green light is followed by a 3-second yellow light before it turns to red light. Although this is a simplification of the real world scenario, the research of more types of intersections (e.g., three-way intersection), and more complex light phasing (e.g., with left-turn phasing) can be further conducted in similar way.
合成数据的实验环境是一个四向交叉路口，如图 3.1 所示。十字路口由四条长 300 米的道路连接，每条道路有三条进车道和三条出车道。这部分实验中的红绿灯包含两个阶段：（1）绿灯-WE（WE 绿灯，SN 红灯）；（2）红灯-WE（WE 红灯，SN 绿灯）。请注意，当一个方向亮起绿灯时，另一个方向也会亮起红灯。此外，绿灯亮起后会有 3 秒钟的黄灯，然后才会转为红灯。虽然这只是对现实场景的简化，但可以用类似的方法进一步研究更多类型的交叉路口（如三叉路口）和更复杂的灯光相位（如左转相位）。

Parameter Setting 参数设置

The parameter setting and reward coefficients for our methods are shown in Table 3.2 and Table 3.3 respectively. We found out that the action time interval

Δ t

has minimal influence on performance of our model as long as

Δ t

is between 5 seconds and 25 seconds.
我们方法的参数设置和奖励系数分别如表 3.2 和表 3.3 所示。我们发现，只要

Δ t

在 5 秒到 25 秒之间，行动时间间隔

Δ t

对模型性能的影响就很小。

Evaluation Metric 评估指标

We evaluate the performance of different methods using the following measures:
我们采用以下方法对不同方法的性能进行评估：

Reward: average reward over time. Defined in Equation 1, the reward is a combination of several terms (positive and negative terms), therefore, the range of reward is from $- \infty$ to $\infty$ . Under specific configuration, there will be an upper bound for the reward when all cars move freely without any stop or delay.
奖励：一段时间内的平均奖励。根据公式 1 的定义，奖励是多个项（正项和负项）的组合，因此奖励的范围是从 $- \infty$ 到 $\infty$ 。.在特定配置下，当所有车辆都自由行驶，没有任何停顿或延迟时，奖励会有一个上限。
Queue length: average queue length over time, where the queue length at time $t$ is the sum of $L$ (defined in Section 3.4.2) over all approaching lanes. A smaller queue length means there are fewer waiting vehicles on all lanes.
队列长度：随时间变化的平均队列长度，其中 $t$ 时的队列长度是 $L$ 的总和（第 3.4.2 节对此进行了定义）。(的总和（定义见第 3.4.2 节）。队列长度越小，表示所有车道上等待的车辆越少。
Delay: average delay over time, where the delay at time $t$ is the sum of $D$ (defined in Equation 3.1) of all approaching lanes. A lower delay means a higher speed of all lanes.
延迟：随时间变化的平均延迟，其中 $t$ 时的延迟是 $D$ 的总和（定义见公式 3.1）。(的总和（定义见公式 3.1）。延迟越低，意味着所有车道的速度越快。
Duration: average travel time vehicles spent on approaching lanes (in seconds). It is one of the most important measures that people care when they drive on the road. A smaller duration means vehicles spend less time passing through the intersection.
持续时间：车辆在接近车道上的平均行驶时间（秒）。这是人们在道路上行驶时最关心的衡量标准之一。持续时间越短，说明车辆通过交叉路口的时间越短。

In summary, a higher reward indicates a better performance of the method, and a smaller queue length, delay and duration indicates the traffic is less jammed.
总之，奖励越高，说明该方法的性能越好，队列长度、延迟和持续时间越短，说明交通堵塞越少。

Compared Methods 比较方法

To evaluate the effectiveness of our model, we compare our model with the following baseline methods, and tune the parameter for all methods. We then report their best performance.
为了评估我们模型的有效性，我们将我们的模型与以下基准方法进行了比较，并对所有方法的参数进行了调整。然后，我们报告它们的最佳性能。

w1	w2	w3	w4	w5	w6
-0.25	-0.25	-0.25	-5	1	1

Table 3.3: Reward coefficients
表 3.3：奖励系数

Model parameter 模型参数	Value 价值
Action time interval 行动时间间隔	5 seconds 5 秒钟
$Δ t$
$γ$ for future reward $γ$ 未来奖励	0.8
$ϵ$ for exploration $ϵ$ 用于探索	0.05
Sample size 样本量	300
Memory length 内存长度	1000
Model update interval 模型更新间隔	300 seconds 300 秒

Table 3.2: Settings for our method * Fixed-time Control (FT). Fixed-time control method use a pre-determined cycle and phase time plan [72] and is widely used when the traffic flow is steady.
表 3.2：我们方法的设置 * 固定时间控制（FT）。固定时间控制法使用预先确定的周期和阶段时间计划 [72]，在交通流量稳定时广泛使用。

Self-Organizing Traffic Light Control (SOTL)[74]. This method controls the traffic light according to the current traffic state, including the eclipsed time and the number of vehicles waiting at the red light. Specifically, the traffic light will change when the number of waiting cars is above a hand-tuned threshold.
自组织交通灯控制（SOTL）[74]。这种方法根据当前的交通状态，包括红灯的黯淡时间和等待红灯的车辆数量来控制交通灯。具体来说，当等候车辆的数量超过人工调整的阈值时，交通信号灯就会改变。
Deep Reinforcement Learning for Traffic Light Control (DRL). Proposed in [10], this method applies DQN framework to select optimal light configurations for traffic intersections. Specifically, it solely relies on the original traffic information as an image.
用于交通灯控制的深度强化学习（DRL）。该方法在 [10] 中提出，应用 DQN 框架为交通路口选择最佳的灯光配置。具体来说，它完全依赖于原始交通信息图像。

In addition to the baseline methods, we also consider several variations of our model.
除基准方法外，我们还考虑了模型的几种变体。

IntelliLight (Base). Using the same network structure and reward function defined as in Section 3.4.2 and 3.4.3. This method is without Memory Palace and Phase Gate.
智能照明（基础）。使用第 3.4.2 和 3.4.3 节中定义的相同网络结构和奖励函数。此方法不含记忆宫殿和相位门。
IntelliLight (Base+MP). By adding Memory Palace in psychology to IntelliLightBase, we store the samples from different phase and time in separate memories.
IntelliLight (Base+MP)。通过在 IntelliLightBase 的基础上添加 Memory Palace，我们可以将不同阶段和时间的采样分别存储在不同的存储器中。
IntelliLight (Base+MP+PG). This is the model adding two techniques (Memory Palace and Phase Gate).
IntelliLight （Base+MP+PG）。这是增加了两种技术（Memory Palace 和 Phase Gate）的型号。

Datasets 数据集

Synthetic data 合成数据

In the first part of our experiment, synthetic data is used with four traffic flow settings: simple changing traffic (configuration 1), equally steady traffic (configuration 2), unequally steady traffic (configuration 3) and complex traffic (configuration 4) which is a combination of previous three configurations. As is shown in Table 3.4, the arriving of vehicles are generated by Poisson distribution with certain arrival rates.
在实验的第一部分，我们使用了四种交通流设置的合成数据：简单变化交通流（配置 1）、均等稳定交通流（配置 2）、不均等稳定交通流（配置 3）和复杂交通流（配置 4），后者是前三种配置的组合。如表 3.4 所示，车辆的到达是由一定到达率的泊松分布产生的。

Real-world data 真实世界的数据

The real-world dataset is collected by 1,704 surveillance cameras in Jinan, China over the time period from 08/01/2016 to 08/31/2016. The locations of these cameras are shown in Figure 3.6. Gathered every second by the cameras facing towards vehicles near intersections, each record in the dataset consists of time, camera ID and the information about vehicles. By analyzing these records with camera locations, the trajectories of vehicles are recorded when they pass through road intersections. The dataset covers 935 locations, where 43 of them are four-way intersections. We use the number of vehicles passing through 24 intersections as traffic volume for experiments since only these intersections have consecutive data. Then we feed this real-world traffic setting into SUMO as online experiments. It can be seen from Table 3.5 that traffic flow on different roads are dynamically changing in the real world.
真实世界数据集由中国济南的 1,704 个监控摄像头收集，时间段为 2016 年 8 月 1 日至 2016 年 8 月 31 日。这些摄像头的位置如图 3.6 所示。数据集中的每条记录都由面向十字路口附近车辆的摄像头每秒采集一次，包括时间、摄像头 ID 和车辆信息。通过分析这些记录和摄像头位置，可以记录车辆通过道路交叉口时的轨迹。数据集涵盖 935 个地点，其中 43 个是四向交叉路口。由于只有这些交叉路口有连续数据，因此我们将通过 24 个交叉路口的车辆数作为交通流量进行实验。然后，我们将这一真实交通设置输入 SUMO 作为在线实验。从表 3.5 中可以看出，现实世界中不同道路的交通流量是动态变化的。

Config 配置	Directions 路线指引	Arrival rate 抵达率	Start time 开始时间	End time 结束时间
Config 配置	Directions 路线指引	(cars/s) (辆/秒)	(s)	(s)
1	WE 我们	0.4	0	36000
1	SN	0.4	36001	72000
2	WE 我们	0.033	0	72000
2	SN	0.033	0	72000
3	WE 我们	0.2	0	72000
3	SN	0.033	0	72000
4	Configuration 1 配置 1	0	72000
	Configuration 2 配置 2	72001	144000
	Configuration 3 配置 3	144001	216000

Table 3.4: Configurations for synthetic traffic data
表 3.4：合成流量数据的配置

Figure 3.6: Traffic surveillance cameras in Jinan, China
图 3.6：中国济南的交通监控摄像头

Performance on Synthetic Data
合成数据的性能

3.5.6.1 Comparison with state-of-the-art methods
3.5.6.1 与最先进方法的比较

We first compare our method with three other baselines under different synthetic traffic settings. From Table 3.6, 3.7, 3.8 and 3.9 we can see that our method performs better than all other baseline methods in configurations 1, 2, 3 and 4. Although some baselines perform well on certain setting, they perform badly in other configurations (e.g., SOTL achieves good rewards under configuration 1, almost the same as our method in 3 digit floats. This is because our method has learned to keep the light until 36000 s and switch the light after that, and SOTL is also designed to behave similarly. Hence, these two methods perform very similar). On the contrary, our method IntelliLight shows better performance under different configurations.
我们首先将我们的方法与其他三种基线方法在不同的合成流量设置下进行比较。从表 3.6、3.7、3.8 和 3.9 中可以看出，在配置 1、2、3 和 4 中，我们的方法比其他所有基线方法都要好。虽然某些基线方法在某些配置下表现良好，但在其他配置下却表现不佳（例如，SOTL 在配置 1 下获得了良好的奖励，在 3 位浮点数下几乎与我们的方法相同。这是因为我们的方法学会了在 36000 秒之前保持灯光，之后再切换灯光，而 SOTL 也被设计成类似的行为。因此，这两种方法的表现非常相似）。相反，我们的 IntelliLight 方法在不同的配置下表现得更好。

3.5.6.2 Comparison with variants of our proposed method
3.5.6.2 与我们提出的方法的变体比较

Table 3.6, 3.7, 3.8 and 3.9 show the performance of variants of our proposed method. First, we can see that adding Memory Palace helps achieve higher reward under configuration 3 and 4, although it does not boost the reward under configuration 1 and 2. This is because for the simple case (configuration 1 and 2), the phase is relatively steady for a long time (because the traffic only comes from one direction or keeps not changing in a long time). Therefore, the memory palace does not help in building a better model for predicting the reward. Further adding Phase Gate also reduces the queue length in most cases and achieves highest reward, demonstrating the effectiveness of these two techniques.
表 3.6、3.7、3.8 和 3.9 显示了我们提出的方法的各种变体的性能。首先，我们可以看到，在配置 3 和配置 4 下，添加记忆宫有助于获得更高的回报，但在配置 1 和配置 2 下，添加记忆宫并不能提高回报。这是因为在简单情况下（配置 1 和 2），相位在很长一段时间内都相对稳定（因为流量只来自一个方向或在很长一段时间内保持不变）。因此，"记忆宫殿 "无助于建立更好的奖励预测模型。在大多数情况下，增加相位门也能减少队列长度，并获得最高奖励，这证明了这两种技术的有效性。

3.5.6.3 Interpretation of learned signal
3.5.6.3 学习信号的解释

To understand what our method have learned w.r.t. dynamic traffic conditions, we show the percentage of duration for phase Green-WE (i.e., green light on WE direction with red light on SN direction), along with the ratio of traffic flow on WE over total traffic
为了解我们的方法在动态交通条件下的学习成果，我们展示了绿灯-WE 阶段（即 WE 方向绿灯，SN 方向红灯）的持续时间百分比，以及 WE 方向交通流量与总交通流量的比率。

Time Range 时间范围	Records 记录	Arrival Rate (cars/s) 到达率（辆/秒）
Time Range 时间范围	Records 记录	Mean 平均值	Std 标准	Max 最大	Min 最小
08/01/2016 -08/31/2016	405,370,631	0.089	0.117	0.844	0.0

Table 3.5: Details of real-world traffic datasetflow from all directions. With the changing of traffic, an ideal traffic light control method would be able to adjust its phase duration to traffic flows and get high reward. For example, as traffic changes from direction WE to SN, the traffic light agent is expected to adjust its phase duration from giving WE green light to giving SN green light. As we can see from Figure 3.7, IntelliLight can adjust its phase duration as the traffic changes.
表 3.5：真实世界中各个方向的交通流量数据集详情。随着交通流量的变化，理想的交通灯控制方法应能根据交通流量调整相位持续时间并获得高回报。例如，当车流从 WE 方向转向 SN 方向时，交通灯代理应调整其相位持续时间，从给 WE 绿灯变为给 SN 绿灯。从图 3.7 中我们可以看到，IntelliLight 可以根据车流变化调整相位持续时间。

Performance on Real-world Data
真实世界数据的性能

Comparison of different methods
不同方法的比较

In this section, we compare our method with baseline methods on real-world data. The overall results are shown in Table 3.10. Our method IntelliLight achieves the best reward, queue length, delay and duration over all the compared methods, with a relative improvement of 32%, 38%, 19% and 22% correspondingly over the best baseline method. In addition, our method has a relatively steady performance over multiple intersections (small standard deviation).
在本节中，我们将在真实世界数据上对我们的方法和基准方法进行比较。总体结果如表 3.10 所示。在所有比较方法中，我们的方法 IntelliLight 在奖励、队列长度、延迟和持续时间方面都取得了最佳成绩，与最佳基线方法相比，相对改进幅度分别为 32%、38%、19% 和 22%。此外，我们的方法在多个交叉路口的表现相对稳定（标准偏差较小）。

Model name 型号名称	Reward 奖励	Queue length 队列长度	Delay 延迟	Duration 持续时间
FT	-0.978	1.105	2.614	34.278
SOTL	-21.952	19.874	4.384	177.747
DRL	-2.208	3.405	3.431	52.075
IntelliLight (ours) IntelliLight （我们的）
Base 基地	-0.523	0.208	1.689	27.505
Base+MP 基础+MP	-0.556	0.259	1.730	27.888
Base+MP+PG 基础+MP+PG	-0.514	0.201	1.697	27.451

Table 3.7: Performance on configuration 2
表 3.7：配置 2 的性能

Model name 型号名称	Reward 奖励	Queue length 队列长度	Delay 延迟	Duration 持续时间
FT	-2.304	8.532	2.479	42.230
SOTL	0.398	0.006	1.598	24.129
DRL	-36.247	91.412	4.483	277.430
IntelliLight (ours) IntelliLight （我们的）
Base 基地	-3.077	10.654	2.635	92.080
Base+MP 基础+MP	-3.267	6.087	1.865	38.230
Base+MP+PG 基础+MP+PG	0.399	0.005	1.598	24.130

Table 3.6: Performance on configuration 1. Reward: the higher the better. Other measures: the lower the better. Same with the following tables.
表 3.6：配置 1 的性能。奖励：越高越好。其他指标：越低越好。与下表相同。

3.5.7.2 Observations with respect to real traffic
3.5.7.2 对实际交通的观察

In this section, we make observations on the policies we learned from the real data. We analyze the learned traffic light policy for the intersection of Jingliu Road (WE direction)
在本节中，我们将对从真实数据中学到的策略进行观察。我们对学习到的经六路路口（WE 方向）红绿灯策略进行分析

Model name 型号名称	Reward 奖励	Queue length 队列长度	Delay 延迟	Duration 持续时间
FT	-1.670	4.601	2.883	39.707
SOTL	-14.079	13.372	3.753	54.014
DRL	-49.011	91.887	4.917	469.417
IntelliLight (ours) IntelliLight （我们的）
Base 基地	-5.030	5.880	3.432	39.021
Base $+$ MP 基础 {{0}MP	-3.329	5.358	2.238	44.703
Base $+$ MP $+$ PG 基础 $+$ MP {{1}PG	-0.474	0.548	2.202	25.977

Table 3.9: Performance on configuration 4
表 3.9：配置 4 的性能

Figure 3.7: Percentage of the time duration of learned policy for phase Green-WE (green light on W-E and E-W direction, while red light on N-S and S-N direction) in every 2000 seconds for different methods under configuration 4.
图 3.7：在配置 4 下，不同方法在绿-绿（W-E 和 E-W 方向亮绿灯，N-S 和 S-N 方向亮红灯）阶段每 2000 秒所学策略持续时间的百分比。

Model name 型号名称	Reward 奖励	Queue length 队列长度	Delay 延迟	Duration 持续时间
FT	-1.724	4.159	3.551	36.893
SOTL	-20.680	20.227	5.277	69.838
DRL	-8.108	16.968	4.704	66.485
IntelliLight (ours) IntelliLight （我们的）
Base 基地	-0.836	0.905	2.699	28.197
Base $+$ MP 基础 {{0}MP	-0.698	0.606	2.729	26.948
Base $+$ MP $+$ PG 基础 $+$ MP {{1}PG	-0.648	0.524	2.584	26.647

Table 3.8: Performance on configuration 3and Erhuanxi Auxiliary Road (SN direction) under different scenarios: peak hours vs. non-peak hours, weekdays vs. weekends, and major arterial vs. minor arterial.
表 3.8：配置 3 和二环西辅道（SN 方向）在不同情况下的性能：高峰时段与非高峰时段、平日与周末、主要干道与次要干道。

1. Peak hour vs. Non-peak hour. Figure 3.8 (a) shows the average traffic flow from both directions (WE and SN) on a Monday. On this day, there is more traffic on WE direction than SN for most of the time, during which an ideal traffic light control method is expected to give longer time for WE direction. It can be seen from Figure 3.8 (c) that, the ratio of the time duration for phase Green-WE (i.e., green light on WE, while red light on SN) is usually larger than 0.5, which means for most of the time, our method gives longer time for WE. And during peak hours (around 7:00, 9:30 and 18:00), the policies learned from our method also give longer time for green light on WE than non-peak hours. In early morning, the vehicle arrival rates on SN are larger than the rates on WE, and our method automatically gives longer time to SN. This shows our method can intelligently adjust to different traffic conditions.
1.高峰时段与非高峰时段。图 3.8 (a) 显示了周一两个方向（WE 和 SN）的平均交通流量。在这一天的大部分时间里，WE 方向的车流量都比 SN 方向的车流量大，因此理想的交通灯控制方法应为 WE 方向提供更长的通行时间。从图 3.8(c)中可以看出，绿灯-WE 相位（即 WE 方向绿灯，SN 方向红灯）的持续时间比通常大于 0.5，这意味着在大多数情况下，我们的方法都能为 WE 方向提供更长的时间。而在高峰时段（7:00、9:30 和 18:00 左右），根据我们的方法得出的策略也比非高峰时段的 WE 绿灯时间更长。在清晨，SN 上的车辆到达率大于 WE 上的车辆到达率，我们的方法会自动给予 SN 更长的绿灯时间。这表明我们的方法可以根据不同的交通状况进行智能调整。

2. Weekday vs. Weekend. Unlike weekdays, weekend shows different patterns about traffic condition and traffic light control policies. Our policy gives less green light on WE (more green light on SN) during weekend daytime than it gives on weekday. This is because there is more traffic on SN than on WE during weekend daytime in Figure 3.8 (b), while during weekday traffic on SN is less than on WE. Besides, by comparing Figure 3.8 (a) with Figure 3.8 (b), we can see that the traffic of WE and SN during late night time on Monday is similar, making the ratio of duration Green-We close to 0.5.
2.工作日与周末。与工作日不同，周末的交通状况和交通信号灯控制策略呈现出不同的模式。与工作日相比，我们的政策在周末白天给 WE 亮绿灯的次数较少（给 SN 亮绿灯的次数较多）。这是因为在图 3.8 (b)中，周末白天 SN 上的车流量比 WE 上的多，而工作日 SN 上的车流量比 WE 上的少。此外，通过比较图 3.8 (a) 和图 3.8 (b)，我们可以看出，周一深夜时段 WE 和 SN 的流量相近，这使得绿色-We 的持续时间比接近 0.5。

3. Major arterial vs. Minor arterial. Major arterials are roads that have higher traffic volume within a period, and are expected to have a longer green light time. Without prior knowledge about major arterial, learned traffic light control policy using our method prefer giving the major arterial green light (including keeping the green light already on major arterial, and tend to switching red light to green light for major arterial). Specifically, we look into three periods of time (3:00, 12:00 and 23:30) of August 1st. From Figure 3.8 (a), we can tell that the road on WE direction is the main road, since
3.主要干道与次要干道。主要干道是指在一个时间段内交通流量较大的道路，预计绿灯时间较长。在事先不了解主要干道的情况下，使用我们的方法所学到的交通信号灯控制策略更倾向于给主要干道绿灯（包括保留主要干道上已经亮起的绿灯，以及倾向于将主要干道的红灯切换为绿灯）。具体来说，我们研究了 8 月 1 日的三个时间段（3:00、12:00 和 23:30）。从图 3.8 (a)中，我们可以看出 WE 方向的道路是主干道，因为

Methods 方法	Reward 奖励	Queue Length 队列长度	Delay 延迟	Duration 持续时间
FT	-5.727 $\pm$ 5.977	19.542 $\pm$ 22.405	3.377 $\pm$ 1.057	84.513 $\pm$ 60.888
SOTL	-35.338 $\pm$ 65.108	16.603 $\pm$ 17.718	4.070 $\pm$ 0.420	64.833 $\pm$ 23.136
DRL	-30.577 $\pm$ 26.242	54.148 $\pm$ 43.420	4.209 $\pm$ 1.023	166.861 $\pm$ 93.985
IntelliLight 智能照明	-3.892 $\pm$ 7.609	10.238 $\pm$ 20.949	2.730 $\pm$ 1.086	50.487 $\pm$ 46.439

Table 3.10: Performances of different methods on real-world data. The number after

\pm

means standard deviation. Reward: the higher the better. Other measures: the lower the better.
表 3.10：不同方法在实际数据中的表现。

\pm

后面的数字表示标准偏差。奖励：越高越好。其他指标：越低越好。

traffic on WE is usually heavier than traffic on SN. As is shown in Figure 3.9, the dotted lines indicates the number of arriving cars for every second on two different directions. Along with the arrival rate, we also plot the change of phases (dashed area). It can be seen from Figure 3.9 (a) that: 1) the overall time period of phase Red-WE is longer than Green-WE, which is compatible with traffic volume at this time. 2) although the traffic volume of SN is larger than WE, the traffic light change from Red-WE to Green-WE is usually not triggered by waiting cars on WE direction. On the contrary, in Figure 3.9 (b) and Figure 3.9 (c), the change from Green-WE to Red-WE is usually triggered by waiting cars on SN direction. This is mainly because the road on WE is the main road during these time periods, and the traffic light tends to favor phase Green-WE.
WE 方向的车流量通常大于 SN 方向的车流量。如图 3.9 所示，虚线表示两个不同方向上每秒到达的车辆数。除了到达率，我们还绘制了相位变化图（虚线区域）。从图 3.9 (a) 可以看出1) 红-WE 相位的整体时间段长于绿-WE 相位，这与此时的交通流量相符。2) 虽然 SN 方向的车流量大于 WE 方向，但通常 WE 方向的等待车辆不会触发红-WE 向绿-WE 方向的红绿灯转换。相反，在图 3.9(b)和图 3.9(c)中，从绿灯-WE 到红灯-WE 的变化通常是由 SN 方向上等待的车辆触发的。这主要是因为在这些时间段内，WE 方向的道路是主干道，交通信号灯倾向于绿-WE 相位。

3.6 Conclusion 3.6 结论

In this paper, we address the traffic light control problem using a well-designed reinforcement learning approach. We conduct extensive experiments using both synthetic
在本文中，我们采用精心设计的强化学习方法来解决交通灯控制问题。我们使用合成

Figure 3.8: Average arrival rate on two directions (WE and SN) and time duration ratio of two phases (Green-WE and Red-WE) from learned policy for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st and August 7th, 2016.
图 3.8：2016 年 8 月 1 日和 8 月 7 日济南市经六路（WE）和二环西辅路（SN）两个方向（WE 和 SN）的平均到达率和两个阶段（绿-WE 和红-WE）的时间长度比。

and real world experiments and demonstrate the superior performance of our proposed method over state-of-the-art methods. In addition, we show in-depth case studies and observations to understand how the agent adjust to the changing traffic, as a complement to quantitative measure on rewards. These in-depth case studies can help generate traffic rules for real world application.
和真实世界的实验，证明我们提出的方法比最先进的方法性能更优越。此外，我们还展示了深入的案例研究和观察结果，以了解代理如何根据不断变化的交通流量进行调整，作为对奖励量化措施的补充。这些深入的案例研究有助于为现实世界的应用生成交通规则。

We also acknowledge the limitations of our current approach and would like to point out several important future directions to make the method more applicable to real world. First, we can extend the two-phase traffic light to multi-phase traffic light, which will involve more complicated but more realistic state transition. Second, our paper addresses a simplified one intersection case, whereas the real world road network is much more complicated than this. Although some studies have tried to solve the multi-intersection problem by using multiple reinforcement learning agents, they do not explicitly consider the interactions between different intersections (i.e., how can the phase of one intersection affect the state of nearby intersections) and they are still limited to small number of intersections. Lastly, our approach is still tested on a simulation framework and thus the feedback is simulated. Ultimately, a field study should be conducted to learn the real-world feedback and to validate the proposed reinforcement learning approach.
我们也承认当前方法的局限性，并希望指出几个重要的未来方向，以使该方法更适用于现实世界。首先，我们可以将双相交通灯扩展到多相交通灯，这将涉及更复杂但更真实的状态转换。其次，我们的论文解决的是简化的一个交叉路口的情况，而现实世界的道路网络要比这复杂得多。虽然有些研究试图通过使用多个强化学习代理来解决多交叉路口问题，但它们并没有明确考虑不同交叉路口之间的相互作用（即一个交叉路口的相位会如何影响附近交叉路口的状态），而且仍然局限于少量交叉路口。最后，我们的方法仍在模拟框架上进行测试，因此反馈是模拟的。最终，应该进行实地研究，以了解真实世界的反馈情况，并验证所提出的强化学习方法。

Figure 3.9: Detailed average arrival rate on two directions (dotted lines) and changes of two phases (dashed areas) in three periods of time for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st, 2016. X-axis of each figure indicates the time of a day; left Y-axis of each figure indicates the number of cars approaching the intersection every second; right Y-axis for each figure indicates the phase over time.
图 3.9：2016 年 8 月 1 日济南市经六路（WE）和二环西辅路（SN）三个时段内两个方向的平均到达率（虚线）和两个相位的变化（虚线区域）详图。各图的 X 轴表示一天中的时间；各图的左 Y 轴表示每秒驶入路口的车辆数；各图的右 Y 轴表示随时间变化的相位。

Chapter 4 Formulating the Learning Objective
第 4 章制定学习目标

This chapter formulates RL-based traffic signal control methods with theoratical proofs and corresponds to our paper "PressLight: Learning Network-level Cooperation for Traffic Signal Control" [17].
本章通过理论证明提出了基于 RL 的交通信号控制方法，并与我们的论文 "PressLight：学习交通信号控制的网络级合作"[17]。

4.1 Introduction 4.1 导言

Traffic signals coordinate the traffic movements at the intersection and a smart traffic signal control algorithm is the key to transportation efficiency. Traffic signal control remains an active research topic because of the high complexity of the problem. The traffic situations are highly dynamic, thus require traffic signal plans to be able to adjust to different situations.
交通信号协调交叉路口的交通流，智能交通信号控制算法是提高交通效率的关键。由于问题的高度复杂性，交通信号控制仍然是一个活跃的研究课题。交通情况是高度动态的，因此要求交通信号计划能够根据不同情况进行调整。

Recently, people start to investigate reinforcement learning (RL) techniques for traffic signal control. Several studies have shown the superior performance of RL techniques over traditional transportation approaches [9, 10, 11, 46, 78, 80]. The biggest advantage of RL is that it directly learns how to take the next actions by observing the feedback from the environment after previous actions.
最近，人们开始研究用于交通信号控制的强化学习（RL）技术。一些研究表明，与传统的交通方法相比，强化学习技术具有更优越的性能 [9、10、11、46、78、80]。强化学习的最大优势在于，它可以通过观察环境对前一次行动的反馈，直接学习如何采取下一次行动。

One major issue of current RL-based traffic signal control approaches is that the setting is often heuristic and lacks proper theoretical justification from transportation literature. This often results in highly sensitive performance w.r.t. the setting and leads to a long learning process. We elaborate on this issue by examining two fundamental elements in RL setting: reward and state.
目前基于 RL 的交通信号控制方法存在的一个主要问题是，其设置往往是启发式的，缺乏交通文献中适当的理论依据。这通常会导致对设置高度敏感的性能，并导致漫长的学习过程。我们将通过研究 RL 设置中的两个基本要素：奖励和状态来详细阐述这一问题。

First, various reward designs have been proposed in the literature. The reason is that travel time, the ultimate objective, is hard to optimize directly. Travel time is along-term reward depending on a sequence of actions, thus the effect of one action can hardly be reflected in terms of travel time. People thus choose short-term rewards like queue length or delay to approximate the travel time [21]. So the reward function is often defined as a weighted sum of these terms [10, 11, 79, 86, 87]. However, as shown in Figure 6.1(a), tuning the weights on these terms could lead to largely different results in terms of travel time. Some literature [1] discusses how to define the reward by connecting with the existing transportation method, but they only focus on controlling a single intersection. In this chapter, we focus on the multi-intersection control scenario.
首先，文献中提出了各种奖励设计。究其原因，作为最终目标的旅行时间很难直接优化。旅行时间是取决于一系列行动的长期奖励，因此一个行动的效果很难通过旅行时间来反映。因此，人们会选择队列长度或延迟等短期奖励来近似旅行时间[21]。因此，奖励函数通常被定义为这些项的加权和[10, 11, 79, 86, 87]。然而，如图 6.1(a)所示，调整这些项的权重可能会导致旅行时间的结果大相径庭。一些文献[1]讨论了如何通过连接现有交通方法来定义奖励，但它们只关注控制单个交叉口。在本章中，我们将重点讨论多交叉口控制方案。

Second, existing RL methods have a trend of using more complicated state representation. Recent studies use visual images to describe the full traffic situation at the intersection [10, 11], which results in the dimension of the state in the scale of thousands. In the single intersection scenario, [1] reveals that additional information is not always helpful. Similar conclusions can also be found in the multi-intersection scenario. As shown in Figure 6.1(b), complicated state definitions increase the learning time and may not necessarily bring significant gain. Note that we are not claiming that additional information is always not helpful. The choice of the state depends on the reward setting. Based on the reward design of LIT[1], neighboring information is not necessary in the case we show in Figure 6.1(b). The question is, could we justify theoretically how much information is enough in state definition in order to optimize the reward function?
其次，现有的 RL 方法有使用更复杂状态表示的趋势。最近的研究使用视觉图像来描述十字路口的全部交通状况[10, 11]，这导致状态维度高达数千。在单一交叉口场景中，[1] 发现附加信息并不总是有用的。类似的结论也可以在多交叉路口场景中找到。如图 6.1(b)所示，复杂的状态定义会增加学习时间，但不一定会带来显著的收益。请注意，我们并不是说额外的信息总是没有帮助。状态的选择取决于奖励设置。根据 LIT 的奖励设计[1]，在图 6.1(b)所示的情况下，邻接信息并不是必需的。问题是，我们能否从理论上证明，为了优化奖励函数，在状态定义中多少信息是足够的？

The challenges we face in RL motivate us to look for support from transportation. In transportation literature, max pressure (MP) control is one of the state-of-the-arts in traffic signal control [88, 89]. The key idea of MP is to minimize the "pressure" of an
我们在 RL 中面临的挑战促使我们从交通领域寻求支持。在交通文献中，最大压力（MP）控制是交通信号控制的最新技术之一 [88, 89]。MP 的主要思想是最大限度地降低交通信号的 "压力"[88, 89]。

Figure 4.1: Performance of RL approaches is sensitive to reward and state. (a) A heuristic parameter tuning of reward function could result in different performances. (b) The method with a more complicated state (LIT[1] w/ neighbor) has a longer learning time but does not necessarily converge to a better result.
图 4.1：RL 方法的性能对奖励和状态很敏感。(a) 奖赏函数的启发式参数调整可能导致不同的性能。(b) 状态更复杂的方法（LIT[1] w/ neighbor）学习时间更长，但不一定能收敛到更好的结果。

intersection, which can be loosely defined as the number of vehicles on incoming lanes minus the number of vehicles on outgoing lanes. Figure 4.2 illustrates the concept of pressure. By setting the objective as minimizing the pressure of intersections, MP is proved to maximize the throughput of the whole road network1. However, the solution of MP is greedy, which leads to locally optimal solutions.
可宽泛地定义为进入车道的车辆数减去驶出车道的车辆数。图 4.2 展示了压力的概念。通过将目标设定为最小化交叉口压力，MP 被证明可以最大化整个道路网络的吞吐量1。然而，MP 的求解是贪婪的，会产生局部最优解。

Footnote 1: Maximizing throughput equals to minimizing travel time under certain conditions and minimizing travel time is the final goal for most traffic signal control problems.
脚注 1：在某些条件下，最大化吞吐量等于最小化通行时间，而最小化通行时间是大多数交通信号控制问题的最终目标。

Our proposed solution is based on RL but theoretically grounded by MP method. The connection between RL and MP is that both approaches can essentially be framed as an optimization problem. In RL, long term reward is the objective for optimization and the solution is derived from trial-and-error search. In MP, the objective is to minimize pressure and the solution is derived from a greedy algorithm. Intuitively, if we set our reward function the same as the objective of MP, we can achieve the same result as MP. We first prove that under the assumption of no physical queue expansion, both our method and MP are maximizing throughput of the network. We further show that our method can relax the assumption on queue expansion and the conclusion still holds.
我们提出的解决方案以 RL 为基础，但以 MP 方法为理论依据。RL 与 MP 之间的联系在于，这两种方法本质上都可以看作是一个优化问题。在 RL 中，长期回报是优化的目标，而解决方案是通过试错搜索得出的。在 MP 中，目标是最大限度地减少压力，解决方案来自贪婪算法。直观地说，如果我们将奖励函数设定为与 MP 的目标相同，我们就能获得与 MP 相同的结果。我们首先证明，在没有物理队列扩张的假设下，我们的方法和 MP 都能最大化网络的吞吐量。我们进一步证明，我们的方法可以放宽队列扩张的假设，结论仍然成立。

To further address the challenge on state design, we describe the system dynamics using the state features based on MP. MP provides evolution equations to formulate the state transition of the traffic as a Markov chain [90]. In RL, the Markov decision process formally describes the dynamics of an environment. By including the variables from the
为了进一步应对状态设计方面的挑战，我们使用基于 MP 的状态特征来描述系统动态。MP 提供了演化方程，将流量的状态转换表述为马尔可夫链[90]。在 RL 中，马尔可夫决策过程正式描述了环境的动态。通过将

Figure 4.2: Illustration of max pressure control in two cases. In Case A, green signal is set in the North

\to

South direction; in Case B, green signal is set in the East

\to

West direction.
图 4.2：两种情况下的最大压力控制说明。在情况 A 中，绿色信号设置在北

\to

向；在情况 B 中，绿色信号设置在东

\to

向。南方向；在情况 B 中，绿色信号设置在东

\to

西方向。西方向。

evolution equation into state definition in RL, the state is a sufficient statistic for the system dynamics.
将演化方程转化为 RL 中的状态定义，状态是系统动力学的充分统计量。

We conduct comprehensive experiments using both synthetic data and real data. We test our method in different traffic flow and network structure scenarios. We demonstrate the power of RL methods over traditional transportation approaches as RL optimizes the objective through trial and error. Our method also consistently outperforms state-of-the-art RL methods, which shows that theoretically supported reward design is necessary and the concise state design leads to an efficient learning process. We further discuss several interesting policies learned by our method to show that our method can achieve coordination along arterial.
我们使用合成数据和真实数据进行了全面的实验。我们在不同的交通流量和网络结构场景中测试了我们的方法。我们证明了 RL 方法优于传统交通方法的能力，因为 RL 可以通过试错来优化目标。我们的方法还始终优于最先进的 RL 方法，这表明理论支持的奖励设计是必要的，简洁的状态设计会带来高效的学习过程。我们进一步讨论了我们的方法所学习到的几种有趣的策略，以说明我们的方法可以实现干道沿线的协调。

Individual Traffic Signal Control. Individual traffic signal control has been investigated extensively in the field of transportation, which tries to optimize the travel time or delay of vehicles [91, 92, 93, 94, 95], assuming that vehicles are arriving and moving in a specific pattern. Recently, reinforcement learning based methods attempt to address this problem by directly learning from the data [9, 27]. Earlier work using tabular Q-learning [87, 96] can only deal with discrete state representations. Recent work using deep RL [10, 11, 76, 97, 98] can cope with more complex continuous state representation. [1] noticed that it is not always true that the more complex the state definitions are, the better the performance will be. In [1], they also investigated the proper reward design grounded by the individual intersection control method in transportation field. In this chapter, we are focusing on the multi-intersection scenario.
单个交通信号控制。单个交通信号控制在交通领域得到了广泛研究，它试图优化车辆的行驶时间或延迟时间 [91, 92, 93, 94, 95]，假设车辆以特定模式到达和行驶。最近，基于强化学习的方法试图通过直接从数据中学习来解决这一问题 [9, 27]。早期使用表格 Q-learning [87, 96] 的工作只能处理离散状态表示。最近使用深度 RL 的研究 [10, 11, 76, 97, 98] 可以处理更复杂的连续状态表示。[1]注意到，并不总是状态定义越复杂，性能就越好。在[1]中，他们还研究了在交通领域中，以单个交叉口控制方法为基础的适当奖励设计。在本章中，我们将重点讨论多交叉口情况。

Conventional Multi-intersection Traffic Signal Control. In conventional multi-intersection control, coordination can be achieved by setting a fixed offset (i.e., the time interval between the beginnings of green lights) among all intersections along an arterial [99]. In fact, it is not an easy task, given traffic of opposite directions usually cannot be facilitated simultaneously. To solve this problem, some optimization-based methods [100, 101] are developed to minimize vehicle travel time and/or the number of stops at multiple intersections. Instead of optimizing offsets, max pressure [89, 90] aims to maximize the throughput of the network so as to minimizing the travel time. However, these approaches still rely on assumptions to simplify the traffic condition and do not guarantee optimal results in the real world.
传统的多交叉口交通信号控制。在传统的多交叉口控制中，可通过在干道沿线的所有交叉口之间设置一个固定的偏移量（即绿灯开始的时间间隔）来实现协调[99]。事实上，这并不是一件容易的事，因为通常无法同时疏导相反方向的交通。为解决这一问题，人们开发了一些基于优化的方法 [100, 101]，以尽量减少车辆行驶时间和/或在多个交叉口的停车次数。最大压力法[89, 90]的目标不是优化偏移量，而是使网络的吞吐量最大化，从而使行驶时间最小化。然而，这些方法仍然依赖于简化交通状况的假设，并不能保证在现实世界中获得最佳结果。

RL-based Multi-intersection Traffic Signal Control. Since recent advances in RLimprove the performance on isolated traffic signal control [11, 1], efforts have been made to design strategies that control multiple intersections. One way is to consider jointly modeling the action between learning agents with centralized optimization [75, 10]. Since these methods [10, 75] need to negotiate between the agents in the whole network, they are computationally expensive. Another way is to use decentralized RL agents to control the traffic signals in the multi-intersection system [46, 79, 102]. Since each agent makes its own decision based on the information from itself and neighboring intersections without centralized decision, decentralized methods may be more scalable and practicable. By plugging new intersection controllers into the system, the decentralized systems are easy to scale. Our proposed method also follows this direction.
基于 RL 的多交叉口交通信号控制。由于最近 RL 的进步提高了孤立交通信号控制的性能 [11, 1]，人们开始努力设计控制多个交叉路口的策略。其中一种方法是考虑通过集中优化对学习代理之间的行动进行联合建模 [75, 10]。由于这些方法 [10, 75] 需要在整个网络中的代理之间进行协商，因此计算成本很高。另一种方法是使用分散的 RL 代理来控制多交叉路口系统中的交通信号 [46, 79, 102]。由于每个代理都会根据自身和相邻交叉口的信息做出自己的决定，无需集中决策，因此分散式方法可能更具可扩展性和实用性。通过在系统中加入新的交叉口控制器，分散式系统很容易扩展。我们提出的方法也遵循了这一方向。

We notice the recent trend to vary the definition of state and reward in RL for traffic signal control. Readers interested in the detailed comparison of the state and reward definitions can refer to [21]. We are the first RL method that is theoretically grounded by traditional transportation methods to coordinate the traffic signals along an arterial.
我们注意到，在交通信号控制的 RL 中，状态和奖励的定义最近出现了变化趋势。对状态和奖励定义的详细比较感兴趣的读者可以参考 [21]。我们是第一个以传统交通方法为理论基础的 RL 方法，用于协调干道上的交通信号。

4.3 Preliminaries and Notations
4.3 前言和符号

In this section, we re-iterate the preliminaries mentioned in Chapter 2 and provide necessary notations for this chapter.
在本节中，我们将重申第 2 章中提到的前言，并为本章提供必要的符号。

Definition 1 (Incoming lane and outgoing lane of an intersection).: An incoming lane for an intersection is a lane where the traffic enters the intersection. An outgoing lane for an intersection is a lane where the traffic leaves the intersection. We denote the set of incoming lanes and outgoing lanes of an intersection as

L_{i n}

and

L_{o u t}

respectively.
定义 1（交叉路口的进车道和出车道）：交叉路口的进车道是交通进入交叉路口的车道。交叉路口的出行车道是车流驶离交叉路口的车道。我们将交叉路口的进车道和出车道集合分别记为

L_{i n}

和

L_{o u t}

。

Definition 2 (Traffic movement).: A traffic movement is defined as the traffic traveling across an intersection from one incoming lane to an outgoing lane. We denote a traffic movement from lane

l

to lane

m

(l, m)

.
定义 2（交通流）：交通移动的定义是，从一个进入车道到一个离开车道的交通穿过交叉路口。我们将从车道

l

到车道

m

的车流称为

(l, m)

。.

Definition 3 (Movement signal and phase).: A movement signal is defined on the traffic movement, with green signal indicating the corresponding movement is allowed and red signal indicating the movement is prohibited. We denote a movement signal as

a (l, m)

, where

a (l, m) = 1

indicates the green light is on for movement

(l, m)

, and

a (l, m) = 0

indicates the red light is on for movement

(l, m)

. A phase is a combination of movement signals. We denote a phase as

p = {(l, m) | a (l, m) = 1}

, where

l \in L_{i n}

and

m \in L_{o u t}

.
定义 3（移动信号和相位）：移动信号是针对交通移动而定义的，绿色信号表示允许相应移动，红色信号表示禁止移动。我们用

a (l, m)

表示移动信号。其中，

a (l, m) = 1

表示移动信号

(l, m)

绿灯亮起，

a (l, m) = 0

表示移动信号

(l, m)

红灯亮起。而

a (l, m) = 0

则表示移动信号

(l, m)

亮红灯。.相位是运动信号的组合。我们用

p = {(l, m) | a (l, m) = 1}

表示一个阶段。其中

l \in L_{i n}

和

m \in L_{o u t}

。.

In Figure 2.1, there are twelve incoming lanes and twelve outgoing lanes in the intersection. Eight movement signals (red and green dots around the intersection)comprise four phases to control the traffic movements for the intersection: WE-Straight (Going Straight from West and East), SN-Straight (Going Straight from South and North), WE-Left (Turning Left from West and East), SN-Left (Turning Left from South and North). Specifically, WE-Left allows two traffic movements. When phase #2 is activated, the traffic from

l_{E}

and

l_{W}

is allowed to turn left to corresponding outgoing lanes.
在图 2.1 中，交叉路口有十二条进车道和十二条出车道。八个移动信号灯（交叉口周围的红绿点）组成四个阶段，控制交叉口的交通移动：WE-直行（从西面和东面直行）、SN-直行（从南面和北面直行）、WE-左转（从西面和东面左转）、SN-左转（从南面和北面左转）。具体来说，WE-Left 允许两次交通移动。激活 2 号阶段后，来自

l_{E}

和

l_{W}

的车辆可左转至相应的出行车道。

Definition 4 (Pressure of movement, pressure of intersection).: The pressure of a movement is defined as the difference of vehicle density between the incoming lane and the outgoing lane. The vehicle density of a lane is defined as

x (l) / x_{m a x} (l)

, where

x (l)

is the number of vehicles on lane

l

x_{m a x} (l)

is the maximum permissible vehicle number on

l

. We denote the pressure of movement

(l, m)

as
定义 4（运动压力、交叉口压力）：行车压力定义为驶入车道与驶出车道之间的车辆密度差。车道的车辆密度定义为

x (l) / x_{m a x} (l)

。其中

x (l)

为车道上的车辆数

l

。

x_{m a x} (l)

为

l

上允许的最大车辆数。.我们将

(l, m)

的运动压力表示为

\begin{matrix} (4.1) & w (l, m) = \frac{x (l)}{x_{m a x} (l)} - \frac{x (m)}{x_{m a x} (m)} \end{matrix}

If all the lanes have the same maximum capacity

x_{m a x}

, then

w (l, m)

is simply indicating the difference between the incoming and outgoing number of vehicles.
如果所有车道的最大通行能力

x_{m a x}

相同，则

w (l, m)

仅表示进出车辆数之差。则

w (l, m)

仅表示进出车辆数之差。

The pressure of an intersection

i

is defined as the sum of the absolute pressures over all traffic movements, denoted as:
交叉口

i

的压力定义为所有交通流的绝对压力总和，表示为：

\begin{matrix} (4.2) & P_{i} = | \sum_{(l, m) \in i} w (l, m) | \end{matrix}

In Figure 4.2, the pressure of the intersection in Case A is

| 3 + 1 | = 4

, whereas the pressure of intersection in Case B is

| - 2 + 1 | = 1

. In general, the pressure

P_{i}

indicates the degree of disequilibrium between the incoming and outgoing vehicle density. The larger

P_{i}

is, the more unbalanced the distribution of vehicles is.
在图 4.2 中，情况 A 中的交点压力为

| 3 + 1 | = 4

，而情况 B 中的交点压力为

| - 2 + 1 | = 1

。而情况 B 中的交点压力为

| - 2 + 1 | = 1

。.一般来说，压力

P_{i}

表示进出车辆密度的不平衡程度。

P_{i}

越大，车辆分布越不平衡。

Problem 1 (Multi-intersection traffic signal control).: In our problem, each intersection is controlled by an RL agent. At each time step

t

, agent

i

observes from the environment as its state

o_{i}^{t}

. Given the vehicle distribution and current traffic signal phase, the goal of the agent is to give the optimal action

a

(i.e., which phase to set), so that the reward

r

(i.e., the average travel time of all vehicles) can be maximized.
问题 1（多交叉路口交通信号控制）：在我们的问题中，每个交叉路口都由一个 RL 代理控制。在每个时间步长

t

代理

i

从环境中观察到其状态

o_{i}^{t}

。.考虑到车辆分布和当前的交通信号相位，代理的目标是给出最优行动

a

（即哪个相位的交通信号灯最亮）。(即设置哪个相位），从而获得奖励

r

（即平均值）

r

。(即所有车辆的平均行驶时间）最大化。

4.4 Method 4.4 方法

Agent Design 代理设计

First, we introduce the state, action and reward design for an agent that controls an intersection.
首先，我们介绍一个控制交叉路口的代理的状态、行动和奖励设计。

State (Observation). Our state is defined for one intersection, which equals to the definition of observation in multi-agent RL. It includes the current phase $p$ , the number of vehicles on each outgoing lane $x (m)$ ( $m \in L_{o u t}$ ), and the number of vehicles on each segment of every incoming lane $x (l)_{k}$ ( $l \in L_{i n}$ , $k = 1 \dots K$ ). In this paper, each lane is evenly divided into 3 segments ( $K = 3$ ), and we denote the segment on lane $l$ nearest to the intersection as the first segment $x (l)_{1}$ .
状态（观测）。我们的状态定义为一个交叉点，相当于多代理 RL 中观察的定义。它包括当前阶段 $p$ ，每条出行车道上的车辆数 $x (m)$ ( $m \in L_{o u t}$ )，以及每段进入车道上的车辆数 $x (l)_{k}$ ( $l \in L_{i n}$ , $k = 1 \dots K$ ).在本文中，每条车道被平均分为 3 段（ $K = 3$ ），我们将最靠近交叉路口的车道 $l$ 上的路段记为第一段 $x (l)_{1}$ 。.
Action. At time $t$ , each agent chooses a phase $p$ as its action $a_{t}$ from action set $A$ , indicating the traffic signal should be set to phase $p$ . In this paper, each agent has four permissible actions, correspondingly four phases in Figure 11. Each action candidate $a_{i}$ is represented as a one-hot vector. Note that in the real world the signal phases may organize in a cyclic way, while our action makes the traffic signal plan more flexible. Also, there may be different number of phases in the real world and four phases is not a must.
行动。在时间 $t$ 时，每个代理都从行动集 $A$ 中选择一个相位 $p$ 作为其行动 $a_{t}$ ，表示交通信号灯应设置为相位 $p$ 。表示交通信号应设置为相位 $p$ 。.在本文中，每个代理有四个可允许的行动，对应图 11 中的四个阶段。每个候选行动 $a_{i}$ 都是一个单击向量。需要注意的是，在现实世界中，信号相位可能会以循环的方式组织，而我们的行动会使交通信号规划更加灵活。此外，现实世界中可能存在不同数量的相位，四个相位并不是必须的。
Reward. We define the reward $r_{i}$ as $r_{i} = - P_{i},$ (4.3) where $P_{i}$ is the pressure of intersection $i$ , as defined in Equation (4.2). Intuitively, the pressure $P_{i}$ indicates the degree of disequilibrium between vehicle density on the incoming and outgoing lanes. By minimizing $P_{i}$ , the vehicles within the system can be evenly distributed. Then the green light is effectively utilized so that the throughput is optimized.
奖励。我们将奖励 $r_{i}$ 定义为 $r_{i} = - P_{i},$ (4.3) 其中 $P_{i}$ 是交点压力 $i$ ，如公式 (4.2) 所定义。如公式 (4.2) 所定义。直观地说，压力 $P_{i}$ 表示进出车道上车辆密度的不平衡程度。通过最小化 $P_{i}$ ，系统内的车辆就能均匀分布。这样就能有效利用绿灯，从而优化吞吐量。

Learning Process 学习过程

In this paper, we adopt Deep Q-Network (DQN) as function approximator to estimate the Q-value function. To stabilize the training process, we maintain an experience replay memory as described in [52] by adding the new data samples in and removing the oldsamples occasionally. Periodically, the agent will take samples from the memory and use them to update the network.
本文采用深度 Q 网络（DQN）作为函数近似器来估计 Q 值函数。为了稳定训练过程，我们按照文献[52]中的描述，维护一个经验重放存储器，偶尔添加新的数据样本并移除旧样本。代理将定期从存储器中提取样本，并用于更新网络。

4.5 Justification of RL agent
4.5 RL 代理的理由

To theoretically support the efficacy of our proposed method, we justify our reward and state design by showing that, in a simplified transportation system, the states we use can fully describe the system dynamics, and using Equation (4.3) as reward function in RL is equivalent to optimizing travel time as in the transportation methods. Some important notation is summarized in Table 4.1.
为了从理论上支持我们提出的方法的有效性，我们证明了我们的奖励和状态设计的合理性，即在一个简化的交通系统中，我们使用的状态可以完全描述系统的动态，并且在 RL 中使用公式 (4.3) 作为奖励函数等同于优化交通方法中的旅行时间。表 4.1 总结了一些重要的符号。

Justification for State Design
国家设计的理由

General description of traffic movement process as a Markov chain
作为马尔可夫链的交通流过程的一般描述

Consider the arterial scenario described in Example 2.
考虑例 2 中描述的动脉情况。

Example 2.: Figure 4.3 associates a distinct traffic movement with each incoming lane

l \in L_{i n}

and each

m \in O u t_{l}

, where

O u t_{l}

is the set of lanes output from lane

l

. Follow the notation from [90], let

x (l, m) (t)

be the associated number of vehicles at beginning of period

t

X (t) = {x (l, m) (t)}

is the

s t a t e

of the movement network, which we regard as states

o^{t}

in accordance with Section 4.4.1. There are two variables which are considered independent of

X (t)

:
例 2：图 4.3 将不同的交通流与每条进入的车道

l \in L_{i n}

和每条

m \in O u t_{l}

联系起来，其中

O u t_{l}

是车道

l

输出的车道集。其中

O u t_{l}

是由车道

l

输出的车道集。.按照文献[90]中的符号，设

x (l, m) (t)

为周期

t

开始时的相关车辆数，

X (t) = {x (l, m) (t)}

为周期

s t a t e

开始时的相关车辆数。，

X (t) = {x (l, m) (t)}

为运动网络的

s t a t e

，根据第 4.4.1 节，我们将其视为状态

o^{t}

。有两个变量与

X (t)

无关：

Notation 符号

Meaning 意义

L_{i n}

set of incoming lanes for an intersection
交叉路口的一组来车道

L_{o u t}

set of outgoing lanes for an intersection
交叉路口的一组出行车道

(l, m)

a traffic movement from lane

l

m

从车道

l

到

m

的交通移动

x (l, m)

number of vehicles leaving

l

and entering

m

离开

l

和进入

m

的车辆数

x (l)

number of vehicles on lane

l

车道上的车辆数

l

x (l)_{k}

number of vehicles on

k

-th segment of

l

k

上的车辆数

l

的第-段

x_{m a x} (m)

maximum permissible vehicle number on lane

m

车道上允许的最大车辆数

m

r (l, m)

turning ratio of traffic movements from

l

m

从

l

到

m

的交通转向比率

c (l, m)

discharging rate of movement

(l, m)

放电运动速率

(l, m)

a (l, m)

1 if the green light is on for movement

(l, m)

,
如果移动

(l, m)

时绿灯亮起，则为 1,

0 otherwise 否则为 0

Table 4.1: Summary of notation.
表 4.1：符号摘要。

Turning ratio $r (l, m)$ : $r (l, m)$ is an i.i.d. random variable indicating the proportion of vehicles entering $m$ from $l$ to the total vehicles on $l$ .
转弯比率 $r (l, m)$ : $r (l, m)$ 是一个 i.i.d. 随机变量，表示从 $l$ 进入 $m$ 的车辆占 $l$ 上所有车辆的比例。.
Discharging rate $c (l, m)$ : For each $(l, m)$ , the queue discharging rate $c (l, m)$ is a non-negative, bounded, i.i.d. random variable, i.e., $c (l, m) \leq C (l, m)$ , where $C (l, m)$ is the saturation flow rate.
卸载率 $c (l, m)$ ：对于每个 $(l, m)$ ，队列卸载率 $c (l, m)$ 是一个非负的有界随机变量，即 $c (l, m) \leq C (l, m)$ } 。，队列排放率 $c (l, m)$ 是一个非负的、有界的 i.i.d. 随机变量，即 $c (l, m) \leq C (l, m)$ 。其中 $C (l, m)$ 为饱和流量。

At the end of each period

t

, an action

A^{t} = {(l, m) | a^{t} (l, m)}

must be selected from the action set

A^{t}

as a function of

X^{t}

for use in period

(t + 1)

, indicating the agent will give green light for movements from

l

m

, see the bottom of Figure 4.3.
在每个周期

t

结束时作为

X^{t}

的函数，必须从行动集

A^{t}

中选择一个行动

A^{t} = {(l, m) | a^{t} (l, m)}

供

(t + 1)

期间使用。这表明代理将为

l

至

m

期间的行动开绿灯。见图 4.3 的底部。

The evolution equations of

X (t)

are developed in [89]. For each

(l, m)

and

t

, the evolution of

x (l, m)

consists of receiving and discharging, and is captured by the following equation:

X (t)

的演化方程见 [89]。对于每一个

(l, m)

和

t

，

x (l, m)

的演化都包括接收和排放。，

x (l, m)

的演化过程包括接收和排放，其演化方程如下：

\begin{matrix} (4.4) & \begin{array}{l} x (l, m) (t + 1) \\ = x (l, m) (t) + \underset{r e c e i v i n g v e h i c l e s}{\underset{⏟}{Σ_{k \in I n_{l}} m i n [c (k, l) \cdot a (k, l) (t), x (k, l) (t)] \cdot r (l, m)}} \\ - \underset{d i s c h a r g i n g v e h i c l e s}{\underset{⏟}{m i n {c (l, m) \cdot a (l, m) (t), x (l, m) (t)} \cdot 1 (x (m) \leq x_{m a x} (m))}}, \end{array} \end{matrix}

where

I n_{l}

represents the set of lanes input to

l

. For the second term in Equation (4.4), when

l

is the receiving lane, up to

x (k, l)

vehicles will move from

k

a (k, l) (t) = 1

and they will join

(l, m)

r (l, m) = 1

For the third term in Equation (4.4), when traffic movement

(l, m)

is actuated, i.e.,

a (l, m) (t) = 1

, up to

x (l, m)

vehicles will leave

l

and
其中

I n_{l}

表示输入到

l

的车道集。.对于公式（4.4）中的第二项，当

l

为接收车道时，如果

a (k, l) (t) = 1

，则最多有

x (k, l)

辆车从

k

移动，如果

r (l, m) = 1

，则最多有{{3}辆车加入

(l, m)

。对于公式 (4.4) 中的第三项，当交通流

(l, m)

被激活时，即

a (l, m) (t) = 1

时，最多有 {{10} 辆车离开

l

并加入

l

。

Figure 4.3: The transition of traffic movements.
图 4.3：交通流的过渡。

be routed to

m

if there is no blockage on lane

m

, i.e.,

x (m) \leq x_{m a x} (m)

, where

x_{m a x} (m)

is the maximum permissible vehicle number on lane

m

.
如果车道

m

没有堵塞，则{{0}将被路由到

m

，即

x (m) \leq x_{m a x} (m)

将被路由到{{0}。即

x (m) \leq x_{m a x} (m)

，其中

x_{m a x} (m)

是车道

m

上允许的最大车辆数。其中

x_{m a x} (m)

是车道

m

上允许的最大车辆数。.

Suppose the initial state

X (1) = x (l, m) (1)

is a bounded random variable. Since

A (t) = a (l, m) (t)

is a function of the current state

X (t)

, and

c (l, m)

and

r (l, m)

are all independent of

X (1), . . ., X (t)

, the process

X (t)

is a Markov chain. The transition probabilities of the chain depend on the control policy.
假设初始状态

X (1) = x (l, m) (1)

是一个有界随机变量。由于

A (t) = a (l, m) (t)

是当前状态

X (t)

的函数，且

c (l, m)

和

r (l, m)

都与

X (1), . . ., X (t)

无关。以及

c (l, m)

和

r (l, m)

都与

X (1), . . ., X (t)

无关，因此过程

X (t)

是一个马尔可夫链。因此，过程

X (t)

是一个马尔可夫链。该链的过渡概率取决于控制策略。

4.5.1.2 Specification with proposed state definition
4.5.1.2 建议的状态定义规范

We can modify the traffic movement equation from lane-level to segment-level. We denote

x (l)_{1}

as the number of vehicles on the segment

l_{1}

closest to the intersection and

x (l)_{2}

as the number of vehicles on the second closest segment, which is connected with

l_{1}

. Assume the vehicles change lanes for routing by the time it enters the lane

l

, i.e.,

x (l, m) = x (l)

, and all vehicles on

l_{i + 1}

enter next segment

l_{i}

during time

t

, then the movement process on the segment closest to the intersection can be written as:
我们可以将交通流方程从车道级修改为路段级。我们将

x (l)_{1}

表示最靠近交叉口的

l_{1}

段上的车辆数，

x (l)_{2}

表示与

l_{1}

相连的第二近段上的车辆数。.假定车辆在进入车道

l

时变换车道进行路由，即

l

为

x (l)_{2}

。，即

x (l, m) = x (l)

。且

l_{i + 1}

上的所有车辆都在

t

时间内进入下一分段

l_{i}

，则该分段上的移动过程为

t

。则最靠近交叉路口的路段上的移动过程可写成

\begin{matrix} (4.5) & \begin{aligned} x (l)_{1} (t + 1) = x (l)_{1} (t) + x (l)_{2} (t) \\ - m i n {c (l, m) \cdot a (l, m) (t), x (l)_{1} (t)} \cdot 1 (x (m) \leq x_{m a x} (m)) . \end{aligned} \end{matrix}

Equations for other segments can be derived in a similar way.
其他分段的等式也可以用类似的方法推导出来。

With the lane and segment movement evolution equations described above, the evolution of an individual intersection could be obtained, which is a combination of the equations of all the lanes involved. For a single intersection

i

c (l, m)

is a constant physical feature of each movement, whereas

x (l)_{1}

x (l)_{2}

, and

x (m)

are provided to the RL agent in our state definition. Hence, our state definition can fully describe the dynamics of the system.
利用上述车道和路段运动演变方程，可以得到单个交叉口的演变情况，即所有相关车道方程的组合。对于单个交叉口

i

c (l, m)

是每个运动的恒定物理特征，而

x (l)_{1}

x (l)_{2}

和

x (m)

则在我们的状态定义中提供给 RL 代理。因此，我们的状态定义可以完全描述系统的动态。

Justification for Reward Design
奖励设计的理由

4.5.2.1 Stabilization on traffic movements with proposed reward.
4.5.2.1 稳定拟议奖励的交通流量。

Inspired by [89], we first relax its assumption on physical queue expansion in the arterial. Then the goal of our RL agents is proven to stabilize the queue length, thus maximizes the system throughput and minimizes the travel time of vehicles.
受文献[89]的启发，我们首先放宽了文献中关于动脉中队列物理扩张的假设。然后证明我们的 RL 代理的目标是稳定队列长度，从而最大化系统吞吐量并最小化车辆的行驶时间。

Definition 5 (Movement process stability).: The movement process

X (t) = {x (l, m) (t)}

is stable in the mean (and

u

is a stabilizing control policy) if for some

M < \infty

, the following holds:
定义 5（运动过程的稳定性）：运动过程

X (t) = {x (l, m) (t)}

的平均值是稳定的（

u

是稳定控制策略），如果对于某个

M < \infty

，以下条件成立：

X (t) = {x (l, m) (t)}

是稳定控制策略，

u

是稳定控制策略。成立：

\begin{matrix} (4.6) & \sum_{t = 1}^{T} \sum_{(l, m)} E [x (l, m) (t)] < M, \forall T \end{matrix}

where

E

denotes expectation. Movement stability in the mean implies that the chain is positive recurrent and has a unique steady-state probability distribution for all

T

.
其中

E

表示期望值。均值的运动稳定性意味着该链是正循环的，并且对所有

T

具有唯一的稳态概率分布。.

Definition 6 (Max-pressure control policy [89]).: At each period

t

, the agent selects the action with maximum pressure at every state

X

{\tilde{A}}^{*} (X) = \arg max_{\tilde{A} \in A} θ (\tilde{A}, X)

, where the pressure of

\tilde{A}

is defined as
定义 6（最大压力控制策略 [89]：在每个周期

t

代理选择在每个状态

X

{\tilde{A}}^{*} (X) = \arg max_{\tilde{A} \in A} θ (\tilde{A}, X)

时压力最大的行动。其中

\tilde{A}

的压力定义为

θ (\tilde{A}, X) = \sum_{(l, m) : a (l, m) = 1} \tilde{w} (l, m),

and

\tilde{w} (l, m) = x (l) - x (m)

is the pressure of each movement. In this paper, we use the tilde symbol for max-pressure policy, i.e.,

\tilde{A}

, in order to differentiate it from a RL policy.
和

\tilde{w} (l, m) = x (l) - x (m)

为每个动作的压力。在本文中，我们使用"

\tilde{A}

"来表示最大压力策略，即

\tilde{A}

。以区别于 RL 策略。

Theorem 1.: Without considering the physical queue expansion2, action

{\tilde{A}}^{*}

selected by max-pressure control policy and action

A^{*}

selected by our RL policy are both stabilizing the system, whenever the average demand is admissible3.
定理 1：在不考虑队列物理膨胀的情况下2，只要平均需求是可接受的3，最大压力控制策略选择的行动

{\tilde{A}}^{*}

和我们的 RL 策略选择的行动

A^{*}

都能使系统趋于稳定。

Footnote 2: “Without physical queue expansion” means the vehicles will be considered to have no physical length in a queue.
脚注 2："无实际队列扩展 "是指车辆在队列中将被视为无实际长度。

Footnote 3: Intuitively, an admissible demand means the traffic demand can be accommodated by traffic signal control strategies, not including situations like long-lasting over-saturated traffic that requires perimeter control to stop traffic from getting in the system.
脚注 3：直观地说，可接受的需求指的是交通信号控制策略可以满足的交通需求，不包括长期过饱和交通等需要周边控制来阻止交通进入系统的情况。

Proof.: For max-pressure control policy, Theorem 1 in [89] shows that given a time period

t = 1, \dots, T

there exists

m < \infty

and

ϵ > 0

such that under

{\tilde{A}}^{*}

ϵ \cdot \frac{1}{T} \sum_{t = 1}^{T} E [X (t)] \leq m + \frac{1}{T} \cdot E [X (1)]^{2}

, where

X (1)

denotes the state when

t = 1

.
证明对于最大压力控制策略，[89] 中的定理 1 表明，在给定时间段

t = 1, \dots, T

的情况下，存在

m < \infty

和

ϵ > 0

，从而在

{\tilde{A}}^{*}

ϵ \cdot \frac{1}{T} \sum_{t = 1}^{T} E [X (t)] \leq m + \frac{1}{T} \cdot E [X (1)]^{2}

条件下，

X (1)

表示

t = 1

时的状态。其中

X (1)

表示

t = 1

时的状态。.

For an optimal RL control policy, the agent selects the action

A

with optimal

Q (A, X)

at every state

X

:
对于最优 RL 控制策略，代理在每个状态

X

都会选择最优

Q (A, X)

的行动

A

：

\begin{matrix} (4.7) & A^{*} (X) = \arg max_{A \in A} Q (A, X) . \end{matrix}

where

Q_{t} (A, X) = E [r_{t + 1} + γ r_{t + 2} + \dots | A, X]

denotes the maximum total reward at state

X

by taking

A

at time

t

(in Equation (4.7), we neglect time

t

for simplicity). The difference between the pressure definition in RL reward and max-pressure is that our RL agent uses the weighted pressure considering maximum permissible vehicle number

x_{m a x}

in Equation (4.1). If we assume the lanes are in the same length

x_{m a x} (l)

, the stability result still holds for the normalized

x (l)

.
其中

Q_{t} (A, X) = E [r_{t + 1} + γ r_{t + 2} + \dots | A, X]

表示在时间

t

取

A

时状态

X

的最大总奖励。(在公式 (4.7) 中，为简单起见，我们忽略了时间

t

）。RL reward 中的压力定义与 max-pressure 中的压力定义不同，我们的 RL 代理在公式（4.1）中使用了考虑最大允许车辆数

x_{m a x}

的加权压力。如果我们假设车道长度相同

x_{m a x} (l)

，则稳定性结果对 RL 代理仍然成立。则对于归一化后的

x (l)

，稳定性结果仍然成立。.

Theorem 2.: Considering the physical queue expansion in the arterial environment, action

A^{*}

selected by our RL policy is also stabilizing the movement.
定理 2：考虑到动脉环境中队列的物理膨胀，我们的 RL 策略所选择的行动

A^{*}

也能稳定运动。

Different from [89], we now establish the proof of Theorem 2, which removes the assumption of no physical queue expansion in the arterial environment. In the arterial environment:
与 [89] 不同的是，我们现在建立了定理 2 的证明，它取消了在动脉环境中没有物理队列扩展的假设。在动脉环境中

The maximum permissible vehicle number $x_{m a x}$ on side street lane $m^{s i d e}$ is assumed to be infinite, hence the second term in Equation (4.1) is zero. Thus we have $w (l, m^{s i d e}) = \frac{x (l)}{x_{m a x} (l)} > 0$ .
假定侧车道 $m^{s i d e}$ 上允许的最大车辆数 $x_{m a x}$ 为无限，因此等式 (4.1) 中的第二项为零。因此有 $w (l, m^{s i d e}) = \frac{x (l)}{x_{m a x} (l)} > 0$ .
When the outgoing lane $m^{m a i n}$ along the arterial is saturated, the second term in Equation (4.1) is approximately 1 because of the queue expansion. Thus $w (l, m^{m a i n}) \approx \frac{x (l)}{x_{m a x} (l)} - 1 < 0$ .
当干道上的出行车道 $m^{m a i n}$ 达到饱和时，由于队列扩张，方程 (4.1) 中的第二项近似为 1。因此 $w (l, m^{m a i n}) \approx \frac{x (l)}{x_{m a x} (l)} - 1 < 0$ .

This means when we consider the physical queue expansion in the arterial,

w (l, m^{s i d e}) > w (l, m^{m a i n})

, the control policy will restrict the queue spillback since it prohibits more vehicles to rush into the downstream intersection and block the movements of vehicles in other phases. Accordingly,

M

in Equation (4.6) can now be set to

M \leq \sum_{t = 1}^{T} \sum_{(l, m)} x_{m a x} (m)

.
这意味着，当我们考虑到干道上队列的实际扩张时，

w (l, m^{s i d e}) > w (l, m^{m a i n})

时，控制策略将限制队列回溢，因为它禁止更多车辆冲入下游交叉口，阻碍其他相位车辆的行驶。因此，公式（4.6）中的

M

现在可以设为

M \leq \sum_{t = 1}^{T} \sum_{(l, m)} x_{m a x} (m)

。.

4.5.2 Connection to throughput maximization and travel time minimization.
4.5.2 与吞吐量最大化和旅行时间最小化的联系。

Given that the traffic movement process of each intersection is stable, the system is accordingly stable. In an arterial environment without U-turn, vehicles that move from lane

m

l

would not move from

l

m

again, i.e., between

x (m, l)

and

x (l, m)

only one of them can exist under arterial network. Then the actions that RL agents take will not form gridlock or block the network, thus can efficiently utilize the green time. Within the given time period

T

, our RL agent can provide the maximum throughput, thus minimize the travel time of all vehicles within the system.
由于每个交叉路口的交通移动过程是稳定的，因此系统也相应是稳定的。在没有掉头的干道环境中，从

m

车道移动到

l

车道的车辆不会再从

l

车道移动到

m

车道，即在

x (m, l)

和

x (l, m)

之间，干道网络中只能存在其中一个车道。这样，RL 代理所采取的行动就不会形成堵塞或阻塞网络，从而可以有效利用绿色时间。在给定的时间段

T

内在给定的时间段

T

内，我们的 RL 代理可以提供最大的吞吐量，从而最大限度地减少系统内所有车辆的行驶时间。

4.6 Experiment 4.6 实验

We conduct experiments on CityFlow4, an open-source traffic simulator that supports large-scale traffic signal control [103]. After the traffic data being fed into the simulator, a vehicle moves towards its destination according to the setting of the environment. The simulator provides the state to the signal control method and executes the traffic signal actions from the control method.5
我们在支持大规模交通信号控制的开源交通模拟器 CityFlow4 上进行了实验[103]。交通数据输入模拟器后，车辆根据环境设置向目的地行驶。模拟器向信号控制方法提供状态，并执行控制方法的交通信号操作5。

Dataset Description 数据集说明

Both synthetic and real-world traffic flow data are used in our experiments. In a traffic dataset, each vehicle is described as

(o, t, d)

, where

o

is origin location,

t

is time, and

d

is destination location. Locations

o

and

d

are both locations on the road network. Traffic data is taken as input for the simulator. All the data contains bi-directional and dynamic flows with turning traffic.
我们在实验中使用了合成和真实世界的交通流数据。在交通数据集中，每辆车都被描述为

(o, t, d)

。其中

o

为起始位置，

t

为时间，

d

为目的地。位置

o

和

d

都是道路网络上的位置。交通数据作为模拟器的输入。所有数据都包含双向动态交通流和转弯交通流。

Synthetic data. Four different configurations are tested as detailed in Table 4.2. This data is synthesized from a statistical analysis of real-world traffic patterns in Jinan and Hangzhou.
合成数据。测试了四种不同的配置，详见表 4.2。这些数据是根据济南和杭州实际交通模式的统计分析合成的。
Real-world data. We collect six representative traffic flow data from three cities to evaluate the performance of our model: Beaver Avenue in State College, USA; Qingdao Road in Jinan, China; four avenues in Manhattan, New York City, USA. Figure 4.4 shows the aerial view on these arterials. Detailed statistics of these datasets are listed in Table 4.3.
真实世界数据。我们从三个城市收集了六个具有代表性的交通流数据，以评估模型的性能：美国州立学院的海狸大道、中国济南的青岛路和美国纽约市曼哈顿的四条大道。图 4.4 显示了这些干道的鸟瞰图。表 4.3 列出了这些数据集的详细统计数据。

Figure 4.4: Real-world arterial network for the experiment.
图 4.4：用于实验的真实世界动脉网络。

Experimental Settings 实验设置

Environmental settings 环境设置

Different road networks are configured. Besides a six-intersection arterial on which we primarily experiment, arterials with larger scale and heterogeneous intersections (in Figure 4.6) are also tested.
配置了不同的道路网络。除了我们主要试验的六交叉路口主干道外，我们还测试了规模更大、交叉路口不同的主干道（如图 4.6 所示）。

The free-flow speed on the road segments is set to 40 kilometers/hour. Vehicles can always turn right when there is no conflicting traffic. Every time the phase switches, a 5-second combined yellow and all-red time are followed to clear the intersection.
路段的自由通行速度设定为 40 公里/小时。在没有交通冲突的情况下，车辆可以随时右转。每次相位切换时，都要经过 5 秒钟的黄灯和全红灯组合时间，才能通过交叉路口。

4.6.2.2 Evaluation metric
4.6.2.2 评估指标

Following existing studies [11], we use the average travel time in seconds to evaluate the performance. The average travel time of all vehicles is the most frequently used measure in the transportation field [104], which is calculated as the average travel time of all vehicles spent in the system.
根据现有研究[11]，我们使用以秒为单位的平均行驶时间来评估性能。所有车辆的平均行驶时间是交通领域最常用的衡量标准[104]，计算方法是所有车辆在系统中的平均行驶时间。

4.6.2.3 Compared methods
4.6.2.3 比较方法

We compare our model with the following two categories of methods: transportation methods and RL methods. Note that all methods are carefully tuned and their best results are reported (except the offsets of FixedTime because of its random nature).
我们将我们的模型与以下两类方法进行比较：运输方法和 RL 方法。请注意，所有方法都经过仔细调整，并报告了其最佳结果（FixedTime 的偏移除外，因为它具有随机性）。

Conventional transportation baselines:
传统运输基线：

FixedTime: Fixed-time with random offset [104]. Each phase has a fixed time of 15 seconds. For uni-directional traffic, there are only 2 phases (WE-straight, SN-straight). For traffic with turning vehicles, there are 4 phases.
固定时间固定时间，随机偏移 [104]。每个阶段的固定时间为 15 秒。对于单向交通，只有 2 个阶段（WE-直行、SN-直行）。对于有转弯车辆的交通，有 4 个阶段。

Config 配置	Demand 需求	Arrival rate 抵达率
	pattern 模式	(vehicles/h/road) (车辆/小时/道路）	Volume 卷数
1. Light-Flat 1.轻平	Flat 扁平	Arterial : 600 动脉：600
2. Light-Peak 2.光峰	Peak 峰值	Side-street: 180 侧街：180	(Light) (灯光)
3. Heavy-Flat 3.重型平板	Flat 扁平	Arterial: 1400 干道：1400
4. Heavy-Peak 4.重峰	Peak 峰值	Side-street : 420 侧街 : 420	(Heavy) （沉重）

Table 4.2: Configurations for synthetic traffic data* GreenWave[104]: This is the most classical method in transportation field to implement coordination that gives an optimal solution for unidirectional and uniform traffic on the arterial. It requires that all intersections share the same cycle length, which is the minimum value of the cycle length for individual intersections calculated using Webster's theory [71]. The phase split percentage equals to the percentage between the demand of a designated phase and total demand. Offsets between intersections are equivalent to the free-flow travel time between two consecutive intersections.
表 4.2：合成交通数据的配置* GreenWave[104]：这是交通领域最经典的协调方法，可为干道上的单向均匀交通提供最优解。它要求所有交叉口共享相同的周期长度，即利用韦伯斯特理论[71]计算出的单个交叉口周期长度的最小值。相位分割百分比等于指定相位的需求量与总需求量之间的百分比。交叉口之间的偏移量相当于两个连续交叉口之间的自由通行时间。

Max-pressure: Max pressure control [105] is a state-of-the-art network-level traffic signal control method, which greedily chooses the phase with the maximum pressure, as introduced in Definition 6. RL baselines:
最大压力：最大压力控制 [105] 是一种最先进的网络级交通信号控制方法，如定义 6 所述，它会贪婪地选择压力最大的相位。RL 基线：
LIT is an individual deep reinforcement learning approach proposed in [1]. This method does not consider the traffic condition on downstream lanes in state and uses a reward with queue length.
LIT 是 [1] 中提出的一种单独的深度强化学习方法。这种方法不考虑状态下下游车道的交通状况，而是使用队列长度作为奖励。
DRL is a coordinated reinforcement learning approach for multi-intersection control [10]. Specifically, the coordination is to design a coordination graph and to learn the joint local Q-function on two adjacent intersections directly.
DRL 是一种用于多交叉口控制的协调强化学习方法 [10]。具体来说，协调是设计一个协调图，并直接学习相邻两个交叉口的联合局部 Q 函数。

Performance Comparison 性能比较

Table 4.4 reports our experimental results using synthetic data under six-intersection arterial and real-world data w.r.t. average travel time. We have the following findings:
表 4.4 报告了我们使用六交叉路口干道下的合成数据和真实世界数据计算平均旅行时间的实验结果。我们得出以下结论：

Dataset 数据集	Arrival rate (vehicles/h) 到达率（车辆/小时）			# of inter- # 的
Dataset 数据集	Mean 平均值	Std 标准	Max 最大	Min 最小	sections 部分
Qingdao Rd., Jinan 济南市青岛路	3338.83	221.58	2748	3864	3
Beaver Ave., 海狸大道	2982.33	359.70	2724	3491	5
State College 州立学院
8-th Ave., NYC 纽约市第 8 大道	6790.04	32.34	4968	7536	16
9-th Ave., NYC 纽约市 9th Ave.	4513.06	25.88	4416	6708	16
10-th Ave., NYC 纽约市第十大道	6083.90	25.61	2892	5016	16
11-th Ave., NYC 纽约市第 11 大道	4030.79	24.08	2472	4536	16

Table 4.3: Data statistics of real-world traffic dataset(1) Conventional transportation methods (FixedTime, GreenWave and Max-pressure) give poor performance. This is because the traffic in these settings is dynamic. Conventional methods, which rely heavily on over-simplified assumptions or prior knowledge on the traffic, may easily fail under the dynamic traffic scenarios.
表 4.3：实际交通数据集的数据统计(1) 传统交通方法（FixedTime、GreenWave 和 Max-pressure）性能较差。这是因为这些设置中的交通是动态的。传统方法在很大程度上依赖于过于简化的假设或有关交通的先验知识，在动态交通场景下很容易失效。

(2) Our method PressLight outperforms all other RL methods. Though all the methods aim to learn to minimize the travel time, our reward design is proven to directly optimize towards it, while DRL and LIT are using mixed reward which may distract the model from learning efficiently.
(2) 我们的方法 PressLight 优于所有其他 RL 方法。虽然所有方法都旨在学习如何最大限度地缩短旅行时间，但我们的奖励设计被证明能直接优化旅行时间，而 DRL 和 LIT 采用的是混合奖励，这可能会分散模型的注意力，影响其学习效率。

(3) When the traffic grows larger (Config 3,4 to 1,2), PressLight becomes much better than other baselines. Under heavy traffic, a poor control strategy would make downstream queue may easily spill back and the green time would be wasted. The reward design of our agents considers balancing the queues on all the intersections within the arterial, which makes the performance even superior as the traffic becomes larger.
(3) 当流量变大时（Config 3,4 到 1,2），PressLight 比其他基线要好得多。在车流量大的情况下，如果控制策略不当，下游队列很容易回溢，浪费绿灯时间。我们的代理奖励设计考虑到了平衡干道上所有交叉口的队列，因此当交通流量变大时，其性能更加优越。

	Synthetic traffic 合成交通
	LightFlat	LightPeak 光峰	HeavyFlat 重型平板	HeavyPeak 重峰
FixedTime 固定时间	93.29	109.50	325.48	246.25
GreenWave 绿波	98.39	124.09	263.36	286.85
Max-pressure 最大压力	74.30	82.37	262.26	225.60
DRL	123.02	115.85	525.64	757.73
LIT	65.07	66.77	233.17	258.33
PressLight 新闻灯	59.96	61.34	160.48	184.51
	Real-world traffic 真实世界的交通
	Qingdao 青岛	Beaver Ave., 海狸大道	8th Ave., 第八大道、	9th Ave., 9th Ave、	10th Ave., 10th Ave、	11th Ave., 11th Ave、
	Rd., Jinan 济南	State College 州立学院	NYC 纽约市	NYC 纽约市	NYC 纽约市	NYC 纽约市
FixedTime 固定时间	317.40	336.29	432.60	469.54	347.05	368.84
GreenWave 绿波	370.30	332.06	451.98	502.30	317.02	314.08
Max-pressure 最大压力	567.06	222.90	412.58	370.61	392.77	224.54
DRL	238.19	455.42	704.98	669.69	676.19	548.34
LIT	58.18	338.52	471.30	726.04	309.95	340.40
PressLight 新闻灯	54.87	92.00	223.36	149.01	161.21	140.82

Table 4.4: Performance comparison between all the methods in the arterial with 6 intersections w.r.t. average travel time (the lower the better). Top-down: conventional transportation methods, learning methods, and our proposed method.
表 4.4：在有 6 个交叉路口的干道上，所有方法在平均旅行时间方面的性能比较（越短越好）。自上而下：传统交通方法、学习方法和我们提出的方法。

4.6.4 Study of PressLight
4.6.4 对 PressLight 的研究

4.6.4.1 Effects of variants of our proposed method
4.6.4.1 拟议方法变体的影响

We consider several variations of our model as follows.
我们考虑了我们模型的以下几种变体。

Base. Instead of using the distribution of the vehicles, Base simply uses phase and number of vehicles on each incoming lanes as its state (similar to LIT), and uses the reward defined same as LIT. This serves as a base model for later variants.
基地。Base 不使用车辆的分布情况，而是简单地使用每个进入车道的相位和车辆数作为其状态（类似于 LIT），并使用与 LIT 相同的奖励定义。这为后面的变体提供了一个基础模型。
LIT+out. Based on Base, LIT+out adds the number of vehicles on outgoing lanes to its state, which has more information about its downstream intersections than Base agents.
LIT+out.在 Base 的基础上，LIT+out 在其状态中添加了出行车道上的车辆数，它比 Base 代理拥有更多关于其下游交叉口的信息。
LIT+out+seg. Based on LIT+out, LIT+out+seg uses the phase, the number of segments' vehicles on both incoming and outgoing lanes into its state, which is the same as our proposed state definition.
LIT+out+seg。在 LIT+out 的基础上，LIT+out+seg 将相位、进出车道上的分段车辆数作为其状态，这与我们提出的状态定义相同。
PressLight. Our proposed method which further changes LIT+out+seg's reward to pressure.
压力之光。我们提出的方法进一步改变了 LIT+out+seg 对压力的奖励。

Table 4.5 shows the performance of variants of our method:
表 4.5 显示了我们方法的各种变体的性能：

Giving the added state information (LIT+out and LIT+out+seg) boosts the performance. This makes sense since (1) LIT+out is able to observe traffic condition on outgoing lanes and helps to balance the queues for each intersection when there is congestion on outgoing lanes; (2) LIT+out+seg has the information about vehicle distributions which is the key factor for agents to learn the offsets.
提供附加状态信息（LIT+out 和 LIT+out+seg）可提高性能。这是有道理的，因为：(1) LIT+out 能够观察出行车道的交通状况，当出行车道拥堵时，它有助于平衡每个交叉口的队列；(2) LIT+out+seg 拥有车辆分布信息，这是代理学习偏移量的关键因素。
PressLight further outperforms LIT+out+seg owing to its reward definition. Instead of optimizing a reward that is not directly towards the travel time under arterial network, our reward design is proved to be a surrogate of average travel time. This demonstrates the effectiveness of our proposed reward design.
由于其奖励定义，PressLight 的性能进一步优于 LIT+out+seg。事实证明，我们的奖励设计是平均旅行时间的代用指标，而不是直接针对动脉网络下的旅行时间来优化奖励。这证明了我们提出的奖励设计的有效性。

	HeavyFlat 重型平板	HeavyPeak 重峰
Base 基地	233.17	258.33
LIT+out	201.56	281.21
LIT+out+seg	200.28	196.34
PressLight 新闻灯	160.48	184.51

Table 4.5: Detailed comparison of our proposed state and reward design and their effects w.r.t. average travel time (lower the better) under synthetic traffic data.
表 4.5：在合成交通数据下，我们建议的状态和奖励设计及其对平均旅行时间（越短越好）的影响的详细比较。

Figure 5.4 illustrates the convergence curve of our agents learning process w.r.t. the average reward and the average pressure of each round. We can see that the travel time is closely correlated with pressure.
图 5.4 展示了代理学习过程与每轮平均奖励和平均压力的收敛曲线。我们可以看到，行进时间与压力密切相关。

Performance on Mixed Scenarios
混合场景下的性能

Heterogeneous intersections
异质交叉口

We employ our model to two heterogeneous arterials, as is shown in Figure 4.6. For intersections with 3 legs, we use zero-padding to complete the state. For intersections with different lengths of lanes, our method can handle this well since the state is independent of the lane length. Table 4.6 illustrates the performance of our model against Max-pressure.
如图 4.6 所示，我们将模型应用于两条异构干道。对于有 3 条腿的交叉口，我们使用零填充来完成状态。对于车道长度不同的交叉口，我们的方法可以很好地处理，因为状态与车道长度无关。表 4.6 说明了我们的模型与 Max-pressure 相比的性能。

4.6.5.2 Arterials with a different number of intersections and network
4.6.5.2 交叉路口和路网数量不同的干道

We employ our model to arterials with 6, 10 and 20 intersections under synthetic data. As is shown in Table 4.6, our model could achieve better performance over conventional transportation method Max-pressure and reinforcement learning method LIT even when the number of intersections grows.
在合成数据下，我们将模型应用于交叉口数量分别为 6、10 和 20 个的主干道。如表 4.6 所示，即使交叉口数量增加，我们的模型也能取得比传统交通方法 Max-pressure 和强化学习方法 LIT 更好的性能。

We also test our model a network with 9 intersections (

3 \times 3

grid). Table 4.6 shows the experiment results and we can see that PressLight can outperform Max-pressure and LIT under both traffic.
我们还对包含 9 个交叉口（

3 \times 3

网格）的网络进行了测试。表 4.6 显示了实验结果，我们可以看到，在两种流量下，PressLight 都优于 Max-pressure 和 LIT。

Figure 4.5: Convergence curve of average duration and our reward design (pressure). Pressure shows the same convergence trend with travel time.
图 4.5：平均持续时间与我们的奖励设计（压力）的收敛曲线。压力与旅行时间呈现出相同的收敛趋势。

Case Study 案例研究

Another desirable property of PressLight is its ability to automatically coordinate the offset between adjacent intersections. To demonstrate this, we show two examples. Under simplified uniform traffic, we show that our model has learned the optimal solution which could be justified by transportation theories. Under the real-world traffic, the learned offset is visualized to reveal this property.
PressLight 的另一个理想特性是能够自动协调相邻交叉点之间的偏移。为了证明这一点，我们举了两个例子。在简化的统一交通情况下，我们展示了我们的模型已经学习到了最佳解决方案，这在交通理论上是合理的。在实际交通情况下，学习到的偏移可视化地揭示了这一特性。

Figure 4.6: Average travel time of our method on heterogeneous intersections. (a) Different number of legs. (b) Different length of lanes. (c) Experiment results.
图 4.6：我们的方法在异构交叉口的平均行驶时间。(a) 不同的支腿数。 (b) 不同的车道长度。(c) 实验结果。

Figure 4.7: Performance comparison under uniform unidirectional traffic, where the optimal solution is known (GreenWave). Only PressLight can achieve the optimal.
图 4.7：已知最优解（GreenWave）的均匀单向流量下的性能比较。只有 PressLight 可以达到最优。

4.6.6.1 Synthetic traffic on the uniform, uni-directional flow
4.6.6.1 统一单向流上的合成流量

In this section, we perform experiments on the arterials with six homogeneous intersections under two traffic settings. One is for light traffic (arterial demand: 300 vehicle/hour/lane, side-street demand: 180 vehicle/hour/lane) and one is for heavy traffic (arterial demand: 700 vehicle/hour/lane, side-street demand: 420 vehicle/hour/lane). Both of them are uniform and uni-directional without turning traffic and two phases (WE for green light on arterial and SN for green light for side streets) are used for all intersections. Under these simplified scenarios, the optimal solution is known as GreenWave in transportation area as stated in [104]. As the optimal solution under these settings, GreenWave's policy includes the offsets between intersections and the phase split, which requires several prior knowledge to calculate them: The offset

Δ

equals to the block length

l

between two consecutive intersections divided by free-flow speed

v

; the optimal phase split ratio is equal to the ratio of the demand for a designated phase and total demand. In our experiments,

l \approx 300

v \approx 10

m/s, hence, the optimal offset should be

Δ \approx 30

s, and the optimal phase split should be 1:0.6 (WE: SN).
在本节中，我们将在两种交通状况下对有六个同质交叉口的干道进行实验。一种是轻度交通（干道需求：300 辆/小时/车道，支路需求：300 辆/小时/车道）：300 辆/小时/车道，侧街需求：180辆/小时/车道）和重型交通（干道需求：700辆/小时/车道，支路需求：180辆/小时/车道）：700 辆/小时/车道，侧街需求：420 辆/小时/车道：420 辆/小时/车道）。这两个交叉口都是统一的单向交叉口，没有转弯交通，所有交叉口都使用两个相位（干道绿灯为 WE 相位，支路绿灯为 SN 相位）。如文献[104]所述，在这些简化方案下，最优解被称为交通领域的绿波（GreenWave）。作为这些设置下的最优解，GreenWave 的策略包括交叉口之间的偏移和相位分割，这需要一些先验知识来计算：偏移量

Δ

等于两个连续交叉口之间的街区长度

l

除以自由流速度

v

；最佳相位分割比等于指定相位的需求量与总需求量之比。在我们的实验中，

l \approx 300

m、

v \approx 10

m/s，因此最佳偏移量应为

Δ \approx 30

s，最佳相位分配比例应为 1:0.6（WE：SN）。

Performance comparisonWe compared PressLight with all aforementioned baselines and report their results in Figure 4.7. We can find that given GreenWave is the optimal solution, only our method PressLight achieves the same performance as GreenWave in both settings. This demonstrates that our RL agents can learn the optimal policy under these simplified scenarios.
性能比较我们将 PressLight 与上述所有基线进行了比较，结果见图 4.7。我们可以发现，在 GreenWave 是最优解的情况下，只有我们的方法 PressLight 在这两种情况下都取得了与 GreenWave 相同的性能。这表明，我们的 RL 代理可以在这些简化场景下学习最优策略。

	6-intersection arterial 6 个交叉路口的干道		10-intersection arterial 10 个交叉路口的干道
	HeavyFlat 重型平板	HeavyPeak 重峰	HeavyFlat 重型平板	HeavyPeak 重峰
Max-pressure 最大压力	262.26	225.60	129.63	129.63
LIT	233.17	258.33	157.84	200.96
PressLight (ours) 新闻之光（我们的）	160.48	184.51	88.88	79.61

Table 4.6: Average travel time of different methods under arterials with a different number of intersections and network.
表 4.6：不同交叉口数量和网络的干道下不同方法的平均行车时间。

4.6.6.1 Policy learned by RL agents
4.6.6.1 RL 代理学习的策略

We use time-space diagrams to show the trajectories of vehicles and phase plans of traffic signal controllers. In a time-space diagram like Figure 4.8, the x-axis is the time and the y-axis is the distance (from a reference point, here we use the westernmost point as the reference point). As it is shown in Figure 4.8, there are six bands with green-yellow-red colors indicating the changing phases of six intersections. The black line with an arrow is the trajectory of a vehicle, where the x-axis tells the time and the y-axis tells the location. Vehicles that travel within the green dashed area will experience a green wave. For example, vehicle
我们使用时空图来显示车辆的行驶轨迹和交通信号控制器的相位计划。在图 4.8 这样的时空图中，X 轴是时间，Y 轴是距离（与参考点的距离，这里我们使用最西端作为参考点）。如图 4.8 所示，六条绿黄红三色带表示六个交叉口的变化阶段。带箭头的黑线是车辆的行驶轨迹，其中 x 轴表示时间，y 轴表示位置。在绿色虚线区域内行驶的车辆会出现绿波。例如，车辆

Figure 4.8: Offsets between intersections learnt by RL agents under uni-directional uniform traffic (700 vehicles/hour/lane on arterial)
图 4.8：单向均匀交通（干道上每小时 700 辆车/车道）下 RL 代理学习的交叉口之间的偏移量

Figure 4.9: Space-time diagram with signal timing plan to illustrate the learned coordination strategy from real-world data on the arterial of Qingdao Road in the morning (around 8:30 a.m.) on August 6th.
图 4.9：8 月 6 日上午（8:30 左右）青岛路主干道上的信号配时计划时空图，以说明从实际数据中学到的协调策略。

[MISSING_PAGE_FAIL:76] [missing_page_fail:76]。

Chapter 5 Improving Learning Efficiency
第 5 章提高学习效率

This chapter presents our efforts in improving the learning efficiency for coordinating traffic signals and corresponds to our paper "CoLight: Learning Network-level Cooperation for Traffic Signal Control' [19]'.
本章介绍了我们在提高交通信号协调学习效率方面所做的努力，与我们的论文 "CoLight：学习交通信号控制的网络级合作' [19]"。

5.1 Introduction 5.1 导言

A key question often asked of traffic signal control is "How do traffic signals cooperate between intersections?" Cooperation among intersections is especially important for an urban road network because the actions of signals could affect each other, especially when the intersections are closely connected. Good cooperation among the traffic signals enables the vehicles to move through intersections more quickly.
在交通信号控制方面，一个经常被问到的关键问题是："交通信号如何在交叉口之间相互配合？十字路口之间的合作对于城市道路网络尤为重要，因为信号灯的行动可能会相互影响，尤其是当十字路口紧密相连时。交通信号灯之间的良好配合可以使车辆更快地通过交叉路口。

In the transportation field, a typical way to solve the cooperation problem between intersections is to formulate it as an optimization problem and solve it under certain assumptions (e.g., uniform arrival rate [106, 107] and unlimited lane capacity [90]). Such methods, however, do not perform well because the assumptions do not hold in the real world.
在交通领域，解决交叉口之间合作问题的典型方法是将其表述为优化问题，并在某些假设条件（如均匀到达率 [106, 107] 和无限车道容量 [90]）下求解。然而，这些方法的效果并不好，因为这些假设在现实世界中并不成立。

Recently, researchers start to investigate reinforcement learning (RL) techniques for the cooperation of traffic signals. The most common way is to have an RL agent control each intersection and communication is achieved by sharing information among agents [108, 109, 110, 46]. At each time step, the agent observes the traffic condition of the target intersection and its neighboring intersections, and decides an action

a

to take. After the action is taken, a reward

r

(often defined as a measure correlated with travel time) is fed back to the agent indicating how good the action

a

is. Different from conventional approaches, such RL methods avoid making strong assumptions and directly learn good strategies from trials and errors.
最近，研究人员开始研究用于交通信号合作的强化学习（RL）技术。最常见的方法是由一个 RL 代理控制每个交叉路口，代理之间通过共享信息实现通信 [108, 109, 110, 46]。在每个时间步长内，代理观察目标路口及其相邻路口的交通状况，并决定采取

a

行动。行动结束后，奖励

r

（通常定义为与交通流量相关的指标(通常定义为与行车时间相关的度量）反馈给代理，表明该行动

a

的好坏。与传统方法不同，这种 RL 方法避免了强烈的假设，而是直接从试验和错误中学习好的策略。

Existing RL-based methods still fail to communicate with neighbors in the most efficient way. We propose CoLight that improves communication of agents and is scalable to hundreds of intersections. In particular, our work makes the following key contributions:
现有的基于 RL 的方法仍然无法以最有效的方式与邻居通信。我们提出的 CoLight 可改善代理之间的通信，并可扩展至数百个交叉点。具体而言，我们的工作有以下主要贡献：

∙

Cooperation through dynamic communication. Most methods to achieve cooperation are through expanding the observation of the target agent for more comprehensive information. Existing studies [46, 48, 108, 109, 110] tend to select the traffic conditions from adjacent intersections and directly concatenate them with the traffic condition of the target intersection, neglecting the fact that the traffic is changing both temporally and spatially. For example, if intersections

A

and

B

are adjacent intersections on a major road, but intersection

C

is on a side road linked with

B

. The information from

A

is more useful to

B

compared with that from

C

B

. Besides, the influences from intersection

A

could change temporally. For example, during the morning rush hour, there are a large number of vehicles moving from

A

B

, and the direction of vehicles is reversed at night peak hours. Therefore, the effect of

A

B

is also changing at different times of the day. In this chapter, we propose to use the graph attentional network [111] to learn the dynamics of the influences, which shows superior performance against methods without this mechanism in Section 5.5.5. To the best of our knowledge, we are the first to use graph attentional network in the setting of traffic signal control.

∙

通过动态交流实现合作。实现合作的大多数方法都是通过扩大目标代理的观测范围来获取更全面的信息。现有研究 [46, 48, 108, 109, 110] 倾向于选择相邻交叉路口的交通状况，并直接与目标交叉路口的交通状况进行串联，而忽略了交通状况在时间和空间上都是变化的这一事实。例如，如果交叉口

A

和

B

是主干道上的相邻交叉口，但交叉口

C

位于与

B

相连的支路上。.与来自

C

和

B

的信息相比，来自

A

的信息对

B

更有用。.此外，来自交叉路口

A

的影响也会随时间发生变化。例如，在早高峰时段，有大量车辆从

A

驶向

B

，并且车辆方向与

A

相反。而在晚高峰时段，车辆的行驶方向则相反。因此，

A

对

B

的影响也在一天的不同时段发生变化。在本章中，我们建议使用图注意网络[111]来学习影响因素的动态变化，在第 5.5.5 节中，该方法的性能优于无此机制的方法。据我们所知，我们是第一个在交通信号控制中使用图注意力网络的人。

∙

Index-free model learning with parameter sharing. Agents often share the same model parameters to enable efficient learning [46, 108]. However, if we simply take neighboring information as part of the state for the communication purpose, the fixed indexing of neighbors may cause conflicting learning problems. Take Figure 6.1 as an example. Figure 6.1(a) shows two target intersections

A

and

B

, where their neighbors A-W, A-E, B-N and B-S are on the major roads, while A-N, A-S, B-W and B-E are on the side streets. Intuitively, the two intersections on the major roads should relate more to the target intersection than those on the side streets, e.g., the influence from A-W to

A

should be greater than that from A-N to

A

. When neighbors are indexed in the way as [E, W, N, S], agent

A

would care more about the first and second neighbors (i.e., East and West) and agent

B

would pay more attention to the third and fourth neighbors (i.e., North and South). However, when two agents share the model parameters, this will cause conflicts to learn the influences of neighbors to the target intersection. In this chapter, similar to the "mean-field" idea [112], we tackle this problem by averaging over the influences of all neighboring intersections with the learned attention weights, instead of using a fixed indexing for neighbors. This weighted average mechanism provides index-free modeling of the influences of neighboring intersections, and along with the parameter sharing strategy, the overall parameters of the learning model can be largely reduced.

∙

参数共享的无索引模型学习。为了提高学习效率，代理通常会共享相同的模型参数 [46, 108]。然而，如果我们只是将邻居信息作为状态的一部分来进行通信，那么邻居的固定索引可能会导致相互冲突的学习问题。以图 6.1 为例。图 6.1(a) 显示了两个目标交叉点

A

和

B

。它们的邻居 A-W、A-E、B-N 和 B-S 位于主干道上，而 A-N、A-S、B-W 和 B-E 位于辅道上。直观而言，位于主干道上的两个交叉口与目标交叉口的关系应大于位于辅道上的交叉口，例如，A-W 对

A

的影响应大于 A-N 对

A

的影响。.当邻居的索引方式为[E, W, N, S]时，代理

A

会更关注第一和第二邻居（即东和西），代理

B

会更关注第三和第四邻居（即北和南）。然而，当两个代理共享模型参数时，就会导致冲突，从而影响目标交叉点的邻居。在本章中，类似于 "均值场 "的想法[112]，我们通过用学习到的注意力权重平均所有相邻交叉口的影响来解决这个问题，而不是使用固定的相邻索引。这种加权平均机制提供了对相邻交叉口影响的无索引建模，加上参数共享策略，学习模型的整体参数可以大大减少。

∙

Experiment on the large-scale road network. To the best of our knowledge, none of the existing studies that use RL to cooperate traffic signals have evaluated their methods in large-scale road networks with hundreds of traffic signals. Instead, most of them justify their cooperation strategies on small road networks with only fewer than 20 intersections [21]. In this chapter, we conduct comprehensive experiments on both synthetic and real-world data, including a large-scale road network with 196 intersections derived from Manhattan, New York. The experiments demonstrate that the proposed model benefits from the dynamic communication and the index-free modeling mechanism and significantly outperforms the state-of-the-art methods.

∙

大规模道路网络的实验。据我们所知，现有的使用 RL 实现交通信号合作的研究都没有在拥有数百个交通信号的大型道路网络中对其方法进行评估。相反，大多数研究都是在只有不到 20 个交叉口的小型道路网络中论证其合作策略[21]。在本章中，我们对合成数据和实际数据进行了全面的实验，其中包括来自纽约曼哈顿的拥有 196 个交叉口的大型道路网络。实验证明，所提出的模型得益于动态通信和无索引建模机制，其性能明显优于最先进的方法。

Conventional coordinated methods [113] and systems [114, 115, 116] in transportation usually coordinate traffic signals through modifying the offset (i.e., the time interval between the beginnings of green lights) between consecutive intersections and require the intersections to have the same cycle length. But this type of methods can only optimize the traffic
交通领域的传统协调方法[113]和系统[114, 115, 116]通常通过修改连续交叉口之间的偏移量（即绿灯开始的时间间隔）来协调交通信号，并要求交叉口具有相同的周期长度。但这类方法只能优化交通

Figure 5.1: Illustration of index-based concatenation. Thick yellow lines are the arterials and grey thin lines are the side streets. With index-based concatenation,

A

and B’s observation will be aligned as model inputs with an fixed order. These two inputs will confuse the model shared by

A

and

B

.
图 5.1：基于索引的连接说明。黄色粗线为主干道，灰色细线为支路。通过基于索引的连接，

A

和 B 的观测值将以固定顺序作为模型输入。这两个输入将混淆

A

和

B

共享的模型。.

flow for certain pre-defined directions [117]. Actually, it is not an easy task to coordinate the offsets for traffic signals in the network. For network-level control, Max-pressure [90] is a state-of-the-art signal control method which greedily takes actions that maximizes the throughput of the network, under the assumption that the downstream lanes have unlimited capacity. Other traffic control methods like TUC [118] also use optimization techniques to minimize vehicle travel time and/or the number of stops at multiple intersections under certain assumptions, such as the traffic flow is uniform in a certain time period. However, such assumptions often do not hold in the network setting and therefore prevent these methods from being widely applied.
某些预定方向的流量[117]。事实上，协调网络中交通信号的偏移并非易事。在网络级控制方面，Max-pressure[90]是一种最先进的信号控制方法，它在下游车道具有无限容量的假设条件下，贪婪地采取能使网络吞吐量最大化的行动。其他交通控制方法，如 TUC [118]，也使用优化技术，在某些假设条件下，如交通流在某个时间段内是均匀的，使车辆行驶时间和/或多个交叉口的停车次数最小化。然而，这些假设在网络环境中往往不成立，因此这些方法无法得到广泛应用。

Recently, reinforcement learning techniques have been proposed to coordinate traffic signals for their capability of online optimization without prior knowledge about the given environment. One way to achieve coordination is through centralized optimization over the joint actions of multiple intersections. [119] directly trains one central agent to decide the actions for all intersections but it cannot learn well due to the curse of dimension in joint action space. [120, 10] propose to jointly model two adjacent intersections and then using centralized coordination over the global joint actions, but they require the maximization over a combinatorially large joint action space and face scalability issues during deployment. Therefore, these centralized methods are hard to apply on the large-scale road network.
最近，有人提出了强化学习技术来协调交通信号，因为这种技术能够在不预先了解给定环境的情况下进行在线优化。实现协调的一种方法是对多个交叉口的联合行动进行集中优化。文献[119]直接训练一个中心代理来决定所有交叉口的行动，但由于联合行动空间的维度诅咒，它无法很好地学习。[120，10] 建议对相邻的两个交叉口进行联合建模，然后对全局联合行动进行集中协调，但它们需要在一个组合性很大的联合行动空间上最大化，并且在部署过程中面临可扩展性问题。因此，这些集中式方法很难应用于大规模道路网络。

To mitigate this issue, independently modelling RL methods [121, 9, 17] are proposed in which they train a bunch of RL agents separately, one for each intersection. To avoid the non-stationary impacts from neighboring agents, communication among agents [122, 64] are proposed in order to achieve coordination using neighboring information: [108, 9, 48, 109, 110] add downstream information into states, [9, 46] add all neighboring states, and [47] adds neighbors' hidden states. However, in these methods, the information from different neighbors is concatenated all together and treated with equal importance, which leads to two major issues: 1). the impacts from neighboring intersections are changing dynamically with the traffic flow and should not be treated evenly. Even when the traffic flow is static, Kinenmatic-wave theory [123] from the transportation area shows that the upstream intersections could have larger influence than downstream intersections; 2). simple concatenation of neighboring information requires all agents to have an extra indexing mechanism, which is usually unrealistic and requires heuristic design. To address the shortcomings of previous methods, our proposed method leverages the attention mechanism to learn and specify different weights to different neighboring intersections and directly models the overall influence of neighboring intersections on the target intersection.
为了缓解这一问题，有人提出了独立建模的 RL 方法 [121, 9, 17]，即分别训练一组 RL 代理，每个交叉路口一个。为了避免来自相邻代理的非稳态影响，有人提出了代理间通信 [122, 64]，以便利用相邻信息实现协调：[108、9、48、109、110] 将下游信息加入状态，[9、46] 加入所有相邻状态，[47] 加入相邻的隐藏状态。然而，在这些方法中，来自不同邻居的信息都被串联在一起，并被视为同等重要，这导致了两个主要问题：1). 相邻交叉口的影响随交通流动态变化，不应平均处理。即使交通流是静态的，来自交通领域的 Kinenmatic-wave 理论[123]也表明，上游交叉口的影响可能大于下游交叉口；2）.简单串联相邻信息要求所有代理都有额外的索引机制，这通常是不现实的，需要启发式设计。针对以往方法的不足，我们提出的方法利用注意力机制来学习和指定不同相邻交叉口的不同权重，并直接模拟相邻交叉口对目标交叉口的整体影响。

It is worth noting that most of joint modelling methods for the traffic signal control problem only conduct experiments on simple road networks with at most 20 intersections. To the best of our knowledge, none of the individual modelling methods conduct experiments on more than 70 signals [21]. In this chapter, we conduct experiments on the simulator under different scales, including a real-world road network with about 200 intersections.
值得注意的是，大多数针对交通信号控制问题的联合建模方法仅在最多 20 个交叉口的简单道路网络上进行实验。据我们所知，没有一种单独建模方法对超过 70 个信号灯进行实验[21]。在本章中，我们将在不同规模的模拟器上进行实验，包括一个拥有约 200 个交叉口的真实世界道路网络。

5.3 Problem Definition 5.3 问题定义

In this section, we present the problem of traffic signal control as a Markov Game. Each intersection in the system is controlled by an agent. Given each agent observes part of the total system condition, we would like to proactively decide for all the intersections in the system which phases should they change to so as to minimize the average queue length on the lanes around the intersections. Specifically, the problem is characterized by the following major components

⟨ S, O, A, P, r, π, γ ⟩

:
在本节中，我们将交通信号控制问题视为马尔可夫博弈。系统中的每个交叉路口都由一个代理控制。鉴于每个代理都能观察到整个系统的部分情况，我们希望主动决定系统中的所有交叉路口应切换到哪个相位，从而使交叉路口周围车道上的平均排队长度最小化。具体来说，该问题主要由以下几个部分组成

⟨ S, O, A, P, r, π, γ ⟩

：

∙

System state space

S

and observation space

O

. We assume that there are

N

intersections in the system and each agent can observe part of the system state

s \in S

as its observation

o \in O

. In this work, we define

o_{i}^{t}

for agent

i

at time

t

, which consists of its current phase (which direction is in green light) represented by a one-hot vector, and the number of vehicles on each lane connected with the intersection.

∙

系统状态空间 {{1} 和观测空间

O

.｝.我们假设系统中存在

N

交集，每个代理可以观察到系统状态

s \in S

的一部分，作为其观察值

o \in O

。.在本工作中，我们为代理

i

在时间

t

时定义了

o_{i}^{t}

。由其当前阶段（哪个方向为绿灯）和与交叉路口相连的每条车道上的车辆数组成。

∙

Set of actions

A

. In the traffic signal control problem, at time

t

, an agent

i

would choose an action

a_{i}^{t}

from its candidate action set

A_{i}

as a decision for the next

Δ t

period of time. Here, each intersection would choose a phase

p

as its action

a_{i}^{t}

from its pre-defined phase set, indicating that from time

t

t + Δ t

, this intersection would be in phase

p

∙

行动集 {{1}.在交通信号控制问题中，在

t

时刻代理

i

将从其候选行动集

A_{i}

中选择一个行动

a_{i}^{t}

，作为下一个

Δ t

时段的决策。在这里，每个交叉点都将从其预定义的阶段集中选择一个阶段

p

作为其行动

a_{i}^{t}

，这表明从时间

t

到

t + Δ t

之间，该交叉点将处于阶段

t

中。该交叉点将处于

p

阶段。.

∙

Transition probability

P

. Given the system state

s^{t}

and corresponding joint actions

a^{t}

of agents at time

t

, the system arrives at the next state

s^{t + 1}

according to the state transition probability

P (s^{t + 1} | s^{t}, a^{t}) : S \times A_{1} \times \dots \times A_{N} \to Ω (S)

, where

Ω (S)

denotes the space of state distributions.

∙

过渡概率

P

.给定系统状态

s^{t}

和代理在时间

t

的相应联合行动

a^{t}

，系统根据状态转换概率

P (s^{t + 1} | s^{t}, a^{t}) : S \times A_{1} \times \dots \times A_{N} \to Ω (S)

到达下一个状态

s^{t + 1}

。，系统会根据状态转换概率

P (s^{t + 1} | s^{t}, a^{t}) : S \times A_{1} \times \dots \times A_{N} \to Ω (S)

到达下一个状态

s^{t + 1}

。其中

Ω (S)

表示状态分布空间。

∙

Reward

r

. Each agent

i

obtains an immediate reward

r_{i}^{t}

from the environment at time

t

by a reward function

S \times A_{1} \times \dots \times A_{N} \to R

. In this paper, we want to minimize the travel time for all vehicles in the system, which is hard to optimize directly. Therefore, we define the reward for intersection i as

r_{i}^{t} = - Σ_{l} u_{i, l}^{t}

where

u_{i, l}^{t}

is the queue length on the approaching lane

l

at time t.

∙

奖励 {{1}.每个代理

i

在

t

时通过奖励函数

S \times A_{1} \times \dots \times A_{N} \to R

从环境中获得直接奖励

r_{i}^{t}

。.在本文中，我们希望最小化系统中所有车辆的行驶时间，而这是很难直接优化的。因此，我们将路口 i 的奖励定义为

r_{i}^{t} = - Σ_{l} u_{i, l}^{t}

，其中

u_{i, l}^{t}

为时间 t 时接近车道上的排队长度

l

。

∙

Policy

π

and discount factor

γ

. Intuitively, the joint actions have long-term effects on the system, so that we want to minimize the expected queue length of each intersection in each episode. Specifically, at time

t

, each agent chooses an action following a certain policy

O \times A \to π

, aiming to maximize its total reward

G_{i}^{t} = Σ_{t = τ}^{T} γ^{t - τ} t_{i}^{t}

, where

T

is total time steps of an episode and

γ \in [0, 1]

differentiates the rewards in terms of temporal proximity.

∙

政策

π

和贴现因子

γ

..直观地说，联合行动会对系统产生长期影响，因此我们希望最大限度地减少每个交叉路口在每个时段的预期排队长度。具体来说，在

t

时间每个代理都会按照一定的策略

O \times A \to π

选择行动。目标是最大化其总回报

G_{i}^{t} = Σ_{t = τ}^{T} γ^{t - τ} t_{i}^{t}

，其中

T

是时间

G_{i}^{t} = Σ_{t = τ}^{T} γ^{t - τ} t_{i}^{t}

。其中，

T

是一集的总时间步数，

γ \in [0, 1]

则是根据时间上的接近程度来区分奖励。

In this paper, we use the action-value function

Q_{i} (θ_{n})

for each agent

i

at the

n

-th iteration (parameterized by

θ

) to approximate the total reward

G_{i}^{t}

with neural networks by minimizing the loss:
在本文中，我们使用每个代理

i

在

n

-次迭代时的行动值函数

Q_{i} (θ_{n})

（参数为

θ

），通过最小化损失来逼近总奖励

G_{i}^{t}

。-第

θ

次迭代时的行动值函数

Q_{i} (θ_{n})

（参数为

θ

），通过最小化损失，用神经网络逼近总奖励

G_{i}^{t}

：

\begin{matrix} (5.1) & L (θ_{n}) = E [(r_{i}^{t} + γ max_{a^{'}} Q (o_{i}^{t'}, a_{i}^{t^{'}}; θ_{n - 1}) - Q (o_{i}^{t}, a_{i}^{t}; θ_{n}))^{2}] \end{matrix}

where

o_{i}^{t^{'}}

denotes the next observation for

o_{i}^{t}

. These earlier snapshots of parameters are periodically updated with the most recent network weights and help increase the learning stability by decorrelating predicted and target q-values.
其中

o_{i}^{t^{'}}

表示

o_{i}^{t}

的下一个观测值。.这些较早的参数快照会根据最新的网络权重定期更新，并通过将预测值和目标 q 值去相关化来提高学习的稳定性。

5.4 Method 5.4 方法

In this section, we will first introduce the proposed cooperated RL network structure, as Figure 6.2 illustrates, from bottom to top layer: the first observation embedding layer, the interior neighborhood cooperation layers (shorted as GAT layers) and the final q-value prediction layer. Then we will discuss its time and space complexity compared with other methods of signal control for multiple intersections.
在本节中，我们将首先介绍所提出的合作 RL 网络结构，如图 6.2 所示，从下到上依次为：第一观测嵌入层、内部邻域合作层（简称 GAT 层）和最终 q 值预测层。然后，我们将讨论它与其他多交叉口信号控制方法相比在时间和空间上的复杂性。

Figure 5.2: Left: Framework of the proposed CoLight model. Right: variation of cooperation scope (light blue shadow, from one-hop to two-hop) and attention distribution (colored points, the redder, the more important) of the target intersection.
图 5.2：左：建议的 CoLight 模型框架。右图：目标交叉点的合作范围（浅蓝色阴影，从一跳到两跳）和注意力分布（彩色点，越红越重要）的变化。

Observation Embedding 观察嵌入

Given the raw data of the local observation, i.e., the number of vehicles on each lane and the phase the signal currently in,
给定本地观测的原始数据，即每条车道上的车辆数和信号灯当前所处的相位、

we first embed such

k

-dimensional data into an

m

-dimensional latent space via a layer of Multi-Layer Perceptron:
我们首先将这些

k

-维数据嵌入

m

维潜在空间。-维潜在空间：

\begin{matrix} (5.2) & h_{i} = E m b e d (o_{i}^{t}) = σ (o_{i} W_{e} + b_{e}), \end{matrix}

where

o_{i}^{t} \in R^{k}

is intersection

i

's observation at time

t

and

k

is the feature dimension of

o_{i}^{t}

W_{e} \in R^{k \times m}

and

b_{e} \in R^{m}

are weight matrix and bias vector to learn,

σ

is ReLU function (same denotation for the following

σ

). The generated hidden state

h_{i} \in R^{m}

represents the current traffic condition of the

i

-th intersection.
其中

o_{i}^{t} \in R^{k}

是交叉点

i

在时间

t

的观测值，而

k

是

o_{i}^{t}

的特征维度。

t

时的观测值，

k

是

o_{i}^{t}

的特征维度，

W_{e} \in R^{k \times m}

和

b_{e} \in R^{m}

是权重矩阵和偏置向量。

W_{e} \in R^{k \times m}

和

b_{e} \in R^{m}

为学习权重矩阵和偏置向量，

σ

为 ReLU 函数（以下

σ

的表示方法相同）。生成的隐藏状态

h_{i} \in R^{m}

表示

i

-th 十字路口当前的交通状况。-第 1 个交叉路口。

Graph Attention Networks for Cooperation
图形注意网络促进合作

Communication between agents is necessary for cooperation in multi-agent reinforcement learning (MARL) environment, since the evaluation of the conducted policy for each agent depends not only on the observable surrounding, but also on the policies of other agents [64, 124]. CoLight agent learns to communicate on the representations of neighboring intersections by leveraging the attention mechanism, a widely-used technique to boost model accuracy [125, 126, 127, 128]. Then an overall summary of the agent's neighborhood is generated, upon which the agent learns to model the influence of neighborhoods.
在多代理强化学习（MARL）环境中，代理之间的交流是合作的必要条件，因为对每个代理所执行策略的评估不仅取决于可观察到的周围环境，还取决于其他代理的策略[64, 124]。CoLight 代理通过利用注意力机制（一种广泛应用于提高模型准确性的技术[125, 126, 127, 128]），学会就邻近交叉口的表征进行交流。然后会生成该代理所在邻域的整体摘要，在此基础上，该代理学会对邻域的影响进行建模。

5.4.2.1 Observation Interaction
5.4.2.1 观察互动

To learn the importance of information from intersection

j

(source intersection) in determining the policy for intersection

i

(target intersection), we first embed the representation of these two intersections from previous layer and calculate

e_{i, j}

, the importance of

j

in determining the policy for

i

, with the following operation:
了解交叉路口

j

（源交叉路口）的信息在确定交叉路口

i

政策时的重要性(源交叉点）的信息在确定交叉点

i

（目标交叉点）的策略时的重要性。(目标交叉点）的策略时，我们首先从上一层嵌入这两个交叉点的表示，然后计算

e_{i, j}

、

j

和{{4} 的重要性。，

j

在确定

i

的策略时的重要性。的重要性，具体操作如下：

\begin{matrix} (5.3) & e_{i j} = (h_{i} W_{t}) \cdot (h_{j} W_{s})^{T}, \end{matrix}

where

W_{s}, W_{t} \in R^{m \times n}

are embedding parameters for the source and target intersection respectively.
其中，

W_{s}, W_{t} \in R^{m \times n}

分别为源交叉点和目标交叉点的嵌入参数。

Note that

e_{i j}

is not necessarily equal to

e_{j i}

. Take the scenario in Figure 6.2(a) as an example, where the 9-th Avenue is a one-way arterial. On one hand, since the traffic flow goes from Inter 9-50 to Inter 9-49, the traffic condition from Inter 9-50 is important for_Inter 9-49_ to decide for the future actions, thus,

e_{9 - 49, 9 - 50}

should be quite large. On the other hand, as the traffic condition of the downstream Inter 9-49 is less helpful to Inter 9-50,

e_{9 - 50, 9 - 49}

should be relatively small.
请注意，

e_{i j}

不一定等于

e_{j i}

。.以图 6.2(a)中的情况为例，9-th 大道是一条单向干道。一方面，由于交通流从 9-50 号区间流向 9-49 号区间，9-50 号区间的交通状况对 9-49 号区间决定未来行动非常重要，因此

e_{9 - 49, 9 - 50}

应该很大。另一方面，由于下游 9-49 号区间的交通状况对 9-50 号区间的帮助较小，

e_{9 - 50, 9 - 49}

应该相对较小。

5.4.2.2 Attention Distribution within Neighborhood Scope
5.4.2.2 邻里范围内的注意力分布

To retrieve a general attention value between source and target intersections, we further normalize the interaction scores between the target intersection

i

and its neighborhood intersections:
为了获取源交叉点和目标交叉点之间的一般关注值，我们进一步对目标交叉点

i

与其邻近交叉点之间的交互得分进行归一化处理：

\begin{matrix} (5.4) & α_{i j} = softmax (e_{i j}) = \frac{\exp (e_{i j} / τ)}{\sum_{j \in N_{i}} \exp (e_{i j} / τ)}, \end{matrix}

where

τ

is the temperature factor
其中

τ

是温度系数

and

N_{i}

is the set of intersections in the target intersection's neighborhood scope. The neighborhood of the target contains the top

| N_{i} |

closest intersections, and the distance can be defined in multiple ways. For example, we can construct the neighborhood scope for target intersection

i

through:
和

N_{i}

是目标交叉点邻域范围内的交叉点集合。目标相邻范围包含最靠近的

| N_{i} |

个交叉点，距离可以通过多种方式定义。例如，我们可以通过以下方式构建目标交叉点

i

的邻域范围：

∙

Road distance: the geo-distance between two intersections' geo-locations.

∙

道路距离：两个交叉口地理位置之间的地理距离。

∙

Node distance: the smallest hop count between two nodes over the network, with each node as an intersection.

∙

节点距离：网络上两个节点之间的最小跳数，每个节点为一个交叉点。

Note that intersection

i

itself is also included in

N_{i}

to help the agent get aware of how much attention should be put on its own traffic condition.
请注意，交叉路口

i

本身也包含在

N_{i}

中，以帮助代理了解自身交通状况应受到多少关注。

The general attention score

α_{i j}

is beneficial not only for it applies to all kinds of road network structures (intersections with different numbers of arms), but also for it relaxes the concept of "neighborhood". Without losing generality, the target can even take some other intersections into

N_{i}

although they are not adjacent to them. For instance, one four-way intersection can determine its signal policy based on information from five nearby intersections, four of which are the adjacent neighbors while the other is disconnected but geographically close to the target intersection.
一般关注度得分

α_{i j}

的好处不仅在于它适用于各种路网结构（具有不同臂数的交叉口），还在于它放宽了 "邻近 "的概念。在不失一般性的前提下，目标甚至可以将其他一些交叉口纳入

N_{i}

，尽管它们并不相邻。例如，一个四向交叉路口可以根据附近五个交叉路口的信息来确定其信号政策，其中四个是相邻的邻居，而另一个虽然不相连，但在地理位置上靠近目标交叉路口。

5.4.2.3 Index-free Neighborhood Cooperation
5.4.2.3 无指数邻里合作

To model the overall influence of neighborhoods to the target intersection, the representation of several source intersections are combined with their respective importance:
为了模拟街区对目标交叉口的整体影响，需要将多个源交叉口的表征与各自的重要性结合起来：

\begin{matrix} (5.5) & h s_{i} = σ (W_{q} \cdot \sum_{j \in N_{i}} α_{i j} (h_{j} W_{c}) + b_{q}), \end{matrix}

where

W_{c} \in R^{m \times c}

is weight parameters for source intersection embedding,

W_{q}

and

b_{q}

are trainable variables. The weighted sum of neighborhood representation

h s_{i} \in R^{c}

accumulates the key information from the surrounding environment for performing efficient signal policy. By summing over the neighborhood representation, the model is index-free that does not require all agents to align the index of their neighboring intersections.

\begin{matrix} (5.5) & h s_{i} = σ (W_{q} \cdot \sum_{j \in N_{i}} α_{i j} (h_{j} W_{c}) + b_{q}), \end{matrix}

，其中

W_{c} \in R^{m \times c}

是源交叉嵌入的权重参数，

W_{q}

和

b_{q}

是可训练变量。邻域表示的加权和

h s_{i} \in R^{c}

积累了来自周围环境的关键信息，用于执行有效的信号策略。通过对邻域表示求和，该模型是无索引模型，不需要所有代理对齐其邻近交叉口的索引。

The graph-level attention allows the agent to adjust their focus according to the dynamic traffic and to sense the environment in a larger scale. In Figure 6.2(b) and (c), the emphasizes of Inter 9-49 on four neighbors are quite distinct due to the uni-directional traffic flow, i.e., a higher attention score for Inter 9-50 (upstream, red marked) than for Inter 9-48 (downstream, green marked). The agent for Inter 9-49 acquires the knowledge of adjacent intersections (Inter 9-48, Inter 9-50, Inter 10-49 and Inter 8-49) directly from the first layer of Graph Attention Networks (GAT). Since the hidden states of adjacent neighbors from the first GAT layer carry their respective neighborhood message, then in the second GAT layer, the cooperation scope of Inter 9-49 expands significantly (blue shadow in Figure 6.2(c)) to 8 intersections. Such additional information helps the target Inter 9-49 learn the traffic trend. As a result, Inter 9-49 relies more on the upstream intersections and less on the downstream to take actions, and the attention scores on Inter 9-50 and Inter 8-49 grow higher while those on Inter 10-49 and Inter 9-48 become lower. More GAT layers helps the agent detect environment dynamics more hops away.
图层关注允许代理根据动态流量调整其关注点，并在更大范围内感知环境。在图 6.2(b)和(c)中，由于单向交通流，9-49 号交叉路口对四个相邻路口的关注度截然不同，即 9-50 号交叉路口（上游，红色标记）的关注度高于 9-48 号交叉路口（下游，绿色标记）。9-49 号区间的代理直接从图形注意力网络（GAT）的第一层获取相邻区间（9-48 号区间、9-50 号区间、10-49 号区间和 8-49 号区间）的知识。由于来自第一层 GAT 的相邻邻居的隐藏状态携带着各自的邻居信息，因此在第二层 GAT 中，Inter 9-49 的合作范围大幅扩大（图 6.2(c)中的蓝色阴影），达到 8 个交叉路口。这些附加信息有助于目标 Inter 9-49 了解交通趋势。因此，Inter 9-49 在采取行动时更多地依赖上游路口，而较少依赖下游路口，Inter 9-50 和 Inter 8-49 的关注分值也随之增加，而 Inter 10-49 和 Inter 9-48 的关注分值则变低。更多的 GAT 层有助于代理检测更多跳外的环境动态。

5.4.2.4 Multi-head Attention
5.4.2.4 多头关注

The cooperating information

h s_{i}

for the

i

-th intersection concludes one type of relationship with neighboring intersections. To jointly attend to the neighborhood from different representation subspaces at different positions, we extend the previous single-head attention in the neural network to multi-head attention as much recent work did [111, 129]. Specifically, the attention function (procedures including Observation Interaction, Attention Distribution and Neighborhood Cooperation) with different linear projections (multiple sets of trainable parameters

{W_{t}

W_{s}

W_{c}}

) is performed in parallel and the different versions of neighborhood condition summarization

h s_{i}

are averaged as

h m_{i}

i

的合作信息

h s_{i}

。-的合作信息

h s_{i}

与相邻交叉点的关系。为了从不同位置的不同表征子空间共同关注邻域，我们将之前神经网络中的单头注意扩展为多头注意，正如最近的许多研究[111, 129]所做的那样。具体来说，不同线性投影（多组可训练参数

{W_{t}

，

W_{s}

，

W_{c}}

）的注意力函数（包括观察互动、注意力分配和邻域合作等程序）是并行执行的，不同版本的邻域条件总结

h s_{i}

的平均值为

h m_{i}

：

\begin{matrix} (5.6) & e_{i j}^{h} = (h_{i} W_{t}^{h}) \cdot (h_{j} W_{s}^{h})^{T} \end{matrix}

\begin{matrix} (5.7) & α_{i j}^{h} = softmax (e_{i j}^{h}) = \frac{\exp (e_{i j}^{h} / τ)}{\sum_{j \in N_{i}} \exp (e_{i j}^{h} / τ)} \end{matrix}

\begin{matrix} (5.8) & h m_{i} = σ (W_{q} \cdot (\frac{1}{H} \sum_{h = 1}^{h = H} \sum_{j \in N_{i}} α_{i j}^{h} (h_{j} W_{c}^{h})) + b_{q}) \end{matrix}

where

H

is the number of attention heads. Besides averaging operation, concatenating the product of multi-head attention is another feasible way to conclude multiple types of the neighborhood cooperation.
其中，

H

为注意力头数。除了平均运算外，串联多头注意力的乘积也是缔结多种邻域合作关系的一种可行方法。

In this work, we investigate the effects of multi-head attention on performance of the proposed model and find that 5 attention heads achieve the best performance.
在这项工作中，我们研究了多头注意力对所提模型性能的影响，发现 5 个注意力头实现了最佳性能。

Q-value Prediction Q 值预测

As illustrated in Figure 6.2(a), each hidden layer of model

C o L i g h t

learns the neighborhood representation through methods introduced in Section 5.4.2. We denote such layerwise cooperation procedure by GAT, then the forward propagation of input data in

C o L i g h t

can be formatted as follows:
如图 6.2(a)所示，模型

C o L i g h t

的每个隐藏层通过第 5.4.2 节介绍的方法学习邻域表示。我们用 GAT 来表示这种分层合作过程，那么

C o L i g h t

中输入数据的前向传播格式如下：

\begin{matrix} (5.9) & h i = E m b e d (o_{i}^{t}), \end{matrix}

h m_{i}^{1} = G A T^{1} (h_{i}),

\dots,

h m_{i}^{L} = G A T^{L} (h m_{i}^{L - 1}),

\tilde{q} (o_{i}^{t}) = h m_{i}^{L} W_{p} + b_{p},

where

W_{p} \in R^{c \times p}

and

b_{p} \in R^{p}

are parameters to learn,

p

is the number of phases (action space),

L

is the number of GAT layers,

\tilde{q}

is the predicted q-value.
其中，

W_{p} \in R^{c \times p}

和

b_{p} \in R^{p}

是要学习的参数，

p

是阶段数（动作空间），

L

是 GAT 层数，

\tilde{q}

是预测的 q 值。

According to Eq. (5.1), the loss function for our

C o L i g h t

to optimize the current policy is:
根据公式 (5.1)，优化当前策略的

C o L i g h t

的损失函数为

\begin{matrix} (5.10) & L (θ) = \frac{1}{T} \sum_{t = 1}^{t = T} \sum_{i = 1}^{i = N} (q (o_{i}^{t}, a_{i}^{t}) - \tilde{q} (o_{i}^{t}; a_{i}^{t}, θ))^{2}, \end{matrix}

where

T

is the total number of time steps that contribute to the network update,

N

is the number of intersections in the whole road network,

θ

represents all the trainable variables in

C o L i g h t

.
其中，

T

为网络更新的总时间步数，

N

为整个道路网络中的交叉口数量，

θ

代表

C o L i g h t

中的所有可训练变量。.

As the importance of neighborhood to the target intersection varies spatially and temporally, the proposed attention mechanism is able to help the target agent distinguish among the complex scenarios by considering the traffic condition of any source-target intersection pair.
由于邻近地区对目标交叉路口的重要性在空间和时间上各不相同，因此所提出的关注机制能够通过考虑任何一对源-目标交叉路口的交通状况，帮助目标代理区分复杂的场景。

Complexity Analysis 复杂性分析

Although CoLight spares additional parameters to learn the dynamic cooperation from neighborhood, owing to the index-free parameter sharing mechanism, both the time and space it demands are approximately equal to

O (m^{2} L)

, which is irrelevant to the number of intersections. Hence CoLight is scalable even if the road network contains hundreds of or even thousands of intersections.
尽管 CoLight 在学习邻域动态合作时需要额外的参数，但由于采用了无索引参数共享机制，它所需的时间和空间都近似等于

O (m^{2} L)

，这与交叉点的数量无关。，这与交叉点的数量无关。因此，即使道路网络包含数百甚至数千个交叉口，CoLight 也是可扩展的。

5.4.4.1 Space Complexity
5.4.4.1 空间复杂性

If there are

L

hidden layers and each layer has

m

neurons, then the size of the weight matrices and bias vectors in each component of CoLight is: 1) Observation Embedding layer:

k m + m

; 2) interior Graph Attentional layers:

(3 m^{2} + (m^{2} + m)) L = m (4 m + 1) L

; 3) Q-value Prediction layer:

m p + p

. Hence the total number of learnable parameters to store is

O (m (4 m L + L + k + 1 + p) + p)

. Normally, the size of the hidden layer (

m

) is far greater than the number of layers (

L

), the phase space (

p

) and comparable to the input dimension (

k

). Therefore, the space complexity of CoLight is approximately equal to

O (m^{2} L)

.
如果有

L

个隐藏层，每层有

m

个神经元，那么 CoLight 各部分的权重矩阵和偏置向量的大小为1) 观察嵌入层：

k m + m

; 2) 内部图注意层：

(3 m^{2} + (m^{2} + m)) L = m (4 m + 1) L

; 3) Q 值预测层：

m p + p

.因此，需要存储的可学习参数总数为

O (m (4 m L + L + k + 1 + p) + p)

。.通常情况下，隐藏层的大小（

m

）远大于层数（

L

）和相空间（

p

），与输入维度（

k

）相当。因此，CoLight 的空间复杂度约等于

O (m^{2} L)

。.

If we leverage

N

separate RL models (without parameter sharing) to control signals in

N

intersections, then the space complexity is

O (((k m + m) + (m^{2} + m) L + (m p + p)) \cdot N) \approx O (m^{2} L \cdot N)

, which is unfeasible when

N

is extremely large for city-level traffic signal control. To scale up, the simplest solution is to allow all the intersections to share parameters and maintain one model, in this case, the space complexity is

O (m^{2} L)

, which is identical to that of CoLight.
如果我们利用

N

个独立的 RL 模型（不共享参数）来控制

N

个交叉口的信号灯，那么空间复杂度为

O (((k m + m) + (m^{2} + m) L + (m p + p)) \cdot N) \approx O (m^{2} L \cdot N)

。而对于城市级交通信号控制来说，

N

是非常大的，这就不可行了。要扩大规模，最简单的办法是让所有交叉口共享参数，并维护一个模型，在这种情况下，空间复杂度为

O (m^{2} L)

，与 Co{3}} 的空间复杂度相同。在这种情况下，空间复杂度为

O (m^{2} L)

，与 CoLight 的空间复杂度相同。

5.4.4.2 Time Complexity 5.4.4.2 时间复杂性

We assume that: 1) all the agents leverage CoLight to predict q-values for the corresponding intersections concurrently; 2) the multiple heads of attention are independently computed so that they are as fast as the single-head attention; 3) the embeddings for either source or target intersection condition via

W_{s}

W_{c}

and

W_{t}

are separate processes that can also be executed at the same time, 4) for one target intersection, the interaction with all the neighbors is computed simultaneously, then the time complexity (only multiplication operations considered since the addition procedures are relatively insignificant) in each component of CoLight is: 1) Observation Embedding layer:

O (k m)

; 2) interior Graph Attentional layers:

O ((m^{2} + m^{2}) L)

; 3) Q-value Prediction layer:

O (m p)

. Hencethe time complexity is

O (m (k + 2 m L + p))

, and similarly, it is approximately equal to

O (m^{2} L)

.
我们假设1) 所有代理都利用 CoLight 同时预测相应交叉点的 q 值；2) 多头注意力是独立计算的，因此其速度与单头注意力一样快；3) 通过

W_{s}

、

W_{c}

和

W_{t}

对源交叉点或目标交叉点条件的嵌入是独立的过程，也可以同时执行。4) 对于一个目标交点，与所有相邻交点的交互是同时计算的，那么 CoLight 各部分的时间复杂度（只考虑乘法运算，因为加法运算相对较少）为1) 观察嵌入层：

O (k m)

; 2) 内部图注意层：

O ((m^{2} + m^{2}) L)

; 3) Q 值预测层：

O (m p)

.时间复杂度为

O (m (k + 2 m L + p))

。同样，时间复杂度约等于

O (m^{2} L)

。.

Either the individual RL models or the shared single RL model for signal control in multiple intersections requires

O (m (k + m L + p)) \approx O (m^{2} L)

computation, approaching that of

C o L i g h t

.
在多个交叉口的信号控制中，无论是单个 RL 模型还是共享的单一 RL 模型，都需要

O (m (k + m L + p)) \approx O (m^{2} L)

计算量，接近

C o L i g h t

的计算量。.

5.5 Experiments 5.5 实验

We perform experiments on two synthetic datasets and three real-world datasets to evaluate our proposed method, especially the graph-level attention for neighborhood cooperation.
我们在两个合成数据集和三个真实世界数据集上进行了实验，以评估我们提出的方法，特别是邻里合作的图层关注度。

Settings 设置

We conduct experiments on CityFlow1, an open-source traffic simulator that supports large-scale traffic signal control [103]. After the traffic data being fed into the simulator, a vehicle moves towards its destination according to the setting of the environment. The simulator provides the state to the signal control method and executes the traffic signal actions from the control method. Following the tradition, each green signal is followed
我们在支持大规模交通信号控制的开源交通模拟器 CityFlow1 上进行了实验[103]。交通数据输入模拟器后，车辆根据环境设置向目的地行驶。模拟器向信号控制方法提供状态，并执行控制方法的交通信号操作。按照传统，每个绿灯信号

Figure 5.3: Road networks for real-world datasets. Red polygons are the areas we select to model, blue dots are the traffic signals we control. Left: 196 intersections with uni-directional traffic, middle: 16 intersections with uni- & bi-directional traffic, right: 12 intersections with bi-directional traffic.
图 5.3：真实世界数据集的道路网络。红色多边形是我们选择建模的区域，蓝色圆点是我们控制的交通信号灯。左图：196 个单向交通路口，中图：16 个单向和双向交通路口，右图：12 个双向交通路口。

by a three-second yellow signal and two-second all red time.2
黄灯三秒，红灯两秒。

Footnote 2: CoLight’s codes, parameter settings, public datasets can be found at: https://github.com/wingsweihua/colight. More datasets can be found at: http://traffic-signal-control.github.io
脚注 2：CoLight 的代码、参数设置和公共数据集可在以下网址找到： https://github.com/wingsweihua/colight。更多数据集请访问：http://traffic-signal-control.github.io。

In a traffic dataset, each vehicle is described as

(o, t, d)

, where

o

is the origin location,

t

is time, and

d

is the destination location. Locations

o

and

d

are both locations on the road network.
在交通数据集中，每辆车被描述为

(o, t, d)

，其中

o

是起始位置，

t

是时间，

d

是目的地。其中

o

为起始位置，

t

为时间，

d

为目的地。位置

o

和

d

都是道路网络中的位置。

Datasets 数据集

5.5.2.1 Synthetic Data 5.5.2.1 合成数据

In the experiment, we use two kinds of synthetic data, i.e., uni- and bi-directional traffic, on the following different road networks:
在实验中，我们使用了以下不同道路网络上的两种合成数据，即单向和双向交通数据：

∙

A r t e r i a l_{1 \times 3}

: A

1 \times 3

arterial to show the spatial attention distribution learned by CoLight.

∙

A r t e r i a l_{1 \times 3}

：

1 \times 3

动脉，用于显示 CoLight 学习到的空间注意力分布。

∙

G r i d_{3 \times 3}

: A

3 \times 3

grid network to show convergence speed of different RL methods and the temporal attention distribution.

∙

G r i d_{3 \times 3}

：

3 \times 3

网格网络，用于显示不同 RL 方法的收敛速度和时间注意力分布。

∙

G r i d_{6 \times 6}

: A

6 \times 6

grid network to evaluate effectiveness and efficiency of different methods.

∙

G r i d_{6 \times 6}

：用于评估不同方法有效性和效率的

6 \times 6

网格网络。

Each intersection in the synthetic road network has four directions (West

\to

East, East

\to

West, South

\to

North, North

\to

South), and 3 lanes (300 meters in length and 3 meters in width) for each direction. In bi-directional traffic, vehicles come uniformly with 300 vehicles/lane/hour in West

\leftrightarrow

East direction and 90 vehicles/lane/hour in South

\leftrightarrow

North direction. Only West

\to

East and North

\to

South directional flows travel in uni-directional traffic.
合成路网中的每个交叉口有四个方向（西

\to

东、东

\to

西、南

\to

北、北

\to

南），每个方向有 3 条车道（长 300 米，宽 3 米）。在双向交通中，西

\leftrightarrow

方向的车辆统一为每小时 300 辆/车道，东

\leftrightarrow

方向为每小时 90 辆/车道。东向每小时 300 辆车，南向每小时 90 辆车。北向。只有西

\to

东向和北向

\to

南向车流为单向行驶。

5.5.2.2 Real-world Data 5.5.2.2 真实世界数据

We also use the real-world traffic data from three cities: New York, Hangzhou and Jinan. Their road networks are imported from OpenStreetMap3, as shown in Figure 5.3. And their traffic flows are processed from multiple sources, with data statistics listed in Table 5.1. The detailed descriptions on how we preprocess these datasets are as follows:
我们还使用了三个城市的真实交通数据：纽约、杭州和济南。它们的道路网络都是从 OpenStreetMap3 中导入的，如图 5.3 所示。它们的交通流量是从多个来源处理而来，数据统计见表 5.1。我们如何预处理这些数据集的详细说明如下：

Footnote 3: https://www.openstreetmap.org
脚注 3：https://www.openstreetmap.org

∙

D_{N e w Y o r k}

: There are 196 intersections in Upper East Side of Manhattan with open source taxi trip data4. Since the taxi data only contains the origin and destination geo-locations of each trip, we first map these geo-locations to the intersections and find the shortest path between them. Then we take the trips that fall within the selected areas.

∙

D_{N e w Y o r k}

：曼哈顿上东区有 196 个交叉路口，这些交叉路口有公开的出租车出行数据4。由于出租车数据只包含每次出行的出发地和目的地地理位置，我们首先将这些地理位置映射到交叉路口，并找出它们之间的最短路径。然后，我们选取所选区域内的行程。

∙

D_{H a n g z h o u}

: There are 16 intersections in Gudang Sub-district with traffic data generated from the roadside surveillance cameras. Each record in the dataset consists of time, camera ID and the information about vehicles. By analyzing these records with camera locations, the trajectories of vehicles are recorded when they pass through road intersections. We use the number of vehicles passing through these intersections as traffic volume for experiments.

∙

D_{H a n g z h o u}

：古当分区有 16 个十字路口的交通数据来自路边监控摄像头。数据集中的每条记录都包含时间、摄像头 ID 和车辆信息。通过分析这些带有摄像头位置的记录，可以记录车辆通过交叉路口时的轨迹。我们将通过这些路口的车辆数作为交通流量进行实验。

∙

D_{J i a n}

: Similar to

D_{H a n g z h o u}

, this traffic data is collected by roadside cameras near 12 intersections in Dongfeng Sub-district, Jinan, China.

∙

D_{J i a n}

：与

D_{H a n g z h o u}

类似。该交通数据由中国济南东风分区 12 个路口附近的路边摄像头收集。

Compared Methods 比较方法

We compare our model with the following two categories of methods: conventional transportation methods and RL methods. Note that all the RL models are learned
我们将我们的模型与以下两类方法进行比较：传统运输方法和 RL 方法。请注意，所有 RL 模型都是通过学习

Dataset 数据集	# intersections # 交叉路口	Arrival rate (vehicles/300s) 到达率（车辆/300 秒）
Dataset 数据集	# intersections # 交叉路口	Mean 平均值	Std 标准	Max 最大	Min 最小
$D_{N e w Y o r k}$	196	240.79	10.08	274	216
$D_{H a n g z h o u}$	16	526.63	86.70	676	256
$D_{J i a n}$	12	250.70	38.21	335	208

Table 5.1: Data statistics of real-world traffic dataset
表 5.1：真实世界交通数据集的数据统计

Model 模型	$G r i d_{6 \times 6}$ -Uni $G r i d_{6 \times 6}$ -统一	$G r i d_{6 \times 6}$ -Bi $G r i d_{6 \times 6}$ -比	$D_{N e w Y o r k}$	$D_{H a n g z h o u}$	$D_{J i a n}$
Fixedtime [113] 固定时间 [113]	209.68	209.68	1950.27	728.79	869.85
MaxPressure [90] 最大压力 [90］	186.07	194.96	1633.41	422.15	361.33
CGRL [10]	1532.75	2884.23	2187.12	1582.26	1210.70
Individual RL [11] 个人经常资源[11]	314.82	261.60	- $^{*}$	345.00	325.56
OneModel [48] 一个模型 [48］	181.81	242.63	1973.11	394.56	728.63
Neighbor RL [46] 邻居 RL [46]	240.68	248.11	2280.92	1053.45	1168.32
GCN [47]	205.40	272.14	1876.37	768.43	625.66
CoLight-node CoLight 节点	178.42	176.71	1493.37	331.50	340.70
CoLight	173.79	170.11	1459.28	297.26	291.14

* No result as Individual RL can not scale up to 196 intersections in New York’s road network.
* 没有结果，因为单个 RL 无法扩展到纽约路网的 196 个交叉路口。

Table 5.2: Performance on synthetic data and real-world data w.r.t average travel time. CoLight is the best.
表 5.2：合成数据和实际数据在平均旅行时间方面的性能。CoLight 是最好的。

without any pre-trained parameters for fair comparison.
为了进行公平的比较，在没有任何预训练参数的情况下，也可以使用该参数。

Transportation Methods: 运输方法：

∙

Fixedtime[113]: Fixed-time with random offsets. This method uses a pre-determined plan for cycle length and phase time, which is widely used when the traffic flow is steady.

∙

固定时间[113]：固定时间，随机偏移。这种方法使用预先确定的周期长度和相位时间计划，在交通流量稳定时广泛使用。

∙

MaxPressure[90]: A state-of-the-art network-level traffic signal control method in the transportation field, which greedily chooses the phase that maximizes the pressure (a pre-defined metric about upstream and downstream queue length).

∙

最大压力[90]：一种交通领域最先进的网络级交通信号控制方法，它能贪婪地选择压力（关于上下游队列长度的预定指标）最大化的相位。

RL Methods: RL 方法：

∙

CGRL[10]: A RL-based method for multi-intersection signal control with joint-action modelling[10]. Specifically, the cooperation is achieved by designing a coordination graph and it learns to optimize the joint action between two intersections.

∙

CGRL[10]：基于联合行动建模的多交叉口信号控制 RL 方法[10]。具体来说，该方法通过设计协调图来实现合作，并通过学习来优化两个交叉口之间的联合行动。

∙

Individual RL[11]: An individual deep RL approach which does not consider neighbor information. Each intersection is controlled by one agent, and the agents do not share parameters, but update their own networks independently.

∙

Individual RL[11]：个体深度 RL 方法不考虑邻居信息。每个交叉点由一个代理控制，代理之间不共享参数，而是独立更新各自的网络。

∙

OneModel[48]: This method uses the same state and reward as Individual RL in its agent design, which only considers the traffic condition on the roads connecting with the controlled intersection. Instead of maintaining their own parameters, all the agents share the same policy network.

∙

OneModel[48]：该方法在代理设计中使用与 Individual RL 相同的状态和奖励，只考虑与受控交叉路口相连道路的交通状况。所有代理不需要维护各自的参数，而是共享同一个策略网络。

∙

Neighbor RL[46]: Based on OneModel, agents concatenate their neighboring intersections' traffic condition with their own and all the agents share the same parameters. Hence its feature space for observation is larger than OneModel.

∙

邻近 RL[46]：在 OneModel 的基础上，代理将其相邻交叉口的交通状况与自己的交通状况串联起来，所有代理共享相同的参数。因此，其观测特征空间大于 OneModel。

∙

GCN[47]: A RL based traffic signal control method that uses a graph convolutional neural network to automatically extract the traffic features of adjacent intersections. This method treats each neighboring traffic condition without difference.

∙

GCN[47]：一种基于 RL 的交通信号控制方法，利用图卷积神经网络自动提取相邻交叉口的交通特征。这种方法对每个相邻交通状况都进行无差别处理。

Variants of Our Proposed Method:
我们提出的方法的变体

∙

ColLight: The neighborhood scope of an intersection is constructed through geo-distance.

∙

ColLight：交叉口的邻域范围是通过地理距离构建的。

∙

ColLight-node: The neighborhood scope is constructed through node distance, i.e., the smallest hop count between two intersections in the road network.

∙

ColLight-node邻域范围是通过节点距离（即路网中两个交叉点之间的最小跳数）构建的。

Evaluation Metric 评估指标

Following existing studies[11, 17], we use the average travel time to evaluate the performance of different models for traffic signal control. It calculates the average travel time of all the vehicles spend between entering and leaving the area (in seconds), which is the most frequently used measure of performance to control traffic signal in the transportation field[21, 107].
根据现有研究[11, 17]，我们使用平均行驶时间来评估不同交通信号控制模型的性能。它计算的是所有车辆从进入到离开该区域所花费的平均行驶时间（以秒为单位），这是交通领域最常用的交通信号控制性能指标[21, 107]。

Performance Comparison 性能比较

In this section, we investigate on the performance of CoLight w.r.t. the travel time and compare it with state-of-the-art transportation and RL methods.
在本节中，我们将研究 CoLight 在旅行时间方面的性能，并将其与最先进的运输和 RL 方法进行比较。

5.5.5.1 Overall Analysis
5.5.5.1 总体分析

Table 5.2 lists the performance of two types of the proposed CoLight, classic transportation models as well as state-of-the-art learning methods in both synthetic and real-world datasets. We have the following observations:
表 5.2 列出了两种类型的 CoLight、经典交通模型以及最先进的学习方法在合成数据集和实际数据集中的表现。我们得出以下结论：

Figure 5.4: Convergence speed of CoLight (red continuous curves) and other 5 RL baselines (dashed curves) during training. CoLight starts with the best performance (Jumpstart), reaches to the pre-defined performance the fastest (Time to Threshold), and ends with the optimal policy (Aysmptotic). Curves are smoothed with a moving average of 5 points.
图 5.4：在训练过程中，CoLight（红色连续曲线）和其他 5 条 RL 基线（虚线）的收敛速度。CoLight 以最佳性能（Jumpstart）开始，以最快速度达到预定性能（Time to Threshold），以最优策略（Aysmptotic）结束。曲线以 5 个点的移动平均值进行平滑处理。

CoLight achieves consistent performance improvements over state-of-the-art transportation (MaxPressure) and RL (Individual RL) methods across diverse road networks and traffic patterns: the average improvement is 6.98% for synthetic data and 11.69% for real-world data. The performance improvements are attributed to the benefits from dynamic communication along with the index-free modeling. The advantage of our model is especially evident when controlling signals in real-world cities, where road structures are more irregular and traffic flows are more dynamic. Specifically, Individual RL can hardly achieve satisfactory results because it independently optimizes the single intersection's policy; Neighbor RL and GCN do not work well for either $D_{N e w Y o r k}$ or $D_{H a n g z h o u}$ , as the agent treats the information from the upstream and downstream intersections with static importance according to the prior geographic knowledge rather than real-time traffic flows.
在不同的道路网络和交通模式下，CoLight 的性能始终优于最先进的交通（MaxPressure）和 RL（Individual RL）方法：合成数据的平均改进率为 6.98%，真实世界数据的平均改进率为 11.69%。性能的提高归功于动态通信和无索引建模的优势。在现实世界的城市中，道路结构更加不规则，交通流量更加动态，因此在控制信号灯时，我们模型的优势尤为明显。具体来说，单个 RL 难以达到令人满意的结果，因为它只能独立优化单个交叉口的策略；邻居 RL 和 GCN 在 $D_{N e w Y o r k}$ 或 $D_{H a n g z h o u}$ 中都不能很好地发挥作用。，因为代理会根据先前的地理知识而不是实时交通流量，静态地重视来自上下游交叉口的信息。
The performance gap between the proposed CoLight and the conventional transportation method MaxPressure becomes larger as the evaluated data change from synthetic regular traffic (average gap 8.08%) to real-world dynamic traffic (average gap 19.89%). Such growing performance divergence conforms to the deficiency inherent in MaxPressure, that it is incapable of learning from the feedback of the environment.
随着评估数据从合成常规交通（平均差距为 8.08%）变为真实世界的动态交通（平均差距为 19.89%），所提出的 CoLight 与传统交通方法 MaxPressure 之间的性能差距越来越大。这种不断扩大的性能差距符合 MaxPressure 的固有缺陷，即它无法从环境反馈中学习。
Our method outperforms the joint-action modelling method CGRL. In order to achieve cooperation, CGRL first builds up one model to decide the joint actions of two adjacent intersections and then uses centralized coordination over the global joint actions. It requires the centralized maximization over a combinatorially large joint action space and faces scalability issues. On the constrast, our method achieves cooperation through communication between decentralized agents, which has a smaller action space and shows superior performances.
我们的方法优于联合行动建模方法 CGRL。为了实现合作，CGRL 首先建立一个模型来决定相邻两个交叉口的联合行动，然后对全局联合行动进行集中协调。它需要在一个组合性巨大的联合行动空间上集中最大化，因此面临着可扩展性问题。相比之下，我们的方法通过分散代理之间的通信实现合作，行动空间更小，性能更优越。

Convergence Comparison 收敛性比较

In Figure 5.4, we compare CoLight's performance (average travel time for vehicles evaluated at each episode) to the corresponding learning curves for the other five RL methods. Evaluated in all the listed datasets, the performance of CoLight is better than any of the baselines by a large margin, both in jumpstart performance (initial performance after the first episode), time to threshold (learning time to achieve a pre-specified performance level), as well as in asymptotic performance (final learned performance). Learning the attention on neighborhood does not slow down model convergence, but accelerates the speed of approaching the optimal policy instead.
在图 5.4 中，我们将 CoLight 的性能（每集评估车辆的平均行驶时间）与其他五种 RL 方法的相应学习曲线进行了比较。在所有列出的数据集中进行评估后发现，CoLight 的性能远远优于任何基线方法，无论是在启动性能（第一集后的初始性能）、到阈值时间（达到预先指定的性能水平所需的学习时间），还是在渐近性能（最终学习性能）方面都是如此。学习对邻域的关注并不会减慢模型的收敛速度，反而会加快接近最优策略的速度。

From Figure 5.4(a), we discover that model Individual RL starts with extremely huge travel time and approaches to the optimal performance after a long training time. Such disparity of convergence speed shown in Figure 5.4 agrees with our previous space complexity analysis (in Section 5.4.4.1), that agents with shared models (CGRL, Neighbor RL, OneModel, GCN and CoLight) need to learn

O (m^{2} L)

parameters while individual agents (Individual RL) have to update

O (m^{2} L \cdot N)

parameters.
从图 5.4(a)中，我们发现模型个体 RL 在开始时需要极长的旅行时间，而在经过长时间的训练后才接近最佳性能。图 5.4 所示的收敛速度差异与我们之前的空间复杂性分析（第 5.4.4.1 节）一致，即具有共享模型的代理（CGRL、Neighbor RL、OneModel、GCN 和 CoLight）需要学习

O (m^{2} L)

个参数，而单个代理（Individual RL）则需要更新

O (m^{2} L \cdot N)

个参数。

Scalability Comparison 可扩展性比较

In this section, we investigate on whether CoLight is more scalable than other RL-based methods in the following aspects:
在本节中，我们将从以下几个方面研究 CoLight 是否比其他基于 RL 的方法更具可扩展性：

5.5.6.1 Effectiveness. 5.5.6.1 有效性。

As is shown in Table 5.2 and the convergence curve in Figure 5.4, CoLight performs consistently better than other RL methods on networks with different scales, ranging from 9-intersection grid network to 196-intersection real-world network.
如表 5.2 和图 5.4 中的收敛曲线所示，在不同规模的网络（从 9 个交叉路口的网格网络到 196 个交叉路口的真实世界网络）上，CoLight 的性能始终优于其他 RL 方法。

5.5.6.2 Training time. 5.5.6.2 培训时间。

We compare CoLight's training time (total clock time for 100 episodes) to the corresponding running time for the other five RL methods on road networks with different scales. All the methods are evaluated individually on the server for fair comparison. As is shown in Figure 5.5, the training time for CoLight is comparable to that of OneModel and GCN, which is far more efficient than that of CGRL, Individual RL and Neighbor RL. This is consistent with the time complexity analysis (in Section 5.4.4.2), as most of the parallel computation assumptions are satisfied in our experiments.
我们将 CoLight 的训练时间（100 集的总时钟时间）与其他五种 RL 方法在不同规模道路网络上的相应运行时间进行了比较。为了进行公平比较，所有方法都在服务器上进行了单独评估。如图 5.5 所示，CoLight 的训练时间与 OneModel 和 GCN 相当，远高于 CGRL、Individual RL 和 Neighbor RL。这与时间复杂性分析（第 5.4.4.2 节）一致，因为我们的实验满足了大多数并行计算假设。

Note that the average travel time for Individual RL (in Table 5.2) is missing and the bar of training time (in Figure 5.5) is estimated on

D_{N e w Y o r k}

setting. This is because Individual RL is non-scalable because all the separate 196 agents cannot be trained and updated simultaneously due to processor and memory limitation.
请注意，表 5.2 中缺少单个 RL 的平均旅行时间，图 5.5 中的训练时间条形图是根据

D_{N e w Y o r k}

设置估算的。这是因为单个 RL 是不可扩展的，因为受处理器和内存的限制，所有 196 个独立代理无法同时进行训练和更新。

Study of CoLight CoLight 研究

In this section, we investigate on how different components (i.e., neighborhood definition, number of neighbors, and number of attention heads) affect CoLight.
在本节中，我们将研究不同组件（即邻域定义、邻居数量和关注头数量）对 CoLight 的影响。

5.5.7.1 Impact of Neighborhood Definition.
5.5.7.1 邻里定义的影响。

As mentioned in Section 5.4.2.2, the neighborhood scope of an intersection can be defined in different ways. And the results in Table 5.2 show that CoLight (using geo-distance) achieves similar performance with CoLight-node under synthetic data, but largely outperforms CoLight-node under real-world traffic. The reason could be that under synthetic data, since the lane lengths of all intersections are the same, the top closest neighboring intersections set according to geo-distance is identical to that based on node distance. In the following parts of our experiments, we only compare CoLight with other methods.
如第 5.4.2.2 节所述，交叉口的邻域范围可以用不同的方式定义。表 5.2 中的结果显示，在合成数据下，CoLight（使用地理距离）与 CoLight-node 的性能相近，但在实际交通情况下，CoLight-node 的性能却大大优于 CoLight-node。原因可能是在合成数据下，由于所有交叉口的车道长度相同，因此根据地理距离设置的最邻近交叉口与根据节点距离设置的最邻近交叉口相同。在接下来的实验中，我们只将 CoLight 与其他方法进行比较。

5.5.7.2 Impact of Neighbor Number.
5.5.7.2 邻居编号的影响。

In Figure 5.6, we show how the number of neighbors

| N_{i} |

influences the performance and also shed lights on how to set it. As is shown in Figure 5.6, when the number of neighbors grows from 2 to 5, the performance of CoLight achieves the optimal. Further adding nearby intersections into the neighborhood scope

N_{i}

, however, leads to the opposite trend. This is because including more neighbors in the neighborhood results in learning more relations. To determine signal control policy for each intersection, computing the attention scores only on four nearby intersections and itself seems adequate for cooperation with both time and performance guarantee.
图 5.6 展示了邻居

| N_{i} |

的数量对性能的影响，并介绍了如何设置邻居数量。如图 5.6 所示，当邻接数从 2 增加到 5 时，CoLight 的性能达到最佳。在邻域范围

N_{i}

中进一步添加附近的交叉口后，CoLight 的性能会出现相反的趋势。但结果恰恰相反。这是因为在邻域中加入更多的邻居会导致学习到更多的关系。为了确定每个交叉路口的信号控制策略，只计算附近四个交叉路口及其本身的关注度得分似乎足以保证合作的时间和性能。

Figure 5.5: The training time of different models for 100 episodes. CoLight is efficient across all the datasets. The bar for Individual RL on

D_{N e w Y o r k}

is shadowed as its running time is far beyond the acceptable time.
图 5.5：不同模型对 100 个事件的训练时间。CoLight 在所有数据集上都很高效。由于

D_{N e w Y o r k}

上的 Individual RL 的运行时间远远超出了可接受的时间，因此条形图中出现了阴影。

5.5.7.3 Impact of Attention Head Number.
5.5.7.3 注意头编号的影响。

To evaluate the effectiveness of multi-head attention, we test different numbers of attention heads and find that moderate numbers of heads are beneficial to better control intersection signals. As shown in Table 5.3, drivers spend less time as the number of attention heads grows. However, the benefits of more types of attention disappear as

H

exceeds 5. Similar conclusions can be made on other datasets with details unshown due to space limitation.
为了评估多头注意力的有效性，我们测试了不同数量的注意力头，发现适度数量的注意力头有利于更好地控制交叉路口信号。如表 5.3 所示，随着注意头数量的增加，驾驶员花费的时间也会减少。然而，当

H

超过 5 时，更多类型的注意力所带来的益处就会消失。在其他数据集上也可以得出类似结论，由于篇幅有限，具体细节未显示。

5.6 Conclusion 5.6 结论

In this chapter, we propose a well-designed reinforcement learning approach to solve the network-level traffic signal control problem. Specifically, our method learns the dynamic communication between agents and constructs an index-free model by leveraging the graph attention network. We conduct extensive experiments using synthetic and real-world data and demonstrate the superior performance of our proposed method over state-of-the-art methods. In addition, we show in-depth case studies and observations to understand how the attention mechanism helps cooperation.
在本章中，我们提出了一种精心设计的强化学习方法来解决网络级交通信号控制问题。具体来说，我们的方法学习代理之间的动态通信，并利用图注意力网络构建无索引模型。我们使用合成数据和真实世界数据进行了大量实验，证明我们提出的方法比最先进的方法性能更优越。此外，我们还展示了深入的案例研究和观察结果，以了解注意力机制如何帮助合作。

#Heads	1	3	5	7	9
Travel Time (s) 行驶时间（秒）	176.32	172.47	170.11	174.54	174.51

Table 5.3: Performance of

C o L i g h t

with respect to different numbers of attention heads (

H

) on dataset

G r i d_{6 \times 6}

. More types of attention (

H \leq 5

) enhance model efficiency, while too many (

H > 5

) could distract the learning and deteriorate the overall performance.
表 5.3：数据集

G r i d_{6 \times 6}

中，

C o L i g h t

在不同注意力头数量（

H

）下的表现.更多的注意力类型（

H \leq 5

）会提高模型效率，而过多的注意力类型（

H > 5

）则会分散学习注意力，降低整体性能。

Figure 5.6: Performance of

C o L i g h t

with respect to different numbers of neighbors (

| N_{i} |

) on dataset

D_{H a n g z h o u}

(left) and

D_{J i a n}

(right). More neighbors (

| N_{i} | \leq 5

) for cooperation brings better performance, but too many neighbors (

| N_{i} | > 5

) requires more time (200 episodes or more) to learn.
图 5.6：在数据集

D_{H a n g z h o u}

和

D_{J i a n}

中，

C o L i g h t

在不同邻接数（

| N_{i} |

）下的性能表现(左）和

D_{J i a n}

(右）。更多的合作邻居（

| N_{i} | \leq 5

）会带来更好的性能，但过多的邻居（

| N_{i} | > 5

）则需要更多的学习时间（200 集或更多）。

We would like to point out several important future directions to make the method more applicable to the real world. First, the neighborhood scope can be determined in a more flexible way. The traffic flow information between intersections can be utilized to determine the neighborhood. Second, the raw data for observation only include the phase and the number of vehicles on each lane. More exterior data like the road and weather condition might help to boost model performance.
我们想指出几个重要的未来方向，以使该方法更适用于现实世界。首先，可以更灵活地确定邻域范围。可以利用交叉口之间的交通流信息来确定邻域。其次，观测的原始数据只包括相位和每条车道上的车辆数。更多的外部数据（如道路和天气状况）可能有助于提高模型性能。

Chapter 6 Learning to Simulate
第 6 章学习模拟

This chapter presents out methodology of learning a better simulator from data and corresponds to our paper "Learning to Simulate on Sparse Trajectory Data" [130].
本章介绍从数据中学习更好的模拟器的方法，与我们的论文 "在稀疏轨迹数据上学习模拟"[130]相对应。

6.1 Introduction 6.1 导言

Simulation of the real world is one of the feasible ways to verify driving policies on autonomous vehicles and transportation policies like traffic signal control [131, 132, 133] or speed limit setting [134] since it is costly to validate them in the real world directly [21]. The driving behavior model, i.e., how the vehicle accelerates/decelerates, is the critical component that affects the similarity of the simulated traffic to the real-world traffic [135, 136, 137]. Traditional methods to learn the driving behavior model usually first assumes that the behavior of the vehicle is only influenced by a small number of factors with predefined rule-based relations, and then calibrates the model by finding the parameters that best fit the observed data [138, 139]. The problem with such methods is that their assumptions oversimplify the driving behavior, resulting in the simulated driving behavior far from the real world.
模拟现实世界是验证自动驾驶汽车驾驶政策和交通信号控制[131, 132, 133]或限速设置[134]等交通政策的可行方法之一，因为在现实世界中直接验证这些政策和政策的成本很高[21]。驾驶行为模型，即车辆如何加速/减速，是影响模拟交通与真实交通相似性的关键部分[135, 136, 137]。学习驾驶行为模型的传统方法通常首先假定车辆行为只受少数具有预定规则关系的因素影响，然后通过找到最适合观测数据的参数来校准模型[138, 139]。这类方法的问题在于其假设过于简化了驾驶行为，导致模拟的驾驶行为与真实世界相去甚远。

In contrast, imitation learning (IL) does not assume the underlying form of the driving behavior model and directly learns from the observed data (also called demonstrations from expert policy in IL literature). With IL, a more sophisticated driving behavior policy can be represented by a parameterized model like neural nets and provides a promising way to learn the models that behave similarly to expert policy. Existing IL methods (e.g., behavior cloning [140, 141] and generative adversarial imitation learning [142, 143, 144, 145]) for learning driving behavior relies on a large amount of behavior trajectory data that consists of dense vehicle driving states, either from vehicles installed with sensors, or roadside cameras that capture the whole traffic situation (including every vehicle drivingbehavior at every moment) in the road network.
相比之下，模仿学习（IL）并不假定驾驶行为模型的基本形式，而是直接从观察到的数据中学习（在模仿学习文献中也称为从专家政策中示范）。通过模仿学习，可以用神经网络等参数化模型来表示更复杂的驾驶行为政策，并为学习行为与专家政策类似的模型提供了一种可行的方法。用于学习驾驶行为的现有 IL 方法（如行为克隆 [140, 141] 和生成式对抗模仿学习 [142, 143, 144, 145]）依赖于大量的行为轨迹数据，这些数据由密集的车辆驾驶状态组成，或来自安装了传感器的车辆，或来自捕捉路网中整个交通状况（包括每辆车在每个时刻的驾驶行为）的路边摄像头。

However, in most real-world cases, the available behavior trajectory data is sparse, i.e., the driving behavior of the vehicles at every moment is difficult to observe. It is infeasible to install sensors for every vehicle in the road network or to install cameras that cover every location in the road network to capture the whole traffic situation. Most real-world cases are that only a minimal number of cars on the road are accessible with dense trajectory, and the driving behavior of vehicles can only be captured when the vehicles drive near the locations where the cameras are installed. For example, in Figure 6.1, as the cameras are installed only around certain intersections, consecutive observed points of the same car may have a large time difference, resulting in a sparse driving trajectory. As data sparsity is considered as a critical issue for unsatisfactory accuracy in machine learning, directly using sparse trajectories to learn the driving behavior could make the model fail to learn the behavior policy at the unobserved states.
然而，在现实世界的大多数情况下，可用的行为轨迹数据非常稀少，也就是说，很难观测到车辆在每一时刻的驾驶行为。为路网中的每辆车安装传感器或安装覆盖路网每个位置的摄像头来捕捉整个交通状况是不可行的。现实世界中的大多数情况是，道路上只有极少量的车辆能以密集的轨迹进入，只有当车辆行驶到安装摄像头的位置附近时，才能捕捉到车辆的驾驶行为。例如，在图 6.1 中，由于摄像头只安装在某些十字路口附近，同一辆车的连续观测点可能会有较大的时间差，从而导致行驶轨迹稀疏。由于数据稀疏性被认为是机器学习准确性不理想的关键问题，因此直接使用稀疏轨迹来学习驾驶行为可能会导致模型无法学习未观测状态下的行为策略。

To deal with sparse trajectories, a typical approach is to interpolate the sparse trajectories first and then learn the model with the dense trajectories [146, 147, 148]. This two-step approach also has an obvious weakness, especially in the problem of learning behavior models. For example, linear interpolation is often used to interpolate the missing points between two observed trajectory points. But in real-world cases, considering the interactions between vehicles, the vehicle is unlikely to drive at a uniform speed during that unobserved time period, hence the interpolated trajectories may be different from the true trajectories. However, the true trajectories are also unknown and are exactly
处理稀疏轨迹的典型方法是先插值稀疏轨迹，然后用密集轨迹学习模型 [146, 147, 148]。这种两步法也有明显的弱点，尤其是在学习行为模型的问题上。例如，线性插值通常用于插值两个观测轨迹点之间的缺失点。但在实际情况中，考虑到车辆之间的相互作用，车辆不可能在未观测到的时间段内以统一速度行驶，因此插值轨迹可能与真实轨迹不同。然而，真实轨迹也是未知的，而且恰好是

Figure 6.1: Illustration of a driving trajectory. In the real-world scenario, only part of the driving points can be observed and form a sparse driving trajectory (in red dots). Each driving point includes a driving state and an action of the vehicle at the observed time step. Best viewed in color.
图 6.1：行驶轨迹示意图。在实际场景中，只有部分行驶点可以被观测到，并形成稀疏的行驶轨迹（红点）。每个行驶点都包括一个行驶状态和车辆在观测时间步长内的一个动作。以彩色显示效果最佳。

what we aim to imitate. A better approach is to integrate interpolation with imitation because they should inherently be the same model. To the best of our knowledge, none of the existing literature has studied the real-world problem of learning driving policies from sparse trajectory data.
我们的目标是模仿什么。更好的方法是将插值与模仿结合起来，因为它们本质上应该是同一个模型。据我们所知，现有文献都没有研究过从稀疏轨迹数据中学习驾驶策略的实际问题。

In this chapter, we present ImIn-GAIL, an approach that can learn the driving behavior of vehicles from observed sparse trajectory data. ImIn-GAIL learns to mimic expert behavior under the framework of generative adversarial imitation learning (GAIL), which learns a policy that can perform expert-like behaviors through rewarding the policy for deceiving a discriminator trained to classify between policy-generated and expert trajectories. Specifically, for the data sparsity issue, we present an interpolator-discriminator network that can perform both the interpolation and discrimination tasks, and a downsampler that draws supervision on the interpolation task from the trajectories generated by the learned policy. We conduct experiments on both synthetic and real-world data, showing that our method can not only have excellent imitation performance on the sparse trajectories but also have better interpolation results compared with state-of-the-art baselines. The main contributions of this chapter are summarized as follows:
在本章中，我们将介绍一种能够从观测到的稀疏轨迹数据中学习车辆驾驶行为的方法--ImIn-GAIL。ImIn-GAIL 在生成式对抗模仿学习（GAIL）的框架下学习模仿专家的行为，该框架通过奖励能够执行类似专家行为的策略来欺骗经过训练的判别器，从而在策略生成的轨迹和专家轨迹之间进行分类。具体来说，针对数据稀疏性问题，我们提出了一个插值器-判别器网络，该网络可同时执行插值和判别任务，以及一个下采样器，该下采样器可从所学策略生成的轨迹中提取对插值任务的监督。我们在合成数据和真实世界数据上进行了实验，结果表明我们的方法不仅在稀疏轨迹上具有出色的模仿性能，而且与最先进的基线方法相比具有更好的插值效果。本章的主要贡献概述如下：

We propose a novel framework ImIn-GAIL, which can learn driving behaviors from the real-world sparse trajectory data.
我们提出了一种新型框架 ImIn-GAIL，它可以从真实世界的稀疏轨迹数据中学习驾驶行为。
We naturally integrate the interpolation with imitation learning that can interpolate the sparse driving trajectory.
我们很自然地将插值与模仿学习结合起来，模仿学习可以插值稀疏的驾驶轨迹。
We conduct experiments on both real and synthetic data, showing that our approach significantly outperforms existing methods. We also have interesting cases to illustrate the effectiveness on the imitation and interpolation of our methods.
我们在真实数据和合成数据上进行了实验，结果表明我们的方法明显优于现有方法。我们还通过有趣的案例来说明我们的方法在模仿和插值方面的有效性。

Parameter calibration. In parameter calibration-based methods, the driving behavior model is a prerequisite, and parameters in the model are tuned to minimize a pre-defined cost function. Heuristic search algorithms such as random search, Tabu search [139], and genetic algorithm [138] can be used to search the parameters. These methods rely on the pre-defined models (mostly equations) and usually fail to match the dynamic vehicle driving pattern in the real-world.
参数校准。在基于参数标定的方法中，驾驶行为模型是先决条件，模型中的参数需要调整以最小化预先定义的成本函数。随机搜索、Tabu 搜索 [139] 和遗传算法 [138] 等启发式搜索算法可用于搜索参数。这些方法依赖于预定义的模型（大多是方程），通常无法与真实世界中的动态车辆驾驶模式相匹配。

Imitation learning. Without assuming an underlying physical model, we can solve this problem via imitation learning. There are two main lines of work: (1) behavior cloning (BC) and Inverse reinforcement learning (IRL). BC learns the mapping from demonstrated observations to actions in a supervised learning way [140, 141], but suffers from the errors which are generated from unobserved states during the simulation. On the contrast, IRL not only imitates observed states but also learns the expert's underlying reward function, which is more robust to the errors from unobserved states [149, 150, 151]. Recently, a more effective IRL approach, GAIL [142], incorporates generative adversarial networks with learning the reward function of the agent. However, all of the current work did not address the challenges of sparse trajectories, mainly because in their application contexts, e.g., game or robotic control, observations can be fully recorded every time step.
模仿学习。在不假设底层物理模型的情况下，我们可以通过模仿学习来解决这个问题。目前主要有两种方法：(1) 行为克隆（BC）和反强化学习（IRL）。BC 以监督学习的方式学习从演示观察到行动的映射[140, 141]，但会受到模拟过程中未观察到的状态所产生的误差的影响。相比之下，IRL 不仅能模仿观察到的状态，还能学习专家的基本奖励函数，对来自未观察状态的误差具有更强的鲁棒性 [149、150、151]。最近，一种更有效的 IRL 方法，即 GAIL [142]，将生成对抗网络与学习代理的奖励函数结合起来。然而，目前所有的研究都没有解决稀疏轨迹的难题，主要是因为在它们的应用环境中，例如在游戏或机器人控制中，每个时间步的观测都可以被完整记录下来。

6.3 Preliminaries 6.3 前言

Definition 7 (Driving Point).: A driving point

τ^{t} = (s^{t}, a^{t}, t)

describes the driving behavior of the vehicle at time

t

, which consists of a driving state

s^{t}

and an action

a^{t}

of the vehicle. Typically, the state

s^{t}

describes the surrounding traffic conditions of the vehicle (e.g., speed of the vehicle and distance to the preceding vehicle), and the action

a^{t} \sim π (a | s^{t})

the vehicle takes at time

t

is the magnitude of acceleration/deceleration following its driving policy

π (a | s^{t})

.
定义 7（行驶点）：驾驶点

τ^{t} = (s^{t}, a^{t}, t)

描述了车辆在时间

t

时的驾驶行为。由车辆的驾驶状态

s^{t}

和动作

a^{t}

组成。通常，状态

s^{t}

描述的是车辆周围的交通状况（例如，车辆的速度和与前车的距离），而车辆在时间

t

采取的行动

a^{t} \sim π (a | s^{t})

则是按照其驾驶策略

π (a | s^{t})

加速/减速的幅度。.

Definition 8 (Driving Trajectory).: A driving trajectory of a vehicle is a sequence of driving points generated by the vehicle in geographical spaces, usually represented by a series of chronologically ordered points, e.g.

τ = (τ^{t_{0}}, \dots, τ^{t_{N}})

.
定义 8（行驶轨迹）：车辆行驶轨迹是车辆在地理空间中产生的一系列行驶点，通常由一系列按时间顺序排列的点表示，例如

τ = (τ^{t_{0}}, \dots, τ^{t_{N}})

In trajectory data mining [152, 153, 154], a dense trajectory of a vehicle is the driving trajectory with high-sampling rate (e.g., one point per second on average), and a sparse trajectory of a vehicle is the driving trajectory with low-sampling rate (e.g., one point every 2 minutes on average). In this paper, the observed driving trajectory is a sequence of driving points with large and irregular intervals between their observation times.
在轨迹数据挖掘[152, 153, 154]中，车辆的密集轨迹是高采样率（例如平均每秒一个点）的行驶轨迹，而车辆的稀疏轨迹是低采样率（例如平均每 2 分钟一个点）的行驶轨迹。在本文中，观测到的行车轨迹是行车点的序列，其观测时间间隔较大且不规则。

Problem 2.: In our problem, a vehicle observes state

s

from the environment, take action

a

following policy

π^{E}

at every time interval

Δ t

, and generate a raw driving trajectory

τ

during certain time period. While the raw driving trajectory is dense (i.e., at a high-sampling rate), in our problem we can only observe a set of sparse trajectories

T_{E}

generated by expert policy

π^{E}

as expert trajectory, where

T_{E} = {τ_{i} | τ_{i} = (τ_{i}^{t_{0}}, \dots, τ_{i}^{t_{N}})}

t_{i + 1} - t_{i} ≫ Δ t

and

t_{i + 1} - t_{i}

may be different for different observation time

i

. Our goal is to learn a parameterized policy

π_{θ}

that imitates the expert policy

π^{E}

.
问题 2：在我们的问题中，车辆从环境中观察状态

s

，在每个时间间隔

Δ t

按照策略

π^{E}

采取行动

a

，并在一定时间段内生成原始行驶轨迹

τ

。并在一定时间段内生成原始行驶轨迹

τ

。原始行驶轨迹是密集的（即采样率较高），而在我们的问题中，我们只能观察到由专家策略

π^{E}

生成的一组稀疏轨迹

T_{E}

作为专家轨迹，其中

T_{E} = {τ_{i} | τ_{i} = (τ_{i}^{t_{0}}, \dots, τ_{i}^{t_{N}})}

和

t_{i + 1} - t_{i} ≫ Δ t

为专家轨迹。

t_{i + 1} - t_{i} ≫ Δ t

和

t_{i + 1} - t_{i}

在不同的观测时间

i

可能不同。.我们的目标是学习一个模仿专家策略

π^{E}

的参数化策略

π_{θ}

。.

6.4 Method 6.4 方法

In this section, we first introduce the basic imitation framework, upon which we propose our method (lmln-GAIL) that integrates trajectory interpolation into the basic model.
在本节中，我们首先介绍基本的模仿框架，在此基础上提出我们的方法（lmln-GAIL），将轨迹插值整合到基本模型中。

Basic GAIL Framework GAIL 基本框架

In this paper, we follow the framework similar to GAIL [142] due to its scalability to the multi-agent scenario and previous success in learning human driver models [155]. GAIL formulates imitation learning as the problem of learning policy to perform expert-like behavior by rewarding it for "deceiving" a classifier trained to discriminate between policy-generated and expert state-action pairs. For a neural network classifier

D_{ψ}

parameterized by

ψ

, the GAIL objective is given by

m a x_{ψ} m i n_{θ} L (ψ, θ)

where

L (ψ, θ)

is :
在本文中，我们采用与 GAIL [142] 类似的框架，因为它可以扩展到多机器人场景，而且之前在学习人类驾驶员模型方面也取得了成功 [155]。GAIL 将模仿学习表述为：通过奖励 "欺骗 "经过训练的分类器来区分策略生成的状态-行动对和专家的状态-行动对，从而学习策略以执行类似专家的行为。对于以

ψ

为参数的神经网络分类器

D_{ψ}

，GAIL 目标为，GAIL 目标由

m a x_{ψ} m i n_{θ} L (ψ, θ)

给出，其中

L (ψ, θ)

为：

\begin{matrix} (6.1) & L (ψ, θ) = E_{(s, a) \sim τ \in T_{E}} \log D_{ψ} (s, a) + E_{(s, a) \sim τ \in T_{G}} \log (1 - D_{ψ} (s, a)) - β H (π_{θ}) \end{matrix}

where

T_{E}

and

T_{G}

are respectively the expert trajectories and the generated trajectories from the interactions of policy

π_{θ}

with the simulation environment,

H (π_{θ})

is an entropy regularization term.
其中，

T_{E}

和

T_{G}

分别为专家轨迹和政策

π_{θ}

与模拟环境交互后生成的轨迹，

H (π_{θ})

为熵正则化项。

∙

Learning $ψ$ : When training

D_{ψ}

, Equation (6.1) can simply be set as a sigmoid cross entropy where positive samples are from

T_{E}

and negative samples are from

T_{G}

. Then optimizing

ψ

can be easily done with gradient ascent.

∙

学习

ψ

：训练

D_{ψ}

时等式 (6.1) 可以简单地设置为正交熵（sigmoid cross entropy），其中正样本来自

T_{E}

，负样本来自

T_{G}

。.然后就可以用梯度上升法轻松优化

ψ

。

∙

Learning $θ$ : The simulator is an integration of physical rules, control policies and randomness and thus its parameterization is assumed to be unknown. Therefore, given

T_{G}

generated by

π_{θ}

in the simulator, Equation (6.1) is non-differentiable w.r.t

θ

. In order to learn

π_{θ}

, GAIL optimizes through reinforcement learning, with a surrogate reward function formulated from Equation (6.1) as:

∙

学习

θ

：模拟器是物理规则、控制策略和随机性的集成，因此假定其参数未知。因此，给定模拟器中由

π_{θ}

生成的

T_{G}

时，等式 (6.1) 与

θ

是无差别的。.为了学习

π_{θ}

为了学习

π_{θ}

，GAIL 通过强化学习进行优化，根据公式（6.1）制定的替代奖励函数为

\begin{matrix} (6.2) & \tilde{r} (s^{t}, a^{t}; ψ) = - \log (1 - D_{ψ} (s^{t}, a^{t})) \end{matrix}

Here,

\tilde{r} (s^{t}, a^{t}; ψ)

can be perceived to be useful in driving

π_{θ}

into regions of the state-action space at time

t

similar to those explored by

π^{E}

. Intuitively, when the observedtrajectory is dense, the surrogate reward from the discriminator in Equation (6.2) is helpful to learn the state transitions about observed trajectories. However, when the observed data is sparse, the reward from discriminator will only learn to correct the observed states and fail to model the behavior policy at the unobserved states. To relieve this problem, we propose to interpolate the sparse expert trajectory within the based imitation framework.
在这里，可以认为

\tilde{r} (s^{t}, a^{t}; ψ)

有助于推动

π_{θ}

进入时间

t

的状态-行动空间区域，类似于

π^{E}

所探索的区域。.直观地说，当观察到的轨迹比较密集时，等式（6.2）中来自判别器的替代奖励有助于学习观察到的轨迹的状态转换。然而，当观测数据稀疏时，来自判别器的奖励只能学习修正观测到的状态，而无法模拟未观测状态下的行为政策。为了解决这个问题，我们建议在基于模仿的框架内插值稀疏的专家轨迹。

Imitation with Interpolation
用插值法模仿

An overview of our proposed Imitation-Interpolation framework (ImIn-GAIL) is shown in Figure 6.2, which consists of the following three key components.
我们提出的模仿-插值框架（ImIn-GAIL）概览如图 6.2 所示，它由以下三个关键部分组成。

Generator in the simulator
模拟器中的发电机

Given an initialized driving policy

π_{θ}

, the dense trajectories

T_{G}^{D}

of vehicles can be generated in the simulator. In this paper, the driving policy

π_{θ}

is parameterized by a neural network which will output an action

a

based on the state

s

it observes. The simulator can generate driving behavior trajectories by rolling out

π_{θ}

for all vehicles simultaneously in the simulator. The optimization of the driving policy is optimized via TRPO [156] as in vanilla GAIL [142].
给定初始化驾驶策略

π_{θ}

即可在模拟器中生成车辆的密集轨迹

T_{G}^{D}

。在本文中，驾驶策略

π_{θ}

由神经网络参数化，神经网络将根据其观察到的状态

s

输出动作

a

。模拟器可以通过在模拟器中同时推出所有车辆的

π_{θ}

来生成驾驶行为轨迹。与 vanilla GAIL [142] 一样，驾驶策略的优化通过 TRPO [156] 进行。

Downsampling of generated trajectories
生成轨迹的下采样

The goal of the downsampler is to construct the training data for interpolation, i.e., learning the mapping from a sparse trajectory to a dense one. For two consecutive points
下采样器的目标是构建用于插值的训练数据，即学习从稀疏轨迹到密集轨迹的映射。对于两个连续点

Figure 6.2: Proposed ImIn-GAIL Approach. The overall framework of ImIn-GAIL includes three components: generator, downsampler, and interpolation-discriminator. Best viewed in color.
图 6.2：拟议的 ImIn-GAIL 方法。ImIn-GAIL 的整体框架包括三个部分：生成器、下采样器和插值-鉴别器。彩色效果最佳。

(i.e.,

τ^{t_{s}}

and

τ^{t_{e}}

in generated sparse trajectory

T_{G}

), we can sample a point

τ^{t_{i}}

T_{G}^{D}

where

t_{s} \leq t_{i} \leq t_{e}

and construct training samples for the interpolator. The sampling strategies can be sampling at certain time intervals, sampling at specific locations or random sampling and we investigate the influence of different sampling rates in Section 6.5.5.1.
(即生成的稀疏轨迹

T_{G}

中的

τ^{t_{s}}

和

τ^{t_{e}}

），我们可以对

T_{G}^{D}

中

t_{s} \leq t_{i} \leq t_{e}

的点

τ^{t_{i}}

进行采样，为内插器构建训练样本。采样策略可以是特定时间间隔采样、特定位置采样或随机采样，我们将在第 6.5.5.1 节中研究不同采样率的影响。

6.4.2.3 Interpolation-Discriminator
6.4.2.3 内插法-鉴别器

The key difference between ImIn-GAIL and vanilla GAIL is in the discriminator. While learning to differentiate the expert trajectories from generated trajectories, the discriminator in ImIn-GAIL also learns to interpolate a sparse trajectory to a dense trajectory. Specifically, as is shown in Figure 6.3, the proposed interpolation-discriminator copes with two subtasks in an end-to-end way: interpolation on sparse data and discrimination on dense data.
ImIn-GAIL 与 vanilla GAIL 的主要区别在于判别器。在学习区分专家轨迹和生成轨迹的同时，ImIn-GAIL 中的判别器还学习将稀疏轨迹插值为密集轨迹。具体来说，如图 6.3 所示，建议的插值-判别器以端到端的方式处理两个子任务：稀疏数据的插值和密集数据的判别。

6.4.2.3.1 Interpolator module
6.4.2.3.1 内插模块

The goal of the interpolator is to interpolate the sparse expert trajectories

T_{E}

to the dense trajectories

T_{E}^{D}

. We can use the generated dense trajectories

T_{G}^{D}

and sparse trajectories

T_{G}

from previous downsampling process as training data for the interpolator.
插值器的目标是将稀疏专家轨迹

T_{E}

插值到密集轨迹

T_{E}^{D}

中。.我们可以使用之前下采样过程中生成的密集轨迹

T_{G}^{D}

和稀疏轨迹

T_{G}

作为内插器的训练数据。

For each point

τ^{t_{i}}

to be interpolated, we first concatenate state and action and embed them into an

m

-dimensional latent space:
对于每个要插值的点

τ^{t_{i}}

，我们首先将状态和行动串联起来，然后将它们嵌入到一个

m

-维潜在空间中。-维潜在空间：

\begin{matrix} (6.3) & h_{s} = σ (C o n c a t (s^{t_{s}}, a^{t_{s}}) W_{s} + b_{s}), h_{e} = σ (C o n c a t (s^{t_{e}}, a^{t_{e}}) W_{e} + b_{e}) \end{matrix}

where

K

is the feature dimension after the concatenation of

s^{t_{e}}

and

a^{t_{e}}

W_{s} \in R^{K \times M}

W_{e} \in R^{K \times M}

b_{s} \in R^{M}

and

b_{e} \in R^{M}

are weight matrix to learn,

σ

is ReLU function
其中，

K

是连接

s^{t_{e}}

和

a^{t_{e}}

之后的特征维度。,

W_{s} \in R^{K \times M}

W_{e} \in R^{K \times M}

，

b_{s} \in R^{M}

和

b_{e} \in R^{M}

是要学习的权重矩阵，

σ

是 ReLU 函数

Figure 6.3: Proposed interpolation-discriminator network.
图 6.3：拟议的插值-鉴别器网络。

(same denotation for the following

σ

). Here, considering

t_{s}

and

t_{e}

may have different effects on interpolation, we use two different embedding weights for

t_{s}

and

t_{e}

.
(后面的

σ

也是同样的含义）。在这里，考虑到

t_{s}

和

t_{e}

对插值的影响可能不同，我们对

t_{s}

和

t_{e}

使用了两种不同的嵌入权重。.

After point embedding, we concatenate

h_{s}

and

h_{e}

with the time interval between

t_{s}

and

t_{i}

, and use a multi-layer perception (MLP) with

L

layers to learn the interpolation.
点嵌入后，我们将

h_{s}

和

h_{e}

与

t_{s}

和{{3}之间的时间间隔串联起来。｝并使用具有

L

层的多层感知（MLP）来学习插值。

\begin{matrix} (6.4) & \begin{aligned} h_{i n} & = σ (C o n c a t (h_{s}, h_{e}, t_{i} - t_{s}) W_{0} + b_{0}) \\ h_{1} & = σ (h_{i n} W_{1} + b_{1}), h_{2} = σ (h_{1} W_{2} + b_{2}), \dots \\ h_{L} & = t a n h (h_{L - 1} W_{L} + b_{L}) = {\hat{τ}}^{t_{i}} \end{aligned} \end{matrix}

where

W_{0} \in R^{(2 M + 1) \times N_{0}}

b_{0} \in R^{N_{0}}

are the learnable weights;

W_{j} \in R^{N_{j} \times N_{j + 1}}

and

b_{j} \in R^{N_{j + 1}}

are the weight matrix for hidden layers (

1 \leq j \leq L - 1

) of interpolator;

W_{L} \in R^{N_{j} \times K}

and

b_{L} \in R^{K}

are the weight matrix for the last layer of interpolator, which outputs an interpolated point

{\hat{τ}}^{t_{i}}

. In the last layer of interpolator, we use

t a n h

as all the feature value of

τ^{t_{i}}

is normalized to

[- 1, 1]

.
其中

W_{0} \in R^{(2 M + 1) \times N_{0}}

，

b_{0} \in R^{N_{0}}

为可学习权重；

W_{j} \in R^{N_{j} \times N_{j + 1}}

和

b_{j} \in R^{N_{j + 1}}

为内插器隐藏层（

1 \leq j \leq L - 1

）的权重矩阵；

W_{L} \in R^{N_{j} \times K}

和

b_{L} \in R^{K}

为内插器最后一层的权重矩阵，该层输出内插点

{\hat{τ}}^{t_{i}}

。.在插值器的最后一层，我们使用

t a n h

，因为

τ^{t_{i}}

的所有特征值都归一化为

[- 1, 1]

。.

2.3.2 Discriminator moduleWhen sparse expert trajectories

T_{E}

are interpolated into dense trajectories

T_{E}^{D}

by the interpolator, the discriminator module learns to differentiate between expert dense trajectories

T_{E}^{D}

and generated dense trajectories

T_{D}^{D}

. Specifically, the discriminator learns to output a high score when encountering an interpolated point

{\hat{τ}}^{t_{i}}

originated from

T_{E}^{D}

, and a low score when encountering a point from

T_{G}^{D}

generated by

π_{θ}

. The output of the discriminator

D_{ψ} (s, a)

can then be used as a surrogate reward function whose value grows larger as actions sampled from

π_{θ}

look similar to those chosen by experts.
2.3.2 鉴别器模块当稀疏专家轨迹

T_{E}

被插值器插值为密集轨迹

T_{E}^{D}

时，鉴别器模块将学习如何区分专家密集轨迹

T_{E}^{D}

和生成的密集轨迹

T_{D}^{D}

。.具体来说，当遇到源自

T_{E}^{D}

的插值点

{\hat{τ}}^{t_{i}}

时，判别器将输出高分；当遇到源自

T_{E}^{D}

的插值点

{\hat{τ}}^{t_{i}}

时，判别器将输出低分。的插值点

{\hat{τ}}^{t_{i}}

时输出高分，而遇到由

π_{θ}

生成的来自

T_{G}^{D}

的点时输出低分。.鉴别器

D_{ψ} (s, a)

的输出结果可用作替代奖励函数，当从

π_{θ}

中采样的操作与专家选择的操作相似时，奖励函数的值就会变大。

The discriminator module is an MLP with

H

hidden layers, takes

h_{L}

as input and outputs the probability of the point belongs to

T_{E}

.
判别模块是一个具有

H

隐藏层的 MLP，以

h_{L}

为输入，输出该点属于

T_{E}

的概率。.

\begin{matrix} (6.5) & \begin{aligned} h_{1}^{D} & = σ (h_{L} W_{1}^{D} + b_{1}^{D}), h_{2}^{D} = σ (h_{1}^{D} W_{2}^{D} + b_{2}^{D}), \dots \\ p & = S i g m o i d (h_{H - 1}^{D} W_{H}^{D} + b_{H}^{D}) \end{aligned} \end{matrix}

where

W_{i}^{D} \in R^{N_{i - 1}^{D} \times N_{i}^{D}}

b_{i}^{D} \in R^{N_{i}^{D}}

are learnable weights for

i

-th layer in discriminator module. For

i = 1

, we have

W_{1}^{D} \in R^{K \times N_{1}^{D}}

b_{1}^{D} \in R^{N_{1}^{D}}

K

is the concatenated dimension of state and action; for

i = H

, we have

W_{H}^{D} \in R^{N_{H - 1}^{D} \times 1}

b_{H}^{D} \in R

.
其中

W_{i}^{D} \in R^{N_{i - 1}^{D} \times N_{i}^{D}}

b_{i}^{D} \in R^{N_{i}^{D}}

是判别器模块中

i

-th layer 的可学习权重。-层的可学习权重。对于

i = 1

，我们有

W_{1}^{D} \in R^{K \times N_{1}^{D}}

b_{1}^{D} \in R^{N_{1}^{D}}

K

是状态和动作的并集维度；对于

i = H

有

W_{H}^{D} \in R^{N_{H - 1}^{D} \times 1}

b_{H}^{D} \in R

2.3.3 Loss function of Interpolation-DiscriminatorThe loss function of the Interpolation-Discriminator network is a combination of interpolation loss

L_{I N T}

and discrimination loss

L_{D}

, which interpolates the unobserved points and predicts the probability of the point being generated by expert policy

π^{E}

simultaneously, :
2.3.3 插值-判别网络的损失函数插值-判别网络的损失函数是插值损失

L_{I N T}

和判别损失

L_{D}

的组合。对未观察到的点进行插值，并同时预测专家策略

π^{E}

生成点的概率，：

\begin{matrix} (6.6) & \begin{aligned} L & = λ L_{I N T} + (1 - λ) L_{D} = λ E_{τ^{t} \sim τ \in T_{G}^{D}} ({\hat{τ}}^{t} - τ^{t}) + \\ (1 - λ) [E_{τ^{t} \sim τ \in T_{G}} \log p (τ^{t}) + E_{τ^{t} \sim τ \in T_{E}} \log (1 - p (τ^{t}))] \end{aligned} \end{matrix}

where

λ

is a hyper-parameter to balance the influence of interpolation and discrimination,

{\hat{τ}}^{t}

is the output of the interpolator module, and

p (τ)

is the output probability from the discriminator module.
其中，

λ

是平衡插值和判别影响的超参数，

{\hat{τ}}^{t}

是插值模块的输出，

p (τ)

是判别模块的输出概率。

Input: Sparse expert trajectories \(\mathcal{T}_{E}\), initial policy and interpolation-discriminator parameters \(\theta_{0}\), \(\psi_{0}\) Output: Policy \(\pi_{\theta}\), interpolation-discriminator InDNet\({}_{\psi}\)
1for\(i\longleftarrow\) 0, 1,...do
2 Rollout dense trajectories for all agents \(\mathcal{T}^{D}_{G}=\{\tau|\tau=(\tau^{t_{0}},\cdots,\tau^{t_{N}}),\ \tau^{t_{j}}=(s^{t_{j}},a^{t_{j}})\sim\pi_{\theta_{i}}\}\);
3 (Generator update step)
4\(\bullet\) Score \(\tau^{t_{j}}\) from \(\mathcal{T}^{D}_{G}\) with discriminator, generating reward using Eq. 6.2;
5\(\bullet\) Update \(\theta\) in generator given \(\mathcal{T}^{D}_{G}\) by optimizing Eq. 6.1;
6 (Interpolator-discriminator update step)
7\(\bullet\) Interpolate \(\mathcal{T}_{E}\) with the interpolation module in InDNet, generating dense expert trajectories \(\mathcal{T}^{D}_{E}\);
8\(\bullet\) Downsample generated dense trajectories \(\mathcal{T}^{D}_{G}\) to sparse trajectories \(\mathcal{T}_{G}\);
9\(\bullet\) Construct training samples for InDNet
10\(\bullet\) Update InDNet parameters \(\psi\) by optimizing Eq. 6.6
11 end for

Algorithm 1Training procedure of ImIn-GAIL
算法 1 ImIn-GAIL 的训练程序

Training and Implementation
培训与实施

Algorithm 1 describes the ImIn-GAIL approach. In this paper, the driving policy is parameterized with a two-layer fully connected network with 32 units for all the hidden layers. The policy network takes the driving state

s

as input and outputs the distribution parameters for a Beta distribution, and the action

a

will be sampled from this distribution. The optimization of the driving policy is optimized via TRPO [156]. Following [143, 155], we use the features in Table 6.1 to represent the driving state of a vehicle, and the driving policy takes the drivings state as input and outputs an action

a

(i.e., next step speed). For the interpolation-discriminator network, each driving point is embedded to a 10-dimensional latent space, the interpolator module uses a three-layer fully connected layer to interpolate the trajectory and the discriminator module contains a two-layer fully connected layer. Some of the important hyperparameters are listed in Table 6.2.
算法 1 描述了 ImIn-GAIL 方法。本文使用一个两层全连接网络对驾驶策略进行参数化，该网络的所有隐藏层均有 32 个单元。策略网络将驾驶状态

s

作为输入，并输出贝塔分布的分布参数，而行动

a

将从该分布中采样。驾驶策略的优化通过 TRPO [156] 进行。按照文献[143, 155]，我们使用表 6.1 中的特征来表示车辆的驾驶状态，驾驶策略将驾驶状态作为输入，并输出动作

a

（即下一步）。(即下一步速度）。在插值-判别网络中，每个驾驶点都被嵌入到一个 10 维的潜在空间中，插值模块使用三层全连接层对轨迹进行插值，判别模块包含两层全连接层。表 6.2 列出了一些重要的超参数。

6.5 Experiment 6.5 实验

Experimental Settings 实验设置

We conduct experiments on CityFlow [157], an open-source traffic simulator that supports large-scale vehicle movements. In a traffic dataset, each vehicle is described as

(o, t, d, r)

, where

o

is the origin location,

t

is time,

d

is the destination location and

r

is its route from

o

d

. Locations

o

and

d

are both locations on the road network, and

r

is a sequence of road ID. After the traffic data is fed into the simulator, a vehicle moves towards its destination based on its route. The simulator provides the state to the vehicle control method and executes the vehicle acceleration/deceleration actions from the control method.
我们在 CityFlow [157] 上进行了实验，这是一款支持大规模车辆运动的开源交通模拟器。在交通数据集中，每辆车都被描述为

(o, t, d, r)

。其中

o

为起始位置，

t

为时间，

d

为目的地，

r

为从

o

到

d

的路线。.位置

o

和

d

都是道路网络中的位置，

r

是道路 ID 序列。交通数据输入模拟器后，车辆根据路线向目的地行驶。模拟器向车辆控制方法提供状态，并执行控制方法中的车辆加速/减速操作。

Dataset 数据集

In experiment, we use both synthetic data and real-world data.
在实验中，我们使用了合成数据和真实世界数据。

Parameter 参数	Value 价值	Parameter 参数	Value 价值
Batch size for generator 发电机的批量大小	64	Batch size for InDNet InDNet 的批量大小	32
Update epoches for generator 更新发电机的电子表格	5	Update epoches for InDNet 更新 InDNet 的书信	10
Learning rate for generator 发电机的学习率	0.001	Learning rate for InDNet InDNet 的学习率	0.0001
Number of layers in generator 生成器的层数	4	Balancing factor $λ$ 平衡系数 $λ$	0.5

Table 6.2: Hyper-parameter settings for ImIn-GAIL
表 6.2：ImIn-GAIL 的超参数设置

6.5.1.1.1 Synthetic Data
6.5.1.1.1 合成数据

In the experiment, we use two kinds of synthetic data, i.e., traffic movements under ring road network and intersection road network, as shown in Figure 6.4. Based on the traffic data, we use default simulation settings of the simulator to generate dense expert trajectories and sample sparse expert trajectories when vehicles pass through the red dots.
在实验中，我们使用了两种合成数据，即环形路网和交叉口路网下的交通流，如图 6.4 所示。根据交通数据，我们使用模拟器的默认仿真设置生成密集专家轨迹，并在车辆通过红点时对稀疏专家轨迹进行采样。

∙

R i n g

: The ring road network consists of a circular lane with a specified length, similar to [158, 159]. This is a very ideal and simplified scenario where the driving behavior can be measured.

∙

R i n g

：环形道路网络由指定长度的环形车道组成，类似于 [158, 159]。这是一个非常理想的简化场景，可以对驾驶行为进行测量。

∙

I n t e r s e c t i o n

: A single intersection network with bi-directional traffic. The intersection has four directions (West

\to

East, East

\to

West, South

\to

North, and North

\to

South), and 3 lanes (300 meters in length and 3 meters in width) for each direction. Vehicles come uniformly with 300 vehicles/lane/hour in West

\leftrightarrow

East direction and 90 vehicles/lane/hour in South

\leftrightarrow

North direction.

∙

I n t e r s e c t i o n

：双向交通的单一交叉口网络。交叉口有四个方向（西

\to

东、东

\to

西、南

\to

北、北

\to

南），每个方向有 3 条车道（长 300 米，宽 3 米）。西

\leftrightarrow

方向车辆统一为 300 辆/车道/小时，东{{7}方向为 90 辆/车道/小时。东向为每小时 300 辆车/车道，南向

\leftrightarrow

为每小时 90 辆车/车道。北方向。

Real-world DataWe also use real-world traffic data from two cities: Hangzhou and Los Angeles. Their road networks are imported from OpenStreetMap1, as shown in Figure 6.4. The detailed descriptions of how we preprocess these datasets are as follows:
真实世界数据我们还使用了两个城市的真实交通数据：杭州和洛杉矶。如图 6.4 所示，这两个城市的道路网络是从 OpenStreetMap1 中导入的。我们如何预处理这些数据集的详细说明如下：

∙

L A_{1 \times 4}

. This is a public traffic dataset collected from Lankershim Boulevard, Los Angeles on June 16, 2005. It covers an 1

\times

4 arterial with four successive intersections. This dataset records the position and speed of every vehicle at every 0.1 second. We treat these records as dense expert trajectories and sample vehicles' states and actions when they pass through intersections as sparse expert trajectories.

∙

L A_{1 \times 4}

.这是 2005 年 6 月 16 日从洛杉矶 Lankershim Boulevard 收集的公共交通数据集。它覆盖了一条 1

\times

4 干道，有四个连续交叉路口。该数据集每 0.1 秒记录一次每辆车的位置和速度。我们将这些记录视为密集专家轨迹，并将车辆通过交叉路口时的状态和行动样本视为稀疏专家轨迹。

Figure 6.4: Illustration of road networks. (a) and (b) are synthetic road networks, while (c) and (d) are real-world road networks.
图 6.4：道路网络示意图。(a) 和 (b) 是合成的道路网络，而 (c) 和 (d) 是真实世界的道路网络。

∙

H Z_{4 \times 4}

. This dataset covers a 4

\times

4 network of Gudang area in Hangzhou, collected from surveillance cameras near intersections in 2016. This region has relatively dense surveillance cameras and we sampled the sparse expert trajectories in a similar way as in

L A_{1 \times 4}

∙

H Z_{4 \times 4}

.本数据集涵盖杭州古荡地区 4

\times

4 个路口的监控摄像头采集的数据。该地区的监控摄像头相对密集，我们采用与

L A_{1 \times 4}

类似的方法对稀疏的专家轨迹进行采样。.

Data Preprocessing 数据预处理

To mimic the real-world situation where the roadside surveillance cameras capture the driving behavior of vehicles at certain locations, the original dense expert trajectories are processed to sparse trajectories by sampling the driving points near several fixed locations unless specified. We use the sparse trajectories as expert demonstrations for training models. To test the imitation effectiveness, we use the same sampling method as the expert data and then compare the sparse generated data with sparse expert data. To test the interpolation effectiveness, we directly compare the dense generated data with dense expert data.
为了模拟真实世界中路边监控摄像头捕捉车辆在特定位置行驶行为的情况，我们将原始的密集专家轨迹处理为稀疏轨迹，方法是对几个固定位置附近的行驶点进行采样（除非指定）。我们将稀疏轨迹作为训练模型的专家示范。为了测试模仿效果，我们使用与专家数据相同的采样方法，然后将稀疏生成的数据与稀疏专家数据进行比较。为了测试插值效果，我们直接比较密集生成数据和密集专家数据。

Compared Methods 比较方法

We compare our model with the following two categories of methods: calibration-based methods and imitation learning-based methods.
我们将我们的模型与以下两类方法进行比较：基于校准的方法和基于模仿学习的方法。

6.5.2.1 Calibration-based methods
6.5.2.1 基于校准的方法

For calibration-based methods, we use Krauss model [135], the default car-following model (CFM) of simulator SUMO [160] and CityFlow [157]. Krauss model has the following forms:
对于基于校准的方法，我们使用 Krauss 模型[135]、模拟器 SUMO[160] 和 CityFlow[157] 的默认汽车跟随模型 (CFM)。克劳斯模型有以下几种形式：

\begin{matrix} (6.7) & v_{s a f e} (t) = v_{l} (t) + \frac{g (t) - v_{l} (t) t_{r}}{\frac{v_{l} (t) + v_{f} (t)}{2 b} + t_{r}} \end{matrix}

\begin{matrix} (6.8) & v_{d e s} (t) = min [v_{s a f e} (t), v (t) + a Δ t, v_{m a x}] \end{matrix}

Env-name 环境名称	$R i n g$	$I n t e r s e c t i o n$	$L A_{1 \times 4}$	$H Z_{4 \times 4}$
Duration (seconds) 持续时间（秒）	300	300	300	300
# of vehicles # 车辆数	22	109	321	425
# of points (dense) # 点数（密集）	1996	10960	23009	87739
# of points (sparse) # 个点（稀疏）	40	283	1014	1481

Table 6.3: Statistics of dense and sparse expert trajectory in different datasetswhere

v_{s a f e} (t)

the safe speed at time

t

v_{l} (t)

and

v_{f} (t)

is the speed of the leading vehicle and following vehicle respectively at time

t

g (t)

is the gap to the leading vehicle,

b

is the maximum deceleration of the vehicle and

t_{r}

is the driver's reaction time.

v_{d e s} (t)

is the desired speed, which is given by the minimum of safe speed, maximum allowed speed, and the speed after accelerating at

a

for

Δ t

. Here,

a

is the maximum acceleration and

Δ t

is the simulation time step.
表 6.3：不同数据集中密集专家轨迹和稀疏专家轨迹的统计

v_{s a f e} (t)

为时间

t

时的安全速度，

v_{l} (t)

和

v_{f} (t)

分别为前方车辆和后方车辆在时间

t

时的速度，

g (t)

为与前方车辆的间隙。

g (t)

为与前车的差距，

b

为车辆的最大减速度，

t_{r}

为驾驶员的反应时间。

v_{d e s} (t)

为期望速度，由安全速度的最小值、允许的最大速度和

a

时

Δ t

加速后的速度得出。.这里，

a

是最大加速度，

Δ t

是模拟时间步长。

We calibrate three parameters in Krauss model, which are the maximum deceleration of the vehicle, the maximum acceleration of the vehicle, and the maximum allowed speed.

∙

Random Search (CFM-RS)[161]: The parameters are chosen when they generate the most similar trajectories to expert demonstrations after a finite number of trial of random selecting parameters for Krauss model.
我们在克劳斯模型中校准了三个参数，分别是车辆的最大减速度、车辆的最大加速度和允许的最大速度。

∙

随机搜索（CFM-RS）[161]：在对克劳斯模型进行一定次数的随机选择参数试验后，当参数产生的轨迹与专家示范的轨迹最为相似时，再选择参数。

∙

Tabu Search (CFM-TS)[139]: Tabu search chooses the neighbors of the current set of parameters for each trial. If the new CFM generates better trajectories, this set of parameters is kept in the Tabu list.

∙

塔布搜索（CFM-TS）[139]：塔布搜索为每次试验选择当前参数集的邻近参数。如果新的 CFM 能生成更好的轨迹，这组参数就会保留在 Tabu 列表中。

6.5.2.2 Imitation learning-based methods
6.5.2.2 基于模仿学习的方法

We also compare with several imitation learning-based methods, including both traditional and state-of-the-art methods.
我们还与几种基于模仿学习的方法进行了比较，包括传统方法和最先进的方法。

∙

Behavioral Cloning (BC)[141] is a traditional imitation learning method. It directly learns the state-action mapping in a supervised manner.

∙

行为克隆（BC）[141] 是一种传统的模仿学习方法。它以监督方式直接学习状态-动作映射。

∙

Generative Adversarial Imitation Learning (GAIL) is a GAN-like framework [142], with a generator controlling the policy of the agent, and a discriminator containing a classifier for the agent indicating how far the generated state sequences are from that of the demonstrations.

∙

生成式对抗模仿学习（GAIL）是一种类似于 GAN 的框架[142]，其中的生成器控制着代理的策略，而判别器则包含代理的分类器，指示生成的状态序列与演示的状态序列相差多远。

Evaluation Metrics 评估指标

Following existing studies [143, 145, 155], to measure the error between learned policy against expert policy, we measure the position and the travel time of vehicles between generated dense trajectories and expert dense trajectories, which are defined as:
根据现有研究 [143、145、155]，为了测量学习策略与专家策略之间的误差，我们测量了生成的密集轨迹与专家密集轨迹之间的车辆位置和行驶时间，其定义为

\begin{matrix} (6.9) & R M S E_{p o s} = \frac{1}{T} \sum_{t = 1}^{T} \sqrt{\frac{1}{M} \sum_{i = 1}^{m} (l_{i}^{t} - {\hat{l}}_{i}^{t})^{2}}, R M S E_{t i m e} = \sqrt{\frac{1}{M} \sum_{i = 1}^{m} (d_{i} - {\hat{d}}_{i})^{2}} \end{matrix}

where

T

is the total simulation time,

M

is the total number of vehicles,

l_{i}^{t}

and

{\hat{l}}_{i}^{t}

are the position of

i

-th vehicle at time

t

in the expert trajectories and in the generatedtrajectories relatively,

d_{i}

and

\hat{d_{i}}

are the travel time of vehicle

i

in expert trajectories and generated trajectories respectively.
其中，

T

为总模拟时间，

M

为车辆总数，

l_{i}^{t}

和

{\hat{l}}_{i}^{t}

分别为

i

-辆车在专家轨迹和生成轨迹中的相对位置，

d_{i}

和

\hat{d_{i}}

为车辆{{8}在专家轨迹和生成轨迹中的行驶时间。和

{\hat{l}}_{i}^{t}

分别为第

i

-辆车在专家轨迹和生成轨迹中相对于时间

t

的位置，

d_{i}

和

\hat{d_{i}}

分别为车辆

i

在专家轨迹和生成轨迹中的行驶时间。

Performance Comparison 性能比较

In this section, we compare the dense trajectories generated by different methods with the expert dense trajectories, to see how similar they are to the expert policy. The closer the generated trajectories are to the expert trajectories, the more similar the learned policy is to the expert policy. From Table 6.4, we can see that ImIn-GAIL achieves consistently outperforms over all other baselines across synthetic and real-world data. CFM-RS and CFM-RS can hardly achieve satisfactory results because the model predefined by CFM could be different from the real world. Specifically, ImIn-GAIL outperforms vanilla GAIL, since ImIn-GAIL interpolates the sparse trajectories and thus has more expert trajectory data, which will help the discriminator make more precise estimations to correct the learning of policy.
在本节中，我们将不同方法生成的密集轨迹与专家密集轨迹进行比较，以了解它们与专家策略的相似程度。生成的轨迹与专家轨迹越接近，说明学习到的策略与专家策略越相似。从表 6.4 中可以看出，在合成数据和真实世界数据中，ImIn-GAIL 的性能始终优于所有其他基线。CFM-RS 和 CFM-RS 很难取得令人满意的结果，因为 CFM 预定义的模型可能与真实世界不同。具体来说，ImIn-GAIL 的性能优于 vanilla GAIL，这是因为 ImIn-GAIL 对稀疏轨迹进行了插值，因此拥有更多的专家轨迹数据，这将有助于判别器做出更精确的估计，从而修正策略的学习。

Study of ImIn-GAIL 对 ImIn-GAIL 的研究

Interpolation Study 插值研究

. To better understand how interpolation helps in simulation, we compare two representative baselines with their two-step variants. Firstly, we use a pre-trained non-linear interpolation model to interpolate the sparse expert trajectories following the idea of [147, 162]. Then we train the baselines on the interpolated trajectories.
.为了更好地理解插值对模拟的帮助，我们比较了两个具有代表性的基线和它们的两步变体。首先，我们按照文献[147, 162]的思路，使用预先训练好的非线性插值模型对稀疏专家轨迹进行插值。然后，我们在插值轨迹上训练基线。

Table 6.5 shows the performance of baseline methods in

R i n g

and

I n t e r s e c t i o n

. We find out that baseline methods in a two-step way show inferior performance. One possible
表 6.5 显示了基线方法在

R i n g

和

I n t e r s e c t i o n

中的性能。.我们发现，采用两步法的基线方法性能较差。一种可能的情况是

	$R i n g$		$I n t e r s e c t i o n$		$L A_{1 \times 4}$		$H Z_{4 \times 4}$
	time (s) 时间（秒）	pos (km) 位置（公里）	time (s) 时间（秒）	pos (km) 位置（公里）	time (s) 时间（秒）	pos (km) 位置（公里）	time (s) 时间（秒）	pos (km) 位置（公里）
CFM-RS	343.506	0.028	39.750	0.144	34.617	0.593	27.142	0.318
CFM-TS	376.593	0.025	95.330	0.184	33.298	0.510	175.326	0.359
BC 不列颠哥伦比亚省	201.273	0.020	58.580	0.342	55.251	0.698	148.629	0.297
GAIL 盖尔	42.061	0.023	14.405	0.032	30.475	0.445	14.973	0.196
ImIn-GAIL	16.970	0.018	4.550	0.024	19.671	0.405	5.254	0.130

Table 6.4: Performance w.r.t Relative Mean Squared Error (RMSE) of time (in seconds) and position (in kilometers). All the measurements are conducted on dense trajectories. Lower the better. Our proposed method ImIn-GAIL achieves the best performance.
表 6.4：时间（秒）和位置（千米）的相对均方误差（RMSE）。所有测量均在密集轨迹上进行。越低越好。我们提出的 ImIn-GAIL 方法性能最佳。

reason is that the interpolated trajectories generated by the pre-trained model could be far from the real expert trajectories when interacting in the simulator. Consequently, the learned policy trained on such interpolated trajectories makes further errors.
原因在于，在模拟器中进行交互时，预训练模型生成的内插轨迹可能与真实的专家轨迹相去甚远。因此，根据这些插值轨迹训练的学习策略会产生更多误差。

In contrast, ImIn-GAIL learns to interpolate and imitate the sparse expert trajectories in one step, combining the interpolator loss and discriminator loss, which can propagate across the whole framework. If the trajectories generated by

π_{θ}

is far from expert observations in current iteration, both the discriminator and the interpolator will learn to correct themselves and provide more precise reward for learning

π_{θ}

in the next iteration. Similar results can also be found in

L A_{1 \times 4}

and

H Z_{4 \times 4}

, and we omit these results due to page limits.
相比之下，ImIn-GAIL 只需一步即可学会插值和模仿稀疏的专家轨迹，并将插值损失和判别损失结合起来，从而在整个框架中传播。如果

π_{θ}

生成的轨迹与专家在当前迭代中的观测结果相差甚远，那么判别器和插值器都将学会自我修正，并在下一次迭代中为

π_{θ}

的学习提供更精确的奖励。类似的结果也可以在

L A_{1 \times 4}

和

H Z_{4 \times 4}

中找到。由于篇幅限制，我们略去这些结果。

6.5.5.2 Sparsity Study 6.5.5.2 稀疏性研究

In this section, we investigate how different sampling strategies influence ImIn-GAIL. We sample randomly from the dense expert trajectories at different time intervals to get different sampling rates: 2%, 20%, 40%, 60%, 80%, and 100%. We set the sampled data as the expert trajectories and evaluate by measuring the performance of our model in imitating the expert policy. As is shown in Figure 6.5, with denser expert trajectory, the error of ImIn-GAIL decreases, indicating a better policy imitated by our method.
在本节中，我们将研究不同的采样策略对 ImIn-GAIL 的影响。我们以不同的时间间隔从密集的专家轨迹中随机取样，以获得不同的取样率：2%、20%、40%、60%、80% 和 100%。我们将采样数据设置为专家轨迹，并通过测量模型在模仿专家策略方面的性能进行评估。如图 6.5 所示，专家轨迹越密集，ImIn-GAIL 的误差就越小，这表明我们的方法能更好地模仿专家策略。

Case Study 案例研究

To study the capability of our proposed method in recovering the dense trajectories of vehicles, we showcase the movement of a vehicle in Ring data learned by different
为了研究我们提出的方法在恢复车辆密集轨迹方面的能力，我们展示了通过不同的

	$R i n g$		$I n t e r s e c t i o n$
	time (s) 时间（秒）	position (km) 位置（公里）	time (s) 时间（秒）	position (km) 位置（公里）
CFM-RS	343.506	0.028	39.750	0.144
CFM-RS (two step) CFM-RS（两步式）	343.523	0.074	73.791	0.223
GAIL 盖尔	42.061	0.023	14.405	0.032
GAIL (two step) 盖尔（两步）	98.184	0.025	173.538	0.499
ImIn-GAIL	16.970	0.018	4.550	0.024

Table 6.5: RMSE on time and position of our proposed method ImIn-GAIL against baseline methods and their corresponding two-step variants. Baseline methods and ImIn-GAIL learn from sparse trajectories, while the two-step variants interpolate sparse trajectories first and trained on the interpolated data. ImIn-GAIL achieves the best performance in most cases.
表 6.5：我们提出的 ImIn-GAIL 方法与基准方法及其相应的两步变体在时间和位置上的 RMSE 比较。基准方法和 ImIn-GAIL 从稀疏轨迹中学习，而两步变体首先对稀疏轨迹进行插值，然后在插值数据上进行训练。在大多数情况下，ImIn-GAIL 的性能最佳。

methods. 方法。

We visualize the trajectories generated by the policies learned with different methods in Figure 6.6. We find that imitation learning methods (BC, GAIL, and ImIn-GAIL) perform better than calibration-based methods (CFM-RS and CFM-TS). This is because the calibration based methods pre-assumes an existing model, which could be far from the real behavior model. On the contrast, imitation learning methods directly learn the policy without making unrealistic formulations of the CFM model. Specifically, ImIn-GAIL can imitate the position of the expert trajectory more accurately than all other baseline methods. The reason behind the improvement of ImIn-GAIL against other methods is that in ImIn-GAIL, policy learning and interpolation can enhance each other and result in significantly better results.
我们在图 6.6 中直观展示了用不同方法学习到的策略生成的轨迹。我们发现，模仿学习方法（BC、GAIL 和 ImIn-GAIL）比基于校准的方法（CFM-RS 和 CFM-TS）表现更好。这是因为基于校准的方法预先假定了一个现有的模型，而这个模型可能与真实的行为模型相去甚远。相比之下，模仿学习方法可以直接学习策略，而无需对 CFM 模型进行不切实际的表述。具体来说，ImIn-GAIL 比其他所有基线方法都能更准确地模仿专家轨迹的位置。ImIn-GAIL 比其他方法更胜一筹的原因在于，在 ImIn-GAIL 中，策略学习和插值可以相互促进，从而取得更好的效果。

Figure 6.5: RMSE on time and position of our proposed method ImIn-GAIL under different level of sparsity. As the expert trajectory become denser, a more similar policy to the expert policy is learned.
图 6.5：我们提出的 ImIn-GAIL 方法在不同稀疏程度下的时间和位置均方根误差。随着专家轨迹越来越密集，学习到的策略与专家策略也越来越相似。

6.6 Conclusion 6.6 结论

In this chapter, we present a novel framework ImIn-GAIL to integrate interpolation with imitation learning and learn the driving behavior from sparse trajectory data. Specifically, different from existing literature which treats data interpolation as a separate and preprocessing step, our framework learns to interpolate and imitate expert policy in a fully end-to-end manner. Our experiment results show that our approach significantly outperforms state-of-the-art methods. The application of our proposed method can be used to build a more realistic traffic simulator using real-world data.
在本章中，我们介绍了一种新型框架 ImIn-GAIL，该框架将插值与模仿学习相结合，并从稀疏轨迹数据中学习驾驶行为。具体来说，与将数据插值作为单独预处理步骤的现有文献不同，我们的框架以完全端到端的方式学习插值和模仿专家策略。实验结果表明，我们的方法明显优于最先进的方法。我们提出的方法可用于利用真实世界的数据建立更逼真的交通模拟器。

Figure 6.6: The generated trajectory of a vehicle in the

R i n g

scenario. Left: the initial position of the vehicles. Vehicles can only be observed when they pass four locations

A

B

C

and

D

where cameras are installed. Right: the visualization for the trajectory of

V e h i c l e

0. The x-axis is the timesteps in seconds. The y-axis is the relative road distance in meters. Although vehicle 0 is only observed three times (red triangles), ImIn-GAIL (blue points) can imitate the position of the expert trajectory (grey points) more accurately than all other baselines. Better viewed in color.
图 6.6：

R i n g

情景下生成的车辆轨迹。左图：车辆的初始位置。车辆只有在经过

A

、

B

和

C

四个位置时才能被观测到。,

B

、

C

和

D

安装了摄像头。右图：

V e h i c l e

0 的可视化轨迹。x 轴为时间单位，以秒为单位。y 轴为相对道路距离，单位为米。虽然只观察到车辆 0（红色三角形）三次，但 ImIn-GAIL（蓝色点）比其他所有基线都能更准确地模仿专家轨迹（灰色点）的位置。彩色效果更佳。

Chapter 7 Conclusion and Future Directions
第 7 章结论与未来方向

In this thesis, we present an overview of recent advances in reinforcement learning methods for traffic signal control, and propose methods that deal with the theoretical formulation and the learning efficiency in RL-based traffic signal control methods. This thesis also propose to build a more realistic simulator to bridge the gap between simulation and the real world. Here, we briefly discuss some directions for future research.
在本论文中，我们概述了交通信号控制强化学习方法的最新进展，并提出了处理基于 RL 的交通信号控制方法的理论表述和学习效率的方法。本论文还提出要建立一个更逼真的模拟器，以弥合模拟与现实世界之间的差距。在此，我们简要讨论了未来研究的一些方向。

7.1 Evolving behavior with traffic signals
7.1 不断变化的交通信号行为

Existing research usually assumes the vehicles behavior is fixed when solving traffic signal control problem, i.e., the vehicle movement model is fixed once a vehicle enters the system. However, vehicles in the real-world could also adjust to traffic signals. For example, when there are multiple routes toward the destination, a human driver see a red light on through traffic, he/she might change to another lane with green light in order to move faster. This kind of adjustments in driving behavior is also evolving with the operation of traffic signals. An promising approach to cope with this situation is to co-learn both the driving behavior and the traffic signals.
在解决交通信号控制问题时，现有研究通常假设车辆的行为是固定的，即车辆进入系统后，其运动模型就是固定的。然而，现实世界中的车辆也会根据交通信号进行调整。例如，当有多条路线通往目的地时，人类驾驶员在看到红灯亮起时，可能会换到另一条绿灯的车道上，以便行驶得更快。这种驾驶行为的调整也随着交通信号灯的运行而不断变化。应对这种情况的一种可行方法是共同学习驾驶行为和交通信号。

7.2 Benchmarking datasets and baselines
7.2 基准数据集和基线

As discussed in Section 2.3.3, researchers use different road networks, which could introduce large variances in final performance. Therefore, evaluating control policies using a standard setting could save the effort and assure a fair comparison and reproducability of RL methods [163]. An effort that could greatly facilitate research in this field is to create publicly available benchmark. Another concern for RL-based traffic signal control is that for this interdisciplinary research problem, existing literature of RL-based methodsis often lack of comparison with typical methods from transportation area, like Webster's Formula [164] and MaxPressure [165].
如第 2.3.3 节所述，研究人员使用不同的道路网络，这可能会导致最终性能差异很大。因此，使用标准设置来评估控制策略可以节省工作量，并确保 RL 方法的公平比较和可重复性[163]。创建可公开获取的基准，可以极大地促进这一领域的研究。基于 RL 的交通信号控制的另一个问题是，对于这一跨学科研究问题，现有的基于 RL 方法的文献往往缺乏与交通领域典型方法的比较，如 Webster 公式 [164] 和 MaxPressure [165]。

7.3 Learning efficiency 7.3 学习效率

Existing RL methods for games usually require a massive number of update iterations and trial-and-errors for RL models to yield impressive results in simulated environments. These trial-and-error attempts will lead to real traffic jams in the traffic signal control problem. Therefore, how to learn efficiently is a critical question for the application of RL in traffic signal control. While there is some previous work considers using Meta-Learning [166] or imitation learning [55], there is still much to investigate on learning with limited data samples and efficient exploration in traffic signal control problem.
现有的游戏 RL 方法通常需要大量的更新迭代和试错，才能使 RL 模型在模拟环境中产生令人印象深刻的结果。在交通信号控制问题中，这些试错尝试将导致真正的交通堵塞。因此，如何高效学习是将 RL 应用于交通信号控制的关键问题。虽然之前有一些工作考虑使用元学习 [166] 或模仿学习 [55]，但在交通信号控制问题中，有限数据样本的学习和高效探索仍有许多问题需要研究。

7.4 Safety issue 7.4 安全问题

While RL methods learn from trial-and-error, the learning cost of RL could be critical or even fatal in the real world as the malfunction of traffic signals might lead to accidents. An open problem for RL-based traffic signal control problem is to find ways to adapt risk management to make RL agents acceptably safe in physical environments [167]. [168] directly integrate real-world constraints into the action selection process. If pedestrians are crossing the intersection, their method will not change the control actions, which can protect crossing pedestrians. However, more safety problems like handling collisions is still to be explored.
虽然 RL 方法是从试错中学习的，但在现实世界中，RL 的学习成本可能是至关重要的，甚至是致命的，因为交通信号的故障可能会导致事故。基于 RL 的交通信号控制问题的一个未决问题是如何调整风险管理，使 RL 代理在物理环境中达到可接受的安全程度 [167]。[168]直接将现实世界的约束条件整合到行动选择过程中。如果行人正在穿过十字路口，他们的方法不会改变控制行动，这可以保护过马路的行人。不过，更多的安全问题（如处理碰撞）仍有待探索。

7.5 Transferring from simulation to reality
7.5 从模拟走向现实

The papers reviewed in this survey mainly conduct experiments in the simulator since the simulator can generate data in a cheaper and faster way than real experimentation. Discrepancies between simulation and reality confine the application of learned policies in the real world. While some work considers to learn an interpretable policy before applying to the real world [169] or to build a more realistic simulator [170] for direct transferring, there is still a challenge to transfer the control policies learned in simulation to reality.
与真实实验相比，模拟器生成数据的成本更低，速度更快。模拟与现实之间的差异限制了所学策略在现实世界中的应用。虽然有些工作考虑先学习可解释的策略，然后再应用于现实世界 [169] 或建立更逼真的模拟器 [170] 以直接移植，但将模拟中学习到的控制策略移植到现实中仍是一项挑战。

Bibliography 参考书目

[1]Zheng, G., X. Zang, N. Xu, H. Wei, Z. Yu, et al. (2019) "Diagnosing Reinforcement Learning for Traffic Signal Control," arXiv preprint.
[2]Economist, T. (2014), "The cost of traffic jams," https://www.economist.com/blogs/economist-explains/2014/11/economist-explains-1.
[2]Economist, T. (2014)，"The cost of traffic jams"，https://www.economist.com/blogs/economist-explains/2014/11/economist-explains-1.
[3]Schrank, D., B. Eisele, T. Lomax, and J. Bak (2015) "2015 Urban Mobility Scorecard,".
[4]Schrank, D., B. Eisele, and T. Lomax (2012) "TTI's 2012 urban mobility report," Texas A&M Transportation Institute. The Texas A&M University System, 4.
[4]Schrank, D., B. Eisele, and T. Lomax (2012) "TTI's 2012 urban mobility report," Texas A&M Transportation Institute.德州农工大学系统，4。
[5]Lowrie, P. (1992) "SCATS-a traffic responsive method of controlling urban traffic. Roads and traffic authority," NSW, Australia.
[5]Lowrie, P. (1992)《SCATS--控制城市交通的交通响应方法》。澳大利亚新南威尔士州道路与交通管理局。
[6]Hunt, P., D. Robertson, R. Bretherton, and R. Winton (1981) SCOOT-a traffic responsive method of coordinating signals, Tech. rep.
[7]Hunt, P., D. Robertson, R. Bretherton, and M. C. Royle (1982) "The SCOOT on-line traffic signal optimisation technique," Traffic Engineering & Control, 23(4).
[7]Hunt, P., D. Robertson, R. Bretherton, and M. C. Royle (1982) "The SCOOT on-line traffic signal optimisation technique," Traffic Engineering & Control, 23(4).
[8]Roess, R. P., E. S. Prassas, and W. R. McShane (2004) Traffic engineering, Pearson/Prentice Hall.
[9]Wiering, M. (2000) "Multi-agent reinforcement learning for traffic light control," in Machine Learning: Proceedings of the Seventeenth International Conference (ICML'2000), pp. 1151-1158.
[9]Wiering, M. (2000) "Multi-agent reinforcement learning for traffic light control," in Machine Learning：第十七届国际会议（ICML'2000）论文集》，第 1151-1158 页。
[10]Van der Pol, E. and F. A. Oliehoek (2016) "Coordinated deep reinforcement learners for traffic light control,".
[11]Wei, H., G. Zheng, H. Yao, and Z. Li (2018) "Intellilight: A reinforcement learning approach for intelligent traffic light control," in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, pp. 2496-2505.
[11]Wei, H., G. Zheng, H. Yao, and Z. Li (2018) "Intellilight：A reinforcement learning approach for intelligent traffic light control，" in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining，ACM，pp.2496-2505.
[12]Sutton, R. S. and A. G. Barto (1998) Reinforcement learning: An introduction, vol. 1, MIT press Cambridge.
[12]Sutton, R. S. and A. G. Barto (1998) Reinforcement learning：An introduction, vol. 1, MIT press Cambridge.
[13]Watkins, C. J. C. H. (1989) Learning from delayed rewards, Ph.D. thesis, King's College, Cambridge.
[14]Silver, D., A. Huang, C. J. Maddison, A. Guez, L. Sifre, et al. (2016) "Mastering the game of Go with deep neural networks and tree search," Nature.
[15]Dulac-Arnold, G., M. Daniel, and T. Hester (2019) "Challenges of Real-World Reinforcement Learning," arXiv:1904.12901.
[16]Lu, H., Y. Li, S. Mu, D. Wang, H. Kim, and S. Serikawa (2018) "Motor Anomaly Detection for Unmanned Aerial Vehicles Using Reinforcement Learning," IEEE Internet of Things Journal, 5(4), pp. 2315-2322.
[16]Lu, H., Y. Li, S. Mu, D. Wang, H. Kim, and S. Serikawa (2018) "Motor Anomaly Detection for Unmanned Aerial Vehicles Using Reinforcement Learning," IEEE Internet of Things Journal, 5(4), pp.
[17]Wei, H., C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, and Z. Li (2019) "PressLight: Learning Max Pressure Control to Coordinate Traffic Signals in Arterial Network," in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '19, ACM, pp. 1290-1298.
[17]Wei, H., C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, and Z. Li (2019) "PressLight：Learning Max Pressure Control to Coordinate Traffic Signals in Arterial Network," in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '19, ACM, pp.
[18]Zheng, G., Y. Xiong, X. Zang, J. Feng, H. Wei, et al. (2019) "Learning Phase Competition for Traffic Signal Control," in CIKM.
[19]Wei, H., N. Xu, H. Zhang, G. Zheng, et al. (2019) "CoLight: Learning Network-level Cooperation for Traffic Signal Control," in CIKM.
[19]Wei, H., N. Xu, H. Zhang, G. Zheng, et al. (2019) "CoLight：Learning Network-level Cooperation for Traffic Signal Control," in CIKM.
[20]Chen, C., H. Wei, N. Xu, et al. (2020) "Toward A Thousand Lights: Decentralized Deep Reinforcement Learning for Large-Scale Traffic Signal Control," in AAAI.
[20]Chen, C., H. Wei, N. Xu, et al. (2020) "Toward A Thousand Lights：用于大规模交通信号控制的分散式深度强化学习"，AAAI.
[21]Wei, H., G. Zheng, V. Gayah, and Z. Li (2019) "A Survey on Traffic Signal Control Methods," CoRR, abs/1904.08117, 1904.08117. URL http://arxiv.org/abs/1904.08117
[21]Wei, H., G. Zheng, V. Gayah, and Z. Li (2019) "A Survey on Traffic Signal Control Methods," CoRR, abs/1904.08117, 1904.08117.网址 http://arxiv.org/abs/1904.08117
[22] ------ (2020) "Recent Advances in Reinforcement Learning for Traffic Signal Control: A Survey of Models and Evaluation," ACM SIGKDD Explorations Newsletter.
[22] ------ (2020) "Recent Advances in Reinforcement Learning for Traffic Signal Control：A Survey of Models and Evaluation，" ACM SIGKDD Explorations Newsletter。
[23]El-Tantawy, S. and B. Abdulhai (2012) "Multi-agent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC)," in ITSC, IEEE, pp. 319-326.
[23]El-Tantawy, S. and B. Abdulhai (2012) "Multi-agent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC)," in ITSC, IEEE, pp.
[24]Stevanovic, A. (2010) "Adaptive Traffic Control Systems: Domestic and Foreign State of Practice," National Cooperative Highway Research Program, (403).
[24]Stevanovic, A. (2010) "Adaptive Traffic Control Systems：国内外实践状况》，国家合作公路研究计划，（403）。
[25]Bazzan, A. L. (2009) "Opportunities for multiagent systems and multiagent reinforcement learning in traffic control," AAMAS.
[26]El-Tantawy, S. and B. Abdulhai (2011) Comprehensive analysis of reinforcement learning methods and parameters for adaptive traffic signal control, Tech. rep.
[27]Mannion, P., J. Duggan, and E. Howley (2016) "An experimental review of reinforcement learning algorithms for adaptive traffic signal control," in Autonomic Road Transport Support Systems, Springer, pp. 47-66.
[27]Mannion, P., J. Duggan, and E. Howley (2016) "An experimental review of reinforcement learning algorithms for adaptive traffic signal control," in Autonomic Road Transport Support Systems, Springer, pp.
[28]Yau, K.-L. A., J. Qadir, H. L. Khoo, et al. (2017) "A Survey on Reinforcement Learning Models and Algorithms for Traffic Signal Control," ACM Computing Survey.
[29]Schutera, M., N. Goby, S. Smolarek, and M. Reischl (2018) "Distributed traffic light control at uncoupled intersections with real-world topology by deep reinforcement learning," arXiv preprint arXiv:1811.11233.
[30]Bakker, B., S. Whiteson, L. Kester, and F. C. Groen (2010) "Traffic light control by multiagent reinforcement learning systems," in Interactive Collaborative Information Systems, Springer, pp. 475-510.
[30]Bakker, B., S. Whiteson, L. Kester, and F. C. Groen (2010) "Traffic light control by multiagent reinforcement learning systems," in Interactive Collaborative Information Systems, Springer, pp.
[31]Prashanth, L. A. and S. Bhatnagar (2011) "Reinforcement learning with average cost for adaptive control of traffic lights at intersections," 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), pp. 1640-1645.
[31]Prashanth, L. A. and S. Bhatnagar (2011) "Reinforcement learning with average cost for adaptive control of traffic lights at intersections," 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), pp.
[32]Abdoos, M., N. Mozayani, and A. L. Bazzan (2014) "Hierarchical control of traffic signals using Q-learning with tile coding," Applied intelligence, 40(2), pp. 201-213.
[32]Abdoos, M., N. Mozayani, and A. L. Bazzan (2014) "Hierarchical control of traffic signals using Q-learning with tile coding," Applied intelligence, 40(2), pp.
[33]Genders, W. and S. Razavi (2016) "Using a deep reinforcement learning agent for traffic signal control," arXiv preprint arXiv:1611.01142.
[34]Mousavi, S. S. et al. (2017) "Traffic light control using deep policy-gradient and value-function-based reinforcement learning," Intelligent Transport Systems.
[35]Liang, X., X. Du, G. Wang, and Z. Han (2018) "Deep reinforcement learning for traffic light control in vehicular networks," arXiv preprint arXiv:1803.11115.
[36]Liang, X., X. Du, G. Wang, and Z. Han (2019) "A Deep Reinforcement Learning Network for Traffic Light Cycle Control," IEEE Transactions on Vehicular Technology, 68(2), pp. 1243-1253.
[36]Liang, X., X. Du, G. Wang, and Z. Han (2019) "A Deep Reinforcement Learning Network for Traffic Light Cycle Control," IEEE Transactions on Vehicular Technology, 68(2), pp.
[37]Gao, J., Y. Shen, J. Liu, M. Ito, and N. Shiratori (2017) "Adaptive traffic signal control: Deep reinforcement learning algorithm with experience replay and target network," arXiv preprint arXiv:1705.02755.
[37]Gao, J., Y. Shen, J. Liu, M. Ito, and N. Shiratori (2017) "Adaptive traffic signal control：Deep reinforcement learning algorithm with experience replay and target network," arXiv preprint arXiv:1705.02755.
[38]Garg, D., M. Chli, and G. Vogiatzis (2018) "Deep reinforcement learning for autonomous traffic light control," in ICITE.
[39]Calvo, J. A. and I. Dusparic (2018) "Heterogeneous Multi-Agent Deep Reinforcement Learning for Traffic Lights Control." in AICS, pp. 2-13.
[39]Calvo, J. A. and I. Dusparic (2018) "Heterogeneous Multi-Agent Deep Reinforcement Learning for Traffic Lights Control." in AICS, pp.
[40]Gong, Y., M. Abdel-Aty, Q. Cai, and M. S. Rahman (2019) "Decentralized network level adaptive signal control by multi-agent deep reinforcement learning," Transportation Research Interdisciplinary Perspectives, 1, p. 100020.
[41]Coskun, M. et al. (2018) "Deep reinforcement learning for traffic light optimization," in ICDMW, IEEE.
[42]Aslani, M., M. S. Mesgari, and M. Wiering (2017) "Adaptive traffic signal control with actor-critic methods in a real-world traffic network with different traffic disruption events," TRB-C.
[43]Aslani, M., S. Seipel, M. S. Mesgari, and M. Wiering (2018) "Traffic signal optimization through discrete and continuous reinforcement learning with robustness analysis in downtown Tehran," Advanced Engineering Informatics.
[44]Casas, N. (2017) "Deep deterministic policy gradient for urban traffic light control," arXiv preprint.
[45]Pham, T. T., T. Brys, M. E. Taylor, T. Brys, M. M. Drugan, P. Bosman, M.-D. Cock, C. Lazar, L. Demarchi, D. Steenhoff, et al. (2013) "Learning coordinated traffic light control," in Proceedings of the Adaptive and Learning Agents workshop (at AAMAS-13), vol. 10, IEEE, pp. 1196-1201.
[45]Pham, T. T., T. Brys, M. E. Taylor, T. Brys, M. M. Drugan, P. Bosman, M.-D. Cock, C. Lazar, L. Demarchi, D. Steenhoff, et al.Cock, C. Lazar, L. Demarchi, D. Steenhoff, et al. (2013) "Learning coordinated traffic light control," in Proceedings of the Adaptive and Learning Agents workshop (at AAMAS-13), vol. 10, IEEE, pp.
[46]Arel, I., C. Liu, T. Urbanik, and A. Kohls (2010) "Reinforcement learning-based multi-agent system for network traffic signal control," IET Intelligent Transport Systems, 4(2), pp. 128-135.
[46]Arel, I., C. Liu, T. Urbanik, and A. Kohls (2010) "Reinforcement learning-based multi-agent system for network traffic signal control," IET Intelligent Transport Systems, 4(2), pp.
[47]Nishi, T., K. Otaki, K. Hayakawa, and T. Yoshimura (2018) "Traffic Signal Control Based on Reinforcement Learning with Graph Convolutional Neural Nets," in 2018 21st International Conference on Intelligent Transportation Systems (ITSC), IEEE, pp. 877-883.
[47]Nishi, T., K. Otaki, K. Hayakawa, and T. Yoshimura (2018) "Traffic Signal Control Based on Reinforcement Learning with Graph Convolutional Neural Nets," in 2018 21st International Conference on Intelligent Transportation Systems (ITSC), IEEE, pp.
[48]Chu, T., J. Wang, L. Codeca, and Z. Li (2019) "Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control," IEEE Transactions on Intelligent Transportation Systems.
[49]Arulkumaran, K., M. P. Deisenroth, et al. (2017) "A brief survey of deep reinforcement learning," arXiv preprint.
[50]Kaelbling, L. P., M. L. Littman, and A. W. Moore (1996) "Reinforcement learning: A survey," Journal of artificial intelligence research.
[50]Kaelbling, L. P., M. L. Littman, and A. W. Moore (1996) "Reinforcement learning：A survey," Journal of artificial intelligence research.
[51]Nachum, O., M. Norouzi, K. Xu, and D. Schuurmans (2017) "Bridging the gap between value and policy based reinforcement learning," in NeurIPS.
[52]Mnih, V., K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) "Human-level control through deep reinforcement learning," Nature, 518(7540), p. 529.
[53]Li, L., Y. Lv, and F.-Y. Wang (2016) "Traffic signal timing via deep reinforcement learning," IEEE/CAA Journal of Automatica Sinica, 3(3).
[53]Li, L., Y. Lv, and F.-Y. Wang (2016) "Traffic signal timing via deep reinforcement learning," IEEE/CAA Journal Automatica Sinica, 3(3).Wang (2016) "Traffic signal timing via deep reinforcement learning," IEEE/CAA Journal of Automatica Sinica, 3(3).
[54]Lillicrap, T. P., J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) "Continuous control with deep reinforcement learning," arXiv preprint.
[55]Xiong, Y., G. Zheng, K. Xu, and Z. Li (2019) "Learning Traffic Signal Control from Demonstrations," in CIKM.
[56]Rizzo, S. G., G. Vantini, and S. Chawla (2019) "Time Critic Policy Gradient Methods for Traffic Signal Control in Complex and Congested Scenarios," in KDD.
[57]Wang, Y., T. Xu, X. Niu, and other (2019) "STMARL: A Spatio-Temporal Multi-Agent Reinforcement Learning Approach for Traffic Light Control," arXiv preprint.
[58]Wei, H., C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, and Z. Li (2019) "PressLight: Learning Max Pressure Control to Coordinate Traffic Signals in Arterial Network," in KDD.
[58]Wei, H., C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, and Z. Li (2019) "PressLight：Learning Max Pressure Control to Coordinate Traffic Signals in Arterial Network," in KDD.
[59]Zhang, Z., J. Yang, and H. Zha (2019) "Integrating independent and centralized multi-agent reinforcement learning for traffic signal network optimization," arXiv preprint.
[60]Claus, C. and C. Boutilier (1998) "The dynamics of reinforcement learning in cooperative multiagent systems," AAAI/IAAI.
[61]Kok, J. R. and N. Vlassis (2005) "Using the max-plus algorithm for multiagent decision making in coordination graphs," in Robot Soccer World Cup, Springer, pp. 1-12.
[61]Kok, J. R. and N. Vlassis (2005) "Using the max-plus algorithm for multiagent decision making in coordination graphs," in Robot Soccer World Cup, Springer, pp.
[62]Tan, T., F. Bao, Y. Deng, A. Jin, Q. Dai, and J. Wang (2019) "Cooperative deep reinforcement learning for large-scale traffic grid signal control," IEEE transactions on cybernetics.
[63]Liu, X.-Y., Z. Ding, S. Borst, and A. Walid (2018) "Deep reinforcement learning for intelligent transportation systems," arXiv preprint arXiv:1812.00979.
[64]Nowe, A., P. Vrancx, and Y. M. D. Hauwere (2012) Game Theory and Multi-agent Reinforcement Learning.
[65]Sukhbaatar, S., R. Fergus, et al. (2016) "Learning multiagent communication with backpropagation," in NeurIPS.
[66]Xu, M., J. Wu, L. Huang, R. Zhou, T. Wang, and D. Hu (2020) "Network-wide traffic signal control based on the discovery of critical nodes and deep reinforcement learning," Journal of Intelligent Transportation Systems, 24(1), pp. 1-10.
[66]Xu, M., J. Wu, L. Huang, R. Zhou, T. Wang, and D. Hu (2020) "Network-wide traffic signal control based on the discovery of critical nodes and deep reinforcement learning," Journal of Intelligent Transportation Systems, 24(1), pp.
[67]Ge, H., Y. Song, C. Wu, J. Ren, and G. Tan (2019) "Cooperative deep Q-learning with Q-value transfer for multi-intersection signal control," IEEE Access, 7, pp. 40797-40809.
[67]Ge, H., Y. Song, C. Wu, J. Ren, and G. Tan (2019) "Cooperative deep Q-learning with Q-value transfer for multi-intersection signal control," IEEE Access, 7, pp.
[68]Schlichtkrull, M., T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) "Modeling relational data with graph convolutional networks," in European Semantic Web Conference, Springer.
[69]Velickovic, P., G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) "Graph attention networks," ICLR.
[70]Guerrini, F. (2014) "Traffic Congestion Costs Americans $124 Billion A Year, Report Says," Forbes, October.
[70]Guerrini，F.（2014 年），"报告称，交通拥堵使美国人每年损失 1240 亿美元"，《福布斯》，10 月。
[71]Webster, F. V. (1958) "Traffic signal settings," Road Research Technical Paper, 39.
[72]Miller, A. J. (1963) "Settings for fixed-cycle traffic signals," Journal of the Operational Research Society, 14(4), pp. 373-386.
[72]Miller, A. J. (1963) "Settings for fixed-cycle traffic signals," Journal of the Operational Research Society, 14(4), pp.
[73]Porche, I. and S. Lafortune (1999) "Adaptive look-ahead optimization of traffic signals," Journal of Intelligent Transportation System, 4(3-4), pp. 209-254.
[73]Porche, I. and S. Lafortune (1999) "Adaptive look-ahead optimization of traffic signals," Journal of Intelligent Transportation System, 4(3-4), pp.
[74]Cools, S.-B., C. Gershenson, and B. D. Hooghe (2013) "Self-organizing traffic lights: A realistic simulation," in Advances in applied self-organizing systems, Springer, pp. 45-55.
[74]Cools, S.-B., C. Gershenson, and B. D. Hooghe (2013) "Self-organizing traffic lights：A realistic simulation," in Advances in applied self-organizing systems, Springer, pp.
[75]Kuyer, L., S. Whiteson, B. Bakker, and N. Vlassis (2008) "Multiagent reinforcement learning for urban traffic control using coordination graphs," Machine learning and knowledge discovery in databases, pp. 656-671.
[75]Kuyer, L., S. Whiteson, B. Bakker, and N. Vlassis (2008) "Multiagent reinforcement learning for urban traffic control using coordination graphs," Machine learning and knowledge discovery in databases, pp.
[76]Li, L., Y. Lv, and F.-Y. Wang (2016) "Traffic signal timing via deep reinforcement learning," IEEE/CAA Journal of Automatica Sinica, 3(3), pp. 247-254.
[Li, L., Y. Lv, and F.-Y. Wang (2016) "Traffic signal timing via deep reinforcement learning," IEEE/CAA Journal Automatica Sinica, 3(3), pp.Wang (2016) "Traffic signal timing via deep reinforcement learning," IEEE/CAA Journal of Automatica Sinica, 3(3), pp.
[77]Dion, F., H. Rakha, and Y.-S. Kang (2004) "Comparison of delay estimates at under-saturated and over-saturated pre-timed signalized intersections," Transportation Research Part B: Methodological, 38(2), pp. 99-122.
[77]Dion, F., H. Rakha, and Y.-S.Kang (2004) "Comparison of delay estimates at under-saturated and over-saturated pre-timed signalized intersections," Transportation Research Part B: Methodological, 38(2), pp.
[78]Abdulhai, B., R. Pringle, and G. J. Karakoulas (2003) "Reinforcement learning for true adaptive traffic signal control," Journal of Transportation Engineering, 129(3), pp. 278-285.
[78]Abdulhai, B., R. Pringle, and G. J. Karakoulas (2003) "Reinforcement learning for true adaptive traffic signal control," Journal of Transportation Engineering, 129(3), pp.
[79]El-Tantawy, S., B. Abdulhai, and H. Abdelgawad (2013) "Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): methodology and large-scale application on downtown Toronto," IEEE Transactions on Intelligent Transportation Systems, 14(3), pp. 1140-1150.
[79]El-Tantawy, S., B. Abdulhai, and H. Abdelgawad (2013) "Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): methodology and large-scale application on downtown Toronto," IEEE Transactions on Intelligent Transportation Systems, 14(3), pp.
[80]Abdoos, M., N. Mozayani, and A. L. Bazzan (2013) "Holonic multi-agent system for traffic signals control," Engineering Applications of Artificial Intelligence, 26(5), pp. 1575-1587.
[80]Abdoos, M., N. Mozayani, and A. L. Bazzan (2013) "Holonic multi-agent system for traffic signals control," Engineering Applications of Artificial Intelligence, 26(5), pp.
[81]Genders, W. and S. Razavi (2016) "Using a deep reinforcement learning agent for traffic signal control," arXiv preprint arXiv:1611.01142.
[82]Gao, J., Y. Shen, J. Liu, M. Ito, and N. Shiratori (2017) "Adaptive Traffic Signal Control: Deep Reinforcement Learning Algorithm with Experience Replay and Target Network," arXiv preprint arXiv:1705.02755.
[82]Gao, J., Y. Shen, J. Liu, M. Ito, and N. Shiratori (2017) "Adaptive Traffic Signal Control：Deep Reinforcement Learning Algorithm with Experience Replay and Target Network," arXiv preprint arXiv:1705.02755.
[83]Liu, M., J. Deng, M. Xu, X. Zhang, and W. Wang (2017) "Cooperative Deep Reinforcement Learning for Tra ic Signal Control,".
[84]Legge, E. L., C. R. Madan, E. T. Ng, and J. B. Caplan (2012) "Building a memory palace in minutes: Equivalent memory performance using virtual versus conventional environments with the Method of Loci," Acta psychologica, 141(3), pp. 380-390.
[84]Legge, E. L., C. R. Madan, E. T. Ng, and J. B. Caplan (2012) "Building a memory palace in minutes：使用 Loci 方法的虚拟环境与传统环境下的同等记忆效果"，《心理学报》，141（3），第 380-390 页。
[85]Godwin-Jones, R. (2010) "Emerging technologies,".
[86]Brys, T., T. T. Pham, and M. E. Taylor (2014) "Distributed learning and multi-objectivity in traffic light control," Connection Science, 26(1), pp. 65-83.
[86]Brys, T., T. T. Pham, and M. E. Taylor (2014) "Distributed learning and multi-objectivity in traffic light control," Connection Science, 26(1), pp.
[87]El-Tantawy, S. and B. Abdulhai (2010) "An agent-based learning towards decentralized and coordinated traffic signal control," IEEE Conference on Intelligent Transportation Systems, Proceedings, ITSC, pp. 665-670.
[87]El-Tantawy, S. and B. Abdulhai (2010) "An agent-based learning towards decentralized and coordinated traffic signal control," IEEE Conference on Intelligent Transportation Systems, Proceedings, ITSC, pp.
[88]Lioris, J., A. Kurzhanskiy, and P. Varaiya (2013) "Adaptive Max Pressure Control of Network of Signalized Intersections," Transportation Research Part C, 36(22), pp. 177-195.
[88]Lioris, J., A. Kurzhanskiy, and P. Varaiya (2013) "Adaptive Max Pressure Control of Network of Signalized Intersections," Transportation Research Part C, 36(22), pp.
[89]Varaiya, P. (2013) "Max pressure control of a network of signalized intersections," Transportation Research Part C: Emerging Technologies, 36, pp. 177-195.
[89]Varaiya, P. (2013) "Max pressure control of a network of signalized intersections," Transportation Research Part C: Emerging Technologies, 36, pp.
[90] ------ (2013) "The max-pressure controller for arbitrary networks of signalized intersections," in Advances in Dynamic Network Modeling in Complex Transportation Systems, Springer, pp. 27-66.
[90] ------ (2013) "The max-pressure controller for arbitrary networks of signalized intersections," in Advances in Dynamic Network Modeling in Complex Transportation Systems, Springer, pp.
[91]Gartner, N. H. (1983) OPAC: A demand-responsive strategy for traffic signal control, 906.
[91]Gartner, N. H. (1983) OPAC：A demand-responsive strategy for traffic signal control, 906.
[92]Henry, J.-J., J. L. Farges, and J. Tuffal (1984) "The PRODYN real time traffic algorithm," in Control in Transportation Systems, Elsevier, pp. 305-310.
[92]Henry, J.-J., J. L. Farges, and J. Tuffal (1984) "The PRODYN real time traffic algorithm," in Control in Transportation Systems, Elsevier, pp.
[93]Boillot, F., S. Midenet, and J.-C. Pierrelee (2006) "The real-time urban traffic control system CRONOS: Algorithm and experiments," Transportation Research Part C: Emerging Technologies, 14(1), pp. 18-38.
[93]Boillot, F., S. Midenet, and J.-C.Pierrelee (2006) "The real-time urban traffic control system CRONOS: Algorithm and experiments," Transportation Research Part C: Emerging Technologies, 14(1), pp.
[94]Sen, S. and K. L. Head (1997) "Controlled optimization of phases at an intersection," Transportation science, 31(1), pp. 5-17.
[94]Sen, S. and K. L. Head (1997) "Controlled optimization of phases at an intersection," Transportation science, 31(1), pp.
[95]Liang, X., X. Du, G. Wang, and Z. Han (2018) "Deep reinforcement learning for traffic light control in vehicular networks," arXiv preprint arXiv:1803.11115.
[96]Abdulhai, B., R. Pringle, and G. J. Karakoulas (2003) "Reinforcement learning for true adaptive traffic signal control," Journal of Transportation Engineering, 129(3), pp. 278-285.
[96]Abdulhai, B., R. Pringle, and G. J. Karakoulas (2003) "Reinforcement learning for true adaptive traffic signal control," Journal of Transportation Engineering, 129(3), pp.
[97]Mousavi, S. S., M. Schukat, P. Corcoran, and E. Howley (2017) "Traffic Light Control Using Deep Policy-Gradient and Value-Function Based Reinforcement Learning," arXiv preprint arXiv:1704.08883.
[98]Casas, N. (2017) "Deep Deterministic Policy Gradient for Urban Traffic Light Control," arXiv preprint arXiv:1703.09035.
[99]Urbanik, T., A. Tanaka, B. Lozner, E. Lindstrom, K. Lee, S. Quayle, S. Beaird, S. Tsoi, P. Ryus, D. Gettman, et al. (2015) Signal timing manual, Transportation Research Board.
[100]Robertson, D. I. (1969) "TRANSYT: a traffic network study tool,".
[101]Little, J. D., M. D. Kelson, and N. H. Gartner (1981) "MAXBAND: A versatile program for setting signals on arteries and triangular networks,".
[102]da Silva, A. B. C., D. de Oliveria, and E. Basso (2006) "Adaptive traffic control with reinforcement learning," in Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 80-86.
[102]da Silva, A. B. C., D. de Oliveria, and E. Basso (2006) "Adaptive traffic control with reinforcement learning," in Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp.
[103]Zhang, H., S. Feng, C. Liu, Y. Ding, Y. Zhu, Z. Zhou, W. Zhang, Y. Yu, H. Jin, and Z. Li (2019) "CityFlow: A Multi-Agent Reinforcement Learning Environment for Large Scale City Traffic Scenario,".
[104]Roess, R. P., E. S. Prassas, and W. R. Mcshane (2011) Traffic Engineering, Pearson/Prentice Hall.
[105]Varaiya, P. (2013) "The max-pressure controller for arbitrary networks of signalized intersections," in Advances in Dynamic Network Modeling in Complex Transportation Systems, Springer, pp. 27-66.
[105]Varaiya, P. (2013) "The max-pressure controller for arbitrary networks of signalized intersections," in Advances in Dynamic Network Modeling in Complex Transportation Systems, Springer, pp.
[106]Webster, F. V. (1958) Traffic signal settings, Tech. rep.
[107]Roess, R. P., E. S. Prassas, and W. R. McShane (2004) Traffic engineering, Pearson/Prentice Hall.
[108]El-Tantawy, S., B. Abdulhai, and H. Abdelgawad (2013) "Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): methodology and large-scale application on downtown Toronto," IEEE Transactions on Intelligent Transportation Systems, 14(3), pp. 1140-1150.
[108]El-Tantawy, S., B. Abdulhai, and H. Abdelgawad (2013) "Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): methodology and large-scale application on downtown Toronto," IEEE Transactions on Intelligent Transportation Systems, 14(3), pp.
[109]Dresner, K. and P. Stone (2006) "Multiagent traffic management: Opportunities for multiagent learning," in Learning and Adaption in Multi-Agent Systems, Springer, pp. 129-138.
[109]Dresner, K. and P. Stone (2006) "Multiagent traffic management：Opportunities for multiagent learning," in Learning and Adaption in Multi-Agent Systems, Springer, pp.
[110]Silva, B. C. D., E. W. Basso, F. S. Perotto, A. L. C. Bazzan, and P. M. Engel (2006) "Improving reinforcement learning with context detection," in International Joint Conference on Autonomous Agents & Multiagent Systems.
[110]Silva, B. C. D., E. W. Basso, F. S. Perotto, A. L. C. Bazzan, and P. M. Engel (2006) "Improving reinforcement learning with context detection," in International Joint Conference on Autonomous Agents & Multiagent Systems.
[111]Velickovic, P., G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) "Graph attention networks," arXiv preprint arXiv:1710.10903, 1(2).
[112]Yang, Y., R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang (2018) "Mean Field Multi-Agent Reinforcement Learning," arXiv preprint arXiv:1802.05438.
[113]Koonce, P. et al. (2008) Traffic signal timing manual, Tech. rep., United States. Federal Highway Administration.
[113]Koonce, P. et al. (2008) Traffic signal timing manual, Tech. Rep., United States.Federal Highway Administration.
[114]Lutkebohle, I. (2018), "An Introduction To The New Generation Scats 6," http://www.scats.com.au/files/an_introduction_to_scats_6.pdf, [Online; accessed 5-September-2018].
[114]Lutkebohle, I. (2018)，"新一代 Scats 6 简介"，http://www.scats.com.au/files/an_introduction_to_scats_6.pdf，[在线；2018 年 9 月 5 日访问]。
[115]Hunt, P., D. Robertson, R. Bretherton, and R. Winton (1981) SCOOT-a traffic responsive method of coordinating signals, Tech. rep.
[116]Hunt, P., D. Robertson, R. Bretherton, and M. C. Royle (1982) "The SCOOT on-line traffic signal optimisation technique," Traffic Engineering & Control, 23(4).
[116]Hunt, P., D. Robertson, R. Bretherton, and M. C. Royle (1982) "The SCOOT on-line traffic signal optimisation technique," Traffic Engineering & Control, 23(4).
[117]Gartner, N. H., S. F. Assman, F. Lasaga, and D. L. Hou (1991) "A multi-band approach to arterial traffic signal optimization," Transportation Research Part B: Methodological, 25(1), pp. 55-74.
[117]Gartner, N. H., S. F. Assman, F. Lasaga, and D. L. Hou (1991) "A multi-band approach to arterial traffic signal optimization," Transportation Research Part B: Methodological, 25(1), pp.
[118]Diakaki, C., M. Papageorgiou, and K. Aboudolas (2002) "A multivariable regulator approach to traffic-responsive network-wide signal control," Control Engineering Practice, 10(2), pp. 183-195.
[118]Diakaki, C., M. Papageorgiou, and K. Aboudolas (2002) "A multivariable regulator approach to traffic-responsive network-wide signal control," Control Engineering Practice, 10(2), pp.
[119]Prashanth, L. and S. Bhatnagar (2011) "Reinforcement learning with function approximation for traffic signal control," IEEE Transactions on Intelligent Transportation Systems, 12(2), pp. 412-421.
[119]Prashanth, L. and S. Bhatnagar (2011) "Reinforcement learning with function approximation for traffic signal control," IEEE Transactions on Intelligent Transportation Systems, 12(2), pp.
[120]Kuyer, L., S. Whiteson, B. Bakker, and N. Vlassis (2008) "Multiagent reinforcement learning for urban traffic control using coordination graphs," in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp. 656-671.
[120]Kuyer, L., S. Whiteson, B. Bakker, and N. Vlassis (2008) "Multiagent reinforcement learning for urban traffic control using coordination graphs," in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp.
[121]Camponogara, E. and W. Kraus (2003) "Distributed learning agents in urban traffic control," in Portuguese Conference on Artificial Intelligence, Springer, pp. 324-335.
[121]Camponogara, E. and W. Kraus (2003) "Distributed learning agents in urban traffic control," in Portuguese Conference on Artificial Intelligence, Springer, pp.
[122]Bishop, C. M. et al. (2006) "Pattern recognition and machine learning (information science and statistics),".
[123]Hayes, W. D. (1970) "Kinematic wave theory," Proc. R. Soc. Lond. A, 320(1541), pp. 209-226.
[123]Hayes, W. D. (1970) "Kinematic wave theory," Proc.R. Soc. Lond.A, 320(1541), pp.
[124]de Oliveira, D. and A. L. Bazzan (2009) "Multiagent learning on traffic lights control: effects of using shared information," in Multi-agent systems for traffic and transportation engineering, IGI Global, pp. 307-321.
[124]de Oliveira, D. and A. L. Bazzan (2009) "Multiagent learning on traffic lights control: effects of using shared information," in Multi-agent systems for traffic and transportation engineering, IGI Global, pp.
[125]Yao, H., X. Tang, H. Wei, G. Zheng, Y. Yu, and Z. Li (2018) "Modeling Spatial-Temporal Dynamics for Traffic Prediction," arXiv preprint arXiv:1803.01254.
[126]You, Q., H. Jin, Z. Wang, C. Fang, and J. Luo (2016) "Image captioning with semantic attention," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4651-4659.
[126]You, Q., H. Jin, Z. Wang, C. Fang, and J. Luo (2016) "Image captioning with semantic attention," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
[127]Cheng, W., Y. Shen, Y. Zhu, and L. Huang (2018) "A neural attention model for urban air quality inference: Learning the weights of monitoring stations," in Thirty-Second AAAI Conference on Artificial Intelligence.
[127]Cheng, W., Y. Shen, Y. Zhu, and L. Huang (2018) "A neural attention model for urban air quality inference：Learning the weights of monitoring stations," in Thirty-Second AAAI Conference on Artificial Intelligence.
[128]Jiang, J., C. Dun, and Z. Lu (2018) "Graph Convolutional Reinforcement Learning for Multi-Agent Cooperation," arXiv preprint arXiv:1810.09202.
[129]Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) "Attention is all you need," in Advances in Neural Information Processing Systems, pp. 5998-6008.
[129]Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) "Attention is all you need," in Advances in Neural Information Processing Systems, pp.
[130]Wei, H., C. Chen, C. Liu, G. Zheng, and Z. Li (2020) "Learning to Simulate on Sparse Data," ECML-PKDD.
[131]Wei, H., C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, and Z. Li (2019) "PressLight: Learning Max Pressure Control to Coordinate Traffic Signals in Arterial Network," in KDD.
[131]Wei, H., C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, and Z. Li (2019) "PressLight：Learning Max Pressure Control to Coordinate Traffic Signals in Arterial Network," in KDD.
[132]Wei, H., N. Xu, H. Zhang, G. Zheng, X. Zang, C. Chen, W. Zhang, Y. Zhu, K. Xu, and Z. Li (2019) "CoLight: Learning Network-Level Cooperation for Traffic Signal Control," in CIKM.
[132]Wei, H., N. Xu, H. Zhang, G. Zheng, X. Zang, C. Chen, W. Zhang, Y. Zhu, K. Xu, and Z. Li (2019) "CoLight：Learning Network-Level Cooperation for Traffic Signal Control," in CIKM.
[133]Wei, H., G. Zheng, H. Yao, and Z. Li (2018) "IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light Control," in KDD.
[133]Wei, H., G. Zheng, H. Yao, and Z. Li (2018) "IntelliLight：A Reinforcement Learning Approach for Intelligent Traffic Light Control，" in KDD.
[134]Wu, Y., H. Tan, and B. Ran (2018) "Differential Variable Speed Limits Control for Freeway Recurrent Bottlenecks via Deep Reinforcement learning," arXiv preprint arXiv:1810.10952.
[135]Krauss, S. (1998) Microscopic modeling of traffic flow: Investigation of collision free vehicle dynamics, Ph.D. thesis.
[135]Krauss, S. (1998) Microscopic modeling of traffic flow：无碰撞车辆动力学研究》，博士论文。
[136]Leutzbach, W. and R. Wiedemann (1986) "Development and applications of traffic simulation models at the Karlsruhe Institut fur Verkehruesen," Traffic engineering & control, 27(5).
[136]Leutzbach, W. and R. Wiedemann (1986) "Development and applications of traffic simulation models at the Karlsruhe Institut fur Verkehruesen," Traffic Engineering & Control, 27(5).
[137]Nagel, K. and M. Schreckenberg (1992) "A cellular automaton model for freeway traffic," Journal de physique I, 2(12).
[138]Kesting, A. and M. Treiber (2008) "Calibrating car-following models by using trajectory data: Methodological study," Transportation Research Record, 2088(1), pp. 148-156.
[138]Kesting, A. and M. Treiber (2008) "Calibrating car-following models by using trajectory data：Methodological study," Transportation Research Record, 2088(1), pp.
[139]Osorio, C. and V. Punzo (2019) "Efficient calibration of microscopic car-following models for large-scale stochastic network simulators," Transportation Research Part B: Methodological, 119, pp. 156-173.
[139]Osorio, C. and V. Punzo (2019) "Efficient calibration of microscopic car-following models for large-scale stochastic network simulators," Transportation Research Part B: Methodological, 119, pp.
[140]Michie, D., M. Bain, and J. Hayes-Miches (1990) "Cognitive models from subcognitive skills," IEEE Control Engineering Series, 44.
[141]Torabi, F., G. Warnell, and P. Stone (2018) "Behavioral cloning from observation," in IJCAI.
[142]Ho, J. and S. Ermon (2016) "Generative adversarial imitation learning," in NeurIPS.
[143]Bhattacharyya, R. P., D. J. Phillips, B. Wulfe, J. Morton, A. Kuefler, and M. J. Kochenderfer (2018) "Multi-agent imitation learning for driving simulation," in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE.
[144]Song, J., H. Ren, D. Sadigh, and S. Ermon (2018) "Multi-agent generative adversarial imitation learning," in NeurIPS.
[145]Zheng, G., H. Liu, K. Xu, and Z. Li (2020) "Learning to Simulate Vehicle Trajectories from Demonstrations," ICDE.
[146]Li, S. C.-X. and B. M. Marlin (2016) "A scalable end-to-end gaussian process adapter for irregularly sampled time series classification," in NeurIPS.
[147]Yi, X., Y. Zheng, J. Zhang, and T. Li (2016) "ST-MVL: filling missing values in geo-sensory time series data," in IJCAI, AAAI Press.
[148]Zheng, K., Y. Zheng, X. Xie, and X. Zhou (2012) "Reducing uncertainty of low-sampling-rate trajectories," in ICDE.
[149]Abbeel, P. and A. Y. Ng (2004) "Apprenticeship learning via inverse reinforcement learning," in ICML.
[150]Ng, A. Y., S. J. Russell, et al. (2000) "Algorithms for inverse reinforcement learning." in ICML.
[151]Ziebart, B. D., A. L. Maas, J. A. Bagnell, and A. K. Dey (2008) "Maximum entropy inverse reinforcement learning." in AAAI, vol. 8.
[152]Liu, Y., K. Zhao, G. Cong, and Z. Bao (2020) "Online anomalous trajectory detection with deep generative sequence modeling," in ICDE.
[153]Lou, Y., C. Zhang, Y. Zheng, X. Xie, W. Wang, and Y. Huang (2009) "Map-matching for low-sampling-rate GPS trajectories," in SIGSPATIAL, ACM.
[154]Zheng, Y. (2015) "Trajectory data mining: an overview," ACM Transactions on Intelligent Systems and Technology (TIST), 6(3).
[154]Zheng，Y.（2015），"轨迹数据挖掘：综述"，ACM Transactions on Intelligent Systems and Technology（TIST），6（3）.
[155]Kuefler, A., J. Morton, T. Wheeler, and M. Kochenderfer (2017) "Imitating driver behavior with generative adversarial networks," in IEEE Intelligent Vehicles Symposium (IV), IEEE.
[156]Schulman, J., S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2015) "Trust Region Policy Optimization," in ICML.
[157]Zhang, H., S. Feng, C. Liu, Y. Ding, Y. Zhu, Z. Zhou, W. Zhang, Y. Yu, H. Jin, and Z. Li (2019) "CityFlow: A Multi-Agent Reinforcement Learning Environment for Large Scale City Traffic Scenario," in International World Wide Web Conference.
[158]Sugiyama, Y., M. Fukui, M. Kikuchi, K. Hasebe, A. Nakayama, K. Nishinari, S.-i. Tadaki, and S. Yukawa (2008) "Traffic jams without bottlenecks--experimental evidence for the physical mechanism of the formation of a jam," New journal of physics, 10(3).
[158]Sugiyama, Y., M. Fukui, M. Kikuchi, K. Hasebe, A. Nakayama, K. Nishinari, S. -i.Tadaki, and S. Yukawa (2008) "Traffic jams without bottlenecks--experimental evidence for the physical mechanism of the formation of a jam," New journal of physics, 10(3).
[159]Wu, C., A. Kreidieh, E. Vinitsky, and A. M. Bayen (2017) "Emergent behaviors in mixed-autonomy traffic," in Conference on Robot Learning.

Simulation of Urban MObility," International Journal On Advances in Systems and Measurements, 5(3&4), pp. 128-138.
城市流动性模拟》，《系统与测量进展国际期刊》，5(3&4)，第 128-138 页。

[161]Asamer, J., H. J. van Zuylen, and B. Heilmann (2013) "Calibrating car-following parameters for snowy road conditions in the microscopic traffic simulator VISSIM," IET Intelligent Transport Systems, 7(1).
[162]Tang, X., B. Gong, Y. Yu, H. Yao, Y. Li, H. Xie, and X. Wang (2019) "Joint Modeling of Dense and Incomplete Trajectories for Citywide Traffic Volume Inference," in The World Wide Web Conference, ACM.
[163]Henderson, P., R. Islam, P. Bachman, J. Pineau, et al. (2018) "Deep reinforcement learning that matters," in AAAI.
[164]Koonce, P. et al. (2008) Traffic signal timing manual, Tech. rep., United States. Federal Highway Administration.
[164]Koonce, P. et al. (2008) Traffic signal timing manual, Tech. Rep., United States.Federal Highway Administration.
[165]Kouvelas, A., J. Lioris, S. A. Fayazi, and P. Varaiya (2014) "Maximum Pressure Controller for Stabilizing Queues in Signalized Arterial Networks," TRB.
[166]Zang, X., H. Yao, G. Zheng, N. Xu, K. Xu, and Z. Li (2020) "MetaLight: Value-based Meta-reinforcement Learning for Online Universal Traffic Signal Control," in AAAI.
[166]Zang, X., H. Yao, G. Zheng, N. Xu, K. Xu, and Z. Li (2020) "MetaLight：基于价值的在线通用交通信号控制元强化学习"，AAAI。
[167]Garcia, J. and F. Fernandez (2015) "A comprehensive survey on safe reinforcement learning," JMLR.
[168]Liu, Y., L. Liu, and W.-P. Chen (2017) "Intelligent traffic light control using distributed multi-agent Q learning," in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), IEEE, pp. 1-8.
[168]Liu, Y., L. Liu, and W.-P.Chen (2017) "Intelligent traffic light control using distributed multi-agent Q learning," in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), IEEE, pp.
[169]Ault, J., J. Hanna, and G. Sharon (2019) "Learning an Interpretable Traffic Signal Control Policy," arXiv preprint.
[170]Zheng, G., H. Liu, and Z. Li (2020) "Learning to Simulate Vehicle Trajectory from Demonstrations," in ICDE.

Vita 维塔

Hua Wei 华伟

I get Ph. D. from the College of Information Sciences and Technology, The Pennsylvania State University. I work with Dr. Zhenhui (Jessie) Li and my research interest lies in spatio-temporal data mining and reinforcement learning, especially utilizing the power of big data to assist intelligent decision making for human beings.
我在宾夕法尼亚州立大学信息科学与技术学院获得博士学位。我与李振辉（Jessie）博士共事，研究兴趣在于时空数据挖掘和强化学习，尤其是利用大数据的力量来辅助人类的智能决策。

Selected Publications 部分出版物

*Equal contribution. *平等贡献。

∙

Hua Wei, Guanjie Zheng, Vikash Gayah, Zhenhui Li. Recent Advances in Reinforcement Learning for Traffic Signal Control: A Survey of Models and Evaluation, in ACM SIGKDD Explorations Newsletter, Dec 2020.

∙

Hua Wei, Guanjie Zheng, Vikash Gayah, Zhenhui Li.交通信号控制强化学习的最新进展：模型与评估调查》，《ACM SIGKDD 探索通讯》，2020 年 12 月。

∙

Hua Wei*, Xian Wu*, Wenbo Guo*, Xinyu Xing. Adversarial Policy Training against Deep Reinforcement Learning, in Proceedings of the 30th USENIX Security Symposium (USENIX Security), Vancouver, Canada, Aug 2021.

∙

Hua Wei*, Xian Wu*, Wenbo Guo*, Xinyu Xing.针对深度强化学习的对抗性策略训练》，第 30 届 USENIX 安全研讨会（USENIX Security）论文集，加拿大温哥华，2021 年 8 月。

∙

Hua Wei, Chacha Chen, Chang Liu, Guanjie Zheng, Zhenhui Li. Learning to Simulate on Sparse Trajectory Data, in Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD'20), Ghent, Belgium, Sep, 2020. (Best Applied Data Science Paper Award)

∙

Hua Wei, Chacha Chen, Chang Liu, Guanjie Zheng, Zhenhui Li.在稀疏轨迹数据上学习模拟》，欧洲机器学习和数据库知识发现原理与实践会议（ECML-PKDD'20）论文集，比利时根特，2020 年 9 月。(最佳应用数据科学论文奖）

∙

Hua Wei, Chacha Chen, Guanjie Zheng, Kan Wu, Vikash Gayah, Kai Xu, Zhenhui Li, PressLight: Learning Max Pressure Control for Signalized Intersections in Arterial Network, in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'19), Anchorage, AK, USA, Aug 2019.

∙

Hua Wei, Chacha Chen, Guanjie Zheng, Kan Wu, Vikash Gayah, Kai Xu, Zhenhui Li, PressLight：学习动脉网络中信号灯路口的最大压力控制》，第25届ACM SIGKDD知识发现与数据挖掘国际会议（KDD'19）论文集，美国阿拉斯加州安克雷奇，2019年8月。

∙

Hua Wei, Nan Xu, Huichu Zhang, Guanjie Zheng, Xinshi Zang, Chacha Chen, Weinan Zhang, Yanmin Zhu, Kai Xu, Zhenhui Li. CoLight: Learning Network-level Cooperation for Traffic Signal Control, in Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM'19), Beijing, China, Nov 2019.

∙

魏华、徐楠、张慧初、郑冠杰、臧新时、陈茶茶、张伟南、朱艳敏、徐凯、李振辉。CoLight：交通信号控制的学习网络级合作》，第28届ACM国际信息与知识管理大会（CIKM'19）论文集，中国北京，2019年11月。

∙

Hua Wei, Guanjie Zheng, Huaxiu Yao, Zhenhui Li, IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light Control, in Proceedings of the 2018 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'18), London, UK, Aug 2018.

∙

Hua Wei, Guanjie Zheng, Huaxiu Yao, Zhenhui Li, IntelliLight：智能交通灯控制的强化学习方法》，《2018 ACM SIGKDD 知识发现与数据挖掘国际会议（KDD'18）论文集》，英国伦敦，2018 年 8 月。

∙

Hua Wei, Chacha Chen, Kan Wu, Guanjie Zheng, Zhengyao Yu, Vikash Gayah, Zhenhui Li, Deep Reinforcement Learning for Traffic Signal Control along Arterials, in Workshop on Deep Reinforcement Learning for Knowledge Discovery (DRL4KDD), Anchorage, AK, USA, Aug 2019.

∙

Hua Wei, Chacha Chen, Kan Wu, Guanjie Zheng, Zhengyao Yu, Vikash Gayah, Zhenhui Li, Deep Reinforcement Learning for Traffic Signal Control along Arterials, in Workshop on Deep Reinforcement Learning for Knowledge Discovery (DRL4KDD), Anchorage, AK, USA, Aug 2019.

∙

Hua Wei, Yuandong Wang, Tianyu Wo, Yaxiao Liu, Jie Xu, ZEST: A Hybrid Model on Predicting Passenger Demand for Chauffeured Car Service, in Proceedings of 25th ACM International Conference on Information and Knowledge Management (CIKM'16), Indianapolis, USA, October 2016.

∙

Hua Wei, Yuandong Wang, Tianyu Wo, Yaxiao Liu, Jie Xu, ZEST: A Hybrid Model on Predicting Passenger Demand for Chauffeared Car Service, in Proceedings of 25th ACM International Conference on Information and Knowledge Management (CIKM'16), Indianapolis, USA, October 2016.

∙

Chacha Chen, Hua Wei, Nan Xu, Guanjie Zheng, Ming Yang, Yuanhao Xiong, Kai Xu, Zhenhui Li. Toward A Thousand Lights: Decentralized Deep Reinforcement Learning for Large-Scale Traffic Signal Control. in Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI'20), New York, USA, Feb 2020.

∙

陈茶茶、魏华、徐楠、郑冠杰、杨明、熊元浩、徐凯、李振辉。万家灯火：用于大规模交通信号控制的分散式深度强化学习》，《第三十四届 AAAI 人工智能大会（AAAI'20）论文集》，美国纽约，2020 年 2 月。

ProQuest Number: 28767689
ProQuest 编号：28767689

INFORMATION TO ALL USERS
向所有用户提供信息

The quality and completeness of this reproduction is dependent on the quality and completeness of the copy made available to ProQuest.
本复制品的质量和完整性取决于 ProQuest 所获副本的质量和完整性。

Distributed by ProQuest LLC (2021).
由 ProQuest LLC 发行（2021 年）。

This work may be used in accordance with the terms of the Creative Commons license or other rights statement, as indicated in the copyright statement or in the metadata associated with this work. Unless otherwise specified in the copyright statement or the metadata, all rights are reserved by the copyright holder.
本作品可按照知识共享许可协议的条款或其他权利声明（如版权声明或与本作品相关的元数据所示）使用。除非版权声明或元数据中另有说明，否则版权所有者保留所有权利。

This work is protected against unauthorized copying under Title 17,
本作品受第 17 章保护，未经授权不得复制、

United States Code and other applicable copyright laws.
美国法典》和其他适用的版权法。

Microform Edition where available (c) ProQuest LLC. No reproduction or digitization of the Microform Edition is authorized without permission of ProQuest LLC.
微缩版（如有） (c) ProQuest LLC.未经 ProQuest LLC 许可，不得对 Microform Edition 进行复制或数字化。

ProQuest LLC

789 East Eisenhower Parkway
东艾森豪威尔公园路 789 号

P.O. Box 1346 邮政信箱：1346

Ann Arbor, MI 48106 - 1346 USA
密歇根州安娜堡 48106 - 1346 美国

deep_reinforcement_learning_fo-183ae653-e302-46bb-993b-53ac63dce2ed

Abstract 摘要

Table of Contents 目录

Acknowledgements 致谢

Chapter 2 Notation, Background and Literature第 2 章 符号、背景和文献

Chapter 4 Formulating the Learning Objevtive第 4 章 制定学习目标

Chapter 6 Learning to Simulate第 6 章 学习模拟

Acknowledgements 致谢

Chapter 1 Introduction 第 1 章 导言

1.1 Why do we need a more intelligent traffic signal control1.1 为什么我们需要更智能的交通信号控制？

1.2 Why do we use reinforcement learning for traffic signal control1.2 为什么要将强化学习用于交通信号控制

1.3 Why RL for traffic signal control is challenging?1.3 交通信号控制 RL 为何具有挑战性？

Formulation of RL agentRL 剂的配方

Learning cost 学习成本

Simulation 模拟

1.5 Proposed Tasks 1.5 拟议任务

1.6 Overview of this Disseration1.6 本论文概述

Chapter 2 Notation, Background and Literature第 2 章 符号、背景和文献

2.1 Preliminaries of Traffic Signal Control Problem2.1 交通信号控制问题初探

Term Definition 术语定义

Objective 目标

Special Considerations 特殊考虑因素

2.2 Background of Reinforcement Learning2.2 强化学习的背景

Single Agent Reinforcement learning单个代理强化学习

Problem setting 问题设置

2.3 Literature 2.3 文献资料

Agent formulation 制剂配方

2.3.1.1 Reward 2.3.1.1 奖励

2.3.1.2 State 2.3.1.2 国家

2.3.1.3 Action scheme 2.3.1.3 行动方案

Policy Learning 政策学习

2.3.2.1 Value-based methods2.3.2.1 基于价值的方法

Joint action learners 联合行动学习者

2.3.3.2 Independent learners2.3.3.2 独立学习者

2.3.3.3 Sizes of Road network2.3.3.3 公路网的规模

2.4 Conclusion 2.4 结论

Chapter 3 Basic Formulation for Traffic Signal Control第 3 章 交通信号控制基本公式

3.1 Introduction 3.1 导言

3.2 Related work 3.2 相关工作

Conventional Traffic Light Control传统交通灯控制

Reinforcement Learning for Traffic Light Control用于交通灯控制的强化学习

3.3 Problem Definition 3.3 问题定义

3.4 Method 3.4 方法

Framework 框架

Agent Design 代理设计

Network Structure 网络结构

Memory Palace and Model Updating记忆宫殿和模型更新

Experiment Setting 实验设置

Parameter Setting 参数设置

Evaluation Metric 评估指标

Compared Methods 比较方法

Datasets 数据集

Synthetic data 合成数据

Real-world data 真实世界的数据

Performance on Synthetic Data合成数据的性能

3.5.6.1 Comparison with state-of-the-art methods3.5.6.1 与最先进方法的比较

3.5.6.2 Comparison with variants of our proposed method3.5.6.2 与我们提出的方法的变体比较

3.5.6.3 Interpretation of learned signal3.5.6.3 学习信号的解释

Performance on Real-world Data真实世界数据的性能

Comparison of different methods不同方法的比较

3.5.7.2 Observations with respect to real traffic3.5.7.2 对实际交通的观察

3.6 Conclusion 3.6 结论

Chapter 4 Formulating the Learning Objective第 4 章 制定学习目标

4.1 Introduction 4.1 导言

4.2 Related Work 4.2 相关工作

4.3 Preliminaries and Notations4.3 前言和符号

4.4 Method 4.4 方法

Agent Design 代理设计

Learning Process 学习过程

4.5 Justification of RL agent4.5 RL 代理的理由

Justification for State Design国家设计的理由

General description of traffic movement process as a Markov chain作为马尔可夫链的交通流过程的一般描述

4.5.1.2 Specification with proposed state definition4.5.1.2 建议的状态定义规范

Justification for Reward Design奖励设计的理由

4.5.2.1 Stabilization on traffic movements with proposed reward.4.5.2.1 稳定拟议奖励的交通流量。

4.5.2 Connection to throughput maximization and travel time minimization.4.5.2 与吞吐量最大化和旅行时间最小化的联系。

4.6 Experiment 4.6 实验

Dataset Description 数据集说明

Experimental Settings 实验设置

Environmental settings 环境设置

Chapter 2 Notation, Background and Literature
第 2 章符号、背景和文献

Chapter 4 Formulating the Learning Objevtive
第 4 章制定学习目标

Chapter 6 Learning to Simulate
第 6 章学习模拟

Chapter 1 Introduction 第 1 章导言

1.1 Why do we need a more intelligent traffic signal control
1.1 为什么我们需要更智能的交通信号控制？

1.2 Why do we use reinforcement learning for traffic signal control
1.2 为什么要将强化学习用于交通信号控制

1.3 Why RL for traffic signal control is challenging?
1.3 交通信号控制 RL 为何具有挑战性？

Formulation of RL agent
RL 剂的配方

1.6 Overview of this Disseration
1.6 本论文概述

Chapter 2 Notation, Background and Literature
第 2 章符号、背景和文献

2.1 Preliminaries of Traffic Signal Control Problem
2.1 交通信号控制问题初探

2.2 Background of Reinforcement Learning
2.2 强化学习的背景

Single Agent Reinforcement learning
单个代理强化学习

2.3.2.1 Value-based methods
2.3.2.1 基于价值的方法

2.3.3.2 Independent learners
2.3.3.2 独立学习者

2.3.3.3 Sizes of Road network
2.3.3.3 公路网的规模

Chapter 3 Basic Formulation for Traffic Signal Control
第 3 章交通信号控制基本公式

Conventional Traffic Light Control
传统交通灯控制

Reinforcement Learning for Traffic Light Control
用于交通灯控制的强化学习

Memory Palace and Model Updating
记忆宫殿和模型更新

Performance on Synthetic Data
合成数据的性能

3.5.6.1 Comparison with state-of-the-art methods
3.5.6.1 与最先进方法的比较

3.5.6.2 Comparison with variants of our proposed method
3.5.6.2 与我们提出的方法的变体比较

3.5.6.3 Interpretation of learned signal
3.5.6.3 学习信号的解释

Performance on Real-world Data
真实世界数据的性能

Comparison of different methods
不同方法的比较

3.5.7.2 Observations with respect to real traffic
3.5.7.2 对实际交通的观察

Chapter 4 Formulating the Learning Objective
第 4 章制定学习目标

4.3 Preliminaries and Notations
4.3 前言和符号

4.5 Justification of RL agent
4.5 RL 代理的理由

Justification for State Design
国家设计的理由

General description of traffic movement process as a Markov chain
作为马尔可夫链的交通流过程的一般描述

4.5.1.2 Specification with proposed state definition
4.5.1.2 建议的状态定义规范

Justification for Reward Design
奖励设计的理由

4.5.2.1 Stabilization on traffic movements with proposed reward.
4.5.2.1 稳定拟议奖励的交通流量。

4.5.2 Connection to throughput maximization and travel time minimization.
4.5.2 与吞吐量最大化和旅行时间最小化的联系。

4.6.2.2 Evaluation metric
4.6.2.2 评估指标

4.6.2.3 Compared methods
4.6.2.3 比较方法

4.6.4 Study of PressLight
4.6.4 对 PressLight 的研究

4.6.4.1 Effects of variants of our proposed method
4.6.4.1 拟议方法变体的影响

4.6.4 Average travel time related to pressure.
4.6.4 与压力有关的平均旅行时间。

Performance on Mixed Scenarios
混合场景下的性能

Heterogeneous intersections
异质交叉口

4.6.5.2 Arterials with a different number of intersections and network
4.6.5.2 交叉路口和路网数量不同的干道

4.6.6.1 Synthetic traffic on the uniform, uni-directional flow
4.6.6.1 统一单向流上的合成流量

4.6.6.1 Policy learned by RL agents
4.6.6.1 RL 代理学习的策略

Chapter 5 Improving Learning Efficiency
第 5 章提高学习效率

Graph Attention Networks for Cooperation
图形注意网络促进合作

5.4.2.1 Observation Interaction
5.4.2.1 观察互动

5.4.2.2 Attention Distribution within Neighborhood Scope
5.4.2.2 邻里范围内的注意力分布

5.4.2.3 Index-free Neighborhood Cooperation
5.4.2.3 无指数邻里合作

5.4.2.4 Multi-head Attention
5.4.2.4 多头关注

5.4.4.1 Space Complexity
5.4.4.1 空间复杂性

5.5.5.1 Overall Analysis
5.5.5.1 总体分析

5.5.7.1 Impact of Neighborhood Definition.
5.5.7.1 邻里定义的影响。

5.5.7.2 Impact of Neighbor Number.
5.5.7.2 邻居编号的影响。

5.5.7.3 Impact of Attention Head Number.
5.5.7.3 注意头编号的影响。

Chapter 6 Learning to Simulate
第 6 章学习模拟

Imitation with Interpolation
用插值法模仿

Generator in the simulator
模拟器中的发电机

Downsampling of generated trajectories
生成轨迹的下采样

6.4.2.3 Interpolation-Discriminator
6.4.2.3 内插法-鉴别器

6.4.2.3.1 Interpolator module
6.4.2.3.1 内插模块

Training and Implementation
培训与实施

6.5.1.1.1 Synthetic Data
6.5.1.1.1 合成数据

6.5.2.1 Calibration-based methods
6.5.2.1 基于校准的方法

6.5.2.2 Imitation learning-based methods
6.5.2.2 基于模仿学习的方法