这是用户在 2024-3-26 3:45 为 file:///D:/Download/deep_reinforcement_learning_fo-183ae653-e302-46bb-993b-53ac63dce2ed%20(1).html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

deep_reinforcement_learning_fo-183ae653-e302-46bb-993b-53ac63dce2ed

The Pennsylvania State University
宾夕法尼亚州立大学
The Graduate School 研究生院
DEEP REINFORCEMENT LEARNING FOR TRAFFIC SIGNAL CONTROL
用于交通信号控制的深度强化学习
A Dissertation in 学位论文
Information Sciences and Technology
信息科学与技术
by
Hua Wei 华伟
(c 2020 Hua Wei
(c 2020 华伟
Submitted in Partial Fulfillment
部分履约
of the Requirements 的要求
for the Degree of
学位
Doctor of Philosophy 哲学博士
December 2020The dissertation of Hua Wei was reviewed and approved by the following:
2020 年 12 月 华伟(Hua Wei)的学位论文经以下评审委员会评审通过:
Zhenhui (Jessie) Li 李振辉(杰西
Associate Professor of Information Sciences and Technology
信息科学与技术副教授
Dissertation Adviser 论文顾问
Chair of Committee 委员会主席
C. Lee Giles C.李-贾尔斯
Professor of Information Sciences and Technology
信息科学与技术教授
Xiang Zhang
Associate Professor of Information Sciences and Technology
信息科学与技术副教授
Vikash V. Gayah
Associate professor of Civil and Environmental Engineering
土木与环境工程学副教授
Mary Beth Rosson 玛丽-贝丝-罗森
Professor of Information Sciences and Technology
信息科学与技术教授
Graduate Program Chair 研究生项目主席

Abstract 摘要

Traffic congestion is a growing problem that continues to plague urban areas with negative out comes to both the traveling public and society as a whole. Signalized intersections are one of the most prevalent bottleneck types in urban environments and thus traffic signal control tends to play a large role in urban traffic management. Nowadays the widely-used traffic signal control systems (e.g., SCATS and SCOOT) are still based on manually designed traffic signal plans. Recently, there are emerging research studies using reinforcement learning (RL) to tackle traffic signal control problem. In this dissertation, we propose to consider reinforcement learning to intelligently optimize the signal timing plans in real-time to reduce traffic congestion.
交通拥堵是一个日益严重的问题,一直困扰着城市地区,给市民出行和整个社会带来负面影响。信号交叉口是城市环境中最常见的瓶颈类型之一,因此交通信号控制往往在城市交通管理中发挥着重要作用。目前广泛使用的交通信号控制系统(如 SCATS 和 SCOOT)仍以人工设计的交通信号计划为基础。最近,利用强化学习(RL)来解决交通信号控制问题的研究正在兴起。在本论文中,我们建议考虑使用强化学习来智能地实时优化信号配时计划,以减少交通拥堵。
Although some efforts using reinforcement learning (RL) techniques are proposed to adjust traffic signals dynamically, they only use ad-hoc designs when formulating the traffic signal control problem. They lack a principled approach to formulate the problem under the framework of RL. Secondly, since RL directly learns from the data via a trial-and-error search, it requires a decent number of interactions with the environment before the algorithms converge. In real-world problems, every interaction means real cost (e.g., traffic congestion, traffic accidents). Hence, a more data-efficient method is necessary. Thirdly, discrepancies between simulation and reality confine the application of RL in the real world, despite its massive success in domains like Games. Most RL methods mainly conduct experiments in the simulator since the simulator can generate data in a cheaper and faster way than real experimentation. Hence, how to address the performance gap of RL methods between simulation and the real-world is required for applying RL in the real world.
虽然有人提出利用强化学习(RL)技术来动态调整交通信号,但他们在提出交通信号控制问题时只是采用了临时设计。它们缺乏在 RL 框架下制定问题的原则性方法。其次,由于 RL 通过试错搜索直接从数据中学习,因此在算法收敛之前需要与环境进行相当数量的交互。在实际问题中,每次交互都意味着实际成本(如交通拥堵、交通事故)。因此,需要一种数据效率更高的方法。第三,尽管 RL 在游戏等领域取得了巨大成功,但模拟与现实之间的差异限制了 RL 在现实世界中的应用。大多数 RL 方法主要在模拟器中进行实验,因为模拟器生成数据的方式比真实实验更便宜、更快捷。因此,如何解决 RL 方法在模拟和真实世界之间的性能差距,是将 RL 应用于真实世界的必要条件。
This dissertation presents how to utilize mobility data and RL-based methods for traffic signal control. I have investigated the key challenges for RL-based traffic signal control methods, including how to formulate the objective function and improve learning efficiency for city-wide traffic signal control. Besides, I have also managed to mitigate the performance gap of RL methods between simulation and the real-world. We have achieved significant improvement over the state-of-the-art or currently employed methods, which will provide us with promising solutions to traffic signal control problems and implications for smart city applications.
本论文介绍了如何利用移动数据和基于 RL 的方法进行交通信号控制。我研究了基于 RL 的交通信号控制方法所面临的主要挑战,包括如何制定目标函数和提高城市范围内交通信号控制的学习效率。此外,我还设法缩小了 RL 方法在仿真与现实世界之间的性能差距。与最先进的方法或目前采用的方法相比,我们已经取得了重大改进,这将为我们提供有前途的交通信号控制问题解决方案,并对智慧城市应用产生影响。

Table of Contents 目录

List of FiguresList of Tables
图表一览表

Acknowledgements 致谢

Chapter 1 Introduction 第 1 章 导言
1.1 Why do we need a more intelligent traffic signal control 1.2 Why do we use reinforcement learning for traffic signal control 1.3 Why RL for traffic signal control is challenging? 1.3.1 Formulation of RL agent 1.3.2 Learning cost 1.3.3 Simulation 1.4 Previous Studies 1.5 Proposed Tasks 1.6 Overview of this Disseration
1.1 为什么我们需要更智能的交通信号控制 1.2 为什么我们在交通信号控制中使用强化学习 1.3 为什么交通信号控制的强化学习具有挑战性?1.3.1 RL 代理的制定 1.3.2 学习成本 1.3.3 仿真 1.4 以往研究 1.5 拟议任务 1.6 本论文概述

Chapter 2 Notation, Background and Literature
第 2 章 符号、背景和文献

1 Preliminaries of Traffic Signal Control Problem 2.1.1 Term Definition 2.1.2 Objective 2.1.3 Special Considerations 2.2 Background of Reinforcement Learning 2.2.1 Single Agent Reinforcement learning 2.2.2 Problem setting 2.3 Literature 2.3.1 Agent formulation 2.3.1.1 Reward 2.3.1.2 State 2.3.1.3 Action scheme 2.3.2 Policy Learning 2.3.2.1 Value-based methods2.3.2.2 Policy-based methods
1 交通信号控制问题前言 2.1.1 术语定义 2.1.2 目标 2.1.3 特殊考虑 2.2 强化学习背景 2.2.1 单代理强化学习 2.2.2 问题设置 2.3 文献 2.3.1 代理制定 2.3.1.1 奖励 2.3.1.2 状态 2.3.1.3 行动方案 2.3.2 策略学习 2.3.2.1 基于价值的方法 2.3.2.2 基于策略的方法

* 2.3.3 Coordination * 2.3.3 协调
* 2.3.3.1 Joint action learners
* 2.3.3.1 联合行动学习者

* 2.3.3.2 Independent learners
* 2.3.3.2 独立学习者

* 2.3.3.3 Sizes of Road network
* 2.3.3.3 公路网的规模
  • 2.4 Conclusion 2.4 结论

Chapter 4 Formulating the Learning Objevtive
第 4 章 制定学习目标

* 4.1 Introduction
* 4.2 Related Work
* 4.3 Preliminaries and Notations
* 4.4 Method
	* 4.4.1 Agent Design
	* 4.4.2 Learning Process
* 4.5 Justification of RL agent
	* 4.5.1 Justification for State Design
		* 4.5.1.1 General description of traffic movement process as a Markov chain
		* 4.5.1.2 Specification with proposed state definition
	* 4.5.2 Justification for Reward Design
		* 4.5.2.1 Stabilization on traffic movements with proposed reward.
		* 4.5.2.2 Connection to throughput maximization and travel time minimization.
* 4.6 Experiment
	* 4.6.1 Dataset Description
	* 4.6.2 Experimental Settings
		* 4.6.2.1 Environmental settings
		* 4.6.2.2 Evaluation metric
		* 4.6.2.3 Compared methods
	* 4.6.3 Performance Comparison
	* 4.6.4 Study of PressLight
		* 4.6.4.1 Effects of variants of our proposed method
		* 4.6.4.2 Average travel time related to pressure.
	* 4.6.5 Performance on Mixed Scenarios
		* 4.6.5.1 Heterogeneous intersections
		* 4.6.5.2 Arterials with a different number of intersections and network
	* 4.6.6 Case Study
		* 4.6.6.1 Synthetic traffic on the uniform, uni-directional flow
			* 4.6.6.1.1 Performance comparison
			* 4.6.6.1.2 Policy learned by RL agents
		* 4.6.6.2 Real-world traffic in Jinan
* 4.7 Conclusion
Chapter 5 第五章
Improving Learning Efficiency
提高学习效率
* 5.1 Introduction
* 5.2 Related Work
* 5.3 Problem Definition
* 5.4 Method
	* 5.4.1 Observation Embedding
	* 5.4.2 Graph Attention Networks for Cooperation
		* 5.4.2.1 Observation Interaction
		* 5.4.2.2 Attention Distribution within Neighborhood Scope
		* 5.4.2.3 Index-free Neighborhood Cooperation
		* 5.4.2.4 Multi-head Attention
	* 5.4.3 Q-value Prediction
	* 5.4.4 Complexity Analysis
5.4.4.1 Space Complexity * 5.4.4.2 Time Complexity * 5.5 Experiments * 5.5.1 Settings * 5.5.2 Datasets * 5.5.2.1 Synthetic Data * 5.5.2.2 Real-world Data * 5.5.3 Compared Methods * 5.5.4 Evaluation Metric * 5.5.5 Performance Comparison * 5.5.5.1 Overall Analysis * 5.5.5.2 Convergence Comparison * 5.5.6 Scalability Comparison * 5.5.6.1 Effectiveness. * 5.5.6.2 Training time. * 5.5.7 Study of CoLight * 5.5.7.1 Impact of Neighborhood Definition. * 5.5.7.2 Impact of Neighbor Number. * 5.5.7.3 Impact of Attention Head Number. * 5.6 Conclusion
5.4.4.1 空间复杂性 * 5.4.4.2 时间复杂性 * 5.5 实验 * 5.5.1 设置 * 5.5.2 数据集 * 5.5.2.1 合成数据 * 5.5.2.2 真实世界数据 * 5.5.3 比较方法 * 5.5.4 评估指标 * 5.5.5 性能比较 * 5.5.5.1 总体分析 * 5.5.5.2 收敛性比较 * 5.5.6 扩展性比较 * 5.5.6.1 总体分析5.5.4 评估指标 * 5.5.5 性能比较 * 5.5.5.1 总体分析 * 5.5.5.2 收敛性比较 * 5.5.6 可扩展性比较 * 5.5.6.1 有效性。* 5.5.6.2 训练时间。* 5.5.7 CoLight 研究 * 5.5.7.1 邻域定义的影响。* 5.5.7.2 邻居数量的影响。* 5.5.7.3 注意头数量的影响。* 5.6 结论

Chapter 6 Learning to Simulate
第 6 章 学习模拟

  • 6.1 Introduction 6.1 导言
  • 6.2 Related Work 6.2 相关工作
  • 6.3 Preliminaries  6.3 前言
    • 6.4 Method  6.4 方法
      • 6.4.1 Basic GAIL Framework
        6.4.1 GAIL 基本框架
      • 6.4.2 Imitation with Interpolation
        6.4.2 利用插值法进行模仿
        • 6.4.2.1 Generator in the simulator
          6.4.2.1 模拟器中的发电机
        • 6.4.2.2 Downsampling of generated trajectories
          6.4.2.2 对生成的轨迹进行下采样
        • 6.4.2.3 Interpolation-Discriminator
          6.4.2.3 内插法-鉴别器
          • 6.4.2.3.1 Interpolator module
            6.4.2.3.1 内插模块
          • 6.4.2.3.2 Discriminator module
            6.4.2.3.2 鉴别器模块
          • 6.4.2.3.3 Loss function of Interpolation-Discriminator
            6.4.2.3.3 内插法-判别器的损失函数
      • 6.4.3 Training and Implementation
        6.4.3 培训和实施
    • 6.5 Experiment  6.5 实验
      • 6.5.1 Experimental Settings
        6.5.1 实验设置
        • 6.5.1.1 Dataset 6.5.1.1 数据集
        • 6.5.1.1 Synthetic Data  6.5.1.1 合成数据
          • 6.5.1.1.2 Real-world Data
            6.5.1.1.2 真实世界数据
        • 6.5.1.2 Data Preprocessing
          6.5.1.2 数据预处理
      • 6.5.2 Compared Methods  6.5.2 比较方法
        • 6.5.2.1 Calibration-based methods
          6.5.2.1 基于校准的方法

          6.5.2.2 Imitation learning-based methods
          6.5.2.2 基于模仿学习的方法
6.5.3 Evaluation Metrics
6.5.3 评估指标
6.5.4 Performance Comparison
6.5.4 性能比较
6.5.5 Study of ImIn-GAIL
6.5.5 ImIn-GAIL 研究
6.5.5.1 Interpolation Study
6.5.5.1 内插法研究
6.5.5.2 Sparsity Study 6.5.5.2 稀疏性研究
6.5.6 Case Study 6.5.6 案例研究
6.6 Conclusion 6.6 结论
Chapter 7: 第 7 章.....:
Conclusion and Future Directions
结论和未来方向
7.1 Evolving behavior with traffic signals
7.1 不断变化的交通信号行为
7.2 Benchmarking datasets and baselines
7.2 基准数据集和基线
7.3 Learning efficiency 7.3 学习效率
7.4 Safety issue 7.4 安全问题
7.5 Transferring from simulation to reality
7.5 从模拟走向现实
BibliographyList of Figures
参考文献图表目录
  • 2.1 Definitions of traffic movement and traffic signal phases.
    2.1 交通流和交通信号阶段的定义。
  • 2.2 RL framework for traffic signal control.
    2.2 交通信号控制的 RL 框架。
  • 3.1 Reward is not a comprehensive measure to evaluate traffic light control performance. Both policies will lead to the same rewards. But policy #1 is more suitable than policy #2 in the real world.
    3.1 奖励不是评价交通灯控制性能的全面措施。两种政策会带来相同的奖励。但在现实世界中,1 号政策比 2 号政策更合适。
  • 3.2 Case A and case B have the same environment except the traffic light phase.
    3.2 案例 A 和案例 B 的环境相同,但红绿灯阶段不同。
  • 3.3 Model framework 3.3 模型框架
  • 3.4 Q-network 3.4 Q 网络
  • 3.5 Memory palace structure
    3.5 记忆宫殿结构
  • 3.6 Traffic surveillance cameras in Jinan, China
    3.6 中国济南的交通监控摄像头
  • 3.7 Percentage of the time duration of learned policy for phase Green-WE (green light on W-E and E-W direction, while red light on N-S and S-N direction) in every 2000 seconds for different methods under configuration 4.
    3.7 在配置 4 下,不同方法在绿-绿(W-E 和 E-W 方向亮绿灯,N-S 和 S-N 方向亮红灯)阶段每 2000 秒所学策略持续时间的百分比。
  • 3.8 Average arrival rate on two directions (WE and SN) and time duration ratio of two phases (Green-WE and Red-WE) from learned policy for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st and August 7th, 2016.
    3.8 2016 年 8 月 1 日和 8 月 7 日济南市经六路(WE)和二环西辅路(SN)的两个方向(WE 和 SN)的平均到达率和两个阶段(绿-WE 和红-WE)的时间长度比。
Detailed average arrival rate on two directions (dotted lines) and changes of two phases (dashed areas) in three periods of time for Jingliu Road (WE) and Erhuanxi Auxiliary Road (SN) in Jinan on August 1st, 2016. X-axis of each figure indicates the time of a day; left Y-axis of each figure indicates the number of cars approaching the intersection every second; right Y-axis for each figure indicates the phase over time.
2016 年 8 月 1 日济南市经六路(WE)和二环西辅路(SN)三个时段内两个方向(虚线)的平均到达率和两个相位(虚线区域)的变化详情。各图的 X 轴表示一天中的时间;各图的左 Y 轴表示每秒驶入路口的车辆数;各图的右 Y 轴表示随时间变化的相位。
  • 4.1 Performance of RL approaches is sensitive to reward and state. (a) A heuristic parameter tuning of reward function could result in different performances. (b) The method with a more complicated state (LIT[1] w/ neighbor) has a longer learning time but does not necessarily converge to a better result.
    4.1 RL 方法的性能对奖励和状态很敏感。(a) 奖赏函数的启发式参数调整可能导致不同的性能。(b) 状态更复杂的方法(LIT[1] w/ neighbor)学习时间更长,但不一定收敛到更好的结果。
  • 4.2 Illustration of max pressure control in two cases. In Case A, green signal is set in the North rarr\rightarrowSouth direction; in Case B, green signal is set in the East rarr\rightarrowWest direction.
    4.2 两种情况下的最大压力控制示意图。在情况 A 中,绿色信号设置在北 rarr\rightarrow 向;在情况 B 中,绿色信号设置在东 rarr\rightarrow 向。南方向;在情况 B 中,绿色信号设置在东 rarr\rightarrow 西方向。西方向。
  • 4.3 The transition of traffic movements.
    4.3 交通流的过渡。
  • 4.4 Real-world arterial network for the experiment.
    4.4 用于实验的真实世界动脉网络。
  • 4.5 Convergence curve of average duration and our reward design (pressure). Pressure shows the same convergence trend with travel time.
    4.5 平均持续时间与我们的奖励设计(压力)的收敛曲线。压力与旅行时间的收敛趋势相同。
  • 4.6 Average travel time of our method on heterogeneous intersections. (a) Different number of legs. (b) Different length of lanes. (c) Experiment results.
    4.6 我们的方法在异构交叉口的平均行驶时间。(a) 不同的支腿数。 (b) 不同的车道长度。(c) 实验结果。
  • 4.7 Performance comparison under uniform unidirectional traffic, where the optimal solution is known (GreenWave). Only PressLight can achieve the optimal.
    4.7 已知最优解(GreenWave)的均匀单向流量下的性能比较。只有 PressLight 可以达到最优。
  • 4.8 Offsets between intersections learnt by RL agents under uni-directional uniform traffic (700 vehicles/hour/lane on arterial)
    4.8 在单向均匀交通(主干道上每小时每车道 700 辆车)条件下,RL 代理了解到的交叉口之间的偏移量
  • 4.9 Space-time diagram with signal timing plan to illustrate the learned coordination strategy from real-world data on the arterial of Qingdao Road in the morning (around 8:30 a.m.) on August 6th.
    4.9 根据 8 月 6 日上午(8:30 左右)青岛路干道的实际数据绘制的时空图和信号配时方案,以说明所学的协调策略。
Illustration of index-based concatenation. Thick yellow lines are the arterials and grey thin lines are the side streets. With index-based concatenation, A A AAA and B's observation will be aligned as model inputs with an fixed order. These two inputs will confuse the model shared by A A AAA and B B BBB.
基于索引的连接图示。黄色粗线为主干道,灰色细线为支路。通过基于索引的连接, A A AAA 和 B 的观测值将以固定顺序作为模型输入。这两个输入将混淆 A A AAA B B BBB 共享的模型。.
  • 5.2 Left: Framework of the proposed CoLight model. Right: variation of cooperation scope (light blue shadow, from one-hop to two-hop) and attention distribution (colored points, the redder, the more important) of the target intersection.
    5.2 左图:建议的 CoLight 模型框架。右图:目标交叉点的合作范围(浅蓝色阴影,从一跳到两跳)和注意力分布(彩色点,越红越重要)的变化。
  • 5.3 Road networks for real-world datasets. Red polygons are the areas we select to model, blue dots are the traffic signals we control. Left: 196 intersections with uni-directional traffic, middle: 16 intersections with uni-& bi-directional traffic, right: 12 intersections with bi-directional traffic..
    5.3 真实世界数据集的道路网络。红色多边形为我们选择的建模区域,蓝色圆点为我们控制的交通信号灯。左图:196 个单向交通路口,中图:16 个单向和双向交通路口,右图:12 个双向交通路口。
  • 5.4 Convergence speed of CoLight (red continuous curves) and other 5 RL baselines (dashed curves) during training. CoLight starts with the best performance (Jumpstart), reaches to the pre-defined performance the fastest (Time to Threshold), and ends with the optimal policy (Aysmptotic). Curves are smoothed with a moving average of 5 points.
    5.4 在训练过程中,CoLight(红色连续曲线)和其他 5 条 RL 基线(虚线)的收敛速度。CoLight 以最佳性能(Jumpstart)开始,以最快速度达到预定性能(Time to Threshold),以最优策略(Aysmptotic)结束。曲线用 5 个点的移动平均值进行平滑处理。
  • 5.5 The training time of different models for 100 episodes. CoLight is efficient across all the datasets. The bar for Individual RL on D N e w Y o r k D N e w Y o r k D_(NewYork)D_{NewYork}DNewYork is shadowed as its running time is far beyond the acceptable time.
    5.5 不同模型对 100 个事件的训练时间。CoLight 在所有数据集上都很高效。由于 D N e w Y o r k D N e w Y o r k D_(NewYork)D_{NewYork}DNewYork 上的 Individual RL 的运行时间远远超出了可接受的时间,因此其条形图上有阴影。
  • 5.6 Performance of CoLight with respect to different numbers of neighbors ( | N i | | N i | |N_(i)||\mathcal{N}_{i}||Ni|) on dataset D H a n g z h o u D H a n g z h o u D_(Hangzhou)D_{Hangzhou}DHangzhou (left) and D J i n a n D J i n a n D_(Jinan)D_{Jinan}DJinan (right). More neighbors ( | N i | 5 | N i | 5 |N_(i)| <= 5|\mathcal{N}_{i}|\leq 5|Ni|5) for cooperation brings better performance, but too many neighbors ( | N i | > 5 | N i | > 5 |N_(i)| > 5|\mathcal{N}_{i}|>5|Ni|>5) requires more time (200 episodes or more) to learn..
    5.6 在数据集 D H a n g z h o u D H a n g z h o u D_(Hangzhou)D_{Hangzhou}DHangzhou D J i n a n D J i n a n D_(Jinan)D_{Jinan}DJinan 中,CoLight 在不同邻接数( | N i | | N i | |N_(i)||\mathcal{N}_{i}||Ni| )下的性能表现(左)和 D J i n a n D J i n a n D_(Jinan)D_{Jinan}DJinan (右(右图)。更多的邻居( | N i | 5 | N i | 5 |N_(i)| <= 5|\mathcal{N}_{i}|\leq 5|Ni|5 )会带来更好的合作效果,但过多的邻居( | N i | > 5 | N i | > 5 |N_(i)| > 5|\mathcal{N}_{i}|>5|Ni|>5 )则需要更多的时间(200 集或更多)来学习。
  • 6.1 Illustration of a driving trajectory. In the real-world scenario, only part of the driving points can be observed and form a sparse driving trajectory (in red dots). Each driving point includes a driving state and an action of the vehicle at the observed time step. Best viewed in color.
    6.1 行驶轨迹示意图。在实际场景中,只有部分行驶点可以被观测到,并形成稀疏的行驶轨迹(红点)。每个行车点都包括观察到的时间步长内的行车状态和车辆动作。以彩色显示效果最佳。
  • 6.2 Proposed ImIn-GAIL Approach. The overall framework of ImIn-GAIL includes three components: generator, downsampler, and interpolation-discriminator. Best viewed in color.
    6.2 拟议的 ImIn-GAIL 方法。ImIn-GAIL 的整体框架包括三个部分:生成器、下采样器和插值-鉴别器。彩色效果最佳。
  • 6.3 Proposed interpolation-discriminator network.
    6.3 拟议的插值-鉴别器网络。
  • 6.4 Illustration of road networks. (a) and (b) are synthetic road networks, while (c) and (d) are real-world road networks.
    6.4 道路网络示意图。(a) 和 (b) 是合成的道路网络,而 (c) 和 (d) 是真实世界的道路网络。
  • 6.5RMSE on time and position of our proposed method ImIn-GAIL under different level of sparsity. As the expert trajectory become denser, a more similar policy to the expert policy is learned.
    6.5 我们提出的 ImIn-GAIL 方法在不同稀疏程度下的时间和位置均方根误差。随着专家轨迹越来越密集,学习到的策略与专家策略也越来越相似。
  • 6.6 The generated trajectory of a vehicle in the R i n g R i n g RingRingRing scenario. Left: the initial position of the vehicles. Vehicles can only be observed when they pass four locations A A AAA, B B BBB, C C CCC and D D DDD where cameras are installed. Right: the visualization for the trajectory of V e h i c l e V e h i c l e VehicleVehicleVehicle 0. The x-axis is the timesteps in seconds. The y-axis is the relative road distance in meters. Although vehicle 0 is only observed three times (red triangles), ImIn-GAIL (blue points) can imitate the position of the expert trajectory (grey points) more accurately than all other baselines. Better viewed in color.
    6.6 在 R i n g R i n g RingRingRing 场景中生成的车辆轨迹。左图:车辆的初始位置。车辆只有在经过 A A AAA B B BBB C C CCC 四个位置时才能被观测到。, B B BBB C C CCC D D DDD 安装了摄像头。右图: V e h i c l e V e h i c l e VehicleVehicleVehicle 0 的可视化轨迹。x 轴为时间单位,以秒为单位。y 轴为相对道路距离,单位为米。虽然只观察到车辆 0(红色三角形)三次,但 ImIn-GAIL(蓝色点)比其他所有基线都能更准确地模仿专家轨迹(灰色点)的位置。彩色效果更佳。
List of Tables 表格清单
  • 2.1 Representative deep RL-based traffic signal control methods. Due to page limits, we only put part of the investigated papers here.
    2.1 基于深度 RL 的代表性交通信号控制方法。由于篇幅限制,我们在此仅介绍部分研究论文。
  • 3.1 Notations 3.1 符号
  • 3.2 Settings for our method
    3.2 我们方法的设置
  • 3.3 Reward coefficients 3.3 奖励系数
  • 3.4 Configurations for synthetic traffic data
    3.4 合成流量数据的配置
  • 3.5 Details of real-world traffic dataset
    3.5 真实世界交通数据集的细节
  • 3.6 Performance on configuration 1. Reward: the higher the better. Other measures: the lower the better. Same with the following tables.
    3.6 配置 1 的性能。奖励:越高越好。其他措施:越低越好。与下表相同。
  • 3.7 Performance on configuration 2
    3.7 配置 2 的性能
  • 3.8 Performance on configuration 3
    3.8 配置 3 的性能
  • 3.9 Performance on configuration 4
    3.9 配置 4 的性能
  • 3.10 Performances of different methods on real-world data. The number after ± ± +-\pm± means standard deviation. Reward: the higher the better. Other measures: the lower the better.
    3.10 不同方法在实际数据中的表现。 ± ± +-\pm± 后面的数字表示标准偏差。奖励:越高越好。其他指标:越低越好。
  • 4.1 Summary of notation.
    4.1 符号摘要
  • 4.2 Configurations for synthetic traffic data
    4.2 合成流量数据的配置
  • 4.3 Data statistics of real-world traffic dataset
    4.3 真实世界交通数据集的数据统计
Performance comparison between all the methods in the arterial with 6 intersections w.r.t. average travel time (the lower the better). Top-down: conventional transportation methods, learning methods, and our proposed method.
在有 6 个交叉路口的干道上,所有方法在平均旅行时间方面的性能比较(越短越好)。自上而下:传统交通方法、学习方法和我们提出的方法。
  • 4.5 Detailed comparison of our proposed state and reward design and their effects w.r.t. average travel time (lower the better) under synthetic traffic data.
    4.5 在合成交通数据下,详细比较我们建议的状态和奖励设计及其对平均旅行时间(越短越好)的影响。
  • 4.6 Average travel time of different methods under arterials with a different number of intersections and network.
    4.6 在交叉口数量和网络不同的干道上,不同方法的平均旅行时间。
  • 5.1 Data statistics of real-world traffic dataset
    5.1 真实世界交通数据集的数据统计
  • 5.2 Performance on synthetic data and real-world data w.r.t average travel time. CoLight is the best.
    5.2 平均旅行时间在合成数据和真实世界数据上的表现。CoLight 是最好的。
  • 5.3 Performance of CoLight with respect to different numbers of attention heads ( H H HHH) on dataset G r i d 6 × 6 G r i d 6 × 6 Grid_(6xx6)Grid_{6\times 6}Grid6×6. More types of attention ( H 5 H 5 H <= 5H\leq 5H5) enhance model efficiency, while too many ( H > 5 H > 5 H > 5H>5H>5) could distract the learning and deteriorate the overall performance.
    5.3 在数据集 G r i d 6 × 6 G r i d 6 × 6 Grid_(6xx6)Grid_{6\times 6}Grid6×6 中,CoLight 在不同注意力头数( H H HHH )下的性能表现.更多的注意力类型( H 5 H 5 H <= 5H\leq 5H5 )会提高模型效率,而过多的注意力类型( H > 5 H > 5 H > 5H>5H>5 )则会分散学习注意力,降低整体性能。
  • 6.1 Features for a driving state
    6.1 驾驶状态的特征
  • 6.2 Hyper-parameter settings for ImIn-GAIL
    6.2 ImIn-GAIL 的超参数设置
  • 6.3 Statistics of dense and sparse expert trajectory in different datasets
    6.3 不同数据集中密集和稀疏专家轨迹的统计数据
  • 6.4 Performance w.r.t Relative Mean Squared Error (RMSE) of time (in seconds) and position (in kilometers). All the measurements are conducted on dense trajectories. Lower the better. Our proposed method ImIn-GAIL achieves the best performance.
    6.4 时间(秒)和位置(千米)的相对均方差(RMSE)性能。所有测量均在密集轨迹上进行。越低越好。我们提出的 ImIn-GAIL 方法性能最佳。
  • 6.5 RMSE on time and position of our proposed method ImIn-GAIL against baseline methods and their corresponding two-step variants. Baseline methods and ImIn-GAIL learn from sparse trajectories, while the two-step variants interpolate sparse trajectories first and trained on the interpolated data. ImIn-GAIL achieves the best performance in most cases.
    6.5 与基准方法及其相应的两步变体相比,我们提出的 ImIn-GAIL 方法在时间和位置上的均方根误差。基准方法和 ImIn-GAIL 从稀疏轨迹中学习,而两步变体首先对稀疏轨迹进行插值,然后在插值数据上进行训练。在大多数情况下,ImIn-GAIL 的性能最佳。

Acknowledgements 致谢

I would like to express my sincere appreciation to my doctoral committee Dr. Zhenhui (Jessie) Li, Dr. C. Lee Giles, Dr. Xiang Zhang, Dr. Vikash V. Gayah, and my Ph. D. program chair Dr. Mary Beth Rosson.
在此,我要向我的博士生导师李振辉(Jessie)博士、C. Lee Giles 博士、张翔博士、Vikash V. Gayah 博士以及我的博士生导师 Mary Beth Rosson 博士表示衷心的感谢。
I would like to show my special appreciation to my PhD adviser Dr. Zhenhui (Jessie) Li. Jessie patiently guided me through the research process and career development, including several open-armed enlightening discussions on how to cope with work-life balance, pressure and priority. Her supportive and critical attitude has made our fruitful collaboration possible. Her patience and kindness has help me manage a smooth career development. Without her guidance, my PhD would surely not have been as successful.
我要特别感谢我的博士生导师李振辉(Jessie)博士。Jessie 耐心指导了我的研究过程和职业发展,包括就如何应对工作与生活的平衡、压力和优先权等问题进行了几次开诚布公的启发式讨论。她的支持和批评态度使我们之间富有成效的合作成为可能。她的耐心和仁慈帮助我顺利完成了职业发展。没有她的指导,我的博士论文肯定不会如此成功。
I would also extend my thanks to Dr. Vikash Gayah from the Department of Civil and Environmental Engineering, whose insightful advice forms some of the foundations in our interdisciplinary research and who have been more than happy to help me become familiar with the background material.
我还要向土木与环境工程系的 Vikash Gayah 博士表示感谢,他的真知灼见为我们的跨学科研究奠定了一些基础,他非常乐意帮助我熟悉背景材料。
I am also fortunate to have collaborated with several others throughout my PhD: Guanjie Zheng, Chacha Chen, Porter Jenkins, Zhengyao Yu, Kan Wu, Huaxiu Yao, Nan Xu, Chang Liu, Yuandong Wang. Without them, this research would not be possible.
在整个博士期间,我还有幸与其他几位同事进行了合作:Guanjie Zheng、Chacha Chen、Porter Jenkins、Zhengyao Yu、Kan Wu、Huaxiu Yao、Nan Xu、Chang Liu、Yuandong Wang。没有他们,就不可能有这项研究。
The studies in this dissertation have been supported by NSF awards #1544455, #1652525, #1618448, and #1639150. The views and conclusions contained in the studies are those of the authors and should not be interpreted as representing any funding agencies. Thanks for their generous support to make these research happen.
本论文中的研究得到了国家自然科学基金 #1544455、#1652525、#1618448 和 #1639150 号的资助。研究报告中的观点和结论仅代表作者本人,不应被解释为代表任何资助机构。感谢他们的慷慨支持,使这些研究得以实现。
During my several years of PhD program, I was lucky to have been surrounded by wonderful colleagues and friends. Thank you all for sharing Friday game nights with me, for the road trips and getaways on weekends, and for the summer twilight barbecues. All of you made my time at Penn State a wonderful experience.
在攻读博士学位的几年里,我很幸运身边有许多出色的同事和朋友。感谢大家与我分享周五的游戏之夜,感谢大家周末的公路旅行和郊游,感谢大家夏日黄昏的烧烤。你们让我在宾夕法尼亚州立大学度过了一段美好的时光。
At last, a very special thank you to my parents Shaozhu and Yuling, to my whole family, for always supporting me and for always encouraging me to pursue my passion.
最后,我要特别感谢我的父母绍柱和玉玲,感谢我的全家人,感谢他们一直支持我,鼓励我去追求自己的激情。

Chapter 1 Introduction 第 1 章 导言

Traffic congestion is a growing problem that continues to plague urban areas with negative out comes to both the traveling public and society as a whole. And, these negative outcomes will only grow over time as more people flock to urban areas. In 2014, traffic congestion cost Americans over $160 billion in lost productivity and wasted over 3.1 billion gallons of fuel [2]. Traffic congestion was also attributed to over 56 billion pounds of harmful CO2 emissions in 2011 [3]. In the European Union, the cost of traffic congestion was equivalent to 1% of the entire GDP [4]. Mitigating congestion would have significant economic, environmental and societal benefits. Signalized intersections are one of the most prevalent bottleneck types in urban environments and thus traffic signal control tends to play a large role in urban traffic management.
交通拥堵是一个日益严重的问题,一直困扰着城市地区,给市民出行和整个社会带来负面影响。而且,随着越来越多的人涌入城市地区,这些负面影响只会与日俱增。2014 年,交通拥堵使美国人损失了超过 1600 亿美元的生产力,浪费了超过 31 亿加仑的燃料[2]。2011 年,交通拥堵还造成了超过 560 亿磅的有害二氧化碳排放[3]。在欧盟,交通拥堵造成的损失相当于整个 GDP 的 1%[4]。缓解交通拥堵将带来巨大的经济、环境和社会效益。信号交叉口是城市环境中最常见的瓶颈类型之一,因此交通信号控制往往在城市交通管理中发挥着重要作用。

1.1 Why do we need a more intelligent traffic signal control
1.1 为什么我们需要更智能的交通信号控制?

The majority of small and big cities even in industrialized countries are still operating old-fashioned fixed-time signal control strategies, often even poorly optimized or maintained. Even when modern traffic-responsive control systems are installed (e.g., SCATS [5] and SCOOT [6, 7]), the employed control strategies are sometimes naive, mainly based on manually designed traffic signal plans.
即使在工业化国家,大多数大中小城市仍在使用老式的固定时间信号控制策略,而且往往优化或维护不善。即使安装了现代交通响应控制系统(如 SCATS [5] 和 SCOOT [6,7]),所采用的控制策略有时也很幼稚,主要是基于人工设计的交通信号计划。
On the other hand, nowadays various kinds of traffic data can be collected to enrich the information about traffic condition. Systems like SCATS or SCOOTS mainly rely on the loop sensor data to choose the signal plans. However, loop sensor data only count the vehicle when it passes the sensor, while nowadays increasing amount of traffic data have been collected from various sources such as GPS-equipped vehicles, navigationalsystems, and traffic surveillance cameras. How to use rich traffic data to better optimize our traffic signal control system has attracted more and more attention from academia, government and industry.
另一方面,如今可以通过收集各种交通数据来丰富交通状况信息。SCATS 或 SCOOTS 等系统主要依靠环形传感器数据来选择信号计划。然而,环路传感器数据只计算车辆通过传感器时的数据,而现在越来越多的交通数据来自各种来源,如配备 GPS 的车辆、导航系统和交通监控摄像头。如何利用丰富的交通数据来更好地优化我们的交通信号控制系统,引起了学术界、政府和工业界越来越多的关注。

1.2 Why do we use reinforcement learning for traffic signal control
1.2 为什么要将强化学习用于交通信号控制

In transportation field, traffic signal control is one of the most fundamental research questions [8]. The typical approach that transportation researchers take is to seek an optimization solution under certain assumptions about traffic models [8]. However, most of the works focus only on automobiles and the assumptions they make are simplified and do not necessarily hold true in the field. The real traffic behave in a complicated way, affected by many factors such as driver's preference, interactions with vulnerable road users (e.g. pedestrians, cyclists, etc.), weather and road conditions, etc. These features can hardly be modelled accurately for optimizing traffic signal control.
在交通领域,交通信号控制是最基本的研究问题之一 [8]。交通研究人员通常采用的方法是在交通模型的某些假设条件下寻求优化解决方案[8]。然而,大多数研究都只关注汽车,他们所做的假设是简化的,在实际中并不一定成立。真实交通的行为方式非常复杂,受到很多因素的影响,如驾驶员的偏好、与易受伤害的道路使用者(如行人、骑自行车者等)的互动、天气和路况等。要优化交通信号控制,很难准确模拟这些特征。
On the other hand, machine learning techniques learn directly from the observed data without making assumptions about the data model. However, the traffic signal control problem is not a typical machine learning problem with fixed data sets for training and testing. The real-world traffic is constantly changing, and the execution of traffic lights is also changing the traffic. Therefore, in the case of ever-changing data samples, we need to learn from the feedback from the environment. This idea of trial-and-error is an essential RL idea. RL attempts different traffic signal control strategies based on the current traffic environment. The model will learn and adjust strategies based on environmental feedback.
另一方面,机器学习技术直接从观测数据中学习,而不对数据模型做出假设。然而,交通信号控制问题并不是一个典型的机器学习问题,它没有固定的数据集进行训练和测试。现实世界中的交通流量在不断变化,交通信号灯的执行也在改变着交通流量。因此,在数据样本不断变化的情况下,我们需要从环境的反馈中学习。这种试错思想是 RL 的基本思想。RL 会根据当前的交通环境尝试不同的交通信号控制策略。模型将根据环境反馈来学习和调整策略。

1.3 Why RL for traffic signal control is challenging?
1.3 交通信号控制 RL 为何具有挑战性?

Formulation of RL agent
RL 剂的配方

A key question for RL is how to define reward and state. In existing studies [9, 10, 11], a typical reward definition for traffic signal control is a weighted linear combination of several components such as queue length, waiting time, number of switches in traffic signal, and sum of delay. The state include components such as queue length, number of cars, waiting time, and current traffic signal. In recent work [10, 11], images of vehicles' positions on the roads are also considered in the state.
RL 的一个关键问题是如何定义奖励和状态。在现有的研究中[9, 10, 11],交通信号控制的典型奖励定义是队列长度、等待时间、交通信号开关数和延迟总和等几个部分的加权线性组合。状态包括队列长度、汽车数量、等待时间和当前交通信号等要素。最近的研究 [10, 11]还在状态中考虑了车辆在道路上的位置图像。
However, all of the existing work take an ad-hoc approach to define reward and state. Such an ad-hoc approach will cause several problems that hinder the application of RL in the real world. First, the engineering details in formulating the reward and state could significantly affect the results. For example, if the reward is defined as a weighted linear combination of several terms, the weights on each terms are tricky to set and the minor difference in weight setting could lead to dramatically different results. Second, the state representation could be in a high-dimensional space, especially when using traffic images as part of the state representation [10, 11]. Such a high-dimensional state representation will need much more training data samples to learn and may not even converge. Third, there is no connection between existing RL approaches and transportation methods. Without the support of transportation theory, it is highly risky to apply these purely data-driven RL-based approaches in the real physical world.
然而,所有现有工作都采用临时方法来定义奖励和状态。这种临时方法会带来一些问题,阻碍 RL 在现实世界中的应用。首先,制定奖励和状态的工程细节会严重影响结果。例如,如果奖励被定义为多个项的加权线性组合,那么每个项的权重设置都很棘手,权重设置的细微差别可能会导致截然不同的结果。其次,状态表示可能是在一个高维空间中,尤其是在使用交通图像作为状态表示的一部分时[10, 11]。这种高维状态表示需要更多的训练数据样本来学习,甚至可能无法收敛。第三,现有的 RL 方法与交通方法之间没有联系。没有运输理论的支持,在真实物理世界中应用这些纯粹基于数据驱动的 RL 方法风险很大。

Learning cost 学习成本

While learning from trial-and-error is the key idea in RL, the learning cost is RL is fatal for real-world applications. Although RL algorithms are very useful to learning good solutions when the model of environment is unknown in advance [12, 13], the solutions may only be achieved after an extensive numbers of trials and errors, which is usually very time consuming. Existing RL methods for games (e.g, Go or Atari games) yield impressive results in simlulated environments, the cost of error in traffic signal control is critical, even fatal in the real world. Therefore, how to learn efficiently (e.g., learning from limited data samples, sample training data in an adaptive way, transfer learned knowledge) is a critial question for the application of RL in traffic signal control.
虽然从试错中学习是 RL 的关键思想,但 RL 的学习成本对于实际应用来说是致命的。虽然在环境模型预先未知的情况下,RL 算法对学习好的解决方案非常有用[12, 13],但解决方案可能要经过大量的试验和错误才能得到,这通常非常耗时。现有的游戏 RL 方法(如围棋或雅达利游戏)在模拟环境中取得了令人印象深刻的结果,而在交通信号控制中,错误的代价是至关重要的,在现实世界中甚至是致命的。因此,如何高效地学习(例如,从有限的数据样本中学习、以自适应的方式抽取训练数据、迁移所学知识)是将 RL 应用于交通信号控制的关键问题。

Simulation 模拟

Reinforcement learning (RL) has shown great success in a series of artificial intelligence (AI) domains such as Go games [14]. Despite its huge success in AI domains, RL has not yet shown the same degree of success for real-world applications [15]. These applications could involve control such as drones [16], systems that interact with people such as traffic signal control [11]. As people analyze the challenges in all these scenarios, a recurring theme emerges: there is rarely a good simulator[15].
强化学习(RL)在围棋等一系列人工智能(AI)领域取得了巨大成功[14]。尽管强化学习在人工智能领域取得了巨大成功,但在现实世界的应用中,强化学习尚未取得同样程度的成功[15]。这些应用可能涉及无人机等控制[16]、交通信号控制等与人交互的系统[11]。人们在分析所有这些应用场景所面临的挑战时,发现了一个反复出现的主题:很少有好的模拟器[15]。
Previous Studies 以往的研究
There are already some work done in investigating reinforcement learning methods in traffic signal control. They are focusing on different scale of traffic signal control, including isolated intersection control[4-6], arterial control [7], and region control [8-10]. Most of the previous studies use different kinds of features in state and reward design, while we propose the formulation of reinforcement learning in connection with transportation theory. Furthermore, we propose to investigate on how to learn efficiently to reduce the learning cost in the real-world setting. We will discuss more about the literature in Chapter 2, Chapter 3 and Chapter 4.
目前已经有一些研究强化学习方法在交通信号控制中的应用。它们主要针对不同规模的交通信号控制,包括孤立交叉口控制[4-6]、干道控制[7]和区域控制[8-10]。以往的研究大多在状态和奖励设计中使用不同的特征,而我们则结合交通理论提出了强化学习的方法。此外,我们还建议研究如何在实际环境中高效学习以降低学习成本。我们将在第 2 章、第 3 章和第 4 章讨论更多相关文献。

1.5 Proposed Tasks 1.5 拟议任务

In this thesis, we propose to use RL for traffic signal control problems that can combine the transportation guided RL with efficient learning. The first part will help the formulation of RL towards a reasonable state and reward design. The connection between between RL approaches and transportation theory can help RL optimize towards a correct objective and condense the state space for convergence. The second part will help reduce the learning cost of RL and enables the application of RL in the real world. The third part will try to tackle the real-world application issue by building a realistic simulator. Specifically, we elaborate RL-based traffic signal control methods with real-world implications from the following topics:
在本论文中,我们建议将 RL 用于交通信号控制问题,将交通引导 RL 与高效学习相结合。第一部分将帮助 RL 制定合理的状态和奖励设计。RL 方法与交通理论之间的联系可以帮助 RL 向正确的目标优化,并压缩状态空间以实现收敛。第二部分将有助于降低 RL 的学习成本,并使 RL 在现实世界中得到应用。第三部分将尝试通过建立一个逼真的模拟器来解决现实世界中的应用问题。具体来说,我们将从以下几个方面阐述基于 RL 的交通信号控制方法对现实世界的影响:
  • Formulation of RL with theoretical justification. How to formulate the RL agent, i.e., the reward and state definition for traffic signal control problem? Does it always beneficial if we use complex reward function and state representations? In [7, 1, 17], I proposed to deduce the intricate design of current literature and use concise reward and state design, with theoretical proof on the concise design towards the global optimal, for both signal intersection control and multi-intersection control scenario.
    制定 RL 的理论依据。如何制定 RL 代理,即交通信号控制问题的奖励和状态定义?使用复杂的奖励函数和状态表示是否一定有益?在文献[7, 1, 17]中,笔者提出对现有文献的复杂设计进行演绎,使用简明的奖励和状态设计,并对信号交叉口控制和多交叉口控制场景下的简明设计进行了全局最优的理论证明。
  • Optimization of the learning process in RL. Is vanilla deep neural network the solution to traffic signal control problem with RL? In [11], I tackle the sample imbalance problem with a phase-gated network to emphasize certain features. In [18], we design a network to model the priority between action and learn the change of traffic flow like flip and rotation. Coordination could benefit signal control for multi-intersection scenarios. In [19], I contributed a framework to enable agents to communicate between them about their observations and behave as a group. Though each agent has limited capabilities and visibility of the world, in this work, agents were able to cooperate multi-hop neighboring intersections and learn the dynamic interactions between the hidden state of neighboring agents. Later in [20], we investigate the possibility of using RL to control traffic signals under the scale of a city, test our RL methods in the simulator under the road network of Manhattan, New York, with 2510 traffic signals. This is the first time an RL-based method can operate on such a scale in traffic signal control.
    优化 RL 的学习过程。香草深度神经网络是利用 RL 解决交通信号控制问题的方法吗?在 [11] 中,我用相位门控网络来解决样本不平衡问题,以强调某些特征。在文献[18]中,我们设计了一个网络来模拟行动之间的优先级,并学习交通流的变化,如翻转和旋转。在多交叉口情况下,协调可有利于信号控制。在文献[19]中,我提出了一个框架,使代理之间能够就其观察结果进行交流,并作为一个群体行事。虽然每个代理的能力和对世界的可见度都有限,但在这项工作中,代理能够在相邻交叉口进行多跳合作,并学习相邻代理隐藏状态之间的动态交互。随后在 [20] 中,我们研究了在城市规模下使用 RL 控制交通信号的可能性,在纽约曼哈顿拥有 2510 个交通信号的道路网络下的模拟器中测试了我们的 RL 方法。这是基于 RL 的方法首次在如此大规模的交通信号控制中发挥作用。
  • Bridging Gap from Simulation to Reality. Despite its massive success in artificial domains, RL has not yet shown the same degree of success for real-world applications. These applications could involve control systems such as drones, software systems such as data centers, systems that interact with people such as transportation systems. Currently, most RL methods mainly conduct experiments in the simulator since the simulator can generate data in a cheaper and faster way than real experimentation. Discrepancies between simulation and reality confine the application of learned policies in the real world.
    缩小模拟与现实之间的差距。尽管 RL 在人工智能领域取得了巨大成功,但在现实世界的应用中还没有取得同样的成功。这些应用可能涉及无人机等控制系统、数据中心等软件系统、交通系统等与人交互的系统。目前,大多数 RL 方法主要在模拟器中进行实验,因为模拟器生成数据的方式比真实实验更便宜、更快捷。模拟与现实之间的差异限制了所学政策在现实世界中的应用。

1.6 Overview of this Disseration
1.6 本论文概述

In this chapter, we discussed the motivation, challenges and possible tasks of using reinforcement learning for traffic signal control. We will elaborate more on the details of work that have applied reinforcement learning in several scenarios. Chapter 2 will covers the notation, background and necessary literature for traffic signal control. The basic formulation of RL-based traffic signal control and further theory-guided RL method will be discussed in detail in Chapter 3 and 4. Chapter 5 will discuss how to improve the learning efficiency with neighborhood information for coordinated intersections. Chapter 6 will tackle the challenge for real-world applications of RL by building a realistic simulator, and then we are going to talk about the potential future work briefly in Chapter 7.
在本章中,我们讨论了将强化学习用于交通信号控制的动机、挑战和可能的任务。我们将详细介绍在多个场景中应用强化学习的工作细节。第 2 章将介绍交通信号控制的术语、背景和必要文献。第 3 章和第 4 章将详细讨论基于 RL 的交通信号控制的基本公式和进一步的理论指导 RL 方法。第 5 章将讨论如何利用邻域信息提高协调交叉口的学习效率。第 6 章将通过建立一个真实的模拟器来解决 RL 在现实世界中的应用难题,然后我们将在第 7 章简要讨论未来可能开展的工作。

Chapter 2 Notation, Background and Literature
第 2 章 符号、背景和文献

This chapter covers notation, background information and necessary literature that will be discussed throughout the rest of this thesis. The main part of this chapter is adopted from our survey [21] and [22].
本章包括本论文其余部分将讨论的符号、背景信息和必要文献。本章的主要部分采用了我们的研究[21]和[22]。

2.1 Preliminaries of Traffic Signal Control Problem
2.1 交通信号控制问题初探

Term Definition 术语定义

Terms on road structure and traffic movement:
道路结构和交通流术语
  • Approach: A roadway meeting at an intersection is referred to as an approach. At any general intersection, there are two kinds of approaches: incoming approaches and outgoing approaches. An incoming approach is one on which cars can enter the intersection; an outgoing approach is one on which cars can leave the intersection. Figure 2.1(a) shows a typical intersection with four incoming approaches and outgoing approaches. The southbound incoming approach is denoted in this figure as the approach on the north side in which vehicles are traveling in the southbound direction.
    引道:在交叉路口交汇的道路称为引道。在任何一个普通交叉路口,都有两种引道:进入引道和驶出引道。进路是指车辆可以进入交叉路口的道路;出路是指车辆可以离开交叉路口的道路。图 2.1(a) 显示了一个有四条进路和出路的典型交叉路口。南行进路在本图中表示为车辆南行的北侧进路。
  • Lane: An approach consists of a set of lanes. Similar to approach definition, there are two kinds of lanes: incoming lanes and outgoing lanes (also known as approaching/entering lane and receiving/exiting lane in some references [23, 24]).
    车道:航道由一组车道组成。与路径定义类似,车道也分为两种:进入车道和离开车道(在一些参考文献中也称为接近/进入车道和接收/离开车道[23, 24])。
  • Traffic movement: A traffic movement refers to vehicles moving from an incoming approach to an outgoing approach, denoted as ( r i r o ) ( r i r o ) (r_(i)rarrr_(o))(r_{i}\to r_{o})(riro), where r a r a r_(a)r_{a}ra and r r r r r_(r)r_{r}rr is theincoming lane and the outgoing lane respectively. A traffic movement is generally categorized as left turn, through, and right turn.
    交通流:交通移动是指车辆从进站通道驶向出站通道,表示为 ( r i r o ) ( r i r o ) (r_(i)rarrr_(o))(r_{i}\to r_{o})(riro) ,其中 r a r a r_(a)r_{a}ra r r r r r_(r)r_{r}rr 分别为进站车道和出站车道。其中, r a r a r_(a)r_{a}ra r r r r r_(r)r_{r}rr 分别为进入车道和驶出车道。交通流一般分为左转、通过和右转。
Terms on traffic signal:
交通信号条款:
  • Movement signal: A movement signal is defined on the traffic movement, with green signal indicating the corresponding movement is allowed and red signal indicating the movement is prohibited. For the four-leg intersection shown in Figure 2.1(a), the right-turn traffic can pass regardless of the signal, and there are eight movement signals in use, as shown in Figure 2.1(b).
    移动信号:移动信号:移动信号定义在交通移动上,绿灯表示允许相应移动,红灯表示禁止移动。对于图 2.1(a)所示的四脚交叉路口,无论信号灯如何,右转车辆均可通过,如图 2.1(b)所示,共有八个移动信号灯在使用。
  • Phase: A phase is a combination of movement signals. Figure 2.1(c) shows the conflict matrix of the combination of two movement signals in the example in Figure 2.1(a) and Figure 2.1(b). The grey cell indicates the corresponding two movements conflict with each other, i.e. they cannot be set to 'green' at the same time (e.g., signals #1 and #2). The white cell indicates the non-conflicting movement signals. All the non-conflicting signals will generate eight valid paired-signal phases (letters 'A' to 'H' in Figure 2.1(c)) and eight single-signal phases (the diagonal cells in conflict matrix). Here we letter the paired-signal phases only because in an isolated intersection, it is always more efficient to use paired-signal
    相位:相位:相位是运动信号的组合。图 2.1(c)显示了图 2.1(a)和图 2.1(b)中两个运动信号组合的冲突矩阵。灰色单元格表示相应的两个运动信号相互冲突,即不能同时设置为 "绿色"(如 1 号和 2 号信号)。白色单元格表示不冲突的运动信号。所有非冲突信号将产生八个有效的成对信号相位(图 2.1(c)中的字母 "A "至 "H")和八个单信号相位(冲突矩阵中的对角线单元格)。在这里,我们只提及成对信号相位,因为在一个孤立的交叉路口,使用成对信号相位总是更有效率。
Figure 2.1: Definitions of traffic movement and traffic signal phases.
图 2.1:交通流和交通信号阶段的定义。
phases. When considering multiple intersections, single-signal phase might be necessary because of the potential spill back.
阶段。在考虑多个交叉口时,由于可能出现回溢,可能需要单信号灯相位。
  • Phase sequence: A phase sequence is a sequence of phases which defines a set of phases and their order of changes.
    相序:阶段序列是一个阶段序列,它定义了一组阶段及其变化顺序。
  • Signal plan: A signal plan for a single intersection is a sequence of phases and their corresponding starting time. Here we denote a signal plan as ( p 1 , t 1 ) ( p 2 , t 2 ) ( p i , t i ) ( p 1 , t 1 ) ( p 2 , t 2 ) ( p i , t i ) (p_(1),t_(1))(p_(2),t_(2))dots(p_(i),t_(i))dots(p_{1},t_{1})(p_{2},t_{2})\dots(p_{i},t_{i})\dots(p1,t1)(p2,t2)(pi,ti), where p i p i p_(i)p_{i}pi and t i t i t_(i)t_{i}ti stand for a phase and its starting time.
    信号计划:单个交叉口的信号计划是一系列相位及其相应的起始时间。在此,我们用 ( p 1 , t 1 ) ( p 2 , t 2 ) ( p i , t i ) ( p 1 , t 1 ) ( p 2 , t 2 ) ( p i , t i ) (p_(1),t_(1))(p_(2),t_(2))dots(p_(i),t_(i))dots(p_{1},t_{1})(p_{2},t_{2})\dots(p_{i},t_{i})\dots(p1,t1)(p2,t2)(pi,ti) 表示一个信号计划。其中 p i p i p_(i)p_{i}pi t i t i t_(i)t_{i}ti 代表一个相位及其起始时间。
  • Cycle-based signal plan: A cycle-based signal plan is a kind of signal plan where the sequence of phases operates in a cyclic order, which can be denoted as ( p 1 , t 1 1 ) ( p 2 , t 2 1 ) ( p N , t N 1 ) ( p 1 , t 1 2 ) ( p 2 , t 2 2 ) ( p N , t N 2 ) ( p 1 , t 1 1 ) ( p 2 , t 2 1 ) ( p N , t N 1 ) ( p 1 , t 1 2 ) ( p 2 , t 2 2 ) ( p N , t N 2 ) (p_(1),t_(1)^(1))(p_(2),t_(2)^(1))dots(p_(N),t_(N)^(1))(p_(1),t_(1)^(2))(p_(2),t_(2)^(2))dots(p_(N),t_(N)^(2))dots(p_{1},t_{1}^{1})(p_{2},t_{2}^{1})\dots(p_{N},t_{N}^{1})(p_{1},t_{1}^{2})(p_{ 2},t_{2}^{2})\dots(p_{N},t_{N}^{2})\dots(p1,t11)(p2,t21)(pN,tN1)(p1,t12)(p2,t22)(pN,tN2), where p 1 , p 2 , , p N p 1 , p 2 , , p N p_(1),p_(2),dots,p_(N)p_{1},p_{2},\dots,p_{N}p1,p2,,pN is the repeated phase sequence and t i j t i j t_(i)^(j)t_{i}^{j}tij is the starting time of phase p i p i p_(i)p_{i}pi in the j j jjj-th cycle. Specifically, C j = t 1 j + 1 t 1 j C j = t 1 j + 1 t 1 j C^(j)=t_(1)^(j+1)-t_(1)^(j)C^{j}=t_{1}^{j+1}-t_{1}^{j}Cj=t1j+1t1j is the cycle length of the j j jjj-th phase cycle, and { t 2 j t 1 j C j , , t N j t N 1 j C j } { t 2 j t 1 j C j , , t N j t N 1 j C j } {(t_(2)^(j)-t_(1)^(j))/(C^(j)),dots,(t_(N)^(j)-t_(N-1)^(j))/(C^(j))}\{\frac{t_{2}^{j}-t_{1}^{j}}{C^{j}},\dots,\frac{t_{N}^{j}-t_{N-1}^{j}}{C^{j}}\}{t2jt1jCj,,tNjtN1jCj} is the phase split of the j j jjj-th phase cycle. Existing traffic signal control methods usually repeats similar phase sequence throughout the day.
    循环信号计划:循环信号计划是一种相位顺序以循环顺序运行的信号计划,可表示为 ( p 1 , t 1 1 ) ( p 2 , t 2 1 ) ( p N , t N 1 ) ( p 1 , t 1 2 ) ( p 2 , t 2 2 ) ( p N , t N 2 ) ( p 1 , t 1 1 ) ( p 2 , t 2 1 ) ( p N , t N 1 ) ( p 1 , t 1 2 ) ( p 2 , t 2 2 ) ( p N , t N 2 ) (p_(1),t_(1)^(1))(p_(2),t_(2)^(1))dots(p_(N),t_(N)^(1))(p_(1),t_(1)^(2))(p_(2),t_(2)^(2))dots(p_(N),t_(N)^(2))dots(p_{1},t_{1}^{1})(p_{2},t_{2}^{1})\dots(p_{N},t_{N}^{1})(p_{1},t_{1}^{2})(p_{ 2},t_{2}^{2})\dots(p_{N},t_{N}^{2})\dots(p1,t11)(p2,t21)(pN,tN1)(p1,t12)(p2,t22)(pN,tN2) ,其中 p 1 , p 2 , , p N p 1 , p 2 , , p N p_(1),p_(2),dots,p_(N)p_{1},p_{2},\dots,p_{N}p1,p2,,pN 为重复相位顺序, t i j t i j t_(i)^(j)t_{i}^{j}tij j j jjj 中相位 p i p i p_(i)p_{i}pi 的起始时间。其中, p 1 , p 2 , , p N p 1 , p 2 , , p N p_(1),p_(2),dots,p_(N)p_{1},p_{2},\dots,p_{N}p1,p2,,pN 为重复的相位序列, t i j t i j t_(i)^(j)t_{i}^{j}tij 为相位 p i p i p_(i)p_{i}pi j j jjj -次循环中的起始时间。-的开始时间。具体来说, C j = t 1 j + 1 t 1 j C j = t 1 j + 1 t 1 j C^(j)=t_(1)^(j+1)-t_(1)^(j)C^{j}=t_{1}^{j+1}-t_{1}^{j}Cj=t1j+1t1j j j jjj -次相位周期的周期长度, { t 2 j t 1 j C j , , t N j t N 1 j C j } { t 2 j t 1 j C j , , t N j t N 1 j C j } {(t_(2)^(j)-t_(1)^(j))/(C^(j)),dots,(t_(N)^(j)-t_(N-1)^(j))/(C^(j))}\{\frac{t_{2}^{j}-t_{1}^{j}}{C^{j}},\dots,\frac{t_{N}^{j}-t_{N-1}^{j}}{C^{j}}\}{t2jt1jCj,,tNjtN1jCj} 是{{8} -次相位周期的周期长度。的周期长度, { t 2 j t 1 j C j , , t N j t N 1 j C j } { t 2 j t 1 j C j , , t N j t N 1 j C j } {(t_(2)^(j)-t_(1)^(j))/(C^(j)),dots,(t_(N)^(j)-t_(N-1)^(j))/(C^(j))}\{\frac{t_{2}^{j}-t_{1}^{j}}{C^{j}},\dots,\frac{t_{N}^{j}-t_{N-1}^{j}}{C^{j}}\}{t2jt1jCj,,tNjtN1jCj} j j jjj -个周期的相位分割。-第 4 个相位周期的相位分割。现有的交通信号控制方法通常全天重复类似的相位顺序。

Objective 目标

The objective of traffic signal control is to facilitate safe and efficient movement of vehicles at the intersection. Safety is achieved by separating conflicting movements in time and is not considered in more detail here. Various measures have been proposed to quantify efficiency of the intersection from different perspectives:
交通信号控制的目的是促进车辆在交叉路口安全高效地行驶。安全是通过及时分隔相互冲突的通行来实现的,在此不再详述。人们从不同角度提出了各种措施来量化交叉口的效率:
  • Travel time. In traffic signal control, travel time of a vehicle is defined as the time different between the time one car enters the system and the time it leaves the system. One of the most common goals is to minimize the average travel time of vehicles in the network.
    行驶时间。在交通信号控制中,车辆的行驶时间被定义为一辆车从进入系统到离开系统之间的时间差。最常见的目标之一就是尽量缩短网络中车辆的平均行驶时间。
  • Queue length. The queue length of the road network is the number of queuing vehicles in the road network.
    队列长度。路网的队列长度是指路网中排队车辆的数量。
  • Number of stops. The number of stops of a vehicle is the total times that a vehicle experienced.
    停车次数。车辆的停车次数是指车辆经历的总次数。
  • Throughput. The throughput is the number of vehicles that have completed their trip in the road network during throughout a period.
    吞吐量。吞吐量是指在整个时段内完成路网行程的车辆数量。

Special Considerations 特殊考虑因素

In practice, additional attention should be paid to the following aspects:
在实践中,应额外注意以下几个方面:
  1. Yellow and all-red time. A yellow signal is usually set as a transition from a green signal to a red one. Following the yellow, there is an all-red period during which all the signals in an intersection are set to red. The yellow and all-red time, which can last from 3 to 6 seconds, allow vehicles to stop safely or pass the intersection before vehicles in conflicting traffic movements are given a green signal.
    黄色和全红时间。黄色信号灯通常是从绿色信号灯向红色信号灯的过渡。黄灯之后是全红灯时间,在此期间,交叉路口的所有信号灯都设置为红灯。黄色信号灯和全红信号灯时间可持续 3 到 6 秒钟,这段时间允许车辆安全停靠或通过交叉路口,然后再向与之交通流冲突的车辆发出绿色信号灯。
  2. Minimum green time. Usually, a minimum green signal time is required to ensure pedestrians moving during a particular phase can safely pass through the intersection.
    最短绿灯时间。通常情况下,需要一个最短的绿灯时间,以确保在某一特定阶段通行的行人能安全通过交叉路口。
  3. Left turn phase. Usually, a left-turn phase is added when the left-turn volume is above certain threshold.
    左转阶段。通常,当左转量超过一定临界值时,就会增加一个左转阶段。

2.2 Background of Reinforcement Learning
2.2 强化学习的背景

In this section, we first describe the reinforcement learning framework which constitutes the foundation of all the methods presented in this dissertation. We then provide background on conventional RL-based traffic signal control, including the problem of controlling a single intersection and multiple intersections.
在本节中,我们首先介绍强化学习框架,它是本论文中所有方法的基础。然后,我们将介绍基于 RL 的传统交通信号控制的背景,包括单个交叉口和多个交叉口的控制问题。

Single Agent Reinforcement learning
单个代理强化学习

Usually a single agent RL problem is modeled as a Markov Decision Process represented by S , A , P , R , γ S , A , P , R , γ (:S,A,P,R,gamma:)\langle\mathcal{S},\mathcal{A},P,R,\gamma\rangleS,A,P,R,γ, where their definitions are given as follows:
通常,单个代理的 RL 问题被建模为一个马尔可夫决策过程,由 S , A , P , R , γ S , A , P , R , γ (:S,A,P,R,gamma:)\langle\mathcal{S},\mathcal{A},P,R,\gamma\rangleS,A,P,R,γ 表示。表示,它们的定义如下:
\bullet Set of state representations S S S\mathcal{S}S: At time step t t ttt, the agent observes state s t S s t S s^(t)inS\mathsf{s}^{t}\in\mathcal{S}stS.
\bullet 状态表示 S S S\mathcal{S}S 的集合:在时间步骤 t t ttt 中,代理观察到状态 s t S s t S s^(t)inS\mathsf{s}^{t}\in\mathcal{S}stS 。代理观察到状态 s t S s t S s^(t)inS\mathsf{s}^{t}\in\mathcal{S}stS 。.
\bullet Set of action A A A\mathcal{A}A and state transition function P P PPP: At time step t t ttt, the agent takes an action a t A a t A a^(t)inA\mathsf{a}^{t}\in\mathcal{A}atA, which induces a transition in the environment according to the state transition function P ( s t + 1 | s t , a t ) : S × A S P ( s t + 1 | s t , a t ) : S × A S P(s^(t+1)|s^(t),a^(t)):SxxArarrSP(\mathsf{s}^{t+1}|\mathsf{s}^{t},\mathsf{a}^{t}):\mathcal{S}\times\mathcal{ A}\rightarrow\mathcal{S}P(st+1|st,at):S×AS
\bullet 行动 A A A\mathcal{A}A 和状态转换函数 P P PPP 的集合: 在时间步骤 t t ttt 中,代理采取了行动 a t A a t A a^(t)inA\mathsf{a}^{t}\in\mathcal{A}atA 。代理采取了一个行动 a t A a t A a^(t)inA\mathsf{a}^{t}\in\mathcal{A}atA ,根据状态转换函数 P ( s t + 1 | s t , a t ) : S × A S P ( s t + 1 | s t , a t ) : S × A S P(s^(t+1)|s^(t),a^(t)):SxxArarrSP(\mathsf{s}^{t+1}|\mathsf{s}^{t},\mathsf{a}^{t}):\mathcal{S}\times\mathcal{ A}\rightarrow\mathcal{S}P(st+1|st,at):S×AS 引起了环境的转换。根据状态转换函数 P ( s t + 1 | s t , a t ) : S × A S P ( s t + 1 | s t , a t ) : S × A S P(s^(t+1)|s^(t),a^(t)):SxxArarrSP(\mathsf{s}^{t+1}|\mathsf{s}^{t},\mathsf{a}^{t}):\mathcal{S}\times\mathcal{ A}\rightarrow\mathcal{S}P(st+1|st,at):S×AS ,环境会发生转换
\bullet Reward function R R RRR: At time step t t ttt, the agent obtains a reward r t r t r^(t)r^{t}rt by a reward function: R ( s t , a t ) : S × A R R ( s t , a t ) : S × A R R(s^(t),a^(t)):SxxArarrRR(\mathsf{s}^{t},\mathsf{a}^{t}):\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}R(st,at):S×AR
\bullet 奖励函数 R R RRR :在时间步骤 t t ttt 中,代理通过奖励函数 r t r t r^(t)r^{t}rt 获得奖励。代理通过奖励函数获得奖励 r t r t r^(t)r^{t}rt R ( s t , a t ) : S × A R R ( s t , a t ) : S × A R R(s^(t),a^(t)):SxxArarrRR(\mathsf{s}^{t},\mathsf{a}^{t}):\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}R(st,at):S×AR
\bullet Discount factor γ γ gamma\gammaγ: The goal of an agent is to find a policy that maximizes the expected return, which is the discounted sum of rewards: G t := i = 0 γ i r t + i G t := i = 0 γ i r t + i G^(t):=sum_(i=0)^(oo)gamma^(i)r^(t+i)G^{t}:=\sum_{i=0}^{\infty}\gamma^{i}\mathsf{r}^{t+i}Gt:=i=0γirt+i, where the discount factor γ [ 0 , 1 ] γ [ 0 , 1 ] gamma in[0,1]\gamma\in[0,1]γ[0,1] controls the importance of immediate rewards versus future rewards.
\bullet 贴现因子 γ γ gamma\gammaγ :代理人的目标是找到一个能使预期收益(即贴现后的收益总和)最大化的策略: G t := i = 0 γ i r t + i G t := i = 0 γ i r t + i G^(t):=sum_(i=0)^(oo)gamma^(i)r^(t+i)G^{t}:=\sum_{i=0}^{\infty}\gamma^{i}\mathsf{r}^{t+i}Gt:=i=0γirt+i 其中,贴现因子 γ [ 0 , 1 ] γ [ 0 , 1 ] gamma in[0,1]\gamma\in[0,1]γ[0,1] 控制着当前收益与未来收益的重要性。
Here, we only consider continuing agent-environment intersections which do not end with terminal states but goes on continually without limit.
在这里,我们只考虑持续的代理-环境交集,这种交集不会以终结状态结束,而是无限制地持续下去。
Solving a reinforcement learning task means, roughly, finding an optimal policy π π pi^(**)\pi^{*}π that maximizes expected return. While the agent only receives reward about its immediate, one-step performance, one way to find the optimal policy π π pi^(**)\pi^{*}π is by following an optimal action-value function or state-value function. The action-value function (Q-function) of a policy π π pi\piπ, Q π : S × A R Q π : S × A R Q^(pi):SxxArarrRQ^{\pi}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}Qπ:S×AR, is the expected return of a state-action pair Q π ( s , a ) = E π [ G t | s t = s , a t = a ] Q π ( s , a ) = E π [ G t | s t = s , a t = a ] Q^(pi)(s,a)=E_(pi)[G^(t)|s^(t)=s,a^(t)=a]Q^{\pi}(\mathsf{s},\mathsf{a})=\mathbb{E}_{\pi}[G^{t}|\mathsf{s}^{t}=\mathsf{s },\mathsf{a}^{t}=\mathsf{a}]Qπ(s,a)=Eπ[Gt|st=s,at=a]. The state-value function of a policy π π pi\piπ, V π : S R V π : S R V^(pi):SrarrRV^{\pi}:\mathcal{S}\rightarrow\mathbb{R}Vπ:SR, is the expected return of a state V π ( s ) = E π [ G t | s t = s ] V π ( s ) = E π [ G t | s t = s ] V^(pi)(s)=E_(pi)[G^(t)|s^(t)=s]V^{\pi}(\mathsf{s})=\mathbb{E}_{\pi}[G^{t}|\mathsf{s}^{t}=\mathsf{s}]Vπ(s)=Eπ[Gt|st=s].
解决强化学习任务大致意味着找到一个能使预期收益最大化的最优策略 π π pi^(**)\pi^{*}π 。虽然行为主体只能就其直接的、一步到位的表现获得奖励,但找到最优策略 π π pi^(**)\pi^{*}π 的一种方法是遵循最优的行动-价值函数或状态-价值函数。政策 π π pi\piπ 的行动价值函数(Q 函数}, Q π : S × A R Q π : S × A R Q^(pi):SxxArarrRQ^{\pi}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}Qπ:S×AR 的预期收益。.政策 π π pi\piπ 的状态值函数, V π : S R V π : S R V^(pi):SrarrRV^{\pi}:\mathcal{S}\rightarrow\mathbb{R}Vπ:SR 是状态 V π ( s ) = E π [ G t | s t = s ] V π ( s ) = E π [ G t | s t = s ] V^(pi)(s)=E_(pi)[G^(t)|s^(t)=s]V^{\pi}(\mathsf{s})=\mathbb{E}_{\pi}[G^{t}|\mathsf{s}^{t}=\mathsf{s}]Vπ(s)=Eπ[Gt|st=s] 的预期收益。.

Problem setting 问题设置

We now introduce the general setting of RL-based traffic signal control problem, in which the traffic signals are controlled by an RL agent or several RL agents. Figure 2.2 illustrates the basic idea of the RL framework in single traffic signal control problem. The environment is the traffic conditions on the roads, and the agent controls the traffic signal. At each time step t t ttt, a description of the environment (e.g., signal phase, waiting time of cars, queue length of cars, and positions of cars) will be generated as the state s t s t s_(t)\mathsf{s}_{t}st. The agent will predict the next action a t a t a^(t)\mathsf{a}^{t}at to take that maximizes the expected return, where the action could be changing to a certain phase in the single intersection scenario. The action a t a t a^(t)\mathsf{a}^{t}at will be executed in the environment, and a reward r t r t r^(t)\mathsf{r}^{t}rt will be generated, where the reward could be defined on traffic conditions of the intersection. Usually, in the decision process, an agent combines the exploitation of learned policy and exploration of a new policy.
现在我们介绍基于 RL 的交通信号控制问题的一般设置,其中交通信号由一个或多个 RL 代理控制。图 2.2 展示了单个交通信号控制问题中 RL 框架的基本思想。环境是道路上的交通状况,代理控制交通信号。在每个时间步 t t ttt 将生成环境描述(如信号灯相位、汽车等待时间、汽车队列长度和汽车位置)作为状态 s t s t s_(t)\mathsf{s}_{t}st 。.代理将预测下一步采取的行动 a t a t a^(t)\mathsf{a}^{t}at ,该行动可使预期收益最大化。行动 a t a t a^(t)\mathsf{a}^{t}at 将在环境中执行,并产生奖励 r t r t r^(t)\mathsf{r}^{t}rt ,奖励可根据交叉路口的交通状况确定。通常,在决策过程中,代理会将利用已学政策和探索新政策结合起来。
Figure 2.2: RL framework for traffic signal control.
图 2.2:交通信号控制的 RL 框架。
In multi-intersection traffic signal control problem, there are N N NNN traffic signals in the environment, controlled by one or several agents. The goal of the agent(s) is to learn the optimal policies to optimize the traffic condition of the whole environment. At each timestep t t ttt, each agent i i iii observe part of the environment as the observation o i t o i t o_(i)^(t)o_{i}^{t}oit and make predictions on the next actions a t = ( a 1 t , , a N t ) a t = ( a 1 t , , a N t ) a^(t)=(a_(1)^(t),dots,a_(N)^(t))\boldsymbol{\mathsf{a}}^{t}=(\boldsymbol{\mathsf{a}}_{1}^{t},\ldots,\boldsymbol {\mathsf{a}}_{N}^{t})at=(a1t,,aNt) to take. The actions will be executed in the environment, and the reward r i t r i t r_(i)^(t)\boldsymbol{\mathsf{r}}_{i}^{t}rit will be generated, where the reward could be defined on the level of individual intersections or a group of intersections within the environment. We refer readers interested in more detailed problem settings to [25].
在多交叉路口交通信号控制问题中,环境中有 N N NNN 个交通信号,由一个或多个代理控制。代理的目标是学习最优策略,优化整个环境的交通状况。在每个时间步 t t ttt 中每个代理 i i iii 都会观察环境的一部分,即观察值 o i t o i t o_(i)^(t)o_{i}^{t}oit ,并预测下一步要采取的行动 a t = ( a 1 t , , a N t ) a t = ( a 1 t , , a N t ) a^(t)=(a_(1)^(t),dots,a_(N)^(t))\boldsymbol{\mathsf{a}}^{t}=(\boldsymbol{\mathsf{a}}_{1}^{t},\ldots,\boldsymbol {\mathsf{a}}_{N}^{t})at=(a1t,,aNt) 。这些行动将在环境中执行,并产生奖励 r i t r i t r_(i)^(t)\boldsymbol{\mathsf{r}}_{i}^{t}rit ,奖励可以定义为环境中的单个交叉点或一组交叉点。如果读者对更详细的问题设置感兴趣,请参阅 [25]。

2.3 Literature 2.3 文献资料

In this section, we introduce three major aspects investigated in recent RL-based traffic signal control literature: agent formulation, policy learning approach and coordination strategy.
在本节中,我们将介绍近期基于 RL 的交通信号控制文献中研究的三个主要方面:代理制定、策略学习方法和协调策略。

Agent formulation 制剂配方

A key question for RL is how to formulate the RL agent, i.e., the reward, state, and action definition. In this section, we focus on the advances in the reward, state, and action design in recent deep RL-based methods, and refer readers interested in more detailed definitions to [26, 27, 28].
RL 的一个关键问题是如何制定 RL 代理,即奖励、状态和行动定义。在本节中,我们将重点介绍近期基于深度 RL 的方法在奖励、状态和动作设计方面的进展,对更详细的定义感兴趣的读者可参考 [26, 27, 28]。
2.3.1.1 Reward 2.3.1.1 奖励
The choice of reward reflects the learning objective of an RL agent. In the traffic signal control problem, although the ultimate objective is to minimize the travel time of all vehicles, travel time is hard to serve as a valid reward in RL. Because the travel time of a vehicle is affected by multiple actions from traffic signals and vehicle movements, the travel time as reward would be delayed and ineffective in indicating the goodness of the signals' action. Therefore, the existing literature often uses a surrogate reward that can be effectively measured after an action, considering factors like average queue length, average waiting time, average speed or throughput [11, 29]. The authors in [10] also take the frequency of signal changing and the number of emergency stops into reward.
奖励的选择反映了 RL 代理的学习目标。在交通信号控制问题中,虽然最终目标是尽量减少所有车辆的行驶时间,但在 RL 中,行驶时间很难作为有效的奖励。因为车辆的行驶时间会受到交通信号和车辆行驶等多种行为的影响,所以将行驶时间作为奖励会有延迟,也不能有效地显示信号行为的好坏。因此,现有文献通常使用一种可在行动后有效测量的替代奖励,考虑平均排队长度、平均等待时间、平均速度或吞吐量等因素 [11,29]。文献[10]中的作者还将信号变化频率和紧急停车次数作为奖励。

2.3.1.2 State 2.3.1.2 国家

At each time step, the agent receives some quantitative descriptions of the environment as state to decide its action. Various kinds of elements have been proposed to describe the environment state, such as queue length, waiting time, speed, phase, etc. These elements can be defined on the lane level or road segment level, and then concatenated as a vector. In earlier work using RL for traffic signal control, people need to discretize the state space and use a simple tabular or linear model to approximate the state functions for efficiency [30, 31, 32]. However, the real-world state space is usually huge, which confines the traditional RL methods in terms of memory or performance. With advances in deep learning, deep RL methods are proposed to handle large state space as an effective function approximator. Recent studies propose to use images [33, 34, 35, 36, 37, 38, 39, 40, 10, 11] to represent the state, where the position of vehicles are extracted as an image representation.
在每个时间步骤中,代理都会收到一些环境的定量描述,作为决定其行动的状态。人们提出了各种描述环境状态的元素,如队列长度、等待时间、速度、相位等。这些元素可以定义在车道或路段级别,然后串联成一个向量。在早期将 RL 用于交通信号控制的工作中,人们需要将状态空间离散化,并使用简单的表格或线性模型来近似状态函数以提高效率 [30、31、32]。然而,现实世界的状态空间通常非常巨大,这就限制了传统 RL 方法的内存或性能。随着深度学习的发展,人们提出了深度 RL 方法,作为一种有效的函数近似方法来处理大型状态空间。最近的研究提出使用图像 [33, 34, 35, 36, 37, 38, 39, 40, 10, 11] 来表示状态,其中车辆的位置被提取为图像表示。
2.3.1.3 Action scheme 2.3.1.3 行动方案
Now there are different types of action definition for an RL agent in traffic signal control: (1) set current phase duration [42, 43], (2) set the ratio of the phase duration over pre-defined total cycle duration [32, 44], (3) change to the next phase in pre-defined cyclic phase sequence [10, 11, 27, 45], and (4) choose the phase to change to among a set of phases [34, 39, 46, 47, 48, 1, 48]. The choice of action scheme is closely related to specific settings of traffic signals. For example, if the phase sequence is required to be cyclic, then the first three action schemes should be considered, while "choosing the phase to change to among a set of phases" can generate flexible phase sequences.
目前,交通信号控制中的 RL 代理有不同类型的动作定义:(1) 设置当前相位持续时间 [42, 43],(2) 设置相位持续时间与预先定义的总周期持续时间之比 [32, 44],(3) 在预先定义的周期相位序列中切换到下一个相位 [10, 11, 27, 45],(4) 在一组相位中选择要切换到的相位 [34, 39, 46, 47, 48, 1, 48]。行动方案的选择与交通信号的具体设置密切相关。例如,如果要求相位序列是循环的,则应考虑前三种行动方案,而 "在一组相位中选择要切换到的相位 "则可以产生灵活的相位序列。

Policy Learning 政策学习

RL methods can be categorized in different ways. [49, 50] divide current RL methods to model-based methods and model-free methods. Model-based methods try to model the transition probability among states explicitly, while model-free methods directly estimate the reward for state-action pairs and choose the action based on this. In the context of traffic signal control, the state transition between states is primarily influenced by people's driving behaviors, which are diverse and hard to predict. Therefore, currently, most RL-based methods for traffic signal control are model-free methods. In this section, we take the categorization in [51]: value-based methods and policy-based methods.
RL 方法有不同的分类方法。[49,50] 将当前的 RL 方法分为基于模型的方法和无模型的方法。基于模型的方法试图对状态间的转换概率进行明确建模,而无模型方法则直接估计状态-行动对的奖励,并据此选择行动。在交通信号控制中,状态之间的转换主要受人们驾驶行为的影响,而人们的驾驶行为多种多样,难以预测。因此,目前大多数基于 RL 的交通信号控制方法都是无模型方法。在本节中,我们采用文献[51]中的分类方法:基于值的方法和基于策略的方法。

2.3.2.1 Value-based methods
2.3.2.1 基于价值的方法

Value-based methods approximate the state-value function or state-action value function (i.e., how rewarding each state is or state-action pair is), and the policy is implicitly obtained from the learned value function. Most of the RL-based traffic signal control methods use DQN [52], where the model is parameterized by neural networks and takes the state representation as input [10, 53]. In DQN, discrete actions are required as the model directly outputs the action's value given a state, which is especially suitable for action schema (3) and (4) mentioned in Section 2.3.1.3.
基于值的方法近似地计算状态-值函数或状态-动作值函数(即每个状态或状态-动作对的回报率),并从学习到的值函数中隐含地获得策略。大多数基于 RL 的交通信号控制方法都使用 DQN [52],其中模型由神经网络参数化,并将状态表示作为输入 [10,53]。在 DQN 中,由于模型直接输出给定状态的动作值,因此需要离散动作,这尤其适合第 2.3.1.3 节中提到的动作模式 (3) 和 (4)。
Policy-based methods 基于政策的方法
Policy-based methods directly update the policy parameters (e.g., a vector of probabilities to conduct actions under specific state) towards the direction to maximizing a predefined objective (e.g., average expected return). The advantage of policy-based methods is that it does not require the action to be discrete like DQN. Also, it can learn a stochastic policy and keep exploring potentially more rewarding actions. To stabilize the training process, the actor-critic framework is widely adopted. It utilize the strengths of both value-based and policy-based methods, with an actor controls how the agent behaves (policy-based), and the critic measures how good the conducted action is (value-based). In the traffic signal control problem, [44] uses DDPG [54] to learn a deterministic policy which directly maps states to actions, while [34, 42, 55] learn a stochastic policy that maps states to action probability distribution, all of which have shown excellent performance in traffic signal control problems. To further improve convergence speed for RL agents, [56] proposed a time-dependent baseline to reduce the variance of policy gradient updates to specifically avoid traffic jams.
基于策略的方法直接更新策略参数(如在特定状态下采取行动的概率向量),使其朝着最大化预定目标(如平均预期收益)的方向发展。基于策略的方法的优势在于,它不像 DQN 那样要求行动是离散的。此外,它还能学习随机策略,并不断探索可能更有回报的行动。为了稳定训练过程,行为批判框架被广泛采用。它利用了基于价值和基于策略两种方法的优点,由行动者控制代理的行为方式(基于策略),由批评者衡量所采取的行动的好坏(基于价值)。在交通信号控制问题中,[44] 使用 DDPG [54] 学习直接将状态映射到行动的确定性策略,而 [34, 42, 55] 则学习将状态映射到行动概率分布的随机策略,这些方法在交通信号控制问题中都表现出了优异的性能。为了进一步提高 RL 代理的收敛速度,[56] 提出了一个随时间变化的基线,以减少策略梯度更新的方差,从而特别避免交通堵塞。
In the above-mentioned methods, including both value-based and policy-based methods, deep neural networks are used to approximate the value functions. Most of the literature use vanilla neural networks with their corresponding strengths. For example, Convolutional Neural Networks (CNN) are used since the state representation contains image representation [10, 38, 39, 40, 33, 34, 35, 36]; Recurrent Neural Networks (RNN) are used to capture the temporal dependency of historical states [57].
在上述方法(包括基于价值的方法和基于策略的方法)中,深度神经网络被用来近似价值函数。大多数文献都使用了具有相应优势的虚神经网络。例如,由于状态表示包含图像表示,因此使用了卷积神经网络(CNN)[10, 38, 39, 40, 33, 34, 35, 36];为了捕捉历史状态的时间依赖性,使用了循环神经网络(RNN)[57]。
Coordination could benefit signal control for multi-intersection scenarios. Since recent advances in RL improve the performance on isolated traffic signal control, efforts havebeen performed to design strategies that cooperate with MARL agents. Literature [60] categorizes MARL into two classes: Joint action learners and independent learners. Here we extend this categorization for the traffic signal control problem.
在多交叉路口的情况下,协调可有利于信号控制。由于最近 RL 的进步提高了孤立交通信号控制的性能,因此人们开始努力设计与 MARL 代理合作的策略。文献 [60] 将 MARL 分成两类:联合行动学习者和独立学习者。在此,我们针对交通信号控制问题扩展了这一分类。
Joint action learners 联合行动学习者
A straightforward solution is to use a single global agent to control all the intersections [31]. It directly takes the state as input and learns to set the joint actions of all intersection at the same time. However, these methods can result in the curse of dimensionality, which encompasses the exponential growth of the state-action space in the number of state and action dimensions. Joint action modeling methods explicitly learns to model the joint action
一个直接的解决方案是使用一个全局代理来控制所有交叉点 [31]。它直接将状态作为输入,同时学习设置所有交叉口的联合行动。然而,这些方法可能会导致维度诅咒,即状态-行动空间在状态和行动维度数量上的指数增长。联合行动建模方法明确地学习对联合行动进行建模。
Citation 引用 Method 方法 Cooperation 合作 Road net (# signals)
路网(# 信号灯)
[46] Value-based 以价值为基础 With communication 通过交流 Synthetic (5) 合成 (5)
Policy-based 以政策为基础 Without communication 没有交流 Real (50) 真实 (50)
Policy-based 以政策为基础 Without communication 没有交流 Real (43) 真实 (43)
Value-based 以价值为基础 Without communication 没有交流 Real (2510) 真实 (2510)
Policy-based 以政策为基础 Joint action 联合行动 Real (30) 真实 (30)
Value-based 以价值为基础 - Synthetic (1) 合成 (1)
Value-based 以价值为基础 - Synthetic (1) 合成 (1)
Value-based 以价值为基础 Without communication 没有交流 Synthetic (9) 合成 (9)
Both studied 两者都研究了 - Synthetic (1) 合成 (1)
Value-based 以价值为基础 With communication 通过交流 Synthetic (6) 合成 (6)
Value-based 以价值为基础 Without communication 没有交流 Synthetic (4) 合成 (4)
Both studied 两者都研究了 Single global 单一全球 Synthetic (5) 合成 (5)
Policy-based 以政策为基础 - Real (1) 真实 (1)
Value-based 以价值为基础 Joint action 联合行动 Synthetic (4) 合成 (4)
Value-based 以价值为基础 With communication 通过交流 Real (4) 真实 (4)
Value-based 以价值为基础 - Synthetic (1) 合成 (1)
Value-based 以价值为基础 Without communication 没有交流 Real (16) 真实 (16)
Value-based 以价值为基础 With communication 通过交流 Real (196) 真实 (196)
Value-based 以价值为基础 Joint action 联合行动 Synthetic (36) 合成 (36)
Value-based 以价值为基础 Without communication 没有交流 Real (16) 真实 (16)
Value-based 以价值为基础 Without communication 没有交流 Real (5) 真实 (5)
Citation Method Cooperation Road net (# signals) [46] Value-based With communication Synthetic (5) Policy-based Without communication Real (50) Policy-based Without communication Real (43) Value-based Without communication Real (2510) Policy-based Joint action Real (30) Value-based - Synthetic (1) Value-based - Synthetic (1) Value-based Without communication Synthetic (9) Both studied - Synthetic (1) Value-based With communication Synthetic (6) Value-based Without communication Synthetic (4) Both studied Single global Synthetic (5) Policy-based - Real (1) Value-based Joint action Synthetic (4) Value-based With communication Real (4) Value-based - Synthetic (1) Value-based Without communication Real (16) Value-based With communication Real (196) Value-based Joint action Synthetic (36) Value-based Without communication Real (16) Value-based Without communication Real (5)| Citation | Method | Cooperation | Road net (# signals) | | :---: | :---: | :---: | :---: | | [46] | Value-based | With communication | Synthetic (5) | | | Policy-based | Without communication | Real (50) | | | Policy-based | Without communication | Real (43) | | | Value-based | Without communication | Real (2510) | | | Policy-based | Joint action | Real (30) | | | Value-based | - | Synthetic (1) | | | Value-based | - | Synthetic (1) | | | Value-based | Without communication | Synthetic (9) | | | Both studied | - | Synthetic (1) | | | Value-based | With communication | Synthetic (6) | | | Value-based | Without communication | Synthetic (4) | | | Both studied | Single global | Synthetic (5) | | | Policy-based | - | Real (1) | | | Value-based | Joint action | Synthetic (4) | | | Value-based | With communication | Real (4) | | | Value-based | - | Synthetic (1) | | | Value-based | Without communication | Real (16) | | | Value-based | With communication | Real (196) | | | Value-based | Joint action | Synthetic (36) | | | Value-based | Without communication | Real (16) | | | Value-based | Without communication | Real (5) |

* Traffic with arrival rate less than 500 vehicles/hour/lane is considered as light traffic in this survey, otherwise considered as heavy.
* 本调查将到达率低于 500 辆/小时/车道的交通视为轻型交通,否则视为重型交通。

* Synthetic light uniform; 2. Synthetic light dynamic; 3. Synthetic heavy uniform; 4. Synthetic heavy dynamic; 5. Real-world data
* 合成轻型均匀; 2. 合成轻型动态; 3. 合成重型均匀; 4. 合成重型动态; 5. 真实世界数据
Table 2.1: Representative deep RL-based traffic signal control methods. Due to page limits, we only put part of the investigated papers here.
表 2.1:基于深度 RL 的代表性交通信号控制方法。由于篇幅限制,我们仅在此列出部分研究论文。
action value of multiple agents Q ( o 1 , , o N , a ) Q ( o 1 , , o N , a ) Q(o_(1),dots,o_(N),a)Q(o_{1},\ldots,o_{N},\mathbf{\mathfrak{a}})Q(o1,,oN,a). The joint action space grows with the increase in the number of agents to model. To alleviate this challenge, [10] factorize the global Q-function as a linear combination of local subproblems, extending [9] using max-plus [61] algorithm: Q ^ ( o 1 , , o N , a ) = Σ i , j Q i , j ( o i , o j , a i , a j ) Q ^ ( o 1 , , o N , a ) = Σ i , j Q i , j ( o i , o j , a i , a j ) hat(Q)(o_(1),dots,o_(N),a)=Sigma_(i,j)Q_(i,j)(o_(i),o_(j),a_(i),a_(j))\hat{Q}(o_{1},\ldots,o_{N},\mathbf{\mathfrak{a}})=\Sigma_{i,j}Q_{i,j}(o _{i},o_{j},\mathbf{\mathfrak{a}}_{i},\mathbf{\mathfrak{a}}_{ j})Q^(o1,,oN,a)=Σi,jQi,j(oi,oj,ai,aj), where i i iii and j j jjj corresponds to the index of neighboring agents. In other works, [48, 59, 62] regard the joint Q-value as a weighted sum of local Q-values, Q ^ ( o 1 , , o N , a ) = Σ i , j w i , j Q i , j ( o i , o j , a i , a j ) Q ^ ( o 1 , , o N , a ) = Σ i , j w i , j Q i , j ( o i , o j , a i , a j ) hat(Q)(o_(1),dots,o_(N),a)=Sigma_(i,j)w_(i,j)Q_(i,j)(o_(i),o_(j),a_(i),a_(j))\hat{Q}(o_{1},\ldots,o_{N},\mathbf{\mathfrak{a}})=\Sigma_{i,j}w_{i,j }Q_{i,j}(o_{i},o_{j},\mathbf{\mathfrak{a}}_{i},\mathbf{ \mathfrak{a}}_{j})