这是用户在 2024-5-7 16:25 为 https://ar5iv.labs.arxiv.org/html/2103.13798?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Improving Playtesting Coverage via Curiosity Driven Reinforcement Learning Agents

Camilo Gordillo, Joakim Bergdahl, Konrad Tollmar, Linus Gisslén
SEED - Electronic Arts (EA), Stockholm, Sweden
SEED - 艺电(EA),瑞典斯德哥尔摩

cgordillo, jbergdahl, ktollmar, lgisslen@ea.com
Abstract 摘要

As modern games continue growing both in size and complexity, it has become more challenging to ensure that all the relevant content is tested and that any potential issue is properly identified and fixed. Attempting to maximize testing coverage using only human participants, however, results in a tedious and hard to orchestrate process which normally slows down the development cycle. Complementing playtesting via autonomous agents has shown great promise accelerating and simplifying this process. This paper addresses the problem of automatically exploring and testing a given scenario using reinforcement learning agents trained to maximize game state coverage. Each of these agents is rewarded based on the novelty of its actions, thus encouraging a curious and exploratory behaviour on a complex 3D scenario where previously proposed exploration techniques perform poorly. The curious agents are able to learn the complex navigation mechanics required to reach the different areas around the map, thus providing the necessary data to identify potential issues. Moreover, the paper also explores different visualization strategies and evaluates how to make better use of the collected data to drive design decisions and to recognize possible problems and oversights.
随着现代游戏在规模和复杂性上不断增长,确保测试所有相关内容并正确识别和修复任何潜在问题变得更具挑战性。然而,仅使用人类参与者来最大程度地提高测试覆盖率会导致一个乏味且难以协调的过程,通常会减慢开发周期。通过自主代理补充游戏测试已显示出加速和简化此过程的巨大潜力。本文解决了使用强化学习代理自动探索和测试给定场景的问题,这些代理经过训练以最大化游戏状态覆盖率。每个代理根据其行动的新颖性而获得奖励,从而鼓励在以前提出的探索技术表现不佳的复杂 3D 场景中表现出好奇和探索行为。这些好奇的代理能够学习到达地图周围不同区域所需的复杂导航机制,从而提供必要的数据来识别潜在问题。 此外,本文还探讨了不同的可视化策略,并评估如何更好地利用收集到的数据来推动设计决策,并识别可能存在的问题和疏漏。

Index Terms:
automated game testing, computer games, reinforcement learning, curiosity

I Introduction I 介绍

Playtesting modern video games using human participants alone has become unfeasible due to the sheer scale of the projects. As games grow in size and complexity, maximizing coverage and ensuring sufficient exploration becomes a tedious, repetitive, and labor-intensive task. By contrast, automated approaches relying on AI-based agents have the potential to be parallelized and accelerated to provide results in a short period of time [1], thus complementing the regular testing pipelines.

We tackle the problem of automatically exploring a given scenario with the purpose of identifying any potential issues which could, otherwise, be potentially overlooked by human testers. Human participants, we argue, should focus on testing and experiencing the key mechanics of the game without the burden of identifying, documenting and reporting general glitches.

Our approach focuses on the use of reinforcement learning (RL) agents to maximize testing coverage. These types of agents have shown very important and appealing advantages over classical techniques when applied to game testing [2] and may play an important role when working with complex 3D scenarios like the one presented in this paper. We make use of curiosity as the motivation factor encouraging a set of RL agents to improve exploration and to seek novel interactions. It is therefore important to note that our intent is not to optimize a behaviour policy or any specific game score, but to make sure that proper and sufficient data can be collected while training such agents. Access to these kinds of data would enable a wide variety of applications such as automatically mapping reachable/unreachable areas in the scenario, identifying unintended mechanics, visualizing changes in response to design choices, to name a few. With the proper tools and metrics, moreover, other important issues like crash-inducing bugs and frame rate drops could also be triggered and recognized while training.
我们的方法侧重于使用强化学习(RL)代理来最大化测试覆盖率。这些类型的代理在应用于游戏测试时已经显示出非常重要和吸引人的优势,并且在处理像本文中所呈现的复杂 3D 场景时可能发挥重要作用。我们利用好奇心作为激励因素,鼓励一组 RL 代理改进探索并寻求新颖的互动。因此,重要的是要注意,我们的目的不是优化行为策略或任何特定的游戏得分,而是确保在训练这些代理时可以收集到适当和充分的数据。获得这些数据将使各种应用成为可能,例如自动映射场景中可达/不可达区域,识别意外的机制,可视化对设计选择的响应变化等。此外,通过适当的工具和指标,还可以在训练过程中触发和识别其他重要问题,如导致崩溃的错误和帧速率下降。

Once the game has been sufficiently explored and data has been collected, proper metrics and visualizations are required to make sense of the events recorded while training. Previous approaches have proposed different visualizations to derive insights about level design from playtesting data [3][4], and we use some of these ideas as reference to introduce metrics and analytics allowing us to validate our results and to identify different problems around the environment.

II Related Work 相关工作

To date, a couple of studies have investigated the use of automatic exploration techniques to maximize game state coverage. Walk Monster [5] is an automated reachability testing tool implemented while developing The Witness (released in 2006 as a puzzle game). The purpose of this tool was to validate the traversability of the map and to identify any potential issues: players getting stranded by reaching areas they were not supposed to get into, getting stuck inside geometries, etc. The proposed algorithm managed to achieve impressive results despite employing fairly simple exploration heuristics. Nevertheless, it is important to note the simplicity and low dimensionality of the game itself (a two-dimensional space in practice). Similarly, several exploration strategies have been evaluated with the aim of producing a semantic map of reachable states in several commercial games (from Atari 2600 to Nintendo 64) [6]. Even though their results are comparable to human gameplay, their exploration strategies rely heavily on random actions. This, we argue, would not be applicable in more complex scenarios like the one presented in this paper. The Wuji [7] framework, on the other hand, employs a RL policy similar to ours together with evolutionary multi-objective optimization to encourage exploration and high game state coverage in two commercial combat games. Contrary to our approach, however, the authors do not evaluate the use of data and visualizations to allow for the identification of bugs and oversights.
迄今为止,已有几项研究调查了使用自动探索技术来最大化游戏状态覆盖的情况。Walk Monster [5] 是一个自动可达性测试工具,是在开发《见证者》(2006 年发布的益智游戏)时实施的。该工具的目的是验证地图的可穿越性,并识别任何潜在问题:玩家可能会因到达不应进入的区域、被困在几何结构内等而受困。尽管采用了相当简单的探索启发式方法,所提出的算法仍然取得了令人印象深刻的结果。然而,重要的是要注意游戏本身的简单性和低维度(实际上是二维空间)。同样,已经评估了几种探索策略,旨在生成可达状态的语义地图,适用于几款商业游戏(从 Atari 2600 到 Nintendo 64)[6]。尽管它们的结果与人类游戏相当,但它们的探索策略严重依赖于随机动作。我们认为,这在像本文中所呈现的更复杂场景中是不适用的。 与我们的方法相似,无极[7]框架则采用了一个 RL 策略,结合进化多目标优化,以鼓励在两款商业战斗游戏中进行探索和高游戏状态覆盖。然而,与我们的方法相反,作者们并未评估数据和可视化的使用,以便识别错误和疏漏。

More recent approaches have been designed to take advantage of human demonstrations. When data from human participants is available, the Reveal-More algorithm [8] can be used to amplify coverage by focusing on exploring around those trajectories. This idea is indeed very interesting and orthogonal to our current approach. We could, for example, make use of imitation learning techniques to encourage and bias exploration around those human generated trajectories.
更近期的方法已经被设计用来利用人类的示范。当有来自人类参与者的数据时,可以使用 Reveal-More 算法[8]来通过专注于探索这些轨迹周围来扩大覆盖范围。这个想法确实非常有趣,与我们目前的方法不同。例如,我们可以利用模仿学习技术来鼓励和偏向于围绕那些人类生成的轨迹进行探索。

Another similar research direction focuses on the use of procedural personas to mimic how different player archetypes would interact with a given scenario. The PathOS framework [4], for example, is a very comprehensive tool for simulating testing sessions with artificial agents. Each of these agents is modeled to represent one particular player archetype by using classical scripted AI. Scripting each of these behaviours, however, can prove to be quite challenging as the complexity of the game increases. Other approaches, in contrast, have tried to automate the generation of such behaviours. The authors of [9], for example, propose the use of Monte Carlo tree search and evolutionary algorithms to generate utility functions leading to different behaviours in 2D dungeon levels. How this approach would scale to games of higher complexity, however, remains an open question.
另一个类似的研究方向关注使用程序化角色来模拟不同玩家原型如何与给定情景互动。例如,PathOS 框架[4]是一个非常全面的工具,用于模拟与人工智能代理的测试会话。这些代理中的每一个都被建模为使用经典脚本化人工智能来代表一个特定的玩家原型。然而,随着游戏复杂性的增加,脚本化这些行为可能会变得非常具有挑战性。相反,其他方法尝试自动化生成这些行为。例如,[9]的作者提出使用蒙特卡洛树搜索和进化算法来生成导致 2D 地牢级别中不同行为的效用函数。然而,这种方法如何扩展到更高复杂性的游戏仍然是一个悬而未决的问题。

Regardless of what kinds of automated strategies are employed proper visualizations and metrics are key to make sense of the collected data. In [3] the authors propose a set of visualizations to analyze level design in 2D side-scrolling games. In a similar fashion, the authors of [10] introduce Differentia, a set of visualizations to evaluate incremental game design changes. Although our approach draws inspiration from all of these techniques, it is important to remark that 2D visualizations are unlikely to be enough when testing complex 3D scenarios. The PathOS framework [4], on the other hand, is perhaps one of the most similar approaches in terms of visualizations and metrics allowing the user to visualize the outcome of the simulation directly in the game engine.
无论采用何种自动化策略,适当的可视化和度量是理解收集到的数据的关键。在[3]中,作者提出了一组可视化方法,用于分析 2D 横向卷轴游戏中的关卡设计。类似地,在[10]的作者介绍了 Differentia,一组可视化方法用于评估增量游戏设计变化。尽管我们的方法受到所有这些技术的启发,但需要强调的是,在测试复杂的 3D 场景时,2D 可视化可能不足够。另一方面,PathOS 框架[4]也许是在可视化和度量方面最相似的方法之一,允许用户直接在游戏引擎中可视化模拟结果。

Meanwhile, intrinsic motivation is a highly studied topic within the reinforcement learning community aiming to encourage agents to explore and to play in the absence of an extrinsic reward. One of such motivations, curiosity, was originally proposed by [11] as a way of rewarding the agents for exploring previously unseen game states and for improving their knowledge about the world. In [12], curiosity is used as a mechanism for pushing the agents to explore complex environments more efficiently while learning skills which may become useful later in their lifetime. A detailed survey about the use of intrinsic motivation in reinforcement learning is presented in [13].

III Implementation III 实施

Our approach relies on a set of RL agents continuously interacting with the game and encouraged to maximize coverage. In contrast to the PathOS framework [4] (see above), we employ curiosity as the sole motivation profile driving each of the agents. The following sections describe the scenario developed for our experiments, the RL setup and training algorithm, and the tools which were developed to collect relevant data and generate visualizations directly in the game engine.
我们的方法依赖于一组强化学习代理不断与游戏进行交互,并被鼓励最大程度地覆盖。与 PathOS 框架[4](见上文)相比,我们将好奇心作为驱动每个代理的唯一动机配置文件。以下各节描述了我们实验中开发的场景、强化学习设置和训练算法,以及我们开发的用于在游戏引擎中直接收集相关数据和生成可视化的工具。

III-A Environment III-A 环境

We evaluate our approach on a relatively large (500 mtimes500meter500\text{\,}\mathrm{m} x 500 mtimes500meter500\text{\,}\mathrm{m} x 50 mtimes50meter50\text{\,}\mathrm{m}) map designed for the purpose of creating an elaborate navigation landscape. We created a scenario with complex navigation mechanics (e.g. jumps, climbable walls and elevators) in a 3D space as shown in Fig. 1. Moreover, and similar to what it is normally seen in modern video games, we have designed the scenario so that complex navigation strategies are required to fully explore the different areas around the map. More details about the environment, navigation mechanics and final results can be found in the accompanying video111https://www.youtube.com/watch?v=cfm3R94FB_4
我们在一个相对较大的( 500 mtimes500meter500\text{\,}\mathrm{m} x 500 mtimes500meter500\text{\,}\mathrm{m} x 50 mtimes50meter50\text{\,}\mathrm{m} )地图上评估我们的方法,该地图旨在创建一个复杂的导航景观。我们创建了一个具有复杂导航机制(例如跳跃、可攀爬的墙壁和电梯)的场景,如图 1 所示。此外,与现代视频游戏中通常看到的情况类似,我们设计了场景,以便需要复杂的导航策略才能完全探索地图周围的不同区域。有关环境、导航机制和最终结果的更多详细信息,请参见附带视频 1

Refer to caption
Figure 1: Evaluation map: 500 mtimes500meter500\text{\,}\mathrm{m} x 500 mtimes500meter500\text{\,}\mathrm{m} x 50 mtimes50meter50\text{\,}\mathrm{m}. The environment contains complex navigation challenges composed of multiple sequential jumps, climbable objects and elevators.
图 1:评估地图: 500 mtimes500meter500\text{\,}\mathrm{m} x 500 mtimes500meter500\text{\,}\mathrm{m} x 50 mtimes50meter50\text{\,}\mathrm{m} 。环境包含复杂的导航挑战,由多个连续的跳跃、可攀爬的物体和电梯组成。

The character in our environment (see Fig. 2) is 1.7 mtimes1.7meter1.7\text{\,}\mathrm{m} tall and has a total of 3 continuous navigation actions: forward/backward, left/right turn, left/right strafe, plus a discrete action for jumping. The character also has the ability to climb on special surfaces located around the map.
我们环境中的角色(见图 2)身高为 1.7 mtimes1.7meter1.7\text{\,}\mathrm{m} ,共有 3 个连续的导航动作:前进/后退,左/右转,左/右平移,以及一个离散的跳跃动作。角色还具有爬上地图周围特殊表面的能力。

Because the purpose of our approach is to test the game and identify any potential issues, we introduced a set of known bugs into the map. These bugs include missing collision boxes, design oversights, places where players may get stuck, etc.

It is important to remark that, contrary to similar approaches like the one presented in [4], we make no use of navigation meshes within our environment. As discussed in [2], navigation meshes are often not designed to resemble the freedom of movement that human players will have. Any agent constrained by these meshes will, most likely, fail at exploring the environment to the same degree a human would, and will therefore miss those bugs frequently found by human players. In the next section we discuss how an RL agent can be used to control navigation and improve exploration.
需要指出的是,与[4]中提出的类似方法不同,我们在环境中没有使用导航网格。正如[2]中讨论的那样,导航网格通常并不设计成类似于人类玩家将拥有的移动自由。受这些网格限制的任何代理人很可能会在探索环境时失败,而不像人类那样全面,因此会错过人类玩家经常发现的那些错误。在下一节中,我们将讨论如何使用 RL 代理来控制导航并改善探索。

III-B Reinforcement learning setup
III-B 强化学习设置

We make use of proximal policy optimization (PPO) [14] as our RL training algorithm. PPO is a robust and well established baseline within the RL community when working with continuous action spaces. We evaluate and compare the performance of the agents when providing two different types of observations. The first one consists of an aggregate vector of 37 values: agent position (3superscript3\mathbb{R}^{3}), agent velocity (3superscript3\mathbb{R}^{3}), agent world rotation (4superscript4\mathbb{R}^{4}), is climbing (𝔹𝔹\mathbb{B}), is in contact with ground (𝔹𝔹\mathbb{B}), jump cool-down time (\mathbb{R}), and a vision array (24superscript24\mathbb{R}^{24}). The vision array consists of 12 ray casts projected in various directions around the agent (see Fig. 2). Each of these rays provides two values: a collision distance and a semantic meaning depending on the type of object it collides with. All values are normalized to be kept between [1,1]11[-1,1]. We also explore a second configuration by providing the agents with an additional first person view of the environment and we compare both models in Section IV-A.
我们使用近端策略优化(PPO)[14]作为我们的强化学习训练算法。在处理连续动作空间时,PPO 是强化学习社区内一个稳健且成熟的基准。我们评估并比较代理的性能,当提供两种不同类型的观察时。第一种包括一个包含 37 个值的聚合向量:代理位置( 3superscript3\mathbb{R}^{3} ),代理速度( 3superscript3\mathbb{R}^{3} ),代理世界旋转( 4superscript4\mathbb{R}^{4} ),是否攀爬( 𝔹𝔹\mathbb{B} ),是否接触地面( 𝔹𝔹\mathbb{B} ),跳跃冷却时间( \mathbb{R} ),以及一个视觉数组( 24superscript24\mathbb{R}^{24} )。视觉数组包括 12 个射线投影在代理周围不同方向上(见图 2)。每条射线提供两个值:碰撞距离和语义含义,取决于它碰撞的对象类型。所有值都被归一化以保持在 [1,1]11[-1,1] 之间。我们还探索了第二种配置,通过为代理提供额外的第一人称视角环境,并在第 IV-A 节中比较这两种模型。

Global Hyperparameters 全局超参数
Name Value
Learning rate (αα\upalpha)
学习率 ( αα\upalpha )
Discount (γγ\upgamma) 折扣 ( γγ\upgamma ) 0.98
PPO-Clip 0.2
Entropy coefficient 熵系数 1e-2
GAE coefficient (λλ\uplambda)
GAE 系数 ( λλ\uplambda )
Fully connected layers 全连接层 [1024, 512, 256]
LSTM layer LSTM 层 256
Visual Encoder 视觉编码器
Image size 图像尺寸 [84,84,3]
Kernel size 内核大小 [5,3,3,3]
Padding [1,1,1,1]
Strides [2,2,2,1]
Channels [32,32,64,64]
TABLE I: Algorithm’s training hyperparameters and model architecture.
表 I:算法的训练超参数和模型架构。
Refer to caption
Figure 2: A representation of our character and the ray casts composing the vision array.
图 2:我们角色的表示和构成视觉数组的射线投射。

The reward given to the agents is a function of novelty and it is described, together with the reset logic, in the following sections. The algorithm’s hyperparameters and the model’s architecture are presented in Table I.
给予代理的奖励是新颖性的函数,并且它与重置逻辑一起在以下部分中描述。算法的超参数和模型的架构在表 I 中呈现。

III-C Optimizing coverage through count-based exploration

One of the great advantages of automated testing is the ability to parallelize and scale to a degree which is unfeasible to reach with human participants alone. The results described in this paper were collected while simulating and training 320 agents distributed across multiple machines. Our distributed setup allows us to train a single and centralized model using data collected from multiple environment instances. Instantiating a single training server also allows us to easily process, analyze, and store all the data collected by the agents in a single place.
自动化测试的一个巨大优势是能够并行化和扩展到人类参与者无法达到的程度。本文描述的结果是在模拟和训练 320 个代理分布在多台机器上时收集的。我们的分布式设置允许我们使用从多个环境实例收集的数据来训练单一和集中的模型。实例化单个训练服务器还允许我们轻松处理、分析和存储代理收集的所有数据在一个地方。

The reward given to the agents is computed following the idea of count-based exploration [15] and becomes, therefore, inversely proportional to how frequent a given game state has been visited. We define these states as the 3D position of the agent at a given point in time. Keeping track of these visit counters on a continuous space, however, quickly becomes intractable. To solve this problem, we discretize the space by means of a threshold ττ\uptau. An agent is only considered to have entered a new state once its distance to any previously visited state is larger than ττ\uptau.
奖励给予的代理根据基于计数的探索思想[15]计算,因此与给定游戏状态的访问频率成反比。我们将这些状态定义为代理在某一时间点的三维位置。然而,在连续空间中跟踪这些访问计数很快变得难以处理。为了解决这个问题,我们通过一个阈值 ττ\uptau 对空间进行离散化。只有当代理到先前访问的任何状态的距离大于 ττ\uptau 时,才认为代理已进入新状态。

Using a small value for ττ\uptau increases the number of points we will need to keep track of and may therefore hinder performance when working with large maps. A high value, on the contrary, results in a very sparse reward signal for our agents which significantly increases the difficulty of the task. The value of τ=5 mτtimes5meter\uptau=$5\text{\,}\mathrm{m}$ was empirically found after a couple of experiments and has proven to be a sensible choice not only on the scenario presented here, but also in other maps not shown in this paper.
使用较小的 ττ\uptau 值会增加我们需要跟踪的点的数量,因此在处理大地图时可能会影响性能。相反,较高的值会导致我们的代理程序获得非常稀疏的奖励信号,从而显著增加任务的难度。 τ=5 mτtimes5meter\uptau=$5\text{\,}\mathrm{m}$ 的值是在几次实验后经验性地找到的,并且已经被证明不仅在这里提出的场景中是一个明智的选择,而且在本文未展示的其他地图中也是如此。

When a new observation is received, the first step to compute a reward is to extract the position p𝑝p of the agent and compare it to all the previously visited states. This buffer is initially empty and gets populated as exploration takes place. We also keep track of a visit counter Nisubscript𝑁𝑖N_{i} for each point i𝑖i in the current buffer. If the minimum distance between the current position p𝑝p and the points in the buffer is larger than ττ\uptau, then point p𝑝p is added to the buffer and its visit counter is set to 1. If, on the contrary, the minimum distance is smaller than ττ\uptau, we identify the point i𝑖i within the buffer closest to p𝑝p and increment Nisubscript𝑁𝑖N_{i} by 1. Having done this, the reward for reaching point i𝑖i is computed using Equation 1, where Rmaxsubscript𝑅𝑚𝑎𝑥R_{max} is set to 0.5 and defines the reward for exploring a new point, and max_counter𝑚𝑎𝑥_𝑐𝑜𝑢𝑛𝑡𝑒𝑟max\_counter is set to 500, thus annealing the reward down to zero as a given point gets more frequent visits.
当收到新的观察时,计算奖励的第一步是提取代理的位置 p𝑝p 并将其与所有先前访问过的状态进行比较。该缓冲区最初为空,并随着探索的进行而填充。我们还跟踪当前缓冲区中每个点 i𝑖i 的访问计数器 Nisubscript𝑁𝑖N_{i} 。如果当前位置 p𝑝p 与缓冲区中的点之间的最小距离大于 ττ\uptau ,则将点 p𝑝p 添加到缓冲区,并将其访问计数器设置为 1。相反,如果最小距离小于 ττ\uptau ,我们将识别缓冲区中距离 p𝑝p 最近的点 i𝑖i ,并将 Nisubscript𝑁𝑖N_{i} 增加 1。完成此操作后,使用方程式 1 计算到达点 i𝑖i 的奖励,其中 Rmaxsubscript𝑅𝑚𝑎𝑥R_{max} 设置为 0.5,并定义探索新点的奖励, max_counter𝑚𝑎𝑥_𝑐𝑜𝑢𝑛𝑡𝑒𝑟max\_counter 设置为 500,因此将奖励降低到零,因为给定点的访问次数更频繁。

Rt=Rmax[1Nimax_counter]subscript𝑅𝑡subscript𝑅𝑚𝑎𝑥delimited-[]1subscript𝑁𝑖𝑚𝑎𝑥_𝑐𝑜𝑢𝑛𝑡𝑒𝑟R_{t}=R_{max}*\left[1-\frac{N_{i}}{max\_counter}\right] (1)

III-D Reset logic III-D 重置逻辑

Each training episode is simulated for 3000 steps (equivalent to 1 minute of game play) and the agents are respawned once time is up. An initial spawning location was defined for this scenario and is located near the middle of the map at ground level. As training goes on spawning locations are sampled from the current buffer using the inverse of the corresponding visit counters as sample weights. This logic prevents biasing the exploration by respawning the agents in previously unexplored positions while giving priority to less frequently visited locations.
每个训练周期模拟进行 3000 步(相当于 1 分钟的游戏时间),并且当时间到达时,代理会重新生成。为此场景定义了一个初始生成位置,位于地图中间靠近地面的位置。随着训练的进行,生成位置将从当前缓冲区中使用相应访问计数器的倒数作为样本权重进行抽样。这种逻辑可以防止通过重新生成代理在先前未探索的位置引入偏见,同时优先考虑访问频率较低的位置。

To prevent agents from spawning in mid-air, we take advantage of one of the values available in the observation vector: is_in_contact_with_ground𝑖𝑠_𝑖𝑛_𝑐𝑜𝑛𝑡𝑎𝑐𝑡_𝑤𝑖𝑡_𝑔𝑟𝑜𝑢𝑛𝑑is\_in\_contact\_with\_ground. When storing a new point in the buffer we keep track of whether or not the character was stepping on something when at that location. Then, when sampling a new spawn position, we can just consider the points in the buffer for which this condition is true.
为了防止代理在半空中生成,我们利用观察向量中的一个可用值之一: is_in_contact_with_ground𝑖𝑠_𝑖𝑛_𝑐𝑜𝑛𝑡𝑎𝑐𝑡_𝑤𝑖𝑡_𝑔𝑟𝑜𝑢𝑛𝑑is\_in\_contact\_with\_ground 。在将新点存储到缓冲区时,我们会跟踪角色在该位置时是否站在某物体上。然后,在抽样新的生成位置时,我们只考虑满足此条件的缓冲区中的点。

This way of respawning the agents at previously visited states strongly resembles algorithms such as Rapidly-Exploring Random Trees (RRT) [16]. In contrast to the exploration strategies proposed in [6], however, we take advantage of the complex navigation strategies developed by our agents to explore around those spawning locations. In Section IV we compare the performance of our approach to the one of a random policy very similar to the chaos monkey strategy proposed in [6]. This random policy employs the same respawning logic presented above which allows us to fairly compare both techniques.

III-E Collecting and visualizing data
III-E 数据收集和可视化

Different kinds of data are continuously processed and stored while the agents interact with the environment. Most of this data is handled by the centralized training process which receives all episodic information (i.e. observations, actions, rewards). Some of the data, however, is recorded directly by the environments based on possible events triggered by the agents which are harder to identify outside the engine. The nature of this data, and how it is used to identify and correct problems, is described together with the corresponding experiments in the next section.

Other relevant metrics such as the number of visited states and values relevant to the RL algorithm are also continuously logged and visualized as training goes on. These logs are very useful to quickly judge and/or compare the performance of a set of experiments without the need of waiting until their completion.
其他相关指标,如访问状态的数量和与 RL 算法相关的值,也会持续记录和可视化,随着训练的进行。这些日志非常有用,可以快速评估和/或比较一组实验的性能,而无需等待它们完成。

IV Results IV 结果

In this section we present results on the exploration performance of our agents and give a few examples on the type of analyses which can be conducted using the collected data. As described in the previous section, our simulation pipeline generates a set of files which can be loaded directly into the game engine allowing us to identify potential problems in the game.

IV-A Exploration performance and map coverage
IV-A 探索性能和地图覆盖

The first thing we would like to evaluate is the ability of our RL agents to navigate and explore the whole map. To do this, we make use of the buffer of visited states introduced in Section III-C. This set of visited 3D coordinates is stored and updated as training goes on and can be used as a metric for exploration and coverage. Fig. 3 shows the percentage of the map covered by our agents when compared to a random policy. As expected, the random based exploration technique did not cover the whole map due to its complexity and was only able to reach easily accessible areas. Moreover, complementing the observation space of our RL agents with a camera image (first person view) improves the results by allowing the model to better understand its surroundings and by decreasing the uncertainty introduced by the discrete set of ray casts. It currently takes around 24 hours to explore 90% of the map but, as discussed in Section V, we believe that coming up with better and more efficient ways for encoding the environment may boost the performance of the agents and speed up exploration.
我们想要评估的第一件事是我们的 RL 代理在整个地图中导航和探索的能力。为此,我们利用了第 III-C 节介绍的访问状态缓冲区。这组访问过的 3D 坐标被存储并随着训练的进行而更新,可以用作探索和覆盖的度量标准。图 3 显示了我们的代理相对于随机策略覆盖地图的百分比。正如预期的那样,基于随机的探索技术由于其复杂性而未能覆盖整个地图,只能到达易于访问的区域。此外,通过用摄像头图像(第一人称视角)补充我们的 RL 代理的观察空间,可以通过使模型更好地理解周围环境并减少由离散射线集引入的不确定性来改善结果。目前需要大约 24 小时才能探索地图的 90%,但正如第 V 节中讨论的那样,我们相信想出更好更高效的环境编码方式可能会提升代理的性能并加快探索速度。

Refer to caption
Figure 3: Map coverage as a function of simulation time. The maximum number of points which could be reached (equivalent to 100%) was estimated to be 25K. The plot shows the mean and variance of the performance of different policies over 5 different runs.
图 3:地图覆盖率随模拟时间变化的情况。估计可达到的最大点数(相当于 100%)为 25K。该图显示了不同策略在 5 次不同运行中性能的平均值和方差。
Refer to caption
(a) Challenge A: The agents were able to reach the top at the right of the figure by jumping and climbing over a series of obstacles.
挑战 A:特工们通过跳跃和攀爬一系列障碍物成功到达图中右侧的顶部。
Refer to caption
(b) Challenge B: The only way of reaching the top of the block to the left is by sequentially jumping over the rocks.
挑战 B:到达左侧方块顶部的唯一方法是依次跳过岩石。
Figure 4: Two of the navigation challenges spread across the map and the solution found by our training RL agents. The red cubes represent the points in our buffer of visited states. The blue trajectories showcase the paths that a player would have to follow to fully explore these areas.
图 4:地图上分布的两个导航挑战和我们训练的 RL 代理找到的解决方案。红色立方体代表我们访问状态缓冲区中的点。蓝色轨迹展示了玩家必须沿着的路径,以完全探索这些区域。
Refer to caption
(a) Challenge A: A random policy was, as expected, unable to solve complex navigation tasks even when simulated for longer periods of time.
挑战 A:预期之中,随机策略即使在模拟更长时间后,也无法解决复杂的导航任务。
Refer to caption
(b) Challenge B: The only entry point to the top right area in this challenge is via the bridge coming from the lower-left. Random agents were unable to find this path.
挑战 B:这个挑战中通往右上区域的唯一入口是通过从左下方来的桥。随机代理无法找到这条路径。
Figure 5: Exploration performance of a random exploration strategy when faced with complex navigation challenges.
图 5:当面临复杂导航挑战时,随机探索策略的探索表现。

As shown in Figs. 4 and 5, all the data collected by the agents (e.g. the buffer of visited states) can also be loaded and displayed on top of the map. These visualizations allow the designers to verify whether or not different areas across the map are reachable and also allow us to compare the exploration performance of different navigation strategies. Both figures showcase a couple of relatively complex navigation challenges spread across the map and the extend of the exploration coverage achieved by our RL agents when compared to a simple random exploration strategy.
如图 4 和 5 所示,代理收集的所有数据(例如访问状态的缓冲区)也可以加载并显示在地图顶部。这些可视化使设计人员能够验证地图上不同区域是否可达,还允许我们比较不同导航策略的探索性能。两幅图展示了地图上分布的几个相对复杂的导航挑战,以及我们的 RL 代理相对于简单的随机探索策略所实现的探索覆盖范围。

Another set of interesting findings relates to the agents reaching areas which should have been inaccessible for the player. These areas could be identified either by visually inspecting the distribution of the collected points or, as shown in the next section, by defining exploration boundaries.

IV-B Exploration boundaries and regions of interest
IV-B 探索边界和感兴趣区域

Our method allows the designer to specify both an exploration boundary (EB) and regions of interest (ROIs) across the map prior to training. The EB defines the section of the map which should be explored and the episode terminates whenever the agent exits that boundary. The ROIs, on the other hand, are optional regions inside the EB and serve as a reference for data collection.
我们的方法允许设计师在训练之前指定探索边界(EB)和兴趣区域(ROIs)的地图上。EB 定义了应该被探索的地图部分,当代理程序退出该边界时,该情节终止。另一方面,ROIs 是 EB 内的可选区域,用作数据收集的参考。

Although we would like to record and store the specific trajectories followed by the agents, doing so would quickly become too expensive and intractable during long simulations. The definition of the EB and the ROIs, however, allows us to focus on those trajectories which are likely to be useful for testing the game. Our technique keeps track of the episodic trajectory for each agent but only records them if a couple of conditions are met: first, the trajectory needs to cross over the boundary defining either the EB or a ROI; second, the point at which the agent crosses that boundary needs to be significantly different to the one of any previously recorded trajectories.
尽管我们希望记录并存储代理程序所遵循的具体轨迹,但这样做很快会变得过于昂贵和难以处理,尤其是在长时间模拟过程中。然而,EB 和 ROIs 的定义使我们能够专注于那些可能对测试游戏有用的轨迹。我们的技术跟踪每个代理程序的情节性轨迹,但只有在满足一些条件时才记录它们:首先,轨迹需要越过定义 EB 或 ROI 的边界;其次,代理程序越过该边界的点需要与先前记录的任何轨迹的点明显不同。

Fig. 6 shows examples of trajectories leaving the EB when that boundary is defined at the walls surrounding the scenario in Fig. 1. Our technique allows the user to display these trajectories directly in the game engine and, in this case, would reveal design oversights around the map allowing the player to leave the game area. Figs. 7 and 8 show additional examples of the kinds of issues which could be identified using these visualizations. Fig. 7 shows a trajectory leaving the scenario due to a collision box missing in one of the wall segments. This particular oversight was intentionally introduced in the map to validate the usefulness of the collected data. Interestingly, not all of the problems we were able to identify were intentionally added to the game. Fig. 8, for example, shows a trajectory recorded during some of our first design iterations. In this case, the agent was getting stuck in between two objects and the physics engine would eventually throw it upwards forcing the character to leave the EB. This shows the use these visualizations could have for identifying problems early on in the design process.
图 6 显示了当边界在图 1 中定义的围绕场景的墙壁时,离开 EB 的轨迹示例。我们的技术允许用户直接在游戏引擎中显示这些轨迹,并且在这种情况下,将揭示围绕地图的设计疏漏,使玩家能够离开游戏区域。图 7 和图 8 显示了使用这些可视化工具可以识别的问题类型的其他示例。图 7 显示了由于一个墙壁段缺少碰撞框而导致轨迹离开场景。这种特定的疏忽是故意引入地图中的,以验证收集数据的有用性。有趣的是,并非所有我们能够识别的问题都是故意添加到游戏中的。例如,图 8 显示了在我们最初的设计迭代中记录的轨迹。在这种情况下,代理人被卡在两个物体之间,物理引擎最终会将其向上抛出,迫使角色离开 EB。这显示了这些可视化工具在设计过程早期识别问题方面的用途。

Refer to caption
Figure 6: Visualization of trajectories leaving the exploration boundary. The agents have found multiple ways of exploiting design oversights to reach into areas which should be inaccessible. These trajectories were recorded during training and could be used to redesign the map and fix these issues.
图 6:显示离开探索边界的轨迹可视化。代理人已经找到了多种利用设计疏漏的方式,以到达本应无法进入的区域。这些轨迹是在训练期间记录的,可以用来重新设计地图并修复这些问题。

The ROIs, on the other hand, allow the user to validate the reachability and access to particular areas in the map. Fig. 9 shows two such regions which were intended to be unreachable for the player. The agents were indeed unable to reach the first region and therefore no trajectories were recorded. The second region, however, ended up having one access point which could be identified using the collected data.
另一方面,ROI 允许用户验证地图中特定区域的可达性和访问性。图 9 显示了两个此类区域,本意是玩家无法到达的。代理确实无法到达第一个区域,因此没有记录到轨迹。然而,第二个区域最终有一个访问点,可以使用收集的数据进行识别。

Refer to caption
Figure 7: A trajectory leaving the game area due to a missing collision box. The collision box of a small segment in the wall (red arrow) was intentionally removed to validate the use and application of the collected data and the proposed visualizations.
图 7:由于缺少碰撞箱而离开游戏区域的轨迹。墙壁中一个小段的碰撞箱(红色箭头)被故意移除,以验证所收集数据和提出的可视化的使用和应用。
Refer to caption
Figure 8: The agents were getting stuck in the gap between these objects (left) which was causing the physics engine to eventually throw them upwards into the sky. A trajectory leaving the exploration boundary (right) was recorded by our system and allowed us to fix the problem by closing up the gap.
图 8:代理人被卡在这些物体之间的缝隙中(左),导致物理引擎最终将它们向上抛入天空。我们的系统记录了离开探索边界的轨迹(右),这使我们能够通过关闭缝隙来解决问题。
Refer to caption
Figure 9: Two regions of interest defined at areas which should have been unreachable for the player. The buffer of visited states (red cubes) allows us to see that the region to the left remained unexplored while the region to the right was reachable somehow. The trajectories recorded while training, however, allow us to easily visualize how the agents did manage to break into that region and could help the designers to correct any oversight.
图 9:在玩家本应无法到达的区域定义了两个感兴趣的区域。已访问状态的缓冲区(红色立方体)使我们能够看到左侧区域仍然未被探索,而右侧区域却以某种方式可达。然而,在训练过程中记录的轨迹使我们能够轻松地可视化代理如何设法闯入该区域,并有助于设计人员纠正任何疏忽。

IV-C Connectivity graph IV-C 连通性图

Refer to caption
Figure 10: Visualizing the connectivity between a small subset of collected points. Bidirectional edges are represented in white while red edges point towards the target node. This connectivity graph broadly behaves like a classical navigation mesh and represents the exploration space that the players are able to traverse.
图 10:可视化所收集点的小子集之间的连接。双向边用白色表示,而红色边指向目标节点。这种连接图在很大程度上表现出经典导航网格的行为,并代表了玩家可以穿越的探索空间。

We can do more than just storing a point cloud of visited states. We can, for instance, configure our training server to build and store a graph structure representing the connectivity between those points. For this reason, the episodic trajectories collected to train our agents are also used to generate a directed graph as the one shown in Fig. 10. Even though the accuracy of such a graph strongly depends on our discretization threshold ττ\uptau (see Section III-C), we believe it has a couple of very promising applications as presented next.
我们不仅可以存储已访问状态的点云。例如,我们可以配置我们的训练服务器来构建和存储表示这些点之间连接性的图结构。因此,用于训练我们的代理的情节轨迹也用于生成一个有向图,如图 10 所示。尽管这样的图的准确性在很大程度上取决于我们的离散化阈值 ττ\uptau (请参见 III-C 节),但我们认为它有几个非常有前途的应用,如下所示。

IV-C1 Navigating to custom points of interest
IV-C1 导航到自定义兴趣点

We can make use of the connectivity graph and path planning algorithms to estimate navigation trajectories between any given two points. Our tool allows the user to define an initial and a target position in the map and it then generates a navigation trajectory between those two points based on the data collected from the agents. Fig. 11 shows an example of such a trajectory.
我们可以利用连接图和路径规划算法来估计任意两点之间的导航轨迹。我们的工具允许用户在地图中定义一个初始位置和一个目标位置,然后根据从代理收集的数据生成这两点之间的导航轨迹。图 11 显示了这样一个轨迹的示例。

We argue that a tool like this could be very useful for designers to comprehend how the agents are navigating the map and whether or not they have found potential exploits. Fig. 12, for example, shows one particular exploit which wasn’t intentionally introduced into the game and which was identified thanks to the connectivity graph. In Fig. 12(a) the agents are able to climb over the wall without the need of the elevator. This seems to be happening due to the slope of the walls at that particular corner (you can see more details in the accompanying video). Fig. 12(b), in contrast, shows the trajectory followed by the agent once the previous issue was fixed.
我们认为,像这样的工具对设计师来说可能非常有用,可以帮助他们理解代理人如何在地图上导航,以及他们是否发现了潜在的漏洞。例如,图 12 展示了一个特定的漏洞,这个漏洞并非故意引入游戏中,而是通过连接图表识别出来的。在图 12(a)中,代理人能够在不需要电梯的情况下爬过墙。这似乎是由于那个特定角落墙壁的坡度造成的(您可以在附带的视频中看到更多细节)。相比之下,图 12(b)展示了在修复前一个问题后代理人所遵循的轨迹。

Refer to caption
Figure 11: The connectivity graph presented in Fig. 10 allows us to generate trajectories between any given two points in the map. In this case, the red block to the left was set as the origin while the green block to the right was declared as the navigation target. The path which an agent could have followed is shown in white.
图 11:在图 10 中呈现的连接图使我们能够在地图中的任意两点之间生成轨迹。在这种情况下,左侧的红色块被设置为起点,而右侧的绿色块被声明为导航目标。代理可能遵循的路径显示为白色。
Refer to caption
(a) There seems to be a path leading to the top of this platform without the need of using the elevator. The problem is caused by the slope of the walls at that particular corner.
Refer to caption
(b) This is how the same path looks like once the slopes are adjusted to prevent the agents from climbing. Taking the elevator is now the only option to reach the top.
Figure 12: The connectivity graph and the visualization of custom trajectories allowed us to identify a minor oversight resulting in agents being able to climb over this particular segment (more details in the accompanying video).
图 12:连接图和自定义轨迹的可视化使我们能够识别一个轻微的疏忽,导致代理能够爬过这个特定部分(更多细节请参见附带视频)。

IV-C2 Semantic connectivity maps
IV-C2 语义连接图

The connectivity graph can also be used to identify how different areas in the map are connected to each other. The specific regions in the map could be either manually defined by the user or, as in our next experiment, automatically extracted from the point cloud of visited states. Fig. 13 shows an example of such a mapping where the regions in the map were automatically extracted and color-coded using unsupervised clustering algorithms. Once the regions are identified, we can make use of the connectivity graph to analyze what kinds of connections exist between them. This semantic mapping could then be used to drive design decisions, to validate the mechanics and traversability of the map, and to recognize potential exploits (unexpected paths).
连接图还可以用于确定地图中不同区域之间的连接方式。地图中的特定区域可以由用户手动定义,或者如我们的下一个实验中那样,从访问状态的点云中自动提取。图 13 展示了这样一个映射的示例,其中地图中的区域是自动提取并使用无监督聚类算法进行颜色编码的。一旦确定了这些区域,我们可以利用连接图分析它们之间存在哪些连接。这种语义映射可以用于驱动设计决策,验证地图的机械性和可穿透性,并识别潜在的漏洞(意外路径)。

IV-D Termination states IV-D 终止状态

It is common for players to find themselves stuck in some particular part of the map due to issues in the environment or the location of the game assets which render the playing character immovable. Maximizing exploration coverage gives us the opportunity to automatically identify such locations during training. One approach is to keep track of a termination counter for each point in our buffer of visited states (i.e. how many times an episode ended with an agent in that location). Once training is over, we can proceed and analyze the distribution of terminal states. Any outlier in this distribution can be easily identified and it is likely to be caused by the agents getting stuck in that position.

We conducted experiments by introducing areas across the map where the agents could get stuck. Fig. 14 shows some examples of such locations together with the outlier positions identified from the collected data. Due to the high coverage achieved by our agents, all the intentionally introduced issues could be identified, as well as one unintentional design oversights causing a similar problem (see Fig. 14(a)).
我们通过在地图各处引入代理可能被困的区域来进行实验。图 14 显示了一些这样的位置示例,以及从收集到的数据中识别出的异常位置。由于我们的代理实现了高覆盖率,所有故意引入的问题都能被识别出来,以及一个无意中导致类似问题的设计疏忽(见图 14(a))。

Refer to caption
Figure 13: Semantic map generated automatically using the data collected from the agents. Regions in the map are identified and color-coded using unsupervised clustering algorithms and the connectivity graph is then used to visualize traversability between regions. Blue and red lines represent upwards and downwards trajectories respectively.
图 13:使用从代理收集的数据自动生成的语义地图。地图中的区域通过无监督聚类算法进行识别和着色,并且连接图用于可视化区域之间的可穿越性。蓝色和红色线分别代表向上和向下的轨迹。
Refer to caption
Refer to caption
Refer to caption
Figure 14: Visualizing areas where players could get stuck. The purple areas are intentionally introduced surfaces which freeze the player in place if they come in contact. The green blocks represent the outliers encountered in the distribution of terminal states and highlight the locations around the map which could be problematic. Figure (14(a)) shows two additional blocks between two platforms which were caused by the agents falling in the gap and getting stuck. This was a design oversight which was not intentionally introduced into the game and which was identified thanks to these visualizations. Figure (14(c)), on the contrary, shows a region in the map which was designed for the agents to get trapped if they fell into it.
图 14:可视化玩家可能被卡住的区域。紫色区域是故意引入的表面,如果玩家接触到会使其停滞不前。绿色方块代表终端状态分布中遇到的异常值,并突出了地图周围可能存在问题的位置。图(14(a))显示了两个额外的方块,位于两个平台之间,是由于代理人掉入间隙并被卡住所致。这是一个设计疏忽,不是故意引入游戏中的,并且通过这些可视化得以识别。相反,图(14(c))显示了地图中一个被设计用于使代理人陷入其中的区域。

V Conclusions and future work

In this paper we have shown a potential use case for RL agents trained to maximize testing coverage in complex 3D scenarios. We have reported on the use of curiosity to encourage exploratory behaviour in our agents, thus allowing them to fully traverse the environment. We have shown that curiosity driven agents can be used for automating the collection of playtest data and performance metrics.

The aim of our approach was to maximize testing coverage and to keep time consumption to a minimum by means of scaling and parallelizing data collection. We have provided examples on the type of data which can be collected, the kind of analysis which can be conducted, and the different sets of visualizations and metrics which can be used to facilitate the identification of frequent oversights, glitches and exploits.

A natural progression of this work is to further increase the complexity of the environment by introducing new mechanics, objectives and environmental hazards. This line of research is also strongly dependent on finding better and more efficient ways of encoding the environment. As discussed in Section IV-A, the way the agents perceive the map both influences the complexity of the task and the cost of training.

Another promising research vector relates to the use of human demonstrations. One could, on one hand, explore a similar idea to the one presented in [10] and focus on exploring around predefined human-generated trajectories. This approach will provide designers with more control over the exploration space and will therefore speed up coverage over regions of high interest. On the other hand, human demonstrations could also be used, together with imitation learning, to provide the agents with some prior understanding about the mechanics of the game and with some basic navigation skills. This prior knowledge is then likely to speed up exploration and decrease the time it takes to collect relevant data.


  • [1] M. Sy, C. Guo, and J. Greco, “Unity game simulation: Find the perfect balance with Unity and GCP,” in Google for Games Developer Summit, 2020. [Online]. Available: https://events.withgoogle.com/gdc2020/
  • [2] J. Bergdahl, C. Gordillo, K. Tollmar, and L. Gisslén, “Augmenting automated game testing with deep reinforcement learning,” in 2020 IEEE Conference on Games (CoG), 2020, pp. 600–603.
  • [3] S. Agarwal, C. Herrmann, G. Wallner, and F. Beck, “Visualizing ai playtesting data of 2d side-scrolling games,” in Proceedings of IEEE Conference on Games, aug 2020.
  • [4] S. Stahlke, A. Nova, and P. Mirza-Babaei, “Artificial players in the design process: Developing an automated testing tool for game level and world design,” in Proceedings of the Annual Symposium on Computer-Human Interaction in Play (CHI PLAY ’20).   New York, NY, USA: Association for Computing Machinery, 2020, p. 267–280.
  • [5] C. Muratori, “Killing the walk monster [Conference presentation],” in BIC Festival, 2018. [Online]. Available: https://caseymuratori.com/blog_0032
  • [6] Z. Zhan, B. Aytemiz, and A. M. Smith, “Taking the scenic route: Automatic exploration for videogames,” in KEG@AAAI, ser. CEUR Workshop Proceedings, vol. 2313.   CEUR-WS.org, 2019, pp. 26–34.
  • [7] Y. Zheng, X. Xie, T. Su, L. Ma, J. Hao, Z. Meng, Y. Liu, R. Shen, Y. Chen, and C. Fan, “Wuji: Automatic online combat game testing using evolutionary deep reinforcement learning,” in Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering.   IEEE Press, 2019, p. 772–784.
  • [8] K. Chang, B. Aytemiz, and A. M. Smith, “Reveal-more: Amplifying human effort in quality assurance testing using automated exploration,” in 2019 IEEE Conference on Games (CoG), 2019, pp. 1–8.
  • [9] C. Holmgård, M. C. Green, A. Liapis, and J. Togelius, “Automated playtesting with procedural personas through mcts with evolved heuristics,” IEEE Transactions on Games, vol. 11, no. 4, pp. 352–362, 2018.
  • [10] K. Chang and A. Smith, “Differentia: Visualizing incremental game design changes,” Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, vol. 16, no. 1, pp. 175–181, Oct. 2020.
  • [11] J. Schmidhuber, “A possibility for implementing curiosity and boredom in model-building neural controllers,” in Proc. of the international conference on simulation of adaptive behavior: From animals to animats, 1991, pp. 222–227.
  • [12] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017.
  • [13] A. Aubret, L. Matignon, and S. Hassas, “A survey on intrinsic motivation in reinforcement learning,” CoRR, vol. abs/1908.06976, 2019.
  • [14] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [15] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, “Unifying count-based exploration and intrinsic motivation,” in Advances in Neural Information Processing Systems, vol. 29.   Curran Associates, Inc., 2016, pp. 1471–1479.
  • [16] S. LaValle, “Rapidly-exploring random trees : a new tool for path planning,” Technical Report TR 98-11, Computer Science Department, Iowa State University, 1998.