Improving Playtesting Coverage via Curiosity Driven Reinforcement Learning Agents
通过好奇驱动的强化学习代理改进游戏测试覆盖率

Camilo Gordillo, Joakim Bergdahl, Konrad Tollmar, Linus Gisslén
卡米洛·戈迪略，约阿金·贝格达尔，康拉德·托尔马尔，林纳斯·吉斯伦 SEED - Electronic Arts (EA), Stockholm, Sweden
SEED - 艺电（EA），瑞典斯德哥尔摩
cgordillo, jbergdahl, ktollmar, lgisslen@ea.com
cgordillo，jbergdahl，ktollmar，lgisslen@ea.com

Abstract 摘要

As modern games continue growing both in size and complexity, it has become more challenging to ensure that all the relevant content is tested and that any potential issue is properly identified and fixed. Attempting to maximize testing coverage using only human participants, however, results in a tedious and hard to orchestrate process which normally slows down the development cycle. Complementing playtesting via autonomous agents has shown great promise accelerating and simplifying this process. This paper addresses the problem of automatically exploring and testing a given scenario using reinforcement learning agents trained to maximize game state coverage. Each of these agents is rewarded based on the novelty of its actions, thus encouraging a curious and exploratory behaviour on a complex 3D scenario where previously proposed exploration techniques perform poorly. The curious agents are able to learn the complex navigation mechanics required to reach the different areas around the map, thus providing the necessary data to identify potential issues. Moreover, the paper also explores different visualization strategies and evaluates how to make better use of the collected data to drive design decisions and to recognize possible problems and oversights.
随着现代游戏在规模和复杂性上不断增长，确保测试所有相关内容并正确识别和修复任何潜在问题变得更具挑战性。然而，仅使用人类参与者来最大程度地提高测试覆盖率会导致一个乏味且难以协调的过程，通常会减慢开发周期。通过自主代理补充游戏测试已显示出加速和简化此过程的巨大潜力。本文解决了使用强化学习代理自动探索和测试给定场景的问题，这些代理经过训练以最大化游戏状态覆盖率。每个代理根据其行动的新颖性而获得奖励，从而鼓励在以前提出的探索技术表现不佳的复杂 3D 场景中表现出好奇和探索行为。这些好奇的代理能够学习到达地图周围不同区域所需的复杂导航机制，从而提供必要的数据来识别潜在问题。此外，本文还探讨了不同的可视化策略，并评估如何更好地利用收集到的数据来推动设计决策，并识别可能存在的问题和疏漏。

Index Terms:

automated game testing, computer games, reinforcement learning, curiosity
索引词：自动化游戏测试，电脑游戏，强化学习，好奇心

I Introduction I 介绍

Playtesting modern video games using human participants alone has become unfeasible due to the sheer scale of the projects. As games grow in size and complexity, maximizing coverage and ensuring sufficient exploration becomes a tedious, repetitive, and labor-intensive task. By contrast, automated approaches relying on AI-based agents have the potential to be parallelized and accelerated to provide results in a short period of time [1], thus complementing the regular testing pipelines.
仅仅依靠人类参与者来测试现代视频游戏已经变得不可行，这是因为项目的规模庞大。随着游戏的规模和复杂性不断增长，最大程度地覆盖和确保足够的探索变成了一项繁琐、重复且劳动密集的任务。相比之下，依赖基于人工智能代理的自动化方法有潜力被并行化和加速，以在短时间内提供结果[1]，从而补充常规的测试流程。

We tackle the problem of automatically exploring a given scenario with the purpose of identifying any potential issues which could, otherwise, be potentially overlooked by human testers. Human participants, we argue, should focus on testing and experiencing the key mechanics of the game without the burden of identifying, documenting and reporting general glitches.
我们解决了自动探索给定场景的问题，目的是识别可能被人类测试人员忽视的潜在问题。我们认为，人类参与者应该专注于测试和体验游戏的关键机制，而不必担负识别、记录和报告一般故障的负担。

Our approach focuses on the use of reinforcement learning (RL) agents to maximize testing coverage. These types of agents have shown very important and appealing advantages over classical techniques when applied to game testing [2] and may play an important role when working with complex 3D scenarios like the one presented in this paper. We make use of curiosity as the motivation factor encouraging a set of RL agents to improve exploration and to seek novel interactions. It is therefore important to note that our intent is not to optimize a behaviour policy or any specific game score, but to make sure that proper and sufficient data can be collected while training such agents. Access to these kinds of data would enable a wide variety of applications such as automatically mapping reachable/unreachable areas in the scenario, identifying unintended mechanics, visualizing changes in response to design choices, to name a few. With the proper tools and metrics, moreover, other important issues like crash-inducing bugs and frame rate drops could also be triggered and recognized while training.
我们的方法侧重于使用强化学习（RL）代理来最大化测试覆盖率。这些类型的代理在应用于游戏测试时已经显示出非常重要和吸引人的优势，并且在处理像本文中所呈现的复杂 3D 场景时可能发挥重要作用。我们利用好奇心作为激励因素，鼓励一组 RL 代理改进探索并寻求新颖的互动。因此，重要的是要注意，我们的目的不是优化行为策略或任何特定的游戏得分，而是确保在训练这些代理时可以收集到适当和充分的数据。获得这些数据将使各种应用成为可能，例如自动映射场景中可达/不可达区域，识别意外的机制，可视化对设计选择的响应变化等。此外，通过适当的工具和指标，还可以在训练过程中触发和识别其他重要问题，如导致崩溃的错误和帧速率下降。

Once the game has been sufficiently explored and data has been collected, proper metrics and visualizations are required to make sense of the events recorded while training. Previous approaches have proposed different visualizations to derive insights about level design from playtesting data [3][4], and we use some of these ideas as reference to introduce metrics and analytics allowing us to validate our results and to identify different problems around the environment.
一旦游戏得到充分探索并收集了数据，就需要适当的度量标准和可视化来理解训练过程中记录的事件。先前的方法提出了不同的可视化方式，以从游戏测试数据中获取关于关卡设计的见解[3][4]，我们借鉴了其中一些想法，引入度量标准和分析，使我们能够验证结果并识别环境周围的不同问题。

II Related Work 相关工作

To date, a couple of studies have investigated the use of automatic exploration techniques to maximize game state coverage. Walk Monster [5] is an automated reachability testing tool implemented while developing The Witness (released in 2006 as a puzzle game). The purpose of this tool was to validate the traversability of the map and to identify any potential issues: players getting stranded by reaching areas they were not supposed to get into, getting stuck inside geometries, etc. The proposed algorithm managed to achieve impressive results despite employing fairly simple exploration heuristics. Nevertheless, it is important to note the simplicity and low dimensionality of the game itself (a two-dimensional space in practice). Similarly, several exploration strategies have been evaluated with the aim of producing a semantic map of reachable states in several commercial games (from Atari 2600 to Nintendo 64) [6]. Even though their results are comparable to human gameplay, their exploration strategies rely heavily on random actions. This, we argue, would not be applicable in more complex scenarios like the one presented in this paper. The Wuji [7] framework, on the other hand, employs a RL policy similar to ours together with evolutionary multi-objective optimization to encourage exploration and high game state coverage in two commercial combat games. Contrary to our approach, however, the authors do not evaluate the use of data and visualizations to allow for the identification of bugs and oversights.
迄今为止，已有几项研究调查了使用自动探索技术来最大化游戏状态覆盖的情况。Walk Monster [5] 是一个自动可达性测试工具，是在开发《见证者》（2006 年发布的益智游戏）时实施的。该工具的目的是验证地图的可穿越性，并识别任何潜在问题：玩家可能会因到达不应进入的区域、被困在几何结构内等而受困。尽管采用了相当简单的探索启发式方法，所提出的算法仍然取得了令人印象深刻的结果。然而，重要的是要注意游戏本身的简单性和低维度（实际上是二维空间）。同样，已经评估了几种探索策略，旨在生成可达状态的语义地图，适用于几款商业游戏（从 Atari 2600 到 Nintendo 64）[6]。尽管它们的结果与人类游戏相当，但它们的探索策略严重依赖于随机动作。我们认为，这在像本文中所呈现的更复杂场景中是不适用的。与我们的方法相似，无极[7]框架则采用了一个 RL 策略，结合进化多目标优化，以鼓励在两款商业战斗游戏中进行探索和高游戏状态覆盖。然而，与我们的方法相反，作者们并未评估数据和可视化的使用，以便识别错误和疏漏。

More recent approaches have been designed to take advantage of human demonstrations. When data from human participants is available, the Reveal-More algorithm [8] can be used to amplify coverage by focusing on exploring around those trajectories. This idea is indeed very interesting and orthogonal to our current approach. We could, for example, make use of imitation learning techniques to encourage and bias exploration around those human generated trajectories.
更近期的方法已经被设计用来利用人类的示范。当有来自人类参与者的数据时，可以使用 Reveal-More 算法[8]来通过专注于探索这些轨迹周围来扩大覆盖范围。这个想法确实非常有趣，与我们目前的方法不同。例如，我们可以利用模仿学习技术来鼓励和偏向于围绕那些人类生成的轨迹进行探索。

Another similar research direction focuses on the use of procedural personas to mimic how different player archetypes would interact with a given scenario. The PathOS framework [4], for example, is a very comprehensive tool for simulating testing sessions with artificial agents. Each of these agents is modeled to represent one particular player archetype by using classical scripted AI. Scripting each of these behaviours, however, can prove to be quite challenging as the complexity of the game increases. Other approaches, in contrast, have tried to automate the generation of such behaviours. The authors of [9], for example, propose the use of Monte Carlo tree search and evolutionary algorithms to generate utility functions leading to different behaviours in 2D dungeon levels. How this approach would scale to games of higher complexity, however, remains an open question.
另一个类似的研究方向关注使用程序化角色来模拟不同玩家原型如何与给定情景互动。例如，PathOS 框架[4]是一个非常全面的工具，用于模拟与人工智能代理的测试会话。这些代理中的每一个都被建模为使用经典脚本化人工智能来代表一个特定的玩家原型。然而，随着游戏复杂性的增加，脚本化这些行为可能会变得非常具有挑战性。相反，其他方法尝试自动化生成这些行为。例如，[9]的作者提出使用蒙特卡洛树搜索和进化算法来生成导致 2D 地牢级别中不同行为的效用函数。然而，这种方法如何扩展到更高复杂性的游戏仍然是一个悬而未决的问题。

Regardless of what kinds of automated strategies are employed proper visualizations and metrics are key to make sense of the collected data. In [3] the authors propose a set of visualizations to analyze level design in 2D side-scrolling games. In a similar fashion, the authors of [10] introduce Differentia, a set of visualizations to evaluate incremental game design changes. Although our approach draws inspiration from all of these techniques, it is important to remark that 2D visualizations are unlikely to be enough when testing complex 3D scenarios. The PathOS framework [4], on the other hand, is perhaps one of the most similar approaches in terms of visualizations and metrics allowing the user to visualize the outcome of the simulation directly in the game engine.
无论采用何种自动化策略，适当的可视化和度量是理解收集到的数据的关键。在[3]中，作者提出了一组可视化方法，用于分析 2D 横向卷轴游戏中的关卡设计。类似地，在[10]的作者介绍了 Differentia，一组可视化方法用于评估增量游戏设计变化。尽管我们的方法受到所有这些技术的启发，但需要强调的是，在测试复杂的 3D 场景时，2D 可视化可能不足够。另一方面，PathOS 框架[4]也许是在可视化和度量方面最相似的方法之一，允许用户直接在游戏引擎中可视化模拟结果。

Meanwhile, intrinsic motivation is a highly studied topic within the reinforcement learning community aiming to encourage agents to explore and to play in the absence of an extrinsic reward. One of such motivations, curiosity, was originally proposed by [11] as a way of rewarding the agents for exploring previously unseen game states and for improving their knowledge about the world. In [12], curiosity is used as a mechanism for pushing the agents to explore complex environments more efficiently while learning skills which may become useful later in their lifetime. A detailed survey about the use of intrinsic motivation in reinforcement learning is presented in [13].
同时，内在动机是强化学习社区内一个备受关注的主题，旨在鼓励代理探索和玩耍，即使没有外部奖励。好奇心是这种动机之一，最初由[11]提出，作为一种奖励代理探索先前未见过的游戏状态并提高他们对世界的了解的方式。在[12]中，好奇心被用作一种机制，推动代理更有效地探索复杂环境，同时学习可能在以后有用的技能。关于内在动机在强化学习中的应用的详细调查见于[13]。

III Implementation III 实施

Our approach relies on a set of RL agents continuously interacting with the game and encouraged to maximize coverage. In contrast to the PathOS framework [4] (see above), we employ curiosity as the sole motivation profile driving each of the agents. The following sections describe the scenario developed for our experiments, the RL setup and training algorithm, and the tools which were developed to collect relevant data and generate visualizations directly in the game engine.
我们的方法依赖于一组强化学习代理不断与游戏进行交互，并被鼓励最大程度地覆盖。与 PathOS 框架[4]（见上文）相比，我们将好奇心作为驱动每个代理的唯一动机配置文件。以下各节描述了我们实验中开发的场景、强化学习设置和训练算法，以及我们开发的用于在游戏引擎中直接收集相关数据和生成可视化的工具。

III-A Environment III-A 环境

We evaluate our approach on a relatively large ( $500\text{\,}\mathrm{m}$ x $500\text{\,}\mathrm{m}$ x $50\text{\,}\mathrm{m}$ ) map designed for the purpose of creating an elaborate navigation landscape. We created a scenario with complex navigation mechanics (e.g. jumps, climbable walls and elevators) in a 3D space as shown in Fig. 1. Moreover, and similar to what it is normally seen in modern video games, we have designed the scenario so that complex navigation strategies are required to fully explore the different areas around the map. More details about the environment, navigation mechanics and final results can be found in the accompanying video¹¹1https://www.youtube.com/watch?v=cfm3R94FB_4
我们在一个相对较大的（ $500\text{\,}\mathrm{m}$ x $500\text{\,}\mathrm{m}$ x $50\text{\,}\mathrm{m}$ ）地图上评估我们的方法，该地图旨在创建一个复杂的导航景观。我们创建了一个具有复杂导航机制（例如跳跃、可攀爬的墙壁和电梯）的场景，如图 1 所示。此外，与现代视频游戏中通常看到的情况类似，我们设计了场景，以便需要复杂的导航策略才能完全探索地图周围的不同区域。有关环境、导航机制和最终结果的更多详细信息，请参见附带视频 ¹ 。.

Refer to caption — Figure 1: Evaluation map: $500\text{\,}\mathrm{m}$ x $500\text{\,}\mathrm{m}$ x $50\text{\,}\mathrm{m}$ . The environment contains complex navigation challenges composed of multiple sequential jumps, climbable objects and elevators.
图 1：评估地图： $500\text{\,}\mathrm{m}$ x $500\text{\,}\mathrm{m}$ x $50\text{\,}\mathrm{m}$ 。环境包含复杂的导航挑战，由多个连续的跳跃、可攀爬的物体和电梯组成。

The character in our environment (see Fig. 2) is $1.7\text{\,}\mathrm{m}$ tall and has a total of 3 continuous navigation actions: forward/backward, left/right turn, left/right strafe, plus a discrete action for jumping. The character also has the ability to climb on special surfaces located around the map.
我们环境中的角色（见图 2）身高为 $1.7\text{\,}\mathrm{m}$ ，共有 3 个连续的导航动作：前进/后退，左/右转，左/右平移，以及一个离散的跳跃动作。角色还具有爬上地图周围特殊表面的能力。

Because the purpose of our approach is to test the game and identify any potential issues, we introduced a set of known bugs into the map. These bugs include missing collision boxes, design oversights, places where players may get stuck, etc.
因为我们的方法的目的是测试游戏并识别任何潜在问题，所以我们在地图中引入了一组已知的错误。这些错误包括缺少碰撞箱、设计疏忽、玩家可能被卡住的地方等。

It is important to remark that, contrary to similar approaches like the one presented in [4], we make no use of navigation meshes within our environment. As discussed in [2], navigation meshes are often not designed to resemble the freedom of movement that human players will have. Any agent constrained by these meshes will, most likely, fail at exploring the environment to the same degree a human would, and will therefore miss those bugs frequently found by human players. In the next section we discuss how an RL agent can be used to control navigation and improve exploration.
需要指出的是，与[4]中提出的类似方法不同，我们在环境中没有使用导航网格。正如[2]中讨论的那样，导航网格通常并不设计成类似于人类玩家将拥有的移动自由。受这些网格限制的任何代理人很可能会在探索环境时失败，而不像人类那样全面，因此会错过人类玩家经常发现的那些错误。在下一节中，我们将讨论如何使用 RL 代理来控制导航并改善探索。

III-B Reinforcement learning setup
III-B 强化学习设置

We make use of proximal policy optimization (PPO) [14] as our RL training algorithm. PPO is a robust and well established baseline within the RL community when working with continuous action spaces. We evaluate and compare the performance of the agents when providing two different types of observations. The first one consists of an aggregate vector of 37 values: agent position ( $\mathbb{R}^{3}$ ), agent velocity ( $\mathbb{R}^{3}$ ), agent world rotation ( $\mathbb{R}^{4}$ ), is climbing ( $\mathbb{B}$ ), is in contact with ground ( $\mathbb{B}$ ), jump cool-down time ( $\mathbb{R}$ ), and a vision array ( $\mathbb{R}^{24}$ ). The vision array consists of 12 ray casts projected in various directions around the agent (see Fig. 2). Each of these rays provides two values: a collision distance and a semantic meaning depending on the type of object it collides with. All values are normalized to be kept between $[-1,1]$ . We also explore a second configuration by providing the agents with an additional first person view of the environment and we compare both models in Section IV-A.
我们使用近端策略优化（PPO）[14]作为我们的强化学习训练算法。在处理连续动作空间时，PPO 是强化学习社区内一个稳健且成熟的基准。我们评估并比较代理的性能，当提供两种不同类型的观察时。第一种包括一个包含 37 个值的聚合向量：代理位置（ $\mathbb{R}^{3}$ ），代理速度（ $\mathbb{R}^{3}$ ），代理世界旋转（ $\mathbb{R}^{4}$ ），是否攀爬（ $\mathbb{B}$ ），是否接触地面（ $\mathbb{B}$ ），跳跃冷却时间（ $\mathbb{R}$ ），以及一个视觉数组（ $\mathbb{R}^{24}$ ）。视觉数组包括 12 个射线投影在代理周围不同方向上（见图 2）。每条射线提供两个值：碰撞距离和语义含义，取决于它碰撞的对象类型。所有值都被归一化以保持在 $[-1,1]$ 之间。我们还探索了第二种配置，通过为代理提供额外的第一人称视角环境，并在第 IV-A 节中比较这两种模型。

Global Hyperparameters 全局超参数
Name	Value
Learning rate ( $\upalpha$ ) 学习率 ( $\upalpha$ )	1e-4
Discount ( $\upgamma$ ) 折扣 ( $\upgamma$ )	0.98
PPO-Clip	0.2
Entropy coefficient 熵系数	1e-2
GAE coefficient ( $\uplambda$ ) GAE 系数 ( $\uplambda$ )	0.95
Fully connected layers 全连接层	[1024, 512, 256]
LSTM layer LSTM 层	256
Visual Encoder 视觉编码器
Image size 图像尺寸	[84,84,3]
Kernel size 内核大小	[5,3,3,3]
Padding	[1,1,1,1]
Strides	[2,2,2,1]
Channels	[32,32,64,64]

TABLE I: Algorithm’s training hyperparameters and model architecture.
表 I：算法的训练超参数和模型架构。

The reward given to the agents is a function of novelty and it is described, together with the reset logic, in the following sections. The algorithm’s hyperparameters and the model’s architecture are presented in Table I.
给予代理的奖励是新颖性的函数，并且它与重置逻辑一起在以下部分中描述。算法的超参数和模型的架构在表 I 中呈现。

III-C Optimizing coverage through count-based exploration
通过基于计数的探索优化覆盖率

One of the great advantages of automated testing is the ability to parallelize and scale to a degree which is unfeasible to reach with human participants alone. The results described in this paper were collected while simulating and training 320 agents distributed across multiple machines. Our distributed setup allows us to train a single and centralized model using data collected from multiple environment instances. Instantiating a single training server also allows us to easily process, analyze, and store all the data collected by the agents in a single place.
自动化测试的一个巨大优势是能够并行化和扩展到人类参与者无法达到的程度。本文描述的结果是在模拟和训练 320 个代理分布在多台机器上时收集的。我们的分布式设置允许我们使用从多个环境实例收集的数据来训练单一和集中的模型。实例化单个训练服务器还允许我们轻松处理、分析和存储代理收集的所有数据在一个地方。

The reward given to the agents is computed following the idea of count-based exploration [15] and becomes, therefore, inversely proportional to how frequent a given game state has been visited. We define these states as the 3D position of the agent at a given point in time. Keeping track of these visit counters on a continuous space, however, quickly becomes intractable. To solve this problem, we discretize the space by means of a threshold $\uptau$ . An agent is only considered to have entered a new state once its distance to any previously visited state is larger than $\uptau$ .
奖励给予的代理根据基于计数的探索思想[15]计算，因此与给定游戏状态的访问频率成反比。我们将这些状态定义为代理在某一时间点的三维位置。然而，在连续空间中跟踪这些访问计数很快变得难以处理。为了解决这个问题，我们通过一个阈值 $\uptau$ 对空间进行离散化。只有当代理到先前访问的任何状态的距离大于 $\uptau$ 时，才认为代理已进入新状态。

Using a small value for $\uptau$ increases the number of points we will need to keep track of and may therefore hinder performance when working with large maps. A high value, on the contrary, results in a very sparse reward signal for our agents which significantly increases the difficulty of the task. The value of $\uptau=$5\text{\,}\mathrm{m}$$ was empirically found after a couple of experiments and has proven to be a sensible choice not only on the scenario presented here, but also in other maps not shown in this paper.
使用较小的 $\uptau$ 值会增加我们需要跟踪的点的数量，因此在处理大地图时可能会影响性能。相反，较高的值会导致我们的代理程序获得非常稀疏的奖励信号，从而显著增加任务的难度。 $\uptau=$5\text{\,}\mathrm{m}$$ 的值是在几次实验后经验性地找到的，并且已经被证明不仅在这里提出的场景中是一个明智的选择，而且在本文未展示的其他地图中也是如此。

When a new observation is received, the first step to compute a reward is to extract the position $p$ of the agent and compare it to all the previously visited states. This buffer is initially empty and gets populated as exploration takes place. We also keep track of a visit counter $N_{i}$ for each point $i$ in the current buffer. If the minimum distance between the current position $p$ and the points in the buffer is larger than $\uptau$ , then point $p$ is added to the buffer and its visit counter is set to 1. If, on the contrary, the minimum distance is smaller than $\uptau$ , we identify the point $i$ within the buffer closest to $p$ and increment $N_{i}$ by 1. Having done this, the reward for reaching point $i$ is computed using Equation 1, where $R_{max}$ is set to 0.5 and defines the reward for exploring a new point, and $max\_counter$ is set to 500, thus annealing the reward down to zero as a given point gets more frequent visits.
当收到新的观察时，计算奖励的第一步是提取代理的位置 $p$ 并将其与所有先前访问过的状态进行比较。该缓冲区最初为空，并随着探索的进行而填充。我们还跟踪当前缓冲区中每个点 $i$ 的访问计数器 $N_{i}$ 。如果当前位置 $p$ 与缓冲区中的点之间的最小距离大于 $\uptau$ ，则将点 $p$ 添加到缓冲区，并将其访问计数器设置为 1。相反，如果最小距离小于 $\uptau$ ，我们将识别缓冲区中距离 $p$ 最近的点 $i$ ，并将 $N_{i}$ 增加 1。完成此操作后，使用方程式 1 计算到达点 $i$ 的奖励，其中 $R_{max}$ 设置为 0.5，并定义探索新点的奖励， $max\_counter$ 设置为 500，因此将奖励降低到零，因为给定点的访问次数更频繁。

R_{t}=R_{max}*\left[1-\frac{N_{i}}{max\_counter}\right]

(1)

III-D Reset logic III-D 重置逻辑

Each training episode is simulated for 3000 steps (equivalent to 1 minute of game play) and the agents are respawned once time is up. An initial spawning location was defined for this scenario and is located near the middle of the map at ground level. As training goes on spawning locations are sampled from the current buffer using the inverse of the corresponding visit counters as sample weights. This logic prevents biasing the exploration by respawning the agents in previously unexplored positions while giving priority to less frequently visited locations.
每个训练周期模拟进行 3000 步（相当于 1 分钟的游戏时间），并且当时间到达时，代理会重新生成。为此场景定义了一个初始生成位置，位于地图中间靠近地面的位置。随着训练的进行，生成位置将从当前缓冲区中使用相应访问计数器的倒数作为样本权重进行抽样。这种逻辑可以防止通过重新生成代理在先前未探索的位置引入偏见，同时优先考虑访问频率较低的位置。

To prevent agents from spawning in mid-air, we take advantage of one of the values available in the observation vector: $is\_in\_contact\_with\_ground$ . When storing a new point in the buffer we keep track of whether or not the character was stepping on something when at that location. Then, when sampling a new spawn position, we can just consider the points in the buffer for which this condition is true.
为了防止代理在半空中生成，我们利用观察向量中的一个可用值之一： $is\_in\_contact\_with\_ground$ 。在将新点存储到缓冲区时，我们会跟踪角色在该位置时是否站在某物体上。然后，在抽样新的生成位置时，我们只考虑满足此条件的缓冲区中的点。

This way of respawning the agents at previously visited states strongly resembles algorithms such as Rapidly-Exploring Random Trees (RRT) [16]. In contrast to the exploration strategies proposed in [6], however, we take advantage of the complex navigation strategies developed by our agents to explore around those spawning locations. In Section IV we compare the performance of our approach to the one of a random policy very similar to the chaos monkey strategy proposed in [6]. This random policy employs the same respawning logic presented above which allows us to fairly compare both techniques.
这种在先前访问过的状态重新生成代理的方式与诸如快速探索随机树（RRT）[16]等算法强烈相似。然而，与[6]中提出的探索策略相比，我们利用我们的代理开发的复杂导航策略来探索这些生成位置周围。在第四节中，我们将我们的方法的性能与类似于[6]中提出的混沌猴策略的随机策略进行比较。这种随机策略采用了上述相同的重新生成逻辑，这使我们能够公平地比较这两种技术。

III-E Collecting and visualizing data
III-E 数据收集和可视化

Different kinds of data are continuously processed and stored while the agents interact with the environment. Most of this data is handled by the centralized training process which receives all episodic information (i.e. observations, actions, rewards). Some of the data, however, is recorded directly by the environments based on possible events triggered by the agents which are harder to identify outside the engine. The nature of this data, and how it is used to identify and correct problems, is described together with the corresponding experiments in the next section.
与环境交互时，不同类型的数据会持续处理和存储。大部分数据由集中式训练过程处理，该过程接收所有情节信息（即观察、动作、奖励）。然而，一些数据是由环境直接记录的，基于代理触发的可能事件，这些事件在引擎外部很难识别。这些数据的性质以及如何用于识别和纠正问题，将在下一节中与相应的实验一起描述。

Other relevant metrics such as the number of visited states and values relevant to the RL algorithm are also continuously logged and visualized as training goes on. These logs are very useful to quickly judge and/or compare the performance of a set of experiments without the need of waiting until their completion.
其他相关指标，如访问状态的数量和与 RL 算法相关的值，也会持续记录和可视化，随着训练的进行。这些日志非常有用，可以快速评估和/或比较一组实验的性能，而无需等待它们完成。

IV Results IV 结果

In this section we present results on the exploration performance of our agents and give a few examples on the type of analyses which can be conducted using the collected data. As described in the previous section, our simulation pipeline generates a set of files which can be loaded directly into the game engine allowing us to identify potential problems in the game.
在本节中，我们展示了我们代理的探索性能结果，并给出了一些可以使用收集的数据进行的分析类型示例。如前一节所述，我们的模拟管道生成一组文件，可以直接加载到游戏引擎中，从而帮助我们识别游戏中的潜在问题。

IV-A Exploration performance and map coverage
IV-A 探索性能和地图覆盖

The first thing we would like to evaluate is the ability of our RL agents to navigate and explore the whole map. To do this, we make use of the buffer of visited states introduced in Section III-C. This set of visited 3D coordinates is stored and updated as training goes on and can be used as a metric for exploration and coverage. Fig. 3 shows the percentage of the map covered by our agents when compared to a random policy. As expected, the random based exploration technique did not cover the whole map due to its complexity and was only able to reach easily accessible areas. Moreover, complementing the observation space of our RL agents with a camera image (first person view) improves the results by allowing the model to better understand its surroundings and by decreasing the uncertainty introduced by the discrete set of ray casts. It currently takes around 24 hours to explore 90% of the map but, as discussed in Section V, we believe that coming up with better and more efficient ways for encoding the environment may boost the performance of the agents and speed up exploration.
我们想要评估的第一件事是我们的 RL 代理在整个地图中导航和探索的能力。为此，我们利用了第 III-C 节介绍的访问状态缓冲区。这组访问过的 3D 坐标被存储并随着训练的进行而更新，可以用作探索和覆盖的度量标准。图 3 显示了我们的代理相对于随机策略覆盖地图的百分比。正如预期的那样，基于随机的探索技术由于其复杂性而未能覆盖整个地图，只能到达易于访问的区域。此外，通过用摄像头图像（第一人称视角）补充我们的 RL 代理的观察空间，可以通过使模型更好地理解周围环境并减少由离散射线集引入的不确定性来改善结果。目前需要大约 24 小时才能探索地图的 90％，但正如第 V 节中讨论的那样，我们相信想出更好更高效的环境编码方式可能会提升代理的性能并加快探索速度。

As shown in Figs. 4 and 5, all the data collected by the agents (e.g. the buffer of visited states) can also be loaded and displayed on top of the map. These visualizations allow the designers to verify whether or not different areas across the map are reachable and also allow us to compare the exploration performance of different navigation strategies. Both figures showcase a couple of relatively complex navigation challenges spread across the map and the extend of the exploration coverage achieved by our RL agents when compared to a simple random exploration strategy.
如图 4 和 5 所示，代理收集的所有数据（例如访问状态的缓冲区）也可以加载并显示在地图顶部。这些可视化使设计人员能够验证地图上不同区域是否可达，还允许我们比较不同导航策略的探索性能。两幅图展示了地图上分布的几个相对复杂的导航挑战，以及我们的 RL 代理相对于简单的随机探索策略所实现的探索覆盖范围。

Another set of interesting findings relates to the agents reaching areas which should have been inaccessible for the player. These areas could be identified either by visually inspecting the distribution of the collected points or, as shown in the next section, by defining exploration boundaries.
另一组有趣的发现与代理到达本应对玩家不可及的区域有关。这些区域可以通过直观地检查收集点的分布或者如下一节所示，通过定义探索边界来识别。

IV-B Exploration boundaries and regions of interest
IV-B 探索边界和感兴趣区域

Our method allows the designer to specify both an exploration boundary (EB) and regions of interest (ROIs) across the map prior to training. The EB defines the section of the map which should be explored and the episode terminates whenever the agent exits that boundary. The ROIs, on the other hand, are optional regions inside the EB and serve as a reference for data collection.
我们的方法允许设计师在训练之前指定探索边界（EB）和兴趣区域（ROIs）的地图上。EB 定义了应该被探索的地图部分，当代理程序退出该边界时，该情节终止。另一方面，ROIs 是 EB 内的可选区域，用作数据收集的参考。

Although we would like to record and store the specific trajectories followed by the agents, doing so would quickly become too expensive and intractable during long simulations. The definition of the EB and the ROIs, however, allows us to focus on those trajectories which are likely to be useful for testing the game. Our technique keeps track of the episodic trajectory for each agent but only records them if a couple of conditions are met: first, the trajectory needs to cross over the boundary defining either the EB or a ROI; second, the point at which the agent crosses that boundary needs to be significantly different to the one of any previously recorded trajectories.
尽管我们希望记录并存储代理程序所遵循的具体轨迹，但这样做很快会变得过于昂贵和难以处理，尤其是在长时间模拟过程中。然而，EB 和 ROIs 的定义使我们能够专注于那些可能对测试游戏有用的轨迹。我们的技术跟踪每个代理程序的情节性轨迹，但只有在满足一些条件时才记录它们：首先，轨迹需要越过定义 EB 或 ROI 的边界；其次，代理程序越过该边界的点需要与先前记录的任何轨迹的点明显不同。

Fig. 6 shows examples of trajectories leaving the EB when that boundary is defined at the walls surrounding the scenario in Fig. 1. Our technique allows the user to display these trajectories directly in the game engine and, in this case, would reveal design oversights around the map allowing the player to leave the game area. Figs. 7 and 8 show additional examples of the kinds of issues which could be identified using these visualizations. Fig. 7 shows a trajectory leaving the scenario due to a collision box missing in one of the wall segments. This particular oversight was intentionally introduced in the map to validate the usefulness of the collected data. Interestingly, not all of the problems we were able to identify were intentionally added to the game. Fig. 8, for example, shows a trajectory recorded during some of our first design iterations. In this case, the agent was getting stuck in between two objects and the physics engine would eventually throw it upwards forcing the character to leave the EB. This shows the use these visualizations could have for identifying problems early on in the design process.
图 6 显示了当边界在图 1 中定义的围绕场景的墙壁时，离开 EB 的轨迹示例。我们的技术允许用户直接在游戏引擎中显示这些轨迹，并且在这种情况下，将揭示围绕地图的设计疏漏，使玩家能够离开游戏区域。图 7 和图 8 显示了使用这些可视化工具可以识别的问题类型的其他示例。图 7 显示了由于一个墙壁段缺少碰撞框而导致轨迹离开场景。这种特定的疏忽是故意引入地图中的，以验证收集数据的有用性。有趣的是，并非所有我们能够识别的问题都是故意添加到游戏中的。例如，图 8 显示了在我们最初的设计迭代中记录的轨迹。在这种情况下，代理人被卡在两个物体之间，物理引擎最终会将其向上抛出，迫使角色离开 EB。这显示了这些可视化工具在设计过程早期识别问题方面的用途。

The ROIs, on the other hand, allow the user to validate the reachability and access to particular areas in the map. Fig. 9 shows two such regions which were intended to be unreachable for the player. The agents were indeed unable to reach the first region and therefore no trajectories were recorded. The second region, however, ended up having one access point which could be identified using the collected data.
另一方面，ROI 允许用户验证地图中特定区域的可达性和访问性。图 9 显示了两个此类区域，本意是玩家无法到达的。代理确实无法到达第一个区域，因此没有记录到轨迹。然而，第二个区域最终有一个访问点，可以使用收集的数据进行识别。

IV-C Connectivity graph IV-C 连通性图

We can do more than just storing a point cloud of visited states. We can, for instance, configure our training server to build and store a graph structure representing the connectivity between those points. For this reason, the episodic trajectories collected to train our agents are also used to generate a directed graph as the one shown in Fig. 10. Even though the accuracy of such a graph strongly depends on our discretization threshold $\uptau$ (see Section III-C), we believe it has a couple of very promising applications as presented next.
我们不仅可以存储已访问状态的点云。例如，我们可以配置我们的训练服务器来构建和存储表示这些点之间连接性的图结构。因此，用于训练我们的代理的情节轨迹也用于生成一个有向图，如图 10 所示。尽管这样的图的准确性在很大程度上取决于我们的离散化阈值 $\uptau$ （请参见 III-C 节），但我们认为它有几个非常有前途的应用，如下所示。

IV-C1 Navigating to custom points of interest
IV-C1 导航到自定义兴趣点

We can make use of the connectivity graph and path planning algorithms to estimate navigation trajectories between any given two points. Our tool allows the user to define an initial and a target position in the map and it then generates a navigation trajectory between those two points based on the data collected from the agents. Fig. 11 shows an example of such a trajectory.
我们可以利用连接图和路径规划算法来估计任意两点之间的导航轨迹。我们的工具允许用户在地图中定义一个初始位置和一个目标位置，然后根据从代理收集的数据生成这两点之间的导航轨迹。图 11 显示了这样一个轨迹的示例。

We argue that a tool like this could be very useful for designers to comprehend how the agents are navigating the map and whether or not they have found potential exploits. Fig. 12, for example, shows one particular exploit which wasn’t intentionally introduced into the game and which was identified thanks to the connectivity graph. In Fig. 12(a) the agents are able to climb over the wall without the need of the elevator. This seems to be happening due to the slope of the walls at that particular corner (you can see more details in the accompanying video). Fig. 12(b), in contrast, shows the trajectory followed by the agent once the previous issue was fixed.
我们认为，像这样的工具对设计师来说可能非常有用，可以帮助他们理解代理人如何在地图上导航，以及他们是否发现了潜在的漏洞。例如，图 12 展示了一个特定的漏洞，这个漏洞并非故意引入游戏中，而是通过连接图表识别出来的。在图 12(a)中，代理人能够在不需要电梯的情况下爬过墙。这似乎是由于那个特定角落墙壁的坡度造成的（您可以在附带的视频中看到更多细节）。相比之下，图 12(b)展示了在修复前一个问题后代理人所遵循的轨迹。

IV-C2 Semantic connectivity maps
IV-C2 语义连接图

The connectivity graph can also be used to identify how different areas in the map are connected to each other. The specific regions in the map could be either manually defined by the user or, as in our next experiment, automatically extracted from the point cloud of visited states. Fig. 13 shows an example of such a mapping where the regions in the map were automatically extracted and color-coded using unsupervised clustering algorithms. Once the regions are identified, we can make use of the connectivity graph to analyze what kinds of connections exist between them. This semantic mapping could then be used to drive design decisions, to validate the mechanics and traversability of the map, and to recognize potential exploits (unexpected paths).
连接图还可以用于确定地图中不同区域之间的连接方式。地图中的特定区域可以由用户手动定义，或者如我们的下一个实验中那样，从访问状态的点云中自动提取。图 13 展示了这样一个映射的示例，其中地图中的区域是自动提取并使用无监督聚类算法进行颜色编码的。一旦确定了这些区域，我们可以利用连接图分析它们之间存在哪些连接。这种语义映射可以用于驱动设计决策，验证地图的机械性和可穿透性，并识别潜在的漏洞（意外路径）。

IV-D Termination states IV-D 终止状态

It is common for players to find themselves stuck in some particular part of the map due to issues in the environment or the location of the game assets which render the playing character immovable. Maximizing exploration coverage gives us the opportunity to automatically identify such locations during training. One approach is to keep track of a termination counter for each point in our buffer of visited states (i.e. how many times an episode ended with an agent in that location). Once training is over, we can proceed and analyze the distribution of terminal states. Any outlier in this distribution can be easily identified and it is likely to be caused by the agents getting stuck in that position.
玩家常常会发现自己因为环境问题或游戏资产的位置而被困在地图的某个特定部分。这使得玩家角色无法移动。最大化探索范围为我们提供了在训练期间自动识别这些位置的机会。一种方法是跟踪我们访问状态缓冲区中每个点的终止计数（即一个情节以代理在该位置结束的次数）。训练结束后，我们可以继续分析终端状态的分布。这个分布中的任何异常值都可以很容易地被识别出来，很可能是由于代理被困在那个位置造成的。

We conducted experiments by introducing areas across the map where the agents could get stuck. Fig. 14 shows some examples of such locations together with the outlier positions identified from the collected data. Due to the high coverage achieved by our agents, all the intentionally introduced issues could be identified, as well as one unintentional design oversights causing a similar problem (see Fig. 14(a)).
我们通过在地图各处引入代理可能被困的区域来进行实验。图 14 显示了一些这样的位置示例，以及从收集到的数据中识别出的异常位置。由于我们的代理实现了高覆盖率，所有故意引入的问题都能被识别出来，以及一个无意中导致类似问题的设计疏忽（见图 14(a)）。

V Conclusions and future work

In this paper we have shown a potential use case for RL agents trained to maximize testing coverage in complex 3D scenarios. We have reported on the use of curiosity to encourage exploratory behaviour in our agents, thus allowing them to fully traverse the environment. We have shown that curiosity driven agents can be used for automating the collection of playtest data and performance metrics.

The aim of our approach was to maximize testing coverage and to keep time consumption to a minimum by means of scaling and parallelizing data collection. We have provided examples on the type of data which can be collected, the kind of analysis which can be conducted, and the different sets of visualizations and metrics which can be used to facilitate the identification of frequent oversights, glitches and exploits.

A natural progression of this work is to further increase the complexity of the environment by introducing new mechanics, objectives and environmental hazards. This line of research is also strongly dependent on finding better and more efficient ways of encoding the environment. As discussed in Section IV-A, the way the agents perceive the map both influences the complexity of the task and the cost of training.

Another promising research vector relates to the use of human demonstrations. One could, on one hand, explore a similar idea to the one presented in [10] and focus on exploring around predefined human-generated trajectories. This approach will provide designers with more control over the exploration space and will therefore speed up coverage over regions of high interest. On the other hand, human demonstrations could also be used, together with imitation learning, to provide the agents with some prior understanding about the mechanics of the game and with some basic navigation skills. This prior knowledge is then likely to speed up exploration and decrease the time it takes to collect relevant data.

References

[1] M. Sy, C. Guo, and J. Greco, “Unity game simulation: Find the perfect balance with Unity and GCP,” in Google for Games Developer Summit, 2020. [Online]. Available: https://events.withgoogle.com/gdc2020/
[2] J. Bergdahl, C. Gordillo, K. Tollmar, and L. Gisslén, “Augmenting automated game testing with deep reinforcement learning,” in 2020 IEEE Conference on Games (CoG), 2020, pp. 600–603.
[3] S. Agarwal, C. Herrmann, G. Wallner, and F. Beck, “Visualizing ai playtesting data of 2d side-scrolling games,” in Proceedings of IEEE Conference on Games, aug 2020.
[4] S. Stahlke, A. Nova, and P. Mirza-Babaei, “Artificial players in the design process: Developing an automated testing tool for game level and world design,” in Proceedings of the Annual Symposium on Computer-Human Interaction in Play (CHI PLAY ’20). New York, NY, USA: Association for Computing Machinery, 2020, p. 267–280.
[5] C. Muratori, “Killing the walk monster [Conference presentation],” in BIC Festival, 2018. [Online]. Available: https://caseymuratori.com/blog_0032
[6] Z. Zhan, B. Aytemiz, and A. M. Smith, “Taking the scenic route: Automatic exploration for videogames,” in KEG@AAAI, ser. CEUR Workshop Proceedings, vol. 2313. CEUR-WS.org, 2019, pp. 26–34.
[7] Y. Zheng, X. Xie, T. Su, L. Ma, J. Hao, Z. Meng, Y. Liu, R. Shen, Y. Chen, and C. Fan, “Wuji: Automatic online combat game testing using evolutionary deep reinforcement learning,” in Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. IEEE Press, 2019, p. 772–784.
[8] K. Chang, B. Aytemiz, and A. M. Smith, “Reveal-more: Amplifying human effort in quality assurance testing using automated exploration,” in 2019 IEEE Conference on Games (CoG), 2019, pp. 1–8.
[9] C. Holmgård, M. C. Green, A. Liapis, and J. Togelius, “Automated playtesting with procedural personas through mcts with evolved heuristics,” IEEE Transactions on Games, vol. 11, no. 4, pp. 352–362, 2018.
[10] K. Chang and A. Smith, “Differentia: Visualizing incremental game design changes,” Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, vol. 16, no. 1, pp. 175–181, Oct. 2020.
[11] J. Schmidhuber, “A possibility for implementing curiosity and boredom in model-building neural controllers,” in Proc. of the international conference on simulation of adaptive behavior: From animals to animats, 1991, pp. 222–227.
[12] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017.
[13] A. Aubret, L. Matignon, and S. Hassas, “A survey on intrinsic motivation in reinforcement learning,” CoRR, vol. abs/1908.06976, 2019.
[14] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
[15] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, “Unifying count-based exploration and intrinsic motivation,” in Advances in Neural Information Processing Systems, vol. 29. Curran Associates, Inc., 2016, pp. 1471–1479.
[16] S. LaValle, “Rapidly-exploring random trees : a new tool for path planning,” Technical Report TR 98-11, Computer Science Department, Iowa State University, 1998.

Improving Playtesting Coverage via Curiosity Driven Reinforcement Learning Agents通过好奇驱动的强化学习代理改进游戏测试覆盖率