这是用户在 2024-5-17 5:02 为 https://app.immersivetranslate.com/pdf-pro/3fd89bde-bc0f-4e9c-9ca2-997cff79f6b2 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_05_16_cfaf4219a20e44d52963g

SARC: Soft Actor Retrospective Critic

Sukriti VermaCarnegie Mellon Universitysukritiv@andrew.cmu.edu

Ayush ChopraMIT Media Labayushc@mit.edu

Jayakumar SubramanianAdobejasubram@adobe.com

Mausoom SarkarAdobemsarkar@adobe.com

Nikaash PuriAdobenikpuri@adobe.com

Piyush GuptaAdobepiygupta@adobe.com

Balaji KrishnamurthyAdobekbalaji@adobe.com

Abstract

The two-time scale nature of SAC, which is an actor-critic algorithm, is characterised by the fact that the critic estimate has not converged for the actor at any given time, but since the critic learns faster than the actor, it ensures eventual consistency between the two. Various strategies have been introduced in literature to learn better gradient estimates to help achieve better convergence. Since gradient estimates depend upon the critic, we posit that improving the critic can provide a better gradient estimate for the actor at each time. Utilizing this, we propose Soft Actor Retrospective Critic (SARC), where we augment the SAC critic loss with another loss term - retrospective loss - leading to faster critic convergence and consequently, better policy gradient estimates for the actor. An existing implementation of SAC can be easily adapted to SARC with minimal modifications. Through extensive experimentation and analysis, we show that SARC provides consistent improvement over SAC on benchmark environments. We plan to open-source the code and all experiment data at https://github.com/sukritiverma1996/SARC
SAC 是一种行动者-批评者算法,具有双时间尺度的特点,即批评者的估计值在任何给定时间内都不会收敛于行动者,但由于批评者的学习速度比行动者快,因此可以确保两者之间的最终一致性。文献中提出了各种策略来学习更好的梯度估计值,以帮助实现更好的收敛。由于梯度估计值取决于批判者,因此我们认为,改进批判者可以在每次学习时为行动者提供更好的梯度估计值。利用这一点,我们提出了 "软行动者回溯批判"(SARC),即在 SAC 批判损失的基础上增加另一个损失项--回溯损失--从而加快批判收敛速度,进而为行动者提供更好的政策梯度估计。现有的 SAC 实现只需极少的修改就能轻松地适应 SARC。通过大量的实验和分析,我们表明在基准环境下,SARC 比 SAC 有持续的改进。我们计划将代码和所有实验数据开源,网址为 https://github.com/sukritiverma1996/SARC。

1 Introduction

The idea of applying and extending successful supervised learning techniques from the field of deep learning to the field of reinforcement learning has long existed and helped achieve remarkable success in deep RL over this decade. One of the early breakthroughs, Deep Q-Networks (DQN) [12, 13], leveraged advances from the field of deep learning to train an end-to-end deep reinforcement learning agent, which achieved better than human-level performance in several challenging Atari games. Before DQNs, reinforcement learning could only be applied to tasks having low-dimensional state spaces or tasks where a small number of features could be extracted. With DQNs, it became possible to learn policies directly from high-dimensional input. This subsequently gave rise to a plethora of deep RL algorithms that combined deep learning with reinforcement learning. Broadly, these algorithms fall into three classes - critic-only (like DQN, DDQN [23| etc.), actor-only (REINFORCE [26]) and actor-critic (SAC [6], PPO [16] etc.) algorithms, In this work, we seek to improve the Soft Actor-Critic Algorithm (SAC) [6], which is one of the current state-of-the-art actor-critic algorithms in Deep RL, by taking inspiration from a recent advance in the field of supervised deep learning Retrospective Loss.
将深度学习领域成功的监督学习技术应用和扩展到强化学习领域的想法由来已久,并在这十年间帮助深度 RL 取得了显著成就。早期的突破之一是深度 Q 网络(DQN)[12, 13],它利用深度学习领域的进步训练了一个端到端的深度强化学习代理,在几个具有挑战性的雅达利游戏中取得了优于人类水平的表现。在使用 DQNs 之前,强化学习只能应用于具有低维状态空间的任务或可以提取少量特征的任务。有了 DQNs,就可以直接从高维输入中学习策略。随后,结合深度学习和强化学习的深度 RL 算法层出不穷。大体上,这些算法可分为三类--纯批判算法(如 DQN、DDQN [23| 等)、纯演员算法(REINFORCE [26])和演员批判算法(SAC [6], PPO [16] 等)。
Actor-critic methods such as PPO, A3C [14], DPG [17], TD3 [5] and SAC have yielded great success in recent times [24, 2]. With actor-critic algorithms, the central idea is to split the learning into two key modules: i) Critic, which learns the policy-value function given the actor's policy. ii) Actor, which uses the critic's policy-value function to estimate a policy gradient and improve its policy.
最近,PPO、A3C [14]、DPG [17]、TD3 [5] 和 SAC 等行为者批判方法取得了巨大成功 [24, 2]。行为者批判算法的核心思想是将学习分成两个关键模块:i) 批判者,根据行为者的策略学习策略值函数;ii) 行为者,利用批判者的策略值函数估计策略梯度并改进自己的策略。
One iteration of policy improvement in actor-critic methods [10] involves two interwoven steps: one step of critic learning followed by one step of actor learning. These steps are interwoven for the following reason. Given a fixed critic, policy improvement and actor learning is determined by policy gradient estimates computed using the policy-value function provided by the critic. Given a fixed actor, the policy remains unchanged for the critic and hence, there exists a unique value function for the critic to learn. This allows the application of policy evaluation algorithms to learn this value function. A good value function, in turn enables accurate policy gradient estimates for the actor [18, 27].
行动者批判方法[10]中的政策改进迭代包括两个相互交织的步骤:一步批判者学习,然后是一步行动者学习。这些步骤相互交织的原因如下。在批判者固定的情况下,政策改进和行动者学习由使用批判者提供的政策值函数计算出的政策梯度估计值决定。在行为者固定的情况下,批判者的政策保持不变,因此批判者有一个唯一的价值函数可供学习。这就允许应用政策评估算法来学习这个价值函数。一个好的价值函数反过来又能为行动者提供准确的政策梯度估计 [18, 27]。
Several advances in actor-critic methods such as TRPO [15], PPO, A3C and DDPG [11] have come from improving actor learning and discovering more stable ways of updating the policy. A few actorcritic algorithms, such as TD3 and SAC, have advanced state-of-the-art by improving critic learning. Both of these algorithms learn two value functions rather than one to counter the overestimation bias [21] in value learning. TD3 updates the critic more frequently than the actor to minimize the error in the value function [5]. SAC uses the entropy of the policy as a regulariser [6]. In this work, we propose to further improve critic learning in SAC by applying a regulariser that accelerates critic convergence.
行动者批判方法的一些进步,如 TRPO [15]、PPO、A3C 和 DDPG [11],都来自于改进行动者学习和发现更稳定的策略更新方法。一些行动者批评算法,如 TD3 和 SAC,通过改进批评者学习达到了最先进的水平。这两种算法都学习两个而非一个价值函数,以消除价值学习中的高估偏差 [21]。TD3 比行动者更频繁地更新批判者,以最小化价值函数的误差[5]。SAC 使用策略熵作为正则调节器 [6]。在这项工作中,我们建议通过应用可加速批判者收敛的正则器来进一步改进 SAC 中的批判者学习。
This regulariser is inspired from a recent technique that improves performance in the supervised learning setting called retrospective loss [8]. When retrospective loss is minimized along with the task-specific loss, the parameters at the current training step are guided towards the optimal parameters while being pulled away from the parameters at a previous training step. In the supervised setting, retrospective loss results in better test accuracy.
这种正则调节器的灵感来自于最近一种在监督学习环境中提高性能的技术--回溯损失[8]。当追溯损失与特定任务损失一起最小化时,当前训练步骤的参数就会被引导向最优参数,同时远离上一个训练步骤的参数。在有监督的情况下,回溯损失会带来更好的测试精度。
Using this loss as a retrospective regulariser applied to the critic, accelerates the convergence of the critic. Due to the two time-scale nature of actor-critic algorithms [3, 10], causing the critic to learn faster leads to better policy gradient estimates for the actor, as gradient estimates depend on the value function learned by the critic. Better policy gradient estimates for the actor in turn lead to better future samples.
将这一损失作为追溯正则应用于批判者,可以加快批判者的收敛速度。由于行动者批判算法具有两个时间尺度的特性 [3,10],使批判者更快地学习会为行动者带来更好的策略梯度估计,因为梯度估计取决于批判者学习到的值函数。对行动者来说,更好的政策梯度估计反过来又会带来更好的未来样本。
Bringing it all together, in this work, we propose a novel actor-critic method, Soft Actor Retrospective Critic (SARC), which improves the existing Soft Actor-Critic (SAC) method by applying the aforementioned retrospective regulariser on the critic. We perform extensive empirical demonstration and analysis to show that SARC leads to better actors in terms of sample complexity as well as final return in several standard benchmark RL tasks provided in the DeepMind Control Suite and PyBullet Environments. We also show how SARC does not degrade performance for tasks where SAC already achieves optimal performance. To help reproducibility and to help convey that it takes minimal editing and essentially no added computation cost to convert an existing implementation of SAC to SARC, we open-source the code and all experiment data at https://github.com/sukritiverma1996/SARC
综上所述,在这项工作中,我们提出了一种新颖的行为体批判方法--软行为体回顾批判(Soft Actor Retrospective Critic,SARC),它通过在批判者身上应用上述回顾正则器,改进了现有的软行为体批判(Soft Actor-Critic,SAC)方法。我们进行了大量的实证演示和分析,表明在 DeepMind Control Suite 和 PyBullet 环境中提供的几个标准基准 RL 任务中,SARC 在样本复杂度和最终回报方面都能带来更好的行为者。我们还展示了 SARC 如何在 SAC 已达到最佳性能的任务中不降低性能。为了提高可重复性,并说明将 SAC 的现有实现转换为 SARC 只需极少的编辑工作,基本上不会增加计算成本,我们将代码和所有实验数据开源到 https://github.com/sukritiverma1996/SARC。
Out of the three broad classes of RL algorithms, critic-only methods can efficiently learn an optimal policy [9] (implicitly derived from the optimal action value function) under tabular setting. But under continuous or stochastic settings, function approximation can cause critic-only methods to diverge [25]. Actor-only methods can be applied to these settings but they suffer from high variance when computing the policy gradient estimates. Actor-critic algorithms combine the advantages of critic-only methods and actor-only methods. The key difference between actor-critic and actor-only algorithms is that actor-critic algorithms estimate an explicit critic or policy-value function instead of the Monte Carlo return estimated by the actor-only algorithms. In the sequel, we restrict attention to actor-critic algorithms. A more detailed comparison can be found in Sutton and Barto [18].
在三大类 RL 算法中,纯批判方法可以在表格设置下有效地学习最优策略 [9](隐含来自最优行动值函数)。但在连续或随机设置下,函数近似会导致纯批判方法出现偏差 [25]。纯行动者方法可以应用于这些环境,但在计算策略梯度估计值时,它们会受到高方差的影响。行为批判算法结合了纯批判方法和纯行为方法的优点。演员批判算法与纯演员算法的主要区别在于,演员批判算法估算的是明确的批判者或政策值函数,而不是纯演员算法估算的蒙特卡罗返回值。在接下来的文章中,我们将只关注行动者批判算法。更详细的比较可参见 Sutton 和 Barto [18]。
Methods that model an explicit policy, such as actor-only and actor-critic algorithms, consist of two steps. i) Policy evaluation: Calculating the policy-value function for the current policy, and ii) Policy improvement: Using the value function to learn a better policy.
对明确政策进行建模的方法,如纯行动者算法和行动者批判算法,包括两个步骤: i) 政策评估:计算当前政策的政策价值函数,以及 ii) 政策改进:利用价值函数学习更好的政策。
It is customary in RL to run these steps simultaneously utilizing the two time-scale stochastic approximation algorithm [3, 19, 10] rather than running these steps to convergence individually. This is the approach used in actor-critic algorithms, where the critic learns the policy-value function given the actor's policy and the actor improves it's policy guided by policy gradient estimates computed using the policy-value function provided by the critic.
通常情况下,RL 采用双时间尺度随机逼近算法 [3, 19, 10] 同时运行这些步骤,而不是单独运行这些步骤以达到收敛。这就是行动者-批评者算法中使用的方法,批评者根据行动者的策略学习策略值函数,而行动者则在利用批评者提供的策略值函数计算出的策略梯度估计值的指导下改进自己的策略。
RL algorithms can be on-policy or off-policy. Off-policy actor-critic algorithms like Deep Deterministic Policy Gradient (DDPG) [11] and Twin Delayed DDPG (TD3) [5] often have a better sample complexity than on-policy algorithms. DDPG is a deep variant of the deterministic policy gradient algorithm [17]. It combines the actor-critic approach with insights from DQN. However, DDPG suffers from an overestimation bias [21] in learning the Q-values. This hinders policy improvement. TD3 addresses this by learning two Q-functions and using the smaller of the two to compute the policy gradient. This favors underestimations. TD3 also proposes delaying policy updates until the value estimates have converged.
RL 算法可以是政策内算法,也可以是政策外算法。深度确定性策略梯度算法(Deep Deterministic Policy Gradient,DDPG)[11] 和双延迟 DDPG 算法(Twin Delayed DDPG,TD3)[5] 等非策略行动者批判算法通常比策略上算法具有更好的样本复杂度。DDPG 是确定性策略梯度算法的深度变体[17]。它结合了行为批判方法和 DQN 的见解。然而,DDPG 在学习 Q 值时存在高估偏差 [21]。这阻碍了策略的改进。TD3 通过学习两个 Q 函数并使用其中较小的一个来计算策略梯度来解决这一问题。这有利于低估。TD3 还建议延迟策略更新,直到值估计收敛为止。
On-policy algorithms provide better stability but are not as sample efficient and often employ replay buffers to improve sample efficiency. Trust Region Policy Optimization (TRPO) [15] proposes updating the policy by optimizing a surrogate objective, constrained on the KL-divergence between the previous policy and the updated policy. Actor Critic using Kronecker-Factored Trust Region (ACKTR) [28] uses kronecker-factored approximated curvature to perform a similar trust region update. Proximal Policy Optimization (PPO) [16] proposes to match the performance of TRPO with only first-order optimization.
策略上算法具有更好的稳定性,但采样效率较低,通常采用重放缓冲区来提高采样效率。信任区域策略优化(TRPO)[15] 建议通过优化代用目标来更新策略,该代用目标受限于先前策略与更新策略之间的 KL-发散。使用克朗克尔因子信任区域的行为批判(ACKTR)[28] 使用克朗克尔因子近似曲率来执行类似的信任区域更新。近端策略优化(PPO)[16] 建议仅用一阶优化来匹配 TRPO 的性能。
Soft Actor-Critic (SAC) [6] is an off-policy algorithm that has both the qualities of sample efficiency and stability. SAC optimizes a stochastic policy with entropy regularization. The policy is trained to maximize a trade-off between expected return and entropy. Increasing entropy results in more exploration, which can help accelerate learning. It can also prevent the policy from converging to a local optimum.
软行为批判算法(SAC)[6] 是一种非策略算法,兼具样本效率和稳定性。SAC 通过熵正则化优化随机策略。该策略的训练目的是最大化预期收益和熵之间的权衡。增加熵会带来更多的探索,这有助于加速学习。它还能防止策略收敛到局部最优。
Our contribution In this work, we propose Soft Actor Restrospective Critic (SARC), an algorithm that improves SAC by applying a retrospective regulariser on the critic inspired from retrospective loss [8]. We empirically validate this claim by comparing and analysing SARC with SAC, TD3 and DDPG on multiple continuous control environments.
我们的贡献 在这项工作中,我们提出了软代理回顾性批判(Soft Actor Restrospective Critic,SARC),这是一种受回顾性损失(retrospective loss)[8]启发,通过在批判器上应用回顾性正则器来改进 SAC 的算法。我们通过在多个连续控制环境中比较和分析 SARC 与 SAC、TD3 和 DDPG,实证验证了这一说法。

3 Soft Actor Retrospective Critic

3.1 Premise

In actor-critic methods, we note that the actor learning algorithm incrementally improves its policy, using the critic's performance estimate, while the best performance is unknown. However, in contrast to this, the critic learning algorithm is a supervised learning one albeit with a moving target. This is the key observation that enables us to import retrospective loss from supervised learning to actor-critic methods.
我们注意到,在演员批评法中,演员学习算法利用批评者的性能估计值逐步改进其策略,而最佳性能是未知的。然而,与此不同的是,批评者学习算法是一种监督学习算法,尽管目标是移动的。这正是我们能够将监督学习中的追溯损失引入演员批评方法的关键所在。
Retrospective loss and critic learning We begin by describing our notation. Let denote an input and output (ground truth) training data point, with denoting the size of the training data and belonging to the spaces and respectively, where are the respective dimensions of the input and output. Furthermore, let denote a mapping from the input space to the output space, parameterized by , where denotes the number of parameters in this mapping. Let denote the predicted output for the data point using parameter . The supervised learning objective is to learn a parameter , which minimizes the error between the ground truth output and the predicted output for all . Various metrics can be used to represent this loss, but for our illustration, we choose a loss based on the metric. We call this loss as the original loss, , which is defined as:
回溯损失和批判学习 我们首先介绍一下我们的符号。让 表示输入和输出(地面实况)训练数据点, 表示训练数据的大小, 分别属于空间 ,其中 分别是输入和输出的维度。此外,让 表示从输入空间到输出空间的映射,参数为 ,其中 表示该映射的参数数。让 表示使用参数 预测的数据点 的输出。监督学习的目标是学习一个参数 ,使所有 的地面实况输出 与预测输出 之间的误差最小。可以使用多种指标来表示这种损失,但在我们的示例中,我们选择了基于 指标的损失。我们将这种损失称为原始损失, ,其定义为:
We paraphrase the definition of retrospective loss from [8], and modify it slightly as follows:
我们转述了 [8] 中关于追溯损失的定义,并稍作修改如下:
Definition 1. [adapted from [8]] Retrospective loss is defined in the context of an iterative parameter estimation scheme and is a function of the parameter values at two different iterations (times)-the current iteration and a previous iteration. Let denote the current and previous times respectively, and we so have . Then, the retrospective loss at time , is comprised of two components - the original loss and the retrospective regularizer as given below:
定义 1.[改编自[8]]回溯损失是在迭代参数估计方案中定义的,是两个不同迭代(时间)--当前迭代和之前迭代--的参数值的函数。让 分别表示当前时间和前一次时间,因此我们有 。那么,时间 时的回溯损失由两部分组成--原始损失和回溯正则器,如下所示:
where the retrospective regularizer is defined as:
where and is any metric on vector spaces, such as . We term the family of functions as given in (2), defined by parameters and the metric as retrospective loss functions.
其中 是向量空间上的任意度量,如 。我们将参数 和度量 定义的函数族称为追溯损失函数,如 (2) 所示。
When retrospective loss is minimized along with the task-specific loss, the parameters at the current training step are guided towards the optimal parameters while being pulled away from the parameters at a previous training step. In the supervised setting, retrospective loss results in better test accuracy.
当追溯损失与特定任务损失一起最小化时,当前训练步骤中的参数就会被引导到最佳参数,同时与前一个训练步骤中的参数拉开距离。在有监督的情况下,回溯损失会带来更高的测试精度。
We posit that accelerating critic learning can help actor realise better gradient estimates which results in improved policy learning. In lieu with this goal, we propose that using the retrospective loss as a regularizer in the critic objective will accelerate critic learning. In Section 5, we present experiments that demonstrate improved policy learning for the actor achieved with the use of retrospective loss as a regularizer in the critic objective. Furthermore, in Section 6 we observe that even for slowly moving targets (ground truths), as is the case for the critic, retrospective loss yields convergence acceleration.
我们认为,加速批判学习可以帮助行动者实现更好的梯度估计,从而改进政策学习。为了实现这一目标,我们建议在批判目标中使用回顾损失作为正则因子,以加速批判学习。在第 5 节中,我们通过实验证明,使用回顾损失作为批判目标中的正则因子,可以改善行动者的政策学习。此外,在第 6 节中,我们还观察到,即使是对于缓慢移动的目标(地面实况),就像批判者一样,追溯损失也能加速收敛。

3.2 Formalising SARC

As the popular Soft Actor-Critic (SAC) algorithm offers both the qualities of sample efficient learning as well as stability [6] (Section[2), we therefore, specifically extend SAC in this work. In principle, retrospective regularization can be incorporated with other actor critic methods as well and would be an interesting direction for future exploration. Our algorithm is summarized in Algorithm 1 where the basic SAC code is reproduced from SpinningUp [1] and the additions are highlighted in blue. From an algorithmic perspective, SARC is a minimal modification (3 lines) over SAC.
由于流行的软行为批判(Soft Actor-Critic,SAC)算法同时具有样本高效学习和稳定性的特点[6](第[2]节),因此我们在本研究中特别扩展了软行为批判算法。原则上,回溯正则化也可以与其他行为批判方法相结合,这将是未来探索的一个有趣方向。算法 1 总结了我们的算法,其中 SAC 的基本代码来自 SpinningUp [1],新增部分用蓝色标出。从算法角度来看,SARC 是对 SAC 的最小修改(3 行)。
Learning in SARC Here we reproduce the basic SAC actor and critic learning algorithms [6, 1] with the retrospective regularizer added to the critic loss, which yields the SARC algorithm. The update equations are as given below. The actor is represented by a function , parametrized with . Given any state , the action is sampled as:
SARC 中的学习 在这里,我们重现了基本的 SAC 表演者和批判者学习算法[6, 1],并在批判者损失中加入了回溯正则,从而产生了 SARC 算法。更新方程如下。行动者由函数 表示,参数为 。给定任何状态 ,行动 的采样为
The targets for the two critics in SARC are then computed as in SAC:
然后,SARC 中两个批评者的目标计算方法与 SAC 相同:
where is the target, is the per-step reward function, is the done signal, are the critic parameters, and are the corresponding critic parameters for the target. The loss for the critic is then estimated as:
其中, 是目标, 是每步奖励函数, 是完成信号, 是批判者参数, 是目标的相应批判者参数。批判者的损失估计为
where is given by [3). And finally, the actor is updated as in the original SAC algorithm.
其中 由 [3] 给出。最后,与最初的 SAC 算法一样,对演员进行更新。

4 Experimental Setup

To demonstrate that SARC outperforms SAC and is competitive with existing state-of-the-art actorcritic methods such as TD3 and DDPG, we conduct an exhaustive set of experiments. For all of our
为了证明 SARC 的性能优于 SAC,并能与 TD3 和 DDPG 等现有最先进的演员批评方法相媲美,我们进行了一系列详尽的实验。对于我们所有的
Algorithm 1 Soft Actor Retrospective Critic
    Input: initial policy parameters \(\theta, \mathrm{Q}\)-function parameters \(\phi_{1}, \phi_{2}\), empty replay buffer \(\mathcal{D}\)
    Set target parameters equal to main parameters \(\phi_{\text {targ }, 1} \leftarrow \phi_{1}, \phi_{\text {targ }, 2} \leftarrow \phi_{2}\)
    Previous Q-function parameters set to initial
    \(\phi_{\text {prev }, 1} \leftarrow \phi_{1}, \phi_{\text {prev }, 2} \leftarrow \phi_{2}\)
    repeat
        Observe state \(s\), sample \(a \sim \pi_{\theta}(. \mid s)\)
        Execute \(a\) in environment
        Observe next state \(s^{\prime}\), reward \(r\), and done signal \(d\) to indicate whether \(s^{\prime}\) is terminal
        Store \(\left(s, a, s^{\prime}, r, d\right)\) in replay buffer \(D\)
        If \(d=1\) then \(s^{\prime}\) is terminal, reset environment
        if it's time to update then
            for \(j\) in range numUpdates do
                Sample batch, \(B=\left(s, a, s^{\prime}, r, d\right)\) from \(D\)
                Compute targets for Q-functions
14: Update Q-functions
15: Update policy
where is a sample from which is differentiable wrt via the reparametrization trick.
其中, 是来自 的样本,通过重拟态技巧,它与 是可微分的。
Update target Q-functions
Update previous Q-functions
end for
end if
until convergence
(a) Walker-Stand
(d) Cheetah-Run
(b) Walker-Walk
(e) Reacher-Easy
(c) Finger-Spin
(f) Reacher-Hard
Figure 1: Results for SARC (red), SAC (green), TD3 (purple) and DDPG (blue) across 6 DeepMind Control Suite Tasks. The -axis shows the timesteps in each environment. The -axis shows the average value and standard deviation band of return across 5 seeds (10 test episode returns per seed). It can be observed that SARC improves SAC in all environments. SARC also outperforms TD3 and DDPG in most environments.
图 1:6 个 DeepMind 控制套件任务中的 SARC(红色)、SAC(绿色)、TD3(紫色)和 DDPG(蓝色)结果。 轴显示了每个环境中的时间步数。 轴显示了 5 个种子(每个种子有 10 个测试集返回)返回的平均值和标准偏差带。可以看出,SARC 在所有环境中都改善了 SAC。在大多数环境中,SARC 也优于 TD3 和 DDPG。
experiments, we use the implementations provided in SpinningUp [1]. SpinningUp is an open-source deep RL library.
实验中,我们使用 SpinningUp [1] 中提供的实现。SpinningUp 是一个开源的深度 RL 库。
We modify the original implementation of SAC, provided in SpinningUp, to SARC. Modifying an existing SAC implementation to SARC is minimal and straightforward, requiring addition of a few lines and essentially no compute as presented in Algorithm 1 Within SARC critic loss, for the retrospective regularizer, we use across all tasks, environments and analyses.
我们将 SpinningUp 中提供的 SAC 原始实现修改为 SARC。将现有的 SAC 实现修改为 SARC 非常简单直接,只需添加几行代码,基本上不需要计算,如算法 1 所示 在 SARC 的批评损失中,对于追溯正则器,我们在所有任务、环境和分析中使用
We conduct our evaluations and analysis across various continuous control tasks. Concretely, we use 6 tasks provided in the DeepMind Control Suite [20] and 5 tasks provided in PyBullet Environments [4]. For all experiments, we retain all of the default hyperparameters specified originally in the SpinningUp library and let each agent train for 2 million timesteps in the given environment. In our experiments, the same hyperparameters worked for all environments and tasks. The actor and critic network size used in all experiments is [256, 256], unless specified otherwise. This is the default value for the network size specified originally in the SpinningUp library. All models were trained using CPUs on a Linux machine running Ubuntu 16.04 with 2nd Generation Intel Xeon Gold Processor.
我们对各种连续控制任务进行了评估和分析。具体来说,我们使用了 DeepMind Control Suite [20] 中提供的 6 项任务和 PyBullet Environments [4] 中提供的 5 项任务。在所有实验中,我们保留了 SpinningUp 库中最初指定的所有默认超参数,并让每个代理在给定环境中训练 200 万个时间步。在我们的实验中,相同的超参数适用于所有环境和任务。除非另有说明,所有实验中使用的角色和批判者网络大小均为 [256, 256]。这是 SpinningUp 库中最初指定的网络大小的默认值。所有模型都是在运行 Ubuntu 16.04 和第二代英特尔至强黄金处理器的 Linux 机器上使用 CPU 训练的。
In Section 5, we compare our proposed algorithm SARC to SAC, TD3 and DDPG across 6 tasks provided in the DeepMind Control Suite. In Section6, we compare retrospective loss with an alternate baseline technique for critic acceleration and analyse our hypothesis of faster critic convergence. We also compare SARC with SAC under various modified experimental settings to further establish our claim. We make all of our code available at: https://github.com/sukritiverma1996/SARC. We also make all of our data used to construct the graphs and report the figures available at: https : ///github.com/sukritiverma1996/SARC
在第 5 节中,我们在 DeepMind Control Suite 提供的 6 个任务中将我们提出的算法 SARC 与 SAC、TD3 和 DDPG 进行了比较。在第 6 节中,我们将回溯损失与另一种批判者加速基线技术进行了比较,并分析了我们关于批判者收敛速度更快的假设。我们还在各种修改后的实验设置下比较了 SARC 和 SAC,以进一步证实我们的假设。我们在 https://github.com/sukritiverma1996/SARC 网站上提供了所有代码。我们还将用于构建图表和报告数字的所有数据公布在:https :///github.com/sukritiverma1996/SARC

5 Results

In this section, we present comparisons between SARC, SAC, TD3 and DDPG. We perform these evaluations on 6 DeepMind Control Suite tasks: Cheetah-Run, Finger-Spin, Walker-Walk, WalkerStand, Reacher-Easy, Reacher-Hard. To ensure fairness, consistency and, to rule out outlier behavior,
在本节中,我们将对 SARC、SAC、TD3 和 DDPG 进行比较。我们在 6 项 DeepMind Control Suite 任务上进行了评估:我们在 6 个 DeepMind 控制套件任务中进行了这些评估:Cheetah-Run、Finger-Spin、Walker-Walk、WalkerStand、Reacher-Easy、Reacher-Hard。为了确保公平性和一致性,并排除离群行为,我们对 6 个 DeepMind 控制套件任务进行了评估、
Table 1: Test episode returns of SARC, SAC, TD3 and DDPG on 6 DeepMind Control Suite tasks after training for steps.
表 1:经过 步训练后,SARC、SAC、TD3 和 DDPG 在 6 个 DeepMind 控制套件任务上的测试集回报。
Environment Average Test Episode Return
TD3 DDPG SAC SARC
Walker-Stand 959.48 716.99 697.33
Walker-Walk 797.87 770.08 886.28
Finger-Spin 827.02 179.3 315.28
Cheetah-Run 500.52 8.22 382.69
Reacher-Easy 799.62 913.92