Part 1: Key Concepts in RL ¶
第 1 部分：强化学习中的关键概念 ¶

Table of Contents 目录

Part 1: Key Concepts in RL
第 1 部分：强化学习中的关键概念

Welcome to our introduction to reinforcement learning! Here, we aim to acquaint you with
欢迎来到我们的强化学习介绍！在这里，我们旨在让您熟悉

the language and notation used to discuss the subject,
讨论该主题所使用的语言和符号，
a high-level explanation of what RL algorithms do (although we mostly avoid the question of how they do it),
强化学习算法的高级解释（尽管我们大多避免讨论它们是如何做到的），
and a little bit of the core math that underlies the algorithms.
以及一些支撑算法的核心数学。

In a nutshell, RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.
简而言之，强化学习是研究智能体及其如何通过试错学习的学科。它形式化了这样一个观点：对智能体的行为进行奖励或惩罚，使其在未来更有可能重复或放弃该行为。

What Can RL Do?¶
强化学习能做什么？

RL methods have recently enjoyed a wide variety of successes. For example, it’s been used to teach computers to control robots in simulation...
RL 方法最近取得了广泛的成功。例如，它被用来教计算机在模拟中控制机器人……

...and in the real world...
...在现实世界中...

It’s also famously been used to create breakthrough AIs for sophisticated strategy games, most notably Go and Dota, taught computers to play Atari games from raw pixels, and trained simulated robots to follow human instructions.
它还以其著名的用途创造了突破性的人工智能，用于复杂的策略游戏，尤其是围棋和 Dota，教会计算机从原始像素中玩 Atari 游戏，并训练模拟机器人遵循人类指令。

Key Concepts and Terminology ¶
关键概念和术语 ¶

../_images/rl_diagram_transparent_bg.png

Agent-environment interaction loop.
代理-环境交互循环。

The main characters of RL are the agent and the environment. The environment is the world that the agent lives in and interacts with. At every step of interaction, the agent sees a (possibly partial) observation of the state of the world, and then decides on an action to take. The environment changes when the agent acts on it, but may also change on its own.
RL 的主要角色是智能体和环境。环境是智能体生活和互动的世界。在每一步互动中，智能体会看到世界状态的（可能是部分的）观察，然后决定采取的行动。当智能体对环境采取行动时，环境会发生变化，但也可能会自行变化。

The agent also perceives a reward signal from the environment, a number that tells it how good or bad the current world state is. The goal of the agent is to maximize its cumulative reward, called return. Reinforcement learning methods are ways that the agent can learn behaviors to achieve its goal.
代理还感知到来自环境的奖励信号，一个数字告诉它当前世界状态有多好或多坏。代理的目标是最大化其累积奖励，称为回报。强化学习方法是代理可以学习行为以实现其目标的方式。

To talk more specifically what RL does, we need to introduce additional terminology. We need to talk about
要更具体地谈论强化学习的作用，我们需要引入额外的术语。我们需要谈论

states and observations,
状态和观察，
action spaces, 行动空间，
policies, 政策，
trajectories, 轨迹，
different formulations of return,
不同的回报形式，
the RL optimization problem,
强化学习优化问题，
and value functions. 和价值函数。

States and Observations¶
状态和观察 ¶

A state $s$ is a complete description of the state of the world. There is no information about the world which is hidden from the state. An observation $o$ is a partial description of a state, which may omit information.
一个状态 $s$ 是对世界状态的完整描述。没有任何关于世界的信息是从状态中隐藏的。一个观察 $o$ 是对状态的部分描述，可能会省略信息。

In deep RL, we almost always represent states and observations by a real-valued vector, matrix, or higher-order tensor. For instance, a visual observation could be represented by the RGB matrix of its pixel values; the state of a robot might be represented by its joint angles and velocities.
在深度强化学习中，我们几乎总是用实值向量、矩阵或高阶张量来表示状态和观察。例如，视觉观察可以用其像素值的 RGB 矩阵表示；机器人的状态可能由其关节角度和速度表示。

When the agent is able to observe the complete state of the environment, we say that the environment is fully observed. When the agent can only see a partial observation, we say that the environment is partially observed.
当智能体能够观察到环境的完整状态时，我们称环境为完全可观察的。当智能体只能看到部分观察时，我们称环境为部分可观察的。

You Should Know 你应该知道

Reinforcement learning notation sometimes puts the symbol for state, $s$ , in places where it would be technically more appropriate to write the symbol for observation, $o$ . Specifically, this happens when talking about how the agent decides an action: we often signal in notation that the action is conditioned on the state, when in practice, the action is conditioned on the observation because the agent does not have access to the state.
强化学习符号有时会在技术上更适合写观察符号 $o$ 的地方放置状态符号 $s$ 。具体来说，这种情况发生在谈论代理如何决定行动时：我们经常在符号中表示行动是以状态为条件的，而实际上，行动是以观察为条件的，因为代理无法访问状态。

In our guide, we’ll follow standard conventions for notation, but it should be clear from context which is meant. If something is unclear, though, please raise an issue! Our goal is to teach, not to confuse.
在我们的指南中，我们将遵循标准的符号约定，但从上下文中应该可以清楚地看出所指的内容。如果有什么不清楚的地方，请提出问题！我们的目标是教学，而不是让人困惑。

Action Spaces¶ 行动空间 ¶

Different environments allow different kinds of actions. The set of all valid actions in a given environment is often called the action space. Some environments, like Atari and Go, have discrete action spaces, where only a finite number of moves are available to the agent. Other environments, like where the agent controls a robot in a physical world, have continuous action spaces. In continuous spaces, actions are real-valued vectors.
不同的环境允许不同种类的动作。在给定环境中所有有效动作的集合通常被称为动作空间。一些环境，如 Atari 和围棋，具有离散动作空间，代理只能进行有限数量的移动。其他环境，如代理在物理世界中控制机器人，具有连续动作空间。在连续空间中，动作是实值向量。

This distinction has some quite-profound consequences for methods in deep RL. Some families of algorithms can only be directly applied in one case, and would have to be substantially reworked for the other.
这种区别对深度强化学习中的方法有一些相当深远的影响。一些算法家族只能在一种情况下直接应用，而在另一种情况下则必须进行大幅重构。

Policies¶ 政策 ¶

A policy is a rule used by an agent to decide what actions to take. It can be deterministic, in which case it is usually denoted by $\mu$ :
策略是代理用来决定采取什么行动的规则。它可以是确定性的，在这种情况下通常用 $\mu$ 表示：

$a_t = \mu(s_t),$

or it may be stochastic, in which case it is usually denoted by $\pi$ :
或它可能是随机的，在这种情况下通常用 $\pi$ 表示：

$a_t \sim \pi(\cdot | s_t).$

Because the policy is essentially the agent’s brain, it’s not uncommon to substitute the word “policy” for “agent”, eg saying “The policy is trying to maximize reward.”
因为政策本质上是代理的思维，所以用“政策”代替“代理”并不罕见，例如说“政策试图最大化奖励。”

In deep RL, we deal with parameterized policies: policies whose outputs are computable functions that depend on a set of parameters (eg the weights and biases of a neural network) which we can adjust to change the behavior via some optimization algorithm.
在深度强化学习中，我们处理参数化策略：其输出是可计算函数的策略，这些函数依赖于一组参数（例如神经网络的权重和偏差），我们可以通过某种优化算法调整这些参数以改变行为。

We often denote the parameters of such a policy by $\theta$ or $\phi$ , and then write this as a subscript on the policy symbol to highlight the connection:
我们通常用 $\theta$ 或 $\phi$ 来表示这种政策的参数，然后将其作为下标写在政策符号上，以突出其联系：

$a_t &= \mu_{\theta}(s_t) \\ a_t &\sim \pi_{\theta}(\cdot | s_t).$

Deterministic Policies¶ 确定性策略 ¶

Example: Deterministic Policies. Here is a code snippet for building a simple deterministic policy for a continuous action space in PyTorch, using the torch.nn package:
示例：确定性策略。以下是一个代码片段，用于在 PyTorch 中为连续动作空间构建一个简单的确定性策略，使用 torch.nn 包：

pi_net = nn.Sequential(
              nn.Linear(obs_dim, 64),
              nn.Tanh(),
              nn.Linear(64, 64),
              nn.Tanh(),
              nn.Linear(64, act_dim)
            )

This builds a multi-layer perceptron (MLP) network with two hidden layers of size 64 and $\tanh$ activation functions. If obs is a Numpy array containing a batch of observations, pi_net can be used to obtain a batch of actions as follows:
这构建了一个具有两个隐藏层（大小为 64 和 $\tanh$ 激活函数）的多层感知器（MLP）网络。如果 obs 是一个包含一批观测值的 Numpy 数组，则可以使用 pi_net 来获得一批动作，如下所示：

obs_tensor = torch.as_tensor(obs, dtype=torch.float32)
actions = pi_net(obs_tensor)

You Should Know 你应该知道

Don’t worry about it if this neural network stuff is unfamiliar to you—this tutorial will focus on RL, and not on the neural network side of things. So you can skip this example and come back to it later. But we figured that if you already knew, it could be helpful.
如果你对这个神经网络的内容不熟悉，不用担心——本教程将专注于强化学习，而不是神经网络的内容。所以你可以跳过这个例子，稍后再回来。但我们认为如果你已经知道，这可能会有所帮助。

Stochastic Policies¶ 随机策略 ¶

The two most common kinds of stochastic policies in deep RL are categorical policies and diagonal Gaussian policies.
深度强化学习中最常见的两种随机策略是分类策略和对角高斯策略。

Categorical policies can be used in discrete action spaces, while diagonal Gaussian policies are used in continuous action spaces.
分类政策可以在离散动作空间中使用，而对角高斯政策则用于连续动作空间。

Two key computations are centrally important for using and training stochastic policies:
使用和训练随机策略的两个关键计算至关重要：

sampling actions from the policy,
从策略中采样动作，
and computing log likelihoods of particular actions, $\log \pi_{\theta}(a|s)$ .
和计算特定动作的对数似然， $\log \pi_{\theta}(a|s)$ 。

In what follows, we’ll describe how to do these for both categorical and diagonal Gaussian policies.
接下来，我们将描述如何对分类和对角高斯策略进行这些操作。

Categorical Policies 分类政策

A categorical policy is like a classifier over discrete actions. You build the neural network for a categorical policy the same way you would for a classifier: the input is the observation, followed by some number of layers (possibly convolutional or densely-connected, depending on the kind of input), and then you have one final linear layer that gives you logits for each action, followed by a softmax to convert the logits into probabilities.
分类策略就像是对离散动作的分类器。你构建分类策略的神经网络的方式与构建分类器相同：输入是观察值，后面是一些层（可能是卷积层或密集连接层，具体取决于输入的类型），然后你有一个最终的线性层，为每个动作提供 logits，接着是 softmax 将 logits 转换为概率。

Sampling. Given the probabilities for each action, frameworks like PyTorch and Tensorflow have built-in tools for sampling. For example, see the documentation for Categorical distributions in PyTorch, torch.multinomial, tf.distributions.Categorical, or tf.multinomial.
采样。考虑到每个动作的概率，像 PyTorch 和 Tensorflow 这样的框架具有内置的采样工具。例如，请参阅 PyTorch 中的分类分布文档、torch.multinomial、tf.distributions.Categorical 或 tf.multinomial。

Log-Likelihood. Denote the last layer of probabilities as $P_{\theta}(s)$ . It is a vector with however many entries as there are actions, so we can treat the actions as indices for the vector. The log likelihood for an action $a$ can then be obtained by indexing into the vector:
对数似然。将最后一层概率表示为 $P_{\theta}(s)$ 。它是一个向量，条目数量与动作数量相同，因此我们可以将动作视为向量的索引。动作 $a$ 的对数似然可以通过索引向量获得：

$\log \pi_{\theta}(a|s) = \log \left[P_{\theta}(s)\right]_a.$

Diagonal Gaussian Policies
对角高斯策略

A multivariate Gaussian distribution (or multivariate normal distribution, if you prefer) is described by a mean vector, $\mu$ , and a covariance matrix, $\Sigma$ . A diagonal Gaussian distribution is a special case where the covariance matrix only has entries on the diagonal. As a result, we can represent it by a vector.
多元高斯分布（或多元正态分布，如果你愿意的话）由均值向量 $\mu$ 和协方差矩阵 $\Sigma$ 描述。对角高斯分布是一个特殊情况，其中协方差矩阵仅在对角线上有条目。因此，我们可以用一个向量来表示它。

A diagonal Gaussian policy always has a neural network that maps from observations to mean actions, $\mu_{\theta}(s)$ . There are two different ways that the covariance matrix is typically represented.
对角高斯策略总是有一个神经网络，将观察映射到均值动作， $\mu_{\theta}(s)$ 。协方差矩阵通常有两种不同的表示方式。

The first way: There is a single vector of log standard deviations, $\log \sigma$ , which is not a function of state: the $\log \sigma$ are standalone parameters. (You Should Know: our implementations of VPG, TRPO, and PPO do it this way.)
第一种方法：有一个单一的对数标准差向量 $\log \sigma$ ，它不是状态的函数： $\log \sigma$ 是独立的参数。（你应该知道：我们对 VPG、TRPO 和 PPO 的实现就是这样做的。）

The second way: There is a neural network that maps from states to log standard deviations, $\log \sigma_{\theta}(s)$ . It may optionally share some layers with the mean network.
第二种方法：有一个神经网络将状态映射到对数标准差， $\log \sigma_{\theta}(s)$ 。它可以选择与均值网络共享一些层。

Note that in both cases we output log standard deviations instead of standard deviations directly. This is because log stds are free to take on any values in $(-\infty, \infty)$ , while stds must be nonnegative. It’s easier to train parameters if you don’t have to enforce those kinds of constraints. The standard deviations can be obtained immediately from the log standard deviations by exponentiating them, so we do not lose anything by representing them this way.
请注意，在这两种情况下，我们输出的是对数标准差，而不是直接的标准差。这是因为对数标准差可以取任何值，而标准差必须是非负的。如果不必强制执行这些约束，训练参数会更容易。标准差可以通过对数标准差的指数运算立即获得，因此以这种方式表示它们并不会失去任何东西。

Sampling. Given the mean action $\mu_{\theta}(s)$ and standard deviation $\sigma_{\theta}(s)$ , and a vector $z$ of noise from a spherical Gaussian ( $z \sim \mathcal{N}(0, I)$ ), an action sample can be computed with
采样。给定均值动作 $\mu_{\theta}(s)$ 和标准差 $\sigma_{\theta}(s)$ ，以及来自球形高斯分布 ( $z \sim \mathcal{N}(0, I)$ ) 的噪声向量 $z$ ，可以计算出一个动作样本。

$a = \mu_{\theta}(s) + \sigma_{\theta}(s) \odot z,$

where $\odot$ denotes the elementwise product of two vectors. Standard frameworks have built-in ways to generate the noise vectors, such as torch.normal or tf.random_normal. Alternatively, you can build distribution objects, eg through torch.distributions.Normal or tf.distributions.Normal, and use them to generate samples. (The advantage of the latter approach is that those objects can also calculate log-likelihoods for you.)
其中 $\odot$ 表示两个向量的逐元素乘积。标准框架内置了生成噪声向量的方法，例如 torch.normal 或 tf.random_normal。或者，您可以通过 torch.distributions.Normal 或 tf.distributions.Normal 构建分布对象，并使用它们生成样本。（后者方法的优点是这些对象还可以为您计算对数似然。）

Log-Likelihood. The log-likelihood of a $k$ -dimensional action $a$ , for a diagonal Gaussian with mean $\mu = \mu_{\theta}(s)$ and standard deviation $\sigma = \sigma_{\theta}(s)$ , is given by
对角高斯的对数似然。对于一个 $k$ 维的动作 $a$ ，其均值为 $\mu = \mu_{\theta}(s)$ ，标准差为 $\sigma = \sigma_{\theta}(s)$ ，对数似然由以下公式给出：

$\log \pi_{\theta}(a|s) = -\frac{1}{2}\left(\sum_{i=1}^k \left(\frac{(a_i - \mu_i)^2}{\sigma_i^2} + 2 \log \sigma_i \right) + k \log 2\pi \right).$

Trajectories¶ 轨迹 ¶

A trajectory $\tau$ is a sequence of states and actions in the world,
轨迹 $\tau$ 是世界中状态和动作的序列，

$\tau = (s_0, a_0, s_1, a_1, ...).$

The very first state of the world, $s_0$ , is randomly sampled from the start-state distribution, sometimes denoted by $\rho_0$ :
世界的第一个状态， $s_0$ ，是从起始状态分布中随机抽样的，有时用 $\rho_0$ 表示：

$s_0 \sim \rho_0(\cdot).$

State transitions (what happens to the world between the state at time $t$ , $s_t$ , and the state at $t+1$ , $s_{t+1}$ ), are governed by the natural laws of the environment, and depend on only the most recent action, $a_t$ . They can be either deterministic,
状态转移（在时间 $t$ 、 $s_t$ 和 $t+1$ 、 $s_{t+1}$ 之间世界发生了什么）受环境自然法则的支配，仅依赖于最近的动作 $a_t$ 。它们可以是确定性的，

$s_{t+1} = f(s_t, a_t)$

or stochastic, 或随机，

$s_{t+1} \sim P(\cdot|s_t, a_t).$

Actions come from an agent according to its policy.
行动来自代理，根据其策略。

You Should Know 你应该知道

Trajectories are also frequently called episodes or rollouts.
轨迹也常被称为情节或展开。

Reward and Return¶ 奖励与回报 ¶

The reward function $R$ is critically important in reinforcement learning. It depends on the current state of the world, the action just taken, and the next state of the world:
奖励函数 $R$ 在强化学习中至关重要。它依赖于当前的世界状态、刚刚采取的行动以及世界的下一个状态：

$r_t = R(s_t, a_t, s_{t+1})$

although frequently this is simplified to just a dependence on the current state, $r_t = R(s_t)$ , or state-action pair $r_t = R(s_t,a_t)$ .
尽管这通常简化为仅依赖于当前状态 $r_t = R(s_t)$ 或状态-动作对 $r_t = R(s_t,a_t)$ 。

The goal of the agent is to maximize some notion of cumulative reward over a trajectory, but this actually can mean a few things. We’ll notate all of these cases with $R(\tau)$ , and it will either be clear from context which case we mean, or it won’t matter (because the same equations will apply to all cases).
代理的目标是最大化某种关于轨迹的累积奖励的概念，但这实际上可以意味着几件事。我们将用 $R(\tau)$ 来表示所有这些情况，通常可以从上下文中清楚地知道我们指的是哪种情况，或者这并不重要（因为相同的方程适用于所有情况）。

One kind of return is the finite-horizon undiscounted return, which is just the sum of rewards obtained in a fixed window of steps:
一种回报是有限时间范围内的未折现回报，它只是固定步数窗口内获得的奖励之和：

$R(\tau) = \sum_{t=0}^T r_t.$

Another kind of return is the infinite-horizon discounted return, which is the sum of all rewards ever obtained by the agent, but discounted by how far off in the future they’re obtained. This formulation of reward includes a discount factor $\gamma \in (0,1)$ :
另一种回报是无限期折现回报，它是代理获得的所有奖励的总和，但根据它们在未来获得的时间远近进行折现。这种奖励的公式包括一个折现因子 $\gamma \in (0,1)$ ：

$R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t.$

Why would we ever want a discount factor, though? Don’t we just want to get all rewards? We do, but the discount factor is both intuitively appealing and mathematically convenient. On an intuitive level: cash now is better than cash later. Mathematically: an infinite-horizon sum of rewards may not converge to a finite value, and is hard to deal with in equations. But with a discount factor and under reasonable conditions, the infinite sum converges.
那么，我们为什么会想要折扣因子呢？难道我们不想获得所有的奖励吗？我们确实想，但折扣因子在直观上是有吸引力的，在数学上也很方便。从直观的角度来看：现在的钱比以后的钱更好。从数学的角度来看：无限期的奖励总和可能不会收敛到一个有限的值，并且在方程中很难处理。但是，使用折扣因子并在合理的条件下，无限总和是收敛的。

You Should Know 你应该知道

While the line between these two formulations of return are quite stark in RL formalism, deep RL practice tends to blur the line a fair bit—for instance, we frequently set up algorithms to optimize the undiscounted return, but use discount factors in estimating value functions.
虽然这两种回报的公式在强化学习形式中非常明显，但深度强化学习实践往往模糊了这一界限——例如，我们经常设置算法来优化未折现回报，但在估计价值函数时使用折现因子。

The RL Problem¶ 强化学习问题 ¶

Whatever the choice of return measure (whether infinite-horizon discounted, or finite-horizon undiscounted), and whatever the choice of policy, the goal in RL is to select a policy which maximizes expected return when the agent acts according to it.
无论选择何种回报度量（无论是无限期折现还是有限期未折现），无论选择何种策略，强化学习的目标是选择一种策略，使得在代理根据该策略行动时，期望回报最大化。

To talk about expected return, we first have to talk about probability distributions over trajectories.
要谈论预期回报，我们首先必须谈论轨迹上的概率分布。

Let’s suppose that both the environment transitions and the policy are stochastic. In this case, the probability of a $T$ -step trajectory is:
假设环境转移和策略都是随机的。在这种情况下， $T$ 步轨迹的概率是：

$P(\tau|\pi) = \rho_0 (s_0) \prod_{t=0}^{T-1} P(s_{t+1} | s_t, a_t) \pi(a_t | s_t).$

The expected return (for whichever measure), denoted by $J(\pi)$ , is then:
预期回报（无论哪种度量），用 $J(\pi)$ 表示，随后为：

$J(\pi) = \int_{\tau} P(\tau|\pi) R(\tau) = \underE{\tau\sim \pi}{R(\tau)}.$

The central optimization problem in RL can then be expressed by
RL 中的中心优化问题可以表示为

$\pi^* = \arg \max_{\pi} J(\pi),$

with $\pi^*$ being the optimal policy.
以 $\pi^*$ 作为最佳策略。

Value Functions¶ 价值函数 ¶

It’s often useful to know the value of a state, or state-action pair. By value, we mean the expected return if you start in that state or state-action pair, and then act according to a particular policy forever after. Value functions are used, one way or another, in almost every RL algorithm.
了解一个状态或状态-动作对的价值通常是有用的。我们所说的价值是指如果你从该状态或状态-动作对开始，然后根据特定策略永远行动，所期望的回报。价值函数在几乎每个强化学习算法中以某种方式被使用。

There are four main functions of note here.
这里有四个主要功能需要注意。

The On-Policy Value Function, $V^{\pi}(s)$ , which gives the expected return if you start in state $s$ and always act according to policy $\pi$ :
在政策下的价值函数 $V^{\pi}(s)$ ，它给出了如果你从状态 $s$ 开始并始终按照政策 $\pi$ 行动时的期望回报：

$V^{\pi}(s) = \underE{\tau \sim \pi}{R(\tau)\left| s_0 = s\right.}$
The On-Policy Action-Value Function, $Q^{\pi}(s,a)$ , which gives the expected return if you start in state $s$ , take an arbitrary action $a$ (which may not have come from the policy), and then forever after act according to policy $\pi$ :
在政策下的动作价值函数 $Q^{\pi}(s,a)$ ，它给出了如果你在状态 $s$ 开始，采取一个任意动作 $a$ （可能不是来自政策），然后永远按照政策 $\pi$ 行动的期望回报：

$Q^{\pi}(s,a) = \underE{\tau \sim \pi}{R(\tau)\left| s_0 = s, a_0 = a\right.}$
The Optimal Value Function, $V^*(s)$ , which gives the expected return if you start in state $s$ and always act according to the optimal policy in the environment:
最优值函数 $V^*(s)$ ，它给出了如果你在状态 $s$ 开始并始终根据环境中的最优策略行动时的期望回报：

$V^*(s) = \max_{\pi} \underE{\tau \sim \pi}{R(\tau)\left| s_0 = s\right.}$
The Optimal Action-Value Function, $Q^*(s,a)$ , which gives the expected return if you start in state $s$ , take an arbitrary action $a$ , and then forever after act according to the optimal policy in the environment:
最优动作值函数 $Q^*(s,a)$ ，它给出了如果你在状态 $s$ 开始，采取任意动作 $a$ ，然后在环境中永远按照最优策略行动的期望回报：

$Q^*(s,a) = \max_{\pi} \underE{\tau \sim \pi}{R(\tau)\left| s_0 = s, a_0 = a\right.}$

You Should Know 你应该知道

When we talk about value functions, if we do not make reference to time-dependence, we only mean expected infinite-horizon discounted return. Value functions for finite-horizon undiscounted return would need to accept time as an argument. Can you think about why? Hint: what happens when time’s up?
当我们谈论价值函数时，如果我们不提及时间依赖性，我们仅指期望的无限期折现回报。有限期未折现回报的价值函数需要将时间作为一个参数。你能想想为什么吗？提示：时间到期时会发生什么？

You Should Know 你应该知道

There are two key connections between the value function and the action-value function that come up pretty often:
价值函数和动作价值函数之间有两个关键联系，这些联系经常出现：

$V^{\pi}(s) = \underE{a\sim \pi}{Q^{\pi}(s,a)},$

and 和

$V^*(s) = \max_a Q^* (s,a).$

These relations follow pretty directly from the definitions just given: can you prove them?
这些关系直接来自刚才给出的定义：你能证明它们吗？

The Optimal Q-Function and the Optimal Action¶
最优 Q 函数和最优动作 ¶

There is an important connection between the optimal action-value function $Q^*(s,a)$ and the action selected by the optimal policy. By definition, $Q^*(s,a)$ gives the expected return for starting in state $s$ , taking (arbitrary) action $a$ , and then acting according to the optimal policy forever after.
最优行动价值函数 $Q^*(s,a)$ 与最优策略选择的行动之间存在重要联系。根据定义， $Q^*(s,a)$ 给出了在状态 $s$ 下开始，采取（任意）行动 $a$ ，然后永远按照最优策略行动的预期回报。

The optimal policy in $s$ will select whichever action maximizes the expected return from starting in $s$ . As a result, if we have $Q^*$ , we can directly obtain the optimal action, $a^*(s)$ , via
在 $s$ 中的最佳策略将选择能够最大化从 $s$ 开始的预期回报的行动。因此，如果我们有 $Q^*$ ，我们可以直接通过获得最佳行动 $a^*(s)$ 。

$a^*(s) = \arg \max_a Q^* (s,a).$

Note: there may be multiple actions which maximize $Q^*(s,a)$ , in which case, all of them are optimal, and the optimal policy may randomly select any of them. But there is always an optimal policy which deterministically selects an action.
注意：可能存在多个最大化 $Q^*(s,a)$ 的动作，在这种情况下，所有这些动作都是最优的，最优策略可能随机选择其中任何一个。但总是存在一个最优策略，它确定性地选择一个动作。

Bellman Equations¶ 贝尔曼方程 ¶

All four of the value functions obey special self-consistency equations called Bellman equations. The basic idea behind the Bellman equations is this:
所有四个价值函数都遵循称为贝尔曼方程的特殊自洽方程。贝尔曼方程的基本思想是：

The value of your starting point is the reward you expect to get from being there, plus the value of wherever you land next.
你起点的价值是你期望从那里获得的回报，加上你接下来落脚的地方的价值。

The Bellman equations for the on-policy value functions are
在政策价值函数的贝尔曼方程是

$\begin{align*} V^{\pi}(s) &= \underE{a \sim \pi \\ s'\sim P}{r(s,a) + \gamma V^{\pi}(s')}, \\ Q^{\pi}(s,a) &= \underE{s'\sim P}{r(s,a) + \gamma \underE{a'\sim \pi}{Q^{\pi}(s',a')}}, \end{align*}$

where $s' \sim P$ is shorthand for $s' \sim P(\cdot |s,a)$ , indicating that the next state $s'$ is sampled from the environment’s transition rules; $a \sim \pi$ is shorthand for $a \sim \pi(\cdot|s)$ ; and $a' \sim \pi$ is shorthand for $a' \sim \pi(\cdot|s')$ .
其中 $s' \sim P$ 是 $s' \sim P(\cdot |s,a)$ 的简写，表示下一个状态 $s'$ 是从环境的转移规则中采样的； $a \sim \pi$ 是 $a \sim \pi(\cdot|s)$ 的简写； $a' \sim \pi$ 是 $a' \sim \pi(\cdot|s')$ 的简写。

The Bellman equations for the optimal value functions are
最优价值函数的贝尔曼方程是

$\begin{align*} V^*(s) &= \max_a \underE{s'\sim P}{r(s,a) + \gamma V^*(s')}, \\ Q^*(s,a) &= \underE{s'\sim P}{r(s,a) + \gamma \max_{a'} Q^*(s',a')}. \end{align*}$

The crucial difference between the Bellman equations for the on-policy value functions and the optimal value functions, is the absence or presence of the $\max$ over actions. Its inclusion reflects the fact that whenever the agent gets to choose its action, in order to act optimally, it has to pick whichever action leads to the highest value.
贝尔曼方程在策略价值函数和最优价值函数之间的关键区别在于对动作的 $\max$ 的缺失或存在。其包含反映了这样一个事实：每当代理选择其动作时，为了实现最优行动，它必须选择导致最高价值的动作。

You Should Know 你应该知道

The term “Bellman backup” comes up quite frequently in the RL literature. The Bellman backup for a state, or state-action pair, is the right-hand side of the Bellman equation: the reward-plus-next-value.
“贝尔曼备份”这个术语在强化学习文献中相当常见。一个状态或状态-动作对的贝尔曼备份是贝尔曼方程的右侧：奖励加上下一个值。

Advantage Functions¶ 优势功能 ¶

Sometimes in RL, we don’t need to describe how good an action is in an absolute sense, but only how much better it is than others on average. That is to say, we want to know the relative advantage of that action. We make this concept precise with the advantage function.
在现实生活中，有时我们不需要绝对地描述一个行动有多好，而只需要知道它比其他行动平均好多少。也就是说，我们想知道这个行动的相对优势。我们用优势函数来精确定义这个概念。

The advantage function $A^{\pi}(s,a)$ corresponding to a policy $\pi$ describes how much better it is to take a specific action $a$ in state $s$ , over randomly selecting an action according to $\pi(\cdot|s)$ , assuming you act according to $\pi$ forever after. Mathematically, the advantage function is defined by
与策略 $\pi$ 对应的优势函数 $A^{\pi}(s,a)$ 描述了在状态 $s$ 中采取特定行动 $a$ 相对于根据 $\pi(\cdot|s)$ 随机选择行动的好处，假设你在此之后永远按照 $\pi$ 行动。从数学上讲，优势函数的定义为

$A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s).$

You Should Know 你应该知道

We’ll discuss this more later, but the advantage function is crucially important to policy gradient methods.
我们稍后会详细讨论这个问题，但优势函数对策略梯度方法至关重要。

(Optional) Formalism ¶ （可选）形式主义 ¶

So far, we’ve discussed the agent’s environment in an informal way, but if you try to go digging through the literature, you’re likely to run into the standard mathematical formalism for this setting: Markov Decision Processes (MDPs). An MDP is a 5-tuple, $\langle S, A, R, P, \rho_0 \rangle$ , where
到目前为止，我们以非正式的方式讨论了代理的环境，但如果你试图深入研究文献，你可能会遇到这个设置的标准数学形式：马尔可夫决策过程（MDP）。MDP 是一个 5 元组， $\langle S, A, R, P, \rho_0 \rangle$ ，其中

$S$ is the set of all valid states,
$S$ 是所有有效状态的集合，
$A$ is the set of all valid actions,
$A$ 是所有有效操作的集合，
$R : S \times A \times S \to \mathbb{R}$ is the reward function, with $r_t = R(s_t, a_t, s_{t+1})$ ,
$R : S \times A \times S \to \mathbb{R}$ 是奖励函数，带有 $r_t = R(s_t, a_t, s_{t+1})$ ，
$P : S \times A \to \mathcal{P}(S)$ is the transition probability function, with $P(s'|s,a)$ being the probability of transitioning into state $s'$ if you start in state $s$ and take action $a$ ,
$P : S \times A \to \mathcal{P}(S)$ 是转移概率函数，其中 $P(s'|s,a)$ 是在状态 $s$ 下采取行动 $a$ 时转移到状态 $s'$ 的概率，
and $\rho_0$ is the starting state distribution.
和 $\rho_0$ 是初始状态分布。

The name Markov Decision Process refers to the fact that the system obeys the Markov property: transitions only depend on the most recent state and action, and no prior history.
马尔可夫决策过程这个名称指的是系统遵循马尔可夫性质：状态转移仅依赖于最近的状态和动作，而与之前的历史无关。

Part 1: Key Concepts in RL¶第 1 部分：强化学习中的关键概念 ¶

What Can RL Do?¶强化学习能做什么？

Key Concepts and Terminology¶关键概念和术语 ¶

States and Observations¶状态和观察 ¶

Action Spaces¶ 行动空间 ¶

Policies¶ 政策 ¶

Deterministic Policies¶ 确定性策略 ¶

Stochastic Policies¶ 随机策略 ¶

Trajectories¶ 轨迹 ¶

Reward and Return¶ 奖励与回报 ¶

The RL Problem¶ 强化学习问题 ¶

Value Functions¶ 价值函数 ¶

The Optimal Q-Function and the Optimal Action¶最优 Q 函数和最优动作 ¶

Bellman Equations¶ 贝尔曼方程 ¶

Advantage Functions¶ 优势功能 ¶

(Optional) Formalism¶ （可选）形式主义 ¶

Part 1: Key Concepts in RL ¶
第 1 部分：强化学习中的关键概念 ¶

What Can RL Do?¶
强化学习能做什么？

Key Concepts and Terminology ¶
关键概念和术语 ¶

States and Observations¶
状态和观察 ¶

The Optimal Q-Function and the Optimal Action¶
最优 Q 函数和最优动作 ¶

(Optional) Formalism ¶ （可选）形式主义 ¶