The main problem with TD learning and DP is that their step updates are biased on the initial conditions of the learning parameters. The bootstrapping process typically updates a function or lookup Q(s,a) on a successor value Q(s',a') using whatever the current estimates are in the latter. Clearly at the very start of learning these estimates contain no information from any real rewards or state transitions.
TD 学习和 DP 的主要问题在于,它们的步骤更新对学习参数的初始条件存在偏差。自举过程通常会根据后继值 Q(s',a')更新函数或查找表 Q(s,a),所用的是当前的估计值。显然,在学习初期,这些估计并未包含来自任何实际奖励或状态转移的信息。
If learning works as intended, then the bias will reduce asymptotically over multiple iterations. However, the bias can cause significant problems, especially for off-policy methods (e.g. Q Learning) and when using function approximators. That combination is so likely to fail to converge that it is called the deadly triad in Sutton & Barto.
如果学习按预期进行,那么bias将在多次迭代中渐近减少。然而,bias可能会引发重大问题,尤其对于离策略方法(例如 Q 学习)以及在使用函数逼近器时。这种组合极有可能导致无法收敛,以至于在Sutton & Barto中被称为“致命三元组”。
Monte Carlo control methods do not suffer from this bias, as each update is made using a true sample of what Q(s,a) should be. However, Monte Carlo methods can suffer from high variance, which means more samples are required to achieve the same degree of learning compared to TD.
蒙特卡洛控制方法不受此bias问题困扰,因为每次更新都基于 Q(s,a)真实样本进行。然而,蒙特卡洛方法可能面临高variance,这意味着与 TD 方法相比,达到相同学习程度需要更多样本。
In practice, TD learning appears to learn more efficiently if the problems with the deadly triad can be overcome. Recent results using experience replay and staged "frozen" copies of estimators provide work-arounds that address problems - e.g. that is how DQN learner for Atari games was built.
实际上,如果能够克服“致命三元组”问题,TD 学习似乎能更高效地学习。近期利用经验回放和分阶段“冻结”估计器副本的方法提供了绕过这些问题的解决方案——例如,这就是构建 Atari 游戏 DQN 学习器的方式。
There is also a middle ground between TD and Monte Carlo. It is possible to construct a generalised method that combines trajectories of different lengths - from single-step TD to complete episode runs in Monte Carlo - and combine them. The most common variant of this is TD(𝜆) learning, where 𝜆 is a parameter from 0 (effectively single-step TD learning) to 1 (effectively Monte Carlo learning, but with a nice feature that it can be used in continuous problems). Typically, a value between 0 and 1 makes the most efficient learning agent - although like many hyperparameters, the best value to use depends on the problem.
在 TD 和蒙特卡洛之间也存在一个中间地带。可以构建一种广义方法,它结合了不同长度的轨迹——从单步 TD 到蒙特卡洛中的完整情节运行——并将它们结合起来。这种最常见的变体是 TD( 𝜆 )学习,其中 𝜆 是一个参数,范围从 0 (实际上是单步 TD 学习)到 1 (实际上是蒙特卡洛学习,但具有一个优点,即可以用于连续问题)。通常,介于 0 和 1 之间的值能实现最有效的学习agent——尽管像许多超参数一样,最佳使用值取决于具体问题。
If you are using a value-based method (as opposed to a policy-based one), then TD learning is generally used more in practice, or a TD/MC combination method such as TD(λ) can be even better.
如果你采用的是基于价值的方法(与基于策略的方法相对),那么在实践中通常更多使用 TD 学习,或者采用 TD/MC 组合方法,如 TD(λ),效果可能更佳。
In terms of "practical advantage" for MC? Monte Carlo learning is conceptually simple, robust and easy to implement, albeit often slower than TD. I would generally not use it for a learning controller engine (unless in a hurry to implement something for a simple environment), but I would seriously consider it for policy evaluation in order to compare multiple agents for instance - that is due to it being an unbiased measure, which is important for testing.
就 MC(蒙特卡洛学习)对 MC 的“实际优势”而言,它在概念上简单、稳健且易于实现,尽管通常比 TD(时序差分学习)慢。我通常不会将其用于学习控制引擎(除非急于为简单环境实现某些功能),但我确实会认真考虑将其用于策略评估,例如比较多个智能体——这是因为它是一种无偏估计,这对于测试至关重要。