25

I've been doing a lot of research about Reinforcement Learning lately. I followed Sutton & Barto's Reinforcement Learning: An Introduction for most of this.
我最近对强化学习进行了大量研究。在这方面,我主要跟随 Sutton 与 Barto 的《强化学习:导论》进行学习。

I know what Markov Decision Processes are and how Dynamic Programming (DP), Monte Carlo and Temporal Difference (DP) learning can be used to solve them. The problem I'm having is that I don't see when Monte Carlo would be the better option over TD-learning.
我知道马尔可夫决策过程是什么,以及如何使用动态规划(DP)、蒙特卡洛和时间差分(TD)学习来解决它们。我遇到的问题是,我不明白在什么情况下蒙特卡洛会比 TD 学习更优。

The main difference between them is that TD-learning uses bootstrapping to approximate the action-value function and Monte Carlo uses an average to accomplish this. I just can't really think of a scenario when this is the better way to go.
它们之间的主要区别在于,TD-学习采用自举法来近似动作价值函数,而蒙特卡洛方法则通过平均值来实现这一点。我实在想不出在什么情况下这种方法会更优。

My guess is that it might have something to do with performance but I can't find any sources that can proof this.
我猜测这可能与性能有关,但我找不到任何能证明这一点的资料。

Am I missing something or is TD-learning generally the better option?
我是否遗漏了什么,还是说 TD 学习通常是更好的选择?

CC BY-SA 4.0

2 Answers 2

31

The main problem with TD learning and DP is that their step updates are biased on the initial conditions of the learning parameters. The bootstrapping process typically updates a function or lookup Q(s,a) on a successor value Q(s',a') using whatever the current estimates are in the latter. Clearly at the very start of learning these estimates contain no information from any real rewards or state transitions.
TD 学习和 DP 的主要问题在于,它们的步骤更新对学习参数的初始条件存在偏差。自举过程通常会根据后继值 Q(s',a')更新函数或查找表 Q(s,a),所用的是当前的估计值。显然,在学习初期,这些估计并未包含来自任何实际奖励或状态转移的信息。

If learning works as intended, then the bias will reduce asymptotically over multiple iterations. However, the bias can cause significant problems, especially for off-policy methods (e.g. Q Learning) and when using function approximators. That combination is so likely to fail to converge that it is called the deadly triad in Sutton & Barto.
如果学习按预期进行,那么bias将在多次迭代中渐近减少。然而,bias可能会引发重大问题,尤其对于离策略方法(例如 Q 学习)以及在使用函数逼近器时。这种组合极有可能导致无法收敛,以至于在Sutton & Barto中被称为“致命三元组”

Monte Carlo control methods do not suffer from this bias, as each update is made using a true sample of what Q(s,a) should be. However, Monte Carlo methods can suffer from high variance, which means more samples are required to achieve the same degree of learning compared to TD.
蒙特卡洛控制方法不受此bias问题困扰,因为每次更新都基于 Q(s,a)真实样本进行。然而,蒙特卡洛方法可能面临高variance,这意味着与 TD 方法相比,达到相同学习程度需要更多样本。

In practice, TD learning appears to learn more efficiently if the problems with the deadly triad can be overcome. Recent results using experience replay and staged "frozen" copies of estimators provide work-arounds that address problems - e.g. that is how DQN learner for Atari games was built.
实际上,如果能够克服“致命三元组”问题,TD 学习似乎能更高效地学习。近期利用经验回放和分阶段“冻结”估计器副本的方法提供了绕过这些问题的解决方案——例如,这就是构建 Atari 游戏 DQN 学习器的方式。

There is also a middle ground between TD and Monte Carlo. It is possible to construct a generalised method that combines trajectories of different lengths - from single-step TD to complete episode runs in Monte Carlo - and combine them. The most common variant of this is TD(λ) learning, where λ is a parameter from 0 (effectively single-step TD learning) to 1 (effectively Monte Carlo learning, but with a nice feature that it can be used in continuous problems). Typically, a value between 0 and 1 makes the most efficient learning agent - although like many hyperparameters, the best value to use depends on the problem.
在 TD 和蒙特卡洛之间也存在一个中间地带。可以构建一种广义方法,它结合了不同长度的轨迹——从单步 TD 到蒙特卡洛中的完整情节运行——并将它们结合起来。这种最常见的变体是 TD( λ )学习,其中 λ 是一个参数,范围从 0 (实际上是单步 TD 学习)到 1 (实际上是蒙特卡洛学习,但具有一个优点,即可以用于连续问题)。通常,介于 0 1 之间的值能实现最有效的学习agent——尽管像许多超参数一样,最佳使用值取决于具体问题。

If you are using a value-based method (as opposed to a policy-based one), then TD learning is generally used more in practice, or a TD/MC combination method such as TD(λ) can be even better.
如果你采用的是基于价值的方法(与基于策略的方法相对),那么在实践中通常更多使用 TD 学习,或者采用 TD/MC 组合方法,如 TD(λ),效果可能更佳。

In terms of "practical advantage" for MC? Monte Carlo learning is conceptually simple, robust and easy to implement, albeit often slower than TD. I would generally not use it for a learning controller engine (unless in a hurry to implement something for a simple environment), but I would seriously consider it for policy evaluation in order to compare multiple agents for instance - that is due to it being an unbiased measure, which is important for testing.
就 MC(蒙特卡洛学习)对 MC 的“实际优势”而言,它在概念上简单、稳健且易于实现,尽管通常比 TD(时序差分学习)慢。我通常不会将其用于学习控制引擎(除非急于为简单环境实现某些功能),但我确实会认真考虑将其用于策略评估,例如比较多个智能体——这是因为它是一种无偏估计,这对于测试至关重要。

CC BY-SA 4.0
3
  • First off, thanks for the answer. I see how in theorie an unbiased algorithm could be preferred over a biased one. But considering the high variance Monte Carlo can give at the start of training, I don't see how this really matters. Both Monte Carlo and TD will start with inaccurate approximations and from what I've read, TD will converge much faster. I just can't really come up with a practical advantage of using Monte Carlo. (Amusing the deadly triad can be avoided)
    首先,感谢您的回答。我理解理论上无偏算法可能比有偏算法更受青睐。但考虑到蒙特卡洛在训练初期可能给出的高variance值,我不明白这实际上有何重要性。蒙特卡洛和时序差分(TD)都会从不准确的近似开始,而我所读到的资料表明,TD 会更快收敛。我实在想不出使用蒙特卡洛的实际优势。(假设可以避免致命三元组
    – Anne-dirk
    Commented Mar 27, 2018 at 8:10
  • 1
    @Anne-dirk If you are using a value-based method (as opposed to a policy-based one), then TD learning is generally used more in practice, or a TD/MC combination method such as TD(λ) can be even better. I am not sure what you mean by "practical advantage"? Monte Carlo learning is conceptually simple, robust and easy to implement. I would generally not use it for a learning controller engine (unless in a hurry to implement something for a simple environment), but I would seriously consider it for policy evaluation in order to compare multiple agents for instance.
    @Anne-dirk 如果您使用的是基于价值的方法(而非基于策略的方法),那么在实践中通常更多采用 TD 学习,或者采用如 TD(λ)这样的 TD/MC 组合方法可能会更佳。我不太确定您所说的“实际优势”是指什么?蒙特卡洛学习在概念上简单、稳健且易于实现。我一般不会将其用于学习控制引擎(除非急需为简单环境实现某些功能),但在政策评估方面,为了比较多个代理实例,我会认真考虑使用它。
    Commented Mar 27, 2018 at 8:53
  • @Neul Slater Aaaah I see... That's the kind of answer I was looking for :) Thanks for your help!
    – Anne-dirk
    Commented Mar 27, 2018 at 9:16
2

Essentially it depends on your environment.

TD exploits the Markov property, i.e. the future states of a process rely only upon the current state, and therefore it's usually more efficient to use TD in Markov environments.

MC does not exploit the Markov property as it bases rewards on the entire learning process, which lends itself to non-Markov environments.

CC BY-SA 4.0
8
  • To be clear, I was referring to efficiency. If you can exploit the Markov property then TD is advantageous because you can start in any given state, take and action and the result will always be the same, so you can compute the TD error with high levels of certainty. With non-MDP if you get a state that is partially observed then TD may not be very efficient. That's not to say you can't use TD in non-MDPs, you can, but it might be inefficient and may get better success with TD lambda rather than TD(1).
    – BigBadMe
    Commented Feb 23, 2019 at 0:04
  • It literally has everything to do with it being Markov; if you're in a Markov environment then once you establish taking action a in state a it will lead will state a' with reward x - that will always be the case in a markov environment, so you don't need to evaluate it over and over - you can take bigger steps and TD enables you to exploit that. But it won't be the case in a POMDP because you can have exactly the same state, take same action, but end up in completely different states and rewards.
    – BigBadMe
    Commented Feb 23, 2019 at 14:55
  • Yes! MDPs are deterministic by definition. POMDPs are stochastic, that's exactly what I've been saying. A MDP is not stochastic - it literally cannot be deterministic and stochastic. But look, if you want to take it up with David Silver, feel free to do so: youtu.be/PnHCvfgC_ZA?t=3811 Burn!
    – BigBadMe
    Commented Feb 23, 2019 at 15:18
  • Oh look, the baby has marked down my answer because he was shown to be wrong. Pathetic. I didn't say MC methods don't work in Markov environments! I was saying efficiency gains can be found by using the correct approach depending on the environment. See, you're now even needing to make things up, and you're coming across as quite pedantic now.
    – BigBadMe
    Commented Feb 23, 2019 at 15:53
  • And no, an MDP is not "just a way of representing the environment and its dynamics", it literally means that if you are state S then all events that occurred before are defined by that state, and if you take action a, then you will always get the same State S' and reward r. That is literally the definition of an MPD, and why it is deterministic. That's why a MDP can be guaranteed to converge to an optimal policy. I get the impression you don't know what you're talking about.
    – BigBadMe
    Commented Feb 23, 2019 at 16:01

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.