]ByteDance Seed 字节跳动种子

What’s Behind PPO’s Collapse in Long-CoT?
Value Optimization Holds the Secret
PPO 在长 CoT 中的崩溃背后是什么？价值优化掌握着秘密

Yufeng Yuan Yu Yue Ruofei Zhu Tiantian Fan Lin Yan [ yufeng.yuan@bytedance.com

(March 3, 2025) （2025 年 3 月 3 日）

Abstract 摘要

Reinforcement learning (RL) is pivotal for enabling large language models (LLMs) to generate long chains of thought (CoT) for complex tasks like math and reasoning. However, Proximal Policy Optimization (PPO), effective in many RL scenarios, fails in long CoT tasks. This paper identifies that value initialization bias and reward signal decay are the root causes of PPO’s failure. We propose Value-Calibrated PPO (VC-PPO) to address these issues. In VC-PPO, the value model is pretrained to tackle initialization bias, and the Generalized Advantage Estimation (GAE) computation is decoupled between the actor and critic to mitigate reward signal decay. Experiments on the American Invitational Mathematics Examination (AIME) show that VC-PPO significantly boosts PPO performance. Ablation studies show that techniques in VC-PPO are essential in enhancing PPO for long CoT tasks.
强化学习（RL）对于使大型语言模型（LLMs）能够生成用于复杂任务（如数学和推理）的长链思维（CoT）至关重要。然而，在许多强化学习场景中有效的近端策略优化（PPO）在长 CoT 任务中却失败了。本文指出，价值初始化偏差和奖励信号衰减是 PPO 失败的根本原因。我们提出了价值校准 PPO（VC-PPO）来解决这些问题。在 VC-PPO 中，价值模型被预训练以解决初始化偏差，并且广义优势估计（GAE）的计算在演员和评论家之间解耦，以减轻奖励信号衰减。在美國邀請賽數學考試（AIME）上的实验表明，VC-PPO 显著提高了 PPO 的性能。消融研究表明，VC-PPO 中的技术对于增强 PPO 以处理长 CoT 任务至关重要。

\correspondence 相应

Yufeng Yuan at 袁宇峰

1 Introduction 1 引言

In recent years, large language models (LLMs) have achieved remarkable breakthroughs across diverse domains, including question-answering [6], code generation [8, 3], dialog generation [6], and agent-related tasks [20]. A particularly notable advancement is the ability of LLMs to solve Olympiad-level math and reasoning problems. This feat is accomplished by generating a long Chain of Thought (CoT) [21] before reaching a final answer. Such an inference-time scaling paradigm was initially proposed by OpenAI-o1 [10] and further popularized by DeepSeek-R1 [4] and OpenAI-o3 [11]. During the long CoT process, the model formulates hypotheses and verifies them to gradually converge to a correct solution.
近年来，大型语言模型在多个领域取得了显著的突破，包括问答[6]、代码生成[8, 3]、对话生成[6]和与代理相关的任务[20]。一个特别值得注意的进步是LLMs能够解决奥林匹克级别的数学和推理问题。这一成就通过在得出最终答案之前生成一个长的思维链（CoT）[21]来实现。这种推理时间缩放范式最初由 OpenAI-o1[10]提出，并由 DeepSeek-R1[4]和 OpenAI-o3[11]进一步普及。在漫长的 CoT 过程中，模型提出假设并验证它们，逐渐收敛到一个正确的解决方案。

To equip models with such capabilities, researchers typically follow these procedures:
为了使模型具备这些能力，研究人员通常遵循以下程序：

1.

Cold-start with Supervised Fine-tuning (SFT) data: A substantial amount of SFT data following the Long-CoT pattern is collected. This initial step equips the model with a fundamental understanding of how to test and verify its own answers, laying a solid foundation for subsequent learning.

1. 使用监督微调（SFT）数据冷启动：收集了大量遵循 Long-CoT 模式的 SFT 数据。这一初始步骤使模型具备了如何测试和验证自身答案的基本理解，为后续学习奠定了坚实基础。
2.

Construct a dataset with verifiable tasks: The dataset is composed of tasks such as math and reasoning problems, whose correctness can be objectively judged by a non-hackable, rule-based reward model. This ensures that the model receives reliable feedback during training.

2. 构建一个可验证的任务数据集：该数据集由数学和推理问题等任务组成，其正确性可以通过一个不可破解的基于规则的奖励模型进行客观判断。这确保了模型在训练过程中获得可靠的反馈。
3.

Reinforcement learning (RL) training: The model is trained using RL with the objective of maximizing the reward from the rule-based reward model. Through this process, the model’s long CoT capabilities are solidified and enhanced, leading to a significant improvement in its performance on complex tasks.

3. 强化学习（RL）训练：模型使用强化学习进行训练，旨在最大化基于规则的奖励模型的奖励。通过这一过程，模型的长 CoT 能力得到巩固和提升，从而显著提高了其在复杂任务上的性能。

Reinforcement learning has played a critical role in developing such capabilities. However, directly applying Proximal Policy Optimization (PPO) [16], a method that has proven effective in various fields, including Reinforcement Learning with Human Feedback (RLHF), can lead to failure modes in tasks that require long CoT. As the response length increases, obtaining an accurate value model becomes increasingly challenging both before and during training. In contrast, Group Relative Policy Optimization (GRPO) [17], a simplified version of PPO that replaces the value model with the Leave-One-Out [9] estimate, has shown strong performance in such tasks. However, compared to GRPO, which only uses response-level feedback, PPO can utilize more fine-grained token-level feedback, indicating that PPO could have higher potential in complex tasks that require extensive exploration.
强化学习在发展此类能力中发挥了关键作用。然而，直接应用近端策略优化（PPO）[16]，一种在各种领域包括具有人类反馈的强化学习（RLHF）中已被证明有效的算法，可能导致在需要长 CoT 的任务中失败。随着响应长度的增加，在训练前后获取准确的价值模型变得越来越具有挑战性。相比之下，分组相对策略优化（GRPO）[17]，这是 PPO 的简化版本，用留一法估计[9]替换了价值模型，在类似任务中表现出强大的性能。然而，与仅使用响应级反馈的 GRPO 相比，PPO 可以利用更精细的标记级反馈，这表明 PPO 在需要广泛探索的复杂任务中可能具有更高的潜力。

In this paper, our goal is to fully exploit the potential of PPO in long CoT tasks. We first identify the key problem of PPO in long CoT tasks: the value model exhibits considerable bias before and during training, which causes it to fail to predict values accurately. The pre-training value bias stems from the common practice of initializing the value model from the reward model. During the early training stage, this approach leads to a large error in advantage estimation. The in-training value bias arises from the decaying nature of Generalized Advantage Estimation (GAE) [15] computation. In scenarios with long sequences and rewards at the end, the value function fails to propagate the reward signal to the preceding tokens.
本文旨在充分利用 PPO 在长 CoT 任务中的潜力。我们首先确定了 PPO 在长 CoT 任务中的关键问题：价值模型在训练前后表现出相当大的偏差，导致其无法准确预测价值。预训练价值偏差源于从奖励模型初始化价值模型的常见做法。在早期训练阶段，这种方法会导致优势估计出现较大误差。训练中的价值偏差源于广义优势估计（GAE）[15]计算的衰减性质。在具有长序列和末尾奖励的场景中，价值函数无法将奖励信号传播到前面的标记。

To address the value bias in PPO, we propose Value-Calibrated PPO (VC-PPO), in which the value model is calibrated before and during training. To address the value initialization bias, we pretrain the value model with responses generated by a fixed SFT policy in an offline manner. This helps the value model to better estimate the expected rewards and reduces the bias in the early training phase. To mitigate the value bias during training, we propose to decouple the GAE computation for the policy and the value, such that the value could use a larger $\lambda$ to allow a more effective propagation of the reward signal along the long sequence, while the policy could maintain the original $\lambda$ to ensure convergence under time and computational constraints.
为了解决 PPO 中的价值偏差，我们提出了价值校准 PPO（VC-PPO），其中价值模型在训练前后进行校准。为了解决价值初始化偏差，我们以离线方式使用固定的 SFT 策略生成的响应预先训练价值模型。这有助于价值模型更好地估计预期奖励，并减少早期训练阶段的偏差。为了减轻训练过程中的价值偏差，我们提出将策略和价值的目标函数计算解耦，这样价值可以使用更大的 $\lambda$ ，以便更有效地沿着长序列传播奖励信号，而策略可以保持原始的 $\lambda$ ，以确保在时间和计算约束下收敛。

In our experiments on the American Invitational Mathematics Examination (AIME), these two techniques significantly boost the performance of the baseline PPO from 5.6 to 49.0, achieving a higher score than previously reported in [4]. Moreover, our ablation studies demonstrate that both techniques are essential for achieving superior performance in AIME, highlighting the importance of our proposed solutions in enhancing the effectiveness of PPO in Long-CoT tasks.
在我们的美国邀请数学竞赛（AIME）实验中，这两种技术显著提升了基线 PPO 的表现，从 5.6 提升到 49.0，达到了比之前报道在[4]中更高的分数。此外，我们的消融研究表明，这两种技术对于在 AIME 中实现卓越表现至关重要，突显了我们提出的解决方案在增强 PPO 在长序列任务中的有效性方面的重要性。

2 Preliminaries 2 前言

This section presents the fundamental concepts and notations that serve as the basis for our proposed algorithm. We first explore the basic framework of representing language generation as a reinforcement learning task. Subsequently, we introduce Proximal Policy Optimization and Generalized Advantage Estimation.
本节介绍了作为我们提出算法基础的基本概念和符号。我们首先探讨了将语言生成表示为强化学习任务的基本框架。随后，我们介绍了近端策略优化和广义优势估计。

2.1 Modeling Language Generation as Token-Level MDP
2.1 将语言生成建模为标记级马尔可夫决策过程

Reinforcement Learning (RL) centers around the learning of a policy that maximizes the cumulative reward for an agent as it interacts with an environment. In this study, we cast language generation tasks within the framework of a Markov Decision Process (MDP) [12].
强化学习（RL）的核心是学习一个策略，该策略在智能体与环境交互时最大化累积奖励。在本研究中，我们将语言生成任务纳入马尔可夫决策过程（MDP）[12]的框架内。

Let the prompt be denoted as $x$ , and the response to this prompt as $y$ . Both $x$ and $y$ can be decomposed into sequences of tokens. For example, the prompt $x$ can be expressed as $x=(x_{0},\dots,x_{m})$ , where the tokens are drawn from a fixed discrete vocabulary $\mathcal{A}$ .
提示符表示为 $x$ ，对此提示符的响应表示为 $y$ 。 $x$ 和 $y$ 都可以分解为标记序列。例如，提示符 $x$ 可以表示为 $x=(x_{0},\dots,x_{m})$ ，其中标记来自一个固定的离散词汇表 $\mathcal{A}$ 。

We define the token-level MDP as the tuple $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathbb{P},r,d_{0},\omega)$ . Here is a detailed breakdown of each component:
我们将标记级 MDP 定义为元组 $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathbb{P},r,d_{0},\omega)$ 。以下是每个组件的详细分解：

•

State Space ( $\mathcal{S}$ ): This space encompasses all possible states formed by the tokens generated up to a given time step. At time step $t$ , the state $s_{t}$ is defined as $s_{t}=(x_{0},\dots,x_{m},y_{0},\dots,y_{t})$ .

状态空间（ $\mathcal{S}$ ）：该空间包含在给定时间步之前由标记生成的所有可能状态。在时间步 $t$ ，状态 $s_{t}$ 被定义为 $s_{t}=(x_{0},\dots,x_{m},y_{0},\dots,y_{t})$ 。
•

Action Space ( $\mathcal{A}$ ): It corresponds to the fixed discrete vocabulary, from which tokens are selected during the generation process.

动作空间（ $\mathcal{A}$ ）：对应于固定的离散词汇表，在生成过程中从中选择标记。
•

Dynamics ( $\mathbb{P}$ ): These represent a deterministic transition model between tokens. Given a state $s_{t}=(x_{0},\dots,x_{m},y_{0},\dots,y_{t})$ , an action $a=y_{t+1}$ , and the subsequent state $s_{t+1}=(x_{0},\dots,x_{m},y_{0},\dots,y_{t},y_{t+1})$ , the probability $\mathbb{P}(s_{t+1}|s_{t},a)=1$ .

• 动力学（ $\mathbb{P}$ ）：这代表了一种在标记之间的确定性转换模型。给定一个状态 $s_{t}=(x_{0},\dots,x_{m},y_{0},\dots,y_{t})$ ，一个动作 $a=y_{t+1}$ ，以及后续状态 $s_{t+1}=(x_{0},\dots,x_{m},y_{0},\dots,y_{t},y_{t+1})$ ，概率 $\mathbb{P}(s_{t+1}|s_{t},a)=1$ 。
•

Termination Condition: The language generation process concludes when the terminal action $\omega$ , typically the end-of-sentence token, is executed.

终止条件：当执行终端动作 $\omega$ ，通常是句末标记时，语言生成过程结束。
•

Reward Function ( $r(s,a)$ ): This function offers scalar feedback to evaluate the agent’s performance after taking action $a$ in state $s$ . In the context of Reinforcement Learning from Human Feedback (RLHF), the reward function can be learned from human preferences or defined by a set of rules specific to the task.

奖励函数（ $r(s,a)$ ）：该函数在动作 $a$ 后，在状态 $s$ 下提供标量反馈以评估智能体的性能。在人类反馈强化学习（RLHF）的背景下，奖励函数可以从人类偏好中学习或由特定于任务的规则集定义。
•

Initial State Distribution ( $d_{0}$ ): It is a probability distribution over prompts $x$ . An initial state $s_{0}$ consists of the tokens within the prompt $x$ .

初始状态分布（ $d_{0}$ ）：它是关于提示词 $x$ 的概率分布。一个初始状态 $s_{0}$ 由提示词 $x$ 中的标记组成。

2.2 RLHF Learning Objective
2.2 RLHF 学习目标

We formulate the optimization problem as a KL-regularized RL task. Our objective is to approximate the optimal KL-regularized policy, which is given by:
我们将优化问题表述为 KL 正则化的强化学习任务。我们的目标是逼近最优的 KL 正则化策略，其表达式为：

\displaystyle\pi^{*}=\arg\max_{\pi}\mathbb{E}_{\pi,s_{0}\sim d_{0}}\left[\sum_% {h=0}^{H}\left(r(s_{h},a_{h})-\beta\text{KL}\big{(}\pi(\cdot|s_{h})\|\pi_{% \text{ref}}(\cdot|s_{h})\big{)}\right)\right]

(1)

In this equation, $H$ represents the total number of decision steps, $s_{0}$ is a prompt sampled from the dataset, $r(s_{h},a_{h})$ is the token-level reward obtained from the reward function, $\beta$ is a coefficient that controls the strength of the KL-regularization, and $\pi_{\text{ref}}$ is the initialization policy.
在这个方程中， $H$ 代表决策步骤总数， $s_{0}$ 是从数据集中采样的提示， $r(s_{h},a_{h})$ 是从奖励函数获得的 token 级奖励， $\beta$ 是控制 KL 正则化强度的系数， $\pi_{\text{ref}}$ 是初始化策略。

In traditional RLHF and most tasks related to Large Language Models (LLMs), the reward is sparse and is only assigned at the terminal action $\omega$ , that is, the end-of-sentence token <eos>.
在传统的 RLHF 以及与大型语言模型(LLMs)相关的多数任务中，奖励是稀疏的，并且仅在终端动作( $\omega$ )时分配，即句尾标记。

2.3 Proximal Policy Optimization
2.3 近端策略优化

PPO [16] uses a clipped surrogate objective to update the policy. The key idea is to limit the change in the policy during each update step, preventing large policy updates that could lead to instability.
PPO [16] 使用剪枝代理目标来更新策略。关键思想是在每个更新步骤中限制策略的变化，防止可能导致不稳定的较大策略更新。

Let $\pi_{\theta}(a|s)$ be the policy parameterized by $\theta$ , and $\pi_{\theta_{\text{old}}}(a|s)$ be the old policy from the previous iteration. The surrogate objective function for PPO is defined as:
令 $\pi_{\theta}(a|s)$ 为由 $\theta$ 参数化的策略， $\pi_{\theta_{\text{old}}}(a|s)$ 为前一次迭代的旧策略。PPO 的代理目标函数定义为：

\mathcal{L}^{CLIP}(\theta)=\hat{\mathbb{E}}_{t}\left[\min\left(r_{t}(\theta)% \hat{A}_{t},\text{clip}(r_{t}(\theta),1-\epsilon,1+\epsilon)\hat{A}_{t}\right)\right]

(2)

where $r_{t}(\theta)=\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{\text{old}}}(a_{t}% |s_{t})}$ is the probability ratio, $\hat{A}_{t}$ is the estimated advantage at time step $t$ , and $\epsilon$ is a hyperparameter that controls the clipping range.
$r_{t}(\theta)=\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{\text{old}}}(a_{t}% |s_{t})}$ 是概率比， $\hat{A}_{t}$ 是在时间步 $t$ 估计的优势， $\epsilon$ 是控制裁剪范围的超参数。

Generalized Advantage Estimation (GAE) [15] is a technique used to estimate the advantage function more accurately in PPO. It combines multiple-step bootstrapping to reduce the variance of the advantage estimates. For a trajectory of length $T$ , the advantage estimate $\hat{A}_{t}$ at time step $t$ is computed as:
广义优势估计（GAE）[15]是一种用于在 PPO 中更精确估计优势函数的技术。它结合多步自举来减少优势估计的方差。对于长度为 $T$ 的轨迹，时间步 $t$ 的优势估计 $\hat{A}_{t}$ 计算如下：

\hat{A}_{t}=\sum_{l=0}^{T-t-1}(\gamma\lambda)^{l}\delta_{t+l}

(3)

where $\gamma$ is the discount factor, $\lambda\in[0,1]$ is the GAE parameter, and $\delta_{t}=r_{t}+\gamma V(s_{t+1})-V(s_{t})$ is the temporal-difference (TD) error. Here, $r_{t}$ is the reward at time step $t$ , and $V(s)$ is the value function. Since it is a common practice to use discount factor $\gamma=1.0$ in RLHF, to simplify our notation, we omit $\gamma$ in later sections of this paper.
$\gamma$ 是折扣因子， $\lambda\in[0,1]$ 是 GAE 参数， $\delta_{t}=r_{t}+\gamma V(s_{t+1})-V(s_{t})$ 是时序差分（TD）误差。在这里， $r_{t}$ 是时间步 $t$ 的奖励， $V(s)$ 是价值函数。由于在强化学习与人类反馈（RLHF）中常用折扣因子 $\gamma=1.0$ ，为了简化符号，本文后续部分省略 $\gamma$ 。

3 Identifying and Addressing PPO’s Failure Modes in Long CoT Tasks
识别和解决长 CoT 任务中 PPO 的失效模式

In this section, we show a common failure mode of PPO in long CoT tasks and examine its relationship with the pre-training and in-training value biases from both theoretical and empirical perspectives. Subsequently, we propose practical solutions to enhance PPO and enable it to avoid such failures.
在这一节中，我们展示了长 CoT 任务中 PPO 的常见故障模式，并从理论和实证角度探讨了其与预训练和训练过程中价值偏差的关系。随后，我们提出了增强 PPO 并使其避免此类故障的实用解决方案。

3.1 Failure Modes of PPO in Long CoT Tasks
3.1 长 CoT 任务中 PPO 的故障模式

Two common practices when applying PPO in the domain of Reinforcement Learning from Human Feedback (RLHF) are as follows [22, 7]:
两种在人类反馈强化学习（RLHF）领域应用 PPO 的常见做法如下[22, 7]：

•

Employ the default Generalized Advantage Estimation (GAE), typically with $\lambda=0.95$ .

• 使用默认的广义优势估计（GAE），通常为 $\lambda=0.95$ 。
•

Initialize the value model using a well-trained reward model.

初始化价值模型，使用经过良好训练的奖励模型。

The first practice finds its origin in the traditional RL literature, where PPO has been extensively tested in environments like Mujoco [19] and Atari [2]. In these environments, the rewards accumulate over the trajectory, resulting in high-variance return. As a consequence, variance reduction becomes a necessity. The second practice emerges naturally from the apparent similarity between a reward model and a value model, since both models are trained to predict scalar information about the response. However, our experiments have revealed that naively applying PPO to tasks that require long CoT inevitably leads to failure, as shown in Figure 1.
第一次实践源于传统的强化学习文献，其中 PPO 在 Mujoco[19]和 Atari[2]等环境中得到了广泛测试。在这些环境中，奖励在轨迹上累积，导致高方差回报。因此，方差减少成为必要。第二次实践自然地从奖励模型和价值模型之间的明显相似性中产生，因为这两个模型都是训练来预测关于响应的标量信息。然而，我们的实验表明，天真地将 PPO 应用于需要长 CoT 的任务必然会导致失败，如图 1 所示。

Refer to caption — (a) Model output length （a）模型输出长度

Typically, the failure modes are that the validation performance degrades as the training starts, accompanied by a significant decrease in the model’s output length. Since it has been demonstrated that output length is strongly correlated with the model’s performance on complex reasoning tasks [10], the initial collapse in output length can be attributed as the root cause of this performance degradation.
通常，失败模式是随着训练开始，验证性能下降，伴随着模型输出长度的显著减少。由于已经证明输出长度与模型在复杂推理任务上的性能密切相关[10]，因此输出长度的初始崩溃可以归因为此性能下降的根本原因。

3.2 Addressing the Value Initialization Bias by Value Pretraining
3.2 通过值预训练解决值初始化偏差问题

In our tasks, a verifier serves as the source of the reward signal. It utilizes a rule-based answer parsing mechanism, which is unlikely to show a preference for output length. Consequently, the reduction in output length can only be ascribed to the policy optimization dynamics, which are mainly driven by the advantages assigned to each token. To further explore this, we plot the correlation between advantages and the position of tokens, as shown in Figure 2. This reveals a strong correlation between advantages and token position. The more preceding the tokens are, the more positively biased their advantages are. This causes the model to favor preceding tokens, ultimately leading to the observed collapse in output length.
在我们的任务中，验证器作为奖励信号的来源。它使用基于规则的答案解析机制，不太可能表现出对输出长度的偏好。因此，输出长度的减少只能归因于策略优化动态，这些动态主要是由分配给每个标记的优势驱动的。为了进一步探索这一点，我们绘制了优势和标记位置之间的相关性，如图 2 所示。这揭示了优势和标记位置之间存在着强烈的关联。标记越靠前，它们的优势就越偏向积极。这导致模型倾向于优先考虑前面的标记，最终导致观察到的输出长度减少。

The root cause of the positive bias lies in the objective mismatch between the reward model and the value model. The training objective of the reward model is to score the response at the <EOS> token. Since the tokens preceding <EOS> are not included in the training, the reward model tends to assign lower scores to tokens that are more preceding due to its increasing incompleteness. On the other hand, the aim of value prediction is to estimate the expected rewards of each token preceding <EOS> for a given policy. Given that tokens which are more preceding have lower scores and the KL-penalties are essentially zero at the beginning of training, there will be a positive bias at every timestep $t$ that accumulates along the trajectory, which is obvious through how advantages $\hat{A}_{t}$ is computed:
积极偏差的根本原因在于奖励模型和价值模型之间的客观不匹配。奖励模型的训练目标是给标记的响应评分。由于在训练中不包括之前的标记，因此奖励模型倾向于给更早出现的标记分配较低的分数，因为它的不完整性不断增加。另一方面，价值预测的目标是估计给定策略下每个之前标记的预期奖励。鉴于更早出现的标记分数较低，且在训练初期 KL 惩罚本质上为零，每个时间步 $t$ 都会出现正偏差，并沿着轨迹累积，这通过如何计算优势 $\hat{A}_{t}$ 是显而易见的：

\hat{A}_{t}=\sum_{l=0}^{T-t-1}\lambda^{l}(r_{t+l}+V(s_{t+l+1})-V(s_{t+l}))

(4)

This explains why tokens that are preceding tend to exhibit a greater positive bias in advantages. Due to this correlation between token position and advantages, the model tends to generate shorter responses, which prevents the model from generating a long chain of thoughts before finalizing an answer.
这解释了为什么位于前面的标记往往在优势方面表现出更大的积极偏差。由于标记位置和优势之间的这种相关性，模型倾向于生成较短的响应，这防止了模型在最终确定答案之前生成长链思维。

To alleviate such value initialization bias, we introduce Value-Pretraining. This approach involves offline training the value model until convergence under a pre-specified fixed policy. Once the value model has converged, it will be employed in all subsequent formal experiments. The specific steps are outlined as follows:
为了缓解此类值初始化偏差，我们引入了值预训练。这种方法涉及在预指定的固定策略下离线训练值模型，直到收敛。一旦值模型收敛，它将被用于所有后续的正式实验。具体步骤如下：

1.

Continuously generate responses by sampling from a fixed policy, for instance, $\pi_{\text{sft}}$ , and update the value model using GAE with $\lambda=1.0$ , also known as the Monte-Carlo return. This setting transforms the optimization problem into a stable gradient-descent optimization, ensuring more reliable and consistent updates to the value model.

1. 通过从固定策略中采样连续生成响应，例如 $\pi_{\text{sft}}$ ，并使用 GAE（ $\lambda=1.0$ ，也称为蒙特卡洛回报）更新价值模型。此设置将优化问题转化为稳定的梯度下降优化，确保对价值模型的更新更加可靠和一致。
2.

Train the value model until key training metrics, including value loss and explained variance [5], attain sufficiently low values. Monitoring these metrics is crucial as they reflect the quality and stability of the model’s learning process, and reaching low values indicates that the model is converging effectively.

2. 训练价值模型，直到关键训练指标，包括价值损失和解释方差[5]，达到足够低的值。监控这些指标至关重要，因为它们反映了模型学习过程的质量和稳定性，达到低值表明模型正在有效收敛。
3.

Save the value checkpoint upon the completion of training. Subsequently, load this checkpoint for following experiments. This step provides a more accurate initial point for value estimation, enabling the model to start from a well-calibrated state.

3. 在训练完成后保存值检查点。随后，为后续实验加载此检查点。这一步骤为价值估计提供了一个更准确的初始点，使模型能够从一个校准良好的状态开始。

3.3 Improving In-training Value Estimate with Decoupled-GAE
3.3 使用解耦-GAE 提高在训价值估计

Variance reduction is a critical topic in RL. The use of GAE with $\lambda=0.95$ is common in traditional RL tasks like Mujoco and Atari, where accumulated rewards have high variance and lead to slow convergence. In contrast, in RLHF, a reward model or rule-based scoring mechanism offers trajectory-level feedback which consists of non-accumulating and well-defined values.
方差减少是强化学习中的一个关键主题。在传统的强化学习任务中，如 Mujoco 和 Atari，使用 GAE 与 $\lambda=0.95$ 是常见的，因为累积奖励具有高方差，导致收敛速度慢。相比之下，在 RLHF 中，一个奖励模型或基于规则的评分机制提供轨迹级别的反馈，该反馈由非累积且定义良好的值组成。

Therefore, we question whether variance reduction is necessary in optimizing the value model in RLHF.
因此，我们质疑在强化学习与人类反馈（RLHF）中优化价值模型时是否需要方差减少。

Based on the GAE computation in Equation 4, we can rewrite the equation to obtain the value optimization target $V^{\text{target}}$ :
基于方程 4 中的 GAE 计算，我们可以将方程重写为获得值优化目标 $V^{\text{target}}$ ：

V^{\text{target}}(s_{t})=\left\{\begin{array}[]{ll}\sum_{l=0}^{T-t-1}\lambda^{% l}(r_{t+l}+V(s_{t+l+1})-V(s_{t+l}))+V(s_{t}),&\lambda<1.0\\ \sum_{l=0}^{T-t-1}r_{t+l},&\lambda=1.0\end{array}\right.

(5)

According to Equation 5, the reward assigned at the <EOS> token decays at a rate of $\lambda$ when propagating to the preceding tokens during GAE computation. The reward signal propagated to the $t$ -th token is $\lambda^{T-t}r_{\text{<EOS>}}$ . When $T-t$ is large, the resulting reward signal is essentially zero. With $\lambda=1.0$ , such reward signal decay would not occur, which makes it a desirable option for value optimization. Moreover, when $\lambda<1.0$ , value prediction is incorporated into the construction of the regression target. This approach belongs to semi-gradient descent methods, which tend to be unstable. Conversely, when $\lambda=1.0$ , the value is simply regressing to the accumulated rewards, resulting in a stable gradient-descent optimization.
根据方程 5，在 GAE 计算过程中，分配给标记的奖励在传播到前面的标记时以 $\lambda$ 的速度衰减。传播到 $t$ 标记的奖励信号是 $\lambda^{T-t}r_{\text{<EOS>}}$ 。当 $T-t$ 很大时，产生的奖励信号基本上为零。使用 $\lambda=1.0$ ，这种奖励信号衰减不会发生，这使得它成为价值优化的理想选择。此外，当 $\lambda<1.0$ 时，将价值预测纳入回归目标的构建。这种方法属于半梯度下降法，通常是不稳定的。相反，当 $\lambda=1.0$ 时，价值简单地回归到累积奖励，从而实现稳定的梯度下降优化。

In Figure 3, we show that with $\lambda<1.0$ , the reward signal rapidly decays during propagation and preceding tokens are unable to receive the signal from the reward model. This phenomenon is exacerbated in tasks that require long CoT because the trajectory lengths are substantially longer. Therefore, optimizing the value in an unbiased manner outweighs learning it in a variance-reduced way because of the trajectory-level reward signal in RLHF. A similar argument is also proposed in [1].
在图 3 中，我们展示了在 $\lambda<1.0$ 的情况下，奖励信号在传播过程中迅速衰减，先前的标记无法接收到奖励模型的信号。这种现象在需要长 CoT 的任务中加剧，因为轨迹长度显著更长。因此，由于 RLHF 中的轨迹级奖励信号，以无偏方式优化价值比以方差减少的方式学习它更为重要。类似的论点也在[1]中提出。

However, variance reduction might still be necessary in policy optimization.
然而，在策略优化中可能仍然需要减少方差。

Training large language models consumes a vast amount of computational resources. Under the constraints of time and computing power, achieving faster convergence during training is highly desirable. In PPO, the $\lambda$ parameter in GAE plays a crucial role in the bias-variance trade-off during policy updates. The variance of policy update can be analyzed in terms of the variances of the TD errors. Let $\text{Var}[\delta_{t}]$ denote the variance of the TD error at time step $t$ . The variance of $A_{t}^{\lambda}$ can be roughly computed as:
训练大型语言模型消耗大量计算资源。在时间和计算能力的限制下，在训练过程中实现更快收敛是非常理想的。在 PPO 中，GAE 中的 $\lambda$ 参数在策略更新过程中的偏差-方差权衡中起着关键作用。策略更新的方差可以通过 TD 误差的方差来分析。令 $\text{Var}[\delta_{t}]$ 表示时间步 $t$ 的 TD 误差的方差。 $A_{t}^{\lambda}$ 的方差可以大致计算如下：

\begin{split}\text{Var}[A_{t}^{\lambda}]&=\text{Var}\left[\sum_{l=0}^{T-t-1}% \lambda^{l}\delta_{t+l}\right]\\ &=\sum_{l=0}^{T-t-1}\lambda^{2l}\text{Var}[\delta_{t+l}]+2\sum_{i=0}^{T-t-1}% \sum_{j=i+1}^{T-t-1}\lambda^{i+j}\text{Cov}[\delta_{t+i},\delta_{t+j}],\end{split}

(6)

where $\text{Cov}[\delta_{t+i},\delta_{t+j}]$ is the covariance between the TD errors at time steps $t+i$ and $t+j$ . Since $\lambda\in[0,1]$ , as $\lambda$ decreases, the weights $\lambda^{l}$ for the later TD errors decrease more rapidly. This means that the contribution of the more variable and less reliable later TD errors to the overall advantage estimate is reduced, thereby reducing the variance of the advantage estimate.
$\text{Cov}[\delta_{t+i},\delta_{t+j}]$ 是时间步 $t+i$ 和 $t+j$ 之间 TD 错误的协方差。由于 $\lambda\in[0,1]$ ，随着 $\lambda$ 的减小，后续 TD 错误的权重 $\lambda^{l}$ 减少得更快。这意味着更易变且可靠性较低的后续 TD 错误对整体优势估计的贡献减少，从而降低了优势估计的方差。

Nevertheless, adjusting this $\lambda$ can have an additional impact on value optimization. To address this issue, we introduce Decoupled-GAE. This approach allows the policy to adopt a different $\lambda$ value from that of the value function. By doing so, the policy can better balance its own bias-variance trade-off, thereby enhancing the training efficiency.
尽管如此，调整此 $\lambda$ 可能会对价值优化产生额外影响。为了解决这个问题，我们引入了解耦-GAE。这种方法允许策略采用与价值函数不同的 $\lambda$ 值。通过这样做，策略可以更好地平衡其自身的偏差-方差权衡，从而提高训练效率。

Next, we show that using a value function obtained with a different $\lambda$ from the policy is mathematically justifiable without introducing additional bias. Let $\bar{V}$ represent the value estimate obtained with a potentially different $\lambda$ , and define the $n$ -step return with $\bar{V}$ as $G$ :
接下来，我们表明使用与策略不同的 $\lambda$ 获得的价值函数在数学上是合理的，而不会引入额外的偏差。用 $\bar{V}$ 表示可能不同的 $\lambda$ 获得的价值估计，并定义以 $\bar{V}$ 为 $G$ 的 $n$ 步回报：

G_{t:t+h}=\left\{\begin{array}[]{ll}\sum_{l=0}^{h-1}r_{t+l}+\bar{V}(s_{t+h}),&% t+h<T\\ \sum_{l=0}^{T-h}r_{t+l},&t+h=T\end{array}\right.

(7)

Then, the policy gradient with an arbitrary $\lambda$ can be rewritten as follows:
然后，可以将具有任意 $\lambda$ 的政策梯度重写如下：

\begin{split}\mathbb{E}_{t}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})A% _{t}\right]&=\mathbb{E}_{t}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})% \sum_{l=0}^{T-t-1}\lambda^{l}(r_{t+l}+\bar{V}(s_{t+l+1})-\bar{V}(s_{t+l}))% \right]\\ &=\mathbb{E}_{t}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\left((1-% \lambda)\sum_{l=1}^{T-t-1}\lambda^{l-1}G_{t:t+l}+\lambda^{T-t-1}G_{t:T}-\bar{V% }(s_{t})\right)\right]\\ &=\mathbb{E}_{t}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\left((1-% \lambda)\sum_{l=1}^{T-t-1}\lambda^{l-1}G_{t:t+l}+\lambda^{T-t-1}G_{t:T}\right)% \right]\\ \end{split}

(8)

Based on Equation 8, plugging in an arbitrary value function does not introduce additional bias to the policy gradient. Given the substantial time and computational resources required for large language models, it is desirable to use a smaller $\lambda$ to expedite the convergence of the policy. A potential configuration could be $\lambda_{\text{policy}}=0.95$ and $\lambda_{\text{value}}=1.0$ .
基于方程 8，将任意值函数代入不会引入额外的偏差到策略梯度。鉴于大型语言模型所需的大量时间和计算资源，使用较小的 $\lambda$ 以加速策略的收敛是可取的。一种可能的配置是 $\lambda_{\text{policy}}=0.95$ 和 $\lambda_{\text{value}}=1.0$ 。

By combining Value-Pretraining and Decoupled-GAE, we propose Value-Calibrated Proximal Policy Optimization (VC-PPO) as shown in Alg 1, which is a simple yet effective approach to enhance PPO’s performance in long CoT tasks. The main differences between VC-PPO and baseline PPO are highlighted.
通过结合价值预训练和解耦 GAE，我们提出了价值校准近端策略优化（VC-PPO），如算法 1 所示，这是一种简单而有效的方法，用于增强长 CoT 任务中 PPO 的性能。VC-PPO 与基线 PPO 之间的主要区别被突出显示。

Algorithm 1 Value-Calibrated Proximal Policy Optimization (VC-PPO)
算法 1 值校准近端策略优化（VC-PPO）

1:Input: Initial policy

\pi_{\theta}

, Pretrained value function

V_{\phi}

, number of epochs

E

, number of mini-batches

M

, learning rate

\alpha_{\theta}

, learning rate

\alpha_{\phi}

, clip parameter

\epsilon

, actor lambda

\lambda_{\text{actor}}

, critic lambda

\lambda_{\text{critic}}

输入：初始策略

\pi_{\theta}

，预训练值函数

V_{\phi}

，训练轮数

E

，小批量数量

M

，学习率

\alpha_{\theta}

，学习率

\alpha_{\phi}

，裁剪参数

\epsilon

，演员 lambda

\lambda_{\text{actor}}

，评论家 lambda

\lambda_{\text{critic}}

2:Output: Optimized policy

\pi_{\theta}

, optimized value function

V_{\phi}

输出：优化策略

\pi_{\theta}

，优化值函数

V_{\phi}

3:for

e=1

E

do
for

e=1

至

E

循环

4: Collect a set of trajectories

\tau=\{(s_{t},a_{t},r_{t})\}_{t=0}^{T-1}

using the current policy

\pi_{\theta}

收集一组使用当前策略

\pi_{\theta}

的轨迹

\tau=\{(s_{t},a_{t},r_{t})\}_{t=0}^{T-1}

5: Compute the advantages

A

with

\lambda_{\text{actor}}

计算优势

A

与

\lambda_{\text{actor}}

6: Compute the value targets

R

with

\lambda_{\text{critic}}

计算值目标

R

与

\lambda_{\text{critic}}

7: Split the collected data into

M

mini-batches
将收集到的数据分为

M

个迷你批次

8: for

m=1

M

do
for

m=1

到

M

循环

9: Sample a mini-batch of data

(s,a,A,R)

from the collected data
从收集的数据中采样一个迷你批次数据

(s,a,A,R)

10: Compute the probability ratio

r(\theta)=\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}

计算概率比

r(\theta)=\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}

11: Compute the clipped surrogate objective

\mathcal{L}^{CLIP}(\theta)=\min\left(r(\theta)A,\text{clip}(r(\theta),1-% \epsilon,1+\epsilon)A\right)

计算裁剪的代理目标函数

\mathcal{L}^{CLIP}(\theta)=\min\left(r(\theta)A,\text{clip}(r(\theta),1-% \epsilon,1+\epsilon)A\right)

12: Compute the value function loss

\mathcal{L}^{VF}(\phi)=\frac{1}{2}(V_{\phi}(s)-R)^{2}

计算值函数损失

\mathcal{L}^{VF}(\phi)=\frac{1}{2}(V_{\phi}(s)-R)^{2}

13: Maximizing

\mathcal{L}^{CLIP}(\theta)

using:

\theta\leftarrow\theta+\alpha_{\theta}\nabla_{\theta}\mathcal{L}^{CLIP}(\theta)

最大化使用：

\theta\leftarrow\theta+\alpha_{\theta}\nabla_{\theta}\mathcal{L}^{CLIP}(\theta)

14: Minimizing

\mathcal{L}^{VF}(\phi)

using:

\phi\leftarrow\phi-\alpha_{\phi}\nabla_{\phi}\mathcal{L}^{VF}(\phi)

最小化

\mathcal{L}^{VF}(\phi)

使用：

\phi\leftarrow\phi-\alpha_{\phi}\nabla_{\phi}\mathcal{L}^{VF}(\phi)

15: end for 15: for 结束

16:end for 16：结束循环

17:return

\pi_{\theta}

V_{\phi}

17:返回

\pi_{\theta}

，

V_{\phi}

4 Experiments 4 实验

4.1 Setup 4.1 设置

Datasets. To comprehensively demonstrate the effectiveness of our proposed algorithm, we conduct experiments on American Invitational Mathematics Examination (AIME) problems, which typically demand a long chain of thought for solution. The test set consists of AIME questions from the most recent two years. Conversely, the training set is composed of questions from all past AIME competitions, supplemented with some artificially constructed difficult mathematical problems. To evaluate the model’s generalizability, we simultaneously monitor its performance in typical long CoT scenarios, such as General-Purpose Question-Answering (GPQA) [14] and Codeforces.
数据集。为了全面展示我们提出算法的有效性，我们在美国邀请赛数学考试（AIME）问题上进行实验，这些问题通常需要长时间的思考来解决。测试集包括最近两年的 AIME 问题。相反，训练集由所有过去的 AIME 竞赛问题组成，并补充了一些人工构造的困难数学问题。为了评估模型的可推广性，我们同时监控其在典型长 CoT 场景中的性能，例如通用问答（GPQA）[14]和 Codeforces。

Cold Start. This phase aims to enhance the model’s reasoning capabilities within a specific reasoning format. We used dozens of samples with a format that requires the model to place its thinking process between <think> and </think> tags before presenting the final answer. These samples were used to fine-tune the Qwen2.5 32B model [13], which we employ in our experiments for better reproducibility.
冷启动。此阶段旨在增强模型在特定推理格式下的推理能力。我们使用了数十个样本，这些样本要求模型在呈现最终答案之前，将其思考过程放置在和标签之间。这些样本用于微调 Qwen2.5 32B 模型[13]，我们在实验中使用该模型以提高可重复性。

Reward Modeling. We adopt the methodology commonly used in classical reasoning tasks across domains such as mathematics, code, and logical reasoning. This approach utilizes rule-based rewards to guide the learning process. When assigning the reward score, the verifier ignores the thinking part enclosed by the <think> tokens and extracts only the answer part for evaluation. Correct answers are assigned a score of $1.0$ , while incorrect answers receive a score of $-1.0$ .
奖励建模。我们采用在数学、代码和逻辑推理等领域的经典推理任务中常用的方法论。这种方法利用基于规则的奖励来指导学习过程。在分配奖励分数时，验证者忽略由标记包围的思考部分，仅提取答案部分进行评估。正确答案被分配分数 $1.0$ ，而错误答案则获得分数 $-1.0$ 。

RL Baseline. In our experiments, we use verl [18] as our experimental framework. The Proximal Policy Optimization (PPO) algorithm described in [12] serves as the baseline, with $\lambda$ set to 0.95 by default. The learning rates for the policy model and the value model are set to $1\times 10^{-6}$ and $2\times 10^{-6}$ , respectively. The KL penalty coefficient is set to 0 because the rule-based reward cannot be hacked in the same way as that of a general reward model. We adopt different context length settings of 8k and 16k for different purposes: the 16k setting is used for comparison with state-of-the-art results, and the 8k setting is used for ablation studies.
RL 基线。在我们的实验中，我们使用 verl [18] 作为我们的实验框架。在[12]中描述的近端策略优化（PPO）算法作为基线，默认将 $\lambda$ 设置为 0.95。策略模型和价值模型的学习率分别设置为 $1\times 10^{-6}$ 和 $2\times 10^{-6}$ 。由于基于规则的奖励不能像通用奖励模型那样被破解，因此将 KL 惩罚系数设置为 0。我们根据不同目的采用不同的上下文长度设置，8k 和 16k：16k 设置用于与最先进的结果进行比较，8k 设置用于消融研究。

Value-Pretraining. We freeze the policy model and set the Generalized Advantage Estimation (GAE) $\lambda$ to 1.0 to obtain an unbiased return. The other hyperparameters are the same as those of the baseline Proximal Policy Optimization (PPO). By saving the value model at different steps of value pretraining, we can acquire multiple initial value checkpoints for RL training. We also conduct ablation studies on these checkpoints in our experiments.
值预训练。我们冻结策略模型并将广义优势估计（GAE） $\lambda$ 设置为 1.0 以获得无偏回报。其他超参数与基线近端策略优化（PPO）相同。通过在不同步骤保存价值预训练的价值模型，我们可以获得多个 RL 训练的初始价值检查点。我们还在实验中对这些检查点进行了消融研究。

Decoupled-GAE. Due to the value oscillation and reward signal decay described in Section 3.3, $\lambda_{\text{critic}}$ is set to 1.0. Meanwhile, $\lambda_{\text{actor}}$ used in the policy is maintained at 0.95 to enable a fair comparison with the baseline PPO. Subsequently, we assign values to $\lambda_{\text{actor}}$ ranging from 0.9 to 1.0 to investigate its impact on the convergence of the policy, while $\lambda_{\text{critic}}$ remains at 1.0.
解耦-GAE。由于第 3.3 节中描述的价值振荡和奖励信号衰减， $\lambda_{\text{critic}}$ 被设置为 1.0。同时，用于策略中的 $\lambda_{\text{actor}}$ 保持为 0.95，以便与基线 PPO 进行公平比较。随后，我们将 $\lambda_{\text{actor}}$ 的值从 0.9 到 1.0 进行赋值，以研究其对策略收敛的影响，而 $\lambda_{\text{critic}}$ 保持为 1.0。

4.2 Experimental Results 4.2 实验结果

We conduct RL training on the Qwen-32B-Base model using the proposed Value-Calibrated Proximal Policy Optimization (VC-PPO) algorithm. We then compare our model with the well-established Generalized Proximal Policy Optimization (GRPO) algorithm, which is employed in the DeepSeek-R1 model [4]. This experiment utilizes a 16k context length to attain the state-of-the-art performance.
我们在 Qwen-32B-Base 模型上使用提出的价值校准近端策略优化（VC-PPO）算法进行 RL 训练。然后，我们将我们的模型与在 DeepSeek-R1 模型[4]中使用的成熟通用的近端策略优化（GRPO）算法进行比较。该实验使用 16k 的上下文长度以达到最先进的性能。

The results are presented in Table 1. The proposed VC-PPO algorithm significantly outperforms GRPO under the same experimental setting. Our primary objective is to optimize the model’s performance on the American Invitational Mathematics Examination (AIME), which consists of Olympiad-level math problems. Consequently, the majority of the training data is math-related, and VC-PPO demonstrates the most substantial advantage on the AIME dataset.
结果展示在表 1 中。所提出的 VC-PPO 算法在相同的实验设置下显著优于 GRPO。我们的主要目标是优化模型在美国邀请赛数学考试（AIME）上的性能，该考试包含奥林匹克级别的数学问题。因此，大部分训练数据与数学相关，VC-PPO 在 AIME 数据集上展现出最显著的优势。

	AIME 2024		GPQA	CodeForces
Model	pass@1	pass@32	pass@1	pass@1
GRPO	38.9	70.0	49.4	12.6
VC-PPO	48.8	73.3	48.8	12.8

Table 1: Comparison between VC-PPO and GRPO in 16K context length.
表 1：16K 上下文长度下 VC-PPO 与 GRPO 的比较

To the best of our knowledge, a pass@1 score of 48.8 on the AIME dataset stands as the highest performance attained by a Qwen-32B-Base model without employing distillation techniques. This score surpasses the AIME score of 47.0 reported in the DeepSeek-R1 technical report [4] under comparable experimental settings¹¹1It should be noted that a direct comparison between these two results is not entirely feasible because the dataset used for RL training in DeepSeek-R1 has not been made publicly available.. The increasing pass rate of the AIME dataset during the training process is illustrated in Figure 4. Additionally, we have deployed the VC-PPO algorithm in our internal model, which has achieved an AIME score of 74.
据我们所知，在 AIME 数据集上，Qwen-32B-Base 模型在不使用蒸馏技术的情况下，取得了 48.8 的 pass@1 分数，这是该模型所达到的最高性能。这个分数超过了在 DeepSeek-R1 技术报告中报道的、在可比实验设置下取得的 47.0 的 AIME 分数[4]。图 4 展示了 AIME 数据集在训练过程中的通过率不断提高。此外，我们还在内部模型中部署了 VC-PPO 算法，该算法实现了 74 的 AIME 分数。

For ablation studies, we use an 8k context length to enhance training efficiency.
对于消融研究，我们使用 8k 的上下文长度以提高训练效率。

In Table 2, we showcase the ablation results of Value-Pretraining and Decoupled-GAE. When directly applying the Proximal Policy Optimization (PPO) algorithm, it fails to improve the performance of the pre-trained model. This is because the output length of the model collapses. In contrast, the proposed Value-Calibrated Proximal Policy Optimization (VC-PPO) algorithm demonstrates a significant performance boost, highlighting its superiority in handling tasks that demand a long CoT.
在表 2 中，我们展示了 Value-Pretraining 和 Decoupled-GAE 的消融结果。当直接应用近端策略优化（PPO）算法时，它未能提高预训练模型的表现。这是因为模型输出的长度崩溃了。相比之下，提出的价值校准近端策略优化（VC-PPO）算法表现出显著的性能提升，突显了其在处理需要长 CoT 的任务中的优越性。

Moreover, when we conduct ablation experiments by removing either the Value-Pretraining or the Decoupled-GAE component from VC-PPO, there is a notable drop in performance. This decline emphasizes the crucial roles that both Value-Pretraining and Decoupled-GAE play in the effectiveness of our proposed VC-PPO algorithm.
此外，当我们通过移除 VC-PPO 中的 Value-Pretraining 或 Decoupled-GAE 组件进行消融实验时，性能明显下降。这种下降强调了 Value-Pretraining 和 Decoupled-GAE 在所提出的 VC-PPO 算法有效性中的关键作用。

	AIME		GPQA	CodeForces
Alg. 算法	pass@1	pass@32	pass@1	pass@1
GRPO	35.8	63.0	-	-
PPO	5.6	36.7	38.7	7.3
VC-PPO w/o Decoupled-GAE VC-PPO 无解耦-GAE	29.4	66.7	46.9	9.9
VC-PPO	41.9	76.6	48.6	11.4

Table 2: Ablation study on VC-PPO’s components.
表 2：VC-PPO 组件消融研究。

In Figure 3, we compare the model performance at the same step in this ablation experiment, specifically after 100 steps of training. The decision to conduct this comparison is driven by the evident divergence in performance trends. The optimal configuration involves pretraining the value model for 100 steps. This is because additional training beyond this point might induce overfitting, which could negatively impact the model’s generalization ability.
图 3 中，我们比较了该消融实验中同一步骤下的模型性能，具体是在训练 100 步之后。进行这种比较的决定是由性能趋势的明显分歧所驱动的。最佳配置涉及在 100 步前对价值模型进行预训练。这是因为超过这个点进行额外的训练可能会引起过拟合，这可能会对模型的泛化能力产生负面影响。

	AIME		GPQA	CodeForces
Value Pretraining Steps. 值预训练步骤。	pass@1	pass@32	pass@1	pass@1
50	20.6	56.7	43.9	7.7
100	30.1	63.3	48.5	8.6
150	25.1	63.6	48.4	8.9

Table 3: Ablation study on Value-Pretraining steps.
表 3：值预训练步骤消融研究

The results of the ablation study on $\lambda_{\text{actor}}$ are presented in Table 4. It should be highlighted that all experimental groups with $\lambda_{\text{actor}}<1.0$ significantly outperform the group with $\lambda_{\text{actor}}=1.0$ . This outcome supports our analysis in Section 3.3. In the case of the American Invitational Mathematics Examination (AIME), the experimental group with $\lambda_{\text{actor}}=0.99$ outperforms the other groups with lower $\lambda_{\text{actor}}$ values. However, there is only a slight decrease in performance within a certain range between $0.95$ and $1.0$ . Therefore, the recommended setting for $\lambda_{\text{actor}}$ is $\lambda_{\text{actor}}\in[0.95,1.0)$ .
消融研究的结果在表 4 中呈现。应强调，所有包含 $\lambda_{\text{actor}}<1.0$ 的实验组在性能上显著优于包含 $\lambda_{\text{actor}}=1.0$ 的组。这一结果支持了我们在第 3.3 节的分析。在美國邀請賽數學考試（AIME）的情况下，包含 $\lambda_{\text{actor}}=0.99$ 的实验组在性能上优于其他具有较低 $\lambda_{\text{actor}}$ 值的组。然而，在 $0.95$ 和 $1.0$ 之间的一定范围内，性能仅有轻微下降。因此，推荐的 $\lambda_{\text{actor}}$ 设置是 $\lambda_{\text{actor}}\in[0.95,1.0)$ 。

	AIME		GPQA	CodeForces
$\lambda_{\text{actor}}$	pass@1	pass@32	pass@1	pass@1
0.9	34.3	70.0	48.1	11.7
0.95	41.3	73.3	48.7	12.8
0.99	41.9	76.7	48.3	11.4
1.0	29.4	66.7	46.9	8.4

Table 4: Ablation study on different

\lambda_{\text{actor}}

values.
表 4：不同

\lambda_{\text{actor}}

值的消融研究。

4.3 Discussion 4.3 讨论

A smooth initial state for training is crucial in RLHF, especially in long CoT tasks.
训练 RLHF 中，一个平滑的初始状态至关重要，尤其是在长 CoT 任务中。

In traditional RL, both the value function and the policy are typically initialized randomly. However, in RLHF, the initial policy is usually initialized from the supervised fine-tuning (SFT) policy. This SFT policy acts as a strong prior for the learning process. In long CoT tasks, the initial policy is further enhanced with the CoT pattern, offering an even stronger prior.
在传统的强化学习中，值函数和策略通常随机初始化。然而，在 RLHF 中，初始策略通常从监督微调（SFT）策略初始化。这种 SFT 策略作为学习过程的一个强大先验。在长 CoT 任务中，初始策略通过 CoT 模式进一步增强，提供了一个更强大的先验。

Our empirical observations suggest that as the prior policy becomes stronger, it is increasingly essential to align the value model with the policy. Otherwise, the painstakingly constructed CoT pattern can easily be disrupted, as demonstrated in Figure 1. In our experiment, after applying the value-pretraining technique, which effectively aligns the value model with the initial policy, the collapse in output length is no longer observed. This result clearly highlights the significance of having a fully-aligned value model, as shown in Figure 5.
我们的实证观察表明，随着先验策略的增强，将价值模型与策略对齐变得越来越重要。否则，精心构建的 CoT 模式很容易被破坏，如图 1 所示。在我们的实验中，在应用了将价值模型与初始策略有效对齐的价值预训练技术后，输出长度的下降现象不再出现。这一结果清楚地突出了拥有完全对齐的价值模型的重要性，如图 5 所示。

Value-Pretraining injects knowledge into the value model, which is a superior form of value warm-up.
价值预训练将知识注入价值模型，这是一种更高级的价值预热形式。

We present the value loss during value-pretraining in Figure 6, where a two-stage convergence pattern can be observed. In the first stage, there is a rapid decline in value loss. We interpret this stage as range alignment, which shares similarities with the commonly used value warm-up technique in RL. However, in the second stage, the value loss decreases at a slower pace. We interpret this stage as knowledge injection. In this stage, the model starts to learn which tokens are more advantageous, a crucial aspect that has often been overlooked in previous research. As shown in Table 3, this stage has a substantial impact on the final performance of our model.
我们在图 6 中展示了值预训练过程中的值损失，其中可以观察到两阶段收敛模式。在第一阶段，值损失迅速下降。我们将这一阶段解释为范围对齐，这与 RL 中常用的值预热技术相似。然而，在第二阶段，值损失的下降速度较慢。我们将这一阶段解释为知识注入。在这一阶段，模型开始学习哪些标记更有利，这是先前研究中常被忽视的一个关键方面。如表 3 所示，这一阶段对我们的模型最终性能有重大影响。

Value optimization dynamics are more tolerant to variance, which leads to different variance-bias preferences in value and policy.
价值优化动态对变化更为容忍，这导致了价值和策略中对变化和偏差的不同偏好。

Based on the experimental results presented in Table 2 and the ablation results in Table 4, we can conclude that the value model favors a larger $\lambda$ , resulting in higher variance but lower bias. In contrast, the policy model prefers lower variance which means a lower $\lambda$ than $1.0$ . Notably, a relatively lower bias is still required at the same time since the extremely variance can hurt the performance anyway. This finding implicitly suggests that regression-style loss objectives, such as the mean squared error (MSE) loss used in value optimization, are less sensitive to variance. Conversely, policy-gradient-style objectives are more likely to be adversely affected by variance. This could serve as a promising avenue for further research in RL or RLHF.
基于表 2 中展示的实验结果和表 4 中的消融结果，我们可以得出结论，价值模型倾向于更大的 $\lambda$ ，导致更高的方差但更低的偏差。相比之下，策略模型更喜欢较低的方差，这意味着 $\lambda$ 比 $1.0$ 低。值得注意的是，同时仍然需要相对较低的偏差，因为极端的方差无论如何都会损害性能。这一发现隐含地表明，回归式损失目标，如用于价值优化的均方误差（MSE）损失，对方差不太敏感。相反，策略梯度式目标更容易受到方差的不利影响。这可以作为 RL 或 RLHF 进一步研究的有希望的途径。

5 Conclusion 5 结论

In this study, we delved into the failure of PPO in long CoT tasks and proposed VC-PPO as a solution. By identifying value initialization bias and reward signal decay as the main problems, we introduced value pretraining and decoupled-GAE techniques. Value pretraining aligns the value model with the initial policy, preventing the loss of the CoT pattern and improving performance. Decoupling the GAE computation for the policy and value allows for better bias-variance trade-offs in both components. Experimental results on AIME, CodeForces, and GPQA datasets show that VC-PPO outperforms the baseline PPO significantly. Ablation studies further emphasize the crucial role of value pretraining and decoupled-GAE in VC-PPO. Additionally, our research reveals differences in variance-bias preferences between value and policy models, which could be a promising area for future RL and RLHF research. Overall, VC-PPO provides an effective way to enhance PPO’s performance in long CoT tasks, contributing to the advancement of LLMs in complex reasoning tasks.
本研究深入探讨了 PPO 在长 CoT 任务中的失败，并提出了 VC-PPO 作为解决方案。通过识别价值初始化偏差和奖励信号衰减为主要问题，我们引入了价值预训练和去耦合 GAE 技术。价值预训练使价值模型与初始策略保持一致，防止 CoT 模式丢失并提高性能。去耦合策略和价值 GAE 计算允许两个组件在偏差-方差权衡方面有更好的表现。在 AIME、CodeForces 和 GPQA 数据集上的实验结果表明，VC-PPO 在性能上显著优于基线 PPO。消融研究进一步强调了价值预训练和去耦合 GAE 在 VC-PPO 中的关键作用。此外，我们的研究揭示了价值和策略模型在方差-偏差偏好上的差异，这可能是未来 RL 和 RLHF 研究的有希望领域。总的来说，VC-PPO 为提高 PPO 在长 CoT 任务中的性能提供了一种有效方法，有助于复杂推理任务中LLMs的进步。

References 参考文献

Ahmadian et al. [2024] Ahmadian 等人[2024] Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024. URL https://arxiv.org/abs/2402.14740.
阿拉斯·阿赫玛迪安，克里斯·克雷默，马蒂亚斯·加勒，马丽兹·法达伊，朱莉娅·克鲁策，奥利维耶·皮奎因，阿赫梅特·乌斯腾，以及萨拉·胡克。回归基础：重新审视llms中从人类反馈中学习的强化风格优化，2024。URL https://arxiv.org/abs/2402.14740
Bellemare et al. [2013] 贝尔马雷等。[2013] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013.
M. G. Bellemare, Y. Naddaf, J. Veness, 和 M. Bowling. 游艺机学习环境：通用智能体的评估平台。人工智能研究杂志，第 47 卷，第 253-279 页，2013 年 6 月。
Chen et al. [2024] 陈等[2024] Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimization for mathematical reasoning. arXiv preprint arXiv:2406.10858, 2024.
郭欣陈，廖敏鹏，李成熙，范凯。数学推理的步骤级价值偏好优化。arXiv 预印本 arXiv:2406.10858，2024。
DeepSeek-AI [2025] DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948.
DeepSeek-AI. Deepseek-r1：通过强化学习激励llms的推理能力，2025。URL https://arxiv.org/abs/2501.12948。
Good and Fletcher [1981] Ron Good and Harold J. Fletcher. Reporting explained variance. Journal of Research in Science Teaching, 18(1):1–7, 1981. https://doi.org/10.1002/tea.3660180102. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/tea.3660180102.
Ron Good 和 Harold J. Fletcher. 解释方差报告。科学教学研究杂志，第 18 卷第 1 期：1–7 页，1981 年。https://doi.org/10.1002/tea.3660180102. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/tea.3660180102.
Han et al. [2024] 汉等[2024] Ji-Eun Han, Jun-Seok Koh, Hyeon-Tae Seo, Du-Seong Chang, and Kyung-Ah Sohn. Psydial: Personality-based synthetic dialogue generation using large language models. arXiv preprint arXiv:2404.00930, 2024.
韩智恩，高俊硕，赵贤泰，张杜成，孙庆华. 基于人格的大语言模型合成对话生成：Psydial. arXiv 预印本 arXiv:2404.00930，2024.
Huang et al. [2024] 黄等人 [2024] Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, and Lewis Tunstall. The n+ implementation details of rlhf with ppo: A case study on tl;dr summarization, 2024. URL https://arxiv.org/abs/2403.17031.
黄圣毅，米哈伊尔·努库维奇，阿里安·霍塞尼，卡希夫·拉苏尔，王伟勋，刘易斯·坦斯塔尔。基于 PPO 的 rlhf 的 n+实现细节：关于 tl;dr 摘要的案例研究，2024。URL https://arxiv.org/abs/2403.17031。
Jimenez et al. [2023] 吉梅内斯等人 [2023] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
卡洛斯·E·吉梅内斯，杨约翰，亚历山大·韦蒂格，姚顺宇，裴克欣，奥菲尔·普雷斯，卡蒂克·纳拉西马汉。Swe-bench：语言模型能否解决现实世界的 GitHub 问题？arXiv 预印本 arXiv:2310.06770，2023。
Kool et al. [2019] Kool 等人[2019] Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free! In Deep Reinforcement Learning Meets Structured Prediction, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=r1lgTGL5DE.
Wouter Kool, Herke van Hoof 和 Max Welling. 购买 4 个 REINFORCE 样本，免费获得一个基线！在《深度强化学习与结构化预测》ICLR 2019 研讨会，美国路易斯安那州新奥尔良，2019 年 5 月 6 日。OpenReview.net，2019。网址 https://openreview.net/forum?id=r1lgTGL5DE
OpenAI [2024] OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/learning-to-reason-with-llms/.
OpenAI. 学习使用llms进行推理，2024。网址 https://openai.com/index/learning-to-reason-with-llms/.
OpenAI [2025] OpenAI. Learning to reason with llms, 2025. URL https://openai.com/index/openai-o3-mini/.
OpenAI. 使用llms进行推理学习，2025。网址 https://openai.com/index/openai-o3-mini/
Ouyang et al. [2022] 欧阳等. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
长欧阳，吴杰夫，蒋旭，迪戈·阿尔梅达，卡罗尔·韦恩赖特，帕梅拉·米什金，张冲，桑迪尼·阿加瓦尔，卡塔琳娜·斯拉马，亚历克斯·雷，等。通过人类反馈训练语言模型遵循指令。神经信息处理系统进展，35：27730–27744，2022。
Qwen et al. [2025] Qwen 等人 [2025] Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115.
Qwen, :, 杨安, 杨宝松, 张贝臣, 惠斌元, 郑波, 余博文, 李成远, 刘大恒, 黄飞, 魏浩然, 林欢, 杨建, 图建宏, 张建伟, 杨建新, 杨家熙, 周景仁, 林俊阳, 谭凯, 陆克明, 包克勤, 杨克新, 余乐, 李梅, 薛明锋, 张培, 朱勤, 门瑞, 林润基, 李天豪, 唐天翼, 夏廷宇, 任兴章, 任宣成, 樊阳, 苏阳, 张一昌, 万宇, 刘雨琼, 崔泽宇, 张振儒, 邱子涵. Qwen2.5 技术报告，2025. 网址 https://arxiv.org/abs/2412.15115.
Rein et al. [2023] Rein 等人[2023] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022.
David Rein，李贝蒂·侯，阿萨·库珀·斯蒂克兰德，杰克逊·佩蒂，杨元哲·潘，朱利安·迪拉尼，朱利安·迈克尔，以及山姆·R·鲍曼。Gpqa：一个研究生级别的谷歌防问与答基准，2023。URL https://arxiv.org/abs/2311.12022。
Schulman et al. [2015] Schulman 等人[2015] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
约翰·舒尔曼，菲利普·莫里茨，谢尔盖·莱文，迈克尔·乔丹，皮埃特·阿贝贝尔。使用广义优势估计进行高维连续控制。arXiv 预印本 arXiv:1506.02438，2015。
Schulman et al. [2017] Schulman 等人[2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford 和 Oleg Klimov. 近端策略优化算法。arXiv 预印本 arXiv:1707.06347，2017。
Shao et al. [2024] 邵等。[2024] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
邵志红，王佩怡，朱启豪，徐润新，宋俊晓，张明川，李 YK，吴宇，郭大亚. Deepseekmath：推动开放语言模型中数学推理的极限。arXiv 预印本 arXiv:2402.03300，2024。
Sheng et al. [2024] 沈等。[2024] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024.
光明胜，张驰，叶紫灵峰，吴西斌，张旺，张如，彭阳华，林海斌，吴传。Hybridflow：一个灵活高效的 RLHF 框架。arXiv 预印本 arXiv: 2409.19256，2024。
Todorov et al. [2012] 托多罗夫等。[2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, pages 5026–5033. IEEE, 2012. ISBN 978-1-4673-1737-5. URL http://dblp.uni-trier.de/db/conf/iros/iros2012.html#TodorovET12.
Emanuel Todorov、Tom Erez 和 Yuval Tassa. Mujoco：基于模型的控制物理引擎。在 IROS 会议论文集中，第 5026–5033 页。IEEE，2012。ISBN 978-1-4673-1737-5。URL http://dblp.uni-trier.de/db/conf/iros/iros2012.html#TodorovET12.
Wang et al. [2024] 王等。[2024] Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Opendevin: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024.
王星瑶，李博宣，宋宇凡，徐方富，唐翔宇，诸葛明晨，潘佳怡，宋越奇，李博文，贾斯基拉特·辛格，等。Opendevin：面向 AI 软件开发者的通用代理开放平台。arXiv 预印本 arXiv:2407.16741，2024。
Wei et al. [2022] Wei 等人[2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
贾森·魏，王雪志，戴尔·舒尔曼斯，马滕·博斯马，布莱恩·伊彻，夏飞，埃德·H·奇，吴国维·莱，周登尼。思维链提示引发大型语言模型的推理。在山米·科耶约，S.穆罕默德，A.阿加瓦尔，丹妮尔·贝尔格雷夫，K.卓，A.欧编辑的《神经信息处理系统 35：2022 年神经信息处理系统年度会议》：NeurIPS 2022，美国路易斯安那州新奥尔良，2022 年 11 月 28 日至 12 月 9 日，2022 年。
Zheng et al. [2023] 郑等[2023] Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. Secrets of rlhf in large language models part i: Ppo, 2023. URL https://arxiv.org/abs/2307.04964.
Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. 大型语言模型中 rlhf 的秘密（第一部分）：PPO，2023。URL https://arxiv.org/abs/2307.04964。

What’s Behind PPO’s Collapse in Long-CoT? Value Optimization Holds the SecretPPO 在长 CoT 中的崩溃背后是什么？价值优化掌握着秘密