这是用户在 2025-3-5 19:29 为 https://arxiv.org/html/2502.06781?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: CJK
  • failed: spverbatim
  • failed: nicematrix

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-ND 4.0
许可:CC BY-NC-ND 4.0
arXiv:2502.06781v1 [cs.CL] 10 Feb 2025
arXiv:2502.06781v1 [cs.CL] 2025 年 2 月 10 日
Equal contribution {\dagger} Corresponding author
同等贡献 {\dagger} 对应作者

Exploring the Limit of Outcome Reward
for Learning Mathematical Reasoning
探索学习数学推理的成果奖励极限

Chengqi Lyu1∗ Songyang Gao1∗ Yuzhe Gu1,2∗ Wenwei Zhang1∗† Jianfei Gao1
Kuikun Liu1 Ziyi Wang1 Shuaibin Li1 Qian Zhao1 Haian Huang1 Weihan Cao1
Jiangning Liu1  Hongwei Liu1 Junnan Liu1 Songyang Zhang1
Dahua Lin1,3,4 Kai Chen1†
1Shanghai AI Laboratory  2Shanghai Jiao Tong University
3MMLab, The Chinese University of Hong Kong  4HKGAI under InnoHK
{lvchengqi,gaosongyang,guyuzhe,zhangwenwei,chenkai}@pjlab.org.cn
Abstract  摘要

Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through Outcome REwArd-based reinforcement Learning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be further reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future research111https://github.com/InternLM/OREAL.
推理能力,尤其是解决复杂数学问题的能力,是通用智能的关键组成部分。私有公司,如 OpenAI 的 o 系列模型,在推理任务上取得了显著进展。然而,完整的技术细节尚未公开,被认为肯定采用的技术只有强化学习(RL)和长链思维。本文提出了一种新的 RL 框架,称为 OREAL,旨在通过基于结果奖励的强化学习(RL)追求数学推理任务所能达到的性能极限,其中只有二进制结果奖励容易获得。我们理论上证明,从最佳 N(BoN)采样中的正轨迹进行行为克隆足以在二进制反馈环境中学习 KL 正则化的最优策略。这种公式进一步意味着,应进一步调整负样本的奖励,以确保正负样本之间的梯度一致性。 为了缓解 RL 中长期存在的稀疏奖励问题,这些问题由于推理任务中长链思维的局部正确性而加剧,我们进一步应用一个 token 级别的奖励模型来采样推理轨迹中的重要 token 以进行学习。OREAL 首次使一个 7B 模型能够在 MATH-500 上通过 RL 获得 94.0 的 pass@1 准确率,与 32B 模型相当。OREAL-32B 在 MATH-500 上也超过了通过蒸馏训练的先前 32B 模型,达到了 95.0 的 pass@1 准确率。我们的研究还表明,初始策略模型和训练查询对于 RL 的重要性。代码、模型和数据将发布以促进未来的研究 1

1 Introduction  1 引言

Solving complex problems with reasoning capability forms one of the cornerstones of human cognition - a cognitive ability that artificial general intelligence must ultimately master [1, 2]. Among various problem domains, the mathematical problem emerges as a particularly compelling experimental paradigm for AI research [3, 4, 5, 6], owing to its relatively well-defined structure and availability of precise binary correctness feedback based on the verifiable final answers.
解决具有推理能力的复杂问题是人类认知的基石之一——这是通用人工智能必须最终掌握的认知能力[1, 2]。在各种问题领域中,数学问题因其相对明确的结构和基于可验证最终答案的精确二元正确性反馈,成为人工智能研究的一个特别有吸引力的实验范例[3, 4, 5, 6]。

Recent advances in large language models (LLMs) have achieved remarkable progress in mathematical reasoning by the chain-of-thought technics [7, 8, 9], in which the LLMs are elicited to produce a series of intermediate reasoning steps before providing the final answers to the problem. However, as most of the capable models (e.g., the o-series models by OpenAI [10]) are developed by proprietary companies, there is no clear pathway to develop state-of-the-art reasoning models. Some recent work shows that distillation [11, 12] is sufficient to obtain high performance given the accessibility to existing best or near best AI models, reinforcement learning (RL) is believed to be a more fundamental approach and has exhibited potential [13] to advance beyond the intelligence boundary of current AI models, using the most capable open-source foundation models (DeepSeek-V3-base [14], inter alia).
近期大型语言模型在数学推理方面取得了显著进展,通过思维链技术[7, 8, 9],在提供最终答案之前,LLMs被激发产生一系列中间推理步骤。然而,由于大多数有能力的模型(例如,OpenAI 的 o 系列模型[10])是由专有公司开发的,因此没有明确的途径来开发最先进的推理模型。一些最近的研究表明,在能够访问现有最佳或接近最佳的 AI 模型的情况下,蒸馏[11, 12]足以获得高性能,强化学习(RL)被认为是一种更基本的方法,并已显示出[13]超越当前 AI 模型智能边界的潜力,使用最强大的开源基础模型(如 DeepSeek-V3-base[14]等)。

However, fundamental challenges of sparse reward in RL persist and are even exacerbated in mathematical reasoning tasks that mainly rely on the chain of thought technics with LLMs [7]: the evaluation of intermediate reasoning steps is labor intensive [15] and its accurate automation approach is still under-explored, thus, the only reliable reward is based on the outcome (correctness of final answer), which is inherently binary and sparse when faced with more than 2000 tokens in the long reasoning trajectories [13, 16]. Existing approaches have attempted to estimate the advantages or values of reasoning steps by search [17, 18] or value function-based credit assignment [19, 20], yet, their performance remains unsatisfactory in comparison with the distilled models [13].
然而,强化学习中的稀疏奖励问题仍然存在,甚至在主要依赖思维链技术的数学推理任务中更加严重:中间推理步骤的评估劳动密集 [15],其准确自动化方法仍处于探索阶段,因此,唯一可靠的奖励是基于结果(最终答案的正确性),这在面对超过 2000 个标记的长推理轨迹时,本质上具有二进制和稀疏性 [13, 16]。现有方法试图通过搜索 [17, 18] 或基于价值函数的信用分配 [19, 20] 来估计推理步骤的优势或价值,然而,与蒸馏模型 [13] 相比,它们的性能仍然不尽人意。

This paper aims to conquer the above challenges and proposes a simple framework, termed OREAL, to push the limit of Outcome REwArd-based reinforcement Learning for mathematical reasoning tasks. OREAL is grounded in the unique characteristics of mathematical reasoning tasks that binary outcome feedback creates an environment where all positive trajectories are equally valid. We first establish that behavior cloning on BoN-sampled positive trajectories is sufficient to achieve KL-regularized optimality, which emerges from the analysis that the positive trajectory from BoN sampling converges to a distribution independent of the sample number. For learning on negative samples, OREAL reveals the necessity of reward shaping to maintain consistent gradient estimation between sampling and target distributions. Such a mechanism compensates for BoN’s under-sampling of negative gradients, and enables difficulty-adaptive optimization over both successful and failed trajectories.
本文旨在克服上述挑战,并提出一个简单框架,称为 OREAL,以推动基于结果奖励的强化学习在数学推理任务中的极限。OREAL 基于数学推理任务的独特特性,即二元结果反馈创造了一个所有正向轨迹都同等有效的环境。我们首先确立,在 BoN 采样的正向轨迹上进行行为克隆足以实现 KL 正则化的最优性,这是从分析中得出的结论,即 BoN 采样的正向轨迹收敛到一个与样本数量无关的分布。对于负样本的学习,OREAL 揭示了奖励塑造的必要性,以保持采样和目标分布之间的一致梯度估计。这种机制补偿了 BoN 对负梯度的欠采样,并使优化能够适应难度,覆盖成功和失败的轨迹。

Another intrinsic property of mathematical reasoning tasks is the partial correctness in long reasoning chains, which further imposes the learning difficulty of sparse rewards when only a binary outcome reward is available at each iteration of RL training. Thus, OREAL adopts a lightweight credit assignment scheme through a token-level reward model trained using outcome rewards. This mechanism automatically estimates step-wise importance weights by decomposing trajectory advantages, enabling focused learning of critical reasoning steps or errors. The integration of these components yields a theoretically sound framework that effectively bridges the gap between sparse binary feedback and dense policy optimization requirements for mathematical reasoning tasks.
数学推理任务的另一个内在属性是长推理链中的部分正确性,这进一步增加了在 RL 训练的每次迭代中仅提供二元结果奖励时稀疏奖励的学习难度。因此,OREAL 采用了一种轻量级的信用分配方案,通过使用结果奖励训练的 token 级奖励模型。该机制通过分解轨迹优势自动估计逐步重要性权重,从而实现关键推理步骤或错误的集中学习。这些组件的集成产生了一个理论上一致且有效的框架,有效地弥合了稀疏二元反馈与数学推理任务密集策略优化需求之间的差距。

Refer to caption
Figure 1: Overall performance between OREAL-32B and some competitive baselines.
图 1:OREAL-32B 与一些竞争基线之间的整体性能。

Extensive experimental results show that OREAL effectively improves the mathematical reasoning capability of LLMs. At the 7B parameter scale, to the best of our knowledge, OREAL-7B is the first to obtain the pass@1 accuracy on MATH-500 [21] to 91.0 using RL instead of distillation, which even exceeds QwQ-32B-Preview [22] and o1-mini [10]. OREAL also improves DeepSeek-R1-Distilled-Qwen-7B from 92.8 to 94.0 pass@1 accuracy, being on par with the previous best 32B models. For the 32B model, OREAL-32B outperforms all previous models (Figure 1), both distilled and RL-based, obtaining new state-of-the-art results with 95.0 pass@1 accuracy on MATH-500.
大量实验结果表明,OREAL 有效地提高了LLMs的数学推理能力。在 7B 参数规模下,据我们所知,OREAL-7B 是第一个使用强化学习而非蒸馏在 MATH-500 [21]上获得 91.0 的 pass@1 准确率,这甚至超过了 QwQ-32B-Preview [22]和 o1-mini [10]。OREAL 还将 DeepSeek-R1-Distilled-Qwen-7B 的 pass@1 准确率从 92.8 提升到 94.0,与之前的最佳 32B 模型相当。对于 32B 模型,OREAL-32B 优于所有之前的模型(图 1),无论是蒸馏的还是基于强化学习的,在 MATH-500 上获得了 95.0 的 pass@1 准确率,创造了新的最先进成果。

2 Methods  2 方法

To obtain a deeper understanding of the challenges when applying reinforcement learning (RL) for solving math word problems, we first analyze the formulation of RL and the intrinsic properties of underlying binary feedback environments (§2.1), and establish a theoretical foundation for our optimization framework about how to learn from positive samples (§2.2) and failure trials (§2.3). To further conquer the learning ambiguity brought by outcome rewards to the partially correct long reasoning chains, we adopt a new strategy to estimate the importance of tokens for learning (§2.4).
为了更深入地理解应用强化学习(RL)解决数学文字问题时的挑战,我们首先分析了强化学习的公式和底层二元反馈环境的内在属性(§2.1),并为我们的优化框架建立了理论基础,关于如何从正样本(§2.2)和失败试验(§2.3)中学习。为了进一步克服结果奖励给部分正确长推理链带来的学习模糊性,我们采用了一种新的策略来估计学习中对标记的重要性(§2.4)。

2.1 Preliminary  2.1 初步

When adopting a large language model (LLM) for mathematic reasoning, the input to the LLM policy is a textual math problem that prompts the LLM to output a multi-step reasoning trajectory consisting of multiple tokens as actions. During RL training, common practices [6, 23] conduct sampling on the LLM to produce multiple reasoning trajectories, assign binary feedback (0/1 reward) based solely on the correctness of their final answer, and perform corresponding policy optimization using the sampled trajectories with reward.
在采用大型语言模型(LLM)进行数学推理时,LLM策略的输入是一个文本数学问题,该问题提示LLM输出由多个标记作为动作的多步推理轨迹。在强化学习训练过程中,常见的做法[6, 23]对LLM进行采样以产生多个推理轨迹,仅根据最终答案的正确性分配二元反馈(0/1 奖励),并使用带有奖励的采样轨迹进行相应的策略优化。

Policy Optimization. Consider a Markov Decision Process (MDP) defined by the tuple (𝒮,𝒜,P,r,γ)(\mathcal{S},\mathcal{A},P,r,\gamma)( caligraphic_S , caligraphic_A , italic_P , italic_r , italic_γ ), where 𝒮\mathcal{S}caligraphic_S is a finite state space (e.g., contextual steps in mathematical reasoning), 𝒜\mathcal{A}caligraphic_A is the action space (i.e. the token space of LLMs), P(s|s,a)P(s^{\prime}|s,a)italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) specifies the state transition dynamics, r:𝒮×𝒜r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}italic_r : caligraphic_S × caligraphic_A → blackboard_R is the reward function, and γ[0,1)\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ) denotes the discount factor.
策略优化。考虑由元组 (𝒮,𝒜,P,r,γ)(\mathcal{S},\mathcal{A},P,r,\gamma)( caligraphic_S , caligraphic_A , italic_P , italic_r , italic_γ ) 定义的马尔可夫决策过程(MDP),其中 𝒮\mathcal{S}caligraphic_S 是有限状态空间(例如数学推理中的上下文步骤), 𝒜\mathcal{A}caligraphic_A 是动作空间(即 LLMs 的标记空间), P(s|s,a)conditionalsuperscriptP(s^{\prime}|s,a)italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) 指定状态转移动态, r:𝒮×𝒜r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}italic_r : caligraphic_S × caligraphic_A → blackboard_R 是奖励函数, γ[0,1)01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ) 表示折扣因子。

In this section, we focus on KL-regularized policy optimization, which maximizes the expected cumulative returns while regularizing the policy πθ(|s)\pi_{\theta}(\cdot|s)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_s ) toward a reference policy π0(|s)\pi_{0}(\cdot|s)italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | italic_s ). The objective function is formulated as:
在这一节中,我们关注 KL 正则化策略优化,该优化在最大化期望累积回报的同时,将策略 πθ(|s)\pi_{\theta}(\cdot|s)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_s ) 正则化到参考策略 π0(|s)\pi_{0}(\cdot|s)italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | italic_s ) 。目标函数被定义为:

J(θ)𝔼sρ0,aπθ(|s)[Qπθ(s,a)]α𝔼sρ0[DKL(πθ(|s)π0(|s))]\displaystyle J(\theta)\triangleq\mathbb{E}_{s\sim\rho_{0},a\sim\pi_{\theta}(% \cdot|s)}\left[Q^{\pi_{\theta}}(s,a)\right]-\alpha\cdot\mathbb{E}_{s\sim\rho_{% 0}}\left[D_{\text{KL}}\left(\pi_{\theta}(\cdot|s)\|\pi_{0}(\cdot|s)\right)\right]italic_J ( italic_θ ) ≜ blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_s ) end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) ] - italic_α ⋅ blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | italic_s ) ) ] (1)

with the state-action value function Qπ(s,a)=𝔼π[k=0γkr(st+k,at+k)st=s,at=a]Q^{\pi}(s,a)=\mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty}\gamma^{k}r(s_{t+k},a_{t% +k})\mid s_{t}=s,a_{t}=a\right]italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT ) ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ] under vanilla policy π\piitalic_π. This objective admits a closed-form solution for optimal policy π\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT:
在原始策略 π\piitalic_π 下,状态-动作值函数 Qπ(s,a)=𝔼π[k=0γkr(st+k,at+k)st=s,at=a]superscriptsubscriptdelimited-[]formulae-sequenceconditionalsuperscriptsubscript0superscriptsubscriptsubscriptsubscriptsubscriptQ^{\pi}(s,a)=\mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty}\gamma^{k}r(s_{t+k},a_{t% +k})\mid s_{t}=s,a_{t}=a\right]italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT ) ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ] 的此目标函数对于最优策略 πsuperscript\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 具有闭式解。

π(a|s)=π0(a|s)exp(Qπ(s,a)/α)Z(s),\displaystyle\pi^{*}(a|s)=\frac{\pi_{0}(a|s)\exp\left(Q^{\pi}(s,a)/\alpha% \right)}{Z(s)},italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a | italic_s ) = divide start_ARG italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_a | italic_s ) roman_exp ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) / italic_α ) end_ARG start_ARG italic_Z ( italic_s ) end_ARG , (2)

where Z(s)=𝔼aπ0(|s)[exp(Qπ(s,a)/α)]Z(s)=\mathbb{E}_{a\sim\pi_{0}(\cdot|s)}\left[\exp\left(Q^{\pi}(s,a)/\alpha% \right)\right]italic_Z ( italic_s ) = blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | italic_s ) end_POSTSUBSCRIPT [ roman_exp ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) / italic_α ) ] is the partition function that ensures normalization.
Z(s)=𝔼aπ0(|s)[exp(Qπ(s,a)/α)]Z(s)=\mathbb{E}_{a\sim\pi_{0}(\cdot|s)}\left[\exp\left(Q^{\pi}(s,a)/\alpha% \right)\right]italic_Z ( italic_s ) = blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | italic_s ) end_POSTSUBSCRIPT [ roman_exp ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) / italic_α ) ] 是确保归一化的配分函数。

Best-of-N (BoN) Sampling. As a common and efficient strategy to sample multiple reasoning trajectories from LLMs, Best-of-NNitalic_N sampling selects the trajectory with maximal reward among nnitalic_n independent rollouts from π0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to enhance policy performance. Formally, given candidate actions {a(i)}i=1nπ0(|s)\{a^{(i)}\}_{i=1}^{n}\sim\pi_{0}(\cdot|s){ italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | italic_s ), the chosen action is a=argmaxa(i)Q(s,a(i))a^{*}=\arg\max_{a^{(i)}}Q(s,a^{(i)})italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ). This strategy effectively leverages the exploration-exploitation trade-off through parallel sampling [24, 25].
Best-of-N (BoN) 样本选择。作为一种常见且高效的策略,从LLMs中采样多个推理轨迹,Best-of- NNitalic_N 样本选择从 π0subscript0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPTnnitalic_n 独立运行中选择具有最大奖励的轨迹以增强策略性能。形式上,给定候选动作 {a(i)}i=1nπ0(|s)\{a^{(i)}\}_{i=1}^{n}\sim\pi_{0}(\cdot|s){ italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | italic_s ) ,选择动作 a=argmaxa(i)Q(s,a(i))superscriptsubscriptsuperscriptsuperscripta^{*}=\arg\max_{a^{(i)}}Q(s,a^{(i)})italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) 。该策略通过并行采样[24, 25]有效地利用了探索-利用权衡。

Binary Feedback under Outcome Supervision. Though a reasoning trajectory usually contains multiple reasoning steps with thousands of tokens, there lacks an efficient approach to automatically label the correctness of each token or reasoning step in math reasoning tasks. Thus, a practical way is to parse the final answer from the reasoning trajectory [13, 26], evaluate its correctness based on rules or models, and then provide an outcome reward at the end of the trajectory as below:
二进制反馈在结果监督下。尽管推理轨迹通常包含多个推理步骤,涉及数千个标记,但在数学推理任务中缺乏一种有效的方法来自动标注每个标记或推理步骤的正确性。因此,一种实用的方法是解析推理轨迹中的最终答案[13, 26],根据规则或模型评估其正确性,然后在轨迹结束时提供结果奖励如下:

R(st)={1if t is the end step and the answer is correct0otherwise,\displaystyle R(s_{t})=\begin{cases}1&\text{if $t$ is the end step and the % answer is correct}\\ 0&\text{otherwise},\end{cases}italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 end_CELL start_CELL if italic_t is the end step and the answer is correct end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise , end_CELL end_ROW (3)

which intrinsically treats the correct trajectory equally for learning. Moreover, the reward signal is severely sparse when compared to the thousands of tokens and does not provide any signal of progress or correctness for intermediate steps. The resulting reward distribution of trajectories is also different from that of the dense reward function constructed through preference pairs in traditional RL for large language models [27], which induces a more appropriate optimization framework for mathematical reasoning tasks, discussed in the next section.
该算法内在地平等对待正确的轨迹以进行学习。此外,与数千个标记相比,奖励信号非常稀疏,并且不提供任何关于中间步骤的进展或正确性的信号。轨迹的奖励分布也与通过偏好对构建的传统强化学习中的密集奖励函数不同[27],这为数学推理任务提供了一个更合适的优化框架,将在下一节中讨论。

2.2 Learning from Positive Samples
2.2 从正样本中学习

Building upon the reward equivalence principle stated in Eq. 3, we first formalize a key probabilistic characteristic of BoN sampling:
基于式(3)中提出的奖励等价原理,我们首先形式化 BoN 采样的一种关键概率特性:

Lemma 2.1.  引理 2.1

Let π(θ,s)\pi(\theta,s)italic_π ( italic_θ , italic_s ) be a distribution over parameters θ\thetaitalic_θ and trajectory ssitalic_s, where each ssitalic_s is associated with a binary reward R(s){0,1}R(s)\in\{0,1\}italic_R ( italic_s ) ∈ { 0 , 1 }. Define p𝔼sπ(θ,)[R(s)=1]>0p\triangleq\mathbb{E}_{s\sim\pi(\theta,\cdot)}[R(s)=1]>0italic_p ≜ blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_π ( italic_θ , ⋅ ) end_POSTSUBSCRIPT [ italic_R ( italic_s ) = 1 ] > 0. Consider the BoN sampling: n=n0n=n_{0}\to\inftyitalic_n = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → ∞ and sample {s1,s2,,sn}\{s_{1},s_{2},\dots,s_{n}\}{ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } i.i.d. from πθ\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. BoN selects ss^{*}italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT uniformly from the subset with R(si)=1R(s_{i})=1italic_R ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1. We have that, The probability of selecting ss^{*}italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is converge to π(θ,s)p\frac{\pi(\theta,s)}{p}divide start_ARG italic_π ( italic_θ , italic_s ) end_ARG start_ARG italic_p end_ARG, which is independent of nnitalic_n.
π(θ,s)\pi(\theta,s)italic_π ( italic_θ , italic_s ) 为参数 θ\thetaitalic_θ 和轨迹 ssitalic_s 上的分布,其中每个 ssitalic_s 都与一个二元奖励 R(s){0,1}01R(s)\in\{0,1\}italic_R ( italic_s ) ∈ { 0 , 1 } 相关联。定义 p𝔼sπ(θ,)[R(s)=1]>0subscriptsimilar-todelimited-[]10p\triangleq\mathbb{E}_{s\sim\pi(\theta,\cdot)}[R(s)=1]>0italic_p ≜ blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_π ( italic_θ , ⋅ ) end_POSTSUBSCRIPT [ italic_R ( italic_s ) = 1 ] > 0 。考虑 BoN 抽样: n=n0subscript0n=n_{0}\to\inftyitalic_n = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → ∞ 和从 πθsubscript\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT 中独立同分布地抽取样本 {s1,s2,,sn}subscript1subscript2subscript\{s_{1},s_{2},\dots,s_{n}\}{ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } 。BoN 从具有 R(si)=1subscript1R(s_{i})=1italic_R ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1 的子集中均匀选择 ssuperscripts^{*}italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 。我们有,选择 ssuperscripts^{*}italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 的概率收敛到 π(θ,s)p\frac{\pi(\theta,s)}{p}divide start_ARG italic_π ( italic_θ , italic_s ) end_ARG start_ARG italic_p end_ARG ,它与 nnitalic_n 无关。

The proof follows directly from the union law of BoN sampling (BoNn+m=BoN2(BoNm,BoNn)\text{BoN}_{n+m}=\text{BoN}_{2}(\text{BoN}_{m},\text{BoN}_{n})BoN start_POSTSUBSCRIPT italic_n + italic_m end_POSTSUBSCRIPT = BoN start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( BoN start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , BoN start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )) and the trivial distinguishability of 010-10 - 1 rewards. This result reveals that for problems with attainable positive responses, we are using a BoN generator with an arbitrary sampling budget to construct the positive training samples.
证明直接来自 BoN 采样的并集定律( BoNn+m=BoN2(BoNm,BoNn)subscriptsubscript2subscriptsubscript\text{BoN}_{n+m}=\text{BoN}_{2}(\text{BoN}_{m},\text{BoN}_{n})BoN start_POSTSUBSCRIPT italic_n + italic_m end_POSTSUBSCRIPT = BoN start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( BoN start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , BoN start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) )和 01010-10 - 1 奖励的平凡可区分性。这一结果揭示了对于具有可实现积极响应的问题,我们使用具有任意采样预算的 BoN 生成器来构建积极训练样本。

To quantify the distributional divergence induced by BoN sampling, prior work [28, 29, 30] has analyzed the KL divergence between the BoN distribution πBoN\pi_{\text{BoN}}italic_π start_POSTSUBSCRIPT BoN end_POSTSUBSCRIPT and the original policy π\piitalic_π. For continuous trajectory spaces 𝒮\mathcal{S}caligraphic_S, the BoN distribution admits the explicit form:
为了量化 BoN 采样引起的分布差异,先前的研究[28, 29, 30]分析了 BoN 分布 πBoNsubscript\pi_{\text{BoN}}italic_π start_POSTSUBSCRIPT BoN end_POSTSUBSCRIPT 与原始策略 π\piitalic_π 之间的 KL 散度。对于连续轨迹空间 𝒮\mathcal{S}caligraphic_S ,BoN 分布可以表示为显式形式:

πBoN(s)=n[P(s)]n1π(s),\displaystyle\pi_{\text{BoN}}(s)=n\cdot\left[P(s)\right]^{n-1}\cdot\pi(s),italic_π start_POSTSUBSCRIPT BoN end_POSTSUBSCRIPT ( italic_s ) = italic_n ⋅ [ italic_P ( italic_s ) ] start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ⋅ italic_π ( italic_s ) , (4)

where P(s)P(s)italic_P ( italic_s ) denotes the cumulative distribution function (CDF) associated with π(s)\pi(s)italic_π ( italic_s ). The corresponding KL divergence is given by
P(s)P(s)italic_P ( italic_s ) 表示与 π(s)\pi(s)italic_π ( italic_s ) 相关的累积分布函数(CDF)。相应的 KL 散度由下式给出:

KL(πBoNπ)=lognn1n.\text{KL}(\pi_{\text{BoN}}\parallel\pi)=\log n-\frac{n-1}{n}.KL ( italic_π start_POSTSUBSCRIPT BoN end_POSTSUBSCRIPT ∥ italic_π ) = roman_log italic_n - divide start_ARG italic_n - 1 end_ARG start_ARG italic_n end_ARG .

The KL divergence expression possesses two crucial properties, as nn\to\inftyitalic_n → ∞, the KL(πBoNπ)\text{KL}(\pi_{\text{BoN}}\parallel\pi)KL ( italic_π start_POSTSUBSCRIPT BoN end_POSTSUBSCRIPT ∥ italic_π ) is strictly increasing in nnitalic_n for n1n\geq 1italic_n ≥ 1, and converge to logn\log n\to\inftyroman_log italic_n → ∞, covering the entire positive real axis. This implies that for any prescribed KL divergence constraint ϵ>0\epsilon>0italic_ϵ > 0, there exists a BoN parameterization for approximating the optimal policy, and can be simply sampled with the minimal n(ϵ)n(\epsilon)italic_n ( italic_ϵ ) satisfying that
KL 散度表达式具有两个关键性质,即 nn\to\inftyitalic_n → ∞KL(πBoNπ)conditionalsubscript\text{KL}(\pi_{\text{BoN}}\parallel\pi)KL ( italic_π start_POSTSUBSCRIPT BoN end_POSTSUBSCRIPT ∥ italic_π )nnitalic_n 上严格递增,对于 n11n\geq 1italic_n ≥ 1 收敛到 logn\log n\to\inftyroman_log italic_n → ∞ ,覆盖整个正实数轴。这意味着对于任何给定的 KL 散度约束 ϵ>00\epsilon>0italic_ϵ > 0 ,存在一个 BoN 参数化来逼近最优策略,并且可以通过满足最小 n(ϵ)n(\epsilon)italic_n ( italic_ϵ ) 的简单采样来实现。

n(ϵ)=argminn𝔼sπBoN[R(s)].n(\epsilon)=\arg\min_{n}\mathbb{E}_{s\sim\pi_{\text{BoN}}}[-R(s)].italic_n ( italic_ϵ ) = roman_arg roman_min start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_π start_POSTSUBSCRIPT BoN end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - italic_R ( italic_s ) ] .

BoNBoN [31] empirically shows that BoN sampling achieves the optimal win rate under fixed KL constraint by exhaustive search over the positive support. Therefore, the behavior cloning on BoN-selected positive samples directly learns the analytic solution to Eq. 1. Intuitively, since every correct answer is preferred identically in the outcome-supervised sense, we only need to sample until we get a positive example, whose generating probability distribution will be the same as randomly picking from arbitrarily large numbers of samples.
BoNBoN [31] 通过对正支持集进行穷举搜索,在固定的 KL 约束下,实证表明 BoN 采样实现了最优的胜率。因此,在 BoN 选定的正样本上进行的行为克隆直接学习方程 1 的解析解。直观上,由于在结果监督的意义下,每个正确答案都是相同的,我们只需要采样直到得到一个正例,其生成概率分布将与从任意大量样本中随机选择相同。

Based on the theoretical understanding established, we formulate the first component of the learning objective in OREAL by incorporating KL-constrained max-likelihood-objective over positive examples obtained through sampling:
基于建立的理论理解,我们通过将 KL 约束下的最大似然目标函数应用于通过采样获得的正例,形成了 OREAL 学习目标的第一部分:

1(θ)=𝔼s𝒟+[logπθ(s)]Positive example alignment+βKL(πθπold)Policy constraint,\mathcal{L}_{1}(\theta)=\underbrace{\mathbb{E}_{s\sim\mathcal{D}^{+}}\left[-% \log\pi_{\theta}(s)\right]}_{\text{Positive example alignment}}+\beta% \underbrace{\text{KL}(\pi_{\theta}\parallel\pi_{\text{old}})}_{\text{Policy % constraint}},caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ ) = under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ caligraphic_D start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ - roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) ] end_ARG start_POSTSUBSCRIPT Positive example alignment end_POSTSUBSCRIPT + italic_β under⏟ start_ARG KL ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∥ italic_π start_POSTSUBSCRIPT old end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT Policy constraint end_POSTSUBSCRIPT ,

where 𝒟+\mathcal{D}^{+}caligraphic_D start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT denotes the set of positive trajectories selected via BoN sampling from RL rollouts.
𝒟+superscript\mathcal{D}^{+}caligraphic_D start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT 表示通过 BoN 采样从 RL 回放中选择出的正轨迹集合。

2.3 Learning from Negative Samples
2.3 从负样本中学习

As established in Section 2.2, direct behavioral cloning on positive responses can effectively recover the policy distribution. BOND [32] proposes estimating Jeffreys divergence [33] for the BoN strategy to train with both positive and negative samples, and demonstrates that signals from unsuccessful trajectories provide critical information about decision boundaries and failure modes.
如第 2.2 节所述,直接在积极反应上进行行为克隆可以有效地恢复策略分布。BOND [32] 提出为 BoN 策略估计 Jeffreys 距离 [33],以训练正负样本,并证明不成功轨迹的信号提供了关于决策边界和故障模式的关键信息。

In this section, we will discuss the relationship between the BoN (Best-of-N) distribution and the optimization objective defined in Eq. 1, then elucidate the necessity of reward reshaping when training with negative samples. Notably, while Eq. 4 shares structural similarities with Eq. 2, its application to mathematical reasoning tasks with binary feedback requires reformulation. Specifically, the transformed BoN distribution can be expressed as
在本节中,我们将讨论 BoN(N 中最佳)分布与式(1)中定义的优化目标之间的关系,然后阐述在训练负样本时调整奖励的必要性。值得注意的是,虽然式(4)与式(2)在结构上相似,但其应用于具有二元反馈的数学推理任务时需要重新表述。具体来说,转换后的 BoN 分布可以表示为:

πbon(s)=π(s)[R(s)1(1p)np+(1R(s))(1p)n1],\displaystyle\pi_{\text{bon}}(s)=\pi(s)\left[R(s)\cdot\frac{1-\left(1-p\right)% ^{n}}{p}+\left(1-R(s)\right)\cdot\left(1-p\right)^{n-1}\right],italic_π start_POSTSUBSCRIPT bon end_POSTSUBSCRIPT ( italic_s ) = italic_π ( italic_s ) [ italic_R ( italic_s ) ⋅ divide start_ARG 1 - ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_p end_ARG + ( 1 - italic_R ( italic_s ) ) ⋅ ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ] , (5)

which reveals fundamental differences between the BoN distribution and the original sampling distribution. Consider a scenario where two correct and two incorrect solutions are sampled, yielding an empirical accuracy of 50%. However, the probability of selecting negative samples under Best-of-4 becomes (0.5)4=6.25%(0.5)^{4}=6.25\%( 0.5 ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = 6.25 %, significantly lower than the original distribution. This discrepancy necessitates reward shaping to maintain consistency between our optimization target and the expected return under the BoN distribution.
揭示 BoN 分布与原始采样分布之间的基本差异。考虑一种情况,其中采样了两个正确和两个错误的解决方案,得到 50%的经验准确性。然而,在 Best-of-4 下选择负样本的概率为 (0.5)4=6.25%superscript0.54percent6.25(0.5)^{4}=6.25\%( 0.5 ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = 6.25 % ,显著低于原始分布。这种差异需要调整奖励形状,以保持我们的优化目标与 BoN 分布下的预期回报之间的一致性。

Building on BoN-RLB’s [34] application of the log-likelihood trick for BoN-aware policy gradients, we analyze the reward shaping technic for negative samples to maintain gradient consistency with Section 2.2. With expectation return ppitalic_p follow the definition in Lemma 2.1. The policy gradient under the BoN distribution can be derived as
基于 BoN-RLB[34]对 BoN 感知策略梯度的对数似然技巧的应用,我们分析了用于负样本的奖励塑造技术,以保持与第 2.2 节的梯度一致性。期望回报 ppitalic_p 遵循引理 2.1 的定义。在 BoN 分布下的策略梯度可以推导如下:

θJbon\displaystyle\nabla_{\theta}J_{\text{bon}}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT bon end_POSTSUBSCRIPT =𝔼sπbon()[R(s)θlogπbon(s)]\displaystyle=\mathbb{E}_{s\sim\pi_{\text{bon}}(\cdot)}\left[R(s)\nabla_{% \theta}\log\pi_{\text{bon}}(s)\right]= blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_π start_POSTSUBSCRIPT bon end_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT [ italic_R ( italic_s ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT bon end_POSTSUBSCRIPT ( italic_s ) ] (6)
=𝔼sπbon()[R(s)θ(ID+(s)logπ(s)1(1p)np+ID(s)logπ(s)(1p)n1)],\displaystyle=\mathbb{E}_{s\sim\pi_{\text{bon}}(\cdot)}\left[R(s)\nabla_{% \theta}\left(I_{D_{+}}(s)\log\pi(s)\frac{1-(1-p)^{n}}{p}+I_{D_{-}}(s)\log\pi(s% )(1-p)^{n-1}\right)\right],= blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_π start_POSTSUBSCRIPT bon end_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT [ italic_R ( italic_s ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) roman_log italic_π ( italic_s ) divide start_ARG 1 - ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_p end_ARG + italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT - end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) roman_log italic_π ( italic_s ) ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ) ] ,

where ID+(s)I_{D_{+}}(s)italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) and ID(s)I_{D_{-}}(s)italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT - end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) denote indicator functions for positive and negative sample sets respectively. Notably, these indicators are independent of policy parameters θ\thetaitalic_θ. Given 𝔼sπbon[ID+(s)]=1(1p)n\mathbb{E}_{s\sim\pi_{\text{bon}}}[I_{D_{+}}(s)]=1-(1-p)^{n}blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_π start_POSTSUBSCRIPT bon end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) ] = 1 - ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we derive the gradient components as
ID+(s)subscriptsubscriptI_{D_{+}}(s)italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s )ID(s)subscriptsubscriptI_{D_{-}}(s)italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT - end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) 分别表示正样本集和负样本集的指示函数。值得注意的是,这些指示函数与策略参数 θ\thetaitalic_θ 无关。给定 𝔼sπbon[ID+(s)]=1(1p)nsubscriptsimilar-tosubscriptdelimited-[]subscriptsubscript1superscript1\mathbb{E}_{s\sim\pi_{\text{bon}}}[I_{D_{+}}(s)]=1-(1-p)^{n}blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_π start_POSTSUBSCRIPT bon end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) ] = 1 - ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ,我们推导出梯度分量如下:

𝔼sπbon[θ(ID+(s)logπ(s)1(1p)np)]=n(1p)n1𝔼sπ,sD+[θlogπ(s)].\displaystyle\mathbb{E}_{s\sim\pi_{\text{bon}}}\left[\nabla_{\theta}\left(I_{D% _{+}}(s)\log\pi(s)\frac{1-(1-p)^{n}}{p}\right)\right]=n(1-p)^{n-1}\mathbb{E}_{% s\sim\pi,s\in D_{+}}\left[\nabla_{\theta}\log\pi(s)\right].blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_π start_POSTSUBSCRIPT bon end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) roman_log italic_π ( italic_s ) divide start_ARG 1 - ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_p end_ARG ) ] = italic_n ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_π , italic_s ∈ italic_D start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π ( italic_s ) ] .

Similarly, we also have  同样,我们也有

𝔼sπbon[θ(ID(s)logπ(s)(1p)n1)]=n(1p)n𝔼sπ,sD[θlogπ(s)].\displaystyle\mathbb{E}_{s\sim\pi_{\text{bon}}}\left[\nabla_{\theta}\left(I_{D% _{-}}(s)\log\pi(s)(1-p)^{n-1}\right)\right]=n(1-p)^{n}\mathbb{E}_{s\sim\pi,s% \in D_{-}}\left[\nabla_{\theta}\log\pi(s)\right].blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_π start_POSTSUBSCRIPT bon end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT - end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) roman_log italic_π ( italic_s ) ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ) ] = italic_n ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_π , italic_s ∈ italic_D start_POSTSUBSCRIPT - end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π ( italic_s ) ] .

This derivation reveals that when assigning unit reward (R(s)=1R(s)=1italic_R ( italic_s ) = 1) to positive samples, gradient consistency requires reshaping negative sample rewards to R(s)(1p)R(s)R^{\star}(s)\triangleq(1-p)R(s)italic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s ) ≜ ( 1 - italic_p ) italic_R ( italic_s ). Based on this reward shaping, we can construct policy optimization both on positive and negative samples for optimal policy.
这一推导表明,在将单位奖励( R(s)=11R(s)=1italic_R ( italic_s ) = 1 )分配给正样本时,梯度一致性要求将负样本奖励重塑为 R(s)(1p)R(s)superscript1R^{\star}(s)\triangleq(1-p)R(s)italic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s ) ≜ ( 1 - italic_p ) italic_R ( italic_s ) 。基于这种奖励重塑,我们可以构建针对正负样本的政策优化,以获得最佳策略。

To obtain the parameter 1p1-p1 - italic_p which can be linked to the Monte-Carlo (MC) advantage estimation, we can simply estimate that probability by calculating the expected accuracy on the sample space by counting a small number of responses. In this paper we apply a similar setting to RLOO [35], namely RRLOO(s)=R(s)1N1ssR(s)R_{RLOO}(s)=R(s)-\frac{1}{N-1}\sum_{s^{\star}\neq s}R(s^{\star})italic_R start_POSTSUBSCRIPT italic_R italic_L italic_O italic_O end_POSTSUBSCRIPT ( italic_s ) = italic_R ( italic_s ) - divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≠ italic_s end_POSTSUBSCRIPT italic_R ( italic_s start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) for unbiased mean reward and train with policy gradient. The second part of our OREAL objective is then formulated as below:
为了获得可以与蒙特卡洛(MC)优势估计相联系的参数 1p11-p1 - italic_p ,我们可以简单地通过计算样本空间上的期望精度并计数少量响应来估计该概率。在本文中,我们应用了与 RLOO [35]类似的设置,即 RRLOO(s)=R(s)1N1ssR(s)subscript11subscriptsuperscriptsuperscriptR_{RLOO}(s)=R(s)-\frac{1}{N-1}\sum_{s^{\star}\neq s}R(s^{\star})italic_R start_POSTSUBSCRIPT italic_R italic_L italic_O italic_O end_POSTSUBSCRIPT ( italic_s ) = italic_R ( italic_s ) - divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≠ italic_s end_POSTSUBSCRIPT italic_R ( italic_s start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) 用于无偏平均奖励,并使用策略梯度进行训练。我们的 OREAL 目标函数的第二部分如下:

2(θ)=𝔼sS[F(1p)logπθ(s)πold(s)]+βKL(πθπold),\mathcal{L}_{2}(\theta)=\mathbb{E}_{s\sim S_{-}}\left[F(1-p)\cdot\log\frac{\pi% _{\theta}(s)}{\pi_{old}(s)}\right]+\beta\text{KL}(\pi_{\theta}\parallel\pi_{% \text{old}}),caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_S start_POSTSUBSCRIPT - end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_F ( 1 - italic_p ) ⋅ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ( italic_s ) end_ARG ] + italic_β KL ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∥ italic_π start_POSTSUBSCRIPT old end_POSTSUBSCRIPT ) ,

where p=θπ[R(θ)=1]p=\mathbb{P}_{\theta\sim\pi}[R(\theta)=1]italic_p = blackboard_P start_POSTSUBSCRIPT italic_θ ∼ italic_π end_POSTSUBSCRIPT [ italic_R ( italic_θ ) = 1 ], SS_{-}italic_S start_POSTSUBSCRIPT - end_POSTSUBSCRIPT is the failed subset generated by policy model, and FFitalic_F represents the preprocessing for advantage scores to serve as a generalized form, for example, F(1p)rimean({rirn})std({rirn})F(1-p)\triangleq\frac{r_{i}-mean(\{r_{i}...r_{n}\})}{std(\{r_{i}...r_{n}\})}italic_F ( 1 - italic_p ) ≜ divide start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_m italic_e italic_a italic_n ( { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT … italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ) end_ARG start_ARG italic_s italic_t italic_d ( { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT … italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ) end_ARG in the recent GRPO [6] algorithm, where mean({rirn})pmean(\{r_{i}...r_{n}\})\to pitalic_m italic_e italic_a italic_n ( { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT … italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ) → italic_p when nn\to\inftyitalic_n → ∞.
p=θπ[R(θ)=1]subscriptsimilar-todelimited-[]1p=\mathbb{P}_{\theta\sim\pi}[R(\theta)=1]italic_p = blackboard_P start_POSTSUBSCRIPT italic_θ ∼ italic_π end_POSTSUBSCRIPT [ italic_R ( italic_θ ) = 1 ]SsubscriptS_{-}italic_S start_POSTSUBSCRIPT - end_POSTSUBSCRIPT 表示由策略模型生成的失败子集, FFitalic_F 代表预处理优势分数,以作为通用形式,例如,在最近的 GRPO [6] 算法中的 F(1p)rimean({rirn})std({rirn})1subscriptsubscriptsubscriptsubscriptsubscriptF(1-p)\triangleq\frac{r_{i}-mean(\{r_{i}...r_{n}\})}{std(\{r_{i}...r_{n}\})}italic_F ( 1 - italic_p ) ≜ divide start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_m italic_e italic_a italic_n ( { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT … italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ) end_ARG start_ARG italic_s italic_t italic_d ( { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT … italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ) end_ARG ,其中当 mean({rirn})psubscriptsubscriptmean(\{r_{i}...r_{n}\})\to pitalic_m italic_e italic_a italic_n ( { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT … italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ) → italic_p 时, nn\to\inftyitalic_n → ∞

2.4 Dealing with Long Reasoning Chains
2.4 处理长推理链

In the previous discussion, we introduced the adaptation of binary reward training in response space. However, since the outcome supervision only provides feedback at the sequence level, this modeling essentially reduces to a contextual bandit without internal reward modeling within the MDP. A common counterexample is PPO, which utilizes a separate critic model to estimate the value function. However, such a solution appears to be expensive and complex, which has induced numerous explorations on how to stabilize the PPO training.
在前一次讨论中,我们介绍了在响应空间中对二进制奖励训练的适应。然而,由于结果监督仅在序列级别提供反馈,这种建模本质上降低为 MDP 内部没有内部奖励模型的上下文多臂老虎机。一个常见的反例是 PPO,它使用一个单独的评论家模型来估计价值函数。然而,这种解决方案似乎既昂贵又复杂,这引发了众多关于如何稳定 PPO 训练的探索。

Things become slightly different in mathematical reasoning, where the model can spontaneously revise omissions in intermediate steps to obtain the correct final answer. Therefore, outcome supervision is preferred, and the value function is more meant to be a simple credit assignment to determine how much the process step contributes to the outcome reward. With efficiency and performance trade-off considerations, we choose to use some low-cost alternatives for sequence-level reweighting.
事物在数学推理中略有不同,模型可以自发地修正中间步骤中的遗漏以获得正确的最终答案。因此,结果监督更受欢迎,价值函数更多的是一种简单的信用分配,以确定过程步骤对结果奖励的贡献程度。考虑到效率和性能权衡,我们选择使用一些低成本的替代方案进行序列级重新加权。

Taking into account the deterministic dynamics in mathematical reasoning (st+1=f(st,at)s_{t+1}=f(s_{t},a_{t})italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_f ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )), the state-action function Qπ(s<t,π(st))Q^{\pi}(s_{<t},\pi(s_{t}))italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) simplifies to the cumulative discounted reward of policy π\piitalic_π:
考虑到数学推理中的确定性动力学( st+1=f(st,at)subscript1subscriptsubscripts_{t+1}=f(s_{t},a_{t})italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_f ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ),状态-动作函数 Qπ(s<t,π(st))superscriptsubscriptabsentsubscriptQ^{\pi}(s_{<t},\pi(s_{t}))italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) 简化为策略 π\piitalic_π 的累积折现奖励:

Qπ(s<t,π(st))=Vπ(st)=k=0Ttγkr(st+k|s<t).\displaystyle Q^{\pi}(s_{<t},\pi(s_{t}))=V^{\pi}(s_{\leq t})=\sum_{k=0}^{T-t}% \gamma^{k}r(s_{t+k}|s_{<t}).italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) . (7)

Since intermediate rewards are not provided in mathematical reasoning tasks, we define an advantage function based solely on outcome feedback:
由于数学推理任务中不提供中间奖励,我们定义了一个仅基于结果反馈的优势函数:

A(st)=Vπ(st+1)Vπ(st).A(s_{\leq t})=V^{\pi}(s_{\leq t+1})-V^{\pi}(s_{\leq t}).italic_A ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) = italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT ≤ italic_t + 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) . (8)

This formulation treats A(st)A(s_{\leq t})italic_A ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) as a token-wise credit assignment mechanism, estimating each token’s contribution toward the final outcome.
这种公式将 A(st)subscriptabsentA(s_{\leq t})italic_A ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) 视为一种按词元分配信用的机制,估计每个词元对最终结果贡献的大小。

For a pair of responses y1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and y2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to the same query, their initial values coincide V01=V02V_{0}^{1}=V_{0}^{2}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The win rate between them then satisfies:
对于同一查询的响应对 y1subscript1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTy2subscript2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,它们的初始值相同 V01=V02superscriptsubscript01superscriptsubscript02V_{0}^{1}=V_{0}^{2}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 。它们之间的胜率满足:

p(y1>y2)\displaystyle p(y_{1}>y_{2})italic_p ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) =σ(r(y1)r(y2))\displaystyle=\sigma(r(y_{1})-r(y_{2}))= italic_σ ( italic_r ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_r ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) (9)
=σ((V01+t=0TγtAy1t)(V02+t=0TγtAy2t))\displaystyle=\sigma\left(\left(V_{0}^{1}+\sum_{t=0}^{T}\gamma^{t}A_{y_{1}}^{t% }\right)-\left(V_{0}^{2}+\sum_{t=0}^{T}\gamma^{t}A_{y_{2}}^{t}\right)\right)= italic_σ ( ( italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) )
=σ(t=0Tγt(Ay1tAy2t)).\displaystyle=\sigma\left(\sum_{t=0}^{T}\gamma^{t}\left(A_{y_{1}}^{t}-A_{y_{2}% }^{t}\right)\right).= italic_σ ( ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) .

Equation 9 indicates that for any function family 𝒜={A(st)}\mathcal{A}=\{A(s_{\leq t})\}caligraphic_A = { italic_A ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) } , a cumulative reward function through sequence aggregation can be constructed to model rewards:
方程 9 表明,对于任何函数族 𝒜={A(st)}subscriptabsent\mathcal{A}=\{A(s_{\leq t})\}caligraphic_A = { italic_A ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) } ,可以通过序列聚合构建一个累积奖励函数来模拟奖励:

r(s)t=0TγtA(st),r^{*}(s)\triangleq\sum_{t=0}^{T}\gamma^{t}A(s_{\leq t}),italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) ≜ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_A ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) ,

which is trainable via preference pairs {(yw,yl)}\{(y_{w},y_{l})\}{ ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) } by fitting the outcome feedback. The learned A(st)A(s_{\leq t})italic_A ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) serves as a weighting function for credit assignment, which is used to reweight the original training loss, emphasizing critical reasoning steps or errors. An analogous implementations is r2Qr2Q^{*}italic_r 2 italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [36, 37] by defining A=logπ(yi)πref(yi)A=\log\frac{\pi(y_{i})}{\pi_{\text{ref}}(y_{i})}italic_A = roman_log divide start_ARG italic_π ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG, PRIME [20] then apply this formulation to improve performance of RLOO. In our work, following the practice from [38], we directly train a token-level reward function w(st)w(s_{\leq t})italic_w ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) satisfying
通过偏好对 {(yw,yl)}subscriptsubscript\{(y_{w},y_{l})\}{ ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) } 进行训练,通过拟合结果反馈。学习到的 A(st)subscriptabsentA(s_{\leq t})italic_A ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) 作为信用分配的加权函数,用于重新加权原始训练损失,强调关键推理步骤或错误。类似的实现是 r2Q2superscriptr2Q^{*}italic_r 2 italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ 36, 37],通过定义 A=logπ(yi)πref(yi)subscriptsubscriptsubscriptA=\log\frac{\pi(y_{i})}{\pi_{\text{ref}}(y_{i})}italic_A = roman_log divide start_ARG italic_π ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ,PRIME [ 20],然后将此公式应用于提高 RLOO 的性能。在我们的工作中,遵循 [ 38] 的做法,我们直接训练一个满足 w(st)subscriptabsentw(s_{\leq t})italic_w ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) 的标记级奖励函数。

1Tt=0Tw(st)=r(s),\frac{1}{T}\sum_{t=0}^{T}w(s_{\leq t})=r(s),divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) = italic_r ( italic_s ) ,

without constraining KL-divergence to reference model in reward model training. These sequential rewards can serve as a proxy for the contribution of thinking steps to the result accuracy. Assuming a pair of prefix-consistent correct and incorrect samples, due to the causal inference nature of the token-level reward model, the preference optimization for these samples will only function on the steps that have different contents, which induces higher credits on the core reasoning step that affects the final result. We further discuss the training details of this model and analyze the visualization of its token-wise scoring effects later in Section 3.2 and Appendix A.
不将 KL 散度约束在奖励模型训练中的参考模型上。这些序列奖励可以作为思维步骤对结果准确度贡献的代理。假设有一对前缀一致的正确和错误样本,由于标记级奖励模型的因果推断性质,对这些样本的偏好优化将仅作用于内容不同的步骤,这导致对影响最终结果的核心推理步骤赋予更高的信用。我们将在第 3.2 节和附录 A 中进一步讨论该模型的训练细节,并分析其标记级评分效果的可视化。

In practice, we decompose the output weight w(s)w(s)italic_w ( italic_s ) for positive and negative samples and clip on the positive axis to prevent reversing the direction of the optimized gradient, denoted as ω+\omega^{+}italic_ω start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and ω\omega^{-}italic_ω start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT:
在实践中,我们将正负样本的输出权重 w(s)w(s)italic_w ( italic_s ) 进行分解,并在正轴上截断以防止优化梯度的方向反转,分别表示为 ω+superscript\omega^{+}italic_ω start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPTωsuperscript\omega^{-}italic_ω start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT

ω+=max(2σ(w)1,0),ω=max(12σ(w),0).\displaystyle\omega^{+}=\max(2\sigma(w)-1,0),\omega^{-}=\max(1-2\sigma(w),0).italic_ω start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = roman_max ( 2 italic_σ ( italic_w ) - 1 , 0 ) , italic_ω start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = roman_max ( 1 - 2 italic_σ ( italic_w ) , 0 ) . (10)

Giving input query dditalic_d, the overall loss is as follows:
输入查询 dditalic_d ,整体损失如下:

total(d)\displaystyle\mathcal{L}_{\text{total}}(d)\triangleqcaligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT ( italic_d ) ≜ 𝔼sS[t=0T(ωst+logπθ(st|d)ID+(s)+ηωstlogπθ(st|d)πold(st|d)ID(s))]\displaystyle\mathbb{E}_{s\sim S}\left[\sum_{t=0}^{T}\left(-\omega_{s\leq t}^{% +}\log\pi_{\theta}(s_{\leq t}|d)I_{D_{+}}(s)+\eta\ \omega_{s\leq t}^{-}\log% \frac{\pi_{\theta}(s_{\leq t}|d)}{\pi_{old}(s_{\leq t}|d)}I_{D_{-}}(s)\right)\right]blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_S end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( - italic_ω start_POSTSUBSCRIPT italic_s ≤ italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT | italic_d ) italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) + italic_η italic_ω start_POSTSUBSCRIPT italic_s ≤ italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT | italic_d ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT | italic_d ) end_ARG italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT - end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) ] (11)
+βKL(πθ(|d)πold(|d)),\displaystyle+\beta\text{KL}(\pi_{\theta}(\cdot|d)\parallel\pi_{\text{old}}(% \cdot|d)),+ italic_β KL ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_d ) ∥ italic_π start_POSTSUBSCRIPT old end_POSTSUBSCRIPT ( ⋅ | italic_d ) ) ,

where η\etaitalic_η represents the balancing weights for positive and negative losses.
η\etaitalic_η 代表正负损失的平衡权重。

3 Implementation  3 实现

3.1 Policy Initialization  3.1 政策初始化

We utilize Qwen2.5-7B and Qwen2.5-32B [39] as the base model. Initially, we fine-tune the base models using long chain-of-thought data obtained through rejection sampling [23]. This rejection sampling fine-tuned (RFT) [23] models then serve as the initialization for the policy model in our RL framework. We also explore to use of DeepSeek-R1-Distill-Qwen-7B [13] as the initial policy model and perform OREAL on it and discuss the influence of different initial policy models in Section 4.4. The training data for the RFT models consists of in-house datasets supported by OpenDataLab [40] and open-source datasets including Numina [41] and the training set of MATH [21].
我们使用 Qwen2.5-7B 和 Qwen2.5-32B[39]作为基础模型。最初,我们使用通过拒绝采样[23]获得的长链思维数据对基础模型进行微调。这种拒绝采样微调(RFT)[23]模型随后作为我们强化学习框架中策略模型的初始化。我们还探索使用 DeepSeek-R1-Distill-Qwen-7B[13]作为初始策略模型,并对其实施 OREAL,并在第 4.4 节中讨论不同初始策略模型的影响。RFT 模型的训练数据包括由 OpenDataLab[40]支持的内部数据集和包括 Numina[41]和 MATH 训练集的开源数据集。

3.2 Reinforcement Learning
3.2 强化学习

Data Preparation. During the on-policy RL process, we utilize questions from Numina, MATH training sets, and historical AMC/AIME (without AIME2024) competitions. For each question, we independently sample 16 trajectories from the RFT models. The correctness of each trajectory is then averaged to estimate the correctness rate of each query. To increase the difficulty of training queries, only questions with correctness rates between 0 and 0.8 are retained for further training.
数据准备。在基于策略的强化学习过程中,我们使用 Numina、MATH 训练集以及历史 AMC/AIME(不含 AIME2024)竞赛中的问题。对于每个问题,我们从 RFT 模型中独立采样 16 个轨迹。然后,将每个轨迹的正确性进行平均,以估计每个查询的正确率。为了提高训练查询的难度,仅保留正确率在 0 到 0.8 之间的问题进行进一步训练。

Outcome Reward Signal. We employ the Qwen2.5-72B-Instruct [39] as a generative verifier, in conjunction with a rule-based verifier, to evaluate the correctness of the model’s outputs and provide binary rewards. This combination enhances the robustness of correctness assessment, mitigating issues related to the false negative of the rule-based verifier.
结果奖励信号。我们采用 Qwen2.5-72B-Instruct [39] 作为生成验证器,结合基于规则的验证器,以评估模型输出的正确性并提供二元奖励。这种组合增强了正确性评估的鲁棒性,减轻了基于规则验证器的假阴性问题。

Training Token-level Reward Model. For the token-level reward model, we directly use the binary outcome rewards provided by the verifier and optimize using the cross-entropy loss:
训练 token 级奖励模型。对于 token 级奖励模型,我们直接使用验证者提供的二元结果奖励,并使用交叉熵损失进行优化:

CE=𝔼(s,r)𝒟[rlogp(s)+(1r)log(1p(s))],\mathcal{L}_{\text{CE}}=-\mathbb{E}_{(s,r)\sim\mathcal{D}}\left[r\log p(s)+(1-% r)\log(1-p(s))\right],caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_r ) ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_r roman_log italic_p ( italic_s ) + ( 1 - italic_r ) roman_log ( 1 - italic_p ( italic_s ) ) ] , (12)

where ssitalic_s represents the sampled trajectory, r{0,1}r\in\{0,1\}italic_r ∈ { 0 , 1 } is the binary outcome reward from the verifier, and p(s)=σ(1TtTw(st))p(s)=\sigma(\frac{1}{T}\sum_{t}^{T}w(s_{t}))italic_p ( italic_s ) = italic_σ ( divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) denotes the predicted probability of correctness by the token-level reward model wwitalic_w.
ssitalic_s 表示采样轨迹, r{0,1}01r\in\{0,1\}italic_r ∈ { 0 , 1 } 是验证器的二进制结果奖励, p(s)=σ(1TtTw(st))1superscriptsubscriptsubscriptp(s)=\sigma(\frac{1}{T}\sum_{t}^{T}w(s_{t}))italic_p ( italic_s ) = italic_σ ( divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) 表示由标记级奖励模型 wwitalic_w 预测的正确性概率。

To further analyze the behavior of the token-level reward model, we visualize its output distribution w(st)w(s_{t})italic_w ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) during the on-policy RL training process (see Appendix A). In this training paradigm, w(st)w(s_{t})italic_w ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) assigns token-wise importance scores across the chain-of-thought reasoning process, capturing each token’s contribution to the final correctness of the generated response. Consequently, this allows us to leverage w(st)w(s_{t})italic_w ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for importance sampling during the optimization process, enabling a more principled selection of informative tokens.
为了进一步分析基于标记级别的奖励模型的行为,我们在在线策略强化学习训练过程中可视化其输出分布(见附录 A)。在这种训练范式下, w(st)subscriptw(s_{t})italic_w ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 在整个思维链推理过程中分配标记的重要性分数,捕捉每个标记对生成响应最终正确性的贡献。因此,这使我们能够在优化过程中利用 w(st)subscriptw(s_{t})italic_w ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 进行重要性采样,从而实现更有原则的信息标记选择。

Training Algorithm. The loss function for the policy model follows the formulation described in Section 2. The complete RL training procedure is described in Algorithm 1.
训练算法。策略模型的损失函数遵循第 2 节中描述的公式。完整的强化学习训练过程在算法 1 中描述。

Algorithm 1 The OREAL Reinforcement Learning Algorithm
算法 1 OREAL 强化学习算法
1:  Inputs: Question set 𝒟\mathcal{D}caligraphic_D, policy model πθ\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, token-level reward model wθw_{\theta}italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, number of iterations NNitalic_N, batch size BBitalic_B, number of rollouts per question KKitalic_K.
输入:Inputs: Question set 𝒟\mathcal{D}caligraphic_D , policy model πθsubscript\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , token-level reward model wθsubscriptw_{\theta}italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , number of iterations NNitalic_N , batch size BBitalic_B , number of rollouts per question KKitalic_K . 输出:输入:问题集𝒟\mathcal{D}caligraphic_D,策略模型πθsubscript\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT,token 级奖励模型wθsubscriptw_{\theta}italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT,迭代次数NNitalic_N,批量大小BBitalic_B,每个问题滚动次数KKitalic_K
2:  Initialize policy π0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and token-level reward model w0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with πsft\pi_{\text{sft}}italic_π start_POSTSUBSCRIPT sft end_POSTSUBSCRIPT.
初始化策略 π0subscript0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 和 token 级奖励模型 w0subscript0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 使用 πsftsubscript\pi_{\text{sft}}italic_π start_POSTSUBSCRIPT sft end_POSTSUBSCRIPT
3:  for i=0,,Ni=0,\dots,Nitalic_i = 0 , … , italic_N do
3: for i=0,,N0i=0,\dots,Nitalic_i = 0 , … , italic_N do
4:     Sample a batch of questions 𝒟i𝒟\mathcal{D}_{i}\subseteq\mathcal{D}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊆ caligraphic_D of size BBitalic_B.
从大小为 BBitalic_B 的批次中抽取样本问题 𝒟i𝒟subscript\mathcal{D}_{i}\subseteq\mathcal{D}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊆ caligraphic_D
5:     For x𝒟ix\in\mathcal{D}_{i}italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, generate KKitalic_K policy samples: Y={y1,,yK}Y=\{y_{1},\dots,y_{K}\}italic_Y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } where ykπi(x)y_{k}\sim\pi_{i}(x)italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x )
5: 对于 x𝒟isubscriptx\in\mathcal{D}_{i}italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,生成 KKitalic_K 策略样本: Y={y1,,yK}subscript1subscriptY=\{y_{1},\dots,y_{K}\}italic_Y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } 其中 ykπi(x)similar-tosubscriptsubscripty_{k}\sim\pi_{i}(x)italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x )
6:     Obtain binary rewards {r1,,rK}\{r_{1},\dots,r_{K}\}{ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } from verifier.
从验证者处获得二进制奖励 {r1,,rK}subscript1subscript\{r_{1},\dots,r_{K}\}{ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }
7:     Compute correctness rate: p=1Kk=1Krkp=\frac{1}{K}\sum_{k=1}^{K}r_{k}italic_p = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for reward shaping.
计算正确率: p=1Kk=1Krk1superscriptsubscript1subscriptp=\frac{1}{K}\sum_{k=1}^{K}r_{k}italic_p = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT 用于奖励塑造。
8:     Filter out questions with 0<p<10<p<10 < italic_p < 1 to avoid trivial cases.
过滤掉带有 0<p<1010<p<10 < italic_p < 1 的提问以避免琐碎情况。
9:     Select one correct y+y^{+}italic_y start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and one incorrect sample yy^{-}italic_y start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT for each question to avoid imbalance between positive and negative samples.
选择每个问题的一个正确样本 y+superscripty^{+}italic_y start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT 和一个错误样本 ysuperscripty^{-}italic_y start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ,以避免正负样本之间的不平衡。
10:     Compute token-level importance sampling weights of each token with wiw_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.
10: 使用 wisubscriptw_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 计算每个标记的标记级重要性采样权重。
11:     Use Eq (12) to update wiw_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.
使用公式(12)更新 wisubscriptw_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
12:     Update πi\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with Eq (11)
12: 更新 πisubscript\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 为等式(11)
13:  end for  13: for 结束
14:  Return: The optimized policy model π\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.
返回:优化后的策略模型 πsuperscript\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

Hyperparameters. The policy model is initialized from the RFT model. Similarly, the token-level reward model is also initialized with the same weights, but its output layer is replaced with a linear layer that produces a one-dimensional scalar. The weights of this layer are initialized to zero to ensure unbiased importance sampling weight at the start of training.
超参数。策略模型从 RFT 模型初始化。同样,标记级奖励模型也使用相同的权重初始化,但其输出层被替换为产生一维标量的线性层。该层的权重初始化为零,以确保在训练开始时的重要性采样权重无偏。

During training iterations, each batch consists of 64 questions, with 16 rollouts per question. The max length of each rollout trajectory is set to 16384 tokens. Then the correctness of each response is averaged to calculate the pass rate, and questions with an overall pass rate of 0 or 1 are discarded. For the remaining trajectories, we retain only one correct response and one incorrect response per question, ensuring a balanced distribution of positive and negative samples for token-level reward model training.
在训练迭代过程中,每个批次包含 64 个问题,每个问题进行 16 次 rollout。每个 rollout 轨迹的最大长度设置为 16384 个 token。然后,将每个响应的正确性平均计算通过率,对于整体通过率为 0 或 1 的问题予以丢弃。对于剩余的轨迹,每个问题仅保留一个正确响应和一个错误响应,以确保正负样本在 token 级奖励模型训练中的平衡分布。

For optimization, the policy model is trained with a learning rate of 5e75e{-7}5 italic_e - 7, while the token-level reward model is trained with a learning rate of 2e62e{-6}2 italic_e - 6. The latter undergoes a 10-step warm-up phase before training begins. Both models employ a cosine annealing learning rate schedule, decaying to 1/51/51 / 5 of the initial learning rate over time. We optimize both models using the AdamW optimizer. The total number of training steps is 80, with evaluation conducted every 10 steps. The KL coefficient β\betaitalic_β is set to 0.01. We select the best-performing model determined by evaluation metrics.
为了优化,策略模型以学习率 5e7575e{-7}5 italic_e - 7 进行训练,而令牌级奖励模型以学习率 2e6262e{-6}2 italic_e - 6 进行训练。后者在训练开始前经过 10 步预热阶段。两个模型都采用余弦退火学习率计划,学习率随时间衰减至初始学习率的 1/5151/51 / 5 。我们使用 AdamW 优化器优化这两个模型。总训练步数为 80 步,每 10 步进行一次评估。KL 系数 β\betaitalic_β 设置为 0.01。我们选择根据评估指标确定的最佳性能模型。

3.3 Skill-based Enhancement
3.3 基于技能的提升

During the RL training procedure, we observe that the model consistently struggles with certain types of questions, particularly those involving specific knowledge and skill areas, such as trigonometric constant transformations, probability statistics, series transformations, etc. We believe this is caused by the insufficient learning of the base model on these concepts in the Pre-training or RFT stages.
在 RL 训练过程中,我们观察到模型在处理某些类型的问题时一直存在困难,尤其是涉及特定知识和技能领域的问题,如三角恒等变换、概率统计、级数变换等。我们认为这是由于基础模型在预训练或 RFT 阶段对这些概念学习不足所致。

To address this problem, we implement a skill-based enhancement approach, using the MATH dataset to reduce the high cost of skill annotation. Specifically, we annotate each question in the training set with its corresponding core skill. For questions that the model repeatedly fails to answer correctly during the RL phase, we perform data augmentation by including similar questions from the training set that share the same skill. These augmented questions are then added to the training data during the RFT stage to help the model better internalize these skills.
为了解决这个问题,我们实施了一种基于技能的增强方法,利用 MATH 数据集来降低技能标注的高成本。具体来说,我们将训练集中的每个问题标注与其对应的核心技能。对于在 RL 阶段模型反复无法正确回答的问题,我们通过包括来自训练集中具有相同技能的相似问题来进行数据增强。然后,将这些增强问题添加到 RFT 阶段的训练数据中,以帮助模型更好地内化这些技能。

4 Experiment  4 实验

4.1 Evaluation Setup  4.1 评估设置

Baseline. We conduct evaluations against several baselines, including GPT-4o-0513 [42], Claude-Sonnet-3.5-1022 [43], OpenAI-o1-mini, OpenAI-o1-preview [10], Qwen2.5-Instrust-7B, Qwen2.5-Math-Instrust-7B, Qwen2.5-Instrust-32B [39], QwQ-32B-Preview [22], DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Qwen-32B [13], SimpleRL [44], PRIME [20], rStarMath [45]. For part of the baseline, we directly use the results from their report, which we mark with *.
基线。我们对多个基线进行评估,包括 GPT-4o-0513 [42]、Claude-Sonnet-3.5-1022 [43]、OpenAI-o1-mini、OpenAI-o1-preview [10]、Qwen2.5-Instrust-7B、Qwen2.5-Math-Instrust-7B、Qwen2.5-Instrust-32B [39]、QwQ-32B-Preview [22]、DeepSeek-R1-Distill-Qwen-7B、DeepSeek-R1-Distill-Qwen-32B [13]、SimpleRL [44]、PRIME [20]、rStarMath [45]。对于部分基线,我们直接使用他们报告中的结果,并用*标记。

Benchmark. We use some well-established mathematical datasets for evaluation, including MATH-500 [21], AIME2024 [46], AIME2025 (Part1) [46], LiveMathBench [47], and OlympiadBench [48].
基准。我们使用一些经过验证的数学数据集进行评估,包括 MATH-500 [21]、AIME2024 [46]、AIME2025(第一部分)[46]、LiveMathBench [47]和 OlympiadBench [48]。

Metrics. We use pass@1 as the metric for evaluation under the zero-shot chain-of-thought setting and use greedy decoding for each sample to assess correctness using OpenCompass [49].
指标。在零样本思维链设置下,我们使用 pass@1 作为评估指标,并对每个样本使用贪婪解码,通过 OpenCompass [49]来评估正确性。

Model MATH-500 AIME2024 AIME2025-I LiveMath Olympiad  奥林匹克
API Models  API 模型
GPT-4o-1120 [42] 72.8 16.7 13.3 44.8 33.7
Claude-3.5-Sonnet-1022 [43]
Claude-3.5-十四行诗-1022 [ 43]
78.3 13.3 3.3 46.7 35.4
OpenAI-o1-preview [10] 85.5 44.6 40.0 71.0 43.6
OpenAI-o1-mini [10] 90.0 56.6 46.7 74.4 46.3
7B Models  7B 模型
Qwen2.5-Instrust-7B [39] 76.6 13.3 0.0 37.0 29.1
Qwen2.5-Math-Instrust-7B [39]
Qwen2.5-数学指令-7B [39]
81.8 20.0 13.3 44.1 31.1
rStar-Math-7B [45] 78.4* 26.7* - - 47.1*
Qwen2.5-7B-SimpleRL [44]
Qwen2.5-7B-简单强化学习 [44]
82.4* 26.7* - - 37.6*
Eurus-2-7B-PRIME [20] 79.2* 26.7* - - 42.1*
DeepSeek-R1-Distill-Qwen-7B [13] 92.8* 55.5* 40.0 65.6 64.1
OREAL-7B 91.0 33.3 33.3 62.6 59.9
OREAL-DSR1-Distill-Qwen-7B
OREAL-DSR1-蒸馏-Qwen-7B
94.0 50.0 40.0 65.6 66.1
32B Models  32B 模型
Qwen2.5-Instrust-32B [39] 80.6 20.0 13.3 50.8 40.4
QwQ-32B-Preview [22] 90.6 50.0 40.0 72.7 58.5
DeepSeek-R1-Distill-Qwen-32B [13] 94.3* 72.6* 46.7 67.7 71.2
OREAL-32B 95.0 60.0 46.7 74.8 72.4
Table 1: Overall evaluation results for OREAL and each baseline. “OREAL-DSR1-Distill-Qwen-7B” denotes the DeepSeek-R1-Distill-Qwen7B trained by OREAL. “AIME2025-I”, “LiveMath” and “Olympiad” represent “AIME 2025 Part1”, “LiveMathBench”, and “OlympiadBench”, respectively. For models at the parameter scale of 7B and 32B, we use Bold and Underlined to represent the best and second best performance, respectively. For part of the baseline, we directly use the results from their report, marked with *.
表 1:OREAL 及其各基线方法的总体评估结果。“OREAL-DSR1-Distill-Qwen-7B”表示由 OREAL 训练的 DeepSeek-R1-Distill-Qwen7B。“AIME2025-I”、“LiveMath”和“Olympiad”分别代表“AIME 2025 Part1”、“LiveMathBench”和“OlympiadBench”。对于参数规模为 7B 和 32B 的模型,我们分别用粗体和下划线表示最佳和次佳性能。对于部分基线,我们直接使用其报告中提供的结果,并用*标记。

4.2 Overall Results  4.2 总体结果

Tabel 1 shows the results of the comprehensive evaluation, highlighting the performance of our proposed models across different parameter scales. Notably, at the 7B scale, OREAL-7B achieves a remarkable pass@1 accuracy of 91.0 on the MATH-500 and 59.9 on OlympiadBench. To the best of our knowledge, this is the first time a model of this size has reached such a high level of accuracy using RL instead of distillation. This performance not only establishes a new milestone for RL-based methods but also surpasses significantly larger models, including QwQ-32B-Preview and OpenAI-o1-mini, demonstrating the effectiveness of our approach. Furthermore, after applying OREAL on the previous best 7B model, DeepSeek-R1-Distill-Qwen-7B, the resulting model, OREAL-DSR1-Distill-Qwen-7B, obtains 94.0 and 66.1 pass@1 accuracy on MATH-500 and OlympiadBench, respectively, setting new records among all 7B models. This result verifies the effectiveness of OREAL even when faced with strong initial policies.
表 1 展示了全面评估的结果,突出了我们提出的模型在不同参数尺度下的性能。值得注意的是,在 7B 尺度下,OREAL-7B 在 MATH-500 上实现了令人瞩目的 91.0 pass@1 准确率,在 OlympiadBench 上为 59.9。据我们所知,这是首次有如此规模的模型使用强化学习而非蒸馏达到如此高的准确率。这一性能不仅为基于强化学习的方法树立了新的里程碑,而且显著超越了包括 QwQ-32B-Preview 和 OpenAI-o1-mini 在内的更大规模模型,证明了我们方法的有效性。此外,在将 OREAL 应用于之前最佳的 7B 模型 DeepSeek-R1-Distill-Qwen-7B 之后,得到的模型 OREAL-DSR1-Distill-Qwen-7B 在 MATH-500 和 OlympiadBench 上分别获得了 94.0 和 66.1 的 pass@1 准确率,在所有 7B 模型中创造了新纪录。这一结果验证了 OREAL 即使在面对强大的初始策略时也具有有效性。

For 32B models, OREAL-32B achieves a groundbreaking pass@1 accuracy of 95.0 on MATH-500, 46.7 on AIME2025-I, 74.8 on LiveMathBench, and 72.4 on OlympiadBench, setting a new state-of-the-art among all previously reported models. These results underscore the advantages of our methodology, including its scalability for training superior mathematical reasoning models across different model sizes.
对于 32B 模型,OREAL-32B 在 MATH-500 上实现了突破性的 95.0% pass@1 准确率,在 AIME2025-I 上为 46.7%,在 LiveMathBench 上为 74.8%,在 OlympiadBench 上为 72.4%,在所有已报道的模型中创造了新的最高水平。这些结果突显了我们方法的优势,包括其训练不同规模数学推理模型的可扩展性。

Compared to the most competitive baseline, DeepSeek-R1-Distill-Qwen series, OREAL-32B demonstrates a clear advantage, whereas OREAL-7B lags slightly behind than the distilled 7B model, despite being trained on the same dataset as OREAL-32B. We attribute this discrepancy to the different affinities of the base models for the post-training data. Qwen-7B and Qwen-32B may exhibit varying degrees of knowledge gaps due to model sizes and pre-training settings. Our training data appears to better complement the existing knowledge of Qwen-32B, while it may be less effective in bridging gaps for Qwen-7B.
与最具竞争力的基线模型 DeepSeek-R1-Distill-Qwen 系列相比,OREAL-32B 展现出明显的优势,而 OREAL-7B 尽管与 OREAL-32B 在相同的数据集上训练,但在蒸馏后的 7B 模型上略逊一筹。我们将这种差异归因于基模型对训练后数据的亲和力不同。由于模型大小和预训练设置的不同,Qwen-7B 和 Qwen-32B 可能表现出不同程度的知识差距。我们的训练数据似乎更好地补充了 Qwen-32B 的现有知识,而在弥合 Qwen-7B 的知识差距方面可能效果较差。

In addition, OREAL-DSR1-Distill-Qwen-7B improves the MATH-500 score from 92.8 to 94.0 and also achieves gains on LiveMathBench and OlympiadBench. However, its performance on the AIME benchmark series is comparatively weaker. We observe the same disadvantages of OREAL-32B and OREAL-7B, whose AIME2024 scores are relatively lower than the best scores. Since the overall performance verifies the effectiveness of the OREAL algorithm, we attribute the reason to the deficiency (e.g., in response quality, query difficulty, and quantity) of RFT data and RL training queries for obtaining high performance in the domain of AIME and leave it for the future work.
此外,OREAL-DSR1-Distill-Qwen-7B 将 MATH-500 分数从 92.8 提高到 94.0,并在 LiveMathBench 和 OlympiadBench 上取得进步。然而,其在 AIME 基准系列上的性能相对较弱。我们观察到 OREAL-32B 和 OREAL-7B 相同的缺点,其 AIME2024 分数相对低于最佳分数。由于整体性能验证了 OREAL 算法的有效性,我们将原因归因于 RFT 数据(例如,在响应质量、查询难度和数量方面)和 RL 训练查询的不足,以获得 AIME 领域的较高性能,并将其留待未来工作。

4.3 Ablation Study  4.3 消融研究

Setting  设置 MATH-500
Initial Policy  初始政策 84.8
+ REINFORCE (baseline)  + REINFORCE(基线) 85.8
+ Reward Shaping  奖励塑造 86.6
+ Behavior Cloning  行为克隆 87.6
+ Importance Sampling  重要性抽样 89.0
+ Skill-based Enhancement
基于技能的提升
91.0
Table 2: Ablation study for 7B models performance on MATH-500 with different reinforcement learning settings.
表 2:不同强化学习设置下 7B 模型在 MATH-500 上的消融研究
Refer to caption
Figure 2: Average test accuracy of 7B models across different training steps.
图 2:7B 模型在不同训练步骤下的平均测试准确率。

To verify the effectiveness of each component described in Section 2, we progressively add the proposed component based on the 7B model and compare the evaluation results on MATH-500, starting from REINFORCE [50] as baseline.
为了验证第 2 节中描述的每个组件的有效性,我们逐步添加基于 7B 模型的所提组件,并从 REINFORCE [50]作为基线开始,在 MATH-500 上比较评估结果。

As shown in Tabel 2, we add each component step by step, where “Reward Shaping” represents L2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT introduced in Section 2.3, “Behavior Cloning” represents L1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT introduced in Section 2.2, “Importance Shaping” represents LtotalL_{total}italic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT introduced in Section 2.4. The gradual addition of the modules steadily increases the Pass@1 scores of the 7B model on MATH-500, proving the effectiveness of our method. Ultimately, the policy model is raised from an initial score of 84.8 to 91.0.
如表 2 所示,我们逐步添加每个组件,其中“奖励塑造”代表第 2.3 节中引入的 L2subscript2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,“行为克隆”代表第 2.2 节中引入的 L1subscript1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,“重要性塑造”代表第 2.4 节中引入的 LtotalsubscriptL_{total}italic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT 。模块的逐步添加稳步提高了 7B 模型在 MATH-500 上的 Pass@1 分数,证明了我们方法的有效性。最终,策略模型从初始分数 84.8 提升到 91.0。

We also report average pass@1 accuracy across all benchmarks during the training process with different RL settings. As shown in Figure 2, the REINFORCE training process is unstable, which can be mitigated by “Reward Shaping”. “Behavioral Cloning” for positive samples can speed up convergence and show better performance early in training. Although the performance growth of “Importance Sampling” is relatively slow in the early stage of training, it ultimately obtains the best results.
我们还在训练过程中报告了不同强化学习设置下所有基准的平均 pass@1 准确率。如图 2 所示,REINFORCE 训练过程不稳定,可以通过“奖励塑造”来缓解。对于正样本的“行为克隆”可以加快收敛速度,并在训练早期表现出更好的性能。尽管“重要性采样”在训练早期性能增长相对较慢,但最终获得最佳结果。

4.4 Analysis of Initial Policy Models
4.4 初始政策模型分析

Model MATH-500 AIME2024 AIME2025-I LiveMath Olympiad  奥林匹克
OREAL-7B-SFT-wo-enhance  OREAL-7B-SFT-无增强 84.8 26.7 26.7 55.0 55.1
OREAL-7B-wo-enhance  OREAL-7B-增强 89.0 36.7 40.0 60.1 58.1
OREAL-7B-SFT 86.4 26.7 26.7 54.2 56.0
OREAL-7B 91.0 33.3 33.3 62.6 59.9
DeepSeek-R1-Distill-Qwen-7B [13] 92.8* 55.5* 40.0 65.6 64.1
OREAL-DSR1-Distill-Qwen-7B
OREAL-DSR1-蒸馏-Qwen-7B
94.0 50.0 40.0 65.6 66.1
OREAL-32B-SFT 92.6 43.3 46.7 71.9 68.7
OREAL-32B 95.0 60.0 46.7 74.8 72.4
Table 3: Evaluation for the performance of OREAL on different initial policy models. Here, “-SFT” and “DeepSeek-R1-Distill-Qwen7B” denote the initial policy model. “wo-enhance” means the model which do not perform the skill-based enhancement during the SFT stage.
表 3:OREAL 在不同初始策略模型上的性能评估。其中,“-SFT”和“DeepSeek-R1-Distill-Qwen7B”表示初始策略模型。“wo-enhance”表示在 SFT 阶段不执行基于技能增强的模型。

We further analyze OREAL by adopting it to several different initial policy models, as shown in Table 3. OREAL consistently improves the performance of each initial policy model, including our own trained model and the strong distilled model [13], on MATH-500, LiveMathBench, and OlympiadBench, except slight fluctuations on AIME2024 and AIME2025 part1 when the performance of initial policy models are already high (e.g., DeepSeek-R1-Distill-Qwen-7B), which demonstrates the generality of OREAL.
我们进一步通过将 OREAL 应用于多个不同的初始策略模型来分析它,如表 3 所示。OREAL 在 MATH-500、LiveMathBench 和 OlympiadBench 上始终提高了每个初始策略模型的表现,包括我们自己的训练模型和强大的蒸馏模型[13],除了在初始策略模型性能已经很高时(例如,DeepSeek-R1-Distill-Qwen-7B)在 AIME2024 和 AIME2025 部分 1 上略有波动,这证明了 OREAL 的通用性。

After adding skill-based enhancement data (introduced in Section 3.3), there is a significant rise in MATH-500 scores for the initial policy model (row 1 and row 3) and the corresponding RL-trained model (row 2 and row 4). Since our enhancement is performed primarily for the MATH-500, this verifies the effectiveness of the skill-based enhancement approach. In addition, the performance of the model after RL is strongly correlated with the capabilities of the initial policy model itself. The stronger the initial policy model, the higher the performance that RL can deliver, indicating the importance of policy initialization.
在添加基于技能的增强数据(如第 3.3 节所述)后,初始策略模型(第 1 行和第 3 行)以及相应的 RL 训练模型(第 2 行和第 4 行)的 MATH-500 分数显著上升。由于我们的增强主要针对 MATH-500,这验证了基于技能的增强方法的有效性。此外,经过 RL 训练后的模型性能与初始策略模型本身的能力密切相关。初始策略模型越强,RL 能带来的性能越高,这表明策略初始化的重要性。

5 Related Work  5 相关工作

Stimulate Reasoning using Chain of Thought. In mathematical reasoning tasks, Chain of Thought (CoT) [7] is recognized as a crucial technique to enhance the reasoning ability of large language models (LLMs), which can be implemented through few-shot examples [7] or zero-shot prompt engineering [9]. Self-consistency (SC) [8] is further proposed to generate and voting through multiple CoTs. In addition to simple CoTs, various search methods have been explored that simultaneously consider multiple potential CoTs, such as Tree-of-Thought (ToT) [51] and Graph-of-Thought (GoT) [52], which extend the idea to tree or graph structure, offering more flexibility in developing CoTs and backtracking. However, these methods mainly stimulate the reasoning capability of LLMs by prompts without parameter updates, these inference-time techniques do not fundamentally improve the underlying ability of LLMs.
利用思维链激发推理。在数学推理任务中,思维链(CoT)[7] 被认为是提高大型语言模型(LLMs)推理能力的关键技术,这可以通过少量示例[7]或零样本提示工程[9]来实现。进一步提出了自洽性(SC)[8],通过多个思维链生成和投票。除了简单的思维链外,还探索了各种搜索方法,这些方法同时考虑多个潜在的思维链,如思维树(ToT)[51]和思维图(GoT)[52],将这一想法扩展到树或图结构,为开发思维链和回溯提供了更多灵活性。然而,这些方法主要通过提示而不是参数更新来刺激LLMs的推理能力,这些推理时间技术并没有从根本上提高LLMs的潜在能力。

Reasoning Enhancement by Supervised Fine-tuning. To let the LLMs essentially acquire the reasoning abilities, many studies [5, 53, 54, 55, 56, 57] have explored synthesizing high-quality data to conduct supervised fine-tuning (SFT) on LLMs. But this method heavily relies on high-quality training data and a existing high-performing model [15]. As a result, many existing works [11, 12] have turned to distilling knowledge from powerful, large-scale models to synthesize data, yielding good results. However, distilled-based methods receive the limitations of the teacher model. One criticism of SFT is its limited generalization capability [58]. Some studies argue that SFT merely transforms the model into a knowledge retriever, rather than an actual reasoner [59].
通过监督微调增强推理。为了让LLMs本质上获得推理能力,许多研究[5, 53, 54, 55, 56, 57]已经探索了合成高质量数据以对LLMs进行监督微调(SFT)。但是,这种方法严重依赖于高质量的训练数据和现有高性能模型[15]。因此,许多现有工作[11, 12]已经转向从强大的大型模型中提取知识以合成数据,取得了良好的效果。然而,基于蒸馏的方法受到教师模型的限制。对 SFT 的一种批评是其有限的泛化能力[58]。一些研究认为,SFT 仅仅将模型转变为知识检索器,而不是真正的推理器[59]。

Reinforcement Learning for LLM. Compared to SFT, reinforcement learning (RL) offers better generalization and is therefore considered a more fundamental training aproach [58]. Previous attempts applying RL for LLMs mainly target aligning the LLM to human preferences [27]. Later, some works [5, 6, 60, 15] has attempted to use it to enhance the model’s reasoning and obtained promising results. Recently, the advent of the o1 family of models [10] and a series of o1-like works [13, 44, 20, 45] make the importance of large-scale RL for inference became more apparent. Currently, the mainstream approach to RL involves using outcome reward signals [13, 44, 18] and there are different views in the community on how to use that reward signal. ReSTEM [61] and RFT [23] simply select the positive samples based on the binary signal and only use them for behavior cloning. GRPO [6], RLOO [35, 62], REINFORCE [50], use both positive and negative samples for policy updating, but facing the challenges of sparse reward in long sequence. PPO [19] makes the preference modeling on sequence-level. Different from them, to explore the limit of outcome reward, OREAL presents a unified framework, grounded in the unique characteristics of mathematical reasoning tasks.
强化学习在LLM中的应用。与 SFT 相比,强化学习(RL)提供了更好的泛化能力,因此被认为是一种更基本的训练方法[58]。先前尝试将 RL 应用于LLMs主要针对将LLM与人类偏好对齐[27]。后来,一些工作[5, 6, 60, 15]试图利用它来增强模型的推理能力,并取得了有希望的结果。最近,o1 系列模型[10]和一系列类似 o1 的工作[13, 44, 20, 45]的出现,使得大规模 RL 在推理中的重要性变得更加明显。目前,主流的 RL 方法涉及使用结果奖励信号[13, 44, 18],在社区中关于如何使用该奖励信号存在不同的观点。ReST EM [61]和 RFT[23]简单地根据二元信号选择正样本,并仅用于行为克隆。GRPO[6]、RLOO[35, 62]、REINFORCE[50]使用正负样本进行策略更新,但面临着长序列中奖励稀疏的挑战。PPO[19]在序列级别进行偏好建模。 与它们不同,OREAL 提出一个统一的框架,该框架基于数学推理任务的独特特征,以探索结果奖励的极限。

6 Conclusion and Future Work
6 结论与未来工作

This paper aims to explore the limit of Outcome REwArd-based reinforcement Learning for mathematical reasoning tasks, and proposes a unified policy optimization framework, termed OREAL, grounded in three key insights: 1) Behavior cloning on positive trajectories from Best-of-n (BoN) sampling is both necessary and sufficient for optimal policy learning under binary feedback; 2) Accordingly, a reward-shaping mechanism should be further introduced to transform the reward function derived from the optimal policy; 3) An efficient token-level credit assignment scheme can be achieved through trajectory advantage decomposition without relying on additional value networks. Together, these components form a theoretically grounded, general, and scalable approach for mathematical reasoning tasks. With OREAL, we are the first to improve the performance of a 7B model on the MATH-500 accuracy to 91 using the RL method instead of distillation, which even surpasses OpenAI-o1-mini and QwQ-32B-Preview. Even when taking the previous best 7B model, DeepSeek-R1-Distill-Qwen-7B, as initial policy, OREAL can improve it to 94 pass@1 accuracy on MATH-500, being even on par with the previous best 32B models. OREAL-32B also obtains new state-of-the-art results among the 32B model on MATH-500, LiveMathBench, and OlympiadBench.
本文旨在探讨基于结果奖励的强化学习在数学推理任务中的极限,并提出一个统一的政策优化框架,称为 OREAL,基于三个关键洞察:1)在二元反馈下,从最佳 n 个(BoN)采样的正轨迹上的行为克隆对于最优策略学习既是必要的也是充分的;2)因此,应进一步引入奖励塑造机制来转换从最优策略导出的奖励函数;3)通过轨迹优势分解,无需依赖额外的价值网络,可以实现高效的标记级信用分配方案。这些组件共同构成了一个理论上有根据、通用且可扩展的数学推理任务方法。使用 OREAL,我们首次通过强化学习方法而不是蒸馏,将 7B 模型的 MATH-500 准确率提高到 91,甚至超过了 OpenAI-o1-mini 和 QwQ-32B-Preview。即使以先前的最佳 7B 模型 DeepSeek-R1-Distill-Qwen-7B 作为初始策略,OREAL 也能将其在 MATH-500 上的 pass@1 准确率提高到 94,甚至与先前的最佳 32B 模型相当。 OREAL-32B 在 MATH-500、LiveMathBench 和 OlympiadBench 的 32B 模型中也取得了最新的顶尖成果。

Along with the experimental observations presented in this paper, we also find two factors that are crucial for the success of scalable RL for mathematical reasoning tasks, which become the primary focus of our future work. First, the initial policy model should be as free of knowledge deficiencies as possible, as this serves as the foundation for further improvement during the RL stage. A strong starting point ensures that RL can effectively and efficiently incentivize the underlying capability of LLMs obtained through pre-training or supervised fine-tuning. Towards this goal, it is a practical way to conduct distillation or data synthesis with DeepSeek-R1 or DeepSeek-V3, which is not explored in this work as it is orthogonal to our investigation. Second, the quality of the data used in the RL phase must be diverse and sufficient in terms of difficulty, quantity, and scope. A well-balanced dataset enables the model to reach its full potential by exposing it to a broad range of challenges and learning opportunities. Thus, we believe it is still valuable to make efforts in the pre-training and post-training data construction process.
本文中提出的实验观察结果之外,我们还发现两个对于可扩展强化学习在数学推理任务中的成功至关重要的因素,这两个因素将成为我们未来工作的主要焦点。首先,初始策略模型应尽可能摆脱知识缺陷,因为这是强化学习阶段进一步改进的基础。一个强大的起点确保了强化学习能够有效地激励通过预训练或监督微调获得的LLMs的潜在能力。为此,使用 DeepSeek-R1 或 DeepSeek-V3 进行蒸馏或数据合成是一种实际的方法,但这在本工作中未探讨,因为它与我们的研究正交。其次,在强化学习阶段使用的数据质量必须在难度、数量和范围方面多样化且充足。一个均衡的语料库能够通过使其面临广泛的挑战和学习机会,使模型充分发挥其潜力。因此,我们认为在预训练和后训练数据构建过程中付出努力仍然是有价值的。

References  参考文献

  • [1]  [1] [1] Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, et al. Towards large reasoning models: A survey of reinforced reasoning with large language models. arXiv preprint arXiv:2501.09686, 2025.
    徐峰利,郝千越,宗泽方,王静伟,张云柯,王静怡,兰晓冲,贡佳慧,欧阳天健,孟凡金,等. 向大型推理模型迈进:大型语言模型增强推理综述. arXiv 预印本 arXiv:2501.09686,2025.
  • [2]  [2] [2] Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, et al. Evaluation of openai o1: Opportunities and challenges of agi. arXiv preprint arXiv:2409.18486, 2024.
    田阳中,刘正亮,潘毅,张宇通,周益凡,梁世哲,吴子豪,吕燕君,舒鹏,余晓伟,等。评估 OpenAI O1:通用人工智能的机会与挑战。arXiv 预印本 arXiv:2409.18486,2024。
  • [3] Wentao Liu, Hanglei Hu, Jie Zhou, Yuyang Ding, Junsong Li, Jiayi Zeng, Mengliang He, Qin Chen, Bo Jiang, Aimin Zhou, et al. Mathematical language models: A survey. arXiv preprint arXiv:2312.07622, 2023.
    刘文涛,胡寒磊,周杰,丁宇扬,李俊松,曾佳怡,何梦亮,陈琴,姜波,周爱民等. 数学语言模型:综述. arXiv 预印本 arXiv:2312.07622,2023.
  • [4] Nikolaos Matzakos, Spyridon Doukakis, and Maria Moundridou. Learning mathematics with large language models: A comparative study with computer algebra systems and other tools. International Journal of Emerging Technologies in Learning (iJET), 18(20):51–71, 2023.
    马茨阿科斯·尼古拉奥斯,杜卡基斯·斯皮里东,蒙德鲁多·玛丽亚。使用大型语言模型学习数学:与计算机代数系统和其他工具的比较研究。国际新兴学习技术杂志(iJET),2023 年第 18 卷第 20 期:51–71。
  • [5] Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large language models toward verifiable reasoning. arXiv preprint arXiv:2402.06332, 2024.
    淮源英,说张,林洋李,浙江周,云帆邵,赵叶飞,马一川,洪佳伟,刘奎昆,王紫怡,等。Internlm-math:开放数学大型语言模型向可验证推理迈进。arXiv 预印本 arXiv:2402.06332,2024。
  • [6] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
    邵志红,王佩怡,朱启豪,徐润新,宋俊晓,毕晓,张浩伟,张明川,李 YK,吴 Y,等. Deepseekmath:推动开放语言模型中数学推理的极限。arXiv 预印本 arXiv:2402.03300,2024。
  • [7] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
    贾森·魏,王雪志,戴尔·舒尔曼斯,马滕·博斯马,夏飞,艾德·奇,吴国越·莱,周登尼,等。思维链提示引发大型语言模型的推理。神经信息处理系统进展,35:24824–24837,2022。
  • [8] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
    王雪志,魏杰,舒尔曼斯,吴国,艾德·奇,纳兰,乔达里,周丹妮。自洽性提升语言模型中的思维链推理。arXiv 预印本 arXiv:2203.11171,2022。
  • [9] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
    小岛隆志,石祥·山姆·顾,雷德·马切尔,松尾裕隆,岩崎祐介。大型语言模型是零样本推理器。神经信息处理系统进展,35:22199–22213,2022。
  • [10] OpenAI. Learning to reason with llms. 2024.
    OpenAI. 学习使用llms进行推理。2024。
  • [11] Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey–part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson? arXiv preprint arXiv:2411.16489, 2024.
    郑黄,邹浩阳,李雪峰,刘一秀,郑宇翔,陈 Ethan,夏世杰,秦一伟,袁伟哲,刘鹏飞. O1 复制之旅—第二部分:通过简单蒸馏超越 o1-preview,大进步还是苦涩教训?arXiv 预印本 arXiv:2411.16489,2024.
  • [12] Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems. arXiv preprint arXiv:2412.09413, 2024.
    闵英谦,陈志鹏,姜金浩,陈杰,邓家,胡怡文,唐依如,王佳鹏,程晓雪,宋华桐,赵新(Wayne Xin Zhao),刘正,王中原,文继荣。模仿、探索和自我改进:慢思考推理系统的复现报告。arXiv 预印本 arXiv:2412.09413,2024。
  • [13] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025.
    DeepSeek-AI,郭大亚,杨德健,张浩伟,宋俊晓,张若愚,徐润新,朱启豪,马世荣,王培艺,毕晓,张晓康,余兴凯,吴宇,吴 Z.F.,郭志斌,邵志红,李卓舒,高紫依,刘爱新,薛冰,王冰轩,吴博超,冯贝,陆成达,赵成刚,邓成琪,张晨宇,阮崇,戴大迈,陈德利,季东杰,李尔航,林方云,戴福聪,罗福丽,郝广博,陈冠廷,李国伟,张 H.,包汉,徐汉伟,王浩成,丁红辉,辛华建,高华卓,曲辉,李辉,郭建中,李佳石,王佳伟,陈敬长,袁敬阳,邱俊杰,李俊龙,蔡 J.L.,倪家琪,梁建,陈金,董凯,胡凯,高凯格,关康,黄克新,余快,王廉,张乐从,赵亮,王立同,张丽月,徐磊,夏雷义,张明川,张明华,唐明辉,李萌,王苗军,李明明,田宁,黄盼盼,张鹏,王千成,陈琴雨,杜秋石,葛瑞琪,张瑞松,潘锐,王润基,陈 R.J.,陈 R.L. 金,瑞怡 陈,上豪 卢,上岩 周山,山黄 陈,胜峰 叶,诗雨 王,水平 余,顺风 周树,树婷 潘,S. S. 李,双 周,少青 武,胜峰 叶,云 云,天培 田,天宇 孙,T. W. 王,王鼎 曾,万嘉 赵,文 刘,文峰 良,文军 高,文琴 余,文涛 张,W. L. 肖伟,安伟,刘晓东,刘晓涵,陈晓康,聂晓涛,程新,刘新,谢新,刘兴超,杨新宇,李新源,苏学成,林旭恒,李 X. Q.,金祥月,沈晓金,陈晓沙,孙晓文,王晓祥,宋新南,周新怡,王先祖,山新霞,李 Y. K.,王 Y. Q.,魏 Y. X.,张杨,徐艳红,李姚,赵姚,孙耀峰,王耀辉,余毅,张一超,石一凡,熊一良,何英,票一石,王一松,谭一璇,马一洋,刘一元,郭永强,欧元,王宇端,龚月,邹宇恒,何宇佳,熊云帆,罗宇翔,尤翔,刘宇轩,周宇扬,朱 Y. X.,徐艳红,黄艳平,李耀辉,郑一,朱云仙,唐英,赵玉坤,颜燕婷,Z. Z. 任泽辉 任张丽 沙哲 富哲轩 徐振达 谢正岩 张正艳 郝哲文 马志成 袁志刚 吴志宇 顾子辉 朱子佳 刘子俊 李子琳 谢子微 宋子扬 潘子正 黄振 激鹏 徐中玉 张中宇 张振
  • [14] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024.
    刘爱新,冯贝,薛冰,王冰璇,吴博超,卢成达,赵成刚,邓成琪,张晨宇,阮崇,等. Deepseek-v3 技术报告。arXiv 预印本 arXiv:2412.19437,2024。
  • [15] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
    亨特·莱特曼,维奈特·科萨拉朱,尤拉·伯达,哈里·爱德华兹,鲍文·贝克,泰迪·李,简·莱克,约翰·舒尔曼,伊利亚·苏茨克维,卡尔·科布。让我们一步步验证。arXiv 预印本 arXiv:2305.20050,2023。
  • [16] Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025.
    金米团队,杜昂钢,高博飞,邢博伟,姜长久,陈成,李成,肖晨军,杜晨庄,廖崇华,等。金米 k1.5:使用llms扩展强化学习。arXiv 预印本 arXiv:2501.12599,2025。
  • [17] Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024.
    王培艺,李磊,邵志宏,徐润新,戴大迈,李一飞,陈德丽,吴宇,隋志方。Math-shepherd:无需人工标注验证和强化llms的逐步过程。在 2024 年计算语言学协会第 62 届年度会议(第 1 卷:长篇论文)论文集中,第 9426–9439 页。
  • [18] Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment. arXiv preprint arXiv:2410.01679, 2024.
    阿米尔·侯赛因·卡泽姆内贾德,米尔达·阿加霍拉里,伊娃·波特兰斯,亚历山德罗·索尔多尼,西瓦·雷迪,阿隆·库维尔,以及尼古拉·勒鲁。Vineppo:通过精细的信用分配释放llm推理的 rl 潜力。arXiv 预印本 arXiv:2410.01679,2024。
  • [19] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford 和 Oleg Klimov. 近端策略优化算法。arXiv 预印本 arXiv:1707.06347,2017。
  • [20] Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025.
    崔干,袁立帆,王泽凡,王汉宾,李文迪,何炳祥,范宇辰,余天宇,徐其欣,陈伟泽,等. 通过隐含奖励进行过程强化。arXiv 预印本 arXiv:2502.01456,2025。
  • [21] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
    丹·亨德里克斯,柯林·伯恩斯,萨乌拉夫·卡达瓦特,阿库尔·阿罗拉,史蒂文·巴萨特,埃里克·唐,当恩·宋,以及雅各布·斯坦哈特。使用数学数据集衡量数学问题解决能力。arXiv 预印本 arXiv:2103.03874,2021。
  • [22] Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, November 2024.
    Qwen 团队。Qwq:深刻反思未知领域的边界,2024 年 11 月。
  • [23] Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023.
    郑远,袁红毅,李成鹏,董冠庭,陆克明,谭创奇,周长,周敬仁。使用大型语言模型学习数学推理的缩放关系。arXiv 预印本 arXiv:2308.01825,2023。
  • [24] Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
    高雷奥,约翰·舒尔曼,雅各布·希尔顿。奖励模型过度优化缩放定律。国际机器学习会议论文集,第 10835-10866 页。PMLR,2023 年。
  • [25] Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, and Marc Dymetman. Compositional preference models for aligning lms. arXiv preprint arXiv:2310.13011, 2023.
    Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen 和 Marc Dymetman. 用于对齐语言模型的组合偏好模型。arXiv 预印本 arXiv:2310.13011,2023。
  • [26] Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T\\backslash\" ulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024.
    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu 等人。T \\backslash\ " ulu 3: 开放语言模型后训练的边界拓展。arXiv 预印本 arXiv:2411.15124,2024。
  • [27] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
    长欧阳,吴杰夫,蒋旭,迪戈·阿尔梅达,卡罗尔·韦恩赖特,帕梅拉·米什金,张冲,桑迪尼·阿加瓦尔,卡塔琳娜·斯拉马,亚历克斯·雷,等。通过人类反馈训练语言模型遵循指令。神经信息处理系统进展,35:27730–27744,2022。
  • [28] Jacob Hilton and Leo Gao. Measuring goodhart’s law. OpenAI Research Blog, 2022.
    Jacob Hilton 和 Leo Gao. 测量 Goodhart 定律。OpenAI 研究博客,2022 年。
  • [29] Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training language models with language feedback at scale. arXiv preprint arXiv:2303.16755, 2023.
    Jérémy Scheurer、Jon Ander Campos、Tomasz Korbak、Jun Shern Chan、Angelica Chen、Kyunghyun Cho 和 Ethan Perez。大规模语言反馈训练语言模型。arXiv 预印本 arXiv:2303.16755,2023。
  • [30] Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization. arXiv preprint arXiv:2310.02743, 2023.
    托马斯·科斯特、乌斯曼·安瓦尔、罗伯特·柯克和戴维·克鲁格。奖励模型集成有助于缓解过度优化。arXiv 预印本 arXiv:2310.02743,2023。
  • [31] Lin Gui, Cristina Gârbacea, and Victor Veitch. Bonbon alignment for large language models and the sweetness of best-of-n sampling. arXiv preprint arXiv:2406.00832, 2024.
    林贵,克里斯蒂娜·加尔巴察,维克多·维奇。大型语言模型的 Bonbon 对齐与最佳采样之甜美。arXiv 预印本 arXiv:2406.00832,2024。
  • [32] Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Nino Vieillard, Alexandre Ramé, Bobak Shariari, Sarah Perrin, Abe Friesen, Geoffrey Cideron, et al. Bond: Aligning llms with best-of-n distillation. arXiv preprint arXiv:2407.14622, 2024.
    皮埃罗·朱塞佩·塞萨,罗伯特·达达希,莱昂纳尔·胡森诺特,约翰·费雷特,尼诺·维耶尔,亚历山大·拉梅,鲍巴克·沙里亚里,莎拉·佩林,艾布·弗里森,杰弗里·西德隆,等。Bond:将llms与 best-of-n 蒸馏最佳化。arXiv 预印本 arXiv:2407.14622,2024。
  • [33] Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461, 1946.
    Harold Jeffreys. 估计问题中先验概率的不变量形式。伦敦皇家学会学报。A 系列。数学与物理科学,186(1007):453–461,1946。
  • [34] Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, and Aleksandra Faust. Inference-aware fine-tuning for best-of-n sampling in large language models. arXiv preprint arXiv:2412.15287, 2024.
    周荫岚,盖·滕内霍茨,易兹丁·古尔,庄文轩,戴博,西里达·西加拉詹,克雷格·巴蒂尔,里沙布·阿加瓦尔,阿维拉尔·库马尔,亚历山德拉·福斯特。大型语言模型中最佳样本采样的高推理感知微调。arXiv 预印本 arXiv:2412.15287,2024。
  • [35] Keinosuke Fukunaga and Donald M. Hummels. Leave-one-out procedures for nonparametric error estimates. IEEE transactions on pattern analysis and machine intelligence, 11(4):421–423, 1989.
    福冈健介和唐纳德·M·胡梅尔斯。非参数误差估计的留一法。IEEE 模式分析与机器智能杂志,11(4):421–423,1989。
  • [36] Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to Q*: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024.
    Rafael Rafailov, Joey Hejna, Ryan Park 和 Chelsea Finn. 从 r 到 Q*:你的语言模型实际上是一个 q 函数。arXiv 预印本 arXiv:2404.12358,2024。
  • [37] Han Xia, Songyang Gao, Qiming Ge, Zhiheng Xi, Qi Zhang, and Xuanjing Huang. Inverse-q*: Token level reinforcement learning for aligning large language models without preference data. arXiv preprint arXiv:2408.14874, 2024.
    韩晓,高松阳,葛奇明,席志恒,张琪,黄宣静。无偏好数据的语言模型对齐的 token 级强化学习:Inverse-q*。arXiv 预印本 arXiv:2408.14874,2024。
  • [38] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
    卡尔·科贝,维奈特·科萨拉朱,穆罕默德·巴维里安,马克·陈,河宇俊,卢卡斯·凯撒,马蒂亚斯·普拉佩特,杰里·特沃雷克,雅各布·希尔顿,中里宏一郎,等. 训练验证者解决数学文字问题。arXiv 预印本 arXiv:2110.14168,2021。
  • [39] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024.
    杨安,杨宝松,张贝辰,惠斌元,郑博,余博文,李成远,刘大恒,黄飞,魏浩然,等。Qwen2.5 技术报告。arXiv 预印本 arXiv:2412.15115,2024。
  • [40] Conghui He, Wei Li, Zhenjiang Jin, Chao Xu, Bin Wang, and Dahua Lin. Opendatalab: Empowering general artificial intelligence with open datasets, 2024.
    何从辉,李伟,金振江,徐超,王斌,林大华。Opendatalab:利用开放数据集赋能通用人工智能,2024。
  • [41] Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13:9, 2024.
    贾丽,爱德华·比钦,刘易斯·坦斯塔尔,本·利平金,罗曼·索列茨基,黄胜毅,卡希夫·拉苏尔,杨龙辉,江伟强,沈子儒,等。Numinamath:ai4maths 中最大的公共数据集,包含 860k 对竞赛数学问题和解答。Hugging Face 仓库,2024 年,第 13 卷第 9 期。
  • [42] OpenAI. Hello GPT-4o, 2024.
    OpenAI. 你好,GPT-4o,2024。
  • [43] Anthropic. Claude 3.5 sonnet, 2024.
    Anthropic. 克劳德 3.5 十四行诗,2024。
  • [44] Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, and Junxian He. 7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient. https://hkust-nlp.notion.site/simplerl-reason, 2025. Notion Blog.
    曾维豪,黄宇珍,刘伟,何克清,刘倩,马泽军,何俊贤。7b 模型与 8k 示例:基于强化学习的推理既有效又高效。https://hkust-nlp.notion.site/simplerl-reason,2025. Notion 博客。
  • [45] Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. arXiv preprint arXiv:2501.04519, 2025.
    关新宇,张琳娜,刘一飞,尚宁,孙悦然,朱毅,杨帆,杨茂. rstar-math:小型llms可以通过自我进化的深度思考掌握数学推理。arXiv 预印本 arXiv:2501.04519,2025。
  • [46] MAA. American invitational mathematics examination - aime. In American Invitational Mathematics Examination - AIME.
    美国邀请赛数学考试 - AIME. 在美国邀请赛数学考试 - AIME.
  • [47] Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your llms capable of stable reasoning? arXiv preprint arXiv:2412.13147, 2024.
    刘俊南,刘宏伟,肖林晨,王紫怡,刘奎昆,高松阳,张文伟,张松阳,陈凯。您的llms能否进行稳定的推理?arXiv 预印本 arXiv:2412.13147,2024。
  • [48] Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024.
    郝超群,罗仁杰,白宇卓,胡胜定,冷振泰,沈俊豪,胡金一,韩旭,黄宇杰,张宇翔,等. Olympiadbench:促进通用人工智能的具有奥林匹克水平双语多模态科学问题的挑战性基准。arXiv 预印本 arXiv:2402.14008,2024。
  • [49] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
    OpenCompass 贡献者。OpenCompass:基础模型通用评估平台。https://github.com/open-compass/opencompass,2023。
  • [50] Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
    理查德·S·萨顿,大卫·麦克阿利斯特,萨蒂恩德·辛格,以及伊沙伊·曼索尔。基于函数逼近的强化学习策略梯度方法。神经信息处理系统进展,第 12 卷,1999 年。
  • [51] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
    姚顺宇,余电,赵杰夫,伊扎克·沙弗兰,汤姆·格里菲斯,曹媛,纳拉辛汉·卡蒂克。思维树:大型语言模型辅助的深思熟虑问题解决。神经信息处理系统进展,36 卷,2024 年。
  • [52] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024.
    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk 等人。思维之图:用大型语言模型解决复杂问题。在 AAAI 人工智能会议论文集,第 38 卷,第 17682-17690 页,2024 年。
  • [53] Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
    余龙辉,姜伟森,石汉,余金城,刘正英,张宇,郭子文·T,李正国,韦勒·A,刘伟阳。元数学:为大型语言模型启动您自己的数学问题。arXiv 预印本 arXiv:2309.12284,2023。
  • [54] Haoxiong Liu, Yifan Zhang, Yifan Luo, and Andrew Chi-Chih Yao. Augmenting math word problems via iterative question composing. arXiv preprint arXiv:2401.09003, 2024.
    刘豪雄,张一帆,罗一帆,姚期智。通过迭代问题创作增强数学词汇问题。arXiv 预印本 arXiv:2401.09003,2024。
  • [55] Chengpeng Li, Zheng Yuan, Guanting Dong, Keming Lu, Jiancan Wu, Chuanqi Tan, Xiang Wang, and Chang Zhou. Query and response augmentation cannot help out-of-domain math reasoning generalization. arXiv preprint arXiv:2310.05506, 2023.
    李成鹏,袁正,董冠廷,陆克明,吴建灿,谭传祺,王翔,周长。查询和响应增强无法帮助域外数学推理泛化。arXiv 预印本 arXiv:2310.05506,2023。
  • [56] Tiedong Liu and Bryan Kian Hsiang Low. Goat: Fine-tuned llama outperforms gpt-4 on arithmetic tasks. arXiv preprint arXiv:2305.14201, 2023.
    刘铁东,刘建祥。山羊:微调后的 llama 在算术任务上优于 gpt-4。arXiv 预印本 arXiv:2305.14201,2023。
  • [57] Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
    向月,行为区,张戈,付瑶,黄文豪,孙欢,苏宇,陈文虎。Mammoth:通过混合指令调整构建数学通才模型。arXiv 预印本 arXiv:2309.05653,2023。
  • [58] Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161, 2025.
    田哲初,翟月翔,杨吉汉,童胜邦,谢赛宁,Dale Schuurmans,Quoc V Le,Sergey Levine,马毅。Sft 记忆,rl 泛化:基础模型后训练的比较研究。arXiv 预印本 arXiv:2501.17161,2025。
  • [59] Subbarao Kambhampati. Can large language models reason and plan? Annals of the New York Academy of Sciences, 1534(1):15–18, 2024.
    Subbarao Kambhampati. 大型语言模型能否推理和规划?《纽约科学院年鉴》,1534(1):15–18,2024。
  • [60] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
    罗海鹏,孙清风,徐灿,赵璞,楼建国,陶晨阳,耿锡波,林清伟,陈世峰,张东梅. Wizardmath:通过强化进化指令增强大型语言模型的数学推理能力。arXiv 预印本 arXiv:2308.09583,2023。
  • [61] Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
    Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, 等人. 超越人类数据:通过语言模型扩展自训练以解决问题。arXiv 预印本 arXiv:2312.06585,2023。
  • [62] Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024.
    阿拉斯·阿赫玛迪安、克里斯·克雷默、马蒂亚斯·加勒、马丽兹·法达伊、朱莉娅·克鲁策、奥利维耶·皮奎因、阿赫梅特·乌斯腾和萨拉·胡克。回归基础:重新审视llms中从人类反馈中学习的强化风格优化。arXiv 预印本 arXiv:2402.14740,2024。

Appendix A Token Level Reward Model Score Visualization
附录 AToken 级奖励模型得分可视化

Figure A2 and A2 show the token-level reward model scores across responses. The values are normalized to [0, 1]. Cooler colors indicate higher reward scores, while warmer colors denote lower scores. For correct responses, the overall REWARDS are high, especially at the end, although there are a few lower sections in the middle. For incorrect responses, the distribution of rewards is reversed, and the closer to the end the lower the rewards. This indicates that not all tokens contribute to the response equally and it is important to assign token-level credits to the sequences.
图 A2 和 A2 显示了响应的标记级奖励模型得分。数值已归一化到[0, 1]。较冷的颜色表示较高的奖励得分,而较暖的颜色表示较低的得分。对于正确响应,总体奖励较高,尤其是在结尾处,尽管中间有一些得分较低的段落。对于错误响应,奖励的分布相反,且越接近结尾奖励越低。这表明并非所有标记对响应的贡献是相同的,因此为序列分配标记级信用很重要。

Refer to caption
Figure A1: Token-level reward model score visualization for a correct response.
图 A1:正确响应的词级奖励模型得分可视化。
Refer to caption
Figure A2: Token-level reward model score visualization for an incorrect response.
图 A2:错误响应的词级奖励模型得分可视化

Appendix B Prompt  附录 BPrompt

Figure A3 is the system prompt of the verifier model, which is used during RL training to provide the binary outcome reward for a response. Figure A4 is the system prompt we use for fine-tuning and RL training as well as the evaluation.
图 A3 是验证器模型的系统提示,在 RL 训练期间用于为响应提供二进制结果奖励。图 A4 是我们用于微调和 RL 训练以及评估的系统提示。

Figure A3: Prompts for the model-based generative verifier.
图 A3:基于模型的生成验证器的提示
Figure A4: System prompts for long CoT reasoning.
图 A4:系统提示进行长 CoT 推理。