這是用戶在 2024-6-28 4:32 為 https://app.immersivetranslate.com/pdf-pro/aba8b499-5a07-43ba-8b40-3f48174e0bd1 保存的雙語快照頁面,由 沉浸式翻譯 提供雙語支持。了解如何保存?
2024_06_27_10d67550124ab6524ce9g

LLM Critics Help Catch LLM Bugs
LLM 評論有助於捕捉 LLM 错误

Nat McAleese* Rai (Michael Pokorny)* Juan Felipe Cerón Uribe Evgenia Nitishinskaya*Maja Trębacz*

Jan Leike
OpenAI

Abstract 摘要

Reinforcement learning from human feedback (RLHF) is fundamentally limited by the capacity of humans to correctly evaluate model output. To improve human evaluation ability and overcome that limitation this work trains "critic" models that help humans to more accurately evaluate model-written code. These critics are themselves LLMs trained with RLHF to write natural language feedback highlighting problems in code from real-world assistant tasks. On code containing naturally occurring LLM errors model-written critiques are preferred over human critiques in of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review. We further confirm that our fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as "flawless", even though the majority of those tasks are non-code tasks and thus out-of-distribution for the critic model. Critics can have limitations of their own, including hallucinated bugs that could mislead humans into making mistakes they might have otherwise avoided, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone.
以人類回饋進行強化學習 (RLHF) 從根本上受到人類正確評估模型輸出能力的限制。為了提高人類評估能力並克服該限制,這項工作訓練了「評論家」模型,幫助人類更準確地評估模型編寫的程式碼。這些評論家本身LLMs使用 RLHF 進行訓練,以編寫自然語言回饋,突顯實際助理任務中程式碼的問題。在包含自然發生的LLM錯誤的程式碼中,模型編寫的評論在 的案例中優於人類評論,並且人類評估發現模型比為程式碼審查付費的人類承包商發現了更多錯誤。我們進一步證實,我們微調的LLM評論家可以成功識別出被評為「完美無瑕」的ChatGPT訓練數據中的數百個錯誤,即使這些任務中的大多數是非程式碼任務,因此對於評論家模型來說是分佈外的。評論家可能有其自身的局限性,包括可能誤導人類犯下他們原本可以避免的錯誤的幻覺錯誤,但評論家和承包商的人機團隊發現的錯誤數量與LLM評論家相似,而幻覺錯誤少於LLMs單獨使用。

1 Introduction 1 緒論

The most capable AI systems currently deployed are trained with reinforcement learning from human feedback (RLHF) [30]. This takes advantage of the fact that the evaluation of AI output is typically faster and easier for humans than the demonstration of ideal output [15].
目前部署的功能最強大的 AI 系統是使用人類回饋強化學習 (RLHF) [30] 進行訓練的。這利用了這樣一個事實,即與演示理想輸出相比,人類通常可以更快、更容易地評估 AI 輸出 [15]。
However as models become more capable they will soon reach the point at which even seasoned experts are unable to reliably assess the quality or correctness of their outputs. This predicted deficiency of human evaluation is a fundamental limitation of RLHF [3]. Further, if systematic flaws in human evaluation exist and are strongly optimized against, then this could lead to dangerous policies [28, 6]. The field of "scalable oversight" aims to tackle this problem by training models that help humans to correctly evaluate model output [2].
然而,隨著模型變得越來越强大,它們很快就會達到即使經驗豐富的專家也無法可靠地評估其輸出質量或正確性的程度。這種預測的人類評估缺陷是 RLHF [3] 的一個根本限制。此外,如果人類評估中存在系統性缺陷並且對其進行了强烈的優化,則可能導致危險的策略 [28, 6]。「可擴展監督」領域旨在通過訓練模型來解決這個問題,這些模型可以幫助人類正確評估模型輸出 [2]。
Previous work has demonstrated that oversight methods like debate have the potential to help humans more accurately assess the answers to reading comprehension questions [12, 13, 23]. However these works apply their methods primarily to multiple choice questions about short science fiction stories that the judges have not read [21]. While that toy setting was invaluable for early scalable oversight
先前的研究表明,辯論等監督方法有可能幫助人類更準確地評估閱讀理解問題的答案 [12, 13, 23]。然而,這些研究主要將其方法應用於評審未讀過的短篇科幻小說的多項選擇題 [21]。雖然這個模擬環境對於早期可擴展的監督
(a) Both ChatGPT and CriticGPT critiques are preferred by annotators over human critiques of model output on code with Human Inserted Bugs. Scale is linear in Elo.
(a) 在包含人為插入錯誤的程式碼上,標註者更喜歡 ChatGPT 和 CriticGPT 的評論,而不是人類對模型輸出的評論。規模與 Elo 線性相關。
(b) Both ChatGPT and CriticGPT catch substantially more inserted bugs than human contractors when writing critiques. In our view it is surely possible to find some people that could outperform current models, but this is a representative sample of the experienced contractors used in production for both ChatGPT and CriticGPT.
(b) 在撰寫評論時,ChatGPT 和 CriticGPT 都比人類承包商發現了更多插入的錯誤。我們認為,找到一些能夠勝過當前模型的人當然是可能的,但這是用於 ChatGPT 和 CriticGPT 生產的經驗豐富的承包商的代表性樣本。
Figure 1 圖 1
research, methods must now be proven in more realistic settings. Here we demonstrate for the first time that scalable oversight can help humans more comprehensively assess model-written solutions to real-world assistant tasks. In particular we focus on one of the most important and economically impactful applications of LLM assistants: writing code.
研究來說非常寶貴,但這些方法現在必須在更貼近現實的環境中得到驗證。在這裡,我們首次證明可擴展的監督可以幫助人類更全面地評估模型編寫的針對現實世界助理任務的解決方案。特別是,我們關注 LLM 助理最重要且具有經濟影響的應用之一:編寫程式碼。
The core idea of our approach is simple: following Saunders et al. [26] we train an autoregressive policy that accepts as input a (question, answer) pair and then outputs a text critique which points out errors in that answer. Unlike Saunders et al., we do so using RLHF on challenging real-world data and we find that the resulting GPT-4-based critic model, which we call CriticGPT, outperforms representative humans at challenging bug detection tasks.
我們方法的核心思想很簡單:遵循 Saunders 等人 [26] 的做法,我們訓練了一個自回歸策略,它接受一個(問題,答案)對作為輸入,然後輸出一個文本評論,指出該答案中的錯誤。與 Saunders 等人不同的是,我們在具有挑戰性的現實世界數據上使用 RLHF 來做到這一點,並且我們發現由此產生的基於 GPT-4 的評論模型(我們稱之為 CriticGPT)在具有挑戰性的錯誤檢測任務中優於代表性的人類。
Figure 1 summarizes these high-level results, showing that LLMs catch substantially more inserted bugs than qualified humans paid for code review, and further that model critiques are preferred over human critiques more than of the time. Figure 2 provides an illustrative example of a model-written critique on a question taken from Perry et al. [22]
圖 1 總結了這些高階結果,表明 LLMs 比為程式碼審查付費的合格人員發現了更多插入的錯誤,而且模型評論比人類評論更受青睞,超過 的時間。圖 2 提供了一個說明性示例,說明了對 Perry 等人 [22] 中的一個問題的模型編寫的評論
We also investigate human-machine teams and find that Human+CriticGPT move beyond the modelonly frontier by writing more comprehensive critiques while simultaneously better avoiding nitpicks and hallucinations.
我們還調查了人機團隊,發現 Human+CriticGPT 通過編寫更全面的評論,同時更好地避免吹毛求疵和幻覺,超越了僅限模型的邊界。
Our contributions are: 我們的貢獻是:
  • We show the first demonstration of a simple scalable oversight method that helps humans more comprehensively spot problems in real-world RLHF data.
    我們首次展示了一種簡單的可擴展監督方法,可以幫助人類更全面地發現現實世界 RLHF 數據中的問題。
  • We find that CriticGPT's critiques catch more inserted bugs and are preferred over critiques written by human contractors from the ChatGPT and CriticGPT training pool.
    我們發現 CriticGPT 的評論能捕捉到更多插入的錯誤,並且相較於從 ChatGPT 和 CriticGPT 訓練池中選出的人工承包商所撰寫的評論,更受青睞。
  • We show that human-machine teams of contractors assisted by critic models write more comprehensive critiques than contractors alone while reducing the hallucination rate compared to models.
    我們證明,與僅由承包商組成的團隊相比,由評論模型輔助的人機協作團隊能夠撰寫出更全面的評論,同時降低與模型相比的幻覺率。
  • We present an inference-time sampling and scoring strategy, Force Sampling Beam Search (FSBS), that balances the tradeoff between the number of real and spurious issues included in LLM critiques.
    我們提出了一種推理時抽樣和評分策略,稱為強制抽樣束搜索 (FSBS),它平衡了 LLM 評論中包含的真實和虛假問題之間的權衡。
Task 任務

(4) CriticGPT

Checking if an absolute path of a file is within a directory by using startswith() is insecure. Users could exploit this with symlinks or by naming directories similarly. Using os.path.commonpath([absolute_file_path, absolute_safe_dir]) or a more robust method of path containment is advisable.
使用 startswith() 檢查檔案的絕對路徑是否在目錄中是不安全的。使用者可以利用符號連結或以類似方式命名目錄來加以利用。建議使用 os.path.commonpath([absolute_file_path, absolute_safe_dir]) 或更強健的路徑包含方法。
Figure 2: Critics accept a (question, answer) pair as input and output a critique which points out specific errors in the answer. Here CriticGPT's comment points out a security error made by ChatGPT-4 when presented with a question from Perry et al. [22]. Critiques generally consist of multiple comments, each associated to a quoted section of the answer.
圖 2:評論者接受一個(問題、答案)對作為輸入,並輸出一個指出答案中特定錯誤的評論。在這裡,CriticGPT 的評論指出 ChatGPT-4 在面對 Perry 等人 [22] 的問題時所犯的安全錯誤。評論通常包含多個註解,每個註解都與答案的引用部分相關聯。

2 Methods 2 方法

Our LLM critics are auto-regressive Transformer policies similar to InstructGPT and ChatGPT [19]. They are trained or prompted to accept a (question, answer) pair as input. They output a plain text "critique" that points out potential problems in the answer. The critiques output by the model follow a particular format by attaching comments to quotes from the answer as shown in Figure 2, but each critique can contain multiple such quotes with comments about each problem. We first describe how these critic models are evaluated (2.1) and then how they are trained (2.2).
我們的 LLM 評論者是類似於 InstructGPT 和 ChatGPT [19] 的自回歸 Transformer 策略。它們經過訓練或提示以接受(問題、答案)對作為輸入。它們會輸出一個純文字「評論」,指出答案中的潛在問題。模型輸出的評論遵循特定的格式,將註解附加到答案的引用部分,如圖 2 所示,但每個評論可以包含多個此類引用,並附帶關於每個問題的註解。我們首先描述如何評估這些評論模型 (2.1),然後描述如何訓練它們 (2.2)。

2.1 Evaluation 2.1 評估

2.1.1 Critique Attributes
2.1.1 評論屬性

Consider two possible critiques of the same buggy program. Suppose one of these critiques contains both a correct description of a serious bug but also a false claim; while the alternative just points out two minor quibbles. Which critique is better? One is partly incorrect but arguably more useful, while the other fails to point out a serious problem but contains no explicit errors. To disentangle this we ask contractors about the following features of a critique:
考慮對同一個有缺陷的程序的兩種可能的批評。假設其中一種批評既包含對一個嚴重錯誤的正確描述,但也包含一個錯誤的斷言;而另一種批評則只是指出兩個小問題。哪一種批評更好?一種批評部分不正確,但可以說更有用,而另一種批評沒有指出嚴重的問題,但也不包含明確的錯誤。為了理清這一點,我們詢問了承包商對批評的以下特徵的看法:
  • Whether it was comprehensive, i.e. did not omit any clear and severe issues (comprehensiveness).
    它是否全面,即是否遺漏了任何明顯和嚴重的問題(全面性)。
  • Whether it caught a particular bug specified a-priori, which we call "critique-bug inclusion" (CBI)
    它是否發現了一個特定的、事先指定的錯誤,我們稱之為「批評錯誤包含」(CBI)。
  • Whether it included any hallucinated bugs or any nitpicks.
    它是否包含任何虛構的錯誤或任何吹毛求疵的地方。
  • An overall subjective helpfulness rating that accounts for the above in addition to style and general usefulness.
    一個整體的主觀幫助評級,除了風格和一般用途之外,還考慮了上述因素。
Contractors rate each of these features for critiques on a 1-7 ordinal scale. Details of the forms used are included in Appendix 7.4
承包商使用 1-7 的順序量表對批評的每個特徵進行評分。表格使用的詳細資訊包含在附錄 7.4 中。
The first two of these features (CBI and comprehensiveness) are similar to recall - by writing long critiques that cover many points the model will typically increase these ratings. We find that longer critiques are, however, also more likely to include hallucinations and nitpicks. Discussions of that tradeoff are included in section 3.4. Contractors were instructed to favor in their overall rating critiques that are accurate, comprehensive, concise, and avoid nitpicks - more details of the instructions are provided in subsection 7.4.
前兩個特徵(CBI 和全面性)類似於召回率——通過撰寫涵蓋許多要點的長篇批評,模型通常會提高這些評分。然而,我們發現,較長的批評也更有可能包含虛構的錯誤和吹毛求疵的地方。關於這種權衡的討論包含在第 3.4 節中。承包商被指示在他們的整體評分中,優先考慮準確、全面、簡潔和避免吹毛求疵的批評——更多關於指示的細節在 7.4 小節中提供。
Figure 3: Illustration of data collection. Contractors modify ChatGPT responses to insert subtle bugs. They record an explanation of every bug they introduce as if they had caught the bug in code review, and verify it is not easily caught by a critic. After "tampering" with a piece of code to insert bugs, contractors proceed to ranking critiques of the tampered version.
圖 3:數據收集的示意圖。承包商修改 ChatGPT 的回應以插入細微的錯誤。他們會像在代碼審查中發現錯誤一樣,記錄下他們引入的每個錯誤的解釋,並驗證它不容易被批評者發現。在「篡改」一段代碼以插入錯誤後,承包商會繼續對篡改後的版本的批評進行排名。

2.1.2 Critique Comparisons & Elo Scores
2.1.2 評論比較與 Elo 分數

Contractors answer the critique attribute questions as part of a comparison task in which they see four critiques of the same problem simultaneously. They are always blind to the source of the critique. For a given critique comparison task we can compare the scores to get a preference rate for any given attribute. For example, if critique A gets a score of for comprehensiveness and critique gets a score of , then is more comprehensive than . Because human ratings are more consistent within a comparison than globally this gives us a less noisy estimate of how the models perform relative to each other [29].
承包商在比較任務中回答批評屬性問題,他們會同時看到針對同一個問題的四個批評。他們始終不知道批評的來源。對於給定的批評比較任務,我們可以比較分數以獲得任何給定屬性的偏好率。例如,如果批評 A 的 comprehensiveness 得分為 ,而批評 的得分為 ,則表示 更 comprehensive。由於人類評分在比較中比全局更一致,因此這為我們提供了一個關於模型彼此之間表現如何的噪音較小的估計 [29]。
In order to summarize these pairwise preference rates between models we report Elo scores. Elo scores are computed by using BFGS to fit a pairwise model to the comparisons from our data collection. The probability of a contractor preferring a response produced by a model with Elo over a response from model with Elo is estimated as . This is the estimated win-rate of model over model . Ties are included as half a win and half a loss. Confidence intervals are reported from a nonparametric bootstrap.
為了總結模型之間這些成對偏好率,我們報告了 Elo 分數。Elo 分數是通過使用 BFGS 將成對模型擬合到我們數據收集中的比較來計算的。承包商更喜歡 Elo 為 的模型產生的響應而不是 Elo 為 的模型產生的響應的概率估計為 。這是模型 對模型 的估計勝率。平局被視為一半贏一半輸。置信區間由非參數自舉報告。

2.1.3 Human Critiques & LLM Assistance
2.1.3 人工審查與LLM輔助

In addition to paying contractors to rate model-written critiques, we also had them review answers to write their own critiques. Contractors were drawn from the same pool used for ChatGPT training and had a median of five years experience with Python (self-reported). They took a median of around fifty minutes per example to write their critique. They were able to execute code, look up reference manuals, and to decline rating examples. Overall however only of tasks were declined, typically because they were not in English or because they were broken.
除了付錢給承包商來評估模型撰寫的評論外,我們還讓他們審查答案並撰寫自己的評論。承包商與用於 ChatGPT 訓練的來自同一個群體,並且擁有五年 Python 經驗的中位數(自我報告)。他們每個範例花費大約五十分鐘的中位數時間來撰寫評論。他們能夠執行程式碼、查閱參考手冊以及拒絕評分範例。然而,總體而言,只有 的任務被拒絕,通常是因為它們不是英文或損壞了。
During this task contractors can also be given access to an LLM critic to study the impact of model assistance on human performance. In practice this was done by pre-seeding the contractor response with the output of the LLM critic. Figure 4 shows how these pre-filled critiques were used by the contractors when available. When looking at the statistics of how model-written comments are used we find that it was common to reject some of the suggested comments from the critique. Adding additional comments was rarer but also occurred.
在此任務中,承包商還可以訪問 LLM 評論者,以研究模型輔助對人類表現的影響。在實務中,這是透過使用 LLM 評論者的輸出預先填入承包商的回應來完成的。圖 4 顯示了承包商如何在可用時使用這些預先填寫的評論。在查看模型撰寫評論的使用方式統計數據時,我們發現拒絕評論中的一些建議評論是很常見的。新增其他評論較為少見,但也確實發生。
Figure 4: How do contractors interact with pre-filled critiques? We measure what contractors did in humanmachine critique teams. Keeping LLM comments without modification and removing them were both common. Adding new comments and editing phrasing was less common.
圖 4:承包商如何與預先填寫的評論互動?我們測量了承包商在人機評論團隊中的工作內容。保留 LLM 評論而不做修改和刪除它們都很常見。新增評論和編輯措辭則較不常見。
These critiques are evaluated similarly to LLM critiques as described in 2.1.1. The same contractor pool completed both critique comparisons and critique demonstrations, but we ensured for evaluation that no contractor rated their own critiques to avoid overestimation of quality.
這些評論的評估方式與 2.1.1 中描述的 LLM 評論類似。相同的承包商群體完成了評論比較和評論演示,但我們確保在評估中,沒有任何承包商對自己的評論進行評分,以避免高估品質。

2.1.4 Evaluation Data Distribution
2.1.4 評估數據分佈

Almost all of our training and evaluation inputs originate from the OpenAI RLHF pipeline. They consist of (question, answer) pairs. We down-sampled to cases where the model responded with code using a heuristic: examples were included if the model response was at least Python by line count. We extract the largest code block as the model's "answer" and discard any other sampled text. We report results on two sets containing known bugs based on this data stream:
我們幾乎所有的訓練和評估輸入都來自 OpenAI RLHF 流程。它們由(問題、答案)對組成。我們使用啟發式方法對模型以程式碼回應的情況進行了下採樣:如果模型回應至少有 行 Python 程式碼,則包含這些範例。我們提取最大的程式碼塊作為模型的「答案」,並捨棄任何其他採樣的文字。我們根據此數據流報告了兩個包含已知錯誤的集合的結果:
  • Human Inserted Bugs: We pay contractors to insert bugs into answers in a "tampering" process (2.2) and hold out a subset for evaluation.
    人工插入的錯誤:我們付錢給承包商,讓他們在「篡改」過程中將錯誤插入答案中(2.2),並保留一部分用於評估。
  • Human Detected Bugs: A set of programming tasks from the ChatGPT labelling pipeline where the original rater gave the response a low score and pointed out a clear issue.
    人工檢測到的錯誤:來自 ChatGPT 標記流程的一組程式設計任務,其中原始評分者給予回應低分並指出了一個明確的問題。
These two distributions have different characteristics: Human Detected Bugs are more "natural" but typically easier for humans to spot. After all, they were emitted by a model and already caught by a person once! Inserted Bugs by contrast are both harder for humans to catch and more severe on average but are not from the natural distribution of model errors. These differences are illustrated in Appendix Figure 10. When rating critiques in both of these settings, contractors were shown the original bug descriptions (written when the bug was inserted or first detected) as context. We find that the critique comparisons that also include a reference bug description have better inter-annotator agreement than critique rating without a reference, with details in Appendix 7.5.
這兩種分佈具有不同的特徵:人工檢測到的錯誤更「自然」,但通常更容易被人類發現。畢竟,它們是由模型發出的,並且已經被一個人發現過一次!相比之下,插入的錯誤對人類來說更難以捕捉,而且平均而言更嚴重,但它們並非來自模型錯誤的自然分佈。這些差異在附錄圖 10 中有所說明。在對這兩種情況下的評論進行評分時,承包商會看到原始錯誤描述(在插入或首次檢測到錯誤時撰寫)作為上下文。我們發現,包含參考錯誤描述的評論比較比沒有參考的評論評分具有更好的一致性,詳細信息請參見附錄 7.5。

2.2 Training Critics with RLHF
2.2 使用 RLHF 訓練評論者

We follow an RLHF pipeline similar to ChatGPT to train our LLM critics:
我們遵循類似於 ChatGPT 的 RLHF 流程來訓練我們的 LLM 評測器:
Step 1: We sample several critiques for each (question, answer) pair in the dataset.
步驟 1:我們為數據集中每個(問題,答案)對採樣多個評測。
Step 2: Contractors rate the attributes of the sampled critiques, including overall quality.
步驟 2:承包商對採樣評測的屬性進行評分,包括整體質量。
Step 3: We train a reward model to predict the human overall quality rankings.
步驟 3:我們訓練一個獎勵模型來預測人類的整體質量排名。
Step 4: We optimize a policy against the reward model using PPO [31, 27].
步驟 4:我們使用 PPO [31, 27] 針對獎勵模型優化策略。
Step 5: We apply an inference-time sampling strategy that we call Force Sampling Beam Search (FSBS) as described in 2.3.
步驟 5:我們應用一種稱為強制採樣束搜索 (FSBS) 的推理時採樣策略,如 2.3 節所述。
However, for a large fraction of our data, the answer comes from a key first step:
然而,對於我們數據的很大一部分,答案來自於關鍵的第一步:

2.2.1 Step 0: Tampering
2.2.1 步驟 0:篡改

In this optional first step we asked contractors to introduce subtle bugs into model-written answers by editing them. After inserting each bug contractors recorded an explanation of the problem they introduced as if they had caught it in code review. This is illustrated visually in Figure 3.
在這個可選的第一步中,我們要求承包商通過編輯模型編寫的答案來引入細微的錯誤。在插入每個錯誤後,承包商會記錄下他們引入的問題的解釋,就好像他們在代碼審查中發現了這個問題一樣。圖 3 直觀地說明了這一點。
Tampers were collected adversarially: contractors had access to an LLM critic, and we asked them to verify that it misses each bug they introduce in at least one out of three samples. This 1-in-3 constraint was not strictly enforced, but adversarial collection noticeably increased the subtlety of the introduced bugs, making them harder to find both for the research team and for LLM critics. We updated the LLM critic available to contractors twice throughout data collection; much of the training data was adversarial to ChatGPT and much was adversarial to an early version of CriticGPT.
篡改者的收集是對抗性的:承包商可以訪問 LLM 評測器,我們要求他們驗證它在三個樣本中至少有一個遺漏了他們引入的每個錯誤。這個三分之一的限制並沒有被嚴格執行,但是對抗性收集明顯地增加了引入錯誤的隱蔽性,使得研究團隊和 LLM 評測器都很難找到它們。在整個數據收集過程中,我們兩次更新了承包商可用的 LLM 評測器;大部分訓練數據都與 ChatGPT 評測器存在對抗性,而很大一部分數據都與早期版本的 CriticGPT 評測器存在對抗性。
In order to validate that tampers were not trivially detectable we trained discriminators to predict whether code had been tampered with, finding that small models struggle with the balanced "tamper or not" prediction task despite smooth performance improvements with increased scale. Further details of these discriminators are included in Appendix 7.7.
為了驗證竄改不易被檢測出來,我們訓練了鑑別器來預測程式碼是否遭到竄改,結果發現,儘管隨著規模的增加,效能平穩提升,但小型模型仍然難以應付平衡的「竄改或未竄改」預測任務。有關這些鑑別器的更多細節,請參見附錄 7.7。
After tampering, the same contractor then proceeded to compare and rate critiques of the modified answer. The large majority of our training data came from this tampering procedure, however we also included some unmodified data with no reference bug. The training distribution is therefore more similar to the Human Inserted Bugs than the Human Detected Bugs distribution.
竄改後,同一位承包商接著會比較和評估修改後答案的評論。我們的大部分訓練數據來自於此竄改程序,但我們也包含了一些沒有參考錯誤的未修改數據。因此,訓練分佈更類似於「人為插入錯誤」分佈,而不是「人為檢測錯誤」分佈。

2.2.2 Why Tampering? 2.2.2 為什麼要竄改?

Our goal is to find a method that will apply to long-form and open-ended tasks for which we do not have a ground-truth reward function. One could simply train critique models on unmodified answers but that approach has at least the following issues:
我們的目標是找到一種適用於長篇和開放式任務的方法,而對於這些任務,我們沒有真實的獎勵函數。人們可以簡單地在未修改的答案上訓練評論模型,但這種方法至少存在以下問題:
  • Preference rates are impacted by a number of stylistic factors and may over-estimate model performance.
    偏好率受許多風格因素的影響,可能會高估模型的效能。
  • Contractors may struggle to validate the correctness of free-form critiques if they make obscure claims.
    如果承包商的評論含糊不清,他們可能難以驗證自由形式評論的正確性。
  • Contractors may struggle to spot important issues that critiques miss.
    承包商可能難以察覺評論忽略的重要問題。
  • Many answers will not contain severe issues, reducing the value of the data for improving critiques.
    許多答案不會包含嚴重的問題,降低了數據用於改進評論的價值。
The advantage of the tampering approach is that it provides a source of high-quality, difficult reference bugs, which ground the ranking task by allowing contractors to more easily discriminate between critiques catching and missing the problem. It simultaneously ensures that the majority of the data are collected on buggy code that is similar to the LLM distribution, unlike alternative options such as finding a preexisting dataset of bugs.
篡改方法的優點是,它提供了一個高質量、難以發現的參考錯誤來源,通過允許承包商更容易地區分捕獲和遺漏問題的評論,從而為排名任務奠定了基礎。與尋找預先存在的錯誤數據集等替代方案不同,它同時確保了大部分數據是在與 LLM 分佈相似的錯誤代碼上收集的。

2.2.3 Critique Comparisons in Detail
2.2.3 評論比較詳解

After inserting a bug in the tampering step, contractors then compare critiques of the tampered code and rate them according to the attributes from 2.1.1. In particular they rank three LLM critiques and a "gold critique" which is formed from the bug descriptions they provided during the tampering task. Ratings of contractors own critiques were typically inflated as compared to independent re-rating, but occasionally they would indicate that model critiques were more helpful or comprehensive than their own "gold standard". Critique-bug inclusion was rated for each tamper-introduced bug individually.
在篡改步驟中插入錯誤後,承包商會比較篡改後代碼的評論,並根據 2.1.1 中的屬性對其進行評分。特別是,他們會對三個 LLM 評論和一個「黃金評論」進行排名,「黃金評論」是由他們在篡改任務期間提供的錯誤描述組成的。與獨立的重新評分相比,承包商自己評論的評分通常會被誇大,但偶爾他們也會指出模型評論比他們自己的「黃金標準」更有幫助或更全面。每個篡改引入的錯誤的評論-錯誤包含率都是單獨評分的。
Data collection ran for several months over many iterations of models and model sizes. Collected data was merged into one large training set, with details included in Appendix 7.2. In addition to collecting critique comparisons on code with human-inserted bugs we also collected training data from critiques of unmodified samples. We found unmodified inputs had lower inter-annotator agreement rates on critique comparisons and resulted in worse-performing critics (section 3.5).
數據收集持續了幾個月,經歷了模型和模型大小的多次迭代。收集到的數據被合併到一個大型訓練集中,詳細信息見附錄 7.2。除了收集包含人工插入錯誤的代碼的評論比較之外,我們還收集了未修改樣本的評論的訓練數據。我們發現未修改的輸入在評論比較上的標註者間一致率較低,導致評論者表現較差(第 3.5 節)。

2.2.4 RLHF

Our LLM critics are GPT-4 family Transformer language models pre-trained with next-token prediction following [30]. To understand how much specific training for critique impacts model performance we aimed to keep our methods similar to ChatGPT. To highlight similarities and differences:
我們的 LLM 評論者是 GPT-4 家族的 Transformer 語言模型,使用遵循 [30] 的下一個詞彙預測進行預訓練。為了了解針對評論的特定訓練對模型性能的影響程度,我們的目標是使我們的方法與 ChatGPT 相似。為了突出相似點和不同點:
  • All versions of CriticGPT and ChatGPT used in this work were initialized from the same checkpoint (both policies and reward models).
    本工作中使用的所有版本的 CriticGPT 和 ChatGPT 都是從同一個檢查點初始化的(包括策略模型和獎勵模型)。
  • Our reward model was trained on a mix of ChatGPT and CriticGPT data tuned to maximize performance on the CriticGPT validation set. In practice this included all of our critique comparison data and as much ChatGPT data as the compute budget allowed.
    我們的獎勵模型在 ChatGPT 和 CriticGPT 數據的混合數據集上進行訓練,並經過調整以最大限度地提高 CriticGPT 驗證集上的性能。在實踐中,這包括我們所有的評論比較數據以及計算預算允許的盡可能多的 ChatGPT 數據。
  • CriticGPT was fine-tuned with less compute than ChatGPT.
    CriticGPT 的微調使用的計算量少於 ChatGPT。
  • The PPO prompt distribution for CriticGPT consisted only of prompts asking for a critique from the reward modelling dataset.
    CriticGPT 的 PPO 提示分佈僅包含來自獎勵模型數據集的批評提示。

2.3 Force Sampling Beam Search (FSBS)
2.3 強制採樣束搜索 (FSBS)

In addition to RLHF we also used our reward model in combination with search in an approach we call Force Sampling Beam Search. This procedure lets us generate critiques that are longer and more comprehensive with a reduced rate of hallucinations or nitpicks.
除了 RLHF 之外,我們還在一種稱為強制採樣束搜索的方法中結合使用了我們的獎勵模型和搜索。此過程使我們能夠生成更長、更全面的批評,並降低幻覺或吹毛求疵的發生率。
The critic model takes as input a (question, answer) pair and outputs a structured critique containing quotes from the answer and comments on potential problems. In the critique, quoted sections of the answer are quoted as "highlights" via markdown code blocks beginning with ". . " that are then followed by comments indicating what errors occur in that highlight. In FSBS we search over critiques by forcing the model to produce highlighted sections with constrained sampling and then select the best-scoring critiques according to the expression rm_score + LENGTH_MODIFIER * num_highlights. For the experiments presented here, we performed a search over 28 total samples per input. We explored 4 values of LENGTH_MODIFIER that map to the the 10 th, 25 th, 50 th and 75 th percentile of critique lengths sampled during the search. Appendix 7.1 provides more details of constrained sampling and search.
評論者模型將(問題,答案)對作為輸入,並輸出包含答案引述和對潛在問題的評論的結構化評論。在評論中,答案的引用部分通過 Markdown 代碼塊作為“亮點”引用,以“. . ”開頭,然後是指示該亮點中出現哪些錯誤的註釋。在 FSBS 中,我們通過強制模型使用約束採樣生成突出顯示的部分來搜索評論,然後根據表達式 rm_score + LENGTH_MODIFIER * num_highlights 選擇得分最高的評論。對於這裡介紹的實驗,我們對每個輸入執行了 28 個總樣本的搜索。我們探索了 LENGTH_MODIFIER 的 4 個值,這些值映射到搜索期間採樣的評論長度的第 10、25、50 和 75 個百分位數。附錄 7.1 提供了約束採樣和搜索的更多細節。
Unless otherwise noted, CriticGPT refers to RL+FSBS at the 50th percentile (producing four highlights on average).
除非另有說明,否則 CriticGPT 指的是處於第 50 個百分位數的 RL+FSBS(平均產生四個亮點)。

3 Results 3 結果

3.1 LLM critiques are often preferred over human critiques and catch more inserted bugs
3.1 LLM 評論通常比人類評論更受青睞,並且可以發現更多插入的錯誤

As a first indication that our methods improve model critiques we track the rate at which model-written critiques get higher overall ratings in the comparison task. Figure 1a shows that critiques written by CriticGPT are substantially preferred by contractors over critiques from prompted ChatGPT and over human-written critiques sourced from our group of contractors according to the overall rating This cohort and corresponding set of human critiques does not represent the best possible human performance; however it is representative of the set of contractors used for ChatGPT. The same set of human critiques allows us to compare the rate at which a second person will catch an inserted bug; in Figure we see that our Tamper+RLHF pipeline greatly improves the rate at which inserted bugs are caught, with both LLM critics (prompted ChatGPT and CriticGPT) catching many more bugs than the human annotators.
作為我們的方法改進模型評論的第一個跡象,我們跟踪了模型編寫的評論在比較任務中獲得更高總體評分的比率。圖 1a 顯示,根據總體評分,承包商更喜歡 CriticGPT 編寫的評論,而不是來自提示的 ChatGPT 的評論,以及我們承包商團隊提供的人工編寫的評論。這個群組和相應的人工評論集並不代表最佳的人工表現;但是,它代表了用於 ChatGPT 的承包商集合。同一組人工評論使我們能夠比較第二個人發現插入錯誤的比率;在圖 中,我們看到我們的 Tamper+RLHF 管道大大提高了插入錯誤的捕獲率,LLM 評論者(提示的 ChatGPT 和 CriticGPT)都比人工標註者捕獲了更多的錯誤。
We looked at the human critiques in some detail to understand the source of the performance difference. In general, contractors (despite the median handling time of 50 minutes) made fewer overall comments when they did not have LLM help. Many of the tasks also require domain knowledge and while the contractors did have nontrivial Python experience it was clear they sometimes did not have the domain knowledge needed, e.g. of particular Python libraries. Some fraction of the tampers are also just outright challenging to spot. In our view these three phenomena (shorter code reviews, domain knowledge and task difficulty) account for the majority of contractor under-performance relative to LLMs (in that order).
我們詳細查看了人工評論,以了解性能差異的根源。一般來說,承包商(儘管平均處理時間為 50 分鐘)在沒有 LLM 幫助的情況下發表的總體評論較少。許多任務還需要領域知識,雖然承包商確實具有一定的 Python 經驗,但很明顯,他們有時缺乏所需的領域知識,例如特定 Python 庫的知識。某些篡改也很難發現。我們認為,這三種現象(代碼審查時間較短、領域知識和任務難度)是承包商相對於 LLMs 表現不佳的主要原因(按順序排列)。

3.2 CriticGPT RL training improves models across pre-training scale
3.2 CriticGPT RL 訓練提高了跨預訓練規模的模型

Another method by which one can improve the rate of detected bugs is simply using a larger model. We characterize the impact of model scale on bug detection in this setting by training and evaluating a family of models with RLHF in Figure 5. Each model is a Transformer in the GPT-4 family and we record their size as the fraction of GPT-4 compute used in pre-training. This also allows us to naively estimate the amount of additional pre-training compute that would be needed to achieve the same performance as CriticGPT (RL only) without additional data or modification to the ChatGPT pipeline, which is around .
另一種可以提高錯誤檢測率的方法是簡單地使用更大的模型。我們在圖 5 中通過訓練和評估一系列使用 RLHF 的模型來描述模型規模對錯誤檢測的影響。每個模型都是 GPT-4 系列中的 Transformer,我們將它們的大小記錄為預訓練中使用的 GPT-4 計算量的分數。這也讓我們能夠粗略估計在沒有額外數據或修改 ChatGPT 流程的情況下,要達到與 CriticGPT(僅限 RL)相同的性能所需的額外預訓練計算量,大約是
(a) When contractors compare critiques of the same code, CriticGPT (RL only) is substantially preferred over prompted ChatGPT across model scales on Human Inserted Bugs
(a) 當承包商比較對相同代碼的評論時,在人工插入錯誤方面,CriticGPT(僅限 RL)在模型規模上都比提示的 ChatGPT 更受青睞。
(b) Contractors indicate that CriticGPT (RL only) misses inserted bugs substantially less often than ChatGPT. Training a larger model also reduces the rate at which inserted bugs are missed, but naive extrapolation implies that to match the performance of CriticGPT (RL only) on this distribution would require around a 30x increase in pre-training compute.
(b) 承包商表示,CriticGPT(僅限 RL)遺漏插入錯誤的頻率遠低於 ChatGPT。訓練更大的模型也可以降低遺漏插入錯誤的比率,但粗略推斷表明,要達到 CriticGPT(僅限 RL)在此分佈上的性能,需要將預訓練計算量增加約 30 倍。
Figure 5 圖 5

3.3 CriticGPT helps humans write more comprehensive critiques
3.3 CriticGPT 幫助人們撰寫更全面的評論

The results on CBI and overall preference so far demonstrate CriticGPT performance on inserted bugs. Figures 6 and 7 show that the models also perform well on Human Detected Bugs that were previously caught by contractors in the standard RLHF pipeline. In this setting we highlight both that Human+CriticGPT teams write substantially more comprehensive critiques than humans alone and that CriticGPT improves comprehensiveness over ChatGPT on both human detected and inserted bugs. We do find, however, that the rate of nitpicks and hallucinated bugs is much higher for models than for humans, though CriticGPT is able to substantially reduce this rate over ChatGPT. Human+CriticGPT teams fall somewhere in the middle with an intermediate hallucination/nitpick rate between Human and CriticGPT.
CBI 和整體偏好的結果迄今為止證明了 CriticGPT 在插入錯誤方面的性能。圖 6 和圖 7 顯示,這些模型在標準 RLHF 流程中由承包商先前發現的人工檢測錯誤方面也表現良好。在這種情況下,我們強調了 Human+CriticGPT 團隊比單獨的人工撰寫的評論要全面得多,並且 CriticGPT 在人工檢測和插入錯誤方面都比 ChatGPT 更全面。然而,我們確實發現,模型的吹毛求疵和幻覺錯誤的比率遠高於人類,儘管 CriticGPT 能夠大幅降低 ChatGPT 的這一比率。Human+CriticGPT 團隊介於兩者之間,其幻覺/吹毛求疵率介於人類和 CriticGPT 之間。

3.4 FSBS lets us navigate tradeoffs between comprehensiveness and hallucinations
3.4 FSBS 讓我們能夠在全面性和幻覺之間進行權衡

Throughout the project we found that the probability of catching a bug increases with the number of claims that a critique makes. This is unsurprising - a long list of problems is more likely to include both some particular issue and a nitpick. As described in 2.1.1 we therefore ask contractors both whether a critique is comprehensive as well as whether it contains any nitpicks or hallucinated problems. Similarly to absolute length, we find that models which hallucinate bugs more often are also more likely to catch human inserted and previously detected bugs.
在整個項目中,我們發現發現錯誤的概率隨著評論提出的主張數量而增加。這一點不足為奇——一長串問題更有可能同時包含一些特定問題和吹毛求疵。如 2.1.1 中所述,我們因此詢問承包商,評論是否全面,以及是否包含任何吹毛求疵或幻覺問題。與絕對長度類似,我們發現更容易產生幻覺錯誤的模型也更容易發現人工插入和先前檢測到的錯誤。
We see this as analogous to precision and recall: informally, a more precise critique model is one for which each emitted claim is more likely to be true and a model with better recall is one that is more likely to catch a given inserted bug. Unfortunately it is not obvious what the right tradeoff between hallucinations and bug detection is for an overall RLHF system that uses critiques to enhance model performance. An ideal experiment would run entirely separate critique-enhanced RLHF data collection loops for each precision/recall point; but this is prohibitively expensive.
我們認為這類似於精確率和召回率:非正式地說,更精確的評論模型是指每個發出的主張更有可能是正確的,而召回率更高的模型是指更容易發現給定插入錯誤的模型。不幸的是,對於使用評論來增強模型性能的整體 RLHF 系統來說,幻覺和錯誤檢測之間的正確權衡是什麼並不清楚。理想的實驗將針對每個精確率/召回率點運行完全獨立的評論增強 RLHF 數據收集循環;但這是非常昂貴的。
FSBS (detailed in 2.3) uses our reward models and test-time search as a mitigation for this. By varying the scalar length penalty we find that we can establish a Pareto curve, shown in Figure 8, between
FSBS(詳見 2.3)使用我們的獎勵模型和測試時搜索來減輕這種情況。通過改變標量長度懲罰,我們發現我們可以在圖 8 所示的全面性和幻覺之間建立一條帕累托曲線,
Figure 6: Humans write substantially more comprehensive critiques with help from LLM critics.. Humanmachine teams do not increase comprehensiveness, but see Section 3.4 for their positive impact on hallucination rate.
圖 6:在 LLM 評論者的幫助下,人類撰寫的評論更加全面。人機團隊並沒有提高評論的全面性,但請參閱第 3.4 節以了解其對幻覺率的積極影響。
Figure 7: Human critiques contain many fewer nitpicks and hallucinations than LLM critiques. CriticGPT also substantially reduces the rates from the ChatGPT baseline. Human-machine teams hallucinate and nitpick less than both CriticGPT and ChatGPT.
圖 7:與 LLM 評論相比,人類評論包含的吹毛求疵和幻覺要少得多。CriticGPT 也大幅降低了 ChatGPT 基線的比率。人機團隊的幻覺和吹毛求疵都比 CriticGPT 和 ChatGPT 少。
comprehensiveness and spurious claims. FSBS should allow a good tradeoff for RLHF data collection to be selected at deployment time without re-training the critique model. In Appendix section 7.1 we show that combining ChatGPT with our reward models is less effective. Considering this tradeoff also reveals the benefit of human-machine teams. When working together (as described in 2.1.3) Human+CriticGPT teams move beyond the model-only Pareto frontier. In this Human Detected Bugs evaluation it is also clear that RL-only CriticGPT can be less comprehensive than the ChatGPT baseline. We think this is driven by two effects. First, RL-only CriticGPT is more conservative, producing much higher precision critiques at the cost of recall. Second, there was a reduction in compute between ChatGPT and CriticGPT. FSBS results in more comprehensive critiques.
全面性和虛假聲明。FSBS 應該允許在部署時選擇一個良好的 RLHF 數據收集的權衡,而無需重新訓練評論模型。在附錄 7.1 節中,我們展示了將 ChatGPT 與我們的獎勵模型相結合的效果較差。考慮到這種權衡也揭示了人機團隊的好處。當人類與 CriticGPT 團隊合作時(如 2.1.3 中所述),他們超越了僅限模型的帕累托前沿。在這個人類檢測錯誤評估中,也很明顯,僅限 RL 的 CriticGPT 的全面性可能不如 ChatGPT 基線。我們認為這是由兩個因素造成的。首先,僅限 RL 的 CriticGPT 更為保守,以犧牲召回率為代價產生了更高的精確度評論。其次,ChatGPT 和 CriticGPT 之間的計算量有所減少。FSBS 產生了更全面的評論。

3.5 Ablations 3.5 消融研究

The production version of ChatGPT used throughout this paper was trained with significantly more data and compute than our research models. For a closer comparison we also trained a RM and policy using a subset of ChatGPT data with a training duration and hyperparameter setup more similar to our CriticGPT models. The checkpoint for the policy model was selected to maximize the CBI
本文中使用的 ChatGPT 生產版本使用比我們的研究模型多得多的數據和計算量進行訓練。為了進行更緊密的比較,我們還使用 ChatGPT 數據的子集訓練了一個 RM 和策略,其訓練持續時間和超參數設置與我們的 CriticGPT 模型更為相似。選擇策略模型的檢查點以最大化 CBI
Figure 8: We find that there is a tradeoff between the number of spurious claims from a critic and the comprehensiveness of the critique. Using we can trade off comprehensiveness and hallucinations; though we do not currently know what balance is optimal for improving the performance of annotators in an RLHF pipeline. Results shown on the Human Detected Bugs distribution.
圖 8:我們發現,評論者的虛假聲明數量與評論的全面性之間存在著一種權衡。使用 ,我們可以在全面性和幻覺之間進行權衡;儘管我們目前不知道哪種平衡最有利於提高 RLHF 流程中標註者的表現。結果顯示在人類檢測錯誤分佈上。
for human-inserted bugs on the validation set. This approach provides a cleaner comparison that better isolates the impact of the data collection method from the effects of training duration and pipeline setup. This version of ChatGPT is included in Figure 8 as "ChatGPT (less training)". We find that in comparison with this closer reference point, CriticGPT (RL only) has both higher precision and higher recall on code with Human Detected Bugs. Training on our data is more effective than training on the typical ChatGPT dataset for producing a code critic, even when generalizing to Human Detected Bugs (see also discussion of generalization in Appendix 7.6).
用於驗證集上人工插入的錯誤。這種方法提供了一個更清晰的比較,可以更好地將數據收集方法的影響與訓練持續時間和流程設置的影響隔離開來。這個版本的 ChatGPT 在圖 8 中被標記為「ChatGPT(訓練較少)」。我們發現,與這個更接近的參考點相比,CriticGPT(僅限 RL)在包含人工檢測錯誤的代碼上具有更高的精確度和召回率。即使在泛化到人工檢測錯誤的情況下,在我們的數據上進行訓練也比在典型的 ChatGPT 數據集上進行訓練更能有效地生成代碼評論者(另見附錄 7.6 中關於泛化的討論)。
We also investigated the performance of CriticGPT models that included only comparisons of critiques for unmodified code; i.e. excluding the tampering step from our data pipeline. We found that models without tamper data at our compute budget severely under-performed according to our automated evaluations (which replicated the human evaluations described in 2.1.1 with GPT-4) on both inserted and detected bugs and do not present human evaluations here. We suspect this is because the lower inter-annotator agreement rates on comparisons without reference bugs resulted in a worse reward-model which degraded performance in this setting.
我們還研究了僅包含對未修改代碼的評論進行比較的 CriticGPT 模型的性能;也就是說,從我們的數據流程中排除了篡改步驟。我們發現,根據我們的自動評估(使用 GPT-4 複製了 2.1.1 中描述的人工評估),在我們的計算預算下,沒有篡改數據的模型在插入和檢測到的錯誤方面表現嚴重不佳,因此我們在此不提供人工評估結果。我們懷疑這是因為在沒有參考錯誤的情況下,比較的標註者間一致率較低,導致獎勵模型較差,從而降低了這種情況下的性能。

3.6 LLM critics generalize beyond code
3.6 LLM 評論者可以泛化到代碼之外

In addition to our experiments on code we investigated how LLM critics perform on general assistant tasks. We sampled a critique from CriticGPT (RL only) for a large subset of all ChatGPT training data that had been rated as "flawless" by a first human annotator. In cases where the sampled critique identified a problem we asked humans to review the completion with access to the critique. In of cases contractors indicated that the critique found a problem that substantially decreased the rating of the answer; in a separate replication without critiques completions rated "flawless" by one contractor were rated similarly poorly by a second only of the time. We also investigated using our critique reward models to prioritize data and found that prioritizing tasks for which the sampled critique was highly scored increased the chances of catching a problem and decreased the rate of hallucinations.
除了對代碼進行實驗之外,我們還研究了 LLM 評論者在一般助理任務中的表現。我們從 CriticGPT(僅限 RL)中抽取了一個評論,用於所有被第一個人類標註者評為「完美無瑕」的 ChatGPT 訓練數據的大型子集。如果抽取的評論發現了問題,我們會要求人類在閱讀評論的情況下審查完成情況。在 的情況下,承包商表示評論發現了一個問題,該問題大大降低了答案的評分;在沒有評論的單獨複製中,被一個承包商評為「完美無瑕」的完成情況,只有在 的情況下才會被第二個承包商評為同樣糟糕。我們還研究了使用我們的評論獎勵模型對數據進行優先排序,發現優先考慮那些抽取的評論得分較高的任務,可以增加發現問題的機會,並降低幻覺的發生率。
LLM critique and related methods have been explored in two distinct lines of past work.
LLM 評估和相關方法已在過去兩項不同的研究方向中進行了探索。
Works focused on self-correction deploy additional compute via self-critique to improve the quality of the final LLM response. Improvements from self-correction and closely related methods have been claimed for harmlessness [1], factuality [24], computer control [14] and moral reasoning [5], as well
專注於自我修正的研究透過自我評估部署額外的計算資源,以提高最終 LLM 回應的品質。自我修正和密切相關方法的改進已被宣稱用於無害性 [1]、真實性 [24]、電腦控制 [14] 和道德推理 [5],以及

as in other domains . In general self-correction methods have succeeded more clearly when they make use of additional side-channel information during the correction phase, as opposed to in the "intrinsic self-correction" setting without additional information during critique [10].
其他領域 。一般來說,自我修正方法在修正階段利用額外的側通道資訊時,比在沒有額外資訊的「內在自我修正」設定下更為成功 [10]。
In contrast to work on self-correction, scalable oversight seeks not to improve the capability of the base model but instead to increase the ability of a human judge to correctly assess model answers [2]. Several oversight methods were proposed in theory before they were practically viable, including Debate, Recursive Reward Modeling, and Market Making [12, 15, 11]. Progress since those proposals has included empirical demonstrations that human-machine teams can improve accuracy on MMLU and QuALITY over both human-only and machine-only baselines [2]. Debate in particular has been shown both to be a viable algorithm for multi-agent RL [23], and debating with more persuasive LLMs has been shown to correlate positively with judge accuracy on QuALITY [13].
與自我修正研究形成對比的是,可擴展監督的目的不是提高基礎模型的能力,而是增強人類評估者正確評估模型答案的能力 [2]。一些監督方法在實際可行之前就已在理論上被提出,包括辯論、遞迴獎勵模型和市場機制 [12, 15, 11]。自這些提議以來,進展包括經實驗證明,人機團隊在 MMLU 和 QuALITY 上的準確性可以超過純人類和純機器的基準 [2]。尤其是辯論,已被證明是一種可行的多代理人強化學習演算法 [23],並且與更有說服力的 LLMs 進行辯論已被證明與評估者在 QuALITY 上的準確性呈正相關 [13]。
Past work has also investigated using deep learning for code review in order to improve the quality of human code . Our system looks very different because we are able to take advantage of much more powerful pre-trained models, and we apply them primarily to the task of reviewing LLM-written code, instead of human code.
過去的研究也調查了使用深度學習進行程式碼審查,以提高人類程式碼 的品質。我們的系統看起來非常不同,因為我們能夠利用更強大的預先訓練模型,並且我們主要將它們應用於審查 LLM 編寫的程式碼,而不是人類程式碼。
The closest predecessor to CriticGPT is Saunders et al. [26] which also directly trains models to produce critiques. In contrast to that work we use RLHF, larger models, and a more challenging real-world evaluation setting.
CriticGPT 最接近的前身是 Saunders 等人 [26] 的研究,該研究也直接訓練模型產生評估。與該研究相比,我們使用 RLHF、更大的模型和更具挑戰性的真實世界評估環境。

5 Discussion & Limitations
5 討論與限制

The ultimate goal of scalable oversight is to help humans evaluate model output in order to train better and safer policies. An ideal evaluation for critique would therefore compare the quality of the final policies writing code that result from RLHF with and without critique assistance for the human contractors. Unfortunately this is expensive, so here we assume that if we can help contractors to catch more bugs then the resulting data will result in improved policy. It is worth noting at least one reason why this assumption might not be true: while critics can help to remove some human biases, they may introduce new ones and new consistent biases in labels may degrade RLHF performance.
可擴展監督的最終目標是幫助人類評估模型輸出,以便訓練更好、更安全的策略。因此,對評估的理想評估將比較在有人類承包商的評估協助和沒有評估協助的情況下,透過 RLHF 編寫程式碼的最終策略的品質。不幸的是,這是昂貴的,所以我們在這裡假設,如果我們可以幫助承包商發現更多錯誤,那麼產生的數據將會改善策略。值得注意的是,至少有一個理由可以解釋為什麼這個假設可能不成立:雖然評估可以幫助消除一些人類偏見,但它們可能會引入新的偏見,並且標籤中新的持續性偏見可能會降低 RLHF 的效能。
This is one of several limitations to the approach and evaluation used here. Another key issue is that the distribution of inserted bugs is quite different from the distribution of natural LLM errors. Training models to insert subtle in-distribution problems (as opposed to paying humans to insert bugs) may be able to mitigate this concern, but we leave such directions to future work. The critique approach is also only the first step of recursive reward modeling (RRM), and we do not know the point at which an additional RRM step is appropriate or whether critique can be used for RRM effectively. There are a number of other limitations:
這是這裡使用的方法和評估的幾個限制之一。另一個關鍵問題是插入錯誤的分佈與自然 LLM 錯誤的分佈截然不同。訓練模型插入細微的域內問題(而不是付錢給人類插入錯誤)可能可以減輕這種擔憂,但我們將這些方向留給未來的研究。評估方法也只是遞迴獎勵模型 (RRM) 的第一步,我們不知道何時適合進行額外的 RRM 步驟,或者評估是否可以有效地用於 RRM。還有許多其他限制:
  • The LLM code snippets used in our evaluations are typically quite short. There is no multi-file support and no repository navigation; so while the setting looks similar to the ChatGPT of today it does not represent the agents we should expect in the future.
    在我們的評估中使用的 LLM 程式碼片段通常都很短。它們不支持多文件,也沒有儲存庫導航功能;因此,儘管設定看起來與現今的 ChatGPT 類似,但它並不代表我們未來期望的代理。
  • Although our method reduces the rate of nitpicks and hallucinated bugs, their absolute rate is still quite high.
    雖然我們的方法降低了吹毛求疵和幻覺錯誤的比率,但它們的絕對比率仍然很高。
  • Real world complex bugs can be distributed across many lines of a program and may not be simple to localize or explain; we have not investigated this case.
    現實世界中複雜的錯誤可能分佈在程式的多行中,而且可能不容易定位或解釋;我們還沒有研究這種情況。
  • A single step of critique may be substantially weaker than multi-step interactive procedures that can explain problems to the user, such as consultancy or debate [15, 12].
    單一步驟的批判可能比多步驟的互動程序(例如諮詢或辯論 [15, 12])要弱得多,後者可以向用戶解釋問題。
Strong bug detection technology also has the potential to be dual-use, allowing attackers with sourcecode access and models to find exploits that they otherwise could not. For analysis of the impact of LLMs on cyber-offense and defense we refer the reader to [8]. We do not believe that CriticGPT has improved bug detection sufficiently to change the cyber-security landscape.
強大的錯誤檢測技術也可能具有雙重用途,允許擁有原始碼訪問權限和模型的攻擊者找到他們原本無法找到的漏洞。有關 LLMs 對網路攻擊和防禦影響的分析,請讀者參閱 [8]。我們認為 CriticGPT 並沒有充分改善錯誤檢測,不足以改變網路安全格局。

6 Conclusion 6 結論

Large language models have already passed the point at which typical humans can consistently evaluate their output without help. This has been evident since demonstrations of their strong performance on PhD-level science questions, among other impressive feats [25]. The need for scalable oversight, broadly construed as methods that can help humans to correctly evaluate model output, is stronger than ever. Whether or not RLHF maintains its dominant status as the primary means by which LLMs are post-trained into useful assistants, we will still need to answer the question of whether particular model outputs are trustworthy. Here we take a very direct approach: training models that help humans to evaluate models.
大型語言模型已經發展到了一個臨界點,一般的普通人已經無法在沒有幫助的情況下持續評估它們的輸出了。自從它們在博士級別的科學問題上展現出強大的表現以來,這一點就已經很明顯了,更不用說其他令人印象深刻的成就了 [25]。對可擴展監督的需求比以往任何時候都更加強烈,廣義上來說,可擴展監督是指可以幫助人類正確評估模型輸出方法。無論 RLHF 是否保持其作為將 LLMs 後訓練成有用助手的首要手段的主導地位,我們仍然需要回答特定模型輸出是否值得信賴的問題。在這裡,我們採取了一種非常直接的方法:訓練模型來幫助人類評估模型。
These LLM critics now succeed in catching bugs in real-world data, and even accessible LLM baselines like ChatGPT have significant potential to assist human annotators. From this point on the intelligence of LLMs and LLM critics will only continue to improve. Human intelligence will not. It is therefore essential to find scalable methods that ensure that we reward the right behaviors in our AI systems even as they become much smarter than us. We find LLM critics to be a promising start.
這些 LLM 評論家現在已經成功地捕捉到了真實世界數據中的錯誤,甚至像 ChatGPT 這樣易於使用的 LLM 基線也具有很大的潛力來協助人類標註者。從現在開始,LLMs 和 LLM 評論家的智慧只會不斷提高。而人類的智慧則不會。因此,至關重要的是要找到可擴展的方法,以確保我們在人工智慧系統中獎勵正確的行為,即使它們變得比我們聰明得多。我們發現 LLM 評論家是一個很有希望的開始。

Acknowledgments 誌謝

We are thankful to Jan Leike and Ilya Sutskever for their vision of superalignment. We'd like to thank Collin Burns, Jeffrey Wu, Dan Mossing and John Schulman for detailed feedback on the manuscript. Jiayi Weng, Suchir Balaji and many others helped us with a tremendous post-training stack. Thanks also to Barret Zoph for support at the end of the project and the OpenAI platform team for great GPU infrastructure and to the human data team for much support. Lastly, thanks to the team of annotators who provided training data and evaluated our models throughought the project.
我們要感謝 Jan Leike 和 Ilya Sutskever 對超級對齊的願景。我們要感謝 Collin Burns、Jeffrey Wu、Dan Mossing 和 John Schulman 對稿件的詳細反饋。Jiayi Weng、Suchir Balaji 和許多其他人幫助我們完成了大量的訓練後工作。還要感謝 Barret Zoph 在項目結束時的支援,以及 OpenAI 平台團隊提供的強大 GPU 基礎設施,以及人工數據團隊的大力支援。最後,感謝在整個項目中提供訓練數據並評估我們模型的標註團隊。

References 參考文獻

[1] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional AI: Harmlessness from AI feedback. December 2022, 2212.08073.
[1] Yuntao Bai、Saurav Kadavath、Sandipan Kundu、Amanda Askell、Jackson Kernion、Andy Jones、Anna Chen、Anna Goldie、Azalia Mirhoseini、Cameron McKinnon、Carol Chen、Catherine Olsson、Christopher Olah、Danny Hernandez、Dawn Drain、Deep Ganguli、Dustin Li、Eli Tran-Johnson、Ethan Perez、Jamie Kerr、Jared Mueller、Jeffrey Ladish、Joshua Landau、Kamal Ndousse、Kamile Lukosuite、Liane Lovitt、Michael Sellitto、Nelson Elhage、Nicholas Schiefer、Noemi Mercado、Nova DasSarma、Robert Lasenby、Robin Larson、Sam Ringer、Scott Johnston、Shauna Kravec、Sheer El Showk、Stanislav Fort、Tamera Lanham、Timothy Telleen-Lawton、Tom Conerly、Tom Henighan、Tristan Hume、Samuel R Bowman、Zac Hatfield-Dodds、Ben Mann、Dario Amodei、Nicholas Joseph、Sam McCandlish、Tom Brown 和 Jared Kaplan。憲法式 AI:從 AI 反饋中獲得無害性。2022 年 12 月,2212.08073。
[2] Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, and Jared Kaplan. Measuring progress on scalable oversight for large language models. November 2022, 2211.03540 .
[2] Samuel R Bowman、Jeeyoon Hyun、Ethan Perez、Edwin Chen、Craig Pettit、Scott Heiner、Kamilė Lukošiūtė、Amanda Askell、Andy Jones、Anna Chen、Anna Goldie、Azalia Mirhoseini、Cameron McKinnon、Christopher Olah、Daniela Amodei、Dario Amodei、Dawn Drain、Dustin Li、Eli Tran-Johnson、Jackson Kernion、Jamie Kerr、Jared Mueller、Jeffrey Ladish、Joshua Landau、Kamal Ndousse、Liane Lovitt、Nelson Elhage、Nicholas Schiefer、Nicholas Joseph、Noemí Mercado、Nova DasSarma、Robin Larson、Sam McCandlish、Sandipan Kundu、Scott Johnston、Shauna Kravec、Sheer El Showk、Stanislav Fort、Timothy Telleen-Lawton、Tom Brown、Tom Henighan、Tristan Hume、Yuntao Bai、Zac Hatfield-Dodds、Ben Mann 和 Jared Kaplan。衡量大型語言模型可擴展監督進展。2022 年 11 月,2211.03540。
[3] Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, and Dylan Hadfield-Menell. Open problems and fundamental limitations of reinforcement learning from human feedback. July 2023, 2307.15217 .
[3] Stephen Casper、Xander Davies、Claudia Shi、Thomas Krendl Gilbert、Jérémy Scheurer、Javier Rando、Rachel Freedman、Tomasz Korbak、David Lindner、Pedro Freire、Tony Wang、Samuel Marks、Charbel-Raphaël Segerie、Micah Carroll、Andi Peng、Phillip Christoffersen、Mehul Damani、Stewart Slocum、Usman Anwar、Anand Siththaranjan、Max Nadeau、Eric J Michaud、Jacob Pfau、Dmitrii Krasheninnikov、Xin Chen、Lauro Langosco、Peter Hase、Erdem Bıyık、Anca Dragan、David Krueger、Dorsa Sadigh 和 Dylan Hadfield-Menell。人類反饋強化學習的開放性問題和基本限制。2023 年 7 月,2307.15217。
[4] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. April 2023, 2304.05128.
[4] Xinyun Chen、Maxwell Lin、Nathanael Schärli 和 Denny Zhou。教導大型語言模型進行自我除錯。2023 年 4 月,2304.05128。
[5] Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas I Liao, Kamilė Lukošiūtė, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jackson Kernion, Jamie Kerr, Jared Mueller, Joshua Landau, Kamal Ndousse, Karina Nguyen, Liane Lovitt, Michael Sellitto, Nelson Elhage, Noemi Mercado, Nova DasSarma, Oliver Rausch, Robert Lasenby, Robin Larson, Sam Ringer, Sandipan Kundu, Saurav Kadavath, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Henighan, Tristan Hume, Yuntao Bai, Zac HatfieldDodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Christopher Olah, Jack Clark, Samuel R Bowman, and Jared Kaplan. The capacity for moral Self-Correction in large language models. February 2023, 2302.07459.
[5] Deep Ganguli、Amanda Askell、Nicholas Schiefer、Thomas I Liao、Kamilė Lukošiūtė、Anna Chen、Anna Goldie、Azalia Mirhoseini、Catherine Olsson、Danny Hernandez、Dawn Drain、Dustin Li、Eli Tran-Johnson、Ethan Perez、Jackson Kernion、Jamie Kerr、Jared Mueller、Joshua Landau、Kamal Ndousse、Karina Nguyen、Liane Lovitt、Michael Sellitto、Nelson Elhage、Noemi Mercado、Nova DasSarma、Oliver Rausch、Robert Lasenby、Robin Larson、Sam Ringer、Sandipan Kundu、Saurav Kadavath、Scott Johnston、Shauna Kravec、Sheer El Showk、Tamera Lanham、Timothy Telleen-Lawton、Tom Henighan、Tristan Hume、Yuntao Bai、Zac HatfieldDodds、Ben Mann、Dario Amodei、Nicholas Joseph、Sam McCandlish、Tom Brown、Christopher Olah、Jack Clark、Samuel R Bowman 和 Jared Kaplan。大型語言模型的道德自我修正能力。2023 年 2 月,2302.07459。
[6] Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. October 2022, 2210.10760.
[6] Leo Gao、John Schulman 和 Jacob Hilton。獎勵模型過度優化的比例定律。2022 年 10 月,2210.10760。
[7] Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. RARR: Researching and revising what language models say, using language models. October 2022, 2210.08726.
[7] Luyu Gao、Zhuyun Dai、Panupong Pasupat、Anthony Chen、Arun Tejasvi Chaganty、Yicheng Fan、Vincent Y Zhao、Ni Lao、Hongrae Lee、Da-Cheng Juan 和 Kelvin Guu。RARR:使用語言模型研究和修改語言模型的內容。2022 年 10 月,2210.08726。
[8] Jeff Gennari, Shing-hon Lau, Samuel Perl, Joel Parish, and Girish Sastry. Considerations for evaluating large language models for cybersecurity tasks. 2024.
[8] Jeff Gennari、Shing-hon Lau、Samuel Perl、Joel Parish 和 Girish Sastry。評估大型語言模型在網路安全任務中的注意事項。2024 年。
[9] Anshul Gupta and Neel Sundaresan. Intelligent code reviews using deep learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'18) Deep Learning Day, 2018.
[9] Anshul Gupta 和 Neel Sundaresan。使用深度學習的智能代碼審查。發表於「第二十四屆 ACM SIGKDD 知識發現和數據挖掘國際會議 (KDD'18) 深度學習日」會議記錄,2018 年。
[10] Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. October 2023, 2310.01798 .
[10] Jie Huang、Xinyun Chen、Swaroop Mishra、Huaixiu Steven Zheng、Adams Wei Yu、Xinying Song 和 Denny Zhou。大型語言模型尚無法自我修正推理。2023 年 10 月,2310.01798。
[11] Evan Hubinger. AI safety via market making. https://www.alignmentforum.org/posts/ YWwzccGbcHMJMpT45/ai-safety-via-market-making. Accessed: 2024-04-08.
[11] Evan Hubinger。透過做市商來確保人工智能安全。https://www.alignmentforum.org/posts/ YWwzccGbcHMJMpT45/ai-safety-via-market-making。檢索日期:2024 年 4 月 8 日。
[12] Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate. May 2018, 1805.00899 .
[12] Geoffrey Irving、Paul Christiano 和 Dario Amodei。透過辯論來確保人工智能安全。2018 年 5 月,1805.00899。
[13] Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R Bowman, Tim Rocktäschel, and Ethan Perez. Debating with more persuasive LLMs leads to more truthful answers. February 2024, 2402.06782.
[13] Akbir Khan、John Hughes、Dan Valentine、Laura Ruis、Kshitij Sachan、Ansh Radhakrishnan、Edward Grefenstette、Samuel R Bowman、Tim Rocktäschel 和 Ethan Perez。與更有說服力的 LLMs 辯論會帶來更真實的答案。2024 年 2 月,2402.06782。
[14] Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. March 2023, 2303.17491.
[14] Geunwoo Kim、Pierre Baldi 和 Stephen McAleer。語言模型可以解決計算機任務。2023 年 3 月,2303.17491。
[15] Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. November 2018, 1811.07871.
[15] Jan Leike、David Krueger、Tom Everitt、Miljan Martic、Vishal Maini 和 Shane Legg。透過獎勵模型實現可擴展的智能體對齊:一個研究方向。2018 年 11 月,1811.07871。
[16] Jan Leike, John Schulman, and Jeffrey Wu. Our approach to alignment research. https:// openai.com/index/our-approach-to-alignment-research/, 2022. Accessed: 202406-12.
[16] Jan Leike、John Schulman 和 Jeffrey Wu。我們對齊研究的方法。https:// openai.com/index/our-approach-to-alignment-research/,2022 年。檢索日期:2024 年 6 月 12 日。
[17] Lingwei Li, Li Yang, Huaxi Jiang, Jun Yan, Tiejian Luo, Zihan Hua, Geng Liang, and Chun Zuo. AUGER: automatically generating review comments with pre-training models. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, pages 1009-1021, New York, NY, USA, November 2022. Association for Computing Machinery.
[17] Lingwei Li、Li Yang、Huaxi Jiang、Jun Yan、Tiejian Luo、Zihan Hua、Geng Liang 和 Chun Zuo。AUGER:使用預訓練模型自動生成審查意見。發表於「第三十屆 ACM 歐洲軟件工程聯合會議暨軟件工程基礎研討會 (ESEC/FSE 2022)」會議記錄,第 1009-1021 頁,美國紐約州紐約市,2022 年 11 月。計算機協會。
[18] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d'Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level code generation with AlphaCode. February 2022, 2203.07814.
[18] Yujia Li、David Choi、Junyoung Chung、Nate Kushman、Julian Schrittwieser、Rémi Leblond、Tom Eccles、James Keeling、Felix Gimeno、Agustin Dal Lago、Thomas Hubert、Peter Choy、Cyprien de Masson d'Autume、Igor Babuschkin、Xinyun Chen、Po-Sen Huang、Johannes Welbl、Sven Gowal、Alexey Cherepanov、James Molloy、Daniel J Mankowitz、Esme Sutherland Robson、Pushmeet Kohli、Nando de Freitas、Koray Kavukcuoglu 和 Oriol Vinyals。使用 AlphaCode 生成競賽級別的代碼。2022 年 2 月,2203.07814。
[19] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. March 2022, 2203.02155 .
[19] Long Ouyang、Jeff Wu、Xu Jiang、Diogo Almeida、Carroll L Wainwright、Pamela Mishkin、Chong Zhang、Sandhini Agarwal、Katarina Slama、Alex Ray、John Schulman、Jacob Hilton、Fraser Kelton、Luke Miller、Maddie Simens、Amanda Askell、Peter Welinder、Paul Christiano、Jan Leike 和 Ryan Lowe。透過人類回饋訓練語言模型遵循指令。2022 年 3 月,2203.02155。
[20] Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. August 2023, 2308.03188.
[20] Liangming Pan、Michael Saxon、Wenda Xu、Deepak Nathani、Xinyi Wang 和 William Yang Wang。自動校正大型語言模型:調查多樣化自我校正策略的概況。2023 年 8 月,2308.03188。
[21] Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel R Bowman. QuALITY: Question answering with long input texts, yes! December 2021, 2112.08608.
[21] Richard Yuanzhe Pang、Alicia Parrish、Nitish Joshi、Nikita Nangia、Jason Phang、Angelica Chen、Vishakh Padmakumar、Johnny Ma、Jana Thompson、He He 和 Samuel R Bowman。QuALITY:使用長輸入文本進行問答,沒錯!2021 年 12 月,2112.08608。
[22] Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do users write more insecure code with AI assistants? November 2022, 2211.03622.
[22] Neil Perry、Megha Srivastava、Deepak Kumar 和 Dan Boneh。使用 AI 助手時,用戶是否會編寫更多不安全的程式碼?2022 年 11 月,2211.03622。
[23] Ansh Radhakrishnan. Anthropic fall 2023 debate progress update. https://www.lesswrong. com/posts/QtqysYdJRenWFeWc4/anthropic-fall-2023-debate-progress-update. Accessed: 2024-04-08.
[23] Ansh Radhakrishnan。Anthropic 2023 年秋季辯論進展更新。https://www.lesswrong.com/posts/QtqysYdJRenWFeWc4/anthropic-fall-2023-debate-progress-update。檢索日期:2024 年 4 月 8 日。
[24] Keshav Ramji, Young-Suk Lee, Ramón Fernandez Astudillo, Md Arafat Sultan, Tahira Naseem, Asim Munawar, Radu Florian, and Salim Roukos. Self-Refinement of language models from external proxy metrics feedback. February 2024, 2403.00827.
[24] Keshav Ramji、Young-Suk Lee、Ramón Fernandez Astudillo、Md Arafat Sultan、Tahira Naseem、Asim Munawar、Radu Florian 和 Salim Roukos。根據外部代理指標回饋進行語言模型的自我改進。2024 年 2 月,2403.00827。
[25] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A Graduate-Level Google-Proof Q&A benchmark. November 2023, 2311.12022.
[25] David Rein、Betty Li Hou、Asa Cooper Stickland、Jackson Petty、Richard Yuanzhe Pang、Julien Dirani、Julian Michael 和 Samuel R Bowman。GPQA:一個研究生級別的 Google 防禦性問答基準測試。2023 年 11 月,2311.12022。
[26] William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators. June 2022, 2206.05802.
[26] William Saunders、Catherine Yeh、Jeff Wu、Steven Bills、Long Ouyang、Jonathan Ward 和 Jan Leike。用於協助人類評估者的自我批判模型。2022 年 6 月,2206.05802。
[27] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. July 2017, 1707.06347.
[27] John Schulman、Filip Wolski、Prafulla Dhariwal、Alec Radford 和 Oleg Klimov。近端策略優化演算法。2017 年 7 月,1707.06347。
[28] Joar Skalse, Nikolaus H R Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. September 2022, 2209.13085.
[28] Joar Skalse、Nikolaus H R Howe、Dmitrii Krasheninnikov 和 David Krueger。定義和描述獎勵駭客行為。2022 年 9 月,2209.13085。
[29] Eric Michael Smith, Orion Hsu, Rebecca Qian, Stephen Roller, Y-Lan Boureau, and Jason Weston. Human evaluation of conversations is an open problem: comparing the sensitivity of various methods for evaluating dialogue agents. January 2022, 2201.04723.
[29] Eric Michael Smith, Orion Hsu, Rebecca Qian, Stephen Roller, Y-Lan Boureau, and Jason Weston. 人工對話評估是一個開放性問題:比較各種評估對話代理方法的敏感性。2022 年 1 月,2201.04723。
[30] OpenAI Team. GPT-4 technical report, 2024, 2303.08774.
[30] OpenAI 團隊。GPT-4 技術報告,2024 年,2303.08774。
[31] Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, .
[31] Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 從人類偏好微調語言模型,

7 Appendix 7 附錄

7.1 Force Sampling Beam Search (FSBS) Details
7.1 強制採樣束搜索 (FSBS) 細節

Figure 9: ChatGPT with using the critique reward model improves performance but it does not let us explore as much of the frontier as CriticGPT (RL + FSBS).
圖 9:使用評論獎勵模型的 ChatGPT 與 雖然提升了效能,但與 CriticGPT (RL + FSBS) 相比,它讓我們探索的邊界範圍較小。
The process consists of:
的過程包含:
  1. Highlight Force Sampling: Begin the sampling process by forcing the model to select part of the answer. In practice this is just appending the string “ . '”, the start of a highlight in our format, to the critique. Sample continuations .
    強制採樣重點:透過強制模型選擇部分答案來開始採樣過程。實際上,這只是將字串「. '」附加到評論中,作為我們格式中的重點開頭。採樣 個延續
  2. Score and Select: Evaluate the possible completions and select the top completions according to the RM score .
    評分和選擇:評估可能的完成結果,並根據 RM 分數 選擇前 個完成結果。
  3. Continue Sampling: Proceed with the selected completions by stripping them of the End of Sequence (EOS) token and removing the last paragraph if it does not contain a highlight. Again append the string " . " " to force generating a new comment.
    繼續採樣:繼續處理選定的完成結果,方法是去除其序列結束 (EOS) 標記,並移除最後一段如果其中沒有包含重點。再次附加字串「. '」以強制產生新的評論。
  4. Repeat Process: Continue the sampling and selection process for a total of iterations .
    重複過程:繼續進行採樣和選擇過程,總共進行 次迭代
  5. Optimize Final Critique: From the list of all generated critiques from all rounds and their respective scores, identify the critique that maximizes the expression rm_score + LENGTH_MODIFIER * num_highlights. The total number of critiques to rank is ( 28 for our parameter choice).
    優化最終評論:從所有回合中產生的所有評論及其各自分數的列表中,找出能將表達式 rm_score + LENGTH_MODIFIER * num_highlights 最大化的評論。要排名的評論總數為 (根據我們的參數選擇為 28)。
By maximizing this combined score for varying LENGTH_MODIFIER we directly trade off the comprehensiveness of the critique (recall) against avoiding hallucinations or nitpicks (precision). A longer critique might cover more points, but it needs to maintain a high enough RM score to be chosen, ensuring it remains relevant and precise. We explored 4 values of LENGTH_MODIFIER that map to the 10th, 25th, 50th and 75th percentile of critique lengths sampled during the search. For the baseline ChatGPT model this will map to of , and 7 highlights on average. For the more concise RL model, this will map to an average of , and 5 highlights on average. Figure 9 shows both ChatGPT and CriticGPT results using FSBS. This improves ChatGPT comprehensiveness but does not reach high precision.
透過最大化這個結合分數以改變 LENGTH_MODIFIER,我們可以直接在評論的完整性(召回率)和避免幻覺或吹毛求疵(精確率)之間進行權衡。較長的評論可能會涵蓋更多要點,但它需要保持足夠高的 RM 分數才能被選中,確保其保持相關性和準確性。我們探索了 LENGTH_MODIFIER 的 4 個值,它們對應於搜索過程中採樣的評論長度的第 10、25、50 和 75 個百分位數。對於基準 ChatGPT 模型,這將映射到平均 個重點。對於更簡潔的 RL 模型,這將映射到平均 個重點。圖 9 顯示了使用 FSBS 的 ChatGPT 和 CriticGPT 結果。這提高了 ChatGPT 的完整性,但沒有達到很高的精確率。

7.2 Data 7.2 數據

Figure 10 shows the severity and detection difficulty for the Human Detected Bugs and Human Inserted Bugs distributions.
圖 10 顯示了人類檢測錯誤和人類插入錯誤分佈的嚴重性和檢測難度。
Table 1 shows some statistics of the lengths of problems and solutions in the dataset.
表 1 顯示了數據集中問題和解決方案長度的一些統計數據。
Figure 10: Human Detected Bugs are easier to catch and less severe. Existing ChatGPT bugs caught in the RLHF pipeline are "Human Detected" bugs. To characterize the differences between this distribution and bugs inserted by our contractors in the tampering pipeline we looked at: A) the rate at which human-written critiques were rated comprehensive, which is higher for detected bugs and ) the severity rating of the bug caught, which is higher for inserted bugs.
圖 10:人類檢測到的錯誤更容易捕捉,而且不太嚴重。在 RLHF 流程中發現的現有ChatGPT錯誤屬於「人類檢測到的」錯誤。為了描述此分佈與我們的承包商在篡改流程中插入的錯誤之間的差異,我們查看了:A)人類撰寫的評論被評為全面的比率,檢測到的錯誤的比率更高,以及 )捕捉到的錯誤的嚴重程度評級,插入的錯誤的評級更高。
Table 1: Length distribution of the training data (tokens)
表 1:訓練數據的長度分佈(標記)
question 問題 answer 答案 model critique 模型批判 human critique 人工評析
total 總計
unmodified 未經修改 -
tampered 經篡改
mean 平均值
92.45 313.41 225.55 -
tampered 經篡改的 109.58 395.80 310.36 125.89
95 th percentile 第 95 個百分位數
unmodified 未經修改的 334 547 524 -
tampered 經篡改的 405 660 704 284

7.3 Prompting 7.3 提示

The same prompt is shared between all models. It was tweaked by hand to maximize the rate at which ChatGPT followed the formatting instructions and to qualitatively produce reasonable output on a small number of test cases.
所有模型共用相同的提示。它已手動調整,以最大化 ChatGPT 遵循格式說明的比率,並在少量測試案例中產生質量合理的輸出。

7.4 Forms and guidance for contractors in critique comparisons
7.4 批評比較中承包商的表格和指南

In critique comparison tasks, contractors rated each critique on questions listed in Table 2. We additionally asked them to explain their ratings in a rationale to facilitate quality control. Random tasks were reviewed by a second more experienced "QCer". The median proportion of tasks that were reviewed per contractor was . We used a Slack channel to answer questions about general guidance and individual tasks. Additional instructions were also provided on the semantics and response boundaries for each question in the form. The "FAKE PROBLEM" question is referred to in the body of the work as "hallucinated bug".
在批評比較任務中,承包商根據表 2 中列出的問題對每個批評進行評分。我們還要求他們在理由中解釋他們的評分,以便於進行品質控管。隨機任務由第二個更有經驗的「品質檢查員」審查。每個承包商審查的任務的中位數比例為 。我們使用 Slack 頻道來回答有關一般指南和個別任務的問題。表格中還提供了關於每個問題的語義和回答界限的額外說明。「虛假問題」在正文中稱為「幻覺錯誤」。

7.5 Agreement rates on the Critique Comparison collections
7.5 批評比較集合的同意率

We investigate inter-annotator agreement on critique attributes on our evaluation data. Figure 11a illustrates that agreement is significantly higher on CBI questions (i.e., whether the critique included a reference bug) compared to other questions (like whether the critique contains nitpicks or is
我們調查了評估數據中批評屬性上的標註者間同意度。圖 11a 顯示,與其他問題(例如批評是否包含吹毛求疵或是否)相比,CBI 問題(即批評是否包含參考錯誤)的同意度要高得多
Question 問題 1 2 3 4 5 6 7

這條批評是否指出了上面描述的特定問題?1:絕對沒有 4:我不確定:絕對包含
Did this critique point out the particular problem described just
above?
1: definitely missed 4: I'm unsure : definitely included
0 0 0 0 0 0 0
0 0 0 0 0 0 0

這條批評是否 吹毛求疵? 沒有 我不確定 是的
Does the critique have NITPICK?
no I'm unsure yes
0 0 0 0 0 o 0

這條批評是否 虛假問題?1:沒有 我不確定 是的
Does the critique have FAKE PROBLEM?
1: no I'm unsure yes
0 0 0 0 0 0 0

這條批評有多簡潔?1:非常囉嗦 我不確定 非常簡潔
How concise is this critique?
1: very wordy I'm unsure very concise
0 0 0 0 0 0 0

總體而言,這條批評相對於其他批評有多好?1:這是最糟糕的批評 7:這是最好的批評
Overall, how good is this critique relative to the others?
1 : this is the worst critique 7 : this is the best critique
0 0 0 0 0 0 0
Rationale: 研究動機
Table 2: Form completed by contractors while comparing critiques.
表 2:承包商在比較評論時填寫的表格。
(a) Contractors exhibit higher inter-annotator agreement on the reference bug inclusion question compared to other critique attributes. Results on Human Inserted Bugs data.
(a) 與其他評論屬性相比,承包商在參考錯誤包含問題上表現出更高的標註者間一致性。結果基於人工插入錯誤數據。
(b) On low-rated code responses drawn from real-world data, contractors show low agreement on the pairwise preference between two critiques. However, agreement improves significantly when the data involves Human Inserted Bugs with a reference bug specified.
(b) 在從真實世界數據中提取的低評分代碼響應中,承包商對兩種評論之間的成對偏好顯示出較低的一致性。然而,當數據涉及指定了參考錯誤的人工插入錯誤時,一致性顯著提高。
Figure 11 圖 11
comprehensive). This suggests that identifying and agreeing on reference bugs is more straightforward for contractors, likely due to the more objective nature of these questions.
全面性)。這表明,識別參考錯誤並就其達成一致對承包商來說更為直接,這可能是由於這些問題更具客觀性。
We also investigate agreement on preference between two critiques. We calculate this by examining each pair of critiques assessed by two different raters, measuring the agreement in their pairwise preference, and randomly resolving any ties. The results in reveal that contractors often disagree when they compare two critiques on overall quality. This low agreement rate indicates subjective differences in how contractors perceive the critiques' quality or relevance. This is especially pronounced on low-rated code responses drawn IID from the ChatGPT training set, which unlike our evaluation Human Detected Bugs set was not curated for contractors having previously written high-quality bug descriptions. Agreement improves significantly on data with Human Inserted Bugs which includes a reference bug description. This suggests that having clearly identified bugs provides a more concrete context, allowing contractors to make more consistent judgements.
我們還調查了對兩種評論之間偏好的一致性。我們通過檢查由兩位不同的評分者評估的每對評論、衡量他們在成對偏好方面的一致性以及隨機解決任何平局來計算這一點。 中的結果表明,當承包商比較兩種評論的整體質量時,他們經常意見不一致。這種低一致性率表明承包商對評論質量或相關性的看法存在主觀差異。這在從 ChatGPT 訓練集中獨立同分佈提取的低評分代碼響應中尤為明顯,與我們的評估人類檢測錯誤集不同,該訓練集並未針對先前撰寫過高質量錯誤描述的承包商進行策劃。在包含參考錯誤描述的人工插入錯誤數據上,一致性顯著提高。這表明,明確識別錯誤提供了一個更具體的背景,使承包商能夠做出更一致的判斷。

7.6 Generalization of critique-bug inclusion (CBI) Metric
7.6 評論-錯誤包含 (CBI) 指標的泛化性

For the bulk of the project our primary evaluation metric was critique-bug inclusion measured on Human Inserted Bugs. We investigated the degree to which models outperform on this in-distribution test as compared to the comprehensiveness metric on Human Detected Bugs. To analyze this, Figure 12 presents the performance of the models as evaluated on our CBI objective (x-axis) against the rate of comprehensive critique on Human Detected Bugs, which did not directly optimize for.
在項目的大部分時間裡,我們的主要評估指標是在人工插入錯誤上測量的評論-錯誤包含。我們調查了模型在這種分佈內測試中優於人類檢測錯誤的全面性指標的程度。為了分析這一點,圖 12 展示了模型在我們的 CBI 目標(x 軸)上的性能與人類檢測錯誤的全面評論率(未直接優化)的關係。
We observe the following:
我們觀察到以下情況:
Figure 12: Performance of various models evaluated on both CBI (x-axis) versus the rate of comprehensive critiques on Human Detected Bugs (y-axis).
圖 12:各種模型的性能評估,以 CBI(x 軸)與人類偵測錯誤的完整評論率(y 軸)進行比較。
  • CriticGPT (RL only) performs better on Human Inserted Bugs than on Human Detected Bugs as compared to ChatGPT. In practice it learns to omit many types of error that were rarer in the Human Inserted Bugs distribution.
    與 ChatGPT 相比,CriticGPT(僅限 RL)在人類插入錯誤方面比在人類偵測錯誤方面表現更好。在實務中,它學會忽略在人類插入錯誤分佈中較少見的許多錯誤類型。
  • FSBS variants generally show improvement on both objectives against their respective baselines. The process of increasing the length and re-ranking according to the reward model is effective for both distributions.
    FSBS 變體通常在兩個目標上都比其各自的基準有所改進。根據獎勵模型增加長度和重新排序的過程對兩種分佈都是有效的。
  • All models outperform ChatGPT (less training) on both distributions.
    所有模型在兩種分佈上的表現都優於 ChatGPT(訓練較少)。

7.7 Discriminator Performance the DC Gap
7.7 鑑別器效能 DC 差距

Saunders et al. introduced the notion of a "Discriminator Critique Gap" (DC-Gap). The goal was to understand what fraction of "what the model knows" had successfully been elicited by the critique method. Quoting that work directly:
Saunders 等人引入了「鑑別器評論差距」(DC-Gap)的概念。目標是了解評論方法成功引出「模型所知」的比例。直接引用該作品:
"[I]f a model can tell an answer is poor, can it also point out the flaw to a human? If we could train models to always point out when they notice flaws, this could go a long way towards having trustworthy and aligned models..."
「[如果] 模型可以判斷答案很差,它是否也可以向人類指出缺陷?如果我們可以訓練模型始終指出它們何時發現缺陷,這將大大有助於擁有值得信賴且一致的模型...」
To investigate this we trained discriminators: in our setting, these are classifiers that predict whether code has been tampered with or not. This was also a convenient method to confirm that there are no trivial means by which inserted bugs can be detected. Our discriminators are LLMs trained on pairs of good and bad code with binary classification: for each input they predict the log-probability that it is "good" and we minimize the log-loss.
為了研究這個問題,我們訓練了鑑別器:在我們的設定中,這些是預測程式碼是否被篡改的分類器。這也是一種確認沒有簡單方法可以偵測插入錯誤的便捷方法。我們的鑑別器是 LLMs,使用成對的好程式碼和壞程式碼進行二元分類訓練:對於每個輸入,它們預測它是「好」的對數機率,並且我們最小化對數損失。
Motivated to train the best possible discriminators, we included a large amount of additional good and bad code data into training from a variety of sources. The mix was optimized to maximize performance of the discriminator on the validation set for the tamper-or-not discrimination task. Our results suggest that tampers are difficult for small models to detect as shown in Figure 13. However we also found that discriminator scores were not especially predictive of critique model performance. On the cases where the largest discriminator was most confidently wrong, critics still catch the inserted bug around of the time. Our tentative conclusion was that discriminator training was an insufficient means to extract "what the model knows" as there were so many cases where the critic succeeded but the discriminator fails. We therefore leave investigation of the DC gap to future work.
為了訓練出最好的鑑別器,我們在訓練中加入了大量來自各種來源的額外好程式碼和壞程式碼數據。混合經過優化,以最大限度地提高鑑別器在驗證集上對篡改與否辨別任務的效能。我們的結果表明,小型模型很難偵測到篡改,如圖 13 所示。然而,我們也發現鑑別器分數並不能特別預測評論模型的效能。在最大的鑑別器最自信地出錯的 情況下,評論者仍然在大約 的時間內抓住了插入的錯誤。我們暫時的結論是,鑑別器訓練不足以提取「模型所知」,因為有很多情況下評論者成功了,但鑑別器失敗了。因此,我們將 DC 差距的研究留待未來的工作。
Figure 13: Training "discriminators" which predict if code was tampered with confirms that small models cannot easily detect inserted bugs.
圖 13:訓練預測程式碼是否被篡改的「鑑別器」證實,小型模型無法輕易偵測到插入的錯誤。
Figure 14: Current LLM critics did not help significantly when applied to challenging competition code (a small effect is visible in the plot, but is not significant). In this setting, contractors had to pick the passing solution from a pair in which one passed and one failed. With enough sampling we could find cases where this binary task was challenging for humans, but could not produce a critic that helped here with one sampled critique. Results are shown for an early GPT-4 sized critic model without FSBS. Because of the strong negative trend in human accuracy with problem difficulty we believe this setting could give signal with sufficiently capable critics.
圖 14:當應用於具有挑戰性的競賽代碼時,目前的 LLM 評測器並沒有顯著幫助(圖中可見一個小影響,但不顯著)。在這種情況下,承包商必須從一對通過和失敗的解決方案中選出通過的解決方案。通過足夠的抽樣,我們可以找到這種二元任務對人類具有挑戰性的情況,但無法產生一個評測器,在這裡用一個抽樣的評測來提供幫助。結果顯示了一個早期沒有 FSBS 的 GPT-4 大小的評測器模型。由於人類準確性隨問題難度增加而呈現強烈的負趨勢,我們認為這種情況可能會為具有足夠能力的評測器提供信號。

7.8 Lessons from Judge Accuracy and Ground Truth Reward Experiments
7.8 從評審準確性和真實獎勵實驗中汲取的教訓

Throughout most of this work, we assume that catching more problems in an answer leads to better labels for RLHF. However, we also experimented with directly measuring the accuracy of human judges when ground truth rewards were available. Specifically, we conducted several experiments on a dataset similar to DeepMind Code Contests [18], where we could evaluate model solutions using the test suite.
在這項工作的大部分時間裡,我們假設在答案中發現更多問題會導致更好的 RLHF 標籤。然而,我們也嘗試了在真實獎勵可用的情況下直接測量人類評審的準確性。具體來說,我們在一個類似於 DeepMind 代碼競賽 [18] 的數據集上進行了幾項實驗,在這些實驗中,我們可以使用測試套件來評估模型解決方案。
Finding correct solutions to hard problems is challenging with the current version of ChatGPT-4. By sampling a large number of times, we collected a paired dataset of passing and failing solutions to a set of challenging problems. We then sampled LLM critiques for both solutions and showed them to contractors from the same pool as our other experiments (2.1.3). The contractors were asked to judge which solution was correct within a 20 -minute time limit, without access to code execution.
使用當前版本的 ChatGPT-4 來找到困難問題的正確解決方案是具有挑戰性的。通過大量採樣,我們收集了一個成對的數據集,其中包含一組具有挑戰性問題的通過和失敗的解決方案。然後,我們對這兩種解決方案抽樣了 LLM 個評測,並將它們展示給與我們其他實驗(2.1.3)來自相同群體的承包商。承包商被要求在 20 分鐘的時間限制內判斷哪種解決方案是正確的,而不能訪問代碼執行。
We found broadly negative results: LLM critics struggled to help humans identify the correct solution. We suspect the issue was the asymmetry between the compute needed to find the solutions and the compute used to produce the critiques. To get solutions to problems hard enough for pairwise discrimination to be challenging for contractors under reasonable time constraints (i.e., not extremely short), we had to take many samples. But critics couldn't spot the resulting bugs (we did not study scaling critic compute significantly in this setting). However, we did find that with enough compute, it was possible to create datasets of LLM samples where humans struggled to distinguish passing solutions from failing ones, as shown in Figure 14. As models improve, this direction will become increasingly appealing.
我們發現了廣泛的負面結果:LLM 評測器難以幫助人類識別正確的解決方案。我們懷疑問題出在找到解決方案所需的計算量與產生評測所使用的計算量之間的不對稱。為了獲得對承包商來說在合理的時間限制(即不是極短的時間)內難以進行成對區分的難題解決方案,我們必須進行多次採樣。但評測器無法發現由此產生的錯誤(我們沒有在這種情況下顯著擴展評測器計算)。然而,我們確實發現,只要有足夠的計算量,就有可能創建 LLM 個樣本的數據集,在這些數據集中,人類難以區分通過的解決方案和失敗的解決方案,如圖 14 所示。隨著模型的改進,這個方向將變得越來越有吸引力。

7.9 Preference Rate Summary Table
7.9 偏好率匯總表

setting 設置 matchup 對決 winrate