Leveraging Large Language Model for Automatic Patch Correctness Assessment
利用大型语言模型进行自动补丁正确性评估

Xin Zhou, Bowen Xu, Kisub Kim, DongGyun Han, Hung Huu Nguyen, Thanh LeCong, Junda He, Bach Le, and David Lo, Fellow, IEEE
Xin 周， Bowen Xu， Kisub Kim， DongGyun Han， Hung Huu Nguyen， Thanh LeCong， Junda He， Bach Le， and David Lo， IEEE会士

Abstract 抽象

Automated Program Repair (APR) techniques have shown more and more promising results in fixing real-world bugs. Despite the effectiveness, APR techniques still face an overfitting problem: a generated patch can be incorrect although it passes all tests. It is time-consuming to manually evaluate the correctness of generated patches that can pass all available test cases. To address this problem, many approaches have been proposed to automatically assess the correctness of patches generated by APR techniques. These approaches are mainly evaluated within the cross-validation setting. However, for patches generated by a new or unseen APR tool, users are implicitly required to manually label a significant portion of these patches (e.g., $90 %$ in 10 -fold cross-validation) in the cross-validation setting before inferring the remaining patches (e.g., $10 %$ in 10 -fold cross-validation). To mitigate the issue, in this study, we propose LLM4PatchCorrect, the patch correctness assessment by adopting a large language model for code. Specifically, for patches generated by a new or unseen APR tool, LLM4PatchCorrect does not need labeled patches of this new or unseen APR tool for training but directly queries the large language model for code to get predictions on the correctness labels without training. In this way, LLM4PatchCorrect can reduce the manual labeling effort when building a model to automatically assess the correctness of generated patches of new APR tools. To provide knowledge regarding the automatic patch correctness assessment (APCA) task to the large language model for code, LLM4PatchCorrect leverages bug descriptions, execution traces, failing test cases, test coverage, and labeled patches generated by existing APR tools, before deciding the correctness of the unlabeled patches of a new or unseen APR tool. Additionally, LLM4PatchCorrect prioritizes labeled patches from existing APR tools that exhibit semantic similarity to those generated by new APR tools, enhancing the accuracy achieved by LLM4PatchCorrect for patches from new APR tools. Our experimental results showed that LLM4PatchCorrect can achieve an accuracy of $84.4 %$ and an F1-score of $86.5 %$ on average although no labeled patch of the new or unseen APR tool is available. In addition, our proposed technique significantly outperformed the prior state-of-the-art.
自动程序修复（APR）技术在修复实际错误方面显示出越来越有希望的结果。尽管 APR 技术有效，但仍然面临过度拟合问题：生成的补丁虽然通过了所有测试，也可能是错误的。手动评估可以通过所有可用测试用例的生成的补丁的正确性非常耗时。为了解决这个问题，已经提出了许多方法来自动评估 APR 技术生成的补丁的正确性。这些方法主要在 cross-validation 设置中进行评估。但是，对于由新的或未见过的 APR 工具生成的补丁，用户需要先在交叉验证设置中手动标记这些补丁的很大一部分（例如， $90 %$ 在 10 倍交叉验证中），然后再推断剩余的补丁（例如， $10 %$ 在 10 倍交叉验证中）。为了缓解这个问题，在本研究中，我们提出了 LLM4PatchCorrect，即通过采用大型语言模型进行代码的补丁正确性评估。具体来说，对于由新的或未见过的 APR 工具生成的补丁，LLM4PatchCorrect 不需要这个新的或未见过的 APR 工具的标记补丁进行训练，而是直接查询大型语言模型的代码，以便在没有训练的情况下获得对正确性标签的预测。通过这种方式，LLM4PatchCorrect 可以减少构建模型时手动标记工作，以自动评估新 APR 工具生成的补丁的正确性。为了向大型代码语言模型提供有关自动补丁正确性评估（APCA）任务的知识，LLM4PatchCorrect 在确定新的或未见过的 APR 工具的未标记补丁的正确性之前，利用了错误描述、执行跟踪、失败的测试用例、测试覆盖率和现有 APR 工具生成的标记补丁。此外，LLM4PatchCorrect 会优先考虑现有 APR 工具中标记的补丁，这些补丁与新 APR 工具生成的补丁表现出语义相似性，从而提高了 LLM4PatchCorrect 对新 APR 工具中补丁的准确性。我们的实验结果表明，LLM4PatchCorrect 可以达到平均的准确性 $84.4 %$ 和 F1 分数 $86.5 %$ ，尽管没有新的或看不见的 APR 工具的标记补丁可用。此外，我们提出的技术明显优于以前的最先进的技术。

Index Terms-Automatic patch correctness assessment, Large language models of code, In-context learning
索引术语 - 自动补丁正确性评估、大型语言代码模型、上下文学习

1 InTRODUCTION 1 导入

Automated Program Repair (APR) has gained increasing attention and diverse APR tools have been proposed [1], [2], [3], [4], [5], |6], |7], [8], [9], [10], [11], [12], |13]. Despite the significant improvements achieved in APR, existing APR tools still face a long-standing challenge: the overfitting problem [14], [15], [16], [17], [18]. Due to the absence of strong program specifications, APR tools often use test cases to validate whether a generated patch is correct or not. However, passing all the existing test cases does not ensure that the patch is indeed correct and there is no guarantee that the patch can actually repair the program. A generated patch is considered overfitting if it passes all the available test
自动程序修复（APR）越来越受到关注，并提出了各种 APR 工具 [1]， [2]， [3]， [4]， [5]， |6]， |7]， [8]， [9]， [10]， [11]， [12]， |13]。尽管 APR 取得了重大改进，但现有的 APR 工具仍然面临一个长期的挑战：过拟合问题 [14]， [15]， [16]， [17]， [18]。由于缺乏强大的程序规范，APR 工具经常使用测试用例来验证生成的补丁是否正确。但是，通过所有现有的测试用例并不能确保补丁确实正确，也不能保证补丁可以真正修复程序。如果生成的补丁通过了所有可用测试，则认为该补丁程序过拟合

cases while it is still incorrect with respect to the intended program specification.
的情况下，它仍然与预期的程序规范无关。

Identifying overfitting patches is crucial for the APR tool adoption in practice. Suppose Bob is a practitioner who is keen to use advanced APR tools. There exist multiple approaches he can employ, and each produces many patches. However, recent studies [19], [20] demonstrate that APR tools could generate more overfitting patches than correct ones, showing a high false positive rate. In addition, researchers have revealed that high false positive rates may deliver dissatisfaction and distrust to developers on automatic SE tools such as static analysis [21] and fault localization [22]. This indicates that APR tools can disappoint Bob by wasting his time with overfitting patches. Thus, it is important to detect and reduce the overfitting patches, especially for the generate-and-validate APR approaches in practice [23].
识别过拟合补丁对于实践中 APR 工具的采用至关重要。假设 Bob 是一位热衷于使用高级 APR 工具的从业者。他可以采用多种方法，每种方法都会产生许多补丁。然而，最近的研究 [19]、[20] 表明，APR 工具可能比正确的工具生成更多的过拟合补丁，显示出很高的假阳性率。此外，研究人员还发现，高误报率可能会给开发人员带来对自动 SE 工具（如静态分析 [21] 和故障定位 [22]）的不满和不信任。这表明 APR 工具可能会让 Bob 失望，因为 Bob 会浪费时间进行过度拟合的补丁。因此，检测和减少过拟合补丁非常重要，特别是对于实践中的生成和验证 APR 方法 [23]。

To address this issue, many approaches [4], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35] have been proposed to conduct automatic patch correctness assessment (APCA). Lin et al. [33| categorized the existing APCA approaches into two categories: (1) dynamic approaches which are based on running/executing the tests and (2) static approaches which are built on top of source code patterns or features. In general, dynamic approaches perform correctness assessment by either augmenting test cases using automated test generation tools such as Randoop [36] or collecting the runtime information for analysis.
为了解决这个问题，已经提出了许多方法 [4]， [23]， [24]， [25]， [26]， [27]， [28]， [29]， [30]， [31]， [32]， [33]， [34]， [35] 来进行自动补丁正确性评估（APCA）。Lin 等人 [33| 将现有的 APCA 方法分为两类：（1）基于运行/执行测试的动态方法，以及（2）建立在源代码模式或功能之上的静态方法。一般来说，动态方法通过使用自动化测试生成工具（如 Randoop [36]）来增强测试用例，或者收集运行时信息进行分析来执行正确性评估。

On the other hand, static approaches extract code patterns or features to decide the correctness. Despite the promising results, both of them still have drawbacks. The dynamic approaches are very time-consuming [28], [30] while the static approaches are more efficient but could be less precise [29]. Additionally, building certain static systems could be hard if they require crafting specialized code patterns or features. In this study, we aim to advance static APCA approaches.
另一方面，静态方法提取代码模式或特征来确定正确性。尽管结果令人鼓舞，但它们仍然存在缺点。动态方法非常耗时 [28]，[30] 而静态方法效率更高，但可能不太精确 [29]。此外，如果某些静态系统需要制作专门的代码模式或功能，则构建某些静态系统可能会很困难。在这项研究中，我们的目标是推进静态 APCA 方法。

Many static APCA approaches have been proposed in recent years. For example, Ye et al. [27] introduced ODS, which identifies overfitting patches by utilizing statically extracted code features at the Abstract Syntax Tree (AST) level from both the buggy code and patches generated by APR tools. Tian et al. [23] leveraged advanced code representation learning techniques such as BERT [37], to extract source code embeddings for assessing patch correctness. Recently, Tian et al. introduced Quatrain [34], which transforms the APCA task into a question-answering problem. Quatrain first learned the relationship between bug reports and patch descriptions. Subsequently, it constructed a question-answer-based classifier to assess patch correctness. Moreover, Lin et al. [33] proposed Cache, a patch correctness assessment technique that learns a context-aware code change embedding, considering program structures. Cache achieved the state-of-the-art performance and outperformed many dynamic approaches [25], [26], [29], [30], [36].
近年来提出了许多静态 APCA 方法。例如，Ye et al. [27] 引入了 ODS，它通过利用抽象语法树（AST）级别的静态提取代码特征来识别过拟合补丁，这些代码特征来自 APR 工具生成的错误代码和补丁。Tian等[23]利用先进的代码表示学习技术，如BERT [37]，提取源代码嵌入以评估补丁的正确性。最近，Tian et al. 引入了 Quatrain [34]，它将 APCA 任务转化为问答问题。Quatrain 首先了解了 bug 报告和补丁描述之间的关系。随后，它构建了一个基于问答的分类器来评估补丁的正确性。此外，Lin 等人 [33] 提出了 Cache，这是一种补丁正确性评估技术，在考虑程序结构的情况下学习上下文感知代码更改嵌入。Cache 实现了最先进的性能，并优于许多动态方法 [25]， [26]， [29]， [30]， [36]。

Static APCA approaches, such as ODS [27], Tian et al.'s approach [23], Quatrain [34], and Cache [33], directly extract features from patches and learn correct patterns from the labeled dataset. In prior works, static APCA approaches were primarily evaluated using

k

-fold cross-validation, where patches generated by different APR tools are mixed and separated into (

k - 1

):1 for training and testing. However, the cross-validation setting has a significant limitation: for patches generated by a new or unseen APR tool, users are implicitly required to manually label a significant portion of these patches (e.g.,

90 %

in 10 -fold cross-validation) before inferring the remaining patches (e.g.,

10 %

in 10 -fold crossvalidation). Suppose Bob is a practitioner eager to leverage an advanced new APR tool to fix a bug and also aims to utilize APCA approaches to filter out the overfitting patches generated by the new tool. However, the 10 -fold crossvalidation process implicitly necessitates that Bob manually labels

90 %

of the patches generated by the new APR tool. This significant manual effort may deter Bob from adopting APCA approaches.
静态 APCA 方法，如 ODS [27]、Tian 等人的方法 [23]、Quatrain [34] 和 Cache [33]，直接从补丁中提取特征，并从标记的数据集中学习正确的模式。在以前的工作中，静态 APCA 方法主要使用

k

-fold 交叉验证进行评估，其中由不同 APR 工具生成的补丁被混合并分离到（

k - 1

）：1 中用于训练和测试。但是，交叉验证设置有一个很大的限制：对于由新的或未见过的 APR 工具生成的补丁，用户需要先手动标记这些补丁的很大一部分（例如，

90 %

在 10 倍交叉验证中），然后再推断剩余的补丁（例如，

10 %

在 10 倍交叉验证中）。假设 Bob 是一位从业者，他渴望利用先进的新 APR 工具来修复错误，并且还打算利用 APCA 方法来过滤掉新工具生成的过拟合补丁。但是，10 倍交叉验证过程隐式要求 Bob 手动标记

90 %

新 APR 工具生成的补丁。这种大量的手动工作可能会阻止 Bob 采用 APCA 方法。

Given the rapid emergence of new APR tools, our goal is to alleviate the burden on users by avoiding the need for manual labeling of the patches generated by these new tools before APCA approaches can predict their correctness. Additionally, many patches generated by existing APR tools have already been manually checked for correctness [23], |29|. Thus, we are motivated to ask a key question:
鉴于新 APR 工具的迅速出现，我们的目标是在 APCA 方法预测其正确性之前，无需手动标记这些新工具生成的补丁，从而减轻用户的负担。此外，现有 APR 工具生成的许多补丁已经被手动检查了正确性 [23]， |29|。因此，我们有动力提出一个关键问题：

Is it feasible to utilize labeled patches of existing APR tools to predict the correctness of the patches generated by a newlunseen APR tool?
利用现有 APR 工具的标记补丁来预测 newlunseen APR 工具生成的补丁的正确性是否可行？

We first explore whether the state-of-the-art APCA approaches can predict the correctness of patches generated by an unseen APR tool when labeled patches of other APR tools
我们首先探讨了最先进的 APCA 方法是否可以在标记其他 APR 工具的补丁时预测由看不见的 APR 工具生成的补丁的正确性

Fig. 1: An abstracted example of how to use the large pre-trained models. An unlabeled test patch of a new APR tool is concatenated by several labeled patches of existing APR tools to form the input to the pre-trained model.
图 1：如何使用大型预训练模型的抽象示例。新 APR 工具的未标记测试补丁由现有 APR 工具的多个标记补丁连接，以形成预训练模型的输入。
are available in the training data. During our preliminary experiments, we observed that state-of-the-art APCA approaches such as Quatrain [34] and Cache [33| did not yield satisfactory results. This was possibly due to a lack of labeled patches from the new or unseen APR tool for training the model. Addressing this challenge is critical for further advancing the APCA task. We report the effectiveness of these state-of-the-art approaches in Section 5.1. While Ye et al. [27] have explored the setting mentioned above, they did not investigate the effectiveness of recent large language models (LLMs). This is particularly noteworthy since LLMs are known to excel in few-shot or zero-shot tasks [38] that closely align with our focus. Studying LLMs in the context of patch correctness assessment could offer valuable insights into the capabilities of these models. Additionally, in this study, we introduce a novel large language model-based solution to effectively tackle the challenge, which significantly outperforms the ODS approach proposed by Ye et al. [27].
在训练数据中可用。在我们的初步实验中，我们观察到最先进的APCA方法，如Quatrain [34]和Cache [33|]并没有产生令人满意的结果。这可能是由于新的或未见过的 APR 工具中缺少用于训练模型的标记补丁。应对这一挑战对于进一步推进 APCA 任务至关重要。我们在第 5.1 节中报告了这些最先进方法的有效性。虽然 Ye et al. [27] 已经探索了上述设置，但他们并没有调查最近的大型语言模型的有效性（）LLMs 。这一点特别值得注意，因为LLMs众所周知，它们在与我们的关注点密切相关的小样本或零样本任务中表现出色 [38]。在补丁正确性评估的背景下进行研究LLMs可以为这些模型的功能提供有价值的见解。此外，在这项研究中，我们引入了一种新颖的基于大语言模型的解决方案来有效应对这一挑战，该解决方案明显优于 Ye 等人 [27] 提出的 ODS 方法。

To tackle the challenge, we propose LLM4PatchCorrect, which aims to enhance the effectiveness of predicting the correctness of patches generated by new or unseen APR tools. LLM4PatchCorrect employs an open-source large language model (LLM) for code, called Starcoder-7B [39], to evaluate patch correctness, without requiring fine-tuning. Technically, we directly leverage the pre-training objective of the LLM: generating the next token based on all previous tokens, to accomplish the APCA task. As shown in Figure 1 . we first prepare the model inputs by concatenating an unlabeled patch from a new APR tool with several labeled patches from existing APR tools, along with other bugrelated information. We then query the LLMs to generate the next token to show its tendencies in terms of patch correctness (correct or overfitting). This allows us to apply LLMs without the need for fine-tuning since we formulate the APCA task (i.e. predicting whether a patch is correct or not) in the same format as the pre-training task (i.e., generating the next token). The utilization of LLMs in this manner is referred to as In-context learning.
为了应对这一挑战，我们提出了 LLM4PatchCorrect，它旨在提高预测新的或看不见的 APR 工具生成的补丁正确性的有效性。LLM4PatchCorrect 采用一种开源的大型语言模型（）LLM 来评估代码的正确性，称为 Starcoder-7B [39]，而无需微调。从技术上讲，我们直接利用了 pre-training 目标LLM：根据所有先前的 Token 生成下一个 Token，以完成 APCA 任务。如图 1 所示。我们首先通过将来自新 APR 工具的未标记补丁与来自现有 APR 工具的几个标记补丁以及其他与错误相关的信息连接起来来准备模型输入。然后我们查询 the LLMs 以生成下一个 token，以显示其在 patch 正确性（正确或过拟合）方面的趋势。这允许我们在不需要微调的情况下应用LLMs，因为我们以与预训练任务相同的格式（即生成下一个标记）来制定 APCA 任务（即预测补丁是否正确）。LLMs以这种方式的利用称为上下文学习。

LLM4PatchCorrect does not encompass all labeled patches produced by existing APR tools; rather, it selects semantically similar patches. This strategy aids the large language model in providing more precise predictions for patches generated by a new APR tool. In addition to the labeled patches of existing APR tools, LLM4PatchCorrect incorporates a broader range of guiding information, as illustrated in (2) of Figure 2
LLM4PatchCorrect 并不包含现有 APR 工具生成的所有标记补丁;相反，它会选择语义相似的补丁。此策略有助于大型语言模型为新的 APR 工具生成的补丁提供更精确的预测。除了现有 APR 工具的标记补丁外，LLM4PatchCorrect 还包含更广泛的指导信息，如图 2 的（2）所示

Bug Descriptions: descriptions detailing the nature of the bug that the patch intends to resolve;
Bug Descriptions：详细说明补丁打算解决的 Bug 性质的描述;

(3) LLM Inference on Patch Correctness
（3） LLM 补丁正确性推断

Fig. 2: Overall Framework of LLM4PatchCorrect.
图 2：LLM4PatchCorrect 的整体框架。
2) Execution Traces: traces of the buggy program’s executions;
2） Execution Traces：有缺陷的程序执行的痕迹;
3) Failing Test Cases: test cases that expose failures in the buggy program;
3） Failing Test Cases：暴露 buggy 程序中失败的测试用例;
4) Test Coverage: line and condition coverage metrics for all available test cases associated with the bug.
4） Test Coverage：与 bug 相关的所有可用测试用例的行和条件覆盖率指标。
Bug descriptions, execution traces, and failing test cases serve to enhance LLM4PatchCorrect’s comprehension of the characteristics of the bug targeted by a patch generated through a new APR tool. Test coverage serves as an approximate indicator of the adequacy of the available test cases. In cases where test coverage is notably low, even if a patch enables the program to pass all test cases, the correctness of the patch cannot be guaranteed because many code lines and conditions remain uncovered in the tests. By leveraging the diverse range of guiding information (including the labeled patches of existing APR tools), LLM4PatchCorrect can provide accurate predictions.
错误描述、执行跟踪和失败的测试用例有助于增强 LLM4PatchCorrect 对通过新 APR 工具生成的补丁所针对的错误特征的理解。测试覆盖率是可用测试用例是否充分的近似指标。在测试覆盖率明显较低的情况下，即使补丁使程序能够通过所有测试用例，也无法保证补丁的正确性，因为测试中仍未覆盖许多代码行和条件。通过利用各种指导信息（包括现有 APR 工具的标记补丁），LLM4PatchCorrect 可以提供准确的预测。

We evaluate LLM4PatchCorrect using real-world, largescale patch correctness datasets [23|, [29|. These datasets comprise patches generated by 22 different APR tools, with labels meticulously examined by developers. The experimental results demonstrate that LLM4PatchCorrect significantly improves Accuracy, F1, and AUC scores, increasing them from

10.2 %

32.4 %, 6.1 %

24.2 %

, and

10.1 %

34.2 %

, on average, respectively, compared to state-of-the-art APCA approaches.
我们使用真实世界的大规模补丁正确性数据集 [23|， [29|.这些数据集包含由 22 种不同的 APR 工具生成的补丁，并带有由开发人员仔细检查的标签。实验结果表明，与最先进的 APCA 方法相比，LLM4PatchCorrect 显著提高了准确性、F1 和 AUC 分数，平均分别将它们从

10.2 %

32.4 %, 6.1 %

增加到

24.2 %

和

10.1 %

34.2 %

。
Contributions. The main contributions are as follows:
贡献。主要贡献如下：

This paper underscores the importance of a novel setting for the APCA task, where we assume that no labeled patches are available for a new or unseen APR tool. This setting can better match the initial goal of APCA tasks to reduce the manual labeling effort and can evaluate the
本文强调了 APCA 任务的新设置的重要性，其中我们假设没有标记的补丁可用于新的或看不见的 APR 工具。此设置可以更好地匹配 APCA 任务的初始目标，以减少手动标记工作，并且可以评估
ability of approaches to transfer knowledge embedded in the existing labeled data to future unlabeled data.
将嵌入在现有标记数据中的知识转移到未来未标记数据的方法的能力。
To the best of our knowledge, we are the first to introduce advanced LLM in solving the APCA task. We design an LLM-based solution, LLM4PatchCorrect, for this challenging setting (i.e., no labeled patches of the new or unseen APR tool).
据我们所知，我们是第一个引入 advanced LLM 来解决 APCA 任务的公司。我们针对这个具有挑战性的设置（即，新的或未见过的 APR 工具没有标记的补丁）设计了一个LLM基于的解决方案 LLM4PatchCorrect。
We propose incorporating diverse guiding information to aid LLM4PatchCorrect in decision-making regarding patch correctness. Specifically, LLM4PatchCorrect takes into account bug descriptions, execution traces, failing test cases, test coverage, and labeled patches generated by existing APR tools.
我们建议结合不同的指导信息来帮助 LLM4PatchCorrect 做出有关补丁正确性的决策。具体来说，LLM4PatchCorrect 考虑了错误描述、执行跟踪、失败的测试用例、测试覆盖率以及现有 APR 工具生成的标记补丁。

2 BACKGROUND 2 背景

This section provides background information on Large Language Models (Section 2.1) and their applications in software engineering tasks (Section 2.2).
本节提供有关大型语言模型（第 2.1 节）及其在软件工程任务（第 2.2 节）中的应用的背景信息。

2.1 Large Language Models (LLMs) for Code
2.1 代码的大型语言模型（）LLMs

Large Language Models (LLMs) [37], [40], [41], [42], [43], [44], [45], [46] become popular in Natural Language Processing (NLP) and Software Engineering (SE). CodeBERT [37] is a typical encoder-only model for code that is widely used in SE tasks such as code search and defect prediction. CodeT5 [43] is a typical pre-trained encoder-decoder model for code, which is pre-trained on denoising sequence-tosequence objectives. Starcoder [39], CodeLlama [47], CodeParrot [45], BLOOM [46] are typical LLMs that use only the Transformer decoder to predict the probability of the next token given the previous tokens. The nature of these models makes them highly useful for generation tasks because text/code is usually written in a left-to-right way.
大型语言模型（） [LLMs37]， [40]， [41]， [42]， [43]， [44]， [45]， [46] 在自然语言处理（NLP）和软件工程（SE）中变得流行起来。CodeBERT [37] 是一种典型的纯编码器代码模型，广泛用于 SE 任务，例如代码搜索和缺陷预测。CodeT5 [43] 是一种典型的预训练代码编码器-解码器模型，它对序列到序列目标进行了去噪预训练。Starcoder [39]、CodeLlama [47]、CodeParrot [45]、BLOOM [46] 是典型的LLMs，它们只使用 Transformer 解码器来预测给定前一个词元的下一个词元的概率。这些模型的性质使它们对生成任务非常有用，因为文本/代码通常以从左到右的方式编写。

- X. Zhou, K. Kim, H. H. Nguyen, J. He and D. Lo are with the School of Computing and Information Systems, Singapore Management University, Singapore.
  X. 周， K. Kim， H. H. Nguyen， J.他和 D. Lo 就职于新加坡管理大学计算机与信息系统学院。
  E-mail: {xinzhou.2020, huuhungn, jundahe, davidlo}@smu.edu.sg and falconlk00@gmail.com.
  邮箱：{xinzhou.2020， huuhungn， jundahe， davidlo}@smu.edu.sg 和 falconlk00@gmail.com.
- B. Xu is with the Department of Computer Science College of Engineering, North Carolina State University, United States (USA).
  B. Xu 就职于美国北卡罗来纳州立大学工程学院计算机科学系。
  E-mail: bxu22@ncsu.edu
  电子邮件： bxu22@ncsu.edu
- D. Han is with the Department of Computer Science, Royal Holloway, University of London, United Kingdom (UK).
  D. Han 就职于英国伦敦大学（UK）皇家霍洛威学院计算机科学系。
  E-mail: donggyun.han@rhul.ac.uk.
  电子邮件：donggyun.han@rhul.ac.uk。
- T. Le-Cong and B. Le is with the School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.
  T. Le-Cong 和 B. Le 就职于澳大利亚墨尔本墨尔本大学计算与信息系统学院。
  E-mail: congthanh.le@student.unimelb.edu.au and bach.le@unimelb.edu.au.
  电子邮件：congthanh.le@student.unimelb.edu.au 和 bach.le@unimelb.edu.au。

Leveraging Large Language Model for Automatic Patch Correctness Assessment 利用大型语言模型进行自动补丁正确性评估

Abstract 抽象

1 InTRODUCTION 1 导入

2 BACKGROUND 2 背景

2.1 Large Language Models (LLMs) for Code2.1 代码的大型语言模型 （）LLMs

Leveraging Large Language Model for Automatic Patch Correctness Assessment
利用大型语言模型进行自动补丁正确性评估

2.1 Large Language Models (LLMs) for Code
2.1 代码的大型语言模型（）LLMs