Cite this: Chem. Sci., 2024, 15, 19977
引用此：化学科学， 2024， 15， 19977
All publication charges for this article have been paid for by the Royal Society of Chemistry
本文的所有出版费用均由英国皇家化学学会支付

Received 4th April 2024 接收日期 2024 年 4 月 4 日
Accepted 11th October 2024
2024 年 10 月 11 日接受
DOI: 10.1039/d4sc02233e DOI： 10.1039/d4sc02233e
rsc.li/chemical-science

ProBID-Net: a deep learning model for proteinprotein binding interface design $†$
ProBID-Net：用于蛋白质结合界面设计的 $†$ 深度学习模型

Zhihang Chen, Menglin Ji, Jie Qian, Zhe Zhang, Xiangying Zhang, (D) Haotian Gao, Haojie Wang, Renxiao Wang* and Yifei Qi (D*
陈志航、季梦林、钱杰、张哲、张香英、（D）高浩天、王浩杰、王仁晓* 和齐逸飞（D*

Abstract 抽象

Protein-protein interactions are pivotal in numerous biological processes. The computational design of these interactions facilitates the creation of novel binding proteins, crucial for advancing biopharmaceutical products. With the evolution of artificial intelligence (AI), protein design tools have swiftly transitioned from scoring-function-based to AI-based models. However, many Al models for protein design are constrained by assuming complete unfamiliarity with the amino acid sequence of the input protein, a feature most suited for de novo design but posing challenges in designing proteinprotein interactions when the receptor sequence is known. To bridge this gap in computational protein design, we introduce ProBID-Net. Trained using natural protein-protein complex structures and protein domain-domain interface structures, ProBID-Net can discern features from known target protein structures to design specific binding proteins based on their binding sites. In independent tests, ProBIDNet achieved interface sequence recovery rates of $52.7 %, 43.9 %$ , and $37.6 %$ , surpassing or being on par with ProteinMPNN in binding protein design. Validated using AlphaFold-Multimer, the sequences designed by ProBID-Net demonstrated a close correspondence between the design target and the predicted structure. Moreover, the model’s output can predict changes in binding affinity upon mutations in protein complexes, even in scenarios where no data on such mutations were provided during training (zero-shot prediction). In summary, the ProBID-Net model is poised to significantly advance the design of protein-protein interactions.
蛋白质-蛋白质相互作用在许多生物过程中至关重要。这些相互作用的计算设计有助于创造新的结合蛋白，这对于推进生物制药产品至关重要。随着人工智能（AI）的发展，蛋白质设计工具已迅速从基于评分函数的模型过渡到基于 AI 的模型。然而，许多用于蛋白质设计的 Al 模型受到假设对输入蛋白质的氨基酸序列完全不熟悉的限制，这是最适合从头设计的特征，但在受体序列已知的情况下，对设计蛋白质蛋白质相互作用提出了挑战。为了弥合计算蛋白质设计中的这一差距，我们引入了 ProBID-Net。ProBID-Net 使用天然蛋白质-蛋白质复合物结构和蛋白质结构域-结构域界面结构进行训练，可以从已知的靶蛋白结构中识别特征，从而根据其结合位点设计特定的结合蛋白。在独立测试中，ProBIDNet 在结合蛋白设计方面实现了 $52.7 %, 43.9 %$ 、和 $37.6 %$ 的界面序列回收率，超过或与 ProteinMPNN 相当。使用 AlphaFold-Multimer 进行验证，ProBID-Net 设计的序列证明了设计目标和预测结构之间的密切对应关系。此外，该模型的输出可以预测蛋白质复合物突变后结合亲和力的变化，即使在训练期间没有提供此类突变数据的情况下（零样本预测）。总之，ProBID-Net 模型有望显着推进蛋白质-蛋白质相互作用的设计。

Introduction 介绍

Natural proteins, despite their vital roles as carriers of various life activities, have limitations due to their specific working environment and finite lifespan. To address these limitations, the ability to create entirely novel proteins from scratch using computational algorithms becomes essential. This process is known as computational protein design (CPD). The primary objective of CPD is to identify specific combinations of amino acid residues on a native scaffold that can fold into desired protein structures with precise functions. Additionally, CPD can be utilized to optimize existing native proteins or complexes to enhance their stability or modify their functions to serve specific purposes. This powerful approach allows researchers to engineer proteins tailored to specific needs, going beyond what nature has provided and offering great potential for various applications in biomedicine and beyond.

^{1}

天然蛋白质尽管作为各种生命活动的载体发挥着重要作用，但由于其特定的工作环境和有限的寿命，它们存在局限性。为了解决这些限制，使用计算算法从头开始创建全新蛋白质的能力变得至关重要。这个过程被称为计算蛋白质设计（CPD）。CPD 的主要目标是鉴定天然支架上氨基酸残基的特定组合，这些支架可以折叠成具有精确功能的所需蛋白质结构。此外，CPD 可用于优化现有的天然蛋白质或复合物，以提高其稳定性或修改其功能以服务于特定目的。这种强大的方法使研究人员能够根据特定需求设计蛋白质，超越大自然所提供的，并为生物医学及其他领域的各种应用提供巨大潜力。

^{1}

Designing proteins is very challenging because of the vast search space of sequences and structures. Before the recent
由于序列和结构的搜索空间很大，因此设计蛋白质非常具有挑战性。在最近的

surge of AI algorithms, the most advanced methods still relied on hand-crafted energy functions and heuristic sampling algorithms, which frequently produce suboptimal solutions and are computationally intensive and time-consuming. Classical CPD methods, such as Rosetta Design,

^{2}

typically demand predefined protein secondary structures or specific folding modes, then select appropriate amino acids or short peptides using energy functions, perform sequence structure optimization iterations, and finally generate output sequences by ranking energy function scoring results.

^{2 - 6}

In recent years, there have been numerous impressive achievements

^{7 - 14}

in protein design through classical CPD methods, including self-assembly,

^{10}

immune signaling,

^{12, 13}

enzyme,

^{15}

targeted therapeutics,

^{14, 16}

and protein switches,

^{11, 17}

demonstrating the tremendous potential of designed proteins.

^{1 - 6}

AI 算法的激增，最先进的方法仍然依赖于手工制作的能量函数和启发式采样算法，这些算法经常产生次优解决方案，并且计算密集且耗时。经典的 CPD 方法，如 Rosetta Design，

^{2}

通常需要预定义的蛋白质二级结构或特定的折叠模式，然后使用能量函数选择合适的氨基酸或短肽，进行序列结构优化迭代，最后通过对能量函数评分结果进行排序来生成输出序列。

^{2 - 6}

近年来，通过经典的 CPD 方法在蛋白质设计方面取得了许多令人印象深刻的成就

^{7 - 14}

，包括自组装、

^{10}

免疫信号传导、

^{12, 13}

酶、

^{15}

靶向治疗和

^{14, 16}

蛋白质转换，

^{11, 17}

展示了设计蛋白质的巨大潜力。

^{1 - 6}

With the advent of deep learning, CPD research has rapidly transformed from knowledge-based to data-driven methods in recent years.

^{18}

Artificial deep neural networks are capable of extracting protein features from existing data, generating integrated statistical motifs, and storing them in millions of parameters for inference in different protein design applications. Several deep network architectures have been widely used in protein research and have made a significant impact. Early AI-protein-design models, including SPIN

^{19}

and SPIN2,

^{20}

have
随着深度学习的出现，近年来 CPD 研究已从基于知识的方法迅速转变为数据驱动的方法。

^{18}

人工深度神经网络能够从现有数据中提取蛋白质特征，生成集成的统计模体，并将其存储在数百万个参数中，以便在不同的蛋白质设计应用中进行推理。几种深度网络架构已广泛用于蛋白质研究，并产生了重大影响。早期的 AI 蛋白质设计模型，包括 SPIN

^{19}

和 SPIN2，

^{20}

具有
achieved a sequence recovery rate of about

34 %

. Later on, SPROF,

^{21}

ProDCoNN

^{22}

and our prior research, referred to as DenseCPD,

^{23}

have employed a 3D-CNN as the model architecture. These approaches have notably enhanced sequence recovery. More recently, Graph Neural Network (GNN) is employed to predict residue interactions and residue types in proteins. In this context, proteins are treated as graphs, with nodes representing residues, and the prediction becomes a graph classification problem. Among them, graph based models, such as ProtTrans,

^{24}

GVP,

^{25}

StructGNN,

^{26}

AlphaDesign,

^{27}

ESM-IF, ProteinMPNN,

^{28}

PiFold,

^{29}

SPIN-CGNN and VFN,

^{30}

have exhibited notable successes in the realm of protein design and have improved sequence recovery to

50 - 55 %

. Notably, LM-DESIGN,

^{31}

ABACUS-

R^{32}

and ProDESIGN-LE,

^{33}

ESMIF, LM-DESIGN,

^{31}

and CarbonDesign

^{34}

have used various AI models including large language Model (LLM) and AlphaFold2 architectures to achieve high accuracy in sequence design or generation. The designs of some of these models have been verified by wet experiments and exhibited impressive success rate.

^{4, 5, 32, 35 - 37}

实现了约

34 %

的序列恢复率。后来，SPROF、

^{21}

ProDCoNN

^{22}

和我们之前的研究（称为 DenseCPD）

^{23}

都采用了 3D-CNN 作为模型架构。这些方法显著增强了序列恢复。最近，图神经网络（GNN）被用于预测蛋白质中的残基相互作用和残基类型。在这种情况下，蛋白质被视为图形，节点代表残基，预测成为图形分类问题。其中，基于图的模型，如 ProtTrans、

^{24}

GVP、

^{25}

StructGNN、

^{26}

AlphaDesign、

^{27}

ESM-IF、ProteinMPNN、

^{28}

PiFold、

^{29}

SPIN-CGNN 和 VFN，在蛋白质设计领域取得了显著的成功，

^{30}

并将序列恢复率提高到

50 - 55 %

。值得注意的是，LM-DESIGN、

^{31}

ABACUS-

R^{32}

和 ProDESIGN-LE、

^{33}

ESMIF、LM-DESIGN

^{31}

和 CarbonDesign

^{34}

使用了各种 AI 模型，包括大型语言模型（LLM）和 AlphaFold2 架构，以实现序列设计或生成的高精度。其中一些模型的设计已经通过湿实验验证，并表现出令人印象深刻的成功率。

^{4, 5, 32, 35 - 37}

Protein-protein interactions play a vital role in numerous biological processes as they form the foundation of many molecular machines responsible for multiple functions.

^{38, 39}

Understanding these interactions in detail can provide crucial insights into the functions of protein complexes and has significant implications for medical and drug research. Classical CPD methods often leverage information extracted from native complex structures. This involves the strategic placement of naturally occurring protein scaffolds guided by hotspot residues, followed by the generation of binders through methodologies such as library selection

^{14}

or antibody modification.

^{40 - 42}

Subsequently, computational saturation mutagenesis is employed to optimize the affinity and specificity of the protein binder.

^{43 - 45}

蛋白质-蛋白质相互作用在许多生物过程中起着至关重要的作用，因为它们构成了许多负责多种功能的分子机器的基础。

^{38, 39}

详细了解这些相互作用可以为蛋白质复合物的功能提供重要的见解，并对医学和药物研究具有重要意义。经典的 CPD 方法通常利用从原生复杂结构中提取的信息。这涉及在热点残基引导下战略性地放置天然存在的蛋白质支架，然后通过文库选择

^{14}

或抗体修饰等方法生成结合物。

^{40 - 42}

随后，采用计算饱和诱变来优化蛋白质结合剂的亲和力和特异性。

^{43 - 45}

Although the deep learning models mentioned above have demonstrated impressive results in designing individual protein units by predicting the joint probability of residues under given backbone constraints or generating direct sequences, there is a notable lack of models specifically tailored for protein binder design. Thus, developing such binder design models is an important area of research that holds great potential for advancing our understanding of protein-protein interactions and would be valuable in identifying suitable binding proteins for a given target protein structure. Among the mentioned models, those of both Rosetta

^{2}

and ProteinMPNN

^{35}

can be employed for the specific task of designing protein binder interfaces.
尽管上述深度学习模型通过预测给定主链约束下残基的联合概率或生成直接序列，在设计单个蛋白质单元方面取得了令人印象深刻的结果，但明显缺乏专门为蛋白质结合剂设计量身定制的模型。因此，开发这种结合物设计模型是一个重要的研究领域，对于促进我们对蛋白质-蛋白质相互作用的理解具有巨大潜力，并且在确定适合给定靶蛋白结构的结合蛋白方面很有价值。在上述模型中，Rosetta

^{2}

和 ProteinMPNN 的模型

^{35}

都可用于设计蛋白质结合剂界面的特定任务。

In this study, we aimed to develop a specialized model for the design and optimization of protein-protein interface residues. We used DenseNet

^{46}

to recognize three-dimensional structural data of protein interface residues. The resulting network, named Protein Binding Interface Design Network (ProBID-Net), was trained to learn the correlations between target residues and their surrounding interface environment, based on the distribution of residue backbone atoms found in known receptor protein chains.
在这项研究中，我们旨在开发一种专门的模型来设计和优化蛋白质-蛋白质界面残基。我们使用 DenseNet

^{46}

来识别蛋白质界面残基的三维结构数据。由此产生的网络名为蛋白质结合界面设计网络（ProBID-Net），经过训练，根据已知受体蛋白链中发现的残基骨架原子的分布，了解靶残基与其周围界面环境之间的相关性。

As a result, ProBID-Net has effectively acquired knowledge regarding protein-protein interaction from interfaces and
因此，ProBID-Net 从界面和
achieved an impressive sequence recovery rate of

52.7 %

on an independent test set and

43.9 %

on an external test set. It exhibited low perplexity in interface residue prediction and high conservation of hydrophobic positions. We predicted the complex structure of the designs with AlphaFold-Multimer, and found that the predicted structure was in good agreement with the design target, which further verified the foldability and binding specificity of the model design sequence. In addition, the predicted probability of each amino acid on the protein interface residues can be used as a zero-shot prediction of binding affinity change caused by mutations, providing a reference for binding affinity modification.
在独立测试集和

43.9 %

外部测试集上实现了令人印象深刻的序列恢复率

52.7 %

。它在界面残基预测中表现出低困惑度和疏水位置的高度守恒性。我们用 AlphaFold-Multimer 预测了设计的复杂结构，发现预测的结构与设计目标吻合较好，进一步验证了模型设计序列的可折叠性和结合特异性。此外，预测每个氨基酸在蛋白质界面残基上的概率可以作为突变引起的结合亲和力变化的零点预测，为结合亲和修饰提供参考。

Results and discussion 结果与讨论

Sequence recovery and perplexity
序列恢复和困惑

The ProBID-Net architecture comprises DenseNet models featuring three Dense Blocks and were trained on the training set of QSalign

^{47}

labelled heterodimers and domain-domain interfaces. Subsequently, three distinct non-redundant test sets, namely the TS920, de novo set, and Folddock set, were employed for evaluation. Each protein-protein complex of these test sets exhibited a sequence identity with those in the training set of less than

40 %

.
ProBID-Net 架构由具有三个密集块的 DenseNet 模型组成，并在 QSalign

^{47}

标记的异二聚体和域-域接口的训练集上进行了训练。随后，采用三种不同的非冗余测试集，即 TS920、de novo 集和 Folddock 集进行评估。这些测试集的每个蛋白质-蛋白质复合物都表现出与训练集中小于的序列同一性

40 %

。

Model performance was evaluated using perplexity and average recovery rate of residues located at the interfaces of ligand protein. We defined residues on the unknown chain with CA atoms within a distance of

< 8 Å

from any atom on the known receptor chain as the target interface residues. Meanwhile, interface sequence recovery is measured by reading structures in test sets and then calculating the percent identity between them by iterating over all residues on ligand protein interfaces. Perplexity is a measure used in information theory to quantify how well a probability distribution or probability model predicts a sample. As shown in Table 1, the model achieved an average interface sequence recovery of

37.7 %

on TS920,

37.6 %

on the de novo set and

32.8 %

on the Folddock set.
使用位于配体蛋白界面处的残基的困惑度和平均回收率来评估模型性能。我们定义了未知链上的残基，其中 CA 原子在 Å 来自已知受体链上的任何原子作为目标界面残基。同时，通过读取测试集中的结构，然后通过迭代配体蛋白界面上的所有残基来计算它们之间的百分比同一性，从而测量界面序列的恢复。困惑度是信息论中使用的一种度量，用于量化概率分布或概率模型对样本的预测程度。如表 1 所示，该模型在 TS920

37.6 %

、de novo set 和

32.8 %

Folddock set

37.7 %

上实现了平均界面序列恢复。

In order to increase the amount of protein-protein interface structural data, we hypothesize that the evolutionary conservation of protein domains aligns closely with that observed at the fold level, potentially leading to an augmentation of protein interface datasets. We assembled an additional dataset focused on domain-domain interfaces through the segmentation of domains in multi-domain protein chains according to CATH.

^{48}

Table 2 provides a comprehensive evaluation of the average interface residue sequence recovery and perplexity for two ProBID-Net models that were respectively trained on datasets from pure chain-chain interface data and on the set with the addition of domain-domain interface structure data. For comparison, 1000 sequences were generated using ProteinMPNN and Rosetta Design (using the ref 2015 energy function) for each complex in the three test sets and the sequence recovery and predictive perplexity were compared.
为了增加蛋白质-蛋白质界面结构数据的数量，我们假设蛋白质结构域的进化保守性与在折叠水平上观察到的密切相关，这可能会导致蛋白质界面数据集的扩大。我们通过根据 CATH 对多域蛋白质链中的域进行分割，组装了一个额外的数据集，专注于域-域接口。

^{48}

表 2 提供了两个 ProBID-Net 模型的平均界面残基序列恢复和困惑度的全面评估，这两个模型分别在纯链-链接口数据的数据集上训练，并在添加了域-域接口结构数据的数据集上进行训练。为了进行比较，使用 ProteinMPNN 和 Rosetta Design （使用 ref 2015 能量函数）为三个测试集中的每个复合物生成 1000 个序列，并比较序列恢复和预测困惑度。

The enhancement observed in ProBID-Net trained on both chain-chain interface and domain-domain interface sets (ProBID-Net) relative to the version trained on the chain-chain interface (ProBID-Net-CC) suggests an increased confidence in
相对于在链-链接口（ProBID-Net-CC）上训练的版本（ProBID-Net），在链-链接口和域-域接口集（ProBID-Net）上训练的 ProBID-Net 中观察到的增强表明，对

Table 1 The average recovery of interface residue sequences and standard deviation on three independent test sets, designed by ProBID-Net trained through a five-fold cross-validation

^{a, b, c}

表 1 由 ProBID-Net 设计的三个独立测试集上界面残基序列和标准差的平均回收率，通过五重交叉验证

^{a, b, c}

进行训练

Average interface recovery (%)

↑

平均接口恢复率（%）

↑

Model no. 型号	1	2	3	4	5	Average 平均
TS920	$37.9 \pm 12.8$	$38.0 \pm 12.9$	$35.9 \pm 12.2$	$40.2 \pm 13.4$	$36.7 \pm 12.5$
De novo 再	$39.4 \pm 13.9$	$35.9 \pm 13.0$	$35.0 \pm 13.4$	$40.8 \pm 14.3$	$37.0 \pm 13.8$
Folddock 折叠坞	$32.6 \pm 11.4$	$32.7 \pm 11.4$	$31.5 \pm 11.1$	$35.1 \pm 12.1$	$32.1 \pm 11.4$

^{a}

Trained with growth rate

= 70 .^{b}

The domain-domain interfaces are not included in the training dataset, the training set solely comprises chainchain interface structures extracted from heterodimers.

^{c}

The format for the numbers is “Average Interface Recovery

\pm

Standard Deviation” in percentage (%).

^{a}

以生长速率

= 70 .^{b}

训练域-域接口不包括在训练数据集中，训练集仅包含从异二聚体中提取的链链接口结构。

^{c}

数字的格式为“平均接口恢复

\pm

标准偏差”，以百分比（%）表示。

Table 2 Comparison of interface residues designed by ProBID-NetCC, ProBID-Net, ProteinMPNN and Rosetta on TS920, de novo set and Folddock set according to the average interface recovery and perplexity

^{a}

表 2 ProBID-NetCC、ProBID-Net、ProteinMPNN 和 Rosetta 在 TS920、de novo set 和 Folddock set 上根据平均界面回收率和困惑度

^{a}

设计的界面残基比较

平均接口恢复率（%）

Average interface

recovery (%)

↑

Perplexity

↓

困惑

↓

Model 型

TS920

40.2 \pm 13.4

3.91

ProBID-Net-CC ProBID-Net-CC 智能标

52.7 \pm 16.5

3.02

ProBID-Net ProBID-Net 智能出价

36.7 \pm 18.6

6.06

ProteinMPNN 蛋白质MPNN

43.2 \pm 14.6

Rosetta fast design Rosetta 快速设计

De novo set 从新集合

40.8 \pm 14.3

3.87

ProBID-Net-CC ProBID-Net-CC 智能标

43.9 \pm 10.4

3.67

ProBID-Net ProBID-Net 智能出价

42.6 \pm 13.5

4.12

ProteinMPNN 蛋白质MPNN

Rosetta fast design Rosetta 快速设计

35.1 \pm 12.1

37.6 \pm 11.7

4.63

Folddockset 折叠坞集

39.3 \pm 18.4

4.28

ProBID-Net-CC ProBID-Net-CC 智能标

40.5 \pm 16.2

8.11

ProBID-Net ProBID-Net 智能出价

^{a}

The format for the table is “Average Interface Recovery

\pm

Standard Deviation” in percentage (%).

^{a}

该表的格式为“平均接口恢复

\pm

标准偏差”，以百分比（%）表示。
accurate protein interface sequence design when incorporating domain-domain interface structure data into the training set.
将结构域-结构域界面结构数据合并到训练集时，准确的蛋白质界面序列设计。

ProBID-Net achieved a remarkable sequence recovery rate of

52.7 %

on the independent heterodimer test set (TS920), surpassing the performance of ProteinMPNN (36.7%) and Rosetta (43.2%). Moreover, on the de novo protein-protein complex test set (de novo set) and the Folddock test set (Folddock set), ProBID-Net achieved sequence recovery rates of

43.9 %

and

37.6 %

, respectively. Notably, ProBID-Net demonstrated better performance in recovery scores on both the TS920 and de novo set compared to both ProteinMPNN and Rosetta Design. However, ProBID-Net does not achieve the highest performance on the Folddock set. We attribute this outcome to the removal of all structures from the Folddock dataset that exhibited high similarity to those in the ProBID-Net training set, reducing the number of complexes from 2734 to 1106. The remaining complexes in the Folddock set display significant differences from the training set, which likely contributed to the observed reduction in performance. In contrast, the CATH4.2
ProBID-Net 在独立异二聚体测试集（TS920）上实现了显著的

52.7 %

序列回收率，超过了 ProteinMPNN （36.7%）和 Rosetta （43.2%）的性能。此外，在从头蛋白质-蛋白质复合物测试集（de novo set）和 Folddock 测试集（Folddock set）上，ProBID-Net 分别实现了

43.9 %

和

37.6 %

的序列回收率。值得注意的是，与 ProteinMPNN 和 Rosetta Design 相比，ProBID-Net 在 TS920 和从头集合上的回收率评分均表现出更好的性能。但是，ProBID-Net 在 Folddock 集上没有达到最高性能。我们将这一结果归因于从 Folddock 数据集中删除了与 ProBID-Net 训练集中的结构高度相似的所有结构，将复合物的数量从 2734 个减少到 1106 个。Folddock 集中的其余复合物与训练集显示出显著差异，这可能导致观察到的性能下降。相比之下，CATH4.2
dataset, used for training ProteinMPNN, was not subjected to a similar structural dereplication process relative to the Folddock set. This lack of filtering enabled ProteinMPNN to more easily predict the correct interface residues.
数据集，用于训练 ProteinMPNN，相对于 Folddock 集没有经历类似的结构去重复过程。这种过滤的缺失使 ProteinMPNN 能够更轻松地预测正确的界面残基。

Regarding the perplexity of interface residues, ProBID-Net consistently outperformed the other two models on all test sets. This robust performance underscores the efficacy of ProBID-Net in designing protein-protein interaction interfaces. In Fig. 1, we plotted the distribution of sequence recovery rates for both models on TS920, de novo sets, and Folddock set. Additionally, metrics such as residue type precision, recall and F1_score of ProBID-Net and ProteinMPNN are present in Fig. S1.

†

The flexibility of certain positions in protein structural interfaces, allowing the replacement of amino acids without compromising the stability of the structure and potentially enhancing binding strength, highlights the dynamic nature of these regions. Our objective was to conduct a thorough assessment of our model for interface residues, deviating from established norms to develop a nuanced understanding of the variations under natural conditions. To accomplish this, we utilized the BLOSUM score, a comprehensive metric that combines BLOSUM62 (ref. 49) values and probabilities predicted by ProBID-Net. The calculation of this score follows a similar approach to the evaluation methodology used in SPINCGNN.

^{50, 51}

This score serves as an enlightening metric, effectively capturing the intricacies associated with both perplexity and the amino acid substitution.
关于界面残差的困惑性，ProBID-Net 在所有测试集上的表现始终优于其他两个模型。这种稳健的性能强调了 ProBID-Net 在设计蛋白质-蛋白质相互作用界面方面的有效性。在图 1 中，我们绘制了 TS920 、 de novo sets 和 Folddock set 上两种模型的序列回收率分布。此外，图 S1 中还显示了 ProBID-Net 和 ProteinMPNN 的残基类型精度、召回率和F1_score等指标。

†

蛋白质结构界面中某些位置的灵活性，允许在不影响结构稳定性的情况下替换氨基酸，并可能增强结合强度，突出了这些区域的动态性质。我们的目标是对我们的界面残基模型进行彻底评估，偏离既定规范，以对自然条件下的变化有细致入微的理解。为了实现这一目标，我们使用了 BLOSUM 分数，这是一个综合指标，结合了 ProBID-Net 预测的 BLOSUM62（参考文献 49）值和概率。该分数的计算遵循与 SPINCGNN 中使用的评估方法类似的方法。

^{50, 51}

这个分数是一个有启发性的指标，有效地捕捉了与困惑和氨基酸替换相关的复杂性。

As presented in Table 3, ProBID-Net demonstrated better performance compared to ProteinMPNN across all three test sets, as indicated by the relative BLOSUM. These findings
如表 3 所示，ProBID-Net 在所有三个测试集中都表现出优于 ProteinMPNN 的性能，如相对 BLOSUM 所示。这些发现

Fig. 1 The distribution of sequence recovery rates for both ProBIDNet and ProteinMPNN on TS920 (A), de novo set (B), and Folddock set ©. The violin plots represent the interface residue sequence recovery from 920 heterodimers in TS920, 62 heterodimers in the de novo set, and 1106 heterodimers in the Folddock set.
图 1 ProBIDNet 和 ProteinMPNN 在 TS920 （A）、从头集（B）和 Folddock 集©上的序列回收率分布。小提琴图表示从 TS920 中的 920 个异二聚体、从头集中的 62 个异二聚体和 Folddock 集中的 1106 个异二聚体中回收的界面残基序列。

Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People’s Republic of China. E-mail: wangrx@ fudan.edu.cn; yfqi@fudan.edu.cn
复旦大学药学院药物化学系，上海201203张衡路 826 号，中华人民共和国。电子邮件：wangrx@ fudan.edu.cn;yfqi@fudan.edu.cn
$†$ Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4sc02233e
$†$ 提供电子补充信息（ESI）。请参阅 DOI： https://doi.org/10.1039/d4sc02233e

ProBID-Net: a deep learning model for proteinprotein binding interface design † † †\daggerProBID-Net：用于蛋白质结合界面设计的 † † †\dagger 深度学习模型

Abstract 抽象

Introduction 介绍

Results and discussion 结果与讨论

Sequence recovery and perplexity序列恢复和困惑

ProBID-Net: a deep learning model for proteinprotein binding interface design $†$
ProBID-Net：用于蛋白质结合界面设计的 $†$ 深度学习模型

Sequence recovery and perplexity
序列恢复和困惑