这是用户在 2025-3-31 14:00 为 https://app.immersivetranslate.com/pdf-pro/6683bf12-1533-4beb-88fa-0a48afbdc047 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

DynamicBind: predicting ligand-specific protein-ligand complex structure with a deep equivariant generative model
DynamicBind:利用深度等变生成模型预测配体特异性蛋白质-配体复合物结构

Received: 24 August 2023  收稿日期:2023 年 8 月 24 日
Accepted: 24 January 2024
录用日期:2024 年 1 月 24 日

Published online: 05 February 2024
在线发布日期:2024 年 2 月 5 日

Check for updates  检查更新

Wei Lu (0) 1 , 5 1 , 5 ^(1,5){ }^{1,5}, Jixian Zhang 1 , 5 1 , 5 ^(1,5){ }^{1,5}, Weifeng Huang 2 2 ^(2){ }^{2}, Ziqiao Zhang 1 1 ^(1){ }^{1}, Xiangyu Jia 1 1 ^(1){ }^{1}, Zhenyu Wang 1 1 ^(1){ }^{1}, Leilei Shi D 1 D 1 D^(1)\mathbb{D}^{1}, Chengtao Li 1 1 ^(1){ }^{1}, Peter G. Wolynes 3 3 ^(3){ }^{3} & Shuangjia Zheng (10) 4 , 5 4 , 5 ^(4,5){ }^{4,5}
魏璐(0) 1 , 5 1 , 5 ^(1,5){ }^{1,5} 、张继贤 1 , 5 1 , 5 ^(1,5){ }^{1,5} 、黄伟峰 2 2 ^(2){ }^{2} 、张子乔 1 1 ^(1){ }^{1} 、贾翔宇 1 1 ^(1){ }^{1} 、王振宇 1 1 ^(1){ }^{1} 、石磊磊 D 1 D 1 D^(1)\mathbb{D}^{1} 、李成涛 1 1 ^(1){ }^{1} 、Peter G. Wolynes 3 3 ^(3){ }^{3} 及郑双佳(10) 4 , 5 4 , 5 ^(4,5){ }^{4,5}

While significant advances have been made in predicting static protein structures, the inherent dynamics of proteins, modulated by ligands, are crucial for understanding protein function and facilitating drug discovery. Traditional docking methods, frequently used in studying protein-ligand interactions, typically treat proteins as rigid. While molecular dynamics simulations can propose appropriate protein conformations, they’re computationally demanding due to rare transitions between biologically relevant equilibrium states. In this study, we present DynamicBind, a deep learning method that employs equivariant geometric diffusion networks to construct a smooth energy landscape, promoting efficient transitions between different equilibrium states. DynamicBind accurately recovers ligand-specific conformations from unbound protein structures without the need for holo-structures or extensive sampling. Remarkably, it demonstrates state-of-the-art performance in docking and virtual screening benchmarks. Our experiments reveal that DynamicBind can accommodate a wide range of large protein conformational changes and identify cryptic pockets in unseen protein targets. As a result, DynamicBind shows potential in accelerating the development of small molecules for previously undruggable targets and expanding the horizons of computational drug discovery.
尽管在预测静态蛋白质结构方面已取得重大进展,但由配体调节的蛋白质固有动态性对于理解蛋白质功能和促进药物发现至关重要。传统对接方法在研究蛋白质-配体相互作用时通常将蛋白质视为刚性结构。虽然分子动力学模拟能提出合适的蛋白质构象,但由于生物相关平衡态之间的罕见转换,其计算需求极高。本研究提出 DynamicBind,一种利用等变几何扩散网络构建平滑能量景观的深度学习方法,促进不同平衡态之间的高效转换。该方法无需结合态结构或大量采样,即可从未结合蛋白质结构中准确恢复配体特异性构象。值得注意的是,其在对接和虚拟筛选基准测试中展现出最先进的性能。实验表明,DynamicBind 能适应多种大尺度蛋白质构象变化,并在未知蛋白质靶标中发现隐秘口袋。 因此,DynamicBind 在加速针对既往不可成药靶点的小分子开发方面展现出潜力,并拓展了计算药物发现的视野。
Remarkable progress has been achieved in the realm of protein structure prediction from sequence data. Some prediction techniques use machine learning in concert with molecular dynamics or Monte Carlo 1 4 1 4 ^(1-4){ }^{1-4}. These generate an ensemble of structures. AlphaFold, which leads the way in the prediction of nearly all structures in the human proteome 5 8 5 8 ^(5-8){ }^{5-8}, however, typically generates only a few conformations for each protein sequence, despite the fact that proteins are inherently dynamic and generally adopt multiple conformations to perform their functions 9 , 10 9 , 10 ^(9,10){ }^{9,10}. The ability of proteins to interconvert between different
在从序列数据预测蛋白质结构领域已取得显著进展。某些预测技术结合机器学习与分子动力学或蒙特卡洛 1 4 1 4 ^(1-4){ }^{1-4} 方法,可生成结构集合。然而,引领人类蛋白质组中几乎所有结构预测的 AlphaFold 5 8 5 8 ^(5-8){ }^{5-8} ,通常仅针对每个蛋白质序列生成少数构象,尽管蛋白质本质上是动态的,通常会采取多种构象来执行其功能 9 , 10 9 , 10 ^(9,10){ }^{9,10} 。蛋白质在不同构象间相互转换的能力

conformations is central to their biological activities in all domains of life. The therapeutic effect of drug molecules arises from their specific binding to only some conformations of the target proteins and thereby modulating essential biological activities by altering the conformational landscape of these proteins 11 14 11 14 ^(11-14){ }^{11-14}. In practice, nowadays the interactions between proteins and ligands are studied through molecular docking methods computationally. Docking is a key component of structure-based drug discovery 15 15 ^(15){ }^{15}. Nevertheless, despite the widespread recognition of the importance of protein dynamics, traditional
构象对于所有生命领域中的生物活性至关重要。药物分子的治疗效果源于它们仅与目标蛋白质的某些构象特异性结合,从而通过改变这些蛋白质的构象景观来调节关键生物活性 11 14 11 14 ^(11-14){ }^{11-14} 。在实践中,现今蛋白质与配体之间的相互作用主要通过分子对接方法进行计算研究。对接是基于结构的药物发现的关键组成部分 15 15 ^(15){ }^{15} 。然而,尽管蛋白质动力学的重要性得到广泛认可,传统的
docking methods often treat proteins as rigid, or in some cases, as being only partially flexible, permitting only selected side chains to move, to manage computational costs 16 , 17 16 , 17 ^(16,17){ }^{16,17}. As a result, AlphaFoldpredicted structures of apoproteins, when used as inputs for docking, will yield ligand pose predictions that do not align well with the ligandbound co-crystallized holo-structures 18 , 19 18 , 19 ^(18,19){ }^{18,19}. The AlphaFold-predicted structures often do not present the most favorable side-chain rotamer configurations for ligand binding, and consequently the relevant binding pocket will appear to be inaccessible since the apoprotein adopts a conformation substantially different from the holo state.
对接方法通常将蛋白质视为刚性,或在某些情况下仅允许部分灵活性,仅允许选定的侧链移动以管理计算成本 16 , 17 16 , 17 ^(16,17){ }^{16,17} 。因此,当使用 AlphaFold 预测的脱辅基蛋白结构作为对接输入时,得到的配体姿态预测与配体结合的共结晶全蛋白结构 18 , 19 18 , 19 ^(18,19){ }^{18,19} 无法良好匹配。AlphaFold 预测的结构通常不会呈现最适合配体结合的侧链旋转构象,因此相关结合口袋看似无法接近,因为脱辅基蛋白采取的构象与全蛋白状态存在显著差异。
Here, we present DynamicBind, a geometric deep generative model designed for “dynamic docking”. Unlike traditional docking methods that treat proteins as mostly rigid entities, DynamicBind efficiently adjusts the protein conformation from its initial AlphaFold prediction to a holo-like state. Our model is capable of handling a wide range of large conformational changes during prediction, such as the well-known DFG-in to DFG-out transition in kinase proteins, a challenge that has been formidable for other methods, such as molecular dynamics (MD) simulations 20 , 21 20 , 21 ^(20,21){ }^{20,21}. We have attained this efficiency in sampling large protein conformational changes by learning a funneled energy landscape, where the transitions between biologically relevant states are minimally frustrated 22 22 ^(22){ }^{22}. This is made possible through the innovative employment of a morph-like transformation for decoy generation during training (more in “DynamicBind architectures” and “Methods”). The present method shares similarities with the Boltzmann generator 23 , 24 23 , 24 ^(23,24){ }^{23,24}, as it allows for direct and efficient sampling of lowenergy states from the learned model. Unlike traditional Boltzmann generators, which are typically constrained to the systems for which they are trained on, however, DynamicBind is a generalizable model that can handle new proteins and ligands.
在此,我们介绍 DynamicBind,一种专为“动态对接”设计的几何深度生成模型。不同于将蛋白质视为近乎刚性实体的传统对接方法,DynamicBind 能高效地将其初始 AlphaFold 预测的蛋白质构象调整为类全息状态。我们的模型在预测过程中能够处理广泛的大规模构象变化,例如激酶蛋白中著名的 DFG-in 到 DFG-out 转变——这一挑战对其他方法(如分子动力学(MD)模拟 20 , 21 20 , 21 ^(20,21){ }^{20,21} )而言曾极为艰巨。我们通过学习漏斗状能量景观实现了高效采样大规模蛋白质构象变化,其中生物相关状态间的转换具有最小阻力 22 22 ^(22){ }^{22} 。这一成果得益于训练期间创新性地采用类形态变换生成诱饵构象(详见“DynamicBind 架构”与“方法”部分)。本方法与 Boltzmann 生成器 23 , 24 23 , 24 ^(23,24){ }^{23,24} 具有相似性,因其可直接从学习模型中高效采样低能态。 与传统玻尔兹曼生成器不同(后者通常受限于其训练所针对的特定系统),DynamicBind 是一种通用模型,能够处理新蛋白质和配体。
In the upcoming results section, we present a comprehensive evaluation of DynamicBind, illustrating its potential to aid drug discovery. Our presentation is organized into six segments: first, outlining the DynamicBind model; then benchmarking DynamicBind against current docking methods; then highlighting the method’s ability to sample large protein conformational changes in a ligand-specific manner; then specifically demonstrating the scope of conformational changes it can handle; illustrating its capacity to predict cryptic pockets through a case study; and finally showcasing its application to the proteome-wide virtual screening task using an antibiotics dataset 25 25 ^(25){ }^{25}. These investigations collectively highlight the potential of DynamicBind, setting the stage for further understanding and manipulating the protein-ligand interaction landscape.
在接下来的结果部分,我们将全面评估 DynamicBind,展示其在辅助药物发现方面的潜力。我们的汇报分为六个部分:首先概述 DynamicBind 模型;接着将 DynamicBind 与当前对接方法进行基准测试;随后重点展示该方法以配体特异性方式采样大尺度蛋白质构象变化的能力;具体演示其可处理的构象变化范围;通过案例研究说明其预测隐蔽口袋的能力;最后展示其在抗生素数据集上应用于全蛋白质组虚拟筛选任务的效果 25 25 ^(25){ }^{25} 。这些研究共同凸显了 DynamicBind 的潜力,为深入理解和调控蛋白质-配体相互作用奠定了基础。

Results  结果

DynamicBind architectures
DynamicBind 架构

DynamicBind executes “dynamic docking”, a process that performs prediction of the protein-ligand complex structure while accommodating substantial protein conformational changes. DynamicBind accepts apo-like structures (in the present study, AlphaFold-predicted conformations) in PDB format and small-molecule ligands in several widely available formats, such as Simplified Molecular Input Line Entry System (SMILES) or structure-data file (SDF) format. During inference, the model randomly places the ligand, whose seed conformation is generated using RDKit 26 26 ^(26){ }^{26}, around the protein. Then, over the course of 20 iterations (more details in “Model architecture”), using progressively smaller time steps, the model gradually translates and rotates the ligand while adjusting its internal torsional angles. After the initial five steps where only the ligand conformation is changed, the model then simultaneously translates and rotates the protein residues, while modifying the side-chain chi angles 27 27 ^(27){ }^{27}, in the remaining steps.
DynamicBind 执行“动态对接”过程,该过程在预测蛋白质-配体复合物结构的同时适应显著的蛋白质构象变化。DynamicBind 接受 PDB 格式的类 apo 结构(在本研究中为 AlphaFold 预测的构象)以及多种广泛可用格式的小分子配体,如简化分子输入行条目系统(SMILES)或结构数据文件(SDF)格式。在推理过程中,模型随机放置配体(其种子构象使用 RDKit 26 26 ^(26){ }^{26} 生成)于蛋白质周围。随后,在 20 次迭代中(详见“模型架构”),模型逐步使用更小的时间步长,逐渐平移和旋转配体,同时调整其内部扭转角。在前五个步骤仅改变配体构象后,模型在剩余步骤中同步平移和旋转蛋白质残基,同时修饰侧链 chi 角 27 27 ^(27){ }^{27}
As illustrated in Fig. 1a, at each step, the features and the coordinates of the protein and the ligand are fed into an SE ( 3 ) SE ( 3 ) SE(3)\mathrm{SE}(3)-equivariant interaction module. Subsequently, the protein and readout modules generate the predicted translation, rotation, and dihedral updates for
如图 1a 所示,在每一步中,蛋白质和配体的特征及坐标被输入到一个 SE ( 3 ) SE ( 3 ) SE(3)\mathrm{SE}(3) -等变交互模块中。随后,蛋白质和读出模块生成预测的平移、旋转和二面角更新。

the current state. Further details about the model are given in “Transformation of the protein conformation”. Unlike the traditional protocol employed in diffusion-based model training, which generates decoys by perturbing the native state with Gaussian noise of varying magnitudes 28 32 28 32 ^(28-32){ }^{28-32}, our method employs a morph-like transformation to produce protein decoys. In the process of doing this, the native conformation is gradually transitioned towards the AlphaFold-predicted conformation. The structure of proteins is highly constrained in many ways, with residues linked by peptide bonds, and the bond lengths are governed by chemical principles. When decoys are generated using Gaussian noise, the model primarily learns only to revert to the most chemically stable conformation, often the conformation before the noise was added. In the present task, the ligand-bound holo conformation is unknown, and the most readily available protein structure is the one predicted by AlphaFold, which often significantly differs from the holo conformation. Given that the AlphaFold-predicted structure often already complies with most chemical constraints, it is challenging to anticipate how the model trained on decoys made merely from Gaussian noise could accurately predict long timescale transformations of biological relevance, which are our primary concern. In contrast, the decoys generated by our morph-like transformation generally satisfy the basic chemical constraints, allowing our model to concentrate on learning biophysically relevant statechanging events. In unbiased molecular dynamics simulations, transitions between meta-stable states, such as the DFG ‘in’ and ‘out’ transition, are infrequent due to the realistic yet rugged energy landscape inherent in the all-atom force field 21 21 ^(21){ }^{21}. Our method, in contrast, features a significantly more funneled energy landscape, effectively lowering the free energy barrier between biologically meaningful states. Consequently, akin to other Boltzmann generator methods 23 , 33 23 , 33 ^(23,33){ }^{23,33}, the present approach demonstrates markedly enhanced efficiency in sampling alternate states pertinent to ligand binding. A schematic figure has been included to elucidate these differences (Fig. 1b).
当前状态。关于模型的更多细节见“蛋白质构象的转变”。与扩散模型训练中采用的传统方案(即通过不同强度的高斯噪声扰动天然状态生成构象变体 28 32 28 32 ^(28-32){ }^{28-32} )不同,我们的方法采用类形态转变来产生蛋白质构象变体。在此过程中,天然构象会逐步向 AlphaFold 预测的构象过渡。蛋白质结构在多方面受到高度约束,如残基间通过肽键连接,键长由化学原理决定。当使用高斯噪声生成构象变体时,模型主要学习如何回归到化学最稳定的构象(通常是添加噪声前的构象)。在当前任务中,配体结合的全息构象未知,而最易获得的蛋白质结构是由 AlphaFold 预测的构象,该构象常与全息构象存在显著差异。 鉴于 AlphaFold 预测的结构通常已符合大多数化学约束条件,很难预期仅基于高斯噪声生成的诱饵训练的模型如何准确预测我们主要关注的、具有生物学意义的长时程构象变化。相比之下,我们采用的类形态转换生成的诱饵通常满足基本化学约束,使模型能专注于学习生物物理相关的状态转变事件。在无偏分子动力学模拟中,由于全原子力场 21 21 ^(21){ }^{21} 固有的真实但崎岖的能量景观,亚稳态间转变(如 DFG"in"与"out"状态转换)发生频率较低。而我们的方法具有显著更漏斗化的能量景观,有效降低了生物学相关状态间的自由能垒。因此,与其他玻尔兹曼生成器方法 23 , 33 23 , 33 ^(23,33){ }^{23,33} 类似,本方法在采样与配体结合相关的交替状态时展现出显著提升的效率。示意图(图 1b)已用于阐明这些差异。
DynamicBind achieves higher accuracy in ligand pose prediction and improves the initial AlphaFold-predicted protein conformations
DynamicBind 在配体姿态预测中实现了更高的准确度,并优化了初始 AlphaFold 预测的蛋白质构象

To evaluate our method, we first utilized the PDBbind dataset 34 34 ^(34){ }^{34} and, in line with previous works 19 , 35 , 36 19 , 35 , 36 ^(19,35,36){ }^{19,35,36}, we trained the model using a chronological, time-based split of the training, validation, and test sets. Since the PDBbind test set, comprising around 300 structures from 2019, includes many non-small-molecule ligands ( 53 cases being polypeptides), we extended the scope of our assessment using a curated Major Drug Target (MDT) test set. The MDT set includes 599 structures that were deposited in or after 2020, with both drug-like ligands and proteins from four major protein families: kinases, GPCRs, nuclear receptors, and ion channels (refer to “Dataset construction” for more details). These protein families represent the targets of about 70 % 70 % 70%70 \% of FDA-approved small-molecule drugs 37 37 ^(37){ }^{37}.
为评估我们的方法,我们首先使用了 PDBbind 数据集 34 34 ^(34){ }^{34} ,并遵循先前研究 19 , 35 , 36 19 , 35 , 36 ^(19,35,36){ }^{19,35,36} 的做法,采用按时间顺序划分的训练集、验证集和测试集对模型进行训练。由于 PDBbind 测试集包含约 300 个 2019 年的结构,其中许多配体并非小分子(53 例为多肽),我们通过引入精选的主要药物靶点(MDT)测试集扩展了评估范围。MDT 集包含 599 个 2020 年及之后提交的结构,涵盖药物样配体和来自四大蛋白质家族的靶点:激酶、GPCRs、核受体及离子通道(详见“数据集构建”部分)。这些蛋白质家族代表了约 70 % 70 % 70%70 \% FDA 批准的小分子药物 37 37 ^(37){ }^{37} 的作用靶点。
Instead of using holo-structures as the input, we adopted a more challenging and realistic scenario during testing, where we assumed the holo protein conformation is not available and only use the protein conformations predicted by AlphaFold as our input. Holo conformations exhibit strong shape and charge complementarity to cocrystallized ligands, which already, unrealistically, simplify ligand pose prediction 11 11 ^(11){ }^{11}. In contrast, the apo conformations or those predicted by AlphaFold may clash with transplanted ligands obtained by superimposing crystal structures 14 14 ^(14){ }^{14}.
在测试过程中,我们并未采用全息结构作为输入,而是选择了一个更具挑战性和现实性的场景:假设全息蛋白质构象不可用,仅以 AlphaFold 预测的蛋白质构象作为输入。全息构象与共结晶配体展现出强烈的形状和电荷互补性,这种特性已不切实际地简化了配体姿态预测 11 11 ^(11){ }^{11} 。相比之下,预测的 apo 构象或 AlphaFold 生成的构象可能与通过晶体结构叠加移植的配体产生冲突 14 14 ^(14){ }^{14}
As shown in Fig. 2a and b, DynamicBind predicts more cases with ligand RMSD below various thresholds than other baselines. In particular, it achieves the fraction of ligand RMSD below 2 2 2"Å"2 \AA ( 5 5 5"Å"5 \AA ), being 33 % 33 % 33%33 \% ( 65 % 65 % 65%65 \% ) on the PDBbind test set and 39 % ( 68 % ) 39 % ( 68 % ) 39%(68%)39 \%(68 \%) on the MDT test set, respectively.
如图 2a 和 b 所示,DynamicBind 预测的配体 RMSD 低于各阈值的情况多于其他基线方法。具体而言,其配体 RMSD 低于 2 2 2"Å"2 \AA 5 5 5"Å"5 \AA )的比例达到 33 % 33 % 33%33 \% 65 % 65 % 65%65 \% )(PDBbind 测试集)和 39 % ( 68 % ) 39 % ( 68 % ) 39%(68%)39 \%(68 \%) (MDT 测试集)。
Evaluating models solely on ligand RMSD may favor deep learning-based models (DiffDock, TankBind, and DynamicBind) due to
仅基于配体 RMSD 评估模型可能会使深度学习模型(如 DiffDock、TankBind 和 DynamicBind)获得优势,因为

Fig. 1 1 1∣1 \mid Overview of DynamicBind model. a The holo state is represented in pink, the initial apo and the model-predicted conformation in green. The native ligand is depicted in cyan, and the predicted ligand shown in orange. The model accepts as input both the features and the current conformation of the protein and ligand. The output readouts include the predicted updates: global translation and rotation for both the ligand and each protein residue, the rotation of torsional angles for the ligands and chi angles for the protein residues, and two prediction modules (binding affinity, A and confidence score, D). During the training phase, the model is
1 1 1∣1 \mid DynamicBind 模型概览。a 部分:粉红色代表全息(holo)状态,绿色分别表示初始脱辅(apo)状态与模型预测的构象。天然配体以青色显示,预测配体则为橙色。模型同时接收蛋白质和配体的特征信息及当前构象作为输入,输出包括预测的更新项:配体与各蛋白质残基的全局平移与旋转、配体扭转角与蛋白质残基χ角的旋转,以及两个预测模块(结合亲和力 A 与置信度评分 D)。训练阶段,模型被

designed to learn the transformation from the apo-like conformation into the holo conformation. During inference, the model iteratively updates the initial input structure twenty times. b A schematic figure shows that our model could predict the two different holo conformations when the protein binds with two different ligands. Our model could predict the bounded protein conformation within 20 steps, while millions of steps of all-atom MD simulations are needed to find the same bounded state.
设定为学习从类 apo 构象到 holo 构象的转换过程。推理阶段,模型会对初始输入结构进行 20 次迭代更新。b 部分示意图显示,当蛋白质与两种不同配体结合时,我们的模型能预测出两种不同的 holo 构象。模型仅需 20 步即可预测出结合态蛋白质构象,而全原子分子动力学模拟需要数百万步才能达到相同结合态。

their higher clash tolerance, while may disadvantage force field-based methods (GNINA, GLIDE, VINA) that strictly enforce Van der Waals forces. Significant clashes can impede interaction analysis in structurebased drug design, obscuring crucial molecular interactions and complicating the design of molecule improvements. Consequently, we use both ligand RMSD and clash scores (as defined by Hekkelman et al. 14 14 ^(14){ }^{14} ) to assess success rates. Figure 2c, shows the success rates using both a stringent criterion, ligand RMSD < 2 < 2 < 2"Å"<2 \AA, clash score < 0.35 < 0.35 < 0.35<0.35, and a more relaxed criterion, ligand RMSD < 5 < 5 < 5"Å"<5 \AA, clash score < 0.5 < 0.5 < 0.5<0.5. The success rate of DynamicBind ( 0.33 ) is 1.7 times higher than the best baseline DiffDock (0.19) under the more stringent condition. Furthermore, DynamicBind has demonstrated the ability to reduce the pocket RMSD relative to the initial AlphaFold structure, even in cases
它们较高的冲突容忍度,可能对严格强制执行范德华力的基于力场的方法(GNINA、GLIDE、VINA)不利。显著的冲突会阻碍基于结构的药物设计中的相互作用分析,掩盖关键的分子相互作用并使分子改进的设计复杂化。因此,我们同时使用配体 RMSD 和冲突分数(由 Hekkelman 等人定义 14 14 ^(14){ }^{14} )来评估成功率。图 2c 展示了使用严格标准(配体 RMSD < 2 < 2 < 2"Å"<2 \AA 、冲突分数 < 0.35 < 0.35 < 0.35<0.35 )和较宽松标准(配体 RMSD < 5 < 5 < 5"Å"<5 \AA 、冲突分数 < 0.5 < 0.5 < 0.5<0.5 )的成功率。在更严格条件下,DynamicBind 的成功率(0.33)比最佳基线 DiffDock(0.19)高出 1.7 倍。此外,DynamicBind 已展现出即使在初始 AlphaFold 结构的情况下,也能降低口袋 RMSD 的能力。

with large original pocket RMSDs (Fig. 2d). This observation highlights that the present approach is capable of managing substantial conformational changes, recovering holo-structures when other methods may struggle. Given our model’s ability to generate diverse conformations, we developed the contact-LDDT (cLDDT) scoring module, a concept inspired by AlphaFold’s LDDT score. The module’s purpose is to select the most suitable complex structure from the predicted outputs. As shown in Fig. 2e, our predicted cLDDT correlates well with the actual ligand RMSD, indicating its effectiveness in selecting highquality complex structures. The auROC score, with ligand RMSD below 2 2 2"Å"2 \AA as the true positive, is 0.764 . While our cLDDT scoring function is effective, there is potential for improvement. Perfect selection could enhance our success rate from 0.33 to 0.5 , as illustrated in Fig. 2f. Even
具有较大原始口袋 RMSD(图 2d)。这一观察结果表明,当前方法能够处理显著的构象变化,在其他方法可能遇到困难时恢复全息结构。鉴于我们模型生成多样化构象的能力,我们开发了接触-LDDT(cLDDT)评分模块,这一概念受到 AlphaFold 的 LDDT 评分的启发。该模块的目的是从预测输出中选择最合适的复合物结构。如图 2e 所示,我们预测的 cLDDT 与实际配体 RMSD 有良好的相关性,表明其在选择高质量复合物结构方面的有效性。以配体 RMSD 低于 2 2 2"Å"2 \AA 为真阳性时,auROC 得分为 0.764。虽然我们的 cLDDT 评分函数有效,但仍有改进空间。如图 2f 所示,完美选择可将我们的成功率从 0.33 提高到 0.5。即使

Fig. 2 2 2∣2 \mid Benchmark results overview. a, b DynamicBind outperforms other methods in predicting ligand poses for both the PDBbind dataset (a) and major drug targets (MDT) dataset (b) across different RMSD thresholds. c Dark and light shades represent success rates under stringent (ligand RMSD < 2 < 2 < 2"Å"<2 \AA, clash score < 0.35 < 0.35 < 0.35<0.35 ) and relaxed (ligand RMSD < 5 < 5 < 5"Å"<5 \AA, clash score < 0.5 < 0.5 < 0.5<0.5 ) criteria, respectively. d The protein conformations predicted by DynamicBind are more native-like, as evidenced by the lower pocket RMSD around the binding sites. e e e\mathbf{e} The contact-LDDT
2 2 2∣2 \mid 基准测试结果概览。a、b DynamicBind 在预测配体姿态方面优于其他方法,无论是在 PDBbind 数据集(a)还是主要药物靶标(MDT)数据集(b)中,在不同 RMSD 阈值下均表现更优。c 深色和浅色区域分别代表严格标准(配体 RMSD < 2 < 2 < 2"Å"<2 \AA 、冲突分数 < 0.35 < 0.35 < 0.35<0.35 )和宽松标准(配体 RMSD < 5 < 5 < 5"Å"<5 \AA 、冲突分数 < 0.5 < 0.5 < 0.5<0.5 )下的成功率。d DynamicBind 预测的蛋白质构象更接近天然状态,结合位点周围的口袋 RMSD 更低证明了这一点。 e e e\mathbf{e} 接触-LDDT

(cLDDT) score predicted by DynamicBind correlates well with the ligand RMSD and is a good predictor of the true ligand RMSD below 2 2 2"Å"2 \AA (auROC 0.764). f As the number of generated samples increases, the success rate increases. c-f display results for the combined PDBbind and MDT test sets. Results for individual datasets, as well as those filtered at 30 % , 60 % 30 % , 60 % 30%,60%30 \%, 60 \%, and 90 % 90 % 90%90 \% maximum ligand and protein sequence similarity cutoffs, are detailed in Supplementary Figs. 1-4. Source data are provided as a Source Data file.
(cLDDT)评分与配体 RMSD 高度相关,在 2 2 2"Å"2 \AA 以下时能较好预测真实配体 RMSD(auROC 0.764)。f 随着生成样本数量增加,成功率同步提升。c-f 展示的是 PDBbind 与 MDT 测试集合并后的结果。各数据集单独结果,以及经过 30 % , 60 % 30 % , 60 % 30%,60%30 \%, 60 \% 90 % 90 % 90%90 \% 最大配体/蛋白质序列相似性阈值过滤的结果详见补充图 1-4。源数据文件详见 Source Data 文件。

in the absence of an ideal selection model, our method considerably outperforms DiffDock and the top force field-based method, GLIDE. Due to the variability in the number of samples produced by Glide, often because of its filtering scheme eliminating unrealistic conformations, Glide’s best performance is represented using a flat line. This line reflects the success rate determined by the most effective sample from Glide. DynamicBind’s exceptional performance stems from its ability to undergo significant protein conformational changes, leading to a better fit between the protein and the ligand.
在缺乏理想选择模型的情况下,我们的方法显著优于 DiffDock 和基于力场的顶级方法 GLIDE。由于 Glide 生成的样本数量存在变异性(通常因其过滤方案会剔除不现实的构象),Glide 的最佳性能以一条水平线表示。该线反映了由 Glide 最有效样本确定的成功率。DynamicBind 的卓越性能源于其能够经历显著的蛋白质构象变化,从而实现蛋白质与配体之间更佳的匹配。
To assess the model’s generalization to new proteins and ligands, we analyzed results stratified by maximum ligand and protein sequence similarity to the training set (Supplementary Figs. 3 and 4). This analysis reveals that DynamicBind performs well with new ligands, outperforming others, but is less effective with new proteins, where it is outperformed by classical docking methods with predefined ground-truth binding pockets. Other deep learning methods also show similar declines with new proteins, hinting at a need for larger training set and improved inductive biases. Moreover, considering that the identification of binding sites on new proteins is an active research area, the challenges encountered in blind global docking by deep learning methods, including ours, are likely shared across different approaches. Overall, DynamicBind’s proficiency with new ligands is significant in drug discovery, highlighting its potential in identifying protein conformational changes vital for creating effective, specific drugs.
为评估模型对新蛋白质和配体的泛化能力,我们根据配体及蛋白质序列与训练集的最大相似度对结果进行了分层分析(附图 3 和 4)。分析表明,DynamicBind 在处理新配体时表现优异,超越其他方法;但在面对新蛋白质时效果较弱,被预设真实结合口袋的传统分子对接方法反超。其他深度学习方法在新蛋白质上也呈现类似性能下降,暗示需要更大训练集及改进的归纳偏置。此外,考虑到新蛋白质结合位点的识别本身是研究热点,包括本方法在内的深度学习技术在盲法全局对接中遇到的挑战,可能普遍存在于各类方法中。总体而言,DynamicBind 在新配体上的卓越表现在药物发现中具有重要意义,凸显了其在捕捉对开发高效特异性药物至关重要的蛋白质构象变化方面的潜力。

DynamicBind can capture ligand-specific protein conformational changes
DynamicBind 能够捕捉配体特异性的蛋白质构象变化

Conventional docking protocols usually perform protein conformation sampling as a separate step from the docking process 15 , 38 15 , 38 ^(15,38){ }^{15,38}. In many instances, however, two distinct ligands may fit into mutually exclusive protein conformations. For example, c-Met kinase can adopt two different conformations, corresponding to active and inactive states, typically referred to as the Asp-Phe-Gly (DFG)-in and DFG-out conformations (Fig. 3b, d). The DFG motif can flip out, subsequently blocking or opening up different regions of the protein. In previous docking models, the protein must be preset to the correct conformation to have a chance of identifying the appropriate binding pose for the ligand 20 20 ^(20){ }^{20}. In contrast, DynamicBind, utilizing the protein conformation predicted by AlphaFold (Fig. 3a), can dynamically adjust the protein conformation to find the optimal conformation that accommodates the ligand of interest. As a representative case, for PDB 6 UBW, the predicted ligand RMSD is 0.49 0.49 0.49"Å"0.49 \AA, and pocket RMSD is 1.97 1.97 1.97"Å"1.97 \AA, while the pocket RMSD for the AlphaFold structure is 9.44 9.44 9.44"Å"9.44 \AA. For PDB 7V3S, the predicted ligand RMSD is 0.51 0.51 0.51"Å"0.51 \AA, and the pocket RMSD is 1.19 1.19 1.19"Å"1.19 \AA, (AlphaFold 6.02 6.02 6.02"Å"6.02 \AA ). Neither of the two ligands have been seen before in the training set (Fig. 3c, e). In our quantitative analysis, only seven proteins from the test set, represented in 79 PDB structures, were found to adopt both DFG-in and DFG-out conformations, as annotated by the Kinase-Ligand Interaction Fingerprints and Structures (KLIFS) web server 39 39 ^(39){ }^{39}. Figure 3 f and g demonstrates how
传统的对接协议通常将蛋白质构象采样作为与对接过程分离的独立步骤进行 15 , 38 15 , 38 ^(15,38){ }^{15,38} 。然而在许多情况下,两种不同的配体可能适配于相互排斥的蛋白质构象。例如,c-Met 激酶可呈现两种不同构象,分别对应活性与非活性状态,通常称为 Asp-Phe-Gly(DFG)-in 和 DFG-out 构象(图 3b,d)。DFG 基序可能发生翻转,从而阻断或开放蛋白质的不同区域。在既往对接模型中,必须预先将蛋白质设置为正确构象,才有机会识别配体的合适结合姿态 20 20 ^(20){ }^{20} 。相比之下,DynamicBind 利用 AlphaFold 预测的蛋白质构象(图 3a),能动态调整蛋白质构象以寻找适配目标配体的最优构象。以 PDB 6UBW 为例,预测配体 RMSD 为 0.49 0.49 0.49"Å"0.49 \AA ,口袋 RMSD 为 1.97 1.97 1.97"Å"1.97 \AA ,而 AlphaFold 结构的口袋 RMSD 为 9.44 9.44 9.44"Å"9.44 \AA ;对于 PDB 7V3S,预测配体 RMSD 为 0.51 0.51 0.51"Å"0.51 \AA ,口袋 RMSD 为 1.19 1.19 1.19"Å"1.19 \AA (AlphaFold 6.02 6.02 6.02"Å"6.02 \AA )。 训练集中从未出现过这两种配体(图 3c、e)。定量分析显示,测试集中仅有 7 种蛋白质(以 79 个 PDB 结构为代表)通过激酶-配体相互作用指纹与结构(KLIFS)网络服务器标注 39 39 ^(39){ }^{39} ,同时存在 DFG-in 和 DFG-out 构象。图 3f 和 g 展示了

Cn1cc(-c2ccc3nnc(C(F)(F)c4ccc5nc(NC(=O)C6CC6)cn5n4)n3c2)cn1NNNFFNNHONNNN


a


h
Fig. 3 | DynamicBind captures ligand-specific protein conformational changes. AlphaFold-predicted structures are depicted in white, the crystal structure with protein, and ligand in pink and cyan, respectively. Our model’s predictions are shown in green and orange, for the protein and ligand, respectively. The side chains of the Asp-Phe-Gly (DFG) residues are shown in stick. Red arrows highlight significant conformational changes of the crystal structure from the AlphaFold structure. The input conformation is the AlphaFold-predicted conformation. a When the ligand 84S (b) binds to c-Met protein, the protein adopts a DFG-in conformation. When the ligand 519 (d) binds to the same protein, the protein adopts a DFG-out conformation. Our prediction for both ligands (c, e) agrees well
图 3 | DynamicBind 捕捉配体特异性蛋白质构象变化。AlphaFold 预测结构以白色显示,晶体结构中蛋白质和配体分别用粉色和青色表示。我们模型的预测结果中,蛋白质和配体分别用绿色和橙色标示。天冬氨酸-苯丙氨酸-甘氨酸(DFG)残基的侧链以棍状模型展示。红色箭头突出晶体结构相较于 AlphaFold 结构的显著构象变化。输入构象为 AlphaFold 预测构象。a 当配体 84S(b)与 c-Met 蛋白结合时,蛋白质呈现 DFG-in 构象;当配体 519(d)与同一蛋白结合时,蛋白质转为 DFG-out 构象。我们对两种配体的预测结果(c、e)均高度吻合

with the crystal structure. Ligand RMSD is 0.49 0.49 0.49"Å"0.49 \AA and 0.51 0.51 0.51"Å"0.51 \AA. Improvement of Pocket RMSD from initial AlphaFold is 7.47 7.47 7.47"Å"7.47 \AA and 4.83 4.83 4.83"Å"4.83 \AA for DFG-in and DFG-out, respectively. Among the test set, seven proteins (identified by their UniProt IDs), contains both DFG-in and DFG-out crystallized holo conformations, their pocket RMSD of both initial AlphaFold and predicted structures are shown in ( f , n = 39 ) ( f , n = 39 ) (f,n=39)(\mathbf{f}, n=39) and ( g , n = 34 g , n = 34 g,n=34\mathbf{g}, n=34 ) for DFG-in holo conformations and DFG-out holo conformations separately, where central line marking the median, box edges indicating the upper and lower quartiles, whiskers extending up to 1.5 times the interquartile range, and individual points in dots. h h h\mathbf{h} The histogram of the improvement in pocket RMSD from AlphaFold for all 79 PDBs. Source data are provided as a Source Data file.
与晶体结构相比,配体 RMSD 为 0.49 0.49 0.49"Å"0.49 \AA 0.51 0.51 0.51"Å"0.51 \AA 。对于 DFG-in 和 DFG-out 构象,口袋 RMSD 相较于初始 AlphaFold 模型的改进分别为 7.47 7.47 7.47"Å"7.47 \AA 4.83 4.83 4.83"Å"4.83 \AA 。测试集中有七个蛋白质(通过 UniProt ID 标识)同时包含 DFG-in 和 DFG-out 的结晶全构象,其初始 AlphaFold 结构和预测结构的口袋 RMSD 分别在 ( f , n = 39 ) ( f , n = 39 ) (f,n=39)(\mathbf{f}, n=39) 和( g , n = 34 g , n = 34 g,n=34\mathbf{g}, n=34 )中展示,分别对应 DFG-in 全构象和 DFG-out 全构象,其中中线表示中位数,箱线边缘表示上下四分位数,须线延伸至四分位距的 1.5 倍,散点表示个体数据点。 h h h\mathbf{h} 展示了所有 79 个 PDB 结构中口袋 RMSD 相较于 AlphaFold 的改进直方图。源数据以源数据文件形式提供。

these proteins (denoted by their UniProt IDs), starting from the same initial structure, move progressively towards the DFG-in conformation upon type-I inhibitor binding, and incline towards the DFG-out conformation when interacting with a type-II inhibitor. Further, Fig. 3h reveals that the majority of the predicted protein structures show a lower pocket RMSD compared to the initial AlphaFold structures. These results demonstrate that, DynamicBind, is capable of capturing ligand-specific conformational changes. This feature is critical in
这些蛋白质(以其 UniProt ID 表示)从相同的初始结构出发,在结合 I 型抑制剂时逐步向 DFG-in 构象转变,而在与 II 型抑制剂相互作用时则倾向于 DFG-out 构象。此外,图 3h 显示,大多数预测的蛋白质结构相较于初始 AlphaFold 结构表现出更低的结合口袋 RMSD 值。这些结果表明,DynamicBind 能够捕捉配体特异性的构象变化,这一特性对于

preventing the overlooking of potential “hit” compounds that could bind well with conformations distinct from the initially provided protein structure.
防止漏筛与初始提供的蛋白质结构不同构象结合良好的潜在“苗头”化合物至关重要。
DynamicBind covers multi-scale protein conformation changes The DFG-in/out conformation has been extensively studied, and some challenges can be partially addressed by employing ensemble docking, wherein proteins in both conformations are utilized for docking 40 , 41 40 , 41 ^(40,41){ }^{40,41}.
DynamicBind 涵盖多尺度蛋白质构象变化 DFG-in/out 构象已被广泛研究,通过采用集成对接策略(即同时利用两种构象进行对接),部分挑战可得到解决 40 , 41 40 , 41 ^(40,41){ }^{40,41}
Ensemble docking, however, elevates computational costs and may not be suitable for less well-characterized conformations. In this section, we provide a comprehensive analysis of six distinct conformational changes across the picosecond level to millisecond level, each exemplified by a case found in our PDBbind test set. In Fig. 4, the crystal structure is depicted in pink, the AlphaFold structure in white, and our prediction in green. The native ligand is illustrated in cyan, and our predicted ligand is in orange. Δ Δ Delta\Delta pocket RMSD measures the difference in pocket RMSD between the predicted protein structure and the AlphaFold structure, based on comparison with the crystal structure. A negative Δ Δ Delta\Delta pocket RMSD indicates that the predicted aligns more closely with the crystal structure compared with the AlphaFold prediction. Δ Δ Delta\Delta clash measures the difference in clash scores between the predicted protein-ligand pair and the AlphaFold structure with the transplanted ligand 14 14 ^(14){ }^{14}. A negative Δ Δ Delta\Delta clash indicates fewer clashes in the predicted complex. In Fig. 4a, the native ligand clashes with a side chain of the superimposed AlphaFold structure; in our prediction, this side chain shifts towards the native conformation, thus resolving the clash. In Fig. 4b, a part of the pocket is blocked by a Tyrosine in the AlphaFold structure; it becomes accessible in both our predicted and native structures. In Fig. 4c, a flexible loop intersects with the ligand, and it moves away in our prediction, consistent with the native structure. In Fig. 4d, alpha helices transform into loops near the ligandbinding site. In Fig. 4e, a substantial secondary structure motion is observed in the Heat shock protein, Hsp 90 α Hsp 90 α Hsp90 alpha\mathrm{Hsp} 90 \alpha, transitioning from the
然而,集成对接会提高计算成本,可能不适用于特征较少的构象。在本节中,我们对皮秒级到毫秒级的六种不同构象变化进行了全面分析,每种变化均以我们的 PDBbind 测试集中发现的一个案例为例。图 4 中,晶体结构以粉色显示,AlphaFold 结构为白色,我们的预测结果为绿色。天然配体以青色表示,预测配体为橙色。 Δ Δ Delta\Delta 口袋 RMSD 衡量了预测蛋白质结构与 AlphaFold 结构之间基于晶体结构比较的口袋 RMSD 差异。负的 Δ Δ Delta\Delta 口袋 RMSD 表明预测结果比 AlphaFold 预测更接近晶体结构。 Δ Δ Delta\Delta 冲突衡量了预测的蛋白质-配体对与移植配体的 AlphaFold 结构之间的冲突分数差异 14 14 ^(14){ }^{14} 。负的 Δ Δ Delta\Delta 冲突表示预测复合物中的冲突较少。在图 图 4a 中,天然配体与叠加的 AlphaFold 结构的一个侧链发生冲突;在我们的预测中,该侧链向天然构象方向移动,从而解决了冲突。图 4b 中,口袋的一部分被 AlphaFold 结构中的酪氨酸阻塞;而在我们的预测结构和天然结构中,该部分变得可及。图 4c 中,一个柔性环与配体相交,在我们的预测中它移开了,与天然结构一致。图 4d 中,α螺旋在配体结合位点附近转变为环。图 4e 中,热休克蛋白 Hsp 90 α Hsp 90 α Hsp90 alpha\mathrm{Hsp} 90 \alpha 中观察到显著的二级结构运动,从

closed state to the open state. In Fig. 4f, two domains of AKT1 kinase coalesce, forming a pocket that did not previously exist. Taken together, the present model can predict diverse types of conformational changes associated with ligand binding when the ligand-binding pocket is either insufficiently spacious or unformed in the AlphaFoldpredicted conformations.
闭合状态转变为开放状态。图 4f 中,AKT1 激酶的两个结构域合并,形成了一个之前不存在的口袋。综上所述,当配体结合口袋在 AlphaFold 预测的构象中空间不足或未形成时,本模型能够预测与配体结合相关的多种类型构象变化。

DynamicBind reveals cryptic pockets significant to drug discovery
DynamicBind 揭示了药物发现中至关重要的隐秘口袋

The dynamic nature of proteins often gives rise to cryptic pockets. These cryptic pockets, which appear during protein dynamics, can reveal druggable sites not found in static structures, thus making previously ‘undruggable’ proteins into potential drug targets. We demonstrate the utility of DynamicBind in revealing these cryptic pockets using the SET domain-containing protein 2 (SETD2), a histone methyltransferase, as a case study. SETD2, critical for the treatment of multiple myeloma (MM) and diffuse large B-cell lymphoma (DLBCL) 42 , 43 42 , 43 ^(42,43){ }^{42,43}, has a cryptic pocket targeted by a highly selective compound, EZM0414, currently undergoing Phase I clinical trials. As illustrated in Fig. 5a, b, all SETD2 homologs in the training set, defined by a protein Smith-Waterman similarity 44 44 ^(44){ }^{44} over 0.4, are co-crystallized with S-Adenosyl methionine (SAM) or Sinefungin analogs, depicted in lines. Sinefungin and its analogs broadly inhibit methyltransferases by occupying the SAM site 45 45 ^(45){ }^{45}, making the selective inhibition of SETD2 challenging. Before 2019, no structure of SETD2 or its homologs had
蛋白质的动态特性常常会产生隐蔽口袋。这些在蛋白质动态过程中出现的隐蔽口袋,能够揭示静态结构中未发现的成药位点,从而将以往“不可成药”的蛋白质转化为潜在药物靶点。我们以含有 SET 结构域的蛋白 2(SETD2)——一种组蛋白甲基转移酶——为例,展示了 DynamicBind 在揭示这类隐蔽口袋中的应用价值。SETD2 对多发性骨髓瘤(MM)和弥漫性大 B 细胞淋巴瘤(DLBCL)的治疗至关重要 42 , 43 42 , 43 ^(42,43){ }^{42,43} ,其隐蔽口袋可被高选择性化合物 EZM0414 靶向,该化合物目前正处于 I 期临床试验阶段。如图 5a、b 所示,训练集中所有 SETD2 同源物(定义为 Smith-Waterman 蛋白质相似性 44 44 ^(44){ }^{44} 超过 0.4 的蛋白)均与 S-腺苷甲硫氨酸(SAM)或 Sinefungin 类似物共结晶,如线条所示。Sinefungin 及其类似物通过占据 SAM 位点 45 45 ^(45){ }^{45} 广泛抑制甲基转移酶,这使得选择性抑制 SETD2 具有挑战性。2019 年之前,尚未有 SETD2 或其同源物的结构

Fig. 4 | DynamicBind effectively captures protein dynamics across diverse time scales. Proteins undergo conformational changes that can occur across a range of time scales upon binding with small-molecule ligands. A negative Δ Δ Delta\Delta pocket RMSD indicates the predicted structure has a lower RMSD with the ground truth relative to the AlphaFold structure. A negative Δ Δ Delta\Delta clash implies that the predicted ligand has lower clash score with the predicted structure compared to the transplanted ligand with the AlphaFold structure. a The side chain of Arginine rotates, mitigating
图 4 | DynamicBind 有效捕捉跨多时间尺度的蛋白质动态变化。蛋白质与小分子配体结合时会发生跨越不同时间尺度的构象变化。负值的 Δ Δ Delta\Delta 口袋 RMSD 表示预测结构相较于 AlphaFold 结构具有更低的与真实结构的均方根偏差。负值的 Δ Δ Delta\Delta 冲突分数表明预测配体与预测结构的冲突分数低于移植配体与 AlphaFold 结构的冲突分数。a 精氨酸侧链发生旋转,缓解了

clashes with the ligand. b b b\mathbf{b} The Tyrosine in the AlphaFold structure that was obstructing the binding pocket, shifts away in the predicted structure. c The loop region of the AlphaFold structure intersects with the ligand, and it is re-positioned in the predicted structure. d Alpha helices near the binding site transform into loops, aligning with the crystal structure. e The alpha helix of Hsp90 experiences a considerable relocation. f f ff Two domains coalesce, thereby forming the binding pocket. Source data are provided as a Source Data file.
与配体的空间冲突。 b b b\mathbf{b} AlphaFold 结构中阻碍结合口袋的酪氨酸残基,在预测结构中发生了位移。cAlphaFold 结构的环区与配体产生空间干涉,在预测结构中该环区被重新定位。d 结合位点附近的α螺旋转变为环状结构,与晶体结构保持一致。eHsp90 的α螺旋发生显著位移。 f f ff 两个结构域相互靠拢,从而形成结合口袋。源数据详见随附的源数据文件。

Fig. 5 | DynamicBind reveals cryptic pocket for ligand EZMO414. a Only six PDBs in the training set have a protein Smith-Waterman similarity greater than 0.4 with the SETD2 protein, and all are co-crystallized with SAM-like ligands, also shown in lines in (b). The ligand of PDB 7TY2, EZM0414, is displayed in cyan sticks, with the protein shown in pink. c The binding pocket for EZM0414 is absent in the AlphaFold structure, depicted in white. d d d\mathbf{d} This panel shows the Tanimoto similarity of ligands
图 5 | DynamicBind 揭示 EZMO414 配体的隐秘结合口袋。a 训练集中仅有 6 个 PDB 与 SETD2 蛋白的 Smith-Waterman 相似度大于 0.4,且均与 SAM 类配体共结晶,如(b)中连线所示。PDB 7TY2 的配体 EZM0414 以青色棍状显示,蛋白以粉色表示。c EZM0414 的结合口袋在 AlphaFold 结构(白色呈现)中缺失。 d d d\mathbf{d} 本面板展示训练集中配体与 EZM0414 的 Tanimoto 相似度

in the training set compared to EZM0414, and the top three most similar ligands are drawn out. e The protein-ligand complex structure as predicted by DynamicBind, with the protein represented in green and the ligand in orange. f f ff The superposition of the complex as predicted by DynamicBind and the corresponding crystal structure. Source data are provided as a Source Data file.
对比情况,并绘制出相似度最高的三种配体。e DynamicBind 预测的蛋白-配体复合物结构,蛋白以绿色表示,配体为橙色。 f f ff DynamicBind 预测的复合物结构与对应晶体结构的叠加效果。源数据以源数据文件形式提供。

been crystallized with a compound bound at the site targeted by EZM0414 (depicted in cyan sticks). Consequently, our model had not been trained on any structures with a compound bound to this newly identified site. In Fig. 5c, the AlphaFold structure and its surface are shown in white. The cryptic site appears blocked, causing substantial clashes with the transplanted EZM0414. Figure 5d confirms EZM0414 as an unseen ligand, with even the most similar Tanimoto ligands deviating substantially from EZM0414. Figure 5e displays the protein-ligand complex structure predicted by our model, taking the AlphaFold-predicted structure of SETD2 and the SMILES representation of EZM0414 as inputs. Figure 5 f overlays our prediction with the crystal structure of the SETD2-EZM0414 complex (PDB 7TY2). The resultant ligand RMSD is 1.4 1.4 1.4"Å"1.4 \AA, and the pocket RMSD is 2.16 2.16 2.16"Å"2.16 \AA. Furthermore, we have included in the Supplementary Information several cases from the Cryptosite dataset 46 46 ^(46){ }^{46} that have low sequence similarity to our training set.
已与 EZM0414 靶向位点结合的化合物共结晶(以青色棍状图示)。因此,我们的模型未在结合该新发现位点的任何结构上进行训练。图 5c 中,AlphaFold 预测的蛋白质结构及其表面以白色显示。该隐秘位点看似被阻塞,导致与移植的 EZM0414 产生显著冲突。图 5d 证实 EZM0414 是一种未见配体,即使最相似的 Tanimoto 配体也与 EZM0414 存在显著差异。图 5e 展示了我们模型预测的蛋白质-配体复合物结构,输入为 SETD2 的 AlphaFold 预测结构和 EZM0414 的 SMILES 表示。图 5f 将我们的预测与 SETD2-EZM0414 复合物的晶体结构(PDB 7TY2)进行叠加。所得配体 RMSD 为 1.4 1.4 1.4"Å"1.4 \AA ,口袋 RMSD 为 2.16 2.16 2.16"Å"2.16 \AA 。此外,我们在补充信息中纳入了来自 Cryptosite 数据集 46 46 ^(46){ }^{46} 的若干案例,这些案例与训练集的序列相似性较低。

DynamicBind achieves better screening performance in an antibiotics benchmark
DynamicBind 在抗生素基准测试中实现了更优的筛选性能

In target-based drug discovery, both screening of potential drug candidates and reverse screening, where protein targets are identified for specific compounds, are crucial. These processes require accurate prediction of binding affinities, the measure of the interaction strength between a protein and a compound, at a proteome level. Therefore, we have added an affinity prediction module to our model, trained using experimentally measured binding affinity data from the PDBbind dataset. To assess DynamicBind in a real-world virtual screening scenario, we used a recently published antibiotic experimental benchmark 25 25 ^(25){ }^{25}. This dataset includes a panel of 2616 protein-compound pairs, none of which were encountered during our training phase. It features 12 proteins from the essential proteome of Escherichia coli
在基于靶标的药物发现中,潜在药物候选物的筛选和反向筛选(即针对特定化合物识别蛋白质靶标)都至关重要。这些过程需要在蛋白质组水平上准确预测结合亲和力——衡量蛋白质与化合物之间相互作用强度的指标。为此,我们在模型中新增了亲和力预测模块,该模块使用 PDBbind 数据集中实验测得的结合亲和力数据进行训练。为了在真实虚拟筛选场景中评估 DynamicBind,我们采用了近期发布的抗生素实验基准 25 25 ^(25){ }^{25} 。该数据集包含 2616 个蛋白质-化合物对,均未在训练阶段出现过,其中 12 种蛋白质来自大肠杆菌必需蛋白质组。

paired with 218 active antibacterial compounds. Figure 6a shows that DynamicBind surpasses both common docking methods like VINA and DOCK6.9 and the best machine learning-based re-scoring methods, achieving the mean average area under the receiver operating characteristic curve (auROC) of 0.68 . Baseline numbers are directly sourced from the benchmark paper 25 25 ^(25){ }^{25}. This performance improvement is due to DynamicBind’s dynamic docking capability, which refines the AlphaFold structure towards a more native-like state, leading to a more precise binding affinity estimation. As depicted in Fig. 6b, the predicted structures of protein murD conform more closely around the ligand, forming more interactions that were not possible with the initial AlphaFold structure. This evaluation on the antibiotics benchmark agrees with our benchmarks on PDBbind test sets for binding affinity predictions (Supplementary Table 1), where DynamicBind consistently outperforms traditional docking methods and deep learning-based rigid docking methods. These results indicate that DynamicBind, with its binding affinity prediction capability, exhibits significant potential for proteome-level virtual screening applications.
与 218 种活性抗菌化合物配对。图 6a 显示,DynamicBind 不仅超越了 VINA 和 DOCK6.9 等常见对接方法,还优于基于机器学习的最佳重打分方法,其接收者操作特征曲线下平均面积(auROC)达到 0.68。基准数据直接引自参考文献 25 25 ^(25){ }^{25} 。这一性能提升归因于 DynamicBind 的动态对接能力,它能将 AlphaFold 预测的结构优化至更接近天然状态,从而实现更精确的结合亲和力估算。如图 6b 所示,蛋白 murD 的预测结构更紧密地围绕配体,形成了初始 AlphaFold 结构无法实现的更多相互作用。抗生素基准测试的结果与我们在 PDBbind 测试集上进行的结合亲和力预测基准一致(补充表 1),DynamicBind 始终优于传统对接方法和基于深度学习的刚性对接方法。 这些结果表明,具有结合亲和力预测能力的 DynamicBind 在蛋白质组水平虚拟筛选中展现出显著的应用潜力。

Discussion  讨论

DynamicBind unifies two conventionally separated steps, protein conformation generation, and ligand pose prediction, into a single framework. As an end-to-end deep learning method, it is orders of magnitude faster than traditional MD simulations in sampling extensive protein conformational changes. Unlike traditional docking methods that demand predefined binding pockets, DynamicBind has the capability to perform global docking, a feature that becomes essential when the binding pocket has yet to be identified. These advantages empower DynamicBind for the virtual screening of compounds that bind to cryptic pockets. Such compounds are likely to bind exclusively to the target protein, thereby potentially minimizing
DynamicBind 将传统上分离的两个步骤——蛋白质构象生成与配体姿态预测——统一至单一框架中。作为端到端的深度学习方法,其在采样大规模蛋白质构象变化时比传统分子动力学模拟快数个数量级。不同于需要预定义结合口袋的传统对接方法,DynamicBind 具备全局对接能力,这一特性在结合位点尚未明确时尤为重要。这些优势使 DynamicBind 能够高效开展针对隐秘口袋化合物的虚拟筛选,此类化合物很可能仅与目标蛋白特异性结合,从而有望显著降低副作用。

Fig. 6 | DynamicBind achieves better screening performance in an antibiotics benchmark. a Comparative evaluation of the virtual screening performance on the antibiotics benchmark by different methods, measured in terms of auROC (area under the ROC curve). The benchmark encompasses n = 12 n = 12 n=12n=12 distinct protein systems. Each box plot shows the median (central line), upper and lower quartiles (box edges), whiskers extending up to 1.5 times the interquartile range, and individual
图 6 | DynamicBind 在抗生素基准测试中展现更优的筛选性能。a 不同方法在抗生素基准测试中的虚拟筛选性能对比(以 ROC 曲线下面积 auROC 衡量)。该基准涵盖 n = 12 n = 12 n=12n=12 个独立蛋白质体系。箱线图展示了中位数(中线)、上下四分位数(箱体边缘)、延伸至 1.5 倍四分位距的须线及离散数据点。

data points (dots). We faithfully incorporated all six baseline numbers as presented in the benchmark paper 25 25 ^(25){ }^{25}. b b b\mathbf{b} The AlphaFold-predicted protein structure is shown in white, while the protein structures generated by DynamicBind for three active compounds are shown in green. Red arrows indicate the regions where the protein moves closer to the ligand, forming additional interactions. Source data are provided as a Source Data file.
数据点(圆点)。我们严格参照基准论文 25 25 ^(25){ }^{25} 中提供的全部六项基线数值进行了整合。 b b b\mathbf{b} AlphaFold 预测的蛋白质结构以白色显示,而 DynamicBind 针对三种活性化合物生成的蛋白质结构则以绿色呈现。红色箭头标示蛋白质向配体移动靠近的区域,从而形成额外的相互作用。源数据已作为源数据文件提供。

side effects. In addition, DynamicBind can predict whether a new drug candidate may bind to an unintended protein target or can aid in identifying the binding target when an active compound is discovered via phenotype screening.
副作用。此外,DynamicBind 能够预测新药候选物是否可能与非预期蛋白质靶点结合,或在通过表型筛选发现活性化合物时辅助识别其结合靶点。
DynamicBind, while demonstrating state-of-the-art performance in our benchmarks, still presents opportunities for improvement, especially in enhancing its ability to generalize to proteins with low sequence homology compared to those in the training set 47 , 48 47 , 48 ^(47,48){ }^{47,48}. As a data-driven model, it significantly benefits from rapid advancements in Cryo-EM methods 4 51 4 51 ^(4-51){ }^{4-51}. These technological progressions will broaden the diversity and comprehensiveness of our training data, providing more varied conformations of protein-ligand complexes at a faster rate. There is also potential to improve DynamicBind by utilizing a large amount of non-structural binding affinity data, which are currently more abundant than crystallized structures. By adopting a selfdistillation approach analogous to AlphaFold 5 5 ^(5){ }^{5}, we could augment our training set by integrating high-confidence predictions of the complex structures of protein-ligand pairs that previously only had affinity data available.
DynamicBind 虽然在我们的基准测试中展现了最先进的性能,但仍存在改进空间,特别是在提升其对训练集 47 , 48 47 , 48 ^(47,48){ }^{47,48} 中序列同源性较低的蛋白质的泛化能力方面。作为一种数据驱动模型,它极大地受益于冷冻电镜(Cryo-EM)方法的快速发展 4 51 4 51 ^(4-51){ }^{4-51} 。这些技术进步将以更快的速度拓宽我们训练数据的多样性和全面性,提供更多样化的蛋白质-配体复合物构象。此外,通过利用大量非结构结合亲和力数据(目前比晶体结构更为丰富),DynamicBind 还有进一步优化的潜力。借鉴类似于 AlphaFold 5 5 ^(5){ }^{5} 的自蒸馏方法,我们可以通过整合仅具有亲和力数据的蛋白质-配体对的高置信度复合物结构预测,来扩充训练集。
In summary, DynamicBind presents a “dynamic docking” approach for investigating protein-ligand interactions, setting it apart from traditional docking methods that treat proteins as static and molecular dynamics (MD) simulations that are computationally demanding. Its capacity for large-scale protein dynamics carries particularly significant implications for the discovery of drug molecules, especially those targeting cryptic pockets. In addition, the ligandspecific protein conformations generated by DynamicBind may offer valuable insights into the influence of ligands on proteins, potentially clarifying structure-function relationships and augmenting our mechanistic understanding.
总之,DynamicBind 提出了一种“动态对接”方法来研究蛋白质-配体相互作用,这使其区别于将蛋白质视为静态的传统对接方法以及计算量大的分子动力学(MD)模拟。其大规模蛋白质动力学研究能力对药物分子发现具有特别重要的意义,尤其是针对隐秘口袋的靶点。此外,DynamicBind 生成的配体特异性蛋白质构象可能为理解配体对蛋白质的影响提供宝贵见解,有望阐明结构-功能关系并增强我们的机制性认识。

Methods  方法

Overview  概述
Our model is an E ( 3 ) E ( 3 ) E(3)\mathrm{E}(3)-equivariant, diffusion-based, graph neural network utilizing a coarse-grained representation.
我们的模型是一个基于扩散的、 E ( 3 ) E ( 3 ) E(3)\mathrm{E}(3) 等变图神经网络,采用粗粒度表示。
An E ( 3 ) E ( 3 ) E(3)\mathrm{E}(3)-equivariant model transforms the output, y y yy, according to the trans-rotation and parity operations applied to the input x x xx in 3D space 52 52 ^(52){ }^{52}. Research has demonstrated that equivariant models can be trained with 1000 times less data while yielding superior results on the structures of bulk water 53 53 ^(53){ }^{53}. Despite substantial advancements in cryoelectron microscopy and crystallography, the existing protein-ligand complex database remains relatively limited, only extending to tens of
一个 E ( 3 ) E ( 3 ) E(3)\mathrm{E}(3) 等变模型会根据输入 x x xx 在三维空间 52 52 ^(52){ }^{52} 中施加的转旋和平移操作来变换输出 y y yy 。研究表明,等变模型仅需千分之一的数据量即可训练,并在体相水结构上取得更优结果 53 53 ^(53){ }^{53} 。尽管冷冻电镜和晶体学技术已取得重大进展,现有蛋白质-配体复合物数据库仍相对有限,规模仅达数万级别。

thousands in size. Consequently, an efficient model is required, capable of discerning the most relevant information and avoiding superficial information that does not hold true upon relocating or rotating the entire structure. The traditional approach to fulfilling the SO(3) symmetry involves exclusively using or predicting invariant quantities, such as the contact map. However, a contact or distance map does not always correlate with physically feasible configurations. For instance, a residue may be predicted to be in contact with two vastly distant atoms. In addition, a contact map may overlook chirality, a significant aspect in drug discovery 54 54 ^(54){ }^{54}.
因此需要一种高效模型,能够识别最相关信息并规避那些在整体结构平移或旋转后失效的表层信息。传统实现 SO(3)对称性的方法仅使用或预测不变量(如接触图),但接触图或距离图并不总与物理可行构象相关。例如,可能预测某个残基与两个相距甚远的原子存在接触。此外,接触图可能忽略手性这一药物发现中的关键因素 54 54 ^(54){ }^{54}
As a diffusion-based model, DynamicBind is trained through a process that incrementally distorts the native conformation at various degrees, enabling the model to learn how to restore the correct conformation. Distorting the original configuration commonly involves adding trans-locational Gaussian noise to the atoms. With bond distance constraints imposed by chemical bonds and excluded volume effects enforced by Van der Waals forces, restoring from such distortions is straightforward when the distortion is relatively small. However, we observed that merely adding Gaussian noise is insufficient to train a model that can predict the transformation from one biologically meaningful configuration to another. To address this, we introduced a morph-like transformation that interpolates between the crystal protein structure and the structure predicted by AlphaFold, thereby reducing the transition barriers between meta-stable configurations, such as the AlphaFold-predicted conformation, and the ligandbounded holo configuration. Unlike other generative models that train a score function, s θ ( x , t ) log p t ( x ) s θ ( x , t ) log p t ( x ) s_(theta)(x,t)~~grad log p_(t)(x)\boldsymbol{s}_{\theta}(\mathbf{x}, t) \approx \nabla \log p_{t}(\mathbf{x}), our diffusion architectures aim to map perturbed structures directly back to the original conformations, akin to the consistency model 29 , 55 29 , 55 ^(29,55){ }^{29,55}. The outputs of the model are denoted as f θ ( x t , t ) = ϕ ( x t , t ) f θ x t , t = ϕ x t , t f_(theta)(x_(t),t)=-phi(x_(t),t)\boldsymbol{f}_{\theta}\left(\mathbf{x}_{t}, t\right)=-\boldsymbol{\phi}\left(\mathbf{x}_{t}, t\right), where ϕ ( x t , t ) ϕ x t , t phi(x_(t),t)\boldsymbol{\phi}\left(\mathbf{x}_{t}, t\right) represents the added morph-like transformation to the native conformation.
作为一种基于扩散的模型,DynamicBind 通过逐步扭曲不同程度的天然构象进行训练,使模型学会如何恢复正确构象。扭曲原始构型通常涉及向原子添加平移高斯噪声。由于化学键施加的键距约束和范德华力排除的体积效应,当扭曲相对较小时,从此类扭曲中恢复较为简单。然而,我们观察到仅添加高斯噪声不足以训练出能预测从一个生物学意义构型到另一个构型转变的模型。为此,我们引入了一种类似形态的变换,在晶体蛋白结构与 AlphaFold 预测的结构之间进行插值,从而降低亚稳态构型(如 AlphaFold 预测构象)与配体结合的 holo 构型之间的过渡能垒。 与其他训练评分函数的生成模型 s θ ( x , t ) log p t ( x ) s θ ( x , t ) log p t ( x ) s_(theta)(x,t)~~grad log p_(t)(x)\boldsymbol{s}_{\theta}(\mathbf{x}, t) \approx \nabla \log p_{t}(\mathbf{x}) 不同,我们的扩散架构旨在将扰动结构直接映射回原始构象,类似于一致性模型 29 , 55 29 , 55 ^(29,55){ }^{29,55} 。模型的输出表示为 f θ ( x t , t ) = ϕ ( x t , t ) f θ x t , t = ϕ x t , t f_(theta)(x_(t),t)=-phi(x_(t),t)\boldsymbol{f}_{\theta}\left(\mathbf{x}_{t}, t\right)=-\boldsymbol{\phi}\left(\mathbf{x}_{t}, t\right) ,其中 ϕ ( x t , t ) ϕ x t , t phi(x_(t),t)\boldsymbol{\phi}\left(\mathbf{x}_{t}, t\right) 表示对原生构象添加的类形态变换。
Traditional methods use an all-atom representation, modeling the coordinates of every atom explicitly. However, atoms do not move independently due to their connections via chemical bonds, and local geometry is highly constrained-for example, a benzene ring is generally flat. To reduce the number of degrees of freedom of these nonphysical configurations, we adopted a coarse-grained representation for both the protein and the ligand. In our model, each protein residue is represented by a node with two vectors-coordinates and directions, and side-chain dihedral angles. More details are provided in “Featurization”. For the ligand, every heavy atom is represented by a node, and these nodes transform in an extrinsic-to-intrinsic manner,
传统方法采用全原子表示法,显式建模每个原子的坐标。然而由于化学键连接的原子并非独立运动,且局部几何结构高度受限(例如苯环通常保持平面),为减少这些非物理构型的自由度,我们对蛋白质和配体均采用粗粒度表示。模型中每个蛋白质残基用包含坐标向量、方向向量及侧链二面角的节点表示,具体细节见“特征化”章节。配体则每个重原子对应一个节点,这些节点遵循外源到内源的转换规律。

wherein changes in torsional angles are converted into changes in Cartesian coordinates 33 33 ^(33){ }^{33}. Additional details can be found in “Transformation of the ligand conformation”. Notably, despite being a coarsegrained representation, the coordinates of all non-hydrogen atoms can still be mapped in a one-to-one manner.
其中扭转角的变化被转化为笛卡尔坐标 33 33 ^(33){ }^{33} 的变化。更多细节可参见“配体构象的转换”。值得注意的是,尽管是粗粒度表示,所有非氢原子的坐标仍能以一对一的方式映射。
The input to our model is the current conformation of the protein and the ligand. The outputs include the predicted updates to k l k l k^(l)k^{l} scalar torsion angles and two translation-rotation vectors for each ligand, along with updates to k i p k i p k_(i)^(p)k_{i}^{p} scalar dihedral angles of the side-chain and two translation-rotation vectors of the backbone for each protein residue. Further details can be found in “Transformation of the protein conformation”. In addition, the model produces two scalar outputs: one to estimate the degree of the native conformation as assessed by cLDDT (contactLDDT), and another to predict the binding affinity between the protein and ligand.
我们模型的输入是蛋白质和配体的当前构象。输出包括对配体 k l k l k^(l)k^{l} 个标量扭转角的预测更新、每个配体的两个平移-旋转向量,以及对每个蛋白质残基侧链 k i p k i p k_(i)^(p)k_{i}^{p} 个标量二面角的更新和主链的两个平移-旋转向量。更多细节可参见“蛋白质构象的转换”。此外,模型还生成两个标量输出:一个用于评估由 cLDDT(contactLDDT)测定的天然构象程度,另一个用于预测蛋白质与配体之间的结合亲和力。

Featurization  特征化

The ligand in our model is the attributed graph G l = ( V l , E l ) G l = V l , E l G^(l)=(V^(l),E^(l))\mathcal{G}^{l}=\left(\mathcal{V}^{l}, \mathcal{E}^{l}\right), in which each node v i l V l v i l V l v_(i)^(l)inV^(l)\mathfrak{v}_{i}^{l} \in \mathcal{V}^{l} represents a heavy atom and the aromatic, single, double, or triple bonds as the edges. The node features of the ligand graph include atomic number, chirality, degree, and formal charge. In addition to bond type, edge length embedding is also used as scalar edge features.
在我们的模型中,配体表示为带属性的图 G l = ( V l , E l ) G l = V l , E l G^(l)=(V^(l),E^(l))\mathcal{G}^{l}=\left(\mathcal{V}^{l}, \mathcal{E}^{l}\right) ,其中每个节点 v i l V l v i l V l v_(i)^(l)inV^(l)\mathfrak{v}_{i}^{l} \in \mathcal{V}^{l} 代表一个重原子,边则代表芳香键、单键、双键或三键。配体图的节点特征包括原子序数、手性、度数和形式电荷。除了键的类型外,边长度嵌入也被用作标量边特征。
The protein graph is denoted as G p = ( V p , E p ) G p = V p , E p G^(p)=(V^(p),E^(p))\mathcal{G}^{p}=\left(\mathcal{V}^{p}, \mathcal{E}^{p}\right), where each node v i p V p v i p V p vi^(p)inV^(p)\mathfrak{v} i^{p} \in \mathcal{V}^{p} corresponds to a residue at the C α C α C_(alpha)C_{\alpha} position. The node features in the protein graph include amino acid type, language model embedding from esm 7 7 ^(7){ }^{7}, and side-chain dihedral angles, which are represented as ( 7 × 2 ) ( 7 × 2 ) (7xx2)(7 \times 2)-dimensional zero-padded scalar features (five rotatable chi angles [chi1, chi2, …, chi5] and two symmetric chi angles [altchi1, altchi2] for each amino acid, and these angles are transformed into sine and cosine values). To ensure the uniqueness of the sidechain angles for a given structure, we consistently handle it as [max( chil,altchi1), min(chi1,altchi1), max(chi2,altchi2), min(chi2,
蛋白质图表示为 G p = ( V p , E p ) G p = V p , E p G^(p)=(V^(p),E^(p))\mathcal{G}^{p}=\left(\mathcal{V}^{p}, \mathcal{E}^{p}\right) ,其中每个节点 v i p V p v i p V p vi^(p)inV^(p)\mathfrak{v} i^{p} \in \mathcal{V}^{p} 对应于 C α C α C_(alpha)C_{\alpha} 位置的一个残基。蛋白质图中的节点特征包括氨基酸类型、来自 esm 7 7 ^(7){ }^{7} 的语言模型嵌入以及侧链二面角,这些角被表示为 ( 7 × 2 ) ( 7 × 2 ) (7xx2)(7 \times 2) 维零填充标量特征(每个氨基酸有五个可旋转的 chi 角[chi1, chi2, ..., chi5]和两个对称 chi 角[altchi1, altchi2],这些角被转换为正弦和余弦值)。为确保给定结构的侧链角唯一性,我们始终将其处理为[max(chil,altchi1), min(chi1,altchi1), max(chi2,altchi2), min(chi2,

altchi2 2 , chi3, chi4, chi5]. In addition, the backbone orientation is represented as two unit vector features, which are x N x C α x N x C α x N x C α x N x C α (x_(N)-x_(C_(alpha)))/(||x_(N)-x_(C_(alpha))||)\frac{\mathbf{x}_{N}-\mathbf{x}_{C_{\alpha}}}{\left\|\mathbf{x}_{N}-\mathbf{x}_{C_{\alpha}}\right\|} and x c x c α x c x c α x c x c α x c x c α (x_(c)-x_(c_(alpha)))/(||x_(c)-x_(c_(alpha)))||\frac{\mathbf{x}_{c}-\mathbf{x}_{c_{\alpha}}}{\| \mathbf{x}_{c}-\mathbf{x}_{c_{\alpha}}} \|. For edges, length embedding is used as scalar features. Our featurization of the amino acid enable the model to infer the positions of all heavy atoms.
包括 altchi2、chi3、chi4、chi5 等二面角特征。此外,主链取向通过两个单位向量特征 x N x C α x N x C α x N x C α x N x C α (x_(N)-x_(C_(alpha)))/(||x_(N)-x_(C_(alpha))||)\frac{\mathbf{x}_{N}-\mathbf{x}_{C_{\alpha}}}{\left\|\mathbf{x}_{N}-\mathbf{x}_{C_{\alpha}}\right\|} x c x c α x c x c α x c x c α x c x c α (x_(c)-x_(c_(alpha)))/(||x_(c)-x_(c_(alpha)))||\frac{\mathbf{x}_{c}-\mathbf{x}_{c_{\alpha}}}{\| \mathbf{x}_{c}-\mathbf{x}_{c_{\alpha}}} \| 表示。对于边特征,采用长度嵌入作为标量特征。我们的氨基酸特征化方法使模型能够推断所有重原子的位置。

Model architecture  模型架构

DynamicBind is a graph neural network that uses both equivariant and invariant features. It propagates information using tensor products of irreducible representations (irreps) as per the definitions in the e3nn library 52 52 ^(52){ }^{52}.
DynamicBind 是一种同时利用等变特征与不变特征的图神经网络。其信息传播机制基于 e3nn 库 52 52 ^(52){ }^{52} 中定义的不可约表示(irreps)张量积运算实现。
The input scalar features of nodes and edges are concatenated with sinusoidal embeddings 56 56 ^(56){ }^{56} of diffusion time and then encoded by different multilayer perceptrons (MLPs). For the protein node, the two unit vector features of amino acids are combined with the new scalar representations to form the initial features for interaction layers. Similar to DiffDock 19 19 ^(19){ }^{19}, in each step of the graph propagation process, the ligand and protein graphs undergo one intra-interaction and one inter-interaction. In the ligand’s intra-interaction, the representation of each ligand atom is updated by other ligand atoms within a distance of 5 5 5"Å"5 \AA. For the protein, each amino acid is updated by other amino acids within a distance of 15 15 15"Å"15 \AA. To reduce the training runtime and memory usage of the model, a maximum of 24 neighbors is allowed for each residue. The edges for inter-interaction are determined based on whether an amino acid is within a distance of ( 3 σ t r + 12 ) 3 σ t r + 12 (3sigma_(tr)+12)"Å"\left(3 \sigma_{t r}+12\right) \AA from any ligand atom, where σ t r σ t r sigma_(tr)\sigma_{t r} is the current standard deviation of the diffusion translational noise. This dynamic cutoff is designed to ensure interconnections exist even when the
节点和边的输入标量特征与扩散时间的正弦嵌入 56 56 ^(56){ }^{56} 拼接后,通过不同的多层感知机(MLPs)进行编码。对于蛋白质节点,氨基酸的两个单位向量特征与新标量表示结合,形成交互层的初始特征。与 DiffDock 19 19 ^(19){ }^{19} 类似,在图传播过程的每一步中,配体和蛋白质图分别经历一次内部交互和一次相互交互。在配体的内部交互中,每个配体原子的表示由距离 5 5 5"Å"5 \AA 内的其他配体原子更新。对于蛋白质,每个氨基酸由距离 15 15 15"Å"15 \AA 内的其他氨基酸更新。为了减少模型的训练运行时间和内存使用,每个残基最多允许 24 个邻居。相互交互的边根据氨基酸是否距离任何配体原子 ( 3 σ t r + 12 ) 3 σ t r + 12 (3sigma_(tr)+12)"Å"\left(3 \sigma_{t r}+12\right) \AA 以内来确定,其中 σ t r σ t r sigma_(tr)\sigma_{t r} 是当前扩散平移噪声的标准差。这种动态截断设计旨在确保即使当

ligand is far from the receptor when σ t r σ t r sigma_(tr)\sigma_{t r} is large. After the connected graph is determined, the messages of the node is updated by the TensorProductLayer. Specifically, for each node a belonging to category c a c a c_(a)c_{a} :
σ t r σ t r sigma_(tr)\sigma_{t r} 数值较大时,配体与受体距离较远。在连通图确定后,节点的信息通过张量积层进行更新。具体而言,对于属于 c a c a c_(a)c_{a} 类别的每个节点 a:
h a h a c { , r } BN ( c a , c ) ( 1 | N a ( c ) | b N a ( c ) Y ( r a b ) ψ a b h b ) with Ψ a b = Ψ ( c a , c ) ( e a b , h a 0 , h b 0 ) h a h a c { , r } BN c a , c 1 N a ( c ) b N a ( c ) Y r a b ψ a b h b  with  Ψ a b = Ψ c a , c e a b , h a 0 , h b 0 {:[h_(a)larrh_(a)o+_(c in{ℓ,r})BN^((c_(a),c))((1)/(|N_(a)^((c))|)sum_(b inN_(a)^((c)))Y(r_(ab))ox_(psi_(ab))h_(b))],[" with "Psi_(ab)=Psi^((c_(a),c))(e_(ab),h_(a)^(0),h_(b)^(0))]:}\begin{gathered} \mathbf{h}_{a} \leftarrow \mathbf{h}_{a} \underset{c \in\{\ell, r\}}{\oplus} \mathrm{BN}^{\left(c_{a}, c\right)}\left(\frac{1}{\left|\mathcal{N}_{a}^{(c)}\right|} \sum_{b \in \mathcal{N}_{a}^{(c)}} Y\left(\mathbf{r}_{a b}\right) \otimes_{\psi_{a b}} \mathbf{h}_{b}\right) \\ \text { with } \Psi_{a b}=\boldsymbol{\Psi}^{\left(c_{a}, c\right)}\left(e_{a b}, \mathbf{h}_{a}^{0}, \mathbf{h}_{b}^{0}\right) \end{gathered}
Here, h a h a h_(a)\mathbf{h}_{a} represents the features of a node, and h a 0 h a 0 h_(a)^(0)\mathbf{h}_{a}^{0} denotes its scalar features. N a ( c ) N a ( c ) N_(a)^((c))\mathcal{N}_{a}^{(c)} refers to the neighbors of node a a aa of category c c cc (either ligand, or protein). The spherical harmonics are denoted as Y Y YY, and BN represents the (equivariant) batch normalization. The module Ψ Ψ Psi\Psi is a MLP which contains learnable weights for the tensor product, which are computed based on the edge embeddings, e a b e a b e_(ab)e_{a b}, and scalar features, h a 0 , h b 0 h a 0 , h b 0 h_(a)^(0),h_(b)^(0)\mathbf{h}_{a}^{0}, \mathbf{h}_{b}^{0}.
此处 h a h a h_(a)\mathbf{h}_{a} 表示节点的特征, h a 0 h a 0 h_(a)^(0)\mathbf{h}_{a}^{0} 代表其标量特征。 N a ( c ) N a ( c ) N_(a)^((c))\mathcal{N}_{a}^{(c)} 指代节点 a a aa (配体或蛋白质类别 c c cc )的相邻节点。球谐函数记作 Y Y YY ,BN 表示(等变)批量归一化。模块 Ψ Ψ Psi\Psi 是一个包含张量积可学习权重的多层感知机,这些权重基于边嵌入 e a b e a b e_(ab)e_{a b} 和标量特征 h a 0 , h b 0 h a 0 , h b 0 h_(a)^(0),h_(b)^(0)\mathbf{h}_{a}^{0}, \mathbf{h}_{b}^{0} 计算得出。
After the final interaction layer, the node representations are used to produce the outputs. For generating the cLDDT, binding affinity, ligand’s translation and rotation predictions, a convolution of each ligand atom with the geometric center of the ligand is employed:
在最终交互层处理后,节点表征被用于生成输出结果。为预测 cLDDT、结合亲和力、配体平移及旋转,采用配体原子与配体几何中心进行卷积运算:
v = 1 | V | a V Y ( r o a ) ψ o a h a with ψ o a = Ψ ( e o a , h a 0 ) v = 1 V a V Y r o a ψ o a h a  with  ψ o a = Ψ e o a , h a 0 {:[v=(1)/(|V^(ℓ)|)sum_(a inV^(ℓ))Y(r_(oa))ox_(psi_(oa))h_(a)],[" with "psi_(oa)=Psi(e_(oa),h_(a)^(0))]:}\begin{aligned} \mathbf{v}= & \frac{1}{\left|\mathcal{V}^{\ell}\right|} \sum_{a \in \mathcal{V}^{\ell}} Y\left(\mathbf{r}_{o a}\right) \otimes_{\psi_{o a}} \mathbf{h}_{a} \\ & \text { with } \psi_{o a}=\Psi\left(e_{o a}, \mathbf{h}_{a}^{0}\right) \end{aligned}
where e o a e o a e_(oa)e_{o a} is the edge embedding between the geometric center of the ligand and a ligand node a. The output v v v\mathbf{v} consists of 144 even scalars, 2 odd parity vectors and 2 even vectors. The scalars are used for predicting the cLDDT (D) and negative logarithm of the binding affinity (A) as measured in the unit of concentration.
其中 e o a e o a e_(oa)e_{o a} 表示配体几何中心与配体节点 a 之间的边缘嵌入。输出 v v v\mathbf{v} 包含 144 个偶标量、2 个奇宇称向量和 2 个偶向量。这些标量用于预测 cLDDT(D)和结合亲和度的负对数(A),后者以浓度单位衡量。
D = MLP ( v scalar [ : 72 ] ) A = clamp ( MLP ( v scalar [ 72 : 144 ] ) D + eps , min = 0 , max = 15 ) D = MLP v scalar  [ : 72 ] A = clamp MLP v scalar  [ 72 : 144 ] D + eps , min = 0 , max = 15 {:[D=MLP(v_("scalar ")[:72])],[A=clamp((MLP(v_("scalar ")[72:144]))/(D+eps),min=0,max=15)]:}\begin{gathered} D=\operatorname{MLP}\left(\mathbf{v}_{\text {scalar }}[: 72]\right) \\ A=\operatorname{clamp}\left(\frac{\operatorname{MLP}\left(\mathbf{v}_{\text {scalar }}[72: 144]\right)}{D+\mathrm{eps}}, \min =0, \max =15\right) \end{gathered}
The odd vectors are used to predict ligand translation, while the even vectors are used to predict ligand rotation:
奇向量用于预测配体平移,而偶向量用于预测配体旋转:
tr l = v vector odd v vector odd + eps × MLP ( v vector odd , s t ) rot l = v vector even v vector + eps × MLP ( v vector even , s t ) with v vector = v vector [ 0 ] + v vector [ 1 ] 2 tr l = v ¯ vector  odd  v ¯ vector  odd  + eps × MLP v ¯ vector  odd  , s t rot l = v ¯ vector  even  v vector  + eps × MLP v ¯ vector  even  , s t  with  v ¯ vector  = v vector  [ 0 ] + v vector  [ 1 ] 2 {:[tr^(l)=( bar(v)_("vector ")^("odd "))/(|| bar(v)_("vector ")^("odd ")||+eps)xx MLP(|| bar(v)_("vector ")^("odd ")||,s_(t))],[rot^(l)=( bar(v)_("vector ")^("even "))/(||v_("vector ")||+eps)xx MLP(|| bar(v)_("vector ")^("even ")||,s_(t))],[" with " bar(v)_("vector ")=(v_("vector ")[0]+v_("vector ")[1])/(2)]:}\begin{aligned} \operatorname{tr}^{l}= & \frac{\overline{\mathbf{v}}_{\text {vector }}^{\text {odd }}}{\left\|\overline{\mathbf{v}}_{\text {vector }}^{\text {odd }}\right\|+\mathrm{eps}} \times \operatorname{MLP}\left(\left\|\overline{\mathbf{v}}_{\text {vector }}^{\text {odd }}\right\|, \mathbf{s}_{t}\right) \\ \operatorname{rot}^{l}= & \frac{\overline{\mathbf{v}}_{\text {vector }}^{\text {even }}}{\left\|\mathbf{v}_{\text {vector }}\right\|+\mathrm{eps}} \times \operatorname{MLP}\left(\left\|\overline{\mathbf{v}}_{\text {vector }}^{\text {even }}\right\|, \mathbf{s}_{t}\right) \\ & \text { with } \overline{\mathbf{v}}_{\text {vector }}=\frac{\mathbf{v}_{\text {vector }}[0]+\mathbf{v}_{\text {vector }}[1]}{2} \end{aligned}
Here, s t s t s_(t)\mathbf{s}_{t} is the sinusoidal embeddings of the diffusion time, eps = 10 12 = 10 12 =10^(-12)=10^{-12} is added for numerical stability. Following Jing et al. 33 33 ^(33){ }^{33}, our model predicts a scalar torsion update for each rotatable bond of ligand. For bond b b bb, the torsion update T b l T b l T_(b)^(l)T_{b}^{l} is generated by a convolution of every atom on a radius graph with the bond center o o oo :
此处 s t s t s_(t)\mathbf{s}_{t} 为扩散时间的正弦嵌入,eps = 10 12 = 10 12 =10^(-12)=10^{-12} 用于数值稳定性。参照 Jing 等人 33 33 ^(33){ }^{33} 的研究,我们的模型为配体每个可旋转键预测标量扭转更新。对于键 b b bb ,其扭转更新 T b l T b l T_(b)^(l)T_{b}^{l} 通过对键中心 o o oo 半径图上每个原子进行卷积生成:
T b l = MLP ( 1 | N b | a N b Y ( r o a ) Y 2 ( r b ) Y o a h a ) with γ o a = Γ ( e o a , h a 0 , h b 0 0 + h b 1 0 ) T b l = MLP 1 N b a N b Y r o a Y 2 r b Y o a h a  with  γ o a = Γ e o a , h a 0 , h b 0 0 + h b 1 0 {:[T_(b)^(l)=MLP((1)/(|N_(b)|)sum_(a inN_(b))Y(r_(oa))oxY^(2)(r_(b))ox_(Y_(oa))h_(a))],[" with "gamma_(oa)=Gamma(e_(oa),h_(a)^(0),h_(b_(0))^(0)+h_(b_(1))^(0))]:}\begin{aligned} T_{b}^{l}= & \operatorname{MLP}\left(\frac{1}{\left|\mathcal{N}_{b}\right|} \sum_{a \in \mathcal{N}_{b}} Y\left(\mathbf{r}_{\mathbf{o a}}\right) \otimes Y^{2}\left(\mathbf{r}_{b}\right) \otimes_{Y_{o a}} \mathbf{h}_{a}\right) \\ & \text { with } \gamma_{o a}=\Gamma\left(e_{o a}, \mathbf{h}_{a}^{0}, \mathbf{h}_{b_{0}}^{0}+\mathbf{h}_{b_{1}}^{0}\right) \end{aligned}
To predict the conformation changes of protein, we require updates of the side-chain chis, translation, and rotation for each protein node. These operations are generated from the final interaction
为预测蛋白质构象变化,我们需要每个蛋白质节点的侧链χ角、平移和旋转的更新量。这些操作由最终相互作用生成

representations h i h i h_(i)\mathbf{h}_{i} of each amino acid:
每种氨基酸的表示 h i h i h_(i)\mathbf{h}_{i}
T i p = MLP ( h i , scalar odd , h i , scalar even ) t r i p = h i , vector odd h i , vector odd + eps × MLP ( h i , vector odd , s t ) rot i p = h i , vector even h i , vector even + eps × MLP ( h i , vector even , s t ) with h i , vector = 1 | N i | j N i h i , vector j T i p = MLP h i ,  scalar  odd  , h i ,  scalar  even  t r i p = h ¯ i ,  vector  odd  h ¯ i ,  vector  odd  + eps × MLP h ¯ i ,  vector  odd  , s t rot i p = h ¯ i ,  vector  even  h ¯ i ,  vector  even  + eps × MLP h ¯ i ,  vector  even  , s t  with  h ¯ i ,  vector  = 1 N i j N i h i ,  vector  j {:[T_(i)^(p)=MLP(h_(i," scalar ")^("odd "),h_(i," scalar ")^("even "))],[tr_(i)^(p)=( bar(h)_(i," vector ")^("odd "))/(|| bar(h)_(i," vector ")^("odd ")||+eps)xx MLP(|| bar(h)_(i," vector ")^("odd ")||,s_(t))],[rot_(i)^(p)=( bar(h)_(i," vector ")^("even "))/(|| bar(h)_(i," vector ")^("even ")||+eps)xx MLP(|| bar(h)_(i," vector ")^("even ")||,s_(t))],[" with "_( bar(h)_(i," vector ")=(1)/(|N_(i)|)sum_(j inN_(i))h_(i," vector ")^(j))]:}\begin{gathered} T_{i}^{p}=\operatorname{MLP}\left(\mathbf{h}_{i, \text { scalar }}^{\text {odd }}, \mathbf{h}_{i, \text { scalar }}^{\text {even }}\right) \\ \mathbf{t r}_{i}^{p}=\frac{\overline{\mathbf{h}}_{i, \text { vector }}^{\text {odd }}}{\left\|\overline{\mathbf{h}}_{i, \text { vector }}^{\text {odd }}\right\|+\mathrm{eps}} \times \operatorname{MLP}\left(\left\|\overline{\mathbf{h}}_{i, \text { vector }}^{\text {odd }}\right\|, \mathbf{s}_{t}\right) \\ \operatorname{rot}_{i}^{p}=\frac{\overline{\mathbf{h}}_{i, \text { vector }}^{\text {even }}}{\left\|\overline{\mathbf{h}}_{i, \text { vector }}^{\text {even }}\right\|+\mathrm{eps}} \times \operatorname{MLP}\left(\left\|\overline{\mathbf{h}}_{i, \text { vector }}^{\text {even }}\right\|, \mathbf{s}_{t}\right) \\ \text { with }_{\overline{\mathbf{h}}_{i, \text { vector }}=\frac{1}{\left|\mathcal{N}_{i}\right|} \sum_{j \in \mathcal{N}_{i}} \mathbf{h}_{i, \text { vector }}^{j}} \end{gathered}
Here, T i p T i p T_(i)^(p)T_{i}^{p} is a five-dimensional scalar output representing torsion updates for [chi1, chi2, …, chi5].
在这里, T i p T i p T_(i)^(p)T_{i}^{p} 是一个五维标量输出,表示 [chi1, chi2, …, chi5] 的扭转更新。

Transformation of the ligand conformation
配体构象的转变

To update the ligand conformation, we employ a unified global translation tr l R 3 l R 3 ^(l)inR^(3)^{l} \in \mathbb{R}^{3} and rotation R l R 3 × 3 R l R 3 × 3 R^(l)inR^(3xx3)R^{l} \in \mathbb{R}^{3 \times 3}. All atoms of the ligand will be simultaneously translated and rotated around the geometric center of the ligand, which is calculated as x l = 1 n x i l x ¯ l = 1 n x i l bar(x)^(l)=(1)/(n)sumx_(i)^(l)\overline{\mathbf{x}}^{l}=\frac{1}{n} \sum \mathbf{x}_{i}^{l}, where n n nn is the total number of heavy atoms of the ligand and x i l x i l x_(i)^(l)\mathbf{x}_{i}^{l} denotes the position vector of atom i i ii. Specifically, the transformed position vector x l x l x^(l)\mathbf{x}^{l} is obtained as x l = R l ( x l x l ) + x l + t r l x l = R l x l x ¯ l + x ¯ l + t r l x^(l)=R^(l)(x^(l)- bar(x)^(l))+ bar(x)^(l)+tr^(l)\mathbf{x}^{l}=R^{l}\left(\mathbf{x}^{l}-\overline{\mathbf{x}}^{l}\right)+\overline{\mathbf{x}}^{l}+\mathbf{t r}^{l}.
为了更新配体构象,我们采用统一的全局平移 tr l R 3 l R 3 ^(l)inR^(3)^{l} \in \mathbb{R}^{3} 和旋转 R l R 3 × 3 R l R 3 × 3 R^(l)inR^(3xx3)R^{l} \in \mathbb{R}^{3 \times 3} 。配体的所有原子将围绕其几何中心同时进行平移和旋转,该中心按 x l = 1 n x i l x ¯ l = 1 n x i l bar(x)^(l)=(1)/(n)sumx_(i)^(l)\overline{\mathbf{x}}^{l}=\frac{1}{n} \sum \mathbf{x}_{i}^{l} 计算,其中 n n nn 表示配体的重原子总数, x i l x i l x_(i)^(l)\mathbf{x}_{i}^{l} 代表原子 i i ii 的位置向量。具体而言,变换后的位置向量 x l x l x^(l)\mathbf{x}^{l} 通过 x l = R l ( x l x l ) + x l + t r l x l = R l x l x ¯ l + x ¯ l + t r l x^(l)=R^(l)(x^(l)- bar(x)^(l))+ bar(x)^(l)+tr^(l)\mathbf{x}^{l}=R^{l}\left(\mathbf{x}^{l}-\overline{\mathbf{x}}^{l}\right)+\overline{\mathbf{x}}^{l}+\mathbf{t r}^{l} 获得。
In addition to translation and rotation, torsion angles are also crucial factors in determining the ligand conformation. However, modifying torsion angles can perturb the position of the center of mass of the ligand. To address this issue, Corso et al. 19 19 ^(19){ }^{19} demonstrated that performing an RMSD alignment after updating the torsion angles can ensure that the effect of the torsion updates is orthogonal to the rototranslation updates, and thus decouple the consequences of torsional updates and roto-translation updates. Overall, the updated ligand pose is obtained as x l = RMSDAlign ( ( T 0 l T k l ) ( x l ) , R l ( x l x l ) + x l + t r l ) x l = RMSDAlign T 0 l T k l x l , R l x l x ¯ l + x ¯ l + t r l x^(l)=RMSDAlign((T_(0)^(l)@cdotsT_(k)^(l))(x^(l)),R^(l)(x^(l)- bar(x)^(l))+ bar(x)^(l)+tr^(l))\mathbf{x}^{l}=\operatorname{RMSDAlign}\left(\left(T_{0}^{l} \circ \cdots T_{k}^{l}\right)\left(\mathbf{x}^{l}\right), R^{l}\left(\mathbf{x}^{l}-\overline{\mathbf{x}}^{l}\right)+\overline{\mathbf{x}}^{l}+\mathbf{t r}{ }^{l}\right), where T k l T k l T_(k)^(l)T_{k}^{l} is the torsion rotation.
除了平移和旋转,扭转角也是决定配体构象的关键因素。然而,修改扭转角可能会扰动配体质心的位置。为解决这一问题,Corso 等人 19 19 ^(19){ }^{19} 证明,在更新扭转角后执行 RMSD 对齐可确保扭转更新的效果与旋转平移更新正交,从而解耦扭转更新与旋转平移更新的影响。最终,更新后的配体姿态通过 x l = RMSDAlign ( ( T 0 l T k l ) ( x l ) , R l ( x l x l ) + x l + t r l ) x l = RMSDAlign T 0 l T k l x l , R l x l x ¯ l + x ¯ l + t r l x^(l)=RMSDAlign((T_(0)^(l)@cdotsT_(k)^(l))(x^(l)),R^(l)(x^(l)- bar(x)^(l))+ bar(x)^(l)+tr^(l))\mathbf{x}^{l}=\operatorname{RMSDAlign}\left(\left(T_{0}^{l} \circ \cdots T_{k}^{l}\right)\left(\mathbf{x}^{l}\right), R^{l}\left(\mathbf{x}^{l}-\overline{\mathbf{x}}^{l}\right)+\overline{\mathbf{x}}^{l}+\mathbf{t r}{ }^{l}\right) 获得,其中 T k l T k l T_(k)^(l)T_{k}^{l} 表示扭转旋转。

Transformation of the protein conformation
蛋白质构象的转变

Following AlphaFold 5 5 ^(5){ }^{5}, we use C α C α Calpha\mathrm{C} \alpha as the residue node to perform global translation and rotation. Additionally, the model predicts the updates of side-chain torsion angles. For 180 180 180^(@)180^{\circ}-rotation-symmetric side-chain parts, considering symmetry is unnecessary in the inference stage, but we introduce symmetry side-chain torsion features during training to correctly compute the loss function. Since the position of the C α C α Calpha\mathrm{C} \alpha is independent of the side-chain torsion angles, rotating the side chain does not affect the residue-level translation and rotation. Thus, we can perform roto-translations and torsion rotations in any order. Finally, the updated conformation of each protein residue is represented as x i p = ( T i , 0 p T i , k p ) ( R i p ( x i p x i , c α p ) + x i , c α p + t r i p ) x i p = T i , 0 p T i , k p R i p x i p x i , c α p + x i , c α p + t r i p x_(i)^(p)=(T_(i,0)^(p)@cdotsT_(i,k)^(p))(R_(i)^(p)(x_(i)^(p)-x_(i,c_(alpha))^(p))+x_(i,c_(alpha))^(p)+tr_(i)^(p))\mathbf{x}_{i}^{p}=\left(T_{i, 0}^{p} \circ \cdots T_{i, k}^{p}\right)\left(R_{i}^{p}\left(\mathbf{x}_{i}^{p}-\mathbf{x}_{i, c_{\alpha}}^{p}\right)+\mathbf{x}_{i, c_{\alpha}}^{p}+\mathbf{t r}_{i}^{p}\right), where T i , k p T i , k p T_(i,k)^(p)T_{i, k}^{p} is the sidechain torsion rotation of i i ii th residue.
继 AlphaFold 5 5 ^(5){ }^{5} 之后,我们使用 C α C α Calpha\mathrm{C} \alpha 作为残基节点进行全局平移和旋转。此外,模型还预测侧链扭转角的更新。对于 180 180 180^(@)180^{\circ} -旋转对称的侧链部分,在推理阶段无需考虑对称性,但我们在训练中引入对称性侧链扭转特征以正确计算损失函数。由于 C α C α Calpha\mathrm{C} \alpha 的位置与侧链扭转角无关,旋转侧链不会影响残基级的平移和旋转。因此,我们可以按任意顺序执行旋转平移和扭转旋转。最终,每个蛋白质残基的更新构象表示为 x i p = ( T i , 0 p T i , k p ) ( R i p ( x i p x i , c α p ) + x i , c α p + t r i p ) x i p = T i , 0 p T i , k p R i p x i p x i , c α p + x i , c α p + t r i p x_(i)^(p)=(T_(i,0)^(p)@cdotsT_(i,k)^(p))(R_(i)^(p)(x_(i)^(p)-x_(i,c_(alpha))^(p))+x_(i,c_(alpha))^(p)+tr_(i)^(p))\mathbf{x}_{i}^{p}=\left(T_{i, 0}^{p} \circ \cdots T_{i, k}^{p}\right)\left(R_{i}^{p}\left(\mathbf{x}_{i}^{p}-\mathbf{x}_{i, c_{\alpha}}^{p}\right)+\mathbf{x}_{i, c_{\alpha}}^{p}+\mathbf{t r}_{i}^{p}\right) ,其中 T i , k p T i , k p T_(i,k)^(p)T_{i, k}^{p} 是第 i i ii 个残基的侧链扭转旋转。

Training and inference  训练与推理

During the training process, the input are the protein structure in decoy conformation constructed by adding morph-like transformation to the native conformation and the ligand structure in conformation with Gaussian noise added. The expected output are the denoising operations. The input protein structure at time t t tt is defined as x t p = ϕ ( x holo , t ) x t p = ϕ x holo  , t x_(t)^(p)=phi(x^("holo "),t)\mathbf{x}_{t}^{p}=\boldsymbol{\phi}\left(\mathbf{x}^{\text {holo }}, t\right). Specifically, for the i i ii th amino acid, the Kabsch algorithm 57 57 ^(57){ }^{57} is used to calculate the translation t r i t r i tr_(i)^(**)\mathbf{t r}_{i}^{*} and rotation rot i i _(i)^(**)_{i}^{*} around C α C α C alphaC \alpha that aligns the backbone atoms N C α C N C α C N-C alpha-CN-C \alpha-C of the holostructure to the apo structure:
在训练过程中,输入为通过向天然构象添加类变形转换构建的蛋白质结构(处于诱饵构象)以及添加了高斯噪声的配体结构(处于特定构象)。预期输出为去噪操作。时间步 t t tt 的输入蛋白质结构定义为 x t p = ϕ ( x holo , t ) x t p = ϕ x holo  , t x_(t)^(p)=phi(x^("holo "),t)\mathbf{x}_{t}^{p}=\boldsymbol{\phi}\left(\mathbf{x}^{\text {holo }}, t\right) 。具体而言,对于第 i i ii 个氨基酸,使用 Kabsch 算法 57 57 ^(57){ }^{57} 计算将全结构主链原子 N C α C N C α C N-C alpha-CN-C \alpha-C 对齐到脱辅基结构的平移量 t r i t r i tr_(i)^(**)\mathbf{t r}_{i}^{*} 和旋转量 rot i i _(i)^(**)_{i}^{*} (围绕 C α C α C alphaC \alpha ):
t r i , r o t i = Kabsch ( x i , ( N , C α , C ) holo x i , C α holo , x i , ( N , C α , C ) apo x i , C α holo ) t r i , r o t i = Kabsch x i , N , C α , C holo  x i , C α holo  , x i , N , C α , C apo  x i , C α holo  tr_(i)^(**),rot_(i)^(**)=Kabsch(x_(i,(N,C_(alpha),C))^("holo ")-x_(i,C_(alpha))^("holo "),x_(i,(N,C_(alpha),C))^("apo ")-x_(i,C_(alpha))^("holo "))\mathbf{t r}_{i}^{*}, \boldsymbol{\operatorname { r o t }}_{i}^{*}=\operatorname{Kabsch}\left(\mathbf{x}_{i,\left(N, C_{\alpha}, C\right)}^{\text {holo }}-\mathbf{x}_{i, C_{\alpha}}^{\text {holo }}, \mathbf{x}_{i,\left(N, C_{\alpha}, C\right)}^{\text {apo }}-\mathbf{x}_{i, C_{\alpha}}^{\text {holo }}\right)
Considering the differences in torsion angles, we can draw the conformation changes of i i ii th residue:
考虑到扭转角差异,我们可以绘制第 i i ii 个残基的构象变化:
x i apo = ϕ ( x i holo ) = ( T i , 0 T i , k ) ( R i ( x i holo x i , C α holo ) + x i , C α holo + t r i ) x i apo  = ϕ x i holo  = T i , 0 T i , k R i x i holo  x i , C α holo  + x i , C α holo  + t r i x_(i)^("apo ")=phi(x_(i)^("holo "))=(T_(i,0)^(**)@cdotsT_(i,k)^(**))(R_(i)^(**)(x_(i)^("holo ")-x_(i,C_(alpha))^("holo "))+x_(i,C_(alpha))^("holo ")+tr_(i)^(**))\mathbf{x}_{i}^{\text {apo }}=\phi\left(\mathbf{x}_{i}^{\text {holo }}\right)=\left(T_{i, 0}^{*} \circ \cdots T_{i, k}^{*}\right)\left(R_{i}^{*}\left(\mathbf{x}_{i}^{\text {holo }}-\mathbf{x}_{i, C_{\alpha}}^{\text {holo }}\right)+\mathbf{x}_{i, C_{\alpha}}^{\text {holo }}+\mathbf{t r}_{i}^{*}\right)
Here, T i , k = T i , k a p o T i , k holo T i , k = T i , k a p o T i , k holo  T_(i,k)^(**)=T_(i,k)^(apo)-T_(i,k)^("holo ")T_{i, k}^{*}=T_{i, k}^{a p o}-T_{i, k}^{\text {holo }} are in radian and the R i R i R_(i)^(**)R_{i}^{*} is the rotation matrix of r o t i r o t i rot_(i)^(**)\boldsymbol{r o t}_{i}^{*}. At any given moment, we aim to perturb the protein structure using a factor, denoted as u ( t ) u ( t ) u(t)u(t), such that the perturbed data is an intermediate state between the holo-structure and apo structure:
此处 T i , k = T i , k a p o T i , k holo T i , k = T i , k a p o T i , k holo  T_(i,k)^(**)=T_(i,k)^(apo)-T_(i,k)^("holo ")T_{i, k}^{*}=T_{i, k}^{a p o}-T_{i, k}^{\text {holo }} 以弧度为单位, R i R i R_(i)^(**)R_{i}^{*} r o t i r o t i rot_(i)^(**)\boldsymbol{r o t}_{i}^{*} 的旋转矩阵。在任意时刻,我们旨在使用一个称为 u ( t ) u ( t ) u(t)u(t) 的因子扰动蛋白质结构,使得扰动后的数据成为全结构与脱辅基结构之间的中间状态:
ϕ ( x i holo , t ) = ( Δ T i , 0 p Δ T i , k p ) ( Δ R i p ( x i holo x i , C α holo ) + x i , C α holo + Δ t r i p ) with Δ t r i p = u ( t ) t r i Δ R i p = Rotation matrix of u ( t ) r o t i Δ T i , k p = u ( t ) T i , k + N ( 0 , 0.3 ) u ( t ) = clamp ( τ min p + ( τ max p τ min p ) × ( 5 t ) 0.3 , min = 0 , max = 1 ) ϕ x i holo  , t = Δ T i , 0 p Δ T i , k p Δ R i p x i holo  x i , C α holo  + x i , C α holo  + Δ t r i p  with  Δ t r i p = u ( t ) t r i Δ R i p = Rotation  matrix of  u ( t ) r o t i Δ T i , k p = u ( t ) T i , k + N ( 0 , 0.3 ) u ( t ) = clamp τ min p + τ max p τ min p × ( 5 t ) 0.3 , min = 0 , max = 1 {:[phi(x_(i)^("holo "),t)=(DeltaT_(i,0)^(p)@cdots DeltaT_(i,k)^(p))(DeltaR_(i)^(p)(x_(i)^("holo ")-x_(i,C_(alpha))^("holo "))+x_(i,C_(alpha))^("holo ")+Deltatr_(i)^(p))],[" with "quadDeltatr_(i)^(p)=u(t)tr_(i)^(**)],[DeltaR_(i)^(p)=Rotation" matrix of "u(t)rot_(i)^(**)],[DeltaT_(i,k)^(p)=u(t)T_(i,k)^(**)+N(0","0.3)],[u(t)=clamp(tau_(min)^(p)+(tau_(max)^(p)-tau_(min)^(p))xx(5t)^(0.3),min=0,max=1)]:}\begin{aligned} \phi\left(\mathbf{x}_{i}^{\text {holo }}, t\right)= & \left(\Delta T_{i, 0}^{p} \circ \cdots \Delta T_{i, k}^{p}\right)\left(\Delta R_{i}^{p}\left(\mathbf{x}_{i}^{\text {holo }}-\mathbf{x}_{i, C_{\alpha}}^{\text {holo }}\right)+\mathbf{x}_{i, C_{\alpha}}^{\text {holo }}+\Delta \mathbf{t r}_{i}^{p}\right) \\ \text { with } \quad & \Delta \mathbf{t r}_{i}^{p}=u(t) \mathbf{t r}_{i}^{*} \\ & \Delta R_{i}^{p}=\operatorname{Rotation} \text { matrix of } u(t) \mathbf{r o t}_{i}^{*} \\ & \Delta T_{i, k}^{p}=u(t) T_{i, k}^{*}+\mathcal{N}(0,0.3) \\ & u(t)=\operatorname{clamp}\left(\tau_{\min }^{p}+\left(\tau_{\max }^{p}-\tau_{\min }^{p}\right) \times(5 t)^{0.3}, \min =0, \max =1\right) \end{aligned}
where τ min p τ min  p tau_("min ")^(p)\tau_{\text {min }}^{p} and τ max p τ max p tau_(max)^(p)\tau_{\max }^{p} represent the parameters of the diffusion noise.
其中 τ min p τ min  p tau_("min ")^(p)\tau_{\text {min }}^{p} τ max p τ max p tau_(max)^(p)\tau_{\max }^{p} 表示扩散噪声的参数。

To overcome the distribution shift between training and inference that arises from the use of RDKit-generated conformations as starting points in the inference process, we replace the training objective with the conformation x 0 l x 0 l x_(0)^(l)\mathbf{x}_{0}^{l} that matched to the ground truth pose x g t 19 , 33 x g t 19 , 33 x^(gt 19,33)\mathbf{x}^{g t 19,33}. At time t t tt, the input ligand pose is a random perturbed conformation:
为克服因在推理过程中使用 RDKit 生成的构象作为起始点而导致的训练与推理间的分布偏移,我们将训练目标替换为与真实姿态 x g t 19 , 33 x g t 19 , 33 x^(gt 19,33)\mathbf{x}^{g t 19,33} 匹配的构象 x 0 l x 0 l x_(0)^(l)\mathbf{x}_{0}^{l} 。在时间 t t tt 时,输入的配体姿态是一个随机扰动的构象:
x t l = ( Δ T 0 l Δ T k l ) ( Δ R l ( x 0 l x 0 l ) + x 0 l + Δ t r l ) with Δ t r l = ( N ( 0 , σ t r l ) , N ( 0 , σ t r l ) , N ( 0 , σ t r l ) ) Δ R l = Rotation matrix of sampling from p ( ω ) ω ^ Δ T k l = N ( 0 , σ t o r l ) p ( ω ) = 1 cos ( ω ) π l = 0 ( 2 l + 1 ) exp ( l ( l + 1 ) ( σ r o t l ) 2 ) sin ( ( l + 1 / 2 ) ω ) sin ( ω / 2 ) x t l = Δ T 0 l Δ T k l Δ R l x 0 l x ¯ 0 l + x ¯ 0 l + Δ t r l  with  Δ t r l = N 0 , σ t r l , N 0 , σ t r l , N 0 , σ t r l Δ R l =  Rotation matrix of sampling from  p ( ω ) ω ^ Δ T k l = N 0 , σ t o r l p ( ω ) = 1 cos ( ω ) π l = 0 ( 2 l + 1 ) exp l ( l + 1 ) σ r o t l 2 sin ( ( l + 1 / 2 ) ω ) sin ( ω / 2 ) {:[x_(t)^(l)=(DeltaT_(0)^(l)@cdots DeltaT_(k)^(l))(DeltaR^(l)(x_(0)^(l)- bar(x)_(0)^(l))+ bar(x)_(0)^(l)+Deltatr^(l))],[" with "quad Deltatr^(l)=(N(0,sigma_(tr)^(l)),N(0,sigma_(tr)^(l)),N(0,sigma_(tr)^(l)))],[DeltaR^(l)=" Rotation matrix of sampling from "p(omega) hat(omega)],[DeltaT_(k)^(l)=N(0,sigma_(tor)^(l))],[p(omega)=(1-cos(omega))/(pi)sum_(l=0)^(oo)(2l+1)exp(-l(l+1)(sigma_(rot)^(l))^(2))(sin((l+1//2)omega))/(sin(omega//2))]:}\begin{aligned} \mathbf{x}_{t}^{l}= & \left(\Delta T_{0}^{l} \circ \cdots \Delta T_{k}^{l}\right)\left(\Delta R^{l}\left(\mathbf{x}_{0}^{l}-\overline{\mathbf{x}}_{0}^{l}\right)+\overline{\mathbf{x}}_{0}^{l}+\Delta \mathbf{t r}^{l}\right) \\ & \text { with } \quad \Delta \mathbf{t r}^{l}=\left(\mathcal{N}\left(0, \sigma_{t r}^{l}\right), \mathcal{N}\left(0, \sigma_{t r}^{l}\right), \mathcal{N}\left(0, \sigma_{t r}^{l}\right)\right) \\ & \Delta R^{l}=\text { Rotation matrix of sampling from } p(\omega) \hat{\omega} \\ & \Delta T_{k}^{l}=\mathcal{N}\left(0, \sigma_{t o r}^{l}\right) \\ & p(\omega)=\frac{1-\cos (\omega)}{\pi} \sum_{l=0}^{\infty}(2 l+1) \exp \left(-l(l+1)\left(\sigma_{r o t}^{l}\right)^{2}\right) \frac{\sin ((l+1 / 2) \omega)}{\sin (\omega / 2)} \end{aligned}
Here, x 0 l x ¯ 0 l bar(x)_(0)^(l)\overline{\mathbf{x}}_{0}^{l} is the geometric center of x 0 l , p ( ω ) x 0 l , p ( ω ) x_(0)^(l),p(omega)\mathbf{x}_{0}^{l}, p(\omega) is the isotropic Gaussian distribution on SO ( 3 ) SO ( 3 ) SO(3)\mathrm{SO}(3) and the ω ^ ω ^ hat(omega)\hat{\boldsymbol{\omega}} is a unit vector generated by random sampling.
此处, x 0 l x ¯ 0 l bar(x)_(0)^(l)\overline{\mathbf{x}}_{0}^{l} x 0 l , p ( ω ) x 0 l , p ( ω ) x_(0)^(l),p(omega)\mathbf{x}_{0}^{l}, p(\omega) 的几何中心, SO ( 3 ) SO ( 3 ) SO(3)\mathrm{SO}(3) 上的各向同性高斯分布,而 ω ^ ω ^ hat(omega)\hat{\boldsymbol{\omega}} 是通过随机采样生成的单位向量。
The network is trained with eight losses. The total loss can be defined as follows
网络通过八种损失函数进行训练。总损失可定义如下:
L = 1 3 L t r l + 1 3 L rot l + 1 3 L T l + 1 3 L t r p + 1 3 L r o t p + 1 3 L T p + 0.01 L A + 0.99 L D L = 1 3 L t r l + 1 3 L rot  l + 1 3 L T l + 1 3 L t r p + 1 3 L r o t p + 1 3 L T p + 0.01 L A + 0.99 L D L=(1)/(3)L_(tr)^(l)+(1)/(3)L_("rot ")^(l)+(1)/(3)L_(T)^(l)+(1)/(3)L_(tr)^(p)+(1)/(3)L_(rot)^(p)+(1)/(3)L_(T)^(p)+0.01L_(A)+0.99L_(D)\mathcal{L}=\frac{1}{3} \mathcal{L}_{\mathbf{t r}}^{l}+\frac{1}{3} \mathcal{L}_{\text {rot }}^{l}+\frac{1}{3} \mathcal{L}_{T}^{l}+\frac{1}{3} \mathcal{L}_{\mathbf{t r}}^{p}+\frac{1}{3} \mathcal{L}_{\mathbf{r o t}}^{p}+\frac{1}{3} \mathcal{L}_{T}^{p}+0.01 \mathcal{L}_{A}+0.99 \mathcal{L}_{D}
where L tr l , L rot l , L T l L tr  l , L rot  l , L T l L_("tr ")^(l),L_("rot ")^(l),L_(T)^(l)\mathcal{L}_{\text {tr }}^{l}, \mathcal{L}_{\text {rot }}^{l}, \mathcal{L}_{T}^{l} are the losses for the translation, rotation, and torsion of the ligand, respectively. The L tr p , L rot p L tr p , L rot  p L_(tr)^(p),L_("rot ")^(p)\mathcal{L}_{\mathrm{tr}}^{p}, \mathcal{L}_{\text {rot }}^{p}, and L T p L T p L_(T)^(p)\mathcal{L}_{T}^{p} are the losses for the protein residues. The L A L A L_(A)\mathcal{L}_{A} is binding affinity loss and the L D L D L_(D)\mathcal{L}_{D} is contact-LDDT loss. The distance difference for computing the groundtruth cLDDT is d = | d ( x 0 l , x holo ) d ( x t l , x t p ) | d = d x 0 l , x holo  d x t l , x t p d=|d(x_(0)^(l),x^("holo "))-d(x_(t)^(l),x_(t)^(p))|d=\left|d\left(\mathbf{x}_{0}^{l}, \mathbf{x}^{\text {holo }}\right)-d\left(\mathbf{x}_{t}^{l}, \mathbf{x}_{t}^{p}\right)\right| (more details of the cLDDT score calculation can be found in “Evaluation metrics”).
其中 L tr l , L rot l , L T l L tr  l , L rot  l , L T l L_("tr ")^(l),L_("rot ")^(l),L_(T)^(l)\mathcal{L}_{\text {tr }}^{l}, \mathcal{L}_{\text {rot }}^{l}, \mathcal{L}_{T}^{l} 分别表示配体平移、旋转和扭转的损失。 L tr p , L rot p L tr p , L rot  p L_(tr)^(p),L_("rot ")^(p)\mathcal{L}_{\mathrm{tr}}^{p}, \mathcal{L}_{\text {rot }}^{p} L T p L T p L_(T)^(p)\mathcal{L}_{T}^{p} 是蛋白质残基的损失。 L A L A L_(A)\mathcal{L}_{A} 为结合亲和力损失, L D L D L_(D)\mathcal{L}_{D} 为接触-LDDT 损失。用于计算真实 cLDDT 的距离差异为 d = | d ( x 0 l , x holo ) d ( x t l , x t p ) | d = d x 0 l , x holo  d x t l , x t p d=|d(x_(0)^(l),x^("holo "))-d(x_(t)^(l),x_(t)^(p))|d=\left|d\left(\mathbf{x}_{0}^{l}, \mathbf{x}^{\text {holo }}\right)-d\left(\mathbf{x}_{t}^{l}, \mathbf{x}_{t}^{p}\right)\right| (cLDDT 分数计算的更多细节见“评估指标”部分)。
Since a rotation vector u u u\mathbf{u} represents the same rotation as another v v v\mathbf{v} if u u u\mathbf{u} and v v v\mathbf{v} have opposite orientation and u + v = 2 π u + v = 2 π ||u||+||v||=2pi\|\mathbf{u}\|+\|\mathbf{v}\|=2 \pi. So we take the minimum of the forward and opposite orientation losses when computing the rotation loss. The torsion angle losses are computed using the cosine of the angle difference between the predicted value and the added torsion angle noise. The full training procedures can be see in Supplementary Algorithm 1.
由于旋转向量 u u u\mathbf{u} 与另一个旋转向量 v v v\mathbf{v} 表示相同的旋转,当且仅当两者具有相反方向且满足 u + v = 2 π u + v = 2 π ||u||+||v||=2pi\|\mathbf{u}\|+\|\mathbf{v}\|=2 \pi 条件时。因此,在计算旋转损失时,我们取正向与反向旋转损失中的较小值。扭转角损失则通过预测值与添加的扭转角噪声之间角度差的余弦值进行计算。完整训练流程详见补充算法 1。
During the inference process, we use the ligand structure with conformations generated by RDKit and the protein structure prediction by AlphaFold as the initial complex conformation. The complex structure is updated with 20 steps. To prevent the final conformation trapped in local minimum, in each step, a small random noise is added to the denoised ligand pose. For each pair, we perform 40 samplings and rank the binding conformations based on the predicted cLDDTs.
在推理过程中,我们使用 RDKit 生成的配体构象与 AlphaFold 预测的蛋白质结构作为初始复合物构象。通过 20 个步骤更新复合物结构。为避免最终构象陷入局部最小值,每一步都对去噪后的配体姿态添加微小随机噪声。每个配体-蛋白质对进行 40 次采样,并根据预测的 cLDDT 值对结合构象进行排序。
We also noticed that the weighted averaged of the predicted binding affinity is a more accurate estimator of the experimentally measured affinity (Supplementary Table 1). The predicted cLDDT values is used as the weights. The complete inference procedures can be found in Supplementary Algorithm 2.
我们还注意到,预测结合亲和力的加权平均值能更准确地估计实验测量值(见附表 1)。预测的 cLDDT 值被用作权重。完整的推理流程可参见补充算法 2。
DynamicBind has 63.67 million parameters and was trained for 5 days on eight Nvidia A100 80GB GPUs.
DynamicBind 模型包含 6367 万个参数,在 8 块 Nvidia A100 80GB GPU 上训练了 5 天。

Evaluation metrics  评估指标

To assess the interaction between the protein and the ligand within the predicted complex structure, we determine the extent of intermolecular native contact formation. We adopt a definition similar to that of the Local Distance Difference Test (LDDT) score, previously employed for quantifying the nativeness of predicted protein structures 58 58 ^(58){ }^{58}. The Contact-LDDT (cLDDT) score is computed by considering the distances less than 15 15 15"Å"15 \AA Among all pairs of ligand atoms and protein atoms. The distance difference is determined between the ground truth and the predicted complex structure, while accounting for symmetry. The final cLDDT score is derived from the mean fraction of conserved distances across four tolerance thresholds: 0.5 , 1 , 2 0.5 , 1 , 2 0.5,1,20.5,1,2, and 4 4 4"Å"4 \AA.
为评估预测复合物结构中蛋白质与配体的相互作用,我们测定了分子间天然接触形成的程度。采用类似于局部距离差异测试(LDDT)分数的定义(该指标曾用于量化预测蛋白质结构的天然性 58 58 ^(58){ }^{58} )。接触-LDDT(cLDDT)分数通过考虑配体原子与蛋白质原子间距离小于 15 15 15"Å"15 \AA 的所有原子对来计算。距离差异通过对比真实结构与预测复合物结构确定,同时考虑对称性。最终 cLDDT 分数取自四个容差阈值( 0.5 , 1 , 2 0.5 , 1 , 2 0.5,1,20.5,1,2 4 4 4"Å"4 \AA )下保守距离的平均保留比例。
In order to evaluate the deviation of the predicted protein structure from the native protein structure surrounding the binding pocket, we compute the pocket Root Mean Square Deviation (pocket RMSD). This is performed using protein atoms located within 5 5 5"Å"5 \AA of the reference ligand atoms. Initially, the predicted protein structure is aligned with the crystal protein structure. Subsequently, the RMSD between the predicted pocket atoms and the crystal pocket atoms is determined.
为评估预测蛋白质结构与天然蛋白质结构在结合口袋区域的偏差,我们计算口袋均方根偏差(pocket RMSD)。该计算使用参考配体原子 5 5 5"Å"5 \AA 范围内的蛋白质原子完成。首先将预测蛋白质结构与晶体蛋白质结构进行对齐,随后测定预测口袋原子与晶体口袋原子之间的 RMSD 值。
Similiar to AlphaFill 14 14 ^(14){ }^{14}, the clash score is the root mean square (RMS) of the van der Waals overlaps 59 59 ^(59){ }^{59} across all distances between the ligand atoms and the protein atoms, which are less than 4 4 4"Å"4 \AA. It is computed as follows:
类似于 AlphaFill 14 14 ^(14){ }^{14} ,冲突分数是配体原子与蛋白质原子之间所有距离小于 4 4 4"Å"4 \AA 的范德华作用重叠 59 59 ^(59){ }^{59} 的均方根值(RMS),其计算公式如下:
clash score = i = 0 N VdW overlap i 2 N  clash score  = i = 0 N  VdW overlap  i 2 N " clash score "=sqrt((sum_(i=0)^(N)" VdW overlap "_(i)^(2))/(N))\text { clash score }=\sqrt{\frac{\sum_{i=0}^{N} \text { VdW overlap }{ }_{i}^{2}}{N}}
where N N NN is the number of distances considered.
其中 N N NN 表示所考虑的距离数量。

Dataset construction  数据集构建

Our training and test dataset was built upon the PDBbind2020 34 34 ^(34){ }^{34} database, which includes a curated collection of 19,443 crystal structures of protein-ligand complexes, each paired with an experimentally measured binding affinity. We employed the same time split as previous works 19 , 35 , 36 19 , 35 , 36 ^(19,35,36){ }^{19,35,36}, using structures deposited before 2019 for training and validation, while those deposited in 2019 were reserved for testing. Each protein was aligned with the AlphaFold-predicted structure that corresponds to the same protein sequence. The aligned AlphaFold structures and the crystal structures are used to generate training samples of the protein part through morph-like interpolation. The Major Drug Targets (MDT) test set was constructed using the following criteria: PDBs deposited in 2020 or later; proteins belonging to one of the four major drug target groups - kinases, GPCRs, nuclear receptors, and ion channels; the AlphaFold-predicted protein structures have pocket RMSD above the 2 2 2"Å"2 \AA (or pocket LDDT below 0.8 ) with the crystal structure; ligands are drug-like small molecules with molecular weights between 200 and 650 Dalton; at most 10 PDBs from a single study are included. These criteria ensure that the test set is challenging, with the initial input protein differs from the native conformation, and is representative, covering a wide range of protein targets. In addition, it prevents a few proteins dominating the entire test set, as certain studies deposited significantly more PDBs, structures of the same protein co-crystallized with slightly different ligands, than other studies.
我们的训练和测试数据集基于 PDBbind2020 34 34 ^(34){ }^{34} 数据库构建,该数据库包含 19,443 个经过筛选的蛋白质-配体复合物晶体结构,每个结构均配有实验测得的结合亲和力数据。我们采用与先前研究 19 , 35 , 36 19 , 35 , 36 ^(19,35,36){ }^{19,35,36} 相同的时间划分策略:将 2019 年前收录的结构用于训练和验证,而 2019 年收录的结构则保留用于测试。每个蛋白质均与 AlphaFold 预测的对应相同蛋白质序列的结构进行比对。通过类形态插值方法,利用比对后的 AlphaFold 结构和晶体结构生成蛋白质部分的训练样本。主要药物靶标(MDT)测试集的构建遵循以下标准:2020 年或之后收录的 PDB 条目;蛋白质属于四大主要药物靶标类别之一——激酶、GPCRs、核受体和离子通道;AlphaFold 预测的蛋白质结构与晶体结构的口袋 RMSD 高于 2 2 2"Å"2 \AA (或口袋 LDDT 低于 0.8);配体为分子量介于 200 至 650 道尔顿之间的类药小分子;单个研究中最多包含 10 个 PDB 条目。 这些标准确保测试集具有挑战性,初始输入蛋白质与天然构象不同,并且具有代表性,涵盖了广泛的蛋白质靶标。此外,它防止少数蛋白质主导整个测试集,因为某些研究比其他研究提交了更多的 PDB(同一蛋白质与略有不同配体共结晶的结构)。

Baselines  基线方法

We performed docking on both PDBbind test set (303 ligand-receptor pairs) and Major drug targets (MDT) test set (599 ligand-receptor
我们分别对 PDBbind 测试集(303 个配体-受体对)和主要药物靶标(MDT)测试集(599 个配体-受体对)进行了分子对接实验。

pairs) using different docking methods listed below. The docking ligands were extracted from the co-crystalized structures without changing their atomic coordinates and the docking receptor structures were predicted by AlphaFold. We use a symmetry-aware method, specifically the symmrmsd function from the spyrmsd package 60 60 ^(60){ }^{60} for all RMSD computations.
采用下述不同对接方法时,配体均从共晶结构中提取且未改变其原子坐标,受体结构由 AlphaFold 预测生成。所有 RMSD 计算均采用对称性敏感方法,具体使用 spyrmsd 包的 symmrmsd 函数 60 60 ^(60){ }^{60} 实现。

Autodock VINA rigid  Autodock VINA 刚性对接

In Autodock Vina 17 17 ^(17){ }^{17}, ligands were converted from SDF format to PDBQT format by Meeko 2.0.0. Protein preparation was performed by using the “prepare_receptor” command in ADFR Suite 1.0. The docking box was defined using an automatic box around the native ligand with the default buffer of 4 4 4"Å""Å"4 \AA \AA on all six sides. And the box center was the center of mass of the native ligand. Because the boron atom is not a valid AutoDock atom type, ligands with this atom cannot be docked. Therefore, only 301 ligand-receptor pairs in PDBbind dataset and 597 ligand-receptor pairs in MDT dataset had docking output in VINA rigid docking.
在 Autodock Vina 17 17 ^(17){ }^{17} 中,配体通过 Meeko 2.0.0 从 SDF 格式转换为 PDBQT 格式。蛋白质预处理使用 ADFR Suite 1.0 的"prepare_receptor"命令完成。对接盒子以天然配体质心为中心,默认在六个方向扩展 4 4 4"Å""Å"4 \AA \AA 缓冲距离自动生成。由于硼原子不是 AutoDock 的有效原子类型,含该原子的配体无法对接。因此 VINA 刚性对接仅在 PDBbind 数据集的 301 个配体-受体对和 MDT 数据集的 597 个配体-受体对中产生输出。

Autodock VINA flex  Autodock VINA 柔性对接

Comparing to VINA rigid docking, there is an additional flexible receptor preparation step in VINA flexible docking. It was performed by a python script called “prepare_flexreceptor.py”, which is available at https://github.com/ccsb-scripps/AutoDock-Vina/tree/ develop/example/autodock_scripts. Through this step, the protein PDBQT format file was divided into two PDBQT format files, one for the rigid part and one for the flexible side chains. For Vina Flex mode, flexible side chains must be predetermined. We identified all residues with side-chain atoms within 5 5 5"Å"5 \AA of the ligand atoms as flexible. In this mode, the protein backbone remains rigid.Ligand preparation and grid box setting were consistent with VINA rigid docking.
与 VINA 刚性对接相比,VINA 柔性对接额外包含一个柔性受体准备步骤。该步骤通过名为“prepare_flexreceptor.py”的 Python 脚本完成,脚本可从 https://github.com/ccsb-scripps/AutoDock-Vina/tree/develop/example/autodock_scripts 获取。在此过程中,蛋白质 PDBQT 格式文件被分割为两个 PDBQT 文件,分别对应刚性部分和柔性侧链。Vina Flex 模式下,柔性侧链必须预先确定。我们将所有侧链原子与配体原子距离在 5 5 5"Å"5 \AA 范围内的残基识别为柔性部分。此模式下蛋白质骨架保持刚性。配体准备与网格盒设置与 VINA 刚性对接保持一致。

GNINA rigid  GNINA 刚性对接

The ligand input files for GNINA 61 61 ^(61){ }^{61} are in PDBQT format, created using OpenBabel after adding hydrogens with RDKit Protein input files were PDB format files. The grid box setting was consistent with VINA rigid docking. For the PDBbind dataset, all of the ligand-receptor pairs had docking output. For MDT dataset, 1 pair had no output because the ligand in the original PDB file (PDB ID: 8HMU) was not completely resolved, which had missing atoms.
GNINA 61 61 ^(61){ }^{61} 的配体输入文件为 PDBQT 格式,通过 RDKit 添加氢原子后使用 OpenBabel 生成。蛋白质输入文件为 PDB 格式。网格盒设置与 VINA 刚性对接一致。对于 PDBbind 数据集,所有配体-受体对均有对接输出结果;而 MDT 数据集中有 1 对(PDB ID: 8HMU)因原始 PDB 文件里的配体原子缺失导致无输出结果。

GLIDE  GLIDE 分子对接

GLIDE 16 16 ^(16){ }^{16} is a rigid protein docking module in Schrödinger software. Ligands were prepared by using the LigPrep module. Protein preparation was performed by using the Protein Preparation Wizard module. Grid files were generated by the Receptor Grid Generation module with a 10 10 10"Å"10 \AA inner box and an automatic outer box around the ligand with the default buffer of 4 4 4"Å"4 \AA on all six sides centered on the center of mass of the ligand. Then, the SP precision docking was performed. Some of the ligands in PDBbind dataset are polypeptides, which cannot be processed by LigPrep module. In addition, ligands with severe clashes with pocket atoms had no output pose during docking. Therefore, 266 ligand-receptor pairs in PDBbind dataset and 472 ligand-receptor pairs in MDT dataset had docking output in GLIDE rigid docking.
GLIDE 16 16 ^(16){ }^{16} 是 Schrödinger 软件中的刚性蛋白质对接模块。配体通过 LigPrep 模块进行准备。蛋白质准备则使用 Protein Preparation Wizard 模块完成。网格文件由 Receptor Grid Generation 模块生成,采用 10 10 10"Å"10 \AA 内盒及默认 4 4 4"Å"4 \AA 缓冲距离的自动外盒,以配体质心为中心覆盖六个方向。随后执行 SP 精度对接。PDBbind 数据集中的部分配体为多肽,无法通过 LigPrep 模块处理。此外,与口袋原子存在严重冲突的配体在对接过程中无输出构象。因此,PDBbind 数据集中 266 个配体-受体对及 MDT 数据集中 472 个配体-受体对在 GLIDE 刚性对接中获得了对接输出结果。

Induced fit docking  诱导契合对接

Induced fit docking (IFD) module 62 62 ^(62){ }^{62} in Schrödinger software provides a protein-flexible docking function for the user. Different from VINA and GNINA, not only residue side chains but also residue backbones can move slightly. Ligand preparation and protein preparation were the same as GLIDE rigid docking. The search space was defined by default parameters, a 10 10 10"Å"10 \AA inner box and an outer box with auto size (similar in size to ligand) centered on the center of mass of the native ligand.
薛定谔软件中的诱导契合对接(IFD)模块 62 62 ^(62){ }^{62} 为用户提供了蛋白质柔性对接功能。与 VINA 和 GNINA 不同,不仅残基侧链可以移动,残基主链也能轻微调整。配体准备和蛋白质准备步骤与 GLIDE 刚性对接相同。搜索空间采用默认参数定义,包含一个 10 10 10"Å"10 \AA 内盒和一个外盒(大小自动调整,与配体尺寸相近),均以天然配体质心为中心。
Amino acid residues within 5 5 5"Å"5 \AA of the ligand atoms were defined as flexible residues. The docking process was performed under the standard protocol, which generates up to 20 poses. In total, 284 ligandreceptor pairs in PDBbind dataset and 580 in MDT dataset were docked successfully by using the IFD module. Induced fit docking can give output poses successfully for more ligand-receptor pairs in PDBbind dataset than GLIDE rigid docking, indicating that this docking method can extend the pocket by moving pocket residues.
配体原子 5 5 5"Å"5 \AA 范围内的氨基酸残基被定义为柔性残基。对接过程按标准协议执行,最多生成 20 个构象。PDBbind 数据集中的 284 个配体-受体对及 MDT 数据集中的 580 个对通过 IFD 模块成功完成对接。诱导契合对接能为 PDBbind 数据集中更多配体-受体对输出成功构象,表明该方法可通过移动口袋残基扩展结合位点。

Reporting summary  报告摘要

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
研究设计的更多信息详见本文链接的《自然》系列期刊报告摘要。

Data availability  数据可用性

Raw data were sourced from the public dataset PDBbind2020, available at http://www.pdbbind.org.cn/index.php. The data generated in this study and processed training data have been publicly deposited to Zenodo under https://doi.org/10.5281/zenodo.10429051. Source data are provided with this paper.
原始数据来源于公开数据集 PDBbind2020,获取地址为 http://www.pdbbind.org.cn/index.php。本研究生成的数据及处理后的训练数据已公开存储于 Zenodo 平台,DOI 为 https://doi.org/10.5281/zenodo.10429051。源数据随本文提供。

Code availability  代码可用性说明

Demo, instructions, and codes for DynamicBind are available at https://github.com/luwei0917/DynamicBind. The version used for this publication is available in ref. 63. In addition, a web server is available at https://m1.galixir.com/#/home/demo/dynamicDocking.
DynamicBind 的演示、说明及代码可在 https://github.com/luwei0917/DynamicBind 获取。本出版物采用的版本参见参考文献 63。此外,我们还提供了网页服务器访问地址:https://m1.galixir.com/#/home/demo/dynamicDocking。

References  参考文献

  1. Papoian, G. A. & Wolynes, P.G. Awsem-md: from neural networks to protein structure prediction and functional dynamics of complex biomolecular assemblies. Coarse-Grained Model. Biomol. 121-190 (2017).
    Papoian, G. A. 和 Wolynes, P.G. Awsem-md:从神经网络到蛋白质结构预测及复杂生物分子组装体的功能动力学。粗粒化模型。《生物分子》121-190 (2017)。
  2. Jin, S. et al. Protein structure prediction in casp13 using awsemsuite. J. Chem. Theory Comput. 16, 3977-3988 (2020).
    Jin, S. 等。在 CASP13 中使用 awsemsuite 进行蛋白质结构预测。《化学理论与计算杂志》16, 3977-3988 (2020)。
  3. Leman, J. K. et al. Macromolecular modeling and design in rosetta: recent methods and frameworks. Nat. Methods 17, 665-680 (2020).
    Leman, J. K. 等. Rosetta 中的大分子建模与设计:近期方法与框架. 自然·方法 17, 665-680 (2020).
  4. Zhang, C., Mortuza, S., He, B., Wang, Y. & Zhang, Y. Template-based and free modeling of i-tasser and quark pipelines using predicted contact maps in casp12. Proteins Struct. Funct. Bioinforma. 86, 136-151 (2018).
    张, C., Mortuza, S., 何, B., 王, Y. & 张, Y. 基于模板与自由建模的 I-TASSER 和 QUARK 流程利用预测接触图在 CASP12 中的应用。《蛋白质:结构、功能与生物信息学》86 卷, 136-151 页 (2018 年).
  5. Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583-589 (2021).
    Jumper, J. 等. 高精度蛋白质结构预测工具 AlphaFold。《自然》596 卷, 583-589 页 (2021 年).
  6. Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871-876 (2021).
    Baek, M. 等. 利用三通道神经网络精准预测蛋白质结构与相互作用。《科学》373 卷, 871-876 页 (2021 年).
  7. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123-1130 (2023).
    Lin, Z. 等. 语言模型实现原子级蛋白质结构的进化尺度预测。《科学》379 卷, 1123-1130 页 (2023 年).
  8. Wu, R. et al. High-resolution de novo structure prediction from primary sequence. Preprint at https://www.biorxiv.org/content/10. 1101/2022.07.21.500999v1 (2022).
    吴荣等。基于一级序列的高分辨率蛋白质结构从头预测。预印本发布于 https://www.biorxiv.org/content/10.1101/2022.07.21.500999v1(2022 年)。
  9. Lane, T. J. Protein structure prediction has reached the singlestructure frontier. Nat. Methods 20, 170-173 (2023).
    莱恩,T.J. 蛋白质结构预测已抵达单结构前沿。《自然·方法》20 卷,170-173 页(2023 年)。
  10. Frauenfelder, H., Sligar, S. G. & Wolynes, P. G. The energy landscapes and motions of proteins. Science 254, 1598-1603 (1991).
    弗劳恩费尔德,H.,斯利加,S.G. & 沃尔尼斯,P.G. 蛋白质的能量景观与运动。《科学》254 卷,1598-1603 页(1991 年)。
  11. Nussinov, R., Zhang, M., Liu, Y. & Jang, H. Alphafold, allosteric, and orthosteric drug discovery: ways forward. Drug Discov. Today 28, 103551 (2023).
    努西诺夫,R.,张明,刘洋 & 蒋华。AlphaFold、变构与正位药物发现:前进之路。《今日药物发现》28 卷,103551 页(2023 年)。
  12. Boehr, D. D., Nussinov, R. & Wright, P. E. The role of dynamic conformational ensembles in biomolecular recognition. Nat. Chem. Biol. 5, 789-796 (2009).
    Boehr, D. D., Nussinov, R. 与 Wright, P. E. 动态构象集合在生物分子识别中的作用。《自然-化学生物学》5, 789-796 (2009).
  13. Gunasekaran, K., Ma, B. & Nussinov, R. Is allostery an intrinsic property of all dynamic proteins? Proteins Struct. Funct. Bioinforma. 57, 433-443 (2004).
    Gunasekaran, K., Ma, B. 与 Nussinov, R. 变构效应是所有动态蛋白质的固有属性吗?《蛋白质:结构、功能与生物信息学》57, 433-443 (2004).
  14. Hekkelman, M. L., de Vries, I., Joosten, R. P. & Perrakis, A. Alphafill: enriching alphafold models with ligands and cofactors. Nat. Methods 20, 205-213 (2023).
    赫克曼(Hekkelman, M. L.)等。Alphafill:用配体和辅因子丰富 AlphaFold 模型。《自然·方法》20 卷,205-213 页(2023 年)。
  15. Gorgulla, C. Recent developments in structure-based virtual screening approaches. Preprint at https://arxiv.org/abs/2211. 03208v1 (2022).
    Gorgulla, C. 基于结构的虚拟筛选方法的最新进展。预印本发布于 https://arxiv.org/abs/2211.03208v1 (2022).
  16. Friesner, R. A. et al. Glide: a new approach for rapid, accurate docking and scoring. 1. method and assessment of docking accuracy. J. Med. Chem. 47, 1739-1749 (2004).
    弗里斯纳, R. A. 等. Glide:一种快速、精确对接与评分的新方法。1. 方法及对接准确性评估。药物化学杂志 47, 1739-1749 (2004).
  17. Trott, O. & Olson, A. J. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31, 455-461 (2010).
    特罗特, O. 与 奥尔森, A. J. Autodock vina:通过新评分函数、高效优化及多线程提升对接速度与精度。计算化学杂志 31, 455-461 (2010).
  18. Scardino, V., Di Filippo, J. I. & Cavasotto, C. N. How good are alphafold models for docking-based virtual screening? Iscience 26, 1 (2023).
    斯卡迪诺, V., 迪菲利波, J. I. 与 卡瓦索托, C. N. AlphaFold 模型在基于对接的虚拟筛选中表现如何?Iscience 26, 1 (2023).
  19. Corso, G., Stärk, H., Jing, B., Barzilay, R. & Jaakkola, T. Diffdock: diffusion steps, twists, and turns for molecular docking. In International Conference on Learning Representations (ICLR) (ICIR, 2023).
    科尔索, G., 斯塔克, H., 荆, B., 巴兹莱, R. 与 贾科拉, T. Diffdock:分子对接中的扩散步骤、扭转与转向。发表于国际学习表征会议(ICLR)(ICIR, 2023).
  20. Miller, E. B. et al. Reliable and accurate solution to the induced fit docking problem for protein-ligand binding. J. Chem. Theory Comput. 17, 2630-2639 (2021).
    Miller, E. B. 等。可靠且准确地解决蛋白质-配体结合中的诱导契合对接问题。《化学理论与计算杂志》17, 2630-2639 (2021)。
  21. Ayaz, P. et al. Structural mechanism of a drug-binding process involving a large conformational change of the protein target. Nat. Commun. 14, 1885 (2023).
    Ayaz, P. 等。涉及蛋白质靶标大幅构象变化的药物结合过程的结构机制。《自然·通讯》14, 1885 (2023)。
  22. Ferreiro, D. U., Hegler, J. A., Komives, E. A. & Wolynes, P. G. On the role of frustration in the energy landscapes of allosteric proteins. Proc. Natl. Acad. Sci. USA 108, 3499-3503 (2011).
    Ferreiro, D. U., Hegler, J. A., Komives, E. A. 与 Wolynes, P. G.。论变构蛋白能量景观中挫败的作用。《美国国家科学院院刊》108, 3499-3503 (2011)。
  23. Noé, F., Olsson, S., Köhler, J. & Wu, H. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science 365, 1147 (2019).
    Noé, F., Olsson, S., Köhler, J. 与 Wu, H.。玻尔兹曼生成器:用深度学习采样多体系统的平衡态。《科学》365, 1147 (2019)。
  24. Noé, F., De Fabritiis, G. & Clementi, C. Machine learning for protein folding and dynamics. Curr. Opin. Struct. Biol. 60, 77-84 (2020).
    Noé, F., De Fabritiis, G. 和 Clementi, C. 机器学习在蛋白质折叠与动力学中的应用。《结构生物学当前观点》60 卷, 77-84 页 (2020 年)。
  25. Wong, F. et al. Benchmarking alphafold-enabled molecular docking predictions for antibiotic discovery. Mol. Syst. Biol. 18, 11081 (2022).
    Wong, F. 等. AlphaFold 赋能分子对接预测在抗生素发现中的基准测试。《分子系统生物学》18 卷, 11081 页 (2022 年)。
  26. Landrum, G. et al. RDKit: A Software Suite For Cheminformatics, Computational Chemistry, and Predictive Modeling (Academic Press Cambridge, 2013).
    Landrum, G. 等. RDKit:化学信息学、计算化学与预测建模软件套件(剑桥学术出版社,2013 年)。
  27. Lin, X. et al. Forging tools for refining predicted protein structures. Proc. Natl. Acad. Sci. USA 116, 9400-9409 (2019).
    Lin, X. 等. 精修预测蛋白质结构的工具锻造。《美国国家科学院院刊》116 卷, 9400-9409 页 (2019 年)。
  28. Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089-1100 (2023).
    Watson, J. L. 等。利用 RFdiffusion 从头设计蛋白质结构与功能。《自然》620 卷, 1089-1100 页 (2023 年)。
  29. Song, Y. et al. Score-based generative modeling through stochastic differential equations. Preprint at https://arxiv.org/abs/2011. 13456 (2020).
    Song, Y. 等。通过随机微分方程进行基于分数的生成建模。预印本发布于 https://arxiv.org/abs/2011.13456 (2020 年)。
  30. Qiao, Z., Nie, W., Vahdat, A., Miller III, T. F. & Anandkumar, A. Statespecific protein-ligand complex structure prediction with a multiscale deep generative model. Preprint at https://arxiv.org/pdf/ 2209.15171.pdf (2023).
    Qiao, Z., Nie, W., Vahdat, A., Miller III, T. F. & Anandkumar, A. 使用多尺度深度生成模型预测特定状态的蛋白质-配体复合物结构。预印本发布于 https://arxiv.org/pdf/2209.15171.pdf (2023 年)。
  31. Nakata, S., Mori, Y. & Tanaka, S. End-to-end protein-ligand complex structure generation with diffusion-based generative models. BMC Bioinforma. 24, 1-18 (2023).
    Nakata, S., Mori, Y. & Tanaka, S. 基于扩散生成模型的端到端蛋白质-配体复合物结构生成。《BMC 生物信息学》24 卷, 1-18 页 (2023 年)。
  32. Brocidiacono, M., Popov, K. I., Koes, D. R. & Tropsha, A. Plantain: diffusion-inspired pose score minimization for fast and accurate molecular docking. Preprint at https://arxiv.org/abs/2307. 12090 (2023).
    Brocidiacono, M., Popov, K. I., Koes, D. R. & Tropsha, A. Plantain:受扩散启发的姿态评分最小化方法实现快速精准分子对接。预印本发布于 https://arxiv.org/abs/2307.12090 (2023 年)。
  33. Jing, B., Corso, G., Chang, J., Barzilay, R. & Jaakkola, T. Torsional diffusion for molecular conformer generation. Adv. Neural Inf. Process. Syst. 35, 24240-24253 (2022).
    Jing, B., Corso, G., Chang, J., Barzilay, R. & Jaakkola, T. 扭转扩散用于分子构象生成。《神经信息处理系统进展》35 卷, 24240-24253 页 (2022 年)。
  34. Liu, Z. et al. Pdb-wide collection of binding data: current status of the pdbbind database. Bioinformatics 31, 405-412 (2015).
    Liu, Z. 等. PDB 全库结合数据汇总:PDBbind 数据库现状。《生物信息学》31 卷, 405-412 页 (2015 年)。
  35. Stärk, H., Ganea, O., Pattanaik, L., Barzilay, R. & Jaakkola, T. Equibind: geometric deep learning for drug binding structure prediction. In International Conference on Machine Learning 20503-20521 (PMLR, 2022).
    Stärk, H., Ganea, O., Pattanaik, L., Barzilay, R. & Jaakkola, T. Equibind:基于几何深度学习的药物结合结构预测。《国际机器学习会议》20503-20521 页 (PMLR, 2022 年)。
  36. Lu, W. et al. Tankbind: trigonometry-aware neural networks for drug-protein binding structure prediction. Adv. Neural Inf. Process. Syst. (2022).
    卢伟等。Tankbind:基于三角几何感知的神经网络用于药物-蛋白质结合结构预测。《神经信息处理系统进展》(2022 年)。
  37. Santos, R. et al. A comprehensive map of molecular drug targets. Nat. Rev. Drug Discov. 16, 19-34 (2017).
    Santos, R. 等。分子药物靶点的综合图谱。《自然综述·药物发现》16 卷, 19-34 页 (2017 年)。
  38. Bender, B. J. et al. A practical guide to large-scale docking. Nat. Protoc. 16, 4799-4832 (2021).
    Bender, B. J. 等。大规模分子对接的实用指南。《自然·实验手册》16 卷, 4799-4832 页 (2021 年)。
  39. Kanev, G. K., de Graaf, C., Westerman, B. A., de Esch, I. J. & Kooistra, A. J. Klifs: an overhaul after the first 5 years of supporting kinase research. Nucleic Acids Res. 49, 562-569 (2021).
    卡内夫等。Klifs:支持激酶研究五年后的全面升级。《核酸研究》第 49 卷,562-569 页(2021 年)。
  40. Huang, S.-Y. & Zou, X. Ensemble docking of multiple protein structures: considering protein structural variations in molecular docking. Proteins Struct. Funct. Bioinforma. 66, 399-421 (2007).
    黄, S.-Y. 与邹, X. 多蛋白结构集成对接:在分子对接中考虑蛋白质结构变异。《蛋白质:结构、功能与生物信息学》66 卷,399-421 页(2007 年)。
  41. Amaro, R. E. et al. Ensemble docking in drug discovery. Biophys. J. 114, 2271-2278 (2018).
    阿马罗, R. E. 等。药物发现中的集成对接方法。《生物物理学期刊》114 卷,2271-2278 页(2018 年)。
  42. Lampe, J. W. et al. Discovery of a first-in-class inhibitor of the histone methyltransferase setd2 suitable for preclinical studies. ACS Med. Chem. Lett. 12, 1539-1545 (2021).
    Lampe, J. W. 等人。发现一种适用于临床前研究的首类组蛋白甲基转移酶 SETD2 抑制剂。《ACS 医学化学快报》12 卷, 1539-1545 页(2021 年)。
  43. Alford, J. S. et al. Conformational-design-driven discovery of ezm0414: a selective, potent setd2 inhibitor for clinical studies. ACS Med. Chem. Lett. 13, 1137-1143 (2022).
    Alford, J. S. 等人。通过构象设计驱动发现 ezm0414:一种用于临床研究的选择性高效 SETD2 抑制剂。《ACS 医学化学快报》13 卷, 1137-1143 页(2022 年)。
  44. Zhao, M., Lee, W.-P., Garrison, E. P. & Marth, G. T. Ssw library: an simd smith-waterman c/c++ library for use in genomic applications. PLoS ONE 8, 82138 (2013).
    赵明等。SSW 库:用于基因组应用的 SIMD Smith-Waterman C/C++库。《公共科学图书馆·综合》8 卷,82138 页(2013 年)。
  45. Krafcikova, P., Silhan, J., Nencka, R. & Boura, E. Structural analysis of the sars-cov-2 methyltransferase complex involved in RNA cap creation bound to sinefungin. Nat. Commun. 11, 3717 (2020).
    克拉夫奇科娃等。SARS-CoV-2 参与 RNA 帽结构创建的甲基转移酶复合物与 Sinefungin 结合的结构分析。《自然·通讯》11 卷,3717 页(2020 年)。
  46. Cimermancic, P. et al. Cryptosite: expanding the druggable proteome by characterization and prediction of cryptic binding sites. J. Mol. Biol. 428, 709-719 (2016).
    西默曼契克等。CryptoSite:通过表征和预测隐蔽结合位点扩展可成药蛋白质组。《分子生物学杂志》428 卷,709-719 页(2016 年)。
  47. Buttenschoen, M., Morris, G. M. & Deane, C. M. Posebusters: Aibased docking methods fail to generate physically valid poses or generalise to novel sequences. Chem. Sci. (2023).
    布滕雄等。PoseBusters:基于 AI 的对接方法无法生成物理有效构象或推广至新序列。《化学科学》(2023 年)。
  48. Bryant, P., Kelkar, A., Guljas, A., Clementi, C. & Noé, F. Structure prediction of protein-ligand complexes from sequence information with Umol. Preprint at https://www.biorxiv.org/content/10.1101/ 2023.11.03.565471v1 (2023).
    Bryant, P., Kelkar, A., Guljas, A., Clementi, C. 与 Noé, F. 利用 Umol 基于序列信息预测蛋白质-配体复合物结构。预印本发布于 https://www.biorxiv.org/content/10.1101/2023.11.03.565471v1 (2023 年)。
  49. Zhong, E. D., Bepler, T., Berger, B. & Davis, J. H. Cryodrgn: reconstruction of heterogeneous cryo-em structures using neural networks. Nat. Methods 18, 176-185 (2021).
    Zhong, E. D., Bepler, T., Berger, B. 与 Davis, J. H. Cryodrgn:利用神经网络重构异质性冷冻电镜结构。《自然·方法》18 卷,176-185 页 (2021 年)。
  50. Zhang, S. et al. Usp14-regulated allostery of the human proteasome by time-resolved cryo-em. Nature 605, 567-574 (2023).
    Zhang, S. 等。通过时间分辨冷冻电镜解析人源蛋白酶体的 USP14 调控变构机制。《自然》605 卷,567-574 页 (2023 年)。
  51. Punjani, A. & Fleet, D. J. 3DFlex: determining structure and motion of flexible proteins from cryo-EM. Nat. Methods 20, 860-870 (2023).
    Punjani, A. 与 Fleet, D. J. 3DFlex:基于冷冻电镜测定柔性蛋白结构与运动。《自然·方法》20 卷,860-870 页 (2023 年)。
  52. Geiger, M. & Smidt, T. e3nn: euclidean neural networks. Preprint at https://arxiv.org/abs/2207.09453 (2022).
    盖格,M. 与斯密特,T. e3nn:欧几里得神经网络。预印本发布于 https://arxiv.org/abs/2207.09453 (2022 年)。
  53. Batzner, S. et al. E (3)-equivariant graph neural networks for dataefficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).
    巴茨纳,S. 等。E(3)-等变图神经网络用于数据高效且精确的原子间势能预测。《自然·通讯》13 卷, 2453 页 (2022 年)。
  54. McBride, W. G. Thalidomide and congenital abnormalities. Lancet 2, 90927-8 (1961).
    麦克布莱德,W. G. 沙利度胺与先天性畸形。《柳叶刀》2 卷, 90927-8 页 (1961 年)。
  55. Song, Y., Dhariwal, P., Chen, M. & Sutskever, I. Consistency models. Preprint at https://arxiv.org/abs/2303.01469 (2023).
    宋,Y.,达里瓦尔,P.,陈,M. 与苏茨克韦尔,I. 一致性模型。预印本发布于 https://arxiv.org/abs/2303.01469 (2023 年)。
  56. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, (2017).
    Vaswani, A. 等。注意力机制就是您所需的一切。Adv. Neural Inf. Process. Syst. 30, (2017).
  57. Kabsch, W. A solution for the best rotation to relate two sets of vectors. Acta Crystallogr. Sect. A: Crystal Phys. Diffr. Theor. Gen. Crystallogr. 32, 922-923 (1976).
    Kabsch, W. 一种求解两组向量间最佳旋转的方法。Acta Crystallogr. Sect. A: Crystal Phys. Diffr. Theor. Gen. Crystallogr. 32, 922-923 (1976).
  58. Mariani, V., Biasini, M., Barbato, A. & Schwede, T. Iddt: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29, 2722-2728 (2013).
    Mariani, V., Biasini, M., Barbato, A. 和 Schwede, T. IDDT:一种基于距离差异测试的蛋白质结构及模型局部无叠加比较评分方法。Bioinformatics 29, 2722-2728 (2013).
  59. Batsanov, S. S. Van der waals radii of elements. Inorganic Mater. 37, 871-885 (2001).
    巴特萨诺夫, S. S. 元素的范德华半径. 无机材料 37, 871-885 (2001).
  60. Meli, R. & Biggin, P. C. spyrmsd: symmetry-corrected RMSD calculations in Python. J. Cheminforma. 12, 49 (2020).
    Meli, R. 与 Biggin, P. C. spyrmsd:Python 中的对称校正 RMSD 计算。《化学信息学杂志》12, 49 (2020).
  61. McNutt, A. T. et al. Gnina 1.0: molecular docking with deep learning. J. Cheminforma. 13, 1-20 (2021).
    麦克纳特, A. T. 等. Gnina 1.0: 基于深度学习的分子对接. 化学信息学杂志 13, 1-20 (2021).
  62. Sherman, W., Beard, H. S. & Farid, R. Use of an induced fit receptor structure in virtual screening. Chem. Biol. Drug Design 67, 83-84 (2006).
    谢尔曼, W., 比尔德, H. S. & 法里德, R. 在虚拟筛选中使用诱导契合受体结构. 化学生物学与药物设计 67, 83-84 (2006).
  63. Lu, W. luwei0917/DynamicBind: V1.0. https://doi.org/10.5281/ zenodo. 10443816 (2023).
    卢伟. luwei0917/DynamicBind: V1.0. https://doi.org/10.5281/ zenodo. 10443816 (2023).

Acknowledgements  致谢

We are grateful to Meihui Song and Jiahui Tang for their visualization and inspiration. We thank Da Wei and Ziwei Huang for their assistance in web server setup and IT support. We thank James J. Collins for his valuable feedback and insightful discussions. S.Z. acknowledges funding from the Baidu scholarship. P.G.W. is supported by the Center for Theoretical Biological Physics sponsored by the NSF grant PHY-2019745 and the D.R. Bullard-Welch Chair at the Rice University Grant C-0016.
我们感谢宋美慧和唐佳慧在可视化方面的贡献与灵感启发。感谢魏达和黄子威在服务器搭建与 IT 支持中的协助。感谢 James J. Collins 提供的宝贵反馈与深刻讨论。S.Z.感谢百度奖学金的支持。P.G.W.的研究由美国国家科学基金会资助 PHY-2019745 的理论生物物理中心及莱斯大学 D.R. Bullard-Welch 讲席教授基金 C-0016 支持。

Author contributions  作者贡献

S.Z. and W.L. conceived and supervised the project. W.L. and J.Z. contributed to the algorithm implementation. J.Z., W.L., S.Z., and W.H. contributed to the visualization and baseline implementation. W.L., S.Z., P.G.W., and J.Z. wrote the manuscript. All authors were involved in the discussions and proofreading.
S.Z.和 W.L.共同构思并监督项目。W.L.和 J.Z.负责算法实现。J.Z.、W.L.、S.Z.和 W.H.参与可视化与基线实现。W.L.、S.Z.、P.G.W.和 J.Z.撰写论文。所有作者均参与讨论与文稿校对。

Competing interests  利益竞争

W.L., J.Z., W.H., Z.Z., X.J., Z.W., L.S., and C.L. work directly or indirectly for Galixir Technologies. S.Z. was a former employee of Galixir Technologies. The remaining authors declare no competing interests.
W.L.、J.Z.、W.H.、Z.Z.、X.J.、Z.W.、L.S.和 C.L.直接或间接为 Galixir Technologies 工作。S.Z.曾是 Galixir Technologies 的前员工。其余作者声明无利益竞争。

Additional information  附加信息

Supplementary information The online version contains
补充信息 在线版本包含

supplementary material available at
补充材料可查阅

https://doi.org/10.1038/s41467-024-45461-2.
Correspondence and requests for materials should be addressed to Wei Lu, Jixian Zhang or Shuangjia Zheng.
通讯及材料索取请联络 Wei Lu、Jixian Zhang 或 Shuangjia Zheng。
Peer review information Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
同行评审信息:《自然-通讯》感谢匿名审稿人对本工作的贡献。同行评审文件已备查。
Reprints and permissions information is available at http://www.nature.com/reprints
重印及许可信息详见 http://www.nature.com/reprints
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
出版商注:施普林格·自然对已出版地图中的管辖权主张及机构 affiliations 保持中立。
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/ licenses/by/4.0/.
开放获取 本文遵循知识共享署名 4.0 国际许可协议,允许您在任何媒介或格式中使用、分享、改编、分发及复制本作品,只要您恰当标明原作者及来源,提供知识共享许可协议的链接,并说明是否进行了修改。除非在材料的署名行中另有说明,否则本文中的图片或其他第三方材料均包含在文章的知识共享许可协议中。如果材料未被包含在文章的知识共享许可协议中,且您的预期用途未被法律法规允许或超出了许可范围,您需要直接向版权持有人获取授权。要查看该许可的副本,请访问 http://creativecommons.org/licenses/by/4.0/。

© The Author(s) 2024
© 作者 2024

  1. 1 1 ^(1){ }^{1} Galixir Technologies, 200100 Shanghai, China. 2 2 ^(2){ }^{2} School of Pharmaceutical Science, Sun Yat-sen University, 510006 Guangzhou, China. 3 3 ^(3){ }^{3} Center for Theoretical Biological Physics and Department of Chemistry, Rice University, Houston, TX 77005, USA. 4 4 ^(4){ }^{4} Global Institute of Future Technology, Shanghai Jiao Tong University, 200240 Shanghai, China. 5 5 ^(5){ }^{5} These authors contributed equally: Wei Lu, Jixian Zhang, Shuangjia Zheng. /_\\triangle e-mail: luwei0917@gmail.com; jxzly1993@gmail.com; shuangjia.zheng@sjtu.edu.cn
    1 1 ^(1){ }^{1} Galixir Technologies, 中国上海 200100。 2 2 ^(2){ }^{2} 中山大学药学院,中国广州 510006。 3 3 ^(3){ }^{3} 莱斯大学理论生物物理中心与化学系,美国德克萨斯州休斯顿 77005。 4 4 ^(4){ }^{4} 上海交通大学未来技术全球研究院,中国上海 200240。 5 5 ^(5){ }^{5} 这些作者贡献均等:Wei Lu, Jixian Zhang, Shuangjia Zheng。 /_\\triangle 电子邮件:luwei0917@gmail.com; jxzly1993@gmail.com; shuangjia.zheng@sjtu.edu.cn