10.1039@C8CP07640E

Cite this: DOI: 10.1039/c8cp07640e
引用此文：DOI: 10.1039/c8cp07640e

Received 14th December 2018, Accepted 1st February 2019
2018年12月14日收到，2019年2月1日接受

DOI: 10.1039/c8cp07640e
rsc.li/pccp

Modelling potential energy surfaces for small clusters using Shepard interpolation with Gaussian-form nodal functions
利用谢泼德插值法和高斯形式节点函数为小集群的势能面建模

Haina Wang (DD)a and Ryan P. A. Bettens (DD)
王海娜（副司长）a 和 Ryan P. A. Bettens（副司长）

Abstract 摘要

The potential energy surface (PES) of a chemical system is an analytical function that outputs the potential energy of the system when a nuclear configuration is given as input. The PESs of small atmospheric clusters have theoretical as well as environmental significance. A common method used to generate analytical PESs is the Shepard interpolation, where the PES is a weighed sum of Taylor series expansions (nodal functions) at ab initio sample points. Based on this, in this study we present a new method based on the Shepard interpolation, where the nodal functions are composed of a symmetric Gaussian term and an asymmetric exponential term in each dimension. Corresponding sampling methods were also developed. We tested the method on several atmospheric bimolecular clusters and achieved root mean square errors (RMSE) below $0.13 kJ {mol}^{- 1}$ in 150 samples for Ar-rigid $H_{2} O$ and Ne -rigid ${CO}_{2}$ , and below $0.39 kJ {mol}^{- 1}$ in 1800 samples for rigid $N_{2}$ -rigid ${CO}_{2}$ .
化学系统的势能面（PES）是一个分析函数，当输入核构型时，它输出系统的势能。小型大气团块的势能面具有理论和环境意义。生成分析 PES 的常用方法是 Shepard 插值法，其中 PES 是在 ab initio 样本点上的泰勒级数展开（节点函数）的加权和。在此基础上，我们在本研究中提出了一种基于 Shepard 插值的新方法，其中节点函数由每个维度上的对称高斯项和非对称指数项组成。同时还开发了相应的采样方法。我们在几个大气双分子簇上对该方法进行了测试，结果表明氩-刚性 $H_{2} O$ 和氖-刚性 ${CO}_{2}$ 的均方根误差（RMSE）在 150 个样本内低于 $0.13 kJ {mol}^{- 1}$ ，刚性 $N_{2}$ 和刚性 ${CO}_{2}$ 的均方根误差在 1800 个样本内低于 $0.39 kJ {mol}^{- 1}$ 。

Introduction 简介

The potential energy surface (PES) of a chemical system is an important concept in theoretical chemistry that has wide applications including spectral analysis,

^{1}

protein folding,

^{2 - 4}

and reaction dynamics.

^{5}

Based on the Born-Oppenheimer approximation, the PES refers to a function that outputs the electronic potential energy of the system when a nuclear configuration is given as input. For a nonlinear molecule with

N

atoms, the PES is a function of

3 N - 6

independent degrees of freedom.

^{6, 7}

However, “redundant dimensions” can be present depending on the choice of the coordinate system that describes the nuclear configuration. In the 1990s and early 2000s, there has been controversy or reservation about using all internuclear distances as coordinates because of this “redundancy”, and there were suggestions of using different sets of

3 N - 6

internal coordinates in different regions of the configuration space, e.g. in ref. 8. However, starting with Bowman’s group, recent decades have seen successful uses of all internuclear distances, or “Morse variables” derived therefrom.

^{9, 10}

化学体系的势能面（PES）是理论化学中的一个重要概念，应用广泛，包括光谱分析、

^{1}

蛋白质折叠、

^{2 - 4}

和反应动力学。

^{5}

基于玻恩-奥本海默近似，势能面是指在输入核构型时输出系统电子势能的函数。对于

N

原子的非线性分子，PES 是

3 N - 6

独立自由度的函数。

^{6, 7}

然而，"冗余维度 "可能存在，这取决于描述核构型的坐标系的选择。在 20 世纪 90 年代和 21 世纪初，由于这种 "冗余性"，对使用所有核间距作为坐标一直存在争议或保留意见，有人建议在构型空间的不同区域使用不同的

3 N - 6

内部坐标集，例如参考文献 8。8.然而，从鲍曼研究小组开始，近几十年来所有核间距或由此衍生的 "莫尔斯变量 "都得到了成功应用。

^{9, 10}

Recently, due to the increasing awareness of human impact on the atmosphere, there has been extensive theoretical interest in the PESs of atmospheric clusters.

^{11 - 13}

These are small systems of common atmospheric molecules, e.g.

N_{2}, O_{2}, O_{3}

,
最近，由于人们越来越意识到人类对大气的影响，理论界对大气团块的 PES 产生了广泛的兴趣。

^{11 - 13}

这些是由常见大气分子（如

N_{2}, O_{2}, O_{3}

）组成的小系统、

H_{2} O

, Ar and

{CO}_{2}

, interacting through van der Waals forces or hydrogen bonds. Careful investigations of these clusters can shed new light on current environmental issues concerning atmospheric chemistry. For example, it has been predicted that as

{CO}_{2}

forms clusters, its usually symmetric stretch mode

ν_{1}

can be significantly asymmetric and absorb infrared radiation, contributing to the greenhouse effect together with its usually asymmetric vibrational modes,

ν_{2}

and

ν_{3} .^{11}

H_{2} O

、Ar 和

{CO}_{2}

，通过范德华力或氢键相互作用。对这些团簇的仔细研究可以为当前有关大气化学的环境问题提供新的启示。例如，据预测，当

{CO}_{2}

形成簇时，其通常对称的伸展模式

ν_{1}

可能会明显不对称，并吸收红外辐射，与通常不对称的振动模式

ν_{2}

和一起造成温室效应。

ν_{3} .^{11}

Due to the small sizes of atmospheric clusters, ab initio single point energy (SPE) evaluations with very high accuracy

(1.0 kJ {mol}^{- 1})

can be achieved.

^{14}

However, a thorough understanding of their possible interactions demands the generation of accurate analytical PESs over a global range of chemically and physically interesting configurations using a finite set of ab initio data, which remains a challenging task. The advantages of an analytical PES over pointwise evaluations include faster SPE retrieval, easy mathematical treatment and convenience of visualization. Typically, generating an analytical PES involves three aspects: the basic form of the function, the sampling of the configuration space (for

a b

initio calculations), and finally, the fitting or interpolation method that gives the global PES.

^{6}

由于大气团簇的尺寸较小，因此可以实现精度极高的非初始单点能量（SPE）评估

(1.0 kJ {mol}^{- 1})

。

^{14}

然而，要彻底了解它们之间可能存在的相互作用，就必须利用有限的 ab initio 数据集，在全局范围内对化学和物理上有趣的构型生成精确的分析 PES，这仍然是一项具有挑战性的任务。与点式评估相比，解析 PES 的优势在于 SPE 检索速度更快、数学处理简单、可视化方便。通常情况下，生成分析型 PES 涉及三个方面：函数的基本形式、构型空间采样（用于

a b

initio 计算）以及最后给出全局 PES 的拟合或插值方法。

^{6}

There are many milestones for all these aspects. The requirement for the PES to be symmetric with respect to permutations of like atoms has been explicitly stated by Murrell and colleagues in their classic monograph.

^{15}

To incorporate this permutational symmetry, theories of symmetrized coordinates

^{16}

and invariant fitting bases

^{17}

have been developed for triatomic and homonuclear tetra-atomic systems. More recently, the least squares fitting of permutationally invariant polynomials (PIP) was developed for high dimensionality
在所有这些方面都有许多里程碑式的成果。穆雷尔及其同事在他们的经典专著中明确提出了 PES 对于同类原子排列对称的要求。

^{15}

为了纳入这种排列对称性，针对三原子和同核四原子系统开发了对称坐标理论

^{16}

和不变拟合基

^{17}

。最近，针对高维度的包络不变多项式 (PIP) 的最小二乘法拟合也得到了发展。
by Braams and Bowman.

^{18}

The many-body expansion (MBE), first applied by Varandas and Murrell to PESs of

H_{n}

systems,

^{16}

has become a widely used approach to model asymptotic regions of configuration spaces.

^{6, 15, 19 - 21}

Common sampling methods include random sampling from a grid of bond lengths and angles. There are, however, more sophisticated sampling methods such as Mutually Orthogonal Latin Squares that aim to cover a wide range of a high dimensional space.

^{4}

The Bowman group has used scattered sampling methods, where samples are generated from molecular dynamics simulation data with widely different trajectories, effectively sampling a large range of configurations and energies.

^{22}

由 Braams 和 Bowman 合著。

^{18}

多体展开（MBE）最早由 Varandas 和 Murrell 应用于

H_{n}

系统的 PES，

^{16}

，现已成为一种广泛应用于构型空间渐近区域建模的方法。

^{6, 15, 19 - 21}

常见的取样方法包括从键长和角度网格中随机取样。不过，还有更复杂的取样方法，例如互正交拉丁方阵，旨在覆盖高维空间的广泛范围。

^{4}

鲍曼研究小组使用了分散采样法，即从分子动力学模拟数据中生成轨迹差异很大的样本，从而有效地对大量构型和能量进行采样。

^{22}

One of the most successful interpolation methods is the modified Shepard interpolation proposed by Collins and colleagues in 1994.

^{8, 23}

It represents the PES as a weighed sum of Taylor expansions in inverse internuclear distances up to the second order on each sample. The weight of a sample at a configuration is inversely proportional to the distance between the configuration and the sample. In addition, the group proposes an iterative sampling scheme that outperforms random sampling, adding in each round new samples that are distant from existing samples. Problems of the modified Shepard method include bump- or step-shaped artifacts where distinct samples compete for dominance. Various studies have been focusing on reducing such effect.

^{20, 24}

Another problem with the modified Shepard interpolation is that the number of ab initio points needed to calculate the second derivatives grows as the square of the number of dimensions, making the method computationally expensive for large systems.
最成功的内插法之一是柯林斯及其同事于 1994 年提出的修正谢泼德内插法。

^{8, 23}

它将 PES 表示为每个样本上核间距逆二阶泰勒展开的加权和。某个构型的样本权重与构型和样本之间的距离成反比。此外，研究小组还提出了一种优于随机抽样的迭代抽样方案，即在每一轮中增加与现有样本距离较远的新样本。修改后的谢泼德方法存在的问题包括凹凸或阶梯形伪影，在这些伪影中，不同的样本会争夺主导地位。各种研究都在关注如何减少这种影响。

^{20, 24}

改良谢泼德插值法的另一个问题是，计算二阶导数所需的原子序数点数随着维数的平方而增长，这使得该方法在大型系统中的计算成本非常昂贵。

More recently, some groups have studied the application of Gaussian Process (GP) in PES generation.

^{25 - 30}

As another method of interpolation, it uses Gaussians instead of Taylor expansion as nodal functions. Being a less dramatically varying function than the polynomial, the Gaussian has promises to reduce artifacts. It also provides a natural estimation of error at not-yet sampled configurations, thereby providing an iterative sampling scheme where points with large estimated errors are added to the sample set. GP has been used successfully in modeling three-dimensional PESs of proton transfer in crystals.

^{25}

Very recently, PIP-GP approaches have been developed to incorporate permutation invariance into GP.

^{30}

The application of GP to higher dimensional gaseous systems, however, has been largely limited by the inflexible width of the Gaussian nodal functions and the fact that Gaussians are inevitably symmetric at samples, making it hard to account for first derivatives.
最近，一些研究小组研究了高斯过程（GP）在 PES 生成中的应用。

^{25 - 30}

作为另一种插值方法，它使用高斯而不是泰勒展开作为节点函数。与多项式相比，高斯函数的变化幅度较小，因此有望减少伪影。它还能自然地估计尚未采样配置的误差，从而提供一种迭代采样方案，将估计误差较大的点添加到采样集中。GP 已成功用于晶体中质子传递的三维 PES 建模。

^{25}

最近，人们开发了 PIP-GP 方法，将包换不变性纳入 GP。

^{30}

然而，由于高斯节点函数的宽度不够灵活，而且高斯在样本处不可避免地具有对称性，因此很难考虑一阶导数，这在很大程度上限制了 GP 在更高维度气态系统中的应用。

In this study, we developed a new form of nodal functions that aims to tackle the limitations of GP. It composes of a symmetric Gaussian part and an asymmetric inverse-exponential part for each dimension, both having flexible widths and amplitudes. We also proposed an iterative sampling scheme compatible with these nodal functions. We tested our method on three atmospheric clusters: Ar-rigid

H_{2} O,^{31}

Ne-rigid

{CO}_{2},^{28}

and rigid

{CO}_{2}

-rigid

N_{2} .^{11}

The focus for this present paper is on non-covalent interactions, with expectations of possible application to atmospheric science. For example, the PES for the clusters can serve as a perturbation term for the rovibration spectrum of interacting atmospheric molecules, with the monomers as the “base case”, in a similar spirit to ref. 32. Nevertheless, we believe that the method itself is general enough to also deal with reacting molecules.
在这项研究中，我们开发了一种新的节点函数形式，旨在解决 GP 的局限性。它由每个维度的对称高斯部分和非对称反指数部分组成，两者都具有灵活的宽度和振幅。我们还提出了一种与这些节点函数兼容的迭代采样方案。我们在三个大气集群上测试了我们的方法：

H_{2} O,^{31}

{CO}_{2},^{28}

{CO}_{2}

N_{2} .^{11}

本文的重点是非共价相互作用，希望能应用于大气科学。例如，以单体为 "基例"，簇的 PES 可以作为大气分子相互作用振动谱的扰动项，这与参考文献 32 的精神类似。32.尽管如此，我们相信该方法本身的通用性足以处理反应分子。

We focus only on interpolation of ab initio data in this current study, while noting that the dissociation regime can be well described by MBE methods, and that potential functions accurate for different regimes can be combined to give a global PES, e.g. with the energy switching (ES) approach developed by Varandas.

^{33 - 35}

在目前的研究中，我们只关注 ab initio 数据的插值，同时注意到解离机制可以用 MBE 方法很好地描述，而且不同机制的精确势函数可以结合起来给出全局 PES，例如用 Varandas 开发的能量转换 (ES) 方法。

^{33 - 35}

We will first explain the general methodology and then discuss the specific implementations and results for the different clusters.
我们将首先解释一般方法，然后讨论不同集群的具体实施和结果。

Methodology 方法

Nodal functions 节点功能

We use internuclear distances to describe configurations. As all clusters we study are rigid, only intermolecular internuclear distances are included. For example, in the cluster Ar-rigid

H_{2} O

, configurations are defined by the internuclear distances (

Ar - O, Ar - H^{1}, Ar - H^{2}

). Given the configuration vector

r

of a sample with ab initio energy

E_{r}

, the nodal function

V_{r}

is expressed by a Gaussian symmetric term (resembling the s orbital) and an asymmetric term (resembling the p orbital) in each dimension, plus the constant

E_{r}

, ensuring that the nodal function passes through the sample, i.e.

V_{r} (r) = E_{r}

, as shown in (1).
我们使用核间距来描述构型。由于我们研究的所有原子团都是刚性的，因此只包括分子间核间距。例如，在氩刚性原子团簇

H_{2} O

中，构型由核间距 (

Ar - O, Ar - H^{1}, Ar - H^{2}

) 定义。给定样本的构型矢量

r

，初始能量为

E_{r}

，结点函数

V_{r}

由每个维度上的一个高斯对称项（类似于 s 轨道）和一个不对称项（类似于 p 轨道）加上常数

E_{r}

来表示，以确保结点函数穿过样本，即

V_{r} (r) = E_{r}

，如 (1) 所示。

V_{r} (\tilde{r}) = \sum_{i = 1}^{\dim (r)} [A_{i} ({\tilde{r}}_{i} - r_{i}) e^{- α_{i} {({\tilde{r}}_{i} - r_{i})}^{2}} + B_{i} (e^{- β_{i} {({\tilde{r}}_{i} - r_{i})}^{2}} - 1)] + E_{r}

where

\tilde{r}

is any arbitrary configuration in the vicinity of

r

. For each dimension, the parameters

A_{i}

and

α_{i}

are amplitude and width of the asymmetric term.

B_{i}

and

β_{i}

are amplitude and width of the symmetric term.

α_{i}

and

β_{i}

must be positive.
其中

\tilde{r}

是

r

附近的任意配置。对于每个维度，参数

A_{i}

和

α_{i}

是不对称项的振幅和宽度。

B_{i}

和

β_{i}

是对称项的振幅和宽度。

α_{i}

和

β_{i}

必须为正值。

All parameters

A_{i}, B_{i}, α_{i}, β_{i}

need to be determined. The most obvious way is to fit the parameters based on ab initio calculations on a set of configurations surrounding each sample, which we terms as co-samples. We then optimize the width parameters with a rather “brute force” line search, first letting

α_{1}

increase from 0.01 to 50 with increment 0.01 , while all other widths are initially fixed at 0.01 . For each value of

α_{1}

, the amplitude parameters

A_{i}

and

B_{i}

are obtained by linear regression; a least-squares regression error is thus generated. The value of

α_{1}

that gives the smallest regression error is taken to be optimal and fixed, while the next width parameter, e.g.

α_{2}

, undergoes optimization, and so on for all

α_{i}

β_{i}

. After all widths undergo optimization, the whole process starts over from

α_{1}

again and repeats for totally three rounds. This regression scheme apparently does not search for the global optimum of

α_{i}

and

β_{i}

, but as we will see in the next section, it turns out to be a rather practical scheme.
所有参数

A_{i}, B_{i}, α_{i}, β_{i}

都需要确定。最明显的方法是根据围绕每个样本的一组构型（我们称之为共样本）的 ab initio 计算来拟合参数。然后，我们用一种相当 "粗暴 "的直线搜索来优化宽度参数，首先让

α_{1}

从 0.01 增加到 50，增量为 0.01，而所有其他宽度最初都固定为 0.01。对于

α_{1}

的每个值，振幅参数

A_{i}

和

B_{i}

都是通过线性回归得到的；因此会产生最小二乘回归误差。得出最小回归误差的

α_{1}

值被视为最优值并固定下来，同时对下一个宽度参数（如

α_{2}

）进行优化，依此类推，对所有

α_{i}

β_{i}

进行优化。所有宽度参数优化后，整个过程从

α_{1}

重新开始，共重复三轮。这种回归方案显然没有搜索

α_{i}

和

β_{i}

的全局最优值，但正如我们将在下一节看到的，它被证明是一种相当实用的方案。

The permutational invariance, i.e. the requirement that when two atoms of the identical element switch position, the potential energy should remain the same, can be incorporated easily. Whenever a sample is chosen, we also add its “permutational twins” into the sample set. The nodal functions of the twin samples are obtained by permuting the terms of the nodal function of
我们可以很容易地加入排列不变性，即当两个相同元素的原子交换位置时，其势能应保持不变。每当选择一个样本时，我们也会将其 "孪生包络 "加入样本集中。孪生样本的节点函数是通过对
the original sample. For example, if a sample of Ar-rigid

H_{2} O

is given by the internuclear distances

(Ar - O, Ar - H^{1}, Ar - H^{2}) =

(r_{1}, r_{2}, r_{3}) = r

with nodal function (1), then it has a twin sample

r^{'} = (r_{1}^{'}, r_{2}^{'}, r_{3}^{'}) = (r_{1}, r_{2}, r_{3})

with the nodal function (2).
原始样本。例如，如果一个 Ar-rigid

H_{2} O

样本由核间距离

(Ar - O, Ar - H^{1}, Ar - H^{2}) =

(r_{1}, r_{2}, r_{3}) = r

与节点函数 (1) 给出，那么它就有一个节点函数 (2) 的孪生样本

r^{'} = (r_{1}^{'}, r_{2}^{'}, r_{3}^{'}) = (r_{1}, r_{2}, r_{3})

。

\begin{aligned} V_{r^{'}} (\tilde{r}) = & A_{1} ({\tilde{r}}_{1} - r_{1}^{'}) e^{- α_{1} {({\tilde{r}}_{1} - r_{1}^{'})}^{2}} + B_{1} (e^{- β_{1} {({\tilde{r}}_{1} - r_{1}^{'})}^{2}} - 1) \\ + A_{3} ({\tilde{r}}_{2} - r_{2}^{'}) e^{- α_{3} {({\tilde{r}}_{2} - r_{2}^{'})}^{2}} + B_{3} (e^{- β_{3} {({\tilde{r}}_{2} - r_{2}^{'})}^{2}} - 1) \\ + A_{2} ({\tilde{r}}_{3} - r_{3}^{'}) e^{- α_{2} {({\tilde{r}}_{3} - r_{3}^{'})}^{2}} + B_{2} (e^{- β_{2} {({\tilde{r}}_{3} - r_{3}^{'})}^{2}} - 1) + E_{r} \end{aligned}

After the nodal functions of all samples are obtained, the global PES

V (\tilde{r})

is expressed as a weighed sum of nodal functions using inverse distance weighing, as given by (3) and (4), identical to the modified Shepard interpolation.

^{23}

在获得所有样本的节点函数后，全局 PES

V (\tilde{r})

采用反距离称重法表示为节点函数的加权和，如 (3) 和 (4) 所示，这与修正的 Shepard 插值法相同。

^{23}

V (\tilde{r}) = \sum_{sample r} V_{r} (\tilde{r}) w_{r} (\tilde{r})

where the weights

w_{r} (\tilde{r})

are such that for any arbitrary configuration

\tilde{r}

其中权重

w_{r} (\tilde{r})

，对于任意配置

\tilde{r}

w_{r} (\tilde{r}) \propto \frac{1}{‖ \tilde{r} - r ‖^{p}} and \sum_{sample r} w_{r} (\tilde{r}) = 1

for a constant

p > 0

.
为常数

p > 0

。
We remark here that we have used a rather “brute force” way to incorporate permutatioanal symmetry. In the systems studied in the paper, the symmetry is relatively low and the limiting step for generating the PES is the generation of the nodal functions, i.e. the nonlinear and linear regressions to optimize

A_{i}, B_{i}, α_{i}, β_{i}

. These steps do not need to be replicated for the replicated data. However, for systems with higher permutational symmetry like

{CH}_{5}

, the “inference” step (3), i.e. the generation of the final PES by weighing all the nodal functions may well be the limiting step. In such cases, fast search algorithms exist in computer science to search for the closest samples to an arbitrary configuration (range queries),

^{36, 37}

because only those contribute significantly to the energy estimation. Alternatively, instead of replicating data, PIP methods could be incorporated into our approach to make the PES permutationally invariant in an efficient way, where “primary invariant polynomials” described in ref. 18 of the internal coordinates replace the expressions (

{\tilde{r}}_{i} - r_{i}

) in (1).

^{18, 30}

For non-covalent interactions, reduced permutational symmetry can be used to further reduce computational cost.

^{38}

在此我们要指出的是，我们采用了一种相当 "粗暴 "的方式来纳入 permutatioanal 对称性。在本文研究的系统中，对称性相对较低，生成 PES 的极限步骤是生成节点函数，即优化

A_{i}, B_{i}, α_{i}, β_{i}

的非线性和线性回归。对于复制数据，这些步骤无需重复。不过，对于具有更高排列对称性的系统，如

{CH}_{5}

，"推理 "步骤 (3)，即通过权衡所有节点函数生成最终的 PES 很可能是限制性步骤。在这种情况下，计算机科学中存在快速搜索算法，用于搜索最接近任意配置的样本（范围查询），

^{36, 37}

，因为只有这些样本对能量估计有重大贡献。另外，我们也可以将 PIP 方法融入我们的方法中，以高效的方式使 PES 变不变，而不是复制数据，其中参考文献 18 中描述的 "主不变多项式 "取代了内部坐标的计算公式。18 中描述的 "主不变多项式 "取代了 (1) 中的表达式 (

{\tilde{r}}_{i} - r_{i}

)。

^{18, 30}

对于非共价相互作用，可以使用降低的排列对称性来进一步降低计算成本。

^{38}

Sampling 取样

Simple random sampling over a grid of

a b

initio points in Jacobian coordinates (defined in the next section) was tested for all clusters studied here. The Jacobian coordinates must be converted to internal coordinates to generate the nodal functions. On top of random sampling, as we have mentioned in the introduction, there are iterative sampling algorithms that gradually improve the PES estimation as more samples are added, and they generally make significant improvement over random sampling. Inspired by these works, we came up with two ideas that could enhance sampling.
在雅各布坐标（定义见下一节）的

a b

initio 点网格上进行简单随机抽样，对本文研究的所有聚类进行了测试。雅各布坐标必须转换为内部坐标才能生成节点函数。除了随机取样之外，正如我们在引言中提到的，还有一些迭代取样算法，它们可以随着样本的增加而逐渐改进 PES 估计，而且通常比随机取样有显著改进。受这些著作的启发，我们提出了两个可以增强采样效果的想法。

One of them is “error probing”. Since we use linear regression to obtain the parameters and of the nodal functions, a regression error is generated automatically at each sample. We reason that if this regression error is large, the Gaussian nodal function does not describe the energy in the vicinity of the sample very well. Therefore, more samples are needed in that area to reduce the PES estimation error effectively.
其中之一就是 "误差探测"。由于我们使用线性回归来获取参数和节点函数，因此每个样本都会自动产生回归误差。我们的理由是，如果回归误差较大，高斯结点函数就不能很好地描述样本附近的能量。因此，需要在该区域采集更多样本，以有效减少 PES 估计误差。

The second idea is “distance probing”, similar to that in Collins’ work.

^{23}

The estimation error is expected to be large at regions where samples are sparse. We want the range of the samples to be as broad as possible, so that unfavorable extrapolation can be prevented. Therefore, points far from existing samples are chosen as new samples.
第二个想法是 "距离探测"，与柯林斯的研究类似。

^{23}

在样本稀少的区域，估计误差会很大。我们希望样本的范围越宽越好，这样可以避免不利的外推。因此，我们选择远离现有样本的点作为新样本。

In practice, error probing and distance probing should alternate with random sampling. Exactly how this can be done will be described for individual systems in the next section.
在实践中，误差探测和距离探测应与随机取样交替进行。具体如何操作，将在下一节针对各个系统进行说明。

For the more complex cluster rigid

{CO}_{2}

-rigid

N_{2}

, we also developed a “Divide and Conquer” method, which probes for samples separately in the wall, well and dissociation regions before pooling samples together.
对于更为复杂的集群刚性

{CO}_{2}

-rigid

N_{2}

，我们还开发了一种 "分而治之 "的方法，在将样品集中到一起之前，分别探测壁区、井区和解离区的样品。

Application to atmospheric clusters
大气集群的应用

General aspects: co-samples, regression and the decay exponent
一般方面：共同样本、回归和衰减指数

The selection of co-samples around each sample is a crucial step in generating the nodal functions. We choose co-samples in a random fashion such that the sample-co-sample distances (in the Euclidean metric on internal coordinates) are between

0.1 a_{0}

and

1.0 a_{0}

. The number of co-samples

L

around each sample needs to be equal to or slightly larger than the number of unknown parameters in (1), or

L \geq 4 \cdot \dim (r)

. Therefore, for Ar-rigid

H_{2} O

and Ne-rigid

{CO}_{2}, L = 13

. For rigid

{CO}_{2}

-rigid

N_{2}

L = 25

. Taking fewer co-samples than

4 \cdot \dim (r)

can lead to very inaccurate nodal functions.
在每个样本周围选择共同样本是生成节点函数的关键步骤。我们以随机方式选择共样本，使样本与样本之间的距离（内部坐标的欧几里得度量）介于

0.1 a_{0}

和

1.0 a_{0}

之间。每个样本周围的共同样本

L

的数量需要等于或略大于 (1) 中未知参数的数量，即

L \geq 4 \cdot \dim (r)

。因此，对于 Ar-rigid

H_{2} O

和 Ne-rigid

{CO}_{2}, L = 13

。对于刚性

{CO}_{2}

-rigid

N_{2}

L = 25

。采取比

4 \cdot \dim (r)

更少的共同样本会导致节点函数非常不精确。

As we will explain for each cluster in the next subsections, while the samples are truly ab initio points, the co-samples are pseudo

a b

initio, generated from very accurate analytical functions. We remark that this is a simplification, since some topological features, e.g. nearby crossings that may also happen for noncovalent interactions, do not appear in analytical functions used to generate the pseudo ab initio points.

^{39}

正如我们将在接下来的小节中为每个聚类解释的那样，样本是真正的 ab initio 点，而共同样本则是由非常精确的分析函数生成的伪

a b

initio。我们要说明的是，这只是一种简化，因为某些拓扑特征，例如非共价相互作用也可能发生的附近交叉，并没有出现在用于生成伪阿比西尼亚初始点的分析函数中。

^{39}

For all clusters studied, the “brute force” nonlinear regression for

α_{i}

and

β_{i}

, together with linear regression for

A_{i}

and

B_{i}

, usually gives least-squares errors of less than

0.05 E_{r}

for samples in the well region (i.e.

5 %

of the sample energy) and

0.15 - 0.50 E_{r}

for samples in the wall region. As expected, the asymmetrical amplitudes

A_{i}

are larger at samples with large first derivatives, especially for samples in the wall region, where local asymmetry of the PES is significant. Therefore, we believe that we have found a rather practical method for parameter determination. We also found that the regression accuracy is much more sensitive to

α_{i}

than to

β_{i}

. One could even save computational resources by fixing all

β_{i}

at 0.5 without
对于所研究的所有群集，

α_{i}

和

β_{i}

的 "蛮力 "非线性回归，以及

A_{i}

和

B_{i}

的线性回归，对于井区域的样本（即

5 %

的样本能量）和对于壁区域的样本

0.15 - 0.50 E_{r}

，通常给出的最小二乘误差小于

0.05 E_{r}

。不出所料，在具有较大一阶导数的样本中，不对称振幅

A_{i}

较大，尤其是在壁区的样本中，PES 的局部不对称非常明显。因此，我们认为我们找到了一种相当实用的参数确定方法。我们还发现，回归精度对

α_{i}

比对

β_{i}

更为敏感。我们甚至可以通过将所有

β_{i}

固定为 0.5 来节省计算资源。
changing the order of magnitude of the regression error. The accuracy is insensitive to

α_{i}

only within an order of magnitude.
改变回归误差的数量级。精度对

α_{i}

不敏感，仅在一个数量级内。

The “decay exponent”

p

in (4) is an important parameter in all interpolations using inverse distance weighing. It controls the “blurryness” of the generated PES. Previous studies on the modified Shepard interpolation shows that one must choose

p ≫ 3 N - 3

to ensure that samples far from

\tilde{r}

makes no contribution to the energy estimation at

\tilde{r} .^{24}

It turns out that our investigations of the Gaussian nodal functions yield similar results.

p

between 10 to 15 works well for Ar-rigid

H_{2} O

and Ne -rigid

{CO}_{2}

. For rigid

{CO}_{2}

-rigid

N_{2}

, we set

p

between 20 to 30 . It is observed that, as long as

p

lies in the ranges specified above, the RMSE of the PES estimation is not sensitive to the specific value that

p

takes.
(4) 中的 "衰减指数"

p

是所有使用反距离称重法进行插值的重要参数。它控制着生成的 PES 的 "模糊度"。以前对修正 Shepard 插值的研究表明，必须选择

p ≫ 3 N - 3

，以确保远离

\tilde{r}

的样本不会对

\tilde{r} .^{24}

的能量估计产生任何影响。事实证明，我们对高斯结点函数的研究也得出了类似的结果。

p

，对于 Ar-rigid

H_{2} O

和 Ne-rigid

{CO}_{2}

，10 至 15 之间的衰减指数效果良好。对于刚性

{CO}_{2}

-rigid

N_{2}

，我们将

p

设置在 20 至 30 之间。可以看出，只要

p

在上述范围内，PES 估计的 RMSE 对

p

的具体取值并不敏感。

Ar-rigid $H_{2} O$ Ar-rigid $H_{2} O$

Coordinate system and ab initio details. The Jacobian coordinates on which the

a b

initio data grid is based is shown in Fig. 1(a). The water molecule is rigid with

r (OH) = 1.810 a_{0}

and

∠ HOH =

{104.51}^{\circ}

. The

H_{2} O

is placed in the

x y

-plane with its center of mass at the origin. The

y

-axis bisects

∠ HOH

and O is on the positive half. The configuration can now be described by the polar coordinates

(R, θ, ϕ)

of Ar, where

0 < θ < 90^{\circ}, 0 < ϕ < 180^{\circ}

.
坐标系和 ab initio 细节。

a b

initio 数据网格所基于的雅各布坐标见图 1(a)。水分子与

r (OH) = 1.810 a_{0}

和

∠ HOH =

{104.51}^{\circ}

保持刚性。

H_{2} O

放置在

x y

平面上，质心位于原点。

y

轴将

∠ HOH

一分为二，O 位于正半轴上。现在可以用 Ar 的极坐标

(R, θ, ϕ)

来描述该构型，其中

0 < θ < 90^{\circ}, 0 < ϕ < 180^{\circ}

。

The ab initio data consists of 1584 symmetry-unique points on a grid of

(R, θ, ϕ)

, described in ref. 31 with

R

ranging from

4.5 - 20 a_{0}

. The grid points are denser near the equilibrium values of

R (6.2 - 7.2 a_{0})

. The theory level and basis set is CCSD(T)-F12(a)/AVQZ with standard counterpoise correction of basis set superposition effect (BSSE).
自旋数据由

(R, θ, ϕ)

网格上的 1584 个对称独立点组成，详见参考文献。

R

4.5 - 20 a_{0}

网格点在

R (6.2 - 7.2 a_{0})

的平衡值附近更为密集。理论水平和基集为 CCSD(T)-F12(a)/AVQZ，并对基集叠加效应（BSSE）进行了标准的反极校正。

In our algorithm of PES interpolation, the Jacobian coordinates are converted to the internal coordinates (

Ar - O, Ar - H^{1}, Ar - H^{2}

). A test set of 100 symmetry-unique points is chosen from the grid. All samples are also chosen from the grid, but points in the test set are not allowed to be samples. The energies of the randomly chosen co-samples are obtained from a fitted analytical form given in ref. 31 which has been shown to give discrepancy within

10^{- 3} {cm}^{- 1} (1.2 \times

10^{- 5} kJ {mol}^{- 1}

) compared to direct ab initio calculations.
在我们的 PES 插值算法中，雅各布坐标被转换为内部坐标 (

Ar - O, Ar - H^{1}, Ar - H^{2}

)。从网格中选取 100 个对称唯一点作为测试集。所有样本也从网格中选取，但测试集中的点不允许是样本。随机选择的共同样本的能量是通过参考文献 31 中给出的拟合分析形式获得的。31 与直接的 ab initio 计算相比，其差异在

10^{- 3} {cm}^{- 1} (1.2 \times

10^{- 5} kJ {mol}^{- 1}

) 以内。

Analysis of sampling. The sampling algorithm, described by three integers

(S, D, E)

, is an iterative process, in which each round consists of three actions that add new samples, described as following.
采样分析。采样算法由三个整数

(S, D, E)

描述，是一个迭代过程，其中每一轮由三个添加新样本的操作组成，描述如下。

Randomly select $S$ symmetry-unique samples.
随机选取 $S$ 对称唯一性样本。
(“Distance probing”) randomly select 100 symmetry-unique “probe points” that are not yet samples. For each probe point, obtain its distance to its nearest existing sample (using the Euclidean metric on internal coordinates). The $D$ probes points with largest distances to existing samples are added as new samples.
(距离探测"）随机选取 100 个对称唯一的 "探测点"，这些点还不是样本。对于每个探测点，利用内部坐标上的欧几里得度量，获取其与最近的现有样本的距离。与现有样本距离最大的 $D$ 探测点将被添加为新样本。
(“Error probing”) randomly select 100 symmetry unique “probe points” that are not yet samples. For each probe point, we obtain an estimation of error by averaging the linear regression errors of its three nearest samples. The $E$ probes with largest estimated error are added to the sample set.
(误差探测"）随机选择 100 个对称的唯一 "探测点"，这些点还不是样本。对于每个探测点，我们通过平均与其最近的三个样本的线性回归误差来获得误差估计值。将估计误差最大的 $E$ 探测点添加到样本集中。

Steps 1 through 3 are repeated while the RMSE of the test set is monitored, until 250 samples are obtained. Fig. 2(a) shows the variation of RMSE with the number of samples for four settings of

(S, D, E)

. Note that the setting

(50, 0, 0)

is identical to simple random sampling. Each curve is the average result of 5 runs.
重复步骤 1 至 3，同时监测测试集的 RMSE，直到获得 250 个样本。图 2(a) 显示了

(S, D, E)

的四种设置下 RMSE 随样本数的变化情况。请注意，

(50, 0, 0)

的设置与简单随机抽样相同。每条曲线都是 5 次运行的平均结果。

It can be observed from Fig. 2(a) that the error probing step is very helpful in reducing the error from early on. RMSE of around

100 μ E_{h} (0.26 kJ {mol}^{- 1})

can be achieved within 50 samples using the setting

(35, 0, 15)

, and the final error converges at

20 - 50 μ E_{h}

(0.05 - 0.13 kJ {mol}^{- 1})

. On the other hand, using distance probing without error probing seems to be detrimental to the PES estimation in this system, probably because in the first rounds the distance probing step finds distant points in the trivial dissociation region. The RMSE for the combined probing setting

(35, 5, 10)

converges to the same limit of

20 - 50 μ E_{h}

, but the RMSE is relatively high before 150 samples. Note that 150 samples, with all their co-samples, actually require as many

a b

initio data as the grid itself, but the fact that local behaviors at the 150 samples can predict the global PES accurately shows that our nodal function is probably suitable for PES description. This sets the stage for investigations of higher dimensions.
从图 2(a)可以看出，误差探测步骤非常有助于从早期开始减少误差。在

(35, 0, 15)

的设置下，50 个样本内的 RMSE 约为

100 μ E_{h} (0.26 kJ {mol}^{- 1})

，最终误差收敛于

20 - 50 μ E_{h}

(0.05 - 0.13 kJ {mol}^{- 1})

。另一方面，在该系统中使用距离探测而不使用误差探测似乎不利于 PES 估计，这可能是因为在第一轮中，距离探测步骤会发现琐碎解离区域中的远点。联合探测设置

(35, 5, 10)

的 RMSE 也收敛到了

20 - 50 μ E_{h}

这一极限，但在 150 个样本之前 RMSE 相对较高。需要注意的是，150 个样本及其所有共样本实际上需要与网格本身一样多的

a b

initio 数据，但 150 个样本的局部行为可以准确预测全局 PES，这表明我们的节点函数可能适合于 PES 描述。这为更高维度的研究奠定了基础。

Visualization of PES estimation. To visualize the quality of PES estimation and to understand the chemical distribution of the estimation error, we fix

θ

90^{\circ}

(i.e. fixing Ar on the

x y

-plane). We plot, in Fig. 3, the true ab initio energies and the estimation error against the position of Ar on the

x y

-plane. The PES was generated from 250 random samples. It can be seen that the error is high exclusively in the wall region, whereas the well and the dissociation region are well estimated. The errors at the wall are systematically positive, possibly due to extrapolation at configurations whose

R

are smaller than those of the samples.
PES 估算的可视化。为了直观地显示 PES 估算的质量并了解估算误差的化学分布，我们将

θ

固定在

90^{\circ}

（即将 Ar 固定在

x y

平面上）。在图 3 中，我们绘制了真实的 ab initio 能量和估计误差与 Ar 在

x y

平面上的位置的对比图。PES 由 250 个随机样本生成。从图中可以看出，误差主要集中在壁面区域，而井和解离区域的估计结果良好。壁面的误差是系统性的正误差，这可能是由于在

R

比样本小的构型上进行了外推。

Fig. 1 The Jacobian coordinate systems of the ab initio data grid for each of the clusters studied in this work. (a)

Ar - H_{2} O

. (b)

Ne - {CO}_{2}

. ©

{CO}_{2} - N_{2}

.
图 1 本研究中每个集群的 ab initio 数据网格的雅各布坐标系。(a)

Ar - H_{2} O

. (b)

Ne - {CO}_{2}

.©

{CO}_{2} - N_{2}

Fig. 2 Variation of RMSE during iterative sampling. The curves correspond to different iterative sampling schemes represented by

(S, D, E)

described in the text. (a) On a 100-point test set for

Ar - H_{2} O

. (b) On a 200 -point test set for

{CO}_{2} - N_{2}

.
图 2 在迭代采样过程中 RMSE 的变化。曲线对应于文中描述的

(S, D, E)

所代表的不同迭代采样方案。(a)

Ar - H_{2} O

的 100 点测试集。 (b)

{CO}_{2} - N_{2}

的 200 点测试集。

Fig.

3 A b

initio energy and estimation error for the “slice”

θ = 90^{\circ}

of the PES of

Ar - H_{2} O

. PES generated by 250 samples obtained through the iterative sampling setting

(S, D, E) = (35, 0, 15)

. Averaged over 3 runs. Energy unit is

E_{h}

and box dimensions are in

a_{0}

.
图

3 A b

Ar - H_{2} O

的 PES 的 "切片 "

θ = 90^{\circ}

的初始能量和估计误差。PES 由通过迭代采样设置获得的 250 个样本生成

(S, D, E) = (35, 0, 15)

。3 次运行的平均值。能量单位为

E_{h}

，方框尺寸为

a_{0}

。

Ne-rigid ${CO}_{2}$ Ne-rigid ${CO}_{2}$

Fig. 1(b) shows the Jacobian coordinates

(R, θ)

defining the ab initio data grid, described in ref. 28 .

R

is the distance between C and Ne.

θ

is the angle between the

C \to Ne

vector and the

{CO}_{2}

molecular axis. The geometry of rigid

{CO}_{2}

is fixed at

r (CO) = 2.196 a_{0}

. The ab initio data consists of 1200 symmetryunique points, with ranging from 4.0 to

20.0 a_{0}

. The theory level is

CCSD (T)

, and the basis set is AVTZ for C and O and atomic natural orbital (ANO) 6s5p3d2f for Ne. Standard counterpoise is used to correct for BSSE.
图 1(b) 显示了雅各布坐标

(R, θ)

定义了参考文献 28 中描述的 ab initio 数据网格。

R

是 C 和 Ne 之间的距离。

θ

是

C \to Ne

向量与

{CO}_{2}

分子轴之间的夹角。刚性

{CO}_{2}

的几何形状固定为

r (CO) = 2.196 a_{0}

。ab initio 数据包括 1200 个对称唯一点，范围从 4.0 到

20.0 a_{0}

。理论水平为

CCSD (T)

，基集为 C 和 O 的 AVTZ 和 Ne 的原子自然轨道 (ANO) 6s5p3d2f。使用标准反极值对 BSSE 进行校正。

The internal coordinates (

Ne - C, Ne - O^{1}, Ne - O^{2}

) are used for our model. A random set of 40 symmetry-unique points from the grid with

R

between 4.5 and

15 a_{0}

is chosen as the test set. Samples are chosen from the same grid and exclude the test points. Co-sample are randomly chosen around each sample, and their energies are obtained from the analytical form given in ref. 28 which achieves RMSEs less than

0.03 {cm}^{- 1} (3.6 \times

10^{- 4} kJ {mol}^{- 1}

) compared to direct ab initio calculations.
内部坐标 (

Ne - C, Ne - O^{1}, Ne - O^{2}

) 用于我们的模型。从网格中随机选取 40 个对称唯一的点作为测试集，

R

在 4.5 和

15 a_{0}

之间。样本选自同一网格，不包括测试点。在每个样本周围随机选取共同样本，它们的能量由参考文献 28 中给出的分析形式获得，与直接的 ab initio 计算相比，其 RMSE 小于

0.03 {cm}^{- 1} (3.6 \times

10^{- 4} kJ {mol}^{- 1}

）。

The performance of the sampling schemes is similar to the case of Ar-rigid

H_{2} O

, but the RMSE drops much faster due to the simplicity of the actually two-dimensional PES. Even with random sampling, the RMSE can be reduced to

30 - 100 μ E_{h}

(0.08 - 0.26 kJ {mol}^{- 1}

) within 50 samples and converges to below

50 μ E_{h} (0.13 kJ {mol}^{- 1})

in 150 samples. Using the setting

(S, D, E) =

(35, 0, 15)

, the RMSE can converge to

20 μ E_{h} (0.05 kJ {mol}^{- 1})

in
采样方案的性能与 Ar-rigid

H_{2} O

相似，但由于实际二维 PES 的简单性，RMSE 下降得更快。即使采用随机抽样，RMSE 也能在 50 个样本内降至

30 - 100 μ E_{h}

(0.08 - 0.26 kJ {mol}^{- 1}

），并在 150 个样本内收敛到

50 μ E_{h} (0.13 kJ {mol}^{- 1})

以下。在

(S, D, E) =

(35, 0, 15)

的设置下，RMSE 可以在 50 个样本内收敛到

20 μ E_{h} (0.05 kJ {mol}^{- 1})

。
150 samples. This RMSE value is comparable to that of the basic Gaussian Process interpolation described in ref. 40. Same as

Ar - H_{2} O

, the error increases with decreasing

R

on the wall region. Since there are three internuclear distances but only two actual dimensions, the results for Ne -rigid

{CO}_{2}

show that our model works well with redundant internal coordinates.
150 个样本。这个 RMSE 值与参考文献 40 中描述的基本高斯过程插值的 RMSE 值相当。40.与

Ar - H_{2} O

一样，误差随着

R

壁区域的减小而增大。由于核内距离有三个，但实际尺寸只有两个，因此 Ne -rigid

{CO}_{2}

的结果表明，我们的模型可以很好地处理冗余内部坐标。

Rigid ${CO}_{2}$ -rigid $N_{2}$
刚性 ${CO}_{2}$ -刚性 $N_{2}$

Coordinate system and ab initio details. Fig. 1© shows the Jacobian coordinates

(R, θ_{1}, θ_{2}, ϕ)

defining the ab initio data grid, as described in ref. 11.

R

is the distance between C and the center of mass Y of

N_{2} . θ_{1} = ∠ O^{1} CY . θ_{2} = ∠ {CYN}^{1} . ϕ

is the dihedral angle

O^{1} {CYN}^{1}

. The configurations of

{CO}_{2}

and

N_{2}

are frozen at

r (CO) = 2.21727 a_{0}

and

r (NN) = 2.10665 a_{0}

. The grid consists of 21840 symmetry-unique points, with

R

ranging from 4.0 to

30.0 a_{0}

. However, we restrict our attention to the test configurations whose minimum intermolecular internuclear distances

d_{min}

lie between 4.5 and

15.0 a_{0}

. The sampling space consists of configurations with

d_{min}

between 4.45 and

15.1 a_{0}

. We want the sampling space to be slightly larger than the test space to reduce unfavorable extrapolation. The

a b

initio theory level and basis set are

CCSD (T)

-F12(a)/AVTZ, with standard counterpoise to correct for BSSE.
坐标系和 ab initio 细节。图 1© 显示了定义 ab initio 数据网格的雅各布坐标

(R, θ_{1}, θ_{2}, ϕ)

，如参考文献 11 所述。

R

是 C 与

N_{2} . θ_{1} = ∠ O^{1} CY . θ_{2} = ∠ {CYN}^{1} . ϕ

的质心 Y 之间的距离，是二面角

O^{1} {CYN}^{1}

。

{CO}_{2}

和

N_{2}

的构型被冻结在

r (CO) = 2.21727 a_{0}

和

r (NN) = 2.10665 a_{0}

上。网格由 21840 个对称唯一点组成，

R

范围从 4.0 到

30.0 a_{0}

。不过，我们只关注分子核间最小距离

d_{min}

在 4.5 和

15.0 a_{0}

之间的测试构型。采样空间包括

d_{min}

在 4.45 和

15.1 a_{0}

之间的构型。我们希望采样空间略大于测试空间，以减少不利的外推。

a b

initio 理论水平和基集为

CCSD (T)

-F12(a)/AVTZ，并使用标准反极值校正 BSSE。

The internal coordinates

(N^{1} - C, N^{1} - O^{1}, N^{1} - O^{2}, N^{2} - C, N^{2} - O^{1}

N^{2} - O^{2}

) are used for our model. We use two kinds of test sets.
内部坐标

(N^{1} - C, N^{1} - O^{1}, N^{1} - O^{2}, N^{2} - C, N^{2} - O^{1}

N^{2} - O^{2}

) 用于我们的模型。我们使用两种测试集。

The first is a random set of 200 symmetry-unique points from the grid. The second is the entire “slice” on the PES with

θ_{2} = 0

ϕ = 0

, used for visualization of PES. All samples are chosen from the grid, excluding the test points. Co-samples energies are obtained from the analytical form given in ref. 11, which achieves RMSEs less than

1.0 {cm}^{- 1} (0.012 kJ {mol}^{- 1})

compared to direct ab initio calculations.
第一个是从网格中随机抽取的 200 个对称唯一点。第二个是 PES 上的整个 "切片"，

θ_{2} = 0

ϕ = 0

，用于 PES 的可视化。所有样本均从网格中选取，不包括测试点。共样本能量由参考文献 11 中给出的分析形式获得。11 中给出的分析形式获得，与直接的原子序数计算相比，其 RMSE 小于

1.0 {cm}^{- 1} (0.012 kJ {mol}^{- 1})

。

Analysis of sampling. We tried the iterative sampling scheme discussed above in the section of

Ar - H_{2} O

. The performance of four settings within 600 symmetry-unique samples is given in Fig. 2(b). RMSE is calculated for a 200-point test set. Each curve is the average of 5 runs.
采样分析。我们尝试了上文

Ar - H_{2} O

部分讨论的迭代采样方案。图 2(b) 给出了 600 个对称唯一采样内四种设置的性能。RMSE 是针对 200 点测试集计算的。每条曲线都是 5 次运行的平均值。

Unlike in Fig. 2(a), we see that here the sampling schemes do not show significant difference from each other in terms of RMSE variation. It can only be observed that, compared to the simple random sampling, the settings with error probing or distance probing are more reliable in reducing the error steadily during sampling. The random sampling curve

(S, D, E) = (50, 0, 0)

has several significant rises, whereas the other curves almost monotonously decrease after 100 samples. However, as sampling goes on, the random scheme can also spot good samples and obtain an accuracy comparable with schemes using probing.
与图 2(a)不同的是，在这里我们可以看到各种采样方案在 RMSE 变化方面并没有明显的差异。只能看出，与简单的随机取样相比，带有误差探测或距离探测的设置在取样过程中能更可靠地稳步减少误差。随机抽样曲线

(S, D, E) = (50, 0, 0)

有几次明显的上升，而其他曲线在 100 个样本后几乎单调地下降。不过，随着采样的进行，随机方案也能发现好的样本，并获得与使用探测方案相当的精度。

The RMSE converges at about 1500 samples. Even for the same test set, the final converged RMSE vary greatly, ranging from 70 to

300 μ E_{h} (0.18 - 0.79 kJ {mol}^{- 1})

. This large range suggests that there may still be room for improvement of sampling.
RMSE 在大约 1500 个样本时收敛。即使是同一个测试集，最终收敛的 RMSE 也相差很大，从 70 到

300 μ E_{h} (0.18 - 0.79 kJ {mol}^{- 1})

不等。如此大的范围表明，采样仍有改进的余地。
“Divide and Conquer” for sampling. We found that the RMSE can be reliably limited to around

100 μ E_{h}

when we restricted the test set and the sample set to points whose

d_{min}

are greater than

5.0 a_{0}

, but no significant improvement was observed if we only restricted the test set. We speculated that the configurations at the wall region, having very high energy, interfere with the sampling or the energy estimation at the well region. Conversely, the samples in the well region may influence the estimation of the wall configurations. Thus, the idea of a “Divide and Conquer” strategy seemed immediately promising: one could probably improve the full PES estimation by estimating disjoint regions first.
"分而治之 "进行采样。我们发现，当我们将测试集和样本集限制为

d_{min}

大于

5.0 a_{0}

的点时，RMSE 可以被可靠地限制在

100 μ E_{h}

左右，但如果我们只限制测试集，则没有观察到明显的改善。我们推测，壁区域的配置具有很高的能量，会干扰井区域的采样或能量估计。相反，井区域的样本可能会影响对壁配置的估计。因此，"分而治之 "策略的想法似乎很有前途：先对不相连的区域进行估算，可能会改进整个 PES 估算。

In light of this observation, we separated the PES into 3 pieces (units in

a_{0}

):
有鉴于此，我们将持久性有机污染物分成三部分（单位为

a_{0}

）：

“Wall” for test points with $4.5 < d_{min} < 5.5$ , samples with $4.45 < d_{min} < 5.6$
"墙 "用于测试点， $4.5 < d_{min} < 5.5$ ，样品有 $4.45 < d_{min} < 5.6$
“Well” for test points with $5.5 < d_{min} < 10.0$ , samples with $5.4 < d_{min} < 10.1$ ;
"井 "表示测试点， $5.5 < d_{min} < 10.0$ ，样本 $5.4 < d_{min} < 10.1$ ；
“Tail” for test points with $10.0 < d_{min} < 15.0$ , samples with $9.9 < d_{min} < 15.1$ .
测试点的 "尾部" $10.0 < d_{min} < 15.0$ ，样本 $9.9 < d_{min} < 15.1$ 。

The boundary

d_{min}

between Wall and Well

(5.5 a_{0})

was so chosen because it is approximately the sum of the van der Waals radii of N and C ( or N and O ), where we expect collisions to occur. The boundary

d_{min}

between Well and Tail

(10.0 a_{0})

corresponds to configurations with absolute values of energy below

100 μ E_{h}

. These choices make the unlikely assumption that whether a point belongs to Wall, Well or Tail depends only on

d_{min}

. However, as we shall see, it proves to be a simple and useful approximation.
之所以选择墙（Wall）和井（Well）之间的边界

d_{min}

(5.5 a_{0})

，是因为它近似于 N 和 C（或 N 和 O）的范德瓦耳斯半径之和，我们预计碰撞会在这里发生。井与尾之间的边界

d_{min}

(10.0 a_{0})

对应于能量绝对值低于

100 μ E_{h}

的构型。这些选择做出了一个不太可能的假设，即一个点是属于墙（Wall）、井（Well）还是尾（Tail）只取决于

d_{min}

。然而，正如我们将要看到的，这被证明是一个简单而有用的近似值。

We intend to sample separately for each piece. We therefore tested, for each range, how many samples are needed to bring down the error to a reasonably low value. The following is observed.
我们打算对每件作品分别取样。因此，我们测试了每个范围需要多少样本才能将误差降到合理的低值。结果如下。

Wall: the RMSE converges to $150 - 500 μ E_{h}$ in 800-900 samples;
墙：RMSE 在 800-900 个样本内收敛到 $150 - 500 μ E_{h}$ ；
Well: the RMSE converges to $20 - 80 μ E_{h}$ in $500 - 600$ samples;
那么：RMSE 在 $500 - 600$ 样本中收敛到 $20 - 80 μ E_{h}$ ；
Tail: the RMSE converges to $< 15 μ E_{h}$ in 600 samples, but just 200 samples are needed to bring it down to below $80 μ E_{h}$ .
尾部：RMSE 在 600 个样本中收敛至 $< 15 μ E_{h}$ ，但只需 200 个样本就能将其降至 $80 μ E_{h}$ 以下。

Therefore, we divide 1800 samples into 900 in Wall, 600 in Well and 300 in Tail. Sampling is done for each piece separately, with random sampling alternating with distance probing and error probing with

(S, D, E) = (35, 5, 10)

. Then we pool the 1800 samples together to form a final global PES according to (3). Note that (3) guarantees the final PES is smooth, i.e. we divided the configuration space only for better sampling, and did not generate separate PESs for different regions. However, there exist methods where PESs accurate for separate regions are obtained and combined, e.g. energy switching mentioned in Introduction.
因此，我们将 1800 个样本分为墙 900 个、井 600 个和尾 300 个。每块区域分别进行采样，随机采样与距离探测和误差探测交替进行，

(S, D, E) = (35, 5, 10)

。然后，根据 (3) 将 1800 个样本集中起来，形成最终的全局 PES。请注意，（3）保证了最终的 PES 是平滑的，也就是说，我们划分配置空间只是为了更好地采样，并没有为不同区域生成单独的 PES。不过，也有一些方法可以获得并合并不同区域的精确 PES，例如导言中提到的能量转换。

To see how our piecewise algorithm performs for the fourdimensional PES of rigid

{CO}_{2}

-rigid

N_{2}

, again, 200-point random test sets are chosen. The estimation results on the test sets are shown in Fig.

4 (a - c)

. The improvement of © over (b) is clear. The RMSE with piecewise sampling is

40 - 150 μ E_{h} (0.10 -

0.39 kJ {mol}^{- 1}

), a significant improvement in stability from the

70 - 300

without the piecewise algorithm.
为了了解我们的分段算法在刚性

{CO}_{2}

-rigid

N_{2}

的四维 PES 中的表现，我们再次选择了 200 点随机测试集。测试集的估计结果如图

4 (a - c)

40 - 150 μ E_{h} (0.10 -

0.39 kJ {mol}^{- 1}

），与不采用分片算法的

70 - 300

相比，稳定性有了显著提高。

To visualize the estimated PES, we examine the “slice” with

θ_{2} = 0, ϕ = 0

. Fig. 4( d and e) shows the performances of estimating this slice for sampling algorithms with and without the sampling algorithms. It is clear that the latter algorithm gives much better estimation at the wall and the well regions (better overlap of the yellow and blue surfaces), but the estimation is worse at the dissociation region because samples are relatively sparse there. Overall, the RMSE of the estimations of the slice is

50 ν s .80 μ E_{h}

with and without the piecewise algorithm, respectively. This “Divide and Conquer” strategy yields a peculiar three-piece RMSE decrease pattern as the sample size increases, shown in Fig. 4(f).
为了直观显示估计的 PES，我们检查了

θ_{2} = 0, ϕ = 0

的 "切片"。图 4（d 和 e）显示了使用采样算法和不使用采样算法估算该切片的性能。很明显，后一种算法在壁和井区域的估计效果要好得多（黄色和蓝色表面的重叠更好），但在解离区域的估计效果较差，因为那里的样本相对稀少。总体而言，采用和不采用分片算法对切片进行估计的 RMSE 分别为

50 ν s .80 μ E_{h}

。如图 4(f)所示，随着样本量的增加，这种 "分而治之 "的策略产生了一种奇特的三片式 RMSE 下降模式。

The piecewise algorithm may have reduced the error by the simple fact that it samples more intensely at the chemically important wall and well regions, instead of wasting too many samples at the dissociation region, or by the fact that it separately treats the wall and the well, preventing their crosscontamination by providing more focused spaces for random sampling and distance and error probing. Our conjecture is that the second factor at least plays more than a trivial role because, as we have observed, the well region PES can be much better estimated when the wall points are excluded from the sample set.
分片算法降低误差的原因可能很简单，它在化学性质重要的壁和井区域进行了更密集的采样，而不是在解离区域浪费过多的样本；也可能是因为它分别处理了壁和井，为随机采样和距离与误差探测提供了更集中的空间，从而防止了它们之间的交叉污染。我们推测，第二个因素的作用至少不是微不足道的，因为正如我们所观察到的，如果将井壁点从样本集中排除，井区 PES 的估计结果会好得多。

The potential shortcoming of the “Divide and Conquer” method is that for a larger system, there may be many more features that turn out to be relevant for the PES, and the boundary between Wall, Well and Tail may be much less well-defined by

d_{min}

. However, the generation of boundaries may be automated by considering the van der Waals radii of all atoms in a system.
分而治之 "法的潜在缺点是，对于一个较大的系统，可能会有更多的特征与 PES 相关，而 Wall、Well 和 Tail 之间的边界可能远不如

d_{min}

界定得那么清楚。不过，可以通过考虑系统中所有原子的范德华半径来自动生成边界。

Analysis of error. As in the section of

Ar - H_{2} O

, we want to know where large errors occur. Fig. 5(a and b) show the estimation error of 200 test points, plotted against their

R

and

θ_{1}

, (a) being the front view and (b) being the top view, where a darker area means a larger error. The underlying PES is generated with 1800 samples by the piecewise algorithm described above.
误差分析。与

Ar - H_{2} O

部分一样，我们想知道哪里出现了较大误差。图 5(a)和(b)显示了 200 个测试点的估计误差，这些误差与它们的

R

和

θ_{1}

（a）为正视图，（b）为俯视图，其中颜色越深的区域表示误差越大。基础 PES 是通过上述分片算法用 1800 个样本生成的。

Fig. 4 The effect of the “Divide and Conquer” strategy in PES estimation of

{CO}_{2} - N_{2}

. For all runs, the sample size is 1800. (a-c) True and estimated energy at 200 test points with (a) random sampling, (b) iterative sampling with error probing and distance probing, and © piecewise sampling with iterative probing. Horizontal axes are the test point indices. All estimation curves (red) are results averaged over 3 runs. (d and e) True and estimated PES on the slice

θ_{2} = ϕ = 0

, with and without the piecewise sampling algorithm. Averaged over 3 runs. Energy unit is

E_{h}

and box dimensions are in

a_{0}

. (f) Variation of RMSE during piecewise sampling. The horizontal axis is the number of samples. Two sharp drops corresponding to switches of sampling regions can be observed.
图 4 "分而治之 "策略在 PES 估算

{CO}_{2} - N_{2}

中的效果。在所有运行中，样本量均为 1800。(a-c) 200 个测试点的真实能量和估计能量，(a) 随机抽样，(b) 带误差探测和距离探测的迭代抽样，© 带迭代探测的分片抽样。横轴为测试点指数。所有估计曲线（红色）均为 3 次运行的平均结果。(d 和 e) 采用和未采用分片采样算法的切片

θ_{2} = ϕ = 0

上的真实和估计 PES。为 3 次运行的平均值。能量单位为

E_{h}

，方框尺寸为

a_{0}

。 (f) 分片采样过程中 RMSE 的变化。横轴为样本数。可以观察到两个与采样区域切换相对应的急剧下降。

It is clear from Fig. 5(a and b) that the error does not simply increase with decreasing intermolecular distance. Some configurations with a smaller

R

are better estimated than configurations with a larger

R

. The large-error configurations appear as “spikes” in Fig. 5(a). According to Fig. 5(b), these spikes appear at intermolecular distances of

6 - 8 a_{0}

, as if forming a wall surrounding

{CO}_{2}

. We hypothesize that large errors are likely to occur at

R

values where some orientations are colliding but others with the same

R

are non-colliding. The “switch of nature” of the system puts it at the boundary of the wall and well regions, making it harder to make accurate energy predictions from samples.
从图 5(a)和(b)中可以明显看出，误差并不只是随着分子间距的减小而增大。一些

R

较小的构型比

R

较大的构型估计得更好。误差较大的构型在图 5（a）中表现为 "尖峰"。根据图 5(b)，这些 "尖峰 "出现在分子间距为

6 - 8 a_{0}

的位置，就像形成了一堵墙，将

{CO}_{2}

围住。我们推测，在

R

值处可能会出现较大的误差，在该值处，一些取向发生碰撞，而具有相同

R

的其他取向则不发生碰撞。系统的 "性质转换 "使其处于壁区和井区的边界，因此更难从样品中得出准确的能量预测。

To test this hypothesis, we pick the configurations with largest errors, fix their

R, θ_{1}

and survey their ab initio energy as their

θ_{2}

and

ϕ

R = 7.25 a_{0}, θ_{1} = 30^{\circ}

. (These values correspond to the highest spike in Fig. 5(a).) Apparently, the energy depends heavily on

θ_{2}

, dropping from more than

11000 μ E_{h}

for

θ_{2} = 0^{\circ}

to only about

200 μ E_{h}

for

θ_{2} = 90^{\circ}

. The dependence of energy on

ϕ

, on the other hand, is insignificant. We therefore fixed

ϕ

60^{\circ}

arbitrarily and visualized configurations with

R = 7.25 a_{0}, θ_{1} = 30^{\circ}, ϕ = 60^{\circ}

at different values of

θ_{2}

using space-filling models with van der
为了验证这一假设，我们选取了误差最大的构型，固定其

R, θ_{1}

，并随着

θ_{2}

和

ϕ

R = 7.25 a_{0}, θ_{1} = 30^{\circ}

的 ab initio 能量曲线（这些值与图 5(a)中的最高尖峰相对应）。显然，能量在很大程度上取决于

θ_{2}

，从超过

11000 μ E_{h}

的

θ_{2} = 0^{\circ}

下降到只有约

200 μ E_{h}

的

θ_{2} = 90^{\circ}

。另一方面，能量对

ϕ

的依赖性并不明显。因此，我们将

ϕ

任意固定为

60^{\circ}

，并使用空间填充模型，在

θ_{2}

的不同值上对

R = 7.25 a_{0}, θ_{1} = 30^{\circ}, ϕ = 60^{\circ}

的构型进行可视化。
Waals radii given by ref. 41. The models are shown in Fig. 5(d). As expected, a transition between colliding and non-colliding orientations is observed. Smaller values of

θ_{2}

give colliding configurations, where the van der Waals surfaces of the molecules overlap, while

θ_{2} = 90^{\circ}

is non-colliding. Analysis of other large-error configurations gives similar transitions between collision and noncollision. This discussion may have illustrated in a novel way the significance of the van der Waals radius as a reliable parameter for modeling intermolecular interactions.
参考文献给出的 Waals 半径。41.模型见图 5(d)。正如预期的那样，可以观察到碰撞和非碰撞取向之间的过渡。

θ_{2}

的较小值表示碰撞构型，即分子的范德华表面重叠，而

θ_{2} = 90^{\circ}

则表示非碰撞。对其他大误差构型的分析也给出了碰撞和非碰撞之间的类似转变。以上讨论可能以一种新颖的方式说明了范德华半径作为分子间相互作用建模的可靠参数的重要性。

Comparison with other interpolation methods
与其他插值方法的比较

The number of necessary

a b

initio evaluations around each sample increases as the square of the number of dimensions for the Shepard interpolation, where Taylor expansion up to second order, and therefore the Hessian matrices, must be calculated. For our method, the number of necessary co-samples

L

increases linearly with the number of dimensions. Therefore, our method has great advantage in computational efficiency compared to the modified Shepard interpolation for higher-dimensional systems. We are developing flexible algorithms to apply the method on arbitrary dimers with up to 10 atoms.
每个样本周围所需的

a b

initio 评估次数会随着 Shepard 插值维数的平方而增加，因为必须计算二阶以下的泰勒展开以及 Hessian 矩阵。而对于我们的方法，所需的共同样本数

L

与维数呈线性增长。因此，与高维系统的修正谢泼德插值法相比，我们的方法在计算效率上有很大优势。我们正在开发灵活的算法，以便在多达 10 个原子的任意二聚体上应用该方法。

Fig. 5 Error of the estimated PES of

{CO}_{2} - N_{2}

. (a) Plot of error on a 200 -point test set against

R

and

θ_{1}

of the test points. Energy in

E_{h}

. Box dimensions in

a_{0}

R = 7 : 25 a_{0}, θ_{1} = 30^{\circ}

. The energy is sensitive to

θ_{2}

much more than to

ϕ

. (d) Space-filling models of configurations with

R = 7 : 25 a_{0}, θ_{1} = 30^{\circ}, ϕ = 60^{\circ}

and various

θ_{2}

using van der Waals radii. Both colliding and non-colliding configurations are present.
图 5

{CO}_{2} - N_{2}

的估计 PES 误差。(a) 200 点测试集上的误差与

R

和

θ_{1}

的对比图。

E_{h}

中的能量。

a_{0}

R = 7 : 25 a_{0}, θ_{1} = 30^{\circ}

配置的 ab initio 能量。能量对

θ_{2}

的敏感度远高于对

ϕ

的敏感度。 (d) 使用范德华半径对

R = 7 : 25 a_{0}, θ_{1} = 30^{\circ}, ϕ = 60^{\circ}

和各种

θ_{2}

的构型建立空间填充模型。碰撞和非碰撞构型都存在。

To compare our Gaussian nodal functions with Taylor expansions up to the first order, we applied both methods on a simple hypothetical one-dimensional PES with the Morse potential (5).
为了将我们的高斯节点函数与泰勒一阶展开进行比较，我们将这两种方法都应用于莫尔斯势（5）的简单假设一维 PES。

V (\tilde{r}) = D_{e} {(1 - e^{- a (\tilde{r} - r_{e})})}^{2} - V (r_{e})

where we set

r_{e} = 2 a_{0}, D_{e} = V (r_{e}) = 500 E_{h}, a = 0.5

. Four sample points are arbitrarily chosen at

r = 1.0, 2.0, 5.0, 10.0 a_{0}

. On the one hand, first order Taylor expansions in

{\tilde{r}}^{- 1} - r^{- 1}

are obtained. On the other hand, Gaussian nodal functions are generated with regression on 4 co-samples for each sample. For both types of nodal functions, inverse-distance weighing (3) and (4) are used with

p = 6

. Fig. 6 plots the hypothetical “true” PES as well as those estimated by both types of nodal functions. It is clear that Gaussian-form nodal functions are able to reduce “bumps” compared to first order Taylor expansions. This is because a Taylor expansion is, in a sense, strictly local, whereas a Gaussian-form nodal function, being fitted with a set of co-samples, promises to take care of a wider range of configurations around the sample. It is also interesting to observe that in Fig. 6, it seems that our method extrapolates beyond the leftmost sample at

1.0 a_{0}

better than first order Taylor expansions. The shortcoming of the Gaussian-form is that it does not promise to account for the first derivatives exactly, and, of course, it is more computationally expensive than first order Taylor expansions because of the required calculations on co-samples and the “brute force” nonlinear regression. More discussion will follow shortly.
其中我们设置

r_{e} = 2 a_{0}, D_{e} = V (r_{e}) = 500 E_{h}, a = 0.5

。在

r = 1.0, 2.0, 5.0, 10.0 a_{0}

处任意选取四个样本点。一方面，在

{\tilde{r}}^{- 1} - r^{- 1}

中获得一阶泰勒展开。另一方面，通过对每个样本的 4 个共同样本进行回归，生成高斯节点函数。对于这两种类型的节点函数，都使用反距离权衡 (3) 和 (4)，

p = 6

。图 6 显示了假设的 "真实 "PES 以及两种节点函数估计的 PES。与一阶泰勒展开相比，高斯形式的节点函数显然能够减少 "颠簸"。这是因为泰勒扩展从某种意义上说是严格局部的，而高斯形式的节点函数是用一组共样本拟合的，可以照顾到样本周围更广泛的配置。同样有趣的是，在图 6 中，我们的方法似乎比一阶泰勒展开法更好地推断出最左侧样本之外的

1.0 a_{0}

。高斯形式的不足之处在于，它并不保证能精确计算一阶导数，当然，由于需要计算共样本和 "蛮力 "非线性回归，它的计算成本也比一阶泰勒展开式高。稍后将进行更多讨论。

Fig. 6 Comparison of first order Taylor expansions and Gaussians as nodal functions in modelling a hypothetical one-dimensional PES. Horizontal axis in

a_{0}

.
图 6 作为节点函数的一阶泰勒展开式与高斯函数在模拟假设的一维 PES 时的比较。横轴为

a_{0}

。

Compared to the full Gaussian process, our method seems to be significantly better thanks to the asymmetric terms and the flexible width parameters in the nodal functions. In our preliminary studies on the flexible water dimer, the RMSE we achieved with simple GP was

650 μ E_{h} (1.71 kJ {mol}^{- 1})

with 1200 training data. Though the clusters in this study are all rigid, intramolecular vibrations are not expected to perturb the PES remarkably, since ground states’ mean vibrational amplitudes should be on the order of

0.05 Å,^{42}

much less than the length
与全高斯过程相比，由于节点函数中的不对称项和灵活的宽度参数，我们的方法似乎要好得多。在我们对柔性水二聚体进行的初步研究中，我们使用简单 GP 所取得的 RMSE 为

650 μ E_{h} (1.71 kJ {mol}^{- 1})

，训练数据为 1200 个。虽然本研究中的原子团都是刚性的，但由于基态的平均振动振幅应该在

0.05 Å,^{42}

的数量级上，远小于长度，因此分子内振动预计不会对 PES 产生明显的扰动。
scale considered for the rigid-body PES. Therefore, we believe that our method presented here will give more accurate PES estimations than the full GP for flexible clusters. This conjecture, however, must be rigorously tested. In addition, we note that the computational effort of full GP increases as the cube of the number of training samples, because matrix inversions are required to compute the correlations between samples.

^{30, 43}

It is therefore very time consuming to use the GP once the training data size exceeds 1000. For our method, on the other hand, the computational cost scales linearly with respect to the size of ab initio data.
的尺度。因此，我们相信，对于柔性集群，我们在此介绍的方法将比完整的 GP 方法给出更准确的 PES 估计值。不过，这一猜想必须经过严格检验。此外，我们注意到全 GP 的计算量会随着训练样本数量的立方而增加，因为计算样本之间的相关性需要进行矩阵反演。

^{30, 43}

因此，一旦训练数据量超过 1000，使用 GP 就会非常耗时。而对于我们的方法，计算成本与原子序数数据的大小成线性关系。

There are two major limitations to our approach. The first limitation is that the nonlinear regression does not search for the global minimum. To find a global minimum, “multistart”, i.e. different sets of initial values of

α_{i}

and

β_{i}

, or general optimization techniques like simulated annealing

^{44}

should be used. However, we have noted that the error of the final PES is not sensitive to

β_{i}

and not sensitive to

α_{i}

within an order of magnitude. With multistart experiments on our 1D hypothetical PES, we find that both multistart and one-off nonlinear regression can agree on the order of magnitude of

α_{i}

. The reason is partly that the linear regression step, with optimizations on

A_{i}

and

B_{i}

, usually does well to give a nodal function describing the local features around a sample. We assumed that this is true also for more than one dimensions, but this is admittedly a compromise because it is very computationally expensive to do multistart regression for all

α_{i}

and

β_{i}

.
我们的方法有两大局限。第一个局限是，非线性回归并不寻找全局最小值。要找到全局最小值，应使用 "多起点"，即

α_{i}

和

β_{i}

的不同初始值集，或一般优化技术，如模拟退火

^{44}

。不过，我们注意到，最终 PES 的误差对

β_{i}

并不敏感，对

α_{i}

也不敏感，误差不超过一个数量级。通过对一维假定 PES 进行多起始实验，我们发现多起始和一次性非线性回归都能在

α_{i}

的数量级上达成一致。部分原因是线性回归步骤在

A_{i}

和

B_{i}

上进行了优化，通常能很好地给出描述样本周围局部特征的节点函数。我们假定对于多个维度也是如此，但这不得不说是一种妥协，因为对所有

α_{i}

和

β_{i}

进行多步回归的计算成本非常高昂。

The second major limitation is the large number of co-samples required, whose

a b

initio energies require huge computational costs if a PES is to be constructed de novo. This is also a limitation for Taylor expansions, as derivatives must be obtained numerically. Our method requires less co-sample data than second order Taylor expansions but more than first-order. We have the following suggestions regarding this limitation, but further testing must be done.
第二个主要限制是需要大量的共同样本，如果要重新构建 PES，其

a b

initio 能量需要巨大的计算成本。这也是泰勒展开式的局限性，因为导数必须通过数值计算获得。与二阶泰勒展开法相比，我们的方法需要较少的共样数据，但比一阶方法需要更多。针对这一局限性，我们有以下建议，但还需进一步测试。

Firstly, if derivatives of the PES are also of interest, some data needed for computing derivatives can act as co-samples. Secondly, the co-samples can be shared between samples if the samples are close enough, e.g. the Cartesian distance between the two samples are less than

2.0 a_{0}

. A sample can also act as a co-sample of another sample if they are close enough. This sharing can reduce the required number of

a b

initio calculations by at least half. Thirdly, the ab initio calculations of the co-samples do not need the same degree of accuracy as the samples, and may be accomplished by a faster method with a lower theory level or smaller basis set, probably with adjustments such that the energies of the samples match between theories.
首先，如果对 PES 的导数也感兴趣，计算导数所需的一些数据可以作为共样本。其次，如果样本之间的距离足够近，例如两个样本之间的笛卡尔距离小于

2.0 a_{0}

，则样本之间可以共享共样本。如果样本之间的距离足够近，则一个样本也可以作为另一个样本的共样本。这种共享可以将所需的

a b

initio 计算数量减少至少一半。第三，协同样本的起始计算并不需要达到与样本相同的精确度，可以用较低理论水平或较小基集的更快方法来完成，可能还需要进行调整，使样本的能量在理论之间相匹配。

Conclusions 结论

We have presented a new interpolation method based on the modified Shepard interpolation, in which the nodal functions are composed of a symmetric Gaussian term and an asymmetric exponential term in each dimension. The parameters in the nodal functions are determined by regression on ab initio co-samples around each sample. An iterative sampling scheme has been developed to add samples where errors are expected to be large. We are able to achieve an RMSE below

0.13 kJ {mol}^{- 1}

in 150 samples for Ar-rigid

H_{2} O

and Ne-rigid

{CO}_{2}

, and below

0.39 kJ {mol}^{- 1}

in 1800 samples for rigid

N_{2}

-rigid

{CO}_{2}

. For the last system, we found that sampling separately in the wall, well and dissociation regions is very useful to reduce the error.
我们提出了一种基于改良谢泼德插值法的新插值方法，其中节点函数由每个维度上的对称高斯项和非对称指数项组成。节点函数中的参数是通过对每个样本周围的 ab initio 协同样本进行回归确定的。我们开发了一种迭代采样方案，用于添加预计误差较大的样本。对于 Ar-rigid

H_{2} O

和 Ne-rigid

{CO}_{2}

，我们能够在 150 个样本内使 RMSE 低于

0.13 kJ {mol}^{- 1}

；对于刚性

N_{2}

和刚性

{CO}_{2}

，我们能够在 1800 个样本内使 RMSE 低于

0.39 kJ {mol}^{- 1}

。对于最后一个系统，我们发现在壁区、井区和解离区分别取样对减少误差非常有用。

Conflicts of interest 利益冲突

There are no conflicts to declare.
没有需要声明的冲突。

Acknowledgements 鸣谢

The authors thank M. Hochlaf and colleagues for the ab initio data and analytical code of the PES of

{CO}_{2} - N_{2}

.
作者感谢 M. Hochlaf 及其同事提供了

{CO}_{2} - N_{2}

PES 的 ab initio 数据和分析代码。

References 参考资料

1 N. Rekik, Toward accurate prediction of potential energy surfaces and the spectral density of hydrogen bonded systems, Phys. B, 2014, 436, 164-176.
2 J. N. Onuchic, Z. Luthey-Schulten and P. G. Wolynes, Theory of protein folding: the energy landscape perspective, Annu. Rev. Phys. Chem., 1997, 48, 545-600.
2 J. N. Onuchic, Z. Luthey-Schulten and P. G. Wolynes, Theory of protein folding: the energy landscape perspective, Annu.Rev. Phys. Chem., 1997, 48, 545-600.
3 A. R. Dinner, A. Šali, L. J. Smith, C. M. Dobson and M. Karplus, Understanding protein folding via free-energy surfaces from theory and experiment, Trends Biochem. Sci., 2000, 25, 331-339.
3 A. R. Dinner, A. Šali, L. J. Smith, C. M. Dobson and M. Karplus, Understanding protein folding via free-energy surfaces from theory and experiment, Trends Biochem.Sci., 2000, 25, 331-339.
4 K. Vengadesan and N. Gautham, Enhanced sampling of the molecular potential energy surface using mutually orthogonal Latin squares: application to peptide structures, Biophys. J., 2003, 84, 2897-2906.
4 K. Vengadesan and N. Gautham, Enhanced sampling of the molecular potential energy surface using mutually orthogonal Latin squares: application to peptide structures, Biophys.J., 2003, 84, 2897-2906.
5 J. M. Bowman, G. Czakó and B. Fu, High-dimensional ab initio potential energy surfaces for reaction dynamics calculations, Phys. Chem. Chem. Phys., 2011, 13, 8094.
5 J. M. Bowman, G. Czakó and B. Fu, High-dimensional ab initio potential energy surfaces for reaction dynamics calculations, Phys. Chem.Chem.Phys., 2011, 13, 8094.
6 G. C. Schatz, The analytical representation of electronic potential-energy surfaces, Rev. Mod. Phys., 1989, 61, 669-688.
6 G. C. Schatz, The analytical representation of electronic potential-energy surfaces, Rev. Mod. Phys.Phys., 1989, 61, 669-688.
7 M. A. Collins, Molecular potential-energy surfaces for chemical reaction dynamics, Theor. Chem. Acc., 2002, 108, 313-324.
7 M. A. Collins, Molecular potential-energy surfaces for chemical reaction dynamics, Theor.Chem.Acc., 2002, 108, 313-324.
8 K. C. Thompson, M. J. T. Jordan and M. A. Collins, Polyatomic molecular potential energy surfaces by interpolation in local internal coordinates, J. Chem. Phys., 1998, 108, 8302-8316.
8 K. C. Thompson, M. J. T. Jordan and M. A. Collins, Polyatomic molecular potential energy surfaces by interpolation in local internal coordinates, J. Chem. Phys.Phys., 1998, 108, 8302-8316.
9 A. Brown, B. J. Braams, K. Christoffel, Z. Jin and J. M. Bowman, Classical and quasiclassical spectral analysis of

{CH}_{5}^{+}

using an

a b

initio potential energy surface, J. Chem. Phys., 2003, 119,

8790 - 8793

.
9 A. Brown, B. J. Braams, K. Christoffel, Z. Jin and J. M. Bowman, Classical and quasiclassical spectral analysis of

{CH}_{5}^{+}

using an

a b

initio potential energy surface, J. Chem.Phys., 2003, 119,

8790 - 8793

.
10 X. Huang, B. J. Braams and J. M. Bowman, Ab initio potential energy and dipole moment surfaces for

H_{5} O_{2}^{+}

, J. Chem. Phys., 2005, 122, 044308.
10 X. Huang, B. J. Braams and J. M. Bowman, Ab initio potential energy and dipole moment surfaces for

H_{5} O_{2}^{+}

, J. Chem. Phys.物理，2005，122，044308。
11 S. Nasri, Y. Ajili, N.-E. Jaidane, Y. N. Kalugina, P. Halvick, T. Stoecklin and M. Hochlaf, Potential energy surface of the

{CO}_{2} - N_{2}

van der Waals complex, J. Chem. Phys., 2015, 142, 174301.
11 S. Nasri、Y. Ajili、N. -E.Jaidane, Y. N. Kalugina, P. Halvick, T. Stoecklin and M. Hochlaf, Potential energy surface of the

{CO}_{2} - N_{2}

van der Waals complex, J. Chem.Phys., 2015, 142, 174301.
12 J. M. Anglada, G. J. Hoffman, L. V. Slipchenko, M. Costa, M. F. Ruiz-López and J. S. Francisco, Atmospheric significance of water clusters and ozone-water complexes, J. Phys. Chem. A, 2013, 117, 10381-10396.
12 J. M. Anglada, G. J. Hoffman, L. V. Slipchenko, M. Costa, M. F. Ruiz-López and J. S. Francisco, Atmospheric significance of water clusters and ozone-water complexes, J. Phys. Chem.A, 2013, 117, 10381-10396.
13 V. Vaida, Perspective: water cluster mediated atmospheric chemistry, J. Chem. Phys., 2011, 135, 020901.
13 V. Vaida, Perspective: water cluster mediated atmospheric chemistry, J. Chem. Phys.Phys.，2011，135，020901。
14 F. Negri, F. Ancilotto, G. Mistura and F. Toigo, Ab initio potential energy surfaces of

He - {CO}_{2}

and

Ne - {CO}_{2}

van der Waals complexes, J. Chem. Phys., 1999, 111, 6439-6445.
14 F. Negri, F. Ancilotto, G. Mistura and F. Toigo, Ab initio potential energy surfaces of

He - {CO}_{2}

and

Ne - {CO}_{2}

van der Waals complexes, J. Chem. Phys.Phys., 1999, 111, 6439-6445.
15 J. N. Murrell, S. Carter, S. C. Farantos, P. Huxley and A. J. C. Varandas, Molecular Potential Energy Functions, 1984.
15 J. N. Murrell、S. Carter、S. C. Farantos、P. Huxley 和 A. J. C. Varandas，《分子势能函数》，1984 年。
16 A. J. Varandas and J. N. Murrell, A many-body expansion of polyatomic potential energy surfaces: application to

H_{n}

systems, Faraday Discuss. Chem. Soc., 1977, 62, 92-109.
16 A.J. Varandas 和 J. N. Murrell，多原子势能面的多体扩展：在

H_{n}

系统中的应用，《法拉第讨论》。Chem.Soc., 1977, 62, 92-109.
17 A. Schmelzer and J. N. Murrell, The general analytic expression for S4-symmetry-invariant potential functions of tetra-atomic homonuclear molecules, Int. J. Quantum Chem., 1985, 28, 287-295.
17 A.Schmelzer and J. N. Murrell, The general analytic expression for S4-symmetry-invariant potential functions of tetra-atomic homonuclear molecules, Int.J. Quantum Chem., 1985, 28, 287-295.
18 B. J. Braams and J. M. Bowman, Permutationally invariant potential energy surfaces in high dimensionality, Int. Rev. Phys. Chem., 2009, 28, 577-606.
18 B. J. Braams and J. M. Bowman, Permutationally invariant potential energy surfaces in high dimensionality, Int. Rev. Phys. Chem, 2009, 28, 577-606.Rev. Phys. Chem., 2009, 28, 577-606.
19 J. N. Murrell and S. Carter, Approximate single-valued representations of multivalued potential energy surfaces, J. Phys. Chem., 1984, 88, 4887-4891.
20 S. Y. Lin, P. Zhang and J. Z. Zhang, Hybrid many-body-expansion/ Shepard-interpolation method for constructing ab initio potential energy surfaces for quantum dynamics calculations, Chem. Phys. Lett., 2013, 556, 393-397.
20 S. Y. Lin, P. Zhang and J. Z. Zhang, Hybrid many-body-expansion/ Shepard-interpolation method for constructing ab initio potential energy surfaces for quantum dynamics calculations, Chem.物理快报》，2013，556，393-397。
21 O. B. M. Teixeira, V. C. Mota, J. M. Garcia de La Vega and A. J. C. Varandas, Single-sheeted double many-body expansion potential energy surface for ground-state

{ClO}_{2}

, J. Phys. Chem. A, 2014, 118, 4851-4862.
21 O. B. M. Teixeira, V. C. Mota, J. M. Garcia de La Vega and A. J. C. Varandas, Single-sheeted double many-body expansion potential energy surface for ground-state

{ClO}_{2}

, J. Phys. Chem.A, 2014, 118, 4851-4862.
22 H. Liu, Y. Wang and J. M. Bowman, Quantum calculations of the IR spectrum of liquid water using

a b

initio and model potential and dipole moment surfaces and comparison with experiment, J. Chem. Phys., 2015, 142, 194502.
22 H. Liu, Y. Wang and J. M. Bowman, 使用

a b

initio 和模型势能及偶极矩表面对液态水红外光谱的量子计算及与实验的比较，J. Chem.Phys., 2015, 142, 194502.
23 J. Ischtwan and M. A. Collins, Molecular potential energy surfaces by interpolation, J. Chem. Phys., 1994, 100, 8080-8088.
23 J. Ischtwan and M. A. Collins, Molecular potential energy surfaces by interpolation, J. Chem. Phys.Phys., 1994, 100, 8080-8088.
24 R. P. A. Bettens and M. A. Collins, Learning to interpolate molecular potential energy surfaces with confidence: a Bayesian approach, J. Chem. Phys., 1999, 111, 816-826.
24 R. P. A. Bettens 和 M. A. Collins, Learning to interpolate molecular potential energy surfaces with confidence: a Bayesian approach, J. Chem. Phys.Phys., 1999, 111, 816-826.
25 K. Toyoura, D. Hirano, A. Seko, M. Shiga, A. Kuwabara, M. Karasuyama, K. Shitara and I. Takeuchi, Machine-learningbased selective sampling procedure for identifying the low-energy region in a potential energy surface: a case study on proton conduction in oxides, Phys. Rev. B, 2016, 93, 54112 .
25 K. Toyoura, D. Hirano, A. Seko, M. Shiga, A. Kuwabara, M. Karasuyama, K. Shitara and I. Takeuchi, Machine-learningbased selective sampling procedure for identifying the low-energy region in a potential energy surface: a case study on proton conduct in oxides, Phys. Rev. B, 2016, 93, 54112 .
26 Y. Guan, S. Yang and D. H. Zhang, Construction of reactive potential energy surfaces with Gaussian process regression: active data selection, Mol. Phys., 2018, 116, 823-834.
26 Y. Guan, S. Yang and D. H. Zhang, 用高斯过程回归构建反应势能面：主动数据选择，Mol.Phys., 2018, 116, 823-834.
27 J. Cui and R. V. Krems, Efficient non-parametric fitting of potential energy surfaces for polyatomic molecules with Gaussian processes, J. Phys. B: At., Mol. Opt. Phys., 2016, 49, 224001.
27 J. Cui 和 R. V. Krems, Efficient non-parametric fitting of potential energy surfaces for polyatomic molecules with Gaussian processes, J. Phys. B: At., Mol. Opt.Opt.Phys., 2016, 49, 224001.
28 R. Chen, E. Jiao, H. Zhu and D. Xie, A new ab initio potential energy surface and microwave and infrared spectra for the

Ne - {CO}_{2}

complex, J. Chem. Phys., 2010, 133, 104302.
28 R. Chen, E. Jiao, H. Zhu and D. Xie, A new ab initio potential energy surface and microwave and infrared spectra for the

Ne - {CO}_{2}

complex, J. Chem. Phys.物理，2010，133，104302。
29 J. P. Alborzpour, D. P. Tew and S. Habershon, Efficient and accurate evaluation of potential energy matrix elements for quantum dynamics using Gaussian process regression, J. Chem. Phys., 2016, 145, 174112.
29 J. P. Alborzpour, D. P. Tew and S. Habershon, Efficient and accurate evaluation of potential energy matrix elements for quantum dynamics using Gaussian process regression, J. Chem. Phys, 2016, 145, 174112.Phys., 2016, 145, 174112.
30 C. Qu, Q. Yu, B. L. Van Hoozen, J. M. Bowman and R. A. Vargas-Hernández, Assessing Gaussian process regression and permutationally invariant polynomial approaches to represent high-dimensional potential energy surfaces, J. Chem. Theory Comput., 2018, 14, 3381-3396.
30 C.Qu, Q. Yu, B. L. Van Hoozen, J. M. Bowman and R. A. Vargas-Hernández, Assessing Gaussian process regression and permutationally invariant polynomial approaches to represent high-dimensional potential energy surfaces, J. Chem.Theory Comput., 2018, 14, 3381-3396.
31 T. Vanfleteren, T. Földes and M. Herman, Analysis of a perpendicular band in

Ar - H_{2} O

with origin close to the

ν 1 + ν 3

, R(0) line in H2O, Chem. Phys. Lett., 2015, 627, 36-38.
31 T. Vanfleteren, T. Földes and M. Herman, Analysis of a perpendicular band in

Ar - H_{2} O

with origin close to the

ν 1 + ν 3

, R(0) line in H2O, Chem. Phys Lett, 2015, 627, 36-38.2015, 627, 36-38.
32 A. van der Avoird and D. J. Nesbitt, Rovibrational states of the

H_{2} O - H_{2}

complex: an ab initio calculation, J. Chem. Phys., 2011, 134, 044314.
32 A. van der Avoird and D. J. Nesbitt, Rovibrational states of the

H_{2} O - H_{2}

complex: an ab initio calculation, J. Chem. Phys.物理》，2011 年，134, 044314。
33 A. J. C. Varandas, Energy switching approach to potential surfaces: an accurate single valued function for the water molecule, J. Chem. Phys., 1996, 105, 3524-3531.
33 A. J. C. Varandas, Energy switching approach to potential surfaces: an accurate single valued function for the water moleule, J. Chem. Phys.1996，105，3524-3531。
34 A. J. C. Varandas, A. I. Voronin and P. J. S. B. Caridade, Energy switching approach to potential surfaces. III. Three-valued function for the water molecule, J. Chem. Phys., 1998, 108,

7623 - 7630

.
34 A. J. C. Varandas, A. I. Voronin and P. J. S. B. Caridade, Energy switching approach to potential surfaces.III.Three-valued function for the water moleule, J. Chem. Phys.Phys., 1998, 108,

7623 - 7630

.
35 A. J. C. Varandas, A realistic multi-sheeted potential energy surface for

NO 2 (2 A^{'})

from the double many-body expansion method and a novel multiple energy-switching scheme, J. Chem. Phys., 2003, 119, 2596-2613.
35 A. J. C. Varandas, A realistic multi-sheeted potential energy surface for

NO 2 (2 A^{'})

from the double many-body expansion method and a novel multiple energy-switching scheme, J. Chem. Phys.Phys., 2003, 119, 2596-2613.
36 T. Skopal, M. Krátký, J. Pokorný and V. Snášel, A new range query algorithm for Universal B-trees, Information Systems, 2006, 31, 489-511.
37 E. Carlini, A. Lulli and L. Ricci, Dragon: multidimensional range queries on distributed aggregation trees, Future Gener. Comput. Syst., 2016, 55, 101-115.
37 E. Carlini, A. Lulli and L. Ricci, Dragon: multidimensional range queries on distributed aggregation trees, Future Gener.计算。Syst.，2016，55，101-115。
38 Z. Homayoon, R. Conte, C. Qu and J. M. Bowman, Fulldimensional, high-level ab initio potential energy surfaces for

H 2 (H 2 O)

and

H 2 (H 2 O) 2

with application to hydrogen clathrate hydrates, J. Chem. Phys., 2015, 143, 084302.
38 Z. Homayoon, R. Conte, C. Qu and J. M. Bowman, Fulldimensional, high-level ab initio potential energy surfaces for

H 2 (H 2 O)

and

H 2 (H 2 O) 2

with application to hydrogen clathrate hydrates, J. Chem. Phys, 2015, 143, 084302.Phys., 2015, 143, 084302.
39 A. J. C. Varandas, Accurate combined-hyperbolic-inversepower-representation of

a b

initio potential energy surface for the hydroperoxyl radical and dynamics study of

O + OH

reaction, J. Chem. Phys., 2013, 138, 134117.
39 A. J. C. Varandas, Accurate combined-hyperbolic-inversepower-representation of

a b

initio potential energy surface for the hydroperoxyl radical and dynamics study of

O + OH

reaction, J. Chem. Phys.Phys., 2013, 138, 134117.
40 E. Uteva, R. S. Graham, R. D. Wilkinson and R. J. Wheatley, Interpolation of intermolecular potentials using Gaussian processes, J. Chem. Phys., 2017, 147, 161706.
40 E. Uteva, R. S. Graham, R. D. Wilkinson and R. J. Wheatley, Interpolation of intermolecular potentials using Gaussian processes, J. Chem. Phys.Phys., 2017, 147, 161706.
41 A. Bondi, Van der Waals volumes and radii, J. Phys. Chem., 1964, 68, 441-451.
41 A. Bondi，《范德华体积和半径》，J. Phys. Chem.，1964，68，441-451。
42 B. Cyvin, S. Cyvin and G. Hagen, Condensed values of mean and amplitudes of vibration, Chem. Phys. Lett., 1967, 1, 211-213.
42 B. Cyvin、S. Cyvin 和 G. Hagen，《振动平均值和振幅的凝聚值》，Chem.Phys.Lett.，1967，1，211-213。
43 J. Hensman, N. Fusi and N. D. Lawrence, Gaussian Processes for Big Data, UAI, 2013.
43 J. Hensman、N. Fusi 和 N. D. Lawrence，《大数据的高斯过程》，UAI，2013。
44 S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi, Optimization by simulated annealing, Science, 1983, 220, 671-680.

$^{a}$ Department of Chemistry, Princeton University, Princeton, USA. E-mail: hainaw@princeton.edu
$^{a}$ 美国普林斯顿大学化学系。电子邮件：hainaw@princeton.edu
$^{b}$ Department of Chemistry, National University of Singapore, Singapore.
$^{b}$ 新加坡国立大学化学系，新加坡。
E-mail: chmbrpa@nus.edu.sg
电子邮件：chmbrpa@nus.edu.sg

Modelling potential energy surfaces for small clusters using Shepard interpolation with Gaussian-form nodal functions利用谢泼德插值法和高斯形式节点函数为小集群的势能面建模

Abstract 摘要

Introduction 简介

Methodology 方法

Nodal functions 节点功能

Sampling 取样

Application to atmospheric clusters大气集群的应用

General aspects: co-samples, regression and the decay exponent一般方面：共同样本、回归和衰减指数

Ar-rigid H 2 O H 2 O H_(2)O\mathrm{H}_{2} \mathrm{O} Ar-rigid H 2 O H 2 O H_(2)O\mathrm{H}_{2} \mathrm{O}

Ne-rigid CO 2 CO 2 CO_(2)\mathrm{CO}_{2} Ne-rigid CO 2 CO 2 CO_(2)\mathrm{CO}_{2}

Rigid CO 2 CO 2 CO_(2)\mathrm{CO}_{2}-rigid N 2 N 2 N_(2)\mathbf{N}_{2}刚性 CO 2 CO 2 CO_(2)\mathrm{CO}_{2} -刚性 N 2 N 2 N_(2)\mathbf{N}_{2}

Comparison with other interpolation methods与其他插值方法的比较

Conclusions 结论

Conflicts of interest 利益冲突

Acknowledgements 鸣谢

References 参考资料

Modelling potential energy surfaces for small clusters using Shepard interpolation with Gaussian-form nodal functions
利用谢泼德插值法和高斯形式节点函数为小集群的势能面建模

Application to atmospheric clusters
大气集群的应用

General aspects: co-samples, regression and the decay exponent
一般方面：共同样本、回归和衰减指数

Ar-rigid $H_{2} O$ Ar-rigid $H_{2} O$

Ne-rigid ${CO}_{2}$ Ne-rigid ${CO}_{2}$

Rigid ${CO}_{2}$ -rigid $N_{2}$
刚性 ${CO}_{2}$ -刚性 $N_{2}$

Comparison with other interpolation methods
与其他插值方法的比较